# quantitative_convergences_of_lie_group_momentum_optimizers__031f205a.pdf

Quantitative Convergences of Lie Group Momentum Optimizers

Lingkai Kong School of Mathematics Georgia Institute of Technology lkong75@gatech.edu

Molei Tao School of Mathematics Georgia Institute of Technology mtao@gatech.edu

Explicit, momentum-based dynamics that optimize functions defined on Lie groups can be constructed via variational optimization and momentum trivialization. Structure preserving time discretizations can then turn this dynamics into optimization algorithms. This article investigates two types of discretization, Lie Heavy-Ball, which is a known splitting scheme, and Lie NAG-SC, which is newly proposed. Their convergence rates are explicitly quantified under L-smoothness and local strong convexity assumptions. Lie NAG-SC provides acceleration over the momentumless case, i.e. Riemannian gradient descent, but Lie Heavy-Ball does not. When compared to existing accelerated optimizers for general manifolds, both Lie Heavy-Ball and Lie NAG-SC are computationally cheaper and easier to implement, thanks to their utilization of group structure. Only gradient oracle and exponential map are required, but not logarithm map or parallel transport which are computational costly.1

1 Introduction

First-order optimization, i.e., with the gradient of the potential (i.e. the objective function) given as the oracle, is ubiquitously employed in machine learning. Within this class of algorithms, momentum is often introduced to accelerate convergence; for example, in Euclidean setups, it has been proved to yield the optimal dimension-independent convergence order, for a large class of first-order optimizers, under strongly convex and L-smooth assumptions of the objective function[21, Sec. 2].

Gradient Descent (GD) without momentum can be generalized to Riemannian manifold by moving to the negative gradient direction using the exponential map with a small step size. This generalization is algorithmically straight forward, but quantifying the convergence rate in curved spaces needs nontrivial efforts [6, 32, 27]. In comparison, generalizing momentum GD to manifold itself is nontrivial due to curved geometry; for example, the iteration must be kept on the manifold and the momentum must stay in the tangent space, which changes with the iteration, at the same time. It is even more challenging to quantify the convergence rate theoretically due to the loss of linearity in the space, leading to the lack of triangle inequality and cosine rule. Finally, there is not necessarily acceleration unless the generalization is done delicately.

Regardless, optimization on manifolds is an important task, for which the manifold structure can either naturally come from the problem setup or be artificially introduced. A simple but extremely important example is to compute the leading eigenvalues of a large matrix, which can be approached efficiently via optimization on the Stiefel manifold [19]; a smaller scale version can also be solved via optimization on SO(n) [7, 29, 8, 26]. More on the modern machine learning side, one can algorithmically add orthonormal constraints to deep learning models to improve their accuracy and robustness [e.g., 5, 9, 17, 19, 24]. Both examples involve SO(n), which is an instance of an important type of curved spaces called Lie groups.

1Code can be found at https://github.com/konglk1203/Accelerated_Optimizer_On_Lie_Group

38th Conference on Neural Information Processing Systems (Neur IPS 2024).

Lie groups are manifolds with additional group structure, and the nonlinearity of the space manifests through the non-commutative group multiplication. The group structure can help not only design momentum optimizer but also analyze its convergence. More precisely, this work will make the following contributions:

Provide the first quantitative analysis of Lie group momentum optimizers. This is significant, partly because there is no nontrivial convex functions on many Lie groups (see Rmk. 1), so we have to analyze nonconvex optimization.

Theoretically show an intuitively constructed momentum optimizer, namely Lie Heavy-Ball, may not yield accelerated convergence. Numerical evidence is also provided (Sec. 6).

Generalize techniques from Euclidean optimization to propose a Lie group optimizer, Lie NAG-SC, that provably has acceleration.

Comparing to other optimizers that are designed for general manifolds, we bypass the requirements for costly operations such as parallel transport [e.g., 4], which is a way to move the momentum between tangent spaces, and the computation of geodesic [e.g., 1], which may be an issue due to not only high computational cost but also its possible non-uniqueness.

1.1 Related work In Euclidean space, Gradient Descent (GD) for µ-strongly convex and L-smooth objective function can converge as xk x (1 C µ

L) k xk x0 2 upon appropriately chosen learning rate. Momentum can accelerate the convergence by softening the dependence on the condition number κ = L/µ. However, how momentum is introduced matters to achieving such an acceleration. For example, NAG-SC [21] has convergence rate 3 1 C µ

L, but Heavy-Ball [23] still has linear dependence on the conditional number, similar to gradient descent without momentum (but it may work better for nonconvex cases).

Remarkable quantitative results also existed for manifold optimization. The momentum-less case is relatively simpler, and [32], for example, developed convergence theory for GD on Riemannian manifold under various assumptions on convexity and smoothness, which matched the classical Euclidean result for instance, Thm. 15 of their paper gave a convergence rate of 1 C min{ µ

when the learning rate is 1/L, under geodesic-µ-strong-convexity and geodesic-L-smoothness. For the momentum case, [3] analyzed the convergence of a related dynamics in continuous time, namely an optimization ODE corresponding to momentum gradient flow, on Riemannian manifolds under both geodesically strongly and weakly convex potentials, based on a tool of modified cosine rule. However, numerical methods in discrete time that are easy and cheap to implement and provably convergent in an accelerated fashion under mild conditions are still under-developed. One existing idea is to transform a function on the manifold to a function on a Euclidean space by the logarithm function. More precisely, in the case where the logarithm is a one-to-one map from the manifold to the tangent space, it can be used to project the objective function on the manifold to a function on the tangent space ( pullback objective function ), enabling the usage of accelerated algorithms in Euclidean spaces [e.g., 11]. Although such analysis may relax the requirement of global convexity, it requires assumptions on the pullback objective function , which is hard to check in reality. Another series of seminal works include [33, 1], which analyzed the convergence of a class of optimizers by extending Nesterov s technique of estimating sequence [21] to Riemannian manifolds. They managed to show a convergence rate between 1 C µ

L and 1 C µ

L, i.e., with conditional number dependence inbetween that of GD with and without momentum in the Euclidean cases. They further proved that, as the iterate gets closer to the minimum, the rate bound gets better because it converges to 1 C µ

L. However, their algorithm requires the logarithm map (inverse of the exponential map), which may not be uniquely defined on many manifolds (e.g., sphere) and can be computationally expensive. In contrast, Lie NAG-SC, which we will construct, works only for Lie group manifolds, but they are

2All the constants C in this paper may be different case-by-case, but they are all independent of dimension and the condition number. 3For continuous dynamics, convergence rate means c if U(gt) U(g ) = O(e ct). For discrete algorithms, it refers to c if U(gk) U(g ) = O(ck). 4We will use K to denote constants that depend on the manifold structure, e.g., diameter, curvature. It may be different case-by-case.

more efficient when applicable, due to being based on only exponential map and gradient oracle. Acceleration of the same type will also be theoretically proved.

Momentum optimizers specializing in Lie groups have also been constructed before [26], where variational optimization [30] was generalized to the manifold setup and then left trivialization were employed to obtain ODEs with Euclidean momentum that perform optimization in continuous time. Our work is also based on those ODEs, whose time however has to be discretized so that an optimization algorithm can be constructed. Delicate discretizations have been proposed in [26] so that the optimization iterates stay exactly on the manifold, saving computational cost and reducing approximation errors. But we will further improve those discretizations. More precisely, note first that [26] slightly abused notation and called both the continuous dynamics and one discretization Lie NAG-SC. However, we find that their splitting discretization may not give the best optimizer at least, a 1st-order version of their splitting scheme yields a linear condition number dependence in our convergence rate bound. Since this splitting-based optimizer almost degenerates to heavy-ball in the special case of Euclidean spaces (as Rmk. 28 will show), we refine the terminology and call it Lie Heavy-Ball. To remedy the lack of acceleration, we propose a new discretization that actually has square root condition number dependence and thus provable acceleration, and call it as (the true) Lie NAG-SC.

Finally, note there can be obstructions to accelerated optimization, for example roughly when curvature is negative [16, 10, 12]. That is, however, not a contradiction because our setup involves positive curvature. 1.2 Main results We consider the local minimization of a differentiable function U G R, i.e., ming G U(g), where G is a finite-dimensional compact Lie group, and the oracle allowed is the differential of U.

Two optimizers we focus on are given in Alg. (1). Under assumptions of L-smoothness and locally

geodesic-µ-strong convexity, we proved that Lie Heavy-Ball has convergence rate (1 + C L

µ ) 1 , which

is approximately the convergence rate of 1 C min{ L

µ ,K} for Lie GD (Eq.1, which is identical to Riemannian GD applied to Lie groups); this is no acceleration. To accelerate, we propose a new Lie

NAG-SC algorithm, with provable convergence rate (1 + C min{

L µ ,K}) 1 . Note the condition number dependence becomes κ instead of κ = L/µ, hence acceleration.

For a summary of our main results, please see Table 1.

Algorithm 1: Momentum optimizer on Lie groups Parameter :step size h > 0, friction γ > 0, number of iterations N Initialization :g0 G, ξ0 = 0 Output :Local minimum of U for k = 0, ,N 1 do

if Heavy-Ball then

ξk+1 = (1 γh)ξk h Tgk Lgk 1 U(gk) if NAG-SC then

ξk+1 = (1 γh)ξk (1 γh)h(Tgk Lgk 1 U(gk) Tgk 1Lgk 1 1 U(gk 1)) h Tgk Lgk 1 U(gk) gk+1 = gk exp(hξk+1) end return g N

Remark 1 (Triviality of convex functions on Lie groups). We do not assume any global convexity of U. In fact, U has to be nonconvex for any meaningful optimization to happen. This is because we are considering compact Lie groups5, and a convex function on a connected compact manifold could only be a constant function [31]. An intuition for this is, a convex function on a closed geodesic must be constant. See [e.g., 18, Sec. B.3] for more discussions. Our analysis is, importantly, for nonconvex U, and convexity is only required locally to ensure a quantitative rate estimate can be obtained.

5Even if we consider noncompact Lie groups, many of them have compact Lie subgroups [20], and our argument would still hold.

Continuous dynamics Heavy-Ball NAG-SC Scheme Eq. (2) Eq. (9) Eq. (13) Step size h -

2L, 1 2p(a)}

Convergence rate c 2 3 µ (1 + µ 16L) 1 (1 + 1

2L, 1 2p(a)}) 1

(Modified) energy Eq. (5) Eq. (10) Eq. (41) 6

Lyapunov function Eq. (6) Eq. (12) Eq. (14) Main theorem Thm. 9 Thm. 13 Thm. 14 Table 1: A summary of main results 2 Preliminaries and setup

2.1 Lie group and Lie algebra

A Lie group, denoted by G, is a differentiable manifold with a group structure. A Lie algebra is a vector space with a bilinear, alternating binary operation that satisfies the Jacobi identity, known as Lie bracket. The tangent space at e (the identity element of the group) is a Lie algebra, denoted as g = Te G. The dimension of the Lie group G will be denoted by m. Assumption 2 (general geometry). We assume the Lie group G is finite-dimensional and compact.

One technique we will use to handle momentum is called left-trivialization: Left group multiplication Lg ˆg gˆg is a smooth map from the Lie group to itself and its tangent map Tˆg Lg Tˆg G Tgˆg G is a one-to-one map. As a result, for any g G, we can represent the vectors in Tg G by Te Lξ for ξ Te G. This operation is the left-trivialization. It comes from the group structure and may not exist for a general manifold. If the group is represented via an embedding to matrix group, i.e., g,ξ Rn n, then the left trivialization is simply given by Te Lgξ = gξ with the right-hand side given by matrix multiplication.

A Riemannian metric is required to take Riemannian gradient and we are considering a left-invariant metric: we first define an inner product , on g, which is a linear space, and then move it around by the differential of left multiplication, i.e., the inner product at Tg G is for η1,η2 Tg G, η1,η2 = Tg Lg 1η1,Tg Lg 1η2 .

2.2 Optimization dynamics Riemannian GD [e.g., 32] with iteration gk+1 = expgk( h U(gk)) can be employed to optimize U defined on G, where is Riemannian gradient, and exp G Tg G G is the exponential map. To see a connection to the common Euclidean GD, it means we start from gk and go to the direction of negative gradient with step size h to get gk+1 by geodesic instead of straight line. Riemannian GD can be understood as a time discretization of the Riemannian gradient flow dynamics g = U(g).

In the Lie group case, it is identical to the following Lie GD obtained from left-trivialization [26]:

gk+1 = gk expe(h Tgk Lgk 1 U(gk)), (1)

where Tgk Lgk 1 U(gk) g is the left-trivialized gradient. expe 7 is the exponential map staring at the group identity e following the Riemannian structure given by the left-invariant metric, and the operation between gk and exp(h Tgk Lgk 1 U(gk)) is the group multiplication. To accelerate its convergence, momentum was introduced to the Riemannian gradient flow via variational optimization and left-trivialization [26], leading to the following dynamics:

{ g = Te Lgξ ξ = γ(t)ξ + ad ξ ξ Tg Lg 1 U(g) (2)

Here g(t) G is the position variable. g is the standard momentum variable even though it should really be called velocity. It lives Tg(t)G, which varies as g(t) changes in time, and we will utilize group structure to avoid this complication. More precisely, the dynamics lets the momentum g be Te Lgξ, and ξ is therefore Tg Lg 1 g and it is our new, left-trivialized momentum. Intuitively, one can

6The monotonicity of this energy function requires smaller step size than it listed in this table. See discussion in Rmk. 36 and the details are provided in Sec. D.1 7The group exponential map and the exponential map from Riemannian structure can be different [18, Sec. D.1]. However, under our choice of the left-invariant metric later in Lemma 3, they are identical and exp will be the group exponential unless further specified.

think ξ as angular momentum, and Te Lgξ being gξ is position times angular momentum, which is momentum. Similar to the Lie GD Eq. (1), we will not use U(g) directly, but its left-trivialization Tg Lg 1( U(g)), to update the left-trivialized momentum.

This dynamics essentially models a damped mechanical system, and Tao and Ohsawa [26] proved this ODE converges to a local minimum of U using the fact that the total energy (kinetic energy 1 2 ξ,ξ plus potential energy U) is drained by the friction term γξ. In general, γ can be a positive time-dependent function (e.g., for optimizing convex but not strongly-convex functions), but for simplicity, we will only consider locally strong-convex potentials, and constant γ is enough.

For curved space, an additional term ad ξ ξ that vanishes in Euclidean space shows up in Eq. (2). It could be understood as a generalization of Coriolis force that accounts for curved geometry and is needed for free motion. The adjoint operator ad g g g is defined by ad X Y = [X,Y ]. Its dual, known as the coadjoint operator ad g g g, is given by ad X Y,Z = Y,ad X Z , Z g.

2.3 Property of Lie groups with ad skew-adjoint The term ad ξ ξ in the optimization ODE (2) is a quadratic term and it will make the numerical discretization that will be considered later difficult. Another complication from this term is, it depends on the Riemannian metric, and indicates an inconsistency between the Riemannian structure and the group structure, i.e., the exponential map from the Riemannian structure is different from the exponential map from the group structure. Fortunately, on a compact Lie group, the following lemma shows a special metric on g can be chosen to make the term ad ξ ξ vanish. Lemma 3 (ad skew-adjoint [20]). Under Assumption 2, there exists an inner product on g such that the operator ad is skew-adjoint, i.e., ad ξ = adξ for any ξ g.

This special inner product will also give other properties useful in our technical proofs; see Sec. A.1.

2.4 Assumption on potential function To show convergence and quantify its rate for the discrete algorithm, some smoothness assumption is needed. We define the L-smoothness on a Lie group as the following. Definition 4 (L-smoothness). A function U G R is L-smooth if and only if g, ˆg G,

Tˆg Lˆg 1 U(ˆg) Tg Lg 1 U(g) Ld(ˆg,g) (3) where d is the geodesic distance.

Under the choice of metric in Lemma 3 that ad is skew-adjoint, Lemma 21 shows this is same as the commonly used geodesic-L-smoothness (Def. 20).

To provide an explicit convergence rate, some convex assumption on the objective function is usually needed. Under the assumption of unique geodesic on a geodesically convex set S G, the definition of strongly convex functions in Euclidean spaces can be generalized to Lie groups: Definition 5 (Locally geodesically strong convexity). A function U G R is locally geodesic-µstrongly convex at g if and only if there exists a geodesically convex neighbourhood of g , denoted by S, such that g, ˆg S,

U(g) U(ˆg) Tˆg Lˆg 1 U(ˆg),log ˆg 1g + µ

2 log ˆg 1g 2 (4)

where log is well-defined due to the geodesic convexity of S.

3 Convergence of the optimization ODE in continuous time

To start, we provide a convergence analysis of the ODE (2), since our numerical scheme comes from its time discretization. We do not claim such convergence analysis for the ODE is new, and in fact, convergence for continuous dynamics has been provided on general manifolds [e.g., 3]. However, we will prove it using our technique to be self-contained and provide some insights for the convergence analysis of the discrete algorithm later.

Define the total energy EODE G g R as

EODE(g,ξ) = U(g) + 1

i.e., the total energy is the sum of the potential energy and the kinetic energy. Thanks to the friction γ, the total energy is monotonely decreasing, which provides global convergence to a stationary point.

Theorem 6 (Monotonely decreasing of total energy [26]). Suppose the potential function U C1(G) and the trajectory (g(t),ξ(t)) follows ODE (2). Then

d dt EODE(g(t),ξ(t)) = γ ξ 2

Thm. 6 provides the global convergence of ODE (2) to a stationary point under only C1 smoothness: when the system converges, we have ξ = 0, which gives U(g ) = 0.

Moreover, using the non-increasing property of total energy, the following corollary states that if the particle starts with small initial energy, it will be trapped in a sub-level set of U. The local potential well can be defined using U s sub-level set. Definition 7 (u sub-level set). Given u R, we define the u sub-level set of U as

{g G U(g) u} = i 0 Si

i.e. a disjoint union of connected components. Corollary 8. Suppose U C1(G). Let u = EODE(g(0),ξ(0)). If the u sub-level set of U is i 0 Si and g(0) S0, then we have g(t) S0, t 0.

Under further assumption of local strong convexity on this sub-level set, convergence rate can be quantified via a Lyapunov analysis inspired by [25]. More specifically, given a fixed local minimum g , there is provably a local unique geodesic convex neighbourhood of g . Denote it by S, and we define LODE on S by

LODE(g,ξ) = U(g) U(g ) + 1

4 γ log g 1 g + ξ 2 (6)

By assuming the local geodesic-µ-strong convexity of U on S, we have the following quantification of Eq. (2). Theorem 9 (Convergence rate of the optimization ODE). If the initial condition (g0,ξ0) satisfies that g0 S for some geodesically convex set S G, U C1(G) is locally geodesic-µ-convex on S, and the u sub-level set of U with u = EODE(g0,ξ0) satisfies S0 S, then we have

U(g(t)) U(g ) e c ODEt LODE(g0,ξ0) (7)

with c ODE = 2

3 µ by choosing γ = 2 µ. Remark 10. This theorem alone is a local convergence result and a {Lie group + momentum} extension of an intuitive result for Euclidean gradient flow, which is, if the initial condition is close enough to a minimizer and the objective function has a positive definite Hessian at that minimizer, then gradient flow converges exponentially fast to that minimizer. However, Thm.6 already ensures global convergence, and if not stuck at a saddle point, the dynamics will eventually enter some local potential well. If that potential well is locally strongly convex at its minimizer, then the local convergence result (Thm.9) supersedes the global convergence result (which has no rate), and gives the asymptotic convergence rate. Note however that different initial conditions may lead to convergence to different potential wells (and hence minimizers), as usual.

4 Convergence of Lie Heavy-Ball/splitting discretization in discrete time One way to obtain a manifold optimization algorithm by time discretization of the ODE (2) is to split its vector field as the sum of two, and use them respectively to generate two ODEs:

{ g = Te Lgξ ξ = 0 { g = 0 ξ = γξ Tg Lg 1 U(g) (8)

Each ODE enjoys the feature that its solution stays exactly on G g [26], and therefore if one alternatively evolves them for time h, the result is a step-h time discretization that exactly respects the geometry (no projection needed). If one approximates exp( γh) by 1 hγ, then the same property holds, and the resulting optimizer is

{gk+1 = gk exp(hξk+1) ξk+1 = (1 γh)ξk h Tgk Lgk 1 U(gk) (9)

In Euclidean cases, such numerical scheme can be viewed as Polyak s Heavy-Ball algorithm after a change of variable (Rmk. 27), and will thus be referred to as Lie Heavy-Ball. It is also a 1st-order (in h) version of the 2nd-order Lie-NAG optimizer in [26] (Rmk. 28).

To analyze Lie Heavy-Ball s convergence, we again seek some energy function such that the iteration of the numerical scheme Eq. (9) will never escape a sub-level set of the potential, similar to the continuous case. Given fixed friction parameter γ and step size h, we define the modified energy EHB G g R as

EHB(g,ξ) = U(g) + (1 γh)2

Theorem 11 (Monotonely decreasing of modified energy of Heavy Ball). Assume the potential U is globally L-smooth. When the step size satisfies h γ γ2+L, we have the modified energy EHB is monotonely decreasing, i.e.,

EHB(gk,ξk) EHB(gk 1,ξk 1) γh ξk 2

Thm. 11 provides the global convergence of Heavy-Ball scheme Eq. (9) to a stationary point under only L-smoothness: Due to the monotonicity of the energy function EHB, the system will eventually converge. When it converges, since g is not moving, we have ξ = 0, leading to the fact that U(g ) = 0. More importantly, the following corollary shows that the non-increasing property of the modified traps g in sub-level set of U: Corollary 12. Let u = EHB(g0,ξ0). If the u sub-level set of U satisfies g0 S0 and

d(S0, i 1 Si) > h

2EHB(g0,ξ0) + h2 max S0 U (11)

Then we have gk S0 for any k for the Heavy-Ball scheme Eq. (9) when h γ γ2+L.

Under the further assumption of local strong convexity on this sub-level set, the convergence rate can be quantified via a Lyapunov analysis inspired by [25]. More specifically, given a fixed local minimum g , there is a local unique geodesic neighbourhood of g , denoted by S, and we define LHB on S by

LHB(g,ξ) = 1 1 γh (U(g exp( hξ)) U(g )) + 1

4 γ 1 γh log g 1 g + ξ

The exponential decay for the Lyapunov function (Lemma 32) helps us quantify of the convergence rate for Eq. (9) in the following theorem: Theorem 13 (Convergence rate of Heavy-Ball scheme). If the initial condition (g0,ξ0) satisfies that g0 S for some geodesically convex set S G, U is L-smooth and locally geodesic-µ-convex on S, and the u sub-level set of U with u = EODE(g0,ξ0) satisfies S0 S and Eq. (11), then we have

U(gk) U(g ) ck HBLHB(g0,ξ0)

with c HB = (1 + µ 16L) 1 by choosing γ = 2 µ, h = µ

Note the rate is (1 + 1/(16κ)) 1. The condition number dependence is linear (κ) but not κ. Similarly, the procedure of global convergence local potential well local minimum discussed in Rmk. 10 also applies the Heavy-Ball algorithm.

5 Convergence of Lie NAG-SC in discrete time

The motivation for NAG-SC is to improve the condition number dependence. The convergence rate of Heavy-Ball shown in Thm. 13 is the same as the momentumless case [e.g., 32, Thm. 15] under the assumption of local strong convexity and L-smoothness. To improve the condition number dependence, inspired by [25], we define Lie NAG-SC as the following:

{gk+1 = gk exp(hξk+1) ξk+1 = (1 γh)ξk (1 γh)h(Tgk Lgk 1 U(gk) Tgk 1Lgk 1 1 U(gk 1)) h Tgk Lgk 1 U(gk) (13)

Comparing to Lie Heavy-Ball, an extra O(h2) term h(Tgk Lgk 1 U(gk) Tgk 1Lgk 1 1 U(gk 1)) is introduced (see [25, Sec. 2] for more details in the Euclidean space). Our technique of lefttrivialized (and hence Euclidean) momentum allows this trick to transfer directly from Euclidean to the Lie group case.

For NAG-SC, we will only provide a local convergence with quantified convergence under Lsmoothness and local geodesically convexity on a geodesically convex subset S G. The difficulty in designing a modified energy and proving the global convergence will be given later in Rmk. 36. We define the following Lyapunov function:

LNAG-SC(g,ξ) = 1 1 γh (U(g exp( hξ)) U(g )) + 1

4 ξ + γ 1 γh log g 1 g + h U(g exp( hξ))

4(1 γh) U(g exp( hξ)) 2

where g is the minimum of U in S. This Lyapunov function helps us to trap g in a local potential well and quantify the convergence rate: Theorem 14 (Convergence rate of NAG-SC). If the initial condition (g0,ξ0) satisfies that g0 S for some geodesically convex set S G satisfying maxg S d(g ,g) a A for some a < 2π and A = max X =1 ad X op, U is L-smooth and locally geodesic-µ-convex on S, and the u sub-level set of U with u = (1 γh) 1LNAG-SC(g0,ξ0) satisfies S0 S and

d(S0,S S0) > h

LNAG-SC(g0,ξ0) (15) then we have U(gk) U(g ) ck NAG-SCLNAG-SC(g0,ξ0)

by choosing h = min{ 1

2L, 1 2p(a)} and γ = 2 µ, with c NAG-SC = (1 + 1

2L, 1 2p(a)}) 1 , where p(x) = x 1 exp( x) (16)

Unlike sampling ODE and Lie Heavy-Ball, monotonely decreasing modified energy is not provided for Lie NAG-SC. It is unclear whether such modified energy for NAG-SC exists, and an intuition is provided in the Rmk. 36.

Another fact in Thm. 14 that is worth noticing is, we have a term 1/p(a) that depends on the curvature of the Lie group 8, while the Lie Heavy-Ball has the same convergence rate as the Euclidean case [25]. It is unclear if the lost of convergence rate in Lie NAG-SC comparing to the Euclidean case is because of our proof technique or the curved space itself. However, we try to provide some insights in Rmk. 35.

6 Systematic numerical verification via the eigen decomposition problem

6.1 Analytical estimation of property of eigenvalue decomposition potential Given a symmetric matrix, its eigen decomposition problem can be approached via an optimization problem on SO(n): min X Rn n,X X=I tr X BXN

where N = diag([1,...,n]). This problem is a hard non-convex problem on manifold, but some analytical estimation [e.g., 7, Thm. 4] can be helpful for us to choose optimizer hyperparameters (we don t have to have those to apply the optimizers, but in this section we d like to verify our theoretical bounds and hence µ and L are needed).

This problem is non-convex with 2nn! stationary points corresponding to the elements in n-order symmetric group, including 2n local minima and 2n local maxima. We suppose B = RΛR with Λ = diag (0,1,...,n 2, κ n 1), where λi s (the diagonal values of Λ) are in ascend order. Given π in the n-symmetric group, the corresponding local minimum is Xπ = (Xπ(i)), i.e., we switch the columns of X by π. The eigenvalues of its Hessian at the local minimum π can be written as σij = (j i)(λπ(j) λπ(i)), 1 i < j n The global minimum is given by π = id with minimum value n i=1 iλi.

8In comparison, Euclidean NAG-SC has convergence rate (1 + C µ

2000 4000 6000 8000

condition number

1-convergence rate

(a) Numerical estimation for 1 c under different condition numbers κ = L

µ for Heavy-Ball and NAGSC, initialized close to the global minimum. The dashed curves are fitted value using our theoretical result, i.e., for Heavy-Ball, it is fitted by 1 c HB Cκ 1 for some C fit by linear regression, and for NAG-SC, it is 1 c NAG-SC C κ 1.

0 1000 2000 3000 4000 5000 6000 7000 8000

10 8 10 6 10 4 10 2 100 102

heavy_ball NAG_SC

(b) Global convergence on non-convex potential. The initial condition is chosen closed to the global maximum, and we plot the value of the potential function along the trajectory. The horizontal tail means the algorithm converges to the machine precision.

Figure 1: Fig. 1(a) shows that 1) Lie NAG-SC converges much faster than Lie Heavy-Ball on ill-conditioned problems; 2) The fitted dashed curve and the experimental results align well, showing our theoretical analysis of the convergence rate c HB and c NAG-SC is correct. Fig. 1(b) shows the performance of our algorithms on non-convex problems experimentally. In this specific experiment, Lie NAG-SC outperforms Lie Heavy-Ball and finds the global minimum successfully without being trapped in local minimums. However, we are not sure which is better in general optimization. One possible reason for the good performance on NAG-SC is it uses a larger learning rate and is better for jumping out of the local minimums. The values of Lyapunov function along the trajectory are not provided since it is not globally defined.

6.2 Numerical Experiment

We use the eigenvalues at the global minimum to estimate the L and µ in its neighborhood. As a result, around the global minimum, L (n 1)(λn λ1), and µ mini{λi+1 λi}, where we assume λ s are sorted in the ascend order. Such estimation is used to choose our parameters (γ and h) in all experiments as stated in Table 1.

Given a conditional number κ = L µ, we design A in the following way: we choose Λ = diag (0,1,...,n 2, κ n 1) and R is uniformly sampled from SO(n) using [22, Sec. 2.1.1]. When the given κ satisfies κ (n 1)(n 2), the condition number at global minimum is the given κ.

The results are presented in Fig. 1 and 2. In all experiments, we set n = 10, and the computations are done on a Mac Book Pro (M1 chip, 8GB memory).

7 Application to Vision Transformer

This section will demonstrate a practical modern machine learning application of our Lie NAG-SC optimizer. The setting is a highly non-convex optimization problem with stochastic gradients, due to being a real deep learning task, but empirical success is still observed. More specifically, it was discovered [19] that adding artificial orthogonal constraints to attention layers in transformer models can improve their performances, because orthogonality disallows linearly dependent correlations between tokens, so that the learned attentions can be more efficient and robust. We will apply our optimizer to solve this constrained optimization problem.

The setup is the following (using the notation of [28]): consider a Scaled Dot-Product Multihead Attention given by Multi Head(Q,K,V ) = Concat(head1,...,headnhead)W O, where headi = Attention(QW Q i ,KW K i ,V W V i ), Attention( Q, K, V ) = softmax( Q K dk ) V . The trainable parame-

ters are matrices W Q i Rdmodel dk, W K i Rdmodel dk, W V i Rdmodel dv and W O Rnheaddv dmodel. The

0 250 500 750 10001250150017502000

kappa=100 kappa=1000 kappa=10000

(a) (momentumless) GD

0 250 500 750 10001250150017502000

10 510 410 310 210 1100101

kappa=100 kappa=1000 kappa=10000

(b) Heavy-Ball

0 250 500 750 10001250150017502000

10 10 10 8 10 6 10 4 10 2 100

kappa=100 kappa=1000 kappa=10000

Figure 2: Local convergence of Lie Heavy-Ball and Lie NAG-SC on eigenvalue decomposition problem with different condition numbers. The initialization is close to the global minimum. The dashed curves are the value of potential function along the trajectory and the solid curves are the values of the corresponding Lyapunov functions. Lie GD (Eq. 1) has h been chosen as 1/L [32, Thm. 15]. We observe: 1. Lie NAG-SC converges much faster than Lie Heavy-Ball, especially on ill-conditioned problems. 2. Although the potential function is not monotonely decreasing, the Lyapunov is.

165 170 175 180 185 190 195 200 epoch

Cifar 10 validation error

Euclidean SGD Lie Heavy-Ball Lie NAG-SC

165 170 175 180 185 190 195 200 epoch

Cifar 100 validation error

Euclidean SGD Lie Heavy-Ball Lie NAG-SC

Figure 3: Training curve when applying our Lie HB and Lie NAG-SC to vision transformers. Forcing the query matrices and the key matrices on SO(n) [19, Orthogonality across heads] led to reduced training error. Moreover, validation error is also improved (Tab. 2).

three input matrices Q, K and V all have dimension sequence_length dmodel. dk and dv are usually smaller than dmodel.

In the case dmodel = nheaddk, which is satisfied in many popular models, we apply the constraint orthogonality across heads [19] and require Concat(W Q i ,i = 1...,nhead) and Concat(W K i ,i = 1...,nhead) to be in SO(dmodel). We compare the performance of our newly proposed optimizer with the existing ones (the optimizer in [19] is identical to Lie Heavy-Ball on SO(n)). Fig. 3 and Tab. 2 are the validation error when we train a vision transformer [2] with 6.3M parameters from scratch on CIFAR, showing an improvement of Lie NAG-SC comparing the state-of-the-art algorithm Lie Heavy-Ball. The computations are done on a single Nvidia V100 GPU. The model structures and hyperparameters are identical as Sec. 3.2 in [19]. Each presented result is the average of 3 independent runs.

Euclidean SGD Lie HB Lie NAG-SC CIFAR 10 9.84% 9.12% 8.77% CIFAR 100 32.68% 31.93% 31.38% Table 2: Validation error rate of vision transformer trained by different algorithms on CIFAR, showing the performance is ordered by Euclidean GD < Lie Heavy-Ball < Lie NAG-SC for both CIFAR 10 and 100. The blue font means the lowest error rate.

Acknowledgments and Disclosure of Funding

The authors are grateful for the partially support by NSF DMS-1847802, Cullen-Peck Scholarship, and GT-Emory Humanity.AI Award. We thank the anonymous reviewers for their helpful comments.

[1] Kwangjun Ahn and Suvrit Sra. From nesterov s estimate sequence to riemannian acceleration. In Conference on Learning Theory, pages 84 118. PMLR, 2020.

[2] Dosovitskiy Alexey. An image is worth 16x16 words: Transformers for image recognition at scale. ar Xiv preprint ar Xiv: 2010.11929, 2020.

[3] Foivos Alimisis, Antonio Orvieto, Gary Bécigneul, and Aurelien Lucchi. A continuous-time perspective for modeling acceleration in riemannian optimization. In International Conference on Artificial Intelligence and Statistics, pages 1297 1307. PMLR, 2020.

[4] Foivos Alimisis, Antonio Orvieto, Gary Becigneul, and Aurelien Lucchi. Momentum improves optimization on riemannian manifolds. In International conference on artificial intelligence and statistics, pages 1351 1359. PMLR, 2021.

[5] Martin Arjovsky, Amar Shah, and Yoshua Bengio. Unitary evolution recurrent neural networks. In International conference on machine learning, pages 1120 1128. PMLR, 2016.

[6] Silvere Bonnabel. Stochastic gradient descent on riemannian manifolds. IEEE Transactions on Automatic Control, 58(9):2217 2229, 2013.

[7] Roger W Brockett. Least squares matching problems. Linear Algebra and its applications, 122: 761 777, 1989.

[8] Zhehui Chen, Xingguo Li, Lin Yang, Jarvis Haupt, and Tuo Zhao. On constrained nonconvex stochastic optimization: A case study for generalized eigenvalue decomposition. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 916 925. PMLR, 2019.

[9] Moustapha Cisse, Piotr Bojanowski, Edouard Grave, Yann Dauphin, and Nicolas Usunier. Parseval networks: Improving robustness to adversarial examples. In International conference on machine learning, pages 854 863. PMLR, 2017.

[10] Christopher Criscitiello and Nicolas Boumal. Negative curvature obstructs acceleration for strongly geodesically convex optimization, even with exact first-order oracles. In Conference on Learning Theory, pages 496 542. PMLR, 2022.

[11] Christopher Criscitiello and Nicolas Boumal. An accelerated first-order method for non-convex optimization on manifolds. Foundations of Computational Mathematics, 23(4):1433 1509, 2023.

[12] Christopher Criscitiello and Nicolas Boumal. Curvature and complexity: Better lower bounds for geodesically convex optimization. In The Thirty Sixth Annual Conference on Learning Theory, pages 2969 3013. PMLR, 2023.

[13] EB Dynkin. Calculation of the coefficients in the campbell hausdorff formula. DYNKIN, EB Selected Papers of EB Dynkin with Commentary. Ed. by YUSHKEVICH, AA, pages 31 35, 2000.

[14] Nicolas Guigui and Xavier Pennec. A reduced parallel transport equation on lie groups with a left-invariant metric. In Geometric Science of Information: 5th International Conference, GSI 2021, Paris, France, July 21 23, 2021, Proceedings 5, pages 119 126. Springer, 2021.

[15] Ernst Hairer, Marlis Hochbruck, Arieh Iserles, and Christian Lubich. Geometric numerical integration. Oberwolfach Reports, 3(1):805 882, 2006.

[16] Linus Hamilton and Ankur Moitra. A no-go theorem for robust acceleration in the hyperbolic plane. Neur IPS, 2021.

[17] Kyle Helfrich, Devin Willmott, and Qiang Ye. Orthogonal recurrent neural networks with scaled cayley transform. In International Conference on Machine Learning, pages 1969 1978. PMLR, 2018.

[18] Lingkai Kong and Molei Tao. Convergence of kinetic langevin monte carlo on lie groups. COLT, 2024.

[19] Lingkai Kong, Yuqing Wang, and Molei Tao. Momentum stiefel optimizer, with applications to suitably-orthogonal attention, and optimal transport. ICLR, 2023.

[20] John Milnor. Curvatures of left invariant metrics on lie groups, 1976.

[21] Yurii Nesterov. Introductory lectures on convex optimization: A basic course, volume 87. Springer Science & Business Media, 2013.

[22] Sean O Hagan. Uniform sampling methods for various compact spaces. 2007.

[23] Boris T Polyak. Introduction to optimization. 1987.

[24] Zeju Qiu, Weiyang Liu, Haiwen Feng, Yuxuan Xue, Yao Feng, Zhen Liu, Dan Zhang, Adrian Weller, and Bernhard Schölkopf. Controlling text-to-image diffusion by orthogonal finetuning. Advances in Neural Information Processing Systems, 36:79320 79362, 2023.

[25] Bin Shi, Simon S Du, Michael I Jordan, and Weijie J Su. Understanding the acceleration phenomenon via high-resolution differential equations. Mathematical Programming, pages 1 70, 2021.

[26] Molei Tao and Tomoki Ohsawa. Variational optimization on lie groups, with examples of leading (generalized) eigenvalue problems. In International Conference on Artificial Intelligence and Statistics, pages 4269 4280. PMLR, 2020.

[27] Nilesh Tripuraneni, Nicolas Flammarion, Francis Bach, and Michael I Jordan. Averaging stochastic gradient descent on riemannian manifolds. In Conference On Learning Theory, pages 650 687. PMLR, 2018.

[28] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017.

[29] Zaiwen Wen and Wotao Yin. A feasible method for optimization with orthogonality constraints. Mathematical Programming, 142(1):397 434, 2013.

[30] Andre Wibisono, Ashia C Wilson, and Michael I Jordan. A variational perspective on accelerated methods in optimization. proceedings of the National Academy of Sciences, 113(47):E7351 E7358, 2016.

[31] Shing-Tung Yau. Non-existence of continuous convex functions on certain riemannian manifolds. Mathematische Annalen, 207:269 270, 1974.

[32] Hongyi Zhang and Suvrit Sra. First-order methods for geodesically convex optimization. In Conference on learning theory, pages 1617 1638. PMLR, 2016.

[33] Hongyi Zhang and Suvrit Sra. Towards riemannian accelerated gradient methods. ar Xiv preprint ar Xiv:1806.02812, 2018.

A Properties of Lie groups and functions on Lie groups

A.1 More details about compact Lie groups with left-invariant metric

Comparing with the Euclidean space, Lie groups lack of commutativity, i.e., for g, ˆg G, gˆg and ˆgg are not necessarily equal. This can also be characterized by the non-trivial Lie bracket [ , ]. This non-commutativity leads to the fact that exp(X)exp(Y ) exp(X + Y ). An explicit expression for log(exp(X)exp(Y )) is given by Dynkin s formula [13]. Utilizing Dynkin s formula, we quantify dlog in the following.

Corollary 15 (Differential of logarithm). If dlog g is well defined, then the differential of logarithm on G is given by

dξ log g = (dlog)g(Te Lgξ) = Te Lg [p(adlog g)ξ], ξ g (17)

where the power series p is defined in Eq. (16)

The vanishment of ad ξ ξ can also be understood as the group structure and the Riemannian structure are compatible. See [18] for more discussion. Under such assumption, we have the following properties:

Corollary 16. Suppose we have ad X is skew-adjoint X g. Then for any g G and any ξ g such that dξ log g is well-defined, we have

dξ log g,log g = log g,ξ

Corollary 17. When dξ log g is well-defined and ad X is skew-adjoint X g, we have

dξ log g,ξ ξ 2

Corollary 18. Define A = max X =1 ad X op (18)

When d(g,e) a

A for some a (0,2π), we have

dlog g Id q(a)

where q is defined by

q(x) = p(xi) 1

= xi 1 cosx + isinx 1 (19)

2 cosx xsinx

with p defined in Eq. (16). Remark 19 (About existence and uniqueness of log). As the inverse of exp, the operator log may not be uniquely defined globally. However, we are always considering in a unique geodesic subset of the Lie group, where log is defined uniquely in such a subset of Lie group. Similarly, even if we do not have globally geodesically strongly convex functions, we only require locally strong convexity.

A.2 More details about functions on Lie groups

The commonly used geodesic L-smooth on a manifold M is given by the following definition [e.g., 32, Def. 5]:

Definition 20 (Geodesically L-smooth). U G R is geodesically L-smooth if for any g,ˆ M,

U(g) Γg ˆg U(ˆg) Ld(g, ˆg)

where Γg ˆg is the parallel transport from ˆg to g.

Lemma 21. Under the assumption of ad X is skew-adjoint X g, Def. 4 is identical to Def. 20.

Proof of Lemma 21. For any g, ˆg G, consider the shortest geodesic ϕ [0,1] G connecting g and ˆg and denote ξ = Tg Lg 1 U(g). Using the condition ad is skew-adjoint, we have ϕ(t) = 0 and Te Lϕ(t)ξ is parallel along ϕ by checking the condition for parallel transport [14, Thm. 1]:

d dtξ = 0 = 1

2 [Tϕ(t)Lϕ(t) 1 ϕ(t),ξ]

This tells that

Tg Lg 1Γˆg g f(g) = Tˆg Lˆg 1 f(ˆg)

Together with the metric is left-invariant, we have

Tg Lg 1 U(g) Tˆg Lˆg 1 U(ˆg) Ld(g, ˆg)

which is identical to Def. 4.

Corollary 22 (Properties of L-smooth functions). If U G R is L-smooth, then for any g, ˆg G, we have

U(ˆg) U(g) + Tg Lg 1 U(g),log g 1ˆg + L

2 d2(g, ˆg) (20)

Proof of Cor. 22. We denote the one of the shortest geodesic connecting g and ˆg as g(t), i.e., π [0,1] G with g(0) = g and g(1) = ˆg, g(t) = g exp(tξ) for some ξ g with ξ = d(g, ˆg). Then

= Te Lgξ, U(g) +

0 Te Lg(t)ξ, U(g(t)) dt

= ξ,Tg Lg 1 U(g) +

0 ξ,Tg(t)Lg(t) 1 U(g(t)) Tg Lg 1 U(g) dt

ξ,Tg Lg 1 U(g) +

0 ξ Tg(t)Lg(t) 1 U(g(t)) Tg Lg 1 U(g) dt

ξ,Tg Lg 1 U(g) +

0 t L ξ 2 dt

= ξ,Tg Lg 1 U(g) + L

2 d2(g, ˆg)

Lemma 23 (Co-coercivity). If the function U G R is both L-smooth and convex on a geodesically convex set S G, then we have for any g, ˆg S,

U(ˆg) U(g) + Tg Lg 1 U(g),log g 1ˆg + 1

2L Tg Lg 1 U(g) Tˆg Lˆg 1 U(ˆg) 2 (21)

Proof of Lemma 23. By convexity, we have for any g G, ξ g and t [0,1],

U(g exp(tξ)) U(g) t Tg Lg 1 U(g),ξ

U(g) U(g exp(tξ)) t Tg exp(tξ)Lg exp(tξ) 1 U(g exp(tξ)),ξ

We sum these two inequalities and have

Tg exp(tξ)Lg exp(tξ) 1 U(g exp(tξ)) Tg Lg 1 U(g),ξ 0

which tells that

t [Tg exp(tξ)Lg exp(tξ) 1 U(g exp(tξ))] = Htξ

for some linear map H g g with all eigenvalues between 0 and L, i.e., 0 Ht L for any t [0,1].

Now, we select the shortest geodesic connection g and ˆg, defined by g(t) = g exp(tξ), with ˆg = g exp(ξ) and ξ = d(g, ˆg). By

Tg(t)Lg(t) 1 U(g(t)) Tg Lg 1 U(g) =

0 s Tg exp(sξ)Lg exp(sξ) 1 U(g exp(sξ))ds

0 Tg(t) U(g(t)),ξ

0 Tg Lg 1 U(g) +

0 s Tg(s)Lg(s) 1 U(g(s))ds,ξ dt

= Tg Lg 1 U(g),ξ +

0 Hsξds,ξ dt

= Tg Lg 1 U(g),ξ +

0 Hsξ,ξ dsdt

= Tg Lg 1 U(g),ξ + 1

0 Hsξ,ξ dsdt

Tg Lg 1 U(g),ξ + 1

0 Hsξ,Hsξ dsdt

Tg Lg 1 U(g),ξ + 1

= Tg Lg 1 U(g),ξ + 1

0 Tˆg Lˆg 1 U(ˆg) Tg Lg 1 U(g) 2dt

= Tg Lg 1 U(g),ξ + 1

2L Tˆg Lˆg 1 U(ˆg) Tg Lg 1 U(g) 2

Corollary 24. If the function U G R is both L-smooth and convex on a geodesically convex set S G, then we have for any g, ˆg S,

Tg Lg 1 U(g) Tˆg Lˆg 1 U(ˆg),log ˆg 1g 1

L Tg Lg 1 U(g) Tˆg Lˆg 1 U(ˆg) 2 (22)

Proof. By exchanging g and ˆg in Eq. (21), we have

U(g) U(ˆg) + Tˆg Lˆg 1 U(ˆg),log ˆg 1g + 1

2L Tg Lg 1 U(g) Tˆg Lˆg 1 U(ˆg) 2 (23)

Summing up 21 and 23 gives us the desired result.

Corollary 25 (Properties of µ-strongly convex functions). Suppose U G R is geodesic-µ-strongly convex on a geodesically convex set S G, then for any g, ˆg S,

Tg Lg 1 U(g),log ˆg 1g µd2(g, ˆg) (24)

Proof of Cor. 25. By the definition of geodesic-µ-strongly convex functions in Eq. (4), we have

U(g) U(ˆg) Tˆg Lˆg 1 U(ˆg),log ˆg 1g + µ

2 log ˆg 1g 2

U(ˆg) U(g) Tg Lg 1 U(g),log g 1ˆg + µ

2 log ˆg 1g 2

Summing them up and using log g 1ˆg = log ˆg 1g gives us the conclusion.

B More details about optimization ODE

Proof of Thm. 6. By direct calculation,

d dt E(g(t),ξ(t)) = Te Lgξ, U(g) + ξ, ξ

Lemma 26 (Monotonicity of the Lyapunov function). Assume ad X is skew-adjoint for any X g. Suppose there is a geodesically convex set S G satisfying:

U is geodesic-µ-strongly convex on a geodesically convex set S G.

g is the minimum of U on S.

g(t) S for all t 0.

log g 1 g and its differential is well-defined for all g S.

Then the solution of ODE (2) satisfies

d dt LODE(g(t),ξ(t)) c ODELODE(g(t),ξ(t)) (25)

with the convergence rate given by

c ODE = γ min{ µ µ + γ2 , 2

Proof of Lemma 26. The time derivative of L is

d dt LODE = ξ,Tg Lg 1 U(g) + 1

2 ξ, γξ Tg Lg 1 U(g)

2 γ log g 1 g + ξ,γ dξ log g 1 g γξ Tg Lg 1 U(g)

= ξ,Tg Lg 1 U(g) γ

2 ξ,Tg Lg 1 U(g)

2 log g 1 g,dξ log g 1 g γ2

2 log g 1 g,ξ γ

2 log g 1 g,Tg Lg 1 U(g)

2 ξ,dξ log g 1 g γ

2 log g 1 g,Tg Lg 1 U(g) + γ

2 ξ,dξ log g 1 g

2 log g 1 g,Tg Lg 1 U(g) γ

where the second last equation is by Cor. 16 and the last inequality is by Cor. 17. By the property of strong convexity in Eq. (4) and Cor. 25, for any λ [0,1],

d dt LODE γ

2 (λ(U(g) U(g )) + (λ

2 + (1 λ))µd2(g ,g) + ξ 2) (26)

where λ is a constant to be determined later.

Next, we try to give LODE an upper bound. By Cauchy-Schwarz inequality,

γ log g 1 g + ξ 2 2(γ2d2(g ,g) + ξ 2)

and further

LODE U(g) U(g ) + 3

2 d2(g,g ) (27)

Splitting scheme Heavy ball

hγ 1 e γh ξ

friction parameter γ

h step size h

hγ h Table 3: Change of variable between Heavy-ball and splitting scheme

Compare the coefficients in Eq. (26) and (27), we have

d dt LODE c ODELODE

where the convergence rate c ODE is given by

2 + (1 λ))µ}

By selecting λ = 2µ µ+γ2 to make λ = µ

γ2 (2 λ) satisfied, we have the desired result.

Proof of Thm. 9. This is the direct corollary from Cor. 8 and Lemma 26.

C More details about Heavy-Ball discretization

Remark 27 (Polyak s Heavy ball [23]). Heavy-ball scheme in the Euclidean space is

xk+1 = xk + α(xk xk 1) β U(xk)

where α and β are positive parameters. We now write it into a position-velocity form. By setting vk+1 = xk+1 xk

β , we have

{xk+1 = xk + βvk+1 vk+1 = αvk β U(xk)

We perform the change of variables given by β h, α 1 γh gives

{xk+1 = xk + hvk+1 vk+1 = (1 γh)vk h U(xk)

which is the Euclidean version corresponding to Eq. (9).

Remark 28 (Splitting discretization). The two systems of ODEs in Eq. (8) are both linear and has exact solutions. We will refer to the numerical scheme evolving their exact solution alternatively by splitting discretization. More precisely, this gives us the following numerical scheme:

gk+1 = gk exp(hξk+1) ξk+1 = e γhξk 1 e γh

γ Tgk Lgk 1 U(gk) (28)

After change of variable in the Table 3, it becomes Eq. (9).

Eq. (28) is similar to the NAG-SC in [26]. The authors provide a second-order approximation to the optimization ODE Eq. (2) by evolving the two ODEs in Eq. (8) in the following way in each step: 1) evolve ξ-ODE for h/2 time; 2) evolve g-ODE for h/2 time; 3) evolve ξ-ODE for h/2 time again. Although this zig-zag scheme is higher-order of approximation of the optimization ODE comparing to the the splitting approximation mentioned in Rmk. 28, it still has the same condition number dependence. The reason is, we can take out the first evolution of time h/2 for the ξ-system and it becomes identical to the splitting scheme (with a different initial condition.)

Before we start the theoretical calculation, we mention that update for ξ in Heavy-Ball scheme Eq. (9) can also be written as

ξk = 1 1 γhξk+1 + h 1 γh Tgk Lgk 1 U(gk) (29)

which will be helpful later.

Proof of Thm. 11. Using Eq. (29), we have the following calculation of EHB(gk,ξk) EHB(gk 1,ξk 1):

EHB(gk,ξk) EHB(gk 1,ξk 1)

= U(gk) U(gk 1) + (1 γh)2

2 ( ξk 2 ξk 1 2)

= U(gk 1 exp(hξk)) U(gk 1) + (1 γh)2

2 ( ξk 2 1 1 γhξk + h 1 γh Tgk 1Lgk 1 1 U(gk 1)

= U(gk 1 exp(hξk)) U(gk 1) (γh γ2h2

2 ) ξk 2 h ξk,Tgk 1Lgk 1 1 U(gk 1) h2

2 U(gk 1) 2

h ξk,Tgk 1Lgk 1 1 U(gk 1) + Lh2

2 ξk 2 (γh γ2h2

2 ) ξk 2 h ξk,Tgk 1Lgk 1 1 U(gk 1) h2

2 U(gk 1) 2

2(h2L 2γh + γ2h2) ξk 2 h2

2 U(gk 1) 2

where the second last inequality is the property of L-smooth functions given in Eq. 20.

When h γ γ2+L, we have h2L 2γh+γ2h2 γh, and EHB(gk,ξk) EHB(gk 1,ξk 1) γh ξk 2.

Remark 29. The design of modified energy for Heavy-Ball in Eq. (10) and Thm. 11 is new, and is different from modified potential function in existing works (e.g., [15]). Our modified energy is not designed to let the Hamiltonian system to have higher order of preserving the total energy, but is defined to ensure monotonicity of the modified energy to ensure global convergence of the numerical scheme.

Remark 30. We specially choose our update of numerical scheme in Eq. (9) (and also later Eq. (13) for NAG-SC) to ensure such natural form of the modified energy. If we choose to update g first and then ξ (e.g., [25]), the modified energy will have to be evaluated at different time step, EHB(gk+1,ξk), for example.

Proof of Cor. 12. We prove this by induction. Suppose we have gk S0. By the dissipation of the modified energy Thm. 11, EHB(gk,ξk) EHB(g0,ξ0). As a result,

ξk+1 (1 γh) ξk + h U(gk)

2 (1 γh)2 EHB(gk,ξk) + h U(gk)

2EHB(g0,ξ0) + hmax S0 U

Since we have gk+1 = gk exp(hξk+1), we have d(gk+1,gk) h ξk+1 . Together with the condition that U(gk+1) EHB(gk+1,ξk+1) u, we have gk+1 S0. Mathematical induction gives the desired result.

Lemma 31. Assume ad X is skew-adjoint for any X g. Suppose there is a geodesically convex set S G satisfying:

U is geodesically µ-strongly convex on a convex set S G.

g is the minima of U on S.

gk S for all k N.

log g 1 g and its differential is well-defined for all g S.

Then we have

LHB k+1 LHB k b HBLHB k+1 h 2(1 γh) ( 3γ

4L h 1 γh) U(gk+1) 2

where b HB is given by

b HB = min{γh

8 , µh(1 γh)

2γ , 2γh 3(1 γh)}

Proof of Lemma 31. For (gk,ξk) following the Heavy-Ball scheme Eq. (9), we use the shorthand notation of LHB k = LHB(gk,ξk), which gives

LHB k = 1 1 γh (U(gk 1) U(g )) + 1

4 γ 1 γh log g 1 gk + ξk

Evaluate of the three terms in LHB k+1 LHB k separately :

The first term: By co-coercivity in Lemma 23, we have

1 1 γh (U(gk) U(gk 1))

h 1 γh Tgk Lgk 1 U(gk),ξk I1

2L 1 1 γh Tgk Lgk 1 U(gk) Tgk 1Lgk 1 1 U(gk 1) 2 I2

The second term: Using Eq. (29), we have

1 4 ( ξk+1 2 ξk 2)

2 ξk+1 ξk,ξk+1 1

4 ξk+1 ξk 2

= γh 2(1 γh) ξk+1 2 II1

h 2(1 γh) ξk+1,Tgk Lgk 1L U(gk) II2

4 ξk+1 ξk 2 II3

The third term: Define a parametric curve [0,h] G as gt = gk exp(tξk+1)

1 4 ( γ 1 γh log g 1 gk+1 + ξk+1

2 γ 1 γh log g 1 gk + ξk

4 ( γ 1 γh log g 1 gk+1 + ξk+1

2 γ 1 γh log g 1 gk + ξk+1

4 ( γ 1 γh log g 1 gk + ξk+1

2 γ 1 γh log g 1 gk + ξk

dt ( γ 1 γh log g 1 gt + ξk+1), γ 1 γh log g 1 gt + ξk+1 dt

4 ( γ 1 γh log g 1 gk + ξk+1

2 γ 1 γh log g 1 gk + ξk

We evaluate the two terms separately.

dt ( γ 1 γh log g 1 gt + ξk+1), γ 1 γh log g 1 gt + ξk+1 dt

0 γ 1 γh dξk+1 log g 1 gt, γ 1 γh log g 1 gt + ξk+1 dt

0 γ 1 γh dξk+1 log g 1 gt, γ 1 γh log g 1 gt dt

0 γ 1 γh dξk+1 log g 1 gt,ξk+1 dt

0 ξk+1,log g 1 gt dt + γh 1 γh ξk+1 2

0 ξk+1,log g 1 gt log g 1 gk dt + γ2h (1 γh)2 ξk+1,log g 1 gk + γh 1 γh ξk+1 2

0 dξk+1 log g 1 gsds dt + γ2h (1 γh)2 ξk+1,log g 1 gk + γh 1 γh ξk+1 2

0 ξk+1,dξk+1 log g 1 gs dsdt + γ2h (1 γh)2 ξk+1,log g 1 gk + γh 1 γh ξk+1 2

0 ξk+1 2dsdt + γ2h (1 γh)2 ξk+1,log g 1 gk + γh 1 γh ξk+1 2

2(1 γh)2 ξk+1 2 + γ2h (1 γh)2 ξk+1,log g 1 gk + γh 1 γh ξk+1 2

Using the ξ update in Eq. (29), we have

γ 1 γh log g 1 gk+1 + ξk+1

2 γ 1 γh log g 1 gk + ξk

= ξk+1 ξk, 2γ 1 γh log g 1 gk + ξk+1 + ξk

= γh 1 γhξk+1 + h 1 γh Tgk Lgk 1 U(gk), 2γ 1 γh log g 1 gk + 2 γh

1 γhξk+1 + h 1 γh Tgk Lgk 1 U(gk)

= 2γ2h (1 γh)2 ξk+1,log g 1 gk γh(2 γh)

(1 γh)2 ξk+1 2 2γh (1 γh)2 Tgk Lgk 1 U(gk),log g 1 gk

2h (1 γh)2 Tgk Lgk 1 U(gk),ξk+1 h2

(1 γh)2 U(gk) 2

Sum them up, we have

1 4 ( γ 1 γh log g 1 k+1g + ξk+1

2 γ 1 γh log g 1 k g + ξk

h 2(1 γh)2 Tgk Lgk 1 U(gk),ξk+1 III2

γh 2(1 γh)2 Tgk Lgk 1 U(gk),log g 1 gk III1

4(1 γh)2 U(gk) 2 III3

Take a closer look: 1 2I1 + III2 = h 2(1 γh) Tgk Lgk 1 U(gk),ξk h 2(1 γh)2 Tgk Lgk 1 U(gk),ξk+1

= h 2(1 γh) Tgk Lgk 1 U(gk),ξk 1 1 γhξk+1

2(1 γh)2 Tgk Lgk 1 U(gk) 2

1 2I1 + II2 + II3 + III3 = 1

4 ξk+1 ξk + h 1 γh Tgk Lgk 1 U(gk)

Now we sum everything up

LHB k+1 LHB k γh 2(1 γh) ( Tgk Lgk 1 U(gk),log g 1 gk + ξk+1 2) + h2

2(1 γh) U(gk) 2

By Eq. (4) (strong convexity) and Eq. (22) (corollary of co-coercivity) of U, we have U(gk) U(g ) + µ

2 log g 1 gk 2 Tgk Lgk 1 U(gk),log g 1 gk

LHB k+1 LHB k γh 2(1 γh) (1

4(U(gk) U(g )) + µ

2 log g 1 gk 2 + ξk+1 2)

h 2(1 γh) (3

4γ(U(gk) U(g )) h 1 γh U(gk) 2)

γh 2(1 γh) (1

4(U(gk) U(g )) + µ

2 log g 1 gk 2 + ξk+1 2) (31)

h 2(1 γh) ( 3γ

4L h 1 γh) U(gk) 2

Cauchy-Schwarz inequality gives

LHB k+1 1 1 γh (U(gk) U(g )) + γ2

2(1 γh)2 log g 1 gk 2 + 3

4 ξk+1 2 (32)

Comparing the coefficients in Eq. (31) and (32) gives us the desired result.

Lemma 32 (Monotonicity of the Lyapunov function for Heavy-Ball scheme). Assume the conditions in Lemma 31 is satisfied. By choosing γ = 2 µ and step size h = µ

4L , we have b HB = µ 16L and

LHB k+1 c HBLHB k (33)

by defining c HB = (1 + b HB) 1.

Proof of Lemma 32. By our choice of h and γ makes 3γ

4L h 1 γh 0, and the convergence rate by Lemma 31.

Proof of Thm. 13. This is a direct corollary of Lemma 32 and Cor. 12.

D More details about NAG-SC discretization

Lemma 33. Assume ad X is skew-adjoint for any X g. Suppose there is a geodesically convex set S G satisfying:

U is geodesically µ-strongly convex on a convex set S G.

g is the minima of U on S.

gk S for all t 0.

log g 1 g and its differential is well-defined for all g S.

maxg S d(g ,g) a

A for some a < 2π.

Setting h = min{ 1

2L, 1 2p(a)} and γ = 2 µ, we have

LNAG-SC k+1 LNAG-SC k b NAG-SCLNAG-SC k+1

for the contraction rate b NAG-SC given by

b NAG-SC = 1

2L , 1 2p(a)} (34)

Proof of Lemma 33. For (gk,ξk) following the NAG-SC scheme Eq. (13), we define the shorthand notation LNAG-SC k = LNAG-SC(gk,ξk), which gives

LNAG-SC k = 1 1 γh (U(gk 1) U(g ))+1

4 ξk + γ 1 γh log g 1 gk + h U(gk 1)

4(1 γh) U(gk 1) 2

Evaluate of the basic terms of LNAG-SC k+1 LNAG-SC k first:

The first term: By the property of convex L-smooth functions in Eq. (21),

1 1 γh (U(gk) U(gk 1))

h 1 γh Tgk Lgk 1 U(gk),ξk I1

2L 1 1 γh Tgk Lgk 1 U(gk) Tgk 1Lgk 1 1 U(gk 1) 2 I2

The second term: NAG-SC scheme in Eq. 13 gives the following:

1 4 ( ξk+1 2 ξk 2)

2 ξk+1 ξk,ξk+1 1

4 ξk+1 ξk 2

= γh 2(1 γh) ξk+1 2 h

2 Tgk Lgk 1 U(gk) Tgk 1Lgk 1 1 U(gk 1),ξk+1

h 2(1 γh) Tgk 1Lgk 1 1 U(gk 1),ξk+1 1

4 ξk+1 ξk 2

= γh 2(1 γh) ξk+1 2 II1

2 Tgk Lgk 1 U(gk) Tgk 1Lgk 1 1 U(gk 1),ξk II2

2 Tgk Lgk 1 U(gk) Tgk 1Lgk 1 1 U(gk 1) 2 II3

2 Tgk Lgk 1 U(gk) Tgk 1Lgk 1 1 U(gk 1),Tgk Lgk 1 U(gk) II4

h 2(1 γh) ξk+1,Tgk Lgk 1 U(gk) II5

4 ξk+1 ξk 2 II6

The third term: We consider the parametric curve on G connecting gk and gk+1 defined by gt = gk exp(tξk+1),t [0,h].

1 4 ξk+1 + γ 1 γh log g 1 gk+2 + h Tgk Lgk 1 U(gk)

4 ξk + γ 1 γh log g 1 gk + h Tgk 1Lgk 1 1 U(gk 1)

4 ξk+1 + γ 1 γh log g 1 gk+2 + h Tgk Lgk 1 U(gk)

4 ξk+1 + γ 1 γh log g 1 gk + h Tgk Lgk 1 U(gk)

4 ξk+1 + γ 1 γh log g 1 gk + h Tgk Lgk 1 U(gk)

4 ξk + γ 1 γh log g 1 gk + h Tgk 1Lgk 1 1 U(gk 1)

0 ξk+1 + γ 1 γh log g 1 gt + h Tgk Lgk 1 U(gk), γ 1 γh dξk+1 log g 1 gt dt

4 ξk+1 + ξk + 2γ 1 γh log g 1 gk + h Tgk Lgk 1 U(gk) + h Tgk 1Lgk 1 1 U(gk 1),ξk+1 ξk + h Tgk Lgk 1 U(gk) h Tgk 1Lg

0 ξk+1 + γ 1 γh log g 1 gt + h Tgk Lgk 1 U(gk), γ 1 γh dξk+1 log g 1 gt dt

+ 1 4(1 γh)2 (2 γh)ξk+1 + 2γ log g 1 gk + (3 2γh)h Tgk Lgk 1 U(gk), γhξk+1 h Tgk Lgk 1 U(gk)

First, we estimate the term h 0 ξk+1 + γ 1 γh log g 1 gt + h Tgk Lgk 1 U(gk), γ 1 γh dξk+1 log g 1 gt dt using the property of dlog in Cor. 16, 17 and 18.

γ 1 γh log g 1 gt + ξk+1 + h Tgk Lgk 1 U(gk),dξk+1 log g 1 gt

= γ 1 γh log g 1 gt + ξk+1,dξk+1 log g 1 gt

+ h Tgk Lgk 1 U(gk),ξk+1 + h Tgk Lgk 1 U(gk),dξk+1 log g 1 gt ξk+1

γ 1 γh log g 1 gt,ξk+1 + ξk+1 2 + h Tgk Lgk 1 U(gk),ξk+1 + p(a)h U(gk) ξk+1 (36)

Taking integral gives

0 γ 1 γh log g 1 gt + ξk+1 + h Tgk Lgk 1 U(gk),dξk+1 log g 1 gt dt

0 γ 1 γh log g 1 gt,ξk+1 + ξk+1 2 + h Tgk Lgk 1 U(gk),ξk+1 + p(a)h U(gk) ξk+1 dt

0 log g 1 gt,ξk+1 dt + h ξk+1 2 + h2 Tgk Lgk 1 U(gk),ξk+1 + h2p(a) U(gk) ξk+1

We calculate the integral

0 log g 1 gt,ξk+1 dt

0 [ log g 1 gk,ξk+1 +

0 dξk+1 log g 1 gτ,ξk+1 dτ]dt

h log g 1 gk,ξk+1 +

0 ξk+1 2 dτ dt

= h log g 1 gk,ξk+1 + h2

which gives us

0 ξk+1 + γ 1 γh log g 1 gt + h Tgk Lgk 1 U(gk),dξk+1 log g 1 gt dt

γh 1 γh log g 1 gk,ξk+1 + h(2 γh)

2(1 γh) ξk+1 2

+ h2 Tgk Lgk 1 U(gk),ξk+1 + h2p(a) U(gk) ξk+1

The other term

(2 γh)ξk+1 + 2γ log g 1 gk + (3 2γh)h Tgk Lgk 1 U(gk), γhξk+1 h Tgk Lgk 1 U(gk)

= γh(2 γh) ξk+1 2 2γ2h log g 1 gk,ξk+1 γh2(3 2γh) Tgk Lgk 1 U(gk),ξk+1

h(2 γh) Tgk Lgk 1 U(gk),ξk+1 2γh log g 1 gk,Tgk Lgk 1 U(gk) h2(3 2γh) Tgk Lgk 1 U(gk) 2

= γh(2 γh) ξk+1 2 2γ2h log g 1 gk,ξk+1 2h(1 + γh γ2h2) Tgk Lgk 1 U(gk),ξk+1

2γh log g 1 gk,Tgk Lgk 1 U(gk) h2(3 2γh) U(gk) 2

Summing them up, we have

ξk+1 + γ 1 γh log g 1 gk+1 + h Tgk Lgk 1 U(gk)

2 ξk + γ 1 γh log g 1 gk + h Tgk 1Lgk 1 1 U(gk 1)

2γ2h (1 γh)2 log g 1 gk,ξk+1 + γh(2 γh)

(1 γh)2 ξk+1 2

1 γh Tgk Lgk 1 U(gk),ξk+1 + 2γh2

1 γhp(a) U(gk) ξk+1

(1 γh)2 ξk+1 2 2γ2h (1 γh)2 log g 1 gk,ξk+1 2h(1 + γh γ2h2)

(1 γh)2 Tgk Lgk 1 U(gk),ξk+1

2γh (1 γh)2 log g 1 gk,Tgk Lgk 1 U(gk) h2(3 2γh)

(1 γh)2 Tgk Lgk 1 U(gk) 2

1 γhp(a) U(gk) ξk+1 2h (1 γh)2 Tgk Lgk 1 U(gk),ξk+1

2γh (1 γh)2 log g 1 gk,Tgk Lgk 1 U(gk) h2(3 2γh)

(1 γh)2 U(gk) 2

Eventually, the third term becomes

1 4 ξk+1 + γ 1 γh log g 1 gk+1 + h Tgk Lgk 1 U(gk)

4 ξk + γ 1 γh log g 1 gk + h Tgk 1Lgk 1 1 U(gk 1)

= hγ 2(1 γh)2 Tgk Lgk 1 U(gk),log g 1 gk III1

h 2(1 γh)2 Tgk Lgk 1 U(gk),ξk+1 III2

2(1 γh) Tgk Lgk 1 U(gk) 2 III3

4(1 γh)2 Tgk Lgk 1 U(gk) 2 III4

2(1 γh)p(a) U(gk) ξk+1 extra term from curvature

We sum the all the terms up to get

LNAG-SC k+1 LNAG-SC k

γh 2(1 γh) ( 1 1 γh Tgk Lgk 1 U(gk),log g 1 gk + ξk+1 2) II1 + III1

h 2(1 γh) Tgk Lgk 1 U(gk), 1 1 γhξk+1 ξk + h Tgk Lgk 1 U(gk) 1 2I1 + III2 + III3

2 Tgk Lgk 1 U(gk) Tgk 1Lgk 1 1 U(gk 1),ξk II2

2 Tgk Lgk 1 U(gk) Tgk 1Lgk 1 1 U(gk 1),Tgk Lgk 1 U(gk) II4

4 ( 2h 1 γh ξk+1 ξk,Tgk Lgk 1 U(gk) + ξk+1 ξk 2 + h2

(1 γh)2 U(gk) 2)

1 2I1 + II5 + II6 + III4

L 1 1 γh h2(1 γh)) Tgk Lgk 1L U(gk) Tgk 1Lgk 1 1 U(gk 1) 2 I2 + II3

4(1 γh) ( U(gk) 2 U(gk 1) 2) Additional term

2(1 γh)p(a) U(gk) ξk+1 extra term from curvature

1 2I1 + III2 + III3

2(1 γh) Tgk Lgk 1 U(gk) Tgk 1Lgk 1 1 U(gk 1),Tgk Lgk 1 U(gk) IV1

2(1 γh)2 Tgk Lgk 1 U(gk) 2 IV2

LNAG-SC k+1 LNAG-SC k

γh 2(1 γh) ( 1 1 γh ( Tgk Lgk 1 U(gk),log g 1 gk h2 U(gk) 2) + ξk+1 2)

II1 + III1 + IV2

2 Tgk Lgk 1 U(gk) Tgk 1Lgk 1 1 U(gk 1),log g 1 k 1gk II2

2 2 γh 1 γh Tgk Lgk 1 U(gk) Tgk 1Lgk 1 1 U(gk 1),Tgk 1Lgk 1 1 U(gk 1) II4 + IV1

L 1 1 γh h2(1 γh)) Tgk Lgk 1 U(gk) Tgk 1Lgk 1 1 U(gk 1) 2 I2 + II3

4 2 γh 1 γh ( U(gk) 2 U(gk 1) 2) Additional term

2(1 γh)p(a) U(gk) ξk+1 extra term from curvature

II4 + IV1 + Additional term = h2

4 2 γh 1 γh Tgk Lgk 1 U(gk) Tgk 1Lgk 1 1 U(gk 1) 2

(II4 + IV1 + Additional term) + (I2 + II3)

2 (1 γh + 1 1 γh 1 1 γh 1 Lh2 ) Tgk Lgk 1 U(gk) Tgk 1Lgk 1 1 U(gk 1) 2

LNAG-SC k+1 LNAG-SC k

γh 2(1 γh) ( 1 1 γh ( Tgk Lgk 1 U(gk),log g 1 gk h2 U(gk) 2) + ξk+1 )

2 Tgk Lgk 1 U(gk) Tgk 1Lgk 1 1 U(gk 1),log g 1 k 1gk

2 (1 γh + 1 1 γh 1 1 γh 1 Lh2 ) Tgk Lgk 1 U(gk) Tgk 1Lgk 1 1 U(gk 1) 2

2(1 γh)p(a) Tgk Lgk 1 U(gk) ξk+1 extra term from curvature

By the property of L-smoothness in Eq. (22), we have

Tgk Lgk 1 U(gk) Tgk 1Lgk 1 1 U(gk 1),log g 1 k 1gk 1

L Tgk Lgk 1 U(gk) Tgk 1Lgk 1 1 U(gk 1) 2

LNAG-SC k+1 LNAG-SC k γh 2(1 γh) ( 1 1 γh ( Tgk Lgk 1 U(gk),log g 1 gk h2 U(gk) 2) + ξk+1 2)

2 (1 γh + 1 1 γh)( 1

L h2) Tgk Lgk 1 U(gk) Tgk 1Lgk 1 1 U(gk 1) 2

2(1 γh)p(a) U(gk) ξk+1

2L, together with Lemma 33 and Cauchy-Schwarz inequality, we have

LNAG-SC k+1 LNAG-SC k

γh 2(1 γh) ( 1 1 γh ( Tgk Lgk 1 U(gk),log g 1 gk h2 U(gk) 2) + ξk+1 2)

2(1 γh)p(a) U(gk) ξk+1

= γh 2(1 γh) ( 1 1 γh ( Tgk Lgk 1 U(gk),log g 1 gk h2 U(gk) 2) + ξk+1 2)

4(1 γh)p(a)(λ U(gk) 2 + λ 1 ξk+1 2)

where λ > 0 is a parameter to be chosen later. Using property of L-smoothness in Eq. (21), (22) and property of µ-strong convexity in Eq. (24), we have for any λ1 0 and λ2 1 satisfying λ1 + λ2 1,

γh (LNAG-SC k+1 LNAG-SC k )

Tgk Lgk 1 U(gk),log g 1 gk

+ h2 U(gk) 2 (1 γh) ξk+1 2 + h(1 γh)λp(a)

2 U(gk) 2 + h(1 γh)λ 1p(a)

(2 λ1 λ2)(U(gk) U(g )) λ1

2L U(gk) 2 µλ2

+ h2 U(gk) 2 (1 γh) ξk+1 2 + h(1 γh)λp(a)

2 U(gk) 2 + h(1 γh)λ 1p(a)

(2 λ1 λ2)(U(gk) U(g )) + (h(1 γh)λp(a)

2L) U(gk) 2 µλ2

(1 γh)(1 hλ 1p(a)

2 ) ξk+1 2 (37)

Now we try to upper bound LNAG-SC k . Using Cauchy-Schwarz inequality,

ξk + 2γ 1 γh log g 1 gk + h Tgk 1Lgk 1 1 U(gk 1)

3[ ξk 2 + ( 2γ 1 γh)

2 d2(g ,gk) + h2 U(gk 1) 2]

3[ ξk 2 + ( 2γ 1 γh)

2 (d(g ,gk) + h ξk )2 + h2 U(gk 1) 2]

3[(1 + 2( 2γh

2 ) ξk 2 + 2( 2γ 1 γh)

2 d2(g ,gk) + h2 U(gk 1) 2]

As a result,

1 1 γh (U(gk 1) U(g )) + (1

(1 γh)2 ) ξk 2

(1 γh)2 d2(g ,gk) + (3

4(1 γh))h2 U(gk 1) 2

= 1 1 γh (U(gk 1) U(g )) + 1 2γh + 7γ2h2

(1 γh)2 ξk 2 + 6γ2

(1 γh)2 d2(g ,gk) + 1 2γh

4(1 γh)h2 U(gk 1) 2

( 1 1 γh + 1 2γh

2L)(U(gk 1) U(g )) + 1 2γh + 7γ2h2

4(1 γh)2 ξk 2 + 3γ2

2(1 γh)2 d2(g ,gk)

5 2γh 4(1 γh) (U(gk 1) U(g )) + 1 2γh + 7γ2h2

(1 γh)2 ξk 2 + 3γ2

2(1 γh)2 d2(g ,gk)

which is the same as

2(1 γh)LNAG-SC k

2 (U(gk 1) U(g )) + 2(1 2γh + 7γ2h2)

1 γh ξk 2 + 3γ2

1 γhd2(g ,gk) (38)

h(1 γh)λp(a)

As a result, by comparing the parameters in Eq. (38) and (37), we have the contraction rate is

b NAG-SC γh 1 γh min{2(2 λ1 λ2)

5 2γh ,(1 hλ 1p(a)

2 ) (1 γh)2

2(1 2γh + 7γ2h2), µλ2(1 γh)

= γhmin{ 2(2 λ1 λ2) (5 2γh)(1 γh),(1 hλ 1p(a)

2 ) 1 γh 2(1 2γh + 7γ2h2), µλ2

Now we try to give a lower bound for the contraction rate c upon a set of parameters (λ,λ1,λ2). By assuming γh 1

b NAG-SC γhmin{2(2 λ1 λ2), 1

2 (1 hλ 1p(a)

λ1 = 2L(hλp(a)

to make Eq. (39) satisfied. By assuming h 1 2L, we choose

λ2 = 1 hλp(a)

to make 2(2 λ1 λ2) µλ2

6γ2 . Then we have

b NAG-SC γhmin{1

2 (1 hλ 1p(a)

2 ), µ 6(γ2 + µ) (1 hλp(a))}

2 (1 hλ 1p(a)

30 (1 hλp(a))}

30 (1 hp(a))}

by choosing γ = 2 µ, same as continuous case and simply choose λ = 1.

b NAG-SC 2 µmin{1

4h(2 hp(a)), 1

30h(1 hp(a))}

Setting h = min{ 1

2L, 1 2p(a)}, we have

b NAG-SC 2 µ h

2L , 1 2p(a)}

which gives us the desired result.

Corollary 34. If the iteration for NAG-SC is initialized by g0 S0, with (g0,ξ0) satisfying

ENAG-SC(g0,ξ0) (1 γh)u for some u, with the u sub-level set of U satisfying Eq. (15). Then we have gk S0 for any k for the NAG-SC scheme Eq. (13) when h

Proof of Cor. 34. First we prove LNAG-SC 1/4 ξk 2. By L-smoothness and geodesic convexity, Lemma 23 gives

U(g) U(g ) 1

L U(g) 2 g S

As a result, when h

1 2L, we have 1 L(1 γh) h2(2 γh)

2(1 γh) , leading to LNAG-SC(g,ξ) 1

Now we are ready to prove this corollary by induction. Suppose we have gk 1 S0. By the monotonicity of the Lyapunov function, LNAG-SC(gk,ξk) LNAG-SC(g0,ξ0). As a result,

4LNAG-SC k 2

LNAG-SC 0 Since we have gk = gk 1 exp(hξk), we have d(gk,gk 1) h ξk . Together with the condition that U(gk 1) (1 γh)LNAG-SC k (1 γh)u, we have gk S0. Mathematical induction gives the desired result.

Proof of Thm. 14. This is the direct corollary from Lemma 33 and Cor. 34 when defining c NAG-SC = (1 + b NAG-SC) 1, with b NAG-SC is given in Eq. (34).

Remark 35 (Why Lie NAG-SC losses convergence rate comparing to the Euclidean case). In order to utilize the extra term h(Tgk Lgk 1 U(gk) Tgk 1Lgk 1 1 U(gk 1)) in Eq. (13), Lyapunov function Eq. (14) has the term ξk + γ 1 γh log g 1 gk+1 + h U(gk), and the term log g 1 gk+1 log g 1 gk,Tgk Lgk 1 U(gk) needs to be quantified, consequently. However, due to the non-linearity of the Lie group, we have to make the assumption that gk+1 and gk are close to g , leading to the space is nearly linear , so that we can bound the error from the non-linearity of the space using Cor. 18. Please see the proof of Lemma 33, Eq. (36) for more details.

However, such loss of convergence rate due to curved spaces is also observed in some of the best results so far [33, 1].

For heavy-ball scheme, the design of the Lyapunov function only has the term γ 1 γh log g 1 gk + ξk, and we can use the properties Eq. (16) and (17) to quantify LHB k+1 LHB k .

D.1 Modified energy for NAG-SC

Remark 36 (Why we cannot have an modified energy for NAG-SC). An modified energy for Lie NAG-SC is provided in Eq. (41), whose monotonicity is shown in Thm. 37. The modified energy is global defined and required only L-smoothness. However, the failure for this energy function is because its monotonicity requires step size O ( µ

L ), which is smaller than the step size O ( 1

L) that provides acceleration.

Comparing with the Heavy-Ball scheme, the larger step size and the acceleration of NAG-SC come from extra term (1 γh)h(Tgk Lgk 1 U(gk) Tgk 1Lgk 1 1 U(gk 1)), which is closely related to co-coercivity in Lemma 23. However, co-coercivity requires (local) geodesic convexity and Lsmoothness at the same time, which is not available when we are considering the convergence globally.

The update for ξ in NAG-SC Eq. (13) can be also written as 1 1 γh (ξk + h Tgk Lgk 1 U(gk)) = (ξk 1 + h Tgk 1Lgk 1 1 U(gk 1)) h Tgk Lgk 1 U(gk) (40)

This inspires us to define the following modified energy:

ENAG-SC(g,ξ) = U(g) + (1 γh)2

2(1 + γh γ2h2) ξ + h Tg Lg 1 U(g) 2 (41)

Theorem 37 (Monotonely decreasing of modified energy of NAG-SC). Assume the potential U is globally L-smooth. When h min{ 1

ENAG-SC(gk,ξk) ENAG-SC(gk 1,ξk 1) 0

where the modified energy ENAG-SC is defined in Eq. (41).

Proof of Thm. 37. Given L-smoothness of U, we have

ENAG-SC(gk,ξk) ENAG-SC(gk 1,ξk 1)

= U(gk) U(gk 1) + (1 γh)2

2(1 + γh γ2h2) ( ξk + h Tgk 1Lgk 1 1 U(gk 1) 2 ξk 1 + h Tgk 2Lgk 2 1 U(gk 2) 2)

= U(gk) U(gk 1) + (1 γh)2

2(1 + γh γ2h2) ξk + h Tgk 1Lgk 1 1 U(gk 1) 2

2(1 + γh γ2h2) 1 1 γhξk + h(2 γh)

1 γh Tgk 1Lgk 1 1 U(gk 1))

= U(gk) U(gk 1) + (1 γh)2

2(1 + γh γ2h2) ξk + h Tgk 1Lgk 1 1 U(gk 1) 2

1 2(1 + γh γ2h2) ξk + h(2 γh)Tgk 1Lgk 1 1 U(gk 1)) 2

= U(gk) U(gk 1) γh(2 γh) 2(1 + γh γ2h2) ξk 2 h ξk,Tgk 1Lgk 1 1 U(gk 1)

h2(3 2γh) 2(1 + γh γ2h2) U(gk 1) 2

By L-smoothness, U(gk) U(gk 1) h ξk,Tgk 1Lgk 1 1 U(gk 1) Lh2

ENAG-SC(gk,ξk) ENAG-SC(gk 1,ξk 1)

2 ξk 2 γh(2 γh) 2(1 + γh γ2h2) ξk 2 h2(3 2γh) 2(1 + γh γ2h2) U(gk 1) 2

Consequently, a sufficient condition for ENAG-SC(gk,ξk) ENAG-SC(gk 1,ξk 1) 0 can be given by

2 γh(2 γh) 2(1 + γh γ2h2)

By assuming γh 1, a sufficient condition for this can be given by h γ 2L, i.e., when

Neur IPS Paper Checklist

Question: Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope?

Answer: [Yes]

Justification: The abstract and introduction stated the contributions, assumptions and limitations.

Guidelines:

The answer NA means that the abstract and introduction do not include the claims made in the paper. The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations. A No or NA answer to this question will not be perceived well by the reviewers. The claims made should match theoretical and experimental results, and reflect how much the results can be expected to generalize to other settings. It is fine to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper.

2. Limitations

Question: Does the paper discuss the limitations of the work performed by the authors?

Answer: [Yes]

Justification: The section Related work contains our limitations, comparing with existing results.

Guidelines:

The answer NA means that the paper has no limitation while the answer No means that the paper has limitations, but those are not discussed in the paper. The authors are encouraged to create a separate "Limitations" section in their paper. The paper should point out any strong assumptions and how robust the results are to violations of these assumptions (e.g., independence assumptions, noiseless settings, model well-specification, asymptotic approximations only holding locally). The authors should reflect on how these assumptions might be violated in practice and what the implications would be. The authors should reflect on the scope of the claims made, e.g., if the approach was only tested on a few datasets or with a few runs. In general, empirical results often depend on implicit assumptions, which should be articulated. The authors should reflect on the factors that influence the performance of the approach. For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting. Or a speech-to-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon. The authors should discuss the computational efficiency of the proposed algorithms and how they scale with dataset size. If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness. While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that aren t acknowledged in the paper. The authors should use their best judgment and recognize that individual actions in favor of transparency play an important role in developing norms that preserve the integrity of the community. Reviewers will be specifically instructed to not penalize honesty concerning limitations.

3. Theory Assumptions and Proofs

Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof?

Answer: [Yes] Justification: We make assumtption of compactness (Assumption 2), L-smoothness (Def. 4) and local geodesic-strong convexity (Def. 5) Guidelines:

The answer NA means that the paper does not include theoretical results. All the theorems, formulas, and proofs in the paper should be numbered and crossreferenced. All assumptions should be clearly stated or referenced in the statement of any theorems. The proofs can either appear in the main paper or the supplemental material, but if they appear in the supplemental material, the authors are encouraged to provide a short proof sketch to provide intuition. Inversely, any informal proof provided in the core of the paper should be complemented by formal proofs provided in appendix or supplemental material. Theorems and Lemmas that the proof relies upon should be properly referenced. 4. Experimental Result Reproducibility

Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)? Answer: [Yes] Justification: Details for numerical experiment are included in Sec. 6. Code is provided in supplementary materials. Guidelines:

The answer NA means that the paper does not include experiments. If the paper includes experiments, a No answer to this question will not be perceived well by the reviewers: Making the paper reproducible is important, regardless of whether the code and data are provided or not. If the contribution is a dataset and/or model, the authors should describe the steps taken to make their results reproducible or verifiable. Depending on the contribution, reproducibility can be accomplished in various ways. For example, if the contribution is a novel architecture, describing the architecture fully might suffice, or if the contribution is a specific model and empirical evaluation, it may be necessary to either make it possible for others to replicate the model with the same dataset, or provide access to the model. In general. releasing code and data is often one good way to accomplish this, but reproducibility can also be provided via detailed instructions for how to replicate the results, access to a hosted model (e.g., in the case of a large language model), releasing of a model checkpoint, or other means that are appropriate to the research performed. While Neur IPS does not require releasing code, the conference does require all submissions to provide some reasonable avenue for reproducibility, which may depend on the nature of the contribution. For example (a) If the contribution is primarily a new algorithm, the paper should make it clear how to reproduce that algorithm. (b) If the contribution is primarily a new model architecture, the paper should describe the architecture clearly and fully. (c) If the contribution is a new model (e.g., a large language model), then there should either be a way to access this model for reproducing the results or a way to reproduce the model (e.g., with an open-source dataset or instructions for how to construct the dataset). (d) We recognize that reproducibility may be tricky in some cases, in which case authors are welcome to describe the particular way they provide for reproducibility. In the case of closed-source models, it may be that access to the model is limited in some way (e.g., to registered users), but it should be possible for other researchers to have some path to reproducing or verifying the results. 5. Open access to data and code

Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: Code will be released. Guidelines:

The answer NA means that paper does not include experiments requiring code. Please see the Neur IPS code and data submission guidelines (https://nips.cc/ public/guides/Code Submission Policy) for more details. While we encourage the release of code and data, we understand that this might not be possible, so No is an acceptable answer. Papers cannot be rejected simply for not including code, unless this is central to the contribution (e.g., for a new open-source benchmark). The instructions should contain the exact command and environment needed to run to reproduce the results. See the Neur IPS code and data submission guidelines (https: //nips.cc/public/guides/Code Submission Policy) for more details. The authors should provide instructions on data access and preparation, including how to access the raw data, preprocessed data, intermediate data, and generated data, etc. The authors should provide scripts to reproduce all experimental results for the new proposed method and baselines. If only a subset of experiments are reproducible, they should state which ones are omitted from the script and why. At submission time, to preserve anonymity, the authors should release anonymized versions (if applicable). Providing as much information as possible in supplemental material (appended to the paper) is recommended, but including URLs to data and code is permitted. 6. Experimental Setting/Details

Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results? Answer: [Yes] Justification: Details for numerical experiment are included in Sec. 6. Guidelines:

The answer NA means that the paper does not include experiments. The experimental setting should be presented in the core of the paper to a level of detail that is necessary to appreciate the results and make sense of them. The full details can be provided either with the code, in appendix, or as supplemental material. 7. Experiment Statistical Significance

Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments? Answer: [NA] Justification: No statistical error bar is included. Guidelines:

The answer NA means that the paper does not include experiments. The authors should answer "Yes" if the results are accompanied by error bars, confidence intervals, or statistical significance tests, at least for the experiments that support the main claims of the paper. The factors of variability that the error bars are capturing should be clearly stated (for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions). The method for calculating the error bars should be explained (closed form formula, call to a library function, bootstrap, etc.)

The assumptions made should be given (e.g., Normally distributed errors). It should be clear whether the error bar is the standard deviation or the standard error of the mean. It is OK to report 1-sigma error bars, but one should state it. The authors should preferably report a 2-sigma error bar than state that they have a 96% CI, if the hypothesis of Normality of errors is not verified. For asymmetric distributions, the authors should be careful not to show in tables or figures symmetric error bars that would yield results that are out of range (e.g. negative error rates). If error bars are reported in tables or plots, The authors should explain in the text how they were calculated and reference the corresponding figures or tables in the text.

8. Experiments Compute Resources

Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments?

Answer: [Yes]

Justification: Please see Sec. 6.

Guidelines:

The answer NA means that the paper does not include experiments. The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage. The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute. The paper should disclose whether the full research project required more compute than the experiments reported in the paper (e.g., preliminary or failed experiments that didn t make it into the paper).

9. Code Of Ethics

Question: Does the research conducted in the paper conform, in every respect, with the Neur IPS Code of Ethics https://neurips.cc/public/Ethics Guidelines?

Answer: [Yes]

Justification: We confirm we follow the Neur IPS Code of Ethics.

Guidelines:

The answer NA means that the authors have not reviewed the Neur IPS Code of Ethics. If the authors answer No, they should explain the special circumstances that require a deviation from the Code of Ethics. The authors should make sure to preserve anonymity (e.g., if there is a special consideration due to laws or regulations in their jurisdiction).

10. Broader Impacts

Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed?

Answer: [NA]

Justification: This paper focuses on theoretical aspect.

Guidelines:

The answer NA means that there is no societal impact of the work performed. If the authors answer NA or No, they should explain why their work has no societal impact or why the paper does not address societal impact. Examples of negative societal impacts include potential malicious or unintended uses (e.g., disinformation, generating fake profiles, surveillance), fairness considerations (e.g., deployment of technologies that could make decisions that unfairly impact specific groups), privacy considerations, and security considerations.

The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments. However, if there is a direct path to any negative applications, the authors should point it out. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to generate deepfakes for disinformation. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster. The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from (intentional or unintentional) misuse of the technology. If there are negative societal impacts, the authors could also discuss possible mitigation strategies (e.g., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efficiency and accessibility of ML).

11. Safeguards

Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)?

Answer: [NA]

Justification: This paper focuses on theoretical aspect.

Guidelines:

The answer NA means that the paper poses no such risks. Released models that have a high risk for misuse or dual-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety filters. Datasets that have been scraped from the Internet could pose safety risks. The authors should describe how they avoided releasing unsafe images. We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort.

12. Licenses for existing assets

Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected?

Answer: [NA]

Justification: No existing assets are used.

Guidelines:

The answer NA means that the paper does not use existing assets. The authors should cite the original paper that produced the code package or dataset. The authors should state which version of the asset is used and, if possible, include a URL. The name of the license (e.g., CC-BY 4.0) should be included for each asset. For scraped data from a particular source (e.g., website), the copyright and terms of service of that source should be provided. If assets are released, the license, copyright information, and terms of use in the package should be provided. For popular datasets, paperswithcode.com/datasets has curated licenses for some datasets. Their licensing guide can help determine the license of a dataset. For existing datasets that are re-packaged, both the original license and the license of the derived asset (if it has changed) should be provided.

If this information is not available online, the authors are encouraged to reach out to the asset s creators. 13. New Assets

Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets? Answer: [NA] Justification: No new assets are introduced. Guidelines:

The answer NA means that the paper does not release new assets. Researchers should communicate the details of the dataset/code/model as part of their submissions via structured templates. This includes details about training, license, limitations, etc. The paper should discuss whether and how consent was obtained from people whose asset is used. At submission time, remember to anonymize your assets (if applicable). You can either create an anonymized URL or include an anonymized zip file. 14. Crowdsourcing and Research with Human Subjects

Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)? Answer: [NA] Justification: No human subject is involved. Guidelines:

The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Including this information in the supplemental material is fine, but if the main contribution of the paper involves human subjects, then as much detail as possible should be included in the main paper. According to the Neur IPS Code of Ethics, workers involved in data collection, curation, or other labor should be paid at least the minimum wage in the country of the data collector. 15. Institutional Review Board (IRB) Approvals or Equivalent for Research with Human Subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or institution) were obtained? Answer: [NA] Justification: No human subject is involved. Guidelines:

The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Depending on the country in which research is conducted, IRB approval (or equivalent) may be required for any human subjects research. If you obtained IRB approval, you should clearly state this in the paper. We recognize that the procedures for this may vary significantly between institutions and locations, and we expect authors to adhere to the Neur IPS Code of Ethics and the guidelines for their institution. For initial submissions, do not include any information that would break anonymity (if applicable), such as the institution conducting the review.