# learningratefree_stochastic_optimization_over_riemannian_manifolds__33c23de3.pdf

Learning-Rate-Free Stochastic Optimization over Riemannian Manifolds

Daniel Dodd 1 Louis Sharrock 1 Christopher Nemeth 1

In recent years, interest in gradient-based optimization over Riemannian manifolds has surged. However, a significant challenge lies in the reliance on hyperparameters, especially the learning rate, which requires meticulous tuning by practitioners to ensure convergence at a suitable rate. In this work, we introduce innovative learningrate-free algorithms for stochastic optimization over Riemannian manifolds, eliminating the need for hand-tuning and providing a more robust and user-friendly approach. We establish high probability convergence guarantees that are optimal, up to logarithmic factors, compared to the bestknown optimally tuned rate in the deterministic setting. Our approach is validated through numerical experiments, demonstrating competitive performance against learning-rate-dependent algorithms.

1. Introduction

We study Riemannian optimization problems of the form

min x M f(x), (1)

where f is a geodesically convex function, and M is a Riemannian manifold. In recent years, there has been a growing interest within the machine learning community in addressing optimization challenges on such geometric spaces. These problems manifest in diverse applications, including principal component analysis (Edelman et al., 1998), dictionary learning (Sun et al., 2017), low-rank matrix completion (Boumal & Absil, 2011), tensor factorization (Ishteva et al., 2011), Gaussian mixture models (Hosseini & Sra, 2015) and metric learning (Zadeh et al., 2016).

One of the prominent hurdles in applying Riemannian gradient-based optimization is the requirement for careful

1Department of Mathematics and Statistics, Lancaster University, UK. Correspondence to: Daniel Dodd <d.dodd1@lancaster.ac.uk>.

Proceedings of the 41 st International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by the author(s).

(a) Learning rate too large.

RDo G RSGD Optima

(b) Learning rate too small.

Figure 1. Rayleigh quotient maximization on the unit sphere. Our algorithm, RDo G, converges without tuning, while RSGD shows sensitivity to the learning rate, leading to (a) overshooting or (b) slow convergence.

tuning of the learning rate or step size parameter. Selecting an appropriate learning rate is imperative to the algorithm s performance, impacting the convergence rate, final solution quality, and overall algorithm stability. To illustrate, Figure 1 showcases the impact of inadequate learning rates on the convergence rate of Riemannian stochastic gradient descent (RSGD, Bonnabel, 2013).

Recently, the expense and lack of robustness associated with learning rate tuning have spurred substantial research of learning-rate-free methods for Euclidean optimization. These aim to automate tuning by crafting algorithms that achieve near-optimal convergence rates with minimal knowledge of the function s properties and do not have any tuning parameters. Notable examples include online learning schemes like coin betting (Orabona & P al, 2016) and exponentiated gradients (Mc Mahan & Orabona, 2014) and bisection subroutines (Carmon & Hinder, 2022). Our paper addresses the absence of comparable tools for Riemannian optimization with the first comprehensive study of learningrate-free algorithms in this setting.

Contributions: Building upon the recently proposed Distance over Gradients (Do G, Ivgi et al., 2023) and Distance over Weighted Gradients (Do WG, Khaled et al., 2023) Euclidean optimization approaches, we introduce dynamic learning-rate-scheduler algorithms for stochastic Riemannian optimization. Our results establish high probability convergence guarantees, achieving optimal convergence rates with logarithmic factors in smooth and Lipschitz settings, rendering them a robust solution for geodesically convex stochastic optimization on Riemannian manifolds.

Learning-Rate-Free Stochastic Optimization over Riemannian Manifolds

2. Preliminaries

2.1. Riemannian Geometry

In this section, we recall some fundamental definitions from Riemannian geometry (e.g., Petersen, 2006; Lee, 2012; Boumal, 2023).

Riemannian manifold, tangent space, metric. A Riemannian manifold M is a smooth, locally Euclidean space. At each point x on M, there is a corresponding tangent space Tx M representing all possible tangential directions, endowed with a smoothly varying inner product , x : Tx Tx R termed the Riemannian metric, that induces a norm x = p

, x. The metric measures angles, curve lengths, surface areas, and volumes locally, with global quantities obtained by integrating these contributions.

Geodesics and distances. The length of a curve c : [0, 1] 7 c(t) M is L(c) = R 1 0 c (t) c(t)dt. Generalizing straight lines leads to geodesics, constant speed curves representing the shortest path between points x and y on the manifold: γ = arg minc L(c) with γ(0) = x, γ(1) = y, and γ (t) γ(t) = 1, establishing a metric space structure with geodesic distance d(x, y) = infc L(c).

Exponential maps. The concept of moving along a straight curve with constant velocity is given by the exponential map. As such, for any point x on M, and any tangent vector v Tx M, there is a unique unit speed geodesic γ satisfying γ(0) = x and γ (0) = v. The corresponding exponential map expx : Tx M M is defined as expx(v) = γ(1). When expx is well-defined on Tx M for all x M, the geodesic distance d(x, y) is given by exp 1 x (y) x.

Parallel transport. Parallel transport Γy x : Tx M Ty M provides a means to move tangent vectors from one tangent space to another while preserving their norm, and roughly speaking, direction, analogous to translation in Euclidean space.

Curvature. The curvature of a Riemannian manifold is determined by its metric at each point. The sectional curvature at a point x on the manifold is the Gauss curvature of a two-dimensional submanifold formed as the image of a two-dimensional subspace of the tangent space Tx M under the exponential map.

Trigonometric bound. The law of cosines in Euclidean space is fundamental for analyzing optimization algorithms,

a2 = b2 + c2 2bc cos(A),

where a, b, c are the sides of a Euclidean triangle with A the angle between sides b and c. Trigonometric geometry behaves differently in manifolds compared to Euclidean spaces. While the equality does not hold for nonlinear

spaces, a trigonometric distance bound can be established for manifolds with sectional curvature bounded below.

Lemma 2.1. (Zhang & Sra, 2016, Lemma 5) Suppose a, b, c are the side lengths of a geodesic triangle in a Riemannian manifold with sectional curvature lower bounded by κ > and A is the angle between sides b and c (defined through the inverse exponential map and inner product in tangent space). Then

a2 ζκ(c)b2 + c2 2bc cos(A),

where ζκ : R+ R is the geometric curvature function

|κ| d , if κ < 0,

Proof. Given by Lemma 3.12 of (Cordero-Erausquin et al., 2001) and by Lemma 5 of (Zhang & Sra, 2016).

2.2. Function Classes

Geodesic convexity. M is geodesically convex if every two points are connected by a geodesic. A function f : M R is geodesically convex if, for any geodesic γ M,

f(γ(t)) (1 t)f(γ(0)) + tf(γ(1)), t [0, 1].

Equivalently, M is geodesically convex if, for any x, y M, there exists a tangent vector f(x) Tx M such that

f(y) f(x) + f(x), exp 1 x (y) x,

where f(x) is a Riemannian subgradient of f at x. When f is differentiable, { f(x)} = grad f(x), the Riemannian gradient of f at x, defined as the tangent vector in Tx M satisfying grad f(x), v x = df(x)[v],

where df(x): Tx M R denotes the differential of f at x.

Geodesic Lipschitz. A function f : M R is said to be geodesically L-Lipschitz if, for all x, y M, there exists a constant L > 0 such that,

|f(y) f(x)| L exp 1 x (y) x.

When f is differentiable, the geodesically L-Lipschitzness is equivalent to grad f(x) x L for all x M.

Geodesic smoothness. A differentiable function f is geodesically S-smooth if its gradient is geodesically SLipschitz. That is, if for all x, y M,

grad f(x) Γx y grad f(y) x S exp 1 x (y) x,

where Γx y is the parallel transport from y to x. One can show in this case that

f(y) f(x) + grad f(x), exp 1 x (y) x + S

2 exp 1 x (y) 2 x.

Learning-Rate-Free Stochastic Optimization over Riemannian Manifolds

3. Algorithms and Theory

We are interested in solving optimization problems of the form (1). We proceed under the following standard regularity conditions (Zhang & Sra, 2016; Alimisis et al., 2020a).

Assumption 3.1 (Geodesic convexity). The geodesically convex function f : M R attains its minimum at x within its closed and geodesically convex domain M, which includes a well-defined exponential map.

Assumption 3.2 (Lower bounded sectional curvature). M exhibits sectional curvature bounded from below: κ > .

To minimize f, we will assume access to a stochastic gradient oracle G. When queried at x M, the oracle returns a stochastic (sub)gradient estimator G(x) which satisfies E [G(x)|x] f(x). In a slight abuse of notation, we will henceforth write grad f(x) := E [G(x)|x]. We consider the following additional assumptions.

Assumption 3.3 (Locally bounded stochastic gradients). There exists some continuous function ℓ: M R+ such that G(x) x ℓ(x) almost surely.

Assumption 3.4 (Locally smooth stochastic gradients). There exists some continuous function s: M R+ such that G(x) Γx y G(y) x s(x) exp 1 x (y) x, almost surely.

Assumption 3.3 corresponds to the Riemannian analog of Ivgi et al. (2023) s locally bounded gradient assumption, whereas Assumption 3.4 introduces a novel condition.

3.1. Riemannian Stochastic Gradient Descent

Our work centers on the Riemannian stochastic gradient descent algorithm (RSGD) introduced by Bonnabel (2013), which from an initial point x0 M iterates the following update rule:

xt+1 = expxt( ηtgt).

Here t 0 denotes the iteration index, gt := G(xt) represents the stochastic gradient oracle, and ηt > 0 is a userchosen learning rate or step size parameter.

Our analysis commences by characterizing the ideal step size in the deterministic gradient setting, an extension of Theorem 9 from Zhang & Sra (2016).

Theorem 3.5. Under noiseless conditions and Assumption 3.1, and 3.2, RSGD with a constant step size ηt = η > 0, for a geodesically L-Lipschitz function, satisfies

min t T [f(xt) f(x )]

" d2 T 2ηT + ηζκ( d T ) PT t=0 gt 2 xt 2T

where dt := maxs t ds, ds = d(xs, x ). Minimizing this bound with respect to η, gives a convergence rate of

with corresponding ideal step size

ζκ( d T ) PT t=0 gt 2xt

Proof. See Appendix B.1.

3.2. Riemannian Distance Over Gradients

In practice, determining the ideal step size η , even in hindsight, is challenging due to its dependence on the unknown maximum distance d T . In this section, we introduce an adaptive algorithm that estimates this whilst attaining the optimal convergence rate up to a logarithmic factor.

Learning-rate-free schedule for RSGD. Our key proposal, inspired by Ivgi et al. (2023), is to estimate d T via a proxy,

rt := max s t rs, rs := max(d(x0, xs), ϵ),

where ϵ > 0 is an initial estimate. Intuitively, the maximum deviation from the starting point should reflect the maximum deviation from the optimum, assuming the RSGD iterations converge to the optimum. Integrating this estimation into the ideal step size, we establish an adaptive sequence of step sizes

ζκ( rt) Pt s=0 gs 2xs

We term this step size schedule as Riemannian Distance over Gradients (RDo G, Algorithm 1). Observe that the initial step gives a step size of ϵ/ g0 x0, a normalized gradient step of size ϵ. We demonstrate that, provided ϵ is chosen sufficiently small, the specific value is insensitive.

Algorithm 1 RDo G

Input: initial point x0, initial estimate ϵ > 0, G 1 = 0. for t = 0 to T 1 do

gt = G(xt) rt = max (ϵ, maxs t d(xs, x0)) Gt = Gt 1 + ||gt||2 xt ηt = rt

ζκ( rt)Gt xt+1 = expxt ( ηtgt) end for

Optimality gap bounds assuming bounded iterates. We bound the error of the weighted average sequence

xt+1 = exp xt

ζκ( rt) Pt s=0 rs/ p

ζκ( rs) exp 1 xt (xt)

Learning-Rate-Free Stochastic Optimization over Riemannian Manifolds

To simplify our analysis, we write log+( ) := 1 + log( ) where the logarithm has a base of e, and introduce the following quantities

s=0 gs 2 xs, θt,δ := log 60 log(6t)

Our first result establishes a bound on the optimality gap under bounded iterates.

Theorem 3.6. Suppose that Assumption 3.1, 3.2, and 3.3 hold. Then, for all δ (0, 1) and L > 0, and for all t T, RDo G (Algorithm 1) satisfies the optimality gap f( xt) f(x ) of

(d0 + rt) p

ζκ(d0 + rt) q

Gt 1 + θt,δGt 1 + θ2 t,δL2

Pt 1 s=0 rs/

with probability at least 1 δ P( ℓT > L), where ℓT := maxs T ℓ(xs).

Proof. See Appendix D.2.

This theorem yields a corollary for bounded manifolds.

Corollary 3.7. Under Assumption 3.1, 3.2, and 3.3, for any D d0, let LD := maxx M:d(x,x0) D ℓ(x). Then, for

all δ (0, 1) and for τ arg maxt T Pt 1 s=0 rs/

RDo G (Algorithm 1) satisfies the optimality gap f( xτ) f(x ) of

Gτ 1θτ,δ + L2 Dθ2 τ,δ T log+

with probability at least 1 δ P( ℓT > L).

Proof. See Appendix D.2.

Unlike prior work on bounded domains (e.g., Zhang & Sra, 2016; Wang et al., 2021), our approach adapts without knowledge of the domain width to set the learning rate, achieving optimality up to a logarithmic factor. Remark 3.8. We enhance this result to a high probability convergence guarantee of O(1/T) under uniformly averaged iterates, following Assumption 3.4 in Appendix D.3. Remark 3.9. In Appendix D.4, we ensure bounded iterates with high probability by slightly reducing step sizes. Remark 3.10. Omitting the geometric curvature term ζκ( ) from RDo G s step sizes and weighted averaging results in an additional cost of O( p

ζκ(D)) in the optimality gap. Further details are available in Appendix D.5.

3.3. Normalized Riemannian Stochastic Gradient Descent

We consider extending standard Euclidean normalized gradient descent (Shor, 2012; Levy, 2016; Konnov, 2003; Hazan et al., 2015) to Riemannian manifolds, providing scale-free adaptability, with updates of the form

xt+1 = expxt

ηt grad f(xt) grad f(xt) xt

We term this algorithm Normalized Riemannian Stochastic Gradient Descent (NRSGD). In the deterministic Euclidean setting, normalized gradient descent automatically adjusts to the Lipschitz constant in non-smooth optimization (Nesterov, 2018, Theorem 3.2.2) and the smoothness constant(s) in smooth optimization (Grimmer, 2019, Corollary 2.2). This adaptability extends to NRSGD, as we will demonstrate.

Theorem 3.11. Under noiseless conditions and Assumption 3.1, and 3.2, NRSGD with a constant step size ηt = η > 0, for a geodesically L-Lipschitz function, satisfies

min t T [f(xt) f(x )] L d2 T 2ηT + η

2ζκ( d T ) .

While for a geodesically S-smooth function, we have

min t T [f(xt) f(x )] 2S d2 T 2ηT + η

2ζκ( d T ) 2 .

Minimizing these give respective convergence rates

and O 2S d2 T ζκ( d T )

T with correspond-

ing ideal step size

Tζκ( d T ) .

Proof. See Appendix C.1.

Learning-rate-free schedule for NRSGD. Normalization brings adaptivity to Lipschitz and smoothness settings, using a common universal ideal step size . However, like RSGD, this ideal step size relies on the intractable maximum distance quantity d T . Our solution is to substitute this with our proxy rt, resulting in our second algorithm, Normalized Riemannian Distance over Gradients (NRDo G) algorithm, summarized in Algorithm 4 in Appendix F.

3.4. Riemannian Distance Over Weighted Gradients

Weighted learning-rate-free schedule for RSGD. We introduce a third algorithm Riemannian Distance over Weighted Gradients (RDo WG, Algorithm 2) that extends the recently proposed Distance over Weighted Gradients

Learning-Rate-Free Stochastic Optimization over Riemannian Manifolds

(Do WG) (Khaled et al., 2023) to the Riemannian setting. Like RDo G and NRDo G, RDo WG estimates the intractable maximum distance quantity d T by utilizing the maximum distance deviation from the initial point, rt. However, in RDo WG, the normalization is based on the square root of the weighted gradient sum, vt = Pt s=0 r2 s||gs||2 xs, rather than simply the square root of the gradient sum Gt = Pt s=0 ||gs||2 xs.

The motivation for this normalization choice, as discussed in Khaled et al. (2023), lies in its improved adaptation to the problem geometry, especially in regions far from the initialization at x0. Specifically, as the distances { rt}t 0 monotonically increase, later gradients receive greater weights than earlier gradients. This choice aligns with the practice in previous Riemannian optimization schemes such as RADAM (Becigneul & Ganea, 2019), which also utilizes weighted gradient sums. However, unlike RADAM, where weights are determined by fixed user-selected hyperparameters, RDo WG adaptively estimates these weights.

Algorithm 2 RDo WG

Input: initial point x0, initial estimate ϵ > 0, v 1 = 0. for t = 0 to T 1 do

gt = G(xt) rt = max (ϵ, maxs t d(xs, x0)) vt = vt 1 + r2 t gt 2 xt ηt = rt

ζκ( rt)vt xt+1 = expxt ( ηtgt) end for

Optimality gap bounds assuming bounded iterates. We bound the error of the weighted average sequence

xt+1 = exp xt

r2 t /ζκ( rt) Pt s=0 r2s/ζκ( rs) exp 1 xt (xt)

We initiate our analysis in the non-smooth setting before transitioning to the smooth setting. Our initial result, assuming bounded iterates, provides the optimality gap achieved by RDo WG. Theorem 3.12. Suppose that Assumption 3.1, 3.2, and 3.3 hold. Then, for all δ (0, 1) and L > 0, and for all t T, RDo WG (Algorithm 2) satisfies the optimality gap f( xt) f(x ) of

(d0 + rt) p

ζκ(d0 + rt) q

Gt 1 + θt,δGt 1 + θ2 t,δL2

Pt 1 s=0 r2s/ζκ( rs) r2 t /ζκ( rt)

with probability at least 1 δ P( ℓT > L).

Proof. See Appendix E.3

We obtain a result on bounded domains which is optimal up to a logarithmic factor.

Corollary 3.13. Suppose that Assumption 3.1, 3.2, and 3.3 hold. In addition, for any D d0, let LD := maxx M:d(x,x0) D ℓ(x). Then, for all δ (0, 1) and for

τ arg maxt T Pt 1 s=0 r2 s/ζκ( rs) r2 t /ζκ( rt), RDo WG (Algorithm 2) satisfies the optimality gap f( xτ) f(x ) of

Gτ 1θτ,δ + L2 Dθ2 τ,δ T log+

with probability at least 1 δ P( ℓT > L).

Proof. See Appendix E.3

We proceed with analyzing the smooth setting. Our initial result yields an optimality gap for bounded iterates. It is worth noting that RDo G achieves similar results via uniform averaging, albeit with an additional cost (see Appendix D.3 for further details).

Theorem 3.14. Suppose that Assumption 3.1, 3.2, and 3.4 hold and write s T := maxt T s(xt). Then, for all δ (0, 1) and S > 0, and for all t T, RDo WG (Algorithm 2) satisfies the optimality gap f( xt) f(x ) of

(d0 + rt)2ζκ(d0 + rt)(Sθ2 t,δ) Pt 1 s=0 r2s/ζκ( rs) r2 t /ζκ( rt)

with probability at least 1 δ P( s T > S).

Proof. See Appendix E.4

This result achieves the optimal rate, aligning with the smooth analysis of Zhang & Sra (2016), with an additional logarithmic factor on bounded domains.

Corollary 3.15. Suppose that Assumption 3.1, 3.2, and 3.4 hold. In addition, for any D d0, let SD := maxx M:d(x,x0) D s(x). Then, for all δ (0, 1) and for

τ arg maxt T Pt 1 s=0 r2 s/ζκ( rs) r2 t /ζκ( rt), RDo WG (Algorithm 2) satisfies the optimality gap f( xτ) f(x ) of

D2ζκ(D)SDθ2 τ,δ T log+

with probability at least 1 δ P( s T > S).

Proof. See Appendix E.4

Stability analysis. While RDo WG is generally stable in practice, in theory, the algorithm trajectories can diverge. Drawing inspiration from Ivgi et al. (2023), we now introduce a variant of RDo WG that guarantees iterates remain bounded with high probability. The concept involves using step sizes that are smaller by a polylogarithmic factor. Following the taxonomy introduced in Ivgi et al. (2023), we

Learning-Rate-Free Stochastic Optimization over Riemannian Manifolds

10 6 10 3 100 103 106

Learning rate η

Geodesic distance to optima

RADAM RSGD NRDo G (Ours)

RDo G (Ours)

RDo WG (Ours)

10 8 10 6 10 4 10 2 Initial distance ϵ

(a) Initial distance sensitivity.

0 100 200 300 400 500 Iteration

Best RSGD RSGD

Learning rate η

(b) Regret trace plot RSGD.

0 100 200 300 400 500 Iteration

Best RSGD RDo G (Ours)

Initial distance ϵ

(c) Regret trace plot RDo G.

Figure 2. Results for Rayleigh quotient maximization on the sphere. (a) Geodesic distance between the final iterate and the numerical solution after T = 5000 iterations as a function of the learning rate for RADAM and RSGD and as a function of the initial distance estimate for RDo G, RDo WG, and NRDo G. (b) Shows the regret (the function value of each iterate minus the function value of the numerical solution) for RSGD for a selection of learning rates. (c) Shows the regret for RDo G for a selection of different initial distance estimates. Results are averaged over ten replications.

Algorithm 3 T-RDo WG

Input: initial point x0, initial estimate ϵ > 0, v 1 = 0. for t = 0 to T 1 do

gt = G(xt) rt = max (ϵ, maxs t d(xs, x0)) vt = vt 1 + r2 t gt 2 xt v t = 84θ2 T,δ log2 + (1+t) r2 t ℓ2 t /ζκ( rt) r2 0 ℓ2 0/ζκ( r0)

(vt 1 + 16 r2 t ζκ( rt) ℓ2 t)

ζκ( rt)v t xt+1 = expxt ( ηtgt) end for

refer to this scheme as Tamed Riemannian Distance over Weighted Gradients (T-RDo WG, Algorithm 3).

Our first result characterizes the key property of T-RDo WG: bounded iterates with high probability. Theorem 3.16. Suppose that Assumption 3.1, 3.2, and 3.3 hold, and ϵ 3d0. Then, for any δ (0, 1), and for any t N, the iterations of T-RDo WG (Algorithm 3) satisfy P( rt > 3d0) δ.

Proof. See Appendix E.5

Using this result, we can now obtain the convergence rate of T-RDo WG. Corollary 3.17. Suppose that Assumption 3.1, 3.2, and 3.3 hold, and ϵ 3d0. For any δ (0, 1/2), and for any t N, let τ arg maxt T Pt 1 s=0 r2 s/ζκ( rs) r2 t /ζκ( rt). Then T-RDo WG (Algorithm 3) satisfies the optimality gap f( xτ) f(x ) of

ζκ(d0)(Gτ + L2 )

with probability at least 1 2δ, where L := maxx M:d(x,x0) 3d(x ,x0) ℓ(x) and c =

log+(T d0L f(x0) f(x )) log+( d0

ϵ ) log( log+(T )

Proof. See Appendix E.5

Remark 3.18. We can also extend the analysis to obtain a similar optimality gap in the smooth setting. For brevity, we omit the details here.

4. Related Work

Riemannian optimization. Numerous authors have studied optimization on Riemannian manifolds. Earlier works on this topic established the asymptotic convergence of firstorder methods in both the deterministic (Udris te, 1994; Absil et al., 2008) and the stochastic (Liu et al., 2004; Bonnabel, 2013) settings. More recently, Zhang & Sra (2016) obtained the first non-asymptotic analysis for Riemannian stochastic gradient descent, assuming geodesic convexity. Subsequently, other authors have obtained iteration complexity results for Riemannian proximal-point methods (Bento et al., 2017), Frank-Wolfe schemes (Weber & Sra, 2021), variance reduced methods (Zhang et al., 2016; Kasai et al., 2017; Sato et al., 2019; Zhou et al., 2021), trust-region methods (Boumal et al., 2018; Agarwal et al., 2021), amongst others. In parallel, there has also been growing interest in obtaining Riemannian counterparts of accelerated (Liu et al., 2017; Alimisis et al., 2020b; Zhang & Sra, 2018; Ahn & Sra, 2020) and adaptive (Becigneul & Ganea, 2019; Kasai et al., 2019; Cho & Lee, 2017; Roy et al., 2018) methods used in Euclidean optimization. No existing works, however, consider learning-rate-free Riemannian optimization algorithms.

Learning-rate-free Euclidean optimization. On the other hand, learning-rate-free methods for (stochastic) optimization on Euclidean spaces are substantial; see, e.g., Orabona & Cutkosky (2020); Carmon & Hinder (2022) and references therein. Most relevant to our work, Carmon & Hinder (2022) recently introduced a learning-rate-free algorithm for stochastic convex optimization based on interval bisection. Building on this work, Ivgi et al. (2023), Defazio &

Learning-Rate-Free Stochastic Optimization over Riemannian Manifolds

Mishchenko (2023) and Khaled et al. (2023) have since obtained learning-rate-free (stochastic) convex optimization algorithms which, under varying assumptions, achieve the optimal convergence rate of (stochastic) gradient descent up to a logarithmic factor. Many other learning-rate-free optimization algorithms originate in the online learning literature. These include methods based on coin betting (Orabona & P al, 2016; Orabona & Tommasi, 2017), exponentiated gradients (Streeter & Mc Mahan, 2012; Orabona, 2013), amongst others (e.g., Mc Mahan & Orabona, 2014; Orabona & Cutkosky, 2020). Recently, coin betting ideas have demonstrated effectiveness on Wasserstein spaces (Sharrock & Nemeth, 2023; Sharrock et al., 2023; 2024), that heuristically follow a Riemannian interpretation (Villani, 2003).

5. Experiments

In this section, we assess the numerical performance of RDo G (Algorithm 1), RDo WG (Algorithm 2), and NRDo G (Algorithm 4) against manually tuned RSGD (Bonnabel, 2013) and RADAM (Becigneul & Ganea, 2019). Implementing all algorithms in Python 3 with JAX (Bradbury et al., 2018), our experiments run on a Mac Book Pro 16 (2021) with an Apple M1 Pro chip and 16GB of RAM. Detailed manifold descriptions and required operations for the experiments are provided in Appendix G. Code to reproduce the experiments is available at https://github.com/ daniel-dodd/riemannian_dog.

5.1. Rayleigh Quotient Maximization on the Sphere

We seek to find the dominant eigenvector of a symmetric matrix A in Rd d by minimizing 1

2x T Ax on the unit sphere Sd 1. This is challenging for high-dimensional and ill-conditioned A in the Euclidean case. We consider A = 1 d BBT , with B Rd q having standard Gaussian entries.

For illustration purposes, we first consider d = 3 and q = 5 in Figure 1, underscoring the pivotal role of selecting an optimal learning rate for RSGD, as deviations, whether too small or too large, adversely affect performance.

In a higher-dimensional scenario with d = 1000 and q = 1100 = d, resulting in a high condition number, we employ RADAM and RSGD with a grid of twenty logarithmically spaced learning rates η [10 8, 106]. On the other hand, we investigate RDo G and RDo WG with ten logarithmically spaced initial distance values ϵ [10 8, 100]. Here, we initialize ten starting points x0 Rd by drawing their entries independently from a standard Gaussian distribution, then projecting them onto the sphere through normalizing, a shared procedure for each optimizer.

Our results show that RDo G, RDo WG, and NRDo G are insensitive to initial distance, consistently achieving robust

performance in recovering negligible geodesic distance to the numerical solution via the eigendecomposition. In contrast, the effectiveness of RADAM and RSGD depends on selecting an appropriate learning rate. Notably, as seen in Figure 2, RSGD is highly sensitive to the learning rate, while RDo G rapidly adapts to optimal regret within a few hundred iterations, irrespective of the initial distance estimate s magnitude. Additional regret trace plots for other optimizers are available in Appendix H.1, along with similar plots for geodesic distance to the optima. These underscore that the algorithms quickly adapt within a few hundred iterations without prior knowledge of the function.

5.2. PCA on the Grassmann Manifold

We investigate principal component analysis (PCA) on the Grassmann manifold G(d, r), where points are represented as equivalence classes with an orthogonal matrix x Rd r having orthonormal columns (x T x = I). The PCA problem minimizes the sum of squared residual errors between projected data points and the original data, minx G(d,r) 1 n Pn i=1 zi xx T zi 2 2, with each zi represented as a d-dimensional data point. We consider datasets Wine, Waveform-5000, and Tiny Image Net. The numerical solution is computed using the scikit-learn implementation (Pedregosa et al., 2011). The geodesic distances of final iterates (using weighted averages for RDo G and RDo WG) are compared against learning-rate-dependent algorithms, as shown in Figure 3.

In training, Wine uses the full batch for T = 5000 iterations, and Waveform-5000 and Tiny Image Net use batch sizes of 64 for T = 2000 iterations. Each dataset has an 80:20 train-test split per replication. Following Pymanopt (Townsend et al., 2016), initial points x0 Rd r are drawn from a standard Gaussian distribution and projected onto the manifold using vectorized QR decomposition.

Results in Figure 3 from five random train-test splits show RDo G, RDo WG, and NRDo G are insensitive to initial distance estimates across magnitudes, with ten logarithmically spaced values in ϵ [10 8, 100]. In contrast, RADAM and RSGD require a narrower tuning range of optimal learning rates, exploring twenty logarithmically spaced values in η [10 8, 106]. Additional results in Appendix H.2 further highlight the robust adaptation of RDo G, RDo WG, and NRDo G.

5.3. Embedding Graphs in the Poincar e Ball

The Word Net noun hierarchy (Miller et al., 1990) is a lexical database of English words organized into a hierarchical structure, where each word is categorized based on its semantic relationships with other words. Moreover, the hypernymy relation, often termed Is-A relation, signifies that one concept (the hypernym) encompasses another (the

Learning-Rate-Free Stochastic Optimization over Riemannian Manifolds

10 6 10 3 100 103 106

Learning rate η

Geodesic distance to optima

RADAM RSGD NRDo G (Ours)

RDo G (Ours)

RDo WG (Ours)

10 8 10 6 10 4 10 2 Initial distance ϵ

(a) Wine (d = 13, r = 1).

10 6 10 3 100 103 106

Learning rate η

Geodesic distance to optima

RADAM RSGD NRDo G (Ours)

RDo G (Ours)

RDo WG (Ours)

10 8 10 6 10 4 10 2 Initial distance ϵ

(b) Waveform (d = 40, r = 2).

10 6 10 3 100 103 106

Learning rate η

Geodesic distance to optima

RADAM RSGD NRDo G (Ours)

RDo G (Ours)

RDo WG (Ours)

10 8 10 6 10 4 10 2 Initial distance ϵ

(c) Tiny Image Net (d = 12, 288, r = 5).

Figure 3. Results for PCA on the Grassmann manifold. (a)-(c) Geodesic distance between the final iterate and the numerical solution after T = 2000 iterations as a function of the learning rate for RADAM and RSGD and as a function of the initial distance estimate for RDo G, RDo WG, and NRDo G. (b)-(c) Uses the final iterate of the weighted average sequence for RDo G, RDo WG, and NRDog. Results are averaged over five replications.

hyponym). For instance, mammal is a hypernym of dog and cat. Following Nickel & Kiela (2017), we consider representing the transitive closure of the mammals subtree that involves 1,180 nouns denoted as N (of which mammal is a hypernym) and 6,450 hypernymy relations, represented as R = {(u, v)} N N.

The embedding is performed in the Poincar e ball of hyperbolic geometry which is well-known to be better suited to embed tree-like graphs than the Euclidean space (Gromov, 1987; Sala et al., 2018). As such, the Poincar e ball model is defined as Bd = {x Rd : x < 1} equipped with the Riemannian metric , x = 4/(1 x 2)2 , . We adopt the loss function from the official code of Nickel & Kiela (2017), deviating from the one described in the paper:

min θ : N Bd

(u,v) R log

e d(θ(u),θ(v)) P

v Neg(u,v) e d(θ(u),θ(v ))

where each noun pair (u, v) R has associated embeddings θ(u), θ(v) Bd, and Neg(u, v) = {v : (u, v ) / R} {v} is the set of negative examples for u, including v, and

d( , ) = arcosh 1 + 2 2

is the geodesic distance measuring the dissimilarity between the embeddings of two nouns in the Poincar e ball. Intuitively, minimizing this loss function encourages closely related mammals to be positioned closer together in the embedding space and less similar pairs to be farther apart.

For initialization, following Nickel & Kiela (2017), we uniformly initialize the embeddings in [ 10 3, 10 3]d and consider ten logarithmically spaced learning rates η [10 2, 102] and five logarithmically spaced initial distance estimates ϵ [10 10, 10 6]. In the first ten epochs, we use RSGD with a reduced learning rate of η/10 for RSGD and RADAM. During this burn-in phase, negative word sampling is based on the graph degree raised to the power of

3/4, leading to numerical improvements. No burn-in heuristic is applied for RDo G, RDo WG, or NRDo G. Thereafter, we run the optimizers on the initialized embeddings for one thousand epochs, with each iteration having a batch size of ten and fifty uniformly sampled negative samples. We repeat this experiment over five replications.

To measure the quality of the embeddings obtained from each optimizer, we follow Nickel & Kiela (2017) and compute, for each observed edge (u, v) R, the corresponding distance d(u, v) in the embedding space and rank it among the distances of all unobserved edges for u, i.e., {d(u, v ) : (u, v ) / R}. Subsequently, we calculate the mean average precision of this ranking.

In Figure 4, embeddings of dimension five are presented. RDo G and RDo WG demonstrate competitive performance, while RADAM and RSGD require careful tuning. The performance significantly degrades for RADAM and RSGD without a burn-in heuristic, as exemplified in Appendix H.3. Visualizing two-dimensional embeddings between RDo G and RSGD trained for two thousand epochs, with burnin applied only for RSGD and using the optimal learning rate selected from ten logarithmically spaced values η [10 2, 102], we observe meaningful groupings across various categories without employing burn-in heuristics for RDo G. Additional embedding plots for the other optimizers are presented in Appendix H.3.

6. Discussion

We have introduced new learning-rate-free optimizers for Riemannian manifolds and have highlighted significant numerical improvements over learning-rate-dependent algorithms. Our theoretical results provide high probability convergence guarantees that are optimal, up to a logarithmic factor, compared to the theoretically, yet practically unavailable, optimal deterministic algorithms.

Learning-Rate-Free Stochastic Optimization over Riemannian Manifolds

10 2 10 1 100 101 102

Learning rate η

Mean average precision

RADAM RSGD NRDo G (Ours)

RDo G (Ours)

RDo WG (Ours)

10 10 10 9 10 8 10 7 10 6 Initial distance ϵ

(a) Mean average precision.

Chiacoan peccary

Even-Toed ungulate

Dugong Aquatic mammal

Billy Ruminant

Gerbil Tree wallaby

English toy spaniel

Toy spaniel

Water buﬀalo

Bullock Cattle

Domestic cat

Brush-Tailed phalanger

Metatherian

Odd-Toed ungulate

Australopithecus afarensis

Hominid Woolly monkey Primate

Meadow jumping mouse

Brown hyena

Rambouillet

Dandie dinmont

Hunting dog Muishond

Lesser panda

Alaska fur seal

Steeplechaser

American water shrew

Staﬀordshire bullterrier

(b) RDo G embeddings.

Chiacoan peccary

Even-Toed ungulate

Aquatic mammal

Billy Ruminant

Tree wallaby Wallaby

Jennet Equine

English toy spaniel

Toy spaniel

Water buﬀalo

Bullock Cattle

Domestic cat

Brush-Tailed phalanger Metatherian

Odd-Toed ungulate

Australopithecus afarensis Hominid

Woolly monkey

Meadow jumping mouse

Brown hyena

Rambouillet

Dandie dinmont

Hunting dog

Lesser panda

Alaska fur seal

Steeplechaser

American water shrew Shrew

Staﬀordshire bullterrier

(c) RSGD embeddings.

Figure 4. Results for Poincar e word embeddings. (a) The mean average precision of the embeddings is assessed against the ground truth after 1000 training epochs. Results are averaged over five replications, with the embedding dimension set to five. (b)-(c) Two-dimensional embeddings after 2000 training epochs are visualized and annotated for the first 50 nouns of the mammal s subtree for RDo G and RSGD.

Many existing Riemannian optimization methods rely on a retraction map, which serves as a cost-effective approximation of the exponential map on manifolds and is a reasonable choice in numerous real-world scenarios. Incorporating this into our framework is paramount for enhancing the effectiveness of our algorithms, particularly in large-scale optimization problems. Moreover, certain Riemannian manifolds, such as the Stiefel or multivariate Gaussian Fisher Rao manifolds, pose challenges due to the intractability of the geodesic distance. Recognizing the argument underlying our convergence guarantees (though potentially less robust) holds for upper bounds on geodesic distance, exploring tractable or more economical approximations in these situations is essential.

Additionally, it is crucial to explore integrating these methods with recent proven advances in momentum acceleration (Liu et al., 2017; Alimisis et al., 2020b; Zhang & Sra, 2018; Ahn & Sra, 2020) a challenge both in theory and practice. Furthermore, developing practical algorithms that offer guarantees on iterate boundedness is a consideration for future research.

Acknowledgements

We thank the anonymous reviewers for their constructive feedback. DD was supported by the EPSRC-funded STOR-i Centre for Doctoral Training, grant number EP/S022252/1. LS and CN were supported by the Engineering and Physical Sciences Research Council (EPSRC), grant number EP/V022636/1. CN acknowledges further support from the EPSRC, grant numbers EP/S00159X/1 and EP/Y028783/1.

Impact Statement

This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here.

Absil, P.-A., Mahony, R., and Sepulchre, R. Optimization Algorithms on Matrix Manifolds. Princeton University Press, Princeton, NJ, 2008.

Agarwal, N., Boumal, N., Bullins, B., and Cartis, C. Adaptive regularization with cubics on manifolds. Mathematical Programming, 188(1):85 134, 2021.

Ahn, K. and Sra, S. From Nesterov s estimate sequence to Riemannian acceleration. In Proceedings of the 33rd Conference on Learning Theory (COLT 2020), 2020.

Alimisis, F., Orvieto, A., Becigneul, G., and Lucchi, A. A continuous-time perspective for modeling acceleration in Riemannian optimization. In Chiappa, S. and Calandra, R. (eds.), Proceedings of the 23rd International Conference on Artificial Intelligence and Statistics, volume 108 of Proceedings of Machine Learning Research, pp. 1297 1307. PMLR, 2020a.

Alimisis, F., Orvieto, A., Becigneul, G., and Lucchi, A. A continuous-time perspective for modeling acceleration in Riemannian optimization. In Proceedings of the 23rd International Conference on Artificial Intelligence and Statistics (AISTATS 2020), 2020b.

Becigneul, G. and Ganea, O.-E. Riemannian adaptive optimization methods. In International Conference on Learning Representations, 2019.

Learning-Rate-Free Stochastic Optimization over Riemannian Manifolds

Bento, G. C., Ferreira, O. P., and Melo, J. G. Iterationcomplexity of gradient, subgradient and proximal point methods on Riemannian manifolds. Journal of Optimization Theory and Applications, 173(2):548 562, 2017.

Bonnabel, S. Stochastic gradient descent on Riemannian manifolds. IEEE Transactions on Automatic Control, 58 (9):2217 2229, 2013.

Boumal, N. An Introduction to Optimization on Smooth Manifolds. Cambridge University Press, 2023.

Boumal, N. and Absil, P.-A. RTRMC: A Riemannian trustregion method for low-rank matrix completion. In Proceedings of the 24th Conference on Neural Information Processing Systems (NIPS 2011), volume 24, 2011.

Boumal, N., Absil, P.-A., and Cartis, C. Global rates of convergence for nonconvex optimization on manifolds. IMA Journal of Numerical Analysis, 39(1):1 33, 2018.

Bradbury, J., Frostig, R., Hawkins, P., Johnson, M. J., Leary, C., Maclaurin, D., Necula, G., Paszke, A., Vander Plas, J., Wanderman-Milne, S., and Zhang, Q. JAX: composable transformations of Python+Num Py programs, 2018.

Carmon, Y. and Hinder, O. Making SGD parameter-free. In Conference on Learning Theory, pp. 2360 2389. PMLR, 2022.

Cho, M. and Lee, J. Riemannian approach to batch normalization. In Proceedings of the 31st Conference on Neural Information Processing Systems (NIPS 2017), 2017.

Cordero-Erausquin, D., Mc Cann, R. J., and Schmuckenschl ager, M. A Riemannian interpolation inequality a la Borell, Brascamp and Lieb. Inventiones mathematicae, 146(2):219 257, 2001.

Defazio, A. and Mishchenko, K. Learning-rate-free learning by D-adaptation. In Proceedings of the 40th International Conference on Machine Learning (ICML 2023), Honolulu, HI, 2023.

Edelman, A., Arias, T. A., and Smith, S. T. The geometry of algorithms with orthogonality constraints. SIAM Journal on Matrix Analysis and Applications, 20(2):303 353, 1998.

Grimmer, B. Convergence rates for deterministic and stochastic subgradient methods without Lipschitz continuity. SIAM Journal on Optimization, 29(2):1350 1365, 2019.

Gromov, M. Hyperbolic groups. In Essays in group theory, pp. 75 263. Springer, 1987.

Hazan, E., Levy, K., and Shalev-Shwartz, S. Beyond convexity: Stochastic quasi-convex optimization. In Proceedings of the 28th Conference on Neural Information Processing Systems (NIPS 2015), volume 28, 2015.

Hosseini, R. and Sra, S. Matrix manifold optimization for Gaussian mixtures. In Proceedings of the 28th Conference on Neural Information Processing Systems (Neur IPS 2015), volume 28, 2015.

Ishteva, M., Absil, P.-A., Van Huffel, S., and De Lathauwer, L. Best low multilinear rank approximation of higher-order tensors, based on the Riemannian trustregion scheme. SIAM Journal on Matrix Analysis and Applications, 32(1):115 135, 2011.

Ivgi, M., Hinder, O., and Carmon, Y. Do G is SGD s best friend: A parameter-free dynamic step size schedule. In Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pp. 14465 14499. PMLR, 2023.

Kasai, H., Sato, H., and Mishra, B. Riemannian stochastic variance reduced gradient on Grassmann manifold. ar Xiv preprint ar Xiv:1605.07367, 2017.

Kasai, H., Jawanpuria, P., and Mishra, B. Riemannian adaptive stochastic gradient algorithms on matrix manifolds. In Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pp. 3262 3271. PMLR, 2019.

Khaled, A., Mishchenko, K., and Jin, C. Do WG unleashed: An efficient universal parameter-free gradient descent method. In Proceedings of the 37th Conference on Neural Information Processing Systems (Neur IPS 2023), 2023.

Konnov, I. V. On convergence properties of a subgradient method. Optimization Methods and Software, 18(1):53 62, 2003.

Lee, J. M. Smooth manifolds. Springer, 2012.

Levy, K. Y. The power of normalization: Faster evasion of saddle points. ar Xiv preprint ar Xiv:1611.04831, 2016.

Liu, X., Srivastava, A., and Gallivan, K. Optimal linear representations of images for object recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 26(5):662 666, 2004.

Liu, Y., Shang, F., Cheng, J., Cheng, H., and Jiao, L. Accelerated first-order methods for geodesically convex optimization on Riemannian manifolds. In Proceedings of the 31st Conference on Neural Information Processing Systems (NIPS 2017), 2017.

Learning-Rate-Free Stochastic Optimization over Riemannian Manifolds

Mc Mahan, H. B. and Orabona, F. Unconstrained online linear learning in Hilbert spaces: Minimax algorithms and normal approximations. In Proceedings of the 27th Conference on Learning Theory (COLT 2014), Barcelona, Spain, 2014.

Miller, G. A., Beckwith, R., Fellbaum, C., Gross, D., and Miller, K. J. Introduction to Word Net: An On-line Lexical Database. International Journal of Lexicography, 3(4): 235 244, 12 1990.

Nesterov, Y. Lectures on Convex Optimization. Number 978-3-319-91578-4 in Springer Optimization and Its Applications. Springer, 2018.

Nickel, M. and Kiela, D. Poincar e embeddings for learning hierarchical representations. In Proceedings of the 30th Conference on Neural Information Processing Systems (Neur IPS 2017), volume 30, 2017.

Orabona, F. Dimension-free exponentiated gradient. In Proceedings of the 26th Conference on Neural Information Processing Systems, 2013.

Orabona, F. and Cutkosky, A. Tutorial on Parameter-Free Online Learning. In Proceedings of the 37th International Conference on Machine Learning (ICML 2020), 2020.

Orabona, F. and P al, D. Coin betting and parameter-free online learning. In Proceedings of the 29th Conference on Neural Information Processing Systems (Neur IPS 2016), volume 29, 2016.

Orabona, F. and Tommasi, T. Training Deep Networks without Learning Rates Through Coin Betting. In Proceedings of the 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, 2017.

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., and Duchesnay, E. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825 2830, 2011.

Petersen, P. Riemannian geometry, volume 171. Springer, 2006.

Roy, S. K., Mhammedi, Z., and Harandi, M. Geometry aware constrained optimization techniques for deep learning. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4460 4469, 2018.

Sala, F., De Sa, C., Gu, A., and R e, C. Representation tradeoffs for hyperbolic embeddings. In International Conference on Machine Learning (ICML 2018), pp. 4460 4469. PMLR, 2018.

Sato, H., Kasai, H., and Mishra, B. Riemannian stochastic variance reduced gradient algorithm with retraction and vector transport. SIAM Journal on Optimization, 29(2): 1444 1472, 2019.

Sharrock, L. and Nemeth, C. Coin Sampling: Gradient Based Bayesian Inference without Learning Rates. In Proceedings of the 40th International Conference on Machine Learning (ICML 2023), Honolulu, HI, 2023.

Sharrock, L., Mackey, L., and Nemeth, C. Learning Rate Free Sampling in Constrained Domains. In Proceedings of the 37th International Conference on Neural Information Processing Systems (Neur IPS 2023), New Orleans, LA, 2023.

Sharrock, L., Dodd, D., and Nemeth, C. Tuning-free maximum likelihood training of latent variable models via coin betting. In International Conference on Artificial Intelligence and Statistics, pp. 1810 1818. PMLR, 2024.

Shor, N. Z. Minimization methods for non-differentiable functions, volume 3. Springer Science & Business Media, 2012.

Streeter, M. and Mc Mahan, H. B. No-regret algorithms for unconstrained online convex optimization. In Proceedings of the 25th Conference on Neural Information Processing Systems (NIPS 2012), 2012.

Sun, J., Qu, Q., and Wright, J. Complete dictionary recovery over the sphere ii: Recovery by Riemannian trust-region method. IEEE Transactions on Information Theory, 63 (2):885 914, February 2017.

Townsend, J., Koep, N., and Weichwald, S. Pymanopt: A Python toolbox for optimization on manifolds using automatic differentiation. Journal of Machine Learning Research, 17(137):1 5, 2016.

Udris te, C. Convex Functions and Optimization Methods on Riemannian Manifolds. Springer Dordrecht, 1994.

Ungar, A. A. A gyrovector space approach to hyperbolic geometry. Synthesis Lectures on Mathematics and Statistics, 1(1):1 194, 2008.

Villani, C. Topics in Optimal Transportation. American Mathematical Society, Providence, Rhode Island, 2003.

Wang, X., Tu, Z., Hong, Y., Wu, Y., and Shi, G. No-regret online learning over Riemannian manifolds. In Proceedings of the 35th Conference on Neural Information Processing Systems (Neur IPS 2021), 2021.

Weber, M. and Sra, S. Projection-free nonconvex stochastic optimization on Riemannian manifolds. IMA Journal of Numerical Analysis, 42(4):3241 3271, 09 2021.

Learning-Rate-Free Stochastic Optimization over Riemannian Manifolds

Zadeh, P., Hosseini, R., and Sra, S. Geometric mean metric learning. In Balcan, M. F. and Weinberger, K. Q. (eds.), Proceedings of the 33rd International Conference on Machine Learning (ICML 2016), volume 48 of Proceedings of Machine Learning Research, pp. 2464 2471, New York, New York, USA, 2016. PMLR.

Zhang, H. and Sra, S. First-order methods for geodesically convex optimization. In Feldman, V., Rakhlin, A., and Shamir, O. (eds.), 29th Annual Conference on Learning Theory, volume 49 of Proceedings of Machine Learning Research, pp. 1617 1638, Columbia University, New York, New York, USA, 23 26 Jun 2016. PMLR.

Zhang, H. and Sra, S. Towards Riemannian accelerated gradient methods. ar Xiv preprint ar Xiv:1806.02812, 2018.

Zhang, H., Reddi, S. J., and Sra, S. Riemannian SVRG: Fast stochastic optimization on Riemannian manifolds. In Proceedings of the 30th Conference on Neural Information Processing Systems (NIPS 2016), 2016.

Zhou, P., Yuan, X.-T., Yan, S., and Feng, J. Faster firstorder methods for stochastic non-convex optimization on Riemannian manifolds. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(2):459 472, 2021.

Learning-Rate-Free Stochastic Optimization over Riemannian Manifolds

A. Useful Results

We begin by introducing essential lemmas for the establishment of our theory.

A.1. Trigonometric Distance Bounds for Manifolds

The law of cosines in Euclidean space is fundamental for analyzing optimization algorithms,

a2 = b2 + c2 2bc cos(A), (2)

where a, b, c are the sides of a Euclidean triangle with A the angle between sides b and c.

Trigonometric geometry behaves differently in manifolds compared to Euclidean spaces. While the equality does not hold for nonlinear spaces, a trigonometric distance bound can be established for manifolds with curvature bounded below.

Lemma A.1. (Zhang & Sra, 2016, Lemma 5) If a, b, c are the side lengths of a geodesic triangle in a Riemannian manifold with sectional curvature lower bounded by κ > and A is the angle between sides b and c (defined through the inverse exponential map and inner product in tangent space), then

a2 ζκ(c)b2 + c2 2bc cos(A). (3)

Proof. Given by Lemma 3.12 of (Cordero-Erausquin et al., 2001) and by Lemma 5 of (Zhang & Sra, 2016).

This lemma holds profound implications for our analysis of geodesically convex functions f. Specifically, the property of geodesic convexity allows us to bound f(xt) f(x ) by the inner product grad f(xt), exp 1 xt (x ) xt. The above trigonometric inequality empowers us to bound this inner product to devise tractable optimization algorithms.

To streamline future analysis, we expand our perspective to encompass bounding the inner product gt, exp 1 xt (x ) xt for any tangent vector gt Txt M.

Lemma A.2. (Zhang & Sra, 2016, Corollary 8) For any Riemannian manifold M where the sectional curvature is lower bounded by κ > and any point x , xt M and any tangent vector gt Txt M, scalar ηt > 0 consider the RSGD update xt+1 = expxt( ηtgt). Then by Lemma A.1, we have

gt, exp 1 xt (x ) xt 1 2ηt

d2 t d2 t+1 + ηt

2 ζκ(dt) gt 2 xt. (4)

Proof. Consider the geodesic triangle with vertices xt+1, xt, and x . Then we have the side lengths of are given by

a = d(xt+1, x ) = dt+1, b = d(xt+1, xt) = ηt gt xt, c = d(xt, x ) = dt. (5)

Recalling that the angle between two tangent vectors u and v at x M is given by arccos u,v x u x v x . Now, considering the angle, A, between side lengths b and c, we have,

2bc cos(A) = 2bc cos arccos exp 1 xt (xt+1), exp 1 xt (x ) xt exp 1 xt (xt+1) xt exp 1 xt (x ) xt

= ηtgt, exp 1 xt (x ) xt. (6)

Substituting these terms in Lemma A.1 and rearranging yields the result as required.

A.2. Jensen s Inequality for Geodesically Convex Functionals

We present an analog for Jensen s inequality for geodesically convex functions on Riemannian manifolds. This will allow us to leverage innovative weighted averaging strategies in the regret analysis of our algorithms.

Lemma A.3. Let f be geodesically convex. For any sequence of iterates x0, . . . , xt M and positive weights w0, . . . , wt R+, define the online weighted average sequence by

xt+1 = exp xt

wt Pt s=0 ws exp 1 xt (xt)

, x1 = x0. (7)

Learning-Rate-Free Stochastic Optimization over Riemannian Manifolds

Then we have

f( xt) 1 Pt 1 s=0 ws

s=0 wsf(xs). (8)

Proof. We prove this by induction. The base case for t = 1 holds by definition. Now for t 2, for the inductive step, assume the statement is true for t 1 and consider t. We have,

1 Pt 1 s=0 ws

s=0 wsf(xs) = wt 1 Pt 1 s=0 ws f(xt 1) + 1 Pt 1 s=0 ws

s=0 wsf(xs) (9)

= wt 1 Pt 1 s=0 ws f(xt 1) + Pt 2 s=0 ws Pt 1 s=0 ws

1 Pt 2 s=0 ws

s=0 wsf(xs) (10)

wt 1 Pt 1 s=0 ws f(xt 1) + Pt 2 s=0 ws Pt 1 s=0 ws f( xt 1). (11)

In the final line, we have exploited the inductive assumption. Finally, we note that γ(s) = expx (1 s) exp 1 x (x) + s exp 1 x (y) for s [0, 1] defines a geodesic between any two points x and y in M. Moreover, by geodesic convexity we have

f(γ(s)) (1 s)f(γ(0)) + sf(γ(1)) = (1 s)f(x) + sf(y). (12)

Thus applying this to Equation (11) with x = xt 1, y = xt 1 and s = wt 1 Pt 1 s=0 ws and noting that for this choice,

wt 1 Pt 1 s=0 ws

1 wt 1 Pt 1 s=0 ws

exp 1 xt 1( xt 1) + wt 1 Pt 1 s=0 ws exp 1 xt 1(xt 1)

wt 1 Pt 1 s=0 ws exp 1 xt 1(xt 1)

yields the result as required.

A.3. Smoothness Bounds

We present smoothness results that establish bounds on individual gradient norms, that we will use in our later analysis to yield tighter regret bounds under the geodesic smoothness assumption. Lemma A.4. Suppose f is S-smooth and lower bounded by f(x ). Then, for all x M we have

grad f(x) x p

2S(f(x) f(x )). (16)

Proof. This is a trivial consequence of e.g., Proposition 4.7 and 4.8 of (Boumal, 2023). We include the proof for completeness. Let x M and define y = expx 1

S grad f(x) . Then geodesic smoothness provides,

f(y) f(x) + grad f(x), exp 1 x (y) x + S

2 exp 1 x (y) 2 x (17)

S grad f(x) 2 x + 1

2S grad f(x) 2 x (18)

2S grad f(x) 2 x. (19)

Now since f is lower bounded by f(x ) we thus have

f(x ) f(y) f(x) 1

2S grad f(x) 2 x. (20)

Rearranging gives the result.

Learning-Rate-Free Stochastic Optimization over Riemannian Manifolds

Using the above argument, we provide a bound on the norm of the stochastic error.

Lemma A.5. Under locally smooth stochastic gradients (Assumption 3.4), for the stochastic error (x) := G(x) grad f(x) we almost surely have that

2(f(x) f(x )). (21)

Proof. Noting that Assumption 3.4 implies that for any x, y M we almost surely have that

f(s) f(x) + G(x), exp 1 x (y) x + s(x)

2 exp 1 x (y) 2 x. (22)

We follow the same argument as in Lemma A.4 to deduce that almost surely,

2s(x)(f(x) f(x )). (23)

While the triangle inequality and applying Lemma A.4 to grad f(x) x gives,

(x) x G(x) x + grad f(x) x p

2s(x)(f(x) f(x )) + p

2S(f(x) f(x )). (24)

A.4. Bounds for Real-Valued Series

Lemma A.6. (Ivgi et al., 2023, Lemma 3) Let a0, a1, . . . , a T be a positive increasing sequence. Then

as at e 1 T 1 + log(a T /a0) 1 . (25)

Proof. Lemma 3 of (Ivgi et al., 2023). Define K := log(a T /a0) , and n := T/K . Then, given the sequence is increasing we have

k=0 log an(k+1)

K min k<K log an(k+1)

Rearranging gives,

min k<K log an(k+1)

/K 1 = min k<K an(k+1)

ank e. (27)

as at max t [n,T ] nat n

at = max k K nan(k 1)

ank ne 1 (28)

= e 1 T log(a T /a0)

e 1 T 1 + log(a T /a0) 1 . (29)

Lemma A.7. (Ivgi et al., 2023, Lemma 4). Let a0, . . . , at be a nondecreasing sequence of nonnegative numbers. Then

ak ak 1 ak 2 ( at a0) . (30)

Learning-Rate-Free Stochastic Optimization over Riemannian Manifolds

Proof. This is a well-known result e.g., Lemma 4 of (Ivgi et al., 2023). We have

ak ak 1 ak =

( ak + ak 1)( ak ak 1) ak (31)

k=1 ( ak ak 1) = 2( at a0). (32)

Lemma A.8. (Ivgi et al., 2023, Lemma 6). Recall log+(z) := 1 + log(z). Consider a non-decreasing sequence of nonnegative numbers, a 1, a0, a1, . . . , at, then

ak ak 1 ak log2 +(ak/a 1) 1. (33)

Proof. Lemma 6 of (Ivgi et al., 2023). We have

ak ak 1 ak log2 +(ak/a 1)

dα α log2 +(α) = Z at/a 1

dα α log2 +(α) (34)

dα α log2 +(α) = 1 1 + log(α)

1 = 1. (35)

A.5. Martingale Concentration bound

Lemma A.9. (Ivgi et al., 2023, Lemma 7). Consider a filtration process Ft and let S be the set of nonnegative and nondecreasing sequences. Let Ct Ft 1 and let Xt be a martingale difference sequence adapted to Ft 1 such that |Xt| Ct with probability 1 for all t. Recalling that θt,δ := log(60 log(6t)/δ). Then, for all δ (0, 1), c > 0, and ˆXt Ft 1 such that | ˆXt| Ct with probability 1,

t T, {ys} s=1 S :

v u u tθt,δ

Xs ˆXs 2 + c2θ2 t,δ

δ + P ( t T : Ct > c) . (36)

Proof. See Lemma 7 of (Ivgi et al., 2023).

Learning-Rate-Free Stochastic Optimization over Riemannian Manifolds

B. RGD Ideal Step Size Analysis

B.1. Proof of Theorem 3.5

Proof. Using geodesic convexity and applying Lemma A.2, we have

min t T [f(xt) f(x )] 1

t=0 [f(xt) f(x )] (37)

t=0 gt, exp 1 xt (x ) xt (38)

2η d2 t d2 t+1 + η

2ζκ(dt) gt 2 xt

= d2 0 2ηT + η PT t=0 ζκ(dt) gt 2 xt 2T (40)

d2 T 2ηT + ηζκ( d T ) PT t=0 gt 2 xt 2T . (41)

Now, setting η = d T q

ζκ( d T ) PT t=0 gt 2xt , we have

min t T [f(xt) f(x )] d T q

ζκ( d T ) PT t=0 gt 2xt 2T + ζκ( d T ) PT t=0 gt 2 xt

2Tζκ( d T ) q PT t=0 gt 2xt

ζκ( d T ) PT t=0 gt 2xt T (43)

Where we have bounded gt xt L due to the Lipschitz assumption, and d dt for all t 0.

C. NRGD Ideal Step Size Analysis

C.1. Proof of Theorem 3.11

Proof of Theorem 3.11. Using Lemma A.2 we have grad f(xt)

grad f(xt) xt , exp 1 xt (x )

2η d2 t d2 t+1 + η

2ζκ(dt). (45)

Averaging the above, we have

grad f(xt) xt , exp 1 xt (x )

xt d2 0 2ηT + η

t=0 ζκ(dt). (46)

Now for the Lipshitz setting, we have grad f(xt) xt L thus,

L , exp 1 xt (x )

grad f(xt) xt , exp 1 xt (x )

xt d2 0 2ηT + η

t=0 ζκ(dt). (47)

Multiplying through by L and using definition of geodesic convexity yields,

min t T [f(xt) f(x )] 1

t=0 [f(xt) f(x )] L

" d2 0 2ηT + η

d2 T 2ηT + η

2ζκ( d T ) . (48)

Learning-Rate-Free Stochastic Optimization over Riemannian Manifolds

Now substituting η = d T

T ζκ( d T ), gives

min t T [f(xt) f(x )] L d T p

Tζκ( d T ) T L d T p

which completes the proof for the Lipshitz case.

Now we proceed to consider the smooth setting. By convexity we have

grad f(xt), exp 1 xt (x ) xt f(xt) f(x ) 0. (50)

And by smoothness (Lemma A.4), we have

grad f(xt) xt p

2S(f(xt) f(x )). (51)

Now if f(xt) = f(x ) the theorem holds trivially, suppose not. Then combining the above expressions, we have grad f(xt)

grad f(xt) xt , exp 1 xt (x )

xt f(xt) f(x ) p

2S(f(xt) f(x )) =

(f(xt) f(x ))

Thus we have

f(xt) f(x ) i 1

f(xt) f(x )

" d2 0 2ηT + η

2S d2 T 2ηT + η

2ζκ( d T ) .

Squaring gives us the first result. Now, plugging in η = d T

T ζκ( d T ), gives

min t T [f(xt) f(x )] 2S d2 T Tζκ( d T )

T 2 2S d2 T ζκ( d T )

D. RDo G Theoretical Analysis

D.1. Overview

In this section, we analyze RDo G (Algorithm 1). Thus we consider RSGD with step sizes given by,

ζκ( rt) Pt s=0 gs 2xs

We consider bounding the error of the weighted average sequence,

xt+1 = exp xt

ζκ( rt) Pt s=0 rs/ p

ζκ( rs) exp 1 xt (xt)

For a geodesically convex function f : M R, we have by Lemma A.3 that xt satisfies,

f( xt) f(x ) 1 Pt 1 s=0( rs/ p

s=0 ( rs/ p

ζκ( rs)) grad f(xs), exp 1 xs (x ) xs. (56)

Recalling that gs represents the stochastic oracle evaluation at xs, denoted as G(xs), we can decompose the numerator into two components:

s=0 ( rs/ p

ζκ( rs)) gs, exp 1 xs (x ) xs | {z } weighted regret

s=0 ( rs/ p

ζκ( rs)) s, exp 1 xs (x ) xs | {z } noise

with s := gs grad f(xs).

Learning-Rate-Free Stochastic Optimization over Riemannian Manifolds

D.2. Non-Smooth Analysis

We give deterministic bounds for the weighted regret (Lemma D.1) and high probability bounds for the noise term (Lemma D.2).

Lemma D.1. Under Assumption 3.1 and 3.2, we have that the iterates of RDo G (Algorithm 1) satisfy

s=0 ( rs/ p

ζκ( rs)) gs, exp 1 xs (x ) xs rt

2 dt + rtζκ( dt)

Proof. Applying Lemma A.2, we can bound the weighted average as

ζκ( rs) gs, exp 1 xs (x ) xs 1

d2 s d2 s+1

ζκ( rs) ηsζκ(ds) gs 2 xs | {z } (B)

We bound the terms (A) and (B) in turn, beginning with the former:

Gs d2 s d2 s+1 = d2 0 p

Gt 1 + d2 t

Gt 1( d2 t d2 t) (ii) 4 rt dt p

Inequality (i) uses ds dt and that Gt is nondecreasing. Inequality (ii) use that for k arg maxs t ds, we have d2 t d2 t = d2 k d2 t = (dk dt)(dk + dt) d(xk, xt)(dk + dt) ( rk + rt)(dk + dt) 4 rt dt. Bounding the second term (B), we have for κ < 0:

r2 sζκ(ds) gs 2 xs ζκ( rs) Gs =

r2 s tanh( p

|κ| rs)ζκ(ds) gs 2 xs p

|κ| rs Gs =

|κ| rs)ζκ(ds) gs 2 xs p

|κ| Gs (62)

|κ| rt tanh( p

|κ| rt)ζκ( dt)

gs 2 xs Gs 2 p

|κ| rt tanh( p

|κ| rt)ζκ( dt) p

= 2 r2 t tanh ( p

|κ| rt ζκ( dt) p

Gt 1 = 2 r2 t ζκ( rt)ζκ( dt) p

While for κ = 0 the geometric curvature function d 7 ζκ(d) takes constant value one, thus the same bound above can be established trivially. Combining (A) and (B), gives the result.

Lemma D.2. For all δ (0, 1), T N and L > 0, if Assumption 3.1, 3.2, and 3.3 hold, the iterates of RDo G (Algorithm 1) satisfy

ζκ( rs) s, exp 1 xs (x ) xs

δ + P( ℓT > L), (65)

where bt = 8 rt 1

ζκ( rt 1) dt 1 q

θt,δGt 1 + θ2 t,δL2 and ℓT := maxs T ℓ(xs).

Proof. For 1 s T define the random variables

Ys := rs 1 p

ζκ( rs 1) ds 1, Xs :=

s 1, exp 1 xs 1(x )

xs 1 , ˆXs :=

grad f(xs 1), exp 1 xs 1(x )

xs 1 . (66)

Learning-Rate-Free Stochastic Optimization over Riemannian Manifolds

By the Cauchy-Schwartz inequality and Assumption 3.3, we have each |Xs| ℓ(x), and each | ˆXs| ℓ(x) with probability 1. Moreover, and consider the filtration Fs = σ(G(x0) . . . , G(xs)). Then we have that Xs is a martingale difference sequence adapted to Fs and ˆXs Fs 1. By construction or any t T, we have

s=1 Ys Xs =

ζκ( rs) s, exp 1 xs (x ) xs. (67)

Therefore, applying Lemma A.9 yields the result as required.

Combining the above results, we obtain the following.

Theorem D.3. For all δ (0, 1) and L > 0, if Assumption 3.1, 3.2, and 3.3 hold, then with probability at least 1 δ P( ℓT > L), for all t T, the optimality gap on the weighted iterates f( xt) f(x ) of RDo G (Algorithm 1) satisfy

(d0 + rt) p

ζκ(d0 + rt) q

Gt 1 + θt,δGt 1 + θ2 t,δL2

Pt 1 s=0 rs/

Proof. Combining Lemma D.1 and Lemma D.2, we have for the given probability that

f( xt) f(x )

ζκ( rt) + rt

ζκ( rt)ζκ( dt) p

Gt 1 + 8 dt 1 q

θt,δGt 1 + θ2 t,δL2

Pt 1 s=0 rs/

Now using the fact dt d0 + rt and that d 7 ζκ(d) and d 7 d

ζκ(d) are increasing functions gives the result.

We then have a useful result when the manifold is bounded but its exact diameter is unknown.

Corollary D.4. Under Assumption 3.1, 3.2, and 3.3, for any D d0 let LD := maxx M:d(x,x0) D ℓ(x). Then, for all

δ (0, 1) and for τ arg maxt T Pt 1 s=0 rs/

ζκ( rt) , with probability at least 1 δ P( ℓT > L), iterates of RDo G

(Algorithm 1) satisfy the optimality gap bound

f( xτ) f(x ) = O

Gτ 1θτ,δ + L2 Dθ2 τ,δ T log+

Proof. Apply Lemma A.6 to the denominator term of Theorem D.3.

D.3. Smooth Guarantees via Uniform Averaging

Under the assumption of locally smooth stochastic gradients (Assumption 3.4), we can deduce an O(1/T) convergence guarantee under uniformly averaged iterates,

ˆxt+1 = expˆxt

t exp 1 ˆxt (xt) , ˆx1 = x0.

We begin by presenting a theorem that shows a bound under uniform iterate averaging in the non-smooth setting.

Theorem D.5. For all δ (0, 1) and L > 0, if Assumption 3.1, 3.2, and 3.3 hold, then with probability at least 1 δ P( ℓT > L), for all t T, the optimality gap on the uniformly averaged iterates f(ˆx T ) f(x ) of RDo G (Algorithm 1) satisfy:

(d0 log+ r T

ϵ + r T ) p

ζκ(d0 + r T ) q

GT 1 + θT,δGT 1 + θ2 T,δL2

Learning-Rate-Free Stochastic Optimization over Riemannian Manifolds

Proof. Define the times τs = min min{k| rk 2 rτk 1}, T , with τ0 := 0. Moreover, let K be the first index such that τK = T and note that K 1 + log2 r T

ϵ by construction. Now using the argument of Lemma D.13, we have that for k K

ζκ( rt) gt, exp 1 xt (x ) rτk

2 dτk + rτk ζκ( rτk)ζκ( dτk) p

= O rτk (d0 + rτk) ζκ(d0 + rτk)ζκ(d0 + rτk) p

= O rτk(d0 + rτk) p

GT 1 . (74)

Where the first equality holds due by the virtue of d 7 d ζκ(d) is an increasing function and that dτk rτk + d0. Furthermore, by Lemma D.14 we have for all k K with probability at least 1 δ P( ℓT > L),

ζκ( rt) t, exp 1 xt (x ) xt

ζκ( rt) t, exp 1 xt (x ) xt

ζκ( rt) t, exp 1 xt (x ) xt

ζκ( rτk 1) dτk 1 q

θT,δGT 1 + θ2 T,δL2. (76)

Now combining these two bounds, we have

t=τk 1 f(xt) f(x ) 1 rτk 1/ p

ζκ( rt) [f(xt) f(x )] (77)

ζκ( rt) grad f(xt), exp 1 xt (x ) xt (78)

= 1 rτk 1/ p

gt, exp 1 xt (x ) xt + t, exp 1 xt (x ) xt (79)

ζκ( rτk) rτk 1/ p

ζκ( rτk 1)(d0 + rτk) p

ζκ(d0 + rτk) q

GT 1 + θT,δGT 1 + θ2 T,δL2 !

= O (d0 + rτk) p

ζκ(d0 + rτk) q

GT 1 + θT,δGT 1 + θ2 T,δL2 , (81)

where final reduction holds since d 7 d ζκ(d) is an increasing function, and for any t,

rt+1 rt + d(xt+1, xt) = rt

1 + gt xt Gt

Now summing over k from 1 to K we have

t=0 [f(xt) f(x )] =

t=τk 1 [f(xt) f(x )] (83)

k=1 (d0 + rτk) p

ζκ(d0 + rτk) q

GT 1 + θT,δGT 1 + θ2 T,δL2 !

k=1 (d0 + rτk)

GT 1 + θT,δGT 1 + θ2 T,δL2

Learning-Rate-Free Stochastic Optimization over Riemannian Manifolds

Where the final reduction holds since d 7 ζκ(d) is an increasing function. Now, recall that K = O log+ r T

ϵ and note that PK k=1 rτk = O( r T ) since rτs rτK 1 2s (K 1) for all s K 1. The proof is complete noting that via Lemma A.3 we

deduce f(ˆx T ) 1

T PT 1 t=0 f(xt).

Now under smooth assumption, we present a result for bounding the stochastic term that depends on S.

Lemma D.6. For all δ (0, 1), T N and S > 0, if Assumption 3.1, 3.2, and 3.4 hold, then the iterates of RDo G (Algorithm 1) satisfy

ζκ( rs) s, exp 1 xs (x ) xs

δ + P( s T > S), (86)

where bt = 8 rt 1

ζκ( rt 1) dt 1 q

2Sθt,δ + 8Sθ2 t,δ

q Pt 1 s=0 [f(xs) f(x )] and s T := maxk T s(xk).

Proof. For 1 s T define the random variables

Ys := rs 1 ds 1 p

ζκ( rs 1) , Xs :=

s 1, exp 1 xs 1(x )

xs 1 , ˆXs :=

grad f(xs 1), exp 1 xs 1(x )

xs 1 , (87)

and consider the filtration Fs = σ(G(x0) . . . , G(xs)). Then we have that Xs is a martingale difference sequence adapted to Fs and ˆXs Fs 1. By construction or any t T, we have

s=1 Ys Xs =

s=0 r2 s s, exp 1 xs (x ) xs. (88)

Now we consider bounding, max{|Xt|, | ˆXt|} by a constant c. Moreover, by the Cauchy-Schwartz inequality and Lemma A.5 we have with probability P( s T > S) we have

|Xs|2 s 1 2 xs 1 1 8S(f(xs 1) f(x )) (89)

| ˆXs|2 grad f(xs 1) 2 xs 1 1 8S(f(xs 1) f(x )). (90)

Thus we have that,

s=0 [f(xs) f(x )] =: c (91)

s=1 | ˆXs|2

s=0 [f(xs) f(x )] =: c. (92)

Therefore, applying Lemma A.9 yields,

rs ζκ( rs) s, exp 1 xs (x ) xs

ζκ( rt 1) dt 1

v u u tθt,δ

Xs ˆXs 2 + c2θ2 t,δ

Now, finally noting

s=1 (Xs ˆXs)2

s=0 grad f(xs) 2 xs 2S

s=0 (f(xs) f(x )), (95)

yields the result.

Learning-Rate-Free Stochastic Optimization over Riemannian Manifolds

Theorem D.7. For all δ (0, 1) and S > 0, if Assumption 3.1, 3.2, and 3.4 hold, then with probability at least 1 δ P( s T > S), for all t T, the optimality gap of the uniformly averaged iterates f(ˆx T ) f(x ) of RDo G (Algorithm 1) satisfy:

(d0 log+ r T

ϵ + r T )2ζκ(d0 + r T )θ2 T,δS T

Proof. Similar to the non-smooth setting, define the times τs = min min{k| rk 2 rτk 1}, T , with τ0 := 0. Moreover, let K be the first index such that τK = T and note that K 1 + log2 r T

ϵ by construction. Now using the argument of Lemma D.13, we have that for k K

ζκ( rt) gt, exp 1 xt (x ) rτk

2 dτk + rτk ζκ( rτk)ζκ( dτk) p

= O rτk(d0 + rτk) p

rτk(d0 + rτk)

t=0 [f(xt) f(x )]

Where in the final inequality we have applied Lemma A.4. Now, by Lemma D.6 we have for all k K with probability at least 1 δ P( s T > L),

ζκ( rt) t, exp 1 xt (x ) xt

ζκ( rt) t, exp 1 xt (x ) xt

ζκ( rt) t, exp 1 xt (x ) xt

ζκ( rτk 1) dτk 1 q

2SθT,δ + 8Sθ2 T,δ

t=0 [f(xt) f(x )]. (101)

Now combining these two bounds, we have

t=τk 1 f(xt) f(x ) 1 rτk 1/ p

ζκ( rt) [f(xt) f(x )] (102)

ζκ( rt) grad f(xt), exp 1 xt (x ) xt (103)

= 1 rτk 1/ p

gt, exp 1 xt (x ) xt + t, exp 1 xt (x ) xt (104)

ζκ( rτk) rτk 1/ p

ζκ( rτk 1)(d0 + rτk) p

ζκ(d0 + rτk) q

t=0 [f(xt) f(x )]

(d0 + rτk) p

ζκ(d0 + rτk) q

t=0 [f(xt) f(x )]

Where final reduction holds since d 7 d ζκ(d),and for any t,

rt+1 rt + d(xt+1, xt) = rt

1 + gt xt Gt

2 rt. (107)

Learning-Rate-Free Stochastic Optimization over Riemannian Manifolds

Now summing over k from 1 to K we have

t=0 [f(xt) f(x )] =

t=τk 1 [f(xt) f(x )] (108)

k=1 (d0 + rτk) p

ζκ(d0 + rτk) q

t=0 [f(xt) f(x )]

k=1 (d0 + rτk)

t=0 [f(xt) f(x )]

Where the final reduction holds since d 7 ζκ(d) is an increasing function. Now, recall that K = O log+ r T

ϵ and note that PK k=1 rτk = O( r T ) since rτs rτK 1 2s (K 1) for all s K 1. We then divide both sides through by q PT 1 t=0 [f(xt) f(x )], to yield

t=0 [f(xt) f(x )] O (d0 r T /ϵ + r T ) p

ζκ (d0 + r T ) q

Sθ2 T,δ , (111)

squaring both sides, the proof is complete noting that via Lemma A.3 we deduce f(ˆx T ) 1

T PT 1 t=0 f(xt).

D.4. Iterate Stability Bound

We introduce Tamed Riemannian Distance over Gradients (T-RDo G), a dampened version of RDo G (Algorithm 1) whose iterates are guaranteed to remain bounded with high probability. T-RDo G has the following step size scheme

ζκ( rt)G t , G t = 84θ2 T,δ log2 +

(1 + t) ℓ2 t ℓ2 0

(Gt 1 + 16 ℓ2 t), (112)

using G 1 := 0 and recalling ℓt := maxs t ℓ(xs) for a function ℓsatisfying Assumption 3.3. To show iterate boundedness in the stochastic setting, we consider the stopping time

Tout = min{t 0 : rt > 3d0}, (113)

so that the event { r T 3d0} is the same as {Tout > T}. We also define the following truncated step size sequence,

ηk := ηk I{k<Tout}. (114)

Truncating as such allows us to handle the possibility that r T exceeds 3d0. In particular, the following holds for { ηk} but not for {ηk}.

Lemma D.8. For all t T, if Assumption 3.1, 3.2, and 3.3 hold, under the truncated T-RDo G step size sequence { ηt}, the iterates satisfy,

ηt σ(G(x0), . . . , G(xt 1)), (115)

| ηt γ, exp 1 xt (x ) xt| 6d2 0 82p

ζκ(3d0)θT,δ for γ {gt, grad f(xt), t}, (116)

k=0 η2 kζκ(dk) gk 2 xk 12d2 0 84θT,δ , and (117)

k=0 ( ηk gk, exp 1 xk (x ) xk)2 3 43d4 0 84θT,δ . (118)

Learning-Rate-Free Stochastic Optimization over Riemannian Manifolds

Proof. The first line holds directly by definition of the truncated T-RDo G iterates. For bound in the second line, note (recalling t = gt grad f(xt) we have t xt gt xt + grad f(xt) xt 2ℓ(xt). Since G t 4284ℓ2(xt)θ2 T,δ for all t, the Cauchy-Schwartz inequality gives,

| ηt t, exp 1 xt (x ) xt| rt p

ζκ( rt)G t t xtdt 1 2 82θT,δ

ζκ( r T ) dt 6d2 0 82p

ζκ(3d0)θT,δ , (119)

where we have used the fact that d 7 d

ζκ(d) is an increasing function, and that dt d0 + rt and rt 3d0 (or else ηt = 0).

Bounds for γ {gt, grad f(xt)} hold in a similar fashion.

Now for the third line, we have

k=0 η2 kζκ(dk) gk 2 xk

k=0 η2 kζκ(dk) gk 2 xk (120)

r2 kζκ(dk)(Gk Gk 1)

ζκ( rk)G k . (121)

Now d 7 ζκ(d) is increasing, thus ζκ(dk) ζκ(d0 + r Tout 1) as dk dk d Tout 1 d0 + r Tout 1. Additionally, since d 7 d ζκ(d) is increasing, we have r2 k ζκ( rk) = rk rk ζκ( rk) r Tout 1 r Tout 1 ζκ( r Tout 1) r Tout 1 ( r Tout 1+d0) ζκ( r Tout 1+d0). Thus, we have

r2 kζκ(dk)(Gk Gk 1)

ζκ( rk)G k r Tout 1( r Tout 1 + d0)

(i) r Tout 1( r Tout 1 + d0)

(Gk + ℓ2 k) log2 + Gk+ ℓ2 k ℓ2 0

(ii) 12d2 0 84θT,δ . (124)

Where we have used in (i) that

G k 84θT,δ(Gk 1 + gk 2 xk + ℓ2 k) log2 +

Pk s=0 ℓ2 s + ℓ2 k ℓ2 0

84θT,δ(Gk + ℓ2 k) log2 +

Gk + ℓ2 k ℓ2 0

holding since gk xk ℓk. Additionally, in (ii) we have used Lemma A.8 with ak = Gk + ℓ2 k and r Tout 1 3d0.

The final line holds from the previous noting that,

k=0 ( ηk gk, exp 1 xk (x ) xk)2

k=0 η2 kζκ(dk) gk 2 xkd2 k (4d0)2 t X

k=0 η2 kζκ(dk) gk 2 xk, (126)

where the first inequality follows from the Cauchy-Schwartz inequality and the second inequality holds from the fact that only terms with k < Tout contribute to the sum.

Using the above lemma, we can establish the following concentration bound.

Lemma D.9. If Assumption 3.1, 3.2, and 3.3 hold, under the truncated T-RDo G step size sequence { ηt}, the iterates satisfy,

k=0 ηk k, exp 1 xk (x ) xk > d2 0

Proof. Consider the filtration Ft = σ(G(x0) . . . , G(xt)) and define Xt = ηt t, exp 1 xt (x ) xt and ˆXt = ηt grad f(xt), exp 1 xt (x ) xt. Then we have that Xt is a martingale difference sequence adapted to Ft and ˆXt Ft 1.

Learning-Rate-Free Stochastic Optimization over Riemannian Manifolds

Moreover, we have max{|Xt|, | ˆXt|} c almost surely for c = 24d2 0 84θT,δ . Substituting into Lemma A.9, we have

v u u tθt,δ

k=0 (Xk ˆXk)2 + c2θ2 t,δ

Noting that Xt ˆXt = ηt gt, exp 1 xt (x ) xt and substituting the definition of c and the bound gives for every t < T,

v u u tθt,δ

k=0 (Xk ˆXk)2 + c2θ2 t,δ 4

v u u tθt,δ 3 43d4 0 84θT,δ +

6θt,δd2 0 82p

ζκ(3d0)θT,δ

d2 0. (129)

Finally, we show that the event defined in the previous lemma implies the desired distance bound.

Lemma D.10. Suppose Assumption 3.1, 3.2, and 3.3 hold. If Pt 1 k=0 ηk k, exp 1 xk (x ) xk d2 0 for all t T then Tout > T i.e., rt 3d0.

Proof. To condense notation, let Bt := maxt t Pt 1 k=0 ηk k, exp 1 xk (x ) xk, so the claim becomes Bt d2 0 implies Tout > t for all t T. We prove the claim by induction on t. The basis of the induction is that Tout > 0 always hold since r0 = ϵ 3d0 by assumption. For the induction step, we assume that Bt 1 implies Tout t and show that Bt d2 0 implies Tout > t. To that end, we use grad f(xt), exp 1 xt (x ) xt f(xt) f(x ) 0 to rearrange Lemma A.2 as

d2 k+1 d2 k η2 kζκ(dk) gk 2 xk + 2ηk k, exp 1 xk (x ) xk (130)

for all k. Summing from 0 k t 1, we have

k=0 η2 kζκ(dk) gk 2 xk + 2

k=0 ηk k, exp 1 xk (x ) xk (131)

k=0 η2 kζκ(dk) gk 2 xk + 2

k=0 ηk k, exp 1 xk (x ) xk. (132)

where the equality holds since Tout > t 1 and therefore ηk = ηk for all 0 k t 1. Now, by previous lemma we have Pt 1 k=0 η2 kζκ(dk) gk 2 xk 12d2 0 84θT,δ d2 0. Moreover, by assumption we have Pt 1 k=0 ηk k, exp 1 xk (x ) xk Bt d2 0, from which we conclude, d2 t 4d2 0 and hence rt d0 + dt 3d0. Finally, since rt = max{ rt 1, rt} and rt 1 3d0 by the induction assumption, we have that rt 3d0.

Theorem D.11. Suppose ϵ 3d0 and Assumption 3.1, 3.2, and 3.3 hold. Then for any δ (0, 1) and t N, under the T-RDo G step size sequence {ηt}, the iterates satisfy P( rt > 3d0) δ.

Proof. A consequence of combining the previous two lemmas.

Corollary D.12. Suppose that Assumption 3.1, 3.2, and 3.3 hold. For any δ (0, 1/2), t N, consider T iterations of

T-RDo G, with an initial step size of ϵ 3d0. Then for τ arg maxt T Pt 1 s=0 rs/

ζκ( rt) we have, with probability at least

f( xτ) f(x ) = O

cδ,ϵ,T d0 p

ζκ(d0)(Gτ 1 + L2 )

cδ,ϵ,T d0 p

where L := maxx M:d(x,x0) 3d(x ,x0) ℓ(x) and cδ,ϵ,T = log+(T d0L f(x0) f(x )) log+( d0

ϵ ) log( log+(T )

Learning-Rate-Free Stochastic Optimization over Riemannian Manifolds

Proof. Here we adapt Theorem D.3. Using that rt 3d0, we have ζκ(d0 + rt) ζκ(4d0) κ4d0 tanh( κ4d0) κ4d0 tanh( κd0) = O(ζκ(d0)) for κ > 0, otherwise ζκ(d0 + rt) = 1 = ζκ(d0) = O(ζκ(d0)) in the case κ = 0. Now, by Assumption 3.3 we have ℓ0 grad f(x0) x0 (f(x0) f(x ))/d0, while r T 3d0 gives ℓT L . Therefore, log+ 1 + T ℓ2 T ℓ2 0

O log+ T d0L f(x0) f(x ) .

D.5. Omitting Geometric Curvature Term Analysis

We analyze omitting the geometric curvature term from the denominator RDo G (Algorithm 1). Thus we consider step sizes of the form

ηt = rt q Pt s=0 gs 2xs

We term this algorithm Curvature Omitted Riemannian Distance over Gradients (CO-RDo G).

We consider bounding the error of the weighted average sequence,

xt+1 = exp xt

rt Pt s=0 rs exp 1 xt (xt)

For a geodesically convex function f : M R, we have by Jensens inequality (Lemma A.3) that xt satisfies,

f( xt) f(x ) 1 Pt 1 s=0 rs

s=0 rs grad f(xs), exp 1 xs (x ) xs. (135)

Recalling gs is the stochastic oracle evaluation, G(xs), the numerator decomposes into two components:

s=0 rs gs, exp 1 xs (x ) xs | {z } weighted regret

s=0 rs s, exp 1 xs (x ) xs | {z } noise

with s := gs grad f(xs).

We give deterministic bounds for the weighted regret (Lemma D.13) and high probability bounds for the noise term (Lemma D.14).

Lemma D.13. Under Assumption 3.1 and 3.2, the iterates of CO-RDo G, satisfy

s=0 rs gs, exp 1 xs (x ) xs rt 2 dt + rtζκ( dt) p

Gt 1. (137)

Proof. Applying Lemma A.2, we can bound the weighted average as

s=0 rs gs, exp 1 xs (x ) xs 1

d2 s d2 s+1

s=0 rsηsζκ(ds) gs 2 xs | {z } (B)

We bound the terms (A) and (B) in turn, beginning with the former:

Gs d2 s d2 s+1 = d2 0 p

Gt 1 + d2 t

Gt 1( d2 t d2 t) (ii) 4 rt dt p

Gt 1. (140)

Learning-Rate-Free Stochastic Optimization over Riemannian Manifolds

Inequality (i) uses ds dt and that Gt is nondecreasing. Inequality (ii) use that for k arg maxs t ds, we have d2 t d2 t = d2 k d2 t = (dk dt)(dk + dt) d(xk, xt)(dk + dt) ( rk + rt)(dk + dt) 4 rt dt.

Bounding the second term (B), using d 7 ζκ(d) is an increasing function, we have:

r2 sζκ(ds) gs 2 xs Gs

r2 sζκ( ds) gs 2 xs Gs r2 t ζκ( dt)

gs 2 xs Gs 2 r2 t ζκ( dt) p

Gt 1. (141)

Thus, combining (A) and (B) together, gives the result.

Lemma D.14. For all δ (0, 1), T N and L > 0, if Assumption 3.1, 3.2, and 3.3 hold, the iterates of CO-RDo G satisfy

s=0 rs s, exp 1 xs (x ) xs

δ + P( ℓT > L), (142)

where bt = 8 rt 1 dt 1 q

θt,δGt 1 + θ2 t,δL2 and ℓT := maxs T ℓ(xs).

Proof. For 1 s T define the random variables

Ys := rs 1 ds 1, Xs :=

s 1, exp 1 xs 1(x )

xs 1 , ˆXs :=

grad f(xs 1), exp 1 xs 1(x )

xs 1 . (143)

By the Cauchy-Schwartz inequality and Assumption 3.3 we have each |Xs| ℓ(x), and each | ˆXs| ℓ(x) with probability 1. Moreover, for any t T, we have

s=1 Ys Xs =

s=0 rs s, exp 1 xs (x ) xs. (144)

Therefore, applying Lemma A.9 yields the result as required.

Combining the above results, we obtain the following.

Theorem D.15. For all δ (0, 1) and L > 0, if Assumption 3.1, 3.2, and 3.3 hold, then with probability at least 1 δ P( ℓT > L), for all t T, the optimality gap on the weighted iterates f( xt) f(x ) of CO-RDo G satisfy

(d0 + rt)ζκ(d0 + rt) q

Gt 1 + θt,δGt 1 + θ2 t,δL2 Pt 1 s=0 rs/ rt

with probability at least 1 δ P( ℓT > L).

Proof. Combining Lemma D.13 and Lemma D.14 and utilizing dt d0 + rt and that d 7 ζκ(d) is an increasing function yields the result as required.

Thus in comparison to standard RDo G, we pay an additional cost of O p

ζκ(d0 + rt) for omitting the geometric curvature term with CO-RDo G.

We then have a useful result when the manifold is bounded but its exact diameter is unknown.

Corollary D.16. Under Assumption 3.1, 3.2, and 3.3, for any D d0 let LD := maxx M:d(x,x0) D ℓ(x). Then, for all

δ (0, 1) and for τ arg maxt T Pt 1 s=0 rs/

ζκ( rt) , with probability at least 1 δ P( ℓT > L), iterates of CO-RDo G

satisfy the optimality gap bound

f( xτ) f(x ) = O

Gτ 1θτ,δ + L2 Dθ2 τ,δ T log+(D/ϵ)

Learning-Rate-Free Stochastic Optimization over Riemannian Manifolds

Proof. Apply Lemma A.6 to the denominator term of Theorem D.15.

Thus in comparison to standard RDo G, we pay an additional cost of O p

ζκ(D) for omitting the curvature term with CO-RDo G.

E. RDo WG Theoretical Analysis

E.1. Overview

In this section, we analyze RDo WG (Algorithm 2). Thus we consider RSGD with step sizes given by,

ηt = r2 t ζκ( rt) vt , vt = vt 1 + r2 t ζκ( rt) gt 2 xt, v 1 = 0. (147)

We consider the bounding the error of the weighted average sequence,

xt+1 = exp xt

r2 t /ζκ( rt) Pt s=0 r2s/ζκ( rs) exp 1 xt (xt)

For a geodesically convex function f : M R, we have by Lemma A.3 that xt satisfies,

f( xt) f(x ) 1 Pt 1 s=0( r2s/ζκ( rs))

s=0 ( r2 s/ζκ( rs)) grad f(xs), exp 1 xs (x ) xs. (148)

Recalling that gs represents the stochastic oracle evaluation at xs, denoted as G(xs), we can decompose the numerator into two components:

s=0 ( r2 s/ζκ( rs)) gs, exp 1 xs (x ) xs | {z } weighted regret

s=0 ( r2 s/ζκ( rs)) s, exp 1 xs (x ) xs | {z } noise

with s := gs grad f(xs).

E.2. Supporting Analysis

Our first result gives deterministic bounds for the weighted regret (Lemma E.2).

Lemma E.1. Under Assumption 3.1 and 3.2, we have that the iterates of RDo WG (Algorithm 2) satisfy

s=0 ( r2 s/ζκ( rs)) gs, exp 1 xs (x ) xs rt

2 dt + rt ζκ( rt)ζκ( dt) vt 1. (150)

Proof. Follow same argument as Lemma D.1 but with weights r2 s ζκ( rs) replacing rs

ζκ( rs) and weighted gradient sum vs replacing the standard gradient sum Gs.

E.3. Non-Smooth Analysis

We give deterministic bounds for the weighted regret (Lemma E.2) and high probability bounds for the noise term (Lemma E.3) for the non-smooth setting.

Lemma E.2. Under Assumption 3.1 and 3.2, we have that the iterates of RDo WG (Algorithm 2) satisfy

s=0 ( r2 s/ζκ( rs)) gs, exp 1 xs (x ) xs r2 t p

2 dt + rt ζκ( rt)ζκ( dt) p

Gt 1. (151)

Learning-Rate-Free Stochastic Optimization over Riemannian Manifolds

Proof. Using the bound of E.2 and that vt 1 rt 1

ζκ( rt 1) p

Gt 1 gives the result.

Lemma E.3. For all δ (0, 1), T N and L > 0, if Assumption 3.1, 3.2, and 3.3 hold, the iterates of RDo WG (Algorithm 2) satisfy

r2 s ζκ( rs) s, exp 1 xs (x ) xs

δ + P( ℓT > L), (152)

where bt = 8 r2 t 1 ζκ( rt 1) dt 1 q

θt,δGt 1 + θ2 t,δL2 and ℓT := maxs T ℓ(xs).

Proof. Following the argument of Lemma D.2.

Combining the above results, we obtain the following. Theorem E.4. For all δ (0, 1) and L > 0, if Assumption 3.1, 3.2, and 3.3 hold, then with probability at least 1 δ P( ℓT > L), for all t T, the optimality gap on the weighted iterates f( xt) f(x ) of RDo WG (Algorithm 2) satisfy

(d0 + rt) p

ζκ(d0 + rt) q

Gt 1 + θt,δGt 1 + θ2 t,δL2

Pt 1 s=0 r2s/ζκ( rs) r2 t /ζκ( rt)

Proof. Using Lemma E.2 and Lemma E.3 we have

f( xt) f(x )

ζκ( rt) + rt

ζκ( rt)ζκ( dt) p

Gt 1 + 8 dt q

θt,δGt 1 + θ2 t,δL2

Pt 1 s=0 r2s/ζκ( rs) r2 t /ζκ( rt) . (154)

Now using the fact dt d0 + rt and that d 7 ζκ(d) and d 7 d

ζκ(d) are increasing functions gives the result.

Corollary E.5. Suppose Assumption 3.1, 3.2, and 3.3 hold, and for any D d0 let LD := maxx M:d(x,x0) D ℓ(x). Then,

for all δ (0, 1) and for τ arg maxt T Pt 1 s=0 rs/

ζκ( rt) , with probability at least 1 δ P( ℓT > L), iterates of

RDo WG (Algorithm 2) satisfy the optimality gap bound

f( xτ) f(x ) = O

Gτ 1θτ,δ + L2 Dθ2 τ,δ T log+

Proof. Apply Lemma A.6 to the denominator term of Theorem E.4.

E.4. Smooth Analysis

Lemma E.6. Suppose f is S-smooth and assume Assumption 3.1 and 3.2 hold. Then we have that the iterates of RDo WG (Algorithm 2) satisfy

s=0 ( r2 s/ζκ( rs)) gs, exp 1 xs (x ) xs rt

2 dt + rt ζκ( rs)ζκ( dt) v u u t2S

r2s ζκ( rs)(f(xs) f(x )). (156)

Proof. By smoothness we can use Lemma A.4 to deduce grad f(x) 2 x 2S(f(x) f(x )) for all x M. Therefore

r2 s ζκ( rs) grad f(xs) 2 xs 2S

r2 s ζκ( rs)(f(xs) f(x )). (157)

Taking square roots and substituting this into Lemma E.2 gives the result.

Learning-Rate-Free Stochastic Optimization over Riemannian Manifolds

Lemma E.7. Suppose Assumption 3.1, 3.2, and 3.4 hold. Then for all δ (0, 1), T N and S > 0, Then we have that the iterates of RDo WG (Algorithm 2) satisfy

r2 s ζκ( rs) s, exp 1 xs (x ) xs

δ + P( s T > S), (158)

where bt = 8 rt 1

ζκ( rt 1) dt 1 q

2Sθt,δ + 8Sθ2 t,δ q Pt 1 s=0 r2s ζκ( rs) [f(xs) f(x )] and s T := maxt T s(xt).

Proof. Define for 1 s T the following random variables as

Ys := rs 1 p

ζκ( rs 1) ds 1, Xs := rs 1 p

s 1, exp 1 xs 1(x )

xs 1 , (159)

ˆXs := rs 1 p

grad f(xs 1), exp 1 xs 1(x )

xs 1 , (160)

and follow similar argument to Lemma D.6.

Combining the above results, we obtain the following. Theorem E.8. Suppose Assumption 3.1, 3.2, and 3.4 hold. Then for all δ (0, 1) and S > 0, with probability at least 1 δ P( s T > S), for all t T, the optimality gap on the weighted iterates f( xt) f(x ) of RDo WG (Algorithm 2) satisfy

(d0 + rt)2ζκ(d0 + rt)(Sθ2 t,δ) Pt 1 s=0 r2s/ζκ( rs) r2 t /ζκ( rt)

Proof. Using Lemma E.20 and Lemma E.21 above, we have with the relevant probabilistic conditions,

r2 s ζκ( rs)[f(xs) f(x )]

2 dt + rt ζκ( rt)ζκ( dt) + 8 rt p

ζκ( rt) dt q

2Sθt,δ + 8Sθ2 t,δ

r2s ζκ( rs) [f(xs) f(x )].

Now if f(xs) f(x ) = 0 for some iterate, then the statement is trivial. Otherwise diving by sides by the square root term, we have v u u t

r2s ζκ( rs) [f(xs) f(x )]

2 dt + rt ζκ( rt)ζκ( dt) + 8 rt p

ζκ( rt) dt q

2Sθt,δ + 8Sθ2 t,δ

We square both sides and divide through by r2 s ζκ( rs). Finally using the fact, dt d0 + rt, in the above bound gives the result since d 7 ζκ(d) and d 7 d

ζκ(d) are increasing functions.

We then have a useful result when the manifold is bounded but its exact diameter is unknown. Corollary E.9. Under Assumption 3.1, 3.2, and 3.4 hold, for any D d0 let SD := maxx M:d(x,x0) D s(x). Then, for all

δ (0, 1) and for τ arg maxt T Pt 1 s=0 r2 s/ζκ( rs) r2 t /ζκ( rt), with probability at least 1 δ P( s T > S), iterates of Algorithm 1 satisfy the optimally gap on the weighted iterates f( xτ) f(x ) of RDo WG (Algorithm 2) satisfy

D2ζκ(D)SDθ2 τ,δ T log+

Proof. Apply Lemma A.6 to the denominator term of Corollary E.23.

Learning-Rate-Free Stochastic Optimization over Riemannian Manifolds

E.5. Iterate Stability Bound

We introduce Tamed Riemannian Distance over Weighted Gradients (T-RDo WG), a dampened version of RDo WG (Algorithm 2) whose iterates are guaranteed to remain bounded with high probability. T-RDo WG has the following step size scheme

vt = vt 1 + r2 t ζκ( rt) gt 2 xt, v 1 = 0, (164)

ηt = r2 t ζκ( rt) p

v t , v t = 84θ2 T,δ log2 +

(1 + t) r2 t ℓ2 t/ζκ( rt) r2 0 ℓ2 0/ζκ( r0)

(vt 1 + 16 r2 t ζκ( rt) ℓ2 t). (165)

To show iterate boundedness in the stochastic setting, we consider the stopping time

Tout = min{t 0 : rt > 3d0}, (166)

so that the event { r T 3d0} is the same as {Tout > T}. We also define the following truncated step size sequence,

ηk := ηk I{k<Tout}. (167)

Truncating as such allows us to handle the possibility that r T exceeds 3d0. In particular, the following holds for { ηk} but not for {ηk}.

Lemma E.10. For all t T, if Assumption 3.1, 3.2, and 3.3 hold, under the truncated T-RDo WG step size sequence { ηt}, the iterates satisfy

ηt σ(G(x0), . . . , G(xt 1)), (168)

| ηt γ, exp 1 xt (x ) xt| 6d2 0 82p

ζκ(3d0)θT,δ for γ {gt, grad f(xt), t}, (169)

k=0 η2 kζκ(dk) gk 2 xk 12d2 0 84θT,δ , and (170)

k=0 ( ηk gk, exp 1 xk (x ) xk)2 3 43d4 0 84θT,δ . (171)

Proof. The first line holds directly by definition of the truncated T-RDo WG iterates. For bound in the second line, note (recalling t = gt grad f(xt) we have t xt gt xt + grad f(xt) xt 2ℓ(xt). Since v t 4284 r2 t ζκ( rt)ℓ2(xt)θ2 T,δ for all t, the Cauchy-Schwartz inequality gives,

| ηt t, exp 1 xt (x ) xt| r2 t ζκ( rt) p

v t t xtdt 1 2 82θT,δ

ζκ( r T ) dt 6d2 0 82p

ζκ(3d0)θT,δ , (172)

where we have used the fact that d 7 d

ζκ(d) is an increasing function, and that dt d0 + rt and rt 3d0 (or else ηt = 0).

Bounds for γ {gt, grad f(xt)} hold in a similar fashion.

Now for the third line, we have

k=0 η2 kζκ(dk) gk 2 xk

k=0 η2 kζκ(dk) gk 2 xk (173)

r4 kζκ(dk) gk 2 xk ζκ( rk)2v k (174)

r2 kζκ(dk)(vk vk 1)

ζκ( rk)v k . (175)

Learning-Rate-Free Stochastic Optimization over Riemannian Manifolds

Now d 7 ζκ(d) is increasing, thus ζκ(dk) ζκ(d0 + r Tout 1) as dk dk d Tout 1 d0 + r Tout 1. Additionally, since d 7 d ζκ(d) is increasing, we have r2 k ζκ( rk) = rk rk ζκ( rk) r Tout 1 r Tout 1 ζκ( r Tout 1) r Tout 1 ( r Tout 1+d0) ζκ( r Tout 1+d0). Thus, we have

r2 kζκ(dk)(Gk Gk 1)

ζκ( rk)G k r Tout 1( r Tout 1 + d0)

(i) r Tout 1( r Tout 1 + d0)

(vk + r2 k ζκ( rk) ℓ2 k) log2 +

vk+ r2 k ζκ( rk) ℓ2 k

r2 0 ζκ( r0) ℓ2 0

(ii) 12d2 0 84θT,δ . (178)

Where we have used in (i) that

vk 1 + r2 k ζκ( rk) gk 2 xk + r2 k ζκ( rk)ℓ2 k

Pk s=0 r2 s ζκ( rs) ℓ2 s + r2 k ζκ( rk) ℓ2 k

r2 0 ζκ( r0) ℓ2 0

vk + r2 k ζκ( rk) ℓ2 k

vk + r2 k ζκ( rk) ℓ2 k

r2 0 ζκ( r0) ℓ2 0

holding since gk xk ℓk. Additionally, in (ii) we have used Lemma A.8 with ak = vk + r2 k ζκ( rk) ℓ2 k and r Tout 1 3d0.

The final line holds from the previous noting that,

k=0 ( ηk gk, exp 1 xk (x ) xk)2

k=0 η2 kζκ(dk) gk 2 xkd2 k (4d0)2 t X

k=0 η2 kζκ(dk) gk 2 xk, (181)

where the first inequality follows from the Cauchy-Schwartz inequality and the second inequality holds from the fact that only terms with k < Tout contribute to the sum.

Using the above lemma, we can establish the following concentration bound.

Lemma E.11. If Assumption 3.1, 3.2, and 3.3 hold, under the truncated T-RDo WG step size sequence { ηt}, the iterates satisfy,

k=0 ηk k, exp 1 xk (x ) xk > d2 0

Proof. Consider the filtration Ft = σ(G(x0) . . . , G(xt)) and define Xt = ηt t, exp 1 xt (x ) xt and ˆXt = ηt grad f(xt), exp 1 xt (x ) xt. Then we have that Xt is a martingale difference sequence adapted to Ft and ˆXt Ft 1.

Moreover, we have max{|Xt|, | ˆXt|} c almost surely for c = 24d2 0 84θT,δ . Substituting into Lemma A.9, we have

v u u tθt,δ

k=0 (Xk ˆXk)2 + c2θ2 t,δ

Noting that Xt ˆXt = ηt gt, exp 1 xt (x ) xt and substituting the definition of c and the bound gives for every t < T,

v u u tθt,δ

k=0 (Xk ˆXk)2 + c2θ2 t,δ 4

v u u tθt,δ 3 43d4 0 84θT,δ +

6θt,δd2 0 82p

ζκ(3d0)θT,δ

d2 0. (184)

Learning-Rate-Free Stochastic Optimization over Riemannian Manifolds

Lemma E.12. Suppose Assumption 3.1, 3.2, and 3.3 hold. If Pt 1 k=0 ηk k, exp 1 xk (x ) xk d2 0 for all t T then Tout > T i.e., rt 3d0.

Proof. To condense notation, let Bt := maxt t Pt 1 k=0 ηk k, exp 1 xk (x ) xk, so the claim becomes Bt d2 0 implies Tout > t for all t T. We prove the claim by induction on t. The basis of the induction is that Tout > 0 always hold since r0 = ϵ 3d0 by assumption. For the induction step, we assume that Bt implies Tout t and show that Bt d2 0 implies Tout > t. To that end, we use grad f(xt), exp 1 xt (x ) xt f(xt) f(x ) 0 to rearrange Lemma A.2 as

d2 k+1 d2 k η2 kζκ(dk) gk 2 xk + 2ηk k, exp 1 xk (x ) xk (185)

for all k. Summing from 0 k t 1, we have

k=0 η2 kζκ(dk) gk 2 xk + 2

k=0 ηk k, exp 1 xk (x ) xk (186)

k=0 η2 kζκ(dk) gk 2 xk + 2

k=0 ηk k, exp 1 xk (x ) xk. (187)

where the equality holds since Tout t and therefore ηk = ηk for all 0 k t 1. Now, by previous lemma we have Pt 1 k=0 η2 kζκ(dk) gk 2 xk 12d2 0 84θT,δ d2 0. Moreover, by assumption we have Pt 1 k=0 ηk k, exp 1 xk (x ) xk Bt d2 0, from which we conclude, d2 t 4d2 0 and hence rt d0 + dt 3d0. Finally, since rt = max{ rt 1, rt} and rt 1 3d0 by the induction assumption, we have that rt 3d0.

Theorem E.13. Suppose Assumption 3.1, 3.2, and 3.3 hold, and ϵ 3d0. Then for any δ (0, 1) and t N, under the T-RDo WG step size sequence {ηk}, the iterates satisfy P( rt > 3d0) δ.

Proof. A consequence of combining the previous two lemmas.

Corollary E.14. Suppose that Assumption 3.1, 3.2, and 3.3 hold. For any δ (0, 1/2), t N, consider T iterations of {ηk}, with initial step size of ϵ 3d0. Then for τ arg maxt T Pt 1 s=0 r2 s/ζκ( rs) r2 t /ζκ( rt) we have, with probability at least 1 2δ, that

f( xτ) f(x ) = O

cδ,ϵ,T d0 p

ζκ(d0)(Gτ + L2 )

cδ,ϵ,T d0 p

where L := maxx M:d(x,x0) 3d(x ,x0) ℓ(x) and cδ,ϵ,T = log+(T d0L f(x0) f(x )) log+( d0

ϵ ) log( log+(T )

Proof. Here we adapt theorem Theorem E.4. Using that rt 3d0, we have ζκ(d0 + rt) ζκ(4d0) κ4d0 tanh( κ4d0) κ4d0 tanh( κd0) = O(ζκ(d0)) for κ > 0, otherwise ζκ(d0 + rt) = 1 = ζκ(d0) = O(ζκ(d0)) in the case κ = 0. Now, by Assumption 3.3 we have ℓ0 grad f(x0) x0 (f(x0) f(x ))/d0, while r T 3d0 gives ℓT L . Therefore,

log+ 1 + T ℓ2 T ℓ2 0

= O log+ T d0L f(x0) f(x ) .

E.6. Omitting Geometric Curvature Term Analysis

We analyze omitting the geometric curvature term from the denominator RDo WG (Algorithm 2). Thus we consider step sizes of the form

ηt = r2 t vt , vt = vt 1 + r2 t gs 2 xs, v 1 = 0. (189)

We term this algorithm Curvature Omitted Riemannian Distance over Weighted Gradients (CO-RDo WG).

We consider the bound the error of the weighted average sequence,

xt+1 = exp xt

r2 t Pt s=0 r2s exp 1 xt (xt)

Learning-Rate-Free Stochastic Optimization over Riemannian Manifolds

For a geodesically convex function f : M R, we have by Lemma A.3 that xt satisfies,

f( xt) f(x ) 1 Pt 1 s=0 r2s

s=0 r2 s grad f(xs), exp 1 xs (x ) xs. (190)

Recalling gs is the stochastic oracle evaluation, G(xs), the numerator decomposes into two components:

s=0 r2 s gs, exp 1 xs (x ) xs | {z } weighted regret

s=0 r2 s s, exp 1 xs (x ) xs | {z } noise

with s := gs grad f(xs).

SUPPORTING ANALYSIS

We give a deterministic bound for the weighted regret (Lemma E.15).

Lemma E.15. Under Assumption 3.1 and 3.2, the iterates of CO-RDo WG, satisfy

s=0 r2 s gs, exp 1 xs (x ) xs rt 2 dt + rtζκ( dt) vt 1. (192)

Proof. Follow the same argument as Lemma D.13 but with weights r2 s replacing rs and weighted gradient sum vs replacing the standard gradient sum Gs.

NON-SMOOTH ANALYSIS

We give deterministic bounds for the weighted regret (Lemma E.16) and high probability bounds for the noise term (Lemma E.17) in the non-smooth setting.

Lemma E.16. Under Assumption 3.1 and 3.2, the iterates of CO-RDo WG, satisfy

s=0 r2 s gs, exp 1 xs (x ) xs r2 t 2 dt + rtζκ( dt) p

Gt 1. (193)

Proof. Using the bound of E.15 and that vt 1 rt 1 p

Gt 1 gives the result.

Lemma E.17. For all δ (0, 1), T N and L > 0, if Assumption 3.1, 3.2, and 3.3 hold, the iterates of CO-RDo WG satisfy

s=0 r2 s s, exp 1 xs (x ) xs

δ + P( ℓT > L), (194)

where bt = 8 r2 t 1 dt 1 q

θt,δGt 1 + θ2 t,δL2 and ℓT := maxs T ℓ(xs).

Proof. Follow argument of Lemma D.14.

Combining the above results, we obtain the following.

Theorem E.18. For all δ (0, 1) and L > 0, if Assumption 3.1, 3.2, and 3.3 hold, then with probability at least 1 δ P( ℓT > L), for all t T, the optimality gap on the weighted iterates f( xt) f(x ) of CO-RDo WG satisfy

(d0 + rt)ζκ(d0 + rt) q

Gt 1 + θt,δGt 1 + θ2 t,δL2 Pt 1 s=0 r2s/ r2 t

Learning-Rate-Free Stochastic Optimization over Riemannian Manifolds

Proof. Combine Lemma E.16 and Lemma E.17 and use the fact dt d0 + rt.

Thus in comparison to standard RDo WG, we pay an additional cost of O p

ζκ(d0 + rt) for omitting the geometric curvature term with CO-RDo WG.

We then have a useful result when the manifold is bounded but its exact diameter is unknown.

Corollary E.19. Under Assumption 3.1, 3.2, and 3.3, for any D d0 let LD := maxx M:d(x,x0) D ℓ(x). Then, for all

δ (0, 1) and for τ arg maxt T Pt 1 s=0 r2 s r2 t , with probability at least 1 δ P( ℓT > L), iterates of CO-Do WG satisfy the optimality gap bound

f( xτ) f(x ) = O

Gτ 1θτ,δ + L2 Dθ2 τ,δ T log+(D/ϵ)

Proof. Apply Lemma A.6 to the denominator term of Theorem E.18.

Thus in comparison to standard RDo WG, we pay an additional cost of O p

ζκ(D) for omitting the curvature term with CO-RDo WG.

SMOOTH ANALYSIS

Lemma E.20. Suppose f is S-smooth and assume Assumption 3.1 and 3.2 hold. Then we have that the iterates of CO-RDo WG satisfy

s=0 r2 s gs, exp 1 xs (x ) xs rt 2 dt + rtζκ( dt) v u u t2S

s=0 r2s(f(xs) f(x )). (197)

Proof. By smoothness we can use Lemma A.4 to deduce grad f(x) 2 x 2S(f(x) f(x )) for all x M. Therefore

s=0 r2 s grad f(xs) 2 xs 2S

s=0 r2 s(f(xs) f(x )). (198)

Taking square roots and substituting this into Lemma A.4 gives the result.

Lemma E.21. Suppose Assumption 3.1, 3.2, and 3.4 hold. Then for all δ (0, 1), T N and S > 0, Then we have that the iterates of CO-RDo WG satisfy

s=0 r2 s s, exp 1 xs (x ) xs

δ + P( s T > S), (199)

where bt = 8 rt 1 dt 1 q

2Sθt,δ + 8Sθ2 t,δ

q Pt 1 s=0 r2s [f(xs) f(x )] and s T := maxt T s(xt).

Proof. Follow argument of Lemma E.7.

Combining the above results, we obtain the following.

Theorem E.22. For all δ (0, 1) and S > 0, if Assumption 3.1, 3.2, and 3.4 hold, then with probability at least 1 δ P( s T > S), for all t T, CO-RDo WG satisfies the optimality gap f( xt) f(x ) of

((d0 + rt)ζκ(d0 + rt))2 (Sθ2 t,δ) Pt 1 s=0 r2s r2 t

Learning-Rate-Free Stochastic Optimization over Riemannian Manifolds

Proof. Using Lemma E.20 and Lemma E.21 above, we have with the relevant probabilistic conditions,

s=0 r2 s[f(xs) f(x )]

2S rt 2 dt + rtζκ( dt) + 8 rt dt q

2Sθt,δ + 8Sθ2 t,δ v u u t

s=0 r2s [f(xs) f(x )]. (201)

Now if f(xs) f(x ) = 0 for some iterate, then the statement is trivial. Otherwise diving by sides by the square root term, we have v u u t

s=0 r2s [f(xs) f(x )]

2S rt 2 dt + rtζκ( dt) + 8 rt dt q

2Sθt,δ + 8Sθ2 t,δ . (202)

We square both sides and divide through by Pt 1 s=0 r2 s,

1 Pt 1 s=0 r2s

s=0 r2 s [f(xs) f(x )] O

2 dt + rtζκ( dt) 2 (Sθ2 t,δ) Pt 1 s=0 r2s r2 t

Now using the fact, dt d0 + rt, in the above bound gives the result.

Thus in comparison to standard RDo WG, we pay an additional cost of O p

ζκ(d0 + rt) for omitting the geometric curvature term with CO-RDo WG.

We then have a useful result when the manifold is bounded but its exact diameter is unknown.

Corollary E.23. Under Assumption 3.1, 3.2, and 3.4, for any D d0 let SD := maxx M:d(x,x0) D s(x). Then, for all

δ (0, 1) and for τ arg maxt T Pt 1 s=0 r2 s r2 t , with probability at least 1 δ P( s T > S), the iterates of CO-RDo WG satisfies the optimally gap f( xτ) f(x ) of

D2ζκ(D)2SDθ2 τ,δ T log+(D/ϵ)

Proof. Apply Lemma A.6 to the denominator term of Theorem E.22.

Thus in comparison to standard RDo WG, we pay an additional cost of O p

ζκ(D) for omitting the curvature term with CO-RDo WG.

F. NRDo G Overview

Here we present a learning-rate-free schedule for NRSGD: Normalized Riemannian Distance over Gradients (NRDo G).

Algorithm 4 NRDo G

Input: initial point x0, initial estimate ϵ > 0, G 1 = 0. for t = 0 to T 1 do

gt = G(xt) rt = max (ϵ, maxs t d(xs, x0)) ηt = rt

(t+1)ζκ( rt)

xt+1 = expxt ηt gt gt xt

Learning-Rate-Free Stochastic Optimization over Riemannian Manifolds

G. Geometry of Specific Riemannian Manifolds

G.1. Sphere Manifold

The sphere manifold Sd 1 := {x Rd : x = 1} is an embedded submanifold of Rd with tangent space Tx Sd 1 = {v Rd : x T v = 0}. The Riemannian metric is given by the Euclidean inner product , x = , . The exponential map is given by expx(v) = cos( v )x + sin( v ) v

v with inverse exponential map as exp 1 x (y) = arccos(x T y) Projx(y x) Projx(y x) where Projx(v) = v (x T v)x is the orthogonal projection of any v Rd to the tangent space Tx Sd 1. Following the Pymanopt implementation (Townsend et al., 2016), parallel transport is approximated with the projection operation, i.e., Γy xv Projx(v).

G.2. Grassmann Manifold

The Grassmann manifold of dimension d r, denoted as G(d, r) is the set of all r dimensional subspaces in Rd (d r). Each point on the Grassmann manifold can be identified as a column of orthonormal matrices x Rd r, x T x = I and two points x, y are equivalent if x = yo for some r r orthogonal matrix o. For our implementation of the exponential map, inverse exponential map, and parallel transport, we directly translate the Pymanopt code (Townsend et al., 2016) from Num Py to JAX.

G.3. Poincar e Manifold

The Poincar e manifold of dimension d is given by the open d-dimensional unit ball Bd := {x Rd : x < 1} equipped with Riemannian metric , x = 4/(1 x 2)2 , . The M obius addition of x and y in Bd is defined as (Ungar, 2008)

x y := (1 + 2 x, y + y 2)x + (1 x 2)y

1 + 2 x, y + x 2 y 2 .

Defining the conformal factor as λx := 2/(1 x 2), the exponential map is given by expx(v) = x tanh λx x

2 v v and the inverse exponential map is given by exp 1 x (y) = 2 λx tanh 1( x y ) x y

x y . Parallel transport can also be given in closed form (see Ungar, 2008, for further details).

H. Additional Numerical Results

H.1. Rayleigh Quotient Maximization on the Sphere

In this section, we provide additional results for the Rayleigh quotient maximization discussed in Section 5.1 with a consistent setup across d = 1000 dimensions. The initial figures in Figure 5 emphasize the learning-rate-free adaptability and insensitivity to the choice of the initial distance estimate, ϵ [10 8, 100], for RDo G, RDo WG, and NRDo G, particularly after a few hundred iterations. In contrast, we observe a notable impact on the performance of RSGD due to the choice of the learning rate, η [10 8, 100]. This sensitivity in regret also influences solution quality, as illustrated in Figure 6.

We proceed to evaluate the algorithms for various numbers of iterations T {100, 500, 1000, 2000}, showcasing regret in Figure 7 and geodesic distance to a numerically computed optimum in Figure 8 for different learning rates, η [10 8, 106], for RSGD and RADAM. Additionally, we explore different initial distance estimates, ϵ [10 8, 10 1], for RDo G, RDo WG, and NRDo G. Notably, we observe that for T = 100 iterations, the initial distance estimate does impact the algorithms, but after T = 500 iterations, the effect becomes insensitive over several orders of magnitude, mirroring Figure 7 and Figure 8. Conversely, RADAM and RSGD exhibit a requirement for careful tuning.

H.2. PCA on the Grassmann Manifold

In this section, we present additional results for the PCA on the Grassmann manifold discussed in Section 5.2, maintaining a consistent experimental setup. In Table 1, we observe that while RSGD exhibits sensitivity to the learning rate η [10 8, 100], RDo G, RDo WG, and NRDo G quickly adapt and achieve performance comparable to the best learning rate for RSGD within 500 iterations, irrespective of the chosen initial distance ϵ [10 8, 100]. This adaptability is further evident in Table 2, where we consider halting the algorithms for T {100, 500, 1000, 2000} iterations and comparing the geodesic distance of the output of the optimizer with the numerical solution. Discrepancies are noticeable for T = 100, but these discrepancies diminish for T = 500 iterations and beyond.

Learning-Rate-Free Stochastic Optimization over Riemannian Manifolds

0 100 200 300 400 500 Iteration

Best RSGD RSGD

Learning rate η

0 100 200 300 400 500 Iteration

Best RSGD RDo G (Ours)

Initial distance ϵ

0 100 200 300 400 500 Iteration

Best RSGD RDo WG (Ours)

Initial distance ϵ

(c) RDo WG.

0 100 200 300 400 500 Iteration

Best RSGD NRDo G (Ours)

Initial distance ϵ

(d) NRDo G.

Figure 5. Supplementary results for Rayleigh quotient maximization on the sphere (Section 5.1). The plots depict regret as a function of the iteration, considering various learning rates. Results are averaged over ten random replications. The optimal RSGD is chosen based on minimizing the regret after 5000 iterations. Note that (a) and (b) are equivalent to Figure 2 (b) and (c) respectively.

0 100 200 300 400 500 Iteration

Geodesic distance to optima

Best RSGD RSGD

Learning rate η

0 100 200 300 400 500 Iteration

Geodesic distance to optima

Best RSGD RDo G (Ours)

Initial distance ϵ

0 100 200 300 400 500 Iteration

Geodesic distance to optima

Best RSGD RDo WG (Ours)

Initial distance ϵ

(c) RDo WG.

0 100 200 300 400 500 Iteration

Geodesic distance to optima

Best RSGD NRDo G (Ours)

Initial distance ϵ

(d) NRDo G.

Figure 6. Supplementary results for Rayleigh quotient maximization on the sphere (Section 5.1). The plots display the geodesic distance from an optimum as a function of the iteration, considering various learning rates. Results are averaged over ten replicates with different initial points. The optimal RSGD is selected based on minimizing the geodesic distance from the optimum after 5000 iterations.

H.3. Embedding Graphs in the Poincar e Embeddings

In this section, we provide supplementary results concerning Poincar e embeddings, as detailed in Section 5.3.

Table 3 maintains a consistent experimental framework with the main paper, focusing on five-dimensional embeddings. The top-left section of the table corresponds to Figure 4(a) presented in the main paper, serving as a reference for comparison. In the bottom-left segment, we explore the algorithmic performance of RADAM and RSGD without implementing the burn-in heuristic, which results in inferior performance. Notably, our optimizers demonstrate robustness, eliminating the need for such heuristics. On the left-hand column, we investigate the impact of omitting the curvature term from the learning rates. For RDo G and RDo WG, the curvature omission corresponds to CO-RDo G (Appendix D.5) and CO-RDo WG (Appendix E.6). This omission leads to a performance decrease for NRDo G and RDo WG, while RDo G remains unaffected in performance.

Table 4 adheres to a consistent experimental framework for two-dimensional embeddings outlined in the main paper. In the right-hand column, we discern meaningful groupings across various categories without resorting to burn-in heuristics for RDo G, RDo WG, and NRDo G. Conversely, in the left-hand column, we emphasize the pivotal role of geometric curvature in governing step sizes; its absence results in inferior groupings. This discrepancy is reflected in the mean average precision

10 6 10 3 100 103 106

Learning rate η

RADAM RSGD NRDo G (Ours)

RDo G (Ours)

RDo WG (Ours)

10 8 10 6 10 4 10 2 Initial distance ϵ

(a) T = 100.

10 6 10 3 100 103 106

Learning rate η

RADAM RSGD NRDo G (Ours)

RDo G (Ours)

RDo WG (Ours)

10 8 10 6 10 4 10 2 Initial distance ϵ

(b) T = 500.

10 6 10 3 100 103 106

Learning rate η

RADAM RSGD NRDo G (Ours)

RDo G (Ours)

RDo WG (Ours)

10 8 10 6 10 4 10 2 Initial distance ϵ

(c) T = 1000.

10 6 10 3 100 103 106

Learning rate η

RADAM RSGD NRDo G (Ours)

RDo G (Ours)

RDo WG (Ours)

10 8 10 6 10 4 10 2 Initial distance ϵ

(d) T = 2000.

Figure 7. Supplementary Results for Rayleigh Quotient Maximization (Section 5.1). Each plot illustrates the regret after the algorithm is halted for the specified number of iterations. Results are averaged over ten replicates with different initial points.

Learning-Rate-Free Stochastic Optimization over Riemannian Manifolds

10 6 10 3 100 103 106

Learning rate η

Geodesic distance to optima

RADAM RSGD NRDo G (Ours)

RDo G (Ours)

RDo WG (Ours)

10 8 10 6 10 4 10 2 Initial distance ϵ

(a) T = 100.

10 6 10 3 100 103 106

Learning rate η

Geodesic distance to optima

RADAM RSGD NRDo G (Ours)

RDo G (Ours)

RDo WG (Ours)

10 8 10 6 10 4 10 2 Initial distance ϵ

(b) T = 500.

10 6 10 3 100 103 106

Learning rate η

Geodesic distance to optima

RADAM RSGD NRDo G (Ours)

RDo G (Ours)

RDo WG (Ours)

10 8 10 6 10 4 10 2 Initial distance ϵ

(c) T = 1000.

10 6 10 3 100 103 106

Learning rate η

Geodesic distance to optima

RADAM RSGD NRDo G (Ours)

RDo G (Ours)

RDo WG (Ours)

10 8 10 6 10 4 10 2 Initial distance ϵ

(d) T = 2000.

Figure 8. Supplementary results for Rayleigh quotient maximization on the sphere (Section 5.1). Each plot illustrates the geodesic distance to a numerically computed optimum after the algorithm is halted for the specified number of iterations. Results are averaged over ten replicates with different initial points.

Learning-Rate-Free Stochastic Optimization over Riemannian Manifolds

RSGD RDo G RDo WG NRDo G

0 100 200 300 400 500 Iteration

Geodesic distance to optima

Best RSGD RSGD

Learning rate η

0 100 200 300 400 500 Iteration

Geodesic distance to optima

Best RSGD RDo G (Ours)

Initial distance ϵ

0 100 200 300 400 500 Iteration

Geodesic distance to optima

Best RSGD RDo WG (Ours)

Initial distance ϵ

0 100 200 300 400 500 Iteration

Geodesic distance to optima

Best RSGD NRDo G (Ours)

Initial distance ϵ

0 100 200 300 400 500 Iteration

Geodesic distance to optima

Best RSGD RSGD

Learning rate η

0 100 200 300 400 500 Iteration

Geodesic distance to optima

Best RSGD RDo G (Ours)

Initial distance ϵ

0 100 200 300 400 500 Iteration

Geodesic distance to optima

Best RSGD RDo WG (Ours)

Initial distance ϵ

0 100 200 300 400 500 Iteration

Geodesic distance to optima

Best RSGD NRDo G (Ours)

Initial distance ϵ

Tiny Image Net

0 100 200 300 400 500 Iteration

Geodesic distance to optima

Best RSGD RSGD

Learning rate η

0 100 200 300 400 500 Iteration

Geodesic distance to optima

Best RSGD RDo G (Ours)

Initial distance ϵ

0 100 200 300 400 500 Iteration

Geodesic distance to optima

Best RSGD RDo WG (Ours)

Initial distance ϵ

0 100 200 300 400 500 Iteration

Geodesic distance to optima

Best RSGD NRDo G (Ours)

Initial distance ϵ

Table 1. Supplementary results for PCA on the Grassmann manifold (Section 5.2). The plots display the geodesic distance from a numerically computed optimum as a function of the iteration, considering various learning rates. Results are averaged over five replicates with different initial points. The optimal RSGD is selected based on minimizing the geodesic distance from the optimum after 2000 iterations.

Learning-Rate-Free Stochastic Optimization over Riemannian Manifolds

T=100 T=500 T=1000 T=2000

10 6 10 3 100 103 106

Learning rate η

Geodesic distance to optima

RADAM RSGD NRDo G (Ours)

RDo G (Ours)

RDo WG (Ours)

10 8 10 6 10 4 10 2 Initial distance ϵ

10 6 10 3 100 103 106

Learning rate η

Geodesic distance to optima

RADAM RSGD NRDo G (Ours)

RDo G (Ours)

RDo WG (Ours)

10 8 10 6 10 4 10 2 Initial distance ϵ

10 6 10 3 100 103 106

Learning rate η

Geodesic distance to optima

RADAM RSGD NRDo G (Ours)

RDo G (Ours)

RDo WG (Ours)

10 8 10 6 10 4 10 2 Initial distance ϵ

10 6 10 3 100 103 106

Learning rate η

Geodesic distance to optima

RADAM RSGD NRDo G (Ours)

RDo G (Ours)

RDo WG (Ours)

10 8 10 6 10 4 10 2 Initial distance ϵ

10 6 10 3 100 103 106

Learning rate η

Geodesic distance to optima

RADAM RSGD NRDo G (Ours)

RDo G (Ours)

RDo WG (Ours)

10 8 10 6 10 4 10 2 Initial distance ϵ

10 6 10 3 100 103 106

Learning rate η

Geodesic distance to optima

RADAM RSGD NRDo G (Ours)

RDo G (Ours)

RDo WG (Ours)

10 8 10 6 10 4 10 2 Initial distance ϵ

10 6 10 3 100 103 106

Learning rate η

Geodesic distance to optima

RADAM RSGD NRDo G (Ours)

RDo G (Ours)

RDo WG (Ours)

10 8 10 6 10 4 10 2 Initial distance ϵ

10 6 10 3 100 103 106

Learning rate η

Geodesic distance to optima

RADAM RSGD NRDo G (Ours)

RDo G (Ours)

RDo WG (Ours)

10 8 10 6 10 4 10 2 Initial distance ϵ

Tiny Image Net

10 6 10 3 100 103 106

Learning rate η

Geodesic distance to optima

RADAM RSGD NRDo G (Ours)

RDo G (Ours)

RDo WG (Ours)

10 8 10 6 10 4 10 2 Initial distance ϵ

10 6 10 3 100 103 106

Learning rate η

Geodesic distance to optima

RADAM RSGD NRDo G (Ours)

RDo G (Ours)

RDo WG (Ours)

10 8 10 6 10 4 10 2 Initial distance ϵ

10 6 10 3 100 103 106

Learning rate η

Geodesic distance to optima

RADAM RSGD NRDo G (Ours)

RDo G (Ours)

RDo WG (Ours)

10 8 10 6 10 4 10 2 Initial distance ϵ

10 6 10 3 100 103 106

Learning rate η

Geodesic distance to optima

RADAM RSGD NRDo G (Ours)

RDo G (Ours)

RDo WG (Ours)

10 8 10 6 10 4 10 2 Initial distance ϵ

Table 2. Supplementary results for PCA on the Grassmann manifold (Section 5.2). Results for different datasets and methods. Each plot illustrates the geodesic distance to a numerically computed optimum after the algorithm is halted for the specified number of iterations. Results are averaged over ten replicates with different initial points.

Learning-Rate-Free Stochastic Optimization over Riemannian Manifolds

With geometric curvature term Without geometric curvature term

With burn-in period

10 2 10 1 100 101 102

Learning rate η

Mean average precision

RADAM RSGD NRDo G (Ours)

RDo G (Ours)

RDo WG (Ours)

10 10 10 9 10 8 10 7 10 6 Initial distance ϵ

10 2 10 1 100 101 102

Learning rate η

Mean average precision

RADAM RSGD NRDo G (Ours)

RDo G (Ours)

RDo WG (Ours)

10 10 10 9 10 8 10 7 10 6 Initial distance ϵ

Without burn-in period

10 2 10 1 100 101 102

Learning rate η

Mean average precision

RADAM RSGD NRDo G (Ours)

RDo G (Ours)

RDo WG (Ours)

10 10 10 9 10 8 10 7 10 6 Initial distance ϵ

10 2 10 1 100 101 102

Learning rate η

Mean average precision

RADAM RSGD NRDo G (Ours)

RDo G (Ours)

RDo WG (Ours)

10 10 10 9 10 8 10 7 10 6 Initial distance ϵ

Table 3. Supplementary results for the five-dimensional Poincar e word embeddings (Section 5.3). We compute the mean average precision of the embeddings against the ground truth after 1000 training epochs. The reported results represent the average over five replications, with the dimension of the embeddings set to five. In the columns, with geometric curvature term corresponds to learning schedulers for RDo G, RDo WG, and NRDo G that retain the geometric curvature term in the denominator, while without geometric curvature term denotes the omission of this term. On the rows, with burn-in period indicates running RADAM and RSGD with a burn-in heuristic. In this case, the algorithms are executed with learning rates divided by ten for the initial ten epochs before regular training. without burn-in period signifies the absence of this heuristic.

Learning-Rate-Free Stochastic Optimization over Riemannian Manifolds

With geometric curvature term Without geometric curvature term

Mean average precision

10 2 10 1 100 101 102

Learning rate η

Mean average precision

RSGD NRDo G (Ours)

RDo G (Ours)

RDo WG (Ours)

10 10 10 9 10 8 10 7 10 6 Initial distance ϵ

10 2 10 1 100 101 102

Learning rate η

Mean average precision

RSGD NRDo G (Ours)

RDo G (Ours)

RDo WG (Ours)

10 10 10 9 10 8 10 7 10 6 Initial distance ϵ

Chiacoan peccary

Even-Toed ungulate

Dugong Aquatic mammal

Billy Ruminant

Gerbil Tree wallaby

English toy spaniel

Toy spaniel

Water buﬀalo

Bullock Cattle

Domestic cat

Brush-Tailed phalanger

Metatherian

Odd-Toed ungulate

Australopithecus afarensis

Hominid Woolly monkey Primate

Meadow jumping mouse

Brown hyena

Rambouillet

Dandie dinmont

Hunting dog Muishond

Lesser panda

Alaska fur seal

Steeplechaser

American water shrew

Staﬀordshire bullterrier

Chiacoan peccary

Even-Toed ungulate

Aquatic mammal

Billy Ruminant

Tree wallaby Wallaby

English toy spaniel

Toy spaniel

Water buﬀalo

Sorrel Lincoln

Bullock Cattle

Domestic cat

Brush-Tailed phalanger Metatherian

Odd-Toed ungulate

Australopithecus afarensis

Hominid Woolly monkey Primate

Meadow jumping mouse

Brown hyena

Rambouillet

Dandie dinmont

Hunting dog

Lesser panda

Alaska fur seal

Steeplechaser

American water shrew

Staﬀordshire bullterrier

Chiacoan peccary

Even-Toed ungulate

Aquatic mammal

Tree wallaby

English toy spaniel Toy spaniel

Water buﬀalo Sorrel Lincoln

Bullock Cattle

Domestic cat

Brush-Tailed phalanger Metatherian

Odd-Toed ungulate

Australopithecus afarensis

Hominid Woolly monkey

Meadow jumping mouse

Brown hyena

Rambouillet

Dandie dinmont

Hunting dog

Lesser panda

Alaska fur seal

Steeplechaser

American water shrew

Staﬀordshire bullterrier

Chiacoan peccary

Even-Toed ungulate

Dugong Aquatic mammal

Billy Ruminant

Gerbil Tree wallaby Wallaby

English toy spaniel

Toy spaniel

Water buﬀalo

Bullock Cattle

Domestic cat

Brush-Tailed phalanger Metatherian

Odd-Toed ungulate

Australopithecus afarensis

Hominid Woolly monkey Primate

Meadow jumping mouse

Brown hyena

Rambouillet

Dandie dinmont

Hunting dog

Lesser panda

Alaska fur seal

Steeplechaser

American water shrew

Staﬀordshire bullterrier

Chiacoan peccary

Even-Toed ungulate

Aquatic mammal

Gerbil Tree wallaby

English toy spaniel Toy spaniel

Water buﬀalo

Dasyure Bullock

Domestic cat

Brush-Tailed phalanger

Metatherian

Odd-Toed ungulate

Australopithecus afarensis Hominid

Woolly monkey

Meadow jumping mouse

Alley cat Placental

Brown hyena

Rambouillet Bovid

Dandie dinmont

Hunting dog Muishond

Lesser panda

Alaska fur seal

Steeplechaser

American water shrew

Staﬀordshire bullterrier

Chiacoan peccary

Even-Toed ungulate

Aquatic mammal

Appaloosa Mammal

Gerbil Tree wallaby

English toy spaniel Toy spaniel

Water buﬀalo

Domestic cat

Brush-Tailed phalanger

Metatherian

Odd-Toed ungulate

Australopithecus afarensis

Woolly monkey

Meadow jumping mouse

Brown hyena

Rambouillet Bovid

Dandie dinmont

Hunting dog

Lesser panda

Alaska fur seal

Steeplechaser

American water shrew

Staﬀordshire bullterrier

Table 4. Supplementary results for the two-dimensional Poincar e word embeddings (Section 5.3). We compute the mean average precision of the embeddings against the ground truth after 2000 training epochs. The reported results represent the average over five replications, with the dimension of the embeddings set to two. Plots of embeddings obtained under each optimizer are visualized and annotated for the first 50 nouns of the mammal s subtree. In the columns, with geometric curvature term corresponds to learning schedulers for RDo G, RDo WG, and NRDo G that retain the geometric curvature term in the denominator, while without geometric curvature term denotes the omission of this term. 44