# online_optimization_over_riemannian_manifolds__49403d1f.pdf

Journal of Machine Learning Research 24 (2023) 1-67 Submitted 11/21; Revised 12/22; Published 2/23

Online Optimization over Riemannian Manifolds

Xi Wang wangxi14.ucas@gmail.com Zhipeng Tu tuzhipeng@amss.ac.cn Academy of Mathematics and Systems Science Chinese Academy of Sciences Beijing 100190, P. R. China, and Australian Center for Robotics School of Aerospace, Mechanical and Mechatronic Engineering The University of Sydney NSW 2006, Australia

Yiguang Hong yghong@iss.ac.cn Shanghai Research Institute for Intelligent Autonomous Systems Tongji University Shanghai 201210, P. R. China

Yingyi Wu wuyy@ucas.ac.cn Department of Mathematics University of Chinese Academy of Sciences Beijing 100040, P. R. China

Guodong Shi guodong.shi@sydney.edu.au Australian Center for Robotics School of Aerospace, Mechanical and Mechatronic Engineering The University of Sydney NSW 2006, Australia

Editor: Silvia Villa

Online optimization has witnessed a massive surge of research attention in recent years. In this paper, we propose online gradient descent and online bandit algorithms over Riemannian manifolds in full information and bandit feedback settings respectively, for both geodesically convex and strongly geodesically convex functions. We establish a series of upper bounds on the regrets for the proposed algorithms over Hadamard manifolds. We also ﬁnd a universal lower bound for achievable regret on Hadamard manifolds. Our analysis shows how time horizon, dimension, and sectional curvature bounds have impact on the regret bounds. When the manifold permits positive sectional curvature, we prove similar regret bound can be established by handling non-constrictive project maps. In addition, numerical studies on problems deﬁned on symmetric positive deﬁnite matrix manifold, hyperbolic spaces, and Grassmann manifolds are provided to validate our theoretical ﬁndings, using synthetic and real-world data.

. A preliminary version of the paper is scheduled for presentation at Neur IPS-2021 (Wang et al., 2021). . Correspondence author (G. Shi, Ross Street Building, Darlington NSW 2006, Sydney, Australia; +6102-8627 8037).

c 2023 Xi Wang, Zhipeng Tu, Yiguang Hong, Yingyi Wu and Guodong Shi.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v24/21-1308.html.

Wang, Tu, Hong, Wu, and Shi

Keywords: Online optimization, Riemannian manifolds, Riemannian optimization, Gradient estimation, Fr echet mean

1. Introduction

The online optimization has been widely studied in the past decades in online routing, spam ﬁltering, and machine learning (Agmon, 1954; Hazan, 2016; Arnold et al., 2019). Without a prior knowledge of loss functions, an online convex optimization algorithm predicts solutions before loss functions are revealed. In this paper, we consider the following Riemannian online convex optimization (ROCO) problem,

min xt K M ft(xt), t = 1, 2, . . . , T, (1)

where M is a complete Riemannian manifold equipped with a Riemannian metric g and K is a geodesically convex (g-convex) subset of M. Here, {ft}t=1,2,...,T is a sequence of unknown loss functions and every ft is a geodesically convex (g-convex) function with suﬃcient smoothness. The R-OCO problem (1) extends the online convex optimization in Euclidean spaces with potential applications in machine learning, such as online principal component analysis (PCA), dictionary learning, and neural networks (Lee and Kriegman, 2005; Feng et al., 2013; Hu et al., 2020). The R-OCO problem (1) can be understood as a learning process of T rounds. At each round t = 1, 2, 3, . . . , T, an online learner chooses a strategy xt from the g-convex subset K. Later or simultaneously, the adversary (or nature) produces a g-convex loss function ft : K R of which the learner has no prior knowledge. Finally, the learner receives the feedback and suﬀers the loss ft(xt). Generally, there are two types of information feedback. One is the full information feedback, where the entire function ft is revealed to the learner; the other is the bandit feedback, where only the value ft(xt) is revealed. The goal of the R-OCO is to minimize the regret, deﬁned as

t=1 ft(xt) min x K

which measures the diﬀerence between the cost by {xt}t=1,...,T and the best-ﬁxed point chosen in hindsight. An algorithm is termed no-regret (Srinivas et al., 2010), if the regret of the algorithm goes sublinearly with the time horizon T. For carrying out optimization over a manifold, some classical methods treat the manifold as a subset of an ambient Euclidean space and employ Euclidean constrained optimization techniques. For instance, Nie et al. (2016) presented an online PCA algorithm, where the variables were updated in an embedding Euclidean space and then projected onto a manifold. However, in practical applications, the dimension of an embedding Euclidean space can be too high (e.g., the Grassmann manifold, Boumal and Absil, 2015), and the projection can be expensive to compute (e.g., the manifold of symmetric positive deﬁnite (SPD) matrices, Zhang et al., 2018). An alternative approach termed Riemannian optimization makes use of intrinsic geometry of manifolds so that it can optimize directly on manifolds

Online Optimization over Riemannian Manifolds

as an unconstrained problem, and thus avoiding high dimension embedding and high computing cost for the projection. Furthermore, this viewpoint has shown beneﬁts from the g-convexity, by which a nonconvex optimization problem can be converted into a g-convex one (Allen-Zhu et al., 2018). Consequently, it is important to take a Riemannian approach in our problem (1). Although there were many existing algorithms for oﬄine manifold optimization problems (Absil et al., 2009; Ring and Wirth, 2012; Ahn and Sra, 2020), very few results were obtained about the Riemannian online optimization problem. Tupker et al. (2021) proposed an online algorithm for estimating hidden Markov chains on Hadamard homogeneous spaces; Becigneul et al. (2019) analyzed Riemannian adaptive methods on product manifolds in the regret sense. More recently, Maass et al. (2022) studied a zeroth-order online optimization problem on Hadamard manifolds and achieved no-regret bound with a sublinear assumption.

Contribution This paper aims to design no-regret algorithms for the R-OCO problem in both full information feedback and bandit feedback. The contribution of this paper is summarized as follows:

We propose a Riemannian online gradient descent algorithm (R-OGD) for the ROCO problem in the full information feedback, and then establish the regret bounds on Hadamard manifolds for g-convex and strongly g-convex functions. In addition, we present a universal lower regret bound which matches the regret bound achieved by R-OGD in g-convex setting.

We introduce a Riemannian bandit algorithm (R-BAN) and construct regret bounds for g-convex and strongly g-convex functions with the one-point bandit feedback. We also proposed a Riemannian two-point bandit algorithm (R-2-BAN) with the two-point bandit feedback, of which regret bounds can be improved to resemble the bounds in full information cases. Moreover, we develop a key technique to analyze the derivative of a local integration on homogeneous manifolds, which can be applied to estimate gradients in Riemannian optimization and beyond.

We generalize the R-OGD, R-BAN and R-2-BAN algorithms to non-Hadamard manifolds. We overcome the challenge of non-constrictive projection maps and derive regret bounds of the same order in time horizon compared to those in Hadamard cases.

The established lower and upper bounds on the achievable bounds of the R-OCO match their counterparts for Euclidean online convex optimization (Zinkevich, 2003; Hazan et al., 2006; Flaxman et al., 2005; Abernethy et al., 2008; Agarwal et al., 2010). Please see Table 1 for the detail. Some preliminary results of the paper are scheduled for presentation at Neur IPS-2021 (Wang et al., 2021). Compared to the conference version, we have expanded the theoretical study considerably into two-point bandit algorithm and regret analysis for non-Hadamard manifolds, and presented a comprehensive set of numerical tests on both synthetic and real-world data.

Wang, Tu, Hong, Wu, and Shi

Related Work The Euclidean online convex optimization was introduced by Zinkevich (2003). Inspired by the gradient descent method, Zinkevich (2003) proposed the online gradient descent algorithm (OGD) of which the regret bound was proven to be O

T . Then Hazan et al. (2006) proceeded with the study of the OGD algorithm and established a regret bound O log T for strongly convex functions. In addition, Abernethy et al. (2008) gave a universal lower bound of O

T for online algorithms, which indicated that the bounds in Zinkevich (2003) and Hazan et al. (2006) are essentially optimal. In the bandit setting, Flaxman et al. (2005) provided a detailed exposition of a one-point bandit algorithm. By modifying the gradient in the OGD algorithm to a randomized estimator, the regret bounds attained O n T 3 4 and O n 2 3 (1 + log T) 1 3 T 2 3 for convex loss functions and strongly convex loss functions, respectively. By extending the one-point bandit algorithm, Agarwal et al. (2010) developed a multi-point bandit algorithm and presented regret bounds O n

and O n2(1+log T) for convex and strongly convex loss functions. The Riemannian online algorithms proposed in this paper in the full information feedback and the bandit feedback settings are extensions of the Euclidean online algorithms to Riemannian manifolds. Riemannian optimization has drawn much research attention in the past decades. Many basic algorithms in Euclidean spaces such as the gradient descent method, Newton s method, and trust-region methods have been adapted into a Riemannian setting (see Fiori, 2005; Absil et al., 2009; Ahn and Sra, 2020; Ring and Wirth, 2012; Koudounas and Fiori, 2020). Some research of Riemannian stochastic optimization (R-SO) was intended to deal with time-varying optimization problems (Bonnabel, 2013; Zhang and Sra, 2016; Zhang et al., 2018; Tupker et al., 2021). Among them, Zhang and Sra (2016) provided the ﬁrst global complexity analysis for the R-SGD algorithm on geodesically convex problems over Hadamard manifolds, and Tupker et al. (2021) proposed an online algorithm to deal with hidden Markov chains on Hadamard homogeneous spaces. When loss functions are arrived in batch, R-SO methods are actually to minimize the average regret in the case of knowing the prior distribution of loss functions. In this case, the R-SO can be viewed as a kind of R-OCO problems and R-OCO algorithms can handle broader settings without prior knowledge. The results about the R-OCO problem are fairly limited. Antonakopoulos et al. (2020) proposed regularized online optimization methods via a Riemann Lipschitz continuity condition, which focused on convex functions and vector addition from an ambient Euclidean space. In the full information setting, Becigneul et al. (2019) proposed the Riemannian versions of ADAGRAD and ADAM algorithms, which depended on a product manifold structure. In addition, Becigneul et al. (2019) constructed O

T regret bounds of both ADAGARD and ADAM algorithms for g-convex functions. When the form of losses is not available, Maass et al. (2022) proposed a Riemannian online zeroth-order (R-OZO) algorithm for strongly g-convex functions. The R-OZO generated a random Gaussian vector ut in an ambient embedding Euclidean space, and then used a two-point diﬀerence to present a descent along the projection of ut on the tangent space of the manifold. For g-strongly convex functions on Hadamard manifolds, Maass et al. (2022) derived asymptotic tracking error and a O

T +VT dynamic regret bound of the R-OZO, where VT is the accumulated distance between two consecutive minimizers. In contrast, the regret bounds established for our online gradient-based/bandit Riemannian optimization algorithms are sublinear for any time, matching those for Euclidean online optimization. A detailed comparison of our results with the existing works is summarized in Table 1.

Online Optimization over Riemannian Manifolds

Feedback setting G-convex Strongly g-convex

Full information

Our work O ζ 1 2

Previous Work O

T (Product space) (Becigneul et al., 2019)

Euclidean O

(Zinkevich, 2003) (Hazan et al., 2006)

One-point bandit

Our work O nζ 1 2 T 3 4

O n 2 3 ζ(1 + log T) 1 3 T 2 3

Previous Work

Euclidean O n T 3 4

O n 2 3 (1 + log T) 1 3 T 2 3

(Flaxman et al., 2005) (Flaxman et al., 2005)

Two-point bandit

Our work O nζ 1 2

O n2ζ(1 + log T)

Previous Work

(Dynamic regret) (Maass et al., 2022)

Euclidean O n

O n2(1 + log T)

(Agarwal et al., 2010) (Agarwal et al., 2010)

Our work Ω(

Previous Work

lower bound Euclidean Ω(

T) Ω(log T) (Agarwal et al., 2010) (Agarwal et al., 2010)

Table 1: Comparison of regret among our work, previous Riemannian online optimization, and corresponding results in Euclidean spaces. T: the time horizon; D: the diameter of the feasible set; n: the dimension of the manifold; ζ: a constant related to the sectional curvature bound κ; VT : the accumulated distance between two consecutive minimizers.

2. Preliminaries

In this section, we present a brief review on the Riemannian manifold and introduce basic functions classes for Riemannian optimization. We refer readers to the following textbooks and tutorial papers (do Carmo, 1992; Chern et al., 1999; Berestovskii and Nikonorov, 2020; Ghomi and Spruck, 2019; Fiori, 2021) for more details.

Riemannian manifolds An n-dimensional manifold M is a topological space locally diﬀeomorphic to the vector space Rn. The tangent space Tx M is a linearization of the manifold M at a point x. A Riemannian manifold is a smooth manifold M equipped with a metric tensor g (called Riemannian metric), which deﬁnes an inner product

gx : Tx M Tx M R

gx(X, Y ) = X, Y x

Wang, Tu, Hong, Wu, and Shi

in every tangent space Tx M of x M. The Riemannian metric g gives us a way to measure the length of curves, bringing a metric space structure to M with distance function

d(x, y) = inf γ {Length(γ)|γ is a curve connecting x and y}.

A curve is a geodesic if it locally minimizes the length, which is an analog of a straight line in Euclidean spaces. On Riemannian manifolds, a geodesic is uniquely determined by its starting point and initial tangent vector. In this way, the exponential map expx : Tx M M is deﬁned by mapping a vector X Tp M to γ(1) M for the geodesic γ such that γ(0) = x and γ(0) = X. A set K is termed geodesically convex (g-convex) if, for any points x, y K, there admits a geodesic γ K connecting x and y. Moreover, if the γ is unique, the set K is termed uniquely geodesically convex (uniquely g-convex). It is shown that the exponential map expx is locally a diﬀeomorphism and consequently has an inverse exp 1 x ( ) on a uniquely g-convex set. Curvature reﬂects the geometry of manifolds. We focus on sectional curvature, which is the Gauss curvature of a two-dimensional submanifold. Following Zhang and Sra (2016), we mainly consider the Hadamard manifold, which is a simply connected and complete manifold with non-positive sectional curvature. The Cartan-Hadamard theorem (Berger, 2009) shows that the Hadamard manifold is uniquely g-convex so that the exponential map expx has an global inverse exp 1 x ( ) on Hadamard manifolds. In this way, the distance d(x, y) can be expressed as exp 1 x (y) x. Isometries of Riemannian manifolds have been widely studied in diﬀerential geometry (Berger, 2009; Berestovskii and Nikonorov, 2020). An isometry φ : M M is a diﬀeomorphism preserving distance, i.e., d(x, y) = d(φ(x), φ(y)) for all x, y M. It is remarked that all isometries of a Riemannian manifold form a Lie group G. A Riemannian manifold is a homogeneous manifold if the group of isometries G acts on M transitively, i.e., for any points x, y M there exists an isometry such that φ(x) = y. A Riemannian manifold is a symmetric manifold if for any x M, there exists a symmetry sx G such that x is an isolated ﬁxed point of sx.

Vector ﬁelds and Their ﬂows A vector ﬁeld X is a map that assigns every point x M to a tangent vector X(x) Tx M. Let X(M) denote the set of all vector ﬁelds. A vector ﬁeld X can be also viewed as a diﬀerential operator over smooth functions on M, i.e., the operation X(f) gives a function on M, deﬁned as

X(f)(x) = lim t 0 1

t (f(ξ(t)) f(x)),

where ξ is a curve that starts at x with the tangent vector X(x). The Levi-Civita connection is an analogue of the diﬀerential operator over vector ﬁelds in Euclidean spaces and uniquely determined by properties ( X Y, Z = XY, Z + Y, XZ XY Y X = XY Y X

for all X, Y, Z X(M).

Online Optimization over Riemannian Manifolds

The inﬁnitesimal variation of a geodesic is described by the Jacobi ﬁeld. A vector ﬁeld η along a geodesic γ is a Jacobi ﬁeld if it satisﬁes the Jacobi equation

γ γη + R( γ, η) γ = 0.

A vector η is a Killing ﬁeld if it satisﬁes for all X, Y X(M)

Xη, Y + X, Y η = 0.

We follow the same idea in Euclidean spaces to deﬁne the ﬂow of a vector ﬁeld. Suppose that M is a smooth manifold and X X(M). Let there be a smooth map φ : R M M. Denote φt(p) = φ(t, p), for any (t, p) R M, such that the following conditions are satisﬁed:

1) φ0(p) = p;

2) φs φt = φs+t for any real numbers s, t;

3) X(p) = φt(p)

Then we call φt the ﬂow (or the one-parameter group of diﬀeomorphism) of X, and term X the inﬁnitesimal transformation of φt.

Function Classes A function f : K R is called geodesically convex (or g-convex) if for any geodesic γ : [0, 1] M,

f(γ(t)) (1 t)f(γ(0)) + tf(γ(1)).

The g-convexity has some equivalent conditions. When f is diﬀerentiable, which means that there exists a gradient vector ﬁeld f such that f(x), X = X(f)(x) for every vector ﬁeld X X(M), the g-convexity is equivalent to the following condition

f(y) f(x) + f(x), exp 1 x (y) , x, y M.

Furthermore, if f is twice diﬀerentiable, the g-convexity is equivalent to

2f(X, X) := X f, X = X(X(f)) XX(f) 0

for any X X(M). A diﬀerentiable function f : M R is geodesically µ-strongly convex (or µ-strongly g-convex) if there exists a constant µ > 0 such that for any x, y M, there holds

f(y) f(x) + f(x), exp 1 x (y) + µ

2 d2(x, y).

We term a function to be geodesically L-Lipschitz (or g-L-Lipschitz) if there exists a constant L > 0 such that, for any x, y M,

|f(y) f(x)| L d(x, y),

which is equivalent to f(x) L, x M,

if f is diﬀerentiable.

Wang, Tu, Hong, Wu, and Shi

Algorithm 1: Riemannian Online Gradient Descent Algorithm (R-OGD)

Input: Manifold M, time T, step sizes (or schedule) {αt} Output: {xt}t=1,...,T for t = 1 to T do

Play xt and observe the function ft; Update xt+1 with ( xt+1 = expxt( αt ft(xt)) xt+1 = PK( xt+1),

where PK is the Riemannian projection mapping of x onto K, i.e.,

PK(x) := arg min y K d(x, y);

Return xt+1, and suﬀer the loss ft(xt); end

3. R-OCO with Full Information Feedback

This section is devoted to the study of the R-OCO problem in the full information feedback. We ﬁrst propose our R-OGD algorithm and then analyze the upper regret bounds of the R-OGD for both g-convex and strongly g-convex functions. In addition, a universal lower regret bound in the g-convex case is presented to illustrate that the regret bound of the R-OGD algorithm is tight up to a constant.

3.1 Riemannian Online Gradient Algorithm

In the full information setting, we consider the following assumptions, which were standard in the literature of Euclidean online convex optimization and Riemannian optimization (Zinkevich, 2003; Ahn and Sra, 2020; Huang et al., 2015).

Assumption 1 There exists a x M such that x = arg min PT t=1 ft(x).

Assumption 2 (M, g) is a Hadamard manifold with the sectional curvature lower bounded by a constant κ.

Note that the Hadamard manifold plays an important role in Riemannian geometry (see Ghomi and Spruck, 2019). Some well-known spaces, such as the Euclidean space Rn, the hyperbolic space Hn, and the manifold of SPD matrices, are all Hadamard manifolds (Ahn and Sra, 2020; Huang et al., 2015).

Assumption 3 The set K is a bounded and g-convex set with diameter D, i.e.,

d(x, y) D, x, y K.

It is worth emphasizing that the Cartan-Hadamard theorem indicates that any g-convex set on Hadamard manifolds is uniquely g-convex. Consequently, we can deﬁne the inverse exponential map exp 1 x ( ) for any point x K (do Carmo, 1992).

Online Optimization over Riemannian Manifolds

Assumption 4 For any t = 1, . . . , T, ft is diﬀerentiable and g-L-Lipschitz.

We now propose our Riemannian online gradient descent algorithm (R-OGD) in Algorithm 1, where the exponential map replaces the vector addition in the Euclidean online gradient descent (Zinkevich, 2003).

3.2 Regret Upper Bounds

In Theorems 5 and 6 we present upper regret bounds of the R-OGD algorithm for g-convex loss functions and strongly g-convex functions, respectively. Take

|κ| d , κ < 0

By direct observation, ζ decreases with respect to κ, and increases with respect to d.

Theorem 5 (Convex Case) Suppose that Assumptions 1-4 hold, and ft is g-convex for any t = 1, . . . , T. Then the R-OGD algorithm with step sizes {αt = D L

ζ(κ,D)t} guarantees

the following regret bound for all T 1:

Theorem 6 (Strongly-convex Case) Suppose that Assumptions 1-4 hold, and that ft is µ-strongly g-convex for any t = 1, . . . , T. Then the R-OGD algorithm with step sizes {αt = 1

µt} guarantees the following regret bound for all T 1:

Reg(T) L2ζ(κ, D)

2µ (1 + log T).

The proofs of Theorems 5 and 6 are in Appendix B. A major challenge in proving Theorems 5 and 6 is that there is no vector space structure on Riemannian manifolds. Thanks to the trigonometric distance bound proposed in Zhang and Sra (2016), we manage to obtain the regrets O

T and O log T for g-convex and strongly g-convex loss functions, respectively. By gradually moving κ to zero, the results recover the regret bounds of Euclidean gradient descent of Zinkevich (2003) and Hazan et al. (2006). Theorems 5 and 6 also reveal the inﬂuence of curvature on the regret bounds. Since ζ(κ, d) is an increasing function of κ, the upper regret bounds in the R-OGD algorithm are greater than those in Euclidean spaces and the increase of κ raises the upper regret bounds. Therefore, a proper Riemannian metric should be chosen in the optimization to avert high sectional curvature bounds.

3.3 Regret Lower Bound

This section is intended to answer the question of whether there exists an algorithm that attains a tighter regret bound than O

T for g-convex functions. Theorem 7 provides a negative answer.

Wang, Tu, Hong, Wu, and Shi

Theorem 7 Suppose that Assumptions 1-4 hold. Then for any Hadamard manifold M, the Riemannian online convex optimization incurs the regret Ω(DL

T) for any possible strategy in the worst case.

The proof of Theorem 7 is in Appendix C. The result illustrates that, as in Euclidean spaces, the regret of a Riemannian online convex algorithm can not be less than Ω(

T) in the worst case. Moreover, Theorem 7 shows that the regret of the R-OGD algorithm in Theorem 5 is tight up to a constant.

4. R-OCO with One-Point Bandit Feedback

In this section, we consider the Riemannian online convex optimization with the one-point bandit feedback. We ﬁrst present the Riemannian bandit algorithm (R-BAN) on Hadamard homogeneous manifolds and then analyze (expected) regret bounds for the R-BAN. Here and subsequently, we denote by Bδ(x) the ball centered at x with radius δ and by Sδ(x) the sphere centered at x with radius δ.

4.1 Riemannian Bandit Algorithm

In the bandit setting, Assumptions 2-4 are slightly modiﬁed as follows.

Assumption 8 M is an n-dimensional homogeneous Hadamard manifold with the sectional curvature lower bounded by a constant κ.

The homogeneous Hadamard manifold has been widely studied in diﬀerential geometry (Berestovskii and Nikonorov, 2020; Berger, 2009). The homogeneity has received much attention in machine learning (Tang et al., 2020; Tupker et al., 2021; Bronstein et al., 2021). It has been seen that many manifolds often considered in Riemannian optimization, such as the Euclidean space Rn, the Hyperbolic space Hn, and the manifold of SPD matrices, are Hadamard homogeneous manifolds. Note that on homogeneous manifolds, the volume and surface area of a ball are only related to the radius but not to the center of the ball. Thus, we denote by Vδ the volume of Bδ(x) and Sδ as the surface area of Sδ(x) for all points x over the homogeneous manifold M.

Assumption 9 There exists an interior point p K such that the set K contains a ball with radius r centered at p, and K is also contained in a ball with radius D, i.e.,

Br(p) K BD(p).

Assumption 10 For any t = 1, . . . , T, ft is diﬀerentiable, g-L-Lipschitz and the function value of ft is bounded by C.

Inspired by the Euclidean bandit algorithm, we replace the gradient ft(xt) with a randomized estimator gt and propose our R-BAN in Algorithm 2 over Hadamard homogeneous manifolds.

Online Optimization over Riemannian Manifolds

Algorithm 2: Riemannian Bandit Algorithm (R-BAN)

Input: Manifold M, time T, step sizes (or schedule) αt, parameters δ, τ. Output: Sequence {xt}t=1,...,T for t = 1 to T do

Choose xt uniformly from Sδ(yt) and play xt; Observe the loss ft(xt) and compute

gt = ft(xt) exp 1 yt (xt)

exp 1 yt (xt) ;

Update yt+1 with ( yt+1 = expyt( αtgt) yt+1 = P(1 τ)K PK( yt+1) ,

where the symbols PK and P(1 τ)K represent the projection mappings onto the feasible set K and the shrinking set (1 τ)K = {expp((1 τ)u)|u = exp 1 p (x), x K}, respectively. Return xt and suﬀer the loss ft(xt); end

4.2 Challenge from Geometry

Since Algorithm 2 is an extension of the Euclidean bandit algorithm by Flaxman et al. (2005), it is worth reviewing the analysis in the work by Flaxman et al. (2005). In the Euclidean setting, we uniformly choose xt on the Sδ(yt) and update yt by the rule

( g E t = f(xt) xt yt

yt+1 = P(1 τ)K(yt αg E t ). (2)

The basic idea for the analysis is to introduce the smoothed loss function (Flaxman et al., 2005)

ˆf E t (x) = Eu Bδ(x)[ft(u)] = 1

Bδ(x) ft(u)du,

where ˆf E t is a convex approximation of ft when δ is small. It is shown that n

δ g E t is an unbiased estimator of the gradient ˆf E t (yt), hence the bandit algorithm is actually an expected gradient descent method (Flaxman et al., 2005) with the loss function ˆf E t . In this way, an Euclidean regret bound of the bandit algorithm is established by Flaxman et al. (2005). Back to the Riemannian case, we attempt to generalize the analysis of Flaxman et al. (2005) in parallel by deﬁning the Riemannian version of the smoothed loss function,

ˆft(x) = Eu Bδ(x)[ft(u)] = 1

Bδ(x) ft(u)ω,

Wang, Tu, Hong, Wu, and Shi

where ω is the volume element with the respect to the metric g. Analyzing the smoothed loss function in the Riemannian space, however, is fundamentally challenging due to the following diﬃculties.

The gradient is hard to estimate. Estimating the gradient of ˆft is quite diﬀerent from that in Euclidean spaces, due to absence of the commutativity of the derivative operator and the integration operator R

Bδ(yt). In Euclidean spaces, the derivative operator commutes with the integration operator R

Bδ(yt). Accordingly, for the Euclidean smoothed

loss function ˆf E t , there holds

ˆf E t (yt) = 1

Bδ(yt) ft(u)du = 1

Bδ(yt) ft(u)du, (3)

which implies n

δ E[g E t ] = ˆf E t (yt). However, on Riemannian manifolds the derivative operator does not commute with the integration operator R

Bδ(yt). Consequently, the equation (3) fails to hold for functions on Riemannian manifolds.

The convexity may be lost. Another essential challenge for regret analysis is the convexity of ˆft. In Euclidean spaces, one can easily conclude the convexity of ˆf E t . However, the convexity may not hold for Riemannian manifolds. Through calculation, the Hessian of ˆft on a Riemannian manifold is

2(ˆft)(X, X) = 1

Bδ(x) ( 2(ft)(η, η)(u) + ηη, ft(u) )ω,

where η is a Killing ﬁeld with η(x) = X. Since the quadratic form 2(ft)(η, η)(x) + ηη, ft(x) can be negative at some η Tp M, the g-convexity of ˆft is violated for some small δ.

4.3 Gradient Bound and Approximate G-convexity

We ﬁrst propose a key technique to analyze the derivative of local integration by introducing the Killing vector ﬁeld. With the help of this technique, we manage to estimate the gradient of ˆft = 1 Vδ R

Bδ(x) ft(u)ω in Lemma 11.

Lemma 11 Suppose M is a homogeneous Hadamard manifold, f is a C1 function on M with bound C, and x M. Denote

g(u) = f(u) exp 1 x (u) exp 1 x (u) .

Then for a ﬁxed δ > 0, the following statements hold.

(i) If u is uniformly chosen from Sδ(x), then Sδ

Vδ g(u) is an unbiased estimator of ˆf(x), i.e.,

Eu Sδ(x) h Sδ

Vδ g(u) x i = 1

Sδ(x) f(u) exp 1 x (u) exp 1 x (u) ωSδ(x) = ˆf(x), x M;

Online Optimization over Riemannian Manifolds

(ii) If the sectional curvature of M is bounded lower by κ, then the estimator Sδ

Vδ g(u) is bounded, i.e.,

Eu Sδ(x) h Sδ

Vδ g x i Sδ

δ + n|κ |δ)C,

where κ = min{κ, 0}.

The proof of Lemma 11 can be found in Appendix D. The ﬁrst part of the lemma establishes a gradient estimator of ˆft(x), while the second part gives us an easy-to-compute bound of the gradient. In the proof, we develop a technique that transforms a derivation of an integration on Bδ(x) to an integration of a derivative of corresponding Killing vector ﬁeld on Bδ(x), i.e.,

Bδ(x) f(u)ω = Z

Bδ(x) η(f(u))ω, (4)

where η is a Killing ﬁeld such that η(x) = X. This technique does not rely on the curvature and other speciﬁc manifold structures. We also notice that although the function ˆft may not be g-convex, it is very close to be g-convex. We introduce the following deﬁnition.

Deﬁnition 12 A function f : K M R is called to be

(i) λ-sub g-convex if there exists a constant λ 0 such that for any x, y M

f(y) f(x) f(x), exp 1 x (y) λ.

(ii) µ-strongly λ-sub g-convex if there exist two constants λ, µ 0 such that for any x, y M f(y) f(x) f(x), exp 1 x (y) µ

2 d2(x, y) λ.

Lemma 13 Suppose that (M, g) is a Hadamard homogeneous manifold. If K is a g-convex and bounded set of M, then there exists a constant ρ 0 depending only on the set K such that, the following statements hold.

(i) For any g-convex and g-L-Lipschitz function f, the smoothed function ˆf is 2ρδL-sub g-convex.

(ii) For any µ-strongly g-convex and g-L-Lipschitz function f, the smoothed function ˆf is µ-strongly 2(ρδL + µDδ)-sub g-convex.

The proof of Lemma 13 is in Appendix D. It is worth mentioning that the constant ρ describes how close a smoothed function is to being g-convex. Notice that

ρ = sup x,y,u K | 1

G exp 1 u φ(u) i| s.t. φ(x) = y

does not depend on the function ˆft, and the time T. Moreover, for a given manifold M, once the set K is ﬁxed and the explicit expression of φ is given, we can compute the constant ρ as a ﬁnite number. We brieﬂy list the bound of φ in the following two types of manifolds.

Wang, Tu, Hong, Wu, and Shi

(i) Let manifold M be a Euclidean space. We can ﬁnd the isometry φ(z) = z + y x. Hence we can conclude that ρ = 0 and ˆf is convex, which coincides with the result in Euclidean spaces.

(ii) Let M be a 2-dimensional Poincar e disk. Then the isometry φ from x to y has the closed form of φ = φx φy,

where φx(z) = x z 1 xz and φy(z) = y z 1 yz. Therefore, if K has a diameter D, we can ﬁgure out a bound of ρ in

ρ 161 + tanh(D/2)

1 tanh(D/2)( 1 1 tanh(2D)2 + D tanh(D/2)),

which implies that ρ may grow exponentially with respect to D.

Although the value of ρ is generally diﬃcult to calculate, our algorithm analysis and parameter selection do not depend on the speciﬁc value of ρ (see Theorems 14 and 15).

4.4 Regret Bounds

With the above eﬀort, we now carry out the analysis of the expected regret bounds of Algorithm 2. Denote B = n

Theorem 14 (Convex Cases) Suppose that Assumptions 1, 8, 9 and 10 hold, and ft is g-convex for any t = 1, . . . , T. If we take δ = T 1

4 , θ = κ(D+r) sinh κ(D+r), τ = δ rθ, and

ζ(κ,D)T , then the expected regret of Algorithm 2 is upper bounded by

E[Reg(T)] n|κ|DC p

ζ(κ, D)T 1 4 + n D p

ζ(κ, D) + 2D2

rθ + (3 + D

rθ + 2ρ)L T 3 4 .

Theorem 15 (Strongly Convex Cases) Suppose that Assumptions 1, 8, 9 and 10 hold,

and ft is µ-strongly g-convex for any t = 1, . . . , T. If we take δ = 3q

n2C2(1+log T)

T , θ = κ(D+r) sinh κ(D+r), τ = δ

rθ, and αt = B

µt, then the expected regret of Algorithm 2 is upper bounded by

E[Reg(T)] 2n 8 3 C 8 3 Dκ2ζ(κ, D)

+ n 2 3 C 2 3 ζ(κ, D)

µ + 3L + 2D2

rθ + 2ρL + 2Dµ (1 + log T) 1 3 T 2 3 .

The proofs of Theorems 14 and 15 can be seen in Appendix E. Theorems 14 and 15 show that the regrets of the Riemannian bandit algorithm achieve O T 3 4 and O T 2 3 for g-convex loss functions and strongly g-convex functions on homogeneous Hadamard manifolds, which are same as the regret bounds in Euclidean spaces (Flaxman et al., 2005). We also note that, diﬀerent from bandit algorithms in the Euclidean space, Theorems 14 and 15 introduce the parameter θ in the selection of τ = δ rθ to ensure that the ball

Online Optimization over Riemannian Manifolds

Bδ(yt) = Bθ τr(yt) always remains within the feasible set K, thereby ensuring the feasibility of the algorithm (see Lemma 45 for details). Because the value of θ in Theorems 14 and 15 is chosen to ensure the feasibility for all possible subsets K, we may ﬁnd that the chosen value of θ is too small and conservative in practical situations. This may lead to an overshrinkage of the set (1 τ)K. However, we would like to point out that, for a speciﬁc subset K, as long as the value of θ satisﬁes the feasibility requirement of Bδ(yt), the same regrets result as Theorems 14 and 15 can be obtained. This means that we can choose a much larger value of θ speciﬁcally tailored to the subset K to achieve better practical application results.

5. R-OCO with Two Point Bandit Feedback

In this section, we consider the Riemannian online convex optimization with the two-point bandit feedback. We ﬁrst propose a Riemannian two-point bandit algorithm (R-2-BAN), which estimates the gradient of ft(yt) with two queries of values around yt, and then analyze regret bounds for g-convex and strongly g-convex functions.

5.1 Riemannian Two-point Bandit Algorithm

In this subsection, we propose our R-2-BAN algorithm in Algorithm 3 with an additional assumption.

Assumption 16 M is an n-dimensional symmetric Hadamard manifold with the sectional curvature lower bounded by a constant κ.

The assumption of symmetry is important in the two-point bandit feedback setting. From symmetry, it is shown that for any y M, the minus map

x := expy( exp 1 y (x))

deﬁnes a isometry in M. The result implies that the uniform distribution on the geodesic sphere Sδ(y) is symmetric, which is key insight in the Euclidean two bandit algorithm (Agarwal et al., 2010), as well as in our R-2-BAN analysis. The symmetric Hadamard manifold is widely studied in diﬀerential geometry (Berestovskii and Nikonorov, 2020; Berger, 2009). Moreover, many manifolds with great practical value are symmetric manifolds, such as the Euclidean space Rn, the Hyperbolic space Hn, and the manifold of SPD matrices.

5.2 Regret Bound

This subsection studies the regret of the R-2-BAN, which is deﬁned as

1 2(ft(xt,1) + ft(xt,2)) min x K

Since the uniform distribution on geodesic spheres is symmetric, we derive that the two-point estimator Sδ

Vδ g is also unbiased and bounded gradient estimator of the smoothed function ˆf(x) = 1 Vδ R

Wang, Tu, Hong, Wu, and Shi

Algorithm 3: Riemannian Two-point Bandit Algorithm (R-2-BAN)

Input: Manifold M, time T, step size (or schedule) αt, parameter δ, τ. Output: Sequence {xt}t=1,...,T for t = 1 to T do do

Pick xt,1 uniformly from Sδ(yt) and set xt,2 = xt,1; Play xt,1 and xt,2, then observe ft(xt,1) and ft(xt,2); Compute

2 ft(xt,1) ft(xt,2) exp 1 yt (xt,1)

exp 1 yt (xt,1)

Update yt+1 with ( yt+1 = expyt( αt gt) yt+1 = P(1 τ)K PK( yt+1) ,

where the symbols PK and P(1 τ)K represent the projection mappings onto the feasible set K and the shrinking set (1 τ)K = {expp((1 τ)u)|u = exp 1 p (x), x K}, respectively. Return xt and suﬀer the loss 1

2 ft(xt,1) + ft(xt,2) ; end

Lemma 17 Suppose M is a symmertic Hadamard manifold and f is a g-L-Lipschitz function on M. Then for a ﬁxed δ > 0 the gradient estimator

Sδ Vδ g = Sδ

f(u) f( u) exp 1 x (u) exp 1 x (u)

satisﬁes the following.

(i) Eu Sδ(x) h Sδ Vδ g x i = ˆf(x), x M;

(ii) Eu Sδ(x) h Sδ

Vδ g x i Sδ

Vδ δL n L(1 + κδ2).

Then we carry out the regret analysis in Theorem 18 and Theorem 19. Notice that B = n

δ + n|κ|δ and ρ is the constant in Lemma 13 that only depends on K.

Theorem 18 (Convex Cases) Suppose that Assumptions 1, 8, 9, 10 and 16 hold, and ft is g-convex for any t = 1, . . . , T. If we take δ = 1

T , θ = κ(D+r) sinh κ(D+r), τ = δ rθ, and αt = n δL

ζ(κ,D)T , then the expected regret of Algorithm 3 is upper bounded by

E[Reg(T)] nκDL p

ζ(κ, D) + 2D2

rθ + (3 + D

Theorem 19 (Strongly Convex Cases) Suppose that Assumptions 1, 8, 9, 10 and 16 hold, and ft is µ-strongly g-convex for any t = 1, . . . , T. If we take δ = 1+log T

Online Optimization over Riemannian Manifolds

κ(D+r) sinh κ(D+r), τ = δ

rθ, and αt = B

µt, then the expected regret of Algorithm 3 is upper bounded by

E[Reg(T)] 3n2C2κ2ζ(κ, D)

+ ζ(κ, D)n2L2

rθ + (3 + D

rθ + 2ρ)L + 2µD (1 + log T).

The proofs of Theorems 18 and 19 can be seen in Appendix F. Theorems 18 and 19 show that regrets of the Riemannian two-point bandit algorithm achieve O

T and O log T

for g-convex and strongly g-convex functions on symmetric Hadamard manifolds, which improves the regret bounds of the Riemannian one point bandit algorithm. Such improvement is consistent with the results in Euclidean spaces (Agarwal et al., 2010).

6. Generalizing to Non-Hadamard Cases

In this section, we generalize the R-OGD, R-BAN and R-2-BAN algorithms to a manifold M with sectional curvature lower bounded by κ and upper bounded by K > 0. However, positive sectional curvature can make projection maps no longer constrictive so that a straightforward generalization may fail to be no-regret. We demonstrate how the non-Hadamard structure aﬀects the property of projection maps and how non-constrictive projection maps cause trouble in regret analysis. Furthermore, we try to address the difﬁculty withour assuming the invariance condition adopted in the previous work (Ahn and Sra, 2020; Alimisis et al., 2021), and then construct no-regret bounds of Algorithm 1, 2 and 3.

6.1 Impact of Positive Curvature in Regret Analysis

The projection map can be non-constrictive for non-Hadamard manifolds, since the positive sectional curvature aﬀects the g-convexity of the norm of Jacobi ﬁelds. The following example provides an illustration.

Example 1 Suppose that M is a manifold with positive sectional curvature, and K is a geodesic ξ(s) : [0, 1] M. Let U(s) be a parallel vector ﬁeld on ξ that is normal to ξ. Then we consider the following variation

γs(t) : [0, 1] [0, δ] M

(s, t) expξ(s)(t U(s)).

For a suﬃciently small δ1 we have,

PK(γs(t)) = ξ(s), t δ1.

In addition, we notice that derivatives of the length of the s curve, namely L(t), are characterized by the ﬁrst and second variation formula (do Carmo, 1992),

L (0) = U(b), ξ(b) U(a), ξ(a) = 0

L (0) = Z 1

0 | γU|2 R(U, ξ, U, ξ)ds = Z 1

0 R(U, ξ, U, ξ)ds

Wang, Tu, Hong, Wu, and Shi

Because the sectional curvature is positive, we found L (0) < 0, which means that L(0) is a local maximum. As a result, we can take a suﬃciently small δ2 > 0 such that

L(t) < L(0) = d(ξ(a), ξ(b)), t δ2.

Take δ3 = min{δ1, δ2}, p = γ0(δ3) and q = γ1(δ3) we have

d(p, q) L(δ3) < L(0) = d(ξ(a), ξ(b)) = d(PK(p), PK(q)),

which indicates the non-expansiveness of the projection map PK fails to hold.

y1 (a) Minimal geodesics are not unique

(b) Geodesics are loops

Figure 1: Examples in S2

One may try to sidestep the projection map by assuming compactness of M and K = M. However, these assumptions do not work as g-convexity on non-Hadamard manifolds only holds locally. On one hand, positive sectional curvature admits conjugate points, where the connecting geodesic is not unique, leading K no longer to be uniquely g-convex. On the other hand, global g-convex functions may not exist on compact manifolds. From the study by Yau (1974), the existence of nontrivial global g-convex functions implies inﬁnity of volume. As a result, there is no global g-convex function apart from constant functions on compact Riemannian manifolds. We take examples on the sphere S2 to illustrate the above points.

Example 2 (i) Let x1, y1 be the north pole and the south pole of the sphere S2 (see Figure 1 (a)). Then every arc connecting x and y is a geodesic. Therefore S2 is not uniquely g-convex.

(ii) Suppose that f is a g-convex function. Since for any x2, y2 in S2, the geodesic is the great circle connecting x2, y2 (see Figure 1 (b)), we can choose a geodesic loop which starts at x2 and ends at x2. By g-convexity, we have

f(y2) (1 t)f(x2) + tf(x2) = f(x2).

Choosing the geodesic loop that starts at y2, we can obtain f(x2) f(y2). Therefore there must hold f(x2) = f(y2), which indicates that f is actually a constant.

Online Optimization over Riemannian Manifolds

Our regret analysis of Algorithms 1, 2 and 3 on Hadamard manifolds greatly depends on non-expansiveness of projection maps. During the analysis, we use Lemma 35 to bound the loss ft(xt) ft(x ) with

ft(xt) ft(x ) ft(xt), exp 1 xt (x )

1 2αt (d2(xt, x ) d2(expxt( αt ft(xt), x )) + 1

2ζ(κ, d(xt, x))αt ft(xt) 2.

Denote the intermediate point xt+1 = expxt( αt ft(xt)). Applying non-expansiveness of the projection map PK, we have

d(xt+1, x ) = d(PK( xt+1), x ) d(expxt( xt+1, x )). (6)

Combing (5) and (6) we get

ft(xt) ft(x ) 1 2αt (d2(xt, x ) d2(xt+1, x )) + 1

2ζ(κ, d(xt, x))αt ft(xt) 2, (7)

which is a key step to get sublinear regrets, as we can rearrange the summation and cancel term by term. Unfortunately, when it turns to non-Hadamard manifolds, equations (6) and (7) no longer hold. Thus the loss ft(xt) ft(x ) can be only bounded with

ft(xt) ft(x ) 1 2αt (d2(xt, x ) d2( xt+1, x )) + 1

2ζ(κ, d(xt, x))αt ft(xt) 2,

1 2αt (d2(xt, x ) d2(xt+1, x2)) + 1

2ζ(κ, d(xt, x))αt ft(xt) 2

+ 1 2αt (d2(xt+1, x ) d2(expxt( xt+1, x )),

and there is an additional projection error term

1 2αt (d2(xt+1, x ) d2(expxt( xt+1, x )) (8)

in the ﬁnal regret, which is not clear to be sublinear.

6.2 Dealing with Projection Error and Regret Analysis

An approach in recent research to deal with non-constrictive projection maps on non Hadamard manifolds is to omit projections by an invariant condition. The invariant condition assumes that all iterations of the algorithm remain in the unique g-convex feasible set K (Zhang and Sra, 2018; Ahn and Sra, 2020; Alimisis et al., 2021). In this section, we try to remove the invariant condition assumption by developing analysis on the projection error term (8) with control of the step size αt. First, we require the feasible set K to be bounded with the respect to the positive curvature bound K.

Assumption 20 The diameter D of the g-convex subset K is less than π 2

Wang, Tu, Hong, Wu, and Shi

Assumption 20 is a standard condition in the literature (e.g., Zhang and Sra, 2018; Ahn and Sra, 2020; Alimisis et al., 2021). From the conjugate point theorem (Lemma 31), Assumption 20 guarantees that there is no pair of conjugate points on K, thus K is uniquely g-convex and the inverse exponential map exp 1 x ( ) can be deﬁned throughout K (even in the area that slightly deviates from K). Moreover, the Hessian comparison theorem (Lemma 33) shows that when the diameter of K is greater than π

K , the subset K may be inﬁnitely curved , i.e., the Hessian of the distance function d(x, ) reaches the inﬁnity. Consequently, Assumption 20 is essential to establish theoretical regret bounds. In practical applications, the feasible set K depends on the prior knowledge, physical constraints, or even artiﬁcial tuning. Assumption 20 indicates that, in order to have guaranteed regret bounds, the choice of K should rely on the curvature bound K for non-Hadamard cases. Besides, our experiments (see Subsection 7.3) indicate that a feasible set K with a much larger diameter than π 2

K does not inherently impact the performance of the proposed algorithms. Under Assumption 20, projection error term (8) can be ﬁxed by Lemma 21. Let we denote

K and K > 0.

Lemma 21 Suppose K M with the radius D < π 2

K . Assume that the iteration is as follows with αtgt D, ( xt+1 = expxt(αtgt) xt+1 = PK( xt+1)

Then it holds that

1 2αt (d2(xt+1, x ) d2( xt+1, x )) σ(K, 2D)

1 2αt gt 2 (9)

Now we look back on the R-OGD (Algorithm 1), the R-BAN (Algorithm 2) and the R-2-BAN (Algorithm 3). For g-convex cases, the condition (9) is generally fulﬁlled as

αt f(xt) D L

ζ(κ,D)T L D, (R-OGD)

ζ(κ,D)T C D, (R-BAN)

22δL) D. (R-2-BAN)

Furthermore, the summation PT t=1 αt gt 2 is O

T for g-convex cases in Algorithms 1, 2 and 3. Consequently, sublinear regret bounds of Algorithms 1, 2 and 3 can be achieved. On the other hand, the condition (9) does not hold for strongly g-convex cases, since the strong convexity coeﬃcient µ can be arbitrary. In these cases, we may set a suﬃciently large constant c0 to the step size αt = 1 µ(t+c0) (or αt = B µ(t+c0) in bandit settings) such that

Online Optimization over Riemannian Manifolds

αt f(xt) (or αtgt , αt gt ) is small enough and thus ensure our Algorithms 1, 2 and 3 continue to be no-regret. In the following, we formally state our results of Algorithms 1, 2 and 3 over non Hadamard cases under Assumptions 20, along with a modiﬁed lemma for the constants in sub g-convexity. The analysis is quite similar to that in Hadamard cases, so we only present the proof of strong g-convex cases in Appendix G.

Lemma 22 (modiﬁcation of Lemma 13) Suppose that (M, g) is a Riemannian manifold whose sectional curvature is bounded above by K and below by κ. Let K be a g-convex set of M with diameter D. Denote κ = min{κ, 0} and ι = 2s(κ ,D)

s(K,D) (see (10)). Then there exists a constant ρ 0 depending only on the set K such that, the following statements hold.

(i) For any g-convex and g-L-Lipschitz function f, the smoothed function ˆf is 2ρδL + (n + n|κ |δ2)π2ιLδ -sub g-convex.

(ii) For any µ-strongly g-convex and g-L-Lipschitz function f, the smoothed function ˆf is µ-strongly 2ρδL + 2µDδ + (n + n|κ |δ2)π2ιLδ -sub g-convex.

Theorem 23 Suppose that the sectional curvature M is lower bounded by κ and upper bounded by K. Assume that the previous assumptions for the R-OGD, R-BAN, R-2-BAN and Assumption 20 hold. Denote κ = min{κ, 0}, ι = 2s(κ ,D)

s(K,D) and θ = s(κ ,D+r)

s(K,D+r). Then for g-convex loss functions on non-Hadamard manifolds, the following statements hold.

(i) Setting step size αt = D L

ζ(κ,D)t, the R-OGD algorithm achieves regret

ζ(κ, D)T + DLσ(K, 2D)

(ii) Setting τ = δ

rθ, αt = D C

ζ(κ,D)T and δ = T 1

4 , the R-BAN algorithm achieves regret

E[Reg(T)] n|κ |DC p

(ζ(κ, D) + σ(K, 2D)

ζ(κ, D) ) + n|κ |π2ιL T 1 4

(ζ(κ, D) + σ(K, 2D)

ζ(κ, D) + 3L

rθ + 2ρL + nπ2ιL T 3 4 .

(iii) Setting τ = δ

rθ, αt = D δL

ζ(κ,D)T and δ = T 1

2 , the R-2-BAN algorithm achieves regret

E[Reg(T)] (n|κ |DL( p

(ζ(κ, D) + σ(K, 2D)

ζ(κ, D) ) + n|κ |π2ιL)T 1

(ζ(κ, D) + σ(K, 2D)

ζ(κ, D) ) + 3L

rθ + 2ρL + nπ2ιL

Wang, Tu, Hong, Wu, and Shi

Theorem 24 Suppose that the sectional curvature M is lower bounded by κ and upper bounded by K. Assume that the previous assumptions for the R-OGD, R-BAN, R-2-BAN, and Assumptions 20 hold. Denote κ = min{κ, 0}, B = n

δ + n|κ |δ, ι = 2s(κ ,D)

s(K,D) an and

θ = s(κ ,D+r)

s(K,D+r). Then for µ-strongly g-convex loss fucntions on non-Hadamard manifolds, the following statements hold.

(i) Setting c0 L µD and step size αt = 1 µ(t+c0), the R-OGD algorithm achieves regret

Reg(T) D2µc0

2(ζ(κ, D) + σ(K, 2D))G2 1 + log(T + c0) .

(ii) Setting c0 BC

µD , τ = δ rθ, αt = B µ(t+c0) and δ = 3q

n C(1+log(T+c0))

T , the R-BAN algorithm achieves regret

E[Reg(T)] D2µc0

2 + 4(c0 + 1)

µ (ζ(κ, D) + σ(K, 2D))C 8 3 n 8 3 κ 2D

ζ(κ, D) + σ(K, 2D)

µ + |κ |D2π2ι n 4 3 C 4 3

+ nπ2ι + 2ρL + 3L + DL + 2D2

rθ + 2µD n 1 3 C 1 3

(1 + log(T + c0)) 1 3 T 2 3 .

(iii) Setting c0 BδL

µD , τ = δ rθ, αt = B µ(t+c0) and δ = 1+log(T+c0)

T , the R-2-BAN algorithm achieves regret

E[Reg(T)] D2µc0

2 + 3(c0 + 1)n2κ 2L2 ζ(κ, D) + σ(K, 2D)

2µ + 3(c0 + 1)n|κ |π2ιL

+ ζ(κ, D) + σ(K, 2D)n2L2

µ + 3L + DL + 2D2

rθ + 2ρL + 2µD (1 + log(T + c0)).

In Theorem 24, a regret bound of O n 4 3 (1 + log T) 1 3 T 2 3 has been established on the R-BAN algorithm for strongly g-convex functions. In contrast, for online optimization over Hadamard manifolds or in Euclidean spaces, such regret bound with strongly g-convex losses is of the order O n 2 3 (1 + log T) 1 3 T 2 3 .

7. Numerical Experiment

In this section, we validate the ﬁndings of the proposed R-OGD, R-BAN and R-2-BAN algorithms over a number of tasks. We also compare our algorithms with the Riemannian online zeroth optimization (R-OZO) algorithm by Maass et al. (2022). The code is built with the help of the Pymanopt package (Townsend et al., 2016) and all experiments are performed in Python 3.8 on a 3.4 GHz AMD Ryzen5 machine with 16GB RAM. For reproduction of the results, all the source codes are accessible online1.

1. https://github.com/Riemannian OCO/experiments

Online Optimization over Riemannian Manifolds

5000 6000 7000 8000 9000 10000

Learning rounds t

E[Reg(t)] / t

(a) (Expected) Average regrets vs learning rounds

0.0 0.2 0.4 0.6 0.8 1.0

Running times

E[Reg(t)] / t

(b) Average regrets vs running time

Figure 2: Algorithm performance on hyperbolic Fr echet mean problem

7.1 Fr echet Mean on the Hyperbolic Space

The Fr echet mean problem is known as ﬁnding the Riemannian centroid of a set of points on a manifold. The Fr echet mean problem has many applications, such as diﬀusion tensor magnetic resonance imaging (DT-MRI) (Cheng et al., 2012; Rathi et al., 2007), radar signal processing (Lapuyade-Lahorgue and Barbaresco, 2008), and batch normalization (Brooks et al., 2019). In the following, we study an online version of the Fr echet mean problem, which attempts to average a set of N time-variant points in a hyperbolic space. Denote by , M the Minkowski dot product

i=1 xiyi xn+1yn+1.

The hyperbolic space can be modeled as

Hn = {x Rn+1| x, x M = 1},

with the metric gx(u, v) = u, v M. A hyperbolic space has constant curvature 1 and thus is a Hadamard manifold (Lee, 2018). Given points {At,1, At,2, . . . , At,N} in a hyperbolic space, the loss function ft of the online Fr echet mean problem is

ft(xt) = 1 2N

i=1 d2(xt, At,i) = 1 2N

i=1 cosh 1( xt, At,i M)2

The loss function ft, yet is not convex in the Euclidean view, is 1-strongly g-convex (da Silva Alves et al., 2021) so that we can apply Algorithms 1, 2 and 3 to the problem. We consider the online Fr echet mean problem where [n, N, T] = [100, 10, 10000]. The ﬁrst n indices of At,i are generated by an Gaussian distribution with the covariance matrix diag( n, . . . , n) and the last index is calculated by the equation At,i, At,i M = 1. We examine Algorithms 1, 2 and 3 for strongly g-convex cases with µ = 1. Additionally, we discuss the choice of δ, αt and τ in the R-BAN and the R-2-BAN algorithms as follows.

Since the value of a point grows exponentially with its length in hyperbolic spaces, a large step size αt in the R-BAN and the R-2-BAN algorithms may cause numerical

Wang, Tu, Hong, Wu, and Shi

overﬂows. Consequently, we set αt = B µ(t+C0) for some C0 N, which means to start the optimization at the time C0. We take C0 = 2125 for the R-BAN algorithm and C0 = 170 for the R-2-BAN algorithm in the experiment, in case of the numerical stability.

For the R-BAN algorithm, the theoretical δ turns out too conservative and not prac-

tical. In the experiment, we set δ = 7 3q

In this experiment, the feasible set K is a geodesic ball. Thus, we ﬁnd that θ = 1 is suﬃcient to guarantee feasibility. As a result, we opted to use τ = δ

r in our experiment.

Figure 2 shows the performance of the average regret Reg(T)

T versus the number of learning rounds and the running time. The expected average regrets of the R-BAN and the R-2-BAN are performed in the average of 100 random runs with error bars. Figure 2 indicates that the regrets of all three algorithms go sublinearly with the number of learning rounds T. As seen, the one-point bandit algorithm performs most poorly among the three algorithms, while the two-bandit algorithm achieves a comparable regret bound with the R-OGD algorithm in the full information feedback setting, which is consistent with our theoretical ﬁndings and also matches the empirical performance in Euclidean spaces (Lei et al., 2020). We also compare our bandit algorithms (Algorithms 2 and 3) with the Riemannian online zeroth algorithm (R-OZO) by Maass et al. (2022). We observe that in this case our R-BAN and R-2-BAN algorithms perform better regrets with less or equal information, since the R-OZO needs function values of two points.

7.2 Operator Scaling on SPD Matrices

The operator scaling problem is an example of g-convex but not strongly g-convex Riemannian optimization, which is deﬁned on the manifold of SPD matrices

{X Rn n|XT = X, X 0},

with the metric g X(U, V ) = Tr(X 1UX 1V ).

Given a tuple of n n matrices (A1, A2, . . . , AN), the operator scaling attempts to ﬁnd X, Y Rn n such that ˆAi = Y 1Ai X is doubly stochastic for all i = 1, 2, 3, . . . , N. The operator scaling problem has drawn abundant interest in many areas, such as computing non-commutative rank (Ivanov et al., 2017) and computing Brascamp-Lieb constants (Garg et al., 2018). In this subsection, we study an online form of the operator scaling problem, which is to ﬁnd Xi and Yi for time-varying matrices (At,1, . . . , At,N). The problem can be formulated in terms of minimizing the log capacity of the operator Tt(X) = PN i=1 At,i XAT t,i, that is ft(Xt) = log det(T(Xt)) log det(Xt), t = 1, 2, . . . , T.

The loss function ft is g-convex, but not strongly g-convex. We test Algorithms 1, 2, 3 and the R-OZO for the case [n, N, T] = [5, 2, 50000] by taking D = 5, L = 2, and C = 7. For the R-BAN algorithm, we set the parameter δ = 0.67, which is 10 times as the theoretical value and also set τ = δ r. In addition, since the R-OZO

Online Optimization over Riemannian Manifolds

30000 35000 40000 45000 50000

Learning rounds t (scaled)

E[Reg(t)] / t

(a) (Expected) average regrets vs learning rounds

0 5 10 15 20 25

Running times

E[Reg(t)] / t

(b) Average regrets vs running time

Figure 3: Algorithm performance on operator scaling problem

is designed for σ-strongly g-convex functions and requires a nonzero σ to set up the step size, we take σ = 0.001 in the R-OZO. The remaining parameters of the R-OZO are set as L = 1 (smooth coeﬃcient) and V = 9. Figure 3 again shows that the (expected) regrets of Algorithms 1, 2 and 3 are sublinear with T. The R-2-BAN reaches a comparable bound under the full information setting in the online operator scaling problem and the R-OZO presents a regret bound comparable to that of the R-BAN. The above results showcase the applicability for our Algorithms 1, 2 and 3 in g-convex settings.

7.3 Principal Component Analysis on Grassmann Manifolds

At last, we test the eﬀectiveness of Algorithms 1, 2 and 3 on manifolds with positive curvature. An important instance is the principal component analysis (PCA) on Grassmann manifolds. Given a set of data points {A1, . . . , AN} in Rn, the PCA problem is to learn an orthogonal projector X Rn r that minimizes the sum of the squared residual errors between the projected data points and the original data points, which is a signiﬁcant dimensionality reduction issue when handling high-dimensional data in the real world (Anzai, 2012). In this subsection, we consider an online PCA problem, where the loss at the time t is

ft(Xt) = 1 2N

i=1 At,i Xt XT t At,i 2 2,

where {At,1, . . . At,N} is a batch of N data points, and Xt is a point on the Stiefel manifold St(d, n), which is formed by d n matrices with orthogonal columns.

data set class sample feature experimental parameters iris 3 4 150 [n, N, T, d, µ, θ] = [4, 1, 150, 2, 2, 1] egg-eye-state 2 14 14980 [n, N, T, d, µ, θ] = [14, 1, 14000, 3, 0.1, 1] waveform-5000 3 40 5000 [n, N, T, d, µ, θ] = [40, 5, 1000, 10, 5, 1]

Table 2: Descriptions and settings of testing data sets for online PCA

Wang, Tu, Hong, Wu, and Shi

0 20 40 60 80 100 120 140

Learning rounds t

E[Reg(t)] / t

400 500 600 700 800 900 1000

Learning rounds t

E[Reg(t)] / t

(b) Waveform

Figure 4: Algorithm performance on the iris and waveform data set.

Since the group action X XY does not change the value of ft(xt) for any orthogonal matrices Y O(d), we view ft as a function on the quotient Grassmann manifold

Gr(d, n) = St(d, n)/O(d)

with the metric g X(U, V ) = Tr(UT V ). Then the online problem is equivalent to ﬁnd

min ft(Xt) = 1

i=1 AT t,i Xt XT t At,i, Xt Gr(d, n).

The Grassmann manifold is homogeneous and symmetric, and the sectional curvature of Grassmann manifolds takes value in [0, 2]. Consequently, we can apply Algorithms 1, 2 and 3 to the online PCA problem. We examine Algorithms 1, 2, 3 and the R-OZO on three real-world data sets from the openml database2, including iris, eeg-eye-state and waveform-5000. All the data sets are normalized, and the eeg-eye-state data set is randomly shuﬄed and excludes outliers. Figure 4 and Figure 5(a) show the average regret Reg(T)

T in the three real-world data sets. All the R-BAN, R-2-BAN and R-OZO are conducted for 100 random runs and ploted with error bars. In the R-BAN and R-2-BAN algorithms, the step size αt and δ are taken referring to the theoretical values. For the R-OZO algorithm, we set σ = µ, L = 1 (smooth coeﬃcient) and V = 2. As shown, the R-2-BAN performs comparably with the R-OGD, and the performance of the R-OZO is between that of the R-BAN algorithm and that of the R-2-BAN algorithm. Besides, All of our algorithms achieve sublinear regret. The results in PCA analysis demonstrate the eﬀectiveness of our algorithms in non-Hadamard cases. At last, we test the eeg-eye-state data set for the situation when the diameter D π 2

K . In particular, we test D = π

2 (the injectivity radius of the Grassmann manifold) in Figures 5(b) and (d), and D =

2 (the diameter of the Grassmann manifold) in Figures 5(c) and (e). The initial points in Figures 5(b) and (c) are as same as that in Figure 5(c), and are extremely close to the boarder of the feasible set in Figures 5 (d) and (e). Figures 5(a), (b), and (c) show that, when the initial point does not change, diameter D does not inﬂuence the (expected) regrets of our Algorithms 1, 2 and 3. Furthermore, Figures 5(d) and (e) show that Algorithms 1, 2 and 3 can still achieve no-regret even when Assumption 20 does

2. https://www.openml.org/

Online Optimization over Riemannian Manifolds

0 2000 4000 6000 8000 10000 12000 14000

Learning rounds t

E[Reg(t)] / t

(a) D = π 2

0 2000 4000 6000 8000 10000 12000 14000

Learning rounds t

E[Reg(t)] / t

2 , same initial point

0 2000 4000 6000 8000 10000 12000 14000

Learning rounds t

E[Reg(t)] / t

dπ 2 , same initial point

0 2000 4000 6000 8000 10000 12000 14000

Learning rounds t

E[Reg(t)] / t

2 , close to the border

0 2000 4000 6000 8000 10000 12000 14000

Learning rounds t

E[Reg(t)] / t

dπ 2 , close to the border

Figure 5: Algorithm performance on the eeg-eye-state data set

not hold. The above results give evidence for our applicability of our Algorithms 1, 2 and 3 in practical scenarios.

8. Conclusion

We considered an online optimization problem on Riemannian manifolds in the full information, one-point bandit, and two-point bandit feedback settings, which extended the Euclidean counterpart. The upper regret bounds of the R-OGD, R-BAN, and R-2-BAN algorithm, together with a universal lower regret bound were established with the inﬂuence of curvature clearly indicated. All of the regret bounds were consistent with their Euclidean counterpart. An interesting direction moving forward is to take retraction into consideration. A retraction map is a cheap approximation of the exponential map on manifolds and is a sensible choice in many real scenarios. In future work, we intend to design Riemannian online optimization methods with the retraction map, so that resulting algorithms can be more eﬀective in large-scale optimization problems.

Acknowledgments

The authors would like to express their gratitude to the action editor Silvia Villa, and the anonymous reviewers for their constructive comments and valuable suggestions, which greatly improved this work. The authors also would like to thank Zihao Hu from the School of Computer Science at the Georgia Institude of Technology for his helpful comments and

Wang, Tu, Hong, Wu, and Shi

discussions on the feasibility and projection error of the Riemannian bandit algorithms, which played a signiﬁcant role in enhancing the mathematical rigor of this work. This work is supported in part by Shanghai Municipal Science and Technology Major Project under Grant 2021SHZDZX0100 and the National Natural Science Foundation of China under Grant 61733018, and in part by Australian Research Council under Grants DP190103615, LP210200473, and DP230101014.

Online Optimization over Riemannian Manifolds

Appendix A. Basic Deﬁnitions and Technical Lemmas

In this section, we recall some basic deﬁnitions and results from Riemannian geometry, which are useful in our analysis.

Deﬁnition 25 (Divergence) For a vector ﬁeld X, the divergence Div(X) is the trace of the operator X. More precisely, if {e1, . . . , en} is a normal orthogonal basis of the tangent space Tx M, the divergence of X at x can be expressed as

Div(X)(x) =

i=1 ei X(x), ei .

Lemma 26 (Berestovskii and Nikonorov, 2020, Prop. 3.1.6.) If η is a Killing vector ﬁeld on M, then it satisﬁes the following.

(i) For every vector ﬁeld X, Xη, X = 0. As a corollary, the divergence Div(η) 0.

(ii) For every geodesic γ, η|γ is a Jacobi ﬁeld.

Lemma 27 (do Carmo, 1992, Prop. 3.6) If η is a Jacobi vector ﬁeld along a geodesic γ : [0, 1] R, denote η(t) = η(γ(t)), then

η(t), γ(t) = η(0), γ(t) + t γη(0), γ(t)

for all t [0, 1].

Lemma 28 (Nomizu, 1960) Let M be a simply connected complete Riemannian homogeneous manifold. Then for every x M and every X Tx M, there exists a Killing vector ﬁeld η such that η(x) = X. The ﬂow of η exists and consists of a one-parameter group of isometries.

Lemma 29 (Divergence theorem, Lee, 2018, 2-22) Let M be a Riemannian manifold M with the volume form ω, K M with the boundary K, and n be the (outer) unit normal vector ﬁeld of K. Then, for any vector ﬁeld X and any diﬀerentiable function f, Z

K X(f)(u)ω = Z

K f(u) X, n ω K Z

K Div(X)f(u)ω,

where ω K is the volume form of K induced by ω.

Lemma 30 (Jacobi Field Comparison, Lee, 2018, Thm. 11.9) Suppose that M is a Riemannian manifold, γ : [0, b] M is a unit-speed geodesic segment without conjugate points, and J is a Jacobi ﬁeld along γ such that J(0) = 0. Denote

t, if κ = 0;

1 κ sin( κt), if κ > 0;

1 κ sinh( κt), if κ < 0.

Wang, Tu, Hong, Wu, and Shi

(i) If the sectional curvature of M is bounded above by a constant K > 0, then

J(t) s(K, t) J(0)

for all t [0, b1], where b1 = min{b, π κ}.

(ii) If the sectional curvature of M is bounded below by a constant κ < 0, then

J(t) s(κ , t) J(0)

for all t [0, b], where κ = min{κ, 0}.

Lemma 31 (Conjugate Theorem, Lee, 2018, Thm. 11.12) Suppose M is a Riemannian manifold whose sectional curvature is bounded above by K.

(i) If K 0, then a point of M has no conjugate points along any geodesic.

(ii) If K 0, then there are no conjugate point along any geodesic segment shorter that π

From the Morse index theorem Lee, 2018, Thm. 10.18, the absence of conjugate points is equivalent to the unique g-convexity, we have the following corollary.

Corollary 32 Suppose (M, g) is a Riemannian manifold whose sectional curvature is bounded by K.

(i) If K 0, then any g-convex set in M is uniquely g-convex.

(ii) If K 0, then any g-convex set in M with diameter less than π

K is uniquely g-convex.

Lemma 33 (Hessian Comparison Theorem, Lee, 2018, Thm. 11.7) Suppose M is a Riemannian manifold and x M. Denote ρx(y) = d(x, y) and

1 κ cot( κt), κ > 0;

1 κ coth( κt), κ < 0.

(i) If the sectional curvature of M is bounded above by a constant K > 0, then the following inequality holds in U := {y M|d(x, y) < π

2ρx(y) c(K, ρx(y))Id.

(ii) If the section curvature of M is bounded below by a constant κ 0, then the following inequality holds in all of M

2ρx(y) c(κ, ρx(y))Id.

Online Optimization over Riemannian Manifolds

Lemma 34 (Volume comparison theorem, Lee, 2018, Thm. 11.19) Let M denote an n-dimensional Riemannian manifold with sectional curvature lower bounded by κ. Given p M, we denote by Vr the volume of the ball of radius r about p, and Vr,κ as the volume of a ball of radius r on n-dim constant-curvature model spaces with curvature κ, that is, the sphere Sn( 1 κ), the Euclidean space Rn, or the hyperbolic space Hn( 1 κ) when κ is positive, zero, or negative. Then the function

is non-increasing.

Lemma 35 (Zhang and Sra, 2016, Lemma 5) Let a, b, c be the sides (side lengths) of a geodesic triangle on a Riemannian manifold with sectional curvature lower bounded by κ. Let A be the angle between sides b and c. Then

a2 ζ(κ, c)b2 + c2 2bc cos A.

Lemma 36 (Bac ak, 2014) Let (M, g) be a Hadamard manifold. Let K be a closed gconvex set. Then the mapping PK(x) is single-valued and nonexpansive, that is, we have for every x, y M

d(PK(x), PK(y)) d(x, y).

Lemma 37 (Berestovskii and Nikonorov, 2020) For a given point x on a Riemannian symmetric manifold M, the symmetry sx reverses every geodesic through the point x. Moreover, the derivative map dsx at x is Id Tx M.

Lemma 38 (Berestovskii and Nikonorov, 2020) A Riemannian symmetric manifold is homogeneous.

Appendix B. Proofs of Theorems 5 and 6

In this appendix, we prove Theorems 5 and 6 in Subsections B.1 and B.2, respectively.

B.1 Proof of Theorem 5

By the g-convexity, we have

ft(xt) ft(x ) ft(xt), exp 1 xt (x ) . (11)

Denote xt+1 = expxt( αt ft(xt)). Recalling Lemma 35 in the geodesic triangle xt xt+1x

αt ft(xt), exp 1 xt (x ) 1

2(d2(xt, x ) d2( xt+1, x )) + 1

2ζ(κ, d(xt, x )) αt ft(xt) 2.

Since xt+1 = PK(expxt( αtgt)) = PK( xt+1), applying Lemma 36 we have

d(xt+1, x ) d( xt+1, x ).

Wang, Tu, Hong, Wu, and Shi

αt ft(xt), exp 1 xt (x ) 1

2(d2(xt, x ) d2(xt+1, x )) + 1

2ζ(κ, d(xt, x )) αt ft(xt) 2.

Combining (11) and (12), we get

ft(xt) ft(x ) 1 2αt (d2(xt, x ) d2(xt+1, x )) + 1

2ζ κ, d(xt, x ) αt ft(xt) 2.

With the Lipschitz constant L, we have

ft(xt) ft(x ) 1 2αt (d2(xt, x ) d2(xt+1, x )) + 1

2ζ κ, d(xt, x ) L2αt. (13)

Summing (13) from 1 to T, we obtain

d2(xt, x ) d2(xt+1, x ) +

1 2ζ κ, d(xt, x ) L2αt

t=2 d2(xt, x )( 1

2αt 1 2αt 1 ) + 1 2α1 d2(x1, x ) + 1

t=1 ζ κ, d(xt, x ) αt.

Since the set K has diameter D, it follows immediately that d(xt, x ) D and

ζ κ, d(xt, x ) ζ(κ, D)

for every t = 1, 2, . . . , T, which implies

Reg(T) D2 T X

2αt 1 2αt 1 ) + D2 1

2ζ(κ, D)L2 T X

2ζ(κ, D)L2 T X

Setting αt = D L

ζ(κ,D)t, we get

Reg(T) DL p

2ζ(κ, D)L2 T X

2ζ(κ, D)L2 2D

The second inequality is based on the inequality PT t=1 1

T, and then we complete our proof for Theorem 5.

Online Optimization over Riemannian Manifolds

B.2 Proof of Theorem 6

By the strong g-convexity, we have

ft(xt) ft(x ) ft(xt), exp 1 xt (x ) µ

2 d2(xt, x ).

With the help of Lemma 35, Lemma 36 and the Lipschitz constant L, we have

ft(xt) ft(x ) 1 2αt

d2(xt, x ) d2(xt+1, x ) + 1

2ζ κ, d(xt, x ) L2αt µ

2 d2(xt, x ).

Summing (14) from 1 to T, we obtain

d2(xt, x ) d2(xt+1, x ) +

1 2ζ(κ, d(xt, x ))L2αt

2 d2(xt, x )

t=2 d2(xt, x )( 1

2αt 1 2αt 1 µ

2 ) + d2(x1, x )( 1

t=1 ζ(κ, d(xt, x ))αt.

Substituting d(xt, x ) D and ζ(κ, d(xt, x )) ζ(κ, D) for t = 1, 2, . . . , T, we obtain

2αt 1 2αt 1 µ

2 ) + D2( 1

t=1 ζ(κ, d(xt, x ))αt

2ζ(κ, D)L2 T X

Setting αt = 1

Reg(T) 0 + 1

2ζ(κ, D)L2 T X

t=1 αt ζ(κ, D)L2

2µ (1 + log T).

We completed the proof.

Appendix C. Proof of Theorem 7

In this appendix, we ﬁrst introduce an instance of Riemannian online convex optimization called the Riemannian online Busemann optimization (ROBO) and then prove Theorem 7 by analyzing the worst-case regret of the ROBO problem.

C.1 Riemannian Online Busemann Optimization

We ﬁrst introduce the deﬁnition of Busemann functions (Ballmann, 2012), which are used to study the large-scale geometry of Hadamard manifolds.

Wang, Tu, Hong, Wu, and Shi

Deﬁnition 39 (Ballmann, 2012) Let M be a Hadamard manifold and γ : [0, ) be a geodesic ray on M with γ(0) = 1. Then the Busemann function with γ is deﬁned as

fγ(x) = lim t d(x, γ(t)) t .

Here are some properties of Busemann functions.

Lemma 40 (Ballmann, 2012) If fγ is a Busemann function, then the following properties hold.

(i) fγ is g-convex;

(ii) fγ(γ(t)) = γ(t) for every t [0, );

(iii) fγ(x) 1 for every x M.

Next, we introduce some notations. Let D, L > 0 be two constants, M be a Hadamard manifold, p M and γ : R M be a geodesic with γ(0) = 1 and γ(0) = p. Then we consider an instance of R-OCO problem termed Riemannian online Busemann optimization (ROBO) on M, where the convex set K is the ball centered p with radius D, i.e.,

K = {x M|d(x, p) D},

and the loss function ft is randomly and uniformly chosen in the set

{Lf+, Lf }.

Here, f+ and f are Busemann functions related to the geodesic rays γ+(t) = γ(t) and γ (t) = γ( t). The regret of the ROBO problem is

t=1 ft(xt) min x K

In the last part of the section, we propose a lemma on the minimum of PT t=1 ft(x).

Lemma 41 The minimum of af+(t) + bf (t), (a, b N) in K is |a b|D.

Proof By the convexity of f+ and f , we have

af+(x) + bf (x) af+(p) + bf (p) + a f+(p) + b f (p), exp 1 p (x) , x K.

Because f+(p) = γ(0), f (p) = γ(0) and f (p) = 0, we have

af+(x) + bf (x) (a b) γ(0), exp 1 p (x) , x K.

Moreover, since γ(0) = 1 and exp 1 p (x) = d(x, p) D, we have

min x K af+(x) + bf (x) min x K (a b) γ(0), exp 1 p (x) |a b|D. (15)

However, we see that af+(γ(D)) + bf (γ(D)) = (b a)D, and af+(γ( D)) + bf (γ( D)) = (a b)D, which imply that

min x K af+(x) + bf (x) min n (b a)D, (a b)D o = |a b|D. (16)

Following from (15) and (16), we complete our proof.

Online Optimization over Riemannian Manifolds

C.2 Proof of Theorem 7

We begin our proof with an analysis of the worst-case regret of the ROBO problem. In the ROBO, the expectation of the regret on loss functions {f1, f2, . . . , ft} is

Ef1,...,ft[Reg(T)] =Ef1,...,ft[

t=1 ft(xt) min x K

=Ef1,...,ft[

t=1 ft(xt)] Ef1,...,ft[min x K

t=1 ft(x)]. (17)

Since ft is uniformly and independently chosen in {f+, f }, we can get

Ef1,...,ft[

t=1 ft(xt)] =

t=1 Eft[ft(xt)]

1 2(Lf+(xt) + Lf (xt))

2 min x K(f+(x) + f (x)).

From Lemma 41,

Ef1,...,ft[

t=1 ft(xt)] 0. (18)

Putting (18) into (17), we obtain

Ef1,...,ft[Reg(T)] Ef1,...,ft[min x K

t=1 ft(x)].

By Lemma 41,

Ef1,...,ft[Reg(T)] Ef1,...,ft[min x K

= Ef1,...,ft

= Eϵ1,...,ϵT

= Eϵ1,...,ϵT

where ϵt are i.i.d Rademacher variables ϵt = 1 with probability 1/2. From the Khinchine s inequality (Cesa-Bianchi and Lugosi, 2006), we ﬁnally get

Ef1,...,ft[Reg(T)] DL

2 Eϵ1,...,ϵT

Wang, Tu, Hong, Wu, and Shi

which indicates that no matter how we choose strategies in the ROBO, there is a sequence of functions {f1, . . . , ft} {Lf+, Lf }T to make the regret not less than DL

T. Considering that the diameter of the set K is 2D and the Lipschitz constant of {Lf+, Lf } is L, we complete our proof.

Appendix D. Proofs of Lemmas 11 and 13

In this appendix, we prove Lemmas 11 and 13 in Subsections D.1 and D.2, respectively.

D.1 Proof of Lemma 13

We initially examine the ﬁrst part of the lemma. Take a vector X Mx arbitrarily. From Lemma 28, we can ﬁnd a Killing vector ﬁeld η on M such that η(x) = X. The ﬂow of η consists of a one-parameter group of isometries {φt}t R. Then the directional derivative of ˆf along X can be written as

X(ˆf(x)) = lim t 0

ˆf(φt(x)) ˆf(x)

Vδ lim t 0 1

Bδ(φt(x)) f(u)ω Z

Bδ(x) f(u)ω . (20)

Since φt is an isometry that preserves the distance, φt(Bδ(x)) = Bδ(φt(x)). By the substitution rule of integration (Chern et al., 1999), we have Z

Bδ(φt(x)) f(u)ω = Z

Bδ(x) f(φt(u))φ t (ω). (21)

Because φt preserves the metric g, it preserves the volume form, i.e., φ t (ω) = ω, which gives Z

Bδ(φt(x)) f(u)ω = Z

Bδ(x) f(φt(u))ω. (22)

Combining equations (20) and (22) together, we have

X(ˆf(x)) = 1

Bδ(x) lim t 0 f(φt(u)) f(u)

t |t=0(f)ω. (23)

By deﬁnition of the ﬂow, φt(u)

t |t=0 = η(u). Hence, we can rewrite (23) as

X(ˆf(x)) = 1

Bδ(x) η(f)ω.

According to Lemma 29, we have

X(ˆf(x)) = 1

Sδ(x) f(u) η(u), n(u) ωSδ(x) 1

Bδ(x) Div(η)(u)f(u)ω

Sδ(x) f(u) η(u), n(u) ωSδ(x) (24)

Online Optimization over Riemannian Manifolds

where ωSδ(x) is the volume form of Sδ(x) induced by ω and n is the (outer) unit normal vector ﬁeld of Sδ(x). The last equation is due to Div(η) 0 stated in Lemma 26. Then we need to compute η, n for each point u Sδ(x). Since geodesics start at the center x are vertical to the sphere Sδ(x), the outer normal vector n(u) can be written as γu(1) γu(1) for the geodesic γu such that γu(0) = x and γu(1) = u. Therefore, we can write η(u), n(u) as

η(u), n(u) = 1 γu(1) η(γu(1)), γu(1) .

Since η is Killing, by Lemma 26, η(γu(t)) is Jacobi. By Lemma 27, we have

η(u), n(u) = 1 γu(1) η(γu(1)), γu(1)

η(γu(0)), γu(0) + 1 γuη(γu(0)), γu(0) ,

η(γu(0)), γu(0) + 0 . (25)

Applying η(γu(0)) = η(x) = X and γu(0) = exp 1 x (u) to (25) yields

η(u), n(u) = X, exp 1 x (u) exp 1 x (u) . (26)

Substituting (26) to (24), we have

X(ˆf(x)) = 1

Sδ(x) f(u) X, exp 1 x (u) exp 1 x (u) ωSδ(x) = 1

Sδ(x) f(u) exp 1 x (u) exp 1 x (u) ωSδ(x), X .

Because the directional derivative X(ˆf(x)) coincides with the term ˆf(x), X , we can obtain

ˆf(x), X = 1

Sδ(x) f(u) exp 1 x (u) exp 1 x (u) ωSδ(x), X .

For the arbitrariness of the vector ﬁeld X, we conclude that

Sδ(x) f(u) exp 1 x (u) exp 1 x (u) ωSδ(x) = Sδ

Vδ Eu Sδ(x) h f(u) exp 1 x (u) exp 1 x (u)

which completes the proof of the ﬁrst part. Then we examine the second part of the lemma. From the ﬁrst part, it is clear to see

Since the sectional curvature of M is lower bounded by κ , from Lemma 34, the function g(r) = Vr Vr,κ is non-increasing and so does log g(r). Therefore,

d dr log(g(r)) = d

dr log Vr d

dr log Vr,κ 0.

Wang, Tu, Hong, Wu, and Shi

Since deriving the volume of a ball along the radius gives the surface area of its sphere, we can write

d dr log(g(r)) = Sr

Vr,κ 0, (27)

where Sr and Sr,κ are the surface area of the balls in M and the corresponding constant curvature space, respectively.

Setting r = δ in (27), we get Sδ

Vδ Sδ,κ Vδ,κ . From calculation, it shows that

|κ |δ) R δ 0 sinhn 1(

|κ |t)dt, κ = κ < 0

So we have completed the proof for the case κ = 0. Then we focus on the case that κ < 0. By a change of variable u = sinh t, we ﬁnd

0 sinhn 1( p

|κ |t)dt = |κ | 1/2 Z sinh(

0 un 1(1 + u2) 1/2du

Integration by parts gives

0 sinhn 1( p

|κ |t)dt = sinhn( p

|κ | cosh( p

|κ |δ) + |κ | 1/2 Z sinh (

1 nun+1(1 + u2) 3/2du

|κ | cosh( p

Putting it into the expression of Sδ

Vδ , we get

|κ | coth( p

Applying the inequality coth(x) < x + 1/x, we have

δ + n|κ |δ, δ > 0.

Hence, for every δ > 0,

δ + n|κ |δ ,

which completes our proof.

Online Optimization over Riemannian Manifolds

D.2 Proof of Lemma 11

First we examine (i). Without loss of generality, we assume f(x) = 0. By the homogeneity of the manifold M, we can ﬁnd an isometry φ such that φ(x) = y. Denote the vector ﬁeld V (u) = exp 1 u (φ(u)). Clearly, we obtain

ˆf(y) ˆf(x) = 1

Bδ(y) f(u)ω Z

Bδ(x) f(u)ω

Bδ(φ(x)) f(u)ω Z

Bδ(x) f(u)ω .

With the method shown in (21) and (22), we obtain

ˆf(y) ˆf(x) = 1

Bδ(x) f(φ(u)) f(u)ω.

By the g-convexity of f,

ˆf(y) ˆf(x) = 1

Bδ(x) f(φ(u)) f(u)ω

Bδ(x) f(u), exp 1 u (φ(u)) ω

Bδ(x) f(u), V (u) ω

Bδ(x) V (f(u))ω. (28)

By Lemma 29, we have Z

Bδ(x) V (f(u))ω = Z

Sδ(x) f(u) V (u), n(u) ωSδ(x) Z

Bδ(x) Div(V )f(u)ω. (29)

Hence, we rewrite (28) as

ˆf(y) ˆf(x) 1

Sδ(x) f(u) V (u), n(u) ωSδ(x) Z

Bδ(x) Div(V )f(u)ω .

In Lemma 11, we have already shown that

ˆf(x), exp 1 x (y) = 1

Sδ(x) f(u) exp 1 x (u) exp 1 x (u) ωSδ(x), exp 1 x (y)

Sδ(x) f(u) exp 1 x (u) exp 1 x (u) ωSδ(x), V (x) . (30)

Denote by m(u) the vector exp 1 x (u) exp 1 x (u) . Combining (29) and (30) gives

ˆf(y) ˆf(x) ˆf(x), exp 1 x (y) 1

Sδ(x) f(u) V (u), n(u) V (x), m(u) ωSδ(x)

Bδ(x) Div(V )f(u)ω . (31)

Wang, Tu, Hong, Wu, and Shi

Here we claim that

V (u), n(u) V (x), m(u) 0, u Sδ(x). ( )

This claim requires many geometric details that deviates our attention from the proof, and we will prove it afterwards. If the claim ( ) holds, then with the g-L-Lipschitzness of f and the condition f(x) = 0, we have Z

Sδ(x) f(u) V (u), n(u) V (x), m(u) ωSδ(x)

Sδ(x) δL V (u), n(u) V (x), m(u) ωSδ(x)

Sδ(x) δL V (u), n(u) ωSδ(x) Z

Sδ(x) δL V (x), m(u) ωSδ(x) . (32)

By Lemma 11, 1 Vδ R

Sδ(x) δL V (x), m(u) ωSδ(x) in (32) is the gradient of the function

Bδ(x) δL ω δL,

Sδ(x) δL V (x), m(u) ωSδ(x) = 0. (33)

Combining (31)-(33), we have

ˆf(y) ˆf(x) ˆf(x), exp 1 x (y) 1

Sδ(x) δL V (u), n(u) ωSδ(x)

Bδ(x) Div(V )f(u)ω .

Applying Lemma 29 again, we obtain

Sδ(x) δL V (u), n(u) ωSδ(x) = 1

Bδ(x) V (δL)ω 1

Bδ(x) Div(V )δLω

Bδ(x) Div(V )δLω .

Therefore, there holds

ˆf(y) ˆf(x) ˆf(x), exp 1 x (y) 1

Bδ(x) Div(V )(f(x) + δL)ω

2δL sup u Bδ(x) |Div(V (u))|. (34)

Note that V (u) = exp 1 u (φ(u)) is continuous on p and φ, and φ is continuous on x and y. Thus, |Div(V (u))| is a continuous function of (x, y, u) K K K. Denote

ρ = sup (x,y,u) K K K |Div(V (u))|.

Online Optimization over Riemannian Manifolds

Since the boundedness of K set yields the compactness of K K K, we have ρ < . Putting ρ into (34) establishes the desired result. Then we begin to prove ii). From the strong g-convexity of f,

ˆf(y) ˆf(x) = 1

Bδ(x) V (f(u)) + µ

2 d2(u, φ(u))ω.

Thus it remains to prove that

2 d2(u, φ(u))ω µ

2 d2(x, y) 2µD,

which is obvious from the fact

|d2(u, φ(u)) d2(x, y)| = (d(u, φ(u)) + d(x, y))|(d(u, φ(u)) d(x, y))|

2D 2δ = 4Dδ.

D.3 Proof of the Claim ( )

Fix u Sδ(x) and denote by ξu(s) = expx(s m(u)) the geodesic with the initial tangent vector m(u). Consider the following rectangle map

Γu : [0, 1] [0, δ] M

(t, s) expξu(s)(t V (ξu(s))).

Set T(t, s) = Γu

t (t, s) and S(t, s) = Γu

s (t, s). For a ﬁxed t, the length of the curve γt(s) = Γu(t, s), (0 s δ) is deﬁned as

lu(t) = Z δ

S(t, s), S(t, s) ds.

The ﬁrst variation formula (see Lee, 2018, Theorem 6.3) gives,

l u(0) = T(0, δ), S(0, δ) T(0, 0), S(0, 0) .

Because T(0, s) = V (ξu(s)), for all s [0, δ], S(0, 0) = m(u) and S(0, δ) = n(u), we have

l u(0) = V (u), n(u) V (x), m(u) .

To prove ( ), it is suﬃcient to show that l u(0) 0. Let us focus on the second derivative of the function lu(t), that is,

l u(t) = d2

S(t, s), S(t, s) ds

S(t, s), S(t, s) ds

S T S, S )ds

0 1 S 3 T S, S 2 + 1 S T S, T S + 1 S T T S, S ds. (35)

Wang, Tu, Hong, Wu, and Shi

For every ﬁxed s, the curve γs(t) = Γu(t, s) is a geodesic, hence S is the variation ﬁeld of the geodesic γs(t) and becomes a Jacobi ﬁeld. Putting the Jacobi equation into (35), we have

l u(t) = Z δ

0 1 S 3 T S, S 2 + 1 S T S, T S + 1 S R(T, S, S, T)ds.

By the Cauchy Schwarz inequality, T S, S 2 S 2 T S 2, which yields

0 1 S 3 S 2 T S 2 + 1 S T S, T S + 1 S R(T, S, S, T)ds

1 S R(T, S, S, T)ds.

From the deﬁnition of sectional curvature, R(T, S, S, T) = K(Π)|T S|2, where K(Π) is the sectional curvature of the two-dimensional submanifold spanned by T and S. Under the assumption that M has nonpostive sectional curvature, we get

1 S R(T, S, S, T)ds 0,

which means that lu(t) is convex in [0, 1]. Let us look back on the function lu(t). Note that the 0-curve is

γs(0) = ξ(s),

and the 1-curve is

γs(1) = expξ(s)(V (ξ(s))) = expξ(s)(exp 1 ξ(s)(φ(ξ(s))) = φ(ξ(s)).

Since the mapping φ is an isometry, the length of ξ(s) is equal to the length of φ(ξ(s)). As a result, there holds

lu(0) = lu(1). (36)

The convexity of lu immediately leads that

which proves the claim ( ).

Appendix E. Proofs of Theorems 14 and 15

Before the proof, we propose some lemmas about the expected online gradient descent for λ-sub g-convex functions.

Lemma 42 (Sub-convex Cases) Suppose that S is a subset of a g-convex set K M with diameter D, and {ft}t=1,2,...,T be a series of λ1-sub g-convex smooth functions. If the sequence {xt}t=1,2,...,T is generated by

xt+1 = H PK expxt( αgt) , (37)

Online Optimization over Riemannian Manifolds

where the random vector gt satisﬁes that E[gt|xt] = ft(xt) and E[ gt ] G, and for the operator H : K S, there exist a constant λ2 0 satisﬁes

d2(H (x), y) d2(x, y) + λ2, x K, y S.

Then with α = D G

ζ(κ,D)T , we have

t=1 ft(xt) i min x S

t=1 ft(x) DG p

ζ(κ, D)T + λ1T + λ2T.

Proof Let x = arg minx S PT t=1 ft(x). From λ1-sub g-convexity, the diﬀerence between ft(xt) and ft(x ) is bounded by

ft(xt) ft(x ) ft(xt), exp 1 xt (x ) + λ1 = E[gt|xt], exp 1 xt (x ) + λ1

= E h gt, exp 1 xt (x ) |xt i + λ1.

Taking the expectation on both sides yields

E[ft(xt) ft(x )] E h gt, exp 1 xt (x ) i + λ1.

From Lemma 35 and 36,

E[ft(xt) ft(x )] E h 1

2α(d2(xt, x ) d2(PK(expxt( αtgt)), x )) i

2ζ(κ, d(xt, x ))α gt 2i + λ1.

Combining with the condition E[ gt ] G and d2(xt, x ) D, we have

E[ft(xt) ft(x )] E h 1

2α(d2(xt, x ) d2(PK(expxt( αgt), x ))) i + 1

2ζ(κ, D)αG2 + λ1

2α(d2(xt, x ) d2 H PK expxt( αtgt) , x i

2ζ(κ, D)αG2 + λ1 + λ2.

2α(d2(xt, x ) d2(xt+1, x )) i + 1

2ζ(κ, D)αG2 + λ1 + λ2. (38)

Summing (38) from 1 to T, we have

t=1 E[ft(xt) ft(x )]

2α(d2(xt, x ) d2(xt+1, x )) i + 1

2ζ(κ, D)αG2T + λ1T + λ2T

2αd2(x1, x ) i + 1

2ζ(κ, D)αG2T + λ1T + λ2T

2ζ(κ, D)αG2T + λ1T + λ2T. (39)

Wang, Tu, Hong, Wu, and Shi

Putting α = D G

ζ(κ,D)T in (39), we complete our proof.

Lemma 43 (Strongly Sub-convex Cases) Suppose that S is a subset of a g-convex set K M with diameter D, and {ft}t=1,2,...,T be a series of µ-strongly λ1-sub g-convex smooth functions. If the sequence {xt}t=1,2,...,T is generated by

xt+1 = H PK expxt( αgt) ,

where the random vector gt satisﬁes that E[gt|xt] = ft(xt), E[ gt |xt] G,and the operator H satisﬁes (37). Then with constant ν 1 and αt = ν

µt we have that

t=1 ft(xt) i min x S

t=1 ft(x) 1

2ζ(κ, D)νG2(1 + log T) + λ1T + λ2T.

Proof Let x = arg minx S PT t=1 ft(x). From µ-strongly λ1-sub g-convexity, the diﬀerence between ft(xt) and ft(x ) is bounded by

ft(xt) ft(x ) ft(xt), exp 1 xt (x ) µ

2 d2(xt, x ) + λ1

= E[gt|xt], exp 1 xt (x ) µ

2 d2(xt, x ) + λ1

= E h gt, exp 1 xt (x ) |xt i µ

2 d2(xt, x ) + λ1.

Taking the expectation on both sides yields

E[ft(xt) ft(x )] E h gt, exp 1 xt (x ) µ

2 d2(xt, x ) i + λ1.

From Lemma 35,

E[ft(xt) ft(x )] E h 1

2αt (d2(xt, x ) d2(PK(expxt( αtgt))), x )) µ

2 d2(xt, x ) i

2ζ(κ, d(xt, x ))αt gt 2i + λ1.

2αt (d2(xt, x ) d2(xt+1, x )) µ

2 d2(xt, x ) i

2ζ(κ, d(xt, x ))αt gt 2i + λ1 + λ2

Combining with the condition E[ gt ] G and d2(xt, x ) D, we have

E[ft(xt) ft(x )] E h 1

2αt (d2(xt, x ) d2(xt+1, x )) µ

2 d2(xt, x ) i

2ζ(κ, D)αt G2 + λ1 + λ2. (40)

Online Optimization over Riemannian Manifolds

Summing (40) from 1 to T, we have

t=1 E[ft(xt) ft(x )]

t=1 E h d2(xt, x )( 1

2αt 1 2αt 1 µ

2ζ(κ, D)G2 T X

t=1 αt + λ1T + λ2T. (41)

Since αt = ν

µt and ν 1,

1 2αt 1 2αt 1 µ

Putting it into (41), we ﬁnd

t=1 E[ft(xt) ft(x )] 1

2ζ(κ, D)G2ν

1 µt + λ1T + λ2T.

2µζ(κ, D)νG2(1 + log T) + λ1T + λ2T,

which completes our proof.

Lemma 44 relates the gap between regret of ˆft and the real regret of ft.

Lemma 44 Suppose all ft are g-L-Lipschitz. The (expected) regret of the R-BAN algorithm satisﬁes

E[Reg(T)] E h T X

t=1 (ˆft(yt)) i min x (1 τ)K

t=1 ˆft(x) + 3δLT + τDL.T

Proof Denote by x τ the minimizer of the problem minx (1 τ)K PT t=1 ft(x). The expectation can be reformulated as follows

E[Reg(T)] =

t=1 E h ft(xt) ft(x ) i

t=1 (ft(xt) ft(yt)) i + E h T X

t=1 (ft(yt) ˆft(yt)) i + E h T X

t=1 (ˆft(yt) ˆft(x τ)) i

t=1 (ˆft(x τ) ft(x τ)) i + E h T X

t=1 (ft(x τ) ft(x )) i .

The Lipschitz condition leads to

ft(xt) ft(yt) δL ft(yt) ˆft(yt) δL ˆft(x τ) ft(x τ) δL,

Wang, Tu, Hong, Wu, and Shi

which gives us

E[Reg(T)] E h T X

t=1 (ˆft(yt) ˆft(x τ)) i + E h T X

t=1 (ft(x τ) ft(x )) i + 3δLT. (42)

Next, we notice that (1 τ)K = {expp((1 τ)u)|u = exp 1 p (x) K}, It is easy to check that

min x (1 τ)K

t=1 ft(x) = min x K

t=1 ft expp((1 τ) exp 1 p (x))

t=1 ft(x) + τDL)

τDLT + min x K

This forces

t=1 (ft(x τ) ft(x )) τDLT. (43)

Combining (42) and (43) together, we can conclude

E[Reg(T)] E h T X

t=1 (ˆft(yt) ˆft(x τ)) i + 3δLT + τDLT.

which is the desired result.

Lemma 45 guarantees the feasibility of the proposed bandit algorithms.

Lemma 45 Let M be a Riemannian manifold with sectional curvature bounded below by a constant κ 0 and above by a constant K 0. Suppose there exists a g-convex set K M, a point p K, and two constants 0 < r D such that Br(p) K BD(p), where D π 2

K if K > 0. Denote θ = s(K,D+r)

s(κ,D+r) 1. Then, for every y (1 τ)K, the geodesic ball Bθ τr(y) lies in K.

Proof Let x K be arbitrary and y = expp (1 τ) exp 1 p (x) . For any v Ty M with v = 1, let z = expy θτr v . We will show that z K. Denote ξ(s) to be the geodesic with ξ(0) = y and ξ(1) = z. Then we can deﬁne a rectangle map

Γ : [0, 1/τ] [0, 1] K; (t, s) 7 expx t exp 1 x ξ(s) .

Now, we have a vector ﬁeld v(t, s) = Γ

s (s, t) over the rectangle. The vector ﬁeld v(t, s) is a variation ﬁeld of the geodesic γs(t) = expx t exp 1 x ξ(s) with v(0, s) = 0. Thus, v(s, t) is an initial zero Jacobi ﬁeld and we can apply Theorem 30 at t = 1. Since the sectional

Online Optimization over Riemannian Manifolds

curvature of M is upper bounded by K, we can use the normalization of the geodesic γs(t) to get

θτr = v(1, s) s K, d(x, ξ(s)) v(0, s) . (44)

Next, we apply Theorem 30 to v(s, t) at t = 1

τ . Since the sectional curvature of M is lower bounded by κ, we can use the normalization of the geodesic γs(t) to get

v(1/τ, s) s(κ, d(x, ξ(s))

τ ) v(0, s) (45)

Combining (44) and (45), we obtain

v(1/τ, s) s(κ, d(x,ξ(s))

τ ) s(K, d(x, ξ(s)))τθr.

By concavity of the function sin(x) in [0, π], we can get

Kd(x, ξ(s))) τ sin(

K d(x, ξ(s))

which derives 1 τ s(K, d(x, ξ(s))) s(K, d(x, ξ(s))

v(1/τ, s) s(κ, d(x,ξ(s))

τ ) s(K, d(x, ξ(s)))τθr s(κ, d(x,ξ(s))

s(K, d(x,ξ(s))

Since d(x, ξ(s)) d(x, y) + d(y, z) τ(D + r), we have

v(1/τ, s) s(κ, D + r)

s(K, D + r)θr = s(κ, D + r)

s(K, D + r) s(K, D + r)

s(κ, D + r) r = r.

Hence, the length of the curve c(s) = Γ(1/τ, s) is bounded by

l(c(s)) = Z 1

0 v(1/τ, s) ds Z 1

Notice that p = Γ(1/τ, 0) and denote w = Γ(1/τ, 1), we have d(p, w) l(c(s)) = r, which means that w Br(p) K. Therefore, for the geodesic γ1(t), we have γ1(0) = x K and γ1(1/τ) = w K. Thus, by g-convexity, we have z = γ1(1) K, which completes our proof.

Now we carry out the proofs of Theorems 14 and 15.

Wang, Tu, Hong, Wu, and Shi

E.1 Proof of Theorem 14

First, we focus on the update rule of yt, that is

yt+1 = P(1 τ)K PK

expyt(αtgt)

= P(1 τ)K PK

ζ(κ, D)T gt

= P(1 τ)K PK

expyt D Sδ Vδ C p

ζ(κ, D)T (Sδ

gt = ft(xt) exp 1 yt (xt)

exp 1 yt (xt) ,

from Lemma 11 we obtain E h Sδ Vδ gt yt i = ˆf(yt) and E h Sδ

Vδ C. Moreover, according

to Lemma 13, ˆft is 2δρL-sub g-convex. Also, for all x K and y (1 τ)K, we have

d2(P(1 τ)K(x)), y) d2(x, y) 2D d(P(1 τ)K(x)), x)

2D d(expp((1 τ) exp 1 p (x)), x)

(2D)(τD) = 2τD2. (46)

Thus, the update rule in Algorithm 2 is exactly the expected gradient descent in Lemma 42 with parameters S = (1 τ)K, G = Sδ

Vδ C, λ1 = 2δρL and λ2 = 2τD2. We can get

t=1 ft(yt) i min x (1 τ)K

t=1 ft(x) Sδ

ζ(κ, D)T + 2δρLT + 2τD2T. (47)

From the inequality Sδ

δ + n|κ|δ in Lemma 11, we have

t=1 ft(yt) i min x (1 τ)K

t=1 ft(x) (n

δ + n|κ|δ)DC p

ζ(κ, D)T + 2δρLT + 2τD2T. (48)

Applying Lemma 44 in (48), we have that

E[Reg(T)] (n

δ + n|κ|δ)DC p

ζ(κ, D)T + 3δLT + τDLT + 2δρLT + 2τD2T. (49)

Finally, taking τ = δ

rθ and δ = T 1

4 in (49), we get

E[Reg(T)] n

ζ(κ, D)T + n|κ|δDC p

ζ(κ, D)T + 3δLT + τDLT + 2δρLT + 2τD2T

ζ(κ, D)T 3 4 + n|κ|DC p

ζ(κ, D)T 1 4 + (3L + DL

rθ + 2ρL)T 3 4

ζ(κ, D)T 1 4 + n DC p

ζ(κ, D) + 3L + DL

rθ + 2ρL T 3 4 ,

which completes our proof.

Online Optimization over Riemannian Manifolds

E.2 Proof of Theorem 15

As we have carried out in the proof of Theorem 14, we focus on the update rule of yt, that is

yt+1 = P(1 τ)K PK

= P(1 τ)K PK

= P(1 τ)K PK

gt = f(xt) exp 1 yt (xt)

exp 1 yt (xt) ,

from Lemma 11 we obtain E h Sδ Vδ gt yt i = ˆf(yt) and E h Sδ

Vδ gt i Sδ Vδ C. Additionally,

according to Lemma 13, ˆft is µ-strongly (2ρL + 2Dµ)δ-sub g-convex. Thus, the update rule in Algorithm 2 is exactly the expected gradient descent in Lemma 43 with parameters S = (1 τ)K, G = Sδ

Vδ C, λ1 = (2ρL + 2Dµ)δ, λ2 = 2τD2 and ν = BVδ

Sδ . We can get,

t=1 ft(yt) i min x (1 τ)K

t=1 ft(x) 1

2µζ(κ, D)(Sδ

Vδ )2C2 BVδ

Sδ (1 + log T)

+ (2ρL + 2Dµ)δT + 2τD2T

2µBζ(κ, D)Sδ

Vδ C2(1 + log T)

+ (2ρL + 2Dµ)δT + 2τD2T.

From the inequality Sδ

δ + n|κ|δ = B in Lemma 11, we have

t=1 ft(yt) i min x (1 τ)K

t=1 ft(x) 1

δ + n|κ|δ)2ζ(κ, D)C2(1 + log T)

+ (2ρL + 2Dµ)δT + 2τD2T

δ2 + κ2δ2)ζ(κ, D)C2(1 + log T)

+ (2ρL + 2Dµ)δT + 2τD2T.

Applying Lemma 44, we have that

δ2 + n2κ2δ2)ζ(κ, D)C2(1 + log T) + (2ρL + 2Dµ)δT

+ 3δLT + τDLT + 2τD2T. (50)

Wang, Tu, Hong, Wu, and Shi

Taking τ = δ

rθ and δ = 3q

n2C2(1+log T)

T in (50), we get

E[Reg(T)] n 2 3 C 2 3 ζ(κ, D)

µ (1 + log T) 1 3 T 2 3 + n2C2δ2κ2ζ(κ, D)

µ (1 + log T)

+ n 2 3 C 2 3 (3L + DL

rθ + 2ρL + 2Dµ)(1 + log T) 1 3 T 2 3

n 2 3 C 2 3 ζ(κ, D)

µ (1 + log T) 1 3 T 2 3 + n 8 3 C 8 3 Dκ2ζ(κ, D)

µ (1 + log T) 4 3 T 1

+ n 2 3 C 2 3 (3L + DL

rθ + 2ρL + 2Dµ)(1 + log T) 1 3 T 2 3

2n 8 3 C 8 3 Dκ2ζ(κ, D)

+ n 2 3 C 2 3 ζ(κ, D)

µ + 3L + DL

rθ + 2Dµ (1 + log T) 1 3 T 2 3 ,

where the last inequality is due to max T 1 3q

4 2.Then we completes our proof.

Appendix F. Proofs of Theorems 18 and 19

In this section, we present proof of the regret bounds of the R-2-BAN algorithm.

F.1 Proof of Lemma 17

We ﬁrst examine (i). By Lemma 38, M is homogeneous. Thus from Lemma 11 we know

Vδ Eu Sδ(x) h f(u) exp 1 x (u) exp 1 x (u)

Sδ(x) f(u) exp 1 x (u) exp 1 x (u) ωSδ(x). (51)

From Lemma 37, we notice that the symmetry sx maps a random variable u M to

sx(u) = exp( exp 1 x (u)) = u.

Substituting sx(u) with u in (51) yields

Sδ(x) f(sx(u)) exp 1 x (sx(u)) exp 1 x (sx(u)) s xωSδ(x)

Sδ(x) f( u) exp 1 x (u) exp 1 x (u) s xωSδ(x).

Since sx is a isometry, we have s xωSδ(x) = ωSδ(x). Therefore,

Sδ(x) f( u) exp 1 x (u) exp 1 x (u) ωSδ(x)

Vδ Eu Sδ(x) h f( u) exp 1 x (u) exp 1 x (u)

Online Optimization over Riemannian Manifolds

Combining (51) and (52), we have

2 ˆf(x) = Sδ

Vδ Eu Sδ(x) h f(u) exp 1 x (u) exp 1 x (u)

Vδ Eu Sδ(x) h f( u) exp 1 x (u) exp 1 x (u)

Vδ Eu Sδ(x) h (f(u) f( u)) exp 1 x (u) exp 1 x (u)

Vδ Eu Sδ(x) g ,

which completes the proof of (i). Then we prove (ii). Notice that

Eu Sδ(x)[Sδ

2Vδ |f(expx u) f(expx u)| Sδ

2Vδ 2Lδ. (53)

It follows from Lemma 11 that

δ + n|κ|δ. (54)

Putting (53) and (54) together we get

Eu Sδ(x)[Sδ

δ + n|κ|δ)2Lδ = n L(1 + |κ|δ2),

which completes our proof.

F.2 Gap between Regrets

As in the one-point bandit case, we also conclude the gap between the regret of ˆft and the real regret of ft in Lemma 46 for the two-point bandit case.

Lemma 46 Suppose all ft are g-L-Lipschitz. The (expected) regret of the R-BAN algorithm satisﬁes

E[Reg(T)] E h T X

t=1 (ˆft(yt)) i min x (1 τ)K

t=1 ˆft(x) + 3δLT + τDLT.

Proof Denote by x τ the minimizer of the problem minx (1 τ)K PT t=1 ft(x). The expectation can be reformulated as follows

E[Reg(T)] =

2(ft(xt,1) + ft(xt,2)) ft(x ) i

2(ft(xt,1) + ft(xt,2)) ft(yt)) i + E h T X

t=1 (ft(yt) ˆft(yt)) i

t=1 (ˆft(yt) ˆft(x τ)) i + E h T X

t=1 (ˆft(x τ) ft(x τ)) i + E h T X

t=1 (ft(x τ) ft(x )) i .

Wang, Tu, Hong, Wu, and Shi

The Lipschitz condition leads to

ft(xt,1) ft(yt) δL ft(xt,2) ft(yt) δL ft(yt) ˆft(yt) δL ˆft(x τ) ft(x τ) δL.

In addition, Lemma 44 shows that

t=1 (ft(x τ) ft(x )) τDLT.

Thus, we can conclude

E[Reg(T)] E h T X

t=1 (ˆft(yt) ˆft(x τ)) i + 3δLT + τDLT.

which is the desired result.

F.3 Proof of Theorem 18

Denote gt = Sδ

2Vδ ft(expxt ut) ft(expxt( ut)) ut

ut . The update rule of yt is

yt+1 = P(1 τ)K PK expyt(αt gt)

= P(1 τ)K PK expyt( D

ζ(κ, D)T gt)

= P(1 τ)K PK expyt( D Sδ Vδ δL p

Sδ Vδ gt) .

2 ft(xt,1) ft(xt,2) exp 1 yt (xt,1)

exp 1 yt (xt,1) ,

from Lemma 17 we obtain E h Sδ Vδ gt yt i = ˆf(yt) and E h Sδ

Vδ gt i Sδ Vδ δL. Additionally,

according to Lemma 13, ˆft is 2δρL-sub g-convex. Thus, the update rule in Algorithm 2 is exactly the expected gradient descent in Lemma 42 with parameters S = (1 τ)K, G = Sδ

Vδ δL, λ1 = 2δρL and λ2 = 2τD2. We can get

t=1 ft(yt) i min x (1 τ)K

t=1 ft(x) Sδ

ζ(κ, D)T + 2δρLT + 2τD2T. (55)

From the inequality Sδ

δ + n|κ|δ in Lemma 11, we have

t=1 ft(yt) i min x (1 τ)K

t=1 ft(x) (n

δ + n|κ|δ)DδL p

ζ(κ, D)T + 2δρLT + 2τD2T. (56)

Online Optimization over Riemannian Manifolds

Applying Lemma 44 in (56), we have that

E[Reg(T)] (n

δ + n|κ|δ)DδL p

ζ(κ, D)T + 3δLT + τDLT + 2δρLT + 2τD2T. (57)

Finally, taking τ = δ

rθ and δ = 1

T in (57), we get

E[Reg(T)] n DL p

ζ(κ, D)T + n|κ|δ2DL p

ζ(κ, D)T + 3δLT + τDLT + 2δρLT + 2τD2T

ζ(κ, D)T + n|κ|DL p

T + (3L + DL

ζ(κ, D) + 3L + DL

which completes our proof.

F.4 Proof of Theorem 19

As we do in the proof of Theorem 14, we focus on the update rule of yt, that is

yt+1 = P(1 τ)K PK

expyt αt gt

= P(1 τ)K PK

= P(1 τ)K PK

f(expxt ut) ft(expxt( ut)) ut

from Lemma 17 we obtain E h Sδ Vδ gt yt i = ˆf(yt) and E h Sδ

Vδ gt i Sδ Vδ δL. Additionally,

according to Lemma 13, ˆft is µ-strongly (2ρL + 2Dµ)δ-sub g-convex. Thus, the update rule in Algorithm 2 is exactly the expected gradient descent in Lemma 43 with parameters S = (1 τ)K, G = Sδ

Vδ δL, λ1 = (2ρL + 2Dµ)δ, λ2 = 2τD2 and ν = BVδ

Sδ . We can get,

t=1 ft(yt) i min x (1 τ)K

t=1 ft(x) 1

2µζ(κ, D)(Sδ

Vδ )2δ2L2 BVδ

Sδ (1 + log T)

+ (2ρL + 2Dµ)δT + 2τD2T

2µBζ(κ, D)Sδ

Vδ δ2L2(1 + log T)

+ (2ρL + 2Dµ)δT + 2τD2T.

Wang, Tu, Hong, Wu, and Shi

From the inequality Sδ

δ + n|κ|δ = B in Lemma 11, we have

t=1 ft(yt) i min x (1 τ)K

t=1 ft(x) 1

2µ(n + n|κ|δ2)2ζ(κ, D)L2(1 + log T)

+ (2ρL + 2Dµ)δT + 2τD2T

µ (1 + κ2δ4)ζ(κ, D)L2(1 + log T)

+ (2ρL + 2Dµ)δT + 2τD2T.

Applying Lemma 44, we have that

E[Reg(T)] n2

µ (1 + κ2δ4)ζ(κ, D)L2(1 + log T) + 3δLT + τDLT

+ (2ρL + 2Dµ)δT+2τD2T.

Taking τ = δ

rθ and δ = 1+log T

E[Reg(T)] n2L2ζ(κ, D)

µ (1 + log T) + n2C2κ2ζ(κ, D)

µ (1 + log T)5

rθ + 2ρL + 2Dµ)(1 + log T)

3n2C2κ2ζ(κ, D)

2µ + ζ(κ, D)n2L2

+ 3L + DL + 2D2

rθ + 2ρL + 2Dµ (1 + log T).

The last inequality is due to max T 1 (1+log T)5

3125 1024e 3

2. Then we completes our proof.

Appendix G. Proof of Statements in Section 6

G.1 Proof of Lemma 21

Lemma 47 (Walter, 1974) Set K M, then for any y M \ K,

exp 1 PK(y)(y), exp 1 PK(y)(x) 0, x K.

Lemma 48 Suppose (M, g) is a Riemannian manifold whose sectional curvature is bounded above by K > 0 and x K. Then for any y M, if there exists a geodesic γ lying in the geodesic ball BD(x) for some D π

K that connects y and PK(y), then,

d2(x, PK(y)) d2(x, y) σ(κ, D)d2(y, PK(y)).

Proof Since γ BD(x) B π

K (x), we can apply Hessian comparison theorem (Theorem 33) to ρ(p) = d(x, p), which implies

KD)Id, p γ.

Online Optimization over Riemannian Manifolds

Then for ρ2(p) = d2(x, p), we have

2ρ2(p) = ρ 2ρ + ρ ρT

( 0, if D π 2

KD)Id, if π 2

which means that

2ρ2(p) σ(K, D)Id, p γ.

By the mean value theorem, there exist a q γ such that

ρ2(y) ρ2(PK(y)) = ρ2(PK(y)), exp 1 PK(y)(y) + 2ρ(q)(exp 1 PK(y)(y), exp 1 PK(y)(y))

exp 1 PK(y)(x), exp 1 PK(y)(y) σ(κ, D)d2(y, PK(y)). (58)

Lemma (47) yields exp 1 PK(y)(x), exp 1 PK(y)(y) 0.

ρ2(y) ρ2(PK(y)) = d2(x, y) d2(x, PK(y)) σ(κ, D)d2(y, PK(y)),

which completes our proof.

We now carry out our proof of Lemma 21. Proof Since xt+1 = PK( xt+1) K, we have

d(xt+1, x ) D < π 2

Also, we have

d( xt+1, xt+1) = αtgt D < π 2

which implies there is an unique geodesic γ connecting xt+1 and xt+1 by Lemma 31. For any point s γ, we can ﬁnd

d(x , s) d(x , xt+1) + d(xt+1, s)

d(x , xt+1) + d(xt+1, xt+1)

So γ is contained in the geodesic ball B2D(x ) B D

K (x ), which satisﬁes the condition in

Lemma 48. It give us

d2(xt+1, x ) d2( xt+1, x ) 1 2αt

σ(κ, 2D)d2(xt+1, xt+1) (61)

σ(κ, 2D)d2(xt, xt+1) (62)

σ(κ, 2D)αt gt 2. (63)

Summing t from 1 to T, we complete our proof.

Wang, Tu, Hong, Wu, and Shi

G.2 Proof of Lemma 22

The second part is directly from the ﬁrst part of lemma. So we focus on the ﬁrst part of the lemma. By the proof of Lemma 13, we have

ˆf(y) ˆf(x) ˆf(x), exp 1 x (y) 1

Sδ(x) f(u) V (u), n(u) V (x), m(u) ωSδ(x)

Bδ(x) Div(V )f(u)ω .

Here we change ( ) with the claim ( )

V (u), n(u) V (x), m(u) 1

2π2ιδ 0, u Sδ(x). ( )

This claim requires many geometric details that deviates our attention from the proof, and we will prove it afterwards. If the claim ( ) holds, we have Z

Sδ(x) f(u) V (u), n(u) V (x), m(u) ωSδ(x)

Sδ(x) f(u) V (u), n(u) V (x), m(u) 1

2π2ιδ + f(u)1

2π2ιδ ωSδ(x).

Then with the g-L-Lipschitz of f and the condition f(x) = 0, we have |f(u)| δL, thus Z

Sδ(x) f(u) V (u), n(u) V (x), m(u) ωSδ(x) (65)

Sδ(x) δL V (u), n(u) V (x), m(u) 1

2π2ιδ ωSδ(x) (66)

Sδ(x) δL V (u), n(u) ωSδ(x) Z

Sδ(x) δL V (x), m(u) ωSδ(x) π2ιδ2LSδ. (67)

Continuing to analyze the ﬁrst two terms with the method in the proof of Lemma 13, we have

ˆf(y) ˆf(x) ˆf(x), exp 1 x (y) 1

Bδ(x) Div(V )(f(x) + δL)ω Sδ

2δL sup u Bδ(x) |Div(V (u))| (n

δ + n|κ |δ)π2ιLδ2

= 2δL sup u Bδ(x) |Div(V (u))| (n + n|κ |δ2)π2ιLδ.

Again, setting ρ = sup (x,y,u) K K K |Div(V (u))| < ,

we establish the 2ρδL + (n + n|κ |δ2)π2ιLδ -sub g-convexity.

Online Optimization over Riemannian Manifolds

For proving (ii), It is suﬃcient to show

2 d2(u, φ(u))ω µ

2 d2(x, y) 4δD,

which is obvious from the fact d2(u, φ(u)) d2(x, y) = (d(u, φ(u)) + d(x, y))(d(u, φ(u)) d(x, y)) 2D 2δ = 4Dδ.

G.3 Proof of the Claim ( )

We use the symbol as same as those in the proof of the Claim ( ). Note that the rectangle map Γ is deﬁned by

Γu : [0, 1] [0, δ] M

(t, s) expξu(s)(t V (ξu(s))).

and the length of the s-curve lu(t) is described by

lu(t) = Z δ

S(t, s), S(t, s) ds.

l u(0) = V (u), n(u) V (x), m(u) ,

1 S R(T, S, S, T)ds.

The following analysis is quite diﬀerent because the section curvature of M is no longer non positive, but bounded by K > 0. As a result, we have

1 S R(T, S, S, T)ds

1 S K T 2 S 2ds

0 K T 2 S d S

0 S d S. (68)

Then we try to estimate the norm S(t, s) in [0, 1] [0, δ]. We separate the Jacobi ﬁeld

S(t, s) = S0(t, s) + S1(t, s).

over the geodesic γs(t) for every t [0, 1], where Si(t, s) is also a Jacobi ﬁeld with condition Si(i, s) = S(i, s) and Si(1 i, s) = 0. We estimate Si(t, s) with Theorem 30. W.l.o.g., we set i = 0.

Wang, Tu, Hong, Wu, and Shi

Since all sectional curvature of M is bounded below by κ , we have

S0(t, s) s(κ , t T(t, s) ) S0(0, s) . (69)

Also, since all sectional curvature of M is bounded above by K, we have

1 = S0(t, s) s(K, t T(t, s) ) S0(0, s) . (70)

Putting (69) and (70) together, we have

S0(t, s) s(κ , t T(t, s) )

s(K, t T(t, s) ).

With the condition t T(t, s) T(t, s) D, we have

S0(t, s) s(κ , D)

s(K, D) = 1

Then we ﬁnally get

S(t, s) = S0(t, s) + S1(t, s) 1

2ι = ι. (71)

Putting (71) into (68), we have

l u(t) KD2 Z δ

0 S d S (72)

Hence, by the mean value theorem there exists a t (0, 1) such that

lu(1) = lu(0) + l u(0) + 1

2l u(t) (75)

lu(0) + l u(0) 1

2π2ιδ. (76)

With the equality (36), we have

2π2ιδ, (77)

which complete our proof of claim ( ).

G.4 Proof of Theorem 24

Lemma 49 Suppose that S is subset of a unique g-convex set K M with the diameter D π 2

K , and {ft}t=1,2,...,T is a series of λ1-sub µ-strongly g-convex functions. Let the sequence be generated by

xt+1 = H (PK(expxt( αtgt))), (78)

Online Optimization over Riemannian Manifolds

where the random vector gt satisﬁes that E[gt|xt] = ft(xt), gt G and the operator H satisﬁes (37). For constants ν 1 and c0 νG

µD, if the step size αt = ν µ(t+c0), then

i=1 ft(xt)] min x S

i=1 ft(x) D2µc0

2 + λ1T + λ2T

2(ζ(κ, D) + σ(K, 2D))νG2(1 + log(T + c0))).

Proof By Lemma 43, we have

E[ft(xt) ft(x )] E h gt, exp 1 xt (x ) µ

2 d2(xt, x ) i + λ1.

Denote xt+1 = expxt( αtgt). Recalling Lemma 35 in the geodesic triangle xt xt+1x gives that

αtgt, exp 1 xt (x ) 1

2(d2(xt, x ) d2( xt+1, x )) + 1

2ζ(κ, d(xt, x)) αtgt 2. (79)

Combining (11) and (12), we get

E[ft(xt) ft(x )] 1 2αt (d2(xt, x ) d2( xt+1, x )) + 1

2ζ κ, d(xt, x ) αt gt 2

2 d2(xt, x ) + λ1

1 2αt (d2(xt, x ) d2(PK( xt+1), x ))

+ 1 2αt (d2(PK( xt+1), x ) d2( xt+1, x ))

2ζ κ, d(xt, x ) αt G2 µ

2 d2(xt, x ) + λ1

1 2αt (d2(xt, x ) d2(xt+1, x )) + 1 2αt (d2((PK( xt+1), x ) d2( xt+1, x ))

2ζ κ, d(xt, x ) αt G2 µ

2 d2(xt, x ) + λ1 + λ2. (80)

Summing (80) from 1 to T, we obtain

i=1 ft(xt)] min x S

t=1 (d2(xt, x ) 1

2αt 1 2αt 1 µ

1 2ζ κ, d(xt, x ) G2αt + λ1T + λ2T

1 2αt (d2(PK( xt+1), x ) d2( xt+1, x ))

t=1 d2(xt, x )( 1

2αt 1 2αt 1 µ

2ζ(κ, D)G2 T X

1 2αt (d2(PK( xt+1), x ) d2( xt+1, x )) + λ1T + λ2T.

Wang, Tu, Hong, Wu, and Shi

By Lemma 21, we have

1 2αt (d2(PK( xt+1), x ) d2( xt+1, x )) σ(K, 2D)

2σ(K, 2D)G2 T X

i=1 ft(xt)] min x S

t=1 (d2(xt, x )( 1

2αt 1 2αt 1 µ

2µ(ζ(κ, D) + σ(K, 2D))G2 T X

t=1 αt + λ1T + λ2T.

Taking αt = ν µ(t+c0), we have

i=1 ft(xt)] min x S

i=1 ft(x) d2(x1, x )(µ(c0 + 1)

2(ζ(κ, D) + σ(K, 2D))νG2 1 + log(T + c0) + λ1T + λ2T

2 + λ1T + λ2T

2µ(ζ(κ, D) + σ(K, 2D))νG2 1 + log(T + c0) ,

which proves Lemma 49.

We are now ready to prove Theorem 24. Proof (i) R-OGD algorithm: The update rule in the R-OGD algorithm is actually (78) with S = K, G = L, ν = 1 and λ1 = λ2 = 0. By Lemma 49, the regret bound of the R-OGD is

Reg(T) = E[

i=1 ft(xt)] min x K

2(ζ(κ, D) + σ(K, 2D))G2 1 + log(T + c0) .

which completes the proof of the R-OGD algorithm. (ii) R-BAN algorithm: It follows from Lemma 44 that

E[Reg(T)] E h T X

t=1 (ˆft(yt) ˆft(x τ)) i + 3δLT + τDLT. (81)

Online Optimization over Riemannian Manifolds

As shown in the proof of Theorem 14, the update rule of yt is actually (78) with parameter S = (1 τ)K, ν = BVδ

Vδ C, λ1 = 2ρδL+2Dµδ+(n+n|κ |δ2)π2ιLδ , and λ2 = 2τD2. By Lemma 49,

t=1 (ˆft(yt) ˆft(x τ)) i D2µc0

2µ(ζ(κ, D) + σ(K, 2D))BVδ

Vδ )2C2(1 + log(T + c0))

+ 2δρLT + 2DµδT + 2τD2T + (n + n|κ |δ2)π2ιLδT

2µ(ζ(κ, D) + σ(K, 2D))B2C2(1 + log(T + c0))

+ 2δρLT + 2DµδT + 2τD2T + (n + n|κ |δ2)π2ιLδT

2 + 2τD2T + 2δρLT + 2DµδT + (n + n|κ |δ2)π2ιLδT

µ(ζ(κ, D) + σ(K, 2D))(n2

δ2 + n2κ 2δ2)C2(1 + log(T + c0)).

Combining (81) and (82), we have

E[Reg(T)] D2µc0

µ(ζ(κ, D) + σ(K, 2D))(n2

δ2 + n2κ 2δ2)C2(1 + log(T + c0))

+ 2δρLT + 2DµδT + (n + n|κ |δ2)π2ιLδT + 3δLT + τDLT + 2τD2T.

Taking τ = δ

rθ and δ = 3q

n C(1+log(T+c0))

T in (49), we get

E[Reg(T)] D2µc0

µ(ζ(κ, D) + σ(K, 2D))n 4 3 C 4 3 (1 + log(T + c0)) 1 3 T 2 3

µ(ζ(κ, D) + σ(K, 2D))C2n2κ 2δ2(1 + log(T + c0))

+ 2ρL + 3L + DL

rθ + 2Dµ n 1 3 C 1 3 (1 + log(T + c0)) 1 3 T 2 3

+ (nπ2ιL + n|κ |δ2π2ιL)n 1 3 C 1 3 (1 + log(T + c0)) 1 3 T 2 3

µ(ζ(κ, D) + σ(K, 2D))n 4 3 C 4 3 (1 + log(T + c0)) 1 3 T 2 3

µ(ζ(κ, D) + σ(K, 2D))C 8 3 n 8 3 κ 2D(1 + log(T + c0)) 4 3 T 1

+ 2ρL + 3L + DL

rθ + 2Dµ n 1 3 C 1 3 (1 + log(T + c0)) 1 3 T 2 3

+ (nπ2ιL + n|κ |δ2π2ιL)n 1 3 C 1 3 (1 + log(T + c0)) 1 3 T 2 3 .

Wang, Tu, Hong, Wu, and Shi

Then, we have,

E[Reg(T)] D2µc0

2 + 4(c0 + 1)

µ (ζ(κ, D) + σ(K, 2D))C 8 3 n 8 3 κ 2D

µ(ζ(κ, D) + σ(K, 2D))n 4 3 C 4 3 (1 + log(T + c0)) 1 3 T 2 3

+ 2ρL + 3L + DL

rθ + 2Dµ n 1 3 C 1 3 (1 + log(T + c0)) 1 3 T 2 3

+ (n 4 3 C 1 3 π2ιL + 2n 4 3 C 4 3 |κ |D2π2ι)(1 + log(T + c0)) 1 3 T 2 3 .

Finally, we have

E[Reg(T)] = D2µc0

2 + 4(c0 + 1)

µ (ζ(κ, D) + σ(K, 2D))C 8 3 n 8 3 κ 2D

µ(ζ(κ, D) + σ(K, 2D)) + |κ |D2π2ι n 4 3 C 4 3 +

+ nπ2ι + 2ρL + 3L + DL

rθ + 2Dµ n 1 3 C 1 3 L

(1 + log(T + c0)) 1 3 T 2 3 .

The last inequality is due to δ < D, δL < 2C and

(1 + log(T + c0))4

T = max T 1

(1 + log(T + c0))4

T 4(c0 + 1).

Then we completes the proof of the R-BAN algorithm.

(iii) R-2-BAN algorithm: Notice that the update rule of yt is (78) with parameter S = (1 τ)K, ν = BVδ

Sδ , G = Sδ Vδ δL, λ1 = 2ρδL + 2Dµδ + (n + n|κ |δ2)π2ιLδ , and λ2 = 2τD2. Hence,

E[Reg(T)] E h T X

t=1 (ˆft(yt) ˆft(x τ)) i + 3δLT + τDLT

2µ ζ(κ, D) + σ(K, 2D) B2δ2L2(1 + log(T + c0))

+ 3δLT + τDLT + 2δρLT + 2DµδT + 2τD2T + ((n + n|κ |δ2)π2ιLδ)T,

µ ζ(κ, D) + σ(K, 2D) (n2 + n2κ 2δ4)L2(1 + log(T + c0))

+ 3δLT + τDLT + 2δρLT + 2DµδT + 2τD2T + ((n + n|κ |δ2)π2ιLδ)T.

Online Optimization over Riemannian Manifolds

Taking τ = δ

rθ and δ = 1+log(T+c0)

E[Reg(T)] D2µc0

2 + n|κ |π2ιL(1 + log(T + c0))3

µ(ζ(κ, D) + σ(K, 2D) n2κ 2L2 (1 + log(T + c0))5

µ(ζ(κ, D) + σ(K, 2D))n2L2

+ 3L + 2D2 + DL

rθ + 2ρL + 2Dµ (1 + log(T + c0))

µ(ζ(κ, D) + σ(K, 2D) 3(c0 + 1)n2κ 2L2

2 + 3(c0 + 1)n|κ |π2ιL

µ(ζ(κ, D) + σ(K, 2D))n2L2

+ 3L + DL + 2D2

rθ + 2ρL + 2Dµ (1 + log(T + c0)). (83)

The last inequality is due to max T 1 (1+log(T+c0))5

T 4 and max T 1 (1+log(T+c0))3

T 2 are less than 3 2(c0 + 1). Then we complete our proof.

Jacob Abernethy, Peter Bartlett, Alexander Rakhlin, and Ambuj Tewari. Optimal strategies and minimax lower bounds for online convex game. In Proceedings of the 21st Annual Conference on Learning Theory, pages 415 423, 2008.

Pierre-Antoine Absil, Robert Mahony, and Rodolphe Sepulchre. Optimization Algorithms on Matrix Manifolds. Princeton University Press, 2009.

Alekh Agarwal, Ofer Dekel, and Lin Xiao. Optimal algorithms for online convex optimization with multi-point bandit feedback. In Proceedings of the 23rd Annual Conference on Learning Theory, pages 28 40, 2010.

Shmuel Agmon. The relaxation method for linear inequalities. Canadian Journal of Mathematics, 6:382 392, 1954.

Kwangjun Ahn and Suvrit Sra. From Nesterov s estimate sequence to Riemannian acceleration. In Proceedings of the Annual 33rd Conference on Learning Theory, pages 84 118, 2020.

Foivos Alimisis, Antonio Orvieto, Gary B ecigneul, and Aurelien Lucchi. Momentum improves optimization on Riemannian manifolds. In Proceedings of the 24th International Conference on Artiﬁcial Intelligence and Statistic, pages 1351 1359, 2021.

Zeyuan Allen-Zhu, Ankit Garg, Yuanzhi Li, Rafael Oliveira, and Avi Wigderson. Operator scaling via geodesically convex optimization, invariant theory and polynomial identity

Wang, Tu, Hong, Wu, and Shi

testing. In Proceedings of the 50th Annual ACM Symposium on Theory of Computing, pages 172 181, 2018.

Kimon Antonakopoulos, Elena V. Belmega, and Panayotis Mertikopoulos. Online and stochastic optimization beyond Lipschitz continuity: a Riemannian approach. In Proceedings of the 8th International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=rkx Zya Ntw B.

Yuichiro Anzai. Pattern Recognition and Machine Learning. Elsevier, 2012.

S ebastien Arnold, Pierre-Antoine Manzagol, Reza B. Harikandeh, Ioannis Mitliagkas, and Nicolas Le Roux. Reducing the variance in online optimization by transporting past gradients. In Proceedings of the 32nd Advances in Neural Information Processing Systems, pages 5392 5403, 2019.

Miroslav Bac ak. Convex Analysis and Optimization in Hadamard Spaces. Walter de Gruyter Gmb H & Co KG, 2014.

Werner Ballmann. Lectures on Spaces of Nonpositive Curvature. Birkh auser, 2012.

Gary Becigneul and Octavian-Eugen Ganea. Riemannian adaptive optimization methods. In Proceedings of the 7th International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=r1eiqi09K7.

Valerii Berestovskii and Yurii Nikonorov. Riemannian Manifolds and Homogeneous Geodesics. Springer, 2020.

Marcel Berger. Geometry I. Springer Science & Business Media, 2009.

Silvere Bonnabel. Stochastic gradient descent on Riemannian manifolds. IEEE Transactions on Automatic Control, 58(9):2217 2229, 2013.

Nicolas Boumal and Pierre-Antoine Absil. Low-rank matrix completion via preconditioned optimization on the Grassmann manifold. Linear Algebra and its Applications, 475:200 239, 2015.

Michael M. Bronstein, Joan Bruna, Taco Cohen, and Petar Veliˇckovi c. Geometric deep learning: grids, groups, graphs, geodesics, and gauges. ar Xiv preprint ar Xiv:2104.13478, 2021.

Daniel Brooks, Olivier Schwander, Frederic Barbaresco, Jean-Yves Schneider, and Matthieu Cord. Riemannian batch normalization for SPD neural networks. In Proceedings of the 32nd Advances in Neural Information Processing Systems, 2019.

Manfredo Perdigao do Carmo. Riemannian Geometry. Birkh auser, 1992.

Nicolo Cesa-Bianchi and G abor Lugosi. Prediction, Learning, and Games. Cambridge University Press, 2006.

Online Optimization over Riemannian Manifolds

Guang Cheng, Hesamoddin Salehian, and Baba C. Vemuri. Eﬃcient recursive algorithms for computing the mean diﬀusion tensor and applications to DTI segmentation. In Proceedings of the 12th European Conference on Computer Vision, pages 390 401, 2012.

Shiing-Shen Chern, Weihuan Chen, and Kai Shue Lam. Lectures on Diﬀerential Geometry. World Scientiﬁc, 1999.

Charlan Dellon da Silva Alves, Paulo Roberto Oliveira, and Ronaldo Malheiros Greg orio. lα Riemannian weighted centers of mass applied to compose an image ﬁlter to diﬀusion tensor imaging. Applied Mathematics and Computation, 390:125603, 2021.

Jiashi Feng, Huan Xu, and Shuicheng Yan. Online robust PCA via stochastic optimization. In Proceedings of the 27th Advances in Neural Information Processing Systems, pages 404 412, 2013.

Abraham D. Flaxman, Adam Tauman Kalai, and Brendan Mc Mahan. Online convex optimization in the bandit setting: gradient descent without a gradient. In Proceedings of the 16th Annual ACM-SIAM Symposium on Discrete Algorithms, pages 385 394, 2005.

Simone Fiori. Manifold calculus in system theory and control fundamentals and ﬁrst-order systems. Symmetry, 13(11):2092, 2021.

Simone Fiori. Quasi-geodesic neural learning algorithms over the orthogonal group: a tutorial. Journal of Machine Learning Research, 6(26):743 781, 2005.

Ankit Garg, Leonid Gurvits, Rafael Oliveira, and Avi Wigderson. Algorithmic and optimization aspects of Brascamp-Lieb inequalities, via operator scaling. Geometric and Functional Analysis, 28(1):100 145, 2018.

Mohammad Ghomi and Joel Spruck. Total curvature and the isoperimetric inequality in cartan-hadamard manifolds. The Journal of Geometric Analysis, 32(2):50, 2022.

Elad Hazan. Introduction to online convex optimization. MIT Press, 2022.

Elad Hazan, Adam Kalai, Satyen Kale, and Amit Agarwal. Logarithmic regret algorithms for online convex optimization. In Proceedings of the 19th Annual Conference on Learning Theory, pages 499 513, 2006.

Jiang Hu, Xin Liu, Zai-Wen Wen, and Ya-Xiang Yuan. A brief introduction to manifold optimization. Journal of the Operations Research Society of China, 8(2):199 248, 2020.

Wen Huang, Pierre-Antoine Absil, and Kyle A. Gallivan. A Riemannian symmetric rank-one trust-region method. Mathematical Programming, 150(2):179 216, 2015.

Sergey K. Ivanov, Anatoly M. Kamchatnov, Thibault Congy, and Nicolas Pavloﬀ. Solution of the Riemann problem for polarization waves in a two-component Bose-Einstein condensate. Physical Review E, 96(6):062202, 2017.

Alkis Koudounas and Simone Fiori. Gradient-based learning methods extended to smooth manifolds applied to automated clustering. Journal of Artiﬁcial Intelligence Research, 68:777 816, 2020.

Wang, Tu, Hong, Wu, and Shi

Jerome Lapuyade-Lahorgue and Frederic Barbaresco. Radar detection using Siegel distance between autoregressive processes, application to HF and X-band radar. In Proceedings of 2008 IEEE Radar Conference, pages 1 6, 2008.

John M. Lee. Introduction to Riemannian Manifolds. Springer, 2018.

Kuang-Chih Lee and David Kriegman. Online learning of probabilistic appearance manifolds for video-based recognition and tracking. In Proceedings of 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, volume 1, pages 852 859, 2005.

Jinlong Lei, Peng Yi, Yiguang Hong, Jie Chen, and Guodong Shi. Online convex optimization over Erdos-Renyi random networks. In Proceedings of the 33rd Advances in Neural Information Processing Systems, pages 15591 15601, 2020.

Alejandro I. Maass, Chris Manzie, Dragan Nesic, Jonathan H. Manton, and Iman Shames. Tracking and regret bounds for online zeroth-order Euclidean and Riemannian optimization. SIAM Journal on Optimization, 32(2):445 469, 2022.

Jiazhong Nie, Wojciech Kot lowski, and Manfred K. Warmuth. Online PCA with optimal regret. Journal of Machine Learning Research, 17(173):1 49, 2016.

Katsumi Nomizu. On local and global existence of Killing vector ﬁelds. Annals of Mathematics, 72(1):105 120, 1960.

Yogesh Rathi, Allen Tannenbaum, and Oleg Michailovich. Segmenting images on the tensor manifold. In Proceedings of 2007 IEEE Conference on Computer Vision and Pattern Recognition, pages 1 8, 2007.

Wolfgang Ring and Benedikt Wirth. Optimization methods on Riemannian manifolds and their application to shape space. SIAM Journal on Optimization, 22(2):596 627, 2012.

Niranjan Srinivas, Andreas Krause, Sham Kakade, and Matthias Seeger. Gaussian process optimization in the bandit setting: no regret and experimental design. In Proceedings of the 27th International Conference on Machine Learning, page 1015 1022, 2010.

Hao Tang, Zhiao Huang, Jiayuan Gu, Bao-Liang Lu, and Hao Su. Towards scale-invariant graph-related problem solving by iterative homogeneous GNNs. In Proceedings of the 33rd Advances in Neural Information Processing Systems, pages 15811 15822, 2020.

James Townsend, Niklas Koep, and Sebastian Weichwald. Pymanopt: a python toolbox for optimization on manifolds using automatic diﬀerentiation. Journal of Machine Learning Research, 17(137):1 5, 2016.

Quinten Tupker, Salem Said, and Cyrus Mostajeran. Online learning of Riemannian hidden Markov models in homogeneous hadamard spaces. In Proceedings of the 5th International Conference on Geometric Science of Information, pages 37 44, 2021.

Rolf Walter. On the metric projection onto convex sets in Riemannian spaces. Archiv der Mathematik, 25(1):91 98, 1974.

Online Optimization over Riemannian Manifolds

Xi Wang, Zhipeng Tu, Yiguang Hong, Yingyi Wu, and Guodong Shi. No-regret online learning over Riemannian manifolds. In Proceedings of the 34th Advances in Neural Information Processing Systems, pages 28323 28335, 2021.

Shing-Tung Yau. Non-existence of continuous convex functions on certain Riemannian manifolds. Mathematische Annalen, 207(4):269 270, 1974.

Hongyi Zhang and Suvrit Sra. First-order methods for geodesically convex optimization. In Proceedings of the 29th Annual Conference on Learning Theory, pages 1617 1638, 2016.

Hongyi Zhang and Suvrit Sra. An estimate sequence for geodesically convex optimization. In Proceedings of the 31st Annual Conference on Learning Theory, pages 1703 1723, 2018.

Jingzhao Zhang, Hongyi Zhang, and Suvrit Sra. R-spider: A fast Riemannian stochastic optimization algorithm with curvature independent rate. ar Xiv preprint ar Xiv:1811.04194, 2018.

Martin Zinkevich. Online convex programming and generalized inﬁnitesimal gradient ascent. In Proceedings of the 20th International Conference on International Conference on Machine Learning, pages 928 935, 2003.