# noregret_online_learning_over_riemannian_manifolds__5dbc7fc3.pdf

No-regret Online Learning over Riemannian Manifolds

Xi Wang AMSS Chinese Academy of Sciences Beijing, China wangxi14@mails.ucas.ac.cn

Zhipeng Tu AMSS Chinese Academy of Sciences Beijing, China tuzhipeng@amss.ac.cn

Yiguang Hong Tongji University Shanghai, China yghong@iss.ac.cn

Yingyi Wu University of Chinese Academy of Sciences Beijing, China wuyy@ucas.ac.cn

Guodong Shi The University of Sydney NSW, Australia guodong.shi@sydney.edu.au

We consider online optimization over Riemannian manifolds, where a learner attempts to minimize a sequence of time-varying loss functions deﬁned on Riemannian manifolds. Though many Euclidean online convex optimization algorithms have been proven useful in a wide range of areas, less attention has been paid to their Riemannian counterparts. In this paper, we study Riemannian online gradient descent (R-OGD) on Hadamard manifolds for both geodesically convex and strongly geodesically convex loss functions, and Riemannian bandit algorithm (R-BAN) on Hadamard homogeneous manifolds for geodesically convex functions. We establish upper bounds on the regrets of the problem with respect to time horizon, manifold curvature, and manifold dimension. We also ﬁnd a universal lower bound for the achievable regret by constructing an online convex optimization problem on Hadamard manifolds. All the obtained regret bounds match the corresponding results are provided in Euclidean spaces. Finally, some numerical experiments validate our theoretical results.

1 Introduction

The online optimization has been widely studied in the past decades in online routing, spam ﬁltering, and machine learning [4, 23, 8]. Without a prior knowledge of loss functions, an online convex optimization algorithm predicts solutions before the loss function is revealed.

In this paper, we consider the following Riemannian online convex optimization (R-OCO) problem,

min xt K M ft(xt), t = 1, 2, . . . , T, (1)

where M is a complete Riemannian manifold equipped with a Riemannian metric g and K is a geodesically convex (g-convex) subset of M. Here, {ft}t=1,2,...,T is a sequence of unknown loss functions and every ft is a geodesically convex (g-convex) function with sufﬁcient smoothness. The R-OCO problem (1) extends the online convex optimization in Euclidean spaces with potential applications in machine learning, such as online principal component analysis (PCA), dictionary learning, and neural networks [28, 19, 25].

Correspondence author (Y. Hong, +86-10-82541888 )

35th Conference on Neural Information Processing Systems (Neur IPS 2021).

The R-OCO problem (1) can be understood as a learning process of T rounds. At each round t = 1, 2, 3, . . . , T, an online learner chooses a strategy xt from the g-convex subset K. Later or simultaneously, the adversary (or nature) produces a g-convex loss function ft : K R of which the learner has no prior knowledge. Finally, the learner receives the feedback and suffers the loss ft(xt). Generally, there are two types of information feedback. One is the full information feedback, where the entire function ft is revealed to the learner; the other is the bandit feedback, where only the value ft(xt) is revealed. The goal of the R-OCO is to minimize the regret, deﬁned as

t=1 ft(xt) min x K

which measures the difference between the cost by {xt}t=1,...,T and the best-ﬁxed point chosen in hindsight. An algorithm is called no-regret [32], if the regret of the algorithm goes sublinearly with the time horizon T.

For carrying out optimization on a manifold, some classical methods treat the manifold as a subset of an ambient Euclidean space and employ Euclidean constrained optimization techniques. For instance, [30] presented an algorithm for the online PCA problem, where the variables were updated in an embedding Euclidean space and then projected onto a manifold. However, in practical applications, the dimension of an embedding Euclidean space can be too high (e.g., the Grassmann manifold [13]), and the projection can be expensive to compute (e.g., the manifold of symmetric positive deﬁnite (SPD) matrices [37]). An alternative approach termed Riemannian optimization makes use of intrinsic geometry of manifolds so that Riemannian optimization can optimize directly on the manifold as an unconstrained problem, and thus avoiding high dimension embedding and high computing cost for the projection. Furthermore, this viewpoint has shown beneﬁts from the g-convexity, by which a nonconvex optimization problem can be converted into a g-convex one [6]. Consequently, it is important to take a Riemannian approach in our problem (1).

Although there were many existing algorithms for ofﬂine manifold optimization problems [2, 31, 5], very few results were obtained about the Riemannian online optimization problem. [35] proposed an online algorithm for estimating hidden Markov chains on Hadamard homogeneous spaces and [9] analyzed Riemannian adaptive methods on products in the regret sense. More recently, [29] studied a zeroth-order online optimization problem on Hadamard manifolds with a sublinear assumption.

Contribution This paper aims to design no-regret algorithms for the R-OCO problem in both full information feedback and bandit feedback. The contribution of this paper is summarized as follows:

We propose a Riemannian online gradient descent algorithm (R-OGD) for the R-OCO problem in the full information feedback, and then establish upper regret bounds of the R-OGD algorithm for g-convex and strongly g-convex functions. We introduce a Riemannian bandit algorithm (R-BAN) for the R-OCO problem in the bandit feedback and then establish an upper regret bound for g-convex functions. Moreover, we develop a key technique to analyze the derivative of a local integration on homogeneous manifolds, which can be applied to estimate gradients in Riemannian optimization and beyond. We focus on the worst-case regret and present a universal lower regret bound of R-OCO algorithms with g-convex losses on Hadamard manifolds, which matches the upper bound achieved by the R-OGD algorithm for g-convex functions.

The established lower and upper bounds on the achievable bounds of R-OCO match their counterparts for Euclidean online convex optimization e.g., [38, 24, 20, 1]. We brieﬂy list our results in Table 1.

Related Work The Euclidean online convex optimization was introduced in [38]. Inspired by the gradient descent method, [38] proposed the online gradient descent algorithm (OGD) of which the upper regret bound was proven to be O

T . Then [24] proceeded with the study of the OGD algorithm and established an upper regret bound O log T for strongly convex functions. In addition, [1] gave a universal lower bound of O

T for online algorithms, which indicated that the bounds in [38] and [24] are essentially optimal. In the bandit setting, [20] provided a detailed exposition of a one-point bandit algorithm. By modifying the gradient in the OGD algorithm to a randomized

Table 1: Comparison of regrets between our work and corresponding results in Euclidean spaces. T: the time horizon; n: dimension of the manifold; ζ and Λ: constants relied on the sectional curvature bound κ, the dimension n and the domain K.

Full information, Full information, Bandit Universal g-convex strongly g-convex feedback lower bound Our Work O ζ 1 2

O n 1 2 ζ 1 4 ΛT 3 4

T) Euclidean O

T [38] O log T [24] O n 1 2 T 3 4 [20] Ω(

estimator, the upper regret bound attained O T 3 4 and O T 2 3 for convex loss functions and strongly convex loss functions, respectively. By extending the one-point bandit algorithm, [3] developed a multi-point bandit algorithm and presented upper regret bounds O

T and O log T for convex and strongly convex loss functions. The Riemannian online algorithms proposed in this paper in the full information feedback and the one-point bandit feedback settings are extensions of the algorithms in ([38, 24, 20, 1]) to Riemannian manifolds.

Riemannian optimization has drawn much research attention in the past decades. Many basic algorithms in Euclidean spaces such as the gradient descent method, Newton s method, and trustregion methods have been adapted into a Riemannian setting [2, 5, 31]. Some research of Riemannian stochastic optimization (R-SO) was intended to deal with time-varying optimization problems [12, 36, 37, 35]. Among them, [36] provided the ﬁrst global complexity analysis for the R-SGD algorithm on geodesically convex problems over Hadamard manifolds, and [35] proposed an online algorithm to deal with hidden Markov chains on Hadamard homogeneous spaces. When performed in batch, R-SO methods are to minimize the average regret in the case of knowing the prior distribution of the loss functions. In these sense, the R-SO can be viewed as a kind of R-OCO problems and R-OCO algorithms can handle settings without prior knowledge.

The results about the R-OCO problem are quite limited. [7] proposed regularized online optimization methods via a Riemann Lipschitz continuity condition, which focused on convex functions from an ambient Euclidean space. In the full information setting, [9] constructed regret upper bounds of Riemannian adaptive methods for g-convex functions, which required a product manifold structure. When the form of loss function was not available, [29] proposed a zeroth-order online algorithm on Hadamard manifolds for the tracking error in asymptotic sense as well as regret bounds by assuming the sublinearity of the term VT , which is a summation of distance between the minimizer of ft and ft+1. In contrast, the regret bounds established for our online gradient-based/bandit Riemannian optimization algorithms are sublinear for any time, matching those for Euclidean online optimization.

2 Preliminaries

Riemannian manifolds A manifold M is a topological space locally diffeomorphic to Euclidean spaces. The tangent space Tx M is a linearization of manifold M at point x. A vector ﬁeld X is a map assigning every point x M with a tangent vector X(x) Tx M, which can be also viewed as differential operators over smooth functions on M, i.e., the operation X(f) deﬁnes a function X(f)(x) = limt 0 1

t (f(ξ(t)) f(x)) on M, where ξ is a curve that starts at x with tangent vector X(x).

A Riemannian manifold is a smooth manifold M equipped with a metric tensor g (or called Riemannian metric), which deﬁnes an inner product , x in every tangent space Tx M of x M. The Riemannian metric g brings a distance structure on M. A curve is a geodesic if it locally minimizes the length, which is an analog of a straight line in Euclidean spaces. On Riemannian manifolds, a geodesic is uniquely determined by the start point and initial tangent vector. In this way, the exponential map expx : Tx M M is deﬁned by mapping a vector X Tp M to γ(1) M for the geodesic γ such that γ(0) = x and γ(0) = X.

Curvature reﬂects the geometry of manifolds. We focus on sectional curvature, which is the Gauss curvature of a two-dimensional submanifold. Following [36], we consider the Hadamard manifold, which is a simply connected and complete manifold with nonpositive sectional curvature. The Cartan-Hadamard theorem [11] shows that the exponential map expx( ) is a diffeomorphism from the

tangent space Tx M = Rn to the manifold M. Therefore, the exponential map has an global inverse exp 1 x ( ) on Hadamard manifolds, and the distance d(x, y) can be expressed as exp 1 x (y) x.

Isometries of Riemannian manifolds have been widely studied in differential geometry [11, 10]. An isometry φ : M M is a diffeomorphism preserving distance, i.e., d(x, y) = d(φ(x), φ(y)) for all x, y M. It is remarked that all isometries of a Riemannian manifold form a Lie group G. A Riemannian manifold is a homogeneous manifold if the group of isometries G acts on M transitively, i.e., for any points x, y M there exists an isometry such that φ(x) = y. Some properties of homogeneous manifolds are used in our work, for example, the Killing ﬁeld that represents the inﬁnitesimal symmetry of isometries. We leave those properties in the appendix, due to the space limitation.

Function Classes A set K is called geodesically convex (g-convex) if, for any points x, y K, there admits a geodesic γ K connecting x and y. A function f : K R is called geodesically convex (or g-convex) if for any geodesic γ : [0, 1] M,

f(γ(t)) (1 t)f(γ(0)) + tf(γ(1)), t [0, 1].

The g-convexity has some equivalent conditions. When f is differentiable, which means that there exists a gradient f(x) such that f(x), X = X(f)(x) for any vector ﬁeld X, the g-convexity is equivalent to f(y) f(x) + f(x), exp 1 x (y) , x, y M. Furthermore, if f is twice differentiable, the g-convexity is equivalent to

Hess(f)(x)(X, X) = X(X(f(x)) ( XX)f(x) 0, x M, X Tx M,

where is the Levi-Civita connection of M, which is an analog of the differential of vector ﬁelds in Euclidean spaces (see [17]). A differentiable function f : M R is geodesically µ-strongly convex (or µ-strongly g-convex) if there exists a constant µ > 0 such that for any x, y M,

f(y) f(x) + f(x), exp 1 x (y) + µ

2 d2(x, y).

We term a function to be geodesically L-Lipschitz (or g-L-Lipschitz) if there exists a constant L > 0 such that, |f(y) f(x)| L d(x, y), x, y M, When f is differentiable, the g-L-Lipschitzness is equivalent to

f(x) L, x M.

3 Riemannian Online Convex Optimization with Full Information Feedback

This section is devoted to the study of the R-OCO problem in the full information feedback. Here, we ﬁrst propose our R-OGD algorithm and then analyze upper regret bounds of R-OGD for both g-convex and strongly g-convex functions. In addition, a universal lower regret bound is presented to illustrate that the regret bound of the R-OGD algorithm is tight up to a constant in the g-convex case.

3.1 Riemannian Online Gradient Algorithm

In the full information setting, we consider the following assumptions, which were used in the literature of Euclidean online convex optimization and Riemannian optimization (e.g., [38, 36, 5]).

Assumption 1. There exists x M such that x = arg min PT t=1 ft(x). Assumption 2. (M, g) is a Hadamard manifold with the sectional curvature lower bounded by a constant κ (κ 0). Remark 1. The Hadamard manifold plays an important role in Riemannian geometry [22]. Some well-known spaces, such as the Euclidean space Rn, the hyperbolic space Hn, and the manifold of SPD matrices, are all Hadamard manifolds [36, 5]. Assumption 3. The g-convex set K is a bounded set with diameter D, i.e.,

d(x, y) D, x, y K.

Assumption 4. For all t = 1, . . . , T, ft are differentiable and g-L-Lipschitz.

We now propose our Riemannian online gradient descent algorithm (R-OGD) in Algorithm 1, where the exponential map replaces the vector addition in the Euclidean online gradient descent [38].

Algorithm 1: Riemannian Online Gradient Descent Algorithm (R-OGD) Input: Manifold M, time T, step sizes {αt} Output: {xt}t=1,...,T for t = 1 to T do

Play xt and observe the function ft; Update xt+1 = PK(expxt( αt ft(xt))), where PK is the Riemannian projection mapping of x onto K, that is, PK(x) := arg miny K d(x, y); Return xt+1, and suffer from the loss ft(xt); end

3.2 Regret Upper Bounds

In Theorems 1 and 2 we present upper bounds of the regret along the R-OGD algorithm for g-convex and strongly g-convex functions, respectively. Take ζ(κ, d) = κ d

tanh κ d . By direct observation, ζ

is an increasing function of the variables κ and d when κd 0. Theorem 1 (Convex Case). Suppose that Assumptions 1-4 hold, and ft is g-convex for any t = 1, . . . , T. Then the R-OGD algorithm with step sizes {αt = D L

ζ(κ,D)t} guarantees the following

regret bound for all T 1,

Theorem 2 (Strongly-convex Case). Suppose that Assumptions 1-4 hold, and ft is µ-strongly gconvex for any t = 1, . . . , T. Then the R-OGD algorithm with step sizes {αt = 1 µt} guarantees the following regret bound for all T 1,

R(T) L2ζ(κ, D)

2µ (1 + log T).

The proofs of Theorems 1 and 2 are in the appendix. A major challenge in proving Theorems 1 and 2 is that there is no vector space structure on Riemannian manifolds. Thanks to the trigonometric distance bound proposed in [36], we manage to obtain the regret O

T and O log T for g-convex and strongly g-convex loss functions, respectively. By gradually moving κ to zero, the results recover the regret bounds for Euclidean gradient descent in [38] and [24].

Theorems 1 and 2 also reveal the inﬂuence of curvature on the regret bounds. Since ζ(κ, d) is an increasing function of κ, the upper regret bounds in the R-OGD algorithm are larger than those in Euclidean spaces and the increase of κ raises the upper regret bound. Therefore, a proper Riemannian metric should be chosen in the optimization to avert the high sectional curvature bound.

3.3 Regret Lower Bound

This section is intended to answer the question of whether there exists an algorithm that attains a tighter regret bound than O

T for g-convex functions. Theorem 3 provides a negative answer. Theorem 3. Suppose that Assumptions 1-4 hold. Then for any Hadamard manifold M, the Riemannian online convex optimization incurs the regret Ω(DL

T) for any possible strategy in the worst case.

The proof of Theorem 3 is in the appendix. The result illustrates that, as in Euclidean spaces, the regret of a Riemannian online comvex algorithm can not be less than Ω(

T) in the worst case. Moreover, Theorem 3 shows that the regret of the R-OGD algorithm in Theorem 1 is tight up to a constant.

4 Riemannian Online Convex Optimization with Bandit Feedback

In this section, we consider the Riemannian online convex optimization with the one-point bandit feedback. We ﬁrst present the Riemannian bandit algorithm (R-BAN) on Hadamard homogeneous manifolds and then analyze the (expected) upper regret bound for our algorithm. In the rest of this section, for any point x, we denote Bδ(x) as the geodesic ball centered at x with radius δ, and Sδ(x) as the geodesic sphere centered at x with radius δ.

4.1 Riemannian Bandit Algorithm

In the bandit setting, Assumptions 2-4 are slightly modiﬁed as follows. Assumption 5. M is an n-dimensional homogeneous Hadamard manifold with the sectional curvature lower bounded by a constant κ (κ 0). Remark 2. The homogeneous Hadamard manifold has been widely studied in differential geometry [10, 11]. The property of homogeneity has received much attention in machine learning [33, 35, 14]. It has been seen that many manifolds often considered in Riemannian optimization, such as the Euclidean space Rn, the Hyperbolic space Hn, and the manifold of SPD matrices, are Hadamard homogeneous manifolds.

Note that on Hadamard homogeneous manifolds, the volume and surface area of a geodesic ball is only related to the radius but not to the center of the ball, thus we denote Vδ as the volume of Bδ(x) and Sδ as the surface area ((n 1)-dim volume) of Sδ(x) for all x in the manifold M. Assumption 6. There exists an interior point p K such that the set K contains a ball with radius r centered at p, and is also contained in a ball with radius D, i.e.,

Br(p) K BD(p).

Assumption 7. For any t = 1, . . . , T, ft is differentiable, g-L-Lipschitz and bounded by C.

Inspired by the Euclidean bandit algorithm, we replace the gradient ft(xt) with a randomized estimator gt and propose our R-BAN in Algorithm 2 on Hadamard homogeneous manifolds.

Algorithm 2: Riemannian Bandit Algorithm (R-BAN) Input: Manifold M, time T, step size α, parameters δ, τ. Output: Sequence {xt}t=1,...,T for t = 1 to T do

Pick xt uniformly from Sδ(yt); Play xt and observe ft(xt); Construct the gradient estimator

gt = ft(xt) exp 1 yt (xt)

exp 1 yt (xt) ;

Update yt with the rule yt+1 = P(1 τ)K(expyt( αgt)), where we denote P(1 τ)K as the projection mapping onto the shrinking set (1 τ)K = {expp((1 α)u)|u = exp 1 p (x), x K}. Return xt and suffer from the loss ft(xt); end

4.2 Challenges from Geometry

Since Algorithm 2 is an extension of the Euclidean bandit algorithm in [20], it is worth reviewing the analysis in [20]. In the Euclidean setting, we uniformly choose xt on the Sδ(yt) and update yt by the rule ( gt = f(xt) xt yt

xt yt , yt+1 = P(1 τ)K(yt α gt). (2)

The basic idea for the analysis is to introduce the smoothed loss function [20]

ˆf E t (x) = Eu Bδ(x)[ft(u)] = 1

Bδ(x) ft(u)du,

where ˆf E t is a convex approximation of ft when δ is small. It is shown that n

δ gt is an unbiased estimator of the gradient ˆf E t (yt), hence the bandit algorithm is actually an expected gradient descent method [20] with the loss function ˆf E t . In this way, an Euclidean regret bound of the bandit algorithm is established by [20].

Back to the Riemannian case, we attempt to generalize the analysis of [20] in parallel by deﬁning the "Riemannian version" of the smoothed loss function,

ˆft(x) = Eu Bδ(x)[ft(u)] = 1

Bδ(x) ft(u)ω,

where ω is the volume element with respect to the metric g. Analyzing this smoothed loss function in the Riemannian space, however is fundamentally challenging due to the following two reasons.

(i) The gradient is hard to compute. Computing the gradient of ˆft is quite different from that in Euclidean spaces, due to the absence of the commutativity of the derivative operator and the integration operator R

Bδ(yt). In Euclidean spaces, the derivative operator commutes with the

integration operator R

Bδ(yt). Accordingly, for the Euclidean smoothed loss function ˆf E t ,

ˆf E t (yt) = 1

Bδ(yt) ft(u)du = 1

Bδ(yt) ft(u)du, (3)

which implies n

δ E[ gt] = ˆf E t (yt). However, on Riemannian manifolds the derivative operator does not commute with the integration operator R

Bδ(yt). Consequently, the equation (3) fails for functions on Riemannian manifolds.

(ii) The convexity may be lost. Another essential challenge for regret analysis is the convexity of ˆft. In Euclidean spaces, one can easily conclude the convexity of ˆf E t . However, the convexity may not hold for a Riemannian manifold. Through calculation, the Hessian of ˆft on Riemannian manifolds is

Hess( ˆft)(X, X) = 1

Bδ(x) (Hess(ft)(η, η)(u) + ηη, ft(u) )ω,

where η is a Killing ﬁeld such that η(x) = X. Since the quadratic form Hess(ft)(η, η)(x) + ηη, ft(x) can be negative at some η Tp M, the g-convexity of ˆft is violated for some small δ.

4.3 Gradient Bound and Approximate g-Convexity

To address the difﬁculty (i), we propose a key technique to analyze the derivative of local integration by introducing the Killing vector ﬁeld. With the help of this technique, we manage to compute the gradient of ˆft in Lemma 1.

Lemma 1. Let M be a homogeneous Hadamard manifold, and f be a C1 function on M. Then the smoothed function ˆf(x) = 1 Vδ R

Bδ(x) fω satisﬁes,

1) ˆf(x) = 1 Vδ R

Sδ(x) f(u) exp 1 x (u) exp 1 x (u) = Sδ

Vδ Eu Sδ(x) h f(u) exp 1 x (u) exp 1 x (u)

2) If |f(x)| < C, then

for all δ > 0.

Remark 3. The proof of Lemma 1 can be found in the appendix. The ﬁrst part of the lemma establishes a gradient estimator of ˆft(x) , while the second part gives us an easy-to-compute bound of the gradient. In the proof, we develop a key technique that transforms a derivation of integration on Bδ(x) to a integration of derivative of corresponding Killing vector ﬁeld on Bδ(x), i.e.,

Bδ(x) f(u)ω = Z

Bδ(x) η(f(u))ω, (4)

where η is a Killing ﬁeld with η(x) = X. This technique does not rely on the curvature and other speciﬁc manifold structures. Hence, the technique can be a basic tool for optimization on homogeneous spaces and maybe in broader areas.

For difﬁculty (ii), we notice that though the function ˆft may not be g-convex, it is very close to be g-convex.

Lemma 2. Suppose that M is a Hadamard homogeneous manifold, and K is a convex and bounded set of M. Then there exists a constant ρ 0 only related to the set K such that for any g-convex and g-L-Lipschitz function f,

ˆf(y) ˆf(x) ˆf(x), exp 1 x (y) 2ρδL, x, y K.

The proof of Lemma 2 is in the appendix. It is worth mentioning that the constant ρ describes how close a smoothed function is to be g-convex. Notice that,

ρ = sup x,y,u K | 1

G exp 1 u φ(u) i| s.t. φ(x) = y

does not depend on the function ˆft, and the time T. Moreover, for a certain manifold M, once the set K is ﬁxed and the explicit expression of φ is given, we can compute the constant ρ as a ﬁnite number. We brieﬂy list the bound of φ in the following two kind of manifolds.

(1) Let manifold M be a Euclidean space, then we can ﬁnd the isometry φ(z) = z + y x. Hence we can conclude that ρ = 0 and ˆf is convex, which coincides with the result in Euclidean spaces.

(2) Let M be a 2-dimensional Poincaré disk, and then the isometry φ from x to y has the closed form of φ = φx φy,

where φx(z) = x z 1 xz and φy(z) = y z 1 yz. Therefore, if K has diameter D, we can ﬁgure out a bound of ρ in

ρ 161 + tanh(D/2)

1 tanh(D/2)

1 1 tanh(2D)2 + D tanh(D/2)

which implies that ρ may grow exponentially with respect to D.

Although the value of ρ is generally difﬁcult to calculate, our algorithm analysis and parameter selection do not depend on the speciﬁc value of ρ (see Theorem 4).

4.4 Regret Bound

With the above effort, we carry out the analysis of the expected regret bounds of Algorithm 2. Denote B = nκ, = BCD p

ζ(κ, D) + 3L + 2C/r and Λ =

Theorem 4. Suppose that Assumptions 1 and 5-7 hold, and ft is g-convex for any t = 1, . . . , T.

ζ(κ,D)T ,δ = T 1

ζ(κ,D) . Then the expected regret of Algorithm 2 is

upper bounded by

E[R(T)] 2T 3 4 q

The proof of Theorem 4 can be seen in the appendix. Theorem 4 shows that the regret of the Riemannian bandit algorithm achieves O T 3 4 for g-convex loss functions on homogeneous Hadamard manifolds, which is the same as the regret bounds in Euclidean spaces [20].

Our work is different from the result in [29] in two aspects. First, in [29] the sublinear regrets bounds depends on the assumption of sublinearity of the term VT , while in Theorem 4, the sublinear regret bound is expressed explicitly with the time horizon T with no additional assumptions. Second, to estimate the gradient, [29] used a biased estimator of the gradient, while in our bandit algorithm the gradient estimator gt of the smoothed function is unbiased.

5 Numerical Experiment

We validate our ﬁndings on Riemannian manifolds in both g-convex and strongly g-convex losses. We also compare our results with the Riemannian zeroth online (R-OZO) algorithm [29] if possible (the R-OZO is only suitable for strongly g-convex cases). All experiments are performed with the help of Pymanopt toolbox [34]2 on a 64 bit Windows platform with a 3.4 GHz CPU (AMD Ryzen5 2600) and the code used for the numerical experiments is provided in the supplementary materials.

5.1 Strongly Convex Cases

For strongly g-convex cases, we apply our R-OGD and R-BAN algorithms to the Fréchet mean problem, which is also known as ﬁnding the Riemannian centroid of a set of points on a manifold with many applications, such as diffusion tensor magnetic resonance imaging (DT-MRI), radar signal processing, and batch normalization of neural networks [16, 27, 15]. In this subsection, we study an online form of the Fréchet mean problem to average a set of N time-varying points {At,1, At,2, At,3, . . . , At,N} on a manifold. The loss function is deﬁned as

i=1 d2(Xt, At,i), t = 1, 2, . . . , T,

where d(X, Y ) is the Riemannian distance of the manifold. It is remarked that ft is g-convex and 2-strongly g-convex [18] so that we can apply our algorithms. In particular, we test the R-OGD and the R-BAN algorithm, respectively, on the manifold of SPD matrices and in the hyperbolic space, and we also compare our R-BAN algorithm with the R-OZO in [29].

Fréchet mean on the SPD Manifold On the manifold of SPD matrices {X Rn n|XT = X, X 0} , we run the R-OGD algorithm with two kinds of step size: the step size for convex case αt = D/(L p

ζ(κ, D)t) (R-OGD-C) and the step size for strongly g-convex case αt = 1/(2t) (R-OGD-SC). In these two cases, we set [n, N, T] = [100, 10, 1000]. Matrices Ai,t are randomly generated by the method in Pymanopt toolbox. We plot the average regret R(t)/t versus the learning round t and the running time in Figures 1(a)-1(b). As seen, the regrets of the R-OGD algorithm go sublinearly with t and the average regret of the R-OGD-SC converges faster than that of R-OGD-C in terms of the iteration as well as the running time, which are consistent with the results in Theorems 1 and 2.

Fréchet mean on the Hyperbolic Space We test the performance of the R-BAN algorithm in the hyperbolic space Hn = {x Rn+1| x2 n+1+Pn i=1 x2 i = 1}. The data Ai,t is randomly generated by normal Gaussian distributions. Consider the case [n, N, T] = [100, 10, 10000] and set δ = 0.399 (which is four times the theoretical value) and α = 0.006. Figures 1(b) and 2(b) compare the average performance between the R-BAN algorithm and the R-OZO algorithm for 100 random runs. We observe that our R-BAN algorithm can achieve a sublinear expected regret with less information, since the R-OZO needs function values of two points in this case.

5.2 Convex Cases

An example of Riemannian optimization problems, which is g-convex but not strongly g-convex, is the operator scaling problem deﬁned on the manifold of SPD matrices {X Rn n|XT = X, X 0}.

2https://www.pymanopt.org/, BSD 3-Clause License

0 200 400 600 800 1000 0

R-OGD-C R-OGD-SC

(a) Fréchet mean: SPD manifold

0 2000 4000 6000 8000 10000 0

R-BAN R-OZO

(b) Fréchet mean: hyperbolic

0 2 4 6 8 10

R-OGD R-BAN

(c) Operator scaling

Figure 1: Average regret vs learning rounds

0 100 200 300 400 0

R-OGD-C R-OGD-SC

(a) Fréchet mean: SPD manifold

0 0.2 0.4 0.6 0.8 1 1.2 0

R-BAN R-OZO

(b) Fréchet mean: hyperbolic

0 20 40 60 80 100 0

R-OGD R-BAN

(c) Operator scaling

Figure 2: Average regret vs running time

The operator scaling problem has drawn abundant interest in many areas, such as computing noncommutative rank [26] and computing Brascamp-Lieb constants [21]. In this subsection, we study an online form of the operator scaling problem. Given a tuple of time-varying matrices (At,1, . . . , At,N), the online operator scaling can be formulated in terms of minimizing the log capacity of operator Tt(X) = PN i=1 At,i XAT t,i, that is

ft(Xt) = log det(T(Xt)) log det(Xt), t = 1, 2, . . . , T.

We test our R-OGD and R-BAN algorithms for the case [n, N, T] = [5, 2, 100000]. The entries of Ai,t are generated from the normal Gaussian distribution. We test the R-OGD with with taking the Lipschitz constant L = 2 and test the R-BAN with δ = 0.22 (which is ﬁve times the theoretical value) and α = 0.002 for 100 different runs. The result in Figures 1(c) and 2(c) again shows the (expected) sublinear regret in this case, which supports our theoretical results.

6 Conclusion

We considered an online optimization problem on Riemannian manifolds in the full information and bandit feedback setting. We developed the R-OGD algorithm on Hadamard manifolds and the R-BAN algorithm on Hadamard homogeneous manifolds. The upper regret bounds of the R-OGD and R-BAN algorithm, together with a universal lower regret bound were established with the inﬂuence of curvature clearly indicated. All of the regret bounds matched their Euclidean counterpart.

A limitation of our work is that we do not take retraction into consideration. A retraction map is a cheap approximation of the exponential map on manifolds and is a sensible choice in many real scenarios. In future work, we intend to design Riemannian online optimization methods with the retraction map, so that the resulting algorithms can be more effective in large-scale optimization problems.

Acknowledgement

This work is supported in part by Shanghai Municipal Science and Technology Major Project under Grant 2021SHZDZX0100 and the National Natural Science Foundation of China under Grant 61733018, and in part by Australian Research Council under Grant DP190103615.

[1] Jacob Abernethy, Peter Bartlett, Alexander Rakhlin, and Ambuj Tewari. Optimal strategies and minimax lower bounds for online convex games [Technical Report No. UCB/EECS-2008-19]. Technical Report. University of California, Berkeley, United States, 2008.

[2] P-A Absil, Robert Mahony, and Rodolphe Sepulchre. Optimization algorithms on matrix manifolds. Princeton University Press, 2009.

[3] Alekh Agarwal, Ofer Dekel, and Lin Xiao. Optimal algorithms for online convex optimization with multi-point bandit feedback. In Proceedings of the 23rd Annual Conference on Learning Theory (COLT), June 2010.

[4] Shmuel Agmon. The relaxation method for linear inequalities. Canadian Journal of Mathematics, 6:382 392, 1954.

[5] Kwangjun Ahn and Suvrit Sra. From Nesterov s estimate sequence to Riemannian acceleration. In Jacob Abernethy and Shivani Agarwal, editors, Proceedings of Thirty Third Conference on Learning Theory, volume 125 of Proceedings of Machine Learning Research, pages 84 118. PMLR, 09 12 Jul 2020.

[6] Zeyuan Allen-Zhu, Ankit Garg, Yuanzhi Li, Rafael Oliveira, and Avi Wigderson. Operator scaling via geodesically convex optimization, invariant theory and polynomial identity testing. In Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, pages 172 181, 2018.

[7] Kimon Antonakopoulos, E. Veronica Belmega, and Panayotis Mertikopoulos. Online and stochastic optimization beyond Lipschitz continuity: A Riemannian approach. In International Conference on Learning Representations, 2020.

[8] Sébastien Arnold, Pierre-Antoine Manzagol, Reza Babanezhad Harikandeh, Ioannis Mitliagkas, and Nicolas Le Roux. Reducing the variance in online optimization by transporting past gradients. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019.

[9] Gary Becigneul and Octavian-Eugen Ganea. Riemannian adaptive optimization methods. In International Conference on Learning Representations, 2019.

[10] Valerii Berestovskii and Yurii Nikonorov. Riemannian manifolds and homogeneous geodesics. Springer, 2020.

[11] Marcel Berger. Geometry i. Springer Science & Business Media, 2009.

[12] Silvere Bonnabel. Stochastic gradient descent on Riemannian manifolds. IEEE Transactions on Automatic Control, 58(9):2217 2229, 2013.

[13] Nicolas Boumal and P-A Absil. Low-rank matrix completion via preconditioned optimization on the Grassmann manifold. Linear Algebra and its Applications, 475:200 239, 2015.

[14] Michael M. Bronstein, Joan Bruna, Taco Cohen, and Petar Velickovic. Geometric deep learning: Grids, groups, graphs, geodesics, and gauges. Co RR, abs/2104.13478, 2021.

[15] Daniel Brooks, Olivier Schwander, Frederic Barbaresco, Jean-Yves Schneider, and Matthieu Cord. Riemannian batch normalization for SPD neural networks. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019.

[16] Guang Cheng, Hesamoddin Salehian, and Baba C Vemuri. Efﬁcient recursive algorithms for computing the mean diffusion tensor and applications to DTI segmentation. In European Conference on Computer Vision, pages 390 401. Springer, 2012.

[17] Shiing-Shen Chern, Weihuan Chen, and Kai Shue Lam. Lectures on differential geometry, volume 1. World Scientiﬁc, 1999.

[18] Charlan Dellon da Silva Alves, Paulo Roberto Oliveira, and Ronaldo Malheiros Gregório. lα Riemannian weighted centers of mass applied to compose an image ﬁlter to diffusion tensor imaging. Applied Mathematics and Computation, 390:125603, 2021.

[19] Jiashi Feng, Huan Xu, and Shuicheng Yan. Online robust PCA via stochastic optimization. In Advances in neural information processing systems, pages 404 412. Citeseer, 2013.

[20] Abraham D. Flaxman, Adam Tauman Kalai, and H. Brendan Mc Mahan. Online convex optimization in the bandit setting: Gradient descent without a gradient. In Proceedings of the Sixteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 05, page 385 394, USA, 2005. Society for Industrial and Applied Mathematics.

[21] Ankit Garg, Leonid Gurvits, Rafael Oliveira, and Avi Wigderson. Algorithmic and optimization aspects of Brascamp-Lieb inequalities, via operator scaling. Geometric and Functional Analysis, 28(1):100 145, 2018.

[22] Mohammad Ghomi and Joel Spruck. Total curvature and the isoperimetric inequality in Cartan Hadamard manifolds, 2021.

[23] Elad Hazan. Introduction to online convex optimization. ar Xiv preprint ar Xiv:1909.05207, 2019.

[24] Elad Hazan, Adam Kalai, Satyen Kale, and Amit Agarwal. Logarithmic regret algorithms for online convex optimization. In International Conference on Computational Learning Theory, pages 499 513. Springer, 2006.

[25] Jiang Hu, Xin Liu, Zai-Wen Wen, and Ya-Xiang Yuan. A brief introduction to manifold optimization. Journal of the Operations Research Society of China, 8(2):199 248, 2020.

[26] SK Ivanov, AM Kamchatnov, T Congy, and N Pavloff. Solution of the Riemann problem for polarization waves in a two-component Bose-Einstein condensate. Physical Review E, 96(6):062202, 2017.

[27] J Lapuyade-Lahorgue and F Barbaresco. Radar detection using Siegel distance between autoregressive processes, application to HF and X-band radar. In 2008 IEEE Radar Conference, pages 1 6. IEEE, 2008.

[28] Kuang-Chih Lee and David Kriegman. Online learning of probabilistic appearance manifolds for video-based recognition and tracking. In 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 05), volume 1, pages 852 859. IEEE, 2005.

[29] Alejandro I Maass, Chris Manzie, Dragan Nesic, Jonathan H Manton, and Iman Shames. Online zeroth-order optimisation on Riemannian manifolds. ar Xiv preprint ar Xiv:2010.00211, 2020.

[30] Jiazhong Nie, Wojciech Kotłowski, and Manfred K Warmuth. Online PCA with optimal regret. The Journal of Machine Learning Research, 17(1):6022 6070, 2016.

[31] Wolfgang Ring and Benedikt Wirth. Optimization methods on Riemannian manifolds and their application to shape space. SIAM Journal on Optimization, 22(2):596 627, 2012.

[32] Niranjan Srinivas, Andreas Krause, Sham M Kakade, and Matthias Seeger. Gaussian process optimization in the bandit setting: No regret and experimental design. ar Xiv preprint ar Xiv:0912.3995, 2009.

[33] Hao Tang, Zhiao Huang, Jiayuan Gu, Bao-Liang Lu, and Hao Su. Towards scale-invariant graphrelated problem solving by iterative homogeneous GNNs. Advances in Neural Information Processing Systems, 33, 2020.

[34] James Townsend, Niklas Koep, and Sebastian Weichwald. Pymanopt: A python toolbox for optimization on manifolds using automatic differentiation. Journal of Machine Learning Research, 17(137):1 5, 2016.

[35] Quinten Tupker, Salem Said, and Cyrus Mostajeran. Online learning of Riemannian hidden Markov models in homogeneous Hadamard spaces. In Frank Nielsen and Frédéric Barbaresco, editors, Geometric Science of Information, pages 37 44, Cham, 2021. Springer International Publishing.

[36] Hongyi Zhang and Suvrit Sra. First-order methods for geodesically convex optimization. In Conference on Learning Theory, pages 1617 1638, 2016.

[37] Pan Zhou, Xiao-Tong Yuan, and Jiashi Feng. Faster ﬁrst-order methods for stochastic nonconvex optimization on Riemannian manifolds. In Kamalika Chaudhuri and Masashi Sugiyama, editors, Proceedings of the Twenty-Second International Conference on Artiﬁcial Intelligence and Statistics, volume 89 of Proceedings of Machine Learning Research, pages 138 147. PMLR, 16 18 Apr 2019.

[38] Martin Zinkevich. Online convex programming and generalized inﬁnitesimal gradient ascent. In Proceedings of the Twentieth International Conference on International Conference on Machine Learning, ICML 03, page 928 935. AAAI Press, 2003.