# riemannian_projectionfree_online_learning__6594c1ea.pdf Riemannian Projection-free Online Learning Zihao Hu , Guanghui Wang , Jacob Abernethy , College of Computing, Georgia Institute of Technology Google Research {zihaohu,gwang369}@gatech.edu, abernethyj@google.com The projection operation is a critical component in a wide range of optimization algorithms, such as online gradient descent (OGD), for enforcing constraints and achieving optimal regret bounds. However, it suffers from computational complexity limitations in high-dimensional settings or when dealing with ill-conditioned constraint sets. Projection-free algorithms address this issue by replacing the projection oracle with more efficient optimization subroutines. But to date, these methods have been developed primarily in the Euclidean setting, and while there has been growing interest in optimization on Riemannian manifolds, there has been essentially no work in trying to utilize projection-free tools here. An apparent issue is that non-trivial affine functions are generally non-convex in such domains. In this paper, we present methods for obtaining sub-linear regret guarantees in online geodesically convex optimization on curved spaces for two scenarios: when we have access to (a) a separation oracle or (b) a linear optimization oracle. For geodesically convex losses, and when a separation oracle is available, our algorithms achieve O(T 1/2), O(T 3/4) and O(T 1/2) adaptive regret guarantees in the full information setting, the bandit setting with one-point feedback and the bandit setting with two-point feedback, respectively. When a linear optimization oracle is available, we obtain regret rates of O(T 3/4) for geodesically convex losses and O(T 2/3 log T) for strongly geodesically convex losses. 1 Introduction Online convex optimization (OCO) offers a framework for modeling sequential decision-making problems (Hazan et al., 2016). The standard setting depicts the learning process as a zero-sum game between a learner and an adversary. At round t, the learner selects a decision xt from a convex set K and observes the encountered convex loss function ft. The learner s goal is to minimize regret, defined as Regret T := t=1 ft(xt) min x K In Euclidean space, OCO boasts a robust theoretical foundation and numerous real-world applications, such as online load balancing (Molinaro, 2017), optimal control (Li et al., 2019), revenue maximization (Lin et al., 2019), and portfolio management (Jézéquel et al., 2022). The standard approach for OCO is online gradient descent (OGD), which performs xt+1 = ΠK(xt ηt ft(xt)), where ΠK represents the orthogonal projection onto K, ensuring the sequence {xt}T t=1 remains feasible. However, the projection operation can be computationally expensive in high-dimensional or complex feasible sets. Projection-free online learning provides a reasonable way to handle this situation. At the heart of this approach is the understanding that, in many cases, the complexity of 37th Conference on Neural Information Processing Systems (Neur IPS 2023). Table 1: Summary of main results. Our approach allows us to invoke either a Separation Oracle (SO) or a Linear Optimization Oracle (LOO) for O(T) times throughout the T rounds. Notably, our results match those presented by Garber & Kretzu (2022). Oracle Losses Feedback Measure Regret gsc-convex full information adaptive regret O(T 1/2), Thm. 1 SO gsc-convex bandit, one-point expected adaptive regret O(T 3/4), Thm. 2 gsc-convex bandit, two-point expected adaptive regret O(T 1/2), Thm. 3 LOO gsc-convex full information adaptive regret O(T 3/4), Thm. 4 strongly gsc-convex full information regret O(T 2/3 log T), Thm. 5 implementing an optimization oracle can be significantly lower than that of the orthogonal projection. As a result, both practical and theoretical interests lie in replacing the projection operation with these optimization oracles. The two most well-known projection-free optimization oracles are: (Linear optimization oracle) argmin x K g, x and (Separation oracle) g : y x, g > 0, x K. Whereas much attention has been given to the development of sub-linear regret guarantees using optimization oracles in Euclidean space (Hazan & Kale, 2012; Levy & Krause, 2019; Hazan & Minasyan, 2020; Wan et al., 2022; Garber & Kretzu, 2022; Mhammedi, 2022a; Wan et al., 2023), there is considerably less research on projection-free optimization on Riemannian manifolds. There are numerous scenarios where the space of feasible parameters is not a convex subset of Euclidean space, but instead has a manifold structure with a Riemannian metric; examples include non-negative PCA (Montanari & Richard, 2015), K-means clustering (Carson et al., 2017) and computing Wasserstein Barycenters (Weber & Sra, 2022b). However, there has been significantly less research on efficiently computing projections or projection-free optimization in the Riemannian setting, even for highlystructured geodesically convex (gsc-convex) feasible sets like: K = {x| max 1 i m hi(x)} where each hi(x) is gsc-convex. In this paper we focus on projection-free optimization on Riemannian manifolds in the online setting. A Riemannian version of Online Gradient Descent (R-OGD) was given by Wang et al. (2021): xt+1 = ΠKExpxt ( ηt ft(xt)) , where K is now a gsc-convex set, and ΠK is the metric projection onto K. This method achieves sub-linear regret guarantees for Riemannian OCO. The metric projection is often the most expensive operation, and it is not always clear how to implement it on a manifold. In the present work, we propose the use of two alternative operations: a separation oracle (SO) or a linear optimization oracle (LOO) on the Riemannian manifold, with definitions deferred to Section 3. It is important to note that, in Euclidean space, both oracles rely on the definition of a hyperplane. In the realm of Riemannian manifolds, there are two natural extensions of a hyperplane: one is the sub-level set of a Busemann function, known as a horosphere (Bridson & Haefliger, 2013); the other relies on the inverse exponential map, as outlined in (2). While a horosphere is gsc-convex, the corresponding separation oracle exists if every boundary point of the feasible gsc-convex set has a locally supporting horosphere (Borisenko, 2002). The existence of a separation oracle is assured for any gsc-convex sets on Hadamard manifolds (Silva Louzeiro et al., 2022), if defined via the inverse exponential map. Hence, we adopt this latter definition. It is worth noting that this object is typically non-convex (Kristály et al., 2016), leading to geometric complexities that necessitate careful management. Also, our work builds upon the findings of Garber & Kretzu (2022) and inherits the adaptive regret guarantee for gsc-convex losses, as defined by Hazan & Seshadhri (2009): Adaptive Regret T := sup[s,e] [T ] {Pe t=s ft(xt) minx K Pe t=s ft(x)} . Our main contributions are summarized in Table 1. More specifically, Given a separation oracle, we attain adaptive regret bounds of O(T 1/2), O(T 3/4) and O(T 1/2) for gsc-convex losses in the full information setting, the bandit convex optimization setting with onepoint feedback1, and the bandit convex optimization setting with two-point feedback, respectively. Assuming access to a linear optimization oracle, we provide algorithms that enjoy O(T 3/4) adaptive regret for gsc-convex losses and O(T 2/3 log T) regret for strongly gsc-convex losses. We highlight some key differences between convex sets on a Hadamard manifold and in Euclidean space. In particular, shrinking a gsc-convex set towards an interior point does not preserve convexity, and the Minkowski functional on a Hadamard manifold is non-convex. These results, which may not be well-known within the machine learning community, could be of independent interest. The technical challenges of this paper can be divided into two parts. Firstly, for the separation oracle, we need to bound the "thickness" of part of the feasible set cut by the separating hyperplane. While in Euclidean space, this can be achieved through a convex combination argument (Garber & Kretzu, 2022), the task becomes challenging on manifolds due to the varying metric at different points and the non-convex nature of the separating hyperplane. Fortunately, this problem can be addressed using the Jacobi field comparison technique. Also, unlike in Euclidean space, one cannot directly construct a separation oracle for (1 δ)K using a separation oracle for K. This poses significant challenges in the bandit setting. Nonetheless, we have identified a novel solution to this issue in the two-point feedback setting. Secondly, in Euclidean space, the linear optimization oracle is typically invoked by online Frank Wolfe (OFW) (Hazan & Kale, 2012; Kretzu & Garber, 2021) to achieve no-regret online learning. The analysis of OFW relies on the fact that the Hessian of a linear function is zero everywhere. But on Hadamard manifolds, such functions existence implies that the manifold has zero sectional curvature everywhere (Kristály et al., 2016). The algorithms in Garber & Kretzu (2022) do not require affinity and serve as a starting point for our results. However, the analysis still needs to be conducted carefully due to the non-convexity of the separating hyperplane on manifolds. 2 Related Work In this section, we briefly review previous work on projection-free online learning in Euclidean space as well as online and projection-free optimization on Riemannian manifolds. 2.1 Projection-free OCO in Euclidean Space Linear Optimization Oracle. The pioneering work of Hazan & Kale (2012) first introduced an online variant of the Frank-Wolfe algorithm (OFW) and achieved O(T 3/4) regret for convex functions. Hazan & Minasyan (2020) proposed a randomized algorithm that leverages smoothness to achieve O(T 2/3) expected regret. The insightful analysis of Wan & Zhang (2021) and Kretzu & Garber (2021) demonstrated that OFW indeed attains O(T 2/3) regret for strongly convex functions. Garber & Kretzu (2022) showed that it is possible to achieve O(T 3/4) adaptive regret and O(T 2/3 log T) regret for convex and strongly convex functions, respectively. Mhammedi (2022b) illustrated how to obtain O(T 2/3) regret for convex functions, where O( ) hides logarithmic terms. Our results are in line with those of Garber & Kretzu (2022) and inherit the adaptive regret guarantee. Separation Oracle and Membership Oracle. Levy & Krause (2019) demonstrated that it is possible to achieve O( T) and O(log T) regret bounds for convex and strongly convex functions when the feasible set is a sublevel set of a smooth and convex function. Mhammedi (2022a) generalized the idea of Levy & Krause (2019) and showed how to obtain O( T) and O(log T) regret guarantees for general convex sets. Garber & Kretzu (2022) provided algorithms that ensure O( T) and O(T 3/4) adaptive regret guarantees for convex losses in the full information and bandit settings, respectively. 1For the one-point feedback setting, we allow the point at which we play and the point where we receive feedback to differ, thereby bypassing a fundamental challenge posed by Riemannian geometry. In the case of two-point feedback, however, we eliminate this non-standard setting. 2.2 Online and Projection-free Optimization on Manifolds Online Optimization on Manifolds. Bécigneul & Ganea (2019) demonstrated that a series of adaptive optimization algorithms can be implemented on a product of Riemannian manifolds, with each factor manifold being assigned a learning rate. Antonakopoulos et al. (2020) proposed using Follow the Regularized Leader with a strongly gsc-convex regularizer to achieve O( T) regret when the loss satisfies Riemannian Lipschitzness. Wang et al. (2021) introduced Riemannian OGD (R-OGD) and showed regret guarantees in full information and bandit convex optimization settings. Hu et al. (2023) considered achieving optimistic and dynamic regret on Riemannian manifolds. Projection-free Optimization on Manifolds. Rusciano (2018) provided a non-constructive cutting hyperplane method on Hadamard manifolds. By comparison, our algorithms are constructive and deterministic. Weber & Sra (2022b) proposed Riemannian Frank-Wolfe (RFW) for gsc-convex optimization and showed some practical applications on the manifold of SPD matrices. In a subsequent work, Weber & Sra (2022a) generalized RFW to the stochastic and non-convex setting. We use RFW as a subroutine to invoke the linear optimization oracle and establish sub-linear regret guarantees. Hirai et al. (2023) implemented the interior point method on Riemannian manifolds and used a self-concordant barrier to enforce the constraint. 3 Preliminaries and Assumptions In this section, we lay the groundwork for our study by presenting an overview of Riemannian manifolds, the separation oracle and the linear optimization oracle within these spaces. Additionally, we establish key definitions and assumptions that will be integral to the following sections. Riemannian Manifolds. We provide key notations in Riemannian geometry that will be employed throughout this paper. Readers looking for a more comprehensive treatment are encouraged to consult Petersen (2006); Lee (2018). Our proof also relies on the concept of the Jacobi field, and we provide some backgrounds in Appendix E.1. We consider an n-dimensional smooth manifold M equipped with a Riemannian metric g. This metric confers a point-wise inner product u, v x at every point x M, where u, v are vectors in the tangent space Tx M at x. This tangent space, a vector space of dimension n, encompasses all vectors tangent to x. The Riemannian metric also determines the norm of a tangent vector u as: u := p u, u . A geodesic γ(t) : [0, c] M is a piecewise smooth curve with a constant velocity that locally minimizes the distance between its endpoints, say, x and y. The Riemannian distance between these two points is given by d(x, y) := R c 0 γ(t) dt = R c 0 γ(0) dt. It s important to note that the Riemannian distance remains invariant under reparameterizations of γ(t). Consider a geodesic γ(t) : [0, 1] M with γ(0) = x, γ(1) = y and γ(0) = v. The exponential map Expx(v) transforms v Tx M to y M, and the inverse exponential map Exp 1 x y performs the inverse operation, mapping y M to v Tx M. The inverse exponential map also offers a handy way to express the Riemannian distance: d(x, y) = Exp 1 x y . The sectional curvature at a point x M is contingent on two-dimensional subspaces of Tx M, and describes the curvature near x. Generally, geodesics diverge on manifolds with negative sectional curvature, converge on manifolds with positive sectional curvature, and manifolds with zero sectional curvature are locally isometric to Euclidean space. In line with Zhang & Sra (2016); Wang et al. (2021), we primarily explore Hadamard manifolds, which are simply connected manifolds with non-positive curvature that admit a unique global minimizing geodesic between any pair of points. A set K M is geodesically convex (gsc-convex) if it includes the geodesic connecting x and y for any x, y K. A µ-strongly gsc-convex (or gsc-convex, when µ = 0) function f : M R fulfills f(y) f(x) + f(x), Exp 1 x y + µ 2 d(x, y)2, (1) for any x, y M, where f(x) Tx M is the Riemannian gradient. Optimization Oracles on Riemannian Manifolds. In this part, we introduce the concept of a separation oracle and a linear optimization oracle on Riemannian manifolds. A separation oracle, given a point y not in the gsc-convex set K, returns a non-convex separating hyperplane that satisfies the following condition: Exp 1 y x, g > 0, x K, (2) where g Ty M. Even with the non-convexity of the separating hyperplane, for certain gsc-convex sets like K = {x| max1 i m hi(x) 0} where each hi(x) is gsc-convex, a separation oracle can be efficiently implemented by Lemma 18 and Remark 4.2 On the other hand, a linear optimization oracle is responsible for solving the following problem: g, Exp 1 x0 x on a gsc-convex set K, where x0 K and g Tx0M. Although this objective is not gsc-convex, it can still be solved in closed form for certain problems. Examples include computing the geometric mean and the Bures-Wasserstein barycenter on the manifold of SPD matrices (Weber & Sra, 2022b). In this paper, we rely on a series of definitions and assumptions, which we introduce here for clarity and reference in subsequent sections. Assumption 1. The manifold M is Hadamard with sectional curvature bounded below by κ, so the sectional curvature of M lies in the interval [κ, 0]. Assumption 2. The manifold M is a homogeneous Hadamard manifold with sectional curvature bounded below by κ. For every t [T], the inequality |ft(x)| M holds. For the bandit setting with two-point feedback, we additionally require M to be symmetric. Assumption 3. The set K M is a gsc-convex decision set and satisfies Bp(r) K Bp(R). Here, Bp(r) represents the geodesic ball centered at p K with radius r. Assumption 4. For every t [T] and x Bp(R), the function ft(x) is gsc-convex (or strongly gsc-convex), and the norm of its gradient is bounded by G, i.e., ft(x) G. Let us make two important comments. First, the homogeneity and the symmetry of M allows us to employ the unbiased estimator presented in Wang et al. (2023) for the bandit setting. It should be noted that homogeneous and symmetric Hadamard manifolds include two of the most commonly used ones: the hyperbolic space and the manifold of SPD matrices. Second, the projection onto a geodesic ball, denoted as ΠBp(R)( ), is considered projection-free as it can be computed to an ϵ precision using log(1/ϵ) bisections 3. Definition 1. We define a geometric constant ζ as ζ := 2R κ coth(2R κ). Definition 2. Fixing p K, for any c (0, ), we define c K = {Expp(c Exp 1 p x)|x K}. Definition 3. We call y M an infeasible projection of y M onto a simply connected closed set K M if for every point z K, the inequality d( y, z) d(y, z) holds. We define OIP ( K, y) as an infeasible oracle for K which, given any y, returns an infeasible projection of y onto K. We note that, in Euclidean space where the sectional curvature is zero everywhere, we have ζ = limκ 0 2R κ coth(2R κ) = 1. We also observe that the definition of the infeasible projection in Garber & Kretzu (2022) requires K to be convex, which is indeed unnecessary. This distinction is essential because, in the case of the separation-oracle-based OCO, we construct an infeasible projection oracle onto K = (1 δ)K, which may be non-convex (Theorem 6). 4 Warm-up: the High-level Idea We briefly illustrate the overarching strategy of achieving regret guarantees. Our basic algorithm is Algorithm 1, which generates a sequence {yt}T t=1 by R-OGD that does not necessarily fall within the feasible set. A key insight is that we can build an infeasible projection oracle using either a separation oracle or a linear optimization oracle, resulting in a sequence { yt}T t=1 that exhibits a desirable regret guarantee (as shown in Lemma 1). The design of the infeasible projection oracle rests on a straightforward fact: whenever yt deviates significantly from K, we can call upon either oracle to produce a descent direction and then apply Lemma 2 to gauge the progress. Additional error terms, arising from the fact that yt may not necessarily lie in K, can be quantified by leveraging the boundedness of the gradient in Assumption 4. 2This fact is well-known in Euclidean space (Grötschel et al., 2012), but it appears we are the first to observe its counterpart on manifolds. 3A comprehensive explanation of this observation can be found in Remark 5. Algorithm 1: Infeasible Riemannian OGD Data: horizon T, feasible set K, step-sizes {ηt}T t=1, infeasible projection oracle OIP ( K, ). for t = 1, . . . , T do Play yt, and observe ft( yt) Update yt+1 = Exp yt( ηt ft( yt)), and set yt+1 = OIP ( K, yt+1) end We have the following guarantee for Algorithm 1. Lemma 1. (Proof in Appendix A.1) Assume yt Bp(R) and let t = ft( yt) and K Bp(R) be a gsc-convex subset of M. Consider K K as a simply connected and compact set, and OIP ( K, ) be an infeasible projection oracle as in Definition 3. 1) Suppose all losses are gsc-convex on Bp(R). Fix some η > 0 and let ηt = η for all t 1. Algorithm 1 guarantees that the adaptive regret is upper-bounded by: I = [s, e] [T] : t=s ft( yt) min x I K t=s ft(x I) d( ys, x I)2 2η + ηζ Pe t=s t 2 2) Suppose all losses are α-strongly gsc-convex on Bp(R) for some α > 0. Let ηt = 1 αt for all t 1. Algorithm 1 guarantees that the static regret is upper bounded by: t=1 ft( yt) min x K Remark 1. To apply Lemma 1, we need to ensure that yt Bp(R) for any t [T]. In the case of a separation oracle, we have K = (1 δ)K for some δ (0, 1), and yt K Bp(R) by Lemma 4. With a linear optimization oracle, we have K = K, and yt Bp(R) is guaranteed by Lemma 6. Lemma 2. (Proof in Appendix A.2) Consider K K Bp(R) as a simply connected and compact subset of M. If y / K and g Ty M satisfies Exp 1 y z, g Q, where Q > 0, then consider y = Expy( γg). For γ = Q ζC2 and g C, assume d(y, z) 2R, then we have d( y, z)2 d(y, z)2 Q2 Unlike Garber & Kretzu (2022), in Lemma 2, we also do not require K to be gsc-convex. 5 Riemannian OCO with a Separation Oracle In this section, we show how to use a separation oracle to construct an infeasible projection oracle and achieve sublinear regret guarantees. We note that we rely on an infeasible projection oracle onto (1 δ)K rather than directly onto K. While we have a separation oracle that results in Exp 1 y x, g > 0, using Lemma 2 on this separating hyperplane may lead to minuscule progress, given that Q can be arbitrarily small. Consequently, achieving sublinear regret with only O(T) oracle calls becomes unfeasible. In contrast, constructing an infeasible projection onto (1 δ)K always ensures meaningful progress, as quantified by Lemmas 3 and 4. Algorithm 2: Infeasible Projection onto (1 δ)K with a Riemannian Separation Oracle Data: feasible gsc-convex set K, radius r, squeeze parameter δ, initial point y0. y1 = ΠBp(R)y0 for i = 1, . . . do if yi / K then SOK returns gi satisfying Exp 1 yi x, gi > 0 yi+1 = Expyi( γigi) where γi = δ r g and r is defined in Equation (3) else return y = yi end end Lemma 3. (Proof in Appendix B.1) Let y Bp(R) \ K and let g Ty M be the output of the separation oracle for y. Then, under Assumptions 1 and 3, we have that Exp 1 y z, g > δ r g for any z (1 δ)K, where r := κ(2R + r) sinh( κ(2R + r)) κ(R + r) sinh( κ(R + r)) r. (3) In Euclidean space, we can establish that y z, g > δr g (Garber & Kretzu, 2022, Lemma 11). However, as indicated in Lemma 3, the result on manifolds is significantly worse with respect to R, given the exponential nature of sinh. It is an interesting line of inquiry to explore whether this dependence is unavoidable.4 Based on Lemma 3, to implement an infeasible projection oracle onto (1 δ)K, the number of calls to the separation oracle is bounded in Lemma 4. Lemma 4. (Proof in Appendix B.2) Under Assumptions 1 and 3. Let 0 < δ < 1 and set γi = δ r g . Algorithm 2 executes at most ζ(d(y0,(1 δ)K)2 d(y,(1 δ)K)2) δ2 r2 + 1 iterations and returns y K such that d(y, z)2 d(y0, z)2 holds for any z (1 δ)K. In the full information setting, with a separation oracle, infeasible R-OGD is shown in Algorithm 3. Algorithm 3: Infeasible R-OGD with a separation oracle Data: feasible gsc-convex set K, radius r, step-size η and squeeze parameter δ. y1 = p (1 δ)K for t = 1, . . . , T do Play yt and receive ft( yt) yt+1 = Exp yt ( η ft( yt)) Update yt+1 as the output of Algorithm 2 with set K, radius r, squeeze parameter δ and initial point yt+1. end We can show the following regret guarantee for Algorithm 3. Theorem 1. (Proof in Appendix B.3) Under Assumptions 1, 3 and 4. Set η = 2R G ζT and δ = 1 2 T , then the regret of Algorithm 3 is upper bounded by sup[s,e] [T ] {Pe t=s ft( yt) minx I K Pe t=s ft(x I)} 5 2GR ζT, and the number of calls to the separation oracle is O(T). Moving on, we demonstrate how to achieve a sublinear regret guarantee in the bandit convex optimization setting. A major challenge is that, while in Euclidean space, we can construct a separation oracle on (1 δ)K using the separation oracle on K (Garber & Kretzu, 2022, Lemma 11.). On Hadamard manifolds, (1 δ)K can even be non-convex (Theorem 6), thus a separation oracle for (1 δ)K may not exist. For Riemannian BCO with one-point feedback, in Algorithm 4, we address this by resorting to a non-standard setting: we play ezt K but we receive feedback at zt where zt, ezt are nearby points. We present the algorithm and the corresponding regret guarantee in Algorithm 4 and Theorem 2. Algorithm 4: One-point bandit convex optimization on manifolds with a separation oracle Data: feasible gsc-convex set K, radii (R, r), step-size η, squeeze parameters (δ, δ , τ), y1 = p. for t = 1, . . . , T do Sample zt S yt(δ ); play zt := Expp Exp 1 p zt 1+τ // δ = κ(R+r) sinh( κ(R+r))τr Observe ft(zt); gt = ft(zt) Exp 1 yt zt Exp 1 yt zt ; yt+1 = Exp yt( ηgt) yt+1 Output of Algorithm 2 with K, radius r, squeeze parameter δ and initial point yt+1. end 4An exponential dependence on the diameter is common on manifolds with strictly negative curvature. For instance, the volume of a geodesic ball grows exponentially with its radius in hyperbolic space, whereas this dependence is only polynomial in Euclidean space. This property has been leveraged to construct lower bounds for convex optimization on manifolds (Criscitiello & Boumal, 2022). Theorem 2. (Proof in Appendix B.4) Under Assumptions 2, 3 and 4. Set η = T 1 2 , δ = T 1 4 , τ = T 1 4 . Then regret of Algorithm 4 is upper bounded by sup[s,e] [T ] {Pe t=s ft(ezt) minx I K Pe t=s ft(x I)} = O T 3 4 , and the number of calls to the separation oracle is O(T). Remark 2. An acute reader may notice a discrepancy between the step-size η = O T 1 work and the step-size η = O T 3 4 in Garber & Kretzu (2022, Theorem 15). It is important to highlight that our η is equivalent to nη δ as per Garber & Kretzu (2022), ensuring that the parameters are in fact consistent. This reasoning is also applicable to Theorem 3. Interestingly, in the context of the two-point feedback setting, we can get rid of the non-standard setting by adapting the algorithm. In Algorithm 5, we adhere to the loop invariants: xt βK and yt K. Each round involves constructing an unbiased gradient estimator gt at xt using the two-point feedback. Subsequently, we map gt to the tangent space of yt and execute online gradient descent in that space. A key advantage of this parallel transport is the ability to employ the separation oracle on K to construct an infeasible projection onto (1 δ)K. Upon a meticulous analysis, we observe that, at each round, an additional distortion arises from this parallel transport: Sδ Vδ E[ gt, Γxt yt Exp 1 yt x Exp 1 xt x ] Sδ Vδ E[ gt ] O(δ ) = O(δ ), This distortion accumulates to O( T) when choosing δ = O 1 , ensuring the regret bound remains unaffected. The regret assurance of Algorithm 5 is detailed in Theorem 3. Algorithm 5: Two-point bandit convex optimization on manifolds with a separation oracle Data: feasible gsc-convex set K, radii (R, r), step-size η, parameters (δ, δ , β): δ (0, 1), β (0, 1), δ = (1 β) κ(R+r) sinh( κ(R+r))r. Initialize x1 βK, y1 = Expp Exp 1 p x1 for t = 1, . . . , T do Sample zt Sxt(δ ) Play zt and its antipodal point zt Observe ft at zt and zt Construct gt by gt = 1 2(ft(zt) ft( zt)) Exp 1 xt zt Exp 1 xt zt y t+1 = Expyt( ηΓyt xtgt) yt+1 Output of Algorithm 2 with K, radius r, squeeze parameter δ and initial point y t+1 xt+1 = Expp βExp 1 p yt+1 // xt+1 βK end Theorem 3. (Proof in Appendix B.5) Under Assumptions 2, 3 and 4. Set η = 1, δ = 1 β = T 1 2 , then the regret of Algorithm 5 is upper bounded by sup [s,e] [T ] t=s ft(zt) min x K and the number of calls to the separation oracle is O(T). 6 Riemannian OCO with a Linear Optimization Oracle In this section, we focus on performing projection-free OCO on Riemannian manifolds utilizing a linear oracle. The linear oracle is invoked inside the Riemannian Frank-Wolfe (RFW) algorithm (Weber & Sra, 2022b). We outline how to obtain a separating hyperplane by RFW, as detailed in Algorithm 6 and Lemma 5. Algorithm 6: Separating Hyperplane via RFW Data: feasible gsc-convex set K, error tolerance ϵ, initial point x1 K, target vector y. for i = 1, . . . do vi = argminx K{ Exp 1 xi y, Exp 1 xi x } if { Exp 1 xi y, Exp 1 xi vi } ϵ or d(xi, y)2 3ϵ then return x xi end σi = argminσ [0,1]{d(y, Expxi(σExp 1 xi vi))2} xi+1 = Expxi(σi Exp 1 xi vi) end Lemma 5. (Proof in Appendix C.1) Under Assumptions 1 and 3. For any y Bp(R), Algorithm 6 terminates after at most ζ (27R2/ϵ) 2 iterations and returns x K satisfies: 1) d( x, y)2 d(x1, y)2. 2) At least one of the following holds: d( x, y)2 3ϵ or z K : Exp 1 y z, Exp 1 y x 2ϵ. 3) If d(y, K) ϵ then d( x, y)2 3ϵ. Remark 3. Note that the second item of Lemma 5 provides a separating hyperplane between y and K. One of the challenges in its proof is to find an analog of the Euclidean identity z y, x y = x y 2 2 z x, y x on manifolds, which initially appears to be a daunting task. However, a clever application of Lemma 30 (Appendix E) provides a solution. Algorithm 7: Closer Infeasible Projection via LOO Data: feasible gsc-convex set K, x0 K, initial point y0, error tolerance ϵ, step size γ. y1 = ΠBp(R)y0 if d(x0, y0)2 3ϵ then return x x0, y y1. end for i = 1, . . . , T do xi Output of Algorithm 6 with set K, feasible point xi 1, initial point yi and tolerance ϵ. if d(xi, yi)2 > 3ϵ then yi+1 = Expxi((1 γ)Exp 1 xi yi) else return x xi, y yi. end end Algorithm 7 demonstrates how to "pull" an initial point y0 towards K using RFW, while Lemma 6 verifies that the output of Algorithm 7 is indeed an infeasible projection onto K. Lemma 6. (Proof in Appendix C.2) Under Assumptions 1 and 3. Fix ϵ > 0. Setting γ = 2ϵ d(x0,y0)2 , Algorithm 7 stops after at most max n ζd(x0,y0)2(d(x0,y0)2 ϵ) 4ϵ2 + 1, 1 o iterations, and returns (x, y) K Bp(R) such that z K : d(y, z)2 d(y0, z)2 and d(x, y)2 3ϵ. Given that Lemma 6 provides an infeasible projection oracle, we can combine Algorithms 1 and 7 to achieve sublinear regret by setting the error tolerance as ϵ = o(1). However, RFW requires Θ(1/ϵ) = ω(1) iterations in the worst-case scenario (Lemma 5), and the resulting algorithm necessitates ω(T) calls to the linear optimization oracle. Garber & Kretzu (2022) utilize a block trick to address this challenge: the time horizon T is broken into B blocks, and the infeasible projection is computed once in each block. We demonstrate that this trick can be implemented on Riemannian manifolds in Algorithm 8. We present the regret guarantees for gsc-convex and strongly gsc-convex losses in Theorem 4 and Theorem 5, respectively. Theorem 4. (Proof in Appendix C.3) Under Assumptions 1, 3 and 4. Fixing ηi and ϵi as η = Rζ and ϵ = 60R2ζ2T 1 2 , respectively, for any i 1, . . . , T B . Setting B = 5T 1/2. Then the regret of Algorithm 8: Block OGD on manifolds with a linear optimization oracle Data: horizon T, feasible gsc-convex set K, number of blocks B, step-sizes {ηi} T B i=1, error tolerances {ϵi} T B i=1. Choose x0, x1 K y0 = x0, y1 = x0, y1 = x1, 1 = 0 T y0M for t = 1, . . . , B do Play x0; observe ft(x0) 1 = 1 + ft( y0) end y B+1 = Exp y0( η1 1). for i = 2, . . . , T B do (xi, yi) K Bp(R) output of Algorithm 7 with input K, xi 2, y(i 1)B+1 and ϵi. y(i 1)B+1 = yi 1; i = 0 T yi 1M. for s = 1, . . . , B do Play xi 1 and observe ft(xi 1) i = i + f(i 1)B+s( yi 1) end yi B+1 = Exp yi 1( ηi i). end Algorithm 8 for gsc-convex losses is bounded by sup I=[s,e] [T ] t=s ft(xt) min x I K t=s ft(x I) 2T 3/4ζ2 + ζ 180T 3/4 + 4T 3/4/ζ + 20T 1/2 , and the number of calls to the linear optimization oracle is bounded by T. Theorem 5. (Proof in Appendix C.4) Under Assumptions 1, 3 and 4. Suppose all losses {ft}T t=1 are α-strongly gsc-convex for some known α > 0 and T 3B. Choosing ϵi = 20G α(i+3) 2 ζ and ηi = 2 αi B for any i = 1, . . . , T B . With B = αR 3 T 2/3 and assume T 3B, the regret guarantee of Algorithm 8 is bounded by t=1 ft(xt) min x K t=1 ft(x) (20 3ζ + 1)(G4R2/α) 1 3 T 2/3 And the number of total calls to the linear optimization oracle is bounded by ζT. 7 Conclusion and Perspective This paper pioneers the exploration of projection-free online optimization on Riemannian manifolds. The primary technical challenges originate from the non-convex nature of the Riemannian hyperplane and the variable metric. These challenges are tackled effectively with the aid of the Jacobi field comparison, enabling us to establish a spectrum of sub-linear regret guarantees. Interested readers may question the difficulty of generalizing these techniques from Hadamard manifolds to CAT(κ) manifolds. Some hints toward this generalization are provided in Appendix E.3. There are several promising directions for future research. First, there exists the potential to refine the regret bounds, particularly for strongly gsc-convex losses via a separation oracle. In the context of the separation oracle, reducing the dependence on the number of calls about the diameter of the decision set would be an intriguing objective. Moreover, devising an efficient method to optimize the linear optimization oracle objective, argminx K g, Exp 1 x0 x , remains a notable open problem. This paper does not discuss the membership oracle, primarily because related work (Mhammedi, 2022a; Lu et al., 2023) heavily relies on the convexity of the Minkowski functional in Euclidean space, a property not guaranteed to hold on Hadamard manifolds (Theorem 7). However, this does not rule out the potential for executing OCO or convex optimization on manifolds using a membership oracle. Thus, uncovering alternative strategies to tackle this issue remains a compelling research question. Acknowledgments We would like to thank five anonymous referees for constructive comments and suggestions. We gratefully thank the AI4OPT Institute for funding, as part of NSF Award 2112533. We also acknowledge the NSF for their support through Award IIS-1910077. GW would like to acknowledge an ARC-ACO fellowship provided by Georgia Tech. We would also like to thank Xi Wang from UCAS and Andre Wibisono from Yale for helpful discussions. Ahn, K. and Sra, S. From nesterov s estimate sequence to riemannian acceleration. In Conference on Learning Theory, pp. 84 118. PMLR, 2020. Alimisis, F., Orvieto, A., Bécigneul, G., and Lucchi, A. A continuous-time perspective for modeling acceleration in riemannian optimization. In International Conference on Artificial Intelligence and Statistics, pp. 1297 1307. PMLR, 2020. Antonakopoulos, K., Belmega, E. V., and Mertikopoulos, P. Online and stochastic optimization beyond lipschitz continuity: A riemannian approach. In ICLR 2020-International Conference on Learning Representations, pp. 1 20, 2020. Bacák, M. Convex analysis and optimization in Hadamard spaces. de Gruyter, 2014. Bécigneul, G. and Ganea, O. Riemannian adaptive optimization methods. In 7th International Conference on Learning Representations, 2019. Borisenko, A. Convex sets in hadamard manifolds. Differential Geometry and its Applications, 17 (2-3):111 121, 2002. Bridson, M. R. and Haefliger, A. Metric spaces of non-positive curvature, volume 319. Springer Science & Business Media, 2013. Carson, T., Mixon, D. G., and Villar, S. Manifold optimization for k-means clustering. In 2017 International Conference on Sampling Theory and Applications (Samp TA), pp. 73 77. IEEE, 2017. Criscitiello, C. and Boumal, N. Negative curvature obstructs acceleration for strongly geodesically convex optimization, even with exact first-order oracles. In Conference on Learning Theory, pp. 496 542. PMLR, 2022. Garber, D. and Kretzu, B. New projection-free algorithms for online convex optimization with adaptive regret guarantees. ar Xiv preprint ar Xiv:2202.04721, 2022. Grötschel, M., Lovász, L., and Schrijver, A. Geometric algorithms and combinatorial optimization, volume 2. Springer Science & Business Media, 2012. Hazan, E. and Kale, S. Projection-free online learning. ar Xiv preprint ar Xiv:1206.4657, 2012. Hazan, E. and Minasyan, E. Faster projection-free online learning. In Conference on Learning Theory, pp. 1877 1893. PMLR, 2020. Hazan, E. and Seshadhri, C. Efficient learning algorithms for changing environments. In Proceedings of the 26th annual international conference on machine learning, pp. 393 400, 2009. Hazan, E. et al. Introduction to online convex optimization. Foundations and Trends in Optimization, 2(3-4):157 325, 2016. Hirai, H., Nieuwboer, H., and Walter, M. Interior-point methods on manifolds: theory and applications. ar Xiv e-prints, pp. ar Xiv 2303, 2023. Hu, Z., Wang, G., and Abernethy, J. Minimizing dynamic regret on geodesic metric spaces. ar Xiv preprint ar Xiv:2302.08652, 2023. Jaggi, M. Revisiting frank-wolfe: Projection-free sparse convex optimization. In International conference on machine learning, pp. 427 435. PMLR, 2013. Jézéquel, R., Ostrovskii, D. M., and Gaillard, P. Efficient and near-optimal online portfolio selection. ar Xiv preprint ar Xiv:2209.13932, 2022. Kowalski, O. and Vanhecke, L. Ball-homogeneous and disk-homogeneous riemannian manifolds. Mathematische Zeitschrift, 180:429 444, 1982. Kretzu, B. and Garber, D. Revisiting projection-free online learning: the strongly convex case. In International Conference on Artificial Intelligence and Statistics, pp. 3592 3600. PMLR, 2021. Kristály, A., Li, C., López-Acedo, G., and Nicolae, A. What do convexities imply on hadamard manifolds? Journal of Optimization Theory and Applications, 170:1068 1074, 2016. Lee, J. M. Riemannian manifolds: an introduction to curvature, volume 176. Springer Science & Business Media, 2006. Lee, J. M. Introduction to Riemannian manifolds, volume 2. Springer, 2018. Levy, K. and Krause, A. Projection free online learning over smooth sets. In The 22nd International Conference on Artificial Intelligence and Statistics, pp. 1458 1466. PMLR, 2019. Li, Y., Chen, X., and Li, N. Online optimal control with linear dynamics and predictions: Algorithms and regret analysis. Advances in Neural Information Processing Systems, 32, 2019. Lin, Q., Yi, H., Pang, J., Chen, M., Wierman, A., Honig, M., and Xiao, Y. Competitive online optimization under inventory constraints. Proceedings of the ACM on Measurement and Analysis of Computing Systems, 3(1):1 28, 2019. Lu, Z., Brukhim, N., Gradu, P., and Hazan, E. Projection-free adaptive regret with membership oracles. In International Conference on Algorithmic Learning Theory, pp. 1055 1073. PMLR, 2023. Mhammedi, Z. Efficient projection-free online convex optimization with membership oracle. In Conference on Learning Theory, pp. 5314 5390. PMLR, 2022a. Mhammedi, Z. Exploiting the curvature of feasible sets for faster projection-free online learning. ar Xiv preprint ar Xiv:2205.11470, 2022b. Molinaro, M. Online and random-order load balancing simultaneously. In Proceedings of the Twenty-Eighth Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 1638 1650. SIAM, 2017. Montanari, A. and Richard, E. Non-negative principal component analysis: Message passing algorithms and sharp asymptotics. IEEE Transactions on Information Theory, 62(3):1458 1484, 2015. Petersen, P. Riemannian geometry, volume 171. Springer, 2006. Rusciano, A. A riemannian corollary of helly s theorem. ar Xiv preprint ar Xiv:1804.10738, 2018. Sakai, T. Riemannian geometry, volume 149. American Mathematical Soc., 1996. Silva Louzeiro, M., Bergmann, R., and Herzog, R. Fenchel duality and a separation theorem on hadamard manifolds. SIAM Journal on Optimization, 32(2):854 873, 2022. Udriste, C. Convex functions and optimization methods on Riemannian manifolds, volume 297. Springer Science & Business Media, 2013. Wan, Y. and Zhang, L. Projection-free online learning over strongly convex sets. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pp. 10076 10084, 2021. Wan, Y., Tu, W.-W., and Zhang, L. Online frank-wolfe with arbitrary delays. Advances in Neural Information Processing Systems, 35:19703 19715, 2022. Wan, Y., Zhang, L., and Song, M. Improved dynamic regret for online frank-wolfe. In Neu, G. and Rosasco, L. (eds.), Proceedings of Thirty Sixth Conference on Learning Theory, volume 195 of Proceedings of Machine Learning Research, pp. 3304 3327. PMLR, 12 15 Jul 2023. URL https://proceedings.mlr.press/v195/wan23a.html. Wang, X., Li, C., and Yao, J.-C. On some basic results related to affine functions on riemannian manifolds. Journal of Optimization Theory and Applications, 170(3):783 803, 2016. Wang, X., Tu, Z., Hong, Y., Wu, Y., and Shi, G. No-regret online learning over riemannian manifolds. Advances in Neural Information Processing Systems, 34:28323 28335, 2021. Wang, X., Tu, Z., Hong, Y., Wu, Y., and Shi, G. Online optimization over riemannian manifolds. Journal of Machine Learning Research, 24(84):1 67, 2023. Weber, M. and Sra, S. Projection-free nonconvex stochastic optimization on riemannian manifolds. IMA Journal of Numerical Analysis, 42(4):3241 3271, 2022a. Weber, M. and Sra, S. Riemannian optimization via frank-wolfe methods. Mathematical Programming, pp. 1 32, 2022b. Zhang, H. and Sra, S. First-order methods for geodesically convex optimization. In Conference on Learning Theory, pp. 1617 1638. PMLR, 2016. A Omitted Proofs in Section 4 A.1 Proof of Lemma 1 Proof. Fix t, since yt+1 is an infeasible projection of yt+1 onto K, and yt+1 = Exp yt( ηt t), by Lemma 29, we have for any x K, d( yt+1, x)2 d(yt+1, x)2 ζd(yt+1, yt)2 + d( yt, x)2 2 D Exp 1 yt yt+1, Exp 1 yt x E ζη2 t t 2 + d( yt, x)2 + 2ηt D t, Exp 1 yt x E . (4) To verify that ζ represents the correct geometric distortion, we need to show d( yt, x) 2R holds for any yt Bp(R) and x K. Since K K Bp(R), we can demonstrate this by the triangle inequality: d( yt, x) d( yt, p) + d(p, x) 2R. Rearranging Equation (4), we have D t, Exp 1 yt x E d( yt, x)2 2ηt d( yt+1, x)2 2ηt + ζηt t 2 Fix 1 s e T and sum over [s, e], we have D t, Exp 1 yt x E d( ys, x)2 2ηt 1 2ηt 1 d( yt, x)2 + Using gsc-convexity of ft( ) and set ηt = η, we have t=s ft( yt) ft(x) d( ys, x)2 2η + ηζ Pe t=s t 2 In case all losses are α-strongly gsc-convex, using ft(y) ft(x) ft(y), Exp 1 y x αd(x,y)2 2 and set (s, e) = (1, T), we have t=1 ft( yt) ft(x) d( y1, x)2+ 2ηt 1 2ηt 1 α d( yt, x)2. Setting ηt = 1 αt, we get the guarantee for the strongly gsc-convex case. A.2 Proof of Lemma 2 Proof. By Lemma 29, d( y, z)2 ζd(y, y)2 + d(y, z)2 2 Exp 1 y y, Exp 1 y z ζγ2C2 + d(y, z)2 2 γg, Exp 1 y z ζγ2C2 + d(y, z)2 2γQ =d(y, z)2 Q2 where the first inequality relies on Lemma 29 and d(y, z) 2R, the third inequality holds because Exp 1 y z, g Q, and for the last line, we plug in γ = Q ζC2 . B Omitted Proofs in Section 5 The technical difficulty lies in constructing an infeasible projection oracle using a separation oracle. The following lemma is instrumental to the later results. B.1 Proof of Lemma 3 Proof. Let x = Expy δ r g g + Exp 1 y z , if we can show x K, then by the definition of the separation oracle, we have Exp 1 y x, g = δ r g Exp 1 y z, g > 0, which in turn implies Exp 1 y z, g > δ r g . Note that when r = 0, x = z (1 δ)K K, thus there exists positive r such that x K for any z (1 δ)K. Let r = κ(R+r) sinh( κ(R+r)) r. By Lemma 19, we have Bz(δ r) K. Thus we only need to ensure d(x, z) δ r because this implies x Bz(δ r) K. We define an admissible family of curves: Γ : [0, 1] [0, 1] M (t, s) Expy(t(Exp 1 y z + sδ rv)) where v = g g . Then Γ(1, 0) = z and Γ(1, 1) = x. Let Js(t) := Γ(t,s) s , which is a variation field of the variation Γ, and is thus a Jacobi field. We can compute Js(0) = Γ(0,s) s = 0 and by Lemma 14, Dt Js(t)|t=0 = δ rv. To apply Lemma 15, we need to reparametrize Γ to make it unit speed. Let Γ(t, s) = Γ t R(s), s R(s) = d(Γ(1, s), Γ(0, s)) = Exp 1 y z + sδ rv Exp 1 y z + δ r 2R + δr 2R + r. Then we have Js(t) = Js(R(s)t) and Dt Js(t) = R(s)Dt Js(R(s)t). Now we can apply Lemma 15: Js(1) = Js(R(s)) sinh ( κR(s)) Dt Js(0) κ =sinh ( κR(s)) Dt Js(0) κR(s) = sinh ( κR(s))δ r κR(s) We would like to have d(x, z) δ r, and it suffices to choose r such that d(x, z) Z 1 0 Js(1) ds sinh( κ(2R + r)) κ(2R + r) δ r δ r. We then find a valid r is r = κ(2R + r) sinh( κ(2R + r)) r = κ(2R + r) sinh( κ(2R + r)) κ(R + r) sinh( κ(R + r)) r. B.2 Proof of Lemma 4 Proof. We use k to denote the number of iterations in Algorithm 2, then for any i < k, we have yi / K. By Lemma 3, we have Exp 1 yi z, gi δ r gi for any z (1 δ)K. We then invoke Lemma 2 with C = gi , Q = δ r gi to get d(yi+1, z)2 d(yi, z)2 δ2 r2 holds for any z (1 δ)K. To ensure the geometric distortion ζ is valid here, we need to prove d(yi, z) 2R holds for any i 1 and z (1 δ)K, which can be guaranteed by induction. The case of i = 1 is straightforward. As y1 Bp(R), z (1 δ)K Bp(R), we have d(y1, z) d(y1, p) + d(z, p) 2R. Now assume d(yi, z) 2R holds for some i 1 and any z (1 δ)K, then by Lemma 29 and Lemma 2, d(yi+1, z)2 ζ(κ, d(yi, z)) d(yi+1, yi)2 + d(yi, z)2 2 Exp 1 yi yi+1, Exp 1 yi z ζ(κ, 2R) d(yi+1, yi)2 + d(yi, z)2 2 Exp 1 yi yi+1, Exp 1 yi z =ζd(yi+1, yi)2 + d(yi, z)2 2 Exp 1 yi yi+1, Exp 1 yi z d(yi, z)2 δ2 r2 Thus, we know d(yi, z) 2R holds for any i 1 and z (1 δ)K. By Equation (7), d(y, z)2 d(y1, z)2 for all z (1 δ)K. Since y1 is the projection of y0 onto Bp(R), we have d(y1, z)2 d(y0, z)2 holds for z Bp(R). Thus, we indeed have d(y, z)2 d(y0, z)2 for any z (1 δ)K. It remains to bound the number of iterations in Algorithm 2. We must be careful because, on manifolds, (1 δ)K is not guaranteed to be gsc-convex (Theorem 6). Note that argminx (1 δ)K d(yi, x)2 is consistently non-empty because (1 δ)K is a closed set. Let x i argminx (1 δ)K d(yi, x)2 and invoke Equation (7), we have d(yi+1, (1 δ)K)2 = d(yi+1, x i+1)2 d(yi+1, x i )2 d(yi, x i )2 δ2 r2 ζ = d(yi, (1 δ)K)2 δ2 r2 The first inequality is due to the definition of x i+1 and the second one follows from Equation (7). We also need to show d(y1, (1 δ)K) d(y0, (1 δ)K) to finish the proof. We have d(y1, x 1) d(y1, x 0) d(y0, x 0), (8) where the first inequality is again due to the definition of x 1, while the second one comes from y1 is the metric projection of y0 onto Bp(R) and x 0 (1 δ)K Bp(R). Reminding y = yk and unrolling the recurrence, we have d(y, (1 δ)K)2 d(y0, (1 δ)K)2 (k 1)δ2 r2 Thus, in the worst case, we need k = ζ d(y0, (1 δ)K)2 d(y, (1 δ)K)2 iterations to stop. B.3 Proof of Theorem 1 Proof. Combining Algorithm 3 with Lemma 4, we know yt K thus Algorithm 3 plays a feasible point at each round. For a fixed interval I = [s, e] [T], let x I = argminx K Pe t=s ft(x) and x I = Expp((1 δ)Exp 1 p x I). Then d( x I, x I) = d(Expp((1 δ)Exp 1 p x I), x I) = d(Expx I(δExp 1 x I p), x I) = δd(p, x I) δR. By Lemma 4, yt K is an infeasible projection of yt over (1 δ)K. Then by Lemma 1 t=s (ft( yt) ft( x I)) d( ys, x I)2 2η + ηζ Pe t=s ft( yt) 2 On the other hand, by the gradient-Lipschitzness, we have ft( x I) ft(x I) G d( x I, x I) GδR. Combining the above two equations, we have t=s (ft( yt) ft(x I)) GRδ + G2ηζ Setting η = 2R G ζT and δ = 1 2 T < 1, then we have t=s (ft( yt) ft(x I)) 5 where we use ζ 1. It remains to bound the number of calls to the separation oracle. Denote x t argminx (1 δ)K d(x, yt). Due to yt+1 = Exp yt ( η ft( yt)), d(yt+1, (1 δ)K) d(x t , yt+1) d(x t , yt) + d( yt, yt+1) d( yt, (1 δ)K) + η ft( yt) , where the first inequality is because x t (1 δ)K, the second is due to the triangle inequality, while the third one follows from the definition of x t and yt+1 = Exp yt ( η ft( yt)). Squaring both sides, we have d(yt+1, (1 δ)K)2 d( yt, (1 δ)K)2 + 2d( yt, (1 δ)K)ηG + η2G2 d( yt, (1 δ)K)2 + 2δRηG + η2G2, (9) where the second inequality is because d( yt, (1 δ)K) d( yt, Expp((1 δ)Exp 1 p yt)) = d( yt, Exp yt(δExp 1 yt p)) = δd( yt, p) δR. Combining Equation (9) with Lemma 4, we can bound the number of separation oracle calls as: ζ d(yt+1, (1 δ)K)2 d( yt+1, (1 δ)K)2 ζ d( yt, (1 δ)K)2 + 2RδηG + η2G2 d( yt+1, (1 δ)K)2 r2δ + G2η2T where we use y1 = p (1 δ)K and thus d( y1, (1 δ)K) = 0 to derive the third inequality. B.4 Proof of Theorem 2 We first prove Lemma 7, which characterizes the relation between δ and τ. Lemma 7. For a gsc-convex subset K M where M is Hadamard, any point y K and Bp(r) K Bp(R), we have κ(R + r) sinh κ(R + r) Proof. The proof of this proposition takes inspiration from Lemma 19 (Wang et al., 2023), but with several significant adjustments. Notably, Lemma 19 hinges on the gsc-convexity of K to bound the radius of the geodesic ball By( r), where y (1 τ)K, which resides within K. However, the gsc-convexity of (1 + τ)K is unknown.5 Therefore, we must explore alternative strategies that leverage the gsc-convexity of K to meet our objectives. Let v Ty M be a unit vector such that v = 1 and z = Expy(θτrv). The goal is to find the maximum θ which ensures z (1 + τ)K for any v. Let y = Expp Exp 1 p y 1+τ and z = Expp Exp 1 p z 1+τ . It is immediate to see z K iff z (1 + τ)K. We denote ξτ(s) as the geodesic satisfying ξτ(0) = y and ξτ(1) = z . 5We conjecture that on Hadamard manifolds, (1 + τ)K is gsc-convex for any τ 0. We define an admissible family of curves: Γ : 0, 1 + τ [0, 1] M (t, s) Expy(t(Exp 1 y ξτ(s))). We can verify that Γ(0, 1) = y, Γ(1, 0) = y , Γ 1+τ τ , 0 = Expy 1+τ τ Exp 1 y y = p, and Γ(1, 1) = z . We also denote w := Γ 1+τ τ , 1 . The idea of the proof is to show d(p, w) r by Lemma 15, which implies w K since Bp(r) K. Combining with the fact that y K and z is on the geodesic connecting y and w, we have z K. Thus z (1 + τ)K. We notice that v(t, s) = Γ(s,t) s is a Jacobi field since it is a variation field of γs(t) = Expy(t Exp 1 y ξτ(s)). Let R(s) = d(Γ(0, s), Γ(1, s)). To apply Lemma 15, we need to normal- ize the geodesic γs(t) = γs t R(s) . Since M is Hadamard, by Lemma 15 v(1, s) R(s) γsv(0, s) . (11) Remind that v(1, s) = ξτ (s) s = ξτ(s) = d(y , z ). Now we use Lemma 20 to bound d(y , z ). We construct ( p, y, z) in Euclidean space, a comparison triangle of (p, y, z) with comparison points y [ p, y] and z [ p, z]. We restrict d(p, y ) = d E( p, y ) and d(p, z ) = d E( p, z ), then by Lemma 20, we have d(y , z ) d E( y , z ) (12) On the other hand, it is immediate to verify ( p, y, z) is similar to ( p, y , z ) by considering d E( p , z ) d E( p , z ) = d(p, z ) d(p, z) = 1 1 + τ = d(p, y ) d(p, y) = d E( p , y ) d E( p , y ). Thus we have d E( y , z ) = 1 1 + τ d E( y, z) = 1 1 + τ d(y, z) = θτr 1 + τ , (13) where the first equation is due to the property of similar triangles, the second one follows from the definition of the comparison triangle, and the last one is due to the definition of z. Combining Equations (11), (12) and (13), we have γsv(0, s) v(1, s) R(s) θτr (1 + τ)R(s). (14) We also need to apply Lemma 15 at t = 1+τ τ . Since the sectional curvature of M is lower bounded by κ, we have v 1 + τ τ , s 1 κ sinh κR(s) 1 + τ γsv(0, s) (15) Putting Equations (14) and (15) together, we have v 1 + τ τ , s sinh κR(s) 1+τ τ R(s)1 + τ τ θτr (1 + τ)R(s) =sinh κR(s) 1+τ Now we can bound θ. Note that R(s) = d(Γ(0, s), Γ(1, s)) = d(y, ξτ(s)) d(y, y ) + d(y , z ) τ 1 + τ R + θτr 1 + τ τ 1 + τ (R + r). By Lemma 21, we have sinh κR(s) 1+τ τ sinh κ(R + r) Thus we can take θ = κ(R + r) sinh κ(R + r) to ensure v 1+τ τ , s r. The length of the curve v 1+τ τ , s can be bounded by Z 1 This implies d(p, w) r and thus w K. Then we know z K because z lies on the geodesic connecting w and y, with both endpoints in K. This finally leads to the fact that z (1 + τ)K. Now we give the proof of Theorem 2. Proof of Theorem 2. Algorithm 4 always plays feasible points because by δ = κ(R + r) sinh κ(R + r) τr and Lemma 7, we have zt (1 + τ)K and zt K. By the first item of Lemma 23, E h D gt, Exp 1 yt x Ei = E h D E [gt| yt] , Exp 1 yt x Ei = Vδ Sδ E h D ˆft( yt), Exp 1 yt x Ei (17) where ˆft(x) is a smoothed version of ft(x). More specifically, ˆft(x) := 1 Bx(δ ) ft(u)ω where ω is the volume form. Applying Lemma 4 to Algorithm 4, we know yt is an infeasible projection of yt over (1 δ)K, which means d( yt+1, x)2 d(yt+1, x)2 d( yt, x)2 + ζη2 gt 2 2η D gt, Exp 1 yt x E (18) holds for any x (1 δ)K, where the second inequality is due to Lemma 29. Equation (18) is equivalent to D gt, Exp 1 yt xt E d( yt, x)2 d( yt+1, x)2 2 gt 2. (19) By the third item of Lemma 23, we have ˆf( yt) ˆf(x) D ˆf( yt), Exp 1 yt x E + 2δ ρG Vδ E h D gt, Exp 1 yt x Ei + 2δ ρG Vδ d( yt, x)2 d( yt+1, x)2 Vδ E gt 2 + 2δ ρG. where ρ is a constant solely depends on K. Combining Equations (17), (19), and (20), we have ˆft( yt) ˆft(x) # t=s E gt 2 + 2δ ρGT (21) holds for any x (1 δ)K, where we use the fact that d( ys, x) 2R. We also need to bound E h ft(ezt) ˆft( yt) i and E h ˆft( x I) ft(x I) i . For the former term, E h ft(ezt) ˆft( yt) i =E [ft(ezt) ft(zt)] + E [ft(zt) ft( yt)] + E h ft( yt) ˆft( yt) i G d(ezt, zt) + Gδ + Gδ GτR + 2Gδ , (22) where we use Lemma 22 and the gradient Lipschitzness. Similarly, for the latter term, E h ˆft( xt) ft(x I) i =E h ˆft( x I) ft( x I) i + E [ft( x I) ft(x I)] Gδ + G d( x I, x I) Gδ + GδR. (23) By the second term of Lemma 23, we have S2 δ V 2 δ gt 2 n δ + n|κ| 2 M 2. (24) Combining Equations (21), (22), (23) and (24), summing from t = s to e, we get t=s (ft(ezt) ft(x I)) ft(ezt) ˆft( yt) # ˆft( yt) ˆft(x I) # ˆft(x I) ft(x I) # (GτR + 2Gδ ) T + 2R2 2 Sδ Vδ T + 2ρδ GT + (Gδ + GδR) T (GτR + 2Gδ ) T + 2R2 δ + n|κ| + ηζM 2 δ + n|κ| T + 2ρδ GT + (Gδ + GδR) T. After plugging in η = T 1 2 , δ = T 1 4 , τ = T 1 4 and δ = κ(R+r) sinh( κ(R+r))T 1 4 r, we see 4 r and 1 δ sinh κ(R + r) κ(R + r) T 1 4 r . t=s (ft(ezt) ft(x I)) as claimed. Denote x t argminx (1 δ)K d(x, yt). We now bound the number of calls to the separation oracle. Due to yt+1 = Exp yt ( ηgt), d(yt+1, (1 δ)K) d(x t , yt+1) d(x t , yt) + d( yt, yt+1) d( yt, (1 δ)K) + η gt . Squaring both sides, we have d(yt+1, (1 δ)K)2 d( yt, (1 δ)K)2 + 2d( yt, (1 δ)K)ηM + η2M 2 d( yt, (1 δ)K)2 + 2δRηM + η2M 2, (26) where the second inequality is because d( yt, (1 δ)K) d( yt, Expp((1 δ)Exp 1 p yt)) = d( yt, Exp yt(δExp 1 yt p)) = δd( yt, p) δR and gt = |ft(zt)| M. Combining Equation (26) with Lemma 4, we can bound the number of separation oracle calls as: ζ d(yt+1, (1 δ)K)2 d( yt+1, (1 δ)K)2 ζ d( yt, (1 δ)K)2 + 2RδηM + η2M 2 d( yt+1, (1 δ)K)2 r2δ + M 2ζη2T r2 T 3 4 + M 2ζ where the last inequality follows from y1 = p (1 δ)K and thus d( y1, (1 δ)K) = 0. B.5 Proof of Theorem 3 Proof. We maintain the loop invariant xt βK. So by Lemma 19, B(xt, δ ) K, which means zt and z t are feasible points. For the two-point feedback setting, by Lemma 23, ˆft(xt) ˆft(x) D ˆft(xt), Exp 1 xt x E + 2δ ρG Vδ E gt, Exp 1 xt x + 2δ ρG Vδ E gt, Γxt yt Exp 1 yt x + gt, Γxt yt Exp 1 yt x Exp 1 xt x + 2δ ρG. By Lemma 17 of Wang et al. (2023), Sδ Vδ E [ gt ] n G(1 + |κ|δ 2). Sδ Vδ E gt, Γxt yt Exp 1 yt x Exp 1 xt x Sδ Vδ E [ gt ] ζd(xt, yt) n G(1 + |κ|δ 2) ζ(1 β)d(yt, p) n GRζ(1 β)(1 + |κ|δ 2) =n GRζ sinh( κ(R + r)) κ(R + r) δ r 1 + κ|δ 2 = O(δ ), where we use the ζ-smoothness of 1 2d(x, y)2 and the definition of δ . Since yt+1 is an infeasible projection of y t+1 onto (1 δ)K, d(yt+1, x)2 d(y t+1, x)2 d(yt, x)2 + ζd(yt, y t+1)2 2 Exp 1 yt y t+1, Exp 1 yt x =d(yt, x)2 + ζη2 gt 2 2 ηΓyt xtgt, Exp 1 yt x =d(yt, x)2 + ζη2 gt 2 2η gt, Γxt yt Exp 1 yt x holds for any x (1 δ)K, which means gt, Γxt yt Exp 1 yt x d(yt, x)2 d(yt+1, x)2 2η + ηζ gt 2 Taking expectation and using Sδ δ + n|κ|, we have Sδ Vδ E gt, Γxt yt Exp 1 yt x Sδ 2ηVδ d(yt, x)2 d(yt+1, x)2 + ηζSδ δ + n|κ| d(yt, x)2 d(yt+1, x)2 + n ηζ δ 2G2. (31) Combining Equations (28), (29), (30) and (31), and summing from t = s to e, we have ˆft(xt) ˆft(x) # ηδ + δ T + ηδ T holds for any x (1 δ)K. Following Equations (22) and (23), and taking x = Expp (1 δ)Exp 1 p x , we have E h ft(zt) ˆft(xt) i = O(δ ) and E h ˆft(x) ft(x ) i = O(δ + δ ). In sum, e X t=s E [ft(zt) ft(x )] = O 1 ηδ + (δ + δ )T + ηδ T . By choosing η = 1, δ = 1 β = T 1 2 , then δ = O T 1 2 and we can get O( Denote x t argminx (1 δ)K d(x, yt) and we again bound the number of calls to the separation oracle. Due to y t+1 = Expyt ηΓyt xtgt , d(y t+1, (1 δ)K) d(x t , y t+1) d(x t , yt) + d(yt, y t+1) d( yt, (1 δ)K) + η gt d( yt, (1 δ)K) + ηδ G where we use gt = 1 2|ft(zt) ft( zt)| δ G by Assumption 4. Squaring both sides, we have d(y t+1, (1 δ)K)2 d(yt, (1 δ)K)2 + 2d(yt, (1 δ)K)ηδ G + η2δ 2G2 d(yt, (1 δ)K)2 + 2δRηδ G + η2δ 2G2, (32) where the second inequality is again due to yt K and d(yt, (1 δ)K) d(yt, Expp((1 δ)Exp 1 p yt)) = d(yt, Expyt(δExp 1 yt p)) = δd(yt, p) δR. Combining Equation (32) with Lemma 4, we can bound the number of separation oracle calls as: ζ d(y t+1, (1 δ)K)2 d(yt+1, (1 δ)K)2 ζ d(yt, (1 δ)K)2 + 2Rδηδ G + η2δ 2G2 d(yt+1, (1 δ)K)2 δ r2 + ζ η2δ 2G2 δ2 r2 T + T where the last inequality follows from y1 = p (1 δ)K and thus d( y1, (1 δ)K) = 0. Since η = 1, δ = 1 β = T 1 δ = (1 β) κ(R + r) sinh κ(R + r) r = δ κ(R + r) sinh κ(R + r) r δr. Plugging these parameters into Equation (33), we have Ncalls ζ 2r RGT r2 + ζ r2G2T r2 + T = O(T). C Omitted Proofs in Section 6 We have the following convergence guarantee for RFW. C.1 Proof of Lemma 5 Proof. Since Algorithm 6 is indeed RFW with line-search (Algorithm 9) when f(x) = d(x, y)2/2, which is ζ gsc-smooth by Lemma 27 and f(x) = Exp 1 x y, the upper bound on the number of iterations follows from Lemma 26 with L = ζ and D = 2R. Item 1 follows from the line-search because f(xi) = d(xi, y)2/2 does not increase as i increases. For item 2, Algorithm 6 stops when either d( x, y)2 3ϵ or d( x, y)2 > 3ϵ and Exp 1 x y, Exp 1 x x ϵ. For the first case, item 2 obviously holds. Now we consider the second case. We first show Exp 1 y z, Exp 1 y x + Exp 1 x z, Exp 1 x y d( x, y)2 as follows. By Lemma 30, we have Exp 1 y z, Exp 1 y x 1 2(d(y, z)2 + d(y, x)2 d( x, z)2) Exp 1 x z, Exp 1 x y 1 2(d( x, y)2 + d( x, z)2 d(y, z)2). Adding the above two inequalities we have Exp 1 y z, Exp 1 y x + Exp 1 x z, Exp 1 x y d( x, y)2. Now item 2 follows from Exp 1 y z, Exp 1 y x d( x, y)2 Exp 1 x z, Exp 1 x y > 3ϵ Exp 1 x vi, Exp 1 x y > 2ϵ. For the third item, denote x = argminx K d(x, y)2, Suppose d(y, K)2 = d(x , y)2 < ϵ and d( x, y)2 > 3ϵ. Denote f(x) = d(x, y)2/2 and f(x) = Exp 1 x y. For the last iteration, Exp 1 x vi, Exp 1 x y = max v K Exp 1 x v, f( x) ϵ, which implies d( x, y)2 d(y, K)2 =2f( x) 2f(x ) 2 Exp 1 x x , f( x) 2 max v K Exp 1 x v, f( x) 2ϵ. And thus d( x, y)2 d(y, K)2 + 2ϵ 3ϵ as claimed. C.2 Proof of Lemma 6 Before proving Lemma 6, we need the following lemma. Lemma 8. Consider Algorithm 7 and fix some ϵ such that 0 < 3ϵ < d(x0, y0)2. Setting γ = 2ϵ d(x0,y0)2 , we have d(xi, yi) d(x0, y0) for any i. Proof. For all i > 1, we define the sequence yi by the relation yi = Expxi 1((1 γ)Exp 1 xi 1yi 1), and γ [0, 1). From this, we can deduce the following sequence of inequalities: d(xi 1, yi) = Exp 1 xi 1yi = (1 γ) Exp 1 xi 1yi 1 d(xi 1, yi 1). The inequality follows since γ (0, 1). By the first item of Lemma 5, for any i 1, we have d(xi, yi) d(xi 1, yi). Combining this with the previous inequality, we get d(xi, yi) d(xi 1, yi 1) . . . d(x1, y1) d(x0, y1) d(x0, y0), where the last inequality follows from the fact that x0 K and y1 is the projection of y0 onto Bp(R), while K Bp(R). Proof of Lemma 6. Given that y1 is the projection of y0 onto Bp(R) and K Bp(R), we have z K : d(y1, z)2 d(y0, z)2. The lemma trivially holds when d(x0, y0)2 3ϵ or d(x1, y1)2 3ϵ. Without loss of generality, we assume that d(x1, y1) > 3ϵ. Let k > 1 be the number of iterations in Algorithm 7. This means that d(xk, yk)2 3ϵ and d(xi, yi)2 > 3ϵ for any i < k. According to the second item of Lemma 5, we obtain Exp 1 yi z, Exp 1 yi xi 2ϵ for all i < k and z K. Lemma 8 also gives us d(xi, yi) d(x0, y0) for all i < k. Applying Lemma 2 with g = Exp 1 yi xi, C = d(x0, y0), and Q = 2ϵ, we deduce that for every 1 i < k, z K : d(z, yi+1)2 d(z, yi)2 4ϵ2 ζd(xi, yi)2 d(z, yi)2 4ϵ2 ζd(x0, y0)2 . (34) Consequently, for any z K, the returned point y satisfies d(y, z)2 d(y1, z)2 d(y0, z)2. Because p K, we have d(y, p) d(y1, p) R, which implies y Bp(R). Note that x K as it is the output of Algorithm 6. Thus, we know Algorithm 7 returns (x, y) K Bp(R). Now we bound the number of iterations in Algorithm 7. Denote x i = argminx K d(x, yi)2 for any i < k. Using Equation (34), d(yi+1, K)2 =d(x i+1, yi+1)2 d(x i , yi+1)2 d(x i , yi)2 4ϵ2 ζd(x0, y0)2 = d(yi, K)2 4ϵ2 ζd(x0, y0)2 . By applying the inequality d(y1, K)2 d(y0, K)2 and subsequently performing a telescoping sum, we can demonstrate that d(yi+1, K)2 d(y1, K)2 4iϵ2 ζd(x0, y0)2 d(y0, K)2 4iϵ2 ζd(x0, y0)2 d(y0, x0)2 4iϵ2 ζd(x0, y0)2 . After at most k 1 = (ζd(x0, y0)2(d(x0, y0)2 ϵ))/4ϵ2 iterations, we have d(yk, K)2 ϵ. By the third item of Lemma 5, the next iteration will be the last one, and the returned points x, y satisfy d(x, y)2 3ϵ, as claimed. C.3 Proof of Theorem 4 We first prove Lemma 9, which bounds the regret of the infeasible projection for gsc-convex losses. Lemma 9. With some fixed block size B, ηi = η and ϵi = ϵ for any i = 1, . . . , T B . Assume all losses are gsc-convex. We use i(t) to denote the block index of round t. Then the regret of playing yi(t) 1 as defined in Algorithm 8 can be bounded by sup I=[s,e] [T ] t=s ft( yi(t) 1) min x I K t=s ft(x I) 2 + 4RKG + 4R2 Proof. For convenience, we assume B divides T. Let Ti = {(i 1)B + 1, . . . , i B} be the set of rounds in the i-th block. By Lemma 6, yi+1 is the infeasible projection of yi B+1 onto K: d( yi+1, x)2 d(yi B+1, x)2 holds for any x K. By Algorithm 8, we have yi B+1 = Exp yi 1 s=1 f(i 1)B+s ( yi 1) t Ti ft( yi(t) 1) d( yi+1, x)2 d(yi B+1, x)2 d( yi 1, x)2 + ζη2B2G2 2η X D ft( yi 1), Exp 1 yi 1x E where the last inequality is due to Lemma 29 and ft(x) G. This is equivalent to D ft( yi 1), Exp 1 yi 1x E d( yi 1, x)2 d( yi+1, x)2 2η + ζηB2G2 For a certain interval [s, e] [T], let is and ie be the smallest and the largest block indices such that Tis and Tie are fully contained in [s, e]. Then we have D ft( yi(t) 1), Exp 1 yi(t) 1x E D ft( yis 2), Exp 1 yis 2x E D ft( yi 1), Exp 1 yi 1x E + D ft( yie), Exp 1 yie There are some undesirable terms in blocks is 1 and ie + 1, and we can bound them by noticing D ft( yi(t) 1), Exp 1 yi(t) 1x E ft( yi(t) 1) Exp 1 yi(t) 1x G 2R = 2GR. (37) Combining Equations (35), (36) and (37), we have ft( yi(t) 1) ft(x) D ft( yi(t) 1), Exp 1 yi(t) 1x E d( yi 1, x)2 d( yi+1, x)2 2η + ζηB2G2 where the last inequality is due to the number of blocks in [s, e] is upper bounded by T Now we give the proof of Theorem 4. Proof of Theorem 4. We still use i(t) to denote the block index of the t-th round. For an interval [s, e] [T], we can do the following decomposition ft(xi(t) 1) ft(x) ft(xi(t) 1) ft( yi(t) 1) + ft( yi(t) 1) ft(x) D ft(xi(t) 1), Exp 1 xi(t) 1 yi(t) 1 E + ft( yi(t) 1) ft(x) Algorithm 8 calls Algorithm 7 in each block to compute an infeasible projection. By Lemma 6 and Algorithm 8, (xi, yi) K Bp(R), which satisfies d(xi, yi)2 3ϵi = 3ϵ. Thus D ft(xi(t) 1), Exp 1 xi(t) 1 yi(t) 1 E Gd(xi(t) 1, yi(t)) Combining Equations (39), (40) and Lemma 9, we have ft(xi(t) 1) ft(x) 4RGB + 4R2 holds for any x K. By plugging in η, ϵ, and B, we get the claimed regret bound. We also need to bound the number of calls to the linear optimization oracle. Note that d(x1, y1) = 0 due to the initialization, and for any i 2, we have d(xi, yi)2 3ϵ due to Lemma 6. By Algorithm 8, we have yi B+1 = Exp yi 1 t=(i 1)B+1 ft( yi 1) i = 1, . . . , T Therefore d(xi 1, yi B+1) d(xi 1, yi 1) + d( yi 1, yi B+1) 3ϵ + ηBG. (41) Squaring on both sides, we have d(xi 1, yi B+1)2 3ϵ + ηBG 2 6ϵ + 2B2G2η2, (42) where (a + b)2 2(a2 + b2) is used. By Lemma 6, on the i-th block, Algorithm 7 stops after at most ( ζd(xi 1, yi B+1)2 d(xi 1, yi B+1)2 ϵ iterations. According to Lemma 5, each iteration calls the linear optimization oracle for at most ζ 27R2 ϵ2 2 times. Thus in the i-th block, the number of calls to the linear optimization oracle is bounded by Ncalls := max ( ζd(xi 1, yi B+1)2 d(xi 1, yi B+1)2 ϵ There are T B blocks in total, so the total number of calls to the linear optimization oracle is at most 2 ζ + 11B2G2η2 2ϵ + B4G4η4 Plugging in the values of B, η and ϵ, we find Ncalls T as claimed. C.4 Proof of Theorem 5 Proof. Based on a similar argument as in Equation (35), we have t Ti (ft( yi 1) ft(x)) D ft( yi 1), Exp 1 yi 1x E αB 2 d( yi 1, x)2 ! d( yi 1, x)2 d( yi+1, x)2 2ηi + ζηB2G2 2 d( yi 1, x)2 where we apply the strong gsc-convexity on the i-th block under the help of Lemma 28; the last inequality follows from the definition of ηi. If we consider the sum of regret over each block as in Equation (43), ft( yi(t) 1) ft(x) ζBG2 On the other hand, since ft( ) is G-Lipschitz, by Lemma 6, ft(xi(t) 1) ft( yi(t) 1) D ft(xi(t) 1), Exp 1 xi(t) 1 yi(t) 1 E where we use the fact that for the first block, i(t) = 1 and thus xi(t) 1 = yi(t) 1. Now as we combine Equations (44) and (45), we have ft(xi(t) 1) ft(x) By plugging in {ϵi} T B i=1, we get the stated regret guarantee. Now it remains to bound the number of calls to the linear optimization oracle. Note that for each block i T B , by Lemma 6, d(xi, yi)2 3ϵi. In view of Algorithm 8, we also have y(i+1)K+1 = Exp yi t=i B+1 ft( yi) By the triangle inequality, d(xi 1, yi B+1) d(xi 1, yi 1) + d( yi 1, yi B+1) p 3ϵi 1 + ηi BG. (46) Squaring both sides and making use of (a + b)2 2(a2 + b2), we have d(xi 1, yi B+1)2 6ϵi 1 + 2B2G2η2 i . (47) By Lemma 6, the number of iterations in Algorithm 7 for the i + 1-th block is at most max ζd(xi 1, yi B+1)2(d(xi 1, yi B+1)2 ϵi+1) 4ϵ2 i+1 + 1, 1 max ζ 30ϵ2 i 1 + 22B2G2η2 i ϵi 1 + 4B4G4η4 i 4ϵ2 i+1 + 1, 1 max ζ 61ϵ2 i 1 + 8B4G4η4 i 4ϵ2 i+1 + 1, 1 max 16ϵ2 i 1 ϵ2 i+1 ζ + 2B4G4η4 i ϵ4 i+1 ζ + 1, 1 Thus, the number of calls to the linear optimization oracle is bounded by 16ϵ2 i 1 ϵ2 i+1 ζ + 2B4G4η4 i ϵ4 i+1 ζ + 1 27R2ζ (i + 1)4 + 2(i + 3)4 4002(i 1)4ζ2 + 1 2 (i + 3)2ζ i=2 (i + 3)2 ζ where the last line follows from ζ 1, i+3 i 1 5 holds for any i 2. On the other hand, we have T B X i=2 (i + 3)2 Combining Equations (49), (50) and plugging in B = αR 3 T 2/3, we get the claimed Ncalls ζT. D Some Basic Facts about Convex Sets on Hadamard Manifolds In this section, we highlight two significant characteristics of convex sets in Euclidean space which do not apply to Hadamard manifolds, emphasizing the fundamental differences between these two spaces. Our counterexamples are constructed within the Poincaré half-plane, H2. Therefore, understanding some basic properties of this unique manifold will prove beneficial. Lemma 10. (Udriste, 2013; Wang et al., 2016; Kristály et al., 2016) The Poincaré half plane H2 can be described by M := {(t1, t2) R2|t2 > 0} with metric gij(u, v) = δij v2 i = 1, 2 j = 1, 2. We have the following conclusions about geodesics and the distance on the Poincaré half-plane. 1) For the geodesic connecting x = (t1, t2) and y = (s1, s2), let bxy := s2 1 + s2 2 t2 1 t2 2 2(s1 t1) rxy = q (s1 bxy)2 + s2 2, then the geodesic γxy := (γ1 xy, γ2 xy) connecting x and y is, γ1 xy(s) := ( t1, if t1 = s1 bxy rxy tanh (1 s) artanh bxy t1 rxy + s artanh bxy s1 , if t1 = s1 (51) γ2 xy(s) := ( e(1 s) ln t2+s ln s2, if t1 = s1 rxy cosh (1 s) artanh bxy t1 rxy +s artanh bxy s1 , if t1 = s1. (52) Indeed, we have (γ1 xy(s) bxy)2 + (γ2 xy(s))2 = r2 xy for any s [0, 1] when t1 = s1. 2) The distance between two points (u1, v1) and (u2, v2) can be calculated by d H((u1, v1), (u2, v2)) = arcosh 1 + (u2 u1)2 + (v2 v1)2 D.1 Shrinking Gsc-convex Sets on Hadamard Manifolds Does Not Preserve Convexity In the context of projection-free online learning on manifolds, the set c K = Expp(c Exp 1 p x)|x K takes on a pivotal role, particularly when c < 1. Consider a convex set K in Euclidean space, then the convexity of c K is guaranteed for any positive c. Nonetheless, this property does not necessarily hold on Hadamard manifolds. We can construct a counterexample within the Poincaré half-plane H2. To achieve this, we rely on the following lemma. Lemma 11. On the Poincaré half-plane H2, the mid-point of the geodesic γxy which connects x = ( λ, µ) and y = (λ, µ) is (0, p Proof. By symmetry we know γ1 xy( 1 2) = 0. To compute γ2 xy( 1 2), note that bxy = 0 and rxy = p λ2 + µ2, then by Lemma 10, cosh 1 2 arctanh λ 2 arctanh λ where we use the fact that arctanh( ) is an odd function and cosh(0) = 1. Now we prove the following theorem. Theorem 6. There exists a counter-example on the Poincaré half plane H2 such that the radially contracted set (1 α)K where α (0, 1) is non-convex. Proof. Let p = (0, 1), u = ( 1, 2), v = (1, 2), w = (0, 3). Observe that γuv is an arc satisfying {(λ, µ)|λ2 + µ2 = 3} and, by Lemma 11, w is its midpoint. Following this, we calculate γpu, γpw, and γpv, and shrink these three geodesics by a factor of (1 α). Let s denote the resulting geodesics as γpu , γpw , and γpv . Ultimately, we demonstrate that for some α, w is positioned beneath γu v , implying that the radial contraction of K does not encompass a geodesic, hence proving its non-convexity. We first compute γpu and γpv to facilitate this. Note that bpu = 1 and rpu = 2, by Lemma 10, we can write down γ1 pu(s) = 1 2 tanh (1 s) arctanh 1 cosh (1 s) arctanh 1 By symmetry, we have γ1 pv(s) = 1 + 2 tanh (1 s) arctanh 1 cosh (1 s) arctanh 1 We can also compute γpw(s) = 0, es ln 3 . Now we have u = γpu(1 α) = 1 2 tanh α arctanh 1 2 cosh α arctanh 1 v = γpv(1 α) = 1 + 2 tanh α arctanh 1 2 cosh α arctanh 1 and w = 0, e(1 α) ln 3 . We just need to demonstrate that for some α (0, 1), w is positioned below γu v 1 2 tanh α arctanh 1 by Lemma 11. Specifically, we need to show α (0, 1), 2 tanh α arctanh 1 We set α = .1, then s 2 tanh α arctanh 1 3 = 0.036 > 0. Therefore, the set (1 α)K fails to contain the geodesic γu v (s) and is thus non-convex. D.2 Non-convexity of the Minkowski Functional on Hadamard Manifolds In this part, we show the Minkowski functional (Gauge function), which is defined as γK(x) = inf λ > 0 : Expp Exp 1 p z λ can be non-convex on Hadamard manifolds, where K is a gsc-convex set of a Hadamard manifold M and p is an interior point in K. Theorem 7. The Minkowski functional defined in Equation (53) can be non-convex on Hadamard manifolds. Proof. We again construct a counterexample on the Poincaré plane H2: M := {(t1, t2) R2|t2 > 0}. To verify the geodesic convexity, we can verify an equivalent statement: a function f is gsc-convex iff f is convex when restricted to any geodesic. That is f(γ(t)) (1 t)f(γ(0)) + tf(γ(1)) holds for any geodesic γ(t) : [0, 1] M. The counterexample is as follows. Assume K is the gsc-convex set consists of all points (u1, u2) such that u2 1 + u2 2 3 and u2 > 0. x = ( 1, 3), y = ( 2, 2) and p = (0, 1). We show that γK(x) is non-convex on the geodesic γxy. We first compute bxy = 2 and rxy = 2. By Lemma 10, we have 2 2 tanh (1 s) arctanh 1 , 2 cosh (1 s) arctanh 1 :=( 2 2 cos θ(s), 2 sin θ(s)), where we let θ(s) := tanh (1 s) arctanh 1 2 to ease the elaboration. We denote z := γxy(s) = ( 2 2 cos θ(s), 2 sin θ(s)). Then we compute the geodesic γpz connecting z and p = (0, 1), and we have bpz(s) = (2 + 2 cos θ(s))2 + (2 sin θ(s))2 1 2( 2 2 cos θ(s)) = 7 + 8 cos θ(s) 4(1 + cos θ(s)) and rpz(s) = p bpz(s)2 + 1. Now we can compute the intersection of γpz(s) with K as the solution (q1, q2) of the following system of equations (q1 bpz(s))2 + q2 2 = bpz(s)2 + 1 q2 1 + q2 2 = 3 The intersection point is 1 bpz(s), q 3 1 bpz(s)2 . For each s [0, 1], we can evaluate the Minkowski functional at γxy(s) w.r.t. K as h(s) = d H(( 2 2 cos θ(s), 2 sin θ(s)), (0, 1))/d H 3 1 bpz(s)2 where d H is the hyperbolic distance defined in the second item of Lemma 10. If the Minkowski functional is gsc-convex, then we should have h(s) (1 s)h(0) + s h(1) for any s [0, 1]. However, by numerical computation, we find 2h(1) = 0.0057 > 0, which refutes the assumption that h(s) is convex. Thus our conclusion is established. E Technical Lemmas E.1 Background on the Jacobi Field The role of the Jacobi field as an essential tool for understanding the impact of curvature on the behavior of neighboring geodesics cannot be overstated. To make our discussion more comprehensive and self-contained, we will introduce related definitions and properties sourced from Lee (2006) and Lee (2018). We define an admissible curve as a continuous map, γ(t) : [a, b] M, representing a piecewise regular curve segment. An admissible family of curves, on the other hand, is a continuous map Γ(t, s) : [a, b] ( ϵ, ϵ) M which is smooth on each rectangle [ai 1, ai] ( ϵ, ϵ) for a finite subdivision a = a0 < < ak = b. Here, Γs(t) := Γ(t, s) is recognized as an admissible curve for each s ( ϵ, ϵ). When γ(t) is admissible and Γ(t, 0) = γ(t), we refer to Γ as a variation of γ(t). Upon an admissible curve Γ, we can delineate two sets of curves: the principal curves Γs(t), which hold s as constant, and the transverse curves Γ(t)(s) := Γ(t, s), which consider t as constant. We consider Γ as a variation through geodesics if each principal curve Γs(t) = Γ(t, s) is a geodesic segment. For an admissible family of curves Γ, we define a vector field along Γ as a map V : [a, b] ( ϵ, ϵ) TM such that V (t, s) TΓ(t,s)M and for each s, V (s, t) is piecewise smooth. The Jacobi field is a vector field along a variation through geodesics and is formally defined as follows. Lemma 12. (Lee, 2018, Theorem 10.1, The Jacobi Equation) Let (M, g) be a Riemannian manifold, γ be a geodesic in M, and J be a vector field along γ. If J is the variation field of a variation through geodesics, then J needs to satisfy the following Jacobi Equation: D2 t J + R(J, γ) γ = 0, where R is the Riemann curvature endomorphism. A smooth vector field J satisfying the Jacobi Equation is called a Jacobi field. Given initial values of J and Dt J at one point, the Jacobi Equation can be uniquely determined by the following proposition. Lemma 13. (Lee, 2018, Prop. 10.2, Existence and Uniqueness of Jacobi Fields) Let (M, g) be a Riemannian manifold and γ : I M is a geodesic where I is an interval. Fix a I, let p = γ(a), u, v Tp M, there is a unique Jacobi field J along γ which satisfies J(a) = u, Dt J(a) = v. The Jacobi field seems to be a complicated object to study, but the following lemma shows we can write down the initial conditions of the Jacobi field for certain variations through geodesics. Lemma 14. (Lee, 2018, Lemma 10.9) Let (M, g) be a Riemannian manifold, I be an interval containing 0, and γ : I M be a geodesic. Suppose J : I M be a Jacobi field with J(0) = 0. If M is geodesically complete or I is compact, then J is the variation field of the following variation of γ: Γ(t, s) = Expp(t(u + sv)), where p = γ(0), u = γ(0), and v = Dt J(0). The advantage of introducing the Jacobi field is the norm of J(t) can be bounded by the initial value Dt J(0) , as shown in the following Lemma 15. Definition 4. We define t, if κ = 0 1 κ sin( κt), if κ > 0 1 κ sinh( κt), if κ < 0. Lemma 15. (Lee, 2018, Theorem. 11.9 ) Suppose (M, g) is a Riemannian manifold, γ : [0, b] M is a unit-speed geodesic segment, and J is any Jacobi field along γ such that J(0) = 0. For each c R, let s( , ) be the function defined by Definition 4. 1) If all sectional curvatures of M are bounded above by a constant κ2, then J(t) s(max{0, κ2}, t) Dt J(0) for all t [0, b1], where b1 = b if κ2 0, and b1 = min b, π κ2 2) If all sectional curvatures of M are bounded below by a constant κ1, then J(t) s(min{0, κ1}, t) Dt J(0) for all t [0, b2], where b2 is chosen so that γ (b2) is the first conjugate point to γ(0) along γ if there is one, and otherwise b2 = b. E.2 Miscellaneous Technical Lemmas Lemma 16. (Bacák, 2014, Section 2.1) Let M be a Hadamard manifold, f : M ( , ) be a gsc-convex lower semicontinuous function. Then any β-sublevel set of f: {x M : f(x) β} is a closed gsc-convex set. Lemma 17. (Bacák, 2014, Section 2.2) Let M be a Hadamard manifold, fi : M ( , ) for i = 1, . . . , m be a series of gsc-convex lower semicontinuous functions. Then supi [m] fi(x) is gsc-convex. Lemma 18. For a set K := {x| max1 i m hi(x) 0} M where each hi(x) is gsc-convex, M is Hadamard and any y / K, there exists g Ty M such that Exp 1 y x, g > 0, x K. Proof. By Lemma 17, max1 i m hi(x) is gsc-convex. Further, by Lemma 16, K is a gsc-convex subset of M. For some y / K, there exists j [m] such that hj(y) > 0. If not, we have hi(y) 0 holds for any i = 1, . . . , m, which implies y K and leads to a contradictory. Applying the gsc-convexity for any x K, we have Exp 1 y x, hj(y) hj(y) hj(x) > 0, where the second inequality is because hj(y) > 0 and hj(x) 0. This is a valid separation oracle between y and K. Remark 4. Given K = {x| max1 i m hi(x) 0} where each hi(x) is gsc-convex and y / K, we can check the sign of hi(y) for i = 1, . . . , m until we find i such that hi (y) > 0, then using the proof of Lemma 18, we obtain Exp 1 y x, hi (y) > 0 for every x K. This gives us a separation oracle between y and K. The computation of this separation oracle requires no more than m function value evaluations and a single gradient evaluation. This is efficiently implementable when m is of moderate size. Remark 5. Using Gauss s Lemma (Lee, 2006, Theorem 6.8), we can efficiently compute the projection onto a geodesic ball. Consider x / Bp(r) for which we aim to compute its projection onto Bp(r). First, we determine the geodesic segment, denoted as γ(t), connecting x and p. This segment intersects the geodesic ball at point z. Gauss s Lemma then tells us that the tangent vector of γ(t) at z is orthogonal to the geodesic sphere Sp(r). This indicates that z is the projection of x onto Bp(r). By performing O(log 1 ϵ ) rounds of binary search on γ(t), we can approximate z to within an ϵ precision. In each round, it suffices to verify if the condition d(xt, p) r is met. Lemma 19. (Wang et al., 2023, Lemma 45) Suppose the sectional curvature of M is in [κ1, κ2] where κ1 0 and κ2 0. Assume a gsc-convex set K satisfies Bp(r) K Bp(R) where R π 2 κ2 when κ2 > 0. Then for any y (1 τ)K, the geodesic ball By s(κ2,R+r) τr s(κ1,R+r) K, where s( , ) is defined in Definition 4. Lemma 20. (Bridson & Haefliger, 2013, Chapter II.1) Suppose (p, q, r) is a triangle on a Hadamard manifold M and ( p, q, r) is a comparison triangle in Euclidean space such corresponding side lengths are the same for both triangles. Then for any x [p, q], y [p, r], x [ p, q], y [ p, r] satisfying d(p, x) = d E( p, x) and d(p, y) = d E( p, y), we have d(x, y) d E( x, y). Lemma 21. The function sinh x x is increasing for any x [0, ]. Proof. We denote g(x) = sinh x g (x) = x cosh x sinh x It suffices to show h(x) := x cosh x sinh x 0 holds for any x [0, ). This is obvious because h(0) = 0 and h (x) = x sinh x 0 for any x [0, ). It remains to compute g (0). By L Hôpital s rule, g (0) = lim x 0 x cosh x sinh x x2 = lim x 0 x sinh x Lemma 22. Let f : M R be gsc-convex and G-Lipschitz on a gsc-convex set K M. Then Bx(δ) f(u)ω satisfies | ˆf(x) f(x)| Gδ. Proof. We have | ˆf(x) f(x)| = E v =1 [f(Expx(δv) f(x)] E v =1 [f(Expx(δv)) f(x)] E v =1 [G δv ] Gδ. where v Tx M. The first inequality is due to Jensen s inequality, and the second one is by the gradient Lipschitzness of f. Definition 5. Given a homogeneous Riemannian manifold M, we use Sδ and Vδ to denote the surface area and the volume of a geodesic ball with radius δ. Remark 6. We note that on homogeneous Riemannian manifolds, the volume and the area of a geodesic ball only depend on its radius (Kowalski & Vanhecke, 1982), so Sδ and Vδ are well-defined. Lemma 23. (Wang et al., 2023, Lemmas 11 and 13) For a homogeneous Hadamard manifold (M, g) with sectional curvature lower bounded by κ 0, f be a C1 and gsc-convex function defined on M such that |f| C for any x M. We define Bx(δ) f(u)ω where ω is the volume element, and g(u) = f(u) Exp 1 x (u) Exp 1 x (u) . is a corresponding gradient estimator. Then we have 1) Sδ Vδ g(u) is an unbiased estimator of ˆf(x). More specifically, for any x M, Vδ g(u) x = ˆf(x). 2) The expected norm of Sδ Vδ g(u) can be bounded by Sδ Vδ g(u) x δ + n|κ|δ C. 3) Suppose K M is a gsc-convex set and f(x) G, then ˆf satisfies ˆf(y) ˆf(x) D ˆf(x), Exp 1 x (y) E 2ρδG, where ρ solely depends on the set K. Remark 7. From the proof of Lemma 23, we can figure out the formal definition of ρ is ρ = sup x,y,u K GExp 1 u ϕ(u) i s.t. ϕ(x) = y. By Wang et al. (2023), this quantity can be O e D even on a 2-dimensional Poincaré disk. Algorithm 9: Riemannian Frank-Wolfe with line-search Data: feasible gsc-convex set K, initial point x0 K, gsc-convex objective f( ). for i = 0, . . . do vi = argminx K{ f(xi), Exp 1 xi x } σi = argminσ [0,1]{f(Expxi(σExp 1 xi vi))} xi+1 = Expxi(σi Exp 1 xi vi) end Lemma 24. (Weber & Sra, 2022b, Theorem 1) Suppose the diameter of the gsc-convex set K M is upper bounded by D and f is gsc-convex as well as L gsc-smooth on K. With a linear optimization oracle in hand, the primal convergence rate of Algorithm 9 can be described by f(xk) f(x ) 2LD2 Lemma 25. (Weber & Sra, 2022b, Lemma 2) Under the same assumptions as in Lemma 24, for a step xk+1 = Expxk(ηExp 1 xk v) with η [0, 1], f(xk+1) f(xk) + η f(xk), Exp 1 xk v + η2LD2 Lemma 26. Under the same assumptions as in Lemma 24, denote g(x) := max v K Exp 1 x v, f(x) . If Algorithm 9 is run for K 2 rounds, then there exists k such that 1 k K and g(xk ) 27LD2 Proof. The proof is an adaptation of Jaggi (2013, Theorem 2). We first assume the duality gap is large in the last one third of K iterations, then get a contradictory to prove conclusion. For simplicity, we define hk := f(xk) f(x ) and gk := g(xk). To ease the elaboration, we denote C := 2LD2, α := 2 8 and d := K + 2. We assume gk > b C K+2 = b C d holds for the last one third of the K iterations. More specifically, d for k { αd 2, . . . , K}. By Lemma 25 with η = 1 k+2, we have hk+1 hk 2 k + 2gk + 2LD2 =hk 2 k + 2gk + C (k + 2)2 Plugging in the assumption that the duality gap is large, we arrive at hk+1 < hk 2 k + 2 b C d + C (k + 2)2 d2 + C α2d2 = hk 2b C C/α2 for all k = αd 2, . . . , K, when the second inequality follows from αd 2 k K implies αd k + 2 d. Denote kmin := αd 2 and hmin = hkmin, then there are K kmin +1 = K ( αd 2)+1 (1 α)d steps in the last one third iterations. Thus summing Equation (57) from k = αd 2 to K, we have h K+1 < hmin (1 α)d2b C C/α2 αd (1 α)d2αb 1/α αd (1 (1 α) (2αb 1/α)) where hmin C αd is due to Lemma 24, while for the last step, we plug in the values α = 2 3 and b = 27 8 . However, h K+1 < 0 is impossible because h K+1 = f(x K+1) f(x ). Thus, there always exists k in the last one third iterations of K such that K + 2 = 27LD2 Lemma 27. (Ahn & Sra, 2020, Prop. H.1) Let M be a Riemannian manifold with sectional curvatures lower bounded by κ 0 and the distance function d(x) = 1 2d(x, p)2 where p M. For D 0, d( ) is ζ gsc-smooth within the domain {u M : d(u, p) D} where ζ = κD coth( κD) is the geometric constant defined in Definition 1. Lemma 28. Suppose fi(x) is α-strongly gsc-convex for any i = 1, . . . , N, then PN i=1 fi(x) is αN-strongly gsc-convex. Proof. By the α-strongly gsc-convexity of each fi(x), fi(y) fi(x) + fi(x), Exp 1 x (y) + α 2 d(x, y)2. Summing from i = 1 to N, we have i=1 fi(x) + fi(x), Exp 1 x y + i=1 fi(x) + i=1 fi(x), Exp 1 x y 2 (αN) d(x, y)2. This indeed implies PN i=1 fi(y) is αN-strongly gsc-convex. Lemma 29. (Zhang & Sra, 2016, Lemma 5). Let M be a Riemannian manifold with sectional curvature lower bounded by κ 0. Consider a geodesic triangle fully lies within M with side lengths a, b, c, we have a2 ζ(κ, c)b2 + c2 2bc cos A where ζ(κ, c) := κc coth( κc). Lemma 30. (Sakai, 1996, Prop. 4.5) Let M be a Riemannian manifold with sectional curvature upper bounded by κ 0. Consider N, a gsc-convex subset of M with diameter D. For a geodesic triangle fully lies within N with side lengths a, b, c, we have a2 b2 + c2 2bc cos A. E.3 Extension to CAT(κ) Spaces It is not too difficult to generalize the result in this work to CAT(κ) spaces, which are simply connected manifolds with sectional curvature upper bounded by κ. To achieve this extension, the following adjustments would be necessary in our manuscript: We may assume that the sectional curvature lies in the range [κ, K]. In this context, we can replace Lemma 30 in our draft with Alimisis et al. (2020, Corollary 2.1). When K > 0, we must further assume that the diameter of the decision set is upper-bounded by π The separation theorem, specifically for Hadamard manifolds as outlined in Silva Louzeiro et al. (2022), would require generalization to CAT(κ) spaces. This change ensures that the separation oracle remains well-defined. Lemma 3 requires modification to accommodate the recomputation of the Jacobi field, considering the manifold s sectional curvature. Lemmas 15 and 19 will be instrumental in achieving this.