# riemannian_stochastic_recursive_gradient_algorithm__e091f20c.pdf

Riemannian Stochastic Recursive Gradient Algorithm

Hiroyuki Kasai 1 Hiroyuki Sato 2 Bamdev Mishra 3

Stochastic variance reduction algorithms have recently become popular for minimizing the average of a large, but ﬁnite number of loss functions on a Riemannian manifold. The present paper proposes a Riemannian stochastic recursive gradient algorithm (R-SRG), which does not require the inverse of retraction between two distant iterates on the manifold. Convergence analyses of R-SRG are performed on both retractionconvex and non-convex functions under computationally efﬁcient retraction and vector transport operations. The key challenge is analysis of the inﬂuence of vector transport along the retraction curve. Numerical evaluations reveal that R-SRG competes well with state-of-the-art Riemannian batch and stochastic gradient algorithms.

1 Introduction

Let f : M R be a smooth real-valued function on a Riemannian manifold M (Absil et al., 2008). The target problem concerns a given model variable w M, and is expressed as

where n is the total number of the elements. This problem has many applications; for example, in principal component analysis (PCA) and the subspace tracking problem (Balzano et al., 2010) on the Grassmann manifold. The low-rank matrix/tensor completion problem is a promising application concerning the manifold of ﬁxed-rank matrices/tensors (Mishra & Sepulchre, 2014; Kasai & Mishra, 2016). The linear regression problem is also deﬁned on the manifold of ﬁxed-rank matrices (Meyer et al., 2011).

A popular choice of algorithms for solving (1) is the Rie-

1The University of Electro-Communications, Japan. 2Kyoto University, Japan. 3Microsoft, India. Correspondence to: Hiroyuki Kasai <kasai@is.uec.ac.jp>.

Proceedings of the 35 th International Conference on Machine Learning, Stockholm, Sweden, PMLR 80, 2018. Copyright 2018 by the author(s).

mannian gradient descent method, which calculates the Riemannian full gradient estimation, i.e., gradf(w) = 1 n n i=1 gradfi(w), for every iteration, where gradfi(w) is the Riemannian gradient on the Riemannian manifold M for the i-th sample. However, this estimation is computationally costly when n is extremely large. A popular alternative is the Riemannian stochastic gradient descent algorithm (R-SGD), which extends the stochastic gradient descent algorithm (SGD) in the Euclidean space (Bonnabel, 2013) to the Riemannian manifold. As R-SGD calculates only gradfi(w) for the i-th sample, the complexity per iteration is independent of the sample size n. Although R-SGD requires retraction and vector transport operations in every iteration, those calculation costs can be ignored when they are lower than those of gradfi(w); this applies to many important Riemannian optimization problems, including the low-rank tensor completion problem and the Riemannian centroid problem as seen in Section 5.

Similar to SGD (Robbins & Monro, 1951), R-SGD is hindered by a slow convergence rate due to a decaying step size sequence. To accelerate the rate of R-SGD, the Riemannian stochastic variance reduced gradient algorithm (R-SVRG) (Sato et al., 2017; Zhang et al., 2016) has recently been proposed; this technique reduces the variance of the stochastic gradient exploiting the ﬁnite-sum form of (1) based on recent progress in variance reduction methods in the Euclidean space (Johnson & Zhang, 2013; Roux et al., 2012; Shalev-Shwartz & Zhang, 2013; Defazio et al., 2014; Reddi et al., 2016). One distinguished feature is reduction of the variance of noisy stochastic gradients by periodical full gradient estimations, which yields a linear convergence rate. R-SQN-VR has also recently been proposed, where a stochastic quasi-Newton algorithm and the variance reduced methods are mutually combined (Kasai et al., 2018). Although it achieves practical improvements for ill-conditioned problems, its convergence rate is worse than that of R-SVRG. Both R-SVRG and R-SQN-VR transport vectors between two distant iterates on the manifold M; thus, they must calculate a tangent vector to connect them at every iteration, and are hindered by additional larger errors caused by vector transport compared with the parallel translation approach.

Here, we propose a Riemannian stochastic recursive gradient algorithm (R-SRG) that does not rely on two distant

Riemannian Stochastic Recursive Gradient Algorithm

iterates. This feature of R-SRG is relevant for practical implementations and is also interesting from a theoretical analysis perspective. The R-SRG counterpart in the Euclidean space is proposed in (Nguyen et al., 2017a;b). The advantage of R-SRG over R-SVRG is more notable in the Riemannian than Euclidean case.

Contributions. Our contributions are summarized below.

Our convergence analysis of R-SRG deals with both (strongly) retraction-convex (Deﬁnition 3.3 and Lemma 3.6) and non-convex functions.

Our analysis considers computationally efﬁcient retraction and vector transport instead of the more restrictive exponential mapping and parallel translation. This is more challenging than R-SVRG (Zhang et al., 2016), which involves exponential mapping and parallel translation.

The obtained total complexity is the ﬁrst result with respect to retraction and vector transport. Sato et al. (2017) have analyzed R-SVRG under retraction and vector transport, but have not provided the total complexity. Zhang et al. (2016) have provided the total complexity with only exponential mapping and parallel translation. Here, we derive a key fact (Lemma 3.8): the constants of L-smooth and Ll-Lipschitz are not identical with respect to retraction. Addressing the Ll/L ratio and the deviation parameter θ between vector transport and parallel translation (Lemma 3.7), we provide completely new complexities.

The proposed algorithm has a linear convergence rate for (strongly) retraction-convex functions; converges at the sublinear rate in a single outer loop for general non-convex functions; and provides a linear rate in the case of gradient-dominated functions (Deﬁnition 3.4) (Polyak, 1963; Reddi et al., 2016; Zhang et al., 2016).

The advantages of R-SRG are summarized below.

In R-SRG, computationally efﬁcient retraction and vector transport are employed. Whereas R-SVRG transports vectors between two distant iterates, RSRG transports vectors from the previous iterate. Thus, calculation of the inverse of retraction is avoided and R-SRG is computationally more efﬁcient.

R-SRG alleviates the additional errors caused by vector transport between two distant points encountered by R-SVRG.

A practical variant of R-SRG accelerates the convergence speed, exploiting the linear convergence of the modiﬁed stochastic gradient in the inner loop.

Use of retraction and vector transport enables application of R-SRG to a wider range of manifolds. For

example, unlike our analysis and algorithm, the algorithm proposed by Zhang et al. (2016) cannot be applied to the Stiefel and ﬁxed-rank manifolds, because they do not have closed-form expressions for parallel translation.

This paper is organized as follows. Section 2 presents details of the proposed R-SRG. Sections 3 and 4 summarize the preliminaries and present the convergence analysis, respectively. In Section 5, numerical comparisons with RSGD and R-SVRG on two manifolds are given. The results suggest superior performance of R-SRG. The codes of R-SRG are implemented in the Matlab toolbox Manopt (Boumal et al., 2014) and are available at https:// github.com/hiroyuki-kasai/RSOpt. Concrete proofs of theorems and details of additional experiments are provided as supplementary material.

2 Riemannian stochastic recursive gradient algorithm (R-SRG)

We assume that the manifold M is endowed with a Riemannian metric structure; i.e., a smooth inner product , w is associated with tangent space Tw M for each w M (Absil et al., 2008). The norm w of a tangent vector in Tw M is that associated with the Riemannian metric. The metric structure allows a systematic framework for optimization over manifolds. Conceptually, the constrained optimization problem (1) is translated into an unconstrained problem over M.

2.1 R-SGD and R-SVRG

R-SGD: Given a starting point w0 M, R-SGD produces a sequence {wt} in M that converges to a ﬁrstorder critical point of (1). Speciﬁcally, it updates w as wt+1 = Rwt( αtgradfit(wt)), where αt is the step size and gradfit(wt) is a Riemannian stochastic gradient for the it-th sample, which is a tangent vector at wt M. gradfit(wt) represents an unbiased estimator of the Riemannian full gradient gradf(wt), and the expectation of gradfit(wt) is gradf(wt), i.e., E[gradfit(wt)|Ft] = gradf(wt). E[ |Ft] denotes an expected value taken with respect to the distribution of the random variable it. Ft = σ(w0, i1, . . . , it 1) is the σ-algebra that depends on (w0, i1, . . . , it 1). The update moves from wt in the direction gradfit(wt) with step size αt, remaining on M. This mapping, denoted Rw : Tw M M : ζ 7 Rw(ζ), is called a retraction at w, and maps tangent space Tw M onto M with a local rigidity condition that preserves the gradients at w. Exponential mapping Exp is an instance of retraction. Here, a curve deﬁned by retraction R is called a retraction curve in this paper, being a geodesic when R is the exponential mapping.

Riemannian Stochastic Recursive Gradient Algorithm

R-SVRG: R-SVRG has a double loop structure where the s-th outer loop, called the epoch, has ms 1 inner iterations. Retaining a selected w M at the end of the (s 1)-th epoch as w, R-SVRG computes and stores the full Riemannian gradient gradf( w) for this stored w. At the t-th inner iteration of the s-th epoch at wt, picking the it-th sample, it computes the Riemannian stochastic gradient gradfit(wt) and gradfit( w) for this sample. Then, it calculates a modiﬁed stochastic gradient ξt by modifying gradfit(wt) using both gradf( w) and gradfit( w). Because they belong to different tangent spaces, their simple addition is not welldeﬁned, as Riemannian manifolds are not vector spaces. Consequently, after gradfit( w) and gradf( w) are transported to Twt M by T wt w , the resultant update is performed as wt+1 = Rwt( αtξt), where ξt is set as

ξt = gradfit(wt) T wt w (gradfit( w) gradf( w)).

Here, T z w or Tη represents vector transport from w to z satisfying Rw(η) = z. Vector transport T : TM TM TM, (η, ξ) 7 Tηξ is associated with R, where ξ, ζ Tw M and w M. It holds that (i) Tηξ TR(η)M, (ii) T0ξ = ξ, and (iii) Tη is a linear map. Parallel translation is a special instance of vector transport, which transports a vector along a curve γ from w to z. It is represented by P(γ)z w, or simply P z w, when γ is clear. Additionally, P(γ)η or Pη are also used.

2.2 Proposed R-SRG

Similar to R-SVRG, R-SRG has double loops. However, differently from R-SVRG, the inner loop of R-SRG generates the modiﬁed stochastic gradient vt by adding and subtracting gradients to and from the previous vt 1. More speciﬁcally, the recursive update of the stochastic gradient is calculated as v0 = gradf(w0) and, for t 1, as

vt = gradfit(wt)

T wt wt 1gradfit(wt 1) + T wt wt 1vt 1. (2)

Then, the iterate update is calculated as wt+1 = Rwt( αtvt) with step size αt > 0. Note that, while the (modiﬁed) stochastic gradient of both R-SGD and R-SVRG is an unbiased estimator of the full gradient, that of R-SRG is not, i.e., E[vt|Ft] = gradf(wt) T wt wt 1gradf(wt 1) + T wt wt 1vt 1 = gradf(wt). However, the total expectation E[vt] = E[gradf(wt)] holds. The algorithm is summarized in Algorithm 1.

Inspired by (Nguyen et al., 2017a), we propose a practical variant of R-SRG called R-SRG+. R-SRG has a linearly convergent vt in retraction-convex functions (Proposition 4.4); thus, we propose an adaptive length for the inner loop size m. In detail, we stop the inner loop at t = tlast < m when the norm of vt decreases below the threshold of that of v0, and proceed to the next outer loop. The threshold control parameter is denoted ϑ(0 ϑ 1); the ϑ = 0

Algorithm 1 R-SRG algorithm Require: Update frequency m and sequence {αt} with αt > 0. 1: Initialize w0. 2: for s = 1, 2, . . . do 3: Store w0 = ws 1. 4: Calculate Riemannian full gradient gradf(w0). 5: Store v0 = gradf(w0). 6: Update w1 = Rw0( α0v0). 7: for t = 1, 2, . . . , m 1 do 8: Choose it {1, 2, . . . , n} uniformly at random. 9: Calculate vt by (2): 10: Update wt+1 = Rwt( αtvt). 11: end for 12: Set ws = wt for randomly chosen t {0, 1, . . . , m}. 13: end for

case is identical to R-SRG. We also adopt a practical selection of ws as wtlast+1. The proposed variant eliminates the need for careful selection of m in R-SRG. Note that this approach is inapplicable to R-SVRG because such a linearly convergent decrease of the modiﬁed stochastic gradient ξt is absent.

3 Preliminaries

For the convergence analysis, we derive the retraction Ll Lipschitz lemma (Lemma 3.8) assuming the bound of the Hessian of f along a retraction curve, and exploiting the retraction L-smooth lemma (Lemma 3.5). First, we brieﬂy present deﬁnitions and assumptions, followed by the essential lemmas.

3.1 Deﬁnitions and assumptions

We ﬁrst summarize some deﬁnitions. Let R be a retraction. For linear transformations in tangent spaces, we use the operator norm with respect to the inner product from the Riemannian metric.

Deﬁnition 3.1 (Upper-Hessian bounded). f is said to be upper-Hessian bounded in U M with respect to R if there exists a constant L > 0 such that d2f(Rw(tη))

dt2 L, for all w U and η Tw M with η w = 1, and all t such that Rw(τη) U for all τ [0, t].

Deﬁnition 3.2 (Lower-Hessian bounded). f is said to be lower-Hessian bounded in U M with respect to R if there exists a constant µ > 0 such that µ d2f(Rw(tη))

dt2 , for all w U and η Tw M with η w = 1, and all t such that Rw(τη) U for all τ [0, t].

Deﬁnition 3.3 (Retraction convex (Huang et al., 2015b)). f is retraction convex in S M with respect to R if, for

Riemannian Stochastic Recursive Gradient Algorithm

all w S and η Tw M with η w = 1, f(Rw(τη)) is convex for all t which satisﬁes f(Rw(τη)) S for all τ [0, t].

Deﬁnition 3.4 (τ-gradient dominated (Polyak, 1963)). f is τ-gradient dominated in U M if there exists a constant τ > 0 such that for w U, f(w) f(w ) τ gradf(w) 2 w, where w is a global minimizer of f.

We make the following assumptions about (1).

Assumption 1. For problem (1), we assume the following:

(1.1) f and its components f1, f2, . . . , fn are twice continuously differentiable.

(1.2) T is isometric on M, i.e., Tξη, Tξζ Rw(ξ) = η, ζ w holds for any w M and ξ, η, ζ Tw M.

(1.3) The sequence generated by Algorithm 1 is continuously contained in a sufﬁciently small neighborhood U M of an optimal solution w . There exists a constant c0 > 0 such that T satisﬁes Tη TRη c0 η w, T 1 η T 1 Rη c0 η w for all w, z U with Rw(η) = z, where TR denotes the differentiated retraction, i.e., TRζξ = DRw(ζ)[ξ] with ξ Tw M.

(1.4) The norms of the Riemannian gradient and Hessian are bounded, i.e., there exist constants Cg > 0 and Ch > 0 such that gradfi(wt) w Cg and Hessfi(w) Ch for w U.

(1.5) The neighborhood U is a totally retractive neighborhood and totally normal neighborhood of w (see (Huang et al., 2015b)).

(1.6) There exists a constant c R > 0 such that Exp 1 w (z) R 1 w (z) w c R R 1 w (z) 2 for all w, z U.

Note that Assumption (1.1) is standard and (1.2) is guaranteed by the speciﬁc vector transport in (Huang et al., 2015b). Further, (1.3) is guaranteed when the vector transport is C0, as derived from the Taylor expansion. Also, (1.4) holds when the manifold is compact like the Grassmann manifold, or through slight modiﬁcation of the objective function and the algorithm. For (1.5), see (Huang et al., 2015b) for detail. Since R is a ﬁrst-order approximation of Exp, (1.6) can be also considered natural.

3.2 Essential lemmas

Here, we present the lemmas essential for the convergence analysis under Assumption 1. The complete proofs are in the supplementary material.

Lemma 3.5 (Retraction L-smooth). Suppose that Assumptions (1.1) and (1.5) hold and that f is upper-Hessian bounded in U. Then, for all w, z U and the constant

L > 0 in Deﬁnition 3.1, we have

f(z) f(w) + gradf(w), ξ w + 1

where ξ Tw M and Rw(ξ) = z. Here, such an f is called retraction L-smooth with respect to R. Lemma 3.6 (Retraction µ-strongly convex (Huang et al., 2015b)). Suppose that Assumptions (1.1) and (1.5) hold and that f is lower-Hessian bounded in U. Then, for µ > 0 from Deﬁnition 3.2 and for all w, z U,

f(z) f(w) + gradf(w), ξ w + 1

where ξ Tw M and Rw(ξ) = z. Here, such an f is called retraction µ-strongly convex with respect to R.

We also introduce the following lemma to quantify the difference between parallel translation and vector transport. Lemma 3.7 (Difference between parallel translation and vector transport (Huang et al., 2015b, Lemma 3.5), (Huang et al., 2015a, Lemma 6)). Let T C0 be a vector transport associated with the same retraction R as that of the parallel translation P C . Under Assumption (1.3), there exists a constant θ > 0 such that for all w, z U,

Tηξ Pηξ z θ ξ w η w,

T 1 η χ P 1 η χ w θ χ z η w,

where ξ, η Tw M, χ Tz M, and Rw(η) = z.

Finally, we derive the following key lemma: Lemma 3.8 (Retraction Ll-Lipschitz). Let R be a retraction on M. Suppose that Assumptions (1.1) and (1.3) (1.5) hold. Then, there exists a constant Ll > 0 such that

P(γ)w z gradf(z) gradf(w) w Ll η w,

for all w, z U, where Ll = Ch(1 + Cηθ), with Cη being the upper bound of the norm of η for η Tw M. Note that θ is in Lemma 3.7; γ is a curve γ(t) := Rw(tη) deﬁned by R on M with γ(0) = w and γ(1) = z; and P(γ)w z ( ) is a parallel translation operator along γ from z to w.

L and Ll are counterparts to those of L-smooth and LLipschitz for the geodesically L-smooth case and the Lsmooth Euclidean case. However, it should be speciﬁcally emphasized that L and Ll are not identical in the retraction curve case. Lemma 3.9. Suppose that Assumptions (1.3) and (1.6) hold. Then, there exists a constant ν > 0 such that ξ, Exp 1 w (z) w ξ, R 1 w (z) w + ν R 1 w (z) 2 w, for all w, z U, where ξ Tw M and ξ w 2Cg, where Cg is a constant in Assumption (1.4).

4 Convergence analysis

This section presents convergence analyses on both retraction-convex and non-convex functions under retrac-

Riemannian Stochastic Recursive Gradient Algorithm

tion and vector transport operations. To this end, we derive the bound on the number of iterations T to achieve an ϵaccurate solution in terms of calls to the incremental ﬁrstorder oracle (Agarwal & Bottou, 2015). In this particular case, we use the bound to guarantee the expected squared norm of a stochastic gradient E[ gradf(w T ) 2] ϵ. Note that we derive the total complexity with regard to standard parameters such as n, ϵ, and L, and also ρl = Ll/L in Lemma 3.8 and θ in Lemma 3.7 for this particular purpose.

4.1 Retraction-convex functions

We further suppose Assumption B.3 holds, which is equivalent to that f is L-smooth and convex in the Euclidean case (see the supplementary material).

Theorem 4.1 (Convergence analysis within a single outer loop on retraction convex functions). Let M be a Riemannian manifold and w M be a minimum of f. Suppose that Assumptions 1 and B.3 hold and f is upper Hessian bounded. Consider Algorithm 1 with α that satisﬁes α < 2/L and (2((2Ll + 2θCg + L)θCg + νL)m L2)α2 + 3Lα 2 0. Then, for any s 1,

E[ gradf( ws) 2 ws] 2 α(m+1)E[f( ws 1) f(w )]

+ αL 2 αLE[ gradf( ws 1) 2 ws 1].

Proof. The complete proof is in the supplementary material. The proof proceeds as follows: Using Lemmas 3.5 and 3.8, and exploiting Lemma 3.7, the bound of m t=0 E[ gradf(wt) vt 2 wt] is derived. Then, conditioning α to eliminate additional terms caused by vector transport, E[ gradf( ws) 2 ws] is derived from Lemma 3.5.

Suppose the θ in Lemma 3.7 and ν in Lemma 3.9 are sufﬁciently close to zero, i.e., T and R are close to P and Exp, respectively. This reasonable assumption yields β < L2; thus, (L2 β)α2 3Lα + 2 0, where β = 2((2Ll +2θCg +L)θCg +νL)m. The smaller root of

(L2 β)α2 3Lα + 2 = 0, which is 3L

L2+8β 2(L2 β) (= αl), is smaller than 1/L, and the larger root exceeds 2/L. Theorem 4.1 with α 1/L implies E[ gradf( ws) 2 ws] 2 α(m+1)E[f( ws 1) f(w )] + E[ gradf( ws 1) 2 ws 1].

Consequently, selecting α = α :=

2αl/(m + 1) such that m satisﬁes α 1/L, we have

E[ gradf( ws) 2 ws]

L2+8β)(m+1)E[f( ws 1) f(w )]

+ E[ gradf( ws 1) 2 ws 1].

This result means that the convergence rate within a single outer loop for retraction-convex functions is sublinear. Next, the convergence rates with multiple outer steps are derived based on this bound in Theorem 4.1.

Theorem 4.2 (Convergence analysis on retraction-convex functions). Suppose that all the conditions in Theorem 4.1 hold and deﬁne δk = 2 α(m+1)E[f( wk) f(w )] for k = 0, 1, . . . , s 1, and δ = max0 k s 1 δk. We also deﬁne = δ(1 + αL 2(1 αL)), and φ = αL 2 αL. Then, we have

E[ gradf( ws) 2 ws] φs( gradf( w0) 2 w0 ).

The proof of Theorem 4.2 follows a similar approach to that given by (Nguyen et al., 2017a).

Suppose β L2/3 and that we select = ϵ/4, φ

2L2 3(L2 β) (with α = α := 2αl/3 = 3L

L2+8β 3(L2 β) ) and m = O(1/ϵ) in Theorem 4.2. As each inner loop evaluates n + 2m gradients, the total complexity with respect to ϵ-accuracy is O((n + (1/ϵ)) log(1/ϵ)/ log(c(1 β/L2))) where c > 1 is a constant and O(β/L2) = O(ρlθ/L).

Theorem 4.3 (Convergence analysis on retraction µ-strongly convex functions). Suppose that all the conditions in Theorem 4.1 hold and further assume that f is lower-Hessian bounded, where µ > 0 satisﬁes the conditions in Deﬁnition 3.2 for both retraction R and exponential mapping Exp. Deﬁne σm := 1 µα(m+1) + αL 2 αL. Then, choosing α and m to satisfy σm < 1, we have

E[ gradf( ws) 2 ws] (σm)s gradf( w0) 2 w0.

Let κ := L/µ be the condition number of f. Suppose β L2/5 and choose α = α := αl/2 and m = 6.5κ for T := log( gradf( w0) 2 w0/ϵ)/ log( 5(L2 β)

4L2 ) outer iterations in Theorem 4.3. Then, the total complexity with respect to ϵ-accuracy is O((n + κ) log(1/ϵ)/ log(c(1 β/L2))), where c > 1 and O(β/L2) = O(ρlθ/L).

The proof of Theorem 4.3 is also similar to that given by (Nguyen et al., 2017a).

Finally, we show the linear convergence of vt in Algorithm 1 when f is retraction µ-strongly convex.

Proposition 4.4 (Linear convergence of vt in inner loop). Suppose that Assumption B.4 and all the conditions in Theorem 4.3 except for α hold. Consider vt in Algorithm 1 with a constant step size α < 2/L. Then, there exist a function ϕ(α) and constant a0 > 0 such that, for t 1, we have

E[ vt 2 wt] [ 1 ( 2

αL 1 ) (a0µ a1Cg)2α2 + ϕ(α) ]t

E[ gradf(w0) 2 w0].

When θ and ν are sufﬁciently close to zero, ϕ(α) is close to zero regardless of α. Then, this result indicates that we obtain the linear convergence rate of vt 2 wt for expectation rate (1 1/κ2) selecting α = O(1/L).

Riemannian Stochastic Recursive Gradient Algorithm

4.2 Non-convex functions

We next ﬁnd the convergence rate for non-convex functions.

Theorem 4.5 (Convergence analysis within a single outer loop on non-convex functions). Let w M be a minimizer of f and suppose Assumption 1 and that f is upper Hessian bounded. Consider Algorithm 1 with a constant step size α 2 L+

L2+8m(L2 l +C2gθ2). Then, for w = wt ,

E[ gradf( w) 2 w] 2 α(m+1)[f(w0) f(w )],

where t is randomly chosen from {0, 1, . . . , m}.

Proof. The complete proof is in the supplementary ﬁle, whose strategy is similar to those of Theorem 4.1 and (Nguyen et al., 2017b). Conditioning α to eliminate additional terms caused by retraction and vector transport, E[ gradf( w) 2 w] is upper bounded by Lemma 3.5.

We suppose that θ is sufﬁciently close to zero. Then, selecting the upper bound of α in Theorem 4.5 as α yields

E[ gradf( w) 2 w]

L2+8m(L2 l +C2gθ2) m+1 E[f(w0) f(w )].

Hence, when selecting m = O((L2ρ2 l + θ2)/ϵ2) and α = O(1/

m(L2ρ2 l + θ2)), we have E[ gradf(wt) 2 wt] ϵ to achieve a sublinear convergence rate with m for expectation rate O(

(L2ρ2 l + θ2)/m). Hence, the total complexity for ϵ-accuracy is O(n + (L2ρ2 l + θ2)/ϵ2). Finally, the convergence rate with multiple outer steps is derived.

Theorem 4.6 (Convergence analysis on non-convex functions). Assume that all the conditions in Theorem 4.5 hold and that f is τ-gradient dominated. Consider Algorithm 1 with α as in Theorem 4.5 and assume σm := 2τ α(m+1) < 1, i.e., α(m + 1)/2 > τ. Then, we have

E[ gradf( ws) 2 ws] ( σm)s gradf( w0) 2 w0.

Here, we choose α as the upper bound in Theorem 4.5. We need m = O(τ 2(L2ρ2 l + θ2)) to achieve α(m+1)

2 > τ, and s = O(log(1/ϵ)) to achieve ϵ-accuracy in the outer loop. Consequently, the total complexity with respect to ϵ-accuracy is O((n + τ 2(L2ρ2 l + θ2)) log(1/ϵ)).

4.3 Discussions

We here discuss the total complexity of R-SRG as summarized in Table 1, addressing the special terms strongly related to retraction and vector transport, i.e., ρl (= Ll/L), θ and β/L2. It should be noted that, as previously, we restrict the discussion to the case where Ll and θ are sufﬁciently close to L and zero, respectively. Note that c in Table 1 is a constant satisfying β (1 1

Table 1. Comparison of total complexity. Function type R-SRG (Proposed) Retraction convex (vector transport) O((n + 1

ϵ )/ log(c(1 β/L2)))

Retraction µ-strongly convex (vector transport)

O((n + κ) log( 1

ϵ )/ log(c(1 β/L2)))

Non-convex (vector transport) O(n + L2ρ2 l +θ2

τ-gradient dominated (vector transport) O((n + τ 2(L2ρ2 l + θ2)) log( 1

Function type R-SRG (Proposed) R-SVRG (Zhang et al., 2016) Geodesically convex (parallel translation) O((n + 1

Geodesically µ-strongly convex (parallel translation)

O((n + κ) log( 1

ϵ )) O((n + ζκ2) log( 1

Non-convex (parallel translation) O(n + L2

ϵ2 ) O(n + ζ 1 2 n 2 3 /ϵ)

τ-gradient dominated (parallel translation) O((n+τ 2L2)log(1

ϵ)) O((n+Lτζ 1 2n 2 3)log(1

Impact on complexity due to retraction and vector transport: Clearly, when a retraction curve deviates from the geodesic, the value of ρl increases, deviating from 1. Furthermore, θ deviates from 0 when the vectors from vector transport deviate from those of parallel translation. In those cases, the total complexity increases drastically. Those deviations more strongly inﬂuence the non-convex cases than the convex case. However, in the opposite case, the complexities retain values similar to the case of exponential mapping and parallel translation. Finally, the derived complexity is the worst case estimate for use of retraction and vector transport compared with the case of exponential mapping and parallel translation. Therefore, we leave investigation of more efﬁcient retractions and vector transports than exponential mapping and parallel translation for future research.

Comparison with R-SVRG (Zhang et al., 2016): Additionally, we compare R-SRG with R-SVRG when exponential mapping and vector transport are used. In this case, we have ρl = 1, θ = 0 and β/L2 = 0; the results are also summarized as special cases in Table 1. Although RSVRG has a curvature parameter ζ( 1), R-SRG is superior to R-SVRG in the geodesically µ-strongly convex case. The convergence of R-SRG in a single outer loop for nonconvex functions is inferior to that of R-SVRG in terms of ϵ, but superior with regard to n. R-SRG is superior to RSVRG for τ-gradient dominated functions.

5 Numerical comparisons

In this section, we compare R-SRG(+) with R-SGD with a decaying step size sequence and R-SVRG with a ﬁxed step size. The decaying step size sequence is αk = α(1 + αλα k/m ) 1, where k is the number of inner iterations, and where denotes the ﬂoor function. As references, we also perform comparisons with two Riemannian batch methods with backtracking line search, R-SD and R-CG,

Riemannian Stochastic Recursive Gradient Algorithm

0 10 20 30 40 50 #grad/n

Traning loss - optimum

R-SD R-CG R-SGD R-SVRG R-SRG R-SRG+

(a) Optimality gap vs. # of gradient evaluations.

0 100 200 300 400 500 600 Time [sec]

Traning loss - optimum

R-SD R-CG R-SGD R-SVRG R-SRG R-SRG+

(b) Optimality gap vs. processing time.

0 10 20 30 40 50 #grad/n

Norm of gradient

R-SD R-CG R-SGD R-SVRG R-SRG R-SRG+

(c) Norm of gradient vs. # of gradient evaluations.

0 5 10 15 #grad/n

Traning loss - optimum

R-SVRG R-SRG R-SRG+ ( =0.1) R-SRG+ ( =0.05) R-SRG+ ( =0.01) R-SRG+ ( =0.005) R-SRG+ ( =0.001) R-SRG+ ( =0.0005) R-SRG+ ( =0.0001)

(d) Inﬂuence of ϑ on R-SRG+.

0 20 40 60 80 Time [sec]

Traning loss - optimum

R-SVRG R-SRG R-SRG+ ( =0.1) R-SRG+ ( =0.05) R-SRG+ ( =0.01) R-SRG+ ( =0.005) R-SRG+ ( =0.001) R-SRG+ ( =0.0005) R-SRG+ ( =0.0001)

(e) Inﬂuence of ϑ on R-SRG+.

0 500 1000 1500 #grad of inner loop

Norm of modified stochastic gradient

R-SVRG ( ) R-SRG (v)

(f) Norm of modiﬁed stochastic gradient.

Figure 1. Riemannian centroid problem on SPD manifold.

which are the steepest descent and conjugate gradient algorithms on Riemannian manifolds, respectively (Absil et al., 2008). All experiments are executed in Matlab on a 4.0 GHz Intel Core i7 PC with 32 GB RAM, and are stopped when the gradient norm passes below 10 8 or a predeﬁned maximum iteration is reached. All hyper parameters are selected by cross-validation. The supplementary material presents additional results.

5.1 Riemannian centroid computation on symmetric positive-deﬁnite (SPD) manifold

We consider the problem of computing the Riemannian centroid on the d d symmetric positive-deﬁnite (SPD) manifold Sd ++, which frequently appears in computer vision problems such as visual object categorization and pose categorization (Jayasumana et al., 2015). Details of the SPD manifold are in the supplementary material.

Given n points {X1, . . . , Xn} Sd ++, the Riemannian centroid is derived from the solution to the problem min C Sd ++ 1 2n n i=1(dist(C, Xi))2, where dist(a, b) =

log(a 1/2ba 1/2) F represents the distance along the corresponding geodesic between the two points a, b Sd ++ with respect to the afﬁne-invariant Riemannian metric (AIRM). The gradient of the loss function is computed as 1 n n i=1 log(Xi C 1)C.

We generate synthetic datasets and randomly initialize after setting the maximum iteration number as 20 for RSVRG and R-SRG(+), and 60 for all others. α is tuned from {10 5, . . . , 10 1}. m and the batch size are n and 10, respectively. ϑ = 0.05 is selected for R-SRG+. Our algorithm implementation for this particular problem uses the retraction and vector transport of (Huang et al., 2015b), which satisfy the requirements detailed in Section 4.

Figures 1(a) (c), respectively, show two optimality gap results in terms of the number of gradient evaluations and processing time, and the norm of the gradient when n = 10000 with d = 30. The optimality gap indicates the performance against the minimum loss, which is calculated by R-CG with higher precision in advance. Hence, R-SRG and R-SVRG outperform the batch methods, RSD and R-CG, even in terms of processing time. Further, R-SRG is competitive with R-SVRG, and R-SRG+ outperforms the others, especially in terms of processing time. Next, to investigate the inﬂuence of ϑ on RSRG+, we consider the optimality gap when ϑ is changed to {0.1, 0.05, 0.01, 0.005, 0.001, 0.0005, 0.0001} (Figures 1 (d), (e)). Clearly, all results in this ϑ range indicate superior performance over the original R-SRG and R-SVRG. More importantly, R-SRG+ is insensitive to ϑ. Finally, the norms of the modiﬁed stochastic gradients ξt of R-SVRG and vt of R-SRG are compared in Figure 1(f); the results for the ﬁrst three outer loops are shown. While ξt of R-

Riemannian Stochastic Recursive Gradient Algorithm

0 10 20 30 40 50 60 #grad/n

Traning loss - optimum

R-SD R-CG R-SGD R-SVRG R-SRG R-SRG+

(a) Optimality gap for PCA problem.

0 10 20 30 40 50 60 #grad/n

Means square error on test set

R-SD R-CG R-SGD R-SVRG R-SRG R-SRG+

(b) Test MSE for MC problem (Jester).

0 5 10 15 20 25 30 #grad/n

Means square error on test set

R-SD R-CG R-SGD R-SVRG R-SRG R-SRG+

(c) Test MSE for MC problem (Movie Lens).

Figure 2. PCA problem and MC problem on Grassmann manifold.

SVRG ﬂuctuates through each outer loop, vt of R-SRG decreases monotonically, supporting Proposition 4.4.

5.2 PCA and matrix completion problems on Grassmann manifold

We also consider the PCA and matrix completion (MC) problems on the Grassmann manifold Gr(r, d), where a point is an equivalence class represented by a d r orthogonal matrix U with orthonormal columns: UT U = I. Note that projection-based vector transport and QRdecomposition-based retraction, which do not satisfy Assumption 1, are used in the following experiments for computational efﬁciency. The motivation is to show that our algorithm also empirically performs well without use of the speciﬁc vector transport. Details are given in the supplementary material.

PCA problem. Given an orthonormal matrix projector U St(r, d), the PCA problem involves minimization of the sum of squared residual errors between projected data points and the original data, which is expressed as min U St(r,d) 1 n n i=1 xi UUT xi 2 2, where xi is a data vector of size d 1. This problem is equivalent to maximizing 1 n n i=1 x T i UUT xi. Here, the critical points in the space St(r, d) are not isolated, because the cost function remains unchanged under the group action U 7 UO for all orthogonal matrices O of size r r. Subsequently, this is an optimization problem on the Grassmann manifold Gr(r, d). Figure 2(a) shows the results of the optimality gap when n = 50000, d = 200, and r = 10. α is from {10 3, 2 10 3, . . . , 10 2}. The minimum loss for the optimality gap is obtained via the Matlab function pca. Figure 2(a) reveals that R-SRG(+) exhibits superior convergence performance to the alternatives.

MC problem. The MC problem involves completion of an incomplete matrix X, say, of size d n, from a small number of entries by assuming that the latent structure of the matrix is low-rank. If Ωis the set of known in-

dices in X, the rank-r MC problem amounts to solving min U,A PΩ(UA) PΩ(X) 2 F , where U Rd r, A Rr n. The operator PΩacts as PΩ(Xpq) = Xpq if (p, q) Ωand PΩ(Xpq) = 0 otherwise. Partitioning X = [x1, . . . , xn], the previous problem is equivalent to min U Rd r, ai Rr 1

n n i=1 PΩi(Uai) PΩi(xi) 2 2, where xi Rd and PΩi is the sampling operator for the i-th column. Given U, ai admits a closed-form solution. Consequently, the problem depends on the column space of U only and is on Gr(r, d) (Boumal & Absil, 2015). Here, we use the Jester dataset (Goldberg et al., 2001) consisting of 24983 user ratings of 100 jokes. Each rating is a real number between 10 and 10. We randomly extract two ratings per user as the training set Ωand test set Φ. α is chosen from {10 7, . . . , 10 2} for R-SGD, R-SVRG, and R-SRG(+), and the batch size is 1, r = 5, and ϑ = 0.1. The maximum number of outer iterations is 30 for R-SVRG and R-SRG(+), and 60 for the others. The algorithms are initialized randomly. We also use the Movie Lens-1M dataset (Mov) containing one million ratings for 3952 movies (N) from 6040 users (d). We further randomly split this set into 80/10/10 percent datasets of the entire dataset as train/validation/test partitions. α is chosen from {10 5, 5 10 5, . . . , 10 2, 5 10 2}, the batch size is 50, r = 5, and ϑ = 0.5. The maximum number of outer iterations to stop is 20 for R-SVRG and R-SRG(+), and 60 for the others. Figures 2(b) and (c) show the superior performance of SRG(+) in both datasets.

6 Conclusions

We have proposed a Riemannian stochastic recursive gradient algorithm (R-SRG) on manifolds that is well suited for ﬁnite-sum minimization problems, and presented convergence analyses. R-SRG makes explicit use of retraction and vector transport, making it appealing for a wide variety of manifolds. Numerical comparisons have shown the beneﬁts of R-SRG for a number of applications, with notable advantages over R-SVRG in both theory and practice.

Riemannian Stochastic Recursive Gradient Algorithm

Acknowledgements

H. Kasai was partially supported by JSPS KAKENHI Grant Numbers JP16K00031 and JP17H01732. H. Sato was partially supported by JSPS KAKENHI Grant Number JP16K17647.

Movie Lens-10m dataset. URL http://grouplens. org/datasets/movielens/.

Absil, P.-A., Mahony, R., and Sepulchre, R. Optimization Algorithms on Matrix Manifolds. Princeton University Press, 2008.

Agarwal, A. and Bottou, L. A lower bound for the optimization of ﬁnite sums. In ICML, 2015.

Balzano, L., Nowak, R., and Recht, B. Online identiﬁcation and tracking of subspaces from highly incomplete information. In Allerton, pp. 704 711, 2010.

Bonnabel, S. Stochastic gradient descent on Riemannian manifolds. IEEE Trans. on Automatic Control, 58(9): 2217 2229, 2013.

Boumal, N. and Absil, P.-A. Low-rank matrix completion via preconditioned optimization on the Grassmann manifold. Linear Algebra Appl., 475:200 239, 2015.

Boumal, N., Mishra, B., Absil, P.-A., and Sepulchre, R. Manopt: a Matlab toolbox for optimization on manifolds. JMLR, 15(1):1455 1459, 2014.

Defazio, A., Bach, F., and Lacoste-Julien, S. SAGA: A fast incremental gradient method with support for nonstrongly convex composite objectives. In NIPS, 2014.

Goldberg, K., Roeder, T., Gupta, D., and Perkins, C. Eigentaste: A constant time collaborative ﬁltering algorithm. Inform. Retrieval, 4(2):133 151, 2001.

Huang, W., Absil, P.-A., and Gallivan, K. A. A Riemannian symmetric rank-one trust-region method. Math. Program., Ser. A, 150:179 216, 2015a.

Huang, W., Gallivan, K. A., and Absil, P.-A. A Broyden class of quasi-Newton methods for Riemannian optimization. SIAM J. Optim., 25(3):1660 1685, 2015b.

Jayasumana, S., Hartley, R., Salzmann, M., Li, H., and Harandi, M. Kernel methods on Riemannian manifolds with Gaussian RBF kernels. IEEE Trans. Pattern Anal. Mach. Intell., 37(12), 2015.

Johnson, R. and Zhang, T. Accelerating stochastic gradient descent using predictive variance reduction. In NIPS, 2013.

Kasai, H. and Mishra, B. Low-rank tensor completion: a Riemannian manifold preconditioning approach. In ICML, 2016.

Kasai, H., Sato, H., and Mishra, B. Riemannian stochastic quasi-Newton algorithm with variance reduction and its convergence analysis. In AISTATS, 2018.

Meyer, G., Bonnabel, S., and Sepulchre, R. Linear regression under ﬁxed-rank constraints: A Riemannian approach. In ICML, 2011.

Mishra, B. and Sepulchre, R. R3MC: A Riemannian threefactor algorithm for low-rank matrix completion. In IEEE CDC, 2014.

Nguyen, L. M., Liu, J., Scheinberg, K., and Takac, M. SARAH: A novel method for machine learning problems using stochastic recursive gradient. In ICML, 2017a.

Nguyen, L. M., Liu, J., Scheinberg, K., and Takac, M. Stochastic recursive gradient algorithm for nonconvex optimization. ar Xiv preprint ar Xiv:1705.07261, 2017b.

Polyak, B. Gradient methods for the minimisation of functionals. USSR Computational Mathematics and Mathematical Physics, 3(4):864 878, 1963.

Reddi, S. J., Hefny, A., Sra, S., Poczos, B., and Smola, A. Stochastic variance reduction for nonconvex optimization. In ICML, 2016.

Robbins, H. and Monro, S. A stochastic approximation method. Ann. Math. Statistics, pp. 400 407, 1951.

Roux, N. L., Schmidt, M., and Bach, F. R. A stochastic gradient method with an exponential convergence rate for ﬁnite training sets. In NIPS, 2012.

Sato, H., Kasai, H., and Mishra, B. Riemannian stochastic variance reduced gradient. ar Xiv preprint: ar Xiv:1702.05594, 2017.

Shalev-Shwartz, S. and Zhang, T. Stochastic dual coordinate ascent methods for regularized loss minimization. JMLR, 14:567 599, 2013.

Zhang, H., Reddi, S. J., and Sra, S. Fast stochastic optimization on Riemannian manifolds. In NIPS, 2016.