# inexact_trustregion_algorithms_on_riemannian_manifolds__def767c3.pdf

Inexact trust-region algorithms on

Riemannian manifolds

Hiroyuki Kasai The University of Electro-Communications

Japan kasai@is.uec.ac.jp

Bamdev Mishra

India bamdevm@microsoft.com

We consider an inexact variant of the popular Riemannian trust-region algorithm for structured big-data minimization problems. The proposed algorithm approximates the gradient and the Hessian in addition to the solution of a trust-region sub-problem. Addressing large-scale ﬁnite-sum problems, we speciﬁcally propose sub-sampled algorithms with a ﬁxed bound on sub-sampled Hessian and gradient sizes, where the gradient and Hessian are computed by a random sampling technique. Numerical evaluations demonstrate that the proposed algorithms outperform state-of-the-art Riemannian deterministic and stochastic gradient algorithms across different applications.

1 Introduction

We consider the optimization problem

min x M f(x), (1)

where f : M R is a smooth real-valued function on a Riemannian manifold M [1]. The focus on the paper is when f has a ﬁnite-sum structure, which frequently arises as big-data problems in machine learning applications. Speciﬁcally, we consider the form f(x) 1

i=1 fi(x), where n is the total number of samples and fi(x) is the cost function for the i-th (i [n]) sample.

Riemannian optimization translates the constrained optimization problem (1) into an unconstrained optimization problem over the manifold M. This viewpoint has shown beneﬁts in many applications. The principal component analysis (PCA) and subspace tracking problems are deﬁned on the Grassmann manifold [2, 3]. The low-rank matrix completion (MC) and tensor completion problems are examples on the manifold of ﬁxed-rank matrices and tensors [4, 5, 6, 7, 8, 9, 10]. The linear regression problem is deﬁned on the manifold of the ﬁxed-rank matrices [11, 12]. The independent component analysis (ICA) problem requires a whitening step that is posed as a joint diagonalization problem on the Stiefel manifold [13, 14].

A popular choice for solving (1) is the Riemannian steepest descent (RSD) algorithm [1, Sec. 4], which is traced back to [15]. RSD calculates the Riemannian full gradient gradf(x) every iteration, which can be computationally heavy when the data size n is extremely large. As an alternative, the Riemannian stochastic gradient descent (RSGD) algorithm becomes a computationally efﬁcient approach [16], which extends the stochastic gradient descent (SGD) in the Euclidean space to the general Riemannian manifolds [17, 18, 19]. The beneﬁt of RSGD is that it calculates only Riemannian stochastic gradient gradfi(x) corresponding to a particular i-th sample every iteration. Consequently, the complexity per iteration of RSGD is independent of the sample size n, which leads to higher scalability for large-scale data. Although the iterates generated by RSGD do not guarantee to decrease the objective value, gradfi(x) is a decent direction in expectation. However, similar to SGD, RSGD suffers from slow convergence due to a decaying stepsize sequence. For

32nd Conference on Neural Information Processing Systems (Neur IPS 2018), Montréal, Canada.

this issue, variance reduction (VR) methods on Riemannian manifolds, including RSVRG [20, 21] and RSRG [22], have recently been proposed to accelerate the convergence of RSGD, which are generalization of the algorithms in the Euclidean space [23, 24, 25, 26, 27, 28]. The core idea is to reduce the variance of noisy stochastic gradients by periodical full gradient estimations, resulting in a linear convergent rate. It should, however, be pointed out that such Riemannian VR methods require retraction and vector transport operations at every iteration. As the computational cost of a retraction and vector transport operation is similar to that of a Riemannian stochastic gradient computation, Riemannian VR methods may have slower wall-clock time performance per iteration than RSGD.

All the above algorithms are ﬁrst-order algorithms, which guarantee convergence to the ﬁrst-order optimality condition, i.e., gradf(x) x = 0, using only the gradient information. As a result, their performance in ill-conditioned problems suffers due to poor curvature approximation. Second-order algorithms, on the other hand, alleviate the effect of ill-conditioned problems by exploiting curvature information effectively. Therefore, they are expected to converge to a solution that satisﬁes the second-order optimality conditions, i.e., gradf(x) x = 0 and Hessf(x) 0, where Hessf(x) is the Riemannian Hessian of f at x [29]. The Riemannian Newton method is a second-order algorithm, which has a superlinear local convergence rate [1, Thm. 6.3.2]. The Riemannian Newton method, however, lacks global convergence and a practical variant of the Riemannian Newton method is computationally expensive to implement. A popular alternative to the Riemannian Newton method is the Riemannian limited memory BFGS algorithm (RLBFGS) that requires lower memory. It, however, exhibits only a linear convergence rate and requires many vector transports of curvature information pairs [30, 31, 32]. Finally, the Riemannian trust-region algorithm (RTR) comes with a global convergence property [1, Thm 7.4.4] and a superlinear local convergence rate [1, Thm. 7.4.11]. It can alleviate a poor approximation of the local quadratic model (e.g., that the Newton method uses) by adjusting a trustable radius every iteration. Considering an ϵ-approximate second-order optimality condition (Def. 2.1), RTR can return an (ϵg, ϵH)-optimality point in O(max{1/ϵ3

gϵH)}) iterations when the true Hessian is used in the model and a second-order retraction is used [33]. On the stochastic front, the VR methods have been recently extended to take curvature information into account [34]. Although they achieve practical improvements for ill-conditioned problems, their convergence rates are worse than that of RSVRG and RSRG.

A common issue among second-order algorithms is higher computational costs for dealing with exact or approximate Hessian matrices, which is computationally prohibitive in a large-scale setting. To address this issue, inexact techniques, including sub-sampling techniques, have recently been proposed in the Euclidean space [35, 36, 37, 38, 39]. However, no work has been reported in the Riemannian setting. To this end, we propose an inexact Riemannian trust-region algorithm, inexact RTR, for (1). Additionally, we propose a sub-sampled trust-region algorithm, Sub-RTR, as a practical but efﬁcient variant of inexact RTR for ﬁnite-sum problems. The theoretical convergence proof heavily relies on that of the original works in the Euclidean space [37, 38, 39] and the RTR algorithm [33]. We particularly derive the bounds of the sample size of the sub-sampled Riemannian Hessian and gradient, and show practical performance improvements of our algorithms over other Riemannian algorithms. We speciﬁcally address the case of compact submanifolds of Rn by following [33]. Additionally, the numerical experiments include problems on the Grassmann manifold to show effectiveness of our algorithms on more general quotient manifolds.

The paper is organized as follows. Section 2 describes the preliminaries and assumptions. We propose a novel inexact trust-region algorithm in the Riemannian setting in Section 3. In particular, in Section 4, we propose sub-sampled trust-region algorithms as its practical variants. Building upon the results in the Euclidean space [37, 38, 39] and that of the RTR algorithm [33], we derive the bounds of the sample size of sub-sampled gradients and Hessians in Theorem 4.1, which only requires a ﬁxed sample size [37]. This has not been addressed in [37, 38, 39, 33]. In Section 5, numerical experiments on three different problems demonstrate signiﬁcant speed-ups compared with state-of-the-art Riemannian deterministic and stochastic algorithms when the sample size n is large.

The implementation of the proposed algorithms uses the MATLAB toolbox Manopt [40] and is available at https://github.com/hiroyuki-kasai/Subsampled-RTR. The proofs of theorems and additional experiments are provided as supplementary material.

2 Preliminaries and assumptions

We assume that M is endowed with a Riemannian metric structure, i.e., a smooth inner product , x of tangent vectors is associated with the tangent space Tx M for all x M. The norm x of a tangent vector in Tx M is the norm associated with the Riemannian metric. We also assume that f is twice continuously differentiable throughout this paper.

2.1 Riemannian trust-region algorithm (RTR)

RTR is the generalization of the classical trust-region algorithm in the Euclidean space [41] to Riemannian manifolds [1, Chap. 7]. In comparison with the Euclidean case, in RTR, the approximation model mx of fx around x is obtained from the Taylor expansion of the pullback of the function

ˆfx fx Rx deﬁned on the tangent space, where Rx is the retraction operator that maps a tangent vector onto the manifold with a local rigidity condition that preserves the gradients at x [1, Chap. 4]. Exponential mapping is an instance of the retraction. ˆfx is a real-valued function on the vector space of Tx M, and the pullback of fx at x to Tx M through Rx, around the origin 0x of Tx M. This model of mx is denoted as ˆmx, where mx = ˆmx R 1, and is chosen for ξ Tx M as

ˆmx(ξ) = f(x) + gradf(x), ξ x + 1

2 H(x)[ξ], ξ x, (2)

where H(x) : Tx M Tx M is some symmetric operator on Tx M. The algorithm of RTR starts with an initial point x0 M, an initial radius 0, and a maximum radius max. At iteration k, RTR deﬁnes a trust region k around the current point xk M, which can be trusted such that it constructs a local model ˆmxk that is a reasonable approximation of the the real objective function

ˆfxk. It then ﬁnds the direction and the length of the step, denoted as ηk, simultaneously by solving a sub-problem based on the approximate model in this region. It should be noted that this calculation is performed in the vector space Txk M. The next candidate iterate x+

k = Rxk(ηk) is accepted as xk+1 = x+

k when the decrease of the true objective function ˆfk(xk) ˆfk(x+

k ) is sufﬁciently large against that of the approximate model ˆmk(0xk) ˆmk(ηk). Otherwise, we accept as xk+1 = xk. Here, ˆfk and ˆmk represent ˆfxk and ˆmxk, respectively, and hereinafter we use them for notational simplicity. The trust region k is enlarged, unchanged, or shrunk by the parameter γ > 1 according to the degree of the agreement of the model decrease and the true function decrease.

2.2 Essential assumptions

Since the ﬁrst-order optimality condition, i.e., gradf(x) x = 0, is not sufﬁcient in non-convex minimization problems due to existence of saddle points and local maximum points, we typically design algorithms that guarantee convergence to a point satisfying the second-order optimality conditions gradf(x) x = 0 and Hessf(x) 0. In practice, however, we use its approximate condition, which is deﬁned as (ϵg, ϵH)-optimality as presented below. Deﬁnition 2.1 ((ϵg, ϵH)-optimality [42]). Given 0 < ϵg, ϵH < 1, x is said to be an (ϵg, ϵH)- optimality of (1) when

gradf(x) x ϵg, and Hessf(x) ϵHId, where gradf(x) is the Riemannian gradient, and Hessf(x) is the Riemannian Hessian of f at x. Id is the identity mapping.

We now provide essential assumptions below. We consider the inexact Hessian H(xk) : Txk M Txk M and the inexact gradient G(xk) Txk M for gradf(x) in (2). Hereinafter, we particularly use Hk H(xk) and Gk G(xk) at xk for notational simplicity. Assumption 1 (Compact submanifold in Rn and second-order retraction). We consider compact submanifolds in Rn. We also assume that the retraction is the second-order retraction.

It should be noted that, although the Hessian 2 ˆfx(0x) and the Riemannian Hessian Hessf(x) are in general different from each other, they are identical under second-order retraction [33, Lem. 17]. This assumption ensures that, as stated in Theorem 3.1, Algorithm 1 provides a solution that satisﬁes the (ϵg, ϵH)-optimality. Otherwise, it gives a solution satisfying λmin(H(x)) ϵH. It should be stressed that the second-order retractions are available in many submanifolds such as Rx(η) = (x+η)/ x+η x in the case of spherical manifold [1, Sec. 4].

Assumption 2 (Restricted Lipschitz Hessian [33, A.5]). If ϵH < , there exists LH 0 such that, for all xk, ˆfk satisﬁes

" " " "ˆfk(ηk) f(xk) gradf(xk), ηk xk 1

2 ηk, 2 ˆfk(0xk)[ηk] xk

" " " " 1 2LH ηk 3

for all ηk Txk M such that η xk k.

It should be noted that the retraction Rx needs to be deﬁned only in the radius of k. Since the manifold under consideration is compact, Assumption 2 holds [33, Lem. 9]. We also assume a bound of the norm of the inexact Riemannian Hessian Hk [33, A.6]. Assumption 3 (Norm bound on Hk). There exists KH 0 such that, for all xk, Hk satisﬁes

Hk xk sup η Txk M, η xk 1

η, Hk[η] xk KH.

We now provide essential assumptions on the bounds for approximation error of the inexact Riemannian gradient Gk and the inexact Riemannian Hessian Hk at iteration k. As seen later in Section 4, this ensures that the sample size of sub-sampling can be ﬁxed. Assumption 4 (Approximation error bounds on inexact gradient and Hessian). There exist constants 0 < δg, δH < 1 such that the approximation of the gradient, Gk, and the approximation of the Hessian, Hk, at iterate k, satisfy

Gk gradf(xk) xk δg, (3)

(Hk 2 ˆfk(0xk))[ηk] xk δH ηk xk. (4)

The latter is a weaker condition than the below condition [33, A7].

Hk 2 ˆfk(0xk) xk δH.

It should be emphasized that the approximation error bound for Hk is deﬁned with the Hessian of the pullback of f at xk, i.e., 2 ˆfk(0xk), instead of the Riemannian Hessian of f, i.e., Hessf(xk). Furthermore, it should be noted that Assumption 4 is a relax form in comparison with a typical condition in the Euclidean setting, which is deﬁned as [43, AM.4]

(Hk 2 ˆfk(0xk))[ηk] xk δH ηk 2

This typical form (5) is different from (4). It should be noted that the condition (5) requires that the sizes of the sub-sampled Hessian and gradient need to be increased towards the convergence, whereas our new condition (4) allows the size to be ﬁxed, as seen later in Section 4 [37, 38].

Finally, we give an assumption for the step ηk. We need a sufﬁcient decrease in ˆmk(ηk), and there exit ways to solve the sub-problem (See [41, 1] for more details). However, the calculation of the exact solution of the problem is prohibitive, especially in large-scale problems. To this end, various approximate solvers have been investigated in the literature that require certain conditions to be met. The popular conditions are the Cauchy and Eigenpoint conditions [41]. The assumptions required for the convergence analysis of Algorithm 1 by generalizing [37, Cond. 2] are provided below. Assumption 5 (Sufﬁcient descent relative to the Cauchy and Eigen directions [41, 37]). We assume the ﬁrst-order step, called the Cauchy step, as

ˆmk(0xk) ˆmk(ηk) ˆmk(0xk) ˆmk(ηC

k ) 1 2 Gk xk min

We assume the second-order step, called the Eigen step, for some ν (0, 1] when λmin(Hk) < ϵH as

ˆmk(0xk) ˆmk(ηk) ˆmk(0xk) ˆmk(ηE

k ) 1 2ν|λmin(Hk)| 2

k is the negative gradient direction and ηE

k is an approximation of the negative curvature direction such that ηE

k ] xk νλmin(Hk) ηE

xk <0. Assumption 5 is ensured by using TR subproblem solvers, e.g., the Steihaug-Toint truncated conjugate gradients algorithm [44].

Algorithm 1 Inexact Riemannian trust-region (Inexact RTR) algorithm Require: 0 < max < , ϵg, ϵH (0, 1), ρT H,γ > 1.

1: Initialize 0 < 0 < max, and a starting point x0 M. 2: for k = 1, 2, . . . do 3: Set the approximate (inexact) gradient Gk and Hk. 4: if Gk ϵg and λmin(Hk) ϵH then Return xk. end if 5: if Gk ϵg then Gk = 0. end if 6: Calculate ηk Txk M by solving ηk arg min

f(xk) + Gk, η xk + 1

2 η, Hk[η] xk.

7: Set ρk =

ˆ fk(0xk ) ˆ fk(ηk) ˆmk(0xk ) ˆmk(ηk).

8: if ρk ρT H then xk+1 = Rxk(ηk) and k+1 = γ k. 9: else xk+1 = xk and k+1 = k/γ. end if 10: end for 11: Output xk.

3 Riemannian trust-regions with inexact Hessian and gradient

This section proposes an inexact variant of the Riemannian trust-region algorithm, i.e., inexact RTR, which approximates gradient and Hessian as well as the solution of a sub-problem. The proposed algorithm is summarized in Algorithm 1. The inexact RTR algorithm solves approximately a subproblem ˆmk(η) : Txk M R for η Txk M of the form

ηk arg min η Txk M

ˆmk(η) subject to η xk k, (6)

where ˆmk(η) is notably deﬁned as

f(xk) + Gk, η xk + 1

2 η, Hk[η] xk, Gk xk ϵg, (7a)

2 η, Hk[η] xk, otherwise. (7b)

It should be stressed that, as (7b) represents, we ignore the gradient when it is smaller than ϵg, i.e., Gk xk < ϵg, which is crucial for the convergence analysis in Theorem 3.1 [38].

Now, we show the convergence analysis of the proposed inexact RTR. To this end, we assume an additional approximation condition on the inexact gradient and Hessian for the constants in Assumption 4 [38, Cond. 1]. This additional assumption is essential for the relax form of (4). Assumption 6 (Gradient and Hessian approximations for Algorithm 1 [38]). Let ρT H be the threshold parameter of the reduction ratio of the true objective function and the approximate model in Algorithm 1. For ν (0, 1] in Assumption 5, we assume that the constants of the inexact gradient and Hessian satisfy δg < 1 ρT H

4 ϵg and δH < min

This implies that we only need δg O(ϵg) and δH O(ϵH) [38, Cond. 1]. Theorem 3.1 (Optimal complexity of Algorithm 1). Consider 0 < ϵg, ϵH < 1. Suppose Assumptions 1, 2, and 3 hold. Also, suppose that the inexact Hessian Hk and gradient Gk satisfy Assumption 4 with the approximation tolerance δg and δH. Suppose that the solution of the sub-problem (6) satisﬁes Assumption 5 and Assumption 6 holds. Then, Algorithm 1 returns an (ϵg, ϵH)-optimal solution in, at most, T O(max{ϵ 2

H }) iterations.

The proof of Theorem 3.1 follows that of [37, 38, 33]. Therefore, we only provide the proof sketch in Section B.1 of the supplementary material ﬁle.

4 Sub-sampled Riemannian trust-regions for ﬁnite-sum problems

Particularly addressing large-scale ﬁnite-sum minimization problems, we propose an inexact gradient and Hessian trust-region algorithm, Sub-RTR, by exploiting a sub-sampling technique to generate inexact gradient and Hessian. The generated inexact gradient and Hessian satisfy Assumption 4 in a probabilistic way. More concretely, we derive sampling conditions based on the probabilistic

deviation bounds for random matrices, which originate from the Bernstein inequality in Lemma B.2 of the supplementary material ﬁle.

We ﬁrst deﬁne the sub-sampled inexact gradient and Hessian as

gradfi(xk) and Hk 1 |SH|

Hessfi(xk), i = 1, 2, . . . , n,

where Sg, SH {1, . . . , n} are the set of the sub-sampled indexes for the estimates of the approximate gradient and Hessian, respectively. Their sizes, i.e., the cardinalities, are denoted as |Sg| and |SH|, respectively. Next, we provide the sampling conditions. For simplicity, we use the standard Riemannian metric in the analysis. Equivalently, M is endowed with a smooth inner product , 2 and the norm 2. We suppose that

gradfi(x) 2 Ki

g and sup x M

Hessfi(x) 2 Ki

H i = 1, 2, . . . , n,

and we also deﬁne Kmax

H. As for the sufﬁcient size of subsampling to guarantee the convergence in Theorem 3.1, we have the following theorem. Theorem 4.1 (Bounds on sampling size). Given Ki

H , and 0 < δ, δg, δH < 1, we deﬁne

|Sg| 32(Kmax

g )2 log(1/δ) + 1/4

and |SH| 32(Kmax

H )2 log(1/δ) + 1/4

At any xk M, suppose that the sampling is done uniformly at random to generate Sg and SH. Then, we have

Pr( Gk gradf(xk) 2 δg) 1 δ,

Pr( (Hk 2 ˆfk(0x))[ηk] 2 δH ηk 2) 1 δ.

From Theorem 4.1, it can be easily seen that Assumption 4 follows with the same probability with Kg = Kmax

g and KH = Kmax

H . It should be emphasized that if we use the typical condition

(5) instead of Assumption 4, we obtain, e.g., |SH| 32(Kmax

H )2 log(1/δ)+1/4 δ2

2 for the sub-sampled Hessian Hk. Considering that ηk goes to nearly zero as the iterations proceed, this obtained bound indicates that |SH| increases accordingly. Consequently, the size of the sub-sampled Hessian needs to be increased towards the convergence. On the other hand, our results ensure that the sample size can be ﬁxed to guarantee the convergence of Algorithm 1.

5 Numerical comparisons

This section evaluates the performance of our two proposed inexact RTR algorithms: the subsampled Hessian RTR (Sub-H-RTR) and the sub-sampled Hessian and gradient RTR (Sub-HGRTR). We compare them with the Riemannian deterministic algorithms: RSD, Riemannian conjugate gradient (RCG), RLBFGS, and RTR. We also show comparisons with RSVRG [20, 21]. We compare the algorithms in terms of the total number of oracle calls and run time, i.e., wall-clock time. The former measures the number of function, gradient, and Hessian-vector product computations. The sub-sampled RTR requires (n + |Sg| + rs|SH|) oracle calls per iteration, whereas the original RTR requires (2n+rsn) oracle calls. Here, rs is the number of iterations required for solving the trust-region sub-problem approximately. RSD, RCG, and RLBFGS require (n + rln) oracle calls per iteration, where rl is the number of line searches carried out. RSVRG requires (n + mn) oracle calls per outer iteration, where m is the update frequency of the outer loop. Algorithms are initialized randomly and are stopped when either the gradient norm is below a particular threshold. Multiple constant stepsizes from {10 10, 10 9, . . . , 1} are used for RSVRG and the best-tuned results are shown. By following [38], we set |Sg| = n/10 and |SH| = n/102 except Cases P5, P6, M4, and M5. We set the batch-size to n/10 in RSVRG. All simulations are performed in MATLAB on a 4.0 GHz Intel Core i7 machine with 32 GB RAM.

We address the independent component analysis (ICA) problem on the Stiefel manifold and two problems on the Grassmann manifold, namely the principal component analysis (PCA) and the lowrank matrix completion (MC) problems. The Stiefel manifold is the set of orthogonal r-frames in

0 1 2 3 4 5 Oracle calls 104

-6000 -5000

RSD RCG RLBFGS RSVRG RTR Sub-H-RTR Sub-HG-RTR

(a-1) Case I1: Oracle calls.

0 0.2 0.4 0.6 0.8 1 Time [sec]

-6000 -5000

RSD RCG RLBFGS RSVRG RTR Sub-H-RTR Sub-HG-RTR

(a-2) Case I1: Run time.

0 2 4 6 8 10 Oracle calls 104

RSD RCG RLBFGS RSVRG RTR Sub-H-RTR Sub-HG-RTR

(b-1) Case I2: Oracle calls.

0 0.5 1 1.5 2 2.5 3 Time [sec]

RSD RCG RLBFGS RSVRG RTR Sub-H-RTR Sub-HG-RTR

(b-2) Case I2: Run time.

0 2 4 6 8 10 Oracle calls 105

RSD RCG RLBFGS RSVRG RTR Sub-H-RTR Sub-HG-RTR

(c-1) Case I3: Oracle calls.

0 2 4 6 8 10 12 Time [sec]

RSD RCG RLBFGS RSVRG RTR Sub-H-RTR Sub-HG-RTR

(c-2) Case I3: Run time.

Figure 1: Performance evaluations on the ICA problem.

Rd for some r d and is viewed as an embedded submanifold of Rd r [1, Sec. 3.3]. On the other hand, the Grassmann manifold Gr(r, d) is the set of r-dimensional subspaces in Rd and is a Riemannian quotient manifold of the Stiefel manifold [1, Sec. 3.4]. The motivation behind including the latter two applications is to show that our proposed algorithms empirically work very well even if the manifold is not a submanifold. In all these problems, full gradient methods, i.e., RSD, RCG, RLBFGS, and RTR, become prohibitively computationally expensive when n is very large and the inexact approach is one promising way to achieve scalability. The details of the manifolds and the derivations of the Riemannian gradient and Hessian are provided as supplementary material.

5.1 ICA problem

The ICA or the blind source separation problem refers to separating a signal into components so that the components are as independent as possible [45]. A particular preprocessing step is the whitening step that is proposed through joint diagonalization on the Stiefel manifold [13], i.e., min U Rd r 1

i=1 diag(U Ci U) 2

F , where diag(A) 2

F deﬁnes the sum of the squared diagonal elements of A. The symmetric matrices Cis are of size d d and can be cumulant matrices or time-lagged covariance matrices of different signal samples [13].

We use three real-world datasets: Yale B [46], COIL-100 [47], and CIFAR-100 [48]. From these datasets, we create a Gabor-Based region covariance matrix (GRCM) descriptor [49, 50, 51]. A 43 43 GRCM is computed from the pixel coordinates and Gabor features that are obtained by convolving Gabor kernels with an intensity image. We set m = 1 in RSVRG. Figures 1 (a), (b), and (c) show the results on the Yale B dataset with (n, d, r) = (2015, 43, 43) (Case I1), the COIL-100 dataset with (n, d, r) = (7.2 103, 43, 43) (Case I2) and the CIFAR-100 dataset with (n, d, r) = (6 104, 43, 43) (Case I3), respectively. As seen, the proposed Sub-H-RTR and Sub-HG-RTR perform better in terms of both the number of oracle calls and run time than others except RSVRG. It should be emphasized that though RSVRG performs comparable to or slightly better than our proposed algorithms, its results require ﬁne tuning of stepsizes.

5.2 PCA problem

Given an orthonormal matrix projector U St(r, d), the PCA problem is to minimize the sum of squared residual errors between projected data points and the original data as min U St(r,d)

i=1 zi UU zi 2

2, where zi is a data vector of size d 1. This problem is equivalent to min U St(r,d) = 1

i UU zi. Here, the critical points in the space St(r, d) are not isolated because the cost function remains unchanged under the group action U 0 UO for all orthogonal matrices O of size r r. Subsequently, the PCA problem is an optimization problem on the Grassmann manifold Gr(r, d).

0 2 4 6 8 10 12 Oracle calls 108

Optimality gap

RSD RCG RLBFGS RSVRG RTR Sub-H-RTR Sub-HG-RTR

(a-1) Case P1: Oracle calls.

0 50 100 150 200 250 Time [sec]

Optimality gap

RSD RCG RLBFGS RSVRG RTR Sub-H-RTR Sub-HG-RTR

(a-2) Case P1: Run time.

0 2 4 6 8 10 12 Oracle calls 107

Optimality gap

RSD RCG RLBFGS RSVRG RTR Sub-H-RTR Sub-HG-RTR

(b-1) Case P2: Oracle calls.

0 50 100 150 200 Time [sec]

Optimality gap

RSD RCG RLBFGS RSVRG RTR Sub-H-RTR Sub-HG-RTR

(b-2) Case P2: Run time.

0 5 10 15 20 Time [sec]

Optimality gap

RSD RCG RLBFGS RSVRG RTR Sub-H-RTR Sub-HG-RTR

(c) Case P3: MNIST dataset.

0 10 20 30 40 Time [sec]

Optimality gap

RSD RCG RLBFGS RSVRG RTR Sub-H-RTR Sub-HG-RTR

(d) Case P4: Covertype dataset.

0 20 40 60 80 100 Time [sec]

Optimality gap

RTR Sub-H-RTR (|SH|=500000)

Sub-H-RTR (|SH|=50000)

Sub-H-RTR (|SH|=5000)

(e) Case P5: Sampling size insensitivity.

0 20 40 60 80 100 120 Time [sec]

Optimality gap

RTR Sub-H-RTR (fix) Sub-HG-RTR (fix) Sub-H-RTR (linear) Sub-HG-RTR (linear) Sub-H-RTR (adaptive) Sub-HG-RTR (adaptive)

(f) Case P6: Sampling algorithms.

Figure 2: Performance evaluations on the PCA problem.

Figures 2(a) and (b) show the results on two synthetic datasets with (n, d, r) = (5 106, 102, 5) (Case P1), and (n, d, r) = (5 105, 103, 5) (Case P2). We set m = 5 in RSVRG. It should be noted that, although RSVRG is competitive in terms of the oracle calls in (a), its run time performance is poor than others. This is attributed to RSVRG requiring retraction and vector transport operations at every iteration. Overall, the proposed Sub-H-RTR outperforms others, whereas the proposed Sub-HG-RTR is inferior to others. Figures 2(c) and (d) show the results on two real-world datasets with r = 10, where Case P3 deals with the MNIST dataset [52] with (n, d) = (6 104, 784) and Case P4 deals with the Covertype dataset [53] with (n, d) = (581012, 54). From the ﬁgure, our proposed Sub-H-RTR outperforms others. We also change the sample size in Sub-H-RTR as |SH| = {n/10, n/102, n/103} in Case P1. We observe that Sub-H-RTR has low sensitivity to the size |SH| from Figures 2(e) (Case P5). Additionally, we compare three different ways to decide the sample size of |SH| and |Sg|: (i) ﬁxed , (ii) linear , and (iii) adaptive variants (Case P6). The ﬁxed variant keeps the size as the initial |Sg| and |SH| as theoretically supported by Theorem 4.1. The linear variant uses k|Sg| and k|SH| at iteration k. The adaptive variant decides the sizes based on (5) [39]. The results on the synthetic dataset same as Case P2 show that all the proposed algorithms except Sub-HG-RTR with ﬁxed sample size outperform the original RTR.

5.3 MC problem

The MC problem amounts to completing an incomplete matrix Z, say of size d n, from a small number of entries by assuming a low-rank model for the matrix. If Ωis the set of the indices for which we know the entries in Z, the rank-r MC problem amounts to solving the problem min U Rd r,A Rr n PΩ(UA) PΩ(Z) 2

F , where the operator PΩ(Zpq) = Zpq if (p, q) Ωand PΩ(Zpq) = 0 otherwise is called the orthogonal sampling operator and is a mathematically convenient way to represent the subset of known entries. Partitioning Z = [z1, z2, . . . , zn], the problem is equivalent to the problem min U Rd r,ai Rr 1

i=1 PΩi(Uai) PΩi(zi) 2

2, where zi Rd and the operator PΩi is the sampling operator for the i-th column.

0 0.5 1 1.5 2 2.5 3 3.5 Oracle calls 107

Means square error on test set

RSD RCG RLBFGS RSVRG RTRMC RTR Sub-H-RTR Sub-HG-RTR

(a-1) Case M1: Oracle calls.

0 100 200 300 400 500 600 700 Time [sec]

Means square error on test set

RSD RCG RLBFGS RSVRG RTRMC RTR Sub-H-RTR Sub-HG-RTR

(a-2) Case M1: Run time.

0 1 2 3 4 5 Oracle calls 107

Means square error on test set

RSD RCG RLBFGS RSVRG RTRMC RTR Sub-H-RTR Sub-HG-RTR

(b-1) Case M2: Oracle calls.

0 200 400 600 800 1000 1200 Time [sec]

Means square error on test set

RSD RCG RLBFGS RSVRG RTRMC RTR Sub-H-RTR Sub-HG-RTR

(b-2) Case M2: Run time.

0 5 10 15 Oracle calls 105

Means square error on test set

RSD RCG RLBFGS RSVRG RTRMC RTR Sub-H-RTR Sub-HG-RTR

(c-1) Case M3: Oracle calls.

0 5 10 15 20 25 30 Time [sec]

Means square error on test set

RSD RCG RLBFGS RSVRG RTRMC RTR Sub-H-RTR Sub-HG-RTR

(c-2) Case M3: Run time.

0 500 1000 1500 Time [sec]

Means square error on test set

RTR Sub-H-RTR (fixed) Sub-HG-RTR (fixed) Sub-H-RTR (linear) Sub-HG-RTR (linear) Sub-H-RTR (adaptive) Sub-HG-RTR (adaptive)

(d) Case M4: Sampling algorithms.

0 5 10 15 20 25 30 Time [sec]

Means square error on test set

RTR Sub-H-RTR (fixed) Sub-HG-RTR (fixed) Sub-H-RTR (linear) Sub-HG-RTR (linear) Sub-H-RTR (adaptive) Sub-HG-RTR (adaptive)

(e) Case M5: Sampling algorithms.

Figure 3: Performance evaluations on the MC problem.

We also compared our proposed algorithms with RTRMC [10], a state-of-the-art MC algorithm. The code of RTRMC is optimized for the MC problem. Therefore, we mainly compare the oracle calls of RTRMC for fair comparison. We ﬁrst consider a synthetic dataset with (n, d, r) = (105, 102, 5). We show the mean squares error (MSE) on a test set, which is different from the training set. The over-sampling ratio (OS) is 4, where the OS determines the number of entries that are known. An OS of 4 implies that 4(n + d r)r number of randomly and uniformly selected entries are known a priori out of the total nd entries. We also impose an exponential decay of singular values. The ratio of the largest to the lowest singular value is known as the condition number (CN) of the matrix. We set m = 5 in RSVRG. We consider a well-conditioned case with CN=5 (Case M1) and an illconditioned case with CN=20 (Case M2). Figures 3(a) and (b) show relatively good performance of RSVRG for Case M1 . RTRMC is, as expected, extremely fast in terms of run time (owing to its optimized code). Sub-H-RTR and Sub-HG-RTR show superior performance than others, especially for the ill-conditioned case M2. Next, we consider the Jester dataset 1 [54] consisting of ratings of 100 jokes by 24983 users (Case M3). Each rating is a real number between 10 and 10. The algorithms are run by ﬁxing the rank to r = 5. Figure 3(c) shows the comparable or superior performance of the sub-sampled RTR on the test sets against state-of-the-art algorithms. Finally, we compare three variants: ﬁxed , linear , and adaptive to decide the sample size in Cases M4 and M5 under the same conditions as Cases M2 and M3, respectively. Figures 3(d) and (e) show that all the proposed algorithms outperform the original RTR. In particular, the ﬁxed variant gives superior performance than others as supported by Theorem 4.1.

6 Conclusion

We have proposed an inexact trust-region algorithm in the Riemannian setting with a worst case total complexity bound. Additionally, we have also proposed sub-sampled trust-region algorithms for ﬁnite-sum problems, which need only ﬁxed sample bounds of sub-sampled gradient and Hessian. The numerical comparisons show the beneﬁts of our proposed inexact RTR algorithms on a number of applications.

Acknowledgements

H. Kasai was partially supported by JSPS KAKENHI Grant Numbers JP16K00031 and JP17H01732. We thank Nicolas Boumal and Hiroyuki Sato for insight discussions and also express our sincere appreciation to Jonas Moritz Kohler for sharing his expertise on sub-sampled algorithms in the Euclidean case.

[1] P.-A. Absil, R. Mahony, and R. Sepulchre. Optimization Algorithms on Matrix Manifolds.

Princeton University Press, 2008.

[2] L. Balzano, R. Nowak, and B. Recht. Online identiﬁcation and tracking of subspaces from

highly incomplete information. In Allerton, 2010.

[3] B. Mishra, H. Kasai, P. Jawanpuria, and A. Saroop. A Riemannian gossip approach to subspace

learning on Grassmann manifold. Machine Learning (to appear), 2019.

[4] B. Mishra and R. Sepulchre. R3MC: A Riemannian three-factor algorithm for low-rank matrix

completion. In IEEE CDC, pages 1137 1142, 2014.

[5] H. Kasai and B. Mishra. Low-rank tensor completion: a Riemannian manifold preconditioning

approach. In ICML, 2016.

[6] M. Nimishakavi, P. Jawanpuria, and B. Mishra. A dual framework for low-rank tensor com-

pletion. In Neur IPS, 2018.

[7] D. Kressner, M. Steinlechner, and B. Vandereycken. Low-rank tensor completion by Rieman-

nian optimization. BIT Numer. Math., 54(2):447 468, 2014.

[8] B. Vandereycken. Low-rank matrix completion by Riemannian optimization. SIAM J. Optim.,

23(2):1214 1236, 2013.

[9] C. Da Silva and F. J. Herrmann. Optimization on the hierarchical tucker manifold applications

to tensor completion. Linear Algebra Its Appl., 481:131 173, 2015.

[10] N. Boumal and P.-A Absil. Low-rank matrix completion via preconditioned optimization on

the Grassmann manifold. Linear Algebra Its Appl., 475(15):200 239, 2015.

[11] G. Meyer, S. Bonnabel, and R. Sepulchre. Linear regression under ﬁxed-rank constraints: a

Riemannian approach. In ICML, 2011.

[12] U. Shalit, D. Weinshall, and G. Chechik. Online learning in the embedded manifold of low-

rank matrices. J. Mach. Learn. Res., 13(Feb):429 458, 2012.

[13] F. J. Theis, T. P. Cason, and P.-A. Absil. Soft dimension reduction for ICA by joint diagonal-

ization on the Stiefel manifold. In ICA, 2009.

[14] W. Huang, P.-A. Absil, and K. A. Gallivan. A Riemannian BFGS method for nonconvex optimization problems. In ENUMATH 2015. Springer, 2016.

[15] D. G. Luenberger. The gradient projection method along geodesics. Manag. Sci., 18(11):620

[16] S. Bonnabel. Stochastic gradient descent on Riemannian manifolds. IEEE Trans. on Automatic

Control, 58(9):2217 2229, 2013.

[17] H. Robbins and S. Monro. A stochastic approximation method. Ann. Math. Statistics, pages

400 407, 1951.

[18] L. Bottou, F. E. Curtis, and J. Nocedal. Optimization mehtods for large-scale machine learning.

SIAM Rev., 60(2):223 311, 2018.

[19] H. Kasai. SGDLibrary: A MATLAB library for stochastic optimization algorithms. JMLR,

18(215):1 5, 2018.

[20] H. Sato, H. Kasai, and B. Mishra. Riemannian stochastic variance reduced gradient. ar Xiv

preprint: ar Xiv:1702.05594, 2017.

[21] H. Zhang, S. J. Reddi, and S. Sra. Riemannian SVRG: fast stochastic optimization on Rieman-

nian manifolds. In NIPS, 2016.

[22] H. Kasai, H. Sato, and B. Mishra. Riemannian stochastic recursive gradient algorithm. In

ICML, 2018.

[23] R. Johnson and T. Zhang. Accelerating stochastic gradient descent using predictive variance

reduction. In NIPS, 2013.

[24] N. L. Roux, M. Schmidt, and F. R. Bach. A stochastic gradient method with an exponential

convergence rate for ﬁnite training sets. In NIPS, 2012.

[25] S. Shalev-Shwartz and T. Zhang. Stochastic dual coordinate ascent methods for regularized

loss minimization. JMLR, 14:567 599, 2013.

[26] A. Defazio, F. Bach, and S. Lacoste-Julien. SAGA: a fast incremental gradient method with

support for non-strongly convex composite objectives. In NIPS, 2014.

[27] S. J. Reddi, A. Hefny, S. Sra, B. Poczos, and A. Smola. Stochastic variance reduction for

nonconvex optimization. In ICML, 2016.

[28] L. M. Nguyen, J. Liu, K. Scheinberg, and M. Takac. SARAH: a novel method for machine

learning problems using stochastic recursive gradient. In ICML, 2017.

[29] W. H. Yang, L.-H. Zhang, and R. Song. Optimality conditions for the nonlinear programming

problems on riemannian manifolds. Pac. J. Optim., 10(2):415 434, 2014.

[30] W. Huang, K. A. Gallivan, and P.-A. Absil. A Broyden class of quasi-Newton methods for

Riemannian optimization. SIAM J. Optim., 25(3):1660 1685, 2015.

[31] D. Gabay. Minimizing a differentiable function over a differential manifold. J. Optim. Theory

Appl., 37(2):177 219, 1982.

[32] W. Ring and B. Wirth. Optimization methods on Riemannian manifolds and their application

to shape space. SIAM J. Optim., 22(2):596 627, 2012.

[33] N. Boumal, P.-A. Absil, and C. Cartis. Global rates of convergence for nonconvex optimization

on manifolds. IMA J. Numer. Anal., 2018.

[34] H. Kasai, H. Sato, and B. Mishra. Riemannian stochastic quasi-Newton algorithm with vari-

ance reduction and its convergence analysis. In AISTATS, 2018.

[35] R. H. Byrd, G. M. Chin, W. Neveitt, and J. Nocedal. On the use of stochastic Hessian infor-

mation in optimization methods for machine learning. SIAM J. Optim., 21(3):977 995, 2011.

[36] M. A. Erdogdu and A. Montanari. Convergence rates of sub-sampled Newton methods. In

NIPS, 2015.

[37] P. Xu, F. Roosta-Khorasani, and M. W. Mahoney. Newton-type methods for non-convex opti-

mization under inexact Hessian information. ar Xiv preprint ar Xiv:1708.07164, 2017.

[38] Z. Yao, P. Xu, F. Roosta-Khorasani, and M. W. Mahoney. Inexact non-convex Newton-type

methods. ar Xiv preprint ar Xiv:1802.06925, 2018.

[39] J. M. Kohler and A. Lucchi. Sub-sampled cubic regularization for non-convex optimization.

In ICML, 2017.

[40] N. Boumal, B. Mishra, P.-A. Absil, and R. Sepulchre. Manopt, a Matlab toolbox for optimiza-

tion on manifolds. J. Mach. Learn. Res., 15(1):1455 1459, 2014.

[41] A. R. Conn, N. I. M. Gould, and P. L. Toint. Trust Region Methods. MOS-SIAM Series on

Optimization. SIAM, 2000.

[42] J. Nocedal and Wright S.J. Numerical Optimization. Springer, New York, USA, 2006.

[43] C. Cartis, N. I. M. Gould, and P. L. Toint. Adaptive cubic regularisation methods for uncon-

strained optimization. part I: motivation, convergence and numerical results. Math. Program., 127(2):245 295, 2011.

[44] P. L. Toint. Towards an efﬁcient sparsity exploiting Newton method for minimization. Sparse

matrices and their uses, page 1981, 1981.

[45] A. Hyvärinen and E. Oja. Independent component analysis: algorithms and applications. Neu-

ral networks, 13(4-5):411 430, 2000.

[46] The extended Yale Face Database b. http://vision.ucsd.edu/ leekc/Ext Yale Database/Ext Yale B.html.

[47] Columbia university image library (COIL-100). http://www1.cs.columbia.edu/CAVE/software/softlib/coil-

[48] The CIFAR-100 dataset. http://www.cs.toronto.edu/ kriz/cifar.html.

[49] F. Porikli and O. Tuzel. Fast construction of covariance matrices for arbitrary size image windows. In ICIP, 2006.

[50] O. Tuzel, F. Porikli, and P. Meer. Region covariance: a fast descriptor for detection and classi-

ﬁcation. In ECCV, 2006.

[51] Y. Pang, Y. Yuan, and X. Li. Gabor-based region covariance matrices for face recognition.

IEEE Trans. Circuits Syst. Video Technol., 18(7):989 993, 2008.

[52] The MNIST database. http://yann.lecun.com/exdb/mnist/.

[53] Covertype dataset. https://archive.ics.uci.edu/ml/datasets/covertype.

[54] K. Goldberg, T. Roeder, D. Gupta, and C. Perkins. Eigentaste: a constant time collaborative

ﬁltering algorithm. Inform. Retrieval, 4(2):133 151, 2001.

[55] D. Gross. Recovering low-rank matrices from few coefﬁcients in any basis. IEEE Trans. on

Inf. Theory, 57(3):1548 1566, 2011.

[56] R. Kueng and D. Gross. Ripless compressed sensing from anisotropic measurements. Linear

Algebra and its Applications, 441:110 123, 2014.