# local_convergence_properties_of_sagaproxsvrg_and_acceleration__a6164db3.pdf

Local Convergence Properties of SAGA/Prox-SVRG and Acceleration

Clarice Poon * 1 Jingwei Liang * 1 Carola-Bibiane Sch onlieb 1

In this paper, we present a local convergence analysis for a class of stochastic optimisation methods: the proximal variance reduced stochastic gradient methods, and mainly focus on SAGA (Defazio et al., 2014) and Prox-SVRG (Xiao & Zhang, 2014). Under the assumption that the non-smooth component of the optimisation problem is partly smooth relative to a smooth manifold, we present a uniﬁed framework for the local convergence analysis of SAGA/Prox-SVRG: (i) the sequences generated by the methods are able to identify the smooth manifold in a ﬁnite number of iterations; (ii) then the sequence enters a local linear convergence regime. Furthermore, we discuss various possibilities for accelerating these algorithms, including adapting to better local parameters, and applying higher-order deterministic/stochastic optimisation methods which can achieve super-linear convergence. Several concrete examples arising from machine learning are considered to demonstrate the obtained result.

1. Introduction

1.1. Non-smooth Optimisation

Modern optimisation has become a core part of many ﬁelds, such as machine learning and signal/image processing. In a world of increasing data demands, there are two key driving forces behind modern optimisation.

Non-smooth regularisation. We are often faced with models of high complexity, however, the solutions of interest often lie on a manifold of low dimension which is promoted by the non-smooth regularisers. There have been several recent studies explaining how proximal methods identify this low dimensional manifold and efﬁ-

*Equal contribution 1DAMTP, University of Cambridge, Cambridge, United Kingdom. Correspondence to: Clarice Poon <C.M.H.S.Poon@maths.cam.ac.uk>, Jingwei Liang <jl993@cam.ac.uk>.

Proceedings of the 35 th International Conference on Machine Learning, Stockholm, Sweden, PMLR 80, 2018. Copyright 2018 by the author(s).

ciently output solutions which take a certain structure; see for instance (Liang et al., 2017) for the case of deterministic proximal gradient methods. Stochastic methods. The past decades have seen an exponential growth in the data sizes that we have to handle, and stochastic methods have been popular due to their low computational cost; see for instance (Schmidt et al., 2017; Defazio et al., 2014; Xiao & Zhang, 2014) and references therein.

The purpose of this paper is to show that proximal variance reduced stochastic gradient methods allow to beneﬁt from both efﬁcient structure enforcement and low computational cost. In particular, we present a study of manifold identiﬁcation and local acceleration properties of these methods when applied to the following minimisation problem:

min x Rn Φ(x) def= R(x) + F(x), (P)

where R(x) is a non-smooth structure imposing penalty term, and F(x) def= 1

where each fi is continuously differentiable.

We are interested in the problems where the value of m is very large. A typical example of (P) is the LASSO problem, which reads

min x Rn µ||x||1 + 1

i=1 1 2||Kix bi||2,

where µ > 0 is a positive trade-off parameter, Ki is the ith

row of a matrix K Rm n, and bi is the ith element of the vector b Rm. Throughout this paper, we consider the following basic assumptions for problem (P):

(A.1) R : Rn R {+ } is proper, convex and lower semi-continuous; (A.2) F : Rn R is continuously differentiable with F being LF -Lipschitz continuous. For each index i = 1, , m, fi is continuously differentiable with Li-Lipschitz continuous gradient; (A.3) Argmin(Φ) = , that is the set of minimisers is non-empty.

In addition to (A.2), deﬁne L def= maxi={1, ,m} Li, which is the uniform Lipschitz continuity of functions fi. Note that LF 1

i Li L holds.

Local Convergence Properties of SAGA/Prox-SVRG and Acceleration

1.2. Deterministic Forward Backward Splitting

A classical approach to solve (P) is the Forward Backward splitting (FBS) method (Lions & Mercier, 1979), which is also known as proximal gradient descent method. The standard FBS iteration reads xk+1 = proxγk R xk γk F(xk) , γk ]0, 2 LF [, (1)

where proxγR is the proximity operator of R deﬁned by

proxγR( ) def= min x Rn γR(x) + 1

2||x ||2. (2)

In general, the advantages of FBS can be summarised as:

Robust convergence guarantees. Both sequence and objective function are convergent for 0 < γ γk γ < 2/LF where γ, γ > 0 (Combettes & Wajs, 2005); Known convergence rates. There holds ||xk xk 1|| = o(1/

k) (Liang et al., 2016), and Φ(xk) Φ(x ) = o(1/k) (Molinari et al., 2018) where x Argmin(Φ). These rates can be improved to linear1 if for instance Φ is strongly convex (Nesterov, 2004); Numerous acceleration techniques. Such as the inertial schemes including inertial FBS (Liang et al., 2017), FISTA (Beck & Teboulle, 2009) and Nesterov s optimal methods (Nesterov, 2004); Structure adaptivity. Several recent work (Liang et al.,

2014; 2017) studies the manifold identiﬁcation properties of FBS. It is shown in (Liang et al., 2017) that within ﬁnite time, the FBS iterates xk all lie on the same manifold as the optimal solution x . Furthermore, upon identifying this manifold, the FBS iterates can be proved to converge linearly to the optimal solution x .

However, despite these advantages, for problem (P), when m is very large, computing F(xk) could be very expensive, which makes the deterministic FBS-type methods unsuitable for solving large-scale problems.

1.3. Proximal Stochastic Gradient Descent

The most straightforward extension of stochastic gradient descent to problem (P) is the proximal stochastic gradient descent (Prox-SGD), which reads, for k = 0, 1, 2, 3, $ sample ik uniformly from {1, , m}

xk+1 = proxγk R xk γk fik(xk) . (3)

Different from FBS, Prox-SGD only evaluates the gradient of one sampled function fik at each step. However, to ensure the convergence, the step-size γk of Prox-SGD has to converge to 0 at a proper speed (e.g. γk = ks for s ]1/2, 1]), leading to only O(1/

k) convergence rate for Φ(xk) Φ(x ). When Φ is strongly convex, the rate for the objective can only be improved to O(1/k).

1Linear convergence is also known as geometric or exponential convergence.

Perturbed Forward Backward Splitting An alternative perspective of treating Prox-SGD is as a perturbation of the deterministic FBS scheme. More precisely, iteration (3) can be written as the inexact FBS iteration with stochastic approximation error for the gradient, for k = 0, 1, 2, 3, $ sample εk from a ﬁnite distribution Dk,

xk+1 = proxγk R xk γk( F(xk) + εk) . (4)

For most existing stochastic gradient methods, E[εk] = 0 and ||εk||2 is the variance of the stochastic gradient.

For Prox-SGD, the error εk takes the form

ε SGD k def= fik(xk) F(xk). (5)

In general, the variance ||εSGD k ||2 is only bounded, and does not vanish to 0 as xk x . One consequence of this nonvanishing variance is that, unlike FBS, in general manifold identiﬁcation cannot occur. In Section 1 of the supplementary material, we give an example to illustrate this point.

1.4. Variance Reduced Stochastic Gradient Methods

To overcome the vanishing step-size and slow convergence speed of Prox-SGD caused by non-vanishing variance, several variance reduced schemes are proposed schemes (Defazio et al., 2014; Xiao & Zhang, 2014). The features of variance reduced techniques can be summarised as:

Same as Prox-SGD, in expectation, the stochastic gradient remains an unbiased estimation of the full gradient; Different from Prox-SGD, the variance ||εk||2 converges to 0 when xk approaches the solution x .

SAGA Algorithm (Defazio et al., 2014) The key idea of SAGA for reducing the variance is utilising the gradient history for the evaluation of the current gradient.

Given an initial point x0, deﬁne the individual gradient g0,i def= fi(x0), i = 1, , m. Then, for k = 0, 1, 2, 3,

sample ik uniformly from {1, , m},

wk = xk γk( fik(xk) gk,ik + 1

Pm i=1 gk,i),

xk+1 = proxγk R(wk),

( fi(xk) if i = ik,

gk 1,i o.w. (6)

In the context of (4), the stochastic approximation error εk of SAGA takes the form

ε SAGA k def= fik(xk) gk,ik + 1

i=1gk,i F(xk). (7)

Prox-SVRG Algorithm (Xiao & Zhang, 2014) Compared to SAGA, in stead of using the gradient history, Prox-

Local Convergence Properties of SAGA/Prox-SVRG and Acceleration

SVRG computes the full gradient of a given point, and uses it for a certain number of iterations.

Let P be a positive integer, for ℓ= 0, 1, 2,

Pm i=1 fi( xℓ), xℓ,0 = xℓ,

For p = 1, , P

sample ip uniformly from {1, , m},

wk = xℓ,p 1 γk( fip(xℓ,p 1) fip( xℓ) + gℓ),

xℓ,p = proxγk R(wk).

Option I : xℓ+1 = xℓ,P ,

Option II : xℓ+1 = 1

PP p=1 xℓ,p. (8)

In the context of (4), given xℓ,p, denote k = ℓP + p, then we have xℓ,p = xk and the stochastic approximation error εk of Prox-SVRG reads

ε SVRG k def= fip(xk) fip( xℓ) + gℓ F(xk). (9)

1.5. Contributions

In this paper, we present a very general framework for analysing the local convergence properties of variance reduced stochastic gradient. More precisely, this paper consists of the following contributions.

Convergence of Sequence for SAGA/Prox-SVRG Assuming only convexity, we prove the almost sure global convergence of the sequences generated by SAGA (Theorem 2.1) and Prox-SVRG with Option I (Theorem 2.2), which is new to the literature. Moreover, for Prox-SVRG with Option I , an O(1/k) ergodic convergence rate for the objective function is proved.

Finite Time Manifold Identiﬁcation Let x be a global minimiser of problem (P), and suppose that the sequence {xk}k N generated by the perturbed FBS iteration (4) converges to x almost surely. Then under the additional assumptions that the non-smooth function R is partly smooth at x relative to a C2-smooth manifold Mx (Deﬁnition 3.1) and a non-degeneracy condition (Eq. (ND)) holds at x , in Theorem 3.2 we prove a general ﬁnite time manifold identiﬁcation result for the perturbed FBS scheme (4). The manifold identiﬁcation means that after a ﬁnite number of iterations, say K, there holds xk Mx for all k K. Specialising the result to SAGA and Prox-SVRG algorithms, we prove the ﬁnite manifold identiﬁcation of these two algorithms (Corollary 3.4).

Local Linear Convergence Building upon the manifold identiﬁcation result, if moreover F is locally C2-smooth along Mx near x and a restricted injectivity condition (Eq. (RI)) is satisﬁed by the Hessian 2F(x ), we show that locally SAGA and Prox-SVRG converge linearly.

Local Accelerations Another important implication of manifold identiﬁcation is that the global non-smooth optimisation problem becomes locally C2-smooth along the manifold Mx , and moreover is locally strongly convex if the restricted injectivity condition (RI) is satisﬁed. This implies that locally we have many choices of acceleration to choose, for instance we can turn to higher-order optimisation methods, such as (quasi)-Newton methods or Riemannian manifold based optimisation methods which can lead to super linear convergence.

Lastly, for the numerical experiments considered in this paper, the corresponding MATLAB source code to reproduce the results is available online2.

1.6. Notations

Throughout the paper, N denotes the set of non-negative integers and k N denotes the index. Rn is the Euclidean space of dimension n. For a non-empty convex set Ω Rn, ri(Ω) and rbd(Ω) denote its relative interior and relative boundary respectively.

Let R be proper, convex and lower semi-continuous, the sub-differential is deﬁned by R(x) def= g Rn|R(y) R(x) + g, y x , y Rn . A function R is α-strongly convex for some α > 0 if R(x) α

2 ||x||2 is convex.

2. Global Convergence of SAGA/Prox-SVRG

In this section, we prove the convergence of the sequence generated by the SAGA and Prox-SVRG with Option I without strong convexity, while in their respective original work (Defazio et al., 2014; Xiao & Zhang, 2014), only the convergence of the objective function value is provided.

We present ﬁrst the convergence result of the SAGA algorithm, recall that L is the uniform Lipschitz continuity of all element functions fi, i = 1, , m.

Theorem 2.1 (Convergence of SAGA). For problem (P), suppose that conditions (A.1)-(A.3) hold. Let {xk}k N be the sequence generated by the SAGA algorithm (6) with γk 1 3L, then there exists an x Argmin(Φ) such that almost surely Φ(xk) Φ(x ), xk x and εSAGA k 0.

Next we provide the convergence result of the Prox SVRG algorithm with Option I . Given ℓ N and p {1, , P}, let k = ℓP +p, and denote xℓ,p = xk. For sequence {xk}k N, deﬁne a new point by xk def= 1

k Pk ℓ=1 xℓ.

Theorem 2.2 (Convergence of Prox-SVRG). For problem (P), suppose that conditions (A.1)-(A.3) hold. Let {xk}k N be the sequence generated by the Prox-SVRG algorithm (8) with Option I . Then,

2https://github.com/jliang993/Local-VRSGD

Local Convergence Properties of SAGA/Prox-SVRG and Acceleration

(i) If we ﬁx γk γ with γ 1 4L(P +2), then there exists a minimiser x Argmin(Φ) such that xk x and εSVRG k 0 almost surely. Moreover, there holds for each k = ℓP, E[Φ( xk) Φ(x )] = O(1/k); (ii) Suppose that R, F are moreover αR and αF strongly convex respectively, then if 4Lγ(P + 1) < 1, there holds E || xℓ x ||2 O(ρℓ SVRG), where ρSVRG = max{ 1 γαF

1+γαR , 4Lγ(P + 1)}.

Remark 2.3. To the best of our knowledge, the O(1/k) ergodic convergence rate of {E[Φ( xk) Φ(x )]}k N is new to the literature, which also holds for SVRG (Johnson & Zhang, 2013).

3. Finite Time Manifold Identiﬁcation

From this section, we move to the local convergence properties of the SAGA/Prox-SVRG algorithms.

3.1. Partial Smoothness

Let Mx be a C2-smooth Riemannian manifold of Rn

around a point x. Denote TMx(x) the tangent space of Mx at a point x. Given a set S, par(S) def= R(S S) is the subspace parallel to the afﬁne hull of S, and ( ) the orthogonal complement.

Deﬁnition 3.1 (Partial smoothness (Lewis, 2003)). Let R : Rn R {+ } be proper convex and lower semicontinuous. Then R is said to be partly smooth at x relative to a set Mx containing x if R(x) = , and moreover

Smoothness: Mx is a C2-manifold around x, R restricted to Mx is C2 around x. Sharpness: the tangent space TMx(x) coincides with Tx def= (par( R(x))) . Continuity: the set-valued mapping R is continuous at x relative to Mx.

The class of partly smooth functions at x relative to Mx is denoted as PSFx(Mx). Many popular non-smooth penalty functions are partly smooth, such as sparsity promoting ℓ1-norm, group sparsity promoting ℓ1,2-norm, low rank promoting nuclear norm, etc.; see Table 1 for more details. We refer to (Liang et al., 2017) and references therein for detailed properties of these partly smooth functions.

3.2. An Abstract Finite Time Manifold Identiﬁcation

Recall the perturbed FBS iteration (4). We have the following abstract manifold identiﬁcation.

Theorem 3.2 (Abstract manifold identiﬁcation). For problem (P), suppose that conditions (A.1)-(A.3) hold. For the perturbed FBS iteration (4), suppose that:

(B.1) There exists γ > 0 such that lim infk + γk γ;

(B.2) The error {εk}k N converges to 0 almost surely; (B.3) There exists an x Argmin(Φ) such that {xk}k N converges to x almost surely.

For the x in (B.3), suppose that R PSFx (Mx ), and the following non-degeneracy condition holds

F(x ) ri R(x ) . (ND)

Then, there exists a K > 0 such that for all k K, we have xk Mx almost surely.

Remark 3.3. In the deterministic setting, the ﬁnite manifold identiﬁcation of (4), i.e. εk is deterministic approximation error, is discussed in Section 3.3 of (Liang et al., 2017).

3.3. Manifold Identiﬁcation of SAGA/Prox-SVRG

Specialising Theorem 3.2 to the case of SAGA/Prox-SVRG algorithms, we obtain the corollary below. For Prox-SVRG, recall that xℓ,p is denoted as xk with k = ℓP + p.

Corollary 3.4. For problem (P), suppose that conditions (A.1)-(A.3) hold. Suppose that

SAGA is applied under the conditions of Theorem 2.1; Prox-SVRG is ran under the conditions of Theorem 2.2.

Then there exists an x Argmin(Φ) such that the sequence {xk}k N generated by either algorithm converges to x almost surely.

If moreover, R PSFx (Mx ), and the non-degeneracy condition (ND) holds. Then, there exists a K > 0 such that for all k K, xk Mx almost surely.

Remark 3.5. In (Lee & Wright, 2012; Duchi & Ruan, 2016), manifold identiﬁcation properties of the regularised dual averaging algorithm (RDA) (Xiao, 2010) were reported. The main difference between the present work and those of (Lee & Wright, 2012; Duchi & Ruan, 2016) is that, RDA is proposed for solving (P) with inﬁnite sum problem, while in this work we mainly focus on the ﬁnite sum case. We remark also that although RDA has manifold identiﬁcation properties, the method does not exhibit linear convergence under strong convexity, and hence, in contrast to variance reduction methods, there can be no local acceleration in the convergence rate upon manifold identiﬁcation.

4. Local Linear Convergence

Now we turn to the local linear convergence properties of SAGA/Prox-SVRG algorithms. Throughout the section, x Argmin(Φ) denotes a global minimiser (P), Mx is a C2-smooth manifold which contains x , and Tx denotes the tangent space of Mx at x .

Local Convergence Properties of SAGA/Prox-SVRG and Acceleration

Table 1: Examples of partly smooth functions. For x Rn and some subset of indices b {1, . . . , n}, xb is the restriction of x to the entries indexed in b. DDIF stands for the ﬁnite differences operator, rank(z) denotes the rank of a matrix.

FUNCTION EXPRESSION PARTIAL SMOOTH MANIFOLD

ℓ1-NORM ||x||1 = Pn i=1 |xi| Mx = Tx = z Rn : Iz Ix , Ix = i : xi = 0

ℓ1,2-NORM Pm i=1 ||xbi|| Mx = Tx = z Rn : Iz Ix , Ix = i : xbi = 0

ℓ -NORM maxi={1,...,n} |xi| Mx = Tx = z Rn : z Ix Rsign(x Ix) , Ix = i : |xi| = ||x||

TV SEMI-NORM ||x||TV = ||DDIFx||1 Mx = Tx = z Rn : IDDIF z IDDIF x , IDDIF x = {i : (DDIFx)i = 0} NUCLEAR NORM ||x|| = Pr i=1 σ(x) Mx = z Rn1 n2 : rank(z) = rank(x) = r , σ(x) SINGULAR VALUES OF x

4.1. Local Linear Convergence

Similar to (Liang et al., 2017) for the deterministic FBStype methods, the key assumption to establish local linear convergence for SAGA/Prox-SVRG is a so-called restricted injectivity condition, see below. Restricted Injectivity Let F be locally C2-smooth around the minimiser x , and moreover the following restricted injectivity condition holds

ker 2F(x ) Tx = {0}. (RI)

Owing to Proposition 12 of (Liang et al., 2017), it can be shown that under condition (RI), x actually is the unique minimiser of problem (P), and Φ grows locally quadratic if moreover R PSFx (Mx ). Lemma 4.1 (Local quadratic growth (Liang et al., 2017)). For problem (P), suppose that assumptions (A.1)- (A.3) hold. Let x Argmin(Φ) be a global minimiser such that conditions (ND) and (RI) are fulﬁlled and R PSFx (Mx ), then x is the unique minimiser of (P) and there exist α > 0 and r > 0 such that

Φ(x) Φ(x ) α||x x ||2 : x s.t. ||x x || r.

Remark 4.2. A similar result can be found in Theorem 5 of (Lee & Wright, 2012).

Lemma 4.1 implies that when a sequence convergent stochastic method is applied to solve (P), eventually the generated sequence {xk}k N will enter a local neighbourhood of the solution x Argmin(Φ) where the function has the quadratic growth property. If moreover this method is linearly convergent under strong convexity, then locally it will also converge linearly under quadratic growth. As a consequence, we have the following propositions. Proposition 4.3 (Local linear convergence of SAGA). For problem (P), suppose that conditions (A.1)-(A.3) hold, and the SAGA algorithm (6) is applied with γk 1 3L. Then xk converges to x Argmin(Φ) almost surely. If moreover, R PSFx (Mx ), and conditions (ND)-(RI) are satisﬁed, then there exists K > 0 such that for all k K,

E ||xk x ||2 = O(ρk K

where ρSAGA = 1 min{ 1

The claim is a direct consequence of Theorem 1 of (Defazio et al., 2014).

Proposition 4.4 (Local linear convergence of Prox- -SVRG). For problem (P), suppose that conditions (A.1)- (A.3) hold, and the Prox-SVRG algorithm (8) is applied such that Theorem 2.2 holds. Then xk converges to x Argmin(Φ) almost surely. If moreover, R PSFx (Mx ), and conditions (ND)-(RI) are satisﬁed, then there exists K > 0 such that for all k K,

E || xℓ x ||2 = O(ρℓ K

where ρSVRG = max{ 1 γαF

1+γαR , 4Lγ(P + 1)} and γ, P are chosen such that ρSVRG < 1.

The claim is a direct consequence of Theorem 2.2(ii).

4.2. Better Linear Rate Estimation?

In this part, we discuss brieﬂy the obtained linear rate estimations of SAGA/Prox-SVRG (i.e. ρSAGA, ρSVRG), in comparison to the one obtained in (Liang et al., 2017) for deterministic FBS.

For the deterministic FBS algorithm, when R is locally polyhedral around x , Theorem 21 of (Liang et al., 2017) implies that the local convergence rate of FBS is

ρFBS = 1 γα.

While for ρSAGA, ρSVRG, their rate estimations both depend on the number of functions m, which means that tightness3

of ρSAGA, ρSVRG could be much worse than ρFBS.

This naturally leads to the question of whether the rate estimations for SAGA and Prox-SVRG can be improved. Numerically, the rate estimation seems to be problem dependent, that is for some problems ρFBS can be achieved; see Example 6.1 and Figure 1. While for some problems, the practical observation of SAGA/SVRG is much slower than ρFBS; see Section 4 of the supplementary material for details. Note that we are comparing the rate in terms of per iteration, not gradient evaluation complexity.

3Tightness means how close is the rate estimation to the practical observation of the algorithms.

Local Convergence Properties of SAGA/Prox-SVRG and Acceleration

5. Beyond Local Convergence Analysis

As already discussed, manifold identiﬁcation implies that, the globally non-smooth problem minx Rn Φ(x) locally becomes a C2-smooth and possibly non-convex (e.g. nuclear norm) problem constrained on the identiﬁed manifold minx Mx Φ(x). Such a transition to local C2-smoothness, provides various choices of acceleration. For instance, in (Lee & Wright, 2012), the authors proposed a local version of RDA, called RDA+, which achieves linear convergence.

5.1. Better Local Lipschitz Continuity

If the dimension of the manifold Mx is much smaller than that of the whole space Rn, then constrained to Mx , the Lipschitz property of the smooth part would become much better. For each i {1, , m}, denote by LMx ,i the Lipschitz constant of fi along the manifold Mx , and let

LMx def= max i=1, ,m LMx ,i.

In general, locally around x , we have LMx L.

For SAGA/Prox-SVRG, and other stochastic methods which have manifold identiﬁcation property, once the manifold is identiﬁed, one can adapt their step-sizes to the local Lipschitz constants. Since step-size is crucial to the convergence speed of these methods, the potential acceleration of such a local adaptive strategy can be signiﬁcant. See Section 6.1 for numerical example, and the supplementary material on how to compute or approximate LMx .

5.2. Lower Computational Complexity

Another important aspect of the manifold identiﬁcation property is that one can naturally reduce the computational cost, especially when Mx is of very low dimension.

Take R as the ℓ1-norm for example. Suppose that the solution x of Φ is κ-sparse, i.e. the number of non-zero entries of x is κ. We have two stages of gradient evaluation complexity for fi(xk):

Before identiﬁcation: O(n),

After identiﬁcation: O(κ).

Now let R be the nuclear norm, and suppose F is quadratic. Let the solutionx be of rank κ. We have two stages of gradient evaluation complexity for fi(xk) (after identiﬁcation, xk is stored in terms of its SVD decomposition):

Before identiﬁcation: O(n2),

After identiﬁcation: O(κn).

For both cases, the reduction of computational cost depends on the ratio of n/κ.

5.3. Higher-order Acceleration

The last acceleration strategy is the Riemannian manifold based higher-order acceleration. Recently, various Riemannian manifold based optimisation methods have been proposed in the literature (Kressner et al., 2014; Ring & Wirth, 2012; Vandereycken, 2013; Boumal et al., 2014), particularly for low-rank matrix recovery. However, an obvious drawback of this class of methods is that the manifold should be known a priori, which limits the their applications.

The manifold identiﬁcation of proximal methods implies that one can ﬁrst use the proximal method to identify the correct manifold, and then turn to the manifold based optimisation methods. The higher-order methods that can be applied include Newton-type methods, when the restricted injectivity condition (RI) is satisﬁed, and Riemannian geometry based optimisation methods (Lemar echal et al., 2000; Miller & Malick, 2005; Smith, 1994; Boumal et al., 2014; Vandereycken, 2013), for instance the non-linear conjugate gradient method (Smith, 1994). Stochastic Riemannian manifold based optimisation methods are also studied in the literature, for instance in (Zhang et al., 2016), the authors generalise SVRG to the manifold setting.

6. Numerical Experiments

We now consider several examples to verify the established results. Three examples for R are considered, sparsity promoting ℓ1-norm, group sparsity promoting ℓ1,2-norm and low rank promoting nuclear norm.

As the main focus of this work is the theoretical properties of SAGA and Prox-SVRG algorithms, the scale of the problems considered are not very large. In the supplementary material, experiments on large scale real data are presented.

6.1. Local Linear Convergence

We consider the sparse logistic regression problem to demonstrate the manifold identiﬁcation and local linear convergence of SAGA/Prox-SVRG algorithms. Moreover in this experiment, we provide only the rate estimation from the FBS scheme, which is ρFBS = 1 γα.

Example 6.1 (Sparse logistic regression). Let m > 0 and (zi, yi) Rn { 1}, i = 1, , m be the training set. The sparse logistic regression is to ﬁnd a linear decision function which minimises the objective

min (x,b) Rn R µ||x||1 + 1

i=1 log 1 + e yif(zi;x,b) ,

where f(z; x, b) = b + z T x.

The setting of the experiment is: n = 256, m = 128, µ = 1/ m and L = 1188. Notice that, the dimension of the problem is larger than the number of training points. The

Local Convergence Properties of SAGA/Prox-SVRG and Acceleration

0.5 1 1.5 2

(a) |supp(xk)|

0.5 1 1.5 2 2.5 3

(b) ||xk x ||

Figure 1: Finite manifold identiﬁcation and local linear convergence of SAGA and Prox-SVRG for solving the sparse logistic regression problem in Example 6.1. (a) manifold identiﬁcation; (b) local linear convergence. ρFBS is the rate estimation from FBS scheme, that is ρFBS = 1 γα, where γ is the step-size and α is from Lemma 4.1.

parameters choices of SAGA and Prox-SVRG are:

SAGA : γ = 1

2L; Prox-SVRG : γ = 1

Remark 6.2. The reason of choosing different step-sizes for SAGA and Prox-SVRG is only to distinguish the red and black plots in Figure 1. As for the considered synthetic example, the performance of the two algorithms are almost the same under same step-size.

The observations of the experiments are shown in Figure 1. The observations of Prox-SVRG are for the inner loop sequence xℓ,p, which is denoted as xk by letting k = ℓP + p. The non-degeneracy condition (ND) and the restricted injectivity condition (RI) are checked a posterior, which are all satisﬁed for the tested example. The local quadratic growth parameter α and the local Lipschitz constant LMx are

α = 0.0156 and LMx = 61.

Note that, locally the Lipschitz constant becomes about 19 times better.

Finite Manifold Identiﬁcation In Figure 1(a), we plot the size of support of the sequence {xk}k N generated by the two algorithms. The lines are sub-sampled, one out of every m points.

The two algorithms are ran with the same initial point. It can be observed that SAGA shows slightly faster manifold identiﬁcation than Prox-SVRG, this is due the fact that the step-size of SAGA (i.e. γ = 1 2L) is larger than that of Prox-SVRG (i.e. γ = 1 3L). As mentioned in Remark 6.2, the identiﬁcation speed of the two algorithms will be rather similar if they are ran under the same choice of step-size.

Local Linear Convergence In Figure 1(b), we demonstrate the convergence rate of {||xk x ||}k N of the two algorithms. The two solid lines are the practical observation of {||xk x ||}k N generated by SAGA and Prox-SVRG, the two dashed lines are the theoretical estimations using ρFBS, and two dot-dashed lines are the practical observation of the acceleration of SAGA/Prox-SVRG based on the local Lipschitz continuity LMx . The lines are also sub-sampled, one out of every m points.

For the considered problem, given the values of α and γ above, we have that

SAGA : ρFBS = 0.999993, ρm

FBS = 0.99916;

Prox-SVRG : ρFBS = 0.999995, ρm

FBS = 0.99944.

For the considered problem setting, the spectral radius quite matches the practical observations very well.

To conclude this part, we highlight the beneﬁts of adapting to the local Lipschitz continuity of the problem. For both SAGA and Prox-SVRG, their adaptive schemes (e.g. dotdashed lines) show 16 times faster performance compared to the non-adaptive ones (e.g. solid lines). Such an acceleration gain is on the same order of the difference between the global Lipschitz and local Lipschitz constants, which is 19 times. More importantly, the computational cost of evaluating the local Lipschitz constant is almost negligible, which makes the adaptive scheme more preferable in practice.

6.2. Local Higher-order Acceleration

We consider two problems of group sparse and low-rank regression to demonstrate local higher-order acceleration. Example 6.3 (Group sparse and low-rank regression). Let xob Rn be either a group sparse vector or a lowrank matrix (in a vectorised form), consider the following

Local Convergence Properties of SAGA/Prox-SVRG and Acceleration

0.5 1 1.5 2

SAGA SAGA + Newton

(a) Group sparsity

0.5 1 1.5 2

102 SAGA SAGA + non-linear CG

(b) Low rank

Figure 2: Local higher-order acceleration after manifold identiﬁcation for Example 6.3. (a) Newton method is applied after the manifold is identiﬁed by SAGA; (b) non-linear conjugate gradient is applied after manifold identiﬁcation. The black line is the observation of SAGA algorithm, and the red line is the observation of the SAGA+higher-order scheme. The black lines of the SAGA for both examples are not sub-sampled.

observation model b = Kxob + ω, where the entries of K Rm n are sampled from an i.i.d. zero-mean and unitvariance Gaussian distribution, ω Rm is an additive error with bounded ℓ2-norm.

Let µ > 0, and R be either the group sparsity promoting ℓ1,2-norm or the low rank promoting nuclear norm. Consider the problem to recover or approximate xob,

min x Rn µR(x) + 1

i=1 1 2||Kix bi||2 2,

where Ki, bi represent the ith row and entry of K and b, respectively.

We have the following settings for the two examples of R:

ℓ1,2-norm: (m, n) = (256, 512), xob has 8 non-zero

blocks of block-size 4;

Nuclear norm: (m, n) = (2048, 4096), rank(xob) = 4.

We consider only the SAGA algorithm for this test, as the main purpose is to highlight higher-order acceleration. For the ℓ1,2-norm, Newton method is applied after the manifold identiﬁcation, while for nuclear norm, a non-linear conjugate gradient method (Boumal et al., 2014) is applied after manifold identiﬁcation.

The numerical results are shown in Figure 2. For ℓ1,2norm, the black line is the observation of the SAGA algorithm with γ = 1 3L, the red line is the observation of the SAGA+Newton hybrid scheme. It should be noted that the lines are not sub-sampled.

For the hybrid scheme, SAGA is used for manifold identiﬁcation, and Newton method is applied once the manifold is

identiﬁed. As observed, the quadratically convergent Newton method converges in only a few steps. For nuclear norm, a non-linear conjugate gradient is applied when the manifold is identiﬁed. Similar to the observation of the ℓ1,2-norm, the super-linearly convergent non-linear conjugate gradient shows superior performance than SAGA.

7. Conclusion

In this paper, we proposed a uniﬁed framework of local convergence analysis for proximal variance reduced stochastic gradient methods, and especially focused on the SAGA and Prox-SVRG algorithms. Under partial smoothness, we established that these schemes identify the partial smooth manifold in ﬁnite time, and then converge locally linearly. Moreover, we proposed several practical acceleration approaches which can greatly improve the convergence speed of the algorithms.

Acknowledgements

The authors would like to thank F. Bach, J. Fadili and G. Peyr e for helpful discussions.

Beck, A. and Teboulle, M. A fast iterative shrinkagethresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences, 2(1):183 202, 2009.

Boumal, N., Mishra, B., Absil, P.-A., Sepulchre, R., et al. Manopt, a matlab toolbox for optimization on manifolds. Journal of Machine Learning Research, 15(1):1455 1459, 2014.

Local Convergence Properties of SAGA/Prox-SVRG and Acceleration

Combettes, P. L. and Wajs, V. R. Signal recovery by proximal Forward Backward splitting. Multiscale Modeling & Simulation, 4(4):1168 1200, 2005.

Defazio, A., Bach, F., and Lacoste-Julien, S. Saga: A fast incremental gradient method with support for nonstrongly convex composite objectives. In Advances in Neural Information Processing Systems, pp. 1646 1654, 2014.

Duchi, J. and Ruan, F. Local asymptotics for some stochastic optimization problems: Optimality, constraint identiﬁcation, and dual averaging. ar Xiv preprint ar Xiv:1612.05612, 2016.

Johnson, R. and Zhang, T. Accelerating stochastic gradient descent using predictive variance reduction. In Advances in neural information processing systems, pp. 315 323, 2013.

Kressner, D., Steinlechner, M., and Vandereycken, B. Lowrank tensor completion by riemannian optimization. BIT Numerical Mathematics, 54(2):447 468, 2014.

Lee, S. and Wright, S. J. Manifold identiﬁcation in dual averaging for regularized stochastic online learning. Journal of Machine Learning Research, 13(Jun):1705 1744, 2012.

Lemar echal, C., Oustry, F., and Sagastiz abal, C. The ULagrangian of a convex function. Trans. Amer. Math. Soc., 352(2):711 729, 2000.

Lewis, A. S. Active sets, nonsmoothness, and sensitivity. SIAM Journal on Optimization, 13(3):702 725, 2003.

Liang, J., Fadili, J., and Peyr e, G. Local linear convergence of Forward Backward under partial smoothness. In Advances in Neural Information Processing Systems, pp. 1970 1978, 2014.

Liang, J., Fadili, J., and Peyr e, G. Convergence rates with inexact non-expansive operators. Mathematical Programming, 159(1):403 434, September 2016.

Liang, J., Fadili, J., and Peyr e, G. Activity identiﬁcation and local linear convergence of Forward Backward-type methods. SIAM Journal on Optimization, 27(1):408 437, 2017.

Lions, P. L. and Mercier, B. Splitting algorithms for the sum of two nonlinear operators. SIAM Journal on Numerical Analysis, 16(6):964 979, 1979.

Miller, S. A. and Malick, J. Newton methods for nonsmooth convex minimization: connections among-Lagrangian, Riemannian Newton and SQP methods. Mathematical programming, 104(2-3):609 633, 2005.

Molinari, C., Liang, J., and Fadili, J. Convergence rates of Forward Douglas Rachford splitting method. ar Xiv preprint ar Xiv:1801.01088, 2018.

Nesterov, Y. Introductory lectures on convex optimization: A basic course, volume 87. Springer, 2004.

Ring, W. and Wirth, B. Optimization methods on riemannian manifolds and their application to shape space. SIAM Journal on Optimization, 22(2):596 627, 2012.

Schmidt, M., Le Roux, N., and Bach, F. Minimizing ﬁnite sums with the stochastic average gradient. Mathematical Programming, 162(1-2):83 112, 2017.

Smith, S. T. Optimization techniques on Riemannian manifolds. Fields institute communications, 3(3):113 135, 1994.

Vandereycken, B. Low-rank matrix completion by riemannian optimization. SIAM Journal on Optimization, 23(2): 1214 1236, 2013.

Xiao, L. Dual averaging methods for regularized stochastic learning and online optimization. Journal of Machine Learning Research, 11(Oct):2543 2596, 2010.

Xiao, L. and Zhang, T. A proximal stochastic gradient method with progressive variance reduction. SIAM Journal on Optimization, 24(4):2057 2075, 2014.

Zhang, H., Reddi, S. J., and Sra, S. Riemannian svrg: Fast stochastic optimization on riemannian manifolds. In Advances in Neural Information Processing Systems, pp. 4592 4600, 2016.