# sadagrad_strongly_adaptive_stochastic_gradient_methods__fc8cbaff.pdf

SADAGRAD: Strongly Adaptive Stochastic Gradient Methods

Zaiyi Chen * 1 2 Yi Xu * 2 Enhong Chen 1 Tianbao Yang 2

Although the convergence rates of existing variants of ADAGRAD have a better dependence on the number of iterations under the strong convexity condition, their iteration complexities have a explicitly linear dependence on the dimensionality of the problem. To alleviate this bad dependence, we propose a simple yet novel variant of ADAGRAD for stochastic (weakly) strongly convex optimization. Different from existing variants, the proposed variant (referred to as SADAGRAD) uses an adaptive restarting scheme in which (i) ADAGRAD serves as a sub-routine and is restarted periodically; (ii) the number of iterations for restarting ADAGRAD depends on the history of learning that incorporates knowledge of the geometry of the data. In addition to the adaptive proximal functions and adaptive number of iterations for restarting, we also develop a variant that is adaptive to the (implicit) strong convexity from the data, which together makes the proposed algorithm strongly adaptive. In the worst case SADAGRAD has an O(1/ϵ) iteration complexity for ﬁnding an ϵ-optimal solution similar to other variants. However, it could enjoy faster convergence and much better dependence on the problem s dimensionality when stochastic gradients are sparse. Extensive experiments on large-scale data sets demonstrate the efﬁciency of the proposed algorithms in comparison with several variants of ADAGRAD and stochastic gradient method.

1. Introduction

ADAGRAD is a well-known method for general online and stochastic optimization that adopts a step size adaptive to each feature based on the learning history observed in earlier

*Equal contribution 1University of Science and Technology of China, China 2The University of Iowa, USA. Correspondence to: Zaiyi Chen <czy6516@mail.ustc.edu.cn>, Yi Xu <yixu@uiowa.edu>, Tianbao Yang <tianbao-yang@uiowa.edu>.

Proceedings of the 35 th International Conference on Machine Learning, Stockholm, Sweden, PMLR 80, 2018. Copyright 2018 by the author(s).

iterations. It has received tremendous interests for solving big data learning problems (e.g., see (Dean et al., 2012)). A rigorous regret analysis of ADAGRAD for online convex optimization was provided in the original paper (Duchi et al., 2011), which can be easily translated into a convergence result in expectation for stochastic convex optimization.

In spite of the claimed/observed advantage of ADAGRAD over stochastic gradient descent (SGD) method for general stochastic convex optimization (SCO), its beneﬁt for stochastic strongly convex optimization (SSCO) over SGD diminishes due to its linear dependence on the problem s dimensionality and marginal beneﬁt from sparse stochastic gradients. In particular, SGD and its variants can enjoy a convergence rate of O(G2/(λT)) for SSCO (Hazan & Kale, 2011; Rakhlin et al., 2012; Shamir & Zhang, 2013; Lacoste Julien et al., 2012), where T is the number of iterations, G is a variance bound of stochastic gradient and O( ) only hides a constant factor independent of the problem s dimensionality. In contrast, the convergence result of ADAGRAD for strongly convex function (Duchi et al., 2010) implies an iteration complexity of O(G2 Pd i=1 log( g1:T,i 2 2)/(λT)), where g1:T,i is a vector of historical stochastic gradients of the i-th dimension, and G is the upper bound of stochastic gradient s inﬁnity norm. It is notable that g1:T,i 2 enters into the logarithmic function, which gives marginal adaptive beneﬁt from sparse stochastic gradients and also a linear dependence on the dimensionality d even in the presence of sparse stochastic gradients. Such dependence also exists in other variants such as SC-ADAGRAD (Mukkamala & Hein, 2017) and Meta Grad (van Erven & Koolen, 2016) for SSCO, which are two recent variants of ADAGRAD with rigorous regret analysis.

It remains an open problem how to develop a variant of ADAGRAD that can enjoy greater beneﬁt from sparse stochastic gradients and better dependence on the problem s dimensionality while still enjoying O(1/T) convergence rate for SSCO. This paper provides an afﬁrmative solution to address the problem. In particular, we consider the following SCO problem:

min w ΩF(w) (1)

where Ω Rd is a closed convex set and F(w) is a proper lower semi-continuous convex function that satisﬁes (2).

SADAGRAD: Strongly Adaptive Stochastic Gradient Methods

The stochasticity is in the access model (Hazan & Kale, 2011): the only access to F(w) is via a stochastic subgradient oracle, which given any point w Ω, produces a random vector g(w) whose expectation is a subgradient of F(w) at the point w, i.e., E[g(w)] F(w), where F(w) denotes the subdifferential set of F at w. The above SCO includes an important family of problems where F(w) = Eξ[f(w; ξ)], and f(w, ξ) is a proper lower semicontinuous convex function w.r.t w and depends on a random variable ξ.

If the function F(w) is strongly convex, i.e., there exists λ > 0 such that for any u, v Ωwe have F(u) F(v) F(v) (w v) + λ

2 u v 2 2, then (1) becomes an instance of SSCO. In this paper, we consider the weaker strong convexity condition such that F(w) satisﬁes the following inequality with λ > 0 for any w Ω:

λ 2 w w 2 2 F(w) F(w ), (2)

where w Ωis the closest optimal solution to w. The above condition is also known as second-order growth condition in literature, which is implied by the strong convexity condition (Necoara et al., 2016). Nevertheless, many interesting problems in machine learning may not satisfy the strong convexity condition but can satisfy the above condition (please refer to Xu et al. (2017) for discussions and examples). Our major contributions are summarized below:

We propose a novel variant of ADAGRAD, named SADAGRAD, for solving SSCO and more generally SCO that satisﬁes (2), which employs ADAGRAD as a sub-routine and adaptively restart it periodically.

We provide a convergence analysis of SADAGRAD for achieving an ϵ-optimal solution, and demonstrate that it could enjoy greater beneﬁt from sparse stochastic gradients and better dependence on the problem s dimensionality than ADAGRAD and its variants for SSCO.

We also develop a proximal variant of SADAGRAD for stochastic composite optimization to reduce the effect of non-stochastic regularizer on the iteration complexity, and a practical variant that can be run without knowing the strong convexity parameter λ and hence can adapt to strong convexity from the data.

2. Related Work

SGD method has received a lot of attentions in the areas of machine learning and optimization, and many variants of SGD have been proposed and analyzed (Nemirovski et al., 2009; Rakhlin et al., 2012; Shamir & Zhang, 2013; Lacoste Julien et al., 2012). It is well-known that SGD (with appropriate step sizes and averaging scheme) suffers from an O(1/ϵ2) iteration complexity for solving a general SCO

problem and enjoys an improved O(1/ϵ) iteration complexity for SSCO. When a non-stochastic regularizer is present in SCO, different proximal variants of SGD have been developed (Duchi et al., 2010; Xiao, 2010). In contrast with ADAGRAD, these algorithms use the same step size across all features, which could slow down the learning for rare features (with smaller gradients).

Epoch-GD (Hazan & Kale, 2011) is one variant of SGD that runs SGD in a stage-wise manner with an increasing number of iterations and a decreasing step size stage by stage. It was shown to achieve the optimal convergence rate for SSCO or more generally SCO that satisﬁes (2). Recently, Xu et al. (2017) developed a new variant of SGD (named ASSG) to leverage a local growth condition of the problem, which also runs SGD with multiple stages by halving the step size after each stage. Different from Epoch-GD, the number of iterations of each stage in ASSG is chosen based on the problem s local growth condition. For weakly strongly convex problems, a variant of ASSG (named RASSG) can be run without knowing the parameter λ in (2), and enjoys a similar iteration complexity to SGD for SSCO. A similar variant to ASSG also appears in (Chee & Toulis, 2018) but with no convergence analysis. The key differences between the proposed SADAGRAD and these variants of SGD are (i) ADAGRAD is used as a sub-routine of SADAGRAD; (ii) the number of iterations for each stage of SADAGRAD is adaptive to the history of learning instead of being a ﬁxed sequence as in (Hazan & Kale, 2011; Xu et al., 2017). As a result, SADAGRAD enjoys more informative update and convergence.

Due to its adaptive step sizes for different features, ADAGRAD has recently witnessed great potential for solving deep learning problems (Dean et al., 2012) where there exists a large variation in terms of the magnitude of gradients across different layers. Several descendants of ADAGRAD have been developed and found to be effective for deep learning, e.g., ADAM (Kingma & Ba, 2015; Reddi et al., 2018). For general SCO, ADAM enjoys a similar convergence guarantee as ADAGRAD, while there is no formal theoretical convergence analysis for SSCO. We will compare with ADAM empirically for solving SCO problems.

The main competitors of SADAGRAD are variants of ADAGRAD for strongly convex functions (Duchi et al., 2010; Mukkamala & Hein, 2017; van Erven & Koolen, 2016). The advantage of SADAGRAD has been made clear in Introduction and will be explained more in next Section. In addition, Meta Grad (van Erven & Koolen, 2016) employs multiple copies of online Newton method to update the solution, which make it inefﬁcient for high-dimensional data. Nevertheless, Meta Grad can also enjoy O(d log T/T) convergence for problems satisfying a Bernstein condition. It is not clear whether the analysis in (Duchi et al., 2010;

SADAGRAD: Strongly Adaptive Stochastic Gradient Methods

Algorithm 1 ADAGRAD(w0, η, λ, ϵ, ϵ0)

1: Input: η > 0, and w0 Rd

2: Initialize: w1 = w0, g1:0 = [], H0 Rd d

3: while T does not satisfy the condition in Proposition 1 do 4: Compute a stochastic subgradient gt 5: Update g1:t = [g1:t 1, gt], st,i = g1:t,i 2 6: Set Ht = H0 + diag(st) and ψt(w) = 1 2(w w1) Ht(w w1)

7: Let wt+1 = arg min w Ωηw 1 t Pt τ=1 gτ + 1

8: end while 9: Output: bw T = PT t=1 wt/T

Mukkamala & Hein, 2017) can be extended for problems with weak strong convexity condition (2).

3. Preliminaries and ADAGRAD

In this sequel, we let gt = g(wt) denote a stochastic subgradient of F(w) at wt, i.e., E[gt] F(wt). Given a vector st Rd, diag(st) denotes a diagonal matrix with entries equal to the corresponding elements in st. Denote by I an identity matrix of an appropriate size. Denote by g1:t = [g1, . . . , gt] a cumulative stochastic subgradient matrix of size d t, and by g1:t,i the i-th row of g1:t.

Let B(x0, D) denote an Euclidean ball centered around x0 with a radius D. Let w H =

w T Hw be a general norm where H 0 is a positive deﬁnite matrix, and w H 1 =

w H 1w be the dual norm. Let ψ(x; x0) = 1 2(x x0) H(x x0). It is straightforward to show that ψ(x; x0) is a 1-strongly convex function of x w.r.t. the norm x H. Similar to (Duchi et al., 2010; Mukkamala & Hein, 2017), we assume that the stochastic subgradient g(w) has a bounded inﬁnity norm on Ω, i.e., g(w) γ, w Ω. Given an initial solution w0, we assume that there exists ϵ0 > 0 such that F(w0) F(w ) ϵ0. For machine learning problems, F(w) 0 and hence an upper bound ϵ0 of F(w0) satisﬁes F(w0) F(w ) ϵ0.

Below, we ﬁrst present a basic variant of ADAGRAD for solving (1). We note that the original paper developed and analyzed two basic variants of ADAGRAD with one based on the mirror descent update and the other one based on the primal-dual update. For sake of saving space, we here only present the primal-dual variant. However, our development can be extended to the mirror descent update. In addition, the original paper of ADAGRAD explicitly deals with a simple non-smooth component in the objective function by a proximal mapping. We leave this extension to Section 5. The detailed steps of the primal-dual variant of ADAGRAD with two modiﬁcations are presented in Algorithm 1.

The modiﬁcations in Algorithm 1 lies at that (i) a non-zero

Algorithm 2 SADAGRAD(w0, θ, λ, ϵ, ϵ0)

1: Input: θ > 0, w0 Rd and K = log2(ϵ0/ϵ) 2: for k = 1, . . . , K do 3: Let ηk = θ p

ϵk/λ, where ϵk = ϵk 1/2 4: Let wk = ADAGRAD(wk 1, ηk, λ, ϵk, ϵk 1) 5: end for 6: Output: w K

initial solution w0 is allowed and used for constructing the proximal function ψt; (ii) the algorithm is terminated by comparing the number of iterations T to a quantity given in the following Proposition. These two modiﬁcations are mainly for our development of SADAGRAD. The proposition below establishes iteration complexity of ADAGRAD for achieving an ϵ-optimal solution.

Proposition 1. Let ϵ > 0 be ﬁxed, H0 = γI, γ maxt gt , and E[F(w1) F(w )] ϵ0. If T

2 ϵ max n ϵ0(γ+maxi g1:T,i 2)

ηλ , η Pd i=1 g1:T,i 2 o , then Al-

gorithm 1 gives a solution bw T such that E[F(bw T ) F ] ϵ. Remark: The above result serves the foundation for our analysis, which is different from Eqn. (16) in (Duchi et al., 2011, Theorem 5). Note that T is now a random variable instead of a ﬁxed value. In the proof presented in the supplement, we have to exploit the tool of stopping time of a matingale sequence.

4. Strongly Adaptive Stochastic Gradient Method (SADAGRAD)

In this section, we present SADAGRAD and its convergence analysis. The SADAGRAD shown in Algorithm 2 is built upon ADAGRAD in Algorithm 1. It runs with multiple stages and in each stage it employs ADAGRAD using the solution returned from the previous stage as the starting point and also as the reference point in the proximal function ψk t (w) during the k-th stage. In the following presentation, the notations with superscript k refer to the corresponding ones in the k-th call of Algorithm 1. For example, gk τ refers to the stochastic subgradient at the τ-th iteration during the k-th call of ADAGRAD.

It is worth mentioning that SADAGRAD is similar to Epoch GD (Hazan & Kale, 2011) in terms of multi-stage scheme. However, there still exist several key differences that are highlighted below: (i) Epoch-GD is developed with a ﬁxed total number of iterations while SADAGRAD is developed for a ﬁxed precision ϵ. (ii) the initial step size of both algorithms are different. The one in Epoch-GD is 1/λ and that in SADAGRAD is θ p

ϵ0/(2λ). For problems with a very small strong convexity parameter, the initial step size of Epoch-GD is very large which usually leads to unstable performance at the beginning. (iii) The number of iterations

SADAGRAD: Strongly Adaptive Stochastic Gradient Methods

per-stage of both algorithms are different. In Epoch-GD, the number of iterations per-stage is geometrically increased by a ﬁxed factor, while that in SADAGRAD is adaptive to the history of learning depending on the data as exhibited in the following theorem that states the convergence of SADAGRAD.

Theorem 1. Consider SCO (1) with property (2) and a given ϵ > 0. Assume H0 = γI in Algorithm 1 and γ maxk,τ gk τ , F(w0) F ϵ0 and tk is the minimum number such that

tk 2 λϵk max 2(γ+maxi gk 1:tk,i 2) θ , θ Pd i=1 gk 1:tk,i 2

where θ > 0 is a step size parameter of the algorithm. With K = log2(ϵ0/ϵ) , we have E[F(w K) F ] ϵ.

Different from the convergence results of SGD and its variants, the above convergence result of SADAGRAD is adaptive to history of learning similar to that of ADAGRAD, therefore it deserves more explanation and understanding.

We ﬁrst show that in the worst-case for dense stochastic gradients, SADAGRAD can enjoy an optimal iteration complexity of O(1/(λϵ)). To see this, we can bound gk 1:tk,i 2 tkγ, by choosing θ 1/

we have max 2(γ+maxi gk 1:tk,i 2) θ , θ Pd i=1 gk 1:tk,i 2

O(γ dtk), then tk = O( dγ2

λϵk ) satisﬁes the condition in Theorem 1, yielding a total iteration complexity of O(dγ2/(λϵ)). In comparison, the iteration complexity of Epoch-GD for SCO with property (2) is O( G2

λϵ ), where G is the Euclidean norm bound of stochastic gradient, which is γ

d under the assumption that g(w) γ. Therefore, in the worst-case for dense stochastic gradients SADAGRAD can enjoy the same iteration complexity of Epoch-GD.

Next, we compare SADAGRAD with ADAGRAD and SCADAGRAD for solving SSCO due to they have comparable computational costs per-iteration. Let us recall the convergence results of ADAGRAD and SC-ADAGRAD. For ADAGRAD, (Duchi et al., 2010) s regret bound for strongly convex functions imply a convergence rate of

2γ2δ w0 w 2 2 λT + γ2

i=1 log g1:T,i 2 2 δ + 1 . (3)

For SC-ADAGRAD, (Mukkamala & Hein, 2017) s regret bound implies a convergence rate of

δd D2 2αT + α

i=1 log g1:T,i 2 2 δ + 1 , (4)

where D is the upper bound of wt w and α γ2

2λ. It is notable that the above two convergence rates are in a similar order.

First, let us consider a scenario in which the growth of

historical stochastic gradient vector g1:t,i s norm is slower than

t. Specially, let us assume gk 1:tk,i 2 = O(γtα k) with α 1/2. Thus, with a proper value of θ 1 we have tk =

(λϵk)1/2(1 α) satisfying the condition in Theorem 1. It

is not difﬁcult to show that PK k=1 tk = O dγ2

(λϵ)1/2(1 α) . Thus, when α < 1/2, the iteration complexity of SADAGRAD is o(dγ2/(λϵ)) growing slower than O(1/ϵ) in terms of ϵ. In contrast, in this scenario the iteration complexity of both ADAGRAD and SC-ADAGRAD is e O dγ2

λϵ , which justiﬁes the advantage of SADAGRAD. As an example to support this scenario, let us consider support vector machine where F(w) = 1/n P

i ℓc(yiw ai)+ λ

2 w 2 2, where ℓc(z) = max(0, c z), yi {1, 1} and ai Rd and c > 0 is a margin parameter. To consider a stochastic gradient, we ignore the regularizer for a moment 2 and have gk τ = ℓ c(c yk iτ (wk τ) ak iτ )yk iτ ak iτ . Considering a linearly separable data set with a margin c, i.e., yiw ai c, i, when wk τ w , we have gk τ 0 and consequentially gk 1:tk,i 2 O( tk) when tk is large. For non-linearly separable data, we could expect that many components in g1:tk,i will be zeros as wk τ w for those easily classiﬁed training examples, which could also render gk 1:tk,i 2 much smaller than tk.

Second, we use a synthetic example from (Duchi et al., 2011) to demonstrate that the iteration complexity of SADAGRAD could have a better dependence on d than that of ADAGRAD and SC-ADAGRAD in the presence of sparse stochastic gradients. In particular, Duchi et al. (2011) considered a sparse random data, at each iteration t feature i appears with probability pi = min{1, ci α} for some α 2 and a constant c. It was shown that E[Pd i=1 gk 1:t,i 2] O(

t log d). Thus, SADAGRAD could enjoy an iteration complexity of O( γ2 log2 d

λϵ ) in expectation with an proper value of θ. In contrast, both ADAGRAD and SC-ADAGRAD still have an iteration complexity of e O γ2d

Finally, we note that in practice a proper value of θ in SADAGRAD can be tuned by running a number of iterations of ADAGRAD to control the balance between 2(δ+maxi gk 1:tk,i 2) θ and θ Pd i=1 gk 1:tk,i 2.

5. A Proximal Variant of SADAGRAD

In this section, we present a variant of SADAGRAD to handle a non-stochastic regularizer term by proximal mapping. To

1by which we mean that minimizes the lower bound of tk in Theorem 1. 2it will be handled by a proximal mapping in next section, where gk τ is still a stochastic gradient of the loss function.

SADAGRAD: Strongly Adaptive Stochastic Gradient Methods

Algorithm 3 ADAGRAD-PROX(w0, η, λ, ϵ, ϵ0)

1: Input: η > 0, and w0 Rd

2: Initialize: w1 = w0, g1:0 = [], H0 3: while T does not satisfy the condition in Theorem 2 do 4: Compute a stochastic subgradient gt 5: Update g1:t = [g1:t 1, gt], st,i = g1:t,i 2 6: Set Ht = H0 + diag(st) and ψt(w) = 1 2(w w1) Ht(w w1) 7: Let wt+1 = minw Ωηw Pt τ=1 gτ/t + ηφ(w) + 1 t ψt(w) 8: end while 9: Output: ew T = PT +1 t=2 wt/T

Algorithm 4 SADAGRAD-PROX(w0, θ, λ, ϵ, ϵ0)

1: Input: θ > 0, w0 Rd and K = log2(ϵ0/ϵ) 2: for k = 1, . . . , K do 3: Let ϵk = ϵk 1/2 4: Let ηk = θ p

ϵk/λ 5: wk = ADAGRAD-PROX(wk 1, ηk, λ, ϵk, ϵk 1). 6: end for 7: Output: w K

be formal, we consider the following SCO

min w ΩF(w) f(w) + φ(w), (5)

where the stochasticity lies in the access model of f(w) and φ(w) is a non-stochastic regularizer. In this section, we abuse the notation g(w) to refer to the stochastic gradient of f(w), and assume that f(w) 2 G, w Ω.

Duchi et al. (2011) handled the φ(w) by a proximal mapping, i.e., replacing step 7 in Algorithm 1 by

min w Ωηw t X

τ=1 gτ/t + ηφ(w) + ψt(w)/t. (6)

Then they derived a similar convergence to their nonproximal setting (see Theorem 5 (Duchi et al., 2011)), where g1:t,i 2 only captures the stochastic gradient of f(w).

However, the proximal mapping will bring a new challenge when it is employed in the proposed SADAGRAD. To highlight the challenge, we ﬁrst give a similar convergence to Proposition 1 for using the proximal mapping (6).

Proposition 2. Let H0 = γI and γ maxt gt . For any T 1 and w Ω, we have

t=1 (f(wt) + φ(wt+1) f(w) φ(w))

η Pd i=1 g1:T,i 2

T + γ + maxi g1:T,i 2

2ηT w w1 2 2

t=1 (E[gt] gt) (wt w).

Remark: Note that there is one solution shift between the stochastic component f(wt) and the non-stochastic component φ(wt+1). To handle such an issue, Duchi et al. (2011) used a trick by adding [φ(w1) φ(wt+1)]/T on both sides, ad assume that φ(w) 0 and φ(w1) = 0 such that the added term on the R.H.S is less than zero which does not affect the analysis. Nonetheless, when we utilize the above result in the analysis of SADAGRAD, for epochs k = 2, . . . , K the initial solution wk 1 does not give a zero value of φ(wk 1), which will cause the challenge in the analysis. A simple remedy is that we assume the nonstochastic component φ(w) is uniformly bounded over the domain Ω, i.e., 0 φ(w) B. However, such an assumption may impose a strong restriction to the problem. For example if Ω= Rd and φ(w) = w 2 2/2, it does not satisfy the uniform boundness assumption.

To avoid introducing the uniform boundness assumption on φ(w), we propose to add [f(w T +1) f(w1)]/T on both sides, then we have the following inequality for running ADAGRAD with proximal mapping, i.e., Algorithm 3.

F(ew T ) F(w) (7)

G w1 w T +1 2

t=1 (E[gt] gt) (wt w)

+ η Pd i=1 g1:T,i 2

T + γ + maxi g1:T,i 2

2ηT w w1 2 2,

where ew T = PT +1 t=2 wt/T, and the ﬁrst term in the R.H.S is due to G-Lipschitz continuity of f(w).

We present the modiﬁed algorithm in Algorithm 4, where the name PROX refers to proximal mapping. The theorem below establishes the convergence guarantee of Algorithm 4. Theorem 2. For a given ϵ > 0, let K = log2(ϵ0/ϵ) . Assume H0 = γI and γ maxk,τ gk τ , F(w0) F ϵ0 and tk is the minimum number such that

tk 3 λϵk max Ak,

λG wk 1 wk tk+1 2 ϵk

, where Ak =

max 2(γ+maxi gk 1:tk,i 2) θ , θ Pd i=1 gk 1:tk,i 2

. Algorith-

m 4 guarantees that E[F(w K) F ] ϵ.

Remark: We note that there is an additional term in the lower bound of tk, which comes from the ﬁrst term in (7) compared to that in Theorem 1. The convergence of SADAGRAD-PROX is still adaptive to the history of updates, though the analysis of improvement compared with ADAGRAD becomes difﬁcult due to the additional term dependent on wk 1 wk tk+1 2. Nevertheless, using the worstcase analysis, we can bound wk 1 wk tk+1 2 O( 1

λ) for SSCO according to (Hazan & Kale, 2011) (in their Lemma 4), which implies a worse-case iteration complexity of O( 1

λϵ) similar to that of Epoch-GD. However, in practice wk 1 wk tk+1 2 will decrease as the epoch number k

SADAGRAD: Strongly Adaptive Stochastic Gradient Methods

Algorithm 5 r SADAGRAD(w0, θ, λ1, ϵ, ϵ0, τ)

1: Input: θ > 0, w(0) Rd, λ1 λ , τ (0, 1] 2: for s = 1, . . . , S do 3: w(s) = SADAGRAD (w(s 1), θ, λs, ϵ, ϵs 1) 4: λs+1 = λs/2, ϵs = τϵs 1 5: end for 6: Output: w(S)

increases, which can give much better performance than Epoch-GD and also faster convergence than SADAGRAD as observed in our experiments.

6. A Practical Variant of SADAGRAD

In this section, we provide a practical variant of SADAGRAD, which usually converges much faster. The issue that we aim to address is related to the value of λ. In practice, the strong convexity parameter λ is usually underestimated, yielding slower convergence. For example, for ℓ2 2 norm regularized problems, the strong convexity parameter λ is usually set to the regularization parameter before the ℓ2 2 norm. However, such an approach ignores the curvature in the loss functions deﬁned over the data, which is difﬁcult to estimate. In addition, for some problems that satisfy (2) the value of λ is difﬁcult to estimate. For example, ℓ1 regularized square loss minimization satisfy (2), where the value of λ is difﬁcult to compute (Necoara et al., 2016, Theorem 10). To address this issue, we develop a restarting variant of SADAGRAD starting with a relatively large value of λ inspired by the technique in (Xu et al., 2017), which is presented in Algorithm 5 and its formal convergence guarantee is presented in the following theorem. Similar extension can be made to SADAGRAD-PROX and is omitted.

Theorem 3. Under the same assumptions as Theorem 1 and F(w0) F ϵ0, where w0 is an initial solution. Let ϵ ϵ0

2 , τ = 1, K = log2 ϵ0

ϵ and t(s) k

2 λsϵk max 2(γ+maxi gk 1:tk,i 2) θ , θ Pd i=1 gk 1:tk,i 2

Then with at most a total number of S = log2 λ1

λ + 1 calls of SADAGRAD and a worse-cast iteration complexity of O(1/(λϵ)), Algorithm 5 ﬁnds a solution w(S) such that E[F(w(S)) F ] ϵ.

Remark: t(s) k is the number of iterations in k-th stage of the s-th call of SADAGRAD. Note that λs λ for all s S 1, so the iteration complexity can be similarly understood as that in Theorem 1. However, since the algorithm r SADAGRAD starts with a relative large value of λ1, we expect that the number of iterations in the ﬁrst several calls of SADAGRAD can be much smaller than that of Algorithm 2, especially when the underlying strong convexity parameter λ is small. In practice, the parameter τ can be also tuned for better performance.

Table 1. Statistics of real data sets

data #of instance #of feature feature density covtype 581,012 54 22.12% epsilon 400,000 2,000 100% rcv1 697,641 47,236 0.15% news20 19,996 1,355,191 0.0336%

7. Experiments

In this section, we present some experiments to show the effectiveness of the proposed algorithms, SADAGRAD and SADAGRAD-PROX. For all our experiments, we implement their practical variants referred to as r SADAGRAD and r SADAGRAD-PROX, respectively.

We will consider binary classiﬁcation problems with two different formulations. The ﬁrst formulation consists of smoothed hinge loss and an ℓ1 norm regularization, i.e.,

min w Rd 1 n

i=1 fi(w) + ζ w 1, (8)

where fi(w) =

1 2 yiw xi, if yiw xi 0

1 2(1 yiw xi)2, if 0 < yiw xi 1

0, otherwise

(xi, yi), i = 1, . . . , n, is a set of training data examples, and ζ is the regularization parameter. The second formulation is the classical support vector machine (SVM) problem:

min w Rd 1 n

i=1 max{0, c yiw xi} + λ w 2 2, (9)

where c is the margin parameter and λ is the regularization parameter. It is notable that (8) is a piecewise convex quadratic problem, which satisfy the weak strong convexity condition (2) on a compact set according to (Li, 2013). However, the value of the strong convexity modulus λ is unknown for (8). (9) is a strongly convex problem, which satisﬁes (2) with λ being the regularization parameter. The experiments are performed on four data sets from libsvm (Chang & Lin, 2011) website with different scale of instances and features, namely covtype, epsilon, rcv1,and news20. The statistics of these data sets are shown in Table 1.

For solving (8), we compare SADAGRAD and SADAGRADPROX with ADAGRAD (Duchi et al., 2011), ADAM (Kingma & Ba, 2015), RASSG (Xu et al., 2017). Since the strong convexity modulus λ is unknown, we do not compare with Epoch-GD and SC-ADAGRAD (Mukkamala & Hein, 2017) for solving (8), which require knowing the value of λ. For solving (9), we compare SADAGRAD with ADAGRAD, ADAM, SC-ADAGRAD, RASSG and Epoch-

SADAGRAD: Strongly Adaptive Stochastic Gradient Methods

0.5 1 1.5 2 iteration number 107

log10(objective gap)

covtype, ζ =0.1/n

r Sada Grad-Pro X r Sada Grad Ada Grad ADAM RASSG

0 0.5 1 1.5 2 iteration number 107

log10(objective gap)

epsilon, ζ =0.1/n

r Sada Grad-Pro X r Sada Grad Ada Grad ADAM RASSG

0 1 2 3 4 iteration number 107

log10(objective gap)

rcv1, ζ=0.1/n

r Sada Grad-Pro X r Sada Grad Ada Grad ADAM RASSG

0 0.5 1 1.5 2 iteration number 106

log10(objective gap)

news20, ζ=0.1/n

r Sada Grad-Prox r Sada Grad Ada Grad ADAM RASSG

0.5 1 1.5 2 iteration number 107

log10(objective gap)

covtype, ζ =1/n

r Sada Grad-Pro X r Sada Grad Ada Grad ADAM RASSG

0 0.5 1 1.5 2 iteration number 107

log10(objective gap)

epsilon, ζ =1/n

r Sada Grad-Pro X r Sada Grad Ada Grad ADAM RASSG

0 1 2 3 4 iteration number 107

log10(objective gap)

rcv1, ζ=1/n

r Sada Grad-Pro X r Sada Grad Ada Grad ADAM RASSG

0 0.5 1 1.5 2 iteration number 106

log10(objective gap)

news20, ζ=1/n

r Sada Grad-Prox r Sada Grad Ada Grad ADAM RASSG

0.5 1 1.5 2 iteration number 107

log10(objective gap)

covtype, ζ =10/n

r Sada Grad-Pro X r Sada Grad Ada Grad ADAM RASSG

0 0.5 1 1.5 2 iteration number 107

log10(objective gap)

epsilon, ζ =10/n

r Sada Grad-Pro X r Sada Grad Ada Grad ADAM RASSG

0 1 2 3 4 iteration number 107

log10(objective gap)

rcv1, ζ=10/n

r Sada Grad-Pro X r Sada Grad Ada Grad ADAM RASSG

0 0.5 1 1.5 2 iteration number 106

log10(objective gap)

news20, ζ=10/n

r Sada Grad-Prox r Sada Grad Ada Grad ADAM RASSG

Figure 1. Results for smoothed hinge loss + ℓ1 norm with varying ζ.

GD. Please note that RASSG can be considered an adaptive version of SGD that is adaptive to the problem s implicit strong convexity from the data, which is also observed to perform better than SGD in (Xu et al., 2017) and in our experiments. Hence SGD is omitted in our comparison. The parameters of Epoch-GD are chosen as recommended in the cited papers except for initial iteration number, which is tuned to achieve a faster convergence. The parameters for RASSG are tuned following the guidance in (Xu et al., 2017). The step size of ADAM is tuned in 10[ 2:2], and other parameters are chosen as recommended in the paper. For SC-ADAGRAD, the parameters α and ξ1 in their papers are tuned in 10[ 4:2] and [0.1, 1] respectively. Based on the analysis in the previous sections, the step size parameter θ would inﬂuence the convergence speed of both ADAGRAD and SADAGRAD. So we tuned this parameter for both ADAGRAD and SADAGRAD on each data set. We run ADAGRAD a number of iterations (i.e., 5,000) on each

dataset and set θ = r

2(γ+maxi gk 1:5000,i 2) Pd i=1 gk 1:5000,i 2 . Besides, we

set λ1 = 100λ for solving (9) and λ1 = 100ζ for solving (8) and τ = 1 for r SADAGRAD and r SADAGRAD-PROX.

We ﬁrst present and discuss the results for solving the ℓ1 regularized smoothed hinge loss minimization problem (8) with varying regularization parameter ζ. The results are shown in Figure 1, where the y-axis is log-scale of the gap between the objective value of the obtained solution and that of the

optimal solution. We have several observations from the results (i) r SADAGRAD-PROX performs consistently better than r SADAGRAD on high-dimensional data, especially on the extremely high-dimensional data news20. This is due to the presence of ℓ1 norm regularization and the proximal mapping of r SADAGRAD-PROX that reduces the effect of the regularizer; (ii) in most cases r SADAGRAD-PROX has the best performance except on epsilon with two settings of ζ = 0.1/n, 1/n, in which ADAM is better.

Next, we present the results for solving the SVM problem with varying λ and varying c. By varying c, we can control the growth of the stochastic gradient vector. The results of varying λ with ﬁxed c = 1 on the four data sets are reported in Figure 2, and the results of varying c with ﬁxed λ = 1/n for the two data sets epsilon and rcv1 are reported in Figure 3. Here, we only report the results of r-SADAGRAD-PROX. From the results, we can see that SADAGRAD performs considerably faster than other baselines in most cases. In addition, we have several interesting observations that are consistent with our analysis and theory: (i) for smaller strong convexity parameter (corresponding to smaller values of λ), Epoch-GD perform poorly at earlier iterations due to very large step size; in contrast SADAGRAD doesn t suffer from this issue because of its unique step size scheme (c.f. the discussion above Theorem 1); (ii) ADAGRAD and SC-ADAGRAD perform poorly on the extremely high-dimensional data news20, especially when

SADAGRAD: Strongly Adaptive Stochastic Gradient Methods

2 4 6 8 iteration number 107

log10(objective gap)

covtype, λ=0.01/n

r Sada Grad-Pro X Ada Grad SC-Ada Grad ADAM RASSG Epoch-GD

0.5 1 1.5 2 iteration number 107

log10(objective gap)

epsilon, λ=0.01/n

r Sada Grad-Pro X Ada Grad SC-Ada Grad ADAM RASSG Epoch-GD

0 2 4 6 8 iteration number 107

log10(objective gap)

rcv1, λ=0.01/n

r Srada Grad-Pro X Ada Grad SC-Ada Grad ADAM RASSG Epoch-GD

0 0.5 1 1.5 2 iteration number 106

log10(objective gap)

news20, λ=0.01/n

r Sada Grad-Pro X Ada Grad SC-Ada Grad ADAM RASSG Epoch-GD

2 4 6 8 iteration number 107

log10(objective gap)

covtype, λ=0.1/n

r Sada Grad-Pro X Ada Grad SC-Ada Grad ADAM RASSG Epoch-GD

0.5 1 1.5 2 iteration number 107

log10(objective gap)

epsilon, λ=0.1/n

r Sada Grad-Pro X Ada Grad SC-Ada Grad ADAM RASSG Epoch-GD

0 2 4 6 8 iteration number 107

log10(objective gap)

rcv1, λ=0.1/n

r Srada Grad-Pro X Ada Grad SC-Ada Grad ADAM RASSG Epoch-GD

0 0.5 1 1.5 2 iteration number 106

log10(objective gap)

news20, λ=0.1/n

r Sada Grad-Pro X Ada Grad SC-Ada Grad ADAM RASSG Epoch-GD

2 4 6 8 iteration number 107

log10(objective gap)

covtype, λ=1/n

r Sada Grad-Pro X Ada Grad SC-Ada Grad ADAM RASSG Epoch-GD

0.5 1 1.5 2 iteration number 107

log10(objective gap)

epsilon, λ=1/n

r Sada Grad-Pro X Ada Grad SC-Ada Grad ADAM RASSG Epoch-GD

0 2 4 6 8 iteration number 107

log10(objective gap)

rcv1, λ=1/n

r Srada Grad-Pro X Ada Grad SC-Ada Grad ADAM RASSG Epoch-GD

0 0.5 1 1.5 2 iteration number 106

log10(objective gap)

news20, λ=1/n

r Sada Grad-Pro X Ada Grad SC-Ada Grad ADAM RASSG Epoch-GD

Figure 2. Results with varying λ for solving SVM.

λ is small. This is consistent with our prediction that their bad dependence on the dimensionality. In contrast, SADAGRAD is more robust to the high dimensionality and also enjoy adaptiveness to history of learning, making it better than Epoch-GD; (iii) for smaller margin parameter c, the improvement of SADAGRAD over ADAGRAD and SCADAGRAD becomes larger, which is consistent with our analysis (c.f., the discussion below Theorem 1). Overall, we can see that the proposed algorithms SADAGRAD achieve very promising results compared with existing adaptive and non-adaptive stochastic algorithms.

8. Conclusion

In this paper, we have proposed a simple yet novel variant of ADAGRAD, namely SADAGRAD, for solving stochastic strongly convex optimization and more generally stochastic convex optimization that satisﬁes the second order growth condition. We analyzed the iteration complexity of the proposed variant and demonstrated that it not only achieves the optimal iteration complexity but also enjoys faster convergence and better dependence on the problem s dimensionality when the stochastic gradients are sparse. We have also developed a proximal variant to reduce the effect of the non-stochastic regularizer. Experiments on large-scale real data sets demonstrate the effectiveness of the proposed SADAGRAD for solving ℓ2 norm regularized hinge loss min-

imization problem and ℓ1 regularized smoothed hinge loss minimization problem.

0.5 1 1.5 2 iteration number 107

log10(objective gap)

epsilon, c=1

r Sada Grad-Pro X Ada Grad SC-Ada Grad ADAM RASSG Epoch-GD

0 2 4 6 8 iteration number 107

log10(objective gap)

r Srada Grad-Pro X Ada Grad SC-Ada Grad ADAM RASSG Epoch-GD

0.5 1 1.5 2 iteration number 107

log10(objective gap)

epsilon, c=0.1

r Sada Grad-Pro X Ada Grad SC-Ada Grad ADAM RASSG Epoch-GD

0 2 4 6 8 iteration number 107

log10(objective gap)

rcv1, c=0.1

r Srada Grad-Pro X Ada Grad SC-Ada Grad ADAM RASSG Epoch-GD

0.5 1 1.5 2 iteration number 107

log10(objective gap)

epsilon, c=0.01

r Sada Grad-Pro X Ada Grad SC-Ada Grad ADAM RASSG Epoch-GD

0 2 4 6 8 iteration number 107

log10(objective gap)

rcv1, c=0.01

r Srada Grad-Pro X Ada Grad SC-Ada Grad ADAM RASSG Epoch-GD

Figure 3. Results with varying c for solving SVM.

SADAGRAD: Strongly Adaptive Stochastic Gradient Methods

Acknowledgement

We thank the anonymous reviewers for their helpful comments. Most work of Z. Chen was done when he was visiting T. Yang s group at the University of Iowa. Y. Xu and T. Yang are partially supported by National Science Foundation (IIS1545995). Z. Chen and E. Chen are partially supported by National Natural Science Foundation of China (Grants No. U1605251 and 61727809).

Chang, C.-C. and Lin, C.-J. LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2:27:1 27:27, 2011.

Chee, J. and Toulis, P. Convergence diagnostics for stochastic gradient descent with constant learning rate. In International Conference on Artiﬁcial Intelligence and Statistics, pp. 1476 1485, 2018.

Dean, J., Corrado, G., Monga, R., Chen, K., Devin, M., Mao, M., Senior, A., Tucker, P., Yang, K., Le, Q. V., et al. Large scale distributed deep networks. In Advances in neural information processing systems, pp. 1223 1231, 2012.

Duchi, J., Hazan, E., and Singer, Y. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(Jul):2121 2159, 2011.

Duchi, J. C., Hazan, E., and Singer, Y. Adaptive subgradient methods for online learning and stochastic optimization. In The 23rd Conference on Learning Theory (COLT), pp. 257 269, 2010.

Hazan, E. and Kale, S. Beyond the regret minimization barrier: an optimal algorithm for stochastic strongly-convex optimization. In Proceedings of the 24th Annual Conference on Learning Theory (COLT), pp. 421 436, 2011.

Kingma, D. and Ba, J. Adam: A method for stochastic optimization. International Conference on Learning Representations, 2015.

Lacoste-Julien, S., Schmidt, M., and Bach, F. A simpler approach to obtaining an O(1/t) convergence rate for the projected stochastic subgradient method. ar Xiv preprint ar Xiv:1212.2002, 2012.

Li, G. Global error bounds for piecewise convex polynomials. Mathematical Programming, 137(1-2):37 64, 2013.

Mukkamala, M. C. and Hein, M. Variants of rmsprop and adagrad with logarithmic regret bounds. In International Conference on Machine Learning, pp. 2545 2553, 2017.

Necoara, I., Nesterov, Y., and Glineur, F. Linear convergence of ﬁrst order methods for non-strongly convex optimization. Mathematical Programming, pp. 1 39, 2016.

Nemirovski, A., Juditsky, A., Lan, G., and Shapiro, A. Robust stochastic approximation approach to stochastic programming. SIAM Journal on optimization, 19(4):1574 1609, 2009.

Rakhlin, A., Shamir, O., Sridharan, K., et al. Making gradient descent optimal for strongly convex stochastic optimization. In International Conference on Machine Learning, pp. 1571 1578, 2012.

Reddi, S. J., Kale, S., and Kumar, S. On the convergence of adam and beyond. In International Conference on Learning Representations, 2018.

Shamir, O. and Zhang, T. Stochastic gradient descent for non-smooth optimization: Convergence results and optimal averaging schemes. In International Conference on Machine Learning, pp. 71 79, 2013.

van Erven, T. and Koolen, W. M. Metagrad: Multiple learning rates in online learning. In Advances in Neural Information Processing Systems, pp. 3666 3674, 2016.

Xiao, L. Dual averaging methods for regularized stochastic learning and online optimization. Journal of Machine Learning Research, 11(Oct):2543 2596, 2010.

Xu, Y., Lin, Q., and Yang, T. Stochastic convex optimization: Faster local growth implies faster global convergence. In International Conference on Machine Learning, pp. 3821 3830, 2017.