# conditional_accelerated_lazy_stochastic_gradient_descent__3c0bb8a7.pdf

Conditional Accelerated Lazy Stochastic Gradient Descent

Guanghui Lan * 1 Sebastian Pokutta * 1 Yi Zhou * 1 Daniel Zink * 1

In this work we introduce a conditional accelerated lazy stochastic gradient descent algorithm with optimal number of calls to a stochastic ﬁrstorder oracle and convergence rate O( 1

"2 ) improving over the projection-free, Online Frank-Wolfe based stochastic gradient descent of (Hazan and Kale, 2012) with convergence rate O( 1

1. Introduction

The conditional gradient method (also known as: Frank Wolfe algorithm) proposed in (Frank and Wolfe, 1956), gained much popularity in recent years due to its simple projection-free scheme and fast practical convergence rates. We consider the basic convex programming (CP) problem

x2X f(x), (1)

where X Rn is a closed convex set and f : X ! R is a smooth convex function such that 9L > 0,

kf 0(x) f 0(y)k Lkx yk, 8 x, y 2 X. (2)

The classic conditional gradient (CG) method solves (1) iteratively by minimizing a series of linear approximations of f over the feasible set X. More speciﬁcally, given xk 1 2 X at the k-th iteration, it updates xk according to the following steps:

1) Call the ﬁrst-order (FO) oracle to compute (f(xk 1), f 0(xk 1)) and set pk = f 0(xk 1).

2) Call the linear optimization (LO) oracle to compute

yk 2 argminx2Xhpk, xi. (3)

3) Set xk = (1 λk)xk 1 + λkyk for some λk 2 [0, 1].

*Equal contribution 1ISy E, Georgia Institute of Technology, Atlanta, GA. Correspondence to: Daniel Zink <daniel.zink@gatech.edu>.

Proceedings of the 34 th International Conference on Machine Learning, Sydney, Australia, PMLR 70, 2017. Copyright 2017 by the author(s).

Compared to most other ﬁrst-order methods, such as e.g., gradient descent algorithms and accelerated gradient algorithms (Nesterov, 1983; 2004), the CG method is computationally cheaper in some cases, since it only requires the solution of a linear optimization subproblem (3) rather than an often costly projection onto the feasible region X.

There has been extensive and fruitful research on the general class of linear-optimization-based convex programming (LCP) methods (which covers the CG method and its variants) and their application in machine learning (e.g., (Ahipasaoglu and Todd, 2013; Bach et al., 2012; Beck and Teboulle, 2004; Cox et al., 2013; Clarkson, 2010; Freund and Grigas, 2013; Hazan, 2008; Harchaoui et al., 2012; Jaggi, 2011; 2013; Jaggi and Sulovský, 2010; Luss and Teboulle, 2013; Shen et al., 2012; Hazan and Kale, 2012; Lan, 2013; Lan and Zhou, 2014; Braun et al., 2016)). It should be noted that even the computational cost for LO oracle to solve the linear optimization subproblem (3) is high for some complex feasible regions. Recently, several approaches have been considered to address this issue. Jaggi demonstrated practical speed up for the CG method by approximately solving (3) in (Jaggi, 2013). Braun, Pokutta, and Zink in (Braun et al., 2016) proposed a class of modiﬁed CG methods, namely the lazy conditional gradient (LCG) algorithms, which calls a weak separation oracle rather than solving the linear subproblem (3) in the classical CG method. In fact, the weak separation oracle is computationally more efﬁcient than approximate minimization used in (Jaggi, 2013), at the expense of not providing any guarantee for function value improvement with respect to (3). Furthermore, as shown in (Jaggi, 2013; Lan, 2013), the total number of iterations for the LCP methods to ﬁnd an -solution of (1) (i.e., a point x 2 X, s.t. f( x) f ) cannot be smaller than O(1/ ), which is not improvable even when the objective function f is strongly convex. Improved complexity results can only be obtained under stronger assumptions on the LO oracle or the feasible set (see, e.g., (Garber and Hazan, 2013; Lan, 2013)). However, the O(1/ ) bound does not preclude the existence of more efﬁcient LCP algorithms for solving (1). Lan and Zhou in (Lan and Zhou, 2014) proposed a class of conditional gradient sliding methods (CGS), which signiﬁcantly improve the complexity bounds in terms of the number of gradient evaluations while maintaining optimal complexity bounds

Conditional Accelerated Lazy Stochastic Gradient Descent

for the LO oracle calls required by the LCP methods.

Inspired by (Braun et al., 2016) and (Lan and Zhou, 2014), in this paper we focus on a class of modiﬁed LCP methods that require only improving solutions for a certain separation problem rather than solving the linear optimization subproblem (3) explicitly through LO oracle calls while simultaneously minimizing the number of gradient evaluations when performing weak separation over the feasible set X. At ﬁrst these two objectives seem to be incompatible as (Braun et al., 2016) give up the dual guarantee to simplify the oracle, while the dual guarantee of CG iterations is at the core of the analysis in (Lan and Zhou, 2014). We overcome this impasse by carefully modifying both techniques.

It should be mentioned that Hazan and Kale in (Hazan and Kale, 2012) proposed the online Frank-Wolfe (OFW) algorithm, which obtains O(1/ 4) rate of convergence for stochastic problems. Indeed, if we consider the objective function f(x) := E[F(x, )] for stochastic optimization, the OFW method can be applied to solve (1) by viewing the iteratively observed function ft as the current realization of the true objective function f, i.e., ft( ) = F( , t). Without re-evaluating the (sub)gradients at the updated points, the OFW obtains O(T 1/4) bound for any (smooth or nonsmooth) objective functions (see Theorem 4.4 in (Hazan and Kale, 2012)), which implies O(1/ 4) rate of convergence in terms of the number of (sub)gradient evaluations for stochastic optimization. However, we can show that our proposed algorithm obtains O(1/ ) (resp., O(1/ 2)) rate of convergence for smooth (resp., non-smooth) stochastic problems, which is much better than the convergence rate of the OFW method. We would like to stress that the stochastic optimization bound in (Hazan and Kale, 2012, Theorem 4.1) which gives a guarantee of O(1/ 2), requires to re-evaluate all gradients at the current iterate and as such the number of gradient evaluations required grows quadratically in t. Moreover, Hazan and Luo (2016) proposed two methods for solving the special case of Problem (1) of the form

min x2X f(x) = min

which allows for a potentially smaller number of SFO evaluations than O(1/"2), the lower bound for the general problem. The two methods Stochastic Variance-Reduced Frank-Wolfe (SVRF) and Stochastic Variance-Reduced Conditional Gradient Sliding (STORC) are obtained by applying the variance reduction idea of Johnson and Zhang (2013) and Mahdavi et al. (2013) to the CG method and the Stochastic CGS method respectively. Both algorithms however need a certain number of exact (or full) gradient evaluations leading to a potentially undesirable dependence on the number of examples m.

Contributions

Our main contributions can be brieﬂy summarized as follows. We consider stochastic smooth optimization, where we have only access unbiased estimators of the gradients of f via a stochastic ﬁrst-order (SFO) oracle. By incorporating a modiﬁed LCG procedure (Braun et al., 2016) into a modiﬁed CGS method (Lan and Zhou, 2014) we obtain a new conditional accelerated lazy stochastic gradient descent algorithm (CALSGD) and we show that the number of calls to the weak separation oracle can be optimally bounded by O(1/ ), while the optimal bound of O(1/ 2) on the total number of calls to the SFO oracle can be maintained. In addition, if the exact gradients of f can be accessed by a FO oracle, the latter bound can be signiﬁcantly improved to O(1/p ). In order to achieve the above we will present a modiﬁed lazy conditional gradient method, and show that the total number of iterations (or calls to the weak separation oracle) performed by it can be bounded by O(1/ ) under a stronger termination criterion, i.e., the primal-dual gap function.

We also consider strongly convex and smooth functions and show that without enforcing any stronger assumptions on the weak separation oracle or the feasible set X, the total number of calls to the FO (resp., SFO) oracle can be optimally bounded by O(log 1/ ) (resp., O(1/ )) for variants of the proposed method to solve deterministic (resp., stochastic) strongly convex and smooth problems. Furthermore, we also generalize the proposed algorithms to solve an important class of non-smooth convex programming problems with a saddle point structure. By adaptively approximating the original non-smooth problem via a class of smooth functions, we are able to show that the deterministic version of CALSGD can obtain an -solution within O(1/ ) number of linear operator evaluations and O(1/ 2) number of calls to the weak separation oracle, respectively. The former bound will increase to O(1/ 2) for non-smooth stochastic optimization.

Finally, we demonstrate practical speed ups of CALSGD through preliminary numerical experiments for the video co-localization problem, the structured regression problem and quadratic optimization over the standard spectrahedron; an extensive study is beyond the scope of this paper and left for future work. In all cases we report a substantial improvements in performance.

In the main body of the paper we focus on the stochastic smooth case; several other results and their proofs have been relegated to the Supplementary Material.

1.1. Notation and Terminology

Let X Rn be a convex compact set, and k k X be the norm associated with the inner product in Rn. For the sake

Conditional Accelerated Lazy Stochastic Gradient Descent

of simplicity, we often skip the subscript in the norm k k X. We deﬁne the diameter of the set X as

DX DX,k k := max

x,y2X kx yk. (4)

For a given norm k k, we denote its conjugate by ksk = maxkxk 1hs, xi. For a linear operator A : Rn ! Rm, we use k Ak to denote its operator norm deﬁned as k Ak := maxkxk 1 k Axk. Let f : X ! R be a convex function, we denote its linear approximation at x by

lf(x; y) := f(x) + hf 0(x), y xi. (5)

Clearly, if f satisﬁes (2), then

f(y) lf(x; y) + L

2 ky xk2, 8 x, y 2 X. (6)

Notice that the constant L in (2) and (6) depends on k k. Moreover, we say f is smooth with curvature at most C, if

f(y) lf(x; y) + C

2 , 8 x, y 2 X. (7)

It is clear that if X is bounded, we have C LD2

X. In the following we also use R++ to denote the set of strictly positive reals.

2. Conditional Accelerated Lazy Stochastic

Gradient Descent

We now present a new method for stochastic gradient descent that is based on the stochastic conditional gradient sliding (SCGS) method and the parameter-free lazy conditional gradient (LCG) procedure from Section 2.2, which we refer to as the Conditional Accelerated Lazy Stochastic Gradient Descent (CALSGD) method.

We consider the stochastic optimization problem:

x2X{f(x) = E [F(x, )]}, (8)

where f(x) is a smooth convex function satisfying (2).

2.1. The Algorithm

Throughout this section, we assume that there exists a stochastic ﬁrst-order (SFO) oracle, which for a search point zk 2 X outputs a stochastic gradient F 0(zk, k), s.t.

E [F 0(zk, k)] = f 0(zk), (9)

k F 0(zk, k) f 0(zk)k2

If σ = 0, the stochastic gradient F 0(zk, k) is the exact gradient at point zk, i.e., F 0(zk, k) = f 0(zk).

Our algorithmic framework is inspired by the SCGS method by (Lan and Zhou, 2014). However, instead of applying the classic CG method to solve the projection subproblem

appearing in the accelerated gradient (AG) method, the CALSGD method utilizes a modiﬁed parameter-free LCG algorithm (see Section 2.2) to approximately solve the subproblem (x) deﬁned in (16) and skips the computations of the stochastic gradient F 0(z, ) from time to time when performing weak separation over the feasible region X. The main advantages of our method are that it does not solve a traditional projection problem and achieves the optimal bounds on the number of calls to the SFO and LOsep X oracles (see Oracle 1 in Subsection 2.2) for solving problem (1)-(8). To the authors best knowledge, no such algorithms have been developed before in the literature; we present the algorithm below in Algorithm 1.

Algorithm 1 Conditional Accelerated Lazy Stochastic Gradient Descent (CALSGD)

Input: Initial point x0 2 X, iteration limit N, and weak separation oracle accuracy 1. Let βk 2 R++, γk 2 [0, 1], and k 2 R+, k = 1, 2, . . ., be given and set y0 = x0. for k = 1, 2, . . . , N do

zk = (1 γk)yk 1 + γkxk 1, (11)

j=1F 0(zk, k,j), (12)

xk = LCG(gk, βk, xk 1, , k), (13)

yk = (1 γk)yk 1 + γkxk, (14)

where F 0(zk, k,j), j = 1, . . . , Bk, are stochastic gradients computed by the SFO at zk. end for Output: y N.

We hasten to make some observations about the CALSGD method. Firstly, we apply mini-batches to estimate the gradient at point zk, where the parameter {Bk} denotes the batch sizes used to compute gk. It can be easily seen from (9), (10), and (12) that

E[gk f 0(zk)] = 0 and E[kgk f 0(zk)k2

and hence gk is an unbiased estimator of f 0(zk). In fact, letting SBk = PBk

j=1(F 0(zk, k,j) f 0(zk)), from (9) and (10), by induction, we have

k SBk 1 + F 0(zk, k,Bk) f 0(zk)k2

+ k F 0(zk, k,Bk) f 0(zk)k2

+2h SBk 1, F 0(zk, k,Bk) f 0(zk)i]

k F 0(zk, k,Bk) f 0(zk)k2

k F 0(zk, k,j) f 0(zk)k2

Conditional Accelerated Lazy Stochastic Gradient Descent

which together with the fact that gk f 0(zk) = 1 Bk

j=1 [F 0(zk, k,j) f 0(zk)] = 1 Bk SBk, implies the second relationship in (15).

Secondly, in view of the SCGS method in (Lan and Zhou, 2014), xk obtained in (13) should be an approximate solution to the gradient sliding subproblem

k(x) := hgk, xi + βk

2 kx xk 1k2o

such that for some k 0 we have

k(xk), xk xi = hgk + βk(xk xk 1), xk xi k,

(17) for all x 2 X. If we solve the subproblem (16) exactly (i.e., k = 0), then CALSGD will reduce to the accelerated stochastic approximation method by (Lan, 2009; 2012). However, by employing the LCG procedure (see Procedure 1 in Subsection 2.2), we only need to use a weak separation oracle, but still maintaining the optimal bounds on stochastic ﬁrst-order oracle as in (Lan, 2009; 2012; Lan and Zhou, 2014).

Thirdly, observe that the CALSGD method so far is conceptual only as we have not yet speciﬁed the LCG procedure and the parameters {Bk}, {βk}, {γk}, and { k}. We will come back to this issue after introducing the LCG procedure and establishing its main convergence properties.

2.2. The Parameter-free Lazy Conditional Gradient

The classical CG method is a well-known projection-free algorithm, which requires only the solution of a linear optimization subproblem (3) rather than the projection over X per iteration. Therefore, it has computational advantages over many other ﬁrst-order methods when projection over X being costly. The LCG procedure presented in this subsection, a modiﬁcation of the vanilla LCG method in (Braun et al., 2016), goes several steps further than CG and even vanilla LCG method. Firstly, it replaces LO oracle by a weaker separation oracle LOsep, which is no harder than linear optimization and often much simpler. Secondly, it uses a stronger termination criterion, the Frank-Wolfe gap (cf. (18)), than vanilla LCG method. Finally, it maintains the same order of convergence rate as the CG and the vanilla LCG method.

We present the LOsep oracle in Oracle 1 below.

Oracle 1 Weak Separation Oracle LOsep P (c, x, Φ, ) Input: c 2 Rn linear objective, x 2 P point, 1 accu-

racy, Φ > 0 objective value; Output: y 2 P vertex with either (1) c T (x y) > Φ/ ,

or (2) y = argmaxy2P c T (x z) Φ.

Observe that the oracle has two output modes. In particular, Oracle 1 ﬁrst veriﬁes whether there exists an improving point y 2 P with the required guarantee and if so it outputs this point, which we refer it as a positive call. If no such point exists the oracle certiﬁes this by providing the maximizer y, which then also provides a new duality gap. We refer to this case as a negative call. The computational advantages of this oracle are that it can reuse previously seen solutions y if they satisfy the improvement condition and even if LO oracle has to be called, the optimization can be terminated early once the improvement condition is satisﬁed. Finally, the parameter allows to only approximately satisfy the improvement condition making separation even easier; in our applications we set the parameter slightly larger than 1.

We present the LCG procedure based on (Braun et al., 2016) below. We adapted the parameter-free version to remove any dependence on hard to estimate parameters. For any smooth convex function φ, we deﬁne its duality gap as

gapφ,X(x) gapφ(x) := max

y2X rφ(x)T (x y). (18)

Clearly, by convexity the duality gap is an upper bound on f(x ) f(x). Given any accuracy parameter 0, the LCG procedure solves minx2X φ(x) approximately with accuracy , i.e., it outputs a point u 2 X, s.t. gapφ( u) .

Procedure 1 Parameter-free Lazy Conditional Gradients (LCG) procedure Input: access to gradients of smooth convex function φ,

u1 2 X vertex, LOsep X weak linear separation oracle, accuracy 1, duality gap bound Output: u 2 X with bounded duality gap, i.e., gapφ( u)

1: Φ0 maxu2X rφ(u1)T (u1 u) 2: for t = 1 to T 1 do 3: vt LOsep X(rφ(ut), xt, Φt 1, ) 4: if not rφ(ut)T (ut vt) > Φt 1/ then 5: if Φt 1 = then 6: return u = ut 7: end if 8: else 9: Φt max

{Update Φt}

10: end if 11: λt argmin φ((1 λt)ut + λtvt) 12: ut+1 (1 λt)ut + λtvt 13: end for

The LCG procedure is a parameter-free algorithm. Note that while line search can be expensive in general, for our subproblems, function evaluation is very cheap. The algorithm needs only one LO oracle call to estimate the initial functional value gap at Line 1. Alternatively, this

Conditional Accelerated Lazy Stochastic Gradient Descent

can be also done approximately via binary search with LOsep. The algorithm maintains a sequence, {Φt}, that provides valid upper bounds for the functional value gap at the current iterate, i.e., φ(ut) φ 2Φt 1 (see Theorem 5.1 of (Braun et al., 2016)), and it halves the value of Φt only when the current oracle call is negative. Finally, our LCG procedure exits at Line 5 whenever LOsep X returns a negative call and Φt 1 = , which ensures that gapφ( u) = maxy2Xhrφ( u), u yi .

Theorem 2.1 below provides a bound for the total number of iterations (or calls to the LOsep X oracle) that the LCG procedure requires to generate a point u 2 X with gapφ( u) .

Theorem 2.1. Procedure 1 returns a point u 2 X such that the duality gap at point u is bounded by , i.e., gapφ( u) . Furthermore, the total number of iterations T (and hence

LOsep X calls) performed by Procedure 1 is at most

+ 4 + 4 2Cφ

Proof. From the observations above, it is clear that the duality gap at the output point u is bounded by .

Also observe that the procedure calls LOsep X once per iteration. In order to demonstrate the bound in (19), we split the LCG procedure into two phases, and bound the number of iterations separately for each phase. Let Cφ denote the curvature of the smooth convex function φ.

We say Procedure 1 is in the ﬁrst phase whenever Φt 1 > . In view of Theorem 5.1 in (Braun et al., 2016), it is clear that the number of iterations in the ﬁrst phase can be bounded as

Procedure 1 enters the second phase when Φt 1 . Again with the argumentation in Theorem 5.1 in (Braun et al., 2016), we obtain that the total number of positive calls in this phase can be bounded by 4 2Cφ

, if < Cφ, or by 4 if Cφ. Moreover, the procedure exits whenever the current LOsep X oracle call is a negative call. Hence, the number of iterations in the second phase can be bounded by

+ 1, < Cφ; 4 + 1, Cφ.

Thus, our bound in (19) can be obtained from the above two bounds plus one more LO oracle call at Line 1.

2.3. The Convergence Properties of CALSGD

This subsection is devoted to establishing the main convergence properties of the CALSGD method. Since the algorithm is stochastic, we will establish the convergence results for ﬁnding a stochastic -solution, i.e., a point x 2 X s.t. E[f( x) f(x )] . We ﬁrst state a simple technical result from (Lan and Zhou, 2014, Lemma 2.1) that we will use.

Lemma 2.2. Let wt 2 (0, 1], t = 1, 2, . . ., be given. Also let us denote

1 t = 1 (1 wt)Wt 1 t 2.

Suppose that Wt > 0 for all t 2 and that the sequence {δt}t 0 satisﬁes

δt (1 wt)δt 1 + Bt, t = 1, 2, . . . .

Then for any 1 l k, we have

Wl δl 1 + Pk

Theorem 2.3 describes the main convergence properties of the CALSGD method (cf. Algorithm 1). The proof of this theorem can be found in the Supplementary material A.

Theorem 2.3. Let Γk be deﬁned as follows,

1 k = 1 Γk 1(1 γk) k 2. (20)

Suppose that {βk} and {γk} in the CALSGD algorithm satisfy

γ1 = 1 and Lγk βk, k 1. (21)

Γk βk 1γk 1

Γk 1 , k 2, (22)

then under assumptions (9) and (10), we have

E [f(yk) f(x )] βkγk

2Γi Bi(βi Lγi)

where x is an arbitrary optimal solution of (8) and DX is deﬁned in (4).

Γk βk 1γk 1

Γk 1 , k 2, (24)

(rather than (36)) is satisﬁed, then the result in part a)

holds by replacing βkγk D2

X with β1Γkkx0 x k2 in the ﬁrst term of the RHS of (37).

Conditional Accelerated Lazy Stochastic Gradient Descent

c) Under the assumptions in part a) or b), the number of

inner iterations performed at the k-th outer iterations is bounded by

X k + 2, k < βk D2

+ 4 + 4 2βk D2

X k + 2, k βk D2

X, (25) with := 4

Now we provide two different sets of parameters {βk}, {γk}, { k}, and {Bk}, which lead to optimal complexity bounds on the number of calls to the SFO and LOsep X oracles.

Corollary 2.4. Suppose that {βk}, {γk}, { k}, and {Bk} in the CALSGD method are set to

βk = 4L k+2, γk = 3 k+2, k = LD2

, k 1, (26)

and we assume kf 0(x )k is bounded for any optimal solution x of (8). Under assumptions (9) and (10), we have

E [f(yk) f(x )] 6LD2

X (k+2)2 + 9LD2

X 2(k+1)(k+2), 8k 1.

(27) As a consequence, the total number of calls to the SFO and

LOsep X oracles performed by the CALSGD method for ﬁnding a stochastic -solution of (1), respectively, can be bounded by

with probability 1 .

Proof. It can be easily seen from (26) that (35) holds. Also note that by (26), we have

Γk = 6 k(k+1)(k+2), (30)

Γk = 2Lk(k+1)

which implies that (36) holds. It can also be easily checked from (30) and (26) that

γi Γi Bi(βi Lγi) k LD2

Using the bound in (37), we obtain (27), which implies that the total number of outer iterations N can be bounded by O

under the assumptions (9) and (10). The

bound in (28) then immediately follows from this observation and the fact that the number of calls to the SFO oracle is bounded by

X + N σ2(N+3)4

We now provide a good estimation for Φk

0 (cf. Line 1 in LCG procedure) at the k-th outer iteration. In view of the deﬁnition of Φk

0 and ( ) (cf. (16)), we have,

k(xk 1), xk 1 xi = hgk, xk 1 xi.

Moreover, let Ak := kgk f 0(zk)k

Nσ2 Bk , by Chebyshev s inequality and (15), we obtain,

Prob{Ak} E[kgk f 0(zk)|2

N , 8 < 1, k 1,

which implies that Prob{TN

k=1 Ak} 1 . Hence, by Cauchy-Schwarz and triangle inequalities, we have with probability 1 ,

0 = hgk f 0(zk), xk 1 xi + hf 0(zk), xk 1 xi}

Nσ2 Bk + kf 0(zk) f 0(x )k + kf 0(x )k

X + kf 0(x )k DX, (31)

where the last inequality follows from (6) and (26).

Note that we always have k < βk D2

X. Therefore, it follows from the bound in (39), (26), and (31) that the total number of inner iterations can be bounded by

N k3 + 1 + kf 0(x )k

= O(N log N 2

+ N 2 + N),

which implies that our bound in (29).

We now provide a slightly improved complexity bound on the number of calls to the SFO oracle which depends on the distance from the initial point to the set of optimal solutions, rather than the diameter DX. In order to obtain this improvement, we need to estimate D0 kx0 x k and to ﬁx the number of iterations N in advance. This result will play an important role for the analysis of the CALSGD method to solve strongly convex problems (see Supplementary Material C.1).

Conditional Accelerated Lazy Stochastic Gradient Descent

Corollary 2.5. Suppose that there exists an estimate D0 s.t. kx0 x k D0 DX. Also assume that the outer iteration limit N 1 is given. If

k , γk = 2 k+1, k = 2LD2

, k 1. (32)

Under assumptions (9) and (10),

E [f(y N) f(x )] 8LD2

0 N(N+1), 8N 1.

As a consequence, the total number of calls to the SFO and

LOsep X oracles performed by the CALSGD method for ﬁnding a stochastic -solution of (1), respectively, can be bounded by

Proof. The proof is similar to Corollary 2.4, and hence details are skipped.

It should be pointed out that the complexity bound for the number of calls to the LOsep oracle in (29) is established with probability 1 . However, the probability parameter only appears in the non-dominant term.

3. Experimental Results

We present preliminary experimental results showing the performance of CALSGD compared to OFW for stochastic optimization. As examples we use the video co-localization problem, which can be solved by quadratic programming over a path polytope, different structured regression problems, and quadratic programming over the standard spectrahedron. In all cases we use objective functions of the form k Ax bk2, with A 2 Rm n, i.e., m examples over a feasible region of dimension n. For comparability we use a batch size of 128 for all algorithms to compute each gradient and the full matrix A for the actual objective function values. All graphs show the function value using a logscale on the vertical axis.

In Figure 1 we compare the performance of three algorithms: CALSGD, SCGS and OFW. As described above SCGS is the non-lazy counterpart of CALSGD. In the four graphs of Figure 1 we report the objective function value over the number of iterations, the wall clock time in seconds, the number of calls to the linear oracle, and the number of gradient evaluations in that order. In all these measures, our proposed algorithms outperform OFW by multiple orders of magnitude. As expected in number of iterations and number of gradient evaluations both versions CALSGD and SCGS perform equally well, however in wall clock time and

0 150 300 450 600 Iterations

Function value

CALSGD SCGS OFW

0 150 300 450 Wall clock time

Function value

CALSGD SCGS OFW

0 100 200 300 400 500 600

Function value

CALSGD SCGS OFW

0 100 200 300 400 500 600

Gradient evaluations

Function value

CALSGD SCGS OFW

Figure 1. Performance of CALSGD and its non-lazy variant SCGS

on a structured regression problem compared to OFW. The feasible region is the ﬂow-based formulation of the convex hull of Hamiltonian cycles on 9 nodes and has dimension n = 162.

in the number of calls to the linear oracle we observe the advantage of the weaker LOsep oracle over LO.

In Figure 5 and 4 we show the performance of CALSGD on one video co-localization instance and one semi-deﬁnite convex programming instance. Due to space limitations we only report the function value over the number of iterations and wall clock time in seconds; see Supplementary Material D for a detailed analysis as well as more examples.

Implementation details. Finally, we provide details of the implementation of LOsep. In the case of the structured regression problems and the quadratic optimizations over the path polytope instances, we used Gurobi as a solver and included callbacks to terminate whenever the required improvement (given by Φt 1) is reached; our approach is one out of many and other approaches could have been used equally well. If the solver does not ﬁnd a good enough solution, it returns with a lower bound on the Wolfe gap, which we use to update Φt. In the case of convex programming over the feasible region Sn := {X 2 Rn n | X < 0 and tr(X) = 1}, we compute a maximal eigenvector of the gradient (which is a matrix in this case) and use the rank1 factor of the maximal eigenvector, which is an optimal point. In this case, there is no early termination.

However, in all cases, we use caching, i.e., we store all previously seen points and check if any of them satisﬁes the improvement guarantee. If that is the case we do not call Gurobi or the maximal eigenvector routine. The size of the cache is very small in all experiments; alternatively one could use cache strategies such as e.g., k-paging.

Conditional Accelerated Lazy Stochastic Gradient Descent

0 10 20 30 40 Iterations

102 103 104 105 106 107 108

Function value

0 8 16 24 Iterations

102 103 104 105 106 107 108

Function value

0 15 30 45 Wall clock time

102 103 104 105 106 107 108

Function value

0 15 30 45 Wall clock time

102 103 104 105 106 107 108

Function value

Figure 2. Two small video co-localization instances. On the left: road_paths_01_DC_a instance (n = 29682 and m = 10000). On the right: road_paths_01_DC_b instance (n = 29682 and m = 10000). Observe a signiﬁcant difference in function value of multiple orders of magnitude after a few seconds.

0 10 20 30 Iterations

103 104 105 106 107 108 109

Function value

0 8 16 24 Iterations

103 104 105 106 107 108 109

Function value

0 50 100 150 200 Wall clock time

103 104 105 106 107 108 109

Function value

0 50 100 150 200 Wall clock time

103 104 105 106 107 108 109

Function value

Figure 3. Two medium sized video co-localization instances. On

the left: road_paths_02_DE_a instance (n = 119520 and m = 10000). On the right: road_paths_02_DE_b instance (n = 119520 and m = 10000). Similar results as in Figure 2.

0 500 1000 1500 Iterations

10 4 10 3 10 2 10 1

100 101 102 103

Function value

0 25 50 75 100 Wall clock time

10 4 10 3 10 2 10 1

100 101 102 103

Function value

Figure 4. Performance of CALSGD and OFW on a medium sized

convex programming instance with feasible region Sn := {X 2 Rn n | X < 0, tr(X) = 1} with n = 100 and m = 10000. Similar to the results before in both iterations and wall clock time our method performs better.

0 10 20 30 40 Iterations

102 103 104 105 106 107 108

Function value

0 15 30 45 Wall clock time

102 103 104 105 106 107 108

Function value

Figure 5. Performance of CALSGD compared to OFW on a small video co-localization instance. The dimension of the underlying path polytope is n = 29682, the time limit is 50 seconds. Our algorithm performs signiﬁcantly better both in number of iterations as well as in wall clock time.

0 25 50 75 100 Iterations

Function value

0 20 40 60 Iterations

Function value

0 1000 2000 3000 4000 Wall clock time

Function value

0 1000 2000 3000 4000 Wall clock time

Function value

Figure 6. Structured regression problem over the convex hull of

all Hamiltonian cycles of a graph on 11 nodes (n = 242) on the left and 12 nodes (n = 288) on the right. We used a density of d = 0.6 for A and m = 10000. On both instances CALSGD achieves lower values much faster, both in number of iterations as well as in wall clock time.

Conditional Accelerated Lazy Stochastic Gradient Descent

Acknowledgements

We would to thank Elad Hazan for providing references. Research reported in this paper was partially supported by NSF CAREER award CMMI-1452463.

S. Ahipasaoglu and M. Todd. A Modiﬁed Frank-Wolfe Al-

gorithm for Computing Minimum-Area Enclosing Ellipsoidal Cylinders: Theory and Algorithms. Computational Geometry, 46:494 519, 2013.

F. Bach, S. Lacoste-Julien, and G. Obozinski. On the equiv-

alence between herding and conditional gradient algorithms. In the 29th International Conference on Machine Learning, 2012.

A. Beck and M. Teboulle. A conditional gradient method

with linear rate of convergence for solving convex linear systems. Math. Methods Oper. Res., 59:235 247, 2004.

G. Braun, S. Pokutta, and D. Zink. Lazifying conditional

gradient algorithms. ar Xiv preprint ar Xiv:1610.05120, 2016.

Y. Chen, G. Lan, and Y. Ouyang. Optimal primal-dual

methods for a class of saddle point problems. SIAM Journal on Optimization, 24(4):1779 1814, 2014.

K. L. Clarkson. Coresets, sparse greedy approximation, and

the frank-wolfe algorithm. ACM Trans. Algorithms, 6(4): 63:1 63:30, Sept. 2010.

B. Cox, A. Juditsky, and A. S. Nemirovski. Dual subgradient

algorithms for large-scale nonsmooth learning problems. Manuscript, School of ISy E, Georgia Tech, Atlanta, GA, 30332, USA, 2013. submitted to Mathematical Programming, Series B.

M. Frank and P. Wolfe. An algorithm for quadratic program-

ming. Naval Research Logistics Quarterly, 3:95 110, 1956.

R. M. Freund and P. Grigas. New Analysis and Results for

the Frank-Wolfe Method. Ar Xiv e-prints, July 2013.

D. Garber and E. Hazan. A Linearly Convergent Condi-

tional Gradient Algorithm with Applications to Online and Stochastic Optimization. Ar Xiv e-prints, Jan 2013.

S. Ghadimi and G. Lan. Optimal stochastic approximation

algorithms for strongly convex stochastic composite optimization, I: a generic algorithmic framework. SIAM Journal on Optimization, 22:1469 1492, 2012.

S. Ghadimi and G. Lan. Optimal stochastic approximation

algorithms for strongly convex stochastic composite optimization, II: shrinking procedures and optimal algorithms. SIAM Journal on Optimization, 23:2061 2089, 2013.

Gurobi Optimization. Gurobi optimizer reference manual version 6.5, 2016. URL https://www.gurobi. com/documentation/6.5/refman/.

Z. Harchaoui, A. Juditsky, and A. S. Nemirovski. Condi-

tional gradient algorithms for machine learning. NIPS OPT workshop, 2012.

E. Hazan. Sparse approximate solutions to semideﬁnite

programs. In E. Laber, C. Bornstein, L. Nogueira, and L. Faria, editors, LATIN 2008: Theoretical Informatics, volume 4957 of Lecture Notes in Computer Science, pages 306 316. Springer Berlin Heidelberg, 2008. ISBN 978-3-540-78772-3.

E. Hazan and S. Kale. Projection-free online learning. ar Xiv

preprint ar Xiv:1206.4657, 2012.

E. Hazan and H. Luo. Variance-reduced and projection-

free stochastic optimization. In Proceedings of The 33rd International Conference on Machine Learning, pages 1263 1271, 2016.

M. Jaggi. Sparse Convex Optimization Methods for Machine Learning. Ph D thesis, ETH Zürich, 2011. http://dx.doi.org/10.3929/ethz-a-007050453.

M. Jaggi. Revisiting frank-wolfe: Projection-free sparse

convex optimization. In the 30th International Conference on Machine Learning, 2013.

M. Jaggi and M. Sulovský. A simple algorithm for nuclear

norm regularized problems. In the 27th International Conference on Machine Learning, 2010.

R. Johnson and T. Zhang. Accelerating stochastic gradient

descent using predictive variance reduction. In Advances in Neural Information Processing Systems, pages 315 323, 2013.

A. Joulin, K. Tang, and L. Fei-Fei. Efﬁcient image and

video co-localization with frank-wolfe algorithm. In European Conference on Computer Vision, pages 253 268. Springer, 2014.

G. Lan. Convex optimization under inexact ﬁrst-order in-

formation. Ph.D. dissertation, School of Industrial and Systems Engineering, Georgia Institute of Technology, Atlanta, GA 30332, USA, 2009.

G. Lan. An optimal method for stochastic composite opti-

mization. Mathematical Programming, 133(1):365 397, 2012.

G. Lan. The complexity of large-scale convex programming

under a linear optimization oracle. Technical Report, 2013. Available on http://www.optimization-online.org/.

Conditional Accelerated Lazy Stochastic Gradient Descent

G. Lan and Y. Zhou. Conditional gradient sliding for convex

optimization. Optimization-Online preprint (4605), 2014.

R. Luss and M. Teboulle. Conditional gradient algorithms

for rank one matrix approximations with a sparsity constraint. SIAM Review, 55:65 98, 2013.

M. Mahdavi, L. Zhang, and R. Jin. Mixed optimization for

smooth functions. In Advances in Neural Information Processing Systems, pages 674 682, 2013.

Y. E. Nesterov. A method for unconstrained convex mini-

mization problem with the rate of convergence O(1/k2). Doklady AN SSSR, 269:543 547, 1983.

Y. E. Nesterov. Introductory Lectures on Convex Optimiza-

tion: a basic course. Kluwer Academic Publishers, Massachusetts, 2004.

Y. E. Nesterov. Smooth minimization of nonsmooth func-

tions. Mathematical Programming, 103:127 152, 2005.

C. Shen, J. Kim, L. Wang, and A. van den Hengel. Posi-

tive semideﬁnite metric learning using boosting-like algorithms. Journal of Machine Learning Research, 13: 1007 1036, 2012.