# lazifying_conditional_gradient_algorithms__bfe5f7ba.pdf

Journal of Machine Learning Research 20 (2019) 1-42 Submitted 3/18; Revised 2/19; Published 3/19

Lazifying Conditional Gradient Algorithms

G abor Braun gabor.braun@isye.gatech.edu Sebastian Pokutta sebastian.pokutta@isye.gatech.edu Daniel Zink daniel.zink@gatech.edu School of Industrial & Systems Engineering Georgia Institute of Technology Atlanta, GA 30332, USA

Editor: Mark Schmidt

Conditional gradient algorithms (also often called Frank Wolfe algorithms) are popular due to their simplicity of only requiring a linear optimization oracle and more recently they also gained signiﬁcant traction for online learning. While simple in principle, in many cases the actual implementation of the linear optimization oracle is costly. We show a general method to lazify various conditional gradient algorithms, which in actual computations leads to several orders of magnitude of speedup in wall-clock time. This is achieved by using a faster separation oracle instead of a linear optimization oracle, relying only on few linear optimization oracle calls. Keywords: Frank Wolfe algorithm, conditional gradient, caching, linear optimization oracle, convex optimization

1. Introduction

Convex optimization is an important technique both from a theoretical and an applications perspective. Gradient descent based methods are widely used due to their simplicity and easy applicability to many real-world problems. We are interested in solving constraint convex optimization problems of the form

min x P f(x), (1)

where f is a smooth convex function and P is a polytope, with access to f being limited to ﬁrst-order information, i.e., we can obtain f(x) and f(x) for a given x P and access to P via a linear minimization oracle, which returns LPP (c) = argminx P cx for a given linear objective c. When solving Problem (1) using gradient descent approaches in order to maintain feasibility, typically a projection step is required. This projection back into the feasible region P is potentially computationally expensive, especially for complex feasible regions in very large dimensions. As such, projection-free methods gained a lot of attention recently, in particular the Frank Wolfe algorithm Frank and Wolfe (1956) (also known as conditional gradient descent Levitin and Polyak (1966); see also Jaggi (2013) for an overview) and its online version Hazan and Kale (2012) due to their simplicity. We recall the basic Frank Wolfe algorithm in Algorithm 1. These methods eschew the projection step and rather use a

c 2019 G abor Braun, Sebastian Pokutta, and Daniel Zink.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v20/18-114.html.

Braun, Pokutta, and Zink

Algorithm 1 Frank Wolfe Algorithm Frank and Wolfe (1956) Input: smooth convex function f with curvature C, start vertex x1 P, linear minimization oracle LPP Output: points xt in P

1: for t = 1 to T 1 do

2: vt LPP ( f(xt))

3: xt+1 (1 γt)xt + γtvt with γt := 2 t+2 4: end for

linear optimization oracle to stay within the feasible region. While convergence rates and regret bounds are often suboptimal, in many cases the gain due to only having to solve a single linear optimization problem over the feasible region in every iteration still leads to signiﬁcant computational advantages (see e.g., (Hazan and Kale, 2012, Section 5)). This led to conditional gradient algorithms being used for e.g., online optimization and more generally machine learning. Also the property that these algorithms naturally generate sparse distributions over the extreme points of the feasible region is often helpful. Further increasing the relevance of these methods, it was shown recently that conditional gradient methods can also achieve linear convergence (see e.g., Garber and Hazan (2013); Lacoste-Julien and Jaggi (2015); Garber and Meshi (2016)) as well as that the number of total gradient evaluations can be reduced while maintaining the optimal number of oracle calls as shown in Lan and Zhou (2014). Unfortunately, for complex feasible regions even solving the linear optimization problem might be time-consuming and as such the cost of solving the LP might be non-negligible. This could be the case, e.g., when linear optimization over the feasible region is a hard problem or when solving large-scale optimization or learning problems. As such it is natural to ask the following questions:

(i) Does the linear optimization oracle have to be called in every iteration?

(ii) Does one need approximately optimal solutions for convergence?

(iii) Can one reuse information across iterations?

We will answer these questions in this work, showing that (i) the LP oracle is not required to be called in every iteration, (ii) much weaker guarantees are suﬃcient, and (iii) we can reuse information. To signiﬁcantly reduce the cost of oracle calls while maintaining identical convergence rates up to small constant factors, we replace the linear optimization oracle by a (weak) separation oracle (Oracle 1) which approximately solves a separation problem

Oracle 1 Weak Separation Oracle LPsep P (c, x, Φ, K) Input: linear objective c Rn, point x P, accuracy K 1, objective value Φ > 0; Output: Either (1) vertex y P with c(x y) > Φ/K, or (2) false: c(x z) Φ for all z P.

within a multiplicative factor and returns improving vertices. We stress that the weak separation oracle is signiﬁcantly weaker than approximate minimization, which has been

Lazifying Conditional Gradient Algorithms

already considered in Jaggi (2013). In fact, there is no guarantee that the improving vertices returned by the oracle are near to the optimal solution to the linear minimization problem. It is this relaxation of dual bounds and approximate optimality that will provide a signiﬁcant speedup as we will see later. However, if the oracle does not return an improving vertex (returns false), then this fact can be used to derive a reasonably small dual bound of the form: f(xt) f(x ) f(xt)(xt x ) Φt for some Φt > 0. While the accuracy K is presented here as a formal argument of the oracle, an oracle implementation might restrict to a ﬁxed value K > 1, which often makes implementation easier. We point out that the cases (1) and (2) potentially overlap if K > 1. This is intentional and in this case it is unspeciﬁed which of the cases the oracle should choose (and it does not matter for the algorithms). This new oracle encapsulates the smart use of the original linear optimization oracle, even though for some problems it could potentially be implemented directly without relying on a linear programming oracle. Concretely, a weak separation oracle can be realized by a single call to a linear optimization oracle and as such is no more complex than the original oracle. However it has two important advantages: it allows for caching and early termination. Caching refers to storing previous solutions, and ﬁrst searching among them to satisfy the oracle s separation condition. The underlying linear optimization oracle is called only, when none of the cached solutions satisfy the condition. Algorithm 2 formalizes this process. Early termination is the technique to stop the linear optimization algorithm before it ﬁnishes at an appropriate stage, when from its internal data a suitable oracle answer can be easily recovered; this is clearly an implementation dependent technique. The two techniques can be combined, e.g., Algorithm 2 could use an early terminating linear oracle or other implementation of the weak separation oracle in line 4.

Algorithm 2 LPsep P (c, x, Φ, K) via LP oracle Input: linear objective c Rn, point x P, accuracy K 1, objective value Φ > 0; Output: Either (1) vertex y P with c(x y) > Φ/K, or (2) false: c(x z) Φ for all z P.

1: if y P cached with c(x y) > Φ/K exists then

2: return y {Cache call}

4: y argmaxx P cx {LP call}

5: if c(x y) > Φ/K then

6: add y to cache

7: return y

9: return false

We call laziﬁcation the technique of replacing a linear programming oracle with a much weaker one, and we will demonstrate signiﬁcant speedups in wall-clock performance (see e.g., Figure 12), while maintaining identical theoretical convergence rates. To exemplify our approach we provide conditional gradient algorithms employing the weak separation oracle for the standard Frank Wolfe algorithm as well as the variants in

Braun, Pokutta, and Zink

Hazan and Kale (2012); Garber and Meshi (2016); Garber and Hazan (2013), which have been chosen due to requiring modiﬁed convergence arguments that go beyond those required for the vanilla Frank Wolfe algorithm. Complementing the theoretical analysis we report computational results demonstrating eﬀectiveness of our approach via a signiﬁcant reduction in wall-clock time compared to their linear optimization counterparts.

Related Work

There has been extensive work on Frank Wolfe algorithms and conditional gradient algorithms, so we will restrict to review work most closely related to ours. The Frank Wolfe algorithm was originally introduced in Frank and Wolfe (1956) (also known as conditional gradient descent Levitin and Polyak (1966) and has been intensely studied in particular in terms of achieving stronger convergence guarantees as well as aﬃne-invariant versions. We demonstrate our approach for the vanilla Frank Wolfe algorithm Frank and Wolfe (1956) (see also Jaggi (2013)) as an introductory example. We then consider more complicated variants that require non-trivial changes to the respective convergence proofs to demonstrate the versatility of our approach. This includes the linearly convergent variant via local linear optimization Garber and Hazan (2013) as well as the pairwise conditional gradient variant of Garber and Meshi (2016), which is especially eﬃcient in terms of implementation. However, our technique also applies to the Away-Step Frank Wolfe algorithm, the Fully-Corrective Frank Wolfe algorithm, the Pairwise Conditional Gradient algorithm, as well as the Block-Coordinate Frank Wolfe algorithm. Recently, in Freund and Grigas (2016) guarantees for arbitrary step-size rules were provided and an analogous analysis can be also performed for our approach. On the other hand, the analysis of the inexact variants, e.g., with approximate linear minimization does not apply to our case as our oracle is signiﬁcantly weaker than approximate minimization as pointed out earlier. For more information, we refer the interested reader to the excellent overview in Jaggi (2013) for Frank Wolfe methods in general as well as Lacoste-Julien and Jaggi (2015) for an overview with respect to global linear convergence. It was also recently shown in Hazan and Kale (2012) that the Frank Wolfe algorithm can be adjusted to the online learning setting and in this work we provide a lazy version of this algorithm. Combinatorial convex online optimization has been investigated in a long line of work (see e.g., Kalai and Vempala (2005); Audibert et al. (2013); Neu and Bart ok (2013)). It is important to note that our regret bounds hold in the structured online learning setting, i.e., our bounds depend on the ℓ1-diameter or sparsity of the polytope, rather than its ambient dimension for arbitrary convex functions (see e.g., Cohen and Hazan (2015); Gupta et al. (2016)). We refer the interested reader to Hazan (2016) for an extensive overview. A key component of the new oracle is the ability to cache and reuse old solutions, which accounts for the majority of the observed speed up. The idea of caching of oracle calls was already explored in various other contexts such as cutting plane methods (see e.g., Joachims et al. (2009)) as well as the Block-Coordinate Frank Wolfe algorithm in Shah et al. (2015); Osokin et al. (2016). Our laziﬁcation approach (which uses caching) is however much more lazy, requiring no multiplicative approximation guarantee; see (Osokin et al., 2016, Proof of Theorem 3. Appendix F) and Lacoste-Julien et al. (2013) for comparison to our setup.

Lazifying Conditional Gradient Algorithms

Contribution

The main technical contribution of this paper is a new approach, whereby instead of ﬁnding the optimal solution, the oracle is used only to ﬁnd a good enough solution or a certiﬁcate that such a solution does not exist, both ensuring the desired convergence rate of the conditional gradient algorithms. Our contribution can be summarized as follows:

(i) Lazifying approach. We provide a general method to lazify conditional gradient algorithms. For this we replace the linear optimization oracle with a weak separation oracle, which allows us to reuse feasible solutions from previous oracle calls, so that in many cases the oracle call can be skipped. In fact, once a simple representation of the underlying feasible region is learned no further oracle calls are needed. We also demonstrate how parameter-free variants can be obtained.

(ii) Laziﬁed conditional gradient algorithms. We exemplify our approach by providing lazy versions of the vanilla Frank Wolfe algorithm as well as of the conditional gradient methods in Hazan and Kale (2012); Garber and Hazan (2013); Garber and Meshi (2016).

(iii) Weak separation through augmentation. We show in the case of 0/1 polytopes how to implement a weak separation oracle with at most k calls to an augmentation oracle that on input c Rn and x P provides either an improving solution x P with cx < cx or ensures optimality, where k denotes the ℓ1-diameter of P. This is useful when the solution space is sparse.

(iv) Computational experiments. We demonstrate computational superiority by extensive comparisons of the weak separation based versions with their original versions. In all cases we report signiﬁcant speedups in wall-clock time often of several orders of magnitude.

It is important to note that in all cases, we inherit the same requirements, assumptions, and properties of the baseline algorithm that we lazify. This includes applicable function classes, norm requirements, as well as smoothness and (strong) convexity requirements. We also maintain identical convergence rates up to (small) constant factors. A previous version of this work appeared as extended abstract in Braun et al. (2017); this version has been signiﬁcantly revised over the conference version including a representative subset of more extensive computational results, full proofs for all described variants, as well as a variant that uses an augmentation oracle instead of linear optimization oracle (see Section 7).

We brieﬂy recall notation and notions in Section 2 and consider conditional gradient algorithms in Section 3. In Section 4 we consider parameter-free variants of the proposed algorithms, and in Section 5 we examine online versions. Finally, in Section 7 we show a realization of a weak separation oracle with an even weaker oracle in the case of combinatorial problem and we provide extensive computational results in Section 8.

Braun, Pokutta, and Zink

2. Preliminaries

Let be an arbitrary norm on Rn, and let denote the dual norm of . A function f is L-Lipschitz if |f(y) f(x)| L y x for all x, y dom f. A convex function f is smooth with curvature at most C if

f(γy + (1 γ)x) f(x) + γ f(x)(y x) + Cγ2/2

for all x, y dom f and 0 γ 1. A function f is S-strongly convex if

f(y) f(x) f(x)(y x) + S

for all x, y dom f. Unless stated otherwise Lipschitz continuity and strong convexity will be measured in the norm . Moreover, let Br (x) := {y | x y r} be the ball around x with radius r with respect to . . In the following, P will denote the feasible region, a polytope and the vertices of P will be denoted by v1, . . . , v N.

3. Lazy Conditional Gradient

We start with the most basic Frank Wolfe algorithm as a simple example for lazifying by means of a weak separation oracle. We then lazify more complex Frank Wolfe algorithms in Garber and Hazan (2013) and Garber and Meshi (2016). Throughout this section denotes the ℓ2-norm.

3.1. Lazy Conditional Gradient: a basic example

We start with lazifying the original Frank Wolfe algorithm (arguably the simplest Conditional Gradient algorithm), adapting the baseline argument from (Jaggi, 2013, Theorem 1). While the vanilla version has suboptimal convergence rate O(1/T), its simplicity makes it an illustrative example of the main idea of laziﬁcation. The lazy algorithm (Algorithm 3) maintains an upper bound Φt on the convergence rate, guiding its eagerness for progress when searching for an improving vertex vt. If the weak separation oracle provides an improving vertex vt we refer to this as a positive call and if the oracle claims there are no improving vertices we call it a negative call. The step size γt is chosen to (approximately) minimize Φt in Line 2; roughly Φt 1/KC.

Theorem 1 Assume f is convex and smooth with curvature C. Then Algorithm 3 with γt = 2(K2+1) K(t+K2+2) and f(x1) f(x ) Φ0 has convergence rate

f(xt) f(x ) 2 max{C, Φ0}(K2 + 1)

t + K2 + 2 ,

where x is a minimum point of f over P. Proof We prove by induction that

f(xt) f(x ) Φt 1.

Lazifying Conditional Gradient Algorithms

Algorithm 3 Lazy Conditional Gradient Input: smooth convex function f with curvature C, start vertex x1 P, weak linear separation oracle LPsep P , accuracy K 1, step sizes γt, initial upper bound Φ0 Output: points xt in P

1: for t = 1 to T 1 do

2: Φt Φt 1+ Cγ2 t 2 1+ γt

K 3: vt LPsep P ( f(xt), xt, Φt, K)

4: if vt = false then

5: xt+1 xt 6: else

7: xt+1 (1 γt)xt + γtvt 8: end if

The claim is clear for t = 1 by the choice of Φ0. Assuming the claim is true for t, we prove it for t + 1. We distinguish two cases depending on the return value of the weak separation oracle in Line 3. In case of a positive call, i.e., when the oracle returns an improving solution vt, then f(xt)(xt vt) Φt/K, which is used in the second inequality below. The ﬁrst inequality follows by smoothness of f, and the second inequality by the induction hypothesis and the fact that vt is an improving solution:

f(xt+1) f(x ) f(xt) f(x ) | {z } Φt 1

+γt f(xt)(vt xt) | {z } Φt/K

K + Cγ2 t 2 = Φt,

In case of a negative call, i.e., when the oracle returns no improving solution, then in particular f(xt)(xt x ) Φt, hence by Line 5

f(xt+1) f(x ) = f(xt) f(x ) f(xt)(xt x ) Φt.

Finally, using the speciﬁc values of γt we prove the upper bound

Φt 1 2 max{C, Φ0}(K2 + 1)

by induction on t. The claim is obvious for t = 1. The induction step is an easy computation relying on the deﬁnition of Φt on Line 2:

Φt = Φt 1 + Cγ2 t 2 1 + γt

2 max{C,Φ0}(K2+1)

t+K2+2 + max{C,Φ0}γ2 t 2 1 + γt

K 2 max{C, Φ0}(K2 + 1)

t + K2 + 3 .

Here the last inequality follows from the concrete value of γt.

Braun, Pokutta, and Zink

Note that by design, the algorithm converges at the worst-case rate that we postulate due to the negative calls when it does not move. Clearly, this is highly undesirable, therefore the algorithm should be understood as the textbook variant of lazy conditional gradient. We will present an improved, parameter-free variant of Algorithm 3 in Section 4 that converges at the best possible rate that the non-lazy variant would achieve (up to a small constant factor).

3.2. Lazy Pairwise Conditional Gradient

In this section we provide a lazy variant (Algorithm 4) of the Pairwise Conditional Gradient algorithm from Garber and Meshi (2016), using separation instead of linear optimization. We make identical assumptions: the feasible region is a 0/1 polytope, i.e., all vertices of P have only 0/1 entries, and moreover it is given in the form P = {x Rn | 0 x 1, Ax = b}, where 1 denotes the all-one vector.

Algorithm 4 Lazy Pairwise Conditional Gradient (LPCG) Input: polytope P, smooth and S-strongly convex function f with curvature C, accuracy K 1, non-increasing step-sizes ηt, eagerness t Output: points xt

1: x1 P arbitrary and Φ0 f(x1) f(x )

2: for t = 1, . . . , T do

3: e f(xt)i :=

( f(xt)i if xt,i > 0 if xt,i = 0

4: Φt 2Φt 1+η2 t C 2+ ηt K t

5: ct f(xt), e f(xt)

6: (v+ t , v t ) LPsep P P ct, (xt, xt), Φt

7: if (v+ t , v t ) = false then

8: xt+1 xt 9: else

10: ηt max{2 δ | δ Z 0, 2 δ ηt}

11: xt+1 xt + ηt(v+ t v t )

13: end for

Observe that Algorithm 4 calls the linear separation oracle LPsep on the cartesian product of P with itself. Choosing the objective function as in Line 5 allows us to simultaneously ﬁnd an improving direction and an away-step direction. Let card x denote the number of non-zero entries of the vector x.

Theorem 2 Let x be a minimum point of f in P, and Φ0 an upper bound of f(x1) f(x ).

Furthermore, let card(x ) α, M1 := q

S 8α, κ := min M1

KC , 1/ Φ0 , ηt := κ Φt 1 and

Lazifying Conditional Gradient Algorithms

S , then Algorithm 4 has convergence rate

f(xt+1) f(x ) Φt Φ0

where B := κ M1

We recall a technical lemma for the proof.

Lemma 3 ((Garber and Meshi, 2016, Lemma 2)) Let x, y P. Then x is a liner combination x = Pk i=1 λivi of some vertices vi of P (in particular, Pk i=1 λi = 1) with x y = Pk i=1 γi(vi z) for some 0 γi λi and z P such that Pk i=1 γi p

card(y) x y .

Proof [Proof of Theorem 2] The feasibility of the iterates xt is ensured by Line 10 and the monotonicity of the sequence {ηt}t 1 with the same argument as in (Garber and Meshi, 2016, Lemma 1 and Observation 2). We ﬁrst show by induction that

f(xt+1) f(x ) Φt.

For t = 0 we have Φ0 f(x1) f(x ). Now assume the statement for some t 0. In case of a negative call (Line 8), we use the guarantee of Oracle 1 to get

ct[(xt, xt) (z1, z2)] Φt

for all z1, z2 P, which is equivalent to (as ct(xt, xt) = 0)

e f(xt)z2 f(xt)z1 Φt

and therefore f(xt)( z2 z1) Φt

for all z2, z1 P with supp( z2) supp(xt), where supp(x) denotes the set of non-zero coordinates of x. We use Lemma 3 for the decompositions xt = Pk i=1 λivi and xt x = Pk i=1 γi(vi z) with 0 γi λi, z P and

card(x ) xt x

2 card(x )Φt 1

using the induction hypothesis and strong convexity in the second inequality. Then

f(xt+1) f(x ) = f(xt) f(x ) f(xt)(xt x ) =

i=1 γi f(xt)(vi z) | {z } Φt/ t

where we used Equation (2) for the last inequality.

Braun, Pokutta, and Zink

In case of a positive call (Lines 10 and 11) we get, using ﬁrst smoothness of f, then ηt/2 < ηt ηt and f(xt)(v+ t v t ) Φt/(K t), and ﬁnally the deﬁnition of Φt:

f(xt+1) f(x ) = f(xt) f(x ) + f(xt + ηt(v+ t v t )) f(xt)

Φt 1 + ηt f(xt)(v+ t v t ) + η2 t C

2 Φt K t + η2 t C

Plugging in the values of ηt and t to the deﬁnition of Φt gives the desired bound.

Φt = 2Φt 1 + η2 t C 2 + ηt K t = Φt 1 1 + κ2C/2 1 + κM1/K Φt 1 1 + B 1 + 2B Φ0

3.3. Lazy Local Conditional Gradient

In this section we provide a lazy version (Algorithm 5) of the conditional gradient algorithm from Garber and Hazan (2013). Let P Rn be any polytope, D denote an upper bound on the ℓ2-diameter of P, and µ 1 be an aﬃne invariant parameter of P satisfying Lemma 4 below, see (Garber and Hazan, 2013, Section 2) for a possible deﬁnition. As the algorithm is not aﬃne invariant by nature, we need a non-invariant version of smoothness: Recall that a convex function f is β-smooth if

f(y) f(x) f(x)(y x) + β y x 2/2.

Algorithm 5 Lazy Local Conditional Gradient (LLCG) Input: feasible polytope P, β-smooth and S-strongly convex function f, parameters K, S, β, µ; diameter D Output: points xt

1: x1 P arbitrary and Φ0 f(x1) f(x )

2: α S 2Kβnµ2 3: for t = 1, . . . , T do

5: Φt Φt 1+ β

2 α2 min{nµ2r2 t ,D2} 1+α/K 6: pt LLPsep P ( f(xt), xt, rt, Φt, K)

7: if pt = false then

8: xt+1 xt 9: else

10: xt+1 xt + α(pt xt)

12: end for

Lazifying Conditional Gradient Algorithms

Algorithm 6 Weak Local Separation LLPsep P (c, x, r, Φ, K) Input: polytope P together with invariant µ, linear objective c Rn, point x P, radius r > 0, objective value Φ > 0, accuracy K 1 Output: Either (1) y P with x y nµr and c(x y) > Φ/K, or (2) false: c(x z) Φ for all z P Br (x).

1: min n nµ

2: Decompose x: x = PM j=1 λjvj, λj > 0, P j λj = 1.

3: Sort vertices: i1, . . . , i M cvi1 cvi M .

4: k min{k : Pk j=1 λij }

5: p Pk 1 j=1 λijvij + Pk 1 j=1 λij vik

6: v LPsep P c, p

7: if v = false then

8: return false

10: return y x p + v

As an intermediary step, we ﬁrst implement a local weak separation oracle in Algorithm 6, a local version of Oracle 1, which ﬁnds improving points only in a small neighborhood of the original point, analogously to the local linear optimization oracle in Garber and Hazan (2013). To this end, we recall a technical lemma from Garber and Hazan (2013).

Lemma 4 (Garber and Hazan, 2013, Lemma 7) Let P Rn be a polytope and v1, . . . , v N be its vertices. Let x, y P and x = PN i=1 λivi a convex combination of the vertices of P. Then there are numbers 0 γi λi and z P satisfying

i [N] γi(z vi)

i [N] γi nµ

Now we prove the correctness of the weak local separation algorithm.

Lemma 5 Algorithm 6 is correct. In particular LLPsep P (c, x, r, Φ, K)

(i) returns either an y P with x y nµr and c(x y) Φ/K,

(ii) or returns false, and then c(x z) Φ for all z P Br (x).

Proof We ﬁrst consider the case when the algorithm exits in Line 10. Observe that y P since y is a convex combination of vertices of by construction: y = PM j=1(λij γj)vij + v

with = PM j=1 γj nµ

D r, where γj = λij for j < k, and γk = Pk 1 j=1 λij, and γj = 0 for j > k. Therefore

j=1 γj(vij v )

j=1 γj vij v nµr.

Braun, Pokutta, and Zink

Finally using the guarantee of LPsep P we get

c(x y) = c p

If the algorithm exits in Line 8, we use Lemma 4 to decompose any y P Br (x):

i=1 γi(vi z),

with z P and PM i=1 γi nµ

D x y . Since PM i=1 λi = 1 , there are numbers γi η i λi with PM i=1 η i = . Let

i=1 η i vi,

p+ := y x + p =

i=1 (η i γi)vi +

so that p+/ P. To bound the function value we ﬁrst observe that the choice of p in the algorithm assures that cu cp for all u = PM i=1 ηivi with PM i=1 ηi = and all 0 ηi λi. In particular, c p cp . The function value of the positive part p+ can be bounded with the guarantee of LPsep P :

i.e., c(p p+) Φ. Finally combining these bounds gives

c(x y) = c ( p p+) c(p p+) Φ

as desired.

We are ready to examine the Conditional Gradient Algorithm based on LLPsep P :

Theorem 6 Algorithm 5 converges with the following rate:

f(xt+1) f(x ) Φt Φ0

Proof The proof is similar to the proof of Theorem 2. We prove this rate by induction. For t = 0 the choice of Φ0 guarantees that f(x1) f(x ) Φ0. Now assume the theorem holds for t 0. With strong convexity and the induction hypothesis we get

S (f(xt) f(x )) 2

S Φt 1 = r2 t ,

i.e., x P Brt (xt). In case of a negative call, i.e., when pt = false, then case (ii) of Lemma 5 applies:

f(xt+1) f(x ) = f(xt) f(x ) f(xt)(xt x ) Φt.

Lazifying Conditional Gradient Algorithms

In case of a positive call, i.e., when Line 10 is executed, we get the same inequality via:

f(xt+1) f(x ) Φt 1 + α f(xt)(pt xt) + β

2 α2 xt pt 2

2 α2 min{nµ2r2 t , D2}

Therefore using the deﬁnition of α and rt we get the desired bound:

Φt Φt 1 + β

2 α2r2 t nµ2

1 + α/K = Φt 1

4. Parameter-free Conditional Gradient via Weak Separation

In this section we provide a parameter-free variant of the Lazy Frank Wolfe Algorithm, which is inspired by Pokutta (2017) and which exhibits a very favorable behavior in computations; the same technique applies to all other variants from Section 3 as well. The idea is that instead of using predetermined values for progress parameters, like Φt and γt in Algorithm 3, to optimize worst-case progress, the parameters are adjusted adaptively, using data encountered by the algorithm, and avoiding hard-to-estimate parameters, like the curvature C. In practice, this leads to faster convergence, as usual for adaptive methods, while the theoretical convergence rate is worse only by a small constant factor. See Figure 14 for a comparison and Section 8.1.1 for more experimental results. The occasional halving of the Φt is reminiscent of an adaptive restart strategy, considering successive iterates with the same Φt as an epoch. It ensures that Φt is always at least half of the primal gap, while quickly reducing it if it is too large for the algorithm to make progress, and as such it represents a reasonable amount of expected progress throughout the whole run of the algorithm, not just at the initial iterates.

Remark 7 (Additional LP call for initial bound) Note that Algorithm 7 ﬁnds a tight initial bound Φ0 with a single extra LP call. If this is undesired, this can be also done approximately as long as Φ0 is a valid upper bound, for example by means of binary search via the weak separation oracle.

Theorem 8 Let f be a smooth convex function with curvature C. Algorithm 7 converges at a rate proportional to 1/t. In particular to achieve a bound f(xt) f(x ) ε, given an initial upper bound f(x1) f(x ) 2Φ0, the number of required steps is upper bounded by

+ 1 + 4K log Φ0

Proof The main idea of the proof is to maintain an approximate upper bound on the optimality gap. We then show that negative calls halve the upper bound 2Φt and positive oracle calls make signiﬁcant objective function improvement.

Braun, Pokutta, and Zink

Algorithm 7 Parameter-free Lazy Conditional Gradient (LCG) Input: smooth convex function f, start vertex x1 P, weak linear separation oracle LPsep P , accuracy K 1 Output: points xt in P

1: Φ0 maxx P f(x1)(x1 x)/2 {Initial bound}

2: for t = 1 to T 1 do

3: vt LPsep P ( f(xt), xt, Φt 1, K)

4: if vt = false then

5: xt+1 xt 6: Φt Φt 1

2 {Update Φ}

8: γt argmin0 γ 1 f((1 γ)xt + γvt) {Line search}

9: xt+1 (1 γt)xt + γtvt {Update iterate}

10: Φt Φt 1 11: end if

12: end for

We analyze iteration t of the algorithm. If Oracle 1 in Line 3 returns a negative answer (i.e., false, case (2)), then this guarantees f(xt)(xt x) Φt 1 for all x P, in particular, using convexity, f(xt+1) f(x ) = f(xt) f(x ) f(xt)(xt x ) Φt 1 = 2Φt. If Oracle 1 returns a positive answer (case (1)), then we have f(xt) f(xt+1) γtΦt 1/K (C/2)γ2 t by smoothness of f. By minimality of γt, therefore f(xt) f(xt+1) min0 γ 1(γΦt 1/K (C/2)γ2), which is Φ2 t 1/(2CK2) if Φt 1 < KC, and Φt 1/K C/2 C

2 if Φt 1 KC. Now we bound the number t of consecutive positive oracle calls immediately following an iteration t with a negative oracle call. Note that the same argument bounds the number of initial consecutive positive oracle calls with the choice t = 0, as we only use f(xt+1) f(x ) 2Φt below. Note that Φt = Φt+1 = = Φt+t . Therefore

2Φt f(xt+1) f(x )

τ=t+1 (f(xτ) f(xτ+1))

( t Φ2 t 2CK2 if Φt < KC

2 if Φt KC ,

which gives in the case Φt < KC that t 4CK2/Φt, and in the case Φt KC that

2 = 4KΦt 2Φt KC 4KΦt 2Φt Φt = 4K.

Thus iteration t is followed by at most 4K consecutive positive oracle calls as long as Φt KC, and 4CK2/Φt < 2ℓ+1 4K ones for 2 ℓ 1KC < Φt 2 ℓKC with ℓ 0. Adding up the number of oracle calls gives the desired rate: in addition to the positive oracle calls we also have at most log(Φ0/ε) + 1 negative oracle calls, where log( ) is the binary logarithm and ε is the (additive) accuracy. Thus after a total of

+1+4K log Φ0

log(KC/ε) X

ℓ=0 2ℓ+1 4K log Φ0

+1+4K log Φ0

Lazifying Conditional Gradient Algorithms

iterations (or equivalently oracle calls) we have f(xt) f(x ) ε.

As seen from the proof, the algorithm receives few negative oracle calls by design; these are usually more expensive than positive ones as the oracle has to compute a certiﬁcate by, e.g., executing a full linear optimization oracle call.

Corollary 9 Algorithm 7 receives at most log Φ0/ε + 1 negative oracle answers.

Remark 10 (Improved use of Linear Optimization oracle) A possible improvement to Line 6 is Φt maxx P f(xt)(xt x)/2, assuming that at a negative call the oracle also provides the dual gap maxx P f(xt)(xt x) as well as the minimizer x P of the oracle call. This is the case e.g., when the weak separation oracle is implemented as in Algorithm 2. Clearly, the minimizer x can be also used to perform a progress step; albeit without guarantee w.r.t. to Φt.

Remark 11 (Line Search) If line search is too expensive we can choose γt = min{1, Φt/KC} in Algorithm 7. In this case an estimate of the curvature C is required.

5. Lazy Online Conditional Gradient

In this section we lazify the online conditional gradient algorithm of Hazan and Kale (2012) over arbitrary polytopes P = {x Rn | Ax b}, resulting in Algorithm 8. We slightly improve constant factors by replacing (Hazan and Kale, 2012, Lemma 3.1) with a better estimation via solving a quadratic inequality arising from strong convexity. In this section the norm can be arbitrary.

Theorem 12 Let 0 b, s < 1. Let K 1 be an accuracy parameter. Assume ft is L-Lipschitz, and smooth with curvature at most Ct b. Let D := maxy1,y2 P y1 y2 denote the diameter of P in norm . Then the following hold for the points xt computed by Algorithm 8 where x T is the minimizer of PT t=1 ft:

(i) With the choice γt = t (1 b)/2,

the xt satisfy 1 T

t=1 (ft(x T ) ft(x T )) AT (1 b)/2,

where A := CK 2(1 b) + L(K + 1)D.

(ii) Moreover, if all the ft are St s-strongly convex, then with the choice

γt = t(b+s 2)/3,

Braun, Pokutta, and Zink

Algorithm 8 Lazy Online Conditional Gradient (LOCG) Input: functions ft, start vertex x1 P, weak linear separation oracle LPsep P , parameters K, C, b, S, s; diameter D Output: points xt

1: for t = 1 to T 1 do

2: t ft(xt)

3: if t = 1 then

4: h1 min{ 1 D, 2 1 2 /S}

6: ht Φt 1 + min t D, t 2

St1 s + 2 r

2St1 s + Φt 1

8: Φt ht+ Ct1 bγ2 t 2(1 b) 1+ γt

K 9: vt LPsep P (Pt i=1 fi(xt), xt, Φt, K)

10: if vt = false then

11: xt+1 xt 12: else

13: xt+1 (1 γt)xt + γtvt 14: Φt ht Pt i=1 fi(xt) + Pt i=1 fi(xt+1)

16: end for

the xt satisfy 1 T

t=1 (ft(x T ) ft(x T )) AT (2(1+b) s)/3, (3)

A := 2 (K + 1)(K + 2)L2

S + CK 2(1 b)

Proof We prove only Claim (ii), as the proof of Claim (i) is similar and simpler. Let FT := PT t=1 ft. Furthermore, let h T := AT 1 (2(1+b) s)/3 be T times the right-hand side of Equation (3). In particular, FT is ST -strongly convex, and smooth with curvature at most CFT where

CFT := CT 1 b

t=1 t b, ST := ST 1 s S

We prove Ft(xt) Ft(x t ) ht ht by induction on t. The case t = 1 is clear. Let Φt denote the value of Φt in Line 8, while we reserve Φt to denote its value as used in Line 6. We start by showing Ft(xt+1) Ft(x t ) Φt Φt. We distinguish two cases depending on the oracle answer vt from Line 9. For a negative oracle answer (vt = false), we have Φt = Φt and the weak separation oracle asserts maxy P Ft(xt)(xt y) Φt, which combined with the convexity of Ft provides

Ft(xt+1) Ft(x t ) = Ft(xt) Ft(x t ) Ft(xt)(xt xt ) Φt = Φt.

Lazifying Conditional Gradient Algorithms

Otherwise, for a positive oracle answer, Line 14 and the induction hypothesis provides Ft(xt+1) Ft(x t ) ht +Ft(xt+1) Ft(xt) = Φt. To prove Φt Φt, we apply the smoothness of Ft followed by the inequality provided by the choice of vt:

Ft(xt+1) Ft(xt) CFtγ2 t 2 Ft(xt)(xt+1 xt) = γt Ft(xt)(vt xt) γtΦt

Rearranging provides the inequality:

Φt = ht + Ft(xt+1) Ft(xt) ht γtΦt

K + CFtγ2 t 2 = Φt.

For later use, we bound the diﬀerence between ht and Φt using the value of parameters, ht ht, and γt 1:

ht Φt ht ht + CFtγ2 t 2 1 + γt

K CFtγ2 t 2 1 + γt

K CFtγ2 t 2 1 + 1

K = A CK 2(1 b) K + 1 t[2s (1+b)]/3.

We now apply Ft(xt+1) Ft(x t ) Φt, together with convexity of ft+1, and the minimality Ft(x t ) Ft(x t+1) of x t , followed by strong convexity of Ft+1:

Ft+1(xt+1) Ft+1(x t+1) (Ft(xt+1) Ft(x t )) + (ft+1(xt+1) ft+1(x t+1))

Φt + t+1 xt+1 x t+1

2 St+1 (Ft+1(xt+1) Ft+1(x t+1)).

Solving the quadratic inequality provides

Ft+1(xt+1) Ft+1(x t+1) Φt + t+1 2

v u u t t+1 2

From Equation (4), ignoring the last line, we also obtain Ft+1(xt+1) Ft+1(x t+1) Φt + t+1 D via the estimate xt+1 x t+1 D. Thus Ft+1(xt+1) Ft+1(x t+1) ht+1, by Line 6, as claimed. Now we estimate the right-hand side of Equation (5) by using the actual value of the parameters, the estimate t+1 L, and the inequality s + b 2. In fact, we estimate a proxy for the right-hand side. Note that A was chosen to satisfy the second inequality:

2St1 s ht L2

S t[2s (1+b)]/3 + 2

t[2s (1+b)]/3 A CK 2(1 b) K + 1 t[2s (1+b)]/3

ht Φt ht Φt.

Braun, Pokutta, and Zink

In particular, L2 2St+1 + Φt ht hence combining with Equation (5) we obtain

ht+1 Φt + L2

5.1. Stochastic and Adversarial Versions

Complementing the oﬄine algorithms from Section 3, we will now derive various online versions. The presented cases here are similar to those in Hazan and Kale (2012) and thus we state them without proof. For stochastic cost functions ft, we obtain bounds from Theorem 12 (i) similar to (Hazan and Kale, 2012, Theorems 4.1 and 4.3) (with δ replaced by δ/T in the bound to correct an inaccuracy in the original argument). The proof is analogous and hence omitted, but note that y1 y2 2 p

y1 y2 1 y1 y2

k for all y1, y2 P.

Corollary 13 Let ft be convex functions sampled i.i.d. with expectation E [ft] = f , and δ > 0. Assume that the ft are L-Lipschitz in the 2-norm.

(i) If all the ft are smooth with curvature at most C, then Algorithm 8 applied to the ft (with b = 0) yields with probability 1 δ

t=1 f (xt) min x P

t=1 f (x) O C

n T log(n T 2/δ) log T .

(ii) Without any smoothness assumption, Algorithm 8 (applied to smoothenings of the ft) provides with probability 1 δ

t=1 f (xt) min x P

t=1 f (x) O n Lk T 2/3 + Lk p

n T log(n T 2/δ) log T .

Similar to (Hazan and Kale, 2012, Theorem 4.4), from Theorem 12 (ii) we obtain the following regret bound for adversarial cost functions with an analogous proof.

Corollary 14 For any L-Lipschitz convex cost functions ft, Algorithm 8 applied to the functions ft(x) := ft(xt)x + 2L

kt 1/4 x x1 2 2 (with b = s = 1/4, C = L

k, and Lipschitz constant 3L) achieving regret

t=1 ft(xt) min x P

t=1 ft(x) O(L

with at most T calls to the weak separation oracle.

Lazifying Conditional Gradient Algorithms

Note that the gradient of the ft are easily computed via the formula ft(x) = ft(xt) + 4Lt 1/4(x x1)/

k, particularly because the gradient of the ft need not be recomputed, so that we obtain a weak separation-based stochastic gradient descent algorithm, where we only have access to the ft through a stochastic gradient oracle, while retaining all the favorable properties of the Frank Wolfe algorithm with a convergence rate O(T 1/4) (c.f., Garber and Hazan (2013)).

6. Non-polytopal domains

So far we have formulated our results for the polytopal case as most of the base methods that we lazify are usually formulated for polytopal domains. However, whenever a base method extends to general compact convex sets as domains then so does our laziﬁcation of the base method. In fact it is not even required to use vertices or extremal points as answers to either the LP oracle or weak-separation oracle calls; this is usually only required to obtain iterates as convex combinations of extremal points but it is not necessary for convergence. Base methods that extend to general compact convex sets in particular include the vanilla Frank Wolfe algorithm, the Away-Step Frank Wolfe algorithm, and the Pairwise Frank Wolfe algorithm. Note, however that for the Away-Step Frank Wolfe algorithm and the Pairwise Frank Wolfe algorithm (irrespective of laziﬁcation) it is not known whether a linear rate of convergence can be achieved for strongly convex functions over general compact convex sets. Base variants that do not readily apply to the non-polyhedral case are the variants in Garber and Meshi (2016) (see Section 3.2) and Garber and Hazan (2013) (see Section 3.3). We further would like to mention, that often (but not always) for non-polyhedral domains the linear optimization oracle might be more expensive, so that laziﬁcation might oﬀer attractive beneﬁts in this case as the eﬀect of caching and early termination might be even more pronounced. We present computational tests for non-polyedral domains for matrix completion, where the feasible domain is given via X R, where is the (non-polyhedral) nuclear norm (see Section 8.1).

7. Weak Separation through Augmentation

So far we realized the weak separation oracle via lazy optimization. We will now create a (weak) separation oracle for integral polytopes, employing an even weaker, so-called augmentation oracle, which only provides an improving solution but provides no guarantee with respect to optimality. We call this approach lazy augmentation. This is especially useful when a fast augmentation oracle is available or the vertices of the underlying polytope P are particularly sparse, i.e., y1 y2 1 k n for all y1, y2 P, where n is the ambient dimension of P. As before theoretical convergence rates are maintained. For simplicity of exposition we restrict to 0/1 polytopes P here. For general integral polytopes, one considers a so-called directed augmentation oracle, which can be similarly linearized after splitting variables in positive and negative parts; we refer the interested reader to see Schulz and Weismantel (2002); Bodic et al. (2015) for an in-depth discussion.

Braun, Pokutta, and Zink

Let k denote the ℓ1-diameter of P. Upon presentation with a 0/1 solution x and a linear objective c Rn, an augmentation oracle either provides an improving 0/1 solution x with c x < cx or asserts optimality for c:

Oracle 2 Linear Augmentation Oracle AUGP (c, x) Input: linear objective c Rn, vertex x P Output: vertex x P with c x < cx when exists, otherwise x = x

Such an oracle is signiﬁcantly weaker than a linear optimization oracle but also significantly easier to implement and much faster; we refer the interested reader to Gr otschel and Lov asz (1993); Schulz et al. (1995); Schulz and Weismantel (2002) for an extensive list of examples. While augmentation and optimization are polynomially equivalent (even for convex integer programming Oertel et al. (2014)) the current best linear optimization algorithms based on an augmentation oracle are slow for general objectives. While optimizing an integral objective c Rn needs O(k log c ) calls to an augmentation oracle (see Schulz et al. (1995); Schulz and Weismantel (2002); Bodic et al. (2015)), a general objective function, such as the gradient in Frank Wolfe algorithms has only an O(kn3) guarantee in terms of required oracle calls (e.g., via simultaneous diophantine approximations Frank and Tardos (1987)), which is not desirable for large n. In contrast, here we use an augmentation oracle to perform separation, without ﬁnding the optimal solution. Allowing a multiplicative error K > 1, we realize an augmentation-based weak separation oracle (see Algorithm 9), which decides given a linear objective function c Rn, an objective value Φ > 0, and a starting point x P, whether there is a y P with c(x y) > Φ/K or c(x y) Φ for all y P. In the former case, it actually provides a certifying y P, i.e., with c(x y) > Φ/K. Note that a constant accuracy K requires a linear number of oracle calls in the diameter k, e.g., K = (1 1/e) 1 1.582 needs at most N k oracle calls, which can be much smaller than the ambient dimension of the polytope. At the beginning, in Line 2, the algorithm has to replace the input point x with an integral point x0. If the point x is given as a convex combination of integral points, then a possible solution is to evaluate the objective c on these integral points, and choose x0 the ﬁrst one with cx0 cx. This can be easily arranged for Frank Wolfe algorithms as they maintain convex combinations.

Proposition 15 Assume y1 y2 1 k for all y1, y2 P. Then Algorithm 9 is correct, i.e., it outputs either (1) y P with c(x y) > Φ/K, or (2) false. In the latter case c(x y) Φ for all y P holds. The algorithm calls AUGP at most N log(1 1/K)/ log(1 1/k) many times. Proof First note that (1 2x)v + x 1 = v x 1 for x, v {0, 1}n, hence Line 7 is equivalent to xi AUGP (c + Φ c(x xi 1)

k xi 1 1, xi 1). The algorithm obviously calls the oracle at most N times by design, and always returns a value, so we need to verify only the correctness of the returned value. We distinguish cases according to the output.

Lazifying Conditional Gradient Algorithms

Algorithm 9 Augmenting Weak Separation LPsep P (c, x, Φ, K) Input: linear objective c Rn, point x P, objective value Φ > 0; accuracy K > 1 Output: Either (1) y P vertex with c(x y) > Φ/K, or (2) false: c(x z) Φ for all z P.

1: N log(1 1/K)/ log(1 1/k)

2: Choose x0 P vertex with cx0 cx.

3: for i = 1 to N do

4: if c(x xi 1) Φ then

5: return xi 1 6: end if

7: xi AUGP (c + Φ c(x xi 1)

k (1 2xi 1), xi 1)

8: if xi = xi 1 then

9: return false

11: end for

12: return x N

Clearly, Line 5 always returns an xi 1 with c(x xi 1) Φ > [1 (1 1/k)N]Φ. When Line 9 is executed, the augmentation oracle just returned xi = xi 1, i.e., for all y P

cxi 1 cy + Φ c(x xi 1)

k y xi 1 1 cy + Φ c(x xi 1)

k k = c(y x)+cxi 1 +Φ,

so that c(x y) Φ, as claimed. Finally, when Line 12 is executed, the augmentation oracle has found an improving vertex xi at every iteration, i.e.,

cxi 1 > cxi + Φ c(x xi 1)

k xi xi 1 1 cxi + Φ c(x xi 1)

using xi xi 1 1 1 by integrality. Rearranging provides the convenient form

Φ c(x xi) < 1 1

[Φ c(x xi 1)],

which by an easy induction provides

Φ c(x x N) < 1 1

N [Φ c(x x0)] 1 1

i.e., c(x x N) Φ

K , ﬁnishing the proof.

8. Experiments

We implemented and compared the parameter-free variant of LCG (Algorithm 7) to the standard Frank Wolfe algorithm (CG), then Algorithm 4 (LPCG) to the Pairwise Conditional

Braun, Pokutta, and Zink

Gradient algorithm (PCG) of Garber and Meshi (2016), as well as Algorithm 8 (LOCG) to the Online Frank Wolfe algorithm (OCG) of Hazan and Kale (2012). While we did implement the Local Conditional Gradient algorithm of Garber and Hazan (2013) as well, the very large constants in the original algorithms made it impractical to run. Unless stated otherwise the weak separation oracle is implemented as sketched in Algorithm 2 through caching and early termination of the original LP oracle. We have used K = 1.1 and K = 1 as multiplicative factors for the weak separation oracle; for the impact of the choice of K see Section 8.2.2. For the baseline algorithms we use inexact variants, i.e., we solve linear optimization problems only approximately. This is a signiﬁcant speedup in favor of non-lazy algorithms at the (potential) cost of accuracy, while neutral to lazy optimization as it solves an even more relaxed problem anyways. To put things in perspective, the non-lazy baselines could not complete even a single iteration for a signiﬁcant fraction of the considered problems in the given time frame if we were to exactly solve the linear optimization problems. In terms of using line search, for all tests we treated all algorithms equally: either all or none used line search. If not stated otherwise, we used (simple backtracking) line search. The linear optimization oracle over P P for LPCG was implemented by calling the respective oracle over P twice: once for either component. Contrary to the non-lazy version, the lazy algorithms depend on the initial upper bound Φ0. For the instances that need a very long time to solve the (approximate) linear optimization even once, we used a binary search for Φ0 for the lazy algorithms: starting from a conservative initial value, using the update rule Φ0 Φ0/2 until the separation oracle returns an improvement for the ﬁrst time and then we start the algorithm with 2Φ0, which is an upper bound on the Wolfe gap and hence also on the primal gap. This initial phase is also included in the reported wall-clock time. Alternatively, if the linear optimization was less time consuming we used a single (approximate) linear optimization at the start to obtain an initial bound on Φ0 (see e.g., Section 4). In some cases, especially when the underlying feasible region has a high dimension and the (approximate) linear optimization can be solved relatively fast compared to the cost of computing an inner product, we observed that the costs of maintaining the cache was very high. In these cases we reduced the cache size every 100 steps by keeping only the 100 points that were used the most so far. Both the number of steps and the approximate size of the cache were chosen arbitrarily, however 100 for both worked very well for all our examples. Of course there are many diﬀerent strategies for maintaining the cache, which could be used here and which could lead to further improvements in performance. The stopping criteria for each of the experiments was a given wall clock time limit in seconds. The time limit was enforced separately for the main code and the oracle code, so in some cases the actual time used can be larger, when the last oracle call started before the time limit was reached and took longer than the time left. We implemented all algorithms in Python 2.7 with critical functions cythonized for performance employing Numpy. We used these packages from the Anaconda 4.2.0 distribution as well as Gurobi 7.0 Gurobi Optimization (2016) as a black box solver for the linear optimization oracle. The weak separation oracle was implemented via a callback function to stop linear optimization as soon as a good enough feasible solution has been found in a schema as outlined in Algorithm 2. The parameters for Gurobi were kept at their

Lazifying Conditional Gradient Algorithms

default settings except for enforcing the time limit of the tests and setting the acceptable duality gap to 10%, allowing Gurobi to terminate the linear optimization early avoiding the expensive proof of optimality. This is used to realize the inexact versions of the baseline algorithms. All experiments were performed on a 16-core machine with Intel Xeon E5-2630 v3 @ 2.40GHz CPUs and 128GB of main memory. While our code does not explicitly use multiple threads, both Gurobi and the numerical libraries use multiple threads internally.

8.1. Computational results

We performed computational tests on a large variety of diﬀerent problems that are instances of the three machine learning tasks video colocalization, matrix completion, and structured regression.

Video colocalization. Video colocalization is the problem of identifying objects in a sequence of multiple frames in a video. In Joulin et al. (2014) it is shown that video colocalization can be reduced to optimizing a quadratic objective function over a ﬂow or a path polytope, which is the problem we are going to solve. The resulting linear program is an instance of the minimum-cost network ﬂow problem, see (Joulin et al., 2014, Eq. (3)) for the concrete linear program and more details. The quadratic functions are of the form Ax b 2 where we choose the non-zero entries in A according to a density parameter at random and then each of these entries to be [0, 1]-uniformly distributed, while b is chosen as a linear combination of the columns of A with random multipliers from [0, 1]. For some of the instances we also use x b 2 as the objective function with bi [0, 1] uniformly at random.

Matrix completion. The formulation of the matrix completion problem we are going to use is the following:

(i,j) Ω |Xi,j Ai,j|2 s.t. X R, (6)

where denotes the nuclear norm, i.e., A = Tr(

At A). The set Ω, the matrix A and R are given parameters. Similarly to Lan and Zhou (2014) we generate the m n matrix A as the product of AL of size m r and AR of size r n. The entries in AL and AR are chosen from a standard Gaussian distribution. The set Ωis chosen uniformly of size s = min{5r(m+n r), 0.99mn }. The linear optimization oracle is implemented in this case by a singular value decomposition of the linear objective function and we essentially solve the LP to (approximate) optimality. The matrix completion tests will only demonstrate the impact of caching solutions. Note that this test is also informative as due to the roundness of the feasible region the solution of the actual LP oracle will induce a direction that is equal to the true gradient and as such it provides insight into how much per-iteration progress is lost due to working with gradient approximations from the weak separation oracle.

Structured regression. The structured regression problem consists of solving a quadratic function of the form Ax b 2 over some structured feasible set or a polytope P, i.e., we solve minx P Ax b 2. We construct the objective functions in the same way as for the video colocalization problem.

Braun, Pokutta, and Zink

Tests. In the following two sections we will present our results for various problems grouped by the versions of the considered algorithms. Every ﬁgure contains two columns, each containing one experiment. We use diﬀerent measures to report performance: we report progress of loss or function value in wall-clock time in the ﬁrst row (including time spent by the oracle), in the number of iterations in the second row, and in the number of linear optimization calls in the last row. Obviously, the latter only makes sense for the lazy algorithms. In some other cases we report in another row the dual bound or Wolfe gap in wall-clock time. The red line denotes the non-lazy algorithm and the green line denotes the lazy variants. For each experiment we also report the cache hit rate, which is the number of oracle calls answered with a point from the cache over all oracle calls given in percent. While we found convergence rates in the number of iterations quite similar (as expected!), we consistently observe a signiﬁcant speedup in wall-clock time. In particular for many large-scale or hard combinatorial problems, lazy algorithms performed several thousand iterations whereas the non-lazy versions completed only a handful of iterations due to the large time spent approximately solving the linear optimization problem. The observed cache hit rate was at least 90% in most cases, and often even above 99%. Compared to the non-lazy variants, the lazy variants might use weaker descent directions (due to employing the weak-separation oracle instead of the LP oracle), and hence one expects that the lazy algorithms in general require more iterations despite being faster in wall-clock time. However, the lazy and non-lazy algorithms usually generate diﬀerent sequences of iterates, and hence the lazy algorithms may converge faster even in the number of iterations by chance; this is even expected in a small number of cases by the law of large numbers. This happens for example in Figures 3 and 14.

8.1.1. Offline Results

We describe the considered instances in the oﬄine case separately for the vanilla Frank Wolfe method and the Pairwise Conditional Gradient method.

Vanilla Frank Wolfe Method We tested the vanilla Frank Wolfe algorithm on the six video colocalization instances with underlying path polytopes from http://lime.cs.elte. hu/~kpeter/data/mcf/netgen/ (Figure 1). In these instances we additionally report the dual bound or Wolfe gap in wall clock time. We further tested the vanilla Frank Wolfe algorithm on eight instances of the matrix completion problem generated as described above, for which we did not use line search; the parameter-free lazy variant is run with approximate minimization as described in Remark 11, the others use their respective standard step sizes. We provide the used parameters for each example in the ﬁgures below (Figures 2 and 3). The last tests for this version were performed on three instances of the structured regression problem, two with the feasible region containing ﬂow-based formulations of Hamiltonian cycles in graphs (Figure 4), and further tests on two spanning tree instances of diﬀerent size (Figure 5). We observed a signiﬁcant speedup of LCG compared to CG, due to the faster iteration of the lazy algorithm.

Pairwise Conditional Gradient Algorithm As we inherit structural restrictions of PCG on the feasible region, the problem repertoire is limited in this case. We tested the

Lazifying Conditional Gradient Algorithms

Pairwise Conditional Gradient algorithm on the structured regression problem with feasible regions from the MIPLIB instances eil33-2, air04 (Figure 6). Again similarly to the vanilla Frank Wolfe algorithm, we observed a signiﬁcant improvement in wall-clock time of LPCG compared to CG, due to the faster iteration of the lazy algorithm.

8.1.2. Online Results

Additionally to the quadratic objective functions above we tested the online version on random linear functions cx + b with c [ 1, +1]n and b [0, 1]. For online algorithms, each experiment used a random sequence of 100 diﬀerent random loss functions. In every ﬁgure the left column uses linear loss functions, while the right one uses quadratic loss functions over the same polytope. As customary, we did not use line search here but used the respective prescribed step sizes. As an instance of the structured regression problem we used the standard formulation of the cut polytope for graphs with 28 nodes as the feasible region (Figure 7). We also tested our algorithm on the quadratic unconstrained boolean optimization (QUBO) instances deﬁned on Chimera graphs Dash (2013), which are available at http://researcher.watson.ibm.com/ researcher/files/us-sanjeebd/chimera-data.zip. The instances are relatively hard albeit their rather small size and in general the problem is NP-hard. (Figure 8). One instance of the video colocalization problem uses a path polytope from http:// lime.cs.elte.hu/~kpeter/data/mcf/netgen/ that was generated with the netgen graph generator (Figure 9). Most of these instances are very large-scale minimum cost ﬂow instances with several tens of thousands nodes in the underlying graphs, therefore solving still takes considerable time despite the problem being in P. Finally, for the spanning tree problem, we used the well-known extended formulation with O(n3) inequalities for an n-node graph. We considered graphs with 25 nodes (Figures 10). We observed that similarly to the oﬄine case while OCG and LOCG converge comparably in the number of iterations, the lazy LOCG performed signiﬁcantly more iterations; for hard problems, where linear optimization is costly and convergence requires a large number of iterations, this led LOCG converging much faster in wall-clock time. In extreme cases OCG could not complete even a single iteration. This is due to LOCG only requiring some good enough solution, whereas OCG requires a stronger guarantee. This is reﬂected in faster oracle calls for LOCG.

Weak-Separation via Augmentation As discussed in Section 7 in some cases it can be very beneﬁcial to realize the weak-separation oracle by means of augmentation instead of linear optimization. To verify this we implemented a weak-separation oracle via an augmentation oracle for quadratic unconstrained boolean optimization (QUBO) instances (see above for details). For those instances, the primal heuristics of Gurobi (or CPLEX) can ﬁnd improving solutions very fast due to the structure of the instances. We obtain the augmentation oracle then by exiting the Gurobi call as soon as the ﬁrst improving solution is found. For comparability, apart from using the augmentation oracle we used the same conﬁguration as above. The results can be found in Figure 11, where we can observe a signiﬁcant speedup of the laziﬁed variant of OCG using augmentation over the base OCG algorithm. Also observe that this is in contrast to Figure 8, where the advantage of LOCG

Braun, Pokutta, and Zink

(without augmentation) over OCG was not that pronounced. To provide further insight in this case we also report oracle time, which denotes actual time spent in the augmentation oracle, which is minimal in both cases as can be seen.

8.2. Performance improvements, parameter sensitivity, and tuning

8.2.1. Effect of caching

As mentioned before, lazy algorithms have two improvements: caching and early termination. Here we depict the eﬀect of caching in Figure 12, comparing OCG (no caching, no early termination), LOCG (caching and early termination) and LOCG (only early termination) (see Algorithm 8). We did not include a caching-only OCG variant, because caching without early termination does not make much sense: in each iteration a new linear optimization problem has to be solved; previous solutions can hardly be reused as they are unlikely to be optimal for the new linear optimization problem.

8.2.2. Effect of K

If the parameter K of the oracle can be chosen, which depends on the actual oracle implementation, then we can increase K to bias the algorithm towards performing more positive calls. At the same time the steps get shorter. As such there is a natural trade-oﬀ between the cost of many positive calls vs. a negative call. We depict the impact of the parameter choice for K in Figure 13.

8.2.3. Parameter-free vs. textbook variant

For illustrative purposes, we compare the textbook variant of the lazy conditional gradient (Algorithm 3) with its parameter-free counterpart (Algorithm 7) in Figure 14. The parameterfree variant outperforms the textbook variant due to the active management of Φ combined with line search. Similar parameter-free variants can be derived for the other algorithms; see discussion in Section 4.

9. Final Remarks

As discussed above in Section 6, if a given baseline algorithm works over general compact convex sets P, then so does the laziﬁed version. In fact, as the laziﬁed algorithm runs, it produces a polyhedral approximation of the set P with very few vertices (subject to optimality vs. sparsity tradeoﬀs; see (Jaggi, 2013, Appendix C)). Moreover, the weak separation oracle does not need to return extreme points. All algorithms also work with maximal solutions that are not necessarily extremal (e.g., lying in a higher-dimensional face). However, in that case we lose the desirable property that the ﬁnal solution is a sparse convex combination of extreme points (typically vertices in the polyhedral setup). We would also like to brieﬂy address potential downsides of our approach. In fact, we believe the right perspective is the following: when using the lazy oracle over the LP oracle, we obtain potentially weaker approximations vt xt of the true gradient f(xt) compared

Lazifying Conditional Gradient Algorithms

to solving the actual LP, but the computation might be much faster. This is the tradeoﬀthat one has to consider: working with weaker approximations (which implies potentially less progress per iteration) vs. potentially signiﬁcantly faster computation of the approximations. If solving the LP is expensive, then laziﬁcation will be usually very beneﬁcial, if the LP is very cheap as in the case of P = [0, 1]n or P = n being the probability simplex, then laziﬁcation might be slower. A related remark in this context is that once the laziﬁed algorithm has obtained vertices x1, . . . , xm of P, so that the minimizer x of f satisﬁes x conv{x1, . . . , xm}, then from that point onwards no actual calls to the true LP oracle have to be performed anymore for primal progress and the algorithm will only use cache calls; the only remaining true LP calls are at most a logarithmic number for dual progress updates of the Φt.

Acknowledgements

We are indebted to Alexandre D Aspremont, Simon Lacoste-Julien, and George Lan for the helpful discussions and for providing us with relevant references. Research reported in this paper was partially supported by NSF CAREER award CMMI-1452463.

Jean-Yves Audibert, S ebastien Bubeck, and G abor Lugosi. Regret in online combinatorial optimization. Mathematics of Operations Research, 39(1):31 45, 2013.

Pierre Le Bodic, Jeﬀrey W Pavelka, Marc E Pfetsch, and Sebastian Pokutta. Solving MIPs via scaling-based augmentation. ar Xiv preprint ar Xiv:1509.03206, 2015.

G. Braun, S. Pokutta, and D. Zink. Lazifying Conditional Gradient Algorithms. Proceedings of ICML, 2017.

Alon Cohen and Tamir Hazan. Following the perturbed leader for online structured learning. In Proceedings of the 32nd International Conference on Machine Learning (ICML-15), pages 1034 1042, 2015.

Sanjeeb Dash. A note on QUBO instances deﬁned on Chimera graphs. preprint ar Xiv:1306.1202, 2013.

Andr as Frank and Eva Tardos. An application of simultaneous Diophantine approximation in combinatorial optimization. Combinatorica, 7(1):49 65, 1987.

Marguerite Frank and Philip Wolfe. An algorithm for quadratic programming. Naval research logistics quarterly, 3(1-2):95 110, 1956.

Robert M. Freund and Paul Grigas. New analysis and results for the Frank Wolfe method. Mathematical Programming, 155(1):199 230, 2016. ISSN 1436-4646. doi: 10.1007/s10107-014-0841-6. URL http://dx.doi.org/10.1007/s10107-014-0841-6.

Dan Garber and Elad Hazan. A linearly convergent conditional gradient algorithm with applications to online and stochastic optimization. ar Xiv preprint ar Xiv:1301.4666, 2013.

Braun, Pokutta, and Zink

Dan Garber and Ofer Meshi. Linear-memory and decomposition-invariant linearly convergent conditional gradient algorithm for structured polytopes. ar Xiv preprint, ar Xiv:1605.06492v1, May 2016.

Martin Gr otschel and L aszlo Lov asz. Combinatorial optimization: A survey, 1993.

Swati Gupta, Michel Goemans, and Patrick Jaillet. Solving combinatorial games using products, projections and lexicographically optimal bases. ar Xiv preprint ar Xiv:1603.00522, 2016.

Gurobi Optimization. Gurobi optimizer reference manual version 6.5, 2016. URL https: //www.gurobi.com/documentation/6.5/refman/.

Elad Hazan. Introduction to online convex optimization. Foundations and Trends in Optimization, 2(3 4):157 325, 2016. doi: 10.1561/2400000013. URL http://ocobook.cs. princeton.edu/.

Elad Hazan and Satyen Kale. Projection-free online learning. ar Xiv preprint ar Xiv:1206.4657, 2012.

Martin Jaggi. Revisiting Frank Wolfe: Projection-free sparse convex optimization. In Proceedings of the 30th International Conference on Machine Learning (ICML-13), pages 427 435, 2013.

Thorsten Joachims, Thomas Finley, and Chun-Nam John Yu. Cutting-plane training of structural SVMs. Machine Learning, 77(1):27 59, 2009.

Armand Joulin, Kevin Tang, and Li Fei-Fei. Eﬃcient image and video co-localization with Frank-Wolfe algorithm. In European Conference on Computer Vision, pages 253 268. Springer, 2014.

Adam Kalai and Santosh Vempala. Eﬃcient algorithms for online decision problems. Journal of Computer and System Sciences, 71(3):291 307, 2005.

Simon Lacoste-Julien and Martin Jaggi. On the global linear convergence of Frank Wolfe optimization variants. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 28, pages 496 504. Curran Associates, Inc., 2015. URL http://papers.nips.cc/paper/ 5925-on-the-global-linear-convergence-of-frank-wolfe-optimization-variants. pdf.

Simon Lacoste-Julien, Martin Jaggi, Mark Schmidt, and Patrick Pletscher. Block-coordinate Frank Wolfe optimization for structural SVMs. In ICML 2013 International Conference on Machine Learning, pages 53 61, 2013.

Guanghui Lan and Yi Zhou. Conditional gradient sliding for convex optimization. Optimization-Online preprint (4605), 2014.

Evgeny S Levitin and Boris T Polyak. Constrained minimization methods. USSR Computational mathematics and mathematical physics, 6(5):1 50, 1966.

Lazifying Conditional Gradient Algorithms

Gergely Neu and G abor Bart ok. An eﬃcient algorithm for learning with semi-bandit feedback. In Algorithmic Learning Theory, pages 234 248. Springer, 2013.

Timm Oertel, Christian Wagner, and Robert Weismantel. Integer convex minimization by mixed integer linear optimization. Oper. Res. Lett., 42(6-7):424 428, 2014.

Anton Osokin, Jean-Baptiste Alayrac, Isabella Lukasewitz, Puneet K Dokania, and Simon Lacoste-Julien. Minding the gaps for block Frank Wolfe optimization of structured SVMs. ICML 2016 International Conference on Machine Learning / ar Xiv preprint ar Xiv:1605.09346, 2016.

Sebastian Pokutta. Smooth convex optimization via geometric scaling. preprint, 2017.

Andreas S Schulz and Robert Weismantel. The complexity of generic primal algorithms for solving general integer programs. Mathematics of Operations Research, 27(4):681 692, 2002.

Andreas S. Schulz, Robert Weismantel, and G unter M. Ziegler. 0/1-integer programming: Optimization and augmentation are equivalent. In Algorithms ESA 95, Proceedings, pages 473 483, 1995.

Neel Shah, Vladimir Kolmogorov, and Christoph H Lampert. A multi-plane block-coordinate Frank Wolfe algorithm for training structural SVMs with a costly max-oracle. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2737 2745, 2015.

Braun, Pokutta, and Zink

0 150 300 450 Wall clock time (s)

Function value

0 150 300 450 Wall clock time (s)

Function value

150 300 450 Wall clock time (s)

106 107 108 109 1010 1011 1012 1013 1014 1015 1016 1017

150 300 450 Wall clock time (s)

107 108 109 1010 1011 1012 1013 1014 1015 1016 1017

0 10 20 30 40 Iterations

Number of LP calls

0 10 20 30 40 Iterations

Number of LP calls

cache hit rate: 48.72% cache hit rate: 50.00%

Figure 1: LCG vs. CG on large netgen instances netgen 16a (left) and netgen 16b (right) with quadratic objective functions. In both cases the diﬀerence in function value between the two versions of the algorithm is large. In the dual bound the performance of the lazy version is multiple orders of magnitude better than the performance of the non-lazy counterpart. The cache hit rates for these two instances are lower due to the high dimension of the polytope.

Lazifying Conditional Gradient Algorithms

0 150 300 450 Wall clock time (s)

Function value

0 150 300 450 Wall clock time (s)

Function value

0 500 1000 1500 Iterations

Function value

0 80 160 240 Iterations

Function value

0 500 1000 1500 Iterations

Number of LP calls

0 80 160 240 Iterations

Number of LP calls

cache hit rate: 94.24% cache hit rate: 84.80%

Figure 2: LCG vs. CG on two matrix completion instances. We solve the problem as given in Equation (6) with the parameters n = 3000, m = 1000, r = 10 and R = 30000 for the left instance and n = 10000, m = 100, r = 10 and R = 10000 for the right instance. In both cases the lazy version is slower in iterations, however signiﬁcantly faster in wall clock time. Note that we used the short-step rule for step sizes for both algorithms as line search for matrix completion is very expensive.

Braun, Pokutta, and Zink

0 250 500 750 1000 Wall clock time (s)

Function value

0 250 500 750 1000 Wall clock time (s)

Function value

0 150 300 450 600 Iterations

Function value

0 10 20 30 40 Iterations

Function value

0 200 400 600 Iterations

0 2 4 6 8 10 12 14 16

Number of LP calls

0 10 20 30 40 Iterations

0 2 4 6 8 10 12 14 16

Number of LP calls

cache hit rate: 97.50% cache hit rate: 59.46%

Figure 3: LCG vs. CG on two matrix completion instances. The parameters for Equation (6) are given by n = 5000, m = 4000, r = 10 and R = 50000 for the left instance and n = 100, m = 20000, r = 10 and R = 15000 for the right instance. In both of these cases the performance of the lazy and the non-lazy version are comparable in iterations, however in wall clock time the lazy version reaches lower function values faster. Note that we used the short-step rule for step sizes for both algorithms as line search for matrix completion is very expensive.

Lazifying Conditional Gradient Algorithms

0 1000 2000 3000 4000 Wall clock time (s)

10 4 10 3 10 2 10 1

100 101 102

Function value

0 1000 2000 3000 Wall clock time (s)

Function value

0 1500 3000 4500 Iterations

10 4 10 3 10 2 10 1

100 101 102

Function value

0 80 160 240 Iterations

Function value

0 1500 3000 4500 Iterations

Number of LP calls

0 80 160 240 Iterations

Number of LP calls

cache hit rate: 69.46% cache hit rate: 43.06%

Figure 4: LCG vs. CG on structured regression problems with feasible regions being a TSP polytope over 11 nodes (left) and 12 nodes (right). In both cases LCG is signiﬁcantly faster in wall-clock time.

Braun, Pokutta, and Zink

0 150 300 450 Wall clock time (s)

Function value

0 1000 2000 3000 4000 Wall clock time (s)

100 101 102 103 104

Function value

0 250 500 750 1000 Iterations

Function value

0 800 1600 2400 Iterations

100 101 102 103 104

Function value

0 250 500 750 1000 Iterations

0 10 20 30 40 50 60 70 80 90

Number of LP calls

0 800 1600 2400 Iterations

Number of LP calls

cache hit rate: 90.83% cache hit rate: 95.59%

Figure 5: LCG vs. CG on structured regression instances with extended formulation of the spanning tree problem on a 10 node graph on the left and a 15 node graph on the right.

Lazifying Conditional Gradient Algorithms

eil33-2, 4516 dimensions air04, 8904 dimensions

0 1000 2000 3000 4000 Wall-clock time (s)

Function value

0 1500 3000 4500 Wall-clock time (s)

Function value

0.0 0.5 1.0 1.5 2.0 Iterations 1e6

Function value

0 2 4 6 8 Iterations 1e5

Function value

0.0 0.5 1.0 1.5 2.0 Iterations 1e6

Number of LP calls

0 2 4 6 8 Iterations 1e5

Number of LP calls

cache hit rate: 99.8% cache hit rate: 99.999%

Figure 6: LPCG vs. PCG on two MIPLIB instances eil33-2 and air04. LPCG converges very fast, making millions of iterations with a relatively few oracle calls, while PCG completed only comparably few iterations due to the time-consuming oracle calls. This clearly illustrates the advantage of lazy methods when the cost of linear optimization is non-negligible. On the left, when reaching ε-optimality, LPCG performs many (negative) oracle calls to (re-)prove optimality; at that point one might opt for stopping the algorithm. On the right LPCG needed a rather long time for the initial bound tightening of Φ0, before converging signiﬁcantly faster than PCG.

Braun, Pokutta, and Zink

0 2000 4000 6000 8000 Wall-clock time (s)

0 2000 4000 6000 8000 Wall-clock time (s)

0 2000 4000 6000 8000

0 8000 16000 24000 Iterations

0 2000 4000 6000 8000

Number of LP calls

0 8000 16000 24000 Iterations

Number of LP calls

cache hit rate: 99.7% cache hit rate: 98.6%

Figure 7: LOCG vs. OCG over cut polytope for a 28-node graph. As for the smaller problem, this also illustrates the advantage of lazy algorithms when linear optimization is expensive. Again, LOCG needed no oracle calls after a small initial amount of time.

Lazifying Conditional Gradient Algorithms

0 2000 4000 6000 8000 Wall-clock time (s)

0 2000 4000 6000 8000 Wall-clock time (s)

0 3000 6000 9000 12000

0 8000 16000 24000

0 3000 6000 9000 12000

Number of LP calls

0 8000 16000 24000

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0

Number of LP calls

cache hit rate: 99.8% cache hit rate: 99.99%

Figure 8: LOCG vs. OCG on a large QUBO instance. Both algorithms converge fast to the optimum. Interestingly, LOCG only performs 4 LP calls.

Braun, Pokutta, and Zink

0 150 300 450 600 Wall-clock time (s)

0 150 300 450 600 Wall-clock time (s)

0 150 300 450 Iterations

0 150 300 450 600 Iterations

0 150 300 450 Iterations

0 2 4 6 8 10 12 14

Number of LP calls

0 150 300 450 600 Iterations

Number of LP calls

cache hit rate: 97.0% cache hit rate: 89.6%

Figure 9: LOCG vs. OCG over a path polytope. Similar convergence rate in the number of iterations, but signiﬁcant diﬀerence in terms of wall-clock time.

Lazifying Conditional Gradient Algorithms

0 2000 4000 6000 8000 Wall-clock time (s)

0 2000 4000 6000 8000 Wall-clock time (s)

0 800 1600 2400 Iterations

0 1500 3000 4500 6000

0 800 1600 2400 Iterations

Number of LP calls

0 1500 3000 4500 6000

0 2 4 6 8 10 12 14

Number of LP calls

cache hit rate: 95.9% cache hit rate: 99.7%

Figure 10: LOCG vs. OCG over a spanning tree instance for a 25-node graph. On the left, early ﬂuctuation can be observed, bearing no consequence for later convergence rate. OCG did not get past this early stage. In both cases LOCG converges signiﬁcantly faster.

Braun, Pokutta, and Zink

0 100 200 300 400 500 600 600

LOCG (with AUG) OCG

Wall-clock time (s)

0 2000 4000 6000 8000 2500

LOCG (with AUG) OCG

Wall-clock time (s)

0 50 100 150 200 250 300 350 400 600

LOCG (with AUG) OCG

Oracle time (s)

0 1000 2000 3000 4000 Oracle time (s)

LOCG (with AUG) OCG

Figure 11: Lazy OCG (with augmentation) vs. OCG on two QUBO instances. In both cases the lazy variant together with augmentation signiﬁcantly outperforms OCG.

Lazifying Conditional Gradient Algorithms

0 2000 4000 6000 8000 Wall-clock time

LOCG with cache LOCG without cache OCG

0 2000 4000 6000 8000 Oracle time

LOCG with cache LOCG without cache OCG

Figure 12: Performance gain due to caching and early termination for online optimization over a maximum cut problem with linear losses. The red line is the OCG baseline, the green one is the lazy variant using only early termination, and the blue one uses caching and early termination. Left: loss vs. wall-clock time. Right: loss vs. total time spent in oracle calls. Time limit was 7200 seconds. Caching allows for a signiﬁcant improvement in loss reduction in wall-clock time. The eﬀect is even more obvious in oracle time as caching cuts out a large number of oracle calls.

0 150 300 450 Wall clock time (s)

Function value

LCG K=1 LCG K=1.5 LCG K=5 LCG K=10 LCG K=50 LCG K=100

0 200 400 600 Iterations

Function value

LCG K=1 LCG K=1.5 LCG K=5 LCG K=10 LCG K=50 LCG K=100

Figure 13: Impact of the oracle approximation parameter K depicted for the Lazy CG algorithm. We can see that increasing K leads to a deterioration of progress in iterations but improves performance in wall-clock time. The behavior is similar for other algorithms.

Braun, Pokutta, and Zink

0 1000 2000 3000 4000 Wall clock time (s)

Function value

LCG param free LCG

0 2000 4000 6000 8000 Wall clock time (s)

Function value

LCG param free LCG

0 40 80 120 Iterations

Function value

LCG param free LCG

0 200 400 600 800 Iterations

Function value

LCG param free LCG

0 40 80 120 Iterations

Nu m ber of LP calls

LCG param free LCG

0 200 400 600 800 Iterations

Number of LP calls

LCG param free LCG

Figure 14: Comparison of the textbook variant of the Lazy CG algorithm (Algorithm 3) vs. the Parameter-free Lazy CG (Algorithm 7) depicted for two sample instances to demonstrate behavior. The parameter-free variant usually has a slightly improved behavior in terms of iterations and a signiﬁcantly improved behavior in terms of wall-clock performance. In particular, the parameter-free variant can execute signiﬁcantly more oracle calls, due to the Φ-halving strategy and the associated bounded number of negative calls (see Theorem 9).