# blended_conditonal_gradients__a243f75a.pdf

Blended Conditional Gradients: The Unconditioning of Conditional Gradients

G abor Braun 1 Sebastian Pokutta 1 Dan Tu 1 Stephen Wright 2

Abstract We present a blended conditional gradient approach for minimizing a smooth convex function over a polytope P, combining the Frank Wolfe algorithm (also called conditional gradient) with gradient-based steps, different from away steps and pairwise steps, but still achieving linear convergence for strongly convex functions, along with good practical performance. Our approach retains all favorable properties of conditional gradient algorithms, notably avoidance of projections onto P and maintenance of iterates as sparse convex combinations of a limited number of extreme points of P. The algorithm is lazy, making use of inexpensive inexact solutions of the linear programming subproblem that characterizes the conditional gradient approach. It decreases measures of optimality rapidly, both in the number of iterations and in wall-clock time, outperforming even the lazy conditional gradient algorithms of (Braun et al., 2017). We also present a streamlined version of the algorithm that applies when P is the probability simplex.

1. Introduction

A common paradigm in convex optimization is minimization of a smooth convex function f over a polytope P. The conditional gradient (CG) algorithm, also known as Frank Wolfe (Frank & Wolfe, 1956), (Levitin & Polyak,

1966) is enjoying renewed popularity because it can be implemented efﬁciently to solve important problems in data analysis. It is a ﬁrst-order method, requiring access to gradients f(x) and function values f(x). In its original form,

1ISy E, Georgia Institute of Technology, Atlanta, GA 2Computer Sciences Department, University of Wisconsin, Madison, WI. Correspondence to: G abor Braun <gabor.braun@isye.gatech.edu>, Sebastian Pokutta <sebastian.pokutta@isye.gatech.edu>, Dan Tu <dan.tu@gatech.edu>, Stephen Wright <swright@cs.wisc.edu>.

Proceedings of the 36 th International Conference on Machine Learning, Long Beach, California, PMLR 97, 2019. Copyright 2019 by the author(s).

CG employs a linear programming (LP) oracle to minimize a linear function over the polytope P at each iteration. The cost of this operation depends on the complexity of P.

In this work, we describe a blended conditional gradient (BCG) approach, which takes one of several types of steps on the basis of the gradient f at the current point. Our approach maintains an active vertex set, consisting of some solutions from previous iterations. Building on (Braun et al., 2017), BCG uses a weak-separation oracle to ﬁnd a vertex of P for which the linear objective attains some fraction of the reduction in f promised by the LP oracle, typically by searching among the current set of active vertices. If no vertex yielding acceptable reduction can be found, the LP oracle used in the original FW algorithm may be deployed. On other iterations, BCG employs a simplex descent oracle, which takes a step within the convex hull of the active vertices, yielding progress either via reduction in function value (a descent step ) or via culling of the active vertex set (a drop step ). The size of the active vertex set typically remains small, which beneﬁts both the efﬁciency of the method and the sparsity of the ﬁnal solution (i.e., its representation as a convex combination of a relatively small number of vertices).

BCG has similar theoretical convergence rates to several other variants of CG that have been studied recently, including pairwise-step and away-step variants and the lazy variants of (Braun et al., 2017). In several cases, we observe better empirical convergence for BCG than for these other variants. While the lazy variant of (Braun et al., 2017) has an advantage over baseline CG when the LP oracle is expensive, our BCG approach consistently outperforms the other variants in more general circumstances, both in per-iteration progress and in wall-clock time.

Related work

There has been an extensive body of work on conditional gradient algorithms; see the excellent overview of (Jaggi, 2013). Here we review only those papers most closely related to our work.

Our main inspiration comes from (Braun et al., 2017; Lan et al., 2017), which introduces the weak-separation oracle

Blended Conditional Gradients

as a lazy alternative to calling the LP oracle in every iteration. It is inﬂuenced too by the method of (Rao et al., 2015), which maintains an active vertex set, using projected descent steps to improve the objective over the convex hull of this set, and culling the set on some steps to keep its size under control. While the latter method is a heuristic with no proven convergence bounds beyond those inherited from the standard Frank Wolfe method, our BCG algorithm employs a criterion for optimal trade-off between the various steps, with a proven convergence rate equal to state-of-theart Frank Wolfe variants up to a constant factor.

Our main result shows linear convergence of BCG for strongly convex functions. Linearly convergent variants of CG were studied as early as (Gu elat & Marcotte, 1986) for special cases and (Garber & Hazan, 2013) for the general case (though the latter work involves very large constants). More recently, linear convergence has been established for various pairwise-step and away-step variants of CG in (Lacoste-Julien & Jaggi, 2015), where the concept of an active vertex set is used to improve performance. Other memory-efﬁcient decomposition-invariant variants were described in (Garber & Meshi, 2016) and (Bashiri & Zhang, 2017). Modiﬁcation of descent directions and step sizes, reminiscent of the drop steps used in BCG, have been considered by (Freund & Grigas, 2016; Freund et al., 2017). The use of an inexpensive oracle based on a subset of the vertices of P, as an alternative to the full LP oracle, has been considered in (Kerdreux et al., 2018b). (Garber et al., 2018) proposes a fast variant of conditional gradients for matrix recovery problems.

BCG is quite distinct from the fully-corrective Frank Wolfe algorithm (FCFW) (see, for example, (Holloway, 1974; Lacoste-Julien & Jaggi, 2015)). Both approaches maintain active vertex sets, generate iterates that lie in the convex hulls of these sets, and alternate between Frank Wolfe steps generating new vertices and correction steps optimizing within the current active vertex set. However, convergence analyses of the FCFW algorithm assume that the correction steps have unit cost, though they can be quite expensive in practice, requiring multiple evaluations of the gradient f. For BCG, by contrast, we assume only a single step of gradient descent type having unit cost (disregarding cost of line search). For further explanation of the differences between BCG and FCFW, see computational results in Figure 12 and discussion in Appendix D.

Contribution

Our contribution can be summarized as follows:

Blended Conditional Gradients (BCG). The BCG approach blends different types of descent steps: the traditional CG steps of (Frank & Wolfe, 1956), the laziﬁed CG steps of (Braun et al., 2017), and gradient descent steps over the

convex hull of the current active vertex set. It avoids projections onto P or onto the convex hull of the active vertices, and does not use away steps and pairwise steps, which are elements of other popular variants of CG. It achieves linear convergence for strongly convex functions (see Theorem 3.1), and O(1/t) convergence after t iterations for general smooth functions. While the linear convergence proof of the Away-step Frank Wolfe Algorithm (Lacoste Julien & Jaggi, 2015, Theorem 1, Footnote 4) requires the objective function f to be deﬁned on the Minkowski sum P P + P, BCG does not need f to be deﬁned outside the polytope P. The algorithm has complexity comparable to pairwise-step or away-step variants of conditional gradients, both in per-iteration running time and in the space required to store vertices and iterates. It is afﬁne-invariant and parameter-free; estimates of such parameters as smoothness, strong convexity, or the diameter of P are not required. It maintains iterates as (often sparse) convex combinations of vertices, typically much sparser than the baseline CG methods, a property that is important for some applications. Such sparsity is due to the aggressive reuse of active vertices, and the fact that new vertices are added only as a kind of last resort. In wall-clock time as well as per-iteration progress, our computational results show that BCG can be orders of magnitude faster than competimg CG methods on some problems.

Simplex Gradient Descent (Si GD). In Section 4, we describe a new projection-free gradient descent procedure for minimizing a smooth function over the probability simplex, which can be used to implement the simplex descent oracle required by BCG.

Computational Experiments. We demonstrate the excellent computational behavior of BCG compared to other CG algorithms on standard problems, including video colocalization, sparse regression, structured SVM training, and structured regression. We observe signiﬁcant computational speedups and in several cases empirically better convergence rates.

We summarize preliminary material in Section 2, including the two oracles that are the foundation of our BCG procedure. BCG is described and analyzed in Section 3, establishing linear convergence rates. The simplex gradient descent routine, which implements the simplex descent oracle, is described in Section 4. Our computational experiments are summarized in Section 5; more extensive experiments appear in Appendix D. Variants on the analysis and other auxiliary materials are relegated to the appendix. We mention in particular a variant of BCG that applies when P is the probability simplex, a special case that admits several simpliﬁcations and improvements to the analysis.

Blended Conditional Gradients

2. Preliminaries

We use the following notation: ei is the i-th coordinate vector, 1 := (1, . . . , 1) = e1 + e2 + is the all-ones vector, denotes the Euclidean norm (ℓ2-norm), D = diam(P) = supu,v P u v 2 is the ℓ2-diameter of P, and conv S denotes the convex hull of a set S of points. The probability simplex k := conv{e1, . . . , ek} is the convex hull of the coordinate vectors in dimension k.

Let f be a differentiable convex function. Recall that f is Lsmooth if f(y) f(x) f(x)(y x) L y x 2/2 for all x, y P. The function f has curvature C if f(γy+(1 γ)x) f(x) + γ f(x)(y x) + Cγ2/2, for all x, y P and 0 γ 1. (Note that an L-smooth function always has curvature C LD2.) Finally, f is strongly convex if for some α > 0 we have f(y) f(x) f(x)(y x) α y x 2/2, for all x, y P. We use the following fact about strongly convex function when optimizing over P. Fact 2.1 (Geometric strong convexity guarantee). (Lacoste Julien & Jaggi, 2015, Theorem 6 and Eq. (28)) Given a strongly convex function f, there is a value µ > 0 called the geometric strong convexity such that

f(x) min y P f(y) (maxy S,z P f(x)(y z))2

for any x P and for any subset S of the vertices of P for which x lies in the convex hull of S.

The value of µ depends both on f and the geometry of P.

2.1. Simplex Descent Oracle

Given a convex objective function f and an ordered ﬁnite set S = {v1, . . . , vk} of points, we deﬁne f S : k R as follows:

f S(λ) := f

When f S is Lf S-smooth, Oracle 1 returns an improving point x in conv S together with a vertex set S S such that x conv S .

Oracle 1 Simplex Descent Oracle Si DO(x, S, f) Input: ﬁnite set S Rn, point x conv S, convex smooth function f : conv S Rn; Output: ﬁnite set S S, point x conv S satisfying either

drop step: f(x ) f(x) and S = S

descent step:

f(x) f(x ) [maxu,v S f(x)(u v)]2/(4Lf S)

In Section 4 we provide an implementation (Algorithm 2)

of this oracle via a single descent step, which avoids projection and does not require knowledge of the smoothness parameter Lf S.

2.2. Weak-Separation Oracle

Oracle 2 Weak-Separation Oracle LPsep P (c, x, Φ, K) Input: linear objective c Rn, point x P, accuracy K 1, gap estimate Φ > 0; Output: Either (1) vertex y P with c(x y) Φ/K, or

(2) false: c(x z) Φ for all z P.

The weak-separation oracle Oracle 2 was introduced in (Braun et al., 2017) to replace the LP oracle traditionally used in the CG method. Provided with a point x P, a linear objective c, a target reduction value Φ > 0, and an inexactness factor K 1, it decides whether there exists y P with cx cy Φ/K, or else certiﬁes that cx cz Φ for all z P. In our applications, c = f(x) is the gradient of the objective at the current iterate x. Oracle 2 could be implemented simply by the standard LP oracle of minimizing cz over z P. However, it allows more efﬁcient implementations, including the following. (1) Caching: testing previously obtained vertices y P (speciﬁcally, vertices in the current active vertex set) to see if one of them satisﬁes cx cy Φ/K. If not, the traditional LP oracle could be called to either ﬁnd a new vertex of P satisfying this bound, or else to certify that cx cz Φ for all z P, and (2) Early Termination: Terminating the LP procedure as soon as a vertex of P has been discovered that satisﬁes cx cy Φ/K. (This technique requires an LP implementation that generates vertices as iterates.) If the LP procedure runs to termination without ﬁnding such a point, it has certiﬁed that cx cz Φ for all z P. In (Braun et al., 2017) these techniques resulted in orders-ofmagnitude speedups in wall-clock time in the computational tests, as well as sparse convex combinations of vertices for the iterates xt, a desirable property in many contexts.

3. Blended Conditional Gradients

Our BCG approach is speciﬁed as Algorithm 1. We discuss the algorithm in this section and establish its convergence rate. The algorithm expresses each iterate xt, t = 0, 1, 2, . . . as a convex combination of the elements of the active vertex set, denoted by St, as in the Pairwise and Away-step variants of CG. At each iteration, the algorithm calls either Oracle 1 or Oracle 2 in search of the next iterate, whichever promises the smaller function value, using a test in Line 6 based on an estimate of the dual gap. The same greedy principle is used in the Away-step CG approach, and its lazy variants. A critical role in the algorithm (and particularly in the test of Line 6) is played by the value

Blended Conditional Gradients

Φt, which is a current estimate of the primal gap the difference between the current function value f(xt) and the optimal function value over P. When Oracle 2 returns false, the curent value of Φt is discovered to be an overestimate of the dual gap, so it is halved (Line 13) and we proceed to the next iteration. In subsequent discussion, we refer to Φt as the gap estimate.

Algorithm 1 Blended Conditional Gradients (BCG) Input: smooth convex function f, start vertex x0 P, weak-separation oracle LPsep P , accuracy K 1 Output: points xt in P for t = 1, . . . , T

1: Φ0 maxv P f(x0)(x0 v)/2 {Initial gap estimate} 2: S0 {x0} 3: for t = 0 to T 1 do 4: v A t argmaxv St f(xt)v

5: v F W S t argminv St f(xt)v

6: if f(xt)(v A t v F W S t ) Φt then 7: xt+1, St+1 Si DO(xt, St) {either a drop step or a descent step} 8: Φt+1 Φt 9: else 10: vt LPsep P ( f(xt), xt, Φt, K) 11: if vt = false then 12: xt+1 xt 13: Φt+1 Φt/2 {gap step} 14: St+1 St 15: else 16: xt+1 argminx [xt,vt] f(x) {FW step, with line search} 17: Choose St+1 St {vt} minimal such that xt+1 conv St+1. 18: Φt+1 Φt 19: end if 20: end if 21: end for

In Line 17, the active set St+1 is required to be minimal. By Caratheodory s theorem, this requirement ensures that |St+1| dim P + 1. In practice, the St are invariably small and no explicit reduction in size is necessary. The key requirement, in theory and practice, is that if after a call to Oracle Si DO the new iterate xt+1 lies on a face of the convex hull of the vertices in St, then at least one element of St is dropped to form St+1. This requirement ensures that the local pairwise gap in Line 6 is not too large due to stale vertices in St, which can block progress. Small size of the sets St is crucial to the efﬁciency of the algorithm, in rapidly determining the maximizer and minimizer of f(xt) over the active set St in Lines 4 and 5.

The constants in the convergence rate described in our main theorem (Theorem 3.1 below) depend on a modiﬁed

curvature-like parameter of the function f. Given a vertex set S of P, recall from Section 2.1 the smoothness parameter Lf S of the function f S : k R deﬁned by (1). Deﬁne the simplicial curvature C to be

C := max S : |S| 2 dim P Lf S (2)

to be the maximum of the Lf S over all possible active sets. This afﬁne-invariant parameter depends both on the shape of P and the function f. This is the relative smoothness constant Lf,A from the predecessor of (Gutman & Pe na, 2019), namely (Gutman & Pe na, 2018, Deﬁniton 2a), with an additional restriction: the simplex is restricted to faces of dimension at most 2 dim P, which appears as a bound on the size of S in our formulation. This restriction improves the constant by removing dependence on the number of vertices of the polytope, and can probably replace the original constant in convergence bounds. We can immediately see the effect in the common case of L-smooth functions, that the simplicial curvature is of reasonable magnitude, speciﬁcally,

C LD2(dim P)

where D is the diameter of P. This result follows from (2) and the bound on Lf S from Lemma A.1 in the appendix. This bound is not directly comparable with the upper bound Lf,A LD2/4 in (Gutman & Pe na, 2018, Corollary 2), because the latter uses the 1-norm on the standard simplex, while we use the 2-norm, the norm used by projected gradients and our simplex gradient descent. The additional factor dim P is explained by the n-dimensional standard simplex having constant minimum width 2 in 1norm, but having minimum width dependent on the dimension n (speciﬁcally, Θ(1/ n)) in the 2-norm. Recall that the minimum width of a convex body P Rn in norm is minφ maxu,v P φ(u v), where φ runs over all linear maps Rn R having dual norm φ = 1. For the 2-norm, this is just the minimum distance between parallel hyperplanes such that P lies between the two hyperplanes.

For another comparison, recall the curvature bound C LD2. Note, however, that the algorithm and convergence rate below are afﬁne invariant, and the only restriction on the function f is that it has ﬁnite simplicial curvature. This restriction readily provides the curvature bound

where the factor 2 arises as the square of the diameter of the standard simplex k. (See Lemma A.2 in the appendix for details.) Note that S is allowed to be large enough so that every point of P is in the convex hull of some vertex subset S, by Caratheodory s theorem, and that the simplicial curvature provides an upper bound on the curvature

Blended Conditional Gradients

We describe the convergence of BCG (Algorithm 1) in the following theorem.

Theorem 3.1. Let f be a strongly convex, smooth function over the polytope P with simplicial curvature C and geometric strong convexity µ. Then Algorithm 1 ensures f(x T ) f(x ) ε, where x is an optimal solution to f in P for some iteration index T that satisﬁes

+ 8K log Φ0 2KC

where log denotes logarithms to the base 2.

For smooth but not necessarily strongly convex functions f, the algorithm ensures f(x T ) f(x ) ε after T = O(max{C , Φ0}/ε) iterations by a similar argument, which is omitted.

Proof. The proof tracks that of (Braun et al., 2017). We divide the iteration sequence into epochs that are demarcated by the gap steps, that is, the iterations for which the weakseparation oracle (Oracle 2) returns the value false, which results in Φt being halved for the next iteration. We then bound the number of iterates within each epoch. The result is obtained by aggregating across epochs.

We start by a well-known bound on the function value using the Frank Wolfe point v F W t := argminv P f(xt)v at iteration t, which follows from convexity:

f(xt) f(x ) f(xt)(xt x ) f(xt)(xt v F W t ).

If iteration t 1 is a gap step, we have using xt = xt 1 and Φt = Φt 1/2 that

f(xt) f(x ) f(xt)(xt v F W t ) 2Φt. (5)

This bound also holds at t = 0, by deﬁnition of Φ0. Thus Algorithm 1 is guaranteed to satisfy f(x T ) f(x ) ε at some iterate T such that T 1 is a gap step and 2ΦT ε. Therefore, the total number of gap steps NΦ required to reach this point satisﬁes

which is also a bound on the total number of epochs. The next stage of the proof ﬁnds bounds on the number of iterations of each type within an individual epoch.

If iteration t 1 is a gap step, we have xt = xt 1 and Φt = Φt 1/2, and because the condition is false at Line 6 of Algorithm 1, we have

f(xt)(v A t xt) f(xt)(v A t v F W S t ) 2Φt. (7)

This condition also holds trivially at t = 0, since v A 0 = v F W S 0 = x0. By summing (5) and (7), we obtain f(xt)(v A t v F W t ) 4Φt, so it follows from Fact 2.1 that f(xt) f(x ) [ f(xt)(v A t v F W t )]2

2µ 8Φ2 t µ . By combining this inequality with (5), we obtain

f(xt) f(x ) min 8Φ2 t/µ, 2Φt , (8)

for all t such that either t = 0 or else t 1 is a gap step. In fact, (8) holds for all t, because (1) the sequence of function values {f(xs)}s is non-increasing; and (2) Φs = Φt for all s in the epoch that starts at iteration t.

We now consider the epoch that starts at iteration t, and use s to index the iterations within this epoch. Note that Φs = Φt for all s in this epoch.

We distinguish three types of iterations besides gap step. The ﬁrst type is a Frank Wolfe step, taken when the weakseparation oracle returns an improving vertex vs P such that f(xs)(xs vs) Φs/K = Φt/K (Line 16). Using the deﬁnition of curvature C, we have by standard Frank Wolfe arguments that (c.f., (Braun et al., 2017)).

f(xs) f(xs+1) Φs

2K min 1, Φs

2K min 1, Φt 2KC

where we used Φs = Φt and C 2C (from (3)). We denote by N t FW the number of Frank Wolfe iterations in the epoch starting at iteration t.

The second type of iteration is a descent step, in which Oracle Si DO (Line 7) returns a point xs+1 that lies in the relative interior of conv Ss and with strictly smaller function value. We thus have Ss+1 = Ss and, by the deﬁnition of Oracle Si DO, together with (2), it follows that

f(xs) f(xs+1) [ f(xs)(v A s v F W S s )]2

Φ2 s 4C = Φ2 t 4C . (10)

We denote by N t desc the number of descent steps that take place in the epoch that starts at iteration t.

The third type of iteration is one in which Oracle 1 returns a point xs+1 lying on a face of the convex hull of Ss, so that Ss+1 is strictly smaller than Ss. Similarly to the Awaystep Frank Wolfe algorithm of (Lacoste-Julien & Jaggi, 2015), we call these steps drop steps, and denote by N t drop the number of such steps that take place in the epoch that starts at iteration t. Note that since Ss is expanded only at Frank Wolfe steps, and then only by at most one element, the total number of drop steps across the whole algorithm cannot exceed the total number of Frank Wolfe steps. We

Blended Conditional Gradients

use this fact and (6) in bounding the total number of iterations T required for f(x T ) f(x ) ε:

T NΦ + Ndesc + NFW + Ndrop

+ Ndesc + 2NFW

t:epoch start (N t desc + 2N t FW).

Here Ndesc denotes the total number of descent steps, NFW the total number of Frank Wolfe steps, and Ndrop the total number of drop steps, which is bounded by NFW, as just discussed.

Next, we seek bounds on the iteration counts N t desc and N t FW within the epoch starting with iteration t. For the total decrease in function value during the epoch, Equations (9) and (10) provide a lower bound, while f(xt) f(x ) is an obvious upper bound, leading to the following estimate using (8).

If Φt 2KC then

2Φt f(xt) f(x ) N t desc Φ2 t 4C + N t FW Φt 2K

N t desc Φt K

2 + N t FW Φt 2K (N t desc + 2N t FW) Φt

N t desc + 2N t FW 8K. (12)

If Φt < 2KC , a similar argument provides

8Φ2 t µ f(xt) f(x ) N t desc Φ2 t 4C + N t FW Φ2 t 4K2C

(N t desc + 2N t FW) Φ2 t 8K2C ,

N t desc + 2N t FW 64K2C

There are at most log Φ0 2KC

epochs in the regime with Φt 2KC , log 2KC

epochs in the regime with Φt < 2KC .

Combining (11) with the bounds (12) and (13) on N t FW and N t desc, we obtain (4).

4. Simplex Gradient Descent

Here we describe the Simplex Gradient Descent approach (Algorithm 2), an implementation of the Si DO oracle (Oracle 1). Algorithm 2 requires only O(|S|) operations beyond the evaluation of f(x) and the cost of line search. (It is assumed that x is represented as a convex combination of vertices of P, which is updated during Oracle 1.) Apart from the (trivial) computation of the projection of f(x) onto the linear space spanned by k, no projections are computed. Thus, Algorithm 2 is typically faster even than a step of Frank Wolfe, for typical small sets S.

Alternative implementations of Oracle 1 are described in Section C.1. Section C.2 describes the special case in which P itself is a probability simplex. Here, BCG and its oracles are combined into a single, simple method with better constants in the convergence bounds.

In the algorithm, the form c1 denotes the scalar product of c and 1, i.e., the sum of entries of c.

Algorithm 2 Simplex Gradient Descent Step (Si GD) Input: polyhedron P, smooth convex function f : P R, subset S = {v1, v2, . . . , vk} of vertices of P, point x conv S Output: set S S, point x conv S

1: Decompose x as a convex combination x = Pk i=1 λivi, with Pk i=1 λi = 1 and λi 0, i = 1, 2, . . . , k 2: c [ f(x)v1, . . . , f(x)vk] {c = f S(λ); see (1)}

3: d c (c1)1/k {Projection onto the hyperplane of k} 4: if d = 0 then 5: return x = v1, S = {v1} {Arbitrary vertex} 6: end if 7: η max{η 0 : λ ηd 0} 8: y x η P

i divi 9: if f(x) f(y) then 10: x y 11: Choose S S, S = S with x conv S . 12: else 13: x argminz [x,y] f(z) 14: S S 15: end if 16: return x , S

To verify the validity of Algorithm 2 as an implementation of Oracle 1, note ﬁrst that since y lies on a face of conv S by deﬁnition, it is always possible to choose a proper subset S S in Line 11, for example, S := {vi : λi > ηdi}. The following lemma shows that with the choice h := f S, Algorithm 2 correctly implements Oracle 1.

Lemma 4.1. Let k be the probability simplex in k dimensions and suppose that h: k R is an Lh-smooth

Blended Conditional Gradients

function. Given some λ k, deﬁne d := h(λ) ( h(λ)1/k)1 and let η 0 be the largest value for which τ := λ ηd 0. Let λ := argminz [λ,τ] h(z). Then either h(λ) h(τ) or

h(λ) h(λ ) [max1 i,j k h(λ)(ei ej)]2

Proof. Let g(ζ) := h(ζ (ζ1)1/k), then g(ζ) = h(ζ (ζ1)1/k) ( h(ζ (ζ1)1/k)1)1/k, and g is clearly Lhsmooth, too. In particular, g(λ) = d.

From standard gradient descent bounds, not repeated here, we have the following inequalities, for γ min{η, 1/Lh}:

h(λ) h(λ γd) = g(λ) g(λ γ g(λ))

γ g(λ) 2 2 2 γ [max1 i,j k g(λ)(ei ej)]2

= γ [max1 i,j k h(λ)(ei ej)]2

where the second inequality uses that the ℓ2-diameter of the k is 2, and the last equality follows from g(λ)(ei ej) = h(λ)(ei ej).

When η 1/Lh, we conclude that h(λ ) h(λ (1/Lh)d) h(λ), hence

h(λ) h(λ ) [maxi,j {1,2,...,k} h(λ)(ei ej)]2

which is the second case of the lemma. When η < 1/Lh, then setting γ = η in (14) clearly provides h(λ) h(τ) 0, which is the ﬁrst case of the lemma.

5. Computational Experiments (Summary)

To compare our experiments to previous work, we used problems and instances similar to those in (Lacoste-Julien & Jaggi, 2015; Garber & Meshi, 2016; Rao et al., 2015; Braun et al., 2017; Lan et al., 2017). These include structured regression, sparse regression, video co-localization, sparse signal recovery, matrix completion, and Lasso. We compared various algorithms denoted by the following acronyms: our algorithm (BCG), the Away-step Frank Wolfe algorithm (ACG) and the Pairwise Frank Wolfe algorithm (PCG) from (Lacoste-Julien & Jaggi, 2015; Garber & Meshi, 2016), the vanilla Frank Wolfe algorithm (CG), as well as their laziﬁed versions from (Braun et al., 2017). We add a preﬁx L for the laziﬁed versions. Figure 1 summarizes our results on four test problems. Further details and more extensive computational results are reported in Appendix D.

Performance Comparison

We implemented Algorithm 1 as outlined above and used Si GD (Algorithm 2) for the descent steps as described in

Section 4. For line search in Line 13 of Algorithm 2, we perform standard backtracking, and for Line 16 of Algorithm 1, we do ternary search. In Figure 1, each of the four plots itself contains four subplots depicting results of four variants of CG on a single instance. The two subplots in each upper row measure progress in the logarithm (to base 2) of the function value, while the two subplots in each lower row report the logarithm of the gap estimate Φt from Algorithm 1. The subplots in the left column of each plot report performance in terms of number of iterations, while the subplots in the right column report wall-clock time.

As discussed earlier, 2Φt upper bounds the primal gap (the difference between the function value at the current iterate and the optimal function value). The laziﬁed algorithms (including BCG) halve Φt occasionally, which provides a stair-like appearance in the graphs. In implementations, if a stronger bound on the primal gap is available (e.g., from an LP oracle call), we reset Φt to half of that value, thus removing unnecessary successive halving steps. For the non-laziﬁed algorithms, we plot the dual gap maxv P f(xt)(xt v) as a gap estimate. The dual gap does not necessarily decrease in a monotone fashion (though of course the primal gap is monotone decreasing), so the plots have a zigzag appearance in some instances.

6. Final Remarks

In (Lan et al., 2017), an accelerated method based on weak separation and conditional gradient sliding was described. This method provided optimal tradeoffs between (stochastic) ﬁrst-order oracle calls and weak-separation oracle calls. An open question is whether the same tradeoffs and acceleration could be realized by replacing Si GD (Algorithm 2) by an accelerated method.

After an earlier version of our work appeared online, (Kerdreux et al., 2018a) introduced the H older Error Bound condition (also known as sharpness or the Łojasiewicz growth condition). This is a family of conditions parameterized by 0 < p 1, interpolating between strongly convex (p = 0) and convex functions (p = 1). For such functions, convergence rate O(1/εp) has been shown for Away-step Frank Wolfe algorithms, among others. Our analysis can be similarly extended to objective functions satisfying this condition, leading to similar convergence rates.

Acknowledgements

We are indebted to Swati Gupta for the helpful discussions. Research reported in this paper was partially supported by NSF CAREER award CMMI-1452463, and also NSF Awards 1628384, 1634597, and 1740707; Subcontract 8F-30039 from Argonne National Laboratory; and Award N660011824020 from the DARPA Lagrange Program.

Blended Conditional Gradients

Log Function Value

ACG BCG CG LACG LCG LPCG PCG

Log Gap Estimate

Wall-clock Time

Log Function Value

ACG BCG CG LACG LCG LPCG PCG

0 1000 2000

Log Gap Estimate

Wall-clock Time

Log Function Value

ACG BCG CG LACG LCG LPCG PCG

0 1000 2000

Log Gap Estimate

Wall-clock Time

Log Function Value

ACG BCG CG LACG LCG LPCG PCG

Log Gap Estimate

Wall-clock Time

Figure 1. Four representative examples. (Upper-left) Sparse signal recovery: minx Rn: x 1 τ y Φx 2 2, where Φ is of size 1000 3000 with density 0.05. BCG made 1402 iterations with 155 calls to the weak-separation oracle LPsep P . The ﬁnal solution is a convex combination of 152 vertices. (Upper-right) Lasso. We solve minx P Ax b 2 with P being the (scaled) ℓ1-ball. A is a 400 2000 matrix with 100 non-zeros. BCG made 2130 iterations, calling LPsep P 477 times, with the ﬁnal solution being a convex combination of 462 vertices. (Lower-left) Structured regression over the Birkhoff polytope of dimension 50. BCG made 2057 iterations with 524 calls to LPsep P . The ﬁnal solution is a convex combination of 524 vertices. (Lower-right) Video co-localization over netgen 12b polytope with an underlying 5000-vertex graph. BCG made 140 iterations, with 36 calls to LPsep P . The ﬁnal solution is a convex combination of 35 vertices.

Blended Conditional Gradients

Bashiri, M. A. and Zhang, X. Decomposition-invariant conditional gradient for general polytopes with line search. In Advances in Neural Information Processing Systems, pp. 2687 2697, 2017.

Braun, G., Pokutta, S., and Zink, D. Lazifying conditional gradient algorithms. Proceedings of ICML, 2017.

Frank, M. and Wolfe, P. An algorithm for quadratic programming. Naval research logistics quarterly, 3(1-2): 95 110, 1956.

Freund, R. M. and Grigas, P. New analysis and results for the Frank Wolfe method. Mathematical Programming, 155(1):199 230, 2016. ISSN 1436-4646. doi: 10.1007/s10107-014-0841-6. URL http://dx.doi. org/10.1007/s10107-014-0841-6.

Freund, R. M., Grigas, P., and Mazumder, R. An extended Frank Wolfe method with in-face directions, and its application to low-rank matrix completion. SIAM Journal on Optimization, 27(1):319 346, 2017.

Garber, D. and Hazan, E. A linearly convergent conditional gradient algorithm with applications to online and stochastic optimization. ar Xiv preprint ar Xiv:1301.4666, 2013.

Garber, D. and Meshi, O. Linear-memory and decomposition-invariant linearly convergent conditional gradient algorithm for structured polytopes. ar Xiv preprint, ar Xiv:1605.06492v1, May 2016.

Garber, D., Sabach, S., and Kaplan, A. Fast generalized conditional gradient method with applications to matrix recovery problems. ar Xiv preprint ar Xiv:1802.05581, 2018.

Gu elat, J. and Marcotte, P. Some comments on wolfe s away step . Mathematical Programming, 35(1):110 119, 1986.

Gurobi Optimization. Gurobi optimizer reference manual version 6.5, 2016. URL https://www.gurobi. com/documentation/6.5/refman/.

Gutman, D. H. and Pe na, J. F. The condition of a function relative to a polytope. ar Xiv preprint ar Xiv:1802.00271, February 2018.

Gutman, D. H. and Pe na, J. F. The condition of a function relative to a set. ar Xiv preprint ar Xiv:1901.08359, January 2019.

Holloway, C. A. An extension of the Frank and Wolfe method of feasible directions. Mathematical Programming, 6(1):14 27, 1974.

Jaggi, M. Revisiting Frank Wolfe: Projection-free sparse convex optimization. In Proceedings of the 30th International Conference on Machine Learning (ICML-13), pp. 427 435, 2013.

Joulin, A., Tang, K., and Fei-Fei, L. Efﬁcient image and video co-localization with frank-wolfe algorithm. In European Conference on Computer Vision, pp. 253 268. Springer, 2014.

Kerdreux, T., d Aspremont, A., and Pokutta, S. Restarting Frank Wolfe. ar Xiv preprint ar Xiv:1810.02429, 2018a.

Kerdreux, T., Pedregosa, F., and d Aspremont, A. Frank Wolfe with subsampling oracle. ar Xiv preprint ar Xiv:1803.07348, 2018b.

Koch, T., Achterberg, T., Andersen, E., Bastert, O., Berthold, T., Bixby, R. E., Danna, E., Gamrath, G., Gleixner, A. M., Heinz, S., Lodi, A., Mittelmann, H., Ralphs, T., Salvagnin, D., Steffy, D. E., and Wolter, K. MIPLIB 2010. Mathematical Programming Computation, 3(2):103 163, 2011. doi: 10.1007/s12532-011-00259. URL http://mpc.zib.de/index.php/MPC/ article/view/56/28.

Lacoste-Julien, S. and Jaggi, M. On the global linear convergence of Frank Wolfe optimization variants. In Cortes, C., Lawrence, N. D., Lee, D. D., Sugiyama, M., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 28, pp. 496 504. Curran Associates, Inc., 2015. URL http: //papers.nips.cc/paper/5925-on-theglobal-linear-convergence-of-frankwolfe-optimization-variants.pdf.

Lan, G., Pokutta, S., Zhou, Y., and Zink, D. Conditional accelerated lazy stochastic gradient descent. Proceedings of ICML, 2017.

Lan, G. G. Lectures on Optimization for Machine Learning. ISy E, April 2017.

Levitin, E. S. and Polyak, B. T. Constrained minimization methods. USSR Computational mathematics and mathematical physics, 6(5):1 50, 1966.

Nemirovski, A. and Yudin, D. Problem complexity and method efﬁciency in optimization. Wiley, 1983. ISBN 0-471-10345-4.

Rao, N., Shah, P., and Wright, S. Forward backward greedy algorithms for atomic norm regularization. IEEE Transactions on Signal Processing, 63(21):5798 5811, 2015.