# on_frankwolfe_and_equilibrium_computation__a9b86414.pdf

On Frank-Wolfe and Equilibrium Computation

Jacob Abernethy Georgia Institute of Technology prof@gatech.edu

Jun-Kun Wang Georgia Institute of Technology jimwang@gatech.edu

We consider the Frank-Wolfe (FW) method for constrained convex optimization, and we show that this classical technique can be interpreted from a different perspective: FW emerges as the computation of an equilibrium (saddle point) of a special convex-concave zero sum game. This saddle-point trick relies on the existence of no-regret online learning to both generate a sequence of iterates but also to provide a proof of convergence through vanishing regret. We show that our stated equivalence has several nice properties, as it exhibits a modularity that gives rise to various old and new algorithms. We explore a few such resulting methods, and provide experimental results to demonstrate correctness and efﬁciency.

1 Introduction

There has been a burst of interest in a technique known as the Frank-Wolfe method (FW) [10], also known as conditional gradient, for solving constrained optimization problems. FW is entirely a ﬁrst-order method, does not require any projection operation, and instead relies on access to a linear optimization oracle. Given a compact and convex constraint set X Rd, we require the ability to (quickly) answer queries of the form O(v) := arg minx X x v, for any vector v Rd. Other techniques such as gradient descent methods require repeated projections into the constraint set which can be prohibitively expensive. Interior point algorithms, such as Newton path following schemes [1], require computing a hessian inverse at each iteration which generally does not scale well with the dimension.

In the present paper we aim to give a new perspective on the Frank-Wolfe method by showing that, in a broad sense, it can be viewed as a special case of equilibrium computation via online learning. Indeed, when the optimization objective is cast as a particular convex-concave payoff function, then we are able to extract the desired optimal point via the equilibrium of the associated zero-sum game. Within Machine Learning there has been a lot of attention paid to the computation of optimal strategies for zero-sum games using online learning techniques. An amazing result, attributed to [12] yet now practically folklore in the literature, says that we can compute the optimal equilibrium in a zero sum game by pitting two online learning strategies against each other and, as long as they achieve the desired regret-minimization guarantee, the long-run empirical average of their actions (strategy choices) must converge to the optimal equilibrium. This trick is both very beautiful but also extremely useful: it was in some sense the core of early work in Boosting [11], has been shown to generalize many linear programming techniques [3], it serves as the key tool for recent advances in ﬂow optimization problems [8], and has been instrumental in understanding differential privacy [9].

We begin in Section 2 by reviewing the method of proving a generalized minimax theorem using regret minimization, and we show how this proof is actually constructive and gives rise to a generic meta-algorithm. This meta-algorithm is especially modular, and allows for the substitution of various algorithmic tools that achieve, up to convergence rates, essentially the same core result. We then show that the original Frank-Wolfe algorithm is simply one instantiation of this meta-algorithm, yet where the convergence rate follows as a trivial consequence of main theorem, albeit with an additional O(log T) factor.

31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.

We build upon this by showing that a number of variants of Frank-Wolfe are also simple instantiations of our meta-algorithm, with a convergence rate that follows easily. For example, we propose the cumulative gradient variant of Frank-Wolfe and prove that the same guarantee holds, yet relies on a potentially more stable optimization oracle. We show that techniques of [31] using stochastic smoothing corresponding to implement a Follow-the-perturbed-leader variant of our meta-algorithm. And ﬁnally, we use our framework to prove an entirely new result, showing that one obtains an O(log T/T) convergence rate even when the objective f( ) is not smooth, but instead the constraint set satisﬁes strong convexity.

The results laid out in this paper provide value not only in proving rates and establishing new and existing algorithms but also in setting forth a perspective on Frank-Wolfe-style methods that can leverage the wealth of results we have available from online learning and online convex optimization. At present, the possibilities and limits of various online learning problems has been thoroughly worked out [20, 7] with incredibly tight bounds. Using the connections we put forth, many of these results can provide a stronger theoretical framework towards understanding projection-free conditional gradient methods.

Related works of projection-free algorithms [25] gives an analysis of FW for smooth objectives, and shows that FW converges at a O(1/T) rate even when the linear oracle is solved approximately, under certain conditions. [30] develops a block-wise update strategy for FW on the dual objective of structural SVM, where only a subset of dual variables are updated at each iteration. In the algorithm, a smaller oracle is called due to the block-wise update, which reduces the computational time per iteration and leads to the speedup overall. [37] proposes updating multiple blocks at a time. [34] proposes using various measures to select a block for update.

In another direction, some results have aimed at obtaining improved convergence rates. [14] shows that for strongly convex and smooth objective functions, FW can achieve a O(1/T 2) convergence rate over a strongly convex set. [13, 15] ﬁrst show that one can achieve linear convergence for strongly convex and smooth objectives over polytopes using a projection-free algorithm. The algorithm constructs a stronger oracle which can be efﬁciently implemented for certain polytopes like simplex. [29] shows that some variants of FW such as away-step FW [38] or pairwise FW enjoy an exponential convergence rate when the feasible set is a polytope. [5] provides a reﬁned analysis for the awaystep FW. [17] extends [29] to some saddle-point optimization problems, where the constraint set is assumed to be a polytope and the objective is required to be strongly convex for one variable and strongly concave for the other. A drawback of away-step FW [38] is that it requires storing the previous outputs from the oracle. Very recently, [16] develop a new variant that avoids this issue for speciﬁc polytopes, which also enjoys exponential convergence for strongly convex and smooth objectives. Note that all of the exponential convergence results depend on some geometric properties of the underlying polytope.

Other works include variants for stochastic setting [23], online learning setting [22], minimizing some structural norms [19, 39], or reducing the number of gradient evaluations [32]. There is also a connection between subgradient descent and FW; Bach [4] shows that for certain types of objectives, subgradient descent applied to the primal domain is equivalent to FW applied to the dual domain.

Preliminaries and Notation

Deﬁnition 1: A convex set Y Rm is an α-strongly convex set w.r.t. a norm if for any u, v Y , any θ [0, 1], the ball centered at θu + (1 θ)v with radius θ(1 θ) α

2 u v 2 is contained in Y . Please see [14] for examples about strongly-convex sets.

Deﬁnition 2 A function is β-strongly smooth w.r.t. a norm if f is everywhere differentiable and f(u) f(v) + f(v) (u v) + β

2 u v 2. A function is β-strongly convex w.r.t. a norm if f(u) f(v) + f(v) (u v) + β

Deﬁnition 3 For a convex function f( ), its Fenchel conjugate is f (x) := supy x, y f(y). Note that if f is convex then so is its conjugate f , since it is deﬁned as the maximum over linear functions of x [6]. Furthermore, the biconjugate f equals f if and only if f is closed and convex. It is known that f is β-strongly convex w.r.t. if and only if f is 1/β strongly smooth w.r.t the dual norm [26], assuming that f is a closed and convex function.

2 Minimax Duality via No-Regret Learning

2.1 Brief review of online learning

In the task of online convex optimization, we assume a learner is provided with a compact and convex set K Rn known as the decision set. Then, in an online fashion, the learner is presented with a sequence of T loss functions ℓ1( ), ℓ2( ), . . . , ℓT ( ) : K R. On each round t, the learner must select a point xt K, and is then charged a loss of ℓt(xt) for this choice. Typically it is assumed that, when the learner selects xt on round t, she has observed all loss functions ℓ1( ), . . . , ℓt 1( ) up to, but not including, time t. However, we will also consider learners that are prescient, i.e. that can choose xt with knowledge of the loss functions up to and including time t. The objective of interest in most of the online learning literature is the learner s regret, deﬁned as RT := PT t=1 ℓt(xt) minx K PT t=1 ℓt(x). Oftentimes we will want to refer to the average regret, or the regret normalized by the time horizon T, which we will call RT := RT

T . What has become a cornerstone of online learning research has been the existence of no-regret algorithms, i.e. learning strategies that guarantee RT 0 as T .

Let us consider three very simple learning strategies, and we note the available guarantees for each.

(Follow The Leader) Perhaps the most natural algorithm one might think of is to simply select xt as the best point in hindsight. That is, the learner can choose xt = arg minx K Pt 1 s=1 ℓs(x).

Lemma 1 ([21]). If each ℓt( ) is 1-lipschitz and 1-strongly convex, then Follow The Leader achieves RT c log T

T for some constant c.

(Be The Leader) When the learner is prescient, then we can do slightly better than Follow The Leader by incorporating the current loss function: xt = arg minx K Pt s=1 ℓs(x). This algorithm was named Be The Leader by [28], who also proved that it actually guarantees non-positive regret!

Lemma 2 ([28]). For any sequence of loss functions, Be The Leader achieves RT 0.

(Best Response) But perhaps the most trivial strategy for a prescient learner is to ignore the history of the ℓs s, and simply play the best choice of xt on the current round. We call this algorithm Best Response, deﬁned as xt = arg minx K ℓt(x). A quick inspection reveals that Best Response satisﬁes RT 0.

2.2 Minimax Duality

The celebrated minimax theorem for zero-sum games, ﬁrst discovered by John von Neumann in the 1920s [36, 33], is certainly a foundational result in the theory of games. It states that two players, playing a game with zero-sum payoffs, each have an optimal randomized strategy that can be played obliviously that is, even announcing their strategy in advance to an optimal opponent would not damage their own respective payoff, in expectation.

In this paper we will focus on more general minimax result, establishing duality for a class of convex/concave games, and we will show how this theorem can be proved without the need for Brouwer s Fixed Point Theorem [27]. The key inequality can be established through the use of no-regret learning strategies in online convex optimization, which we detail in the following section. The theorem below can be proved as well using Sion s Minimax Theorem [35].

Theorem 1. Let X, Y be compact convex subsets of Rn and Rm respectively. Let g : X Y R be convex in its ﬁrst argument and concave in its second. Then we have that

min x X max y Y g(x, y) = max y Y min x X g(x, y) (1)

We want to emphasize that a meta-algorithm (Algorithm 1) actually emerges from our proof for Theorem 1, please see the supplementary for details. It is important to point out that the meta algorithm, as a routine for computing equlibria, is certainly not a novel technique, it has served implicitly as the underpinning of many works, including those already mentioned [11, 9, 8].

We close this section by summarizing the approximate equilibrium computation guarantee that follows from the above algorithm. This result is classical, and we explore it in great detail in the

Algorithm 1 Meta Algorithm for equilibrium computation

1: Input: convex-concave payoff g : X Y R, algorithms OAlg X and OAlg Y

2: for t = 1, 2, . . . , T do 3: xt := OAlg X(g( , y1), . . . , g( , yt 1)) 4: yt := OAlg Y (g(x1, ), . . . , g(xt 1, ), g(xt, )) 5: end for 6: Output: x T = 1

T PT t=1 xt and y T := 1

T PT t=1 yt

Appendix. We let x T := 1

T PT t=1 xt and y T := 1

T PT t=1 yt, and let V be the value of the game, which is the quantity in (1).

Theorem 2. Algorithm 1 outputs x T and y T satisfying

max y Y g( x T , y) V + ϵT + δT and min x X g(x, y T ) V (ϵT + δT ). (2)

as long as OAlg X and OAlg Y guarantee average regret bounded by ϵT and δT , respectively.

3 Relation to the Frank-Wolfe Method

We now return our attention to the problem of constrained optimization, and we review the standard Frank-Wolfe algorithm. We then use the technologies presented in the previous section to recast Frank-Wolfe as an equilibrium computation, and we show that indeed the vanilla algorithm is an instantiation of our meta-algorithm (Alg. 1). We then proceed to show that the modularity of the minimax duality perspective allows us to immediately reproduce existing variants of Frank-Wolfe, as well as construct new algorithms, with convergence rates provided immediately by Theorem 2.

To begin, let us assume that we have a compact set Y Rn and a convex function f : Y R. Our primary goal is to solve the objective min y Y f(y). (3)

We say that y0 is an ϵ-approximate solution as long as f(y0) miny Y f(y) ϵ.

3.1 A Brief Overview of Frank-Wolfe

Algorithm 2 Standard Frank-Wolfe algorithm

1: Input: obj. f : Y R, oracle O( ), learning rate {γt [0, 1]}t=1,2,..., init. w0 Y 2: for t = 1, 2, 3 . . . , T do 3: vt O( f(wt 1)) = arg min v Y v, f(wt 1)

4: wt (1 γt)wt 1 + γtvt. 5: end for 6: Output: w T

The standard Frank-Wolfe algorithm (Algorithm 2) consists of making repeated calls to a linear optimization oracle (line 6), followed by a convex averaging step of the current iterate and the oracle s output (line 7). It initializes a w1 in the constraint set Y . Due to the convex combination step, the iterate wt is always within the constraint set, which is the reason why it is called projection free. We restate a proposition from [10], who established the convergence rate of their algorithm.

Theorem 3 ([10]). Assume that f( ) is 1-strongly smooth. If Algorithm 2 is run for T rounds, then there exists a sequence {γt} such that the output w T is a O 1

T -approximate solution to (3).

It is worth noting that the typical learning rate used throughout the literature is γt = 2 2+t [31, 25]. This emerges as the result of a recursive inequality.

3.2 Frank-Wolfe via the Meta-Algorithm

We now show that the meta-algorithm generalizes Frank-Wolfe, and provides a much more modular framework for producing similar algorithms. We will develop some of these novel methods and establish their convergence via Theorem 2.

In order to utilize minimax duality, we have to deﬁne decision sets for two players, and we must produce a convex-concave payoff function. First we will assume, for convenience, that f(y) := for any y / Y . That is, it takes the value outside of the convex/compact set Y , which ensures that f is lower semi-continuous and convex. Now, let the x-player be given the set X := { f(y) : y Y }. One can check that the closure of the set X is a convex set. Please see Appendix 2 for the proof.

Theorem 4. The closure of (sub-)gradient space { f(y)|y Y } is a convex set.

The y-player s decision set will be Y , the constraint set of the primary objective (3). The payoff g( , ) will be deﬁned as g(x, y) := x y + f (x). (4)

The function f ( ) is the Fenchel conjugate of f. We observe that g(x, y) is indeed linear, and hence concave, in y, and it is also convex in x.

Let s notice a few things about this particular game. Looking at the max min expression,

max y Y min x X g(x, y) = max y Y

max x X x y f (x) = min y Y f(y) = V , (5)

which follows by the fact that f = f.1 Note, crucially, that the last term above corresponds to the objective we want to solve up to a minus sign. Any y which is an ϵ-approximate equilibrium strategy for the y-player will also be an ϵ-approximate solution to (3).

We now present the main result of this section, which is the connection between Frank-Wolfe (Alg. 2) and Alg. 1.

Theorem 5. When both are run for exactly T rounds, the output y T of Algorithm 1 is identically the output w T of Algorithm 2 as long as: (I) Init. x1 in Alg 1 equals f(w0) in Alg. 2; (II) Alg. 2 uses learning rate γt := 1 t ; (III) Alg. 1 receives g( , ) deﬁned in (4); (IV) Alg. 1 sets OAlg X := Follow The Leader; (V) Alg. 1 sets OAlg Y := Best Response.

Proof. We will prove that the following three equalities are maintained throughout both algorithms. We emphasize that the objects on the left correspond to Alg. 1 and those on the right to Alg. 2.

xt = f(wt 1) (6) yt = vt (7) yt = wt. (8)

We ﬁrst note that the ﬁrst condition of the theorem ensures that (6) holds for t = 1. Second, the choice of learning rate γt = 1

t already guarantees that (7) implies (8), since this choice of rate ensures that wt is always a uniform average of the updates vt. It remains to establish (6) and (7) via induction. We begin with the former.

Recall that xt is selected via Follow The Leader against the sequence of loss functions ℓt( ) := g( , yt). To write precisely what this means,

xt := arg minx X n 1 t 1 Pt 1 s=1 ℓs(x) o = arg minx X n 1 t 1 Pt 1 s=1( y s x + f (x)) o

= arg max x X y t 1x f (x) = f( yt 1).

The ﬁnal line follows as a result of the Legendre transform [6]. Of course, by induction, we have that yt 1 = wt 1, and hence we have established (6).

1It was important how we deﬁned X here, as the fenchel conjugate takes the value of at any point x / { f(y) : y Y }, hence the unconstrained supremum is the same as maxx X( )

Finally, let us consider how yt is chosen according to Best Response. Recall that sequence of loss functions presented to the y-player is ht( ) := g(xt, ). Utilizing Best Response for this sequence implies that

yt = arg min y Y ht(y) = arg min y Y

x t y f (xt) = arg min y Y

((6) by induc.) = arg min y Y f( yt 1) y = arg min y Y f(wt 1) y ( which is vt).

Where the last equality follows by induction via (8). This completes the proof.

Note that the algorithm does not need to compute the conjugate, f . While the Frank-Wolfe algorithm can be viewed as implicitly operating on the conjugate, it is only through the use of arg maxx X y t 1x f (x) . Yet, this operation does not need to be computed in the naive way (i.e. by ﬁrst computing f and then doing the maximization). Instead, the expression actually boils down to f(y) which is just a gradient computation!

The equivalence we just established has several nice features. But it does not provide a convergence rate for Algorithm 2. This should perhaps not be surprising, as nowhere did we even use the smoothness of f anywhere in the equivalence. Instead, this actually follows via a key application of Theorem 2, utilizing the fact that f is strongly convex on the interior of the set X 2, granting Follow The Leader a logarithmic regret rate.

Corollary 1. Assume that f( ) is 1-strongly smooth. Then Algorithm 2, with learning rate γt := 1

outputs w T with approximation error O log T

Proof. As a result of Theorem 5, we have established that Alg. 2 is a special case of Alg. 1, with the parameters laid out in the previous theorem. As a result of Theorem 2, the approximation error of w T is precisely the error ϵT + δT of the point y T when generated via Alg. 1 with subroutines OAlg X := Follow The Leader and OAlg Y = Best Response, assuming that these two learning algorithms guarantee average regret no more than ϵT and δT , respectively. We noted that Best Response does not suffer regret, so δT = 0.

To bound the regret of Follow The Leader on the sequence of functions g( , y1), . . . , g( , y T ), we observe that the smoothness of f implies that f is 1-strongly convex, which in turn implies that g(x, yt) = x yt + f (x) is also 1-strongly convex (in x). Hence Lemma 1 guarantees that

Follow The Leader has average regret ϵT := O log T

T , which completes the proof.

We emphasize that the above result leans entirely on existing work on regret bounds for online learning, and these tools are doing the heavy lifting. We explore this further in the following section.

4 Frank-Wolfe-style Algs, New and Old

We now have a factory for generating new algorithms using the approach laid out in Section 3. Theorem 5 shows that the standard Frank-Wolfe algorithm (with a particular learning rate) is obtained via the meta-algorithm using two particular online learning algorithms OAlg X, OAlg Y . But we have full discretion to choose these two algorithms, as long as they provide the appropriate regret guarantees to ensure convergence.

4.1 Cumulative Gradients

We begin with one simple variant, which we call Cumulative-Gradient Frank-Wolfe, laid out in Algorithm 3. The one signiﬁcant difference with vanilla Frank-Wolfe is that the linear optimization oracle receives as input the average of the gradients obtained thus far, as opposed to the last one.

2 We only need to assume f is "smooth on the interior of Y " to get the result. (That f is technically not smooth outside of Y is not particularly relevant) The result that f is strongly convex on the interior of the set X is essentially proven by [26] in their appendix. This argument has been made elsewhere in various forms in the literature (e.g. [18]).

Algorithm 3 Cumulative-Gradient Frank-Wolfe

1: Initialize: any w0 Y . 2: for t = 1, 2, 3 . . . , T do

3: vt arg min v Y

D y, 1 t 1 Pt 1 s=1 f(ws) E

4: wt (1 γt)wt 1 + γtvt. 5: end for 6: Output: w T

The proof of convergence requires little effort. Corollary 2. Assume that f( ) is 1-strongly smooth. Then Algorithm 3, with learning rate γt := 1

outputs w T with approximation error O log T

Proof. The result follows almost identically to Corollary 1. It requires a quick inspection to verify that the new linear optimization subroutine corresponds to implementing Be The Leader as OAlg Y

instead of Best Response. However, both Best Response and Be The Leader have non-positive regret (δT 0) (Lemma 2 in the supplementary), and thus they achieve the same convergence.

We note that a similar algorithm to the above can be found in [31], although in their results they consider more general weighted averages over the gradients.

4.2 Perturbation Methods and Stochastic Smoothing

Looking carefully at the proof of Corollary 1, the fact that Follow The Leader was suitable for the vanilla FW analysis relies heavily on the strong convexity of the functions ℓt( ) := g( , yt), which in turn results from the smoothness of f( ). But what about when f( ) is not smooth, is there an alternative algorithm available?

We observe that one of the nice techniques to grow out of the online learning community is the use of perturbations as a type of regularization to obtain vanishing regret guarantees [28] their method is known as Follow the Perturbed Leader (FTPL). The main idea is to solve an optimization problem that has a random linear function added to the input, and to select3 as xt the expectation of the arg min under this perturbation. More precisely,

xt := EZ h arg minx X n Z x + Pt 1 s=1 ℓs(x) oi .

Here Z is some random vector drawn according to an appropriately-chosen distribution and ℓs(x) is the loss function of the x-player on round s; with the deﬁnition of payoff function g, ℓs(x) is x ys + f (x) (4).

One can show that, as long as Z is chosen from the right distribution, then this algorithm guarantees average regret on the order of O 1

, although obtaining the correct dimension dependence relies on careful probabilistic analysis. Recent work of [2] shows that the analysis of perturbation-style algorithm reduces to curvature properties of a stochastically-smoothed Fenchel conjugate.

What is intriguing about this perturbation approach is that it ends up being equivalent to an existing method proposed by [31] (Section 3.3), who also uses a stochastically smoothed objective function. We note that

EZ h arg minx X n Z x + Pt 1 s=1 ℓs(x) oi = EZ arg maxx X ( yt 1 + Z/(t 1)) x f (x)

= EZ[ f( yt 1 + Z/(t 1))] = ft 1( yt 1) (9) where fα(x) := E[f(x + Z/α)]. [31] suggests using precisely this modiﬁed f, and they prove a rate

on the order of O 1

. As discussed, the same would follow from vanishing regret of FTPL.

3Technically speaking, the results of [28] only considered linear loss functions and hence their analysis did not require taking averages over the input perturbation. While we will not address computational issues here due to space, actually computing the average arg min is indeed non-trivial.

4.3 Boundary Frank-Wolfe

Algorithm 4 Modiﬁed meta-algorithm, swapped roles

1: Input: convex-concave payoff g : X Y R, algorithms OAlg X and OAlg Y

2: for t = 1, 2, . . . , T do 3: yt := OAlg Y (g(x1, ), . . . , g(xt 1, )) 4: xt := OAlg X(g( , y1), . . . , g( , yt 1), g( , yt)) 5: end for 6: Output: x T = 1

T PT t=1 xt and y T := 1

T PT t=1 yt

We observe that the meta-algorithm previously discussed assumed that the x-player was ﬁrst to act, followed by the y-player who was allowed to be prescient. Here we reverse their roles, and we instead allow the x-player to be prescient. The new meta-algorithm is described in Algorithm 4. We are going to show that this framework lead to a new projection-free algorithm that works for non-smooth objective functions. Speciﬁcally, if the constraint set is strongly convex, then this exhibits a novel projection free algorithm that grants a O(log T/T) convergence even for non-smooth objective functions. The result relies on very recent work showing that Follow The Leader for strongly convex sets [24] grants a O(log T) regret rate. Prior work has considered strongly convex decision sets [14], yet with the additional assumption that the objective is smooth and strongly convex, leading to O(1/T 2) convergence. Boundary Frank-Wolfe requires neither smoothness nor strong convexity of the objective. What we have shown, essentially, is that a strongly convex boundary of the constraint set can be used in place of smoothness of f( ) in order to achieve O(1/T) convergence.

Algorithm 5 Boundary Frank-Wolfe

1: Input: objective f : Y R, oracle O( ) for Y , init. y1 Y . 2: for t = 2, 3 . . . , T do 3: yt arg miny Y 1 t 1 Pt 1 s=1 y, f(ys) 4: end for 5: Output: y T = 1

T PT t=1 yt

We may now prove a result about Algorithm 5 using the same techniques laid out in Theorem 5. Theorem 6. Algorithm 5 is a instance of Algorithm 4 if (I) Init. y1 in Alg 5 equals y1 in Alg. 4; (II) Alg. 1 sets OAlg Y := Follow The Leader; and (III) Alg. 1 sets OAlg X := Best Response. Furthermore, when Y is strongly convex, and Pt s=1 f(ys) has non-zero norm, then

f( y T ) min y Y f(y) = O(M log T

where M := supy Y f(y) , LT := min1 t T Θt , Θt = Pt s=1 1 t f(ys).

Proof. Please see Appendix 3 for the proof.

Note that the rate depends crucially on LT , which is the smallest averaged-gradient norm computed during the optimization. Depending on the underlying optimization problem, LT can be as small as O(1/

T) or can even be 0. Now let us discuss when the boundary FW works; namely, the condition that causes the cumulative gradient being nonzero. If a linear combination of gradients is 0 then clearly 0 is in the convex hull of subgradients f(x) for boundary points x. Since the closure of { f(x)|x Y } is convex, according to Theorem 4, this implies that 0 is in { f(x)|x Y }. If we know in advance that 0 / cl({ f(x)|x Y }) we are assured that the cumulative gradient will not be 0. Hence, the proposed algorithm may only be useful when it is known, a priori, that the solution y will occur not in the interior but on the boundary of Y . It is indeed an odd condition, but it does hold in many typical scenarios. One may add a perturbed vector to the gradient and show that with high probability, LT is a non-zero number. The downside of this approach is that it would generally grant a slower convergence rate; it cannot achieve log(T)/T as the inclusion of the perturbation requires managing an additional trade-off.

[1] Jacob Abernethy and Elad Hazan. Faster convex optimization: Simulated annealing with an efﬁcient universal barrier. In Proceedings of The 33rd International Conference on Machine Learning, pages 2520 2528, 2016.

[2] Jacob Abernethy, Chansoo Lee, Abhinav Sinha, and Ambuj Tewari. Online linear optimization via smoothing. In COLT, pages 807 823, 2014.

[3] Sanjeev Arora, Elad Hazan, and Satyen Kale. The multiplicative weights update method: a meta-algorithm and applications. Theory of Computing, 8(1):121 164, 2012.

[4] Francis Bach. Duality between subgradient and conditional gradient methods. SIAM Journal of Optimization, 2015.

[5] Amir Beck and Shimrit Shtern. Linearly convergent away-step conditional gradient for non-strongly convex functions. Mathematical Programming, 2016.

[6] Stephen Boyd. Convex optimization. Cambridge University Press, 2004.

[7] Nicolo Cesa-Bianchi and Gábor Lugosi. Prediction, learning, and games. Cambridge university press, 2006.

[8] Paul Christiano, Jonathan A Kelner, Aleksander Madry, Daniel A Spielman, and Shang-Hua Teng. Electrical ﬂows, laplacian systems, and faster approximation of maximum ﬂow in undirected graphs. In Proceedings of the forty-third annual ACM symposium on Theory of computing, pages 273 282. ACM, 2011.

[9] Cynthia Dwork, Aaron Roth, et al. The algorithmic foundations of differential privacy. Foundations and Trends R in Theoretical Computer Science, 9(3 4):211 407, 2014.

[10] Marguerite Frank and Philip Wolfe. An algorithm for quadratic programming. Naval research logistics quarterly, 3(1-2):95 110, 1956.

[11] Yoav Freund and Robert E Schapire. Game theory, on-line prediction and boosting. In Proceedings of the ninth annual conference on Computational learning theory, pages 325 332. ACM, 1996.

[12] Yoav Freund and Robert E Schapire. Adaptive game playing using multiplicative weights. Games and Economic Behavior, 29(1-2):79 103, 1999.

[13] Dan Garber and Elad Hazan. Playing non-linear games with linear oracles. FOCS, 2013.

[14] Dan Garber and Elad Hazan. Faster rates for the frank-wolfe method over strongly-convex sets. ICML, 2015.

[15] Dan Garber and Elad Hazan. A linearly convergent conditional gradient algorithm with applications to online and stochastic optimization. SIAM Journal on Optimization, 2016.

[16] Dan Garber and Ofer Meshi. Linear-memory and decomposition-invariant linearly convergent conditional gradient algorithm for structured polytopes. NIPS, 2016.

[17] G. Gidel, T. Jebara, and S. Lacoste-Julien. Frank-wolfe algorithms for saddle point problems. AISTATS, 2016.

[18] Gianluca Gorni. Conjugation and second-order properties of convex functions. Journal of Mathematical Analysis and Applications, 1991.

[19] Zaid Harchaoui, Anatoli Juditsky, and Arkadi Nemirovski. Conditional gradient algorithms for normregularized smooth convex optimization. Math. Prog., Series A, 2013.

[20] Elad Hazan. Introduction to online convex optimization. 2014.

[21] Elad Hazan, Amit Agarwal, and Satyen Kale. Logarithmic regret algorithms for online convex optimization. Machine Learning, 69(2-3):169 192, 2007.

[22] Elad Hazan and Satyen Kale. Projection-free online learning. ICML, 2012.

[23] Elad Hazan and Haipeng Luo. Variance-reduced and projection-free stochastic optimization. ICML, 2016.

[24] Ruitong Huang, Tor Lattimore, András György, and Csaba Szepesvari. Following the leader and fast rates in linear prediction: Curved constraint sets and other regularities. 2016.

[25] Martin Jaggi. Revisiting frank-wolfe: Projection-free sparse convex optimization. ICML, 2013.

[26] Sham M. Kakade, Shai Shalev-shwartz, and Ambuj Tewari. On the duality of strong convexity and strong smoothness: Learning applications and matrix regularization. 2009.

[27] Shizuo Kakutani. A generalization of brouwer s ﬁxed point theorem. 1941.

[28] Adam Kalai and Santosh Vempala. Efﬁcient algorithms for online decision problems. Journal of Computer and System Sciences, 71(3):291 307, 2005.

[29] Simon Lacoste-Julien and Martin Jaggi. On the global linear convergence of frank-wolfe optimization variants. NIPS, 2015.

[30] Simon Lacoste-Julien, Martin Jaggi, Mark Schmidt, and Patrick Pletscher. Block-coordinate frank-wolfe optimization for structural svms. ICML, 2013.

[31] Guanghui Lan. The complexity of large-scale convex programming under a linear optimization oracle. https://arxiv.org/abs/1309.5550, 2013.

[32] Guanghui Lan and Yi Zhou. Conditional gradient sliding for convex optimization. SIAM Journal on Optimization,, 2014.

[33] J von Neumann, Oskar Morgenstern, et al. Theory of games and economic behavior, 1944.

[34] Anton Osokin, Jean-Baptiste Alayrac, Isabella Lukasewitz, Puneet K. Dokania, and Simon Lacoste-Julien. Minding the gaps for block frank-wolfe optimization for structural svms. ICML, 2016.

[35] Maurice Sion. On general minimax theorems. Paciﬁc J. Math, 8(1):171 176, 1958.

[36] J v. Neumann. Zur theorie der gesellschaftsspiele. Mathematische annalen, 100(1):295 320, 1928.

[37] Yu-Xiang Wang, Veeranjaneyulu Sadhanala, Wei Dai, Willie Neiswanger, Suvrit Sra, and Eric Xing. Parallel and distributed block-coordinate frank-wolfe algorithms. ICML, 2016.

[38] P. Wolf. Convergence theory in nonlinear programming. Integer and Nonlinear Programming, 1970.

[39] Y. Yu, X. Zhang, and D. Schuurmans. Generalized conditional gradient for structured estimation. ar Xiv:1410.4828, 2014.