# boosting_frankwolfe_by_chasing_gradients__c16514e2.pdf Boosting Frank-Wolfe by Chasing Gradients Cyrille W. Combettes 1 Sebastian Pokutta 2 3 The Frank-Wolfe algorithm has become a popular first-order optimization algorithm for it is simple and projection-free, and it has been successfully applied to a variety of real-world problems. Its main drawback however lies in its convergence rate, which can be excessively slow due to naive descent directions. We propose to speed up the Frank-Wolfe algorithm by better aligning the descent direction with that of the negative gradient via a subroutine. This subroutine chases the negative gradient direction in a matching pursuit-style while still preserving the projection-free property. Although the approach is reasonably natural, it produces very significant results. We derive convergence rates O(1/t) to O(e ωt) of our method and we demonstrate its competitive advantage both per iteration and in CPU time over the state-of-the-art in a series of computational experiments. 1. Introduction Let (H, , ) be a Euclidean space. In this paper, we address the constrained convex optimization problem min x C f(x) (1) where f : H R is a smooth convex function and C H is a compact convex set. A natural approach to solving Problem (1) is to apply any efficient method that works in the unconstrained setting and add projections back onto C when the iterates leave the feasible region. However, there are situations where projections can be very expensive while linear minimizations over C are much cheaper. For example, 1School of Industrial and Systems Engineering, Georgia Institute of Technology, Atlanta, GA, USA 2Institute of Mathematics, Technische Universit at Berlin, Berlin, Germany 3Department for AI in Society, Science, and Technology, Zuse Institute Berlin, Berlin, Germany. Correspondence to: Cyrille W. Combettes . Proceedings of the 37 th International Conference on Machine Learning, Online, PMLR 119, 2020. Copyright 2020 by the author(s). if C = {X Rm n | X nuc τ} is a nuclear norm-ball, a projection onto C requires computing an SVD, which has complexity O(mn min{m, n}), while a linear minimization over C requires only computing the pair of top singular vectors, which has complexity O(nnz) where nnz denotes the number of nonzero entries. Other examples include the flow polytope, the Birkhoff polytope, the matroid polytope, or the set of rotations; see, e.g., Hazan & Kale (2012). In these situations, the Frank-Wolfe algorithm (FW) (Frank & Wolfe, 1956), a.k.a. conditional gradient algorithm (Levitin & Polyak, 1966), becomes the method of choice, as it is a simple projection-free algorithm relying on a linear minimization oracle over C. At each iteration, it calls the oracle vt arg minv C f(xt), v and moves in the direction of this vertex, ensuring that the new iterate xt+1 xt + γt(vt xt) is feasible by convex combination, with a step-size γt [0, 1]. Hence, FW can be seen as a projection-free variant of projected gradient descent trading the gradient descent direction f(xt) for the vertex direction vt xt minimizing the linear approximation of f at xt over C. FW has been applied to traffic assignment problems (Le Blanc et al., 1975), low-rank matrix approximation (Shalev-Shwartz et al., 2011), structural SVMs (Lacoste-Julien et al., 2013), video co-localization (Joulin et al., 2014), infinite RBMs (Ping et al., 2016), and, e.g., adversarial learning (Chen et al., 2020). The main drawback of FW is that the modified descent direction leads to a sublinear convergence rate O(1/t), which cannot be improved upon in general as an asymptotic lower bound Ω(1/t1+δ) holds for any δ > 0 (Canon & Cullum, 1968). More recently, Jaggi (2013) provided a simple illustration of the phenomenon: if f : x Rn 7 x 2 2 is the squared ℓ2-norm and C = n is the standard simplex, then the primal gap at iteration t J1, n K is lower bounded by 1/t 1/n; see also Lan (2013) for a lower bound Ω(LD2/t) on an equivalent setup, exhibiting an explicit dependence on the smoothness constant L of f and the diameter D of C. Hence, a vast literature has been devoted to the analysis of higher convergence rates of FW if additional assumptions on the properties of f, the geometry of C, or the location of arg min C f are met. Important contributions include: (i) O(e ωt) if C is strongly convex and inf C f > 0 Boosting Frank-Wolfe by Chasing Gradients (Levitin & Polyak, 1966), (ii) O(e ωt) if f is strongly convex and arg min C f relint(C) (Gu elat & Marcotte, 1986), (iii) O(1/t2) if f is gradient dominated and C is strongly convex (Garber & Hazan, 2015). More recently, several variants to FW have been proposed, achieving linear convergence rates without excessively increasing the per-iteration complexity. These include the following: (i) O(e ωt) when f is strongly convex and C is a polytope (Garber & Hazan, 2016; Lacoste-Julien & Jaggi, 2015; Braun et al., 2019), (ii) O(e ωt) with constants depending on the sparsity of the solution when f is strongly convex and C is a polytope, of the form {x Rn | Ax = b, x 0} with vertices in {0, 1}n (Garber & Meshi, 2016), or of arbitrary form (Bashiri & Zhang, 2017). Contributions. We propose the Boosted Frank-Wolfe algorithm (Boost FW), a new and intuitive method speeding up the Frank-Wolfe algorithm by chasing the negative gradient direction f(xt) via a matching pursuit-style subroutine, and moving in this better aligned direction. Boost FW thereby mimics gradient descent while remaining projectionfree. We derive convergence rates O(1/t) to O(e ωt). Although the linear minimization oracle may be called multiple times per iteration, we demonstrate in a series of computational experiments the competitive advantage both per iteration and in CPU time of our method over the state-ofthe-art. Furthermore, Boost FW does not require line search to achieve strong empirical performance, and it does not need to maintain the decomposition of the iterates. Naturally, our approach can also be used to boost the performance of any Frank-Wolfe-style algorithm. Outline. We start with notation and definitions and we present some background material on the Frank-Wolfe algorithm (Section 2). We then move on to the intuition behind the design of the Boosted Frank-Wolfe algorithm and present its convergence analysis (Section 3). We validate the advantage of our approach in a series of computational experiments (Section 4). Finally, a couple of remarks conclude the paper (Section 5). All proofs are available in Appendix D. The Appendix further contains complementary plots (Appendix A), an application of our approach to boost the Decomposition-Invariant Pairwise Conditional Gradient algorithm (DICG) (Garber & Meshi, 2016) (Appendix B), and the convergence analysis of the line search-free Away Step Frank-Wolfe algorithm (Appendix C). We were later informed that the latter analysis was already derived by Pedregosa et al. (2020) in a more general setting. 2. Preliminaries We work in a Euclidean space (H, , ) equipped with the induced norm . Let C H be a nonempty compact convex set. If C is a polytope, let V be its set of vertices. Else, slightly abusing notation, we refer to any point in V := C as a vertex. We denote by D := maxx,y C y x the diameter of C. 2.1. Notation and definitions For any i, j N satisfying i j, the brackets Ji, j K denote the set of integers between (and including) i and j. The indicator function for an event A is 1A := 1 if A is true else 0. For any x Rn and i J1, n K, [x]i denotes the i-th entry of x. Given p 1, the ℓp-norm in Rn is p : x Rn 7 (Pn i=1 |[x]i|p)1/p and the closed ℓp-ball of radius τ > 0 is Bp(τ) := {x Rn | x p τ}. The standard simplex in Rn is n := {x Rn | 1 x = 1, x 0} = conv(e1, . . . , en) where {e1, . . . , en} denotes the standard basis, i.e., ei = (1{1=i}, . . . , 1{n=i}) . The conical hull of a nonempty set A H is cone(A) := {PK k=1 λkak | K N\{0}, λ1, . . . , λK 0, a1, . . . , a K A}. The number of its elements is denoted by |A|. Let f : H R be a differentiable function. We say that f is: (i) L-smooth if L > 0 and for all x, y H, f(y) f(x) f(x), y x L (ii) S-strongly convex if S > 0 and for all x, y H, f(y) f(x) f(x), y x S (iii) µ-gradient dominated if µ > 0, arg min H f = , and for all x H, f(x) min H f f(x) 2 Note that although Definition (iii) is defined with respect to the global optimal value min H f, the bound holds for the primal gap of f on any compact set C H: f(x) min C f f(x) min H f f(x) 2 Definition (iii) is also commonly referred to as the PolyakŁojasiewicz inequality or PL inequality (Polyak, 1963; Łojasiewicz, 1963). It is a local condition, weaker than that of strong convexity (Fact 2.1), but it can still provide linear convergence rates for non-strongly convex functions (Karimi et al., 2016). For example, the least squares Boosting Frank-Wolfe by Chasing Gradients loss x Rn 7 Ax b 2 2 where A Rm n and rank(A) = m < n is not strongly convex, however it is gradient dominated (Garber & Hazan, 2015). See also the Kurdyka-Łojasiewicz inequality (Kurdyka, 1998; Łojasiewicz, 1963) for a generalization to nonsmooth optimization (Bolte et al., 2017). Fact 2.1. Let f : H R be S-strongly convex. Then f is S-gradient dominated. 2.2. The Frank-Wolfe algorithm The Frank-Wolfe algorithm (FW) (Frank & Wolfe, 1956), a.k.a. conditional gradient algorithm (Levitin & Polyak, 1966), is presented in Algorithm 1. It is a simple first-order projection-free algorithm relying on a linear minimization oracle over C. At each iteration, it minimizes over C the linear approximation of f at xt, i.e., ℓf(xt) : z C 7 f(xt) + f(xt), z xt , by calling the oracle (Line 2) and moves in that direction by convex combination (Line 3). Hence, the new iterate xt+1 is guaranteed to be feasible by convexity and there is no need to use projections back onto C. In short, FW solves Problem (1) by minimizing a sequence of linear approximations of f over C. Algorithm 1 Frank-Wolfe (FW) Input: Start point x0 C, step-size strategy γt [0, 1]. Output: Point x T C. yolo 1: for t = 0 to T 1 do 2: vt arg min v V f(xt), v FW oracle 3: xt+1 xt + γt(vt xt) 4: end for Note that FW has access to the feasible region C only via the linear minimization oracle, which receives any c H as input and outputs a point v arg minz C c, z = arg minv V c, v . For example, if H = Rn and C = {x Rn | x 1 τ} is an ℓ1-ball, then V = { τe1, . . . , τen} so the linear minimization oracle simply picks the coordinate ei with the largest absolute magnitude |[c]i| and returns sign([c]i)τei. In this case, FW accesses C only by reading coordinates. Some other examples are covered in the experiments (Section 4). The general convergence rate of FW is O(LD2/t), where L is the smoothness constant of f and D is the diameter of C (Levitin & Polyak, 1966; Jaggi, 2013). There are different step-size strategies possible to achieve this rate. The default strategy is γt 2/(t + 2). It is very simple to implement but it does not guarantee progress at each iteration. The next strategy, sometimes referred to as the short step strategy and which does make FW a descent algorithm, is γt min{ f(xt), xt vt /(L xt vt 2), 1}. It minimizes the quadratic smoothness upper bound on f. If εt := f(xt) min C f denotes the primal gap, then εt f(xt), xt vt 2 2L xt vt 2 if γt < 1 εt/2 if γt = 1. As we can already see here, a quadratic improvement in progress is obtained if the direction vt xt in which FW moves is better aligned with that of the negative gradient f(xt). The third step-size strategy is a line search γt arg minγ [0,1] f(xt + γ(vt xt)). It is the most expensive strategy but it does not require (approximate) knowledge of L and it often yields more progress per iteration in practice. 3. Boosting Frank-Wolfe 3.1. Motivation Suppose that C is a polytope and that the set of global minimizers arg min H f lies on a lower dimensional face. Then FW can be very slow to converge as it is allowed only to follow vertex directions. As a simple illustration, consider the problem of minimizing f : x R2 7 x 2 2/2 over the convex hull of {( 1, 0) , (1, 0) , (0, 1) }, starting from x0 = (0, 1) . The minimizer is x = (0, 0) . We computed the first iterates of FW and we present their trajectory in Figure 1. We can see that the iterates try to reach x by moving towards vertices but clearly these directions vt xt are inadequate as they become orthogonal to x xt. Figure 1. FW yields an inefficient zig-zagging trajectory towards the minimizer. To remedy this phenomenon, Wolfe (1970) proposed the Away-Step Frank-Wolfe algorithm (AFW), a variant of FW that allows to move away from vertices. The issue in Figure 1 is that the iterates are held back by the weight of vertex x0 in their convex decomposition. Figure 2 shows that AFW is able to remove this weight and thereby to converge much faster to x . In fact, Lacoste-Julien & Jaggi (2015) established that AFW with line search converges at a linear rate O LD2 exp( (S/(8L))(W/D)2t) for S-strongly convex functions over polytopes, where W is the pyramidal width of the polytope. However, these descent directions are still not as favorable as those of gradient descent, the pyramidal width is a Boosting Frank-Wolfe by Chasing Gradients Figure 2. AFW breaks the zig-zagging trajectory by performing away steps. Here, x4 is obtained using an away step which enables x5 = x , speeding up the algorithm considerably. dimension-dependent quantity, and AFW further requires to maintain the decomposition of the iterates onto V which can become very expensive both in memory usage and computation time (Garber & Meshi, 2016). Thus, we aim at improving the FW descent direction by directly estimating the gradient descent direction f(xt) using V, in order to maintain the projection-free property. Suppose that f(xt) cone(V xt) and that we are able to compute its conical decomposition, i.e., we have f(xt) = PKt 1 k=0 λk(vk xt) where λ0, . . . , λKt 1 > 0 and v0, . . . , v Kt 1 V. Then by normalizing by Λt := PKt 1 k=0 λk, we obtain a feasible descent direction gt := (1/Λt) PKt 1 k=0 λk(vk xt) in the sense that [xt, xt + gt] C. Therefore, building xt+1 as a convex combination of xt and xt +gt ensures that xt+1 C and the projection-free property holds as in a typical FW step, all the while moving in the direction of the negative gradient f(xt). 3.2. Boosting via gradient pursuit In practice however, computing the exact conical decomposition of f(xt), even when this is feasible, is not necessary and it may be overkill. Indeed, all we want is to find a descent direction gt using V that is better aligned with f(xt) and we do not mind if f(xt) gt is arbitrarily large. Thus, we propose to chase the direction of f(xt) by sequentially picking up vertices in a matching pursuit-style (Mallat & Zhang, 1993). The procedure is described in Algorithm 2 (Lines 3-19). In fact, it implicitly addresses the cone constrained quadratic optimization subproblem min d cone(V xt) 1 2 f(xt) d 2 (2) via the Non-Negative Matching Pursuit algorithm (NNMP) (Locatello et al., 2017), without however the aim of solving it. At each round k, the procedure looks to reduce the residual rk by subtracting its projection λkuk onto the principal component uk. The comparison rk, vk xt vs. rk, dk/ dk in Line 9 is less intuitive than the rest of the procedure but it is necessary to ensure convergence; see Locatello et al. (2017). The normalization in Line 21 ensures the feasibility of the new iterate xt+1. Algorithm 2 Boosted Frank-Wolfe (Boost FW) Input: Input point y C, maximum number of rounds K N\{0}, alignment improvement tolerance δ ]0, 1[, step-size strategy γt [0, 1]. Output: Point x T C. yolo 1: x0 arg min v V f(y), v 2: for t = 0 to T 1 do 3: d0 0 4: Λt 0 5: flag false 6: for k = 0 to K 1 do 7: rk f(xt) dk k-th residual 8: vk arg max v V rk, v FW oracle 9: uk arg max u {vk xt, dk/ dk } rk, u 10: λk rk, uk 11: d k dk + λkuk 12: if align( f(xt), d k) align( f(xt), dk) δ then 13: dk+1 d k ( Λt + λk if uk = vk xt Λt(1 λk/ dk ) if uk = dk/ dk 15: else 16: flag true 17: break exit k-loop 18: end if 19: end for 20: Kt k if flag = true else K 21: gt d Kt/Λt normalization 22: xt+1 xt + γtgt 23: end for Since we are only interested in the direction of f(xt), the stopping criterion in the procedure (Line 12) is an alignment condition between f(xt) and the current estimated direction dk, which serves as descent direction for Boost FW. The function align, defined in (3), measures the alignment between a target direction d H\{0} and its estimate ˆd H. It is invariant by scaling of d or ˆd, and the higher the value, the better the alignment: align(d, ˆd) := d ˆd if ˆd = 0 1 if ˆd = 0. (3) In order to optimize the trade-off between progress and complexity per iteration, we allow for (very) inexact alignments and we stop the procedure as soon as sufficient Boosting Frank-Wolfe by Chasing Gradients progress is not met (Lines 15-17). Furthermore, note that it is not possible to obtain a perfect alignment when f(xt) / cone(V xt), but this is not an issue as we only seek to better align the descent direction. The number of pursuit rounds at iteration t is denoted by Kt (Line 20). In the experiments (Section 4), we typically set δ 10 3 and K + ; the role of K is only to cap the number of pursuit rounds per iteration when the FW oracle is particularly expensive (see Section 4.3). Note that if K = 1 then Boost FW reduces to FW. In the case of Figures 1-2, Boost FW exactly estimates the direction of f(x0) = (x0 x ) in only two rounds and converges in 1 iteration. A more general illustration of the procedure is presented in Figure 3. See also Appendix A.2 for an illustration of the improvements in alignment of dk during the procedure. Lastly, note that Boost FW does not need to maintain the decomposition of the iterates, which is very favorable in practice (Garber & Meshi, 2016). Figure 3. The gradient pursuit procedure builds a descent direction gt better aligned with the negative gradient direction f(xt), while the FW descent direction is that of v0 xt. We have gt = d2/(λ0 + λ1) where d2 = λ0u0 + λ1u1, u0 = v0 xt, and u1 = v1 xt. Furthermore, note that [xt, xt + d2] C but [xt, xt +gt] C. Moving along the segment [xt, xt +gt] ensures feasibility of the new iterate xt+1. We present in Proposition 3.1 some properties satisfied by Boost FW (Algorithm 2). Proofs are available in Appendix D.2. Proposition 3.1. For all t J0, T 1K, (i) d1 is defined and Kt 1, (ii) λ0, . . . , λKt 1 0, (iii) dk cone(V xt) for all k J0, Kt K, (iv) xt + gt C and xt+1 C, (v) align( f(xt), gt) align( f(xt), vt xt) + (Kt 1)δ where vt arg minv V f(xt), v and align( f(xt), vt xt) 0. 3.3. Convergence analysis We denote by ηt := align( f(xt), gt). We provide in Theorem 3.2 the general convergence rate of Boost FW. All proofs are available in Appendix D.3. Note that ηt f(xt) /(L gt ) = f(xt), gt /(L gt 2) corresponds to the short step strategy. Theorem 3.2 (Universal rate). Let f : H R be L-smooth, convex, and µ-gradient dominated, and set γt min{ηt f(xt) /(L gt ), 1} or γt arg minγ [0,1] f(xt + γgt). Then for all t J0, TK, f(xt) min C f 1{γs<1} 1 gs 2 vs xs where vs arg minv V f(xs), v for all s J0, T 1K. Strictly speaking, the rate in Theorem 3.2 is not explicit although it still provides a quantitative estimation. Note that γt = 1 is extremely rare in practice, and we observed no more than 1 such iteration in each of the experiments (Section 4). This is a similar phenomenon to that in the Away-Step and Pairwise Frank-Wolfe algorithms (Lacoste Julien & Jaggi, 2015). Similarly, Kt > 1 simply means that it is possible to increase the alignment by δ twice and consecutively, where δ is typically set to a low value. In the experiments, we set δ 10 3 and we observed Kt > 1 (or even Kt > 5) almost everytime. For completeness, we disregard these observations and address in Theorem 3.3 the case where the number of iterations with γt < 1 and Kt > 1 is not dominant, and we add a minor adjustment to Algorithm 2: if γt = 1 then we choose to do a simple FW step, i.e., to move in the direction of vk=0 xt instead of the direction of gt, where vk=0 is computed in the first round of the procedure (Line 8). Although this usually provides less progress, we do it for the sole purpose of presenting a fully explicit convergence rate; again, there is no need for such tweaks in practice as typically almost every iteration satisfies γt < 1 and Kt > 1. Theorem 3.3 states the convergence rate for this scenario, which is very loose as it accommodates for these FW steps. Theorem 3.3 (Worst-case rate). Let f : H R be L-smooth, convex, and µ-gradient dominated, and set γt min{ηt f(xt) /(L gt ), 1} or γt arg minγ [0,1] f(xt + γgt). Consider Algorithm 2 with the minor adjustment xt+1 xt + γ t(vt xt) in Line 22 Boosting Frank-Wolfe by Chasing Gradients when γt = 1, where vt vk=0 is computed in Line 8 and γ t min{ f(xt), xt vt /(L xt vt 2), 1} or γ t arg minγ [0,1] f(xt + γ(vt xt)). Then for all t J0, TK, f(xt) min C f 4LD2 We now provide in Theorem 3.4 the more realistic convergence rate of Boost FW, where Nt := |{s J0, t 1K | γs < 1, Ks > 1}| is nonnegligeable, i.e., Nt ωtp for some ω > 0 and p ]0, 1]. This is the rate observed in practice, where Nt t 1 so ω 1 and p = 1 (Section 4). Theorem 3.4 (Practical rate). Let f : H R be L-smooth, convex, and µ-gradient dominated, and set γt min{ηt f(xt) /(L gt ), 1} or γt arg minγ [0,1] f(xt + γgt). Assume that |{s J0, t 1K | γs < 1, Ks > 1}| ωtp for all t J0, T 1K, for some ω > 0 and p ]0, 1]. Then for all t J0, TK, f(xt) min C f LD2 Remark 3.5. Note that when γt < 1 and Kt > 1, we have (see proofs in Appendix D.3) f(xt) min C f f(xt+1) min C f δ2 f(xt) 2 so if NT := |{t J0, T 1K | γt < 1, Kt > 1}|, then f(x0) min C f δ2 inf C f 2 Thus, if inf C f > 0 then NT 2L(f(x0) min C f) δ2 inf C f 2 LD δ inf C f since f(x0) min C f LD2/2 (see proofs in Appendix D.3). However, the assumption in Theorem 3.4 can still hold as convergence is usually achieved within T iterations where LD δ inf C f for some ω > 0 and p ]0, 1]. In the experiments for example (Section 4), convergence is always achieved within O(103) iterations. Furthermore, early stopping to increase the generalization error of a model also prevents T from blowing up. Lastly, we provide in Corollary 3.6 a bound on the number of FW oracle calls, i.e., the number of linear minimizations over C, performed to achieve ε-convergence. In comparison, FW and AFW respectively require O(LD2/ε) and O (L/S)(D/W)2 ln(1/ε) oracle calls, where f is assumed to be S-strongly convex and C is assumed to be a polytope with pyramidal width W for AFW (Lacoste-Julien & Jaggi, 2015). It is clear from its design that Boost FW performs more oracle calls per iteration, however it uses them more efficiently and the progress obtained overcomes the cost. This is demonstrated in the experiments (Section 4). Corollary 3.6. In order to achieve ε-convergence, the number of linear minimizations performed over C is O LD2 min{K, 1/δ} in the worst-case scenario ωδ2 L µ ln 1 in the practical scenario. Note that the practical scenario assumes that we have set K 2 in Boost FW (K = 1 reduces Boost FW to FW). 4. Computational experiments We compared the Boosted Frank-Wolfe algorithm (Boost FW, Algorithm 2) to the Away-Step Frank-Wolfe algorithm (AFW) (Wolfe, 1970), the Decomposition-Invariant Pairwise Conditional Gradient algorithm (DICG) (Garber & Meshi, 2016), and the Blended Conditional Gradients algorithm (BCG) (Braun et al., 2019) in a series of computational experiments. We ran two strategies for AFW, one with the default line search (AFW-ls) and one using the smoothness of f (AFW-L): min f(xt), xt v FW t L xt v FW t 2 2 , 1 if FW step min f(xt), vaway t xt L vaway t xt 2 2 , γmax if away step where γmax is defined in the algorithm (see Algorithm 5 in Appendix C). Contrary to common belief, both strategies yield the same linear convergence rate; see Lacoste-Julien & Jaggi (2015) for AFW-ls and Theorem C.3 in the Appendix for AFW-L (Pedregosa et al., 2020). For Boost FW, we also ran a line search strategy to demonstrate that the speed-up really comes from the boosting procedure and not from being line search-free. Results further show that the line searchfree strategy γt min{ηt f(xt) /(L gt ), 1} = min{ f(xt), gt /(L gt 2), 1} is very performant in CPU time. The line search-free strategy of DICG was not competitive in the experiments. DICG is not applicable to optimization problems over the ℓ1-ball min x Rn f(x) (4) s.t. x 1 τ, Boosting Frank-Wolfe by Chasing Gradients however we can perform a change of variables xi = zi zn+i and use the following reformulation over the simplex: min z R2n f([z]1:n [z]n+1:2n) (5) s.t. z τ 2n where [z]1:n and [z]n+1:2n denote the truncation to Rn of the first n entries and the last n entries of z R2n respectively. Fact 4.1 formally states the equivalence between problems (4) and (5). A proof can be found in Appendix D.4. Fact 4.1. Consider Rn and let τ > 0. Then B1(τ) = {[z]1:n [z]n+1:2n | z τ 2n}. We implemented all the algorithms in Python using the same code framework for fair comparisons. In the case of synthetic data, we generated them from Gaussian distributions. We ran the experiments on a laptop under Linux Ubuntu 18.04 with Intel Core i7 3.5GHz CPU and 8GB RAM. Code is available at https://github.com/ cyrillewcombettes/boostfw. In each experiment, we estimated the smoothness constant L of the (convex) objective function f : Rn R, i.e., the Lipschitz constant of the gradient function f : Rn Rn, by sampling a few pairs of points (x, y) C C and computing an upper bound on f(y) f(x) 2/ y x 2. Unless specified otherwise, we set δ 10 3 and K + in Boost FW. The role of K is only to cap the number of pursuit rounds per iteration when the FW oracle is particularly expensive (see Section 4.3). 4.1. Sparse signal recovery Let x Rn be a signal which we want to recover as a sparse representation from observations y = Ax + w, where A Rm n and w N(0, σ2Im) is the noise in the measurements. The natural formulation of the problem is min x Rn y Ax 2 2 s.t. x 0 x 0 but the ℓ0-pseudo-norm 0 : x Rn 7 |{i J1, n K | [x]i = 0}| is nonconvex and renders the problem intractable in many situations (Natarajan, 1995). To remedy this, the ℓ1-norm is often used as a convex surrogate and leads to the following lasso formulation (Tibshirani, 1996) of the problem: min x Rn y Ax 2 2 s.t. x 1 x 1. In order to compare to DICG, which is not applicable to this formulation, we ran all algorithms on the reformulation (5). We set m = 200, n = 500, σ = 0.05, and τ = x 1. Since the objective function is quadratic, we can derive a closed-form solution to the line search and there is no need for AFW-L or Boost FW-L. The results are presented in Figure 4. Figure 4. Sparse signal recovery. 4.2. Sparsity-constrained logistic regression We consider the task of recognizing the handwritten digits 4 and 9 from the Gisette dataset (Guyon et al., 2005), available at https://archive.ics.uci.edu/ml/ datasets/Gisette. The dataset includes a high number of distractor features with no predictive power. Hence, a sparsity-constrained logistic regression model is suited for the task. The sparsity-inducing constraint is realized via the ℓ1-norm: min x Rn 1 m i=1 ln(1 + exp( yia i x)) where a1, . . . , am Rn and y { 1, +1}m. In order to compare to DICG, which is not applicable to this formulation, we ran all algorithms on the reformulation (5). We used m = 2000 samples and the number of features is n = 5000. We set τ = 10, L = 0.5, and δ 10 4 in Boost FW. The results are presented in Figure 5. As expected, AFW-L and Boost FW-L converge faster in CPU time as they do not rely on line search, however they converge slower per iteration as each iteration provides less progress. Figure 5. Sparse logistic regression on the Gisette dataset. Boosting Frank-Wolfe by Chasing Gradients 4.3. Traffic assignment We consider the traffic assignment problem. The task is to assign vehicles on a traffic network in order to minimize congestion while satisfying travel demands. Let A, R, and S be the sets of links, routes, and origin-destination pairs respectively. For every pair (i, j) S, let Ri,j and di,j be the set of routes and the travel demand from i to j. Let xa and ta be the flow and the travel time on link a A, and let yr be the flow on route r R. The Beckmann formulation of the problem (Beckmann et al., 1956), derived from the Wardrop equilibrium conditions (Wardrop, 1952), is min x R|A| X 0 ta(ξ) dξ (6) s.t. xa = X r R 1{a r}yr a A r Ri,j yr = di,j (i, j) S yr 0 r Ri,j, (i, j) S. A commonly used expression for the travel time ta as a function of the flow xa, developed by the Bureau of Public Records, is ta : xa R+ 7 τa(1 + 0.15(xa/ca)4) where τa and ca are the free-flow travel time and the capacity of the link. A linear minimization over the feasible region in (6) amounts to computing the shortest routes between all origin-destination pairs. Thus, the FW oracle is particularly expensive here so we capped the maximum number of rounds in Boost FW to K 5; see Figure 12 in Appendix A.2. We implemented the oracle using the function all pairs dijkstra path from the Python package networkx (Hagberg et al., 2008). We created a directed acyclic graph with 500 nodes split into 20 layers of 25 nodes each, and randomly dropped links with probability 0.5 so |A| 6000 and |S| 113000. We set di,j U([0, 1]) for every (i, j) S. DICG is not applicable here and AFWL and Boost FW-L were not competitive. The results are presented in Figure 6. Figure 6. Traffic assignment. 4.4. Collaborative filtering We consider the task of collaborative filtering on the Movie Lens 100k dataset (Harper & Konstan, 2015), available at https://grouplens.org/datasets/ movielens/100k/. The low-rank assumption on the solution and the approach of Mehta et al. (2007) lead to the following problem formulation: min X Rm n 1 |I| (i,j) I hρ(Yi,j Xi,j) s.t. X nuc τ where hρ is the Huber loss with parameter ρ > 0 (Huber, 1964): ( t2/2 if |t| ρ ρ(|t| ρ/2) if |t| > ρ, Y Rm n is the given matrix to complete, I J1, m K J1, n K is the set of indices of observed entries in Y , and nuc : X Rm n 7 tr( X X) = Pmin{m,n} i=1 σi(X) is the nuclear norm and equals the sum of the singular vectors. It serves as a convex surrogate for the rank constraint (Fazel et al., 2001). Since {X Rm n | X nuc = 1} = conv({uv | u Rm, v Rn, u 2 = v 2 = 1}), a linear minimization over the nuclear norm-ball of radius τ amounts to computing the top left and right singular vectors u and v of f(Xt) and to return τuv . To this end, we used the function svds from the Python package scipy.sparse.linalg (Virtanen et al., 2020). We have m = 943, n = 1682, and |I| = 105, and we set ρ = 1, τ = 5000, and L = 5 10 6. DICG is not applicable here. The results are presented in Figure 7. Figure 7. Collaborative filtering on the Movie Lens 100k dataset. The time limit here was set to 500 seconds but for AFW-L we reduced it to 250 seconds, else it raises a memory error on our machine shortly after. This is because AFW requires Boosting Frank-Wolfe by Chasing Gradients storing the decomposition of the iterate onto V. Note that Boost FW-ls converges faster in CPU time than AFW-L, although it relies on line search, and that Boost FW-L converges faster per iteration than the other methods although it does not rely on line search. 4.5. Video co-localization We consider the task of video co-localization on the aeroplane class of the You Tube-Objects dataset (Prest et al., 2012), using the problem formulation of Joulin et al. (2014). The goal is to localize (with bounding boxes) the aeroplane object across the video frames. It consists in minimizing f : x R660 7 x Ax/2 + b x over a flow polytope, where A R660 660, b R660, and the polytope each encode a part of the temporal consistency in the video frames. We obtained the data from https://github. com/Simon-Lacoste-Julien/linear FW. A linear minimization over the flow polytope amounts to computing a shortest path in the corresponding directed acyclic graph. We implemented the boosting procedure for DICG, which we labeled Boost DICG; see details in Appendix B. Since the objective function is quadratic, we can derive a closed-form solution to the line search and there is no need for AFW-L or Boost FW-L. We set δ 10 7 in Boost FW and δ 10 15 in Boost DICG. The results are presented in Figure 8. Figure 8. Video co-localization on the You Tube-Objects dataset. All algorithms provide a similar level of performance in function value. In Garber & Meshi (2016), the algorithms are compared with respect to the duality gap maxv V f(xt), xt v (Jaggi, 2013) on the same experiment. For completeness, we report a similar study in Figure 9. The boosting procedure applied to DICG produces very promising empirical results. Appendix A.3 presents comparisons in duality gap for the other experiments. DICG converges faster than Boost FW in duality gap here (after closing it to 10 6 though), but it is not the case in the other experiments. Figure 9. Video co-localization on the You Tube-Objects dataset. 5. Final remarks We have proposed a new and intuitive method to speed up the Frank-Wolfe algorithm by descending in directions better aligned with those of the negative gradients f(xt), all the while remaining projection-free. Our method does not need to maintain the decomposition of the iterates and can naturally be used to boost the performance of any Frank Wolfe-style algorithm. Although the linear minimization oracle may be called multiple times per iteration, the progress obtained greatly overcomes this cost and leads to strong gains in performance. We demonstrated in a variety of experiments the computational advantage of our method both per iteration and in CPU time over the state-of-the-art. Furthermore, it does not require line search to produce strong performance in practice, which is particularly useful on instances where these are excessively expensive. Future work may replace the gradient pursuit procedure with a faster conic optimization algorithm to potentially reduce the number of oracle calls. It could also be interesting to investigate how to make each oracle call cheaper via, e.g., lazification (Braun et al., 2017) or subsampling (Kerdreux et al., 2018). Lastly, we expect significant gains in performance when applying our approach to chase the gradient estimators in (non)convex stochastic Frank-Wolfe algorithms as well (Hazan & Luo, 2016; Xie et al., 2020). Acknowledgments Research reported in this paper was partially supported by NSF CAREER Award CMMI-1452463. Bashiri, M. A. and Zhang, X. Decomposition-invariant conditional gradient for general polytopes with line search. In Advances in Neural Information Processing Systems 30, pp. 2690 2700. 2017. Beckmann, M. J., Mc Guire, C. B., and Winsten, C. B. Studies in the Economics of Transportation. Yale University Press, 1956. Bolte, J., Nguyen, T.-P., Peypouquet, J., and Suter, B. W. From Boosting Frank-Wolfe by Chasing Gradients error bounds to the complexity of first-order descent methods for convex functions. Mathematical Programming, 165(2):471 507, 2017. Braun, G., Pokutta, S., and Zink, D. Lazifying conditional gradient algorithms. In Proceedings of the 34th International Conference on Machine Learning, pp. 566 575, 2017. Braun, G., Pokutta, S., Tu, D., and Wright, S. Blended conditional gradients: the unconditioning of conditional gradients. In Proceedings of the 36th International Conference on Machine Learning, pp. 735 743, 2019. Canon, M. D. and Cullum, C. D. A tight upper bound on the rate of convergence of Frank-Wolfe algorithm. SIAM Journal on Control, 6(4):509 516, 1968. Chen, J., Zhou, D., Yi, J., and Gu, Q. A Frank-Wolfe framework for efficient and effective adversarial attacks. In Proceedings of the 34th AAAI Conference on Artificial Intelligence, pp. 3486 3494, 2020. Fazel, M., Hindi, H., and Boyd, S. P. A rank minimization heuristic with application to minimum order system approximation. In Proceedings of the American Control Conference, pp. 4734 4739, 2001. Frank, M. and Wolfe, P. An algorithm for quadratic programming. Naval Research Logistics Quarterly, 3(1-2):95 110, 1956. Garber, D. and Hazan, E. Faster rates for the Frank-Wolfe method over strongly-convex sets. In Proceedings of the 32nd International Conference on Machine Learning, pp. 541 549, 2015. Garber, D. and Hazan, E. A linearly convergent variant of the conditional gradient algorithm under strong convexity, with applications to online and stochastic optimization. SIAM Journal on Optimization, 26(3):1493 1528, 2016. Garber, D. and Meshi, O. Linear-memory and decompositioninvariant linearly convergent conditional gradient algorithm for structured polytopes. In Advances in Neural Information Processing Systems 29, pp. 1001 1009. 2016. Gu elat, J. and Marcotte, P. Some comments on Wolfe s away step . Mathematical Programming, 35(1):110 119, 1986. Guyon, I., Gunn, S., Ben-Hur, A., and Dror, G. Result analysis of the NIPS 2003 feature selection challenge. In Advances in Neural Information Processing Systems 17, pp. 545 552. 2005. Hagberg, A. A., Schult, D. A., and Swart, P. J. Exploring network structure, dynamics, and function using Network X. In Proceedings of the 7th Python in Science Conference, pp. 11 15, 2008. Harper, F. M. and Konstan, J. A. The Movie Lens datasets: History and context. ACM Transactions on Interactive Intelligent Systems, 5(4):19:1 19:19, 2015. Hazan, E. and Kale, S. Projection-free online learning. In Proceedings of the 29th International Conference on Machine Learning, 2012. Hazan, E. and Luo, H. Variance-reduced and projection-free stochastic optimization. In Proceedings of the 33rd International Conference on Machine Learning, pp. 1263 1271, 2016. Huber, P. J. Robust estimation of a location parameter. The Annals of Mathematical Statistics, 35(1):73 101, 1964. Jaggi, M. Revisiting Frank-Wolfe: Projection-free sparse convex optimization. In Proceedings of the 30th International Conference on Machine Learning, pp. 427 435, 2013. Joulin, A., Tang, K., and Fei-Fei, L. Efficient image and video co-localization with Frank-Wolfe algorithm. In European Conference on Computer Vision, pp. 253 268, 2014. Karimi, H., Nutini, J., and Schmidt, M. Linear convergence of gradient and proximal-gradient methods under the PolyakŁojasiewicz condition. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 795 811, 2016. Kerdreux, T., Pedregosa, F., and d Aspremont, A. Frank-Wolfe with subsampling oracle. In Proceedings of the 35th International Conference on Machine Learning, pp. 2591 2600, 2018. Kurdyka, K. On gradients of functions definable in o-minimal structures. Annales de l Institut Fourier, 48(3):769 783, 1998. Lacoste-Julien, S. and Jaggi, M. On the global linear convergence of Frank-Wolfe optimization variants. In Advances in Neural Information Processing Systems 28, pp. 496 504. 2015. Lacoste-Julien, S., Jaggi, M., Schmidt, M., and Pletscher, P. Blockcoordinate Frank-Wolfe optimization for structural SVMs. In Proceedings of the 30th International Conference on Machine Learning, pp. 53 61, 2013. Lan, G. The complexity of large-scale convex programming under a linear optimization oracle. Technical report, Department of Industrial and Systems Engineering, University of Florida, 2013. Le Blanc, L. J., Morlok, E. K., and Pierskalla, W. P. An efficient approach to solving the road network equilibrium traffic assignment problem. Transportation Research, 9(5):309 318, 1975. Levitin, E. S. and Polyak, B. T. Constrained minimization methods. USSR Computational Mathematics and Mathematical Physics, 6(5):1 50, 1966. Locatello, F., Tschannen, M., R atsch, G., and Jaggi, M. Greedy algorithms for cone constrained optimization with convergence guarantees. In Advances in Neural Information Processing Systems 30, pp. 773 784. 2017. Łojasiewicz, S. Une propri et e topologique des sous-ensembles analytiques r eels. In Les Equations aux D eriv ees Partielles, 117, pp. 87 89. Colloques Internationaux du CNRS, 1963. Mallat, S. G. and Zhang, Z. Matching pursuits with time-frequency dictionaries. IEEE Transactions on Signal Processing, 41(12): 3397 3415, 1993. Mehta, B., Hofmann, T., and Nejdl, W. Robust collaborative filtering. In Proceedings of the 2007 ACM Conference on Recommender Systems, pp. 49 56, 2007. Natarajan, B. K. Sparse approximate solutions to linear systems. SIAM Journal on Computing, 24(2):227 234, 1995. Boosting Frank-Wolfe by Chasing Gradients Pedregosa, F., N egiar, G., Askari, A., and Jaggi, M. Linearly convergent Frank-Wolfe with backtracking line-search. In Proceedings of the 23rd International Conference on Artificial Intelligence and Statistics, pp. 1 10, 2020. Ping, W., Liu, Q., and Ihler, A. T. Learning infinite RBMs with Frank-Wolfe. In Advances in Neural Information Processing Systems 29, pp. 3063 3071. 2016. Polyak, B. T. Gradient methods for the minimisation of functionals. USSR Computational Mathematics and Mathematical Physics, 3(4):864 878, 1963. Prest, A., Leistner, C., Civera, J., Schmid, C., and Ferrari, V. Learning object class detectors from weakly annotated video. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 3282 3289, 2012. Shalev-Shwartz, S., Gonen, A., and Shamir, O. Large-scale convex minimization with a low-rank constraint. In Proceedings of the 28th International Conference on Machine Learning, pp. 329 336, 2011. Tibshirani, R. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), 58(1):267 288, 1996. Virtanen, P., Gommers, R., Oliphant, T. E., Haberland, M., Reddy, T., Cournapeau, D., Burovski, E., Peterson, P., Weckesser, W., Bright, J., van der Walt, S. J., Brett, M., Wilson, J., Millman, K. J., Mayorov, N., Nelson, A. R. J., Jones, E., Kern, R., Larson, E., Carey, C., Polat, I., Feng, Y., Moore, E. W., Vander Plas, J., Laxalde, D., Perktold, J., Cimrman, R., Henriksen, I., Quintero, E. A., Harris, C. R., Archibald, A. M., Ribeiro, A. H., Pedregosa, F., van Mulbregt, P., and Sci Py 1.0 Contributors. Scipy 1.0: Fundamental algorithms for scientific computing in Python. Nature Methods, 17(3):261 272, 2020. Wardrop, J. G. Some theoretical aspects of road traffic research. In Proceedings of the Institute of Civil Engineers, volume 1, pp. 325 378, 1952. Wolfe, P. Convergence theory in nonlinear programming. In Integer and Nonlinear Programming, pp. 1 36. North-Holland, 1970. Xie, J., Shen, Z., Zhang, C., Qian, H., and Wang, B. Efficient projection-free online methods with stochastic recursive gradient. In Proceedings of the 34th AAAI Conference on Artificial Intelligence, pp. 6446 6453, 2020.