# heavy_ball_momentum_for_conditional_gradient__0a31e310.pdf Heavy Ball Momentum for Conditional Gradient Bingcong Li Alireza Sadeghi Georgios B. Giannakis University of Minnesota - Twin Cities Minneapolis, MN, USA {lixx5599, sadeg012, georgios}@umn.edu Conditional gradient, aka Frank Wolfe (FW) algorithms, have well-documented merits in machine learning and signal processing applications. Unlike projectionbased methods, momentum cannot improve the convergence rate of FW, in general. This limitation motivates the present work, which deals with heavy ball momentum, and its impact to FW. Specifically, it is established that heavy ball offers a unifying perspective on the primal-dual (PD) convergence, and enjoys a tighter per iteration PD error rate, for multiple choices of step sizes, where PD error can serve as the stopping criterion in practice. In addition, it is asserted that restart, a scheme typically employed jointly with Nesterov s momentum, can further tighten this PD error bound. Numerical results demonstrate the usefulness of heavy ball momentum in FW iterations. 1 Introduction This work studies momentum in Frank Wolfe (FW) methods [9,10,16,20] for solving min x X f(x). (1) Here, f is a convex function with Lipschitz continuous gradients, and the constraint set X Rd is assumed convex and compact, where d is the dimension of variable x. Throughout, we let x X denote a minimizer of (1). FW and its variants are prevalent in various machine learning and signal processing applications, such as traffic assignment [12], non-negative matrix factorization [30], video colocation [17], image reconstruction [15], particle filtering [19], electronic vehicle charging [36], recommender systems [11], optimal transport [26], and neural network pruning [34]. The popularity of FW is partially due to the elimination of projection compared with projected gradient descent (GD) [29], leading to computational efficiency especially when d is large. In particular, FW solves a subproblem with a linear loss, i.e., vk+1 arg minv X f(xk), v at kth iteration, and then updates xk+1 as a convex combination of xk and vk+1. When dealing with a structured X, a closed-form or efficient solution for vk+1 is available [13,16], which is preferable over projection. Unlike projection based algorithms [14, 32] though, momentum does not perform well with FW. Indeed, the lower bound in [16,20] demonstrates that at least O( 1 ϵ ) linear subproblems are required to ensure f(xk) f(x ) ϵ, which does not guarantee that momentum is beneficial for FW, because even vanilla FW achieves this lower bound. In this work, we contend that momentum is evidently useful for FW. Specifically, we prove that the heavy ball momentum leads to tightened and efficiently computed primal-dual error bound, as well as numerical improvement. To this end, we outline first the primal convergence. Primal convergence. The primal error refers to f(xk) f(x ). It is guaranteed for FW that f(xk) f(x ) = O 1/k , k 1 [16,22]. This rate is tight in general since it matches to the lower bound [16,20]. Other FW variants also ensure the same order of primal error; see e.g., [20,21]. 35th Conference on Neural Information Processing Systems (Neur IPS 2021). Table 1: A comparison of HFW with relevant works. The computation in the third column is short for the number of required FW subproblems to calculate the PD error per iteration. reference computation PD conv. type PD conv. rate [16] 1 subproblem Type I 27LD2 4(K+1) [18] 2 subproblems Type II 2LD2 k+1, k [28] 2 subproblems Type II 4LD2 k+1 , k This work (Alg. 2) 1 subproblem Type II 2LD2 k+1 , k This work (Alg. 3) 2 subproblems Type II 2LD2 k+1+c, k with c 0 Primal-dual convergence. The primal-dual (PD) error quantifies the difference between both the primal and the dual functions from the optimal objective, hence it is an upper bound on the primal error. When the PD error is shown to converge, it can be safely used as the stopping criterion: whenever the PD error is less than some prescribed ϵ > 0, f(xk) f(x ) ϵ is ensured automatically. The PD error of FW is convenient to compute, hence FW is suitable for the requirement of solving problems to some desirable accuracy; see e.g., [33]. For pruning (two-layer) neural networks [34], the extra training loss incurred by removing neurons can be estimated via the PD error. However, due to technical difficulties, existing analyses on PD error are not satisfactory enough and lack of unification. It is established in [6,10,16] that the minimum PD error is sufficiently small, namely mink {1,...,K} PDErrork = O 1 K , where K is the total number of iterations. We term such a bound for the minimum PD error as Type I guarantee. Another stronger guarantee, which directly implies Type I bound, emphasizes the per iteration convergence, e.g., PDErrork O( 1 k), k. We term such guarantees as Type II bound. A Type II bound is reported in [18, Theorem 2], but with an unsatisfactory k dependence. This is improved by [7,28] with the price of extra computational burden since it involves solving two FW subproblems per iteration for computing this PD error. Several related works such as [10] provide a weaker PD error compared with [28]; see a summary in Table 1. In this work, we show that a computationally affordable Type II bound can be obtained by simply relying on heavy ball momentum. Interestingly, FW based on heavy ball momentum (HFW) also maintains FW s neat geometric interpretation. Through unified analysis, the resultant type II PD error improves over existing bounds; see Table 1. This PD error of HFW is further tightened using restart. Although restart is more popular in projection based methods together with Nesterov s momentum [31], we show that restart for FW is natural to adopt jointly with heavy ball. In succinct form, our contributions can be summarized as follows. We show through unified analysis that HFW enables a tighter type II guarantee for PD error for multiple choices of the step size. When used as stopping criterion, no extra subproblem is needed. The Type II bound can be further tightened by restart triggered through a comparison between two PD-error-related quantities. Numerical tests on benchmark datasets support the effectiveness of heavy ball momentum. As a byproduct, a simple yet efficient means of computing local Lipschitz constants becomes available to improve the numerical efficiency of smooth step sizes [13,22]. Notation. Bold lowercase (capital) letters denote column vectors (matrices); x stands for a norm of a vector x, whose dual norm is denoted by x ; and x, y is the inner product of x and y. 2 Preliminaries This section outlines FW, starting with standard assumptions that will be taken to hold true throughout. Assumption 1. (Lipschitz continuous gradient.) The objective function f : X R has L-Lipchitz continuous gradients; i.e., f(x) f(y) L x y , x, y X. Assumption 2. (Convexity.) The objective function f : X R is convex; that is, f(y) f(x) f(x), y x , x, y X. Assumption 3. (Convex and compact constraint set.) The constraint set X Rd is convex and compact with diameter D, that is, x y D, x, y X. Algorithm 1 FW [9] 1: Initialize: x0 X 2: for k = 0, 1, . . . , K 1 do 3: vk+1 = arg minv X f(xk), v 4: xk+1 = (1 ηk)xk + ηkvk+1 5: end for 6: Return: x K FW for solving (1) under Assumptions 1 3 is listed in Alg. 1. The subproblem in Line 3 can be visualized geometrically as minimizing a supporting hyperplane of f(x) at xk, i.e., vk+1 arg min v X f(xk) + f(xk), v xk . For many constraint sets, efficient implementation or a closed-form solution is available for vk+1; see e.g., [16] for a comprehensive summary. Upon minimizing the supporting hyperplane in (2), xk+1 is updated as a convex combination of vk+1 and xk in Line 4 so that no projection is required. The choices on the step size ηk [0, 1] will be discussed shortly. The PD error of Alg. 1 is captured by the so-termed FW gap, formally defined as Gk := f(xk), xk vk+1 = f(xk) f(x ) | {z } primal error + f(x ) min v X h f(xk) + f(xk), v xk i | {z } dual error where the second equation is because of (2). It can be verified that both primal and dual errors marked in (3) are no less than 0 by appealing to the convexity of f. If Gk converges, one can deduce that the primal error converges. For this reason, Gk is typically used as a stopping criterion for Alg. 1. Next, we focus on the step sizes that ensure convergence. Parameter-free step size. This type of step sizes does not rely on any problem dependent parameters such as L and D, and hence it is extremely simple to implement. The most commonly adopted step size is ηk = 2 k+2, which ensures a converging primal error f(xk) f(x ) 2LD2 k+1 , k 1, and a weaker claim on the PD error, mink {1,...,K} Gk = 27LD2 4K [16]. A variant of PD convergence has been established recently based on a modified FW gap [28]. Although Type II convergence is observed, the modified FW gap therein is inefficient to serve as stopping criterion because an additional FW subproblem has to be solved per iteration to compute its value. Smooth step size. When the (estimate of) Lipschitz constant L is available, one can adopt the following step sizes in Alg. 1 [22] ( f(xk), xk vk+1 L vk+1 xk 2 , 1 Despite the estimated L is typically too pessimistic to capture the local Lipschitz continuity, such a step size ensures f(xk+1) f(xk); see derivations in Appendix A.1. The PD convergence is studied in [11], where the result is slightly weaker than that of [28]. 3 FW with heavy ball momentum After a brief recap of vanilla FW, we focus on the benefits of heavy ball momentum for FW under multiple step size choices, with special emphasis on PD errors. 3.1 Prelude HFW is summarized in Alg. 2. Similar to GD with heavy ball momentum [14,32], Alg. 2 updates decision variables using a weighted average of gradients gk+1. In addition, the update direction of Alg. 2 is no longer guaranteed to be a descent one. This is because in HFW, f(xk), xk vk+1 can be negative. Although a stochastic version of heavy ball momentum was adopted in [27] and its variants, e.g., [37], to reduce the mean square error of the gradient estimate, heavy ball is introduced here for a totally different purpose, that is, to improve the PD error. The most significant difference comes at technical perspectives, which is discussed in Sec. 3.4. Next, we gain some intuition on why heavy ball can be beneficial. Consider X as an ℓ2-norm ball, that is, X = {x| x 2 R}. In this case, we have vk+1 = R gk+1 2 gk+1 in Alg. 2. The momentum gk+1 can smooth out the changes of { f(xk)}, resulting Algorithm 2 FW with heavy ball momentum 1: Initialize: x0 X, g0 = f(x0) 2: for k = 0, 1, . . . , K 1 do 3: gk+1 = (1 δk)gk + δk f(xk) 4: vk+1 = arg minv X gk+1, v 5: xk+1 = (1 ηk)xk + ηkvk+1 6: end for 7: Return: x K in a more concentrated sequence {vk+1}. Recall that the PD error is closely related to vk+1 [cf. equation (3)]. We hope the concentration of {vk+1} to be helpful in reducing the changes of PD error among consecutive iterations so that a Type II PD error bound is attainable. A few concepts are necessary to obtain a tightened PD error of HFW. First, we introduce the generalized FW gap associated with Alg. 2 that captures the PD error. Write gk+1 explicitly as gk+1 = Pk τ=0 wτ k f(xτ), where wτ k = δτ Qk j=τ+1(1 δj) > 0, τ 1, and w0 k = Qk j=1(1 δj) > 0. Then, define a sequence of linear functions {Φk(x)} as τ=0 wτ k f(xτ) + f(xτ), x xτ , k 0. (5) It is clear that Φk+1(x) is a weighted average of the supporting hyperplanes of f(x) at {xτ}k τ=0. The properties of Φk+1(x), and how they relate to Alg. 2 are summarized in the next lemma. Lemma 1. For the linear function Φk+1(x) in (5), it holds that: i) vk+1 minimizes Φk+1(x) over X; and, ii) f(x) Φk+1(x), k 0, x X. From the last lemma, one can see that vk is obtained by minimizing Φk(x), which is an affine lower bound on f(x). Hence, HFW admits a geometric interpretation similar to that of FW. In addition, based on Φk(x) we can define the generalized FW gap. Definition 1. (Generalized FW gap.) The generalized FW gap w.r.t. Φk(x) is Gk := f(xk) min x X Φk(x) = f(xk) Φk(vk). (6) In words, the generalized FW gap is defined as the difference between f(xk) and the minimal value of Φk(x) over X. The newly defined Gk also illustrates the PD error Gk = f(xk) Φk(vk) = f(xk) f(x ) | {z } primal error + f(x ) Φk(vk) | {z } dual error For the dual error, we have f(x ) Φk(vk) Φk(x ) Φk(vk) 0, where both inequalities follow from Lemma 1. Hence, Gk 0 automatically serves as an overestimate of both primal and dual errors. When establishing the convergence of Gk, it can be adopted as the stopping criterion for Alg. 2. Related claims have been made for the generalized FW gap [20,23,28]. Lack of heavy ball momentum leads to inefficiency, because an additional FW subproblem is needed to compute this gap [28]. Works [20,23] focus on Nesterov s momentum for FW, that incurs additional memory relative to HFW; see also Sec. 3.4 for additional elaboration. Having defined the generalized FW gap, we next pursue parameter choices that establish Type II convergence guarantees. 3.2 Parameter-free step size We first consider a parameter-free choice for HFW to demonstrate the usefulness of heavy ball δk = ηk = 2 k + 2, k 0. (8) Such a choice on δk puts more weight on recent gradients when calculating gk+1, since wτ k = O( τ k2 ). The following theorem specifies the convergence of Gk. Theorem 1. If Assumptions 1-3 hold, then choosing δk and ηk as in (8), Alg. 2 guarantees that Gk = f(xk) Φk(vk) 2LD2 k + 1 , k 1. Theorem 1 provides a much stronger PD guarantee for all k than vanilla FW [16, Theorem 2]. In addition to a readily computable generalized FW gap, our rate is tighter than [28], where the provided bound is 4LD2 k+1 . In fact, the constants in our PD bound even match to the best known primal error of vanilla FW. A direct consequence of Theorem 1 is the convergence of both primal and dual errors. Corollary 1. Choosing the parameters as in Theorem 1, then k 1, Alg.2 guarantees that primal conv.: f(xk) f(x ) 2LD2 k + 1 ; dual conv.: f(x ) Φk(vk) 2LD2 Proof. Combine Theorem 1 with f(xk) f(x ) Gk and f(x ) Φk(vk) Gk [cf. (7)]. 3.3 Smooth step size Next, we focus on HFW with a variant of the smooth step size δk = 2 k + 2 and ηk = max 0, min n f(xk), xk vk+1 L vk+1 xk 2 , 1 o . (9) Comparing with the smooth step size for vanilla FW in (4), it can be deduced that the choice on ηk in (9) has to be trimmed to [0, 1] manually. This is because f(xk), xk vk+1 is no longer guaranteed to be positive. The smooth step size enables an adaptive means of adjusting the weight for f(xk). To see this, note that when ηk = 0, we have xk+1 = xk. As a result, gk+2 = (1 δk+1)gk+1 + δk+1 f(xk+1) = (1 δk+1)gk+1 + δk+1 f(xk), that is, the weight on f(xk) is adaptively increased to δk(1 δk+1) + δk+1 if one further unpacks gk+1. Another analytical benefit of the step size in (9) is that it guarantees a non-increasing objective value; see Appendix A.2 for derivations. Convergence of the generalized FW gap is established next. Theorem 2. If Assumptions 1-3 hold, while ηk and δk are chosen as in (9), Alg. 2 guarantees that Gk = f(xk) Φk(vk) 2LD2 k + 1 , k 1. The proof of Theorem 2 follows from that of Theorem 1 after modifying just one inequality. This considerably simplifies the analysis on the (modified) FW gap compared to vanilla FW with smooth step size [11]. The PD convergence clearly implies the convergence of both primal and dual errors. A similar result to Corollary 1 can be obtained, but we omit it for brevity. We further extend Theorem 2 in Appendix B.4 by showing that if a slightly more difficult subproblem can be solved, it is possible to ensure per step descent on the PD error; i.e., Gk+1 Gk. Line search. When choosing δk = 2 k+2 and ηk via line search, HFW can guarantee a Type II PD error of 2LD2 k+1 ; please refer to Appendix B.5 due to space limitation. For completeness, an iterative manner to update Gk for using as stopping criterion is also described in Appendix C. 3.4 Further considerations There are more choices of δk and ηk leading to (primal) convergence. For example, one can choose δk δ (0, 1) and ηk = O 1 k as an extension of [27].1 A proof is provided in Appendix B.7 for completeness. This analysis framework in [27], however, has two shortcomings: i) the convergence can be only established using ℓ2-norm (recall that in Assumption 1, we do not pose any requirement on the norm); and, ii) the final primal error (hence PD error) can only be worse than vanilla FW because their analysis treats gk+1 as f(xk) with errors but not momentum, therefore, it is difficult to obtain the same tight PD bound as in Theorem 1. Our analytical techniques avoid these limitations. When choosing δk = ηk = 1 k+1, we can recover Algorithm 3 in [1]. Notice that such a choice on δk makes gk+1 a uniform average of all gradients. A slower convergence rate f(xk) f(x ) = O LD2 ln k k was established in [1] through a sophisticated derivation using no-regret online learning. Through our simpler analytical framework, we can attain the same rate while providing more options for the step size. Theorem 3. Let Assumptions 1-3 hold, and select δk = 1 k+1 with ηk using one of the following options: i) ηk = 1 k+1; ii) as in (9); or iii) line search as in (26b). The generalized FW gap of Alg. 2 then converges with rate Gk = f(xk) Φk(vk) LD2 ln(k + 1) 1We are unable to derive even a primal error bound using the same analysis framework in [27] for step sizes listed in Theorem 1. The rate in Theorem 3 has worse dependence on k relative to Theorems 1 and 2, partially because too much weight is put on past gradients in gk+1, suggesting that large momentum may not be helpful. Heavy ball versus Nesterov s momentum. A simple rule to compare these two momentums is whether gradient is calculated at the converging sequence {xk}. Heavy ball momentum follows this rule, while Nesterov s momentum computes the gradient at some extrapolation points that are not used in Alg. 2. It is unclear how the original Nesterov s momentum benefits the PD error, but the -memory variant of Nesterov s momentum [20,23,24], which can be viewed as a combination of heavy ball and Nesterov s momentum, yields a Type II PD error. However, compared with HFW, additional memory should be allocated. In sum, these observations suggest that heavy ball momentum is essentially critical to improve the PD performance of FW. Nesterov s momentum, on the other hand, does not influence PD error when used alone; however, it gives rise to faster (local) primal rates under additional assumptions [20,23]. 3.5 A side result: Directional smooth step sizes Common to both FW and HFW is that the globally estimated L might be too pessimistic for a local update. In this subsection, a local Lipschitz constant is investigated to further improve the numerical efficiency of smooth step sizes in (9). This easily computed local Lipschitz constant is another merit of (H)FW over projection based approaches. Definition 2. (Directional Lipschitz continuous.) For two points x, y X, the directional Lipschitz constant L(x, y) ensures f(ˆx) f(ˆy) L(x, y) ˆx ˆy for any ˆx = (1 α)x + αy, ˆy = (1 β)x + βy with some α [0, 1] and β [0, 1]. In other words, the directional Lipschitz continuity depicts the local property on the segment between points x and y. It is clear that L(x, y) L. Using logistic loss for binary classification as an example, we have L(x, y) 1 4N PN i=1 ai,x y 2 x y 2 2 , where N is the number of data, and ai is the feature of the ith datum. As a comparison, the global Lipschitz constant is L 1 4N PN i=1 ai 2 2. We show in Appendix E that at least for a class of functions, including widely used logistic loss and quadratic loss, L(x, y) has an analytical form. Simply replacing L in (9) with L(xk, vk+1), i.e., ηk = max 0, min n f(xk), xk vk+1 L(xk, vk+1) vk+1 xk 2 , 1 o (10) we can obtain what we term directionally smooth step size. Upon exploring the collinearity of xk, xk+1 and vk+1, a simple modification of Theorem 2 ensures the PD convergence. Corollary 2. Choosing δk = 2 k+2, and ηk via (10), Alg. 2 ensures Gk = f(xk) Φk(vk) 2LD2 k + 1 , k 1. The directional Lipschitz constant can also replace the global one in other FW variants, such as [13,22], with theories therein still holding. As we shall see in numerical tests, directional smooth step sizes outperform the vanilla one by an order of magnitude. 4 Restart further tightens the PD error Up till now it is established that the heavy ball momentum enables a unified analysis for tighter Type II PD bounds. In this section, we show that if the computational resources are sufficient for solving two FW subproblems per iteration, the PD error can be further improved by restart when the standard FW gap is smaller than generalized FW gap. Restart is typically employed by Nesterov s momentum in projection based methods [31] to cope with the robustness to parameter estimates, and to capture the local geometry of problem (1). However, it is natural to integrate restart with heavy ball momentum in FW regime. In addition, restart provides an answer to the following question: which is smaller, the generalized FW gap or the vanilla one? Previous works using the generalized FW gap have not addressed this question [20,23,28]. Algorithm 3 FW with heavy ball momentum and restart 1: Initialize: x0 0 X, g0 0 = f(x0 0), s 0, C0 = 0, G0 0 = G0 0 2: while [not terminated] do 3: k 0, gs 0 = f(xs 0) 4: while [Gs k Gs k or k = 0] and [not terminated] do Check whether restart is needed 5: δs k = 2 k+2+Cs 6: gs k+1 = (1 δs k)gs k + δs k f(xs k) 7: vs k+1 = arg minx X gs k+1, x 8: xs k+1 = (1 ηs k)xs k + ηs kvs k+1 9: vs k+1 = arg minx X f(xs k+1), x 10: Gs k+1 = f(xs k+1) Φs k+1(vs k+1) Generalized FW gap 11: Gs k+1 = f(xs k), xs k vs k+1 Vanilla FW gap 12: k k + 1 13: end while 14: Ks k, xs+1 0 = xs Ks, Cs+1 = 2LD2 Gs Ks , s s + 1 15: end while FW with heavy ball momentum and restart is summarized under Alg. 3. For exposition clarity, when updating the counters such as k and s, we use notation . Alg. 3 contains two loops. The inner loop is the same as Alg. 2 except for computing a standard FW gap (Line 11) in addition to the generalized one (Line 10). The variable Ks, depicting the iteration number of inner loop s, is of analysis purpose. Alg. 3 can be terminated immediately whenever min{Gs k, Gs k} ϵ for a desirable ϵ > 0. The restart happens when the standard FW gap is smaller than generalized FW gap. And after restart, gs k+1 will be reset. For Alg. 3, the linear functions used for generalized FW gap are defined stage-wisely Φs 0(x) = f(xs 0) + f(xs 0), x xs 0 (11a) Φs k+1(x) = (1 δs k)Φs k(x) + δs k f(xs k) + f(xs k), x xs k , k 0. (11b) It can be verified that vs k+1 minimizes Φs k+1(x) over X for any k 0. In addition, we also have f(xs 0) Φs 0(vs 0) = Gs 1 Ks 1 where vs 0 = arg minx X Φs 0(x). There are two tunable parameters ηs k and δs k. The choice on δs k has been provided directly in Line 5, where it is adaptively decided using a variable Cs relating to the generalized FW gap. Three choices are readily available for ηs k: i) ηs k = δs k; ii) smooth step size: ηs k = max 0, min n f(xs k), xs k vs k+1 L vs k+1 xs k 2 , 1 o ; (12) and, iii) line search ηs k = arg min η [0,1] f (1 η)xs k + ηvs k+1 . (13) Note that the directionally smooth step size, i.e., replacing L with L(xs k, vs k+1) in (12) is also valid for convergence. We omit it to reduce repetition. Next we show how restart improves the PD error. Theorem 4. Choose ηs k via one of the three manners: i) ηs k = δs k; ii) as in (12); or iii) as in (13). If there is no restart (e.g., s = 0 when terminating), then Alg. 3 guarantees that G0 k = f(x0 k) Φk(v0 k) 2LD2 k + 1 , k 1. (14a) If restart happens, in additional to (14a), we have Gs k = f(xs k) Φk(vs k) < 2LD2 k + Cs , k 1, s 1, with Cs 1 + j=0 Kj. (14b) Besides the convergence of both primal and dual errors of Alg. 3, Theorem 4 implies that when no restart happens, the generalized FW gap is smaller than the standard one, demonstrating that the 0 200 400 600 800 1000 k f(xk) f(x * ) FW FW-s WFW WFW-s UFW UFW-s 0 200 400 600 800 1000 k f(xk) f(x * ) FW FW-s WFW WFW-s UFW UFW-s 0 200 400 600 800 1000 k f(xk) f(x * ) FW FW-s WFW WFW-s UFW UFW-s 0 200 400 600 800 1000 k f(xk) f(x * ) FW FW-s WFW WFW-s UFW UFW-s 0 200 400 600 800 1000 k f(xk) f(x * ) FW FW-s WFW WFW-s UFW UFW-s 0 200 400 600 800 1000 k f(xk) f(x * ) FW FW-s WFW WFW-s UFW UFW-s 0 200 400 600 800 1000 k f(xk) f(x * ) FW FW-s WFW WFW-s UFW UFW-s 0 200 400 600 800 1000 k f(xk) f(x * ) FW FW-s WFW WFW-s UFW UFW-s 0 200 400 600 800 1000 k f(xk) f(x * ) FW FW-s WFW WFW-s UFW UFW-s 0 200 400 600 800 1000 k f(xk) f(x * ) FW FW-s WFW WFW-s UFW UFW-s 0 200 400 600 800 1000 k f(xk) f(x * ) FW FW-s WFW WFW-s UFW UFW-s 0 200 400 600 800 1000 k f(xk) f(x * ) FW FW-s WFW WFW-s UFW UFW-s (a) w7a (b) realsim (c) mushroom (d) ijcnn1 Figure 1: Performance of FW variants for binary classification with the constraint being an ℓ2-norm ball (first row), an ℓ1-norm ball (second row), and an n-support norm ball (third row). former is more suitable for the purpose of stopping criterion . When restarted, Theorem 4 provides a strictly improved bound compared with Theorems 1, 2, and 6, since the denominator of the RHS in (14b) is no smaller than the total iteration number. An additional comparison with [28], where two subproblems are also required, once again confirms the power of heavy ball momentum to improve the constants in the PD error rate, especially with the aid of restart. The restart scheme (with slight modification) can also be employed in [23,24,28] to tighten their PD error. 5 Numerical tests This section presents numerical tests to showcase the effectiveness of HFW on different machine learning problems. Since there are two parameters choices for HFW in Theorems 1 and 3, we term them as weighted FW (WFW) and uniform FW (UFW), respectively, depending on the weight of { f(xk)} in gk+1. When using smooth step size, the corresponding algorithms are marked as WFW-s and UFW-s. For comparison, the benchmark algorithms include: FW with ηk = 2 k+2 (FW); and, FW with smooth step size (FW-s) in (4). 5.1 Binary classification We first test the performance of Alg. 2 on binary classification using logistic regression i=1 ln 1 + exp( bi ai, x ) . (15) Here (ai, bi) is the (feature, label) pair of datum i, and N is the number of data. Datasets from LIBSVM2 are used in the numerical tests, where details of the datasets are deferred to Appendix F due to space limitation. ℓ2-norm ball constraint. We start with X = {x| x 2 R}. The primal errors are plotted in the first row of Fig. 1. We use primal error here for a fair comparison. It can be seen that the parameter-free step sizes achieve better performance compared with the smooth step sizes mainly because the quality of L estimate. Such a problem can be relived through directional smooth step sizes as we shall shortly. Among parameter-free step sizes, it can be seen that WFW consistently 2https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html. 0 200 400 600 800 1000 k f(xk) f(x * ) FW FW-s WFW-s UFW-s FW-ds WFW-ds UFW-ds 0 200 400 600 800 1000 k f(xk) f(x * ) FW FW-s WFW-s UFW-s FW-ds WFW-ds UFW-ds 0 200 400 600 800 1000 k f(xk) f(x * ) FW FW-s WFW-s UFW-s FW-ds WFW-ds UFW-ds 0 200 400 600 800 1000 k f(xk) f(x * ) FW FW-s WFW-s UFW-s FW-ds WFW-ds UFW-ds (a) ℓ2-norm ball (b) ℓ2-norm ball (a) ℓ1-norm ball (b) ℓ1-norm ball Figure 2: Performance of directionally smooth step sizes. (a) and (c) are tested on mushroom; and (b) and (d) use ijcnn1. outperforms both UFW and FW on all tested datasets, while UFW converges faster than FW only on datasets realsim and mushroom. For smooth step sizes, the per-step-descent property is validated. The excellent performance of HFW can be partially explained by the similarity of its update, namely xk+1 = (1 ηk)xk + ηk R gk+1 gk+1 2 , with normalized gradient descent (NGD) one, that is given by xk+1 = Proj X xk ηk gk+1 gk+1 2 . However, there is also a subtle difference between HFW and NGD updates. Indeed, when projection is in effect, xk+1 in NGD will lie on the boundary of the ℓ2-norm ball. Due to the convex combination nature of the update in HFW, it is unlikely to have xk+1 on the boundary, though it can come arbitrarily close. ℓ1-norm ball constraint. Here X = {x| x 1 R} denotes the constraint set that promotes sparse solutions. In the simulation, R is tuned for a solution with similar sparsity as the dataset itself. The results are showcased in the second row of Fig. 1. For smooth step sizes, FW-s, UFW-s, and WFW-s exhibit similar performances, and their curves are smooth. On the other hand, parameter-free step sizes eventually outperform smooth step sizes though the curves zig-zag. (The curves on realsim are smoothed to improve figure quality.) UFW has similar performance on w7a and mushroom with FW and faster convergence on other datasets. Once again, WFW consistently outperforms FW and UFW. n-support norm ball constraint. The n-support norm ball is a tighter relaxation of a sparsity enforcing ℓ0-norm ball combined with an ℓ2-norm penalty compared with Elastic Net [38]. It gives rise to X = conv{x| x 0 n, x 2 R}, where conv{ } denotes the convex hull [3]. The closed-form solution of vk+1 is given in [25]. In the simulation, we choose n = 2 and tune R for a solution whose sparsity is similar to the adopted dataset. The results are showcased in the third row of Fig. 1. For smooth step sizes, FW-s and WFW-s exhibit similar performance, while UFW-s converges slightly slower on ijcnn1. Regarding parameter-free step sizes, UFW does not offer faster convergence compared with FW on the tested datasets, but WFW again has numerical merits. Directionally smooth step sizes. The results in Fig. 2 validate the effectiveness of directionally smooth (-ds) step sizes. For all datasets tested, the benefit of adopting L(xk, vk+1) is evident, as it improves the performance of smooth step sizes by an order of magnitude. In addition, it is also observed that UFW-ds performs worse than WFW-ds, which suggests that putting too much weight on past gradients could be less attractive in practice. Additional comparisons. We also compare HFW with a generalized version of [27], where we set δk = δ (0, 1), k in Alg. 2. Two specific choices, i.e., δ = 0.6, and δ = 0.8, are plotted in Fig. 3, where the ℓ2-norm ball and n-support norm ball are adopted as constraints. In both cases, WFW converges faster than the algorithm adapted from [27]. In addition, the choice of δ has major impact on convergence behavior, while WFW avoids this need for manual tuning of δ. The performance of WFW with restart, i.e., Alg. 3, is also shown in Fig. 3. Although it slightly outperforms WFW, 0 200 400 600 800 1000 k f(xk) f(x * ) = 0.6 = 0.8 WFW-restart 0 200 400 600 800 1000 k f(xk) f(x * ) = 0.6 = 0.8 WFW-restart (a) ℓ2 norm ball (b) n-supp norm ball Figure 3: Comparison of HFW with other algorithms on muchroom. restart also doubles the computational burden due to the need of solving two FW subproblems. From this point of view, WFW with restart is more of theoretical rather than practical interest. In addition, it is observed that Alg. 3 is not restarted after the first few iterations, which suggests that the generalized FW gap is smaller than the vanilla one, at least in the early stage of convergence. Thus, the generalized FW gap is attractive as a stopping criterion when a solution with moderate accuracy is desirable. 0 200 400 600 800 1000 k f(Xk)/f(X * ) 1 FW FW-s WFW WFW-s UFW UFW-s 0 200 400 600 800 1000 k FW FW-s WFW WFW-s UFW UFW-s (a) objective (b) rank Figure 4: Performance of FW variants for matrix completion on Movie Lens100K. In a nutshell, the numerical experiments suggest that heavy ball momentum performs best with parameter-free step sizes with the momentum weight carefully adjusted. WFW is mainly recommended because it achieves improved empirical performance compared to UFW and FW, regardless of the constraint sets. The smooth step sizes on the other hand, eliminate the zig-zag behavior at the price of convergence slowdown due to the need of L, while directionally smooth step sizes can be helpful to alleviate this convergence slowdown. 5.2 Matrix completion This subsection focuses on matrix completion problems for recommender systems. Consider a matrix A Rm n with partially observed entries, i.e., entries Aij for (i, j) K are known, where K {1, . . . , m} {1, . . . , n}. Based on the observed entries that can be contaminated by noise, the goal is to predict the missing entries. Within the scope of recommender systems, a commonly adopted empirical observation is that A is low rank [4,5,8], leading to the following problem formulation. (i,j) K (Xij Aij)2 s.t. X nuc R. (16) Problem (16) is difficult to solve using GD because projection onto a nuclear norm ball requires a full SVD, which has complexity O mn(m n) with (m n) := min{m, n}. In contrast, FW and its variants are more suitable for (16) since the FW subproblem has complexity less than O(mn) [2]. Heavy ball based FW are tested using dataset Movie Lens100K3. Following the initialization of [11], the numerical results can be found in Fig. 4. Subfigures (a) and (b) depict the optimality error and rank versus k for R = 3. For parameter-free step sizes, WFW converges faster than FW while finding solutions with lower rank. The low rank solution of UFW is partially because it does not converge sufficiently. For smooth step sizes, UFW-s finds a solution with slightly larger objective value but much lower rank compared with WFW-s and FW-s. Overall, when a small optimality error is the priority, WFW is more attractive; while UFW-s is useful for finding low rank solutions. 6 Conclusions and future directions This work demonstrated the merits of heavy ball momentum for FW. Multiple choices of the step size ensured a tighter Type II primal-dual error bound that can be efficiently computed when adopted as stopping criterion. An even tighter PD error bound can be achieved by relying jointly on heavy ball momentum and restart. A novel and general approach was developed to compute local Lipschitz constants in FW type algorithms. Numerical tests in the paradigms of logistic regression and matrix completion demonstrated the effectiveness of heavy ball momentum in FW. Our future research agenda includes performance evaluation of heavy ball momentum for various learning tasks. For example, HFW holds great potential when fairness is to be accounted for [35]. 3https://grouplens.org/datasets/movielens/100k/ Acknowledgement This work is supported by NSF grants 1901134, 2126052, and 2128593. The authors would also like to thank the anonymous reviewers for their feedback. [1] J. D. Abernethy and J.-K. Wang, On Frank-Wolfe and equilibrium computation, in Proc. Advances in Neural Info. Process. Syst., 2017, pp. 6584 6593. [2] Z. Allen-Zhu, E. Hazan, W. Hu, and Y. Li, Linear convergence of a Frank-Wolfe type algorithm over trace-norm balls, in Proc. Advances in Neural Info. Process. Syst., 2017, pp. 6191 6200. [3] A. Argyriou, R. Foygel, and N. Srebro, Sparse prediction with the k-support norm, in Proc. Advances in Neural Info. Process. Syst., 2012, pp. 1457 1465. [4] R. M. Bell and Y. Koren, Lessons from the Netflix prize challenge. Si GKDD Explorations, vol. 9, no. 2, pp. 75 79, 2007. [5] J. Bennett, S. Lanning et al., The Netflix prize, in Proc. KDD cup and workshop, vol. 2007. New York, NY, USA., 2007, p. 35. [6] K. L. Clarkson, Coresets, sparse greedy approximation, and the Frank-Wolfe algorithm, ACM Transactions on Algorithms (TALG), vol. 6, no. 4, p. 63, 2010. [7] J. Diakonikolas and L. Orecchia, The approximate duality gap technique: A unified theory of first-order methods, SIAM Journal on Optimization, vol. 29, no. 1, pp. 660 689, 2019. [8] M. Fazel, Matrix rank minimization with applications, 2002. [9] M. Frank and P. Wolfe, An algorithm for quadratic programming, Naval research logistics quarterly, vol. 3, no. 1-2, pp. 95 110, 1956. [10] R. M. Freund and P. Grigas, New analysis and results for the Frank Wolfe method, Mathematical Programming, vol. 155, no. 1-2, pp. 199 230, 2016. [11] R. M. Freund, P. Grigas, and R. Mazumder, An extended Frank Wolfe method with in-face directions, and its application to low-rank matrix completion, SIAM Journal on Optimization, vol. 27, no. 1, pp. 319 346, 2017. [12] M. Fukushima, A modified Frank-Wolfe algorithm for solving the traffic assignment problem, Transportation Research Part B: Methodological, vol. 18, no. 2, pp. 169 177, 1984. [13] D. Garber and E. Hazan, Faster rates for the Frank-Wolfe method over strongly-convex sets, in Proc. Intl. Conf. on Machine Learning, 2015. [14] E. Ghadimi, H. R. Feyzmahdavian, and M. Johansson, Global convergence of the heavy-ball method for convex optimization, in Proc. of European control conference, 2015, pp. 310 315. [15] Z. Harchaoui, A. Juditsky, and A. Nemirovski, Conditional gradient algorithms for normregularized smooth convex optimization, Mathematical Programming, vol. 152, no. 1-2, pp. 75 112, 2015. [16] M. Jaggi, Revisiting Frank-Wolfe: Projection-free sparse convex optimization. in Proc. Intl. Conf. on Machine Learning, 2013, pp. 427 435. [17] A. Joulin, K. Tang, and L. Fei-Fei, Efficient image and video co-localization with Frank-Wolfe algorithm, in Proc. European Conf. on Computer Vision. Springer, 2014, pp. 253 268. [18] S. Lacoste-Julien and M. Jaggi, On the global linear convergence of Frank-Wolfe optimization variants, in Proc. Advances in Neural Info. Process. Syst., 2015, pp. 496 504. [19] S. Lacoste-Julien, F. Lindsten, and F. Bach, Sequential kernel herding: Frank-Wolfe optimization for particle filtering, in Proc. Intl. Conf. on Artificial Intelligence and Statistics, 2015, pp. 544 552. [20] G. Lan, The complexity of large-scale convex programming under a linear optimization oracle, ar Xiv preprint ar Xiv:1309.5550, 2013. [21] G. Lan and Y. Zhou, Conditional gradient sliding for convex optimization, SIAM Journal on Optimization, vol. 26, no. 2, pp. 1379 1409, 2016. [22] E. S. Levitin and B. T. Polyak, Constrained minimization methods, USSR Computational mathematics and mathematical physics, vol. 6, no. 5, pp. 1 50, 1966. [23] B. Li, M. Coutino, G. B. Giannakis, and G. Leus, A momentum-guided Frank-Wolfe algorithm, IEEE Trans. on Signal Processing, vol. 69, pp. 3597 3611, 2021. [24] B. Li, L. Wang, G. B. Giannakis, and Z. Zhao, Enhancing Frank Wolfe with an extra subproblem, in Proc. of 35th AAAI Conf. on Artificial Intelligence, 2021. [25] B. Liu, X.-T. Yuan, S. Zhang, Q. Liu, and D. N. Metaxas, Efficient k-support-norm regularized minimization via fully corrective Frank-Wolfe method. in Proc. Intl. Joint Conf. on Artifical Intelligence, 2016, pp. 1760 1766. [26] G. Luise, S. Salzo, M. Pontil, and C. Ciliberto, Sinkhorn barycenters with free support via Frank-Wolfe algorithm, in Proc. Advances in Neural Info. Process. Syst., 2019, pp. 9318 9329. [27] A. Mokhtari, H. Hassani, and A. Karbasi, Stochastic conditional gradient methods: From convex minimization to submodular maximization, ar Xiv preprint ar Xiv:1804.09554, 2018. [28] Y. Nesterov, Complexity bounds for primal-dual methods minimizing the model of objective function, Mathematical Programming, vol. 171, no. 1-2, pp. 311 330, 2018. [29] , Introductory lectures on convex optimization: A basic course. Springer Science & Business Media, 2004, vol. 87. [30] T. Nguyen, X. Fu, and R. Wu, Memory-efficient convex optimization for self-dictionary separable nonnegative matrix factorization: A frank-wolfe approach, ar Xiv preprint ar Xiv:2109.11135, 2021. [31] B. O donoghue and E. Candes, Adaptive restart for accelerated gradient schemes, Foundations of computational mathematics, vol. 15, no. 3, pp. 715 732, 2015. [32] B. T. Polyak, Some methods of speeding up the convergence of iteration methods, Ussr computational mathematics and mathematical physics, vol. 4, no. 5, pp. 1 17, 1964. [33] A. Schwing, T. Hazan, M. Pollefeys, and R. Urtasun, Globally convergent parallel MAP LP relaxation solver using the Frank-Wolfe algorithm, in Proc. Intl. Conf. on Machine Learning, 2014, pp. 487 495. [34] M. Ye, C. Gong, L. Nie, D. Zhou, A. Klivans, and Q. Liu, Good subnetworks provably exist: Pruning via greedy forward selection, in Proc. Intl. Conf. on Machine Learning, 2020. [35] M. B. Zafar, I. Valera, M. Gomez-Rodriguez, and K. P. Gummadi, Fairness constraints: A flexible approach for fair classification, The Journal of Machine Learning Research, vol. 20, no. 1, pp. 2737 2778, 2019. [36] L. Zhang, G. Wang, D. Romero, and G. B. Giannakis, Randomized block Frank Wolfe for convergent large-scale learning, IEEE Transactions on Signal Processing, vol. 65, no. 24, pp. 6448 6461, 2017. [37] M. Zhang, Z. Shen, A. Mokhtari, H. Hassani, and A. Karbasi, One sample stochastic Frank Wolfe, in Proc. Intl. Conf. on Artificial Intelligence and Statistics. PMLR, 2020, pp. 4012 4023. [38] H. Zou and T. Hastie, Regularization and variable selection via the elastic net, Journal of the royal statistical society: series B (statistical methodology), vol. 67, no. 2, pp. 301 320, 2005.