# parameterfree_locally_accelerated_conditional_gradients__877da0bd.pdf

Parameter-free Locally Accelerated Conditional Gradients

Alejandro Carderera * 1 Jelena Diakonikolas * 2 Cheuk Yin (Eric) Lin * 2 Sebastian Pokutta * 3

Projection-free conditional gradient (CG) methods are the algorithms of choice for constrained optimization setups in which projections are often computationally prohibitive but linear optimization over the constraint set remains computationally feasible. Unlike in projection-based methods, globally accelerated convergence rates are in general unattainable for CG. However, a very recent work on Locally accelerated CG (La CG) has demonstrated that local acceleration for CG is possible for many settings of interest. The main downside of La CG is that it requires knowledge of the smoothness and strong convexity parameters of the objective function. We remove this limitation by introducing a novel, Parameter-Free Locally accelerated CG (PF-La CG) algorithm, for which we provide rigorous convergence guarantees. Our theoretical results are complemented by numerical experiments, which demonstrate local acceleration and showcase the practical improvements of PF-La CG over non-accelerated algorithms, both in terms of iteration count and wall-clock time.

1. Introduction

Conditional gradient (CG) (or Frank-Wolfe (FW)) methods (Frank & Wolfe, 1956; Levitin & Polyak, 1966) are a fundamental class of projection-free optimization methods, most frequently used to minimize smooth convex objective functions over constrained sets onto which projections are computationally prohibitive; see Combettes & Pokutta (2021) for an overview. These methods have received signiﬁcant recent attention in the machine learning and optimization communities, due to the fact that they eschew projections and produce solutions with sparse representa-

*Equal contribution 1Georgia Institute of Technology, Atlanta, GA, USA 2Department of Computer Sciences, University of Wisconsin-Madison, Madison, WI, USA 3Zuse Institute Berlin and Technische Universit at Berlin, Berlin, Germany. Correspondence to: Jelena Diakonikolas <jelena@cs.wisc.edu>.

Proceedings of the 38 th International Conference on Machine Learning, PMLR 139, 2021. Copyright 2021 by the author(s).

tions (Jaggi, 2013; Garber, 2016; Hazan & Luo, 2016; Braun et al., 2017; 2019; Lei et al., 2019; Tsiligkaridis & Roberts, 2020; Combettes et al., 2020).

While CG methods have been applied to many different problem settings (see, e.g., Hazan & Luo (2016); Zhou et al. (2018); Pedregosa et al. (2020); Lei et al. (2019); Tsiligkaridis & Roberts (2020); Dvurechensky et al. (2020); Zhang et al. (2020); N egiar et al. (2020); Carderera & Pokutta (2020); Kerdreux et al. (2021)), in this paper, we are interested in using CG-type methods to solve problems of the form: min x X f(x), (P)

where f is an L-smooth (gradient-Lipschitz) and m-strongly convex function and X Rn is a polytope.

We assume that we are given access to the objective function f and to the feasible set X via the following two oracles:

Oracle 1.1 (First Order Oracle (FOO)). Given x X, the FOO returns f(x) and f(x).

Oracle 1.2 (Linear Minimization Oracle (LMO)). Given c Rn, the LMO returns argmin x X c, x .

While being of extreme practical importance, these LMObased methods in general do not achieve the globally optimal rates for smooth (and possibly strongly convex) minimization that are attained by projection-based methods. In particular, LMO-based algorithms cannot converge globally faster than O(1/k) for the class of smooth strongly convex functions where k is the number of iterations (up to the dimension threshold n) (Lan, 2013; Jaggi, 2013). Moreover, the dependence of the convergence rate on the dimension is unavoidable in general.

At the same time, it was shown by Diakonikolas et al. (2020) that optimal rates can be obtained asymptotically, referred to as locally optimal rates. That is, after a ﬁnite number of iterations (independent of the target accuracy ϵ) with potentially sub-optimal convergence, optimal convergence rates can be achieved. While Diakonikolas et al. (2020) resolve the question of acceleration for CG-type methods, it unfortunately depends on the knowledge of m and L. Although the latter can be removed with the common line search-based arguments, the knowledge of a good estimate of the former is crucial in achieving acceleration. This makes the Locally-

Parameter-free Locally Accelerated Conditional Gradients

accelerated Conditional Gradient (La CG) algorithm from Diakonikolas et al. (2020), despite being of theoretical interest, of limited use in practice. Not only is it usually hard to come by a good estimate of the parameter m, but working with an estimated lower bound does not take advantage of the potentially better local strong convexity behavior in the vicinity of an iterate.

To remedy these shortcomings, we devise a new Parameter Free Locally-accelerated Conditional Gradient algorithm (PF-La CG) that is based on a similar coupling between a variant of the Away-step Frank-Wolfe (AFW) method (Gu elat & Marcotte, 1986; Lacoste-Julien & Jaggi, 2015) and an accelerated method as used in La CG. However, beyond this basic inspiration, not many things can be reused from Diakonikolas et al. (2020), as in order to achieve parameterfreeness, we need to devise a completely new algorithm employing a gradient-mapping technique that out-of-thebox is incompatible with the approach used in La CG.

1.1. Contributions and Further Related Work

Our main contributions can be summarized as follows (see Section 3.1 for a detailed overview of the main technical ideas).

Near-optimal and parameter-free acceleration with inexact projections. To devise PF-La CG, we introduce a parameter-free accelerated algorithm for smooth strongly convex optimization that utilizes inexact projections onto low-dimensional simplices. While near-optimal (i.e., optimal up to poly-log factors) parameter-free projectionbased algorithms were known in the literature prior to our work (Nesterov, 2013; Ito & Fukuda, 2019), their reliance on exact projections which are computationally infeasible makes them unsuitable for our setting.

Parameter-free Locally-accelerated Conditional Gradient (PF-La CG) algorithm. We propose a novel, parameter-free and locally accelerated CG-type method. Up to poly-logarithmic factors, our algorithm PF-La CG attains an optimal accelerated local rate of convergence. PF-La CG leverages efﬁciently computable projections onto low dimensional simplices, but is otherwise projection-free (i.e., it does not assume access to a projection oracle for X). Local acceleration is achieved by coupling the parameterfree accelerated method with inexact projections described in the previous paragraph and AFW with a fractional exit condition (Kerdreux et al., 2019). This coupling idea is inspired by the coupling between µAGD+ (Cohen et al., 2018) (where AGD stands for Accelerated Gradient Descent) and AFW (Gu elat & Marcotte, 1986) used in Diakonikolas et al. (2020); however, most of the similarities between our work and Diakonikolas et al. (2020) stop there, as there are major technical challenges that have to be overcome to attain the

results in the parameter-free setting.

Computational experiments. We demonstrate the efﬁcacy of PF-La CG using numerical experiments, comparing the performance of the proposed algorithms to several relevant CG-type algorithms. The use of PF-La CG brings demonstrably faster local convergence in primal gap with respect to both iteration count and wall-clock time.

1.2. Outline

In Section 2 we introduce the notation and preliminaries that are required for stating our main results. We then present our approach to parameter-free local acceleration in Section 3 and derive our main results. Finally, we demonstrate the practicality of our approach with computational experiments in Section 4, and conclude with a discussion in Section 5.

2. Notation and Preliminaries

We denote the unique minimizer of Problem (P) by x . Let and , denote the Euclidean norm and the standard inner product, respectively. We denote the diameter of the polytope X by D = maxx,y X x y , and its vertices by vert (X) X. Given a non-empty set S Rn, we denote its convex hull by co (S). For any x X we denote by F (x) the minimal face of X that contains x. We call a subset of vertices S vert(X) a support of x X if x can be expressed as a convex combination of the elements of S. A support S of x is a proper support of x if the weights associated with the convex decomposition are positive. Let B(x, r) denote the ball around x with radius r with respect to . We say that x is r-deep in a convex set C Rn if B(x, r) aﬀ(C) C, where aﬀ(C) denotes the smallest afﬁne space that contains C. The point x is contained in the relative interior of C, x rel.int(C), if there exists an r > 0 such that x is r-deep in C.

Measures of optimality. The two key measures of optimality that we will use in this paper are the Strong Wolfe Gap and the Gradient Mapping. We deﬁne the former as:

Deﬁnition 2.1 (Strong Wolfe Gap). Given x X, the strong Wolfe gap w(x) of f over X is deﬁned as

w(x) := min S Sx w(x, S),

where Sx denotes the set of all proper supports of x and w(x, S) := maxy S,z X f(x), y z .

Note that any polytope X satisﬁes what is known as a δ(X)- scaling inequality, where δ(X) is the pyramidal width of the polytope (see, e.g., (Lacoste-Julien & Jaggi, 2015; Beck & Shtern, 2017; Pe na & Rodr ıguez, 2019; Gutman & Pe na, 2018)) and where the inequality is deﬁned as:

Parameter-free Locally Accelerated Conditional Gradients

Deﬁnition 2.2 (δ-scaling inequality). There exists δ > 0 such that for all x X \ x we have that Problem (P) satisﬁes w(x) δ(X) f(x), x x / x x .

To implement a parameter-free variant of a projection-based accelerated algorithm, we will rely on a second measure of optimality, the gradient mapping. Recall that we do not assume the availability of projections onto the polytope X; instead, our algorithm will only rely on low-complexity projections onto simplices spanned by a small number of vertices of X (see Diakonikolas et al. (2020) for a more detailed discussion). In the following, given a convex set C, we denote the projection of x Rn onto C by PC(x).

Deﬁnition 2.3 (Gradient mapping). Given a convex set C Rn, a differentiable function f : C R, and a scalar η > 0, the gradient mapping of f w.r.t. η, C is deﬁned by:

Gη(x) = η x PC x 1

The gradient mapping is a generalization of the gradient to constrained sets: when C Rn, Gη(x) = f(x). The norm of the gradient mapping can also be used as a measure of optimality: the gradient mapping at a point x is zero if and only if x minimizes f over C; more generally, a small gradient mapping norm for a smooth function implies a small optimality gap. See Beck (2017, Chapter 10) and Appendix B for more useful properties.

2.1. Assumptions

We make two key assumptions. The ﬁrst one, the strict complementarity assumption (Assumption 2.4) is key to proving the convergence of the iterates to F (x ), and is a common assumption in the Frank-Wolfe literature (Gu elat & Marcotte, 1986; Garber, 2020), which is related to the stability of the solution with respect to noise, and rules out degenerate instances.

Asssumption 2.4 (Strict complementarity). We have that f (x ) , x x = 0 if and only if x F (x ). Or stated equivalently, there exists a τ > 0 such that f (x ) , x x τ for all x vert (X) \ F (x ).

Lastly, to achieve local acceleration, similar to Diakonikolas et al. (2020), we require that the optimal solution x is sufﬁciently deep in the relative interior of a face of X. We make the following assumption about the problem, which covers all cases of interest.

Asssumption 2.5 (Location of x ). The optimum satisﬁes x / vert (X), or conversely, there exists an r > 0 such that x is r-deep in a face F of X.

Note that the entire polytope X is an n-dimensional face of itself, and thus Assumption 2.5 allows x rel.int(X).

Note that if x vert (X) and the strict complementarity assumption is satisﬁed (Assumption 2.4), then the projection-free algorithm that will be presented in later sections will reach x in a ﬁnite number of iterations (see Garber (2020)), and so there is no need for acceleration. For computational feasibility, we assume that r is bounded away from zero and much larger than the desired accuracy ϵ > 0.

3. Parameter-Free Local Acceleration

This section provides our main result: a parameter-free locally-accelerated CG method (PF-La CG). Before delving into the technical details, we ﬁrst describe the core ideas driving our algorithm and its analysis. Due to space constraints, most of the proofs are omitted and are instead provided in the supplementary material.

3.1. Overview of Main Technical Ideas

A standard idea for achieving acceleration in smooth and strongly convex setups where m is unknown is to use a restart-based strategy, which can be described as follows: run an accelerated method for smooth (non-strongly) convex minimization and restart it every time some measure of optimality is reduced by a constant factor. The measures of optimality used in these strategies are either f(x) f(x ) (Roulet & d Aspremont, 2020) or Gη(x) (Ito & Fukuda, 2019; Nesterov, 2013).

Neither of these two optimality measures can be applied directly to our setting, as neither can be evaluated: (i) it is rarely the case that f(x ) is known, which is needed for evaluating f(x) f(x ), and we make no such assumption here; and (ii) the gradient mapping norm Gη(x) is a valid optimality measure only for the entire feasible set and requires computing projections onto it; and we do not assume availability of a projection operator onto X.

Even though our algorithm utilizes a restarting-based strategy, the restarts are not used for local acceleration, but for the coupling of a CG method and a projection-based accelerated method. This idea is similar to Diakonikolas et al. (2020); however, there are important technical differences. First, the restarts in Diakonikolas et al. (2020) are scheduled and parameter-based; as such, they cannot be utilized in our parameter-free setting. Our idea is to instead use w(x, S) as a measure of optimality over the polytope, which is observable naturally in many CG algorithms (see Deﬁnition 2.1), where S, dubbed the active set of the CG algorithm, is a proper support of x. We perform restarts each time w(x, S) is halved. The idea of using w(x, S) as a measure of optimality comes from Kerdreux et al. (2019); however in the aforementioned paper w(x, S) was not used to obtain a locally accelerated algorithm.

The aforementioned idea of coupling an active-set-based

Parameter-free Locally Accelerated Conditional Gradients

CG method and an accelerated projection-based method can be summarized as follows. Because the objective function is strongly convex, any convergent algorithm (under any global optimality measure) will be reducing the distance between its iterates and the optimal solution x . The role of the CG method is to ensure such convergence without requiring projections onto the feasible set X. When x is contained in the interior of a face of X, it can be argued that after a ﬁnite burn-in phase whose length is independent of the target accuracy ϵ, every active set of the utilized CG method will contain x in its convex hull. As it is possible to keep the active sets reasonably small (and their size can never be larger than the current iteration count), projections onto the convex hull of an active set can be performed efﬁciently, using low-complexity projections onto a probability simplex. Thus, after the burn-in phase, we could switch to a projection-based accelerated algorithm that uses the convex hull of the active set as its feasible set and attains an accelerated convergence rate from then onwards.

There are, of course, several technical challenges related to implementing such a coupling strategy. To begin with, there is no computationally feasible approach we are aware of that could allow detecting the end of the burn-in phase. Thus any reasonable locally accelerated algorithm needs to work without this information, i.e., we do not know when x is contained in the convex hull of the active set. In Diakonikolas et al. (2020), this is achieved by using a parameter-based accelerated algorithm that monotonically decreases the objective value, running this accelerated algorithm and a CG method (AFW or the Pairwise-step Frank-Wolfe (PFW)) in parallel, and, on restarts, updating the iterate of the accelerated method and the active set to the ones from the coupled CG method whenever the iterate of CG provides a lower function value. Thus, after the burn-in phase, if the feasible set of the accelerated method does not contain x (or its close approximation), CG eventually constructs a point with the lower function value, after which the accelerated algorithm takes over, leading to local acceleration.

We can neither rely on the scheduled restarts nor the accelerated algorithm used in Diakonikolas et al. (2020), as both are parameter-based. Instead, our monotonic progress is w.r.t. w(x, S) (i.e., upon restarts we pick a point and the active set with the lower value of w(x, S)) and we rely on a parameter-free accelerated method. As mentioned before, even using a parameter-free projection-based acceleration requires new results, as we need to rely on inexact projections (and inexact gradient mappings). Further, for our argument to work, it is required that after a burn-in phase the accelerated method contracts w(x, S) at an accelerated rate. Although this may seem like a minor point, we note that it is not true in general, as w(x, S) can be related to other notions of optimality only when the algorithm iterates are contained in F(x ) and the primal gap is sufﬁciently

small. This is where the strict complementarity assumption (Assumption 2.4) comes into play, as it allows us to show that after a burn-in phase (independent of ϵ), we can upper bound w(x, S) using f(x) f(x ) (Theorem 3.4).

3.2. Burn-in Phase

The variant of CG used in our work is the Away Frank Wolfe (AFW) method (Gu elat & Marcotte, 1986; Lacoste Julien & Jaggi, 2015) with a stopping criterion based on halving the Frank-Wolfe gap (Kerdreux et al., 2019), shown in Algorithm 1. For completeness, the useful technical results from Kerdreux et al. (2019) utilized in our analysis are provided in Appendix A.

Algorithm 1 Away-Step Frank-Wolfe Algorithm: AFW(x0, S0)

1: k := 0 2: while w(xk, Sk) > w(x0, S0)/2 do 3: vk := argminu X f(xk), u , d FW k := vk xk 4: sk := argmaxu Sk f(xk), u , d Away k := xk sk

5: if f(xk), d FW k D f(xk), d Away k E then

6: dk := d FW k with λmax := 1 7: else 8: dk := d Away k with λmax := α sk k 1 α sk k 9: end if 10: xk+1 := xk + λkdk with λk [0, λmax] via linesearch 11: Update active set Sk+1 and {αv k+1}v Sk+1 12: k := k + 1 13: end while 14: return (xk, Sk, w(xk, Sk))

We now brieﬂy outline how after a ﬁnite number of iterations T we can guarantee that xk F(x ) and x co (Sk) for k K0. In particular, using Assumption 2.4, if the primal gap is made sufﬁciently small and the iterate xk is not contained in F (x ), then the AFW algorithm will continuously drop those vertices in Sk that are not in F (x ), until the iterates reach F (x ) (see Garber (2020, Theorem 5), or Theorem A.5 in Appendix A.1, included for completeness).

Theorem 3.1. If the strict complementarity assumption is satisﬁed (Assumption 2.4) and the primal gap satisﬁes f(xk) f(x ) <

1/2 min n (τ/(2D( p

L/m + 1)))2/L, τ, LD2o then the following holds for the AFW algorithm (Algorithm 5):

1. If xk / F (x ), AFW will perform an away step that drops a vertex sk vert (X) \ F (x ).

2. If xk F (x ), AFW will either perform a Frank Wolfe step with a vertex vk vert (F (x )) or an

Parameter-free Locally Accelerated Conditional Gradients

away-step with a vertex sk vert (F (x )). Regardless of which step is chosen, the iterate will satisfy:

w(xk, Sk) LD p

2(f (xk) f (x ))/m.

Assuming that x0 vert (X) in the AFW algorithm, and using the primal gap convergence gap guarantee in Lacoste Julien & Jaggi (2015, Theorem 1), we can bound the number of iterations until f(xk) f(x ) satisﬁes the requirement in Theorem 3.1. Using this bound, and the fact that the AFW algorithm can pick up at most one vertex per iteration, we can bound the number of iterations until xk F(x ). Note that by the second claim in Theorem 3.1, this means that when xk F(x ), then the iterates will not leave F(x ). Furthermore, once the iterates are inside the optimal face, there are two options: if x = F(x ), then the AFW algorithm will exit once xk F(x ), as w(xk, Sk) = 0, otherwise if x / vert (X) (the case of interest in our setting, by Assumption 2.5), then we need to prove that after a given number of iterations the active set will satisfy x co (Sk). We prove the former using Fact 3.2 (a variation of Diakonikolas et al. (2020, Fact B.3)).

Fact 3.2 (Critical strong Wolfe gap). There exists a wc > 0 such that for any subset S vert(F(x )) and point x F(x ) with x co(S) and w(x, S) wc it follows that x co(S).

Remark 3.3. The critical strong Wolfe gap in Fact 3.2, is a crucial parameter in the coming proofs. However, like the strict complementarity parameter τ (Gu elat & Marcotte, 1986; Garber, 2020) and the critical radius rc deﬁned in

Diakonikolas et al. (2020), the critical strong Wolfe gap can be arbitrarily small for some problems. Fortunately, as we will show in the proofs to come, it only affects the length of the burn-in phase of the accelerated algorithm, and moreover this dependence is logarithmic. In Remark A.8 in the Appendix we show a simple example for which one can compute wc exactly, and we give a lower bound and an upper bound for wc for Problem (P).

With these tools at hand, we have the bound shown in Theorem 3.4 (see Appendix A for the proof).

Theorem 3.4. Assume that the AFW algorithm (Algorithm 5) is run starting with x0 vert(X). If the strict complementarity assumption (Assumption 2.4) is satisﬁed and x / vert (X), then for k K0 with

K0 = 32L m ln 2

log 2w(x0, S0)

2 , τ, LD2, 2wc}

where δ (X) is the pyramidal width from Deﬁnition 2.2 and wc > 0 is the critical strong Wolfe gap from Fact 3.2, we

have that xk F(x ), x co (Sk). Moreover:

w(xk, Sk) LD p

2(f (xk) f (x ))/m.

3.3. Parameter-free Projection-based Acceleration

The main idea for obtaining parameter-free projectionbased acceleration is to use the regularization trick of Nesterov (Nesterov, 2012) to obtain a near-optimal method for minimizing Gη(x) for a smooth convex function. Then a near-optimal method for smooth strongly convex minimization is obtained by restarting this method every time Gη(x) is halved.

The restarting approach is important here, as it removes the requirement of knowing the parameter m. However, there are a few technical challenges in implementing the near-optimal method for minimizing Gη(x) without the knowledge of the parameter L or the distance to x , which is needed for setting the value of the regularization parameter. Some of these challenges were addressed in Ito & Fukuda (2019); Nesterov (2013). However, as discussed before, the approaches from Ito & Fukuda (2019) and Nesterov (2013) are insufﬁcient for our purposes, as they assume access to exact projections onto the feasible set (and thus, exact evaluations of the gradient mapping).

In the following, we ﬁrst present a near-optimal method for minimizing Gη(x) for a smooth convex function that does not require knowledge of L and works with inexact projections. We then show how to couple this method with adaptive tuning of the regularization parameter and restarts to obtain an overall near-optimal and parameter-free method for smooth strongly convex minimization. This subsection can be read independently from the rest of the paper.

3.3.1. SMALL GRADIENT MAPPING OF SMOOTH FUNCTIONS

Let C Rn be a closed, convex, nonempty set and assume that f : Rn R is L-smooth on C. The rough idea of the regularization trick is the following: instead of working directly with f, use a method for smooth strongly convex minimization to minimize fσ(x) = f(x) + σ

for some sufﬁciently small σ > 0 (for accuracy ϵ > 0, σ = Θ ϵ x C x0 sufﬁces, where x C argminx C f(x)). As we select σ ourselves, the method can assume knowledge of the strong convexity parameter σ of fσ(x).

The method that we employ here is a variant of µAGD+ from Cohen et al. (2018), which is similar to Diakonikolas et al. (2020). However, unlike Cohen et al. (2018); Diakonikolas et al. (2020), this method is adapted to work with an unknown smoothness parameter and to provide convergence guarantees on Gη(x) . One iteration of the algorithm is provided in AGD-Iter (Algorithm 2). In the

Parameter-free Locally Accelerated Conditional Gradients

algorithm statement, we use ϵ argmin to indicate that the function that follows the argmin is minimized to additive error ϵ. The main result is summarized in the following lemma, while the complete analysis is deferred to Appendix B.

Algorithm 2 AGD-Iter(yk 1, vk 1, zk 1, Ak 1, ηk, σ, ϵ0, η0)

1: ηk = ηk/2 2: repeat 3: ηk = 2ηk 4: θk = q

σ 2(ηk+σ), ak = θk 1 θk Ak 1

5: ϵℓ k = θkϵ0/4, ϵM k = akϵ0/4 6: xk = 1 1+θk yk 1 + θk 1+θk vk 1 7: zk = zk 1 ak fσ(xk) + σakxk

8: vk ϵM k argminu C Mk(u), where Mk(u) = zk, u + σAk+η0

9: ˆyk = (1 θk)yk 1 + θkvk

10: yk ϵℓ k argminu C ℓk(u), where ℓk(u) = fσ(ˆyk), u ˆyk + ηk+σ

11: until f(ˆyk) f(xk)+ f(xk), ˆyk xk + ηk

2 ˆyk xk 2 and f(yk) f(ˆyk) + f(ˆyk), yk ˆyk + ηk

12: Ak = Ak 1 + ak, Gσ ηk+σ(ˆyk) = (ηk + σ)(ˆyk yk)

13: return ηk, Ak, zk, vk, ˆyk, yk, Gσ ηk+σ(ˆyk)

Lemma 3.5. Let C Rn be a closed convex set and let f : Rn R be an L-smooth function on C. Let x0 C be an arbitrary initial point, and, given σ > 0, deﬁne fσ(x) = f(x) + σ

2 x x0 2, x σ = argminx C fσ(x).

Let z0 = (η0 + σ)x0 f(x0), y0 = v0 = ˆy0 ϵM 0 argminu C M0(u), where ϵM 0 > 0, M0(u) is deﬁned as in Algorithm 2, and the estimate η0 is doubled until f(y0) f(x0)+ f(x0), y0 x0 + η0

2 y0 x0 2, same as in Algorithm 2. Given η0 > 0, sequence {ak}k 0, Ak = Pk i=0 ai, θk = ak

Ak , and the sequences of errors {ϵM k }k 0, {ϵℓ k}k 0, let the sequences of points {ˆyk, yk}k 0 evolve according to Algorithm 2 for k 1.

If a0 = A0 = 1 and θk = ak

σ 2(ηk+σ) for k 1, then

for all k 1

Gσ ηk+σ(ˆyk) 2

Ak x σ x0 2 + 2

i=0 (2ϵM i + ϵℓ i),

where Gσ ηk+σ(ˆyk) is the gradient mapping w.r.t. fσ, at ˆyk.

In particular, if a0 = A0 = 1, θk = ak Ak = q

for k 1, ϵM k = akϵ2

8 , and ϵℓ k ak Ak ϵ2

8 , then 1 ηk+σ Gσ ηk(ˆyk) 2 ϵ2 after at most

L σ log L x σ x0

iterations. Further, the total number of ﬁrst-order queries to f and oracle queries to inexact projections is at most

k = k + 2 log 2L

3.3.2. ADAPTIVE REGULARIZATION AND RESTARTS

We now provide the full details for the updates of the accelerated sequence. There are two main questions that need to be addressed: (1) how to adaptively adjust the regularization parameter σ and (2) how to perform adaptive restarts based on inexact evaluations of the gradient mapping. For the former, note that, to obtain a near-optimal algorithm, σ should not be allowed to decrease too much (and in fact, needs to satisfy σ = Ω(m); see the proof of Theorem 3.6). For the latter, we can rely on the strong convexity of the function ℓk(u) = fσ(ˆyk), u yk + ηk+σ

2 u ˆyk 2 to relate the exact and the inexact gradient mapping at ˆyk.

We state the main convergence bound of the ACC algorithm (Algorithm 3) here, while the detailed description of the algorithm and its analysis are deferred to Appendix B, for space considerations. Note that the algorithm name stems from ACCeleration.

Algorithm 3 ACC(x0, η0, σ)

1: σ = 2σ 2: repeat 3: σ = σ/2 4: Run a minimization procedure for ℓ0(u) = f(x0), u x0 + η0+σ

2 u x0 2. Halt when the current iterate y of the procedure satisﬁes ℓ0(y) minu C ℓ0(u) ϵ0, where ϵ0 = η0+σ

32 y0 x0 2 = 1 32(η0+σ) Gη0+σ(x0) 2.

5: Set ˆy0 = v0 = y0; z0 = (η0 + σ)x0 f(x0) 6: a0 = A0 = 1 7: repeat 8: k = k + 1 9: ηk, Ak, zk, vk, ˆyk, yk, Gσ ηk+σ(ˆyk) = AGD-Iter(yk 1, vk 1, zk 1, Ak 1, ηk 1, σ, ϵ0, η0) 10: until 1 ηk+σ Gσ ηk+σ(ˆyk) 2 9ϵ0

4 11: until σ ηk+σ ˆyk x0 ϵ0 12: return ˆyk, ηkσ

Theorem 3.6. Let C Rn be a closed convex set, and let f : Rn R be a function that is L-smooth and m-strongly convex on C. Let x0 C be an arbitrary initial point, and let σ0, η0 be such that m σ0 η0 L. Let xout 0 = x0, and for k 0, consider the following updates:

xout k+1, ηk+1, σk+1 = ACC(xout k , ηk, σk),

where ACC is provided in Algorithm 3. Let Gη+σ denote the gradient mapping of f on C with parameter

Parameter-free Locally Accelerated Conditional Gradients

η + σ. Then, for any ϵ > 0, Gηk+σk(xout k ) ϵ for

k = O log L Gη0+σ0(x0)

mϵ . The algorithm for comput-

ing xout k utilizes a total number of

m Gη0+σ0(x0)

queries to the FOO for f and an inexact and efﬁciently computable projection oracle for C, without knowledge of any of the problem parameters.

Observe that, up to the log(L/m) factor, the bound in Theorem 3.6 is optimal for the class of smooth strongly convex functions, under the local oracle model (which includes the FOO model as a special case).

3.4. Coupling of Methods and Local Acceleration

We now show how to couple the AFW algorithm (Algorithm 5 in Appendix A) with the algorithm described in the previous subsection to achieve local acceleration. The key observation is that after a burn-in phase whose length is independent of ϵ, every active set that AFW constructs contains x = argminx C f(x). Thus, minimizing f(x) over X becomes equivalent to minimizing f(x) over the convex hull of the active set of AFW. Of course, the algorithm needs to adapt to this setting without knowledge of when the burn-in phase has ended. The pseudocode for the resulting algorithm Parameter-Free Locally Accelerated Conditional Gradients (PF-La CG) is provided in Algorithm 4. The coupling between AFW and ACC in PF-La CG is illustrated on a simple example in Fig. 1.

For simplicity, Algorithm 4 is stated with the accelerated sequence started with x0 vert (X) and with the active set S0 = {x0}. As S0 contains only one vertex, there is no need to run the accelerated algorithm until a later restart (triggered by the condition in Line 6) where the active set contains more than one vertex occurs. Additionally, for the bound on the number of oracle queries in Theorem 3.7 to hold, we need the iterations of ACC (Algorithm 6) and AFW (Algorithm 5) to be aligned, in the sense that one iteration of ACC (Algorithm 2) occurs in parallel to one iteration of AFW. This is only needed for the theoretical bound on oracle complexity. In practice, the two algorithms can be run in parallel without aligning the iterations, so that the overall execution time (modulo coordination at restarts) is never higher than for running AFW alone.

We are now ready to state our main result. The complete proof is deferred to Appendix C.

Theorem 3.7. Let X Rn be a closed convex polytope of diameter D, and let f : Rn R be a function that is L-smooth and m-strongly convex on X. Let Assumptions 2.2 and 2.4 be satisﬁed for f, X. Denote

Figure 1. An example of coupling between AFW and ACC in PFLa CG on a tetrahedron as the feasible set, starting from initial point x0 with the base of the tetrahedron as its support S0. The two algorithms are run in parallel from x0: AFW optimizes over the entire tetrahedron, allowing it to add and remove vertices, while ACC optimizes over the base of the tetrahedron only and it cannot converge to the optimal point x , as x / co(S0). After several iterations, once the restart criterion for AFW is triggered, PF-La CG chooses the output point of AFW over that of ACC, as w AFW min{w ACC, w ACC prev /2}, hence a PF-La CG restart occurs at x R. For ease of exposition we assume that the point outputted by AFW is contained in F(x ) after a single halving of w(x, S), although in practice several restarts may be needed for AFW to reach F(x ). Since x R is on the optimal face F(x ), PFLa CG has completed the burn-in phrase. The two algorithms again run in parallel from x R after the restart. However, ACC converges to the optimal x at an accelerated rate, much faster than AFW. Hence, local acceleration is achieved by PF-La CG while being at least as fast as vanilla AFW.

x = argminx X f(x). Given ϵ > 0, let xout be the output point of PF-La CG (Algorithm 4), initialized at an arbitrary vertex x0 of X. Then w(xout, Sout) ϵ and PF-La CG uses a total of at most

K =O min log w(x0, S0)

mδ2 log w(x0, S0)

queries to the FOO for f and the LMO for X, where

K0 = 32L m ln 2

log 2w(x0, S0)

2 , τ, LD2, 2wc}

and K1 = 128LD2

Strong Wolfe Gap as an Upper Bound For Primal Gap. Note that while in Theorem 3.7 we show a convergence rate for w(x, S), this translates to the same convergence rate in primal gap, as the aforementioned quantity is an upper bound on the strong Wolfe gap and hence an upper bound on the primal gap (Kerdreux et al., 2019). Furthermore, we

Parameter-free Locally Accelerated Conditional Gradients

Algorithm 4 PF-La CG(x0 vert(X), ϵ > 0)

1: Sout = SAFW = SACC = {x0} 2: xout = x AFW = x ACC = x0 3: wout = w AFW prev = w(x0, S0) 4: while wout > ϵ do 5: Run AFW and restarted ACC (Theorem 3.6) in parallel, and let (x AFW, SAFW) and (x ACC, SACC) denote their respective most recent output points and active sets; note that ACC is run on C = co(SACC) 6: if w AFW = w(x AFW, SAFW) 1

2w AFW prev then 7: w AFW prev = w AFW, w ACC prev = w ACC, 8: Compute w ACC = w(x ACC, SACC) 9: if w AFW min{w ACC, w ACC prev /2} then 10: x ACC = x AFW, SACC = SAFW

11: xout = x AFW, Sout = SAFW

12: wout = w AFW

13: else 14: if |SACC| |SAFW| then 15: x AFW = x ACC, SAFW = SACC

16: w AFW = w ACC

17: end if 18: xout = x ACC, Sout = SACC, wout = w ACC

19: end if 20: end if 21: end while 22: return xout

remark that w(x, S) = 0 if and only if x = x when S vert(X) is a proper support of x with respect to the polytope X. In the next section we illustrate our computational results with respect to the primal gap f(x) f(x ), as this quantity is more common in the optimization literature.

4. Computational Experiments

We numerically demonstrate that PF-La CG (Algorithm 4) when implemented in Python 3 outperforms other parameterfree CG methods in both iteration count and in wall-clock time. Most notably, we compare against the AFW, PFW, Decomposition-Invariant CG (DICG) (in the case of the probability simplex, as this algorithm only applies to 0 1 polytopes (Garber & Meshi, 2016)) and the Lazy AFW (AFW (Lazy)) (Braun et al., 2017) algorithms. As predicted by our theoretical results, the improvement is observed locally, once the iterates of the algorithm reach the optimal face. Further, we empirically observe on the considered examples that the active sets do not become too large, which is important for the accelerated algorithm to have an edge over standard CG updates in the overall wall-clock time. Our code can be found at https://github.com/ ericlincc/Parameter-free-La CG.

Similar to Diakonikolas et al. (2020), we solve the min-

imization subproblems (projections onto the convex hull of the active sets) within ACC (Algorithm 6) using Nesterov s accelerated gradient descent (Nesterov, 2018) with O (n log n) projections onto the simplex described in Duchi et al. (2008, Algorithm 1). Even though in our analysis we consider coupling accelerated sequence with AFW, we note that PF-La CG can be coupled with any CG variant that maintains an active set, such as standard AFW or PFW.

Since the execution of local acceleration within PF-La CG is completely independent of the execution of the coupled CG variant, it is possible to run the locally accelerated algorithm in parallel on a separate core within one machine or even on a secondary machine, allowing us to utilize more computational power with the goal of solving large-scale problems to high accuracy in less computing time. In our experiments, we implemented the former approach where we run each process using one CPU core, and we run the accelerated algorithm of PF-La CG on a separate process. This approach enables us to guarantee that PF-La CG is not slower than the CG variant such as AFW and PFW barring negligible process creation overhead and inter-process communication while achieving much faster convergence once we have reached the optimal face.

Probability Simplex. The unit probability simplex, although a toy example, can give us insight into the behaviour of PF-La CG. We know that after a restart in which the AFW active set satisﬁes x co (Sk), we should expect to see accelerated convergence from the iterates computed by the ACC algorithm. Given the structure of the probability simplex, this is easy to check in the experiments once we have a high-accuracy solution to the minimization problem. The function being minimized in this example is f(x) = x T M T M + α1n x/2 + b T x, where M Rn n and b Rn have entries sampled uniformly at random between 0 and 1 and n = 10000. The parameter α = 500 is set so that the objective function satisﬁes m 500. The resulting condition number is L/m = 50000, and the number of nonzero elements in x

is around 320. The AFW algorithm in PF-La CG (AFW) satisﬁes that x co (Sk) around iteration 400, consequently we achieve the accelerated convergence rate from then onwards. The same can be said regarding PF-La CG (PFW) around iteration 350.

Structured LASSO Regression. The LASSO is an extremely popular sparse regression analysis method in statistics, and has become an important tool in ML. In many applications, we can impose additional structure on the solution being learnt through the addition of linear constraints; this is useful, for example, when learning sparse physics dynamics from data (see Carderera et al. (2021)), in which we can impose additional linear equality constraints on the

Parameter-free Locally Accelerated Conditional Gradients

Probability Simplex

Structured LASSO

Constrained Birkhoff

Figure 2. Numerical performance of PF-La CG: Top row depicts primal gap convergence in terms of iteration count, while bottom row depicts primal gap convergence in terms of time. Left-most column shows the results over the probability simplex, center column shows the results for the structured Lasso problem, and right-most column shows the results for the constrained Birkhoff polytope.

problem to reﬂect the symmetries present in the physical problem. This allows us to learn dynamics that are consistent with the underlying physics, and which potentially generalize better when predicting on unseen data. The objective function is the same as in the last section, except we now have n = 1000, α = 100, and we choose the elements of b uniformly from 0 to 100. This results in a condition number of L/m = 250000. The feasible region is the intersection of the ℓ1 unit ball and a series of equality constraints. To generate the additional equality constraints, we sample 125 pairs of distinct integers (i, j) from 1 i, j n without replacement, and we set xi = xj for each pair, adding 125 linear constraints.

Constrained Birkhoff Polytope. We also solve a matching with the same quadratic function as in the previous section with α = 1, and where we have scaled the matrix M T M to have a maximum eigenvalue of 100000. This results in an objective function that has a condition number of L/m = 100000. The matching problem is solved over the Birkhoff polytope with n = 400 where we have imposed additional linear constraints. We sample 80 integers i from 1 i n without replacement, and we set xi = 0 for the ﬁrst 40 integers (to represent that certain matchings are not possible), and xi 0.5 for the remaining 40 integers to represent a maximum fractional matching. As in the La CG algorithm (Diakonikolas et al., 2020), our approach is compatible with the laziﬁcation technique for AFW (Braun et al., 2017). In this last example we couple our algorithm with the AFW (Lazy) algorithm, resulting in the PF-La CG (Lazy) algorithm. Moreover, we also benchmark against the AFW (Lazy) algorithm for reference.

5. Discussion

We have introduced a novel projection-free PF-La CG algorithm for minimizing smooth and strongly convex functions over polytopes. This algorithm is parameter-free and locally accelerated. In particular, we have shown that after a ﬁnite burn-in phase (independent of the target error ϵ) PF-La CG achieves a near-optimal accelerated convergence rate without knowledge of any of the problem parameters (such as the smoothness or strong convexity of the function). As mentioned in the introduction, global acceleration is generally not possible for CG-type methods. We have also demonstrated the improved locally accelerated convergence rate of PF-La CG using numerical experiments. Some interesting questions for future research remain. For example, it is an interesting and practically relevant question whether local acceleration is possible for CG methods that are not active set-based, such as, e.g., DICG (Garber & Meshi, 2016), which would possibly lead to even faster algorithms.

Acknowledgments

This research was partially funded by NSF grant CCF2007757, by the Ofﬁce of the Vice Chancellor for Research and Graduate Education at the University of Wisconsin Madison with funding from the Wisconsin Alumni Research Foundation, and by the Deutsche Forschungsgemeinschaft (DFG) through the DFG Cluster of Excellence MATH+ and the Research Campus Modal funded by the German Federal Ministry of Education and Research (fund numbers 05M14ZAM, 05M20ZBM).

Parameter-free Locally Accelerated Conditional Gradients

Bashiri, M. A. and Zhang, X. Decomposition-invariant conditional gradient for general polytopes with line search. In NIPS, pp. 2690 2700, 2017.

Beck, A. First-order methods in optimization. MOS-SIAM, 2017.

Beck, A. and Shtern, S. Linearly convergent away-step conditional gradient for non-strongly convex functions. Mathematical Programming, 164(1-2):1 27, 2017.

Braun, G., Pokutta, S., and Zink, D. Lazifying conditional gradient algorithms. In Proc. ICML 2017, 2017.

Braun, G., Pokutta, S., Tu, D., and Wright, S. Blended conditional gradients: The unconditioning of conditional gradients. In Proc. ICML 19, 2019.

Carderera, A. and Pokutta, S. Second-order conditional gradient sliding. ar Xiv preprint ar Xiv:2002.08907, 2020.

Carderera, A., Pokutta, S., Sch utte, C., and Weiser, M. CINDy: Conditional gradient-based identiﬁcation of nonlinear dynamics noise-robust recovery. ar Xiv preprint ar Xiv:2101.02630, 2021.

Chen, S. S., Donoho, D. L., and Saunders, M. A. Atomic decomposition by basis pursuit. SIAM Journal on Scientiﬁc Computing, 20(1):33 61, 1998.

Cohen, M. B., Diakonikolas, J., and Orecchia, L. On acceleration with noise-corrupted gradients. In Proc. ICML 18, 2018.

Combettes, C. W. and Pokutta, S. Complexity of linear minimization and projection on some sets. ar Xiv preprint ar Xiv:2101.10040, 2021.

Combettes, C. W., Spiegel, C., and Pokutta, S. Projectionfree adaptive gradients for large-scale optimization. ar Xiv preprint ar Xiv:2009.14114, 2020.

Condat, L. Fast projection onto the simplex and the ℓ1 ball. Mathematical Programming, 158(1):575 585, 2016.

Diakonikolas, J. and Orecchia, L. The approximate duality gap technique: A uniﬁed theory of ﬁrst-order methods. SIAM Journal on Optimization, 29(1):660 689, 2019.

Diakonikolas, J., Carderera, A., and Pokutta, S. Locally accelerated conditional gradients. In Proc. AISTATS 20, 2020.

Duchi, J., Shalev-Shwartz, S., Singer, Y., and Chandra, T. Efﬁcient projections onto the ℓ1-ball for learning in high dimensions. In Proc. NIPS 08, 2008.

Dvurechensky, P., Ostroukhov, P., Saﬁn, K., Shtern, S., and Staudigl, M. Self-concordant analysis of Frank-Wolfe algorithms. In Proc. ICML 20, 2020.

Frank, M. and Wolfe, P. An algorithm for quadratic programming. Naval research logistics quarterly, 3(1-2): 95 110, 1956.

Garber, D. Faster projection-free convex optimization over the spectrahedron. In Proc. NIPS 16, 2016.

Garber, D. Revisiting Frank-Wolfe for polytopes: Strict complementarity and sparsity. In Proc. Neur IPS 20, 2020.

Garber, D. and Meshi, O. Linear-memory and decomposition-invariant linearly convergent conditional gradient algorithm for structured polytopes. In Proc. NIPS 16, 2016.

Gu elat, J. and Marcotte, P. Some comments on Wolfe s away step . Mathematical Programming, 35(1):110 119, 1986.

Gutman, D. H. and Pe na, J. F. The condition of a function relative to a polytope. ar Xiv preprint ar Xiv:1802.00271, 2018.

Hazan, E. and Luo, H. Variance-reduced and projection-free stochastic optimization. In Proc. ICML 16, 2016.

Held, M., Wolfe, P., and Crowder, H. P. Validation of subgradient optimization. Mathematical programming, 6 (1):62 88, 1974.

Ito, M. and Fukuda, M. Nearly optimal ﬁrst-order methods for convex optimization under gradient norm measure: An adaptive regularization approach. ar Xiv preprint ar Xiv:1912.12004, 2019.

Jaggi, M. Revisiting Frank-Wolfe: Projection-free sparse convex optimization. In Proc. ICML 13, 2013.

Kerdreux, T., d Aspremont, A., and Pokutta, S. Restarting Frank-Wolfe: Faster rates under H olderian error bounds. In Proc. AISTATS 19, 2019.

Kerdreux, T., d Aspremont, A., and Pokutta, S. Projectionfree optimization on uniformly convex sets. In Proc. AISTATS 21, 2021.

Lacoste-Julien, S. and Jaggi, M. On the global linear convergence of Frank-Wolfe optimization variants. In Proc. NIPS 15, 2015.

Lam, S. K., Pitrou, A., and Seibert, S. Numba: A LLVMbased Python JIT compiler. In Proceedings of the Second Workshop on the LLVM Compiler Infrastructure in HPC, pp. 1 6, 2015.

Parameter-free Locally Accelerated Conditional Gradients

Lan, G. The complexity of large-scale convex programming under a linear optimization oracle. ar Xiv preprint ar Xiv:1309.5550, 2013.

Lei, Q., Zhuo, J., Caramanis, C., Dhillon, I. S., and Dimakis, A. G. Primal-dual block generalized Frank-Wolfe. Proc. Neur IPS 19, 2019.

Levitin, E. S. and Polyak, B. T. Constrained minimization methods. USSR Computational mathematics and mathematical physics, 6(5):1 50, 1966.

N egiar, G., Dresdner, G., Tsai, A., El Ghaoui, L., Locatello, F., Freund, R., and Pedregosa, F. Stochastic Frank-Wolfe for constrained ﬁnite-sum minimization. In Proc. ICML 20, 2020.

Nesterov, Y. How to make the gradients small. Optima. Mathematical Optimization Society Newsletter, (88):10 11, 2012.

Nesterov, Y. Gradient methods for minimizing composite functions. Mathematical Programming, (140(1)):125 161, 2013.

Nesterov, Y. Lectures on Convex Optimization. Springer, 2018.

Pe na, J. and Rodr ıguez, D. Polytope conditioning and linear convergence of the Frank Wolfe algorithm. Mathematics of Operations Research, 44(1):1 18, 2019.

Pedregosa, F., Negiar, G., Askari, A., and Jaggi, M. Linearly convergent Frank Wolfe with backtracking line-search. In Proc. AISTATS 20, 2020.

Roulet, V. and d Aspremont, A. Sharpness, restart, and acceleration. SIAM Journal on Optimization, 30(1):262 289, 2020.

Tibshirani, R. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58(1):267 288, 1996.

Tsiligkaridis, T. and Roberts, J. On Frank-Wolfe optimization for adversarial robustness and interpretability. ar Xiv preprint ar Xiv:2012.12368, 2020.

Zhang, M., Shen, Z., Mokhtari, A., Hassani, H., and Karbasi, A. One sample stochastic Frank-Wolfe. In Proc. AISTATS 20, 2020.

Zhou, S., Gupta, S., and Udell, M. Limited memory Kelley s method converges for composite convex and submodular objectives. ar Xiv preprint ar Xiv:1807.07531, 2018.