# a_conditionalgradientbased_augmented_lagrangian_framework__2cc8f33c.pdf

A Conditional-Gradient-Based Augmented Lagrangian Framework

Alp Yurtsever 1 Olivier Fercoq 2 Volkan Cevher 1

This paper considers a generic convex minimization template with afﬁne constraints over a compact domain, which covers key semideﬁnite programming applications. The existing conditional gradient methods either do not apply to our template or are too slow in practice. To this end, we propose a new conditional gradient method, based on a uniﬁed treatment of smoothing and augmented Lagrangian frameworks. The proposed method maintains favorable properties of the classical conditional gradient method, such as cheap linear minimization oracle calls and sparse representation of the decision variable. We prove O(1/

k) convergence rate for our method in the objective residual and the feasibility gap. This rate is essentially the same as the state of the art CG-type methods for our problem template, but the proposed method is arguably superior in practice compared to existing methods in various applications.

1. Introduction

In this paper we focus on the following constrained convex minimization template:

minimize x f(x) + g(Bx)

subject to x X & Ax K (P)

where x is the decision variable that lives on the convex and compact optimization domain X Rn with diameter DX := maxx1,x2 X x1 x2 ; f : X R is a convex differentiable function with Lf-Lipschitz continuous gradient; A : X Rp and B : X Rq are known linear maps; g : Rq R is a convex function which can be non-smooth but we assume that it is Lg-Lipchitz continuous; K Rp is a convex set.

1LIONS, Ecole Polytechnique F ed erale de Lausanne, Switzerland 2LTCI, T el ecom Paris Tech, Universit e Paris-Saclay, France. Correspondence to: Alp Yurtsever <alp.yurtsever@epﬂ.ch>.

Proceedings of the 36 th International Conference on Machine Learning, Long Beach, California, PMLR 97, 2019. Copyright 2019 by the author(s).

This template has a large number of applications in machine learning, signal processing, and computer science; from unsupervised clustering (Peng and Wei, 2007) to generalized eigenvector problems (Boumal et al., 2018), and from maximum cut (Goemans and Williamson, 1995) to phase-retrieval (Cand es et al., 2013). We refer the reader to Section 5 from (Yurtsever et al., 2018) for a detailed discussion on the special instances and applications of (P).

The conditional gradient method (CGM, a.k.a. Frank-Wolfe algorithm) is one of the most scalable methods in the literature for solving convex optimization problems over a structured domain (Yurtsever et al., 2017). The main computational efﬁciency of CGM comes from the so-called linear minimization oracles (lmo), the main building blocks of CGM: lmo X (v) = arg min x X x, v .

lmo is signiﬁcantly less expensive than the projection. For instance, lmo contains a rank-1 solution when X is a nuclear norm-ball, which can be efﬁciently approximated via Krylov subspace methods. Recall that the projection oracle requires a full singular value decomposition instead.

The classical CGM is originated by Frank and Wolfe (1956), but its resurgence in machine learning follows Hazan and Kale (2012) and Jaggi (2013). Unfortunately, CGM has restrictive assumptions such as the smoothness of the objective function. Therefore, extending CG-type methods for broader templates is an active research area (see Section 4 for some recent advancements). In this work, we introduce the conditional gradient augmented Lagrangian framework (CGAL) for solving (P).

(P) is signiﬁcantly broader in applications in comparison with the classical CGM template. We can consider the nonsmooth term g(Bx) as a regularizer to promote some known structures of the solution, or directly as a non-smooth loss function (e.g., least absolute deviations) for robust optimization. More importantly, (P) contains an afﬁne inclusion constraint Ax K. In particular, it covers the standard semideﬁnite programming (SDP) template (with bounded trace constraint).

Afﬁne constraints are key for the ﬂexibility of the template, but they pose substantial computational difﬁculty in solving the problem in the primal domain. As a result, primal-

A Conditional-Gradient-Based Augmented Lagrangian Framework

dual methods are typically preferred for solving these problems in large-scale. Among the primal-dual approaches, augmented Lagrangian provides a powerful framework for deriving fast methods. However, the majority of the primaldual methods for solving (P) rely on the projection and/or proximal-oracles, which do not scale well in many applications (SDP s in particular) and impose a computational bottleneck. In contrast, CGAL provides the utmost scalability by exploiting the cheap linear minimization oracles.

CGAL can be viewed as an extension of the recent work of (Yurtsever et al., 2018), from the quadratic penalty to an augmented Lagrangian formulation in the spirit of (Bertsekas, 1976), with the focus on improving its empirical performance. CGAL guarantees O(1/

k) convergence rate both in the objective residual and the feasibility gap. The simplicity of our analysis enables us to identify adaptive bounds, using which we can propose explicit and implementable dual step-size rules that retain the theoretical convergence rates, while signiﬁcantly enhancing the practical performance of the algorithm. Our numerical evidence demonstrates superior performance.

The rest of this paper is organized as follows. We review the notions of smoothing, quadratic penalty and augmented Lagrangian in Section 2. In Section 3 we introduce CGAL and the main convergence theorem. We provide a detailed discussion of the related works in Section 4. Finally, in Section 5 we present the empirical evidence supporting the advantages of our framework. Technical details are deferred to the supplementary material.

Notation. We use lowercase letters for vectors (or matrices when considering vector space of matrices), uppercase letters for linear maps, and calligraphic letters for sets. We denote the Euclidean inner product by , , and the Euclidean norm by . We denote the adjoint of a linear map A by A . For a set K, its indicator function ιK : Rq R {+ } is deﬁned as

ιK(z) = 0 if z K + otherwise.

2. Preliminaries

Our algorithmic design is based on the uniﬁed treatment of smoothing, quadratic penalty and augmented Lagrangian frameworks. This section reviews these notions and explains their similarities.

2.1. Nesterov Smoothing

In his seminal work, Nesterov (2005a) introduces a technique for solving some structured non-smooth optimization problems with efﬁciency estimates O(1/ϵ), which is much better than the theoretical lower bound O(1/ϵ2). This technique is known as Nesterov smoothing, and it is commonly

used to design efﬁcient primal-dual methods (e.g., (Nesterov, 2005b), (Tran-Dinh et al., 2018)).

Nesterov smoothing exploits an important class of nonsmooth functions ψ(x) that can be written in the following max-form:

ψ(x) = max u U

n Bx, u ˆφ(u) o ,

for some convex and compact set U Rq and a convex function ˆφ : U R.

Let us consider a prox-function δ(u) of U, i.e., a strongly convex continuous function on U. Deﬁne the center point of this prox-function as

u = arg min u U δ(u).

Without loss of generality, we assume that the strong convexity parameter of δ is 1 and δ( u) = 0. Smooth approximation ψβ(x) with the smoothness parameter β > 0 is deﬁned as

ψβ(x) = max u U

n Bx, u ˆφ(u) βδ(u) o .

Then, ψβ is well deﬁned, differentiable, convex and smooth. Moreover, it uniformly approximates ψ, in the sense it satisﬁes the following envelop property:

ψβ(x) ψ(x) ψβ(x) + βDU ( x X),

where DU = maxu U δ(u). See Theorem 1 in (Nesterov, 2005a) for the proof and more details.

For notational convenience, we restrict ourselves with g(B ), a Lipschitz continuous function coupled with a linear map. Note that we can write g(B ) in the max form by choosing ψ(x) = g(Bx) and ˆφ(u) = g (u). Here, g

denotes the Fenchel conjugate of g:

g (u) = max z u, z g(z) .

Since g is convex and lower semicontinuous, Fenchel duality holds, and we have g(Bx) = g (Bx). Moreover, the Lipschitz continuity assumption of g ensures the boundedness of the dual domain (see Lemma 5 in (D unner et al., 2016) for a formal statement of this well-known result).

In this work, we speciﬁcally focus on the Euclidean proxfunctions, δ(u) = 1

2 u u 2. By deﬁnition of ψβ, gβ takes the following form:

gβ(Bx) = max u Rq

Bx, u g (u) β

The argument of this maximization subproblem can be written as proxβ 1g ( u + β 1Bx), where

proxg(z) = arg min u g(u) + 1

A Conditional-Gradient-Based Augmented Lagrangian Framework

Finally, we can compute the gradient of gβ as

gβ(Bx) = B proxβ 1g ( u + β 1Bx)

= B u + β 1B Bx proxβg(β u + Bx) ,

where the second line follows from the well-known Moreau decomposition.

2.2. Quadratic Penalty

The quadratic penalty method is an effective proxy for handling the afﬁne constraints Ax K. It works by replacing the constraint with the penalty function which favors the feasibility of iterates. We consider the squared Euclidean distance, λ

2 dist2(Ax, K), as the penalty function, and λ > 0 is called as the penalty parameter. We update this parameter as we progress in the optimization procedure to converge to a solution of the original constrained problem. Surprisingly, the quadratic penalty approach is structurally equivalent to a de facto instance of Nesterov smoothing.

Let us start by writing the Fenchel conjugate of the indicator function ιK( ),

ι K(z) = max v K v, z .

Then, we can write the afﬁne constraint in the max form by choosing ˆφ(z) = ι K(z) and using the following formula:

ιK(Ax) = max z min v K Ax v, z

= max z Ax, z ι K(z) .

By choosing the standard Euclidean prox-function δ(v) = 1 2 v 2, we get the following smooth approximation :

ιKβ(Ax) = max z

min v K Ax v, z β

= min v K max z

2β dist2(Ax, K),

where the inversion of min and max holds due to the Sion s minimax theorem (Sion, 1958).

In summary, we can obtain the quadratic penalty with parameter λ = β 1, by applying the Nesterov smoothing procedure to the indicator of an afﬁne constraint.

Note that the quadratic penalty does not serve as a uniform approximation, because the dual domain is unbounded and the envelope property does not hold. Consequently, the common analysis techniques for smoothing does not apply for quadratic penalty methods. Nevertheless, one can exploit this structural similarity to design algorithms that universally

work for both cases; composite problems with smoothingfriendly non-smooth regularizers, and problems with afﬁne constraints.

Quadratic penalty provides simple algorithms with interpretable steps, but they provide limited practical applicability due to the poor empirical performance. To this end, the next subsection reviews augmented Lagrangian methods as an alternative approach.

2.3. Augmented Lagrangian

Augmented Lagrangian (AL) methods replace the afﬁne constraint with a continuous function that promotes the feasibility, similar to the quadratic penalty approach. This function is parametrized by a penalty parameter λ > 0 (aka augmented Lagrangian parameter) and a dual vector v Rp

(Lagrange multiplier). In the min-form, we can write this function as

v, Ax v + λ

We can view the augmented Lagrangian as a shifted quadratic penalty, since

arg min x f(x) + min v K

v, Ax v + λ

= arg min x f(x) + min v K λ 2 Ax v + 1

= arg min x f(x) + λ

2 dist2(Ax + 1

Therefore, it is not surprising that we can relate augmented Lagrangian function with Nesterov smoothing. To draw this relation, we simply follow similar arguments as in the quadratic penalty case, but this time we use a shifted proxfunction δ(v) = 1

ιKβ(Ax) = max z

min v K Ax v, z β

= min v K max z

v, Ax v + 1

2β Ax v 2 .

In conclusion, augmented Lagrangian formulation is structurally equivalent to a de facto instance of Nesterov smoothing, applied to the indicator of the constraint, with a shifted Euclidean prox-function. The center point of this proxfunction corresponds to the dual variable, and the penalty parameter corresponds to the inverse of the smoothness parameter (λ = β 1). Once again, this approach does not serve as a uniform approximation, and the common analysis for Nesterov smoothing does not apply for augmented Lagrangian.

A Conditional-Gradient-Based Augmented Lagrangian Framework

3. Algorithm

In this section, we design CGAL for the special case of g(Bx) = 0 for the ease of presentation. One can extend CGAL in a straightforward way for the general case, based on the discussion in Section 2, and the analysis techniques in this work and (Yurtsever et al., 2018).

Algorithm 1 CGAL (for g(Bx) = 0)

Input: x1 X, y1 Rp, λ0 > 0 for k = 1, 2, . . . , do

ηk = 2/(k + 1) and λk = λ0

k + 1 rk = proj K (Axk + (1/λk)yk) vk = f(xk) + A yk + λk A (Axk rk) sk = arg minx X vk, x xk+1 = xk + ηk(sk xk) rk+1 = proj K (Axk+1 + (1/λk+1)yk) σk+1 using (decr.) or (const.) yk+1 = yk + σk+1 (Axk+1 rk+1) end for

3.1. Design of CGAL

Let us introduce the slack variable r = Ax K and deﬁne the augmented Lagrangian function as

Lλ(x, y) = f(x) + min r K

y, Ax r + λ

2 dist2 Ax + 1

where y Rp is the Lagrange multiplier and λ > 0 is the penalty parameter. Clearly, Lλ(x, y) is a smooth convex function with respect to x.

One CGAL iteration is composed of three basic steps:

Primal step (conditional gradient step on x), Penalty parameter update (increment λ), Dual step (proximal gradient step on y).

Primal step. CGAL is characterized by the conditional gradient step with respect to Lλ( , y) on the primal variable. Deﬁne

rk = proj K Axk + 1

Then, we can evaluate x Lλk(x, yk) as

x Lλk(x, yk) = f(xk) + A yk + λk A (Axk rk).

Next, we query the linear minimization oracle

sk = arg min x X x Lλk(xk, yk), x ,

and we form the next iterate (xk+1) by combining the current iterate xk and sk with the CG step-size ηk. We use

the classical step size ηk = 2/(k + 1) of CG-type methods, but the same guarantees hold for the design variants with line-search or fully corrective updates.

Penalty parameter update.

Penalty methods typically require the penalty parameter to be increased at a certain rate for provable convergence. In contrast, augmented Lagrangian methods can be designed with a ﬁxed penalty parameter, because the saddle point formulation already favors the constraints. Unlike other augmented Lagrangian CG-type methods, we adopt an increasing penalty sequence in CGAL by choosing λk = λ0

k + 1 for some λ0 > 0.

Dual step. Once xk+1 is formed, we update the dual variable yk by a gradient ascent step with respect to Lλ(x, ). At iteration k, we evaluate dual update by

yk+1 = yk + σk+1 y Lλk+1(xk+1, yk).

To compute y Lλk+1, we ﬁrst deﬁne

rk+1 = proj K Axk+1 + 1 λk+1 yk .

Then, we can use the following formulation:

y Lλk+1(xk+1, yk) = Axk+1 rk+1.

The choice of dual step-size is crucial for convergence guarantees. We propose two alternative schemes, with a decreasing or constant bound on the step-size.

Decreasing bound on step-size. This variant cancels positive quadratic terms in the majorization bounds due to dual updates, with the negative quadratic terms that come from the penalty parameter updates. Consequently, we choose the largest σk+1 0 which satisﬁes

σk+1 λ0 2 k+1 & yk+1 DYk+1 (decr.)

DYk+1 is a sequence of positive numbers to be chosen, which acts as a dual domain diameter and appears in the ﬁnal bounds. We will specify a reasonable positive constant DY = DYk+1 in the sequel from the ﬁnal converges bounds, by matching the factors of the dominating terms.

Constant bound on step-size. We observed signiﬁcant performance improvements by slightly relaxing the decreasing upper bound on the step-size. To this end, we design this second variant. We do not cancel out additional quadratic terms but restrict them to be smaller than other dominating terms in the majorization bound. To this end, we choose the largest σk+1 0 which satisﬁes (DYk is similar as in (decr.) case)

σk+1 λ0 yk+1 DYk+1 (const.)

σk+1 Axk+1 rk+1 2 1

2η2 k(Lf + λk+1 A 2)D2 X .

A Conditional-Gradient-Based Augmented Lagrangian Framework

We underline that the computation of σk does not require an iterative line-search procedure. Instead, it can be computed by simple vector operation both in (decr.) and (const.) variants. As a result, the computational cost of ﬁnding σk is negligible.

3.2. Theoretical Guarantees of CGAL

We present convergence guarantees of CGAL in this section. But ﬁrst, we deﬁne some basic notions to be used in the sequel and state our main assumptions.

Solution set. We denote a solution of (P) by x , and the set of all solutions by X . Similarly, we denote a solution of the dual problem by y , and the set of all solutions by Y . Throughout, we assume that the solution set is nonempty and that there exists a ﬁnite dual solution, i.e., miny Y y < .

ϵ-solution. Given an accuracy level ϵ > 0, we call a point x X as an ϵ-solution of (P) if

f(x) f ϵ, and dist(Ax, K) ϵ.

We call f(x) f as the objective residual and dist(Ax, K) as the feasibility gap. Note that the convergence of objective residual alone is not enough to approximate the solution since the iterates are non-feasible and f(x) f can take negative values.

Strong duality. We assume that the strong duality holds. This assumption is common for primal-dual methods, and the Slater s condition is a widely used sufﬁcient condition for the strong duality:

relint(X K) {(x, r) dom(f) Rd : Ax = r} = ,

where relint means relative interior.

Theorem 3.1. Sequence xk generated by CGAL with dual step-size conditions (const.) satisﬁes:

f(xk) f y dist(Axk, K)

f(xk) f 4D2 X

+ D2 Yk 2λ0

dist(Axk, K) 2/λ0

2 + yk y + q

where C0 = Lf + A 2λ0. We can also bound yk y using triangle inequality. Considering the bounds, it is reasonable to choose DY proportional to DX A λ0.

Sequence xk generated by CGAL with dual step-size conditions (decr.) satisﬁes similar guarantees as (const.), with the factor of 1/2 for all terms involving D2 X .

We omit design variants of CGAL with line-search and fully corrective updates, covered by our theory. The same guarantees hold for these variants.

3.3. Extension for Composite Problems

One can extend CGAL in a straightforward way for composite problems based on the discussions in Section 2. For this, we simply need to deﬁne the sum of two non-smooth terms: G(Ax, Bx) = ιK(Ax) + g(Bx). Then, CGAL guarantees O(1/

k) rates in the feasibility gap dist(Ax, K) and in the objective residual f(x) + g(Bx) f(x ) g(Bx ). See the supplements for more details.

Below we describe the extension (const.) for this setting

σk+1 λ0 and γk+1 β0 yk+1 DYk+1 and zk+1 DZk+1 σk+1 Axk+1 rk+1 2 1

4η2 k Lk+1D2 X γk+1 Bxk+1 tk+1 2 1

4η2 k Lk+1D2 X (const.2)

where Lk+1 = (Lf + λk+1 A 2 + β 1 k+1 B 2). A reasonable choice is DZk+1 = DZ = Lg. One can similarly also extend (decr.) for this setting. Alternatively, we can set zk = 0n ﬁxed, and perform dual updates only on yk.

Algorithm 2 CGAL for (P)

Input: x1 X, y1 Rp, z1 Rq, λ0 > 0, β0 > 0 for k = 1, 2, . . . , do

ηk = 2/(k+1), λk = λ0

k + 1, βk = β0/

k + 1 rk = proj K (Axk + (1/λk)yk) tk = proxβkg (Bxk + βkzk) vk = A yk + λk A (Axk rk) wk = B zk + (1/βk)B (Bxk tk) sk = arg minx X f(xk) + vk + wk, x xk+1 = xk + ηk(sk xk) rk+1 = proj K (Axk+1 + (1/λk+1)yk) σk+1 using (decr.2) or (const.2) yk+1 = yk + σk+1 (Axk+1 rk+1) tk+1 = proxβk+1g (Bxk+1 + βk+1zk) γk+1 using (decr.2) or (const.2) zk+1 = zk + γk+1 (Bxk+1 tk+1) end for

4. Related Work

The majority of convex methods for solving (P) are based on computationally challenging oracles, e.g., some secondorder oracle (for interior point methods), the projection onto X (for operator splitting methods), and a constrained proximal-oracle (for the majority of the classical primaldual methods). For these methods, we refer to (Wright, 1997), (Komodakis and Pesquet, 2015), (Ryu and Boyd,

2016) and the references therein. In the rest of this section, we focus on the lmo-based algorithms for solving (P) or some special instances of it.

Lan (2014) introduces a conditional gradient method for non-smooth minimization over a convex compact domain.

A Conditional-Gradient-Based Augmented Lagrangian Framework

His method is based on the Nesterov smoothing, and it is the ﬁrst attempt to combine the Nesterov smoothing and conditional gradient approach, to the best of our knowledge. However, this method does not apply in the presence of afﬁne constraints, since it relies on the boundedness of the dual domain and the uniform approximation property.

Yurtsever et al. (2015) present the universal primal-dual method (UPD), a primal-dual subgradient approach for solving convex minimization problems with afﬁne constraints. The main template of UPD is fairly different than (P); it does not have the non-smooth term g(Bx) and the smoothness assumption on f, but it assumes H older smoothness in the dual space instead. The method does not directly work with lmo s, but it leverages the so-called sharp operators. For the standard form SDP formulation, however, the sharp-operator is an instance of the lmo.

UPD adopts the inexact line-search strategy introduced by Nesterov (2015). This strategy requires the target accuracy ϵ as an input parameter, and UPD is guaranteed to converge only up to ϵ accuracy, i.e., UPD guarantees f(x) f O(1/

k) + ϵ. The practical performance of UPD heavily depends on this parameter: Choosing ϵ too small leads to very small step-sizes hence slow convergence. Best values for ϵ are typically around 1/10th and 1/100th of the |f |, but UPD is difﬁcult to tune unless the optimal value is roughly known.

Lan and Zhou (2016) propose the conditional gradient sliding method (CGS). This method is based on an inexact version of the accelerated gradient method by Nesterov (1987), where the projection subproblem is approximately solved by using the classical CGM. CGS is originally proposed for smooth minimization over a convex and compact domain, but the results are generalized for smoothing friendly non-smooth functions in Section 4 by following the same approach as Lan (2014). Note that this generalization directly follows the standard approach of Nesterov smoothing, and it does not apply for afﬁne constraints.

Yen et al. (2016b) propose the greedy direction method of multipliers (GDMM), a CGM variant for minimizing a linear objective function over an intersection of polytopes. GDMM relies on a consensus reformulation over the cartesian product of these polytopes, and the consistency constraint is incorporated by the augmented Lagrangian. This method is further explored in the structural support vector machine (Yen et al., 2016a) and maximum a-posteriori inference (Huang et al., 2017) problems. Nevertheless, Gidel et al. (2018) point out some technical issues in the analysis of this approach, see Section B.1 in (Gidel et al., 2018).

Gidel et al. (2018) propose an augmented Lagrangian framework for the convex splitting problem (FW-AL). Similar to CGAL, this method is characterized by one CGM step

on Lλ( , yk) followed by one dual gradient ascent step on Lλ(xk+1, ). In contrast to CGAL, the penalty parameter λ of FW-AL is kept ﬁxed. Originally, FW-AL is proposed for Ax = 0 type of constraints (i.e., splitting), but it can be applied to Ax = b case by using a simple product space technique. The analysis of FW-AL relies on the error bounds (see Theorem 1 in (Gidel et al., 2018) for the conditions, and (Bolte et al., 2017) for more details about error bounds). Their dual step-size σk+1 depends on the error bound constant α, as σk+1 = 2σ0 k+2 with σ0 min{ 2

2δ }. Hence, σ0 is a tuning parameter, and the method has guaranteed convergence only if it is chosen small enough. However, α is typically not only not known, and it can be also arbitrarily small.

Liu et al. (2018) introduce an inexact augmented Lagrangian method (IAL), where the Lagrangian subproblems are approximately solved by CGM up to a prescribed accuracy εk = ε0/k for some ε0 > 0 to be tuned. This results in a double-loop algorithm, where each outer iteration runs multiple CGM iterations until the following condition is satisﬁed:

max x X f(xk+1) + A yk + λA (Axk+1 b), x εk.

Then, the algorithm takes a dual gradient ascent step.

IAL provably generates an ϵ-solution after O(1/ϵ2) outer iterations, by choosing the penalty parameter λ appropriately (proportional to 1/ ϵ). This method, however, requires multiple queries of lmo at each iteration. Since the number of lmo calls is bounded by 6Lf D2 X /εk 2 (see Theorem 2.2 in (Liu et al., 2018)), the overall lmo complexity of this method is O(1/ϵ4). Note that this is much worse than O(1/ϵ2) calls required by our method.

Yurtsever et al. (2018) present a CG-type method (HCGM) for (P). This method relies on the quadratic penalty approach to handle afﬁne constraints. HCGM guarantees O(1/

k) convergence rate both in the objective residual and the feasibility gap, similar to CGAL. As explained in Section 2, however, penalty methods typically exhibit their proven worst case guarantees in practice. We can indeed observe that the empirical rate of HCGM is O(1/

k) in our numerical experiments (in Section 5), as well as in the experiments in (Yurtsever et al., 2018).

5. Numerical Experiments

This section presents the numerical evidence to demonstrate the empirical superiority of CGAL, based on the max-cut, clustering ,and generalized eigenvector problems.

We compared CGAL against UPD and HCGM from Section 4. This choice is based on the practicality of the algorithms: FW-AL and IAL have 2 tuning parameters each,

A Conditional-Gradient-Based Augmented Lagrangian Framework

and it is very difﬁcult to tune these methods for medium or large scale problems. On the other hand, CGAL, HCGM, and UPD have a single parameter to tune (penalty parameter λ0 for CGAL and HCGM, and accuracy parameter ϵ for UPD). We tune all these parameters by bisection (with factor 10), until the method (with the chosen parameter) outperforms itself with 10th and 1/10th of the parameter. Although CGAL with (decr.) performed better than HCGM in all of our experiments, CGAL with (const.) uniformly outperformed (decr.) and HCGM. Hence in this section, we focus on CGAL with (const.).

Note that the computational cost of all algorithms is dominated by lmo. Hence, we plot some of the results with respect to the number of lmo calls. Arguably, this roughly represents the computation time.

5.1. Max-cut

Maximum cut is an NP-Hard combinatorial problem from computer science. Denoting the symmetric n n graph Laplacian matrix of a graph by c, this problem can be relaxed as (Goemans and Williamson, 1995):

maximize x 1 4 tr(cx)

subject to xii = 1 for i = 1, 2, . . . , n

tr(x) = n, x Sn +.

Tuning all methods from Section 4 requires substantial computational effort, especially because some of these methods have multiple tuning parameters. To this end, we ﬁrst consider a small scale max-cut instance where we compare all of these methods (which applies to this problem) and CGAL. In this setup, we use the GD97 b dataset1, which corresponds to a 47 47 dimensional problem.

# lmo 100 101 102 103 104

|tr ! c(x x ) " |/|tr(cx )|

# lmo 100 101 102 103 104

diag(x) 1 / 1

CGAL HCGM UPD FW-AL IAL

Figure 1. Empirical performance of various methods for solving max-cut problem.

In Figure 1, we present the performance of each method with the best parameter choice obtained after an extensive search. We also provide the performance of each algorithm at all trials in the supplements, and some other variants of these methods as well.

1V. Batagelj and A. Mrvar. Pajek datasets, http://vlado. fmf.uni-lj.si/pub/networks/data/

Next, we consider a medium scale experiment, where we compare CGAL, HCGM, and UPD for max-cut with G1 (800 800) and G40 (2000 2000) datasets2. We compile the results of these tests in Figure 2. Observe that HCGM converges with O(1/

k) (which is the worst case bound) while CGAL achieves a faster rate.

# lmo 100 101 102 103 104 105

|tr ! c (x x ) " |/|tr(c x )|

10 8 10 7 10 6 10 5 10 4 10 3 10 2 10 1 100 G1

CGAL HCGM UPD

# lmo 100 101 102 103 104 105

diag(x) 1 / 1

# lmo 100 101 102 103 104 105

|tr ! c (x x ) " |/|tr(c x )|

10 8 10 7 10 6 10 5 10 4 10 3 10 2 10 1 100 101 G40

# lmo 100 101 102 103 104 105

diag(x) 1 / 1

Figure 2. Empirical comparison of CGAL, HCGM and UPD with max-cut problem setup.

5.2. k-means Clustering

Consider the SDP formulation of model-free k-means clustering problem by (Peng and Wei, 2007):

minimize x tr(cx)

subject to x1n = 1n, x 0, x Sn + & tr(x) = α.

where α is the number of clusters and c is the n n Euclidean distance matrix. We denote the vector of ones by 1n, hence x1n = 1n and x 0 together implies that each row of x is on the unit simplex. The same applies to the columns of x due to symmetry. This problem is an instance of (P), where f(x) = tr(cx), X = {x : x Sn +, tr(x) = α}, A : Sn + Rn Rn n maps x (x1n, x), and K = {1n} Rn n + .

We use the same setup as in (Yurtsever et al., 2018), which is designed and published online by Mixon et al. (2017). This setup contains a 1000 1000 dimensional dataset generated by sampling and preprocessing the MNIST dataset3 using a one-layer neural network. Further details on this setup and the dataset can be found in (Mixon et al., 2017).

2Y. Ye. Gset random graphs. https://www.cise.ufl. edu/research/sparse/matrices/gset/

3Y. Le Cun and C. Cortes. MNIST handwritten digit database, http://yann.lecun.com/exdb/mnist/

A Conditional-Gradient-Based Augmented Lagrangian Framework

100 101 102 103 104

|tr(ψ(x x ))|/|tr(ψx )|

CGAL HCGM UPD

100 101 102 103 104

100 Gaussian

100 101 102 103 104

|tr(ψ(x x ))|/|tr(ψx )|

100 101 102 103 104

100 Poly Decay

100 101 102 103 104

|tr(ψ(x x ))|/|tr(ψx )|

100 101 102 103 104

100 Exp Decay

100 101 102 103 104

|tr(ψ(x x ))|/|tr(ψx )|

Max Cut SDP

100 101 102 103 104

100 Max Cut SDP

Figure 3. Empirical comparison of CGAL, HCGM and UPD for solving generalized eigenvector problem with 4 different synthetic setups. Dotted lines present objective residual and feasibility gap of the atoms chosen by linear minimization oracle (sk).

iteration 100 101 102 103 104 105 106

dist(Ax, K)

iteration 100 101 102 103 104 105 106

|f(x) f |/|f |

Figure 4. Objective residual and feasibility gap for k-means clustering with preprocessed MNIST dataset.

In Figure 4, we observe once again that CGAL outperforms HCGM, achieving O(1/k) empirical convergence rate. In this problem instance, we failed to tune UPD, even with the knowledge of f . After extensive analysis and tests, we concluded that UPD has an implicit tuning parameter. It is possible to choose different accuracy terms for objective and feasibility in UPD, as also noted by the authors, simply by scaling the objective function with a constant. The performance of UPD heavily depends on this scaling in addition to tuning accuracy parameter, hence we omit UPD.

5.3. Generalized Eigenvector Problem

Consider the SDP relaxation of the generalized eigenvector problem from Boumal et al. (2018):

maximize x tr(φx)

subject to tr(x) α, X Sn + & tr(ψx) = 1

where φ and ψ are symmetric matrices of size n n and α > 0 is a model parameter. In this problem, we use some synthetic setups, where we generate ψ with iid Gaussian entries, and we consider 4 different cases for φ:

Gaussian - φ generated by taking symmetric part of 103 103 iid Gaussian matrix Poly Decay - φ generated by randomly rotating diag(1 i, 2 i, . . . , 1000 i) (i = 1) Exp Decay - φ generated by randomly rotating diag(10 i, 10 2i, . . . , 10 1000i) (i = 0.025) Max Cut SDP - φ is a solution of the max-cut SDP with G40 dataset (2000 2000)

This problem highlights an important observation that partially explains the reason why CGAL outperforms the base method HCGM. Remark that this problem has a rank-1 solution, and if we set α correctly, this solution becomes an extreme point of the domain. In this scenario, if the problem is well-conditioned, we might expect the lmo to pick this solution (or some close points). For the sake of the better adaptation to the problem geometry, CGAL updates the dual variable (which corresponds to the center point of a quadratic penalty). In Figure 3, we provide empirical evidence of this adaptation: Dotted lines correspond to extreme points chosen by lmo. Unsurprisingly, these points (sk) quickly converge (with linear rates) to a solution for CGAL, while we do not observe the same behavior for HCGM or UPD (we omit lmo outputs of UPD in ﬁgure which do not converge).

A Conditional-Gradient-Based Augmented Lagrangian Framework

Acknowledgements

This work was supported by the Swiss National Science Foundation (SNSF) under grant number 200021 178865/1. This project has received funding from the European Research Council (ERC) under the European Union s Horizon 2020 research and innovation programme (grant agreement no 725594 - time-data).

D. P. Bertsekas. On penalty and multiplier methods for constrained minimization. SIAM J. Control Optim., 14 (2):216 235, 1976.

J. Bolte, T. P. Nguyen, J. Peypouquet, and B. W. Suter. From error bounds to the complexity of ﬁrst-order descent methods for convex functions. Math. Program., 165(2): 471 507, 2017.

N. Boumal, V. Voroninski, and A. Bandeira. Deterministic guarantees for Burer Monteiro factorizations of smooth semideﬁnite programs. ar Xiv:1804.02008v1, 2018.

E. J. Cand es, T. Strohmer, and V. Voroninski. Phase Lift: Exact and stable signal recovery from magnitude measurements via convex programming. Communications on Pure and Applied Math., 66(8):1241 1274, 2013.

C. D unner, S. Forte, M. Tak ac, and M. Jaggi. Primal dual rates and certiﬁcates. In Proc. 33rd Int. Conf. Machine Learning, 2016.

M. Frank and P. Wolfe. An algorithm for quadratic programming. Naval Research Logistics Quarterly, 3:95 110, 1956.

G. Gidel, F. Pedregosa, and S. Lacoste-Julien. Frank-Wolfe splitting via augmented Lagrangian method. In Proc. 21st Int. Conf. Artiﬁcial Intelligence and Statistics (AISTATS), 2018.

M. X. Goemans and D. P. Williamson. Improved approximation algorithms for maximum cut and satisﬁability problems using semideﬁnite programming. Journal of the ACM, 43(6):1115 1145, 1995.

E. Hazan and S. Kale. Projection free online learning. In Proceedings of the 29th International Conference on Machine Learning, 2012.

X. Huang, I. E.-H. Yen, R. Zhang, Q. Huang, P. Ravikumar, and I. S. Dhillon. Greedy direction method of multiplier for MAP inference of large output domain. In Proc. 20th Int. Conf. Artiﬁcial Intelligence and Statistics (AISTATS), 2017.

M. Jaggi. Revisiting Frank Wolfe: Projection free sparse convex optimization. In Proc. 30th Int. Conf. Machine Learning, 2013.

N. Komodakis and J.-C. Pesquet. Playing with duality: An overview of recent primal-dual approaches for solving large-scale optimization problems. IEEE Signal Process. Mag., 32(6):31 54, 2015.

G. Lan. The complexity of large scale convex programming under a linear optimization oracle. ar Xiv:1309.5550v2, 2014.

G. Lan and Y. Zhou. Conditional gradient sliding for convex optimization. SIAM J. Optim., 26(2):1379 1409, 2016.

Y.-F. Liu, X. Liu, and S. Ma. On the non-ergodic convergence rate of an inexact augmented Lagrangian framework for composite convex programming. to appear in Mathematics of Operations Research, 2018.

D. G. Mixon, S. Villar, and R. Ward. Clustering subgaussian mixtures by semideﬁnite programming. Information and Inference: A Journal of the IMA, 6(4):389 415, 2017.

Y. Nesterov. A method of solving a convex programming problem with convergence rate O(1/k2). Soviet Mathematics Doklady, 27(2):372 376, 1987.

Y. Nesterov. Smooth minimization of non-smooth functions. Math. Program., 103:127 152, 2005a.

Y. Nesterov. Excessive gap technique in nonsmooth convex minimization. SIAM J. Optim., 16(1):235 249, 2005b.

Y. Nesterov. Universal gradient methods for convex optimization problems. Math. Program., 152(1-2):381 404, 2015.

J. Peng and Y. Wei. Approximating K means type clustering via semideﬁnite programming. SIAM J. Optim., 18 (1):186 205, 2007.

E. K. Ryu and S. Boyd. Primer on monotone operator methods. Appl. Comput. Math., 15(1):3 43, 2016.

M. Sion. On general minimax theorems. Paciﬁc J. Math., 8 (1):171 176, 1958.

Q. Tran-Dinh, O. Fercoq, and V. Cevher. A smooth primaldual optimization framework for nonsmooth composite convex minimization. SIAM J. Optim., 28(1):96 134, 2018.

S. J. Wright. Primal Dual Interior Point Methods. SIAM, Philadelphia, USA, 1997.

A Conditional-Gradient-Based Augmented Lagrangian Framework

I. E.-H. Yen, K. Huang, R. Zhong, P. Ravikumar, and I. S. Dhillon. Dual decomposed learning with factorwise oracle for structural svm with large output domain. In Advances in Neural Information Processing Systems 29, 2016a.

I. E.-H. Yen, X. Lin, J. Zhang, P. Ravikumar, and I. S. Dhillon. A convex atomic norm approach to multiple sequence alignment and motif discovery. In Proceedings of the 33rd International Conference on Machine Learning, 2016b.

A. Yurtsever, Q. Tran-Dinh, and V. Cevher. A universal primal-dual convex optimization framework. In Advances in Neural Information Processing Systems 28, 2015.

A. Yurtsever, M. Udell, J. Tropp, and V. Cevher. Sketchy decisions: Convex low-rank matrix optimization with optimal storage. In Proc. 20th Int. Conf. Artiﬁcial Intelligence and Statistics (AISTATS), 2017.

A. Yurtsever, O. Fercoq, F. Locatello, and V. Cevher. A conditional gradient framework for composite convex minimization with applications to semideﬁnite programming. In Proc. 35th Int. Conf. Machine Learning, 2018.