# constrained_convex_minimization_via_modelbased_excessive_gap__214329b8.pdf

Constrained convex minimization via model-based excessive gap

Quoc Tran-Dinh and Volkan Cevher Laboratory for Information and Inference Systems (LIONS) Ecole Polytechnique F ed erale de Lausanne (EPFL), CH1015-Lausanne, Switzerland {quoc.trandinh, volkan.cevher}@epfl.ch

Abstract We introduce a model-based excessive gap technique to analyze ﬁrst-order primaldual methods for constrained convex minimization. As a result, we construct ﬁrstorder primal-dual methods with optimal convergence rates on the primal objective residual and the primal feasibility gap of their iterates separately. Through a dual smoothing and prox-center selection strategy, our framework subsumes the augmented Lagrangian, alternating direction, and dual fast-gradient methods as special cases, where our rates apply.

1 Introduction In [1], Nesterov introduced a primal-dual technique, called the excessive gap, for constructing and analyzing ﬁrst-order methods for nonsmooth and unconstrained convex optimization problems. This paper builds upon the same idea for constructing and analyzing algorithms for the following a class of constrained convex problems, which captures a surprisingly broad set of applications [2, 3, 4, 5]: f := min x Rn {f(x) : Ax = b, x X} , (1)

where f : Rn R {+ } is a proper, closed and convex function; X Rn is a nonempty, closed and convex set; and A Rm n and b Rm are given. In the sequel, we show how Nesterov s excessive gap relates to the smoothed gap function for a variational inequality that characterizes the optimality condition of (1). In the light of this connection, we enforce a simple linear model on the excessive gap, and use it to develop efﬁcient ﬁrst-order methods to numerically approximate an optimal solution x of (1). Then, we rigorously characterize how the following structural assumptions on (1) affect their computational efﬁciency: Structure 1: Decomposability. We say that problem (1) is p-decomposable if its objective function f and its feasible set X can be represented as follows:

i=1 fi(xi), and X := Yp

i=1 Xi, (2)

where xi Rni, Xi Rni, fi : Rni R {+ } is proper, closed and convex for i = 1, . . . , p, and Pp i=1 ni = n. Decomposability naturally arises in machine learning applications such as group sparsity linear recovery, consensus optimization, and duality of empirical risk minimization problems [5]. As an important example, a composite convex minimization problem minx1{f1(x1) + f2(Kx1)} can be cast into (1) with a 2-decomposable structure using an intermediate variable x2 = Kx1 to represent the linear constraints. Decomposable structure immediately supports parallel and distributed implementations in synchronous hardware architectures. Structure 2: Proximal tractability. By proximal tractability, we mean that the computation of the following operation with a given proper, closed and convex function g is efﬁcient (e.g., by a closed form solution or by polynomial time algorithms) [6]: proxg(z) := arg min w Rnz{g(w) + (1/2) w z 2}. (3)

When the constraint z Z is available, we consider the proximal operator of g( )+δZ( ) instead of g, where δZ is the indicator function of Z. Many smooth and non-smooth functions have tractable proximal operators such as norms, and the projection onto a simple set [3, 7, 4, 5].

Scalable algorithms for constrained convex minimization and their limitations. We can obtain scalable numerical solutions of (1) when we augment the objective f with simple penalty functions on the constraints. Despite the fundamental difﬁculties in choosing the penalty parameter, this approach enhances our computational capabilities as well as numerical robustness since we can apply modern proximal gradient, alternating direction, and primal-dual methods. Unfortunately, existing approaches invariably feature one or both of the following two limitations: Limitation 1: Non-ideal convergence characterizations. Ideally, the convergence rate characterization of a ﬁrst-order algorithm for solving (1) must simultaneously establish for its iterates xk X both on the objective residual f(xk) f and on the primal feasibility gap Axk b of its linear constraints. The constraint feasibility is critical so that the primal convergence rate has any significance. Rates on a joint of the objective residual and feasibility gap is not necessarily meaningful since (1) is a constrained problem and f(xk) f can easily be negative at all times as compared to the unconstrained setting, where we trivially have f(xk) f 0.

Hitherto, the convergence results of state-of-the-art methods are far from ideal; see Table 1 in [28]. Most algorithms have guarantees in the ergodic sense [8, 9, 10, 11, 12, 13, 14] with non-optimal rates, which diminishes the practical performance; they rely on special function properties to improve convergence rates on the function and feasibility [12, 15], which reduces the scope of their applicability; they provide rates on dual functions [16], or a weighted primal residual and feasibility score [13], which does not necessarily imply convergence on the primal residual or the feasibility; or they obtain convergence rate on the gap function value sequence composed both the primal and dual variables via variational inequality and gap function characterizations [8, 10, 11], where the rate is scaled by a diameter parameter of the dual feasible set which is not necessary bounded.

Limitation 2: Computational inﬂexibility. Recent theoretical developments customize algorithms to special function classes for scalability, such as convex functions with global Lipschitz gradient and strong convexity. Unfortunately, these algorithms often require knowledge of function class parameters (e.g., the Lipschitz constant and the strong convexity parameter); they do not address the full scope of (1) (e.g., with self-concordant [barrier] functions or fully non-smooth decompositions); and they often have complicated algorithmic implementations with backtracking steps, which can create computational bottlenecks. These issues are compounded by their penalty parameter selection, which can signiﬁcantly decrease numerical efﬁciency [17]. Moreover, they lack a natural ability to handle p-decomposability in a parallel fashion at optimal rates.

Our speciﬁc contributions To this end, this paper addresses the question: Is it possible to efﬁciently solve (1) using only the proximal tractability assumption with rigorous global convergence rates on the objective residual and the primal feasibility gap? The answer is indeed positive provided that there exists a solution in a bounded feasible set X. Surprisingly, we can still leverage favorable function classes for fast convergence, such as strongly convex functions, and exploit p-decomposability at optimal rates. Our characterization is radically different from existing results, such as in [18, 8, 19, 9, 10, 11, 12, 13]. Speciﬁcally, we unify primal-dual methods [20, 21], smoothing (both for Bregman distances and for augmented Lagrangian functions) [22, 21], and the excessive gap function technique [1] in one. As a result, we develop an efﬁcient algorithmic framework for solving (1), which covers augmented Lagrangian method [23, 24], [preconditioned] alternating direction method-of-multipliers ([P]ADMM) [8] and fast dual descent methods [18] as special cases. Based on the new technique, we establish rigorous convergence rates for a few well-known primaldual methods, which is optimal (in the sense of ﬁrst order black-box models [25]) given our particular assumptions. We also discuss adaptive strategies for trading-off between the objective residual |f(xk) f | and the feasibility gap Axk b , which enhance practical performance. Finally, we describe how strong convexity of f can be exploited, and numerically illustrate theoretical results.

2 Preliminaries 2.1. A semi-Bregman distance. Let Z be a nonempty, closed convex set in Rnz. A nonnegative, continuous and µb-strongly convex function b is called a µb-proximity function or prox-function of Z if Z dom (b). Then zc := argminz Z b(z) exists and is unique, called the center point of b. Given a smooth µb-prox-function b of Z (with µb = 1), we deﬁne db(z, ˆz) := b(ˆz) b(z) b(z)T (ˆz z), z, ˆz dom (b), as the Bregman distance between z and ˆz given b. As an example, with b(z) := (1/2) z 2 2, we have db(z, ˆz) = (1/2) z ˆz 2 2, which is the Euclidean distance.

In order to unify both the Bregman distance and augmented Lagrangian smoothing methods, we introduce a new semi-Bregman distance db(Sx, Sxc) between x and xc, given matrix S. Since S is not necessary square, we use the preﬁx semi for this measure. We also denote by:

DS X := sup{db(Sx, Sxc) : x, xc X}, (4) the semi-diameter of X. If X is bounded, then 0 DS X < + .

2.2. The dual problem of (1). Let L(x, y) := f(x) + y T (Ax b) be the Lagrange function of (1), where y Rm is the Lagrange multipliers. The dual problem of (1) is deﬁned as: g := max y Rm g(y), (5)

where g is the dual function, which is deﬁned as: g(y) := min x X{f(x) + y T (Ax b)}. (6)

Let us denote by x (y) the solution of (6) for a given y Rm. Corresponding to x (y), we also deﬁne the domain of g as dom (g) := {y Rm : x (y) exists}. If f is continuous on X and if X is bounded, then x (y) exists for all y Rm. Unfortunately, g is nonsmooth, and numerical solutions of (5) are difﬁcult [25]. In general, we have g(y) f(x) which is the weak-duality condition in convex analysis. To guarantee strong duality, i.e., f = g for (1) and (5), we need an assumption: Assumption A. 1. The solution set X of (1) is nonempty. The function f is proper, closed and convex. In addition, either X is a polytope or the Slater condition holds, i.e.: {x Rn : Ax = b} relint(X) = , where relint(X) is the relative interior of X.

Under Assumption A.1, the solution set Y of (5) is also nonempty and bounded. Moreover, the strong duality holds, i.e., f = g . Any point (x , y ) X Y is a primal-dual solution to (1) and (5), and is also a saddle point of L, i.e., L(x , y) L(x , y ) L(x, y ), (x, y) X Rm.

2.3. Mixed-variational inequality formulation and the smoothed gap function. We use w :=

[x; y] Rn Rm to denote the primal-dual variable, and F(w) := AT y b Ax

to denote a partial

Karush-Kuhn-Tucker (KKT) mapping. Then, we can write the optimality condition of (1) as:

f(x) f(x ) + F(w )T (w w ) 0, w X Rm, (7) which is known as the mixed-variational inequality (MVIP) [26]. If we deﬁne W := X Rm and: G(w ) := max w W f(x ) f(x) + F(w )T (w w) , (8)

then G is known as the Auslender gap function of (7) [27]. By the deﬁnition of F, we can see that:

G(w ) := max (x,y) W

f(x ) f(x) (Ax b)T y = f(x ) g(y ) 0.

It is clear that G(w ) = 0 if and only if w := [x ; y ] W := X Y i.e., the strong duality.

Since G is generally nonsmooth, we strictly smooth it by adding an augmented convex function: dγβ(w) dγβ(x, y) := γdb(Sx, Sxc) + (β/2) y 2, (9) where db is a Bregman distance, S is a given matrix, and γ, β > 0 are smoothness parameters. The smoothed gap function for G is deﬁned as:

Gγβ( w) := max w W f( x) f(x) + F( w)T ( w w) dγβ(w) , (10)

where F is deﬁned in (7). The function Gγβ can be considered as smoothed gap function for the MVIP (7). By the deﬁnition of G and Gγβ, we can easily show that: Gγβ( w) G( w) Gγβ( w) + max{dγβ(w) : w W}, (11) which is key to develop the algorithm in the next section.

Problem (10) is convex, and its solution w γβ( w) can be computed as:

w γβ( w) := [x γ( y); y β( x)]

( x γ( y):= argmin x X f(x)+y T (Ax b)+γdb(Sx, Sxc)

y β( x):= β 1(A x b). (12)

In this case, the following concave function:

gγ(y) := min x X f(x) + y T (Ax b) + γdb(Sx, Sxc) , (13)

can be considered as a smooth approximation of the dual function g deﬁned by (6).

2.4. Bregman distance smoother vs. augmented Lagrangian smoother. Depending on the choice of S and xc, we deal with two smoothers as follows:

1. If we choose S = I, the identity matrix, and xc is then center point of b, then we obtain a Bregman distance smoother. 2. If we choose S = A, and xc X such that Axc = b, then we have the augmented Lagrangian smoother. Clearly, with both smoothing techniques, the function gγ is smooth and concave. Its gradient is Lipschitz continuous with the Lipschitz constant Lg γ := γ 1 A 2 and Lg γ := γ 1, respectively.

3 Construction and analysis of a class of ﬁrst-order primal-dual algorithms

3.1. Model-based excessive gap technique for (1). Since G(w ) = 0 iff w = [x ; y ] is a primal-dual optimal solution of (1)-(5). The goal is to construct a sequence { wk} such that G( wk) 0, which implies that { wk} converges to w . As suggested by (11), if we can construct two sequences { wk} and {(γk, βk)} such that Gγkβk( wk) 0+ as γkβk 0+, then G( wk) 0.

Inspired by Nesterov s excessive gap idea in [1], we construct the following model-based excessive gap condition for (1) in order to achieve our goal. Deﬁnition 1 (Model-based Excessive Gap). Given wk W and (γk, βk) > 0, a new point wk+1 W and (γk+1, βk+1) > 0 so that γk+1βk+1 < γkβk is said to be ﬁrmly contractive (w.r.t. Gγβ deﬁned by (10)) when it holds for Gγkβk that:

Gk+1( wk+1) (1 τk)Gk( wk) ψk, (14) where Gk := Gγkβk, τk [0, 1) and ψk 0.

From Deﬁnition 1, if wk and {(γk, βk)} satisfy (14), then we have Gk( wk) ωk G0( w0) Ψk by induction, where ωk := Qk 1 j=0(1 τj) and Ψk := ψ0 + Pk 1 j=1 Qj 1 i=0(1 τi)ψj. If G0( w0) 0, then we can bound the objective residual |f( xk) f | and the primal feasibility A xk b of (1): Lemma 1 ([28]). Let Gγβ be deﬁned by (10). Let wk

k 0 W and {(γk, βk)}k 0 R2 ++ be the sequences that satisfy (14). Then, it holds that: 2βk D Y + (2γkβk DS X )1/2 D Y f( xk) f γk DS X , A xk b 2βk D Y + (2γkβk DS X )1/2, (15)

where D Y := min { y 2 : y Y }, which is the norm of a minimum norm dual solutions.

Hence, we can derive algorithms based (γk, βk) with a predictable convergence rate via (15). In the sequel, we manipulate τk and ψk to do just that in order to preserve (14) a la Nesterov [1]. Finally, we say that xk X is an ε-solution of (1) if |f( xk) f | ε and A xk b ε.

3.2. Initial points. We ﬁrst show how to compute an initial point w0 such that G0( w0) 0. Lemma 2 ([28]). Given x0 c X, w0 := [ x0; y0] W is computed by: ( x0 = x γ0(0m) := arg min x X f(x) + (γ0/2)db(Sx, Sx0 c) ,

y0 = y β0( x0) := β 1 0 (A x0 b). (16)

satisﬁes Gγ0β0( w0) γ0dp(S x0, Sxc) 0 provided that β0γ0 Lg, where Lg is the Lipschitz constant of gγ with gγ given Subsection 2.4.

3.3. An algorithmic template. Algorithm 1 combines the above ingredients for solving (1). We observe that the key computational step of Algorithm 1 is Step 3, where we update [ xk+1; yk+1]. In the algorithm, we provide two update schemes (1P2D) and (2P1D) based on the updates of the primal or dual variables. The primal step x γk( yk) is calculated via (12). At line 3 of (2P1D), the operator prox S βf is computed as:

prox S βf(ˆx, ˆy) := argmin x X f(x) + ˆy T A(x ˆx) + β 1db(Sx, Sˆx) , (17)

where we overload the notation of the proximal operator prox deﬁned above. At Step 2 of Algorithm 1, if we choose S := I, i.e., db(Sx, Sxc) := db(x, xc) for xc being the center point of b, then we set Lg := A 2. If S := A, i.e., db(Sx, Sxc) := (1/2) Ax b 2, then we set Lg := 1.

Theorem 1 characterizes three variants of Algorithm 1, whose proof can be found in [28].

Algorithm 1: (A primal-dual algorithmic template using model-based excessive gap)

Inputs: Fix γ0 >0. Choose c0 ( 1, 1]. Initialization:

1: Compute a0 := 0.5(1+c0+ p

4(1 c0)+(1+c0)2, τ0 := a 1 0 , and β0 := γ 1 0 Lg (c.f. the text). 2: Compute [ x0; y0] as (16) in Lemma 2. For k = 0 to kmax, perform:

3: If stopping criterion, terminate. Otherwise, use one of the following update schemes:

ˆxk := (1 τk) xk + τkx γk( yk) ˆyk := β 1 k+1(Aˆxk b) xk+1 := prox S βk+1f(ˆxk, ˆyk) yk+1 := (1 τk) yk + τkˆyk.

y k := β 1 k (A xk b), ˆyk := (1 τk) yk + τk y k, xk+1 := (1 τk) xk+τkx γk(ˆyk), yk+1 := ˆyk+γk Ax γk(ˆyk) b .

4: Update βk+1 := (1 τk)βk and γk+1 := (1 ckτk)γk. Update ck+1 from ck (optional). 5: Update ak+1 := 0.5 1 + ck+1 + p

4a2 k + (1 ck+1)2 and set τk+1 := a 1 k+1. End For

Theorem 1. Let ( xk, yk) be the sequence generated by Algorithm 1 after k iterations. Then:

If S = A, i.e., using the augmented Lagrangian smoother, γ0 := Lg = 1, and ck := 0, then the (1P2D) update satisﬁes: ( A xk b 2 8D Y (k+1)2 , 1

2 A xk b 2 2 D Y A xk b 2 f( xk) f 0, (18)

for all k 0. As a consequence, the worst-case analytical complexity of Algorithm 1 to achieve an ε-solution xk is O( ε).

If S = I, i.e., using the Bregman distance smoother, γ0 := Lg = A , and ck := 1, then, for the (2P1D) scheme, we have:

( A xk b A (2D Y+

2DI X ) k+1 , D Y A xk b f( xk) f A

k+1DI X . (19)

Similarly, if γ0 := 2

2 A K+1 and ck := 0 for all k = 0, 1, . . . , K, then, for the (1P2D) scheme, we have:

DI X ) (K+1) ,

D Y A x K b f( x K) f 2

2 A (K+1) DI X . (20)

Hence, the worst-case analytical complexity to achieve an ε-solution xk of (1) is O ε 1 .

The (1P2D) scheme has close relationship to some well-known primal dual methods we describe below. Unfortunately, 1P2D has the drawback of ﬁxing the total number of iterations a priori, which 2P1D can avoid at the expense of one more proximal operator calculation at each iteration. 3.4. Impact of strong convexity. We can improve the above schemes when f Fµ, i.e., f is strongly convex with parameter µf > 0. The dual function g given in (6) is smooth and Lipschitz gradient with Lg f := µ 1 f A 2. Let us illustrate this when S = I and using the (1P2D) scheme as:

ˆyk := (1 τk) yk+τky βk( xk), xk+1 := (1 τk) xk+τkx (ˆyk), yk+1 := ˆyk+ 1

Lg f Ax (ˆyk) b .

We can still choose the starting point as in (16) with β0 := Lg f. The parameters βk and τk at Steps 4 and 5 of Algorithm 1 are updated as βk+1 := (1 τk)βk, and τk+1 := τk

τ 2 k + 4 τk), where β0 := Lg f and τ0 := (

5 1)/2. The following corollary illustrates the convergence of Algorithm 1 using (1P2Dµ); see [28] for the detail proof.

Corollary 1. Let f Fµ and ( xk, yk)

k 0 be generated by Algorithm 1 using (1P2Dµ). Then:

A xk b 2 4 A 2

µf(k + 2)2 D Y, and D Y A xk b f( xk) f 0.

Moreover, we also have xk x 4 A (k+2)µf D Y.

It is important to note that, when f Fµ, we only have one smoothness parameter β and, hence, we do not need to ﬁx the number of iterations a priori (compared with [18]).

4 Algorithmic enhancements through existing methods

Our framework can directly instantiate concrete variants of some popular primal-dual methods for (1). We illustrate three connections here and establish one convergence result for the second variant. We also borrow adaptation heuristics from other algorithms to enhance our practical performance.

4.1. Proximal-point methods. We can choose xk c := x γk 1(ˆyk 1). This makes Algorithm 1 similar to the proximal-based decomposition algorithm in [29], which employs the proximal term db( , ˆx k 1) with the Bregman distance db. The convergence analysis can be found in [28].

4.2. Primal-dual hybrid gradient (PDHG). When f is 2-decomposable, i.e., f(x) := f1(x1) + f2(x2), we can choose xk c by applying one gradient step to the augmented Lagrangian term as:

xk c := [gk 1; gk 2] with gk 1 := xk 1 A1 2AT 1 (A1xk 1+A2xk 2 b), gk 2 := xk 2 A2 2AT 2 (A1xk+1 1 +A2xk 2 b). (PADMM)

In this case, (1P2D) leads to a new variant of PADMM in [8] or PDHG in [9]. Corollary 2 ([28]). Let ( xk, yk)

k 0 be a sequence generated by (1P2D) in Algorithm 1 using

xk c as in (PADMM). If γ0 := 2

2 A 2 K+1 and ck := 0 for all k = 0, 1, . . . , K, then we have

2 A (D Y+DX ) (K+1) ,

D Y A x K b f( x K) f 2

2 A (K+1) D2 X , (21)

where DX := 4 max { x ˆx : x, ˆx X}.

4.4. ADMM. When f is 2-decomposable as f(x) := f1(x1) + f2(x2), we can choose db, S and xk c such that db(Sx, Sxc) := (1/2) A1x1 + A2xk b 2 + A1xk+1 1 + A2x2 b 2 . Then Algorithm 1 reduces to a new variant of ADMM. Its convergence guarantee is fundamentally as same as Corollary 2. More details of the algorithm and its convergence can be found in [28].

4.5. Enhancements of our schemes. For the PADMM and ADMM methods, a great deal of adaptation techniques has been proposed to enhance their convergence. We can view some of these techniques in the light of model-based excessive gap condition. For instance, Algorithm 1 decreases the smoothed gap function Gγkβk as illustrated in Deﬁnition 1. The actual decrease is then given by f( xk) f γk(DS X Ψk/γk). In practice, Dk := DS X Ψk/γk can be dramatically smaller than DS X in the early iterations. This implies that increasing γk can improve practical performance. Such a strategy indeed forms the basis of many adaptation techniques in PADMM, and ADMM.

Speciﬁcally, if γk increases, then τk also increases and βk decreases. Since βk measures the primal feasibility gap Fk := A xk b due to Lemma 1, we should only increase γk if the feasibility gap Fk is relatively high. Indeed, in the case xk c := [gk 1; gk 2], we can compute the dual feasibility gap as Hk := γk AT 1 A2((ˆx 2)k+1 (ˆx 2)k) . Then, if Fk s Hk for some s > 0, we increase γk+1 := cγk for some c > 1. We use ck = c := 1.05 in practice. We can also decrease the parameter γk in (1P2D) by γk+1 := (1 ckτk)γk, where ck := db(Sx γk(ˆyk), Sxc)/DS X [0, 1] after or during the update of ( xk+1, yk+1) as in (2P1D) if we know the estimate DS X .

5 Numerical illustrations 5.1. Theoretical vs. practical bounds. We demonstrate the empirical performance of Algorithm 1 w.r.t. its theoretical bounds via a basic non-overlapping sparse-group basis pursuit problem:

min x Rn Xng

i=1 wi xgi 2 : Ax = b, x ρ , (22)

where ρ > 0 is the signal magnitude, and gi and wi s are the group indices and weights, respectively.

0 2000 4000 6000 8000 10000

# iterations

|f (x k) f | in log-scale

0 2000 4000 6000 8000 10000

# iterations

Ax k b in log-scale

0 2000 4000 6000 8000 10000

# iterations

|f (x k) f | in log-scale

0 2000 4000 6000 8000 10000

# iterations

Ax k b in log-scale

Theoretical bound Basic 2P1D algorithm 2P1D algorithm

Theoretical bound Basic 1P2D algorithm 1P2D algorithm

0 2000 4000 6000 8000 10000 10 10

# iterations

|f (x k) f | in log-scale

0 2000 4000 6000 8000 10000 10 15

# iterations

Ax k b in log-scale

0 2000 4000 6000 8000 10000 10 10

# iterations

|f (x k) f | in log-scale

0 2000 4000 6000 8000 10000 10 15

# iterations

Ax k b in log-scale

Theoretical bound Basic 1P2D algorithm 1P2D algorithm

Theoretical bound Basic 2P1D algorithm 2P1D algorithm

Figure 1: Actual performance vs. theoretical bounds: [top row] the decomposable Bregman distance smoother (S = I) and [bottom row] the augmented Lagrangian smoother (S = A).

In this test, we ﬁx xc = 0n and db(x, xc) := (1/2) x 2. Since ρ is given, we can evaluate DX numerically. By solving (22) with the SDPT3 interior-point solver [30] up to the accuracy 10 8, we can estimate D Y and f . In the (2P1D) scheme, we set γ0 = β0 = p Lg, while, in the (1P2D) scheme, we set γ0 := 2

2 A (K + 1) 1 with K := 104 and generate the theoretical bounds deﬁned in Theorem 1.

We test the performance of the four variants using a synthetic data: n = 1024, m = n/3 = 341, ng = n/8 = 128, and x is a ng/8 -sparse vector. Matrix A are generated randomly using the iid standard Gaussian and b := Ax . The group indices gi is also generated randomly (i = 1, , ng).

The empirical performance of two variants: (2P1D) and (1P2D) of Algorithm 1 is shown in Figure 1. The basic algorithm refers to the case when xk c := xc = 0n and the parameters are not tuned. Hence, the iterations of the basic (1P2D) use only 1 proximal calculation and applies A and AT once each, and the iterations of the basic (2P1D) use 2 proximal calculations and applies A twice and AT once. In contrast, (2P1D) and (1P2D) variants whose iterations require one more application of AT for adaptive parameter updates.

As can be seen from Figure 1 (row 1) that the empirical performance of the basic variants roughly follows the O(1/k) convergence rate in terms of |f( xk) f | and A xk b 2. The deviations from the bound are due to the increasing sparsity of the iterates, which improves empirical convergence. With a kick-factor of ck = 0.02/τk and adaptive xk c, both turned variants (2P1D) and (1P2D) signiﬁcantly outperform theoretical predictions. Indeed, they approach x up to 10 13 accuracy, i.e., xk x 10 13 after a few hundreds of iterations.

Similarly, Figure 1 (row 2) illustrates the actual performance vs. the theoretical bounds O(1/k2) by using the augmented Lagrangian smoother. Here, we solve the subproblems (13) and (17) by using FISTA [31] up to 10 8 accuracy as suggested in [28]. In this case, the theoretical bounds and the actual performance of the basis variants are very close to each other both in terms of |f( xk) f | and A xk b 2. When the parameter γk is updated, the algorithms exhibit a better performance.

5.2. Binary linear support vector machine. This example is concerned with the following binary linear support vector machine problem:

min x Rn F(x) := Xm

j=1 ℓj(yj, w T j x bj) + g(x) , (23)

where ℓj(s, τ) is the Hinge loss function given by ℓj(s, τ) := max {0, 1 sτ} = [1 sτ]+, wj is the column of a given matrix W Rm n, b Rn is the bias vector, y { 1, +1}m is a classiﬁer vector g is a given regularization function, e.g., g(x) := (λ/2) x 2 for the ℓ2-regularizer or g(x) := λ x 1 for the ℓ1-regularizer, where λ > 0 is a regularization parameter.

By introducing a slack variable r = Wx b, we can write (23) in terms of (1) as:

min x Rn,r Rm

j=1 ℓj(yj, rj) + g(x) : Wx r = b o . (24)

Now, we apply the (1P2D) variant to solve (24). We test this algorithm on (24) and compare it with Lib SVM [32] using two problems from the Lib SVM data set available at http://www.csie. ntu.edu.tw/ cjlin/libsvmtools/datasets/. The ﬁrst problem is a1a, which has p = 119 features and N = 1605 data points, while the second problem is news20, which has p = 1 355 191 features and N = 19 996 data points.

We compare Algorithm 1 and the Lib SVM solver in terms of the ﬁnal value F(xk) of the original objective function F, the computational time, and the classiﬁcation accuracy caλ := 1 N 1 PN j=1 sign(Wxk r) = y) of both training and test data set. We randomly select 30% data in a1a and news20 to form a test set, and the remaining 70% data is used for training. We perform 10 runs and compute the average results. These average results are plotted in Fig. 2 for two separate problems, respectively. The upper and lower bounds show the maximum and minimum values of these 10 runs.

0 200 400 600 800 1000 0

x 10 8 The objective values

Parameter horizon (λ 1)

The ob jective values F (x k)

1P2D Lib SVM

0 200 400 600 800 1000

The classification accuracy (training data)

Parameter horizon (λ 1)

The classification accuracy (training set)

1P2D Lib SVM

0 200 400 600 800 1000

The classification accuracy (test data)

Parameter horizon (λ 1)

The classification accuracy (test set)

0 200 400 600 800 1000 2

The CPU time [second]

Parameter horizon (λ 1)

The CPU time [second]

1P2D Lib SVM 1P2D Lib SVM

0 200 400 600 800 1000

4 x 10 7 The objective values

Parameter horizon (λ 1)

The ob jective value F (x k)

1P2D Lib SVM

0 200 400 600 800 1000

The classification accuracy (training data)

Parameter horizon (λ 1)

The classification accuracy (training set)

1P2D Lib SVM

0 200 400 600 800 1000

The classification accuracy (test data)

Parameter horizon (λ 1)

The classification accuracy (test set)

1P2D Lib SVM

0 200 400 600 800 1000 400

The CPU time [second]

Parameter horizon (λ 1)

CPU time [second]

1P2D Lib SVM

Figure 2: The average performance results of the two algorithms on the a1a (ﬁrst row) and news20 (second row) problems.

As can be seen from these results that both solvers give relatively the same objective values, the accuracy for these two problems, while the computational of (1P2D) is much lower than Lib SVM. We note that Lib SVM becomes slower when the parameter λ becomes smaller due to its active-set strategy. The (1P2D) algorithm is almost independent of the regularization parameter λ, which is different from active-set methods. In addition, the performance of (1P2D) can be improved by taking account its parallelization ability, which has not fully been exploited yet in our implementation.

6 Conclusions We propose a model-based excessive gap (MEG) technique for constructing and analyzing ﬁrstorder primal-dual methods that numerically approximate an optimal solution of constrained convex optimization problems (1). Thanks to a combination of smoothing strategies and MEG, we propose, to the best of our knowledge, the ﬁrst primal-dual algorithmic schemes for (1) that theoretically obtain optimal convergence rates directly without averaging the iterates and that seamlessly handle the p-decomposability structure. In addition, our analysis techniques can be simply adapt to handle inexact oracle produced by solving approximately the primal subproblems (c.f. [28]), which is important for the augmented Lagrangian versions with lower-iteration counts. We expect a deeper understanding of MEG and different smoothing strategies to help us in tailoring adaptive update strategies for our schemes (as well as several other connected and well-known schemes) in order to further improve the empirical performance.

Acknowledgments. This work is supported in part by the European Commission under the grants MIRG268398 and ERC Future Proof, and by the Swiss Science Foundation under the grants SNF 200021-132548, SNF 200021-146750 and SNF CRSII2-147633.

[1] Y. Nesterov, Excessive gap technique in nonsmooth convex minimization, SIAM J. Optim., vol. 16, no. 1, pp. 235 249, 2005.

[2] D. Bertsekas and J. N. Tsitsiklis, Parallel and distributed computation: Numerical methods. Prentice Hall, 1989.

[3] V. Chandrasekaran, B. Recht, P. Parrilo, and A. Willsky, The convex geometry of linear inverse problems, Laboratory for Information and Decision Systems, Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Tech. Report., 2012.

[4] M. B. Mc Coy, V. Cevher, Q. Tran-Dinh, A. Asaei, and L. Baldassarre, Convexity in source separation: Models, geometry, and algorithms, IEEE Signal Processing Magazine, vol. 31, no. 3, pp. 87 95, 2014.

[5] M. J. Wainwright, Structured regularizers for high-dimensional problems: Statistical and computational issues, Annual Review of Statistics and its Applications, vol. 1, pp. 233 253, 2014.

[6] N. Parikh and S. Boyd, Proximal algorithms, Foundations and Trends in Optimization, vol. 1, no. 3, pp. 123 231, 2013.

[7] P. L. Combettes and V. R. Wajs, Signal recovery by proximal forward-backward splitting, Multiscale Model. Simul., vol. 4, pp. 1168 1200, 2005.

[8] A. Chambolle and T. Pock, A ﬁrst-order primal-dual algorithm for convex problems with applications to imaging, Journal of Mathematical Imaging and Vision, vol. 40, no. 1, pp. 120 145, 2011.

[9] T. Goldstein, E. Esser, and R. Baraniuk, Adaptive primal-dual hybrid gradient methods for saddle point problems, Tech. Report., vol. http://arxiv.org/pdf/1305.0546v1.pdf, pp. 1 26, 2013.

[10] B. He and X. Yuan, On non-ergodic convergence rate of Douglas-Rachford alternating direction method of multipliers, 2012, (submitted for publication).

[11] , On the O(1/n) convergence rate of the Douglas-Rachford alternating direction method, SIAM J. Numer. Anal., vol. 50, pp. 700 709, 2012.

[12] Y. Ouyang, Y. Chen, G. L. Lan., and E. J. Pasiliao, An accelerated linearized alternating direction method of multiplier, Tech, 2014.

[13] R. Sheﬁand M. Teboulle, Rate of convergence analysis of decomposition methods based on the proximal method of multipliers for convex minimization, SIAM J. Optim., vol. 24, no. 1, pp. 269 297, 2014.

[14] H. Wang and A. Banerjee, Bregman alternating direction method of multipliers, Tech. Report, pp. 1 18, 2013. Online at: http://arxiv.org/pdf/1306.3203v1.pdf.

[15] H. Ouyang, N. He, L. Q. Tran, and A. Gray, Stochastic alternating direction method of multipliers, JMLR W&CP, vol. 28, pp. 80 88, 2013.

[16] T. Goldstein, B. O. Donoghue, and S. Setzer, Fast alternating direction optimization methods, SIAM J. Imaging Sci., vol. 7, no. 3, pp. 1588 1623, 2014.

[17] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein, Distributed optimization and statistical learning via the alternating direction method of multipliers, Foundations and Trends in Machine Learning, vol. 3, no. 1, pp. 1 122, 2011.

[18] A. Beck and M. Teboulle, A fast dual proximal gradient algorithm for convex minimization and applications, Oper. Res. Letter, vol. 42, no. 1, pp. 1 6, 2014.

[19] W. Deng and W. Yin, On the global and linear convergence of the generalized alternating direction method of multipliers, Rice University CAAM, Tech. Rep., 2012, t R12-14.

[20] D. P. Bertsekas, Constrained optimization and Lagrange multiplier methods. Athena Scientiﬁc, 1996.

[21] R. T. Rockafellar, Augmented lagrangians and applications of the proximal point algorithm in convex programming, Mathematics of Operations Research, vol. 1, pp. 97 116, 1976.

[22] Y. Nesterov, Smooth minimization of non-smooth functions, Math. Program., vol. 103, no. 1, pp. 127 152, 2005.

[23] G. Lan and R. Monteiro, Iteration-complexity of ﬁrst-order augmented Lagrangian methods for convex programming, Tech. Report, 2013.

[24] V. Nedelcu, I. Necoara, and Q. Tran-Dinh, Computational complexity of inexact gradient augmented Lagrangian methods: Application to constrained MPC, SIAM J. Optim. Control, vol. 52, no. 5, pp. 3109 3134, 2014.

[25] Y. Nesterov, Introductory lectures on convex optimization: a basic course, Kluwer Academic Publishers, 2004, vol. 87.

[26] F. Facchinei and J.-S. Pang, Finite-dimensional variational inequalities and complementarity problems, N. York, Ed. Springer-Verlag, 2003, vol. 1-2.

[27] A. Auslender, Optimisation: M ethodes Num eriques. Paris: Masson, 1976.

[28] Q. Tran-Dinh and V. Cevher, A primal-dual algorithmic framework for constrained convex minimization, Tech. Report., LIONS, pp. 1 54, 2014.

[29] G. Chen and M. Teboulle, A proximal-based decomposition method for convex minimization problems, Math. Program., vol. 64, pp. 81 101, 1994.

[30] K.-C. Toh, M. Todd, and R. T ut unc u, On the implementation and usage of SDPT3 a Matlab software package for semideﬁnitequadratic-linear programming, version 4.0, NUS Singapore, Tech. Report, 2010.

[31] A. Beck and M. Teboulle, A fast iterative shrinkage-thresholding algorithm for linear inverse problems, SIAM J. Imaging Sciences, vol. 2, no. 1, pp. 183 202, 2009.

[32] C.-C. Chang and C.-J. Lin, LIBSVM: a library for support vector machines, ACM Transactions on Intelligent Systems and Technology, vol. 2, no. 27, pp. 1 27, 2011.