# generalized_conditional_gradient_for_sparse_estimation__596709ea.pdf

Journal of Machine Learning Research 18 (2017) 1-46 Submitted 8/14; Revised 10/17; Published 11/17

Generalized Conditional Gradient for Sparse Estimation

Yaoliang Yu yaoliang.yu@uwaterloo.ca School of Computer Science University of Waterloo Waterloo, ON, N2L 3G1, Canada

Xinhua Zhang zhangx@uic.edu Department of Computer Science University of Illinois at Chicago Chicago, IL 60607

Dale Schuurmans daes@ualberta.ca Department of Computing Science University of Alberta Edmonton, Alberta T6G 2E8, Canada

Editor: Koby Crammer

Sparsity is an important modeling tool that expands the applicability of convex formulations for data analysis, however it also creates signiﬁcant challenges for eﬃcient algorithm design. In this paper we investigate the generalized conditional gradient (GCG) algorithm for solving sparse optimization problems demonstrating that, with some enhancements, it can provide a more eﬃcient alternative to current state of the art approaches. After studying the convergence properties of GCG for general convex composite problems, we develop eﬃcient methods for evaluating polar operators, a subroutine that is required in each GCG iteration. In particular, we show how the polar operator can be eﬃciently evaluated in learning low-rank matrices, instantiated with detailed examples on matrix completion and dictionary learning. A further improvement is achieved by interleaving GCG with ﬁxedrank local subspace optimization. A series of experiments on matrix completion, multi-class classiﬁcation, and multi-view dictionary learning shows that the proposed method can signiﬁcantly reduce the training cost of current alternatives. Keywords: generalized conditional gradient, frank-wolfe, dictionary learning, matrix completion, multi-view learning, sparse estimation

1. Introduction

Sparsity is an important concept in high-dimensional statistics (B uhlmann and van de Geer, 2011) and signal processing (Eldar and Kutyniok, 2012), which has led to important application successes by reducing model complexity and improving interpretability of the results. Although it is common to promote sparsity by adding appropriate regularizers, such as the l1 norm, sparsity can also present itself in more sophisticated forms, such as group sparsity (Yuan and Lin, 2006), graph sparsity (Kim and Xing, 2009; Jacob et al., 2009), matrix rank sparsity (Cand es and Recht, 2009), and other combinatorial sparsity patterns (Obozinski and Bach, 2012). These recent notions of structured sparsity (Bach et al., 2012;

c 2017 Yaoliang Yu, Xinhua Zhang, and Dale Schuurmans.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v18/14-348.html.

Yu, Zhang, and Schuurmans

Micchelli et al., 2013) have greatly enhanced the ability to model structural relationships in data; however, they also create signiﬁcant computational challenges. A currently popular optimization scheme for training sparse models is accelerated proximal gradient (APG) (Beck and Teboulle, 2009; Nesterov, 2013), which enjoys an optimal rate of convergence among black-box ﬁrst-order procedures (Nesterov, 2013). An advantage of APG is that each iteration consists solely of computing a proximal update (PU), which for simple regularizers can have a very low complexity. For example, under l1 norm regularization each iterate of APG reduces to a soft-shrinkage operator that can be computed in linear time, which partly explains its popularity for such problems. Unfortunately, for matrix low-rank regularizers the PU soon becomes a computational bottleneck. For example, the trace norm is often used to promote low rank solutions in matrix variable problems such as matrix completion (Cand es and Recht, 2009); but here the associated PU requires a full singular value decomposition (SVD) of the gradient matrix on each iteration, which prevents APG from being applied to large problems. Not only does the PU require nontrivial computation for matrix regularizers, it also yields dense intermediate iterates. To address the primary shortcomings with APG, we investigate the generalized conditional gradient (GCG) strategy for sparse optimization, which has been motivated by the promise of recent sparse approximation methods (Hazan, 2008; Clarkson, 2010). A special case of GCG, known as the conditional gradient (CG), was originally proposed by Frank and Wolfe (1956) and has received signiﬁcant renewed interest (Bach, 2015; Clarkson, 2010; Freund and Grigas, 2016; Hazan, 2008; Jaggi and Sulovsky, 2010; Jaggi, 2013; Shalev-Shwartz et al., 2010; Tewari et al., 2011; Yuan and Yan, 2013). The key advantage of GCG for sparse estimation is that it need only compute the polar of the regularizer in each iteration, which sometimes admits a much more eﬃcient update than the PU in APG. For example, under trace norm regularization, GCG only requires the spectral norm of the gradient to be computed in each iteration, which is an order of magnitude cheaper than evaluating the full SVD as required by APG. Furthermore, the greedy nature of GCG aﬀords explicit control over the sparsity of intermediate iterates, which is not available in APG. Although existing work on GCG has generally been restricted to constrained optimization (i.e. enforcing an upper bound on the sparsity-inducing regularizer), in this paper we consider the more general regularized version of the problem. Despite their theoretical equivalence, the regularized form allows more eﬃcient local optimization to be interleaved with the primary update, which provides a signiﬁcant acceleration in practice. This paper is divided into two major parts: ﬁrst we present a general treatment of GCG and establish its convergence properties for convex optimization in Section 3, with a special focus on convex gauge regularizers; then we apply the method to important case studies to demonstrate its eﬀectiveness on large-scale, sparse estimation problems in Section 4, with experiments presented in Section 5. In particular, in the ﬁrst part, after establishing notation and brieﬂy discussing relevant optimization schemes (Section 2), we present our ﬁrst main contribution (Section 3): the convergence properties of GCG for general convex composite optimization. The algorithm is then specialized to convex gauge regularizers with reﬁned analysis (Section 3.2). Indeed we prove that, under standard assumptions, GCG converges to the global optimum at the rate of O(1/t) which is a signiﬁcant improvement over previous bounds (Dudik et al., 2012). Although the convergence rate for GCG remains inferior to that of APG, these two

Generalized Conditional Gradient for Sparse Estimation

algorithms demonstrate alternative trade-oﬀs between the number of iterations required and the per-step complexity. As shown in our experiments, GCG can be signiﬁcantly faster than APG in terms of overall computation time for large problems. The results in Section 3 are new. Equipped with the technical results of GCG on convex optimization, the second part of the paper then applies the algorithm to the important application of low rank learning (Section 4). Since imposing a hard bound on matrix rank generally leads to an intractable problem, in Section 4.2 we ﬁrst present a generic relaxation strategy based on convex gauges (Chandrasekaran et al., 2012; Tewari et al., 2011) that yields a convex program (Bach et al., 2008; Bradley and Bagnell, 2009; Zhang et al., 2012). Conveniently, the resulting problem can be easily optimized using the GCG algorithm from Section 3.2. To further reduce computation time, we introduce in Section 4.3 an auxiliary ﬁxed rank subspace optimization within each iteration of GCG. Although similar hybrid approaches have been previously suggested (Burer and Monteiro, 2005; Laue, 2012; Mishra et al., 2013), we propose an eﬃcient new alternative that, instead of locally optimizing a constrained ﬁxed rank problem, optimizes an unconstrained surrogate objective based on a variational representation of matrix norms. This alternative strategy allows a far more eﬃcient local optimization without compromising GCG s convergence properties. In Section 4.4 we show that the approach can be applied to dictionary learning and matrix completion problems, as well as a non-trivial multi-view form of dictionary learning (White et al., 2012). Finally, in Section 5 we provide an extensive experimental evaluation that compares the performance of GCG (augmented with interleaved local search) to state-of-the-art optimization strategies, across a range of problems, including matrix completion, multi-class classiﬁcation, and multi-view dictionary learning.

1.1 Extensions over Previously Published Work

Some of the results in this article have appeared in a preliminary form in two conference papers (Zhang et al., 2012; White et al., 2012); however, several speciﬁc extensions have been added in this paper, including:

A more extensive and uniﬁed treatment of GCG in general Banach spaces.

A more reﬁned analysis of the convergence properties of GCG.

A new application of GCG to multi-view dictionary learning.

New experimental evaluations on larger data sets, including new results for matrix completion and image multi-class classiﬁcation.

Open source Matlab code released at https://www.cs.uic.edu/~zhangx/GCG.

2. Preliminaries

We ﬁrst establish some of the deﬁnitions and notation that will be used throughout the paper, before providing a brief background on the relevant optimization methods that establish a context for discussing GCG.

Yu, Zhang, and Schuurmans

2.1 Notation and Deﬁnitions

Throughout this paper we assume that the underlying space, unless otherwise stated, is a real Banach space B with norm ; for example, B could be the Euclidean space Rd with any norm. The dual space (the set of all real-valued continuous linear functions on B) is denoted B and is equipped with the dual norm g = sup{g(w) : w 1}. Note that for a diﬀerentiable function f : B R := R { }, its (Frech et) derivative f is in the dual space B . We use bold lowercase letters to denote vectors, where the i-th component of a vector w is denoted wi, while wt denotes some other vector. We use the shorthand w, g := g(w) for any g B and w B. We call a function f : B R σ-strongly convex (w.r.t. the norm ) if there exists some σ 0 such that for all 0 λ 1 and w, z B,

f(λw + (1 λ)z) + 1

2σλ(1 λ) w z 2 λf(w) + (1 λ)f(z). (1)

In the case where σ = 0 we simply say f is convex. A set C B is called convex if for all w, z C = λw + (1 λ)z C for all 0 λ 1. It follows from the deﬁnition that the domain of a convex function f, denoted as dom f := {w : f(w) < }, is a convex set. The function f is called proper if dom f = , and closed if all sublevel sets {w : f(w) α} are closed for all α R. The subdiﬀerential of a convex function f : B R at w is deﬁned as the set f(w) = {g B : f(z) f(w) + z w, g , z B}. Key to our subsequent development is two particular functions associated with a nonempty set in B: the indicator function

( 0, if w C , otherwise , (2)

and the gauge function

κ(w) = κA(w) := inf{ρ 0 : w ρA}, (3)

where ρA := {ρw : w A}, and the inﬁmum is taken to be inﬁnity over an empty set (i.e. when the condition w ρA is not satisﬁed for any ρ 0). Intuitively, κ(w) is the smallest rescaling of the set A to barely contain w; see Figure 1. Note that we always have κ(0) = 0, and the gauge function is always positive homogeneous (i.e., κ(tw) = tκ(w) for all t 0). Let B = Bκ := {w : κ(w) 1} denote the unit ball of κ, and then κB = κA. Bκ is obviously intrinsic to κ, while diﬀerent choices of A may yield the same gauge κA. Clearly, the indicator function ιC is convex iﬀC is convex, while the gauge function κ is convex iﬀits unit ball is convex. Moreover, the gauge κ is a semi-norm iﬀits unit ball B is convex and symmetric (i.e. B = B). Speciﬁcally, all semi-norms are convex gauge functions. In practice, it is natural to engineer κA from a general atomic set A, which does not have to be convex (and can even be discrete) or contain the origin. Examples include the set of permutation matrices (Chandrasekaran et al., 2012), and the set of xx with x 2 = 1 and xi 0 for low-rank nonnegative matrix factorization (x is the transpose of x). In this paper, we will work only with convex closed gauge functions, for which a suﬃcient but not necessary condition is that A is convex, closed, and containing the origin (an assumption

Generalized Conditional Gradient for Sparse Estimation

Figure 1: Deﬁning a convex gauge via the Minkowski functional of a convex set C.

made by many textbooks). For practical ﬂexibility, we intentionally avoid making the latter assumptions in the ﬁrst place, and the resulting technical complication is minimal. A related function that underpins our algorithm is the polar of a gauge function κA, deﬁned for all g B as

κ (g) = κ A(g) := inf{µ 0 : w, g µκ(w), w B} (4)

= sup w Bκ w, g = sup w A {0} w, g . (5)

As the notation suggests, the polar of a norm (a bona ﬁde convex gauge) is its dual norm. Lastly, a diﬀerential function ℓ: B R is said to be L-smooth (w.r.t. the norm ) if for all w, z B,

ℓ(w) ℓL(w; z) := ℓ(z) + w z, ℓ(z) + L

2 w z 2 , (6)

i.e., ℓis upper bounded by the quadratic approximation around z. It is well-known that if the gradient ℓis L-Lipschitz continuous, i.e., ℓ(w) ℓ(z) L w z for all w, z, then ℓis L-smooth. The converse is also true if ℓis additionally convex. We will make repeated use of this quadratic upper bound in our analysis below.

2.2 Composite Minimization Problem

Many machine learning problems, particularly those formulated in terms of regularized risk minimization, can be formulated as:

inf w B F(w), such that F(w) := ℓ(w) + f(w), (7)

where f is convex and ℓis convex and continuously diﬀerentiable. Typically, ℓis a loss function and f is a regularizer although their roles can be reversed in some common examples, such as support vector machines and Lasso. Due to the importance of this problem in machine learning, signiﬁcant eﬀort has been devoted to designing eﬃcient algorithms for solving (7). Standard methods are generally based on computing an update sequence {wt} that converges to a global minimizer of F when F is convex, or converges to a stationary point otherwise. For example, a popular

Yu, Zhang, and Schuurmans

example is the mirror descent (MD) algorithm (Beck and Teboulle, 2003), which takes an arbitrary subgradient gt F(wt) and successively updates by

wt+1 = arg min w

2ηt D(w, wt) (8)

with a suitable step size ηt 0; here D(w, z) := d(w) d(z) w z, d(z) denotes a Bregman divergence induced by some diﬀerentiable 1-strongly convex function d : B R. Under mild assumptions, MD converges (in terms of the function value) at a rate of O(1/

t) (Beck and Teboulle, 2003). However, this algorithm ignores the composite structure in (7) and can be impractically slow for many applications. Another widely used algorithm is the proximal gradient (PG) (e.g. Fukushima and Mine, 1981), also known as the forward-backward splitting procedure, which iteratively performs the proximal update (PU)

wt+1 = arg min w

w, ℓ(wt) + f(w) + 1

2ηt D(w, wt) . (9)

PU diﬀers from MD updates in that it does not linearize the regularizer f. The downside is that (9) is usually harder to solve than (8) for each iteration, with an upside that a faster O(1/t) rate of convergence can be achieved (Tseng, 2010). This rate can be further improved to O(1/t2) with the extrapolation variant known as accelerated proximal gradient (APG) (Beck and Teboulle, 2009; Nesterov, 2013). In particular, suppose B = Rd, D(w, z) = 1 2 w z 2 2, and f(w) = w 1, where the lp norm is deﬁned as w p := (Pd i=1 |wi|p)1/p

for p 1. Then PU recovers the soft-shrinkage operator, which plays an important role in sparse estimation:

zt = wt ηt ℓ(wt), and wt+1 = (1 ηt/ |zt|)+ zt. (10)

Here the algebraic operations in the latter formula are performed componentwise. Similarly, when B = Rm n, D(W, Z) = 1

2 W Z 2 F (Frobenius norm) and f(W) = W tr (trace norm, i.e. sum of singular values), one recovers the matrix analogue of soft-shrinkage used for matrix completion (Cand es and Recht, 2009): here the same gradient update of zt in (10) is applied, but the shrinkage operator on wt+1 in (10) is applied to the singular values of Zt while keeping the singular vectors intact. Unfortunately, such a PU can become prohibitively expensive for large matrices, since it requires a full singular value decomposition (SVD) at each iteration, at a cubic time cost. We will see in Section 3.2 that an alternative algorithm only requires computing the dual of the trace norm (namely the spectral norm sp, i.e. the greatest singular value), which requires only quadratic time per iteration.

3. Generalized Conditional Gradient for Convex Optimization

In this section we present the generalized conditional gradient (GCG) algorithm, which has recently been receiving renewed attention (Bredies and Lorenz, 2008; Hazan, 2008; Jaggi, 2013; Shalev-Shwartz et al., 2010; Yuan and Yan, 2013; Zhang et al., 2012). Our goal in this section will be to develop eﬃcient solution methods for (7), where f is convex and ℓ

Generalized Conditional Gradient for Sparse Estimation

is convex and L-smooth (cf. (6)). In particular, we will reﬁne our algorithms to convex gauge regularized problems where special computational savings can be achieved through its polar. These are clearly among the most important settings in practice, and after some enhancements the algorithm and convergence guarantees we establish in this section will be applied to dictionary learning in Section 4.

3.1 General Setting with Convex and Smooth ℓ

We ﬁrst develop the framework of the GCG algorithm and its analysis by focusing on general convex functions f. Formally, we make the following assumption throughout this Section 3.

Assumption 1 ℓis convex and L-smooth, and f is convex.

Protocol: To avoid unnecessary pathologies, we tacitly assume all functions considered in this work are proper and closed. To derive the optimization algorithm, ﬁrst observe that given the above assumption, F = ℓ+ f is convex and so any w B is globally optimal for (7) if, and only if,

0 F(w) = ℓ(w) + f(w). (11)

Let f (g) := supw w, g f(w) denote the Fenchel conjugate of f, and using the fact that g f(w) iﬀw f (g) when f is closed (Borwein and Vanderwerﬀ, 2010), one can observe that the necessary and suﬃcient condition for w to be globally optimal can also be expressed as

(11) w f ( ℓ(w)) w (1 η)w + η f ( ℓ(w)), (12)

where η (0, 1). Then, by the deﬁnition of f , we have

d f ( ℓ(w)) d arg min d { d, ℓ(w) + f(d)} . (13)

Thus, the condition for a global minimum can be characterized by the ﬁxed-point inclusion

w (1 η)w + ηd for some d arg min d { d, ℓ(w) + f(d)} . (14)

This particular ﬁxed-point condition immediately suggests an update that provides the foundation for the generalized conditional gradient algorithm (GCG) outlined in Algorithm 1. Each iteration of this procedure involves linearizing the smooth loss ℓ, solving the subproblem (13), selecting a step size, taking a convex combination, then conducting some form of local improvement. The GCG update based on (14) naturally generalizes the original conditional gradient update, which was ﬁrst studied by (Frank and Wolfe, 1956) for the case when f = ιC for a polyhedral set C. Note that the subproblem (13) shares some similarity with the proximal update (9): both choose to leave the potentially nonsmooth function f intact, but here the smooth loss ℓis replaced by its plain linearization rather than its linearization plus a (strictly convex) proximal regularizer such as the quadratic upper bound ℓL( ) in (6). Consequently, there might be multiple solutions, in which case we simply adopt any one, or no solution (e.g.

Yu, Zhang, and Schuurmans

Algorithm 1 Generalized Conditional Gradient (GCG).

1: Initialize: w0 dom f.

2: for t = 0, 1, . . . do

3: Compute dt arg mind { d, gt + f(d)}, where gt = ℓ(wt).

4: Choose step size ηt [0, 1] and perform update wt+1 = (1 ηt)wt + ηtdt.

5: wt+1 = Improve( wt+1, ℓ, f) Subroutine, see Deﬁnition 1

divergence), in which case we will impose extra assumptions to ensure boundedness (details below, especially Assumption 2 and Theorem 2). An important component of Algorithm 1 is the ﬁnal step, Improve, in Line 5, which is particularly important in practice: it allows for heuristic local improvement to be conducted on the iterates without sacriﬁcing any theoretical convergence properties of the overall procedure (see Section 4.3). A precursor of Improve appeared in (Meyer, 1974, Algorithm 2). Since Improve has access to both ℓand f it can be very powerful in principle we will consider the following variants that have respective consequences for practice and subsequent analysis.

Deﬁnition 1 The subroutine Improve in Algorithm 1 is called Null if for all t, wt+1 = wt+1; Descent if F(wt+1) F( wt+1); Monotone if F(wt+1) F(wt); and Relaxed if

F(wt+1) ℓL( wt+1; wt) + (1 ηt)f(wt) + ηtf(dt). (15)

Obviously, Relaxed can be easily morphed to further satisfy Descent and/or Monotone by comparing F(wt+1) with F( wt+1) and/or F(wt), whereas Null and Descent must be Relaxed (since ℓis L-smooth). Our main goal remains to establish that Algorithm 1 indeed converges to a global optimum under general conditions given in Assumption 1. In order to address the aforementioned issue of bounding dt, we next introduce one more assumption.

Assumption 2 The sequences {dt} and {wt} generated by the algorithm are bounded for any choice of ηt [0, 1].

Although this assumption is more stringent, it can fortunately be achieved under various conditions, which we summarize as follows. The proof is given in Appendix A.1.

Proposition 2 Let Assumption 1 hold. Then Assumption 2 is satisﬁed if any of the following holds.

(a) The subroutine Improve is Monotone; the sublevel set {w dom f : F(w) F(w0)} is compact; and f is coﬁnite, i.e., its Fenchel conjugate f has full domain.

(b) The subroutine Improve is Monotone; the sublevel set {w dom f : F(w) F(w0)} is bounded; and f is super-coercive, i.e., lim w f(w)/ w .

(c) dom f is bounded.

Generalized Conditional Gradient for Sparse Estimation

In particular, under any of the conditions in Theorem 2, Line 3 of Algorithm 1 is welldeﬁned. Note that if B = Rd, then condition (a) is in fact equivalent to condition (b). On the other hand, if f is a convex gauge, then none of the three conditions above holds, and we will deal with this in Section 3.2. Finally, for each w B we deﬁne a quantity, referred to as the duality gap, that will be useful in understanding the convergence properties of GCG:

G(w) := F(w) inf d {ℓ(w) + d w, ℓ(w) + f(d)} (linearizing ℓat w) (16)

= w, ℓ(w) + f(w) inf d { d, ℓ(w) + f(d)}

= w, ℓ(w) + f(w) + f ( ℓ(w)). (deﬁnition of Fenchel dual) (17)

Since f and ℓare convex in Assumption 1, the duality gap provides an upper bound on the suboptimality of any search point, as established in the following proposition.

Proposition 3 Let Assumption 1 hold. Then G(w) F(w) infd F(d) 0 for all w B. Furthermore, G(w) = 0 iﬀw is globally optimal, i.e. w satisﬁes the condition (11).

Proof As ℓis convex, (16) implies G(w) F(w) infd{ℓ(d)+f(d)} = F(w) infd F(d) 0. By the Fenchel-Young inequality, (17) implies G(w) = 0 iﬀw f ( ℓ(w)), i.e. (11) holds.

Since G(wt) is an upper bound on the suboptimality of wt, the duality gap gives a natural stopping criterion for (7) that can be easily computed in each iteration of Algorithm 1. We are now ready for the ﬁrst convergence result regarding Algorithm 1.

Theorem 4 Let Assumptions 1 and 2 hold. Also assume the subroutine is Relaxed, and the subproblem (13) is solved up to some additive error εt 0. Then, for any w dom F and t 0, Algorithm 1 yields

F(wt+1) F(w) + πt(1 η0)(F(w0) F(w)) +

πt πs η2 s(εs/ηs + L

2 ds ws 2), (18)

where πt := Qt s=1(1 ηs) with π0 = 1. Furthermore, for all t k 0, the minimal duality gap Gt k := mink s t G(ws) satisﬁes

Gt k 1 Pt s=k ηs

F(wk) F(wt+1) +

s=k η2 s εs/ηs + L

2 ds ws 2 #

From Theorem 4 we derive the following concrete rates of convergence.

Corollary 5 Under the same assumptions as Theorem 4, also let ηt = 2/(t+2), εt δηt/2 for some δ 0, and LF := supt L dt wt 2. Then, for all t 1 and w, Algorithm 1 yields

F(wt) F(w) + 2(δ + LF )

t + 3 , and Gt 1 3(δ + LF )

t ln 2 4.5(δ + LF )

Yu, Zhang, and Schuurmans

The proofs of Theorem 4 and Theorem 5 can be found in Appendix A.2 and Appendix A.3 respectively. Note that the simple step size rule ηt = 2/(t + 2) already leads to an O(1/t) bound on the objective value attained. Of course, it is possible to use other step size rules. For instance, both ηs = 1/(s + 1) and the constant rule ηs = 1 (t + 1)1/t lead to an O(1+log t

t+1 ) rate; see (Freund and Grigas, 2016) for detailed calculations. We emphasize that GCG with the open-loop step size rule ηt = 2/(t+2) requires neither the Lipschitz constant L nor the norm to be known. We can therefore freely adopt the best choices for the problem at hand in the analysis. Note that, even though the rate does not depend on the initial point w0 provided η0 = 1, by letting η0 = 1 the bound can be slightly improved (Freund and Grigas, 2016).

Remark 6 The observation that an O(1/t) rate can be obtained by an open-loop step size rule such as ηt = O(1/t) appears to have been ﬁrst made by Dunn and Harshbarger (1978). The same rate can be achieved by the closed-loop step size rule where ηt is selected to minimize ℓ((1 ηt)wt+ηtdt) or its quadratic upper bound ℓ((1 ηt)wt+ηtdt; wt) (Frank and Wolfe, 1956; Levitin and Polyak, 1966; Dem yanov and Rubinov, 1967). A similar rate on the minimal duality gap was also given in (Clarkson, 2010), and extended by (Jaggi, 2013). However, all these previous works focused on the special case where f = ιC for some compact set C. Fukushima and Mine (1981) was the ﬁrst to consider a general convex function f (and possibly a nonconvex ℓ), but they only established convergence for the algorithm. Bredies and Lorenz (2008) proved the O(1/t) rate by using a closed-loop step size rule. More recently, Bach (2015) considered the special case where f is strongly convex, and identiﬁed the equivalence between GCG and MD, hence also establishing the O(1/t) rate for this special case.

3.2 Improved Algorithm and Reﬁned Analysis when f is a Convex Gauge

Although the results in Theorem 4 and Theorem 5 hold for general convex functions f, they require {dt} and {wt} to be bounded as in Assumption 2. Some suﬃcient conditions were provided by Theorem 2. Unfortunately, these conditions cannot be satisﬁed by an important class of regularizers in machine learning: norms, semi-norms, and more generally convex gauge functions, cf. Section 2 for the detailed deﬁnition. Examples include the ℓp norms (p 1), the total variation semi-norm, the trace norm for matrices, etc. Indeed, the subproblem (13) in Line 3 of Algorithm 1 might diverge to , leaving the update undeﬁned. In this subsection, we develop a specialized form of GCG that circumvents this problem and exploits properties of convex gauges to signiﬁcantly reduce computational overhead. The resulting GCG formulation, which is one of our main contributions, will be used to achieve eﬃcient algorithms in important case studies in the second part of the paper, such as matrix completion under trace norm regularization (Section 4.2.1 below). In the literature, two main approaches have been used to bypass the unboundedness in solving (13); unfortunately, both remain unsatisfactory in diﬀerent ways. First, one obvious way to restore boundedness to (13) is to reformulate the regularized problem (7) in its equivalent constrained form

inf w ℓ(w) s.t. f(w) ζ, (21)

Generalized Conditional Gradient for Sparse Estimation

where it is well-known that a correspondence between (7) and (21) can be achieved if the constant ζ is chosen appropriately. In this case, Algorithm 1 can be directly applied as long as (21) has a bounded domain. Much recent work in machine learning has investigated this variant (Hazan, 2008; Jaggi and Sulovsky, 2010; Jaggi, 2013; Shalev-Shwartz et al., 2010; Tewari et al., 2011; Yuan and Yan, 2013). However, ﬁnding the appropriate value of ζ is costly in general, and often one cannot avoid considering the penalized formulation (e.g. when it is nested in another problem). Furthermore, the constraints in (21) preclude the application of many eﬃcient local Improve techniques designed for unconstrained problems. A second, alternative, approach is to square f, which ensures that it becomes supercoercive (Bradley and Bagnell, 2009). However, this modiﬁcation is somewhat arbitrary, in the sense that super-coerciveness can be achieved by raising the norm to the pth power for any p > 1 (when B = Rd). Moreover, the insertion of a local Improve step in such an approach requires the regularizer f to be evaluated at all iterates, which is expensive for applications such as matrix completion.

3.2.1 Generalized Convex Gauge Regularization

To address the issue of ensuring well deﬁned iterates, while also reducing their cost, we develop an alternative modiﬁcation of GCG. The algorithm we develop also allows a mild generalization of convex gauge regularization by introducing a composition with an increasing convex function. In particular, let κ denote a convex gauge induced by a convex, closed set C containing the origin (cf. (3)) and let h : R+ R denote an increasing convex function over R+ := [0, ]. Then regularizing by f( ) = h(κ( )) yields the particular form of (7) that we will address with this modiﬁed approach:

inf w ℓ(w) + h(κ(w)). (22)

For the structured sparse regularizers typically considered in machine learning, the set C used to construct the gauge function κ via (3) has additional structure that admits eﬃcient computation (Chandrasekaran et al., 2012). In particular, C is often constructed by taking the convex hull of some compact set of atoms A speciﬁc to the problem; i.e., C = conv A. Such a structured formulation allows one to re-express the gauge as the l1 norm over an appropriate atomic decomposition. In details:

κC(w) = inf

i=1 λi : w =

i=1 λiai, ai A, λi 0

κ C(g) = sup w A {0} w, g . (24)

That is, if we decompose w as a conic combination of the atoms in A, then κ(w) is the least l1 norm of the decomposition coeﬃcients. Determining the convex gauge κC directly from the set C is not always easy, and the following duality result may come into aid (Rockafellar and Wets, 1998, p. 491):

C is closed convex and 0 C = κC = (κ C) . (25)

Since A is compact, the polar κ C has full domain. Intuitively, κ has a much simpler form than κ if the structure of A is simple. In such cases, it is advantageous to design

Yu, Zhang, and Schuurmans

an optimization algorithm that circumvents evaluation of κ, by instead using κ which is generally far less expensive to evaluate. This will be the underpinning motivation of our algorithm design.

3.2.2 Modified GCG Algorithm for Generalized Gauge Regularization

The key to the modiﬁed algorithm is to avoid the possibly unbounded solution to (13) while also avoiding explicit evaluation of the gauge κ. First, to avoid unboundedness, we adopt the technique of moving the regularizer to the constraint, as in (21), but rather than applying the modiﬁcation directly to (22) we only use it in the subproblem (13) via:

dt arg min d:h(κ(d)) ζ d, ℓ(wt) . (26)

Then, to bypass the complication of dealing with h and the unknown bound ζ, we decompose dt into its normalized direction at and scale θt (i.e., such that dt = atθt/ηt), determining each separately. In particular, from this decomposition, the update from Line 4 of Algorithm 1 can be equivalently expressed as

wt+1 = (1 ηt)wt + ηtdt = (1 ηt)wt + θtat, (27)

hence the modiﬁed algorithm need only work with at and θt directly. Determining the direction. First, the normalized direction, at, can be recovered via

at arg min a:κ(a) 1 a, ℓ(wt) at arg min a A {0} a, ℓ(wt) (by (4)) (28)

at arg max a A {0} a, ℓ(wt) . (29)

Observe that (29) eﬀectively involves computation of the polar κ ( ℓ(w)) only, by (4). Since this optimization might still be challenging, we further allow it to be solved only approximately to within an additive error εt 0 and a multiplicative factor αt (0, 1]; that is, by relaxing (28) the modiﬁed algorithm only requires an at A to be found that satisﬁes

at, ℓ(wt) αt εt + min a A {0} a, ℓ(wt) = αt εt κ ( ℓ(wt)) . (30)

Intuitively, this formulation computes the normalized direction that has (approximately) the greatest negative correlation with the gradient ℓ, yielding the steepest local decrease in ℓ. Importantly, this formulation does not require evaluation of the gauge function κ. Determining the scale. Next, to choose the scale θt, it is natural to consider minimizing a simple upper bound on the objective that is obtained by replacing ℓ( ) with its quadratic upper approximation ℓL( ; wt) given in (6):

min θ 0 ℓL((1 ηt)wt + θat; wt) + h(κ((1 ηt)wt + θat)). (31)

Unfortunately, κ still participates in (31), so a further approximation is required. Here is where the decomposition of dt into at and θt is particularly advantageous. Observe that

h(κ((1 ηt)wt + θat)) h((1 ηt)κ(wt) + ηtκ(θat/ηt))

(1 ηt)h(κ(wt)) + ηth(κ(θat/ηt)) (32)

(1 ηt)h(κ(wt)) + ηth(θ/ηt)), (33)

Generalized Conditional Gradient for Sparse Estimation

Algorithm 2 GCG for positively homogeneous regularizers. Require: The set A whose convex hull C deﬁnes the gauge κ.

1: Initialize w0 and ρ0 κ(w0).

2: for t = 0, 1, . . . do

3: Choose normalized direction at that satisﬁes (30).

4: Choose step size ηt [0, 1] and set the scaling θt 0 by (34)

5: wt+1 = (1 ηt)wt + θtat, and ρt+1 = (1 ηt)ρt + θt 6: (wt+1, ρt+1) = Improve( wt+1, ρt+1, ℓ, f) Subroutine, see Theorem 8

where the ﬁrst inequality follows from the convexity of κ and the fact that h is increasing, the second inequality follows from the convexity of h, and (33) follows from the fact that κ(at) 1 by construction (at A in (4)). Although κ still participates in (33) we can now bypass evaluation of κ by maintaining a simple upper bound ρt κ(wt), yielding the relaxed update

θt = arg min θ 0

n ℓL((1 ηt)wt + θat; wt) + (1 ηt)h(ρt) + ηth(θ/ηt) o . (34)

Crucially, the upper bound ρt κ(wt) can be maintained by the simple update ρt+1 = (1 ηt)ρt + θt, which is suﬃcient to ensure that ρt+1 upper bounds κ(wt+1):

ρt+1 = (1 ηt)ρt + θt (1 ηt)κ(wt) + θtκ(at) κ((1 ηt)wt + θtat) κ(wt+1). (35)

From these components we obtain the ﬁnal modiﬁed procedure, Algorithm 2: a variant of GCG (Algorithm 1) that avoids the potential unboundedness of the subproblem (13) by replacing it with (30), while also completely avoiding explicit evaluation of the gauge κ. The Improve subroutine can be adapted by mirroring Theorem 1.

Deﬁnition 7 The subroutine Improve in Algorithm 2 is called Null if for all t, wt+1 = wt+1 and ρt+1 = ρt+1. Provided that ρt+1 κ(wt+1), Improve is called Descent if ℓ(wt+1) + h(ρt+1) ℓ( wt+1) + h( ρt+1); Monotone if ℓ(wt+1) + h(ρt+1) ℓ(wt) + h(ρt); and Relaxed if

ℓ(wt+1) + h(ρt+1) ℓL( wt+1; wt) + (1 ηt)h(ρt) + ηth(θt/ηt). (36)

Similar to Theorem 1, Relaxed can be easily morphed to further satisfy Descent and/or Monotone by comparing (wt+1, ρt+1) with ( wt+1, ρt+1) and/or (wt, ρt) over ℓ(w)+h(ρ). As before Null and Descent are specializations of Relaxed. For problems of the form (22), Algorithm 2 can provide signiﬁcantly more eﬃcient updates than Algorithm 1. Despite the relaxations introduced in Algorithm 2 to achieve faster iterates, the following theorem establishes that the procedure is still sound, preserving essentially the same convergence guarantees as the more expensive Algorithm 1.

Theorem 8 Let h : R+ R be an increasing convex function, κ be a convex gauge induced by the convex hull of a compact atomic set A {0}, and ℓbe L-smooth and convex. Let F = ℓ+ h κ and assume Algorithm 2 generates a bounded sequence {wt}. Let αt > 0,

Yu, Zhang, and Schuurmans

0 ηt 1, and the subroutine Improve be Relaxed. Then for any w dom F and t 0, we have

F(wt+1) F(w) + πt(1 η0) F(w0) F(w)

πt πs η2 s ρεs + h(ρ/αs) h(ρ) /ηs + L

αs as ws 2 , (37)

where ρ := κ(w) and πt := Qt s=1(1 ηs) with π0 = 1.

The proof is given in Appendix A.4. From this theorem, we obtain the following rate of convergence by pursuing a similar argument as in the proof of Theorem 5.

Corollary 9 Under the same setting as in Theorem 8, let ηt = 2/(t + 2), εt δηt/2 for some absolute constant δ > 0, αt = α > 0, ρ = κ(w), and LF := L supt ρ

αat wt 2. Then for all t 1 and w, we have

F(wt) F(w) + 2(ρδ + LF )

t + 3 + h(ρ/α) h(ρ). (38)

Moreover, if ℓ 0 and h is the identity map, then

αF(w) + 2(ρδ + LF )

t + 3 . (39)

Some remarks concerning this algorithm and its convergence properties are in order.

Remark 10 When αt = 1 for all t, and h is the indicator ι[0,ζ] for some ζ > 0, Theorem 9 implies the same convergence rate for the constrained problem (21), which immediately recovers (some of) the more specialized results discussed in (Clarkson, 2010; Hazan, 2008; Jaggi, 2013; Shalev-Shwartz et al., 2010; Tewari et al., 2011; Yuan and Yan, 2013). For the special case where h is the identity map, a similar O(1/t) rate achieved by a similar algorithm appeared in Harchaoui et al. (2015), whose workshop version appeared around the same time as our conference version (Zhang et al., 2012). Harchaoui et al. (2015) considered the slightly more restrictive setting where the convex gauge κ is the restriction of a norm into a convex cone and they did not consider (multiplicative) approximate polar computations. Their algorithm also needs to evaluate κ explicitly, while the key advantage of our method is to avoid this expensive computation by an upper bound. On the other hand, Harchaoui et al. (2015) investigated the memory based acceleration in (Holloway, 1974; Meyer, 1974) while we develop in the next section a ﬁxed-rank local improvement step.

Remark 11 The additional factor ρ in the above bounds is necessary because the approximation in (30) is not invariant to scaling, requiring some compensation. Note also that α 1, since the right-hand side of (30) must become negative as εt 0. The result in (39) states roughly that an α-approximate subroutine (for computing the polar of κ) leads to an α-approximate minimum , also at the rate of O(1/t). This observation was independently made in (Bach, 2013), where the multiplicative factor ρ is introduced to solve approximately some matrix factorization problems. In a similar vein, Cheng et al. (2016) leveraged the multiplicative guarantee to solve some tensor problems.

Generalized Conditional Gradient for Sparse Estimation

Remark 12 The assumption that {wt} is bounded can be satisﬁed easily, which in turn ensures LF < . The simplest way is to postulate that the sublevel set {w dom F : F(w) ℓ(w0) + h(ρ0)} is bounded. In this case {wt} is trivially bounded if Monotone is used. When Relaxed is used instead (which includes Null and Descent), we can strengthen it into Monotone via the roll-back trick in Theorem 7. At ﬁrst sight, the algorithm appears to be stuck, but in fact the diminishing step size ηt will eventually lead to progresses as guaranteed by Theorem 9. Computationally, keeping track of ℓ(wt)+h(ρt) does not incur any overhead. We will provide concrete values/bounds of LF for the case studies in Section 4.2 and Section 4.4.

Remark 13 The convergence results in Theorem 8 and Theorem 9 obviously hold when Null or Descent is employed for Improve, because they are both specializations of Relaxed. The update for θ in (34) requires knowledge of the Lipschitz constant L and the norm . If the norm is Hilbertian and h is the identity map, we have the formula

θt = 1 L at 2 max { at, Lηtwt ℓ(wt) 1, 0} .

However, it is also easy to devise other scaling and step size updates that do not require the knowledge of L or the norm , such as

(η t , θ t ) arg min 1 η 0,θ 0 ℓ((1 η)wt + θat) + (1 η)h(ρt) + ηh(θ/η), (40)

followed by the Null step. Note that (40) is jointly convex in η and θ, since ηh(θ/η) is a perspective function. Evidently this update can be treated as a Relaxed subroutine, since for any (ηt, θt) chosen in step 4 of Algorithm 2, it follows that

ℓ(wt+1) + h(ρt+1) ℓ((1 η t )wt + θ t at) + (1 η t )h(ρt) + η t h(θ t /η t )

ℓ((1 ηt)wt + θtat) + (1 ηt)h(ρt) + ηth(θt/ηt) (by (40))

ℓL( wt+1; wt) + (1 ηt)h(ρt) + ηth(θt/ηt).

The upper bound ρt+1 κ(wt+1) can be veriﬁed in the same way as for (35); hence the update (40) enjoys the same convergence guarantee as in Theorem 9.

Remark 14 An alternative workaround for solving (7) with gauge regularization would be to impose an upper bound ζ κ(w) and consider minw:κ(w) ζ {ℓ(w) + κ(w)}, while dynamically adjusting ζ. A standard GCG approach, similar to (Harchaoui et al., 2015), would then require solving

dt arg min d:κ(d) ζ { d, ℓ(wt) + κ(d)} (41)

as a subproblem, which can be as easy to solve as (30). If, however, one wished to include a local improvement step Improve (which is essential in practice), it is not clear how to eﬃciently maintain the constraint κ(w) ζ. Moreover, solving (41) up to some multiplicative factor might not be as easy as (30).

Yu, Zhang, and Schuurmans

3.3 Additional Discussions and Related Work

In this section we discuss some further properties and extensions of the GCG algorithm. Smoothing: The convergence rates we have established require the convex loss ℓto be L-smooth. Although this holds for a variety of losses such as the square and logistic loss, it does not hold for the hinge loss used in support vector machines. However, one can always approximate a nonsmooth (Lipschitz) loss by a smooth function (Nesterov, 2005) before applying GCG, at the expense of getting a slower O(1/

t) rate of convergence. Totally Corrective: As given, Algorithm 2 adds a single atom in each iteration, with a conic combination between the new atom and the previous aggregate. We will refer to it as vanilla GCG if Improve is further set to Null. If f = ιC for some compact and convex set C, we simply call it vanilla CG for disambiguation. An even more aggressive scheme is to re-optimize the coeﬃcients of all (or a subset of) existing atoms in each iteration. Such a procedure was ﬁrst studied in (Meyer, 1974; Holloway, 1974), which is often known as the totally (or fully) corrective update in the boosting literature and as the restricted simplicial decomposition in optimization. More generally, for the convex gauge κ, each iteration of a totally corrective procedure would involve solving

τ At στ, (42)

where At is a set of atoms accumulated up to the t-th iteration. When f = ιC, the totally corrective CG solves minσ ℓ P τ At στbτ over σ 0 and 1 σ = 1. As pointed out by Holloway (1974), as long as {wt, at} At, the totally corrective CG will converge at least as fast as the vanilla CG. Much faster convergence is usually observed in practice, although this advantage must be countered by the extra eﬀort spent in solving (42) and alike, which itself need not be trivial. In a ﬁnite dimensional setting, provided that the atoms are linearly independent and some restricted strong convexity is present, it is possible to prove that the totally corrective CG converges at a linear rate; see (Shalev-Shwartz et al., 2010; Yuan and Yan, 2013). Recent work of Lacoste-Julien and Jaggi (2015) has further relaxed the requirement of linear independence by making a weaker assumption that the number of atoms is ﬁnite (i.e. the feasible region is a polytope). GCG v.s. PG: The convergence rate established for GCG is on par with PG. When f = ιC, Canon and Cullum (1968) showed that the rate of vanilla CG cannot be improved in general even when ℓis strongly convex.1 In this sense, vanilla GCG (which subsumes vanilla CG) is slower than an optimal algorithm like APG. GCG s advantage, however, is that it only needs to solve a linear subproblem (13) in each iteration (i.e., computing the polar κ ), whereas APG (or PG) needs to compute the proximal update, which is a quadratic problem for the least square prox-function. The two approaches are complementary in the sense that the polar can be easier to compute in some cases, whereas in other cases the proximal update can be computed analytically. An additional advantage GCG has over APG, however, is its greedy sparse nature: Each iteration of GCG amounts to a single atom, hence the number of atoms never exceeds the total number of iterations for GCG. By contrast, APG can produce a dense update in each iteration, although in later stages the

1. Improvements in speciﬁc cases are still possible; see the next paragraph. Indeed the counter-example of Canon and Cullum (1968) only applies when the solution lies at the boundary of a polyhedral set.

Generalized Conditional Gradient for Sparse Estimation

estimates may become sparser due to the shrinkage eﬀect of the proximal update. Moreover, GCG is robust with respect to α-approximate atom selection (cf. (39) in Theorem 9), whereas similar results do not appear to be available for PG or APG when the proximal update is computed up to a multiplicative factor.

Faster Rates: Faster rates of convergence for CG (with a closed-loop step size rule), under the restriction f = ιC, have been considered by a number of works. Levitin and Polyak (1966) ﬁrst showed that if C is strongly convex and the gradient ℓis bounded away from 0, then a linear rate of convergence can be achieved by CG. Gu elat and Marcotte (1986) showed that if there is a minimizer in the interior of C, then again a linear rate of convergence can be obtained. Wolfe (1970) proposed an away step modiﬁcation of CG, and a variant of it designed by Gu elat and Marcotte (1986) attained linear rate of convergence when C is polyhedral. The analysis was further improved into a global linear rate in (Lacoste-Julien and Jaggi, 2015; Beck and Shtern, 2015). Dunn (1979) proved o(1/t) rate, linear rate and ﬁnite convergence for CG, depending on how continuous the polar operation is. Lastly, Garber and Hazan (2015) showed that for polytope C, a modiﬁcation of CG achieves the linear rate if ℓis of quadratic growth (in particular, strongly convex), and Garber and Hazan (2016) proved the O(1/t2) rate if ℓis of quadratic growth and C is strongly convex. Note that these faster rates are possible by putting restrictive assumptions on the constraint set C (polyhedral or strongly convex). Interestingly, results on faster rates for general convex functions f are scarce, although establishing it for GCG is conceivably not hard. For example, in (Zhang et al., 2012) we proved a linear rate for the totally corrective GCG when ℓis strongly convex (in a slightly stronger sense), f is a convex gauge, and B = Rn.

4. Application to Low Rank Learning

As a ﬁrst application of GCG, we consider the problem of optimizing low rank matrices; a problem that forms the cornerstone of many machine learning tasks such as dictionary learning and matrix completion. Since imposing a bound on matrix rank generally leads to an intractable problem, much recent work has investigated approaches that relax the hard rank constraint with carefully designed regularization penalties that indirectly encourage low rank structure (Bach et al., 2008; Bradley and Bagnell, 2009). In Section 4.2 we ﬁrst formulate a generic convex relaxation scheme for low rank problems arising in matrix approximation tasks, including dictionary learning and matrix completion. Conveniently, the modiﬁed GCG algorithm developed in Section 3.2, Algorithm 2, provides a general solution method for the resulting problems. Next, in Section 4.3 we show how the local improvement component of the modiﬁed GCG algorithm can be provided in this case by a procedure that optimizes an unconstrained surrogate (instead of a constrained problem) to obtain greater eﬃciency. This modiﬁcation signiﬁcantly improves the practical eﬃciency of GCG without aﬀecting its convergence properties. We illustrate these contributions by highlighting the example of matrix completion under trace norm regularization, where the improved GCG algorithm demonstrates state of the art performance. Finally, we demonstrate how the convex relaxation can be generalized to handle latent multi-view models (White et al., 2012) in Section 4.4, showing how the improved GCG algorithm can still be applied once the major challenge of eﬃciently computing the polar κ has been solved.

Yu, Zhang, and Schuurmans

4.1 Low Rank Learning Problems

Many machine learning tasks, such as dictionary learning and matrix completion, can be formulated as seeking low rank matrix approximations of a given data matrix. In these problems, one is presented with an n m matrix X (perhaps only partially observed) where each column corresponds to a training example and each row corresponds to a feature across examples; the goal is to recover a low rank matrix W that approximates (the observed entries of) X. To achieve this, we assume we are given a convex loss function ℓ(W) := L(W, X) that measures how well W approximates X.2 If we impose a hard bound, t, on the rank of W, the core training problem can then be expressed as

inf W:rank(W) t ℓ(W) = inf U Rn t inf V Rt m ℓ(UV ). (43)

Unfortunately, this problem is not convex, and except for special cases (such as SVD) is believed to be intractable. The two forms of this problem that we will explicitly illustrate in this paper are dictionary learning and matrix completion. First, for dictionary learning, the n m data matrix X is fully observed, the n t matrix U is interpreted as a dictionary consisting of t basis vectors, and the t m matrix V is interpreted as the coeﬃcients consisting of m code vectors. For this problem, the dictionary and coeﬃcient matrices need to be inferred simultaneously, in contrast to traditional signal approximation schemes where the dictionary (e.g., the Fourier or a wavelet basis) is ﬁxed a priori. Unfortunately, learning the dictionary with the coeﬃcients creates signiﬁcant computational challenge since the formulation is no longer jointly convex in the variables U and V , even when the loss ℓis convex. Indeed, for a ﬁxed dictionary size t, the problem is known to be NP-hard (Vavasis, 2010) except for very special cases (Yu and Schuurmans, 2011). The second learning problem we illustrate is low rank matrix completion. Here, one observes a small number of entries in X Rn m and the task is to infer the remaining entries. In this case, the optimization objective can be expressed as ℓ(W) := L(P(W X)), where L is a reconstruction loss (such as Frobenius norm squared) and P : Rn m Rn m

is a mask operator that simply ﬁlls the entries corresponding to unobserved indices in X with zero. Various recommendation problems, such as the Netﬂix prize,3 can be cast as matrix completion. Due to the ill-posed nature of the problem it is standard to assume that the matrix X can be well approximated by a low rank reconstruction W. Unfortunately, the rank function is not convex, hence hard to minimize in general. Conveniently, both the dictionary learning and matrix completion problems are subsumed by the general formulation we consider in (43).

4.2 General Convex Relaxation and Solution via GCG

To develop a convex relaxation of (43), we ﬁrst consider the factored formulation expressed in terms of U and V on the right hand side. Although it only involves an unconstrained minimization which can be convenient, care is needed to handle the scale invariance introduced by the factored form: since U can always be scaled and V counter-scaled to preserve

2. Note that the matrix W need not reconstruct the entries of X directly: a nonlinear transfer could be interposed while maintaining convexity of ℓ; for example, by using an appropriate Bregman divergence. 3. http://www.netflixprize.com//community/viewtopic.php?id=1537

Generalized Conditional Gradient for Sparse Estimation

the product UV , some form of penalty or constraint is required to restore well-posedness. A standard strategy is to constrain U, for which a particularly common choice is to constrain its columns to have unit length; i.e. U:i c 1 for all i, where U:i stands for the i-th column of U, and c simply denotes some norm on the columns. In dictionary learning, for example, this amounts to constraining the basis vectors in the dictionary U to lie within the unit ball of c. Similar normalization can be used to represent the rows Vi:, with the magnitude encoded by σi 0. In consequence, the unconstrained optimization over ℓ(UV ) in (43) can be reformulated as

min U Rn t,V Rt m,σ Rt ℓ(U diag(σ)V ), s.t. U:i c 1, Vi: r 1, and σi 0. (44)

Here σ := (σ1, σ2, . . . , σt) , and diag(σ) is a diagonal matrix with the i-th diagonal entry being σi. The speciﬁc form of the row norm r provides additional ﬂexibility in promoting diﬀerent structures; for example, the l1 norm leads to sparse solutions, the l2 norm yields low rank solutions, and block structured norms generate group sparsity. The speciﬁc form of the column norm c also has an eﬀect, as discussed in Section 4.4 below. Of course the problem remains non-convex. However, it has recently been observed that a simple relaxation of the rank constraint in (43) allows a convex reformulation to be achieved (Argyriou et al., 2008; Bach et al., 2008; Bradley and Bagnell, 2009; Zhang et al., 2011). Observe that replacing the rank constraint with a regularization penalty on the magnitude σi of each basis pair (U:i, Vi:) yields a relaxed form of (43):

inf U: U:i c 1 inf V : Vi: r 1 inf σ:σi 0 ℓ(U diag(σ)V ) + λ σ 1 , (45)

where the number of columns in U (and also the number of rows in V ) is not restricted. Similar to Lasso, the l1 norm on the singular values (combination weights) σ is used as a penalty that serves as a convex proxy for rank. By the shrinkage eﬀect of the l1 norm, many σi will become zero, implying that the corresponding columns of U (and rows of V ) can be dropped. As a result, the rank of the solution which is indeed unknown a priori can be chosen adaptively with the trade-oﬀparameter λ 0. Importantly, although the problem (45) is still not jointly convex in the factors U, V and σ, it can be exactly reformulated as a convex optimization as long as ℓis convex:

(45) = min W ℓ(W) + λ inf

i σi : σ 0, W = X

i σi U:i Vi:, U:i c 1, Vi: r 1

= min W ℓ(W) + λ κ(W), (47)

where the gauge κ is deﬁned by the following set A with uncountably many elements

A := {uv : u Rn, v Rm, u c 1, v r 1}. (48)

The resulting convex formulation (47) has been observed, in various forms, by, e.g., Argyriou et al. (2008); Bach et al. (2008); Zhang et al. (2011); White et al. (2012). However our derivation, ﬁrst presented in (Zhang et al., 2012), provides a much simpler and ﬂexible view based on convex gauges, which also establishes a direct connection to the GCG method

Yu, Zhang, and Schuurmans

outlined in Section 3.2, yielding an eﬃcient solution method with no additional eﬀort. The reformulation (47) reveals that the relaxed learning problem (45) is essentially considering a rank-one decomposition of W, which generalizes the singular value decomposition. Furthermore it provides a clearer understanding of the computational properties of the regularizer in (47) via its polar:

κ (G) = sup A A A, G = sup u c 1, v r 1 u Gv (49)

= sup v r 1 Gv c = sup u c 1

Recall that for gauge regularized problems in the form (47), Algorithm 2 only requires approximate computation of the polar (30) in Line 3; hence an eﬃcient method for (approximately) solving (49) immediately yields an eﬃcient implementation. However, this simplicity must be countered by the fact that the polar is not always tractable, since it involves maximizing a norm subject to a diﬀerent norm constraint.4 Nevertheless, there exist important cases where the polar can be eﬃciently evaluated, which we illustrate below. The polar also allows one to gain insight into the structure of the original gauge regularizer, since by duality κ = (κ ) yields an explicit formula for κ. For example, consider the important special case where r = c = 2. In this case, using (49) one can infer that the polar κ ( ) = sp is the spectral norm, and therefore the corresponding gauge regularizer κ( ) = tr is the trace norm. In this way, the trace norm regularizer, which is known to provide a convex relaxation of the matrix rank (Chandrasekaran et al., 2012), can be viewed from the perspective of dictionary learning as an induced norm that arises from a simple 2-norm constraint on the magnitude of the dictionary entries U:i and a simple 2-norm penalty on the magnitudes of the code vectors Vi:. This convex relaxation strategy based on convex gauges is quite ﬂexible and has been recently studied for some structured sparse problems (Chandrasekaran et al., 2012; Tewari et al., 2011; Zhang et al., 2012; Bach, 2013). Importantly, the generality of the column and row norms in (45) proﬀer considerable ﬂexibility to ﬁnely characterize the structures sought by low rank learning methods. We will exploit this ﬂexibility in Section 4.4 below to achieve a novel extension of this framework to multi-view dictionary learning (White et al., 2012).

4.2.1 Application to Matrix Completion

We end this subsection with a brief illustration of how the framework can be applied to matrix completion. Recall that in this case one is given a partially observed data matrix X, and the goal is to ﬁnd a low rank approximator W that minimizes the reconstruction error on observed entries. In particular, consider the following standard specialization of (47):

min W Rn m 1 2 P 0 P(X W) 2 F + λ W tr , (51)

where, as noted above, P : Rn m Rn m is the mask operator that ﬁlls entries where X is unobserved with zero. P 0 is the number of observed entries. Surprisingly, Cand es and

4. In this regard, the α-approximate polar oracle introduced in (30) and Theorem 9 might be useful, although we shall not develop this idea further here. See the work of Bach (2013) for some applications of this idea to matrix factorization and the work of Cheng et al. (2016) to tensor problems.

Generalized Conditional Gradient for Sparse Estimation

Recht (2009) have proved that, assuming X is (approximately) low-rank, the solution of the convex relaxation (51) will recover the true matrix X with high probability, even when only a small random portion of X is observed. Note that, since (51) is a specialization of (47), the modiﬁed GCG algorithm, Algorithm 2, can be immediately applied. Interestingly, such an approach contrasts with the most popular algorithm favored in the current literature: PG (or its accelerated variant APG; cf. Section 2). The two strategies, GCG versus PG, exhibit an interesting trade-oﬀ. Each iteration of PG (or APG) involves solving the PU (9), which in this case requires a full singular value decomposition (SVD) of the current approximant W. The modiﬁed GCG method, Algorithm 2, by contrast, only requires the polar of the gradient matrix ℓ(W) to be (approximately) computed in each iteration (i.e. dual of the trace norm; namely, the spectral norm), which is an order of magnitude cheaper to evaluate than a full SVD (cubic versus squared worst case time).5 On the other hand, GCG has a slower theoretical rate than APG, O(1/t) versus O(1/t2), in terms of the number of iterations required to achieve a given accuracy. Once the enhancements in the next section have been incorporated, our experiments in Section 5.1 will demonstrate that GCG is far more eﬃcient than APG for solving the matrix completion problem (51) to tolerances used in practice. Given the speciﬁc objective (51), we may derive a concrete bound on LF used in Theorem 9. Obviously, the smoothness constant is L = 1/ P 0. Suppose we use Monotone with W0 = 0. So all Wt (and the optimal W) need to satisfy λ Wt tr C, where C = PX 2 F /(2 P 0). Due to the normalization, C can be considered a universal constant. All matrices A A in (48) satisfy A F 1. Obviously 1

n Wt 2 tr Wt 2 F Wt 2 tr. So

LF 2L sup t

α2 At 2 F + Wt 2 F

α2 + Wt 2 tr

If λ needs to be less than 1 P 0 , then the upper bound on LF will be O( P 0). Notice that under Monotone, the expression of LF in Theorem 5 is on par with the condition number Lf D2 in (Harchaoui et al., 2015, Theorem 3).

4.3 Fixed-rank Local Optimization

Although the modiﬁed GCG algorithm, Algorithm 2, can be readily applied to the reformulated low rank learning problem (47), the sublinear rate of convergence established in Theorem 9 is still too slow in practice. Nevertheless, an eﬀective local improvement scheme can be developed that satisﬁes the properties of a Relaxed subroutine in Algorithm 2, which allows empirical convergence to be greatly accelerated. Recall that in Algorithm 2, the intermediate iterate Wt+1 is determined by a linear combination of the previous iterate Wt and the newly added atom At (we changed the notation from wt+1 to Wt+1 since

5. The per-iteration cost of computing the spectral norm can be further reduced for the sparse matrices occuring naturally in matrix completion, since the gradient matrix in (51) contains at most k nonzero elements for k equal to the number of observed entries in X. In such cases, an ϵ accurate estimate of the squared spectral norm can be obtained in O k log(n+m) ϵ time (Jaggi, 2013; Kuczy nski and Wo zniakowski,

1992), oﬀering a signiﬁcant speed up over the general case. Although some speed ups might also be possible when computing the SVD for sparse matrices, the gap between the computational cost for evaluating the spectral norm versus the trace norm for sparse matrices remains signiﬁcant.

Yu, Zhang, and Schuurmans

the optimization variables are now matrices). We will next demonstrate that Wt+1 can be further improved by solving a reformulated local problem, and then show how this can still preserve the convergence guarantees of Algorithm 2. The key reformulation relies on the following fact.

Proposition 15 The convex gauge κ induced by the set A in (48) can be re-expressed as

i=1 U:i c Vi: r : UV = W

U:i 2 c + Vi: 2 r : UV = W

for U Rn t and V Rt m, as long as t mn.

In Appendix B.1 we provide a direct proof that allows us to bound the number of terms involved in the summation. The above lower bound mn on t may not be sharp. For example, if r = c = 2, then κ is the trace norm, where Pt i=1( U:i 2 c + Vi: 2 r) is simply U 2 F+ V 2 F, the sum of the squared Frobenius norms; that is, Proposition 15 yields the well-known variational form of the trace norm in this case (Srebro et al., 2005). Observe that t in this case can be as low as the rank of W. Theorem 15 also appeared in (Bach et al., 2008; Bach, 2013), although without the explicit bound t mn. Based on Theorem 15, the objective in (47) can then be approximated at iteration t as

Ft+1(U, V ) := ℓ(UV ) + λ

U:i 2 c + Vi: 2 r

Our key intuition is to ﬁrst factorize Wt+1 into Uinit Vinit, and then ﬁnd improvement to (U, V ) with respect to Ft+1 (not the original F). This results in Ut+1 and Vt+1 satisfying Ft+1(Ut+1, Vt+1) Ft+1(Uinit, Vinit). Finally we restore the iterates based on Theorem 15

Wt+1 = Ut+1Vt+1 and ρt+1 = 1

(Ut+1):i 2 c + (Vt+1)i: 2 r . (54)

Using Ft+1 as a surrogate objective for local improvement is particularly advantageous in computation, because the problem is now unconstrained and the gauge function has been replaced by much simpler forms such as U:i 2 c and Vi: 2 r. This allows eﬃcient unconstrained minimization routines (e.g. L-BFGS) to be applied directly, improving the quality of the current iterate without increasing the size of the dictionary. Note however that (53) is not jointly convex in (U, V ) since the size t + 1 (the iteration counter) is ﬁxed. When t is suﬃciently large, any local minimizer of (53) globally solves the original problem (47) (Burer and Monteiro, 2005)6, but we locally reduce (53) in each iteration even for small t. Once this modiﬁcation is added to Algorithm 2 we obtain the ﬁnal variant, Algorithm 3. In this algorithm, Line 3 remains equivalent to before, and Line 4 adopts the slightly more sophisticated update rule in (40), although one can also use (34) with ηt = 2 t+2. Lines 5 to 8 collectively implement the Improve subroutine: Line 5 splits the iterate into two parts

6. The catch here is that seeking a local minimizer can still be highly non-trivial, sometimes it is even hard to certify the local optimality.

Generalized Conditional Gradient for Sparse Estimation

Algorithm 3 GCG variant for low rank learning. Require: The atomic set A.

1: Initialize W0 = 0, ρ0 = 0, U0 = V0 = Λ0 = .

2: for t = 0, 1, . . . do

3: (ut, vt) arg min uv A

4: (ηt, θt) arg min 0 η 1,θ 0 ℓ((1 η)Wt + θ utv t ) + λ((1 η)ρt + θ)

5: Uinit ( 1 ηt Ut, θtut), Vinit ( 1 ηt V t , θtvt)

6: (Ut+1, Vt+1) Reduce(Ft+1, Uinit, Vinit) s.t. Ft+1(Ut+1, Vt+1) Ft+1(Uinit, Vinit) e.g. ﬁnd a local minimum of Ft+1 with U and V initialized to Uinit and Vinit resp.

7: Wt+1 Ut+1Vt+1 Nominal as we only maintain (Ut, Vt) in practice

2 Pt+1 i=1( (Ut+1):i 2 c + (Vt+1)i: 2 r)

Uinit and Vinit, while Line 6 further reduces the reformulated local objective (53) with the designated initialization (e.g. by ﬁnding a local minimum). We referred to it as Reduce in order to be distinct from Improve. The last two steps restore the GCG iterate by (54). The only major challenge left for inserting a local optimization step in Algorithm 2 is that the Relaxed condition (36) in Theorem 7 must be maintained. This is the condition under which Algorithm 3 retains the O(1/t) rate of convergence established in Theorem 9. Fortunately this holds true as stated in Theorem 16 and its proof is given in Appendix B.2.

Proposition 16 (Uinit, Vinit) in Line 5 is a factorization of the intermediate iterate Wt+1 = (1 ηt)Wt + θtutv t . Let (Ut+1, Vt+1) improve upon it in the sense that Ft+1(Ut+1, Vt+1) Ft+1(Uinit, Vinit). Then the Wt+1 and ρt+1 recovered by (54) satisfy the Relaxed condition.

Although the Reduce subroutine is not guaranteed to achieve strict improvement in every iteration, we observe in our experiments in Section 5 below that Algorithm 3 is almost always signiﬁcantly faster than the Null version of Algorithm 2. In other words, local acceleration appears to make a big diﬀerence in practice.

Remark 17 Interleaving local improvement with a globally convergent procedure has been considered by Mishra et al. (2013) and Laue (2012). In particular, Laue (2012) considered the constrained problem (21), which unfortunately leads to considering a constrained minimization for local improvement that can be less eﬃcient than the unconstrained approach considered in (53). In addition, Laue (2012) used the original objective F directly in local optimization. This leads to signiﬁcant challenge when the regularizer κ is complicated such as the trace norm, and we avoid this issue by using a smooth surrogate objective Ft+1. Focusing on trace norm regularization only, Mishra et al. (2013) proposed a trust-region procedure that locally minimizes the original objective on the Stiefel manifold and the positive semideﬁnite cone, which unfortunately also requires constrained minimization and dynamic maintenance of the singular value decomposition of a small matrix. The rate of convergence of this procedure has not been analyzed.

Remark 18 We remark that local minimization of (53) constitutes the primary bottleneck of Algorithm 3, consuming over 50% of the computation. Since (54) is the same objective

Yu, Zhang, and Schuurmans

considered by the standard ﬁxed rank approach to matrix factorization, the overall performance of Algorithm 3 can be substantially improved by leveraging the techniques that have been developed for this speciﬁc objective. These improvements include system level optimizations such as stochastic solvers on distributed, parallel and asynchronous infrastructures (Zhuang et al., 2013; Yun et al., 2014). In the experiments conducted in Section 5 below, we merely solve (54) using a generic L-BFGS solver to highlight the eﬃciency of the GCG framework itself. Nevertheless, such a study leaves signiﬁcant room for further straightforward acceleration.

Remark 19 Much of our development in this section was inspired by the work of Bach et al. (2008), who ﬁrst demonstrated, although in a slightly less clear way, that the convex relaxation in (47) is possible, but did not observe the computational connection to GCG. Our contribution here is to present this result succinctly using the notion of convex gauges, connect it naturally to the GCG algorithm we developed in Section 3, and propose an eﬃcient ﬁxed-rank local improvement subroutine Reduce that does not aﬀect theoretical convergence guarantees. After our conference version (Zhang et al., 2012), Bach (2013) also independently extended the previous work (Bach et al., 2008) using convex gauges, and discussed a multiplicative approximate GCG variant similar to our Algorithm 2. However, the possibility to intervene ﬁxed-rank local subroutines based on Theorem 15 with GCG, and its practical computational consequences, were not observed in (Bach, 2013).

4.4 Multi-view Dictionary Learning

Finally, in this section we show how low rank learning can be generalized to a multi-view setting where, once again, the modiﬁed GCG algorithm can eﬃciently provide optimal solutions. The key challenge in this case is to develop column and row norms that can capture the more complex problem structure. Consider a two-view learning task where one is given m paired observations nhxj yj io

consisting of an x-view and a y-view with lengths n1 and n2 respectively. The goal is to infer a set of latent representations, hj (of dimension t < min(n1, n2)), and reconstruction models parameterized by the matrices A and B, such that Ahj xj and Bhj yj for all j. In particular, let X denote the n1 m matrix of x-view observations, Y the n2 m matrix of y-view observations, and Z = h X Y i the concatenated (n1 + n2) m matrix. The problem can then be expressed as recovering an (n1 + n2) t matrix of model parameters, C = h A B i , and a t m matrix of latent representations, H, such that Z CH. The key assumption behind multi-view learning is that each of the two views, xj and yj, is conditionally independent given the shared latent representation, hj. Although multiview data can always be concatenated hence treated as a single view, explicitly representing multiple views enables more accurate identiﬁcation of the latent representation (as we will see), when the conditional independence assumption holds. To better motivate this model, it is enlightening to ﬁrst reinterpret the classical formulation of multi-view dictionary learning, which is given by canonical correlation analysis (CCA). Typically, it is expressed as the problem of projecting two views so that the correlation between them is maximized. Assuming the data is centered (i.e. X1 = 0 and Y 1 = 0), the sample covariance of X and Y is given by XX /m and Y Y /m respectively. CCA can then be expressed as an

Generalized Conditional Gradient for Sparse Estimation

optimization over matrix variables

max U,V tr(U XY V ) s.t. U XX U = V Y Y V = I (55)

for U Rn1 t, V Rn2 t (De Bie et al., 2005). Here I is the identity matrix whose size is clear from the context. Although this classical formulation (55) does not make the shared latent representation explicit, CCA can be expressed by a generative model: given a latent representation, hj, the observations xj = Ahj + ϵj and yj = Bhj + νj are generated by a linear mapping plus independent zero mean Gaussian noise, ϵ N(0, Σx), ν N(0, Σy) (Bach and Jordan, 2006). Interestingly, one can show that the classical CCA problem (55) is equivalent to the following multi-view dictionary learning problem.

Proposition 20 (White et al. 2012, Proposition 1) Fix t, let

(A, B, H) = arg min A: A:i 2 1 min B: B:i 2 1 min H

(XX ) 1/2X (Y Y ) 1/2Y

Then U =(XX ) 1

2 A and V =(Y Y ) 1

2 B provide an optimal solution to (55).

Proposition 20 demonstrates how formulation (56) respects the conditional independence of the separate views: given a latent representation hj, the reconstruction losses on the two views, xj and yj, cannot inﬂuence each other, since the reconstruction models A and B are individually constrained. By contrast, in single-view dictionary learning (i.e. principal component analysis) A and B are concatenated in the larger variable C = h A B i , where C as a whole is constrained but A and B are not. A and B must then compete against each other to acquire magnitude to explain their respective views given hj (i.e. conditional independence is not enforced). Such sharing can be detrimental if the two views really are conditionally independent given hj. This matrix factorization viewpoint has recently allowed dictionary learning to be extended from the single-view setting to the multi-view setting (Jia et al., 2010). However, despite its elegance, the computational tractability of CCA hinges on its restriction to squared loss under a particular normalization, precluding other important losses. To combat these limitations, we apply the same convex relaxation principle as in Section 4.2. In particular, we reformulate (56) by ﬁrst incorporating an arbitrary loss function ℓthat is convex in its ﬁrst argument, and then relaxing the rank constraint by replacing it with the l1 norm of the singular values . This amounts to the following training problem that extends (Jia et al., 2010):

min A,B,H,σ ℓ A B

diag(σ)H; Z + λ σ 1 , s.t. Hi: 2 1, σi 0, A:i B:i

where D := a b

: a 2 β, b 2 γ . (57)

The l2 norm on the rows of H is a regularizer that encourages the rows to become sparse, hence reducing the size of the learned representation (Argyriou et al., 2008). Since the size of the latent representation (the number of rows of H) is not ﬁxed, our results in Section 4.2

Yu, Zhang, and Schuurmans

immediately allows the problem to be solved globally and eﬃciently for A, B and H. This considerably improves the previous local solutions in (Jia et al., 2010) which ﬁxed the latent dimension and approached the non-convex problem by alternating block coordinate descent between A, B and diag(σ)H. Indeed, (57) ﬁts exactly in the framework of (45), with r being the l2 norm and

c c = max 1

for c = a b

Next we will show how to eﬃciently compute the polar operator (50) that underpins Algorithm 3 in this multi-view model.

4.4.1 Efficient Polar Operator

The polar operator (50) for problem (57) turns out to admit an eﬃcient but nontrivial solution. For simplicity, we assume β = γ = 1 while the more general case can be dealt with in exactly the same way. The proof of the following key theorem is in Appendix B.3.

Proposition 21 Let I1 := diag( h 1 0 i ) and I2 := diag( h 0 1 i ) be the identity matrices on the subspaces spanned by x-view and y-view, respectively. For any µ > 0, denote Dµ = 1 + µI1 + p

1 + 1/µI2 as a diagonal scaling of the two identity matrices. Then, for the row norm r = 2 and column norm c deﬁned as in (58), the corresponding polar (49) is given by

κ (G) = min µ 0 DµG sp . (59)

Despite its variational form, the polar (59) can still be eﬃciently computed by considering its squared form and expanding:

DµG 2 sp = G DµDµG sp = G (1 + µ)I1 + (1 + 1/µ)I2 G sp .

Clearly, the term inside the spectral norm sp is convex in µ on R+ while the spectral norm is convex and increasing on the cone of positive semideﬁnite matrices, hence by the usual rules of convex composition the left hand side above is a convex function of µ. Consequently we can use a subgradient algorithm, or even golden section search, to ﬁnd the optimal µ in (59). Given an optimal µ, an optimal pair of atoms a and b can be recovered via the KKT conditions; see Appendix B.4 for details. Thus, by using (59) Algorithm 3 is able to solve the resulting optimization problem far more eﬃciently than the initial procedure oﬀered by White et al. (2012). Similar to the case of matrix completion for single view in Section 4.2.1, the LF in Theorem 5 for this multi-view setting can be bounded in exactly the same way as in (52). The only diﬀerence is that instead of W tr W F, we now have κ(W) 1

see it, just set µ = 1 in (59), and we immediately obtain κ (G)

2 G sp. Dualizing both sides leads to κ(W) 1

2 W F. Therefore we can proceed in analogy to (52):

LF 2L sup t

α2 At 2 F + Wt 2 F

α2 + 2κ(Wt)2 4 P 0

Generalized Conditional Gradient for Sparse Estimation

4.4.2 A Modified Power Iteration

In practice, one can compute the polar (59) much more eﬃciently by deploying a heuristic variant of the power iteration method, whose motivation arises from the KKT conditions in the proof of Theorem 21 (see (76) in Appendix B.4). In particular, two normalized vectors a and b are optimal for the polar (50) if and only if there exist nonnegative scalars µ1 and µ2 such that µ1 a µ2 b

, and µ1I1 + µ2I2 GG . (61)

Here A B asserts that A B is positive semideﬁnite. Therefore, after randomly initializing a and b with unit norm, it is natural to extend the power iteration method by repeating the update: s t

, then anew

" 1 s 2 s 1 t 2 t

This procedure usually converges to a solution where the Lagrange multipliers µ1 and µ2, recovered from (61), satisfy (61) hence conﬁrming their optimality. While we leave the investigation of the theoretical guarantees of this power iteration heuristic for future work, let us mention that it works very well in practice (as in the single view case) and when the occasional failure occurs we revert to (59) and follow the recovery rule detailed in Appendix B.4.

5. Experimental Evaluation

We evaluate the extended GCG approach developed above in three sparse learning problems: matrix completion, multi-class classiﬁcation, and multi-view dictionary learning. All algorithms were implemented in Matlab unless otherwise noted.7 All experiments were run on a single core of a cluster housed at National ICT Australia with AMD 6-Core Opteron 4184 (2.8GHz, 3M L2/6M L3 Cache, 95W TDP, 64 GB memory).

5.1 Matrix Completion with Trace Norm Regularization

Our ﬁrst set of experiments is on matrix completion (51), where the division by P 0 is omitted to be consistent with existing solvers. Five data sets are used and their statistics are given in Table 1. The data sets Yahoo252M (Dror et al., 2012) and Yahoo60M8 provide music ratings, and the data sets Netflix (Dror et al., 2012), Movie Lens1M, and Movie Lens10M9 all provide movie ratings. For Movie Lens1M and Yahoo60M, we randomly sampled 75% ratings as training data, using the rest for testing. The other three data sets came with training and testing partition. We compared the extended GCG approach (Algorithm 3) with three state-of-the-art solvers for trace norm regularized objectives: MMBS10 (Mishra et al., 2013), DHM (Dudik

7. All code is available for download at https://www.cs.uic.edu/~zhangx/GCG. 8. Yahoo60M is the R1 data set at http://webscope.sandbox.yahoo.com/catalog.php?datatype=r. 9. Both Movie Lens1M and Movie Lens10M were downloaded from http://www.grouplens.org/datasets/ movielens. 10. Downloaded from http://www.montefiore.ulg.ac.be/~mishra/softwares/trace Norm.html.

Yu, Zhang, and Schuurmans

data set n (#user) m (#product) #train #test λ Movie Lens1M 6,040 3,952 750,070 250,024 50 Movie Lens10M 71,567 65,133 9,301,174 698,599 100 Netflix 2,549,429 17,770 99,072,112 1,408,395 1 Yahoo60M 98,130 1,738,442 60,021,577 20,872,219 300 Yahoo252M 1,000,990 624,961 252,800,275 4,003,960 100

Table 1: Summary of the data sets used in matrix completion experiments. Here n is the total number of users, m is the number of products (movies or music), #train and #test are the number of ratings used for training and testing respectively.

et al., 2012), and JS (Jaggi and Sulovsky, 2010). The local optimization in GCG was performed by L-BFGS with the maximum number of iteration set to 20.11 Traditional solvers such as proximal methods (Toh and Yun, 2010; Pong et al., 2010) were not included in this comparison because they are much slower. MMBS is conﬁned to trace norm regularized problems. At each iteration t, they solve

min U Rn t, B Rt t, V Rt m ℓ(UBV ) + λ tr(B), s.t. U U = V V = I, and B 0. (62)

After obtaining a stationary point, the gradient of ℓis computed, with its top left and right singular vectors appended to U and V respectively, and repeat. Unfortunately, the optimization in (62) is highly challenging on the Stiefel manifold, and their algorithm relies on the directional Hessian which becomes rather complicated in computation for general ℓ. By contrast, the local surrogate (53) used in our approach is substantially easier to optimize. DHM is a special case of GCG, with Improve instantiated with a line search procedure (Dudik et al., 2012). The detail of line search is intricate, but in practice it is slower than totally corrective updates (42) solved by L-BFGS. To further streamline the optimization for (42), we buﬀered the quantities P(utv t), P(usv s) and P(utv t), P(X) , where (ut, vt) is the result of polar operator at iteration t (Line 3 of Algorithm 3). This allowed the objective (42) to be evaluated eﬃciently without having to compute matrix inner product. JS was proposed for solving the constrained problem: min W ℓ(W) s.t. W tr ζ, which makes it hard to directly compare with solvers for the penalized problem (51). As a workaround, given λ we ﬁrst found the optimal solution W for the penalized problem, and then we set ζ = W tr. To solve the constrained problem, JS lifts it to an SDP by observing

that W tr t

2 iﬀthere exist symmetric matrices A and B such that Z := A W W B

and tr(A) + tr(B) = t. Then the objective ℓ(W) can be rewritten in terms of Z, solved by the conventional conditional gradient algorithm for constrained problems. Figures 2 to 6 show how quickly the various algorithms drive the training objective down, while also providing the root mean square error (RMSE) on test data (i.e. comparing the reconstructed matrix each iteration to the ground truth). JS is comparable only on RMSE. The regularization constant λ, given in Table 1, was chosen from

11. Downloaded from http://www.cs.ubc.ca/~pcarbo/lbfgsb-for-matlab.html. Recall a suboptimal solution is suﬃcient for the ﬁxed-rank local optimization in Relaxed.

Generalized Conditional Gradient for Sparse Estimation

10 1 10 0 10 1 10 2 5.6

Running time (seconds)

Objective value (training)

(a) Objective value vs CPU time

10 1 10 0 10 1 10 2 0.86

Running time (seconds)

(b) Test RMSE vs CPU time

0 10 20 30 40 50 60 70 80 5.6

Number of iterations

Objective value (training)

(c) Objective value vs #iteration

Figure 2: Matrix completion on Movie Lens1M

10 1 10 2 10 3 10 4 5.8

Running time (seconds)

Objective value (training)

(a) Objective value vs CPU time

10 1 10 2 10 3 10 4 0.9

Running time (seconds)

(b) Test RMSE vs CPU time

0 10 20 30 40 50 60 70 80 5.8

Number of iterations

Objective value (training)

(c) Objective value vs #iteration

Figure 3: Matrix completion on Movie Lens10M

{1, 10, 50, 100, 200, 300, 400, 500, 1000} to minimize the test RMSE. On the Movie Lens1M and Movie Lens10M data sets, GCG and DHM exhibit similar eﬃciency in reducing the objective value with respect to CPU time (see Figure 2(a) and 3(a)). However, GCG achieves comparable test RMSE an order of magnitude faster than DHM, as shown in Figure 2(b) and 3(b). Interestingly, this discrepancy can be explained by investigating the rank of the solution, i.e. the number of outer iterations taken. In Figure 2(c) and 3(c), DHM clearly converges much more slowly in terms of the number of iterations, which is expected because it does not re-optimize the basis at each iteration, unlike GCG and MMBS. Therefore, although the overall computational cost for DHM appears on par with that for GCG, its solution has a much higher rank, resulting in a much higher test RMSE in these cases. MMBS takes nearly the same number of iterations as GCG to achieve the same objective value (see Figure 2(c) and 3(c)). However, its local search at each iteration is conducted on a constrained manifold, making it much more expensive than locally optimizing the unconstrained smooth surrogate (53). As a result, the overall computational complexity of MMBS is an order of magnitude greater than that of GCG, with respect to optimizing both the objective value and the test RMSE. The observed discrepancies between the relative eﬃciency of optimizing the objective value and test RMSE becomes much smaller on the other three data sets, which are much larger in size. In Figure 4 to 6, it is clear that GCG takes signiﬁcantly less CPU time to

Yu, Zhang, and Schuurmans

10 3 10 4 0.5

Running time (seconds)

Objective value (training)

(a) Objective value vs CPU time

10 3 10 4 1

Running time (seconds)

(b) Test RMSE vs CPU time

Figure 4: Matrix completion on Netflix

10 2 10 3 10 4

Running time (seconds)

Objective value (training)

(a) Objective value vs CPU time

10 2 10 3 10 4 1

Running time (seconds)

(b) Test RMSE vs CPU time

Figure 5: Matrix completion on Yahoo100m

Running time (seconds)

Objective value (training)

(a) Objective value vs CPU time

10 3 10 4 20

Running time (seconds)

(b) Test RMSE vs CPU time

Figure 6: Matrix completion on Yahoo252m

optimize both criteria. The declining trend of the objective value has a similar shape to that of the test RMSE. Here we observe that DHM is substantially slower than GCG and MMBS, conﬁrming the importance of optimizing the bases at the same time as optimizing their weights. JS, which does not exploit the penalized form of the objective, is considerably slower with respect to reducing the test RMSE.

5.2 Multi-class Classiﬁcation with Trace Norm Regularization

Next we compared the four algorithms on multi-class classiﬁcation problems also in the context of trace norm regularization. Here the task is to predict the class membership of input instances, xi Rn, where each instance is associated with a unique label yi {1, . . . , C}. In particular, we consider a standard linear classiﬁcation model where each class c is associated with a weight vector in Rn, stacked in a model matrix W Rn C. Then for each training example (xi, yi), the individual logistic loss L(W; xi, yi) is given by

L(W; xi, yi) = log exp(x i W:,yi) PC c=1 exp(x i W:c) ,

Generalized Conditional Gradient for Sparse Estimation

data set n C #train #test λ Fungus10 4,096 10 10 per class 10 per class 10 2

Fungus134 4,096 134 50 per class 50 per class 10 3

k1024 131,072 50 50 per class 50 per class 10 3

Table 2: Summary of the data sets used in multi-class classiﬁcation experiments. Here n is the number of features, C is the number of classes, #train and #test are the number of training and test images respectively.

and the complete loss is given by averaging over the training set, ℓ(W) := 1

m Pm i=1 L(W; xi, yi). It is reasonable to assume that W has a low rank; i.e. the weight vector of all classes lie in a low dimensional subspace (Akata et al., 2014), which motivates the introduction of a trace norm regularizer on W, yielding a training problem in the form of (47). We conducted experiments on three data sets extracted from the Image Net repository (Deng et al., 2009), with characteristics shown in Table 2 (Akata et al., 2014).12 k1024 is from the ILSVRC2011 benchmark, and Fungus10 and Fungus134 are from the Fungus group. All images are represented by the Fisher descriptor. As in the matrix completion case above, we again compared how fast the algorithms reduced the objective value and test accuracy. Here the value of λ was chosen from {10 4, 10 3, 10 2, 10 1} to maximize test accuracy. Figures 7 to 9 show that the models produced by GCG achieve comparable (or better) training objectives and test accuracies in far less time than MMBS, DHM and JS. Although DHM is often more eﬃcient than GCG early on, it is soon overtaken. This observation could be related to the diminishing return of adding bases, as shown in Figure 7(c), 8(c), and 9(c), since the addition of the ﬁrst few bases yields a much steeper decrease in objective value than that achieved by later bases. Hence, after the initial stage, the computational eﬃciency of a solver turns to be dominated by better exploitation of the bases. That is, using added bases as an initialization for local improvement (as in GCG) appears to be more eﬀective than simply ﬁnding an optimal conic combination of added bases (as in DHM). Also notice from Figures 7(c), 8(c) and 9(c) that MMBS and GCG are almost identical in terms of the number of iterations required to achieve a given objective value. However, the manifold based local optimization employed by MMBS at each iteration makes its overall computational time worse than that of DHM and JS. This is a diﬀerent outcome from the matrix completion experiments we saw earlier, and seems to suggest that the manifold approach in MMBS is more eﬀective for the least square loss than for the logistic loss.

5.3 Multi-view Dictionary Learning

Our third set of experiments focuses on optimizing the objective for the multi-view dictionary learning model (57) presented in Section 4.4. Here we compared the GCG approach outlined in Section 4.4 with a straightforward local solver that is based on block coordinate descent (BCD), which alternates between: (a) ﬁxing H and optimizing A and B (with norm

12. The three data sets were all downloaded from http://lear.inrialpes.fr/src/jsgd/.

Yu, Zhang, and Schuurmans

10 1 10 0 10 1 10 2 1.2

Running time (seconds)

Objective value (training)

(a) Objective value vs CPU time

10 1 10 0 10 1 10 2 15

Running time (seconds)

Test accuracy (%)

(b) Test accuracy vs CPU time

0 4 8 12 16 1.2

Number of iterations

Objective value (training)

(c) Test accuracy vs #iteration

Figure 7: Multi-class classiﬁcation on Fungus10

10 1 10 2 10 3 4.2

Running time (seconds)

Objective value (training)

(a) Objective value vs CPU time

10 1 10 2 10 3 0

Running time (seconds)

Test accuracy (%)

(b) Test accuracy vs CPU time

0 4 8 12 16 20 4.2

Number of iterations

Objective value (training)

(c) Objective value vs #iteration

Figure 8: Multi-class classiﬁcation on Fungus134

10 2 10 3 10 4

Running time (seconds)

Objective value (training)

(a) Objective value vs CPU time

10 2 10 3 10 4 5

Running time (seconds)

Test accuracy (%)

(b) Test accuracy vs CPU time

0 10 20 30 40 50

Number of iterations

Objective value (training)

(c) Objective value vs #iteration

Figure 9: Multi-class classiﬁcation on k1024

ball constraints), then (b) ﬁxing A and B and optimizing H (with nonsmooth regularizer). Both steps were solved by FISTA (Beck and Teboulle, 2009), initialized with the solution from the previous iteration. We tested several values of t, the dimensionality of the latent subspace of the solution, in these experiments. The GCG approach used Algorithm 3 with the local objective (53) instantiated by the norm expression in (58) and solved by L-BFGS with max iteration set to 20. We initialized the U and V matrices with 20 columns and rows respectively, each drawn i.i.d. from a uniform distribution on the unit sphere. Based

Generalized Conditional Gradient for Sparse Estimation

on the noise model we consider in these experiments, we started with a base loss ℓset to l1(Z, Z ) = P ij |Zij Z ij| where Z is the true underlying matrix; however, following (Becker et al., 2009, Eq 3.1), we smoothed this l1 loss by adding a quadratic prox-function with strong convexity modulus 10 4. Under the resulting loss, the optimization over A and B is decoupled given H. We terminated all algorithms after a maximum of running time (3,600 seconds) is reached.

5.3.1 Synthetic Data

We ﬁrst constructed experiments on a synthetic data set that fulﬁlls the conditional independence assumption (57) of the multi-view model (as discussed in Section 4.4). Here the xand y-views have basis matrices A Rn1 t and B Rn2 t respectively, where all columns of A were drawn i.i.d. from the l2 sphere of radius β = 1, and all columns of B were drawn i.i.d. from the l2 sphere of radius γ = 5. We set n1 = 800, n2 = 1000, and t = 10 (so that the dictionary size is indeed small). The latent representation matrix H Rt m has all elements drawn i.i.d. from the zero-mean unit-variance Gaussian distribution. The number of training examples is set to m = 200, which is much lower than the number of features n1 and n2. Then the clean versions of x-view and y-view were generated by X = A H and Y = B H respectively, while the noisy versions were obtained by adding i.i.d. noise to 15% entries of X and Y that were selected uniformly at random, yielding ˆX and ˆY respectively. The noise is uniformly distributed in [0, 10]. Given the noisy observations, the denoising task is to ﬁnd reconstructions X and Y for the two views, such that the error X ˆX 2 F+ Y ˆY 2 F is small. The composite training problem is formulated in (57).

The results of comparing the modiﬁed GCG method with local optimization are presented in Figures 10 to 12, where the regularization parameter λ has been varied among {80, 60, 40}. In each case, GCG optimizes the objective value signiﬁcantly faster than BCD for all values of t chosen for BCD. The diﬀerences between methods are even more evident when considering the reconstruction error. In particular, although the local optimization in GCG already achieves a low objective value in the ﬁrst outer iteration (see Figure 10(a) and 11(a)), the reconstruction error remains high, thus requiring further eﬀort to reduce it.

When t = 10 which equals the true rank t = 10, BCD is more likely to get stuck in a poor local minimum when λ is large, for example as shown in Figure 10(a) for λ = 80. This occurs because when optimizing H, the strong shrinkage in the proximal update stiﬂes major changes in H (and in A and B consequently). This problem is exacerbated when the rank of the solution is overly restricted, leading to even worse local optima. On the other hand, under-regularization does indeed cause overﬁtting in this case; for example, as shown in Figure 12 for λ = 40. Interestingly, BCD gets trapped in local optima for all t, while GCG eventually escapes suboptimal solutions and eventually ﬁnds a way to considerably reduce the objective value. However, this simply creates a dramatic increase in the test reconstruction error, a typical phenomenon of overﬁtting.

Yu, Zhang, and Schuurmans

10 1 10 2 10 3 2.8

Running time (seconds)

Objective value (training)

(a) Objective value vs CPU time

10 1 10 2 10 3 0

Running time (seconds)

Reconstruction Error

(b) Reconst. error vs CPU time

Figure 10: Denoising on synthetic data λ = 80

10 1 10 2 10 3 2.75

Running time (seconds)

Objective value (training)

(a) Objective value vs CPU time

10 1 10 2 10 3 0

Running time (seconds)

Reconstruction Error

(b) Reconst. error vs CPU time

Figure 11: Denoising on synthetic data λ = 60

10 0 10 1 10 2 10 3 2.4

Running time (seconds)

Objective value (training)

(a) Objective value vs CPU time

10 0 10 1 10 2 10 3 0

Running time (seconds)

Reconstruction Error

(b) Reconst. error vs CPU time

Figure 12: Denoising on synthetic data λ = 40

5.3.2 Image Denoising

We then conducted multi-view dictionary learning experiments on natural image data. The data set we considered is based on the Extended Yale Face Database B13 (Georghiades et al., 2001). In particular, the data set we used consists of grey level face images of 28 human subjects, each with 9 poses and 64 lighting conditions. To construct the data set, we set the x-view to a ﬁxed lighting (+000E+00) and the y-view to a diﬀerent ﬁxed lighting (+000E+20). We obtained paired views by randomly drawing a subject and a pose under the two lighting conditions. The underlying assumption is that each lighting has its own set of bases (A and B) and each (person, pose) pair has the same latent representation for the two lighting conditions. All images were down-sampled to 50-by-50, meaning n1 = n2 = 2500, and we randomly selected t = 50 pairs of (person, pose) for training. The x-view was kept clean, while pixel errors were added to the y-view, randomly setting 10% of the pixel values to 1. This noise model mimics natural phenomena such as occlusion and loss of pixel information from image transfer. The goal is again to enable appropriate reconstruction of

13. Data downloaded from http://vision.ucsd.edu/content/extended-yale-face-database-b-b.

Generalized Conditional Gradient for Sparse Estimation

10 0 10 1 10 2 10 3 2

Running time (seconds)

Objective value (training)

(a) Objective value vs CPU time

10 0 10 1 10 2 10 3 300

Running time (seconds)

Reconstruction Error

(b) Reconst. error vs CPU time

Figure 13: Denoising on face data with λ = 40

10 0 10 1 10 2 10 3 2

Running time (seconds)

Objective value (training)

(a) Objective value vs CPU time

10 0 10 1 10 2 10 3 300

Running time (seconds)

Reconstruction Error

(b) Reconst. error vs CPU time

Figure 14: Denoising on face data with λ = 30

10 0 10 1 10 2 10 3 1.5

Running time (seconds)

Objective value (training)

(a) Objective value vs CPU time

10 0 10 1 10 2 10 3 300

Running time (seconds)

Reconstruction Error

(b) Reconst. error vs CPU time

Figure 15: Denoising on face data with λ = 20

a noisy image using other views. Naturally, we set β = γ = 1 since the data is in the [0, 1] interval.

Figures 13 to 15 show the results comparing GCG with the local BCD for diﬀerent choices of t for BCD. Clearly the value of λ that yields the lowest reconstruction error is 40. However, here BCD simply gets stuck at the trivial local minimum H = 0, as shown in Figure 13(a). This occurs because under the random initialization of A and B, the proximal update simply sets H to 0 when the regularization is too strong. When λ is decreased to 30, as shown in Figure 14(a), BCD can escape the trivial H = 0 solution for t = 100, but again it becomes trapped in another local minimum that remains much worse than the globally optimal solution found by GCG. When λ is further reduced to 20, the BCD algorithms ﬁnally escape the trivial H = 0 point for all t, but still fail to ﬁnd an acceptable solution. Although GCG again converges to a much better global solution in terms of the objective value, overﬁtting can still occur: in Figure 15(b) the reconstruction error of GCG increases signiﬁcantly after an initial decline.

Yu, Zhang, and Schuurmans

6. Conclusion

We have presented a uniﬁed treatment of the generalized conditional gradient (GCG) algorithm that covers several useful extensions. After a thorough treatment of its convergence properties, we illustrated how an extended form of GCG can eﬃciently handle a convex gauge regularizer in particular. We illustrated the application of these GCG methods in learning low rank matrices, instantiated with examples on matrix completion, dictionary learning, and multi-view learning. In each case, we considered a generic convex relaxation procedure and focused on the eﬃcient computation of the polar operator one of the key steps to obtaining an eﬃcient implementation of GCG. To further accelerate its convergence, we interleaved the standard GCG update with a ﬁxed rank subspace optimization, which greatly improves performance in practice without aﬀecting the theoretical convergence properties. Extensive experiments on matrix completion, multi-class classiﬁcation, multi-view learning conﬁrmed the superiority of the modiﬁed GCG approach in practice. For future work, several interesting extensions are worth investigating. The current form of GCG is inherently framed as a batch method, where each iteration requires gradient of the loss to be computed over all training data. When the training data is large, it is preferable to only use a random subset, or distribute the training data across several processors. Another interesting direction is to interpolate between the polar operator and proximal updates; that is, one could incorporate a mini-batch of atoms at each iteration. For example, for trace norm regularization, one can consider adding k instead of just the 1 largest singular vectors. It remains an open question whether such a mini-batch of atoms can provably accelerate the overall optimization, but some recent work (Hsieh and Olsen, 2014) has demonstrated promise in practice.

Acknowledgments

This research was supported by the Alberta Innovates Center for Machine Learning (AICML), the Canada Research Chairs program, and NSERC. Most of this work was carried out when Y Yu and X Zhang were with the University of Alberta and AICML. X Zhang would like to thank the support of NICTA, where he worked as a researcher from Oct 2012 to Oct 2015 and conducted 50% of his work on this paper. We thank Csaba Szepesv ari, the action editor, and the reviewers at various stages of this work for the many constructive comments.

Appendix A. Proofs for Section 3

A.1 Proof of Theorem 2

Since ℓis Lipschitz continuous, it must be uniformly continuous on bounded set. (a): Let C be the closure of the sequence {wt}t. Due to the compactness assumption on the sublevel set and the monotonicity of F(wt), C is compact. Moreover C dom f since for all cluster points, say w, of wt we have from the closedness of F that F(w) lim inf F(wtk) F(w0) < . Since ℓis continuous, ℓ(C) is a compact subset of int(dom f ), the interior of dom f . Note that f is continuous on the interior of its domain, therefore its subdiﬀerential is locally bounded on ℓ(C), see e.g. (Borwein and

Generalized Conditional Gradient for Sparse Estimation

Vanderwerﬀ, 2010, Proposition 4.1.26). A standard compactness argument then establishes the boundedness of ( f )( ℓ(C)). Thus {dt}t is bounded. (b): Note ﬁrst that the boundedness of {wt}t follows immediately from the boundedness assumption on the sublevel set, thanks to the monotonic property of F(wt). Since ℓis uniformly continuous, the set { ℓ(wt)}t is again bounded. On the other hand, we know from (Borwein and Vanderwerﬀ, 2010, Theorem 4.4.13, Proposition 4.1.25) that f is supercoercive iﬀ f maps bounded sets into bounded sets. Therefore {dt}t is again bounded. (c): This is clear. Note that the convexity of ℓis not used in this proof.

A.2 Proof of Theorem 4

Since ℓis L-smooth, for the sequences {wt} and {dt} generated by Algorithm 1, and for all η [0, 1], we have

ℓ(wt + η(dt wt)) ℓ(wt) + η dt wt, ℓ(wt) + Lη2

2 dt wt 2 . (63)

Since the subroutine is Relaxed and the subproblem (13) is solved up to some εt 0,

F(wt+1) ℓ(wt) + ηt dt wt, ℓ(wt) + Lη2 t 2 dt wt 2 + (1 ηt)f(wt) + ηtf(dt)

F(wt) ηt G(wt) + Lη2 t 2 dt wt 2 + ηtεt

= F(wt) ηt G(wt) + η2 t (εt/ηt + L

2 dt wt 2).

Deﬁne t := F(wt) F(w) and Gt := G(wt). Thus

t+1 t ηt Gt + η2 t (εt/ηt + L

2 dt wt 2), and t Gt. (64)

Plug the latter into the former and expand:

t+1 πt(1 η0) 0 +

πt πs η2 s(εs/ηs + L

2 ds ws 2). (65)

To prove the second claim, we have from (64) ηt Gt t t+1 + η2 t (εt/ηt + L

2 dt wt 2). Summing from k to t: min k s t Gs

s=k ηs Gs F(wk) F(wt+1) +

s=k η2 s(εs/ηs + L

2 ds ws 2).

Rearranging completes the proof.

A.3 Proof of Theorem 5

As ηt = 2 t+2, we have η0 = 1 and πt = 2 (t+1)(t+2). For the ﬁrst claim (upper bound on F(wt))

all we need to verify is that, by induction, 1 (t+1)(t+2) Pt s=0 s+1 s+2 1 t+4. This is elementary. We prove the second claim by a sequence of calculations similar to that in (Freund and Grigas, 2016). First, using (19) and the bound on F(wt) in (20) with t = 1 and w = w2:

η1 F(w1) F(w2) + η2 1 2 (δ + LF ) 3

2 (δ + LF ) + 2

9 (δ + LF ) = 13

12 (δ + LF ),

Yu, Zhang, and Schuurmans

proving the second claim on Gt 1 for t = 1, 2 (note that G2 1 G1 1 by deﬁnition). For t 3, we consider k = t/2 1 if t is even and k = (t + 1)/2 1 otherwise. Clearly k 1, hence

1 s + 2 2 Z t

1 s + 2ds = 2 ln t + 2

k + 1 2 ln 2,

s=k η2 s/2 = 2

1 (s + 2)2 2 Z t+1

1 (s + 2)2 ds = 2 1 k + 2 1 t + 3

Using again (19), and the bound on F(wt) in (20) with t = k and w = wt+1:

Gt 1 Gt k δ + LF

2 k + 3 + 2 k + 2 2 t + 3

2 t + 2 + 1 t + 3

A.4 Proof of Theorem 8

A proof can be based on the simple observation that the objective value F satisﬁes

F := inf w {ℓ(w) + f(w)} = inf w,ρ:κ(w) ρ ℓ(w) + h(ρ). (66)

Note that if ρ were known, the theorem could have been proved as before. The key idea behind the step size choice (34) is to ensure that Algorithm 2 performs almost the same as if it knew the unknown but ﬁxed constant ρ = κ(w) beforehand. Consider an arbitrary w and let ρ = κ(w). We also use the shorthand ˆFt := ℓ(wt) + h(ρt) ℓ(wt) + f(wt) = F(wt). Then, the following chain of inequalities can be veriﬁed:

ˆFt+1 := ℓ(wt+1) + h(ρt+1)

ˆFt + θtat ηtwt, ℓ(wt) + L

2 θtat ηtwt 2 ηth(ρt) + ηth(θt/ηt)

ˆFt + ηt D ρ αt at wt, ℓ(wt) E + Lη2 t 2 ρ

αt at wt 2 ηth(ρt) + ηth(ρ/αt)

ˆFt + min z:κ(z) ρ ηt z wt, ℓ(wt) + ηtρεt + Lη2 t 2 ρ

αt at wt 2 ηth(ρt) + ηth(ρ/αt)

= ˆFt + min z:κ(z) ρ ηt z wt, ℓ(wt) ηth(ρt) + ηth(ρ)

αt at wt 2 + (ρεt + h(ρ/αt) h(ρ))/ηt | {z } :=δt

= ˆFt + η2 t δt ηt h wt, ℓ(wt) + h(ρt) min z:κ(z) ρ z, ℓ(wt) + h(ρ) | {z }

where the ﬁrst inequality is because the subroutine is Relaxed, the second inequality follows from the minimality of θt in (34), and the third inequality is due to the choice of at in Line 3 of Algorithm 2.

Generalized Conditional Gradient for Sparse Estimation

Recall that ρ = κ(w). Moreover, due to the convexity of ℓ,

ˆG(wt) = wt, ℓ(wt) + h(ρt) min z:κ(z) ρ

z, ℓ(wt) + h(ρ)

= h(ρt) h(κ(w)) + max z:κ(z) κ(w) wt z, ℓ(wt)

h(ρt) h(κ(w)) + max z:κ(z) κ(w) ℓ(wt) ℓ(z)

Thus we have retrieved the recursion:

ˆFt+1 F(w) ˆFt F(w) ηtˆG(wt) + η2 t δt, and ˆFt F(w) ˆG(wt).

Summing the indices as in the proof of Theorem 4 and noting that F(wt) ˆFt for all t completes the proof.

Appendix B. Proofs for Section 4

B.1 Proof of Theorem 15

The ﬁrst equivalence in Theorem 15 is well-known since Grothendieck s work on tensor products of Banach spaces (where it is usually called the projective tensor norm). The second equality follows directly from the arithmetic-geometric mean inequality. We ﬁrst note that the atomic set A in (48) is compact, so is its convex hull conv A. It is also easy to see that A is a connected set (by considering the continuous map (u, v) 7 uv ).

By (4), κ(W) = inf n ρ 0 : W = ρ X

i σiai, σi 0, X

i σi = 1, ai A o (67)

= inf {ρ 0 : W ρ conv A} .

Since conv A is compact with 0 int(conv A) we know the inﬁmum above is attained. Thus there exist ρ 0 and C conv A so that W = ρC and κ(W) = ρ. Applying Caratheodory s theorem for connected sets (Rockafellar and Wets, 1998, Theorem 2.29) we know C = Pt i=1 σiuiv i for some σi 0, P i σi = 1, uiv i A and t mn. Let U = ρ [ σ1u1, . . . , σtut], and V = ρ [ σ1v1, . . . , σtvt] . We then have W = UV

and κ(W) = ρ 1

2 Pt i=1 U:i 2 c + Vi: 2 r , which proves one side of Theorem 15.

On the other hand, consider any U and V that satisfy

j=1 U:j c Vj: r

U:i c Vi: r Pt j=1 U:j c Vj: r

Vi: Vi: r ,

assuming w.l.o.g. that U:i c Vi: r = 0 for all 1 i t. This, together with the deﬁnition (67), gives the other half of the equality: κ(W) Pt i=1 U:i c Vi: r . The proof is completed

by making the elementary observation that Pt i=1 U:i c Vi: r 1

2 Pt i=1 U:i 2 c + Vi: 2 r .

Yu, Zhang, and Schuurmans

B.2 Proof of Theorem 16

Since Uinit = ( 1 ηt Ut, θtut) and Vinit = ( 1 ηt V t , θtvt) , it is obvious that Wt+1 = Uinit Vinit. Note ut c 1 and vt r 1. By construction

ℓ(Wt+1) + λρt+1 = ℓ(Ut+1Vt+1) + λ

(Ut+1):i 2 c + (Vt+1)i: 2 r

= Ft+1(Ut+1, Vt+1) Ft+1(Uinit, Vinit)

ℓ((1 ηt)Ut Vt + θtutv t ) + λθt + λ(1 ηt)

(Ut):i 2 c + (Vt)i: 2 r

= ℓ((1 ηt)Wt + θtutv t ) + λ(1 ηt)ρt + λθt, (68)

where the inequality is by the deﬁnition of Uinit and Vinit, and the last equality is by the deﬁnition of ρt in (54). In addition, κ(Wt+1) ρt+1 follows from (54) and Theorem 15.

B.3 Proof of Theorem 21

By (50), the square of the polar is

(κ (G))2 = max n c ZZ c : a 2 1, b 2 1 o , where c = h a b i . (69)

Since it maximizes a convex quadratic function over a convex set, there must be a maximizer attained at an extreme point of the feasible region. Therefore the problem is equivalent to

(κ (G))2 = max n c ZZ c : a 2 = 1, b 2 = 1 o (70)

= max n tr(SZZ ) : tr(SI1) = 1, tr(SI2) = 1, S 0, rank(S) = 1 o , (71)

where S = cc . Key to this proof is the fact that the rank one constraint can be dropped without breaking the equivalence. To see why, notice that without this rank constraint the feasible region is the intersection of the positive semideﬁnite cone with two hyperplanes, hence the rank of all its extreme points must be upper bounded by one (Pataki, 1998). Furthermore the linearity of the objective implies that there must be a maximizing solution attained at an extreme point of the feasible region. Therefore we can continue by

(κ (G))2 = max n tr(SZZ ) : tr(SI1) = 1, tr(SI2) = 1, S 0 o

= max S 0 min µ1,µ2 R tr(SGG ) µ1(tr(SI1) 1) µ2(tr(SI2) 1)

= min µ1,µ2 R max S 0 tr(SGG ) µ1(tr(SI1) 1) µ2(tr(SI2) 1)

= min n µ1 + µ2 : µ1, µ2 R, GG µ1I1 + µ2I2 o (72)

= min n µ1 + µ2 : µ1 0, µ2 0, Dµ2/µ1G 2 sp µ1 + µ2 o (73)

= min n DµG 2 sp : µ 0 o , (74)

Generalized Conditional Gradient for Sparse Estimation

where the third equality is due to the Lagrangian duality, the ﬁfth equality follows from the equivalence Dµ2/µ1G 2 sp µ1 + µ2 Dµ2/µ1GG Dµ2/µ1 (µ1 + µ2)I

GG (µ1 +µ2)D 1 µ2/µ1D 1 µ2/µ1 = µ1I1 +µ2I2, and the last equality is obtained through the re-parameterization µ = µ2/µ1, ν = µ1 + µ2 and the elimination of ν.

B.4 Solving the multi-view polar

Here we show how to eﬃciently compute the polar (dual norm) in the multi-view setting. Let us ﬁrst backtrack in (73), which produces the optimal µ1 and µ2 via

µ1 + µ2 = DµG 2 sp , and µ2 µ1 = µ. (75)

Then the KKT condition for the step from (72) to (73) can be written as

M 0, Mc = 0, a 2 = b 2 = 1, where M := µ1I1 + µ2I2 GG . (76)

By duality, this is a suﬃcient and necessary condition for a and b to be the solution of the polar operator. Given the optimal µ1 and µ2, the ﬁrst condition M 0 must have been satisﬁed already. Let the null space of M be spanned by an orthonormal basis {q1, . . . , qk}.

Then we can parameterize c as c = Qα where Q = [q1, . . . , qk] =: Q1 Q2

0 = a 2 2 b 2 2 = α Q 1 Q1 Q 2 Q2 α. (77)

Let PΣP be the eigen-decomposition of Q 1 Q1 Q 2 Q2 with Σ being diagonal and P being unitary. Let v = P α. Then c = QPv and (77) simply becomes v Σv = 0. In addition, (76) also implies 2 = c 2 2 = v P Q QPv = v v. So ﬁnally the optimal solution to the polar problem (70) can be completely characterized by a b

= QPv : v Σv = 0, v 2 2 = 2 , (78)

which is simply a linear system after a straightforward change of variable. The major computational cost for recovery is to ﬁnd the null space of M, which can be achieved by QR decomposition. The complexity of eigen-decomposition for Q 1 Q1 Q 2 Q2 depends on the dimension of the null space of M, which is usually low in practice.

Zeynep Akata, Florent Perronnin, Zaid Harchaoui, and Cordelia Schmid. Good practice in large-scale learning for image classiﬁcation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(3):507 520, 2014.

Andreas Argyriou, Theodoros Evgeniou, and Massimiliano Pontil. Convex multi-task feature learning. Machine Learning, 73(3):243 272, 2008.

Francis Bach. Convex relaxations of structured matrix factorizations. HAL:00861118, 2013.

Yu, Zhang, and Schuurmans

Francis Bach. Duality between subgradient and conditional gradient methods. SIAM Journal of Optimization, 25(1):115 129, 2015.

Francis Bach and Michael Jordan. A probabilistic interpretation of canonical correlation analysis. Technical Report 688, Department of Statistics, University of California, Berkeley, 2006.

Francis Bach, Julien Mairal, and Jean Ponce. Convex sparse matrix factorizations. ar Xiv:0812.1869v1, 2008.

Francis Bach, Rodolphe Jenatton, Julien Mairal, and Guillaume Obozinski. Structured sparsity through convex optimization. Statistical Science, 27(4):450 468, 2012.

Amir Beck and Shimrit Shtern. Linearly convergent away-step conditional gradient for non-strongly convex functions, 2015.

Amir Beck and Marc Teboulle. Mirror descent and nonlinear projected subgradient methods for convex optimization. Operations Research Letters, 31:167 175, 2003.

Amir Beck and Marc Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences, 2(1):183 202, 2009.

Stephen Becker, J erˆome Bobin, and Emmanuel J. Cand es. NESTA: A fast and accurate ﬁrst-order method for sparse recovery. SIAM Journal on Imaging Sciences, 4(1):1 39, 2009.

Jonathan M. Borwein and John D. Vanderwerﬀ. Convex Functions: Constructions, Characterizations and Counterexamples. Cambridge University Press, 2010.

David M. Bradley and James Andrew Bagnell. Convex coding. In Conf. on Uncertainty in Artiﬁcial Intelligence, 2009.

Kristian Bredies and Dirk A. Lorenz. Iterated hard shrinkage for minimization problems with sparsity constraints. SIAM Journal on Scientiﬁc Computing, 30(2):657 683, 2008.

Peter B uhlmann and Sara van de Geer. Statistics for High-Dimensional Data. Springer, 2011.

Samuel Burer and Renato D C Monteiro. Local minima and convergence in low-rank semideﬁnite programming. Mathematical Programming, 103(3):427 444, 2005.

Emmanuel J. Cand es and Benjamin Recht. Exact matrix completion via convex optimization. Foundations of Computational Mathematics, 9:717 772, 2009.

Michael D. Canon and Clifton D. Cullum. A tight upper bound on the rate of convergence of Frank-Wolfe algorithm. SIAM Journal on Control, 6:509 516, 1968.

Venkat Chandrasekaran, Benjamin Recht, Pablo A. Parrilo, and Alan S.Willsky. The convex geometry of linear inverse problems. Foundations of Computational Mathematics, 12(6): 805 849, 2012.

Generalized Conditional Gradient for Sparse Estimation

H. Cheng, Y. Yu, X. Zhang, E. Xing, and D. Schuurmans. Scalable and sound low-rank tensor learning. In Proc. Intl. Conf. on Artiﬁcial Intelligence and Statistics, 2016.

Kenneth L. Clarkson. Coresets, sparse greedy approximation, and the Frank-Wolfe algorithm. ACM Transactions on Algorithms, 6(4):1 30, 2010.

Tij De Bie, Nello Cristianini, and Roman Rosipal. Eigenproblems in pattern recognition. In Handbook of Geometric Computing, pages 129 170, 2005.

Vladimir Fedorovich Dem yanov and Alexander M. Rubinov. The minimization of a smooth convex functional on a convex set. SIAM Journal on Control, 5:280 294, 1967.

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-fei. Image Net: A largescale hierarchical image database. In Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2009.

Gideon Dror, Noam Koenigstein, Yehuda Koren, and Markus Weimer. The Yahoo! music dataset and KDD-Cup 11. JMLR Workshop and Conference Proceedings: Proceedings of KDD Cup 2011 Competition, 18:3 18, 2012.

Miroslav Dudik, Zaid Harchaoui, and Jerome Malick. Lifted coordinate descent for learning with trace-norm regularizations. In Proc. Intl. Conf. on Artiﬁcial Intelligence and Statistics, 2012.

J. Dunn. Rates of convergence for conditional gradient algorithms near singular and nonsingular extremals. SIAM Journal on Control and Optimization, 17:187 211, 1979.

J. Dunn and S. Harshbarger. Conditional gradient algorithms with open loop step size rules. Journal of Mathematical Analysis and Applications, 62:432 444, 1978.

Yonina C. Eldar and Gitta Kutyniok, editors. Compressed Sensing: Theory and Applications. Cambridge University Press, 2012.

Marguerite Frank and Philip Wolfe. An algorithm for quadratic programming. Naval Research Logistics Quarterly, 3(1-2):95 110, 1956.

Robert Freund and Paul Grigas. New analysis and results for the Frank-Wolfe method. Mathematical Programming, 155(1):199 230, 2016.

Masao Fukushima and Hisashi Mine. A generalized proximal point algorithm for certain non-convex minimization problems. International Journal of Systems Science, 12(8): 989 1000, 1981.

Dan Garber and Elad Hazan. Faster rates for the Frank-Wolfe method over strongly-convex sets. In Proc. Intl. Conf. on Machine Learning, 2015.

Dan Garber and Elad Hazan. A linearly convergent variant of the conditional gradient algorithm under strong convexity, with applications to online and stochastic optimization. SIAM Journal on Optimization, 26(3):1493 1528, 2016.

Yu, Zhang, and Schuurmans

Athinodoros S. Georghiades, Peter N. Belhumeur, and David J. Kriegman. From few to many: Illumination cone models for face recognition under variable lighting and pose. IEEE Transactions on Pattern Analysis and Machine Intelligence, 23:643 660, 2001.

J. Gu elat and P. Marcotte. Some comments on Wolfe s away step . Mathematical Programming, 35:110 119, 1986.

Zaid Harchaoui, Anatoli Juditsky, and Arkadi Nemirovski. Conditional gradient algorithms for norm-regularized smooth convex optimization. Mathematical Programming, 152:75 112, 2015.

Elad Hazan. Sparse approximate solutions to semideﬁnite programs. In Latin American Conference on Theoretical Informatics, 2008.

Charles A. Holloway. An extension of the Frank and Wolfe method of feasible directions. Mathematical Programming, 6(1):14 27, 1974.

Cho-Jui Hsieh and Peder A. Olsen. Nuclear norm minimization via active subspace selection. In Proc. Intl. Conf. on Machine Learning, 2014.

Laurent Jacob, Guillaume Obozinski, and Jean P. Vert. Group lasso with overlap and graph lasso. In Proc. Intl. Conf. on Machine Learning, 2009.

Martin Jaggi. Revisiting Frank-Wolfe: Projection-free sparse convex optimization. In Proc. Intl. Conf. on Machine Learning, 2013.

Martin Jaggi and Marek Sulovsky. A simple algorithm for nuclear norm regularized problems. In Proc. Intl. Conf. on Machine Learning, 2010.

Yangqing Jia, Mathieu Salzmann, and Trevor Darrell. Factorized latent spaces with structured sparsity. In Advances in Neural Information Processing Systems, pages 982 990, 2010.

Seyoung Kim and Eric P. Xing. Statistical estimation of correlated genome associations to a quantitative trait network. PLo S Genetics, 5(8):1 18, 2009.

Jacek Kuczy nski and Henryk Wo zniakowski. Estimating the largest eigenvalue by the power and Lanczos algorithms with a random start. SIAM Journal on Matrix Analysis and Applications, 13(4):1094 1122, 1992.

S. Lacoste-Julien and M. Jaggi. On the global linear convergence of Frank-Wolfe optimization variants. In Advances in Neural Information Processing Systems, 2015.

S oren Laue. A hybrid algorithm for convex semideﬁnite optimization. In Proc. Intl. Conf. on Machine Learning, 2012.

Evgeny S. Levitin and Boris T. Polyak. Constrained minimization problems. USSR Computational Mathematics and Mathematical Physics, 6(5):1 50, 1966.

Gerard Meyer. Accelerated Frank Wolfe algorithms. SIAM Journal on Control, 12:655 655, 1974.

Generalized Conditional Gradient for Sparse Estimation

Charles A. Micchelli, Jean M. Morales, and Massimiliano Pontil. Regularizers for structured sparsity. Advances in Computational Mathematics, 38(3):455 489, 2013.

Bamdev Mishra, Gilles Meyer, Francis Bach, and Rodolphe Sepulchre. Low-rank optimization with trace norm penalty. SIAM Journal on Optimization, 23(4):2124 2149, 2013.

Yurii Nesterov. Smooth minimization of non-smooth functions. Mathematical Programming, 103(1):127 152, 2005.

Yurii Nesterov. Gradient methods for minimizing composite functions. Mathematical Programming, Series B, 140:125 161, 2013.

Guillaume Obozinski and Francis Bach. Convex relaxation for combinatorial penalties. Technical Report HAL 00694765, 2012.

Gabor Pataki. On the rank of extreme matrices in semideﬁnite programs and the multiplicity of optimal eigenvalues. Mathematics of Operations Research, 23(2):339 358, 1998.

Ting Kei Pong, Paul Tseng, Shuiwang Ji, and Jieping Ye. Trace norm regularization: Reformulations, algorithms, and multi-task learning. SIAM Journal on Optimization, 20 (6):3465 3489, 2010.

Ralph Tyrell Rockafellar and Roger J-B Wets. Variational Analysis. Springer, 1998.

Shai Shalev-Shwartz, Nathan Srebro, and Tong Zhang. Trading accuracy for sparsity in optimization problem with sparsity constraint. SIAM Journal on Optimization, 20(6): 2807 2832, 2010.

Nathan Srebro, Jason D. M. Rennie, and Tommi S. Jaakkola. Maximum-margin matrix factorization. In Advances in Neural Information Processing Systems, 2005.

Ambuj Tewari, Pradeep Ravikumar, and Inderjit S. Dhillon. Greedy algorithms for structurally constrained high dimensional problems. In Advances in Neural Information Processing Systems, 2011.

Kim-Chuan Toh and Sangwoon Yun. An accelerated proximal gradient algorithm for nuclear norm regularized least squares problems. Paciﬁc Journal of Optimization, 6:615 640, 2010.

Paul Tseng. Approximation accuracy, gradient methods, and error bound for structured convex optimization. Mathematical Programming, Series B, 125:263 295, 2010.

Stephen A Vavasis. On the complexity of nonnegative matrix factorization. SIAM Journal on Optimization, 20(3):1364 1377, 2010.

Martha White, Yaoliang Yu, Xinhua Zhang, and Dale Schuurmans. Convex multi-view subspace learning. In Advances in Neural Information Processing Systems, 2012.

P. Wolfe. Convergence theory in nonlinear programming. In Integer and Nonlinear Programming. North-Holland, 1970.

Yu, Zhang, and Schuurmans

Yaoliang Yu and Dale Schuurmans. Rank/norm regularization with closed-form solutions: Application to subspace clustering. In Conf. on Uncertainty in Artiﬁcial Intelligence, 2011.

Ming Yuan and Yi Lin. Model selection and estimation in regression with grouped variables. Journal of The Royal Statistical Society Series B, 68(1):49 67, 2006.

Xiaotong Yuan and Shuicheng Yan. Forward basis selection for pursuing sparse representations over a dictionary. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(12):3025 3036, 2013.

Hyokun Yun, Hsiang-Fu Yu, Cho-Jui Hsieh, S V N Vishwanathan, and Inderjit Dhillon. NOMAD: Nonlocking, stochastic multi-machine algorithm for asynchronous and decentralized matrix completion. In Proc. Intl. Conf. on Very Large Data Bases, 2014.

Xinhua Zhang, Yaoliang Yu, Martha White, Ruitong Huang, and Dale Schuurmans. Convex sparse coding, subspace learning, and semi-supervised extension. In Proc. Nat l. Conf. Artiﬁcial Intelligence, 2011.

Xinhua Zhang, Yaoliang Yu, and Dale Schuurmans. Accelerated training for matrix-norm regularization: A boosting approach. In Advances in Neural Information Processing Systems, 2012.

Yong Zhuang, Wei-Sheng Chin, Yu Juan, and Chih-Jen Lin. A fast parallel SGD for matrix factorization in shared memory systems. In Proc. ACM Recommender System Conference, 2013.