# efficient__47799b10.pdf

Efﬁcient k-Support-Norm Regularized Minimization

via Fully Corrective Frank-Wolfe Method

Bo Liu , Xiao-Tong Yuan , Shaoting Zhang , Qingshan Liu , Dimitris N. Metaxas

Department of Computer Science, Rutgers, The State University of New Jersey

Jiangsu Province Key Laboratory of Big Data Analysis Technology,

Nanjing University of Information Science and Technology Department of Computer Science, University of North Carolina at Charlotte

lb507@cs.rutgers.edu

The k-support-norm regularized minimization has recently been applied with success to sparse prediction problems. The proximal gradient method is conventionally used to minimize this composite model. However it tends to suffer from expensive iteration cost as the proximity operator associated with k-support-norm needs exhaustive searching operations and thus could be time consuming in large scale settings. In this paper, we reformulate the k-support-norm regularized formulation into an identical constrained formulation and propose a fully corrective Frank-Wolfe algorithm to minimize the constrained model. Our method is inspired by an interesting observation that the convex hull structure of the k-support-norm ball allows the application of Frank-Wolfe-type algorithms with low iteration complexity. The convergence behavior of the proposed algorithm is analyzed. Extensive numerical results in learning tasks including logistic regression and matrix pursuit demonstrate the substantially improved computational efﬁciency of our algorithm over the state-of-the-art proximal gradient algorithms.

1 Introduction In many sparsity inducing machine learning problems, a common practice is to use the 1 norm as a convex relaxation of the model parameter cardinality constraint. The 1 norm, however, tends to shrink excessive number of variables to zeros, regardless the potential correlation among the variables. In order to alleviate such an over-shrinkage issue of 1-norm, numerous methods including elastic net [Zou and Hastie, 2005] and trace Lasso [Grave et al., 2011] have been proposed in literature. All of these methods tend to smooth the output parameters by averaging similar features rather than selecting out a single one. More recently, k-support-norm k ksp

k is proposed as a new alternative that provides the tightest convex relaxation of cardinality on the Euclidean norm

These match the formatting instructions of IJCAI-07. The support of IJCAI, Inc. is acknowledged.

unit ball [Argyriou et al., 2012]. Formally, the k-supportnorm of a vector w 2 Rp is deﬁned as

kvgk2 : supp(vg) g, w =

where Gk denotes the set of all subsets of {1, 2, ..., p} of cardinality at most k. More properties of k-support-norm are analyzed in [Argyriou et al., 2012].

As a regularizer, the k-support-norm is characterized by simultaneously selecting a few relevant groups and penalizing the 2-norm of the selected individual groups. The following k-support-norm regularized model was considered in [Argyriou et al., 2012] for sparse prediction tasks:

w f(w) + λ(kwksp

where f(w) is a convex and differentiable objective function. The parameter k is regarded as an upper bound estimation of the number of non-zero elements in w. It has been shown that this model leads to improved learning guarantees as well as better algorithmic stability [Argyriou et al., 2012]. Motivation One issue that hinders the applicability of the ksupport-norm regularized model (1) is its high computational complexity in large scale settings. Indeed, proximal gradient methods are conventionally used for optimizing the composite minimization problem in (1) [Argyriou et al., 2012; Lai et al., 2014; Eriksson et al., 2015]. Given the gradient vector, the per-iteration computational cost of proximal gradient methods is dominated by an proximity operator of the following form:

w = arg min

2 + λ(kwksp

In the work of [Argyriou et al., 2012], a closed form solution of (2) was derived based on an exhaustive search strategy. However, the complexity of such a strategy is O(p(k+log p)) which is computationally expensive when p is large. Despite signiﬁcant speed-ups have been reported in [Lai et al., 2014; Eriksson et al., 2015] by using binary search, those methods are still expensive for large scale problems.

It has been known that the k-support-norm ball Bk,sp

$ := {w 2 Rp : kwksp

k $} is equivalent to the convex hull

Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence (IJCAI-16)

k,$ = {w 2 Rp : kwk0 k, kwk2 $}. In this sense, k-support-norm ball Bk,sp

$ provides a convex envelope of the nonconvex set C(2)

k,$. Recently, Frank-Wolfe-type methods for minimizing a convex objective over a convex hull have received wide interests in optimization and machine learning [Zhang, 2003; Shalev-Shwartz et al., 2010; Yuan and Yan, 2013]. These algorithms have been shown to achieve satisfying trade-off between convergence rate and iteration complexity. In this paper, inspired by the success of these algorithms, we consider applying the Frank-Wolfe method to solve the composite optimization problem (1). Challenge and Contribution In this paper we propose to convert the k-support-norm regularized formulation into an identical constrained formulation by introducing an augmented variable as the bound of the regularizer (kwksp

k )2. We then develop a fully corrective variant of the Frank-Wolfe algorithm to optimize the augmented model. The proposed algorithm is called k-FCFW in the following context. The convergence rate of the proposed algorithm is analyzed. Particularly, we prove that the proposed algorithm converges linearly under proper conditions. Our this result is stronger than the sublinear rate of convergence obtained in [Harchaoui et al., 2015] for norm regularized minimization with Frank-Wolfe methods. The advantage of the proposed algorithm over prior algorithms is demonstrated by empirical results in various learning tasks.

2 Related Work 2.1 k-Support-Norm Regularized Learning The k-support-norm regularized learning models have been extensively studied in machine learning and computer vision. Multiple k-support-norm regularized convex models were investigated in [Blaschko, 2013]. The applications of ksupport-norm to various computer vision problems have been explored in [Lai et al., 2014]. The k-support-norm was applied to generalized dantzig selector for linear models [Chatterjee et al., 2014]. In this paper the proposed algorithm applies to ﬁrst-order k-support-norm regularized minimization problem, which is different from the squared k-support-norm that we consider in this work and [Argyriou et al., 2012; Lai et al., 2014; Eriksson et al., 2015]. The authors of [Belilovsky et al., 2015b] showed that it is helpful to use k-supportnorm regularization in f MRI data analysis including classiﬁcation, regression and data visualization. In [Belilovsky et al., 2015a], total variation penalty is incorporated in the ksupport framework and applied in image and neuroscience data analysis.

2.2 Frank-Wolfe Method The history of Frank-Wolfe method dates back to [Frank and Wolfe, 1956] for constrained optimization problem

w f(w) s.t. w 2 D,

where D is a polytope convex set. It is also known as conditional gradient method. Due to the potential of efﬁciency improvement when applied in solving minimization problem with some forms of constraint, recently there is an increasing

Algorithm 1: k-FCFW Algorithm for k-support-norm regularized problem.

Input : f(x), k, λ. Initialization: w(0) = 0, (0) = 0, U = w(0), V = (0).

for t = 1, 2, ... do

(S1) Compute rf(w(t 1)). (S2) Solve the linear projection problem

{u(t), v(t)} = arg min

u,v hrf(w(t 1)), ui + λv

s.t. (kuksp

(S3) Update

U (t) = [U (t 1), u(t)], V (t) = [V (t 1), v(t)].

(S4) Compute

24t f(U (t) ) + λV (t) , (4)

where 4t = { 2 Rt+1 : 0, k k1 = 1}. (S5) Update w(t) = U (t) (t), (t) = V (t) (t). end Output: w(t).

trend to revisit and restudy this method. The detail of original Frank-Wolfe method, its variants and algorithm analysis can be found in [Jaggi, 2013; Garber and Hazan, 2014]. Frank Wolfe-type methods have been developed to solve numerous optimization problems including structural SVM [Lacoste Julien et al., 2013], trace norm regularization problem [Dudik et al., 2012] and atomic norm constrained problem [Rao et al., 2015]. In sparse learning, a number of Frank-Wolfe-type methods, e.g., forward greedy selection [Shalev-Shwartz et al., 2010] and gradient Lasso [Kim and Kim, 2004], have been developed to solve sparsity constrained problems. In the context of boosting classiﬁcation, the restricted gradient projection algorithms stated in [Grubb and Bagnell, 2011] is essentially a forward greedy selection method over 2functional space.

3 The k-FCFW method for k-Support Norm Regularized Minimization To apply the Frank-Wolfe method, we reformulate (1) into a constrained optimization problem:

w, G(w, ; λ) := f(w) + λ s.t. (kwksp

Here is an augmented variable bounding the term (kwksp

k )2. We introduce in 3.1 the k-FCFW algorithm for solving the above constrained formulation. The convergence rate of k FCFW will be analyzed in 3.2.

3.1 Algorithm Description The k-FCFW algorithm for k-support-norm regularized minimization is outlined in Algorithm 1. As is known that the k-support-norm ball Bk,sp

$ := {w 2 Rp : kwksp

equivalent to the convex hull of C(2)

k,$ = {w 2 Rp : kwk0 k, kwk2 $}. That is,

$ = conv(C(2)

Therefore, given v > 0, the optimal u of (3) admits the following close-form solution:

pvrkf(w(t 1))

krkf(w(t 1))k , (5)

where rkf(w(t 1)) denotes a truncated version of rf(w(t 1)) with its top k (in magnitude) entries preserved. By substituting this back to (3) we get v(t) through the following quadratic program:

v(t) = arg min

krkf(w(t 1))kpv + λv.

Obviously v(t) =

krkf(w(t 1))k

, and thus,

v(t)rkf(w(t 1)) krkf(w(t 1))k = rkf(w(t 1))

At each time instance t, we record all previous updates in U (t) = {w(0), u(1), ..., u(t)} and V (t) = { (0), v(1), v(2), ..., v(t)}. At the t-th iteration, the optimal value of w(t) and (t) are jointly searched on the convex hull deﬁne by U (t) and V (t). The subproblem (4) of estimating (t) is a simplex constrained smooth minimization problem. The scale of such a problem is dominated by the value of t. This subproblem can be solved via off-the-shelf algorithms such as the projected quasi-Newton (PQN) method [Schmidt et al., 2009] as used in our implementation.

It is noteworthy that the subproblem (3) is equivalent to the following k-support-norm regularized linear program:

u(t) = arg min

u hrf(w(t 1)), ui + λ(kuksp

This is in contrast to the proximal gradient method which solves the quadratic proximity operator (2) at each iteration. Apparently the former is less expensive to solve than the latter which involves exhaustive search. In addition, when t is of moderate value and warm start is adopted to initialize the parameters, the subproblem (4) can often be accurately solved with very few iterations. In sparse learning problem the k value is often much smaller than model parameter dimension hence the overall computational cost of k-FCFW is expected to be lower than the proximal gradient algorithms.

Given a constant radius $c > 0, the k-support-norm constrained minimization problem

w f(w) s.t. kwksp

is essentially a special case of the regularized form by directly assigning = $2

c. The proposed algorithm can be easily modiﬁed to solve the constrained model (6).

3.2 Convergence Analysis To analyze the model convergence, we need the following key technical conditions imposed on the curvature of the objective function f restricted on sparse subspaces.

Deﬁnition 1 (Smoothness and restricted strong convexity). We say f is L-smooth if there exists a positive constant L such that for any w and w0,

f(w0) f(w) hrf(w), w0 wi L

2 kw w0k2. (7)

We say f is (s)-strongly convex at sparsity level s, if there exists positive constants (s) such that for any kw w0k0 s,

f(w0) f(w) hrf(w), w0 wi (s)

2 kw w0k2. (8)

In the analysis to follow, we deﬁne

F(w; λ) := f(w) + λ(kwksp

w = arg min

Let s = k wk0 and = (k wksp

k )2. Consider the radius r given by

2λ : F(w; λ) F(w(0); λ)

We now analyze the convergence of Algorithm 1. Before presenting the main result, we need some preliminaries.

Lemma 1. There exist U = [ u1, ..., u l] 2 Rp l with k uik0 k, k uik2 =

p , and 2 4 l such that

Please refer to Appendix A.1 for the proof of Lemma 1. In the following analysis, we will consider such a decomposition

w = U as guaranteed by Lemma 1. Given a matrix M, we write its mathcal version M as a vector set consisting of the columns of M. Similarly, given a set M of vectors of the same size, we denote M be a matrix whose columns are the elements of M. Let σmin(M) be the smallest singular value of matrix M. We establish in the following theorem a linear rate of convergence for Algorithm 1.

Theorem 1. Let s = maxt kw(t)k0. Let M(t) = U [ U(t). Assume that there exists a β > 0 such that σmin(M (t))

β for all t. Assume that f is L-smooth and (s + s)- strongly convex. Given > 0, let us run Algorithm 1 with t 1

F (w(0);λ) F ( w;λ)

where := min

4 l Lr2 , 1

Then Algorithm 1 will output w(t) satisfying

F(w(t); λ) F( w; λ) + .

The proof of Theorem 1 is given in Appendix A.2.

Remark 1. In general constrained minimization problems, the Frank-Wolfe method is known to have O( 1

t ) convergence rate [Jaggi, 2013] and O( 1

t2 ) if the constraint set is

strongly-convex [Garber and Hazan, 2014]. Several linear convergence guarantees are established by adding various speciﬁc assumptions to either constraint set or loss function in literatures such as [Beck and Teboulle, 2004;

Nanculef et al., 2014], but they are not directly applicable to our problem. In a recent work of [Lacoste-Julien and Jaggi, 2015], a global linear convergence rate is proved for the constrained optimization on polytope. Their analysis, however, does not ﬁt for our algorithm as the constraint (kwksp

k )2 v is a k-support-norm cone, rather than a polytope. This imposes extra challenges in analysis. Another related work is [Harchaoui et al., 2015] which also applies Frank-Wolfe method to regularized minimization problems. However it is restrictive as it requires an estimation of the bound of the regularizer which is hard to know in practice. Also, the sublinear convergence rate established in that paper is weaker than the linear rate proved in Theorem 1.

Remark 2. In Algorithm 1 we have required the subproblem (4) in Step (S4) to be solved exactly. This could be be computationally demanding if the objective function f is highly nonlinear and t is relatively large. Instead of solving the subproblem (4) exactly, a more realistic option in practice is to ﬁnd a suboptimal solution up to a precision " > 0 w.r.t. the ﬁrst-order optimality condition. That is, {w(t), (t)} satisfy for any w = U (t) and = V (t) ,

hrf(w(t 1)), w w(t 1)i + λ( v(t 1)) ".

Following the similar arguments in the proof of Theorem 1 we can be prove that F(w(t); λ) F( w; λ) + + O(") after t = O

steps of iteration. In other words, the optimization error of the subproblem (4) does not accumulate during iteration.

4 Experiments

This section is devoted to showing the empirical performance of k-FCFW and comparing it to the state-of-the-art algorithms for k-support-norm regularized minimization. All the considered algorithms are implemented in Matlab and tested on a computer equipped with 3.0GHz CPU and 32GB RAM.

4.1 k-Support-Norm L2-Logistic Regression Given a binary training set {xm, ym}M

m=1, xm 2 Rp, ym 2 { 1, 1}, the k-support-norm regularized logistic regression problem is formulated as

(w; xm, ym)+

2kwk2+λ(kwksp

where (w; xm, ym) = log

1 + exp( ymw>xm)

is the logistic loss on a training pair (xm, ym). The parameter controls the strong convexity of the loss function.

We test the algorithm efﬁciency on a synthetic dataset. The model parameter ground truth wgt is designed to be a p-dimension vector as follows:

wgt = [1, 1, , 1 | {z } p0

, 0, 0, , 0 | {z } p p0

Within the supporting set we partition the feature dimension into g groups. Each group of {xm,1, xm,2, ..., xm,p0/g}, ..., {xm,(g 1)p0/g+1, ..., xm,p0} follows normal distribution that the mean value of each group is drawn from N(0, 1). Other dimensions are drawn from N(0, 1) as noise. Finally xm is normalized by xm = xm/kxmk. The data label {ym}M

m=1 follows Bernoulli distribution with probability

P(ym = 1|xm) =

gtxm) 1+exp(w>

gtxm). The task is designed as selecting the top k most discriminative features for classiﬁcation using logistic regression model through solving (9).

We produce the training data by setting M = 500, p = 106, g = 20, p0 = 10000, k = 2000, 4000, 6000, 8000 and 10000, respectively. λ is selected by grid search according to the testing result on a validation set of size 100. We set the termination criterion as |F (w(t)) F (w(t 1))|

F (w(t 1)) 10 4. We replicate the experiment 10 times and report the mean of the numerical results.

We compare the efﬁciency of k-FCFW with three state-ofthe-art proximal gradient methods: (1) the Box Norm solver (denoted by BN) proposed in [Mc Donald et al., 2014]; (2) the binary search based solver proposed in [Lai et al., 2014] (denoted by BS); and (3) the solver proposed in [Eriksson et al., 2015] which tries to ﬁnd the active set (AS) by a twostep binary searching strategy. All of these proximal gradient solvers are implemented in the framework of FISTA [Beck and Teboulle, 2009].

The running time of the considered algorithms is shown in Figure 1(a). It can be observed that our method is signiﬁcantly faster than all the three comparing solvers.

We also compare the efﬁciency of k-FCFW with ADMM [Parikh and Boyd, 2013] which is another popular framework for regularized minimization problems. Since AS has been observed to be superior to the other considered proximity operator solvers, we equip ADMM with AS as its proximity operator solver. The running time curves of ADMMAS are drawn in Figure 1(a). Clearly, ADMM-AS is inferior to k-FCFW and the proximal gradient algorithms as well. Actually, we observe that ADMM-AS fails to converge to the desired accuracy given maximum number of iterations. In Figure 1(b), we plot the convergence curves of the considered algorithms under k = 2000 and 10000. It can be observed that our method needs signiﬁcant less number of iterations to reach comparable optimization accuracy.

4.2 k-Support-Norm Matrix Pursuit In this group of experiments, we apply the proposed method to the k-support-norm regularized matrix pursuit problem. In many graph based machine learning algorithms such as semisupervised classiﬁcation and subspace segmentation, a key step is learning the sample afﬁnity matrix [Liu et al., 2010; Yuan and Li, 2014]. Matrix pursuit has been proved to be an effective approach. The results of [Lai et al., 2014] indicate that the k-support-norm regularized matrix pursuit method achieves superior performance in various applications. The k-support-norm regularized matrix pursuit is formulated as:

min W 2Rn n

1 2k X XWk2

F + λ(kvec(W)ksp

Figure 1: Results on synthetic dataset: (a) Running time (in second) curves of the considered comparing methods under different values of k. (b) Convergence curves of the considered methods under k = 2000 and 10000. All the curves are drawn from the starting point F(w(1)).

(a) MNIST-1000 dataset (b) USPS-2000 dataset

Figure 2: Running time (in second) curves of the considered methods on (a) MNIST-1000 dataset and (b) USPS-2000 dataset.

where X 2 Rd n is the data matrix with n samples in ddimension space and vec(W) denotes the vectorization of W.

The MNIST [Le Cun et al., 1998] and USPS [Hull, 1994] datasets are adopted for testing. For MNIST dataset, we resize each image into 14 14 then normalize the gray value into [0, 1]. The pixel values are then vectorized as image feature. The USPS dataset is preprocessed by [Cai et al., 2011]. Each image is represented by a 256-dimension feature vector and the feature values are normalized into [ 1, 1]. Each image of both datasets are further normalized to be a unit vector. We select 100 images per digit from MNIST and 200 images per digit from USPS hence the size of datasets are 1000 and 2000, respectively. We use the same optimization termination criterion as in the previous experiment. The algorithms are tested under k = 4000, 6000, 8000, 10000 and 12000 in the k-support-norm. The regularization parameter is set to be λ = 20.

We ﬁrst compare k-FCFW with three FISTA algorithms that respectively employ proximity operator solver BN, BS and AS. The comparison of average time cost over 10 replications is illustrated in Figure 2. The time cost curves of ADMM-AS are also shown in Figure 2. The convergence curves of the considered methods evaluated by F(W) when k = 4000 and 12000 are shown in Figure 3. From the results we can see that in these cases, k-FCFW is the most efﬁcient one for optimization. As the value of k increases, the efﬁciency advantage of k-FCFW gets more signiﬁcant.

(a) MNIST-1000 dataset (b) USPS-2000 dataset

Figure 3: The convergence curves of considered methods on (a) MNIST-1000 and (b) USPS-2000 datasets under k = 4000 and 12000. The starting point of each curve is F(W (1)).

5 Conclusion

In this paper, we proposed k-FCFW as a fully corrective Frank-Wolfe algorithm for optimizing the k-support-norm regularized loss minimization problem. We have established a linear rate of convergence for the proposed algorithm, which to our knowledge is new for Frank-Wolfe-type algorithms when applied to composite formulation. Comparing to the conventionally adopted proximal gradient algorithms and ADMM, k-FCFW has superior rate of convergence in theory and practice. Numerical results in logistic regression and matrix pursuit applications conﬁrmed that k-FCFW is significantly more efﬁcient than several state-of-the-art proximal gradient descent methods, especially in large scale settings. To conclude, both theoretical analysis and empirical observations suggest that k-FCFW is a computationally attractive alternative to the proximal gradient algorithms for solving the k-support-norm regularized minimization problems. As a future study in this line, we will investigate the generalization of k-FCFW to generic group-sparsity-norm regularized minimization problems of which the model considered in this paper is a special case.

A Appendix A.1 Proof of Lemma 1 Proof. Consider

U = arg min

kuk2 : kuk0 k, w =

Let l = | U| and U = { ui} l

i=1. Based on the deﬁnition of k-support-norm we have that w = P

i k uik2 = k wksp

p . We construct U = [ u1, ..., u l] with ui deﬁned by ui =

p ui/k uik. Then we can show that w admits a decomposition of w = U with some lies in a l-dimensional simplex 4 l. Indeed, since w = P

i ui(k uik/

p ), we may deﬁne i = k uik/

p such that P

A.2 Proof of Theorem 1 Proof. From the deﬁnition of G(w, ; λ), the step (S4) of Algorithm 1 and the assumption that f(w) is the L-smooth function, we have

G(w(t), (t); λ)

=f(U (t) (t)) + λV (t) (t)

f((1 )U (t 1) (t 1) + u(t))+

λ((1 )V (t 1) (t 1) + v(t))

=f((1 )w(t 1) + u(t)) + λ((1 ) (t 1) + v(t))

f(w(t 1)) + Γ(u(t)) + 2 2r2L + λ[(1 ) (t 1) + v(t)]

=f(w(t 1)) + λ (t 1) + Φ(u(t), v(t)) + 2 2r2L

=G(w(t 1), (t 1); λ) + Φ(u(t), v(t)) + 2 2r2L.

Γ(u(t)) = hrf(w(t 1)), u(t) w(t 1)i,

Φ(u(t), v(t)) = Γ(u(t)) + λ(v(t) (t 1)).

For simplicity, let us now denote x(t) = [w(t); (t)] 2 Rd+1

as the concatenation of w(t) and (t). We deﬁne V = [ , ..., ] 2 R l. Similarly, we denote X = [ U; V ] 2 R(p+1) l and X(t) = [U (t); V (t)] 2 R(p+1) t. By writing G(x(t); λ) = G(w(t), (t); λ), the preceding inequality can be equivalently written as

G(x(t); λ) G(x(t 1); λ) + hr G(x(t 1)), x(t) x(t 1)i

Let X c := X \ X (t 1). Assume X c 6= ;. From the update rule of { (t), v(t)} in (S2) we know the following inequality holds for any x 2 X c:

hr G(x(t 1)), x(t) x(t 1)i hr G(x(t 1), λ), x x(t 1)i.

x2X c x. From the above two inequalities we get

G(x(t); λ) G(x(t 1); λ) + [

xhr G(x(t 1); λ), xi

hr G(x(t 1); λ), x(t 1)i] + 2 2r2 L.

(11) Since P

x2X (t 1) x/(1 ) = 1, from the optimality of (t 1) (see the step (S4)) we can derive that

hr G(x(t 1); λ),

xx/(1 ) x(t 1)i 0. (12)

Note that (t 1)

x = 0 for x /2 X (t 1) and x = 0 for x /2 X. Therefore

xhr G(x(t 1); λ), xi

hr G(x(t 1); λ), xx (1 ) (t 1)

x2X (t 1)[ X

hr G(x(t 1); λ), xx (1 ) (t 1)

=hr G(x(t 1); λ), x (1 )x(t 1)i

=hr G(x(t 1); λ), x x(t 1)i + hr G(x(t 1); λ), x(t 1)i,

where the inequality follows (12). Combining the preceding inequality with (8) we obtain

xhr G(x(t 1); λ), xi hr G(x(t 1); λ), x(t 1)i

hr G(x(t 1); λ), x x(t 1)i

=hrf(w(t 1)), w w(t 1)i + λ( (t 1))

f( w) f(w(t 1)) (s + s)

2 kw(t 1) wk2

+ λ( (t 1))

G( x; λ) G(x(t 1); λ)

G( x) G(x(t 1)) (s + s) β

u2 U\U(t 1)

G( x) G(x(t 1)) (s + s) β 2

where the last inequality follows P

u2 U\U(t 1) 2

u2 U\U(t 1) u)2/ l. By combining the above with (11) we get

G(x(t); λ) G(x(t 1); λ) [G(x(t 1); λ) G( x; λ)+

(s + s) β 2

2 l ] + 2 2r2 L. (13)

Let us choose = 1 in the above inequality with := min

4 l Lr2 , 1

. Then we have

G(x(t); λ) G(x(t 1); λ) (G(x(t 1); λ) G( w; λ)).

Let us denote t := G(x(t); λ) G( x; λ). Applying this inequality recursively we obtain t 0 (1 )t. Using the inequality 1 x exp( x) and rearranging we get that t 0 exp( t). When t 1

, it can be guaranteed that t . The desired result follows directly from

F(w(t); λ) F( w; λ) G(w(t), (t); λ) G( w; λ) = t.

Acknowledgments Xiao-Tong Yuan was supported partially by the National Natural Science Foundation of China under Grant 61402232, 61522308, and partially by the Natural Science Foundation of Jiangsu Province of China under Grant BK20141003. Qingshan Liu was supported partially by National Natural Science Foundation of China under Grant 61532009.

References [Argyriou et al., 2012] Andreas Argyriou, Rina Foygel, and Nathan

Srebro. Sparse prediction with the k-support norm. In Advances in Neural Information Processing Systems, 2012. [Beck and Teboulle, 2004] Amir Beck and Marc Teboulle. A con-

ditional gradient method with linear rate of convergence for solving convex linear systems. Mathematical Methods of Operations Research, 59(2):235 247, 2004.

[Beck and Teboulle, 2009] Amir Beck and Marc Teboulle. A fast it-

erative shrinkage-thresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences, 2(1):183 202, 2009. [Belilovsky et al., 2015a] Eugene Belilovsky, Andreas Argyriou,

Ga el Varoquaux, and Matthew Blaschko. Convex relaxations of penalties for sparse correlated variables with bounded total variation. Machine Learning, 100(2-3):533 553, 2015. [Belilovsky et al., 2015b] Eugene Belilovsky, Katerina Gkirtzou,

Michail Misyrlis, Anna B Konova, Jean Honorio, Nelly Alia Klein, Rita Z Goldstein, Dimitris Samaras, and Matthew B Blaschko. Predictive sparse modeling of fmri data for improved classiﬁcation, regression, and visualization using the k-support norm. Computerized Medical Imaging and Graphics, 46:40 46, 2015. [Blaschko, 2013] Matthew Blaschko. A note on k-support norm

regularized risk minimization. ar Xiv preprint ar Xiv:1303.6390, 2013. [Cai et al., 2011] Deng Cai, Xiaofei He, Jiawei Han, and Thomas S

Huang. Graph regularized nonnegative matrix factorization for data representation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(8):1548 1560, 2011. [Chatterjee et al., 2014] Soumyadeep Chatterjee, Sheng Chen, and

Arindam Banerjee. Generalized dantzig selector: Application to the k-support norm. In Advances in Neural Information Processing Systems, 2014. [Dudik et al., 2012] Miro Dudik, Zaid Harchaoui, and J erˆome Mal-

ick. Lifted coordinate descent for learning with trace-norm regularization. In International Conference on Artiﬁcial Intelligence and Statistics, 2012. [Eriksson et al., 2015] Anders Eriksson, Trung Thanh Pham, Tat-

Jun Chin, and Ian Reid. The k-support norm and convex envelopes of cardinality and rank. In IEEE Conference on Computer Vision and Pattern Recognition, 2015. [Frank and Wolfe, 1956] Marguerite Frank and Philip Wolfe. An

algorithm for quadratic programming. Naval Research Logistics Quarterly, 3(1-2):95 110, 1956. [Garber and Hazan, 2014] Dan Garber and Elad Hazan. Faster rates

for the frank-wolfe method over strongly-convex sets. In Internatioal Conference on Machine Learning, 2014. [Grave et al., 2011] Edouard Grave, Guillaume R Obozinski, and

Francis R Bach. Trace lasso: a trace norm regularization for correlated designs. In Advances in Neural Information Processing Systems, 2011. [Grubb and Bagnell, 2011] Alexander Grubb and J. Andrew Bag-

nell. Generalized boosting algorithms for convex optimization. In International Conference on Machine Learning, 2011. [Harchaoui et al., 2015] Zaid Harchaoui, Anatoli Juditsky, and

Arkadi Nemirovski. Conditional gradient algorithms for normregularized smooth convex optimization. Mathematical Programming, 152:75 112, 2015. [Hull, 1994] Jonathan J. Hull. A database for handwritten text recognition research. IEEE Transactions on Pattern Analysis and Machine Intelligence,, 16(5):550 554, 1994. [Jaggi, 2013] Martin Jaggi. Revisiting frank-wolfe: Projection-free

sparse convex optimization. In International Conference on Machine Learning, 2013. [Kim and Kim, 2004] Y. Kim and J. Kim. Gradient lasso for fea-

ture selection. In International Conference on Machine Learning, 2004.

[Lacoste-Julien and Jaggi, 2015] Simon Lacoste-Julien and Martin

Jaggi. On the global linear convergence of Frank-Wolfe optimization variants. In Advances in Neural Information Processing Systems, 2015. [Lacoste-Julien et al., 2013] Simon Lacoste-Julien, Martin Jaggi,

Mark Schmidt, and Patrick Pletscher. Block-coordinate Frank Wolfe optimization for structural svms. International Conference on Machine Learning, 2013. [Lai et al., 2014] Hanjiang Lai, Yan Pan, Canyi Lu, Yong Tang, and

Shuicheng Yan. Efﬁcient k-support matrix pursuit. In European Conference on Computer Vision, 2014. [Le Cun et al., 1998] Yann Le Cun, L eon Bottou, Yoshua Bengio,

and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278 2324, 1998. [Liu et al., 2010] Bo Liu, Meng Wang, Richang Hong, Zhengjun

Zha, and Xian-Sheng Hua. Joint learning of labels and distance metric. IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics, 40(3):973 978, 2010. [Mc Donald et al., 2014] Andrew M Mc Donald, Massimiliano Pon-

til, and Dimitris Stamos. Spectral k-support norm regularization. In Advances in Neural Information Processing Systems, 2014. [ Nanculef et al., 2014] Ricardo Nanculef, Emanuele Frandi, Clau-

dio Sartori, and H ector Allende. A novel Frank-Wolfe algorithm. analysis and applications to large-scale svm training. Information Sciences, 285:66 99, 2014. [Parikh and Boyd, 2013] Neal Parikh and Stephen Boyd. Proximal

algorithms. Foundations and Trends in optimization, 1(3):123 231, 2013. [Rao et al., 2015] Nikhil Rao, Parikshit Shah, and Stephen Wright.

Forward-backward greedy algorithms for atomic norm regularization. IEEE Transactions on Single Processing, 63(21):5798 5811, 2015. [Schmidt et al., 2009] Mark W Schmidt, Ewout Berg, Michael P

Friedlander, and Kevin P Murphy. Optimizing costly functions with simple constraints: A limited-memory projected quasinewton algorithm. In International Conference on Artiﬁcial Intelligence and Statistics, 2009. [Shalev-Shwartz et al., 2010] Shai Shalev-Shwartz, Nathan Srebro,

and Tong Zhang. Trading accuracy for sparsity in optimization problems with sparsity constraints. SIAM Journal on Optimization, 20(6):2807 2832, 2010. [Yuan and Li, 2014] Xiao-Tong Yuan and Ping Li. Sparse additive

subspace clustering. In European Conference on Computer Vision. 2014. [Yuan and Yan, 2013] Xiao-Tong Yuan and Shuicheng Yan. For-

ward basis selection for pursuing sparse representations over a dictionary. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(12):3025 3036, 2013. [Zhang, 2003] Tong Zhang. Sequential greedy approximation for

certain convex optimization problems. IEEE Transactions on Information Theory, 49(3):682 691, 2003. [Zou and Hastie, 2005] Hui Zou and Trevor Hastie. Regularization

and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(2):301 320, 2005.