# spectral_ksupport_norm_regularization__e2da80eb.pdf

Spectral k-Support Norm Regularization

Andrew M. Mc Donald, Massimiliano Pontil, Dimitris Stamos

Department of Computer Science

University College London {a.mcdonald,m.pontil,d.stamos}@cs.ucl.ac.uk

The k-support norm has successfully been applied to sparse vector prediction problems. We observe that it belongs to a wider class of norms, which we call the box-norms. Within this framework we derive an efﬁcient algorithm to compute the proximity operator of the squared norm, improving upon the original method for the k-support norm. We extend the norms from the vector to the matrix setting and we introduce the spectral k-support norm. We study its properties and show that it is closely related to the multitask learning cluster norm. We apply the norms to real and synthetic matrix completion datasets. Our ﬁndings indicate that spectral k-support norm regularization gives state of the art performance, consistently improving over trace norm regularization and the matrix elastic net.

1 Introduction

In recent years there has been a great deal of interest in the problem of learning a low rank matrix from a set of linear measurements. A widely studied and successful instance of this problem arises in the context of matrix completion or collaborative ﬁltering, in which we want to recover a low rank (or approximately low rank) matrix from a small sample of its entries, see e.g. [1, 2]. One prominent method to solve this problem is trace norm regularization: we look for a matrix which closely ﬁts the observed entries and has a small trace norm (sum of singular values) [3, 4, 5]. Besides collaborative ﬁltering, this problem has important applications ranging from multitask learning, to computer vision and natural language processing, to mention but a few.

In this paper, we propose new techniques to learn low rank matrices. These are inspired by the notion of the k-support norm [6], which was recently studied in the context of sparse vector prediction and shown to empirically outperform the Lasso [7] and Elastic Net [8] penalties. We note that this norm can naturally be extended to the matrix setting and its characteristic properties relating to the cardinality operator translate in a natural manner to matrices. Our approach is suggested by the observation that the k-support norm belongs to a broader class of norms, which makes it apparent that they can be extended to spectral matrix norms. Moreover, it provides a link between the spectral k-support norm and the cluster norm, a regularizer introduced in the context of multitask learning [9]. This result allows us to interpret the spectral k-support norm as a special case of the cluster norm and furthermore adds a new perspective of the cluster norm as a perturbation of the former.

The main contributions of this paper are threefold. First, we show that the k-support norm can be written as a parametrized inﬁmum of quadratics, which we term the box-norms, and which are symmetric gauge functions. This allows us to extend the norms to orthogonally invariant matrix norms using a classical result by von Neumann [10]. Second, we show that the spectral box-norm is essentially equivalent to the cluster norm, which in turn can be interpreted as a perturbation of the spectral k-support norm, in the sense of the Moreau envelope [11]. Third, we use the inﬁmum framework to compute the box-norm and the proximity operator of the squared norm in O(d log d) time. Apart from improving on the O(d(k + log d)) algorithm in [6], this method allows one to use optimal ﬁrst order optimization algorithms [12] with the cluster norm. Finally, we present numerical

experiments which indicate that the spectral k-support norm shows a signiﬁcant improvement in performance over regularization with the trace norm and the matrix elastic net, on four popular matrix completion benchmarks.

The paper is organized as follows. In Section 2 we recall the k-support norm, and deﬁne the boxnorm. In Section 3 we study their properties, we introduce the corresponding spectral norms, and we observe the connection to the cluster norm. In Section 4 we compute the norm and we derive a fast method to compute the proximity operator. Finally, in Section 5 we report on our numerical experiments. The supplementary material contains derivations of the results in the body of the paper.

2 Preliminaries

In this section, we recall the k-support norm and we introduce the box-norm and its dual. The ksupport norm k k(k) was introduced in [6] as the norm whose unit ball is the convex hull of the set of vectors of cardinality at most k and 2-norm no greater than one. The authors show that the k-support norm can be written as the inﬁmal convolution [11]

kwk(k) = inf

kvgk2 : vg 2 Rd, supp(vg) g,

; , w 2 Rd, (1)

where Gk is the collection of all subsets of {1, . . . , d} containing at most k elements, and for any v 2 Rd, the set supp(v) = {i : vi 6= 0} denotes the support of v. When used as a regularizer, the norm encourages vectors w to be a sum of a limited number of vectors with small support. The k-support norm is a special case of the group lasso with overlap [13], where the cardinality of the support sets is at most k. Despite the complicated form of the primal norm, the dual norm has a simple formulation, namely the 2-norm of the k largest components of the vector

i )2, u 2 Rd, (2)

where |u|# is the vector obtained from u by reordering its components so that they are non-increasing in absolute value [6]. The k-support norm includes the 1-norm and 2-norm as special cases. This is clear from the dual norm since for k = 1 and k = d, it is equal to the 1-norm and 2-norm, respectively. We note that while deﬁnition (1) involves a combinatorial number of variables, [6] observed that the norm can be computed in O(d log d).

We now deﬁne the box-norm, and in the following section we will show that the k-support norm is a special case of this family.

Deﬁnition 2.1. Let 0 a b and c 2 [ad, bd] and let = { 2 Rd : a i b, Pd

i=1 i c}. The box-norm is deﬁned as

v u u t inf

, w 2 Rd. (3)

This formulation will be fundamental in deriving the proximity operator in Section 4.1. Note that we may assume without loss of generality that b = 1, as by rescaling we obtain an equivalent norm, however we do not explicitly ﬁx b in the sequel.

Proposition 2.2. The norm (3) is well deﬁned and the dual norm is kuk , =

The result holds true in the more general case that is a bounded convex subset of the strictly positive orthant (for related results see [14, 15, 16, 17, 18, 19] and references therein). In this paper we limit ourselves to the box constraints above. In particular we note that the constraints are invariant with respect to permutation of the components of , and as we shall see this property is key to extend the norm to matrices.

3 Properties of the Norms

In this section, we study the properties of the vector norms, and we extend the norms to the matrix setting. We begin by deriving the dual box-norm. Proposition 3.1. The dual box-norm is given by

2 + (b a)kuk2

,(k) + (b a)( k)(|u|#

where = c da

b a and k is the largest integer not exceeding .

We see from (4) that the dual norm decomposes into two 2-norms plus a residual term, which vanishes if = k, and for the rest of this paper we assume this holds, which loses little generality.

Note that setting a = 0, b = 1, and c = k 2 {1, . . . , d}, the dual box-norm (4) is the 2-norm of the largest k components of u, and we recover the dual k-support norm in equation (2). It follows that the k-support norm is a box-norm with parameters a = 0, b = 1, c = k.

The following inﬁmal convolution interpretation of the box-norm provides a link between the boxnorm and the k-support norm, and illustrates the effect of the parameters. Proposition 3.2. If 0 < a b and c = (b a)k + da, for k 2 {1, . . . , d}, then

g,i a : vg 2 Rd,

Notice that if b = 1, then as a tends to zero, we obtain the expression of the k-support norm (1), recovering in particular the support constraints. If a is small and positive, the support constraints are not imposed, however effectively most of the weight for each vg tends to be concentrated on supp(g). Hence, Proposition 3.2 suggests that the box-norm regularizer will encourage vectors w whose dominant components are a subset of a union of a small number of groups g 2 Gk.

The previous results have characterized the k-support norm as a special case of the box-norm. Conversely, the box-norm can be seen as a perturbation of the k-support norm with a quadratic term. Proposition 3.3. Let k k be the box-norm on Rd with parameters 0 < a < b and c = k(b a)+da, for k 2 {1, . . . , d}, then

2 + 1 b akzk2

Consider the regularization problem minw2Rd k Xw yk2

, with data X and response y. Using Proposition 3.3 and setting w = u + z, we see that this problem is equivalent to

k X(u + z) yk2

2 + λ b akzk2

Furthermore, if (ˆu, ˆz) solves this problem then ˆw = ˆu + ˆz solves problem (6). The solution ˆw can therefore be interpreted as the superposition of a vector which has small 2 norm, and a vector which has small k-support norm, with the parameter a regulating these two components. Speciﬁcally, as a tends to zero, in order to prevent the objective from blowing up, ˆu must also tend to zero and we recover k-support norm regularization. Similarly, as a tends to b, ˆz vanishes and we have a simple ridge regression problem.

3.1 The Spectral k-Support Norm and the Spectral Box-Norm

We now turn our focus to the matrix norms. For this purpose, we recall that a norm k k on Rd m is called orthogonally invariant if k Wk = k UWV k, for any orthogonal matrices U 2 Rd d and V 2 Rm m. A classical result by von Neumann [10] establishes that a norm is orthogonally invariant if and only if it is of the form k Wk = g(σ(W)), where σ(W) is the vector formed by the singular values of W in nonincreasing order, and g is a symmetric gauge function, that is a norm which is invariant under permutations and sign changes of the vector components.

Lemma 3.4. If is a convex bounded subset of the strictly positive orthant in Rd which is invariant under permutations, then k k is a symmetric gauge function.

In particular, this readily applies to both the k-support norm and box-norm. We can therefore extend both norms to orthogonally invariant norms, which we term the spectral k-support norm and the spectral box-norm respectively, and which we write (with some abuse of notation) as k Wk(k) = kσ(W)k(k) and k Wk = kσ(W)k . We note that since the k-support norm subsumes the 1 and 2-norms for k = 1 and k = d respectively, the corresponding spectral k-support norms are equal to the trace and Frobenius norms respectively. We ﬁrst characterize the unit ball of the spectral k-support norm.

Proposition 3.5. The unit ball of the spectral k-support norm is the convex hull of the set of matrices of rank at most k and Frobenius norm no greater than one.

Referring to the unit ball characterization of the k-support norm, we note that the restriction on the cardinality of the vectors whose convex hull deﬁnes the unit ball naturally extends to a restriction on the rank operator in the matrix setting. Furthermore, as noted in [6], regularization using the k-support norm encourages vectors to be sparse, but less so that the 1-norm. In matrix problems, as the extreme points of the unit ball have rank k, Proposition 3.5 suggests that the spectral k-support norm for k > 1 should encourage matrices to have low rank, but less so than the trace norm.

3.2 Cluster Norm

We end this section by brieﬂy discussing the cluster norm, which was introduced in [9] as a convex relaxation of a multitask clustering problem. The norm is deﬁned, for every W 2 Rd m, as

tr(S 1W >W) (7)

where Sm = {S 2 Rm m, S 0 : a I S b I, tr S = c}, and 0 < a b. In [9] the authors state that the cluster norm of W equals the box-norm of the vector formed by the singular values of W where c = (b a)k +da. Here we provide a proof of this result. Denote by λi( ) the eigenvalues of a matrix which we write in nonincreasing order λ1( ) λ2( ) λd( ). Note that if i are the eigenvalues of S then i = λd i+1(S 1). We have that

>W) = tr(S 1U 2U

λd i+1(S 1)λi(W

where we have used the inequality [20, Sec. H.1.h] for S 1, W >W 0. Since this inequality is attained whenever S = UDiag( )U, where U are the eigenvectors of W >W, we see that k Wkcl = kσ(W)k , that is, the cluster norm coincides with the spectral box-norm. In particular, we see that the spectral k-support norm is a special case of the cluster norm, where we let a tend to zero, b = 1 and c = k. Moreover, the methods to compute the norm and its proximity operator described in the following section can directly be applied to the cluster norm.

As in the case of the vector norm (Proposition 3.3), the spectral box-norm or cluster norm can be written as a perturbation of spectral k-support norm with a quadratic term.

Proposition 3.6. Let k k be a matrix box-norm with parameters a, b, c and let k = c da

F + 1 b ak Zk2

In other words, this result shows that the cluster norm can be seen as the Moreau envelope [11] of a spectral k-support norm.

4 Computing the Norms and their Proximity Operator

In this section, we compute the norm and the proximity operator of the squared norm by explicitly solving the optimization problem in (3). We begin with the vector norm.

Theorem 4.1. For every w 2 Rd it holds that

where w Q = (|w|#

1, . . . , |w|#

q), w I = (|w|#

q+1, . . . , |w|#

d ), w L = (|w|#

d +1, . . . , |w|#

d), and q and are the unique integers in {0, . . . , d} that satisfy q + d,

|wi| > |wq+1|

|wi| > |wd +1|

p = c qb a and we have deﬁned |w0| = 1 and |wd+1| = 0.

Proof. (Sketch) We need to solve the optimization problem

We assume without loss of generality that the wi are ordered nonincreasing in absolute values, and it follows that at the optimum the i are also ordered nonincreasing. We further assume that wi 6= 0 for all i and c db, so the sum constraint will be tight at the optimum. The Lagrangian is given by

where 1/ 2 is a strictly positive multiplier to be chosen such that S( ) := Pd

i=1 i( ) = c. We can then solve the original problem by minimizing the Lagrangian over the constraint 2 [a, b]d. Due to the decoupling effect of the multiplier we can solve the simpliﬁed problem componentwise, obtaining the solution

i = i( ) = min(b, max(a, |wi|)) (11)

where S( ) = c. The minimizer has the form = (b, . . . , b, q+1, . . . , d , a, . . . , a), where q, are determined by the value of . From S( ) = c we get = p/(Pd

i=q+1 |wi|). The value of the norm in (8) follows by substituting into the objective. Finally, by construction we have q b > q+1 and d > a d +1, which give rise to the conditions in (9).

Theorem 4.1 suggests two methods for computing the box-norm. First we ﬁnd such that S( ) = c; this value uniquely determines in (11), and the norm follows by substitution into (10). Alternatively, we identify q and that jointly satisfy (9) and we compute the norm using (8). Taking advantage of the structure of in the former method leads to a computation time that is O(d log d).

Theorem 4.2. The computation of the box-norm can be completed in O(d log d) time.

The k-support norm is a special case of the box-norm, and as a direct corollary of Theorem 4.1 and Theorem 4.2, we recover [6, Proposition 2.1].

4.1 Proximity Operator

Proximal gradient methods can be used to solve optimization problems of the form minw f(w) + λg(w), where f is a convex loss function with Lipschitz continuous gradient, λ > 0 is a regularization parameter, and g is a convex function for which the proximity operator can be computed efﬁciently, see [12, 21, 22] and references therein. The proximity operator of g with parameter > 0 is deﬁned as

prox g(w) = argmin

2kx wk2 + g(x) : x 2 Rd

We now use the inﬁmum formulation of the box-norm to derive the proximity operator of the squared norm.

Algorithm 1 Computation of x = prox λ

Require: parameters a, b, c, λ.

1. Sort points

a+λ |wj| , b+λ

j=1 such that i i+1;

2. Identify points i and i+1 such that S( i) c and S( i+1) c by binary search; 3. Find between i and i+1 such that S( ) = c by linear interpolation; 4. Compute i( ) for i = 1, . . . , d; 5. Return xi = iwi

i+λ for i = 1, . . . , d.

Theorem 4.3. The proximity operator of the square of the box-norm at point w 2 Rd with parameter

λ 2 is given by prox λ

(w) = ( 1w1

1+λ, . . . , dwd

d+λ), where

i = i( ) = min(b, max(a, |wi| λ)) (12)

and is chosen such that S( ) := Pd

i=1 i( ) = c. Furthermore, the computation of the proximity operator can be completed in O(d log d) time.

The proof follows a similar reasoning to the proof of Theorem 4.1. Algorithm 1 illustrates the computation of the proximity operator for the squared box-norm in O(d log d) time. This includes the k-support as a special case, where we let a tend to zero, and set b = 1 and c = k, which improves upon the complexity of the O(d(k +log d)) computation provided in [6], and we illustrate the improvement empirically in Table 1.

4.2 Proximity Operator for Orthogonally Invariant Norms

The computational considerations outlined above can be naturally extended to the matrix setting by using von Neumann s trace inequality (see, e.g. [23]). Here we comment on the computation of the proximity operator, which is important for our numerical experiments in the following section. The proximity operator of an orthogonally invariant norm k k = g(σ( )) is given by

proxk k(W) = Udiag(proxg(σ(W)))V

>, W 2 Rm d,

where U and V are the matrices formed by the left and right singular vectors of W (see e.g. [24, Prop 3.1]). Using this result we can employ proximal gradient methods to solve matrix regularization problems using the squared spectral k-support norm and spectral box-norm.

5 Numerical Experiments

In this section, we report on the statistical performance of the spectral regularizers in matrix completion experiments. We also offer an interpretation of the role of the parameters in the box-norm and we empirically verify the improved performance of the proximity operator computation (see Table 1). We compare the trace norm (tr) [25], matrix elastic net (en) [26], spectral k-support (ks) and the spectral box-norm (box). The Frobenius norm, which is equal to the spectral k-support norm for k = d, performed considerably worse than the trace norm and we omit the results here. We report test error and standard deviation, matrix rank (r) and optimal parameter values for k and a, which were determined by validation, as were the regularization parameters. When comparing performance, we used a t-test to determine statistical signiﬁcance at a level of p < 0.001. For the optimization we used an accelerated proximal gradient method (FISTA), see e.g. [12, 21, 22], with the percentage change in objective as convergence criterion, with a tolerance of 10 5 for the simulated datasets and 10 3 for the real datasets. As is typical with spectral regularizers we found that the spectrum of the learned matrix exhibited a rapid decay to zero. In order to explicitly impose a low rank on the solution we included a ﬁnal step where we hard-threshold the singular values of the ﬁnal matrix below a level determined by validation. We report on both sets of results below.

5.1 Simulated Data

Matrix Completion. We applied the norms to matrix completion on noisy observations of low rank matrices. Each m m matrix was generated as W = AB> + E, where A, B 2 Rm r, r m, and

Table 1: Comparison of proximity operator algorithms for the k-support norm (time in s), k = 0.05d. Algorithm 1 is the method in [6], Algorithm 2 is our method.

d 1,000 2,000 4,000 8,000 16,000 32,000

Alg. 1 0.0443 0.1567 0.5907 2.3065 9.0080 35.6199 Alg. 2 0.0011 0.0016 0.0026 0.0046 0.0101 0.0181

2 4 6 8 10 0

Figure 1: Impact of signal to noise on a.

2 4 6 8 10 1

Figure 2: Impact of matrix rank on k.

the entries of A, B and E are i.i.d. standard Gaussian. We set m = 100, r 2 {5, 10} and we sampled uniformly a percentage 2 {10%, 20%, 30%} of the entries for training, and used a ﬁxed 10% for validation. The error was measured as ktrue predictedk2/ktruek2 [5] and averaged over 100 trials. The results are summarized in Table 2. In the thresholding case, all methods recovered the rank of the true noiseless matrix. The spectral box-norm generated the lowest test errors in all regimes, with the spectral k-support a close second, in particular in the thresholding case. This suggests that the non zero parameter a in the spectral box-norm counteracted the noise to some extent. Role of Parameters. In the same setting we investigated the role of the parameters in the boxnorm. As previously discussed, parameter b can be set to 1 without loss of generality. Figure 1 shows the optimal value of a chosen by validation for varying signal to noise ratios (SNR), keeping k ﬁxed. We see that for greater noise levels (smaller SNR), the optimal value for a increases. While for a > 0, the recovered solutions are not sparse, as we show below this can still lead to improved performance in experiments, in particular in the presence of noise. Figure 2 shows the optimal value of k chosen by validation for matrices with increasing rank, keeping a ﬁxed. We notice that as the rank of the matrix increases, the optimal k value increases, which is expected since it is an upper bound on the sum of the singular values.

Table 2: Matrix completion on simulated data sets, without (left) and with (right) thresholding.

dataset norm test error r k a

rank 5 tr 0.8184 (0.03) 20 - - =10% en 0.8164 (0.03) 20 - - ks 0.8036 (0.03) 16 3.6 - box 0.7805 (0.03) 87 2.9 1.7e-2

rank 5 tr 0.4085 (0.03) 23 - - =20% en 0.4081 (0.03) 23 - - ks 0.4031 (0.03) 21 3.1 - box 0.3898 (0.03) 100 1.3 9e-3

rank 10 tr 0.6356 (0.03) 27 - - =20% en 0.6359 (0.03) 27 - - ks 0.6284 (0.03) 24 4.4 - box 0.6243 (0.03) 89 1.8 9e-3

rank 10 tr 0.3642 (0.02) 36 - - =30% en 0.3638 (0.002 36 - - ks 0.3579 (0.02) 33 5.0 - box 0.3486 (0.02) 100 2.5 9e-3

dataset norm test error r k a

rank 5 tr 0.7799 (0.04) 5 - - =10% en 0.7794 (0.04) 5 - - ks 0.7728 (0.04) 5 4.23 - box 0.7649 (0.04) 5 3.63 8.1e-3

rank 5 tr 0.3449 (0.02) 5 - - =20% en 0.3445 (0.02) 5 - - ks 0.3381 (0.02) 5 2.97 - box 0.3380 (0.02) 5 3.28 1.9e-3

rank 10 tr 0.6084 (0.03) 10 - - =20% en 0.6074 (0.03) 10 - - ks 0.6000 (0.03) 10 5.02 - box 0.6000 (0.03) 10 5.22 1.9e-3

rank 10 tr 0.3086 (0.02) 10 - - =30% en 0.3082 (0.02) 10 - - ks 0.3025 (0.02) 10 5.13 - box 0.3025 (0.02) 10 5.16 3e-4

Table 3: Matrix completion on real data sets, without (left) and with (right) thresholding.

dataset norm test error r k a

Movie Lens tr 0.2034 87 - - 100k en 0.2034 87 - - = 50% ks 0.2031 102 1.00 - box 0.2035 943 1.00 1e-5

Movie Lens tr 0.1821 325 - - 1M en 0.1821 319 - - = 50% ks 0.1820 317 1.00 - box 0.1817 3576 1.09 3e-5

Jester 1 tr 0.1787 98 - - 20 per line en 0.1787 98 - - ks 0.1764 84 5.00 - box 0.1766 100 4.00 1e-6

Jester 3 tr 0.1988 49 - - 8 per line en 0.1988 49 - - ks 0.1970 46 3.70 - box 0.1973 100 5.91 1e-3

dataset norm test error r k a

Movie Lens tr 0.2017 13 - - 100k en 0.2017 13 - - = 50% ks 0.1990 9 1.87 - box 0.1989 10 2.00 1e-5

Movie Lens tr 0.1790 17 - - 1M en 0.1789 17 - - = 50% ks 0.1782 17 1.80 - box 0.1777 19 2.00 1e-6

Jester 1 tr 0.1752 11 - - 20 per line en 0.1752 11 - - ks 0.1739 11 6.38 - box 0.1726 11 6.40 2e-5

Jester 3 tr 0.1959 3 - - 8 per line en 0.1959 3 - - ks 0.1942 3 2.13 - box 0.1940 3 4.00 8e-4

5.2 Real Data

Matrix Completion (Movie Lens and Jester). In this section we report on matrix completion on real data sets. We observe a percentage of the (user, rating) entries of a matrix and the task is to predict the unobserved ratings, with the assumption that the true matrix has low rank. The datasets we considered were Movie Lens 100k and Movie Lens 1M (http://grouplens.org/datasets/movielens/), which consist of user ratings of movies, and Jester 1 and Jester 3 (http://goldberg.berkeley.edu/jesterdata/), which consist of users and ratings of jokes (Jester 2 showed essentially identical performance to Jester 1). Following [4], for Movie Lens we uniformly sampled = 50% of the available entries for each user for training, and for Jester 1 and Jester 3 we sampled 20, respectively 8, ratings per user, and we used 10% for validation. The error was measured as normalized mean absolute error,

ktrue predictedk2

#observations/(rmax rmin), where rmin and rmax are lower and upper bounds for the ratings [4]. The results are outlined in Table 3. In the thresholding case, the spectral box and k-support norms had the best performance. In the absence of thresholding, the spectral k-support showed slightly better performance. Comparing to the synthetic data sets, this suggests that in the absence of noise the parameter a did not provide any beneﬁt. We note that in the absence of thresholding our results for the trace norm on Movie Lens 100k agreed with those in [3].

6 Conclusion

We showed that the k-support norm belongs to the family of box-norms and noted that these can be naturally extended from the vector to the matrix setting. We also provided a connection between the k-support norm and the cluster norm, which essentially coincides with the spectral box-norm. We further observed that the cluster norm is a perturbation of the spectral k-support norm, and we were able to compute the norm and its proximity operator. Our experiments indicate that the spectral box-norm and k-support norm consistently outperform the trace norm and the matrix elastic net on various matrix completion problems. With a single parameter to validate, compared to two for the spectral box-norm, our results suggest that the spectral k-support norm is a powerful alternative to the trace norm and the elastic net, which has the same number of parameters. In future work, we would like to study the application of the norms to clustering problems in multitask learning [9], in particular the impact of centering. It would also be valuable to derive statistical inequalities and Rademacher complexities for these norms.

Acknowledgements

We would like to thank Andreas Maurer, Charles Micchelli and especially Andreas Argyriou for useful discussions. Part of this work was supported by EPSRC Grant EP/H027203/1.

[1] N. Srebro, J. D. M. Rennie, and T. S. Jaakkola. Maximum-margin matrix factorization. Advances in

Neural Information Processing Systems 17, 2005. [2] J. Abernethy, F. Bach, T. Evgeniou, and J.-P. Vert. A new approach to collaborative ﬁltering: Operator

estimation with spectral regularization. Journal of Machine Learning Research, Vol. 10:803 826, 2009. [3] M Jaggi and M. Sulovsky. A simple algorithm for nuclear norm regularized problems. Proceedings of the

27th International Conference on Machine Learning, 2010. [4] K.-C. Toh and S. Yun. An accelerated proximal gradient algorithm for nuclear norm regularized least

squares problems. SIAM Journal on Imaging Sciences, 4:573 596, 2011. [5] R. Mazumder, T. Hastie, and R. Tibshirani. Spectral regularization algorithms for learning large incom-

plete matrices. Journal of Machine Learning Research, 11:2287 2322, 2010. [6] A. Argyriou, R. Foygel, and N. Srebro. Sparse prediction with the k-support norm. In Advances in Neural

Information Processing Systems 25, pages 1466 1474, 2012. [7] R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society,

Vol. 58:267 288, 1996. [8] H. Zou and T. Hastie. Regularization and variable selection via the elastic net. Journal of the Royal

Statistical Society, Series B, 67(2):301 320, 2005. [9] L. Jacob, F. Bach, and J.-P. Vert. Clustered multi-task learning: a convex formulation. Advances in Neural

Information Processing Systems (NIPS 21), 2009. [10] J. Von Neumann. Some matrix-inequalities and metrization of matric-space. Tomsk. Univ. Rev. Vol I,

1937. [11] R. T. Rockafellar. Convex Analysis. Princeton University Press, 1970. [12] Y. Nesterov. Gradient methods for minimizing composite objective function. Center for Operations

Research and Econometrics, 76, 2007. [13] L. Jacob, G. Obozinski, and J.-P. Vert. Group lasso with overlap and graph lasso. Proc of the 26th Int.

Conf. on Machine Learning, 2009. [14] Y. Grandvalet. Least absolute shrinkage is equivalent to quadratic penalization. In ICANN 98, pages

201 206. Springer London, 1998. [15] C. A. Micchelli and M. Pontil. Learning the kernel function via regularization. Journal of Machine

Learning Research, 6:1099 1125, 2005. [16] M. Szafranski, Y. Grandvalet, and P. Morizet-Mahoudeaux. Hierarchical penalization. In Advances in

Neural Information Processing Systems 21, 2007. [17] C. A. Micchelli, J. M. Morales, and M. Pontil. Regularizers for structured sparsity. Advances in Comp.

Mathematics, 38:455 489, 2013. [18] A. Maurer and M. Pontil. Structured sparsity and generalization. The Journal of Machine Learning

Research, 13:671 690, 2012. [19] G. Obozinski and F. Bach. Convex relaxation for combinatorial penalties. Co RR, 2012. [20] A. W. Marshall and I. Olkin. Inequalities: Theory of Majorization and its Applications. Academic Press,

1979. [21] P. L. Combettes and J.-C. Pesquet. Proximal splitting methods in signal processing. In Fixed-Point Algorithms for Inv Prob. Springer, 2011. [22] A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems.

SIAM J. Imaging Sciences, 2(1):183 202, 2009. [23] A. S. Lewis. The convex analysis of unitarily invariant matrix functions. Journal of Convex Analysis,

2:173 183, 1995. [24] A. Argyriou, C. A. Micchelli, M. Pontil, L. Shen, and Y. Xu. Efﬁcient ﬁrst order methods for linear

composite regularizers. Co RR, abs/1104.1436, 2011. [25] J.-F. Cai, E. J. Candes, and Z. Shen. A singular value thresholding algorithm for matrix completion. SIAM

Journal on Optimization, 20(4):1956 1982, 2008. [26] H. Li, N. Chen, and L. Li. Error analysis for matrix elastic-net regularization algorithms. IEEE Transac-

tions on Neural Networks and Learning Systems, 23-5:737 748, 2012. [27] W. Rudin. Functional Analysis. Mc Graw Hill, 1991. [28] D. P. Bertsekas, A. Nedic, and A. E. Ozdaglar. Convex Analysis and Optimization. Athena Scientiﬁc,

2003. [29] R. A. Horn and C. R. Johnson. Topics in Matrix Analysis. Cambridge University Press, 1991.