# robust_discriminative_clustering_with_sparse_regularizers__b743c3e0.pdf

Journal of Machine Learning Research 18 (2017) 1-50 Submitted 8/16; Revised 5/17; Published 8/17

Robust Discriminative Clustering with Sparse Regularizers

Nicolas Flammarion1 nicolas.flammarion@ens.fr Balamurugan Palaniappan2 palaniappan@telecom-paristech.fr Francis Bach1 francis.bach@ens.fr 1INRIA D epartement d Informatique de l ENS, Ecole Normale Sup erieure, CNRS, PSL Research University Paris, France.

2Signal, Statistique et Apprentissage (S2A) Group, IDS Department, LTCI Telecom-Paris Tech Paris, France.

Editor: Sathiya Keerthi

Clustering high-dimensional data often requires some form of dimensionality reduction, where clustered variables are separated from noise-looking variables. We cast this problem as ﬁnding a low-dimensional projection of the data which is well-clustered. This yields a one-dimensional projection in the simplest situation with two clusters, and extends naturally to a multi-label scenario for more than two clusters. In this paper, (a) we ﬁrst show that this joint clustering and dimension reduction formulation is equivalent to previously proposed discriminative clustering frameworks, thus leading to convex relaxations of the problem; (b) we propose a novel sparse extension, which is still cast as a convex relaxation and allows estimation in higher dimensions; (c) we propose a natural extension for the multi-label scenario; (d) we provide a new theoretical analysis of the performance of these formulations with a simple probabilistic model, leading to scalings over the form d = O( n) for the aﬃne invariant case and d = O(n) for the sparse case, where n is the number of examples and d the ambient dimension; and ﬁnally, (e) we propose an eﬃcient iterative algorithm with running-time complexity proportional to O(nd2), improving on earlier algorithms for discriminative clustering with the square loss, which had quadratic complexity in the number of examples.

1. Introduction

Clustering is an important and commonly used pre-processing tool in many machine learning applications, with classical algorithms such as K-means (Mac Queen, 1967), linkage algorithms (Gower and Ross, 1969) or spectral clustering (Ng et al., 2002). In high dimensions, these unsupervised learning algorithms typically have problems identifying the underlying optimal discrete nature of the data; for example, they are quickly perturbed by adding a few noisy dimensions. Clustering high-dimensional data thus requires some form of dimensionality reduction, where clustered variables are separated from non-informative noise-looking (e.g., Gaussian) variables.

Several frameworks aim at linearly separating noise from signal, that is ﬁnding projections of the data that extracts the signal and removes the noise. They diﬀer in the ways

c 2017 Nicolas Flammarion, Balamurugan Palaniappan and Francis Bach.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v18/16-429.html.

Flammarion, Palaniappan and Bach

signals and noise are deﬁned. A line of work that dates back to projection pursuit (Friedman and Stuetzle, 1981) and independent component analysis (Hyv arinen et al., 2004) deﬁnes the noise as Gaussian while the signal is non-Gaussian (Blanchard et al., 2006; Le Roux and Bach, 2013; Diederichs et al., 2013). In this work, we follow De la Torre and Kanade (2006); Ding and Li (2007), along the alternative route where one deﬁnes the signal as being clustered while the noise is any non-clustered variable. In the simplest situation with two clusters, we may project the data into a one-dimensional subspace. Given a data matrix X Rn d composed of n d-dimensional points, the goal is to ﬁnd a direction w Rd such that Xw Rn is well-clustered, e.g., by K-means. This is equivalent to identifying both a direction to project, represented as w Rd and the labeling y { 1, 1}n that represents the partition into two clusters. Most existing formulations are non-convex and typically perform a form of alternating optimization (De la Torre and Kanade, 2006; Ding and Li, 2007), where given y { 1, 1}n, the projection w is found by linear discriminant analysis (or any binary classiﬁcation method), and given the projection w, the clustering is obtained by thresholding Xw or running K-means on Xw. As shown in Section 2, this alternating minimization procedure happens to be equivalent to maximizing the (centered) correlation between y { 1, 1}n

and the projection Xw Rd, that is

max w Rd,y { 1,1}n (y Πn Xw)2

Πny 2 2 Πn Xw 2 2 ,

where Πn = In 1

n1n1 n is the usual centering projection matrix (with 1n Rn being the vector of all ones, and In the n n identity matrix). This correlation is equal to one when the projection is perfectly clustered (independently of the number of elements per cluster). Existing methods are alternating minimization algorithms with no theoretical guarantees. In this paper, we relate this formulation to discriminative clustering formulations (Xu et al., 2004; Bach and Harchaoui, 2007), which consider the problem

min v Rd, b R, y { 1,1}n 1 n y Xv b1n 2 2, (1)

with the intuition of ﬁnding labels y which are easy to predict by an aﬃne function of the data. In particular, we show that given the relationship between the number of positive labels and negative labels (i.e., the squared diﬀerence between the respective number of elements), these two problems are equivalent, and hence discriminative clustering explicitly performs joint dimension reduction and clustering. While the discriminative framework is based on convex relaxations and has led to interesting developments and applications (Zhang et al., 2009; Li et al., 2009; Joulin et al., 2010a,b; Wang et al., 2010; Niu et al., 2013; Huang et al., 2015), it has several shortcomings when used with the square loss: (a) the running-time complexity of the semi-deﬁnite formulations is at least quadratic in n, and typically much more, (b) no theoretical analysis has ever been performed, (c) no convex sparse extension has been proposed to handle data with many irrelevant dimensions, (d) balancing of the clusters remains an issue, as it typically adds an extra hyperparameter which may be hard to set. In this paper, we focus on addressing these concerns.

Robust Discriminative Clustering with Sparse Regularizers

When there are more than two clusters, one considers either the multi-label or the multiclass settings. The multi-class problem assumes that the data are clustered into distinct classes, i.e., a single class per observation, whereas the multi-label problem assumes the data share diﬀerent labels, i.e., multiple labels per observation. We show in this work that discriminative clustering framework extends more naturally to multi-label scenarios and that this extension has the same convex relaxation. A summary of the contributions of this paper follows:

In Section 2, we relate discriminative clustering with the square loss to a joint clustering and dimension reduction formulation. The proposed formulation takes care of the balancing hyperparameter implicitly.

We propose in Section 3, a novel sparse extension to discriminative clustering and show that it can still be cast through a convex relaxation.

When there are more than two clusters, we extend naturally the sparse formulation to a multi-label scenario in Section 4.

We then proceed to provide a theoretical analysis of the proposed formulations with a simple probabilistic model in Section 5, which eﬀectively leads to scalings over the form d = O( n) for the aﬃne invariant case and d = O(n) for the 1-sparse case.

Finally, we propose in Section 6 eﬃcient iterative algorithms with running-time complexity for each step equal to O(nd2), the ﬁrst to be linear in the number of observations n for discriminative clustering with the square loss.

Throughout this paper we assume that X Rn d is centered, a common pre-processing step in unsupervised (and supervised) learning. This implies that X 1n = 0 and Πn X = X.

2. Joint Dimension Reduction and Clustering

In this section, we focus on the single binary label case, where we ﬁrst study the usual nonconvex formulation, before deriving convex relaxations based on semi-deﬁnite programming. Some of the following results are already known in the literature; however, we state them here for completeness.

2.1 Non-convex formulation

Following De la Torre and Kanade (2006); Ding and Li (2007); Ye et al. (2008), we consider a cost function which depends on y { 1, 1}n and w Rd, which is such that alternating optimization is exactly (a) running K-means with two clusters on Xw to obtain y given w (when we say running K-means , we mean solving the vector quantization problem exactly), and (b) performing linear discriminant analysis to obtain w given y.

Proposition 1 (Joint clustering and dimension reduction for two clusters) Given X Rn d such that X 1n = 0 and X has rank d, consider the optimization problem

max w Rd,y { 1,1}n (y Xw)2

Πny 2 2 Xw 2 2 . (2)

Flammarion, Palaniappan and Bach

Given y, the optimal w is obtained as w = (X X) 1X y, while given w, the optimal y is obtained by running K-means on Xw.

This equivalence might be straightforward, however it has not been precisely stated in the literature to the best of our knowledge. Proof Given y, we need to optimize the Rayleigh quotient w X yy Xw

w X Xw with a rank-one matrix in the numerator, which leads to w = (X X) 1X y. Given w, we show in Appendix A, that the averaged distortion measure of K-means once the means have been optimized is exactly equal to (y Xw)2/ Πny 2 2.

Algorithm. The proposition above leads to an alternating optimization algorithm. Note that K-means in one dimension may be run exactly in O(n log n) (Bellman, 1973). After having optimized with respect to w in Eq. (2), we then need to maximize with respect to y the function y X(X X) 1X y

Πny 2 2 , which happens to be exactly performing K-means on the whitened data (which is now in high dimension and not in 1 dimension). At ﬁrst, it seems that dimension reduction is simply equivalent to whitening the data and performing Kmeans; while this is a formally correct statement, the resulting K-means problem is not easy to solve as the clustered dimension is hidden in noise; for example, algorithms such as Kmeans++ (Arthur and Vassilvitskii, 2007), which have a multiplicative theoretical guarantee on the ﬁnal distortion measure, are not provably eﬀective here because the minimal ﬁnal distortion is not small (since the clusters are corrupted by some noisy dimensions), and the multiplicative guarantee is then meaningless.

2.2 Convex relaxation and discriminative clustering

The discriminative clustering formulation in Eq. (1) may be optimized for any y { 1, 1}n

in closed form with respect to b as b = 1 n (y Xv)

n since X is centered. Substituting b in Eq. (1) leads us to

min v Rd 1 n Πny Xv 2 2 = 1

n Πny 2 2 max w Rd (y Xw)2

Xw 2 2 , (3)

where v is obtained from any solution w as v = w y Xw

Xw 2 2 . Thus, given

n2 #{i, yi = 1} #{i, yi = 1} 2 = α [0, 1], (4)

which characterizes the asymmetry between clusters and with Πny 2 = n(1 α), we obtain from Eq. (3), an equivalent formulation to Eq. (2) (with the added constraint) as

min y { 1,1}n, v Rd 1 n Πny Xv 2 2 such that (y 1n)2

n2 = α. (5)

This is exactly equivalent to a discriminative clustering formulation with the square loss (Bach and Harchaoui, 2007) with an explicit cluster balance constraint. Consequently we

Robust Discriminative Clustering with Sparse Regularizers

have formally established that the discriminative clustering formulation in Eq. (5) is related to the joint clustering and dimension reduction formulation in Eq. (2). Following Bach and Harchaoui (2007), we may optimize Eq. (5) in closed form with respect to v as v = (X X) 1X y. Substituting v in Eq. (5) leads us to

min y { 1,1}n 1 ny Πn X(X X) 1X y such that (y 1n)2

n2 = α. (6)

This combinatorial optimization problem is NP-hard in general (Karp, 1972; Garey et al., 1976). Hence in practice, it is classical to consider the following convex relaxation of Eq. (6) (Luo et al., 2010). For any admissible y { 1, +1}n, the matrix Y = yy Rn n is a rank-one symmetric positive semi-deﬁnite matrix with unit diagonal entries and conversely any such Y may be written in the form Y = yy such that y is admissible for Eq. (6). Moreover by rewriting Eq. (6) as

min y { 1,1}n 1 n tr yy Πn X(X X) 1X such that 1 n (yy )1n

we see that the objective and constraints are linear in the matrix Y = yy and Eq. (6) is equivalent to

min Y 0, rank(Y )=1, diag(Y )=1 1 n tr Y Πn X(X X) 1X such that 1 n Y 1n

Then dropping the non-convex rank constraint leads us to the following classical convex relaxation:

min Y 0, diag(Y )=1 1 n tr Y Πn X(X X) 1X such that 1 n Y 1n

n2 = α. (7)

This is the standard (unregularized) formulation, which is cast as a semi-deﬁnite program. The complexity of interior-point methods is O(n7), but eﬃcient algorithms in O(n2) for such problems have been developed due to the relationship with the max-cut problem (Journ ee et al., 2010; Wen et al., 2012). We note that convex relaxation techniques are also used for semi-supervised methods (De Bie and Cristianini, 2003). Given the solution Y , one may traditionally obtain a candidate y { 1, 1}n by running K-means on the largest eigenvector of Y or by sampling (Goemans and Williamson, 1995). In this paper, we show in Section 5 that it may be advantageous to consider the ﬁrst two eigenvectors.

2.3 Unsuccessful full convex relaxation

The formulation in Eq. (7) imposes an extra parameter α that characterises the cluster imbalance. It is tempting to ﬁnd a direct relaxation of Eq. (2). It turns out to lead to a trivial relaxation, which we outline below. When optimizing Eq. (2) with respect to w, we obtain the following optimization problem

max y { 1,1}n y X(X X) 1X y

Flammarion, Palaniappan and Bach

leading to a quasi-convex relaxation as

max Y 0, diag(Y )=1 tr Y X(X X) 1X

whose solution is found by solving a sequence of convex problems (Boyd and Vandenberghe, 2004, Section 4.2.5). As shown in Appendix B, this may be exactly reformulated as a single convex problem: max M 0, diag(M)=1+ 1 M1

n2 tr MX(X X) 1X .

Unfortunately, this relaxation always leads to trivial solutions, and we thus need to consider the relaxation in Eq. (7) for several values of α = 1 n Y 1n/n2 (and then the non-convex algorithm can be run from the rounded solution of the convex problem, using Eq. (2) as a ﬁnal objective). Alternatively, we may solve the following penalized problem for several values of ν 0:

min Y 0, diag(Y )=1 1 n tr Y Πn X(X X) 1X + ν

n2 1 n Y 1n. (8)

For ν = 0, Y = 1n1 n is always a trivial solution. As outlined in our theoretical section and as observed in our experiments, it is suﬃcient to consider ν [0, 1]. By convex duality (Borwein and Lewis, 2000, Sec. 4.3), both constrained relaxation in Eq. (7) and penalized relaxation in Eq. (8) are formally equivalent for speciﬁc choices of constraint parameter α and penalization parameter ν. We will see in Section 6 that the formulation in Eq. (8) is more suitable for algorithmic design (Bach et al., 2012).

2.4 Equivalent relaxations

Optimizing Eq. (5) with respect to v in closed form as in Section 2.2 is feasible with no regularizer or with a quadratic regularizer. However, if one needs to add more complex regularizers, we need a diﬀerent relaxation. Therefore, we now propose a new formulation of the discriminative clustering framework. We start from the penalized version of Eq. (5),

min y { 1,1}n, v Rd 1 n Πny Xv 2 2 + ν (y 1n)2

which we expand as:

min y { 1,1}n, v Rd 1 n tr Πnyy 2

n tr Xvy + 1

n tr X Xvv + ν (y 1n)2

and relax as, using Y = yy , P = yv and V = vv ,

min V,P,Y 1 n tr Πn Y 2

n tr P X + 1

n tr X XV +ν 1 n Y 1n

n2 s.t. Y P P V

0, diag(Y ) = 1. (11)

When optimizing Eq. (11) with respect to V and P, we get exactly Eq. (8). Indeed, the optimum is attained for V = (X X) 1X Y X(X X) 1 and P = Y X(X X) 1 as shown in Appendix C.1. Therefore, the convex relaxation in Eq. (11) is equivalent to Eq. (8).

Robust Discriminative Clustering with Sparse Regularizers

However, we get an interesting behavior when optimizing Eq. (11) with respect to P and Y also in closed form. For ν = 1, we obtain, as shown in Appendix C.2, the following closed form expressions:

Y = Diag(diag(XV X )) 1/2XV X Diag(diag(XV X )) 1/2

P = Diag(diag(XV X )) 1/2XV,

leading to the problem:

min V 0 1 2

(XV X )ii + 1

n tr(V X X). (12)

The formulation above in Eq. (12) is interesting for several reasons: (a) it is formulated as an optimization problem in V Rd d, which will lead to algorithms whose running time will depend on n linearly (see Section 6), (b) it allows for easy adding of regularizers (see Section 3), which may be formulated as convex functions of V = vv . At ﬁrst sight this seems to be valid only for ν = 1. However we now propose a reformulation which can handle all possible ν [0, 1) through a simple data augmentation.

Reformulation for any ν When ν [0, 1), we may reformulate the objective function in Eq. (9) as follows:

1 n Πny Xv 2 2 + ν (y 1n)2

n2 = 1 n Πny Xv + ν y 1n

n 1n 2 2 ν y 1n

n 2 + ν y 1n

= 1 n y Xv (1 ν)y 1n

n 1n 2 2 + ν 1 ν (1 ν)y 1n

= min b R 1 n y Xv b1n 2 2 + ν 1 ν b2, (13)

since 1 n y Xv b1n 2 2 + ν 1 ν b2 can be optimized in closed form with respect to b as

b = (1 ν)y 1n

n . Note that the weighted imbalance ratio (1 ν)y 1n

n is made as an optimization variable in Eq. (13). Thus we have the following reformulation

min v Rd, y { 1,1}n 1 n Πny Xv 2 2 + ν (y 1n)2

= min v Rd, b R, y { 1,1}n 1 n y Xv b1n 2 2 + ν 1 ν b2, (14)

which is a non-centered penalized formulation on a higher-dimensional problem in the variable v b Rd+1. In the rest of the paper, we will focus on the case ν = 1 for ease of exposition. This enables the use of the formulation in Eq. (12), which is easier to optimize. It is worth noting that this is not an algorithmic restriction. Of course any problem with ν [0, 1) can be treated with equal ease by adding a constant term and a quadratic regularizer.

Flammarion, Palaniappan and Bach

3. Regularization

There are several natural possibilities. We consider norms Ωsuch that Ω(w)2 = Γ(ww ) for a certain convex function Γ; all norms have that form (Bach et al., 2012, Proposition 5.1). When ν = 1, Eq. (12) then becomes

max V 0 2 n

(XV X )ii 1

n tr(V X X) Γ(V ). (15)

The quadratic regularizers Γ(V ) = tr ΛV have already been tackled by Bach and Harchaoui (2007). They consider the regularized version of problem in Eq. (3)

min v Rd 1 n Πny Xv 2 2 + v Λv, (16)

optimize in closed form with respect to v as v = (X X + nΛ) 1X y. Substituting v in Eq. (16) leads them to

min Y 0, diag(Y )=1 1 n tr Y Πn X(X X + nΛ) 1X .

In this paper, we propose a novel sparse extension to discriminative clustering framework with the square loss. Speciﬁcally we formulate a non-trivial sparse regularizer which is a combination of weighted squared ℓ1-norm and ℓ2-norm. It leads to

Γ(V ) = tr[Diag(a)V Diag(a)] + Diag(c)V Diag(c) 1, (17)

such that Γ(vv ) = Pd i=1 a2 i v2 i + Pd i=1 ci|vi| 2. This allows to treat all situations simultaneously, with ν = 1 or with ν [0, 1). To be more precise, when ν [0, 1), we can consider in Eq. (14), a problem of size d + 1 with a design matrix [X, 1n] Rn (d+1), a direction of projection v b Rd+1 and diﬀerent weights for the last variable with ad+1 = ν 1 ν and cd+1 = 0. Note that the sparse regularizers on V introduced in this paper are signiﬁcantly diﬀerent when compared to the sparse regularizers on variable v in Eq. (3), for example, considered by Wang et al. (2013). A straightforward sparse regularizer on v in Eq. (3), despite leading to a sparse projection, does not yield natural generalizations of the discriminative clustering framework in terms of theory or algorithms. In our analysis and experiments for the balanced clusters (when ν = 1), the sparse regularization Γ( ) = λ 1, for λ R will often be considered. This is equivalent to setting a = 0d and c =

λ1d in Eq. (17). The problem in Eq. (15) then becomes

max V 0 2 n

(XV X )ii 1

n tr(V X X) λ V 1. (18)

The sparse regularizers considered in this paper have a signiﬁcant algorithmic appeal for certain applications in computer vision (Bojanowski et al., 2013; Alayrac et al., 2016), audio processing (Lajugie et al., 2016) and natural language processing (Grave, 2014). They also lead to robust cluster recovery under minor assumptions as will be illustrated on a simple example in Section 5. The practical beneﬁts of the sparse regularizers will be further demonstrated using empirical evaluation on synthetic and real data sets in Section 7.

Robust Discriminative Clustering with Sparse Regularizers

4. Extension to Multiple Labels

The discussion so far has focussed on two clusters. Yet it is key in practice to tackle more clusters. It is worth noting that the discrete formulations in Eq. (2) and Eq. (5) extend directly to more than two clusters. However two diﬀerent extensions of the initial problems Eq. (2) or Eq. (5) are conceivable. They lead to problems with diﬀerent constraints on diﬀerent optimization domains and, consequently, to diﬀerent relaxations. We discuss these possibilities next. One extension is the multi-class case. The multi-class problem which is dealt with by Bach and Harchaoui (2007) assumes that the data are clustered into K classes and the various partitions of the data points into clusters are represented by the K-class indicator matrices y {0, 1}n K such that y1K = 1n. The constraint y1K = 1n ensures that one data point belongs to only one cluster. However as discussed by Bach and Harchaoui (2007), by letting Y = yy , it is possible to lift these K-class indicator matrices into the outer convex approximations CK = {Y Rn n : Y = Y , diag(Y ) = 1n, Y 0, Y 1

K 1n1 n } (Frieze and Jerrum, 1995), which is diﬀerent for all values of K. Note that letting K = 2 corresponds to the previous sections. In this paper, we consider a diﬀerent novel extension for discriminative clustering to the multi-label case. The multi-label problem assumes that the data share k labels and the data-label membership is represented by matrices y { 1, +1}n k. In other words, the multi-class problem embeds the data in the extreme points of a simplex, while the multi-label problem does so in the extreme points of the hypercube. The discriminative clustering formulation of the multi-label problem is

min v Rd k, y { 1,1}n k 1 n Πny Xv 2 F , (19)

where the Frobenius norm is deﬁned for any vector or rectangular matrix as A 2 F = tr AA = tr A A. Letting k = 1 here corresponds to the previous sections. The discrete ensemble of matrices y { 1, +1}n k can be naturally lifted into Dk = {Y Rn n : Y = Y , diag(Y ) = k1n, Y 0}, since diag(Y ) = diag(yy ) = Pk i=1 y2 i,i = k. As the optimization problems in Eq. (7) and Eq. (8) have linear objective functions, we can change the variable from Y to Y = Y/k to change the constraint diag(Y ) = k1n to diag( Y ) = 1n without changing the optimizer of the problem. Thus the problems can be solved over the relaxed domain D = {Y Rn n : Y = Y , diag(Y ) = 1n, Y 0} which is independent of k. Note that the domain D is similar to that considered in the problems in Eq. (8) and Eq. (11) and these convex relaxations are the same regardless of the value of k. Hence the multi-label problem is a more natural extension of the discriminative framework, with a slight change in how the labels y are recovered from the solution Y (we discuss this in Section 5.3).

5. Theoretical Analysis

In this section, we provide the ﬁrst theoretical analysis for the discriminative clustering framework with the square loss. We start with the 2-clusters situation: the non-sparse case

Flammarion, Palaniappan and Bach

is considered ﬁrst and analysis is provided for both balanced and imbalanced clusters. Our study for the sparse case currently only provides results for the simple 1-sparse solution. However, the analysis also yields valuable insights on the scaling between n and d. We then derive results for multi-label situation.

For ease of analysis, we consider the constrained problem in Eq. (7), the penalized problem in Eq. (8) or their equivalent relaxations in Eq. (12) or Eq. (18) under various scenarios, for which we use the same proof technique. We ﬁrst try to characterize the low-rank solutions of these relaxations and then show in certain simple situations the uniqueness of such solutions, which are then non-ambiguously found by convex optimization. Perturbation arguments could extend these results by weakening our assumptions but are not within the scope of this paper, and hence we do not investigate them further in this section.

5.1 Analysis for two clusters: non-sparse problems

In this section, we consider several noise models for the problem, either adding irrelevant dimensions or perturbing the label vector with noise. We consider these separately for simplicity, but they could also be combined (with little extra insight).

5.1.1 Irrelevant dimensions

We consider an ideal design matrix X Rn d such that there exists a direction v along which the projection Xv is perfectly clustered into two distinct real values c1 and c2. Since Eq. (2) is invariant by aﬃne transformation, we can rotate the design matrix X to have X = [y, Z] with y { 1, 1}n, which is clustered into +1 or 1 along the direction v = 1 0d 1 . Then after being centered, the design matrix is written as X = [Πny, Z] with Z = [z1, . . . , zd 1] Rn (d 1). The columns of Z represent the noisy irrelevant dimensions added on top of the signal y.

5.1.2 Balanced problem

When the problem is well balanced (y 1n = 0), y is already centered and Πny = y. Thus the design matrix is represented as X = [y, Z]. We consider here the penalized formulation in Eq. (8) with ν = 1 which is the only scenario where we are able to provide a theoretical analysis.

Let us assume that the columns (zi)i=1,...,d 1 of Z are i.i.d. with symmetric distribution z, with Ez = Ez3 = 0 and such that z is almost surely bounded by R 0. We denote by Ez2 = m its second moment and by Ez4/(Ez2)2 = β its (unnormalized) kurtosis.

Surprisingly the clustered vector y happens to generate a solution yy of the relaxation Eq. (8) for all possible values of Z (see Lemma 11 in Appendix D.2 ). However the problem in Eq. (8) should have a unique solution in order to always recover the correct assignment y. Unfortunately the semideﬁnite constraint Y 0 of the relaxation makes the secondorder information arduous to study. Due to this reason, we consider the other equivalent relaxation in Eq. (12) for which V = vv is also solution with v (X X) 1X y (see Lemma 12 in Appendix D.3). Fortunately the semideﬁnite constraint V 0 of the problem in Eq. (12) may be ignored since the second-order information in V of the objective function

Robust Discriminative Clustering with Sparse Regularizers

already provides unicity for the unconstrained problem. Hence we are able to ensure the uniqueness of the solution with high probability.

Proposition 2 Let us assume d 3, β > 1 and m2 β 3 2(d+β 4):

(a) If n d2R4 1+(d+β)m2

m2(β 1) , V is the unique solution of the problem in Eq. (12) with high probability. (b) If n d2R4 min{m2(β 1),2m2,2m}, v is the principal eigenvector of any solution of the problem in Eq. (12) with high probability.

Let us make the following observations:

Proof technique: The proof relies on a computation of the Hessian of f(V ) = 2 n Pn i=1 p

(XV X )ii 1

n tr X XV which is the objective function in Eq. (12). We ﬁrst derive the expectation of 2f(V ) with respect to the distribution of X. By the law of large numbers, it amounts to have n going to inﬁnity in 2f(V ). Then we expand the spectrum of this operator E 2f(V ) to lower-bound its smallest eigenvalue. Finally we use concentration theory on matrices, following Tropp (2012), to bound the Hessian 2f(V ) for ﬁnite n.

Eﬀect of kurtosis: We remind that β 1, with equality if and only if z follows a Rademacher law (P(z = +1) = P(z = 1) = 1/2). Thus, if the noisy dimensions are clustered, then unsurprisingly, our guarantee is meaningless. Note that the constant β behaves like a distance of the distribution z to the Rademacher distribution. Moreover, β = 3 if z follows a standard normal distribution.

Scaling between d and n: If the noisy variables are not evenly clustered between the same clusters { 1} (i.e., β > 1), we recover a rank-one solution as long as n = O(d3); while, as long as n = O(d2), the solution is not unique but its principal eigenvector recovers the correct clustering. Moreover, as explained in the proof, its spectrum would be very spiky.

The assumption m2 β 3 2(d+β 4) is generally satisﬁed for large dimensions. Note that m2d is the total variance of the irrelevant dimensions, and when it is small, i.e., when m2 β 3 2(d+β 4), the problem is particularly simple, and we can also show that V is

the unique solution of the problem in Eq. (12) with high probability if n d2R4

m2 . Finally, note that for sub-Gaussian distributions (where β 3), the extra constraint is vacuous, while for super-Gaussian distributions (where β 3), this extra constraint only appears for small m.

This result provides the ﬁrst guarantee for discriminative clustering. However similar theoretical results have been derived for K-means by Ostrovsky et al. (2006) and Gaussian mixtures by Kalai et al. (2010); Moitra and Valiant (2010), where separation conditions between the two clusters are derived, under which the clustering problem is eﬃciently solved. It would be of great interest to relate these separation conditions to our condition on n and d but this is outside the scope of this work.

Flammarion, Palaniappan and Bach

5.1.3 Noise robustness for the one dimensional balanced problem

We assume now that the data are one-dimensional and are perturbed by some noise ε Rn

such that X = y + ε with y { 1, 1}n. The solution of the relaxation in Eq. (8) recovers the correct y in this setting only when each component of y and y + ε have the same sign (this is shown in Appendix D.5). This result comes out naturally from the information on whether the signs of y and y + ε are the same or not. Further if we assume that y and ε are independent, this condition is equivalent to ε < 1 almost surely.

5.1.4 Unbalanced problem

When the clusters are imbalanced (y 1n = 0), the natural rank-one candidates Y = yy

and V = vv are no longer solutions of the relaxations in Eq. (8) (for ν = 1) and Eq. (12), as proved in Appendix D.6. Nevertheless we are able to characterize some solutions of the penalized relaxation in Eq. (8) for ν = 0.

Lemma 3 For ν = 0 and for any non-negative a, b R such that a + b = 1,

Y = ayy + b1n1 n

is solution of the penalized relaxation in Eq. (8).

Hence any eigenvector of this solution Y would be supported by the directions y and 1n. Moreover when the value α = (1 n y

n )2 is known, it turns out that we can characterize some solutions of the constrained relaxation in Eq. (7), as stated in the following lemma.

Lemma 4 For α α ,

1 α yy + 1 1 α

is a rank-2 solution of the constrained relaxation in Eq. (7) with constraint parameter α.

The eigenvectors of Y enable to recover y for α α < 1. We conjecture (and checked empirically) that this rank-2 solution is unique under similar regimes to those considered for the balanced case. The proof would be more involved since, when ν = 1, we are not able to derive an equivalent problem in V for the penalized relaxation in Eq. (8) similar to Eq. (12) for the balanced case. We also note that Lemmas 3 and 4 will be direct consequences of Lemma 8 in Section 5.3. Thus Y being rank-2, one should really be careful and consider the ﬁrst two eigenvectors when recovering y from a solution Y . This can be done by rounding the principal eigenvector of Πn Y Πn = 1 α 1 α Πny(Πny) as discussed in the following lemma.

Lemma 5 Let yev be the principal eigenvector of Πn Y Πn where Y is deﬁned in Lemma 4, then sign(yev) = y.

Proof By deﬁnition of Y , yev = q

1 α 1 α Πny thus sign(yev) = sign(Πny) and since α 1

then sign(Πny) = sign(y α1n) = y.

Robust Discriminative Clustering with Sparse Regularizers

In practice, contrary to the standard procedure, we should, for any ν, solve the penalized relaxation in Eq. (8) and then do K-means on the principal eigenvector of the centered solution Πn Y Πn instead of the solution Y to recover the correct y. This procedure is followed in our experiments on real-world data in Section 7.2.

5.2 Analysis for two clusters: one-sparse problems

We assume here that the direction of projection v (such that Xv = y) is l-sparse (by l-sparse we mean v 0 = l). The ℓ1-norm regularized problem in Eq. (18) is no longer invariant by aﬃne transformation and we cannot consider that X = [y, Z] without loss of generality. Yet the relaxation Eq. (18) seems experimentally to only have rank-one solutions for the simple l = 1 situation. Hence we are able to derive some theoretical analysis only for this case. It is worth noting the l = 1 case is simple since it can be solved in O(d) by using K-means separately on all dimensions and ranking them. Nonetheless the proposed scaling also holds in practice for l 1 (see Figure 1b). Thereby we consider data X = [y, Z] with y { 1, 1}n and Z Rn (d 1) which are clustered in the direction v = [1, 0, . . . , 0] Rd. When adding a ℓ1-penalty, the initial problem in Eq. (5) for α = 0 is

min y { 1,1}n, v Rd 1 n y Xv 2 2 + λ v 2 1. (20)

When optimizing in v this problem is close to the Lasso (Tibshirani, 1996) and a solution is known to be v i = (y y + nλ) 1y y = 1 1+λ, i J and v i = 0, i {1, 2, . . . , d} \ J, where J is the support of v . The candidate V = v v is still a solution of the relaxation in Eq. (18) (see Lemma 15 in Appendix E.1) and we will investigate under which conditions on X this solution is unique. Let us assume as before (zi)i=1,...,d are i.i.d. with distribution z symmetric with Ez = Ez3 = 0, and denote by Ez2 = m and Ez4/(Ez2)2 = β. We also assume that z is almost surely bounded by 0 R 1. We are able to ensure the uniqueness of the solution with high-probability.

Proposition 6 Let us assume d 3. (a) If n d R2 1+(d+β)m2

m2(β 1) , V is the unique solution of the problem Eq. (12) with high probability. (b) If n d R2 m2(β 1), v is the principal eigenvector of any solution of the problem Eq. (12) with high probability.

The proof technique is very similar to the one of Proposition 2. With the function g(V ) = 2 n Pn i=1 p

(XV X )ii λ V 1 1

n tr X XV , we can certify that g will decrease around the solution V by analyzing the eigenvalues of its Hessian. The rank-one solution V is recovered by the principal eigenvector of the solution of the relaxation Eq. (18) as long as n = O(d). Thus we have a much better scaling when compared to the non-sparse setting where n = O(d2). We also conjecture a scaling of order n = O(ld) for a projection in a l-sparse direction (see Figure 1b for empirical results). The proposition does not state any particular value for the regularizer parameter λ. This makes sense since the proposition only holds for the simple situation when l = 1. We propose to use λ = 1/ n by analogy with the Lasso.

Flammarion, Palaniappan and Bach

5.3 Analysis for the multi-label extension

In this section, the signals share k labels which are corrupted by some extra noisy dimensions. We assume the centered design matrix to be X = [Πny, Z] where y { 1, +1}n k

and Z Rn (d k). We also assume that y is full-rank1. We denote by y = [y1, . . . , yk] and

αi = y i 1n

n 2 for i = 1, , k. We consider the discrete constrained problem

min v Rd k, y { 1,1}n k 1 n Πny Xv 2 F such that 1 n yy 1n

n2 = α2, (21)

and the discrete penalized problem for ν = 0

min v Rd k, y { 1,1}n k 1 n Πny Xv 2 F . (22)

As explained in Section 4, these two discrete problems admit the same relaxations in Eq. (7) and Eq. (8) we have studied for one label. We now investigate when the solution of the problems in Eq. (21) and in Eq. (22) generate solutions of the relaxations in Eq. (7) and Eq. (8). By analogy with Lemma 3, we want to characterize the solutions of these relaxations which are supported by the constant vector 1n and the labels (y1, . . . , yk). Their general form is Y = y A y where A Rk k is symmetric semi-deﬁnite positive and y = [1n, y]. However the initial y is easily recovered from the solution Y only when A is diagonal. To that end the following lemma derives some condition under which the only matrix A such that the corresponding Y satisﬁes the constraint of the relaxations in Eq. (7) and Eq. (8) is diagonal.

Lemma 7 The solutions of the matrix equation diag( y A y ) = 1n with unknown variable A are diagonal if and only if the family {1n, (yi)1 i k, (yi yj)1 i<j k} is linearly independent where we denoted by the Hadamard (i.e., pointwise) product between matrices.

In this way we are able to characterize the solution of relaxations in Eq. (7) and Eq. (8) with the following result:

Lemma 8 Let us assume that the family {1n, (yi)1 i k, (yi yj)1 i<j k} is linearly independent. If α αmin = min 1 i k{αi} with (αi)1 i k deﬁned above Eq. (21), the solutions of the

constrained relaxation in Eq. (7) supported by the vectors (1n, y1, , yk) are of the form:

Y = a2 01n1 n +

i=1 a2 i yiy i ,

where (ai)0 i k satisﬁes Pk i=0 a2 i = 1 and a2 0 + Pk i=1 a2 i αi = α.

1. This assumption is fairly reasonable since the probability of a matrix whose entries are i.i.d. Rademacher random variables to be singular is conjectured to be 1/2 + o(1) (Bourgain et al., 2010).

Robust Discriminative Clustering with Sparse Regularizers

Moreover the solutions of the penalized relaxation in Eq. (8) for ν = 0 which are supported by the vectors (1n, y1, , yk) are of the form:

Y = a2 01n1 n +

i=1 a2 i yiy i ,

where (ai)0 i k satisﬁes Pk i=0 a2 i = 1.

In the multi-label case, some combinations of the constant matrix 1n1 n and the rankone matrices yiy i are solutions of constrained or penalized relaxations. Furthermore, under some assumptions on the labels (yi)1 i k, these combinations are the only solutions which are supported by the vectors (1n, y1, , yk). And we conjecture (and checked empirically) that under assumptions similar to those made for the balanced one-label case, all the solutions of the relaxation are supported by the family (1n, y1, , yk) and consequently share the same form as in Lemma 8. Thus the eigenvector of the solution Y would be in the span of the directions (1n, y1, , yk). Let us consider an eigenvalue decomposition of Y = FF = Pk i=0 λieie i and denote by M = [a01n, a1y1, , akyk] where (ai)0 i k are deﬁned in Lemma 8. Since MM = FF , there is an orthogonal transformation R such that FR = M. We also denote the product FR by FR = [ξ0, , ξK]. We propose now an alternating minimization procedure to recover the labels (y1, , yk) from M.

Lemma 9 Consider the optimization problem

min M M, R Rk k: R R=Ik FR M 2 F ,

where M = {[a01n, a1y1, , akyk], a Rk+1 : a 2 = 1, yi { 1}n}. Given M, the problem is equivalent to the orthogonal Procrustes problem (Sch onemann, 1966). Denote by U V a singular value decomposition of F M. The optimal R is obtained as R = UV . While given R, the optimal M is obtained as

ξ1 2 1 + ξ2 2 1 + . . . + ξk 2 1 [ ξ0 1 sign(ξ0), , ξk 1 sign(ξk)].

Proof We give only the argument for the optimization problem with respect to M. Given R, the optimization problem in M is equivalent to max a Rk+1: a 2=1, y { 1,1}n k tr(FR) M and

tr(FR) M = a0ξ 0 1n + Pk i=1 aiξ i yi . Thus by property of the dual norms the solution is given by yi = sign(ξi) and ai = ξi 1

ξ1 2 1+ ξ2 2 1+...+ ξk 2 1 .

The minimization problem in Lemma 9 is non-convex; however we observe that performing few alternating optimizations is suﬃcient to recover the correct (y1, . . . , yk) from M.

5.4 Discussion

In this section we studied the tightness of convex relaxations under simple scenarios where the relaxed problem admits low-rank solutions generated by the solution of the original

Flammarion, Palaniappan and Bach

non-convex problem. Unfortunately the solutions lose the characterized rank when the initial problem is slightly perturbed since the rank of a matrix is not a continuous function. Nevertheless, the spectrum of the new solution is really spiked, and thus these results are quite conservative. We empirically observe that the principal eigenvectors keep recovering the correct information outside these scenarios. However this simple proof mechanism is not easily adaptable to handle perturbed problems in a straightforward way since it is diﬃcult to characterize the properties of eigenvectors of the solution of a semi-deﬁnite program. Hence we are able to derive a proper theoretical study only for these simple models.

6. Algorithms

In this section, we present an optimization algorithm which is adapted to large n settings, and avoids the n-dimensional semideﬁnite constraint.

6.1 Reformulation

We aim to solve the general regularized problem which corresponds to Eq. (15)

max V 0 2 n

(XV X )ii 1

n tr V (X X + n Diag(a)2) Diag(c)V Diag(c) 1. (23)

We consider a slightly diﬀerent optimization problem:

max V 0 1 n

(XV X )ii Diag(c)V Diag(c) 1 s.t. tr V ( 1

n X X + Diag(a)2) = 1. (24)

When c is equal to zero, then Eq. (24) is exactly equivalent to Eq. (23); when c is small (as will typically be the case in our experiments), the solutions are very similar in fact, one can show by Lagrangian duality that by a sequence of problems in Eq. (24), one may obtain the solution to Eq. (23).

6.2 Smoothing

By letting A= X X

n +Diag(a)2, we consider a strongly-convex approximation of Eq. (24) as:

max V 0 1 n

(XV X )ii Diag(c)V Diag(c) 1 ε tr[(A 1 2 V A 1 2 ) log(A 1 2 V A 1 2 )]

s.t. tr(A 1 2 V A 1 2 ) = 1, (25)

where tr M log(M) is a spectral convex function called the von-Neumann entropy (von Neumann, 1927). The diﬀerence in the two problems is known to be ε log(d) (Nesterov, 2007). As shown in Appendix G.1, the dual problem is

min u Rn +,C Rd d:|Cij| cicj

1 ui + φε A 1

2n X Diag(u)X C A 1

where φε(M) is an ε-smooth approximation to the maximal eigenvalue of the matrix M.

Robust Discriminative Clustering with Sparse Regularizers

6.3 Optimization algorithm

In order to solve Eq. (26), we split the objective function into a smooth part F(u, C) = φε A 1

2n X Diag(u)X C A 1

2 and a non-smooth part H(u, C) = I|Cij| cicj+ 1

2n Pn i=1 1 ui . We may then apply FISTA (Beck and Teboulle, 2009) updates to the smooth function φε(A 1

2n X Diag(u)X C)A 1

2 ), along with a proximal operator for the non-smooth terms I|Cij| cicj and 1 2n Pn i=1 1 ui , which may be computed eﬃciently. See details in Appendix G.2.

Running-time complexity. Since we need to project on the SDP cone of size d at each iteration, the running-time complexity per iteration is O(d3 + d2n); given that often n d, the dominating term is O(d2n). It is still an open problem to make this linear in d. Our function being O(1/ε)-smooth, the convergence rate is of the form O(1/(εt2)). Since we stop when the duality gap is ε log(d) (as we use smoothing, it is not useful to go lower), the number of iterations is of order 1/(ε p

log(d)). The proposed algorithm is a clear improvement over the existing approach by Bach and Harchaoui (2007) which is quadratic in n.

7. Experiments

We implemented the proposed algorithm in Matlab. The code has been made available in https://drive.google.com/uc?export=download&id=0B5Bx9jrp7cel Mk5p OFI4UGt0ZEk. Two sets of experiments were performed: one on synthetically generated data sets and the other on real-world data sets. The details about experiments follow.

7.1 Experiments on synthetic data

In this section, we illustrate our theoretical results and algorithms on synthetic examples. The synthetic data were generated by assuming a ﬁxed clustering with α [0, 1], along a single direction and the remaining variables were whitened. We consider clustering error deﬁned for a predictor y as 1 ( y y/n)2, with values in [0, 1] and equal to zero if and only if y = y.

Phase transition. We ﬁrst illustrate our theoretical results for the balanced case in Figure 1. We solve the relaxation in Eq. (12) and Eq. (18) for a large range of d and n using the cvx solver (Grant and Boyd, 2008, 2014). We show the results averaged over 4 replications and take λ = 1/ n for the sparse problems. In Figure 1a, we investigate whether cvx ﬁnds a rank-one solution for a problem of size (n, d) (the value is 1 if the solution is rank-one and 0 otherwise). We compare the performance of the algorithms without ℓ1-regularization in the aﬃne invariant case and with ℓ1-regularization in the 1sparse case. We observe a phase transition with a scaling over the form n = O(d2) for the aﬃne invariant case and n = O(d) for the 1-sparse case. This is better than what is expected by the theory and corresponds rather to the performance of the principal eigenvector of the solution. It is worth noting that it may be uncertain to really distinguish between a rank-one solution and a spiked solution. We also solve the relaxation for 4-sparse problems of diﬀerent sizes d and n and plot the clustering error. We compare, in Figure 1b, the performance of the formulation in Eq. (12)

Flammarion, Palaniappan and Bach

(without ℓ1-regularization) which corresponds to the aﬃne invariant case, against the ℓ1regularized formulation in Eq. (18). We notice a phase transition of the clustering error with a scaling over the form n = O(d2) for the aﬃne invariant case and n = O(d) for the 4-sparse case. It supports our conjecture on the scaling of order n = O(ld) for l-sparse problems. Comparing left plots of Figure 1a and Figure 1b, we observe that the two phase-transitions occur at the same scaling between n and d. Thus there are few values of (n, d) for which the cvx solver ﬁnds a solution whose rank is strictly larger than one and whose principal eigenvector has a low clustering error. This illustrates, in practice, this solver aims to ﬁnd a rank-one solution under the improved scaling n = O(d2).

Noise dimension d

Data size n

Noise dimension d

Data size n

(a) Phase transition for rank-one solution. Left: aﬃne invariant case. Right: 1-sparse case.

Noise dimension d

Data size n

10 20 30 40 50

Noise dimension d

Data size n

10 20 30 40 50

(b) Phase transition for clustering error. Left: aﬃne invariant case . Right: 4-sparse case.

Figure 1: Phase transition plots.

Unbalanced case. We generate an unbalanced problem for d = 10, n = 80 and α = 0.25 and we average the results over 10 replications. We compare the clustering error for the constrained and the penalized relaxations in Eq. (7) and Eq. (8) when we consider the sign of the ﬁrst or second eigenvector and when we use projection technique deﬁned as (Πn Y(2)Πn)(1) where Y(k) is the best rank-k approximation of Y , to extract the information of y. We see in Figure 2 that (a) for the constrained case, the range of α such that the sign of y is recovered is cut in two parts where one eigenvector is correct, whereas the projection method performs well on the whole set. (b) For the penalized case, the correct sign is

Robust Discriminative Clustering with Sparse Regularizers

Clustering Error

First eig. Second eig. Proj.

Rank of the solution

σ=0 σ=0.1 σ=0.3

Clustering Error

First eig. Second eig. Proj.

Figure 2: Unbalanced problem for n = 80, d = 10 and α = 0.25. Left: Clustering error for the constrained relaxation. Middle: Rank of the solution for diﬀerent level of noise σ. Right: Clustering error for the penalized relaxation.

recovered for ν close to 0 by the ﬁrst eigenvector and the projection method whereas the second one performs always badly. (c) When there is zero noise the rank of the solution is one for α {α , 1}, two for α (α , 1) and greater otherwise. These ﬁndings conﬁrm our analysis. However, when y is corrupted by some noise this result is no longer true.

Runtime experiments. We generated data with a k-sparse direction of projection v by adding d k noise variables to a randomly generated and rotated k-dimension data. The scalability of the FISTA based optimization algorithm illustrated in Section 6.3 to solve Eq. (24) (with c =

λ1d, a = 0d) was compared against a benchmark cvx solver (which solves Eq. (18)). Experiments were performed for λ = 0 and λ = 0.001, the coeﬃcient associated with the sparse V 1 term. For a ﬁxed d, cvx breaks down for large n values (typically n 1000). Similarly, the runtime required by cvx is generally high for λ = 0 and is comparable to our method for λ = 0.001. This behavior is illustrated in Figure 3. When λ = 0, the problem reduces exactly to the original Diﬀrac problem (Bach and Harchaoui, 2007). In the plots in Figure 3a our implementation using FISTA is compared to the baseline Diﬀrac which is solved with max-cut SDP (Boumal et al., 2014). We observed that our method is comparable in terms of runtime and clustering performance of low-rank methods for max-cut. However, for λ > 0, the equivalence with max-cut disappears. The plots in these ﬁgures show the behavior of FISTA for two diﬀerent stopping criteria: ε = 10 2/ log(d) and ε = 10 3/ log(d). It is observed that the choice 10 3/ log(d) gives a better accurate solution at the cost of more number of iterations (and hence higher runtime). For sparse problems in Figure 3b, we see that cvx gets a better clustering performance (while crashing for large n); the diﬀerence would be reduced with a smaller duality gap for FISTA.

Clustering performance. Experiments comparing the proposed method (Eq. (24) with c =

λ1d and a = 0d solved using FISTA based optimization algorithm, and Eq. (18) solved using benchmark cvx solver) with K-means and alternating optimization are given in Figure 4. K-means is run on the whitened variables in Rd. Alternating optimization is another popular method proposed by Ye et al. (2008) for dimensionality reduction with clustering (where alternating optimization of w and y is performed to solve the non-convex formulation (2)). The plots show that both K-means and alternating optimization fail when only a few dimensions of noise variables are present. The plots also show that with the introduction of a sparse regularizer (corresponding to the non-zero λ) the proposed

Flammarion, Palaniappan and Bach

0 500 1000 1500 2000 0

Num samples n

Average Clustering Error

fista (0.01/log(d))

fista (0.001/log(d))

0 500 1000 1500 2000 2500 10 0

Num samples n

Average CPU Time (sec)

fista (0.01/log(d))

fista (0.001/log(d))

0 20 40 60 80 100 0

Dimension d

Average Clustering Error

fista (0.01/log(d))

fista (0.001/log(d))

0 20 40 60 80 100 10 2

Dimension d

Average CPU Time (sec)

fista (0.01/log(d))

fista (0.001/log(d))

(a) cvx, max-cut comparison with λ = 0. Top: n varied with d = 50, k = 6. cvx crashed for n 1000. Bottom: d varied with n = 100, k = 2.

0 500 1000 1500 2000 0

Num samples n

Average Clustering Error

fista (0.01/log(d))

fista (0.001/log(d))

0 500 1000 1500 2000 10 1

Num samples n

Average CPU Time (sec)

fista (0.01/log(d))

fista (0.001/log(d))

0 50 100 150 0

Dimension d

Average Clustering Error

fista (0.01/log(d))

fista (0.001/log(d))

0 50 100 150 10 2

Dimension d

Average CPU Time (sec)

fista (0.01/log(d))

fista (0.001/log(d))

(b) cvx comparison with λ = 0.001. Top: n varied with d = 50, k = 6. cvx crashed for n 1000. Bottom: d varied with n = 100, k = 2.

Figure 3: Scalability experiments. 20

Robust Discriminative Clustering with Sparse Regularizers

method becomes more robust to noisy dimensions. As observed earlier, the performance of FISTA is also sensitive to the choice of ε. Finally we give a comparison of sparse discriminative clustering (cvx and FISTA) with max-margin clustering (Li et al., 2009) in Figure 5. We note that square loss is used in our framework whereas hinge loss is used in max-margin clustering. We have also included the behavior of K-means and alternating optimization methods in Figure 5 for completeness. From this plot, it is clear that the max-margin clustering is sensitive to noisy dimensions present in the data. Sparse discriminative clustering with square loss is able to maintain zero cluster error for a large number of noisy dimensions, while the performance of maxmargin clustering starts deteriorating after adding a few noisy dimensions. However, we note from Figure 5 that for large dimensions, the hinge loss used in max-margin clustering is observed to provide a better solution than the square loss used in our framework.

0 20 40 60 80 100 0.2

Dimension d

Average Clustering Error

fista (0.01/log(d))

fista (0.001/log(d))

0 20 40 60 80 0.2

Dimension d

Average Clustering Error

fista (0.01/log(d))

fista (0.001/log(d))

(b) λ = 0.01

0 20 40 60 80 0.2

Dimension d

Average Clustering Error

fista (0.01/log(d))

fista (0.001/log(d))

(c) λ = 0.001

Figure 4: Comparison with k-means and alternating optimization, n = 100.

7.2 Experiments on real-world data

Experiments on two-class data. Experiments were conducted on real two-class classiﬁcation datasets2 to compare the performance of sparse discriminative clustering against non-sparse discriminative clustering, alternating optimization, K-means and max-margin

2. The data sets were obtained from https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/

Flammarion, Palaniappan and Bach

0 10 20 30 40 50 60 70 80 90 0.2

Dimension d

Average Clustering Error

fista (0.01/log(d)) fista (0.001/log(d)) cvx k means alt opt mmc

Figure 5: Comparison with k-means, alternating optimization and max-margin clustering (mmc), n = 100. The plots for FISTA, cvx and mmc correspond to the best choice of regularization parameters.

clustering algorithms. For sparse and non-sparse discriminative clustering, we consider the problem in Eq. (24) and the algorithm detailed in Section 6.3 (with the regularization c = 0 for the non-sparse case). The alternating optimization method is described in Proposition 1. For the two-class datasets, the clustering performance for a cluster y {+1, 1}n obtained from an algorithm under comparison, was computed as 1 ( y y/n)2, where y is the original labeling. Here we explicitly compare the output of clustering with the original labels of the data points. The dataset details and clustering performance results are summarized in Table 1. The experiments for discriminative clustering were conducted for diﬀerent values of a, c {10 3, 10 2, 10 1}1d associated with the ℓ2-regularizer and ℓ1-regularizer respectively. The range of cluster imbalance parameter was chosen to be ν {0.01, 0.25, 0.5, 0.75, 1}. Note that for ν = 1, the reformulation given in Eq. (14) was used, as explained after Eq. (17) in Section 3. The results given in Table 1 pertain to the best choices of these parameters. Similarly, the values of regularization parameter for max-margin clustering (Li et al., 2009) were chosen from the set {10 5, 10 4, 10 3, 10 2, 0.1, 1, 10} and the cluster balance parameter was chosen from {0.1, 0.2, . . . , 0.9}. The results for alternating optimization and K-means show the average cluster error (and standard deviation) over 10 diﬀerent runs. These results show that the cluster error is quite high for many datasets. This is primarily due to the absence of an ambient low-dimensional clustering of the two-class data, which can be identiﬁed by the simple linear model presented in this paper. Since K-means does not provide explicit dimensionality reduction, it might not be able to take advantage of the existence of an ambient low-dimensional clustering of the two-class data and its performance is poor. The results show that max-margin clustering achieves best clustering performance on most datasets. This improved performance of max-margin clustering may be due to the use of hinge loss, as opposed to square loss used in discriminative clustering in this paper. However for the heart dataset, we note that the sparse version with the square loss performs

Robust Discriminative Clustering with Sparse Regularizers

signiﬁcantly better than the non-sparse version with the hinge loss (see additional experiments in Figure 5). The results also show that adding sparse regularizers to discriminative clustering helps in a better cluster identiﬁcation when compared to the non-sparse case.

Table 1: Experiments on two-class datasets

Dataset n d Cluster Error Sparse Non-sparse Alternating K-means Max-margin Discriminative Discriminative Optimization Clustering Clustering Clustering

Heart 270 3 0.52 0.61 0.97 0.03 0.91 0.09 0.93 Diabetes 768 8 0.88 0.88 0.91 0.05 0.93 0.06 0.88 Breast-cancer 683 10 0.15 0.15 0.48 0.17 0.68 0.24 0.15 Australian 690 14 0.5 0.5 0.88 0.17 0.87 0.21 0.5 Liver-disorder 345 6 0.97 0.97 0.99 0.01 0.99 0.01 0.73 Sonar 208 60 0.92 0.95 0.98 0.02 0.99 0.01 0.92 DNA(1 vs 2,3) 1400 180 0.75 0.83 0.99 0.01 0.98 0.02 0.71 a1a 1605 113 0.74 0.75 0.98 0.02 0.8 0.08 0.69 w1a 2270 290 0.11 0.11 0.92 0.08 0.16 0.06 0.11

Experiments on real multi-label data. Experiments were also conducted on the Microsoft COCO dataset3 to demonstrate the eﬀectiveness of the proposed method in discovering multiple labels. We considered n = 2000 images from the dataset, each of which was labeled with a subset of K = 80 labels. The labels identiﬁed the objects in the images like person, car, chair, table, etc. and the corresponding features for each image were extracted from the last layer of a conventional convolutional neural network (CNN). The CNN was originally trained over the imagenet data (Krizhevsky et al., 2012). For each image in the dataset, we obtained d = 1000 features. We then performed discriminative clustering on the 2000 1000 data matrix X and obtained the label matrix Y which was then subjected to the alternating optimization procedure (see Section 5.3). It is clearly unlikely to recover perfect labels; therefore we now describe a way of measuring the amount of information which is recovered. In order to extract meaningful cluster information from the result so-obtained, we computed the correlation matrix Y k Πn Ytrue where Ytrue is the n K label matrix containing actual labels and Πn is the n n centering matrix In 1

n1n1 n . The k predicted labels for the examples are present in the Yk matrix of size n k. In order to choose an appropriate value of k, we plotted Tr(ΦYtrueΦYk) (shown in Figure 6 along with a K-means baseline), where ΦYk = Yk(Yk Yk) 1Y k . From these plots, we chose k = 30 to be a suitable value for our interpretation purposes. After choosing an arbitrary value of k = 30, we plotted the correlations between the actual and predicted labels. The heat map of the normalized absolute correlations is given in Figure 7, where the columns and rows corresponding to the 80 true labels and 30 predicted labels respectively, are ordered according to the sum of squared correlations (the top-scoring labels appear to the left-bottom). From this plot, we extract following highly correlated

3. Dataset obtained from http://mscoco.org/dataset

Flammarion, Palaniappan and Bach

0 10 20 30 40 50 60 70 80 0

Tr(Φ(Ytrue) * Φ(Yk))

alt min k means

Figure 6: Plot of Tr(ΦYtrueΦYk).

True labels

Predicted labels

Figure 7: Heat map of correlations, Y k Πn Ytrue with k = 30, with columns and rows ordered according to the sum of squared correlations.

labels: person, dining table, car, chair, cup, tennis racket, bowl, truck, fork, pizza, showing that these labels were partially recovered by our unsupervised technique (note that the CNN features are learned with supervision on the diﬀerent dataset Imagenet, hence there is still some partial supervision).

8. Conclusion

In this paper, we provided a sparse extension of the discriminative clustering framework, and gave a ﬁrst analysis of its theoretical performance in the totally unsupervised situation, highlighting provable scalings between ambient dimension d, number of observations and clusterability of irrelevant variables. We also proposed an eﬃcient algorithm which is the ﬁrst of its kind to be linear in the number of observations for discriminative clustering with the square loss. Our work could be extended in a number of ways, e.g., extending the sparse

Robust Discriminative Clustering with Sparse Regularizers

analysis to l-sparse case with higher l, extending the framework to nonlinear clustering using kernels, considering related weakly supervised learning extensions (Joulin and Bach, 2012), going beyond uniqueness of rank-one solutions, and improving the complexity of our algorithm to O(nd), for example using stochastic gradient techniques.

Acknowledgments

The authors would like to thank Nicolas Boumal for interesting discussions and for his advice on max-cut implementation, Piotr Bojanowski for his help with multi-label datasets and the reviewers for their constructive and helpful comments which helped to signiﬁcantly improve the paper.

Flammarion, Palaniappan and Bach

Jean-Baptiste Alayrac, Piotr Bojanowski, Nishant Agrawal, Ivan Laptev, Josef Sivic, and Simon Lacoste-Julien. Unsupervised learning from narrated instruction videos. In Computer Vision and Pattern Recognition (CVPR), 2016.

D. Arthur and S. Vassilvitskii. k-means++: The advantages of careful seeding. In Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms, 2007.

F. Bach and Z. Harchaoui. DIFFRAC : a discriminative and ﬂexible framework for clustering. In Advances in Advances in Neural Information Processing Systems (NIPS), 2007.

F. Bach, R. Jenatton, J. Mairal, and G. Obozinski. Optimization with sparsity-inducing penalties. Foundations and Trends R in Machine Learning, 4(1):1 106, 2012.

A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Imaging Sci., 2(1):183 202, 2009.

R. Bellman. A note on cluster analysis and dynamic programming. Mathematical Biosciences, 1973.

G. Blanchard, M. Kawanabe, M. Sugiyama, V. Spokoiny, and K.-R. M uller. In search of non-Gaussian components of a high-dimensional distribution. The Journal of Machine Learning Research, 7:247 282, 2006.

Piotr Bojanowski, Francis Bach, Ivan Laptev, Jean Ponce, Cordelia Schmid, and Josef Sivic. Finding actors and actions in movies. In Proc. IEEE International Conference on Computer Vision, 2013.

J. M. Borwein and A. S. Lewis. Convex Analysis and Nonlinear Optimization, volume 3 of CMS Books in Mathematics/Ouvrages de Math ematiques de la SMC. Springer-Verlag, 2000. Theory and examples.

N. Boumal, B. Mishra, P.-A. Absil, and R. Sepulchre. Manopt, a Matlab Toolbox for Optimization on Manifolds. Journal of Machine Learning Research, 2014.

J. Bourgain, V. H. Vu, and P. M. Wood. On the singularity probability of discrete random matrices. Journal of Functional Analysis, 258(2):559 603, 2010.

S. Boyd and L. Vandenberghe. Convex optimization. Cambridge University Press, Cambridge, 2004.

T. De Bie and N. Cristianini. Convex methods for transduction. In Advances in Neural Information Processing Systems (NIPS), pages 73 80, 2003.

F. De la Torre and T. Kanade. Discriminative cluster analysis. In Proceedings of the conference on machine learning (ICML), 2006.

E. Diederichs, A. Juditsky, A. Nemirovski, and V. Spokoiny. Sparse non-Gaussian component analysis by semideﬁnite programming. Machine learning, 91(2):211 238, 2013.

Robust Discriminative Clustering with Sparse Regularizers

C. Ding and T. Li. Adaptive dimension reduction using discriminant analysis and K-means clustering. In Proceedings of the conference on machine learning (ICML), 2007.

D. Freedman. Statistical models: theory and practice. Cambridge University Press, 2009.

J. H. Friedman and W. Stuetzle. Projection pursuit regression. Journal of the American statistical Association, 76(376):817 823, 1981.

A. Frieze and M. Jerrum. Improved approximation algorithms for MAX k-CUT and MAX BISECTION. In Integer Programming and Combinatorial Optimization. Springer, 1995.

M. R. Garey, D. S. Johnson, and L. Stockmeyer. Some simpliﬁed NP-complete graph problems. Theoret. Comput. Sci., 1(3):237 267, 1976.

M. X. Goemans and D. P. Williamson. Improved Approximation Algorithms for Maximum Cut and Satisﬁability Problems Using Semideﬁnite Programming. J. ACM, 42(6):1115 1145, November 1995.

J. C. Gower and G. J. S. Ross. Minimum spanning trees and single Linkage cluster analysis. Journal of the Royal Statistical Society. Series C (Applied Statistics), 18(1), 1969.

M. Grant and S. Boyd. Graph implementations for nonsmooth convex programs. In V. Blondel, S. Boyd, and H. Kimura, editors, Recent Advances in Learning and Control, Lecture Notes in Control and Information Sciences, pages 95 110. Springer-Verlag Limited, 2008.

M. Grant and S. Boyd. CVX: Matlab Software for Disciplined Convex Programming, version 2.1, March 2014.

Edouard Grave. A convex relaxation for weakly supervised relation extraction. In EMNLP, 2014.

G. Huang, J. Zhang, S. Song, and Z. Chen. Maximin separation probability clustering. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, 2015.

A. Hyv arinen, J. Karhunen, and E. Oja. Independent component analysis, volume 46. John Wiley & Sons, 2004.

A. Joulin and F. Bach. A convex relaxation for weakly supervised classiﬁers. In Proceedings of the conference on machine learning (ICML), 2012.

A. Joulin, F. Bach, and J. Ponce. Discriminative clustering for image co-segmentation. In Proc. CVPR, 2010a.

A. Joulin, J. Ponce, and F. Bach. Eﬃcient optimization for discriminative latent class models. In Advances in Advances in Neural Information Processing Systems (NIPS), 2010b.

M. Journ ee, F. Bach, P-A Absil, and R. Sepulchre. Low-rank optimization on the cone of positive semideﬁnite matrices. SIAM Journal on Optimization, 2010.

Flammarion, Palaniappan and Bach

A. T. Kalai, A. Moitra, and G. Valiant. Eﬃciently learning mixtures of two Gaussians. In International Symposium on Theory of Computing (STOC), pages 553 562, 2010.

R. M. Karp. Reducibility among combinatorial problems. In Complexity of computer computations, pages 85 103. Plenum, New York, 1972.

A. Krizhevsky, I. Sutskever, and G. E. Hinton. Image Net Classiﬁcation with Deep Convolutional Neural Networks. In Advances in Advances in Neural Information Processing Systems (NIPS), 2012.

R emi Lajugie, Piotr Bojanowski, Philippe Cuvillier, Sylvain Arlot, and Francis Bach. A weakly-supervised discriminative model for audio-to-score alignment. In Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing, 2016.

N. Le Roux and F. Bach. Local component analysis. In Proceedings of the International Conference on Learning Representations, 2013.

Y.-F. Li, I. W. Tsang, J. T.-Y. Kwok, and Z-H. Zhou. Tighter and convex maximum margin clustering. In AISTATS, pages 344 351, 2009.

Z. Q. Luo, W. K. Ma, A. C. So, Y. Ye, and S. Zhang. Semideﬁnite relaxation of quadratic optimization problems. Signal Processing Magazine, IEEE, 2010.

J. B. Mac Queen. Some Methods for Classiﬁcation and Analysis of Multi Variate Observations. In Proc. of the ﬁfth Berkeley Symposium on Mathematical Statistics and Probability, volume 1, pages 281 297. University of California Press, 1967.

A. Moitra and G. Valiant. Settling the polynomial learnability of mixtures of Gaussians. In Annual Symposium on Foundations of Computer Science (FOCS), pages 93 102, 2010.

Y. Nesterov. Smoothing technique and its applications in semideﬁnite optimization. Math. Program., 2007.

A. Y. Ng, M. I. Jordan, and Y. Weiss. On Spectral Clustering: Analysis and an algorithm. In T.G. Dietterich, S. Becker, and Z. Ghahramani, editors, Advances in Advances in Neural Information Processing Systems (NIPS). 2002.

G. Niu, B. Dai, L. Shang, and M. Sugiyama. Maximum volume clustering: a new discriminative clustering approach. Journal of Machine Learning Research, 14(1):2641 2687, 2013.

R. Ostrovsky, Y. Rabani, L. J. Schulman, and C. Swamy. The eﬀectiveness of lloyd-type methods for the k-means problem. In Annual Symposium on Foundations of Computer Science (FOCS), pages 165 176, 2006.

P. H Sch onemann. A generalized solution of the orthogonal Procrustes problem. Psychometrika, 31(1), 1966.

R. Tibshirani. Regression shrinkage and selection via the Lasso. J. Roy. Statist. Soc. Ser. B, 1996.

Robust Discriminative Clustering with Sparse Regularizers

J. Tropp. User-friendly tail bounds for sums of random matrices. Found. Comput. Math., 2012.

J. von Neumann. Thermodynamik quantummechanischer Gesamheiten. G ott. Nach, (1): 273 291, 1927.

F. Wang, B. Zhao, and C. Zhang. Linear time maximum margin clustering. IEEE Transactions on Neural Networks, 2010.

H. Wang, F. Nie, and H. Huang. Multi-View Clustering and Feature Learning via Structured Sparsity. In Proceedings of the conference on machine learning (ICML), volume 28, 2013.

Z. Wen, D. Goldfarb, and K. Scheinberg. Block coordinate descent methods for semidefinite programming. In Handbook on Semideﬁnite, Conic and Polynomial Optimization. Springer, 2012.

L. Xu, J. Neufeld, B. Larson, and D. Schuurmans. Maximum margin clustering. In Advances in Advances in Neural Information Processing Systems (NIPS), 2004.

J. Ye, Z. Zhao, and M. Wu. Discriminative k-means for clustering. In Advances in Advances in Neural Information Processing Systems (NIPS), 2008.

K. Zhang, I. W. Tsang, and J. T. Kwok. Maximum margin clustering made practical. IEEE Transactions on Neural Networks, 2009.

Flammarion, Palaniappan and Bach

Appendix A. Joint clustering and dimension reduction

Given y, we need to optimize the Rayleigh quotient w X yy Xw

w X Xw with a rank-one matrix in the numerator, which leads to w = (X X) 1X y. Given w, we will show that the averaged distortion measure of K-means once the means have been optimized is exactly equal to (y Πn Xw)2/ Πny 2 2. Given the data matrix X Rn d, K-means to cluster the data into two components will tend to approximate the data points in X by the centroids c+ Rd and c Rd such that

2 c + (y 1n)

2 c (since y { 1, 1}n)

= y 2(c + c ) + 1

21n(c + + c ).

The objective of K-means can now be written as problem KM:

2(c + c ) 1

21n(c + + c )

= min y,c+,c

2 c + (1n y)

= min y,c+,c X 2 F + c + 2 F

2 + 2c c+ (y + 1n)

2 tr X (y + 1n)

2 c + + (1n y)

= min y,c+,c X 2 F + c + 2 F 1 2(n + 1 n y) + c 2 F 1 2(n 1 n y) 2c +X y + 1n

Fixing y and minimizing with respect to c+ and c , we get closed-form expressions for c+ and c as

c+ = X (y + 1n)

(n + 1 n y) and c = X (1n y)

(n 1 n y) .

Robust Discriminative Clustering with Sparse Regularizers

Substituting these expressions in KM, we have the following optimization problem in y:

min y X 2 F 1

2 X (y + 1n) 2 F (n + 1 n y) 1

2 X (1n y) 2 F (n 1 n y)

= min y X 2 F 1

2 tr XX (y + 1n)(y + 1n)

(n + 1 n y) 1

2 tr XX (1n y)(1n y)

= min y X 2 F 2 (n + 1 n y) tr XX y + 1n

2 (n 1 n y) tr XX 1n y

= min y tr XX 2 (n + 1 n y) tr XX y + 1n

2 (n 1 n y) tr XX 1n y

= min y tr XX I 1 2(n + 1 n y)(yy + 1n1 n + y1 n + 1ny )

1 2(n 1 n y)(1n1 n + yy 1ny y1 n ) .

By the centering of X, we have 1 n X = 0 and hence tr XX 1n1 n = tr XX 1ny = tr XX y1 n = 0. Therefore, we obtain

min y tr XX I 1 2(n + 1 n y)(yy ) 1 2(n 1 n y)(yy )

= min y tr XX I (yy ) 1 2(n + 1 n y) + 1 2(n 1 n y)

= min y tr XX I (yy ) n n2 (1 n y)2)

= min y tr XX I nyy

n2 (1 n y)2

Thus we have the equivalent K-means problem as

min y { 1,1}n 1 n tr Xww X I n n2 (y 1)2 yy = 1 max y { 1,1}n (w X y)2

n2 (y 1)2 .

Thus the averaged distortion measure of K-means with the optimized means is (y Πn Xw)2

Appendix B. Full (unsuccessful) relaxation

It is tempting to ﬁnd a direct relaxation of Eq. (2). It turns out to lead to a trivial relaxation, which we outline in this section. When optimizing Eq. (2) with respect to w, we obtain max y { 1,1}n y X(X X) 1X y

y Πny , leading to a quasi-convex relaxation as max Y 0, diag(Y )=1

tr Y X(X X) 1X

Unfortunately, this relaxation always leads to trivial solutions as described below.

Flammarion, Palaniappan and Bach

Consider the quasi-convex relaxation

max Y 0,diag(Y )=1 tr Y X(X X) 1X

tr Πn Y . (27)

By deﬁnition of Πn this relaxation is equal to:

max Y 0,diag(Y )=1 1 n tr Y X(X X) 1X

Let A = {Y 0, diag(Y ) = 1} the feasible set of this problem and deﬁne B = {M 0, diag(M) = 1+ 1 n M1n

n2 }. Let Y A, then M deﬁned by M = Y

n2 belongs to B since

1 + 1 n M1n

n2 = 1 + 1 n Y 1n n2 1 n Y 1n = 1

n2 = diag(M). Reciprocally for M B, we can deﬁne

n2 , such that diag(Y ) = 1 and Y A and then verify that M = Y

the problem Eq. (27) is equivalent to the relaxation

M 0,diag(M)=1+ 1 n M1n

1 n tr MX(X X) 1X . (28)

The Lagrangian function of this problem can be written as:

L(µ) = tr MX(X X) 1X µ

[diag(M) 1n 1 n M1n

= tr M[X(X X) 1X Diag(µ) + 1 n µ n2 1n1 n ] + 1

Using L(µ) and the PSD constraint M 0, the dual problem is given by

n s.t. Diag(µ) 1 n µ n2 1n1 n X(X X) 1X .

Since X(X X) 1X 0, this implies for the dual variable µ:

Diag(µ) 1 n µ n2 1n1 n 0 1 n Diag(µ) 11n n2

1 µi n2 Pn i=1 µi

1 µi 1 1 n Pn i=1 µi .

However for ν Rn , the harmonic mean 1

n Pn i=1 1 νi 1 is always smaller than the arithmetic mean 1

n Pn i=1 νi with equality if and only if ν = c1n for c R. Thus the dual variable µ is constant and the diagonal constraint in problem Eq. (28) simpliﬁes itself as a trace constraint. Therefore the problem is equivalent to the trivial relaxation below for which each eigenvector of X(X X) 1X is a solution:

M 0, tr(M)=n+ 1 n M1n

n tr MX(X X) 1X .

Robust Discriminative Clustering with Sparse Regularizers

Appendix C. Equivalent relaxations

In this section, we give details about two equivalent relaxations.

C.1 First equivalent relaxation

We start from the penalized version of Eq. (5),

min y { 1,1}n, v Rd 1 n Πny Xv 2 2 + ν (y 1n)2

which we expand as:

min y { 1,1}n, v Rd 1 n tr Πnyy 2

n tr Xvy + 1

n tr X Xvv + ν (y 1n)2

and relax as, using Y = yy , P = yv and V = vv ,

min V,P,Y 1 n tr Πn Y 2

n tr P X + 1

n tr X XV +ν 1 n Y 1n

n2 s.t. Y P P V

0, diag(Y ) = 1. (31)

When optimizing Eq. (31) with respect to V and P, we get exactly Eq. (8). Indeed we solve this problem by ﬁxing the matrix Y such that Y = Y0 and diag(Y0) = 1n. Then the Lagrangian function of the problem in Eq. (31) can be written as

L(A) = 1 n tr Πn Y 2

n tr P X + 1

n tr X XV + ν 1 n Y 1n

n2 + tr A(Y Y0)

n2 1n1 n + A 1

n X 1 n X X

Using L(A) and the psd constraint Y P P V

0, we write the dual problem as

min A tr AY0 s.t. 1

n2 1n1 n + A 1

n X 1 n X X

From the Schur s complement condition of 1

n2 1n1 n + A 1

n X 1 n X X

0, we obtain

n2 1n1 n + A 1

n X(X X) 1X . Substituting the bound for A we get the optimal objective function value

n tr X(X X) 1X Y0 1

n tr Πn Y0 ν

n2 1 n Y01n.

Note that the optimal dual objective value D corresponds to a ﬁxed Y0. Hence by maximizing with respect to Y we obtain exactly Eq. (8) and therefore, the convex relaxation in Eq. (11) is equivalent to Eq. (8). Moreover the Karush-Kuhn-Tucker (KKT) conditions gives P X + V X X = 0 and Y X + PX X = 0

Thus the optimum is attained for P = Y X(X X) 1 and V = (X X) 1X Y X(X X) 1.

Flammarion, Palaniappan and Bach

C.2 Second equivalent relaxation

For ν = 1, we solve the problem in Eq. (31) by ﬁxing the matrix V = V0. Then the Lagrangian function of this problem can be written as

ˆL(µ, B) = 1 n tr Πn Y 2

n tr P X + 1

n tr X XV + ν 1 n Y 1n

n2 + µ (diag(Y ) 1n) + tr B(V V0)

n In + diag(µ) 1

n X 1 n X X + B

µ 1n tr BV0.

Using ˆL(µ, B) and the psd constraint Y P P V

0, the dual problem is given by

min µ,B µ 1n + tr BV0 s.t. 1

n In + diag(µ) 1

n X 1 n X X + B

From the Schur s complement condition of 1

n In + diag(µ) 1

n X 1 n X X + B

0, we obtain

B 1 n2 X diag(µ + 1n/n) 1X 1

n X X. Substituting the bound for B we get the dual problem as

min µ µ 1n + 1

n2 tr V0X diag(µ + 1n/n) 1X 1

µi + 1 n2µi + nx i V0xi

n tr V0X X.

Solving for µi, we get

Substituting µ i into the dual objective function, we get the optimal objective function value

(XV X )ii 1 1

n tr V0X X.

Furthermore the KKT conditions give

Y diag(ν + 1n/n) 1

n PX = 0 and P diag(ν + 1n/n) 1

Thus we obtain the following closed form expressions:

P = Diag(diag(XV X )) 1/2XV

Y = Diag(diag(XV X )) 1/2XV X Diag(diag(XV X )) 1/2.

The optimal dual objective value ˆD corresponds to a ﬁxed V0. Therefore, maximizing with respect to V leads to the problem:

min V 0 1 2

(XV X )ii + 1

n tr(V X X). (32)

Robust Discriminative Clustering with Sparse Regularizers

Appendix D. Auxiliary results for Section 5.1

Here we provide auxiliary results for Section 5.1.

D.1 Auxiliary lemma

The matrix X(X X) 1X has the following properties (see e.g., (Freedman, 2009)).

Lemma 10 The matrix H = X(X X) 1X is the orthogonal projection onto the column space of the design matrix X since:

H is symmetric.

H is idempotent (H2) = H.

X is invariant under H, that is HX = X.

D.2 Rank-one solution of the relaxation Eq. (8)

We denote by (xi)i=1...n the lines (or rows) of X.

Lemma 11 The rank-one solution Y = yy is always a solution of the relaxation Eq. (8).

Proof We give an elementary proof of this result without using convex optimization tools. Using Proposition 1 and Lemma 10 we have Hy = y, thus

tr HY = tr Hyy = tr yy = n.

Moreover all M 0 can always be decomposed as Pn i=1 λiuiu i with λi 0 and (ui)i=1,...,n an orthonormal family. Since H is an orthogonal projection (ui) Hui = (Hui) Hui = Hui 2 ui 2 1. Thus tr HM = Pn i=1 λi tr Hui(ui) = Pn i=1 λi(ui) Hui Pn i=1 λi = tr M. Then for all matrix M feasible we have tr HM n since diag(M) = 1n and tr HY = n which conclude the lemma.

D.3 Rank-one solution of the relaxation Eq. (12)

Lemma 12 The rank-one solution V = vv is always a solution of the relaxation Eq. (12).

Proof The Karush-Kuhn-Tucker (KKT) optimality conditions for the problem are for the dual variable A 0:

n XX = A and AV = 0 (Complementary Slackness).

Since x i w = yi, q

x i V xi = |yi| = 1, V and the dual variable A = 0 satisfy the KKT conditions and then V is solution of this problem.

Flammarion, Palaniappan and Bach

D.4 Proof of Proposition 2

In the following lemma, we use a Taylor expansion to lower-bound f around its minimum.

Lemma 13 For d 3 and δ [0, 1). If β 3 and m2 β 3 2(d+β 4), then with probability at least 1 d exp δ2nm2

2R4d2 , for any symmetric matrix :

f(V ) f(V + ) > 2(1 δ)m2 2 F + o( 2) 0.

Otherwise with probability at least 1 d exp δ2nµ1

4R4d2 , for any symmetric matrix :

f(V ) f(V + ) > (1 δ)µ1 2 F + o( 2) 0,

with µ1 m2(β 1) 1+(d+β 2)m2 . Moreover we also have with probability at least 1 d exp δ2nµ2

for any symmetric matrix min:

f(V ) f(V + ) > (1 δ)µ2 2 F + o( 2) 0,

where µ2 = min{2m2, m2(β 1), 2m} and min = 1 0 0 cmin Id 1

is deﬁned in the proof

and satisﬁes |cmin| m |(d + β 2)m2 1|.

This lemma directly implies Proposition 2. Proof For S(d) and δ R we compute for f(V ) = 1

dδ2 f(V + δ ) = 1

(x i xi)2 q

x i (V + δ )xi

Thus the second directional derivative in V = V along is

2 f(V ) = lim δ 0 d2

dδ2 f(V + δ ) = 1

i=1 (x i xi)2.

Let Tx be the semideﬁnite positive quadratic form of S(d) deﬁned for S(d), by

Tx : 7 (x x)2. (33)

Then there exists a positive linear operator Tx from S(d) to S(d) such that Tx( ) = , Tx . Therefore the function f will be strictly concave if for all directions S(d)

i=1 Txi( ) > 0. (34)

Robust Discriminative Clustering with Sparse Regularizers

We will bound the empirical expectation in Eq. (34) by ﬁrst showing that its expectation remains away from 0. Then we will use a concentration inequality for matrices to control the distance between the sum in Eq. (34) and its expectation. We ﬁrst derive conditions so that the result is true in expectation, i.e., for the operator T deﬁned by T = ETx for x following the same law as (y, z ) . We denote by m = Ez2

and by β = Ez4/m2 its kurtosis.

We let = a b

and then have x x = a + 2yb z + z Cz. Thus

Tx( ) = a2 + 4ayb z + 2az Cz + 4b (zz )b + (z Cz)2 + 4yb z(z Cz).

Therefore we can express the value of the operator T only in function of the elements of :

T ( ) = (a + m tr C)2 + 4m b 2 2 + 2m2 C Diag(diag(C)) 2 F + m2(β 1) diag(C) 2,

where we have used

E(z Cz)2 = E X

i,j,k,l zizjzkzlci,jck,l

i (zi)4c2 i,i + E X

i,k =i z2 i z2 kci,ick,k + 2E X

i,j =i z2 i z2 j c2 i,j

i c2 i,i + m2 X

i,k =i ci,ick,k + 2m2 X

i,j =i c2 i,j

= m2(β 3) X

i c2 i,i + m2 X

i,k ci,ick,k + 2m2 X

= m2(β 3) diag(C) 2 + m2 2 C 2 F + tr(C)2

= m2(β 3) diag(C) 2 + m2 2 C Diag(diag(C)) 2 F + tr(C)2 .

Since β 1, we get

T ( ) (a + m tr C)2 + 4m b 2 2 + 2m2( C 2 F diag(C) 2).

Thus T ( ) = 0 if and only if β = 1 with b = 0d 1 and C = diag(c) with c 1d = a

m2 . With the condition β = 1 meaning that var(z2) = 0 and thus z2 is constant a.s., i.e., z follows a Rademacher law. However we would like to bound T ( ) away from zero by some constant and for that we are looking for the smallest eigenvalue of the operator ETx. Unfortunately we are not able to solve the optimization problem

min S(d), 2 F =1 T ( ),

and we have to compute all the spectrum of this operator to be able to ﬁnd the smallest using ETx = 1/2 T ( ) . We have

1/2 T ( ) =

a + m tr(C) 2mb

2mb (a + m tr(C))m2Id 1 + 2m2C +m2(β 3) Diag(diag(C))

Flammarion, Palaniappan and Bach

For all b Rd 1 we have for = 0 b

, 1/2 T ( ) = 2m . Thus 2m is an

eigenvalue of multiplicity d 1.

For all C R(d 1) (d 1) with diag(C) = 0d 1 we have for = 0 0 0 C

, 1/2 T ( ) =

2m2 . Thus 2m2 is an eigenvalue of multiplicity (d 1)(d 2)

For all c Rd 1 with c 1d 1 = 0 we have for = 0 0 0 diag(C)

, 1/2 T ( ) =

m2(β 1) . Thus m2(β 1) is an eigenvalue of multiplicity d 2.

For all a, c R2 we have for = a 0 0 c Id 1

1/2 T ( ) = a + m(d 1)c 0 0 [ma + m2(d + β 2)c]Id 1

= Diag h 1 m1 d 1 m1d 1 (d + β 2)m2Id 1

Thus an eigenvalue of 1 (d 1)m m (d + β 2)m2

with an eigenvector [a, c] would be an

eigenvalue of the operator ETx with a corresponding eigenvector a 0 0 c Id 1

matrix has two simple eigenvalues

µ = 1 + (d + β 2)m2 p

(1 + (d + β 2)m2)2 4m2(β 1)

Moreover when we add all the multiplicity of the found eigenvalues we get d 1+ (d 1)(d 2)

2 + d 2+2 = d(d+1)

2 which is the dimension of S(d), therefore we have found all the eigenvalues of the linear operator ETx. We will prove now than the smallest eigenvalue is µ when the dimension d is large enough with regards to m2 and 2m2 otherwise.

Lemma 14 Let µ1 and µ2 be the two smallest eigenvalues of the operator ETx. Let us assume that d 3 (the case d = 2 will also be done in the proof). If β 3 and m2 β 3 2(d+β 4) then

µ1 = µ m2(β 1) 1 + (d + β 2)m2 and µ2 = min{2m2, m2(β 1), 2m}.

Moreover we denote by min = 1 0 0 cmin Id 1

the eigenvector associated to µ for

which we have set without loss of generality the ﬁrst component a = 1. Then

|cmin| m |(d + β 2)m2 1|.

Robust Discriminative Clustering with Sparse Regularizers

Unfortunately µ can become small when the dimension increases as explained by the tight bound µ m2(β 1) 1+(d+β 2)m2 . However the corresponding eigenvector has a particular structure we will be able to exploit. Proof First we note that µ m2(β 1) and compute

µ 2m2 1 + (d + β 2)m2 p

(1 + (d + β 2)m2)2 4m2(β 1) 4m2 0

1 + (d + β 2)m2 4m2 p

(1 + (d + β 2)m2)2 4m2(β 1)

(1 + (d + β 2)m2 4m2)2 (1 + (d + β 2)m2)2 4m2(β 1)

and 1 + (d + β 6)m2 0

16m4 8m2(1 + (d + β 2)m2) 4m2(β 1)

and 1 + (d + β 6)m2 0

2(d + β 4)m2 β 3 | {z } R1

and 1 + (d + β 6)m2 0 | {z } R2

If β 3 we have necessary that β 2 and the ﬁrst term R1 implies m2 3 β 2(2 β) and the second term R2 implies m2 1/(4 β). Thus we should have (4 β)(3 β) 2(2 β) which is not possible since the polynomial β2 5β+8 0.

If β 3, the ﬁrst term R1 implies m2 β 3 2(β 2) 1 and the second term R2

implies m2 1/(4 β) β 3 2(β 2) 1 for β 4 and is always satisﬁed otherwise.

If d 3, the ﬁrst term R1 implies that β 3 for which the second term R2 is always satisﬁed. It also implies that m2 β 3 2(d+β 4) 1.

We denote by min = 1 0 0 cmin Id 1

the eigenvector for which we have set without loss

of generality a = 1 and

cmin = 1 2(d 1)m

((d + β 2)m2 1)2 + 4(d 1)m2 (d + β 2)m2 + 1 i .

Consequently cmin 0 and by convexity of the square root we have

((d + β 2)m2 1)2 + 4(d 1)m2 ((d + β 2)m2 1) + 2(d 1)m2

|(d + β 2)m2 1|.

Therefore |cmin| m |(d + β 2)m2 1|.

We will control now the behavior of the empirical expectation by its expectation thanks to concentration theory. By deﬁnition Tx is a symmetric positive linear operator as its projection T x onto the orthogonal space of min. We can thus apply the Matrix Chernoﬀinequality from Tropp (2012, Theorem 5.1.1) to these two operators using Tx op

Flammarion, Palaniappan and Bach

xx 2 tr(xx )2 x 4 2 R4d2. Then:

k=1 Txk nδµ1

inµ1/(2R4d2) de (1 δ)2nµ1/(4R4d2),

k=1 T xk nδµ2

inµ2/(2R4d2) de (1 δ)2nµ2/(4R4d2),

For m = 1 and d 3 we have µ1 = µ β 1

β+d min{β 1

2d } min{1/3, β 1

D.5 Noise robustness for the one dimensional balanced problem

We want a condition on ε such that the solution of the relaxation Eq. (8) recovers the right y. We recall the dual problem of the relaxation Eq. (8)

min µ 1n s.t. Diag(µ) X(X X) 1X .

The KKT conditions are:

Dual feasibility: Diag(µ) X(X X) 1X .

Primal feasibility: Diag(Y ) = 1n and Y 0.

Complimentary slackness : Y [Diag(µ) X(X X) 1X ] = 0

For Y = yy a rank one matrix, the last condition implies Diag(µ)y = Hy and

µi = (X(X X) 1X y)i

For X = y +ε, we denote by y = y +ε, then X(X X) 1X = y y

y 2 and X(X X) 1X y =

y y y 2 y. Thus

y 2 yi yi .

Assume that all yiyi have the same sign, without loss of generality we assume yiyi > 0. By deﬁnition of µ, µ 0. To show the dual feasibility we have to show that Diag(µ) H which is equivalent to Diag( yi

y y, to In Diag( q

y y Diag( q

yi ) 0 and to P yi yi y y

which is obviously true. Reciprocally if µ is dual feasible then Diag(µ) 0 and all the yiyi have the same sign. Therefore we have shown that y is solution of the relaxation Eq. (8) if and only if all the yiyi have the same sign. If ε and y are independent this is equivalent to ε 1 a.s.

Robust Discriminative Clustering with Sparse Regularizers

D.6 The rank-one candidates are not solutions of the relaxation Eq. (12)

We assume now that 1 n y = 0 thus y = Πny, which means we do not have the same proportion in the two clusters. Let us assume that Πny takes two values {πy , πy+} that is by deﬁnition of Πn: πy+ = 1 1 n y

n and πy = 1 1 n y

n . For V deﬁned as before, we get x i V xi = (πyi)2 and with I the set of indices such that Πnyi = πy respectively, the KKT conditions for V = V can be written as

πy+ 1 xix i + X

1 πy 1 xix i i = An 0 and An V = 0.

We check that with n = #{I }:

w Anw = 0 = X

πy+ 1 (πy+)2 + X

1 πy 1 (πy )2

πy+ 1 (πy+)2 + n 1 πy 1 (πy )2

= n+πy+ n πy n+(πy+)2 + n (πy )2

= y Πny (Πny) Πny = y Πny y Πny = 0.

And An = 1 2n P i I+ α+xix i +P i I α xix i with α+ = 1 πy+ 1 and α = 1 πy 1 . Unfortunately α+α 0, and An is not necessary negative. Even worse we will show that EA is not semi-deﬁnite negative which will conclude the proof since by the law of large number lim n 1 n An = EA. Assume that the proportions of the two clusters stay constant with n = ρ n, then

(πy+)2 0 0 I

(πy )2 0 0 I

And ρ+α+(πy+)2 + ρ α (πy )2 = 0 since w Anw = 0. Then

ρ+α+ + ρ α = ρ+πy ρ πy+ πy+πy

= (ρ+ + ρ ) 1 n y

n (ρ+ ρ ) + (1 (1 n y)2)

n (ρ+ ρ ) + (1 n y

n )2) = 2(1 n y

Thus A = 2(1 n y)2

(n2 (1 n y)2)

is not semi-deﬁnite negative and V is not solution of the

relaxation Eq. (12).

Appendix E. Auxiliary results for sparse extension

In this section, we provide some results for sparse extension.

Flammarion, Palaniappan and Bach

E.1 There is a rank-one solution of the relaxation Eq. (18)

Lemma 15 The rank-one solution V = v v is solution of the relaxation Eq. (18) if the design matrix X is such that 1

n X X has all its diagonal entries less than one.

Proof The KKT conditions for problem Eq. (18) are

x i V xi λU 1

n X X = A 0 and AV = 0,

with U such that Uij = sign(Vij) if Vij = 0 and Uij [ 1, 1] otherwise. For V = v v this gives

A = (1 + λ)

n X X = λ h X X

n U i with U1,1 = 1 and Ui,j [ 1, 1] otherwise.

We check that AV = 0. If the design matrix X is such that 1

n X X has all its diagonal entries less than one, we can choose a sub-gradient U such that the dual variable A = 0 and thus V is solution. Otherwise by property of semi-deﬁnite matrices, there is a diagonal entry of 1

n X X which is bigger than 1 which prevents A to be semi-deﬁnite negative since the corresponding diagonal entry of X X

n U will be positive. This shows that V does not solve the problem.

E.2 Proof of proposition 6

Lemma 16 For δ [0, 1), with probability 1 5d2 exp δ2n(β 1) 2d R4(1/m2+β+d) , for any direction such that V + 0, we have:

g(V ) g(V + ) > (1 δ) h λ Diag( ) 1+ β 1 β + d + 1/m2 (1 + λ)3

4 Diag( ) 2 2 i +o( 2) 0.

Moreover we also have with probability at least 1 5d2 exp δ2nm2(β 1)

2d R4 , for any symmetric matrix such that V + 0 and Diag( ) (emin) :

g(V ) g(V + ) > (1 δ) h λ Diag( ) 1+m2(β 1)(1 + λ)3

4 Diag( ) 2 2 i +o( 2) 0,

where emin = [1, cmin1d 1] is deﬁned in the proof and cmin satisﬁes

|cmin| m |(d + β 2)m2 1|.

E.2.1 Proof outline

We will investigate under which conditions on X the solution is unique, ﬁrst for a deterministic design matrix. We make the following deterministic assumptions on X for δ, ζ 0 and S Rd:

Robust Discriminative Clustering with Sparse Regularizers

n 1 (A3) Z Z

n Diag(diag( 1

n δ (A4) λS min X 2(X 2)

where we denoted by the Hadamard (i.e., pointwise) product between matrices and λS min the minimum eigenvalue of a linear operator restricted to a subspace S. Then with

x i V xi λ V 1 1

n tr X XV , we can certify that g will decrease around the solution V .

Lemma 17 Let us assume that the noise matrix veriﬁes assumptions (A1,A2,A3,A4), then for any direction such that V + 0 and diag( ) S we have:

g(V ) g(V + ) λ(1 δ) Diag(diag( )) 1 + ζ (1 + λ)3

4 Diag( ) 2 2 + o( 2) > 0.

Let us assume now that (zi)i=1,.,d are i.i.d of law z symmetric with Ez = Ez3 = 0, Ez2 = m = 1, Ez4/(Ez2)2 = β and such that z is a.s. bounded by 0 R 1. Then the matrix X satisﬁes a.s. assumption (A1). Using multiple Hoeﬀding s inequalities we will prove lemma 17.

Lemma 18 If z does not follow a Rademacher law, the design matrix X satisﬁes assumptions (A1,A2,A3,A4) with probability greater than 1 8d2 exp δ2n(β 1) 2d(β+d)R4 for S = Rd,

and with probability greater than 1 8d2 exp δ2n min{β 1,2}

2d R4 for S = [1, cmin1d 1] where cmin is deﬁned in the proof and satisﬁes

|emin| 1 d + β 3.

This lemma concludes the proof of proposition 6. We will now prove these two lemmas.

E.2.2 Proof of lemma 17

Proof Since the dual variable A for the PSD constraint is 0 (see the proof of lemma 15), this constraint V 0 is not active and we will show that the function decreases in a set of directions which include the one for which V + 0.

Therefore we consider a direction = a b

, with C 0, which is slightly more

general than V + 0. We denote by f(W) = 2 n Pn i=1 q

n tr X XW the smooth part of g. By Taylor-Young, we have for all W:

f(W) f(W + ) = f (W), 1

2 , f (W) + o( 2).

g(W) g(W + ) = f (W), 1

2 , f (W) + λ( W + 1 W 1) + o( 2).

Flammarion, Palaniappan and Bach

Letting W = V this gives with X X = n y Z Z y Z Z

g(W) g(W + ) = λ X X

2 , f (V ) + λ(a + 2 b 1 + C 1) + o( 2)

= λ 2( b 1 1

nb Z y) + C 1 1

n tr(Z ZC) 1

2 , f (V ) + o( 2).

And with H older s inequality and assumption (A2)

nb Z y b 1(1 1

n Z y ) (1 δ) b 1.

Nevertheless we will show in lemma 19 that C 1 1

n tr(Z ZC) (1 δ) C diag(C) 1, thus g(W) g(W + ) λ(1 δ)(2 b 1 + C diag(C) 1) + o( 2). (36)

However in Eq. (36), g(W) g(W + ) = 0 for b = 0 and C diagonal, therefore we have to investigate second order conditions, i.e., to show for = diag(e) with e Rd that , f (V ) > 0. And with assumption (A4)

4 (1 + λ)3 diag(e), f (V ) diag(e) = 1 n

i=1 (x i diag(e)xi)2

j=1 ej(xj i)2)2

i=1 e [x 2 i (x 2 i ) ]e

λmin X 2(X 2)

n e 2 ζ e 2 2.

Thus we can conclude:

g(W) g(W + ) λ(1 δ)(2 b 1 + C diag(C) 1) + ζ (1 + λ)3

4 e 2 2 + o( 2).

E.2.3 Auxiliary lemma

Lemma 19 For all matrix C symmetric semi-deﬁnite positive we have under assumptions (A1) and (A3):

C (1 δ) C diag(C) 1 > 0.

Robust Discriminative Clustering with Sparse Regularizers

Proof We denote by Σn = Z Z

n . We always have C 1 tr(Σn C) = tr(S Σn)C where Si,j = sign(Ci,j), thus if diag(C) > 0 then diag(S) = 1 and diag(S Σn) 0 from assumption (A1). Moreover since Σni,j [ 1, 1] then sign(S Σn) = sign(S). Thus tr(S Σn)C = P i Ci,i(S Σn)i,i+P i =j Ci,j(S Σn)i,j P i =j Ci,j(S Σn)i,j 0. Furthermore from assumption (A3), |(Σn)i,j| δ for i = j. Therefore

tr(S Σn)C X

i =j Ci,j(S Σn)i,j X

i =j |Ci,j|(1 δ) (1 δ) C diag(C) 1 > 0.

If there is a diagonal element of C which is 0, then the corresponding row and column in C will also be 0 and we can look at the same problem as before by erasing oﬀfrom C and Σn

the corresponding column and row.

E.2.4 Proof of lemma 18

Proof We will ﬁrst show that the noise matrix Z satisﬁes assumptions (A2,A3). By Hoeﬀding s inequality we have with probability 1 2 exp( δ2n/(2R2))

i=1 zj i | δ.

Then, since the law of z is symmetric yizi will have the same law as zi and with probability 1 2 exp( δ2n/(2R2)), the design matrix Z satisﬁes assumption (A2):

Likewise we have with probability 1 2 exp( δ2n/(2R4)) that for j = j

i=1 zj i zj i | δ.

Thus we also have with probability 1 2d2 exp( δ2n/(2R4)) that Z satisﬁes assumption (A3):

n Z Z diag( 1

Thus with probability 1 4d2 exp( δ2n/(2R4)), the noise matrix Z satisﬁes assumptions (A1, A2, A3). We proceed as in the proof of proposition 2 to show that X satisﬁes assumption (A4). We ﬁrst derive a condition to have the result in expectation, then we use an inequality concentration on matrix to bound the empirical expectation. This will be very similar, but we will get a better scaling since is diagonal. Using the same arguments as in the proof of proposition 2 we have for the diagonal matrix = diag(e) with e = (a, c) Rd:

e E(x 2(x 2) )e = E(x x)2 = (a + mc 1n 1)2 + m2(β 1) c 2 2 > 0 if β > 1.

Flammarion, Palaniappan and Bach

We can show that m2(β 1) is an eigenvalue of multiplicity d 2 and µ are eigenvalues of multiplicity one of the operator 7 E(x x)2 with eigenvectors e . Thus we have

λmin(Ex 2(x 2) ) = 1 + (d + β 2)m2 p

(1 + (d + β 2)m2)2 4m2(β 1)

m2(β 1) 1 + (d + β 2)m2 ,

λ e min(Ex 2(x 2) ) = m2(β 2).

λmax x 2(x 2) = (x 2) x 2 =

j=1 (xi)4 d R4.

Thus we can apply the Matrix Chernoﬀinequality from (Tropp, 2012) for µS = λS min(Ex 2(x 2) ):

λS min X 2(X 2)

de δ2nµS/(2d R4).

Thus with probability 1 5d2 exp( δ2nµ /(2d R4)) the design matrix X satisﬁes assumptions (A1,A2,A3,A4) with ζ = (1 δ)µ and S = Rd. And with probability 1 5d2 exp( δ2n min{β 1, 2}/(2d R4)) the design matrix X satisﬁes assumptions (A1,A2,A3,A4) with ζ = (1 δ) min{β 1, 2} and S = e .

Appendix F. Proof of multi-label results

We ﬁrst prove lemma 7: Proof Let A Rk k symmetric semi-deﬁnite positive such that diag( y A y ) = 1n, then

diag( y A y ) =

i=0 ai,i1n + 2

i=1 a0,iyi + 2 X

1 i<j k ai,jyi yj

i=1 a0,iyi + 2 X

1 i<j k ai,jyi yj = (1

i=0 ai,i)1n

And this system admits as unique solution 0n if and only if the family {1n, (yi)1 i k, (yiyj)1 i<j k} is linearly independent.

Then we prove the lemma 8: Proof Since a0 + Pk i=1 a2 i αi αmin Pk i=0 a2 i = αmin we should have α αmin. We have already seen that such Y satisﬁes the constraint. The KKT conditions are: B =

Robust Discriminative Clustering with Sparse Regularizers

diag(µ) H ν11 0 and BY = 0. Since yi = Πnyi + (y i 1n)

Hyi = HΠnyi + (y i 1n)H1n = Πny

= (yi 1 n yi

i=1 a2 i Hyiy i

i=1 a2 i (yi 1 n yi

i=1 a2 i (yiy i 1 n yi

and tr(HY ) = Pk i=1 a2 i (n nαi) = n(1 a2 0 + a2 0 α) = n(1 α). Furthermore since 1 n diag(Y ) = n and 1 n M1n = n2α, for µ = 1n and ν = 1/n, B.Y = n n(1 α) nα = 0. And since B = In 1

n1n1 n H, B2 = B and B = B, thus B is a symmetric projection and consequently symmetric semi-deﬁnite positive. Hence the primal variable Y and the dual variables µ = 1n and ν = 1/n satisfy the KKT conditions, thus Y is solution of this problem.

Appendix G. Eﬃcient optimization problem

We now give the details of an eﬃcient optimization algorithm.

G.1 Dual computation

We consider the following strongly-convex approximation of Eq. (24), augmented with the von-Neumann entropy:

max V 0 1 n

(XV X )ii Diag(c)V Diag(c) 1 ε tr[(A 1 2 V A 1 2 ) log(A 1 2 V A 1 2 )] s.t. tr(A 1 2 V A 1 2 ) = 1.

Introducing dual variables, we have

min u Rn +,C:|Cij| cicj max V 0 1 2n

ui((XV X )ii) + 1

tr CV ε tr[(A 1 2 V A 1 2 ) log(A 1 2 V A 1 2 )]

s.t. tr(A 1 2 V A 1 2 ) = 1.

By ﬁxing u and C, and letting Q = A 1 2 V A 1 2 , we can write the max problem as

max Q 0 tr A 1

2n X Diag(u)X C)A 1

2 Q ε tr[Q log(Q)]

s.t. tr Q = 1.

Flammarion, Palaniappan and Bach

This problem is of the form

max Q 0 tr DQ ε

i=1 σi(Q) log σi(Q)

s.t. tr Q = 1

where D = A 1

2n X Diag(u)X C)A 1

2 and σi(Q) denotes the i-th largest eigen value of the matrix Q. If we consider the matrix D to be of the form D = U Diag(θ)U with θ denoting the vector of ordered eigen values of D, then it turns out that at optimality Q has the form Q = U Diag(σ)U , with σ denoting the ordered vector of eigen values of Q. Therefore the above optimization problem can be cast in terms of σ as:

max σ Rn θ σ ε

i=1 σi log σi

i=1 σi = 1.

The solution of this problem is σi = eθi/ε Pn j=1 eθj/ε , which leads to

min θ Rn φε(θ) = ε log

In terms of the original matrix variables, we have

min φε(D) = ε log tr e D

Using the appropriate expansion of D, we have the overall optimization problem as

min u Rn +,C:|Cij| cicj

1 ui + φε(A 1

2n X Diag(u)X C)A 1

At optimality, we have

A 1 2 V A 1 2 = e

2n X Diag(u)X C)A 1

2 ) ε / tr e

2n X Diag(u)X C)A 1

The error of approximation is at most ε log d and the Lipschitz constant associated with the function φε( ) is 1

G.2 Algorithm details

We write the optimization problem Eq. (38) as:

min u Rn + F(u, C) + H(u, C)

Robust Discriminative Clustering with Sparse Regularizers

H(u, C) = φε(A 1

2n X Diag(u)X C)A 1

is the smooth part and

F(u, C) = IC:|Cij| cicj + 1

is the non-smooth part. The gradient u of H(u, C) with respect to u is

u = diag(B U Diag(σ)U B).

where B = 1

2 X and the gradient of H(u, C) with respect to C is

2 U Diag(σ)U A 1

The Lipschitz constant L associated with the gradient H(u, C) is

ε max λmax(B B B B), λ2 max(A 1) , (39)

where λmax(M) denotes the maximum eigen value of matrix M. Computing L takes O(max(n, d)3) time and L needs to be computed once at the beginning of the algorithm. The resultant FISTA procedure is described in Algorithm 1. Note that the FISTA procedure ﬁrst computes intermediate iterates ( uk 1

2 ) (Step 7, Algorithm 1) by taking descent steps along the respective gradient directions. Then two distinct problems in u and C (respectively Steps 8 and 9 in Algorithm 1) are solved. The sub-problem in u (Step 8) can be eﬃciently solved using a Newton procedure followed by a thresholding step, as illustrated in Algorithm 2. The sub-problem in C (Step 9) can also be solved using a simple thresholding step.

Flammarion, Palaniappan and Bach

Algorithm 1 FISTA Algorithm to solve Eq. (38)

1: Input X.

2: Compute Lipschitz constant L.

3: Let (u0, C0) be an arbitrary starting point.

4: Let ( u0, C0) = (u0, C0), t0 = 1.

5: Set the maximum iterations to be K.

6: for k = 1, 2, . . . , K do The loop can also be terminated based on duality gap.

8: Obtain uk = argminu Rn +

2n Pn i=1 1 ui } by Algorithm 2.

9: Obtain Ck = argmin C n IC:|Cij| cicj + L

2 2 F o by thresholding.

10: tk = 1+ q

1+4t2 k 1 2 .

11: ( uk, Ck) = (uk, Ck) + (tk 1 1)

(uk, Ck) (uk 1, Ck 1) .

12: end for

13: Output (u K, CK).

Algorithm 2 Newton method to solve u sub-problem

1: Input uk 1

2: u0 i = max(u k 1

(2n L) 1 3 ), i = 1, 2, . . . , n.

3: Set M to be the max number of Newton steps.

4: for t = 1, 2, . . . , M do

5: for i = 1, 2, . . . , n do

6: ut i = 2n L(ut 1 i )3u k 1

2 i +3ut i 2(n L(ut 1 i )3+1) .

9: Output max(u M, 0).