# robust_discriminative_clustering_with_sparse_regularizers__b743c3e0.pdf Journal of Machine Learning Research 18 (2017) 1-50 Submitted 8/16; Revised 5/17; Published 8/17 Robust Discriminative Clustering with Sparse Regularizers Nicolas Flammarion1 nicolas.flammarion@ens.fr Balamurugan Palaniappan2 palaniappan@telecom-paristech.fr Francis Bach1 francis.bach@ens.fr 1INRIA D epartement d Informatique de l ENS, Ecole Normale Sup erieure, CNRS, PSL Research University Paris, France. 2Signal, Statistique et Apprentissage (S2A) Group, IDS Department, LTCI Telecom-Paris Tech Paris, France. Editor: Sathiya Keerthi Clustering high-dimensional data often requires some form of dimensionality reduction, where clustered variables are separated from noise-looking variables. We cast this problem as finding a low-dimensional projection of the data which is well-clustered. This yields a one-dimensional projection in the simplest situation with two clusters, and extends naturally to a multi-label scenario for more than two clusters. In this paper, (a) we first show that this joint clustering and dimension reduction formulation is equivalent to previously proposed discriminative clustering frameworks, thus leading to convex relaxations of the problem; (b) we propose a novel sparse extension, which is still cast as a convex relaxation and allows estimation in higher dimensions; (c) we propose a natural extension for the multi-label scenario; (d) we provide a new theoretical analysis of the performance of these formulations with a simple probabilistic model, leading to scalings over the form d = O( n) for the affine invariant case and d = O(n) for the sparse case, where n is the number of examples and d the ambient dimension; and finally, (e) we propose an efficient iterative algorithm with running-time complexity proportional to O(nd2), improving on earlier algorithms for discriminative clustering with the square loss, which had quadratic complexity in the number of examples. 1. Introduction Clustering is an important and commonly used pre-processing tool in many machine learning applications, with classical algorithms such as K-means (Mac Queen, 1967), linkage algorithms (Gower and Ross, 1969) or spectral clustering (Ng et al., 2002). In high dimensions, these unsupervised learning algorithms typically have problems identifying the underlying optimal discrete nature of the data; for example, they are quickly perturbed by adding a few noisy dimensions. Clustering high-dimensional data thus requires some form of dimensionality reduction, where clustered variables are separated from non-informative noise-looking (e.g., Gaussian) variables. Several frameworks aim at linearly separating noise from signal, that is finding projections of the data that extracts the signal and removes the noise. They differ in the ways c 2017 Nicolas Flammarion, Balamurugan Palaniappan and Francis Bach. License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v18/16-429.html. Flammarion, Palaniappan and Bach signals and noise are defined. A line of work that dates back to projection pursuit (Friedman and Stuetzle, 1981) and independent component analysis (Hyv arinen et al., 2004) defines the noise as Gaussian while the signal is non-Gaussian (Blanchard et al., 2006; Le Roux and Bach, 2013; Diederichs et al., 2013). In this work, we follow De la Torre and Kanade (2006); Ding and Li (2007), along the alternative route where one defines the signal as being clustered while the noise is any non-clustered variable. In the simplest situation with two clusters, we may project the data into a one-dimensional subspace. Given a data matrix X Rn d composed of n d-dimensional points, the goal is to find a direction w Rd such that Xw Rn is well-clustered, e.g., by K-means. This is equivalent to identifying both a direction to project, represented as w Rd and the labeling y { 1, 1}n that represents the partition into two clusters. Most existing formulations are non-convex and typically perform a form of alternating optimization (De la Torre and Kanade, 2006; Ding and Li, 2007), where given y { 1, 1}n, the projection w is found by linear discriminant analysis (or any binary classification method), and given the projection w, the clustering is obtained by thresholding Xw or running K-means on Xw. As shown in Section 2, this alternating minimization procedure happens to be equivalent to maximizing the (centered) correlation between y { 1, 1}n and the projection Xw Rd, that is max w Rd,y { 1,1}n (y Πn Xw)2 Πny 2 2 Πn Xw 2 2 , where Πn = In 1 n1n1 n is the usual centering projection matrix (with 1n Rn being the vector of all ones, and In the n n identity matrix). This correlation is equal to one when the projection is perfectly clustered (independently of the number of elements per cluster). Existing methods are alternating minimization algorithms with no theoretical guarantees. In this paper, we relate this formulation to discriminative clustering formulations (Xu et al., 2004; Bach and Harchaoui, 2007), which consider the problem min v Rd, b R, y { 1,1}n 1 n y Xv b1n 2 2, (1) with the intuition of finding labels y which are easy to predict by an affine function of the data. In particular, we show that given the relationship between the number of positive labels and negative labels (i.e., the squared difference between the respective number of elements), these two problems are equivalent, and hence discriminative clustering explicitly performs joint dimension reduction and clustering. While the discriminative framework is based on convex relaxations and has led to interesting developments and applications (Zhang et al., 2009; Li et al., 2009; Joulin et al., 2010a,b; Wang et al., 2010; Niu et al., 2013; Huang et al., 2015), it has several shortcomings when used with the square loss: (a) the running-time complexity of the semi-definite formulations is at least quadratic in n, and typically much more, (b) no theoretical analysis has ever been performed, (c) no convex sparse extension has been proposed to handle data with many irrelevant dimensions, (d) balancing of the clusters remains an issue, as it typically adds an extra hyperparameter which may be hard to set. In this paper, we focus on addressing these concerns. Robust Discriminative Clustering with Sparse Regularizers When there are more than two clusters, one considers either the multi-label or the multiclass settings. The multi-class problem assumes that the data are clustered into distinct classes, i.e., a single class per observation, whereas the multi-label problem assumes the data share different labels, i.e., multiple labels per observation. We show in this work that discriminative clustering framework extends more naturally to multi-label scenarios and that this extension has the same convex relaxation. A summary of the contributions of this paper follows: In Section 2, we relate discriminative clustering with the square loss to a joint clustering and dimension reduction formulation. The proposed formulation takes care of the balancing hyperparameter implicitly. We propose in Section 3, a novel sparse extension to discriminative clustering and show that it can still be cast through a convex relaxation. When there are more than two clusters, we extend naturally the sparse formulation to a multi-label scenario in Section 4. We then proceed to provide a theoretical analysis of the proposed formulations with a simple probabilistic model in Section 5, which effectively leads to scalings over the form d = O( n) for the affine invariant case and d = O(n) for the 1-sparse case. Finally, we propose in Section 6 efficient iterative algorithms with running-time complexity for each step equal to O(nd2), the first to be linear in the number of observations n for discriminative clustering with the square loss. Throughout this paper we assume that X Rn d is centered, a common pre-processing step in unsupervised (and supervised) learning. This implies that X 1n = 0 and Πn X = X. 2. Joint Dimension Reduction and Clustering In this section, we focus on the single binary label case, where we first study the usual nonconvex formulation, before deriving convex relaxations based on semi-definite programming. Some of the following results are already known in the literature; however, we state them here for completeness. 2.1 Non-convex formulation Following De la Torre and Kanade (2006); Ding and Li (2007); Ye et al. (2008), we consider a cost function which depends on y { 1, 1}n and w Rd, which is such that alternating optimization is exactly (a) running K-means with two clusters on Xw to obtain y given w (when we say running K-means , we mean solving the vector quantization problem exactly), and (b) performing linear discriminant analysis to obtain w given y. Proposition 1 (Joint clustering and dimension reduction for two clusters) Given X Rn d such that X 1n = 0 and X has rank d, consider the optimization problem max w Rd,y { 1,1}n (y Xw)2 Πny 2 2 Xw 2 2 . (2) Flammarion, Palaniappan and Bach Given y, the optimal w is obtained as w = (X X) 1X y, while given w, the optimal y is obtained by running K-means on Xw. This equivalence might be straightforward, however it has not been precisely stated in the literature to the best of our knowledge. Proof Given y, we need to optimize the Rayleigh quotient w X yy Xw w X Xw with a rank-one matrix in the numerator, which leads to w = (X X) 1X y. Given w, we show in Appendix A, that the averaged distortion measure of K-means once the means have been optimized is exactly equal to (y Xw)2/ Πny 2 2. Algorithm. The proposition above leads to an alternating optimization algorithm. Note that K-means in one dimension may be run exactly in O(n log n) (Bellman, 1973). After having optimized with respect to w in Eq. (2), we then need to maximize with respect to y the function y X(X X) 1X y Πny 2 2 , which happens to be exactly performing K-means on the whitened data (which is now in high dimension and not in 1 dimension). At first, it seems that dimension reduction is simply equivalent to whitening the data and performing Kmeans; while this is a formally correct statement, the resulting K-means problem is not easy to solve as the clustered dimension is hidden in noise; for example, algorithms such as Kmeans++ (Arthur and Vassilvitskii, 2007), which have a multiplicative theoretical guarantee on the final distortion measure, are not provably effective here because the minimal final distortion is not small (since the clusters are corrupted by some noisy dimensions), and the multiplicative guarantee is then meaningless. 2.2 Convex relaxation and discriminative clustering The discriminative clustering formulation in Eq. (1) may be optimized for any y { 1, 1}n in closed form with respect to b as b = 1 n (y Xv) n since X is centered. Substituting b in Eq. (1) leads us to min v Rd 1 n Πny Xv 2 2 = 1 n Πny 2 2 max w Rd (y Xw)2 Xw 2 2 , (3) where v is obtained from any solution w as v = w y Xw Xw 2 2 . Thus, given n2 #{i, yi = 1} #{i, yi = 1} 2 = α [0, 1], (4) which characterizes the asymmetry between clusters and with Πny 2 = n(1 α), we obtain from Eq. (3), an equivalent formulation to Eq. (2) (with the added constraint) as min y { 1,1}n, v Rd 1 n Πny Xv 2 2 such that (y 1n)2 n2 = α. (5) This is exactly equivalent to a discriminative clustering formulation with the square loss (Bach and Harchaoui, 2007) with an explicit cluster balance constraint. Consequently we Robust Discriminative Clustering with Sparse Regularizers have formally established that the discriminative clustering formulation in Eq. (5) is related to the joint clustering and dimension reduction formulation in Eq. (2). Following Bach and Harchaoui (2007), we may optimize Eq. (5) in closed form with respect to v as v = (X X) 1X y. Substituting v in Eq. (5) leads us to min y { 1,1}n 1 ny Πn X(X X) 1X y such that (y 1n)2 n2 = α. (6) This combinatorial optimization problem is NP-hard in general (Karp, 1972; Garey et al., 1976). Hence in practice, it is classical to consider the following convex relaxation of Eq. (6) (Luo et al., 2010). For any admissible y { 1, +1}n, the matrix Y = yy Rn n is a rank-one symmetric positive semi-definite matrix with unit diagonal entries and conversely any such Y may be written in the form Y = yy such that y is admissible for Eq. (6). Moreover by rewriting Eq. (6) as min y { 1,1}n 1 n tr yy Πn X(X X) 1X such that 1 n (yy )1n we see that the objective and constraints are linear in the matrix Y = yy and Eq. (6) is equivalent to min Y 0, rank(Y )=1, diag(Y )=1 1 n tr Y Πn X(X X) 1X such that 1 n Y 1n Then dropping the non-convex rank constraint leads us to the following classical convex relaxation: min Y 0, diag(Y )=1 1 n tr Y Πn X(X X) 1X such that 1 n Y 1n n2 = α. (7) This is the standard (unregularized) formulation, which is cast as a semi-definite program. The complexity of interior-point methods is O(n7), but efficient algorithms in O(n2) for such problems have been developed due to the relationship with the max-cut problem (Journ ee et al., 2010; Wen et al., 2012). We note that convex relaxation techniques are also used for semi-supervised methods (De Bie and Cristianini, 2003). Given the solution Y , one may traditionally obtain a candidate y { 1, 1}n by running K-means on the largest eigenvector of Y or by sampling (Goemans and Williamson, 1995). In this paper, we show in Section 5 that it may be advantageous to consider the first two eigenvectors. 2.3 Unsuccessful full convex relaxation The formulation in Eq. (7) imposes an extra parameter α that characterises the cluster imbalance. It is tempting to find a direct relaxation of Eq. (2). It turns out to lead to a trivial relaxation, which we outline below. When optimizing Eq. (2) with respect to w, we obtain the following optimization problem max y { 1,1}n y X(X X) 1X y Flammarion, Palaniappan and Bach leading to a quasi-convex relaxation as max Y 0, diag(Y )=1 tr Y X(X X) 1X whose solution is found by solving a sequence of convex problems (Boyd and Vandenberghe, 2004, Section 4.2.5). As shown in Appendix B, this may be exactly reformulated as a single convex problem: max M 0, diag(M)=1+ 1 M1 n2 tr MX(X X) 1X . Unfortunately, this relaxation always leads to trivial solutions, and we thus need to consider the relaxation in Eq. (7) for several values of α = 1 n Y 1n/n2 (and then the non-convex algorithm can be run from the rounded solution of the convex problem, using Eq. (2) as a final objective). Alternatively, we may solve the following penalized problem for several values of ν 0: min Y 0, diag(Y )=1 1 n tr Y Πn X(X X) 1X + ν n2 1 n Y 1n. (8) For ν = 0, Y = 1n1 n is always a trivial solution. As outlined in our theoretical section and as observed in our experiments, it is sufficient to consider ν [0, 1]. By convex duality (Borwein and Lewis, 2000, Sec. 4.3), both constrained relaxation in Eq. (7) and penalized relaxation in Eq. (8) are formally equivalent for specific choices of constraint parameter α and penalization parameter ν. We will see in Section 6 that the formulation in Eq. (8) is more suitable for algorithmic design (Bach et al., 2012). 2.4 Equivalent relaxations Optimizing Eq. (5) with respect to v in closed form as in Section 2.2 is feasible with no regularizer or with a quadratic regularizer. However, if one needs to add more complex regularizers, we need a different relaxation. Therefore, we now propose a new formulation of the discriminative clustering framework. We start from the penalized version of Eq. (5), min y { 1,1}n, v Rd 1 n Πny Xv 2 2 + ν (y 1n)2 which we expand as: min y { 1,1}n, v Rd 1 n tr Πnyy 2 n tr Xvy + 1 n tr X Xvv + ν (y 1n)2 and relax as, using Y = yy , P = yv and V = vv , min V,P,Y 1 n tr Πn Y 2 n tr P X + 1 n tr X XV +ν 1 n Y 1n n2 s.t. Y P P V 0, diag(Y ) = 1. (11) When optimizing Eq. (11) with respect to V and P, we get exactly Eq. (8). Indeed, the optimum is attained for V = (X X) 1X Y X(X X) 1 and P = Y X(X X) 1 as shown in Appendix C.1. Therefore, the convex relaxation in Eq. (11) is equivalent to Eq. (8). Robust Discriminative Clustering with Sparse Regularizers However, we get an interesting behavior when optimizing Eq. (11) with respect to P and Y also in closed form. For ν = 1, we obtain, as shown in Appendix C.2, the following closed form expressions: Y = Diag(diag(XV X )) 1/2XV X Diag(diag(XV X )) 1/2 P = Diag(diag(XV X )) 1/2XV, leading to the problem: min V 0 1 2 (XV X )ii + 1 n tr(V X X). (12) The formulation above in Eq. (12) is interesting for several reasons: (a) it is formulated as an optimization problem in V Rd d, which will lead to algorithms whose running time will depend on n linearly (see Section 6), (b) it allows for easy adding of regularizers (see Section 3), which may be formulated as convex functions of V = vv . At first sight this seems to be valid only for ν = 1. However we now propose a reformulation which can handle all possible ν [0, 1) through a simple data augmentation. Reformulation for any ν When ν [0, 1), we may reformulate the objective function in Eq. (9) as follows: 1 n Πny Xv 2 2 + ν (y 1n)2 n2 = 1 n Πny Xv + ν y 1n n 1n 2 2 ν y 1n n 2 + ν y 1n = 1 n y Xv (1 ν)y 1n n 1n 2 2 + ν 1 ν (1 ν)y 1n = min b R 1 n y Xv b1n 2 2 + ν 1 ν b2, (13) since 1 n y Xv b1n 2 2 + ν 1 ν b2 can be optimized in closed form with respect to b as b = (1 ν)y 1n n . Note that the weighted imbalance ratio (1 ν)y 1n n is made as an optimization variable in Eq. (13). Thus we have the following reformulation min v Rd, y { 1,1}n 1 n Πny Xv 2 2 + ν (y 1n)2 = min v Rd, b R, y { 1,1}n 1 n y Xv b1n 2 2 + ν 1 ν b2, (14) which is a non-centered penalized formulation on a higher-dimensional problem in the variable v b Rd+1. In the rest of the paper, we will focus on the case ν = 1 for ease of exposition. This enables the use of the formulation in Eq. (12), which is easier to optimize. It is worth noting that this is not an algorithmic restriction. Of course any problem with ν [0, 1) can be treated with equal ease by adding a constant term and a quadratic regularizer. Flammarion, Palaniappan and Bach 3. Regularization There are several natural possibilities. We consider norms Ωsuch that Ω(w)2 = Γ(ww ) for a certain convex function Γ; all norms have that form (Bach et al., 2012, Proposition 5.1). When ν = 1, Eq. (12) then becomes max V 0 2 n (XV X )ii 1 n tr(V X X) Γ(V ). (15) The quadratic regularizers Γ(V ) = tr ΛV have already been tackled by Bach and Harchaoui (2007). They consider the regularized version of problem in Eq. (3) min v Rd 1 n Πny Xv 2 2 + v Λv, (16) optimize in closed form with respect to v as v = (X X + nΛ) 1X y. Substituting v in Eq. (16) leads them to min Y 0, diag(Y )=1 1 n tr Y Πn X(X X + nΛ) 1X . In this paper, we propose a novel sparse extension to discriminative clustering framework with the square loss. Specifically we formulate a non-trivial sparse regularizer which is a combination of weighted squared ℓ1-norm and ℓ2-norm. It leads to Γ(V ) = tr[Diag(a)V Diag(a)] + Diag(c)V Diag(c) 1, (17) such that Γ(vv ) = Pd i=1 a2 i v2 i + Pd i=1 ci|vi| 2. This allows to treat all situations simultaneously, with ν = 1 or with ν [0, 1). To be more precise, when ν [0, 1), we can consider in Eq. (14), a problem of size d + 1 with a design matrix [X, 1n] Rn (d+1), a direction of projection v b Rd+1 and different weights for the last variable with ad+1 = ν 1 ν and cd+1 = 0. Note that the sparse regularizers on V introduced in this paper are significantly different when compared to the sparse regularizers on variable v in Eq. (3), for example, considered by Wang et al. (2013). A straightforward sparse regularizer on v in Eq. (3), despite leading to a sparse projection, does not yield natural generalizations of the discriminative clustering framework in terms of theory or algorithms. In our analysis and experiments for the balanced clusters (when ν = 1), the sparse regularization Γ( ) = λ 1, for λ R will often be considered. This is equivalent to setting a = 0d and c = λ1d in Eq. (17). The problem in Eq. (15) then becomes max V 0 2 n (XV X )ii 1 n tr(V X X) λ V 1. (18) The sparse regularizers considered in this paper have a significant algorithmic appeal for certain applications in computer vision (Bojanowski et al., 2013; Alayrac et al., 2016), audio processing (Lajugie et al., 2016) and natural language processing (Grave, 2014). They also lead to robust cluster recovery under minor assumptions as will be illustrated on a simple example in Section 5. The practical benefits of the sparse regularizers will be further demonstrated using empirical evaluation on synthetic and real data sets in Section 7. Robust Discriminative Clustering with Sparse Regularizers 4. Extension to Multiple Labels The discussion so far has focussed on two clusters. Yet it is key in practice to tackle more clusters. It is worth noting that the discrete formulations in Eq. (2) and Eq. (5) extend directly to more than two clusters. However two different extensions of the initial problems Eq. (2) or Eq. (5) are conceivable. They lead to problems with different constraints on different optimization domains and, consequently, to different relaxations. We discuss these possibilities next. One extension is the multi-class case. The multi-class problem which is dealt with by Bach and Harchaoui (2007) assumes that the data are clustered into K classes and the various partitions of the data points into clusters are represented by the K-class indicator matrices y {0, 1}n K such that y1K = 1n. The constraint y1K = 1n ensures that one data point belongs to only one cluster. However as discussed by Bach and Harchaoui (2007), by letting Y = yy , it is possible to lift these K-class indicator matrices into the outer convex approximations CK = {Y Rn n : Y = Y , diag(Y ) = 1n, Y 0, Y 1 K 1n1 n } (Frieze and Jerrum, 1995), which is different for all values of K. Note that letting K = 2 corresponds to the previous sections. In this paper, we consider a different novel extension for discriminative clustering to the multi-label case. The multi-label problem assumes that the data share k labels and the data-label membership is represented by matrices y { 1, +1}n k. In other words, the multi-class problem embeds the data in the extreme points of a simplex, while the multi-label problem does so in the extreme points of the hypercube. The discriminative clustering formulation of the multi-label problem is min v Rd k, y { 1,1}n k 1 n Πny Xv 2 F , (19) where the Frobenius norm is defined for any vector or rectangular matrix as A 2 F = tr AA = tr A A. Letting k = 1 here corresponds to the previous sections. The discrete ensemble of matrices y { 1, +1}n k can be naturally lifted into Dk = {Y Rn n : Y = Y , diag(Y ) = k1n, Y 0}, since diag(Y ) = diag(yy ) = Pk i=1 y2 i,i = k. As the optimization problems in Eq. (7) and Eq. (8) have linear objective functions, we can change the variable from Y to Y = Y/k to change the constraint diag(Y ) = k1n to diag( Y ) = 1n without changing the optimizer of the problem. Thus the problems can be solved over the relaxed domain D = {Y Rn n : Y = Y , diag(Y ) = 1n, Y 0} which is independent of k. Note that the domain D is similar to that considered in the problems in Eq. (8) and Eq. (11) and these convex relaxations are the same regardless of the value of k. Hence the multi-label problem is a more natural extension of the discriminative framework, with a slight change in how the labels y are recovered from the solution Y (we discuss this in Section 5.3). 5. Theoretical Analysis In this section, we provide the first theoretical analysis for the discriminative clustering framework with the square loss. We start with the 2-clusters situation: the non-sparse case Flammarion, Palaniappan and Bach is considered first and analysis is provided for both balanced and imbalanced clusters. Our study for the sparse case currently only provides results for the simple 1-sparse solution. However, the analysis also yields valuable insights on the scaling between n and d. We then derive results for multi-label situation. For ease of analysis, we consider the constrained problem in Eq. (7), the penalized problem in Eq. (8) or their equivalent relaxations in Eq. (12) or Eq. (18) under various scenarios, for which we use the same proof technique. We first try to characterize the low-rank solutions of these relaxations and then show in certain simple situations the uniqueness of such solutions, which are then non-ambiguously found by convex optimization. Perturbation arguments could extend these results by weakening our assumptions but are not within the scope of this paper, and hence we do not investigate them further in this section. 5.1 Analysis for two clusters: non-sparse problems In this section, we consider several noise models for the problem, either adding irrelevant dimensions or perturbing the label vector with noise. We consider these separately for simplicity, but they could also be combined (with little extra insight). 5.1.1 Irrelevant dimensions We consider an ideal design matrix X Rn d such that there exists a direction v along which the projection Xv is perfectly clustered into two distinct real values c1 and c2. Since Eq. (2) is invariant by affine transformation, we can rotate the design matrix X to have X = [y, Z] with y { 1, 1}n, which is clustered into +1 or 1 along the direction v = 1 0d 1 . Then after being centered, the design matrix is written as X = [Πny, Z] with Z = [z1, . . . , zd 1] Rn (d 1). The columns of Z represent the noisy irrelevant dimensions added on top of the signal y. 5.1.2 Balanced problem When the problem is well balanced (y 1n = 0), y is already centered and Πny = y. Thus the design matrix is represented as X = [y, Z]. We consider here the penalized formulation in Eq. (8) with ν = 1 which is the only scenario where we are able to provide a theoretical analysis. Let us assume that the columns (zi)i=1,...,d 1 of Z are i.i.d. with symmetric distribution z, with Ez = Ez3 = 0 and such that z is almost surely bounded by R 0. We denote by Ez2 = m its second moment and by Ez4/(Ez2)2 = β its (unnormalized) kurtosis. Surprisingly the clustered vector y happens to generate a solution yy of the relaxation Eq. (8) for all possible values of Z (see Lemma 11 in Appendix D.2 ). However the problem in Eq. (8) should have a unique solution in order to always recover the correct assignment y. Unfortunately the semidefinite constraint Y 0 of the relaxation makes the secondorder information arduous to study. Due to this reason, we consider the other equivalent relaxation in Eq. (12) for which V = vv is also solution with v (X X) 1X y (see Lemma 12 in Appendix D.3). Fortunately the semidefinite constraint V 0 of the problem in Eq. (12) may be ignored since the second-order information in V of the objective function Robust Discriminative Clustering with Sparse Regularizers already provides unicity for the unconstrained problem. Hence we are able to ensure the uniqueness of the solution with high probability. Proposition 2 Let us assume d 3, β > 1 and m2 β 3 2(d+β 4): (a) If n d2R4 1+(d+β)m2 m2(β 1) , V is the unique solution of the problem in Eq. (12) with high probability. (b) If n d2R4 min{m2(β 1),2m2,2m}, v is the principal eigenvector of any solution of the problem in Eq. (12) with high probability. Let us make the following observations: Proof technique: The proof relies on a computation of the Hessian of f(V ) = 2 n Pn i=1 p (XV X )ii 1 n tr X XV which is the objective function in Eq. (12). We first derive the expectation of 2f(V ) with respect to the distribution of X. By the law of large numbers, it amounts to have n going to infinity in 2f(V ). Then we expand the spectrum of this operator E 2f(V ) to lower-bound its smallest eigenvalue. Finally we use concentration theory on matrices, following Tropp (2012), to bound the Hessian 2f(V ) for finite n. Effect of kurtosis: We remind that β 1, with equality if and only if z follows a Rademacher law (P(z = +1) = P(z = 1) = 1/2). Thus, if the noisy dimensions are clustered, then unsurprisingly, our guarantee is meaningless. Note that the constant β behaves like a distance of the distribution z to the Rademacher distribution. Moreover, β = 3 if z follows a standard normal distribution. Scaling between d and n: If the noisy variables are not evenly clustered between the same clusters { 1} (i.e., β > 1), we recover a rank-one solution as long as n = O(d3); while, as long as n = O(d2), the solution is not unique but its principal eigenvector recovers the correct clustering. Moreover, as explained in the proof, its spectrum would be very spiky. The assumption m2 β 3 2(d+β 4) is generally satisfied for large dimensions. Note that m2d is the total variance of the irrelevant dimensions, and when it is small, i.e., when m2 β 3 2(d+β 4), the problem is particularly simple, and we can also show that V is the unique solution of the problem in Eq. (12) with high probability if n d2R4 m2 . Finally, note that for sub-Gaussian distributions (where β 3), the extra constraint is vacuous, while for super-Gaussian distributions (where β 3), this extra constraint only appears for small m. This result provides the first guarantee for discriminative clustering. However similar theoretical results have been derived for K-means by Ostrovsky et al. (2006) and Gaussian mixtures by Kalai et al. (2010); Moitra and Valiant (2010), where separation conditions between the two clusters are derived, under which the clustering problem is efficiently solved. It would be of great interest to relate these separation conditions to our condition on n and d but this is outside the scope of this work. Flammarion, Palaniappan and Bach 5.1.3 Noise robustness for the one dimensional balanced problem We assume now that the data are one-dimensional and are perturbed by some noise ε Rn such that X = y + ε with y { 1, 1}n. The solution of the relaxation in Eq. (8) recovers the correct y in this setting only when each component of y and y + ε have the same sign (this is shown in Appendix D.5). This result comes out naturally from the information on whether the signs of y and y + ε are the same or not. Further if we assume that y and ε are independent, this condition is equivalent to ε < 1 almost surely. 5.1.4 Unbalanced problem When the clusters are imbalanced (y 1n = 0), the natural rank-one candidates Y = yy and V = vv are no longer solutions of the relaxations in Eq. (8) (for ν = 1) and Eq. (12), as proved in Appendix D.6. Nevertheless we are able to characterize some solutions of the penalized relaxation in Eq. (8) for ν = 0. Lemma 3 For ν = 0 and for any non-negative a, b R such that a + b = 1, Y = ayy + b1n1 n is solution of the penalized relaxation in Eq. (8). Hence any eigenvector of this solution Y would be supported by the directions y and 1n. Moreover when the value α = (1 n y n )2 is known, it turns out that we can characterize some solutions of the constrained relaxation in Eq. (7), as stated in the following lemma. Lemma 4 For α α , 1 α yy + 1 1 α is a rank-2 solution of the constrained relaxation in Eq. (7) with constraint parameter α. The eigenvectors of Y enable to recover y for α α < 1. We conjecture (and checked empirically) that this rank-2 solution is unique under similar regimes to those considered for the balanced case. The proof would be more involved since, when ν = 1, we are not able to derive an equivalent problem in V for the penalized relaxation in Eq. (8) similar to Eq. (12) for the balanced case. We also note that Lemmas 3 and 4 will be direct consequences of Lemma 8 in Section 5.3. Thus Y being rank-2, one should really be careful and consider the first two eigenvectors when recovering y from a solution Y . This can be done by rounding the principal eigenvector of Πn Y Πn = 1 α 1 α Πny(Πny) as discussed in the following lemma. Lemma 5 Let yev be the principal eigenvector of Πn Y Πn where Y is defined in Lemma 4, then sign(yev) = y. Proof By definition of Y , yev = q 1 α 1 α Πny thus sign(yev) = sign(Πny) and since α 1 then sign(Πny) = sign(y α1n) = y. Robust Discriminative Clustering with Sparse Regularizers In practice, contrary to the standard procedure, we should, for any ν, solve the penalized relaxation in Eq. (8) and then do K-means on the principal eigenvector of the centered solution Πn Y Πn instead of the solution Y to recover the correct y. This procedure is followed in our experiments on real-world data in Section 7.2. 5.2 Analysis for two clusters: one-sparse problems We assume here that the direction of projection v (such that Xv = y) is l-sparse (by l-sparse we mean v 0 = l). The ℓ1-norm regularized problem in Eq. (18) is no longer invariant by affine transformation and we cannot consider that X = [y, Z] without loss of generality. Yet the relaxation Eq. (18) seems experimentally to only have rank-one solutions for the simple l = 1 situation. Hence we are able to derive some theoretical analysis only for this case. It is worth noting the l = 1 case is simple since it can be solved in O(d) by using K-means separately on all dimensions and ranking them. Nonetheless the proposed scaling also holds in practice for l 1 (see Figure 1b). Thereby we consider data X = [y, Z] with y { 1, 1}n and Z Rn (d 1) which are clustered in the direction v = [1, 0, . . . , 0] Rd. When adding a ℓ1-penalty, the initial problem in Eq. (5) for α = 0 is min y { 1,1}n, v Rd 1 n y Xv 2 2 + λ v 2 1. (20) When optimizing in v this problem is close to the Lasso (Tibshirani, 1996) and a solution is known to be v i = (y y + nλ) 1y y = 1 1+λ, i J and v i = 0, i {1, 2, . . . , d} \ J, where J is the support of v . The candidate V = v v is still a solution of the relaxation in Eq. (18) (see Lemma 15 in Appendix E.1) and we will investigate under which conditions on X this solution is unique. Let us assume as before (zi)i=1,...,d are i.i.d. with distribution z symmetric with Ez = Ez3 = 0, and denote by Ez2 = m and Ez4/(Ez2)2 = β. We also assume that z is almost surely bounded by 0 R 1. We are able to ensure the uniqueness of the solution with high-probability. Proposition 6 Let us assume d 3. (a) If n d R2 1+(d+β)m2 m2(β 1) , V is the unique solution of the problem Eq. (12) with high probability. (b) If n d R2 m2(β 1), v is the principal eigenvector of any solution of the problem Eq. (12) with high probability. The proof technique is very similar to the one of Proposition 2. With the function g(V ) = 2 n Pn i=1 p (XV X )ii λ V 1 1 n tr X XV , we can certify that g will decrease around the solution V by analyzing the eigenvalues of its Hessian. The rank-one solution V is recovered by the principal eigenvector of the solution of the relaxation Eq. (18) as long as n = O(d). Thus we have a much better scaling when compared to the non-sparse setting where n = O(d2). We also conjecture a scaling of order n = O(ld) for a projection in a l-sparse direction (see Figure 1b for empirical results). The proposition does not state any particular value for the regularizer parameter λ. This makes sense since the proposition only holds for the simple situation when l = 1. We propose to use λ = 1/ n by analogy with the Lasso. Flammarion, Palaniappan and Bach 5.3 Analysis for the multi-label extension In this section, the signals share k labels which are corrupted by some extra noisy dimensions. We assume the centered design matrix to be X = [Πny, Z] where y { 1, +1}n k and Z Rn (d k). We also assume that y is full-rank1. We denote by y = [y1, . . . , yk] and αi = y i 1n n 2 for i = 1, , k. We consider the discrete constrained problem min v Rd k, y { 1,1}n k 1 n Πny Xv 2 F such that 1 n yy 1n n2 = α2, (21) and the discrete penalized problem for ν = 0 min v Rd k, y { 1,1}n k 1 n Πny Xv 2 F . (22) As explained in Section 4, these two discrete problems admit the same relaxations in Eq. (7) and Eq. (8) we have studied for one label. We now investigate when the solution of the problems in Eq. (21) and in Eq. (22) generate solutions of the relaxations in Eq. (7) and Eq. (8). By analogy with Lemma 3, we want to characterize the solutions of these relaxations which are supported by the constant vector 1n and the labels (y1, . . . , yk). Their general form is Y = y A y where A Rk k is symmetric semi-definite positive and y = [1n, y]. However the initial y is easily recovered from the solution Y only when A is diagonal. To that end the following lemma derives some condition under which the only matrix A such that the corresponding Y satisfies the constraint of the relaxations in Eq. (7) and Eq. (8) is diagonal. Lemma 7 The solutions of the matrix equation diag( y A y ) = 1n with unknown variable A are diagonal if and only if the family {1n, (yi)1 i k, (yi yj)1 i 0, the equivalence with max-cut disappears. The plots in these figures show the behavior of FISTA for two different stopping criteria: ε = 10 2/ log(d) and ε = 10 3/ log(d). It is observed that the choice 10 3/ log(d) gives a better accurate solution at the cost of more number of iterations (and hence higher runtime). For sparse problems in Figure 3b, we see that cvx gets a better clustering performance (while crashing for large n); the difference would be reduced with a smaller duality gap for FISTA. Clustering performance. Experiments comparing the proposed method (Eq. (24) with c = λ1d and a = 0d solved using FISTA based optimization algorithm, and Eq. (18) solved using benchmark cvx solver) with K-means and alternating optimization are given in Figure 4. K-means is run on the whitened variables in Rd. Alternating optimization is another popular method proposed by Ye et al. (2008) for dimensionality reduction with clustering (where alternating optimization of w and y is performed to solve the non-convex formulation (2)). The plots show that both K-means and alternating optimization fail when only a few dimensions of noise variables are present. The plots also show that with the introduction of a sparse regularizer (corresponding to the non-zero λ) the proposed Flammarion, Palaniappan and Bach 0 500 1000 1500 2000 0 Num samples n Average Clustering Error fista (0.01/log(d)) fista (0.001/log(d)) 0 500 1000 1500 2000 2500 10 0 Num samples n Average CPU Time (sec) fista (0.01/log(d)) fista (0.001/log(d)) 0 20 40 60 80 100 0 Dimension d Average Clustering Error fista (0.01/log(d)) fista (0.001/log(d)) 0 20 40 60 80 100 10 2 Dimension d Average CPU Time (sec) fista (0.01/log(d)) fista (0.001/log(d)) (a) cvx, max-cut comparison with λ = 0. Top: n varied with d = 50, k = 6. cvx crashed for n 1000. Bottom: d varied with n = 100, k = 2. 0 500 1000 1500 2000 0 Num samples n Average Clustering Error fista (0.01/log(d)) fista (0.001/log(d)) 0 500 1000 1500 2000 10 1 Num samples n Average CPU Time (sec) fista (0.01/log(d)) fista (0.001/log(d)) 0 50 100 150 0 Dimension d Average Clustering Error fista (0.01/log(d)) fista (0.001/log(d)) 0 50 100 150 10 2 Dimension d Average CPU Time (sec) fista (0.01/log(d)) fista (0.001/log(d)) (b) cvx comparison with λ = 0.001. Top: n varied with d = 50, k = 6. cvx crashed for n 1000. Bottom: d varied with n = 100, k = 2. Figure 3: Scalability experiments. 20 Robust Discriminative Clustering with Sparse Regularizers method becomes more robust to noisy dimensions. As observed earlier, the performance of FISTA is also sensitive to the choice of ε. Finally we give a comparison of sparse discriminative clustering (cvx and FISTA) with max-margin clustering (Li et al., 2009) in Figure 5. We note that square loss is used in our framework whereas hinge loss is used in max-margin clustering. We have also included the behavior of K-means and alternating optimization methods in Figure 5 for completeness. From this plot, it is clear that the max-margin clustering is sensitive to noisy dimensions present in the data. Sparse discriminative clustering with square loss is able to maintain zero cluster error for a large number of noisy dimensions, while the performance of maxmargin clustering starts deteriorating after adding a few noisy dimensions. However, we note from Figure 5 that for large dimensions, the hinge loss used in max-margin clustering is observed to provide a better solution than the square loss used in our framework. 0 20 40 60 80 100 0.2 Dimension d Average Clustering Error fista (0.01/log(d)) fista (0.001/log(d)) 0 20 40 60 80 0.2 Dimension d Average Clustering Error fista (0.01/log(d)) fista (0.001/log(d)) (b) λ = 0.01 0 20 40 60 80 0.2 Dimension d Average Clustering Error fista (0.01/log(d)) fista (0.001/log(d)) (c) λ = 0.001 Figure 4: Comparison with k-means and alternating optimization, n = 100. 7.2 Experiments on real-world data Experiments on two-class data. Experiments were conducted on real two-class classification datasets2 to compare the performance of sparse discriminative clustering against non-sparse discriminative clustering, alternating optimization, K-means and max-margin 2. The data sets were obtained from https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/ Flammarion, Palaniappan and Bach 0 10 20 30 40 50 60 70 80 90 0.2 Dimension d Average Clustering Error fista (0.01/log(d)) fista (0.001/log(d)) cvx k means alt opt mmc Figure 5: Comparison with k-means, alternating optimization and max-margin clustering (mmc), n = 100. The plots for FISTA, cvx and mmc correspond to the best choice of regularization parameters. clustering algorithms. For sparse and non-sparse discriminative clustering, we consider the problem in Eq. (24) and the algorithm detailed in Section 6.3 (with the regularization c = 0 for the non-sparse case). The alternating optimization method is described in Proposition 1. For the two-class datasets, the clustering performance for a cluster y {+1, 1}n obtained from an algorithm under comparison, was computed as 1 ( y y/n)2, where y is the original labeling. Here we explicitly compare the output of clustering with the original labels of the data points. The dataset details and clustering performance results are summarized in Table 1. The experiments for discriminative clustering were conducted for different values of a, c {10 3, 10 2, 10 1}1d associated with the ℓ2-regularizer and ℓ1-regularizer respectively. The range of cluster imbalance parameter was chosen to be ν {0.01, 0.25, 0.5, 0.75, 1}. Note that for ν = 1, the reformulation given in Eq. (14) was used, as explained after Eq. (17) in Section 3. The results given in Table 1 pertain to the best choices of these parameters. Similarly, the values of regularization parameter for max-margin clustering (Li et al., 2009) were chosen from the set {10 5, 10 4, 10 3, 10 2, 0.1, 1, 10} and the cluster balance parameter was chosen from {0.1, 0.2, . . . , 0.9}. The results for alternating optimization and K-means show the average cluster error (and standard deviation) over 10 different runs. These results show that the cluster error is quite high for many datasets. This is primarily due to the absence of an ambient low-dimensional clustering of the two-class data, which can be identified by the simple linear model presented in this paper. Since K-means does not provide explicit dimensionality reduction, it might not be able to take advantage of the existence of an ambient low-dimensional clustering of the two-class data and its performance is poor. The results show that max-margin clustering achieves best clustering performance on most datasets. This improved performance of max-margin clustering may be due to the use of hinge loss, as opposed to square loss used in discriminative clustering in this paper. However for the heart dataset, we note that the sparse version with the square loss performs Robust Discriminative Clustering with Sparse Regularizers significantly better than the non-sparse version with the hinge loss (see additional experiments in Figure 5). The results also show that adding sparse regularizers to discriminative clustering helps in a better cluster identification when compared to the non-sparse case. Table 1: Experiments on two-class datasets Dataset n d Cluster Error Sparse Non-sparse Alternating K-means Max-margin Discriminative Discriminative Optimization Clustering Clustering Clustering Heart 270 3 0.52 0.61 0.97 0.03 0.91 0.09 0.93 Diabetes 768 8 0.88 0.88 0.91 0.05 0.93 0.06 0.88 Breast-cancer 683 10 0.15 0.15 0.48 0.17 0.68 0.24 0.15 Australian 690 14 0.5 0.5 0.88 0.17 0.87 0.21 0.5 Liver-disorder 345 6 0.97 0.97 0.99 0.01 0.99 0.01 0.73 Sonar 208 60 0.92 0.95 0.98 0.02 0.99 0.01 0.92 DNA(1 vs 2,3) 1400 180 0.75 0.83 0.99 0.01 0.98 0.02 0.71 a1a 1605 113 0.74 0.75 0.98 0.02 0.8 0.08 0.69 w1a 2270 290 0.11 0.11 0.92 0.08 0.16 0.06 0.11 Experiments on real multi-label data. Experiments were also conducted on the Microsoft COCO dataset3 to demonstrate the effectiveness of the proposed method in discovering multiple labels. We considered n = 2000 images from the dataset, each of which was labeled with a subset of K = 80 labels. The labels identified the objects in the images like person, car, chair, table, etc. and the corresponding features for each image were extracted from the last layer of a conventional convolutional neural network (CNN). The CNN was originally trained over the imagenet data (Krizhevsky et al., 2012). For each image in the dataset, we obtained d = 1000 features. We then performed discriminative clustering on the 2000 1000 data matrix X and obtained the label matrix Y which was then subjected to the alternating optimization procedure (see Section 5.3). It is clearly unlikely to recover perfect labels; therefore we now describe a way of measuring the amount of information which is recovered. In order to extract meaningful cluster information from the result so-obtained, we computed the correlation matrix Y k Πn Ytrue where Ytrue is the n K label matrix containing actual labels and Πn is the n n centering matrix In 1 n1n1 n . The k predicted labels for the examples are present in the Yk matrix of size n k. In order to choose an appropriate value of k, we plotted Tr(ΦYtrueΦYk) (shown in Figure 6 along with a K-means baseline), where ΦYk = Yk(Yk Yk) 1Y k . From these plots, we chose k = 30 to be a suitable value for our interpretation purposes. After choosing an arbitrary value of k = 30, we plotted the correlations between the actual and predicted labels. The heat map of the normalized absolute correlations is given in Figure 7, where the columns and rows corresponding to the 80 true labels and 30 predicted labels respectively, are ordered according to the sum of squared correlations (the top-scoring labels appear to the left-bottom). From this plot, we extract following highly correlated 3. Dataset obtained from http://mscoco.org/dataset Flammarion, Palaniappan and Bach 0 10 20 30 40 50 60 70 80 0 Tr(Φ(Ytrue) * Φ(Yk)) alt min k means Figure 6: Plot of Tr(ΦYtrueΦYk). True labels Predicted labels Figure 7: Heat map of correlations, Y k Πn Ytrue with k = 30, with columns and rows ordered according to the sum of squared correlations. labels: person, dining table, car, chair, cup, tennis racket, bowl, truck, fork, pizza, showing that these labels were partially recovered by our unsupervised technique (note that the CNN features are learned with supervision on the different dataset Imagenet, hence there is still some partial supervision). 8. Conclusion In this paper, we provided a sparse extension of the discriminative clustering framework, and gave a first analysis of its theoretical performance in the totally unsupervised situation, highlighting provable scalings between ambient dimension d, number of observations and clusterability of irrelevant variables. We also proposed an efficient algorithm which is the first of its kind to be linear in the number of observations for discriminative clustering with the square loss. Our work could be extended in a number of ways, e.g., extending the sparse Robust Discriminative Clustering with Sparse Regularizers analysis to l-sparse case with higher l, extending the framework to nonlinear clustering using kernels, considering related weakly supervised learning extensions (Joulin and Bach, 2012), going beyond uniqueness of rank-one solutions, and improving the complexity of our algorithm to O(nd), for example using stochastic gradient techniques. Acknowledgments The authors would like to thank Nicolas Boumal for interesting discussions and for his advice on max-cut implementation, Piotr Bojanowski for his help with multi-label datasets and the reviewers for their constructive and helpful comments which helped to significantly improve the paper. Flammarion, Palaniappan and Bach Jean-Baptiste Alayrac, Piotr Bojanowski, Nishant Agrawal, Ivan Laptev, Josef Sivic, and Simon Lacoste-Julien. Unsupervised learning from narrated instruction videos. In Computer Vision and Pattern Recognition (CVPR), 2016. D. Arthur and S. Vassilvitskii. k-means++: The advantages of careful seeding. In Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms, 2007. F. Bach and Z. Harchaoui. DIFFRAC : a discriminative and flexible framework for clustering. In Advances in Advances in Neural Information Processing Systems (NIPS), 2007. F. Bach, R. Jenatton, J. Mairal, and G. Obozinski. Optimization with sparsity-inducing penalties. Foundations and Trends R in Machine Learning, 4(1):1 106, 2012. A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Imaging Sci., 2(1):183 202, 2009. R. Bellman. A note on cluster analysis and dynamic programming. Mathematical Biosciences, 1973. G. Blanchard, M. Kawanabe, M. Sugiyama, V. Spokoiny, and K.-R. M uller. In search of non-Gaussian components of a high-dimensional distribution. The Journal of Machine Learning Research, 7:247 282, 2006. Piotr Bojanowski, Francis Bach, Ivan Laptev, Jean Ponce, Cordelia Schmid, and Josef Sivic. Finding actors and actions in movies. In Proc. IEEE International Conference on Computer Vision, 2013. J. M. Borwein and A. S. Lewis. Convex Analysis and Nonlinear Optimization, volume 3 of CMS Books in Mathematics/Ouvrages de Math ematiques de la SMC. Springer-Verlag, 2000. Theory and examples. N. Boumal, B. Mishra, P.-A. Absil, and R. Sepulchre. Manopt, a Matlab Toolbox for Optimization on Manifolds. Journal of Machine Learning Research, 2014. J. Bourgain, V. H. Vu, and P. M. Wood. On the singularity probability of discrete random matrices. Journal of Functional Analysis, 258(2):559 603, 2010. S. Boyd and L. Vandenberghe. Convex optimization. Cambridge University Press, Cambridge, 2004. T. De Bie and N. Cristianini. Convex methods for transduction. In Advances in Neural Information Processing Systems (NIPS), pages 73 80, 2003. F. De la Torre and T. Kanade. Discriminative cluster analysis. In Proceedings of the conference on machine learning (ICML), 2006. E. Diederichs, A. Juditsky, A. Nemirovski, and V. Spokoiny. Sparse non-Gaussian component analysis by semidefinite programming. Machine learning, 91(2):211 238, 2013. Robust Discriminative Clustering with Sparse Regularizers C. Ding and T. Li. Adaptive dimension reduction using discriminant analysis and K-means clustering. In Proceedings of the conference on machine learning (ICML), 2007. D. Freedman. Statistical models: theory and practice. Cambridge University Press, 2009. J. H. Friedman and W. Stuetzle. Projection pursuit regression. Journal of the American statistical Association, 76(376):817 823, 1981. A. Frieze and M. Jerrum. Improved approximation algorithms for MAX k-CUT and MAX BISECTION. In Integer Programming and Combinatorial Optimization. Springer, 1995. M. R. Garey, D. S. Johnson, and L. Stockmeyer. Some simplified NP-complete graph problems. Theoret. Comput. Sci., 1(3):237 267, 1976. M. X. Goemans and D. P. Williamson. Improved Approximation Algorithms for Maximum Cut and Satisfiability Problems Using Semidefinite Programming. J. ACM, 42(6):1115 1145, November 1995. J. C. Gower and G. J. S. Ross. Minimum spanning trees and single Linkage cluster analysis. Journal of the Royal Statistical Society. Series C (Applied Statistics), 18(1), 1969. M. Grant and S. Boyd. Graph implementations for nonsmooth convex programs. In V. Blondel, S. Boyd, and H. Kimura, editors, Recent Advances in Learning and Control, Lecture Notes in Control and Information Sciences, pages 95 110. Springer-Verlag Limited, 2008. M. Grant and S. Boyd. CVX: Matlab Software for Disciplined Convex Programming, version 2.1, March 2014. Edouard Grave. A convex relaxation for weakly supervised relation extraction. In EMNLP, 2014. G. Huang, J. Zhang, S. Song, and Z. Chen. Maximin separation probability clustering. In Proceedings of the AAAI Conference on Artificial Intelligence, 2015. A. Hyv arinen, J. Karhunen, and E. Oja. Independent component analysis, volume 46. John Wiley & Sons, 2004. A. Joulin and F. Bach. A convex relaxation for weakly supervised classifiers. In Proceedings of the conference on machine learning (ICML), 2012. A. Joulin, F. Bach, and J. Ponce. Discriminative clustering for image co-segmentation. In Proc. CVPR, 2010a. A. Joulin, J. Ponce, and F. Bach. Efficient optimization for discriminative latent class models. In Advances in Advances in Neural Information Processing Systems (NIPS), 2010b. M. Journ ee, F. Bach, P-A Absil, and R. Sepulchre. Low-rank optimization on the cone of positive semidefinite matrices. SIAM Journal on Optimization, 2010. Flammarion, Palaniappan and Bach A. T. Kalai, A. Moitra, and G. Valiant. Efficiently learning mixtures of two Gaussians. In International Symposium on Theory of Computing (STOC), pages 553 562, 2010. R. M. Karp. Reducibility among combinatorial problems. In Complexity of computer computations, pages 85 103. Plenum, New York, 1972. A. Krizhevsky, I. Sutskever, and G. E. Hinton. Image Net Classification with Deep Convolutional Neural Networks. In Advances in Advances in Neural Information Processing Systems (NIPS), 2012. R emi Lajugie, Piotr Bojanowski, Philippe Cuvillier, Sylvain Arlot, and Francis Bach. A weakly-supervised discriminative model for audio-to-score alignment. In Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing, 2016. N. Le Roux and F. Bach. Local component analysis. In Proceedings of the International Conference on Learning Representations, 2013. Y.-F. Li, I. W. Tsang, J. T.-Y. Kwok, and Z-H. Zhou. Tighter and convex maximum margin clustering. In AISTATS, pages 344 351, 2009. Z. Q. Luo, W. K. Ma, A. C. So, Y. Ye, and S. Zhang. Semidefinite relaxation of quadratic optimization problems. Signal Processing Magazine, IEEE, 2010. J. B. Mac Queen. Some Methods for Classification and Analysis of Multi Variate Observations. In Proc. of the fifth Berkeley Symposium on Mathematical Statistics and Probability, volume 1, pages 281 297. University of California Press, 1967. A. Moitra and G. Valiant. Settling the polynomial learnability of mixtures of Gaussians. In Annual Symposium on Foundations of Computer Science (FOCS), pages 93 102, 2010. Y. Nesterov. Smoothing technique and its applications in semidefinite optimization. Math. Program., 2007. A. Y. Ng, M. I. Jordan, and Y. Weiss. On Spectral Clustering: Analysis and an algorithm. In T.G. Dietterich, S. Becker, and Z. Ghahramani, editors, Advances in Advances in Neural Information Processing Systems (NIPS). 2002. G. Niu, B. Dai, L. Shang, and M. Sugiyama. Maximum volume clustering: a new discriminative clustering approach. Journal of Machine Learning Research, 14(1):2641 2687, 2013. R. Ostrovsky, Y. Rabani, L. J. Schulman, and C. Swamy. The effectiveness of lloyd-type methods for the k-means problem. In Annual Symposium on Foundations of Computer Science (FOCS), pages 165 176, 2006. P. H Sch onemann. A generalized solution of the orthogonal Procrustes problem. Psychometrika, 31(1), 1966. R. Tibshirani. Regression shrinkage and selection via the Lasso. J. Roy. Statist. Soc. Ser. B, 1996. Robust Discriminative Clustering with Sparse Regularizers J. Tropp. User-friendly tail bounds for sums of random matrices. Found. Comput. Math., 2012. J. von Neumann. Thermodynamik quantummechanischer Gesamheiten. G ott. Nach, (1): 273 291, 1927. F. Wang, B. Zhao, and C. Zhang. Linear time maximum margin clustering. IEEE Transactions on Neural Networks, 2010. H. Wang, F. Nie, and H. Huang. Multi-View Clustering and Feature Learning via Structured Sparsity. In Proceedings of the conference on machine learning (ICML), volume 28, 2013. Z. Wen, D. Goldfarb, and K. Scheinberg. Block coordinate descent methods for semidefinite programming. In Handbook on Semidefinite, Conic and Polynomial Optimization. Springer, 2012. L. Xu, J. Neufeld, B. Larson, and D. Schuurmans. Maximum margin clustering. In Advances in Advances in Neural Information Processing Systems (NIPS), 2004. J. Ye, Z. Zhao, and M. Wu. Discriminative k-means for clustering. In Advances in Advances in Neural Information Processing Systems (NIPS), 2008. K. Zhang, I. W. Tsang, and J. T. Kwok. Maximum margin clustering made practical. IEEE Transactions on Neural Networks, 2009. Flammarion, Palaniappan and Bach Appendix A. Joint clustering and dimension reduction Given y, we need to optimize the Rayleigh quotient w X yy Xw w X Xw with a rank-one matrix in the numerator, which leads to w = (X X) 1X y. Given w, we will show that the averaged distortion measure of K-means once the means have been optimized is exactly equal to (y Πn Xw)2/ Πny 2 2. Given the data matrix X Rn d, K-means to cluster the data into two components will tend to approximate the data points in X by the centroids c+ Rd and c Rd such that 2 c + (y 1n) 2 c (since y { 1, 1}n) = y 2(c + c ) + 1 21n(c + + c ). The objective of K-means can now be written as problem KM: 2(c + c ) 1 21n(c + + c ) = min y,c+,c 2 c + (1n y) = min y,c+,c X 2 F + c + 2 F 2 + 2c c+ (y + 1n) 2 tr X (y + 1n) 2 c + + (1n y) = min y,c+,c X 2 F + c + 2 F 1 2(n + 1 n y) + c 2 F 1 2(n 1 n y) 2c +X y + 1n Fixing y and minimizing with respect to c+ and c , we get closed-form expressions for c+ and c as c+ = X (y + 1n) (n + 1 n y) and c = X (1n y) (n 1 n y) . Robust Discriminative Clustering with Sparse Regularizers Substituting these expressions in KM, we have the following optimization problem in y: min y X 2 F 1 2 X (y + 1n) 2 F (n + 1 n y) 1 2 X (1n y) 2 F (n 1 n y) = min y X 2 F 1 2 tr XX (y + 1n)(y + 1n) (n + 1 n y) 1 2 tr XX (1n y)(1n y) = min y X 2 F 2 (n + 1 n y) tr XX y + 1n 2 (n 1 n y) tr XX 1n y = min y tr XX 2 (n + 1 n y) tr XX y + 1n 2 (n 1 n y) tr XX 1n y = min y tr XX I 1 2(n + 1 n y)(yy + 1n1 n + y1 n + 1ny ) 1 2(n 1 n y)(1n1 n + yy 1ny y1 n ) . By the centering of X, we have 1 n X = 0 and hence tr XX 1n1 n = tr XX 1ny = tr XX y1 n = 0. Therefore, we obtain min y tr XX I 1 2(n + 1 n y)(yy ) 1 2(n 1 n y)(yy ) = min y tr XX I (yy ) 1 2(n + 1 n y) + 1 2(n 1 n y) = min y tr XX I (yy ) n n2 (1 n y)2) = min y tr XX I nyy n2 (1 n y)2 Thus we have the equivalent K-means problem as min y { 1,1}n 1 n tr Xww X I n n2 (y 1)2 yy = 1 max y { 1,1}n (w X y)2 n2 (y 1)2 . Thus the averaged distortion measure of K-means with the optimized means is (y Πn Xw)2 Appendix B. Full (unsuccessful) relaxation It is tempting to find a direct relaxation of Eq. (2). It turns out to lead to a trivial relaxation, which we outline in this section. When optimizing Eq. (2) with respect to w, we obtain max y { 1,1}n y X(X X) 1X y y Πny , leading to a quasi-convex relaxation as max Y 0, diag(Y )=1 tr Y X(X X) 1X Unfortunately, this relaxation always leads to trivial solutions as described below. Flammarion, Palaniappan and Bach Consider the quasi-convex relaxation max Y 0,diag(Y )=1 tr Y X(X X) 1X tr Πn Y . (27) By definition of Πn this relaxation is equal to: max Y 0,diag(Y )=1 1 n tr Y X(X X) 1X Let A = {Y 0, diag(Y ) = 1} the feasible set of this problem and define B = {M 0, diag(M) = 1+ 1 n M1n n2 }. Let Y A, then M defined by M = Y n2 belongs to B since 1 + 1 n M1n n2 = 1 + 1 n Y 1n n2 1 n Y 1n = 1 n2 = diag(M). Reciprocally for M B, we can define n2 , such that diag(Y ) = 1 and Y A and then verify that M = Y the problem Eq. (27) is equivalent to the relaxation M 0,diag(M)=1+ 1 n M1n 1 n tr MX(X X) 1X . (28) The Lagrangian function of this problem can be written as: L(µ) = tr MX(X X) 1X µ [diag(M) 1n 1 n M1n = tr M[X(X X) 1X Diag(µ) + 1 n µ n2 1n1 n ] + 1 Using L(µ) and the PSD constraint M 0, the dual problem is given by n s.t. Diag(µ) 1 n µ n2 1n1 n X(X X) 1X . Since X(X X) 1X 0, this implies for the dual variable µ: Diag(µ) 1 n µ n2 1n1 n 0 1 n Diag(µ) 11n n2 1 µi n2 Pn i=1 µi 1 µi 1 1 n Pn i=1 µi . However for ν Rn , the harmonic mean 1 n Pn i=1 1 νi 1 is always smaller than the arithmetic mean 1 n Pn i=1 νi with equality if and only if ν = c1n for c R. Thus the dual variable µ is constant and the diagonal constraint in problem Eq. (28) simplifies itself as a trace constraint. Therefore the problem is equivalent to the trivial relaxation below for which each eigenvector of X(X X) 1X is a solution: M 0, tr(M)=n+ 1 n M1n n tr MX(X X) 1X . Robust Discriminative Clustering with Sparse Regularizers Appendix C. Equivalent relaxations In this section, we give details about two equivalent relaxations. C.1 First equivalent relaxation We start from the penalized version of Eq. (5), min y { 1,1}n, v Rd 1 n Πny Xv 2 2 + ν (y 1n)2 which we expand as: min y { 1,1}n, v Rd 1 n tr Πnyy 2 n tr Xvy + 1 n tr X Xvv + ν (y 1n)2 and relax as, using Y = yy , P = yv and V = vv , min V,P,Y 1 n tr Πn Y 2 n tr P X + 1 n tr X XV +ν 1 n Y 1n n2 s.t. Y P P V 0, diag(Y ) = 1. (31) When optimizing Eq. (31) with respect to V and P, we get exactly Eq. (8). Indeed we solve this problem by fixing the matrix Y such that Y = Y0 and diag(Y0) = 1n. Then the Lagrangian function of the problem in Eq. (31) can be written as L(A) = 1 n tr Πn Y 2 n tr P X + 1 n tr X XV + ν 1 n Y 1n n2 + tr A(Y Y0) n2 1n1 n + A 1 n X 1 n X X Using L(A) and the psd constraint Y P P V 0, we write the dual problem as min A tr AY0 s.t. 1 n2 1n1 n + A 1 n X 1 n X X From the Schur s complement condition of 1 n2 1n1 n + A 1 n X 1 n X X 0, we obtain n2 1n1 n + A 1 n X(X X) 1X . Substituting the bound for A we get the optimal objective function value n tr X(X X) 1X Y0 1 n tr Πn Y0 ν n2 1 n Y01n. Note that the optimal dual objective value D corresponds to a fixed Y0. Hence by maximizing with respect to Y we obtain exactly Eq. (8) and therefore, the convex relaxation in Eq. (11) is equivalent to Eq. (8). Moreover the Karush-Kuhn-Tucker (KKT) conditions gives P X + V X X = 0 and Y X + PX X = 0 Thus the optimum is attained for P = Y X(X X) 1 and V = (X X) 1X Y X(X X) 1. Flammarion, Palaniappan and Bach C.2 Second equivalent relaxation For ν = 1, we solve the problem in Eq. (31) by fixing the matrix V = V0. Then the Lagrangian function of this problem can be written as ˆL(µ, B) = 1 n tr Πn Y 2 n tr P X + 1 n tr X XV + ν 1 n Y 1n n2 + µ (diag(Y ) 1n) + tr B(V V0) n In + diag(µ) 1 n X 1 n X X + B µ 1n tr BV0. Using ˆL(µ, B) and the psd constraint Y P P V 0, the dual problem is given by min µ,B µ 1n + tr BV0 s.t. 1 n In + diag(µ) 1 n X 1 n X X + B From the Schur s complement condition of 1 n In + diag(µ) 1 n X 1 n X X + B 0, we obtain B 1 n2 X diag(µ + 1n/n) 1X 1 n X X. Substituting the bound for B we get the dual problem as min µ µ 1n + 1 n2 tr V0X diag(µ + 1n/n) 1X 1 µi + 1 n2µi + nx i V0xi n tr V0X X. Solving for µi, we get Substituting µ i into the dual objective function, we get the optimal objective function value (XV X )ii 1 1 n tr V0X X. Furthermore the KKT conditions give Y diag(ν + 1n/n) 1 n PX = 0 and P diag(ν + 1n/n) 1 Thus we obtain the following closed form expressions: P = Diag(diag(XV X )) 1/2XV Y = Diag(diag(XV X )) 1/2XV X Diag(diag(XV X )) 1/2. The optimal dual objective value ˆD corresponds to a fixed V0. Therefore, maximizing with respect to V leads to the problem: min V 0 1 2 (XV X )ii + 1 n tr(V X X). (32) Robust Discriminative Clustering with Sparse Regularizers Appendix D. Auxiliary results for Section 5.1 Here we provide auxiliary results for Section 5.1. D.1 Auxiliary lemma The matrix X(X X) 1X has the following properties (see e.g., (Freedman, 2009)). Lemma 10 The matrix H = X(X X) 1X is the orthogonal projection onto the column space of the design matrix X since: H is symmetric. H is idempotent (H2) = H. X is invariant under H, that is HX = X. D.2 Rank-one solution of the relaxation Eq. (8) We denote by (xi)i=1...n the lines (or rows) of X. Lemma 11 The rank-one solution Y = yy is always a solution of the relaxation Eq. (8). Proof We give an elementary proof of this result without using convex optimization tools. Using Proposition 1 and Lemma 10 we have Hy = y, thus tr HY = tr Hyy = tr yy = n. Moreover all M 0 can always be decomposed as Pn i=1 λiuiu i with λi 0 and (ui)i=1,...,n an orthonormal family. Since H is an orthogonal projection (ui) Hui = (Hui) Hui = Hui 2 ui 2 1. Thus tr HM = Pn i=1 λi tr Hui(ui) = Pn i=1 λi(ui) Hui Pn i=1 λi = tr M. Then for all matrix M feasible we have tr HM n since diag(M) = 1n and tr HY = n which conclude the lemma. D.3 Rank-one solution of the relaxation Eq. (12) Lemma 12 The rank-one solution V = vv is always a solution of the relaxation Eq. (12). Proof The Karush-Kuhn-Tucker (KKT) optimality conditions for the problem are for the dual variable A 0: n XX = A and AV = 0 (Complementary Slackness). Since x i w = yi, q x i V xi = |yi| = 1, V and the dual variable A = 0 satisfy the KKT conditions and then V is solution of this problem. Flammarion, Palaniappan and Bach D.4 Proof of Proposition 2 In the following lemma, we use a Taylor expansion to lower-bound f around its minimum. Lemma 13 For d 3 and δ [0, 1). If β 3 and m2 β 3 2(d+β 4), then with probability at least 1 d exp δ2nm2 2R4d2 , for any symmetric matrix : f(V ) f(V + ) > 2(1 δ)m2 2 F + o( 2) 0. Otherwise with probability at least 1 d exp δ2nµ1 4R4d2 , for any symmetric matrix : f(V ) f(V + ) > (1 δ)µ1 2 F + o( 2) 0, with µ1 m2(β 1) 1+(d+β 2)m2 . Moreover we also have with probability at least 1 d exp δ2nµ2 for any symmetric matrix min: f(V ) f(V + ) > (1 δ)µ2 2 F + o( 2) 0, where µ2 = min{2m2, m2(β 1), 2m} and min = 1 0 0 cmin Id 1 is defined in the proof and satisfies |cmin| m |(d + β 2)m2 1|. This lemma directly implies Proposition 2. Proof For S(d) and δ R we compute for f(V ) = 1 dδ2 f(V + δ ) = 1 (x i xi)2 q x i (V + δ )xi Thus the second directional derivative in V = V along is 2 f(V ) = lim δ 0 d2 dδ2 f(V + δ ) = 1 i=1 (x i xi)2. Let Tx be the semidefinite positive quadratic form of S(d) defined for S(d), by Tx : 7 (x x)2. (33) Then there exists a positive linear operator Tx from S(d) to S(d) such that Tx( ) = , Tx . Therefore the function f will be strictly concave if for all directions S(d) i=1 Txi( ) > 0. (34) Robust Discriminative Clustering with Sparse Regularizers We will bound the empirical expectation in Eq. (34) by first showing that its expectation remains away from 0. Then we will use a concentration inequality for matrices to control the distance between the sum in Eq. (34) and its expectation. We first derive conditions so that the result is true in expectation, i.e., for the operator T defined by T = ETx for x following the same law as (y, z ) . We denote by m = Ez2 and by β = Ez4/m2 its kurtosis. We let = a b and then have x x = a + 2yb z + z Cz. Thus Tx( ) = a2 + 4ayb z + 2az Cz + 4b (zz )b + (z Cz)2 + 4yb z(z Cz). Therefore we can express the value of the operator T only in function of the elements of : T ( ) = (a + m tr C)2 + 4m b 2 2 + 2m2 C Diag(diag(C)) 2 F + m2(β 1) diag(C) 2, where we have used E(z Cz)2 = E X i,j,k,l zizjzkzlci,jck,l i (zi)4c2 i,i + E X i,k =i z2 i z2 kci,ick,k + 2E X i,j =i z2 i z2 j c2 i,j i c2 i,i + m2 X i,k =i ci,ick,k + 2m2 X i,j =i c2 i,j = m2(β 3) X i c2 i,i + m2 X i,k ci,ick,k + 2m2 X = m2(β 3) diag(C) 2 + m2 2 C 2 F + tr(C)2 = m2(β 3) diag(C) 2 + m2 2 C Diag(diag(C)) 2 F + tr(C)2 . Since β 1, we get T ( ) (a + m tr C)2 + 4m b 2 2 + 2m2( C 2 F diag(C) 2). Thus T ( ) = 0 if and only if β = 1 with b = 0d 1 and C = diag(c) with c 1d = a m2 . With the condition β = 1 meaning that var(z2) = 0 and thus z2 is constant a.s., i.e., z follows a Rademacher law. However we would like to bound T ( ) away from zero by some constant and for that we are looking for the smallest eigenvalue of the operator ETx. Unfortunately we are not able to solve the optimization problem min S(d), 2 F =1 T ( ), and we have to compute all the spectrum of this operator to be able to find the smallest using ETx = 1/2 T ( ) . We have 1/2 T ( ) = a + m tr(C) 2mb 2mb (a + m tr(C))m2Id 1 + 2m2C +m2(β 3) Diag(diag(C)) Flammarion, Palaniappan and Bach For all b Rd 1 we have for = 0 b , 1/2 T ( ) = 2m . Thus 2m is an eigenvalue of multiplicity d 1. For all C R(d 1) (d 1) with diag(C) = 0d 1 we have for = 0 0 0 C , 1/2 T ( ) = 2m2 . Thus 2m2 is an eigenvalue of multiplicity (d 1)(d 2) For all c Rd 1 with c 1d 1 = 0 we have for = 0 0 0 diag(C) , 1/2 T ( ) = m2(β 1) . Thus m2(β 1) is an eigenvalue of multiplicity d 2. For all a, c R2 we have for = a 0 0 c Id 1 1/2 T ( ) = a + m(d 1)c 0 0 [ma + m2(d + β 2)c]Id 1 = Diag h 1 m1 d 1 m1d 1 (d + β 2)m2Id 1 Thus an eigenvalue of 1 (d 1)m m (d + β 2)m2 with an eigenvector [a, c] would be an eigenvalue of the operator ETx with a corresponding eigenvector a 0 0 c Id 1 matrix has two simple eigenvalues µ = 1 + (d + β 2)m2 p (1 + (d + β 2)m2)2 4m2(β 1) Moreover when we add all the multiplicity of the found eigenvalues we get d 1+ (d 1)(d 2) 2 + d 2+2 = d(d+1) 2 which is the dimension of S(d), therefore we have found all the eigenvalues of the linear operator ETx. We will prove now than the smallest eigenvalue is µ when the dimension d is large enough with regards to m2 and 2m2 otherwise. Lemma 14 Let µ1 and µ2 be the two smallest eigenvalues of the operator ETx. Let us assume that d 3 (the case d = 2 will also be done in the proof). If β 3 and m2 β 3 2(d+β 4) then µ1 = µ m2(β 1) 1 + (d + β 2)m2 and µ2 = min{2m2, m2(β 1), 2m}. Moreover we denote by min = 1 0 0 cmin Id 1 the eigenvector associated to µ for which we have set without loss of generality the first component a = 1. Then |cmin| m |(d + β 2)m2 1|. Robust Discriminative Clustering with Sparse Regularizers Unfortunately µ can become small when the dimension increases as explained by the tight bound µ m2(β 1) 1+(d+β 2)m2 . However the corresponding eigenvector has a particular structure we will be able to exploit. Proof First we note that µ m2(β 1) and compute µ 2m2 1 + (d + β 2)m2 p (1 + (d + β 2)m2)2 4m2(β 1) 4m2 0 1 + (d + β 2)m2 4m2 p (1 + (d + β 2)m2)2 4m2(β 1) (1 + (d + β 2)m2 4m2)2 (1 + (d + β 2)m2)2 4m2(β 1) and 1 + (d + β 6)m2 0 16m4 8m2(1 + (d + β 2)m2) 4m2(β 1) and 1 + (d + β 6)m2 0 2(d + β 4)m2 β 3 | {z } R1 and 1 + (d + β 6)m2 0 | {z } R2 If β 3 we have necessary that β 2 and the first term R1 implies m2 3 β 2(2 β) and the second term R2 implies m2 1/(4 β). Thus we should have (4 β)(3 β) 2(2 β) which is not possible since the polynomial β2 5β+8 0. If β 3, the first term R1 implies m2 β 3 2(β 2) 1 and the second term R2 implies m2 1/(4 β) β 3 2(β 2) 1 for β 4 and is always satisfied otherwise. If d 3, the first term R1 implies that β 3 for which the second term R2 is always satisfied. It also implies that m2 β 3 2(d+β 4) 1. We denote by min = 1 0 0 cmin Id 1 the eigenvector for which we have set without loss of generality a = 1 and cmin = 1 2(d 1)m ((d + β 2)m2 1)2 + 4(d 1)m2 (d + β 2)m2 + 1 i . Consequently cmin 0 and by convexity of the square root we have ((d + β 2)m2 1)2 + 4(d 1)m2 ((d + β 2)m2 1) + 2(d 1)m2 |(d + β 2)m2 1|. Therefore |cmin| m |(d + β 2)m2 1|. We will control now the behavior of the empirical expectation by its expectation thanks to concentration theory. By definition Tx is a symmetric positive linear operator as its projection T x onto the orthogonal space of min. We can thus apply the Matrix Chernoffinequality from Tropp (2012, Theorem 5.1.1) to these two operators using Tx op Flammarion, Palaniappan and Bach xx 2 tr(xx )2 x 4 2 R4d2. Then: k=1 Txk nδµ1 inµ1/(2R4d2) de (1 δ)2nµ1/(4R4d2), k=1 T xk nδµ2 inµ2/(2R4d2) de (1 δ)2nµ2/(4R4d2), For m = 1 and d 3 we have µ1 = µ β 1 β+d min{β 1 2d } min{1/3, β 1 D.5 Noise robustness for the one dimensional balanced problem We want a condition on ε such that the solution of the relaxation Eq. (8) recovers the right y. We recall the dual problem of the relaxation Eq. (8) min µ 1n s.t. Diag(µ) X(X X) 1X . The KKT conditions are: Dual feasibility: Diag(µ) X(X X) 1X . Primal feasibility: Diag(Y ) = 1n and Y 0. Complimentary slackness : Y [Diag(µ) X(X X) 1X ] = 0 For Y = yy a rank one matrix, the last condition implies Diag(µ)y = Hy and µi = (X(X X) 1X y)i For X = y +ε, we denote by y = y +ε, then X(X X) 1X = y y y 2 and X(X X) 1X y = y y y 2 y. Thus y 2 yi yi . Assume that all yiyi have the same sign, without loss of generality we assume yiyi > 0. By definition of µ, µ 0. To show the dual feasibility we have to show that Diag(µ) H which is equivalent to Diag( yi y y, to In Diag( q y y Diag( q yi ) 0 and to P yi yi y y which is obviously true. Reciprocally if µ is dual feasible then Diag(µ) 0 and all the yiyi have the same sign. Therefore we have shown that y is solution of the relaxation Eq. (8) if and only if all the yiyi have the same sign. If ε and y are independent this is equivalent to ε 1 a.s. Robust Discriminative Clustering with Sparse Regularizers D.6 The rank-one candidates are not solutions of the relaxation Eq. (12) We assume now that 1 n y = 0 thus y = Πny, which means we do not have the same proportion in the two clusters. Let us assume that Πny takes two values {πy , πy+} that is by definition of Πn: πy+ = 1 1 n y n and πy = 1 1 n y n . For V defined as before, we get x i V xi = (πyi)2 and with I the set of indices such that Πnyi = πy respectively, the KKT conditions for V = V can be written as πy+ 1 xix i + X 1 πy 1 xix i i = An 0 and An V = 0. We check that with n = #{I }: w Anw = 0 = X πy+ 1 (πy+)2 + X 1 πy 1 (πy )2 πy+ 1 (πy+)2 + n 1 πy 1 (πy )2 = n+πy+ n πy n+(πy+)2 + n (πy )2 = y Πny (Πny) Πny = y Πny y Πny = 0. And An = 1 2n P i I+ α+xix i +P i I α xix i with α+ = 1 πy+ 1 and α = 1 πy 1 . Unfortunately α+α 0, and An is not necessary negative. Even worse we will show that EA is not semi-definite negative which will conclude the proof since by the law of large number lim n 1 n An = EA. Assume that the proportions of the two clusters stay constant with n = ρ n, then (πy+)2 0 0 I (πy )2 0 0 I And ρ+α+(πy+)2 + ρ α (πy )2 = 0 since w Anw = 0. Then ρ+α+ + ρ α = ρ+πy ρ πy+ πy+πy = (ρ+ + ρ ) 1 n y n (ρ+ ρ ) + (1 (1 n y)2) n (ρ+ ρ ) + (1 n y n )2) = 2(1 n y Thus A = 2(1 n y)2 (n2 (1 n y)2) is not semi-definite negative and V is not solution of the relaxation Eq. (12). Appendix E. Auxiliary results for sparse extension In this section, we provide some results for sparse extension. Flammarion, Palaniappan and Bach E.1 There is a rank-one solution of the relaxation Eq. (18) Lemma 15 The rank-one solution V = v v is solution of the relaxation Eq. (18) if the design matrix X is such that 1 n X X has all its diagonal entries less than one. Proof The KKT conditions for problem Eq. (18) are x i V xi λU 1 n X X = A 0 and AV = 0, with U such that Uij = sign(Vij) if Vij = 0 and Uij [ 1, 1] otherwise. For V = v v this gives A = (1 + λ) n X X = λ h X X n U i with U1,1 = 1 and Ui,j [ 1, 1] otherwise. We check that AV = 0. If the design matrix X is such that 1 n X X has all its diagonal entries less than one, we can choose a sub-gradient U such that the dual variable A = 0 and thus V is solution. Otherwise by property of semi-definite matrices, there is a diagonal entry of 1 n X X which is bigger than 1 which prevents A to be semi-definite negative since the corresponding diagonal entry of X X n U will be positive. This shows that V does not solve the problem. E.2 Proof of proposition 6 Lemma 16 For δ [0, 1), with probability 1 5d2 exp δ2n(β 1) 2d R4(1/m2+β+d) , for any direction such that V + 0, we have: g(V ) g(V + ) > (1 δ) h λ Diag( ) 1+ β 1 β + d + 1/m2 (1 + λ)3 4 Diag( ) 2 2 i +o( 2) 0. Moreover we also have with probability at least 1 5d2 exp δ2nm2(β 1) 2d R4 , for any symmetric matrix such that V + 0 and Diag( ) (emin) : g(V ) g(V + ) > (1 δ) h λ Diag( ) 1+m2(β 1)(1 + λ)3 4 Diag( ) 2 2 i +o( 2) 0, where emin = [1, cmin1d 1] is defined in the proof and cmin satisfies |cmin| m |(d + β 2)m2 1|. E.2.1 Proof outline We will investigate under which conditions on X the solution is unique, first for a deterministic design matrix. We make the following deterministic assumptions on X for δ, ζ 0 and S Rd: Robust Discriminative Clustering with Sparse Regularizers n 1 (A3) Z Z n Diag(diag( 1 n δ (A4) λS min X 2(X 2) where we denoted by the Hadamard (i.e., pointwise) product between matrices and λS min the minimum eigenvalue of a linear operator restricted to a subspace S. Then with x i V xi λ V 1 1 n tr X XV , we can certify that g will decrease around the solution V . Lemma 17 Let us assume that the noise matrix verifies assumptions (A1,A2,A3,A4), then for any direction such that V + 0 and diag( ) S we have: g(V ) g(V + ) λ(1 δ) Diag(diag( )) 1 + ζ (1 + λ)3 4 Diag( ) 2 2 + o( 2) > 0. Let us assume now that (zi)i=1,.,d are i.i.d of law z symmetric with Ez = Ez3 = 0, Ez2 = m = 1, Ez4/(Ez2)2 = β and such that z is a.s. bounded by 0 R 1. Then the matrix X satisfies a.s. assumption (A1). Using multiple Hoeffding s inequalities we will prove lemma 17. Lemma 18 If z does not follow a Rademacher law, the design matrix X satisfies assumptions (A1,A2,A3,A4) with probability greater than 1 8d2 exp δ2n(β 1) 2d(β+d)R4 for S = Rd, and with probability greater than 1 8d2 exp δ2n min{β 1,2} 2d R4 for S = [1, cmin1d 1] where cmin is defined in the proof and satisfies |emin| 1 d + β 3. This lemma concludes the proof of proposition 6. We will now prove these two lemmas. E.2.2 Proof of lemma 17 Proof Since the dual variable A for the PSD constraint is 0 (see the proof of lemma 15), this constraint V 0 is not active and we will show that the function decreases in a set of directions which include the one for which V + 0. Therefore we consider a direction = a b , with C 0, which is slightly more general than V + 0. We denote by f(W) = 2 n Pn i=1 q n tr X XW the smooth part of g. By Taylor-Young, we have for all W: f(W) f(W + ) = f (W), 1 2 , f (W) + o( 2). g(W) g(W + ) = f (W), 1 2 , f (W) + λ( W + 1 W 1) + o( 2). Flammarion, Palaniappan and Bach Letting W = V this gives with X X = n y Z Z y Z Z g(W) g(W + ) = λ X X 2 , f (V ) + λ(a + 2 b 1 + C 1) + o( 2) = λ 2( b 1 1 nb Z y) + C 1 1 n tr(Z ZC) 1 2 , f (V ) + o( 2). And with H older s inequality and assumption (A2) nb Z y b 1(1 1 n Z y ) (1 δ) b 1. Nevertheless we will show in lemma 19 that C 1 1 n tr(Z ZC) (1 δ) C diag(C) 1, thus g(W) g(W + ) λ(1 δ)(2 b 1 + C diag(C) 1) + o( 2). (36) However in Eq. (36), g(W) g(W + ) = 0 for b = 0 and C diagonal, therefore we have to investigate second order conditions, i.e., to show for = diag(e) with e Rd that , f (V ) > 0. And with assumption (A4) 4 (1 + λ)3 diag(e), f (V ) diag(e) = 1 n i=1 (x i diag(e)xi)2 j=1 ej(xj i)2)2 i=1 e [x 2 i (x 2 i ) ]e λmin X 2(X 2) n e 2 ζ e 2 2. Thus we can conclude: g(W) g(W + ) λ(1 δ)(2 b 1 + C diag(C) 1) + ζ (1 + λ)3 4 e 2 2 + o( 2). E.2.3 Auxiliary lemma Lemma 19 For all matrix C symmetric semi-definite positive we have under assumptions (A1) and (A3): C (1 δ) C diag(C) 1 > 0. Robust Discriminative Clustering with Sparse Regularizers Proof We denote by Σn = Z Z n . We always have C 1 tr(Σn C) = tr(S Σn)C where Si,j = sign(Ci,j), thus if diag(C) > 0 then diag(S) = 1 and diag(S Σn) 0 from assumption (A1). Moreover since Σni,j [ 1, 1] then sign(S Σn) = sign(S). Thus tr(S Σn)C = P i Ci,i(S Σn)i,i+P i =j Ci,j(S Σn)i,j P i =j Ci,j(S Σn)i,j 0. Furthermore from assumption (A3), |(Σn)i,j| δ for i = j. Therefore tr(S Σn)C X i =j Ci,j(S Σn)i,j X i =j |Ci,j|(1 δ) (1 δ) C diag(C) 1 > 0. If there is a diagonal element of C which is 0, then the corresponding row and column in C will also be 0 and we can look at the same problem as before by erasing offfrom C and Σn the corresponding column and row. E.2.4 Proof of lemma 18 Proof We will first show that the noise matrix Z satisfies assumptions (A2,A3). By Hoeffding s inequality we have with probability 1 2 exp( δ2n/(2R2)) i=1 zj i | δ. Then, since the law of z is symmetric yizi will have the same law as zi and with probability 1 2 exp( δ2n/(2R2)), the design matrix Z satisfies assumption (A2): Likewise we have with probability 1 2 exp( δ2n/(2R4)) that for j = j i=1 zj i zj i | δ. Thus we also have with probability 1 2d2 exp( δ2n/(2R4)) that Z satisfies assumption (A3): n Z Z diag( 1 Thus with probability 1 4d2 exp( δ2n/(2R4)), the noise matrix Z satisfies assumptions (A1, A2, A3). We proceed as in the proof of proposition 2 to show that X satisfies assumption (A4). We first derive a condition to have the result in expectation, then we use an inequality concentration on matrix to bound the empirical expectation. This will be very similar, but we will get a better scaling since is diagonal. Using the same arguments as in the proof of proposition 2 we have for the diagonal matrix = diag(e) with e = (a, c) Rd: e E(x 2(x 2) )e = E(x x)2 = (a + mc 1n 1)2 + m2(β 1) c 2 2 > 0 if β > 1. Flammarion, Palaniappan and Bach We can show that m2(β 1) is an eigenvalue of multiplicity d 2 and µ are eigenvalues of multiplicity one of the operator 7 E(x x)2 with eigenvectors e . Thus we have λmin(Ex 2(x 2) ) = 1 + (d + β 2)m2 p (1 + (d + β 2)m2)2 4m2(β 1) m2(β 1) 1 + (d + β 2)m2 , λ e min(Ex 2(x 2) ) = m2(β 2). λmax x 2(x 2) = (x 2) x 2 = j=1 (xi)4 d R4. Thus we can apply the Matrix Chernoffinequality from (Tropp, 2012) for µS = λS min(Ex 2(x 2) ): λS min X 2(X 2) de δ2nµS/(2d R4). Thus with probability 1 5d2 exp( δ2nµ /(2d R4)) the design matrix X satisfies assumptions (A1,A2,A3,A4) with ζ = (1 δ)µ and S = Rd. And with probability 1 5d2 exp( δ2n min{β 1, 2}/(2d R4)) the design matrix X satisfies assumptions (A1,A2,A3,A4) with ζ = (1 δ) min{β 1, 2} and S = e . Appendix F. Proof of multi-label results We first prove lemma 7: Proof Let A Rk k symmetric semi-definite positive such that diag( y A y ) = 1n, then diag( y A y ) = i=0 ai,i1n + 2 i=1 a0,iyi + 2 X 1 i