# matrix_completion_with_noisy_side_information__9a8f5d88.pdf Matrix Completion with Noisy Side Information Kai-Yang Chiang Cho-Jui Hsieh Inderjit S. Dhillon University of Texas at Austin University of California at Davis {kychiang,inderjit}@cs.utexas.edu chohsieh@ucdavis.edu We study the matrix completion problem with side information. Side information has been considered in several matrix completion applications, and has been empirically shown to be useful in many cases. Recently, researchers studied the effect of side information for matrix completion from a theoretical viewpoint, showing that sample complexity can be significantly reduced given completely clean features. However, since in reality most given features are noisy or only weakly informative, the development of a model to handle a general feature set, and investigation of how much noisy features can help matrix recovery, remains an important issue. In this paper, we propose a novel model that balances between features and observations simultaneously in order to leverage feature information yet be robust to feature noise. Moreover, we study the effect of general features in theory and show that by using our model, the sample complexity can be lower than matrix completion as long as features are sufficiently informative. This result provides a theoretical insight into the usefulness of general side information. Finally, we consider synthetic data and two applications relationship prediction and semisupervised clustering and show that our model outperforms other methods for matrix completion that use features both in theory and practice. 1 Introduction Low rank matrix completion is an important topic in machine learning and has been successfully applied to many practical applications [22, 12, 11]. One promising direction in this area is to exploit the side information, or features, to help matrix completion tasks. For example, in the famous Netflix problem, besides rating history, profile of users and/or genre of movies might also be given, and one could possibly leverage such side information for better prediction. Observing the fact that such additional features are usually available in real applications, how to better incorporate features into matrix completion becomes an important problem with both theoretical and practical aspects. Several approaches have been proposed for matrix completion with side information, and most of them empirically show that features are useful for certain applications [1, 28, 9, 29, 33]. However, there is surprisingly little analysis on the effect of features for general matrix completion. More recently, Jain and Dhillon [18] and Xu et al. [35] provided non-trivial guarantees on matrix completion with side information. They showed that if perfect features are given, under certain conditions, one can substantially reduce the sample complexity by solving a feature-embedded objective. This result suggests that completely informative features are extremely powerful for matrix completion, and the algorithm has been successfully applied in many applications [29, 37]. However, this model is still quite restrictive since if features are not perfect, it fails to guarantee recoverability and could even suffer poor performance in practice. A more general model with recovery analysis to handle noisy features is thus desired. In this paper, we study the matrix completion problem with general side information. We propose a dirty statistical model which balances between feature and observation information simultaneously to complete a matrix. As a result, our model can leverage feature information, yet is robust to noisy features. Furthermore, we provide a theoretical foundation to show the effectiveness of our model. We formally quantify the quality of features and show that the sample complexity of our model depends on feature quality. Two noticeable results could thus be inferred: first, unlike [18, 35], given any feature set, our model is guaranteed to achieve recovery with at most O(n3/2) samples in distribution-free manner, where n is the dimensionality of the matrix. Second, if features are reasonably good, we can improve the sample complexity to o(n3/2). We emphasize that since Ω(n3/2) is the lower bound of sample complexity for distribution-free, trace-norm regularized matrix completion [32], our result suggests that even noisy features could asymptotically reduce the number of observations needed in matrix completion. In addition, we empirically show that our model outperforms other completion methods on synthetic data as well as in two applications: relationship prediction and semi-supervised clustering. Our contribution can be summarized as follows: We propose a dirty statistical model for matrix completion with general side information where the matrix is learned by balancing features and pure observations simultaneously. We quantify the effectiveness of features in matrix completion problem. We show that our model is guaranteed to recover the matrix with any feature set, and moreover, the sample complexity can be lower than standard matrix completion given informative features. The paper is organized as follows. Section 2 states some related research. In Section 3, we introduce our proposed model for matrix completion with general side information. We theoretically analyze the effectiveness of features in our model in Section 4, and show experimental results in Section 5. 2 Related Work Matrix completion has been widely applied to many machine learning tasks, such as recommender systems [22], social network analysis [12] and clustering [11]. Several theoretical foundations have also been established. One remarkable milestone is the strong guarantee provided by Cand es et al. [7, 5], who proves that O(npolylogn) observations are sufficient for exact recovery provided entries are uniformly sampled at random. Several work also studies recovery under non-uniform distributional assumptions [30, 10], distribution-free setting [32], and noisy observations [21, 4]. Several works also consider side information in matrix completion [1, 28, 9, 29, 33]. Although most of them found that features are helpful for certain applications [28, 33] and cold-start setting [29] from their experimental supports, their proposed methods focus on the non-convex matrix factorization formulation without any theoretical guarantees. Compared to them, our model mainly focuses on a convex trace-norm regularized objective and on theoretical insight on the effect of features. On the other hand, Jain and Dhillon [18] (also see [38]) studied an inductive matrix completion objective to incorporate side information, and followup work [35] also considers a similar formulation with trace norm regularized objective. Both of them show that recovery guarantees could be attained with lower sample complexity when features are perfect. However, if features are imperfect, such models cannot recover the underlying matrix and could suffer poor performance in practice. We will have a detailed discussion on inductive matrix completion model in Section 3. Our proposed model is also related to the family of dirty statistical models [36], where the model parameter is expressed as the sum of a number of parameter components, each of which has its own structure. Dirty statistical models have been proposed mostly for robust matrix completion, graphical model estimation, and multi-task learning to decompose the sparse component (noise) and low-rank component (model parameters) [6, 8, 19]. Our proposed algorithm is completely different. We aim to decompose the model into two parts: the part that can be described by side information and the part that has to be recovered purely by observations. 3 A Dirty Statistical Model for Matrix Completion with Features Let R Rn1 n2 be the underlying rank-k matrix that aims to be recovered, where k min(n1, n2) so that R is low-rank. Let Ωbe the set of observed entries sampled from R with cardinality |Ω| = m. Furthermore, let X Rn1 d1 and Y Rn2 d2 be the feature set, where each row xi (or yi) denotes the feature of the i-th row (or column) entity of R. Both d1, d2 min(n1, n2) but can be either smaller or larger than k. Thus, given a set of observations Ωand the feature set X and Y as side information, the goal is to recover the underlying low rank matrix R. To begin with, consider an ideal case where the given features are perfect in the following sense: col(R) col(X) and row(R) col(Y ). (1) Such a feature set can be thought as perfect since it fully describes the true latent feature space of R. Then, instead of recovering the low rank matrix R directly, one can recover a smaller matrix M Rd1 d2 such that R = XMY T . The resulting formulation, called inductive matrix completion (or IMC in brief) [18], is shown to be both theoretically preferred [18, 35] and useful in real applications [37, 29]. Details of this model can be found in [18, 35]. However, in practice, most given features X and Y will not be perfect. In fact, they could be quite noisy or only weakly correlated to the latent feature space of R. Though in some cases applying IMC with imperfect X, Y might still yield decent performance, in many other cases, the performance drastically drops when features become noisy. This weakness of IMC can also be empirically seen in Section 5. Therefore, a more robust model is desired to better handle noisy features. We now introduce a dirty statistical model for matrix completion with (possibly noisy) features. The core concept of our model is to learn the underlying matrix by balancing feature information and observations. Specifically, we propose to learn R jointly from two parts, one is the low rank estimate from feature space XMY T , and the other part N is the part outside the feature space. Thus, N can be used to capture the information that noisy features fail to describe, which is then estimated by pure observations. Naturally, both XMY T and N are preferred to be low rank since they are aggregated to estimate a low rank matrix R. This further leads a preference on M to be low rank as well, since one could expect only a small subspace of X and a subspace of Y are jointly effective to form the low rank space XMY T . Putting all of above together, we consider to solve the following problem: ℓ((XMY T + N)ij, Rij) + λM M + λN N , (2) where M and N are regularized with trace norm because of the low rank prior. The underlying matrix R can thus be estimated by XM Y T +N . We refer our model as Dirty IMC for convenience. To solve the convex problem (2), we propose an alternative minimization scheme to solve N and M iteratively. Our algorithm is stated in details in Appendix A. One remark of this algorithm is that it is guaranteed to converge to a global optimal, since the problem is jointly convex with M and N. The parameters λM and λN are crucial for controlling the importance between features and residual. When λM = , M will be enforced to 0, so features are disregarded and (2) becomes a standard matrix completion objective. Another special case is λN = , in which N will be enforced to 0 and the objective becomes IMC. Intuitively, with an appropriate ratio λM/λN, the proposed model can incorporate useful part of features, yet be robust to noisy part by compensating from pure observations. Some natural questions arise from here: How to quantify the quality of features? What is the right λM and λN given a feature set? And beyond intuition, how much can we benefit from features using our model in theory? We will formally answer these questions in Section 4. 4 Theoretical Analysis Now we analyze the usefulness of features in our model under a theoretical perspective. We first quantify the quality of features and show that with reasonably good features, our model achieves recovery with lower sample complexity. Finally, we compare our results to matrix completion and IMC. Due to space limitations, detailed proofs of Theorems and Lemmas are left in Appendix B. 4.1 Preliminaries Recall that our goal is to recover a rank-k matrix R given observed entry set Ω, feature set X and Y described in Section 3. To recover the matrix with our model (Equation (2)), it is equivalent to solve the hard-constraint problem: ℓ((XMY T + N)ij, Rij), subject to M M, N N. (3) For simplicity, we will consider d = max(d1, d2) = O(1) so that feature dimensions do not grow as a function of n. We assume each entry (i, j) Ωis sampled i.i.d. under an unknown distribution with index set {(iα, jα)}m α=1. Also, each entry of R is assumed to be upper bounded, i.e. maxij |Rij| R (so that trace norm of R is in O( n1n2)). Such circumstance is consistent with real scenarios like the Netflix problem where users can rate movies with scale from 1 to 5. For convenience, let θ = (M, N) be any feasible solution, and Θ = {(M, N) | M M, N N} be the feasible solution set. Also, let fθ(i, j) = x T i Myj + Nij be the estimation function for Rij parameterized by θ, and FΘ = {fθ | θ Θ} be the set of feasible functions. We are interested in the following two ℓ-risk quantities: Expected ℓ-risk: Rℓ(f) = E(i,j) ℓ(f(i, j), Rij) Empirical ℓ-risk: ˆRℓ(f) = 1 (i,j) Ωℓ(f(i, j), Rij). Thus, our model is to solve for θ that parameterizes f = arg minf FΘ ˆRℓ(f), and it is sufficient to show that recovery can be attained if Rℓ(f ) approaches to zero with large enough n and m. 4.2 Measuring the Quality of Features We now link the quality of features to Rademacher complexity, a learning theoretic tool to measure the complexity of a function class. We will show that quality features result in a lower model complexity and thus a smaller error bound. Under such a viewpoint, the upper bound of Rademacher complexity could be used for measuring the quality of features. To begin with, we apply the following Lemma to bound the expected ℓ-risk. Lemma 1 (Bound on Expected ℓ-risk [2]). Let ℓbe a loss function with Lipschitz constant Lℓ bounded by B with respect to its first argument, and δ be a constant where 0 < δ < 1. Let R(FΘ) be the Rademacher complexity of the function class FΘ (w.r.t. Ωand associated with ℓ) defined as: σαℓ(f(iα, jα), Riαjα) where each σα takes values { 1} with equal probability. Then with probability at least 1 δ, for all f FΘ we have: Rℓ(f) ˆRℓ(f) + 2EΩ Apparently, to guarantee a small enough Rℓ, both ˆRℓand model complexity EΩ have to be bounded. The next key lemma shows that, the model complexity term EΩ is related to the feature quality in matrix completion context. Before diving into the details, we first provide an intuition on the meaning of good features. Consider any imperfect feature set which violates (1). One can imagine such feature set is perturbed by some misleading noise which is not correlated to the true latent features. However, features should still be effective if such noise does not weaken the true latent feature information too much. Thus, if a large portion of true latent features lies on the informative part of the feature spaces X and Y , they should still be somewhat informative and helpful for recovering the matrix R. More formally, the model complexity can be bounded in terms of M and N by the following lemma: Lemma 2. Let X = maxi xi 2, Y = maxi yi 2 and n = max(n1, n2). Then the model complexity of function class FΘ is upper bounded by: 9CLℓB N( n1 + n2) Then, by Lemma 1 and 2, one could carefully construct a feasible solution set (by setting M and N) such that both ˆRℓ(f ) and EΩ are controlled to be reasonably small. We now suggest a witness pair of M and N constructed as follows. Let γ be defined as: X , mini yi Let Tµ( ) : R+ R+ be the thresholding operator where Tµ(x) = x if x µ and Tµ(x) = 0 otherwise. In addition, let X = $d1 i=1 σiuiv T i be the reduced SVD of X, and define Xµ = $d1 i=1 σ1Tµ(σi/σ1)uiv T i to be the µ-informative part of X. The ν-informative part of Y , denoted as Yν, can also be defined similarly. Now consider setting M = ˆ M and N = R Xµ ˆ MY T ˆ M = arg min is the optimal solution for approximating R under the informative feature space Xµ and Yν. Then the following lemma shows that the trace norm of ˆ M will not grow as n increases. Lemma 3. Fix µ, ν (0, 1], and let ˆd = min(rank(Xµ), rank(Yν)). Then with some universal constant C : ˆd C µ2ν2γ2XY . Moreover, by combining Lemma 1 - 3, we can upper bound Rℓ(f ) of Dirty IMC as follows: Theorem 1. Consider problem (3) with M = ˆ M and N = R Xµ ˆ MY T ν . Then with probability at least 1 δ, the expected ℓ-risk of an optimal solution (N , M ) will be bounded by: 36CLℓB N( n1 + n2) + 4Lℓˆd C µ2ν2γ2 4.3 Sample Complexity Analysis From Theorem 1, we can derive the following sample complexity guarantee of our model. For simplicity, we assume k = O(1) so it will not grow as n increases in the following discussion. Corollary 1. Suppose we aim to ϵ-recover R where E(i,j) ℓ(Nij + XMY T < ϵ given an arbitrarily small ϵ. Then for Dirty IMC model, O(min(N n, N 2 log n)/ϵ2) observations are sufficient for ϵ-recovery provided a sufficiently large n. Corollary 1 suggests that the sample complexity of our model only depends on the trace norm of residual N. This matches the intuition of good features stated in Section 4.2 because X ˆ MY T will cover most part of R if features are good, and as a result, N will be small and one can enjoy small sample complexity by exploiting quality features. We also compare our sample complexity result with other models. First, suppose features are perfect (so that N = O(1)), our result suggests that only O(log n) samples are required for recovery. This matches the result of [35], in which the authors show that given perfect features, O(log n) observations are enough for exact recovery by solving the IMC objective. However, IMC does not guarantee recovery when features are not perfect, while our result shows that recovery is still attainable by Dirty IMC with O(min(N n, N 2 log n)/ϵ2) samples. We will also empirically justify this result in Section 5. On the other hand, for standard matrix completion (i.e. no features are considered), the most wellknown guarantee is that under certain conditions, one can achieve O(n poly log n) sample complexity for both ϵ-recovery [34] and exact recovery [5]. However, these bounds only hold with distributional assumptions on observed entries. For sample complexity without any distributional assumptions, Shamir et al. [32] recently showed that O(n3/2) entries are sufficient for ϵ-recovery, and this bound is tight if no further distribution of observed entries is assumed. Compared to those results, our analysis also requires no assumptions on distribution of observed entries, and our sample complexity yields O(n3/2) as well in the worst case, by the fact that N R = O(n). Notice that it is reasonable to meet the lower bound Ω(n3/2) even given features, since in an extreme case, X, Y could be random matrices and have no correlation to R, and thus the given information is as same as that in standard matrix completion. However, in many applications, features will be far from random, and our result provides a theoretical insight to show that features can be useful even if they are imperfect. Indeed, as long as features are informative enough such that N = o(n), our sample complexity will be asymptotically lower than O(n3/2). Here we provide two concrete instances for such a scenario. In the first scenario, we consider the rank-k matrix R to be generated from random orthogonal model [5] as follows: Theorem 2. Let R Rn n be generated from random orthogonal model, where U = {ui}k i=1, V = {vi}k i=1 are random orthogonal bases, and σ1 . . . σk are singular values with arbitrary magnitude. Let σt be the largest singular value such that limn σt/ n = 0. Then, given the noisy features X, Y where X:i = ui (and Y:i = vi) if i < t and X:i (and V:i) be any basis orthogonal to U (and V ) if i t, o(n) samples are sufficient for Dirty IMC to achieve ϵ-recovery. Theorem 2 suggests that, under random orthogonal model, if features are not too noisy in the sense that noise only corrupts the true subspace associated with smaller singular values, we can approximately recover R with only o(n) observations. An empirical justification for this result is presented in Appendix C. Another scenario is to consider R to be the product of two rank-k Gaussian matrices: Theorem 3. Let R = UV T be a rank-k matrix, where U, V Rn k are true latent row/column features with each Uij, Vij N(0, σ2) i.i.d. Suppose now we are given a feature set X, Y where g(n) row items and h(n) column items have corrupted features. Moreover, each corrupted row/column item has perturbed feature xi = ui + ui and yi = vi + vi, where u ξ1 and 0 0.2 0.4 0.6 0.8 1 0 Sparsity (ρs) = 0.095825 Feature noise level (ρf) Relative error SVDfeature MC IMC Dirty IMC (a) ρs = 0.1 0 0.2 0.4 0.6 0.8 1 0 Sparsity (ρs) = 0.25965 Feature noise level (ρf) Relative error SVDfeature MC IMC Dirty IMC (b) ρs = 0.25 0 0.2 0.4 0.6 0.8 1 0 Sparsity (ρs) = 0.39413 Feature noise level (ρf) Relative error SVDfeature MC IMC Dirty IMC (c) ρs = 0.4 0 0.1 0.2 0.3 0.4 0.5 0 Feature noise level (ρf) = 0.1 Sparsity (ρs) Relative error SVDfeature MC IMC Dirty IMC (d) ρf = 0.1 0 0.1 0.2 0.3 0.4 0.5 0 Feature noise level (ρf) = 0.5 Sparsity (ρs) Relative error SVDfeature MC IMC Dirty IMC (e) ρf = 0.5 0 0.1 0.2 0.3 0.4 0.5 0 Feature noise level (ρf) = 0.9 Sparsity (ρs) Relative error SVDfeature MC IMC Dirty IMC (f) ρf = 0.9 Figure 1: Performance of various methods for matrix completion under different sparsity and feature quality. Compared to other feature-based completion methods, the top figures show that Dirty IMC is less sensitive to noisy features with each ρs, and the bottom figures show that error of Dirty IMC always decreases to 0 with more observations given any feature quality. v ξ2 with some constants ξ1 and ξ2. Then for Dirty IMC model (3), with high probability, O h(n))n log n observations are sufficient for ϵ-recovery. Theorem 3 suggests that, if features have good quality in the sense that items with corrupted features are not too many, for example g(n), h(n) = O(log n), then sample complexity of Dirty IMC can be O(n log n log n) = o(n3/2) as well. Thus, both Theorem 2 and 3 provide concrete examples showing that given imperfect yet informative features, the sample complexity of our model can be asymptotically lower than the lower bound of pure matrix completion (which is Ω(n3/2)). 5 Experimental Results In this section, we show the effectiveness of the Dirty IMC model (2) for matrix completion with features on both synthetic datasets and real-world applications. For synthetic datasets, we show that Dirty IMC model better recovers low rank matrices under various quality of features. For real applications, we consider relationship prediction and semi-supervised clustering, where the current state-of-the-art methods are based on matrix completion and IMC respectively. We show that by applying Dirty IMC model to these two problems, we can further improve performance by making better use of features. 5.1 Synthetic Experiments We consider matrix recovery with features on synthetic data generated as follows. We create a low rank matrix R = UV T , as the true latent row/column space U, V R200 20, Uij, Vij N(0, 1/20). We then randomly sample ρs percent of entries Ωfrom R as observations, and construct a perfect feature set X , Y R200 40 which satisfies (1). To examine performance under different quality of features, we generate features X, Y with a noise parameter ρf, where X and Y will be derived by replacing ρf percent of bases of X (and Y ) with bases orthogonal to X (and Y ). We then consider recovering the underlying matrix R given X, Y and a subset Ωof R. We compare our Dirty IMC model (2) with standard trace-norm regularized matrix completion (MC) and two other feature-based completion methods: IMC [18] and SVDfeature [9]. The standard relative error ˆR R F / R F is used to evaluate a recovered matrix ˆR. For each method, we select parameters from the set {10α}2 α= 3 and report the one with the best recovery. All results are averaged over 5 random trials. Figure 1 shows the recovery of each method under each sparsity level ρs = 0.1, 0.25, 0.4, and each feature noise level ρf = 0.1, 0.5 and 0.9. We first observe that in the top figures, IMC and Method Dirty IMC MF-ALS [16] IMC [18] HOC-3 HOC-5 [12] Accuracy 0.9474 0.0009 0.9412 0.0011 0.9139 0.0016 0.9242 0.0010 0.9297 0.0011 AUC 0.9506 0.9020 0.9109 0.9432 0.9480 Table 1: Relationship prediction on Epinions. Compared with other approaches, Dirty IMC model gives the best performance in terms of both accuracy and AUC. SVDfeature perform similarly under different ρs. This suggests that with sufficient observations, performance of IMC and SVDfeature mainly depend on feature quality and will not be affected much by the number of observations. As a result, given good features (1d), they achieve smaller error compared to MC with few observations, but as features become noisy (1e-1f), they suffer poor performance by trying to learn the underlying matrix under biased feature spaces. Another interesting finding is that when good features are given (1d), IMC (and SVDfeature) still fails to achieve 0 relative error as the number of observations increases, which reconfirms that IMC cannot guarantee recoverability when features are not perfect. On the other hand, we see that performance of Dirty IMC can be improved by both better features or more observations. In particular, it makes use of informative features to achieve lower error compared to MC and is also less sensitive to noisy features compared to IMC and SVDfeature. Some finer recovery results on ρs and ρf can be found in Appendix C. 5.2 Real-world Applications Relationship Prediction in Signed Networks. As the first application, we consider relationship prediction problem in an online review website Epinions [26], where people can write reviews and trust or distrust others based on their reviews. Such social network can be modeled as a signed network where trust/distrust are modeled as positive/negative edges between entities [24], and the problem is to predict unknown relationship between any two users given the network. A state-ofthe-art approach is the low rank model [16, 12] where one can first conduct matrix completion on adjacency matrix and then use the sign of completed matrix for relationship prediction. Therefore, if features of users are available, we can also consider low rank model by using our model for matrix completion step. This approach can be regarded as an improvement over [16] by incorporating feature information. In this dataset, there are about n = 105K users and m = 807K observed relationship pairs where 15% relationships are distrust. In addition to who-trust-to-whom information, we also have user feature matrix Z Rn 41 where for each user a 41-dimensional feature is collected based on the user s review history, such as number of positive/negative reviews the user gave/received. We then consider the low-rank model in [16] where matrix completion is conducted by Dirty IMC with non-convex relaxation (5) (Dirty IMC), IMC [18] (IMC), and matrix factorization proposed in [16] (MF-ALS), along with another two prediction methods, HOC-3 and HOC-5 [12]. Note that both row and column entities are users so X = Y = Z is set for both Dirty IMC and IMC model. We conduct the experiment using 10-fold cross validation on observed edges, where the parameters are chosen from the set 2 α= 3{10α, 5 10α}. The averaged accuracy and AUC of each method are reported in Table 1. We first observe that IMC performs worse than MF-ALS even though IMC takes features into account. This is because features are only weakly related to relationship matrix, and as a result, IMC is misled by such noisy features. On the other hand, Dirty IMC performs the best among all prediction methods. In particular, it performs slightly better than MF-ALS in terms of accuracy, and much better in terms of AUC. This shows Dirty IMC can still exploit weakly informative features without being trapped by noisy features. Semi-supervised Clustering. We now consider semi-supervised clustering problem as another application. Given n items, the item feature matrix Z Rn d, and m pairwise constraints specifying whether item i and j are similar or dissimilar, the goal is to find a clustering of items such that most similar items are within the same cluster. We notice that the problem can indeed be solved by matrix completion. Consider S Rn n to be the signed similarity matrix defined as Sij = 1 (or 1) if item i and j are similar (or dissimilar), and 0 if similarity is unknown. Then solving semi-supervised clustering becomes equivalent to finding a clustering of the symmetric signed graph S, where the goal is to cluster nodes so that most edges within the same group are positive and most edges between groups are negative [12]. As a result, a matrix completion approach [12] can be applied to solve the signed graph clustering problem on S. Apparently, the above solution is not optimal for semi-supervised clustering as it disregards features. Many semi-supervised clustering algorithms are thus proposed by taking both item features 0 0.5 1 1.5 2 Number of observed pairs Pairwise error K means Sign MC MCCC Dirty IMC 0 1 2 3 4 5 6 Number of observed pairs Pairwise error K means Sign MC MCCC Dirty IMC 0 0.5 1 1.5 2 2.5 3 Number of observed pairs Pairwise error K means Sign MC MCCC Dirty IMC Figure 2: Semi-supervised clustering on real-world datasets. For Mushroom dataset where features are almost ideal, both MCCC and Dirty IMC achieve 0 error rate. For Segment and Covtype where features are more noisy, our model outperforms MCCC as its error decreases given more constraints. number of items n feature dimension d number of clusters k Mushrooms 8124 112 2 Segment 2319 19 7 Covtype 11455 54 7 Table 2: Statistics of semi-supervised clustering datasets. and constraints into consideration [13, 25, 37]. The current state-of-the-art method is the MCCC algorithm [37], which essentially solves semi-supervised clustering with IMC objective. In [37], the authors show that by running k-means on the top-k eigenvectors of the completed matrix ZMZT , MCCC outperforms other state-of-the-art algorithms [37]. We now consider solving semi-supervised clustering with our Dirty IMC model. Our algorithm, summarized in Algorithm 2 in Appendix D, first completes the pairwise matrix with Dirty IMC objective (2) instead of IMC (with both X, Y are set as Z), and then runs k-means on the top-k eigenvectors of the completed matrix to obtain a clustering. This algorithm can be viewed as an improved version of MCCC to handle noisy features Z. We now compare our algorithm with k-means, signed graph clustering with matrix completion [12] (Sign MC) and MCCC [37]. Note that since MCCC has been shown to outperform most other state-of-the-art semi-supervised clustering algorithms in [37], comparing with MCCC is sufficient to demonstrate the effectiveness of our algorithm. We perform each method on three real-world datasets: Mushrooms, Segment and Covtype 1. All of them are classification benchmarks where features and ground-truth class of items are both available, and their statistics are summarized in Table 2. For each dataset, we randomly sample m = [1, 5, 10, 15, 20, 25, 30] n pairwise constraints, and perform each algorithm to derive a clustering π, where πi is the cluster index of item i. We then evaluate π by the following pairwise error to ground-truth: 1(πi = πj) + i is the ground-truth class of item i. Figure 2 shows the result of each method on all three datasets. We first see that for Mushrooms dataset where features are perfect (100% training accuracy can be attained by linear-SVM for classification), both MCCC and Dirty IMC can obtain a perfect clustering, which shows that MCCC is indeed effective with perfect features. For Segment and Covtype datasets, we observe that the performance of k-means and MCCC are dominated by feature quality. Although MCCC still benefits from constraint information as it outperforms k-means, it clearly does not make the best use of constraints, as its performance does not improves even if number of constraints increases. On the other hand, the error rate of Sign MC can always decrease down to 0 by increasing m. However, since it disregards features, it suffers from a much higher error rate than methods with features when constraints are few. We again see Dirty IMC combines advantage from MCCC and Sign MC, as it makes use of features when few constraints are observed yet leverages constraint information simultaneously to avoid being trapped by feature noise. This experiment shows that our model outperforms state-of-the-art approaches for semi-supervised clustering. Acknowledgement. We thank David Inouye and Hsiang-Fu Yu for helpful comments and discussions. This research was supported by NSF grants CCF-1320746 and CCF-1117055. 1All datasets are available at http://www.csie.ntu.edu.tw/ cjlin/libsvmtools/datasets/. For Covtype, we subsample from the entire dataset to make each cluster has balanced size. [1] J. Abernethy, F. Bach, T. Evgeniou, and J.-P. Vert. A new approach to collaborative filtering: Operator estimation with spectral regularization. JMLR, 10:803 826, 2009. [2] P. L. Bartlett and S. Mendelson. Rademacher and Gaussian complexities: Risk bounds and structural results. JMLR, 3:463 482, 2003. [3] D. P. Bertsekas. Nonlinear Programming. Athena Scientific, Belmont, MA 02178-9998, 1999. [4] E. Cand es and Y. Plan. Matrix completion with noise. Proceedings of the IEEE, 98(6):925 936, 2010. [5] E. Cand es and B. Recht. Exact matrix completion via convex optimization. Commun. ACM, 55(6):111 119, 2012. [6] E. J. Cand es, X. Li, Y. Ma, and J. Wright. Robust principal component analysis? J. ACM, 58(3):11:1 11:37, 2011. [7] E. J. Cand es and T. Tao. The power of convex relaxation: Near-optimal matrix completion. IEEE Trans. Inf. Theor., 56(5):2053 2080, 2010. [8] V. Chandrasekaran, P. A. Parrilo, and A. S. Willsky. Latent variable graphical model selection via convex optimization. The Annals of Statistics, 2012. [9] T. Chen, W. Zhang, Q. Lu, K. Chen, Z. Zheng, and Y. Yu. SVDFeature: A toolkit for feature-based collaborative filtering. JMLR, 13:3619 3622, 2012. [10] Y. Chen, S. Bhojanapalli, S. Sanghavi, and R. Ward. Coherent matrix completion. In ICML, 2014. [11] Y. Chen, A. Jalali, S. Sanghavi, and H. Xu. Clustering partially observed graphs via convex optimization. JMLR, 15(1):2213 2238, 2014. [12] K.-Y. Chiang, C.-J. Hsieh, N. Natarajan, I. S. Dhillon, and A. Tewari. Prediction and clustering in signed networks: A local to global perspective. JMLR, 15:1177 1213, 2014. [13] J. V. Davis, B. Kulis, P. Jain, S. Sra, and I. S. Dhillon. Information-theoretic metric learning. In ICML, pages 209 216, 2007. [14] U. Feige and G. Schechtman. On the optimality of the random hyperplane rounding technique for max cut. Random Struct. Algorithms, 20(3):403 440, 2002. [15] L. Grippo and M. Sciandrone. Globally convergent block-coordinate techniques for unconstrained opti- mization. Optimization Methods and Software, 10:587 637, 1999. [16] C.-J. Hsieh, K.-Y. Chiang, and I. S. Dhillon. Low rank modeling of signed networks. In KDD, 2012. [17] C.-J. Hsieh and P. A. Olsan. Nuclear norm minimization via active subspace selection. In ICML, 2014. [18] P. Jain and I. S. Dhillon. Provable inductive matrix completion. Co RR, abs/1306.0626, 2013. [19] A. Jalali, P. Ravikumar, S. Sanghavi, and C. Ruan. A dirty model for multi-task learning. In NIPS, 2010. [20] S. M. Kakade, K. Sridharan, and A. Tewari. On the complexity of linear prediction: Risk bounds, margin bounds, and regularization. In NIPS, pages 793 800, 2008. [21] R. Keshavan, A. Montanari, and S. Oh. Matrix completion from noisy entries. JMLR, 2010. [22] Y. Koren, R. M. Bell, and C. Volinsky. Matrix factorization techniques for recommender systems. IEEE Computer, 42:30 37, 2009. [23] B. Laurent and P. Massart. Adaptive estimation of a quadratic functional by model selection. The Annals of Statistics, 28(5):1302 1338, 2000. [24] J. Leskovec, D. Huttenlocher, and J. Kleinberg. Predicting positive and negative links in online social networks. In WWW, 2010. [25] Z. Li and J. Liu. Constrained clustering by spectral kernel learning. In ICCV, 2009. [26] P. Massa and P. Avesani. Trust-aware bootstrapping of recommender systems. In Proceedings of ECAI 2006 Workshop on Recommender Systems, pages 29 33, 2006. [27] R. Meir and T. Zhang. Generalization error bounds for bayesian mixture algorithms. JMLR, 2003. [28] A. K. Menon, K.-P. Chitrapura, S. Garg, D. Agarwal, and N. Kota. Response prediction using collabora- tive filtering with hierarchies and side-information. In KDD, pages 141 149, 2011. [29] N. Natarajan and I. S. Dhillon. Inductive matrix completion for predicting gene-disease associations. Bioinformatics, 30(12):60 68, 2014. [30] S. Negahban and M. J. Wainwright. Restricted strong convexity and weighted matrix completion: Optimal bounds with noise. JMLR, 13(1):1665 1697, 2012. [31] M. Rudelson and R. Vershynin. Smallest singular value of a random rectangular matrix. Comm. Pure Appl. Math, pages 1707 1739, 2009. [32] O. Shamir and S. Shalev-Shwartz. Matrix completion with the trace norm: Learning, bounding, and transducing. JMLR, 15(1):3401 3423, 2014. [33] D. Shin, S. Cetintas, K.-C. Lee, and I. S. Dhillon. Tumblr blog recommendation with boosted inductive matrix completion. In CIKM, pages 203 212, 2015. [34] N. Srebro and A. Shraibman. Rank, trace-norm and max-norm. In COLT, pages 545 560, 2005. [35] M. Xu, R. Jin, and Z.-H. Zhou. Speedup matrix completion with side information: Application to multi- label learning. In NIPS, 2013. [36] E. Yang and P. Ravikumar. Dirty statistical models. In NIPS, 2013. [37] J. Yi, L. Zhang, R. Jin, Q. Qian, and A. Jain. Semi-supervised clustering by input pattern assisted pairwise similarity matrix completion. In ICML, 2013. [38] K. Zhong, P. Jain, and I. S. Dhillon. Efficient matrix sensing using rank-1 gaussian measurements. In International Conference on Algorithmic Learning Theory(ALT), 2015.