# pu_learning_for_matrix_completion__8c282bb7.pdf

PU Learning for Matrix Completion

Cho-Jui Hsieh CJHSIEH@CS.UTEXAS.EDU Nagarajan Natarajan NAGA86@CS.UTEXAS.EDU Inderjit S. Dhillon INDERJIT@CS.UTEXAS.EDU Department of Computer Science, The University of Texas, Austin, TX 78721, USA

In this paper, we consider the matrix completion problem when the observations are one-bit measurements of some underlying matrix M, and in particular the observed samples consist only of ones and no zeros. This problem is motivated by modern applications such as recommender systems and social networks where only likes or friendships are observed. The problem is an instance of PU (positive-unlabeled) learning, i.e. learning from only positive and unlabeled examples that has been studied in the context of binary classiﬁcation. Under the assumption that M has bounded nuclear norm, we provide recovery guarantees for two different observation models: 1) M parameterizes a distribution that generates a binary matrix, 2) M is thresholded to obtain a binary matrix. For the ﬁrst case, we propose a shifted matrix completion method that recovers M using only a subset of indices corresponding to ones; for the second case, we propose a biased matrix completion method that recovers the (thresholded) binary matrix. Both methods yield strong error bounds if M Rn n, the error is bounded as O 1 (1 ρ)n , where 1 ρ denotes the fraction of ones observed. This implies a sample complexity of O(n log n) ones to achieve a small error, when M is dense and n is large. We extend our analysis to the inductive matrix completion problem, where rows and columns of M have associated features. We develop efﬁcient and scalable optimization procedures for both the proposed methods and demonstrate their effectiveness for link prediction (on real-world networks consisting of over 2 million nodes and 90 million links) and semi-supervised clustering tasks.

Proceedings of the 32 nd International Conference on Machine Learning, Lille, France, 2015. JMLR: W&CP volume 37. Copyright 2015 by the author(s).

1. Introduction

The problem of recovering a matrix from a given subset of its entries arises in many practical problems of interest. The famous Netﬂix problem of predicting user-movie ratings is one example that motivates the traditional matrix completion problem, where we would want to recover the underlying (ratings) matrix given partial observations. Strong theoretical guarantees have been developed in the recent past for the low-rank matrix completion problem (Cand es & Recht, 2009). An important variant of the matrix completion problem is to recover an underlying matrix from one-bit quantization of its entries. Modern applications of the matrix completion problem reveal a conspicuous gap between existing matrix completion theory and practice. For example, consider the problem of link prediction in social networks. Here, the goal is to recover the underlying friendship network from a given snapshot of the social graph consisting of observed friendships. We can pose the problem as recovering the adjacency matrix of the network A such that Aij = 1 if users i and j are related and Aij = 0 otherwise. In practice, we only observe positive relationships between users corresponding to 1 s in A. Thus, there is not only one-bit quantization in the observations, but also a one-sided nature to the sampling process here no negative entries are sampled. In the context of classiﬁcation, methods for learning in the presence of positive and unlabeled examples only, called positive-unlabeled (PU in short) learning, have been studied in the past (Elkan & Noto, 2008; Liu et al., 2003). For matrix completion, can one guarantee recovery when only a subset of positive entries is observed? In this paper, we formulate the PU matrix completion problem and answer the question in the afﬁrmative under different settings.

Minimizing squared loss on the observed entries corresponding to 1 s, subject to the low-rank constraints, yields a degenerate solution the rank-1 matrix with all its entries equal to 1 achieves zero loss. In practice, a popular heuristic used is to try and complete the matrix by treating some or all of the missing observations as true 0 s, which seems to be a good strategy when the underlying matrix has a small number of positive examples, i.e., small number of 1 s. This motivates viewing the problem of learning from

PU Learning for Matrix Completion

only positive samples as a certain noisy matrix completion problem. Existing theory for noise-tolerant matrix completion (Cand es & Plan, 2009; Davenport et al., 2012) does not sufﬁciently address recoverability under PU learning (see Section 2).

In our work, we assume that the true matrix M Rm n

has a bounded nuclear norm M . The PU learning model for matrix completion is speciﬁed by a certain one-bit quantization process that generates a binary matrix Y from M and a one-sided sampling process that reveals a subset of positive entries of Y . In particular, we consider two recovery settings for PU matrix completion: The ﬁrst setting is non-deterministic M parameterizes a probability distribution which is used to generate the entries of Y . We show that it is possible to recover M using only a subset of positive entries of Y . The idea is to minimize an unbiased estimator of the squared loss between the estimated and the observed noisy entries, motivated by the approach in (Natarajan et al., 2013). We recast the objective as a shifted matrix completion problem that facilitates in obtaining a scalable optimization algorithm. The second setting is deterministic Y is obtained by thresholding the entries of M, and then a subset of positive entries of Y is revealed. While recovery of M is not possible (see Section 2), we show that we can recover Y with low error. To this end, we propose a scalable biased matrix completion method where the observed and the unobserved entries of Y are penalized differently. Recently, an inductive approach to matrix completion was proposed (Jain & Dhillon, 2013) where the matrix entries are modeled as a bilinear function of real-valued features associated with the rows and the columns. We extend our methods under the two aforementioned settings to the inductive matrix completion problem and establish similar recovery guarantees. Our contributions are summarized below:

1. To the best of our knowledge, this is the ﬁrst paper to formulate and study PU learning for matrix completion, necessitated by the applications of matrix completion. Furthermore, we extend our results to the recently proposed inductive matrix completion problem. 2. We provide strong guarantees for recovery; for example, in the non-deterministic setting, the error in recovering an n n matrix is O( 1 (1 ρ)n) for our method compared to O( 1 (1 ρ) n) implied by the method in (Davenport et al., 2012), where (1 ρ) is the fraction of observed 1 s. 3. Our results provide a theoretical insight for the heuristic approach used in practice, namely, biased matrix completion. 4. We give efﬁcient, scalable optimization algorithms for our methods; experiments on real-world data (social networks consisting of over 2 million users and 90 million links) demonstrate the superiority of the proposed methods for the link prediction problem.

Outline of the paper. We begin by establishing some hardness results and describing our PU learning settings in Section 2. In Section 3, we propose methods and give recovery guarantees for the matrix completion problem under the different settings. We extend the results to PU learning for inductive matrix completion problem in Section 4. We describe efﬁcient optimization procedures for the proposed methods in Section 5. Experimental results on synthetic and real-world data are presented in Section 6.

Related Work. In the last few years, there has been a tremendous amount of work on the theory of matrix completion since the remarkable result concerning recovery of low-rank matrices by (Cand es & Recht, 2009). Strong results on recovery from noisy observations have also been established (Cand es & Plan, 2009; Keshavan et al., 2010). Recently, (Davenport et al., 2012) studied the problem of recovering matrices from 1-bit observations, motivated by the nature of observations in domains such as recommender systems where matrix completion is heavily applied. Our work draws motivation from recommender systems as well, but differs from (Davenport et al., 2012) in that we seek to understand the case when only 1 s in the matrix are observed. One of the algorithms we propose for PU matrix completion is based on using different costs in the objective for observed and unobserved entries. The approach has been used before, albeit heuristically, in the context of matrix completion in recommender system applications (Hu et al., 2008; Rendle et al., 2009; Sindhwani et al., 2010). Compressed sensing is a ﬁeld that is closely related to matrix completion. Recently, compressed sensing theory has been extended to the case of single-bit quantization (Boufounos & Baraniuk, 2008). Here, the goal is to recover an s-sparse signal when the observations consist of only the signs of the measurements, and remarkable recovery guarantees have been proved for the single-bit quantization case (Boufounos & Baraniuk, 2008).

2. Problem Settings We assume that the underlying matrix M Rm n has a bounded nuclear norm, i.e., M t, where t is a constant independent of m and n. If Mij {0, 1} for all (i, j), stating the PU matrix completion problem is straightforward: we only observe a subset Ω1 randomly sampled from {(i, j) | Mij = 1} and the goal is to recover M based on this one-sided sampling. We call this the basic setting . However, in real world applications it is unlikely that the underlying matrix is binary. In the following, we consider two general settings, which include the basic setting as a special case.

2.1. Non-deterministic setting

In the non-deterministic setting, we assume Mij has bounded values and without loss of generality we can assume Mij [0, 1] for all (i, j) by normalizing it. We then

PU Learning for Matrix Completion

consider each entry as a probability distribution which generates a clean 0-1 observation Y Rm n:

P(Yij = 1) = Mij, P(Yij = 0) = 1 Mij,

In the classical matrix completion setting, we will observe partial entries sampled randomly from Y ; In our PU learning model, we assume only a subset of positive entries of Y is observed. More precisely, we observe a subset Ω1 from Y where Ω1 is sampled uniformly from {(i, j) | Yij = 1}. We assume |Ω1| = s and denote the number of 1 s in Y by s. With only Ω1 given, the goal of PU matrix completion is to recover the underlying matrix M. Equivalently, letting A {0, 1}m n to denote the observations, where AΩ1 = 1 and Aij = 0 for all (i, j) / Ω1, the non-deterministic setting can be speciﬁed as observing A by the process:

P(Aij =1)=Mij(1 ρ), P(Aij =0)=1 Mij(1 ρ), (1)

where ρ=1 s/s is the noise rate of ﬂipping a 1 to 0 (or equivalently, 1 ρ is the sampling rate to obtain Ω1 from Y ).

Hardness of recovering M: The 1-bit matrix completion approach of (Davenport et al., 2012) can be applied to this setting Given a matrix M, a subset Ωis sampled uniformly at random from M, and the observed values are quantized by a known probability distribution. We can transform our problem to the 1-bit matrix completion problem by assuming all the unobserved entries are zeros. When M Rn n, the following maximum likelihood estimator can be used to recover the groundtruth:

ˆ M = argmax X: X t

(i,j) Ωs.t. Yij=1 log(f(Xij))

(i,j) Ωs.t. Yij=0 log(1 f(Xij)) . (2)

and the following error bound for the estimator can be obtained from the result in (Davenport et al., 2012):

1 n2 ˆ M M 2 F = O r (1 ρ) n

where rank(M) r (See Appendix B for details). The main drawback of using this approach for PU matrix completion is computation time complexity of solving (32) is O(n2) which makes the approach prohibitive for large matrices. Moreover, the average error on each element is O(1/ n) (in contrast, our algorithm has O(1/n) average error). To see how this affects sample complexity for recovery, assume P

i,j Mi,j = O(n2) (number of 1 s are of the same order as the number of 0 s in the original matrix) and O(log n) 1 s are observed. Then (1 ρ) = O( log n

n ) and the average error according to (34) is 1 n2 ˆ M M 2 F = O( rn log n), which diverges as n . In contrast, the average error of our estimator will diminish to 0 as n .

2.2. Deterministic setting

In the deterministic setting, a clean 0-1 matrix Y is observed from M by the thresholding process: Yij = I(Mij > q), where I( ) is the indicator function and q R is the threshold. Again, in our PU learning model, we assume only a subset of positive entries of Y are observed, i.e. we observe Ω1 from Y where Ω1 is sampled uniformly from {(i, j) | Yij = 1}. Equivalently, we will use A to denote the observations, where Aij = 1 if (i, j) Ω1, and Aij = 0 otherwise.

It is impossible to recover M even if we observe all the entries of Y . A trivial example is that all the matrices ηee T

will give Y = ee T if η > q, and we cannot recover η from Y . Therefore, in the deterministic setting we can only hope to recover the underlying 0-1 matrix Y from the given observations. To the best of our knowledge, there is no existing work that gives a reasonable guarantee of recovering Y . For example, if we apply the noisy matrix completion algorithm proposed in (Cand es & Plan, 2009), the estimator ˆY has an error bound ˆY Y A Y , which indicates the error in ˆY is not guaranteed to be better than the trivial estimator A (see Appendix C for details).

3. Proposed Algorithms for PU Matrix Completion

In this section, we introduce two algorithms: shifted matrix completion for non-deterministic PU matrix completion, and biased matrix completion for deterministic PU matrix completion. All proofs are deferred to Appendix.

3.1. Shifted Matrix Completion for Non-deterministic Setting (Shift MC)

We want to ﬁnd a matrix X such that the loss M X 2 F is bounded, using the noisy observation matrix A generated from M by (1). Observe that conditioned on Y , the noise in Aij is asymmetric, i.e. P(Aij = 0|Yij = 1) = ρ and P(Aij = 1|Yij = 0) = 0. Asymmetric label noise has been studied in the context of binary classiﬁcation, and recently (Natarajan et al., 2013) proposed a method of unbiased estimator to bound the true loss using only noisy observations. In our case, we aim to ﬁnd a matrix minimizing the unbiased estimator deﬁned on each element, which leads to the following optimization problem:

i,j ℓ(Xij, Aij) (4)

such that X t, 0 Xij 1 (i, j).

where ℓ(Xij, Aij) =

( (Xij 1)2 ρX2 ij 1 ρ if Aij = 1, X2 ij if Aij = 0. (5)

The bound constraint on X in the above estimator ensures the loss has bounded Lipschitz constant. This optimization

PU Learning for Matrix Completion

problem is equivalent to the traditional trace-norm regularization problem

i,j ℓ(Xij, Aij) + λ X , (6)

such that 0 Xij 1 (i, j),

where λ has a one-to-one mapping to t. We use ℓinstead of the original loss ℓbecause it is the unbiased estimator of the underlying squared loss ℓ(Xij, Mij) = (Xij Mij)2, as formalized below. Thus, we use ℓon the observed Aij, we minimize the loss w.r.t. Yij in expectation. Lemma 1. For any X Rm n, 1 mn E P

i,j(Xij Yij)2 = 1 mn E P

i,j ℓ(Xij, Aij) .

Interestingly, we can rewrite ℓas ℓ(Xij, 1) = Xij

1 1 ρ 2 ρ (1 ρ)2 . Therefore, (6) can be rewritten as the following shifted matrix completion problem:

ˆX = argmin X

i,j:Aij=0 X2 ij+λ X

s.t. 0 Xij 1 (i, j). (7)

We want to show that the average error of the Shift MC estimator ˆX decays as O(1/n). In order to do so, we ﬁrst need to bound the difference between the expected error and the empirical error. We deﬁne the hypothesis space to be X := {X | X Rm n and W t}. The expected error can be written as EA[R ℓ(W)] = EA[ 1

i,j ℓ(Wij, Aij)], and the empirical error is ˆR ℓ(W) = 1 mn P

i,j ℓ(Wij, Aij). We ﬁrst show that the difference between expected error and empirical error can be upper bounded: Theorem 1. Let X := {X Rm n | X t, 0 X 1}, then

EA[R ℓ(X)] ˆR ℓ(X)

t C n + m + 4 s (1 ρ)mn + 3

log(2/δ) mn(1 ρ)

with probability at least 1 δ, where C is a constant, E[R ℓ(X)] := E[ 1

i,j ℓ(Xij, Aij)] is the expected error, and ˆR ℓ(X) = 1 mn P

i,j ℓ(Xij, Aij) is the empirical error.

Combining Lemma 1 and Theorem 1, we have our ﬁrst main result: Theorem 2 (Main Result 1). With probability at least 1 δ,

i,j (Mij ˆXij)2 6

log(2/δ) mn(1 ρ) +

2Ct n + m + 4 s (1 ρ)mn .

The average error is of the order of O( 1 n(1 ρ)) when M Rn n, where 1 ρ denotes the ratio of observed 1 s. This shows that even when we only observe a very small ratio of 1 s in the matrix, we can still estimate M accurately when n is large enough.

3.2. Biased Matrix Completion for Deterministic Setting (Bias MC)

In the deterministic setting, we propose to solve the matrix completion problem with label-dependent loss (Scott, 2012). Let ℓ(x, a) = (x a)2 denote the squared loss, for a {0, 1}. The α-weighted loss is deﬁned by

ℓα(x, a) = α1a=1ℓ(x, 1) + (1 α)1a=0ℓ(x, 0), (8)

where 1a=1, 1a=0 are indicator functions. We then recover the groundtruth by solving the following biased matrix completion (bias MC) problem:

ˆX = argmin X: X t

i,j ℓα(Xij, Aij) (9)

= argmin X: X t α X

i,j:Aij=1 (Xij 1)2 + (1 α) X

i,j:Aij=0 X2 ij

The underlying binary matrix Y is then recovered by the thresholding operator Xij = I( ˆXij > q).

A similar formulation has been used in (Sindhwani et al., 2010) to recommend items to users in the who-boughtwhat network. Here, we show that this biased matrix factorization technique can be used to provably recover Y . For convenience, we deﬁne the thresholding operator thr(x) = 1 if x > q, and thr(x) = 0 if x q. We ﬁrst deﬁne the recovery error as R(X) = 1 mn P

i,j 1thr(Xij) =Yij, where Y is the underlying 0-1 matrix. Deﬁne the labeldependent error:

Uα(x, a) = (1 α)1thr(x)=11a=0 + α1thr(x)= 11a=1. (10)

and α-weighted expected error:

Rα,ρ(X) = E X

i,j Uα(Xij, Aij) ,

The following lemma is a special case of Theorem 9 in (Natarajan et al., 2013), showing that R(X) and Rα,ρ(X) can be related by a linear transformation:

Lemma 2. For the choice α = 1+ρ

2 and β = 1+ρ

2 , there exists a constant b that is independent of X such that, for any matrix X, Rα ,ρ(X) = βR(X) + b.

Therefore, minimizing the α-weighed expected error in the partially observed situation is equivalent to minimizing the true recovery error R. By further relating Rα ,ρ(X) and Rℓα ,ρ(X) := E P

i,j lα (Xij, Aij) , we can show:

PU Learning for Matrix Completion

Theorem 3 (Main Result 2). Let ˆX be the minimizer of (9), and X be the thresholded 0-1 matrix of ˆX, then with probability at least 1 δ, we have

R( X) 2η 1 + ρ

Ct n + m + 4 s mn + 3

log(2/δ) mn(1 ρ)

where η = max(1/q2, 1/(1 q)2) and C is a constant.

The average error is of the order of O( 1 n(1 ρ)) when M Rn n, where 1 ρ denotes the ratio of observed 1 s, similar to the Shift MC estimator.

4. PU Inductive Matrix Completion

In this section, we extend our approaches to inductive matrix completion problem, where in addition to the samples, row and column features Fu Rm d, Fv Rn d are also given. In the standard inductive matrix completion problem (Jain & Dhillon, 2013), the observations AΩare sampled from the groundtruth M Rm n, and we want to recover M by solving the following optimization problem:

min D Rd d X

i,j Ω (Aij (Fu DF T v )ij)2 + λ D . (11)

Matrix completion is a special case of inductive matrix completion when Fu = I, Fv = I. In the multi-label learning problem, M represents the label matrix and Fu corresponds to examples (typically Fv = I) (Yu et al., 2014; Xu et al., 2013). This technique has also been applied to gene-disease prediction (Natarajan & Dhillon, 2014), semisupervised clustering (Yi et al., 2013), and theoretically studied in (Jain & Dhillon, 2013).

The problem is fairly recent and we wish to extend PU learning analysis to this problem, which is also well motivated in many real world applications. For example, in multi-label learning with partially observed labels, negative labels are usually not available. In the experiments, we will consider another interesting application semisupervised clustering problem with only positive and unlabeled relationships.

4.1. Shifted Inductive Matrix Completion for Non-deterministic Setting (Shift IMC)

In the non-deterministic setting, we consider the inductive version of Shift MC:

min D Rd d X

i,j ℓ((Fu DF T v )ij, Aij) (12)

s. t. D t, 1 Fu DF T v 0,

where the unbiased estimator of loss ℓ( ) is deﬁned in (5). Note that we can assume that Fu, Fv are orthogonal (otherwise we can conduct a preprocessing step to normalize

it). Let ui be the i-th row of Fu (the feature for row i) and vj be the j-th row of Fv. We deﬁne constants Xu = maxi ui , Xv = maxj vj . Since the output of inductive matrix completion is Fu DFv, it can only recover the original matrix when the underlying matrix M can be written in such form. Following (Xu et al., 2013; Yi et al., 2013), we assume the features are good enough such that M = Fu(Fu)T MFv(Fv)T . Recall M t. We now extend Theorem 2 to PU inductive matrix completion.

Theorem 4. Assume ˆD is the optimal solution of (12) and the groundtruth M is in the subspace formed by Fu and Fv: M = Fu(Fu)T MFv(Fv)T , and let ˆ M = F T u ˆDFv, then, with probability at least 1 δ:

1 mn M ˆ M 2 F 6

log(2/δ) mn(1 ρ) + 4t log 2d mn 1 ρXu Xv.

Therefore if t and d are bounded, the mean square error of Shift IMC is O(1/n) in the non-deterministic setting.

4.2. Biased Inductive Matrix Completion for Deterministic Setting (Bias IMC)

In the deterministic setting, we propose to solve the inductive version of Bias MC:

ˆD = arg min D: D t α X

i,j:Aij=1 ((Fu DF T v )ij 1)2

i,j:Aij=0 (Fu DF T v )2 ij. (14)

The clean 0-1 matrix Y can then be recovered by ˆYij = I((Fu ˆDF T v )ij > q).

( 1 if (Fu ˆDF T v )ij q 0 if (Fu ˆDF T v )ij < q. (15)

Similar to the case of matrix completion, Lemma 2 shows that the expected 0-1 error R(X) and the α-weighted expected error in noisy observation Rα,ρ(X) can be related by a linear transformation when α = 1+ρ

2 . With this choice of α , Lemma 2 continues to hold in this case, which allows us to extend Theorem 3 to PU inductive matrix completion and bound R( ˆY ) = 1 mn Y ˆY 2 F :

Theorem 5. Let ˆD be the minimizer of (14) with α = (1 + ρ)/2, and let ˆY be generated from ˆD by (15), then with probability at least 1 δ, we have

R( ˆY ) 2η 1 + ρ

4t log 2d mn 1 ρXu Xv + 6

log(2/δ) mn(1 ρ)

where η = max(1/q2, 1/(1 q)2).

Again, we have that if t and d are bounded, the mean square error of Bias IMC is O(1/n).

PU Learning for Matrix Completion

5. Optimization Techniques for PU Matrix Completion

In this section, we show that Bias MC can be solved very efﬁciently for large-scale (millions of rows and columns) datasets, and that Shift MC can be solved efﬁciently after a relaxation.

First, consider the optimization problem for Bias MC:

argmin X α X

i,j:Aij=1 (Xij 1)2 + (1 α) X

i,j:Aij=0 X2 ij + λ X

:= fb(X) + λ X , (16)

which is equivalent to the constrained problem (9) with suitable λ. The typical proximal gradient descent update is X S(X η fb(X), λ), where η is the learning rate and S is the soft thresholding operator on singular values (Ji & Ye, 2009). The (approximate) SVD of G := (X η fb(X)) can be computed efﬁciently using power method or Lanczos algorithm if we have a fast procedure to compute GP for a tall-and-thin matrix P Rn k. In order to do so, we ﬁrst rewrite fb(X) as

fb(X) = (1 α) X A 2 F +(2α 1) X

i,j:Aij=1 (Xij Aij)2.

(17) Assume the current solution is stored in a low-rank form X = WHT and R = (X A)Ω1 is the residual on Ω1, then

GP = XP 2η [(1 α)(X A) + (2α 1)R] P

=(1 2η(1 α))WHT P + 2η[(1 α)A (2α 1)R]P,

where the ﬁrst term can be computed in O(mk2 + nk2) ﬂops, and the remaining terms can be computed in O(|Ω1|k) ﬂops. With this approach, we can efﬁciently compute the proximal operator. This can also be applied to other faster nuclear norm solvers (for example, (Hsieh & Olsen, 2014)).

Next we show that the non-convex form of Bias MC can also be efﬁciently solved, and thus can scale to millions of nodes and billions of observations. It is well known that the nuclear norm regularized problem min X fb(X) + λ X is equivalent to

min W Rm k,H Rn k fb(WHT ) + λ

2 ( W 2 F + H 2 F ) (18)

when k is sufﬁciently large. We can use a trick similar to (17) to compute the gradient and Hessian efﬁciently:

1 2 W fb(WHT ) =[(1 α)(WHT A) + (2α 1)RΩ]H,

1 2 2 Wi, fb(WHT ) = (1 α)HT H + (2α 1)HT Ωi HΩi,

where HΩi is the sub-matrix with columns {hj : j Ωi}, and Ωi is the column indices of observations in the i-th row.

Thus, we can efﬁciently apply Alternating Least Squares (ALS) or Coordinate Descent (CD) for solving (18). For example, when applying CCD++ in (Yu et al., 2013), each coordinate descent update only needs O(|Ωi| + k) ﬂops. We apply this technique to solve large-scale link prediction problems (see Section 6).

The optimization problem for Shift MC is harder to solve because of the bounded constraint. We can apply the bounded matrix factorization technique (Kannan et al., 2014) to solve the non-convex form of (6), where the time complexity is O(mn) because of the constraint 0 (WHT )ij 1 for all (i, j). To scale it to large datasets, we relax the bounded constraint and solve:

min W Rm k,H Rn k A WHT 2 F + λ

2 ( W 2 F + H 2 F )

s. t. 0 W, H p

This approach (Shift MC-relax) is easy to solve by ALS or CD with O(|Ω|k) complexity per sweep (similar to the Bias MC). In our experiments, we show Shift MC-relax performs even better than shift MC in practice.

6. Experiments

We ﬁrst use synthetic data to show that our bounds are meaningful and then demonstrate the effectiveness of our algorithms in real world applications.

6.1. Synthetic Data

We assume the underlying matrix M Rn n is generated by UU T , where U Rn k is the orthogonal basis of a random Gaussian n by k matrix with mean 0 and variance 1. For the non-deterministic setting, we linearly scale M to have values in [0, 1], and then generate training samples as described Section 2. For deterministic setting, we choose q so that Y has equal number of zeros and ones. We ﬁx ρ = 0.9 (so that only 10% 1 s are observed). From Lemma 2, α = 0.95 is optimal. We ﬁx k = 10, and test our algorithms with different sizes n. The results are shown in Figure 1(a)-(b). Interestingly, the results reﬂect our theory: error of our estimators decreases with n; in particular, error linearly decays with n in log-log scaled plots, which suggests a rate of O(1/n), as shown in Theorems 2 and 3. Directly minimizing A X 2 F gives very poor results. For Bias MF, we also plot the performance of estimators with various α values in Figure 1(b). As our theory suggests, α = 1+ρ

2 performs the best. We also observe that the error is well-behaved in a certain range of α. A principled way of selecting α is an interesting problem for further research.

6.2. Parameter Selection

Before showing the experimental results on real-world problems, we discuss the selection of the parameter ρ in our PU matrix completion model (see eq (1)). Note that

PU Learning for Matrix Completion

(a) Synthetic data: non-deterministic setting.

(b) Synthetic data: deterministic setting.

(c) FPR-FNR on ca-Gr Qcdataset.

(d) FPR-FNR on ca-Hep Phdataset.

(e) FPR-FNR on Live Journaldataset.

(f) FPR-FNR on My Spacedataset.

Figure 1. (a)-(b): Recovery error of Shift MC and Bias MC on synthetic data. We observe that without shifting or biasing, error does not decrease with n (the black lines). The error of our estimators decreases approximately as 1

n, as proved in Theorems 2 and 3. (c)-(f): Comparison of link prediction methods. Shift MC and Bias MC consistently perform better than the rest.

ρ indicates the noise rate of ﬂipping a 1 to 0. If there are equal number of positive and negative elements in the underlying matrix Y , we will have ρ = 1 2s where s = (# positive entries)/(# total entries). In practice (e.g., link prediction problems) number of 1 s are usually less than number of 0 in the underlying matrix, but we do not know the ratio. Therefore, in all the experiments we chose ρ from the set {1 2s, 10(1 2s), 100(1 2s), 1000(1 2s)} based on a random validation set, and use the corresponding α in the optimization problems.

6.3. Matrix completion for link prediction

One of the important applications that motivated our analysis in this paper is the link prediction problem. Note that matrix completion has been used for link prediction on signed network (Chiang et al., 2014), but the application to unsigned network has not been discussed before. Here, we are given n nodes (users) and a set of edges Ωtrain (relationships) and the goal is to predict missing edges, i.e. Ωtest. We use 4 real-world datasets: 2 co-author networks ca-Gr Qc(4,158 nodes and 26,850 edges) and ca Hep Ph(11,204 nodes and 235,368 edges), where we randomly split edges into training and test such that |Ωtrain| = |Ωtest|; 2 social networks Live Journal(1,770,961 nodes, |Ωtrain| = 83,663,478 and |Ωtest| = 2,055,288) and My S-

pace(2,137,264 nodes, |Ωtrain| = 90,333,122 and |Ωtest| = 1,315,594), where train/test split is done using timestamps. For our proposed methods Bias MC, Shift MC and Shift MC-relax, we solve the non-convex form with k = 50 for ca-Gr Qc, ca-Hep Phand k = 100 for Live Journaland My Space. The α and λ values are chosen by a validation set.

We compare with competing link prediction methods (Kiben-Nowell & Kleinberg, 2003) Common Neighbors, Katz, and SVD-Katz (compute Katz using the rank-k approximation, A UkΣk Vk). Note that the classical matrix factorization approach in this case is equivalent to SVD on the given 0-1 training matrix, and SVD-Katz slightly improves over SVD by further computing the Katz values based on the low rank approximation (see (Shin et al., 2012)), so we omit the SVD results in the ﬁgures. Note that the algorithm ld NMF proposed in (Sindhwani et al., 2010) is time consuming because there are O(n2) hidden variables to estimate. Therefore, we compare with it only on a n = 500 subset of ca-Gr Qc. On this subset, Bias MC got 1.14% and ld NMF got 1.08% top 10 prediction accuracy.

Based on the training matrix, each link prediction method will output a list of k candidate entries. We evaluate the quality of the top-k entries by computing the False Positive Rate (FPR) and False Negative Rate (FNR) on the

PU Learning for Matrix Completion

(a) Mushroom dataset.

(b) Segment dataset.

Figure 2. Semi-supervised clustering performance of Bias MCinductive on two real datasets. Bias MC-inductive performs better than MC-inductive (treats unlabeled relationships as zeros) and spectral clustering (does not use features). Bias MC-inductive achieves under 10% error using just 100 samples.

test snapshot. The results are shown in Figure 1 (c)-(f). ca-Gr Qcis a small dataset, so we can solve the original Shift MC problem accurately, although Shift MC-relax achieves a similar performance here. For larger datasets, we show only the performance of Shift MC-relax. In general Bias MC performs the best, and Shift MC tends to perform better in the beginning. Overall, our methods achieve lower FPR and FNR comparing to other methods, which indicate that we obtain a better link prediction model by solving the PU matrix completion problem. Also, Bias MC is highly efﬁcient it takes 516 seconds for 10 coordinate descent sweeps on the largest dataset (My Space), whereas computing top 100 eigenvectors using eigs in MATLAB requires 2408 seconds.

6.4. Inductive matrix completion

We use the semi-supervised clustering problem to evaluate our PU inductive matrix completion methods. PU inductive matrix completion can be applied to many real-world problems, including recommender systems with features and 01 observations, and the semi-supervised clustering problem

when we can only observed positive relationships. Here we use the latter as an example to demonstrate the usefulness of our algorithm.

In semi-supervised clustering problems, we are given n samples with features {xi}n i=1 and pairwise relationships A Rn n, where Aij = 1 if two samples are in the same cluster, Aij = 1 if they are in different clusters, and Aij = 0 if the relationship is unobserved. Note that the groundtruth matrix M {+1, 1}n n exhibits a simple structure and is a low rank as well as low trace norm matrix; it is shown in (Yi et al., 2013) that we can recover M using IMC when there are both positive and negative observations. We consider the setting where only positive relationships are observed, so A is a 0-1 matrix. We show that biased IMC can recover M using very few positive relationships. We test the algorithms on two datasets: the Mushroom dataset with 8142 samples, 112 features, and 2 classes; the Segment dataset with 2310 samples, 19 features, and 7 classes. The results are presented in Figure 2. We compare Bias MC-inductive with (a) MC-inductive, which considers all the unlabeled pairs as zeros and minimizes Fu DF T v A 2 F , and (b) spectral clustering, which does not use feature information. Since the data is from classiﬁcation datasets, the ground truth M is known and can be used to evaluate the results. In Figure 2, the vertical axis is the clustering error rate deﬁned as the fraction of entries in M predicted with correct sign. Figure 2 shows that Bias MC-inductive is much better than other approaches for this task.

7. Conclusions

Motivated by modern applications of matrix completion, our work attempts to bridge a gap between the theory of matrix completion and practice. We have shown that even when there is noise in the form of one-bit quantization as well as a one-sided sampling process that reveals the measurements, the underlying matrix can be accurately recovered. We consider two recovery settings, both of which are natural for PU learning, and provide similar recovery guarantees for both. Our error bounds are strong and useful in practice. Our work serves to provide the ﬁrst theoretical insight into the biased matrix completion approach that has been employed as a heuristic for similar problems in the past. Experimental results on synthetic data conform to our theory and the effectiveness of the methods are evident for the link prediction task in real-world networks. A principled way of selecting or estimating the bias α in Bias MC seems worthy of exploration given our encouraging results.

Acknowledgements

This research was supported by NSF grants CCF-1320746 and CCF-1117055. C.-J.H also acknowledges support from an IBM Ph D fellowship.

PU Learning for Matrix Completion

Boufounos, P. T and Baraniuk, R. G. 1-bit compressive sensing. In Information Sciences and Systems, 2008. CISS 2008. 42nd Annual Conference on, pp. 16 21. IEEE, 2008.

Cand es, E. J. and Plan, Y. Matrix completion with noise. Proceedings of the IEEE, 98(6):925 936, 2009.

Cand es, E. J. and Recht, B. Exact matrix completion via convex optimization. Foundations of Computational mathematics, 9(6):717 772, 2009.

Chiang, K. Y., Hsieh, C. J., Natarajan, N., Tewari, A., and Dhillon, I. S. Prediction and clustering in signed networks: A local to global perspective. Journal of Machine Learning Research (JMLR), 15:1177 1213, March 2014.

Davenport, M. A., Plan, Y., Berg, E., and Wootters, M. 1bit matrix completion. ar Xiv preprint ar Xiv:1209.3672, 2012.

Elkan, C. and Noto, K. Learning classiﬁers from only positive and unlabeled data. In Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 213 220. ACM, 2008.

Hsieh, C.-J. and Olsen, P. A. Nuclear norm minimization via active subspace selection. In ICML, 2014.

Hu, Y., Koren, Y., and Volinsky, C. Collaborative ﬁltering for implicit feedback datasets. In ICDM, 2008.

Jain, P. and Dhillon, I. S. Provable inductive matrix completion. Co RR, abs/1306.0626, 2013.

Ji, S. and Ye, J. An accelerated gradient method for trace norm minimization. In ICML, 2009.

Kakade, Sham M, Sridharan, Karthik, and Tewari, Ambuj. On the complexity of linear prediction: Risk bounds, margin bounds, and regularization. In NIPS, volume 21, pp. 793 800, 2008.

Kannan, R., Ishteva, M., and Park, H. Bounded matrix factorization for recommender system. Knowledge and Information Systems, 2014.

Keshavan, Raghunandan H, Montanari, Andrea, and Oh, Sewoong. Matrix completion from noisy entries. Journal of Machine Learning Research, 11(2057-2078):1, 2010.

Kiben-Nowell, D. and Kleinberg, J. The link prediction problem for social networks. In CIKM, 2003.

Latala, R. Some estimates of norms of random matrices. Proceedings of the AMS, 2005.

Liu, Bing, Dai, Yang, Li, Xiaoli, Lee, Wee Sun, and Yu, Philip S. Building text classiﬁers using positive and unlabeled examples. In Data Mining, 2003. ICDM 2003. Third IEEE International Conference on, pp. 179 186. IEEE, 2003.

Natarajan, N. and Dhillon, I. S. Inductive matrix completion for predicting gene-disease associations. In Bioinformatics, 2014.

Natarajan, N., Tewari, A., Dhillon, I. S., and Ravikumar, P. Learning with noisy labels. In NIPS, 2013.

Rendle, S., C. Freudenthaler, Z. Gantner, and Schmidt Thieme, L. BPR: Bayesian personalized ranking from implicit feedback. In UAI, 2009.

Scott, C. Calibrated asymmetric surrogate losses. Electronic J. of Stat., 6(958 992), 2012.

Shamir, O. and Shalev-Shwartz, S. Collaborative ﬁltering with the trace norm: Learning, bounding, and transducing. In COLT, 2011.

Shawe-Taylor, J. and Cristianini, N. Kernel methods for pattern analysis. Cambridge University Press, 2004.

Shin, D., Si, S., and Dhillon, I. S. Multi-scale link prediction. In Proceedings of the 21st ACM Conference on Information and Knowledge Management(CIKM), October 2012.

Sindhwani, V., Bucak, S. S., Hu, J., and Mojsilovic, A. One-class matrix completion with low-density factorization. In ICDM, 2010.

Xu, M., Jin, R., and Zhou, Z.-H. Speedup matrix completion with side information: Application to multi-label learning. In NIPS, 2013.

Yi, J., Zhang, L., Jin, R., Qian, Q., and Jain, A. K. Semisupervised clustering by input pattern assisted pairwise similarity matrix completion. In ICML, 2013.

Yu, H.-F., Hsieh, C.-J., Si, S., and Dhillon, I. S. Parallel matrix factorization for recommender systems. Knowledge and Information Systems, 2013.

Yu, H.-F., Jain, P., Kar, P., and Dhillon, I. S. Large-scale multi-label learning with missing labels. In ICML, pp. 593 601, 2014.