# matrix_completion_with_noisy_entries_and_outliers__72cd8ccc.pdf

Journal of Machine Learning Research 18 (2017) 1-25 Submitted 2/15; Revised 9/17; Published 12/17

Matrix Completion with Noisy Entries and Outliers

Raymond K. W. Wong raywong@stat.tamu.edu Department of Statistics Texas A&M University College Station, TX 77843 USA

Thomas C. M. Lee tcmlee@ucdavis.edu Department of Statistics University of California Davis, CA 95616, USA

Editor: Xiaotong Shen

This paper considers the problem of matrix completion when the observed entries are noisy and contain outliers. It begins with introducing a new optimization criterion for which the recovered matrix is deﬁned as its solution. This criterion uses the celebrated Huber function from the robust statistics literature to downweigh the eﬀects of outliers. A practical algorithm is developed to solve the optimization involved. This algorithm is fast, straightforward to implement, and monotonic convergent. Furthermore, the proposed methodology is theoretically shown to be stable in a well deﬁned sense. Its promising empirical performance is demonstrated via a sequence of simulation experiments, including image inpainting.

Keywords: ES-Algorithm, Huber function, robust methods, Soft-Impute, stable recovery

1. Introduction

The goal of matrix completion is to impute those missing entries of a large matrix based on the knowledge of its relatively few observed entries. It has many practical applications, ranging from collaborative ﬁltering (Rennie and Srebro, 2005) to computer visions (Weinberger and Saul, 2006) to positioning (Montanari and Oh, 2010). In addition, its application to recommender systems is perhaps the most well known example, widely made popularized by the so-called Netﬂix prize problem (Bennett and Lanning, 2007). In this problem a large matrix of movie ratings is partially observed. Each row of this matrix consists of ratings from a particular customer while each column records the ratings on a particular movie. In the Netﬂix data set, there are around 5 105 customers and 2 104 movies, with less than 1% of the ratings are observed. Without any prior knowledge, a reasonable full recovery of the matrix is virtually impossible. To overcome this issue, it is common to assume that the matrix is of low rank, reﬂecting the belief that the users ratings are based on a relatively small number of factors. This low rank assumption is very sensible in many applications, although the resulting optimizations are combinatorially hard (Srebro and Jaakkola, 2003). To this end, various convex relaxations and related optimization algorithms have been proposed to provide computationally feasible solutions; see, for example,

c 2017 Raymond K. W. Wong and Thomas C. M. Lee.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v18/15-076.html.

Wong and Lee

Cand es and Recht (2009); Cand es and Plan (2010); Keshavan et al. (2010b,a); Mazumder et al. (2010); Marjanovic and Solo (2012) and Hastie et al. (2014).

In addition to computational advances, the theoretical properties of matrix completion using nuclear norm minimization have also been well studied. For example, when the observed entries are noiseless, Cand es and Recht (2009) show that perfect recovery of a low rank matrix is possible; see also Keshavan et al. (2010b), Gross (2011) and Recht (2011). This result of Cand es and Recht (2009) has been extended to noisy measurements by Cand es and Plan (2010): with high probability, the recovery is subject to an error bound proportional to the noise level. Techniques that achieve this desirable property are often referred as stable. See also Keshavan et al. (2010a) and Koltchinskii et al. (2011) for other theoretical developments of matrix completion from noisy measurements.

The original formulation of matrix completion assumes those observed entries are noiseless, and is later extended to the more realistic situation where the entries are observed with noise. This paper further extend the formulation to simultaneously allow for both noisy entries and outliers. To the authors knowledge, such an extension has not been considered before, although similar work exists. In Cand es et al. (2011) a method called principal component pursuit (PCP) is developed to recover a matrix observed with mostly noiseless entries and otherwise a small amount of outliers. This is done by modeling the observed matrix as a sum of a low rank matrix and a sparse matrix. Zhou et al. (2010) extend this PCP method to noisy entries but assumes the matrix is fully observed, thus it does not fall into the class of matrix completion problems. Lastly Chen et al. (2011) extend PCP to safeguard against special outlying structures, namely outlying columns. However, it works only on outliers and otherwise noiseless entries. Due to the similarity between the matrix completion and principal component analysis, it is worthmentioning that there are some related work (Karhunen, 2011; Luttinen et al., 2012) on robust principal component analysis with missing values.

The primary contribution of this paper is the development of a new robust matrix completion method that can be applied to recover a matrix with missing, noisy and/or outlying entries. This method is shown to be stable in the sense of Cand es and Plan (2010), as discussed above. As opposed to the above referenced PCP approach that decomposes the matrix into a sum of a low rank and a sparse matrix, the new approach is motivated by the statistical literature of robust estimation which modiﬁes the least squares criterion to downweigh the eﬀects of outliers. Particularly, we make use of the Huber function for this modiﬁcation. We provide a theoretical result that establishes an intrinsic link between the two diﬀerent approaches. To cope with the nonlinearity introduced by the Huber function, we propose a fast, simple, and easy-to-implement algorithm to perform the resulting nonlinear optimization problem. This algorithm is motivated by the ESAlgorithm for robust nonparametric smoothing (Oh et al., 2007). As to be shown below, it can transform a rich class of (non-robust) matrix completion algorithms into algorithms for robust matrix completion.

The rest of this paper is organized as follows. Section 2 provides further background of matrix completion and proposes a new optimization criterion for robust matrix recovery. Fast algorithms are developed in Section 3 for practically computing the robust matrix estimate. Theoretical and empirical properties of the proposed methodology are studied

Matrix Completion with Noises and Outliers

in Section 4 and Section 5 respectively. Concluding remarks are given in Section 6, while technical details are relegated to the appendix.

2. Matrix Completion with Noisy Observations and Outliers

Suppose X is an n1 n2 matrix which is observed for only a subset of entries Ωobs [n1] [n2], where [n] denotes {1, . . . , n}. Let Ω obs be the complement of Ωobs. Deﬁne the projection operator PΩobs as PΩobs B = C, where Cij = Bij if (i, j) Ωobs and Cij = 0 if (i, j) Ωobs, for any n1 n2 matrix B = (Bij)i [n1],j [n2]. The following is a standard formulation for matrix completion using a low rank assumption:

minimize Y rank(Y )

subject to 1 2 PΩobs X PΩobs Y 2 F e,

where e > 0 and F is the Frobenius norm. Carrying out this rank minimization enables a good recovery of any low rank matrix with missing entries. Note that for the reason of accommodating noisy measurements, the constraint above allows for a slight discrepancy between the recovered and the observed matrices. However, this minimization is combinatorially hard (e.g., Srebro and Jaakkola, 2003). To achieve fast computation, the following convex relaxation is often used:

minimize Y Y

subject to 1 2 PΩobs X PΩobs Y 2 F e,

where Y represents the nuclear norm of Y (i.e., the sum of singular values of Y ). The Lagrangian form of this optimization is

minimize Y f(Y |X) 1

2 PΩobs X PΩobs Y 2 F + γ Y , (1)

where γ > 0 has a one-to-one correspondence to e. The squared loss in the ﬁrst term is used to measure the ﬁtness of the recovered matrix to the observed matrix. It is widely known that such a squared loss is very sensitive to outliers and often leads to unsatisfactory recovery results if such outliers exist. Motivated by the literature of robust statistics (e.g., Huber and Ronchetti, 2011), we propose replacing this squared loss by the Huber loss function

( x2, |x| c c(2|x| c), |x| > c ,

with tuning parameter c. When comparing with the squared loss, the Huber loss downweighs the eﬀects of extreme measurements. Our proposed solution for robust matrix completion is given by the following minimization:

minimize Y g(Y ) 1

(i,j) Ωobs ρc(Xij Yij) + γ Y . (2)

Wong and Lee

Note that the convexity of ρc guarantees the convexity of the objective criterion (2). For many robust statistical estimation problems the tuning parameter c is pre-set as c = 1.345ˆσ to achieve a 95% statistical eﬃciency, where ˆσ is an estimate of the standard deviation of the noise. For the current problem, however, the choice of c is suggested by Theorem 4 below: c = γ/ n(1)p, where n(1) = max{n1, n2} and p is the percentage of missing entries. This choice of c was used throughout all our numerical work.

3. Fast Algorithms for Minimization of (2)

Since the gradient of the Huber function is non-linear, (2) is a harder optimization problem when comparing to many typical matrix completion formulations such as (1). As an example, consider (1) when X is fully observed; i.e., Ωobs = [n1] [n2]. Through sub-gradient analysis (e.g., Cai et al., 2010; Ma et al., 2011), one can derive a closed-form solution to (1), denoted as Sγ(X), where Sγ is the soft-thresholding operator deﬁned in Mazumder et al. (2010), also given in (6) below. However, even if X was fully observed, (2) does not have a closed-form solution. The goal of this section is to develop fast methods for minimizing (2).

3.1 A General Algorithm

In Oh et al. (2007) a method based on the so-called theoretical construct pseudo data is proposed for robust wavelet regression. The idea is to transform a Huber-type minimization problem into a sequence of fast and well understood squared loss minimization problems. This subsection modiﬁes this idea and proposes an algorithm to minimizing (2). As similar to Oh et al. (2007), we deﬁne a pseudo data matrix as

Z = PΩobs Y + 1

2ψc(E), (3)

where Y is the current estimate of the target matrix, E = PΩobs X PΩobs Y is the residual matrix , and ψc = ρ c is the derivative of ρc. With a slight notation abuse, when ψc is applied to a matrix, it means ψc is evaluated in an element-wise fashion. Straightforward algebra shows that the sub-gradient of f(Y |Z) (with respect to Y ) evaluated at Y ,

(PΩobs Z PΩobs Y ) + γ Y , (4)

is equivalent to the sub-gradient of g(Y ) (with respect to Y ) evaluated at Y ,

2ψc(PΩobs X PΩobs Y ) + γ Y . (5)

The proposed algorithm iteratively updates Y = arg min Y f(Y |Z) and Z using (3). Upon convergence (implied by Proposition 1 below), the sub-gradient (4) contains 0 at the converged Y and thus the sub-gradient (5) also contains 0 at this converged Y . Therefore this Y is the solution to (2). Details of this algorithm based on pseudo data matrix are given in Algorithm 1. Algorithm 1 has several attractive properties. First, it can be paired with any existing (non-robust) matrix completion algorithm (or software), as can be easily seen in Step 2(c). This is a huge advantage, as a rich body of existing (non-robust) methods can be made

Matrix Completion with Noises and Outliers

Algorithm 1 The General Robust Algorithm

1: Perform (non-robust) matrix completion on X and assign Y old arg min Y f(Y |X). This Y old is the initial estimate (starting point of the algorithm).

(a) Compute E PΩobs X PΩobs Y old.

(b) Compute Z PΩobs Y old + 1

(c) Perform (non-robust) matrix completion on Z and assign Y new arg min Y f(Y |Z).

(d) If Y new Y old 2 F Y old 2 F < ε,

(e) Assign Y old Y new.

3: Output Y new.

robust against outliers. Second, once such an (non-robust) algorithm is available, the rest of the implementation is straightforward and simple, and no expensive matrix operations are required. Lastly, it has strong theoretical backup, as to be reported in Section 4.

3.2 Further Integration with Existing Matrix Completion Algorithms

Many existing matrix completion algorithms are iterative. A direct application of Algorithm 1 would lead to an algorithm that is iterations-within-iterations. Although our extensive numerical experience suggests that these direct implementations would typically converge within a few iterations to give a reasonably fast execution time, it would still be advantageous to speed up the overall procedure. Here we show that it is possible to further improve the speed of the overall robust algorithm by embedding the pseudo data matrix idea directly into a non-robust algorithm. We shall illustrate this with the Soft-Impute algorithm proposed by Mazumder et al. (2010). To proceed we ﬁrst recall the deﬁnition of their thresholding operator Sγ: for any matrix Z of rank r, Sγ(Z) = UDγV , (6)

where Z = UDV is the singular value decomposition of Z, D = diag[d1, . . . , dr] and Dγ = diag[(d1 γ)+, . . . , (dr γ)+]. Now the main idea is to suitably replace an iterative matrix estimate with the pseudo data matrix estimate given by (3). With Soft-Impute, the resulting robust algorithm is given in Algorithm 2. We shall call this algorithm Robust Impute. As to be shown by the numerical studies below, Robust-Impute is very fast and produces very promising empirical results. Our algorithm also has the sparse-plus-lowrank structure in the singular value thresholding step (Step 2a(iii)). This linear algebra structure has positive impact on the computational complexity. See Section 5 of Mazumder et al. (2010) for details. Moreover, the monotonicity and convergence of our algorithm is guaranteed by Proposition 1 and Theorem 2.

Wong and Lee

Algorithm 2 Robust-Impute

1: Initialize Y old = Sγ1(PΩobs X) and Z = X.

2: Do for γ1 > γ2 > > γK:

(a) Repeat:

(i) Compute E PΩobs X PΩobs Y old. (ii) Compute Z PΩobs Y old + 1

2ψc(E) (iii) Compute Y new Sγk(PΩobs Z + PΩ obs Y old).

(iv) If Y new Y old 2 F Y old 2 F < ε,

exit. (v) Assign Y old Y new.

(b) Assign ˆYγk Y new.

3: Output the sequence of solutions ˆYγ1, . . . , ˆYγK.

4. Theoretical Properties

This section presents some theoretical backups for the proposed methodology.

4.1 Monotonicity and global convergence

We ﬁrst present the following proposition concerning the monotonicity of the algorithms. The proof can be found in Appendix A.1. We also provide an alternative proof suggested by a referee, based on the idea of alternating minimization, in Appendix A.1

Proposition 1 (Monotonicity) Let Y (k) and

Z(k) = PΩobs Y (k 1) + ψc(PΩobs X PΩobs Y (k 1))/2

be, respectively, the estimate and the pseudo data matrix in the k-th iteration. If Y (k+1) is the next estimate such that f(Y (k+1)|Z(k+1)) f(Y (k)|Z(k+1)), then g(Y (k+1)) g(Y (k)).

For the general version (Algorithm 1), it is obvious that the condition f(Y (k+1)|Z(k+1)) f(Y (k)|Z(k+1)) is satisﬁed as the result of the minimization Y old arg min Y f(Y |Z). For the specialized version Robust-Impute (Algorithm 2), this condition is implied by Lemma 2 of Mazumder et al. (2010). Therefore both versions are monotonic. As pointed out by a referee, the proposed algorithms can also be viewed as an instance of the majorization-minimization (MM) algorithm (Lange et al., 2000; Hunter and Lange, 2004). It can be shown that, for (i, j) Ωobs,

ρc(Xij Yij) ρc(Xij Y old ij ) (Yij Y old ij )ψc(Xij Y old ij ) + 2 1

2(Yij Y old ij )2

= Yij Y old ij 1

2ψc(Xij Y old ij ) 2 + constant

= (Yij Zij)2 + constant.

Matrix Completion with Noises and Outliers

Therefore, subject to an additive constant that does not depend on Y , h(Y |Y old) = f(Y |Z) = (1/2) P (i,j) Ωobs(Zij Yij)2 + γ Y is a majorization of the objective function g. With this majorization, Algorithm 1 can be viewed as an MM algorithm. Additionally, one can majorize the unobserved entries by (Yij Zij)2 = (Yij Y old ij )2 0 and, together with the above majorization of the observed entries, Algorithm 2 can also be shown as an MM algorithm. Therefore the monotonicity of the proposed algorithms can also be obtained by the general theory of MM algorithm (e.g., Lange, 2010). Moreover, the explicit connection to the MM algorithm allows possible extensions of the current algorithm to other robust loss functions such as Tukey s biweight loss. However, due to non-diﬀerentiability of the objective function, the typical convergence analysis of MM algorithm (e.g., Lange, 2010, Ch. 15) does not apply to our case. We summarize the global convergence rates of both Algorithm 1 and Algorithm 2 in the following theorem.

Theorem 2 Let Y (k) and Y (0) be, respectively, the estimate in the k-th iteration and the starting point of Algorithm 1 or Algorithm 2 Then for any k 1,

Algorithm 1: g(Y (k)) g(Y ) PΩobs Y (0) PΩobs Y 2 F 2k , Y Y,

Algorithm 2: g(Y (k)) g(Y ) Y (0) Y 2 F 2k , Y Y,

where Y be the set of all global minimizers of g (i.e. Y = arg min Y Rn1 n2 g(Y )).

The global convergence analysis of Algorithm 1 can be carried out similarly as in Beck and Teboulle (2009) for proximal gradient method, despite that Algorithm 1 is not a proximal gradient method. For completeness, we give the proof of Theorem 2 for Algorithm 1 in Appendix A.2. As for Robust-Impute (Algorithm 2), we can rewrite it as an instance of the proximal gradient method applied to g(Y ) = g1(Y1)+g2(Y2), where g1(Y ) = (1/2) P (i,j) Ωobs ρc(Xij Yij) and g2(Y ) = γ Y . In our case, the proximal gradient method with step size L iterates over Y (k+1) = ξL(Y (k)) with

ξL( Y ) = arg min Y

where L is a constant greater than or equal to the Lipstchiz constant of g1. Note that g1 has a Lipschitz contant 1. If we take L = 1, we have the following simpliﬁcation.

F = g2(Y ) + 1

2ψc(PΩobs X PΩobs Y )

= g2(Y ) + 1

Y n PΩ obs Y + PΩobs Z o 2

The minimization of ξ1 is equivalent to Step 2a(iii) of Algorithm 2. Therefore, the proximal gradient method is the same as Robust-Impute. This connection allows us to apply the convergence results of proximal gradient method to Robust-Impute directly. Theorem

Wong and Lee

2 for Algorithm 2 follows from Theorem 3.1 of Beck and Teboulle (2009). Lastly, the Nesterov s method (Nesterov, 2007) can be applied directly to accelerate Algorithm 2. The resulted accelerated version is expected to be faster in terms of convergence. However, the acceleration in the Nesterov s method ruins the computationally beneﬁcial sparse-plus-lowrank structure (Mazumder et al., 2010) in the singular vaue thresholding step (Step 2a(iii)). Hence, for large matrices, the non-accelerated version is still preferred in terms of overall computations. The detailed discussion can be found in Section 5 of Mazumder et al. (2010).

4.2 Stable Recovery

Recall the stable property of Cand es and Plan (2010) implies that, with high probability, the recovered matrix is subject to an error bound proportional to the noise level. This subsection shows that the robust matrix completion deﬁned by (2) is also stable. Although the formulation of (2) has its root from classical robust statistics, it is also related to the more recent principal component pursuit (PCP) proposed by Cand es et al. (2011). PCP assumes that the entries of the observed matrix are noiseless, and that this matrix can be decomposed as the sum of a low rank matrix and a sparse matrix, where the sparse matrix is treated as the gross error. In Cand es et al. (2011) it is shown that using PCP perfect recovery is possible with or without missing entries in the observed matrix. Another notable work by Chandrasekaran et al. (2011) provide completely deterministic conditions for the PCP to succeed under no missing data. See Section 1.5 of Cand es et al. (2011) for a detailed comparison between these two pieces of work. For the case of noisy measurements without missing entries, Zhou et al. (2010) extend PCP to stable PCP (SPCP), which is shown to be stable. However, to the best of our knowledge, there is no existing theoretical results for the case of noisy (and/or outlying) measurements with missing entries. Inspired by She and Owen (2011), we ﬁrst establish an useful link between robust matrix completion (2) and PCP in the following proposition. The proof can be found in Appendix A.3.

Proposition 3 (Equivalence) The minimization (2) is equivalent to

minimize L,S 1 2 PΩobs X PΩobs(L + S) 2 F + γ L + c S 1. (7)

That is, the minimizing Y of (2) and the minimizing L of (7) coincide.

Minimization (7) has a high degree of similarity to both PCP and SPCP. It is equivalent to

minimize L,S L + λ S 1 (8)

subject to PΩobs X PΩobs(L + S) 2 F δ2,

where λ = c/γ and δ > 0 has a one-to-one correspondence to γ. When comparing with PCP, (7) permits the observed matrix to be diﬀerent from the recovered matrix (L + S) to allow for noisy measurements. When comparing with SPCP, (7) permits missing entries, which is necessary for matrix completion problems.

Matrix Completion with Noises and Outliers

Proposition 3 has two immediate implications. First, the proposed Algorithm 1 provides a general methodology to turn a large and well-developed class of matrix completion algorithms into algorithms for solving SPCP with missing entries. Second, many useful results from PCP can be borrowed to study the theoretical properties of robust matrix completion (2). In particular, we show that (2) leads to stable recovery. With Proposition 3, it suﬃces to show that (7) achieves stable recovery of (L0, S 0) from the data PΩobs(X) generated by PΩobs(L0 + S0) obeying PΩobs X PΩobs(L0 + S0) F δ and S 0 = PΩobs S0. Note that L0 = X0. We need some notations to proceed. For simplicity, we assume n = n1 = n2 but our results can be easily extended to rectangular matrices (n1 = n2). The Euclidean inner product Q, R is deﬁned as trace(Q R). Let p0 be the proportion of observed entries. Write Γ Ωobs as the set of locations where the measurements are noisy (but not outliters), and Ω= Ωobs\Γ as the support of S 0 = PΩobs S0; i.e., locations of outliers. Denote their complements as, respectively, Γ and Ω . We deﬁne PΓ, PΩ, PΓ and PΩ similarly to the deﬁnition of PΩobs. Let r be the rank of L0 and UDV be the corresponding singular value decomposition of L0, where U, V Rn r and D Rr r. Similar to Cand es et al. (2011), we consider the linear space of matrices

T := {UQ + RV : Q, R Rn r}.

Write PT and PT as the projection operator to T and T respectively. As in Zhou et al. (2010), we deﬁne a set of notations for any pair of matrices M = (L, S). Here,

let M F := q

L 2 F + S 2 F and M := L + λ S 1. We also deﬁne the projection operators PT PΓ : (L, S) 7 (PT L, PΓ S) and PT PΓ : (L, S) 7 (PT L, PΓS). In our theoretical development, we consider the following special subspaces

Ψ := {(L, S) : L, S Rn n, PΩobs L = PΩobs S, PΩ obs L = PΩ obs S = 0},

Ψ := {(L, S) : L, S Rn n, PΩobs L + PΩobs S = 0}.

And we write the corresponding projection operators as PΨ and PΨ respectively. Let M0 = (L0, S 0). Lastly, for any linear operator A, the operator norm, denoted by A , is sup{ Q F =1} AQ F . In below, we write that an event occurs with high probability if it holds with probability at least 1 O(n 10). To avoid certain pathological cases (see, e.g., Cand es and Recht, 2009), an incoherence condition on U and V is usually assumed. To be speciﬁc, this condition with the parameter µ is:

max i U ei 2 µr

n1 , max i V ei 2 µr

n2 , and UV r µr

where Q is the operator norm or 2-norm of matrix Q (i.e., the largest singular value of Q) and Q = maxi,j |Qi.j|. This condition guarantees that, for small µ, the singular vectors are reasonably spread out.

Theorem 4 (Stable Recovery) Suppose that L0 obeys (9) and Ωobs is uniformly distributed among all sets of cardinality m = p0n2 with p0 > 0 being the proportion of observed entries. Further suppose that each observed entry is grossly corrupted to be an outlier with

Wong and Lee

probability τ independently of the others. Suppose L0 and S0 satisfy r ρrnµ 1(log n) 2

and τ τs with ρr, τs being positive numerical constants. Choose λ = 1/ np0. Then, with high probability (over the choices of Ωand Ωobs), for any X obeying PΩobs X PΩobs(L0 + S0) F δ, the solution (ˆL, ˆS) to (8) satisﬁes

ˆL L0 F 2 + 8 n 1 + r 8

δ and ˆS S 0 F 2 + 8 n 1 + r 8

where S 0 = PΩobs(S0).

The proof of this theorem can be found in Appendix A.4.

5. Empirical Performances

Two sets of numerical experiments and a real data application were conducted to evaluate the practical performances of the proposed methodology. In particular the performance of the proposed procedure Robust-Impute is compared to the performance of Soft-Impute developed by Mazumder et al. (2010). The reasons Soft-Impute is selected for comparison are that it is one of the most popular matrix completion methods due to its simplicity and scalability, and that it is shown by Mazumder et al. (2010) that it generally produces superior results to other common matrix completion methods such as MMMF of Rennie and Srebro (2005), SVT of Cai et al. (2010) and Opt Space of Keshavan et al. (2010b)

5.1 Experiment 1: Gaussian Entries

This experiment covers those settings used in Mazumder et al. (2010, Section 9) and additional settings with diﬀerent proportions of missing entries and outliers. For each simulated data set, the target matrix was generated as X0 = UV , where U and V are random matrices of size 100 r with independent standard normal Gaussian entries. Then each entry of X0 is contaminated by additional independent Gaussian noise with standard deviation σ, which is set to a value such that the signal-to-noise ratio (SNR) is 1. Here SNR is deﬁned as

where Var(X0) is the variance over all the entries of X0 conditional on U and V . Next, for each entry, with probability p yet another independent Gaussian noise with σ/4 is added; these entries are treated as outliers. We call this contaminated version of X0 as X. Lastly, Ωobs is uniformly random over the indices of the matrix with missing proportion as q. In this study, we used two values for r (5, 10), three values for p (0, 0.05, 0.1) and three values for q (0.25, 0.5, 0.75). Thus in total we have 18 simulation settings. For each setting 200 simulated data sets were generated, and both the non-robust method Soft-Impute and the proposed Robust-Impute were applied to recover X0. We also provide two oracle ﬁttings as references. They are produced by applying Soft-Impute to the simulated data set with outlying observed entries removed (i.e., treated as missing entries), and with outlying observed entries replaced by non-outlying contaminated entries (i.e., contaminated by independent Gaussian noise with standard deivation σ) respectively.

Matrix Completion with Noises and Outliers

The ﬁrst oracle ﬁtting is referred to as oracle1 while the second one is called oracle2 in the following. For the two simulation settings with r = 10 and q = 0.5, and one with p = 0 while the other with p = 0.1, Figure 1 summarizes the average number of singular value decompositions (SVDs) used and the average test error. Here test error is deﬁned as

Test error = PΩ obs(X0 ˆX) 2 F PΩ obs X0 2 F ,

where PΓ is the projection operator to the set of locations of the observed noisy entries (but not outliers) Γ, and ˆX is an estimate of X0. From Figure 1 (Top), one can see that the performance of Robust-Impute is slightly inferior to Soft-Impute in the case of no outliers (p = 0), while Robust-Impute gave signiﬁcantly better results when outliers were present (p = 0.1). The inferior performance of Robust-Impute under the absence of outliers is not surprising, as it is widely known in the statistical literature that a small fraction of statistical eﬃciency would be lost when a robust method is applied to a data set without outliers. However, it is also known that the gain could be substantial if outliers did present. As for computational requirements, one can see from Figure 1 (Bottom) that Robust Impute only used slightly more SVDs on average. For ranks greater than 5, the number of SVDs used by Robust-Impute only diﬀers from Soft-Impute on average by less than 1. This suggests that Robust-Impute is slightly more computationally demanding than Soft-Impute. Similar experimental results were obtained for the remaining 16 simulation settings. For brevity, the corresponding results are omitted here but can be found in the supplementary document. From this experiment some empirical conclusions can be drawn. When there is no outlier, Soft-Impute gives slightly better results, while with outliers, results from Robust Impute are substantially better. Since that in practice one often does not know if outliers are present or not, and that Robust-Impute is not much more computationally demanding than Soft-Impute, it seems that Robust-Impute is the choice of method if one wants to be more conservative.

5.2 Experiment 2: Image Inpainting

In this experiment the target matrix is the so-called Lena image that has been used by many authors in the image processing literature. It consists of 256 256 pixels and is shown in Figure 2 (Left). The simulated data sets were generated via contaminating this Lena image by adding Gaussian noises and/or outliers in the following manner. First independent Gaussian noise was added to each pixel, where the standard deviation of the noise was set such that the SNR is 3. Next, 10% of the pixels were selected as outliers, and to them additional independent Gaussian noises with SNR 3/4 were added. In terms of selecting missing pixels, two mechanisms were considered. In the ﬁrst one 40% of the pixels were randomly chosen as missing pixels, while in the second mechanism only 10% were missing but they were clustered together to form patches. Two typical simulated data sets are shown in Figure 2 (Middle). Note that Theorem 4 does not cover the second missing mechanism.

Wong and Lee

G G G G G G G G G G G G G G

0 10 20 30 40 50 rank

G nonrobust

G G G G G G G G G

0 10 20 30 40 50 rank

G nonrobust

G G G G G G G G G

G G G G G G G G G G G G G G G G G G G G G G G G G G G G G 2

0 10 20 30 40 50 rank

number of svd

G nonrobust

G G G G G G G G G G G G G G

G G G G G G G G G G G G G G G G G G G G G G G G 2

0 10 20 30 40 50 rank

number of svd

G nonrobust

Figure 1: Top: The average test errors with their standard error bands (plus or minus one standard error). Bottom: The average number of singular value decompositions used with standard error bands (plus or minus one standard error). Left: results for the simulation setting: r = 10, p = 0 and q = 0.5. Right: results for the simulation setting: r = 10, p = 0.1 and q = 0.5.

Matrix Completion with Noises and Outliers

training error testing error rank 50 75 100 125 50 75 100 125 independent Soft-Impute 0.0499 0.0351 0.0221 0.0113 0.0578 0.0565 0.0581 0.0620 missing Robust-Impute 0.0486 0.0371 0.0282 0.0252 0.0546 0.0540 0.0557 0.0571 clustered Soft-Impute 0.0487 0.0386 0.0296 0.0214 0.0756 0.0751 0.0760 0.0781 missing Robust-Impute 0.0468 0.0390 0.0321 0.0268 0.0716 0.0714 0.0723 0.0742

Table 1: The average training and testing errors for the Lena experiment.

For each missing mechanism, 200 data sets were generated and both Soft-Impute and Robust-Impute were applied to reconstruct Lena.

The average training and testing errors1 of the recovered images of matrix ranks 50, 75, 100 and 125 are reported in Table 1. For both missing mechanisms, Soft-Impute tends to have lower training errors, but larger testing errors when compared to Robust-Impute. In other words, Soft-Impute tends to over-ﬁt the data, and Robust-Impute seems to provide better results. Lastly, for visual evaluation, the recovered image of rank 100 using Robust-Impute is displayed in Figure 1 (Right). From this one can see that the proposed Robust-Impute provided good recoveries under both missing mechanisms.

5.3 Real data application: Landsat Thematic Mapper

In this application the target matrix is an image from a Landsat Thematic Mapper data set publicly available at http://ternauscover.science.uq.edu.au/. This data set contains 149 multiband images of 100 100 pixels, with each image consists of six bands (blue, green and red with three infrared bands). The scene is centered on the Tumbarumba ﬂux tower on the western slopes of the Snowy Mountains in Australia. Due to wild ﬁres or related reasons, some pixels are of value zero which can be treated as missing. Also, due to detector malfunctioning, some isolated pixels have values much higher than the remaining pixels, which can be treated as outliers. We selected an image band with a high missing rate (27.6%) to test our procedure.

To evaluate the recovered matrix, the observed pixels were split into training, validation and testing sets consisting 80%, 10% and 10% of the observed (nonzero) entries respectively. We used the validation set to tune γ. The validation errors are computed in two ways:

mean squared error (MSE) q P (i,j) V(Xij ˆXij)2/|V| and mean absolute deviation (MAD)

median{|Xij ˆXij| : (i, j) V}, where V represents the validation set. Similarly, we compute the testing errors in terms of MSE and MAD. Note that the validation and testing sets may contain outliers and therefore MAD serves as a robust and reliable performance measure. The corresponding results are shown in Table 2. From this table it can be seen that with the presence of outliers, Robust-Imputeprovided better results.

1. The solution path (formed by the pre-speciﬁed set of γ s) may not contain any solution of rank 50, 75, 100 and 125. Thus, the average errors were computed over those ﬁttings that contained the corresponding ﬁtted ranks. At most 2% of these ﬁttings were discarded due to this reason.

Wong and Lee

Figure 2: Left: the Lena image. Middle: degraded Lena images by the independent missing mechanism (Top) and the clustered missing mechanism (Down). Right: corresponding recovered images of rank 100 via Robust-Impute.

tuning by MSE tuning by MAD rank MSE MAD rank MSE MAD Soft-Impute 24 45.20 31.15 21 45.23 31.15 Robust-Impute 24 44.63 29.00 29 44.57 28.76

Table 2: Rank and testing errors of the real data application.

Matrix Completion with Noises and Outliers

6. Concluding remarks

In this paper a classical idea from robust statistics has been brought to the matrix completion problem. The result is a new matrix completion method that can handle noisy and outlying entries. This method uses the Huber function to downweigh the eﬀects of outliers. A new algorithm is developed to solve the corresponding optimization problem. This algorithm is relatively fast, easy to implement and monotonic convergent. It can be paired with any existing (non-robust) matrix completion methods to make such methods robust against outliers. We also developed a specialized version of this algorithm, called Robust-Impute. Its promising empirical performance has been illustrated via numerical experiments. Lastly, we have shown that the proposed method is stable; that is, with high probability, the error of recovered matrix is bounded by a constant proportional to the noise level.

Acknowledgments

The authors are most grateful to the referees and the action editor for their many constructive and useful comments, which led to a much improved version of the paper. The work of Wong was partially supported by the National Science Foundation under Grants DMS-1612985 and DMS-1711952 (subcontract). The work of Lee was partially supported by the National Science Foundation under Grants DMS-1512945 and DMS-1513484.

Appendix A. Technical Details

In this section, we provide technical details of our theoretical results.

A.1 Proofs of Proposition 1

Proof By rewriting

PΩobs Z(k+1) PΩobs Y (k+1) 2 F = PΩobs Z(k+1) PΩobs Y (k) 2 F + PΩobs Y (k) PΩobs Y (k+1) 2 F

2 trace h {PΩobs Z(k+1) PΩobs Y (k)}{PΩobs Y (k) PΩobs Y (k+1)} i ,

and using f(Y (k+1)|Z(k+1)) f(Y (k)|Z(k+1)), we have

1 2 PΩobs Y (k) PΩobs Y (k+1) 2 F + trace h {PΩobs Z(k+1) PΩobs Y (k)}{PΩobs Y (k) PΩobs Y (k+1)} i

+ γ Y (k+1) γ Y (k) .

Thus, by substituting Z(k+1) = PΩobs Y (k) + 1

2ρ c(PΩobs X PΩobs Y (k)),

1 2 PΩobs Y (k) PΩobs Y (k+1) 2 F + 1

2trace h ρ c(PΩobs X PΩobs Y (k)){PΩobs Y (k) PΩobs Y (k+1)} i

+ γ Y (k+1) γ Y (k) . (10)

Wong and Lee

Here we abuse the notation slightly so that ρ c of a matrix simply means the matrix formed by applying ρ c to its entries. Note that for each (i, j) Ωobs, by Taylor s expansion,

ρc(Xij Y (k+1) ij ) = ρ(Xij Y (k) ij ) + (Y (k) ij Y (k) ij )ρ c(Xij Y (k) ij )

+ Z Xij Y (k+1) ij

Xij Y (k) ij (Xij Y (k+1) ij t)ρ c(t)dt,

and the last integral term is less than or equal to (Y (k) ij Y (k+1) ij )2 due to ρ c 2 almost everywhere. Thus, X

(i,j) Ωobs ρc(Xij Y (k+1) ij ) X

(i,j) Ωobs ρc(Xij Y (k) ij )

+ trace h ρ c(PΩobs X PΩobs Y (k)){PΩobs Y (k) PΩobs Y (k+1)} i

+ PΩobs Y (k) PΩobs Y (k+1) 2 F .

Now, plugging it into (10), we have g(Y (k+1)) g(Y (k)).

Proof (Alternative proof of Proposition 1) Similar to the proof of Proposition 3 in Section A.3, one can show that

g(Y ) = min S 1 2 PΩobs X PΩobs Y PΩobs S 2 F + γ Y + c S 1, (11)

where the minimizer is S(Y ) = (1/2)ψc(PΩobs X PΩobs Y ). Now, one can show that

Z(k+1) = PΩobs Y (k) + (1/2)ψc(PΩobs X PΩobs Y (k))

= PΩobs X + (1/2)ψc(PΩobs Y (k) PΩobs X)

= PΩobs X (1/2)ψc(PΩobs X PΩobs Y (k))

= PΩobs X S(Y (k)).

Now, due to (11),

g(Y (k+1)) = min S 1 2 PΩobs X PΩobs Y (k+1) PΩobs S 2 F + γ Y (k+1) + c S 1

2 PΩobs X PΩobs Y (k+1) PΩobs S(Y (k)) 2 F + γ Y (k+1) + c S(Y (k)) 1

= f(Y (k+1)|X S(Y (k))) + c S(Y (k)) 1

= f(Y (k+1)|PΩobs X S(Y (k))) + c S(Y (k)) 1

= f(Y (k+1)|Z(k+1)) + c S(Y (k)) 1

f(Y (k)|Z(k+1)) + c S(Y (k)) 1

2 PΩobs X PΩobs Y (k) PΩobs S(Y (k)) 2 F + γ Y (k) + c S(Y (k)) 1

= g(Y (k)).

Matrix Completion with Noises and Outliers

A.2 Proof of Theorem 2 for Algorithm 1

Proof This proof closely follows the proofs of Lemma 2.3 and Theorem 3.1 in Beck and Teboulle (2009) by modifying their approximation model to

ζ(Y, Y ) = g1( Y ) + Y Y , g1( Y ) + 1

2 PΩobs Y PΩobs Y 2 F + g2(Y )

= g1( Y ) 1

2 Y Y , ψc(PΩobs X PΩobs Y ) + 1

2 PΩobs Y PΩobs Y 2 F + g2(Y )

= g1( Y ) 1

2 PΩobs Y PΩobs Y , ψc(PΩobs X PΩobs Y ) + 1

2 PΩobs Y PΩobs Y 2 F

where X, Y = P i,j Xij Yij. Note that arg min Y ζ(Y, Y ) is the same as arg min Y f(Y |Z), where Z = PΩobs Y + (1/2)ψc(PΩobs X PΩobs Y ), in Steps 2(a)-(c) of Algorithm 1. Let Π( Y ) = arg min Y ζ(Y, Y ). Therefore Y (k+1) = Π(Y (k)). Moreover,

g1(Y ) g1( Y ) + Y Y , g1( Y ) + 1

2 PΩobs Y PΩobs Y 2 F ,

for any Y and Y . Therefore, g(Π( Y )) ζ(Π( Y ), Y ) for any Y Rn1 n2. To proceed, we need a modiﬁed version of Lemma 2.3 in Beck and Teboulle (2009).

Lemma 5 For any Y , Y Rn1 n2,

g(Y ) g(Π( Y )) 1

2 PΩobsΠ( Y ) PΩobs Y 2 F + PΩobs Y PΩobs Y, PΩobsΠ( Y ) PΩobs Y .

This lemma is proved as follows. Since Π( Y ) is the minimizer of the convex function ζ( , Y ), there exists a b( Y ) g2(Π( Y )), the subdiﬀerential of g2 at Π( Y ), such that g1( Y ) + PΩobsΠ( Y ) PΩobs Y + b( Y ) = 0. By the convexity of g1 and g2,

g1(Y ) g1( Y ) 1

2 Y Y , ψc(PΩobs X PΩobs Y )

g2(Y ) g2(Π( Y )) Y Π( Y ), b( Y ) .

g(Y ) g1( Y ) 1

2 Y Y , ψc(PΩobs X PΩobs Y ) + g2(Π( Y )) Y Π( Y ), b( Y ) . (12)

Since g(Π( Y )) ζ(Π( Y ), Y ), we have g(Y ) g(Π( Y )) g(Y ) ζ(Π( Y ), Y ). Plugging in (12), the deﬁnition of ζ and the condition for b, the conclusion of the lemma follows. Using Lemma 5 with Y = Y and Y = Y (k), we have

2{g(Y ) g(Y (k))} PΩobs Y PΩobs Y (k+1) 2 F PΩobs Y PΩobs Y (k) 2 F .

Summing it over k = 0, . . . , m 1,

k=0 g(Y (k))

PΩobs Y PΩobs Y (m) 2 F PΩobs Y PΩobs Y (0) 2 F . (13)

Wong and Lee

Applying Lemma 5 with Y = Y = Y (k),

2 n g(Y (k)) g(Y (k+1)) o PΩobs Y (k+1) PΩobs Y (k) 2 F .

Multiplying it by k and summing over k = 0, . . . , m 1,

mg(Y (m)) +

k=0 g(Y (k+1))

k=0 k PΩobs Y (k+1) PΩobs Y (k) 2 F . (14)

Adding (13) and (14),

2 n g(Y ) g(Y (m)) o PΩobs Y PΩobs Y (m) 2 F PΩobs Y PΩobs Y (0) 2 F

k=0 k PΩobs Y (k+1) PΩobs Y (k) 2 F .

g(Y (m)) g(Y ) PΩobs Y PΩobs Y (0) 2 F 2m .

A.3 Proof of Proposition 3

Proof Since both (2) and (7) are convex, we only need to consider the sub-gradients. The sub-gradient conditions for minimizier of (2) are given as follows:

2ρ c(PΩobs X PΩobs Y ) + γ Y , (15)

where represents the set of subgradients of the nuclear norm. The sub-gradient conditions for minimizier of (7) are given as follows:

0 PΩobs(X L S) + γ L (16)

0 PΩobs(X L S) + c S 1, (17)

where 1 represents the set of subgradients of 1. Here (17) implies, for (i, j) Ωobs,

Xij Lij c, Xij Lij > c 0, |Xij Lij| c Xij Lij + c, Xij Lij < c

and Sij = 0 for (i, j) Ω obs. Note, for (i, j) Ωobs, Xij Lij Sij = ρ c(Xij Lij)/2. Plugging it into (16), we have (15) and thus this proves the proposition.

Matrix Completion with Noises and Outliers

A.4 Proof of Theorem 4

To prove Theorem 4, we ﬁrst show three lemmas and one proposition.

Lemma 6 (Modiﬁed Lemma A.2 in Cand es et al. 2011) Assume that for any matrix Q, PT PΓ Q F n PT PΓ Q F . Suppose there is a pair (W, F) obeying

PT W = 0, W < 1/2, PΓ F = 0, F < 1/2, UV + W + PT D = λ(sgn(S 0) + F) with PT D F n 2.

Then for any perturbation H = (HL, HS) satisfying PΩobs HL + PΩobs HS = 0,

M0 H M0 + 1

The proof of this lemma can be found in Cand es et al. (2011). To procced, we write M 2 F,λ = L 2 F + λ2 S 2 F for any pair of matrices M = (L, S).

Lemma 7 Let M = (ML, MS) be any pair of matrices. Suppose PT PΩ 2 p0/8 and PΩobs PT ML 2 F p0 PT ML 2 F /2. Then

PΨ(PT PΩ)M 2 F,λ (1 + λ2)p0

16 (PT PΩ)M 2 F .

Proof (Proof of Lemma 7) Note that for any M = (M L, M S),

PΨM = PΩobs(M L + M S) 2 , PΩobs(M L + M S) 2

PΨ(PT PΩ)M 2 F,λ = 1 + λ2

4 PΩobs(PT ML + PΩMS) 2 F

4 PΩobs PT ML 2 F + PΩMS 2 F + 2 PΩobs PT ML, PΩMS ,

where the last equality is due to Ω Ωobs. By PT PΩ 2 p0/8,

PΩobs PT ML, PΩMS = PT ML, PΩMS

= PT ML, (PT PΩ)PΩMS

PT PΩ PT ML F PΩMS F

2 PT ML F PΩMS F .

Combining with PΩobs PT ML 2 F p0 PT ML 2 F /2, we have

PΨ(PT PΩ)M 2 F,λ 1 + λ2

2 PT ML 2 F + PΩMS 2 F rp0

2 PT ML F PΩMS F

Wong and Lee

As 2(x2 + y2 xy) x2 + y2 for x, y 0,

PΨ(PT PΩ)M 2 F,λ 1 + λ2

2 PT ML 2 F + PΩMS 2 F (1 + λ2)p0

16 (PT PΩ)M 2 F .

Lemma 8 Let M = (ML, MS) be any pair of matrices. Then PΨM 2 F,λ M 2 F,λ/2.

Proof (Proof of Lemma 8) Write MΨ = (MΨ L , MΨ S ) = PΨM. Since MΨ L 2 F = MΨ S 2 F ,

PΨM 2 F,λ = MΨ L 2 F + λ2 MΨ S 2 F

2( MΨ L 2 F + MΨ S 2 F ) + λ2

2 ( MΨ L 2 F + MΨ S 2 F )

2 MΨ 2 F + λ2

2 M 2 F + λ2

2 M 2 F = 1

Proposition 9 Assume that for any matrix Q,

PT PΓ Q F n PT PΓ Q F and PΩobs PT Q F p0 PT Q F /2.

Further suppose 4/n < λ 1, n 3, p0 > 0, PT PΩ 2 p0/8 and that there exists a pair (W, F) obeying (19). Then the solution ˆ M = (ˆL, ˆS) to (7) satisﬁes

ˆ M M0 F,λ p

1 + λ2 + 4 1 + r 8

( n + nλ p0) δ.

where M0 = (L0, S 0) such that PΩobs X PΩobs(L0+S0)) 2 F δ and S 0 = PΩobs S0. Further, if λ = 1/ np0 (which implies 1/n < p0 < n/16), we obtain

ˆL L F 2 + 8 n 1 + r 8

δ and ˆS S 0 F 2 + 8 n 1 + r 8

Proof (Proof of Proposition 9) Write ˆ M = M0 + H, where H = (HL, HS), and HΨ = (HΨ L , HΨ S ) = PΨH and HΨ = (HΨ L , HΨ S ) = PΨ H. We want to bound

H F,λ = HΨ + HΨ F,λ

HΨ F,λ + HΨ F,λ

HΨ F,λ + (PT PΓ)HΨ F,λ + (PT PΓ )HΨ F,λ. (20)

Matrix Completion with Noises and Outliers

We start with the ﬁrst term of (20). Since HΨ L = HΨ S = (1/2)PΩobs(HL + HS),

2 PΩobs(HL + HS) F

2 PΩobs(ˆL + ˆS L0 S 0) F

PΩobs(ˆL + ˆS X) F + PΩobs(L0 + S 0 X) F

where the last inequality is due to the fact that both M0 and ˆ M are feasible. Then we focus on the second term of (20). First, we have

M0 ˆ M = M0 + H M0 + HΨ HΨ .

By Lemma 6,

M0 + HΨ M0 + a(n) PT HΨ L + b(n, λ) PΓHΨ L 1,

where a(n) = 1

n and b(n, λ) = λ

Now, combining the above inequalities,

HΨ a(n) PT HΨ L + b(n, λ) PΓHΨ L 1. (21)

By the assumption that λ > 4/n and n 3,

n > 0 and b(n, λ) = λ

Therefore (21) implies HΨ a(n) PT HΨ L and HΨ b(n, λ) PΓHΨ L 1. Now, we are ready to establish a bound for the second term of (20).

(PT PΓ)HΨ F,λ PT HΨ L F + λ PΓHΨ S F

PT HΨ L + λ PΓHΨ S 1

1 a(n) + λ b(n, λ)

4( HΨ L + λ HΨ S 1).

As for the third term of (20), we apply Lemma 7 and the bound of the second term in (20). As PΨHΨ = 0, PΨ(PT PΓ )HΨ + PΨ(PT PΓ)HΨ = 0. Therefore, due to Lemma 8,

PΨ(PT PΓ )HΨ F,λ = PΨ(PT PΓ)HΨ F,λ 1

2 (PT PΓ)HΨ F,λ.

Wong and Lee

As PΩ obs HS does not aﬀect the feasibility of M +H and H is chosen such that M +H is

minimized, thus PΩ obs HΨ S = PΩ obs HS = 0 which implies (PT PΓ )HΨ = (PT PΩ)HΨ . Thus, by Lemma 7,

(PT PΓ )HΨ F

8 (1 + λ2)p0 (PT PΓ)HΨ F,λ r 8

p0 (PT PΓ)HΨ F,λ.

And (PT PΓ )HΨ F,λ (PT PΓ )HΨ F as λ 1. Collecting all the above bounds for the three terms, we derive the bound for H F,λ:

1 + λ2 + 4 1 + r 8

( HΨ L + λ HΨ S 1).

Finally, HΨ L n HΨ L F , HΨ S 1 = p

p0n2 HΨ S F (since HΨ S is supported on Ωobs) and HΨ L F = HΨ S = PΩobs(HL + HS) F /2 δ. Therefore,

1 + λ2 + 4 1 + r 8

( n + nλ p0) .

Assume that λ = 1/ np0. First we note that, due to λ > 4/n, this condition imposes a reasonable coverage of p0: 1/n < p0 < n/16. Now we focus on simplifying the bound for H F,λ.

1 + λ2 + 4 1 + r 8

( n + nλ p0) 2 + 8 n 1 + r 8

This implies

HL F 2 + 8 n 1 + r 8

δ and HS F 2 + 8 n 1 + r 8

To prove Theorem 4, we establish one additional lemma.

Lemma 10 Suppose PT p 1 0 PT PΩobs PT 1/2. Then for any matrix Q,

PΩobs PT Q 2 F p0

2 PT Q 2 F .

Proof (Proof of Lemma 10) By the assumptions, for any matrix Q,

PΩobs PT Q 2 F = PΩobs PT Q, PΩobs PT Q

= PT Q, PT PΩobs PT Q

= p0 PT Q, p 1 0 PT PΩobs PT Q

= p0 PT Q 2 F + PT Q, (p 1 0 PT PΩobs PT PT )Q

2 PT Q 2 F .

Matrix Completion with Noises and Outliers

Proof (Proof of Theorem 4) Recall that we write that an event occurs with high probability if it holds with probability at least 1 O(n 10). Due to the asymptotic nature of Theorem 4, we only require the conditions of Proposition 9 to hold asymptotically with large probability. By Lemma A.3 of Cand es et al. (2011), PT PΓ Q F n PT PΓ Q F for all Q, with high probability. By Lemma 10 and Theorem 2.6 of Cand es et al. (2011) (see also Cand es and Recht, 2009, Theorem 4.1), PΩobs PT Q 2 F p0

2 PT Q 2 F for all Q, with high probability. Further, by Cand es and Recht (2009), PT PΩ 2 p0/8 occurs with high probability. Cand es et al. (2011, pp. 33-35) show that there exist dual certiﬁcates (W, F) obeying (19) with high probability. For suﬃciently large n, the conditions of λ and p0 in Proposition 9 are fulﬁlled. Therefore, Theorem 4 follows from Proposition 9.

Amir Beck and Marc Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences, 2(1):183 202, 2009.

James Bennett and Stan Lanning. The netﬂix prize. In Proceedings of KDD cup and workshop, volume 2007, page 35, 2007.

Jian-Feng Cai, Emmanuel J Cand es, and Zuowei Shen. A singular value thresholding algorithm for matrix completion. SIAM Journal on Optimization, 20(4):1956 1982, 2010.

Emmanuel J Cand es and Yaniv Plan. Matrix completion with noise. Proceedings of the IEEE, 98(6):925 936, 2010.

Emmanuel J Cand es and Benjamin Recht. Exact matrix completion via convex optimization. Foundations of Computational mathematics, 9(6):717 772, 2009.

Emmanuel J Cand es, Xiaodong Li, Yi Ma, and John Wright. Robust principal component analysis? Journal of the ACM (JACM), 58(3):Article 11, 2011.

Venkat Chandrasekaran, Sujay Sanghavi, Pablo A Parrilo, and Alan S Willsky. Ranksparsity incoherence for matrix decomposition. SIAM Journal on Optimization, 21(2): 572 596, 2011.

Yudong Chen, Huan Xu, Constantine Caramanis, and Sujay Sanghavi. Robust matrix completion and corrupted columns. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), pages 873 880, 2011.

David Gross. Recovering low-rank matrices from few coeﬃcients in any basis. IEEE Transactions on Information Theory, 57(3):1548 1566, 2011.

T. Hastie, R. Mazumder, J. Lee, and R Zadeh. Matrix completion and low-rank SVD via fast alternating least squares. Unpublished manuscript, 2014.

Wong and Lee

Peter J Huber and Elvezio M Ronchetti. Robust Statistics, volume 693. John Wiley & Sons, New Jersey, second edition, 2011.

David R Hunter and Kenneth Lange. A tutorial on mm algorithms. The American Statistician, 58(1):30 37, 2004.

Juha Karhunen. Robust pca methods for complete and missing data. Neural Network World, 21(5):357, 2011.

Raghunandan H Keshavan, Andrea Montanari, and Sewoong Oh. Matrix completion from noisy entries. Journal of Machine Learning Research, 11(1):2057 2078, 2010a.

Raghunandan H Keshavan, Andrea Montanari, and Sewoong Oh. Matrix completion from a few entries. IEEE Transactions on Information Theory, 56(6):2980 2998, 2010b.

Vladimir Koltchinskii, Karim Lounici, Alexandre B Tsybakov, et al. Nuclear-norm penalization and optimal rates for noisy low-rank matrix completion. The Annals of Statistics, 39(5):2302 2329, 2011.

Kenneth Lange. Numerical Analysis for Statisticians. Springer, New York, 2010.

Kenneth Lange, David R Hunter, and Ilsoon Yang. Optimization transfer using surrogate objective functions. Journal of Computational and Graphical Statistics, 9(1):1 20, 2000.

Jaakko Luttinen, Alexander Ilin, and Juha Karhunen. Bayesian robust pca of incomplete data. Neural processing letters, 36(2):189 202, 2012.

Shiqian Ma, Donald Goldfarb, and Lifeng Chen. Fixed point and bregman iterative methods for matrix rank minimization. Mathematical Programming, 128(1-2):321 353, 2011.

Goran Marjanovic and Victor Solo. On lq optimization and matrix completion. IEEE Transactions on Signal Processing, 60(11):5714 5724, 2012.

Rahul Mazumder, Trevor Hastie, and Robert Tibshirani. Spectral regularization algorithms for learning large incomplete matrices. The Journal of Machine Learning Research, 11: 2287 2322, 2010.

Andrea Montanari and Sewoong Oh. On positioning via distributed matrix completion. In Sensor Array and Multichannel Signal Processing Workshop (SAM), 2010 IEEE, pages 197 200, 2010.

Yurii Nesterov. Gradient methods for minimizing composite objective function. Technical report, CORE, 2007.

Hee-Seok Oh, Douglas W. Nychka, and Thomas C. M. Lee. The role of pseudo data for robust smoothing with application to wavelet regression. Biometrika, 94(4):893 904, 2007.

Benjamin Recht. A simpler approach to matrix completion. The Journal of Machine Learning Research, 12:3413 3430, 2011.

Matrix Completion with Noises and Outliers

Jasson DM Rennie and Nathan Srebro. Fast maximum margin matrix factorization for collaborative prediction. In Proceedings of the 22nd international conference on Machine learning, pages 713 719, 2005.

Yiyuan She and Art B Owen. Outlier detection using nonconvex penalized regression. Journal of the American Statistical Association, 106(494):626 639, 2011.

Nathan Srebro and Tommi Jaakkola. Weighted low-rank approximations. In ICML, volume 3, pages 720 727, 2003.

Kilian Q Weinberger and Lawrence K Saul. Unsupervised learning of image manifolds by semideﬁnite programming. International Journal of Computer Vision, 70(1):77 90, 2006.

Zihan Zhou, Xiaodong Li, John Wright, Emmanuel Candes, and Yi Ma. Stable principal component pursuit. In 2010 IEEE International Symposium on Information Theory Proceedings (ISIT), pages 1518 1522, 2010.