# robust_pca_by_manifold_optimization__04ed16da.pdf

Journal of Machine Learning Research 19 (2018) 1-39 Submitted 8/17; Revised 10/18; Published 11/18

Robust PCA by Manifold Optimization

Teng Zhang teng.zhang@ucf.edu Department of Mathematics University of Central Florida 4000 Central Florida Blvd Orlando, FL 32816, USA

Yi Yang yi.yang6@mcgill.ca Department of Mathematics and Statistics Mc Gill University 805 Sherbrooke Street West Montreal, QC H3A0B9, Canada

Editor: Michael Mahoney

Robust PCA is a widely used statistical procedure to recover an underlying low-rank matrix with grossly corrupted observations. This work considers the problem of robust PCA as a nonconvex optimization problem on the manifold of low-rank matrices and proposes two algorithms based on manifold optimization. It is shown that, with a properly designed initialization, the proposed algorithms are guaranteed to converge to the underlying lowrank matrix linearly. Compared with a previous work based on the factorization of low-rank matrices Yi et al. (2016), the proposed algorithms reduce the dependence on the condition number of the underlying low-rank matrix theoretically. Simulations and real data examples conﬁrm the competitive performance of our method.

Keywords: principal component analysis, low-rank modeling, manifold of low-rank matrices.

1. Introduction

In many problems, the underlying data matrix is assumed to be approximately low-rank. Examples include problems in computer vision Epstein et al. (1995); Ho et al. (2003), machine learning Deerwester et al. (1990), and bioinformatics Price et al. (2006). For such problems, principal component analysis (PCA) is a standard statistical procedure to recover the underlying low-rank matrix. However, PCA is highly sensitive to outliers in the data, and robust PCA Cand es et al. (2011); Chandrasekaran et al. (2011); Clarkson and Woodruﬀ(2013); Frieze et al. (2004); Bhojanapalli et al. (2015); Yi et al. (2016); Chen and Wainwright (2015); Gu et al. (2016); Cherapanamjeri et al. (2016); Netrapalli et al. (2014) is hence proposed as a modiﬁcation to handle grossly corrupted observations. Mathematically, the robust PCA problem is formulated as follows: given a data matrix Y Rn1 n2 that can be written as the sum of a low-rank matrix L (signal) and a sparse matrix S (corruption) with only a few nonzero entries, can we recover both components accurately? Robust PCA has been shown to have applications in many real-life applications

c 2018 Teng Zhang and Yi Yang.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v19/17-473.html.

Zhang and Yang

including background detection Li et al. (2004), face recognition Basri and Jacobs (2003), ranking, and collaborative ﬁltering Cand es et al. (2011). Since the set of all low-rank matrices is nonconvex, it is generally diﬃcult to obtain an algorithm with theoretical guarantee since there is no tractable optimization algorithm for the nonconvex problem. Here we review a few carefully designed algorithms such that the theoretical guarantee on the recovery of underlying low-rank matrix exists. The works Cand es et al. (2011); Chandrasekaran et al. (2011) consider the convex relaxation of the original problem instead:

min L,S L + S 1, s.t. Y = L + S, (1)

where L represents the nuclear norm (i.e., Schatten 1-norm) of L, deﬁned by the sum of its singular values and S 1 represents the sum of the absolute values of all entries of S. Since this problem is convex, the solution to (1) can be solved in polynomial time. In addition, it is shown that the solution recovers the correct low-rank matrix when S has at most γ = O(1/µ2r) fraction of corrupted non-zero entries, where r is the rank of L

and µ is the incoherence level of L Hsu et al. (2011). If the sparsity of S is assumed to be random, then Cand es et al. (2011) shows that the algorithm succeeds with high probability, even when the percentage of corruption can be in the order of O(1) while the rank r = O(min(n1, n2)/µ log2 max(n1, n2)), where µ is a coherence parameter of the lowrank matrix L (this work deﬁnes µ slightly diﬀerently compared to Cand es et al. (2011) and (16) in this work, but the value is comparable). However, the aforementioned algorithms based on convex relaxation have a computational complexity of O(n1n2 min(n1, n2)) per iteration, which could be prohibitive when n1 and n2 are very large. Alternatively, some faster algorithms are proposed based on nonconvex optimization. In particular, the work by Kyrillidis and Cevher (2012) proposes a method based on the projected gradient method. However, it assumes that the sparsity pattern of S is random, and the algorithm still has the same computational complexity as the convex methods. Netrapalli et al. (2014) proposes a method based on the alternating projecting, which allows γ 1 µ2r, with a computational complexity of O(r2n1n2) per iteration. Chen and Wainwright (2015) assumes that L is positive semideﬁnite and applies the gradient descent method on the Cholesky decomposition factor of L , but the positive semideﬁnite assumption is not satisﬁed in many applications. Gu et al. (2016) factorizes L into the product of two matrices and performs alternating minimization over both matrices. It shows that the algorithm allows γ = O(1/µ2/3r2/3 min(n1, n2)) and has the complexity of O(r2n1n2) per iteration. Yi et al. (2016) applies a similar factorization and applies an alternating gradient descent algorithm with a complexity of O(rn1n2) per iteration and allows γ = O(1/κ2µr3/2), where κ is the condition number of the underlying low-rank matrix. There is another line of works that further reduces the complexity of the algorithm by subsampling the entries of the observation matrix Y, including Mackey et al. (2011); Li and Haupt (2015); Rahmani and Atia (2017); Cherapanamjeri et al. (2016) and (Yi et al., 2016, Algorithm 2), which will also be discussed in this paper as the partially observed case. The common idea shared by Gu et al. (2016) and Yi et al. (2016) is as follows. Since any low-rank matrix L Rn1 n2 with rank r can be written as the product of two low-rank

Robust PCA by Manifold Optimization

matrices by L = UVT with U Rn1 r and V Rn2 r, we can optimize the pair (U, V) instead of L, and a smaller computational cost is expected since (U, V) has (n1 + n2)r parameters, which is smaller than n1n2, the number of parameters in L. In fact, such a re-parametrization technique has a long history Ruhe (1974), and has been popularized by Burer and Monteiro Burer and Monteiro (2003, 2005) for solving semi-deﬁnite programs (SDPs). The same idea has been used in other low-rank matrix estimation problems such as dictionary learning Sun et al. (2017), phase synchronization Boumal (2016), community detection Bandeira et al. (2016), matrix completion Jain et al. (2013), recovering matrix from linear measurements Tu et al. (2016), and even general problems Chen and Wainwright (2015); Wang et al. (2017); Park et al. (2016); Wang et al. (2017); Park et al. (2017). In addition, the property of associated stochastic gradient descent algorithm is studied in De Sa et al. (2015).

The main contribution of this work is a novel robust PCA algorithm based on the gradient descent algorithm on the manifold of low-rank matrices, with a theoretical guarantee on the exact recovery of the underlying low-rank matrix. Compared with Yi et al. (2016), the proposed algorithm utilizes the tool of manifold optimization, which leads to a simpler and more naturally structured algorithm with a stronger theoretical guarantee. In particular, with a proper initialization, our method can still succeed with γ = O(1/κµr3/2), which means that it can tolerate more corruption than Yi et al. (2016) by a factor of κ. In addition, the theoretical convergence rate is also faster than Yi et al. (2016) by a factor of κ. Simulations also veriﬁed the advantage of the proposed algorithm over Yi et al. (2016). We remark that while manifold optimization has been applied to robust PCA in Cambier and Absil (2016), our work studies a diﬀerent algorithm and gives theoretical guarantees. Considering the popularity of the methods based on the factorization of low-rank matrices, it is expected that manifold optimization could be applied to other low-rank matrix estimation problems. In addition, we implement our method in an eﬃcient and user-friendly R package morpca, which is available at https://github.com/emeryyi/morpca.

The paper is organized as follows. We ﬁrst present the algorithm in Section 2, and explain how the proposed algorithms are derived in Section 3. Their theoretical properties are studied and compared with previous algorithms in Section 4. In Section 5, simulations and real data analysis on the Shoppingmall dataset show that the proposed algorithms are competitive in many scenarios and have superior performances to the algorithm based on matrix factorization. A discussion about the proposed algorithms is then presented in Section 6, followed by the proofs of the results in Appendix.

2. Algorithm

In this work, we consider the robust PCA problem in two settings: fully observed setting and partially observed setting. The problem under the fully observed setting can be formulated as follows: given Y = L +S , where L is a low-rank matrix and S is a sparse matrix, then can we recover L from Y? To recover L , we solve the following optimization problem:

b L = arg min rank(L)=r f(L), where f(L) = 1

2 F(L Y) 2 F , (2)

Zhang and Yang

where F : Rn1 n2 Rn1 n2 is a hard thresholding procedure deﬁned in (3):

( 0, if |Aij| > |Ai, |[γ] and |Aij| > |A ,j|[γ]

Aij, otherwise. (3)

Here Ai, represents the i-th row of the matrix A, and A ,j represents the j-th column of A. |Ai, |[γ] and |A ,j|[γ] represent the (1 γ)-th percentile of the absolute values of the entries of Ai, and A ,j for γ [0, 1). In other words, what are removed are the entries that are simultaneously among the largest γ-fraction in the corresponding row and column of A in terms of the absolute values. The threshold γ is set by users. If some entries of Ai, or A ,j have the entries with identical absolute values, the ties can be broken down arbitrarily. The motivation is that, if S is sparse in the sense that the percentage of nonzero entries in each row and each column is smaller than γ, then F(L Y) = F( S ) is zero by deﬁnition thus f(L ) is zero. Since f is nonnegative, L is the solution to (2). To solve (2), we propose Algorithm 1 based on manifold optimization, with its derivation deferred to Section 3.3.1.

Algorithm 1 Gradient descent on the manifold under the fully observed setting.

Input: Observation Y Rn1 n2; Rank r; Thresholding value γ; Step size η. Initialization: Set k = 0; Initialize L(0) using the rank-r approximation to F(Y). Loop: Iterate Steps 1 4 until convergence: 1: Let L(k) = U(k)Σ(k)V(k) T . 2: Let D(k) = F(L(k) Y). 3(a): (Option 1) Let Ω(k) = U(k)U(k) T D(k) + D(k)V(k)V(k) T U(k)U(k) T D(k)V(k)V(k) T , and let U(k+1) Rn1 r, Σ(k+1) Rr r, and V(k+1) Rn2 r be matrices consist of the top r left singular vectors/singular values/right singular vectors of L(k) ηΩ(k). 3(b): (Option 2) Let Q1, R1 be the QR decomposition of (L(k) ηD(k))T U(k) and Q2, R2 be the QR decomposition of (L(k) ηD(k))V(k). Then U(k+1) = Q2, V(k+1) = Q1 and Σ(k+1) = R2[U(k) T (L(k) ηD(k))V(k)] 1RT 1 . 4: k := k + 1. Output: Estimation of the low-rank matrix L , given by limk L(k).

Under the partially observed setting, in addition to gross corruption S , the observed matrix Y has a large number of missing values, i.e., many entries of Y are not observed. We denote the set of all observed entries by Φ = {(i, j)|Yij is observed}, and deﬁne F : Rn1 n2 Rn1 n2

( 0, if |Aij| > |Ai, |[γ,Φ] and |Aij| > |A ,j|[γ,Φ]

Aij, otherwise. (4)

Here |Ai, |[γ,Φ] and |A ,j|[γ,Φ] represent the (1 γ)-th percentile of the absolute values of the observed entries of Ai, and A ,j of the matrix A respectively. As a generalization of Algorithm 1, we propose to solve

arg min rank(L)=r f(L), f(L) = 1

Fij(L Y)2, (5)

Robust PCA by Manifold Optimization

Algorithm 2 Gradient descent on the manifold under the partially observed setting.

Input: Observation Y Rn1 n2; Set of all observed entries by Φ; Rank r; Thresholding value γ; Step size η. Initialization: Set k = 0; Initialize L(0) using the rank-r approximation to F(Y). Loop: Iterate Steps 1 4 until convergence: 1: Let L(k) be a sparse matrix with support Φ, with nonzero entries given by the corresponding entries of U(k)Σ(k)V(k) T . 2: Let D(k) = F(L(k) Y). 3(a): (Option 1) Let Ω(k) = U(k)U(k) T D(k) + D(k)V(k)V(k) T U(k)U(k) T D(k)V(k)V(k) T , and let U(k+1) Rn1 r, Σ(k+1) Rr r, and V(k+1) Rn2 r be matrices consists of the top r left singular vectors/singular values/right singular vectors of L(k) ηΩ(k). 3(b): (Option 2) Let Q1, R1 be the QR decomposition of (L(k) ηD(k))T U(k) and Q2, R2 be the QR decomposition of (L(k) ηD(k))V(k). Then U(k+1) = Q2, V(k+1) = Q1 and Σ(k+1) = R2[U(k) T (L(k) ηD(k))V(k)] 1RT 1 . 4: k := k + 1. Output: Estimation of the low-rank matrix L , given by limk L(k).

which is similar to (2) but only the observed entries are considered. The implementation is presented in Algorithm 2 and its derivation is deferred to Section 3.3.2. For Algorithm 1, its memory usage is O(n1n2) due to the storage of Y. For Algorithm 2, storing Y and L(k) requires O(|Φ|) and storing U(k) and V(k) requires O(r(n1+n2)). Adding them together, the memory usage is O(|Φ| + r(n1 + n2)). For both Algorithm 1 and Algorithm 2 with Option 1, the singular value decomposition is the most computationally intensive step and as a result, the complexity per iteration is O(rn1n2). For Algorithm 1 and Algorithm 2 with Option 2, their computational complexities per iteration are in the order of O(rn1n2) and O(r2(n1 + n2) + r|Φ|) respectively.

3. Derivation of the Proposed Algorithms

This section gives the derivations of Algorithms 1 and 2. Since they are derived from manifold optimization, we ﬁrst give a review of manifold optimization in Section 3.1 and the geometry of the manifold of low-rank matrices in Section 3.2.

3.1. Manifold optimization

The purpose of this section is to review the framework of the gradient descent method on manifolds. It summarizes mostly the framework used in Vandereycken (2013); Shalit et al. (2012); Absil et al. (2009), and we refer readers to these work for more details. Given a smooth manifold M Rn and a diﬀerentiable function f : M R, the procedure of the gradient descent algorithm for solving minx M f(x) is as follows:

Step 1. Consider f(x) as a diﬀerentiable function from Rn to R and calculate the Euclidean gradient f(x).

Zhang and Yang

Figure 1: The visualization of gradient descent algorithms on the manifold M. The black solid line is the Euclidean gradient. The blue solid line is the projection of the Euclidean gradient to the tangent space. The red solid line represents the orthographic retraction, while the red dashed line represents the projective retraction.

Step 2. Calculate its Riemannian gradient, which is the direction of steepest ascent of f(x) among all directions in the tangent space Tx M. This direction is given by PTx M f(x), where PTx M is the projection operator to the tangent space Tx M.

Step 3. Deﬁne a retraction Rx that maps the tangent space back to the manifold, i.e. Rx : Tx M M, where Rx needs to satisfy the conditions in (Vandereycken, 2013, Deﬁnition 2.2). In particular, Rx(0) = x, Rx(y) = x + y + O( y 2) as y 0, and Rx needs to be smooth. Then the update of the gradient descent algorithm x+ is deﬁned by

x+ = Rx( ηPTx M f(x)), (6)

where η is the step size.

We remark that in diﬀerential geometry, the standard retraction is the exponential map from the tangent space to the manifold. However, in this work (as well as many works on manifold optimization) it is used to represent a generic mapping from the tangent plane to the manifold. As a result, the deﬁnition of retraction is not unique in this work. In Figure 1, we visualize the gradient descent method on the manifold M with two diﬀerent kinds of retractions (orthographic and projective). We will discuss the details of those two retractions in Section 3.2.

Robust PCA by Manifold Optimization

3.2. The geometry of the manifold of low-rank matrices

To apply the gradient descent algorithm in Section 3.1 to the manifold of the low-rank matrices, the projection PTx M and the retraction Rx need to be deﬁned. In this section, we let M be the manifold of all Rn1 n2 matrices with rank r and X M be a matrix of rank r and will ﬁnd the explicit expressions of PTx M and Rx. The tangent space TXM and the retraction RX of the manifold of the low-rank matrices have been well-studied Absil and Oseledets (2015): Assume that the SVD decomposition of X is X = UΣVT , then the tangent space TXM can be deﬁned by TXM = {AVVT + UUT B : for A Rn1 n1, B Rn2 n2} according to Absil and Oseledets (2015). The explicit formula for the projection PTXM is given in (Absil and Oseledets, 2015, (9)):

PTXM(D) = UUT D + DVVT UUT DVVT , D Rn1 n2. (7)

For completeness, a proof of (7) is presented in Appendix. There are various ways of deﬁning retractions for the manifold of low-rank matrices, and we refer the reader to Absil and Oseledets (2015) for more details. In this work, we consider two types of retractions. One is called the projective retraction Shalit et al. (2012); Vandereycken (2013), Given any δ TXM, the retraction is deﬁned as the nearest low-rank matrix to X + δ in terms of Frobenius norm:

R(1) X (δ) = arg min Z M X + δ Z F . (8)

The solution is the rank-r approximation of X + δ (for any matrix W, its rank-r approximation is given by Pr i=1 σiuiv T i , where σi, ui, vi are the ordered singular values and vectors of W). In order to further improve computation eﬃciency, we also consider the orthographic retraction Absil and Oseledets (2015). Denoted by R(2) X (δ), it is the nearest rank-r matrix to X + δ that their diﬀerence is orthogonal to the tangent space TXM:

R(2) X (δ) = arg min Z M X + δ Z F , s.t. R(2) X (δ) (X + δ), Z F = 0 for all Z TXM, (9)

and its explicit solution of (9) is given in (Absil and Oseledets, 2015, Section 3.2),

R(2) X (δ) = (X + δ)V[UT (X + δ)V] 1UT (X + δ), (10)

and a proof is given in Appendix.

3.3. Derivation of the proposed algorithms

3.3.1. Derivation of Algorithm 1

The gradient descent algorithm (6) for solving (2) can be written as

L(k+1) = RL(k)( ηPTL(k) f(L(k))), (11)

where PTL(k) is deﬁned in (7) and RL(k) is deﬁned in (8) or (10). To derive the explicit algorithm, it remains to ﬁnd the gradient f.

Zhang and Yang

If the absolute values of all entries of A are diﬀerent, then we have

f(L) = F(L Y). (12)

The proof of (12) is deferred to Appendix. When some entries of A are equivalent and there is a tie in generating F(L Y), the objective function could be non-diﬀerentiable. However, it can be shown that by arbitrarily breaking the tie, F(L Y) is a subgradient of f(L). The corresponding gradient descent method (or subgradient method when f is not diﬀerentiable) with projective retraction can be written as follows:

L(k+1) := rank-r approximation of h L(k) ηPTL(k)F(L(k) Y) i , (13)

where the rank-r approximation has been deﬁned after (8). This leads to Algorithm 1 with Option 1. For the orthographic retraction, i.e., RL(k) deﬁned according to (10), by writing D = F(L(k) Y), the update formula (11) can be simpliﬁed to

L(k+1) := (L(k) ηD)V(k)[U(k)T (L(k) ηD)V(k)] 1U(k)T (L(k) ηD), (14)

where U(k) Rn1 r is any matrix such that its column space is the same as the column space of L(k); and V(k) Rn2 r is any matrix such that its column space is the same as the row space of L(k). The derivation of (14) is deferred to Appendix, and it can be shown that the implementation of (14) leads to Algorithm 1 with Option 2.

3.3.2. Derivation of Algorithm 2

By a similar argument as in the previous section, we can conclude that when all entries of |L Y| are diﬀerent from each other, then applying the same procedure of deriving (12), we have f(L) = F(L Y);

and F(L Y) is a subgradient when f(L) is not diﬀerentiable. Based on this observation, the algorithm under the partially observed setting is identical to (13) or (14), with F replaced by F. This gives the implementation of Algorithm 2.

3.3.3. Basic convergence properties of Algorithms 1 and 2

An interesting topic is that, can we still expect the algorithm to have reasonable basic properties, such as convergence to a critical point? Unfortunately, it is impossible to have such a theoretical guarantee if a ﬁxed step size η is chosen: in general, the subgradient method with ﬁxed step size does not have the convergence guarantee if the objective function is non-diﬀerentiable. However, if we choose step size with line search, then any accumulation point of L(k), b L, would have the property that either the objective function is not diﬀerentiable at b L, or it is a critical point in the sense that its Riemannian gradient is zero. For example, the line search strategy for Algorithm 1 can be described as follows: start the step size ηk with a relatively large value, and repeatedly shrinks it by a factor of β (0, 1) such that the following condition is satisﬁed: for L(k+1) = RL(k)( ηk PTL(k) f(L(k))),

f(L(k)) f(L(k+1)) > cηk PTL(k) f(L(k)) 2,

Robust PCA by Manifold Optimization

where c (0, 1) is prespeciﬁed. The argument for convergence follows from the same argument as the proof of (Absil et al., 2009, Theorem 4.3.1).

3.4. Prior works on manifold optimization

The idea of optimization on manifolds has been well investigated in the literature Vandereycken (2013); Shalit et al. (2012); Absil et al. (2009). For example, Absil et al. Absil et al. (2009) give a summary of many advances in the ﬁeld of optimization on manifolds. Manifold optimization has been applied to many matrix estimation problems, including recovering a low rank matrix from its partial entries, i.e., matrix completion Keshavan et al. (2010); Vandereycken (2013); Wei et al. (2016) and robust matrix completion in Cambier and Absil (2016). In fact, the problem studied in this work can be reformulated to the problem analyzed in Cambier and Absil (2016). In comparison, our work studies a diﬀerent algorithm and gives additional theoretical guarantees. In another aspect, while Wei et al. (2016) studies matrix completion, it shares some similarities with this work: both works study manifold optimization algorithms and have theoretical guarantees showing that the proposed algorithms can recover the underlying low-rank matrix exactly. In fact, Wei et al. (2016) can be considered as our problem under the partially observed setting, without corruption S . It proposes to solve

arg min L Rn1 n2,rank(L)=r

(i,j) Φ (Yij Lij)2,

which can be considered as f in (5) when γ = 0.

4. Theoretical Analysis

In this section, we analyze the theoretical properties of Algorithms 1 and 2 and compare them with previous algorithms. Since the goal is to recover the low-rank matrix L and the sparse matrix S from Y = L + S , to avoid identiﬁability issues, we need to assume that L can not be both low-rank and sparse. Speciﬁcally, we make the following standard assumptions on L and S :

Assumption 1 Each row of S contains at most γ n2 nonzero entries and each column of S contains at most γ n1 nonzero entries. In other words, for γ [0, 1), assume S Sγ where

Sγ := A Rn1 n2 | Ai, 0 γ n2, for 1 i n1; A ,j 0 γ n1, for 1 j n2 . (15)

Assumption 2 The low-rank matrix L is not near-sparse. To achieve this, we require that L must be µ-coherent. Given the singular value decomposition (SVD) L = U Σ V T , where U Rn1 r and V Rn2 r, we assume there exists an incoherence parameter µ such that

n1 , V 2, rµr

where the norm 2, is deﬁned by A 2, = max z 2=1 Az and x = maxi |xi|.

Zhang and Yang

4.1. Analysis of Algorithm 1

With Assumption 1 and 2, we have the following theoretical results regarding the convergence rate, initialization, and stability of Algorithm 1:

Theorem 1 (Linear convergence rate, fully observed case) Suppose that L(0) L F aσr(L ), where σr(L ) is the r-th largest singular value of L , a 1/2, γ > 2γ and

4(γ + 2γ )µr + 4 γ γ γ + a2 < 1

2, then there exists η0 = η0(C1, a) > 0 that does not depend on k, such that for all η η0,

L(k) L F 1 1 2C1

8 η k L(0) L F .

Remark 2 (Choices of parameters). It is shown in the proof that η0 can be set to the solution of the equation

η0(1 + C1)2 1

1 η0(1 + C1)a

Since the LHS is an increasing function of η0 and is zero when η0 = 0, and its RHS is a positive number. While C1 < 1/2 requires p

4γ /(γ γ ) < 1/2, i.e., γ > 17γ . In practice a much smaller γ can be used. In Section 5, γ = 1.5γ is used and works well for a large number of examples. It suggests that some constants in Theorem 1 might be due to the technicalities in the proof and can be potentially improved.

Remark 3 (Simpliﬁed choices of parameters) There exists c1 and c2 such that if a < c1, γ µr < c2 and γ = 65γ , then one can choose η0 = 1/8. In this sense, if the initialization of the algorithm is good, then the algorithm can handle γ as large as O(1/µr). In addition, it requires O(log(1/ϵ)) iterations to achieve L(k) L F / L(0) L F < ϵ.

Since the statements require proper initializations (i.e., small a), the question arises as to how to choose proper initializations. The work by Yi et al. (2016) shows that if the rank-r approximation to F(Y) is used as the initialization L(0), then such initialization has the upper bound L(0) L according to the proofs of (Yi et al., 2016, Theorems 1 and 3) (we borrow this estimation along with the fact that L(0) L F

2r L(0) L ).

Theorem 4 (Initialization, fully observed case) If γ > γ and we initialize L(0) using the rank-r approximation to F(Y), then

L(0) L F 8γµr

The combination of Theorem 1, 4 and the fact that γ = O(γ ) implies that under the fully observed setting, the tolerance of the proposed algorithms to corruption is at most γ = O( 1 µr rκ), where κ = σ1(L )/σr(L ) is the condition number of L . We also study the stability of Algorithm 1 in the following statement.

Robust PCA by Manifold Optimization

Theorem 5 (Stability, fully observed case) Let L be the current value, and let L+ be the next update by applying Algorithm 1 to L for one iteration. Assuming Y = L + S + N , where N is a random Gaussian noise i.i.d. sampled from N(0, σ2), γ > 10γ and (γ + 2γ )µr < 1/64, then there exist C, a, c, η0 > 0 such that when η < η0,

P L+ L F (1 cη) L L F for all L Γ 1, as n1, n2 , (17)

Γ = {L Rn1 n2 : rank(L) = r, Cσ p

(n1 + n2)r ln(n1n2) L L F aσr(L ) ,

Since 1 cη < 1, and Theorem 5 shows that when the observation Y is contaminated with a random Gaussian noise, if L(0) is properly initialized such that L(0) L F < aσr(L ), Algorithms 1 will converge to a neighborhood of L given by

{L : L L F Cσ p

(n1 + n2)r ln(n1n2)}

in [ log( L(0) L F ) + log(Cσ p

(n1 + n2)r ln(n1n2))]/ log(1 cη) iterations, with probability goes to 1 as n1, n2 .

4.2. Analysis of Algorithm 2

For the partially observed setting, we assume that each entry of Y = L + S is observed with probability p. That is, for any 1 i n1 and 1 j n2, Pr((i, j) Φ) = p. Then we have the following statement on convergence:

Theorem 6 (Linear convergence rate, partially observed case) There exists c > 0 such that for n = max(n1, n2), if p max(cµr log(n)/ϵ2 min(n1, n2), 56

3 log n γ min(n1,n2)), then with probability 1 2n 3 6n 1,

L(k) L F L(0) L F hs

1 p2(1 ϵ)2 2η 1 C1 ap(1 + ϵ)

2(1 a) (1 + C1) η2(1 + C1)2

+η2a2(p + pϵ)2(1 + C1)2

1 ηa(p + pϵ)(1 + C1)

for C1 = 1 p(1 ϵ)

h 6(γ + 2γ )pµr + 4 3γ

p(1 + ϵ) + a

2)2 + a2i .

Remark 7 (Choice of parameters) Note that when η is small, the RHS of (18) is in the order of

1 p2(1 ϵ)2 1 C1 ap(1 + ϵ)

2(1 a) (1 + C1) η + O(η2).

As a result, to make sure that the RHS of (18) to be smaller than 1 for small η, we assume that

1 C1 ap(1 + ϵ)

2(1 a) (1 + C1) > 0. (19)

For example, when ap(1 + ϵ) = 4(1 a), it requires that C1 < 1/3. If (19) holds, then there exists η0 = η0( C1, p, ϵ, a) such that for all η η0, the RHS of (18) is smaller than 1. The practical choices of η and γ will be discussed in Section 5.

Zhang and Yang

Remark 8 (Simpliﬁed choice of parameters) There exists {ci}4 i=1 > 0 such that when ϵ < 1/2, a < c1p, γ µr < c2 and γ = c3γ , then when η < 1/8,

L(k) L F L(0) L F (1 c4ηp2)k.

Compared with the result in Theorem 1, the addition parameter p appears in both the initialization requirement a < c1p as well as the convergence rate. This makes the result weaker, but we suspect that the dependence on the subsampling ratio p could be improved through a better estimation in (39) and the estimation of C1 in Lemma 16, and we leave it as a possible future direction.

We present a method of obtaining a proper initialization in Theorem 9. Combining it with Theorem 6, Algorithm 2 allows the corruption level γ to be in the order of O( p µr rκ).

Theorem 9 (Initialization, partially observed case) There exists c1, c2, c3 > 0 such that if γ > 2γ , and p c2(µr2

α) log n/ min(n1, n2), and we initialize L(0) using the rank-r approximation to F(Y), then

L(0) L F 16γµrσ1(L )

with probability at least 1 c3n 1, where σ1(L ) is the largest singular value of L .

4.3. Comparison with Alternating Gradient Descent

Since our objective functions are equivalent to the objective functions of the alternating gradient descent (AGD) in Yi et al. (2016), it would be interesting to compare these two works. The only diﬀerence of these two works lies in the algorithmic implementation: our methods use the gradient descent on the manifold of low-rank matrices, while the methods in Yi et al. (2016) use alternating gradient descent on the factors of the low-rank matrix. In the following we compare the results of both works from four aspects:

1. Accuracy of initialization. What is the largest value t that the algorithm can tolerate, such that for any initialization L(0) satisfying L(0) L F t, the algorithm is guaranteed to converge to L ?

2. Convergence rate. What is the smallest number of iteration steps k such that the algorithm reaches a given convergence criterion ϵ, i.e. L(k) L F / L(0) L F < ϵ?

3. Corruption level (perfect initialization). Suppose that the initialization is in a suﬃciently small neighborhood of L (i.e. there exists a very small ϵ0 > 0 such that L(0) satisﬁes L(0) L F < ϵ0), what is the maximum corruption level that can be tolerated in the convergence analysis?

4. Corruption level (proper initialization). Suppose that the initialization is given by the procedure in Theorem 4 (for the partially observed case) and 9 (for the fully observed case), what is the maximum corruption level that can be tolerated?

Robust PCA by Manifold Optimization

These comparisons are summarized in Table 1. We can see that under the full observed setting, our results remove or reduce the dependence on the condition number κ, while keeping other values unchanged. Under the partially observed setting our results still have the advantage of less dependence on κ, but sometimes require an additional dependence on the subsampling ratio p. The simulation results discussed in the next section also verify that when κ is large our algorithms have better performance, while that the slowing eﬀect of p under the partially observed setting is not signiﬁcant. As discussed after Theorem 6, we suspect that this dependence could be removed after a more careful analysis (or more assumptions).

Criterion Accuracy of Convergence Max corruption Max corruption initialization rate (perfect init.) (proper init.)

Algorithm 1 O(σr(L )) O(log(1

µr) O( 1 µr1.5κ)

APG (full) O(σr(L ) κ ) O(κ log(1

ϵ)) O( 1 κ2µr) O( 1 max(µr1.5κ1.5,κ2µr))

Algorithm 2 O( pσr(L )) O(log(1

ϵ)/p2) O( 1

µr) O( p µr1.5κ)

APG (partial) O(σr(L )

κ ) O(κµr log(1

ϵ) O( 1 κ2µr) O( 1 max(µr1.5κ1.5,κ2µr))

Table 1: Comparison of the theoretical guarantees in our work and the alternating gradient descent algorithm in Yi et al. (2016). The four criteria are explained in details in Section 4.3.

Here we use a simple example to give some intuition of why our proposed methods work better than gradient descent method based on factorization. Let us consider the following simple optimization problem:

arg min z Rm f(z), where z = Ax + By,

where x, y Rn and A, B Rm n. In this example, (x, y) can be considered as the factors of z. The gradient descent method on the factors (x, y) is then given by

x+ = x ηAT f (z), y+ = y ηBT f (z). (20)

Writing the update formula (20) in terms of z, it becomes

z+ = Ax+ + By+ = z η(AAT + BBT )f (z).

As a result, the gradient descent on factors (x, y) has a direction of (AAT + BBT )f (z). In comparison, the gradient descent on the variable z has a direction of f (z), which is the direction that f decreases fastest. If AAT +BBT is a matrix with a large condition number, then we expect that the gradient descent method on factors (x, y) would not work well. This example shows that generally, compared to applying the gradient descent method to the factors of a variable, it is better to apply it to the variable itself. Similar to this example, our method applies gradient descent on the L itself while Yi et al. (2016) applies gradient descent to the factors of L.

Zhang and Yang

4.4. Comparison with other robust PCA algorithms

In this section we compare our result with other robust PCA methods and summarize them in Table 2. Some criterion in Table 1 are not included since they do not apply. For example, (Netrapalli et al., 2014, Alternating Projection) only analyzes the algorithm with speciﬁc initialization, and the criterion 1 and 3 in Table 1 do not apply to this work. As a result, we only compare the maximum corruption ratio that these methods can handle, and the computational complexity per iteration in Table 2. As for the convergence rate, it depends on assumptions on parameters such as the coherence parameter, rank, and the size of the matrix: The alternating projection Netrapalli et al. (2014) requires 10 log(4n1µ2r Y L(0) 2/ϵ n1n2) iterations to achieve an accuracy ϵ, under the assumptions that γ < 1/512µr2 and a tuning parameter is chosen to be 4µ2r n1n2. The alternating minimization method Gu et al. (2016) have the guarantee that if L(1) L 2 σ1(L ), then

L(k+1) L F σ1

2νµ r(s /d)3/2κσ1 1 24

2νµ r(s /d)3/2κσ1

!k L(1) L F ,

where ν is a parameter concerning the coherence of L , s is the number of nonzero entries in S , d = min(n1, n2). As a result s /d is approximately γ max(n1, n2) in our notation. If ν is in the order of O(1), then this results requires that µ rκσ1 max(n1, n2)3/2γ 3/2 O(1), which is more restrictive than our assumption in Theorem 1 that γ µr O(1). Convex methods usually have convergence rate guarantees based on convexity, for example, the accelerated proximal gradient method Toh and Yun (2010) has a convergence rate of O(1/k2). While it is a slower convergence rate compared to the result in Theorem 1 in this work or the results in Netrapalli et al. (2014); Gu et al. (2016) and it does not necessarily converge to the correct solution, this result does not depend on any assumption on the low-rank matrix and the corruption ratio.

Criterion Maximum Complexity corruption level per iteration

Algorithm 1 O(1/κµr3/2) O(rn1n2) Convex methods O(1/µ2r) O(n1n2 min(n1, n2)) Netrapalli et al. (2014) 1/512µ2r O(r2n1n2) Gu et al. (2016) O(1/µ2/3r2/3 min(n1, n2)) O(r2n1n2) Yi et al. (2016) O(1/κ2µr3/2) O(rn1n2)

Table 2: Comparison of the theoretical guarantees in our work and some other robust PCA algorithms.

The stability in Theorem 5 is comparable to analysis in Netrapalli et al. (2014) (the works Gu et al. (2016) and (Yi et al., 2016, Alternating Gradient Descent) do not have stability analysis). The work Netrapalli et al. (2014) assumes that N < σr(L )/100n2 and proves that the output of their algorithm b L satisﬁes

b L L F ϵ + 2µ2r 7 N 2 + 8 n1n2 r N

Robust PCA by Manifold Optimization

where ϵ is the error of the algorithm when there is no noise. If N is i.i.d. sampled from N(0, σ2), this result suggests that b L L F is bounded above by ϵ + O(µ2 rn1n2σ). In comparison, Theorem 5 suggests that after a few iterations, L(k) L F is bounded above by Cσ p

(n1 + n2)r ln(n1n2) with high probability, which is a tighter upper bound.

5. Simulations

In this section, we test the performance of the proposed algorithms by simulated data sets and real data sets. The MATLAB implementation of our algorithm used in this section is available at https://sciences.ucf.edu/math/tengz/. For simulated data sets, we generate L by UΣVT , where U Rn1 r and V Rn2 r are random matrices that i.i.d. sampled from N(0, 1), and Σ Rn1 r is an diagonal matrix. As for S Rn1 n2, each entry is sampled from N(0, 100) with probability q, and is zero with probability 1 q. That is, q represents the level of sparsity in the sparse matrix S . It measures the overall corruption level of Y and is associated with the corruption level γ (γ measures the row and column-wise corruption level). For the partially observed case, we assume that each entry of Y is observed with probability p.

5.1. Choice of parameters

We ﬁrst investigate the performance of the proposed algorithms, in particular, the dependence on the parameters η and γ. In simulations, we let [n1, n2] = [500, 600], r = 3, Σ = I, and q = 0.02. For the partially observed case, we let p = 0.2. The ﬁrst simulation investigates the following questions:

Should we use the Algorithms 1 and 2 with Option 1 or Option 2?

What is the appropriate choice of the step size η?

The simulation results for Option 1 an 2 with various step sizes are visualized in Figure 2, which show that the two options perform similarly. Usually the algorithms converge faster when the step size is larger. However, if the step size is too large then it might diverge. As a result, we use the step size η = 0.7 for Algorithm 1 and 0.7/p for Algorithm 2 for the following simulations. The second simulation concerns the choice of γ. We test γ = cγ for a few choices of c (γ can be calculated from the zero pattern of S). Figure 5.1 shows that if γ is too small, for example, 0.5γ , then the algorithm fail to converge to the correct solution; and if γ is too large, then the convergence is slow. Following these observations, we use γ = 1.5γ as the default choice of the following experiments, which is also used in Yi et al. (2016).

5.2. Performance of the proposed algorithm

In this section, we analyze the convergence behavior as the parameters (overall ratio of corrupted entries q, condition number κ, rank r, subsampling ratio p) changes and visualized the result in Figure 4. Figure 4(a) shows the simulation for corruptions level q, we use the setting in Section 5.1, but replace the corruption level q by q = 0.1, 0.2, 0.3, 0.4. Figure 4 shows that the algorithm

Zhang and Yang

0 20 40 60 80 100 Iterations

2 = 0.2 2 = 0.4 2 = 0.7 2 = 1 2 = 1.5 2 = 2.5

0 20 40 60 80 100 Iterations

2 = 0.2 2 = 0.4 2 = 0.7 2 = 1 2 = 1.5 2 = 2.5

0 20 40 60 80 100 Iterations

2 = 0.2/p 2 = 0.4/p 2 = 0.7/p 2 = 1/p 2 = 1.5/p 2 = 2.5/p

0 20 40 60 80 100 Iterations

2 = 0.2/p 2 = 0.4/p 2 = 0.7/p 2 = 1/p 2 = 1.5/p 2 = 2.5/p

Figure 2: The dependence of the estimation error on the number of iterations for diﬀerent step sizes η (a) Algorithm 1 (Option 1); (b) Algorithm 1 (Option 2); (c) Algorithm 2 (Option 1); (d) Algorithm 2 (Option 2).

0 100 200 300 400 500 Iterations

0 100 200 300 400 500 Iterations

Figure 3: The convergence of the algorithm depending on the choice of γ. (a) fully observed setting; (b) partially observed setting.

Robust PCA by Manifold Optimization

converges slower with more corruption, which is expected since there is fewer information available. However, the algorithm still converges even with an overall corruption level at 0.4. Figure 4(b) shows the simulation for rank r, we use the setting in Section 5.1, but replace r by r = 3, 10, 30, 100, 300 respectively. Simulations show that the algorithm works ﬁne for rank r = 3, 10, 30, 100 and it converges slower for rank r = 300. Figure 4(c) shows the simulation for condition number κ of L, we use the setting in Section 5.1, but replace Σ by Σ = diag(1, 1, 1/κ) and try various values of κ. While the algorithm converges for κ up to 30 in the simulation, for larger κ the algorithm converges slowly at the beginning, and then decreases quickly to zero. We suspect that the initialization is not suﬃciently good and it takes a while for the algorithm to reach the local neighborhood of convergence . We also remark that L with a very large condition number, e.g. κ = 100, is generally challenging for any nonconvex optimization algorithm, as shown in Figure 5, Setting 4. It is because that when κ is large, the solution is close to a matrix with rank less than r a singular point on the manifold of the matrices of rank r, which gives a geometry of manifold that is not smooth enough. We observe that when κ = 100, our algorithm performs well if the rank r is set to 2 (instead of the true value 3) in fact, when κ = 100, the underlying matrix is approximately of rank 2 since the third singular value is very small. We test the algorithm with various matrix sizes using the setting in Section 5.1 and set [n1, n2] = [1000, 1200], [5000, 6000], [10000, 12000]. Figure 4(d) shows that Algorithm 1 converges quickly for all of the choices within a few iterations. In the last simulation, we test Algorithm 2 under the setting in Section 5.1 with various choices of the subsampling ratio p. Figure 4(e) shows suggest that the algorithm converges for p as small as 0.1, though the convergence rate is slow for small p.

5.3. Comparison with other robust PCA algorithms

In this section, we compare our algorithm with the accelerated proximal gradient method (APG) and the alternating direction method of multiplier (ADMM) based on convex relaxation (1); the robust matrix completion algorithm (RMC) Cambier and Absil (2016) based on manifold optimization problem

arg min rank(L)=r

(i,j) Φ Lij Yij + λ X

(i,j) Φ L2 ij,

as well as the alternating gradient descent method (AGD) in Yi et al. (2016) that solves the same optimization as this work, but with an implementation based on matrix factorization rather than manifold optimization. We use the implementation of APG from Toh and Yun (2010) and the implementation of ADMM from https://github.com/dlaptev/Robust PCA. In these two algorithms, we use the choice of parameter λ = 1/ p

max(n1, n2), which is the default choice in the implementation Toh and Yun (2010) and the theoretical analysis in Cand es et al. (2011). For ADMM, the augmented Lagrangian parameter is set by default as 10λ. For RMC and GD, we use their default setting of parameters. Since the setting of Algorithm 2 does not apply to the implementations of APG/ADMM, we compare them under the fully observed setting. We compare them in the following four settings:

Zhang and Yang

0 500 1000 1500 Iterations

q = 0.4 q = 0.3 q = 0.2 q = 0.1

0 50 100 150 Iterations

rank = 3 rank = 10 rank = 30 rank = 100 rank = 300

0 500 1000 1500 2000 2500 3000 Iterations

5 = 1 5 = 3 5 = 10 5 = 30 5 = 100

0 10 20 30 40 50 Iterations

[n1,n2] = [500,600]

[n1,n2] = [1000,1200]

[n1,n2] = [5000,6000]

[n1,n2] = [10000,12000]

0 200 400 600 800 1000 Iterations

p = 1 p = 0.5 p = 0.2 p = 0.1 p = 0.05

Figure 4: Dependence of the estimation error on the number of iterations for diﬀerent (a) Overall ratios of corrupted entries q (Algorithm 1); (b) Ranks r (Algorithm 1); (c) Condition numbers κ (Algorithm 1); (d) Matrix sizes [n1, n2] (Algorithm 1); (e) Subsampling ratio p (Algorithm 2).

Robust PCA by Manifold Optimization

0 20 40 60 80 100 120 Running Time

RMC APG ADMM Algorithm 1 AGD

0 5 10 15 20 Running Time

RMC APG ADMM Algorithm 1 AGD

0 200 400 600 800 1000 Running Time

RMC APG ADMM Algorithm 1 AGD

0 1 2 3 4 Running Time

RMC APG ADMM Algorithm 1 AGD

Figure 5: The comparison of the performance of the algorithms under the fully observed setting. The running time is measured in seconds.

Setting 1: same setting as in Section 5.1.

Setting 2 (large condition number): replace Σ by diag(1, 1, 0.1) in Setting 1.

Setting 3 (large matrix): replace n1 and n2 by 3000 and 4000 in Setting 1.

Setting 4 (large condition number): replace Σ by diag(1, 1, 0.01) in Setting 1.

Figure 5 shows that under Setting 1, 2 and 3, Algorithm 1 converges faster than other algorithms. In particular, the advantage over the AGD algorithm is very clear under Setting 2, where the condition number is larger. This veriﬁes our theoretical analysis, where the convergence rate is faster than the analysis in Yi et al. (2016) by a factor of κ. In Setting 3, the algorithms RMC, AGD and Algorithm 1 converge much faster than APG and ADMM, which veriﬁes the computational advantage of nonconvex algorithms when the matrix size is large. However, in Setting 4, the convex algorithms converge to the correct solution while the nonconvex algorithms converge to a local minimizer that is diﬀerent than the correct solution. This is due to the fact that the nonconvex algorithms have more than one minimizer, and if it is not initialized well then it could get trapped in local minimizers. In

Zhang and Yang

0 1 2 3 4 5 Running Time

Comparison of robust PCA algorithms, partially observed setting

RMC Algorithm 2 AGD

Figure 6: The comparison of the performances of the algorithms under the partially observed setting. The running time is measured in seconds.

practice, we observed that if the initialization is well chosen and close to the true L , then Algorithm 1 converges quickly to the correct solution.

We also compare the performance of RMC, AGD and Algorithm 2 under the partially observed setting. We use Setting 1 with p = 0.3 and visualize the result in Figure 6. The results are similar to that of the fully observed setting: AGD and Algorithm 2 are comparable and RMC converges faster at the beginning, but then does not achieve higher accuracy, possibly due to their choice of the regularization parameter.

We also test the proposed algorithms in a real data application for video background subtraction. We adopt the public data set Shoppingmall studied in Yi et al. (2016),1 A few frames are visualized in the ﬁrst column of Figure 7. There are 1000 frames in this video sequence, represented by a matrix of size 81920 1000, where each column corresponds to a frame of the video and each row corresponds to a pixel of the video. We apply our algorithms with r = 3 and γ = 0.1, p = 0.5 for the partially observed case, the step size η = 0.7. We stop the algorithm after 100 iterations. Figure 7 shows that our algorithms obtain desirable low-rank approximations within 100 iterations.

In Figure 8, we compare our algorithms with APG in terms of the convergence of the objective function value. In this ﬁgure, the relative error is deﬁned as F(L Y) F / Y F , a scaled objective value. A smaller relative error implies a better low-rank approximation. Figure 8 shows out that our algorithms can obtain smaller objective values within 100 iterations under both fully observed and partially observed cases.

1. The data set is originally from http://perception.i2r.a-star.edu.sg/bk_model/bk_index.html, and is available at https://sciences.ucf.edu/math/tengz/.

Robust PCA by Manifold Optimization

Figure 7: The performance of Algorithms 1 and 2 in video background subtraction, with three rows representing three frames in the video sequence. For Algorithm 2, a subsampling ratio of p = 0.5 is used.

0 5 10 15 20 25 Iterations

Relative Error

Algorithm 1, fully observed case AGD, fully observed case Algorithm 2, partially observed case AGD, partially observed case

Figure 8: The relative error of Algorithms 1, 2, and AGD with respect to the iterations, for both fully observed case and partially observed case in the experiment of background subtraction.

Zhang and Yang

6. Conclusion

This paper proposes two robust PCA algorithms (one for fully observed case and one for partially observed case) based on the gradient descent algorithm on the manifold of lowrank matrices. Theoretically, compared with the gradient descent algorithm with matrix factorization, our approach has a faster convergence rate, better tolerance of the initialization accuracy and corruption level. The approach removes or reduces the dependence of the algorithms on the condition number of the underlying low-rank matrix. Numerically, the proposed algorithms performance is less sensitive to the choice of step sizes. We also ﬁnd that under the partially observed setting, the performance of the proposed algorithm is not signiﬁcantly aﬀected by the presence of the additional dependence on the observation probability. Considering the popularity of the methods based on matrix factorization, it is an interesting future direction to apply manifold optimization to other low-rank matrix estimation problems.

Acknowledgements

The authors thank the editor, an associate editor and referee for their helpful comments and suggestions. The authors also thank David Fleischer for providing helps on the coding. Yang s research is partially supported by NSERC RGPIN-2016-05174. Zhang s research is partially supported by National Science Foundation (NSF) grant CNS-1739736.

Robust PCA by Manifold Optimization

Appendix for Robust PCA by Manifold Optimization"

A. Technical Derivations in Section 3

Veriﬁcation of (7). Formula (7) can be veriﬁed as follows. Let F be the Frobenius inner product of two matrices, then

D PTXM(D), AVVT F = (I UUT )D(I VVT ), AVVT F = (I UUT )D(I VVT )VVT , A F = 0, A F = 0

and similarly D PTXM(D), UUT B F = 0. As a result, D PTXM(D), AVVT + UUT B F = 0 for all A Rn1 n1 and B Rn2 n2, which veriﬁes formula (7) by showing that D PTXM(D) is orthogonal to TXM.

Veriﬁcation of (10). It is clear that R(2) X (δ) deﬁned in (10) has rank r; and to show that R(2) X (δ) (X+δ), Z F = 0 for all Z TXM, we ﬁrst write this property as [R(2) X (δ) (X+ δ)] TXM for simplicity, and since TXM = {AVVT + UUT B : for A Rn1 n1, B Rn2 n2}, we just need to show that R(2) X (δ) (X + δ), AVVT + UUT B F = 0 for all A Rn1 n1 and B Rn2 n2. This is easy to verify, because we have R(2) X (δ)V = (X+δ)V,

R(2) X (δ) (X + δ), AVVT F = (R(2) X (δ)V (X + δ)V)VT , A F = 0, A F = 0, (21)

Similarly, we can easily verify that UT R(2) X (δ) = UT (X + δ), we have R(2) X (δ) (X + δ), UUT B F = 0, and therefore [R(2) X (δ) (X + δ)] TXM. As a result, there exists a unique R(2) X such that rank(R(2) X ) = r and [R(2) X (δ) (X + δ)] TXM.

Veriﬁcation of (12). We ﬁrst deﬁne the operator S : Rn1 n2 Rn1 n2 such that F(A) = S(A) A ( represents the elementwise product), i.e.,

( 0, if |Aij| > |Ai, |[γ] and |Aij| > |A ,j|[γ], 1, otherwise.

Then if the absolute values of all entries of A are diﬀerent, the sparsity pattern does not change under a small perturbation, i.e., S(A) = S(A + ). Then by deﬁnition of f( ),

f(L + ) f(L) = 1

2 S(L Y + ) (L Y + ) 2 F 1

2 S(L Y) (L Y) 2 F

2 S(L Y) (L Y + ) 2 F 1

2 S(L Y) (L Y) 2 F

= S(L Y) (L Y), F + O( 2 F ),

where represents the Hadamard product, i.e., the elementwise product between matrices.

Veriﬁcation of (14). It is suﬃcient to prove the case where U(k) and V(k) are given by the SVD decomposition L(k) = U(k)Σ(k)V(k)T . Denote D = f(L(k)) = F(L(k) Y). Set X = L(k) and δ = ηPTL(k)M(D) in (10), we have

L(k+1) := (L(k) ηPTL(k)M(D))V(k)[U(k)T (L(k) ηPTL(k)M(D))V(k)] 1 (22)

U(k)T (L(k) ηPTL(k)M(D))

Zhang and Yang

On the other hand, from (7) we have the projection

PTL(k)M(D) = U(k)U(k)T D + DV(k)V(k)T U(k)U(k)T DV(k)V(k)T .

As a result

PTL(k)M(D)V(k) = [U(k)U(k)T D + DV(k)V(k)T U(k)U(k)T DV(k)V(k)T ]V(k)

= U(k)U(k)T DV(k) + DV(k)V(k)T V(k) U(k)U(k)T DV(k)V(k)T V(k) = DV(k) (23)

and similarly, U(k)T PTL(k)M(D) = U(k)T D. (24)

Combining (23), (24) with (22), the update formula (14) is veriﬁed.

B. Proof of Theorem 1

In this proof, we will investigate L+ L F , where

L+ = RL( ηPTLF(L Y)).

It is suﬃcient to prove that when L L aσr(L ) with the value a satisfying the conditions in Theorem 1, then

L+ L F 1 1 2C1

8 η L L F . (25)

To prove (25), we ﬁrst introduce three auxiliary lemmas.

Lemma 10 (a) Let D = L L F(L Y) = L L F(L L S ), then

D 2 F C2 1 L L 2 F . (26)

(b) For the noisy setting where Y = L + S + N , and D = L L N F(L Y), we have D 2 F 2C2 1 L L 2 F + 2(γ + 5γ )N1, (27)

where N1 = n2 Pn1 i=1 |N i, |max + n1 Pn2 j=1 |N ,j|max.

Lemma 11 If L L F aσr(L ) and a 1, then

(L L ) PTL(L L ) F a 2(1 a) L L F , (28)

(L L ) PTL (L L ) F a

2 L L F . (29)

Lemma 12 For X TLM, then

R(i) L (X) (L + X) F X 2 F 2(σr(L) X ), for either i = 1 or 2.

Robust PCA by Manifold Optimization

To prove (25), ﬁrst we note that

L L 2 F L ηPTLF(L Y) L 2 F = L L 2 F L L 2 F + 2η L L , PTLF(L Y) F ηPTLF(L Y) 2 F =2η L L , PTLF(L Y) F ηPTLF(L Y) 2 F =2η L L , PTL(L L ) PTL(L L F(L Y)) F η2 PTLF(L Y) 2 F =2η PTL(L L ), PTL(L L ) PTLD F η2 PTLF(L Y) 2 F 2η( PTL(L L ) 2 F D F PTL(L L ) F ) η2( L L F + D F )2. (30)

The fourth line is obtained by PTL(L L F(L Y)) = L L PTLF(L Y) F . The ﬁfth line is because L L = PTL(L L ) + P TL(L L ). The last line uses Cauchy Schwarz inequality PTL(L L ), PTLD F D F PTL(L L ) F and triangular inequality PTLF(L Y) F L L F + PTL(D) F L L F + D F . Lemma 11 and the assumptions L L F aσr(L ) and q

1 ( a 2(1 a))2 > 1

PTL(L L ) F 1

2 L L F . (31)

Combining it with the estimation of D F in Lemma 10, we have

L L 2 F L ηPTLF(L Y) L 2 F

2 C1) L L 2 F η2(1 + C1)2 L L 2 F . (32)

When ths RHS of (32) is positive (i.e., when (1 2C1) 2η(1 + C1)2), (32) implies L L F > L ηPTLF(L Y) L F and

L L F L ηPTLF(L Y) L F

2 C1) L L 2 F η2(1 + C1)2 L L 2 F L L F + L ηPTLF(L Y) L F

2 C1) η2(1 + C1)2 L L F . (33)

In addition,

PTLF(L Y) F F(L Y) F = L L F + D F (1 + C1) L L F (34)

and Lemma 12 give

L+ L F L ηPTLF(L Y) L F L ηPTLF(L Y) L+ F

η2 PTLF(L Y) 2 F σr(L ) η PTLF(L Y) F η2a2(1 + C1)2

1 ηa(1 + C1) L L F . (35)

Combining (33) and (35),

L L F L+ L F

4η(1 2C1) η2(1 + C1)2 1

1 η(1 + C1)a

Therefore, Theorem 1 is proved when C1 < 1/2, and η0 is chosen such that

η0(1 + C1)2 1

1 η0(1 + C1)a

Zhang and Yang

C. Proof of Theorem 5

The proof of the noisy case also follows similarly from the proofs of Theorem 1 and 6. Note that F(L Y) = L L N D ,

and deﬁne Q = PTL(L L ), then following the proof of Theorem 1 and applying Lemma 10 (b), we have

L L 2 F L ηPTLF(L Y) L 2 F =2η L L , PTLF(L Y) F + O(η2) = 2η PTL(L L ), PTLF(L Y) F + O(η2)

=2η PTL(L L ), PTL(L L N D ) F + O(η2)

2η Q 2 F N , Q F Q F q

2C2 1 L L 2 F + 2(γ + 5γ )N1

In addition, (35) gives L+ L F L ηPTLF(L Y) L F = O(η2).

Combining it with the estimation of C1, N1, and N , Q F in Lemma 13 and the fact that (1 a 2(1 a)) L L F Q F (1 + a 2(1 a)) L L F (which follows from Lemma 11), the Theorem is proved.

Lemma 13 If N Rn1 n2 is elementwisely i.i.d. sampled from N(0, σ2), then (a) with probability 1 4 n7 1n7 2 , Pn1 i=1(|N i, |max)2 16σ2n1 ln(n1n2), and Pn2 j=1(|N ,j|max)2

16σ2n2 ln(n1n2), and as a result, N1 32σ2n1n2 ln(n1n2). (b) There exists C6 > 0 such that as n1 + n2 , the probability that

N , PTL(L L ) F 1

4 PTL(L L ) 2 F (36)

holds for all {L : C6σ p

(n1 + n2)r ln(n1n2) L L F aσr(L )} converges to 1.

D. Proof of Theorem 6

This proof borrows two lemmas from (Yi et al., 2016, Lemmas 9, 10) as follows.

Lemma 14 (Yi et al., 2016, Lemma 9) There exists c > 0 such that for all 0 < ϵ < 1, if p cµr log(n)/ϵ2 min(n1, n2), then with probability at least 1 2n 3, for all X in the tangent plane TL , i.e., all X that can be written as L A + BL , where A Rn2 n2 and B Rn1 n1,

(1 ϵ) X 2 F 1

p PΦX 2 F (1 + ϵ) X 2 F .

Lemma 15 (Yi et al., 2016, Lemma 10) If p 56

3 log n γ min(n1,n2), the with probability at least 1 6n 1, the number of entries in Φ per row is in the interval [pn2/2, 3pn2/2], and the number of entries in Φ per column is in [pn1/2, 3pn1/2].

Then we introduce the following lemma parallel to Lemma 10:

Robust PCA by Manifold Optimization

Lemma 16 When the events in Lemmas 14 and 15 hold, for D = PΦ[L L F(L Y)] we have D 2 F C2 1 L L 2 F , (37)

with C1 = 1 p(1 ϵ)

h 6(γ + 2γ )pµr + 4 3γ

p(1 + ϵ) + a

2)2 + a2i .

The proof of Theorem 6 is parallel to the proof of Theorem 1, with L+ deﬁned slightly diﬀerently by L+ = RL( ηPTL F(L Y)).

Deﬁning PΦ : Rn1 n2 Rn1 n2 by

( Xij, if (i, j) Φ, 0, if (i, j) / Φ.

Then F(L Y) = PΦ F(L Y). Following a similar analysis as (30),

L L 2 F L ηPTLPΦ F(L Y) L 2 F =2η L L , PTLPΦ F(L Y) F ηPTLPΦ F(L Y) 2 F 2η PΦPTL(L L ), PΦ F(L Y) F ηPΦ F(L Y) 2 F 2η PΦ(L L ) PΦP TL(L L ), PΦ(L L ) D F η2( PΦ(L L ) F + D F )2, (38)

here P TL represents the projector to the subspace orthogonal to TL. Lemma 11 and Lemma 14 imply

PΦP TL(L L ) F PΦ(L L ) F P TL(L L ) F PΦ(L L ) F ap(1 + ϵ)

2(1 a) , (39)

and combining it with the estimation of D in Lemma 16, the RHS of (38) is larger than

PΦ(L L ) 2 F

2η 1 C1 ap(1 + ϵ)

2(1 a) (1 + C1) η2(1 + C1)2 . (40)

In addition, Lemma 14 implies

PΦ F(L Y) F PΦ(L L ) F + PΦ D F

(1 + C1) PΦ(L L ) F ,

(1 + C1)p(1 + ϵ) L L

and combining it with Lemma 12,

L+ L F L ηPTLPΦ F(L Y) L F η2a2(p + pϵ)2(1 + C1)2

1 ηa(p + pϵ)(1 + C1) L L F .

Zhang and Yang

Combining it with (40) and Lemma 11, we have

1 p2(1 ϵ)2 2η 1 C1 ap(1 + ϵ)

2(1 a) (1 + C1) η2(1 + C1)2

+η2a2(p + pϵ)2(1 + C1)2

1 ηa(p + pϵ)(1 + C1) ,

and Theorem 6 is proved.

E. Proof of Lemmas

Lemma 10(a) Proof By the deﬁnition of F, for any matrix A, A F(A) is a sparse matrix, therefore

D = L L S F(L L S ) + S

is a sparse matrix. Denote the locations of the nonzero entries of D by S, and divide it into two sets S1 S2 deﬁned as follows:

S1 = {(i, j) : |[L L S ]ij| > |[L L S ]i, |[γ] and |[L L S ]ij| > |[L L S ] ,j|[γ]},

S2 = {(i, j) / S1 : Dij = [L L ]ij F(L L S )ij = 0}.

For (i, j) S1, [F(L L S )]ij = 0. As a result, Dij = [L L ]ij. In addition, by deﬁnition of F( ), each row or column of D has at most γ percentage of points in S1. For (i, j) S2, since [F(L L S )]ij = [L L S ]ij, we have Dij = S ij = 0. By Assumption 1, therefore, for each row or column of D, at most γ percentage of points lie in S2. Combine the results

|[L L S ]i, |[γ] {|[L L ]i, | + |[S ]i, |}[γ] |[L L ]i, |[γ γ ],

|[L L S ]j, |[γ] {|[L L ]j, | + |[S ]j, |}[γ] |[L L ]j, |[γ γ ],

with [F(L L S )]ij = [L L S ]ij, we have for (i, j) S2

|Dij| =|[L L F(L L S )]ij|

|[L L ]ij| + |F(L L S )]ij|

|[L L ]ij| + max(|[L L S ]i, |[γ], |[L L S ] ,j|[γ])

|[L L ]ij| + max(|[L L ]i, |[γ γ ], |[L L ] ,j|[γ γ ]).

Applying the estimations above, and repeatedly use the fact that (x + y)2 2x2 + 2y2, we have

Robust PCA by Manifold Optimization

(i,j) S D2 ij = X

(i,j) S1 D2 ij + X

(i,j) S2 D2 ij X

(i,j) S1 [L L ]2 ij

n |[L L ]ij| + max(|[L L ]i, |[γ γ ], |[L L ] ,j|[γ γ ]) o2

(i,j) S1 [L L ]2 ij + 2 X

(i,j) S2 [L L ]2 ij + 2 X

(i,j) S2 max{|[L L ]i, |[γ γ ], |[L L ] ,j|[γ γ ]}2

(i,j) S1 [L L ]2 ij + 2 X

(i,j) S2 [L L ]2 ij + 2 X

(i,j) S2 {|[L L ]i, |[γ γ ] + |[L L ] ,j|[γ γ ]}2

(i,j) S1 [L L ]2 ij + 2 X

(i,j) S2 [L L ]2 ij + 4 X

(i,j) S2 {|[L L ]i, |[γ γ ]}2 + {|[L L ] ,j|[γ γ ]}2

(i,j) S [L L ]2 ij + X

(i,j) S2 [L L ]2 ij + 4 γ

γ γ L L 2 F

(i,j) S [PTL (L L )]2 ij + 2 X

(i,j) S2 [PTL (L L )]2 ij + 4 γ

γ γ L L 2 F

(i,j) S [L L PTL (L L )]2 ij + 2 X

(i,j) S2 [L L PTL (L L )]2 ij

(i,j) S [PTL (L L )]2 ij + 2 X

(i,j) S2 [PTL (L L )]2 ij + 4 γ

γ γ L L 2 F

+ 4 L L PTL (L L ) 2 F . (41)

Note that from line 5 to line 6, we used the fact that for x Rn, and k n

k(x(k))2 (x(k))2 + (x(k+1))2 + + (x(n 1)) + (x(n)) (x(1)) + + (x(n)) = x 2 F ,

where x(k) is the k-th order statistics of x1, . . . , xn, i.e. the k-th smallest value. This gives us

(γ γ )n2|[L L ]i, |[γ γ ] [L L ]i, 2 2; (γ γ )n2|[L L ] ,j|[γ γ ] [L L ] ,j 2 2.

Therefore X

(i,j) S2 |[L L ]i, |[γ γ ] γ n2 (γ γ )n2 [L L ]i, 2 F ; (42)

(i,j) S2 |[L L ] ,j|[γ γ ] γ n1 (γ γ )n1 [L L ] ,j 2 F . (43)

The values γ n2 and γ n1 in the numerator of the right hand sides in 42 and 43 are due to the fact that, in each row or column of D, at most γ percentage of points lie in S2. On the other hand, Lemma 11 implies

L L PTL (L L ) F a

2 L L F . (44)

Zhang and Yang

In addition, using the fact that there exists A Rn1 r and B Rn2 r, such that PTL (L L ) = AVT + UBT and PTL (L L ) 2 F = AVT 2 F + UBT 2 F , and that for each row or column, at most γ + γ percentage of points lie in S, we have X

(i,j) S [PTL (L L )]2 ij 2 X

(i,j) S [ (AVT )ij 2 + (UBT )ij 2]

2(γ + γ )µr X

1 i n1,1 j n2 [ (AVT )ij 2 + (UBT )ij 2] = 2(γ + γ )µr PTL (L L ) 2 F

2(γ + γ )µr L L 2 F . (45)

Similarly, A 2, = max z 2=1 Az X

(i,j) S2 [PTL (L L )]2 ij 2γ µr L L 2 F , (46)

Combining (41)-(46), (26) is proved.

Lemma 10(b) Proof Let L = L N , then applying the fact that for any x, y Rn,

|[x + y]|[γ] |[x]|[γ] + |[x]|max,

where |[x]|max represents the largest value of |[x]|. We have

(i,j) S [L L ]2 ij + X

(i,j) S2 [L L ]2 ij

(i,j) S2 {(|[L L ]i, |[γ γ ])2 + (|[L L ] ,j|[γ γ ])2}

(i,j) S [L L ]2 ij + |N ij|2 + X

(i,j) S2 [L L ]2 ij + |N ij|2

(i,j) S2 {(|[L L ]i, |[γ γ ])2 + (|N i, |max)2 + (|[L L ] ,j|[γ γ ])2 + (|N ,j|max)2}

2C2 1 L L 2 F + 2(γ + 5γ )N1,

where the last inequality follows from the proof of part (a) and the deﬁnition of N1.

Lemma 11 Proof Let the SVD decomposition of L be L = UΣV, U and V be orthogonal matrices of sizes Rn1 (n1 r) and Rn2 (n2 r) such that Col(U ) Col(U) and Col(V ) Col(V) (here Col(U) represents the subspace spanned by the columns of U). Let L (1,1) UT L V, L (1,2) UT L V ,

L (2,1) U T L V, L (2,2) U T L V .

Robust PCA by Manifold Optimization

Since rank(L ) = r, we have

L (2,2) = L (2,1)L (1,1) 1L (1,2).

Since all singular values of L (1,1) are larger than (1 a)σr(L ), if the singular value decom-

position of L (1,1) 1 is given by

L (1,1) 1 = U0Σ0VT 0 ,

then the Σ0 1/(1 a)σr(L ). Applying

AB 2 F A 2 F B 2 F

and the fact that for a square, diagonal matrix Σ, |[XΣ]ij| = |XijΣjj| Σ |Xij|, we have

L (2,2) F = L (2,1)U0Σ0VT 0 L (1,2) F

L (2,1)U0Σ0 F VT 0 L (1,2) F

1 (1 a)σr(L ) L (2,1)U0 F VT 0 L (1,2) F

1 (1 a)σr(L ) L (2,1) F L (1,2) F

1 (1 a)σr(L )

L (2,1) 2 F + L (1,2) 2 F 2

1 (1 a)σr(L )

2(1 a)σr(L ), (47)

and (28) is proved. The proof of (29) is similar.

Lemma 12 Proof Let the SVD decomposition of L be L = UΣV, and

L(1,1) = UT (X + L)V, L(1,2) = UT (X + L)V

= UT XV , L(2,1) = U T (X + L)V = U T XV,

then it is clear that

R(2) L (X) = L + X + U L(2,1)L(1,1) 1L(1,2)V T ,

and using the same argument as in (47),

L(2,1)L(1,1) 1L(1,2) F 1 σr(L (1,1)) L(1,2) F L(2,1) F

L(2,1) 2 F + L(1,2) 2 F 2

1 σr(L) X X 2 F 2 .

Zhang and Yang

So Lemma 12 is proved for R(2) L (X). By deﬁnition, R(1) L (X) is the closest matrix to L + X that has rank r, so R(1) L (X) (L + X) F R(2) L (X) (L + X) F and Lemma 12 is also proved for R(1) L (X).

Lemma 13 Proof WLOG, we assume σ = 1 and the generic cases can be proved similarly. (a) It follows from the estimation of the distribution of the maximum of n1 i.i.d. Gaussian variables {gi}n1 i=1:

Pr{ max 1 i n1 |gi| 4 p

ln(n1n2)} 1 2 exp (4 p

1 2n1 exp (4 p

= 1 2n 7 1 n 8 2 ,

where the ﬁrst inequality applies the estimation of the cumulative distribution function of the Gaussian distribution (Ledoux and Talagrand, 1991, pg 8). Combining this estimation for each column of N and applying the union bound, the second inequality in part (a) holds with probability 1 2n 7 1 n 7 2 . Similarly, the ﬁrst inequality in part (a) holds with the same probability. (b) First, we parameterize L by g(L) = PL (L L ). Then we claim that, for any L and L such that L L F , L L F aσr(L ), there exists C0 depending on a such that PTL(L L ) PTL (L L ) F C0 g(L) g(L ) F . (48)

To prove (48), apply (29) and obtain

L L F 1 1 a

2 g(L) g(L ) F . (49)

Since PTL = ULUT L + VLVT L ULUT LVLVT L, and using Davis-Kahan theorem Davis and Kahan (1970) and the assumption L L F aσr(L ), there exists c1, c2 depending on a such that

ULUT L UL UT L F c1, VLVT L VL VT L F c2,

so there exists C depending on a such that

PTL(L L ) PTL (L L ) F (50)

= [PTL (L L ) PTL (L L )] + [PTL(L L ) PTL (L L )] F L L F + C L L F .

Combining (49) and (50), (48) is proved. Second, based on (48), we will apply an ϵ-net covering argument to ﬁnish the proof that combines probabilistic estimation for each L and a union bound (ϵ-net covering argument is a standard argument in probabilistic estimation Vershynin (2012)). Use the estimation of

Robust PCA by Manifold Optimization

the cumulative distribution function of the Gaussian distribution (Ledoux and Talagrand, 1991, pg 8), for any L ,

Pr N , PTL (L L ) F t PTL (L L ) F 1

For any L such that g(L ) g(L) F < ϵ, applying (48),

Pr { N , PTL(L L ) F t PTL(L L ) F + C0ϵ( N F + t)} 1

Using union bound, there is an ϵ-net of the set {g(L) : g(L) F = x} with at most (C5x/ϵ)n1r+n2r r2 points. Therefore, for all L such that x ϵ PTL(L L ) F x + ϵ,

Pr { N , PTL(L L ) F t PTL(L L ) F + 2C0ϵ( N F + t)}

n1r+n2r r2 . (51)

Let t = x/8 and ϵ = x/16C0 N F , then when N F 1 (which holds with high probability as n1n2 goes to inﬁnity), then using C0 1 we have ϵ x/16, and when x 4,

t PTL(L L ) F + 2C0ϵ( N F + t) x

8(x + ϵ) + x 8 N F ( N F + t)

8(x + ϵ) + x

8 17 16 + x

8 17 16 + x2

4 PTL(L L ) 2 F , (52)

where the last inequality applies the assumption x ϵ PTL(L L ) F . Combining (51) and (52) and recall that t = x/8, we have that for all L such that x x/16C0 N F PTL(L L ) F x + x/16C0 N F ,

N , PTL(L L ) F 1

4 PTL(L L ) 2 F ,

for all L s.t. PTL(L L ) F x x 16C0 N F

16C5C0 N F n1r+n2r r2 . (53)

Zhang and Yang

n1 + n2 + 128(n1r + n2r r2) ln(16C5C0 N F )(1 + 1/16C0 N F )i with i = 1, 2, ..., then

i=1 exp x2 i 128

16C5C0 N F n1r+n2r r2

exp( n1 + n2

i=1 exp( (1 + 1/16C0 N F )2i)

exp( n1 + n2

i=1 exp( 1 i/8C0 N F )

= exp( n1 + n2

128 1) exp( 1/8C0 N F ) 1 exp( 1/8C0 N F ) 8C0 N F exp( n1 + n2

128 1), (54)

where the last inequality uses exp( c) 1 c when c 0. Clearly, the RHS goes to 0 as n1 + n2 . Combining the estimation (53) for {xi} i=1, with probability 1 8C0 N F exp( n1+n2

128 1), the event (36) holds for all L such that

g(L) F max( p

n1 + n2 + 128(n1r + n2r r2) ln(16C5C0 N F ), 4).

Combining it with (29), the event (36) holds for all for all L such that

aσr(L ) L L F

n1 + n2 + 128(n1r + n2r r2) ln(16C5C0 N F ), 4).

Considering that p

n1 + n2 + 128(n1r + n2r r2) ln(16C5C0 N F ) is the dominant term when n1, n2 , Lemma 13(b) is proved.

Lemma 16 Proof Following (41) and the proof of Lemma 10[a], and note that Lemma 15 means that γ and γ are replaced by arbitrary numbers in the intervals [0.5pγ , 1.5pγ ] and [0.5pγ, 1.5pγ], we have

D 2 F 6(γ + 2γ )pµr L L 2 F + 4 3γ

γ 3γ PΦ(L L ) 2 F + a2 L L 2 F .

Applying Lemma 11 and (44), we have

PΦ(L L ) F PΦPTL (L L ) F + PΦP TL (L L ) F

p(1 + ϵ) L L F + a

Combining it with the estimation of PΦ(L L ) F in Lemma 11, we have D F C1 PΦ(L L ) F with

C1 = 1 p(1 ϵ)

h 6(γ + 2γ )pµr + 4 3γ

p(1 + ϵ) + a

2)2 + a2i .

Robust PCA by Manifold Optimization

P.-A. Absil and I V Oseledets. Low-rank retractions: a survey and new results. Computational Optimization and Applications, 62(1):5 29, 2015. ISSN 1573-2894. doi: 10.1007/s10589-014-9714-4. URL http://dx.doi.org/10.1007/s10589-014-9714-4.

P.A. Absil, R. Mahony, and R. Sepulchre. Optimization Algorithms on Matrix Manifolds. Princeton University Press, 2009. ISBN 9781400830244. URL http://books.google. com/books?id=NSQGQe LN3Nc C.

Afonso S. Bandeira, Nicolas Boumal, and Vladislav Voroninski. On the low-rank approach for semideﬁnite programs arising in synchronization and community detection. In Vitaly Feldman, Alexander Rakhlin, and Ohad Shamir, editors, 29th Annual Conference on Learning Theory, volume 49 of Proceedings of Machine Learning Research, pages 361 382, Columbia University, New York, New York, USA, 23 26 Jun 2016. PMLR. URL http://proceedings.mlr.press/v49/bandeira16.html.

R. Basri and D. Jacobs. Lambertian reﬂectance and linear subspaces. IEEE Transactions on Pattern Analysis and Machine Intelligence, 25(2):218 233, February 2003.

Srinadh Bhojanapalli, Prateek Jain, and Sujay Sanghavi. Tighter low-rank approximation via sampling the leveraged element. In Proceedings of the Twenty-sixth Annual ACMSIAM Symposium on Discrete Algorithms, SODA 15, pages 902 920, Philadelphia, PA, USA, 2015. Society for Industrial and Applied Mathematics. URL http://dl.acm.org/ citation.cfm?id=2722129.2722191.

Nicolas Boumal. Nonconvex phase synchronization. SIAM Journal on Optimization, 26 (4):2355 2377, 2016. doi: 10.1137/16M105808X. URL http://dx.doi.org/10.1137/ 16M105808X.

Samuel Burer and Renato D.C. Monteiro. A nonlinear programming algorithm for solving semideﬁnite programs via low-rank factorization. Mathematical Programming, 95(2):329 357, 2003. ISSN 1436-4646. doi: 10.1007/s10107-002-0352-8. URL http://dx.doi.org/ 10.1007/s10107-002-0352-8.

Samuel Burer and Renato D.C. Monteiro. Local minima and convergence in low-rank semideﬁnite programming. Mathematical Programming, 103(3):427 444, 2005. ISSN 1436-4646. doi: 10.1007/s10107-004-0564-1. URL http://dx.doi.org/10.1007/ s10107-004-0564-1.

L eopold Cambier and P.-A. Absil. Robust low-rank matrix completion by riemannian optimization. SIAM Journal on Scientiﬁc Computing, 38(5):S440 S460, 2016. doi: 10.1137/15M1025153. URL https://doi.org/10.1137/15M1025153.

Emmanuel J. Cand es, Xiaodong Li, Yi Ma, and John Wright. Robust principal component analysis? J. ACM, 58(3):11:1 11:37, June 2011. ISSN 0004-5411. doi: 10.1145/1970392. 1970395. URL http://doi.acm.org/10.1145/1970392.1970395.

Zhang and Yang

Venkat Chandrasekaran, Sujay Sanghavi, Pablo A. Parrilo, and Alan S. Willsky. Ranksparsity incoherence for matrix decomposition. SIAM Journal on Optimization, 21(2): 572 596, 2011. doi: 10.1137/090761793. URL http://dx.doi.org/10.1137/090761793.

Yudong Chen and Martin J. Wainwright. Fast low-rank estimation by projected gradient descent: General statistical and algorithmic guarantees. Co RR, abs/1509.03025, 2015.

Yeshwanth Cherapanamjeri, Kartik Gupta, and Prateek Jain. Nearly-optimal robust matrix completion. Co RR, abs/1606.07315, 2016. URL http://arxiv.org/abs/1606.07315.

Kenneth L. Clarkson and David P. Woodruﬀ. Low rank approximation and regression in input sparsity time. In Proceedings of the Forty-ﬁfth Annual ACM Symposium on Theory of Computing, STOC 13, pages 81 90, New York, NY, USA, 2013. ACM. ISBN 978-1-4503-2029-0. doi: 10.1145/2488608.2488620. URL http://doi.acm.org/10.1145/ 2488608.2488620.

Chandler Davis and W. M. Kahan. The rotation of eigenvectors by a perturbation. iii. SIAM Journal on Numerical Analysis, 7(1):1 46, 1970. doi: 10.1137/0707001. URL http://dx.doi.org/10.1137/0707001.

Christopher De Sa, Kunle Olukotun, and Christopher R e. Global convergence of stochastic gradient descent for some non-convex matrix problems. In Proceedings of the 32Nd International Conference on International Conference on Machine Learning - Volume 37, ICML 15, pages 2332 2341. JMLR.org, 2015. URL http://dl.acm.org/citation.cfm? id=3045118.3045366.

Scott Deerwester, Susan T. Dumais, George W. Furnas, Thomas K. Landauer, and Richard Harshman. Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41(6):391 407, 1990. ISSN 1097-4571. doi: 10.1002/(SICI) 1097-4571(199009)41:6 391::AID-ASI1 3.0.CO;2-9. URL http://dx.doi.org/10.1002/ (SICI)1097-4571(199009)41:6<391::AID-ASI1>3.0.CO;2-9.

R. Epstein, P. Hallinan, and A. Yuille. 5 2 eigenimages suﬃce: An empirical investigation of low-dimensional lighting models. In IEEE Workshop on Physics-based Modeling in Computer Vision, pages 108 116, June 1995.

Alan Frieze, Ravi Kannan, and Santosh Vempala. Fast monte-carlo algorithms for ﬁnding low-rank approximations. J. ACM, 51(6):1025 1041, November 2004. ISSN 0004-5411. doi: 10.1145/1039488.1039494. URL http://doi.acm.org/10.1145/1039488.1039494.

Quanquan Gu, Zhaoran Wang, and Han Liu. Low-rank and sparse structure pursuit via alternating minimization. In Arthur Gretton and Christian C. Robert, editors, AISTATS, volume 51 of JMLR Workshop and Conference Proceedings, pages 600 609. JMLR.org, 2016. URL http://dblp.uni-trier.de/db/conf/aistats/aistats2016. html#Gu WL16.

J. Ho, M. Yang, J. Lim, K. Lee, and D. Kriegman. Clustering appearances of objects under varying illumination conditions. In Proceedings of International Conference on Computer Vision and Pattern Recognition, volume 1, pages 11 18, 2003.

Robust PCA by Manifold Optimization

D. Hsu, S. M. Kakade, and T. Zhang. Robust matrix decomposition with sparse corruptions. IEEE Transactions on Information Theory, 57(11):7221 7234, Nov 2011. ISSN 0018-9448. doi: 10.1109/TIT.2011.2158250.

Prateek Jain, Praneeth Netrapalli, and Sujay Sanghavi. Low-rank matrix completion using alternating minimization. In Proceedings of the Forty-ﬁfth Annual ACM Symposium on Theory of Computing, STOC 13, pages 665 674, New York, NY, USA, 2013. ACM. ISBN 978-1-4503-2029-0. doi: 10.1145/2488608.2488693. URL http://doi.acm.org/10.1145/ 2488608.2488693.

R. H. Keshavan, A. Montanari, and S. Oh. Matrix completion from a few entries. IEEE Transactions on Information Theory, 56(6):2980 2998, June 2010. ISSN 0018-9448. doi: 10.1109/TIT.2010.2046205.

A. Kyrillidis and V. Cevher. Matrix alps: Accelerated low rank and sparse matrix reconstruction. In 2012 IEEE Statistical Signal Processing Workshop (SSP), pages 185 188, Aug 2012. doi: 10.1109/SSP.2012.6319655.

M. Ledoux and M. Talagrand. Probability in Banach Spaces: Isoperimetry and Processes. A Series of Modern Surveys in Mathematics Series. Springer, 1991. ISBN 9783540520139. URL https://books.google.com/books?id=cy KYDfvx Rjs C.

L. Li, W. Huang, I. Gu, and Q. Tian. Statistical modeling of complex backgrounds for foreground object detection. Image Processing, IEEE Transactions on, 13(11):1459 1472, nov. 2004. ISSN 1057-7149. doi: 10.1109/TIP.2004.836169.

X. Li and J. Haupt. Identifying outliers in large matrices via randomized adaptive compressive sampling. IEEE Transactions on Signal Processing, 63(7):1792 1807, April 2015. ISSN 1053-587X. doi: 10.1109/TSP.2015.2401536.

Lester W. Mackey, Michael I. Jordan, and Ameet Talwalkar. Divide-and-conquer matrix factorization. In J. Shawe-Taylor, R. S. Zemel, P. L. Bartlett, F. Pereira, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 24, pages 1134 1142. Curran Associates, Inc., 2011. URL http://papers.nips.cc/paper/ 4486-divide-and-conquer-matrix-factorization.pdf.

Praneeth Netrapalli, Niranjan U N, Sujay Sanghavi, Animashree Anandkumar, and Prateek Jain. Non-convex robust pca. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 27, pages 1107 1115. Curran Associates, Inc., 2014. URL http://papers.nips.cc/paper/ 5430-non-convex-robust-pca.pdf.

Dohyung Park, Anastasios Kyrillidis, Srinadh Bhojanapalli, Constantine Caramanis, and Sujay Sanghavi. Provable burer-monteiro factorization for a class of norm-constrained matrix problems. ar Xiv preprint ar Xiv:1606.01316, 2016.

Dohyung Park, Anastasios Kyrillidis, Constantine Carmanis, and Sujay Sanghavi. Nonsquare matrix sensing without spurious local minima via the Burer-Monteiro approach. In Aarti Singh and Jerry Zhu, editors, Proceedings of the 20th International Conference

Zhang and Yang

on Artiﬁcial Intelligence and Statistics, volume 54 of Proceedings of Machine Learning Research, pages 65 74, Fort Lauderdale, FL, USA, 20 22 Apr 2017. PMLR. URL http: //proceedings.mlr.press/v54/park17a.html.

Alkes L Price, Nick J Patterson, Robert M Plenge, Michael E Weinblatt, Nancy A Shadick, and David Reich. Principal components analysis corrects for stratiﬁcation in genome-wide association studies. Nature genetics, 38(8):904 909, aug 2006. ISSN 1061-4036 (Print). doi: 10.1038/ng1847.

M. Rahmani and G. K. Atia. High dimensional low rank plus sparse matrix decomposition. IEEE Transactions on Signal Processing, 65(8):2004 2019, April 2017. ISSN 1053-587X. doi: 10.1109/TSP.2017.2649482.

A. Ruhe. Numerical Computation of Principal Components when Several Observations are Missing. Univ., 1974. URL https://books.google.com/books?id=Cgbyjg EACAAJ.

Uri Shalit, Daphna Weinshall, and Gal Chechik. Online learning in the embedded manifold of low-rank matrices. J. Mach. Learn. Res., 13(1):429 458, February 2012. ISSN 15324435. URL http://dl.acm.org/citation.cfm?id=2503308.2188399.

J. Sun, Q. Qu, and J. Wright. Complete dictionary recovery over the sphere i: Overview and the geometric picture. IEEE Transactions on Information Theory, 63(2):853 884, Feb 2017. ISSN 0018-9448. doi: 10.1109/TIT.2016.2632162.

Kim-Chuan Toh and Sangwoon Yun. An accelerated proximal gradient algorithm for nuclear norm regularized linear least squares problems. Paciﬁc Journal of optimization, 6(615640):15, 2010.

Stephen Tu, Ross Boczar, Max Simchowitz, Mahdi Soltanolkotabi, and Ben Recht. Lowrank solutions of linear matrix equations via procrustes ﬂow. In Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, New York City, NY, USA, June 19-24, 2016, pages 964 973, 2016. URL http://jmlr.org/proceedings/papers/ v48/tu16.html.

Bart Vandereycken. Low-rank matrix completion by riemannian optimization. SIAM Journal on Optimization, 23(2):1214 1236, 2013. doi: 10.1137/110845768. URL http: //dx.doi.org/10.1137/110845768.

Roman Vershynin. Introduction to the non-asymptotic analysis of random matrices. In Yonina C. Eldar and Gitta Editors Kutyniok, editors, Compressed Sensing: Theory and Practice, pages 210 268. Cambridge University Press, 2012. ISBN 9780511794308. doi: 10.1017/CBO9780511794308.006.

Lingxiao Wang, Xiao Zhang, and Quanquan Gu. A Uniﬁed Computational and Statistical Framework for Nonconvex Low-rank Matrix Estimation. In Aarti Singh and Jerry Zhu, editors, Proceedings of the 20th International Conference on Artiﬁcial Intelligence and Statistics, volume 54 of Proceedings of Machine Learning Research, pages 981 990, Fort Lauderdale, FL, USA, 20 22 Apr 2017. PMLR. URL http://proceedings.mlr.press/ v54/wang17b.html.

Robust PCA by Manifold Optimization

Ke Wei, Jian-Feng Cai, Tony F. Chan, and Shingyu Leung. Guarantees of riemannian optimization for low rank matrix recovery. SIAM Journal on Matrix Analysis and Applications, 37(3):1198 1222, 2016. doi: 10.1137/15M1050525. URL https: //doi.org/10.1137/15M1050525.

Xinyang Yi, Dohyung Park, Yudong Chen, and Constantine Caramanis. Fast algorithms for robust PCA via gradient descent. In Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain, pages 4152 4160, 2016. URL http://papers.nips.cc/paper/ 6445-fast-algorithms-for-robust-pca-via-gradient-descent.