# optimal_mean_robust_principal_component_analysis__9d6641d6.pdf

Optimal Mean Robust Principal Component Analysis

Feiping Nie FEIPINGNIE@GMAIL.COM Jianjun Yuan WRIYJJ@GMAIL.COM Heng Huang HENG@UTA.EDU Computer Science and Engineering Department, University of Texas at Arlington, Arlington, TX, 76019

Principal Component Analysis (PCA) is the most widely used unsupervised dimensionality reduction approach. In recent research, several robust PCA algorithms were presented to enhance the robustness of PCA model. However, the existing robust PCA methods incorrectly center the data using the ℓ2-norm distance to calculate the mean, which actually is not the optimal mean due to the ℓ1-norm used in the objective functions. In this paper, we propose novel robust PCA objective functions with removing optimal mean automatically. Both theoretical analysis and empirical studies demonstrate our new methods can more effectively reduce data dimensionality than previous robust PCA methods.

1. Introduction

Machine learning techniques have been widely applied to many scientiﬁc domains as diverse as engineering, astronomy, biology, remote sensing, and economics. The dimensionality of scientiﬁc data could be more than thousands, such as digital images and videos, gene expressions and DNA copy numbers, documents, and ﬁnancial time series. As a result, data analysis on such data sets suffers from the curse of dimensionality. To solve this problem, the dimensionality reduction (subspace learning) algorithms have been proposed to project the original high-dimensional feature space to a low-dimensional space, wherein the important statistical properties are well preserved.

Among these methods, the unsupervised dimensionality reduction methods are more useful in the practical applications since labeled data are usually expensive to obtain and we often have no any prior knowledge for new scientiﬁc problems. Thus, in this work, we focus on unsupervised

Proceedings of the 31 st International Conference on Machine Learning, Beijing, China, 2014. JMLR: W&CP volume 32. Copyright 2014 by the author(s).

dimensionality reduction. Although PCA (Jolliffe, 2002) is the most popularly used method, it is sensitive to the data outliers because of the square ℓ2-norm based objective function. In real-world applications, the data outliers often largely appear in the datasets, thus PCA may not get the optimal performance.

To address this problem, multiple robust PCA methods have been presented, such as the rotational invariant L1 PCA (Ding et al., 2006) (R1PCA) and convex robust PCA (Wright et al., 2009; Xu et al., 2012). The R1PCA minimizes the ℓ2,1-norm reconstruction error by imposing the ℓ2-norm on the feature dimension and the ℓ1-norm on the data points dimension, such that the effect of data outliers will be reduced by the ℓ1-norm. Later this idea was extended to robust tensor factorization (Huang & Ding, 2009). The convex robust PCA methods utilize the convex relaxation objectives, such that the global solutions can be achieved. However, all existing robust PCA methods neglect the mean calculation problem. Because the ℓ1-norm or ℓ2,1-norm are used in different robust PCA objectives, the square ℓ2-norm distance based mean is not the correct mean anymore.

In this paper, we propose the novel robust PCA objective functions with removing the optimal mean automatically. We ﬁrst show that the Euclidean distance based mean is only valid for the traditional PCA. In robust PCA formulations, the ℓ1-norm or ℓ2,1-norm are used as loss functions, such that the Euclidean distance based mean is not the correct one. Starting from two widely used robust PCA formulations, we propose our corresponding optimal mean robust PCA objectives. We integrate the mean calculation into the dimensionality reduction optimization objective, such that the optimal mean can be obtained to enhance the dimensionality reduction. The optimization algorithms are derived to solve the proposed non-smooth objectives with convergence analysis. Both theoretical analysis and empirical studies demonstrate our new methods can more effectively reduce data dimensionality than previous robust PCA methods.

Notations: For a vector v, we denote the ℓ2-norm of v by

Submission and Formatting Instructions for ICML 2014

v T v. For a matrix M, we denote the (i, j)-th element by mij, the i-th column by mi and the i-th row by mi. Denote M 2 F = Tr(M T M), where trace( ) is the trace operator for a square matrix, and denote M = Tr((M T M) 1 2 ) as the trace norm (nuclear norm). In this paper, we denote M 2,1 = P i mi 2 as the ℓ2,1-norm of matrix M, denote M 1 = P

i,j |mij| as the ℓ1-norm of matrix M, and denote M 2 = σmax(M) as the ℓ2-norm of matrix M, where σmax(M) is the largest singular value of M.

2. Principal Component Analysis Revisited

Given a data matrix X = [x1, x2, ..., xn] Rd n, where d is the dimensionality of data points and n is the number of data points. Essentially, Principal Component Analysis (PCA) is to ﬁnd a low-rank matrix to best approximate the given data matrix in the sense of Euclidean distance. Suppose the data are centered, i.e. the mean of the data is zero, PCA is to solve the following problem:

min rank(Z)=k X Z 2 F . (1)

According to the full rank decomposition, any matrix Z Rd n with rank k can be decomposed as Z = UV T , where U Rd k, V Rn k. After denoting an identity matrix by I, the problem (1) can be rewritten as

min U Rd k,V Rn k,U T U=I

X UV T 2 F . (2)

Taking the derivative w.r.t. V and setting it to zero, we have V = XT U. As a result, the problem (2) becomes:

max U Rd k,U T U=I Tr(U T XXT U) . (3)

The columns of the optimal solution U to problem (3) are the k eigenvectors of XXT corresponding to the k largest eigenvalues.

In the above derivation, we suppose the mean of the data is zero. In the general case that the mean of the data is not zero, PCA is to best approximate the given data matrix with an optimal mean removed. Denote 1 as a column vector with all the elements being one, the problem of PCA becomes:

min b,rank(Z)=k

X b1T Z 2 F . (4)

Note the b Rd 1 in problem (4) is also a variable to be optimized. Similarly, the problem (4) can be rewritten as:

min b,U Rd k,V Rn k,U T U=I

X b1T UV T 2 F . (5)

Taking the derivative w.r.t. V and setting it to zero, we have V = (X b1T )T U. Then the problem (5) becomes:

min b,U Rd k,U T U=I

X b1T UU T (X b1T ) 2 F . (6)

Taking the derivative w.r.t. b and setting it to zero, we have (I UU T )(b1T X)1 = 0. Suppose U is the orthogonal complement of U, i.e. [U, U ] is orthonormal matrix. Then for any vector (b1T X)1, it can be represented as:

(b1T X)1 = Uα + U β . (7)

So we have (I UU T )(Uα + U β) = 0 U β = 0 β = 0. Then Eq. (7) becomes:

n(X1 + Uα), (8)

where α could be any k-dimensional column vector. Denote a centering matrix H = I 1

n11T , by substituting Eq. (8) into the problem (6), we have:

max U Rd k,U T U=I Tr(U T XHXT U) . (9)

We can see that no matter the data X is centered or not, the problem (9) is unchanged. We can simply set α = 0 in Eq. (8), so the optimal mean in the problem (4) is b = 1 n X1. That is to say, in traditional PCA, we can simply center the data such that X1 = 0, and then solve Eq. (3) instead of solving Eq. (9).

3. Robust Principal Component Analysis with Optimal Mean

The problem (4) in the traditional PCA can be written as min b,rank(Z)=k Pn i=1 xi b zi 2 2. It is known that squared

loss function is very sensitive to outliers. Therefore, the squared approximation errors will make the traditional PCA not robust to outliers in the data.

Previous robust PCA methods use a non-squared loss function to improve the robustness, but still center data via the ℓ2-norm distance based mean, which is incorrect. Instead of using the ℓ2-norm distance based mean, we propose a new robust PCA with an optimal mean automatically removed in the given data. Our new robust PCA is to solve the following problem:

min b,rank(Z)=k

i=1 xi b zi 2 , (10)

where we optimize the mean in the robust PCA objective. We can write the problem (10) in matrix form as follows:

min b,rank(Z)=k

X b1T Z 2,1 . (11)

Similarly, the problem (11) can be rewritten as:

min b,U Rd k,V Rn k,U T U=I

xi b U(vi) T 2 . (12)

Submission and Formatting Instructions for ICML 2014

For each i, by setting the derivative w.r.t vi to zero, we have vi = (xi b)T U. Substituting it into Eq. (12), we arrive at the following problem1:

min b,U Rd k,U T U=I

(I UU T )(xi b) 2 . (13)

In this paper, we use an iterative re-weighted method to solve the problem (13). The detailed algorithm is outlined in Algorithm 1, and the theoretical analysis of the algorithm is given in the last of this section. In each iteration, we solve the following problem:

min b,U Rd k,U T U=I

i=1 dii (I UU T )(xi b) 2 2, (14)

where dii is the weights as calculated in Algorithm 1.

Taking the derivative w.r.t. b and then setting it to zero, we have (I UU T )(b1T X)D1 = 0. Similarly, we can let (b1T X)D1 = Uα + U β and get β = 0, so we have (b1T X)D1 = Uα and thus we have

1T D1 + Uα 1T D1 (15)

where α could be any k-dimensional column vector.

Substituting Eq. (15) into the problem (14), the problem becomes

max U Rd k,U T U=I Tr(U T XHd XT U), (16)

where Hd = D D11T D

1T D1 is the weighted centering matrix. The columns of the optimal solution U to problem (16) are the k eigenvectors of XHd XT corresponding to the k largest eigenvalues. When the dimensionality is larger than the number of data, i.e. d > n, the problem (16) can be efﬁciently solved by the SVD of the matrix:

D 1 2 D11T D 1 2 1T D1

3.1. Theoretical Analysis of Algorithm 1

Theorem 1 The Algorithm 1 will monotonically decrease the objective of the problem (13) in each iteration until the algorithm converges.

Proof: In the Steps 1 and 2 of Algorithm 1, denote the updated U and b by U and b, respectively. Since the updated U and b are the optimal solutions of the problem (14) and

1In practice, similar to (Nie et al., 2010), the z 2 is replaced by

z T z + ε, where ε 0.

Algorithm 1 Algorithm to solve the problem (13).

Initialize D as an identity matrix while not converge do

1. Update the columns of U by the k right singular

vectors of X(D 1 2 D11T D 1 2 1T D1 ) corresponding to the k largest singular values. 2. Update b by b = XD1

1T D1 3. Update the diagonal matrix D, where the ith diagonal element of D is updated by dii = 1 2 (I UU T )(xi b) 2 end while

according to the deﬁnition of dii in the Step 3 of Algorithm 1, we have:

(I U U T )(xi b) 2 2 2 (I UU T )(xi b) 2

(I UU T )(xi b) 2 2 2 (I UU T )(xi b) 2 . (18)

According to the Lemma 1 in (Nie et al., 2010), we know

(I U U T )(xi b) 2

(I U U T )(xi b) 2

2 2 (I UU T )(xi b) 2

(I UU T )(xi b) 2

(I UU T )(xi b) 2

2 2 (I UU T )(xi b) 2

Summing Eq. (18) and Eq. (19) on both sides, we have

(I U U T )(xi b) 2

(I UU T )(xi b) 2

(20) Since the problem (13) has an obvious lower bound 0, the Algorithm 1 will converge. The equality in Eq. (20) holds only when the Algorithm 1 converges. Thus, in each iteration, Algorithm 1 monotonically decreases the objective of the problem (13) until the algorithm converges.

Theorem 2 The Algorithm 1 will converge to a local minimal solution to the problem (13).

Proof: The Lagrangian function of problem (13) is:

L1(U, b, Λ) =

(I UU T )(xi b) 2

Tr((U T U I)Λ) . (21)

Taking the derivative w.r.t. U and b respectively and setting them to zero, we get the KKT condition of the problem (13)

Submission and Formatting Instructions for ICML 2014

as follows:

(I UU T )(xi b) 2

(I UU T )(xi b) 2

Using the matrix calculus, we can write the Eq. (22) as follows:

(I UU T )(xi b)(b xi)T U

(I UU T )(xi b) 2 UΛ = 0

(I UU T )(b xi) (I UU T )(xi b) 2 = 0 (23)

In each iteration of Algorithm 1, we ﬁnd the optimal solution to the problem (14). Thus the converged solution of Algorithm 1 satisﬁes the KKT condition of problem (14). The Lagrangian function of problem (14) is:

L2(U, b, Λ) =

i=1 dii (I UU T )(xi b) 2 2

Tr((U T U I)Λ) . (24)

Taking the derivative w.r.t. U and b respectively and setting them to zero, we get the KKT condition of the problem (14) as follows:

i=1 dii (I UU T )(xi b) 2 2

i=1 dii (I UU T )(xi b) 2 2

Similarly, we can write the Eq. (25) as follows using the matrix calculus:

i=1 2dii(I UU T )(xi b)(b xi)T U = UΛ

i 2dii(I UU T )(b xi) = 0 (26)

According to the deﬁnition of dii in Algorithm 1, we can see that Eq. (26) is the same as Eq. (23) when the Algorithm 1 is converged. Therefore, the converged solution of Algorithm 1 satisﬁes Eq. (23), the KKT condition of problem (14). Thus the converged solution of Algorithm 1 is a local minimal solution to the problem (13).

3.2. Extension to a General Algorithm

It is worth noting that the content in this section is a parallel work with (Nie et al., 2010)2. Later, we found that the re-

2The motivation to derive this kind of algorithm is similar to the motivation in Eq.(21) of (Nie et al., 2009) for solving the trace

Algorithm 2 Algorithm to solve the problem (27).

Initialize x C while not converge do

1. For each i, calculate Di = h i(gi(x)) 2. Update x by the optimal solution to the problem min x C f(x) + P

i Tr(DT i gi(x))

weighted algorithm applied in Algorithm 1 can be used to solve the following more general problem:

min x C f(x) + X

i hi(gi(x)), (27)

where hi(x) is an arbitrary concave function in the domain of gi(x). The details to solve problem (27) is described in Algorithm 2, where h i(gi(x)) denotes any supergradient of the concave function hi at point gi(x). According to the deﬁnition of supergradient, it can be easily proved that the Algorithm 2 converges. It can be seen that the converged solution satisﬁes KKT condition of problem (27), thus the Algorithm 2 will converge to a local optimal solution to the problem (27). Empirical evidences show Algorithm 2 converges very fast and usually converges in 20 iterations.

Algorithm 2 is very easy to implement and powerful. For example, it can be used to minimize the ℓp-norm, the ℓ2,pnorm, the Schatten ℓp-norm (Nie et al., 2012), and many robust loss functions such as the capped (truncated) ℓp-norm, where 0 < p 2.

More interestingly, if we need to maximize the objective in Eq.(27), we only require that hi(x) is an arbitrary convex function in the domain of gi(x). In this case, in Algorithm 2, the h i(gi(x)) in the ﬁrst step is changed to be any subgradient of hi at point gi(x), and the min in the second step is changed to max . Therefore, it can be veriﬁed that the Algorithm 2 can also be used to maximize the ℓp-norm (Nie et al., 2011), the ℓ2,p-norm, and the Schatten ℓp-norm, where p 1.

4. Optimal Mean Robust PCA with Convex Relaxation

Recently, a convex relaxed robust PCA was proposed to solve the following problem (Wright et al., 2009):

min Z X Z 1 + γ Z . (28)

To better pursuit the outliers in the data points, a recent work is proposed to solve the following problem (Xu et al., 2012): min Z X Z 2,1 + γ Z . (29)

ratio problem. It is a very useful trick for algorithm derivation.

Submission and Formatting Instructions for ICML 2014

Algorithm 3 Algorithm to solve the problem (30).

Let 1 < ρ < 2. Initialize µ = 0.1, E = 0, Λ = 0 while not converge do

1. Update b and Z by solving

min b,Z 1 2

F + γ Z (32)

where X = X E + 1

µΛ and γ = γ

µ. 2. Update E by solving

F + γ E 2,1 (33)

where X = X b1T Z + 1

µΛ and γ = 1

µ. 3. Update Λ by Λ = Λ + µ(X b1T Z E) 3. Let µ = min(ρµ, 108) end while

However, all these work don t take the optimal mean into account. Although we can center the data such that X1 = 0 before solving problem (29), the mean b = 1 n X1 is not necessarily the optimal mean in the problem (29) since the ℓ2,1-norm loss function instead of the Frobenious norm loss function is used in the objective.

In this section, we consider the optimal mean for the ℓ2,1norm loss function, and propose to solve the following problem:

X b1T Z 2,1 + γ Z . (30)

We can see problem (30) is a convex relaxation of problem (11). We use Augmented Lagrangian Multiplier (ALM) method to solve the proposed problem (30).

First, problem (30) can be rewritten as:

min b,Z,E,X b1T Z=E E 2,1 + γ Z . (31)

With the ALM method, we need to solve the following augmented problem in each iteration:

min b,Z,E E 2,1 + γ Z + µ

2 X b1T Z E + 1

. Solving this problem with joint variables b, Z, E is difﬁcult, we use an alternating method to solve it. The detailed algorithm is outlined in Algorithm 3. When ﬁx E, we solve problem (32) to update b, Z, and when ﬁx b, Z, we solve problem (33) to update E.

The problem (33) can be easily solved and has the closed form solution as follows: ei = ( xi 2 γ)+ xi xi 2 , where (s)+ is deﬁned as (s)+ = max(0, s).

We will see that the problem (32) can also be solved with the closed form solution. First, we have the following lemmas:

Lemma 1 Suppose A = USV T , where the ℓ2-norms of all the columns of U and V are 1, and S is a nonnegative diagonal matrix. Then A Tr S.

Proof: Note that A = max XT X=I,Y T Y =I Tr(XT AY ),

so we have A = max XT X=I,Y T Y =I Tr(XT USV T Y ) =

max XT X=I,Y T Y =I P

i x T i ujv T j yi P

j sj = Tr S.

The inequality in the last but one step holds since P

i x T i ujv T j yi = v T j Y XT uj σmax(Y XT ) = 1, where σmax(Y XT ) denotes the largest singular value of matrix Y XT which is 1 since XT X = I, Y T Y = I.

Lemma 2 Z = min Z=ABT 1 2( A 2 F + B 2 F ).

Proof: Suppose Z = ABT , then Z = ABT P

i ai 2 bi 2 r P

i bi 2 2 = A F B F

1 2( A 2 F + B 2 F ), where the second step holds according to Lemma 1 and the third step holds according to Cauchy Schwarz inequality. On the other hand, suppose the SVD of Z is Z = UΣV T , let A = UΣ 1 2 and B = V Σ 1 2 , then we have Z = ABT and Z = UΣV T = Tr(Σ) =

1 2( UΣ 1 2 2 F + Σ 1 2 V T 2 F ) = 1

2( A 2 F + B 2 F ). Therefore, under the constraint Z = ABT , the minimization of 1 2( A 2 F + B 2 F ) is Z .

Lemma 3 If ZH = Z, then ZH < Z .

Proof: Suppose { ˆA, ˆB} = arg min Z=ABT 1 2( A 2 F + B 2 F ),

then according to Lemma 2 we have Z = 1

2( ˆA 2 F + ˆB 2 F ) and Z = ˆA ˆBT . Thus ZH = ˆA(H ˆB)T , according to ZH = Z and Lemma 2, we have

ZH = min ZH=ABT 1 2( A 2 F + B 2 F )

1 2( ˆA 2 F + H ˆB 2 F )

= 1 2( ˆA 2 F + ˆB 2 F 1

n1T ˆB ˆBT 1)

< 1 2( ˆA 2 F + ˆB 2 F ) = Z , (34)

where the last inequality holds since ZH = Z and Z = ˆA ˆBT indicates ˆBT 1 = 0. Therefore, we have ZH = Z ZH < Z .

The problem (32) can be solved according to the following theorem:

Submission and Formatting Instructions for ICML 2014

Theorem 3 The unique optimal solution ˆb, ˆZ to the problem (32) is ˆb = 1 n X1 and ˆZ = U(Σ γI)+V T , where XH = UΣV T is the compact Singular Value Decomposition (SVD) of XH and the (i, j)-th element of (M)+ is deﬁned as max(0, mij).

Proof: Taking the derivative of Eq. (32) w.r.t. b and setting it to zero, we have:

n ˆZ1 . (35)

So the problem (32) becomes:

min b,Z 1 2 XH ZH 2 F + γ Z . (36)

First, we verify that ˆZ = U(Σ γI)+V T satisﬁes:

0 ˆZH XH + γ ˆZ . (37)

Denote the compact SVD of XH as XH = UΣV T = U1Σ1V T 1 + U2Σ2V T 2 , where the singular values in Σ1 are all greater than γ and the singular values in Σ2 are all smaller than or equal to γ. Then ˆZ can be written as U1(Σ1 γI)V T 1 . On the other hand, we have XH1 = 0 UΣV T 1 = 0 V T 1 = 0 ˆZ1 = 0 ˆZH = ˆZ. So we have the following equation:

= U1Σ1V T 1 + U2Σ2V T 2 U1(Σ1 γI)V T 1 = γU1V T 1 + U2Σ2V T 2 . (38)

It is known (Watson, 1992) that the set of the subgradients of ˆZ is:

ˆZ = {U1V T 1 +M : U T 1 M = 0, MV1 = 0, M 2 1}

So we have U1V T 1 + 1

γ U2Σ2V T 2 ˆZ , and according to Eq. (38) we have:

XH ˆZH γ ˆZ . (39)

Therefore, ˆZ = U(Σ γI)+V T satisﬁes Eq. (37). Since the problem (32) is convex, ˆZ = U(Σ γI)+V T is one of the optimal solution to the problem (32).

Unlike the problem (4) which has many optimal solutions, we further show that the solution ˆb, ˆZ is the unique optimal solution to the problem (32).

According to Lemma 3, we know the optimal solution ˆZ to the problem (36) must satisfy ˆZH = ˆZ, thus the optimal ˆb is ˆb = 1

n X1 according to Eq. (35), and the problem (36) is equivalent to the following problem:

min b,Z 1 2 XH Z 2 F + γ Z . (40)

Since the problem (40) is strictly convex, the optimal solution is unique. Therefore, ˆb = 1

n X1, ˆZ = U(Σ γI)+V T

is the unique optimal solution to the problem (32).

5. Experimental Results

The main goal of PCA is to reduce the dimensionality such that the reduced features represent and reconstruct the original data as good as possible. In the experiments, we show how well the reconstruction of the proposed new optimal mean robust PCA methods compared to the previous PCA and robust PCA methods. The compared PCA methods include original PCA (denoted as PCA), robust PCA with L1 maximization (denoted as L1PCA) (Kwak, 2008), R1PCA (Ding et al., 2006) and robust PCA with convex relaxation (solving Eq. (29), denoted as CRPCA) (Xu et al., 2012). The proposed optimal mean robust PCA methods solving Eq. (11) and Eq. (30) are denoted as RPCA-OM and CRPCA-OM, respectively.

5.1. Experimental Setup

In the experiments, we use 12 benchmark face image datasets, including AT&T (Samaria & Harter, 1994), UMIST (Graham & Allinson, 1998), Yale (face data), Yale B (Georghiades et al., 2001), Palm (Hou et al., 2009), CMU-PIE (Sim & Baker, 2003), FERET (Philips et al., 1998), MSRA, Coil (Nene et al., 1996), JAFFE, MNIST, and AR. We downloaded the image data from different websites. Some of them were resized by previous work, but this won t effect our evaluation.

In each dataset, we randomly select 20% images and randomly place a 1/4 size of occlusion in the selected images. The reconstruction error is calculated as P

i xr i xo i 2, where xo i is the original image without occlusion and xr i is the reconstructed image.

5.2. Reconstruction Error Comparison for Robust PCA methods

We ﬁrst compare the reconstruction error of PCA, L1PCA, R1PCA and our proposed RPCA-OM methods on 12 benchmark datasets in Table 1. Because these methods can share the common reduced dimensionality, their reconstruction errors can be compared under the same dimension. In Table 1, we compare the reconstruction errors under nine different dimensions from 10 to 50. We cannot list more results due to limited space.

From Table 1, we can observe that:

1) The robust PCA methods R1PCA and RPCA-OM are consistently better than the other two PCA methods, when there are occlusions in the data which indicates these robust PCA methods are effective and robust to outliers and noise in the data, except for projected dimension 15 in YALEB.

2) L1PCA is better than PCA in some cases but is worse in other cases. The reason is that L1PCA is to maximize the ℓ1-norm, but not to minimize the reconstruction error.

Submission and Formatting Instructions for ICML 2014

MNIST ( 104)

Dimension 10 15 20 25 30 35 40 45 50 PCA 20.964 17.354 15.849 14.866 13.890 13.193 12.559 11.924 10.776 L1PCA 18.255 16.384 15.198 14.282 13.584 12.895 12.256 11.666 11.359 R1PCA 17.911 16.005 14.865 13.895 13.044 12.350 11.719 11.148 10.774 RPCA-OM (our) 17.885 15.974 14.859 13.899 13.037 12.303 11.707 11.064 10.627

JAFFE ( 104)

Dimension 10 15 20 25 30 35 40 45 50 PCA 34.429 29.642 25.499 24.791 24.020 23.188 22.321 21.066 12.855 L1PCA 17.162 15.852 14.958 14.472 14.052 13.635 13.328 13.021 12.594 R1PCA 16.824 15.567 14.847 14.414 13.973 13.662 13.275 12.993 12.641 RPCA-OM (our) 16.640 15.420 14.716 14.265 13.818 13.526 13.155 12.867 12.500

YALEB ( 105)

Dimension 10 15 20 25 30 35 40 45 50 PCA 10.451 9.6717 9.0074 8.5250 8.0723 7.7539 7.4673 7.2194 6.8686 L1PCA 9.9809 9.3133 8.7559 8.2912 7.8940 7.5742 7.2877 7.0594 6.8297 R1PCA 10.020 9.3594 8.7214 8.2502 7.8399 7.5261 7.2370 6.9991 6.7686 RPCA-OM (our) 9.9742 9.3290 8.6944 8.2169 7.8154 7.4969 7.2078 6.9719 6.7420

Coil20 ( 105)

Dimension 10 15 20 25 30 35 40 45 50 PCA 19.632 18.912 17.766 17.092 16.554 16.029 15.528 15.204 14.907 L1PCA 19.694 18.091 17.095 16.446 15.913 15.497 15.124 14.799 14.509 R1PCA 19.569 17.988 16.972 16.326 15.828 15.411 15.032 14.706 14.419 RPCA-OM (our) 19.397 17.905 16.903 16.270 15.766 15.376 14.995 14.667 14.380

MSRA ( 105)

Dimension 10 15 20 25 30 35 40 45 50 PCA 28.314 27.593 26.326 24.612 23.127 21.938 21.511 20.592 20.039 L1PCA 16.181 15.372 14.765 14.265 13.827 13.476 13.180 12.945 12.737 R1PCA 16.237 15.440 14.786 14.180 13.796 13.462 13.162 12.927 12.728 RPCA-OM (our) 16.112 15.302 14.683 14.068 13.708 13.345 13.074 12.837 12.631

FERET ( 105)

Dimension 10 15 20 25 30 35 40 45 50 PCA 21.619 20.394 19.543 18.864 18.294 17.753 17.307 16.880 16.518 L1PCA 21.570 20.308 19.470 18.805 18.232 17.726 17.293 16.871 16.536 R1PCA 21.481 20.302 19.435 18.791 18.220 17.673 17.232 16.811 16.442 RPCA-OM (our) 21.430 20.262 19.395 18.763 18.154 17.638 17.200 16.784 16.413

CMU-PIE ( 103)

Dimension 10 15 20 25 30 35 40 45 50 PCA 7.8659 6.6812 5.8340 5.2123 4.7648 4.3930 4.0672 3.8033 3.5799 L1PCA 7.8727 6.6765 5.8485 5.2148 4.7691 4.3852 4.0694 3.8020 3.5889 R1PCA 7.8431 6.6478 5.8068 5.1777 4.7334 4.3555 4.0319 3.7716 3.5473 RPCA-OM (our) 7.8010 6.6134 5.7799 5.1547 4.7089 4.3315 4.0096 3.7547 3.5345

AT&T ( 104)

Dimension 10 15 20 25 30 35 40 45 50 PCA 29.333 27.426 26.152 25.119 24.270 23.613 23.002 22.425 21.931 L1PCA 28.948 27.141 26.026 24.980 24.128 23.432 22.873 22.319 21.798 R1PCA 28.877 27.109 25.784 24.857 23.903 23.266 22.742 22.176 21.676 RPCA-OM (our) 28.774 27.036 25.713 24.756 23.814 23.153 22.631 22.064 21.549

UMIST ( 104)

Dimension 10 15 20 25 30 35 40 45 50 PCA 28.917 27.248 25.966 24.910 24.053 23.330 22.780 22.180 21.698 L1PCA 28.568 27.022 25.765 24.726 23.888 23.243 22.640 22.054 21.628 R1PCA 28.486 26.971 25.709 24.603 23.740 23.050 22.532 21.964 21.448 RPCA-OM (our) 28.383 26.900 25.600 24.498 23.655 22.944 22.424 21.850 21.312

Dimension 10 15 20 25 30 35 40 45 50 PCA 23.293 21.004 19.601 18.428 17.544 16.816 16.164 15.617 15.185 L1PCA 23.298 21.006 19.612 18.499 17.637 16.905 16.225 15.723 15.232 R1PCA 23.032 20.837 19.417 18.251 17.356 16.644 15.998 15.428 14.999 RPCA-OM (our) 22.912 20.767 19.321 18.193 17.279 16.530 15.879 15.306 14.874

YALE ( 104)

Dimension 10 15 20 25 30 35 40 45 50 PCA 22.119 19.445 17.995 16.940 16.152 15.658 15.139 14.728 14.218 L1PCA 17.959 16.591 15.514 14.872 14.310 13.745 13.308 12.874 12.434 R1PCA 17.742 16.253 15.207 14.526 14.015 13.530 12.997 12.631 12.331 RPCA-OM (our) 17.692 15.150 14.461 14.009 13.507 12.890 12.889 12.502 12.200

PALM ( 105)

Dimension 10 15 20 25 30 35 40 45 50 PCA 14.703 13.355 12.376 11.596 11.011 10.561 10.119 9.7871 9.4969 L1PCA 14.734 13.373 12.377 11.628 11.033 10.580 10.156 9.7885 9.4881 R1PCA 14.665 13.319 12.300 11.527 10.954 10.461 10.053 9.7027 9.4058 RPCA-OM (our) 14.651 13.300 12.287 11.499 10.935 10.437 10.035 9.6745 9.3826

Table 1. Reconstruction error comparisons of four PCA methods on 12 benchmark datasets using different dimensions. The best reconstruction result under each dimension is bolded.

Submission and Formatting Instructions for ICML 2014

Reconstruction Error

CRPCA CRPCA OM(our)

Reconstruction Error

CRPCA CRPCA OM(our)

Reconstruction Error

CRPCA CRPCA OM(our)

Reconstruction Error

CRPCA CRPCA OM(our)

Reconstruction Error

CRPCA CRPCA OM(our)

(e) CMU-PIE

Reconstruction Error

CRPCA CRPCA OM(our)

Reconstruction Error

CRPCA CRPCA OM(our)

Reconstruction Error

CRPCA CRPCA OM(our)

Reconstruction Error

CRPCA CRPCA OM(our)

Reconstruction Error

CRPCA CRPCA OM(our)

Reconstruction Error

CRPCA CRPCA OM(our)

Reconstruction Error

CRPCA CRPCA OM(our)

Figure 1. Reconstruction errors under different γ obtained by CRPCA and our CRPCA-OM methods.

3) Since the optimal mean is considered in the reconstruction error minimization, our RPCA-OM method consistently outperforms other three methods in most cases.

5.3. Reconstruction Error Comparison for Convex Robust PCA methods

In CRPCA and our CRPCA-OM methods, the projection dimension cannot be selected. We can only get the reconstruction data via adjusting the parameter γ. Thus, we compare these two methods together. We choose the range of γ based on the suggestion from (Wright et al., 2009), in which the γ is suggested with the scale of m 1 2 (m is the dimension of matrix Z). Considering the size of images used in our experiments, we select the range of γ from 30 to 90. The reconstruction error comparison of these two methods are shown in Fig.1. From Fig.1, we can conclude that:

1) As the value of the regularization parameter γ increases, the reconstruction error for both methods increases as well, which is due to the weight we put in the reconstruction error decreases. As a result, the algorithm pays less attention to minimizing the reconstruction error.

2) Our CRPCA-OM method is consistently better than CRPCA approach, because CRPCA-OM method takes into account the optimal mean in the Eq. (29), which as we have said previously is not the Frobenious norm loss function s mean, but the ℓ2,1-norm loss function s mean.

6. Conclusions

In this paper, we proposed the novel optimal mean robust PCA models with automatically removing the correct data mean. To solve the proposed non-smooth objectives, we derive the new optimization algorithms with proved convergence. Both theoretical analysis and empirical results show our new robust PCA with optimal mean models consistently outperform the existing robust PCA methods.

Acknowledgments

Corresponding Author: Heng Huang (heng@uta.edu) This work was partially supported by US NSF IIS1117965, IIS-1302675, IIS-1344152.

Submission and Formatting Instructions for ICML 2014

Ding, Chris, Zhou, Ding, He, Xiaofeng, and Zha, Hongyuan. R1-pca: Rotational invariant l1-norm principal component analysis for robust subspace factorization. Int l Conf. Machine Learning, 2006.

face data, YALE. http://cvc.yale.edu/projects/ yalefaces/yalefaces.html.

Georghiades, A., Belhumeur, P., and Kriegman, D. From few to many: Illumination cone models for face recognition under variable lighting and pose. IEEE Transactions on PAMI, 23(6):643 660, 2001.

Graham, Daniel B and Allinson, Nigel M. Face recognition: From theory to applications. NATO ASI Series F, Computer and Systems Sciences, 163:446 456, 1998.

Hou, Chenping, Nie, Feiping, Zhang, Changshui, and Wu, Yi. Learning an orthogonal and smooth subspace for image classiﬁcation. IEEE Signal Process. Lett., 16(4): 303 306, 2009.

Huang, Heng and Ding, Chris. Robust tensor factorization using r1 norm. CVPR, 2009.

Jolliffe, I. T. Principal Component Analysis. Series: Springer series in statistics, New York: Springer-Verlag, 2nd edition, 2002.

Kwak, N. Principal component analysis based on L1norm maximization. IEEE Transactions on PAMI, 30 (9):1672 1680, 2008. ISSN 0162-8828.

Nene, Sameer, Nayar, Shree K., and Murase, Hiroshi. Columbia object image library (coil-100). Technical report, 1996.

Nie, Feiping, Xiang, Shiming, Jia, Yangqing, and Zhang, Changshui. Semi-supervised orthogonal discriminant analysis via label propagation. Pattern Recognition, 42 (11):2615 2627, 2009.

Nie, Feiping, Huang, Heng, Cai, Xiao, and Ding, Chris. Efﬁcient and robust feature selection via joint ℓ2,1-norms minimization. In NIPS, 2010.

Nie, Feiping, Huang, Heng, Ding, Chris H. Q., Luo, Dijun, and Wang, Hua. Robust principal component analysis with non-greedy L1-norm maximization. In IJCAI, pp. 1433 1438, 2011.

Nie, Feiping, Huang, Heng, and Ding, Chris H. Q. Lowrank matrix recovery via efﬁcient schatten p-norm minimization. In AAAI, 2012.

Philips, I., Wechsler, H., Huang, J., and Rauss, P. The feret database and evaluation procedure for face recognition algorithms. Image and Vision Computing, 16:295 306, 1998.

Samaria, Ferdinando and Harter, Andy. Parameterisation of a stochastic model for human face identiﬁcation, 1994. http://www.cl.cam.ac.uk/research/dtg/attarchive/ facedatabase.html.

Sim, T. and Baker, S. The cmu pose, illumination, and expression database. IEEE Transactions on PAMI, 25 (12):1615 1617, 2003.

Watson, GA. Characterization of the subdifferential of some matrix norms. Linear Algebra and its Applications, 170:33 45, 1992.

Wright, J., Ganesh, A., Rao, S., and Ma, Y. Robust principal component analysis: Exact recovery of corrupted low-rank matrices via convex optimization. NIPS, 2009.

Xu, Huan, Caramanis, Constantine, and Sanghavi, Sujay. Robust pca via outlier pursuit. Information Theory, IEEE Transactions on, 58(5):3047 3064, 2012.