# online_robust_low_rank_matrix_recovery__51adb90b.pdf

Online Robust Low Rank Matrix Recovery

Xiaojie Guo State Key Laboratory of Information Security Institute of Information Engineering, Chinese Academy of Sciences xj.max.guo@gmail.com

Low rank matrix recovery has shown its importance as a theoretic foundation in many areas of information processing. Its solutions are usually obtained in batch mode that requires to load all the data into memory during processing, and thus are hardly applicable on large scale data. Moreover, a fraction of data may be severely contaminated by outliers, which makes accurate recovery signiﬁcantly more challenging. This paper proposes a novel online robust low rank matrix recovery method to address these difﬁculties. In particular, we ﬁrst introduce an online algorithm to solve the problem of low rank matrix completion. Then we move on to low rank matrix recovery from observations with intensive outliers. The outlier support is robustly estimated from a perspective of mixture model. Experiments on both synthetic and real data are conducted to demonstrate the efﬁcacy of our method and show its superior performance over the state-of-the-arts.

1 Introduction Low Rank Matrix Recovery (LRMR) aims to seek a lowdimensional subspace from given data such that the residual between the given and the recovery is minimized with respect to some optimality criterion [Recht et al., 2010; Chandrasekaran et al., 2011]. Its beneﬁt, besides being a standard tool for dimensionality reduction, has been witnessed by a wide spectrum of low level signal processing applications, such as image denosing [Gu et al., 2014], colorization [Wang and Zhang, 2012], reﬂection separation [Guo et al., 2014a] and rectiﬁcation [Zhang et al., 2012], as well as higher level tasks, like collaborative ﬁltering [Jamali and Ester, 2011; Park et al., 2013; Shi et al., 2013; Chen et al., 2014], visual salience detection [Shen and Wu, 2012], video editing [Guo et al., 2013] and dictionary learning [Mairal et al., 2010]. The quality of LRMR largely depends on the choice of optimality criteria for different tasks. Principle Component Analysis (PCA) [Pearson, 1901] adopts the ℓ2 loss (also known as the squared loss) by assuming that the residual existing in the observation follows a Gaussian distribution. Although the ℓ2 is arguably the most commonly used loss, its performance may sharply degenerate when dealing with data

polluted by outliers, which is frequently encountered in realworld applications. To be robust against outliers, the ℓ0 loss is the ideal option. Unfortunately, the non-convexity, nonsmoothness and difﬁculty of being approximated (NP-hard) make the ℓ0 unsuitable for practical use. Alternatively, its tightest convex surrogate, the ℓ1, is employed to achieve the robustness, which corresponds to the least absolute deviations technique and is actually optimal for Laplacian noises. In this way, Cand es et al. proved that their proposed method, namely PCP [Cand es et al., 2011], can exactly recover the underling low rank matrix from the observation even with the fraction of outlier up to about 1

3 under some conditions. In parallel, there are a few works carried out from probabilistic standpoints. Probabilistic Matrix Factorization (PMF) [Salakhutdinov and Mnih, 2008] and Probabilistic Robust Matrix Factorization (PRMF) [Wang et al., 2012] are two representatives, the residual penalties of which are equivalent to using the ℓ2 (PCA) and ℓ1 (PCP), respectively. Although the techniques above have made great progresses in LRMR, the tolerance to outliers is expected to be further improved. Another issue should be concerned is the applicability to large-scale data. Conventional solvers of LRMR, like PCA [Pearson, 1901], PCP [Cand es et al., 2011], PMF [Salakhutdinov and Mnih, 2008] and PRMF [Wang et al., 2012] process data in batch mode, which are both computationally expensive and memory exhausting. Moreover, they are inﬂexible to incrementally add samples. To overcome the shortcomings of batch mode, several online strategies have been developed recently. GROUSE [Balzano et al., 2011], an online extension of PCA, is derived by analyzing incremental gradient descent on the Grassmannian manifold of subspaces. GRASTA [He et al., 2012] follows the same technical line as GROUSE with further consideration of outliers, which can be seen as an online version of PCP. Wang et al. modiﬁed PRMF as Online PRMF for sequentially processing samples [Wang et al., 2012]. Online Robust PCA (OR-PCA) [Feng et al., 2013] has proven that it converges to the same optimal solution as PCP [Cand es et al., 2011] does asymptotically. The methods mentioned above indeed enable the traditional schemes with the scalability, but the improvement space exists, especially in terms of the robustness to more corruptions. This work proposes an online technique to robustly recover the low rank component from the observation with heavy outliers. Please notice that if the outlier support is given or pre-

Proceedings of the Twenty-Fourth International Joint Conference on Artificial Intelligence (IJCAI 2015)

cisely estimated, the problem reduces to the completion task that is much easier to conquer. Speciﬁcally, we ﬁrst develop an online algorithm to solve the low rank matrix completion. Then, we step forward to a more challenging problem, say online low rank matrix recovery with robust outlier support estimation. As regards residual modeling, we employ a Gaussian component to ﬁt the noise while a uniform distributed component to host the oultier. With this strategy, we are capable to handle more severely contaminated data. Extensive experiments on both synthetic and real data are conducted to verify the effectiveness of the proposed technique and demonstrate its advantages over the state-of-the-arts.

2 Our Method Suppose we are given a ﬁnite set of data vectors Y = [y1, ..., yn] Rm n. It will be helpful to consider that low rank matrices can be factorized by Y = UV T , where U Rm r and V Rn r. In such a way, the rank of recovered matrices is guaranteed to be never over r. This work formally concentrates on the following general problem:

min U,V W (Y UV T ) 2 F . (1)

To prevent U from being with arbitrarily large values, the common manner is to enforce the ℓ2 norm of each column to be less than or equal to one. For the rest of the paper, the discussion is always under this constraint. Having W = [w1, ..., wn] {0, 1}m n that indicates which elements of Y are observed, the associated problem is designated as Low Rank Matrix Completion (LRMC). The difﬁculty, however, greatly increases when the data are ruined by outliers, as the outlier support is unknown, we call this problem Robust Low Rank Matrix Recovery (RLRMR). In the next subsections, we will ﬁrst develop an online algorithm based on stochastic optimization to handle LRMC, and then design a robust outlier support estimation scheme for solving RLRMF in an online fashion.

2.1 Online Low Rank Matrix Completion With the data matrix Y and its corresponding support W , solving the problem (1) is indeed equivalent to minimize the following empirical cost function:

i=1 f(yi, wi, U), (2)

where the loss function for each sample is deﬁned as:

f(yi, wi, U) min vi wi (yi Uvi) 2 2. (3)

As pointed out by Bottou and Bousquet [Bottou and Bousquet, 2008], one may be interested in the minimization of the expected cost instead of the empirical cost, as the empirical is just an approximation of the expected. The corresponding expected cost function is:

F(U) Ey,w [f(y, w, U)] = lim n Fn(U) a.s.1 (4)

1Here, a.s. means almost sure convergence.

where the expectation is taken with respect to the distribution of the samples and the known supports. However, classical batch optimization methods become impractical in terms of speed and/or memory requirements to do so. As a consequence, online techniques are desired. In the following, we ﬁrst establish an online algorithm to minimize the empirical cost function (2). At the end of this subsection, we show that the proposed algorithm converges to the solution of the expected cost function (4) asymptotically. When a new sample yt arrives, two things have to be dealt with, i.e. computing vt and updating U (t). With the help of alternating minimizing strategy, we ﬁrst focus on vt, which associates with the following optimization problem with ﬁxed U (t 1) (obtained from previous iteration):

vt = argmin vt wt (yt U (t 1)vt) 2 2

= argmin vt Dtyt Dt U (t 1)vt 2 2, (5)

where Dt Diag(wt). As can be seen from (5), it is a classic least squares problem and has a closed-form solution:

vt = (Dt U (t 1)) Dtyt, (6)

where (Dt U (t 1)) denotes the Moore-Penrose pseudoinverse of Dt U (t 1). For updating U in the tth time instance, say U (t), we minimize the following objective function:

i=1 Diyi Di Uvi 2 2. (7)

We can observe that St(U) is quadratic in U and aggregates the past data with a few sufﬁcient vi. It is obvious that St(U) upper-bounds Ft(U) deﬁned in (2). Similar to [Mairal et al., 2010], we can adopt U (t 1) as the warm restart for calculating U (t). Because St 1(U) and St(U) become closer as t gets larger, and so do U (t 1) and U (t). This inspires us to employ the block-coordinate decent with warm restarts [Bertsekas, 1999] to update U efﬁciently. Moreover, considering some entries in the samples may be missing leads us to handle different rows of U separately based on the visibility of the corresponding elements, for the sake of information balance. In other words, if an element in a sample is invisible, it should contribute nothing to the update of the corresponding row of U. Derived from (7), for each row U j , we have:

ij=1 ([yij]j U j vij)2, (8)

where [yij]j denotes the jth entry of vector yij, tj represents the number of times that the jth row of samples is observed, and ij is the refreshed observation index. The details of solving the problem (7) are provided in Algorithm 1. Please note that, Aj (t) Ptj ij=1 vijv T ij, and bj (t) Ptj ij=1[yij]jvij. For clarity, we summarize the whole procedure to solve the problem (2) in Algorithm 2.

Algorithm 1: The Basis Update

Input: U = [u1, ..., ur] Rm r, k {1, ..., m} Ak = [ak1, ..., akr] Rr r and bk Rr. for j = 1 to r do

for k = 1 to m do

[uj]k 1 [akj]j ([bk]j U k akj) + [uj]k;

end uj uj max( uj 2,1); end Output: Newly updated U

Algorithm 2: Online Low Rank Matrix Completion

Input: The observed data that will be revealed sequentially {y1, ..., y T }, and the corresponding support {w1, ..., w T }, U 0 Rm r. k {1, ..., m} Ak 0 Rr r; bk 0 Rr; for t = 1 to T do

vt argminvt wt (yt U t 1vt) 2 2 ; for k = 1 to m do

Ak (t) Ak (t 1) + [wt]k vtv T t ; bk (t) bk (t 1) + ([wt]k [yt]k)v T t ; end Update U (t) using U (t 1) as warm restart by Algorithm 1; end Output: U (T )

Prior to analyzing the convergence of the designed online solver for LRMC, we state some assumptions under which our analysis holds. The ﬁrst assumption is that the data are bounded, which is natural for the realistic data, such as audio signals, images and videos. The second is the uniqueness of the solution to vt (5) is satisﬁed. The solution is unique if and only if Dt U (t 1) is full column rank. In practice, this condition is easy to meet as r m and the fraction of missing elements in yi is not extremely large typically. The third is that the quadratic surrogate functions St(U (t) j ) are strictly convex with lower-bounded Hessians. This hypothesis is experimentally veriﬁed after processing several data points with reasonable initialization (we will see the random initialization also works sufﬁciently well in Sec. 3), which has also been experimentally veriﬁed on the task of online dictionary learning [Mairal et al., 2010]. The convergence property of the proposed Algorithm 2 is given by Theorem 1.

Theorem 1. Assume the observations are always bounded. Given the rank of the optimal solution to (4) provided as r, and the solution U (t) Rm r obtained by Algorithm 2 is full column rank, then U (t) converges to at least a stationary point of the problem (4) asymptotically.

Proof Sketch. From formula (7), we can obtain a set of row-wise sub-problems as given in (8), which are independent to each other. It indicates that the theorem can be recognized if each sub-problem holds. Please note that the separated sub-problems have the same form as the problem (6) shown in [Mairal et al., 2010], and thus can be proved in the way offered by [Mairal et al., 2010]. Due to space

limit, instead of the complete proof, we here only provide the proof sketch including three steps: 1) the surrogate function St(U (t) j ) converges almost surely; 2) the variations of U (t) j are asymptotic, i.e. U (t+1) j U (t) j F = O( 1

t ); and 3)

F(U (t) j ) St(U (t) j ) 0 almost surely.

2.2 Online Robust Low Rank Matrix Recovery

In many real world applications, the observed data are very likely contaminated by outliers, the support of which is usually unknown in advance. For the purpose of accurate recovery of underling matrices, it is essential to eliminate or reduce the disturbance from these outliers. Now we move on from OLRMC to its upgraded problem, say Online Robust Low Rank Matrix Recovery (ORLRMR). It is sure that, if the outlier support is at hand, ORLRMR degenerates to OLRMC. Thus, compared with OLRMC, ORLRMR requires to detect outliers from observations automatically and robustly. Recall that each entry [yi]j(i = 1, ..., n; j = 1, 2, ..., m) of the sample yi can be modeled as [yi]j = U j vi + [ei]j. Suppose observations are always bounded, an outlier can be with any value in the range with equal probability instead of either Gaussian (ℓ2 PCA and PMF) or Laplacian (ℓ1 PCP and PRMF). Therefore, we assume the outliers follow the uniform distribution 1

ϵ . As for the residuals caused by other factors, we simply assume they (approximately) follow a Gaussian distribution N(0, σ2 i ). Let πu i and πg i , such that πu i + πg i = 1, respectively denote the percentages of outlier and the other in the ith sample, which are also unknown in advance. As a result, each [ei]j can be viewed as a sample from a mixture model of distributions with probability p([ei]j) = πu i N(0, σ2 i ) + πg i 1 ϵ . The likelihood of Y can be written in the following form:

p(Y |U, V , Θ) = Y

i,j (πu i N([yi]j|U j vi, Θi) + πg i ϵ ), (9)

where Θi = {σ2 i , πu i , πg i } is a parameter vector. The nonnegative log-likelihood function of (9) is:

L(U, V , Θ) X

i,j log πu i N([yi]j|U j vi, Θi) + πg i ϵ

(10) So far, the outlier support has not been explicitly touched upon. But, as can be seen from (10), it is a typical mixture model that can be immediately minimized by Expectation Maximization algorithms. This reminds us that there involve latent variables [zk i ]j {0, 1} with PK k=1[zk i ]j = 1, which indicate the assignment of [ei]j to a speciﬁc component of the mixture containing K models in total. In this task, zg i that corresponds the Gaussian component, can be employed as the desired support. To be explicit with respect to the latent variables, and make the model more compatible to additional constraints, by applying Proposition 1, we obtain the

Algorithm 3: Online Robust LRMR

Input: The observed data that will be revealed sequentially {y1, ..., y T }, and non-negative λ. k {1, ..., m} Ak 0 Rr r; bk 0 Rr; for t = 1 to T do

while not converged do

Update Φt via (13), (14); Update Θt via (15); Update ˆvt via (16); end wt binarize(Φg t ); vt argminvt wt (yt U t 1vt) 2 2 ; for k = 1 to m do

Ak (t) Ak (t 1) + [wt]k vtv T t ; bk (t) bk (t 1) + ([wt]k [yt]k)v T t ; end Update U t via Algorithm 1; end Output: Newly updated U

cost function like:

C(U, V , Θ, Φ) X

[Φg i ]j log [Φg i ]j πg i N([yi]j|U j vi, σ2 i )

+ [Φu i ]j log ϵ[Φu i ]j πg i

(11) Proposition 1. Given two functions πk(x) > 0 and pk(x) > 0, we have the following [Liu et al., 2013; Guo et al., 2014b]:

k=1 πk(x)pk(x) = min Φ(x) +

k=1 Φk(x) log Φk(x) πk(x)pk(x),

where Φ(x) = {Φ1(x), ..., ΦK(x)} are hidden variables, and + = {Φ(x) : 0 < Φk(x) < 1, and PK k=1 Φk(x) = 1} is a convex relaxation of a characteristic function decomposition.

Proof. The proof can be easily done via Lagrangian Multiplier method. Due to the limited space, we omit it here.

With the modiﬁcation on (10), the cost function (11) is explicit to the latent variables Φ, and more importantly, is ready to take into account extra constraints on Φ, such as sparsity of outlier and smoothness of support. In the following, for instance, we introduce the sparsity of outlier into (12), i.e. Φu 0, which is a commonly used prior to a variety of real cases. It is well known that the ℓ0 norm is non-convex and usually replaced with the ℓ1. As the elements in Φ distribute between 0 and 1, by slight algebraic transformation, we have the following formula:

[Φg i ]j log [Φg i ]j πg i N([yi]j|U j vi, σ2 i )

+ [Φu i ]j log ϵ[Φu i ]j πg i + λ[Φu i ]j

where λ is a coefﬁcient controlling the weight of outlier sparsity. Please note that we use ˆC to present ˆC(U, V , Θ, Φ) for

brevity. Following the same spirit of solving the problem (2), we can seek the solution of (12) in an online manner. When a new data vector yt comes, we ﬁrst calculate Φt, Θt and vt over the previously updated U (t 1) by the alternating minimizing strategy as follows:

[ ˆ Φu t ]j = argmin [Φu t ]j [Φu t ]j log [Φu t ]jϵ πu t + λ[Φu t ]j = πu t ϵ exp(1 + λ).

[ ˆ Φg t ]j = argmin [Φg t ]j [Φg t ]j log [Φg t ]j πg t N([yt]j|U (t 1) j vt, σ2 t )

2πσ2 t exp 1 ([yt]j U (t 1) j vt)2

(13) To further enforce the summation of [Φu t ]j and [Φg t ]j to be 1, simply normalizing them gives:

[Φu t ]j = [ ˆ Φu t ]j [ ˆ Φg t ]j + [ ˆ Φu t ]j ; [Φg t ]j = [ ˆ Φg t ]j [ ˆ Φg t ]j + [ ˆ Φu t ]j . (14)

It is worth noting that the above processing corresponds to the E step of EM. Next, we focus on updating the parameters of the mixed model including σ2 t , πu t and πg t , which can be achieved via [Dempster et al., 1977]:

P j[Φu t ]j m ; πg t =

P j[Φg t ]j m ;

P j[Φg t ]j([yt]j U (t 1) j vt)2 P j[Φg t ]j . (15)

In turn, vt can be obtained through optimizing the following:

vt = argmin vt

j=1 [Φg t ]j log [Φg t ]j πg t N([yt]j|U (t 1) j vt, σ2 t )

= argmin vt Ωt (yt U (t 1)vt) 2 2, (16)

where [Ωt]j p

[Φg t ]j. This is in the same form with (5), and can be solved in the same way. Please notice that the updating of Θt and vt can be viewed as the M step of EM. Φt, Θt and vt are iteratively updated via the procedure above until converge. Moreover, the element of Φt is in (0, 1) instead of binary, although very close to either 0 or 1. To make the weight Φg t consistent with the binary support (LRMC), we pre-deﬁne a threshold (e.g. 0.5) to binarize Φg t as ˆ wt. With the well constructed support ˆ wt and vt, we are ready to update the basis U. For completeness, the solver of ORLRMF is summarized in Algorithm 3.

3 Experimental Veriﬁcation In this section, we conduct experiments on synthetic data to reveal the convergence speed and the robustness to outliers of the proposed ORLRMR, and on real data for showing the advantages of our method compared against alternative approaches. Because OLRMC is a special case of ORLRMR, we concentrate on ORLRMR. Please note that ORLRMR has

one free parameter λ for balancing the outlier sparsity. From Eqs. (12) and (13), we can see that the effect of λ is actually to pull down 1

ϵ to 1 ϵ exp(λ). No doubt that when λ grows, the probability for a residual of being assigned to the outlier component decreases. But, the margin between the tail of Gaussian distribution and 1

ϵ is usually large, therefore ORLRMR can perform stably with a relatively wide range of λ, especially for Gaussian noises with small variances. For the experiments shown in this paper, we uniformly set λ to 2. The experiments are carried out on a Mac Book Pro running OS X 64bit operating system with Intel Core i.7 2.8GHz CPU and 16GB RAM. Our code is implemented in Matlab.

3.1 Synthetic Data Data Preparation. For fair comparison, we generate data in a similar way with PCP [Cand es et al., 2011] and ORPCA [Feng et al., 2013]. That is, a set of n clean data points is produced by Y Rm n = UV T , where U Rm r and V Rn r. The entries of both U and V are i.i.d. sampled from the N(0, 1/n) distribution, and then the data are projected into [ 1, 1]. The basis of the generated subspace is U, and r is the intrinsic rank (dimension) of the spanned subspace. As for the corruption, we contaminate the data by replacing a fraction po of Y with outliers drawn from a uniform distribution over the interval of [ 1, 1]. Task and Quantitative Metric. We evaluate the ability of correctly recovering the subspace of corrupted observations under varying settings of the error density and intrinsic subspace rank. The U (0) that feeds to the competitors is randomly initialized in the same fashion as data generation. To measure the similarity between the recovered subspace ˆU and the ground-truth U, the Expressed Variance (E.V.) [Xu et al., 2010; Feng et al., 2013] is employed as our metric, the deﬁnition of which is:

E. V.( ˆU, U) tr orth( ˆU)T UU T orth( ˆU)

tr(UU T ) , (17)

where orth( ) and tr( ) stand for the orthogonalizing and trace operators, respectively. The averaged results from 10 independent trials are ﬁnally reported. Competitors. We compare our ORLRMR with recently proposed online techniques, i.e. GRASTA and OR-PCA. The code of GRASTA is downloaded from the authors website, the core of which is implemented in C++. Its most important parameter is the sub-sampling ratio, to obtain its best possible results, we turn off the sub-sampling to take into account whole information. The code of OR-PCA is not available when this paper is prepared. So we implement it in Matlab by strictly following the pseudo code given in [Feng et al., 2013], and adopt the parameters suggested by the authors. Performance Analysis. We provide the performance curves of ORLRMR, OR-PCA and GRASTA against the number of samples in Fig. 1, under the setting of m = 400 and n = 104, to see the robustness of the contestants to corruptions and how the performance is improved when more samples are revealed. As can be seen from the pictures, all the three methods achieve better results with more observations. For

0 2000 4000 6000 8000 10000 0.2

Number of Samples

E.V. vs Number of Samples (rank = 80 )

0 2000 4000 6000 8000 10000 0.2

Number of Samples

E.V. vs Number of Samples (rank = 80)

0 2000 4000 6000 8000 10000 0.2

Number of Samples

E.V. vs Number of Samples (rank = 80 )

0 2000 4000 6000 8000 10000 0.3

Number of Samples

E.V. vs Number of Samples (rank = 120)

Figure 1: Expressed variance versus number of samples. Top-Left: intrinsic rank r = 80 and outlier ratio po = 0.2. Top-Right: r = 80 and po = 0.3. Bottom-Left: r = 80 and po = 0.4. Bottom-Right: r = 120 and po = 0.3.

the four cases, the performance of GRASTA is much inferior to the others, possibly because the intrinsic dimensions versus the ambient are relatively large for GRASTA, as well as the basis updating by GRASTA only takes one gradient descent step while both ORLRMR and OR-PCA have closed-form solutions. We will see that GRASTA attains reasonably well results when the intrinsic rank is relatively small later. In the case of intrinsic rank r = 80 and outlier ratio po = 0.2, the plots of OR-PCA and ORLRMR are very close, both of which reach about E. V. = 0.9 after processing 104 samples. As the outlier ratio increases, the convergence speed drops. Even though, ORLRMR gives E. V. 0.7 and E. V. 0.6 for po = 0.3 and po = 0.4, respectively, which outperforms ORPCA by about 0.2. This indicates that ORLRMR can bear more corruptions than the others. In addition, we show a case of r = 120 and po = 0.3 to view the effect of r. The curves tell us that OR-PCA and ORLRMR perform similarly with those of r = 80 and po = 0.3 with very slight depression, which means the change of r inﬂuences not much to them. Please note that, the E.V. of the initialized basis increases as the intrinsic rank approaches to the ambient dimension.

3.2 Real Data Dataset. The Star dataset consists of 9 real world surveillance videos, which has a variety of scenarios including Escalator (Esc: indoor, dynamic background, crowd), Hall (Hal: indoor, static background, crowd), Boot Strap (BS: indoor, static background, crowd), Lobby (Lob: indoor, light switch, people), Shopping Mall (SM: indoor, static background, people), Campus (Cam: outdoor, dynamic background, cars), Water Surface (WS: outdoor, dynamic background, lingering person), Fountain (Fou: outdoor, dynamic background, people), and Curtain (Cur: indoor, dynamic background, people). Task and Quantitative Metric. This task is to separate foregrounds from surveillance videos, in which the ground-truth foreground is given but the ground-truth background model is not available. Moreover, different to the simulation, the

0 0.05 0.1 0.15 0.2 0

False Positive

True positive

ROC Curves of Esc

PRMF (4.9fps) OR PCA (29.1fps) GRASTA (61.2fps) ORLRMR (43.7fps)

0 0.05 0.1 0.15 0.2 0

False Positive

True positive

ROC Curves of Hal

PRMF (6.3fps) OR PCA (23.3fps) GRASTA (50.3fps) ORLRMR (42.7fps)

0 0.05 0.1 0.15 0.2 0

False Positive

True positive

ROC Curves of BS

PRMF OR PCA (30.8fps) GRASTA (65.4fps) ORLRMR (51.1fps)

0 0.05 0.1 0.15 0.2 0

False Positive

True positive

ROC Curves of Lob

PRMF (6.9fps) OR PCA (86.7fps) GRASTA (29.9fps) ORLRMR (50.8fps)

0 0.05 0.1 0.15 0.2 0

False Positive

True positive

ROC Curves of SM

PRMF (1.5fps) OR PCA (4.4fps) GRASTA (13.4fps) ORLRMR (11.2fps)

0 0.05 0.1 0.15 0.2 0

False Positive

True positive

ROC Curves of Cam

PRMF (4.6fps) OR PCA (31.6fps) GRASTA (63.6fps) ORLRMR (62.3fps)

0 0.05 0.1 0.15 0.2 0

False Positive

True positive

ROC Curves of WS

PRMF (5.8fps) OR PCA (58.9fps) GRASTA (64.0fps) ORLRMR (51.3fps)

0 0.05 0.1 0.15 0.2 0

False Positive

True positive

ROC Curves of Fou

PRMF (6.2fps) OR PCA (50.8fps) GRASTA (61.8fps) ORLRMR (53.6fps)

0 0.05 0.1 0.15 0.2 0

False Positive

True positive

ROC Curves of Cur

PRMF (7.4fps) OR PCA (49.8fps) GRASTA (66.1fps) ORLRMR (67.8fps)

Figure 2: Quantitative results on the Star dataset.

background may comprise dynamic factors, such as fountains and waving curtains. To assess the performance on foreground detection, we use the Receiver Operating Characteristic (ROC) curve as our metric. As the rank of the background of surveillance videos is typically small, we empirically set it to 5 for all the 9 sequences. Competitors. Besides GRASTA and OR-PCA, we further employ a batch method, i.e. PRMF [Wang et al., 2012], as our reference. PRMF is a probabilistic approach to robust matrix factorization, which holds an equivalent relation with PCP. PCP [Cand es et al., 2011] is not used, mainly because its costs in both memory and time are too expensive to process long sequences. The Matlab code of PRMF is downloaded from the authors website, the parameters are set as the values provided in the implementation package released by the authors. In addition, to obtain best possible performance of GRASTA and OR-PCA, the ﬁrst 5 frames of each sequence are used to initialize the basis for GRASTA and OR-PCA2. To show the insensitivity of ORLRMR to initialization, its basis is randomly generated as the previous experiment do.

2Without a good initialization, OR-PCA can not give reasonable results in this experiment. The reason possibly is that the subspace rank versus the ambient dimension is too small and the number of samples is inadequate for converging.

Performance Analysis. Figure 2 summarizes the ROC plots for the 9 videos. From the top row, we can see that our ORLRMR outperforms GRASTA, OR-PCA and even the batch approach PRMF. The reason is that these three videos are in presence of crowd, and ORLRMR can handle heavy outliers better than the others. For the rest videos, PRMF attains the best results (except for the Campus sequence, in which GRASTA slightly exceeds PRMF), while ORLRMR competes very favorably with PRMF in most cases. The only exception is the Curtain video, the background of which contains intensively moving curtains. Although the ground-truth foreground excludes these curtains, from the perspective of low rank recovery, they are more like outliers, and ORLRMR treats these as outliers. Please notice that GRASTA performs reasonably well on these real videos, since the subspace dimension is relatively small. As regards OR-PCA, its inferior performance for most cases may be because of the large fractions of foreground and/or the insufﬁcient number of streaming samples for achieving convergence. Together with the simulation, we can conclude that ORLRMR is better at dealing with heavily polluted data and is of faster convergence speed than OR-PCA. In terms of time, PRMF costs, without surprise, the most (5.7 frames per second on average over the 9 sequences) among the contestants, due to its batch nature. ORLRMR achieves 48.3 fps, which outperforms ORPAC 40.6 fps and is competitive with GRASTA 52.9 fps, even though the core of GRASTA is implemented in C++. The fps numbers corresponding to each sequence can be found in the legends of Fig. 2. We would like to emphasize that the experiments in this part do not intend to verify our ORLRMR wins over those special designs on the task of foreground detection. The aim is to demonstrate the merits of our general model without involving domain knowledge. It is certain that equipping with proper priors for speciﬁc problems can further improve the performance of any of the general models like GRASTA, OR-PCA, ORLRMR and PRMF.

4 Conclusion

In this paper, we have proposed to address the low rank matrix completion and recovery tasks in an online fashion, which scales up gracefully to large data sets. The mixture model consisted of a Gaussian and a uniform distributed components has shown its signiﬁcant improvement on the robustness to heavy outliers. The theoretical and experimental results have demonstrated the properties and advantages of our scheme compared alternative methods. It is worth noting that our technical framework is ready to embrace additional speciﬁc constraints for further boosting the performance on different tasks. For instance, the spatial smoothness can be introduced to accurately detect foregrounds in surveillance videos. It is positive that our proposed technique can be widely applied to tabulated data processing applications.

Acknowledgments

This work was supported in part by the National Natural Science Foundation of China under Grant 61402467 and in part by the Excellent Young Talent Program through the Institute of Information Engineering, Chinese Academy of Sciences.

References [Balzano et al., 2011] L. Balzano, R. Nowak, and B. Recht. Online identiﬁcation and tracking of subspaces from highly incomplete information. ar Xiv:1006.4046, 2011. [Bertsekas, 1999] D. Bertsekas. Nonlinear Programming. Athena Scientiﬁc, 1999. [Bottou and Bousquet, 2008] L. Bottou and O. Bousquet. The trade-offs of large scale learning. In Proceedings of NIPS, pages 161 168, 2008. [Cand es et al., 2011] E. Cand es, X. Li, Y. Ma, and J. Wright. Robust principal component analysis? Journal of the ACM, 58(3):1 37, 2011. [Chandrasekaran et al., 2011] V. Chandrasekaran, S. Sanghavi, P. Parrilo, and A. Willsky. Rank-sparsity incoherence for matrix decomposition. SIAM Journal on Optimization, 21(2):572 596, 2011. [Chen et al., 2014] C. Chen, X. Zheng, Y. Wang, F. Hong, and Z. Lin. Context-aware collaborative topic regression with social matrix factorization for recommender systems. In Proceedings of AAAI, pages 9 15, 2014. [Dempster et al., 1977] A. Dempster, N. Laird, and D. Rubin. Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society, 39(1):1 38, 1977. [Feng et al., 2013] J. Feng, H. Xu, and S. Yan. Online robust pca via stochastic optimization. In Proceedings of NIPS, pages 1 9, 2013. [Gu et al., 2014] S. Gu, L. Zhang, W. Zuo, and X. Feng. Weighted nuclear norm minimization with applications to image denoising. In Proceedings of CVPR, pages 2862 2869, 2014. [Guo et al., 2013] X. Guo, X. Cao, X. Chen, and Y. Ma. Video editing with temporal, spatial and appearance consistency. In Proceedings of CVPR, pages 2283 2290, 2013. [Guo et al., 2014a] X. Guo, X. Cao, and Y. Ma. Robust separation of reﬂection from multiple images. In Proceedings of CVPR, pages 2195 2202, 2014. [Guo et al., 2014b] X. Guo, X. Wang, L. Yang, X. Cao, and Y. Ma. Robust foreground detection using smoothness and arbitrariness constraints. In Proceedings of ECCV, pages 535 550, 2014. [He et al., 2012] J. He, L. Balzano, and A. Szlam. Incremental gradient on the grassmannian for online foreground and background separation in subsampled video. In Proceedings of CVPR, pages 1568 1575, 2012. [Jamali and Ester, 2011] M. Jamali and M. Ester. A transitivity aware matrix factorization model for recommendation in social networks. In Proceedings of IJCAI, pages 2644 2649, 2011. [Liu et al., 2013] J. Liu, X. Tai, H. Huang, and Z. Huan. A weighted dictionary learning model for denoising images corrupted by mixed noise. IEEE TIP, 22(3):1108 1120, 2013.

[Mairal et al., 2010] J. Mairal, F. Bach, J. Ponce, and G. Sapiro. Online learning for matrix factorization and sparse coding. JMLR, 11(1):19 60, 2010. [Park et al., 2013] S. Park, Y. Kim, and S. Choi. Hierarchical bayesian matrix factorization with side information. In Proceedings of IJCAI, pages 1593 1599, 2013. [Pearson, 1901] K. Pearson. On lines and planes of closest ﬁt to systems of points in space. Philosophical Magazine, 1901. [Recht et al., 2010] B. Recht, M. Fazel, and P. Parrilo. Guaranteed minimum-rank solutions of linear matrix equations via nuclear norm minimization. SIAM Review, 52(3):471 501, 2010. [Salakhutdinov and Mnih, 2008] R. Salakhutdinov and A. Mnih. Probabilistic matrix factorization. In Proceedings of NIPS, pages 1 8, 2008. [Shen and Wu, 2012] X. Shen and Y. Wu. A uniﬁed approach to salient object detection via low rank matrix recovery. In Proceedings of CVPR, pages 853 860, 2012. [Shi et al., 2013] J. Shi, N. Wang, Y. Xia, D. Yeung, I. King, and J. Jia. SCMF: Spase covariance matrix factorization for collaborative ﬁltering. In Proceedings of IJCAI, 2013. [Wang and Zhang, 2012] S. Wang and Z. Zhang. Colorization by matrix completion. In Proceedings of AAAI, pages 1169 1174, 2012. [Wang et al., 2012] N. Wang, T. Yao, J. Wang, and D. Yeung. A probabilistic approach to robust matrix factorization. In Proceedings of ECCV, pages 126 139, 2012. [Xu et al., 2010] H. Xu, C. Caramanis, and S. Sanghavi. Principle component analysis with contaminated data: The high dimensional case. In Proceedings of COLT, 2010. [Zhang et al., 2012] Z. Zhang, A. Ganesh, X. Liang, and Y. Ma. TILT: Transform invariant low-rank textures. IJCV, 99(1):1 24, 2012.