# residualbased_sampling_for_online_outlierrobust_pca__6673e507.pdf

Residual-Based Sampling for Online Outlier-Robust PCA

Tianhao Zhu 1 Jie Shen 1

Outlier-robust principal component analysis (ORPCA) has been broadly applied in scientific discovery in the last decades. In this paper, we study online ORPCA, an important variant that addresses the practical challenge that the data points arrive in a sequential manner and the goal is to recover the underlying subspace of the clean data with one pass of the data. Our main contribution is the first provable algorithm that enjoys comparable recovery guarantee to the best known batch algorithm, while significantly improving upon the state-of-the-art online ORPCA algorithms. The core technique is a robust version of the residual norm which, informally speaking, leverages not only the importance of a data point, but also how likely it behaves as an outlier.

1. Introduction

Principal Component Analysis (PCA) is a fundamental tool for analyzing high-dimensional data. The key idea is to get the optimal subspace approximation. In the absence of outliers, such subspace can be computed by the topk left singular vectors, or Principal Components, of the sample covariance matrix. This is one of the most important problems in machine learning that has been studied for a long time. The PCA problem is well understood when the data matrix is fully observed, and there is no noise. Thus, a large body of recent works focus on the online setting or when the data are corrupted.

Robust PCA. Many robust PCA algorithms have been proposed because the traditional PCA breaks down immediately in the presence of outliers. Some of them focused on the robust PCA in the low-dimension regime. They either applied the standard PCA on the robust estimate of the covariance matrix (Xu & Yuille, 1995; Yang & Wang,

1Department of Computer Science, Stevens Institute of Technology, Hoboken, New Jersey, USA. Correspondence to: Tianhao Zhu <tzhu12@stevens.edu>, Jie Shen <jie.shen@stevens.edu>.

Proceedings of the 39 th International Conference on Machine Learning, Baltimore, Maryland, USA, PMLR 162, 2022. Copyright 2022 by the author(s).

1999; la Torre & Black, 2001; Brubaker, 2009; Klivans et al.,

2009), or tried to find a low-dimension that has the maximum robust estimation of the projected distance, called projection pursuit (Croux et al., 2007). A more challenging work lies in the high-dimension regime where the ambient dimension is higher than the number of observed data. In (Candès et al., 2011; Chandrasekaran et al., 2011), the exact recovery of the low-rank and sparse component via convex optimization was well established by minimizing a weighted combination of a nuclear-norm of one component and an ℓ1-norm of another component. Specifically, Xu et al. (2012) showed that matrix decomposition using nuclear norm minimization can recover the column space and identify the outliers. The nearly optimal recovery guarantee was also achieved by Cherapanamjeri et al. (2017) using the threshold-based approaches, which reduced the computational cost significantly. However, these algorithms require not only observing all the samples, but also inliers being incoherent (thus inliers are not thresholded). Meanwhile, another widely used convex surrogate is the max-norm regularization (Srebro et al., 2004), where the max-norm promoted the a low-rank structure. Srebro & Shraibman (2005) studied collaborative filtering and showed a tighter generalization bound than the nuclear norm theoretically, and Lee et al. (2010); Jalali & Srebro (2012) showed its appealing performance in some practical applications empirically. The very recent work Deshpande & Pratap (2021) extended the outlier low-rank approximation problem under ℓp-metric, and extended the analysis over M-estimator loss functions and affine subspace approximation. Readers may refer to Lerman & Maunu (2018) for a comprehensive survey of the works on robust subspace recovery.

Online PCA. The online setting is more restricted than offline in that we can only observe the samples in a sequential manner. Note that this is different from stochastic optimization where the algorithms can access an arbitrary sample in each iteration (Ozawa et al., 2004; Nie et al., 2013; Arora et al., 2013; Garber & Hazan, 2015; Hallgren & Northrop, 2018). The recent work Garber (2019) cast the online PCA into the regret minimization framework. It proposed the regularized Online Gradient Ascent model, and enjoyed the poly-logarithmic regret bound, requiring only linear memory and run-time per iteration. Another way is to progressively expand the subspace of inliers. Boutsidis et al. (2015)

Residual-Based Sampling for Online Outlier-Robust PCA

proposed the first online algorithm for computing the PCA embedding under such setting, resulting in an additive error guarantee. In a recent work, Bhaskara et al. (2019) gave an online algorithm for Column Subsection Selection(CSS) as well as PCA, and achieved the multiplicative approximation by residual-based sampling method. Meanwhile, based on Krasulina s method Krasulina (1969) and Oja s rule Oja & Karhunen (1985), some works achieved the non-asymptotic convergence guarantee for the streaming PCA problem (Jain et al., 2016; Li et al., 2016; Allen-Zhu & Li, 2017; Tang, 2019; Amid & Warmuth, 2020). However, it turns out that all of these algorithms fail in the presence of outliers.

Online Robust PCA. As a combination of robust and online settings, there are two variants of Online Robust PCA. One assumes that the features of each newly observed sample are sparsely corrupted; see e.g. (Feng et al., 2013b; Shen et al., 2014). Another class assumes that some samples are arbitrary or even adversarial while the rest form a low-dimensional space, which is the regime of our interest. In this spectrum, The most related work to this paper is Feng et al. (2013a) in the sense that they considered online optimization for outlier-robust PCA. Unfortunately, their theoretical guarantees were less favorable for practical problems for two reasons: 1) they required strong conditions on the initial iterate, which is hard to satisfy; and 2) their results held only in an asymptotic sense while we present finite convergence guarantee. The reader may refer to Table 1 for a comparison.

1.1. Main results

The main algorithmic contribution of the paper is a novel online outlier-robust PCA (ORPCA) algorithm that uses adaptive sampling technique which shows promising results in the non-robust setting (Bhaskara et al., 2019).

Let matrix A Rd n be the observed matrix with n samples in d dimensions, with Ai being the i-th column. Let k < min{d, n} be the target rank of the subspace of inliers.

We assume the following conditions.

Assumption 1. The point arrives one after another, and we can only make a one-time pass over it.

Assumption 2. The matrix consists inliers Ain and outliers Aout, where the entries of Aout can be arbitrary and shown in columns in arbitrary order. There exists an upper bound z on the number of outliers. Namely, |Aout| z.

Let ξ be a quantity such that Ain SVDk(Ain) 2 F ξ. Let M be the points marked as outliers, and Ain\M be the inliers not marked as outliers. We present our first theorem below.

Theorem 1. If Assumption 1 and Assumption 2 are satisfied, then there exists an efficient algorithm that upon

seeing each Ai, decides to add it to the outlier set M, or outputs an embedding Yi Rr. In the end we have, the error bound on the inliers not marked as outliers is minΦ Rd r,ΦT Φ=I Ain\M ΦY 2 F O(ξ log Ain 2 F ξ ), with the output dimension bounded by r O(k

(log Ain 2 F ξ )2). The number of marked outliers satisfies

|M| O(z (log Ain 2 F ξ )2).

Firstly, let us compare the bound above with that of Bhaskara et al. (2019) which is the state-of-the-art online PCA algorithm. For direct comparison, let us think of the approximation error ξ as 1

γ Ain 2 F , where γ 1 is a constant value. Then we can obtain the error guarantee O( log γ

γ Ain 2 F ) using an embedding dimension O(k (log γ)2) with high probability. It is worth noticing that they assume every point is an inlier, thus their algorithm only works for noise-less setting; while our algorithm is robust to at most z outliers, and enjoys the same error and embedding dimension guarantee over inliers not marked outliers, by discarding O(z (log γ)2) points as outliers.

We note that error is bounded by roughly O(ξ log Ain 2 F ξ ), where ξ is parameter.

It is an important estimation, because it is not only a parameter to run Algorithm 1 and 2, but also appear in their theoretical guarantees. It is unclear whether this is an artifact of our analysis, or is fundamental for the problem. We note that if we relax the restriction of Assumption 1, and permit a second pass over the data, then we can combine our algorithm with the additive approximation algorithm in incremental fashion, and enjoy an error guarantee with a constant approximation factor. We show our second result as follows.

Theorem 2. If Assumption 2 is satisfied, and we allow a second-time pass over the point, then there exists an efficient algorithm that upon seeing each Ai, decide to add it to the outlier set M, or outputs an embedding Yi Rr. In the end we have, the error bound on the inliers not marked as outliers is minΦ Rd r,ΦT Φ=I Ain\M ΦY 2 F Ain SVDk(Ain) 2 F + ϵξ, with the output dimension

bounded by r O( k

ϵ2 (log Ain 2 F ξ )4).

On the positive side, the above theorem significantly improves upon the approximation error. On the other hand, there is a trade-off between the approximation error and the output dimension. Indeed, we show that by sacrificing additional O( 1

ϵ2 (log Ain 2 F ξ )2) factor of embedding dimension, we can improve our approximation ratio from data-dependent to a constant.

Residual-Based Sampling for Online Outlier-Robust PCA

Table 1. A comparison with prior algorithms. In the Online column, the of Algorithm 2 means that the algorithm requires the two-passes over data. It turns out that with such one additional pass, the algorithm can make a correction to its prior prediction and hence improves the performance guarantee. In the Error column, the unknown of Feng et al. (2013a) means that their algorithm does not converge to vanishing approximation error provably. Notably, they only show computational convergence to a stationary point with the unknown statistical property.

Work Online? Outlier-robust? Error Xu et al. (2012) Ain SVDk(Ain) 2 F Cherapanamjeri et al. (2017) Ain SVDk(Ain) 2 F Boutsidis et al. (2015) Ain SVDk(Ain) 2 F + ϵ Ain 2 F Bhaskara et al. (2019) O((log Ain 2 F ξ )2 ξ) Feng et al. (2013a) unknown

This work (Algorithm 1) O((log Ain 2 F ξ )2 ξ)

This work (Algorithm 2) Ain SVDk(Ain) 2 F + ϵξ

1.2. Overview of main techniques

Our algorithm is inspired by Bhaskara et al. (2019) in part, but has crucial differences. We present an overview of the techniques below, and highlight our novelty.

1) Online PCA using adaptive sampling. At a high level, the online PCA algorithm processes in phases. In each phase, a new direction would be added to the subspace when it is deemed significant . By adaptive sampling method, the significance of every point is proportional to its residual norm (to be defined) over current subspace. Therefore, each point is either informative (thus directly adding its direction to the subspace) or not-informative (thus creating a running sketch to sum these vectors until the sum of their residual norms is informative, then adding its direction to the subspace). Classic adaptive sampling is related to column subset selection. For example, Deshpande & Rademacher (2010); Paul et al. (2015) show that there exists a sub-matrix that projects full points onto its span and enjoy a favorable ratio to the best rank-k approximation. Specifically, we show that adaptive sampling can be slightly modified to sample informative residual norms, such that it can solve online PCA. Moreover, the processes of committing subspace and outputting embedding can be done in a one-shot manner, so that the algorithm enjoys favorable computational complexity. The guarantee of the bound on the number of phases is formally shown Lemma 7.

2) Threshold-based outlier removal. Suppose the data is corrupted by z k outliers. Intuitively, this is challenging in an online model, because if we encounter a point far from the current subspace V , we are not sure if it represents a new direction or is simply an outlier. We show each category of the whole phases by introducing the inlier and outlier phases, which are identified by the residual norm of points under or over the threshold ξ/z. This outlier threshold ensures that once z/k of the distant points are

seen, there is a sufficient probability of picking the direction. We show that our adaptive sampling design can choose O(z log Ain 2 F ξ ) points being outliers; Putting all these

ideas leads to an O(ξ log Ain 2 F ξ ) error bound for online PCA; see Section 2.1

3) Incremental additive approximation algorithm to improve error. We define the algorithm in an incremental fashion, when it maintains and incrementally adjusts the objective in each step. For example, our algorithm updates the subspace V and a covariance matrix U incrementally when getting new points. It is commonly applied in online fashion, where the memory and computation cost are limited. Intuitively, if two algorithms have incremental property, we can combine them asynchronously (if the second algorithm requires the results from the first algorithm) or synchronously (if two algorithms are independent). By observing that the error of the algorithm 1 is a new residual, the main idea is to process it to an additive approximation algorithm (for example, the first algorithm of Boutsidis et al. (2015)). However, it require a second pass over the data. We show that we can process the residual of the marked inliers and reduce the error over inliers not marked as outliers; see Section 2.2.

4) Removing the dependence on ξ. Furthermore, we note that the two theorems assumed that the parameter ξ is known and satisfies the bound ξ Ain SVDk(Ain) 2 F , which can only be estimated in hindsight. Indeed, we show that we can remove this assumption by assigning an arbitrary small approximation error, and double its value when sufficient phases are met; see Section 2.3.

1.3. Notations

We use capital letters, for example, M, to denote a matrix, and Mi denotes its ith column. The capital letter A is

Residual-Based Sampling for Online Outlier-Robust PCA

reserved for the data matrix. The ℓ2-norm of a vector Mi is denoted by Mi . The Frobenius norm of a matrix M is denoted by M F . The size of the set is denoted by | |. In the ith phase, the algorithm observes the sample Ai and maintains the subspace V i. We denote the output embedding Yi = (V i)T Ai, where the subspace V i is a set of normalized residuals over previous phases. The projection of Ai to the space orthogonal to V i is denoted by Π V i Ai = Ai V i(V i)T Ai. We also use Π i = I V i(V i)T .

Recall that each column of A is a sample. By Assumption 2, the set of all columns of matrix A can be partitioned into inliers and the outliers. That is, A = Ain Aout. We assume there exists an upper bound z on the number of outliers: |Aout| z. Let SVDk(Ain) be the optimal rankk approximation of Ain, then the optimum error of the kdimension embedding is

OPTk = Ain SVDk(Ain) 2 F . (1)

1.4. Roadmap

In Section 2, we describe our main algorithms, and in Section 3 we show the guarantees. We conclude our work in Section 4, and defer all the proof details and numerical experiments to the appendix.

2. Main Algorithms

In this section, we present our main algorithms.

2.1. Logarithmic approximation

We present our online robust logarithmic approximation algorithm in Algorithm 1. Intuitively, our algorithm uses adaptive sampling based on residual vectors to form the subspace, and process the outlier removal.

Definition 3. In the iteration i, we assign the residual norm in approximating the new point Ai using the current subspace V i by Π V i Ai = Ai V i(V i)T Ai (2)

A collection of Ai would be combined as a proxy vector until the sum of these residual square norm values are O(ξ/k). Then the phase ends, and we add the direction of proxy vector in that phase to V i. At a high level, the proxy vector with the sum of residual square norm O(ξ/k) is deemed significant, thus adding its direction to V i will occur a smaller error in the future.

In Bhaskara et al. (2019), the author uses one threshold of residual square norm O(ξ/k) to identify the informative and non-informative points. On top of that, our algorithm adds an additional threshold O(ξ/z) to identify the inliers

and outliers. By our assumption, we have z k. Thus, we can always get O(ξ/k) O(ξ/z), and classify points into three categories.

1. When the point residual square norm is O(ξ/z), we name it non-special inlier, and accumulate them until the sum of their residual norms O(ξ/k). Then the span of the proxy vector can be treated as an important direction in the subspace, and we add it to the subspace. If the phase terminates with a nonspecial inlier, it is called an non-special inlier phase; see Steps 4 10.

2. When the point residual square norm is between O(ξ/k) and O(ξ/z), the point could be an noninformative outlier, or an non-informative inlier. Such threshold can be thought as the cross-field of inliers and outliers. We name such point non-special outlier. Then we linearly combine it to the proxy vector with probability O(k Π V i Ai /ξ). When z/k number of such points are observed, the proxy vector is deemed important, and we add it to the subspace. If the phase terminates with a non-special outlier, it is called an non-special outlier phase; see Steps 13 18.

3. When the point residual square norm is O(ξ/k), the point could be an informative outlier, or an informative inlier lying in the new directions. Intuitively, there should be a decent number of inliers lying in the new directions, while outliers are separated from other points. Therefore, a special point is an inlier near the new direction with high probability. We name such point special outlier, and add it to the subspace with probability O(k/z). If the phase terminates with a special point, it is called a special phase; see Steps 19 21.

We note that when outliers are arbitrarily far from authentic data, rendering distance-based sampling is prone to pick outliers, so that we can mitigate their effect. The special outliers and non-special outliers are distant from the current subspace, so it is reasonable to mark them as an outlier. Yet, it is possible that inliers can also be classified as outliers, for example, when a new direction starts to form. We claim that if one direction has more than z/k points, then once z/k of them are observed, there is a sufficiently large probability of adding this direction to the subspace. Meanwhile, for the directions having < z/k points, the total number of picked points near these directions is < z. Thus classifying the points as outliers would only increase a factor 2 in the number of marked outliers.

2.2. Constant approximation

We observe that for each vector Ai, Algorithm 1 outputs an embedding of Ai on the current subspace spanned by V i

Residual-Based Sampling for Online Outlier-Robust PCA

Algorithm 1 Online ORPCA with Logarithmic Approximation Error Require: Matrix A Rd n whose columns {Ai} arrive one by one, parameter ξ that upper bounds the optimal approximation error over the inlier points, an upper bound z on the number of outliers, parameter k > 1. Ensure: The online low-dimensional embedding yi; the subspace V at the end. 1: V , U 0d d, w 0, t 0, α 0, β 0, r 2k log Ain 2 F ξ . 2: while columns Ai arrive do

3: Π V Ai Ai UAi, p Ai k Π V Ai 2

4: if p Ai < k

z then 5: α α + p Ai, w w + XAi, where X is 1 uniformly at random. 6: if α 1 then

7: w Π V w Π V w , add w to V , U U + w w T .

8: Reset w, α, t and β to 0. 9: end if 10: else 11: Mark Ai as outliers. 12: if p Ai < 1 then 13: With probability p Ai: t t + X Ai p Ai , where

X is 1 uniformly at random, β β + k

z . 14: if β 1 then

15: t Π V t Π V t , add t to V , U U + t t T .

16: Reset w, α, t and β to 0. 17: end if 18: else if p Ai 1 then

19: With probability k/z: Set A i Π V Ai Π V Ai ,

w Π V w Π V w , add A i and w to V , U

U + A i A T i + w w T . 20: Reset w, α, t and β to 0. 21: end if 22: end if 23: Return the embedding yi V T Ai, resized to dimension r by adding zeros, so that the dimension of all embedding is fixed. 24: end while return V .

with the guarantee of the residual square norm. We can pass the residual and its guarantee to the additive algorithm of Theorem 4 only for marked inliers. The final output is the joint embedding of the outputs from the two algorithms.

Additive approximation algorithm. To obtain the desired bound on error and output dimension, we need to use the previous work of Boutsidis et al. (2015). It proposed an algorithm for online PCA. The algorithm requires the knowl-

Algorithm 2 Online ORPCA with Constant Approximation

Require: Matrix A Rd n whose columns arrive one by one, parameters ξ, k and ϵ. Ensure: The online low-dimensional embedding yi; the subspace W at the end. 1: Initialize V , V .

2: Set output dimension l O(k /ϵ 2)+O(k log Ain 2 F ξ ), where the first term is from Theorem 4, and the second from Theorem 5. 3: while columns Ai arrive do 4: Execute Algorithm 1 with input Ai; this updates V

and updates y i (V )T Ai.

5: if Π V Ai 2 < ξ

z then 6: Execute a step of OPCA-ADD(Π V Ai, k , ϵ , Γ), where k , ϵ , Γ are defined in (3); this updates V , and outputs y i (V )T (Π V Ai) 7: Let W be an orthogonal basis for span(V V ) 8: end if 9: Return the embedding yi W T Ai 10: end while return W.

edge of the Frobenius norm of the entire matrix A 2 F and achieved an additive error ϵ A 2 F , and maintains the subspace U by only appending the new direction if necessary when the new sample arrives. We denote it OPCA-ADD, and leverage it into Algorithm 2. We established the following theorem for OPCA-ADD algorithm and will invoke it to analyze Algorithm 2.

Theorem 4. Given an input matrix A Rd n, a parameter ϵ > 0 and an upper bound Γ on A 2 F , there exist an algorithm for online PCA that, at every time i, upon seeing a vector Ai, outputs an embedding yi Rl, where l = O(k/ϵ2), and maintains a matrix V i with d rows and orthonormal columns. V i is only incremented as the algorithm proceeds. The embedding yi of the vector Ai is precisely (V i)T Ai. One has the guarantee that

F A SVDk(Ain) 2 F + ϵΓ.

We now formally present the constant approximation algorithm in Algorithm 2. The OPCA-ADD refers to the additive approximation of Boutsidis et al. (2015).

Residual-Based Sampling for Online Outlier-Robust PCA

Define the following:

k = 20k log Ain 2 F ξ , (3)

log Ain 2 F ξ , (4)

Γ = ξ log Ain 2 F ξ . (5)

In Algorithm 2, when a point Ai arrives, we first feed it to Algorithm 1 and update V as necessary. If it is non-special inlier, we apply the residual projection Π V Ai to OPCAADD and get the second set V . It is worth noting that we only execute OPCA-ADD algorithm for non-special inliers. This is because we do not need to refine the cost we have incurred on marked outliers. The output W is the union of V and V .

2.3. Remove the dependence on ξ

The assumption on having a parameter ξ such that ξ OPTk is important, because it not only appears in the theoretical guarantee but also is actually required to run the algorithm. In the Algorithm 3, we show how to apply a general way to remove this assumption, at the expense of an

additional factor of log Ain SVDk(Ain) 2

F ξ in the embedding dimension and the number of marked outliers.

Let us denote Lδ = log Ain 2 F ξ + log (1/δ). We show the removing procedure in two steps. Firstly, we show an analog of Theorem 5 without the assumption on ξ. Then for the same incremental property, we can run the residuals through the instantiation of the algorithm in Boutsidis et al. (2015) and output the constant approximation.

First step. We start with the given value of ξ0, and run Algorithm 1. If the total number of phases with the current ξ0 exceeds k Lδ, we conclude that ξ0 is too small, and double ξ0. Once ξ0 Ain SVDk(Ain) 2 F , we will no longer exceed the bound on the number of phases. The number of doubling steps needed is log Ain SVDk(Ain) 2 F /ξ. Since this number is bounded by Lδ, with probability at least 1 δ, the output dimension becomes O(k L2 δ) and the number of marked outliers becomes O(z L2 δ), which establishes Theorem 1.

Second step. We observe that Algorithm 3 also has the incremental property. That is, we maintain the subset Vold + V and output the projection on this space. In the end, we have an O(Lδ) approximation to the error. Thereby, we can combine it with Algorithm 1 from Boutsidis et al. (2015), with the following parameters:

k = 20k L2 δ, ϵ = ϵ 1

Lδ , Γ = ξ Lδ. (6)

Algorithm 3 Online ORPCA Logarithmic Approximation without Parameter ξ

Require: Matrix A Rd n whose columns {Ai} arrive one by one, arbitrary ξ, and parameters k, ϵ. Ensure: The online low-dimensional embedding yi; the subspace Vold at the end. 1: Initialize Vold , V , Uold , U 0d d, w 0, t 0 and running sum α 0, β 0. 2: while columns Ai arrive do 3: Invoke Algorithm 1 with input Π Vold Ai; this updates updates V . 4: Return the embedding yi V T old Ai V T Ai. 5: if number of phases (the dimension of V ) exceeds k Lδ then 6: Vold Vold V ; Uold Uold+U; V , ξ 2ξ. 7: end if 8: end while return Vold.

Thus the number of columns used overall is O(k /ϵ 2) = O( k L4 δ ϵ2 ). This establishes Theorem 2. We defer the detailed proof to the appendix.

3. Performance Guarantee

We state the guarantee of our algorithms. Recall that the analysis of Algorithm 3 has been given in Section 2.3.

3.1. Logarithmic approximation

We start by showing the following theorem that characterizes the performance of Algorithm 1, which is almost our result of Theorem 1, except for the requirement of ξ OPTk.

Theorem 5. If Assumption 1 and Assumption 2 are satisfied, and δ > 0, then with probability 1 δ, Algorithm 1 satisfies: the number of phases, and the number of columns r of the subspace V , is O(k log Ain 2 F ξ + log 1/δ). The number

of points marked as outliers is O(z log Ain 2 F ξ + z

k log 1/δ). The objective cost for the inlier points not marked as outliers is O(ξ log Ain 2 F ξ + ξ

k log 1/δ). The running time of each step is O(d2).

To ease the discussion of the theorem, we need the following definition. Definition 6. A phase is said to be a majority outlier phase if one of the following conditions hold:

1. The phase is non-special inlier and the following inequality holds (where {ui}r i=1 are the non-special in-

liers in the phase): P

i [r] ui A\Ain k Π V ui 2

2. The phase is non-special outlier and the following in-

Residual-Based Sampling for Online Outlier-Robust PCA

equality holds (where {ui}r i=1 are the non-special outliers in the phase): P

i [r] ui A\Ain k z 1

If a phase is not majority outlier, then it is said to be a majority inlier phase.

Observe that for a majority outlier phases, the number of outliers in the phase has the lower bound z 2k, whereas we have an upper bound for the total number of outliers z. Thereby, the number of a successful majority outliers phases must be at most O(k), otherwise the number of outliers would exceed z, which contradict to our Assumption 2.

For majority inlier phases, we show that they are either special, non-special inlier or non-special outlier. To get the support of their bounds on number, we use a crucial geometric lemma proposed in Bhaskara et al. (2019). The key observation is that if each vector has a non-trivial orthogonal component to all of the preceding vectors, we can find upper bound the total number of such vectors in terms of k.

Lemma 7 (Bhaskara et al. (2019)). Let v1, v2, ..., vr Rd

be a set of linearly independent vectors, r d. Let c > 0 be any constant, and let Γ be a parameter satisfying Γ 1 c V SVDk(k) 2 F . Suppose that vi satisfy Π i 1 2 γ. Suppose additionally that γ2 2cΓ

k . Then the number of

columns r satisfies the bound r 2k log V 2 F 2cΓ

By lemma 7, we immediately have the estimate of number of special phases, since the total number of the special points over inliers and outliers is O(k log Ain 2 F ξ + z), and we accept them with probability k/z. Recall k/z < 1, by the law of large numbers, we have the number of special phases O(k log Ain 2 F ξ ).

The following definition summarizes the sense that deemed successful , in which we require w to be a proxy for the phase. We note that the analysis of non-special inlier and non-special outlier phases are similar, so we only show that of non-special inlier phase. For the full and detailed proof, readers can refer to the appendix.

Lemma 8. (Non-special inlier phases with majority inliers) Let the phase be a majority inlier non-special inlier phase. Let {ui}t i=1 be the non-special inliers in the phase. Then with probability at least 1/4 we have:

8 Pt i=0 Π V ui 2 .

2. If Πk is the projection matrix orthogonal to the k-SVD space of the inliers Ain, then

3. w 2 16 Pt i=1 ui 2

Definition 9. A non-special inlier phase is said to be successful if all the inequalities in Lemma 8 are satisfied. Otherwise, it is said unsuccessful.

Combining the lemma 7 and 8, we get that the number of successful majority inlier phases of non-special inlier and non-special outlier is O(k log Ain 2 F ξ ).

Consider each phase as a coin toss, then the successful phase can be thought as the head result of the coin. Assume that we know the number of successful phases, by the Chernoff bound, we can see that it is very unlikely that the number of unsuccessful phase is much larger.

Now we are ready to present the proof sketch of our main theorem.

Proof Sketch of Theorem 5. Combined the conclusion above, we can get that with probability at least 1 δ, the total number of phases is O(k log Ain 2 F ξ + log(1/δ)). We remark that for either special, non-special inlier, non-special outlier phases, the cumulative probability over non-special inliers should be bounded by 2, which shows that the cost of inliers not marked as outliers in each phase is O(ξ/k). Thereby, the total cost over inliers not marked as outliers is O(ξ log Ain 2 F ξ + ξ

k log(1/δ)). Similarly, we observe that for either phase, the number of points marked as outliers should be bounded by 2z

k . Multiply it with the total number of phases we have the total number of marked outliers is O(z log Ain 2 F ξ + z

k log 1/δ). For the running time, we note that in each iteration, we only need to maintain the covariance matrix U = V V T , where V Rd 1. Thus, the running time is only O(d2).

3.2. Constant approximation

Then we present the guarantee of Algorithm 2. Again, this result is almost our result of Theorem 2, except for the requirement of ξ OPTk.

Theorem 10. If Assumption 2 is satisfied, and we allow a second-time pass over the point, and δ > 0, then we have that with probability at least 1 δ, Algorithm 2 satisfies: the number of phase and the number of columns r to be O( k

ϵ2 (log Ain 2 F ξ + log 1/δ)3). The objective cost for the output embedding Y , over the points not marked as outliers OPTk + ϵξ.

Proof Sketch of Theorem 10. We show that the residual squared norm in Algorithm 1 is bounded by O(ξ

log Ain 2 F ξ ), so we can set ϵ = ϵ/(log Ain 2 F ξ ) and satisfy the desired additive error. By Theorem 4, we have P

i Ai W T yi 2 F OPTk + ϵξ (it is satisfied because we set k k, so OPTk OPTk). Thus we get the de-

Residual-Based Sampling for Online Outlier-Robust PCA

sired error bound. We plugged the value from Eq. (3) to the embedding dimension l, and the dominant part in the dimension is O(( k

ϵ2 )(log Ain 2 F ξ + log 1/δ)3).

4. Conclusion and Future Work

In this paper, we have presented a robust PCA algorithm in the online manner that builds up a low-rank embedding by adding the new directions to the subspace in each iteration, when the data is corrupted by outliers. Prior to this work, existing PCA algorithms either only considered partial assumption, or failed to present the convergence analysis within finite data. To our best knowledge, this paper is the first to provide a provable algorithm using sampling method.

We raise three open questions for future work. Firstly, our work assumes that the inlier data are noiseless. It would always be interesting to examine whether our model can preserve the guarantee when the inliers are with additive noise. Secondly, the number of outliers marked by our algorithm slightly violates the bounds on the actual number of outliers z and the dimension of the subspace k by a logarithmic factor. It would be interesting to examine whether we can get a better dependence on these parameters. For example, Bhaskara & Kumar (2018) proposes an offline robust sampling algorithm that violates the number of outliers by only a factor (1 + δ), where δ is an additional input, which improves the dependence on z from a logarithmic factor to a constant ratio. Finally, our work focuses on finding low-rank approximation under Frobenius error, and it would be interesting to study whether our residual-based algorithm can solve the problem under general ℓp error.

Acknowledgements

We thank the anonymous reviewers and meta-reviewer for valuable comments on improving the presentation. This work is supported by NSF-IIS-1948133 and the startup funding from Stevens Institute of Technology.

Allen-Zhu, Z. and Li, Y. First efficient convergence for streaming k-PCA: A global, gap-free, and near-optimal rate. In Proceedings of the 58th IEEE Annual Symposium on Foundations of Computer Science, pp. 487 492, 2017.

Amid, E. and Warmuth, M. K. An implicit form of krasulina s k-PCA update without the orthonormality constraint. In Proceedings of the 34th AAAI Conference on Artificial Intelligence, pp. 3179 3186, 2020.

Arora, R., Cotter, A., and Srebro, N. Stochastic optimization of PCA with capped MSG. In Proceedings of the 27th

Annual Conference on Neural Information Processing Systems, pp. 1815 1823, 2013.

Bhaskara, A. and Kumar, S. Low rank approximation in the presence of outliers. In Proceedings of Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques, volume 116, pp. 4:1 4:16, 2018.

Bhaskara, A., Lattanzi, S., Vassilvitskii, S., and Zadimoghaddam, M. Residual based sampling for online low rank approximation. In Proceedings of the 60th IEEE Annual Symposium on Foundations of Computer Science, pp. 1596 1614, 2019.

Boutsidis, C., Garber, D., Karnin, Z. S., and Liberty, E. Online principal components analysis. In Proceedings of the 2015 Annual ACM-SIAM Symposium on Discrete Algorithms, 2015.

Brubaker, S. Robust PCA and clustering in noisy mixtures. pp. 1078 1087, 01 2009.

Candès, E. J., Li, X., Ma, Y., and Wright, J. Robust principal component analysis? Journal of the ACM, 58(3):11:1 11:37, 2011.

Chandrasekaran, V., Sanghavi, S., Parrilo, P. A., and Willsky, A. S. Rank-sparsity incoherence for matrix decomposition. SIAM Journal on Optimization, 21(2):572 596, 2011.

Cherapanamjeri, Y., Jain, P., and Netrapalli, P. Thresholding based outlier robust PCA. In Proceedings of the 30th Conference on Learning Theory, pp. 593 628, 2017.

Croux, C., Filzmoser, P., and Oliveira, M. R. Algorithms for projection-pursuit robust principal component analysis. Chemometrics and Intelligent Laboratory Systems, 87: 218 225, 06 2007.

Deshpande, A. and Pratap, R. Sampling-based dimension reduction for subspace approximation with outliers. Theoretical Computer Science, 858:100 113, 2021.

Deshpande, A. and Rademacher, L. Efficient volume sampling for row/column subset selection. In Proceedings of the 51th Annual IEEE Symposium on Foundations of Computer Science, pp. 329 338, 2010.

Feng, J., Xu, H., Mannor, S., and Yan, S. Online PCA for contaminated data. In Proceedings of the 27th Annual Conference on Neural Information Processing Systems, pp. 764 772, 2013a.

Feng, J., Xu, H., and Yan, S. Online robust PCA via stochastic optimization. In Proceedings of the 27th Annual Conference on Neural Information Processing Systems, pp. 404 412, 2013b.

Residual-Based Sampling for Online Outlier-Robust PCA

Garber, D. On the regret minimization of nonconvex online gradient ascent for online PCA. In Proceedings of the 32th Conference on Learning Theory, pp. 1349 1373, 2019.

Garber, D. and Hazan, E. Fast and simple PCA via convex optimization. Co RR, abs/1509.05647, 2015.

Hallgren, F. and Northrop, P. Incremental kernel PCA and the nyström method. Co RR, abs/1802.00043, 2018.

Jain, P., Jin, C., Kakade, S. M., Netrapalli, P., and Sidford, A. Streaming PCA: matching matrix bernstein and nearoptimal finite sample guarantees for Oja s algorithm. In Proceedings of the 29th Conference on Learning Theory, pp. 1147 1164, 2016.

Jalali, A. and Srebro, N. Clustering using max-norm constrained optimization. In Proceedings of the 29th International Conference on Machine Learning, 2012.

Klivans, A. R., Long, P. M., and Servedio, R. A. Learning halfspaces with malicious noise. In Proceedings of the 36th International Colloquium, Automata, Languages and Programming, pp. 609 621, 2009.

Krasulina, T. P. The method of stochastic approximation for the determination of the least eigenvalue of a symmetrical matrix. Ussr Computational Mathematics and Mathematical Physics, 9:189 195, 1969.

la Torre, F. D. and Black, M. J. Robust principal component analysis for computer vision. In Proceedings of the 8th International Conference On Computer Vision, pp. 362 369, 2001.

Lee, J. D., Recht, B., Salakhutdinov, R., Srebro, N., and Tropp, J. A. Practical large-scale optimization for maxnorm regularization. In Proceedings of the 24th Annual Conference on Neural Information Processing Systems, pp. 1297 1305, 2010.

Lerman, G. and Maunu, T. An overview of robust subspace recovery. Proceedings of the IEEE, 106(8):1380 1410, 2018.

Li, C.-L., Lin, H.-T., and Lu, C.-J. Rivalry of two families of algorithms for memory-restricted streaming PCA. In Proceedings of the 19th International Conference on Artificial Intelligence and Statistics, pp. 473 481, 2016.

Nie, J., Kotlowski, W., and Warmuth, M. K. Online PCA with optimal regrets. In Proceedings of 24th International Conference on Algorithmic Learning Theory, pp. 98 112, 2013.

Oja, E. and Karhunen, J. On stochastic approximation of the eigenvectors and eigenvalues of the expectation of a

random matrix. Journal of Mathematical Analysis and Applications, 106(1):69 84, 1985.

Ozawa, S., Pang, S., and Kasabov, N. A modified incremental principal component analysis for on-line learning of feature space and classifier. In Proceedings of the 8th Pacific Rim International Conference on Artificial Intelligence, 2004.

Paul, S., Magdon-Ismail, M., and Drineas, P. Column selection via adaptive sampling. In Proceedings of the 29th Annual Conference on Neural Information Processing Systems, pp. 406 414, 2015.

Shen, J., Xu, H., and Li, P. Online optimization for maxnorm regularization. In Proceedings of the 28th Annual Conference on Neural Information Processing Systems, pp. 1718 1726, 2014.

Srebro, N. and Shraibman, A. Rank, trace-norm and maxnorm. In Proceedings of the 18th Annual Conference on Learning Theory, pp. 545 560, 2005.

Srebro, N., Rennie, J. D. M., and Jaakkola, T. S. Maximummargin matrix factorization. In Proceedings of the 17th Annual Conference on Neural Information Processing Systems, pp. 1329 1336, 2004.

Tang, C. Exponentially convergent stochastic k-PCA without variance reduction. In Proceedings of the 30th Annual Conference on Neural Information Processing Systems, pp. 12393 12404, 2019.

Xu, H., Caramanis, C., and Sanghavi, S. Robust PCA via outlier pursuit. IEEE Transactions on Information Theory, 58(5):3047 3064, 2012.

Xu, L. and Yuille, A. L. Robust principal component analysis by self-organizing rules based on statistical physics approach. IEEE Transactions on Neural Networks, 6(1): 131 143, 1995.

Yang, T. and Wang, S. Robust algorithms for principal component analysis. Pattern Recognition Letters, 20(9): 927 933, 1999.

Residual-Based Sampling for Online Outlier-Robust PCA

A. Restatement of Useful Notations

Recall we have a set of n points with d features, represented by a matrix A Rd n, and a low-rank embedding dimension k < min{d, n}. A can be partitioned into inliers and the outliers. That is, A = Ain Aout. We assume there exists an upper bound z on the number of outliers: |Aout| z. We also set ξ be the approximation error over Ain, and SVDk(Ain) be the optimal rank-k approximation of Ain, ξ Ain SVDk(Ain) 2 F .

B. Omitted Proofs for Tightness

In this section, we show an example to explain why the logarithmic term log Ain 2 F ξ is a unavailable for residual-based sampling algorithm.

Lemma 11 (Bhaskara et al. (2019)). Let k = 1, and let t, z > 2 be parameters that will be fixed shortly. There exists a matrix A of dimensions t t, such that A 2 F = z2t, the rank-1 approximation error A SVD-1(A) 2 F 2t2/z2, and further, such that every column of A has a squared projection at least 1 orthogonal to the previous columns.

Proof. We choose the matrix in the following, where z > 2. It is easy to see that each column has a length 1 orthogonal to the previous columns. The bound on the Frobenius norm is also easy to check.

1 z2 z3 zt 1

0 1 z2 zt 2

Let σ1 σ2 σt 0 be the singular values of M. We demonstrate an explicit subspace S of dimension t 1 such that

max x S, x =1 Mx 2 2t

By the min-max characterization of singular values, this implies the desired claim. Now define

S = {x Rt : x1 zt 1 + x2

zt 2 + + xt = 0}

Take any unit vector x S. Consider the ith coordinate of Mx. This is precisely

(Mx)i = xi + xi+1

z + + xt zt 1 (xt 1

where we used the definition of x S. Thus by Cauchy-Schwartz, we have that

(Mx)2 i t(x2 t 1 z2 + x2 t 2 z4 + + x2 1 z2t 2 )

Since z > 2, we have P

i(Mx)2 i 2t

i x2 i , thus proving σ2 2 2t

z2 . This immediately implies that A SVDk(A) 2 F

Assume z = 2t, and ξ = 1. Now we know that A SVDk(A) 2 F ξ. If we run an algorithm that samples the columns

(as they arrive) with probability min (1, Π Ai 2 2 /ξ) (as in our algorithms, Π is the projection orthogonal to the chosen

columns), this algorithm will indeed pick all the columns. Thus the algorithm is choosing Ω(k log Ain 2 F ξ / log log Ain 2 F ξ )

columns. This matches the upper bound up to a log Ain 2 F ξ factor.

Residual-Based Sampling for Online Outlier-Robust PCA

C. Omitted Proofs for Logarithmic Approximation Algorithm

We can get the number of majority outlier phases immediately from the Definition 6.

Lemma 12. The total number of majority outlier phases is 2k.

Proof. Suppose we have t majority outliers phases.

Let us consider one of these phases.

If the phase is special or non-special outlier, and all points in that phase are v1, v2, ..., vr, by the definition we would have P i [r] vi V \Vin k z 1

If the phase is non-special inlier, and all points in that phase are v1, v2, ..., vr, by definition we would have P

i [r] vi V \Vin k Π V v 2

2, and since k

z k Π V v 2

ξ , we can also obtain that P

i [r] vi V \Vin k z 1

Therefore we can see that in either case, the number of outlier points in a phase is at lease z 2k.

Suppose that t > 2k, we would have the number of outliers > z, which contradicts to the setting of the upper bound z on the number of outliers. This implies t 2k. That is, the number of majority outliers phases is 2k.

Before talking about the majority inlier phase, we state the non-trivial Geometric Lemma, which would be used in the majority inlier phase analysis.

Lemma 13 (Restatement of Lemma 7). . Let v1, v2, ..., vr Rd be a set of linearly independent vectors, r d. Let c > 0

be any constant, and let Γ be a parameter satisfying Γ 1

c V V (k) 2

F . Suppose that vi satisfy Π i 1 2 γ. Suppose

additionally that γ2 2cΓ

k . Then the number of columns r satisfies the bound r 2k log V 2 F 2cΓ

Proof. Let K be the parallelopiped formed by the columns of V . It is well-known that vol(K) = p

det(V T V ). The volume of the parallelopiped can also be computed iteratively using the base times height formula. If ℓi is the length of the projection of vi orthogonal to span{v1, ..., vi 1}, then the volume is precisely Πr i=1ℓi. In our case, this is at least γr by hypothesis.

Let σ1 σ2 ... σr be the singular values of the matrix V . Now, if V V (k) 2

F cΓ, then σ2 2k+1 cΓ

k . For suppose not, then: σ2 k+1 + σ2 k+2 + ... + σ2 2k+1 (k + 1)σ2 2k+1 cΓ,

contradicting the bound on V V (k) 2

F . Next, using the formula for the volume, we have

Π2k i=1σi cΓ

(r 2k)/2 γr.

Now, using standard convexity, we have

P2k i=1 σ2 i 2k

Combine the two equations above, we have V 2 F 2cΓ

Take logarithms on both sides we have

k log ( V 2 F 2cΓ ) r

Residual-Based Sampling for Online Outlier-Robust PCA

Thereby we get

r 2k log ( V 2 F 2cΓ )

where we assume γ2 2cΓ

k , so log ( γ2k

cΓ ) log 2, which gives the desired bound r 2k log ( V 2 F 2cΓ ).

Applying the Geometric Lemma, we can get the number of mjaority inliers phase.

Lemma 14. The total number of majority inliers special phase is 2k log A2 in 2ξ .

Proof. We claim that for the arriving point Ai, p Ai = k Π V Ai 2

ξ , where n 2 is a constant number.

Let T be the matrix consisting of all the columns that made the phases special. Note that any column of T must have squared projection nξ/k orthogonal to the span of all previous columns of T.

T is the subset of A, so we have T 2 F Ain 2 F , and T T (k) 2

F Ain SVDk(Ain) 2 F ξ. Thus the hypothesis of lemma 7 holds. This implies that

#cols(T) 2k log T 2 F nξ 2k log T 2 F 2ξ 2k log

Lemma 15 (Restatement of Lemma 8). (Non-special inlier phases with majority inliers) Let the phase be a majority inlier non-special inlier phase. Let {ui}t i=1 be the non-special inliers in the phase. Then with probability at least 1/4 we have:

8 Pt i=0 Π V ui 2 .

2. If Πk is the projection matrix orthogonal to the k-SVD space of the inliers Ain, then

3. w 2 16 Pt i=1 ui 2

Proof. By definition, w = Pt i=1 Xi Ai where Xi is the uniformly at random sign for vector Ai. In the end of the phase, we have Π V w = Pt i=1 XiΠ V Ai. Denote the random variable Z to be Π V w 2, we can obtain Z = (Pt i=1 Π V Xi Ai)2 = Pt i=1 Π V Ai 2 + 2 P

i<j Xi Xj Π V Ai Π V Aj . By the linearity of expectation, we have:

1 i,j t E Xi Xj Π V Ai Π V Aj =

where the second equality holds because E Xi Xj is 0 for any i = j and 1 for i = j. We can apply the Paley-Zygmund inequality to lower bound the probability of this event:

We expand E[Z]2 as:

Π V Ai 4 + X

i<j 2 Π V Ai 2 Π V Aj 2

Residual-Based Sampling for Online Outlier-Robust PCA

Then we need to figure out the value of

Π V Ai 2 + 2 X

i<j Xi Xj Π V Ai, Π V Aj

We note any term with an odd power of Xi has expectation of 0. The remaining terms has the coefficients X 4 i or X 2 i X 2 j , which equals to 1. Therefore we have:

Π V Ai 4 + 4 X

i<j Π V Ai, Π V Aj 2

Applying Cauchy-Schwartz inequality to the inner products above and then we can obtain E[Z2] 2E[Z]2. Applying this into Equation 7 we obtain:

We also observe that 2

3 2 is at least 1/8. Thus we have w.p. 3/8, Π V w 2 1

8 Pt i=0 Π V Ai 2. This ensure the first statement holds.

Assume Z = Πkw 2 where Πk is the projection matrix orthogonal to the k-SVD space of the matrix Ain. Applying the Markov inequality we have:

Pr Z 16E[Z] 1

Therefore we have w.p. 15/16, Πkw 2 16E h Πkw 2i . From the proof of the first statement we have E h Πkw 2i = Pt i=1 Πk Ai 2. The second statement holds.

The proof of the third statement is identical to the second statement (Assuming Z = w 2, and applying it to Markov inequality). So it also holds w.p. 15/16. Combine the results above , we conclude that with probability at least 1 5/8 1/16 1/16 = 1/4, both of the inequalities hold.

Similar to successful non-special inlier phases, we show the analysis of the successful non-special outlier phases.

Lemma 16. (Majority inliers non-special outlier) Let the phase be a majority inlier non-special outlier phase. Let {Ai}r i=1 be the non-special outliers in the phase, V be the basis subspace at beginning of the phase, and t be the linear combination of the residuals from picked points at the end of phase. The phase is said to be successful with probability 1/80:

2 Pr i=1 Π V Ai 2 .

If Πk is the projection matrix orthogonal to the k-SVD space of the matrix Ain, then

i=1 Πk Ai 2

t 2 40 Pr i=1 Ai 2

Proof. Let A1, ..., Ar be the columns in a phase. Let Yi be a indicator showing whether Ai is picked or not. By definition, we have Pr [yi = 1] = p Ai.

Using above definition, we have

i=1 Xi Yi Ai p Ai .

Residual-Based Sampling for Online Outlier-Robust PCA

Firstly, let s compute the expectation of residuals of Ai, we show

E Π V t 2 =

i=1 E[Y 2 i ]

i<j E Xi Xj E[Yi Yj] Π V Ai, Π V Aj p Aip Aj =

This holds because E Xi Xj is zero for i = j, and E[Y 2 i ] = E[Yi] = p Ai.

We denote the random variable Z to be Π V t 2 and apply it to Paley-Zygmund inequality shown in 7. We can easily obtain

that E[Z]2 = Pr i=1 Π V Ai 4 + P

i<j 2 Π V Ai 2 Π V Aj 2 .Then we need to compute E[Z2].

Π V Ai p Ai

i<j p Aip Aj Π V Ai p Ai , Π V Aj p Aj 2

Applying Cauchy-Schwartz inequality to the second part we obtain:

Π V Ai 2 Π V Aj 2 .

For the first term, since the phase is non-special, we have p Ai = k

ξ Π V Ai 2, so we can obtain

The phase ends when Pr i=1 p Ai 1, so equivalently we have Pr i=1 Π V Ai 2 ξ

k. Combine this with the inequality above we have:

The second term (6 P

i<j Π V Ai 2 Π V Aj 2) is not more than 3E[Z2]. Thus we conclude that E[Z2] 4E[Z]2. Applying it to 7 we have Pr Z 1/2E[Z] 1 16. This ensures that the first statement holds with probability 1/16.

Assume Z = Πkw 2 where Πk is the projection matrix orthogonal to the k-SVD space of the matrix Ain. Applying this to Markov inequality we have:

Pr Z 40E[Z] 1

Therefore we have w.p. 39/40, Πkw 2 40E h Πkw 2i . From the proof of the first statement we have E h Πkw 2i = Pt i=1 Πk Ai 2. The second statement holds.

The proof of the third statement is identical to the second statement (Assuming Z = w 2, and applying it to Markov inequality). So it also holds w.p. 39/40. Combine the results above, we conclude that with probability at least 1 15/16 1/40 1/40 = 1/80, both of the inequalities hold together.

Lemma 17. The number of successful majority inliers non-special inlier phases is 2k log A2 in 2ξ .

Proof. We claim that for the arriving point Ai, p Ai = k Π V Ai nξ , where n 512 is a constant number.

By the definition of a successful majority inlier non-special inlier phase, we have: P i [r] Ai Vin Π V Ai 2 nξ

2k . By the

first inequality in the definition of a successful majority inliner non-special inlier phases, we have Π V w 2 nξ 16k.

Residual-Based Sampling for Online Outlier-Robust PCA

Let T be the matrix whose columns are all the vectors w at the end of successful majority inlier non-special inlier phases. The span of the columns of T is a subspace of the span of the columns of V . By the third inequality in the definition of a successful

phase, we have T 2 16 Ain 2. Also, by the second inequality, we have T T (k) 2

F 16 Ain SVDk(Ain) 2 F .

Thus ξ satisfies that ξ 1 16 T T (k) 2

Above inequality implies that

#cols(T) 2k log 16 T 2

nξ 2k log T 2

Lemma 18. The number of successful majority inliers non-special outliers phases is 2k log A2 in 2ξ .

Proof. We claim that for the arriving point Ai, p Ai = k Π V Ai nξ , where n 320 is a constant number.

By the definition of a successful majority inlier non-special outlier phase, we have:P

i [r] Ai Vin k z 1 2, and since

z , we can also obtain P

i [r] Ai Vin k Π V Ai 2

i [r] Ai Vin k z 1

i [r] Ai Vin Π V Ai 2

nξ 2k . By the first inequality in the definition of a successful majority inliner non-special outlier phases, we have Π V w 2 nξ 4k .

Let T be the matrix whose columns are all the vectors t at the end of successful non-special majority inlier non-special outlier phases. The span of the columns of T is a subspace of the span of the columns of V . By the third inequality in

the definition of a successful phase, we have T 2 40 Ain 2. Also, by the second inequality, we have T T (k) 2

40 Ain SVDk(Ain) 2 F . Thus ξ satisfies that ξ 1 40 T T (k) 2

Above inequality implies that

#cols(T) 2k log 4 T 2

nξ 2k log T 2

80ξ 2k log Ain 2

Given the number of successful phase of majority inliers non-special inlier and outlier phases, we want to get their number of total phases. Formally, we need the following result.

Theorem 19. (Bhaskara et al., 2019) We toss a coin n times. The tosses are independent of each other and in each toss, the probability of seeing a head is at least p. Let Hm and Tm denote the number of heads and tails we observe in the first m n coin tosses. With probability 1 δ, we have Hm pm

δ /p) for any 1 m n. We note that although the claim is about conjunction of all these n events, the probability does not rely on n.

Proof. We denote µ the expected number of heads in the first m tosses, which is at least pm. The lower tail inequality of Chernoff Bounds implies

Pr[Hm < (1 1

2)µ] e µ/8 e pm/8

The error probability e pm/8 is at most δ/2 for m m = 8 log(2/δ)/p . Instead of summing up the error bound for all values of m, we focus on the smaller geometrically growing sequence M = {2ℓm |ℓ Z 0 AND 2ℓm n}. Having the lower bound on Hm for every m M helps us achieve a universal lower bound on any 1 m n as follows. For any m m , the bound Hm pm m holds trivially.

For any other m n, there exists an m M such that m m 2m . By definition Hm is at least Hm . Assuming Hm pm /2 implies Hm is at least pm/4 which proves the claim of the lemma. So we focus on bounding the error probabilities for values in set M.

Residual-Based Sampling for Online Outlier-Robust PCA

For m , the error probability is at most δ/2. The next value in M is 2m , so given the exponential form of the error, it is at most (δ/2)2. Using union bound, the aggregate error probability for set M does not exceed

2)4 + ... δ/2 1 δ/2 δ.

Therefore with probability at least 1 δ, we have for every m M, Hm pm/2, and consequently for every 1 m n, Hm pm

4 m which finishes the proof.

Using lemma 19, we see that if Hm is the number of head coins, then with probability at least 1 δ, the total number of coin tosses m = O(Hm + log ( 1

δ )). Lemma 20. For any δ > 0, with probability at least 1 δ, the total number of majority inliers non-special phases is O(k log Ain 2 F ξ + log(1/δ)).

Proof. We consider each phase as coin toss. A head in the coin toss is associated with the phase being a successful one. For majority inliers non-special inlier phases, each phase has a probability 1/4 of being successful. Applying it to the Theorem 19, with probability 1 δ/2, the bound on the number of majority inliers non-special inlier phases is

O(k k log Ain 2 F ξ + log(1/δ)). Similarly, for majority inliers non-special outlier phases, each phase has a probability 1/80 of successful. We also have that with probability 1 δ/2, the bound on the number of majority inliers non-special

outlier phases is O(k k log Ain 2 F ξ + log(1/δ)).

Then the probability that both bounds hold is (1 δ/2)2 1 δ, and the number of the majority inliers non-special

phases is O(k log Ain 2 F ξ + log(1/δ)).

Lemma 21. For any δ > 0, with probability at least 1 δ, the total number of phases is O(k log Ain 2 F ξ + log(1/δ)).

Proof. Combining the Lemmas 12, 14 and 20, we can immediately prove this lemma.

Then, we get the number of columns in subspace, which is also the embedding dimension, |V |. Lemma 22. Given any δ > 0, with probability at least 1 δ, the number of columns in subspace and the embedding dimension is O(k log Ain 2 F ξ + log(1/δ)).

Proof. Every special phases and non-special outlier phases output at most two columns, while the non-special inlier phases output one column, so we have additional O(k log Ain 2 F ξ ) number of columns compared with the number of phases, which implies this lemma.

Then we are ready to prove our error guarantee and the number of marked outliers. Lemma 23. Define the inlier points not marked as outliers as Ain\M. Given any δ > 0, in the end, with probability 1 δ,

min Φ Rd r,ΦT Φ=I

Ain\M ΦY 2 F ξ O(log Ain 2 F ξ + log (1/δ)/k).

Proof. Firstly, we show the bound on the cost in each phase. To this end, let A1, ..., Ar be the points in a phase, and let Vpre be the basis vector of points at the start of the phase, and Vcur be the basis vector of points after the vector Ar has been processed. We now consider two cases,

If the phase is non-special inlier: X

i [r] Ai Vin\M

Π Vcuru 2 X

i [r] Ai Vin\M

Π Vpreu 2 X

i [r] Ai Vin\M p Ai ξ k 2 ξ

The last inequality follows from that the sum of the selection probabilities of non-special inliers in a non-special inlier phase is 2.

Residual-Based Sampling for Online Outlier-Robust PCA

If the phase is special or non-special outlier:

i [r] Ai Vin\M

Π Vcuru 2 X

i [r] Ai Vin\M

Π Vpreu 2 X

i [r] Ai Vin\M p Ai ξ k ξ

The last inequality follows from that the sum of the selection probabilities of non-special inliers or outliers is < 1.

This implies that in either case P

i [r] Ai Vin\M Π Vcuru 2 ξ

k. Combined with the bound on the number of phases, we have that with probability at least 1 δ, the total cost over the inlier points which are not marked as outliers is ξ O(log Ain 2 F ξ + log (1/δ)/k).

Lemma 24. The number of marked outliers satisfies |M| z O(log Ain 2 F ξ + log (1/δ)

Proof. We will first show the bound on number of points marked as outliers in each phase. Let the set of points marked as outliers be M. We consider three cases.

case 1: The phase is non-special inliers: X

The inequality follows from the definition that if the phase is non-special inlier, then β < 1.

case 2: The phase is non-special outlier: X

The inequality follows from the definition that if the phase is non-special outlier, then 1 β < 2.

case 3: The phase is special: X

i [r 1] u M

Therefore, we can see that in case 1 and case 2, |i [r] u M| 2z

k . In case 3, we have the bound on the number of phase being 1 + z

k, while by our setting z

k > 1, we also obtain |i [r 1] u M| + 1 2z

k . Combined with the bound

on the number of phases, with probability at least 1 δ, |M| 2z

k O(k L + log(1/δ)) = z O(log Ain 2 F ξ + log(1/δ)

k ). So we have the desire bound on the second term.

Theorem 25 (Restatement of Theorem 5). If Assumption 1 and Assumption 2 are satisfied, and δ > 0, then with probability 1 δ, Algorithm 1 satisfies: the number of phases, and the number of columns r of the subspace V , is O(k log Ain 2 F ξ + log 1/δ). The number of points marked as outliers is O(z log Ain 2 F ξ + z

k log 1/δ). The objective

cost for the inlier points not marked as outliers is O(ξ log Ain 2 F ξ + ξ

k log 1/δ). The running time of each step is O(d2).

Proof. This is an immediate result by combining Lemmas 22, 24 and 23. The proof for running time has been stated in the last part of Section 2.1.

D. Omitted Proofs for Constant Approximation Algorithm

We remark that since we only process the marked inliers to Algorithm 2, the guarantee for the number of marked outliers is identical. Thus in the section, we only show the analysis of output embedding and the total cost over inliers not marked as outliers.

Lemma 26. The cost for the output embedding Y , over the points not marked as outliers is OPTk + ϵξ.

Residual-Based Sampling for Online Outlier-Robust PCA

Proof. In the ith iteration, let V i denote the subspace that Algorithm 1 maintains, and r i denote the residual of the Algorithm 1, r i = Π V i Ai, which is also the input of the Algorithm 2. Let V i denote the subspace that Algorithm 2 maintains and r i denote its residual, r i = Π V i r i. Let R denote the matrix whose ith column is r i, and R denote the matrix whose ith column is r i .

For our choice of Γ, we have that with probability at least 1 δ,

Then with probability at least 1 δ, we can get the error guarantee from Theorem 4

r i 2 R (R )(k ) + ϵ Γ. (8)

Based on the notion of residual, we have that r i = Ai V i y i, and according to Theorem 5, V i at any time contains at most O(k L + k + log(1/δ)). Thus the rank-k approximation to the matrix R has error at most the rank-k approximation to the matrix Ai. Namely, R R (k ) 2 Ain SVDk(Ain) 2 (9)

Based on Equation 3, we have ϵξ = ϵ Γ. Combining it with Equation 8 and 9, we get:

r i 2 Ain SVDk(Ain) 2 F + ϵξ (10)

As a final step, we observe that r i = r i V i y i = Ai V i y i V i y i .

This is the difference of Ai and a liner combination of the space V i V i . Thus the length of r i is upper bounded by the distance of Ai to the span of the space V i n V i n, which is Ai Wi W T i Ai. Namely, we get:

Ai Wi W T i Ai 2 X

r i 2 . (11)

Combining Equation 10 with 11, and since Y = P

i W T i Ai, we get

A WY 2 F Ain SVDk(Ain) 2 F + ϵξ,

which completes the proof.

Lemma 27. For any δ > 0, with probability at least 1 δ, the number of phase and the number of output dimension r is O( k

ϵ2 (log Ain 2 F ξ + log 1/δ)3).

Proof. Since the dominating term in output dimension l is O(k /ϵ ). Plugging in the values from Equation 3 completes the proof.

Theorem 28 (Restatement of Theorem 10). If Assumption 1 and Assumption 2 are satisfied, and δ > 0, then we have that w.p. at least 1 δ, Algorithm 2 satisfies: the number of phase and the number of output dimension r is O( k

ϵ2 (log Ain 2 F ξ + log 1/δ)3). The objective cost for the output embedding Y , over the points not marked as outliers is OPTk + ϵξ.

Proof. This is an immediate result by combining Lemmas 26 and 27.

Residual-Based Sampling for Online Outlier-Robust PCA

E. Omitted Proofs for Algorithm without the dependence on ξ

E.1. Proof of Theorem 1

In this section, we present the proof for guarantee of the Logarithmic Approximation algorithm without the assumption ξ OPTk. The guarantee for Constant Approximation has been explained in Section 2.3.

Recall Lδ = log Ain 2 F ξ + log (1/δ), and 0 < δ 1.

Lemma 29. For any δ > 0, with probability at least 1 δ, the output dimension is O(k L2 δ).

Proof. Assume we have a set of vectors A = {A1, A2, ..., An}. For a fixed subspace T, let B = {B1, B2, ..., Bn} denote the orthogonal projections from B to T, that is, Bi = Π T Ai. By the fact that the rank-k error only reduces upon projection,

we get A A(k) 2

F . By Algorithm 3, the guessing approximation error ξj does not increase once

ξj A A(k) 2

F , so the total number of the doubling is log

F ξ , which is trivially bounded by Lδ, so the output

dimension is bounded by Lδ O(k Lδ) = O(k L2 δ).

Lemma 30. Suppose we have the initial approximation error ξ0, and δ > 0, in the end, with probability at least 1 δ, the cost over inliers not marked as outliers is at most O(ξ0Lδ).

Proof. After doubling j times, the guessing approximation error is ξj = 2jξ0. Since the number of phases of each ξj is bounded by k Lδ, and the cumulative residual errors over non-special inliers in each phase is < 2ξj

k , we conclude that the total residual error over inliers not marked as outliers is at most

k + ... + 2j+1ξ0

Lemma 31. For any δ > 0, with probability at least 1 δ, the total number of marked outliers is O(z L2 δ).

Proof. Since the number of doublings is bounded by Lδ, the number of phases of each ξj is bounded by k Lδ, and the number of marked outliers in each phase is < 2z/k, we conclude that the total number of marked outliers is bounded by

k Lδ (2z/k + 2z/k + ... + 2z/k) = O(z L2 δ)

This establishes Theorem 1.

F. Numerical Results

In this section, we report some numerical results on synthetic data. Our goal is to illustrate the properties of the online robust algorithm discussed in section 1, and compare our residual-based sampling for online robust PCA algorithm with the algorithm in Feng et al. (2013a), which uses matrix decomposition to update the subspace in each iteration.

To make a fair comparison, we simulate the contaminated data as follows. We randomly generate an d k matrix A, and scale it to make its magnitudes of the leading eigenvalues = 2. Then we multiple A with another uniformly generated matrix X = Rk n to make L = AX. A fraction λ of outliers are generated with uniform distribution over [ 20, 20], where z = λn is the number of outliers.

The algorithm will return a subspace V after receiving every sample, and we will use it to compute the residual loss over inliers Ain V V T Ain F as the performance.

We firstly show the simulations with total n = 1000 samples and d = 500 dimensions. Simulation results for optimum low rank k = 5 with different number of outliers z = 100, 150, 200 have been shown in Figure 1.

While results show that our algorithm can recover the low-rank structure of inlier points, we also care about the dimension of the output embedding, the number of the marked outliers. By the bicriteria approximation, we do not expect these numbers

Residual-Based Sampling for Online Outlier-Robust PCA

0 200 400 600 800 1000 Samples

Residual Loss

[Feng et al. (2013a)] Our Algorithm

0 200 400 600 800 1000 Samples

Residual Loss

[Feng et al. (2013a)] Our Algorithm

0 200 400 600 800 1000 Samples

Residual Loss

[Feng et al. (2013a)] Our Algorithm

Figure 1. Performance comparison of our algorithm (red line) with Algorithm in Feng et al. (2013a) (blue line). Here d = 500, n = 1000, k = 5. We observe that Algorithm in Feng et al. (2013a) converges faster, but our algorithm can get a more accurate approximation. We notice that the red line is a broken line, while blue line is smooth. It makes sense because our algorithm only updates the subspace when the new direction is informative, while Feng et al. (2013a) updates subspace almost in every iteration.

0 200 400 600 800 1000 Samples

Residual Loss

[Feng et al. (2013a)] Our Algorithm

0 200 400 600 800 1000 Samples

Residual Loss

[Feng et al. (2013a)] Our Algorithm

Figure 2. Performance comparison of our algorithm (red line) with Feng et al. (2013a) (blue line). Here d = 1000, n = 1000, k = 5.

too large compared with true values. Moreover, we assume our Algorithm enjoys better computation time than Feng et al. (2013a). Table 2, 3 and 4 support our assumption.

Figure 2 and Table 5, 6 show results of the similar numerical study for n = 1000 samples, and d = 1000 dimensions, where we observe similar trend.

Table 2. Comparison for the embedding dimension, marked outliers and execution time of our algorithm and Feng et al. (2013a) when d = 500, k = 5, z = 100. We observe that our algorithm sacrifices more embedding dimension and number of marked outliers to get the approximation. In contrast, Feng et al. (2013a) can keep the dimension, but mark too few points as outliers. We also show that out algorithm is faster than Feng et al. (2013a).

Algorithm Embedding Dimension #Marked Outliers Running Time (s)

Our algorithm 12 132 7.37 Feng et al. (2013a) 5 29 41.52

Residual-Based Sampling for Online Outlier-Robust PCA

Table 3. Comparison for the embedding dimension, marked outliers and execution time when d = 500, k = 5, z = 150.

Algorithm Embedding Dimension #Marked Outliers Running Time (s)

Our algorithm 14 235 9.26 Feng et al. (2013a) 5 50 43.79

Table 4. Comparison for the embedding dimension, marked outliers and execution time when d = 500, k = 5, z = 200.

Algorithm Embedding Dimension #Marked Outliers Running Time (s)

Our algorithm 6 564 8.42 Feng et al. (2013a) 5 58 50.96

Table 5. Comparison for the embedding dimension, marked outliers and execution time when d = 1000, k = 5, z = 100.

Algorithm Embedding Dimension #Marked Outliers Running Time (s)

Our algorithm 10 358 35.14 Feng et al. (2013a) 5 20 325.01

Table 6. Comparison for the embedding dimension, marked outliers and running time when d = 1000, k = 5, z = 150.

Algorithm Embedding Dimension #Marked Outliers Running Time (s)

Our algorithm 14 391 29.08 Feng et al. (2013a) 5 46 292.80