# nearly_optimal_robust_subspace_tracking__e3710998.pdf Nearly Optimal Robust Subspace Tracking Praneeth Narayanamurthy 1 Namrata Vaswani 1 Robust subspace tracking (RST) can be simply understood as a dynamic (time-varying) extension of robust PCA. More precisely, it is the problem of tracking data lying in a fixed or slowlychanging low-dimensional subspace while being robust to sparse outliers. This work develops a recursive projected compressive sensing algorithm called Nearly Optimal RST (NORST) , and obtains one of the first guarantees for it. We show that NORST provably solves RST under weakened standard RPCA assumptions, slow subspace change, and a lower bound on (most) outlier magnitudes. Our guarantee shows that (i) NORST is online (after initialization) and enjoys near-optimal values of tracking delay, lower bound on required delay between subspace change times, and of memory complexity; and (ii) it has a significantly improved worst-case outlier tolerance compared with all previous robust PCA or RST methods without requiring any model on how the outlier support is generated. 1. Introduction According to its modern definition (Cand es et al., 2011), robust PCA (RPCA) is the problem of decomposing a data matrix into the sum of a low-rank matrix (true data) and a sparse matrix (outliers). The column space of the low-rank matrix then gives the desired principal subspace (PCA solution). A common application is in video analytics in separating a video into a slow-changing background image sequence (modeled as a low-rank matrix) and a foreground image sequence consisting of moving objects or people (sparse) (Cand es et al., 2011). Many fast and provably correct RPCA approaches have appeared in recent years: PCP (Cand es et al., 2011; Chandrasekaran et al., 2011; Hsu 1Department of Electrical and Computer Engineering, Iowa State University, USA. Correspondence to: Namrata Vaswani . Proceedings of the 35 th International Conference on Machine Learning, Stockholm, Sweden, PMLR 80, 2018. Copyright 2018 by the author(s). et al., 2011), Alt Proj (Netrapalli et al., 2014), RPCA-GD (Yi et al., 2016), NO-RMC (Cherapanamjeri et al., 2016). Robust Subspace Tracking (RST) can be simply interpreted as a dynamic (time-varying) extension of RPCA. It assumes that the true data lie in a low-dimensional subspace that can change with time, albeit slowly. The goal is to track this changing subspace over time in the presence of sparse outliers. The offline version of this problem can be called dynamic (or time-varying) RPCA. RST requires the tracking delay to be small, while dynamic RPCA does not. Time-varying subspaces is a more appropriate model for long data sequences, e.g., long surveillance videos. The reason is, if one tries to use a single lower dimensional subspace to represent the entire data sequence, the required dimension may end up being quite large. Short tracking delays are critical for applications where realtime or near real-time estimates are needed, e.g., videobased surveillance/tracking, or detecting dynamic social network anomalies (Ozdemir et al., 2017). While standard RPCA has been extensively studied in recent years, there is much lesser work on provable dynamic RPCA or RST and only includes original-Re Pro CS (Qiu et al., 2014; Lois & Vaswani, 2015; Zhan et al., 2016), modified-PCP (Zhan & Vaswani, 2015), and simple-Re Pro CS (Narayanamurthy & Vaswani, 2018a). Another related approach that comes with a partial guarantee1 is ORPCA (Feng et al., 2013). Modeling subspace change. All existing guarantees for subspace tracking algorithms (except our previous RST work mentioned above) assume the statistically stationary setting of data being generated from a single unknown subspace; and almost all of them are partial guarantees. All work on subspace tracking without outliers, and with or without missing data, e.g, (Yang, 1995; Chi et al., 2013; Zhang & Balzano, 2016) is in this category. On the other hand, the most general (nonstationary) model that allows the subspace to change at each time is not even identifiable since at least r data points are needed to compute an r-dimensional subspace even in the ideal setting of no noise or missing entries. When the data is noisy, missing or outlier corrupted, even more data points are needed. Since only one data vector comes in at each time t, the only way to ensure this is to assume that the subspace remains 1Needs assumptions on intermediate algorithm estimates. Nearly Optimal Robust Subspace Tracking constant for a while and then changes (piecewise constant). We first introduced this assumption in (Qiu et al., 2014). Here, we use a weaker version of the same model. Robust Subspace Tracking (RST) and Dynamic RPCA Problem. At each time t, we get a yt Rn that satisfies yt := ℓt + xt + vt, for t = 1, 2, . . . , d. where vt is small unstructured noise, xt is the sparse outlier vector, and ℓt is the true data vector that lies in a fixed or slowly changing low-dimensional subspace of Rn, i.e., ℓt = P(t)at where P(t) is an n r basis matrix (matrix with mutually orthonormal columns) with r n and with (I P(t 1)P(t 1) )P(t) small compared to P(t) = 1. Here, and elsewhere, . denotes the induced l2 norm and denotes transpose. We use Tt to denote the support set of xt. Given an initial subspace estimate, ˆP0, the goal is to track span(P(t)) and ℓt either immediately or within a short delay. A by-product is that ℓt, xt, and Tt can also be tracked on-the-fly. The initial subspace estimate, ˆP0, can be computed by applying any of the solutions for static RPCA, e.g., PCP or Alt Proj, for the first roughly r data points, Y[1,ttrain] with ttrain = Cr. Here and below [a, b] refers to all integers between a and b, inclusive, [a, b) := [a, b 1], and MT denotes a sub-matrix of M formed by its columns indexed by entries in the set T . Dynamic RPCA is the offline version of the above problem. Define matrices L, X, V , Y with L = [ℓ1, ℓ2, . . . ℓd] and Y , X, V similarly defined. The goal is to recover the matrix L and its column space with ϵ error. We use r L to denote the rank of L. The maximum fraction of nonzeros in any row (column) of the outlier matrix X is denoted by max-outlier-frac-row (max-outlier-frac-col). For basis matrices P1, P2, we use SE(P1, P2) := (I P1P1 )P2 as a measure of Subspace Error (distance) between their respective column spans. This is the sine of the largest principal angle between the subspaces. We reuse the letters C, c to denote different numerical constants in each use with C 1 and c < 1. Identifiability. The above problem definition does not ensure identifiability since either of L or X can be both low-rank and sparse. Also, if the subspace changes at every time, it is impossible to correctly estimate all the subspaces. One way to ensure identifiability of changing subspaces is to assume that they are piecewise constant, i.e., P(t) = Pj for all t [tj, tj+1), j = 1, 2, . . . , J. and to lower bound dj := tj+1 tj. Let t0 = 1 and t J+1 = d. With this model, r L = r J in general (except if subspace directions are repeated). One can ensure that X is not low-rank by imposing upper bounds on max-outlier-frac-col and max-outlier-frac-row. For the RST problem, max-outlier-frac-col := maxt |Tt|/n. Consider max-outlier-frac-row. Since RST is a tracking problem, it involves processing incoming data either one sample at a time or using a mini-batch of α samples at a time. The subspace update step of our proposed algorithm does the latter. Because of this we need to replace a bound on max-outlier-frac-row by a bound on max-outlier-frac-rowα defined as follows. Definition 1.1. Define max-outlier-frac-rowα as the maximum fraction of outliers (nonzeros) per row of any sub-matrix of X with α consecutive columns. For a time interval, J , let γ(J ) := maxi=1,2,...,n 1 |J | P t J 1{i Tt} where 1S is the indicator function for statement S. Thus γ(J ) is the maximum outlier fraction in any row of the sub-matrix XJ of X. Let J α denote a time interval of duration α. Then max-outlier-frac-rowα := max J α [1,d] γ(J α). One way to ensure that L is not sparse is by requiring that its left and right singular vectors are dense (nonsparse) or incoherent w.r.t. a sparse vector. This is quantified as follows (Cand es et al., 2011). Definition 1.2. An n r P basis matrix P is µ-incoherent if maxi P (i) 2 µ r P n (P (i) is i-th row of P ). Using our subspace model, the union of the column spans of all the Pj s is equal to the span of the left singular vectors of L. Thus, for RST, left incoherence is equivalent to assuming that the Pj s are µ-incoherent. We replace right incoherence by statistical assumptions on the at s: assume that at s are element-wise bounded, mutually independent, have identical covariances and zero mean. Refer Remark 3.5 of (Narayanamurthy & Vaswani, 2018b) to see the connection. Contributions. (1) We develop a novel algorithm for RST and dynamic RPCA that we call Nearly Optimal RST or NORST for short. NORST relies on the previously introduced recursive projected compressive sensing solution framework (Qiu et al., 2014), but has a significantly improved, and simpler, subspace update step. Our most important contribution is the first provable guarantee for RST that ensures near optimal tracking delay and needs a near optimal lower bound on how long a subspace needs to remain constant. Both equal Cr log n log(1/ϵ). It also needs no other model on how the subspace changes, and still tolerates O(1) maximum fraction of outliers in any row and an O(1/r) fraction in any column. Of course it uses extra assumptions: slow subspace change and lower bound on most Nearly Optimal Robust Subspace Tracking Video Frame NORST (16.5ms) Alt Proj(26.0ms) RPCA-GD(29.5ms) GRASTA (2.5ms) PCP (44.6ms) Video Frame NORST (85.4ms) Alt Proj(95.7ms) RPCA-GD(122.5ms) GRASTA (22.6ms) PCP (318.3ms) Figure 1. Background Recovery. NORST is the only technique that works for both sequences. Row 2: only NORST does not contain the person or even his shadow. NORST is faster than all except GRASTA (which does not work). The GRASTA output slightly lags the actual frame, and hence, it appears that the person is sitting. Time taken per frame is shown in parentheses. outlier magnitudes. Finally, NORST is provably online (after initialization), fast (has the same complexity as vanilla r-SVD), and nearly memory optimal, it has memory complexity of order nr log n log(1/ϵ). Here the term online means the following: after each subspace change, the algorithm updates the subspace estimate every α = Cr log n time instants, and the subspace recovery error bound decays exponentially with each such step2. This means one gets an ϵ-accurate estimate within C log(1/ϵ) steps. (2) Our guarantees hold under weakened standard RPCA assumptions, slow subspace change, and a lower bound on (most) outlier magnitudes. We say weakened because we show that that, after initialization, NORST can tolerate a constant maximum fraction of outliers per row without needing any assumptions on outlier support. For the video application, this implies that it tolerates slow moving and occasionally static foreground objects much better than all other existing RPCA approaches. This fact is also corroborated on real videos, e.g., see Fig 1. All other existing RPCA approaches (except simple-Re Pro CS) need more: Alt Proj, RPCA-GD, and NO-RMC need this fraction to be O(1/r L); PCP and mod-PCP need the outlier support to uniformly random (strong requirement: for video it implies that objects are very small sized and jumping around randomly); and original-Re Pro CS needs it to satisfy a specific moving object model described later (very restrictive). Moreover, NORST can also detect a subspace change within a short delay of Cr log n. (4) Unlike previous dynamic RPCA work, NORST only needs a coarse subspace initialization which can be computed using at most C log r iterations of any batch RPCA method such as Alt Proj applied to Cr initial samples. In fact, if the outlier magnitudes were very large (or if the outliers were absent) for an initial set of O(r log n log r) time instants, even a random initialization would suffice. 2The reason that just O(r log n) samples suffice for each update is because we assume that the at s are bounded, vt is very small and with effective dimension r or smaller (see Theorem 2.1), and both are mutually independent over time. These along with the specific structure of the PCA problem we encounter (noise/error seen by the PCA step depends on the ℓt s and thus has effective dimension r) is why so few samples suffice. (3) A direct corollary of our result is a guarantee that a minor modification of NORST also solves the subspace tracking with missing data (ST-missing) and the dynamic matrix completion problems (dynamic MC); see follow-up work (?). All existing guarantees for ST-missing hold only for the case of a single unknown subspace and are partial guarantees. Ours is a complete guarantee that provably allows tracking of changing subspaces. From the MC perspective, our solution does not assume any model on the observed entries unlike most others. The disadvantage is that it needs more observed entries than other MC solutions. 2. The Algorithm and Main Result Nearly-Optimal RST via Re Pro CS (Re Pro CS-NORST). Re Pro CS-NORST starts with a good estimate of the initial subspace, which is obtained by C log r iterations of Alt Proj on Y[1,ttrain] with ttrain = Cr. It then iterates between (a) Projected Compressive Sensing (CS) or Robust Regression in order to estimate the sparse outliers, xt s, and hence the ℓt s, and (b) Subspace Update to update the subspace estimate ˆP(t). Projected CS is borrowed from original-Re Pro CS (Qiu et al., 2014), while the subspace update step is new and significantly simplified. Projected CS proceeds as follows. At time t, if the previous subspace estimate, ˆP(t 1), is accurate enough, because of slow subspace change, projecting yt onto its orthogonal complement will nullify most of ℓt. We compute yt := Ψyt where Ψ := I ˆP(t 1) ˆP(t 1) . Thus, yt = Ψxt + Ψ(ℓt + vt) and Ψ(ℓt + vt) is small due to slow subspace change and small vt. Recovering xt from yt is now a CS / sparse recovery problem in small noise (Candes, 2008). We compute ˆxt,cs using noisy l1 minimization followed by thresholding based support estimation to obtain ˆTt. A Least Squares (LS) based debiasing step on ˆTt returns the final ˆxt. We then estimate ℓt as ˆℓt = yt ˆxt. The ˆℓt s are used for the Subspace Update step which involves (i) detecting subspace change, and (ii) obtaining improved estimates of the new subspace by K steps of r SVD, each done with a new set of α samples of ˆℓt. While this step is designed under the piecewise constant subspace assumption (needed for identifiability), if the goal is only to Nearly Optimal Robust Subspace Tracking get good estimates of ℓt or xt, the method works even for real videos (where this assumption may or may not hold). For ease of understanding, we present a basic version of NORST in Algorithm 1. This assumes the change times tj are known. The actual algorithm that we study and implement detects these automatically is given as Algorithm 2 in the long version (Narayanamurthy & Vaswani, 2018b). Main Result. We use ˆtj to denote the time-instant at which the j-th subspace change time is detected. Theorem 2.1. Consider NORST (Algorithm 2 of (Narayanamurthy & Vaswani, 2018b))3. Let Λ := E[a1a1 ], λ+ := λmax(Λ), λ := λmin(Λ), f := λ+/λ , α := Cf 2r log n, and let xmin := mint mini Tt(xt)i denote the minimum outlier magnitude. Pick an ε min(0.01, 0.03 minj SE(Pj 1, Pj)2/f). Let K := C log(1/ε). If 1. incoherence: Pj s are µ-incoherent; and at s are zero mean, mutually independent over time t, have identical covariance matrices, i.e. E[atat ] = Λ, are element-wise uncorrelated (Λ is diagonal), are element-wise bounded (for a numerical constant η, (at)2 i ηλi(Λ)), and are independent of all outlier supports Tt; 2. vt: vt 2 cr E[vtvt ] , E[vtvt ] cε2λ , zero mean, mutually independent, independent of xt, ℓt; 3. max-outlier-frac-col c1 µr, max-outlier-frac-rowα c2 4. subspace change: let := maxj SE(Pj 1, Pj), (a) tj+1 tj > (K + 2)α, (b) 0.8 and C1 rλ+( + 2ε) xmin; 5. init4: SE( ˆ P0, P0) 0.25, C1 rλ+SE( ˆ P0, P0) xmin; and (6) algorithm parameters are appropriately set; then, with probability (w.p.) 1 10dn 10, at all times, t, ˆTt = Tt, tj ˆtj tj + 2α, SE( ˆP(t), P(t)) (ε + ) if t [tj, ˆtj + α), (0.3)k 1(ε + ) if t [ˆtj + (k 1)α, ˆtj + kα), ε if t [ˆtj + Kα + α, tj+1), and ˆxt xt = ˆℓt ℓt 1.2(SE( ˆ P(t), P(t))+ε) ℓt . Offline-NORST (lines 26-30): SE( ˆP off (t) , P(t)) ε, ˆxoff t xt = ˆℓoff t ℓt ε ℓt at all t. Memory complexity is O(nr log n log(1/ε)) and time complexity is O(ndr log(1/ε)). 3Algorithm 1 with a subspace change detection step included 4This can be satisfied by applying Alt Proj (Netrapalli et al., 2014) on first Cr data samples and assuming that these have outlier fractions in any row or column bounded by c/r. Algorithm 1 Basic-NORST (with tj known). The actual algorithm that detects tj automatically is Algo. 3 in (Narayanamurthy & Vaswani, 2018b). Notation: ˆLt;α := [ˆℓt α+1, , ˆℓt] and SV Dr[M] refers to the top of r left singular vectors of the matrix M. Obtain ˆ P0 by C(log r) iterations of Alt Proj on Y[1,ttrain] with ttrain = Cr followed by SVD on the output ˆL. 1: Input: yt, Output: ˆxt, ˆℓt, ˆP(t), ˆTt 2: Parameters: K C log(1/ε), α Cf 2r log n, ωsupp xmin/2, ξ xmin/15, r 3: Initialize: j 1, k 1 ˆPttrain ˆP0 4: for t > ttrain do 5: Ψ I ˆP(t 1) ˆP(t 1) ; 6: yt Ψyt. 7: ˆxt,cs arg min x x 1 s.t. yt Ψ x ξ. 8: ˆTt {i : |ˆxt,cs| > ωsupp}. 9: ˆxt I ˆTt(Ψ ˆTt Ψ ˆTt) 1Ψ ˆTt yt. 10: ˆℓt yt ˆxt. 11: if t = tj + uα for u = 1, 2, , K then 12: ˆPj,k SV Dr[ ˆLt;α], ˆP(t) ˆPj,k, k k + 1. 13: else 14: ˆP(t) ˆP(t 1). 15: end if 16: if t = tj + Kα then 17: ˆPj ˆP(t), k 1, j j + 1 18: end if 19: end for Remark 2.2 (Bi-level outliers). With minor changes, we can actually prove the following which relaxes our outlier magnitudes lower bounded requirement to only requiring that most outlier magnitudes are lower bounded, while the others have small enough magnitudes so that their squared sum is upper bounded. Both bounds decrease as the subspace estimate improves. Assume that the outlier magnitudes are such that the following holds: xt can be split as xt = (xt)small+(xt)large with the two components having disjoint supports and being such that, (xt)small bv,t and the smallest nonzero entry of (xt)large is greater than 30bv,t with bv,t defined as follows: bv,t = C(2ε+ ) rλ+ for t [tj, ˆtj+α) (before the first subspace update), bv,t := 0.3k 1C(2ε+ ) rλ+ for t [ˆtj + (k 1)α, ˆtj + kα 1], k = 2, 2, . . . , K (after the k-th subspace update), and bv,t := Cε rλ+ for t ˆtj + Kα, tj+1). If the above is true, and if the vectors (xt)small are zero mean, mutually independent, and independent of ℓt s and of the support of (xt)large, then all conclusions of Theorem 2.1 hold except the exact support recovery conclusion (this gets replaced by exact recovery of the support of (xt)large). Discussion. Theorem 2.1 shows that, with high probability (whp), when using NORST, the subspace change gets Nearly Optimal Robust Subspace Tracking detected within a delay of at most 2α = Cf 2(r log n) time instants, and the subspace gets estimated to ε error within at most (K + 2)α = Cf 2(r log n) log(1/ε) time instants. The same is also true for the recovery error of xt and ℓt. If offline processing is allowed, with a delay of at most Cf 2(r log n) log(1/ε) samples, we can guarantee all recoveries within normalized error ε. Theorem 2.1 allows a constant maximum fraction of outliers per row (after initialization), without making any assumption on how the outliers are generated. As noted earlier, this is better than what all other RPCA solutions allow. The same is true for the NORST memory complexity which is almost d/r times better. The time complexity is worse than that of only NO-RMC, but NO-RMC needs d cn (unreasonable requirement for videos which often have much fewer frames d than the image size n). NORMC is so fast because it is actually a robust matrix completion solution and it deliberately undersamples the entire data matrix Y to get a faster RPCA algorithm. But this also means that it cannot recover X. Finally, NORST also needs the best outlier fraction per column bound of O(1/r) instead of O(1/r L). Notice that if J is large, e.g. if J = d/(r log n), it is possible that r L r. We should clarify that NORST allows the maximum fraction of outliers per row to be O(1) but this does not necessarily imply that the number of outliers in each row can be this high. The reason is it only allows the fraction per column to only be O(1/r). Thus, for a matrix of size n α, it allows the total number of outliers to be O(min(nα, nα/r)) = O(nα/r). Thus the average fraction allowed is only O(1/r). NORST has the above advantages only if a few extra assumptions hold. The first is element-wise boundedness of the at s. This, along with mutual independence and identical covariances, of at s is similar to the right incoherence assumption needed by all static RPCA methods, see Remark 3.5 in longer version. The zero-mean assumption on at s is a minor one. The assumption that Λ be diagonal is also minor5. The main extra requirement is that xmin be lower bounded as given in the last two assumptions of Theorem 2.1, or that most outliers are lower bounded (more relaxed requirement) as given in the long version. The required lower bound is reasonable as long as the initial subspace estimate is accurate enough and the subspace changes slowly enough so that both and SE( ˆP0, P0) are O(1/ r). This requirement may seem restrictive on first glance but actually is not. The reason is that SE(.) 5It only implies that Pj is the matrix of principal components of E[Lj L j] where Lj := [ℓtj, ℓtj+1, . . . , ℓtj+1 1]. If Λ is not diagonal, it is easy to right multiply Pj by an orthonormal matrix R that is such that R ΛR is diagonal and to use ℓt = (Pj R) at with at := R at. (Pj R) has the same incoherence as Pj. is only measuring the largest principal angle. This bound on SE still allows the chordal distance6 between the two subspaces to be O(1). This also matches what s-Re Pro CS requires. Why NORST is better than RPCA-every-α. RPCAevery-α refers to using any batch RPCA solution on α = Cr log n samples at a time. (a) NORST is significantly better than RPCA-every-α for settings in which the number of outliers in some rows is very large: our guarantee shows that NORST can tolerate a constant maximum fraction of outliers per row because it exploits slow subspace change and one other mild assumption (lower bound on most outlier magnitudes). On the other hand, all RPCA methods require this fraction to be only O(1/r), which is much smaller. (b) Moreover, NORST provides guarantees for subspace recovery (and recovery of ℓt, xt and Tt) at each time t, including during the dwell time (the first α = Cr log n time instants after a subspace change). This is possible because NORST assumes slow subspace change and the algorithm is recursive (uses the previous subspace estimate at the current time). As shown in Theorem 2.1, during the dwell time, the subspace recovery error is essentially bounded by the amount of subspace change. However, RPCA-every-α will not provide any estimates during the dwell time; it will require a delay of this much time before it provides any estimates. (c) NORST can detect subspace changes automatically and quickly (with just one projection-SVD step), while RPCA-every-α will take much more computation time to do this. Finally, in practical comparisons for videos involving slow moving objects (large number of outliers in some rows), NORST is both significantly better than, and faster than, all RPCA methods. This is true both when RPCA methods are applied to the entire dataset, as well as when they are applied to α pieces of the dataset. We do not report these results due to lack of space and since applying RPCA to the whole matrix resulted in better performance out of the two options. Outlier v/s Subspace Assumptions. When there are fewer outliers in the data or when outliers are easy to detect, one would expect to need weaker assumptions on the true data s subspace and/or on its rate of change. This is indeed true. The max-outlier-frac-col bound relates max-outlier-frac-col to µ (not-denseness parameter) and r (subspace dimension). The upper bound on implies that, if xmin is larger (outliers are easier to detect), a larger amount of subspace change can be tolerated. The relation of max-outlier-frac-row to rate of subspace change is not evident from the way the guarantee is stated above because we have assumed max-outlier-frac-row b0 := c/f 2 with c being a numerical constant, and used this to get a sim- 6l2 norm of the vector containing the sine of all principal angles. Nearly Optimal Robust Subspace Tracking Table 1. Comparing all RPCA or RST solutions. All algorithms also require left and right incoherence or left incoherence and at s bounded (which is similar to right incoherence), and hence these are not compared. The incoherence parameter µ and the condition numbers are treated as constants. In general r L = r J. Strong or unrealistic assumptions are shown in red. Algorithm Outlier tolerance Other Assumptions Memory, Time, # params PCP(C) max-outlier-frac-row O(1) outlier support: unif. random, Memory: O(nd) zero mod-PCP max-outlier-frac-col O(1) r L c min(n, d)/log2 n Time: O(nd2 1 ϵ ) Alt Proj max-outlier-frac-row O (1/r L) d cr L Memory: O(nd) 2 max-outlier-frac-col O (1/r L) Time: O(ndr2 L log 1 ϵ ) RPCA-GD max-outlier-frac-row O(1/r1.5 L ) d cr L Memory: O(nd) 5 max-outlier-frac-col O(1/r1.5 L ) Time: O(ndr L log 1 ϵ ) NO-RMC max-outlier-frac-row O (1/r L) c2n d cn Memory: O(nd) 4 max-outlier-frac-col O(1/r L) Time: O(nr3 L log2 n log2 1 ϵ ) orig-Re Pro CS max-outlier-frac-row O(1) outlier support: moving object model, Memory: O(nr2/ϵ2) 5 dynamic RPCA, max-outlier-frac-col O(1/r L) unrealistic subspace change model, Time: O(ndr log 1 ϵ ) tracking delay changed eigenvalues small for some time, too large outlier mag. lower bounded, init data: Alt Proj assu s, at s independent, d cr2/ϵ2 s-Re Pro CS: max-outlier-frac-row O(1) subspace change: only 1 direc at a time, Memory: O(nr log n log 1 ϵ ) 4 solves d-RPCA max-outlier-frac-col O(1/r) outlier mag. lower bounded, Time: O(ndr log 1 ϵ ) & RST w/ at s independent, d cr log n log 1 ϵ . sub-optimal delay init data: Alt Proj assumptions NORST max-outlier-frac-row O(1) subspace change: none, Memory: O(nr log n log 1 ϵ ) 4 (this work): solves max-outlier-frac-col O(1/r) outlier mag. lower bounded, Time: O(ndr log 1 ϵ ) d-RPCA & RST w/ at s independent, d cr log n log 1 ϵ , near-optimal delay first Cr samples: Alt Proj assumptions ple expression for K. If we did not do this, we would get K = C 1 log( b0f) log( c 0.8ε) , see Remark A.1 of long version. Since we need tj+1 tj (K + 2)α, a smaller b0 means a larger can be tolerated for the same delay, or vice versa. Algorithm Parameters. Algorithm 1 (and Algorithm 2 of (Narayanamurthy & Vaswani, 2018b)) assumes knowledge of 4 model parameters: r, λ+, λ and xmin to set the algorithm parameters. The initial dataset used for estimating ˆP0 (using Alt Proj) can be used to get an accurate estimate of r, λ and λ+ using standard techniques. Thus one really only needs to set xmin. If continuity over time is assumed, we can let it be time-varying and set it as mini ˆTt 1 |(ˆxt 1)i| at t. Related Work. For a summary of comparisons, see Table 1. The earliest dynamic RPCA result was a partial guarantee (a guarantee that depended on intermediate algorithm estimates satisfying certain assumptions) for the original Re Pro CS approach (original-Re Pro CS) (Qiu et al., 2014). This was followed up by two complete guarantees for Re Pro CS-based approaches with minor modifications (Lois & Vaswani, 2015; Zhan et al., 2016). For simplicity we will still call these original-Re Pro CS . Parallel work also included a guarantee for modified-PCP which is a solution for RPCA with partial subspace knowledge (Zhan & Vaswani, 2015). This provides a piecewise batch solution to dynamic RPCA. More recent work developed and analyzed simple-Re Pro CS (s-Re Pro CS) which removed most of the disadvantages of previous works. S-Re Pro CS has the same tracking delay and memory complexity as NORST and needs the same outlier assumptions; however, it assumes that only one subspace direction can change at each change time. This is a much more restrictive model than what NORST needs (allows all r directions to change) and it implies that the tracking delay of s-Re Pro CS is r-times sub-optimal. Moreover, s-Re Pro CS required a complicated projection-SVD step for subspace update (as opposed to SVD in NORST). These two facts imply that s-Re Pro CS needs an ϵ-accurate subspace initialization in order to ensure that the later changed subspaces can be tracked with ϵ-accuracy; and it does not provide a useful solution for ST-missing or dynamic MC. The advantage of s-Re Pro CS over NORST is that it is a little faster (needs 1-SVD instead of r-SVD most of the time), and its required bound on outlier fractions and delay between subspace change times have a weaker dependence on the condition number of the true data covariance. The original-Re Pro CS guarantees needed very strong assumptions and their tracking delay was O(nr2/ϵ2). Since ϵ can be very small, this factor can be quite large, and hence one cannot claim that original-Re Pro CS solves RST. Our work is a very significant improvement over all these works. (i) The guaranteed memory complexity, tracking delay, and required delay between subspace change times of NORST are all r/ϵ2 times lower than that of original Re Pro CS. (ii) The original-Re Pro CS guarantees needed a Nearly Optimal Robust Subspace Tracking very specific assumption on how the outlier support could change: they required an outlier support model inspired by a video moving object that moves in one direction for a long time; and whenever it moves, it must move by a fraction of s := maxt |Tt|. This is very specific model with the requirement of moving by a fraction of s being the most restrictive. Our result replaces this with just a bound on max-outlier-frac-row. We explain in Sec. 3 why this is possible. (ii) Their subspace change model required one or more new directions orthogonal to Pj 1 to be added at each tj. This is an unrealistic model for slow subspace change, e.g., in 3D, it implies that the subspace needs to change from the x-y plane to the y-z plane. Moreover because of this model, their results needed the energy (eigenvalues) along the newly added directions to be small for a period of time after each subspace change. This is a strong (and hard to interpret) requirement. We replace all these requirements with a bound on SE(Pj 1, Pj) which is much more realistic. Thus, in 3D, we allow the x-y plane to change to its slightly tilted version. Since the modified-PCP guarantee adapted the PCP proof techniques from (Cand es et al., 2011), its pros and cons are similar to those of PCP, e.g., it also needs a uniformly randomly generated outlier support, and it also cannot detect subspace change. Also see Table 1. We also provide a comparison with provably correct standard RPCA approaches in Table 1. In summary, NORST has significantly better memory complexity than all of them, all of which are batch; it has the best outlier tolerance (after initialization), and the second-best time complexity, as long as its extra assumptions hold. It can also detect subspace change quickly, which can be a useful feature. 3. Proof Outline with vt = 0 Remark 3.1. When stating Theorem 2.1, we just used numerical constants c1, c2, C1 for simplicity. The result holds with c1 = 0.01, c2 = 0.01, and C1 = 15 η. For simplicity, we outline the proof for the vt = 0 case. The changes with vt = 0 are minor, see (Narayanamurthy & Vaswani, 2018b). First consider the simpler case when tj s are known, i.e., consider Algorithm 1. In this case, ˆtj = tj. Define 1. bound on max-outlier-frac-row: b0 := 0.01/f 2. 2. q0 := 1.2(ε + SE(Pj 1, Pj)), qk = (0.3)kq0 3. et := ˆxt xt. Since vt = 0, et = ℓt ˆℓt 4. Events: Γ0,0 := {assumed bound on SE( ˆP0, P0)}, Γ0,k := Γ0,k 1 {SE( ˆP0,k, P0) 0.3k SE( ˆP0, P0)}, Γj,0 := Γj 1,K, Γj,k := Γj,k 1 {SE( ˆPj,k, Pj) qk 1/4} for j = 1, 2, . . . , J and k = 1, 2, . . . , K. 5. Using the expression for K given in the theorem, it follows that Γj,K implies SE( ˆPj,K, Pj) ε. Observe that if we can show that Pr(ΓJ,K|Γ0,0) 1 dn 10 we will have obtained all the subspace recovery bounds of Theorem 2.1. The next two lemmas applied sequentially help show that this is true for Algorithm 1 (tj known). The correctness of the actual algorithm follows using these and Lemma 3.6. Lemma 3.2 (first subspace update interval). Under the conditions of Theorem 2.1, conditioned on Γj,0, 1. for all t [ˆtj, ˆtj + α), Ψℓt (ε + ) p ηrλ+ < xmin/15, ˆxt,cs xt 7xmin/15 < xmin/2, ˆTt = Tt, and the error et := ˆxt xt = ℓt ˆℓt satisfies et = ITt (ΨTt ΨTt) 1 ITt Ψℓt, (1) and et 1.2(ε + ) p 2. w.p. at least 1 10n 10, the first subspace estimate ˆPj,1 satisfies SE( ˆPj,1, Pj) (q0/4), i.e., Γj,1 holds. Lemma 3.3 (k-th subspace update interval). Under the conditions of Theorem 2.1, conditioned on Γj,k 1, 1. for all t [ˆtj +(k 1)α, ˆtj +kα 1), all claims of the first part of Lemma 3.2 holds, Ψℓt 0.3k 1(ε + ) p ηrλ+, and et (0.3)k 1 1.2(ε+ ) p 2. w.p. at least 1 10n 10 the subspace estimate ˆPj,k satisfies SE( ˆPj,k, Pj) (qk 1/4), i.e., Γj,k holds. Remark 3.4. For the case of j = 0, in both the lemmas above, gets replaced with SE( ˆP0, P0) and ε by zero. We prove these lemmas in the long version. The projected CS proof (part one of both lemmas) uses the following lemma from (Qiu et al., 2014) that relates the s-Restricted Isometry Constant (RIC) (Candes, 2008) of a projection matrix to the incoherence of its orthogonal complement. Lemma 3.5. For an n r basis matrix P , δs(I P P ) = max|T | s IT P 2 s maxi=1,2,...,n Ii P 2 sµr/n. The last bound follows using Definition 1.2. We apply this lemma with s = max-outlier-frac-col n. The subspace update step proof uses a guarantee for PCA in sparse data-dependent noise, Corollary 4.17, due to (Vaswani & Narayanamurthy, 2017). Notice that et = ℓt ˆℓt is the noise/error seen by the subspace update step. By (1), this is sparse and depends on the true data ℓt. A careful application of the (Vaswani & Narayanamurthy, 2017) result is the reason why we are able to remove the moving object model assumption on the outlier support needed by the earlier dynamic RPCA guarantees (or orig-Re Pro CS). Applied to our problem, this Nearly Optimal Robust Subspace Tracking result requires P t J α ITt ITt /α to be bounded by a constant less than one. It is not hard to see that max J α [1,d] P t J α ITt ITt /α = max-outlier-frac-row. This is also why a constant bound on max-outlier-frac-row suffices for our setting. The key to our overall proof is to show that the subspace recovery error after each PCA step is sufficiently smaller (e.g. 0.3 times smaller) than the instantaneous noise/error level, E[etet ] , seen by the ˆℓt s used for this step. The reason we can show this is because the PCA step error is proportional to the ratio between the time-averaged noise level (plus time-averaged signal-noise correlatin) and the minimum signal space eigenvalue, λ (Vaswani & Narayanamurthy, 2017). Using the sparsity of et with support Tt and the max-outlier-frac-row bound one can show that the time-averaged noise plus signalnoise correlation, ( P t E[etet ] + P t E[ℓtet )/α, is at least max-outlier-frac-row times its instantaneous value, E[etet ] + E[ℓtet ] . The above, in turn, implies that both the instantaneous and time-averaged error/noise level in the next interval is 0.3 times smaller. Put together, one can show exponential decay of both SE( ˆPj,k, Pj) and the error/noise level. Consider the actual tj unknown case. The following lemma is used to show that, whp, we can detect subspace change within 2α time instants. This lemmas assumes detection threshold ωevals = 2ε2λ+ (see the long version). Lemma 3.6 (Subspace Change Detection). Consider an αlength time interval J α [tj, tj+1] (so that ℓt = Pjat). 1. If Φ := I ˆPj 1 ˆPj 1 and SE( ˆPj 1, Pj 1) ε, with probability at least 1 10n 10, t J α Φˆℓt ˆℓt Φ 0.8λ SE2(Pj 1, Pj) 2. If Φ := I ˆPj ˆPj and SE( ˆPj, Pj) ε, with probability at least 1 10n 10, t J α Φˆℓt ˆℓt Φ 1.37ε2λ+ < ωevals We prove these lemmas in the long version. Theorem 2.1 is an easy consequence of these. 4. Empirical Evaluation The complete details for all experiments are provided in (Narayanamurthy & Vaswani, 2018b). All codes can be found at https://github.com/ praneethmurthy/NORST.. 2,000 4,000 6,000 8,000 10,000 SE( ˆ P(t), P(t)) GRASTA ORPCA s-Re Pro CS NORST Figure 2. The plot of subspace error versus time for the online RST algorithms, plotted every α = 300 time instants. Time taken per sample (milliseconds) is shown in legend parentheses. Outlier Model RPCA-GD Alt Proj Offline-NORST (92.2ms) (70.8ms) (1.7ms) Mov. Obj. 4.283 4.363 3.5 10 4 Bernoulli 0.158 0.269 3.4 10 4 Table 2. Comparison of ˆL L F / L F for offline RPCA methods including offline NORST. Average time given in parentheses. Synthetic Data. We generated the subspaces using Pj = e δj Bj Pj 1 (as done in (He et al., 2012)) where δj controls the subspace change and Bj s are skew-symmetric matrices. In our experiments we use n = 1000, d = 12000, r = 30, J = 2, t1 = 3000, t2 = 8000, r = 30, δ1 = δ2 = 0.001. P0 is generated by ortho-normalizing columns of an n r i.i.d standard normal matrix. The subspace coefficients at Rr are generated as independent zero-mean, bounded (uniform) random variables with condition number of their covariance f = 50. We generated the support Tt of sparse outliers using the Moving Object Model of (Narayanamurthy & Vaswani, 2018a). We used the first ttrain = 3.3r = 100 frames as the training data with fewer outliers: for this period we used b0 = 0.01, and for t > ttrain, we used s/n = 0.05 and b0 = 0.3. The non-zero magnitudes of xt are generated as i.i.d as uniform[10, 20]. The algorithm parameters are set as K = 8, α = 300. As shown in Fig. 2, NORST is significantly better than all the RST methods - s-Re Pro CS, and two popular heuristics from literature - ORPCA and GRASTA. We provide a comparison of offline-NORST with the batch RPCA techniques in Table 2. As can be seen, offline-NORST outperforms all the batch RPCA methods, both for the moving object outlier support model and for the commonly used random Bernoulli support model. All results in this table are averaged over 10 independent runs. Video. We also evaluated NORST for background subtraction; see Figure 1. The NORST parameters were set as α = 60, K = 3, r = 40 and ξt = Ψˆℓt 1 . Acknowledgements The authors thank Praneeth Netrapalli and Prateek Jain of Microsoft Research India for asking an important question that helped remove a redundant assumption. Nearly Optimal Robust Subspace Tracking Candes, E. The restricted isometry property and its implications for compressed sensing. C. R. Math. Acad. Sci. Paris Serie I, 2008. Cand es, E. J., Li, X., Ma, Y., and Wright, J. Robust principal component analysis? J. ACM, 58(3), 2011. Chandrasekaran, V., Sanghavi, S., Parrilo, P. A., and Willsky, A. S. Rank-sparsity incoherence for matrix decomposition. SIAM Journal on Optimization, 21, 2011. Cherapanamjeri, Y., Gupta, K., and Jain, P. Nearly-optimal robust matrix completion. ICML, 2016. Chi, Y., Eldar, Y. C., and Calderbank, R. Petrels: Parallel subspace estimation and tracking by recursive least squares from partial observations. IEEE Trans. Sig. Proc., December 2013. Feng, J., Xu, H., and Yan, S. Online robust pca via stochastic optimization. In NIPS, 2013. He, J., Balzano, L., and Szlam, A. Incremental gradient on the grassmannian for online foreground and background separation in subsampled video. In IEEE Conf. on Comp. Vis. Pat. Rec. (CVPR), 2012. Hsu, D., Kakade, S. M., and Zhang, T. Robust matrix decomposition with sparse corruptions. IEEE Trans. Info. Th., Nov. 2011. Lois, B. and Vaswani, N. Online matrix completion and online robust pca. In IEEE Intl. Symp. Info. Th. (ISIT), 2015. Narayanamurthy, P. and Vaswani, N. Provable dynamic robust pca or robust subspace tracking. ar Xiv:1705.08948, being revised for IEEE Trans. Info. Theory (short version in ISIT 18), 2018a. Narayanamurthy, P. and Vaswani, N. Nearly optimal robust subspace tracking. In Intnl. Conf. Machine Learning (ICML), longer version at ar Xiv:1712.06061[cs.IT] and submitted to IEEE Trans. Info Theory, 2018b. Netrapalli, P., Niranjan, U. N., Sanghavi, S., Anandkumar, A., and Jain, P. Non-convex robust pca. In NIPS, 2014. Ozdemir, A., Bernat, E. M., and Aviyente, S. Recursive tensor subspace tracking for dynamic brain network analysis. IEEE Transactions on Signal and Information Processing over Networks, 2017. Qiu, C., Vaswani, N., Lois, B., and Hogben, L. Recursive robust pca or recursive sparse recovery in large but structured noise. IEEE Trans. Info. Th., pp. 5007 5039, August 2014. Vaswani, N. and Narayanamurthy, P. Finite sample guarantees for pca in non-isotropic and data-dependent noise. In Allerton 2017, long version at ar Xiv:1709.06255, 2017. Yang, B. Projection approximation subspace tracking. IEEE Trans. Sig. Proc., pp. 95 107, 1995. Yi, X., Park, D., Chen, Y., and Caramanis, C. Fast algorithms for robust pca via gradient descent. In NIPS, 2016. Zhan, J. and Vaswani, N. Robust pca with partial subspace knowledge. IEEE Trans. Sig. Proc., July 2015. Zhan, J., Lois, B., Guo, H., and Vaswani, N. Online (and Offline) Robust PCA: Novel Algorithms and Performance Guarantees. In Intnl. Conf. Artif. Intell. Stat. (AISTATS), 2016. Zhang, D. and Balzano, L. Global convergence of a grassmannian gradient descent algorithm for subspace estimation. In AISTATS, 2016.