# robust_streaming_pca__8b2a7c63.pdf Robust Streaming PCA Daniel Bienstock IEOR Department Columbia University Minchan Jeong Graduate School of AI Apurv Shukla IEOR Department Columbia University Se-Young Yun Graduate School of AI {dano,as5197}@columbia.edu, {mcjeong,yunseyoung}@kaist.ac.kr We consider streaming principal component analysis when the stochastic datagenerating model is subject to perturbations. While existing models assume a fixed covariance, we adopt a robust perspective where the covariance matrix belongs to a temporal uncertainty set. Under this setting, we provide fundamental limits on convergence of any algorithm recovering principal components. We analyze the convergence of the noisy power method and Oja s algorithm, both studied for the stationary data generating model, and argue that the noisy power method is rate-optimal in our setting. Finally, we demonstrate the validity of our analysis through numerical experiments on synthetic and real-world dataset. 1 Introduction Principal component analysis (PCA) is one of the most extensively studied methods for obtaining the low-dimensional representation of observed data [22]. However, classical algorithms for PCA store all the observations and use cubic-time complexity, thereby imposing prohibitively large computationtime and space requirements. Recently, several works on PCA have focused on the design and analysis of streaming algorithms with near-optimal memory and storage complexity [25, 29, 40, 43]. These algorithms assume that all the observations belong to the same low-dimensional space. However, this situation is unlikely when the unknown/unexplored alterations corrupt a system s observations. For instance, it is well known that typical data attacks on power grids can significantly change the estimated covariance matrix of the data observed from sensors [9, 10, 23]. Similarly, PCA can be used to explain stock returns in terms of macroeconomic factors [27], and product pricing taking into account cross-product elasticity and demands [45]. In all these scenarios, the underlying data-generating model changes every instant, and the decisions are based on identifying the changed model. Current work considers perturbations of the data lying in a fixed low-dimensional space [49]. They determine the worst-case position of the adversarial data point to incur the maximum error in the subspace estimated through PCA and measure the distance between the two subspaces using the notion of the principal angle between them. Another line of work considers PCA through the lens of stochastic optimization [5, 42, 51, 52]. Our work differs from these approaches since we assume that the data-generating model changes at every time instant. Further, we propose near-optimal algorithms for recovering principal components under this framework. We assume a system relies on the time-series of p-dimensional vectors sampled from a time-varying model. The available observations are the vectors (xt)T t=1. The noisy observation xt 2 Rp is a vector lying in the column space of an unobserved full-rank matrix At 2 Rp k. Precisely, from the standard spiked covariance model [35], we consider the time-dependent environment: xt N(0p 1 , At A> t + σ2Ip p), (1) The authors are ordered alphabetically. 36th Conference on Neural Information Processing Systems (Neur IPS 2022). where At can vary with time. The parameter k p is the desired number of principal components. Assuming At belongs to a temporal uncertainty set (defined in the equation (2)), our goal is to recover the top-k principal components of the terminal subspace AT . In financial applications, our model captures the market evolution in terms of the changing AT with the ultimate goal of explaining the market conditions for an appropriately chosen T. Previous works on the stationary environment assume At =A for all t and have focused on computing a basis for the column space of the matrix A using streaming algorithms [21, 29, 33, 36, 38, 43]. The accurate reconstruction of the principal components for the standard streaming PCA problem depends on the magnitude of observation noise σ, the dimension of observations p, the number of principal components k of the matrix A, and the spectral gap δ between k-th and k+1-th spectrums. singular value of AA>. In a marked departure from previous work, we study the case when the column space of A varies across time. This paper explores these avenues and proposes a tractable analysis framework for the streaming PCA problem, robust to perturbations in the data-generating model. Our contributions can be summarized as follows: 1. (Lower Bound; Section 4) Our first contribution is establishing a fundamental lower bound for estimating the principal components when the covariance matrix belongs to a temporal uncertainty set Tu(δ, Γ). In Theorem 1, we derive the minimax bounds of the expected error for recovering the top singular vectors for any streaming algorithm. Because the underlying distribution can vary, observations from the far past become less important and the estimation of principal components associated with AT should be determined by a subset of samples. Simultaneously, it becomes imperative to find the block size B that can be used to recover the principal components. This is in sharp contrast to the standard spiked covariance model. We show that: For T =O(Γ92/3), the minimax estimation error decreases as O(p1/2T 91/2). On the other hand, for T = (Γ92/3), the error stagnates to O(p1/3Γ1/3) and does not decrease upon collecting more observations1. 2. (Algorithm Analysis; Section 5) We then analyze two algorithms to recover the principal compo- nents extensively used in the standard streaming PCA setting; the noisy power method [29] and Oja s algorithm [48]. These algorithms represent two very different design principles for computing principal components from data in streaming fashion, processing data in blocks vis-à-vis single observations. We determine the optimal choice for critical parameters: the block size in the noisy power method on Lemma 1 and the learning rate in Oja s algorithm on Lemma 2. Next, we leverage these results to obtain an upper bound on the convergence error for these algorithms in Theorem 2 and 3, respectively. From these results, we have: The block size B for the noisy power method and inverse of the learning rate, 1 for Oja s algorithm, plays a similar role even in the non-stationary environment. In either case, the optimal parameters scale with (Γ 2/3), where Γ is the perturbation budget. The derived upper bound on the estimation error for the noisy power method matches the minimax error in terms of p, δ, and Γ and becomes rate-optimal if it satisfies the mild conditions. Notation. We fix notation throughout the main body of the paper. Matrices are denoted by bold uppercase letters (e.g. M) and vectors are denoted by bold lowercase letters (e.g. x). For M 2 Rp k, Mi denote the ith column of M, Mi:j denote the submatrix consists with ith jth column of M and Mi,j be the (i, j)-element of M. k k denote matrix 2-norm or equivalently operator norm for matrices and standard 2-norm for vectors. We use Stk(Rp) for the class of orthogonal matrices in Rp k. b(M) is the orthogonal matrix where the columns form basis for ran(M). si(M) represents the ith largest singular value of the matrix. The singular value decomposition of M is defined as SVD(M) = UDV>, where U 2 Stp(Rp), V 2 Stk(Rk), and D 2 Rp k is a diagonal matrix whose ith diagonal element equals si(M). Therefore, we assume without loss of generality that the singular values and respective singular vectors are ordered from largest to smallest. We denote by M? the orthogonal projection onto the orthogonal complement of the ran(M). Therefore, if rk(M) = r and SVD(M) = UDV>, M? given by M? = I U1:r U> 1:r. Moreover, when rk(M) = rk( M) and SVD( M) = U D V>, the 1For precise explanation forΓ and Tu(δ, Γ) please refer to the Section 3. distance between ran(M) and ran( M) is defined by: d(ran(M), ran( M)) = k b(M)b(M)> b( M)b( M)> k = k U1:k U> 1:k U1:k U> We denote d(ran(M), ran( M)) by d(M, M) whenever clear from the context. The letter A stands for abbreviation of sequence of matrices (At)T t=1. We write X A when each element xt in sequence X = (xt)T t=1 are drawn from N(0, At A> t + σ2Ip p). We denote the expectation of f(X) over xt N(0, At A> t + σ2Ip p) as EX A [f]. We denote O, as the O, notation with ignore the multiplicative dependency of log(p T 2) or smaller. 2 Related Work Principal component analysis (PCA) has been extensively studied across operations research, computer science, and other communities. We highlight how our work differs from existing literature. Robust PCA. Robust PCA deals with the problem of retrieving the principal components robust to the presence of outliers in the data. The cornerstone work in this direction is the principal component pursuit framework wherein they assume that the matrix of observations M can be decomposed in terms of a low-rank matrix L and a sparse matrix (with entries of arbitrarily large magnitude) S [15]. Several works consider the robust PCA problem and propose algorithms in the offline, batch, and online settings [17, 24, 26, 46]. Our work differs from this line of literature in two aspects: assumptions about the data-generating model and convex optimization techniques. First, our data generation model is unrelated to those considered in the robust PCA literature. Further, rich theories from convex optimization can be used in the PCA framework to design efficient algorithms, but our problem is not amenable to such techniques. Streaming PCA. Streaming algorithms for PCA have been proposed, among other works on PCA [2, 29, 34, 43, 60]. Algorithms analyzed in this work, such as the noisy power method [29] and Oja s algorithm [48], are iterative methods for estimating the principal components. These iterative schemes are instances of stochastic approximation-based solutions for the optimization formulation of the PCA problem [5]. The stochastic approximation is a root-finding framework extensively used for stochastic optimization [11, 37]. Oja s algorithm, originally proposed by [48], was the first such scheme. This framework is also used to analyze gradient-type and proximal-type incremental methods akin to algorithms for convex optimization. Stochastic gradient descent-based algorithms for the streaming PCA problem, where a single observation is used at every point in time to update the principal components estimate, have been extensively studied. Along this line, [30] propose GRASTA, an incremental online gradient method for learning over different subspaces. Similarly, [6] proposes GROUSE, based on the idea of gradient updates over the Grassmannian manifold. These and other related works consider Oja s algorithm for the standard streaming PCA problem [1, 16, 31, 31, 53, 59, 62]. However, all these works are based on a completely different modeling assumption than this paper and do not provide theoretical guarantees for our setting. Streaming and robust PCA algorithms are used in the presence of outliers or data with a lot of missing entries [18, 55]. From this literature, the closest to our work is the work on robust subspace tracking [2, 30, 60]. However, the robustness considered there is against erasures or sparse outliers. While these algorithms provide theoretical guarantees in that setting, those guarantees cannot be extended to our model. Considerations of erasures and outliers under the model proposed in this paper are beyond the scope of this work and remain an interesting future direction. 3 Mathematical Framework Observations from the standard spiked covariance model belong to a fixed k-dimensional column space A 2 Rp. Previous work has focused on reconstructing this space from the observed time series. Under our framework, we assume that the sequence of observations (xt)T t=1 follows the time-dependent spiked covariance model (1). We consider the problem of computing top-k singular vectors of AT . We formulate this model by addressing minimax optimization as a robust optimization problem. When the adversary is allowed to select a completely arbitrary sequence of matrices (At)T t=1, it is impossible to accurately recover the column space of AT . Instead, we define temporal uncertainty sets to restrict the power of the adversary. Definition 1. Let Γ, δ 0. We only allow the sequence of matrices (At)T t=1 that lie in an temporal uncertainty set Tu(δ, Γ) defined as: Tu(δ, Γ) := t=1 : sk(At A> t ) δ , k At A> We constrain the difference between any two consecutive covariance matrices of the underlying process by Γ. The assumption sk(At A> t ) δ is crucial for establishing bounds of estimation error. It is also justified in any setting where the underlying phenomenology guarantees covariance of rank k; the stated lower bound excludes pathological cases of near-rank k 1 or smaller. Having described the constraints on the perturbation power, we present the algorithms and the performance metric of interest. In particular, we consider (i) streaming algorithms, i.e., make a single pass through the time series in chronological order; only the previous samples are stored at any point in the past. These characteristics model the behavior of an algorithm that receives data in real-time and only stores samples in limited memory. Among those streaming algorithms, we consider (ii) the algorithms whose output is a set of k orthonormal vectors in Rp. Let us denote by Φ the family of algorithms just described. The output of algorithm φ 2 Φ for a given observations X = (xt)T t=1 sampled using model (2) is a set of k orthonormal vectors, which we denote by φX . We will also view φX as a matrix in Rp k or the subspace generated by φX . Following definition 2 illustrates how we treat the lower bound and when we call the algorithm optimal. We note that this formulation has been widely studied [14, 57, 58]. Definition 2. Let A = (At)T t=1 2 Tu(δ, Γ), and φ 2 Φ. 1. The estimation error of φ given X A is the distance between the space spanned by φX and the column space of AT (d(ran(AT ), φX )). The metric can be easily extended to the cumulative error. Refer Appendix A for the discussion. 2. Rφ is the maximum expected estimation error of φ under A over all A 2 Tu(δ, Γ). Rφ := sup A2Tu(δ,Γ) ran(AT ), φX 3. R is the minimax estimation error defined as the minimum of the largest expected estima- tion error incurred by φ 2 Φ: φ sup A2Tu(δ,Γ) ran(AT ), φX 4. An algorithm φ 2 Φ is rate-optimal if Rφ C R , where constant C > 0 is independent of the problem parameters T, δ, p, k, and Γ. In this work, we establish the minimax estimation error R and propose rate-optimal sublinear-time, single-pass algorithms for robust streaming PCA. 4 Minimax Lower Bound When A = (At)T t=1 belongs to the temporal uncertainty set Tu(δ, Γ) (Definition 1), an algorithm designed to recover the principal components of AT from the observations cannot guarantee zero estimation error. Our first goal is to obtain the minimax lower bound on the estimation error in the problem parameters T, δ, p, k, and Γ. In order to establish the lower bound, we leverage the fundamental limit of hypothesis tests [57]. The crux of the proof lies in constructing the set of worst-case hypotheses and establishing a lower bound on the probability of error in distinguishing between these hypotheses using observed data. The complete proof is provided in Appendix D. Theorem 1 (Lower Bound). Assume δ > Γ 0 and p > 2k + 1. For any algorithm φ 2 Φ, there exists a sequence A 2 Tu(δ, Γ) such that EX A ran(AT ), φX has lower bound with order: pσ2(σ2 + δ) 1/3 pσ2(σ2 + δ) By taking infφ2Φ, we get the same lower bound for R . For the standard streaming PCA problem (Theorem 1 with the case of Γ = 0), the fundamental limit is ((σ/δ)(p(σ2 + δ)/T)1/2) , which has the expected (1/ T) dependence [14, 58]. On the other hand, in the presence of perturbations (Γ>0), the lower bound exhibits a phase transition phenomenon, with the first term representing the effect of model ambiguity. To this end, define the critical time Tc as 2/3 pσ2(σ2 + δ) For T =O(Tc), the lower bound decreases with the rate of 1/ T. However, when T = (Tc), the first term dominates the second term, and R becomes independent of the number of observations T. In this regime, the error stagnates to O((Γ/δ)1/3 $ pσ2(σ2 + δ)/δ2%1/3). Therefore, as our intuition suggests, the information quickly becomes stale in a dynamic environment. Theorem 2 and 3 will prove that the noisy power method and Oja s algorithm attain a near-optimal bound on the convergence guarantee. Theorem 2 guarantees that if s1(At A> t ) = (δ), the upper bound for estimation error on the noisy power method is of the following order: 1/3 (pσ2 + kδ)(σ2 + δ) That is, if pσ2 dominates kδ, the noisy power method becomes rate-optimal under the controlled uncertainty with Tu(δ, Γ). This regime is the case of noisy practical situations, with σ2 6 δ. 5 Convergence Analysis In this section, we analyze two algorithms for the robust streaming PCA problem. A generic template for algorithms, φ 2 Φ of interest to us is as follows: (i) φ is initialized with a random matrix with orthonormal columns ˆU 2 Rp k; (ii) a running estimate of the principal components is maintained as the columns of ˆU; (iii) observations are projected onto the column space of ˆU to update this estimate. We consider two algorithms: a robust version of the noisy power method (Algorithm 1) and Oja s algorithm (Algorithm 2). The critical difference between the noisy power method and Oja s algorithm is the data used to estimate the principal components. In the noisy power method, the estimates are updated after a batch of observations, whereas in Oja s algorithm, the estimates are updated after scaling every observation with the learning rate. Therefore, the parameters determining the performance of these algorithms are the batch size B for the robust power method and the learning rate for Oja s algorithm. The analysis of these algorithms cannot be readily established with existing techniques when the covariance matrix belongs to a temporal uncertainty set since they rely on showing that the estimates improve every iteration. Further, applying many concentration results requires random matrices to be bounded, which is not the case when the observations are sampled from (1). Therefore, in order to simplify the analysis of both algorithms, we introduce Assumption 1, adapted from [34]. Assumption 1. Let (At)T t=1 2 Tu(δ, Γ). For δ δ and M, V > 0, we consider the observations xt for t 2 [T] satisfy the following: t ] = At A> t + σ2Ip p while k At A> t + σ2Ip p)k M a.s., and t + σ2Ip p))2 // V. When the observations follow model (1), we condition our analysis on the high-probability event E. Definition 3. Let SVD(At A> t + σ2Ip p)=Ut Dt UT t and xt =Ut D1/2 t zt, where zt N(0, Ip p). We define the event E as: 2 log(2p T 2), 2 log(2p T 2) The observations from model (1) satisfy Assumption 1 with M = 2(k δ+pσ2)(1+ (log(p T 2)/T)), V = 2M( δ + σ2) with probability P[E] 1 1/T. Although after conditioning E[xtx> t |E] 6= At A> t + σ2Ip p, we can use all the results in Section 5.1 and 5.2 with a multiplicative logarithmic factor. Please refer to Appendix B for the details. 5.1 Noisy Power Method Algorithm 1 Noisy Power Method with block size B [29] 1: Input: Stream of vectors: (xt)T t=1, block size: B, dimensions: p, k 2: Sample each element of ˆU(0) in N(0, 1) 3: for = 1 : L = b T/Bc do 4: Y( ) 0 2 Rp k 5: for t = ( 1)B + 1 : B do 6: Y( ) Y( ) + 1 t ˆU( 1) 7: end for 8: ˆU( ) Gram-Schmidt(Y( )) 9: end for 10: Output: ˆU(L) The noisy power method is an iterative algorithm for computing the top-k principal components of a matrix. Starting from the random matrix ˆU(0) in Rp k, the algorithm runs for L iterations, each processing B samples. By repeating this procedure, we expect the algorithm to reconstruct the covariance matrix if the observations are derived from the fixed distribution. When observations are drawn from model (1) under Assumption 1, the later observation can be sampled from distributions with shifted covariance. Unlike the standard case, for any algorithm φ 2 Φ, the presence of perturbation prevents the convergence of the columns of ˆU( ) to the singular vectors of the final covariance matrix U( ). The main difficulty here is that the columns of ˆU( ) do not converge towards a fixed set of vectors but keep tracking the time-varying principal components. Recall that our ultimate objective is to recover the principal components associated with the terminal observation. Hence, we decompose the covariance matrix in terms of the last observation and the remaining samples. For -th block, we denote the covariance matrix for B-th observation as M( ) and have: t = E[x Bx> B] + E( ) = A BA> B + σ2I | {z } M( ) +E( ) . (6) Due to perturbations in the robust model, E( ) is a non-zero mean random variable. Therefore, we first decompose E( ) in terms of the contribution due to inherent noise and the perturbations the robust model allows. Then, in Lemma 1, we decompose the error k E( )k with respect to the block size B and the allowed perturbations Γ in the robust model. We provided complete proof in Appendix E. Lemma 1 (Spectral norm of noise). Assume that the observations (xt)T t=1 generated according to the Assumption 1. With probability greater than 1 1/T, the matrices E( ) in the equation (6) are bounded by: max 1 L k E( )k 1 + 3 V log(2p T 2) when the block size B is larger than M2 log(2p T 2)/V. Lemma 1 highlights the effect of allowing perturbations in the data generation model. From classical results in statistics, our intuition tells us that the effect of noise washes out as the block size increases, i.e., the error decays with the inverse square root of the block size. Hence, barring memory and data issues, a larger block size is better when Γ = 0. In contrast, covariance perturbations add errors proportional to the block size. Consequently, we have a trade-off between two terms in this case, and an optimal block size exists depending on the Γ. We establish the convergence guarantee of the robust power method in Theorem 2. In the proof of Theorem 2 (Appendix F), we bound the distance, d U(L), ˆU(L) between the output of Algorithm 1, ˆU1:k(L) and k-orthonormal vectors spanning the column space of AT , U1:k(L). We identify the optimal block size B, the unique parameter for the noisy power method, and establish an upper bound on the estimation error of the noisy power method. Theorem 2 (Robust power method). Assume that δ 0.71σ2 and Γ = O(δ3/(V log(2p T 2))). When the observations (xt)T t=1 satisfies the assumption 1, for B = (V1/3 log(2p T 2)1/3/Γ2/3) we have: d(ran(AT ), NPMX ) = O (VΓ log(2p T 2))1/3 with probability 1 2/T c (p k+1) e (p). When Γ = (δ3/(V log(2p T 2))) we have that (VΓ log(2p T 2))1/3/δ = (1). Therefore, the condition on Γ in Theorem 2 is necessary to avoid a trivial upper bound (1). This condition encompasses several applications of interest alluded to earlier. For example, in the financial applications alluded to earlier, individual market changes of interest happen on a millisecond time scale. It is of significant interest to terminally detect incremental market changes. Our results hold on to the large value of cumulative changes and allow us to study them. They further imply that the noisy power method is rate-optimal for non-trivial values of Γ. When the observations follow model (1), under the event E , Theorem 2 shows that the robust power method can achieve an estimation error of: 1/3 (pσ2 + kδ)(σ2 + δ) if T = (max(Tc , δ(pσ2) 1)), Γ = ((c (p k+1) + e (p))δ2(pσ2) 1), and δ = (δ). Then it becomes order-wise identical to the fundamental limit established in Theorem 1 when pσ2 dominates kδ. The first condition on T illustrates when past observations become less critical. The probability for the upper bound on the noisy power method [29] with random initialization should not be small to construct expectation bounds from the high probability bound. We address this regime by condition on Γ, which is coarse due to exponential terms and (pσ2) 1. Establishing bounds on the estimation error when the underlying singular vectors change is intricate since the subspace to which consecutive observations belong is potentially different. Conventional proofs that analyze noisy power methods or Oja s algorithm show that under a variety of assumptions at every iteration , the distance between the estimated and true subspace, d(U( ), ˆU( )) decreases. For instance, the proof in [29] requires k E( )U( )k = O( k/p), which does not hold under our model since k M( ) M( 1)k is, in general, greater than k/p. Similarly, the concentration approach in [33] can be used only when the covariance matrix is time-invariant. We briefly describe our proof technique to establish Theorem 2, deferring details to Appendix F. Let M(L) denotes the product QL M( ) + E( ) . Then, the output of the algorithm ˆU1:k(L) is an orthonormal basis of M(L) ˆU(0), which estimates the first k principal components U1:k of the M(L). To address this, we construct sequences of k and (p k)-dimensional subspaces of Rp from observations {(xt) B =1, denoted by {N( )}L =1 2 Stp k(Rp) and {W( )}L =1 2 Stk(Rp) respectively, such that for all iterations : (M( )+E( ))N( )% (M( )+E( ))W( )% 2. k(M( ) + E( ))N( )k k((M( ) + E( ))W( )) 1k < 1, 3. d(U1:k( ), W( +1)) , d(Uk+1:p( ), N( +1)) = O(k E( )k). The initial random matrix ˆU1:k(0) consists of both N(1) and W(1) with high probability. From the first two properties, at every iteration , the projection of ˆU1:k( 1) in W( ) is amplified more than that on N( ) and thus ˆU1:k(L) becomes very close to W(L) after sufficiently large L. From the last property, we can conclude that W(L) is close to U1:k(L), where the distance between W(L) and U1:k(L) is proportional to k E(L)k. Combining these ideas with Lemma 1 establishes the results. 5.2 Oja s Algorithm Algorithm 2 Oja s Algorithm with learning rate [48] 1: Input: Stream of vectors: (xt)T t=1, learning rate: , and dimensions: p, k 2: Sample each element of ˆU(0) in N(0, 1) 3: for t = 1 : T do 4: ˆU(t) Gram-Schmidt((Ip p + xtx> t ) ˆU(t 1)) 5: end for 6: Output: ˆU(T) We now establish the convergence guarantee for Oja s Algorithm (Algorithm 2) when observations follow the equation (1). Unlike the noisy power method, Oja s Algorithm is multiplicative in its construction of the estimated subspace. We extend the existing analysis for Oja s algorithm [3, 33, 34] by considering a virtual block with B =B = 1 observations. Building upon the analysis framework for the noisy power method and intuition from binomial approximation (1 + x)1/ ' (1 + x) we establish the convergence guarantees for Oja s algorithm. Like the previous section, we decompose the block with target M( ) and error matrix E0( ) as: (Ip p + E[xtx> (Ip p + E[x Bx> B]) + E( ) = MOja( ) + e δ+σ2 E0( ) . (10) where E0 is scaled error matrix. In Lemma 2, we provide bound for scaled error matrix with respect to the learning parameter , or the virtual block size B = 1 similar to Lemma 1. The proof is provided in Appendix G. Lemma 2 (Spectral norm of noise, Oja s algorithm case). Assume that the observations (xt)T t=1 generated according to the Assumption 1. With probability greater than 1 1/T, the matrices E0( ) on the equation (10) are bounded by: max 1 L k E0( )k 2e2M2 log(p T 2) 2 + O(B Γ) , (11) when the virtual block size B = 1 is larger than 2M2 log(p T 2). The difference in parameter M, rather than V, as in the case of Lemma 1 arises due to the use of a different concentration inequality. The estimator of the noisy power method averages the outer product of vectors (equation (10)), making it straightforward to use Bernstein s inequality. On the other hand, Oja s algorithm averages the product of random matrices rather than the sum of random matrices. Therefore, we introduce the multiplicative concentration inequality [32], which requires the (probabilistic) norm bound for matrices. Combining multiplicative concentration inequalities with Lemma 2 and the techniques developed for the noisy power method, we obtain a convergence guarantee for Oja s algorithm in Theorem 3. The complete proof is provided in Appendix H. Theorem 3 (Oja s algorithm). Assume that δ 0.71 and Γ = O(δ3/(e3( δ δ)M2 log(p T 2))). When the observations {xt}T t=1 satisfies the assumption 1, for 1 = (M2/3 log(p T 2)1/3/Γ2/3) we have: d(ran(AT ), Oja X ) = O δ (M2Γ log(p T 2))1/3 with probability 1 2/T c (p k+1) e (p). Unlike the noisy power method, the upper bound in Theorem 3 is O(p2/3) rather than the optimaldependence of O(p1/3) from Theorem 1. It is unclear whether the upper bound can be improved. Sharpening our analysis with a two-phase strategy [33, 39] (wherein the first phase identifies a good initial point and the second phase establishes convergence given an initial point) might be an excellent direction for future investigation. 6 Numerical Results Key observations from Theorem 2 and 3 on each algorithm illustrate; (i) the existence of the optimal block size B and the learning rate to obtain the minimum recovery error, and (ii) Γ 3/2 dependencies of that optimal B and 1/ . In order to verify the established results for both algorithms, this section provides the performance of algorithms for various environments. We synthesized the A = (At)T t=1 and sample X = (xt)T t=1 from A. We generate (At)T t=1 2 Rp k as the product of three matrices, Ut 2 Stp(Rp), Dt 2 Rp k(; diagonal), and Vt 2 Stk(Rk). To obtain the matrix of the next step, we rotate the first matrix as Ut = Ut 1Rt (Rt 2 SO(p)). The vectors xt are sampled from the model (1). More details for experimental setup are described in Appendix I. (a) Distance between true column space of AT (; U1:k) and estimated space at t=T(; ˆU1:k) as Γ varies. (b) Empirical optimal block size B and learning rate as Γ varies. Figure 1: Numerical results with synthesized streams of vectors. We used the setting (σ, δ, p, k)= (0.15, 1.0, 100, 5). Our first observation in Figure 1a is that convergence error decreases as the block size increases without any covariance perturbation (Γ=0). This behavior is expected since increasing the number of past information results in better accuracy guarantees for the recovered space in the absence of a covariance shift. However, if the covariance perturbation exists (Γ>0), the optimal learning parameter exists, and we have the smaller optimal block size with stronger perturbation. This phenomenon is also expected since an increase in the adversarial budget implies that the past information becomes less relevant. Our observations also corroborate our theoretical results in Theorem 1. In Figure 1b, we focus on the optimal value of the block size B and the inverse of the learning rate 1/ and its variation with the perturbation budget Γ. We plot the empirically optimal learning rate for the case of the noisy power method and Oja s algorithm with Γ 2/3. We observe that the optimal block size and the inverse learning rate are proportional to Γ 2/3. This experimental dependency of Γ 2/3 verifies the theoretically prescribed results in Theorem 2 and 3. 6.1 Experiments on Stock Price Dataset We provide the real-world benchmark using the S&P500 stock market dataset [44] in Kaggle to test our findings in the non-stationary environment. Refer the Appendix J for the non-stationarity of environments and further experimental details that do not appear in the main paper. The stream of vectors consisted of 133(=p) companies normalized daily returns. Since each company in S&P500 has a different time horizon of available information, we considered 133 companies with the cost information from Mar.18, 1980, to Jul.22, 2022 (T =10677). Then we calculated the daily return, which is the difference of adjusted close cost between two days; normalized by the adjusted close cost of the day. The stream of vectors from the environment can be seen as sampled from time-varying distributions with Γ ' 0.17. The objective is to predict the principal components of the covariance matrix of daily returns as in [4, 54, 61]. We tested the noisy power method and Oja s algorithm on the preprocessed stream. For the target space AT , we used the k-dimensional subspace consisting of top-k singular vectors of covariance estimator calculated with the final 500 samples (k=1, . . . , 5). Figure 2: S&P500 stock market daily return prediction, for k = 1, 2, . . . , 5. The results in Figure 2 indicate that our findings are also valid in the real-world environment with covariance shifts. First, each result shows optimal parameters regimes observed at the U-shaped curves on recovery errors. Furthermore, the result on the noisy power method with varying block size B is akin to the recovery error on Oja s algorithm, plotted with scaled inverse learning rate 1. These two observations support the main findings in Figure 1a. We also note that different scaling for each k is natural since we have different spectral gaps δ between k-th and k+1-th spectrums. 7 Conclusion On the streaming PCA settings with time-varying covariance, we analyzed the fundamental lower bound of the minimax error and estimation errors for the noisy power method and Oja s algorithm. Under this framework, when no perturbation exists (Γ=0), our theoretical result on a lower limit coincides with the order of the traditional outcome. Furthermore, when perturbation exists (Γ>0), we have a non-avoidable positive minimax recovery error, although the time horizon becomes arbitrarily long. Next, we found that the noisy power method order-wisely achieves this recovery error and becomes rate-optimal if the σ2 = (kδ/p). For the optimal learning parameters B or 1, we showed that optimal block size and inverse learning rate minimizing recovery error are similar up to a multiplicative factor and proportional to Γ 3/2. Experimental results both on the synthetic data and real-world environments support the theoretical findings on the learning parameters. Although our analysis requires the spectral gap assumption, a recent line of literature considers effective rank [12, 50] as a crucial metric. Therefore, establishing guarantees for robust streaming PCA in terms of effective rank is a perfect direction for future work. Acknowledgement DB and AS were supported by a DARPA Lagrange award. MJ and SY were supported by Institute of Information & Communications Technology Planning & Evaluation (IITP) grant funded by the Korea government(MSIT) (No.2022-0-00311, Development of Goal-Oriented Reinforcement Learning Techniques for Contact-Rich Robotic Manipulation of Everyday Objects; No.2019-0-00075, Artificial Intelligence Graduate School Program(KAIST)). [1] K. Abed-Meraim, S. Attallah, A. Chkeif, and Y. Hua. Orthogonal oja algorithm. IEEE Signal Processing Letters, 7(5):116 119, 2000. [2] K. Abed-Meraim, A. Chkeif, and Y. Hua. Fast orthonormal past algorithm. IEEE Signal Processing Letters, 7(3):60 62, 2000. [3] Zeyuan Allen-Zhu and Yuanzhi Li. First efficient convergence for streaming k-pca: A global, gap-free, and near-optimal rate. In 2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS), pages 487 492, 2017. [4] Samreen Fatima and. THE APPLICATION OF PRINCIPAL COMPONENT ANALYSIS AND FACTOR ANALYSIS TO STOCK MARKETS RETURNS. International Journal of Advanced Research, 7(5):97 105, May 2019. [5] Raman Arora, Andrew Cotter, Karen Livescu, and Nathan Srebro. Stochastic optimization for pca and pls. In Communication, Control, and Computing (Allerton), 2012 50th Annual Allerton Conference on, pages 861 868. IEEE, 2012. [6] Laura Balzano, Robert Nowak, and Benjamin Recht. Online identification and tracking of subspaces from highly incomplete information. In 2010 48th Annual allerton conference on communication, control, and computing (Allerton), pages 704 711. IEEE, 2010. [7] Thomas Bendokat, Ralf Zimmermann, and P. A. Absil. A grassmann manifold handbook: Basic geometry and computational aspects, 2020. [8] Rajendra Bhatia. Matrix analysis, volume 169. Springer Science & Business Media, 2013. [9] Daniel Bienstock and Mauro Escobar. Stochastic defense against complex grid attacks. IEEE Transactions on Control of Network Systems, 7(2):842 854, 2020. [10] Daniel Bienstock and Apurv Shukla. Variance-aware optimal power flow: Addressing the tradeoff between cost, security, and variability. IEEE Transactions on Control of Network Systems, 6(3):1185 1196, 2019. [11] Vivek S Borkar. Stochastic approximation: a dynamical systems viewpoint, volume 48. Springer, [12] Florentina Bunea and Luo Xiao. On the sample covariance matrix estimator of reduced effective rank population matrices, with applications to fpca. Bernoulli, 21(2):1200 1230, 2015. [13] John Burkardt. The truncated normal distribution. Department of Scientific Computing Website, Florida State University, 1:35, 2014. [14] Tony Cai, Zongming Ma, and Yihong Wu. Optimal estimation and rank detection for sparse spiked covariance matrices. Probability Theory and Related Fields, 161(3):781 815, Apr 2015. [15] Emmanuel J Candès, Xiaodong Li, Yi Ma, and John Wright. Robust principal component analysis? Journal of the ACM (JACM), 58(3):1 37, 2011. [16] Minshuo Chen, Lin Yang, Mengdi Wang, and Tuo Zhao. Dimensionality reduction for stationary time series via stochastic nonconvex optimization. ar Xiv preprint ar Xiv:1803.02312, 2018. [17] Yeshwanth Cherapanamjeri, Prateek Jain, and Praneeth Netrapalli. Thresholding based outlier robust pca. In Conference on Learning Theory, pages 593 628. PMLR, 2017. [18] Yuejie Chi, Yonina C. Eldar, and Robert Calderbank. Petrels: Parallel subspace estimation and tracking by recursive least squares from partial observations. IEEE Transactions on Signal Processing, 61(23):5947 5959, 2013. [19] John H. Conway, Ronald H. Hardin, and Neil J. A. Sloane. Packing lines, planes, etc.: packings in Grassmannian spaces. Experimental Mathematics, 5(2):139 159, 1996. [20] Wei Dai, Youjian Liu, and B. Rider. Quantization bounds on grassmann manifolds of arbitrary dimensions and mimo communications with feedback. In GLOBECOM 05. IEEE Global Telecommunications Conference, 2005., volume 3, pages 5 pp. , 2005. [21] David Donoho, Matan Gavish, and Iain Johnstone. Optimal shrinkage of eigenvalues in the spiked covariance model. The Annals of Statistics, 46(4):1742 1778, 2018. [22] G.H. Dunteman. Principal Components Analysis. Number No. 69 in A Sage Publications. SAGE Publications, 1989. [23] Mauro Escobar, Daniel Bienstock, and Michael Chertkov. Learning from power system data stream. In 2019 IEEE Milan Power Tech, pages 1 6. IEEE, 2019. [24] Jiashi Feng, Huan Xu, and Shuicheng Yan. Online robust pca via stochastic optimization. In Advances in Neural Information Processing Systems, pages 404 412, 2013. [25] Mina Ghashami, Edo Liberty, Jeff M. Phillips, and David P. Woodruff. Frequent directions: Simple and deterministic matrix sketching. SIAM Journal on Computing, 45(5):1762 1792, 2016. [26] John Goes, Teng Zhang, Raman Arora, and Gilad Lerman. Robust stochastic principal compo- nent analysis. In Artificial Intelligence and Statistics, pages 266 274. PMLR, 2014. [27] Donald Goldfarb and Garud Iyengar. Robust portfolio selection problems. Mathematics of operations research, 28(1):1 38, 2003. [28] Gene H Golub and Charles F Van Loan. Matrix computations (4th ed), volume 3. JHU Press, [29] Moritz Hardt and Eric Price. The noisy power method: A meta algorithm with applications. In Advances in Neural Information Processing Systems, pages 2861 2869, 2014. [30] Jun He, Laura Balzano, and John C. S. Lui. Online Robust Subspace Tracking from Partial Information. ar Xiv e-prints, page ar Xiv:1109.3827, September 2011. [31] Amelia Henriksen and Rachel Ward. Adaoja: Adaptive learning rates for streaming pca, 2019. [32] De Huang, Jonathan Niles-Weed, Joel A Tropp, and Rachel Ward. Matrix concentration for products. ar Xiv preprint ar Xiv:2003.05437, 2020. [33] De Huang, Jonathan Niles-Weed, and Rachel Ward. Streaming k-pca: Efficient guarantees for oja s algorithm, beyond rank-one updates. In Mikhail Belkin and Samory Kpotufe, editors, Conference on Learning Theory, COLT 2021, 15-19 August 2021, Boulder, Colorado, USA, volume 134 of Proceedings of Machine Learning Research, pages 2463 2498. PMLR, 2021. [34] Prateek Jain, Chi Jin, Sham M Kakade, Praneeth Netrapalli, and Aaron Sidford. Streaming pca: Matching matrix bernstein and near-optimal finite sample guarantees for oja s algorithm. In Conference on Learning Theory, pages 1147 1164, 2016. [35] Iain M. Johnstone. On the distribution of the largest eigenvalue in principal components analysis. The Annals of Statistics, 29(2):295 327, 2001. [36] Iain M Johnstone and Debashis Paul. Pca in high dimensions: An orientation. Proceedings of the IEEE, 106(8):1277 1292, 2018. [37] Harold Kushner and G George Yin. Stochastic approximation and recursive algorithms and applications, volume 35. Springer Science & Business Media, 2003. [38] Chris Junchi Li, Mengdi Wang, Han Liu, and Tong Zhang. Near-optimal stochastic approxima- tion for online principal component estimation. Mathematical Programming, 167(1):75 97, 2018. [39] Xin Liang. On the optimality of the oja s algorithm for online PCA. Co RR, abs/2104.00512, [40] Edo Liberty. Simple and deterministic matrix sketching. In Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 13, page 581 588, New York, NY, USA, 2013. Association for Computing Machinery. [41] André L. G. Mandolesi. Grassmann angles between real or complex subspaces, 2021. [42] Poorya Mianjy and Raman Arora. Stochastic pca with 2 and 1 regularization. In ICML, 2018. [43] Ioannis Mitliagkas, Constantine Caramanis, and Prateek Jain. Memory limited, streaming pca. In Proceedings of the 26th International Conference on Neural Information Processing Systems - Volume 2, NIPS 13, page 2886 2894, Red Hook, NY, USA, 2013. Curran Associates Inc. [44] Paul Mooney. Stock Market Data (NASDAQ, NYSE, S&P500). https://www.kaggle.com/datasets/paultimothymooney/stock-market-data, Version 64. [45] Jonas W Mueller, Vasilis Syrgkanis, and Matt Taddy. Low-rank bandit methods for high- dimensional dynamic pricing. Advances in Neural Information Processing Systems, 32, 2019. [46] Praneeth Narayanamurthy and Namrata Vaswani. Provable dynamic robust pca or robust subspace tracking. IEEE Transactions on Information Theory, 65(3):1547 1577, 2018. [47] Yuri Neretin. On jordan angles and the triangle inequality in grassmann manifolds. Geometriae Dedicata, 86:81 91, 06 2001. [48] Erkki Oja and Juha Karhunen. On stochastic approximation of the eigenvectors and eigenvalues of the expectation of a random matrix. Journal of mathematical analysis and applications, 106(1):69 84, 1985. [49] Daniel L Pimentel-Alarcón, Aritra Biswas, and Claudia R Solís-Lemus. Adversarial principal component analysis. In 2017 IEEE International Symposium on Information Theory (ISIT), pages 2363 2367. IEEE, 2017. [50] Olivier Roy and Martin Vetterli. The effective rank: A measure of effective dimensionality. In 2007 15th European signal processing conference, pages 606 610. IEEE, 2007. [51] Ohad Shamir. Convergence of stochastic gradient descent for pca. In International Conference on Machine Learning, pages 257 265, 2016. [52] Ohad Shamir. Fast stochastic algorithms for svd and pca: Convergence properties and convexity. In International Conference on Machine Learning, pages 248 256, 2016. [53] Wenjie Song, Jianke Zhu, Yang Li, and Chun Chen. Image alignment by online robust pca via stochastic gradient descent. IEEE Transactions on Circuits and Systems for video Technology, 26(7):1241 1250, 2015. [54] Wataru Souma. Characteristics of principal components in stock price correlation. Frontiers in Physics, 9, 2021. [55] Le Trung Thanh, Nguyen Viet Dung, Nguyen Linh Trung, and Karim Abed-Meraim. Robust subspace tracking with missing data and outliers: Novel algorithm with convergence guarantee. IEEE Transactions on Signal Processing, 69:2070 2085, 2021. [56] Joel A. Tropp. An introduction to matrix concentration inequalities. Foundations and Trends in Machine Learning, 8(1-2):1 230, 2015. [57] Alexandre B Tsybakov. Introduction to nonparametric estimation. Springer Science & Business Media, 2008. [58] Vincent Q. Vu and Jing Lei. Minimax sparse principal subspace estimation in high dimensions. The Annals of Statistics, 41(6):2905 2947, 2013. [59] Sissi Xiaoxiao Wu, Hoi-To Wai, Anna Scaglione, and Neil A. Jacklin. The power-oja method for decentralized subspace estimation/tracking. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 3524 3528, 2017. [60] Bin Yang. Projection approximation subspace tracking. IEEE Transactions on Signal Processing, 43(1):95 107, 1995. [61] Libin Yang. An Application of Principal Component Analysis to Stock Portfolio Management. Ph D thesis, Department of Economics and Finance, University of Canterbury, 2015. [62] Puyudi Yang, Cho-Jui Hsieh, and Jane-Ling Wang. History pca: A new algorithm for streaming pca. ar Xiv preprint ar Xiv:1802.05447, 2018.