# streaming_principal_component_analysis_from_incomplete_data__9b68dfac.pdf

Journal of Machine Learning Research 20 (2019) 1-62 Submitted 12/16; Revised 2/19; Published 2/19

Streaming Principal Component Analysis From Incomplete Data

Armin Eftekhari armin.eftekhari@epfl.ch Institute of Electrical Engineering École Polytechnique Fédérale de Lausanne Lausanne, VD 1015, Switzerland

Gregory Ongie gongie@uchicago.edu Department of Statistics University of Chicago Chicago, IL 60637, USA

Laura Balzano girasole@umich.edu Department of Electrical Engineering and Computer Science University of Michigan Ann Arbor, MI 48109, USA

Michael B. Wakin mwakin@mines.edu Department of Electrical Engineering Colorado School of Mines Golden, CO 80401, USA

Editor: Tong Zhang

Linear subspace models are pervasive in computational sciences and particularly used for large datasets which are often incomplete due to privacy issues or sampling constraints. Therefore, a critical problem is developing an eﬃcient algorithm for detecting low-dimensional linear structure from incomplete data eﬃciently, in terms of both computational complexity and storage. In this paper we propose a streaming subspace estimation algorithm called Subspace Navigation via Interpolation from Partial Entries (SNIPE) that eﬃciently processes blocks of incomplete data to estimate the underlying subspace model. In every iteration, SNIPE ﬁnds the subspace that best ﬁts the new data block but remains close to the previous estimate. We show that SNIPE is a streaming solver for the underlying nonconvex matrix completion problem, that it converges globally to a stationary point of this program regardless of initialization, and that the convergence is locally linear with high probability. We also ﬁnd that SNIPE shows state-of-the-art performance in our numerical simulations. Keywords: Principal component analysis, Subspace identiﬁcation, Matrix completion, Streaming algorithms, Nonconvex optimization, Global convergence

1. Introduction

Linear models are the backbone of computational science, and Principal Component Analysis (PCA) in particular is an indispensable tool for detecting linear structure in collected data (van Overschee and de Moor, 2012; Ardekani et al., 1999; Krim and Viberg, 1996; Tong

c 2019 Armin Eftekhari, Gregory Ongie, Laura Balzano, Michael B. Wakin.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v20/16-627.html.

Eftekhari, Ongie, Balzano, and Wakin

Figure 1: The sequence of generic vectors {st}T t=1 drawn from an unknown r-dimensional subspace S in panel (a) is only partially observed on random index sets {ωt}T t=1. That is, we only have access to incomplete data vectors {yt}T t=1 in panel (b), where the white entries are missing. Our objective is to estimate the subspace S from the incomplete data vectors, when limited storage and processing resources are available. See Section 1 for the detailed setup.

and Perreau, 1998). Principal components of a dataset are used, for example, to perform linear dimensionality reduction, which is in turn at the heart of classiﬁcation, regression and other learning tasks that often suﬀer from the curse of dimensionality , where having a small number of training samples in relation to the number of features typically leads to overﬁtting (Hastie et al., 2013). In this work, we are particularly interested in applying PCA to data that is presented sequentially to a user, with limited processing time available for each item. Moreover, due to hardware limitations, we assume the user can only store small amounts of data. Finally, we also consider the possibility that the incoming data is incomplete, either due to physical sampling constraints, or deliberately subsampled to facilitate faster processing times or to address privacy concerns. As one example, consider monitoring network traﬃc over time, where acquiring complete network measurements at ﬁne time-scales is impractical and subsampling is necessary (Lakhina et al., 2004; Gershenfeld et al., 2010). As another example, suppose we have a network of cheap, battery-powered sensors that must relay summary statistics of their measurements, say their principal components, to a central node on a daily basis. Each sensor cannot store or process all its daily measurements locally, nor does it have the power to relay all the raw data to the central node. Moreover, many measurements are not reliable and can be treated as missing. It is in this and similar contexts that we hope to develop a streaming algorithm for PCA from incomplete data. More formally, we consider the following problem: Let S be an r-dimensional subspace with orthonormal basis S Rn r. For an integer T, let the coeﬃcient vectors {qt}T t=1 Rr be independent copies of a random vector q Rr with bounded expectation, namely, E q 2 < . Consider the sequence of vectors {Sqt}T t=1 S and set st := Sqt for short. At each time t [1 : T] := {1, 2, , T}, we observe each entry of st independently with a probability of p and collect the observed entries in yt Rn. Formally, we let ωt [1 : n] be the random index set over which st is observed and write this measurement process as yt = Pωt st, where Pωt Rn n is the projection onto the coordinate set ωt, namely, it equals one on its diagonal entries corresponding to the index set ωt and is zero elsewhere.

Streaming PCA From Incomplete Data

Our objective in this paper is to design a streaming algorithm to identify the subspace S from the incomplete data {yt}T t=1 supported on the index sets {ωt}T t=1. Put diﬀerently, our objective is to design a streaming algorithm to compute leading r principal components of the full (but hidden) data matrix [s1 s2 s T ] from the incomplete observations [y1 y2 y T ], see Figure 1. By the Eckart-Young-Mirsky Theorem, this task is equivalent to computing leading r left singular vectors of the full data matrix from its partial observations (Eckart and Young, 1936; Mirsky, 1966). Assuming that r = dim(S) is known a priori (or estimated from data by other means), we present the SNIPE algorithm for this task in Section 2. SNIPE is designed based on the principle of least-change and, in every iteration, ﬁnds the subspace that best ﬁts the new data block but remains close to the previous estimate. SNIPE requires O(n) bits of memory and performs O(n) ﬂops in every iteration, which is optimal in its dependence on the ambient dimension n. As discussed in Section 3, SNIPE has a natural interpretation as a streaming algorithm for low-rank matrix completion (Davenport and Romberg, 2016). Section 4 discusses the global and local convergence of SNIPE. In particular, the local convergence rate is linear near the true subspace, namely, the estimation error of SNIPE reduces by a factor of 1 cp in every iteration, for a certain factor c and with high probability. This local convergence guarantee for SNIPE is a key technical contribution of this paper which is absent in its close competitors, see Section 5. Even though we limit ourselves to the noise-free case of yt = Pωtst in this paper, SNIPE can also be applied (after minimal changes) when yt = Pωt(st + nt), where we might think of nt Rn as measurement noise from a signal processing viewpoint. Alternatively from a statistical viewpoint, nt represents the tail of the covariance matrix of the generative model, from which {st}t are drawn. Moreover, SNIPE can be easily adapted to the dynamic case where the underlying subspace S = S(t) changes over time. We leave the convergence analysis of SNIPE under a noisy time-variant model to a future work. Similarly, entries of incoming vectors are observed uniformly at random in our theoretical analysis but SNIPE also applies to any incomplete data. A review of prior art is presented in Section 5, and the performance of SNIPE and rival algorithms are examined numerically in Section 6, where we ﬁnd that SNIPE shows the stateof-the-art performance. Technical proofs appear in Section 7 and in the appendices, with Appendix A (Toolbox) collecting some of the frequently-used mathematical tools. Finally, Appendix L oﬀers an alternative initialization for SNIPE.

In this section, we present Subspace Navigation via Interpolation from Partial Entries (SNIPE), a streaming algorithm for subspace identiﬁcation from incomplete data, received sequentially. Let us ﬁrst introduce some additional notation. Recall that we denote the incoming sequence of incomplete vectors by {yt}T t=1 Rn, which are supported on index sets {ωt}T t=1 [1 : n]. For a block size b r, we concatenate every b consecutive vectors into a data block, thereby partitioning the incoming data into K = T/b non-overlapping blocks {Yk}K k=1, where Yk Rn b for every k. We assume for convenience that K is an integer. We also often take b = O(r) to maximize the eﬃciency of SNIPE, as discussed below.

Eftekhari, Ongie, Balzano, and Wakin

At a high level, SNIPE processes the ﬁrst incomplete block Y1 to produce an estimate b S1 of the true subspace S. This estimate is then iteratively updated after receiving each of the new incomplete blocks {Yk}K k=2, thereby producing a sequence of estimates { b Sk}K k=2 , see Figure 2. Every b Sk is an r-dimensional subspace of Rn with orthonormal basis b Sk Rn r; the particular choice of orthonormal basis is inconsequential throughout the paper.

Figure 2: SNIPE concatenates every b incoming vectors into a block and iteratively updates its estimate of the true subspace S after receiving each new block. That is, SNIPE updates its estimate of S, from b Sk 1 to b Sk, after receiving the incomplete data block Yk Rn b, see Section 2 for the details.

More concretely, SNIPE sets b S1 to be the span of leading r left singular vectors of Y1, namely, the left singular vectors corresponding to largest r singular values of Y1, with ties broken arbitrarily. Then, at iteration k [2 : K] and given the previous estimate b Sk 1 = span(b Sk 1), SNIPE processes the columns of the kth incomplete block Yk one by one and forms the matrix

yt + PωC t b Sk 1 b S k 1Pωt b Sk 1 + λIr b S k 1yt

t [(k 1)b + 1 : kb], (1)

where denotes the pseudo-inverse and λ 0 is a parameter. Above, PωC t = In Pωt Rn n

projects a vector onto the complement of the index set ωt. The motivation for the particular choice of Rk above will become clear in Section 3. SNIPE then updates its estimate by setting b Sk to be the span of leading r left singular vectors of Rk. Algorithm 1 summarizes these steps. Note that Algorithm 1 rejects ill-conditioned updates in Step 3 for the convenience of analysis and that similar reject options have precedence in the literature (Balzano and Wright, 2015). We however found implementing this reject option to be unnecessary in numerical simulations.

Remark 1 [Computational complexity of SNIPE] We measure the per-iteration algorithmic complexity of SNIPE by calculating the average number of ﬂoating-point operations (ﬂops) performed on an incoming vector. Every iteration of SNIPE involves ﬁnding leading r left singular vectors of an n b matrix. Assuming that b = O(r), this could be done with O(nr2) ﬂops. At the kth iteration with k 2, SNIPE also requires ﬁnding the pseudoinverse of Pωj b Sk 1 Rn r for each incoming vector which costs O(nr2) ﬂops. Therefore the overall computational complexity of SNIPE is O(nr2) ﬂops per vector. As further discussed in Section 5, this matches the complexity of other algorithms for streaming PCA even though here the received data is highly incomplete.

Streaming PCA From Incomplete Data

Remark 2 [Storage requirements of SNIPE] We measure the storage required by SNIPE by calculating the number of memory elements stored by SNIPE at any given instant. At the kth iteration, SNIPE must store the current estimate b Sk 1 Rn r (if available) and the new incomplete block Yk Rn b. Assuming that b = O(r), this translates into O(nr) + O(pnr) = O(nr) memory elements. SNIPE therefore requires O(nr) bits of storage, which is optimal up to a constant factor.

Algorithm 1 SNIPE for streaming PCA from incomplete data

Dimension r,

Received data {yt}T t=1 Rn supported on index sets {ωt}T t=1 [1 : n], presented sequentially in K = T/b blocks of size b r,

Tuning parameter λ 0,

Reject thresholds σmin, τ 0.

r-dimensional subspace b SK.

Form Y1 Rn b by concatenating the ﬁrst b received vectors {yt}b t=1. Let b S1, with orthonormal basis b S1 Rn r, be the span of leading r left singular vectors of Y1, namely, those corresponding to r largest singular values. Ties are broken arbitrarily.

For k [2 : K], repeat:

1. Set Rk {}.

2. For t [(k 1)b + 1 : kb], repeat

Rk yt + PωC t b Sk 1 b S k 1Pωt b Sk 1 + λIr b S k 1yt

where Pωt Rn n equals one on its diagonal entries corresponding to the index set ωt, and is zero elsewhere. Likewise, PωC t projects a vector onto the complement of the index set ωt.

3. If σr(Rk) < σmin or σr(Rk) (1 + τ) σr+1(Rk), then set b Sk b Sk 1. Otherwise, let b Sk, with orthonormal basis b Sk Rn r, be the span of leading r left singular vectors of Rk. Ties are broken arbitrarily. Here, σi(Rk) is the ith largest singular value of Rk.

Return b SK.

Eftekhari, Ongie, Balzano, and Wakin

3. Interpretation of SNIPE

SNIPE has a natural interpretation as a streaming algorithm for low-rank matrix completion, which we now discuss. First let us enrich our notation. Recall the incomplete data blocks {Yk}K k=1 Rn b and let the random index set Ωk [1 : n] [1 : b] be the support of Yk for every k. We write that Yk = PΩk(Sk) for every k, where the complete (but hidden) data block Sk Rn b is formed by concatenating {st}kb t=(k 1)b+1. Here, PΩk(Sk) retains only the entries of Sk on the index set Ωk, setting the rest to zero. By design, st = Sqt for every t and we may therefore write that Sk = S Qk for the coeﬃcient matrix Qk Rr b formed by concatenating {qt}kb t=(k 1)b+1. To summarize, {Sk}K k=1 is formed by partitioning {st}T t=1 into K blocks. Likewise, {Qk, Yk, Ωk}K k=1 are formed by partitioning {qt, yt, ωt}T t=1, respectively. With this introduction, let us form Y Rn T by concatenating the incomplete blocks {Yk}K k=1 Rn b, supported on the index sets Ω [1 : n] [1 : T]. To ﬁnd the true subspace S, one might consider solving

min X,U PU X 2 F + λ PΩC(X) 2 F

PΩ(X) = Y, (2)

where the minimization is over a matrix X Rn T and r-dimensional subspace U Rn. Above, PU Rn n is the orthogonal projection onto the orthogonal complement of subspace U and PΩ(X) retains only the entries of X on the index set Ω, setting the rest to zero. Note that Program (2) encourages its solution(s) to be low-rank while matching the observations Y on the index set Ω. The term λ PΩC(X) 2 F for λ 0 is the Tikhonov regularizer that controls the energy of solution(s) on the complement of index set Ω. With complete data, namely, when Ω= [1 : n] [1 : T], Program (2) reduces to PCA, as it returns X = Y and searches for an r-dimensional subspace that captures most of the energy of Y . That is, Program (2) reduces to min U PU Y 2 F when Ω= [1 : n] [1 : T], solution of which is the span of leading r left singular vectors of Y in light of the Eckart-Young Mirsky Theorem (Eckart and Young, 1936; Mirsky, 1966). In this sense then, Program (2) performs PCA from incomplete data. Note crucially that Program (2) is a nonconvex problem because the Grassmannian G(n, r), the set of all r-dimensional subspaces in Rn, is a nonconvex set.1 However, given a ﬁxed subspace U G(n, r), Program (2) reduces to the simple least-squares program

( min X PU X 2 F + λ PΩC(X) 2 F

PΩ(X) = Y, (3)

where the minimization is over X Rn T . If in addition λ is positive, then Program (3) is strongly convex and has a unique minimizer. Given a ﬁxed feasible X Rn T , Program (2) has the same minimizers as min U G(n,r) PU X 2 F . (4)

1. The Grassmannian can be embedded in Rn n via the map that takes U G(n, r) to the corresponding orthogonal projection PU Rn n. The resulting submanifold of Rn n is a nonconvex set.

Streaming PCA From Incomplete Data

That is, for a ﬁxed feasible X, Program (2) simply performs PCA on X. We might also view Program (2) from a matrix completion perspective. More speciﬁcally, let

ρ2 r(X) = X

i r+1 σ2 i (X), (5)

be the residual of X, namely, the energy of its trailing singular values σr+1(X) σr+2(X) . Like the popular nuclear norm X = P i 1 σi(X) in (Davenport and Romberg, 2016), the residual ρr(X) gauges the rank of X. In particular, ρr(X) = 0 if and only if rank(X) r. Unlike the nuclear norm, however, the residual is still a nonconvex function of X. We now rewrite Program (2) as

min X,U PU X 2 F + λ PΩC(X) 2 F

PΩ(X) = Y =

min X min U G(n,r) PU X 2 F + λ PΩC(X) 2 F

( min X ρ2 r(X) + λ PΩC(X) 2 F

PΩ(X) = Y. (6)

That is, if we ignore the regularization term λ PΩC(X) 2 F , Program (2) searches for a matrix with the least residual, as a proxy for least rank, that matches the observations Y . In this sense then, Program (2) is a relaxation of the low-rank matrix completion problem. Several other formulations for the matrix completion problem are reviewed in (Davenport and Romberg, 2016; Eftekhari et al., 2018b,a). We can also rewrite Program (2) in terms of its data blocks by considering the equivalent program ( min PK k=1 P U Xk 2 F + λ PΩC k (Xk) 2 F PΩk (Xk) = Yk k [1 : K], (7)

where the minimization is over matrices {Xk}K k=1 Rn b and subspace U G(n, r). Let us additionally introduce a number of auxiliary variables into Program (7) by considering the equivalent program

min PK k=1 P Uk Xk 2 F + λ PΩC k (Xk) 2 F PΩk (Xk) = Yk k [1 : K] U1 = U2 = = UK,

where the minimization is over matrices {Xk}K k=1 Rn b and subspaces {Uk}K k=1 G(n, r). Indeed, Programs (2,7,8) are all equivalent and all nonconvex. Now consider the following approximate solver for Program (8) that alternatively solves for matrices and subspaces:

Setting X1 = Y1 in Program (8), we minimize P U1Y1 2 F over U1 G(n, r) and, by the Eckart-Young-Mirsky Theorem, ﬁnd a minimizer to be the span of leading r left singular vectors of Y1, which coincides with b S1 in SNIPE.

For k [2 : K], repeat:

Eftekhari, Ongie, Balzano, and Wakin

Setting Uk = b Sk 1 in Program (8), we solve

( min Xk P b S k 1Xk 2 F + λ PΩC k (Xk) 2 F PΩk (Xk) = Yk, (9)

over matrix Xk Rn b. We verify in Appendix B that the minimizer of Program (9) coincides with Rk in SNIPE, see (1).

If σr(Rk) < σmin or σr(Rk) (1 + τ)σr+1(Rk), then no update is made, namely, we set b Sk = b Sk 1. Otherwise, setting Xk = Rk in Program (8), we solve min P Uk Rk 2 F over Uk G(n, r) to ﬁnd b Sk. That is, by the Eckart-Young Mirsky Theorem again, b Sk is the span of leading r left singular vectors of Rk. The output of this step matches b Sk produced in SNIPE.

To summarize, following the above procedure produces {Rk}K k=1 and { b Sk}K k=1 in SNIPE. In other words, we might think of SNIPE as an approximate solver for Program (2), namely, SNIPE is a streaming algorithm for low-rank matrix completion. In fact, the output of SNIPE always converges to a stationary point of Program (2) in the sense described in Section 4. Another insight about the choice of Rk in (1) is as follows. Let us set λ = 0 for simplicity. At the beginning of the kth iteration of SNIPE with k 2, the available estimate of the true subspace is b Sk 1 with orthonormal basis b Sk 1. Given a new incomplete vector y Rn, supported on the index set ω [1 : n], z = b Sk 1(Pω b Sk 1) y best approximates y in b Sk 1 in ℓ2 sense. In order to agree with the measurements, we minimally adjust this to y + PωCz, where PωC projects onto the complement of index set ω. This indeed matches the expression for the columns of Rk in SNIPE. We note that this type of least-change strategy has been successfully used in the development of quasi-Newton methods for optimization (Nocedal and Wright, 2006, Chapter 6).

4. Performance of SNIPE

To measure the performance of SNIPE whose output is a subspace we naturally use principal angles as an error metric. More speciﬁcally, recall that S and b SK denote the true subspace and the output of SNIPE, respectively. Then the ith largest singular value of PS P b SK equals sin(θi(S, b SK)), where

θ1(S, b SK) θ2(S, b SK) θr(S, b SK)

denote the principal angles between the two r-dimensional subspaces S, b SK (Golub and Van Loan, 2013). The estimation error of SNIPE is then

d G(S, b SK) :=

i=1 sin2 θi(S, b SK) = PS P b SK F r , (10)

Streaming PCA From Incomplete Data

which also naturally induces a metric topology on the Grassmannian G(n, r).2 Note also that we will always reserve calligraphic letters for subspaces and capital letters for their orthonormal bases, for example subspace S and its orthonormal basis S. Our ﬁrst result loosely speaking states that a subsequence of SNIPE converges to a stationary point of the nonconvex Program (2) as T goes to inﬁnity, see Section 7.1 for the proof.

Theorem 3 [Global convergence] Consider an r-dimensional subspace S with orthonormal basis S Rn r. For an integer T, let the coeﬃcient vectors {qt}T t=1 Rr be independent copies of a random vector q Rr with bounded expectation, namely E q 2 < . For every t [1 : T], we observe each coordinate of st = Sqt S independently with a probability of p and collect the observations in yt Rn, supported on a random index set ωt [1 : n]. Fix positive λ, block size b r, positive reject thresholds σmin, τ, and consider the output sequence of SNIPE in Algorithm 1, namely, {(Rk, b Sk)}k. Also by partitioning {qt, st, yt, ωt}t, form the coeﬃcient blocks {Qk}k Rr b, data blocks {Sk}k Rn b, and incomplete data blocks {Yk}k Rn b supported on index sets {Ωk}k [1 : n] [1 : b], as described in Section 3. For every integer l, there exists an integer kl, for which the following asymptotic statement is almost surely true as T . Consider the restriction of Program (2) to iteration kl, namely, ( min X,U PU X 2 F + λ PΩC kl(X) 2 F PΩkl(X) = Ykl, (11)

where the minimization is over matrix X Rn b and r-dimensional subspace U. Then there exists R Rn b and r-dimensional subspace b S such that

b S is the span of leading r left singular vectors of R,

(R, b S) is a stationary pair of Program (11), namely, it satisﬁes the ﬁrst-order optimality conditions of Program (11),

liml Rkl R F = 0,

liml d G( b Skl, S) = 0.

Remark 4 [Discussion of Theorem 3] Theorem 3 roughly speaking states that there is a subsequence of SNIPE that converges to a stationary point of Program (2), which was the program designed in Section 3 for PCA from incomplete data or, from a diﬀerent perspective, for low-rank matrix completion. Theorem 3 is however silent about the nature of this stationary point, whether it is a local or global minimizer/maximizer, or a saddle point. To some extent, this question is addressed below in Proposition 6. More generally, we have been able to show that this stationary point is in fact rankr. When the block size of SNIPE is suﬃciently large, namely when b = Ω(n), we can further establish that the limit point of SNIPE is indeed a global minimizer of Program (2)

2. Another possible error metric is simply the largest principal angle θ1(S, b SK). The two metrics are very closely related: θ1(S, b SK)/ r d G(S, b SK) θ1(S, b SK). However, we ﬁnd that θ1(S, b SK) is not amenable for analysis of our problem, as opposed to d G(S, b SK).

Eftekhari, Ongie, Balzano, and Wakin

and moreover SNIPE recovers the true subspace, namely, liml d( b Skl, S) = 0, with high probability and under certain standard conditions on the coherence of the true subspace S and sampling probability p. We have not included these results here because SNIPE is intended as a streaming algorithm and we are therefore more interested in the setting where b = O(r), see Remarks 1 and 2 about the implementation of SNIPE. It is not currently clear to us when SNIPE converges in general but, as suggested by Proposition 6 below, if SNIPE converges, it does indeed converge to the true subspace S.

Remark 5 [Technical point about Theorem 3] Note that Theorem 3 is proved for positive (but possibly arbitrarily small) σmin and τ. In particular, an update (Rk, b Sk) is rejected in Algorithm 1 if σr(Rk) σr+1(Rk) 1 + τ, (12)

for an (otherwise arbitrary) positive τ and whenever the ratio is well-deﬁned. Here, σi(Rk) is the ith largest singular value of Rk. This is merely a technical nuance, limited to the rth singular value gap, that helps prevent the the output subspace b Sk from oscillating in the limit. Likewise, Theorem 3 does not address the case λ = 0 in Program (11), even though λ can be made arbitrarily small in Theorem 3. This is again for technical convenience and in fact the numerical simulations in Section 6 are all conducted with λ = 0.

Our second result establishes that, if SNIPE converges, then it converges to the true subspace S, see Section 7.2 for the proof.

Proposition 6 [Convergence] Consider the setup in the ﬁrst paragraph of Theorem 3. Suppose that r independent copies of random coeﬃcient vector q Rr almost surely form a basis for Rr.3 Suppose also that the output of SNIPE converges to an r-dimensional subspace b S, namely, lim k d G( b Sk, b S) = 0. (13)

Then almost surely it must hold that b S = S.

Remark 7 [Discussion of Proposition 6] Proposition 6 does not specify the conditions under which SNIPE converges. Indeed, if the sampling probability p is too small, namely, if very few of the entries of incoming vectors are observed, then SNIPE might not converge at all as the numerical evidence suggests, see also Remark 4. However, if SNIPE converges, then it converges to the true subspace S. The local rate of convergence is speciﬁed below.

The concept of coherence is critical in specifying the local convergence rate, since we consider entrywise subsampling. The coherence of an r-dimensional subspace S with orthonormal basis S Rn r is deﬁned as

r max S[i, :] 2 2 , (14)

3. For example, this requirement is met if entries of q are independent Gaussian random variables with zero-mean and unit variance.

Streaming PCA From Incomplete Data

where S[i, :] is the ith row of S. It is easy to verify that η(S) is independent of the choice of orthonormal basis S and that 1 η(S) n

It is also common to say that S is coherent (incoherent) when η(S) is large (small). Loosely speaking, when S is coherent, its orthonormal basis S is spiky. An example is when S is the span of a column-subset of the identity matrix. In contrast, when S is incoherent, entries of S tend to be diﬀuse. Not surprisingly, identifying a coherent subspace from subsampled data may require many more samples (Balzano and Wright, 2015; Mitliagkas et al., 2014; Chen, 2015). We will also use and below to suppress (most of) the universal constants. Moreover, throughout C represents a universal constant, the value of which is subject to change in every appearance. Our next results specify the local convergence rate of SNIPE. Indeed, the convergence speed near the true subspace S is linear as detailed in Theorems 8 and 10, and proved in Section 7.3. In particular, Theorem 8 states that, when suﬃciently small, the expected estimation error of SNIPE reduces by a factor of 1 p/32 in every iteration.

Theorem 8 [Locally linear convergence of SNIPE in expectation] Consider the setup in the ﬁrst paragraph of Theorem 3. Fix a positive tuning parameter α, iteration k [2 : K], and let Ek 1 be the event where

nb p log n log2 p rd G(S, b Sk 1) η(S)r

d G(S, b Sk 1) log

p rd G(S, b Sk 1)

p 3 2 log n, (17)

1 p/4 , (18)

where σmin is the reject threshold in SNIPE. Let also Ak be the event where the kth iteration of SNIPE is not rejected (see Step 3 of Algorithm 1) and let 1Ak be the indicator for this event, taking one if the event happens and zero otherwise. Then it holds that

E h 1Ak d G(S, b Sk) | b Sk 1, Ek 1 i 1 p

d G(S, b Sk 1). (19)

Remark 9 [Discussion of Theorem 8] When the sampling probability p is large enough and SNIPE is near the true subspace S, Theorem 8 states that the expected estimation error of SNIPE reduces by a factor of 1 p/32, if the iterate of SNIPE is not rejected. Note that: 1 The lower bound on the sampling probability p in (16) matches the one in the low-rank matrix completion literature up to a logarithmic factor (Davenport and Romberg, 2016). Indeed, SNIPE can be interpreted as a streaming matrix completion algorithm as discussed in Section 3. The upper bound on p in (16) is merely for technical convenience and a tidier presentation in the most interesting regime for p. Indeed, since we often take b = O(r) n, one might loosely read (16) as 1 nr p η(S)r

Eftekhari, Ongie, Balzano, and Wakin

in which the upper bound hardly poses a restriction even for moderately large data dimension n, as it forces r = O(n 1 3 ). 2 Ignoring the logarithmic factors for simplicity, we may read (17) as d G(S, b Sk 1) p3/2, which activates (19). In other words, the basin of attraction of the true subspace S as a (possibly local) minimizer of the (nonconvex) Program (2) has a radius of O(p3/2). 3 The indicator 1Ak in (19) removes the rejected iterates and similar conditions implicitly exist in the analysis of other streaming PCA algorithms (Balzano and Wright, 2015).

Note that Theorem 8 cannot tell us what the local convergence rate of SNIPE is, even in expectation. Indeed, the expected reduction in the estimation error of SNIPE, speciﬁed in (19), is not enough to activate (17) for the next iteration (namely, with k instead of k 1). That is, we cannot apply Theorem 8 iteratively and ﬁnd the expected convergence rate of SNIPE. A key technical contribution of this paper is specifying the local behaviour of SNIPE below. With high probability, the estimation error does not increase by much in every iteration near the true subspace. However, only in some of these iterations does the error reduce. Overall, on a long enough interval, the estimation error of SNIPE near the true subspace indeed reduces substantially and with high probability as detailed in Theorem 10 and proved in Section 7.3. Performance guarantees for stochastic algorithms on long intervals is not uncommon, see for example (Cartis and Scheinberg, 2017).

Theorem 10 [Locally linear convergence of SNIPE] Consider the setup in the ﬁrst paragraph of Theorem 3. Suppose that the output b SK0 of SNIPE at iteration K0 [2 : K] satisﬁes

d G(S, b SK0) e Cp3nb

eη p 7 2 nb eη log b log (Ceηn) log2(K K0), (21)

K K0 eη log b log(K K0)

p2nb . (22)

Above, C is a universal constant that might change in every appearance. Then it holds that

k=K0+1 1Ak d(S, b SK) 1 Cp3nb eη log b log(K K0)

K K0 d(S, b SK0), (23)

except with a probability of at most

b C log(K K0) +

k=K0+1 Pr Qk > 1 + Cp3nb

and provided that 1

nb p log2 b log nη(S)r

Above, σmin is the reject threshold of SNIPE and

eη = max k [K0:K] eηk, (26)

Streaming PCA From Incomplete Data

eηk := nb P b S k 1Sk 2

P b S k 1Sk 2 F . (27)

Remark 11 [Discussion of Theorem 10] Loosely speaking, Theorem 10 states that, with high probability, the estimation error of SNIPE over O(eηn) iterations reduces linearly (i.e., exponentially fast) when SNIPE is near the true subspace and the sampling probability p is large enough. Most of the remarks about Theorem 8 are also valid here. Let us also point out that the dependence on the coeﬃcient matrix Qk in (24) is mild but necessary. As an example, consider the case where the coeﬃcient vectors {qt}t are standard random Gaussian vectors so that the coeﬃcient matrices {Qk}k are standard random Gaussian matrices, namely, populated with independent zero-mean Gaussian random variables with unit variance. Then by taking

b/ 1 + Cp3nb

we ﬁnd that

Pr Qk > 1 + Cp3nb

b i e Cb (28)

and consequently the failure probability in (24) becomes b C log(K K0)+(K K0)e Cb, which can be made arbitrarily small by modestly increasing the block size b. For the reader s convenience, Appendix K collects the relevant spectral properties of a standard random Gaussian matrix. The dependence on Qk in Theorem 8 is not an artifact of our proof techniques. Indeed, when Qk 1, it is likely that certain directions in S are over represented which will skew the estimate of SNIPE in their favor.

Remark 12 [Coherence factor eη] A key quantity in Theorem 10 is the new coherence factor eη which is absent in the expected behavior of SNIPE in Theorem 8. Somewhat similar to the coherence η( ) in (14), eηk measures how spiky the matrix PS k 1Sk Rn b is. In fact, one may easily verify that

eηk rank(PS k 1Sk) ν(PS k 1Sk)2 η(span(PS k 1Sk)) η(span(S k PS k 1)), (29)

where ν(PS k 1Sk) is the condition number of PS k 1Sk, namely, the ratio of largest to smallest nonzero singular values. The number of iterations needed to see a reduction in estimation error in (22) and the convergence rate of SNIPE in (23) both prefer eη to be small, namely, prefer that {PS k 1Sk}k are all incoherent as measured by eηk. When SNIPE is close enough to true subspace S as required in (21), one would expect that iterates of SNIPE would be nearly as coherent as S itself in the sense that η( b Sk) η(S). This intuition is indeed correct and also utilized in our analysis. However, even when d G(S, b Sk 1) is small and η( b Sk 1), η(S) are both small, eηk might be very large, namely, P b S k 1Sk might be a spiky matrix. Indeed, when b = r, (b S k 1) Sk is approximately the

(horizontal) tangent at b Sk 1 to the geodesic on the Grassmannian G(n, r) that connects b Sk 1

Eftekhari, Ongie, Balzano, and Wakin

to S. Even though both b Sk 1 and S are incoherent subspaces, namely, η( b Sk 1), η(S) are both small, the tangent direction connecting the two is not necessarily incoherent. Despite the dependence of our results on eη, it is entirely possible that SNIPE with high probability approaches the true subspace S from an incoherent tangent direction, of which there are many. Such a result has remained beyond our reach. In fact, similar questions arise in matrix completion. Iterative Hard Thresholding (IHT) is a powerful algorithm for matrix completion with excellent empirical performance (Tanner and Wei, 2013), the convergence rate of which has remained unknown, to the best of our knowledge. With {Mk}k denoting the iterates of IHT, it is not diﬃcult to see that if the diﬀerences {Mk Mk 1}k are incoherent matrices (i.e., not spiky), then the linear convergence rate of IHT would follow from rather standard arguments.

5. Related Work

In this paper, we presented SNIPE for streaming PCA from incomplete data and, from a different perspective, SNIPE might be considered as a streaming matrix completion algorithm, see Section 3. In other words, SNIPE is a subspace tracking algorithm that identiﬁes the linear structure of data as it arrives. Note also that t in our framework need not correspond to time, see Figure 1. For example, only a small portion of a large data matrix Y can be stored in the fast access memory of the processing unit, which could instead use SNIPE to fetch and process the data in small chunks and iteratively update the principal components. Moreover, SNIPE can be easily adapted to the dynamic case where the distribution of data changes over time. In dynamic subspace tracking for example, the (hidden) data vector st is drawn from the subspace S(t) G(n, r) that varies with time. Likewise, it is easy to slightly modify SNIPE to handle noisy observations or equivalently to the case where {st}t are generated from a distribution with possibly full-rank covariance matrix. We leave investigating both of these directions to a future work.

Among several algorithms that have been proposed for tracking low-dimensional structure in a dataset from partially observed streaming data (Mitliagkas et al., 2014; Chi et al., 2013; Mardani et al., 2015; Xie et al., 2013; Eftekhari et al., 2017), SNIPE might be most closely related to GROUSE (Balzano et al., 2010; Balzano and Wright, 2013). GROUSE performs streaming PCA from incomplete data using stochastic gradient projection on the Grassmannian, updating its estimate of the true subspace with each new incomplete vector. Both GROUSE and SNIPE were designed based on the principle of least-change, discussed in Section 3. In fact, when GROUSE is suﬃciently close to the true subspace and with a speciﬁc choice of its step length, both algorithms have nearly identical updates, see (Balzano and Wright, 2015, Equation 1.9). A weaker analogue of Theorem 8 for GROUSE was recently established in (Balzano and Wright, 2015). More speciﬁcally, (Balzano and Wright, 2015) stipulates that, if the current estimate b Sk is suﬃciently close to the true subspace S, then b Sk+1 will be even closer to S in expectation. Such a result however cannot tell us what the local convergence rate of SNIPE is, even in expectation, see the discussion right before Theorem 10 above. In this sense, a key technical contribution of our work is establishing the local linear convergence of SNIPE, see Theorem 10, which is missing from its close competitor GROUSE. In fact, the global convergence guarantees listed in Theorem 3

Streaming PCA From Incomplete Data

and Proposition 6 are also unique to SNIPE; such theoretical guarantees are not available for GROUSE.

It might be interesting to add that our proposed update in SNIPE was inspired by that of GROUSE when we found zero-ﬁlled updates were unreliable (Eftekhari et al., 2017). However, GROUSE was derived as a purely streaming algorithm, and it therefore is not designed to leverage common low-rank structure that may be revealed when a block of vectors is processed at once. Therefore, for each block SNIPE often achieves a more signiﬁcant reduction in error than is possible with GROUSE.

Lastly, both SNIPE and GROUSE have a computational complexity of O(nr2) ﬂops per incoming vector, see Remark 1. Also, SNIPE and GROUSE both require O(nr) memory elements of storage, see Remark 2. With complete data, namely, when no entries are missing, a close relative of both SNIPE and GROUSE are incremental SVD algorithms, a class of algorithms that eﬃciently compute the SVD of streaming data (Bunch and Nielsen, 1978; Balsubramani et al., 2013; Oja and Karhunen, 1985; Watanabe and Pakvasa, 1973; Balsubramani et al., 2013; Brand, 2002).

A streaming PCA algorithm might also be interpreted as a stochastic algorithm for PCA (Arora et al., 2012). Stochastic projected gradient ascent in this context is closely related to the classical power method. In particular, the algorithm in (Mitliagkas et al., 2014) extends the power method to handle missing data, in part by improving the main result of (Lounici, 2014). With high probability, this algorithm converges globally and linearly to the true subspace and, most notably, succeeds for arbitrarily small sampling probability p, if the scope of the algorithm T is large enough. Additionally, this algorithm too has a computational complexity of O(nr2) operations per vector and a storage requirement of O(nr) memory elements. In practice, SNIPE substantially outperforms the power method, as we will see in Section 6. A disadvantage of the power method is that it updates its estimate of the true subspace with every O(n) incoming vectors; the waiting time might be prohibitively long if n is large. In contrast, SNIPE frequently updates its estimate with every b = O(r) incoming vectors. As we will see in Section 6, SNIPE substantially outperforms the power method in practice. Let us add that POPCA (Gonen et al., 2016) is closely related to the power method, for which the authors provide lower bounds on the achievable sample complexity. However, POPCA has substantially greater memory demand than SNIPE, since it maintains an estimate of the possibly dense n n sample covariance matrix of incoming data.

The PETRELS algorithm (Chi et al., 2013) operates on one column at a time (rather than blocks) and global convergence for PETRELS, namely, convergence to a stationary point of the underlying nonconvex program, is known. Designed for streaming matrix completion, the algorithm in (Mardani et al., 2015) also operates on one column at a time and asymptotic onvergence to the true subspace is established, see Propositions 2 and 3 therein. This framework is also extended to tensors. MOUSSE in (Xie et al., 2013) tracks a union of subspaces rather than just one; SNIPE would function more like an ingredient of this algorithm. Asymptotic consistency of MOUSSE is also established there. The theoretical guarantees of SNIPE are more comprehensive in the sense that we also oﬀer local convergence rate for SNIPE, see Theorems 8 and 10. Re Procs, introduced in (Lois and Vaswani, 2015), tracks a slowly changing subspace when initialized suﬃciently close.

Eftekhari, Ongie, Balzano, and Wakin

In the next section, we compare the performance of several of these algorithms in practice and ﬁnd that SNIPE competes empirically with state-of-the-art algorithms. Even though we consider uniform random sampling of the entries of incoming vectors, SNIPE can be applied to any incomplete data. For example, instead of uniform sampling analyzed here, one can perhaps sample the entries of every incoming vector based on their estimated importance. More speciﬁcally, in iteration k, one might observe each entry of the incoming vector with a probability proportional to the leverage score of the corresponding row of the current estimate b Sk 1. In batch or oﬄine matrix completion, using the idea of leveraged sampling (as opposed to uniform sampling) alleviates the dependence on the coherence factor η(S) in (16) (Eftekhari et al., 2018a; Chen, 2015). While interesting, we have not pursued this direction in the current work.

6. Simulations

This section consists of two parts: ﬁrst, we empirically study the dependence of SNIPE on various parameters, and second we compare SNIPE with existing algorithms for streaming subspace estimation with missing data. In all simulations, we consider an r-dimensional subspace S Rn and a sequence of generic vectors {st}T t=1 S. Each entry of these vectors is observed with probability p (0, 1] and collected in vectors {yt}T t=1 Rn. Our objective is to estimate S from {yt}, as described in Section 1.

Sampling probability We ﬁrst set n = 100, r = 5, and let S be a generic r-dimensional subspace, namely, the span of an n r standard random Gaussian matrix. For various values of probability p, we run SNIPE with block size b = 2r = 10 and scope of T = 500r = 2500, recording the average estimation error d G(S, b SK) over 50 trials, see (10). The average error versus probability is plotted in Figure 3a.

Subspace dimension With the same setting as the previous paragraph, we now set p = 3r/n = 0.15 and vary the subspace dimension r, block size b = 2r, and scope T = 500r. The average error versus subspace dimension is plotted in Figure 3b.

Ambient dimension This time, we set r = 5, p = 3r/n, b = 2r, T = 500r, and vary the ambient dimension n. In other words, we vary n while keeping the number of samples per vector ﬁxed at about pn = 3r. The average error versus ambient dimension is plotted in Figure 3c. Observe that the performance of SNIPE steadily degrades as n increases. This is in agreement with Theorem 8 by substituting p = 3r/n there, which states that the error reduces by a factor of 1 Cr/n, in expectation. A similar behavior is observed for our close competitor, namely, GROUSE (Balzano and Wright, 2015).

Block size Next we set n = 100, r = 5, p = 3r/n, T = 500r, and vary the block size b. The average error versus block size in both cases is depicted in Figure 3d. From Step 3 of Algorithm 1, a block size of b r is necessary for the success of SNIPE and qualitatively speaking larger values of b lead to better stability in face of missing data, which might explain the poor performance of SNIPE for very small values of b. However, as b increases, the number of blocks K = T/b reduces because the scope T is held ﬁxed. As the estimation error of SNIPE scales like (1 cp) K in Theorem 10 for a certain factor c, the performance

Streaming PCA From Incomplete Data

0 0.2 0.4 0.6 0.8 1 p

d G(S; b SK)

2 4 6 8 10 12 14 r

d G(S; b SK)

0 100 200 300 400 500 n

d G(S; b SK)

0 10 20 30 40 50 b

d G(S; b SK)

Figure 3: Performance of SNIPE as (a) sampling probability p, (b) subspace dimension r, (c) ambient dimension n, (d) block size b vary. b SK is the output of SNIPE and d G(S, b SK) is its distance to the true subspace S, which generated the input of SNIPE. See Section 6 for details and note that each panel is generated with a diﬀerent and random subspace S.

suﬀers in Figure 3d. It appears that the choice of b = 2r in SNIPE guarantees the best empirical performance.

Coherence Lastly, we set n = 300, r = 10, p = 3r/n, b = 2r, and T = 500r. We then test the performance of SNIPE as the coherence of S varies, see (14). To that end, let S Rn be a generic subspace with orthonormal basis S Rn r. In particular S is obtained by orthogonalizing the columns of a standard n n random Gaussian matrix. Then, the average coherence of S over 50 trials was 3.3334 n/r and the average estimation error of SNIPE was 2.795 10 5. On the other hand, let D Rn n be a diagonal matrix with entries D[i, i] = i 1 and consider S = DS. Unlike S, the new subspace S := span(S ) is

Eftekhari, Ongie, Balzano, and Wakin

p = 0.15 p = 0.30 p = 0.45 p = 0.60 p = 0.75 d G < 10 3 d G < 10 7 d G < 10 3 d G < 10 7 d G < 10 3 d G < 10 7 d G < 10 3 d G < 10 7 d G < 10 3 d G < 10 7

GROUSE 878.0 (76.1) 1852.4 (85.2) 294.9 (20.6) 646.1 (26.8) 181.6 (13.8) 391.2 (18.0) 130.2 (10.3) 277.2 (12.8) 105.1 (11.7) 213.5 (12.8) PETRELS 1689.0 (1394.1) 2853.7 (916.9) 421.8 (31.3) 1100.5 (64.2) 262.1 (25.3) 802.0 (31.1) 181.8 (20.2) 671.4 (22.8) 133.3 (21.3) 599.1 (24.0) SNIPE 1815.7 (137.9) 3946.9 (182.4) 537.4 (39.9) 1236.4 (62.5) 282.4 (25.8) 649.6 (41.5) 171.4 (17.2) 391.4 (25.4) 105.5 (9.1) 241.6 (15.6) SNIPE-overlap 1588.3 (183.8) 3319.5 (232.9) 318.7 (27.1) 704.6 (36.8) 131.8 (11.6) 296.4 (17.4) 71.3 (5.7) 155.6 (9.9) 44.2 (4.2) 91.8 (6.8)

Table 1: Average number of revealed data columns needed to reach the indicated subspace recovery error for various sampling probabilities p over 100 random trials. Standard deviations are given in parenthesis.

typically coherent since the energy of its orthonormal basis S is mostly concentrated along its ﬁrst few rows. This time, the average coherence of S over 50 trials was 19.1773 n/r and the average estimation error of SNIPE was substantially worse at 0.4286.

Comparisons Next we empirically compare SNIPE with GROUSE (Balzano et al., 2010; Zhang and Balzano, 2016), PETRELS (Chi et al., 2013), and the modiﬁed power method in (Mitliagkas et al., 2014). In addition to the version of SNIPE given in Algorithm 1, we also include comparisons with a simple variant of SNIPE, which we call SNIPE-overlap. Unlike SNIPE which processes disjoint blocks, SNIPE-overlap processes all overlapping blocks of data. More precisely, for a block size b, SNIPE-overlap ﬁrst processes data columns t = 1, 2, ..., b, followed by columns t = 2, 3, ..., b+1, and so on, whereas regular SNIPE processes columns t = 1, 2, ..., b followed by t = b + 1, b + 2, ..., 2b, etc. The theory developed in this paper does not hold for SNIPE-overlap because of lack of statistical independence between iterations, but we include the algorithm in the comparisons since it represents a minor modiﬁcations of the SNIPE-overlap framework and appears to have some empirical beneﬁts, as detailed below.

In these experiments, we set n = 100, r = 5, T = 5000, and take S Rn to be a generic r-dimensional subspace and simulate noiseless data samples as before. In Figure 4 we compare the algorithms for three values of sampling probability p, which shows the average over 100 trials of the estimation error of algorithms (with respect to the metric d G) relative to the number of revealed data columns. For SNIPE, we used the block size of b = 2r. Having tried to get the best performance from GROUSE, we used the greedy step-size as proposed in (Zhang and Balzano, 2016). For (Mitliagkas et al., 2014), we set the block size as b = 1000 which was found empirically to yield the lowest subspace error after T = 5000 iterations.

In Table 1 we also compare the average number of revealed columns needed to reach a given error tolerance for each algorithm (as measured by error metric d G) for various values of the sampling probability p. We omit the modiﬁed power method from the results since it was unable to reach the given error tolerances in all cases. For the medium/high sampling rates p = 0.45, 0.60, 0.75, SNIPE-overlap is fastest to converge, while regular SNIPE is competitive with GROUSE and PETRELS. For the lower sampling rates p = 0.15, 0.30 we ﬁnd GROUSE yields the fastest convergence, although SNIPE-overlap is also competitive with GROUSE for p = 0.30.

Streaming PCA From Incomplete Data

1000 2000 3000 4000 5000 t

estimation error in metric d G

SNIPE SNIPE-overlap GROUSE PETRELS Power method

(a) p = 0.30

1000 2000 3000 4000 5000 t

estimation error in metric d G

SNIPE SNIPE-overlap GROUSE PETRELS Power method

(b) p = 0.45

1000 2000 3000 4000 5000 t

estimation error in metric d G

SNIPE SNIPE-overlap GROUSE PETRELS Power method

(c) p = 0.60

Figure 4: Average subspace estimation error versus number of revealed data columns at the speciﬁed sampling probability p, see Section 6 for details.

In this section, we prove the technical results presented in Section 4. A short word on notation is in order ﬁrst. We will frequently use MATLAB s matrix notation so that, for example, A[i, j] is the [i, j]th entry of A, and the row-vector A[i, :] corresponds to the ith

row of A. By {ϵi}i ind. Bernoulli(p), we mean that {ϵi}i are independent Bernoulli random variables taking one with probability of p and zero otherwise. Throughout, Ei,j stands for the [i, j]th canonical matrix so that Ei,j[i, j] = 1 is its only nonzero entry. The size of Ei,j may be inferred from the context. As usual, and F stand for the spectral and Frobenius norms. In addition, A and A 2 return the largest entry of a matrix A (in magnitude) and the largest ℓ2 norm of the rows of A, respectively. Singular values of

Eftekhari, Ongie, Balzano, and Wakin

a matrix A are denoted by σ1(A) σ2(A) . For purely aesthetic reasons, we will occasionally use the notation a b = max(a, b) and a b = min(a, b).

7.1. Convergence of SNIPE to a Stationary Point (Proof of Theorem 3)

Consider Program (2), namely, ( min fΩ(X, U) := P U X 2 F + λ PΩC(X) 2 F , PΩ(X) = Y, (30)

where the minimization is over matrix X Rn T and subspace U G(n, r). Before proceeding with the rest of the proof, let us for future reference record the partial derivatives of fΩbelow. Consider a small perturbation to X in the form of X + , where Rn T . Let U Rn r and U be orthonormal bases for U and its orthogonal complement U , respectively. Consider also a small perturbation to U in the form of U + U Rn r, where R(n r) r. The perturbation to fΩ(X, U) = fΩ(X, U) can be written as

fΩ X + , U + U = fΩ(X, U) + , XfΩ(X, U)

+ , UfΩ(X, U) + o( F ) + o( F ), (31)

where o( ) is the standard little-o notation. The partial derivatives of fΩare listed below and derived in Appendix D.

Lemma 13 For fΩin Program (30), the ﬁrst-order partial derivatives at (X, U) Rn T G(n, r) are

XfΩ(X, U) = 2PU X + 2λPΩC(X) Rn T ,

UfΩ(X, U) = 2(U ) XX U R(n r) r, (32)

where U Rn r and U Rn (n r) are orthonormal bases for U and its orthogonal complement, respectively.

Recall that Qk Rr b is a random matrix with bounded expectation, namely, E Qk F < . As K , {Qk}K k=1 therefore has a bounded subsequence. To keep the notation simple and without any loss of generality, we assume that in fact the sequence {Qk}k is itself bounded. As K and for an integer l, we can always ﬁnd an interval of length l over which the same index set and nearly the same coeﬃcient matrix repeats. More speciﬁcally, consider an index set bΩ [1 : n] [1 : b] and a matrix b Q Rr b in the support of the distributions from which {Ωk}K k=1 and {Qk}K k=1 are drawn. For every integer l, as a result of the second Borel-Cantelli lemma (Durrett, 2010, pg. 64), almost surely there exists a contiguous interval κl := [kl l + 1 : kl], (33)

such that Ωk = bΩ, k κl, (34)

max k κl Qk b Q F 1

Streaming PCA From Incomplete Data

As l , the measurements corresponding to the interval κl converge. To be speciﬁc, let b Y := PbΩ(S b Q) and note that

lim l max k κl Yk b Y F = lim l max k κl PΩk(SQk) b Y F

= lim l max k κl PbΩ(SQk) b Y F

= lim l max k κl PbΩ(S(Qk b Q)) F

lim l max k κl Qk b Q F

The above observation encourages us to exchange (Ωk, Yk) with (bΩ, b Y ) on the interval κl. Let us therefore study the program ( min fbΩ(X, U), PbΩ(X) = b Y , (37)

where the minimization is over all matrices X Rn b and subspaces U G(n, r). From a technical viewpoint, it is in fact more convenient to relax the equality constraint above as ( min fbΩ(X, U), PbΩ(X) b Y F ϵ, (38)

for ϵ > 0. We ﬁx ϵ for now. Let us next use alternative minimization to solve Program (38). More speciﬁcally, recall (33) and consider the initialization b Skl l,ϵ := b Skl l, where b Skl l is the output of SNIPE at iteration kl l, see Algorithm 1. For every k κl, consider the program ( min fbΩ(X, b Sk 1,ϵ), PbΩ(X) b Y F ϵ, (39)

and let Rk,ϵ be a minimizer of Program (39). We then update the subspace by solving

min U G(n,r) fbΩ(Rk,ϵ, U), (40)

and setting b Sk,ϵ to be a minimizer of Program (40). Recalling the deﬁnition of fbΩin Program (30) and in light of the Eckart-Young-Mirsky Theorem, Program (40) can be solved by computing top r left singular vectors of Rk,ϵ (Eckart and Young, 1936; Mirsky, 1966). For future reference, note that the optimality and hence stationarity of b Sk,ϵ in Program (40) dictates that UfbΩ(Rk,ϵ, b Sk,ϵ) = 0, k κl, (41)

where Uf was speciﬁed in Lemma 13. From the above construction of the sequence {(Rk,ϵ, b Sk,ϵ)}k κl, we also observe that

0 fbΩ(Rk,ϵ, b Sk,ϵ) fbΩ(Rk,ϵ, b Sk 1,ϵ) fbΩ(Rk 1,ϵ, b Sk 1,ϵ), (42)

Eftekhari, Ongie, Balzano, and Wakin

for every k [kl l + 2 : kl] κl, see (33). That is, {fbΩ(Rk,ϵ, b Sk,ϵ)}k κl is a nonincreasing and nonnegative sequence. It therefore holds that

fbΩ(Rkl 1,ϵ, b Skl 1,ϵ) fbΩ(Rkl,ϵ, b Skl 1,ϵ) = 0. (43)

By the feasibility of Rkl 1,ϵ in Program (39) and by the continuity of fbΩ(X, U) in X, we conclude in light of (43) that Rkl 1,ϵ too is a minimizer (and hence also a stationary point) of Program (39) and in the limit of l . We therefore ﬁnd by writing the stationarity conditions of Program (39) at Rkl,ϵ that

PbΩ(Rkl,ϵ) b Y F ϵ, (44)

XfbΩ(Rkl,ϵ, b Skl,ϵ) + λkl,ϵ(PbΩ(Rkl,ϵ) b Y ) F = 0. (45)

for nonnegative λkl,ϵ. Recalling the deﬁnition of fbΩand that λ > 0 by assumption, we observe that Program (39) is strongly convex in PbΩC(X) and consequently any pair of minimizers of Program (39) must agree on the index set bΩC. Optimality of Rkl,ϵ and limit optimality of Rkl 1,ϵ in Program (39) therefore imply that

PbΩC(Rkl 1,ϵ Rkl,ϵ) F = 0. (46)

On the index set bΩ, on the other hand, the feasibility of both Rkl 1,ϵ and Rkl,ϵ in Program (39) implies that

PbΩ(Rkl 1,ϵ Rkl,ϵ) F PbΩ(Rkl 1,ϵ) b Y F + PbΩ(Rkl,ϵ) b Y F (triangle inequality)

Combining (46) and (47) yields that

lim l Rkl 1,ϵ Rkl,ϵ F

lim l PbΩ(Rkl 1,ϵ Rkl,ϵ) F + lim l PbΩC(Rkl 1,ϵ Rkl,ϵ) F (triangle inequality)

2ϵ, (see (46,47)) (48)

In light of (39), {Rkl,ϵ}l is bounded and consequently has a convergent subsequence. Without loss of generality and to simplify the notation, we assume that {Rkl,ϵ}l is itself convergent, namely, that there exists Rϵ Rn b for which

lim l Rkl,ϵ Rϵ F 2ϵ. (49)

Let us now send ϵ to zero in (49) to obtain that

lim ϵ 0 lim l Rkl,ϵ Rϵ F lim ϵ 0 2ϵ = 0. (50)

We next show that it is possible to essentially change the order of limits above and also conclude that (R, b S) coincides with the output of SNIPE in limit. The following result is proved in Appendix E.

Streaming PCA From Incomplete Data

Lemma 14 With the setup above, there exist a sequence {ϵi}i with limi ϵi = 0 and a matrix R Rn b such that

lim l,i Rkl,ϵi R F = lim l lim i Rkl,ϵi R F = lim i lim l Rkl,ϵi R F = 0. (51)

Moreover, suppose that the output of SNIPE in every iteration has a spectral gap in the sense that there exists τ > 0 such that

σr(Rk) σr+1(Rk) 1 + τ, (52)

for every k. Let b Skl,ϵi and b S be the span of top r left singular vectors of Rkl,ϵi and R, respectively. Then it holds that

lim l,i d G( b Skl,ϵi, S) = 0. (53)

Lastly, in the limit of l , SNIPE produces (R, b S) in every iteration, namely,

lim l Rkl R F = lim l d G( b Skl, b S) = 0, (54)

where (Rkl, b Skl) is the output of SNIPE in iteration kl, see Algorithm 1.

In fact, the pair (R, b S) from Lemma 14 is stationary in limit in the sense described next and proved in Appendix F.

Lemma 15 The pair (R, b S) in Lemma 14 is a stationary point of the program

( min fΩkl(X, U),

PΩkl(X) = Ykl, (55)

as l . The minimization above is over all matrices X Rn b and subspaces U G(n, r). More speciﬁcally, it holds that

lim l UfΩkl(R, b S) F = 0, (56)

lim l PΩkl(R) Ykl F = 0, (57)

lim l PΩC kl( XfΩkl(R, b S)) F = 0. (58)

In words, Lemmas 14 and 15 together imply that the output of SNIPE in limit is a stationary point of Program (55). This completes the proof of Theorem 3.

Eftekhari, Ongie, Balzano, and Wakin

7.2. Convergence of SNIPE (Proof of Proposition 6)

In iteration k of SNIPE, we partially observe the data block SQk Rn b on a random index set Ωk [1 : n] [1 : b], where Qk Rr b is a random coeﬃcient matrix. We collect the observations in Yk = PΩk(SQk) Rn b, see Sections 2 and 3 for the detailed setup. Note that Rk in (1) can be written as

Rk = Yk + PΩC k (b Sk 1Q k) = PΩk(SQk) + PΩC k (b Sk 1Q k), (59)

b S k 1Pωt b Sk 1 + λIr yt

By (13), there exists Q k Rr b and

R k := PΩk(SQk) + PΩC k (b SQ k), (61)

such that lim k Rk R k F = 0. (62)

In (61) above, b S Rn r is an orthonormal basis for the subspace b S. In Algorithm 1, the rank-r truncated SVD of Rk spans b Sk G(n, r), namely, the output of SNIPE in iteration k. Let also b S k G(n, r) denote the span of rank-r truncated SVD of R k. The existence of the reject option in Algorithm 1 with positive τ implies that Rk has a spectral gap and therefore b Sk is uniquely deﬁned. Combining this with (62), we ﬁnd that b S k too is uniquely deﬁned in the limit of k . Therefore another consequence of (62) is that

lim k d G( b Sk, b S k) = 0. (63)

Then we have that

lim k d G( b S k, b S) lim k d G( b S k, b Sk) + lim k d G( b Sk, b S) 0 + 0, (see (63,13)) (64)

namely, b S k converges to b S in the limit too. Let us now rewrite R k as

R k = PΩk(SQk b SQ k) + b SQ k, (see (61)) (65)

which, together with (64), implies that

lim k b S PΩk(SQk b SQ k) F = 0. (66)

We can rewrite the above limit in terms of the data vectors (rather than data blocks) to obtain that lim t b S Pωt(Sqt b Sq t ) F = 0, (67)

where {qt, q t }t form the columns of the blocks {Qk, Q k}k, and the index sets {ωt}t [1 : n] form {Ωk}k. There almost surely exists a subsequence {ti}i over which ωti = {1}, namely, there is a subsequence where we only observe the ﬁrst entry of the incoming data vector. Consider a vector q1 Rr in the support of the distribution from which {qt}t

Streaming PCA From Incomplete Data

are drawn. Then there also exists a subsequence of {ti}i, denoted by {tij}ij, such that limj qtij q1 2 = 0. Restricted to the subsequence {tij}j, (67) reads as

0 = lim j b S Pωtij (Sqtij b Sq tij ) F = lim j b S P{1}(Sqtij b Sq tij ) F = b S P{1}(Sq1 b Sq 1) F ,

(68) where we set q 1 := limj q tij ; the limit exists by (60). Likewise, we can show that

vl := P{l}(Sql b Sq l) S , l [1 : n], (69)

where {q l}l are deﬁned similarly. Because dim(S ) = n r, at most n r of the vectors {vl}n l=1 are linearly independent. Because the supports of {vl}l are disjoint, it follows that there are at most n r of the vectors {vl}n l=1 are nonzero. Put diﬀerently, there exists an index set I [1 : n] of size at least r such that

Sql = b Sq l, l I. (70)

Almost surely, {ql}l I Rr form a basis for Rr, and therefore S b S. Because b S G(n, r) by assumption, it follows that b S = S, which completes the proof of Proposition 6.

7.3. Locally Linear Convergence of SNIPE (Proof of Theorems 8 and 10)

At iteration k [2 : K], SNIPE uses the current estimate b Sk 1 and the new incomplete block Yk to produce a new estimate b Sk of the true subspace S. The main challenge here is to compare the new and old principal angles with S, namely, compare d G(S, b Sk) and d G(S, b Sk 1). Lemma 16 below, proved in Appendix G, loosely speaking states that d G(S, b Sk) reduces by a factor of 1 O(p) in expectation in every iteration, when d G(S, b Sk) p 5 2 and ignoring all other parameters in this qualitative discussion. In other words, when suﬃciently small, the estimation error of SNIPE reduces in every iteration, but in expectation. The actual behavior of SNIPE is more nuanced. Indeed, Lemma 16 below also adds that the estimation error d G(S, b Sk) in fact contracts in some iterations by a factor of 1 Cp, namely,

d G(S, b Sk) (1 Cp) d G(S, b Sk 1),

provided that d G(S, b Sk 1) p 5 2 . That is, when suﬃciently small, the estimation error of SNIPE reduces in some but not all iterations. In the rest of iterations, the error does not increase by much, namely, d G(S, b Sk) d G(S, b Sk 1),

with high probability and provided that d G(S, b Sk 1) p 5 2 .

Lemma 16 Fix k [2 : K], α, ν 1, and c > 0. Let Ek 1 be the event where

p α2 log2 b log nη( b Sk 1)r

d G(S, b Sk 1) p 7 2 nb αc log b r log n, (72)

Eftekhari, Ongie, Balzano, and Wakin

and let E k be the event where Qk ν σmin, where σmin is the reject threshold in SNIPE, see Algorithm 1. Then it holds that

E h d G(S, b Sk) | b Sk 1, Ek 1, E k i ν 1 p

d G(S, b Sk 1) + b Cα

Moreover, conditioned on b Sk 1 and the event Ek 1 E k, it holds that

d G(S, b Sk) ν 1 + p3nb

d G(S, b Sk 1), (74)

except with a probability of at most b Cα. Lastly, a stronger bound holds conditioned on b Sk 1 and the event Ek 1 E k, namely,

d G(S, b Sk) ν 1 p

d G(S, b Sk 1) (75)

with a probability of at least

φk(α) := 1 exp C1p2nb

eηk = eη(P b S k 1Sk) := nb P b S k 1Sk 2

P b S k 1Sk 2 F . (77)

Let us now use Lemma 16 to complete the proofs of Theorems 8 and 10.

7.3.1. Proof of Theorem 8

With the choice of c = 4p2nb and ν = 1/ p

1 p/4, (73) reads as

E h d G(S, b Sk) | b Sk 1, Ek 1, E k i ν 1 p

d G(S, b Sk 1) + b Cα

r (see (73))

4d G(S, b Sk 1) + b Cα

d G(S, b Sk 1) + b Cα

With the choice of

α = C log p rd G(S, b Sk 1)/16

log b , (79)

for an appropriate constant C above, the bound in (78) simpliﬁes to

E h d G(S, b Sk) | b Sk 1, Ek 1, E k i 1 p

d G(S, b Sk 1) + b Cα

r (see (78))

d G(S, b Sk 1) + p

16d G(S, b Sk 1)

Streaming PCA From Incomplete Data

d G(S, b Sk 1). (80)

Lastly we remove the conditioning on E k above. Using the law of total expectation, we write that

E h d G(S, b Sk) | b Sk 1, Ek 1 i

= E h d G(S, b Sk) | b Sk 1, Ek 1, E k i Pr[E k] + E h d G(S, b Sk) | b Sk 1, Ek 1, E C k i Pr[E C k ]

E h d G(S, b Sk) | b Sk 1, Ek 1, E k i + Pr[E C k ] (see (10))

d G(S, b Sk 1) + Pr[E C k ] (see (80))

d G(S, b Sk 1), (81)

where the last line holds if Pr[E C k ] p

32d G(S, b Sk 1).

With the choice of c, ν, α above, let us also rewrite the event Ek 1 in Lemma 16. First, we rewrite (72) as

d G(S, b Sk 1) p 7 2 nb αc log b r log n = Cp 3 2

α log b r log n

log p rd G(S, b Sk 1)/16 r log n . (see (79)) (82)

Second, we replace the coherence η( b Sk 1) in (71) with the simpler quantity η(S). We can do so thanks to Lemma 20 which roughly speaking states that a pair of subspaces A and B with a small principal angle have similar coherences, namely, θ1(A, B) 0 = η(A) η(B). More concretely, note that q

η( b Sk 1) p

η (S) + d G(S, b Sk 1) n (see Lemma 20)

η(S) + Cp 3 2 r n r log n (see (82))

η(S) + 1 if p 1

η(S). (see (15)) (83)

This completes the proof of Theorem 8.

7.3.2. Proof of Theorem 10

For K0 [1 : K], we condition on b SK0. For positive c to be set later, suppose that

d G(S, b SK0) e Cp3nb

eη p 7 2 nb αc log b log n. (84)

Eftekhari, Ongie, Balzano, and Wakin

In particular, (84) implies that the error at iteration K0 is small enough to activate Lemma 16, see (72). For ν 1 to be set later, we condition for now on the event

E := K k=K0+1E k, (85)

where the event E k was deﬁned in Lemma 16. Suppose also that (71) holds for every k [K0 + 1 : K], namely,

p max k [K0:K 1] η( b Sk) r log2 b log n

which will next allow us to apply Lemma 16 repeatedly to all iterations in the interval [K0 + 1 : K]. With the success probability φk(α) deﬁned in Lemma 16, let us also deﬁne

φ(α) := min k [K0+1:K] φk(α) = 1 exp C1p2nb

b Cα, eη := max k [K0+1:K] eηk 1, (87)

where the inequality above follows because eηk 1 for every k, see (77). We now partition [K0 + 1 : K] into (non-overlapping) intervals {Ii}i, each with the length

l = C2 log b log(K K0)

φ(α) , (88)

except possibly the last interval which might be shorter. Consider one of these intervals, say I1. Then by Lemma 16 and the union bound, (74) holds for every iteration k I1 except with a probability of at most l b Cα because the length of I1 is l. That is, the estimation error does not increase by much in every iteration in the interval I1. In some of these iterations, the error in fact reduces. More speciﬁcally, (75) holds in iteration k with a probability of at least φk(α), see (76). While b Cα in (76) can be made arbitrary small by increasing the tuning parameter α, this of course would not necessarily make φk(α) arbitrary close to one. That is, there is a sizable chance that the estimation error does not contract in iteration k. However, (75) holds at least in one iteration in the interval I1 except with a probability of at most

(1 φ(α))l e φ(α)l = b C2 log(K K0). (1 + a ea)

Therefore, except with a probabilty of at most lb Cα + b C2 log(K K0), (75) holds at least once and (74) holds for all iterations in the interval I1. It immediately follows that

d G(S, b SK0+l)

d G(S, b SK0)

νl 1 1 + p3nb

l 1 (see (74,75))

c + (l 1)p3nb

Streaming PCA From Incomplete Data

except with a probability of at most

lb Cα + b C2 log(K K0). (90)

In particular, suppose that c 4lp2nb, (91)

so that the exponent in the last line of (89) is negative. Let

imax := K K0

and note that

d G(S, b SK)

d G(S, b SK0) =

d G(S, b SK0+il)

d G(S, b SK0+(i 1)l) d G(S, b SK)

d G(S, b SK0+imaxl) . (92)

By applying the bound in (89) to all intervals {Ii}i, we then conclude that

d G(S, b SK)

d G(S, b SK0)

d G(S, b SK0+il)

d G(S, b SK0+(i 1)l) d G(S, b SK)

d G(S, b SK0+imaxl) (see (92))

νl 1 + p3nb

l (see (89,74))

νl exp lp3nb

νK K0 exp K K0

νK K0 exp K K0

(if K K0 l and (91) holds)

νK K0 exp K K0

, (if K K0 l) (93)

except with a probability of at most K K0

lb Cα + b C2 log(K K0)

lb Cα + b C2 log(K K0) (if K K0 l)

(K K0)e Cα + b CC2 log(K K0), (94)

which follows from an application of the union bound to the failure probability in (90). With the choice of α = α log b log(K K0) with suﬃciently large α , the failure probability in (94) simpliﬁes to

(K K0)e Cα + b C log(K K0) = (K K0) (K K0) Cα log b + b C log(K K0)

Eftekhari, Ongie, Balzano, and Wakin

(K K0) Cα log b + b C log(K K0)

= b Cα log(K K0) + b C log(K K0)

b C log(K K0). (95)

The next step involves elementary bookkeeping to upper-bound the last line of (93). Suppose that

α log Ceη p2nb

log b , (96)

Using (96) and (97) with appropriate constants replacing and above, we may verify that

b Cα C1p2nb

4eη , (see (96)) (98)

eη 2, ((97) and eη 1) (99)

φ(α) = 1 exp C1p2nb

b Cα (see (87))

eη b Cα (99) and e a 1 a

4eη , (see (98)) (100)

l = C2 log b log(K K0)

φ(α) (see (88))

4C2eη log b log(K K0)

C1p2nb . (see (100)) (101)

Now with the choice of c = 96C2

C1 eη log b log(K K0), (102)

we may verify that

8, (see (101,102)) (103)

and, revisiting (93), we ﬁnd that

d G(S, b SK)

d G(S, b SK0)

νK K0 exp K K0

Streaming PCA From Incomplete Data

νK K0 exp (K K0)p

(see (103))

ν exp Cp3nb eη log b log(K K0)

K K0 (see (101))

νK K0 1 Cp3nb eη log b log(K K0)

K K0 ((97) and eη 1)

1 Cp3nb eη log b log(K K0)

K K0 , (104)

where we set

ν = 1 + Cp3nb eη log b log(K K0), (105)

for an appropriate choice of constant C. To reiterate, conditioned on the event E in (85), (104) is valid provided that (84,86,96,97) hold and except with the probability of at most b C log(K K0), see (95). In particular, to better interpret (86), we next replace the coherence η( b Sk) therein with the simpler quantity η(S). We can do so thanks to Lemma 20 which roughly speaking states that a pair of subspaces A and B with a small principal angle have similar coherences, namely, θ1(A, B) 0 = η(A) η(B). More concretely, Lemma 20 implies that q

η (S) + d G(S, b Sk) n. (106)

for every k. To bound the distance in the last line above, we observe that (104) holds also after replacing K with any k [K0 + l : K], implying in particular that

d G(S, b Sk) d G(S, b SK0), k [K0 + l : K]. (107)

When k [K0 + 1 : K0 + l 1] however, we cannot guarantee that the error reduces and all we can say is that that the error does not increase by much. That is, for every k [K0 + 1 : K0 + l 1], we have that

d G(S, b Sk) νk K0 1 + p3nb

k K0 d G(S, b SK0) (see (74))

νl 1 + p3nb

l d G(S, b SK0)

= νl 1 + Cp3nb eη log b log(K K0)

l d G(S, b SK0)

1 + Cp3nb eη log b log(K K0)

2l d G(S, b SK0), (see (105)) (108)

with an appropriate choice of C in (105). We continue by writing that

d G(S, b Sk) 1 + Cp3nb eη log b log(K K0)

2l d G(S, b SK0) (see (108))

Eftekhari, Ongie, Balzano, and Wakin

exp Cp3nb eη log b log(K K0) l d G(S, b SK0) (1 + a ea)

e Cpd G(S, b SK0) (see (101))

d G(S, b SK0). (p 1) (109)

Combining (107) and (109), we arrive at

d G(S, b Sk) d G(S, b SK0), (110)

for every k [K0 + 1 : K], provided that (84,86,96,97) hold and except with a probability of at most b C log(K K0), see (95). Substituting the above bound into (106) yields that q

η (S) + d G(S, b Sk) n (see (106))

η (S) + Cd G(S, b SK0) n (see (110))

η (S) + Cp 7 2 n 3 2 b αc log b log n (see (84))

η (S) + Cp 7 2 n 3 2 b αeη log2 b log n log(K K0) (see (102))

η (S) + 1. ((97) and eη 1)

η(S). (η(S) 1) (111)

Plugging back the bound above into (86) yields that

p log2 b log nη(S)r

To summarize, conditioned on the event E , we have established that (104) is valid under (84,96,97,112) and except with a probability of at most b C log(K K0). The event E = K k=K0+1E k itself holds except with a probability of at most PK k=K0+1 Pr[E C k ] by the union bound. With an application of the law of total probability, (104) is therefore valid except with a probability of at most b C log(K K0) + PK k=K0+1 Pr[E C k ]. This completes the proof of Theorem 10.

Acknowledgments

AE would like to thank Anand Vidyashankar and Chris Williams for separately pointing out the possibility of a statistical interpretation of SNIPE, as discussed at the end of Section 3. AE is also extremely grateful to Raphael Hauser for his helpful insights. Lastly, the authors would like to acknowledge and thank Dehui Yang for his involvement in the early phases of this project. For this project, AE was supported by the Alan Turing Institute under the EPSRC grant EP/N510129/1 and partially by the Turing Seed Funding grant SF019. GO and LB were supported by ARO Grant W911NF-14-1-0634. LB was also supported by DARPA grant 16-43-D3M-FP-037. MBW was partially supported by NSF grant CCF1409258 and NSF CAREER grant CCF-1149225.

Streaming PCA From Incomplete Data

Appendix A. Toolbox

This section collects a number of standard results for the reader s convenience. We begin with the following large-deviation bounds that are repeatedly used in the rest of the appendices (Gross, 2011; Tropp, 2012). Throughout, C is a universal constant the value of which might change in every appearance.

Lemma 17 [Hoeﬀding inequality] Let {zi}i be a ﬁnite sequence of zero-mean independent random variables and assume that almost surely every zi belongs to a compact interval of length li on the real line. Then, for positive α and except with a probability of at most e Cα2/P i l2 i , it holds that P i zi α.

Lemma 18 [Matrix Bernstein inequality for spectral norm] Let {Zi}i Rn b be a ﬁnite sequence of zero-mean independent random matrices, and set

β := max i Zi ,

i E [Z i Zi]

i E [Zi Z i ]

Then, for α 1 and except with a probability of at most e Cα, it holds that

α max log (n b) β, p

log (n b) σ .

For two r-dimensional subspaces A and B with principal angles θ1(A, B) θ2(A, B) θr(A, B), recall the following useful identities about the principal angles between them:

sin (θ1 (A, B)) = PA PB = PA PB , (113)

i=1 sin2 (θi (A, B)) = PA PB F = 1

2 PA PB F . (114)

Note also the following perturbation bound that is slightly stronger than the standard ones, but proved similarly nonetheless (Wedin, 1972).

Lemma 19 [Perturbation bound] Fix a rank-r matrix A and let A = span(A). For matrix B, let Br be a rank-r truncation of B obtained via SVD and set Br = span(Br). Then, it holds that

PA PBr = PA PBr PA B

σr (B) B A σr(A) B A ,

2 PA PBr F = PA PBr F PA B F

σr (B) B A F

where σr(A) is the r largest singular value of A.

Eftekhari, Ongie, Balzano, and Wakin

Proof Let Br+ := B Br denote the residual and note that

PA PBr = PA Br B r (Br = span(Br))

= PA (B Br+) B r (B = Br + Br+)

= PA BB r span (B r+) span (B r) = span B r

B A σr (A) B A . (Weyl s inequality)

The proof is identical for the claim with the Frobenius norm and is therefore omitted.

Lastly, let us record what happens to the coherence of a subspace under a small perturbation, see (14).

Lemma 20 [Coherence under perturbation] Let A, B be two r-dimensional subspaces in Rn, and let d G(A, B) denote their distance, see (10). Then their coherences are related as p

η (A) + d G(A, B) n.

Proof Let θi = θi(A, B) be the shorthand for the ith principal angle between the subspaces A and B. It is well-known (Golub and Van Loan, 2013) that there exist orthonormal bases A, B Rn r for the subspaces A and B, respectively, such that

A B = diag cos θ1 cos θ2 cos θr =: Γ Rr r, (115)

where diag(a) is the diagonal matrix formed from vector a. There also exists A Rn r

with orthonormal columns such that A B = diag sin θ1 sin θ2 sin θr =: Σ Rr r, A A = 0, (116)

and, moreover, span A B = span A A . (117)

With A = span(A ), it follows that

B = PAB + PA B = AA B + A (A ) B = AΓ + A Σ. (see (115) and (116)) (118)

Consequently,

r max i B[i, :] 2 (see (14))

Streaming PCA From Incomplete Data

r max i A[i, :] Γ 2 + rn

r max i A [i, :] Σ 2 ((118) and triangle inequality)

r max i A[i, :] 2 Γ + rn

r max i A [i, :] 2 Σ

η (A) Γ + p

η(A ) Σ (see (14))

η (A) Γ + rn

r Σ η(A ) n

r sin θ1 (see (115) and (116))

r PA PB (see (113))

η (A) + d G(A, B) n, (see (10)) (119)

which completes the proof of Lemma 20.

Appendix B. Supplement to Section 3

In this section, we verify that

( arg min P b S k 1Xk 2 F + λ PΩC k (Xk) 2 F PΩk (Xk) = Yk, (120)

when k 2. The optimization above is over Xk Rn b. First note that Program (120) is separable and equivalent to the following b programs:

( arg min P b S k 1x 2 2 + λ PωC t x 2 F Pωt x = yt, t = (k 1)b + j, j [1 : b]. (121)

Above, Rk[:, j] Rn is the jth column of Rk in MATLAB s matrix notation and the optimization is over x Rn. To solve the jth program in (121), we make the change of variables x = yt + WωC t z. Here, z Rn m and WωC t {0, 1}n (n m) is deﬁned naturally so that PωC t = WωC t W ωC t . We now rewrite (121) as the following b unconstrained programs:

zj := arg min P b S k 1yt + P b S k 1WωC t z 2 2 + λ PωC t WωC t z 2 F

= arg min (b S k 1) yt + (b S k 1) WωC t z 2 2 + λ z 2 F , t = (k 1)b + j, j [1 : b]. (122)

Above, b S k 1 Rn (n r) is an orthonormal basis for the subspace b S k 1 G(n, n r) and in particular P b S k 1 = b S k 1(b S k 1) . The optimization above is over z Rn m. Note that

zj = W ωC t P b S k 1WωC t + λIn m 1 W ωC t P b S k 1yt, j [1 : b], (123)

Eftekhari, Ongie, Balzano, and Wakin

are solutions of the least squares programs in (122) when m is large enough. For ﬁxed j, we simplify the expression for zj as follows:

zj = λIn m + W ωC t P b S k 1WωC t

1 W ωC t P b S k 1yt

= (1 + λ)In m W ωC t P b Sk 1WωC t

1 W ωC t P b S k 1yt P b S k 1 = In P b Sk 1

In m + W ωC t b Sk 1 (1 + λ)Ir b S k 1PωC t b Sk 1 1 b S k 1WωC t

W ωC t P b S k 1yt

(inversion lemma)

In m + W ωC t b Sk 1 (1 + λ)Ir b S k 1PωC t b Sk 1 1 b S k 1WωC t

W ωC t P b S k 1Pωtyt

(yt = Pωtyt)

In m + W ωC t b Sk 1 (1 + λ)Ir b S k 1PωC t b Sk 1 1 b S k 1WωC t

W ωC t P b Sk 1yt WωC t Pωt = 0

= WωC t P b Sk 1yt 1 + λ + W ωC t b Sk 1

(1 + λ)Ir b S k 1PωC t b Sk 1 1 b S k 1PωC t P b Sk 1yt PωC t = WωC t W ωC t

= WωC t P b Sk 1yt 1 + λ

+ W ωC t b Sk 1

(1 + λ)Ir b S k 1PωC t b Sk 1 1 b S k 1PωC t b Sk 1 (1 + λ)Ir b S k 1yt

+ W ωC t b Sk 1 (1 + λ)Ir b S k 1PωC t b Sk 1 1 b S k 1yt

= W ωC t b Sk 1 (1 + λ)Ir b S k 1PωC t b Sk 1 1 b S k 1yt

= W ωC t b Sk 1 λIr + b S k 1Pωt b Sk 1 1 b S k 1yt, (124)

which means that

yt + WωC t zj = yt + PωC t b Sk 1 λIr + b S k 1Pωt b Sk 1 1 b S k 1yt, (125)

is a solution of the jth program in (121) which indeed matches the jth column of Rk deﬁned in (1).

Appendix C. Proof of Proposition 24

Let us form the blocks Q1 Rb1 r, S1 Rn b1, and Ω1 [1 : n] [1 : b1] as usual, see Section 3. As in that section, we also write the measurement process as Y1 = PΩ1(S1), where PΩ1( ) projects onto the index set Ω1. Let us ﬁx Q1 for now. Also let Y1,r Rn b1 be a

Streaming PCA From Incomplete Data

rank-r truncation of Y1 obtained via SVD. SNIPE then sets b S1 = span(Y1,r). Our objective here is to control PS P b S1 F . Since

PS P b S1 F r PS P b S1 , b S1 G(n, r) (126)

it suﬃces to bound the spectral norm. Conditioned on Q1, it is easy to verify that E[Y1] = E[PΩ1(S1)] = p S1, suggesting that we might consider Y1 as a perturbed copy of p S1 and perhaps consider b S1 = span(Y1,r) as a perturbation of S = span(p S1). Indeed, the perturbation bound in Lemma 19 dictates that

PS P b S1 Y1 p S1 p σr (S1) Y1 p S1

= Y1 p S1 p σr (Q1) Y1 p S1 (S1 = SQ1, S S = Ir)

σr (Q1) . if Y1 p SQ1 p

2 σr(Q1) (127)

It remains to bound the norm in the last line above. To that end, we study the concentration of Y1 about its expectation by writing that

Y1 p S1 = PΩ1 (S1) p S1 = X

i,j (ϵi,j p) S1[i, j] Ei,j =: X

i,j Zi,j, (128)

where {ϵi,j}i,j ind. Bernoulli(p) and Ei,j Rn b1 is the [i, j]th canonical matrix. Additionally, {Zi,j}i,j are independent zero-mean random matrices. In order to appeal to the matrix Bernstein inequality (Lemma 18), we compute the β and σ parameters below, starting with β:

Zi,j = (ϵi,j p) S1[i, j] Ei,j

= |(ϵi,j p) S1[i, j]| ( Ei,j = 1)

|S1[i, j]| (ϵi,j {0, 1})

S1 S 2 Q1 2 (S1 = SQ1, AB A 2 B 2 )

b1 Q1 (see (14))

=: β. (129)

Above, A and A 2 return the largest entry of A in magnitude and the largest ℓ2 norm of the rows of matrix A, respectively. As for σ, we write that E

i,j Zi,j Z i,j

i,j E h (ϵi,j p)2i S1[i, j]2 Ei,i

i,j p(1 p)S1[i, j]2 Ei,i

(ϵi,j Bernoulli(p))

Eftekhari, Ongie, Balzano, and Wakin

i,j S1[i, j]2 Ei,j

i S1[i, :] 2 2 Ei,i

= p max i S1[i, :] 2 2

= p S1 2 2 p S 2 2 Q1 2 (S1 = SQ1, AB 2 A 2 B )

= p η (S) r

n Q1 2. (see (14)) (130)

In a similar fashion, we ﬁnd that E

i,j Z i,j Zi,j

j S1[:, j] 2 2 Ej,j

= p S 1 2 2 p S 2 Q1 2 2 (S1 = SQ1, AB 2 A 2 B )

b1 Q1 2, ( S = 1, see (14)) (131)

and eventually

i,j Z i,j Zi,j

i,j Zi,j Z i,j

(η (S) η (Q1)) Q1 2. (see (130) and (131)) (132)

max log(n b1) β, p

log(n b1) σ

max log(n b1) r

log(n b1) rpr

η (S) η (Q1) Q1

(see (129) and (132))

log(n b1) rpr

η (S) η (Q1) Q1 . if p log (n b1) r

The Bernstein inequality now dictates that

(see (128))

Streaming PCA From Incomplete Data

α max log(n b1) β, p

log(n b1 σ (see Lemma 18)

log(n b1) rrp

η (S) η (Q1) Q1 , (see (133))

except with a probability of at most e α. In particular, suppose that

p α2ν(Q1)2 1 n

(η (S) η (Q1)) r log(n b1)

n , ν(Q1) = Q1

so that (127) holds. Then, by substituting (134) back into (127) and then applying (126), we ﬁnd that

PS P b S1 F r PS P b S1 (see (126))

σr(Q1) (see (127))

log(n b1) r

(η (S) η (Q1)) ν (Q1) (see (134))

=: δ1 (ν(Q1), η(Q1)) , (136)

except with a probability of at most e α and for ﬁxed Q1. In order to remove the conditioning on Q1, ﬁx ν 1, 1 η1 b1

r , and recall the following inequality for events A and B:

Pr [A] = Pr [A|B] Pr [B] + Pr A|BC Pr BC Pr [A|B] + Pr BC . (137)

Set Q1 = span(Q 1) and let E be the event where both ν(Q1) ν and η(Q1) η1. Thanks to the inequality above, we ﬁnd that

" PS P b S1 F r δ1 (ν, η1)

" PS P b S1 F r δ (ν, η1) | E

+ Pr EC (see (137))

e α + Pr [ν(Q1) > ν] + Pr [η(Q1) > η1] , (see (136)) (138)

which completes the proof of Proposition 24.

Appendix D. Proof of Lemma 13

We conveniently deﬁne the orthonormal basis

B = U U Rn n.

Then the perturbation from U can be written more compactly as

U + U = B Ir

Eftekhari, Ongie, Balzano, and Wakin

In particular, orthogonal projection onto span(U + U ) is

(U + U )(U + U ) = B Ir

Ir + 1 Ir O

B + o( F ), (139)

where 0n r R(n r) (n r) is the matrix of zeros of size n r. From (30), note also that

fΩ(X, U) = PU , XX + λ PΩC(X) 2 F = D In UU , XX E + λ PΩC(X) 2 F .

We can now write that

fΩ X + , U + U

B , (X + ) (X + ) + 2λPΩC(X) + o( F ) + o( F )

(see (139))

B , (X + ) (X + ) + 2λPΩC(X) + o( F ) + o( F )

, B (XX + X + X ) B + 2λPΩC(X) + o( F ) + o( F )

= 0r 0r (n r) 0(n r) r In r

+ 0r 0r (n r) 0(n r) r In r

, B X B + B X B

, B XX B + 2λPΩC(X) + o( F ) + o( F ). (140)

We can further simplify the above expansion as

fΩ X + , U + U

= fΩ(X, U) + 2 , PU X 2 D , (U ) XX U E + 2λPΩC(X) + o( F ) + o( F ).

Therefore the partial derivatives of f are

XfΩ(X, U) = 2PU X + 2λPΩC(X) Rn T ,

UfΩ(X, U) = 2(U ) XX U R(n r) r, (142)

which completes the proof of Lemma 13.

Streaming PCA From Incomplete Data

Appendix E. Proof of Lemma 14

By deﬁnition in (39), {Rϵ }ϵ ϵ is a bounded set, see also Program (30) for the deﬁnition of fbΩ. Therefore there exist a subsequence {Rϵi}i and a matrix R Rn b such that

lim i ϵi = 0, (143)

lim i Rϵi R F = 0. (144)

That is, {Rϵi}i converges to R. On the other hand, for every δ 0, (49) implies that there exists an integer lδ that depneds on δ and

Rkl,ϵ Rϵ F 2ϵl + δ, l lδ. (145)

Restricted to the sequence {ϵi}i above, (145) reads as

Rkl,ϵi Rϵi F 2ϵil + δ, l lδ, (146)

which, in the limit of i , yields that

lim i Rkl,ϵi Rϵi F lim i 2ϵi + δ = δ, l lδ. (147)

We used (143) to obtain the identity above. An immediate consequence of (147) is that

lim l lim i Rkl,ϵi Rϵi F = 0. (148)

Invoking (144), it then follows that

lim l lim i Rkl,ϵi R F = lim l lim i Rkl,ϵi Rϵi F (see (144))

= 0. (see (148)) (149)

Exchanging the order of limits above yields that

lim i lim l Rkl,ϵi R F

lim i lim l Rkl,ϵi Rϵi F + lim i Rϵi R F (triangle inequality)

= lim i lim l Rkl,ϵi Rϵi F (see (144))

= 0. (see (50)) (150)

Therefore, (149) and (150) together state that Rk,ϵi converges to R as l, i , namely,

lim l,i Rkl,ϵi R F = 0, (151)

thereby proving the ﬁrst claim in Lemma 14, see (51). In order to prove the claim about subspaces in Lemma 14, we proceed as follows. Recall the output of SNIPE, namely, {(Rk, b Sk)}k constructed in Algorithm 1. In light of Section 3, Rk is also the unique minimizer of Program

Eftekhari, Ongie, Balzano, and Wakin

(9). Recall that κl = [kl l + 1 : kl] from (33). Recall also the construction of sequence {(Rk,ϵ, b Sk,ϵ)}k κl in the beginning of Section 7.1 and note that both procedures are initialized identically at the beginning of interval κl, namely, b Skl l,ϵ = b Skl 1. Therefore, for ﬁxed l, observe that4

lim ϵ 0 Rk,ϵ Rk F = 0, k κl, (152)

which, when restricted to {ϵi}i, reads as

lim i Rk,ϵi Rk F = 0, k κl. (153)

By design, every Rk has a spectral gap in the sense that there exists τ > 0 such that

σr(Rk) σr+1(Rk) 1 + τ, (154)

for every k. Recall that b Sk is the span of top r left singular vectors of Rk. An immediate consequence of (154) is that b Sk is uniquely deﬁned, namely, there are no ties in the spectrum of Rk. By (153), there are no ties in the spectrum of Rk,ϵi as well for suﬃciently large i, namely,

lim i σr(Rk,ϵi) σr+1(Rk,ϵi) 1 + τ, k κl. (155)

By sending l to inﬁnity above, we ﬁnd that

1 + τ lim l lim i σr(Rkl,ϵi) σr+1(Rkl,ϵi) (see (155))

= lim l,i σr(Rkl,ϵi) σr+1(Rkl,ϵi) (see (151))

= σr(R) σr+1(R). (see (151)) (156)

Recall that b Sk,ϵi is by deﬁnition the span of leading r left singular vectors of Rk,ϵi. Likewise, let b S be the span of leading r left singular vectors of R. An immediate consequence of the second line of (156) is that b Skl,ϵi is uniquely deﬁned, namely, no ties in the spectrum of Rkl,ϵi in the limit of l, i . The third line of (156) similarly implies that b S is uniquely deﬁned. Given the uniqueness of these subspaces, another implication of (151) is that

lim l,i d G( b Skl,ϵi, b S) = 0, (157)

where d G is the metric on Grassmannian deﬁned in (10). Lastly we show that SNIPE in the limit produces copies of (R, b S) on the interval κl,l . This is done by simply noting that

lim l Rkl R F = lim l lim i Rkl,ϵi R F (see (153))

4. To verify (152), note that for every feasible Xϵ in Program (38), X = Y + PbΩC(Xϵ) is feasible for Program (37). Moreover, as ϵ 0, Xϵ X F 0 and consequently, by continuity, |fbΩ(Xϵ, U) fbΩ(X, U)| 0. On the other hand, by deﬁnition, Rk is the unique minimizer of Program (37), namely, fbΩ(Rk, U) < fbΩ(X, U) for any X = Rk feasible for Program (37). For suﬃciently small ϵ, it follows that fbΩ(Rk, U) < fbΩ(Xϵ, U) for any Xϵ that is feasible for Program (38) and lim infϵ 0 Xϵ Rk > 0. That is, the unique minimizer of Program (38) approaches Rk in the limit, namely, limϵ 0 Rk,ϵ Rk F = 0.

Streaming PCA From Incomplete Data

= lim l,i Rkl,ϵi R F (see (151))

= 0, (see (151)) (158)

lim l d G( b Skl, b S) = lim l lim i d G( b Skl,ϵi, b S) (see (153))

= lim l,i lim i d G( b Skl,ϵi, b S) (see (157))

= 0, (see (157)) (159)

which completes the proof of Lemma 14.

Appendix F. Proof of Lemma 15

By restricting (41) to {ϵi}i, we ﬁnd that

0 = UfbΩ(Rkl,ϵi, b Sk,ϵi) (see (41))

= UfΩkl(Rkl,ϵi, b Sk,ϵi), (see (34)) (160)

for every l, i. By the joint continuity of UfΩkl in Lemma 13, it follows that

0 = lim l,i

UfΩkl(Rkl,ϵi, b Skl,ϵi) F

UfΩkl(R, b S) F , (see Lemma 14) (161)

which establishes (56). To establish (57), we restrict (44) to {ϵi}i and then send i, l to inﬁnity to ﬁnd that

0 = lim i lim l PbΩ(Rkl,ϵi) b Y F (see (44))

= lim i lim l PΩk(Rkl,ϵi) Ykl F (see (34,36))

= lim l PΩkl(R) Ykl F . (see Lemma 14) (162)

To establish (58), we restrict (45) to {ϵi}i and then send i to inﬁnity to ﬁnd that

0 = lim i lim l

XfbΩ(Rkl,ϵi, b Skl,ϵi) + λkl,ϵi(PbΩ(Rkl,ϵi) b Y ) 2

= lim i lim l

PbΩC XfbΩ(Rkl,ϵi, b Skl,ϵi) 2

+ lim i lim l

PbΩ XfbΩ(Rkl,ϵ, b Skl,ϵi) + λk,ϵi(Rkl,ϵi b Y ) 2

lim i lim l

PbΩC XfbΩ(Rkl,ϵi, b Skl,ϵi) 2

XfΩkl(R, b S) 2

F . (see (34) and Lemma 14) (163)

This completes the proof of Lemma 15.

Eftekhari, Ongie, Balzano, and Wakin

Appendix G. Proof of Lemma 16

Recall that by construction in Section 3, the rows of the coeﬃcient matrix Qk Rb r are independent copies of the random vector q Rr. Setting Sk = S Q k Rn b, we observe in iteration k each entry of the data block Sk independently with a probability of p, collect the observed entries in Yk Rn b, supported on the index set Ωk [1 : n] [1 : b]. We write this as Yk = PΩk(Sk), where the linear operator PΩk retains the entries on the index set Ωk and sets the rest to zero. Recall also that Qk is obtained by concatenating the coeﬃcient vectors {qt}kb t=(k 1)b+1 Rr. To unburden the notation, we enumerate these vectors as {qj}b j=1. Likewise, we use the indexing {sj, yj, ωj}b j=1 for the data vectors, incomplete data vectors, and their supports, respectively. Given the new incomplete block Yk at iteration k, we update our estimate of the true subspace S from the old b Sk 1 as follows. We calculate the random matrix

Rn b Rk := Yk + PΩC k (O (Yk)) , (164)

where the linear operator PΩC k projects onto the complement of index set Ωk, and

O(Yk) = h b Sk 1(Pωj b Sk 1) yj i Rn b. (165)

Above, b Sk 1 Rn r is an orthonormal basis for the r-dimensional subspace b Sk 1. If σr(Rk) < σmin, reject this iteration, see Algorithm 1. Otherwise, let Rk,r denote a rank-r truncation of Rk obtained via SVD. Then our updated estimate is b Sk = span(Rk,r). We condition on the subspace b Sk 1 and the coeﬃcient matrix Qk for now. To control the estimation error d G(S, b Sk), our strategy is to treat Rk as a perturbed copy of Sk = SQ k and in turn treat b Sk = span(Rk,r) as a perturbed copy of S = span(Sk). Indeed, an application of the perturbation bound in Lemma 19 yields that

PS P b Sk F PS Rk F

σr (Rk) . (166)

To control the numerator above, we begin with some preparation. First, recalling the deﬁnition of O(Yk) from (165), we observe that

O (Yk) = O (PΩk (Sk)) (Yk = PΩk (Sk))

= O(PΩk(P b Sk 1Sk)) + O(PΩk(P b S k 1Sk)) (linearity of O)

= P b Sk 1Sk + O(PΩk(P b S k 1Sk)). (see (165)) (167)

The above decomposition allows us to rewrite R in (164) as

Rk = Yk + PΩC k (O (Yk)) (see (164))

= PΩk(Sk) + PΩC k (O (Yk)) (Yk = PΩk(Sk))

= PΩk (Sk) + PΩC k (P b Sk 1Sk) + [PΩC k O PΩk](P b S k 1Sk)

(see (167), f g(x) := f(g(x)))

Streaming PCA From Incomplete Data

= Sk PΩC k (Sk) + PΩC k (P b Sk 1Sk) + [PΩC k O PΩk](P b S k 1Sk)

= Sk PΩC k (P b S k 1Sk) + [PΩC k O PΩk](P b S k 1Sk)

= Sk PΩC k (P b S k 1Sk) [PΩk O PΩk] (P b S k 1Sk) + [O PΩk] (P b S k 1Sk). (168)

Since PS Sk = PS SQ k = 0, it immediately follows that

PS Rk = PS PΩC k (P b S k 1Sk) PS [PΩk O PΩk](P b S k 1Sk)+PS [O PΩk] (P b S k 1Sk).

In particular, with an application of the triangle inequality above, we ﬁnd that

PΩC k (P b S k 1Sk) + [PΩk O PΩk](P b S k 1Sk) F + PS [O PΩk] (P b S k 1Sk) F . (169)

We proceed by controlling each norm on the right-hand side above using the next two technical lemmas, proved in Appendices H and I, respectively.

Lemma 21 It holds that

E PΩC k (P b S k 1Sk) + [PΩk O PΩk] (P b S k 1Sk) F | b Sk 1, Qk

1 p PS P b Sk 1 F Qk . (170)

For ﬁxed b Sk 1 and Qk, we also have PΩC k (P b S k 1Sk) + [PΩk O PΩk] (P b S k 1Sk) F PS P b Sk 1 F Qk . (171)

and the stronger bound PΩC k (P b S k 1Sk) + [PΩk O PΩk] (P b S k 1Sk) F p

1 p/2 PS P b Sk 1 F Qk , (172)

except with a probability of at most

eηk = eη(P b S k 1Sk) = nb P b S k 1Sk 2

P b S k 1Sk 2 F . (174)

Lemma 22 For ﬁxed b Sk 1, Qk and α 1, it holds that

PS [O PΩk] (P b S k 1Sk) F α log b

p PS P b Sk 1 PS P b Sk 1 F Qk , (175)

except with a probability of at most b Cα and provided that p α2 log2 b log n η( b Sk 1)r/n.

Eftekhari, Ongie, Balzano, and Wakin

We next use Lemmas 21 and 22 to derive two bounds for the numerator of (166), the weaker bound holds with high probability but the stronger bound holds with only some probability. More speciﬁcally, substituting (171) and (175) into (169) yields that

PΩC k (P b S k 1Sk) + [PΩk O PΩk] (P b S k 1Sk) F + PS [O PΩk] (P b S k 1Sk) F (see (169))

1 + Cα log b

p PS P b Sk 1

PS P b Sk 1 F Qk , (176)

except with a probability of at most b Cα and provided that p α2 log2 b log n η( b Sk 1)r/n. For positive c to be set later, let us further assume that

PS P b Sk 1 PS P b Sk 1 F p 7 2 nb cα log b log n. (177)

With an appropriate constant replacing above, (176) simpliﬁes to

PS Rk F 1 + p3nb

PS P b Sk 1 F Qk . (178)

A stronger bound is obtained by substituting (172, 175) into (169), namely,

PΩC k (P b S k 1Sk) + [PΩk O PΩk](P b S k 1Sk) F + PS [O PΩk](P b S k 1Sk) F (see (169))

PS P b Sk 1 F Qk (see (172,175,177))

PS P b Sk 1 F Qk , (179)

provided that p α2 log2 b log n η( b Sk 1)r/n and except with a probability of at most

Note that (178) and (179) oﬀer two alternative bounds for the numerator in the last line (166), which we will next use to complete the proof of Lemma 16. Fix the subspace b Sk 1 for now. Let Ek 1 be the event where p α2 log2 b log n η( b Sk 1)r/n and (177) holds. For ν 1 to be set later, let E k be the event where

Qk ν σmin. (180)

Conditioned on the event Ek 1 E k, we write that

PS P b Sk F PS Rk F

σr (Rk) (see (166))

Streaming PCA From Incomplete Data

PS P b Sk 1 F , (see (178)) (181)

except with a probability of at most b Cα. A stronger bound is obtained from (179), namely,

PS P b Sk F PS Rk F

σr (Rk) (see (166))

PS P b Sk 1 F , (see (179)) (182)

conditioned on the event Ek 1 E k and except with a probability of at most

+ b Cα. (183)

This completes the proof of the probabilistic claims in Lemma 16, namely, (74) and (75). To complete the proof of Lemma 16, we next derive a bound for the conditional expectation of PS P b Sk F . Let E k be the event where p α2 log2 b log n η( b Sk 1)r/n and

PS [O PΩ] P b S k 1Sk F p3nb

c PS P b Sk 1 F Qk . (184)

In light of Lemma 22, we have that

Pr[E k| b Sk 1, Ek, E k] 1 b Cα. (185)

Using the law of total expectation, we now write that

E h PS P b Sk F | b Sk 1, Ek 1, E k i

= E h PS P b Sk F | b Sk 1, Ek 1, E k, E k i Pr[E k| b Sk 1, Ek 1, E k]

+ E h PS P b Sk F | b Sk 1, Ek 1, E k, E C k i Pr[E C k | b Sk 1, Ek 1, E k]

E h PS P b Sk F | b Sk 1, Ek 1, E k, E k i + r Pr[E C k | b Sk 1, Ek 1, E k] b Sk G(n, r)

E h PS P b Sk F | b Sk 1, Ek 1, E k, E k i + rb Cα (see (185))

σr(Rk) | b Sk 1, Ek 1, E k, E k

+ b Cα ((166) and b r)

σ 1 min E h PS Rk F | b Sk 1, Ek 1, E k, E k i + b Cα. (186)

We next bound the remaining expectation above by writing that

E h PS Rk F | b Sk 1, Ek 1, E k, E k i

E PΩC k (P b S k 1Sk) + [PΩk O PΩk] (P b S k 1Sk) F

Eftekhari, Ongie, Balzano, and Wakin

+ PS [O PΩ] (P b S k 1Sk) F | b Sk 1, Ek 1, E k, E k

(see (169))

E PΩC k (P b S k 1Sk) + [PΩk O PΩk] (P b S k 1Sk) F

c PS P b Sk 1 F Qk | b Sk 1, Ek 1, E k, E k

(see (184))

= E E PΩC k (P b S k 1Sk) + [PΩk O PΩk] (P b S k 1Sk) F | b Sk 1, Qk

c PS P b Sk 1 F Qk | b Sk 1, Ek 1, E k, E k

1 p PS P b Sk 1 F Qk + p3nb

c PS P b Sk 1 F Qk | b Sk 1, Ek 1, E k, E k

(see (170))

PS P b Sk 1 F | b Sk 1, Ek 1, E k, E k

(see (180))

PS P b Sk 1 F

PS P b Sk 1 F . (187)

Plugging the bound above back into (186) yields that

E h PS P b Sk F | b Sk 1, Ek 1, E k i

σ 1 min E h PS Rk F | b Sk 1, Ek 1, E k, E k i + b Cα (see (186))

PS P b Sk 1 F + b Cα, (188)

which proves (73) and completes the proof of Lemma 16.

Appendix H. Proof of Lemma 21

Throughout, b Sk 1, Qk is ﬁxed. Recalling the deﬁnition of operator O( ) from (165), we write that

[PΩk O PΩk] (P b S k 1Sk)

= h Pωj b Sk 1 Pωj b Sk 1 Pωj P b S k 1Sqj i (see (165))

= h Pωj b Sk 1 Pωj b Sk 1 P b S k 1Sqj i

= h P b Sk 1,j P b S k 1Sqj i . b Sk 1,j := span(Pωj b Sk 1) (189)

Let also b SC k 1,j := span(PωC j b Sk 1). Note that b Sk 1 = Pωj b Sk 1 + PωC j b Sk 1 and that

(Pωj b Sk 1) (PωC j b Sk 1) = 0. Consequently, b Sk 1,j b SC k 1,j and then P b Sk 1 = P b Sk 1,j +

Streaming PCA From Incomplete Data

P b SC k 1,j. Put diﬀerently, b Sk 1 = b Sk 1,j b SC k 1,j. Using this decomposition, we simplify

[PΩk O PΩk] (P b S k 1Sk)

= h P b Sk 1,j P b S k 1Sqj i (see (189))

= h (P b Sk 1 P b SC k 1,j) P b S k 1Sqj i b Sk 1 = b Sk 1,j b SC k 1,j

= h P b SC k 1,j P b S k 1Sqj i . (190)

It immediately follows that

PΩC k (P b S k 1Sk) + [PΩk O PΩk] (P b S k 1Sk)

= h PωC j P b S k 1Sqj P b SC k 1,j P b S k 1Sqj i

= h PωC j (In P b SC k 1,j)PωC j P b S k 1Sqj i , b SC k 1,j = span(PωC j b Sk 1) (191)

and consequently PΩC k (P b S k 1Sk) + [PΩk O PΩk] (P b S k 1Sk) 2

PωC j (In P b SC k 1,j)PωC j P b S k 1Sqj 2

2 (see (191))

PωC j P b S k 1Sqj 2

= PΩC k (P b S k 1SQk) 2

= PΩC k (P b S k 1Sk) 2

F . (Sk = SQk) (192)

E PΩC k (P b S k 1Sk) + [PΩk O PΩk] (P b S k 1Sk) F | b Sk 1, Qk

E PΩC k (P b S k 1Sk) + [PΩk O PΩk] (PS k 1Sk) 2

F | b Sk 1, Qk

(Jensen s inequality)

E PΩC k (P b S k 1Sk) 2

F | b Sk 1, Qk

(see (192))

v u u u t E

i,j (1 ϵi,j) (P b S k 1Sk)[i, j] 2 | b Sk 1, Qk

1 p P b S k 1Sk F

Eftekhari, Ongie, Balzano, and Wakin

1 p P b S k 1S F Qk (Sk = SQ k, AB F A F B ) (193)

where {ϵi,j}i,j ind. Bernoulli(p). We thus proved the ﬁrst claim in Lemma 21, namely, (170). Then note that PΩC k (P b S k 1Sk) + [PΩk O PΩk] (PS k 1Sk) 2

PΩC k (P b S k 1Sk) 2

F (see (192))

P b S k 1Sk 2 F

P b S k 1S 2 F Qk 2, (Sk = SQk) (194)

which proves the second claim, namely, (171). In fact, with some probability, a stronger bound can be derived by controlling the deviation from the expectation in (193) using the Hoeﬀding inequality (Lemma 17). With α = p

2 P b S k 1Sk 2 F in Lemma 17 and recalling that

b Sk 1, Qk are ﬁxed for now, we ﬁnd that

PΩC k (P b S k 1Sk) 2

F E PΩC k (P b S k 1Sk) 2

F | b Sk 1, Qk

= (1 p/2) P b S k 1Sk 2 F (see (193))

= (1 p/2) P b S k 1SQk 2 F (Sk = SQk)

(1 p/2) P b S k 1S 2 F Qk 2 , (195)

except with a probability of at most

C1p2 P b S k 1Sk 4 F P i,j |(P b S k 1Sk)[i, j]|4

C1p2 P b S k 1Sk 4 F

P b S k 1Sk 2 P b S k 1Sk 2 F

C1p2 P b S k 1Sk 2 F P b S k 1Sk 2

C1p2nb eη(P b S k 1Sk)

where A returns the largest entry of A in magnitude. Substituting (195) back into (194) yields that PΩC k (P b S k 1Sk) + [PΩk O PΩk] (P b S k 1Sk) F

PΩC k (P b S k 1Sk) F (see (194))

1 p/2 P b S k 1S F Qk , (197)

which proves the last claim in Lemma 21, namely, (172).

Streaming PCA From Incomplete Data

Appendix I. Proof of Lemma 22

We ﬁx b Sk 1, Qk throughout. We begin by bounding the target quantity as PS [O PΩk] (P b S k 1Sk) F

= PS P b Sk 1 [O PΩk] (P b S k 1Sk) F (see (165))

PS P b Sk 1 [O PΩk] (P b S k 1Sk) F ( AB F A B F ) . (198)

We bound the random norm in the last line above in Appendix J.

Lemma 23 For α 1 and except with a probability of at most b Cα, it holds that

[O PΩk] (P b S k 1Sk) F α log b

p PS P b Sk 1 F Qk , (199)

provided that p α2 log2 b log n η( b Sk 1)r/n.

In light of the above lemma, we conclude that PS [O PΩk] (P b S k 1Sk) F

PS P b Sk 1 [O PΩk] (P b S k 1Sk) F (see (198))

p PS P b Sk 1 PS P b Sk 1 F Qk , (see (199)) (200)

except with a probability of at most b Cα and provided that p α2 log2 b log n η( b Sk 1)r/n. This completes the proof of Lemma 22.

Appendix J. Proof of Lemma 23

Using the deﬁnition of operator O in (165), we write that [O PΩk] (P b S k 1Sk) 2

b Sk 1(Pωj b Sk 1) Pωj P b S k 1Sqj 2

2 ((165) and Sk = SQk)

b Sk 1(Pωj b Sk 1) P b S k 1Sqj 2

(Pωj b Sk 1) P b S k 1Sqj 2

2 . b S k 1 b Sk 1 = Ir (201)

For ﬁxed j [1 : b], consider the summand in the last line above: (Pωj b Sk 1) P b S k 1S qj 2 (Pωj b Sk 1) P b S k 1

P b S k 1Sqj 2

Eftekhari, Ongie, Balzano, and Wakin

= (Pωj b Sk 1) b S k 1 P b S k 1Sqj 2 P b S k 1 = b S k 1(b S k 1)

=: b Zj P b S k 1Sqj 2. (202)

Above, b S k 1 is as usual an orthonormal basis for the subspace b S k 1. We can now revisit (201) and write that

O(P b S k 1Sk) 2

(Pωj b Sk 1) P b S k 1Sqj) 2

2 (see (201))

max j b Zj 2

j=1 P b S k 1Sqj 2 2 (see (202))

= max j b Zj 2 P b S k 1SQk 2 F

max j b Zj 2 P b S k 1S 2 F Qk 2. ( AB F A F B ) (203)

It remains to control the maximum in the last line above. We ﬁrst focus on controlling b Zj for ﬁxed j [1 : b]. Observe that b Zj is a solution of the least-squares problem

min Z Rn (n r)

b S k 1 (Pωj b Sk 1)Z 2

and therefore satisﬁes the normal equation

(Pωj b Sk 1) (Pωj b Sk 1) b Zj b S k 1 = 0,

which is itself equivalent to

(b S k 1Pωj b Sk 1) b Zj = b S k 1Pωj b S k 1. P 2 ωj = Pωj (204)

In fact, since

E h b S k 1Pωj b S k 1 i = p b S k 1 b S k 1 = 0,

E h b S k 1Pωj b Sk 1 i = p Ir, b S k 1 b Sk 1 = Ir

we can rewrite (204) as

b S k 1Pωj b Sk 1 E h b S k 1Pωj b Sk 1 i b Zj + p b Zj = b ST k 1Pωj b S k 1 E h b ST k 1Pωj b S k 1 i .

An application of the triangle inequality above immediately implies that

p b Zj b S k 1Pωj b Sk 1 E h b S k 1Pωj b Sk 1 i b Zj + b S k 1Pωj b S k 1 E h b S k 1Pωj b S k 1 i .

Streaming PCA From Incomplete Data

To control b Zj , we therefore need to derive large devation bounds for the two remaining norms on the right-hand side above. For the ﬁrst spectral norm, we write that

b S k 1Pωj b Sk 1 E h b S k 1Pωj b Sk 1 i =

i (ϵi p) b S k 1Ei,i b Sk 1

where {ϵi}i ind. Bernoulli(p) and Ei,i Rn n is the [i, i]th canonical matrix. Furthermore, {Ai}i Rr r above are independent and zero-mean random matrices. To apply the Bernstein inequality (Lemma 18), we ﬁrst compute the parameter β as

Ai = (ϵi p) S k 1Ei,i Sk 1 (see (206))

b S k 1Ei,i b Sk 1 (ϵi {0, 1})

= b Sk 1[i, :] 2 2

η( b Sk 1)r

n =: β. (see (14)) (207)

To compute the weak variance σ, we write that E

i E h (ϵi p)2i (b S k 1Ei,i b Sk 1)2 (see (206))

i p(1 p)(b S k 1Ei,i b Sk 1)2 (ϵi Bernoulli(p))

i (b S k 1Ei,i b Sk 1)2

i b S k 1Ei,i b Sk 1 b S k 1Ei,i b Sk 1

i b Sk 1[i, :] 2 2 b S k 1Ei,i b Sk 1

p max i b Sk 1[i, :] 2 2

i b S k 1Ei,i b Sk 1

i b S k 1Ei,i b Sk 1

= p η( b Sk 1)r

i b S k 1Ei,i b Sk 1

= p η( b Sk 1)r

i Ei,i = In, b S k 1 b Sk 1 = Ir

=: σ2. (208)

Eftekhari, Ongie, Balzano, and Wakin

It also follows that

max log r β, p

log r η( b Sk 1)r

log r p η( b Sk 1)r

(see (207) and (208))

log r p η( b Sk 1)r

if p log r η( b Sk 1)r

As a result, for α 1 and except with a probability of at most e Cα, it holds that b S k 1Pωj b Sk 1 E h b S k 1Pωj b Sk 1 i α max log r β, p

log r σ (see Lemma 18)

log r p η( b Sk 1)r

On the other hand, in order to apply the Bernstein inequality to the second spectral norm in (205), we write that

b S k 1Pωj b S k 1 E h b S k 1Pωj b S k 1 i =

i (ϵi p) b S k 1Ei,i b S k 1

where {ϵi}i ind. Bernoulli(p), Ei,i Rn n is the ith canonical matrix, and {Ai}i Rr (n r)

are zero-mean and independent random matrices. To compute the parameter β here, we write that

Ai = (ϵi p) b S k 1Ei,i b S k 1 (see (211))

b S k 1Ei,i b S k 1 (ϵi {0, 1})

b S k 1Ei,i (b S k 1) b S k 1 = In r

= b Sk 1[i, :] 2

η( b Sk 1)r

n =: β. (see (14)) (212)

To compute the weak variance σ, we notice that E

i E h (ϵi p)2i b S k 1Ei,i b S k 1(b S k 1) Ei,i b Sk 1

(see (211))

i p(1 p) b S k 1Ei,i b S k 1(b S k 1) Ei,i b Sk 1

(ϵi Benoulli(p))

i b S k 1Ei,i b S k 1(b S k 1) Ei,i b Sk 1

Streaming PCA From Incomplete Data

i b S k 1Ei,i Ei,i b Sk 1

b S k 1(b S k 1) In

i b S k 1Ei,i b Sk 1

i Ei,i = In, b S k 1 b Sk 1 = Ir

In a similar fashion, we ﬁnd that E

i (b S k 1) Ei,i b Sk 1 b S k 1Ei,i b S k 1

i b Sk 1[i, :] 2 2 (b S k 1) Ei,i b S k 1

p max i b Sk 1[i, :] 2 2

i (b S k 1) Ei,i b S k 1

= p max i b Sk 1[i, :] 2 2

i Ei,i = In, (b S k 1) b S k 1 = In r

= p η( b Sk 1)r

n , (see (14)) (214)

η( b Sk 1)r

(see (213) and (214))

= p. η( b Sk 1) n

We now compute

max log n β, p

log n σ = max

(see (212) and (215))

if p log n η( b Sk 1)r

Therefore, for α 1 and except with a probability of at most e Cα, it holds that b S k 1Pωj b S k 1 E h b S k 1Pωj b S k 1 i α max log n β, p

log n σ (see Lemma 18)

Eftekhari, Ongie, Balzano, and Wakin

log n p. (217)

Overall, by substituting the large deviation bounds (210) and (217) into (205), we ﬁnd that

p b Zj b S k 1Pωj b Sk 1 E h b S k 1Pωj b Sk 1 i b Zj

+ b S k 1Pωj b S k 1 E h b S k 1Pωj b S k 1 i (see (205))

log r p η( b Sk 1)r

n b Zj + α p

log n p, (see (210) and (217))

except with a probability of at most e Cα and under (209) and (216). It immediately follows that

α2 log r η( b Sk 1)r pn

(see the next line)

if α2 log r η( b Sk 1)r

except with a probability of at most e Cα. In light of (209) and (216), we assume that p α2 log n η( b Sk 1)r/n. Then using the union bound and with the choice of α = α log b, it follows that

max j [1:b] b Zj α log b

provided that p α 2 log2 b log n η( b Sk 1)r/n and except with a probability of at most be Cα log b = b Cα . Invoking (203), we ﬁnally conclude that O(P b S k 1Sk) F max j b Zj P b S k 1PS F Qk (see (203))

p P b S k 1PS F Qk .

A bound in expectation also easily follows: Let δ denote the factor of δ in last line above. Then we have that

E PS [O PΩk] (P b S k 1Sk) F = δ ˆ

0 Pr PS [O PΩ] (P b S k 1Sk) F > α δ dα

1 Pr PS [O PΩ] (P b S k 1Sk) F > α δ dα

p P b S k 1PS F Qk . (219)

This completes the proof of Lemma 23.

Streaming PCA From Incomplete Data

Appendix K. Properties of a Standard Random Gaussian Matrix

As a supplement to Remark 25, we show here that a standard random Gaussian matrix G Rb r is well-conditioned and incoherent when b is suﬃciently large. From (Vershynin, 2012a, Corollary 5.35) and for ﬁxed α 1, recall that

b C3α r σr(G) σ1(G)

b + C3α r, (220)

except with a probability of at most e α2r. It follows that

ν(G) = σ1(G)

b C3α r , (221)

which can be made close to one by choosing b α2r. For the coherence, note that G(G G) 1

2 Rb r is an orthonormal basis for span(G). Using the deﬁnition of coherence in (14), we then write that

η (span(G)) = b

r max i [1:b]

G[i, :] (G G) 1

2 (see (14))

r max i G[i, :] 2 2 (G G) 1

r max i G[i, :] 2 2 (σr(G)) 2

r max i G[i, :] 2 2

b C3α r 2 (see (220))

r max i G[i, :] 2 2 b

2 C2 3α2r 1 (a b)2 a2

r max i G[i, :] 2 2 b C4α2r 1 , (222)

except with a probability of at most e α2r. For ﬁxed i, G[i, :] 2 2 is a chi-squared random variable with r degrees of freedom so that

Pr h G[i, :] 2 2 β r i e β, (223)

for β 1. An application of the union bound and the choice of β = Cα log b then leads us to

Pr max i [1:b] G[i, :] 2 2 α log b r b b Cα = b Cα. (224)

Substituting the bound above back into (222) yields that

η (span(G)) b

r max i G[i, :] 2 2 (b C4r) 1 (see (222))

r rα log b (b C4r) 1 (see (224))

αb log b b C4α2r, (225)

except with a probability of at most e αr + b Cα. In particular, when b 2C4α2r, we ﬁnd that η(span(G)) α log b except with a probability of at most e αr + b Cα.

Eftekhari, Ongie, Balzano, and Wakin

Appendix L. Alternative Initialization

SNIPE in Algorithm 1 is initialized by truncating the SVD of the ﬁrst incomplete block Y1 Rn b, where we often take b = O(r) to keep the computational and storage requirements of SNIPE minimal, see Remarks 1 and 2. Put diﬀerently, with the notation of Section 3, even though our end goal is to compute rank-r truncated SVD of the full (but hidden) data block S1 Rn b, SNIPE is initialized with truncated SVD of the incomplete (but available) block Y1 = PΩ1(S1) Rn b, which ﬁlls in the missing entries with zeros. Indeed, when the ﬁrst incomplete block Y1 arrives, there is no prior knowledge to leverage and zero-ﬁlling the missing entries in Y1 is a sensible strategy. In contrast, for the rest of blocks k 2, SNIPE uses its previous estimate b Sk 1 to ﬁll out the erased entries in Yk before updating its estimate to b Sk, see Algorithm 1. One might instead initialize SNIPE with a larger block. More speciﬁcally, suppose that we change the ﬁrst block size to b1 b while keeping the rest of the blocks at the same size b. Then we set b S1 to be the span of leading r left singular vectors of the ﬁrst incomplete block Y1 Rn b1, while the rest of steps in Algorithm 1 do not change. As the size of the ﬁrst block b1 increases, b S1 increasingly better approximates the true subspace S. Indeed, one might consider Y1 = PΩ1(S1) = S1 + (PΩ1(S1) S1) as a noisy copy of S1, where the noise is due to the erasures. Roughly speaking then, as b1 increases, the energy of the signal part, namely, S1 F , grows faster than and eventually dominates the energy of the random noise PΩ1(S1) S1 F . This intuition is made precise by the following result which loosely speaking states that

d G(S, b S1) r r

when b1 = Ω(n). This result is proved in Appendix C with the aid of standard large deviation bounds.

Proposition 24 Consider an r-dimensional subspace S with orthonormal basis S Rn r. For an integer b1 r, let the coeﬃcient vectors {qt}b1 t=1 Rr be independent copies of a random vector q Rr. For every t [1 : b1], we observe each coordinate of st = Sqt S independently with a probability of p and collect the observed entries in yt Rn, supported on the random index set ωt [1 : n]. We set Q1 = [q1 q2 qb1] Rr b1 and Y1 = [y1 y2 yb1] Rn b1 for short. Let also b S1 be the span of leading r left singular vectors of Y1. Then, for ﬁxed α, ν 1 and 1 η b1/r, it holds that

d G(S, b S1) α ν

(η η (S)) r log(n b1)

except with a probability of at most e α + Pr[ν(Q1) > ν] + Pr[η(Q1) > η]. Above, ν(Q1) is the condition number of Q1, η(Q1) is the coherence of Q1 = span(Q 1) (see (14)), and a b := max{a, b}.

Remark 25 [Discussion of Proposition 24] As (226) suggests, for b S1 to be close to S1, the ﬁrst block should be a wide matrix, namely, b1 = O(n). This dependence on the block size was anticipated. Indeed, it is well-understood that one needs O(n) samples in order

Streaming PCA From Incomplete Data

for the sample covariance matrix to closely approximate the covariance matrix of a random vector in Rn (Vershynin, 2012b). As an example, consider the case where the coeﬃcient vectors {qt}b1 t=1 are standard random Gaussian vectors and so Q1 Rn b1 is a standard random Gaussian matrix, namely, populated with zero-mean independent Gaussian random variables with unit variance. Then both probabilities above are small when b1 is suﬃciently large. More speciﬁcally, we show in Appendix K that

Pr[ν(Q1) > ν] exp C ν 1

, when b r, (227)

Pr[η(Q1) > η] exp( Cη/ log b), when b log2 b η2r, (228)

for a Gaussian coeﬃcient matrix Q1.

It is also worth noting that initializing SNIPE with a large ﬁrst block can be done without increasing the storage requirement or computational complexity of SNIPE, namely, replacing the block size b = O(r) in ﬁrst step of Algorithm 1 with b1 = O(n) can be done without losing the streaming nature of SNIPE. More speciﬁcally, with the alternative initialization, naively computing the truncated SVD of the ﬁrst block requires O(b1n) = O(n2) bits of storage and O(rb1n) = O(rn2) ﬂops. These requirements can be signiﬁcantly lowered by implementing a state-of-the-art streaming PCA algorithm such as the power method in (Mitliagkas et al., 2013). This suggests a two-phase algorithm. In the ﬁrst phase, the power method is applied to the incoming data, where the missing entries are ﬁlled with zeros. This phase produces the estimate b S1 in SNIPE, which serves as an initialization for the second phase in which the main loop of SNIPE is applied to the incoming blocks, producing the estimates { b Sk}k 2. If b1 is suﬃciently large, the ﬁrst phase brings us within the basin of attraction of the true subspace S and activates the locally linear convergence of SNIPE to S in the second phase, see Theorems 8 and 10.

B. A. Ardekani, J. Kershaw, K. Kashikura, and I. Kanno. Activation detection in functional MRI using subspace modeling and maximum likelihood estimation. IEEE Transactions on Medical Imaging, 18(2):101 114, 1999.

R. Arora, A. Cotter, K. Livescu, and N. Srebro. Stochastic optimization for PCA and PLS. In Annual Allerton Conference on Communication, Control, and Computing, pages 861 868. IEEE, 2012.

A. Balsubramani, S. Dasgupta, and Y. Freund. The fast convergence of incremental PCA. In Advances in Neural Information Processing Systems, pages 3174 3182, 2013.

L. Balzano and S. J Wright. On GROUSE and incremental SVD. In IEEE International Workshop on Computational Advances in Multi-Sensor Adaptive Processing (CAMSAP), pages 1 4. IEEE, 2013.

L. Balzano and S. J. Wright. Local convergence of an algorithm for subspace identiﬁcation from partial data. Foundations of Computational Mathematics, 15(5):1279 1314, 2015.

Eftekhari, Ongie, Balzano, and Wakin

L. Balzano, R. Nowak, and B. Recht. Online identiﬁcation and tracking of subspaces from highly incomplete information. In Annual Allerton Conference on Communication, Control, and Computing, pages 704 711. IEEE, 2010.

M. Brand. Incremental singular value decomposition of uncertain data with missing values. In European Conference on Computer Vision, pages 707 720. Springer, 2002.

J. R. Bunch and C. P. Nielsen. Updating the singular value decomposition. Numerische Mathematik, 31(2):111 129, 1978.

C. Cartis and K. Scheinberg. Global convergence rate analysis of unconstrained optimization methods based on probabilistic models. Mathematical Programming, pages 1 39, 2017.

Y. Chen. Incoherence-optimal matrix completion. IEEE Transactions on Information Theory, 61(5):2909 2923, 2015.

Y. Chi, Y. C. Eldar, and R. Calderbank. PETRELS: Parallel subspace estimation and tracking by recursive least squares from partial observations. IEEE Transactions on Signal Processing, 61(23):5947 5959, 2013.

M. A. Davenport and J. Romberg. An overview of low-rank matrix recovery from incomplete observations. IEEE Journal of Selected Topics in Signal Processing, 10(4):608 622, 2016.

R. Durrett. Probability: Theory and examples. Cambridge university press, 2010.

C. Eckart and G. Young. The approximation of one matrix by another of lower rank. Psychometrika, 1:211 218, 1936. doi: 10.1007/BF02288367.

A. Eftekhari, L. Balzano, and M. B. Wakin. What to expect when you are expecting on the Grassmannian. IEEE Signal Processing Letters, 24(6):872 876, 2017.

A. Eftekhari, M. B. Wakin, and R. A. Ward. MC2: A two-phase algorithm for leveraged matrix completion. Information and Inference: A Journal of the IMA, 7(3):581 604, 2018a.

A. Eftekhari, D. Yang, and M. B. Wakin. Weighted matrix completion and recovery with prior subspace information. IEEE Transactions on Information Theory, 64(6):4044 4071, 2018b.

N. Gershenfeld, S. Samouhos, and B. Nordman. Intelligent infrastructure for energy eﬃciency. Science, 327(5969):1086 1088, 2010.

G. H. Golub and C. F. Van Loan. Matrix computations. Johns Hopkins Studies in the Mathematical Sciences. Johns Hopkins University Press, 2013.

A. Gonen, D. Rosenbaum, Y. C. Eldar, and S. Shalev-Shwartz. Subspace learning with partial information. Journal of Machine Learning Research, 17(52):1 21, 2016.

D. Gross. Recovering low-rank matrices from few coeﬃcients in any basis. IEEE Transactions on Information Theory, 57(3):1548 1566, 2011.

Streaming PCA From Incomplete Data

T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer Series in Statistics. Springer New York, 2013.

H. Krim and M. Viberg. Two decades of array signal processing research: The parametric approach. IEEE Signal Processing Magazine, 13(4):67 94, 1996.

A. Lakhina, M. Crovella, and C. Diot. Diagnosing network-wide traﬃc anomalies. In ACM SIGCOMM Computer Communication Review, volume 34, pages 219 230. ACM, 2004.

B. Lois and N. Vaswani. Online matrix completion and online robust PCA. In IEEE International Symposium on Information Theory (ISIT), pages 1826 1830. IEEE, 2015.

K. Lounici. High-dimensional covariance matrix estimation with missing observations. Bernoulli, 20(3):1029 1058, 2014.

M. Mardani, G. Mateos, and G. B. Giannakis. Subspace learning and imputation for streaming big data matrices and tensors. IEEE Transactions on Signal Processing, 63(10):2663 2677, 2015.

L. Mirsky. Symmetric gauge functions and unitarily invariant norms. Quart. J. Math. Oxford, pages 1156 1159, 1966.

I. Mitliagkas, C. Caramanis, and P. Jain. Memory limited, streaming PCA. In Advances in Neural Information Processing Systems, pages 2886 2894, 2013.

I. Mitliagkas, C. Caramanis, and P. Jain. Streaming PCA with many missing entries. Preprint, 2014.

J. Nocedal and S. J. Wright. Numerical optimization. Springer Series in Operations Research and Financial Engineering. Springer New York, 2006.

E. Oja and J. Karhunen. On stochastic approximation of the eigenvectors and eigenvalues of the expectation of a random matrix. Journal of Mathematical Analysis and Applications, 106(1):69 84, 1985.

J. Tanner and K. Wei. Normalized iterative hard thresholding for matrix completion. SIAM Journal on Scientiﬁc Computing, 35(5):S104 S125, 2013.

L. Tong and S. Perreau. Multichannel blind identiﬁcation: From subspace to maximum likelihood methods. Proceedings of IEEE, 86:1951 1968, 1998.

J. A. Tropp. User-friendly tail bounds for sums of random matrices. Foundations of Computational Mathematics, 12(4):389 434, 2012.

P. van Overschee and B. L. de Moor. Subspace identiﬁcation for linear systems: Theory, implementation, applications. Springer US, 2012.

R. Vershynin. Introduction to the non-asymptotic analysis of random matrices. In Y. C. Eldar and G. Kutyniok, editors, Compressed Sensing: Theory and Applications, pages 95 110. Cambridge University Press, 2012a.

Eftekhari, Ongie, Balzano, and Wakin

R. Vershynin. How close is the sample covariance matrix to the actual covariance matrix? Journal of Theoretical Probability, 25(3):655 686, 2012b.

S. Watanabe and N. Pakvasa. Subspace method of pattern recognition. In Proc. 1st. IJCPR, pages 25 32, 1973.

P. Wedin. Perturbation bounds in connection with singular value decomposition. BIT Numerical Mathematics, 12(1):99 111, 1972.

Y. Xie, J. Huang, and R. Willett. Change-point detection for high-dimensional time series with missing data. IEEE Journal of Selected Topics in Signal Processing, 7(1):12 27, 2013.

D. Zhang and L. Balzano. Global convergence of a grassmannian gradient descent algorithm for subspace estimation. In Proceedings of the International Conference on Artiﬁcial Intelligence and Statistics, page 1460Ð1468, 2016.