# stochastic_canonical_correlation_analysis__dda355b1.pdf Journal of Machine Learning Research 20 (2019) 1-46 Submitted 2/18; Published 10/19 Stochastic Canonical Correlation Analysis Chao Gao chaogao@galton.uchicago.edu University of Chicago Chicago, IL 60637, USA Dan Garber dangar@technion.ac.il Technion Israel Institute of Technology Haifa, 3200003, Israel Nathan Srebro nati@ttic.edu Toyota Technological Institute at Chicago Chicago, IL 60637, USA Jialei Wang jialei@uchicago.edu University of Chicago Chicago, IL 60637, USA Weiran Wang weiranwang@ttic.edu Toyota Technological Institute at Chicago Chicago, IL 60637, USA Editor: John Shawe-Taylor We study the sample complexity of canonical correlation analysis (CCA), i.e., the number of samples needed to estimate the population canonical correlation and directions up to arbitrarily small error. With mild assumptions on the data distribution, we show that in order to achieve ϵ-suboptimality in a properly defined measure of alignment between the estimated canonical directions and the population solution, we can solve the empirical objective exactly with N(ϵ, , γ) samples, where is the singular value gap of the whitened cross-covariance matrix and 1/γ is an upper bound of the condition number of auto-covariance matrices. Moreover, we can achieve the same learning accuracy by drawing the same level of samples and solving the empirical objective approximately with a stochastic optimization algorithm; this algorithm is based on the shift-and-invert power iterations and only needs to process the dataset for O log 1 ϵ passes. Finally, we show that, given an estimate of the canonical correlation, the streaming version of the shift-and-invert power iterations achieves the same learning accuracy with the same level of sample complexity, by processing the data only once. Keywords: Canonical correlation analysis, sample complexity, shift-and-invert preconditioning, streaming CCA 1. Introduction Let x Rdx and y Rdy be two random vectors with a joint probability distribution P(x, y). The objective of CCA (Hotelling, 1936) in the population setting is to find u Rdx and v Rdy such that projections of the random variables onto these directions are c 2019 Chao Gao, Dan Garber, Nathan Srebro, Jialei Wang, and Weiran Wang. License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v20/18-095.html. Gao, Garber, Srebro, Wang, and Wang maximally correlated:1 max u,v E[(u x)(v y)] p E[(u x)2] p E[(v y)2] . (1) This objective can be written in the equivalent constrained form max u,v u Exyv s.t. u Exxu = v Eyyv = 1 (2) where the crossand auto-covariance matrices are defined as Exy = E[xy ], Exx = E[xx ], Eyy = E[yy ]. (3) The global optimum of (2), denoted by (u , v ), can be computed in closed-form. Define 2 xx Exy E 1 2 yy Rdx dy, (4) and let (a1, b1) be the (unit-length) top left and right singular vector pair associated with T s largest singular value ρ1 = σ1(T). Then the optimal objective value, i.e., the canonical correlation between x and y, is ρ1 1 (see Lemma 20), achieved by (u , v ) = 2 xx a1, E 1 2 yy b1). In practice, we do not have access to the population covariance matrices, but observe samples pairs (x1, y1), . . . , (x N, y N) drawn from P(x, y). In this paper, we are concerned with both the number of samples N(ϵ) needed to approximately solve (2), and the time complexity for obtaining the approximate solution. Note that the CCA objective is not a stochastic convex program due to the ratio form (1), and standard stochastic approximation methods do not apply (Arora et al., 2012). Globally convergent stochastic optimization of CCA has long been a challenge even for the empirical objective, and attracted continuous effort (Lu and Foster, 2014; Ma et al., 2015; Wang and Livescu, 2016), until the recent breakthrough by Ge et al. (2016); Wang et al. (2016). And our understanding of the stochastic objective, e.g., the existence of an efficient algorithm and the sample complexity, has been very limited. Our contributions The contributions of our paper are summarized as follows. First, we provide the ERM sample complexity of CCA. We show that in order to achieve ϵ-suboptimality in the alignment between the estimated canonical directions and the population solution (relative to the population covariances, see Section 2), we can solve the empirical objective exactly with N(ϵ, , γ) samples where is the singular value gap of the whitened cross-covariance and 1/γ is a upper bound of the condition number of the auto-covariance, for several general classes of distributions widely used in statistics and machine learning. 1. For simplicity (especially for the streaming setting), we assume that E[x] = 0 and E[y] = 0. Nonzero means can be easily handled in the ERM approach (see Remark 4). Stochastic Canonical Correlation Analysis Second, to alleviate the high computational complexity of exactly solving the empirical objective, we show that we can achieve the same learning accuracy by drawing the same level of samples and solving the empirical objective approximately with the stochastic optimization algorithm of Wang et al. (2016). This algorithm is based on the shift-and-invert power iterations (Saad, 1992; Garber and Hazan, 2015; Garber et al., 2016). We provide tightened analysis of the algorithm s time complexity, removing an extra log 1 ϵ factor from the complexity given by Wang et al. (2016). Our analysis shows that asymptotically it suffices to process the sample set for O log 1 ϵ passes. While near-linear runtime in the required number of samples is known and achieved for convex learning problems using SGD, no such result was estabilished for the nonconvex CCA objective previously. Third, we show that the streaming version of shift-and-invert power iterations achieves the same learning accuracy with the same level of sample complexity, given a good estimate of the canonical correlation. This approach requires only O(d) memory where d := dx + dy is the input dimensionality, and thus further alleviates the memory cost of solving the empirical objective. This addresses the challenge of the existence of a stochastic algorithm for CCA proposed by Arora et al. (2012). Notation We use σi(A) to denote the i-th largest singular value of a matrix A, and use σmax(A) and σmin(A) to denote the largest and smallest singular values of A respectively. We use to denote the spectral norm of a matrix or the ℓ2-norm of a vector. For a positive definite matrix M, the vector norm M is defined as w M = w Mw for any w. We use C and C to denote universal constants that are independent of problem parameters, and their specific values may vary among appearances. We hide poly-logarithmic dependencies in the notation O( ). 2. Problem setup Assumptions We assume the following properties of the input random variables. 1. Bounded covariances: Eigenvalues of population auto-covariance matrices are bounded:2 max ( Exx , Eyy ) 1, γ := min (σmin(Exx), σmin(Eyy)) > 0. Hence Exx and Eyy are invertible with condition numbers bounded by 1/γ. 2. Singular value gap: For the purpose of learning the canonical directions (u , v ), we assume that there exists a positive singular value gap := σ1(T) σ2(T) (0, 1), such that the top leftand right-singular vector pair of T is uniquely defined. Distribution classes In this paper, we analyze three input distribution classes commonly used in the statistics and machine learning literature. Let z = Exx Exy E xy Eyy 2. CCA is invariant to linear transformations of the inputs, so we could always rescale the data. Gao, Garber, Srebro, Wang, and Wang the distribution classes are defined with (x, y, z) as follows. (Sub-Gaussian) Let z be isotropic and sub-Gaussian, that is, E zz = I and there exists constant C > 0 such that P q z > t exp( Ct2) for any unit vector q. (Regular polynomial-tail, Srivastava and Vershynin, 2013) Let z be isotropic and regular polynomial-tail, that is, E zz = I and there exist constants r > 1, C > 0 such that P Vz 2 > t Ct 1 r for any orthogonal projection V in Rd and any t > C rank (V). Note that this class is general and only implies the existence of a (4 + δ)-moment condition for some δ > 0. (Bounded) Let x and y be bounded and in particular sup x 2 , y 2 1 (which implies max ( Exx , Eyy ) 1 as in Assumption 1). As shown later, these classes satisfy the same concentration property, allowing us to study them (and potentially other distributions) in a unified framework. Measure of error For an estimate (u, v) of the optimal solution to (2), which need not be correctly normalized (i.e., they may not satisfy the constraints of (2)), we can always define (u, v) := E 1 2 xxu , v as the correctly normalized version. And we can measure the quality of these directions by the alignment (cosine of the angle) between , or the sum of alignment between E alignment between E 1 2yyv (all vectors have unit length): align ((u, v); (u , v )) := 1 1 2xxu + v Eyyv This measure of alignment is invariant to the lengths of u and v, and achieves the maximum of 1 if (u, v) lie in the same direction as (u , v ). Intuitively, this measure respects the geometry imposed by the CCA constraints that the projections of each view have unit length. As we will show later, this measure is also closely related to the learning guarantee we can achieve with power iterations. Moreover, high alignment implies accurate estimate of the canonical correlation. Lemma 1 Let η (0, 1). If align ((u, v); (u , v )) 1 η v Eyyv ρ1(1 η). All proofs are deferred to the appendix. Stochastic Canonical Correlation Analysis 3. The sample complexity of ERM One approach to address this problem is empirical risk minization (ERM): We draw N samples {(xi, yi)}N i=1 from P(x, y) and solve the empirical version of (2): max u,v u Σxyv s.t. u Σxxu = v Σyyv = 1 (6) where the empirical covariance matrices are defined as i=1 xiy i , Σxx = 1 i=1 xix i , Σyy = 1 i=1 yiy i . (7) Similarly, define the empirical version of T as 2 xx ΣxyΣ 1 2 yy Rdx dy. (8) We will approximate the population canonical correation and directions based on solution to the above empirical objective. Before going to the detailed analysis, we highlight the key property that enable us to study different input distributions in a unified manner. In fact this property is the only place we handle the stochasticity of data in studying ERM. Proposition 2 (Concentration property) For any ν > 0, with sufficiently large sample sizes N0(ν), the following inequality is satisfied with high probability by sub-Gaussian, regular polynomial-tail, and bounded random variables: 3 2 xx Σxx E 1 2 xx I , E 1 2 yy Σyy E 1 2 yy I , E 1 2 xx (Σxy Exy)E 1 2 yy ν. (9) We provide detailed bounds on N0(ν) for different distributions in Lemma 3. Roadmap for this section We proceed to analyze the sample complexities, eventually obtained in Theorem 9. We first analyze the concentration property of different classes in Lemma 3, and provide the number of samples needed to guarantee small perturbation between b T and T in Lemma 6, which by the Weyl s inequality (Horn and Johnson, 1986) provides the sample complexity for learning canonical correlations (regardless of the existence of a singular value gap for T). Then by the perturbation of singular vectors and after fixing the issue of normalization, we obtain guarantees for the alignment between the estimated and the optimal canonical directions. 3.1. Approximating the canonical correlation We first discuss the error of approximating ρ1 by bρ1 = σ1(b T). Observe that, although the empirical covariance matrices are unbiased estimates of their population counterparts, we do not have E[b T] = T due to the nonlinear operations (matrix multiplication, inverse, and square root) involved in computing T. Nonetheless, we can provide approximation 3. We refrain from specifying the failure probability as it only adds additional mild dependences to our results. Gao, Garber, Srebro, Wang, and Wang guarantee based on concentrations. We will separate the probabilistic property of data the concentration property in Proposition 2 from the deterministic error analysis, and we show below that it is satisfied by distributions considered here. Lemma 3 Let Assumption 1 hold for the random variables. Then the concentration property (9) is satisfied with high probability, if ν2 for the sub-Gaussian class, N0(ν) C d ν2(1+r 1) for the polynomial-tail class, N0(ν) C 1 ν2γ2 for the bounded class. Remark 4 When (x, y) have nonzero means, we use the unbiased estimate of covariance matrices Σxy = PN i=1(xi x)(yi y) N 1 , Σxx = PN i=1(xi x)(xi x) N 1 , and Σyy = PN i=1(yi y)(yi y) N 1 instead of those in (7), where x = 1 N PN i=1 xi and y = 1 N PN i=1 yi. We have similar concentration results, and all results in Sections 3 and 4 still apply. We will decompose the difference T b T and apply the above concentration results. In the decomposition, we need to bound terms of the form E 1 1 2xx I. Such bounds can be derived from our assumption on E 1 2 xx Σxx E 1 2 xx I using Lemma 5 below. This lemma is derived from the main result of Mathias (1997), with extra effort taken to understand the size of perturbation for which higher order error terms can be safely ignored. Lemma 5 (Perturbation of matrix square root) Let H Rd d be positive definite, with eigenvalues in the range [σmin, σmax] for some σmin > 0. Let Θ Rd d be Hermitian, satisfying H 1 2 = 1. Then for ζ 3 4σ 2 maxσ2 min, we have (H + ζ Θ) 1 2 H 1 where Cd = O(log d) is independent of ζ. Lemma 6 Assume that we draw N samples {(xi, yi)}N i=1 independently from the underlying joint distribution P(x, y) for computing the sample covariance matrices in (7), and P(x, y) satisfies Assumption 1 and the concentration property (9). Then for ν 1 4γ2, we have |bρ1 ρ1| T b T 4Cd ν where Cd is the same constant in Lemma 5. We note that the requirement of ν = O(γ2) is not too constraining, since the size of the perturbation ν is closely related to the statistical error, and we are mainly interested in the regime of the statistical error going to zero. It is then straightforward to combine Lemma 3 and Lemma 6 to obtain the following sample complexities for the three distribution classes. Stochastic Canonical Correlation Analysis Corollary 7 (Sample complexity for learning canonical correlation by ERM) Let ϵ (0, 1) and ϵ Cdγ2. Then for N N0 ϵ 4Cd N C d log2 d ϵ 2 for the sub-Gaussian class, N C d log2(1+r 1) d ϵ 2(1+r 1) for the polynomial-tail class, ϵ 2γ2 for the bounded class, we have with high probability that |bρ1 ρ1| ϵ . Remark 8 Due to better concentration properties, the sample complexity for the sub-Gaussian and regular polynomial-tail classes are independent of the condition number 1 γ of the autocovariances. Comparison to Arora et al. (2017) In a parallel work by Arora et al. (2017), the authors studied the top-k stochastic CCA for bounded inputs, and proposed stochastic approximation-type algorithms with O 1 ϵ 2γ2 sample-complexity upper bound for approximating the top canonical correlation. We note, however, their stochastic algorithms are derived from the convex relaxation of stochastic CCA, which lifts the original problem into the space of matrices in Rdx dy and requires a whitening operation (multiplying each fresh sample by Σ 1 2 xx or Σ 1 2 yy ) and a projection operation (onto the set of low 2-norm and nuclearnorm matrices) in each iteration, which are inefficient in high dimensions. Our work studies three different classes of input distributions in a uniform manner4, with the goal of matching the statistical limits for the Gaussian inputs (see Section 5.2). The algorithms we provide in the next sections require only elementary vector operations and thus more practical for high dimensional data. 3.2. Approximating the canonical directions We now discuss the error in learning (u , v ) by ERM, when T has a singular value gap > 0. Let the nonzero singular values of T be 1 ρ1 ρ2 ρr, where r = rank(T) min(dx, dy), and the corresponding (unit-length) singular vector pairs be (a1, b1), . . . , (ar, br). Define C = 0 T T 0 The eigenvalues of C are ρ1 ρr > 0 = = 0 > ρr ρ1, 4. If we only study the case of bounded inputs, we can bypass Lemma 5 and the log d dependence in our bound can be reduced. Gao, Garber, Srebro, Wang, and Wang with corresponding unit eigenvectors , . . . , 1 , . . . , 1 , . . . , 1 Thus, learning canonical directions (u , v ) reduces to learning the top eigenvector of C. We denote the empirical version of C by b C, and the singular vector pairs of b T by {(bai, bbi)}. Due to the block structure of C and b C, we have C b C = T b T . Let the ERM solution be (bu, bv) = Σ 1 2 xx ba1, Σ 1 , which satisfy Σ 1 2xxbu = Σ 1 2yybv = 1. We now state the sample complexity for learning the canonical directions by ERM. Theorem 9 Let ϵ (0, 1) and ϵ 16C2 dγ4 2 . Then for N N0 ϵ N C d log2 d ϵ 2 for the sub-Gaussian class, N C d log2(1+r 1) d ϵ(1+r 1) 2 for the regular polynomial-tail class, ϵ 2γ2 for the bounded class, we have with high probability that align ((bu, bv); (u , v )) 1 ϵ. Proof sketch The proof of Theorem 9 consists of two steps. We first bound the error between b C s top eigenvector 1 and C s top eigenvector 1 a standard result on perturbation of eigenvectors, namely the Davis-Kahan sin θ theo- rem (Davis and Kahan, 1970) which states sin2 θ C b C 2 2 where θ is the angle between top eigenvectors of C and b C. We then show that 1 is very close to the correctly normalized 1 , so the later still aligns well with the population solution. Comparison to prior analysis For the sub-Gaussian class, the tightest analysis of the sample complexity upper bound we are aware of was by Gao et al. (2017). However, their proof relies on the assumption that ρ2 = o(ρ1), i.e., they require that ρ2 ρ1. In contrast, we do not require this assumption, and our bound is sharp in terms of the gap = ρ1 ρ2. Up to the log2 d factor, our ERM sample complexity for the same loss matches the minimax lower bound d ϵ 2 given by Gao et al. (2017) (see also Section 5.2). 4. Stochastic optimization for ERM A disadvantage of the empirical risk minimization approach is that it can be time and memory consuming. To obtain the exact solution to (6), we need to explicitly form and Stochastic Canonical Correlation Analysis store the covariance matrices and to compute their singular value decompositions (SVDs); these steps have a time complexity of O(Nd2 + d3) and a memory complexity of O(d2). In this section, we study the stochastic optimization of the empirical objective, and show that the computational complexity is low: We just need to process a large enough dataset (with the same level of samples as ERM requires) nearly constant times in order to achieve small error with respect to the population objective. The basic algorithm we use here is the shift-and-invert meta-algorithm proposed by Wang et al. (2016). However, in this section we provide refined analysis of the algorithm s time complexity than that provided by Wang et al. (2016). We show that, using a better measure of progress and careful initializations for each least squares problem, the algorithm enjoys linear convergence (see Theorem 12), i.e., the time complexity for achieving η-suboptimalilty in the empirical objective depends on log 1 η, whereas the result of Wang et al. (2016) has a dependence of log2 1 We also note that the recent work of Allen-Zhu and Li (2016) and Allen-Zhu and Li (2017) have extended the ERM problem to extracting the top k 1 pairs of canonical directions, and applied the technique of peeling/deflation together with shift-and-invert. However, their convergence rate for the fist pair of canonical directions does not improve that of Wang et al. (2016).5 As mentioned above, our result strictly improves that of Wang et al. (2016), and in particular replaces the O( ) notation with the O( ) notation in total runtime, achieving true linear convergence. Roadmap for this section We first introduce the shift-and-invert power iterations and provide its iteration complexity, assuming that each matrix-vector multiplication or equivalently a convex least squares problem is solved to sufficient accuracy (Lemma 10). We then show each least squares can be warm-started using rescaled estimates from the previous iteration (Lemma 11). Finally, we plug in the time complexity of SVRG for each subproblem, and give runtime complexities for each distribution class which have different condition numbers (Corollary 13). The condition numbers depend on, among other things, the smallest eigenvalues of the covariance matrices, which are bounded away from zero as discussed below. Eigenvalues of empirical covariance According to the analysis of ERM from previous section, we have been working in the regime that the concentration parameter in (9) satisfies ν γ2 2. Thus in view of Assumption 1, we have with high probability that Σxx Exx = E 2 xx Σxx E 1 2 xx Σxx E 1 and similarly Σyy Eyy γ 2. According to Weyl s inequality, these inequalities make sure eigenvalues of Σxx and Σyy lie in [γ 2], and consequently the involved subproblems are strongly-convex and can be solved efficiently. 5. See the second and third last lines of Table 1, and the last paragraph of Section 1.2 in Allen-Zhu and Li (2017): Our running time matches that of [29] when k = 1 . Gao, Garber, Srebro, Wang, and Wang 4.1. Shift-and-invert power iterations Our algorithm runs the shift-and-invert power iterations on the following matrix c Mλ = λI b C 1 = " λI b T b T λI where λ > bρ1. It is straightforward to see that c Mλ is positive definite with eigenvalues 1 λ bρ1 1 λ bρr 1 λ + bρr 1 λ + bρ1 , and has the same set of eigenvectors as b C. Assume that there exists a singular value gap for b T (this can be guaranteed by drawing sufficiently many samples so that the singular values of b T are within a fraction of the gap of T), denoted as b = bρ1 bρ2. The key observation is that, as opposed to running power iterations on b C (which is essentially done by Ge et al. 2016), c Mλ has a large eigenvalue gap when λ = bρ1 + c(bρ1 bρ2) with c = O(1), and thus power iterations on c Mλ converge more quickly. In particular, we assume for now the availability of an estimated eigenvalue λ such that λ bρ1 [l b , ub ] where 0 < l < u < 1; locating such a λ is discussed later in Remark 14. Define b Aλ := λΣxx Σxy Σ xy λΣyy , b B := Σxx 0 0 Σyy and we have c Mλ = b B 1 2 b A 1 λ b B 1 2 . And by the relationship b Aλ = b B 1 2 c M 1 λ b B 1 2 , eigenvalues of b Aλ are bounded: σmax b Aλ σmax c M 1 λ σmax b B (λ + bρ1)(1 + γ σmin b Aλ σmin c M 1 λ σmin b B (λ bρ1)γ/2. It is convenient to study the convergence in the concatenated variables , rt := b B 1 2 wt = 1 Define the following quantities using the ERM solution , br := b B 1 2 bw = 1 which satisfy bw b Bbw = 1 and br br = 1 respectively. Stochastic Canonical Correlation Analysis 4.2. Convergence of inexact shift-and-invert Our algorithm iteratively applies the approximate matrix-vector multiplications: for t = 0, 1, . . . rt+1 c Mλrt, wt+1 b A 1 λ b Bwt. (12) This equivalence allows us to directly work with (ut, vt) and avoids computing Σ 1 2yy explicitly. Note that we do not perform normalizations of the form wt wt/ b B 1 2 wt at each iteration as done by Wang et al. (2016) (Phase-I of their SI meta-algorithm); the length of each iterate is irrelevant for the purpose of optimizing the alignment between vectors and we could always perform the normalization in the end to satisfy the length constants. Exact power iterations is known to converge linearly when there exist an eigenvalue gap (Golub and van Loan, 1996). The matrix-vector multiplication b A 1 λ b Bwt is equivalent to solving the least squares problem min w ft+1(w) := 1 2w b Aλw w b Bwt (13) whose unique solution is w t+1 = b A 1 λ b Bwt with the optimal objective f t+1 = 1 2w t b B b A 1 λ b Bwt. Of course, solving the problem exactly is costly and we will apply stochastic gradient methods to it. We will show that, when the least squares problems are solved accurately enough, the iterates are of the same quality as those of the exact solutions and enjoys linear convergence. We begin by introducing the measure of progress for the iterates. Denote the eigenvalues of c Mλ by β1 β2 βd, with corresponding eigenvectors p1, . . . , pd forming an orthonormal basis of Rd. Recall that p1 = br, p i c Mλpi = βi for i = 1, . . . , d, and p i c Mλpj = 0 for i = j. We therefore can write each iterate as a linear combination of the eigenvectors: rt rt = Pd i=1 ξtipi, where ξti = r t pi rt for i = 1, . . . , d, and Pd i=1 ξ2 ti = 1. The potential function we use to evaluate the progress of each iteration is c M 1 λ P rt rt q Pd i=2 ξ2 ti/βi p where P and P denote projections onto the subspaces perpendicular and parallel to br respectively. The same potential function was used by Garber et al. (2016) for analyzing the convergence of shift-and-invert for PCA. The potential function is invariant to the length of rt, and is equivalent to the criterion |tan θt| := q Pd i=2 ξ2 ti ξ2 t1 where θt is the angle between rt and br: in the following sense: β1 β2 |tan θt| G(rt) β1 βd |tan θt| . Gao, Garber, Srebro, Wang, and Wang The lemma below shows that under the iterative scheme (12), {G(rt)}t=1,... converges linearly to 0. Lemma 10 Let η (0, 1). Assume that for each approximate matrix-vector multiplication, we solve the least squares problem so accurately that the approximate solution wt+1 satisfies ϵt := ft+1(wt+1) f t+1 w t b Bwt min i=2 ξ2 ti/βi, ξ2 t1/β1 Let T = log 7 η . Then we have |sin θt| G(rt) η for all t T. 4.3. Bounding initial error for least squares It is natural to use an initialization of the form αwt for minimizing ft+1(w). The following lemma provides the optimal α and the resulting initial suboptimality, see detailed analysis in Appendix D.2. Lemma 11 (Warm start for least squares) Initializing minw ft+1(w) with α t wt where α t = w t b Bwt w t b Aλwt , it suffices to set the ratio between the initial and the final error to be 64 max (1, G(rt)) so that (14) is satisfied. This result indicates that in the converging stage (G(rt) 1), we just need to set the ratio between the initial and the final error to the constant 64 (and set it to be the constant 64G(r0) before that). This will ensure that the time complexity of least squares has no dependence on the final error ϵ. 4.4. Solving the least squares by SGD The least squares objective (13) is the sum of N functions: ft+1(w) = 1 N PN i=1 fi t+1(w) where fi t+1(w) = 1 2w λxix i xiy i yix i λyiy i w w Σxx 0 0 Σyy There has been much recent progress on developping linearly convergent stochastic algorithms for solving finite-sum problems. We use SVRG (Johnson and Zhang, 2013) here due to its algorithmic simplicity and memory efficiency; in the next section, we will be using the online version of SVRG for stochastic CCA in the streaming setting. Note that although ft+1(w) is convex, each component fi t+1 may not be convex. We provide the time complexity of SVRG for this case (based on Garber and Hazan, 2015, Appendix B), as well as the condition number for the three classes of distributions in Appendix D.3 and D.4 respectively. 4.5. Total time complexity We first provide the runtime for solving the empirical objective using the (offline) shift-andinvert CCA algorithm. Stochastic Canonical Correlation Analysis Theorem 12 Let η (0, 1). Draw N samples for ERM such that σmin(Σxx) γ 2 and σmin(Σyy)) γ 2. Initialize w0 = w0 q w 0 b B w0 where entries of w0 Rd are randomly sampled from the standard Gaussian distribution. Then with high probability, offline shift-and-invert outputs an (u T , v T ) satisfying min Σ 1 2 xxu T , v T Σyybv Σ 1 2 yyv T 1 η in total time b γ log d b γη for sub-Gaussian/polynomial-tail, O d N + 1 b 2γ2 b γ log d b γη for the bounded class. We have already shown in Theorem 9 that the ERM solution aligns well with the population solution. By drawing slighly more samples and requiring our algorithm to find an approximate solution that aligns well with the ERM solution, we can guarantee high alignment for the approximate solution as shown in the following corollary. Corollary 13 Let ϵ (0, 1) and ϵ 64C2 dγ4 2 . Draw N = N0 ϵ samples for the ERM objective, and use the initialization strategy in Theorem 12. Then with high probability, the total time for offline shift-and-invert to output (u T , v T ) with align ((u T , v T ); (u , v )) 1 ϵ is O d d log2 d for sub-Gaussian, d log2(1+r 1) d ϵ(1+r 1) 2 + d2 for polynomial-tail, ϵ 2γ2 + 1 2γ2 for the bounded class. The ϵ-dependent term is near-linear in the ERM sample complexity N(ϵ, , γ) and is also the dominant term in the total runtime (when ϵ = o(γ2) for the first two classes). For sub-Gaussian/regular polynomial-tail classes, we incur an undesirable d2 dependence for the least squares problem s condition number (see more details in Appendix D.3), mainly due to weak concentration regarding the data norm (we have stronger concentration for the streaming setting discussed next). One can alleviate the issue of large condition number using accelerated SVRG (Lin et al., 2015). Remark 14 We have assumed so far the availability of λ = bρ1+c(bρ1 bρ2) with c = O(1) for shift-and-invert to work. There exists an efficient algorithm for locating such an λ, see the repeat-until loop of Algorithm 3 in Wang et al. (2016). This procedure computes O log 1 approximate matrix-vector multiplications, and its time complexity does not depend on ϵ as we only want to achieve good estimate of the top eigenvalue (and not the top eigenvector). So the cost of locating λ is not dominant in the total runtime. Gao, Garber, Srebro, Wang, and Wang 5. Streaming shift-and-invert CCA A disadvantage of the ERM approach is that we need to store all the samples in order to go through the dataset multiple times. We now study the shift-and-invert algorithms in the streaming setting in which we draw samples from the underlying distribution P(x, y) and process them once. Clearly, the streaming approach requires only O(d) memory. In this section, we assume the availability of a λ = ρ1 + c , where 0 < c < 1.6 Our algorithm is the same as in the ERM case, except that we now directly work with the population covariances through fresh samples instead of their empirical estimates. With slight abuse of notation, we use (Aλ, B, Mλ) to denote the population version of ( b Aλ, b B, c Mλ): Aλ := λExx Exy E xy λEyy , B := Exx 0 0 Eyy , Mλ = B 1 2 A 1 λ B 1 2 , use {(βi, pi)}d i=1 to denote the eigensystem of Mλ, and use (ut, vt) as well as , rt = B 1 2 wt = 1 t = 0, . . . to denote the iterates of our algorithm. Also, define ξti, θt and G(rt) similarly as in Section 4. Handling normalizations It is sufficient to achieve high alignment between u T Exxu T + v T Eyyv T where (u, v) are normalized jointly, and r = 1 where (u, v) are normal- ized separately. According to the lemma below, this would imply high alignment between 1 2xxu T / E 1 2yyv T / E and r which is our final goal. Lemma 15 (Conversion from joint alignment to separate alignment) Let η (0, 1). If the output (u T , v T ) of our online shift-and-invert algorithm satisfy that 2 u Exxu T + v Eyyv T q u T Exxu T + v T Eyyv T 1 η we also have align ((u T , v T ); (u , v )) = 1 u T Exxu T + v Eyyv T q 6. Based on the same intuition given in Remark 14, we believe that a procedure similar to that of Wang et al. (2016) also works in the streaming setting and the cost in locating λ is not dominant, although we do not have a formal analysis. Stochastic Canonical Correlation Analysis Algorithm 1 Streaming SVRG for minw f(w). Input: Initialization w0 = 0, stepsize scaling factor s = 1 352, (µ, S, σ2) are respectively the strong convexity, streaming smoothness, and streaming variance given in Lemma 16. for τ = 1, . . . , Γ do µ , kτ max 44S µ , 20σ2 2τ 1 Draw kτ samples (x1, y1), . . . , (xkτ , ykτ ) and estimate the batch gradient i=1 φ( z; xi, yi) Sample emτ uniformly at random from {1, . . . , mτ} z z for i = 1, . . . , emτ do Draw sample (xi, yi) z z s S ( φ(z; xi, yi) φ( z; xi, yi) + g) end for wτ z end for Output: Return wΓ as the approximate solution. Note that Lemma 15 improves over a similar result by Wang et al. (2016, Theorem 5), which requires the joint alignment to be O(η2)-suboptimal for the separate alignment to be O(η)-suboptimal. 5.1. Solving least squares by streaming SVRG Turning to the streaming algorithm, the least squares problem at iteration t + 1, is now a stochastic program: min w ft+1(w) = 1 2w Aλw w Bwt = E [φt+1(w; x, y)] where φt+1(w; x, y) := 1 2w λxx xy w w xx 0 0 yy wt, and the ex- pectation is computed over P(x, y). The optimal solution to this stochastic program is w t+1 = A 1 λ Bwt. Due to the high sample complexity of accurately estimating α t = w t Bwt w t Aλwt in the streaming setting, we instead initialize each linear systems with the zero vector. With this initialization, we have ft+1(0) f t+1 = 0 1 2w t BA 1 λ Bwt We then solve the linear system with the streaming SVRG algorithm proposed by Frostig et al. (2015), as detailed in Algorithm 1. This is the same approach taken by Garber et al. Gao, Garber, Srebro, Wang, and Wang (2016) for streaming PCA, and our analysis follows the same structure. Streaming SVRG is a natural choice here since it is the online version of the SVRG algorithm for optimizing empirical objectives and enjoys the same algorithmic simplicity and low computational complexity. Moreover, for stochastic least squares problems, streaming SVRG is shown to have the same sample complexity as solving the ERM problem (Frostig et al., 2015), which aligns well with our goal of an overall sample efficient algorithm. With this choice, the final algorithm is very similar to the stochastic optimization algorithm in Section 4, except that fresh samples are used for each update. To analyze the sample complexity of streaming SVRG, we first calculate the streaming smoothness and streaming variance parameters for the three classes of distributions. Lemma 16 (Parameters of streaming SVRG) For any w, w Rd, we have Strong convexity: ft+1(w) ft+1(w ) + ft+1(w ), w w + µ Streaming smoothness: E h φt+1(w) φt+1(w t+1) 2i 2S ft+1(w) f t+1 , Streaming variance: φ(w t+1) 2 ( 2f(w t+1)) 1 σ2. Here µ := γ β1 C γ for some C > 0, and , σ2 = O dβ3 1 rt 2 for sub-Gaussian/regular polynomial-tail classes, and for the bounded class. The proof of Lemma 16 is somewhat technical for the sub-Gaussian/regular polynomialtail classes, which repeatedly applies the concentration properties of these two classes. But this lemma is the key for the sample complexity of our streaming algorithm to match the lower bound in the case of sub-Gaussian inputs: since we always draw fresh samples in the streaming setting, the condition number S/µ for these two classes depend on d only linearly (as opposed to quadratically in approximate ERM). These quantities determine the number of samples to be used in each round τ of Algorithm 1: mτ is on the order of the condition number, and with mτ stochastic updates, one can reduce the suboptimality by a constant factor in each round; κτ has to eventually increase geometrically to make sure the variance is reduced at the same pace. Based on these quantities, we can apply the structural result of Frostig et al. (2015) and give the sampling complexity for driving the final suboptimality to ηt times the initial suboptimality in (16). Stochastic Canonical Correlation Analysis Lemma 17 (Sample complexity of streaming SVRG for least squares) Let ηt (0, 1). Applying streaming-SVRG in Algorithm 1 to minw ft+1(w) with initialization 0, we have E ft+1(wτ) f t+1 ηt for τ Γ = O log 1 . The sample complexity of the first Γ iterations is O d 2ηt + d 2γ2 log 1 for the sub-Gaussian/regular polynomial-tail classes, and O 1 2γ2ηt for the bounded class. Based on the linear convergence of shift-and-invert, we need only solve O log 1 ϵ linear systems, and we can bound 1 η by a geometrically increasing series where the last term is O 1 ϵ (so the sum of this truncated series is still O 1 ϵ ). This results in the following total sample complexity. Theorem 18 (Total sample complexity of streaming shift-and-invert CCA) Let ϵ (0, 1). After solving T = O log 1 ϵ linear systems to sufficient accuracy, streaming shiftand-invert CCA algorithm outputs (u T , v T ) with align ((u T , v T ); (u , v )) 1 ϵ. Our algorithm processes each sample in O(d) time, and has a total sample complexity of ϵ 2 + d 2γ2 log2 1 for the sub-Gaussian/regular polynomial-tail classes, for the bounded class. Interestingly, the sample complexity of our streaming CCA algorithm (assuming the parameter λ) improves over that of ERM we showed in Theorem 9: it removes small log d factors for all classes, and most remarkably achieves polynomial improvement in ϵ for the regular polynomial-tail class. This is due to the fact that the sample complexity of streaming SVRG basically only uses the moments, and does not require concentration of the whole covariance in Lemma 3. As a result, it is not clear if our analysis of ERM is the tightest possible. 5.2. Lower bound for Gaussian inputs Consider the following Gaussian distribution named single canonical pair model (Chen et al., 2013): x y where φ = ψ = 1. It is straightforward to check that T = Exy = φψ for such a distribution. Observe that T is of rank one and has a singular value gap , and the single pair of canonical directions are (u , v ) = (φ, ψ). Denote this class of model by F(dx, dy, ). We have the following minimax lower bound for CCA under this model, which is an application of the result of Gao et al. (2017) for sparse CCA (by using rank r = 1 and hard sparsity, i.e., q = 0 and sparsity level d in their Theorem 3.2). Gao, Garber, Srebro, Wang, and Wang Table 1: Summary of sample, time (measured in floating point operations), and memory complexities of different approaches, in terms of (d, , ϵ), for stochastic CCA with Gaussian inputs. We give the dominant term in complexities as ϵ 0. Note that the time complexity of exact ERM is dominated by forming the eigen-system, while the memory complexity of ERM is dominated by saving the dataset. Method Sample Time Memory Exact ERM O d ϵ 2 O d3 ϵ 2 Approximate ERM by shift-and-invert ϵ 2 O d2 ϵ 2 Streaming shift-and-invert (assuming close init. for λ) O d Lemma 19 (Lower bound for single canonical pair model) Suppose the data is generated by the single canonical pair model. Let (u, v) be some estimate of the canonical directions (u , v ) based on N samples. Then, there is a universal constant C, so that for N sufficiently large, we have: inf u,v sup u ,v F(dx,dy, ) E [1 align ((u T , v T ); (u , v ))] C d 2N . This lemma implies that, to estimate the canonical directions up to ϵ-suboptimality in our measure of alignment, we expect to use at least O d ϵ 2 samples. We therefore observe that, for Gaussian inputs, the sample complexity of the our streaming algorithm matches that of the minimax rate of CCA, up to small factors. In Table 1, we collect the complexities of different approaches, namely exact optimization of ERM (Section 3), stochastic optimization of ERM with shift-and-invert (Section 4), and streaming shift-and-invert (Section 5). We observe that while all three approaches are sample efficient (up to small factors), stochastic and streaming algorithms are more efficient in time and memory. 6. Conclusion In this paper, we have studied the sample complexity of population CCA for several classes of input distributions, and proposed sample-efficient algorithms for learning the first pair of canonical directions. While the original problem is nonconvex, we exploit its structure as an eigenvalue problem, and analyze the statistical performance of the shift-and-invert power iterations. Based on the deflation/peeling scheme (Allen-Zhu and Li, 2016, 2017) for eigenvalue problems, our results shall be extended to extracting the top-k canonical direction pairs. Our algorithms also apply to related eigenvalue problems in machine learning, such as partial least squares (Chen et al., 2017) and linear discriminant analysis (Bach and Jordan, 2005), which are special versions of CCA with the population covariances being identity Stochastic Canonical Correlation Analysis (i.e., Exx = Eyy = I) and y being one-hot representations for class labels respectively. It is an interesting question if our general approach can be adapted to study the statistical performance of the kernel extension of CCA (Fukumizu et al., 2007). Acknowledgement Research partially supported by NSF BIGDATA award 1546462. Zeyuan Allen-Zhu and Yuanzhi Li. Lazy SVD: Even faster SVD decomposition yet without agonizing pain. In Advances in Neural Information Processing Systems, 2016. Zeyuan Allen-Zhu and Yuanzhi Li. Doubly accelerated methods for faster CCA and generalized eigendecomposition. In Proc. of the International Conference on Machine Learning, 2017. Raman Arora, Andy Cotter, Karen Livescu, and Nati Srebro. Stochastic optimization for PCA and PLS. In 50th Annual Allerton Conference on Communication, Control, and Computing, 2012. Raman Arora, Teodor V. Marinov, Poorya Mianjy, and Nathan Srebro. Stochastic approximation for canonical correlation analysis. In Advances in Neural Information Processing Systems, 2017. Sanjeev Arora, Satish Rao, and Umesh Vazirani. Expander flows, geometric embeddings and graph partitioning. Journal of the ACM, (2), 2009. Francis R. Bach and Michael I. Jordan. A probabilistic interpretation of canonical correlation analysis. Technical Report 688, Department of Statistics, University of California, Berkeley, April 21 2005. Marcus Carlsson. Perturbation theory for the matrix square root and matrix modulus. ar Xiv:1810.01464 [math.FA], October 2 2018. Mengjie Chen, Chao Gao, Zhao Ren, and Harrison H. Zhou. Sparse CCA via precision adjusted iterative thresholding. ar Xiv:1311.6186 [math.ST], November 24 2013. Zhehui Chen, Lin F. Yang, Chris J. Li, and Tuo Zhao. Dropping convexity for more efficient and scalable online multiview learning. In Proc. of the International Conference on Machine Learning, 2017. Chandler Davis and W. M. Kahan. The rotation of eigenvectors by a perturbation III. SIAM Journal of Numerical Analysis, 7(1):1 46, 1970. Roy Frostig, Rong Ge, Sham M. Kakade, and Aaron Sidford. Competing with the empirical risk minimizer in a single pass. In Conference on Learning Theory (COLT), pages 728 763, 2015. Gao, Garber, Srebro, Wang, and Wang Kenji Fukumizu, Francis R. Bach, and Arthur Gretton. Statistical consistency of kernel canonical correlation analysis. Journal of Machine Learning Research, 8:361 383, February 2007. Chao Gao, Zongming Ma, and Harrison H. Zhou. Sparse CCA: Adaptive estimation and computational barriers. Annals of Statistics, 45(5):2074 2101, 2017. Dan Garber and Elad Hazan. Fast and simple PCA via convex optimization. ar Xiv:1509.05647 [math.OC], November25 2015. Dan Garber, Elad Hazan, Chi Jin, Sham M. Kakade, Cameron Musco, Praneeth Netrapalli, and Aaron Sidford. Faster eigenvector computation via shift-and-invert preconditioning. In Proc. of the International Conference on Machine Learning, 2016. Rong Ge, Chi Jin, Sham M. Kakade, Praneeth Netrapalli, and Aaron Sidford. Efficient algorithms for large-scale generalized eigenvector computation and canonical correlation analysis. In Proc. of the International Conference on Machine Learning, 2016. Eckart Gekeler. On the pointwise matrix product and the mean value theorem. Linear Algebra and its Applications, 35:183 191, 1981. Gene H. Golub and Charles F. van Loan. Matrix Computations. Johns Hopkins University Press, 1996. Roger A. Horn and Charles R. Johnson. Matrix Analysis. Cambridge University Press, 1986. Roger A. Horn and Charles R. Johnson. Topics in Matrix Analysis. Cambridge University Press, 1991. Harold Hotelling. Relations between two sets of variates. Biometrika, 28(3/4):321 377, 1936. Daniel Hsu, Sham Kakade, and Tong Zhang. A tail inequality for quadratic forms of subgaussian random vectors. Electron. Commun. Probab., 17(52):1 6, 2012. Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In Advances in Neural Information Processing Systems, 2013. Hongzhou Lin, Julien Mairal, and Zaid Harchaoui. A universal catalyst for first-order optimization. In Advances in Neural Information Processing Systems, 2015. Yichao Lu and Dean P. Foster. Large scale canonical correlation analysis with iterative least squares. In Advances in Neural Information Processing Systems, 2014. Zhuang Ma, Yichao Lu, and Dean Foster. Finding linear structure in large datasets with scalable canonical correlation analysis. In Proc. of the International Conference on Machine Learning, 2015. Roy Mathias. A bound for the matrix square root with application to eigenvector perturbation. SIAM J. Matrix Anal. and Apps., 18(4):861 867, 1997. Stochastic Canonical Correlation Analysis Yousef Saad. Numerical Methods for Large Eigenvalue Problems. Manchester University Press, 1992. Nikhil Srivastava and Roman Vershynin. Covariance estimation for distributions with 2+εmoments. Annals of Probability, 41(5):3081 3111, 2013. Roman Vershynin. Compressed Sensing: Theory and Applications, chapter Introduction to the Non-asymptotic Analysis of Random Matrices. Cambridge University Press, 2012. Weiran Wang and Karen Livescu. Large-scale approximate kernel canonical correlation analysis. In Proc. of the International Conference on Learning Representations, 2016. Weiran Wang, Jialei Wang, Dan Garber, and Nathan Srebro. Globally convergent stochastic optimization for canonical correlation analysis. In Advances in Neural Information Processing Systems, 2016. Lin Xiao and Tong Zhang. A proximal stochastic gradient method with progressive variance reduction. SIAM Journal on Optimization, 24(4):2057 2075, 2014. Appendix A. Auxiliary Lemmas Lemma 20 The population canonical correlation is bounded by 1, i.e., ρ1 = σ1 (T) 1. Proof By the Cauchy-Schwarz inequality of random variables, we have ρ1 = E[(u x)(v y)] q E[(u x)2] q E[(v y)2] = p v Eyyv = 1. Lemma 21 (Distance between normalized vectors) For two nonzero vectors a, b Rd, we have a a b b Proof By direct calculation, we have a a b b a + b | a b | Gao, Garber, Srebro, Wang, and Wang where we have used the triangle inequality in the two inequalities. Lemma 22 (Conversion from joint alignment to separate alignment) Let η 0, 1 4 . Consider the four nonzero vectors a, x Rdx and b, y Rdy such that a = b = 1. If 2 a x + b y q x 2 + y 2 1 η, (18) we also have Proof By the Cauchy-Schwarz inequality, we have a x + b y q x 2 + y 2 = a x x 2 + y 2 + b y Thus according to (18), we obtain 2 2(1 η)2 2 4η. y 2 1, this implies where the last step is due to the fact that x x for x (0, 1). Similarly we have b y 1 4η. Then the theorem follows. Lemma 23 (Moment inequalities of sub-Gaussian and regular polynomial-tail random vectors) Let z Rd be isotropic and sub-Gaussian or regular polynomial-tail (see their definitions in Lemma 3). Then for some constant C > 0, we have E z 2 d, E z 4 C d2, E q z 4 C where q is any unit vector. Stochastic Canonical Correlation Analysis Proof Sub-Gaussian case The first bound is by E z 2 = E tr zz = tr (I) = d. To prove the second one, note that according to Theorem 2.1 in Hsu et al. (2012), we have P z 2 > C1(d + t) < e t for all t > 0. Therefore 0 P z 4 > s ds 0 P z 4 > s ds + Z C2 1d2 P z 4 > s ds C2 1d2 exp s E q z 4 = Z 0 P q z 4 > s ds Regular polynomial-tail case The first bound is still by E z 2 = E tr zz = tr (I) = d. When r > 1, we have 0 P z 4 > s ds 0 P z 4 > s ds + Z C2d2 P z 4 > s ds C2d2 Cs 1+r To prove the last bound, take V = qq in the definition of regular polynomial-tail random vectors, and then P q z 2 > t Ct 1 r, for any t > C. We have E q z 4 = Z 0 P q z 4 > s ds 0 P q z 4 > s ds + Z C2 P q z 4 > s ds Gao, Garber, Srebro, Wang, and Wang Appendix B. Proofs for Section 1 B.1. Proof of Lemma 1 Proof Using the fact that u Exxu E 1 2 xxu and v Eyyv E 1 2 yyv are at most 1, the condition on alignment Since {ai}r i=1 and {bi}r i=1 are orthonormal, we have Observe that v Eyyv = (E 1 2xxu) T(E v u u u u u t 2v u u u u u t where we have used the Cauchy-Schwarz inequality in the first inequality. Appendix C. Proofs for Section 3 C.1. Proof of Lemma 3 Proof Sub-Gaussian/regular polynomial-tail cases Consider the random variable z defined in (5), and draw i.i.d. samples z1, . . . , zn of z. It is known that when the sample Stochastic Canonical Correlation Analysis size n is large enough (as specified in the lemma), we have i=1 ziz i I with high probability for the sub-Gaussian class (Vershynin, 2012) and for the regular polynomial-tail class (Srivastava and Vershynin, 2013), given N > C d ν2 and N C d ν2(1+r 1) respectively. We then turn to bounding the error in each covariance matrix. We note that the covariance of f := is Ξ = I T T I with Ξ = 1 + ρ1 2 (since the eigenvalues of Ξ are of the form 1 σi(T)). On the other hand, define Φ := Exx Eyy 2 Exx Exy E xy Eyy 2 satisfying ΦΦ = Ξ and we have f = Φz by the definition of z. Furthermore, fi = Φzi, i = 1, . . . , N are i.i.d. samples of f. Therefore, it holds that " 1 N PN i=1 E 1 2 xx xix i E 1 2 xx I 1 N PN i=1 E 1 2 xx xiy i E 1 1 N PN i=1 E 1 2 yy yix i E 1 2 xx T 1 N PN i=1 E 1 2 yy yiy i E 1 i=1 fif i Ξ i=1 ziz i I i=1 ziz i I i=1 ziz i I Since the norm of each block is bounded by the norm of the entire matrix, we conclude that the error in estimating each covariance matrix is bounded by ν, as required by Proposition 2. Remark 24 In view of Lemma 23 and the proof technique here, for the sub-Gaussian/regular polynomial-tail cases, the bound of z 2 leads to a bound for x 2 and y 2: E( x 2 + y 2) Exx E E 1 2 + Eyy E E 1 2 E f 2 2E z 2 Cd for some constant C > 0, where we have used Assumption 1 in the second inequality. And similarly, we have E( x 2 + y 2)2 E f 4 4E z 4 C d2 for some constant C > 0. Bounded case Consider the joint covariance matrix Exx Exy E xy Eyy Gao, Garber, Srebro, Wang, and Wang which has eigenvalue bounded by 2 due to the assumption that x 2 + y 2 2. Applying Vershynin (2012, Corollary 5.52), we obtain that Σxx Σxy Σ xy Σyy Exx Exy E xy Eyy with probability at least 1 d t2 when N C(t/ν )2 log d for some constant C > 0. Setting the failure probability δ = d t2 gives t2 = log 1 δ log d , and thus we require N C 1 δ for 1 δ success probability. Due to the block structure of the joint covariance matrix, (19) implies Σxy Exy ν , Σxx Exx ν , Σyy Eyy ν hold simultaneously. Now, to satisfy the first inequality of (9), observe that 2 xx Σxx E 1 2 xx I = E 1 2 xx (Σxx Exx)E 1 Σxx Exx E 1 where we have used the assumption that σmin(Exx) γ in the last inequality. Therefore, we obtain E 1 2 xx Σxx E 1 2 xx I ν by setting ν = γν in (19), and this yields the N0(ν) chosen in the lemma. The other two inequalities of (9) can be obtained analogously. C.2. Proof of Lemma 5 Proof This result can be derived from the main result of Mathias (1997) with a bit of detective work, which is needed to understand the higher order error term. As in Mathias (1997), assume without loss of generality that H = diag (λ1, . . . , λd) is diagonal. In the proof of Theorem 2, Mathias (1997) applied the Daleckii-Krein formula in his equation (4) with the O-notation, which can be rephrased as (see also Carlsson (2018)[Theorem 2.1]): for ζ in a neighborhood of 0, it holds (H + ζΘ) 1 2 H 1 2 ζ where denotes elementwise (Hadamard) multiplication. To locate the neighborhood of ζ for which the above is true, we apply the matrix mean value theorem (Gekeler, 1981) with the second order derivative of matrix square root (Horn Stochastic Canonical Correlation Analysis and Johnson, 1991, Theorem 6.6.30 and page 549) to obtain: 2 max τ [0,ζ] h ck(τ)c k (τ) i U(τ) (20) where H + τΘ = U(τ) diag (λ1(τ), . . . , λd(τ)) U(τ) is the eigenvalue decomposition of the perturbation of H, and ck(τ) is the k-th column of X(τ) = U(τ) ΘU(τ). Define Z(τ) = 1 i,j=1 . The summation enclosed in () of the right hand side of (20) can be written as Z(τ) (Z(τ) X(τ))2. Thus continuing from (20) yields 2 max τ [0,ζ] Z(τ) (Z(τ) X(τ))2 . On the one hand, by the assumption that H σmax, we have X(τ) = Θ = H 1 2 (H 1 2 )H 1 2 H H 1 2 σmax. (21) On the other hand, the matrix Z(τ) is positive semidefinite (see Horn and Johnson, 1991, Problem 9, page 348). Using the fact that A B (maxi Aii) B for positive semidefinite A and Hermitian B (Horn and Johnson, 1991, Theorem 5.5.18), we conclude ε ζ2σ2 max 2 max τ [0,ζ] max i 1 4σ 1 maxσmin. Then in view of (21) and the Weyl s inequality, λi(τ) σmin τ [0, ζ], and we have ε 1 2ζ2σ2 maxσ 3 2 min. To sum up, we have shown so far the following first order approximation: for certain error matrix E Rd d, it holds (H + ζΘ) 1 2 = H 1 2 + ζ 2 ) + E where E 1 2ζ2σ2 maxσ 3 Consequently, we have (H + ζΘ) 1 2 H 1 Mathias (1997) showed that the norm of the first term on the right hand size is of the order O(log d ζ). Combining this with the fact that EH 1 2ζ2σ2 maxσ 2 min, we conclude that EH 1 2 = O(ζ) for ζ = O(σ 2 maxσ2 min) and the lemma follows. Gao, Garber, Srebro, Wang, and Wang C.3. Proof of Lemma 6 Proof In view of the Weyl s inequality, we have |bρ1 ρ1| b T T = Σ 1 2 xx ΣxyΣ 1 2 xx Exy E 1 For the right hand side of (22), we have the following decomposition 2 xx ΣxyΣ 1 2 xx Exy E 1 2 xx (Σxy Exy) Σ 1 By the equality 2 B 1 2 A 1 2 A 1 the first term of the RHS of (23) becomes Σ 1 2 xx ΣxyΣ 1 2 xx Σxx E 1 2 xx I ν, according to Lemma 5, we have (by making the identification that H = Exx, ζ = E 1 2 xx (Σxx Exx)E 1 , and Θ = (Σxx Exx)/ E 1 2 xx (Σxx Exx)E 1 4γ2. Combining with the fact that Σ 1 2 xx ΣxyΣ 1 A similar bound can be obtained for the third term of (23). Observe that when E 1 2 yy Σyy E 1 ν < 1, we have Σyy Eyy ν and all eigenvalues of Σyy lie in [γ ν, 1 + ν]. Addi- tionally, E 1 2 yy Σyy E 1 2 yy is invertible, and all eigenvalues of Σ 1 2 yy EyyΣ 1 2 yy lie in [ 1 1+ν , 1 1 ν ], impling that Σ 1 2 yy EyyΣ 1 2 yy I ν 1 ν . According to Lemma 5, we have (by mak- ing the identification that H = Σyy, ζ = Σ 1 2 yy (Σyy Eyy)Σ 1 , and Θ = (Σyy 2 yy (Σyy Eyy)Σ 1 Stochastic Canonical Correlation Analysis 4 γ ν 1+ν 2 , which is satisfied for ν 1 4γ2. Therefore, we can bound the third term of (23) as E 1 2 xx Exy E 1 2 xx Exy E 1 where we have used the fact that E 1 2 xx Exy E 1 1 by Lemma 20. For the second term of (23), we have by assumption that E 1 2 xx (Σxy Exy) Σ 1 Applying the triangle inequality, we obtain from (23) that Σ 1 2 xx ΣxyΣ 1 2 xx Exy E 1 4Cd ν. (25) To sum up, it suffices to set ν = ϵ 4Cd to ensure b T T ϵ , and this yields the desired sample complexity. C.4. Proof of Theorem 9 Proof Apply Lemma 6 with ϵ = ϵ 4 where ϵ (0, 1) is the desired accuracy in Theorem 9. Since T ˆT ϵ < 4 , the eigenvalues of ˆT are within 4 of those of T due to Weyl s inequality, so there exists a positive singular gap of 2 for the empirical estimate ˆT, whose first pair of singular vectors is unique. In view of the off-diagonal structure of b C, we observe that C ˆC = T ˆT ϵ and that the top eigenvector of ˆC is unique. Then with the number of samples given in the theorem ensuring the ϵ perturbation, according to the Davis-Kahan sin θ theorem (Davis and Kahan, 1970), the top eigenvectors of C and b C are well aligned: where θ is the angle between the top eigenvector of C and that of b C. This is equivalent to Gao, Garber, Srebro, Wang, and Wang In the rest of the proof, we fix the issue of incorrect normalization of (bu, bv). Recall we have shown in the proof of Lemma 6 that (see e.g., (24)) I E Consequently, we have E where we have used the facts that (x + y)2 2x2 + 2y2 and Σ 1 2xxbu = 1 in the second inequality. According to Lemma 21, we then have and thus the alignment between these two vectors is A similar bound is obtained for bv: Averaging the above two inequalities yields the desired result. Requiring that ϵ = ϵ Cdγ2 as in Corollary 7 leads to the extra condition that ϵ 16C2 dγ4 Stochastic Canonical Correlation Analysis Appendix D. Proofs for Section 4 D.1. Proof of Lemma 10 Proof If we obtain an approximate solution wt+1 to (13), such that ft+1(wt+1) ft+1(w t+1) = ϵt(w t b Bwt), it holds that ϵt b B 1 2 wt 2 = 1 2 wt+1 w t+1 b Aλ wt+1 w t+1 b B 1 2 wt+1 b B 1 2 w t+1 b B 1 2 b Aλ b B 1 2 b B 1 2 wt+1 b B 1 2 w t+1 2 rt+1 r t+1 c M 1 λ rt+1 r t+1 = 1 rt+1 r t+1 2 c M 1 λ , or equivalently rt+1 r t+1 c M 1 λ = Note that our choice of ϵt is also invariant to the length of rt (or whether normalization is performed). For the exact solution to the linear system, we have r t+1 = c Mλrt = rt i=1 βiξtipi. As a result, we can bound the numerator and denominator of G(rt+1) respectively: P rt+1 rt+1 c M 1 λ 1 rt+1 P r t+1 c M 1 λ + P rt+1 r t+1 c M 1 λ P r t+1 c M 1 λ + rt+1 r t+1 c M 1 λ i=2 βiξ2 ti + P rt+1 rt+1 c M 1 λ 1 rt+1 P r t+1 c M 1 λ P rt+1 r t+1 c M 1 λ P r t+1 c M 1 λ rt+1 r t+1 c M 1 λ Gao, Garber, Srebro, Wang, and Wang Consequently, we have q Pd i=2 βiξ2 ti + 2ϵt p β1ξ2 t1 2ϵt β2 q Pd i=2 ξ2 ti/βi + 2ϵt ξ2 t1/β1 2ϵt β2 + 2ϵt q Pd i=2 ξ2 ti/βi As long as 2ϵt min q Pd i=2 ξ2 ti/βi, p i=2 ξ2 ti/βi, ξ2 t1/β1 we are guaranteed that G(rt+1) G(rt) β1 + 3β2 Substituting in βi = 1 λ bρi with λ bρ1 b , we obtain that β1 + 3β2 3β1 + β2 5 This means that if (14) holds for each least squares problem, the sequence {G(rt)}t=0,... decreases (at least) at a constant geometric rate of 5 7. Therefore, the number of inexact matrix-vector multiplications T needed to achieve |sin θT | η is log 7 D.2. Bounding the initial error for each least squares We can minimize the initial suboptimality for the least squares problem ft+1 for reducing the time complexity of its solver. It is natural to use an initialization of the form αwt, a scaled version of the previous iterate, which gives the following objective ft+1(αwt) = (w t b Aλwt) 2 α2 (w t b Bwt)α. This is a quadratic function of α, and minimizing ft+1(αwt) over α gives the optimal scaling α t = w t b Bwt w t b Aλwt (and this quantity is also invariant to the length of wt). Observe that α t naturally measures the quality of wt: As wt converges to bw, α t converges to β1. This initialization technique plays an important role in showing the linear convergence of our algorithm, and was used by Ge et al. (2016) for their standard power iterations (alternating least squares) scheme for CCA. Stochastic Canonical Correlation Analysis Proof [Proof of Lemma 11] With the given initialization, we have ft+1(α t wt) f t+1 ft+1(β1wt) f t+1 = β2 1r t c M 1 λ rt 2 β1r t rt + rtc Mλrt β2 1 βi 2β1 + βi ξ2 ti βi (β1 βi)2 (w t b Bwt) Therefore, in view of (14), it suffices to set the ratio between the initial and the final error of ft+1 to max (1, G(rt)) 16β2 1 (β1 β2)2 . In the initial phase, G(rt) is large, we can set the ratio to be G(r0) 16β2 1 (β1 β2)2 , until it is reduced to 1 after O (log G(r0)) iterations. Afterwards, we can set the ratio to be the constant of 16β2 1 (β1 β2)2 , until we reach the desired accuracy. Observe that β2 1 (β1 β2)2 = 1 λ bρ1 1 λ bρ1 1 λ bρ2 2 (u + 1)2 4. D.3. Time complexity of SVRG for finite sum with nonconvex component Lemma 25 (Time complexity of SVRG for (15)) With the initialization α t wt, SVRG outputs an wt+1 such that ft+1(wt+1) f t+1 ϵt(w t b Bwt) in time O d(N + κ2) log (64 max (G(rt), 1)) κ , where κ = maxi Li Λ with Li being the gradient Lipschitz constant of fi t+1, and Λ is the stronglyconvex constant of ft+1. Futhermore, if we sample each component fi t+1 non-uniformly with probability proportional to L2 i for the SVRG stochastic updates, we have instead κ = q 1 N PN i=1 L2 i Λ2 . Although not explicitly stated by Garber and Hazan (2015), the result for non-uniform sampling is straightforward by a careful investigation of their analysis; we provide detailed proof of this result in Appendix F. The effect of improved dependence on Li s through nonuniform sampling agrees with related work (Xiao and Zhang, 2014). The purpose of the non-uniform sampling variant is to bound κ2 with high probability for sub-Gaussian/regular polynomial-tail inputs. Gao, Garber, Srebro, Wang, and Wang D.4. Bounding the condition number for SVRG The next lemma upper-bounds the condition number κ2. Lemma 26 Solving min w ft+1(w) using SVRG with non-uniform sampling, we have κ2 = O d2 b 2γ2 for the sub-Gaussian/regular polynomial-tail classes with high probability over the sample set, and κ2 = O 1 b 2γ2 for the bounded class. Proof The gradient Lipschitz constant Li is bounded by the largest eigenvalue (in absolute value) of its Hessian Qi λ = λxix i xiy i yix i λyiy i and the largest eigenvalue is defined as max gx Rdx,gy Rdy β := [g x , g y ]Qi λ s.t. gx 2 + gy 2 = 1. β = λ(g x xi)2 + λ(g y yi)2 2(g x xi)(g y yi) λ(g x xi)2 + λ(g y yi)2 + 2 g x xi g y yi λ(g x xi)2 + λ(g y yi)2 + (g x xi)2 + (g y yi)2 = (λ + 1) (g x xi)2 + (g y yi) (λ + 1) gx 2 xi 2 + gy 2 yi 2) (λ + 1) max xi 2 , yi 2 (λ + 1) xi 2 + yi 2 where we have used the Cauchy-Schwarz inequality in the third inequality. Note that, for bounded inputs, we have xi 2 + yi 2 2 and so L2 i 4(λ + 1)2 for all i = 1, . . . , N. For sub-Gaussian/regular polynomial-tail inputs, we have i=1 L2 i (λ + 1)2 1 xi 2 + yi 2 2 = O((λ + 1)2d2) with high probability in view of Remark 24. On the other hand, we have shown that Λ = σmin (Aλ) (λ bρ1)γ/2. Recalling λ = bρ1 + cb with c (0, 1), we have λ 2 and Λ cb γ/2. Combining this with the data norm bound above yields the desired result. Stochastic Canonical Correlation Analysis D.5. Proof of Theorem 12 Proof Since u T Σxxbu Σ 1 2 xxu T 1 and v T Σyybv Σ 1 2 yyv T 1, it suffices to require u T Σxxbu Σ + v T Σyybv Σ According to Lemma 22 (making the identification that a = Σ 1 2xxbu, x = Σ 1 2xxu T , b = 1 2yybv, and y = Σ 1 2yyv T ), it then suffices to have 2 bu Σxxu T + bv Σyyv T q u T Σxxu T + v T Σyyv T 1 η Since cos θT = p 1 sin2 θT 1 sin2 θT , we just need |sin θT | η 8, and we ensure it by requiring G(r T ) η 8. Applying results from the previous sections, we need to solve O log G(r0) η linear systems, and the time complexity for solving each is at most O (N + κ2) log (G(r0) κ) for SVRG. It remains to bound G(r0). By the definition of G( ), we have q Pd i=2 ξ2 0i/βi p β1 βd 1 |ξ01| where |ξ01| is the alignment between r0 and p1 which, by the relationship between r0 and w0, satisfy w 0 b B 1 2 p1 b B 1 2 w0 w 0 (b B 1 2 p1) σmax(b B 1 2 ) w0 = b B 1 2 p1 ( w0/ w0 ) (b B 1 2 p1/ b B 1 2 p1 ) σmax(b B 1 2 ) σmin(b B 1 2 ) σmax(b B 1 2 ) b B 1 2 p1 b B 1 2 p1 According to the way w0 is initialized and Arora et al. (2009, Lemma 5), we have with prob- ability at least 1 C that w0 w0 b B 1 2 p1 b B 1 2 p1 d. On the other hand, we have β1 b ) and σmax(b B 1 2 ) σmin(b B 1 2 ) = O 1 γ . Combining these results yields that G(r0) = O q with high probability. Then the theorem follows. Gao, Garber, Srebro, Wang, and Wang D.6. Proof of Corollary 13 Proof Denote er := 1 1 2xxu T / Σ 1 2yyv T / Σ , with er = 1. Assume without loss of generality that Σ = 1; this does not affect our measure of alignment, and can be ensured by a final (separate) normalization step with cost O(Nd) (Wang et al., 2016). Apply Lemma 6 with ϵ = ϵ 8 ; requiring that ϵ = ϵ 8 Cdγ2 as in Corollary 7 leads to the extra condition that ϵ 64C2 dγ4 2 . With the specified sample complexity, we have that with high probability In view of the Weyl s inequality, (29) implies that b 3 be the top eigenvector of C. And recall br := 1 the top eigenvector of b C. According to the Davis-Kahan sin θ theorem (Davis and Kahan, 1970), with the number of samples given in the theorem, the top eigenvectors of C and b C are well aligned: where θ is the angle between r and br. This implies that br r = cos θ = p 1 sin2 θ 1 sin2 θ 1 ϵ We now show that the theorem follows if we manage to solve the ERM objective so accurately that bu Σxxu T q u T Σxxu T + bv Σyyv T q To see this, first observe that (30) implies 2 2(er br) ϵ and as a result er r br r (er br) r br r er br 1 ϵ Stochastic Canonical Correlation Analysis Consequently, we have = er r 2 = 2 1 er r ϵ ϵ 8. We are now in the same situation as (27); we can fix the incorrect normalization of er analogously and then our lemma follows. It remains to show the time complexity to achieve (30). According to Lemma 22, it suffices to have 2 bu Σxxu T + bv Σyyv T q u T Σxxu T + v T Σyyv T 1 ϵ2 In turn, it suffices to have |sin θT | ϵ 256 and we ensure it by requiring G(r T ) ϵ 256. We obtain the stated time complexity by applying Theorem 12 with η = ϵ 256. Appendix E. Proofs for Section 5 E.1. Proof of Lemma 15 Proof The desired result is a direct consequence of Lemma 22, by making the identification that 1 2xxu , x = E 1 2xxu T , b = E 1 2yyv , y = E E.2. Parameters of Streaming SVRG for stochastic least squares We divide Lemma 16 into the following three propositions. Proposition 27 (Strong convexity) For any w, w Rd, we have ft+1(w) ft+1(w ) + ft+1(w ), w w + µ where µ := γ β1 C γ for some C > 0. Proof Just observe that the Hessian of ft+1(w) is Aλ = B 1 2 M 1 λ B 1 2 , whose eigenvalues are bounded from below: σmin (Aλ) (λ ρ1) σmin (B) = γ/β1. The lemma follows from the assumption that λ = ρ1 + c for c (0, 1). Gao, Garber, Srebro, Wang, and Wang Proposition 28 (Streaming smoothness) For any w Rd, we have E h φt+1(w) φt+1(w t+1) 2i 2S ft+1(w) f t+1 where S = O dβ1 γ = O d γ for the sub-Gaussian/regular polynomial-tail classes, and γ = O 1 γ for the bounded class. Proof Observe that φt+1(w) = λxx xy As shown in Lemma 26, this gradient function is Lipschitz continuous: φt+1(w) φt+1(w t+1) (λ + 1) sup ( x , y ) w w t+1 . Note that λ ρ1 + u where ρ1 1, 1, and u < 1 by assumption, and thus λ 2. As a result, we obtain E φt+1(w) φt+1(w t+1) 2 9E h x 2 + y 2i w w t+1 2 . For the distributions of P(x, y) considered here, E x 2 and E y 2 are both O(d) for the sub-Gaussian/regular polynomial-tail inputs (see Remark 24), and bounded by 1 for the bounded inputs. On the other hand, according to Lemma 27, we have f(w) f(w t+1) C γ w w t+1 2 for some C > 0. Combining the above two inequalities gives the desired result. Proposition 29 (Streaming variance) We have φ(w t+1) 2 ( 2f(w t+1)) 1 σ2. where σ2 = O dβ3 1 rt 2 for the sub-Gaussian/regular polynomial-tail classes, and σ2 = O β3 1 rt 2 γ2 for the bounded class. Proof Observe that w t+1 = A 1 λ Bwt and φ(w t+1) = λxx xy Stochastic Canonical Correlation Analysis Define the shorthands D = B 1 2 and E = B 1 Then we have φ(w t+1) 2 ( 2f(w t+1)) 1 = E 1 φ(w t+1) 2 A 1 λ 2 φ(w t+1) 2 B 1 2 A 1 λ B 1 2 2E B 1 2 wt (MλD E) Mλ (DMλ E) B 1 2 wt 2E h r t (MλD E) Mλ (DMλ E)rt i . (32) Bounded case For the bounded case where sup x 2 , y 2 1, the derivation is relatively simple. We can bound D 3 γ , and thus φ(w t+1) 2 ( 2f(w t+1)) 1 Mλ rt 2 2 E DMλ E 2 β1 rt 2 E DMλ 2 + E E 2 where we have used the triangle inequality and the fact that (x + y)2 2x2 + 2y2 in the second inequality. Sub-Gaussian/regular polynomial tail cases We now omit the subscript λ in Mλ and t from iterates for convenience. Using the fact that x + y 2 2 x 2 + 2 y 2 with x = M 1 2 DMr and y = M 1 2 Er, we continue from (32) and obtain φ(w t+1) 2 ( 2f(w t+1)) 1 E h r MDMDMr i + E h r EMEr i . Introduce the notation u = E 1 2 xx x, v = E 1 2 yy y, and parition r and M according to (x, y): , M = Mxx Mxy Myx Myy In view of Lemma 23, we can assume max E u 4 , E v 4 Cd2. From now on, we use C for a generic constant whose specific value may change between appearances. We have EME = uu Mxxuu uu Mxyvv vv Myxuu vv Myyvv r EMEr = r x uu Mxxuu rx + r y vv Myyvv ry + 2r x uu Mxyvv ry. Gao, Garber, Srebro, Wang, and Wang We have for the first term that E h r x uu Mxxuu rx i = E r x u 2 u Mxxu E |r x u|4 q E |u Mxxu|2 E |u Mxxu|2 C Mxx rx 2 q C M rx 2 d. Similar arguments also lead to E h r y vv Myyvv ry i C M ry 2 d. For the third term, we have E h r x uu Mxyvv ry i q E |r x u|2 r y v 2q E |u Mxyv|2 M E r x u 4 1 4 E r y v 4 1 C M rx ry d. E h r EMEr i C M r 2 d. Now we need to bound E r MDMDMr . Using the fact that x + y 2 2 x 2 + 2 y 2 with x = M 1 2 D1Mr and y = M 1 2 D2Mr, this can be bounded by two terms: E h r MDMDMr i 2E h r MD1MD1Mr i + 2E h r MD2MD2Mr i D1 = λ uu 0 0 vv , D2 = 0 uv The bound for E r MD1MD1Mr can be derived using the same argument that bounds E r EMEr (now Mr plays the role of r in bounding E r EMEr ), and thus we have E h r MD1MD1Mr i Cλ2 M Mr 2 d C M 3 r 2 λ2d. Finally, we bound E r MD2MD2Mr . Note that D2MD2 = uv Myyvu uv Myxuv vu Mxyvu vu Mxxuv Stochastic Canonical Correlation Analysis r MD2MD2Mr = m x uv Myyvu mx + m y vu Mxxuv my + 2m x uv Myxuv my. Similarly to what we have done above, E m x uv Myyvu mx q E |m x u|4q E |v Myyv|2 C M 3 r 2 d. The same bound also holds for E m y vu Mxxuv my with the same argument. For the term E m x uv Myxuv my , we have E m x uv Myxuv my M E |mxu|4 1 4 E |myv|4 1 C M mx my d C M 3 r 2 d. Combining all the terms, and noting that λ 2, we have shown that E h r MDMDMr i C M 3 r 2 d. And the final bound is E h r (MD E)M(DM E)r i C h M 3 + M i r 2 d = O β3 1 r 2 d . E.3. Proof of Lemma 17 Proof For notational simplicity, we omit the subscript t + 1 below. According to Frostig et al. (2015, Theorem 4.1), we have that for iteration τ of Algorithm 1 E [f(wτ) f ] 1 1 4s S µmτs + 4s E f(wτ 1) f S µE [f(wτ 1) f ] + σ Gao, Garber, Srebro, Wang, and Wang Using the inequality (x + y)2 2(x2 + y2), it holds that S µE [f(wτ 1) f ] + σ µ E f(wτ 1) f + 2σ2. Now, set for this iteration s = c2 µc2 2 , and kτ = max S β1 rt 2c3 , for some c2, c3 (0, 1). We continue from (33) and have E [f(wτ) f ] 1 1 4s S µmτs + 4s + 2 + 4s E f(wτ 1) f + 2 + 4s 2 + 2 + 4c2 E f(wτ 1) f + 4 + c2 22c2 E f(wτ 1) f + 10c3 β1 rt 2 We can now calculate the number of samples used in this iteration, which is kτ + mτ = O dβ2 1 c3 + dβ2 1 γ2c2 2 = O d 2c3 + d 2γ2c2 2 for sub-Gaussian/regular polynomial-tail inputs, and kτ + mτ = O β2 1 γ2c3 + β2 1 γ2c2 2 = O 1 2γ2c3 + 1 2γ2c2 2 for bounded inputs. Let us fix c2 = 1 44 for τ = 1, . . . , Γ. In view of our initialization strategy (16), setting c3 = 1 20 for τ = 1 gives E f(w1) f β1 rt 2 2 . Afterwards, we halve c3 at each outer loop τ = 2, . . . , and this makes sure that E [f(wτ) f ] β1 rt 2 To achieve the desired accuracy, we need Γ = O log 1 outer iterations. Summing (34) and (34) over τ = 1, . . . , Γ, and noting PΓ τ=1 2τ 1 = O 1 ηt , the total sample complexity is τ=1 2τ 1 + 442d = O d 2ηt + d 2γ2 log 1 for sub-Gaussian/regular polynomial-tail inputs, and τ=1 2τ 1 + 442 log 1 = O 1 2γ2ηt for bounded inputs (we have dropped the second term since log 1 ηt is of lower order compared with 1 ηt ). Stochastic Canonical Correlation Analysis E.4. Proof of Theorem 18 Proof Recall that our streaming CCA algorithm performs shift-and-invert power iterations on the population matrices directly. Following the same argument in the ERM case in Corollary 13, as long as each least squares objective is solved to sufficient accuracy, i.e., ft+1(wt+1) f t+1 w t b Bwt min i=2 ξ2 ti/βi, ξ2 t1/β1 the algorithm converges linearly, and therefore we only need to solve T = O log 1 ϵ linear systems. But due to the zero initialization we use in the online setting, the ratio between initial error and final error for each ft+1 is different from the offline setting. When G(rt) > 1, we are in the regime where Pd i=2 ξ2 ti/βi ξ2 t1/β1, and we can ensure the sufficient accuracy in (36) by setting the ratio between the initial and the final error to be ηt = (β1 β2)2 ξ2 t1 in Lemma 17. Since β2 1 (β1 β2)2 4, this implies that 1 ηt 64 cos2 θt = 64(1 + tan2 θt) 64 1 + β2 β1 G2(rt) 64 1 + G2(rt) 64 1 + G2(r0) . Note that the sample complexity of this phase does not depend on the final accuracy in alignment. When G(rt) 1, indicating that we are in the converging regime where Pd i=2 ξ2 ti/βi ξ2 t1/β1, we can ensure the sufficient accuracy in (36) by setting ηt = (β1 β2)2 Pd i=2 ξ2 ti 16β2 1 in Lemma 17. This implies that 1 ηt 64 sin2 θt . Our goal is to have sin2 θT ϵ 4, as this implies cos θT = p 1 sin2 θT 1 sin2 θT 1 ϵ 4, and by Lemma 15 this further implies align ((u T , v T ); (u , v )) 1 ϵ as desired. Since sin2 θt G2(rt), and we have shown that G2(rt) decreases at a geometric rate, we can bound 1 sin2 θt by a geometrically increasing series where the last term is 4 ϵ, and the sum of the truncated series up to time T is of the same order of the last term, i.e., PT t=1 1 ηt = O 1 ϵ . And the theorem follows from Lemma 17, by summing the sample complexity of least squares problems over the outer shift-and-invert iterations. We remark that to achieve the result with probability 1 δ, we require each least squares problem to be solved to the desired accuracy with failure probability δ/ log(1/ϵ) (using the Markov inequality) and finally apply the union bound. This would only cause additional log(1/ϵ) factors in the total sample complexity. Gao, Garber, Srebro, Wang, and Wang Algorithm 2 Non-uniform sampling SVRG for optimizing finite-sum of nonconvex components F(w) = 1 n Pn i=1 fi(w). Input: Stepsize s. Initialize w0 Rd. for j = 1, 2, . . . , M do u wj 1 Evaluate the batch gradient F( u) = 1 n Pn i=1 fi( u) u0 wj 1 for t = 1, 2, . . . , m do Randomly pick it from {1, . . . , n} with probability {pi}n i=1. ut ut 1 s fit(ut 1) fit( u) pitn + F( u) end for wj 1 n Pm t=1 wt end for Output: w M is the approximate solution. Appendix F. SVRG with non-uniform sampling for finite-sum of nonconvex components In this section, we show that for optimizing a convex objective that is the finite-sum of nonconvex components, sampling each components with probability proportional to its smoothness parameter, as shown in Algorithm 2, leads to improved convergence rate. In particular, the final time complexity depends on average smoothness parameter rather than the maximum smoothness. Lemma 30 Let F(w) = 1 n Pn i=1 fi(w), where F(w) is µ-strongly convex, and each component fi(w) is Li-smooth. Let w = arg minw F(w). In the inner loop of Algorithm 2, sample it using weighted sampling probability {pi}n i=1 from {1, . . . , n} where pi = L2 i Pn j=1 L2 j , and set s = 2µn 11 Pn i=1 L2 i , m = 121 Pn i=1 L2 i 8nµ2 . Then the iteration complexity (number of vector operations) to reach ϵ-suboptimality is n + Pn i=1 L2 i nµ2 n Pn i=1 Li) (F(w0) F(w )) Proof For the inner loop of Algorithm 2, we are performing updates of the following form: ut ut 1 svt, vt = fit(ut 1) fit( u) pitn + F( u). Taking expectation over the random choice of component it, we have Et[vt] = F(ut 1) Stochastic Canonical Correlation Analysis We now upper bound the variance of vt: Et vt F(ut 1) 2 =Et fit(ut 1) fit( u) pitn + F( u) F(ut 1) 2 1 (pitn)2 fit(ut 1) fit( u) 2 F(ut 1) F( u) 2 1 pin fi(ut 1) fi( u) 2 fi(ut 1) fi(w ) 2 + fi( u) fi(w ) 2 ut 1 w 2 + u w 2 where we have used the fact that E x E[x] 2 = E x 2 (E[x])2 for a random vector x in the second equality, and that x + y 2 2 x 2 + 2 y 2 in the second inequality. By choosing pi = L2 i Pn i=1 L2 i , the above inequality becomes Et vt F(ut 1) 2 2 Pn i=1 L2 i n ut 1 w 2 + u w 2 . Define L = 1 n Pn i=1 Li which is an upper bound of the smoothness parameter of the average function F(w) as F(a) F(b) = 1 i=1 fi(a) fi(b) 1 i=1 fi(b), a b + Li = F(b), a b + 1 n Pn i=1 Li 2 a b 2 , and define ˆL = q 1 n Pn i=1 L2 i . We then bound the distance from each iterate to the optimum: Et ut w 2 = ut 1 w 2 2s ut 1 w , Et [vt] + s2Et vt 2 ut 1 w 2 2s ut 1 w , F(ut 1) + s2 F(ut 1) 2 + 2s2 ˆL2 ut 1 w 2 + u w 2 ut 1 w 2 2sµ ut 1 w 2 + s2 L2 ut 1 w 2 + 2s2 ˆL2 ut 1 w 2 + u w 2 where we have used the fact that E x = (E[x])2 + E x E[x] 2 in the first inequality, and the smoothness and strong convexity of F(w) in the second inequality. By the Jensen s inequality, we have L ˆL. Therefore, we continue from above and obtain Et h ut w 2i E h ut 1 w 2i 3s2 ˆL2 2sµ ut 1 w 2 + 2s2 ˆL2 u w 2 . Gao, Garber, Srebro, Wang, and Wang Summing the above inequality over the inner loop yields E um w 2 E u0 w 2 3s2 ˆL2 2sµ m X t=1 ut 1 w 2 + 2ms2 ˆL2 u w 2 Using u0 = u and rearranging terms, we have 2sµ 3s2 ˆL2 m X t=1 ut 1 w 2 1 + 2ms2 ˆL2 u w 2 Using u = wj 1 and wj = 1 m Pm 1 t=0 ut, we obtain E wj w 2 1 + 2ms2 ˆL2 2msµ 3ms2 ˆL2 E wj 1 w 2 . 11ˆL2 , m = 1 2s2 ˆL2 = 121ˆL2 2E wj 1 w 2 . Therefore the squared distance to minimum decreases geometrically for each outer loop. After M iterations, we have F(w M) F(w ) L 2 w M w 2 L 2 M w0 w 2 L µ M (F(w0) F(w )). Setting the right hand side to ε gives the number of outer iterations M = O log ( L/µ) (F(w0) F(w )) ε . Finally, the total iteration complexity to reach ε-suboptimality is O ((n + m)M) = O log L(F(w0) F(w ))