# towards_practical_alternating_leastsquares_for_cca__581d26cc.pdf Towards Practical Alternating Least-Squares for CCA Zhiqiang Xu and Ping Li Cognitive Computing Lab Baidu Research No.10 Xibeiwang East Road, Beijing, 10085, China 10900 NE 8th St, Bellevue, WA 98004, USA {xuzhiqiang04,liping11}@baidu.com Alternating least-squares (ALS) is a simple yet effective solver for canonical correlation analysis (CCA). In terms of ease of use, ALS is arguably practitioners first choice. Despite recent provably guaranteed variants, the empirical performance often remains unsatisfactory. To promote the practical use of ALS for CCA, we propose truly alternating least-squares. Instead of approximately solving two independent linear systems, in each iteration, it simply solves two coupled linear systems of half the size. It turns out that this coupling procedure is able to bring significant performance improvements in practical setting. Inspired by the accelerated power method, we further propose faster alternating least-squares, where momentum terms are introduced into the update equations. Theoretically, both algorithms enjoy linear convergence rate. To make faster ALS even more practical, we put forward adaptive alternating least-squares to avoid tuning the momentum parameter, which is as easy to use as the plain ALS while retaining advantages of the fast version. Experiments on several datasets empirically demonstrate the superiority of the proposed algorithms to several recent variants of CCA solvers. 1 Introduction Canonical correlation analysis [11] is a classical statistical tool for finding directions of the maximal correlations between data sources of the same phenomenon, which has found widespread applications in high-dimensional data analysis such as regression [12], clustering [5], classification [13], and word embedding [7], to name a few. Let X Rdx n and Y Rdy n be the data matrices1 of two views with empirical cross-covariance matrix and two auto-covariance matrices given by n XY , Cxx = 1 n XX + rx I, Cyy = 1 n YY + ry I, respectively, where rx and ry are positive regularization parameters for avoiding ill-conditioned matrices and I represents the identity matrix of the appropriate size. CCA aims to find projection matrices Φ Rdx k and Ψ Rdy k such that the cumulative correlation between two views is maximized after the projection of each view [19, 16]: max Φ CxxΦ=Ψ CyyΨ=I tr(Φ CxyΨ). (1) It is well-known that the global optimum to Problem (1), which is also known as the canonical subspaces, can be obtained via a k-truncated singular value decomposition (SVD) on the whitened 1We assume that X and Y are row centered at the origin. 33rd Conference on Neural Information Processing Systems (Neur IPS 2019), Vancouver, Canada. cross-covariance matrix C = C 1 2 xx Cxy C 1 2 yy , i.e., (U, V) = (C 1 2 xx P, C 1 2 yy Q), (2) where P and Q are the top-k left and right singular subspaces of C. Simply applying the partial SVD of C by inverting matrices Cxx and Cyy, however, is computationally prohibitive for highdimensional datasets, as the complexity of matrix inversions can be as high as O(d3), where d = max{dx, dy}, and the data sparsity of X and Y can not be utilized then. To address this computational issue of CCA, there have been a range of relevant algorithms proposed recently in different settings [22, 15, 16, 10, 20, 1, 8, 2, 6, 4]. In this work, we focus on the block and off-line setting where k > 1 and the collection of instance pairs, i.e., (X, Y), is ready. In terms of ease of use, in this setting, alternating least-squares (ALS) [20, 10] is arguably the first choice from a user s perspective, by virtue of the simplicity, the fewest parameters, and guaranteed convergence. Nonetheless, as we will see in our experiments, its effectiveness, especially solutions of high accuracy, often comes at the cost of slow convergence. Particularly, [20] considered inexact alternating least-squares for the vector case k = 1. However, in order for the block case to work, one has to set the block size to 2k rather than k and needs a post-processing step to randomly project the resulting solution of 2k-dimensional subspace onto a k-dimensional subspace, as is demonstrated in [10]. The update equations of alternating least-squares in both [20] and [10] are derived from the power method on an augmented real symmetric matrix, i.e., A = Cxy C xy . However, the power method can only find top eigenspaces corresponding to the largest eigenvalues in magnitude rather than the real part. Given the special eigen-structure of A [23, 20, 10], the block size has to be at least 2k to recover a top-k canonical subspace (U, V). It is clear that this way that the block CCA solver proceeds not only causes a significant increase in both time and space, but may also degrade the quality of the final solution due to the random projection. Thus, the question one would naturally ask is: Is there any variant of ALS that is able to recover (U, V) with block size k? In this paper, we offer a simple answer in the affirmative. Recall that the power iteration in [20, 10] leads to simultaneous approximations to exact iterates Φ t and Ψ t on two canonical variables and ends up solving two independent linear systems. What we are going to change here is to do sequential approximations to Φ t and Ψ t with block size k, arriving at an algorithm that approximately solves two coupled linear systems of half the size per iteration (see Algorithm 1). To stress the difference, the proposed algorithm for CCA is called truly alternating least-squares (TALS). It does not only inherit theoretical properties of global convergence and linear complexity from alternating least-squares but also enjoys a speedup roughly by a factor of σk σk+σk+1 , where σk represents the k-th largest singular value of C. Most important to practitioners is that remarkable performance improvements can be achieved in practice as will be shown in our experiments, albeit with a slight algorithmic change. Moreover, we develop another variant of ALS. Inspired by a recent work on accelerated power method [21], we try to think about faster alternating least-squares (FALS) for CCA with momentum acceleration. The main idea is to add a momentum term to the update equations of the iterates Φt and Ψt on top of the truly alternating least-squares, which gives rise to Algorithm 2. Compared to other fast methods, e.g., shift-and-invert preconditioning based methods [20, 1], especially for the block case, the advantage here is that the fast version takes over the simple structure of the plain one and updating iterates remains in a sequential manner. At least, locally linear convergence can be achieved. On the other hand, the algorithm is no longer almost parameter-free due to the momentum parameter which needs to be tuned. Although we can leverage this parameter to pursue better performance by hand-tuning, it requires multiple runs of the algorithm which computationally may not be affordable in practice. To tackle this, we put forward adaptive alternating least-squares (AALS) with automatic momentum tuning during iterations, such that it is as easy to use as the plain version and at the same time expected to retain the advantages of the fast one. Experiments show that the adaptive version achieves comparable performance to its predecessor, i.e., the faster alternating least-squares, and often outperforms the truly alternating least-squares. The rest of the paper is organized as follows. We discuss recent literature in Section 2 and then present our algorithms with convergence guarantees on truly alternating least-squares in Section 3 and the fast versions in Section 4. Our experimental studies are reported in Section 5. Finally, the paper is concluded by discussions in Section 6. 2 Related Work There is a rich literature on CCA. We focus here on the block and off-line algorithms proposed recently. [3] proposed a randomized CCA algorithm for a pair of tall and thin matrices. It first performs a randomized dimensionality reduction on the matrices and then runs an off-the-shelf CCA algorithm for the resulting matrices. However, it seems to have quite a high complexity, and as was pointed out in [16], it does not work for large dx and dy. To cope with this issue, on top of [3], the problem is cast into solving a sequence of iterative least-squares in [15]. But only sub-optimal results can be achieved this way due to the coarse approximation, which was noted in [10]. [16] proposed an iterative method with a low per-iteration cost, but there is no guarantee of global convergence and the performance is worse than CCALin, i.e., alternating least-squares proposed in [10]. These algorithms directly solve Problem (1). Alternating least-squares solves Problem (1) indirectly, by targeting an equivalent problem, i.e., generalized eigenspace computation, in the following form: max Ω BΩ=I tr(Ω AΩ), where B = diag(Cxx, Cyy). [20] proposed inexact alternating least squares with a sub-linear convergence analysis for the vector case k = 1. The block case was considered with block size set to 2k and given a linear convergence analysis in [10]. While both algorithms enjoys global convergence, they have the drawbacks mentioned in Section 1. In this paper, our proposed truly alternating least-squares is a natural extension of above two algorithms without the drawbacks. Most of the fast CCA algorithms rely on the shift-and-invert preconditioning paradigm that is originally designed for eigenvector computation [9]. [20] extended the paradigm to the CCA setting and achieved better performance than alternating least-squares for the vector case. [1] further extended to the block setting, using the vector version as a meta algorithm to recursively find top-k canonical subspaces via deflation. While both algorithms have theoretically faster convergence, pragmatic concerns arise that the shift-and-invert preconditioning paradigm bears a complicated algorithm structure and is difficult to deploy in practice, especially in the block setting. The deployment is built upon a number of tuning parameters including the nontrivial estimation of the spectral gap [20]. The deflation further complicates the task in the block case. In contrast, the fast CCA algorithm presented in this paper follows the momentum acceleration scheme that is also originally designed for eigenvector computation [21], and outperforms alternating least-squares, particularly for the block case. The underlying algorithm is simple with much fewer parameters. Furthermore, the adaptive version does not even need to tune the momentum parameter, making it more practical. We also note that there are a number of recent CCA algorithms that handle the streaming setting [22, 8, 2, 6, 4]. It will be interesting to investigate how our algorithms extend to this setting. 3 Truly Alternating Least-Squares (TALS) In this section, we detail our proposed truly alternating least-squares (TALS) for CCA, starting from the existing alternating least-squares solvers. Update equations of alternating least-squares in [20] can be written as φt+1 = C 1 xx Cxyψt + ξt, φt+1 = φt+1 φt+1 2 ψt+1 = C 1 yy C xyφt + ηt, ψt+1 = ψt+1 ψt+1 2 where φt Rdx 1, ψt Rdy 1, and ξt, ηt are errors incurred in approximating C 1 xx Cxyψt and C 1 yy C xyφt by least-squares, respectively. For example, C 1 xx Cxyψt is the exact solution to the linear systems of equations Cxx φ = Cxyψt with unknowns φ, or equivalently the following least-squares Algorithm 1 TALS-CCA 1: Input: T, k, data matrices X, Y. 2: Output: approximate top-k canonical subspaces (ΦT , ΨT ). 3: Initialize Φ0 = GSCxx(Φinit) Rdx k, Ψ0 = GSCyy(Ψinit) Rdy k, where entries of Φinit, Ψinit are i.i.d. standard normal samples. 4: for t = 1, 2, , T do 5: Approximately solve least-squares Φt arg min Φ Rdx k lt( Φ) = 1 2n X Φ Y Ψt 1 2 F + rx with initial Φ(0) = Φt 1(Φ t 1CxxΦt 1) 1(Φ t 1CxyΨt 1). 6: Φt = GSCxx( Φt). 7: Approximately solve least-squares Ψt arg min Ψ Rdy k st( Ψ) = 1 2n Y Ψ X Φt 2 F + ry with initial Ψ(0) = Ψt 1(Ψ t 1CyyΨt 1) 1(Ψ t 1C xyΦt). 8: Ψt = GSCyy( Ψt). 9: end for min φ Rdx 1 lt( φ) = 1 2n X φ Y ψt + rx The approximation can be done by running a least-squares solver, warm-started by φt, for only a few iterations. The block version for k > 1 in [10] needs to take the following form: ( Φt+1 = C 1 xx CxyΨt + ξt, Φt+1 = Φt+1( Φ t+1Cxx Φt+1 + Ψ t+1Cyy Ψt+1) 1 Ψt+1 = C 1 yy C xyΦt + ηt, Ψt+1 = Ψt+1( Φ t+1Cxx Φt+1 + Ψ t+1Cyy Ψt+1) 1 where Φt Rdx 2k and Ψt Rdy 2k, rather than Φt Rdx k and Ψt Rdy k. It is easy to see that update equations in both (3) and (4) yield two independent linear systems. It turns out that the independence hampers the empirical performance of alternating least-squares for CCA. To overcome the drawbacks especially for the block case, we propose the following truly (and inexact) alternating least-squares, ( Φt+1 = C 1 xx CxyΨt + ξt, Φt+1 = Φt+1( Φ t+1Cxx Φt+1) 1 Ψt+1 = C 1 yy C xyΦt+1 + ηt+1, Ψt+1 = Ψt+1( Ψ t+1Cyy Ψt+1) 1 where we now have Φt Rdx k and Ψt Rdy k. Compared to (3) and (4), two induced linear systems in (5) are coupled together and of half the size in the block setting. Corresponding algorithmic steps are given in Algorithm 1, where subroutine GSH( ) performs the generalized Gram-Schmidt orthogonalization process with inner product , H for a positive definite matrix H. Note that our initials to the least-squares solver are different from those in both [20] and [10]. Recall that P and Q are the top-k left and right singular subspaces of C with respect to their respective Euclidean metrics, corresponding to singular values Σ = diag(σ1, , σk) in descending order, i.e., σi σj for 1 i < j rank(C). Thus, by Equation (2), ground truth U and V are the counterparts with respect to metrics Cxx and Cyy, respectively. Let θt = max{θmax(Φt, U), θmax(Ψt, V)}, where θmax(Φt, U) represents the largest principal angle between subspaces2 Φt and U in underlying metric Cxx, i.e., θmax(Φt, U) = cos 1(σmin(U CxxΦt)). Let nnz(X) represent the number of nonzero entries in X and κ(Cxx) the condition number of Cxx. Algorithm 1 then has properties that are described by the following theorem. 2For brevity we use Φt to represent the subspace spanned by columns of Φt or one of its bases in the underlying metric Cxx without any risk of confusion. Algorithm 2 FALS-CCA 1: Input: T, k, momentum parameter β, data matrices X, Y. 2: Output: approximate top-k canonical subspaces (ΦT , ΨT ). 3: Initialize Φ 1 = 0 Rdx k, Φ0 = GSCxx(Φinit) Rdx k, Ψ0 = GSCyy(Ψinit) Rdy k, where entries of Φinit, Ψinit are i.i.d. standard normal samples. 4: for t = 1, 2, , T do 5: Approximately solve least-squares Φt arg min Φ Rdx k lt( Φ) = 1 2n X ( Φ + βΦt 2) Y Ψt 1 2 F + rx 2 Φ + βΦt 2 2 F with initial Φ(0) = Φt 1(Φ t 1CxxΦt 1) 1(Φ t 1CxyΨt 1). 6: Φt = GSCxx( Φt). 7: Approximately solve least-squares Ψt arg min Ψ Rdy k st( Ψ) = 1 2n Y ( Ψ + βΨt 1) X Φt 2 F + ry 2 Ψ + βΨt 1 2 F with initial Ψ(0) = Ψt 1(Ψ t 1CyyΨt 1) 1(Ψ t 1C xyΦt). 8: Ψt = GSCyy( Ψt). 9: end for Theorem 1 Given data matrices X and Y, TALS-CCA computes a dx k matrix ΦT and a dy k matrix ΨT which are estimates of top-k canonical subspaces (U, V) with an error of ϵ, i.e., Φ T CxxΦT = Ψ T CyyΨT = I and tan θT ϵ, in T = O( σ2 k σ2 k σ2 k+1 log 1 ϵ cos θ0 ) iterations. If Nesterov s accelerated gradient descent is used as the least-squares solver, the running time is at most O( kσ2 k σ2 k σ2 k+1 nnz(X, Y)κ(X, Y)(log 1 cos θ0 log σ1 (σ2 k σ2 k+1) cos θ0 + ϵ log σ1 σ2 k σ2 k+1 ) + k2σ2 k σ2 k σ2 k+1 max{dx, dy} log 1 ϵ cos θ0 ), where nnz(X, Y) = nnz(X) + nnz(Y) and κ(X, Y) = max{κ(Cxx), κ(Cyy)}. Note that random initializations to Φ0 and Ψ0 result in cos θ0 > 0 with high probability, by Lemma 13 in [10]. Thus, TALS-CCA is globally and linearly convergent. Proofs are provided in the supplementary material. Compared to alternating least-squares, e.g., CCALin in [10], it is roughly faster by a factor of σk σk+σk+1 , whereas empirical improvements are often more pronounced. Note that it makes a difference especially for the cases of a small singular value gap σk σk+1. 4 Faster Alternating Least-Squares (FALS) In this section, we consider the momentum acceleration for CCA, motivated by accelerated power method for eigenvector computation [21]. To derive update equations for faster alternating leastsquares (FALS), we first have CCA cast into the setting of eigenvector computation on real symmetric matrices and then introduce the momentum to speedup. Recall that A = Cxy C xy and B = Cxx Cyy Let Wt = B 1 2 Φt Ψt R(dx+dy) 2k and Wt = B 1 2 Φt Ψt R(dx+dy) 2k. The momentum acceleration applied to B 1 2 then yields that 2 Wt βWt 1, Wt+1 = Wt+1( W t+1 Wt+1) 1 Algorithm 3 AALS-CCA 1: Input: T, k, data matrices X, Y. 2: Output: approximate top-k canonical subspaces (ΦT , ΨT ). 3: Initialize Φ 1 = 0 Rdx k, Φ0 = GSCxx(Φinit) Rdx k, Ψ0 = GSCyy(Ψinit) Rdy k, where entries of Φinit, Ψinit are i.i.d. standard normal samples. 4: for t = 1, 2, , T do 5: Set βt,1 = 1 4 min 1 i k(Σ(t 1,1) ii )2, where Σ(t 1,1) = (Φ t 1CxxΦt 1) 1(Φ t 1CxyΨt 1). 6: Approximately solve least-squares Φt arg min Φ Rdx k lt( Φ) = 1 2n X ( Φ + βt,1Φt 2) Y Ψt 1 2 F + rx 2 Φ + βt,1Φt 2 2 F with initial Φ(0) = Φt 1Σ(t 1,1). 7: Φt = GSCxx( Φt). 8: Set βt,2 = 1 4 min 1 i k(Σ(t 1,2) ii )2, where Σ(t 1,2) = (Ψ t 1CyyΨt 1) 1(Ψ t 1C xyΦt). 9: Approximately solve least-squares Ψt arg min Ψ Rdy k st( Ψ) = 1 2n Y ( Ψ + βt,2Ψt 1) X Φt 2 F + ry 2 Ψ + βt,2Ψt 1 2 F with initial Ψ(0) = Ψt 1Σ(t 1,2). 10: Ψt = GSCyy( Ψt). 11: end for where βWt 1 is known as the momentum term and β is the momentum parameter. Expanding the above update equation into two inexact update equations in Φt, Ψt and having them coupled together as with TALS, we arrive at our faster (truly and inexact) alternating least-squares as follows: ( Φt+1 = C 1 xx CxyΨt βΦt 1 + ξt, Φt+1 = Φt+1( Φ t+1Cxx Φt+1) 1 Ψt+1 = C 1 yy C xyΦt+1 βΨt + ηt+1, Ψt+1 = Ψt+1( Ψ t+1Cyy Ψt+1) 1 where Ψt Rdx k and Φt Rdy k. The algorithmic steps are given in Algorithm 2 which keeps as simple as the plain alternating least-squares. Despite the simplicity, the analysis of faster convergence is very difficult (see our discussions in Section 6). Nonetheless, it is at least locally linearly convergent, as stated in the following theorem. Theorem 2 Given data matrices X and Y, FALS-CCA computes a dx k matrix ΦT and a dy k matrix ΨT which are estimates of top-k canonical subspaces (U, V) with an error of ϵ, i.e., Φ T CxxΦT = Ψ T CyyΨT = I and tan θT ϵ, in T = O( σ2 k cσ1β σ2 k σ2 k+1 4cσ1β log 1 ϵ cos θ0 ) iterations if θ0 π 4 . If Nesterov s accelerated gradient descent is used as the least-squares solver, the running time is at most O( k(σ2 k cσ1β) σ2 k σ2 k+1 4cσ1β nnz(X, Y)κ(X, Y)(log 1 cos θ0 log σ1 (σ2 k σ2 k+1) cos θ0 + ϵ log σ1 σ2 k σ2 k+1 ) + k2(σ2 k cσ1β) σ2 k σ2 k+1 4cσ1β max{dx, dy} log 1 ϵ cos θ0 ), where c > 0 is a constant. Clearly, the momentum parameter plays a key role for Algorithm 2 to work. It is central for us to figure out sensible ways to set it in practice. Given the tight analysis (see Theorem 11 in [21]) for the exact update (6) in Wt, the optimal value of β should be around σ2 k+1 4 . On the other hand, it holds for the optimal solution that Cxy V = Cxx UΣ, C xy U = Cyy VΣ. We thus can write that (U Cxx U) 1U Cxy V = Σ = (V Cyy V) 1V C xy U. Accordingly, we have the following estimate options of Σ for sufficiently large t: Σ(t,1) (Φ t CxxΦt) 1(Φ t CxyΨt), Σ(t,2) (Ψ t CyyΨt) 1(Ψ t C xyΦt+1), Σ(t,3) (Φ t CxxΦt + Ψ t Cyy V) 1(Φ t CxyΨt + Ψ t C xyΦt). Before iterates Φt and Ψt converge, min 1 i k Σ(t,j) ii is strictly less than σk in general. Meanwhile, σk+1 is bounded above by σk. Therefore, our first strategy for approximating the optimal momentum parameter is to run a small number of iterations of TALS, which can be viewed as a burning process, and then set βj = 1 4 min 1 i k(Σ(T0,j) ii )2 for FALS. Denote the resulting algorithm as FALS-T0. Adaptive Alternating Least-Squares (AALS) To further avoid choosing burning parameter T0, the second strategy we propose is to automatically and adaptively adjust momentum parameter β during iterations, as described in Algorithm 3. Compared to Algorithm 2, there is no additional cost in running AALS. It keeps as easy to use as the plain alternating least-squares while retaining the advantages of the fast version. 5 Experiments In this section, we examine and compare the empirical behaviors of both existing and our alternating least-squares algorithms. Three real-world datasets are used: Mediamill [18], JW11 [17], and MNIST [14]. See Table 1 for the statistics and descriptions. They are commonly used to test CCA Table 1: Statistics of Datasets DATA Description dx dy n Memdiamill images and its labels 100 120 30000 JW11 acoustic and articulation measurements 273 112 30000 MNIST left and right halves of images 392 392 60000 Youtube UCI Youtube audio and vision streams 64 1024 122041 solvers [20]. In order to show the inability of the plain alternating least-squares with block size k to solve CCA, we adapt alternating least-squares in both [20] and [10] to block size k, denoted as ALS-k and CCALin-k, respectively. Note that the post-processing step is not needed any more for CCALin-k. Two algorithms differ only in the initial to the least-squares solver. The original CCALin algorithm is also included as a baseline. They are compared with our TALS, FALS-T0 (i.e., FALS with burning parameter T0), and AALS. Particularly, T0 {4, 6} is used. Regularization parameters are fixed to rx = ry = 0.1. Stochastic variance reduced gradient (SVRG) is the least-squares solver we use for each algorithm. Throughout the experiments the solver runs 2 epochs with each running n iterations with constant step-sizes αΦ = 1 maxi xi 2 2 for Φt and αΨ = 1 maxi yi 2 2 for Ψt, where xi is the i-th column of X. All the algorithms were implemented in MATLAB, and run on a laptop with 8 GB memory. Quality measures we use are as follows: sin2 θu sin2 θmax(Φt, U), squared sine value of largest principal angle between Φt and U; sin2 θv sin2 θmax(Ψt, V), squared sine value of largest principal angle between Ψt and V, where ground truth (P, Σ, Q) is obtained by MATLAB s svds function for evaluation purpose. Smaller is better for each measure. It is worth noting that the two measures are more indicative of the performance of all the algorithms considered here, compared to the relative objective function error measure f f f tr(Σ) tr(Φ t CxyΨt) tr(Σ) , 0 5 10 15 time (seconds) Mediamill (k=4) ALS-k CCALin-k CCALin TALS FALS-T0=4 0 20 40 60 time (seconds) JW11 (k=10) 0 100 200 300 time (seconds) MNIST (k=10) 0 200 400 600 800 1000 # Passes Mediamill (k=4) ALS-k CCALin-k CCALin TALS FALS-T0=4 0 2000 4000 6000 # Passes JW11 (k=10) 0 2000 4000 6000 # Passes MNIST (k=10) 0 5 10 15 time (seconds) Mediamill (k=4) ALS-k CCALin-k CCALin TALS FALS-T0=4 0 20 40 60 time (seconds) JW11 (k=10) 0 100 200 300 time (seconds) MNIST (k=10) 0 200 400 600 800 1000 # Passes Mediamill (k=4) ALS-k CCALin-k CCALin TALS FALS-T0=4 0 2000 4000 6000 # Passes JW11 (k=10) 0 2000 4000 6000 # Passes MNIST (k=10) Figure 1: Performance of different ALS algorithms for CCA. because they do not directly optimize the objective function of Problem (1), i.e., tr(Φ CxyΨ), especially for the CCALin. Convergence results in terms of (f f)/f are reported in the supplementary material. Convergence curves of all the considered ALS algorithms are plotted in a 4 3 array of figures in Figure 1 with a column for each dataset. Upper and lower halves of the rows of figures correspond to sin2 θu and sin2 θv, respectively, while upper and lower rows in each half correspond to results in running time and passes over data, respectively. Note that the curve patterns in running time and passes are not necessarily the same, e.g., for CCALin. From these empirical results, we first observe that both ALS-k and CCALin-k indeed fail to work as the values of both measures always remain high during iterations across datasets. This is because the target ground-truth of both algorithms does not cover top-k canonical subspaces. Second, it takes a much longer time for the CCALin than our ALS algorithms to find a solution even with low precision. Third, our TALS achieves better performance than the CCALin by a large margin in both measures, demonstrating the advantage of the coupling in ALS for CCA. Last, further speedups over TALS are observed for the fast versions, which showcases the potential of the momentum acceleration for CCA. Particularly, the adaptive version, i.e., AALS, without the need to tune the momentum parameter and set the burning parameter, performs equally well as FALS-T0, proving its practical value to certain extent. Additional experiments are provided in the supplementary material, aiming to demonstrate: 1) the performance of all the considered algorithms with varying block sizes; 2) the performance of our ALS algorithms especially the fast versions in comparison with the shift-and-invert (SI) preconditioning method [20] in the vector setting; 3) the performance of the algorithms on more datasets (n = 122041). These experiments indicate that the truly alternating least-squares sometimes can achieve equally good performance compared to its fast versions. In the vector case, the faster alternating least-squares may even work better than the SI method which, though is given the advantage of the knowledge on the spectral gap at k = 1 and other tuning parameters. 6 Discussion In this work, we study alternating least-squares as a block CCA solver. Noting the drawbacks of current alternating least-squares methods, we propose the truly alternating least-squares which only needs to update equations of half the size due to coupling. Both theory and practice show that the coupling can significantly improve the performance of alternating least-squares. On top of that, we further propose faster alternating least-squares with momentum acceleration. To make it practical, two strategies are put forward to set the momentum parameter. One is to introduce a burning phase to set it by running the truly alternating least-squares for a few iterations. The other is to automatically adjust the momentum parameter during iterations, making it as easy to use as the plain alternating least-squares without sacrificing fast convergence. Experiments show that both strategies work well. Despite the excellent performance of the fast versions, it lacks of a tight convergence analysis explaining the empirical behaviors. This seems quite difficult, given that there has been no such theory thus far on the momentum acceleration for the basic eigenvector computation in a corresponding setting. The coupling in our context further complicates the analysis. We leave it to our future work where other settings, e.g., streaming or robust, may be considered as well. [1] Zeyuan Allen-Zhu and Yuanzhi Li. Doubly accelerated methods for faster CCA and generalized eigendecomposition. In Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, pages 98 106, 2017. [2] Raman Arora, Teodor Vanislavov Marinov, Poorya Mianjy, and Nati Srebro. Stochastic approximation for canonical correlation analysis. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 4-9 December 2017, Long Beach, CA, USA, pages 4778 4787, 2017. [3] Haim Avron, Christos Boutsidis, Sivan Toledo, and Anastasios Zouzias. Efficient dimensionality reduction for canonical correlation analysis. SIAM J. Scientific Computing, 36(5), 2014. [4] Kush Bhatia, Aldo Pacchiano, Nicolas Flammarion, Peter L. Bartlett, and Michael I. Jordan. Gen-oja: Simple & efficient algorithm for streaming generalized eigenvector computation. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, Neur IPS 2018, 3-8 December 2018, Montréal, Canada., pages 7016 7025, 2018. [5] Kamalika Chaudhuri, Sham M. Kakade, Karen Livescu, and Karthik Sridharan. Multi-view clustering via canonical correlation analysis. In Proceedings of the 26th Annual International Conference on Machine Learning, ICML 2009, Montreal, Quebec, Canada, June 14-18, 2009, page 17, 2009. [6] Zhehui Chen, Xingguo Li, Lin Yang, Jarvis D. Haupt, and Tuo Zhao. On constrained nonconvex stochastic optimization: A case study for generalized eigenvalue decomposition. In The 22nd International Conference on Artificial Intelligence and Statistics, AISTATS 2019, 16-18 April 2019, Naha, Okinawa, Japan, pages 916 925, 2019. [7] Paramveer Dhillon, Dean P Foster, and Lyle H. Ungar. Multi-view learning of word embeddings via cca. In J. Shawe-Taylor, R. S. Zemel, P. L. Bartlett, F. Pereira, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 24, pages 199 207. Curran Associates, Inc., 2011. [8] Chao Gao, Dan Garber, Nathan Srebro, Jialei Wang, and Weiran Wang. Stochastic canonical correlation analysis. Co RR, abs/1702.06533, 2017. [9] Dan Garber, Elad Hazan, Chi Jin, Sham M. Kakade, Cameron Musco, Praneeth Netrapalli, and Aaron Sidford. Faster eigenvector computation via shift-and-invert preconditioning. In International Conference on Machine Learning, pages 2626 2634, 2016. [10] Rong Ge, Chi Jin, Sham M. Kakade, Praneeth Netrapalli, and Aaron Sidford. Efficient algorithms for large-scale generalized eigenvector computation and canonical correlation analysis. In International Conference on Machine Learning, pages 2741 2750, 2016. [11] Harold Hotelling. Relations between two sets of variates. Biometrika, 28(3/4):321 377, 1936. [12] Sham M. Kakade and Dean P. Foster. Multi-view regression via canonical correlation analysis. In Proceedings of the 20th Annual Conference on Learning Theory, COLT 07, pages 82 96, Berlin, Heidelberg, 2007. Springer-Verlag. [13] Nikos Karampatziakis and Paul Mineiro. Discriminative features via generalized eigenvectors. In Proceedings of the 31th International Conference on Machine Learning, ICML 2014, Beijing, China, 21-26 June 2014, pages 494 502, 2014. [14] Yann Lecun, Leon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. In Proceedings of the IEEE, pages 2278 2324, 1998. [15] Yichao Lu and Dean P. Foster. large scale canonical correlation analysis with iterative least squares. In Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 8-13 2014, Montreal, Quebec, Canada, pages 91 99, 2014. [16] Zhuang Ma, Yichao Lu, and Dean P. Foster. Finding linear structure in large datasets with scalable canonical correlation analysis. In Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6-11 July 2015, pages 169 178, 2015. [17] J R. Westbury. X-ray microbeam speech production database users handbook. IEEE Personal Communications - IEEE Pers. Commun., 01 1994. [18] Cees G. M. Snoek, Marcel Worring, Jan C. van Gemert, Jan-Mark Geusebroek, and Arnold W. M. Smeulders. The challenge problem for automated detection of 101 semantic concepts in multimedia. In Proceedings of the 14th ACM International Conference on Multimedia, MM 06, pages 421 430, New York, NY, USA, 2006. ACM. [19] Hrishikesh Vinod. Canonical ridge and econometrics of joint production. Journal of Econometrics, 4:147 166, 02 1976. [20] Weiran Wang, Jialei Wang, Dan Garber, and Nati Srebro. Efficient globally convergent stochastic optimization for canonical correlation analysis. In Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain, pages 766 774, 2016. [21] Peng Xu, Bryan D. He, Christopher De Sa, Ioannis Mitliagkas, and Christopher Ré. Accelerated stochastic power iteration. In International Conference on Artificial Intelligence and Statistics, AISTATS 2018, 9-11 April 2018, Playa Blanca, Lanzarote, Canary Islands, Spain, pages 58 67, 2018. [22] Florian Yger, Maxime Berar, Gilles Gasso, and Alain Rakotomamonjy. Adaptive canonical correlation analysis based on matrix manifolds. In Proceedings of the 29th International Conference on Machine Learning, ICML 2012, Edinburgh, Scotland, UK, June 26 - July 1, 2012, 2012. [23] Zhihua Zhang. The singular value decomposition, applications and beyond. Co RR, abs/1510.08532, 2015.