# eigengame_pca_as_a_nash_equilibrium__f957a718.pdf Published as a conference paper at ICLR 2021 EIGENGAME: PCA AS A NASH EQUILIBRIUM Ian Gemp, Brian Mc Williams, Claire Vernade & Thore Graepel Deep Mind {imgemp,bmcw,vernade,thore}@google.com We present a novel view on principal component analysis (PCA) as a competitive game in which each approximate eigenvector is controlled by a player whose goal is to maximize their own utility function. We analyze the properties of this PCA game and the behavior of its gradient based updates. The resulting algorithm which combines elements from Oja s rule with a generalized Gram Schmidt orthogonalization is naturally decentralized and hence parallelizable through message passing. We demonstrate the scalability of the algorithm with experiments on large image datasets and neural network activations. We discuss how this new view of PCA as a differentiable game can lead to further algorithmic developments and insights. 1 INTRODUCTION The principal components of data are the vectors that align with the directions of maximum variance. These have two main purposes: a) as interpretable features and b) for data compression. Recent methods for principal component analysis (PCA) focus on the latter, explicitly stating objectives to find the k-dimensional subspace that captures maximum variance (e.g., (Tang, 2019)), and leaving the problem of rotating within this subspace to, for example, a more efficient downstream singular value (SVD) decomposition step1. This point is subtle, yet critical. For example, any pair of twodimensional, orthogonal vectors spans all of R2 and, therefore, captures maximum variance of any two-dimensional dataset. However, for these vectors to be principal components, they must, in addition, align with the directions of maximum variance which depends on the covariance of the data. By learning the optimal subspace, rather than the principal components themselves, objectives focused on subspace error ignore the first purpose of PCA. In contrast, modern nonlinear representation learning techniques focus on learning features that are both disentangled (uncorrelated) and low dimensional (Chen et al., 2016; Mathieu et al., 2018; Locatello et al., 2019; Sarhan et al., 2019). It is well known that the PCA solution of the d-dimensional dataset X Rn d is given by the eigenvectors of X X or equivalently, the right singular vectors of X. Impractically, the cost of computing the full SVD scales with O(min{nd2, n2d})-time and O(nd)-space (Shamir, 2015; Tang, 2019). For moderately sized data, randomized methods can be used (Halko et al., 2011). Beyond this, stochastic or online methods based on Oja s rule (Oja, 1982) or power iterations (Rutishauser, 1971) are common. Another option is to use streaming k-PCA algorithms such as Frequent Directions (FD) (Ghashami et al., 2016) or Oja s algorithm2 (Allen-Zhu and Li, 2017) with storage complexity O(kd). Sampling or sketching methods also scale well, but again, focus on the top-k subspace (Sarlos, 2006; Cohen et al., 2017; Feldman et al., 2020). In contrast to these approaches, we view each principal component (equivalently eigenvector) as a player in a game whose objective is to maximize their own local utility function in controlled competition with other vectors. The proposed utility gradients are interpretable as a combination of Oja s rule and a generalized Gram-Schmidt process. We make the following contributions: A novel formulation of PCA as finding the Nash equilibrium of a suitable game, A sequential, globally convergent algorithm for approximating the Nash on full-batch data, 1After learning the top-k subspace V Rd k, the rotation can be recovered via an SVD of XV . 2FD approximates the top-k subspace; Oja s algorithm approximates the top-k eigenvectors. Published as a conference paper at ICLR 2021 A decentralized algorithm with experiments demonstrating the approach as competitive with modern streaming k-PCA algorithms on synthetic and real data, In demonstration of the scaling of the approach, we compute the top-32 principal components of the matrix of RESNET-200 activations on the IMAGENET dataset (n 106, d 20 106). Each of these contributions is important. Novel formulations often lead to deeper understanding of problems, thereby, opening doors to improved techniques. In particular, k-player games are in general complex and hard to analyze. In contrast, PCA has been well-studied. By combining the two fields we hope to develop useful analytical tools. Our specific formulation is important because it obviates the need for any centralized orthonormalization step and lends itself naturally to decentralization. And lastly, theory and experiments support the viability of this approach for continued research. 2 PCA AS AN EIGEN-GAME We adhere to the following notation. Vectors and matrices meant to approximate principal components (equivalently eigenvectors) are designated with hats, ˆv and ˆV respectively, whereas true principal components are v and V . Subscripts indicate which eigenvalue a vector is associated with. For example, vi is the ith largest eigenvector. In this work, we will assume each eigenvalue is distinct. By an abuse of notation, vj ui(zi|ˆvji zlΛll (8) where θi is the angular deviation and z d 1 parameterizes the deviation direction. Note that sin2 has period π instead of 2π, which simply reflects the fact that vi and vi are both eigenvectors. 4Eigen Game sans order learns max 1 PC and sans order+normalization 5 PCs on data in Figure 3a. 5Eigen Game runtimes are longer than those of Eigen Game R in the synthetic experiments despite strictly requiring fewer FLOPS; apparently this is due to low-level floating point arithmetic specific to the experiments. Published as a conference paper at ICLR 2021 An error propagation analysis reveals that it is critical to learn the parents to a given degree of accuracy. The angular distance between vi and the maximizer of player i s utility with approximate parents has tan 1 dependence (i.e., a soft step-function; see Lemma N.5 and Figure 13 in Appendix N). Theorem 4.1 (Global convergence). Algorithm 1 achieves finite sample convergence to within θtol angular error of the top-k principal components, independent of initialization. Furthermore, if each ˆvi is initialized to within π 4 of vi, Algorithm 1 returns the components with angular error less than θtol in T = l O k h (k 1)! gi i2 m iterations. Proofs are deferred to Appendices O.4 and O.5. Angular error is defined as the angle between ˆvi and vi: θi = sin 1( p 1 vi, ˆvi 2). The first k in the formula for T appears from a naive summing of worst case bounds on the number of iterations required to learn each ˆvj2. For Matrix Krasulina, we first compute the optimal matching from ˆvi to ground truth before measuring angular error. We present the longest streak as opposed to # of eigenvectors found because, in practice, no ground truth is available and we think the user should be able to place higher confidence in the larger eigenvectors being correct. If an algorithm returns k vectors, k 2 of which are accurate components but does not indicate which, this is less helpful. We measure normalized subspace distance using 1 1 k Tr(U P) [0, 1] where U = V V and P = ˆV ˆV similarly to Tang (2019). Synthetic data. Experiments on synthetic data demonstrate the viability of our approach (Figure 3a). Oja s algorithm performs best on synthetic experiments because strictly enforcing orthogonalization with an expensive QR step greatly helps when solving for all eigenvectors. Eigen Game is able to effectively parallelize this over k machines and the advantage of QR diminishes in Figure 3b. The remaining algorithms perform similarly on a linearly decaying spectrum, however, Eigen Game performs better on an exponentially decaying spectrum due possibly to instability of Riemannian gradients near the equilibrium (see Appendix J for further discussion). GHA and Eigen Game R are equivalent under specific conditions (see Proposition K.1). Figure 4a shows Eigen Game solves for the eigenvectors up to a high degree of accuracy π 32, i.e. the convergence results in Figure 3a are not the result of using a loose tolerance of π 8 . With the lower tolerance, all algorithms take slightly more iterations to learn the eigenvectors of the linear spectrum; it is difficult to see any performance change for the exponential spectrum. Although Theorem 4.1 assumes distinct eigenvalues, Figure 4b supports the claim that Eigen Game does not require distinct eigenvalues for convergence. We leave proving convergence in this setting to future work. 6See Table 1 in (Allen-Zhu and Li, 2017). 7A detailed discussion of Frequent Directions (Ghashami et al., 2016) can be found in Appendix H. Published as a conference paper at ICLR 2021 0 200 400 600 800 1000 # of Training Iterations Longest Correct Eigenvector Streak Eigen Game (430) Eigen Game R (306) Krasulinas (348) Linearly Decaying Spectrum 0 200 400 600 800 1000 # of Training Iterations Longest Correct Eigenvector Streak Eigen Game (318) Eigen Game R (266) Krasulinas (373) Exponentially Decaying Spectrum (a) Synthetic data: Stricter Tolerance 0 200 400 600 800 1000 # of Training Iterations Longest Correct Eigenvector Streak Eigen Game (435) Eigen Game R (281) Krasulinas (334) Linearly Decaying Spectrum 0 200 400 600 800 1000 # of Training Iterations Longest Correct Eigenvector Streak Eigen Game (452) Eigen Game R (217) Krasulinas (323) Exponentially Decaying Spectrum (b) Synthetic: Repeated Eigenvalues Figure 4: (a) Repeats analysis of Figure 3a but for a lower angular tolerance of π 32. (b) Repeats analysis of Figure 3a with an angular tolerance of π 8 as before, but with eigenvalues 10 19 of the ordered spectrum overwritten with λ10 of the original spectrum. We compute angular error for the eigenvectors on either side of this bubble" to show that Eigen Game finds these eigenvectors despite repeated eigenvalues in the spectrum; note 40/50 is optimal in this experiment. Block 5 Block 4 Block 3 Block 2 PC 2 PC 3 PC 4 PC 5 PC 6 PC 7 PC 8 802816 2408448 9633792 7225344 301056 (a) Principal Components PC 1 PC 2 PC 3 PC 4 PC 5 PC 6 PC 7 PC 8 PC 9 PC 10 PC 11 PC 12 PC 13 PC 14 PC 15 PC 16 PC 17 PC 18 PC 19 PC 20 PC 21 PC 22 PC 23 PC 24 PC 25 PC 26 PC 27 PC 28 PC 29 PC 30 PC 31 PC 32 (b) Block-1 Mean Filter Maps Figure 5: (a) Top-8 principal components of the activations of a RESNET-200 on IMAGENET ordered block-wise by network topology (dimension of each block on the right y-axis). Block 1 is closest to input and Block 5 is the output of the network. Color coding is based on relative variance between blocks across the top-8 PCs from blue (low) to red (high). (b) Block 1 mean activation maps of the top-32 principal components of RESNET-200 on IMAGENET computed with Eigen Game. MNIST handwritten digits. We compare Eigen Game against GHA, Matrix Krasulina, and Oja s algorithm on the MNIST dataset (Figure 3b). We flatten each image in the training set to obtain a 60, 000 784 dimensional matrix. Eigen Game is competitive with Oja s in a high batch size regime (1024 samples per mini-batch). The performance gap between Eigen Game and the other methods shrinks as the mini-batch size is reduced (see Appendix I), expectedly due to biased gradients. The principal components of RESNET-200 activations on IMAGENET are edge filters. A primary goal of PCA is to obtain interpretable low-dimensional representations. To this end we present an example of using Eigen Game to compute the top-32 principal components of the activations of a pretrained RESNET-200 on the IMAGENET dataset. We concatenate the flattened activations from the output of each residual block resulting in a d 20M dimensional vector representation for each of the roughly 1.2M input images. It is not possible to store the entire 195TB matrix in memory, nor incrementally compute the Gram/covariance matrix. We implemented a data-and-model parallel version of Eigen Game in JAX (Bradbury et al., 2018) where each ˆvi is assigned to it s own TPU (Jouppi et al., 2017). Each device keeps a local copy of the RESNET parameters and the IMAGENET datastream. Sampling a mini-batch (of size 128), computing the network activations and updating ˆvi are all performed locally. The broadcast(ˆvi) step is handled by the pmap and lax.all_gather functions. Computing the top-32 principal components takes approximately nine hours on 32 TPUv3s. Figure 5a shows the top principal components of the activations of the trained network organized by network topology (consisting of five residual blocks). Note that Eigen Game is not applied block-wise, but on all 20M dimensions. We do not assume independence between blocks and the eigenvector has unit norm across all blocks. We observe that Block 1 (closest to input) of PC 1 has very small magnitude activations relative to the other PCs. This is because PC 1 should capture the variance which discriminates most between the classes in the dataset. Since Block 1 is mainly Published as a conference paper at ICLR 2021 concerned with learning low-level image filters, it stands to reason that although these are important for good performance, they do not necessarily extract abstract representations which are useful for classification. Conversely, we see that PC 1 has larger relative activations in the later blocks. We visualize the average principal activation in Block 18 in Figure 5b. The higher PCs learn distinct filters (Gabor filters, Laplacian-of-Gaussian filters c.f. (Bell and Sejnowski, 1997)). 7 CONCLUSION It seems easier to train a bi-directional LSTM with attention than to compute the SVD of a large matrix. Chris Re Neur IPS 2017 Test-of-Time Award, Rahimi and Recht (Rahimi and Recht, 2017). In this work we motivated PCA from the perspective of a multi-player game. This inspired a decentralized algorithm which enables large-scale principal components estimation. To demonstrate this we used Eigen Game to analyze a large neural network through the lens of PCA. To our knowledge this is the first academic analysis of its type and scale (for reference, (Tang, 2019) compute the top-6 PCs of the d = 2300 outputs of VGG). Eigen Game also opens a variety of research directions. Scale. In experiments, we broadcast across all edges in Figure 1 every iteration. Introducing lag or broadcasting with dropout may improve efficiency. Can we further reduce our memory footprint by storing only scalars of the losses and avoiding congestion through online bandit or reinforcement learning techniques? Our decentralized algorithm may have implications for federated and privacy preserving learning as well (Heinze et al., 2016; Heinze-Deml et al., 2018; Bonawitz et al., 2019). Games. Eigen Game has a unique Nash equilibrium due to the fixed DAG structure, but vectors are initialized randomly so ˆvk may start closer to v1 than ˆv1 does. Adapting the DAG could make sense, but might also introduce spurious fixed points or suboptimal Nash. Might replacing vectors with populations accelerate extraction of the top principal components? Core ML. Eigen Game could be useful as a diagnostic or for accelerating training (Desjardins et al., 2015; Krummenacher et al., 2016); similarly, spectral normalization has shown to be a valuable tool for stabilizing GAN training (Miyato et al., 2018). Lastly, GANs (Goodfellow et al., 2014) recently reformulated learning a generative model as a two-player zero-sum game. Here, we show how another fundamental unsupervised learning task can be formulated as a k-player game. While two-player, zero-sum games are well understood, research on k-player, general-sum games lies at the forefront in machine learning. We hope that marrying a fundamental, well-understood task in PCA with the relatively less understood domain of many player games will help advance techniques on both ends. ACKNOWLEDGEMENTS We are grateful to Trevor Cai for his help scaling the JAX implementation of Eigen Game to handle the large IMAGENET experiment and to Daniele Calandriello for sharing his expert knowledge of related work and advice on revising parts of the manuscript. P-A. Absil, Robert Mahony, and Rodolphe Sepulchre. Optimization Algorithms on Matrix Manifolds. Princeton University Press, 2009. Zeyuan Allen-Zhu and Yuanzhi Li. First efficient convergence for streaming k-PCA: a global, gap-free, and near-optimal rate. In 2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS), pages 487 492. IEEE, 2017. Ehsan Amid and Manfred K Warmuth. An implicit form of Krasulina s k-PCA update without the orthonormality constraint. ar Xiv preprint ar Xiv:1909.04803, 2019. Anthony J Bell and Terrence J Sejnowski. The independent components of natural scenes are edge filters. Vision Research, 37(23):3327 3338, 1997. 8The activations in Block 1 result from convolving 64 filters over the layer s input. We take the mean over the 64 channels and plot the resulting 112 112 image. Published as a conference paper at ICLR 2021 Keith Bonawitz, Hubert Eichner, Wolfgang Grieskamp, Dzmitry Huba, Alex Ingerman, Vladimir Ivanov, Chloe Kiddon, Jakub Konecny, Stefano Mazzocchi, H Brendan Mc Mahan, et al. Towards federated learning at scale: system design. ar Xiv preprint ar Xiv:1902.01046, 2019. Nicolas Boumal, Pierre-Antoine Absil, and Coralia Cartis. Global rates of convergence for nonconvex optimization on manifolds. IMA Journal of Numerical Analysis, 39(1):1 33, 2019. James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary, Dougal Maclaurin, and Skye Wanderman-Milne. JAX: composable transformations of Python+Num Py programs, 2018. URL http://github.com/google/jax. Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel. Infogan: interpretable representation learning by information maximizing generative adversarial nets. In Advances in Neural Information Processing Systems, pages 2172 2180, 2016. Michael B Cohen, Cameron Musco, and Christopher Musco. Input sparsity time low-rank approximation via ridge leverage score sampling. In Proceedings of the Twenty-Eighth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 1758 1777. SIAM, 2017. Constantinos Daskalakis, Paul W Goldberg, and Christos H Papadimitriou. The complexity of computing a Nash equilibrium. SIAM Journal on Computing, 39(1):195 259, 2009. Guillaume Desjardins, Karen Simonyan, Razvan Pascanu, et al. Natural neural networks. In Advances in Neural Information Processing Systems, pages 2071 2079, 2015. Dan Feldman, Melanie Schmidt, and Christian Sohler. Turning big data into tiny data: Constant-size coresets for k-means, PCA, and projective clustering. SIAM Journal on Computing, 49(3):601 657, 2020. Arpita Gang, Haroon Raja, and Waheed U Bajwa. Fast and communication-efficient distributed pca. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7450 7454. IEEE, 2019. Mina Ghashami, Edo Liberty, Jeff M Phillips, and David P Woodruff. Frequent directions: simple and deterministic matrix sketching. SIAM Journal on Computing, 45(5):1762 1792, 2016. Benyamin Ghojogh, Fakhri Karray, and Mark Crowley. Eigenvalue and generalized eigenvalue problems: Tutorial. ar Xiv preprint ar Xiv:1903.11240, 2019. Itzhak Gilboa and Eitan Zemel. Nash and correlated equilibria: some complexity considerations. Games and Economic Behavior, 1(1):80 93, 1989. Gene H Golub and Henk A Van der Vorst. Eigenvalue computation in the 20th century. Journal of Computational and Applied Mathematics, 123(1-2):35 65, 2000. Gene H Golub and Charles F Van Loan. Matrix Computations, volume 3. JHU press, 2012. Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in Neural Information Processing Systems, pages 2672 2680, 2014. Nathan Halko, Per-Gunnar Martinsson, and Joel A Tropp. Finding structure with randomness: probabilistic algorithms for constructing approximate matrix decompositions. SIAM review, 53(2):217 288, 2011. Donald Olding Hebb. The Organization of Behavior: A Neuropsychological Theory. Psychology Press, 2005. Christina Heinze, Brian Mc Williams, and Nicolai Meinshausen. Dual-loco: distributing statistical estimation using random projections. In Artificial Intelligence and Statistics, pages 875 883, 2016. Christina Heinze-Deml, Brian Mc Williams, and Nicolai Meinshausen. Preserving privacy between features in distributed estimation. Stat, 7(1):e189, 2018. Roger A Horn and Charles R Johnson. Matrix Analysis. Cambridge University Press, 2012. Ian T Jolliffe. Principal components in regression analysis. In Principal Component Analysis. Springer, 2002. Norman P Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, et al. In-datacenter performance analysis of a tensor processing unit. In Proceedings of the 44th Annual International Symposium on Computer Architecture, pages 1 12, 2017. Published as a conference paper at ICLR 2021 TP Krasulina. The method of stochastic approximation for the determination of the least eigenvalue of a symmetrical matrix. USSR Computational Mathematics and Mathematical Physics, 9(6):189 195, 1969. Gabriel Krummenacher, Brian Mc Williams, Yannic Kilcher, Joachim M Buhmann, and Nicolai Meinshausen. Scalable adaptive stochastic optimization using random projections. In Advances in Neural Information Processing Systems, pages 1750 1758, 2016. Shengqiao Li. Concise formulas for the area and volume of a hyperspherical cap. Asian Journal of Mathematics and Statistics, 4(1):66 70, 2011. Francesco Locatello, Stefan Bauer, Mario Lucic, Gunnar Raetsch, Sylvain Gelly, Bernhard Schölkopf, and Olivier Bachem. Challenging common assumptions in the unsupervised learning of disentangled representations. In Proceedings of the International Conference on Machine Learning, pages 4114 4124, 2019. Emile Mathieu, Tom Rainforth, N Siddharth, and Yee Whye Teh. Disentangling disentanglement in variational autoencoders. ar Xiv preprint ar Xiv:1812.02833, 2018. Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. Spectral normalization for generative adversarial networks. ar Xiv preprint ar Xiv:1802.05957, 2018. Erkki Oja. Simplified neuron model as a principal component analyzer. Journal of Mathematical Biology, 15(3): 267 273, 1982. Ali Rahimi and Benjamin Recht. Reflections on random kitchen sinks, 2017. URL http://www.argmin. net/2017/12/05/kitchen-sinks/. Haroon Raja and Waheed U Bajwa. Distributed stochastic algorithms for high-rate streaming principal component analysis. ar Xiv preprint ar Xiv:2001.01017, 2020. H Rutishauser. Simultaneous iteration method for symmetric matrices. In Handbook for Automatic Computation, pages 284 302. Springer, 1971. Terence D Sanger. Optimal unsupervised learning in a single-layer linear feedforward neural network. Neural Networks, 2(6):459 473, 1989. Mhd Hasan Sarhan, Abouzar Eslami, Nassir Navab, and Shadi Albarqouni. Learning interpretable disentangled representations using adversarial VAEs. In Domain Adaptation and Representation Transfer and Medical Image Learning with Less Labels and Imperfect Data, pages 37 44. Springer, 2019. Tamas Sarlos. Improved approximation algorithms for large matrices via random projections. In 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS 06), pages 143 152. IEEE, 2006. Ohad Shamir. A stochastic PCA and SVD algorithm with an exponential convergence rate. In Proceedings of the International Conference on Machine Learning, pages 144 152, 2015. Ohad Shamir. Convergence of stochastic gradient descent for PCA. In Proceedings of the International Conference on Machine Learning, pages 257 265, 2016a. Ohad Shamir. Fast stochastic algorithms for SVD and PCA: Convergence properties and convexity. In Proceedings of the International Conference on Machine Learning, pages 248 256, 2016b. Cheng Tang. Exponentially convergent stochastic k-PCA without variance reduction. In Advances in Neural Information Processing Systems, pages 12393 12404, 2019. Aladin Virmaux and Kevin Scaman. Lipschitz regularity of deep neural networks: analysis and efficient estimation. In Advances in Neural Information Processing Systems, pages 3835 3844, 2018.