# loss_landscapes_of_regularized_linear_autoencoders__1799e2d8.pdf Loss Landscapes of Regularized Linear Autoencoders Daniel Kunin * 1 Jonathan M. Bloom * 2 Aleksandrina Goeva 2 Cotton Seed 2 Autoencoders are a deep learning model for representation learning. When trained to minimize the distance between the data and its reconstruction, linear autoencoders (LAEs) learn the subspace spanned by the top principal directions but cannot learn the principal directions themselves. In this paper, we prove that L2-regularized LAEs are symmetric at all critical points and learn the principal directions as the left singular vectors of the decoder. We smoothly parameterize the critical manifold and relate the minima to the MAP estimate of probabilistic PCA. We illustrate these results empirically and consider implications for PCA algorithms, computational neuroscience, and the algebraic topology of learning. 1. Introduction Consider a data set consisting of points x1, . . . , xn in Rm. Let X Rm n be the data matrix with columns xi. We will assume throughout that k min{m, n} and that the singular values of X are positive and distinct. An autoencoder consists of an encoder f : Rm Rk and decoder g : Rk Rm; the latter maps the latent representation f(xi) to the reconstruction ˆxi = g(f(xi)) (Goodfellow et al., 2016). The full network is trained to minimize reconstruction error, typically the squared Euclidean distance between the dataset X and its reconstruction ˆX (or equivalently, the Frobenius norm of X ˆX). When the activations of the network are the identity, the model class reduces to that of one encoder layer W1 Rk m and one decoder layer W2 Rm k. We refer to this model as a linear autoencoder (LAE) with loss function defined by *Equal contribution 1Institute for Computational and Mathematical Engineering, Stanford University, Stanford, California, USA 2Broad Institute of MIT and Harvard, Cambridge, Massachusetts, USA. Correspondence to: Daniel Kunin , Jonathan M. Bloom . Proceedings of the 36 th International Conference on Machine Learning, Long Beach, California, PMLR 97, 2019. Copyright 2019 by the author(s). L(W1, W2) = ||X W2W1X||2 F . Parameterizing L by the product W = W2W1, the Eckart Young Theorem (Eckart & Young, 1936) states that the optimal W orthogonally projects X onto the subspace spanned by its top k principal directions1. Without regularization, LAEs learn this subspace but cannot learn the principal directions themselves due to the symmetry of L under the action of the group GLk(R) of invertible k k matrices defined by (W1, W2) 7 (GW1, W2G 1): X (W2G 1)(GW1)X = X W2W1X. (1) Indeed, L achieves its minimum value on a smooth submanifold of Rk m Rm k diffeomorphic to GLk(R); the learned latent representation is only defined up to deformation by invertible linear maps; and the k-dimensional eigenspace of W with eigenvalue one has no preferred basis. In light of the above, the genesis of this work was our surprise at the theorem2 and empirical observation in Plaut (2018) that the principal directions of X are recovered from a trained LAE as the left singular vectors of the decoder (or as the right singular vectors of the encoder). We realized by looking at the code that training was done with the common practice of L2-regularization: Lσ(W1, W2) = L(W1, W2) + λ ||W1||2 F + ||W2||2 F . In this paper, we prove that LAEs with L2-regularization do in fact learn the principal directions in this way, while shrinking eigenvalues in the manner of probabilistic PCA (Tipping & Bishop, 1999). The key idea is that regularization reduces the symmetry group from GLk(R) to the orthogonal group Ok(R), which preserves the structure of SVD. We further prove that the encoder and decoder are transposes at all 1The principal directions of X are the eigenvectors of the m m covariance of X in descending order by eigenvalue, or equivalently the left singular vectors of the mean-centered X in descending order by (squared) singular values. 2The theorem was retracted following our correspondance. Loss Landscapes of Regularized Linear Autoencoders critical points, with implications for whether the brain could plausibly implement error backpropagation. 1.1. Related work Building on the original work of Eckart & Young (1936) on low-rank matrix approximation, Izenman (1975) demonstrated a connection between a rank-reduced regression model similar to an LAE and PCA. Bourlard & Kamp (1988) characterized the minima of an unregularized LAE; Baldi & Hornik (1989) extended this analysis to all critical points. Several studies of the effect of regularization on LAEs have emerged of late. The rank-reduced regression model was extended in Mukherjee & Zhu (2011) to the study of rankreduced ridge regression. A similar extension of the LAE model was given in Josse & Wager (2016). An in depth analysis of the linear denoising autoencoder was given in Pretorius et al. (2018) and most recently, Mianjy et al. (2018) explored the effect of dropout regularization on the minima of an LAE. While L2-regularization is a foundational technique in statistical learning, its effect on autoencoder models has not been fully characterized. Recent work of Mehta et al. (2018) on L2-regularized deep linear networks applies techniques from algebraic geometry to highlight how algebraic symmetries result in flat critical manifolds and how L2regularization breaks these symmetries to produce isolated critical points. We instead apply elementary linear algebra, with intuition from algebraic topology, to completely resolve dynamics in the special case of LAEs. 1.2. Our contributions The contributions of our paper are as follows. In Section 2 we consider LAEs with (i) no regularization, (ii) L2-regularization of the composition of the encoder and decoder, and (iii) L2-regularization of the encoder and decoder separately as in Lσ. We build intuition by analyzing the scalar case, consider the relationship between regularization and orthogonality, and deduce that the encoder and decoder are transposes at all critical points of Lσ. In Section 3, we realize all three LAE models as generative processes, most notably relating the minimum of Lσ and the MAP estimate of probabilistic PCA. In Section 4, we characterize all three loss landscapes. To build intuition, we first leave the overparameterized world of coordinate representations to think geometrically about the squared distance from a plane to a point cloud. We expand on this topological viewpoint in Appendix E. In Section 5, we illustrate these results empirically, with all code and several talks available on Git Hub3. In Section 6, we discuss implications for eigendecomposition algorithms, computational neuroscience, and deep learning. The connections we draw between regularization and orthogonality, LAEs and probabilistic PCA, and the topology of Grassmannians are novel and provide a deeper understanding of the loss landscapes of regularized linear autoencoders. 2. Regularized LAEs In the Appendix A, we provide a self-contained derivation of the fact that an LAE with bias parameters is equivalent to an LAE without bias parameters trained on mean-centered data (Bourlard & Kamp, 1988). So without loss of generality, we assume X is mean centered and consider the following three LAE loss functions for fixed λ > 0: L(W1, W2) = ||X W2W1X||2 F Lπ(W1, W2) = L(W1, W2) + λ||W2W1||2 F Lσ(W1, W2) = L(W1, W2) + λ(||W1||2 F + ||W2||2 F ) We call these the unregularized, product, and sum losses, respectively. The product and sum losses mirror the loss functions of a linear denoising autoencoder (DAE) (Vincent et al., 2010) and linear contractive autoencoder (CAE) (Rifai et al., 2011) respectively. See Appendix C for details. 2.1. Visualizing LAE loss landscapes We can visualize these loss functions directly in the case n = m = k = 1, as shown in Figure 1. In fact, working out the critical points in this scalar case led us to conjecture the general result in Section 4. We invite the reader to enjoy deriving the following results and experimenting with these loss landscapes using our online visualization tool. For all three losses, the origin w1 = w2 = 0 is the unique rank-0 critical point. For L and Lπ, the origin is always a saddle point, while for Lσ the origin is either a saddle point or global minimum depending of the value of λ. For L, the global minima are rank-1 and consist of the hyperbola4 For Lπ, the global minima are rank-1 and consist of this hyperbola shrunk toward the origin as in ridge regression, w2w1 = (1 + λx 2) 1. For Lσ the critical points depend on the scale of λ relative to x2. For λ < x2, the origin is a saddle point and the global 3github.com/danielkunin/Regularized-Linear-Autoencoders 4Identified with the components of GL1(R) = R\{0}. Loss Landscapes of Regularized Linear Autoencoders (a) Unregularized (b) Product (λ = 2) (c) Sum (λ = 2) (d) Sum (λ = 4) Figure 1. Scalar loss landscapes with x2 = 4. Yellow points are saddles and red curves and points are global minima. minima are the two isolated rank-1 critical points5 cut out by the equations w2w1 = 1 λx 2, w1 = w2. As λ increases toward x2, these minima move toward the origin, which remains a saddle point. As λ exceeds x2, the origin becomes the unique global minimum. This loss of information was our first hint at the connection to probabilistic PCA formalized in Theorem 3.1. 2.2. Regularization and orthogonality Adding L2-regularization to the encoder and decoder separately reduces the symmetries of the loss from GLk(R) to the orthogonal group Ok(R). Two additional facts about the relationship between regularization and orthogonality have guided our intuition: (a) Orthogonal matrices are the determinant 1 matrices of minimal Frobenius norm6, arg min A ||A||2 F s.t. det(A)2 = 1. (b) Orthogonal matrices are the inverse matrices of minimum total squared Frobenius norm, arg min A,B ||A||2 F + ||B||2 F s.t. AB = I, and in particular A = B at all minima. Both facts follow from the inequality of arithmetic and geometric means after casting the problems in terms of squared singular values7. While it was not immediately clear to us that the transpose relationship in (b) also holds at the minima of Lσ, in fact, all critical points of an L2regularized linear autoencoder are symmetric: 5Identified with the components of O1(R) = { 1}. 6Geometrically: the unit-volume parallelotope of minimal total squared side length is the unit hypercube. 7The squared Frobenius norm is their sum and the squared determinate is their product. Theorem 2.1 (Transpose Theorem). All critical points of Lσ satisfy W1 = W 2 . Our proof uses elementary properties of positive definite matrices as reviewed in Appendix B. Note that without regularization, all critical points are psuedoinverses, W1 = W + 2 , as is clear in the scalar case and derived in Section 4. Proof. Critical points of Lσ satisfy: Lσ W1 = 2W 2 (W2W1 I)XX + 2λW1 = 0, Lσ W2 = 2(W2W1 I)XX W 1 + 2λW2 = 0. We first prove that the matrix C = (I W2W1)XX is positive semi-definite8. Rearranging Lσ W2 W 2 gives XX (W2W1) = (W2W1)XX (W2W1) + λW2W 2 . Both terms on the right are positive semi-definite, so their sum on the left is as well and therefore XX (W2W1) (W2W1)XX (W2W1) . Cancelling (W2W1) via Lemma B.1 gives C 0. We now show the difference A = W1 W 2 is zero. Rearranging terms using the symmetry of C gives = 2A(C + λI). Since C 0 and λ > 0 imply C + λI 0, we conclude from A(C + λI)AT = 0 that A = 0. 8Intuitively, we expect this property so long as W2W1 shrinks the principal directions of X, so that I W2W1 does as well. Loss Landscapes of Regularized Linear Autoencoders 3. Bayesian Models In this section, we identify Bayesian counterparts of our three loss functions and derive a novel connection between (regularized) LAEs and (probabilistic) PCA. This connection enables the application of any LAE training method to Bayesian MAP estimation of the corresponding model. Consider the rank-k (self-)regression model xi = W2W1xi + εi = Wxi + εi where W1 and W2 act through their product W and εi Nm(0, 1). L is rank-k regression. The prior on W is the uniform distribution on Rm m restricted to rank-k matrices9. Lπ is rank-k ridge regression. The prior on W is Nm m(0, λ 1) restricted to rank-k matrices. Lσ is the model with W1 and W 2 independently drawn from Nk m(0, λ 1). Theorem 4.2 shows that the minima of Lσ, or equivalently the MAP of the Bayesian model, are such that W2 is the orthogonal projection onto the top k principal directions followed by compression in direction i via multiplication by the factor (1 λσ 2 i ) 1 2 for σ2 i > λ and zero otherwise. Notably, for principal directions with eigenvalues dominated by λ, all information is lost no matter the number of data points. The same phenomenon occurs for p PCA with respect to eigenvalues dominated by the variance of the noise, σ2. Let s consider these Bayesian models side by side, with W0 Rm k the parameter of p PCA10: Bayesian Lσ p PCA W1, W 2 Nk m(0, λ 1) εi Nm(0, 1) xi = W2W1xi + εi zi Nk(0, 1) εi Nm(0, σ2) xi = W0zi + εi Comparing the critical points of Lσ in Theorem 4.2, W T 1 = W2 = UI(Iℓ λΣ 2 I ) 1 2 O , (2) and p PCA (Tipping & Bishop, 1999), W0 = UIΣI(Iℓ σ2Σ 2 I ) 1 2 O , (3) where O Rk ℓhas orthonormal columns, we see that λ corresponds to σ2 (rather than the precision σ 2) in the sense that principal directions with eigenvalues dominated by either are collapsed to zero. The critical points only differ in the factor by which the the remaining principal directions are shrunk. More precisely: 9Note that rank-(k-1) matrices are a measure zero subset of rank-k matrices. 10See Chapter 12.2 of Bishop (2006) for background on p PCA. Theorem 3.1 (p PCA Theorem). With σ2 = λ, the critical points of L0 σ(W1, W2) = Lσ(W1(XX ) 1 2 , (XX ) 1 coincide with the critical points of p PCA. Proof. Multiplying the expression for W2 in (2) on the left by (XX ) 1 2 gives the expression for W0 in (3). Interestingly, the generative model for L0 σ differs from that of p PCA. For example, in the scalar case L0 σ(w1, w2) is x2 1 x 2w2w1 2 + λx 2(w2 1 + w2 2) whereas the negative log likelihood of p PCA is 1 2 ln(2π) + ln(w2 0 + σ2) + x2(w2 0 + σ2) 1 . 4. Loss Landscapes Having contextualized LAE models in a Bayesian framework, we now turn to understanding their loss landscapes. Symmetries such as (1) exist because the model is expressed in an overparameterized coordinate form rooted in classical linear algebra. This results in flat critical manifolds rather than a finite number of critical points. In Section 4.1, we remove all symmetries by expressing the loss geometrically over a topological domain. This results in m k critical points, and in particular a unique minimum. This intuition will pay off in Sections 4.2 and 4.3, where we fully characterize the critical manifolds and local curvatures of all three LAE loss landscapes. 4.1. Critical points We now consider reconstruction loss over the domain of kdimensional planes through the origin in Rm. This space has the structure of a k(m k)-dimensional smooth, compact manifold called the Grassmannian of k-planes in Rm and denoted Grk(Rm) (Hatcher, 2002). We ll build intuition with a few simple examples. Gr1(R2) is the space of lines through the origin in the plane, which may be smoothly parameterized by a counterclockwise angle of rotation of the x-axis modulo the half turn that maps a line to itself. Gr1(R3) is the space of lines through the origin in 3space, also known as the real projective plane. We can visualize Gr1(R3) as the northern hemisphere of the 2-sphere with equator glued by the antipodal map. Gr2(R3) is identified with Gr1(R3) by mapping a plane to its 1-dimensional orthogonal complement. A point cloud X in Rm determines a smooth function LX : Grk(Rm) R Loss Landscapes of Regularized Linear Autoencoders (a) Lines in the plane. (b) Lines in space. Figure 2. Left: Principal directions of a point cloud X. Middle: LX as height function on the manifold of lines through the origin. Right: Negative gradient flow of LX. whose value on a k-plane is the sum of square distances from the points to the plane. Figure 2 depicts LX as a height function for Gr1(R2) and Gr1(R3). Note that the min and max in (a) are the principal directions u1 and u2, while the min, saddle, and max in (b) are the principal directions u1, u2, and u3. At right, we depict the negative gradient flow LX on each space, with (b) represented as a disk with glued boundary. In (a), u1 may descend to u2 by rotating clockwise or counterclockwise11. In (b), u1 may descend to u2 or u3 by rotating in either of two directions in the plane they span, and u2 may similarly descend to u3. The following theorem requires our assumption that the singular values of X are distinct. Theorem 4.1 (Grassmannian Theorem). LX is a smooth function with m k critical points given by all rank-k principal subspaces. In local coordinates near the critical point with principal directions i1 < . . . < ik, LX takes the form of a standard non-degenerate saddle with j=1 (ij j). (4) descending directions. The latter formula counts the total number of pairs i < j with i I and j / I. These correspond to directions to flow along LX by rotating one principal direction ui to another uj of higher eigenvalue, fixing the rest. In Appendix E, we prove a stronger form of the Grassman- 11Formally, we mean there are two gradient trajectories (modulo time translation) that converge to u1 and u2 in each time direction asymptotically, namely the left and right halves of the circle. nian Theorem by combining Theorem 4.2, a commutative diagram (10) relating LX and L, and techniques from algebraic topology. 4.2. Critical manifolds Translating the Grassmannian Theorem back to the coordinate representation Rk m Rm k introduces two additional phenomena we saw in the scalar case in Section 2.1. Each critical point on Grk(Rm) corresponds to a manifold GLk(R) or Ok(R) of rank-k critical points. Critical manifolds appear with rank less than k. In particular, (0, 0) is a critical point for all three losses. Now let s now combine our topological and scalar intuition to understand the the loss landscapes of LAEs in all dimensions and for all three losses. Theorem 4.2 requires our assumption12 that X has distinct singular values σ1 > > σm > 0. Let u1, . . . , um denote the corresponding left singular vectors (or principal directions) of X. For an index set I {1, . . . , m} we define: ℓ= |I| and increasing indices i1 < < iℓ, ΣI = diag(σi1, . . . , σiℓ) Rℓ ℓ, UI Rm ℓconsisting of columns i1, . . . , iℓof U, FI, the open submanifold of Rk ℓwhose points are matrices G with independent columns13, VI, the closed submanifold of Rk ℓwhose points are matrices O with orthonormal columns14. Theorem 4.2 (Landscape Theorem). For each loss, the critical points form a smooth submanifold of Rk m Rm k. For L and Lπ, this submanifold is diffeomorphic to the disjoint union of FI over all I {1, . . . , m} of size at most k. For Lσ, this submanifold is diffeomorphic to the disjoint union of VI over all I {1, . . . , m0} of size at most k, where m0 is the largest index such that σ2 m0 > λ. These diffeomorphisms map G FI or O VI to a critical point (W1, W2) as follows: L UIG+ GU I Lπ UI(Iℓ+ λΣ 2 I ) 1 2 G+ G(Iℓ+ λΣ 2 I ) 1 2 U I Lσ UI(Iℓ λΣ 2 I ) 1 2 O O(Iℓ λΣ 2 I ) 1 2 U I The proof of the Landscape Theorem 4.2 follows quickly 12For the sum loss, we also assume λ is distinct from all σ2 i . 13Also known as the manifold of l-frames in Rk. 14Also known as the Stiefel manifold Vl(Rk). Loss Landscapes of Regularized Linear Autoencoders from the Transpose Theorem 2.1 and Proposition 4.3. Proposition 4.3. Let D, S Rm m be diagonal matrices such that S is invertible and the diagonal of D2S 2D2 has distinct non-zero elements. Then the critical points of L (Q1, Q2) = tr(Q2Q1S2Q 1Q 2 2Q2Q1D2) are smoothly parameterized as the disjoint union of FI over all I {1, . . . , m} of size at most k. The diffeomorphism maps G FI to a critical point (Q1, Q2) as follows15: L DIS 1 I IIG+ GI IDIS 1 I Proof of Theorem 4.2. Given the singular value decomposition X = UΣV , let Q1 = W1U and Q2 = U W2. By invariance of the Frobenius norm under the smooth action of the orthogonal group, we may instead parameterize the critical points of the following loss functions and then pull the result back to W1 = Q1U and W2 = UQ2: L(Q1, Q2) = ||Σ Q2Q1Σ||2 F Lπ(Q1, Q2) = L(Q1, Q2) + λ||Q2Q1||2 F Lσ(Q1, Q2) = L(Q1, Q2) + λ(||Q1||2 F + ||Q2||2 F ) L expands to tr(Q2Q1Σ2Q 1Q 2 2Q2Q1Σ2 + Σ2). By Proposition 4.3 with S = Σ and D = Σ, the critical points have the form L IIG+ GI I Lπ expands to tr(Q2Q1(Σ2 + λI)Q 1Q 2 2Q2Q1Σ2 + Σ2). By Proposition 4.3 with S = (Σ2 + λI) 1 2 and D = Σ, the critical points have the form Lπ II(Iℓ+ λΣ 2 I ) 1 2 G+ G(Iℓ+ λΣ 2 I ) 1 By Lemma A.1 with A = Q2 and B = Q1, Lσ expands to the sum of two functions: L1(Q1, Q2) = tr(Q2Q1Σ2Q 1Q 2 2Q2Q1(Σ2 λI) + Σ2) L2(Q1, Q2) = λ||Q1 Q 2||2 F . 15Here DI and SI are defined like ΣI. II is defined like UI. So at a critical point, L1(Q1, Q2) = L2(Q1, Q2) and Q1 = Q 2 by the Transpose Theorem 2.1. The latter also implies L2(Q1, Q2) = 0. So the critical points of Lσ coincide with the critical points of L1 such that Q1 = Q 2. By Proposition 4.3 with S = Σ and D = (Σ2 λI) 1 2 , these critical points have the form Lσ II(Iℓ λΣ 2 I ) 1 2 O O(Iℓ λΣ 2 I ) 1 2 I I In particular, real solutions do not exist for σ2 i < λ. 4.3. Local curvature By Theorem 4.2, the critical landscape is a disjoint union of smooth manifolds, each at some height. We now prove that the Hessian is non-degenerate in the normal directions to the critical landscape. As discussed in Appendix E, such functions are called Morse-Bott and studied extensively in differential and algebraic topology. Theorem 4.4 (Curvature Theorem). In local coordinates near any point on the critical manifold indexed by I, all three losses take the form of a standard degenerate saddle with d I + (k ℓ)(m ℓ) descending directions. L and Lπ have kℓflat directions. Lσ has kℓ ℓ+1 2 flat directions. The remaining directions are ascending. Proof. In addition to the descending directions of LX, there are (k ℓ)(m ℓ) more that correspond to scaling one of k ℓremaining slots in G or O toward one of m ℓavailable principal directions16. For L and Lπ, the ascending directions are the k(m k) d I ascending directions of LX; for Lσ, an additional ℓ+1 2 ascending directions preserve the reconstruction term while increasing the regularization term by decreasing orthogonality. The remaining (flat) directions are tangent to the critical manifold, itself diffeomorphic to the manifold of (orthonormal) l-frames in Rk. 5. Empirical Illustration In this section, we illustrate the Landscape Theorem 4.2 by training an LAE on synthetic data and visualizing properties of the learned weight matrices. See Appendix D.1 for experiments on real data. Corollary 4.4 implies that gradient descent and its extensions will reach a global minimum, regardless of the initialization, if trained for a sufficient number of epochs with a small enough learning rate (Zhu 16In Figure 1c, these two gradient trajectories descend from the yellow saddle at 0 to the two red minima at u1. Loss Landscapes of Regularized Linear Autoencoders Figure 3. Distance between W1 and W 2 during training. (a) Unregularized (b) Product Figure 4. Heat map of the matrix U V U U . Black and white correspond to 1 and 1, respectively. et al., 2018). 5.1. Synthetic data In the following experiments, we set k = λ = 10 and fix a data set X R20 20 with singular values σi = i and random left and right singular vectors under Haar measure. We train the LAE for each loss using the Adam optimizer for 4000 epochs with random normal initialization, full batch, and learning rate 0.05. Figure 3 tracks the squared distance between W1 and W 2 during training. Indeed, only the the sum loss pulls the encoder and (transposed) decoder together as claimed in the Transpose Theorem 2.1. Let W be the product W2W1 after training and fix the singular value decompositions X = UΣV , W = U Σ V . For each loss, the heat map of U U U U V U V U in Figure 4 is consistent with W approximating a global minimum defined by the Landscape Theorem. Namely, the lower right quadrant shows U V for each loss and the upper right and lower left quadrants show U U and U V up to column sign for the product and sum losses, but not for the unregularized loss. That is, for the product and sum losses, the left singular vectors of X are obtained as the right and left singular vectors of W . The Landscape Theorem also gives explicit formulae for the eigenvalues of W at convergence. Letting σ2 i and τ 2 i be (a) Unregularized (b) Product Figure 5. Illustration of the relationship between the eigenvalues of the weight matrix (τ 2) and data matrix (σ2) for various values of λ. Points are the empirical and lines are theoretical. the ith largest eigenvalues of XX and W , respectively, in Figure 5 we plot the points σ2 i , τ 2 i for many values of λ. We superimpose a curve for each value of λ defined by the theoretical relationship between σ2 i and τ 2 i in the Landscape Theorem. The (literal) alignment of theory and practice is visually perfect. 6. Implications 6.1. PCA algorithms The Landscape Theorem for Lσ implies those principal directions of X with eigenvalues greater than λ coincide with the top left singular vectors of the trained decoder. Hence we can perform PCA by training a regularized LAE and computing SVD of the decoder. For example, full-batch gradient descent corresponds to the following algorithm, which is guaranteed to converge to a (global) minimum for sufficiently small learning rate by the Curvature Theorem. Loss Landscapes of Regularized Linear Autoencoders Algorithm 1 LAE PCA input X Rm n; k m; λ, α > 0 initialize W1, W 2 Rk m while not converged W1 = α (W 2 (W2W1 I)XX + λW1) W2 = α ((W2W1 I)XX W 1 + λW2) U, Σ, = SVD(W2) return U, λ(I Σ2) 1 Note the SVD step is trivial because the decoder is only m k dimensional with k n. Hence optimizing this formulation of PCA reduces to optimizing the training of a regularized LAE. We have so far explored several simple ideas for optimizing performance17: First order. The Transpose Theorem suggests tying W1 = W 2 a priori. In fact, for λ = 0 and k = 1, the gradient update for Lσ with tied weights is equivalent to Oja s rule for how neurons in the brain adapt synaptic strength (Oja, 1982). See Appendix A for a derivation. Second order. The loss is convex in each parameter fixing the other, so W1 and W2 may be updated exactly per iteration by solving a system of m or k linear equations respectively. While many other methods have been proposed for recovering principal components with neural networks, they all require specialized algorithms for iteratively updating weights, similar to classical numerical SVD approaches Warmuth & Kuzmin (2007); Feng et al. (2013b;a). By contrast, reducing PCA to a small SVD problem by training a regularized LAE more closely parallels randomized SVD (Halko et al., 2011). We hope others will join us in investigating how far one can push the performance of LAE-PCA. 6.2. Neural alignment In gradient descent via backpropagation, the weight matrix before a layer is updated by an error signal that is propagated by the transposed weight matrix after the layer. In the brain, there is no known physical mechanism to reuse feedforward connections for feedback or to enforce weight symmetry between distinct forward and backward connections. The latter issue is known as the weight transport problem (Grossberg, 1987) and regarded as central to the implausibility of backpropagation in the brain. Recently Lillicrap et al. (2016) showed that forward weights can sufficiently align to fixed, random feedback weights to support learning in shallow networks, but Bartunov et al. (2018) demonstrated that this feedback alignment and other biologically-plausible archi- 17Num Py implementations and benchmarks of these ideas are available on Git Hub. tectures break down for deep networks. We are now investigating two approaches to weight transport inspired by the results herein. First, the Transpose Theorem 2.1 suggests that weight symmetry emerges dynamically when feedback weights are updated to optimize forwardbackward reconstruction between consecutive layers with weight decay. This information alignment algorithm balances task prediction, information transmission, and energy efficiency. Second, Lemma A.1 expresses weight symmetry as balancing weight decay and self-amplification, lending greater biological plausibility to symmetric alignment. We have verified that both information and (not surprisingly) symmetric alignment indeed align weights and are competitive with backprop when trained on MNIST and CIFAR10. In parallel, Akrout et al. (2019) has independently shown that similar forms of local dynamic alignment scale to Image Net. 6.3. Morse homology In Appendix E we expand from the algebraic topology of learning PCA to that of learning by gradient descent in general. For example, we explain how Morse homology provides a principled foundation for the empirical observation of low-lying valley passages between minima in Garipov et al. (2018). We are hopeful that this perspective will yield insights for efficient training and improved robustness and interpretation of consensus representations and ensemble predictions, for non-convex models arising in matrix factorization and deep learning. 7. Conclusion In 1989, Baldi & Hornik (1989) characterized the loss landscape of an LAE. In 2018, Zhou & Liang (2018) characterized the loss landscape of an autoencoder with Re LU activations on a single hidden layer. This paper fills out and ties together the rich space of research on linear networks over the last forty years by deriving from first principles a unified characterization of the loss landscapes of LAEs with and without regularization, while introducing a rigorous topological lens. By considering a simple but fundamental model with respect to regularization, orthogonality, and Morse homology, this work also suggests new principles and algorithms for learning. Loss Landscapes of Regularized Linear Autoencoders Acknowledgements We thank Mehrtash Babadi, Daniel Bear, Yoshua Bengio, Alex Bloemendal, Bram Gorissen, Matthew J. Johnson, Scott Linderman, Vardan Papyan, Elad Plaut, Tomaso Poggio, and Patrick Schultz for helpful discussions. Akrout, M., Wilson, C., Humphreys, P. C., Lillicrap, T., and Tweed, D. Deep Learning without Weight Transport. 2019. Alain, G., Bengio, Y., and Rifai, S. Regularized Auto-Encoders Estimate Local Statistics. Co RR, abs/1211.4246, 2012. Baldi, P. and Hornik, K. Neural Networks and Principal Component Analysis: Learning from Examples Without Local Minima. Neural Networks, 2(1):53 58, 1989. Banyaga, A. and Hurtubise, D. Lectures on Morse Homology. Springer Netherlands, 2004. Bartunov, S., Santoro, A., Richards, B. A., Marris, L., Hinton, G. E., and Lillicrap, T. P. Assessing the Scalability of Biologically-Motivated Deep Learning Algorithms and Architectures. In NIPS, pp. 9390 9400, 2018. Bishop, C. M. Pattern Recognition and Machine Learning. Springer, 2006. Bloom, J. M. The Local Structure of Smooth Maps of Manifolds, 4 2004. URL http://www.math.harvard. edu/theses/phd/bloom/Thesis XFinal.pdf. Bourlard, H. and Kamp, Y. Auto-association by Multilayer Perceptrons and Singular Value Decomposition. Biological Cybernetics, 59:291 294, 1988. Choromanska, A., Henaff, M., Mathieu, M., Ben Arous, G., and Le Cun, Y. The loss surfaces of multilayer networks. Journal of Machine Learning Research, 38:192 204, 2015. Duan, H. Morse functions and cohomology of homogeneous spaces. 2004. Eckart, C. and Young, G. The Approximation of One Matrix by Another of Lower Rank. Psychometrika, 1(3):211 218, 1936. Feng, J., Xu, H., Mannor, S., and Yan, S. Online PCA for Contaminated Data. In NIPS, 2013a. Feng, J., Xu, H., and Yan, S. Online Robust PCA via Stochastic Optimization. In NIPS, 2013b. Garipov, T., Izmailov, P., Podoprikhin, D., Vetrov, D., and Wilson, A. G. Loss Surfaces, Mode Connectivity, and Fast Ensembling of DNNs. In NIPS, 2018. Goodfellow, I., Bengio, Y., and Courville, A. Deep Learning. MIT Press, 2016. http://www. deeplearningbook.org. Grossberg, S. Competitive learning: From interactive activation to adaptive resonance. Cognitive Science, 11(1): 23 63, 1987. Halko, N., Martinsson, P. G., and Tropp, J. A. Finding Structure with Randomness: Probabilistic Algorithms for Constructing Approximate Matrix Decompositions. SIAM Rev., 53(2):217 288, May 2011. Hansen, M. C. Morse Theory on Complex Grassmannians. Master s thesis, University of Copenhagen, 6 2012. Hatcher, A. Algebraic Topology. Cambridge University Press, 2002. Izenman, A. Reduced-rank regression for the multivariate linear model. Journal of Multivariate Analysis, 5:248 264, 06 1975. Josse, J. and Wager, S. Bootstrap-Based Regularization for Low-Rank Matrix Estimation. Journal of Machine Learning Research, 17(124):1 29, 2016. Le Cun, Y. and Cortes, C. The MNIST Database of handwritten digits. http://yann.lecun.com/exdb/mnist/. Lillicrap, T. P., Cownden, D., Tweed, D. B., and Akerman, C. J. Random synaptic feedback weights support error backpropagation for deep learning. Nature Communications, 7, 2016. Mehta, D., Chen, T., Tang, T., and Hauenstein, J. D. The loss surface of deep linear networks viewed through the algebraic geometry lens. Co RR, abs/1810.07716, 2018. Mianjy, P., Arora, R., and Vidal, R. On the Implicit Bias of Dropout. In ICML, 2018. Milnor, J. Morse Theory. Princeton University Press, 1963. Mukherjee, A. and Zhu, J. Reduced Rank Ridge Regression and Its Kernel Extensions. Statistical Analysis and Data Mining, 4(6):612 622, 2011. Oja, E. Simplified neuron model as a principal component analyzer. Journal of Mathematical Biology, 15(3):267 273, 1982. Plaut, E. From Principal Subspaces to Principal Components with Linear Autoencoders. Ar Xiv e-prints, 2018. Loss Landscapes of Regularized Linear Autoencoders Pretorius, A., Kroon, S., and Kamper, H. Learning Dynamics of Linear Denoising Autoencoders. In ICML, 2018. Rifai, S., Vincent, P., Muller, X., Glorot, X., and Bengio, Y. Contractive Auto-Encoders: Explicit Invariance During Feature Extraction. In ICML, 2011. Solgun, M. Perfect Morse Function on SO(n). European Journal of Pure and Applied Mathematics, 9(3):314 321, 2016. Tipping, M. E. and Bishop, C. M. Probabilistic Principal Component Analysis. Journal of the Royal Statistical Society, 61(3):611 622, 1999. van den Bos, A. Parameter Estimation for Scientists and Engineers, chapter Appendix C. John Wiley Sons, 2007. Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., and Manzagol, P.-A. Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion. Journal of Machine Learning Research, 11:3371 3408, 2010. Warmuth, M. K. and Kuzmin, D. Randomized Online PCA Algorithms with Regret Bounds that are Logarithmic in the Dimension. 2007. Zhou, Y. and Liang, Y. Critical Points of Linear Neural Networks: Analytical Forms and Landscape Properties. In ICLR, 2018. Zhu, Z., Li, Q., Tang, G., and Wakin, M. B. Global Optimality in Low-Rank Matrix Optimization. IEEE Transactions on Signal Processing, 66(13):3614 3628, 2018.