# kernel_similarity_matching_with_hebbian_networks__1d29c4da.pdf Kernel similarity matching with Hebbian Networks Kyle Luther Princeton Neuroscience Institute kluther@princeton.edu H. Sebastian Seung Princeton Neuroscience Institute Princeton Computer Science Department sseung@princeton.edu Recent works have derived neural networks with online correlation-based learning rules to perform kernel similarity matching. These works applied existing linear similarity matching algorithms to nonlinear features generated with random Fourier methods. In this paper we attempt to perform kernel similarity matching by directly learning the nonlinear features. Our algorithm proceeds by deriving and then minimizing an upper bound for the sum of squared errors between output and input kernel similarities. The construction of our upper bound leads to online correlation-based learning rules which can be implemented with a 1 layer recurrent neural network. In addition to generating high-dimensional linearly separable representations, we show that our upper bound naturally yields representations which are sparse and selective for specific input patterns. We compare the approximation quality of our method to neural random Fourier method and variants of the popular but non-biological Nyström method for approximating the kernel matrix. Our method appears to be comparable or better than randomly sampled Nyström methods when the outputs are relatively low dimensional (although still potentially higher dimensional than the inputs) but less faithful when the outputs are very high dimensional. 1 Introduction Brain inspired learning algorithms have a long history in the field of neural networks and machine learning [Rosenblatt, 1958, Olshausen and Field, 1996, Lee and Seung, Riesenhuber and Poggio, 1999, Hinton, 2007, Lillicrap et al., 2016]. While many algorithms have diverged from their biological roots, the motivation to study biology remains clear: the human brain is such a powerful learning agent, there must be insights to be gained by making our algorithms look brain-like . This paper is focused on merging biological constraints with the well-established field of kernel-based machine learning. A common assumption in brain-inspired models of learning is that synaptic update rules should be a) online, meaning the algorithm only has access to a single input pattern at a time and b) local, meaning synapses should only be modified using information immediately available to the synapse, often just the preand post-firing rates of the neurons to which it is connected. Learning rules with these properties are commonly referred to as Hebbian learning rules [Chklovskii, 2016]. Recent works have devised neural networks with Hebbian learning rules that perform linear similarity matching. These networks map every input xt to a representation yt such that linear output similarities match linear input similarities ys yt xs xt. These networks are interesting as models for real brains because they display a number of interesting biological properties: they are recurrent networks with correlation-based learning rules [Pehlevan et al., 2018] and can be modified to include non-negativity [Pehlevan and Chklovskii, 2014], sparsity, and convolutional structure [Obeid et al., 2019]. However there is a problem if one believes these networks should ultimately generate representations which are useful for downstream tasks. If similarities are actually matched, that is if ys yt = xs xt, 36th Conference on Neural Information Processing Systems (Neur IPS 2022). then the outputs are simply an orthogonal transformation of the inputs, yt = Qxt, which is unlikely to have significant impact on downstream tasks. Bahroun et al. [2017] identified this problem and proposed a solution: instead of matching linear input similarities, one can match nonlinear input similarities: ys yt f(xs, xt). The authors provided a method that can be applied to any shiftinvariant kernel. They applied the random Fourier feature method of Rahimi et al. [2007] to map inputs to nonlinear feature vectors x ψ and then applied the linear similarity matching framework of Pehlevan et al. [2018] to these nonlinear features. In this paper, we tackle the same neural kernel similarity matching problem with a different approach. Instead of using random nonlinear features, we directly optimize for the features with Hebbian learning rules that resemble the learning rules derived in the original works on linear similarity matching. To derive our learning rules, we show that for any kernel we can upper bound the sum of squared errors |ys yt f(xs, xt)|2 with a correlation-based energy. Gradient-based optimization of our upper bound with will lead to a neural network with correlation-based learning rules. 2 Correlation-based bound for kernel similarity matching Roadmap for this section We first define the kernel similarity matching problem (Eq. 1). We then derive a correlation-based optimization (Eq. 6) which is an upper bound for to Eq. 1 (up to a constant that does not depend on the representations). We then use a Legendre transform to derive an equivalent (except for the numerical stability parameter λ) optimization problem in Eq. 9 that will lend itself towards online updates. Kernel similarity matching Assume we are given a set of input vectors {xt RM}T t=1 and a positive semi-definite kernel function f : RM RM R which defines the similarity between input vectors. The goal is to find a corresponding set of representations {yt RN}T t=1 such that for all pairs (s, t) we have ys yt f(xs, xt). We will assume that T N > M: the dimensionalities of x, y are much lower than the number of samples T, but y are still higher dimensional than the inputs. This is formalized by minimizing the sum of squared errors: min {yt} 1 T 2 f(xs, xt) ys yt 2 (1) This is known as the classical multidimensional scaling objective [Borg and Groenen, 2005]. For arbitrary nonlinearity f this can be solved exactly by finding the top N eigenvectors of the T T input similarity matrix [Borg and Groenen, 2005], and is therefore closely related to kernel PCA [Schölkopf et al., 1997]. However, this requires computing and storing similarities for all pairs of input vectors which breaks the online constraint that we require for biological realism. The purpose of this paper is to find an online algorithm, with correlation-based computations, that can at least approximately minimize Eq. 1. Correlation based upper bound In this section we provide an upper bound to Eq. 1 which does not require computing f(xs, xt) for any (s, t). The first step is to expand the square in Eq (1) to yield: f(xs, xt) ys yt 2 = 2 s,t f(xs, xt)ys yt + 1 s,t (ys yt)2 + const (2) We will now show how to bound the first term on the right hand side. Theorem 1. If f is a positive semidefinite kernel function, then s,t ysytf(xs, xt) 1 t qytf(xt, w) 1 2q2f(w, w) (3) for all q and w. Proof. Because f is a positive semi-definite kernel, we can assign to any set of M-dimensional vectors {w, x1, . . . , x T }, a corresponding set of (at most) T + 1-dimensional vectors {ϕw, ϕ1, . . . , ϕT } whose inner products yield the similarity defined by f: ϕt ϕt = f(xt, xt ) ϕt ϕw = f(xt, w) ϕw ϕw = f(w, w) (4) Figure 1: Neural network implementation of the optimization in Eq. 9 (a) network architecture (b) recurrent network dynamics (c) steady state network response (d) Hebbian update rules for the special case of Gaussian kernels (the precise form of these updates will be depend on the kernel) Now consider the vector difference 1 t ytϕt qϕw. The squared norm of this difference is of course non-negative. Additionally we can expand out this square: s,t ysytϕs ϕt 1 t qytϕt ϕw + 1 2q2ϕw ϕw (5) At this point we can simply replace all dot products with the equivalent nonlinear similarities f( , ) in Equation 4 and rearrange the terms to yield our key inequality (Eq. 3). Our inequality (Eq. 3) still holds if we maximize the right hand side with respect to q and w. For every index i of y, we find the optimal wi, qi, and then replace the first pairwise sum in Eq. (2) with our upper bounds. Additionally we rearrange the order of the summations in second term on the right hand side of Eq. (2) to yield the following upper bound for the y-dependent terms in Eq. (2): min yt min qi,wi 1 qiyt if(wi, xt) 1 2q2 i f(wi, wi) + 1 t=1 yt iyt j Online focused reformulation We can further remove the square of the correlation matrix 1 T P t yt iyt j (another impediment to online learning) by introducing a Legendre transformation: 1 2C2 ij max Lij Cij Lij 1 min W,Y,q max L 1 T qiyt if(wi, xt) 1 2q2 i f(wi, wi) + 1 Lijyt iyt j 1 We can swap the order of the y and L optimizations, because the objective obeys the strong min-max property with W, q fixed (Appendix Section A of Pehlevan et al. [2018]). We add one final term λ NT PT t=1 PN i=1(yt i)2 to the objective, which can be important for numerical stability of our resulting algorithm. In our experiments λ = 0.001. Finally, to better motivate our online algorithm, we define the per-sample-energy : i=1 qiyt if(wi, xt) 1 2q2 i f(wi, wi) + 1 Lijyt iyt j 1 i=1 (yt i)2 (8) where et := e(yt, xt; W, q, L). The final optimization we will perform, which is equivalent to the optimization in Eq. 6, and is derived as an upper bound to Eq. 1, is thus: min W,q max L min Y 1 T t=1 e(yt, xt; W, q, L) (9) 3 Neural network optimization Applying a stochastic gradient descent-ascent algorithm to Eq. (9) yields a neural network (Fig. 1) in which yt i is the response of neuron i to input pattern t, wi is the vector of incoming connections to neuron i from the input layer, qi is a term which modulates the strength of these incoming connections, and Lij is a matrix of lateral recurrent connections between outputs. Specifically for the neural algorithm we initialize Wia N(0, 1), qi 1 and Lij Iij. At each iteration sample a minibatch of inputs {xb}. Using Eq.12 we compute the {yb} which minimize Eq. 9 for fixed synapses. Using these optimal {yb}, we compute the minibatch energy e = 1 B P b e(xb, yb; W, q, L) and take a gradient descent step for q, a rescaled gradient descent step for w and a gradient ascent step for L: e wi qi qi ηq e qi Lij Lij + ηl e Lij (10) Convergence of the neural algorithm We treat convergence of this gradient descent ascent algorithm as an empirical issue. We adopt the two time scale strategy that has shown empirical successes for training generative adversarial networks [Heusel et al., 2017]. We choose the learning rates such that ηq, ηw ηl. Intuitively when choosing ηl to be large, the Lij can approximately maximize Eq. 9 for any particular q, W so that the min-max ordering is roughly preserved. In practice this ratio is important for convergence. We do not observe convergence when the ratios ηw/ηl or ηq/ηl are large. Unfortunately it is an empirical question of what is too large . If we could show that the objective were concave in L, it can be gradient descent ascent with smaller learning rates for W, q would indeed converge to a saddle point [Lin et al., 2020, Seung, 2019]. However, this question will have to be left for future work. Empirically it is sometimes observed that qi quickly shrinks to a small value early in training, which subsequently leads to small gradients for w. The rescaling of the wi updates provides an adaptive learning rate that appeared to improve training times in practice. We have attached the main portion of the training code, written using Py Torch, in the appendix. 3.1 Network dynamics Assuming fixed parameters q, W, L, the gradient for y can be computed for any input pattern x. Gradient descent can be used to perform the inner loop minimization in Equation 9: j=1 Lijyj λyi Like previous works on linear similarity matching, these dynamics can be interpreted as the dynamics of a one-layer recurrent neural network with all-to-all inhibition PN j=1 Lijyj between units. A diagram of this network is shown in Figure 1. Note that we can analytically perform the inner loop minimization with a non-neural algorithm: j [L + λI] 1 ij qjf(wj, x) (12) This is useful both conceptually and for speeding up the training process in our experiments. This formula shows us that y is a linear function of the non-linear feedforward input f(wi, xt). This is different from Seung and Zung [2017], Pehlevan and Chklovskii [2014] where the neurons are non-linear functions (due to non-negativity constraints) of linear feedforward input wi xt. 3.2 Synaptic learning rules: arbitrary kernel In the previous section we saw how the W could be interpreted as feedforward synapses, q as feedforward regulatory terms, L as inhibitory synapses. Gradient descent on W, q and gradient ascent on L provides an algorithm for performing the optimization in Equation 9. At each step, we compute the optimal y. For simplicity, we consider the case with a single input, in which case we drop the index b on xb, yb. The stochastic gradients for W lead to the update: wi yi f(wi, x)/ wi qi f(wi, wi)/ wi (13) Classically Hebbian rules have been defined so that the update is linear in the input x (although they can be nonlinear in the output y which is a function of x) (Eq. 1 of Brito and Gerstner [2016]). This dataset learned features neuron 1 tuning neuron 2 tuning neuron 3 tuning Figure 2: Overview of our Hebbian radial basis function network on the half moons dataset (a) dataset, (b) features {wi}16 i=1 (c,d,e) response profiles of 3 neurons. rule is more general as it is a nonlinear function of the input vector f(wi, x)/ wi = h(x, wi). However we note that the spirit of Hebb is still here as this is an online, local, correlation-based learning rule. The regulatory terms (essentially controlling the magnitude of the strength of feedforward input) can be updated with: qi yif(wi, x) f(wi, wi)qi (14) Here we have the correlation between the feedforward input and the neurons response. Finally gradients for the inhibitory synapses are: Lij yiyj Lij (15) This is exactly the same anti-hebbian update seen in previous linear similarity matching works Pehlevan et al. [2018]. The inhibition grows in strength as the correlation between neurons grows. 3.3 Synaptic learning rules: radial basis function kernel Before moving on, we ll consider the form of the update rules in Eq. 13,14 when the kernel is a radial basis function, i.e. when the kernel is a function of the Euclidean distance. For simplicity we ll also assume the kernel is normalized so that f(v, v) = 1: f(u, v) := g( u v ) and g(0) = 1 (16) In this case we get the gradient updates for wi, qi: wi [yig i]x [yig i]w qi yigi qi (17) The update for wi is proportional to the input x, but modulated by the output response (yi) and a function of the feedforward input (g i). The updates for Lij do not depend on the form of the kernel. 4 Experiments We train networks using a Gaussian kernel for the half moons dataset and a power-cosine kernel (defined in section 4.2) for the MNIST dataset. We compare the approximation error (Eq. 1) of our method to the approximation error given by a) the optimal eigenvector-based solution (which we label as kernel PCA) b) Nyström approximation with uniformly sampled landmarks c) Nyström approximation using KMeans to generate the landmarks d) Nyström approximation using our generated features (wi) as the landmarks and e) random Fourier feature method (This method is not applicable to the cosine-based kernel we use for the MNIST dataset). The dimensionality refers to the number of components for the PCA method, the number of landmarks for the Nyström methods, and the number of Fourier features for the Fourier method. See the appendix for more details regarding each of these 5 methods. Method (e) is the only other explicitly neural method. 4.1 Half Moons Dataset We train our algorithm on a simple half moons dataset (which can be generated with Pedregosa et al. [2011]), shown in Figure 2. It consists of 1600 input vectors x = [x1, x2] drawn from a distribution of two noisy interleaving half circles. We use a Gaussian kernel with σ = 0.3 to measure input Figure 3: Approximation error vs. dimensionality for the half moons dataset with the Gaussian kernel (a) learned features for n = 4, 16, 64 (b) input-output similarities for 16 dimensional networks (c) normalized root-mean-square error between true kernel matrix and various approximation methods. The neural method (dashed blue) that we derive in Section 3 performs well for n <= 16, but the approximation actually gets worse as we increase the dimensionality. similarities: f(u, v) = e u v 2 2σ2 . We vary the number of neurons n {2, 4, 8, 16, 32, 64}. See the appendix for training details. Emergence of sparse, template-tuned neurons In Fig. 2 we show the learned features {wi} when we train our algorithm with 16 neurons. We observe that the features appear to tile the input space. We also show the tuning properties of 3 of the output neurons over the dataset. To generate these figures, we color code each sample in the dataset with the response of neuron yi. Gray indicates zero response, red indicates a positive response and blue indicates a negative response. We observe that neurons appear to respond with large positive values centered around a small localized region of the input dataset. The features closely resemble the cluster centers returned by KMeans. Kernel approximation error In panel (a) of Fig. 3 we show the learned features for n = {4, 16, 64}. In panel (b) we plot the input similarities vs. output similarities generated by our neural algorithm with 16 dimensional outputs. In panel (c), we plot the normalized mean squared error for our method compared to random Fourier features Rahimi et al. [2007], orthogonal Fourier features Yu et al. [2016], Nyström methods, and kernel PCA. The neural method of Bahroun et al. [2017] gives equivalent results to the random Fourier method. Kernel PCA is optimal, but non-neural. We observe that for small dimensionality (n 16) our method actually seems to marginally outperform the Nyström+KMeans method, which outperforms the Nyström+randomly sampled landmarks method. Additionally, using the Nyström approximation with our features seems to be uniformly better than the representations we generate with the neural net. Essentially, our algorithm learns useful landmarks, but for most faithful representation, it is better to just throw away the neural responses and simply use the Nyström approximation with our landmarks. It is worth mentioning that as you increase the dimensionality higher, the Random Fourier method ultimately does converge to zero error, unlike our method. Utility of representations evaluated by KMeans clustering In Fig. 4 we visualize the principle components of the inputs x and 16D representations y. Of course, the principle components of x are not too interesting, they are just a reflected version of the original 2D dataset. The top two components of y are more linearly separable than the inputs and this indicates that a strong nonlinear transformation has been applied to the inputs. Additionally, we run KMeans on x and on y (we use the implementation of scikit-learn, and take the lowest energy solution using 100 inits). We observe that the clustering yields the nearly perfect labels when performed on y. The kernel similarity matching vectors appear to be better suited for downstream learning tasks than the original inputs. 4.2 MNIST Dataset We train our algorithm on the MNIST handwritten digits dataset Le Cun [1998]. The dataset consists of 70,000 images of 28x28 handwritten digits, which we cropped by 4 pixels on each side to yield 20x20 images (which become 400 dimensional inputs). We use kernels of the form f(u, v) = u v (ˆu ˆv)α and varying number of neurons. Training details are provided in the appendix. kmeans on x kmeans on y Figure 4: Utility of kernel similarity matching for downstream tasks. (a) principle components of the input vectors x (b) principle components of the 16 dimensional neural representations y (c) clustering generated by kmeans on x (d) clustering generated by kmeans on y. For (a,b) the colors are given by ground truth labels while in (c,d) the colors are given by the KMeans clustering. 0 200 400 600 800 1000 dimensionality f(u, v) = u v (u v) 0 200 400 600 800 1000 dimensionality f(u, v) = u v (u v)3 Our method Nystrom-our landmarks Nystrom-sampled landmarks Nystrom-kmeans landmarks Kernel PCA MNIST approximation error vs. dimensionality Figure 5: Approximation error vs. dimensionality for the MNIST dataset. (a) f(u, v) = u v (linear kernel) (b) f(u, v) = u v (ˆu ˆv)3 (a nonlinear kernel). For the linear kernel all methods give relatively small approximation error once n > 100. Although yet again we see that the neural method does not continue to decrease as the dimensionality increases beyond 200, even in the linear setting. The linear kernel is recovered by setting α = 1. We are not aware of other works using this exact power-cosine kernel before, however it is motivated by the arccosine kernel studied in the context of wide random Re LU networks [Cho and Saul, 2009]. An important property of our kernel network is the linear input-output scaling, meaning that rescaling an input x ax will cause the corresponding representation to also be rescaled by the same factor y ay. This will allow our nonlinear networks to have the same contrast-invariant-tuning properties that are thought to be displayed by simple cells in cat visual cortex [Skottun et al., 1987]. Approximation error We display the normalized approximation errors for α = 1 and α = 3 in Fig. 5. For the linear kernel (α = 1) all methods yield a relatively small error even for low dimensionality. An error of 0.01 is hard to see by eye when plotting input-output similarity scatter plots as done in Fig. 3. For both α = 1, 3 we observe again a strange behavior of our method: it seems to bottom out and the error stops decreasing and even begins to increase as the dimensionality increases. This may be related to unstable convergence properties of gradient descent ascent. For α = 3 we observe that the kernel PCA method largely outperforms all methods. We observe that Nyström with either our features or KMeans appears to outperform sampled Nyström methods. The sampled Nyström method is worse than our representations for low dimensionality but eventually catches up and surpasses ours neural representations. Emergence of sparse representations We train networks with α = {1, 2, 3, 4} and n = 800 neurons (so the output dimensionality is exactly 2x the input dimensionality). There is a sign degeneracy when α is odd: we can multiply both wi and yi by 1 and the objective is unchanged. When we look at the response histogram for single neurons, we observe that for α = 3, the response tends to be heavily skewed so that when the response has a high magnitude, it is either always positive or Figure 6: (a) response distribution (after the procedure we describe in the text for removing the sign degeneracy). The nonlinear kernels (α = 2, 3, 4) naturally give rise to sparse distributions. (b) test set accuracy of a linear classifier classification for MNIST (c) train set accuracy of the corresponding linear classifier. Interestingly all nonlinear kernels give nearly identical train and test set results. The linear kernel gives nearly identical results to simply training the classifier directly on the pixels. always negative. We remove this degeneracy by multiplying both wi and yi by the sign of yi . After removing this degeneracy we plot the neuron responses over all patterns in Figure 6. For α = 1 (linear neurons), neuron responses are roughly centered around zero: neuron responses are neither sparse not skewed. For α = 2, 3, 4, neurons appear to have a heavy tailed distribution, they frequently have small responses, but occasionally have large positive responses. Neurons become increasingly sparse and heavy tailed as we increase α, although this effect is not that strong. Evaluating the representations via linear classification We train a linear classifier on the inputs (x) and the outputs (y) for α = 1, 2, 3, 4 and n = 800. We train every configuration using k {1, 3, 10, 30, 100, 300, 1000, 3000} labels per class. We train all configurations with a weight decay parameter λ {1e 5, 1e 4, 1e 3, 1e 2, 1e 1, 1} which yields the highest test accuracy. We average the accuracy for every configuration over 5 random seeds. The results are shown in Fig. 6. As expected, the performance of the inputs and α = 1 (linear similarity matching) is nearly identical on both test and train sets. Surprisingly, the test performance of α = 2, 3, 4 is nearly identical. Perhaps these curves can be partially explained by the spectra of the output similarity matrix which we show in Figure ?? of the Appendix. While the shapes of the spectra are different in every case, α = 1 has roughly 200 nonzero eigenvalues while α = 2, 3, 4 all have nearly 800 nonzero eigenvalues. Perhaps the number of nonzero eigenvalues is more influential for the linear classification performance than the detailed shapes of these spectra. 5 Related work Kernel similarity matching with random Fourier features The most closely related work to ours is kernel similarity matching with random Fourier features [Bahroun et al., 2017]. The key difference between our methods is that instead of learning the features w, they use random Fourier features to directly generate nonlinear feature vectors ϕt = p 2/n cos(Wxt + b) which they then feed into a standard linear similarity matching network. This leads them to a different architecture (one feedforward layer + one recurrent layer, instead of our single recurrent layer net) and a different set of learning rules. A benefit of the random feature approach is that it will theoretically lead to perfect matching, so long as the number of random features is sufficiently large. However, the feature learning aspect of our algorithm naturally led to a sparse set of responses which lends our model an added degree of biological plausibility. Additionally, our method generalizes to non-shift invariant kernels and empirically it yields better approximation error when the dimensionality of the output is sufficiently low. Our method can be seen as a biased method for approximation, which can be useful when the dimensionality is low, but ultimately will underperform compared to non-biased methods such as random Fourier methods or Nyström methods. Nyström Approximation While not obviously biological, Nyström methods are perhaps the most commonly used methods for approximating kernel matrices. The Nyström approximation uses a set of landmarks {wi : i = 1, 2, . . . , N} to construct a low rank approximation of the original kernel matrix [Williams and Seeger, 2001]. To more clearly see the relationship between this method to ours, one can slightly modify the original formulation to generate Nyström features : Nyström features yt j = X i f(xt, wi)Mij where Mij = [(B )1/2]ij and Bij = f(wi, wj) (18) B indicates the pseudo-inverse. Multiplying two such vectors togethers yields the Nyström approximation ˆFst = ys yt = P ij f(xs, wi)[B ]ijf(xt, wj). Our method produces representations of the same functional form but our M matrix is derived from parameters learned by the correlations: Our features yt j = X i f(xt, wi)Mij where Mij = [L + λI] 1 ij qj (19) As measured by squared error, the Nyström approximation was actually a better approximation than our representations, when we used the same set of landmarks (Figs. 3, 5). The variation in Nyström methods primarily come from the method used to generate the landmarks. Two broad categories of landmark selection can be defined: template vs. pseudo-landmark. Template based approaches choose landmarks as a subset of the inputs w {x1, xw, . . . , x T } typically chosen via sampling schemes [Williams and Seeger, 2001, Drineas et al., 2005, Musco and Musco, 2017]. Pseudo-landmark approaches do not require the landmarks to be inputs. Zhang et al. [2008] used the cluster centers generated by KMeans as the landmarks. Fu [2014] formulate landmark selection as an optimization problem in the reproducing Hilbert space. Our method can be seen as a pseudo-template approach as our landmarks are directly generated via Hebbian learning rules and in general will not be exactly equal to any particular input pattern. Our method is similar in spirit to the approach of Fu [2014]. A key difference is that we use a different objective, a correlation-based upper bound to the sum of squared errors, which gives rise to correlation-based learning rules. 6 Discussion We have extended the neural random Fourier feature method of Bahroun et al. [2017] for kernel similarity matching to instead be applicable to arbitrary differentiable kernels. Rather than using random nonlinear features, we learned the features with Hebbian learning rules. Both this work and that of Bahroun et al. [2017] can be seen as extensions of the linear similarity matching works written in Hu et al. [2014], Pehlevan and Chklovskii [2014], Pehlevan et al. [2015, 2018]. By using a nonlinear input similarity, the representations learned by our network are capable of learning high-dimensional nonlinear functions of the input, without requiring any constraints such as non-negativity. To our knowledge this is the first work that attempts to directly optimize the sum of squared errors in Eq. 1 without relying on sampling schemes or direct computation of the input similarity matrix. It would be interesting to relax the correlation-based constraint we have imposed on ourselves. This might allow for a variety of different types of bounds (Eq. 3) to be derived which in turn could lead to more faithful approximations than the one presented in our paper. Our work falls in line with an increasing body of literature that derives nonlinear Hebbian/local learning rules by starting with kernel/similarity matching measures Pogodin and Latham [2020], Nøkland and Eidnes [2019]. A key contribution of our work is the use of an unsupervised objective, rather than a supervised objective, and a purely bottom up flow of information from pixels to features. The computational complexity of our method (after learning) is comparable to Nyström methods. Let d be the input dimensionality and D be the feature dimensionality. Then Nyström methods require have computational complexity of Dd + D3 and random Fourier methods only require require Dd. Like Nyström, our proposed method requires Dd + D3 if using the fast inverse method (eqn 12). Implementation of the proposed method with a recurrent network (Eq. 11) requires Dd + D3K where K is number of recurrent network iterations. A key obstacle faced by users of this algorithm is the stochastic gradient descent-ascent procedure. Empirically the convergence of our algorithm is quite sensitive to the learning rates. This method does not provide the same sorts of theoretical guarantees or empirically observed robustness of sampling based methods. Use of more robust descent-ascent optimization methods could be useful for making this class of algorithms more accessible for the practitioner. Acknowledgements and Funding We would like to thank Lawrence Saul and Runzhe Yang for their helpful insights and discussions. This research was supported by the Intelligence Advanced Research Projects Activity (IARPA) via Department of Interior/ Interior Business Center (Do I/IBC) contract number D16PC0005, NIH/NIMH RF1MH117815, RF1MH123400. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright annotation thereon. Disclaimer: The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of IARPA, Do I/IBC, or the U.S. Government. Yanis Bahroun, Eugénie Hunsicker, and Andrea Soltoggio. Neural networks for efficient nonlinear online clustering. In International Conference on Neural Information Processing, pages 316 324. Springer, 2017. I. Borg and P. J. F. Groenen. Classical Scaling, pages 261 267. Springer New York, New York, NY, 2005. ISBN 978-0-387-28981-6. doi: 10.1007/0-387-28981-X_12. URL https://doi.org/10. 1007/0-387-28981-X_12. Carlos SN Brito and Wulfram Gerstner. Nonlinear hebbian learning as a unifying principle in receptive field formation. PLo S computational biology, 12(9):e1005070, 2016. Dmitri Chklovskii. The search for biologically plausible neural computation: The conventional approach, 2016. URL http://www.offconvex.org/2016/11/03/Mitya NN1/. Youngmin Cho and Lawrence Saul. Kernel methods for deep learning. In Y. Bengio, D. Schuurmans, J. Lafferty, C. Williams, and A. Culotta, editors, Advances in Neural Information Processing Systems, volume 22. Curran Associates, Inc., 2009. URL https://proceedings.neurips.cc/ paper/2009/file/5751ec3e9a4feab575962e78e006250d-Paper.pdf. Petros Drineas, Michael W Mahoney, and Nello Cristianini. On the nyström method for approximating a gram matrix for improved kernel-based learning. journal of machine learning research, 6(12), 2005. Zhouyu Fu. Optimal landmark selection for nyström approximation. In Chu Kiong Loo, Keem Siah Yap, Kok Wai Wong, Andrew Teoh, and Kaizhu Huang, editors, Neural Information Processing, pages 311 318, Cham, 2014. Springer International Publishing. ISBN 978-3-319-12640-1. Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30, 2017. Geoffrey E Hinton. Boltzmann machine. Scholarpedia, 2(5):1668, 2007. Tao Hu, Cengiz Pehlevan, and Dmitri B. Chklovskii. A hebbian/anti-hebbian network for online sparse dictionary learning derived from symmetric matrix factorization. In 2014 48th Asilomar Conference on Signals, Systems and Computers, pages 613 619, 2014. doi: 10.1109/ACSSC.2014.7094519. Yann Le Cun. The mnist database of handwritten digits. http://yann. lecun. com/exdb/mnist/, 1998. Daniel D. Lee and H. Sebastian Seung. Learning the parts of objects by nonnegative matrix factorization. Nature, 401:788 791, 1999. Timothy P Lillicrap, Daniel Cownden, Douglas B Tweed, and Colin J Akerman. Random synaptic feedback weights support error backpropagation for deep learning. Nature communications, 7(1): 1 10, 2016. Tianyi Lin, Chi Jin, and Michael Jordan. On gradient descent ascent for nonconvex-concave minimax problems. In International Conference on Machine Learning, pages 6083 6093. PMLR, 2020. Cameron Musco and Christopher Musco. Recursive sampling for the nystrom method. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL https://proceedings.neurips.cc/paper/2017/file/ a03fa30821986dff10fc66647c84c9c3-Paper.pdf. Arild Nøkland and Lars Hiller Eidnes. Training neural networks with local error signals. In International conference on machine learning, pages 4839 4850. PMLR, 2019. Dina Obeid, Hugo Ramambason, and Cengiz Pehlevan. Structured and deep similarity matching via structured and deep hebbian networks. In Proceedings of the 33rd International Conference on Neural Information Processing Systems, Red Hook, NY, USA, 2019. Curran Associates Inc. B.A. Olshausen and D.J. Field. Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature, 381:607 609, June 1996. F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825 2830, 2011. Cengiz Pehlevan and Dmitri B. Chklovskii. A hebbian/anti-hebbian network derived from online non-negative matrix factorization can cluster and discover sparse features. In 2014 48th Asilomar Conference on Signals, Systems and Computers, pages 769 775, 2014. doi: 10.1109/ACSSC.2014. 7094553. Cengiz Pehlevan, Tao Hu, and Dmitri B. Chklovskii. A hebbian/anti-hebbian neural network for linear subspace learning: A derivation from multidimensional scaling of streaming data. Neural Computation, 27(7):1461 1495, 2015. doi: 10.1162/NECO_a_00745. Cengiz Pehlevan, Anirvan M. Sengupta, and Dmitri B. Chklovskii. Why do similarity matching objectives lead to hebbian/anti-hebbian networks? Neural Computation, 30(1):84 124, 2018. doi: 10.1162/neco_a_01018. Roman Pogodin and Peter Latham. Kernelized information bottleneck leads to biologically plausible 3-factor hebbian learning in deep networks. Advances in Neural Information Processing Systems, 33:7296 7307, 2020. Ali Rahimi, Benjamin Recht, et al. Random features for large-scale kernel machines. In NIPS, volume 3, page 5. Citeseer, 2007. Maximilian Riesenhuber and Tomaso Poggio. Hierarchical models of object recognition in cortex. Nature neuroscience, 2(11):1019 1025, 1999. Frank Rosenblatt. The perceptron: a probabilistic model for information storage and organization in the brain. Psychological review, 65(6):386, 1958. Bernhard Schölkopf, Alexander Smola, and Klaus-Robert Müller. Kernel principal component analysis. In International conference on artificial neural networks, pages 583 588. Springer, 1997. H. Sebastian Seung. Convergence of gradient descent-ascent analyzed as a newtonian dynamical system with dissipation, 2019. H. Sebastian Seung and Jonathan Zung. A correlation game for unsupervised learning yields computational interpretations of hebbian excitation, anti-hebbian inhibition, and synapse elimination. Co RR, abs/1704.00646, 2017. URL http://arxiv.org/abs/1704.00646. Bernt C Skottun, Arthur Bradley, Gary Sclar, Izumi Ohzawa, and Ralph D Freeman. The effects of contrast on visual orientation and spatial frequency discrimination: a comparison of single cells and behavior. Journal of neurophysiology, 57(3):773 786, 1987. Christopher Williams and Matthias Seeger. Using the nyström method to speed up kernel machines. In T. Leen, T. Dietterich, and V. Tresp, editors, Advances in Neural Information Processing Systems, volume 13. MIT Press, 2001. URL https://proceedings.neurips.cc/paper/2000/file/ 19de10adbaa1b2ee13f77f679fa1483a-Paper.pdf. Felix Xinnan X Yu, Ananda Theertha Suresh, Krzysztof M Choromanski, Daniel N Holtmann-Rice, and Sanjiv Kumar. Orthogonal random features. Advances in neural information processing systems, 29, 2016. Kai Zhang, Ivor W Tsang, and James T Kwok. Improved nyström low-rank approximation and error analysis. In Proceedings of the 25th international conference on Machine learning, pages 1232 1239, 2008.