# associative_memory_and_dead_neurons__562333be.pdf

Published as a conference paper at ICLR 2025

ASSOCIATIVE MEMORY AND DEAD NEURONS

Vladimir Fanaskov AIRI, Skoltech fanaskov.vladimir@gmail.com

Ivan Oseledets AIRI, Skoltech

In Large Associative Memory Problem in Neurobiology and Machine Learning, Dmitry Krotov and John Hopfield introduced a general technique for the systematic construction of neural ordinary differential equations with non-increasing energy or Lyapunov function. We study this energy function and identify that it is vulnerable to the problem of dead neurons. Each point in the state space where the neuron dies is contained in a non-compact region with constant energy. In these flat regions, energy function alone does not completely determine all degrees of freedom and, as a consequence, can not be used to analyze stability or find steady states or basins of attraction. We perform a direct analysis of the dynamical system and show how to resolve problems caused by flat directions corresponding to dead neurons: (i) all information about the state vector at a fixed point can be extracted from the energy and Hessian matrix (of Lagrange function), (ii) it is enough to analyze stability in the range of Hessian matrix, (iii) if steady state touching flat region is stable the whole flat region is the basin of attraction. The analysis of the Hessian matrix can be complicated for realistic architectures, so we show that for a slightly altered dynamical system (with the same structure of steady states), one can derive a diverse family of Lyapunov functions that do not have flat regions corresponding to dead neurons. In addition, these energy functions allow one to use Lagrange functions with Hessian matrices that are not necessarily positive definite and even consider architectures with non-symmetric feedforward and feedback connections.

1 INTRODUCTION

Associative or content-addressable memory is a system that retrieves the most appropriate stored pattern based on a partially known or distorted input pattern. One particularly influential realization of associative memory was proposed by John Hopfield in Hopfield (1982) for discrete variables and in Hopfield (1984) for continuous variables. Both models are distinguished by their biological plausibility, autonomy, asynchronous operations of constituent parts, robustness to noise, and strong theoretical guarantees. Later in Krotov & Hopfield (2020), it was shown that one could develop a general biologically plausible model that unites many previously known models and allows building novel associative memory systems Krotov (2023).

The model in Krotov & Hopfield (2020) is based on the nonlinear dynamical system that evolves in time from a given initial state. Nonlinear dynamical systems show an exceptionally diverse set of behavior Strogatz (2018), so one needs to select an appropriate class of ordinary differential equations suitable to model associative memory. The most crucial requirement is the ability of the system to evolve to a single state from many initial conditions that are close according to some problem-specific metrics. This fact suggests one should use a dynamical system with many stable steady states. If one selects such a system, each stable steady state corresponds to a particular memory and basin of attraction all initial conditions that evolve to a selected state defines a measure of similarity between states.

The technique of choice to study the stability of steady state is to construct energy or Lyapunov function Lyapunov (1992). This function, defined on the state of a dynamical system, is non-increasing on trajectories of a dynamical system. If it is possible to find such a function, its isolated local minima will correspond to steady states. Different variants of energy functions with this property are available in all previous works on associative memory Hopfield (1982), Hopfield (1984), Krotov

Published as a conference paper at ICLR 2025

Re LU softmax

(a) Energy function from Krotov & Hopfield (2020).

Re LU softmax

(b) Energy function proposed in this article.

Figure 1: Vector fields of dynamical systems (top row) and level sets of energy functions (bottom row) for energy functions (2) (the one from Krotov & Hopfield (2020)) and (10) (proposed in this article). Vector fields suggest that all steady states are stable, but energy function (2) does not indicate that. The reason is energy function (2) has non-compact flat regions touching each point where one or more neurons die (see Proposition 1 for precise statement).

& Hopfield (2016) and recent work Krotov & Hopfield (2020) provides the most general energy function that embraces all previous ones.

Since associative memory from Krotov & Hopfield (2020) has an associated energy function, one can argue that it is a member of a broad family of energy-based models Le Cun et al. (2006), including the diffusion models Hoover et al. (2023). In addition to that, the dynamical system used is clearly related to another subfield of machine learning known as Neural Ordinary Differential Equation Chen et al. (2018). Finally, in Krotov (2021), Hoover et al. (2022) it was shown that one can select parameters of model Krotov & Hopfield (2020) in such a way that dynamical variables correspond to the activation pattern of deep neural network with feedback connection. Given that, one can argue that modern Hopfield associative memory Krotov & Hopfield (2020) is uniquely tying up several major paradigms in deep learning.

In this article, we study the Lyapunov function proposed in Krotov & Hopfield (2020) and later adopted and modified in other recent papers, e.g., Hoover et al. (2024), Hoover et al. (2022), Millidge et al. (2022a). Our main observation is that energy function is vulnerable to the problem of dead neurons1 Lu et al. (2019), which result in a flat energy direction. We illustrate this in the left panel of Figure 1, where one can find both the vector field of the dynamical system and isolines of the energy function. Clearly, the energy function from Krotov & Hopfield (2020) does not ensure stability and is not helpful in identifying a steady state when dead neurons are present.

As a remedy, we propose a slightly modified dynamical system that has no flat direction (see right panel of Figure 1) but retains other good properties of the Krotov and Hopfield model. In particular, it is still an energy-based model, neural ODE, and has steady states related to deep neural networks.

A more detailed breakdown of our contribution is below:

1. In Section 2 we provide examples of flat energy directions caused by dead neurons (Figure 1; Examples 2, 3, 4) and formally characterize architectures that are vulnerable to that problem in Proposition 1.

1Usually, neurons are called dead when activation function saturates, e.g., x < 0 for Re Lu or x 1 for sigmoid. In this article we use an extended definition of dead neurons and say that a neuron is dead when it is impossible to reconstruct input to activation function from its output. This is explained in more detail in Section 2.

Published as a conference paper at ICLR 2025

2. Flat energy directions, as we show in Section 3, cause several undesirable consequences for sensitivity and stability:

(a) Problems with sensitivity described in Section 3.1 include: (i) energy is completely independent of degrees of freedom corresponding to dead neurons, because of that activation functions are effectively invertible and energy function presented in Krotov & Hopfield (2020) is completely equivalent to the old one given in Hopfield (1984); (ii) In the flat regions energy function is not sensitive to the change in bias term (Section 3.1). (b) Stability is discussed in Section 3.2 with the following main takeaways: (i) for steady state with dead neurons, stability condition can not be ensured from the Lyapunov function alone; (ii) independent analysis of dynamical system shows dead neurons do not compromise stability, and only the motion in the range of Hessian of Lagrange function is relevant; (iii) if the steady state is stable, the whole flat direction belongs to the basin of attraction; (iv) if in addition of energy, the range of the Hessian is available (e.g., as the orthogonal projector), this information is enough to restore degrees of freedom that energy alone misses at steady state. 3. As a remedy, in Section 4, we define a slightly modified dynamical system and a family of energy functions with good properties: (i) proposition 5 shows that in general one may construct a large family of Lyapunov function with no flat directions; (ii) example 5 shows several concrete choices of Lyapunov functions for symmetric weight matrices with no restriction Hessian matrix of Lagrange function; (iii) examples 6 and 7 further relax restriction on parameters and allow one to consider memory models with non-symmetric weight matrices.

2 DEAD NEURONS

In Krotov & Hopfield (2020) authors proposed to rewrite activation function g(y) as gradient of Lagrange function L(y), i.e., g(y) = L(y)

y . This formalism allows to describe a large class of memory models with Lyapunov function on the common grounds. Results that we obtain in the article are applicable to all of them, but for simplicity we consider dense model taken from (Krotov, 2021, Equation (2)) (see Appendix A or further discussion on relations between different models). This model reads:

y(t) = W g(y(t)) y(t) + b, y(0) = y0, (1)

E(y) = (y b) g(y) L(y) 1

2 (g(y)) W g(y). (2)

Equation (1) describes a temporal dynamics of feature vector y starting from y0. The equation contains weights W , bias b and activation function g(y). Energy function (2) is non-increasing on trajectories of (1) if and only if W = W and Hessian of Lagrange function Λ = 2L(y)

y2 is positive semi-definite (see (Krotov & Hopfield, 2020, Equation (4), Appendix A) and (Krotov, 2021, Equation (4), Appendix A)).

Dynamical system (1) has several favorable properties (see discussion in Section 1 and Krotov & Hopfield (2020) , Krotov (2021)): (i) steady states can be used as memory vectors, (ii) dynamics in the basins of attraction naturally model memory recovery process, (iii) energy function can be used to ensure the existence of steady states and basins of attraction, (iv) memory is related to neural ODEs so can be trained end-to-end, (v) with a special choice of W steady states resemble activation pattern of classical deep learning architectures. The example below illustrates the last point (see (Krotov, 2021, Section 3)). Example 1 (MLP with feedback connections). For simplicity, we consider four layers, extensions to a larger number of layers are straightforward. Weights and state vectors are partitioned on blocks of conformable size

0 W12 W21 0 W23 W32 0 W34 W43 0

y1 y2 y3 y4

b1 b2 b3 b4

Published as a conference paper at ICLR 2025

Note that W is symmetric so W12 = W 21 With that choice dynamical system (1) becomes

yi = Wi,i 1g (yi 1) + Wi,i+1g (yi+1) yi + bi, i = 2, 3 y1 = W12g (y2) y1 + b1, y4 = W43g (y3) y4 + b4,

so steady-state indeed resembles MLP but with feedback connections that are symmetric.

Besides MLP described in Example 1 associative memory (1) leads to many more interesting and fruitful connections. In Krotov & Hopfield (2020) it allowed the authors to reconstruct dense associative memory Krotov & Hopfield (2016) and modern Hopfield network Ramsauer et al. (2020). In Krotov (2021) and Hoover et al. (2022) it was used to build a memory model with dense hidden layers and convolution neural networks with pooling layers. In Hoover et al. (2024) authors introduced an energy transformer using the same formalism. The authors of Tang & Kopp (2021) also noticed that MLP-mixer Tolstikhin et al. (2021) is related to associative memory (1). Clearly, the technique is valuable and versatile.

As we have seen, steady states of dynamical systems are important to emulate deep learning architectures and classical memory models. The role of the Lyapunov function is to ensure stability. Unfortunately, if one looks closer at the energy function (2), it becomes clear that it shows some pathological behavior. Before providing a formal description, we illustrate this with three examples.

Example 2 (flat energy with Re LU activations). Lagrange function L(y) = PN i=1 1 2 (Re LU(yi))2

corresponds to fully-connected neural network with N neurons and Re LU(x) = 1 2 (x + |x|) nonlinearity. Energy function becomes E(y) = (y b) Re LU(y) PN i=1 1 2 (Re LU(yi))2 1 2 (Re LU(y)) W Re LU(y). It is easy to see that if neuron i dies, i.e., yi 0, energy becomes zero for all ey = y +αei, where α 0. ei is a i-th columns of identity matrix and y is a state vector with yi 0. In other words, energy is non-discriminative in a non-compact region of state space.

Example 3 (flat energy with sigmoid activations). For sigmoid activation function σ(x) = (1 + exp( x)) 1 Lagrange function reads L(y) = P

i log (1 + eyi), and the energy is E(y) = (y b) σ(y) P

i log (1 + eyi) 1

2 (σ(y)) W σ(y). Neuron i dies when sigmoid saturates2 for some yi 1. After that for all ey = y + αei, with α 0 the energy has flat region, i.e., E(y) = E(ey(α)) for all admissible α.

Example 4 (flat energy with softmax activations). Activation softmax (y) = exp(y) PN i=1 exp(yi)

corresponds to L(y) = log PN i=1 exp(yi) and energy E(y) = (y b) softmax (y)

log PN i=1 exp(yi) 1

2 (softmax (y)) W softmax (y). Consider variable ey(c) = y + c, where

c is a vector with all components equal c R. So constant shift, softmax (y) = softmax (ey(c)), Lagrange function shifts on c and ey(c) softmax(ey(c)) = y softmax(y) + c, so overall energy remains the same, i.e., E(y) = E(ey(c)) for arbitrary c.

The examples above demonstrate that energy function (2) fails to distinguish states on a large fraction of state space. In Examples 2 and 3 this happens because the activation function saturates for some inputs, and in Example 4 this is a consequence of invariance. The latter case is typically not considered as a dead neuron, but since the number of degrees of freedom decreases by one we call this effective neuron dead all the same.3 The formal definition of dead neurons appears below (see Maas et al. (2013), Lu et al. (2019), Cui & Fearn (2018) for the discussion of dead neurons in the context of Re LU activation function).

Definition (dead neurons). At a given point y RN activation function g has k dead neurons if it is possible to find V RN k such that g (y + V c) = g (y) for all c Rk +.

The examples above hint that most of the activation functions will lead to dead neurons. The first large class is activation functions with saturation: hyperbolic tangent, sigmoid, Re LU, GELU, Si LU,

2Formally, the sigmoid function saturates in the limit y . However, for computations in floatingpoint arithmetic it is safe to say that sigmoid saturates when the absolute value of the argument is sufficiently large but finite. 3One can also say that softmax is always saturated. If we consider input y in the new basis where ey1 = P

i yi, it is easy to see that after the softmax this component becomes 1 regardless of the input.

Published as a conference paper at ICLR 2025

(a) (b) (c)

Figure 2: Sketch of three problematic energy functions: (a) unbounded from below with two stable states (we analyze this situation in Appendix C), (b) bounded from below but with compact flat region, (c) bounded from below but with non-compact flat region. According to Proposition 1 case (c) is realized in models Krotov & Hopfield (2020) when neurons die. In red regions stability properties do not follow from Lyapunov theorems. La Salle s invariance principle (Haddad & Chellaboina, 2008, Theorem 3.3) ensures that in case (b) isolated steady states are stable, but for the case (c) separate stability analysis is needed (see Section 3.2 for details).

SELU, Gaussian, etc.4 The second class is activation functions with symmetries, e.g., softmax and layer norm. With the definition of dead neurons, we can formalize a pathological behavior of energy function (2). Proposition 1. If at a given point y activation function g has k dead neuron defined by V RN k, energy function (2) has constant value in a subspace D = y + V c : c Rk + .

Proof: Appendix B.

Flat energy directions are illustrated in Figure 1 and Figure 2c. Intuitively, it is clear that having large regions with flat energy is not good. In the next section, we explain in detail what will go awry.

As illustrated in Figure 2 energy function can appear problematic for reasons not directly related to dead neurons. We argue that these two situations are less grim. Energy illustrated in Figure 2a can still be utilized with proper initial conditions, e.g., y0 2 R for some small R. Besides, we demonstrate in Appendix C that restricting parameters to ensure energy is bounded from below can lead to catastrophic capacity reduction. Energy from Figure 2b is less problematic for stability analysis because invariance principle (Haddad & Chellaboina, 2008, Theorem 3.3) can be used.

3 CONSEQUENCES OF DEAD NEURONS

Proposition 1 implies that for most architectures used in practice, there are large regions of state space with flat energy functions. This has several negative consequences that we somewhat arbitrarily classify as problems with sensitivity and stability.

3.1 SENSITIVITY

As evident from Figure 1 energy is flat in directions V corresponding to dead neurons. This can be formalised if one considers new variables yd = V V y, ya = I V V y corresponding to dead and alive neurons. Since y = yd + ya, the energy (2) becomes

E(yd, ya) = (yd + ya b) g(ya) L(yd + ya) 1

2 (g(ya)) W g(ya)

= (ya b) g(ya) L(ya) 1

2 (g(ya)) W g(ya),

where we used g(yd + ya) = g(ya) and L(yd + ya) = L(ya) + g(ya) yd.

4For sigmoid, hyperbolic tangent and Gaussian function the saturation is precise only in the limit x . However, as explained in the Example 3 in the floating-point arithmetics these functions effectively saturate for finite values of the argument. Moreover, even with arbitrary precision, approximately flat direction can not be used to reliably store any information since it will require weights with exponentially large magnitudes.

Published as a conference paper at ICLR 2025

We see that energy does not depend on variables yd corresponding to dead neurons, which means the effective number of degrees of freedom is decreased from N to N k where k is the number of dead neurons. If only energy is available it is not possible to recover values on yd and steady states can not be found as in examples from Figure 1.

Variables yd are from the kernel of g which means the activation function becomes effectively invertible. In Hopfield (1984), the energy function for this situation has already been introduced. It is instructive to compare a new energy function Krotov & Hopfield (2020) with the old one. Lyapunov function from the 1984 paper, in the notation used in this article, reads

0 dτig 1 i (τi) b g(u) 1

2 (g(u)) W g(u), (3)

and dynamical system is precisely the same as (1). Activation functions gi are invertible and monotone by assumption. In Appendix D we explain precisely how to adapt the model from Hopfield (1984) to our context. The first term of (3) is seemingly distinct from any term of (2). To show that this is an illusion we simplify integrals as follows

0 dτg 1(τ) =

dρg(ρ) + ρg(ρ)|u g 1(0) =

dρg(ρ) + ug(u),

where in the first step we used τ = g(ρ). Using this simplification and additional definition of Lagrange function L(u) : g(u) = L(u)

u we immediately recognize that energy function from Krotov & Hopfield (2020) is precisely the same as from Hopfield (1984) but without the assumption that g should be invertible. New energy from Krotov & Hopfield (2020) does not formally contain the inverse of the activation function, but as we see still implies that g is invertible.

Besides insensitivity to values of dead neurons, there is another problematic fact. Proposition 1 implies that, when at least one neuron is dead, the energy function has invariant transformation E(y) = E(y + V c). The steady state of dynamical system (1) does not have this symmetry, so, when one transforms y y + V c and preserve energy, this corresponds to a new dynamical system with b b V c. In other words, in the regions when dead neurons are present, energy is not sensitive to the changes in the bias term.

We summarise arguments made in this section in the proposition below Proposition 2. For energy function (2) the following is true:

1. Energy function does not depend on variables from the kernel of g, so one can always assume that activation function g is invertible.

2. Suppose at a given point y dead neurons are described by matrix V RN k. In this case energy function E(y) is Lyapunov function for dynamical systems (1) with b = eb V c for any c Rk +.

3.2 STABILITY OF DYNAMICAL SYSTEM

One of the main applications of the Lyapunov function is to perform stability analysis Haddad & Chellaboina (2008). This part is relevant for predictive coding and related approaches Xie & Seung (2003), Millidge et al. (2022b) to ensure the existence of at least some steady states. Besides, when associative memory is considered, one can only recover stable steady states, so stability directly influences memory capacity. Lyapunov function also characterizes a basin of attraction for a given steady state. In the context of associative memory, basins of attraction define a measure of similarity between inputs, since inputs from the same basin result in the recovery of the same memory. Below, we will show that in the regions with dead neurons, Lyapunov function (2) alone fails to help in stability analysis.

Stability analysis with the Lyapunov function boils down to the study of the energy landscape near the steady state. Using the Taylor series, we find

E(y + δ) E(y ) = δ E(u)

u=y +tδ δ, t(δ, y ) (0, 1),

Published as a conference paper at ICLR 2025

where y : W g(y ) y + b = 0. Computing derivatives, we find E(y + δ) E(y ) = δ (Λ(y ) Λ(y )W Λ(y )) δ, δ 2 1. For stability in the vicinity of a steady state, it is sufficient that the matrix above is positive definite, and for instability, it is sufficient that at least one eigenvalue of this matrix is negative. In situations such as in Figure 1 and Figure 2c when energy has a flat direction nothing can be said about the dynamics in this region5. It is easy to show that if dead neurons are present at point u, matrix V lays in zero eigenspace of Λ(u). To show this we consider the Taylor series g(u + V c) = g(u) + Λ(u + t V c)V c, t (0, 1), and since g(u + V c) = g(u) we confirm that ΛV = 0, i.e., zero eigenspace is always present.

Since energy is insufficient to analyze stability, we need to use a dynamical system (1) directly. We suppose that in the region of interest V , matrix that describes dead neurons, does not change and use yd, yd defined in the previous section. With that, one can show that dynamical system becomes

ya(t) = Pa W g(ya(t)) ya + Pab, yd(t) = Pd W g(ya(t)) yd + Pdb, (4)

where Pd = V V and Pa = I Pd.6 One can immediately see that dead neurons do not influence stability and, when steady state y a is reached, one can find y d = Pd W g(y a) + Pdb. Moreover, when ya(t) reaches steady state yd(t) converges exponentially from the arbitrary starting point, i.e., the whole flat region is a basin of attraction.

The sufficient condition for stability/instability of ya can be obtained from (4) with linearisation. More specifically, stability is defined by matrix S = V W V V ΛV I W Λ I R(N k) (N k), where V RN N k is a matrix with orthogonal columns such that Pa = V V . When matrix S does not have eigenvalues with a positive real part, the steady state is stable; in case there is at least one eigenvalue with a positive real part, the steady state is unstable (see (Haddad & Chellaboina, 2008, Theorem 3.19)).

The structure of matrix S suggests that, in fact, dynamics is stable when energy function indicates that. Indeed, Λ1/2 W Λ1/2 has the same spectrum as W Λ , so sufficient condition for S is equivalent to sufficient condition for Λ1/2 W Λ1/2 I. Finally since Λ is full rank we find that if Λ W Λ Λ is negatively stable (has non-positive eigenvalues), the dynamics of ya is stable. Since this matrix is a restriction of ΛW Λ Λ on the range of Hessian we conclude that one can analyze the stability of steady-state using energy if projector on the range of Hessian is known. It is important to note that matrix ΛW Λ Λ can have other flat directions not corresponding to nullspace of Hessian, these directions will cause instability, so one need to distinguish them from directions corresponding to dead neurons.

Clearly, zero modes of Λ do not negatively affect the spectrum of matrix S. On the contrary, one may expect a stabilizing effect since Cauchy interlacing theorem (Horn & Johnson, 2012, Theorem 4.3.28, Corollary 4.3.37) ensures that the spectrum of W is in-between minimal and maximal eigenvalues of W . So when λ (W Λ) 1 it is likely that λ (W Λ ) < 1. To demonstrate the effect, one needs to make some assumptions about the spectrum of W . Given a standard deep learning practice to use random matrices for initialization of model weights Glorot & Bengio (2010), He et al. (2015), it is natural to assume that W is drawn from Gaussian orthogonal ensemble. Under this condition, one may show a stabilization effect, as demonstrated next.

Proposition 3. Suppose W RN N is a matrix from a Gaussian orthogonal ensemble. If k

neurons are dead E W 2 = q

N E W 2 holds for large N.

Proof: Appendix F.

We gather the main results of this section in the proposition that follows.

Proposition 4. For energy function (2) the following is true:

1. Flat directions of dead neurons characterised by V form nullspace of Hessian Λ.

5Note, however, that for the face given in Figure 2b it is possible to construct a positively invariant set based on energy levels. After that La Salle s invariance principle (Haddad & Chellaboina, 2008, Theorem 3.3) guarantees that isolated steady states are asymptotically stable. This material is standard and covered in many books, e.g., (Haddad & Chellaboina, 2008, Section 3.3.). 6We assume for simplicity that V has orthonormal columns.

Published as a conference paper at ICLR 2025

2. Steady state is stable if energy function predicts local stability in the range of Hessian, i.e. if matrix ΛW Λ Λ restricted on the range of Hessian is positive definite.

3. If one performed direct optimization of the energy function and found that ey corresponds to the minimum, it is possible to recover the steady state of dynamical system (1) as follows: (i) compute the projector Pd on the nullspace of Λ; (ii) steady state of dynamical system reads y = ey + Pd (W g(ey) + b).

4. When steady state (y a, y d) is stable, the whole non-compact flat region (y a, y d + V c) is basin of attraction.

4 ASSOCIATIVE MEMORIES WITHOUT DEAD NEURONS

We have seen that dead neurons present certain problems for energy function (2): (i) flat regions make optimization and analysis of energy cumbersome; (ii) one needs to distinguish between harmless flat direction of Hessian and other flat direction of energy that can compromise stability; (iii) energy do not contain full information and one need to build a projector on the nullspace of Hessian to restore the whole state. The last point implies that one needs Hessian even when stability analysis is not performed and one merely wants to compute a local minimum of energy which can often be done successfully with first-order methods.

To avoid these difficulties we will slightly modify the dynamics of associative memory (1) to minimally alter steady states and get a better energy function. In a simplified terms, our idea is to modify right-hand side of (1) such that it becomes conservative vector field. If this is the case, flat energy directions disappear. For example, the simplest model of this kind is u(t) = W g(W u(t) + b) u(t) = E(u)

u with energy E(u) = 1 2u u L(W u + b). A more general dynamical system along the same lines reads:

u(t) = R(u) (g(W u(t) + b) u(t)) , u(0) = u0, (5)

where R(u) is a matrix-valued function to be specified later.

If one assumes that it is possible to select R(u) that has empty nullspace for all u, i.e., that information is not lost after multiplication by R(u), the condition that we will verify later, steady states of the newly introduced system (5) follows from the old one with affine transformation y = W u + b. This means all architectures possible with the old model (1) are also possible with the new one (5). More precisely, the structure of steady state is the same for memory (1) and memory (5), but the structure of basins of attraction is different owing to the matrix R(u) absent in model (1).

Dynamical systems (5) and (1) have equivalent steady states, but (5) allows to construct of a large set of Lyapunov functions with good properties in a systematic manner.

Proposition 5. For a parametric energy function

E(u; α, β, γ, S) = αE1(u) + βE2(u) + γE3(u; S), α, β, γ R,

E1(u) = u W g (W u + b) L (W u + b) 1

2g (W u + b) W g (W u + b) ,

2u W u L (W u + b) , E3 (u; S) = 1

2 (u g(W u + b)) S (u g(W u + b))

the following is true:

1. E(u; α, β, γ, S) is a Lyapunov function for dynamical system (5) if one can find Q 0, R(u), S = S such that for all u

R F (u) + F (u) R Q, F (u) = W Λ (W u + b) (αW γS) + βW + γS. (7)

2. For arbitrary parameters α, β, γ, S one can always find orthogonal matrix R(u) such that E(u; α, β, γ, S) is a Lyapunov function for dynamical system (5).

3. E2(u) and E3(u; S) does not have flat directions corresponding to dead neurons.

Published as a conference paper at ICLR 2025

4. Sufficient condition for stability of steady-state u : u = g (W u + b) is

F (u ) (I Λ (W u + b) W ) + (I W Λ (W u + b)) (F (u )) > 0 (8)

Proof: Appendix E.

Proposition 5 allows one to make several useful conclusions.

First, the construction of the Lyapunov function boils down to the analysis of matrix inequality (7). In general, this inequality is nonlinear, but in particular cases, it reduces to a Sylvester equation which can be systematically analyzed Simoncini (2016). Below we will provide several explicit choices of R that result in a valid Lyapunov function.

The second point of Proposition 5 shows that suitable R always exist. Moreover, this R does not alter the steady state, since it is an orthogonal matrix. Whether this choice of R is practical depends on the parametrization of weights and other details of the model.

The third point shows that unless β = γ = 0, the energy function does not have a flat direction, and the fourth point gives a sufficient condition for the stability of a particular state. This part is harder to analyze in general, so we will discuss this condition for selected examples below.

Finally, Proposition 5 does not require Hessian to be positive definite which allows one to use a richer set of activation functions. Example 5 (α = γ = 0, β = 1). In this case E = E2 condition (7) becomes R W + W R 0. There are many strategies to select R and W :

1. The simplest choice is to take R = W since in this case condition (7) reduces to W 2 0 which is true for arbitrary symmetric matrix.

2. Unfortunately, if matrix W has low rank and one takes R = W , dynamical system (5) can lead to false memories whenever f(u(t)) reaches a nullspace of W . To exclude this possibility one can parametrize W = OP in terms of its polar decomposition and select R = O. With that choice condition (7) becomes P 0 which automatically follows from polar decomposition.

3. We can restrict W to be positive definite by taking W = KK for some full-rank K. In this case, a large set of positive definite matrices R exists such that RW + W R 0. More specifically, one needs to restrict the condition number of R as explained in Nicholson (1979).

4. If W is positive definite the other option is to select arbitrary Q > 0 and consider Lyapunov equation RW + W R = Q which is known to have unique solution for arbitrary Q > 0. One can explicitly provide it in many forms Lancaster (1970), Simoncini (2016), for example R = R 0 dτ exp( τW )Q exp( τW ).

Condition for the stability of state (8) in this case reduces to W W Λ (W u + b) W > 0. Note that nullspace of Λ does not compromise stability since it does not transfer to nullspace of W W ΛW .

Interestingly, one can also construct associative memory with non-symmetric weights as explained in the next two examples. Example 6 (α = β = 0, γ = 1 and W = W ). E3 is the only energy function that is defined for W = W . If we account for that in Proposition 5, existence condition (7) becomes R I W Λ S + S (I ΛW ) R 0. Two examples of suitable R, S, W are given below:

1. The simplest choice is to take S = I and R = I W Λ. One also may ensure that R is invertible with W 2 < Λ 1 2 .

2. Another possible choice is to take R = Λ, and ensure Λ 1

2Λ W + W Λ 0, which can be simplified to ω(W ) λmin(Λ) λmax(Λ2) where ω(W ) is numerical abscissa ω(W ) = λmax(W + W ) 2. For example, if Λ 0 any W with non-positive real part of the spectrum suffices.

Published as a conference paper at ICLR 2025

Example 7 (associative memory with smooth Leaky Re LU). Here we build a concrete example of hierarchical associative memory Krotov (2021) with smooth Leaky Re LU Biswas et al. (2021) to avoid dealing with ODE having a non-smooth right-hand side. The activation function reads g(u) = u 5 + 3erf 3u 8 8 and its derivative is g (u) = g(u) u + 9u exp 9u2 64 (32 π), where all operations are pointwise and erf is error function. Weights and state vector are selected as in Example 1 but without requirement Wi,i+1 = W i+1,i. After that we select R = I W Λ and obtain the following dynamical system

u(t) = I W g (W u(t) + b) (g(W u(t) + b) u(t)) , u(0) = u0.

For this system Lyapunov function E2 = 1

2 (g(W u(t) + b) u(t)) (g(W u(t) + b) u(t)) is

non-increasing on trajectories. Matrix I W g (W u(t) + b) can in principle have non-

trivial nullspace. To avoid this one observes that W 2 maxi Wi,i+1 2 + maxi Wi,i 1 2 (the bound is tight) and take Wi,i+1 = f Wi,i 1 2 f Wi,i 1 2

, for some f W with the same block structure. Note, that in this construction Λ is not positive definite, and W is not symmetric.

All theoretical statements that we derived are valid only for modified dynamical system 5. However, it is easy to see that energy function E3 from Proposition 5 can be adopted for the original temporal dynamics (1). The resulting energy function and dynamical system reads

y(t) = R(y(t)) (W g(y(t)) y(t) + b) , y(0) = y0, (9)

2 (W g(y(t)) y(t) + b) S (W g(y(t)) y(t) + b) . (10)

We used temporal dynamics (9) with R = I W Λ(y) and energy function (10) with S = I to generate results given in Figure 1. One can also prove results similar to Proposition 5 for energy function (10) but since the extension is straightforward we do not pursue this further.

5 CONCLUSIONS

We describe the effect of dead neurons on the energy landscape of associative memory, analyze the consequences for stability, and provide several remedies, including new dynamical systems with good energy functions. We think it is appropriate to discuss the overall significance of the Lyapunov function for associative memory. Is it necessary to have this function at all? How is this function used in practice?

One may observe that, currently, the Lyapunov function is underutilized. As a rule, one uses several steps or even a single step of temporal dynamics of ODE during training and inference Hoover et al. (2024), Hoover et al. (2022), Millidge et al. (2022a), Ramsauer et al. (2020), Krotov & Hopfield (2016). The role of Lyapunov s function is merely to provide comfort that some steady states may exist somewhere. As we show in this article, it is often a false comfort.

It is important to note, that the Lyapunov function in itself does not ensure stability: (i) it may be the case that no steady state exists, (ii) limit cycles may still be present, (iii) all steady states may be unstable. Besides that, for any steady state, the Lyapunov function can be constructed (by solving the Lyapunov equation) even when the global Lyapunov function is unavailable.

The Lyapunov function, as it is currently considered, is not of huge help. We can suggest several appropriate use cases: (i) model parameters may be adjusted on the training stage to make the Lyapunov function unstable for particular states something that can not be easily done by integrating the dynamical system alone. This may help to learn from negative and adversarial examples Wang et al. (2024), Goodfellow et al. (2014); (ii) Lyapunov function may help in theoretical understanding of memory capacity, e.g., in the present article we have already shown that a whole flat direction corresponds to the basin of attraction, and can not support more than a single memory; (iii) Lyapunov function may be directly used to compute and manipulate basins of attraction, potentially speeding up learning and making memory more robust to adversarial attacks.

All in all, we think that the Lyapunov function is a powerful tool that can lead to novel theoretical results and practical learning techniques in the field of associative memory. We hope that our results will inspire further research in this direction.

Published as a conference paper at ICLR 2025

Koushik Biswas, Sandeep Kumar, Shilpak Banerjee, and Ashish Kumar Pandey. Smu: smooth activation function for deep networks using smoothing maximum technique. ar Xiv preprint ar Xiv:2111.04682, 2021.

Ricky TQ Chen, Yulia Rubanova, Jesse Bettencourt, and David K Duvenaud. Neural ordinary differential equations. Advances in neural information processing systems, 31, 2018.

Marco Chiani. Distribution of the largest eigenvalue for real wishart and gaussian random matrices and a simple approximation for the tracy widom distribution. Journal of Multivariate Analysis, 129:69 81, 2014.

Chenhao Cui and Tom Fearn. Modern practical convolutional neural networks for multivariate regression: Applications to nir calibration. Chemometrics and Intelligent Laboratory Systems, 182:9 20, 2018.

Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pp. 249 256. JMLR Workshop and Conference Proceedings, 2010.

Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. ar Xiv preprint ar Xiv:1412.6572, 2014.

Wassim M Haddad and Vijay Sekhar Chellaboina. Nonlinear dynamical systems and control: a Lyapunov-based approach. Princeton university press, 2008.

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE international conference on computer vision, pp. 1026 1034, 2015.

Benjamin Hoover, Duen Horng Chau, Hendrik Strobelt, and Dmitry Krotov. A universal abstraction for hierarchical hopfield networks. In The Symbiosis of Deep Learning and Differential Equations II, 2022.

Benjamin Hoover, Hendrik Strobelt, Dmitry Krotov, Judy Hoffman, Zsolt Kira, and Duen Horng Chau. Memory in plain sight: A survey of the uncanny resemblances between diffusion models and associative memories. ar Xiv preprint ar Xiv:2309.16750, 2023.

Benjamin Hoover, Yuchen Liang, Bao Pham, Rameswar Panda, Hendrik Strobelt, Duen Horng Chau, Mohammed Zaki, and Dmitry Krotov. Energy transformer. Advances in Neural Information Processing Systems, 36, 2024.

John J Hopfield. Neural networks and physical systems with emergent collective computational abilities. Proceedings of the national academy of sciences, 79(8):2554 2558, 1982.

John J Hopfield. Neurons with graded response have collective computational properties like those of two-state neurons. Proceedings of the national academy of sciences, 81(10):3088 3092, 1984.

Roger A Horn and Charles R Johnson. Matrix analysis. Cambridge university press, 2012.

Dmitry Krotov. Hierarchical associative memory. ar Xiv preprint ar Xiv:2107.06446, 2021.

Dmitry Krotov. A new frontier for hopfield networks. Nature Reviews Physics, 5(7):366 367, 2023.

Dmitry Krotov and John Hopfield. Large associative memory problem in neurobiology and machine learning. ar Xiv preprint ar Xiv:2008.06996, 2020.

Dmitry Krotov and John J Hopfield. Dense associative memory for pattern recognition. Advances in neural information processing systems, 29, 2016.

Peter Lancaster. Explicit solutions of linear matrix equations. SIAM review, 12(4):544 566, 1970.

Yann Le Cun, Sumit Chopra, Raia Hadsell, M Ranzato, Fujie Huang, et al. A tutorial on energybased learning. Predicting structured data, 1(0), 2006.

Published as a conference paper at ICLR 2025

Lu Lu, Yeonjong Shin, Yanhui Su, and George Em Karniadakis. Dying relu and initialization: Theory and numerical examples. ar Xiv preprint ar Xiv:1903.06733, 2019.

Aleksandr Mikhailovich Lyapunov. The general problem of the stability of motion. International journal of control, 55(3):531 534, 1992.

Andrew L Maas, Awni Y Hannun, Andrew Y Ng, et al. Rectifier nonlinearities improve neural network acoustic models. In Proc. icml, volume 30, pp. 3. Atlanta, GA, 2013.

Beren Millidge, Tommaso Salvatori, Yuhang Song, Thomas Lukasiewicz, and Rafal Bogacz. Universal hopfield networks: A general framework for single-shot associative memory models. In International Conference on Machine Learning, pp. 15561 15583. PMLR, 2022a.

Beren Millidge, Yuhang Song, Tommaso Salvatori, Thomas Lukasiewicz, and Rafal Bogacz. Backpropagation at the infinitesimal inference limit of energy-based models: Unifying predictive coding, equilibrium propagation, and contrastive hebbian learning. ar Xiv preprint ar Xiv:2206.02629, 2022b.

D.W. Nicholson. Eigenvalue bounds for ab+ba, with a,b positive definite matrices. Linear Algebra and its Applications, 24:173 184, 1979. ISSN 0024-3795. doi: https://doi.org/ 10.1016/0024-3795(79)90157-5. URL https://www.sciencedirect.com/science/ article/pii/0024379579901575.

Richard S Palais. A simple proof of the banach contraction principle. Journal of Fixed Point Theory and Applications, 2:221 223, 2007.

Hubert Ramsauer, Bernhard Sch afl, Johannes Lehner, Philipp Seidl, Michael Widrich, Thomas Adler, Lukas Gruber, Markus Holzleitner, Milena Pavlovi c, Geir Kjetil Sandve, et al. Hopfield networks is all you need. ar Xiv preprint ar Xiv:2008.02217, 2020.

Valeria Simoncini. Computational methods for linear matrix equations. siam REVIEW, 58(3):377 441, 2016.

Steven H Strogatz. Nonlinear dynamics and chaos: with applications to physics, biology, chemistry, and engineering. CRC press, 2018.

Fei Tang and Michael Kopp. A remark on a paper of krotov and hopfield [arxiv: 2008.06996]. ar Xiv preprint ar Xiv:2105.15034, 2021.

Ilya O Tolstikhin, Neil Houlsby, Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Thomas Unterthiner, Jessica Yung, Andreas Steiner, Daniel Keysers, Jakob Uszkoreit, et al. Mlp-mixer: An all-mlp architecture for vision. Advances in neural information processing systems, 34:24261 24272, 2021.

Renxi Wang, Haonan Li, Xudong Han, Yixuan Zhang, and Timothy Baldwin. Learning from failure: Integrating negative examples when fine-tuning large language models as agents. ar Xiv preprint ar Xiv:2402.11651, 2024.

Xiaohui Xie and H Sebastian Seung. Equivalence of backpropagation and contrastive hebbian learning in a layered network. Neural computation, 15(2):441 454, 2003.

A RELATIONS BETWEEN MEMORY MODELS

Our model is most directly related to Dense Associative Memory described in (Krotov, 2021, Equation (2)). This equation reads

J=1 WIJg J x I,

where x I, I = 1, . . . , N are activities of individual neurons, g J = d L({XI})

dx J is activation function which is a gradient of Lagrange function L that depends on all activities, and τI control relaxation

Published as a conference paper at ICLR 2025

time of individual neurons. Energy function used in this work is available in equation (Krotov, 2021, Equation (3)) and is given below:

I=1 x Ig I L 1

I,J=1 g IWIJg J.

It is easy to see that there is a direct relation between dynamical system (1) and the model (2) we use in the article and the ones from Krotov (2021): WIJ are the elements of matrix W , activities x I are components of y. Two differences between the model from that article and Dense Associative Memory are: bias term that we use b is absent from Dense Associative Memory Krotov (2021), time scales τI are absent from our model. The bias term can be simply removed by taking b = 0 and since it is not used in the stability analysis, this does not affect our results. Observe, that if b = 0 the energy that we use (2) is precisely the same as in Krotov (2021). Similarly, time scale τI are absent from the definition of energy function in Krotov (2021) so they do not influence flat energy regions. However, they mildly influence stability analysis since after linearisation Jacobian will by multiplied by D 1 where D is a diagonal matrix containing time scales on the diagonal.

Next we discuss the relation to model (Krotov & Hopfield, 2020, Equation (1)). This model seems to be quite different from our formulation, but it is a specific case of Dense Associative Memory (Krotov, 2021, Equation (2)) with bias term. To see this we describe model given in (?, Equation (1)) In the article Krotov & Hopfield (2020) authors split neurons on two parts: activations of memory neurons are hµ, µ = 1, . . . , Nh and activations of feature neurons are vi, i = 1, . . . , Nf. These neurons has distinct activations function fµ = f(hµ) and gi = g(hi) which are defined as derivatives of corresponding Lagrange functions fµ = Lh({hµ})

hµ and gi = Lv({vi})

vi (Krotov & Hopfield, 2020, Equation (3)). Dynamical equations for these activations (Krotov & Hopfield, 2020, Equation (1)) is reproduced below

µ=1 ξiµfµ vi + Ii,

i=1 ξµigi hµ,

where ξµi = ξiµ are symmetric weights describing weights of synapses and Ii is the input current into feature neurons. Lyapunov function given in (Krotov & Hopfield, 2020, Equation (2)) reads

i=1 (vi Ii)gi Lv

µ=1 hµfµ Lh

µ,i fµξµigi.

To see that this model is related to (Krotov, 2021, Equation (2)) , consider a special weight matrix WIJ, x I, τI that have the following block structure

W = 0 ξ ξ 0

, τ = τfev τheh

where ev and eh are identity vector of sizes Nf, Nh. With this split and Lagrangian L({x I}) = Lv ({vi}) + Lh ({hµ}) we have X

J WIJg J x I W L

x = 0 ξ ξ 0

so we see that we reproduce (Krotov & Hopfield, 2020, Equation (1)) without bias term Ii. The same way we reproduce the energy function but without Ii.

Given that, both models (Krotov & Hopfield, 2020, Equation (1)), (Krotov, 2021, Equation (2)) are directly related to (1). The only difference is these models contain τ variable that speed up or slow down temporal dynamics for selected neurons. Importantly, energy functions from models Krotov & Hopfield (2020), Krotov (2021) are equivalent to (2).

Note that other modifications of memory models are available. The main problematic part that lead to flat energy is a Lagrange function introduced in Krotov & Hopfield (2020). So, whenever this

Published as a conference paper at ICLR 2025

function is used, the energy will have non-compact flat regions corresponding to dead neurons. To give an example, in Millidge et al. (2022a) authors build a modified model but used a non-trivial Lagrange function Lh in Equation (3). If f(h) = sep(h) can produce dead neurons, the energy from Millidge et al. (2022a) will have a non-compact flat region. One such situation occurs when the softmax function is used as sep(h) in (Millidge et al., 2022a, Equation (5)).

B PROOF OF PROPOSITION 1: DEAD NEURONS LEAD TO NON-COMPACT FLAT ENERGY REGIONS

We assume that we are at point y in the region where neurons are dead, so there is V such that g(y + V c) = g(y) for arbitrary c with positive components. We substitute y + V c into the definition of energy to obtain

E(y + V c) = (y + V c b) g (y + V c) L (y + V c) 1

2 (g (y + V c)) W g (y + V c) .

First, we use g(y + V c) = g(y) to remove shift from activation functions

E(y + V c) = (y + V c b) g (y) L (y + V c) 1

2 (g (y)) W g (y) .

Next, we use Taylor series with mean-value reminder to expand Lagrange function

L (y + V c) = L (y) +

V c, τ (0, 1).

Since by definition g(y) = L(y)

y , the expression above can be presented as follows

L (y + V c) = L (y) + (g (y + τV c)) V c, τ (0, 1).

Now, since τ > 0 we use g(y + τV c) = g(y) to remove shift and obtain

L (y + V c) = L (y) + (g (y)) V c

We substitute this to the last expression for energy which gives

E(y + V c) = (y + V c b) g (y) L (y) (g (y)) V c 1

2 (g (y)) W g (y) .

Two remaining terms containing c are (V c) g (y) and (g (y)) V c cancel each other, so we obtain E(y + V c) = (y b) g (y) L (y) 1

2 (g (y)) W g (y) = E(y),

as claimed in the proposition.

C EXAMPLE OF MEMORY CAPACITY DEGRADATION FOR ENERGY FUNCTION BOUNDED FROM BELOW

Figure 2a shows an energy function that is unbounded from below but supports two stable states. In this section we argue that the unbounded energy is not problematic. Moreover, if one is too strict about this property, memory can severely degrade in capacity.

It is easy to find papers on associative memory that claim that it is necessary to have energy function bounded from below to ensure stable memory recovery, e.g., see discussion after equation (4) (in both cases) in Krotov (2021) and Krotov & Hopfield (2020). In Xie & Seung (2003) authors even incorrectly claim Furthermore, with appropriately chosen fk, such as sigmoid functions, E(x) is also bounded below, in which case E(x) is a Lyapunov function . Clearly, the Lyapunov function is not required to be bounded.

We certainly agree that if the energy function is bounded and informative (not flat), the memory vector will be recovered eventually from arbitrary initial conditions. However, we have several arguments again the overall importance of this condition:

Published as a conference paper at ICLR 2025

1. Unbounded at infinity only. Activation functions used for associative memory are at least continuous, meaning energy can be unbounded only when the activities of neurons reach infinity. This kind of run-away behavior is very easy to detect and prevent when one uses actual implementation of associative memory. 2. Finite number of updates in practice. In practice ODE is not integrated for a very long time. As we discuss in Section 5 one usually performs a small number of steps for discretised dynamics, e.g., 10 or even 1 step. Under these conditions it is not possible to diverge using ODEs considered in this article. 3. Stability is a local condition. For associative memory one cares more about local stability in the basins of attraction, and the fact that energy is bounded from below is a global information, which is seldom important for local stability (with the important exception discussed below). In practice one can always rescale input vectors (initial conditions) to localize it on a sufficiently small sphere with radius R. Memory trained and used this way has little chance to diverge. 4. Extra constraints can compromise capacity. If we require energy to be bounded from below we will put extra constraints on parameters of associative memory. These constraints, as we will show below, can decrease capacity dramatically.

The last point, the decrease in capacity, is especially interesting. We will illustrate it with Re LU memory model dy(t)

dt = W Re LU (y(t)) y(t) + b, y(0) = y0,

that has energy

E(y) = (y b) Re LU(y)

1 2 (Re LU(yi))2 1

2 (Re LU(y)) W Re LU(y).

To analyze this energy function we observe that y Re LU(y) = (Re LU(y)) Re LU(y). This fact allows us to simplify energy to

2 (Re LU(y)) (I W ) Re LU(y) b Re LU(y).

It is possible to bound this energy function from below using spectral radius ρ of W

E(y) 1 ρ(W )

2 (Re LU(y)) Re LU(y) b Re LU(y),

note that this bound is tight unless we restrict allowed weights, since it is saturated for W = αI for scalar α. From this lower bound we can observe that memory is bounded from below if ρ(W ) 1. Moreover, in light of our previous comment, it is a necessary and sufficient condition.

Suppose now we restrict W such that ρ(W ) = 1 ϵ with arbitrary small but nonzero ϵ. In this case Re LU memory has a single memory vector.

We start by showing that a steady state of temporal dynamics exists. For that we consider a discrete iteration y(n+1) = W Re LU y(n) + b.

Observe that right-hand side is a Lipschitz function with Lipschitz constant 1 ϵ:

W Re LU (c1) + b (W Re LU (c2) + b) 2 = W (Re LU (c1) Re LU (c2)) 2 W 2 (Re LU (c1) Re LU (c2) 2 W 2 c1 c2 2 = (1 ϵ) c1 c2 2 Given that, by Banach fixed point theorem Palais (2007) these iterations has unique fixed point. This fixed point is a steady state since

y = W Re LU (y ) + b W Re LU (y ) y + b = 0

so the right-hand side of the dynamical system is zero.

Now, we will show that this steady state is unique. For that we consider two trajectories starting from (arbitrary) distinct initial conditions y1(t), y2(t): y1(0) = y1 0, y2(0) = y2 0. We will show that

Published as a conference paper at ICLR 2025

the distance between y1(t) and y2(t) decreases with time. To do that we consider derivative of this distance:

1 2 d dt y1(t) y2(t) 2 2 = (y1(t) y2(t)) dy1(t)

= (y1(t) y2(t)) (W Re LU (y1(t)) y1(t) W Re LU (y2(t)) + y2(t))

= y1(t) y2(t) 2 2 + (y1(t) y2(t)) (W Re LU (y1(t)) W Re LU (y2(t))) .

To bound second term we use that a |a| and Cauchy-Schwarz inequality

1 2 d dt y1(t) y2(t) 2 2 y1(t) y2(t) 2 2 + (y1(t) y2(t)) (W Re LU (y1(t)) W Re LU (y2(t)))

y1(t) y2(t) 2 2 + y1(t) y2(t) 2 W Re LU (y1(t)) W Re LU (y2(t)) 2 .

Finally, we use result that W Re LU( ) is Lipschitz with Lipschitz constant 1 ϵ to obtain

1 2 d dt y1(t) y2(t) 2 2 y1(t) y2(t) 2 2 + y1(t) y2(t) 2 2 (1 ϵ) = ϵ y1(t) y2(t) 2 2 .

Using Gr onwal inequality we obtain

1 2 d dt y1(t) y2(t) 2 2 e 2ϵt 1

2 d dt y1(0) y2(0) 2 2 .

In other words, separated trajectories converge exponentially fast. Since there is a fixed point one of the trajectories might as well start from y which means this fixed point is the only one, and it is exponentially stable.

The result that we demonstrate here means one should be careful restricting parameters of associative memory, since it may lead to degradation of capacity to a single memory. The proof above is not working for ρ(W ) = 1, but since ϵ can be arbitrary slow one can recover the same result in a limit ϵ 0.

D MEMORY MODEL FROM HOPFIELD (1984)

In this section we align our notation with the one from Hopfield (1984). The model of interest is given in (Hopfield, 1984, Equation (5)). We replicated this model below

j Tij Vj ui/Ri + Ii,

ui = g 1 i (Vi) ,

where ui is an instantaneous input to neuron i, Vi is the output or short term average of the firing rate of the cell i , gi is an activation function or the input-output characteristic of a nonlinear amplifier with negligible response time , Tij are weights and T 1 ij finite impedance between the output Vj and the cell body of cell i , Ri is transmembrane resistance and Ci is input capacitance , Ii is bias or any other (fixed) input current to neuron i .

If we put aside the biological content of Hopfield (1984), this equation is already very similar to all models we discussed in Appendix A. To make the resemblance even more evident observe that Ri can be removed without the loss of generality, since we can multiply on them and redefine Ci Ri Ci, Tij Ri Tij, Ii RIIi. After that we also substitute Vi = g(ui) and obtain the following equation

j Tijgj(uj) ui + Ii,

which is precisely the Dense Associative Memory (Krotov, 2021, Equation (2)) with weights Tij, time variables Ci extra bias term Ii. Since we already discussed the relation of Dense Associative Memory to our model in Appendix A, we can be sure that model from Hopfield (1984) is in the same class of memory models.

Published as a conference paper at ICLR 2025

Dynamical system (Hopfield, 1984, Equation (5)) comes with the energy function (Hopfield, 1984, Equation (7)):

ij Tij Vi Vj + X

0 d V g 1 i (V ) + X

Using Vi = gi(ui) we obtain

ij Tijgi(ui)gi(ui) + X

0 d V g 1 i (V ) + X

i Iigi(ui).

In addition to that we can drop Ri for the reason explained above

ij Tijgi(ui)gi(ui) + X

0 d V g 1 i (V ) + X

i Iigi(ui).

The first and last sums are easily mapped on our notation and the second sum is explained in Section 3.1).

All in all, one can say that our model is similar to the one from Hopfield (1984) but with Ci = Ri = 1. As we argued, Ri is irrelevant and Ci does not appear in the energy function.

E PROOF OF PROPOSITION 3: STABILISATION OF DYNAMICS FOR RANDOM MATRICES

Matrix W can be written as D O W OD where O is orthogonal full rank matrix and D is first N k columns of I. Since GOE is invariant under orthogonal transformations O W O is also a matrix from GOE. The effect of D is to select a principal submatrix with N k rows and columns. Given that W is also from GOE but with N k rows and columns. For GOE with large N, the distribution of leading eigenvalue quickly converges to Tracy Widom distribution (see Chiani (2014) equation (50) and numerical results), so, on the average, spectral radius of a matrix

with N k rows and columns is p

2(N k) (see below) and E W 2 E W 2 = q

To obtain an average spectral radius for the GOE matrix M we will use the theory described in Section 4 of Chiani (2014). Let λ1 be a spectral radius of matrix M RN N. We define a Gaussian orthogonal ensemble as a set of random symmetric matrices with independently identically normally distributed entries (for upper diagonal part) with variances 1 and 1 2 for entries on and off the diagonal respectively. For sufficiently large N it is known that

λ1 = µ N + µTW1σ N, (11)

where λ1 is a mean value of spectral radius, µTW1 1.2 is a mean value of Tracy Widom distribution and

2N 1; σ N = 1

So we see that for large N the average spectral radius of the GOE matrix is

F PROOF OF PROPOSITION 5: ASSOCIATIVE MEMORY WITHOUT DEAD NEURONS

Observe that L(W u+b)

u = W g (W u + b) and g(W u+b)

u = Λ (W u + b) W .

1. Using the identities above we find

u = W Λ (W u + b) W (u g(W u+b)) , E2

u = W (u g(W u+b)) ,

u = (I W Λ (W u + b)) S (u g(W u+b)) .

Published as a conference paper at ICLR 2025

Multiplying gradients on α, β, γ and adding them gives E(u)

u = F (u)f (u). Since u = R(u)f(u(t)) where f(u(t)) = u g(W u+b) from E = u E

u we obtain

E = f(u) R F (u) + (F (u)) R f(u).

If one can find positive semidefinite matrix Q such that R F (u) + (F (u)) R Q, energy is non-increasing on trajectories since in this case E f Qf 0

2. Consider polar decomposition of matrix F (u) = O(u)P (u) and take R(u) = O(u). Since P (u) 0 we have R F (u) + (F (u)) R = 2P (u) 0 and E 0.

3. From the proof of Proposition 1 we know that if W u + b W u + b + V c, Lagrange function shifts on g (W u + b) V c. The first term in the energy function (6) is quadratic and does not contain g so in general energy E2 does not have flat directions. Similarly, E3 has a form f(u) Sf(u) where f is not invariant under transformation W u + b W u + b + V c meaning E3 is not invariant as well.

u = F (u)f(u) and f(u ) = 0 we find for sufficiently small δ

E(u + δ) E(u ) = δ F (u ) f(u)

For stability one needs this quadratic form to be positive definite, which gives (8).