# long_sequence_hopfield_memory__0bafda37.pdf

Long Sequence Hopfield Memory

Hamza Tahir Chaudhry1,2, Jacob A. Zavatone-Veth2,3, Dmitry Krotov5, Cengiz Pehlevan1,2,4

1John A. Paulson School of Engineering and Applied Sciences, 2Center for Brain Science, 3Department of Physics, 4Kempner Institute for the Study of Natural and Artificial Intelligence, Harvard University Cambridge, MA 02138 5MIT-IBM Watson AI Lab, IBM Research, Cambridge, MA 02142 hchaudhry@g.harvard.edu, jzavatoneveth@g.harvard.edu, krotov@ibm.com, cpehlevan@seas.harvard.edu

Sequence memory is an essential attribute of natural and artificial intelligence that enables agents to encode, store, and retrieve complex sequences of stimuli and actions. Computational models of sequence memory have been proposed where recurrent Hopfield-like neural networks are trained with temporally asymmetric Hebbian rules. However, these networks suffer from limited sequence capacity (maximal length of the stored sequence) due to interference between the memories. Inspired by recent work on Dense Associative Memories, we expand the sequence capacity of these models by introducing a nonlinear interaction term, enhancing separation between the patterns. We derive novel scaling laws for sequence capacity with respect to network size, significantly outperforming existing scaling laws for models based on traditional Hopfield networks, and verify these theoretical results with numerical simulation. Moreover, we introduce a generalized pseudoinverse rule to recall sequences of highly correlated patterns. Finally, we extend this model to store sequences with variable timing between states transitions and describe a biologically-plausible implementation, with connections to motor neuroscience.

1 Introduction

Memory is an essential ability of intelligent agents that allows them to encode, store, and retrieve information and behaviors they have learned throughout their lives. In particular, the ability to recall sequences of memories is necessary for a large number of cognitive tasks with temporal or causal structure, including navigation, reasoning, and motor control [1 9].

Computational models with varying degrees of biological plausibility have been proposed for how neural networks can encode sequence memory [1 3, 10 22]. Many of these are based on the concept of associative memory, also known as content-addressable memory, which refers to the ability of a system to recall a set of objects or ideas when prompted by a distortion or subset of them. Modeling associative memory has been an extremely active area of research in computational neuroscience and deep learning for many years, with the Hopfield network becoming the canonical model [23 25].

Unfortunately, a major limitation of the traditional Hopfield Network and related associative memory models is its capacity: the number of memories it can store and reliably retrieve scales linearly with the number of neurons in the network. This limitation is due to interference between different memories during recall, also known as crosstalk, which decreases the signal-to-noise ratio. Large amounts of crosstalk results in the recall of undesired attractor states of the network [26 29].

37th Conference on Neural Information Processing Systems (Neur IPS 2023).

0 10 20 30 40 50 60 70 80 90 100 Time Step (t)

Overlap (m μ)

0 10 20 30 40 50 60 70 80 90 100 Time Step (t)

Overlap (m μ)

Polynomial Dense Net

1 10 20 30 40 50 60 70 80 90 100

Pattern (ξ μ)

Figure 1: Seq Net and Polynomial Dense Net (d = 2) are simulated with N = 300 neurons and P = 100 patterns. One hundred curves are plotted as a function of time, each representing the overlap of the network state at time t with one of the patterns, mµ = (1/N) PN i=1 ξµ i Si. The curves are ordered using the color code described on the right (patterns in the beginning and end of the sequence are shaded in yellow and red respectively). A. Seq Net quickly loses the correct sequence, indicated by the lack of alignment of the network state with the correct pattern in the sequence (mµ 1). B. The Polynomial Dense Net faithfully recalls the entire sequence and maintains alignment with one of the patterns at any moment in time, mµ 1.

Recent modifications of the Hopfield Network, known as Dense Associative Memories or Modern Hopfield Networks (MHNs), overcome this limitation by introducing a strong nonlinearity when computing the overlap between the state of the network and memory patterns stored in the network [30, 31]. This leads to greater separation between partially overlapping memories, thereby reducing crosstalk, increasing the signal-to-noise ratio, and increasing the probability of successful recall [32].

Most models based on the Hopfield Network are autoassocative, meaning they are designed for the robust storage and recall of individual memories. Thus, they are incapable of storing sequences of memories. In order to adapt these models to store sequences, one must utilize asymmetric weights in order to drive the network from one activity pattern to the next. Many such models use temporally asymmetric Hebbian learning rules to strengthen synaptic connections between neural activity at one time state and the next time state, thereby learning temporal association between patterns in a sequence [1, 3, 10, 11, 16, 17, 22].

In this paper, we extend Dense Associative Memories to the setting of asymmetric weights in order to store and recall long sequences of memories. We work directly with the update rule for the state of the network, allowing us to provide an analytical derivation for the sequence capacity of our proposed network. We find a close match between theoretical calculation and numerical simulation, and further establish the ability of this model to store and recall sequences of correlated patterns. Additionally, we examine the dynamics of a model containing both symmetric and asymmetric terms. Finally, we describe applications of our network as a model of biological motor control.

2 Dense Nets for Sequence Storage

Traditional Hopfield Networks and MHNs, as described in Appendix B, are capable of storing individual memories. What about storing sequences? Assume that we want to store a sequence of P patterns, ξ1 ξ2 ξP , where ξµ j { 1} is the jth neuron of the µth pattern and the network will transition from pattern ξµ to ξµ+1. Let N be the number of neurons in the network and S(t) { 1, +1}N be the state of the network at time t. We want to design a network with dynamics such that when the network is initialized in pattern ξ1, it will traverse the entire sequence.1 We define a network, Seq Net, which follows a discrete-time synchronous update rule2:

TSN(S)i := sgn

j =i Jij Sj

µ=1 ξµ+1 i mµ i

, mµ i := 1 (N 1)

j =i ξµ j Sj, (1)

1We impose periodic boundary conditions and define ξP +1 ξ1. Boundary terms have a sub-leading contribution to the crosstalk, so a model with open boundary conditions will have the same scaling of capacity. 2One can also consider an asynchronous update rule in which one neuron is updated at a time [23, 26].

where S(t + 1) = TSN(S) and Jij = 1

N PP µ=1 ξµ+1 i ξµ j is an asymmetric matrix connecting pattern ξµ to ξµ+1. Note that we are excluding self-interaction terms i = j. We also rewrote the dynamics in terms of mµ i , the overlap of the network state S with pattern ξµ. When the network is aligned most closely with pattern ξµ, the overlap mµ i is the largest contribution in the sum and pushes the network to pattern ξµ+1. When multiple patterns have similar overlaps, meaning they are correlated, then there will be low signal-to-noise ratio. This correlation between patterns limits the capacity of the network, limiting the Seq Net s capacity to scale linearly relative to network size.

To overcome the capacity limitations of the Seq Net, we use inspiration from Dense Associative Memories [30] to define the Dense Net update rule:

TDN(S)i := sgn

µ=1 ξµ+1 i f (mµ i )

where f is a nonlinear monotonically increasing interaction function. Similar to MHNs, f reduces the crosstalk between patterns and, as we will analyze in detail, leads to improved capacity. Figure 1 demonstrates this improvement for f(x) = x2.

2.1 Sequence capacity

To derive analytical results for the capacity, we must choose a distribution to generate the patterns. As is standard in studies of the classic HN and MHNs [26 31, 33 36], we choose this to be the Rademacher distribution, where ξµ j { 1, +1} with equal probability for all neurons j in all patterns µ, and calculate the capacity for different update rules. If one is allowed to specially engineer the patterns, even the Seq Netcan store a sequence of length 2N [37], but this construction is not relevant to associative recall of realistic sequences. Rademacher patterns are a more appropriate model for generic patterns while remaining theoretically tractable.

We consider both the robustness of a single transition, and the robustness of propagation through the full sequence. For a fixed network size N {2, 3, . . .} and an error tolerance c [0, 1), we define the single-transition and sequence capacities by

PT (N, c) = max P {2, . . . , 2N} : P TDN(ξ1) = ξ2 1 c (3)

PS(N, c) = max P {2, . . . , 2N} : P P µ=1{TDN(ξµ) = ξµ+1} 1 c , (4)

respectively, where the probability is taken over the random patterns. Note that for the singletransition capacity we could focus on any pair of subsequent patterns due to translation invariance arising from periodic boundary conditions. Also note that the full sequence capacity is defined by demanding that all transitions are correct. For perfect recall, we want to take the threshold c 0. In the thermodynamic limit in which N, P , we expect for there to exist a sharp transition in the recall probabilities as a function of P, with almost-surely perfect recall below the threshold value and vanishing probability of recall above [26 29, 31, 33 36]. Thus, we expect the capacity to become insensitive to the value of c in the thermodynamic limit; this is known rigorously for the classic Hopfield network from the work of Bovier [34].

As we detail in Appendix C, all of our theoretical results are obtained under two approximations. We will validate the accuracy of the resulting capacity predictions through comparison with numerical experiments. First, following Petritis [33] s analysis of the classic Hopfield network, we use union bounds to control the single-transition and full-sequence capacities in terms of the single-bitflip error probability P[TDN(ξ1)1 = ξ2 1]. Using the fact that the patterns are i.i.d., this gives P[TDN(ξµ) = ξµ+1] 1 NP[TDN(ξ1)1 = ξ1 2] and P[ P µ=1{TDN(ξµ) = ξµ+1}] 1 NPP[TDN(ξ1)1 = ξ1 2], respectively, resulting in the lower bounds

PT (N, c) max P {2, . . . , 2N} : NP[TDN(ξ1)1 = ξ1 2] c , (5)

PS(N, c) max P {2, . . . , 2N} : NPP[TDN(ξ1)1 = ξ1 2] c . (6)

From studies of the classic Hopfield network, we expect for these bounds to be tight in the thermodynamic limit (N ), but we will not attempt to prove that this is so [33, 34]. Second, our theoretical

results are obtained under the approximation of P[THN(ξ1)1 = ξ2 1] in the regime N, P 1 by a Gaussian tail probability. Concretely, we write the single-bitflip probability as P[TDN(ξ1)1 = ξ2 1] = P[C < f(1)] (7) in terms of the crosstalk

µ=2 ξ2 1ξµ+1 1 f 1 N 1

j=2 ξµ j ξ1 j

which represents interference between patterns that can lead to a bitflip. Then, as the crosstalk is the sum of P 1 i.i.d. random variables, we approximate its distribution as Gaussian. We then extract the capacity by determining how P should scale with N such that the error probability tends to zero as N , corresponding to taking c 0 with increasing N. Within the Gaussian approximation, we can also estimate the capacity at fixed c by using the asymptotics of the inverse Gaussian tail distribution function to determine how P should scale with N such that the error probability is asymptotically bounded by c as N . This predicts that the effect of non-negligible c should vanish as N .

For P large but finite, this Gaussian approximation amounts to retaining only the leading term in the Edgeworth expansion of the tail distribution function [38 41]. We will not endeavour to rigorously control the error of this approximation in the regime of interest in which N is also large. To convert our heuristic results into fully rigorous asymptotics, one would want to construct an Edgeworth-type series expansion for the tail probability P[C < f(1)] that is valid in the joint limit with rigorouslycontrolled asymptotic error, accounting for the fact that the crosstalk is a sum of discrete random variables [38 41]. As a simple probe of Gaussianity, we will consider the excess kurtosis of the crosstalk distribution, which determines the leading correction to the Gaussian approximation in the Edgeworth expansion, and describes whether its tails are heavier or narrower than Gaussian [38 41].

2.2 Polynomial Dense Net

Consider the Dense Net with polynomial interaction function, f(x) = xd, which we will call the Polynomial Dense Net. In Appendix C.1, we argue that the leading asymptotics of the transition and sequence capacities for perfect recall are given by

2(2d 1)!! log(N), PS N d

2(d + 1)(2d 1)!! log(N). (9)

Note that this polynomial scaling of the single-transition capacity with network size coincides with the capacity scaling of the symmetric MHN [30]. Indeed, as we have excluded self-interaction terms in the update rule, the single-bitflip probabilities for these two models coincide exactly for unbiased Radamacher patterns (Appendix C.1). This allows us to adapt arguments from Demircigil et al. [31] to show that (9) is in fact a rigorous asymptotic lower bound on the capacity (Appendix D). We compare our results for the single-transition and sequence capacities to numerical simulation in Figure 2. The simulation matches theoretical prediction for large network size N. For smaller N, there are finite-size effects that result in deviation from theoretical prediction. The crosstalk has non-negligible kurtosis in finite size networks which leads to deviation from the Gaussian approximation.

Furthermore, we point out that for fixed N, the network capacity does not monotonically increase in the degree d. Since the factorial function grows faster than the exponential function, every finite network of size N has a polynomial degree dmax after which the capacity will actually decrease. This is also true for the standard MHN. We demonstrate this numerically in Figure 2B, again noting mild deviations between theory and simulation due to finite-size effects.

2.3 Exponential Dense Net

We have shown the Dense Net s capacity can scale polynomially with network size. Can it scale exponentially? Consider the Dense Net with exponential interaction function, f(x) = e(N 1)(x 1), which we call the Exponential Dense Net. This function reduces crosstalk dramatically: f(mµ(S)) = 1 when mµ(S) = 1 and is otherwise sent to zero exponentially fast. In Appendix C.2, we show that under the abovementioned approximations one has the leading asymptotics

2 log N and PS βN 1

2 log(β)N , where β = exp(2)

cosh(2) 1.964 . . . (10)

10 20 30 40 50 60 70 80 90 100 N

10 20 30 40 50 60 70 80 90 100 N

Theory: Poly (d=1) Theory: Poly (d=2) Theory: Poly (d=3) Theory: Poly (d=4) Theory: Exp

Sim: Poly (d=1) Sim: Poly (d=2) Sim: Poly (d=3) Sim: Poly (d=4) Sim: Exp

4 Theory: N = 10 Theory: N = 15 Theory: N = 20 Sim: N = 10 Sim: N = 15 Sim: N = 20

10 2 10 3 10 4 P

10 2 10 3 10 4 P

Excess Kurtosis

Figure 2: Testing the transition and sequence capacities of Dense Nets with polynomial and exponential nonlinearities. A. Scaling of transition capacity (log10(PT ), left) and sequence capacity (log10(PS), right) with network size. As network size increases, the variance of the crosstalk decreases and the theoretical approximations become more accurate, resulting in a tight match between theory (solid lines) and simulation (points with error bars). The theory curves are given by Equations 9 and 10. Error bars are computed across realizations of the random patterns (see Appendix G). There is significant deviation between theory and simulation for the sequence capacity of the Exponential Dense Net. We show that this is due to finite-size effects in Section 2.3. B. Transition capacity of Polynomial Dense Nets as a function of degree. For any finite network size N, there is a degree d that maximizes the transition capacity. The same would be true for the sequence capacity. C. Crosstalk variance (left) and excess kurtosis (right) for the Exponential Dense Net as a function of P and N. Variance is proportional to P and inversely proportional to N, while the opposite is true for excess kurtosis. See Appendix G for details of our numerical methods.

In Figure 2, numerical simulations confirm this model scales significantly better than the Polynomial Dense Net and enables one to store exponentially long sequences relative to network size. While the ratio between transition and sequence capacities remains bounded for the Polynomial Dense Net, where PT /PS d + 1, the gap for the Exponential Dense Net diverges with network size.

However, we can see in Figure 2A that the empirically measured capacity particularly the sequence capacity of the Exponential Dense Net deviates substantially from the predictions of our approximate Gaussian theory. Due to computational constraints, our numerical simulations are limited to small network sizes (Appendix G). Computing the excess kurtosis of the crosstalk distribution with a number of patterns comparable to the capacity predicted by the Gaussian theory reveals that, for the range of system sizes we can simulate, the distribution should deviate strongly from a Gaussian. In particular, if take P βN 1/(αN) for some constant factor α, then the excess kurtosis increases with network size up to around N 56 (Appendix C.2). Increasing the size of an Exponential Dense Net therefore has competing effects: for a fixed sequence length P, increasing network size N decreases the crosstalk variance, which should reduce the bitflip probability, but also increases the excess kurtosis, which reflects a fattening of the crosstalk distribution tails that should increase the bitflip probability. This is illustrated in Figure 2C.

The competition between increasing P and N for the Exponential Dense Net is easy to understand intuitively. For a fixed N, increasing P means that the crosstalk is equal in distribution to the sum of an increasingly large number of i.i.d. random variables, and thus by the central limit theorem should become increasingly Gaussian. Conversely, for a fixed P, increasing N means that each of the

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

Poly w/ GPI: d=1

Poly w/ GPI: d=2

Poly w/ GPI: d=3

Figure 3: A. Recall of a sequence of 200000 correlated images from the Moving MNIST dataset using Dense Nets of size N = 784. We showcase a 10 image subsequence. The top row depicts the true sequence, the second row depicts Seq Net s performance, the next rows depict the Polynomial Dense Nets performance which increases with degree d, and the final row depicts the Exponential Dense Net s performance which yields perfect recall. B. Transition capacity of Polynomial Dense Nets of size N = 100 relative to pattern bias ϵ. Increasing ϵ monotonically decreases capacity. Networks with stronger nonlinearities maintain high capacity for large correlation strength. Implementing the generalized pseudoinverse rule decorrelates these patterns and maintains high sequence capacity for much larger correlation. See Appendix G for details of numerical methods.

P 1 contributions to the crosstalk is equal in distribution to the product of an increasing number of i.i.d. random variables as f 1 N 1 PN j=2 ξµ j ξ1 j = QN j=2 exp(ξµ j ξ1 j ) and thus by the multiplicative central limit theorem each term should tend to a lognormal distribution. In this regime, then, the crosstalk is roughly a mixture of lognormals, which is decidedly non-Gaussian. In contrast, for a Polynomial Dense Net, memorization is easy in the limit where N tends to infinity for fixed P, as the crosstalk should tend almost surely to zero as each term f 1 N 1 PN j=2 ξµ j ξ1 j 0 almost surely.

2.4 Recalling Sequences of Correlated Patterns

The full-sequence capacity scaling laws for these models were derived under the assumption of i.i.d Rademacher random patterns. While theoretically convenient, this is unrealistic for real-world data. We therefore test these networks in more realistic settings by storing correlated sequences of patterns, which will lead to greater crosstalk in each transition and thus smaller single-transition and full-sequence capacities relative to network size [26, 36]. However, the nonlinear interaction functions should still assist in separating correlated patterns to enable successful sequence recall.

For demonstration, we store a sequence of 200000 highly-correlated images from the Moving MNIST dataset and attempt to recall this sequence using Dense Nets with different nonlinearities [42]. The entire sequence is composed of 10000 unique subsequences concatenated together, where each subsequence is composed of 20 images of two hand-written digits slowly moving through one another. This means there is significant correlation between patterns which will result in large amounts of crosstalk. The results of the Dense Nets are shown in Figure 3A, where increasing the nonlinearity of the Polynomial Dense Nets slowly improves recall but not entirely, while the exponential network achieves perfect recall. The Seq Net and Dense Nets, up until approximately d = 50, are entirely unable to recall any part of any image, despite the Dense Nets being well within the capacity limits predicted by theoretical calculations on uncorrelated patterns.

2.5 Generalized pseudoinverse rule

Can we overcome the Dense Net s limited ability to store correlated patterns? Drawing inspiration from the pseudoinverse learning rule introduced by Kanter and Sompolinsky [43] for the classic Hopfield network, we propose a generalized pseudoinverse (GPI) transition rule

TGP I(S)i := sgn

µ=1 ξµ+1 i f

ν=1 (O+)µνmν(S)

j=1 ξµ j ξν j , (11)

where the overlap matrix Oµν is positive-semidefinite, so we can define its pseudoinverse O+ by inverting the non-zero eigenvalues. With f(x) = x, this reduces to the pseudoinverse rule of [43].

If the patterns are linearly independent, such that O is full-rank, we can see that this rule can perfectly recall the full sequence (Appendix E). This matches the classic pseudoinverse rule s ability to perfectly store any set of linearly independent patterns; this is why we choose to sum over ν inside the separation function in (11). For i.i.d. Rademacher patterns, linear independence holds almost surely in the thermodynamic limit provided that P < N.

In Figure 3B, we demonstrate the effect of correlation on the Polynomial Dense Net through studying the recall of biased patterns ξµ i with P(ξµ i = 1) = 1 2(1 ϵ) for ϵ [0, 1).3 We see that the Polynomial Dense Net has better recall at all levels of bias ϵ as degree d increases, although we still expect there to be a maximum degree as described before. However, at large correlation values, they all have low recall, suggesting the need for alternative methods to decorrelate these patterns. This failure is easy to understand theoretically, following van Hemmen and Kühn [44] s analysis of the classic Hopfield model: for patterns with bias ϵ, the Polynomial Dense Net update rule expands as

TDN(ξµ)i = sgn[ξµ+1 i + (P 1)ϵ2d+1 + O( p

P/N)]. (12)

Therefore, even if N is large, for ϵ = 0 there must be some value of P for which the constant bias overwhelms the signal. If N for any fixed P, then we must have P < ϵ (2d+1) + 1 for the signal to dominate. In Figure 3B, we show the generalized pseudoinverse update rule is more robust to large correlations than the Polynomial Dense Net. While this rule can also be applied to the Exponential Dense Net, simulations fail due to numerical instability coming from small values in the pseudoinverse.

3 Mixed Nets for variable timing

Thus far, we have considered sequence recall in purely asymmetric networks. These networks transition to the next pattern in the sequence at every timestep, preventing the network from storing sequences with longer timing between elements. In this section, we aim to construct a model where the network stays in a pattern for τ steps. Our starting model will be an associative memory model for storing sequences known as the Temporal Association Network (TAN) [1, 10], defined as:

TT AN(S)i := sgn

h ξµ i mµ i + λξµ+1 i mµ i i#

, mµ i := 1 N 1

j =i ξµ j Sj (13)

where mµ i represents the normalized overlap of each pattern ξµ with a weighted time-average of the network over the past τ timesteps, Si(t) = Pτ ρ=0 w(ρ)Si(t ρ). The weight function, w(t), is generally taken to be a low-pass convolutional filter (e.g. Heaviside step function, exponential decay).

This network combines a symmetric and asymmetric term for robust recall of multiple sequences. The symmetric term containing mµ i (t), also referred to as a fast" synapse, stabilizes the network in pattern ξµ for a desired amount of time. The asymmetric term containing mµ i (t), also referred to as a slow" synapse, drives the network transition to pattern ξµ+1. The λ parameter controls the strength of the transition signal. If λ is too small, no transitions will occur since the symmetric term will overpower it. If λ is too large, transitions will occur too quickly for the network to stabilize in a desired pattern and the sequence will quickly destabilize.

For TAN, Sompolinsky and Kanter [10] used numerical simulations to estimate the capacity as approximately PT AN 0.1N, defining capacity as the ability to recall the sequence in correct order with high overlap (meaning that a small propotion of incorrect bits are allowed in each transition). Note that this model can fail in two ways: (i) it can fail to recall the correct sequence of patterns, or (ii) it can fail to stay in each state for the desired amount of time.

To address these issues, we consider the following dynamics:

TMN(S)i := sgn P X

ξµ i f S (mµ i ) + λξµ+1 i f A ( mµ i ) (14)

3At ϵ = 1, the patterns will be deterministic with ξµ i = +1.

0 100 200 Timestep (t)

Overlap (mμ)

Temporal Association Network Polynomial Mixed Net (d S=d A=2) Polynomial Mixed Net (d S=d A=10)

Pattern (ξμ)

10 20 30 40 50 60 70 80 90 100 N

0 1 2 3 4 5 d S = 1

Theory (d A = 1)

Theory (d A = 2)

Theory (d A = 3)

Sim (d A = 1) Sim (d A = 2) Sim (d A = 3)

10 20 30 40 50 60 70 80 90 100 N

0 1 2 3 4 5 d S = 2

10 20 30 40 50 60 70 80 90 100 N

0 1 2 3 4 5 d S = 3

10 20 30 40 50 60 70 80 90 100

0 1 2 3 4 5 d S = 1

Theory (d A = 1)

Theory (d A = 2)

Theory (d A = 3)

Sim (d A = 1)

Sim (d A = 2)

Sim (d A = 3)

10 20 30 40 50 60 70 80 90 100

0 1 2 3 4 5 d S = 2

10 20 30 40 50 60 70 80 90 100

0 1 2 3 4 5 d S = 3

log10(PT) log10(PS)

log10(PT) log10(PS)

log10(PT) log10(PS)

0 100 200 Timestep (t) 0 100 200 Timestep (t)

Figure 4: Capacity of the Polynomial Mixed Net. A. We simulate Mixed Nets with N = 100, τ = 5, and attempt to store P = 40 patterns. The Temporal Association Network (left), corresponding to a linear Mixed Net with d S = 1 = d A, fails to recover the sequence. Increasing the nonlinearities to d S = 2 = d A (center) recovers the correct sequence order, but not the timing. Increasing the nonlinearities to d S = 10 = d A (right) recovers the correct sequence order and timing. B. Transition capacity log10(PT ) of the Polynomial Mixed Net as a function of network size. Each panel has a fixed symmetric nonlinearity f S(x) = xd S indicated by the panel s title. As network size increases, crosstalk variance decreases and theoretical approximations in Equation 3 become more accurate to tightly match the simulations. Note that as expected, the capacity scales according to the minimum of d S and d A. C. As in B, but for the sequence capacity log10(PS).

We call this model the Mixed Net, and seek to analyze the relationship between the symmetric and asymmetric terms in driving network dynamics and their impact on sequence capacity. As before, the asymmetric term will try to push the network to the next state at every timestep, while the symmetric term tries to maintain it in its current state for τ timesteps. We will allow different nonlinearities for f S and f A, and analyze their effect on transition and sequence capacity.

We demonstrate the effectiveness of the Polynomial Mixed Net, where for simplicity we set f S(x) = f A(x) = xd, in Figure 4A. While TAN fails completely, a polynomial nonlinearity of d = 2 enables recall of pattern order but the network does not stay in each pattern for τ = 5 timesteps. Further increasing the nonlinearity to d = 10 recovers the desired sequence with correct order and timing.

Theoretical analysis of the capacity of the Mixed Net (14) for general memory length τ is challenging due to the extended temporal interactions. We therefore consider single-step memory (τ = 1), and show that even in this relatively tractable special case new complications arise relative to our analysis of the Dense Net. Alternatively, we can interpret the Mixed Net with τ = 1 as an imperfectly-learned Dense Net. If one imagines the network learns its weights through a temporally asymmetric Hebbian rule with an extended plasticity kernel, and its state is not perfectly clamped to the desired transition, the coupling from ξµ to ξµ+1 could be corrupted by coupling ξµ to itself [22].

We first consider the setting where both interaction functions are polynomial, f S(x) = xd S and f A(x) = xd A, and refer to this network as the Polynomial Mixed Net. This model is analyzed in detail in Appendix F.1. Interestingly, this model s crosstalk variance forms a bimodal distribution, as shown in Figure F.1. This complicates the analysis, but once bimodality is accounted for one can approximate the capacity using a similar argument to that of the Dense Net. We find that

N min{d S,d A}

log N , PS (λ 1)2

2(min{d S, d A} + 1)γd S,d A

N min{d S,d A}

log N , (15)

where γd S,d A is a multiplicative factor defined as

(2d S 1)!! , if d S < d A (λ2 + 1)(2d S 1)!! + 2λ[(d S 1)!!]21{d S even} , if d S = d A λ2(2d A 1)!! , if d S > d A.

In Figure 4B-C, we show that simulations match the theory curves well as N increases. We demonstrate theoretical and simulations results for the Exponential Mixed Net in Appendix F.2.

4 Biologically-Plausible Implementation

Since biological neural networks must store sequence memories [2, 5 8], one naturally asks if these results can be generalized to biologically-plausible neural networks. A straightforward biological interpretation of the Dense Net is problematic, as a network with polynomial interaction function of degree d is equivalent to having a neural network with many-body synapses between d + 1 neurons. This can be seen by expanding the Polynomial Dense Net in terms of a weight tensor of d+1 neurons:

Si(t + 1) = sgn

j1,...,jd Jij1...jd Sj1(t) . . . Sjd(t)

, Ji,j1,...,jd = 1 N d

µ=1 ξµ+1 i ξµ j1 ξµ jd (17)

This is biologically unrealistic as synaptic connections usually occur between two neurons [45]. In the case of the Exponential Dense Net, one can interpret its interaction function via a Taylor series expansion, implying synaptic connections between infinitely many neurons which is even more problematic. Similar difficulties arise in models with sum of terms with different powers [46].

Figure 5: Biologically-plausible implementation of Dense Net with two-body synapses.

To address this issue, we again take inspiration from earlier work in MHNs. Krotov and Hopfield [47] addressed this concern for symmetric MHNs by reformulating the network using two-body synapses, where the network was partitioned into a bipartite graph with visible and hidden neurons (see [48] for an extension of this idea to deeper networks). The visible neurons correspond to the neurons in our network dynamics, Sj, while the hidden neurons correspond to the individual memories stored within the network. They are connected through a weight matrix. Since we are working with an asymmetric network, we modify their approach and define two sets of synaptic weights: Wjµ connects visible neuron vj to hidden neuron hµ, Mµj connects hidden neuron hµ to visible neuron vj. This yields the same dynamics exhibited in Equation (2), absorbing the nonlinearity into the hidden neurons dynamics.

For the Dense Net, we define the weights as Wjµ := 1

N ξµ j and Mµj := ξµ+1 j . For the Mixed Net, we redefine the weight matrix Mµj = ξµ j + λξµ+1 j . The update rules for the neurons are as follows:

hµ(t) := f X

j Wjµvj(t) , vj(t + 1) := sgn X

µ Mµjhµ(t) (18)

Note that these networks transition and sequence capacities, PT and PS, now scale linearly with respect to the total number of neurons in this model, N visible neurons and P hidden neurons. However, the network capacity still scales nonlinearly with respect to the number of visible neurons.

Finally, we remark that this network is reminiscent of recent computational models for motor action selection and control via the cortico-basal ganglia-thalamo-cortical loop, in which the basal ganglia inhibits thalamic neurons that are bidirectionally connected to a recurrent cortical network [5, 49, 50]. This relates to our model as follows: the motor cortex (visible neurons) executes an action, each

thalamic unit (hidden neurons) encodes a motor motif, and the basal ganglia silences thalamic neurons (external network modulating context). In particular, the role of the basal ganglia in this network suggests a novel mechanism of context-dependent gating within Hopfield Networks [51]. Rather than modulating synapses or feature neurons in a network, one can directly inhibit (activate) memory neurons in order to decrease (increase) the likelihood of transitioning to the associated state. Similarly, thalamocortical loops have been found to be important to song generation in zebra finches [52]. Thus, the biological implementation of the Dense Net can provide insight into how biological agents reliably store and generate complex sequences.

5 Discussion and Future Directions

We introduced the Dense Net for the reliable storage and recall of long sequences of patterns, derived the scaling of its single-transition and full-sequence capacity, and verified these results in numerical simulation. We found that depending on the choice of nonlinear interaction function, the Dense Net could scale polynomially or exponentially. We tested the ability of these models to recall sequences of correlated patterns, by comparing the recall of a sequence of Moving MNIST images with different nonlinearities. As expected, the network s reconstruction capabilities increased with the nonlinearity power d, with perfect recall achieved by the exponential nonlinearity. To further increase the capacity, we introduced the generalized pseudoinverse rule and demonstrated in simulation its ability to maintain high capacity for highly correlated patterns. We also introduced and analyzed the Mixed Net to maintain patterns within sequences for longer periods of time. Finally, we described a biologically plausible implementation of the models with connections to motor control.

There has recently been a renewed interest in storing sequences of memories. Steinberg and Sompolinsky [53] store sequences in Hopfield networks by using a vector-symbolic architecture to bind each pattern to its temporal order in the sequence, thus storing the entire sequence as a single attractor. However, this model suffers from the same capacity limitations as the Hopfield Network. Whittington et al. [54] suggest a mechanism to control sequence retrieval via an external controller, analogous to the role we ascribe to the basal ganglia for context-dependent gating. Herron et al. [55] investigate a mechanism for robust sequence recall within complex systems more broadly, reducing crosstalk by directly modulating interactions between neurons rather than the inputs into neurons. Tang et al. [56] propose a model for sequential recall akin to Seq Net with an implicit statistical whitening process. Karuvally et al. [57] introduce a model closely related to the biologically-plausible implementation of our Mixed Net and analyze it in the setting of continuous-time dynamics, allowing for intralayer synapses within the hidden layer and different timescales between the hidden and feature layers.

While we have focused on a generalization of the fixed-point capacity for sequence memory, this is not the only notion of capacity one could consider. In other studies of MHNs, instead of considering stability as the probability of staying at a fixed point, researchers quantify the probability that the network will reach a fixed point within a single transition [31, 58, 59]. This approach allows one to quantify noise-robustness and the size of each memory s basin of attraction [35]. More broadly, one could consider other definitions of associative memory capacity not addressed here, including those that depend only on network architecture and not on the assumption of a particular learning rule [60, 61]. However, as compared to the relatively simple analysis that is possible for the fixed-point capacity of a Hopfield network using a Hebbian learning rule, analyzing these alternative notions of capacity in nonlinear networks can pose significant technical challenges [61 63].

In this work, we limited ourselves to theoretical analysis of discrete-time networks storing binary patterns. An important direction for future research would be to go beyond the Gaussian theory in order to develop accurate predictions of the Exponential Dense Net capacity. There are also many potential avenues for extending these models and methods to continuous-time networks, continuous-valued patterns, computing capacity for correlated patterns, testing different weight functions, and examining different network topologies. Finally, we hope to take inspiration from the recent resurgence of RNNs in long sequence modeling to use this model for real-world tasks [64, 65].

Acknowledgments and Disclosure of Funding

We thank Matthew Farrell, Shanshan Qin, and Sabarish Sainathan for useful discussions and comments on earlier versions of our manuscript. HC was supported by the GFSD Fellowship, Harvard GSAS Prize Fellowship, and Harvard James Mills Peirce Fellowship. JAZ-V and CP were supported by NSF Award DMS-2134157 and NSF CAREER Award IIS-2239780. CP received additional support from a Sloan Research Fellowship. This work has been made possible in part by a gift from the Chan Zuckerberg Initiative Foundation to establish the Kempner Institute for the Study of Natural and Artificial Intelligence. The computations in this paper were run on the FASRC Cannon cluster supported by the FAS Division of Science Research Computing Group at Harvard University.

[1] D Kleinfeld and H Sompolinsky. Associative neural network model for the generation of temporal patterns. theory and application to central pattern generators. Biophysical Journal, 54 (6):1039 1051, 1988.

[2] Michael A. Long, Dezhe Z. Jin, and Michale S. Fee. Support for a synaptic chain model of neuronal sequence generation. Nature, 468(7322):394 399, Nov 2010. ISSN 1476-4687. doi:10.1038/nature09514. URL https://doi.org/10.1038/nature09514.

[3] Maxwell Gillett, Ulises Pereira, and Nicolas Brunel. Characteristics of sequential activity in networks with temporally asymmetric Hebbian learning. Proceedings of the National Academy of Sciences, 117(47):29948 29958, November 2020. doi:10.1073/pnas.1918674117.

[4] Stefano Recanatesi, Ulises Pereira-Obilinovic, Masayoshi Murakami, Zachary Mainen, and Luca Mazzucato. Metastable attractors explain the variable timing of stable behavioral action sequences. Neuron, 110(1):139 153, 2022.

[5] Luca Mazzucato. Neural mechanisms underlying the temporal organization of naturalistic animal behavior. e Life, 11:e76577, 2022.

[6] Edmund T Rolls and Patrick Mills. The generation of time in the hippocampal memory system. Cell Reports, 28(7):1649 1658, 2019.

[7] Alexander B. Wiltschko, Matthew J. Johnson, Giuliano Iurilli, Ralph E. Peterson, Jesse M. Katon, Stan L. Pashkovski, Victoria E. Abraira, Ryan P. Adams, and Sandeep Robert Datta. Mapping sub-second structure in mouse behavior. Neuron, 88(6):1121 1135, 2015. ISSN 08966273. doi:https://doi.org/10.1016/j.neuron.2015.11.031. URL https://www.sciencedirect. com/science/article/pii/S0896627315010375.

[8] Jeffrey E. Markowitz, Winthrop F. Gillis, Maya Jay, Jeffrey Wood, Ryley W. Harris, Robert Cieszkowski, Rebecca Scott, David Brann, Dorothy Koveal, Tomasz Kula, Caleb Weinreb, Mohammed Abdal Monium Osman, Sandra Romero Pinto, Naoshige Uchida, Scott W. Linderman, Bernardo L. Sabatini, and Sandeep Robert Datta. Spontaneous behaviour is structured by reinforcement without explicit reward. Nature, 614(7946):108 117, Feb 2023. ISSN 1476-4687. doi:10.1038/s41586-022-05611-2. URL https://doi.org/10.1038/ s41586-022-05611-2.

[9] Cengiz Pehlevan, Farhan Ali, and Bence P Ölveczky. Flexibility in motor timing constrains the topology and dynamics of pattern generator circuits. Nature communications, 9(1):977, 2018. doi:https://doi.org/10.1038/s41467-018-03261-5.

[10] H. Sompolinsky and I. Kanter. Temporal Association in Asymmetric Neural Networks. Physical Review Letters, 57(22):2861 2864, December 1986. doi:10.1103/Phys Rev Lett.57.2861.

[11] Zijian Jiang, Ziming Chen, Tianqi Hou, and Haiping Huang. Spectrum of non Hermitian deep-Hebbian neural networks. Physical Review Research, 5:013090, Feb 2023. doi:10.1103/Phys Rev Research.5.013090. URL https://link.aps.org/doi/10.1103/ Phys Rev Research.5.013090.

[12] Ulises Pereira and Nicolas Brunel. Unsupervised learning of persistent and sequential activity. Frontiers in Computational Neuroscience, 13:97, 2020.

[13] Christian Leibold and Richard Kempter. Memory capacity for sequences in a recurrent network with biological constraints. Neural Computation, 18(4):904 941, 2006.

[14] Jeff Hawkins, Dileep George, and Jamie Niemasik. Sequence memory for prediction, inference and behaviour. Philosophical Transactions of the Royal Society B: Biological Sciences, 364 (1521):1203 1209, 2009.

[15] Jeff Hawkins and Subutai Ahmad. Why neurons have thousands of synapses, a theory of sequence memory in neocortex. Frontiers in Neural Circuits, page 23, 2016.

[16] Daniel J Amit. Neural networks counting chimes. Proceedings of the National Academy of Sciences, 85(7):2141 2145, 1988.

[17] H. Gutfreund and M. Mezard. Processing of temporal sequences in neural networks. Phys. Rev. Lett., 61:235 238, Jul 1988. doi:10.1103/Phys Rev Lett.61.235. URL https://link.aps. org/doi/10.1103/Phys Rev Lett.61.235.

[18] Kanaka Rajan, Christopher D. Harvey, and David W. Tank. Recurrent network models of sequence generation and memory. Neuron, 90(1):128 142, 2016. ISSN 0896-6273. doi:https://doi.org/10.1016/j.neuron.2016.02.009. URL https://www.sciencedirect.com/ science/article/pii/S0896627316001021.

[19] Markus Diesmann, Marc-Oliver Gewaltig, and Ad Aertsen. Stable propagation of synchronous spiking in cortical neural networks. Nature, 402(6761):529 533, Dec 1999. ISSN 1476-4687. doi:10.1038/990101. URL https://doi.org/10.1038/990101.

[20] Nicholas F Hardy and Dean V Buonomano. Neurocomputational models of interval and pattern timing. Current Opinion in Behavioral Sciences, 8:250 257, 2016. ISSN 2352-1546. doi:https://doi.org/10.1016/j.cobeha.2016.01.012. URL https://www.sciencedirect.com/ science/article/pii/S2352154616300195. Time in perception and action.

[21] Dina Obeid, Jacob A. Zavatone-Veth, and Cengiz Pehlevan. Statistical structure of the trial-to-trial timing variability in synfire chains. Phys. Rev. E, 102:052406, Nov 2020. doi:10.1103/Phys Rev E.102.052406. URL https://link.aps.org/doi/10.1103/ Phys Rev E.102.052406.

[22] Matthew Farrell and Cengiz Pehlevan. Recall tempo of Hebbian sequences depends on the interplay of Hebbian kernel with tutor signal timing. bio Rxiv, 2023. doi:10.1101/2023.06.07.542926. URL https://www.biorxiv.org/content/early/2023/06/07/2023.06.07.542926.

[23] John J Hopfield. Neural networks and physical systems with emergent collective computational abilities. Proceedings of the National Academy of Sciences, 79(8):2554 2558, 1982.

[24] John J Hopfield. Neurons with graded response have collective computational properties like those of two-state neurons. Proceedings of the national academy of sciences, 81(10):3088 3092, 1984.

[25] S-I Amari. Learning patterns and pattern sequences by self-organizing nets of threshold elements. IEEE Transactions on computers, 100(11):1197 1206, 1972.

[26] John Hertz, Anders Krogh, and Richard G Palmer. Introduction to the theory of neural computation. CRC Press, 2018.

[27] Daniel J Amit, Hanoch Gutfreund, and Haim Sompolinsky. Spin-glass models of neural networks. Physical Review A, 32(2):1007, 1985.

[28] Daniel J. Amit, Hanoch Gutfreund, and H. Sompolinsky. Storing infinite numbers of patterns in a spin-glass model of neural networks. Phys. Rev. Lett., 55:1530 1533, Sep 1985. doi:10.1103/Phys Rev Lett.55.1530. URL https://link.aps.org/doi/10.1103/ Phys Rev Lett.55.1530.

[29] Daniel J Amit, Hanoch Gutfreund, and H Sompolinsky. Statistical mechanics of neural networks near saturation. Annals of Physics, 173(1):30 67, 1987. ISSN 0003-4916. doi:https://doi.org/10.1016/0003-4916(87)90092-3. URL https://www.sciencedirect. com/science/article/pii/0003491687900923.

[30] Dmitry Krotov and John J. Hopfield. Dense associative memory for pattern recognition. Advances in Neural Information Processing Systems, 29, 2016.

[31] Mete Demircigil, Judith Heusel, Matthias Löwe, Sven Upgang, and Franck Vermet. On a model of associative memory with huge storage capacity. Journal of Statistical Physics, 168(2): 288 299, July 2017. ISSN 0022-4715, 1572-9613. doi:10.1007/s10955-017-1806-y.

[32] Dmitry Krotov. A new frontier for hopfield networks. Nature Reviews Physics, pages 1 2, 2023.

[33] Dimitri Petritis. Thermodynamic formalism of neural computing. In Eric Goles and Servet Martínez, editors, Dynamics of Complex Interacting Systems, pages 81 146. Springer Netherlands, Dordrecht, 1996. doi:10.1007/978-94-017-1323-8_3. URL https://doi.org/10. 1007/978-94-017-1323-8_3.

[34] Anton Bovier. Sharp upper bounds on perfect retrieval in the Hopfield model. Journal of Applied Probability, 36(3):941 950, 1999. doi:10.1239/jap/1032374647.

[35] R. Mc Eliece, E. Posner, E. Rodemich, and S. Venkatesh. The capacity of the Hopfield associative memory. IEEE Transactions on Information Theory, 33(4):461 482, July 1987. ISSN 00189448. doi:10.1109/TIT.1987.1057328.

[36] G. Weisbuch and F. Fogelman-Soulié. Scaling laws for the attractors of Hopfield networks. J. Physique Lett., 46(14):623 630, 1985. doi:10.1051/jphyslet:019850046014062300. URL https://doi.org/10.1051/jphyslet:019850046014062300.

[37] Samuel P. Muscinelli, Wulfram Gerstner, and Johanni Brea. Exponentially Long Orbits in Hopfield Neural Networks. Neural Computation, 29(2):458 484, 02 2017. ISSN 0899-7667. doi:10.1162/NECO_a_00919.

[38] V. V. Petrov. Sums of Independent Random Variables. De Gruyter, Berlin, Boston, 1975. ISBN 9783112573006. doi:doi:10.1515/9783112573006. Trans. A. A. Brown.

[39] John E. Kolassa. Series Approximation Methods in Statistics. Springer New York, 1997. doi:10.1007/978-1-4757-4277-0. URL https://doi.org/10.1007/978-1-4757-4277-0.

[40] John E. Kolassa and Peter Mc Cullagh. Edgeworth Series for Lattice Distributions. The Annals of Statistics, 18(2):981 985, 1990. doi:10.1214/aos/1176347637. URL https: //doi.org/10.1214/aos/1176347637.

[41] Dmitry Dolgopyat and Kasun Fernando. An Error Term in the Central Limit Theorem for Sums of Discrete Random Variables. International Mathematics Research Notices, 05 2023. ISSN 1073-7928. doi:10.1093/imrn/rnad088. URL https://doi.org/10.1093/imrn/rnad088. rnad088.

[42] Nitish Srivastava, Elman Mansimov, and Ruslan Salakhudinov. Unsupervised learning of video representations using lstms. In International conference on machine learning, pages 843 852. PMLR, 2015.

[43] I. Kanter and H. Sompolinsky. Associative recall of memory without errors. Phys. Rev. A, 35: 380 392, Jan 1987. doi:10.1103/Phys Rev A.35.380. URL https://link.aps.org/doi/10. 1103/Phys Rev A.35.380.

[44] J. Leo van Hemmen and Reimer Kühn. Collective phenomena in neural networks. In Eytan Domany, J. Leo van Hemmen, and Klaus Schulten, editors, Models of Neural Networks I, pages 1 113, Berlin, Heidelberg, 1995. Springer Berlin Heidelberg. ISBN 9783-642-79814-6. doi:10.1007/978-3-642-79814-6_1. URL https://doi.org/10.1007/ 978-3-642-79814-6_1.

[45] Eric R Kandel, James H Schwartz, Thomas M Jessell, Steven Siegelbaum, A James Hudspeth, Sarah Mack, et al. Principles of neural science. Mc Graw-hill New York, 6 edition, 2021.

[46] Thomas F Burns and Tomoki Fukai. Simplicial Hopfield networks. In The Eleventh International Conference on Learning Representations, 2023.

[47] Dmitry Krotov and John J. Hopfield. Large associative memory problem in neurobiology and machine learning. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=X4y_10OX-h X.

[48] Dmitry Krotov. Hierarchical associative memory. ar Xiv preprint ar Xiv:2107.06446, 2021.

[49] Ta-Chu Kao, Mahdieh S Sadabadi, and Guillaume Hennequin. Optimal anticipatory control as a theory of motor preparation: A thalamo-cortical circuit model. Neuron, 109(9):1567 1581, 2021.

[50] Laureline Logiaco, LF Abbott, and Sean Escola. Thalamic control of cortical dynamics in a model of flexible motor sequencing. Cell Reports, 35(9):109090, 2021.

[51] Nicolas Y. Masse, Gregory D. Grant, and David J. Freedman. Alleviating catastrophic forgetting using context-dependent gating and synaptic stabilization. Proceedings of the National Academy of Sciences, 115(44), October 2018. ISSN 0027-8424, 1091-6490. doi:10.1073/pnas.1803839115.

[52] Felix W. Moll, Devorah Kranz, Ariadna Corredera Asensio, Margot Elmaleh, Lyn A. Ackert Smith, and Michael A. Long. Thalamus drives vocal onsets in the zebra finch courtship song. Nature, 616(7955):132 136, Apr 2023. ISSN 1476-4687. doi:10.1038/s41586-023-05818-x. URL https://doi.org/10.1038/s41586-023-05818-x.

[53] Julia Steinberg and Haim Sompolinsky. Associative memory of structured knowledge. Scientific Reports, 12(1):21808, Dec 2022. ISSN 2045-2322. doi:10.1038/s41598-022-25708-y. URL https://doi.org/10.1038/s41598-022-25708-y.

[54] James C. R. Whittington, Joseph Warren, and Timothy E. J. Behrens. Relating transformers to models and neural representations of the hippocampal formation, March 2022.

[55] Lukas Herron, Pablo Sartori, and Bing Kan Xue. Robust retrieval of dynamic sequences through interaction modulation. ar Xiv preprint ar Xiv:2211.17152, 2022.

[56] Mufeng Tang, Helen Barron, and Rafal Bogacz. Sequential memory with temporal predictive coding. ar Xiv preprint ar Xiv:2305.11982, 2023.

[57] Arjun Karuvally, Terry J Sejnowski, and Hava T Siegelmann. Energy-based general sequential episodic memory networks at the adiabatic limit. ar Xiv preprint ar Xiv:2212.05563, 2022.

[58] Hubert Ramsauer, Bernhard Schäfl, Johannes Lehner, Philipp Seidl, Michael Widrich, Lukas Gruber, Markus Holzleitner, Thomas Adler, David Kreil, Michael K Kopp, Günter Klambauer, Johannes Brandstetter, and Sepp Hochreiter. Hopfield networks is all you need. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum? id=t L89Rnz Ii Cd.

[59] Carlo Lucibello and Marc Mézard. The exponential capacity of dense associative memories. ar Xiv preprint ar Xiv:2304.14964, 2023.

[60] Andreas Knoblauch, Günther Palm, and Friedrich T Sommer. Memory capacities for synaptic and structural plasticity. Neural Computation, 22(2):289 341, 2010.

[61] Jacob A. Zavatone-Veth and Cengiz Pehlevan. On neural network kernels and the storage capacity problem. Neural Computation, 34(5):1136 1142, 04 2022. ISSN 0899-7667. doi:10.1162/neco_a_01494. URL https://doi.org/10.1162/neco_a_01494.

[62] Jacob A. Zavatone-Veth and Cengiz Pehlevan. Activation function dependence of the storage capacity of treelike neural networks. Phys. Rev. E, 103:L020301, Feb 2021. doi:10.1103/Phys Rev E.103.L020301. URL https://link.aps.org/doi/10.1103/ Phys Rev E.103.L020301.

[63] Rémi Monasson and Riccardo Zecchina. Weight space structure and internal representations: A direct approach to learning and generalization in multilayer neural networks. Phys. Rev. Lett., 75:2432 2435, Sep 1995. doi:10.1103/Phys Rev Lett.75.2432. URL https://link.aps.org/ doi/10.1103/Phys Rev Lett.75.2432.

[64] Albert Gu, Karan Goel, and Christopher Ré. Efficiently modeling long sequences with structured state spaces. ar Xiv preprint ar Xiv:2111.00396, 2021.

[65] Antonio Orvieto, Samuel L Smith, Albert Gu, Anushan Fernando, Caglar Gulcehre, Razvan Pascanu, and Soham De. Resurrecting recurrent neural networks for long sequences. ar Xiv preprint ar Xiv:2303.06349, 2023.

[66] Dmitry Krotov and John Hopfield. Dense associative memory is robust to adversarial inputs. Neural computation, 30(12):3151 3167, 2018.

[67] HH Chen, YC Lee, GZ Sun, HY Lee, Tom Maxwell, and C Lee Giles. High order correlation model for associative memory. In AIP Conference Proceedings, volume 151, pages 86 99. American Institute of Physics, 1986.

[68] Demetri Psaltis and Cheol Hoon Park. Nonlinear discriminant functions and associative memories. In AIP conference Proceedings, volume 151, pages 370 375. American Institute of Physics, 1986.

[69] Pierre Baldi and Santosh S Venkatesh. Number of stable points for spin-glasses and neural networks of higher orders. Physical Review Letters, 58(9):913, 1987.

[70] E Gardner. Multiconnected neural network models. Journal of Physics A: Mathematical and General, 20(11):3453, 1987.

[71] Laurence F Abbott and Yair Arian. Storage capacity of generalized networks. Physical review A, 36(10):5091, 1987.

[72] D Horn and M Usher. Capacities of multiconnected memory models. Journal de Physique, 49 (3):389 395, 1988.

[73] Elena Agliari, Alberto Fachechi, and Chiara Marullo. Nonlinear PDEs approach to statistical mechanics of dense associative memories. Journal of Mathematical Physics, 63(10):103304, 2022. doi:10.1063/5.0095411. URL https://doi.org/10.1063/5.009541.

[74] Elena Agliari, Linda Albanese, Francesco Alemanno, Andrea Alessandrelli, Adriano Barra, Fosca Giannotti, Daniele Lotito, and Dino Pedreschi. Dense Hebbian neural networks: a replica symmetric picture of unsupervised learning. ar Xiv, 2022. doi:10.48550/ARXIV.2211.14067. URL https://arxiv.org/abs/2211.14067.

[75] Linda Albanese, Francesco Alemanno, Andrea Alessandrelli, and Adriano Barra. Replica symmetry breaking in dense Hebbian neural networks. Journal of Statistical Physics, 189(2): 24, Sep 2022. ISSN 1572-9613. doi:10.1007/s10955-022-02966-8. URL https://doi.org/ 10.1007/s10955-022-02966-8.

[76] Stéphane Boucheron, Gábor Lugosi, and Pascal Massart. Concentration Inequalities: A Nonasymptotic Theory of Independence. Oxford University Press, 02 2013. ISBN 9780199535255. doi:10.1093/acprof:oso/9780199535255.001.0001. URL https://doi. org/10.1093/acprof:oso/9780199535255.001.0001.

[77] DLMF. NIST Digital Library of Mathematical Functions. http://dlmf.nist.gov/, Release 1.1.1 of 2021-03-15, 2021. URL http://dlmf.nist.gov/. F. W. J. Olver, A. B. Olde Daalhuis, D. W. Lozier, B. I. Schneider, R. F. Boisvert, C. W. Clark, B. R. Miller, B. V. Saunders, H. S. Cohl, and M. A. Mc Clain, eds.

[78] Yann Le Cun, Corinna Cortes, and CJ Burges. MNIST handwritten digit database. ATT Labs [Online], 2, 2010. URL http://yann.lecun.com/exdb/mnist.

A Review of Modern Hopfield Networks

Here we review the Hopfield network and its modern generalization as an auto-associative memory model. These ideas will be helpful for storing sequences in network dynamics.

A.1 The Hopfield Network

We first introduce the classic Hopfield Network [23]. Let N be the number of neurons in the network and S(t) { 1, +1}N be the state of the network at time t. The task is to store P patterns, {ξ1, . . . , ξµ}, where ξµ j { 1} is the jth neuron of the µth pattern. The goal is to design a network with dynamics such that when the network is initialized with a pattern, it will converge to one of the stored memories.

The Hopfield Network [23] attempts this by following the discrete-time synchronous update rule4:

S(t + 1) = THN(S(t)), (A.1)

where the transition operator THN( )i for neuron i is defined in terms of symmetric Hebbian weights:

THN(S)i = sgn

j =i Jij Sj

µ=1 ξµ i ξµ j . (A.2)

Note that we are excluding self-interaction terms (Jii) in Equation A.2. To interpret this dynamics from another useful point of view, we define the overlap, or Mattis magnetization, mµ i of the network state S with pattern ξµ. We can then rewrite the update rule for the Hopfield Network as

THN(S)i := sgn

µ=1 ξµ i mµ i

, mµ i := 1 (N 1)

j =i ξµ j Sj (A.3)

We interpret this as at every time t, the network tries to identify the pattern ξµ it is closest to and updates neuron i to the value for that pattern. A natural question to ask about the associative memory networks is their capacity: how many patterns can be stored and recalled with minimal error? This question has been the subject of many studies [23, 27 29, 33 36]. Intuitively, in recalling a pattern ξν, what limits the network s capacity is the overlap between the pattern ξν and other patterns, referred to as the crosstalk [26, 36].

A precise answer to the storage capacity question can be given under the assumption that the patterns {ξµ} are sampled from some probability distribution. While different notions of capacity have been considered in the literature [23, 27 29, 33 36], we focus on the fixed-point capacity, which characterizes the probability that, when initialized at a given pattern, the network dynamics do not move the state away from that point. To render the problem analytically tractable, it is usually assumed that the pattern components are i.i.d. Rademacher random variables, i.e., P(ξµ j = 1) = 1/2 for all j and µ. Then, at finite network size one can define the capacity as

PHN(N, c) = max P {2, . . . , 2N} : P P µ=1{THN(ξµ) = ξµ} 1 c , (A.4)

where c [0, 1) is a fixed error tolerance. As we review in detail in Appendix B, one finds an asymptotic capacity estimate PHN N 4 log(N) for c = 0, which can be shown to be a sharp threshold [33 35].

A.2 Modern Hopfield Networks

Recent work from Krotov and Hopfield [30, 66] reinvigorated a line of research into generalized Hopfield Networks with larger capacity [67 72], resulting in what are now called Dense Associative Memories or Modern Hopfield Networks:

TMHN(S)i := sgn

µ=1 ξµ i f (mµ i )

4For the Hopfield network, one can also consider an asynchronous update rule in which only one neuron is updated at each timestep [23, 26].

where f, referred to as the interaction function, is a nonlinear monotonically increasing function whose purpose is to separate the pattern overlaps for better signal to noise ratio. Since mµ i (t) has a maximum value of 1, this means contributions from patterns with partial overlaps will be reduced by the interaction function. This diminishes the crosstalk and thereby increases the probability of transitioning to the correct pattern. If the interaction function is chosen to be f(x) = xd, then the MHN s capacity has been shown to scale polynomially with network size as P βd Nd log(N), where βd is a numerical constant depending on the degree d [30, 73 75]. Using a different definition of capacity, Demircigil et al. [31] have also shown that an exponential nonlinearity can lead to exponential scaling of the capacity. See [32] for a recent review of these results.

B Review of Hopfield network fixed-point capacity

In this Appendix, we review the computation of the classical Hopfield network fixed-point capacity. Our approach will follow but not exactly match that of Petritis [33]. Though these results are standard, we review them in detail both because this approach will inspire in part our approach to the Dense Net, and because several important steps of the analysis are significantly simpler than the corresponding steps for the Dense Net.

We begin by recalling that the Hopfield network update can be written as

THN(S)i := sgn

j =i ξµ j Sj

and that our goal is to determine

PHN(N, c) = max P {2, . . . , 2N} : P P µ=1{THN(ξµ) = ξµ} 1 c (B.2)

for some absolute constant 0 c < 1, at least in the regime where N, P 1 [33 36]. As is standard in theoretical studies of Hopfield model capacity [26 29, 33 36], we take in these probabilities the pattern components ξµ k to be independent and identically distributed Rademacher random variables. We can expand the memorization probability as a union of single-bitflip events:

µ=1 {THN(ξµ) = ξµ}

i=1 {THN(ξµ)i = ξµ i }

This illustrates why analyzing the memorization probability is complicated: the single-pattern events THN(ξµ) = ξµ are not independent across patterns µ, and each single-pattern event is itself the intersection of non-independent single-neuron events THN(ξµ)i = ξµ i . However, as the single-bitflip probabilities P[THN(ξµ)j = ξµ j ] are identical for all µ and j, we can obtain a straightforward union bound

µ=1 {THN(ξµ) = ξµ}

i=1 {THN(ξµ)i = ξµ i }

i=1 P [THN(ξµ)i = ξµ i ] (B.5)

= 1 NPP[THN(ξ1)1 = ξ1 1], (B.6)

where we focus without loss of generality on the first element of the first pattern. Therefore, if we can control the single-bitflip probability P[THN(ξ1)1 = ξ1 1], we can obtain a lower bound on the true capacity. In particular,

PHN(N, c) max P {2, . . . , 2N} : NPP[THN(ξ1)1 = ξ1 1] c (B.7)

From the definition of the Hopfield network update rule, we have

P[THN(ξ1)1 = ξ1 1] = P

j =i ξµ 1 ξµ j ξ1 j

j =i ξ1 1ξµ 1 ξ1 j ξµ j < 0

= P [C > 1] , (B.10)

where we have defined

j =i ξ1 1ξµ 1 ξ1 j ξµ j (B.11)

and used the fact that the distribution of C is symmetric. C is referred to as the crosstalk, because it represents the effect of interference between the first pattern and the other P 1 patterns on recall of the first pattern. We can simplify the crosstalk using the fact that, since we have assumed i.i.d. Rademacher patterns, we have the equality in distribution

ξ1 j ξµ j d= ξµ j (B.12)

for all µ = 2, . . . , P and j = 1, . . . , N, yielding

j =i ξµ 1 ξµ j . (B.13)

Similarly, we have

ξµ 1 ξµ j d= ξµ j (B.14)

for all µ = 2, . . . , P and j = 2, . . . , N, which finally yields

j =i ξµ j . (B.15)

Therefore, for the classic Hopfield network the crosstalk is equal in distribution to the sum of (P 1)(N 1) i.i.d. Rademacher random variables.

B.1 Approach 1: Hoeffding s inequality

Now, we can immediately apply Hoeffding s inequality [76], which implies that for any t > 0

k=2 ξµ k > t

We then have that

k=2 ξµ k > N 1

We then have the bound

PHN(N, c) max P {2, . . . , 2N} : NP exp 1

We now want to consider the regime N 1, and demand that the error probability should tend to zero as we increase N. If we substitute in the Ansatz

P N α log N , (B.19)

the bound is easily seen to tend to zero for all α 4, yielding an estimated capacity of

PHN N 4 log N . (B.20)

As this estimates follows from a sequence of lower bounds on the memorization probability, it is a lower bound on the true capacity of the model [33]. However, via a more involved argument that accounts for the associations between the events THN(ξµ) = ξµ, it was shown by Bovier [34] to be tight.

For the classical Hopfield network, the single bitflip probability P[C > 1] is easy to control using elementary concentration inequalities because the crosstalk can be expressed as a sum of (P 1)(N 1) i.i.d. random variables. Therefore, we expect the crosstalk to concentrate whenever N or P or both together are large. However, for the Dense Net, we will find in Appendix C that the crosstalk is given as the sum of P 1 i.i.d. random variables, each of which is a nonlinear function applied to the sum of N 1 i.i.d. Rademacher random variables. Naïve application of Hoeffding s inequality is then not particularly useful. We will therefore take a simpler, though less rigorously controlled approach, which can also be applied to the classical Hopfield network: we approximate the distribution of the crosstalk as Gaussian [26].

B.2 Approach 2: Gaussian approximation

For the classical Hopfield network, the fact that the crosstalk can be expressed as a sum of (P 1)(N 1) i.i.d. Rademacher random variables means that the classical central limit theorem implies that it tends in distribution to a Gaussian whenever (P 1)(N 1) tends to infinity. By symmetry, the mean of the crosstalk is zero, while its variance is easily seen to be

var(C) = P 1

N 1. (B.21)

If we approximate the distribution of the crosstalk for N and P large but finite by a Gaussian, we therefore have

where H(x) = erfc(x/

2)/2 is the Gaussian tail distribution function. We want to have P[C > 1] 0, so we must have (P 1)/(N 1) 0. Then, we can use the asymptotic expansion [26]

as z (B.23)

to obtain the heuristic Gaussian approximation

(P 1) 2π(N 1) exp (N 1)

If we use this Gaussian approximation instead of the Hoeffding bound applied above, we can easily see that we will obtain identical estimates for the capacity with an error tolerance tending to zero. However, we have given up the rigor of the bound from Hoeffding s inequality, since we have not controlled the rate of convergence to the Gaussian tail probability. In particular, the Berry-Esseen theorem would give in this case a uniform additive error bound of 1/ p

(P 1)(N 1), which in the regime P N/[α log N] cannot compete with the factors of N or NP which we want P[C > 1] to overwhelm. We will not worry about this issue, as we are concerned more with whether we can get accurate capacity estimates that match numerical experiment than whether we can prove those estimates completely rigorously.

We can also use the Gaussian approximation to estimate the capacity for a non-zero error threshold c at finite N. Concretely, if we demand that the union bound is saturated, i.e.,

NP P[THN(ξ1)1 = ξ1 1] = c, (B.25)

under the Gaussian approximation for the bitflip probability we have the self-consistent equation

for P, which we can re-write as

P 1 = N 1 [H 1(c/NP)]2 . (B.27)

This is a transcendental self-consistent equation, which is not easy to solve analytically. However, we can make some progress at small c/(NP). Using the asymptotic expansion of the inverse of the complementary error function [77], we have [H 1(x)]2 = 2 inverfc(2x)2 (B.28)

log 4πx2 log 1

= 2 log(x) log(4π) log log 1

2 log(x) (B.31) as x 0. Then, assuming c is such that log(c) is negligible relative to log(NP), we have

P N 2 log(NP), (B.32)

which we can solve for P as

P N 2W0(N 2/2), (B.33)

where W0 is the principal branch of the Lambert-W function [77]. But, at large N, we can use the asymptotic W0(N) log(N) to obtain the approximate scaling

P N 4 log(N), (B.34)

which agrees with our earlier result. Conceptually, this intuition is consistent with there being a sharp transition in the thermodynamic limit, as proved rigorously by Bovier [34].

C Dense Net Capacity

In this Appendix, we analyze the capacity of the Dense Net. As introduced in Section 2.1 of the main text, there are two notions of robustness to consider: the robustness of a single transition and the robustness of the full sequence. For a fixed N {2, 3, . . .} and an error tolerance c [0, 1), we define these two notions of capacity as

PT (N, c) = max P {2, . . . , 2N} : P TDN(ξ1) = ξ2 1 c (C.1) and PS(N, c) = max P {2, . . . , 2N} : P P µ=1{TDN(ξµ) = ξµ+1} 1 c , (C.2) respectively.

Our goal is to approximately compute the capacity in the regime in which N and P are large. Following Petritis [33] s approach to the HN, to make analytical progress, we can use a union bound to control the single-step error probability in terms of the probability of a single bitflip: P TDN(ξµ) = ξµ+1

i=1 {TDN(ξµ)i = ξµ+1 i }

i=1 P h TDN(ξµ)i = ξµ+1 i i (C.4)

= 1 NP[TDN(ξ1)1 = ξ1 2]. (C.5)

where we use the fact that all elements of all patterns are i.i.d. by assumption. We use a similar approach to control the sequence error probability in terms of the probability of a single bitflip:

µ=1 {TDN(ξµ) = ξµ+1}

i=1 {TDN(ξµ)i = ξµ+1 i }

i=1 P h TDN(ξµ)i = ξµ+1 i i (C.7)

= 1 NPP[TDN(ξ1)1 = ξ1 2]. (C.8)

Thus, as claimed in the main text, we have the lower bounds

PT (N, c) max P {2, . . . , 2N} : NP[TDN(ξ1)1 = ξ1 2] c (C.9)

PS(N, c) max P {2, . . . , 2N} : NPP[TDN(ξ1)1 = ξ1 2] c . (C.10)

As introduced in the main text, for perfect recall, we want to take the threshold c to be zero, or at least to tend to zero as N and P tend to infinity. The capacities estimated through this argument are lower bounds on the true capacities, as they are obtained from lower bounds on the true recall probability. However, we expect for these bounds to in fact be tight in the thermodynamic limit [33, 34].

By the definition of the Dense Net update rule with interaction function f given in Equation (2), we have

TDN(ξ1)1 = sgn

µ=1 ξµ+1 1 f

j=2 ξµ j ξ1 j

and therefore the single-bitflip probability is

P[TDN(ξ1)1 = ξ2 1] = P

µ=1 ξµ+1 1 f

j=2 ξµ j ξ1 j

µ=1 ξµ+1 1 f

j=2 ξµ j ξ1 j

f(1) + ξ2 1

µ=2 ξµ+1 1 f

j=2 ξµ j ξ1 j

For both the polynomial (f(x) = xd) and exponential (f(x) = e(N 1)(x 1)) interaction functions, f(1) = 1, and so

P[TDN(ξ1)1 = ξ2 1] = P

µ=2 ξ2 1ξµ+1 1 f

j=2 ξµ j ξ1 j

We refer to the random variable

µ=2 ξ2 1ξµ+1 1 f

j=2 ξµ j ξ1 j

on the left-hand-side of this inequality as the crosstalk, because it represents the effect of interference between the first pattern and all other patterns [26, 36].

We now observe that, as we have excluded self-interactions (i.e., the sum over neurons inside the interaction function does not include j = 1), we can use the periodic boundary conditions to shift indices as ξµ 1 ξµ+1 1 for all µ, yielding

µ=2 ξ1 1ξµ 1 f

j=2 ξµ j ξ1 j

Thus, the single-bitflip probability for this Dense Net is identical to that for the corresponding MHN with symmetric interactions. Then, we can use the fact that ξµ j ξ1 j d= ξµ j for all µ = 2, . . . , P to obtain

Now, define the P 1 random variables

χµ = ξµ 1 f

for µ = 2, . . . , P, such that the crosstalk is their sum,

µ=2 χµ. (C.20)

As the patterns ξµ j are i.i.d., χµ are i.i.d. random variables of mean

E[χµ] = E[ξµ 1 ]E

and variance

var(χµ) = E

which is bounded from above for any sensible interaction function. We observe also that the distribution of each χµ is symmetric because of the symmetry of the distribution of ξµ 1 . We will therefore simply write χ for any given χµ.

Then, the classical central limit theorem implies that the crosstalk tends in distribution to a Gaussian of mean zero and variance (P 1) var(χ) as P , at lease for any fixed N. However, we are interested in the joint limit in which N, P together. We will proceed by approximating the distribution of C as Gaussian, and will not attempt to rigorously control its behavior in the joint limit.

Approximating the distribution of the crosstalk for N, P 1 by a Gaussian, we then have

P[TDN(ξ1)1 = ξ2 1] H

(P 1) var(χ)

where H(x) = erfc(x/

2)/2 is the Gaussian tail distribution function. We want to have P[TDN(ξ1)1 = ξ2 1] 0, so we must have (P 1) var(χ) 0. Then, we can use the asymptotic expansion [26]

as z (C.24)

P[TDN(ξ1)1 = ξ2 1]

(P 1) var(χ)

2π exp 1 2(P 1) var(χ)

For each model, we can evaluate var(χ) and then determine the resulting predicted capacity.

As we did for the classic Hopfield network in Appendix B, we can estimate the capacity at finite c within the Gaussian approximation by inverting the Gaussian tail distribution function. Concretely, under the union bound, we can estimate the transition capacity by solving

(PT 1) var(χ)

which yields

PT 1 = 1 var(χ)[H 1(c/N)]2 , (C.27)

and the sequence capacity by solving the transcendental self-consistent equation

(PS 1) var(χ)

which we can re-write as

PS 1 = 1 var(χ)[H 1(c/NPS)]2 . (C.29)

As in the classic Hopfield case, we can simplify these complicated equations somewhat by assuming that c/N and c/(NPS) are small. Concretely, using the asymptotic

[H 1(x)]2 2 log(x) (C.30)

for x 0, the transition capacity simplifies to

PT 1 1 2 var(χ) log(N) (C.31)

under the assumption that log(c) is negligible relative to log(N). For the sequence capacity, we can follow an identical argument to that used for the classic Hopfield network to simplify the self-consistent equation to

PS 1 2 var(χ) log(NPS) (C.32)

under the assumption that log(c) is negligible relative to log(NPS), which we can solve to obtain

PS 1 2 var(χ)W0[N/2 var(χ)]. (C.33)

Assuming that N/ var(χ) as N , we can use the asymptotic W0(N) log(N) to obtain the asymptotic

PS 1 2 var(χ) log[N/ var(χ)]. (C.34)

Our first check on the accuracy of the Gaussian approximation will be comparison of the resulting predictions for capacity with numerical experiment. As another diagnostic, we will consider the excess kurtosis κ = κ4(C)/κ2(C) for κn(C) the n-th cumulant of C. If the distribution is indeed Gaussian, the excess kurtosis vanishes, while large values of the excess kurtosis indicate deviations from Gaussianity. By the additivity of cumulants, we have

κn(C) = (P 1)κn(χ). (C.35)

By symmetry, all odd cumulants of χ and therefore all odd cumulants of C are identically zero. As noted above, we have

var(χ) = κ2(χ) = E

If C is indeed Gaussian, then all cumulants above the second should vanish. As the third cumulant vanishes by symmetry, the leading possible correction to Gaussianity is the fourth cumulant, which as χ has zero mean is given by

κ4(χ) = E[(χ)4] 3E[(χ)2]2 (C.37)

Rather than considering the fourth cumulant directly, we will consider the excess kurtosis

κ2(C)2 = 1 P 1 κ4(χ) κ2(χ)2 , (C.39)

which is a more useful metric because it is normalized.

C.1 Polynomial Dense Net Capacity

We first consider the Polynomial Dense Net, with interaction function f(x) = xd for d N>0. To compute the capacity, our goal is then to evaluate

at large N. From the central limit theorem, we expect

(N 1)d . (C.41)

We can make this quantitatively precise through the following straightforward argument. Let

j=2 ξ2 j . (C.42)

We then have immediately that the moment generating function of Ξ is

M(t) = E[etΞ] = cosh t

N 1 , (C.43)

hence the cumulant generating function is

K(t) = log M(t) = (N 1) log cosh t

The function x 7 log cosh(x) is an even function of x, and is analytic near the origin, with the first few orders of its Mac Laurin series being

log cosh(x) = x2

12 + O(x6). (C.45)

Then, the odd cumulants of Ξ vanish as we expect from symmetry while the even cumulants obey

κ2k = C2k (N 1)k 1 (C.46)

for combinatorial factors C2k that do not scale with N. We have, in particular, C2 = 1 and C4 = 2. By the moments-cumulants formula, we have

E[Ξ2k] = B2k(0, κ2, 0, κ4, , κ2k) (C.47)

for B2k the 2k-th complete exponential Bell polynomial. From this, it follows that

E[Ξ2k] = (2k 1)!! + O(N 1), (C.48)

as all cumulants other than κ2 = 1 are O(N 1). Therefore, neglecting subleading terms, we have

Following the general arguments above, we then approximate

P[TDN(ξ1)1 = ξ2 1]

2πN d exp N d

To determine the single-transition capacity following the argument in Section 2.1, we must determine how large we can take P = P(N) such that NP[TDN(ξ1)1 = ξ2 1] 0. Following the requirement that P var(χ) 0, we make the Ansatz

α(2d 1)!! log N (C.51)

for some α. We then have

NP[TDN(ξ1)1 = ξ2 1] r 1 2πα log N N 1 α/2. (C.52)

This tends to zero if α 2, meaning that the predicted capacity in this case is

2(2d 1)!! log N . (C.53)

We now want to determine the sequence capacity, which requires the stronger condition NPP[TDN(ξ1)1 = ξ2 1] 0. Again making the Ansatz

α(2d 1)!! log N (C.54)

for some α, we then have

NPP[TDN(ξ1)1 = ξ2 1] 1

2π(2d 1)!! (α log N)3/2 N d+1 α/2, (C.55)

which tends to zero if α 2d + 2. Then, the predicted sequence capacity is

2(d + 1)(2d 1)!! log N . (C.56)

If we consider the alternative asymptotic formulas obtained above from the finite-c argument, we have

PT 1 2 var(χ) log(N) N d

2(2d 1)!! log(N) (C.57)

PS 1 2 var(χ) log[N/ var(χ)] N d

2(2d 1)!! log[N d+1/(2d 1)!!] N d

2(d + 1)(2d 1)!! log(N),

which agree with these results. For evidence of the finite-c argument for the polynomial Dense Net, observe Figure C.1.

10 20 30 40 50 60 70 80 90 100 N

Polynomial Dense Net (Finite C)

Theory (d=1) Theory (d=2) Theory (d=3) Theory (d=4)

Sim: Poly (d=1, c = 0.3) Sim: Poly (d=1, c = 0.2) Sim: Poly (d=1, c = 0.1) Sim: Poly (d=1, c = 0.0)

Sim: Poly (d=2, c = 0.3) Sim: Poly (d=2, c = 0.2) Sim: Poly (d=2, c = 0.1) Sim: Poly (d=2, c = 0.0)

Sim: Poly (d=3, c = 0.3) Sim: Poly (d=3, c = 0.2) Sim: Poly (d=3, c = 0.1) Sim: Poly (d=3, c = 0.0)

Sim: Poly (d=4, c = 0.3) Sim: Poly (d=4, c = 0.2) Sim: Poly (d=4, c = 0.1) Sim: Poly (d=4, c = 0.0)

Figure C.1: The transition capacity of the polynomial Dense Net is demonstrated for different values of error tolerance c. We see that even for c = 0, we get similar scaling curves although the capacities slightly increase consistently as we increase c, indicated by a transition from dark to light. We plot from c = 0.0 to c = 0.5 for each degree d, with the legend labeling curves up to c = 0.3 to demonstrate the general trend.

Using the Gaussian approximation for moments of χ given above, we can easily work out that

κ4(χ) = E[(χ)4] 3E[(χ)2] (C.59)

= 1 N 2d {(4d 1)!! 3[(2d 1)!!]2} 1 + O 1

Then, the excess kurtosis of the Polynomial Dense Net s crosstalk is

[(2d 1)!!]2 3 1 + O 1

Thus, for the Polynomial Dense Net, we expect the excess kurtosis to be small for any fixed d so long as P and N are both fairly large, without any particular requirement on their relationship. In particular, under the Gaussian approximation we predicted above that the transition and sequence capacities should both scale as

αd log N , (C.63)

where αd depends on d but not on N. This gives an excess kurtosis of

κ = αd log N

[(2d 1)!!]2 3 1 + O 1

which for any fixed d rapidly tends to zero with increasing N. This suggests that the Gaussian approximation should be reasonably accurate even at modest N, but of course does not constitute a proof of its accuracy because we have not considered higher cumulants. However, this matches the results of numerical simulations shown in Figure 2.

C.2 Exponential Dense Net capacity

We now turn our attention to the Exponential Dense Net, with separation function f(x) = e(N 1)(x 1). In this case, we have

var(χ) = exp[ 2(N 1)]E

= exp[ 2(N 1)]

j=2 E exp 2ξ2 j (C.66)

= exp[ 2(N 1)] cosh(2)N 1 (C.67)

= 1 βN 1 , (C.68)

where we have defined the constant

cosh(2) 1.96403. (C.69)

Then, we have the Gaussian approximation

P[TDN(ξ1)1 = ξ2 1]

P 2πβN 1 exp βN 1

As in the polynomial case, we first determine the single-transition capacity by demanding that NP[TDN(ξ1)1 = ξ2 1] 0. We plug in the Ansatz

α log N (C.71)

for some α, which yields

NP[TDN(ξ1)1 = ξ2 1] r 1 2πα log N N 1 α/2. (C.72)

This tends to zero if α 2, which gives a predicted capacity of

2 log N . (C.73)

Considering the sequence capacity, which again requires that NPP[TDN(ξ1)1 = ξ2 1] 0, we plug in the Ansatz

αN , (C.74)

which yields

NPP[TDN(ξ1)1 = ξ2 1] 1

1 2παN exp h log β α

N i . (C.75)

This tends to zero for α 2 log β, meaning that the predicted capacity is in this case

2 log(β)N . (C.76)

Therefore, while the ratio of the predicted single-transition to sequence capacities is finite for the Polynomial Dense Net it is simply PS/PT d + 1 for the Exponential Dense Net it tends to zero as PS/PT log N/[log(β)N].

Using the asymptotic formulas obtained above from the finite-c argument, we have

PT 1 2 var(χ) log(N) = βN 1

2 log(N) (C.77)

1 2 3 4 5 6 7 8 9 10111213141516171819202122232425

Exponential Dense Net (Finite C)

Theory Sim: c=0.5 Sim: c=0.4 Sim: c=0.3 Sim: c=0.2 Sim: c=0.1 Sim: c=0.0

Figure C.2: The transition capacity of the exponential Dense Net is demonstrated for different values of error tolerance c. We see that even for c = 0, we get similar scaling curves although the capacities slightly increase consistently as we increase c.

PS 1 2 var(χ) log[N/ var(χ)] = βN 1

2 log[NβN 1] βN 1

2 log(β)N , (C.78)

which agree with these results. For evidence of the finite-c argument for the exponential Dense Net, observe Figure C.2.

Now considering the fourth cumulant, we can easily compute

κ4(χ) = cosh(4)

N 1 3 cosh(2)2

N 1 , (C.79)

which yields an excess kurtosis of

For this to be small, P must be exponentially large in N, which contrasts with the situation for the Polynomial Dense Net, in which the excess kurtosis is small for any reasonably large P. If we consider taking

α log N , (C.81)

for a constant α, as the Gaussian theory predicts for the Exponential Dense Net transition capacity, we have

α log N cosh(4) exp(2) cosh(2)

α log(N)(0.9823)N 1. (C.84)

This tends to zero as N increases, but only very slowly. In particular, log(N)(0.9823)N 1 increases with N up to around N 19, where it attains a maximum value around 2, before decreasing towards zero. The situation is even worse for the sequence capacity, for which the Gaussian theory predicts

αN , (C.85)

αN cosh(4) exp(2) cosh(2)

αN(0.9823)N 1. (C.88) N(0.9823)N 1 increases with N up to around N 56, where it attains a value of approximately 21.

Taken together, these results suggest that we might expect substantial finite-size corrections to the Gaussian theory s prediction for the capacity. In particular, as the excess kurtosis of the crosstalk is positive, the tails of the crosstalk distribution should be heavier-than-Gaussian, suggesting that the Gaussian theory should overestimate the true capacity. This holds provided that the lower bound on the memorization probability resulting from the union bound is reasonably tight.

D Bounding the polynomial Dense Net capacity

Here, we adapt Demircigil et al. [31] s proof of a rigorous asymptotic lower bound on the polynomial MHN s capacity to obtain a rigorous asymptotic lower bound on the Dense Net capacity. This proof is a step-by-step adaptation of the proof of Theorem 1.2 of Demircigil et al. [31], which we spell out in detail for clarity.

Our objective is to obtain an upper bound on the single-bitflip probability P[TDN(ξ1)1 = ξ1 2] (D.1) which we have argued can be expressed in terms of the crosstalk C as P[TDN(ξ1)1 = ξ1 2] = P[C < 1] (D.2) for

Our goal is to prove the following: First, letting α > 2(2d 1)!! and P = N d/(α log N), we have NP[TDN(ξ1)1 = ξ1 2] 0 (D.4)

as N . Second, letting α > 2(d + 1)(2d 1)!! and P = N d/(α log N), we have NPP[TDN(ξ1)1 = ξ1 2] 0 (D.5) as N .

By Chernoff s inequality (also known as the exponential Chebyschev inequality) [76], we then have

P[TDN(ξ1)1 = ξ1 2] = P

e t(N 1)d E exp

for any t > 0. Using the fact that the pattern elements are i.i.d., we have

j=2 ξµ j , (D.10)

and expand the expectation as a sum over the possible values m {0, (N 1) 1/2, . . . , (N 1)1/2} of M:

m cosh h t(N 1)d/2mdi P[M = m]. (D.11)

For N 1, the distribution of M is nearly Gaussian. We thus split the sum over m to allow us to treat tail events separately. We fix β > 0, and split the sum at log(N)β: X

m cosh h t(N 1)d/2mdi P[M = m] = X

|m| log(N)β cosh h t(N 1)d/2mdi P[M = m]

log(N)β<|m|

N cosh h t(N 1)d/2mdi P[M = m],

where we have used the fact that M

We first consider the tail sum over |m| > log(N)β. As cosh is even and non-decreasing in the modulus of its argument, we have X

log(N)β<|m|

N cosh h t(N 1)d/2mdi P[M = m] (D.13)

2 cosh t(N 1)d P[M > log(N)β] (D.14)

2 cosh t(N 1)d exp 1

2 log(N)2β (D.15)

2 exp t(N 1)d 1

2 log(N)2β , (D.16)

where in the second line we have applied Hoeffding s inequality to bound P[M > log(N)β] from above, and in the third line we have used the bound cosh(z) exp(z) for any z > 0.

We now consider the sum over |m| log(N)β. Using the bound cosh(z) exp(z2/2), we have

|m| log(N)β cosh h t(N 1)d/2mdi P[M = m] X

|m| log(N)β exp 1

2t2(N 1)dm2d P[M = m].

Using the series expansion of the exponential, we have X

|m| log(N)β exp 1

2t2(N 1)dm2d P[M = m] (D.18)

|m| log(N)β

2t2(N 1)dm2d +

(t2(N 1)dm2d)k

P[M = m] (D.19)

= P[|M| log(N)β] + 1

2t2(N 1)d X

|m| log(N)β m2d P[M = m]

|m| log(N)β

(t2(N 1)dm2d)k

P[M = m] (D.20)

where on the third line we have used the linearity of summation. We will now bound each of the three contributions. The first term is trivially bounded from above by 1:

P[|M| log(N)β] 1. (D.21)

To handle the second, we first observe that X

|m| log(N)β m2d P[M = m] E[m2d]. (D.22)

Then, we observe that as m is the normalized sum of N 1 Rademacher random variables, its moments tend to those of a standard normal from below as N , and are for any N strictly bounded from above by those of the standard normal. Therefore, we have

E[m2d] (2d 1)!!. (D.23)

To handle the third term, we first use the fact that for any |m| log(N)β we have m2d log(N)2βd, which gives

|m| log(N)β

(t2(N 1)dm2d)k

|m| log(N)β

(t2(N 1)d log(N)2βd)k

P[M = m] (D.24)

P[|M| log(N)β]

(t2(N 1)d log(N)2βd)k

2kk! (D.25)

(t2(N 1)d log(N)2βd)k

2kk! (D.26)

At this point, [31] uses the bound

(t2(N 1)d log(N)2βd)k

4(e 2)(t2(N 1)d log(N)2βd)2 (D.27)

which holds provided that we choose the arbitrary parameter t such that

t2(N 1)d log(N)2βd 2. (D.28)

Assuming that condition is satisfied, we can then combine these results to obtain X

|m| log(N)β cosh h t(N 1)d/2mdi P[M = m] (D.29)

2t2(N 1)d(2d 1)!! + 1

4(e 2)(t2(N 1)d log(N)2βd)2 (D.30)

2t2(N 1)d(2d 1)!! + 1

4(t2(N 1)d log(N)2βd)2 (D.31)

2t2(N 1)d(2d 1)!! + 1

4(t2(N 1)d log(N)2βd)2 , (D.32)

where on the the second line we have used the fact that e 2 0.718 . . . < 1 and on the third line we have used the bound 1 + x exp(x) for x 0.

Combining this result with the bound on the tail sum obtained previously, we have that

2t2(N 1)d(2d 1)!! + 1

4(t2(N 1)d log(N)2βd)2

+ 2 exp t(N 1)d 1

2 log(N)2β (D.33)

for any β > 1/2 and

2 (N 1)d log(N)2βd . (D.34)

Therefore, we have

P[TDN(ξ1)1 = ξ1 2] e t(N 1)d (

2t2(N 1)d(2d 1)!! + 1

4(t2(N 1)d log(N)2βd)2

+ 2 exp t(N 1)d 1

2 log(N)2β )P 1

subject to these conditions on β and t.

We now want to determine the single-transition and full-sequence capacities. To do so, we fix α > 0, and let P = N d/(α log N). As t is arbitrary, fix γ > 0, and let t = γ/P. For our choice of P, this gives

t2(N 1)d log(N)2βd = γ2α2 (N 1)d

N 2d log(N)2(βd+1) (D.36)

which is clearly less than 2 for N sufficiently large. Therefore, we can apply the bound obtained above, which for this choice of t simplifies to

P[TDN(ξ1)1 = ξ1 2]

e t(N 1)d (

2γ2α2(2d 1)!!log(N)2

4γ4α4 log(N)4(βd 1)

+ 2 exp γα 1

2 log(N)2β 1 log(N) [1 + o(1)]

We can see that the first term in the curly braces tends to 1 with increasing N as its exponent tends to zero while the second term tends to zero as the term in the round brackets within the exponent is negative for sufficiently large N provided that β > 1/2. We may therefore neglect the second term, which gives the simplification

P[TDN(ξ1)1 = ξ1 2] exp αγ 1 1

2γ(2d 1)!! log(N) [1 + o(1)]. (D.38)

To determine the single-transition capacity under the union bound, we want NP[TDN(ξ1)1 = ξ1 2] to tend to zero. We have

NP[TDN(ξ1)1 = ξ1 2] exp 1 αγ 1 1

2γ(2d 1)!! log(N) [1 + o(1)]. (D.39)

For this bound to tend to zero, we should have

2γ(2d 1)!! < 0. (D.40)

As γ is arbitrary, we may let γ = 1/(2d 1)!!, hence the required condition is clearly satisfied if

α > 2(2d 1)!!, (D.41)

as predicted by the Gaussian approximation. Next, to determine the sequence capacity, we want NPP[TDN(ξ1)1 = ξ1 2] to tend to zero. We have

NPP[TDN(ξ1)1 = ξ1 2] 1 α log N exp d + 1 αγ 1 1

2γ(2d 1)!! log(N) [1 + o(1)],

hence an identical line of reasoning to that used for the single-transition capacity shows that we must have

α 2(d + 1)(2d 1)!!. (D.43)

Again, this agrees with the Gaussian theory.

E Generalized pseudoinverse rule capacity

Here, we show that the generalized pseudoinverse rule can perfectly recall any sequence of linearlyindependent patterns. We recall from (11) that the GPI update rule is

TGP I(S)i = sgn

µ=1 ξµ+1 i f

ν=1 (O+)µνmν(S)

j=1 ξµ j ξν j (E.2)

the Gram matrix of the patterns. If the patterns are linearly independent, then O is full rank, and the pseudoinverse reduces to the ordinary inverse: O+ = O 1. Under this assumption, we have

TGP I(ξµ)i = sgn

ν=1 ξν+1 i f(δµν)

f(1)ξµ+1 i + f(0) X

ν =µ ξν+1 i

for all µ and i, hence for separation functions satisfying f(1) > 0 and |f(0)| < f(1)/(P 1) we are guaranteed to have TGP I(ξµ)i = ξµ+1 i as desired. For f(x) = xd, this condition is always satisfied as f(0) = 0 and f(1) = 1. For f(x) = e(N 1)(x 1), we have f(0) = e (N 1) and f(1) = 1; the condition P 1 < e N 1 must therefore be satisfied. However, as P N is required for linear independence, this condition is satisfied so long as N > 3.

F Mixed Net Capacity

In this Appendix, we compute the capacity of the mixed network, which from the update rule defined in (14) has

TMN(ξ1)1 = sgn

j=2 ξµ j ξ1 j

+ λξµ+1 1 f A

j=2 ξµ j ξ1 j

Then, assuming that f S(1) = f A(1) = 1 as is true for the interaction functions considered here, we have

P[TMN(ξ1)1 = ξ2 1] (F.2)

µ=1 ξµ 1 f S

j=2 ξµ j ξ1 j

µ=1 ξµ+1 1 f A

j=2 ξµ j ξ1 j

= P {C < λ} , (F.4)

where we have defined the crosstalk

C = ξ2 1ξ1 1 +

µ=2 ξ2 1ξµ 1 f S

j=2 ξµ j ξ1 j

µ=2 ξ2 1ξµ+1 1 f A

j=2 ξµ j ξ1 j

For j = 2, . . . , N and µ = 2, . . . , P, we have the equality in distribution ξµ j ξ1 j d= ξµ j , hence

C d= ξ2 1ξ1 1 +

µ=2 ξ2 1ξµ 1 f S(Ξµ) + λ

µ=2 ξ2 1ξµ+1 1 f A(Ξµ). (F.6)

where to lighten our notation we define

j=2 ξµ j . (F.7)

However, unlike in the Dense Net, we cannot similarly simplify the terms outside the separation functions. Recalling that we have assumed periodic boundary conditions, we have

C = ξ2 1ξ1 1 + λξ2 1ξ1 1f A(ΞP ) + f S(Ξ2) +

µ=3 ξ2 1ξµ 1 f S(Ξµ) + λ

µ=2 ξ2 1ξµ+1 1 f A(Ξµ) (F.8)

d= ξ1 1 + C1 + C2 + C3 + C4, (F.9)

where we have defined

C1 = f S(Ξ2) + λ ξ3 1f A(Ξ2), (F.10)

C2 = ξP 1 f S(ΞP ) + λξ1 1f A(ΞP ), (F.11)

µ=3 ξµ 1 f S(Ξµ), and (F.12)

µ=3 ξµ+1 1 f A(Ξµ). (F.13)

Importantly, in this case the influence of ξ1 1 on the crosstalk is O(1), and the distribution is not well-approximated by a single Gaussian. Instead, as shown in Figure F.1, it is bimodal. We will therefore approximate it by a mixture of two Gaussians, one for each value of ξ1 1. This approximation can be justified by noting that the boundary terms in C1 and C2 should be negligible at large N and P, while C3 and C4 should give a Gaussian contribution at sufficiently large P. We now observe that, for any f S and f A, the conditional means of each term are

E[C1 | ξ1 1] = E[f S(Ξ)] (F.14)

E[C2 | ξ1 1] = λξ1 1E[f A(Ξ)] (F.15)

E[C3 | ξ1 1] = 0 (F.16)

E[C4 | ξ1 1] = 0, (F.17)

where we note that all Ξµs are identically distributed, so we can simply write Ξ for any one of them. Then, the conditional mean of the crosstalk is

E[C | ξ1 1] = ξ1 1 +

j=1 E[Cj | ξ1 1] (F.18)

= ξ1 1{1 + λE[f A(Ξ)]} + E[f S(Ξ)]. (F.19)

Considering the variance of C, the variances of the different contributions are

var[C1 | ξ1 1] = var[f S(Ξ)] + λ2E[f A(Ξ)2] (F.20)

var[C2 | ξ1 1] = E[f S(Ξ)2] + λ2 var[f A(Ξ)] (F.21)

var[C3 | ξ1 1] = (P 3)E[f S(Ξ)2] (F.22)

var[C4 | ξ1 1] = λ2(P 3)E[f A(Ξ)2], (F.23)

while the covariances are

cov[C1, C2 | ξ1 1] = 0 (F.24)

cov[C1, C3 | ξ1 1] = λE[f A(Ξ)]E[f S(Ξ)] (F.25)

cov[C1, C4 | ξ1 1] = 0, (F.26)

2 1 0 1 2 Crosstalk

Mixed Net Crosstalk

Theory - Sym Theory - Asym Theory - Mixed

Sim - Sym Sim - Asym Sim - Mixed

Figure F.1: Crosstalk of Polynomial Mixed Net where N = 100, λ = 2.5, d S = d A = 3 and P = 1000 patterns are stored. Histograms are generated for patterns drawn from 5000 randomly sequences and theoretical curves are plotted. Green represents the full crosstalk for the Mixed Net. Blue and red represent the asymmetric and symmetric terms of the crosstalk, respectively. Observe that the bimodality in the full model comes from bimodality in the symmetric term.

cov[C2, C3 | ξ1 1] = 0 (F.27)

cov[C2, C4 | ξ1 1] = λE[f S(Ξ)]E[f A(Ξ)], (F.28)

cov[C3, C4 | ξ1 1] = λ

µ,ν=3 E[ξµ 1 Ξν+1 1 ]E[f S(Ξµ)f A(Ξν)] (F.29)

µ=3 E[f S(Ξµ)]E[f A(Ξµ 1)] (F.30)

= λ(P 3)E[f S(Ξ)]E[f A(Ξ)]. (F.31)

Therefore, the conditional variance of the crosstalk is

var[C | ξ1 1] =

j=1 var[Cj | ξ1 1] + 2

k>j cov[Cj, Ck | ξ1] (F.32)

= (P 3){E[f S(Ξ)2] + 2λE[f S(Ξ)]E[f A(Ξ)] + λ2E[f A(Ξ)2]}

+ var[f S(Ξ)] + λ2E[f A(Ξ)2] + E[f S(Ξ)2] + λ2 var[f A(Ξ)] + 4λE[f A(Ξ)]E[f S(Ξ)] (F.33)

= (P 1){E[f S(Ξ)2] + 2λE[f S(Ξ)]E[f A(Ξ)] + λ2E[f A(Ξ)2]}

E[f S(Ξ)]2 λ2E[f A(Ξ)]2. (F.34)

For large P and N, the two terms on the second line of this result will be subleading, as they do not scale with P and have identical or subleading scaling with N to the terms that do scale with P. That is, we have

var[C | ξ1 1] P{E[f S(Ξ)2] + 2λE[f S(Ξ)]E[f A(Ξ)] + λ2E[f A(Ξ)2]}. (F.35)

Collecting these results, we have

E[C | ξ1 1] = ξ1 1{1 + λE[f A(Ξ)]} + E[f S(Ξ)] (F.36)

var[C | ξ1 1] P{E[f S(Ξ)2] + 2λE[f S(Ξ)]E[f A(Ξ)] + λ2E[f A(Ξ)2]}. (F.37)

By the law of total probability, we have

P[TMN(ξ1)1 = ξ2 1] = P[C < λ] (F.38)

2P[C < λ | ξ1 1 = 1] + 1

2P[C < λ | ξ1 1 = +1] (F.39)

λ + E[C | ξ1 1 = 1] p

var[C | ξ1 1 = 1]

λ + E[C | ξ1 1 = +1] p

var[C | ξ1 1 = +1]

under the bimodal Gaussian approximation to the crosstalk distribution. To have P[TMN(ξ1)1 = ξ2 1], both of these conditional probabilities must tend to zero. By basic concentration arguments, we expect to have

E[C | ξ1 1] ξ1 1 (F.41)

up to corrections that are small in an absolute sense. Moreover, we have

E[C | ξ1 1 = +1] E[C | ξ1 1 = 1] = 2{1 + λE[f A(Ξ)]} (F.42)

which for the separation functions considered here is strictly positive. As we keep λ constant with N and P, we must have

E[C | ξ1 1 = 1] > λ (F.43)

and var[C | ξ1 1 = 1] 0 in order to have P[TMN(ξ1)1 = ξ2 1] 0. But, given the formula above, var[C | ξ1 1 = 1] = var[C | ξ1 1 = +1], so this implies that the ξ1 1 = +1 contribution to the probability will be exponentially suppressed. hen, we can apply an identical argument to that which we used for the Dense Net in Appendix C to obtain the asymptotic behavior of P[C < λ | ξ1 1 = 1], yielding

P[TMN(ξ1)1 = ξ2 1] 1

λ + E[C | ξ1 1 = 1] p

var[C | ξ1 1 = 1]

var[C | ξ1 1 = 1] λ + E[C | ξ1 1 = 1] exp 1

2 (λ + E[C | ξ1 1 = 1])2

var[C | ξ1 1 = 1]

For this to work, we must clearly have λ > 1.

We could in principle compute the excess kurtosis of the crosstalk for the Mixed Net as we did for the Dense Net, but we will not do so here as the computation would be tedious and would not yield substantial new insight beyond that for the Dense Net.

F.1 Polynomial Mixed Net

We first consider the polynomial mixed network, with f S(x) = xd S and f A(x) = xd A for two possibly differing degrees d S, d A N>0. We can apply the same reasoning as in Appendix C.1 to obtain the required moments at large N, which yields the first moments

E[f S(Ξ)] = E[Ξd S] =

d S even (F.46)

E[f A(Ξ)] = E[Ξd A] =

d A even, (F.47)

and the second moments

E[f S(Ξ)2] = E[Ξ2d S] = (2d S 1)!!

E[f A(Ξ)2] = E[Ξ2d A] = (2d A 1)!!

Then, the conditional mean of the crosstalk is given by

E[C | ξ1 1] ξ1 1 (F.50)

up to corrections which vanish in an absolute, not a relative, sense, while the conditional variance is asymptotic to

var[C | ξ1 1] P (2d S 1)!!

N d S + 2λ(d S 1)!! (d A 1)!!

N (d S+d A)/2 1{d S, d A even} + λ2 (2d A 1)!!

We must now determine the storage capacity. We recall that, in all case, we want P to tend to infinity slowly enough that var[C | ξ1 1] tends to zero. Then, we can see that what matters is which of the terms inside the curly brackets in the expression for the conditional variance above tends to zero with N the slowest. This is of course determined by min{d S, d A}, but the constant factor multiplying the leading term will depend on which is smaller, or if they are equal. First, consider the case in which d S = d A = d. Then, we have

var[C | ξ1 1] P

N d (2d 1)!! + 2λ(d 1)!! (d 1)!!1{d even} + λ2(2d 1)!! . (F.52)

Now, consider the case in which d S < d A. Then, (d S + d A)/2 > d S, hence the N d S term dominates and we have

var[C | ξ1 1] P N d S (2d S 1)!!. (F.53)

Similarly, if d A > d S, the N d A term dominates, and we have

var[C | ξ1 1] P N d A λ2(2d A 1)!!. (F.54)

We can summarize these results as

var[C | ξ1 1] γd S,d A P N min{d S,d A} , (F.55)

(2d S 1)!! if d S < d A,

(λ2 + 1)(2d S 1)!! + 2λ[(d S 1)!!]21{d S even} if d S = d A,

λ2(2d A 1)!! if d S > d A.

Using the general arguments presented above, we then have

P[TMN(ξ1)1 = ξ2 1] 1 2

γd S,d AP (λ 1)2N min{d S,d A} exp (λ 1)2

2 N min{d S,d A}

for any λ > 1. We must first determine the single-transition capacity, which requires that NP[TMN(ξ1)1 = ξ2 1] 0. Recalling that our argument requires us to take P slowly enough that var[C | ξ1 1] 0, we make the Ansatz that

N min{d S,d A}

log N (F.58)

for some α. This yields

NP[TMN(ξ1)1 = ξ2 1] 1 2 2πα log N N 1 α/2, (F.59)

which tends to zero if α 2, yielding a predicted capacity of

N min{d S,d A}

log N . (F.60)

We now consider the sequence capacity, which requires that NPP[TMN(ξ1)1 = ξ2 1] 0. Then, making the same Ansatz for P as above, we have

NPP[TMN(ξ1)1 = ξ2 1] 1 2

1 (α log N)3/2 N min{d S,d A}+1 α/2, (F.61)

which tends to zero provided that α 2(min{d S, d A} + 1), yielding a predicted capacity of

2(min{d S, d A} + 1)γd S,d A

N min{d S,d A}

log N . (F.62)

F.2 Exponential Mixed Net

We now consider the Exponential Mixed Net, with f S(x) = f A(x) = e(N 1)(x 1). With this, we have the first moments

E[f S(Ξ)] = E[f A(Ξ)] = exp[ (N 1)]E

= exp[ (N 1)]

j=2 E[exp(ξj)] (F.64)

and the second moments

E[f S(Ξ)2] = E[f A(Ξ)2] = cosh(2)

N 1 = 1 βN 1 , (F.66)

where as in Appendix C.2 we let

cosh(2) 1.96403. (F.67)

Noting that exp(1) cosh(1) 1.76159, (F.68)

the conditional mean of the crosstalk is then E[C | ξ1 1] = ξ1 1{1 + λE[f A(Ξ)]} + E[f S(Ξ)] (F.69)

1 + λ cosh(1)

ξ1 1, (F.71) where the corrections are exponentially small in an absolute sense. The leading part of the conditional variance of the crosstalk is var[C | ξ1 1] P{E[f S(Ξ)2] + 2λE[f S(Ξ)]E[f A(Ξ)] + λ2E[f A(Ξ)2]} (F.72)

1 + 2λ cosh(1)2

P βN 1 (1 + λ2), (F.74)

where we note that

cosh(2) 0.632901 (F.75)

hence the other contribution is exponentially suppressed in a relative sense.

We thus have

E[C | ξ1 1] ξ1 1 (F.76)

var[C | ξ1 1] P βN 1 (1 + λ2), (F.77)

hence from the general argument above we have

P[TMN(ξ1)1 = ξ2 1] 1 2

P βN 1 exp 1

1 + λ2 βN 1

for λ > 1. We now want to determine the capacity, starting with the single-transition capacity, for which we must have NP[TMN(ξ1)1 = ξ2 1] 0. Recalling that we want to have var[C | ξ1 1] 0, we make the Ansatz

λ2 + 1 βN 1

log N (F.79)

for some α, which yields

NP[TMN(ξ1)1 = ξ2 1] 1 2 2πα log N N 1 α/2. (F.80)

This tends to zero if α 2, hence we conclude that the Gaussian theory predicts

λ2 + 1 βN 1

log N . (F.81)

We now want to determine the sequence capacity, which requires that NPP[TMN(ξ1)1 = ξ2 1] 0. Following our analysis of the Exponential Dense Net in Appendix C.2, we make the Ansatz that

λ2 + 1 βN 1

which yields

NPP[TMN(ξ1)1 = ξ2 1] 1 2

2παN 1 αβ (λ 1)2

λ2 + 1 exp h log β α

N i . (F.83)

This tends to zero if α 2 log β, giving a predicted sequence capacity of

PS = 1 2 log β (λ 1)2

λ2 + 1 βN 1

Thus, for both definitions of capacity, the Gaussian theory s prediction of the capacity of the Exponential Mixed Net is

λ2 + 1 (F.85)

times the capacity of the Exponential Dense Net analyzed in Appendix C.2. This factor tends to zero from above as λ 1, and gradually increases to 1 as λ . Note that even without explicitly computing the excess kurtosis, we expect the intuition from the Exponential Dense Net to carry over to this setting. Indeed, the numerical simulations in Figure F.2 show that the transition capacity is well captured by the Gaussian theory while the sequence capacity shows significant deviation for small Mixed Nets.

5 10 15 20 25 Number of Neurons (N)

Transition Capacity (log10(PT))

Theory: EMHN Simulation: EMHN

5 10 15 20 25 Number of Neurons (N)

Sequence Capacity (log10(PS))

Theory: EMHN Simulation: EMHN

Figure F.2: The capacities of Exponential Mixed Nets with λ = 2.5 are plotted as a function of network size. (A) Transition capacity for the Exponential Mixed Net, which closely matches theoretical prediction. The predicted capacity is shown by the solid line with dots, while square error bars show the results of numerical experiment. (B) Sequence capacity for the Exponential Mixed Net, which diverges from theoretical prediction.

G Numerical implementation

Source code is available on Git Hub at https://github.com/Pehlevan-Group/ Long Sequence Hopfield Memory. Experiments were run on the Harvard University FAS RC Cannon HPC cluster (https://www.rc.fas.harvard.edu/), using Nvidia A100 80GB GPUs. This limited the maximum number of patterns we could store in memory simultaneously to approximately 106 patterns, restricting our experimental evaluation of the Exponential Dense Net to approximately N = 25 neurons.

G.1 Transition capacity

Numerical simulations for transition capacity were conducted as follows: For a given model of size N, start by initializing 100 sequences of Rademacher distributed patterns of length P0, where P0 = 2P is well above the model s predicted capacity P . This initialization for P0 was found through trial and error, where the method detects if you start below capacity. The model s update rule is applied in parallel across all patterns and across all sequences. If errors are made for any pattern in any sequence, 100 new random sequences are generated with smaller length P1 = 0.99P0. This is repeated, with the new sequence length being Pt+1 = 0.99Pt, until 100 sequences are generated for which no error is made in any transition. This entire process is repeated 20 times starting from P0 in order to obtain error bars.

G.2 Sequence capacity

Numerical simulations for sequence capacity were conducted in a similar fashion. For a given model of size N, start by initializing 100 sequences of Rademacher distributed patterns of length P0, where P0 is well above the model s capacity. Starting from the first pattern of each sequence, the model s update rule is applied serially for each sequence. As soon as an error is obtained within any sequence, 100 new random sequences are generated with smaller length P1 = 0.99P0. This is repeated, with the new sequence length being Pt+1 = 0.99Pt, until 100 sequences are generated for which no error is made. This entire process is repeated 20 times starting from P0 in order to obtain error bars.

G.3 Moving MNIST

For the Moving MNIST experiments in Section 2.4, the images were pre-processed to have binarized pixel values. There were 10000 subsequences, each containing 2 handwritten digits from the MNIST dataset moving through each other across 20 images, that were concatenated to construct the entire sequence of 200000 images [78]. Then, different models were run from initialization and their output for different time steps was displayed in Figure 3.

G.4 Generalized pseudoinverse rule

For numerical simulations of the generalized pseudoinverse rule in 2.5, the transition capacity of the Polynomial Dense Net was simulated in a similar method as described above. However, the Exponential Dense Net suffered from numerical instability when calculating the pseudoinverse of the overlap matrix, resulting in floating point error. Therefore, we showed results only for the Polynomial Dense Net.