# nonparametric_modern_hopfield_models__649674bc.pdf

Nonparametric Modern Hopfield Models

Jerry Yao-Chieh Hu * 1 Bo-Yu Chen * 2 Dennis Wu 1 Feng Ruan 3 Han Liu 1 3

We present a nonparametric interpretation for deep learning compatible modern Hopfield models and utilize this new perspective to debut efficient variants. Our key contribution stems from interpreting the memory storage and retrieval processes in modern Hopfield models as a nonparametric regression problem subject to a set of query-memory pairs. Interestingly, our framework not only recovers the known results from the original dense modern Hopfield model but also fills the void in the literature regarding efficient modern Hopfield models, by introducing sparse-structured modern Hopfield models with sub-quadratic complexity. We establish that this sparse model inherits the appealing theoretical properties of its dense analogue connection with transformer attention, fixed point convergence and exponential memory capacity. Additionally, we showcase the versatility of our framework by constructing a family of modern Hopfield models as extensions, including linear, random masked, top-K and positive random feature modern Hopfield models. Empirically, we validate our framework in both synthetic and realistic settings for memory retrieval and learning tasks. Code is available at Git Hub; future updates are on ar Xiv.

1 Introduction

We tackle the challenges in computational efficiency of modern Hopfield models (Wu et al., 2024b; Hu et al., 2023;

*Equal contribution 1Department of Computer Science, Northwestern University, Evanston, IL, USA 2Department of Physics and Computer Science, National Taiwan University, Taipei, Taiwan 3Department of Statistics and Data Science, Northwestern University, Evanston, IL, USA. Correspondence to: Jerry Yao-Cheih Hu <jhu@u.northwestern.edu>, Bo-Yu Chen <b12202023@ntu.edu.tw>, Dennis Wu <hibb@u.northwestern.edu>, Feng Ruan <fengruan@northwestern.edu>, Han Liu <hanliu@northwestern.edu>.

Proceedings of the 42 nd International Conference on Machine Learning, Vancouver, Canada. PMLR 267, 2025. Copyright 2025 by the author(s).

Ramsauer et al., 2020) a class of transformer-compatible associative memories. In short, we present a nonparametric framework1, and then debuting efficient modern Hopfield models with sub-quadratic complexity and appealing theoretical properties. Such a construction is of practical importance. As in many Hopfield-centric methods (Hu et al., 2024a; Wu et al., 2024a; Xu et al., 2024; Wu et al., 2024b; Schimunek et al., 2023; F urst et al., 2022; Paischer et al., 2022; Seidl et al., 2022; Widrich et al., 2020), modern Hopfield models (and their derived deep learning layers) serve as powerful alternatives to the attention mechanism with additional functionalities, but lack efficient implementation for gigantic deep models (Hu et al., 2023, Section C.2).

This issue becomes more prominent in this era of Large Foundation Models (Bommasani et al., 2021). Foundation models are huge transformer-based models pretrained on massive datasets, and play a central role not only in machine learning but also in a wide range of scientific domains, such as Chat GPT (Brown et al., 2020; Floridi & Chiriatti, 2020) for natural language, Bloomberg GPT (Wu et al., 2023) for finance, DNABERT (Zhou et al., 2024a;b; Ji et al., 2021) for genomics, and many others. To push toward Hopfieldbased large foundation models, this work provides a timely efficient solution, back-boned by a solid theoretical ground.

Modern Hopfield models (Ramsauer et al., 2020), motivated by the dense associative memory models (Demircigil et al., 2017; Krotov & Hopfield, 2016), are (auto-)associative memory models that (i) have exponential memory capacity, (ii) retrieve stored patterns based on input queries with only one retrieval step, and (iii) are compatible with deep learning architectures. They achieve (i) by adopting highly nonlinear energy functions, (ii) by adopting a memory-retrieval dynamics ensuring monotonic minimization of the energy function, and (iii) by the connection between their memory retrieval dynamics and attention mechanism. Deepening (ii) and (iii), Hu et al. (2023) and Wu et al. (2024b) propose a theoretical framework for deriving modern Hopfield models

1After completing the draft, the authors became aware of the independent study by Nguyen et al. (2024) on a nonparametric (primal-dual) formulation for Transformer attention. To our knowledge, Nguyen et al. (2024) were the first to cast attention as an ϵ-SVR solution (Awad et al., 2015; Vapnik, 2013; Kar & Karnick, 2012; Sch olkopf & Smola, 2002). Our study presents a similar framework but focuses on Hopfield-style associative memory.

Nonparametric Modern Hopfield Models

using various entropic regularizers. In addition, they introduce a sparse extension of the original modern Hopfield model to handle its computational burden and vulnerability to noise. As a result, their proposal not only connects to sparse attention mechanism (Correia et al., 2019; Martins & Astudillo, 2016) but also offers both provably computational advantages and robust empirical performance.

However, there are still some missing pieces:

(P1) Lack of Efficiency. Computationally, while Hu et al.

(2023) and Wu et al. (2024b) indeed introduce sparsity into their model, this sparsity does not implies computational efficiency. In fact, it only increases efficiency at the level of memory retrieval, (i.e. the sparsity in (Hu et al., 2023; Wu et al., 2024b) only leads to faster memory retrieval but not necessarily shorter running time, as discussed in (Hu et al., 2023, Section C.2)). Namely, the sparse modern Hopfield model still suffers by the O(n2) complexity (with the input sequence length n), which hampers its scalability2.

(P2) Lack of Rigorous Analysis on Sparsity. Theoretically, because Hu et al. (2023) choose not to make strong assumptions (on the memory and query patterns) in order to maintain their model s generality, they only offer qualitative justifications (Hu et al., 2023, Section 3). They do not rigorously characterize how sparsity impacts different aspects of the sparse model, e.g., the retrieval error, the well-separation condition, and the memory capacity.

(P3) Incomplete Connection between Attention and Hopfield Models. Methodologically, while numerous variants of the attention module exist (Choromanski et al., 2021; Katharopoulos et al., 2020; Beltagy et al., 2020; Child et al., 2019), Hu et al. (2023) only bridge a subset of them to modern Hopfield models. A natural question arises: How can we integrate the advancements of stateof-the-art attention into modern Hopfield models? As noted in (Hu et al., 2024b; 2023; Wu et al., 2024b), this question is far from trivial. Naively substituting the softmax activation function with other alternatives does not necessarily yield well-defined Hopfield models and might sabotage their desirable properties and functionalities.

Overview of Our Theoretical Results. To fill these gaps, this work presents a nonparametric framework for deep learning compatible modern Hopfield models.

To fill (P1), this framework allows us to not only recover the standard dense modern Hopfield model (Ramsauer et al., 2020), but also introduce an efficient modern Hopfield model, termed sparse-structured model (Theorem 3.2).

To fill (P2), our framework facilitates the derivation of a

2See Remark 3.6 for the connection between time complexity of attention and of modern Hopfield models.

retrieval error bound of the sparse modern Hopfield with explicit sparsity dependence (Theorem 4.1). This bound offers rigorous characterizations of the sparsity-induced advantages of the sparse model compared with its dense counterpart, including higher precision in memory retrieval (Corollary 4.1.1 and Corollary 4.1.2), enhanced robustness to noise (Remark 4.2) and exponential-in-d capacity (Lemma 4.1 and Proposition 4.1, d refers to pattern size). Interestingly, unlike existing Hopfield models (Hu et al., 2023; Wu et al., 2023; Ramsauer et al., 2020) requiring an explicit energy function to guarantee the stability of the model, we show that the sparse modern Hopfield model guarantees the fixedpoint convergence even without details of the Hopfield energy function (Corollary 4.1.3).

To fill (P3), beyond introducing the sparse modern Hopfield model, our framework supports a family of modern Hopfield models that connect with various attention variants. This complements the findings in (Hu et al., 2023; Wu et al., 2024b), pushing us toward a more unified understanding.

Contributions. Our contributions are as follows:

We propose a nonparametric framework for deep learning compatible modern Hopfield models. Building upon this, we introduce the first efficient sparse modern Hopfield model with sub-quadratic complexity.

We provide rigorous characterizations of the sparsityinduced advantages of the proposed efficient model: tighter retrieval error bound (Corollary 4.1.1 and Corollary 4.1.2), stronger noise robustness (Remark 4.2) and exponential-d-capacity (Lemma 4.1 and Proposition 4.1).

Based on the proposed framework, we construct a family of modern Hopfield models connecting to many existing attention variants (Choromanski et al., 2021; Zaheer et al., 2020; Beltagy et al., 2020; Katharopoulos et al., 2020). We also verify their efficacy through thorough numerical experiments in both synthetic and realistic settings (memory retrieval and learning performance in Appendices G.1 to G.4 and efficiency in Appendix G.5.).

Notations. We denote vectors by lower case bold letters, and matrices by upper case bold letters. We write a, b := a Tb as the inner product for vectors a, b. Let a[i] denotes the i-th element of vector a. The index set {1, , I} is denoted by [I], where I N+. The spectral norm is denoted by , which is equivalent to the l2-norm when applied to a vector. We denote the memory patterns by ξ Rd and the query pattern by x Rd, and Ξ := [ξ1, , ξM] Rd M as shorthand for stored memory patterns {ξµ}µ [M]. Moreover, we set m := Maxµ [M] ξµ .

Organization. Section 2 reviews modern Hopfield models. Section 3 presents a nonparametric construction for

Nonparametric Modern Hopfield Models

Figure 1. A High-level Visualization. The upper row formalizes the memorization and retrieval of standard modern Hopfield model (Wu et al., 2024b; Hu et al., 2023; Ramsauer et al., 2020). The lower row conceptualizes our nonparametric interpertation.

modern Hopfield models, and debut the sparse-structured (efficient) modern Hopfield models. Section 4 provides the theoretical analysis on the sparse-structured modern Hopfield models. Appendix E includes a family of modern Hopfield models as possible extensions. We conduct numerical experiments to support our framework in Appendix G.

1.1 Related Work

Modern Hopfield Models for Deep Learning. The classical Hopfield models (Amari, 1972; Hopfield, 1984; 1982; Krotov & Hopfield, 2016) are canonical models of the human brain s associative memory. Their primary function is the storage and retrieval of specific memory patterns. Recently, a resurgence of interest in Hopfield models within the machine learning field is attributed to developments in understanding memory storage capacities (Krotov & Hopfield, 2016; Demircigil et al., 2017; Wu et al., 2024a), innovative architecture (Hoover et al., 2023; Seidl et al., 2022; F urst et al., 2022; Ramsauer et al., 2020), and their biological plausibility (Kozachkov et al., 2022; Krotov & Hopfield, 2021). Notably, the modern Hopfield models (Wu et al., 2024b; Hu et al., 2023; Ramsauer et al., 2020; Brandstetter, 2021), demonstrate not only a strong connection to the transformer attention mechanisms in deep learning, but also superior performance, and a theoretically guaranteed exponential memory capacity. In this regard, seeing the modern Hopfield models as an advanced extension of attention mechanisms opens up prospects for crafting Hopfieldcentric architectural designs. Therefore, their applicability spans diverse areas like drug discovery (Schimunek et al., 2023), immunology (Widrich et al., 2020), tabular learning (Xu et al., 2024), time series forecasting (Wu et al., 2024b; Auer et al., 2024), reinforcement learning (Paischer et al., 2022), and large foundation models (Hu et al., 2024a; F urst et al., 2022). This work emphasizes refining this line of research towards efficient models. We posit that this effort is crucial in guiding future research towards Hopfield-driven design paradigms, especially for larger models.

Sparse Modern Hopfield Model. (Ramsauer et al., 2020)

establish a connection between Hopfield models and the vanilla softmax attention. Motivated by this connection, (Hu et al., 2023; Wu et al., 2024b) (and later (Martins et al., 2023)) propose a theoretical framework for modern Hopfield models based on the relationship between entropic regularizers and finite-domain distributions with varying support sets. Importantly, they not only show that (Ramsauer et al., 2020) is just special case within their framework but also propose a sparse extension with superior properties (e.g., robust representation learning, fast fixed-point convergence, and exponential memory capacity) and connection to certain types of sparse attention. However, this is not end of the story. As highlighted in (Hu et al., 2023, Section E), their framework only bridges a subset of existing attention variants (with dense quadratic attention score matrix) and hence is not complete. This work fills this theoretical gap by providing a principle construction for the many modern Hopfield models with theoretical guarantees. Moreover, our framework supports a family of modern Hopfield models mirroring many popular structured efficient attention mechanisms, including Attention with Pre-defined Patterns (each sequence token attends to a predetermined subset of tokens instead of the entire sequence, e.g, Big Bird (Zaheer et al., 2020), Longformer (Beltagy et al., 2020), Blockwise (Qiu et al., 2019), Sparse (Child et al., 2019)), and Kernelized Attention (e.g., Performer (Choromanski et al., 2021), Linear (Clevert et al., 2015) and Multi-head (Vaswani et al., 2017)).

Sparse-Structured Hopfield Models. After completing this work, the authors attended ICML 2024 and learned of a study by Santos et al. (2024b) proposing a different model, also named the sparse-structured Hopfield network. Both models emphasize sparse retrieval patterns. However, Santos et al. (2024a;b) differ in (i) introducing sparsity via optimized Fenchel-Young energies and (ii) enhancing efficiency using the Sparse MAP transformation and active set algorithm (Niculae et al., 2018) with predefined k-ary relations among stored patterns or top-k operations.

Nonparametric Modern Hopfield Models

2 Background: Modern Hopfield Models

This section presents the ideas we build on.

Let x Rd be the input query pattern and Ξ = [ξ1, , ξM] Rd M the M memory patterns.

Hopfield Models. The aim of Hopfield models (Amari, 1972; Hopfield, 1982; 1984) is to store these memory patterns Ξ and retrieve a specific memory ξµ when given a query x. They achieve these by embedding the memories in the energy landscape E(x) of a physical system, where each memory ξµ corresponds to a local minimum. When a query x is presented, the model initiates energy-minimizing retrieval dynamics T at the query, which then navigate the energy landscape to find the nearest local minimum, effectively retrieving the memory most similar to the query.

These models comprise two primary components: an energy function E(x) that encodes memories into its local minima, and a retrieval dynamics T (x) that fetches a memory by iteratively minimizing E(x) starting with a query. Constructing E(x), is straightforward. As outlined in (Krotov & Hopfield, 2016), memories get encoded into E(x) using the overlap-construction: E(x) = F(ΞTx), where F : RM R is a smooth function. This encourages the memories {ξµ}µ [M] to reside at the stationary points of E(x), i.e., x F(ΞTx)|ξµ = 0 for all µ [M]. The choice of F results in different Hopfield model types, as demonstrated in (Krotov & Hopfield, 2021; Ramsauer et al., 2020; Demircigil et al., 2017; Krotov & Hopfield, 2016).

However, determining a suitable retrieval dynamics, T , for a given energy E(x) is more challenging. For effective memory retrieval, the iterative retrieval dynamics T must:

(T1) Monotonically reduce E(x) when applied iteratively.

(T2) Ensure its fixed points coincide with the stationary points of E(x) for precise retrieval.

Modern Hopfield Models. Ramsauer et al. (2020) propose the modern Hopfield model with a specific set of E and T satisfying above requirements, and integrate it into deep learning architectures via its strong connection with attention mechanism, offering enhanced performance, and theoretically guaranteed exponential memory capacity. Specifically, they introduce the energy function:

E(x) = lse(β, ΞTx) + 1

2 x, x , (2.1)

where the retrieval dynamics is given by

xnew = TDense(x) = Ξ Softmax(βΞTx). (2.2)

The function lse (β, z) := log PM µ=1 exp{βzµ} /β is the

log-sum-exponential for any given vector z RM and β > 0. Their analysis reveals that:

1. TDense dynamics converge well (T2) and can retrieve patterns accurately in just one step (T1).

2. Modern Hopfield model from (2.1) possesses an exponential memory capacity in pattern size d.

3. Notably, the one-step approximation of TDense mirrors the attention mechanism in transformers, leading to a novel deep learning architecture design: the Hopfield layers.

Attention Modern Hopfield Model. To see above 3., suppose that X and Ξ are embedded from the raw query R and Y memory patterns, respectively, via XT = RWQ := Q, and ΞT = YWK := K, with some projection matrices WQ and WK. Then, taking the transpose of T in (2.2) and multiplying with WV such that V := KWV , we obtain

Z := Qnew WV = Softmax βQKT V. (2.3)

This enables modern Hopfield models to serve as alternatives to attention mechanism with extra functionalities.

Given the equivalence (2.3), one might wonder if the quest for efficient modern Hopfield models is equivalent to seeking efficient attention mechanisms (Tay et al., 2022), specifically in terms of finding efficient implementations of the Softmax matrix computation. We contend that they are not the same. To build a modern Hopfield model, we expect not only its retrieval dynamics to connect to attention mechanism, but also it to serve as an associative memory model (Hu et al., 2024a; Wu et al., 2024b; Hu et al., 2023; Ramsauer et al., 2020) by design. Moreover, we observe that (T1) and (T2) are essentially about encoding memories onto the fixed points of T .

These motivate us to view the construction of T as a learning problem: we aim to learn a function T satisfying (T2) from a dataset consisting of query-memory pairs. Thus, rather than using the traditional Hopfield model s learning rule where the model memorizes memories by defining an energy function, like the overlap-construction (Hu et al., 2023) we interpret the memorization process as learning a function that maps queries to memories. This new perspective allows us to construct novel modern Hopfield models that are equivalent to various attention variants.

3 Nonparametric Modern Hopfield Models

High-Level Overview.

In Section 3.1, we formulate the memory storage and retrieval of modern Hopfield models as a nonparametric regression problem. We first align the definition of T (the retrieval dynamics (2.2)) with a nonparametric regression problem subject to a set of query-memory pairs. Then, by solving for optimality, we derive a nonparametric formulation of T .

Nonparametric Modern Hopfield Models

In Section 3.2, we showcase our framework with two special cases: the standard dense modern Hopfield model (Ramsauer et al., 2020) (Lemma 3.1), and a new, efficient sparse-structured modern Hopfield model (Theorem 3.2).

3.1 Retrieval Dynamics

The retrieval dynamics (2.2) TΞ(x) : Rd Rd maps an input query x to TΞ(x), with the aim of retrieving the memory pattern ξµ closest to x. To formalize this notion of retrieval, we need a few definitions and notation.

Definition 3.1 (Generalized Fixed Point (Sriperumbudur & Lanckriet, 2009)). We say a set S Rd a generalized fixed point with respect to TΞ if TΞ(y) S for every y S.

Remark 3.1 (Fixed Point). In contrast, a fixed point of TΞ is a point y for which TΞ(y) = y.

In particular, if the retrieval dynamics is initiated at x S where S is an invariant set3, then subsequent iterates such as TΞ(x), TΞ TΞ(x), . . . remain in the invariant set S.

Now we introduce a neighborhood Sµ, a ball of radius R at every memory pattern ξµ:

Sµ = {ξ | ξ ξµ R},

where R := 1

2 min µ,ν [M];µ =ν ξµ ξν .

By definition, neighborhoods associated with distinct memory patterns do not overlap: Sµ Sν = for µ = ν. To measure the progress of the dynamics in retrieving the memory pattern, we introduce the notion of memory storage and ϵ-retrieval.

Definition 3.2 (Storage and ϵ-Retrieval). A memory pattern ξµ is stored if Sµ is a generalized fixed point of T . A memory pattern ξµ gets ϵ-retrieved by TΞ with an input query x if TΞ(x) ξµ ϵ.

In below, when the context is clear, we suppress the notation dependence of TΞ on the memory patterns Ξ for simplicity.

Definition 3.2 states that for an input x around a stored memory pattern ξ, its corresponding mapping output T (x) should be located in the same sphere S. This motivates us to view T as a function aiming to map the query x onto its nearest memory ξ within an error-tolerance margin R. More precisely, we construct such a function satisfying Definition 3.2 as a learning problem, using memory patterns as data. A natural choice for doing this function is through the soft-margin SVR (Awad et al., 2015; Vapnik, 2013; Kar & Karnick, 2012; Sch olkopf & Smola, 2002) (also see Appendix C.1 for a concise overview): it fits the best hyperplane to the data points within a predefined error margin,

3A generalized fixed point S with respect to TΞ is also an invariant set with respect to TΞ.

aiming to minimize the error rate while ensuring the model remains insensitive to errors within a certain threshold.

We first define the regression model. Given a weight matrix W Rd DΦ, and a feature map Φ : Rd RDΦ, denote f W,Φ : Rd Rd to be the mapping

f W,Φ(x) = WΦ(x). (3.1)

Denote K(x1, x2) := Φ(x1), Φ(x2) . This is a positive semidefinite kernel, and there is a unique RKHS H associated with this kernel K (Wainwright, 2019, Theorem 12.11).

To cast T as a SVR problem using (3.1), we now specify the data points that f(x) should fit. Since the goal of T is to retrieve the memory pattern most similar to given query x, we consider the training dataset D = {(ξµ + δξµ, ξµ)}µ [M]. Namely, the input query x = ξµ + δξµ is the contaminated target memory pattern with noise δξµ, and the output y = ξµ is target memory pattern. For convenience, we shorthand [ξ1 + δξ1, , ξM + δξM] = Ξδ Rd M as the contaminated memory patterns.

Next, we frame the memorization in modern Hopfield models as fitting f to the dataset D, and obtain the following nonparametric (support vector) regression problem. Given a dataset D = {(ξµ + δξµ, ξµ)}µ [M], consider the support vector regression using the feature map Φ

Min W,η,eη 1 2 W 2 + C

µ=1 1, (ηµ + eηµ) s.t. (3.2)

(ϵ 1 + eηµ) ξµ W, Φ(ξµ + δξµ) ϵ 1 + ηµ ϵ 1 + ηµ ϵ1/

d ηµ 0, eηµ 0, µ [M],

where the constraints are component-wise, ϵ > 0 is a component-wise error margin, C 0 is a penalty coefficient, and ϵ > 0 is the memory retrieval error. We denote the unique (given the strong convexity of the optimization problem (3.2)) minimizer as (W Φ, η Φ, eη Φ), and the solution to (3.2) as TSVR(x). By solving the optimality via the Lagrangian duality, we obtain the following.

Theorem 3.1. Let α, eα denote the Lagrangian multipliers of the dual problem of (3.2). Let W := (w 1, . . . w d)T Rd DΦ denote the minimizer of (3.2). Then,

µ=1 (αµ[i] eαµ[i]) | {z } R

Φ(ξµ + δξµ) | {z } RDΦ

where a[i] denotes the i-th element of a vector a.

Proof. See Appendix D.1 for a detailed proof.

For any featurization map Φ, Theorem 3.1 introduces a map TSVR,Φ := f W Φ,Φ. By construction, for any Φ, TSVR,Φ

Nonparametric Modern Hopfield Models

obeys the ϵ-retrieval property TSVR,Φ(x) ξµ ϵ, for any µ and x Sµ. Hence, we arrive a nonparametric interpretation for constructing many modern Hopfield models. Given an input query x, the i-th component of the retrieved pattern by applying TSVR(x) once is

xnew[i] := TSVR(x)[i] = w i , Φ(x) . (3.3)

Remark 3.2. Note that ϵ is the component-wise SVR error, not the ϵ in Hopfield retrieval error defined in Definition 3.2.

Remark 3.3. Without any assumption on ϵ, TSVR converges to generalized fixed points, in contrast to the fixed point convergence in (Hu et al., 2023; Ramsauer et al., 2020). Thus, there is no multiple update convergence for TSVR without specifying Φ (and thereby proving the fixed point convergence property.) We provide specific Φ with provably fixed point convergence in Section 3.2 and Remark 4.3.

Remark 3.4. This regression problem is nonparametric. That is, it does not assume a specific functional form for TSVR and is flexible in the number of parameters, allowing the number of support vectors to adjust based on the data.

Intuitively, this optimization problem learns a TSVR,Φ to replace T from the training dataset D = {(ξµ + δξµ, ξµ)}µ [M]. Thus, for any given query ξµ+δξµ, TSVR,Φ retrieves a target memory pattern ξµ with ϵ precision, for all µ [M]. Specifically, this ϵ precision comes from the upper bound of the maximum component-wise error ϵ + ηµ[i] (and ϵ + eηµ[i]) ϵ/

d, defined in (3.2). This choice of SVR error margin mimics the ϵ-retrieval of modern Hopfield models via the flexibility of soft-margin SVR. As a result, the objective of the SVR problem (3.2) coincides with the memorization and retrieval processes of modern Hopfield models. While T retrieves memory patterns {ξµ}µ [M] based on x with an error tolerance ϵ, the SVR problem (3.2)

(Memorization:) Fits a function TSVR satisfying Definition 3.2, which

(Retrieval:) Maps queries onto memory patterns within a component-wise error-margin ϵ/

Importantly, Theorem 3.1 enables us to derive a family of nonparametric modern Hopfield models through constructing their retrieval dynamics with various kernel functions Φ( ), including Dense (Ramsauer et al., 2020), Linear (Katharopoulos et al., 2020), Multi-Head (Vaswani et al., 2017), Sparse-Structured (Zaheer et al., 2020; Beltagy et al., 2020; Child et al., 2019) and Generalized Kernelizable or Positive Random Features (Choromanski et al., 2021) modern Hopfield models. Appendix E includes constructions of these models as extensions from our framework.

3.2 Nonparametric Dense and Sparse-Structured Modern Hopfield Models

In this section, we showcase the nonparametric framework Theorem 3.1 with two special cases. First, we recover the standard dense modern Hopfield model (Ramsauer et al., 2020). Then, we introduce the efficient sparse-structured modern Hopfield models with sub-quadratic complexity.

Dense Modern Hopfield Model (Ramsauer et al., 2020).

Lemma 3.1 (Nonparametric Dense Modern Hopfield Model). Let Φ( ) = (ϕ(0) 0 , ϕ(1) 1 , . . . , ϕ(1) D1, . . . , ϕ(n) 1 , . . . , ϕ(n) Dn, . . .) with the formulation, for 1 D Dn with Dn := d+n 1 n ,

ϕ(n) D := ( βx1)ℓ1 ( βxd)ℓd PM µ=1 Φ(ξµ + δξµ), Φ(x) ℓ1! ℓd! , (3.4)

where ℓ1 + + ℓd = n. By Theorem 3.1, fitting TSVR on D following (3.2) gives

TDense(x) = Ξ Softmax βΞT δ x Rd, (3.5)

where Ξδ := [ξ1 + δξ1, , ξM + δξM] Rd M denotes the contaminated memory patterns.

Proof Sketch. We first select Φ to be the Taylor expansion of the exp function via the exp kernel s feature expansion (Nguyen et al., 2024; Hamid et al., 2014; Kar & Karnick, 2012; Sch olkopf & Smola, 2002). By solving the optimization problem (3.2), we arrive a retrieval dynamics resembling (2.2). See Appendix D.2 for a detailed proof.

Remark 3.5 (Heterov.s. Auto-Associative Memory.). So far, we derive a nonparametric framework for heteroassociative modern Hopfield models, differentiating x and y by incorporating inherent noise δξ into D. If we eliminate noises {δξµ}µ [M] from the training memory patterns, (3.5) reduces to that of the standard auto-associative dense modern Hopfield model, as shown in (2.2).

With Remark 3.5, Lemma 3.1 facilitates the replication of known results from the standard dense modern Hopfield model (Ramsauer et al., 2020). The recovery of dense modern Hopfield model provides a sanity check for our nonparametric framework.

Sparse-Structured Modern Hopfield Models. Next, we present a set of efficient modern Hopfield models with sparse-structured patterns via the following mask.

Definition 3.3 (Sparse-Structured Mask). Let M := {M(1), . . . , M(k)} {1, . . . , M} be the reduced support set for TSVR of size k M. Then, for µ [M], the

Nonparametric Modern Hopfield Models

optimization problem in (3.2) reduces to

Min W,η,eη 1 2 W 2 + C X

µ M 1, (ηµ + eηµ) s.t. (3.6)

(ϵ 1 + eηµ) ξµ W, Φ(ξµ + δξµ) ϵ 1 + ηµ ϵ 1 + ηµ ϵ1/

d ηµ 0, eηµ 0, µ M.

With Definition 3.3, we obtain the following sparsestructured retrieval dynamics (and thereby its corresponding Hopfield model(s)) by fitting TSVR on D masked by M.

Theorem 3.2 (Sparse-Structured Modern Hopfield Models). Let Φ( ) = (ϕ(0) 0 , ϕ(1) 1 , . . . , ϕ(1) D1, . . . , ϕ(n) 1 , . . . , ϕ(n) Dn, . . .) with, for 1 D Dn with Dn := d+n 1 n ,

ϕ(n) D := ( βx1)ℓ1 ( βxd)ℓd P

µ M Φ(ξµ + δξµ), Φ(x) ℓ1! ℓd!,

where ℓ1 + + ℓd = n. By Theorem 3.1, fitting TSVR on D masked by M following (3.6) gives

TSparse(x) = X

Softmax(βΞ Mx)

where ΞM := [ , ξj + δξj ] Rd |M| with j [|M|].

Proof. See Appendix D.3 for a detailed proof.

We emphasize that (3.7) is in fact generic and is able to describe many sparse-structured modern Hopfield models with various support sets. Importantly, it allows us to construct efficient variants with sub-quadratic complexity, and hence fills the void in the literature regarding efficient modern Hopfield models, as discussed in (Hu et al., 2023).

We present three efficient variants based on (3.7) below. To analyze efficiency for long query sequences4, we first generalize (3.7) from a single query x to a sequence of L query denoted by X = [x1, . . . , x L]. Let the binary matrix IM be the corresponding sparse-sturctured mask. Example 1 (Random Masked Modern Hopfield Model with O(k L) Complexity). By setting M to randomly mask (M k) entries, we obtain an efficient modern Hopfield model with a sub-quadratic O(k L) complexity. This model connects to the random attention of Big Bird (Zaheer et al., 2020). Example 2 (Efficient Modern Hopfield Model with O(L

L) Complexity). By setting M for each query in a way that IM reproduces the sliding window pattern of

4Considering long query sequences is crucial, as they contribute to inefficiency (see (Hu et al., 2023, Section C.2)).

window size

L, we obtain an efficient modern Hopfield model with a sub-quadratic O(L

L) complexity. This model connects to the Longformer attention (Beltagy et al., 2020) by design. Example 3 (Top-K Modern Hopfield Model). Let the sequence {pµ}µ [M] be the inner products of memories {ξµ}µ [M] and query x, i.e., pµ := x, ξµ , and let p be the K-th largest element in {pµ}µ [M]. Then we obtain a sparse-structured mask M such that

( µ M, if pµ p

µ M, if pµ < p . (3.8)

With (3.8), we arrive a top-K modern Hopfield model with quadratic complexity, i.e., inefficient. This model connects to the top-K attention (Gupta et al., 2021) by design. Remark 3.6 (Time Complexity of Modern Hopfield Models and Attention Mechanism). The time complexity of modern Hopfield models and Hopfield layers is given by:

Time complexity of modern Hopfield model: O(Md2).

When used as cross-attention (Hopfield layer) with length L (query) and length-M (memory) input sequences: O(LMd2).

When used as self-attention with length-L input sequence (set M = L): O(n2d2).

Our efficient modern Hopfield models achieve high efficiency through two means: a sparse-structured mask and various choices of the kernel Φ. The sparse-structured mask, with a support set size of k M, reduces the complexity from O(Md2) to O(kd2). Additionally, different choices of kernel, such as the linear kernel and positive random kernel in Appendix E, lead to efficient implementations.

Numerically, we verify their performance in Appendices G.1 to G.4 and efficiency in Appendix G.5 (e.g., duration time in Figure 6 and Floating point operations in Figure 7.)

4 Theoretical Analysis

In this section, we characterize how sparsity affects the sparse-structured models defined in (3.7). Our theoretical analysis on these new sparse models5 consists of the following two aspects:

1. Derive the sparsity-dependent retrieval error bound and prove their fixed point convergence.

2. Characterize the fundamental limit of memory capacity.

5We use plural models as M in (3.7) is a generic expression for many models with different sparse patterns.

Nonparametric Modern Hopfield Models

As a reminder, we adopt Definition 3.2 for memory storage and retrieval. Additionally, we recall the following definition regarding the separation between memory patterns.

Definition 4.1 (Separation of Patterns). The separation of a memory pattern ξµ from all other memory patterns Ξ is defined as its minimal inner product difference to any other patterns: µ := Minν,ν =µ [ ξµ, ξµ ξµ, ξν ].

4.1 Memory Retrieval: Error Bounds and Fixed Point Convergence

Memory Retrieval Error Bounds. To analyze the accuracy of memory retrieval, we derive the upper bound on retrieval error of the sparse-structured models.

Theorem 4.1 (Sparsity-Dependent Retrieval Error). Let TSparse be the sparse-structured retrieval dynamics (3.7). For query x Sµ, it holds

TSparse(x) ξµ (4.1)

m(M + k 2) exp β ξµ, x Max ν [M],ν =µ ξµ, ξν ,

for all µ M, where k := |M| [M] denotes the size of the support set M, and m = Maxµ ξµ .

Proof. See Appendix D.4 for a detailed proof.

Interestingly, the retrieval error bound in Theorem 4.1 is sparsity-dependent, which is governed by the size of the support set M, i.e. sparsity dimension k := |M|. Remark 4.1 (Comparing with the Sparse Modern Hopfield Model (Hu et al., 2023)). Compared to the retrieval error bound in (Hu et al., 2023), which lacks explicit dependence on its input (data)-dependent sparsity, the sparsity (size of M) here is pre-specified. When there are fewer elements in the sparse-structured mask, i.e., when k is small, the retrieval error bound is tighter, and vice versa. Remark 4.2 (Noise Robustness). By Theorem 4.1, in cases involving contaminated query or memory, i.e. ex = x + δx (noise in query) or eξ = ξ + δξ (noise in memory), the impact of noise on the sparse retrieval error (4.1) is less than that its impact on the dense counterpart due to the smaller coefficient (M + k 2).

Corollary 4.1.1. Let TDense and TSparse be the dense (3.7) and sparse-structured (3.7) retrieval dynamics, respectively. For any query pattern x Sµ and µ M, it holds

TSparse(x) ξµ TDense(x) ξµ .

Proof. See Appendix D.5 for a detailed proof.

Computationally, Corollary 4.1.1 suggests that TSparse necessitates fewer iterations to reach fixed points compared to TDense, given the same error tolerance level. In other words, TSparse retrieves stored memory patterns faster than TDense. Remark 4.3 (Multiple-Update). Another important implication of Corollary 4.1.1 is that TSparse exhibits similar multiple-update functionality to existing models (Hu et al., 2023; Wu et al., 2024b; Ramsauer et al., 2020).

To bridge to deep learning methodologies, we show that TSparse retrieves memory patterns with high accuracy after a single activation in the following corollary, akin to (Hu et al., 2023; Wu et al., 2024b).

Corollary 4.1.2 (One-Step Retrieval with High Accuracy). For any query x Sµ and µ M, TSparse retrieve the memory pattern ξµ with retrieval error ϵ exponentially suppressed by µ.

Proof. See Appendix D.5 for a detailed proof.

Corollary 4.1.2 indicates that, with sufficiently large µ, TSparse retrieves memory patterns in a single iteration, allowing the integration of sparse-structured modern Hopfield models into deep learning architectures similarly to (Xu et al., 2024; Hu et al., 2024a; Wu et al., 2024b; Schimunek et al., 2023; Hoover et al., 2023; Seidl et al., 2022).

Fixed Point Convergence. By design, the retrieval dynamics constructed via Lemma 3.1 satisfy (T2). We now verify this adherence as a sanity check. Interestingly, while previous studies (Hu et al., 2023; Wu et al., 2024b; Ramsauer et al., 2020) rely on the detailed energy functions to show the convergence properties of modern Hopfield models, we prove them for sparse-structured modern Hopfield models even without knowing E in the next corollary. It affirms that TSparse satisfies (T2).

Corollary 4.1.3 (Fixed Point Convergence). Let TSparse be the sparse-structured retrieval dynamics (3.7). For all µ M, the query x Sµ converges to a fixed point if it is iteratively applied by TSparse.

Proof. See Appendix D.6 for a detailed proof.

4.2 Memory Capacity

To characterize the fundamental limit of memory capacity, we ask the following two questions for sparse-structured modern Hopfield models following (Hu et al., 2023):

(A) What is the necessary condition for a pattern ξµ being considered well-stored, and correctly retrieved?

(B) What is the expected number of memory patterns such that the above condition is satisfied?

Nonparametric Modern Hopfield Models

Well-Separation Condition. To address (A), we identify the necessary condition for a pattern being well-stored and retrieved by the sparse-structured modern Hopfield models.

Lemma 4.1 (Well-Separation Condition). Following Definition 3.2, for µ M, suppose every memory pattern {ξµ}µ M is enclosed by a sphere Sµ := x | x ξµ R , with finite radius R := 1

2 Minµ,ν M;µ =ν ξµ ξν . Then, the retrieved dynamics TSparse maps Sµ to itself if

1. The starting point x is inside Sµ: x Sµ.

2. The well-separation condition:

β ln (M + k 2)m

Proof. See Appendix D.7 for a detailed proof.

Intuitively, the well-separation condition establishes a threshold that ensures any pattern {ξµ}µ M is distinguishable from all others, enabling patterns to be well-stored at a fixed point of TSparse and retrieved with R precision by TSparse. Notably, Lemma 4.1 reveals that the lower bound on µ diminishes as k decreases. Consequently, as M becomes sparser, satisfying the well-separation condition becomes easier, facilitating the storage of patterns and leading to a larger memory capacity lower bound for sparsestructured modern Hopfield models.

Memory Capacity. To address (B), we derive the lower bound for the maximum number of memory patterns that are well-stored and retrievable according to Lemma 4.1:

Proposition 4.1 (Modified from (Hu et al., 2023)). Define the probability of storing and retrieving a memory pattern as 1 p. Memory capacity, the maximum number of patterns randomly sampled from a sphere with radius m that the sparse modern Hopfield models can store and retrieve, has an lower bound: MSparse p C d 1

4 , where C is the solution for the identity C = b/W0(exp{a+ln b}) with the principal branch of Lambert W function, a := (4/d 1) (ln [m( p+k 1)/R] + 1) and b := 4m2β/5(d 1).

Proof. See Appendix D.8 for a detailed proof.

Remark 4.4. Proposition 4.1 gives a memory capacity exponential in the pattern size d (maximum allowed value k). Since k M, the scaling behavior of sparse-structured modern Hopfield models is similar to that of (Ramsauer et al., 2020; Hu et al., 2023). This result mirrors findings in (Wu et al., 2024b; Hu et al., 2023; Ramsauer et al., 2020).

5 Conclusion and Discussion

We introduce a nonparametric framework for modern Hopfield models. We use two examples to validate our framework: the original dense & the sparse-structured modern Hopfield models. With Lemma 3.1, we replicate the known results of the original modern Hopfield model (Ramsauer et al., 2020). With Theorem 3.2, we introduce the efficient sparse-structured Hopfield models with robust theoretical properties: tighter retrieval error bound (Corollary 4.1.1 & Corollary 4.1.2), stronger noise robustness (Remark 4.2) and exponential-in-d capacity (Lemma 4.1 & Proposition 4.1).

Comparing with Existing Works. Our framework complements existing works (Hu et al., 2023; Wu et al., 2024b; Martins et al., 2023) by filling the efficiency gaps and connecting to various attentions in the following. Notably, when the size of the support set k = M, the results of Theorem 4.1, Lemma 4.1 and Proposition 4.1 reduce to those of the dense modern Hopfield model (Ramsauer et al., 2020).

Extensions. In Appendix E, we present a family of modern Hopfield models connecting to many other existing attention mechanisms, including Linear (Katharopoulos et al., 2020), Multi-Head (Vaswani et al., 2017), and Generalized Kernelizable or PRFs (Positive Random Features) (Choromanski et al., 2021) modern Hopfield models.

Hopfield Layers and Numerical Experiments. In line with (Hu et al., 2023; Wu et al., 2024b; Ramsauer et al., 2020), we introduce deep learning layers as competitive attention alternatives with memory-enhanced functionalities (Remark F.1), corresponding to our nonparametric modern Hopfield models (sparse-structured and above extensions) in Appendix F. Numerically, we verify their memory retrieval (as associative memory models) and supervised learning (as transformer alternatives) performance in Appendices G.1 to G.4 and efficiency in Appendix G.5.

Accuracy-Efficiency Tradeoff. For learning tasks conducted in Appendix G, we do not expect generally superior performance from efficient models. Ultimately, there is the provably accuracy-efficiency tradeoff (Keles et al., 2023; Deng et al., 2023) based on complexity analysis of matrix multiplication (hence, this result is transferable to modern Hopfield models (Hu et al., 2024b)). This work only provides a theoretical framework supporting the derivation of efficient variants of modern Hopfield model, with no strictly superior performance guarantee. However, we do observe that, in many cases, linear and random features modern Hopfield models deliver acceptable results.

Limitations and Future Work. We defer the discussion of limitations and future work to Appendix B.

Nonparametric Modern Hopfield Models

Impact Statement

By the theoretical nature of this work, we expect no negative social impacts.

Acknowledgments

JH thanks Thomas F. Burns, Dmitry Krotov, Saul Santos, Andre Martins, Mimi Gallagher, Sara Sanchez, Dino Feng, and Andrew Chen for enlightening discussions; Weimin Wu and Jennifer Zhang for collaborations on related topics; Jaiyi Wang for assistance with numerical experiments; and the Red Maple Family for their support. The authors also thank the anonymous reviewers and program chairs for their constructive comments, especially the Area Chair of Neur IPS 2024 for pointing out key typos in appendix.

JH is partially supported by the Walter P. Murphy Fellowship. HL is partially supported by NIH R01LM1372201, Abb Vie and Dolby. This research was supported in part through the computational resources and staff contributions provided for the Quest high performance computing facility at Northwestern University which is jointly supported by the Office of the Provost, the Office for Research, and Northwestern University Information Technology. The content is solely the responsibility of the authors and does not necessarily represent the official views of the funding agencies.

Amari, S.-I. Characteristics of random nets of analog neuron-like elements. IEEE Transactions on systems, man, and cybernetics, (5):643 657, 1972.

Auer, A., Gauch, M., Klotz, D., and Hochreiter, S. Conformal prediction for time series with modern hopfield networks. Advances in Neural Information Processing Systems, 36, 2024.

Awad, M., Khanna, R., Awad, M., and Khanna, R. Support vector regression. Efficient learning machines: Theories, concepts, and applications for engineers and system designers, pp. 67 80, 2015.

Beltagy, I., Peters, M. E., and Cohan, A. Longformer: The long-document transformer. ar Xiv preprint ar Xiv:2004.05150, 2020.

Bommasani, R., Hudson, D. A., Adeli, E., Altman, R., Arora, S., von Arx, S., Bernstein, M. S., Bohg, J., Bosselut, A., Brunskill, E., et al. On the opportunities and risks of foundation models. ar Xiv preprint ar Xiv:2108.07258, 2021.

Brandstetter, J. Blog post: Hopfield networks is all you need, 2021. URL https://ml-jku.github.io/hopfield-layers/. Accessed: April 4, 2023.

Brauchart, J. S., Reznikov, A. B., Saff, E. B., Sloan, I. H., Wang, Y. G., and Womersley, R. S. Random point sets on the sphere-hole radii, covering, and separation. Experimental Mathematics, 27(1):62 81, 2018.

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. Language models are few-shot learners. Advances in neural information processing systems, 33: 1877 1901, 2020.

Carbonneau, M.-A., Cheplygina, V., Granger, E., and Gagnon, G. Multiple instance learning: A survey of problem characteristics and applications. Pattern Recognition, 77:329 353, 2018.

Child, R., Gray, S., Radford, A., and Sutskever, I. Generating long sequences with sparse transformers. ar Xiv preprint ar Xiv:1904.10509, 2019.

Choromanski, K. M., Likhosherstov, V., Dohan, D., Song, X., Gane, A., Sarlos, T., Hawkins, P., Davis, J. Q., Mohiuddin, A., Kaiser, L., Belanger, D. B., Colwell, L. J., and Weller, A. Rethinking attention with performers. In International Conference on Learning Representations, 2021.

Clevert, D.-A., Unterthiner, T., and Hochreiter, S. Fast and accurate deep network learning by exponential linear units (elus). ar Xiv preprint ar Xiv:1511.07289, 2015.

Correia, G. M., Niculae, V., and Martins, A. F. Adaptively sparse transformers. ar Xiv preprint ar Xiv:1909.00015, 2019.

Demircigil, M., Heusel, J., L owe, M., Upgang, S., and Vermet, F. On a model of associative memory with huge storage capacity. Journal of Statistical Physics, 168:288 299, 2017.

Deng, Y., Song, Z., and Zhou, T. Superiority of softmax: Unveiling the performance edge over linear attention. ar Xiv preprint ar Xiv:2310.11685, 2023.

Floridi, L. and Chiriatti, M. Gpt-3: Its nature, scope, limits, and consequences. Minds and Machines, 30:681 694, 2020.

F urst, A., Rumetshofer, E., Lehner, J., Tran, V. T., Tang, F., Ramsauer, H., Kreil, D., Kopp, M., Klambauer, G., Bitto, A., et al. Cloob: Modern hopfield networks with infoloob outperform clip. Advances in neural information processing systems, 35:20450 20468, 2022.

Gupta, A., Dar, G., Goodman, S., Ciprut, D., and Berant, J. Memory-efficient transformers via top-k attention. ar Xiv preprint ar Xiv:2106.06899, 2021.

Nonparametric Modern Hopfield Models

Hamid, R., Xiao, Y., Gittens, A., and De Coste, D. Compact random feature maps. In International conference on machine learning, pp. 19 27. PMLR, 2014.

Hoover, B., Liang, Y., Pham, B., Panda, R., Strobelt, H., Chau, D. H., Zaki, M. J., and Krotov, D. Energy transformer. ar Xiv preprint ar Xiv:2302.07253, 2023.

Hoover, B., Chau, D. H., Strobelt, H., Ram, P., and Krotov, D. Dense associative memory through the lens of random features. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024.

Hopfield, J. J. Neural networks and physical systems with emergent collective computational abilities. Proceedings of the national academy of sciences, 79(8):2554 2558, 1982.

Hopfield, J. J. Neurons with graded response have collective computational properties like those of two-state neurons. Proceedings of the national academy of sciences, 81(10): 3088 3092, 1984.

Hu, J. Y.-C., Yang, D., Wu, D., Xu, C., Chen, B.-Y., and Liu, H. On sparse modern hopfield model. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.

Hu, J. Y.-C., Chang, P.-H., Luo, R., Chen, H.-Y., Li, W., Wang, W.-P., and Liu, H. Outlier-efficient hopfield layers for large transformer-based models. In Forty-first International Conference on Machine Learning, 2024a.

Hu, J. Y.-C., Lin, T., Song, Z., and Liu, H. On computational limits of modern hopfield models: A fine-grained complexity analysis. In Forty-first International Conference on Machine Learning, 2024b.

Hu, J. Y.-C., Wu, D., and Liu, H. Provably optimal memory capacity for modern hopfield models: Transformercompatible dense associative memories as spherical codes. Advances in Neural Information Processing Systems, 38, 2024c.

Iatropoulos, G., Brea, J., and Gerstner, W. Kernel memory networks: A unifying framework for memory modeling. Advances in Neural Information Processing Systems, 35: 35326 35338, 2022.

Ilse, M., Tomczak, J., and Welling, M. Attention-based deep multiple instance learning. In International conference on machine learning, pp. 2127 2136. PMLR, 2018.

Jaggi, M. An equivalence between the lasso and support vector machines. Regularization, optimization, kernels, and support vector machines, pp. 1 26, 2014.

Ji, Y., Zhou, Z., Liu, H., and Davuluri, R. V. Dnabert: pre-trained bidirectional encoder representations from transformers model for dna-language in genome. Bioinformatics, 37(15):2112 2120, 2021.

Kandemir, M., Zhang, C., and Hamprecht, F. A. Empowering multiple instance histopathology cancer diagnosis by cell graphs. In Medical Image Computing and Computer Assisted Intervention MICCAI 2014: 17th International Conference, Boston, MA, USA, September 14-18, 2014, Proceedings, Part II 17, pp. 228 235. Springer, 2014.

Kar, P. and Karnick, H. Random feature maps for dot product kernels. In Artificial intelligence and statistics, pp. 583 591. PMLR, 2012.

Katharopoulos, A., Vyas, A., Pappas, N., and Fleuret, F. Transformers are rnns: Fast autoregressive transformers with linear attention. In International conference on machine learning, pp. 5156 5165. PMLR, 2020.

Keles, F. D., Wijewardena, P. M., and Hegde, C. On the computational complexity of self-attention. In International Conference on Algorithmic Learning Theory, pp. 597 619. PMLR, 2023.

Kozachkov, L., Kastanenka, K. V., and Krotov, D. Building transformers from neurons and astrocytes. bio Rxiv, pp. 2022 10, 2022.

Krotov, D. and Hopfield, J. J. Dense associative memory for pattern recognition. Advances in neural information processing systems, 29, 2016.

Krotov, D. and Hopfield, J. J. Large associative memory problem in neurobiology and machine learning. In International Conference on Learning Representations, 2021.

Martins, A. and Astudillo, R. From softmax to sparsemax: A sparse model of attention and multi-label classification. In International conference on machine learning, pp. 1614 1623. PMLR, 2016.

Martins, A. F. T., Niculae, V., and Mc Namee, D. Sparse modern hopfield networks. Associative Memory & Hopfield Networks in 2023. Neur IPS 2023 workshop., 2023.

Nguyen, T. M., Nguyen, T., Ho, N., Bertozzi, A. L., Baraniuk, R. G., and Osher, S. J. A primal-dual framework for transformers and neural networks. ar Xiv preprint ar Xiv:2406.13781, 2024.

Niculae, V., Martins, A., Blondel, M., and Cardie, C. Sparsemap: Differentiable sparse structured inference. In International Conference on Machine Learning, pp. 3799 3808. PMLR, 2018.

Nonparametric Modern Hopfield Models

Olver, F. W., Lozier, D. W., Boisvert, R. F., and Clark, C. W. NIST handbook of mathematical functions hardback and CD-ROM. 2010.

Paischer, F., Adler, T., Patil, V., Bitto-Nemling, A., Holzleitner, M., Lehner, S., Eghbal-Zadeh, H., and Hochreiter, S. History compression via language models in reinforcement learning. In International Conference on Machine Learning, pp. 17156 17185. PMLR, 2022.

Qiu, J., Ma, H., Levy, O., Yih, S. W.-t., Wang, S., and Tang, J. Blockwise self-attention for long document understanding. ar Xiv preprint ar Xiv:1911.02972, 2019.

Ramsauer, H., Schafl, B., Lehner, J., Seidl, P., Widrich, M., Adler, T., Gruber, L., Holzleitner, M., Pavlovic, M., Sandve, G. K., et al. Hopfield networks is all you need. ar Xiv preprint ar Xiv:2008.02217, 2020.

Santos, S., Niculae, V., Mc Namee, D., and Martins, A. F. Hopfield-fenchel-young networks: A unified framework for associative memory retrieval. ar Xiv preprint ar Xiv:2411.08590, 2024a.

Santos, S., Niculae, V., Mc Namee, D., and Martins, A. F. Sparse and structured hopfield networks. ar Xiv preprint ar Xiv:2402.13725, 2024b.

Schimunek, J., Seidl, P., Friedrich, L., Kuhn, D., Rippmann, F., Hochreiter, S., and Klambauer, G. Context-enriched molecule representations improve few-shot drug discovery. In The Eleventh International Conference on Learning Representations, 2023.

Sch olkopf, B. and Smola, A. J. Learning with kernels: support vector machines, regularization, optimization, and beyond. MIT press, 2002.

Seidl, P., Renz, P., Dyubankova, N., Neves, P., Verhoeven, J., Wegner, J. K., Segler, M., Hochreiter, S., and Klambauer, G. Improving few-and zero-shot reaction template prediction using modern hopfield networks. Journal of chemical information and modeling, 62(9):2111 2120, 2022.

Sriperumbudur, B. K. and Lanckriet, G. R. On the convergence of the concave-convex procedure. In Advances in neural information processing systems, volume 9, pp. 1759 1767, 2009.

Tay, Y., Dehghani, M., Bahri, D., and Metzler, D. Efficient transformers: A survey. ACM Computing Surveys, 55(6): 1 28, 2022.

Vapnik, V. The nature of statistical learning theory. Springer science & business media, 2013.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. Attention is all you need. Advances in neural information processing systems, 30, 2017.

Wainwright, M. J. High-dimensional statistics: A nonasymptotic viewpoint, volume 48. Cambridge university press, 2019.

Widrich, M., Sch afl, B., Pavlovi c, M., Ramsauer, H., Gruber, L., Holzleitner, M., Brandstetter, J., Sandve, G. K., Greiff, V., Hochreiter, S., et al. Modern hopfield networks and attention for immune repertoire classification. Advances in Neural Information Processing Systems, 33:18832 18845, 2020.

Wu, D., Hu, J. Y.-C., Hsiao, T.-Y., and Liu, H. Uniform memory retrieval with larger capacity for modern hopfield models. In Forty-first International Conference on Machine Learning, 2024a.

Wu, D., Hu, J. Y.-C., Li, W., Chen, B.-Y., and Liu, H. STanhop: Sparse tandem hopfield model for memoryenhanced time series prediction. In The Twelfth International Conference on Learning Representations, 2024b.

Wu, S., Irsoy, O., Lu, S., Dabravolski, V., Dredze, M., Gehrmann, S., Kambadur, P., Rosenberg, D., and Mann, G. Bloomberggpt: A large language model for finance. ar Xiv preprint ar Xiv:2303.17564, 2023.

Xu, C., Huang, Y.-C., Hu, J. Y.-C., Li, W., Gilani, A., Goan, H.-S., and Liu, H. Bishop: Bi-directional cellular learning for tabular data with generalized sparse modern hopfield model. In Forty-first International Conference on Machine Learning, 2024.

Zaheer, M., Guruganesh, G., Dubey, K. A., Ainslie, J., Alberti, C., Ontanon, S., Pham, P., Ravula, A., Wang, Q., Yang, L., et al. Big bird: Transformers for longer sequences. Advances in neural information processing systems, 33:17283 17297, 2020.

Zhou, Z., Ji, Y., Li, W., Dutta, P., Davuluri, R. V., and Liu, H. DNABERT-2: Efficient foundation model and benchmark for multi-species genomes. In The Twelfth International Conference on Learning Representations, 2024a.

Zhou, Z., Wu, W., Ho, H., Wang, J., Shi, L., Davuluri, R. V., Wang, Z., and Liu, H. Dnabert-s: Learning species-aware dna embedding with genome foundation models. ar Xiv preprint ar Xiv:2402.08777, 2024b.

Nonparametric Modern Hopfield Models Supplementary Material

A Table of Notations 14

B Limitations and Future Work 15

C Supplementary Theoretical Backgrounds 16 C.1 Soft-Margin Support Vector Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

D Proofs of Main Text 17 D.1 Theorem 3.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 D.2 Lemma 3.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 D.3 Theorem 3.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 D.4 Theorem 4.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 D.5 Corollary 4.1.1 and Corollary 4.1.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 D.6 Corollary 4.1.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 D.7 Lemma 4.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 D.8 Proposition 4.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

E Nonparametric Modern Hopfield Family 25 E.1 Linear Modern Hopfield Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 E.2 Multi-Head Modern Hopfield Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 E.3 PRFs (Positive Random Features) Kernel Modern Hopfield Model . . . . . . . . . . . . . . . . . . . . . 26

F Nonparametric Modern Hopfield Layers for Deep Learning 27

G Experimental Studies 28 G.1 Memory Retrieval Task (Figure 2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 G.2 Multiple Instance Learning on MNIST (Figure 3 & Figure 4) . . . . . . . . . . . . . . . . . . . . . . . . 30 G.3 Multiple Instance Learning on Real World Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 G.4 Time Series Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 G.4.1 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 G.5 Computational Efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

Nonparametric Modern Hopfield Models

A Table of Notations

Table 1. Mathematical Notations and Symbols

Symbol Description

a[i] The i-th component of vector a a, b Inner product for vectors a, b Rd

[I] Index set {1, , I}, where I N+

Spectral norm, equivalent to the l2-norm when applied to a vector

d Dimension of patterns M Number of stored memory patterns β Scaling factor of the energy function controlling the learning dynamics. We set β = 1/

d in practice

x State/configuration/query pattern in Rd

x Stationary points of the Hopfield energy function ξ Memory patterns (keys) in Rd

δξ Noises in memory patterns in Rd

D Training data set {(ξµ + δξµ, ξµ)}µ [M] Ξ Shorthand for M stored memory (key) patterns {ξµ}µ [M] in Rd M

Ξδ Shorthand for M contaminated memory (key) patterns {δξµ}µ [M] in Rd M

ΞTx M-dimensional overlap vector ( ξ1, x , , ξµ, x , , ξM, x ) in RM

Φ( ) Kernelized feature mapping Φ( ) : Rd Dϕ ϕ Element in the Φ( ) = (ϕ(0) 0 , ϕ(1) 1 , . . . , ϕ(1) D1, . . . , ϕ(n) 1 , . . . , ϕ(n) Dn, . . .) DΦ Dimension of the kernel space, i.e., dimension of output of Φ( ) h( ) Normalization mapping in the regression model defined by (3.1) W Weighted matrix in the regression model defined by (3.1) in Rd DΦ wi i-th row of the weighted matrix W in RDΦ K( , ) Kernel function takes the inner product form K( , ) = Φ( ), Φ( ) in K : RDΦ RDΦ R+

ϵ Component-wise term error margin in the support vector regression problem η, eη Slack variables in the support vector regression C Penalized coefficient of the support vector regression L Lagrangian corresponding to (3.2) α, eα, λ, eλ Dual variables in the Lagrangian L

M Reduced support set for TSVR M := {M(1), . . . , M(k)} {1, . . . , M} 1M(µ) Indicator function corresponding to M, where 1M(µ) = 1 for µ M and 1M(µ) = 0 for µ M k Size of the support set M, defined as k := |M|

m Largest norm of memory patterns, denoted as m := Maxµ [M] ξµ

R Minimal Euclidean distance across all possible pairs of memory patterns, denoted as R := 1

2 Minµ,ν [M] ξµ ξν Sµ Sphere centered at memory pattern ξµ with finite radius R x µ Fixed point of T covered by Sµ, i.e., x µ Sµ µ Separation of a memory pattern ξµ from all other memory patterns Ξ, defined in (4.1) e µ Separation of ξµ at a given x from all memory patterns Ξ, defined in (4.1)

Nonparametric Modern Hopfield Models

B Limitations and Future Work

By the theoretical nature of this work, we rely on certain simplifying assumptions. These assumptions limit the generality of our results.

We require a specific feature mapping Φ to guarantee fixed-point convergence (Section 3.2). This requirement imposes structure on the retrieval dynamics. Without it, an unconstrained nonparametric Hopfield update can converge to a generalized fixed point, not the intended memory (Remark 3.3). Then, multiple iterations may fail to recover the true stored pattern. We emphasize that meeting this requirement is easy (Section 3.2); see also the nonparametric modern Hopfield model family in Appendix E.

Our relative error analysis (Theorem 4.1) assumes that the correct memory item is in the chosen support set. If a randommasking step removes that item, retrieval fails (Figure 2). We remark that while this assumption is restrictive, it is necessary for tractable analysis.

Our exponential capacity result in Proposition 4.1 needs a strong well-separation condition. Each pattern must be distinct enough from the others. Many real datasets have correlated or structured patterns, so this assumption may be hard to satisfy. Still, it is standard in modern Hopfield model literature (Krotov & Hopfield, 2016; Demircigil et al., 2017; Ramsauer et al., 2020; Iatropoulos et al., 2022; Hu et al., 2023; Wu et al., 2024b; Santos et al., 2024a;b). To mitigate this in practice, (Wu et al., 2024a; Hu et al., 2024c) relax this data dependency by optimizing the Hopfield-energy landscape for larger memory capacity.

Lastly, we do not analyze the extensions introduced in Appendix E. We do not prove their convergence or capacity.

Looking ahead, we plan to address these theoretical gaps and to relax strict assumptions like well-separated patterns.

Nonparametric Modern Hopfield Models

C Supplementary Theoretical Backgrounds

C.1 Soft-Margin Support Vector Regression

Soft-margin Support Vector Regression (SVR) (Awad et al., 2015; Jaggi, 2014; Vapnik, 2013; Kar & Karnick, 2012; Sch olkopf & Smola, 2002) generalizes Support Vector Machines (SVM) to regression tasks. It finds a function

f(x) = W Φ(x) + b

that remains within an ϵ -tube of the target outputs, while allowing limited violations (soft margin) when data points lie outside this tube.

Notation and Setup.

{(xµ, yµ)}M µ=1 is the training set, where xµ Rd and yµ Rd.

Φ : Rd RDΦ is a feature map into a (possibly high-dimensional) space.

W Rd DΦ and b Rd are the regression parameters.

Primal Formulation. To tolerate errors beyond ϵ while penalizing them, SVR introduces nonnegative slack variables ηµ and eηµ for each data point. The soft-margin SVR with ℓ2-loss is formulated as

min W,η,eη 1 2 W 2 + C

µ=1 1, (ηµ + eηµ) subject to

yµ W, Φ(xµ) b ϵ 1 + η, W, Φ(xµ) + b yµ ϵ 1 + eη, ηµ, eηµ 0, µ [M], (C.1)

where C > 0 controls the trade-off between model fidelity and penalty on large deviations (ηµ, eηµ). Since (C.1) is strongly convex, it admits a unique global minimizer.

This formulation follows the standard SVR derivation; see (Sch olkopf & Smola, 2002) for a comprehensive treatment.

Lagrangian and Dual Problem. To solve (C.1), we form the Lagrangian:

i=1 wi 2 + C

λµ[i] ηµ[i] + eλµ[i] eηµ[i]

h αµ[i] ϵ + ηµ[i] yµ[i] + wi, Φ(xµ) + b[i] + eαµ[i] ϵ + eηµ[i] wi, Φ(xµ) b[i] + yµ[i] i ,

where λµ, eλµ are multipliers for the slack constraints ηµ, eηµ 0, and αµ, eαµ are multipliers for the ϵ -tube constraints. By applying the Karush-Kuhn-Tucker (KKT) conditions to L, one obtains the dual problem. In practical kernelized SVR, one uses a kernel K(xµ, xν) = Φ(xµ), Φ(xν) , which may bypass explicit construction of Φ.

Summary. Soft-margin SVR balances a tight ϵ -tube fit with penalty-based tolerance for outliers. Strong convexity guarantees a unique solution in the primal, while the dual formulation reveals a sparse, data-driven representation in terms of support vectors. This dependence on the training data for model complexity classifies SVR as a nonparametric method. Remark C.1 (Nonparametric Nature of SVR). Although the primal objective in (C.1) involves a fixed matrix W, the dual solution is data-dependent. In particular, the regressor can be written as

αµ eαµ Φ(xµ), Φ(x) + b,

where only the points with nonzero (αµ eαµ) (the support vectors) affect f. The number of such points can grow with M, making SVR a nonparametric method whose capacity adapts to the size of the training data.

Nonparametric Modern Hopfield Models

D Proofs of Main Text

D.1 Theorem 3.1

Proof of Theorem 3.1. The Lagrangian of convex optimization problem defined in (3.2) is

i=1 wi 2 + C

i=1 (λµ[i]ηµ[i] + eλµ[i]eηµ[i])

i=1 αµ[i] (ϵ + ηµ[i] ξµ[i] + wi, Φ(ξµ + δξµ) )

i=1 eαµ[i] (ϵ + eηµ[i] wi, Φ(ξµ + δξµ) + ξµ[i]) , (D.1)

where λµ[i], eλµ[i], αµ[i] and eαµ[i] are Lagrange multipliers. Next, we solve stationary condition with respect to wi, ηµ[i] and eηµ[i] from above Lagrangian and derive corresponding optimal solution. The Lagrangian in (D.1) admits a stationary solution, which is given by:

wi PM µ=1 (αµ[i] eαµ[i]) Φ(ξµ + δξµ) = 0, C λµ[i] αµ[i] = 0, C eλµ[i] eαµ[i] = 0. (D.2)

Substitute (D.2) into (3.1) to write

xnew[i] = TSVR(x)[i] := w i , Φ(x) ,

with the learned weight matrix

µ=1 (αµ[i] eαµ[i]) | {z } R

Φ(ξµ + δξµ) | {z } RDΦ

The complementary slackness condition and dual feasibility of (D.1) are given by

αµ[i] (ϵ + ηµ[i] ξµ[i] + wi, Φ(ξµ + δξµ) ) = 0 eαµ[i] (ϵ + eηµ[i] wi, Φ(ξµ + δξµ) + ξµ[i]) = 0 αµ[i], eαµ[i], λµ[i], eλµ[i] 0, (D.3)

for all µ [M] and i [d].

This completes the proof.

D.2 Lemma 3.1

To simplify our proofs, we define

Φ(x) := Φ(x)

h(x) , (D.4)

where h( ) : Rd R is some normalization function for later convenience.

To prove Lemma 3.1, we introduce the following three auxiliary lemmas.

Nonparametric Modern Hopfield Models

Lemma D.1. Let αµ[i] 0, eαµ[i] 0 be a solution to (D.2) with KKT conditions (D.3). Then αµ[i] eαµ[i] has the following bounds

C αµ[i] eαµ[i] C, µ [M], i [d].

Proof of Lemma D.1. We prove this lemma by contradiction. Recall that for each fixed values of µ and i

αµ[i] 0, eαµ[i] 0.

Firstly, we assume αµ[i], eαµ[i] R+ (non-zero), for all µ [M] and i [d]. Recall complementary slackness conditions from (D.3) ( ϵ + ηµ[i] ξµ[i] + wi, Φ(ξµ + δξµ) = 0 ϵ + eηµ[i] wi, Φ(ξµ + δξµ) + ξµ[i] = 0.

Combine above two equations to write

ηµ[i] + eηµ[i] = 2ϵ 0.

Since the component-wise error ϵ 0, we have ηµ[i] + eηµ[i] 0. This conclusion contradicts the assumption of the non-negative condition on slack variables ηµ[i], eηµ[i] 0. Therefore, together with (D.2), at least one of αµ[i],eαµ[i] must be 0, for all µ and all i. Subsequently, we have

0 αµ[i] C and 0 eαµ[i] C,

which leads to

C αµ[i] eαµ[i] C.

This completes the proof.

Lemma D.2 (Multinomial Expansion). Given x, y Rd. The identity

xℓ1 1 xℓd d ℓ1! ℓd!

! yℓ1 1 yℓd d ℓ1! ℓd!

holds for all n N.

n !(x1y1 + + xdyd)n

(x1y1)n + + n! ℓ1! ℓd!

i=1 (xiyi)ℓi + + (xdyd)n #

Pd i=1 ℓi = n

i=1 (xi)ℓi d Y

xℓ1 1 xℓd d yℓ1 1 yℓd d

xℓ1 1 xℓd d ℓ1! ℓd!

! yℓ1 1 yℓd d ℓ1! ℓd!

This completes the proof.

Nonparametric Modern Hopfield Models

Our next lemma restates a known result that the exponential dot-product kernel admits an infinite-dimensional feature expansion via its power series (Nguyen et al., 2024; Hamid et al., 2014; Kar & Karnick, 2012; Sch olkopf & Smola, 2002). We include the derivation here for completeness.

Lemma D.3 (Closed-Form Exponential Dot-Product Kernel). Let K( , ) be the exponential dot-product kernel:

K(x, y) := exp{ x, y } =

Φ(x), Φ(y) =

where x, y Rd and Φ maps the feature vectors x and y into infinite dimensional space. Then,

Φ( ) = (ϕ(0) 0 , ϕ(1) 1 , . . . , ϕ(1) D1, . . . , ϕ(n) 1 , . . . , ϕ(n) Dn, . . .),

has a closed form solution

ϕ(n) D = xℓ1 1 xℓd d ℓ1 ! ℓd !,

where ℓ1 + + ℓd = n, 1 D Dn and Dn := d+n 1 n .

Proof of Lemma D.3. Applying Lemma D.2 on the exp kernel, we have

Φ(x), Φ(y) =

i=1 (xi)ℓi d Y

xℓ1 1 xℓd d yℓ1 1 yℓd d

xℓ1 1 xℓd d ℓ1! ℓd!

! yℓ1 1 yℓd d ℓ1! ℓd!

From above, we observe that, for each fixed n, there are d+n 1 n terms in the summation. Consequently, Φ(x) has a solution

Φ(x) = (ϕ(0) 0 , ϕ(1) 1 , . . . , ϕ(1) D1 | {z } ( d+1 1 1 ) elements

, . . . , ϕ(n) 1 , . . . , ϕ(n) Dn | {z } ( d+n 1 n ) elements

where Dn = d+n 1 n and

ϕ(n) D = xℓ1 1 xℓd d ℓ1 ! ℓd !,

for 1 D Dn and ℓ1 + + ℓd = n. This completes the proof.

Proof of Lemma 3.1. Recall that the learned weight matrix W is composed of

µ=1 (αµ[i] eαµ[i]) Φ(ξµ + δξµ)

Substitute w into (3.1) to write

TDense(x) =

αµ[1] eαµ[1]

h(ξµ + δξµ)

Φ(ξµ + δξµ), Φ(x)

αµ[d] eαµ[d]

h(ξµ + δξµ)

Φ(ξµ + δξµ), Φ(x)

Nonparametric Modern Hopfield Models

Let ξµ := αµ[1] eαµ[1]

h(ξµ+δξµ) , . . . , αµ[d] eαµ[d]

h(ξµ+δξµ) and h(x) := PM µ=1

Φ(ξν + δξν), Φ(x) . Then TDense reduces to

TDense(x) =

Φ(ξµ + δξµ), Φ(x)

Φ(ξν + δξν), Φ(x) ξµ.

Following Lemma D.3, here we define the inner product of Φ as a kernel K : RDϕ RDϕ R+

Φ(x), Φ(ξµ + δξµ) := K(x, ξµ + δξµ).

TDense is now given by

TDense(x) =

K(x, ξµ + δξµ) PM ν=1 K(x, ξν + δξν) ξµ. (D.5)

Observe that (2.2) TDense takes a Boltzmann form: exp{ }/ PM ν=1 exp{ }. By Lemma D.3, we take

ϕ(n) D = ( βx1)ℓ1 ( βxd)ℓd

with the kernel

K(x, ξµ + δξµ) =

( βx, βξµ + βδξµ )n

Substitute (D.6) into (D.5) and write

TDense(x) =

P n=0 βx, βξµ + βδξµ n /n! PM ν=1 P t=0 βx, βξν + βδξν t /t! ξµ.

By Taylor s theorem, TDense takes the form

TDense(x) =

exp{β x, ξµ + δξµ } PM ν=1 exp{β x, ξν + δξν } ξµ = Ξ Softmax βΞT δ x , (D.7)

where Ξ = (ξ1, , ξM) Rd M and Ξδ = (ξ1 + δξ1, , ξM + δξM) Rd M denote memories and noises in memories, respectively. This completes the proof.

D.3 Theorem 3.2

Proof of Theorem 3.2. To take w for the sparse-structured model, the partial derivatives of L with respect to wi, ηµ[i] and eηµ[i] must satisfy the stationarity condition

µ M (αµ[i] eαµ[i]) Φ(ξµ + δξµ) = 0, C λµ[i] αµ[i] = 0, C eλµ[i] eαµ[i] = 0.

Then, we arrive

µ M (αµ[i] eαµ[i]) Φ(ξµ + δξµ)

Following a similar approach as in Appendix D.2, we derive the retrieval dynamics for the sparse-structured modern Hopfield model:

TSparse(x) = X

Softmax(βΞ Mx)

where the softmax is computed over βΞ Mx with ΞM := [ , ξj + δξj ] | {z } j [|M|]

. This completes the proof.

Nonparametric Modern Hopfield Models

D.4 Theorem 4.1

Proof of Theorem 4.1. To connect TSparse with µ, first we derive the bound on TSparse(x) ξµ via (Ramsauer et al., 2020) for µ M

TSparse(x) ξµ

Softmax βΞT δ x

(1 [Softmax βΞT δ x ]µ)ξµ + X

ν M,ν =M [Softmax βΞT δ x ]νξν

eϵ ξµ + eϵ M 1

ν M,ν =µ ξν

eϵm + eϵ M 1(k 1)m

= m(M + k 2) exp β x, ξµ Max ν [M] x, ξν , (D.8)

where k := |M|, m := Maxµ ξµ , eϵ := (M 1) exp β x, ξµ Maxν [M] x, ξν and the inequality

Softmax(βΞTx)

ν = exp{β ( x, ξν x, ξµ )} 1 + P

ν =µ exp{β ( x, ξν x, ξµ )} exp β x, ξµ Max ν [M] x, ξν ,

is used in (D.8). This completes the proof.

D.5 Corollary 4.1.1 and Corollary 4.1.2

Proof of Corollary 4.1.1 and Corollary 4.1.2. Since ξµ, x ξν, x for all ν = µ, we have

[Softmax βΞT δ x ]µ [Softmax βΞT δ x ]ν, for ν = µ.

For the support set M, we have

[Softmax βΞT Mx ]µ = ξµ, x P j M ξj, x (D.9)

[Softmax βΞT δ x ]µ.

By the smaller denominator in (D.9)

This implies P

ν M Softmax βΞT Mx

ν ξν is pulled more strongly toward ξµ than PM ν=1 Softmax βΞT δ x

To see this, we write

TSparse(x) ξµ

= ( Softmax βΞT Mx

µ 1) | {z } :=(I)

Softmax βΞT Mx

TDense(x) ξµ

= ( Softmax βΞT δ x

µ 1) | {z } (II)

Softmax βΞT δ x

Nonparametric Modern Hopfield Models

By (D.9), (I) (II). This means the pull toward ξµ is strictly stronger, and hence the softmax-weighted average P

ν M Softmax βΞT Mx

ν ξν lies closer to ξµ:

TSparse(x) ξµ TDense(x) ξµ 0 TSparse(x) ξµ TDense(x) ξµ . (D.10)

This completes the proof of Corollary 4.1.1.

From (Ramsauer et al., 2020, Theorem 4), for any query x, TDense approximately retrieves a memory pattern ξµ with retrieval error ϵ exponentially suppressed by µ:

T (x) ξµ 2m(M 1) exp β µ 2m Max x ξµ , x x µ .

By (D.10), TSparse also enjoys above retrieval error bound. Therefore, TSparse(x) retrieves a memory pattern ξµ with high accuracy after a single activation with a sufficiently large µ. This completes the proof.

D.6 Corollary 4.1.3

Proof of Corollary 4.1.3. Recall (Hu et al., 2023, Lemma 2.2) that for initial query x0 Sµ

lim t xt ξµ = 0, (D.11)

where {xt} t=0 is a sequence generated by TDense from x0, i.e. TDense(xt) = xt+1.

Moreover, recall that for any query pattern x Sµ

0 TSparse(x) ξµ TDense(x) ξµ . (D.12)

By applying squeeze theorem on (D.12) and (D.11), we have

lim t ext ξµ = 0,

where {ext} t=0 is a sequence generated by TSparse, i.e. TSparse(ext) = ext+1.

This completes the proof.

D.7 Lemma 4.1

Proof of 4.1. Following (Wu et al., 2024b; Hu et al., 2023), we define the separation of ξµ at a given x from all memory patterns Ξ as

e µ := Min ν,ν =µ [ x, ξµ x, ξν ] .

Plug above into (D.8), and get

TSparse(x) ξµ m(M + k 2) exp n β e µ o .

By Cauchy-Schwartz inequality, for all µ M,

| ξµ, ξµ x, ξµ | ξµ x ξµ ξµ x m,

we write e µ in terms of µ:

e µ = µ 2 ξµ x m = µ 2m R,

where R is radius of the sphere Sµ. Since T is a mapping T : Sµ Sµ, output of the mapping T falls in Sµ with radius R. Therefore, R is lower-bounded by

R (M + k 2) exp{ β( µ 2m R)}m T (x) ξµ ,

β ln (M + k 2)m

This completes the proof.

Nonparametric Modern Hopfield Models

D.8 Proposition 4.1

We built our proof on top of (Hu et al., 2023, Lemma 2.1), which consists 3 steps:

(Step 1.) We establish a more refined well-separation condition, ensuring that patterns {ξµ}µ [M] are well-stored in H and can be retrieved by T with an error ϵ at most R.

(Step 2.) This condition is then related to the cosine similarity of memory patterns, from which we deduce an inequality governing the probability of successful pattern storage and retrieval.

(Step 3.) We pinpoint the conditions for exponential memory capacity and confirm their satisfaction.

Proof of Proposition 4.1. Our proof is built on top of (Hu et al., 2023, Corollary 3.1.1) with a different well-separation condition.

Let min := Minµ [M] µ and θµν Here we define min and θµν be the angle between two patterns ξµ and ξν.

In order for a pattern ξµ to be well-stored, by Lemma 4.1, we need

β ln (M + k 2)m

On the other hand, we observe

min = Min 1 µ ν M m2 (1 cos(θµν)) = m2 [1 cos(θmin)] ,

where θmin := Min1 µ ν M θµν [0, π]. Then, we have

m2 [1 cos(θmin)] 1

β ln (M + k 2)m

+ 2m R. (D.13)

As a result, the probability of successful storage and retrieval, i.e., the minimal separation min that satisfies Lemma 4.1, is given by

β ln (M + k 2)m

+ 2m R = 1 p.

Inserting (D.13) into above, we obtain

P m2 [1 cos(θmin)] 1

β ln (M + k 2)m

+ 2m R = 1 p.

From (Olver et al., 2010, Equation (4.22.2)), for 0 cos(θmin) 1, cos(θmin) has an upper bound

cos(θmin) 1 θ2 min

P m2θ2 min 5 1

β ln (M + k 2)m

+ 2m R = 1 p,

which leads to

M 2 d 1 θmin

β ln (M + k 2)m

Nonparametric Modern Hopfield Models

For later convenience, here we introduce an extra M 2/d 1 on both sides.

Let ωd := 2π d+1/2

2 ) be the area of a d-dimensional unit sphere manifold, with Γ( ) denoting the gamma function.

Following (Brauchart et al., 2018, Lemma 3.5), we have

M 2 d 1 θmin

β ln (M + k 2)m

2 M 2m (d 1) 1

β ln (M + k 2)m

where γd is the ratio between the surface areas of the unit spheres in (d 1) and d dimensions:

ωd = 1 d π Γ d+1

Recall d, M N+, p [0, 1]. Hence, it holds M = p C d 1

4 for some real values C R.

Then, by (D.14), we have

4 2 m (d 1) ( 1 β ln

2 m (d 1) ( 1 β ln

2 1. (D.15)

Further, we rewrite (D.15) as

and identify

ln m( p + k 1)

, b := 4m2β 5(d 1).

By (Hu et al., 2023, Lemam 3.1), C takes the form

C = b W0(exp{a + ln b}), (D.16)

where W0( ) is the upper branch of the Lambert W function. Since the domain of the Lambert W function is x > ( 1/e, ) and the fact exp{a + ln b} > 0, the solution for (D.16) exists.

When the inequality (D.15) holds, the lower bound on the exponential storage capacity M can be written as:

In particular, the above lower bound takes a form similar to (Ramsauer et al., 2020, Theorem 3).

This completes the proof.

Nonparametric Modern Hopfield Models

E Nonparametric Modern Hopfield Family

In this section, we derive a family of modern Hopfield models as possible extensions based on the proposed framework (Theorem 3.1).6

E.1 Linear Modern Hopfield Model

Proposition E.1 (Linear Modern Hopfield Model). Let Φ(x) = (ϕ1(x), . . . , ϕd(x)) with the component ϕ:

ϕi(x) := elu(x[i]) + 1 PM µ=1 Φ(x), Φ(ξµ + δξµ) , i [d], (E.1)

where elu( ) denotes the exponential linear unit activation function proposed by (Clevert et al., 2015). By Theorem 3.1, fitting TSVR on D following (3.2) gives

TLinear(x) =

PM µ=1 Φ(x), Φ(ξµ + δξµ) ξµ PM ν=1 Φ(x), Φ(ξν + δξν) .

By setting the kernel mapping Φ to linear feature map (E.1), we obtain a linear modern Hopfield model with linear complexity O(n). Compared with dense modern Hopfield model, our proposed linear modern Hopfield model has time and memory complexity O(n) instead of O(n2) since we only need to compute PM µ=1 Φ(ξµ + δξµ)ξµ and PM µ=1 Φ(ξµ + δξµ) once and reuse them for the computation of every query pattern. This model is by design connected to the random attention of linear attention (Katharopoulos et al., 2020).

E.2 Multi-Head Modern Hopfield Models

To derive the multi-head Hopfield model, we cast TMulti as multiple SVR problems such that the memorization of memory patterns Ξ corresponds to training a regression model TMulti on datasets {Ξs}s [H] with noises {Ξ}. These S training data sets are given as {(ξ1 µ + δξ1 µ, ξ1 µ)}µ [M], , {(ξH µ + δξH µ , ξH µ )}µ [M]. To handle multiple regression problems, we extend the regression model (3.1) into the following.

Definition E.1 (Multi-Head Regression Model). Given an input vector x Rd. The output by Rd of the regression model Tmulti is defined as:

by = TMulti(x) :=

s=1 Ws O (WsΦs(x)) Rd,

where Ws O Rd d, Ws = [ws 1, , ws d]T Rd DΦ for all s [H], and Φs(x) = (ϕs 1(x), , ϕs DΦ(x)) : Rd RDΦ denote a series of output projection matrices, weighted matrix and kernel mapping, respectively.

Adopting this multi-head regression model, we introduce the following multi-head modern Hopfield model.

Proposition E.2 (Multi-Head Modern Hopfield Models). Let Φ( ) = (ϕ(0) 0 , ϕ(1) 1 , . . . , ϕ(1) D1, . . . , ϕ(n) 1 , . . . , ϕ(n) Dn, . . .) with, for 1 D Dn,

ϕ(n) D := ( βx1)ℓ1 ( βxd)ℓd PM µ=1 Φ(ξµ + δξµ), Φ(x) ℓ1! ℓd! ,

where ℓ1 + + ℓd = n, and Dn := d+n 1 n . By Theorem 3.1, fitting TSVR on H training data sets {(ξ1 µ + δξ1 µ, ξ1 µ)}µ [M], , {(ξH µ + δξH µ , ξH µ )}µ [M] following (3.2) gives

TMulti(x) =

s=1 Ws O Ξs Softmax(βΞT δ x) .

6Hu et al. (2024b) provide a theoretical characterization of these possible extensions from the perspective of fine-grained complexity theory.

Nonparametric Modern Hopfield Models

This model is by design connected to the standard multi-head attention.

E.3 PRFs (Positive Random Features) Kernel Modern Hopfield Model

Proposition E.3 (Positive Random Features Modern Hopfield Model). Let Φ( ) = (ϕ1, . . . , ϕDΦ) with

Φ(x) := Ψ(x) DΦ (ψ1( p1, x ), . . . , ψ1( pm, x ), . . . , ψl( p1, x ), . . . , ψl( pm, x )),

where DΦ = l m, Ψ : Rd R , ψ1, . . . , ψm are functions that map from R R, and p1, . . . , pm iid P are vectors from some distribution P d ( d := {p Rd + | Pd i=1 pi = 1} is the (d 1)-dimensional unit simplex.). By Theorem 3.1, fitting TSVR on D following (3.2) gives

µ=1 ED[ b D 1 Φ(x), Φ(ξµ + δξµ) ]ξµ,

where we adopt the normalization map b D 1 := ξ1, x given by (Choromanski et al., 2021).

Comparing with regular modern Hopfield model, PRF Hopfield model7 only has the linear space and time complexity, without any additional treatment such as introducing sparsity or low-rankness. The significance of this representational capability lies in its ability to facilitate a precise comparison between softmax and alternative kernels in the context of extensive tasks, surpassing the capabilities of regular modern Hopfield models and enabling a comprehensive exploration of optimal kernels. This model is by design connected to the Performer-type attention (Choromanski et al., 2021). In practice, the default option for P is standard Gaussian (Choromanski et al., 2021).

7Along the same line of research, Hoover et al. (2024) also utilizes random feature approximation in a recurrent setting to facilitate compressed memory storage for associative memory models.

Nonparametric Modern Hopfield Models

F Nonparametric Modern Hopfield Layers for Deep Learning

Building on the link between the nonparametric modern Hopfield models and the attention mechanisms, we introduce the Nonparametric Hopfield (NPH) layers for deep learning.

Following (Hu et al., 2023; Ramsauer et al., 2020), we say X and Ξ are in the associative space (embedded space), as they are embedded from the raw query R and Y memory patterns, respectively, via XT = RWQ := Q, and ΞT = YWK := K, with some WQ and WK. Taking the transpose of T in (3.3) (with a given feature map Φ) and multiplying with WV such that V := KWV , we have

Z := Qnew WV = TSVR βQKT V, (F.1)

which leads to an attention mechanisms with various TSVR as activation functions. Plugging back the raw patterns R and Y, we arrive the Nonparametric Modern Hopfield (NPH) layer(s),

NPH (R, Y) = TSVR βRWQWT KYT YWKWV , (F.2)

which can be seamlessly integrated into deep learning architectures. Concretely, the NPH layers take matrices R, Y as inputs, with the weight matrices WQ, WK, WV . Depending on its configuration, it offers several functionalities:

1. Memory Retrieval: In this learning-free setting, weight matrices WK, WQ, and WV are set as identity matrices. Here, R represents the query input, and Y denotes the stored memory patterns for retrieval.

2. NPH: This configuration takes R and Y as inputs. Intending to substitute the attention mechanism, the weight matrices WK, WQ, and WV are rendered learnable. Furthermore, R, Y, and Y serve as the sources for query, key, and value respectively. Achieving a self-attention-like mechanism requires setting R equal to Y.

3. NPHPooling: With inputs Q and Y, this layer uses Q as a static prototype pattern, while Y contains patterns over which pooling is desired. Given that the query pattern is replaced by the static prototype pattern Q, the only learnable weight matrices are WK and WV .

4. NPHLayer: The NPHLayer layer takes the query R as its single input. The layer equips with learnable weight matrices WK and WV , which function as our stored patterns and their corresponding projections. This design ensures that our key and value are decoupled from the input. In practice, we set WQ and Y as identity matrices.

Remark F.1. We emphasize that Hopfield memory models and Hopfield networks/layers are conceptually distinct:

Hopfield memory model: A fixed content-addressable memory model with no training; retrieval is based on similarity to stored patterns. This is our focus.

Hopfield networks (e.g., Hopfield Layers (Brandstetter, 2021)): Neural layers integrated into deep learning, trained with backprop. They builds on the Dense Associative Memory Transformer Attention correspondence (Ramsauer et al., 2020), later generalized by (Hu et al., 2023; Wu et al., 2024b). These includes architectural innovations such as additional memory-enhanced functionalities studied in prior works (e.g., (Ramsauer et al., 2020) for prototype learning, (Schimunek et al., 2023) for template enriching and (Wu et al., 2024b) for fast test-time adaptation.)

Our work extends the prior modern Hopfield memory model by framing it as a nonparametric regression problem. This allows efficient variants (Sections 3 and 4) and connects it to attention mechanisms (e.g., Performer (Choromanski et al., 2021)) under a rigorous, unified theory.

Nonparametric Modern Hopfield Models

G Experimental Studies

Tasks. We verify the method proposed in the main content with the following experimental sections. These tasks mainly focus on (1) validating the theoretical results, (2) the real-world application and (3) computational efficiency.

Appendix G.1: Memory Retrieval Task (Figure 2).

Appendix G.2: Multiple Instance Learning on MNIST (Figure 4).

Appendix G.3: Multiple Instance Learning on Real World Datasets.

Appendix G.4: Time Series Prediction.

Appendix G.5: Computational Efficiency.

Baselines and Considered Models. We consider the following variations of Modern Hopfield Models in this paper:

Dense Modern Hopfield (Ramsauer et al., 2020)

Sparse Modern Hopfield (Hu et al., 2023)

Sparse-Structured Modern Hopfield:

Random Masked Modern Hopfield Window Modern Hopfield Top-K Modern Hopfield

Linear Modern Hopfield

Random Feature Modern Hopfield

Experiment Environment. All experiments are conducted on the platform with NVIDIA GEFORCE RTX 2080 Ti and INTEL XEON SILVER 4214 @ 2.20GHz. We use Py Torch 1.8.0 for all experiments, and use Ray Tune for hyperparameter search. Remark G.1 (One-Step Retrieval). For simplicity and as a proof of concept, all experiments in this work use single-step (feed-forward) retrieval without iterative updates. This avoids confusion with multi-step or recurrent retrieval.

G.1 Memory Retrieval Task (Figure 2)

In the memory retrieval task, we examine two datasets: MNIST (sparse) and CIFAR10 (dense). We employ the sumof-squares distance between the retrieved image and the ground truth image to measure retrieval error. This experiment encompasses two settings:

1. Half-masked image recovery, and

2. Noisy image recovery.

In the half-masked image recovery scenario, we obscure half of the pixels in the image. The memory set size (M) is varied from 10 to 200, and we report the average retrieval error (sum-of-square difference) over 50 runs. In the noisy image recovery scenario, we fix the memory set size at 100, and introduce varying scales of Gaussian noise to the image, with variance ranging from 0.1 to 1.4.

Implementation Details. The memory set itself is chosen randomly from the dataset in each iteration. We adhere to the implementation outlined in (Hu et al., 2023).

Nonparametric Modern Hopfield Models

Results. We first see clear differences in retrieval success among different Hopfield models (Figure 2).

For top-k Hopfield models, retrieval success depends directly on sparsity set by k. Lower k means higher sparsity and lower retrieval errors. This aligns our theory (Theorem 4.1).

The top-k model performs similarly to the Sparse Hopfield model (Hu et al., 2023) on sparse data (MNIST) with smaller k (higer sparsity). This aligns our theoretical result on capacity (Proposition 4.1). Our sparse-structured model maintains exponential memory capacity when the target memory is within support, ensured by the top-k operation.

In contrast, random masked Hopfield models perform poorly, especially on MNIST (higher sparsity). This is because random masking may remove the target memory. This violates a key assumption (µ M) in Theorem 4.1. Thus, their poor performance is expected.

These numerical results (corresponding to Theorem 4.1 and Proposition 4.1) confirm that careful sparse strategies is capable of matching or exceeding dense Hopfield retrieval in sparse settings.

Figure 2. Numerical Justifications for Theoretical Results: Memory Capacity and Noise Robustness. (Upper): Retrieval success from half-masked queries (Theorem 4.1 and Proposition 4.1). (Lower): Retrieval success with different levels of Gaussian noise (Remark 4.2). We set β = 0.01 (MNIST) and 0.1 (CIFAR10). Lines show averages over 10 runs. Shaded areas show standard deviations. Top-k Hopfield models have retrieval rates tied to their sparsity: smaller k gives lower success, aligning our theory (Theorem 4.1). Top-k models perform similarly to Sparse Hopfield (Hu et al., 2023) on sparse data (MNIST). This aligns our theory on memory capacity (Proposition 4.1). Random masked models perform poorly, especially on MNIST. This is expected since random masking can remove the target memory, violating our the µ M assumption in Theorem 3.2. These results confirm our theory: careful sparsity can match or surpass dense models in sparse settings.

Nonparametric Modern Hopfield Models

G.2 Multiple Instance Learning on MNIST (Figure 3 & Figure 4)

Quoted from (Hu et al., 2023, Section 4.2):

Multiple Instance Learning (MIL) (Ilse et al., 2018; Carbonneau et al., 2018) is a variation of supervised learning where the training set consists of labeled bags, each containing multiple instances. The goal of MIL is to predict the bag labels based on the instances they contain, which makes it particularly useful in scenarios where labeling individual instances is difficult or impractical, but bag-level labels are available. Examples of such scenarios include medical imaging (where a bag could be an image, instances could be patches of the image, and the label could indicate the presence or absence of disease) and document classification (where a bag could be a document, instances could be the words or sentences in the document, and the label could indicate the topic or sentiment of the document).

In this experiment, we evaluate Dense Hopfield, Sparse Hopfield, and Top-K Hopfield models on a Multiple Instance Learning (MIL) task using MNIST bags. This task is standard in modern Hopfield model literture (Ramsauer et al., 2020; Hu et al., 2023; Santos et al., 2024b).

Setup. We designate one digit from MNIST as a negative signal, and the remaining digits as positive signals. The objective is to predict whether a given bag of instances (digits) contains the negative signal. We vary the memory size (number of instances per bag) to study how task difficulty affects each model s performance and convergence. We vary the memory set size (M) from 5 to 100 and report the mean accuracy over 10 runs. We compare the performance of Dense Hopfield, Sparse Hopfield, Top-K Hopfield (with 20%, 50%, and 80%), Random Feature Hopfield, Random Masked Hopfield and Linear Hopfield models. We omit the Window Hopfield model for reasons mentioned earlier.

Implementation Details. We employ an embedding layer to project the flattened MNIST images into the hidden space, followed by a layer of layer normalization. Subsequently, we utilize the Hopfield Pooling layer to pool over all the instances in the bag, followed by a second layer normalization layer. Finally, a fully connected layer is used to project the hidden representation of the bag into the label space. All models are trained using the Adam W optimizer for 150 epochs, with a cosine annealing learning rate decay applied to all models. Note that we exclude Window Hopfield in this and the subsequent MIL experiment since Window Hopfield requires both the query and memory pattern numbers to be large to perform the sliding window operation. However, in our model structure, the number of query patterns in the pooling layer is set to 2. The details of the hyperparameters can be found in Table 2.

Results. We report the results in Figure 3. Additionally, we also conduct a convergence analysis in Figure 4 with bag size = 50. We plot the loss and accuracy curve on MNIST MIL training and test set. Below, we summarize the key findings, connecting them to our theoretical guarantees (e.g. ϵ-sparse convergence from Theorem 3 and the sparse masking assumption):

Robust Performance with Increasing Memory Size (Figure 3): We measure accuracy as the bag size grows (x-axis of Table 2). Top-K Hopfield and Sparse Hopfield keep high accuracy even with large bags. Dense Hopfield struggles: it attends to all instances, so distractors dilute the target signal. As a result, its performance drops when the bag is big. In contrast, Sparse and Top-K Hopfield focus on the most relevant entries. They ignore low-similarity memories and avoid noise. Thus, they hold strong performance as bag size increases. This aligns with our theory (Theorem 4.1 and Corollary 4.1.1), which shows that masking out small similarities enforces a clear margin for correct retrieval. Random masked models do poorly if the mask removes the target memory, violating the µ M assumption. Hence, selective sparsity helps retrieve the correct instance reliably, even with many distractors.

Convergence and Training Dynamics (Figure 4): We compare the training loss and accuracy curves of various Hopfield models on the MNIST MIL task. Sparse and Top-K Hopfield models converge quickly and stably: their loss drops sharply and accuracy climbs rapidly, reaching near-perfect performance within a few epochs. Notable, Random Feature Hopfield model also exhibits relatively fast convergence and competitive performance. This behavior aligns with our ϵ-sparse convergence theorem (Theorem 4.1), as restricting updates to the largest memory entries drives the network to retrieve the correct instance in each bag without interference. In contrast, Dense Hopfield model (Ramsauer et al., 2020) improves more gradually and can plateau or oscillate before converging, reflecting weaker theoretical guarantees when many small memory entries compete. Also, Random masked models perform poorly with large masking (e.g. 80%). This is expected since random masking can remove the target memory, violating our the µ M

Nonparametric Modern Hopfield Models

assumption in Theorem 3.2. Ultimately, all models converge, but the sparse variants reach high accuracy faster and more reliably.

Figure 3. MIL Accuracy vs. Memory Size on MNIST. The y-axis represents the accuracy on test set. We compare test accuracy for Dense Hopfield (no sparsity), Sparse Hopfield (Hu et al., 2023), and Top-k Hopfield (exactly k active memory slots) models as the number of instances per bag increases. Larger memory size (more instances in a bag) makes the task more difficult due to more distractors. Dense (blue), Linear (orange), and Random Masked Hopfield models lose accuracy significantly with larger bags. In contrast, Sparse Hopfield (pink) and Top-k models maintain high accuracy. Namely, sparse models filter irrelevant instances effectively. This aligns with our theoretical prediction that sparse retrieval improves robustness to large bag sizes (Theorem 4.1).

Table 2. Hyperparameter used in the MIL MNIST experiment.

parameter values

batch size 256 learning rate 1e-3 embedding dimension 256 number of heads 4 head dimension 64 test set size 500 train set size 2000 scaling 0.1 num of pattern 2 epochs 150

Nonparametric Modern Hopfield Models

Figure 4. Convergence Analysis of Hopfield Models on MNIST MIL. (Upper): Training loss and accuracy curves for various Hopfield models on the MNIST multiple instance learning task (Theorem 4.1). (Bottom): Validation loss and accuracy curves on the same task (Theorem 4.1). All models are trained for 150 epochs with cosine annealing learning rate decay. Each line is the mean over 10 runs. Sparse Hopfield model by (Hu et al., 2023) attains the nearly the highest validation accuracy. Random Feature Hopfield model also converges quickly with competitive performance. Top-20% Hopfield model converges fast and shows minimal performance drop. Dense Hopfield model converges more slowly, exhibiting occasional plateaus. Again, Random masked models perform poorly with large masking (e.g. 80%). This is expected since random masking can remove the target memory, violating our the µ M assumption in Theorem 3.2. In contrast, sparse updates (e.g. Sparse and Top-K) retrieve relevant entries more reliably, avoid spurious attractors, and ensure stable convergence. Consequently, these models learn faster and reach higher accuracy on sparse data (Theorem 4.1 and Corollary 4.1.1), while Dense Hopfield model faces interference from many memory entries. See Appendix G for more details.

Nonparametric Modern Hopfield Models

G.3 Multiple Instance Learning on Real World Datasets

For this experiment, we follow (Ramsauer et al., 2020; Hu et al., 2023) to conduct MIL experiments on real world datasets. However, we employ a simpler model structure and a smaller hyperparameter search space, rendering our results incomparable. We utilize four datasets: Elephant, Fox, and Tiger for image annotation (Ilse et al., 2018), and UCSB breast cancer classification (Kandemir et al., 2014). We compare Dense Hopfield, Sparse Hopfield, Top K Hopfield at 20%, 50%, and 80%, Random Feature Hopfield, and Linear Hopfield. Random Masked Hopfield is excluded due to its non-deterministic inference, and Window Hopfield is omitted as previously mentioned. The results are presented in Table 5.

Dataset Details. The experiment is conducted on four MIL datasets. Elephant, Fox, and Tiger are designed for image annotation and consist of preprocessed and segmented colored images. Each image is characterized by descriptors for color, texture, and shape. These datasets each contain 100 positive images featuring the specified animal and 100 negative images drawn from a set of images depicting other animals. Additionally, we evaluate our model on the UCSB breast cancer classification task. In the UCSB dataset, each instance comprises a patch of a histopathological image depicting either cancerous or normal tissue. The detailed statistics of the datasets are reported in Table 3.

Table 3. Statistics of MIL benchmark datasets

Name Instances Features Bags +bags bags

Elephant 1391 230 200 100 100 Fox 1302 230 200 100 100 Tiger 1220 230 200 100 100 UCSB 2002 708 58 26 32

Implementation Details. We follow the experimental setting in (Ramsauer et al., 2020) and employ stratified 10-fold cross-validation to evaluate the performance of each baseline Hopfield model. In each fold, we utilize a stratified sampling process to partition the data into a training set and a validation set, with a split rate of 0.1. Hyperparameters are optimized via random search by maximizing the ROC-AUC score on the validation set. All reported ROC-AUC scores represent the average results over 5 runs with different random seeds. The random search space is delineated in Table 4, with the number of trials set to 50 for each fold. The embedding layer, a pre-Hopfield Pooling linear network, has its layer width determined by the number of hidden units. A dropout operation, also referred to as bag dropout, is applied post the embedding layer and the Hopfield Pooling layer. Notably, to better showcase the performance of Top-k Hopfield, dropout is not applied to the attention weight. All models are trained using the Adam optimizer over 50 epochs. To mitigate overfitting, an early-stopping mechanism is employed, selecting the best checkpoint based on the validation set.

Results. For real-world MIL datasets, Sparse Hopfield dominates most tasks (except for UCSB). However, other sparsestructured Hopfield models, especially Top-20% Hopfield, show comparable performance with Sparse Hopfield, indicating a potential trade-off between computational efficiency and model performance. For random feature and linear Hopfield, they did not outperform other baselines. However, their retrieval dynamics behave differently than other sparse-structured Hopfield models. Understanding how to fully utilize their potential and identifying the scenarios where they are most suitable is worth studying in the future

Nonparametric Modern Hopfield Models

Table 4. Hyperparameter random search space on the respective validation sets of the Elephant, Fox, Tiger and UCSB breast cancer datasets.

parameter values

batch size {4, 8, 16} learning rates {10 3, 10 4, 10 5} weight decay {0, 10 3, 10 4,} layer width {128, 256, 512} number of heads {4, 8} scaling factors {0.1, 1} dropout {0.0, 0.3 0.5}

Table 5. Results for MIL benchmark datasets in terms of AUC score. The results suggest that the proposed model achieves performance comparable to the existing Dense and Sparse Modern Hopfield models (Hu et al., 2023; Ramsauer et al., 2020). Note that, since our aim here is to conduct an atomic setting for fair comparison, we employ a simpler network structure (with smaller hyperparameter search space) compared to the ones used in (Hu et al., 2023; Ramsauer et al., 2020). Consequently, our results do not align with those in (Hu et al., 2023) for Dense and Sparse Modern Hopfield Models.

Method Tiger Fox Elephant UCSB

Dense Hopfield (Ramsauer et al., 2020) 0.813 0.563 0.877 0.524 Sparse Hopfield (Hu et al., 2023) 0.830 0.573 0.893 0.585

Top-20% Hopfield 0.824 0.562 0.848 0.586 Top-50% Hopfield 0.812 0.566 0.852 0.572 Top-80% Hopfield 0.812 0.560 0.872 0.551 Random Feature Hopfield 0.802 0.508 0.875 0.566 Linear Hopfield 0.797 0.571 0.869 0.561

Nonparametric Modern Hopfield Models

G.4 Time Series Prediction

We further showcase the performance (in Table 6) and efficiency (in Figure 5) of the proposed nonparametric modern Hopfield models with multivariate time series prediction tasks.

Table 6. Time series prediction using different Hopfield layers (Appendix F) across five datasets. We evaluate each dataset with different prediction horizons (showed in the second column). We report the average Mean Square Error (MSE) and Mean Absolute Error (MAE) metrics of 5 runs. RF denotes the Random Feature Hopfield layer. One notable observation is that the noise level of the dataset significantly influences time series prediction. Therefore, employing Hopfield layers with strong noise-robustness offers performance improvements. Moreover, based on our results, the proposed efficient Hopfield models not only offer significant computational efficiency but also maintain comparable performance. Especially, the Random Feature Hopfield and Linear Hopfield layers (models) not only match but even outperform Dense Hopfield model in several settings. As a side note, Window Hopfield exhibits significant performance degradation in most settings. This degradation arises because it solely focuses on local information. Being the only Hopfield model that does not span the entire associative range (i.e., sequence length), it overlooks a substantial portion of the autoregressive correlation present in time series data. We also record the time used for one epoch on ETTh1 dataset with different prediction horizon (input length as well). The duration time per epoch was showed in Figure 5.

Models Dense Sparse Top20% Top50% Top80% Window RF Linear

Metric MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE

96 0.137 0.307 0.144 0.314 0.148 0.319 0.153 0.321 0.147 0.318 1.043 0.881 0.147 0.312 0.149 0.320 192 0.153 0.326 0.152 0.325 0.146 0.318 0.161 0.333 0.150 0.320 1.003 0.870 0.158 0.332 0.141 0.313 336 0.148 0.319 0.146 0.319 0.156 0.327 0.122 0.286 0.160 0.333 0.889 0.767 0.151 0.322 0.138 0.307 720 0.169 0.331 0.148 0.314 0.184 0.345 0.161 0.327 0.123 0.287 0.756 0.761 0.141 0.271 0.171 0.333

96 0.148 0.301 0.147 0.301 0.144 0.311 0.151 0.310 0.142 0.31o 0.943 0.854 0.151 0.314 0.155 0.319 192 0.189 0.350 0.187 0.340 0.191 0.347 0.185 0.338 0.188 0.341 1.054 0.893 0.190 0.347 0.192 0.348 336 0.163 0.320 0.165 0.322 0.168 0.331 0.161 0.312 0.169 0.330 0.873 0.334 0.175 0.333 0.176 0.337 720 0.159 0.300 0.161 0.303 0.165 0.313 0.167 0.313 0.169 0.320 0.764 0.731 0.162 0.309 0.165 0.310

96 0.378 0.371 0.373 0.370 0.384 0.382 0.386 0.386 0.383 0.376 0.989 0.854 0.390 0.403 0.365 0.378 192 0.486 0.426 0.535 0.507 0.502 0.427 0.501 0.464 0.519 0.481 1.000 0.843 0.543 0.438 0.549 0.464 336 0.748 0.693 0.760 0.688 0.650 0.549 0.674 0.571 0.638 0.545 1.012 0.849 0.767 0.588 0.672 0.578 720 0.961 0.711 0.993 0.758 1.145 0.843 1.166 0.847 1.211 0.872 1.061 0.865 1.362 0.896 1.052 0.770

96 0.347 0.474 0.347 0.477 0.348 0.474 0.348 0.474 0.356 0.479 0.952 0.819 0.345 0.470 0.355 0.476 192 0.399 0.505 0.386 0.497 0.360 0.482 0.370 0.490 0.361 0.482 0.977 0.828 0.368 0.487 0.354 0.478 336 0.407 0.512 0.387 0.501 0.376 0.489 0.397 0.503 0.403 0.505 0.931 0.808 0.392 0.504 0.407 0.514 720 0.669 0.631 0.632 0.623 0.590 0.604 0.569 0.593 0.618 0.618 0.564 0.595 0.564 0.595 0.747 0.676

96 1.466 0.654 1.489 0.638 1.483 0.645 1.517 0.630 1.477 0.638 1.520 0.625 1.515 0.635 1.489 0.644 192 1.551 0.654 1.550 0.657 1.557 0.649 1.548 0.657 1.551 0.652 1.570 0.637 1.551 0.654 1.551 0.653 336 1.595 0.663 1.595 0.662 1.599 0.663 1.592 0.665 1.604 0.657 1.612 0.646 1.613 0.646 1.614 0.646 720 1.660 0.681 1.671 0.671 1.664 0.674 1.676 0.663 1.682 0.661 1.683 0.661 1.682 0.661 1.681 0.660

G.4.1 IMPLEMENTATION DETAILS

For ease of comparison, we employ the simplest possible architecture: an embedding layer to project each signal into a hidden space, followed by a single Hopfield layer. By doing so, we treat every signal as a query pattern. Next, we employ a Hopfield Pooling layer to pool over all the signals into a single hidden vector. Finally, we utilize a fully connected layer to generate the prediction. For all experiments, we maintain the same input and prediction horizon for simplicity. The results can be found in Table 6 and Figure 5.

Datasets. We conduct the experiments on four multivariate time series real-world datasets: ETTh1 (Electricity Transformer Temperature-hourly), ETTm1 (Electricity Transformer Temperature-minutely), WTH (Weather), ECL (Electricity Consuming Load), Traffic.

Setup. For each dataset, we use their univariate setting for our time series prediction experiment. We choose Dense, Sparse, Random Feature, Linear, Top K and Window Hopfield as baselines. We select 4 different prediction horizons for demonstration, which are 96, 196, 336, 720. We report the average error of 5 runs, evaluated using Mean Square Error (MSE) and Mean Absolute Error (MAE) metrics. For window Hopfield, we set the window size as 8, 12, 14, 16, w.r.t. 96, 196, 336, 720.

Nonparametric Modern Hopfield Models

Figure 5. The processing time comparison among different Hopfield models utilized in the time series prediction task described in Table 6. We evaluate the efficiency of multivariate time series prediction on ETTh1 dataset. The findings are consistent with the efficiency discussion in Section 3.2, where the Sparse/Dense/Top-K models (all with O(d2) complexity) necessitate more time to complete an epoch. In conjunction with the results in Figure 4, it is evident that the efficient modern Hopfield models (Window, Linear, Random Feature) not only converge in fewer or comparable epochs but also require less time per epoch compared to the less efficient (Sparse/Dense/Top-K) Hopfield models.

Nonparametric Modern Hopfield Models

G.5 Computational Efficiency

Here we demonstrate the computational overhead for different efficient modern Hopfield variants. We focus on the computational time duration and Flops (the number of Floating point operations). The results demonstrate

For random masked Hopfield, the computational time scales up with respect to probability.

Random feature Hopfield, Linear Hopfield and Window Hopfield demonstrates fast computational overhead in practice. In addition, these efficient Hopfield models also enjoy significantly lower floating point operations with only a marginal sacrifice in performance.

Under Py Torch (version 1.11.0) framework, random masked Hopfield is not able to obtain computational efficiency improvement despite from its sparse-structured nature.

Figure 6. (LHS:) Comparison of duration (ms) per batch for different Hopfield Models. (RHS:) The scaling behavior of Random Masked Hopfield with different masking ratios. The probability denotes the ratio being masked out. We employ various variants of the Hopfield layers to process a batch of tensors, with a batch size of 4 and a hidden dimension of 16. We vary the input memory size (input length). Note that we separate the Random Masked Hopfield from other baselines since the sparse matrix operation in Py Torch, still in the beta stage, may not be as fully optimized as dense tensor operations.

Figure 7. (LHS:) The FLOPs comparison for Random Masked Hopfield with different probabilities is depicted. The lines for Dense and Sparse Hopfield are overlapped, as are the lines for Random Feature Hopfield and Linear Hopfield. (RHS:) The FLOPs comparison across different Hopfield Models is shown. We employ the same settings as in the duration figure. Note that the fvcore package may count sparse matrix operations as normal floating point operations, which is why we might not see a difference.

Nonparametric Modern Hopfield Models

Implementation Details. In this section, we exclusively evaluate the computational efficiency of different Hopfield models with respect to varying input lengths using the Hopfield layer. We report the average duration time per batch, as shown in Figure 6, and the FLOPs concerning different input lengths (memory sizes), as depicted in Figure 7. It s notable that different code implementation methods could potentially affect computational efficiency. We use a randomized batched tensor as input x, where x Rmemory size 16, and the batch size is 4 8. For Random Feature Hopfield and Linear Hopfield, we adhere to the Performer implementation9, while for Window Hopfield, we follow the Longformer implementation10. For Random Masked Hopfield, we utilize the torch.sparse.sampled addmm11 feature, and for other baselines, we employ standard Py Torch built-in functions for implementation. We report the average forward pass time over 10 runs, alongside the FLOPs, with both metrics evaluated on different input lengths. FLOPs are calculated using the fvcore package12. Note that most publicly available packages for FLOPs profiling are either under development or in beta, hence calculation errors are anticipated. Additionally, the torch.sparse package is also in beta, implying its performance may not be fully optimized, especially regarding FLOPs calculation and operation overhead.

Discussion. Note that, by nature, both Dense and Sparse Hopfield exhibit the same FLOPs. Moreover, it is observed that Random Feature Hopfield and Linear Hopfield also share the same FLOPs, as the only distinction between them lies in the kernel function. Regarding Window Hopfield, its FLOPs fall in between, demonstrating notable efficiency compared to both Dense and Sparse Hopfield. In terms of duration time per batch, Sparse Hopfield appears slightly faster than its dense counterpart, likely due to the additional zeros generated by sparsemax. Window Hopfield, on the other hand, showcases a significant reduction in duration compared to Sparse Hopfield. Lastly, it is noted that the processing time for both Random Feature Hopfield and Linear Hopfield converges as the memory size increases.

8approximately (4 4 16 memory size) bytes 9https://github.com/lucidrains/performer-pytorch 10https://github.com/allenai/longformer 11https://pytorch.org/docs/stable/generated/torch.sparse.sampled addmm.html#torch.sparse.sampled addmm 12https://github.com/facebookresearch/fvcore