# sparse_and_structured_hopfield_networks__51e1b488.pdf Sparse and Structured Hopfield Networks Saul Santos 1 2 Vlad Niculae 3 Daniel Mc Namee 4 Andr e F. T. Martins 1 2 5 Modern Hopfield networks have enjoyed recent interest due to their connection to attention in transformers. Our paper provides a unified framework for sparse Hopfield networks by establishing a link with Fenchel-Young losses. The result is a new family of Hopfield-Fenchel-Young energies whose update rules are end-to-end differentiable sparse transformations. We reveal a connection between loss margins, sparsity, and exact memory retrieval. We further extend this framework to structured Hopfield networks via the Sparse MAP transformation, which can retrieve pattern associations instead of a single pattern. Experiments on multiple instance learning and text rationalization demonstrate the usefulness of our approach. 1. Introduction Hopfield networks are a kind of biologically-plausible neural network with associative memory capabilities (Hopfield, 1982). Their attractor dynamics makes them suitable for modeling the retrieval of episodic memories in humans and animals (Tyulmankov et al., 2021; Whittington et al., 2021). The limited storage capacity of classical Hopfield networks was recently overcome through new energy functions and continuous state patterns (Krotov & Hopfield, 2016; Demircigil et al., 2017; Ramsauer et al., 2021), leading to exponential storage capacities and sparking renewed interest in modern Hopfield networks. In particular, Ramsauer et al. (2021) revealed striking connections to transformer attention, but their model is incapable of exact retrieval and may require a low temperature to avoid metastable states (states which mix multiple input patterns). In this paper, we bridge this gap by making a connection *Equal contribution 1Instituto Superior T ecnico, Universidade de Lisboa, Lisbon, Portugal 2Instituto de Telecomunicac oes, Lisbon, Portugal 3Language Technology Lab, University of Amsterdam, The Netherlands 4Champalimaud Research, Lisbon, Portugal 5Unbabel, Lisbon, Portugal. Correspondence to: Saul Santos . Proceedings of the 41 st International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by the author(s). between Hopfield energies and Fenchel-Young losses (Blondel et al., 2020). We extend the energy functions of Ramsauer et al. (2021) and Hu et al. (2023) to a wider family induced by generalized entropies. The minimization of these energy functions leads to sparse update rules which include as particular cases α-entmax (Peters et al., 2019), α-normmax (Blondel et al., 2020), and Sparse MAP (Niculae et al., 2018), where a structural constraint ensures the retrieval of k patterns. Unlike Ramsauer et al. (2021) s Hopfield layers, our update rules pave the way for exact retrieval, leading to the emergence of sparse association among patterns while ensuring end-to-end differentiability. This endeavour aligns with the strong neurobiological motivation to seek new Hopfield energies capable of sparse and structured retrieval. Indeed, sparse neural activity patterns forming structured representations underpin core principles of cortical computation (Simoncelli & Olshausen, 2001; Tse et al., 2007; Palm, 2013). With respect to memory formation circuits, the sparse firing of neurons in the dentate gyrus (DG), a distinguished region within the hippocampal network, underpins its proposed role in pattern separation during memory storage (Yassa & Stark, 2011; Severa et al., 2017). Evidence suggests that such sparsified activity aids in minimizing interference, however an integrative theoretical account linking sparse coding and attractor network functionality to clarify these empirical observations is lacking (Leutgeb et al., 2007; Neunuebel & Knierim, 2014). Our main contributions are: We introduce Hopfield-Fenchel-Young energy functions as a generalization of modern Hopfield networks ( 3.1). We leverage properties of Fenchel-Young losses which relate sparsity and margins to obtain new theoretical results for exact memory retrieval, proving exponential storage capacity in a stricter sense than prior work ( 3.2). We propose new structured Hopfield networks via the Sparse MAP transformation, which return pattern associations instead of single patterns. We show that Sparse MAP has a structured margin and provide sufficient conditions for exact retrieval of pattern associations ( 4). Experiments on synthetic and real-world tasks (multiple instance learning and text rationalization) showcase the usefulness of our proposed models using various kinds of sparse Sparse and Structured Hopfield Networks Figure 1: Overview of the Hopfield networks proposed in this paper: sparse transformations (entmax and normmax) aim to retrieve the closest pattern to the query, and they have exact retrieval guarantees. Structured variants find pattern associations. The k-subsets transformation returns a mixture of the top-k patterns, and sequential k-subsets favors contiguous retrieval. and structured transformations ( 5). An overview of the utility of our methods can be seen in Fig. 1.1 Notation. We denote by N the (N 1)th-dimensional probability simplex, N := {p RN : p 0, 1 p = 1}. The convex hull of a set Y RM is conv(Y) := {PN i=1 piyi : p N, y1, ..., y N Y, N N}. We have N = conv({e1, ..., e N}), where ei RN is the ith basis (one-hot) vector. Given a convex function Ω: RN R, where R = R {+ }, we denote its domain by dom(Ω) := {y RN : Ω(y) < + } and its Fenchel conjugate by Ω (θ) = supy RN y θ Ω(y). We denote by IC the indicator function of a convex set C, defined as IC(y) = 0 if y C, and IC(y) = + otherwise. 2. Background 2.1. Hopfield Networks Let X RN D be a matrix whose rows hold a set of examples x1, ..., x N RD ( memory patterns ) and let q(0) RD be a query vector (or state pattern ). Hopfield networks iteratively update q(t) 7 q(t+1) for t {0, 1, ...} according to a certain rule, eventually converging to a fixed point attractor state q which either corresponds to one of 1Our code is available on https://github.com/ deep-spin/SSHN the memorized examples, or to a mixture thereof. This update rule correspond to the minimization of an energy function, which for classic Hopfield networks (Hopfield, 1982) takes the form E(q) = 1 2q W q, with q { 1}D and W = X X RD D, leading to the update rule q(t+1) = sign(W q(t)). A limitation of this classical network is that it has only N = O(D) memory storage capacity, above which patterns start to interfere (Amit et al., 1985; Mc Eliece et al., 1987). Recent work sidesteps this limitation through alternative energy functions (Krotov & Hopfield, 2016; Demircigil et al., 2017), paving the way for a class of models known as modern Hopfield networks with superlinear (often exponential) memory capacity. In Ramsauer et al. (2021), q RD is continuous and the following energy is used: i=1 exp(βx i q) + 1 2 q 2 + const. (1) Ramsauer et al. (2021) revealed an interesting relation between the updates in this modern Hopfield network and the attention layers in transformers. Namely, the minimization of the energy (1) using the concave-convex procedure (CCCP; Yuille & Rangarajan 2003) leads to the update rule: q(t+1) = X softmax(βXq(t)). (2) D, each update matches the computation per- Sparse and Structured Hopfield Networks formed in the attention layer of a transformer with a single attention head and with identity projection matrices. This triggered interest in developing variants of Hopfield layers which can be used as drop-in replacements for multi-head attention layers (Hoover et al., 2023). While Ramsauer et al. (2021) have derived useful theoretical properties of these networks (including their exponential storage capacity under a relaxed notion of retrieval), the use of softmax in (2) prevents exact convergence and may lead to undesirable metastable states. Recently (and closely related to our work), Hu et al. (2023) introduced sparse Hopfield networks and proved favorable retrieval properties, but still under an approximate sense, where attractors are close but distinct from the stored patterns. We overcome these drawbacks in our paper, where we provide a unified treatment of sparse and structured Hopfield networks along with theoretical analysis showing that exact retrieval is possible without sacrificing exponential storage capacity. 2.2. Sparse Transformations and Fenchel-Young Losses Our construction and results flow from the properties of Fenchel-Young losses, which we next review. Given a strictly convex function Ω: RN R with domain dom(Ω), the Ω-regularized argmax transformation (Niculae & Blondel, 2017), ˆyΩ: RN dom(Ω), is: ˆyΩ(θ) := Ω (θ) = argmax y dom(Ω) θ y Ω(y). (3) Let us assume for now that dom(Ω) = N (the probability simplex). One instance of (3) is the softmax transformation, obtained when the regularizer is the Shannon negentropy, Ω(y) = PN i=1 yi log yi + I N (y). Another instance is the sparsemax transformation, obtained when Ω(y) = 1 2 y 2 + I N (y) (Martins & Astudillo, 2016). The sparsemax transformation is the Euclidean projection onto the probability simplex. Softmax and sparsemax are both particular cases of α-entmax transformations (Peters et al., 2019), parametrized by a scalar α 0 (called the entropic index), which corresponds to the following choice of regularizer, called the Tsallis α-negentropy (Tsallis, 1988): ΩT α(y) = 1 + y α α α(α 1) + I N (y). (4) When α 1, the Tsallis α-negentropy ΩT α becomes Shannon s negentropy, leading to the softmax transformation, and when α = 2, it becomes the ℓ2-norm (up to a constant) and we recover the sparsemax. Another example is the norm α-negentropy (Blondel et al., 2020, 4.3), ΩN α (y) = 1 + y α + I N (y), (5) which, when α + , is called the Berger-Parker dominance index (May, 1975). We call the resulting transformation α-normmax. While the Tsallis and norm negentropies Figure 2: Sparse and structured transformations used in this paper and their regularization path. In each plot, we show ˆyΩ(βθ) = ˆyβ 1Ω(θ) as a function of the temperature β 1 where θ = [1.0716, 1.1221, 0.3288, 0.3368, 0.0425] . Additional examples can be found in App. E.1. have similar expressions and the resulting transformations both tend to be sparse, they have important differences, as suggested in Fig. 2: normmax favors distributions closer to uniform over the selected support. The Ω-Fenchel-Young loss (Blondel et al., 2020) is LΩ(θ, y) := Ω(y) + Ω (θ) θ y. (6) When Ωis Shannon s negentropy, we have Ω (θ) = log PN i=1 exp(θi), and LΩis the cross-entropy loss, up to a constant. Intuitively, Fenchel-Young losses quantify how compatible a score vector θ RN (e.g., logits) is to a desired target y dom(Ω) (e.g., a probability vector). Fenchel-Young losses have relevant properties (Blondel et al., 2020, Prop. 2): (i) they are non-negative, LΩ(θ, y) 0, with equality iff y = ˆyΩ(θ); (ii) they are convex on θ and their gradient is θLΩ(θ, y) = y + ˆyΩ(θ). But, most importantly, they have well-studied margin properties, which, as we shall see, will play a pivotal role in this work. Definition 1 (Margin). A loss function L(θ; y) has a margin if there exists a finite m 0 such that i [N], L(θ, ei) = 0 θi max j =i θj m. (7) The smallest such m is called the margin of L. If LΩis a Fenchel-Young loss, (7) is equivalent to ˆyΩ(θ) = ei. A famous example of a loss with a margin of 1 is the hinge loss of support vector machines. On the other hand, the cross-entropy loss does not have a margin. Blondel et al. (2020, Prop. 7) have shown that Tsallis negentropies ΩT α with α > 1 have a margin of m = (α 1) 1, and that norm-entropies ΩN α with α > 1 have a margin m = 1, Sparse and Structured Hopfield Networks independently of α. We will use these facts in the sequel. 3. Sparse Hopfield-Fenchel-Young Energies We now use Fenchel-Young losses (6) to define a new class of energy functions for modern Hopfield networks. 3.1. Definition and update rule We start by assuming that the regularizer Ωhas domain dom(Ω) = N and that it is a generalized negentropy, i.e., null when y is a one-hot vector, strictly convex, and permutation-invariant (see Appendix A for details). These conditions imply that Ω 0 and that Ω(y) is minimized when y = 1/N is the uniform distribution (Blondel et al., 2020, Prop. 4). Tsallis negentropies (4) for α 1 and norm negentropies (5) for α > 1 both satisfy these properties. We define the Hopfield-Fenchel-Young (HFY) energy as E(q) = β 1LΩ(βXq; 1/N) | {z } Econcave(q) 2 q µX 2 + 1 2(M 2 µX 2) | {z } Econvex(q) where µX := X 1/N RD is the empirical mean of the patterns, and M := maxi xi . This energy extends that of (1), which is recovered when Ωis Shannon s negentropy. The concavity of Econcave holds from the convexity of Fenchel-Young losses on its first argument and from the fact that composition of a convex function with an affine map is convex. Econvex is convex because it is quadratic.2 These two terms compete when minimizing the energy (8): Minimizing Econcave is equivalent to maximizing LΩ(βXq; 1/N), which pushes q to be as far as possible from a uniform average and closer to a single pattern. Minimizing Econvex serves as a proximity regularization, encouraging the state pattern q to stay close to µX. The next result, proved in App. D.1, establishes bounds and derives the Hopfield update rule for energy (8), generalizing Ramsauer et al. (2021, Lemma A.1, Theorem A.1). Proposition 1 (Update rule of HFY energies). Let the query q be in the convex hull of the rows of X, i.e., q = X y for some y N. Then, the energy (8) satisfies 0 E(q) min 2M 2, β 1Ω(1/N) + 1 2M 2 . Furthermore, minimizing (8) with the CCCP algorithm 2Up to constants, for this choice of Ωour convex-concave decomposition is the same as Ramsauer et al. (2021). (Yuille & Rangarajan, 2003) leads to the updates: q(t+1) = X ˆyΩ(βXq(t)). (9) In particular, when Ω= ΩT α (the Tsallis α-negentropy (4)), the update (9) corresponds to the adaptively sparse transformer of Correia et al. (2019). The α-entmax transformation can be computed in linear time for α {1, 1.5, 2} and for other values of α an efficient bisection algorithm was proposed by Peters et al. (2019). The case α = 2 (sparsemax) corresponds to the sparse modern Hopfield network recently proposed by Hu et al. (2023). When Ω= ΩN α (the norm α-negentropy (5)), we obtain the α-normmax transformation. This transformation is harder to compute since ΩN α is not separable, but we derive in App. B an efficient bisection algorithm which works for any α > 1. 3.2. Margins, sparsity, and exact retrieval Prior work on modern Hopfield networks (Ramsauer et al., 2021, Def. 1) defines pattern storage and retrieval in an approximate sense: they assume a small neighbourhood around each pattern xi containing an attractor x i , such that if the initial query q(0) is close enough, the Hopfield updates will converge to x i , leading to a retrieval error of x i xi . For this error to be small, a large β may be necessary. We consider here a stronger definition of exact retrieval, where the attractors coincide with the actual patterns (rather than being nearby). Our main result is that zero retrieval error is possible in HFY networks as long as the corresponding Fenchel-Young loss has a margin (Def. 1). Given that ˆyΩbeing a sparse transformation is a sufficient condition for LΩhaving a margin (Blondel et al., 2020, Prop. 6), this is a general statement about sparse transformations.3 Definition 2 (Exact retrieval). A pattern xi is exactly retrieved for query q(0) iff there is a finite number of steps T such that iterating (9) leads to q(T ) = xi T T. The following result gives sufficient conditions for exact retrieval with T = 1 given that patterns are well separated and that the query is sufficiently close to the retrieved pattern. It establishes the exact autoassociative property of HFY networks: if all patterns are slightly perturbed, the Hopfield dynamics are able to recover the original patterns exactly. Following Ramsauer et al. (2021, Def. 2), we define the separation of pattern xi from data as i = x i xi maxj =i x i xj. Proposition 2 (Exact retrieval in a single iteration). Assume LΩhas margin m, and let xi be a pattern outside the convex hull of the other patterns. Then, xi is a station- 3At first sight, this might seem a surprising result, given that both queries and patterns are continuous. The reason why exact convergence is possible hinges crucially on sparsity. Sparse and Structured Hopfield Networks ary point of the energy (8) iff i mβ 1. In addition, if the initial query q(0) satisfies q(0) (xi xj) mβ 1 for all j = i, then the update rule (9) converges to xi exactly in one iteration. Moreover, if the patterns are normalized, xi = M for all i, and well-separated with i mβ 1 + 2Mϵ, then any q(0) ϵ-close to xi ( q(0) xi ϵ) will converge to xi in one iteration. The proof is in Appendix D.2. For the Tsallis negentropy case Ω= ΩT α with α > 1 (the sparse case), we have m = (α 1) 1 (cf. Def. 1), leading to the condition i 1 (α 1)β . This result is stronger than that of Ramsauer et al. (2021) for their energy (which is ours for α = 1), according to which memory patterns are only ϵ-close to stationary points, where a small ϵ = O(exp( β)) requires a low temperature (large β). It is also stronger than the retrieval error bound recently derived by Hu et al. (2023, Theorem 2.2) for the case α = 2, which has an additive term involving M and therefore does not provide conditions for exact retrieval. For the normmax negentropy case Ω= ΩN α with α > 1, we have m = 1, so the condition above becomes i 1 Given that exact retrieval is a stricter definition, one may wonder whether requiring it sacrifices storage capacity. Reassuringly, the next result, inspired but stronger than Ramsauer et al. (2021, Theorem A.3), shows that HFY networks with exact retrieval also have exponential storage capacity. Proposition 3 (Storage capacity with exact retrieval). Assume patterns are placed equidistantly on the sphere of radius M. The HFY network can store and exactly retrieve N = O((2/ 3)D) patterns in one iteration under a ϵ-perturbation as long as M 2 > 2mβ 1 and 4 m 2βM . (10) Assume patterns are randomly placed on the sphere with uniform distribution. Then, with probability 1 p, the HFY network can store and exactly retrieve N = O( pζ D 1 2 ) patterns in one iteration under a ϵ-perturbation if m 2βM . (11) The proof is in Appendix D.3. 4. Structured Hopfield Networks In 3, we considered the case where y dom(Ω) = N, the scenario studied by Ramsauer et al. (2021) and Hu et al. (2023). We now take one step further and consider the more general structured case, where y is a vector of marginals associated to some given structured set. This structure can reflect pattern associations that we might want to induce when querying the Hopfield network with q(0). Possible structures include k-subsets of memory patterns, potentially leveraging sequential memory structure, tree structures, matchings, etc. In these cases, the set of pattern associations we can form is combinatorial, hence it can be considerably larger than the number N of memory patterns. 4.1. Unary scores and structured constraints We consider first a simpler scenario where there is a predefined set of structures Y {0, 1}N and N unary scores, one for each memory pattern. We show in 4.2 how this framework can be generalized for higher-order interactions modeling soft interactions among patterns. Let Y {0, 1}N be a set of binary vectors indicating the underlying set of structures, and let conv(Y) [0, 1]N denote its convex hull, called the marginal polytope associated to the structured set Y (Wainwright et al., 2008). Example 1 (k-subsets). We may be interested in retrieving subsets of k patterns, e.g., to take into account a k-ary relation among patterns or to perform top-k retrieval. In this case, we define Y = {y {0, 1}N : 1 y = k}, where k [N]. If k = 1, we get Y = {e1, ..., e N} and conv(Y) = N, recovering the scenario studied in 3. For larger k, |Y| = N k N. With a simple rescaling, the resulting marginal polytope is equivalent to the capped probability simplex described by Blondel et al. (2020, 7.3). Given unary scores θ RN, the structure with the largest score is y = argmaxy Y θ y = argmaxy conv(Y) θ y, where the last equality comes from the fact that conv(Y) is a polytope, therefore the maximum is attained at a vertex. As in (3), we consider a regularized prediction version of this problem via a convex regularizer Ω: conv(Y) R: ˆyΩ(θ) := argmax y conv(Y) θ y Ω(y). (12) By choosing Ω(y) = 1 2 y 2 + Iconv(Y)(y), we obtain Sparse MAP, a structured version of sparsemax which can be computed efficiently via an active set algorithm as long as an argmax oracle is available (Niculae et al., 2018). 4.2. General case: factor graph, high order interactions In general, we may want to consider soft interactions among patterns, for example due to temporal dependencies, hierarchical structure, etc. These interactions can be expressed as a bipartite factor graph (V, F), where V = {1, ..., N} are variable nodes (associated with the patterns) and F 2V are factor nodes representing the interactions (Kschischang et al., 2001). A structure can be represented as a bit vector y = [y V ; y F ], where y V and y F indicate configurations of variable and factor nodes, respectively. The Sparse MAP transformation has the same form (12) with Sparse and Structured Hopfield Networks 2 y V 2 + Iconv(Y)(y) (note that only the unary variables are regularized). Full details are given in App. C. Example 2 (sequential k-subsets). Consider the k-subset problem of Example 1 but now with a sequential structure. This can be represented as a pairwise factor graph (V, F) where V = {1, ..., N} and F = {(i, i + 1)}N 1 i=1 . The budget constraint forces exactly k of the N variable nodes to have the value 1. The set Y contains all bit vectors satisfying these constraints as well as consistency among the variable and factor assignments. To promote consecutive memory items to be both retrieved or neither retrieved, one can define attractive Ising higher-order (pairwise) scores θ(i,i+1), in addition to the unary scores. The MAP inference problem can be solved with dynamic programming in runtime O(Nk), and the Sparse MAP transformation can be computed with the active set algorithm (Niculae et al., 2018) by iteratively calling this MAP oracle. 4.3. Structured Fenchel-Young losses and margins The notion of margin in Def. 1 can be extended to the structured case (Blondel et al., 2020, Def. 5): Definition 3 (Structured margin). A loss L(θ; y) has a structured margin if 0 m < such that y Y: θ y max y Y 2 y y 2 L(θ; y) = 0. The smallest such m is called the margin of L. Note that this generalizes the notion of margin in Def. 1, recovered when Y = {e1, ..., e N}. Note also that, since we are assuming Y {0, 1}L, the term y y 2 is a Hamming distance, which counts how many bits need to be flipped to transform y into y. A well-known example of a loss with a structured separation margin is the structured hinge loss (Taskar et al., 2003; Tsochantaridis et al., 2005). We show below that the Sparse MAP loss has a structured margin (our result, proved in App. D.4, extends that of Blondel et al. (2020), who have shown this only for structures without high order interactions): Proposition 4. Let Y {0, 1}L be contained in a sphere, i.e., for some r > 0, y = r for all y Y. Then: 1. If there are no high order interactions, then the Sparse MAP loss has a structured margin m = 1. 2. If there are high order interactions and, for some r V and r F with r2 V + r2 F = r2, we have y V = r V and y F = r F for any y = [y V ; y F ] Y, then the Sparse MAP loss has a structured margin m 1. The assumptions above are automatically satisfied with the factor graph construction in 4.2, with r2 V = |V |, r2 F = |F|, and r2 = |V | + |F|. For the k-subsets example, we have r2 = k, and for the sequential k-subsets example, we have r2 V = N, r2 F = N 1, and r2 = 2N 1. 4.4. Guarantees for retrieval of pattern associations We now consider a structured HFY network using Sparse MAP. We obtain the following updates: q(t+1) = X Sparse MAP(βXq(t)). (13) In the structured case, we aim to retrieve not individual patterns but pattern associations of the form X y, where y Y. Naturally, when Y = {e1, ..., e N}, we recover the usual patterns, since xi = X ei. We define the separation of pattern association yi Y from data as i = y i XX yi maxj =i y i XX yj. The next proposition, proved in App. D.5, states conditions for exact convergence in a single iteration, generalizing Prop. 2. Proposition 5 (Exact structured retrieval). Let Ω(y) be the Sparse MAP regularizer and assume the conditions of Prop. 4 hold. Let yi Y be such that i D2 i 2β , where Di = max yi yj 2r. Then, X yi is a stationary point of the Hopfield energy. In addition, if q(0) satisfies q(0) X (yi yj) D2 i 2β for all j = i, then the update rule (13) converges to the pattern association X yi in one iteration. Moreover, if i D2 i 2β + ϵ min{σmax(X)Di, MD2 i }, where σmax(X) is the spectral norm of X and M = maxk xk , then any q(0) ϵ-close to X yi will converge to X yi in one iteration. Note that the bound above on i includes as a particular case the unstructured bound in Prop. 2 applied to sparsemax (entmax with α = 2, which has margin m = 1/(α 1) = 1), since for Y = N we have r = 1 and Di = 2, which leads to the condition i β 1 + 2Mϵ. For the particular case of the k-subsets problem (Example 1), we have r = 2k, leading to the condition i k β + 2Mkϵ. This recovers sparsemax when k = 1. For the sequential k-subsets problem in Example 2, we have r = 2N 1. Noting that any two distinct y and y differ in at most 2k variable nodes, and since each variable node can affect 6 bits (2 for y V and 4 for y F ), the Hamming distance between y and y is at most 12k, therefore we have Di = 12k, which leads to the condition i 6k 5. Experiments We now present experiments using synthetic and real-world datasets to validate our theoretical findings and illustrate the usefulness of sparse and structured Hopfield networks. Sparse and Structured Hopfield Networks 1.5-entmax 2-entmax 2-normmax 5-normmax Figure 3: Left: contours of the energy function and optimization trajectory of the CCCP iteration (β = 1). Right: attraction basins associated with each pattern. (White sections do not converge to a single pattern but to a metastable state; β = 10 (a larger β is needed to allow for the 1-entmax to get ϵ-close to a single pattern); for α = 1 we allow a tolerance of ϵ = .01). Additional plots for different β can be found in App. E.2. 5.1. Hopfield dynamics and basins of attraction Fig. 3 shows optimization trajectories and basins of attraction for various queries and artificially generated pattern configurations for the two families of sparse transformations, α-entmax and α-normmax. We use α {1, 1.5, 2} for α-entmax and α {2, 5} for α-normmax (where we apply the bisection algorithm described in App. B). As α increases, α-entmax converges more often to a single pattern, whereas α-normmax tends to converge towards an attractor which is a uniform average of some patterns. 5.2. Metastable state distributions in MNIST We next investigate how often our Hopfield networks converge to metastable states, a crucial aspect for understanding the network s dynamics. To elucidate this, we examine ˆyΩ(βXq(t)) for the MNIST dataset, probing the number of nonzeros of these vectors. We set a threshold > 0.01 for the softmax method (1-entmax). For the sparse transformations we do not need a threshold, since they have exact retrieval. Results in Tab. 1 suggest that α-entmax is capable of retrieving single patterns for higher values of α. Despite α-normmax s ability to induce sparsity, we observe that it tends to stabilize in small but persistent metastable states as α is increased, whereas Sparse MAP with k-subsets is capable of retrieving associations of k patterns, as expected. 5.3. Multiple instance learning In multiple instance learning (MIL), instances are grouped into bags and a bag is labeled as positive if it contains at least one instance from a given class. We also consider a extended variant, denoted K-MIL, where bags are considered positive if they contain K or more positive instances. Ramsauer et al. (2021) tackle MIL via a Hopfield pooling layer, where the query q is learned and the keys X are instance embeddings. We experiment with sparse Hopfield pooling layers using our proposed α-entmax and αnormmax transformations ( 3), as well as structured Hopfield pooling layers using Sparse MAP with k-subsets ( 4), varying α and k in each case. Note that 1-entmax recovers Ramsauer et al. (2021) and 2-entmax recovers Hu et al. (2023). We run these models for K-MIL problems in the MNIST dataset (choosing 9 as target) and in three MIL benchmarks: Elephant, Fox, and Tiger (Ilse et al., 2018). Further details can be found in App. F.1 and F.2. Tab. 2 shows the results. We observe that for MNIST, K = 1, 1-entmax surprisingly outperforms the remaining methods. Normmax shows consistent results across datasets achieving near optimal performance, arguably due to its ability to adapt to near-uniform metastable states of varying size. We also observe that, for K > 1, the k-subsets approach achieves top performance when k = K, as expected. We also see that, in the MIL benchmarks, Sparse MAP pooling surpasses sparse pooling variants for 2 out of 3 datasets. 5.4. Structured Rationalizers We experiment with rationalizer models in sentiment prediction tasks, where the inputs are sentences or documents in natural language and the rationales are text highlights (see Fig. 4). These models, sometimes referred as select-predict or explain-predict models (Jacovi & Goldberg, 2021; Zhang et al., 2021), consist of a rationale generator and a predictor. The generator processes the input text and extracts the rationale as a subset of words to be highlighted, and the predictor classifies the input based solely on the extracted rationale, which generally involves concealing non-rationale words through the application of a binary mask. Rationalizers Sparse and Structured Hopfield Networks Table 1: Distribution of metastable state (in %) in MNIST. The training set is memorized and the test set is used as queries. Metastable β = 0.1 β = 1 State α-entmax α-normmax k-subsets α-entmax α-normmax k-subsets Size 1 1.5 2 2 5 2 4 8 1 1.5 2 2 5 2 4 8 1 3.5 69.2 88.1 81.4 51.4 0.0 0.0 0.0 97.8 99.9 100.0 100.0 99.8 0.0 0.0 0.0 2 2.1 8.6 5.2 6.7 31.4 87.3 0.0 0.0 0.9 0.1 0.0 0.0 0.2 99.9 0.0 0.0 3 1.6 3.9 2.6 1.9 7.0 6.1 0.0 0.0 0.4 0.0 0.0 0.0 0.0 0.1 0.0 0.0 4 1.2 2.3 1.6 1.0 2.1 2.5 80.0 0.0 0.3 0.0 0.0 0.0 0.0 0.0 99.3 0.0 5 1.2 1.6 1.1 0.9 1.5 2.0 11.9 0.0 0.2 0.0 0.0 0.0 0.0 0.0 0.7 0.0 6 0.9 0.9 0.8 0.5 1.5 1.1 4.4 0.0 0.1 0.0 0.0 0.0 0.0 0.0 0.1 0.0 7 1.1 0.6 0.4 0.4 1.3 0.6 2.1 0.0 0.1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 8 0.8 0.6 0.1 0.8 1.0 0.2 1.0 60.0 0.1 0.0 0.0 0.0 0.0 0.0 0.0 95.0 9 1.0 0.3 0.0 0.5 0.8 0.1 0.4 26.0 0.1 0.0 0.0 0.0 0.0 0.0 0.0 4.7 10 1.1 0.1 0.0 0.5 0.6 0.0 0.1 9.2 0.1 0.0 0.0 0.0 0.0 0.0 0.0 0.2 10+ 85.5 11.9 0.1 5.4 1.4 0.1 0.0 4.8 0.1 0.0 0.0 0.0 0.0 0.0 0.0 0.1 Table 2: Results for MIL. We show accuracies for MNIST and ROC AUC for MIL benchmarks, averaged across 5 runs. MNIST MIL benchmarks Methods K=1 K=2 K=3 K=5 Fox Tiger Elephant 1-entmax (softmax) 98.4 0.2 94.6 0.5 91.1 0.5 89.0 0.3 66.4 2.0 87.1 1.6 92.6 0.6 1.5-entmax 97.6 0.8 96.0 0.9 90.4 1.1 92.4 1.4 66.3 2.0 87.3 1.5 92.4 1.0 2.0-entmax (sparsemax) 97.9 0.2 96.7 0.5 92.9 0.9 91.6 1.0 66.1 0.6 87.7 1.4 91.8 0.6 2.0-normmax 97.9 0.3 96.6 0.6 93.9 0.7 92.4 0.7 66.1 2.5 86.4 0.8 92.4 0.7 5.0-normmax 98.2 0.5 97.2 0.3 95.8 0.4 93.2 0.5 66.4 2.3 85.5 0.6 93.0 0.7 Sparse MAP, k = 2 97.9 0.3 97.7 0.3 95.1 0.5 92.6 1.1 66.8 2.7 85.3 0.5 93.2 0.7 Sparse MAP, k = 3 98.0 0.6 96.1 1.0 96.5 0.5 92.2 1.2 67.4 2.0 86.1 0.8 92.6 1.7 Sparse MAP, k = 5 98.2 0.4 96.2 1.4 95.1 1.1 95.1 1.5 67.0 2.0 86.3 0.8 91.2 1.0 are usually trained end-to-end, and the discreteness of the latent rationales is either handled with stochastic methods via score function estimators or the reparametrization trick (Lei et al., 2016; Bastings et al., 2019), or with deterministic methods via structured continuous relaxations (Guerreiro & Martins, 2021). In either case, the model imposes sparsity and contiguity penalties to ensure rationales are short and tend to extract adjacent words. Our model architecture is adapted from SPECTRA (Guerreiro & Martins, 2021), but the combination of the generator and predictor departs from prior approaches (Lei et al., 2016; Bastings et al., 2019; Guerreiro & Martins, 2021) in which the predictor does not mask the input tokens; instead, it takes as input the pooled vector that results from the Hopfield pooling layer (either a sequential or non-sequential Sparse MAP k-subsets layer). By integrating this Hopfield pooling layer into the predictor, we transform the sequence of word embeddings into a single vector from which the prediction is made. The rationale is formed by the pattern associations (word tokens) extracted by the Hopfield layer. Tab. 3 shows the results in the downstream task (classification for SST, Ag News, IMDB; regression for Beer Advocate) as well as the F1 overlap with human rationales for the Beer Advocate dataset (Mc Auley et al., 2012). Compared to strong baselines (Bastings et al., 2019; Guerreiro & Martins, 2021), our proposed methods achieve equal or slightly superior performance for all datasets. Moreover, our sequential k-subsets model outperforms the baselines in terms of overlap with human rationales. This is explained by the fact that human rationales tend to contain adjacent words, which is encouraged by our sequential model. Additional details and results with other baselines are shown in App. F.3. 6. Related Work Recent research on modern Hopfield networks and dense associative memories includes (Krotov & Hopfield, 2016; Demircigil et al., 2017; Ramsauer et al., 2021; Millidge et al., 2022; Hoover et al., 2023, inter alia). The closest to our work is Hu et al. (2023), who proposed sparse Hopfield networks (equivalent to our 2-entmax, i.e., sparsemax) and derived retrieval error bounds tighter than the dense analog. Concurrently to our work, Wu et al. (2024) further proposed a generalized sparse Hopfield model , which corresponds to α-entmax with learnable α, and applied it successfully to time series prediction problems. However, neither Hu et al. (2023) or Wu et al. (2024) seem to have realized the possibility of exact retrieval enabled by sparse transformations. Our paper fills this gap by providing a unified framework for sparse Hopfield networks with stronger theoretical guarantees for retrieval and coverage (cf. Prop. 2 3). Our Sparse and Structured Hopfield Networks Table 3: Text rationalization results. We report mean and min/max F1 scores across five random seeds on test sets for all datasets but Beer, where we report MSE. Hard Kuma and SPECTRA results are taken from (Guerreiro & Martins, 2021). We also report human rationale overlap (HRO) as F1 score. We bold the best-performing system(s). SST Ag News IMDB Beer (MSE) Beer (HRO) Hard Kuma (Bastings et al., 2019) .80 (.80/.81) .90 (.87/.88) .87 (.90/.91) .019 (.016/.020) .37 (.00/.90) SPECTRA (Guerreiro & Martins, 2021) .80 (.79/.81) .92 (.92/.93) .90 (.89/.90) .017 (.016/.019) .61 (.56/.68) Sparse MAP k-subsets (ours) .81 (.81/.82) .93 (.92/.93) .90 (.90/.90) .017 (.017/.018) .42 (.29/.62) Sparse MAP seq. k-subsets (ours) .81 (.80/.83) .93 (.93/.93) .90 (.90/.90) .020 (.018/.021) .63 (.49/.70) Sequential k-subsets k-subsets a darkish golden pour from tap with a small white lacing around glass . you can't miss the sweet smell . the word snappy fits this beer well . it is a winter warmer but not from the usual alcohol burn . the alcohol is almost completely hidden . the warm comes from the mix of cinnamon , hops , and most of all spiciness . the alcohol must be there because i sure did feel it after finishing the glass . a darkish golden pour from tap with a small white lacing around glass . you can't miss the sweet smell . the word snappy fits this beer well . it is a winter warmer but not from the usual alcohol burn . the alcohol is almost completely hidden . the warm comes from the mix of cinnamon , hops , and most of all spiciness . the alcohol must be there because i sure did feel it after finishing the glass . Figure 4: Example of human rationale overlap for the aspect appearance . The yellow highlight indicates the model s rationale, while italicized and bold font represents the human rationale. Red font identifies mismatches with human annotations. Sparse MAP with sequential k-subsets prefers more contiguous rationales, which better match humans. Additional examples are shown in App. F.3. framework generalizes their constructions and widens the scope to new families, such as α-normmax, for which we provide an effective bisection algorithm (Alg. 1 in App. B). The link with Fenchel-Young losses (Blondel et al., 2020) is a key dimension of our work. Many results derived therein, such as the margin conditions, were found to have direct application to sparse Hopfield networks. In our paper, we extend some of their results, such as the structured margin of Sparse MAP (Prop. 4), key to establish exact retrieval of pattern associations (Prop. 5). Our k-subsets example relates to the top-k retrieval model of Davydov et al. (2023): their model differs from ours as it uses an entropic regularizer not amenable to sparsity, making exact retrieval impossible. Our sparse and structured Hopfield layers with sparsemax and Sparse MAP involve a quadratic regularizer, which relates to the differentiable layers of Amos & Kolter (2017). The use of Sparse MAP and its active set algorithm (Niculae et al., 2018) allows to exploit the structure of the problem to ensure efficient Hopfield updates and implicit derivatives. 7. Conclusions We presented a unified framework for sparse and structured Hopfield networks. Our framework hinges on a broad family of energy functions written as a difference of a quadratic regularizer and a Fenchel-Young loss, parametrized by a generalized negentropy function. A core result of our paper is the link between the margin property of certain Fenchel-Young losses and sparse Hopfield networks with provable conditions for exact retrieval. We further extended this framework to incorporate structure via the Sparse MAP transformation, which is able to retrieve pattern associations instead of a single pattern. Empirical evaluation confirmed the usefulness of our approach in multiple instance learning and text rationalization problems. Impact Statement Hopfield networks are increasingly relevant for practical applications and not only as theoretical models. While our work is mostly a theoretical advancement, promising experimental results signal potentially wider impact. Sparse HFY networks are applicable in the same scenarios where modern Hopfield networks would be, and we do not foresee any specific societal consequences of sparse transformations or exact retrieval in such cases. In the structured case, the practicioner has more hands-on control for encoding inductive biases through the design of a factor graph and the choice of its parameters (e.g., k). These choices may in turn reflect human biases and have societal implications: for example, contiguous rationales might be well suited for English but not for other languages. We encourage care when designing such methods for practical applications. Acknowledgments This work was supported by the European Research Council (DECOLLAGE, ERC-2022-Co G 101088763), by the Portuguese Recovery and Resilience Plan through project C645008882-00000055 (Center for Responsible AI), by Fundac ao para a Ciˆencia e Tecnologia through contract UIDB/50008/2020, and by the Dutch Research Council (NWO) via VI.Veni.212.228. Sparse and Structured Hopfield Networks Amit, D. J., Gutfreund, H., and Sompolinsky, H. Storing infinite numbers of patterns in a spin-glass model of neural networks. Physical Review Letters, 55(14):1530, 1985. Amos, B. and Kolter, J. Z. Optnet: Differentiable optimization as a layer in neural networks. In International Conference on Machine Learning, pp. 136 145. PMLR, 2017. Bastings, J., Aziz, W., and Titov, I. Interpretable neural predictions with differentiable binary variables. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 2963 2977, 2019. Blondel, M., Martins, A. F., and Niculae, V. Learning with Fenchel-Young losses. Journal of Machine Learning Research, 21(1):1314 1382, 2020. Brauchart, J. S., Reznikov, A. B., Saff, E. B., Sloan, I. H., Wang, Y. G., and Womersley, R. S. Random point sets on the sphere hole radii, covering, and separation. Experimental Mathematics, 27(1):62 81, 2018. Chabauty, C. Resultats sur l empilement de calottes egales sur une perisphere de Rn et correction a un travail anterieur. Comptes Rendus Hebdomadaires des Seances de l Academie des Sciences, 236(15):1462 1464, 1953. Conway, J. H. and Sloane, N. J. A. Sphere packings, lattices and groups, volume 290. Springer Science & Business Media, 2013. Correia, G. M., Niculae, V., and Martins, A. F. Adaptively sparse transformers. In Proceedings of EMNLP-IJCNLP, 2019. Davydov, A., Jaffe, S., Singh, A., and Bullo, F. Retrieving k-nearest memories with modern Hopfield networks. In Associative Memory & Hopfield Networks in 2023, 2023. Demircigil, M., Heusel, J., L owe, M., Upgang, S., and Vermet, F. On a model of associative memory with huge storage capacity. Journal of Statistical Physics, 168:288 299, 2017. Guerreiro, N. M. and Martins, A. F. T. Spectra: Sparse structured text rationalization. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 6534 6550, 2021. Hoover, B., Liang, Y., Pham, B., Panda, R., Strobelt, H., Chau, D. H., Zaki, M. J., and Krotov, D. Energy transformer. In Advances in Neural Information Processing Systems, 2023. Hopfield, J. J. Neural networks and physical systems with emergent collective computational abilities. Proceedings of the National Academy of Sciences, 79(8):2554 2558, 1982. Hu, J. Y.-C., Yang, D., Wu, D., Xu, C., Chen, B.-Y., and Liu, H. On sparse modern hopfield model. In Advances in Neural Information Processing Systems, 2023. URL https://arxiv.org/abs/2309.12673. Ilse, M., Tomczak, J., and Welling, M. Attention-based deep multiple instance learning. In International Conference on Machine Learning, pp. 2127 2136. PMLR, 2018. URL https://arxiv.org/abs/1802.04712. Jacovi, A. and Goldberg, Y. Aligning faithful interpretations with their social attribution. Transactions of the Association for Computational Linguistics, 9:294 310, 2021. Jenssen, M., Joos, F., and Perkins, W. On kissing numbers and spherical codes in high dimensions. Advances in Mathematics, 335:307 321, 2018. Krotov, D. and Hopfield, J. J. Dense associative memory for pattern recognition. In Advances in Neural Information Processing Systems, 2016. Kschischang, F. R., Frey, B. J., and Loeliger, H.-A. Factor graphs and the sum-product algorithm. IEEE Transactions on information theory, 47(2):498 519, 2001. Lei, T., Barzilay, R., and Jaakkola, T. Rationalizing neural predictions. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 107 117, 2016. Leutgeb, J., Leutgeb, S., Moser, M., and Moser, E. Pattern separation in the dentate gyrus and CA3 of the hippocampus. Science, 315(5814):961 966, 2007. Li, Z. and Arora, S. An exponential learning rate schedule for deep learning. In International Conference on Learning Representations, 2020. URL https:// openreview.net/forum?id=r Jg8Te SFDH. Loshchilov, I. and Hutter, F. Decoupled weight decay regularization. ar Xiv preprint ar Xiv:1711.05101, 2017. Martins, A. and Astudillo, R. From softmax to sparsemax: A sparse model of attention and multi-label classification. In Proceedings of ICML, 2016. May, R. M. Patterns of species abundance and distribution. Ecology and Evolution of Communities, pp. 81 120, 1975. Sparse and Structured Hopfield Networks Mc Auley, J. J., Leskovec, J., and Jurafsky, D. Learning attitudes and attributes from multi-aspect reviews. In ICDM, 2012. Mc Eliece, R., Posner, E., Rodemich, E., and Venkatesh, S. The capacity of the hopfield associative memory. IEEE transactions on Information Theory, 33(4):461 482, 1987. Millidge, B., Salvatori, T., Song, Y., Lukasiewicz, T., and Bogacz, R. Universal hopfield networks: A general framework for single-shot associative memory models. In International Conference on Machine Learning, pp. 15561 15583. PMLR, 2022. Neunuebel, J. and Knierim, J. CA3 retrieves coherent representations from degraded input: direct evidence for CA3 pattern completion and dentate gyrus pattern separation. Neuron, 81(2):416 427, 2014. Niculae, V. and Blondel, M. A regularized framework for sparse and structured neural attention. Advances in Neural Information Processing Systems, 30, 2017. Niculae, V., Martins, A. F., Blondel, M., and Cardie, C. Sparsemap: Differentiable sparse structured inference. In Proceedings of the International Conference on Machine Learning (ICML), 2018. Palm, G. Neural associative memories and sparse coding. Neural Netw, 37:165 171, 2013. Peters, B., Niculae, V., and Martins, A. F. Sparse sequenceto-sequence models. In Proceedings of ACL, 2019. Ramsauer, H., Sch afl, B., Lehner, J., Seidl, P., Widrich, M., Gruber, L., Holzleitner, M., Adler, T., Kreil, D., Kopp, M. K., Klambauer, G., Brandstetter, J., and Hochreiter, S. Hopfield networks is all you need. In Proceedings of ICLR, 2021. URL https://openreview.net/ forum?id=t L89Rnz Ii Cd. Severa, W., Parekh, O., James, C., and Aimone, J. A combinatorial model for dentate gyrus sparse coding. Neural Computation, 29(1):94 117, 2017. Shannon, C. E. Probability of error for optimal codes in a gaussian channel. Bell System Technical Journal, 38(3): 611 656, 1959. Simoncelli, E. P. and Olshausen, B. A. Natural image statistics and neural representation. Annual Review of Neuroscience, 24(1):1193 1216, 2001. Taskar, B., Guestrin, C., and Koller, D. Max-margin markov networks. In Advances in Neural Information Processing Systems, 2003. Tsallis, C. Possible generalization of boltzmann-gibbs statistics. Journal of Statistical Physics, 52:479 487, 1988. Tse, D., Langston, R., Kakeyama, M., Bethus, I., Spooner, P., Wood, E., Witter, M., and Morris, R. Schemas and memory consolidation. Science, 316(5821):76 82, 2007. doi: 10.1126/science.1135935. Tsochantaridis, I., Joachims, T., Hofmann, T., Altun, Y., and Singer, Y. Large margin methods for structured and interdependent output variables. Journal of Machine Learning Research, 6(9), 2005. Tyulmankov, D., Fang, C., Vadaparty, A., and Yang, G. R. Biological learning in key-value memory networks. In Advances in Neural Information Processing Systems, 2021. Wainwright, M. J., Jordan, M. I., et al. Graphical models, exponential families, and variational inference. Foundations and Trends in Machine Learning, 1(1 2):1 305, 2008. Whittington, J. C., Warren, J., and Behrens, T. E. Relating transformers to models and neural representations of the hippocampal formation. In Proceedings of ICLR, 2021. Wu, D., Hu, J. Y.-C., Li, W., Chen, B.-Y., and Liu, H. STanhop: Sparse tandem hopfield model for memoryenhanced time series prediction. In Proceedings of ICLR, 2024. URL https://openreview.net/forum? id=6iwg437CZs. Wyner, A. D. Capabilities of bounded discrepancy decoding. Bell System Technical Journal, 44(6):1061 1122, 1965. Yassa, M. and Stark, C. Pattern separation in the hippocampus. Trends Neurosci, 34(10):515 525, 2011. Yuille, A. L. and Rangarajan, A. The concave-convex procedure. Neural computation, 15(4):915 936, 2003. Zhang, Z., Rudra, K., and Anand, A. Explain and predict, and then predict again. In Proceedings of the 14th ACM International Conference on Web Search and Data Mining, 2021. Sparse and Structured Hopfield Networks A. Generalized Negentropies We recall here the definition of generalized negenetropies from Blondel et al. (2020, 4.1): Definition 4. A function Ω: N R is a generalized negentropy iff it satisfies the following conditions: 1. Zero negentropy: Ω(y) = 0 if y is a one-hot vector (delta distribution), i.e., y = ei for any i {1, . . . , N}. 2. Strict convexity: Ω((1 λ)y + λy ) < (1 λ)Ω(y) + λΩ(y ) for λ ]0, 1[ and y, y N with y = y . 3. Permutation invariance: Ω(P y) = Ω(y) for any permutation matrix P (i.e., square matrices with a single 1 in each row and each column, zero elsewhere). This definition implies that Ω 0 and that Ωis minimized when y = 1/N is the uniform distribution (Blondel et al., 2020, Prop. 4). This justifies the name generalized negentropies. B. Bisection Algorithm for the Normmax Transformation We derive here expressions for the normmax transformation along with a bisection algorithm to compute this transformation for general α. Letting Ω(y) = 1 + y α + I N (y) be the norm entropy, we have ( Ω )(θ) = arg max y N θ y y q. (14) The Lagrangian function is L(y, λ, µ) = θ y + y α λ y + µ(1 y 1). Equating the gradient to zero and using the fact that y α = y y α α 1 , we get: 0 = y L(y, λ, µ) = θ + y y α α 1 λ + µ1. (15) The complementarity slackness condition implies that, if yi > 0, we must have λi = 0, therefore, we have for such i supp(y): θi + yi y α α 1 + µ = 0 yi = (θi µ) 1 α 1 y α. (16) Since we must have P i supp(y) yi = 1, we obtain: i supp(y) (θi µ) 1 α 1 y α = y α = 1 P i supp(y)(θi µ) 1 α 1 . (17) Combining the two last equations and noting that, from (15), we have θi < µi for i / supp(y), we get, for i [N]: yi = (θi µ) j supp(y)(θj µ) 1 α 1 + . (18) Moreover, since P i supp(y) yα i = y α α, we obtain from (16): i supp(y) (θi µ) α α 1 y α α X i supp(y) (θi µ) α α 1 = 1. (19) In order to compute the solution (18) we need to find µ satisfying (19). This can be done with a simple bisection algorithm if we find a lower and upper bound on µ. We have, from (18), that µ = θi (yi/ y α)α 1 for any i supp(y). Letting θmax = maxi θi and ymax = maxi yi, we have in particular that µ = θmax (ymax/ y α)α 1. We also have that ymax = y y α, which implies Sparse and Structured Hopfield Networks Algorithm 1 Compute α-normmax by bisection. 1: Input: Scores θ = [θ1, ..., θN] RN, parameter α > 1. 2: Output: Probability vector y = [y1, ..., y N] N. 3: Define θmax maxi θi 4: Compute µmin θmax 1 5: Compute µmax θmax N 1 α 6: for t 1, . . . , T do 7: µ (µmin + µmax)/2 α α 1 + 9: if Z < 1 then µmax µ else µmin µ 10: end for 11: Return y = [y1, ..., y N] with yi = (θi µ) ymax/ y α 1. Since 1/N ymax 1 and y α 1 for any y N, we also obtain ymax/ y α (1/N)/1 = N 1. Therefore we have θmax 1 | {z } µmin µ θmax N 1 α | {z } µmax The resulting algorithm is shown as Alg. 1. C. Structured Prediction with Factor Graphs We define our notation and setup for structured prediction with factor graph representations, based on Niculae et al. (2018). We assume the interactions among patterns (for example due to temporal dependencies, hierarchical or link structure, etc.) can be expressed as a bipartite factor graph (V, F), where V is a set of variable nodes and F are factor nodes (Kschischang et al., 2001). Each factor f F is linked to a subset of variable nodes Vf V . We assume each variable v V can take one of Nv possible values, and we denote by yv {0, 1}Nv a one-hot vector indicating a value for this variable. Likewise, each factor f F has Nf possible configurations, with Nf = Q v Vf Nv, and we associate to it a one-hot vector yf {0, 1}Nf indicating a configuration for that factor. The global configuration of the factor graph is expressed through the bit vectors y V = [yv : v V ] {0, 1}NV and y F = [yf : f F] {0, 1}NF , with NV = P v V Nv and NF = P f F Nf. A particular structure is expressed through the bit vector y = [y V ; y F ] {0, 1}NV +NF . Finally, we define the set of valid structures Y {0, 1}NV +NF this set contains all the bit vectors which correspond to valid structures, which must satisfy consistency between variable and factor assignments, as well as any additional hard constraints. We associate unary scores θV = [θv : v V ] RNV to configurations of variable nodes and higher-order scores θF = [θf : f F] RNF to configurations of factor nodes. We denote θ = [θV ; θF ] RNV +NF . The problem of finding the highest-scoring structure, called the maximum a posteriori (MAP) inference problem, is y = argmax y Y θ y = argmax y conv(Y) θ V y V + θ F y F . (21) As above, we consider regularized variants of (21) via a convex regularizer Ω: conv(Y) R. Sparse MAP corresponds to Ω(y) = 1 2 y V 2 (note that only the unary variables are regularized), which leads to the quadratic optimization problem (12). Niculae et al. (2018) developed an effective and computationally efficient active set algorithm for solving this problem, which requires only a MAP oracle to solve instances of the problem (21). D. Proofs of Main Text D.1. Proof of Proposition 1 We start by proving that E(q) 0. We show first that for any Ωsatisfying conditions 1 3 above, we have LΩ(θ; 1/N) max i θi 1 θ/N. (22) Sparse and Structured Hopfield Networks From the definition of Ω and the fact that Ω(y) Ω(1/N) for any y N, we have that, for any θ, Ω (θ) = maxy N θ y Ω(y) maxy N θ y Ω(1/N) = maxi θi Ω(1/N), which leads to (22). Let now k = arg maxi q xi, i.e., xk is the pattern most similar to the query q. We have E(q) = β 1LΩ(βXq; 1/N) + 1 2 q µX 2 + 1 2(M 2 µX 2) β 1(β max i q xi β1 Xq/N) + 1 2 q µX 2 + 1 2(M 2 µX 2) = q xk + q µX + 1 2 q µX 2 + 1 2(M 2 µX 2) 2 xk q 2 0. The zero value of energy is attained when X = 1q (all patterns are equal to the query), in which case µX = q, M = q = µX , and we get Econvex(q) = Econcave(q) = 0. Now we prove the two upper bounds. For that, note that, for any y N, we have 0 LΩ(θ, y) = LΩ(θ, 1/N) Ω(1/N) + Ω(y) (y 1/N) θ LΩ(θ, 1/N) Ω(1/N) (y 1/N) θ, due to the assumptions 1 3 which ensure Ωis non-positive. That is, LΩ(θ, 1/N) Ω(1/N) + (y 1/N) θ. Therefore, with q = X y, we get Econcave(q) β 1Ω(1/N) y Xq + q µX = β 1Ω(1/N) q 2 + q µX, and E(q) = Econcave(q) + Econvex(q) β 1Ω(1/N) q 2 + q µX + 1 2 q µX 2 + 1 2(M 2 µX 2) = β 1Ω(1/N) 1 2M 2 β 1Ω(1/N) + 1 To show the second upper bound, use the fact that Econcave(q) 0, which leads to E(q) Econvex(q) = 1 2 q µX 2 + 1 2(M 2 µX 2) = 1 2 q 2 q µX + 1 2M 2. Note that q = X y P i yi xi M and that, from the Cauchy Schwarz inequality, we have q µX µX q M 2. Therefore, we obtain E(q) 1 2 q 2 q µX + 1 2M 2 1 2M 2 + M 2 + 1 2M 2 = 2M 2. We now turn to the update rule. The CCCP algorithm works as follows: at the tth iteration, it linearizes the concave function Econcave by using a first-order Taylor approximation around q(t), Econcave(q) Econcave(q) := Econcave(q(t)) + Econcave(q(t)) Then, it computes a new iterate by solving the convex optimization problem q(t+1) := arg minq Econvex(q) + Econcave(q), which leads to the equation Econvex(q(t+1)) = Econcave(q(t)). Using the fact that LΩ(θ, y) = ˆyΩ(θ) y and the chain rule leads to Econcave(q) = β 1 q LΩ(βXq; 1/N) = X (1/N ˆyΩ(βXq)) = µX X ˆyΩ(βXq) Econvex(q) = q µX, (23) giving the update equation (9). D.2. Proof of Proposition 2 A stationary point is a solution of the equation Econcave(q) = Econvex(q). Using the expression for gradients (23), this is equivalent to q = X ˆyΩ(βXq). If xi = X ei is not a convex combination of the other memory patterns, xi is a stationary point iff ˆyΩ(βXxi) = ei. We now use the margin property of sparse transformations (7), according to which the latter is equivalent to βx i xi maxj =i βx i xj m. Noting that the left hand side equals β i leads to the desired result. Sparse and Structured Hopfield Networks If the initial query satisfies q(0) (xi xj) m β for all j = i, we have again from the margin property that ˆyΩ(βXq(0)) = ei, which combined to the previous claim ensures convergence in one step to xi. Finally, note that, if q(0) is ϵ-close to xi, we have q(0) = xi + ϵr for some vector r with r = 1. Therefore, we have (q(0)) (xi xj) = (xi + ϵr) (xi xj) i + ϵr (xi xj) i ϵ r |{z} =1 xi xj , (24) where we invoked the Cauchy-Schwarz inequality in the last step. Since the patterns are normalized (with norm M),4 we have from the triangle inequality that xi xj xi + xj = 2M; using the assumption that i m β + 2Mϵ, we obtain q(0) (xi xj) m β , which from the previous points ensures convergence to xi in one iteration. D.3. Proof of Proposition 3 For the first statement, we follow a similar argument as the one made by Ramsauer et al. (2021) in the proof of their Theorem A.3 however their proof has a mistake, which we correct here.5 Given a separation angle αmin, we lower bound the number of patterns N we can can place in the sphere separated by at least this angle. Estimating this quantity is an important open problem in combinatorics, related to determining the size of spherical codes (of which kissing numbers are a particular case; Conway & Sloane 2013). We invoke a lower bound due to Chabauty (1953), Shannon (1959), and Wyner (1965) (see also Jenssen et al. (2018) for a tighter bound), who show that N (1 + o(1)) 2πD cos αmin (sin αmin)D 1 . For αmin = π 3 , which corresponds to the kissing number problem, we obtain the bound N (1 + o(1)) In this scenario, we have i = M 2(1 cos αmin) by the definition of i. From Proposition 2, we have exact retrieval under ϵ-perturbations if i mβ 1 + 2Mϵ. Combining the two expressions, we obtain ϵ M 2 (1 cos αmin) m 2βM . Setting αmin = π 3 , we obtain ϵ M 2 m 2βM = M 4 m 2βM . For the right hand side to be positive, we must have M 2 > 2m/β. Assume now patterns are placed uniformly at random in the sphere. From Brauchart et al. (2018) we have, for any δ > 0: P(N 2 D 1 αmin δ) 1 κD 1 2 δD 1, with κD := 1 D π Γ((D + 1)/2) Γ(D/2) . (25) Given our failure probability p, we need to have P(M 2(1 cos αmin) mβ 1 + 2Mϵ) 1 p, (26) which is equivalent to N 2 D 1 αmin N 2 D 1 arccos 1 m βM 2 2ϵ Therefore, we set 2 δD 1 = κD 1 2 N 2 arccos 1 m βM 2 2ϵ 4In fact, the result still holds if patterns are not normalized but have their norm upper bounded by M, i.e., if they lie within a ball of radius M and not necessarily on the sphere. 5Concretely, Ramsauer et al. (2021) claim that given a separation angle αmin, we can place N = (2π/αmin)D 1 patterns equidistant on the sphere, but this is not correct. Sparse and Structured Hopfield Networks Choosing N = q 2p κD 1 ζ D 1 2 patterns for some ζ > 1, we obtain 1 = ζ arccos 1 m βM 2 2ϵ Therefore the failure rate p is attainable provided the perturbation error is m 2βM . (30) For the right hand side to be positive, we must have cos 1 ζ < 1 m βM 2 , i.e., ζ < 1 arccos 1 m βM2 . D.4. Proof of Proposition 4 The first statement in the proposition is stated and proved by Blondel et al. (2020) as a corollary of their Proposition 8. We prove here a more general version, which includes the second statement as a novel result. Using Blondel et al. (2020, Proposition 8), we have that the structured margin of LΩis given by the following expression, m = sup y Y µ conv(Y) if the supremum exists. For Sparse MAP, using Ω(µ) = 1 2 µF 2 for any µ conv(Y), and using the fact that y = r for any y Y, we obtain: m = sup y Y µ conv(Y) 0 z }| { 1 2 µF 2 1 ( ) sup y Y µ conv(Y) = 1 inf y Y µ conv(Y) where the inequality ( ) follows from the convexity of 1 2 2, which implies that 1 2 y F 2 = 1 2r2 F ; and the inequality ( ) follows from the fact that both the numerator and denominator in the second term are non-negative, the latter due to the Cauchy-Schwartz inequality and the fact that µ r. This proves the second part of Proposition 4. To prove the first part, note first that, if there are no higher order interactions, then r F = 0 and µF is an empty vector , which implies that ( ) is an equality. We prove now that, in this case, ( ) is also an equality, which implies that m = 1. We do that by showing that, for any y Y, we have infµ conv(Y) y (y µ) = 0. Indeed, choosing µ = ty + (1 t)y for an arbitrary y Y \ {y}, and letting t 0+, we obtain y (y y ) 0. D.5. Proof of Proposition 5 A point q is stationary iff it satisfies q = X ˆyΩ(βXq). Therefore, X yi is guaranteed to be a stationary point if6 ˆyΩ(βXX yi) = yi, which is equivalent to zero loss, i.e., to the existence of a margin βy i XX (yi yj) | {z } β i 6But not necessarily only if in general, we could have X yi in the convex hull of the other pattern associations. Sparse and Structured Hopfield Networks 0 2 4 6 8 10 Temperature 1 Entmax = 1.0 y1 y2 y3 y4 y5 0 2 4 6 8 10 Entmax = 1.5 0 2 4 6 8 10 Entmax = 2.0 0 2 4 6 8 10 Entmax = 10.0 0 2 4 6 8 10 Entmax = 10000.0 0 2 4 6 8 10 Temperature 1 Normmax = 1.0 0 2 4 6 8 10 Normmax = 2.0 0 2 4 6 8 10 Normmax = 5.0 0 2 4 6 8 10 Normmax = 10.0 0 2 4 6 8 10 Normmax = 10000.0 0 2 4 6 8 10 Temperature 1 Sparse MAP k-subsets k = 1 0 2 4 6 8 10 Sparse MAP k-subsets k = 2 0 2 4 6 8 10 Sparse MAP k-subsets k = 3 0 2 4 6 8 10 Sparse MAP k-subsets k = 4 0 2 4 6 8 10 0.2 1.2 Sparse MAP k-subsets k = 5 Figure 5: Sparse and structured transformations used in this paper and their regularization path. In each plot, we show ˆyΩ(βθ) = ˆyβ 1Ω(θ) as a function of the temperature β 1 where θ = [1.0716, 1.1221, 0.3288, 0.3368, 0.0425] . for all j. Since we are assuming i D2 i 2β yi yj 2 2β for all j, we have β i yi yj 2 2 for all j, which implies the margin condition above. If the initial query satisfies q X (yi yj) D2 i 2β for all j = i, we have again from the margin property that ˆyΩ(βXq) = yi, which ensures convergence in one step to X yi. If q is ϵ-close to X yi, then we have q = X yi + ϵr for some vector r with r = 1. Therefore, we have q X (yi yj) = (X yi + ϵr) X (yi yj) i + ϵr X (yi yj). (31) We now bound r X (yi yj) in two possible ways. Using the Cauchy-Schwarz inequality, we have r X (yi yj) Xr yi yj σmax(X)Di, where σmax(X) denotes the largest singular value of X, i.e., the spectral norm of X. On the other hand, denoting Ri := maxj yi yj 1, we can also use H older s inequality to obtain r X (yi yj) Xr yi yj 1 MRi, where we used the fact that Xr = maxk |x k r| xk r = M. Combining the two inequalities, we obtain q X (yi yj) i ϵ min{σmax(X)Di, MRi}. Using the assumption that i D2 i 2β + ϵ min{σmax(X)Di, MRi}, we obtain q X (yi yj) D2 i 2β , which from the previous points ensures convergence to X yi in one iteration. The result follows by noting that, since Y {0, 1}D, we have Ri = D2 i . E. Additional Details and Experiments E.1. Sparse and structured transformations We show additional sparse and structured transformations and their regularization paths in Figure 5. The difference between the entmax and normmax regularizers is subtle but important: when α , the entmax regularizer vanishes Sparse and Structured Hopfield Networks 1.5-entmax 2-entmax 2-normmax 5-normmax 1.5-entmax 2-entmax 2-normmax 5-normmax Figure 6: Top: Contours of the energy function and optimization trajectory of the CCCP iteration (left) alongside the attraction basins associated with each pattern (right) for β = 4. (Note that white sections do not converge to a single pattern but to a metastable state.) Bottom: Similar representation for β = 10. and entmax becomes argmax, returning a one-hot vector. However, when α , the normmax regularizer becomes ΩN (y) = 1 + maxi(yi) and the transformation returns a uniform distribution with a sparse support, as shown in Figure 5, rightmost middle plot. On the other hand, when α 1, entmax becomes softmax, but the normmax regularizer vanishes and normmax becomes argmax. Sparse MAP with k-subsets, as β increases, tends to return a k-hot vector, where for k = 1 it corresponds to entmax with α = 2 (sparsemax). E.2. Hopfield dynamics and basins of attraction In Figure 6, additional plots with varying β values are provided. Particularly, as β increases, the optimization trajectories exhibit a tendency to converge towards a single pattern, contrasting with the prevalence of metastable states observed for smaller β values. Moreover, the basins of attraction become increasingly colorful with higher β values, suggesting a convergence behaviour similar to what was previously described. F. Experimental Details F.1. MNIST K-MIL For K-MIL, we created 4 datasets by grouping the MNIST examples into bags, for K {1, 2, 3, 5}. A bag is positive if it contains at least K targets, where the target is the number 9 (we chose 9 as it can be easily misunderstood with 7 or 4 ). The embedding architecture is the same as Ilse et al. (2018), but instead of attention-based pooling, we use our α-entmax pooling, with α = 1 mirroring the pooling method in (Ramsauer et al., 2021), and α = 2 corresponding to the pooling in (Hu et al., 2023). Additionally, we incorporate α-normmax pooling and Sparse MAP pooling with k-subsets. Further details of the K-MIL datasets are shown in Table 4. We train the models for 5 different random seeds, where the first one is used for tuning the hyperparameters. The reported test accuracies represent the average across these seeds. We use 500 bags for testing and 500 bags for validation. The Sparse and Structured Hopfield Networks Table 4: Dataset sample details for the MNIST K-MIL experiment. The size Li of the ith bag is determined through Li = max{K, L i} where L i N(µ, σ2). The number of positive instances in a bag is uniformly sampled between K and Li for positive bags and between 0 and K 1 for negative bags. Dataset µ σ Features Pos. training bags Neg. training bags MNIST, K = 1 10 1 28 28 1000 1000 MNIST, K = 2 11 2 28 28 1000 1000 MNIST, K = 3 12 3 28 28 1000 1000 MNIST, K = 5 14 5 28 28 1000 1000 Table 5: Hyperparameter space for the MNIST MIL experiment. Hidden size is the dimension of keys and queries and γ is a parameter of the exponential learning rate scheduler (Li & Arora, 2020). Parameter Range learning rate {10 5, 10 6} γ {0.98 , 0.96} hidden size {16, 64} number of heads {8, 16} β {0.25, 0.5, 1.0, 2.0, 4.0, 8.0} bag dropout {0.0, 0.75} hyperparameters are tuned via grid search, where the grid space is shown in Table 5. We consider a dropout hyperparameter, commonly referred to as bag dropout, to the Hopfield matrix due to the risk of overfitting (as done by Ramsauer et al. (2021)). All models were trained for 50 epochs. We incorporated an early-stopping mechanism, with patience 5, that selects the optimal checkpoint based on performance on the validation set. F.2. MIL benchmarks The MIL benchmark datasets (Fox, Tiger and Elephant) comprise preprocessed and segmented color images sourced from the Corel dataset (Ilse et al., 2018). Each image is composed of distinct segments or blobs, each defined by descriptors such as color, texture, and shape. The datasets include 100 positive and 100 negative example images, with the negative ones randomly selected from a pool of photos featuring various other animals. The Hopfield Pooling layers (α-entmax; α-normmax; Sparse MAP, k-subsets) take as input a collection of embedded instances, along with a trainable yet constant query. This query pattern is used for the purpose of averaging class-indicative instances, thereby facilitating the compression of bags of variable sizes into a consistent representation. This compression is important for effectively discriminating between different bags. To tune the model, a manual hyperparameter search was conducted on a validation set. In our approach to tasks involving Elephant, Fox and Tiger, we followed a similar architecture as (Ramsauer et al., 2021): 1. The first two layers are fully connected linear embedding layers with Re LU activation. 2. The output of the second layer serves as the input for the Hopfield Pooling layer, where the pooling operation is executed. 3. Subsequently, we employ a single layer as the final linear output layer for classification with a sigmoid as the classifier. During the hyperparameter search, various configurations were tested, including different hidden layer widths and learning rates. Particular attention was given to the hyperparameters of the Hopfield Pooling layers, such as the number of heads, head dimension, and the inverse temperature β. To avoid overfitting, bag dropout (dropout at the attention weights) was implemented as the chosen regularization technique. All hyperparameters tested are shown in Table 6. We trained for 50 epochs with early stopping with patience 5, using the Adam optimizer (Loshchilov & Hutter, 2017) with exponential learning rate decay. Model validation was conducted through a 10-fold nested cross-validation, repeated five times with different data splits where the first seed is used for hyperparameter tuning. The reported test ROC AUC scores represent the average across these repetitions. Sparse and Structured Hopfield Networks Table 6: Hyperparameter space for the MIL benchmark experiments. Hidden size is the space in which keys and queries are associated and γ is a parameter of the exponential learning rate scheduler. Parameter Range learning rate {10 3, 10 5} γ {0.98 , 0.96} embedding dimensions {32 , 128} embedding layers {2} hidden size {32, 64} number of heads {12} β {0.1, 1, 10} bag dropout {0.0, 0.75} Table 7: Text rationalization results. We report mean and min/max F1 scores across five random seeds on test sets for all datasets but Beer, where we report MSE. All entries except Sparse MAP are taken from (Guerreiro & Martins, 2021). We also report human rationale overlap (HRO) as F1 score. We bold the best-performing rationalized model(s). Method Rationale SST Ag News IMDB Beer Beer(HRO) SFE top-k .76 (.71/.80) .92 (.92/.92) .84 (.72/.88) .018 (.016/.020) .19 (.13/.30) contiguous .71 (.68/.75) .86 (.85/.86) .65 (.57/.73) .020 (.019/.024) 35 (.18/.42) SFE w/Baseline top-k .78 (.76/.80) .92 (.92/.93) .82 (.72/.88) .019 (.017/.020) .17 (.14/.19) contiguous .70 (.64/.75) .86 (.84/.86) .76 (.73/.80) .021 (.019/.025) .41(.37/.42) Gumbel top-k .70 (.67/.72) .78 (.73/.84) .74 (.71/.78) .026 (.018/.041) .27 (.14/.39) contiguous .67 (.67/.68) .77 (.74/.81) .72 (.72/.73) .043 (.040/.048) .42 (.41/.42) Hard Kuma - .80 (.80/.81) .90 (.87/.88) .87 (.90/.91) .019 (.016/.020) .37 (.00/.90) Sparse Attention sparsemax .82 (.81/.83) .93 (.93/.93) .89 (.89/.90) .019 (.016/.021) 48 (.41/.55) fusedmax .81 (.81/.82) .92 (.91/.92) .88 (.87/.89) .018 (.017/.019) 39 (.29/.53) SPECTRA sequential k-subsets .80 (.79/.81) .92 (.92/.93) .90 (.89/.90) .017 (.016/.019) .61 (.56/.68) Sparse MAP k-subsets .81 (.81/.82) .93 (.92/.93) .90 (.90/.90) .017 (.017/.018) .42 (.29/.62) sequential k-subsets .81 (.80/.83) .93 (.93/.93) .90 (.90/.90) .020 (.018/.021) .63 (.49/.70) F.3. Text Rationalization We use the same hyperparameters reported by Guerreiro & Martins (2021). We used a head dimension of 200, to match the dimensions of the encoder vectors (the size of the projection matrices associated to the static query and keys) and a head dropout of 0.5 (applied to the output of the Hopfield layer). We used a single attention head to better match the SPECTRA model. Aditionally we use a transition score of 0.001 and a train temperature of 0.1. We present in Table 7 an extended version of Table 3 with additional baselines, corresponding to prior work in text rationalization. We show additional examples of rationales extracted by our models in Figure 7. Sparse and Structured Hopfield Networks poured from a 12oz bottle into a delirium tremens glass . this is so hard to find in columbus for some reason , but i was able to get it in toledo ... murky yellow appearance with a very thin white head . the aroma is bready and a little sour . the flavor is really complex , with at least the following tastes : wheat , spicy hops , bread , bananas , and a toasty aftertaste . it was really outstanding . i 'd recommend this to anyone, go out and try it . i think it 's the best so far from this brewery . poured from a 12oz bottle into a delirium tremens glass . this is so hard to find in columbus for some reason , but i was able to get it in toledo ... murky yellow appearance with a very thin white head . the aroma is bready and a little sour . the flavor is really complex , with at least the following tastes : wheat , spicy hops , bread , bananas , and a toasty aftertaste . it was really outstanding . i 'd recommend this to anyone, go out and try it . i think it 's the best so far from this brewery . pours a rather crisp yellow almost orange with a thin head . the aroma is dominated by sweet malts with just a slight hoppiness dancing in the background . the taste does have a surprising amount of hoppiness for a pilsner . there is a good maltiness to it as well , but citrus hops just slightly overpower . the beer is very light and refreshing . this makes for an excellent summer session beer . pours a rather crisp yellow almost orange with a thin head . the aroma is dominated by sweet malts with just a slight hoppiness dancing in the background . the taste does have a surprising amount of hoppiness for a pilsner . there is a good maltiness to it as well , but citrus hops just slightly overpower . the beer is very light and refreshing . this makes for an excellent summer session beer . a darkish golden pour from tap with a small white lacing around glass . you can't miss the sweet smell . the word snappy fits this beer well . it is a winter warmer but not from the usual alcohol burn . the alcohol is almost completely hidden . the warm comes from the mix of cinnamon , hops , and most of all spiciness . the alcohol must be there because i sure did feel it after finishing the glass . a darkish golden pour from tap with a small white lacing around glass . you can't miss the sweet smell . the word snappy fits this beer well . it is a winter warmer but not from the usual alcohol burn . the alcohol is almost completely hidden . the warm comes from the mix of cinnamon , hops , and most of all spiciness . the alcohol must be there because i sure did feel it after finishing the glass . Sequential k-subsets k-subsets Figure 7: Examples of human rationale overlap for the aspect appearance . The yellow highlight indicates the model s rationale, while italicized and bold font represents the human rationale. Red font identifies mismatches with human annotations. Sparse MAP with sequential k-subsets prefers more contiguous rationales, which better match humans.