# provable_memorization_capacity_of_transformers__8e7a7101.pdf Published as a conference paper at ICLR 2023 PROVABLE MEMORIZATION CAPACITY OF TRANSFORMERS Junghwan Kim CSE Department University of Michigan Ann Arbor, MI kimjhj@umich.edu Michelle Young Jin Kim CSE Department Michigan State University East Lansing, MI kimmic16@msu.edu Barzan Mozafari CSE Department University of Michigan Ann Arbor, MI mozafari@umich.edu Quantifying memorization capacity is essential for understanding the expressiveness and generalizability of deep learning model architectures. However, the memorization capacity of the Transformer architecture has yet to be explored. In this work, we present the first study of the memorization capacity of the Transformer architecture. We prove that Transformers are capable of memorizing N sequence-to-sequence mappings of length n with d-dimensional input tokens using O(d + n + n N) parameters. Our theory supports memorization both with and without permutation equivariance, utilizing positional encodings in the latter case. Building on our theory, we also analyze the memorization capacity of Transformers in the sequence classification and language modeling tasks. To verify these theoretical findings, we conduct experiments analyzing the memorization capacity of Transformers in the natural language domain. 1 INTRODUCTION Transformer networks (Vaswani et al., 2017) have shown tremendous success in natural language processing tasks (Devlin et al., 2019; Yang et al., 2019; Brown et al., 2020; Fedus et al., 2022), rapidly becoming the standard architecture for natural language modeling. The success of Transformers has also transferred to various other sequence and set modeling tasks, including image recognition (Parmar et al., 2018; Dosovitskiy et al., 2021), semantic segmentation (Zheng et al., 2021), video understanding (Akbari et al., 2021; Bertasius et al., 2021), reinforcement learning (Parisotto et al., 2020; Chen et al., 2021; Janner et al., 2021), 3D point cloud processing (Zhao et al., 2021), protein structure prediction (Jumper et al., 2021), and automatic theorem proving (Polu & Sutskever, 2020). Despite this success across various areas, the theoretical understanding of Transformers lags behind that of standard fully-connected networks. The major strength of Transformers is in their efficient scaling, which is enabled through parallel token processing with parameter sharing and simple dot-product-based token interaction. Surprisingly, even though the parameter sharing and simple token interaction impose constraints on the function space of Transformers, Yun et al. (2020a) show that Transformers can approximate any continuous function from input to output sequences. However, their result focuses on the function approximation capacity with infinite precision, leaving the finite sample memorization capacity with finite precision unexplored. We note that universal function approximation does not automatically imply efficient memorization in terms of the number of parameters. Generalizing infinite precision results to the finite precision case is not straightforward and may not be possible in some cases. For example, Transformers are Turing complete only with infinite precision (P erez et al., 2019), but not with finite precision (Dehghani et al., 2019). Understanding the memorization capacity of a model is critical for choosing an appropriate model size. Practitioners often choose a model size with enough representation capacity to achieve zero training loss (i.e., a size larger than memorization capacity). Moreover, the memorization capacity has generalization implications, as observed in the double descent phenomena (Belkin et al., 2019; Nakkiran et al., 2021). As the network size increases, generalization performance exhibits a bias-variance tradeoff until the memorization is possible and then improves monotonically after- Published as a conference paper at ICLR 2023 ward. Understanding the memorization capacity of Transformers requires answers to the following questions: How large should the size and precision of the Transformer architecture be to enable memorization of any given number of input-output sequence pairs? How does the memorization capacity of Transformers differ for various problem settings in practical application scenarios? In this paper, we answer these questions by proving that Transformers can memorize N sequences of d-dimensional tokens with length n using O(d + n + n N) parameters. Our proof constructs permutation equivariant Transformers that can memorize all permutations of N input sequences. We extend this construction to the memorization without permutation equivariance by adding positional encodings. In addition, we derive the memorization capacity for sequence classification task from our proposed theory. The key technical component of our construction is efficient contextual mapping, which requires only n self-attention layers. Our contextual mapping also applies to sparse-attention Transformers making fewer assumptions on sparsity patterns than Yun et al. (2020b) and Zaheer et al. (2020). Furthermore, we present the generalization of contextual mapping to function approximation settings, vastly improving the parameter efficiency of attention layers compared to the selective-shiftingbased contextual mapping in Yun et al. (2020a). Our main contributions are summarized as follows: We prove the memorization capacity of Transformers for sequence-to-sequence mappings with and without permutation equivariance. We analyze the memorization capacity in other standard task settings, such as sequence classification and language modeling. We show that the efficient contextual mapping presented in our theoretical analysis extends to sparse attention settings and improves the function approximation results. We provide experiments validating the memorization capacity of Transformers for token classification and sequence classification tasks. 1.1 RELATED WORKS Memorization capacity. Characterizing the memorization capacity of neural networks has been an active research area with a long history (Baum, 1988; Sontag, 1997; Huang & Babri, 1998; Huang, 2003; Zhang et al., 2017; Yun et al., 2019; Bubeck et al., 2020; Vershynin, 2020; Rajput et al., 2021; Park et al., 2021; Vardi et al., 2022). Recently, Park et al. (2021) constructed neural networks with O(N 2/3) parameters to memorize N data points. They bypass the Ω(N) lower bound in Sontag (1997) by assuming a simple separation (i.e., xi xj δ, i = j). Vardi et al. (2022) improve further, showing that O(N 1/2) parameters are sufficient. They also prove the matching lower bound of Ω(N 1/2) through the VC-dimension analysis. Inspired by Park et al. (2021) and Vardi et al. (2022), our construction assumes a similar separation, but it is between pairs of distinct tokens, not the whole sequence pairs. (See Definition 3.1 and the discussion that follows the definition.) In addition, our construction uses the same pipeline of projection, string matching, and bit extraction as in Vardi et al. (2022). However, we introduce an additional critical step: efficient contextual mapping to complement projection by summarizing all token information via self-attention layers. In contrast to the extensive results on fully-connected networks, there are few studies on the memorization capacity of specific modern architectures. Hardt & Ma (2017) show that the residual network with Re LU activation and O(N) hidden neurons can memorize N data points under the separation assumption. Nguyen & Hein (2018) show that the convolutional network with O(N) hidden neurons can memorize N data points. To the best of our knowledge, there is no existing literature on the memorization capacity of the Transformer architecture. Transformer expressivity. Given the recent empirical success of Transformers observed across multiple areas, several papers have studied the expressivity of Transformers. Yun et al. (2020a) establish the first universal approximation theorem for Transformers, and the result is later extended to sparse-attention Transformers (Yun et al., 2020b; Zaheer et al., 2020) and Transformers with hard constraints (Kratsios et al., 2022). All these results study the function approximation but not Published as a conference paper at ICLR 2023 the finite sample memorization as in our paper. We also note that our construction can reduce the number of self-attention layers in the function approximation setting. There are other lines of studies focusing on different aspects of the representation capacity of Transformers. Some papers aim to characterize the representation capacity of a single self-attention layer. Bhojanapalli et al. (2020) suggest that the small size of attention heads limits the rank of a selfattention matrix. Dong et al. (2021) show that the rank of self-attention decays exponentially when self-attention layers are composed without skip-connections or feedforward layers. Likhosherstov et al. (2021) show that a fixed self-attention module can approximate any sparsity pattern. Other papers investigate a tradeoff between width and depth since it is a crucial issue when scaling Transformers. Levine et al. (2020) demonstrate the depth efficiency in modeling feature interaction through the separation rank analysis of Transformers. Wies et al. (2021) identify the rank of the input embedding matrix as a bottleneck for the network width s contribution to expressivity. Finally, Vuckovic et al. (2020) and Kim et al. (2021) study the Lipschitz smoothness of attention operations. Edelman et al. (2022) derive the norm-based generalization bounds from the Lipschitz smoothness of norm-bounded attention layers to analyze the inductive bias of attention layers. Wei et al. (2021) study the expressivity of Transformers under the constraint of statistical learnability. Our memorization study provides a complementary understanding of the capacity of Transformers. 2 PRELIMINARIES This section establishes the notation and defines the Transformer architecture. 2.1 NOTATION Denote the number of input-output pairs as N, the number of output classes as C, the token embedding dimension as d, and the sequence length as n. We use O( ) to hide logarithmic factors and O( ) to hide constant factors. We use σR to represent the Re LU activation function. We let σS and σH be the softmax and hardmax operators, respectively. These operators take a matrix as an input, apply softmax/hardmax columnwise, and output a column stochastic matrix of the same size. For m N, we define [m] = {1, 2, , m}. We use |S| to denote the number of elements in a set S. We denote a set or a function as an upper case caligraphic letter, a matrix by an upper case bold letter and a vector by a lower case bold letter. We denote the standard unit vector with all but i-th coordinates 0 as ei, the m dimensional vector with all coordinates 1 as 1m and the m dimensional vector with all coordinates 0 as 0m. For a vector x, we represent its i-th entry as x[i] and its Euclidean norm as x . For a matrix X, we use X[i, j], X[i, :], and X[:, j] to represent (i, j)-th entry, i-th row, and j-th column, respectively. We use X F to represent the Frobenius norm of the matrix X. 2.2 TRANSFORMER ARCHITECTURE We define a Transformer N : Rd n R1 n of depth L as a composition of L Transformer blocks with input and output embedding mappings: N = Eout FL F2 F1 Ein where each Transformer block Fl : Rm n Rm n is a sequence-to-sequence function consisting of two subblocks: a self-attention subblock and a tokenwise feedforward subblock. The input embedding block Ein : Rd n Rm n and the output embedding block Eout : Rm n R1 n are 1-layer tokenwise linear mappings. The self-attention subblock represents the interaction among tokens. Formally, given an input Z Rm n, the self-attention subblock F(SA) l with h heads and head size k computes F(SA) l (Z) = Z + i=1 W (O) l,i W (V ) l,i Z σS W (K) l,i Z T W (Q) l,i Z , where W (O) l,i Rm k and W (V ) l,i , W (K) l,i , W (Q) l,i Rk m are the weight matrices parametrizing the self-attention subblock. We include a skip-connection in the self-attention subblock. Published as a conference paper at ICLR 2023 The feedforward subblock processes each token independently in parallel by applying two feedforward layers. Given an input H Rm n, the feedforward subblock F(F F ) l with dimension q computes F(F F ) l (H) = H + W (2) l σR W (1) l H + b(1) l 1T n + b(2) l 1T n, where W (2) l Rm q, b(2) l Rm, W (1) l Rq m and b(1) l Rq parametrize the feedforward subblock. The feedforward subblock also includes a skip-connection. Finally, the Transformer block composes two subblocks as Fl(Z) = F(F F ) l (F(SA) l (Z)) Unlike the original formulation in Vaswani et al. (2017), our definition excludes layer normalization as Yun et al. (2020a) to simplify our analysis. Since layer normalizations mainly contribute to optimization without much effect on expressivity, our definition still captures the representation aspect of the Transformer architecture. Since each Transformer block consists of a fixed number of layers even in the most fine-grained sense, we use the number of blocks L as the depth of the network. We define the width of the network as max{m, kh, q}1. The number of parameters is the number of non-zero weights in our network. We note that the single parameter is reused n times for a sequence length n, but still is counted as one parameter. The bit complexity of the network is the maximum bit complexity of its weights, where the bit complexity of a weight is the number of bits required to represent it. We adopt these definitions of the number of parameters and the bit complexity from the convention in the VC dimension literature (Bartlett et al., 2019) and the recent paper on the optimal memorization capacity of fully-connected networks (Vardi et al., 2022). 3 MEMORIZATION CAPACITY OF TRANSFORMERS In this section, we describe the problem setting and present our main theorem on the memorization capacity of the Transformer architecture. Then, we sketch the proof for our main theorem and discuss the memorization capacity of Transformers in other standard task settings. 3.1 PROBLEM SETTING We consider the memorization of N input-output sequence pairs (X(1), Y (1)), , (X(N), Y (N)) where each input X(i) Rd n is a sequence of n token vectors in dimension d. Each output Y (i) [C]1 n is a sequence of n labels where each label Y (i)[1, k] is assigned to a token X(i)[:, k]. We define the context of each input sequence X(i) as V(i) = {v Rd : v = X(i)[:, k] for some k [n]}. We define the vocabulary V = S i [N] V(i) as the set of all tokens appearing in the input sequences. Note that |V| n N. As pointed out in Park et al. (2021) and Vardi et al. (2022), we must assume some conditions on the dataset to bypass the lower bound in Sontag (1997) and memorize N data points with o(N) parameters. We present a natural generalization of the separation condition defined in Vardi et al. (2022) to sequence modeling settings. Definition 3.1. Let r 1, 0 < δ 1. Let X(1), , X(N) Rd n be N input sequences with vocabulary V. We say that X(1), , X(N) are tokenwise (r, δ)-separated if 1. v r for all v V and 2. v v δ for all v, v V with v = v . This condition requires (1) each token to have a bounded norm and (2) each pair of distinct tokens to be separated. We note that the tokenwise separation is a stronger condition than the separation of input sequences2. However, the condition better captures many practical settings where the number of tokens in the vocabulary is much smaller than the number of input sequences. 1We recall that m is the embedding dimension, h is the number of attention heads, k is the attention head size, and q is the feedforward dimension. 2When X(1), , X(N) Rd n are distinct and (r, δ)-separated, then (1) X(i) F r n for i [N] and (2) X(i) X(j) F δ for i, j [N] with i = j. Published as a conference paper at ICLR 2023 For the permutation equivariant mappings, we need the following label consistency condition. Definition 3.2. Let (X(1), Y (1)), , (X(N), Y (N)) Rd n [C]1 n be N input-output pairs of sequences. We say that (X(1), Y (1)), , (X(N), Y (N)) are consistently labeled if X(i)[:, k] = X(i)[:, l] implies Y (i)[1, k] = Y (i)[1, l] for every i [N] and k, l [n]. We emphasize that we impose this condition only on the memorization with permutation equivariance but not on the memorization without permutation equivariance3. The condition implies that two identical tokens appearing in the same context should have the same label. Consider a permutation equivariant mapping F : Rd n R1 n and an input sequence X Rd n. Define X by swapping two tokens X[:, k] and X[:, l] in X. Then, we have F(X)[1, k] = F(X )[1, l] due to the permutation equivariance. However, if two tokens X[:, k] and X[:, l] were identical, then X = X and consequently F(X)[1, k] = F(X)[1, l]. 3.2 MAIN RESULTS We now present our main theorem on the memorization capacity of Transformers. Theorem 3.1. Let N, d, n, C N and r 1, 0 < δ 1. Let (X(1), Y (1)), , (X(N), Y (N)) Rd n [C]1 n be N input-output sequence pairs of sequences where input sequences are distinct and tokenwise (r, δ)-separated. 1. (With permutation equivariance) Suppose that contexts V(i) are distinct and sequences are consistently labeled. Then, there exists a Transformer network N : Rd n R1 n such that N(X(i)P ) = Y (i)P for every i [N] and for every permutation matrix P Rn n. 2. (Without permutation equivariance) There exists a Transformer network N : Rd n R1 n and positional encoding E Rd n such that N(X(i) + E) = Y (i) for every i [N]. In both cases, the Transformer N has width 16 (m = 8, h = k = 1 and q = 16), depth n N log(n N) + n N log(n N) max{log C, log R} and bit complexity bounded by n N log(n N) max{log C, log R} where we denote R := 8000r2δ 2dn5N 6. Theorem 3.1 shows that O(d + n + n N) parameters are enough to memorize N sequence-tosequence mapping of length n with token dimension d since the initial embedding layer has d parameters and the rest layers have a constant number of parameters. We provide the proof sketch in Section 3.3 and the full proof in Appendix A. Remark 3.2. Extensions to real vector outputs. Some application scenarios of Transformers require real vector outputs. As proposed in Park et al. (2021) and Vardi et al. (2022), extension to real scalar values is easily achievable by using O( 1 ϵ ) classes when the output has a bounded range. More concretely, we partition the output range into ϵ-length intervals and match each class to one partition to perform regression with error ϵ per token. This replaces log C in Theorem 3.1 with log( 1 3The general sequence-to-sequence mappings do not satisfy the condition. For example, the same word appearing multiple times in a sentence may have different meanings. Published as a conference paper at ICLR 2023 Similarly, extension to vector outputs is possible when the output has a bounded domain. Suppose that we aim to minimize the tokenwise L2 distances in dimension p. We partition the output range into ϵ p-length cubes and match each class to one cube. Then, we use O ( p ϵ )p classes to perform regression with error ϵ per token. This construction replaces log C in Theorem 3.1 with p log( p Remark 3.3. Large width and fixed bit complexity. Theorem 3.1 uses fixed width and large bit complexity to minimize the number of parameters. However, a common approach to scaling Transformers is to increase width (Levine et al., 2020) while using the same number of bits per parameter. Using a similar argument as Vardi et al. (2022), we extend Theorem 3.1 to the cases with a larger width and with bounded bit complexity. When the larger width is allowed, Transformers of width O(n N/L2), depth O(n + L) and bit complexity O(L) memorize the same dataset for some L n N. When the bit complexity is bounded, Transformers of width O(1), depth O(n+n N/B) and bit complexity O(B) memorize the same dataset for some B n N. We provide the formal theorem and the proof in Appendix B. Remark 3.4. Tightness in the order of bit counts. Suppose the token dimension d and the sequence length n are both O(N). Theorem 3.1 shows that a Transformer memorizes N input-output sequence pairs using O( n N) parameters of bit complexity O( n N), which sum up to O(n N) bits. Without any additional assumption on a dataset, representing models that memorize all Cn N possible labels4 for N input sequences requires Ω(n N) bits. Thus, Theorem 3.1 is tight upto logarithmic factors in the order of bit counts. Remark 3.5. Comparison against fully-connected Re LU networks. With a slight modification of results in Vardi et al. (2022)5, fully-connected Re LU networks require O(dn + n N) parameters. The dependence on the number of data points N is the same as O( n N). However, for the permutation equivariant case, Transformers memorize all permutations of each input sequence. That is, Transformers are capable of memorizing at most n! times more data points at the cost of reusing each parameter n times6. Our construction has a better dependence on d and n: O(d+n) for Transformers and O(dn) for fully-connected Re LU networks. Transformers exploit the structure of sequence data through parameter sharing. As a result, Transformers do not need O(dn) parameters to read all dn values in the input sequence. We note that this comparison is not completely fair because (1) our result makes a slightly stronger assumption of the separation between tokens than the separation between whole input; and (2) fullyconnected networks are not designed for permutation equivariant datasets. However, the difference in assumption affects the final bound only to the logarithmic terms. Furthermore, there is no known result on the memorization capacity of other permutation equivariant architectures. 3.3 PROOF SKETCH We outline the proof of Theorem 3.1. A more formal statement of each stage with detailed proof is in Appendix A. Our proof adopts the approach from Vardi et al. (2022) and shares similar steps. We discuss our technical novelty after sketching the main ideas of our proof. Our proof constructs a Transformer in 4 stages. The first two stages assemble input values and encode as a contextual token id that identify each token in the different context of sequence. We ensure that the contextual token id is permutation equivariant and that the contextual token id is uniquely assigned to each token in each context. Then, the last two stages map each contextual token id to the corresponding label using the bit-extraction network adopted from Vardi et al. (2022). We describe key ideas of each stage. 4There are C possible labels for n tokens in N sequences. 5We consider inputs as dn-dimensional flattened vectors and outputs as values from [Cn]. We choose a different balancing point between stages 2 and 3 in their construction to balance the extra factors. 6A similar benefit of Transformers has been previously observed in the function approximation (Yun et al., 2020a): the number of parameters is reduced by (n 1)! times to successfully approximate permutation equivariant functions. Published as a conference paper at ICLR 2023 Stage 1. Tokenwise Projection. We project each token vector to a scalar token id while keeping distinct tokens well separated. Stage 2. Contextual Mapping. We compose a sequence id as a linear combination of token ids. Weights of each id in the linear combination depend on the order of token id within each sequence. The resulting sequence ids are permutation invariant. 7 We concatenate the token id and the sequence id to obtain a contextual token id. Stage 3. String Lookup. We partition all n N contextual token ids into intervals, each containing the same number of ids. We construct two encoding numbers for each interval by concatenating all corresponding contextual token ids and token labels. Then, we find which group the contextual token id falls into and retrieve the corresponding encoding numbers. Stage 4. Bit Extraction. We extract each contextual token id and token label from the encoding numbers. If the extracted contextual token id agrees with the one composed in stage 2, then we output the corresponding token label. Our technical novelty in this proof is in (1) the implementation of the contextual mapping in stage 2 using self-attention subblocks and (2) generalizing stages 3 and 4 to incorporate skip connections. Remark 3.6. Contextual mapping. The number of self-attention subblocks that our proof uses is proportional to the length of the sequence n but independent of the number of data points N. This requirement is in striking contrast to the selective-shifting-based contextual mapping from Yun et al. (2020a), which requires 1 δ dn layers for shifting each grid cell of side length δ. In the memorization setting, we may remove layers for grid cells without any data point. But the selective-shifting-based contextual mapping would still need n N self-attention subblocks, which alone is already larger than the number of required layers in our result. In contrast, our efficient contextual mapping uses n layers, improving all the above when δ, N > 1. 8 We showcase the benefit of our contextual mapping in Appendix C. Specifically, we show that our contextual mapping are capable of incorporating sparse self-attention settings with minimal parameter overhead. Moreover, we show that the same idea of our contextual mapping can be applied in the function approximation setting. Remark 3.7. Parameter contribution. In our construction, self-attention subblocks contribute only O(n) parameters while feedforward subblocks contribute O( n N). When n < N, most of the parameter count comes from feedforward subblocks. Although the self-attention layers play a critical role in contextual mapping, we do not need many of them. This explains the model design in practice, where more than half of the parameters are in the tokenwise feedforward subblocks (Vaswani et al., 2017; Devlin et al., 2019; Brown et al., 2020; Geva et al., 2021). Moreover, Mandava et al. (2020) observes that the self-attention layers can be further reduced without much performance loss. 4 MEMORIZATION CAPACITY IN OTHER TASKS In this section, we generalize Theorem 3.1 to analyze the memorization capacity for other standard task settings. We consider sequence classification in Theorem 4.1, masked language modeling in Theorem D.1 and autoregressive language modeling in Theorem D.2. Due to the limited space, formal results on language modeling tasks and the proofs of the theorems are provided in Appendix D. 4.1 SEQUENCE CLASSIFICATION In sequence classification, we assign a single label y(i) [C] for each input sequence X(i). We present our theorem for sequence classification task. Theorem 4.1. Let N, d, n, C N and r 1, 0 < δ 1. Let (X(1), y(1)), , (X(N), y(N)) Rd n [C] be N input-output pairs of sequences where input sequences are distinct and tokenwise (r, δ)-separated. 1. (With permutation invariance) Suppose that contexts V(i) are distinct. Then, there exists a Transformer network N : Rd n R1 n such that N(X(i)P )[1, k] = y(i) 7An appropriate choice of positional encoding bypasses permutation invariance by collecting token ids in the position order. See the second part of Theorem 3.1 for the result and Section A.5 for the details. 8The most practical settings fall into this regime. Published as a conference paper at ICLR 2023 for every i [N], k [n] and for every permutation matrix P Rn n. 2. (Without permutation invariance) There exists a Transformer network N : Rd n R1 n and positional encoding E Rd n such that N(X(i) + E)[1, k] = y(i) for every i [N], k [n]. In both cases, the Transformer N has width 16 (m = 6, h = k = 1 and q = 16), depth N log N max{log C, log R} and bit complexity bounded by N log N max{log C, log R} where we denote R := 8000r2δ 2dn5N 6. Theorem 4.1 shows that O(d+n+ N) parameters are enough to memorize N sequence classification examples of length n with token dimension d. Compared to the sequence-to-sequence mapping, there is n factor of saving in the last term. 4.2 LANGUAGE MODELING We consider two language modeling tasks commonly used for pre-training Transformers: masked language modeling and autoregressive language modeling. We consider the memorization of all possible length n sequences obtainable from the given corpus of length T. The input is embedded in the d-dimensional space while the output is mapped to one of V tokens in the dictionary. The memorization of masked language modeling requires O(d+n+ nm+1m T) parameters (Theorem D.1) while the memorization of autoregressive language modeling requires O(d + n + T) parameters (Theorem D.2). Compared to the autoregressive language modeling, the masked language modeling has the additional factor nmm that comes from memorizing all masking patterns separately and the factor n that comes from memorizing all masked tokens instead of one next token. For more details on the settings and formal statement of the result, we refer to Appendix D. 5 EXPERIMENTS 5.1 EXPERIMENTAL SETUP We complement our theory with experiments on real-world dataset. We train encoder-only Transformer models (Vaswani et al., 2017) on token classification task where each token is assigned a label as in Theorem 3.1 and sequence classification task where each sequence is assigned a label as in Theorem 4.1. We study the relationship between the memorized dataset size and the model size. For token classification, we use 14,000 randomly selected examples among 14,041 training examples in the named entity recognition dataset from Co NLL-2003 (Tjong Kim Sang & De Meulder, 2003). For sequence classification, we use 50,000 randomly selected examples among 392,702 training examples in the MNLI dataset from GLUE benchmark (Wang et al., 2019). We vary the model size through the embedding size m while fixing the number of layers as L = 6. We fix the number of attention head as h = 12, the embedding to head size ratio as m/k = h = 12 and the feedforward to embedding size ratio as q/m = 4, as commonly done in practice. More details on experiments are in the Appendix F. 5.2 RESULTS Figure 1 shows heatmaps of training errors as the dataset size and the model size vary. There is a clear trend that the training error is smaller (darker in color) for the smaller dataset size and the Published as a conference paper at ICLR 2023 Model Embedding Size Dataset Size (a) Token Classification 96 192 288 384 480 576 672 768 Model Embedding Size Dataset Size (b) Sequence Classification Figure 1: Heatmaps of training errors. We show the color-coded training errors as the dataset size and the model size vary. The dataset size is represented in the Y-axis while the model size is represented in the X-axis as the embedding size. The training error tends to get better for the smaller dataset size and the larger model size. 2000 4000 6000 8000 10000 12000 14000 Dataset Size Num. Parameters (a) Token Classification 10000 20000 30000 40000 50000 Dataset Size Num. Parameters (b) Sequence Classification Figure 2: The number of parameters required for memorization. We show the number of parameters (Y-axis) of the smallest model that achieves less than 0.005 training error for each dataset size (X-axis). We observe a linear trend in the number of parameters with a subtle concavity as our theory predicts, but only at the low data regime. larger model size. To see the clearer relationship between model size and the dataset size, we plot in Figure 2 the minimum model size that achieves less than 0.005 training error for each dataset size. In general, there is a linear increase in the model size as the memorized dataset size increases. We also observe a slight downward curvature (concavity) as our theory predicts, but only at the low dataset size. We conjecture that the linear trend may be due to the fixed depth and bit complexity during the experiments. See Remark 3.3 for the discussion of this bounded depth and bit complexity regime and see Appendix B for the formal result in these regimes. Indeed, Theorem B.1 and Theorem B.2 predicts linear dependence of the model size in the dataset size. 6 CONCLUSIONS In this paper, we prove that Transformers are capable of memorizing N length-n sequence-tosequence mappings with O(d + n + n N) parameters. We extend our theory to analyze the memorization capacity of Transformers in other standard task settings. Our proof constructs a contextual mapping with O(n) self-attention layers, which significantly improves the previously proposed selective-shifting-based contextual mapping in terms of parameter efficiency. Finally, we provide experimental results that verify our theory. Published as a conference paper at ICLR 2023 Hassan Akbari, Liangzhe Yuan, Rui Qian, Wei-Hong Chuang, Shih-Fu Chang, Yin Cui, and Boqing Gong. Vatt: Transformers for multimodal self-supervised learning from raw video, audio and text. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan (eds.), Advances in Neural Information Processing Systems, volume 34, pp. 24206 24221. Curran Associates, Inc., 2021. URL https://proceedings.neurips.cc/paper/2021/file/ cb3213ada48302953cb0f166464ab356-Paper.pdf. Peter L. Bartlett, Nick Harvey, Christopher Liaw, and Abbas Mehrabian. Nearly-tight vc-dimension and pseudodimension bounds for piecewise linear neural networks. Journal of Machine Learning Research, 20(63):1 17, 2019. URL http://jmlr.org/papers/v20/17-612.html. Eric B Baum. On the capabilities of multilayer perceptrons. Journal of Complexity, 4(3):193 215, 1988. ISSN 0885-064X. doi: https://doi.org/10.1016/0885-064X(88)90020-9. URL https: //www.sciencedirect.com/science/article/pii/0885064X88900209. Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal. Reconciling modern machinelearning practice and the classical bias-variance trade-off. Proceedings of the National Academy of Sciences, 116(32):15849 15854, 2019. doi: 10.1073/pnas.1903070116. URL https: //www.pnas.org/doi/abs/10.1073/pnas.1903070116. Gedas Bertasius, Heng Wang, and Lorenzo Torresani. Is space-time attention all you need for video understanding? In Marina Meila and Tong Zhang (eds.), Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pp. 813 824. PMLR, 18 24 Jul 2021. URL https://proceedings.mlr.press/v139/ bertasius21a.html. Srinadh Bhojanapalli, Chulhee Yun, Ankit Singh Rawat, Sashank Reddi, and Sanjiv Kumar. Lowrank bottleneck in multi-head attention models. In Hal Daum e III and Aarti Singh (eds.), Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pp. 864 873. PMLR, 13 18 Jul 2020. URL https: //proceedings.mlr.press/v119/bhojanapalli20a.html. Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam Mc Candlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Language models are few-shot learners. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (eds.), Advances in Neural Information Processing Systems, volume 33, pp. 1877 1901. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper/2020/file/ 1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf. Sebastien Bubeck, Ronen Eldan, Yin Tat Lee, and Dan Mikulincer. Network size and size of the weights in memorization with two-layers neural networks. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (eds.), Advances in Neural Information Processing Systems, volume 33, pp. 4977 4986. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper/2020/file/ 34609bdc08a07ace4e1526bbb1777673-Paper.pdf. Lili Chen, Kevin Lu, Aravind Rajeswaran, Kimin Lee, Aditya Grover, Misha Laskin, Pieter Abbeel, Aravind Srinivas, and Igor Mordatch. Decision transformer: Reinforcement learning via sequence modeling. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan (eds.), Advances in Neural Information Processing Systems, volume 34, pp. 15084 15097. Curran Associates, Inc., 2021. URL https://proceedings.neurips.cc/paper/2021/ file/7f489f642a0ddb10272b5c31057f0663-Paper.pdf. Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and Lukasz Kaiser. Universal transformers. In International Conference on Learning Representations, 2019. URL https: //openreview.net/forum?id=Hyzd Ri R9Y7. Published as a conference paper at ICLR 2023 Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171 4186, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1423. URL https: //aclanthology.org/N19-1423. Yihe Dong, Jean-Baptiste Cordonnier, and Andreas Loukas. Attention is not all you need: pure attention loses rank doubly exponentially with depth. In Marina Meila and Tong Zhang (eds.), Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pp. 2793 2803. PMLR, 18 24 Jul 2021. URL https: //proceedings.mlr.press/v139/dong21a.html. Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2021. URL https: //openreview.net/forum?id=Yicb Fd NTTy. Benjamin L Edelman, Surbhi Goel, Sham Kakade, and Cyril Zhang. Inductive biases and variable creation in self-attention mechanisms. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato (eds.), Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pp. 5793 5831. PMLR, 17 23 Jul 2022. URL https://proceedings.mlr.press/v162/ edelman22a.html. William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. Journal of Machine Learning Research, 23(120):1 39, 2022. URL http://jmlr.org/papers/v23/21-0998.html. Mor Geva, Roei Schuster, Jonathan Berant, and Omer Levy. Transformer feed-forward layers are key-value memories. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 5484 5495, Online and Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.emnlp-main.446. URL https://aclanthology.org/2021.emnlp-main.446. Moritz Hardt and Tengyu Ma. Identity matters in deep learning. In International Conference on Learning Representations, 2017. URL https://openreview.net/forum?id= ryx B0Rtxx. Guang-Bin Huang. Learning capability and storage capacity of two-hidden-layer feedforward networks. IEEE Transactions on Neural Networks, 14(2):274 281, 2003. doi: 10.1109/TNN.2003. 809401. Guang-Bin Huang and H.A. Babri. Upper bounds on the number of hidden neurons in feedforward networks with arbitrary bounded nonlinear activation functions. IEEE Transactions on Neural Networks, 9(1):224 229, 1998. doi: 10.1109/72.655045. Michael Janner, Qiyang Li, and Sergey Levine. Offline reinforcement learning as one big sequence modeling problem. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan (eds.), Advances in Neural Information Processing Systems, volume 34, pp. 1273 1286. Curran Associates, Inc., 2021. URL https://proceedings.neurips. cc/paper/2021/file/099fe6b0b444c23836c4a5d07346082b-Paper.pdf. John Jumper, Richard Evans, Alexander Pritzel, Tim Green, Michael Figurnov, Olaf Ronneberger, Kathryn Tunyasuvunakool, Russ Bates, Augustin ˇZ ıdek, Anna Potapenko, et al. Highly accurate protein structure prediction with alphafold. Nature, 596(7873):583 589, 2021. Hyunjik Kim, George Papamakarios, and Andriy Mnih. The lipschitz constant of self-attention. In Marina Meila and Tong Zhang (eds.), Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pp. 5562 5571. PMLR, 18 24 Jul 2021. URL https://proceedings.mlr.press/v139/kim21i. html. Published as a conference paper at ICLR 2023 Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR (Poster), 2015. URL http://arxiv.org/abs/1412.6980. Anastasis Kratsios, Behnoosh Zamanlooy, Tianlin Liu, and Ivan Dokmani c. Universal approximation under constraints is possible with transformers. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=JGO8Cv G5S9. Yoav Levine, Noam Wies, Or Sharir, Hofit Bata, and Amnon Shashua. Limits to depth efficiencies of self-attention. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (eds.), Advances in Neural Information Processing Systems, volume 33, pp. 22640 22651. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper/2020/file/ ff4dfdf5904e920ce52b48c1cef97829-Paper.pdf. Valerii Likhosherstov, Krzysztof Choromanski, and Adrian Weller. On the expressive power of selfattention matrices. Co RR, abs/2106.03764, 2021. URL https://arxiv.org/abs/2106. 03764. Swetha Mandava, Szymon Migacz, and Alex Fit-Florea. Pay attention when required. Co RR, abs/2009.04534, 2020. URL https://arxiv.org/abs/2009.04534. Preetum Nakkiran, Gal Kaplun, Yamini Bansal, Tristan Yang, Boaz Barak, and Ilya Sutskever. Deep double descent: where bigger models and more data hurt. Journal of Statistical Mechanics: Theory and Experiment, 2021(12):124003, dec 2021. doi: 10.1088/1742-5468/ac3a74. URL https://doi.org/10.1088/1742-5468/ac3a74. Quynh Nguyen and Matthias Hein. Optimization landscape and expressivity of deep CNNs. In Jennifer Dy and Andreas Krause (eds.), Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pp. 3730 3739. PMLR, 10 15 Jul 2018. URL https://proceedings.mlr.press/v80/nguyen18a. html. Emilio Parisotto, Francis Song, Jack Rae, Razvan Pascanu, Caglar Gulcehre, Siddhant Jayakumar, Max Jaderberg, Rapha el Lopez Kaufman, Aidan Clark, Seb Noury, Matthew Botvinick, Nicolas Heess, and Raia Hadsell. Stabilizing transformers for reinforcement learning. In Hal Daum e III and Aarti Singh (eds.), Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pp. 7487 7498. PMLR, 13 18 Jul 2020. URL https://proceedings.mlr.press/v119/parisotto20a.html. Sejun Park, Jaeho Lee, Chulhee Yun, and Jinwoo Shin. Provable memorization via deep neural networks using sub-linear parameters. In Mikhail Belkin and Samory Kpotufe (eds.), Proceedings of Thirty Fourth Conference on Learning Theory, volume 134 of Proceedings of Machine Learning Research, pp. 3627 3661. PMLR, 15 19 Aug 2021. URL https://proceedings.mlr. press/v134/park21a.html. Niki Parmar, Ashish Vaswani, Jakob Uszkoreit, Lukasz Kaiser, Noam Shazeer, Alexander Ku, and Dustin Tran. Image transformer. In Jennifer Dy and Andreas Krause (eds.), Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pp. 4055 4064. PMLR, 10 15 Jul 2018. URL https://proceedings. mlr.press/v80/parmar18a.html. Stanislas Polu and Ilya Sutskever. Generative language modeling for automated theorem proving. Co RR, abs/2009.03393, 2020. URL https://arxiv.org/abs/2009.03393. Jorge P erez, Javier Marinkovi c, and Pablo Barcel o. On the turing completeness of modern neural network architectures. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=Hy GBdo0q Fm. Shashank Rajput, Kartik Sreenivasan, Dimitris Papailiopoulos, and Amin Karbasi. An exponential improvement on the memorization capacity of deep threshold networks. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan (eds.), Advances in Neural Information Processing Systems, volume 34, pp. 12674 12685. Curran Associates, Inc., 2021. URL https://proceedings.neurips.cc/paper/2021/file/ 69dd2eff9b6a421d5ce262b093bdab23-Paper.pdf. Published as a conference paper at ICLR 2023 Eduardo D. Sontag. Shattering All Sets of k Points in General Position Requires (k 1)/2 Parameters. Neural Computation, 9(2):337 348, 02 1997. ISSN 0899-7667. doi: 10.1162/neco. 1997.9.2.337. URL https://doi.org/10.1162/neco.1997.9.2.337. Erik F. Tjong Kim Sang and Fien De Meulder. Introduction to the Co NLL-2003 shared task: Language-independent named entity recognition. In Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003, pp. 142 147, 2003. URL https: //aclanthology.org/W03-0419. Gal Vardi, Gilad Yehudai, and Ohad Shamir. On the optimal memorization power of re LU neural networks. In International Conference on Learning Representations, 2022. URL https:// openreview.net/forum?id=Mk TPtnje YTV. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL https://proceedings.neurips.cc/paper/2017/file/ 3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf. Roman Vershynin. Memory capacity of neural networks with threshold and rectified linear unit activations. SIAM Journal on Mathematics of Data Science, 2(4):1004 1033, 2020. doi: 10. 1137/20M1314884. URL https://doi.org/10.1137/20M1314884. James Vuckovic, Aristide Baratin, and Remi Tachet des Combes. A mathematical theory of attention. Co RR, abs/2007.02876, 2020. URL https://arxiv.org/abs/2007.02876. Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In International Conference on Learning Representations, 2019. URL https://openreview. net/forum?id=r J4km2R5t7. Colin Wei, Yining Chen, and Tengyu Ma. Statistically meaningful approximation: a case study on approximating turing machines with transformers. Co RR, abs/2107.13163, 2021. URL https: //arxiv.org/abs/2107.13163. Noam Wies, Yoav Levine, Daniel Jannai, and Amnon Shashua. Which transformer architecture fits my data? a vocabulary bottleneck in self-attention. In Marina Meila and Tong Zhang (eds.), Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pp. 11170 11181. PMLR, 18 24 Jul 2021. URL https://proceedings.mlr.press/v139/wies21a.html. Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Russ R Salakhutdinov, and Quoc V Le. Xlnet: Generalized autoregressive pretraining for language understanding. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alch e-Buc, E. Fox, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL https://proceedings.neurips.cc/paper/2019/file/ dc6a7e655d7e5840e66733e9ee67cc69-Paper.pdf. Chulhee Yun, Suvrit Sra, and Ali Jadbabaie. Small relu networks are powerful memorizers: a tight analysis of memorization capacity. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alch e Buc, E. Fox, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL https://proceedings.neurips.cc/ paper/2019/file/dbea3d0e2a17c170c412c74273778159-Paper.pdf. Chulhee Yun, Srinadh Bhojanapalli, Ankit Singh Rawat, Sashank Reddi, and Sanjiv Kumar. Are transformers universal approximators of sequence-to-sequence functions? In International Conference on Learning Representations, 2020a. URL https://openreview.net/forum? id=Byx RM0Ntvr. Chulhee Yun, Yin-Wen Chang, Srinadh Bhojanapalli, Ankit Singh Rawat, Sashank Reddi, and Sanjiv Kumar. O(n) connections are expressive enough: Universal approximability of sparse Published as a conference paper at ICLR 2023 transformers. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (eds.), Advances in Neural Information Processing Systems, volume 33, pp. 13783 13794. Curran Associates, Inc., 2020b. URL https://proceedings.neurips.cc/paper/2020/file/ 9ed27554c893b5bad850a422c3538c15-Paper.pdf. Manzil Zaheer, Guru Guruganesh, Kumar Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, and Amr Ahmed. Big bird: Transformers for longer sequences. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (eds.), Advances in Neural Information Processing Systems, volume 33, pp. 17283 17297. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper/2020/ file/c8512d142a2d849725f31a9a7a361ab9-Paper.pdf. Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. In International Conference on Learning Representations, 2017. URL https://openreview.net/forum?id=Sy8gd B9xx&. Hengshuang Zhao, Li Jiang, Jiaya Jia, Philip H.S. Torr, and Vladlen Koltun. Point transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 16259 16268, October 2021. Sixiao Zheng, Jiachen Lu, Hengshuang Zhao, Xiatian Zhu, Zekun Luo, Yabiao Wang, Yanwei Fu, Jianfeng Feng, Tao Xiang, Philip H.S. Torr, and Li Zhang. Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6881 6890, June 2021. Published as a conference paper at ICLR 2023 A PROOF OF THEOREM 3.1 Our proof of Theorem 3.1 consists of four stages as described in Section 3.3. We state and prove the main lemma for each stage in the following subsections. Then, we combine all stages in the end. The lemmas in each stage assumes permutation equivariance. We analyze how to circumvent permutation equivariance through the positional encoding in a separate subsection. A.1 STAGE 1: TOKENWISE PROJECTION The main lemma for stage 1 is stated below. Lemma A.1. Let N, d, n N and r 1, 0 < δ 1. Let X(1), , X(N) Rd n be a set of N input sequences that are distinct and tokenwise (r, δ)-separated. Denote r = 2 2n2N 2 Then, there exists a network N1 : Rd n R1 n consisting of the input embedding block and one tokenwise feedforward subblock with feedforward dimension q = 1 and bit complexity log(2rn2N 2d πδ 1) such that N1(X(i)), i [N] are non-negative and tokenwise (2r , 2)- separated. Moreover, for i, j [N] and k, l [n], N1(X(i))[1, k] = N1(X(j))[1, l] if and only if X(i)[:, k] = X(j)[:, l]. Proof. Our proof defines a vector u Rd such that sequences x(i) = u T X(i) R1 n, i [N] are tokenwise (r , 2)-separated and satisfy that, for i, j [N] and k, l [n], x(i)[1, k] = x(j)[1, l] if and only if X(i)[:, k] = X(j)[:, l] Then, we construct the network N1 with the required size and bit complexity that computes N1(X) = u T X + r 1T n. Construction of u. Recall the definition of the vocabulary V = S i [N] V(i) = {v Rd : v = X(i)[:, k] for some i [N], k [n]}. Note that |V| n N. We use Lemma E.1 on V to find u such that 8 πd v v 1 |V|2 8 πd v v u T (v v ) v v for every v, v V. Let ˆu Rd be a vector with each coordinate being the first log(n2N 2d π) bits of the corre- sponding coordinate of u. We note that ˆu u d 2log(n2N2d π) = 1 n2N 2 q Define u = S ˆu with S = 2n2N 2 πdδ 1 2. We now check that x(i) = u T X(i), i [N] are tokenwise (r , 2)-separated with r = 2S r . Let i, j [N], k, l [n] with X(i)[:, k] = X(j)[:, l] and v, v V with v = X(i)[:, k], v = X(j)[:, l]. Then, we have x(i)[1, k] = u T X(i)[:, k] S u T v + (ˆu u)T v v + 1 n2N 2 Published as a conference paper at ICLR 2023 We also have x(i)[1, k] x(j)[1, l] = u T (X(i)[:, k] X(j)[:, l]) = S ˆu T (v v ) S u T (v v ) (ˆu u)T (v v ) 8 πd v v 1 n2N 2 which also implies x(i)[1, k] = x(j)[1, l] if and only if X(i)[:, k] = X(j)[:, l]. Construction of N1. We construct N1 as a composition of the input embedding block Ein : Rn d R1 n and a tokenwise feedforward block F(F F ) : R1 n R1 n with a skip-connection. We define Ein(X) = ˆu T X and F(F F )(z) = z + (S 1)σR(z + 2 r 1T n) + 2 r 1T n. Then, we have N1(X) = F(F F )(Ein(X)) = ˆu T X + (S 1)σR(ˆu T X + 2 r 1T n) + 2 r 1T n = ˆu T X + (S 1)(ˆu T X + 2 r 1T n) + 2 r 1T n = S ˆu T X + 2S r 1T n = u T X + r 1T n, where we removed the Re LU activation because ˆu T X + 2 r 1T n have all positive values. It is straightforward from the definition of F(F F ) that the feedforward dimension of the network is 1. Moreover, we can represent each coordinate of ˆu with log(n2N 2d π) bits, the weights in F(F F ) with log S = log(2n2N 2 πdδ 1) bits, and the biases in F(F F ) with log 2r bits. Thus, the bit complexity of the network N1 is log(2rn2N 2d πδ 1) . A.2 STAGE 2: CONTEXTUAL MAPPING Before proceeding on to the stage 2, we reindex the tokens of each output from Lemma A.1. Due to the permutation equivariance of the Transformer architecture, we may reorder tokens in any order as we want without loss of generality. For each i [N], we consider x(i) Rn as a vector and reindex as follows. Suppose that there are ni unique tokens in X(i). Then, there are also ni unique values in x(i). We assign indices for tokens so that the first ni tokens have ni unique values of x(i) in descending order: x(i)[1] > x(i)[2] > > x(i)[ni]. Then, we index the remaining redundant tokens in descending order: x(i)[ni + 1] x(i)[ni + 2] x(i)[n]. We note that the resulting vector x(i) is uniquely defined. That is, X(i) corresponds to one and only one valid resulting vector x(i). We need the following definition in this section. Definition A.1. Let N, n N and r 1, 0 < δ 1. Let x(1), , x(N) Rn be N data instances. We say that x(1), , x(N) are (r, δ)-separated if Published as a conference paper at ICLR 2023 x(i) r for all i [N] and x(i) x(j) δ for all i, j [N] with i = j. We now state the main lemma for stage 2. Lemma A.2. Let N, n, r N. Let x(1), , x(N) Rn be a set of N input sequences that are non-negative and tokenwise (2r , 2)-separated. Denote R = 4 2N 2 πn r n . Then, there exists a network N2 : R3 n R3 n consisting of 2n Transformer blocks with the number of head h = 1, head size k = 1, feedforward dimension q = 4 and bit complexity log(2r n N 2 π) such that 0T n 0T n z(i)1T n where z(i) R, i [N] are (R + 1, 2)-separated. Proof. Our proof defines a vector w Rn such that values z(i) = w[1 : ni]T x(i)[1 : ni] R, i [N] are (R , 4)-separated where we denote the number of unique values in x(i) as ni. Then, we construct the network N2 with the required size and bit complexity that computes 0T n 0T n z(i)1T n where z(i) approximates z(i) within 1 as z(i) z(i) 1. Since we have z(i) z(i) + z(i) z(i) R + 1 for i [N] and z(i) z(j) z(i) z(j) z(i) z(i) z(j) z(j) 4 1 1 = 2 for i, j [N] with i = j, we conclude that z(i) R, i [N] are (R + 1, 2)-separated. Construction of w. For i [N], we define x(i) Rn as a vector having the same values as x(i) in the first ni coordinates and 0 in the rest. We use Lemma E.1 on x(1), x(2), , x(N) to find w such that 8 πn x(i) x(j) w T x(i) x(j) x(i) x(j) for every i, j [N]. Let ˆw Rd be a vector with each coordinate being the first log(n N 2 π) bits of the corresponding coordinate of w. We note that ˆw w n 2log(n N2 π) = 1 N2 q Published as a conference paper at ICLR 2023 Define w = P ˆw with P = 2N 2 πn . We now check that z(i) = w[1 : ni]T x(i)[1 : ni] = w T x(i), i [N] are (R , 4)-separated. Let i, j [N] with i = j. Then, we have z(i) = w T x(i) = P ˆw T x(i) P w T x(i) + ( ˆw w)T x(i) and z(i) z(j) = w T ( x(i) x(j)) = P ˆw T ( x(i) x(j)) P w T ( x(i) x(j)) ( ˆw w)T ( x(i) x(j)) 8 πn x(i) x(j) 1 1 πn x(i) x(j) 1 πn x(i) x(j) Construction of N2. We construct N2 = F2n F2n 1 F1 in n steps. Each step l [n] consists of 2 Transformer blocks F2l 1, F2l : R3 n R3 n. First, we use Lemma E.2 to obtain a self-attention module F(SA) 2l 1 : R1 n R1 n that computes a vector with all coordinates holding 1 2P n-approximation ˆxmax of xmax = maxi [n] x[i] given x Rn. We extend F(SA) 2l 1 to define a valid self-attention subblock F(SA) 2l 1 : R3 n R3 n as 0T n F(SA) 2l 1 (x T ) 0T n ˆxmax1T n z T for x, z Rn. The subblock F(SA) 2l 1 finds the 1 2P n-approximate maximum token id among the first coordinate values and outputs in the second coordinate. Next, we use Lemma E.3 to obtain a tokenwise feedforward module F(F F ) 2l 1 : R2 n R1 n that outputs r for tokens with the same value in two coordinates. We define a tokenwise feedforward subblock F(F F ) 2l 1 : R3 n R3 n by extending F(F F ) 2l 1 so that, for x, y, z Rn, F(F F ) 2l 1 2 F(F F ) 2l 1 ([x, y]T ) 0T n 0T n x[i] = x[i] 2r if |x[i] y[i]| < 1 2 x[i] if |x[i] y[i]| > 1 . Published as a conference paper at ICLR 2023 The subblock F(F F ) 2l 1 compares the first two rows of the input and subtracts 2r from the first row if two values are not separated. Since x[i] is bounded above by 2r , the subtracted entries become negative. Since we do not need the self-attention subblock from F2l, we set all weights of F(SA) 2l to zero. We define F(F F ) 2l : R3 n R3 n as "1 0 0 1/P 0 ˆwk e T 1 Pe T 2 "1 0 0 1/P 0 ˆwk x T + σR( x T ) y T σR(y T ) z T + ˆwk PσR(y T ) σR(x T ) σR( y T ) z T + wkσR(y T ) where P > 0 and wk = ˆwk P are defined earlier. When y 0n, we can further simplify as σR(x T ) 0T n z T + wky T Let Z(i,l) = F2l F2l 1 F1 R3 n be the output of the l-th step when the input is x(i) for l = 0, , n. We show inductively that Z(i,l)[1, k] = x(i)[k] if l < ni and x(i)[k] < x(i)[l] 0 otherwise , Z(i,l)[2, k] = 0, Z(i,l)[3, k] = w[1 : min{l, ni}]T ˆx(i)[1 : min{l, ni}] = min{l,ni} X j=1 w[j]ˆx(i)[j] for i [N], l, k [n], where ˆx(i)[j] is 1 2P n-approximation of x(i)[j]. For l = 0, the conditions 1 hold for the input . Suppose that the conditions 1 hold for l = l 1. When l > ni, the induction hypothesis implies that Z(i, l 1)[1, :] = 0T n. Thus, we obtain Published as a conference paper at ICLR 2023 the output as F(F F ) 2 l F(F F ) 2 l 1 F(SA) 2 l 1 Z(i, l 1) = F(F F ) 2 l F(F F ) 2 l 1 F(SA) 2 l 1 0T n 0T n Z(i, l 1)[3, :] = F(F F ) 2 l F(F F ) 2 l 1 0T n 0T n Z(i, l 1)[3, :] = F(F F ) 2 l 2r 1T n 0T n Z(i, l 1)[3, :] 0T n 0T n Z(i, l 1)[3, :] When l ni, the induction hypothesis implies that Z(i, l 1)[1, :] has non-zero entries among which the largest value is x(i)[ l]. Then, it follows that F(F F ) 2 l F(F F ) 2 l 1 F(SA) 2 l 1 Z(i, l 1) = F(F F ) 2 l F(F F ) 2 l 1 F(SA) 2 l 1 Z(i, l 1)[1, :] Z(i, l 1)[2, :] Z(i, l 1)[3, :] = F(F F ) 2 l F(F F ) 2 l 1 F(SA) 2 l 1 Z(i, l 1)[1, :] 0T n Z(i, l 1)[3, :] = F(F F ) 2 l F(F F ) 2 l 1 Z(i, l 1)[1, :] ˆx(i)[ l]1T n Z(i, l 1)[3, :] = F(F F ) 2 l ˆx(i)[ l]1T n Z(i, l 1)[3, :] σR x(i, l 1)T 0T n Z(i, l 1)[3, :] + w[ l]ˆx(i)[ l]1T n where ˆx(i)[ l] is 1 2P n-approximation of x(i)[ l]. Here, x(i, l 1) denotes x(i, l 1)[k] = ( Z(i, l 1)[1, k] 2r if |Z(i, l 1)[1, k] ˆx(i)[ l]| < 1 2 Z(i, l 1)[1, k] if |Z(i, l 1)[1, k] ˆx(i)[ l]| > 1 σR( x(i, l 1)[k]) = ( 0 if |Z(i, l 1)[1, k] ˆx(i)[ l]| < 1 2 Z(i, l 1)[1, k] if |Z(i, l 1)[1, k] ˆx(i)[ l]| > 1 . Since distinct tokens are separated by 2, 1 2P n-approximations of them are separated by 1. On the other hand, since 1 2P n < 1 2, the 1 2P n-approximation of the maximum element stays closer than 1 Consequently, σR( x(i, l 1)T ) is the same as Z(i, l)[1, :]. Thus, the conditions 1 hold and we conclude our induction proof. In the end of n steps, the output is Z(i,n) = 0T n 0T n z(i)1T n with z(i) = w T ˆx(i) where each entry of ˆx(i) 1 2P n-approximates the corresponding entries of x(i). We check that z(i) approximates z(i) within Published as a conference paper at ICLR 2023 1 as z(i) z(i) w T ˆx(i) w T x(i) w ˆx(i) x(i) P ˆw n 2P n 2 ( w + ˆw w ) Our construction of N2 involves 2n Transformer blocks. From Lemma E.2, the subblock F(SA) 2l 1 uses 1 head (h = 1) with head size 1 (k = 1) and bit complexity log log(8n3/2r P) = log log(16n2N 2r π) . From Lemma E.3, the subblock F(F F ) 2l 1 uses feedforward dimen- sion 4 with bit complexity log 2r . The subblock F(F F ) 2l uses feedforward dimension 2. All weights in F(F F ) 2l are either 0, 1, P, 1/P or ˆwk. Since we represent each coordinate of ˆwk using log(n N 2 π) bits, the bit complexity of F(F F ) 2l is max{ log P , log(n N 2 π) } = log(2n N 2 π) . Thus, the network N2 has 1 head (h = 1) with the head size 1 (k = 1), the feedforward dimension 4 and the bit complexity log(2r n N 2 π) . A.3 STAGE 3: STRING LOOKUP In stages 3 and 4, we map each token X(i)[:, k] to the corresponding label using both the token id x(i)[k] from stage 1 and the sequence id z(i) from stage 2. Now that the token id x(i)[k] and the sequence id z(i) hold enough information to identify the label, we process each token independently using tokenwise feedforward blocks from now on. We first combine two ids into a single contextual token id and map to the corresponding token label in the last two stages. We adopt stages 2 and 3 in Vardi et al. (2022) to our architecture that involves skip-connections. We can extend stage 2 by using extra dimension to pass x(i) from stage 1 without any additional parameter. Then, after stage 2, the k-th token in the i-th sequence contains z(i) and x(i)[k], which are enough to identify the corresponding label y(i)[k]. We note that 0 x(i)[k] 2r and z(i) R + 1 with r , R > 6. We define a(i)[k] = ( z(i) + R + 1)(2r + 1) + x(i)[k] + 1 to be the unique integer id for each token in each sequence. Then, we have 1 a(i)[k] < (2R + 3)(2r + 1) < 9r R for i [N] and a(i)[k] a(j)[l] 2 for i, j [N], k, l [n] with V(i) = V(j) or X(i)[:, k] = X(j)[:, l]. We denote R = 9r R . Now, our goal is to map a(i)[k] [R] to y(i)[k] [C] using tokenwise feedforward subblocks. We denote the number of distinct a(i)[k] s as N n N. We reindex each of unique tokens and the corresponding label as a(i) and y(i) for i [N ], respectively. Without loss of generality, we supppose that 1 a(1) < < a(N ) < R. Then, we partition N contextual token ids into A groups of B ids. For each group g [A], we construct two strings ug and wg. The binary string ug is a concatenation of B ids in the group g. Each id is represented as an integer of ρ = log R bits. The binary string wg is a concatenation of B labels corresponding to B ids in the group g. Each label is represented as an integer of γ = log C bits. Thus, ug and wg are strings of length ρB and γB, respectively. We now state the main lemma for stage 3. Lemma A.3. Let N , R N and 1 a(1) < < a(N ) < R to be distinct integer ids that identify each token in each sequence. We suppose that a(i) a(j) 2 for i, j [N ] with i = j. Let A, B, b N with A < N and B = N A . Let w1, , w A N with the number of bits in their binary representation at most b. Published as a conference paper at ICLR 2023 Then, there exists a network N3 : R2 R2 consisting of A feedforward blocks with skipconnections every 2 layers, feedforward dimension q = 4 and bit complexity b + log(2R + 1) such that N3 a(i), 0 = a(i), w i Proof. We construct N3 = FA FA 1 F1 as a composition of A feedforward blocks Fl : R2 R2, l [A] where each feedforward block contains a single hidden layer and a skip-connection. Let l [A]. We use Lemma E.4 to define Fl : R2 R1 such that Fl(x) = wl if x a((l 1) B+1), a(l B) and Fi(x) = 0 if x / a((l 1) B+1) 1 2, a(l B) + 1 2 where we regard l B > N to be N . We extend Fl to define a valid feedforward block Fl as Fl (x, y) = (x, y) + 0, Fi(x) = x, y + Fi(x) . Throughout the computation of N3, the first coordinate is the same across all blocks. For i [N ], a(i) only activates one of Fl, l [A]. In particular, Fl( a(i)) = wl if l = i B and Fl( a(i)) = 0 otherwise. Thus, we get N3 a(i), 0 = a(i), w i Our construction involves A feedforward blocks with skip-connections every 2 layers. From Lemma E.4, feedforward dimension q = 4 and the bit complexity is b + log(2R + 1) . Remark A.4. We can extend the construction from Lemma A.3 to output multiple strings corresponding to the same range without additional feedforward dimension. In particular, the first layer in the construction of Lemma E.4 does not depend on w. Thus, we can reuse this four units with different output weights to output additional strings corresponding to the same range. In stage 4, we need 2 strings for each range. A.4 STAGE 4: BIT EXTRACTION We first define BINi:j(n) to be the substring of n with bits in places from i to j inclusive for n, i, j N with i j. We now state the main lemma for stage 4. Lemma A.5. Let B, ρ, γ N and u, w N. Suppose that the number of bits in binary representation of u and w are ρB and γB, respectively. We assume that BINρ (i 1)+1:ρ i(u) BINρ (j 1)+1:ρ j(u) 2 for i, j [B] with i = j. Then, there exists a network N4 : R3 R consisting of an output embedding block and (max{ρ, γ} + 2)B + 2 feedforward blocks with skip-connections every 2 layers, feedforward dimension q = 16 and bit complexity 2 max{ρ, γ}B such that N4 (x, u, w) = BINρ (i 1)+1:ρ i(w), if there exist i [B] such that x = BINρ (i 1)+1:ρ i(u). Proof. We construct N4 = Fpost FB FB 1 F1 Fpre in B steps with pre-processing Fpre : R3 R8 and post-processing Fpost : R8 R. We define the pre-processing network as Fpre(x, u, w) = x, u 2ρB + 1 2ρB+1 , u 2ρB + 1 2ρB+2 , 0, w 2γB + 1 2γB+1 , w 2γB + 1 2γB+2 , 0, 0 and the post-processing network as Fpost(z1, z2, , z8) = z8. Published as a conference paper at ICLR 2023 In each step l [B], Fl : R8 R8 first implements two sub-networks Fu l : R3 R3 and Fw l : R3 R3 in parallel on the middle 6 coordinates. Two sub-networks uses Lemma E.5 to compute ϕ(ρ(l 1)) u 2ρB + 1 2ρB+1 ϕ(ρ(l 1)) u 2ρB + 1 2ρB+2 ϕ(ρl) u 2ρB + 1 2ρB+1 ϕ(ρl) u 2ρB + 1 2ρB+2 BINρ(l 1)+1:ρl(u) ϕ(γ(l 1)) w 2γB + 1 2γB+1 ϕ(γ(l 1)) w 2γB + 1 2γB+2 2γB + 1 2γB+1 2γB + 1 2γB+2 BINγ(l 1)+1:γl(w) Then, Fl combines the result using two additional feedforward blocks as x ϕ(ρl) u 2ρB + 1 2ρB+1 ϕ(ρl) u 2ρB + 1 2ρB+2 BINρ(l 1)+1:ρl(u) ϕ(γl) w 2γB + 1 2γB+1 2γB + 1 2γB+2 BINγ(l 1)+1:γl(w) 0 x ϕ(ρl) u 2ρB + 1 2ρB+1 ϕ(ρl) u 2ρB + 1 2ρB+2 cl = F1 l (x, BINρ(l 1)+1:ρl(u)) ϕ(γl) w 2γB + 1 2γB+1 2γB + 1 2γB+2 BINγ(l 1)+1:γl(w) 0 x ϕ(ρl) u 2ρB + 1 2ρB+1 ϕ(ρl) u 2ρB + 1 2ρB+2 2γB + 1 2γB+1 2γB + 1 2γB+2 0 F2 l (cl, BINγ(l 1)+1:γl(w)) where F1 l uses Lemma E.3 to compute F1 l (x, y) = 2γ if |x y| < 1 2 0 if |x y| > 1 and F2 l computes, for 0 y 2γ F2 l (x, y) = σR(x 2γ + y) = y if x = 2γ 0 if x = 0 . The resulting value in the last coordinate is BINγ(l 1)+1:γl(w) if |x BINρ(l 1)+1:ρl(u)| < 1 2 and 0 if |x BINρ(l 1)+1:ρl(u)| > 1. Therefore, the last coordinate throughout each step of N4 keeps the value 0 until it finds x = BINρ(l 1)+1:ρl(u). When such l is found, the value is updated to BINγ(l 1)+1:γl(w) as the requirement. Finally, the parallel implementation of Fu l and Fw l requires max{ρ, γ} feedforward blocks, feedforward dimension 16 = 8 + 8 and bit complexity 2 max{ρ, γ}B. Moreover, each of F1 l and F2 l requires 1 feedforward block and bit complexity γ. The feedforward dimension of F1 l is 5 where F1 l incurs 4 from Lemma E.3 and 1 additional unit wipes out the (positive) carried value in 4-th coordinate from the skip-connection. The feedforward dimension of F2 l is 3 where F2 l incurs 1 and 2 additional units wipe out the carried values in 4-th and 7-th coordinates from the skip-connection. In total, each step l [B] consists of max{ρ, γ} + 2 feedforward blocks with skip-connections, feedforward dimension 16 and bit complexity 2 max{ρ, γ}B. The pre-processing network Fpre and the post-processing network Fpost can be implemented with 1 feedforward block, feedforward dimension at most 2 (to carry u and w or z8) and the bit complexity at most max{ρ, γ}B + 2. Thus, N4 consists of (max{ρ, γ}+2)B +2 feedforward blocks with skip-connections, feedforward dimension 16 and bit complexity 2 max{ρ, γ}B. Published as a conference paper at ICLR 2023 A.5 POSITIONAL ENCODING Consider the token id u T X(n) in stage 1. If we define the positional encoding as E = r u[n 1, n 2, , 1, 0], then we have u T (X(n) + E) = u T X(n) + r [n 1, n 2, , 1, 0], Since every element of u T X(n) are bounded above by r , positional encoding enforces the decreasing order used in stage 2 as the usual sequential order. Thus, the sequence id is no longer permutation equivariant. The upper bound on the magnitude of token ids increases by a factor of n, but it only affects parameter complexity and bit complexity logarithmically through R. A.6 PROOF OF THEOREM 3.1 The final Transformer network combines all 4 stages as N = N4 N3 N2 N1 where N3 and N4 applies the same function to each token independently. The mismatch in the embedding and feedforward dimensions is easily resolved by using the maximum dimension required and set all weights in the unused dimension to zero. We modify N3 to output both u i B as mentioned in the Remark A.4. We summarize all stages: 1. N1 projects the input sequence X(i) Rd n to the token ids x(i) Rn. This stage consists of 1 feedforward block of dimension 1 and bit complexity log(2rn2N 2d πδ 1) log(r 2. N2 further projects the token ids x(i) Rn to the sequence id z(i) R permutation equivariantly. This stage consists of 2n Transformer blocks of attention dimension 1, feedforward dimension 4 and bit complexity log(2r n N 2 π) log(R ). 3. N3 combines a token id and a sequence id to obtain the contextual token id and finds two group strings. N n N memorized contextual token ids are partitioned into A groups of B ids so that N AB. Two group strings are crafted as concatenations of contextual token ids and corresponding labels in the group. This stage consists of A feedforward blocks of dimension 8 and bit complexity max{log R, log C} B + log R + 1. 4. N4 extracts the correct label from the crafted strings. This stage consists of (max{ρ, γ} + 2)B + 2 feedforward blocks of dimension 16 and bit complexity 2 max{log R, log C} B. In total, the network N uses 1+2n+A+(max{ρ, γ}+2)B+2 = O(n+A+max{ρ, γ}B) Transformer blocks of dimension 16 and bit complexity log(r d)+log(R )+ max{log R, log C} B + log R + 1 + 2 max{log R, log C} B = O(log d + max{log R, log C} B). We note that R = 9r R 150r 2n N 2 8000r2n5N 6dδ 2 Finally, we balance A and B as A = p n N log(n N) and B = q n N log(n N) and conclude the proof of Theorem 3.1. B LARGE WIDTH AND FIXED BIT COMPLEXITY In this section, we formally study the generalization of Theorem 3.1 in the case of large width (Section B.1) and fixed bit complexity (Section B.2) from Remark 3.3. To simplify the argument, we state and prove the result only for the case without permutation equivariance. The theorem for the permutation equivariance case is straightforwardly obtained from our stated theorem. B.1 LARGE WIDTH We state and prove the theorem for large width. Theorem B.1. Assume the same setting as in Theorem 3.1. Let L n N. There exists a Transformer network N : Rd n R1 n and positional encoding E Rd n such that N(X(i) + E) = Y (i) Published as a conference paper at ICLR 2023 for every i [N]. In both cases, the Transformer N has width 16n N/L2 (m = 6n N/L2, h = k = 1 and q = 16n N/L2), depth 1 log(L) max{log C, log R} and bit complexity bounded by 1 log(L) max{log C, log R} where we denote R := 8000r2δ 2dn5N 6. Proof. Stages 1 and 2 are the same as the proof of Theorem 3.1. For stages 3 and 4, instead of directly memorizing n N contextual token ids directly, we construct n N/L2 subnetworks where each memorize L2 contextual token ids. By stacking subnetworks horizontally across width, we obtain the result. We remark that the width is not increased for the self-attention layers which are not used in the parallelized stages 3 and 4. Theorem B.1 shows that, if depth is bounded above by O(L) with L > n, then O(d + n + n N/L) parameters are enough to memorize N sequence classification examples of length n with token dimension d. We count the number of parameters as linear in width instead of quadratic in width because our construction uses parallel n N/L2 subnetworks without interaction among them. B.2 FIXED BIT COMPLEXITY We state and prove the theorem for fixed bit complexity. Theorem B.2. Assume the same setting as in Theorem 3.1. Let B n N. There exists a Transformer network N : Rd n R1 n and positional encoding E Rd n such that N(X(i) + E) = Y (i) for every i [N]. In both cases, the Transformer N has width 16 (m = 6, h = k = 1 and q = 16), depth log(B) + n N 1 log(B) max{log C, log R} and bit complexity bounded by 1 log(B) max{log C, log R} where we denote R := 8000r2δ 2dn5N 6. Proof. Stage 1 and 2 are the same as the proof of Theorem 3.1. For stage 3 and 4, instead of directly memorizing n N contextual token ids directly, we construct n N/B2 subnetworks where each memorize B2 contextual token ids. By stacking subnetworks vertically across depth, we obtain the result. Theorem B.2 shows that, if bit complexity is bounded above by O(B), then O(d + n + n N/B) parameters are enough to memorize N sequence classification examples of length n with token dimension d. We count the number of parameters as linear in width instead of quadratic in width because our construction uses parallel n N/L2 subnetworks without interaction among them. Published as a conference paper at ICLR 2023 C CONTEXTUAL MAPPING IMPLICATIONS C.1 SPARSE ATTENTION TRANSFORMERS This section shows how our result generalizes to sparse-attention Transformers. Sparse-attention Transformers replaces self-attention subblocks with sparse counterparts. Let Al k [n] be the lth sparsity pattern of k-th token where k [n], l [p]. Given an input Z Rm n, the sparse self-attention subblock F (SSA) l with h heads and head size k computes F (SSA) l (Z) = Z + i=1 W (O) l,i W (V ) l,i ZAl k W (K) l,i ZAl k T W (Q) l,i ZAl k where ZAl k Rm |Al k| denote the submatrix consisting of columns of Z in the index set Al k. Again, W (O) l,i Rm k and W (V ) l,i , W (K) l,i , W (Q) l,i Rk m are the weight matrices parametrizing the sparse self-attention subblock. We make the following assumption on the sparsity pattern, which is the last condition among three conditions in Assumption 1 in Yun et al. (2020b). Thus, our assumption is strictly weaker. Assumption C.1. (Relaxed version of Assumption 1 in Yun et al. (2020b)) Define a sequence of set {St k}t 1 as S1 k := A1 k, St k := [ j A(t 1) mod p+1 k We assume that the sparsity patterns {Al k} satisfy that there exists a finite s N such that s = min{u|Su k = [n] for all k [n]}. We provide the sparse-attention version of our main result. Theorem C.1. Let N, d, n, C, s N and r 1, 0 < δ 1. Let (X(1), Y (1)), , (X(N), Y (N)) Rd n [C]1 n be N input-output pairs of sequences that satisfies. Denote R := 8000r2δ 2dn5N 6. Then, there exists a Transformer network F : Rd n R1 n with width 16 (m = 6, h = k = 1 and r = 16), depth n N log N + n N log N max{log C, log R} and bit complexity bounded by n N log N max{log C, log R} such that F(X(i)P ) = Y (i)P for every i [N] and for every permutation matrix P Rn n. Proof. The stage 1, 3 and 4 are the same as the proof of Theorem 3.1. In stage 2, each step of N2 in Lemma A.2 computes the maximum token id over the whole sequence using 1 self-attention layer. Under Assumption C.1, we instead can compute the maximum token id over the allowed sparsity pattern. Since the whole sequence is covered within a recursion of s consecutive sparsity patterns and taking maximum is associative, repeating s self-attention layer will give the desired maximum token id over the whole sequence again. Now, the other component in stage 2 works as before and the resulting memorization is achieved in the same way. The only overhead in this approach is the s times larger number of self-attention layers. The simplicity of our contextual mapping enables easy generalization to the sparse attention. Since the sparse attention Transformer only constrains the sparsity pattern in the self-attention subblock, stages 1, 3 and 4 in our construction works without any modification. For the contextual mapping in stage 2, we achieve the same memorization capacity with only s times more self-attention layers. The number of parameters is O(d + sn + Published as a conference paper at ICLR 2023 C.2 IMPROVED CONTEXTUAL MAPPING FOR FUNCTION APPROXIMATION Our idea can be used to reduce the number of self-attention layers for the contextual mapping in function approximation settings, too. We recall the formal definition of the contextual mapping as follows. Definition C.2. (Definition 3.1 in Yun et al. (2020a), Contextual Mapping) Consider a finite set L Rd n. A map q : L R1 n is a contextual mapping if the map satisfies the following: 1. For any L L, the n entries in q(L) are distinct. 2. For any L, L L with L = L , all entries of q(L) and q(L ) are distinct. We state our improvement of the contextual mapping. Theorem C.2. (Improved version of Lemma 6 in Yun et al. (2020a)) Consider the following subset of Gδ = {0, δ, , 1 δ}d n: Gδ := {L Gδ|L:,i = L:,j for all i = j}. Assume that n 2 and δ 1 2. Then, there exist a function gc : R4 n R4 n composed of 3n Transformer blocks with h = 1, k = 1 and q = 4 that employ the hardmax operator, vectors w Rd, u R4, constants tl, tr R (0 < tl < tr), such that q(L) := u T gc(w T L, w T L, 0, 0) satisfies the following propeties: 1. For any L Gδ, the entries of q(L) are all distinct. 2. For any L, L Gδ such that L is not a permutation of L , all entries of q(L), q(L ) are distinct. 3. For any L Gδ, all the entries of q(L) are in [tl, tr]. 4. For any L G+ δ \ Gδ, all the entries of q(L) are outside [tl, tr]. Proof. Since token embeddings are δ-discretized, we can concatenate each coordinate to obtain the token id as in Yun et al. (2020a). Let w be the vector that represents such concatenation as a linear operation. As in stage 2 of our proof for Theorem 3.1, we concatenate the token id in the decreasing order of magnitude to obtain the sequence id in n steps. Then, we set u appropriately to obtain the contextual token id or the concatenation of token id and sequence id in q(L). Then, the first three conditions are easy to check. The first condition is trivially true because distinct token will have distinct token id and consequently distinct contextual token id. The second condition is also true because if L is not a permutation of L , then their sequence ids should differ. The third condition is true since a linear function in a compact region is bounded. Consider the final condition. In any step of our efficient contextual mapping, when the maximum token id is zero, there must be duplicate tokens in the input sequence. Conversely, if there are duplicate tokens in the input sequence, the maximum token id should be zero at some steps of our efficient contextual mapping. We may use one more feedforward block in each step of the efficient contextual mapping to subtract the sequence id by M if such zero maximum token id is observed. Here, M is the maximum value possible for the sequence id. Then, the sequence id will still be negative in the end so q(L) will also be negative. Thus, the final condition is also true. We highlight the difference in the architecture. The major difference is the number of Transformer blocks used. We use 3n layers (linear in the sequence length) while Yun et al. (2020a) use δ d + 1 layers (exponential in the embedding dimension). Since the sequence length and the embedding dimension are of the same order in practice, our construction exponentially improves Lemma 6 in Yun et al. (2020a). The minor difference in the architecture is the intermediate embedding dimension (4 in ours and d in Yun et al. (2020a)) and the number of attention heads (1 in ours and 2 in Yun et al. (2020a)). Published as a conference paper at ICLR 2023 D OTHER TASKS We provide the proof for Theorem 4.1 (Section D.1) and formal results on language modeling tasks. We consider two language modeling tasks that are commonly used to pre-training Transformers: masked language modeling (Section D.2) and autoregressive language modeling (Section D.3). D.1 PROOF OF THEOREM 4.1 Stages 1 and 2 are the same as the proof of Theorem 3.1. Instead of classifying the contextual token id, we can directly classify sequence ids in stages 3 and 4. Since there are N possible sequence id, we replace n N with N in the parameter complexity from Theorem 3.1. D.2 MASKED LANGUAGE MODELING Let X Rd T and Y [V ]T be corpus data of T tokens represented as embedding vectors and token ids, respectively. The corpus data is divided into P = T n+1 sequences X(1), , X(P ) Rd n and Y (1), , Y (P ) [V ]n of length n by taking X(i) = X[:, i : i+n] and Y (i) = Y [i : i+ n]. Here, we use the slice index notation i : j for items from i (inclusive) to j (exclusive). We mask m out of n tokens in the sequence with Q = n m possible masking patterns M (1), , M (Q) {0, 1}n. We define the masked sequences M (j) X(i) extracted from X as the sequence X(i) with columns masked according to the pattern M (j). We say that a Transformer network N : Rd n R1 n and positional encoding E Rd n memorizes masked language modeling of X, Y if N(M (j) X(i) + E) = Y (i) for every i [P], j [Q]. Theorem D.1. Let T, d, n, m, V N and r 1, 0 < δ 1. Let X Rd T , Y [V ]T be a corpus data of T tokens represented as embedding vectors and token ids, respectively. Suppose that the masked sequences extracted from X are distinct and tokenwise (r, δ)-separated. Then, there exists a Transformer network N : Rd n R1 n and positional encoding E Rd n that memorizes masked language modeling of X, Y . The Transformer N has width 16 (m = 6, h = k = 1 and q = 16), depth n PQ log(n PQ) + n PQ log(n PQ) max{log V, log R} and bit complexity bounded by n PQ log(n PQ) max{log V, log R} where we denote P = T n + 1, Q = n m and R := 8000r2δ 2dn5P 6Q6. Proof. We apply Theorem 3.1 to memorize PQ masked sequences M (j) X(i). D.3 AUTOREGRESSIVE LANGUAGE MODELING Let X Rd T and Y [V ]T be a corpus data of T tokens represented as embedding vectors and token ids, respectively. The corpus data is divided into P = T n sequences X(1), , X(P ) Rd n and y(1), , y(P ) [V ] of length n by taking X(i) = X[:, i : i + n] and y(i) = Y [i + n]. We call X(i) as the input sequences extracted from X. We say that a Transformer network N : Rd n R1 n and positional encoding E Rd n memorizes autoregressive language modeling of X, Y if N(X(i) + E) = y(i) for every i [P]. Published as a conference paper at ICLR 2023 Theorem D.2. Let T, d, n, m, V N and r 1, 0 < δ 1. Let X Rd T , Y [V ]T be a corpus data of T tokens represented as embedding vectors and token ids, respectively. Suppose that the input sequences extracted from X are distinct and tokenwise (r, δ)-separated. Then, there exists a Transformer network N : Rd n R1 n and positional encoding E Rd n that memorizes autoregressive language modeling of X, Y . The Transformer N has width 16 (m = 6, h = k = 1 and q = 16), depth P log P max{log V, log R} and bit complexity bounded by P log P max{log V, log R} where we denote P = T n and R := 8000r2δ 2dn5P 6. Proof. We apply Theorem 4.1 to memorize P input sequences X(i). E TECHNICAL LEMMAS Here, we state technical lemmas that are used in our proofs. Lemma E.1. (Lemma 13 from Park et al. (2021)) Let N, d N and x(1), x(N) Rd. Then, there exists a unit vector u Rd such that 1 N2 q 8 πd x(i) x(j) |u T (x(i) x(j))| x(i) x(j) for every i, j [N]. Lemma E.2. Let n N and r , P > 1. Then, there exists a neural network F : R1 n R1 n consisting of a single softmax self-attention with 1 head, head size 1 and bit complexity log log(8n3/2r P) such that F(x T ) = c1T n with xmax 1 2P n c xmax whenever x Rn |x[i]| 2r for i [n] and x[i] xmax 2 for i [n] with x[i] = xmax, where we denote xmax = maxi [n] x[i] . Proof. When we have hardmax activation on the attention matrix, it is easy to construct the network that satisfies the condition. Consider the following self-attention: F(x) = 1 (1 x)σH (1 x)T (0 x + 1T n) = xσH x T 1T n = max i [n] xi Indeed, the output is exactly xmax and the bit complexity is 1 since all weights are either 0 or 1. To approximate hardmax with softmax, we introduce some large factor t > 0 in the attention matrix. Consider the following self-attention: F(x) = 1 (1 x)σS (t x)T (0 x + 1T n) = xσS tx T 1T n = c1T n i=1 x[i] exp(tx[i]) Pn j=1 exp(tx[j]). Since c is a convex combination of x[i] s, it is easy to see that xmax upper bounds c. It suffices to find t that satisfies the lower bound condition. Published as a conference paper at ICLR 2023 Choose t = 1 2 log(8n3/2r P) . We lower bound the softmax weights on xmax as P i:x[i]=xmax exp(tx[i]) Pn i=1 exp(tx[i]) i:x[i]=xmax exp(tx[i]) P i:x[i]=xmax exp(tx[i]) + P i:x[i] =xmax exp(tx[i]) i:x[i]=xmax exp(txmax) P i:x[i]=xmax exp(txmax) + P i:x[i] =xmax exp(t(xmax 2)) = nmax nmax + (n nmax) exp( 2t) = 1 1 + ( n nmax 1) exp( 2t) 1 1 + ( n nmax 1) 1 8n3/2r P 1 1 + 1 8r P n , where nmax := |{i : x[i] = xmax}|. Now, we can lower bound c as c xmaxpmax 2r (1 pmax) = xmax (xmax + 2r )(1 pmax) xmax 4r (1 pmax) 1 1 1 + 1 8r P n 1 2P n 1 + 1 8r P n xmax 1 2P n. This self-attention module only has 1 head with head size 1. All weights are either 0, 1 or t so the bit complexity is log t log log(8n3/2r P) . Lemma E.3. Let r N. Then, there exists a neural network F : R2 R with 1 hidden layer, width 4 and bit complexity log 2r such that F(x, y) = r if |x y| < 1 2 0 if |x y| > 1 . Proof. Consider the following neural network: F(x, y) = [r r r r ] σR 2 2 2 2 2 2 2 2 = r (σR(2x 2y + 2) σR(2x 2y + 1) σR(2x 2y 1) + σR(2x 2y 2)) . It is straightforward to see that the network F computes the desired function. This network has 1 hidden layer and width 4. All parameters are either 1, 2 or r so the bit complexity is log 2r . Lemma E.4. Let a, b, w N with a < b. Then, there exists a neural network F : R R with 1 hidden layer, width 4 and bit complexity log w + log(2b + 1) such that F(x) = w if x [a, b] 0 if x / [a 1 Published as a conference paper at ICLR 2023 Proof. Consider the following neural network: F(x, y) = [w w w w] σR 2a 1 2a 2b 2b + 1 = w (σR(2x 2a + 1) σR(2x 2a) σR(2x 2b) + σR(2x 2b 1)) . It is straightforward to see that the network F computes the desired function. This network has 1 hidden layer and width 4. All parameters are either w, 2 or 2a 1, 2a, 2b, 2b + 1 so the bit complexity is log w + log(2b + 1) . Lemma E.5. Let i, j, n, x, y N with i < j n. Denote ϕ(z) = σR(σR(2z) σR(4z 2)). Then, there exists a neural network F : R3 R3 consisting of j i + 1 feedforward blocks with skip-connections every 2 layers, feedforward dimension 8 and bit complexity 2n such that 2n + 1 2n+1 2n + 1 2n+2 2n + 1 2n+1 2n + 1 2n+2 y + BINi:j(x) Proof. We construct F = Fj Fj 1 Fi as a composition of j i + 1 feedforward blocks Fl : R2 R2, i l j where each feedforward block contains a single feedforward layer with 8 hidden units and a skip-connection. In step l, the block Fl computes 2n + 1 2n+1 2n + 1 2n+2 2n + 1 2n+1 2n + 1 2n+2 y + 2j l BINl(x) Since we have BINi:j(x) = l=i 2j l BINl(x), the block computation in Equation 2 ensures that F computes the desired function. Construction of Fl. To obtain Equation 2, we define the feedforward block Fl with a skipconnection as "ϕ(z1) z1 ϕ(z2) z2 fl(z1, z2) " ϕ(z1) ϕ(z2) z3 + fl(z1, z2) fl(z1, z2) = 2j l 2n+1 lϕ(z2) 2n+1 lϕ(z1) + 1 We first note that the triangle function ϕ can be implemented with one hidden layer as ϕ(z) = σR(2z) σR(4z 2) + σR(2z 2). Also, we can implement identity function as Thus, the following 8 hidden units σR (2z1) , σR (4z1 2) , σR (2z1 2) , σR ( 2z1) σR (2z2) , σR (4z2 2) , σR (2z2 2) , σR ( 2z2) are enough to represent Fl. Published as a conference paper at ICLR 2023 We check Equation 2 as follows: 2n + 1 2n+1 2n + 1 2n+2 2n + 1 2n+1 2n + 1 2n+2 y + fl ϕ(l 1) x 2n + 1 2n+1 , ϕ(l 1) x 2n + 1 2n+2 2n + 1 2n+1 2n + 1 2n+2 y + 2j l BINl(x) BINl(x) = 2n+1 lϕ(l) x 2n + 1 2n+2 2n+1 lϕ(l) x 2n + 1 2n+1 Finally, since all parameters are have the form 2k for some 1 k 2n, the bit complexity is 2n. F EXPERIMENTAL SETUP We use Hugging Face 9 Py Torch implementation of the BERT model for our experiments. All experiments are conducted on an Nvidia Quatro RTX 5000, 16 GB memory GPU in a machine with Intel(R) Xeon(R) Silver 4214 CPU @ 2.20GHz. As mentioned in the main paper, we use 14,000 random samples in the named entity recognition dataset from Co NLL-2003 (Tjong Kim Sang & De Meulder, 2003) for token classification and 50,000 random samples in the MNLI dataset from GLUE benchmark (Wang et al., 2019) for sequence classification. For token classification, the task is to classify the named entity type for each token among 9 possible classes. The sequence classification dataset aims to classify the relationship between sentence pairs as 3 classes: entailment, contradiction, and neutral. We vary the dataset size by randomly order examples and picking first p% for p = 10, 20, , 100. We vary the model size through the embedding size m, which is varied by 12 and 96 for token and sequence classification tasks, respectively. As mentioned in the main paper, we fix the number of layers as L = 6, the number of attention head as h = 12, the embedding to head size ratio as m/k = h = 12 and the feedforward to embedding size ratio as q/m = 4, as commonly done in practice. We optimize using Adam optimizer (Kingma & Ba, 2015) with learning rate 0.00002, batch size 32 and dropout rate 10%. We train our models for 1,500 and 7,500 steps for token and sequence classification, respectively. We determine the above number of steps to ensure that the training error does not improve at least for the last 3 epochs. For Figure 2, we choose the minimum size memorizing model as the smallest model that reaches the training error 0.005. The maximum training errors of selected models are 0.00499 and 0.00450 for token and sequence classification tasks, respectively. The average training errors of selected models are 0.00464 and 0.00259 for token and sequence classification tasks, respectively. 9https://huggingface.co/