# attentiononly_transformers_via_unrolled_subspace_denoising__6833becc.pdf Attention-Only Transformers via Unrolled Subspace Denoising Peng Wang 1 Yifu Lu 1 Yaodong Yu 2 Druv Pai 2 Qing Qu 1 Yi Ma 3 Despite the popularity of transformers in practice, their architectures are empirically designed and neither mathematically justified nor interpretable. Moreover, as indicated by many empirical studies, some components of transformer architectures may be redundant. To derive a fully interpretable transformer architecture with only necessary components, we contend that the goal of representation learning is to compress a set of noisy initial token representations towards a mixture of lowdimensional subspaces. To compress these noisy token representations, an associated denoising operation naturally takes the form of a multi-head (subspace) self-attention. By unrolling such iterative denoising operations into a deep network, we arrive at a highly compact architecture that consists of only self-attention operators with skip connections at each layer. Moreover, we show that each layer performs highly efficient denoising: it improves the signal-to-noise ratio of token representations at a linear rate with respect to the number of layers. Despite its simplicity, extensive experiments on vision and language tasks demonstrate that such a transformer achieves performance close to that of standard transformer architectures such as GPT-2 and CRATE. 1. Introduction Over the past decade, transformers (Vaswani et al., 2017) have achieved remarkable empirical success across various modern machine learning applications, including large language models (LLMs) (Brown et al., 2020; Devlin, 2018), vision generative models (Bao et al., 2023; Chen et al., 2020), 1Department of Electrical Engineering and Computer Science, University of Michigan, Ann Arbor, USA 2Department of Electrical Engineering and Computer Science, University of California, Berkeley, USA 3Institute of Data Science & School of Computing and Data Science, University of Hong Kong, Hong Kong, China. Correspondence to: Peng Wang . Proceedings of the 42 nd International Conference on Machine Learning, Vancouver, Canada. PMLR 267, 2025. Copyright 2025 by the author(s). and reinforcement learning (Chen et al., 2021). Transformer architectures are generally constructed by stacking multiple identical layers designed to process and learn from data. Each layer is composed of several interacting components arranged in a specific sequence, including multihead self-attention operators, layer normalization, multilayer perceptron (MLP) networks, and skip connections. In practice, transformers, such as BERT (Devlin, 2018) and GPT-4 (Achiam et al., 2023), are highly deep, often with dozens or even hundreds of layers, and significantly overparameterized, containing millions or even billions of parameters. This considerable depth and over-parameterization endow transformers with impressive learning capabilities, allowing them to capture complex patterns and relationships. Despite the remarkable success of transformers, their deep and over-parameterized architecture renders them black boxes , hindering an understanding of their inner mechanism. To address this challenge, a common approach involves systematically removing or modifying certain components in transformers to simplify the architecture; see, e.g., Dong et al. (2021); Alcalde et al. (2024); Noci et al. (2024); Geshkovski et al. (2023a); Geva et al. (2020); Guo et al. (2024). For example, Alcalde et al. (2024) studied pure-attention hard-max transformers with skip connections and showed that the output converges to a clustered equilibrium as the number of layers goes to infinity. Noci et al. (2024) analyzed a modified softmax-based attention model with skip connections, demonstrating that the limiting distribution can be described by a stochastic differential equation. These studies indicate that the most basic components of transformers are self-attention layers and skip connections. Although existing studies have provided valuable insights into different components of transformers, few of them elucidate the underlying mechanisms by which transformers process and transform input into output across layers. Existing empirical studies suggest that some components of transformers may not be essential and can be removed or modified without compromising performance. For example, He & Hofmann (2024) empirically demonstrated that transformer architecture can be simplified by removing components such as skip connections, value matrix, and normalization layers without degrading performance. Additionally, Sukhbaatar et al. (2019) investigated the effects of removing MLP blocks from transformers and augmenting Attention-Only Transformers the self-attention layers to play a similar role to MLP blocks, showing that performance can be preserved. Similarly, Pires et al. (2023) examined the potential for reducing the frequency of MLP layers in transformers. Other works also studied other simplifications of transformers, such as linear attentions (Katharopoulos et al., 2020) and shared-QK attentions (Kitaev et al., 2020). Based on these discussions, this work focuses on addressing the following question: Can we design a minimalistic transformer architecture consisting of fully interpretable layers that achieves performance close to that of standard transformers? 1.1. Related Works Figure 1. Each layer of our proposed transformer architecture. Existing studies of selfattention mechanisms. One main factor of the power of transformers is the self-attention layers, which enable the model to capture long-range dependencies and contextual relationships between tokens by weighing token relationships across the input sequence (Vaswani et al., 2017). To explore the mechanism behind self-attention, numerous studies have investigated the performance of pure self-attention networks, often incorporating only one additional component to prevent rank collapse and maintain expressiveness; see, e.g., Dong et al. (2021); Geshkovski et al. (2023a;b); Wu et al. (2024). We refer the reader to Appendix A for more discussions. Network architecture design via unrolled optimization. Several lines of work have proposed that the success of modern deep networks largely stems from their ability to transform the raw data into compact and structured representations, which facilitates downstream tasks (Chan et al., 2022; Chen et al., 2023; Ma et al., 2022; Yu et al., 2024; Huh et al., 2024). A principled and interpretable approach to learning such representations is to construct an architecture that incrementally transforms tokens into these representations via unrolling optimization steps as layers of a deep network (Chan et al., 2022; Monga et al., 2021; Wang et al., 2016; Yu et al., 2024; Zhang & Ghanem, 2018). Appendix A contains more discussion on this point. Linear representation and superposition hypotheses. Recent empirical studies of language tasks have raised the linear representation hypothesis , which posits that token representations can be linearly encoded as one-dimensional feature vectors in the activation space of LLMs (Jiang et al., 2024; Park et al., 2024b), and superposition hypothesis , which further hypothesizes that token representations are a sparse linear combination of these feature vectors (Elhage et al., 2022; Yun et al., 2021). Building on these hypotheses, various approaches have been proposed to understand and utilize token representations. For example, Templeton (2024) employed sparse autoencoders to decompose the token representations of Claude 3 Sonnet into more interpretable components. Luo et al. (2024) leveraged sparse dictionary learning to explore token representations, decomposing them into interpretable components based on a concept dictionary. Recently, Engels et al. (2025) conjectured that token representations in LLMs are the sum of many sparse multi-dimensional features. This conjecture is supported by their experiments on GPT-2 and Mistral 7B, where they used sparse autoencoders to identify multidimensional features. Notably, all of these empirical studies conclude that the token representations lie on a union of (possibly many) low-dimensional subspaces. 1.2. Our Contributions Based on the above discussions and motivated by referenced empirical findings, we propose a simple yet evocative model for the structure of token representations in trained transformers. Specifically, we model the underlying distribution of token representations as a mixture of low-rank Gaussians, each supported on a subspace and corrupted by noise (see Definition 2.1). Then, the goal of representation learning is to denoise a set of noisy initial token representations towards the corresponding subspaces. Our contributions are summarized as follows: Attention-only transformer via unrolled optimization. Under the mixture of low-rank Gaussian model, we interpret multi-head (subspace) self-attention as a denoising operator, which compresses noisy token representations into their corresponding supporting subspaces. By iteratively unrolling the multi-head (subspace) self-attention operator, we construct a new transformer architecture with a streamlined design, consisting of only self-attention layers with skip connections (see Figure 1).1 This design is much simpler compared to standard transformers. Theoretical guarantees for the proposed transformer. To quantify the denoising performance of the proposed transformer, we introduce a signal-to-noise (SNR) metric (see Eq. (8)) for the token representations. We prove that each layer of the proposed transformer improves the SNR at a linear rate when the initial token representations are sampled from a noisy mixture of low-rank Gaussians (see Theorem 3.1). This indicates that the multi-head (subspace) self-attention operator is highly effective in denoising token representations towards their corresponding subspaces. 1In practice, Layer Norm layers may be added to enhance performance. Attention-Only Transformers Figure 2. Layers of a transformer gradually denoise token representations Z(l) towards their corresponding subspaces. Understanding the roles of self-attention and MLP layers. Notably, the proposed transformer is a valuable model for understanding the mechanism of attention, as it ablates the effect of MLP layers. Moreover, comparing the proposed transformer to standard transformers provides insights into the specific role and empirical benefits of the MLP layers in different tasks, such as for images and texts (see experiments in Section 4). Finally, we conduct extensive experiments on both vision and language tasks, including supervised image classification, causal language modeling, and in-context learning, to complement our theory and demonstrate the potential of our proposed transformer architecture. We emphasize that the goal of our experiments is not to strive for state-of-the-art performance for these tasks. Instead, they are intended to validate our theory about the components of the transformer. Notation. Given an integer n, we denote by [n] the set {1, . . . , n}. Given a vector a, let a denote the Euclidean norm of a and diag(a) denote the diagonal matrix with a as its diagonal. Given a matrix A, let A denote the spectral norm of A, A F denote the Frobenius norm, and aij denote the (i, j)-th element. For sequences of positive numbers {an} and {bn}, we write an bn or bn an if there exists an absolute constant C > 0 such that an Cbn. Given a constant τ > 0, we define I(x > τ) = 1 if x > τ and I(x > τ) = 0 otherwise. We use On d to denote the set of all n d matrices that have orthonormal columns. 2. Technical Approach and Justification In this section, we introduce the basic setup of transformers for learning representations from real-world data. Realworld data, such as images, videos, and text, are often modeled as random samples drawn from a high-dimensional probability distribution with low-dimensional intrinsic structures (Wright & Ma, 2022). Instead of directly inputting raw data samples into transformers, a common preprocessing step is to convert each sample into a sequence of tokens, where each token represents a localized segment of the data, such as an image patch, a snippet of text, or a frame in a video. Consequently, the input to transformers is typically a sequence of tokens denoted as X = [x1, . . . , x N] RD N. Then, the goal of transformers is to learn a map f that transforms these tokens into structured and compact token representations that facilitate downstream tasks, such as classification (Dosovitskiy et al., 2021) and generation (Saharia et al., 2022), by capturing the underlying patterns in the data. 2.1. Unrolled Optimization for Token Representations In this subsection, we introduce how to learn token representations using the approach of unrolling optimization algorithms (Chan et al., 2022; Gregor & Le Cun, 2010; Monga et al., 2021; Sun et al., 2019; Wang et al., 2016; Yu et al., 2024; Zhang & Ghanem, 2018). Specifically, this approach constructs each layer of a neural network according to a step of an iterative optimization algorithm. That is, the network s architecture is designed to implement a specific optimization algorithm, where each layer corresponds to a single iterative step. By unrolling the algorithm, we construct a white-box transformer architecture as a multi-layer neural network that incrementally transforms input tokens into structured and compact representations. This iterative process can be described as follows: f : X f 0 Z(0) f 1 f l Z(l) f l+1 f L Z(L) =: Z, where f 0 : RD N Rd N is a pre-processing mapping (e.g., positional encoding, token embedding) that transforms input tokens X RD N to initial token representations Z(0) Rd N, f l : Rd N Rd N denotes an iterative step or layer, and Z(l) denotes the token representations at the l-th layer for each l [L]. Then, a key question is how to design the operator f l at each layer to learn meaningful token representations in a principled manner. 2.2. A Model for Token Representations Before we design such an operator f l, we model the structure of token representations in pretrained LLMs. Notably, extensive works (Templeton, 2024; Luo et al., 2024; Engels et al., 2025) have empirically demonstrated that token representations in trained LLMs usually approximately lie in a union of low-dimensional subspaces. These subspaces Attention-Only Transformers encode distinct semantic meanings, capturing various linguistic or contextual features that contribute to the model s overall understanding and performance. This motivates us to model the token representations as follows: Definition 2.1. Let C1, . . . , CK be a partition of the index set [N] and Uk Od pk denote the orthonormal basis of the k-th subspace for each k [K]. We say that the token representations {zi}N i=1 Rd are sampled from a mixture of noisy low-rank Gaussian distributions if for each k [K], zi = Ukai | {z } signal + j =k Ujei,j | {z } noise , i Ck, (1) where ai i.i.d. N(0, Ipk) and ei,j i.i.d. N(0, δ2Ipj) for all i Ck and k [K], {ai} and {ei,j} are respectively mutually independent, and {ai} is independent of {ei,j}. Before we proceed, let us make some remarks on this model. An idealized model for token representations. This model serves as an idealized framework for approximating token representations in real-world pretrained LLMs. It assumes that the token representations are sampled from a mixture of multiple low-rank Gaussian distributions with noise. Under this model, the goal of representation learning is to compress a set of noisy initial token presentations into the corresponding subspace. We should point out that in real-world applications, where token representations exhibit more complicated structures, the goal of representation learning is to find a compact and structured representation via compressing token sets, as argued in Yu et al. (2024). In addition, this model has been widely used in other machine learning problems, such as subspace clustering (Wang et al., 2022; Elhamifar & Vidal, 2013) and diffusion models (Wang et al., 2024). Connections to hypotheses on token representations. This model aligns well with two well-established hypotheses about the structure of token representations in pretrained LLMS: the linear representation hypothesis (Jiang et al., 2024; Park et al., 2024b) and the superposition hypothesis (Elhage et al., 2022; Yun et al., 2021; Arora et al., 2018). The linear representation hypothesis posits that token representations in LLMs lie in lowdimensional linear subspaces that encode semantic features. Similarly, the superposition hypothesis suggests that these representations can be approximately expressed as a sparse linear combination of these feature vectors. In the context of our model, each basis Uk of the subspaces can be interpreted as a set of semantic features, where each feature corresponds to a specific aspect of the token s meaning. Token representations are then approximately expressed as sparse linear combinations of these subspace bases, capturing the essential semantic components of the token while ignoring irrelevant dimensions. 2.3. Denoising Operator for Token Representations Now, we introduce a denoising operator to compress token representations into the corresponding subspace when the initial token representations Z(0) is generated according to Definition 2.1. To simplify our development, we assume that the subspaces in Definition 2.1 are orthogonal to each other, i.e., U T k Uj = 0 for all k = j. Note this assumption is not restrictive, as in high-dimensional spaces, random lowdimensional subspaces are incoherent to each other with high probability, i.e., U T k Uj 0 (Wright & Ma, 2022).2 Multi-head subspace self-attention. Without loss of generality, we rearrange the initial token representations Z(0) such that those from the same subspace are concatenated together, i.e., Z(0) = [Z(0) 1 , . . . , Z(0) K ] with Z(0) k = Uk Ak + X j =k Uj Ek,j, k [K]. Here, the columns of Z(0) k denote the token representations from the k-th subspace for each k [K], the columns of Ak Rpk Nk consists of {ai}i Ck, and the columns of Ek,j Rpj Nk consists of {ei,j}i Ck for each k [K] with Nk = |Ck| for each k [K]. Obviously, projecting token representations onto their corresponding subspace helps to separate the signal from the noise components using U T k Uj = 0 for all k = j, i.e., Uk U T k Z(0) ℓ = ( Uk Ak, if ℓ= k, Uk Eℓ,k, if ℓ = k. (2) To denoise the token representations from k-th subspace, we compute the similarity of projected token representations via (U T k Z)T (U T k Z) and verify that the similarity between projected token representations from the k-th subspace is high, while the similarity between other pairs of projected token representations is low when δ < 1. Then, we convert it to a distribution of membership with function φ, such as hard-thresholding or soft-max functions, and denoise the token representations towards to the corresponding subspace using this membership. Now, we formalize the considered operator as follows: for each l = 0, 1, . . . , L 1, Z(l+1) = Z(l) + η MSSA(Z(l)), (3) where η > 0 is the denoising step size, φ( ) : Rd N Rd N is an operator applied column-wise, and k=1 Uk U T k Zφ ZT Uk U T k Z . (4) 2One may straightforwardly generalize our results to nonorthogonal subspaces, with slightly more sophisticated analysis. Attention-Only Transformers Figure 3. The attention-only transformer (Ao T) architecture. Each layer consists of the MSSA operator and a skip connection. Additionally, Layer Norm can be incorporated to enhance performance. In practice, backpropagation is applied to train the model parameters. Notably, the operator in (4), referred to as the multi-head subspace self-attention (MSSA), is first proposed by (Yu et al., 2024; 2023) to approximately optimize the compression term of the sparse rate reduction objective for constructing a transformer-like architecture. It is worth noting that Yu et al. (2023); Pai et al. (2023) showed that the negative compression gradient of the objective points from the token representation to the corresponding subspace. However, they did not study the quantitative denoising efficiency of the MSSA operator (4). Connections to multi-head self-attention. Notably, the denoising operator (4) is essentially a special instance of multi-head self-attention (MHSA) implemented with a skip connection in transformers. Specially, the multi-head selfattention is of the following form: MHSA(Z) = W O head1 ... head K where W Q k , W K k , W V k are learnable weight matrices for queries, keys, and values for head k, W O is another learnable weight matrix, and headk = (W V k )T Zφ(ZT W Q k (W K k )T Z). Comparing the MHSA operator with the MSSA operator in (4), by setting W Q k = W K k = W V k = Uk and W O = [U1, . . . , UK] in (5), we can obtain the MSSA operator in (4). In this special case, MHSA can be interpreted as a denoising operation onto different subspaces. However, token representations in state-of-the-art large models are inherently more complex than the simplified structures assumed in Definition 2.1. In practice, these token representations are subject to a variety of factors such as noise, context dependence, and intricate dependencies that make their structure more dynamic and multifaceted. In this context, using the more flexible MHSA mechanism may provide a better way to denoise these complex token representations. To sum up, while state-of-the-art models necessitate the use of more advanced mechanisms like MHSA to effectively denoise and optimize token representations, the model in Definition 2.1 offers a useful framework for understanding token representations in an idealized yet evocative setting. 3. Main Results In this section, we formally present an attention-only transformer architecture using unrolled optimization and provide a theoretical guarantee on its denoising performance. 3.1. Attention-Only Transformer Armed with the setup in Section 2, we formally introduce the proposed attention-only transformer architecture. Specifically, by unrolling the iterative optimization steps (3) as layers of a deep network, we construct a transformer architecture in Figure 3. Each layer of the proposed architecture consists only of the MSSA operator and a skip connection. To enhance the model s performance, we may additionally incorporate Layer Norm before the MSSA operator to improve performance in practice. The complete architecture is built by stacking such layers, along with essential task-specific pre-processing and post-processing steps, such as positional encoding, token embedding, and a final taskspecific head to adapt to different applications. Notably, if we apply the same procedure to (5), we obtain an attentiononly transformer that only consists of the MHSA operator. Comparison to standard transformers. Generally speaking, the standard decoder-only transformer architecture is composed of the following key components (Brown et al., 2020; Radford et al., 2019): (1) positional encoding, (2) multi-head QKV self-attention mechanisms, (3) feedforward MLP networks, (4) layer normalization, and (5) residual connections. In contrast, our proposed transformer Attention-Only Transformers Figure 4. Denoising performance of the attention-only transformer. Here, we sample initial token representations from a mixture of low-rank Gaussians in Definition 2.1. Then, we apply (4) to update token representations and report the SNR at each layer. Left: noise level δ = 0.2. Right: noise level δ = 0.5. architecture adopts a streamlined design by incorporating several key simplications. Specifically, it employs shared QKV subspace self-attention mechanisms, excludes MLP layers, and reduces the frequency of Layer Norm. Differences from previous works on attention-only transformers. In the literature, some theoretical works have studied attention-only transformers. For example, Dong et al. (2021); Wu et al. (2024) showed that pure-attention transformers with skip connections or Layer Norm can prevent rank collapse. Additionally, Alcalde et al. (2024) studied the clustering behavior of attention-only hardmax transformers. While these studies contribute significantly to our understanding of the role of self-attention in transformers, they lack empirical validation and practical implications. In contrast to these works, we not only show that each layer of the proposed attention-only transformer can denoise token representations but also conduct experiments on real-world language and vision tasks to demonstrate the potential. The role of backward propagation. Notably, our approach constructs a transformer architecture in the forward pass by interpreting each layer as a denoising operator, conditioned on the assumption that the subspace bases {Uk}K k=1 are known. However, in practice, these subspace bases are unknown and need to be learned gradually via backpropagation. Hence, the forward denoising operator (4) at the l-th layer becomes as follows: For each l = 0, 1, . . . , L 1, Z(l+1) = Z(l) + η k=1 U (l) k U (l)T φ Z(l)T U (l) k U (l)T Now, the parameters {U (l) k } depend on the layer index l and may be different across layers. These matrices are learned through end-to-end training via backpropagation. 3.2. Denoising via Attention-Only Transformer In this subsection, we study the denoising performance of the proposed transformer when the initial token representations are sampled from a mixture of low-rank Gaussians as introduced in Definition 2.1. To quantify the denoising performance, we define the signal-to-noise ratio (SNR) for each cluster of the token representations at the l-th layer as SNR(Z(l) k ) := Uk U T k Z(l) k F (I Uk U T k )Z(l) k F , k [K]. (6) To simplify our analysis, we assume that p = p1 = = p K, N1 = = NK = N/K, and U1 UK Od Kp. (7) With the above setup, we now prove the following theorem. Theorem 3.1. Let Z(0) be generated according to Definition 2.1 and Z(l) be generated according to (3) for each l [L]. Here, φ(x) = h (σ(x)), σ : RN RN is the soft-max function, and h : RN RN is an element-wise thresholding function with h(x) = τI {x > τ} for each i [N]. Suppose that p log N, δ log N/ p, and 2, 1 1 + N exp( 9p/32) For sufficiently large N, it holds with probability at least 1 KLN Ω(1) that for each l [L 1], SNR Z(l+1) k = (1 + ητ)SNR Z(l) k , k [K]. (8) The proof is deferred to Appendix B. Here we comment on the significance of this theorem: Linear denoising performance of the attention-only transformer. When the initial token representations are sampled from a mixture of low-rank Gaussian distributions with a noise level O( log N/ p), we show that Attention-Only Transformers each layer of the proposed transformer denoises token representations at a linear rate. This indicates the MSSA operator s efficiency in reducing noise across layers. Notably, our theoretical results are well-supported by experimental observations in Figure 4, which further validate the practical denoising capability of the proposed transformer. Difficulties in analyzing the dynamics of the update (3). Note that the update (3) is highly nonlinear and complicated. These characteristics lead to intricate interactions among consecutive updates that complicate the analysis of the learning dynamics. Compared to the existing works (Ahn et al., 2023; Zhang et al., 2024; Schlag et al., 2021) that mainly focus on linear self-attention with φ( ) being the identify function, our analysis provides more pertinent results for understanding the denoising performance and learning dynamics of attention mechanisms, capturing the nonlinear interactions and transformations across the layers of modern transformer architectures. 4. Experimental Verification In this section, we evaluate our proposed attention-only transformer (Ao T) architecture using the MSSA (denoted by Ao T-MSSA) and MHSA (denoted by Ao T-MHSA) operators on both vision and language tasks. Since the model configurations on vision and language tasks are different, we use Ao T-MSSA-V and Ao T-MHSA-V to denote the models applied to vision tasks, and Ao T-MSSA-L and Ao TMHSA-L for those applied to language tasks. Due to limited computing resources, the goal of our experiments is not to outperform state-of-the-art transformers but to verify that Ao T can achieve comparable performance on both language and vision tasks. In our implementations, we set the operator φ( ) in Eq. (4) to be the softmax function. Table 1. Top-1 accuracy on Image Net: Evaluation of Ao T-MSSAV and comparison to CRATE. Models Accuracy # of Parameters Ao T-MSSA-V 71.7% 22M CRATE 79.5% 39M 4.1. Vision Transformers for Image Classification In this subsection, we evaluate the performance of Ao T as a backbone architecture for supervised image classification on Image Net and compare it against several state-of-theart models. To construct the Ao T-based model, we adopt the same preprocessing pipeline and classification head as defined in (Yu et al., 2024, Section 4.1.1). Comparison between MSSA and CRATE. We consider the Ao T-MSSA-V model and compare it against the CRATE model in Yu et al. (2024). We employ Lion optimizer Table 2. Top-1 accuracy on Image Net: Evaluation of Ao T-MHSAV and comparison to Vi T. Models Accuracy # of Parameters Ao T-MHSA-V 69.5% 15M Vi T 72.4 % 22M (Chen et al., 2024) to pre-train the Ao T-MSSA-V transformer on Image Net-21K for 90 epochs and to fine-tune it on Image Net-1K (Deng et al., 2009) for 50 epochs by minimizing the cross-entropy (CE) loss. We use different hyperparameters for pre-training and fine-tuning. During pre-training, we use a learning rate of 2 10 4, weight decay of 0.7, label smoothing with a parameter of 0.2, and a batch size of 4096. For fine-tuning, the corresponding values are 5 10 4, 0.3, 0.1, and 2048, respectively. Standard data augmentation techniques, including random cropping, random horizontal flipping, and random augmentation, are used in our implementation, following the same setup as in Yu et al. (2023). Comparison between MHSA and Vi T. We next train Ao T-MHSA-V from scratch on Image Net-1K for 150 epochs and compare its performance with Vi T (Dosovitskiy et al., 2021). The training setup follows the same configuration as above: we use the Lion optimizer with a learning rate of 5 10 4, a weight decay of 0.1, label smoothing with a smoothing parameter of 0.1, a batch size of 2048, and identical data augmentation strategies. Based on the above experimental setup, we report the top-1 accuracy of Ao T-MSSA-V and CRATE in Table 1, and that of Ao T-MHSA-V and Vi T in Table 2. Due to the absence of MLP layers in Ao T, Ao T-based models achieve slightly worse performance comparable to CRATE and Vi T while using only nearly half the number of parameters. This result shows the effectiveness of the attention-only architecture. We provide visualization for the self-attention heatmaps of Ao T-MSSA-V trained on Image Net-1K in Figure 7 in Appendix C.3. We observe that each head captures similar semantic meanings across different images, demonstrating the interpretability of our proposed architecture in practice. 4.2. Decoder-Only Transformers for Language Tasks To study the performance of our architecture on language tasks, we consider the widely used Generative Pre-Training (GPT) task (Radford et al., 2019). In the context of causal language modeling, the goal is to predit the next token in a sequence based on the preceding context. To adapt to this task, we modify the Ao T architecture by changing the MSSA (resp., MHSA) operator to a causally masked MSSA (resp., MHSA) operator. We follow the same preprocessing and post-processing steps in (Yu et al., 2024, Attention-Only Transformers Figure 5. Evaluating models on language tasks. We plot the training loss (left) and validation loss (right) of the Ao T and GPT-2 models pretrained on Open Web Text. Table 3. Zero-shot results on several language benchmark datasets and tasks: Evaluation of different sizes of Ao T with the MSSA and MHSA operators and comparison to the GPT2 model. Models LAMBADA PTB Wiki Text LAMBADA CBT CN CBT NE # of parameters (val loss) (val loss) (val loss) (acc) (acc) (acc) Ao T-MSSA-L Base (102M) 4.70 6.03 4.65 0.25 0.80 0.74 Ao T-MSSA-L Medium (182M) 4.47 5.08 4.22 0.29 0.84 0.77 Ao T-MHSA-L Base (122M) 4.42 5.52 4.19 0.38 0.86 0.82 GPT-2 Base (124M) 4.32 5.75 4.13 0.40 0.87 0.84 Section 4.1.4). Our implementation of the GPT-2 type transformers and training pipeline is based on the framework outlined in (Karpathy, 2022).3 4.2.1. LANGUAGE MODELING Pre-training language models. We pre-train the Ao TMSSA-L and Ao T-MHSA-L models of different sizes, along with GPT-2 (see Table 3 for model sizes), on Open Web Text (Gokaslan & Cohen, 2019). We defer the details of the model architectures to Table 4. Here, we train these models over a 1024-token context using the Adam W optimizer (Loshchilov & Hutter, 2019). We plot the training loss and validation loss against the number of training iterations in Figure 5(a) and (b), respectively. We observe that mediumand large-sized Ao T-based models achieve training and validation losses comparable to those of the GPT-2 base model. In addition, compared to the GPT-2 base model, the Ao T-MHSA-L model is identical to the GPT-2 base model, except for the absence of MLP layers in the architecture. As shown in Figure 5, incorporating MLP layers can accelerate the training process. Zero-shot evaluation. Using the above pre-trained models, we compute the cross-entropy validation loss without training on datasets Wiki Text (Merity et al., 2017)4, LAM- 3https://github.com/karpathy/nano GPT.git 4For Wiki Text2 and Wiki Text103 (Merity et al., 2017), the test splits are the same, so we merge them as a single dataset referred BADA (Paperno et al., 2016)5, and PTB (Marcus et al., 1993) in Table 3. In addition, we report zero-shot accuracy in Table 3 on LAMBADA for predicting the final word of sentences, as well as on the Children s Book Test (CBT) (Hill et al., 2015), where the task is to choose either common nouns (CN) or named entities (NE) from 10 possible options for an omitted word in a paragraph. We observe that the Ao T models with medium and large parameter sizes can achieve comparable performance to the GPT-2 base model. Moreover, we found that adding MLP layers to Ao T does not improve the zero-shot performance. These results highlight the potential of attention-only models to achieve competitive results while maintaining interpretability. 4.2.2. IN-CONTEXT LEARNING In-context learning (ICL) refers to the ability of modern language models to perform tasks by using examples provided in the input prompt, along with a new query input, generating outputs without updating the parameters (Brown et al., 2020; Garg et al., 2022; Park et al., 2024a). We evaluate the ICL capabilities of our Ao T models and compare their performance with that of GPT-2 (Radford et al., 2019). Each model is trained from scratch on specific tasks, including linear and sparse linear regressions. We mainly follow the setup in (Garg et al., 2022) to train models to to as Wiki Text. 5To obtain the accuracy on LAMBADA dataset, we use greedy decoding. Attention-Only Transformers Figure 6. Evaluating models on in-context learning tasks. We plot the normalized squared error as a function of the number of in-context examples for linear regression (left) and sparse linear regression (right) tasks. learn linear functions in context. Specifically, for a specific function class G, we generate random prompts by sampling a function g G from distribution DG over functions random inputs x1, . . . , x N Rd i.i.d. from DX over inputs. To evaluate the inputs on g, we create a prompt P = (x1, g(x1), . . . , x N, g(x N)). We train the model fθ( ) to minimize the expected loss over all prompts prefixes: fθ(P i) g(xi) 2 # where P i is the prompt prefix up to the input i-th in-context example P = (x1, g(x1), . . . , xi). Tasks. We consider both linear functions and sparse linear functions with dimension d = 20. The in-context examples xi are sampled from the isotropic Gaussian distribution. For linear functions, we define G = {g : g(x) = w T x}, where x is sampled from the isotropic Gaussian distribution as well. For sparse linear functions, the setup is similar, but with a modification: only 3 coordinates in the vector w are set as non-zero, while the remaining ones are set to be zero. Training and evaluation. For all experiments, we set the number of heads to 8 and the embedding size to 128. We present the model configurations in Table 5 in Appendix C. To train the model, we sample a batch of random prompts with size 64 and train the models for 50,000 iterations using Adam optimizer (Kingma & Ba, 2014). We evaluate models using same DG and DX to sample 1280 prompts. We refer the reader to (Park et al., 2024a) for more details. We plot the estimation error against the in-context samples in Figure 6. We observe that our Ao T architecture can in-context learn linear functions and sparse linear functions, achieving performance close to that of the GPT-2 transformer. 5. Conclusion In this work, we proposed a new and minimalistic transformer architecture by interpreting each layer as a subspace denoising operator to token representations, where these representations are assumed to be sampled from a mixture of low-rank Gaussians. Remarkably, this simple architecture consists of multi-head (subspace) self-attention and skip connections at each layer, without MLP layers at all. We have rigorously proven that each such layer improves the signal-to-noise ratio of token representations at a linear rate with respect to the number of layers. Extensive experiments on both language and vision tasks demonstrate that this simplified architecture achieves performance comparable to that of standard transformers. Our theoretical and empirical findings suggest that subspace denoising via attention heads is the core mechanism underlying transformer effectiveness, with MLP layers contributing only marginal performance gains. We believe this work lays a foundation for future exploration of more efficient and principled architectural designs. Impact Statement This paper presents work whose goal is to advance the field of machine learning. Both theoretical analysis and experimental evaluation presented in this paper aim to help people better understand the attention layer in a standard transformer architecture. Except for values to academics, we do not anticipate any social or ethical implications. Acknowledgment Peng Wang and Qing Qu would like to acknowledge support from the NSF grant #2402950. Druv Pai would like to acknowledge support from the UC Berkeley College of Engineering Fellowship. Yi Ma would like to acknowledge support from the joint Simons Foundation-NSF DMS grant #2031899, the ONR grant N00014-22-1-2102, the NSF grant #2402951, and also support from and the HKU startup, the Hong Kong Center for Construction Robotics Limited (HKCRC) Award 052245, and JC Club of Hong Kong. Attention-Only Transformers Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al. GPT-4 technical report. ar Xiv preprint ar Xiv:2303.08774, 2023. Ahn, K., Cheng, X., Daneshmand, H., and Sra, S. Transformers learn to implement preconditioned gradient descent for in-context learning. In Advances in Neural Information Processing Systems, volume 36, pp. 45614 45650, 2023. Alcalde, A., Fantuzzi, G., and Zuazua, E. Clustering in pureattention hardmax transformers and its role in sentiment analysis. ar Xiv preprint ar Xiv:2407.01602, 2024. Arora, S., Li, Y., Liang, Y., Ma, T., and Risteski, A. Linear algebraic structure of word senses, with applications to polysemy. Transactions of the Association for Computational Linguistics, 6:483 495, 2018. Bao, F., Nie, S., Xue, K., Cao, Y., Li, C., Su, H., and Zhu, J. All are worth words: A vit backbone for diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 22669 22679, 2023. Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., Mc Candlish, S., Radford, A., Sutskever, I., and Amodei, D. Language models are few-shot learners. In Advances in Neural Information Processing Systems, volume 33, pp. 1877 1901, 2020. Caron, M., Touvron, H., Misra, I., J egou, H., Mairal, J., Bojanowski, P., and Joulin, A. Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650 9660, 2021. Chan, K. H. R., Yu, Y., You, C., Qi, H., Wright, J., and Ma, Y. Redunet: A white-box deep network from the principle of maximizing rate reduction. Journal of Machine Learning Research, 23(114):1 103, 2022. Chen, L., Lu, K., Rajeswaran, A., Lee, K., Grover, A., Laskin, M., Abbeel, P., Srinivas, A., and Mordatch, I. Decision transformer: Reinforcement learning via sequence modeling. In Advances in Neural Information Processing Systems, volume 34, pp. 15084 15097, 2021. Chen, M., Radford, A., Child, R., Wu, J., Jun, H., Luan, D., and Sutskever, I. Generative pretraining from pixels. In International Conference on Machine Learning, pp. 1691 1703. PMLR, 2020. Chen, X., Liang, C., Huang, D., Real, E., Wang, K., Pham, H., Dong, X., Luong, T., Hsieh, C.-J., Lu, Y., et al. Symbolic discovery of optimization algorithms. In Advances in Neural Information Processing Systems, volume 36, 2024. Chen, Y., Yun, Z., Ma, Y., Olshausen, B., and Le Cun, Y. Minimalistic unsupervised representation learning with the sparse manifold transform. In The Eleventh International Conference on Learning Representations, 2023. Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L. Image Net: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248 255. IEEE, 2009. Devlin, J. Bert: Pre-training of deep bidirectional transformers for language understanding. ar Xiv preprint ar Xiv:1810.04805, 2018. Dong, Y., Cordonnier, J.-B., and Loukas, A. Attention is not all you need: Pure attention loses rank doubly exponentially with depth. In International Conference on Machine Learning, pp. 2793 2803. PMLR, 2021. Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., and Houlsby, N. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2021. ar Xiv preprint ar Xiv:2010.11929. Elhage, N., Hume, T., Olsson, C., Schiefer, N., Henighan, T., Kravec, S., Hatfield-Dodds, Z., Lasenby, R., Drain, D., Chen, C., et al. Toy models of superposition. ar Xiv preprint ar Xiv:2209.10652, 2022. Elhamifar, E. and Vidal, R. Sparse subspace clustering: Algorithm, theory, and applications. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(11): 2765 2781, 2013. Engels, J., Michaud, E. J., Liao, I., Gurnee, W., and Tegmark, M. Not all language model features are onedimensionally linear. In International Conference on Learning Representations, 2025. Garg, S., Tsipras, D., Liang, P. S., and Valiant, G. What can transformers learn in-context? a case study of simple function classes. In Advances in Neural Information Processing Systems, volume 35, pp. 30583 30598, 2022. Geshkovski, B., Letrouit, C., Polyanskiy, Y., and Rigollet, P. The emergence of clusters in self-attention dynamics. Attention-Only Transformers In Advances in Neural Information Processing Systems, volume 36, pp. 57026 57037, 2023a. Geshkovski, B., Letrouit, C., Polyanskiy, Y., and Rigollet, P. A mathematical perspective on transformers. ar Xiv preprint ar Xiv:2312.10794, 2023b. Geva, M., Schuster, R., Berant, J., and Levy, O. Transformer feed-forward layers are key-value memories. ar Xiv preprint ar Xiv:2012.14913, 2020. Gokaslan, A. and Cohen, V. Openwebtext corpus. http://Skylion007.github.io/ Open Web Text Corpus, 2019. Gregor, K. and Le Cun, Y. Learning fast approximations of sparse coding. In International Conference on Machine Learning, pp. 399 406, 2010. Guo, J., Chen, X., Tang, Y., and Wang, Y. Slab: Efficient transformers with simplified linear attention and progressive re-parameterized batch normalization. In International Conference on Machine Learning, pp. 16802 16812. PMLR, 2024. He, B. and Hofmann, T. Simplifying transformer blocks. In The Twelfth International Conference on Learning Representations, 2024. Hill, F., Bordes, A., Chopra, S., and Weston, J. The goldilocks principle: Reading children s books with explicit memory representations. ar Xiv preprint ar Xiv:1511.02301, 2015. Huh, M., Cheung, B., Wang, T., and Isola, P. Position: The platonic representation hypothesis. In International Conference on Machine Learning, volume 235, pp. 20617 20642. PMLR, 21 27 Jul 2024. Jiang, Y., Rajendran, G., Ravikumar, P. K., Aragam, B., and Veitch, V. On the origins of linear representations in large language models. In International Conference on Machine Learning, volume 235, pp. 21879 21911. PMLR, 21 27 Jul 2024. Karpathy, A. Nano GPT. https://github.com/ karpathy/nano GPT, 2022. Katharopoulos, A., Vyas, A., Pappas, N., and Fleuret, F. Transformers are RNNs: Fast autoregressive transformers with linear attention. In International Conference on Machine Learning, pp. 5156 5165. PMLR, 2020. Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980, 2014. Kitaev, N., Kaiser, L., and Levskaya, A. Reformer: The efficient transformer. In International Conference on Learning Representations, 2020. Loshchilov, I. and Hutter, F. Decoupled weight decay regularization. In International Conference on Learning Representations, 2019. Luo, J., Ding, T., Chan, K. H. R., Thaker, D., Chattopadhyay, A., Callison-Burch, C., and Vidal, R. Pace: Parsimonious concept engineering for large language models. In Advances in Neural Information Processing Systems, 2024. Ma, Y., Tsao, D., and Shum, H.-Y. On the principles of parsimony and self-consistency for the emergence of intelligence. Frontiers of Information Technology & Electronic Engineering, 23(9):1298 1323, 2022. Marcus, M., Santorini, B., and Marcinkiewicz, M. A. Building a large annotated corpus of english: The penn treebank. Computational Linguistics, 19(2):313 330, 1993. Merity, S., Xiong, C., Bradbury, J., and Socher, R. Pointer sentinel mixture models. In International Conference on Learning Representations, 2017. Monga, V., Li, Y., and Eldar, Y. C. Algorithm unrolling: Interpretable, efficient deep learning for signal and image processing. IEEE Signal Processing Magazine, 38(2): 18 44, 2021. Noci, L., Li, C., Li, M., He, B., Hofmann, T., Maddison, C. J., and Roy, D. The shaped transformer: Attention models in the infinite depth-and-width limit. In Advances in Neural Information Processing Systems, volume 36, 2024. Pai, D., Buchanan, S., Wu, Z., Yu, Y., and Ma, Y. Masked completion via structured diffusion with white-box transformers. In The Twelfth International Conference on Learning Representations, 2023. Paperno, D., Kruszewski, G., Lazaridou, A., Pham, N.-Q., Bernardi, R., Pezzelle, S., Baroni, M., Boleda, G., and Fern andez, R. The lambada dataset: Word prediction requiring a broad discourse context. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1525 1534, 2016. Park, J., Park, J., Xiong, Z., Lee, N., Cho, J., Oymak, S., Lee, K., and Papailiopoulos, D. Can Mamba learn how to learn? a comparative study on in-context learning tasks. In International Conference on Machine Learning, pp. 39793 39812, 2024a. Park, K., Choe, Y. J., and Veitch, V. The linear representation hypothesis and the geometry of large language models. In International Conference on Machine Learning, pp. 39643 39666. PMLR, 2024b. Attention-Only Transformers Pires, T., Lopes, A. V., Assogba, Y., and Setiawan, H. One wide feedforward is all you need. In Proceedings of the Eighth Conference on Machine Translation, pp. 1031 1044, 2023. Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al. Language models are unsupervised multitask learners. Open AI blog, 1(8):9, 2019. Saharia, C., Chan, W., Saxena, S., Li, L., Whang, J., Denton, E. L., Ghasemipour, K., Gontijo Lopes, R., Karagol Ayan, B., Salimans, T., et al. Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems, 35: 36479 36494, 2022. Schlag, I., Irie, K., and Schmidhuber, J. Linear transformers are secretly fast weight programmers. In International Conference on Machine Learning, pp. 9355 9366. PMLR, 2021. Sukhbaatar, S., Grave, E., Lample, G., Jegou, H., and Joulin, A. Augmenting self-attention with persistent memory. ar Xiv preprint ar Xiv:1907.01470, 2019. Sun, X., Nasrabadi, N. M., and Tran, T. D. Supervised deep sparse coding networks for image classification. IEEE Transactions on Image Processing, 29:405 418, 2019. Templeton, A. Scaling monosemanticity: Extracting interpretable features from Claude 3 sonnet. Anthropic, 2024. Tsai, Y.-H. H., Bai, S., Yamada, M., Morency, L.-P., and Salakhutdinov, R. Transformer dissection: An unified understanding for transformer s attention via the lens of kernel. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 4344 4353, 2019. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. Attention is all you need. In Advances in Neural Information Processing Systems, volume 30, 2017. Vershynin, R. High-dimensional probability: An introduction with applications in data science, volume 47. Cambridge university press, 2018. Wang, P., Liu, H., So, A. M.-C., and Balzano, L. Convergence and recovery guarantees of the K-subspaces method for subspace clustering. In International Conference on Machine Learning, pp. 22884 22918. PMLR, 2022. Wang, P., Zhang, H., Zhang, Z., Chen, S., Ma, Y., and Qu, Q. Diffusion models learn low-dimensional distributions via subspace clustering. ar Xiv preprint ar Xiv:2409.02426, 2024. Wang, S., Fidler, S., and Urtasun, R. Proximal deep structured models. Advances in Neural Information Processing Systems, 29, 2016. Wright, J. and Ma, Y. High-dimensional data analysis with low-dimensional models: Principles, computation, and applications. Cambridge University Press, 2022. Wu, X., Ajorlou, A., Wang, Y., Jegelka, S., and Jadbabaie, A. On the role of attention masks and layernorm in transformers. In Advances in Neural Information Processing Systems, volume 37, 2024. Yu, Y., Buchanan, S., Pai, D., Chu, T., Wu, Z., Tong, S., Haeffele, B., and Ma, Y. White-box transformers via sparse rate reduction. In Advances in Neural Information Processing Systems, volume 36, pp. 9422 9457, 2023. Yu, Y., Buchanan, S., Pai, D., Chu, T., Wu, Z., Tong, S., Bai, H., Zhai, Y., Haeffele, B. D., and Ma, Y. White-box transformers via sparse rate reduction: compression is all there is? Journal of Machine Learning Research, 25 (300):1 128, 2024. Yun, Z., Chen, Y., Olshausen, B. A., and Le Cun, Y. Transformer visualization via dictionary learning: contextualized embedding as a linear superposition of transformer factors. ar Xiv preprint ar Xiv:2103.15949, 2021. Zhang, J. and Ghanem, B. ISTA-Net: Interpretable optimization-inspired deep network for image compressive sensing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1828 1837, 2018. Zhang, R., Frei, S., and Bartlett, P. L. Trained transformers learn linear models in-context. Journal of Machine Learning Research, 25(49):1 55, 2024. Attention-Only Transformers To simplify our development, we introduce some further notations. We use Blk Diag(X1, . . . , XK) to denote a block diagonal matrix whose diagonal blocks are X1, . . . , XK. A. Related Literature Existing studies on self-attention mechanisms. It is widely believed that the power of transformers primarily stems from their self-attention layers, which enable the model to capture long-range dependencies and contextual relationships between tokens by dynamically weighing token relationships across the input sequence (Tsai et al., 2019; Vaswani et al., 2017). To explore the mechanism behind self-attention, numerous studies have investigated the performance of pure self-attention networks, often incorporating only one additional component to prevent rank collapse and maintain expressiveness. For example, Dong et al. (2021) showed that in pure-attention transformers without skip connections and MLP layers, token representations collapse exponentially to a rank-1 matrix across layers. They also showed that self-attention networks with skip connections prevent rank collapse. Geshkovski et al. (2023a;b) have studied the dynamics of multi-head self-attentions and characterized clustering behaviors of learned representations. Recently, Wu et al. (2024) showed that pure self-attention networks with Layer Norm can prevent rank collapse. While these studies have advanced the theoretical understanding of self-attention mechanisms in simplified transformer architectures, they cannot provide any empirical validation on real-world vision or language tasks, offering little insight into the role of self-attention in practice. Network architecture design via unrolled optimization. It is commonly believed that the success of modern deep networks largely stems from their ability to transform the raw data into compact and structured representations, which facilitates downstream tasks (Chan et al., 2022; Chen et al., 2023; Ma et al., 2022; Yu et al., 2024). A principled and interpretable approach to learning such representations with transformers is to construct an architecture that incrementally transforms tokens into these representations via unrolling optimization steps as layers of a deep network (Chan et al., 2022; Monga et al., 2021; Wang et al., 2016; Yu et al., 2023; Zhang & Ghanem, 2018). Notably, Monga et al. (2021) demonstrate that such unrolled networks are more interpretable, parameter-efficient, and effective compared to generic networks. Using this approach, each iteration of an algorithm for learning compact and structured representations is represented as one layer of deep networks. For example, Gregor & Le Cun (2010) have demonstrated that sparse coding algorithms, such as ISTA, can be used to construct MLPs. Recently, Chan et al. (2022) constructed a white-box network based on an iterative gradient descent scheme to optimize the maximal coding rate reduction objective. More recently, Yu et al. (2024) designed a white-box transformer architecture by implementing an approximate alternating minimization to optimize the sparse rate reduction objective. The proposed transformer achieves performance comparable to some popular ones such as Vi T (Dosovitskiy et al., 2021) and DINO (Caron et al., 2021). B. Proof of Theorem 3.1 B.1. Preliminary Results To prove Theorem 3.1, we first establish several probabilistic results about Gaussian random vectors. First, we present a probabilistic bound on the deviation of the norm of Gaussian random vectors from its mean. This is an extension of Vershynin (2018, Theorem 3.1.1). Lemma B.1. Let x N(0, δ2Id) be a Gaussian random vector. It holds with probability at least 1 2 exp t2/2δ2 that x δ d t + 2δ. (10) Based on the above lemma, we can respectively estimate the norm of coefficients in the signal and noise parts, the products between different pairs of Gaussian random vectors, and the bounds on the soft-max values of these products. Lemma B.2. Consider the setting in Definition 2.1 with p = p1 = = p K and N1 = = NK = N/K. Suppose that p 16( log N + 1)2 and N 8πK2 log3 N, δ 1 The following statements hold: Attention-Only Transformers (i) With probability at least 1 2KN 1, we have | ai p| 2 p log N + 1 , i [N], (12) | ei,l δ p| 2δ p log N + 1 , i Ck, l = k [K]. (13) (ii) With probability at least 1 4KN 2, we have | ai, aj | 3 p log N ai , i = j Ck, k [K], (14) | ai, ej,l | 3 p log N ej,l , i Ck, j Cl, k = l [K], (15) | ei,k, ej,k | 3δ p log N ej,k , i Cl, j Cm, l, m = k. (16) (iii) With probability at least 1 2N 1, we have max i Ck ai, ej,k p log N ej,k , j Cl, l = k [K]. (17) (iv) With probability at least 1 4KN 1, we have exp ( ai, ej,k ) P i Ck exp ( ai , ej,k ) 1 2, i Ck, j Cl, k = l [K], (18) exp ( ei,k, ej,k ) P i =j,i Cl exp ( ei ,k, ej,k ) 1 2, i = j, i Cl, j Cm, l, m = k. (19) Proof. (i) Applying Lemma B.1 to ai N(0, Ip) with t = 2 log N yields P | ai p| 2( p log N + 1) 1 2N 2. This, together with the union bound, yields that (12) holds for all i [N] with probability at least 1 2N 1. Using the same argument, we obtain that (13) holds for all i Ck and l = k [K] with probability at least 1 2(K 1)N 1. Finally, applying the union bound yields that the probability is 1 2KN 1. (ii) For each pair (i, j) with i = j Ck and k [K], conditioned on ai, we have ai, aj N(0, ai 2). According to the tail bound the Gaussian random variable, we have P | ai, aj | 3 ai p log N ai 2N 4. This, together with the union bound, implies that conditioned on ai, it holds with probability at least 1 2N 2 that | ai, aj | 2 ai log N for all i = j Ck and k [K]. Using the same argument, we obtain (15) and (16). Finally, applying the union bound yields the probability. (iii) Conditioned on ej,k, we obtain that Xi := ai, ej,k / ej,k N(0, 1) for each i Ck are i.i.d. standard normal random variables. Then, we have P max i Ck Xi p log N = 1 P X1 < p log N Nk . (20) Using the property of the standard Gaussian random variable, we have Taking t = log N, we obtain log N = 1 log N 2π exp log N 1 2 2πN log N , (21) Attention-Only Transformers where the inequality follows from N exp(2). Substituting this into (20) yields P max i Ck Xi p log N 1 1 1 2 2πN log N N 2K 2π log N where the second inequality uses 1 x exp ( x) for all x > 0 and the last inequality follows from N 8πK2 log3 N. This, together with the definition of Xi, implies (17). (iv) Conditioned on ej,k, we have Xi := ai, ej,k N(0, ej,k 2) for each i Ck are i.i.d. normal random variables. Suppose that (13) holds for all i Ck, l = k [K], which happens with probability at least 1 2(K 1)N 1 according to (i). This implies for all j Ck and k [K], ej,k δ p + 2 p log N + 2 3 where the last inequality follows from p 16( log N + 1)2 due to (11). For ease of exposition, let σ := ej,k , S := X i Ck exp(Xi). (23) Obviously, showing (18) is equivalent to proving 2 exp(Xi) X i Ck exp (Xi ) = S, i Ck. (24) Note that Xi/σ N(0, 1) for all i Ck. Using the tail bound of the standard normal random variable, we have log N 2N 2, i Ck. This, together with the union bound, yields that it holds with probability 1 2N 1 that |Xi| 2σ log N for all i [N]. Using this, (22), (23), and the union bound, we obtain with probability at least 1 2KN 1, p log N, i [N]. Therefore, we have p log N exp(Xi) exp 3δ p p log N , i [N]. (25) Using this and (23), we have This, together with (25), implies that proving (24) is sufficient to proving p log N + log (2K) , which holds when N max 16K4, exp 64δ2p due to (11). According to the union bound, (18) holds with probability at least 1 2KN 1. Using the same argument, (19) holds with probability at least 1 2KN 1. B.2. Proof of Theorem 3.1 To simplify our development, let Attention-Only Transformers θ2AT 1 A1 θAT 1 E2,1 . . . θAT 1 EK,1 θET 2,1A1 ET 2,1E2,1 . . . ET 2,1EK,1 ... ... ... ... θET K,1A1 ET K,1E2,1 . . . ET K,1EK,1 ET 1,2E1,2 θET 1,2A2 . . . ET 1,2EK,2 θAT 2 ET 1,2 θ2AT 2 A2 . . . θAT 2 EK,2 ... ... ... ... ET K,2E1,2 θET K,2A2 . . . ET K,2EK,2 ET 1,KE1,K ET 1,KE2,K . . . θET 1,KAK ET 2,KE1,K ET 2,KE2,K . . . θET 2,KAk ... ... ... ... θAT KE1,K θAT KE2,K . . . θ2AT KAK where θ 1. Recall that Z(0) = h Z(0) 1 . . . Z(0) K i = U1A1 + P j =1 Uj E1,j . . . UKAK + P j =K Uj EK,j , (27) Lemma B.3. Consider the setting in Definition 2.1 with p = p1 = = p K and N1 = = NK = N/K. Let φ( ) be φ(x) = h(σ(x)), (28) where σ : RN RN is the soft-max function and h : RN RN is an element-wise thresholding function with h(x) = τI {x > τ} for each i [N]. Suppose that (11) holds. Suppose in addition that p 64( log N + 1)2 and 2, 1 1 + N exp( 9p/32) The following statements hold with probability at least 1 KN Ω(1) that , φ(M1) = Blk Diag(τI, 0, . . . , 0), . . . , φ(MK) = Blk Diag(0, 0, . . . , τI). (30) Proof. Suppose that (12)-(19) hold, which happens with probability at least 1 KN Ω(1) according to Lemma B.2, (11), and the union bound. Now, we focus on studying M1 as defined in (26). For ease of exposition, we denote the i-th column of M1 by mi RN for each i [N]. Moreover, recall that C1 = 1, 2, . . . , N , . . . , CK = (K 1)N K + 1, (K 1)N K + 2, . . . , N . We now divide our proof into two cases. We first study the i-th column of M1 for each i C1, and then study the i-th column of M1 for each i Ck with k = 1. Case 1. According to (26), we have for each i C1, mij = θ2 ai, aj , j C1, mij = θ ai, ej,k , j Ck, k = 1. For each pair (i, j) with i = j C1, we compute σi(mi) σj(mi) = exp (mii mij) exp θ ai θ ai 3 p log N exp 9θ2p Attention-Only Transformers where the first inequality follows from (14) and the second uses (12) and p 8( log N + 1). Using the same argument, for each pair (i, j) with i C1, j Ck, and k = 1, we obtain σi(mi) σj(mi) exp 9θ2p This, together with PN j=1 σj(mi) = 1, yields 1 + (N 1) exp 9θ2p/32 σi(mi) 1. Therefore, we have for each i C1, σi(mi) 1 1 + N exp( 9θ2p/32) > 1 2, σj(mi) 1 2, j = i, (32) where the last inequality follows from p 64( log N + 1)2. This, together with the value of τ in (29), yields for each i C1, σj(mi) < τ < σi(mi), j = i. Using this and (28), we have for each i C1, h (σi(mi)) = τ, h (σj(mi)) = 0, j = i. Case 2. For each i Ck with k = 1, it follows from (26) that mij = θ ei,1, aj , j C1, mij = ei,1, ej,1 , j Cl, l = 1. Consider a fixed i Ck with k = 1, it follows from (17) that there exists ji C1 such that miji θ ei,1 log N. This implies σi(mi) = exp (θmiji mii) exp ei,1 θ p where the second inequality follows from (13). This, together with σi(mi) + σji(mi) < 1, implies σi(mi) < 1 1 + exp 3δθ p log N/4 25δ2p/16 < 1 1 + exp δθ p log N/2 < 1 where the second inequality uses δ p log N/8 due to (11). On the other hand, it follows from (18) and (19) that This, together with (33), δ 1/8, p 8( log N + 1), and the value of τ by (29), yields for each i Ck with k = 1, σj(mi) < τ, j [N]. (34) This directly implies h (σ(mi)) = 0, i Ck, k = 1. Then, we have φ(M1) = τI 0 0 0 . Applying the same argument to M2, . . . , MK, we obtain (30). Armed with the above result, we are ready to prove Theorem 3.1. Attention-Only Transformers Proof of Theorem 3.1. For ease of exposition, let M (l) k := Z(l)T Uk U T k Z(l) for each k [K] and l [L]. Suppose that (30) holds, which happens with probability at least 1 KN Ω(1) according to (11), and (29), Lemma B.3. We claim that for each l [L], we have Z(l) = h (1 + ητ)l U1A1 + P j =1 Uj E1,j . . . (1 + ητ)l UKAK + P j =K Uj EK,j i . (35) This, together with (6), yields for each k [K] and l [L], SNR(Z(l) k ) = Uk U T k Z(l) k F (I Uk U T k )Z(l) k F = (1 + ητ)l Ak F j =k Uj Ek,j F , which directly implies (8) for each k [K] and l [L 1]. According to the union bound, the probability is 1 KLN Ω(1). The rest of the proof is devoted to proving the claim (35) using the induction method. First, we consider the base case l = 1. According to (27) and (7), we compute U1U T 1 Z(0) = U1A1 U1E2,1 . . . U1EK,1 , M (0) 1 = (U1U T 1 Z(0))T (U1U T 1 Z(0)) = AT 1 A1 AT 1 E2,1 . . . AT 1 EK,1 ET 2,1A1 ET 2,1E2,1 . . . ET 2,1EK,1 ... ... ... ... ET K,1A1 ET K,1E2,1 . . . ET K,1EK,1 Using the same argument, we can compute M (0) k for each k [K]. This, together with (30) for each k [K], yields k=1 Uk U T k Z(0)φ(M (0) k ) = τU1A1 τU2A2 . . . τUKAK . Using this, (27), and (4), we directly obtain that (35) holds for l = 1. Next, we consider the case l 2. Suppose that (35) holds for some l 1. We compute U1U T 1 Z(l) = (1 + ητ)l U1A1 U1E2,1 . . . U1EK,1 , (1 + ητ)2l AT 1 A1 (1 + ητ)l AT 1 E2,1 . . . (1 + ητ)l AT 1 EK,1 (1 + ητ)l ET 2,1A1 ET 2,1E2,1 . . . ET 2,1EK,1 ... ... ... ... (1 + ητ)l ET K,1A1 ET K,1E2,1 . . . ET K,1EK,1 Using the same argument, we can compute M (l) k for each k [K]. This, together with (30) for each k [K], yields k=1 Uk U T k Z(0)φ(M (0) k ) = (1 + ητ)lτU1A1 (1 + ητ)lτU2A2 . . . (1 + ητ)lτUKAK . Using this, (27), and (4), we directly obtain that (35) holds for l + 1. Then, we prove the claim. Table 4. The architectures of GPT-2 and Ao T models. For GPT-2, each layer consists of an attention operator and an MLP, while each layer only has an attention operator in Ao T. Models Num. of para. Num. of layers Embedding dim. Num. of heads Attention type Has MLP Ao T-MSSA-L Base 102M 24 1024 16 MSSA No Ao T-MSSA-L Medium 182M 36 1280 20 MSSA No Ao T-MHSA-L Base 122M 24 896 14 MHSA No GPT-2 Base 124M 12 768 12 MHSA Yes Attention-Only Transformers C. Additional Experiemental Details C.1. Language Model Configuration We provide details of the language model architecture in Table 4 C.2. ICL Configuration We study decoder-only transformer models in GPT-2 family (Radford et al., 2019) and its corresponding Ao T variants. As in Park et al. (2024a), we perform the same grid search over learning rates in {10 4, 5 10 5, 2 10 4, 4 10 4}, and clipping the gradient norm in {5.0, 10.0, 50.0}. Table 5. The detailed architectures of transformer and Ao T used in ICL experiments. To ensure a fair comparison, all Ao T models are designed with a larger number of layers to match the size of the transformer. Models Num. of para. Num. of layers Embedding dim. Num. of heads Attention type Has MLP Ao T-MSSA-L 7.5M 32 128 8 MSSA No Ao T-MHSA-L 8.55M 32 128 8 MHSA No Transformer 9.63M 16 128 8 MHSA Yes C.3. Emergence of Semantic Meaning The attention heads in our models have different semantic meanings, and indeed demonstrate the interpretability of our proposed architecture in practice. In Figure 7, we train the Ao T model with the MSSA operator on Image Net-1K and visualize the self-attention heatmaps between the [CLS] token and other image patches. Note that the [CLS] token is the class token , a trainable model parameter inserted along with other image tokens to represent the class information. We select 5 attention heads by manual inspection and find that they capture different parts of objects, displaying different semantic meanings. Attention-Only Transformers Figure 7. Visualization of attention heads on Image Net-1K. We feed a trained Ao T-MSSA a mini-batch of images and extract the attention maps of different heads from the penultimate layer. We show that these heads capture certain semantic meanings across different images.