# trained_transformers_learn_linear_models_incontext__0a2c0bfe.pdf Journal of Machine Learning Research 25 (2024) 1-55 Submitted 8/23; Published 1/24 Trained Transformers Learn Linear Models In-Context Ruiqi Zhang rqzhang@berkeley.edu Department of Statistics University of California, Berkeley 367 Evans Hall, Berkeley, CA 94720-3860, USA Spencer Frei sfrei@ucdavis.edu Department of Statistics University of California, Davis 4118 Mathematical Sciences Building 399 Crocker Ave., Davis, CA 95616, USA Peter L. Bartlett peter@berkeley.edu Department of Statistics and Department of Electrical Engineering and Computer Sciences University of California, Berkeley 367 Evans Hall, Berkeley, CA 94720-3860, USA Google Deep Mind 1600 Amphitheatre Parkway Mountain View, CA 94040, USA Editor: Daniel Hsu Attention-based neural networks such as transformers have demonstrated a remarkable ability to exhibit in-context learning (ICL): Given a short prompt sequence of tokens from an unseen task, they can formulate relevant per-token and next-token predictions without any parameter updates. By embedding a sequence of labeled training data and unlabeled test data as a prompt, this allows for transformers to behave like supervised learning algorithms. Indeed, recent work has shown that when training transformer architectures over random instances of linear regression problems, these models predictions mimic those of ordinary least squares. Towards understanding the mechanisms underlying this phenomenon, we investigate the dynamics of ICL in transformers with a single linear self-attention layer trained by gradient flow on linear regression tasks. We show that despite non-convexity, gradient flow with a suitable random initialization finds a global minimum of the objective function. At this global minimum, when given a test prompt of labeled examples from a new prediction task, the transformer achieves prediction error competitive with the best linear predictor over the test prompt distribution. We additionally characterize the robustness of the trained transformer to a variety of distribution shifts and show that although a number of shifts are tolerated, shifts in the covariate distribution of the prompts are not. Motivated by this, we consider a generalized ICL setting where the covariate distributions can vary across prompts. We show that although gradient flow succeeds at finding a global minimum in this setting, the trained transformer is still brittle under mild covariate shifts. We complement this finding with experiments on large, nonlinear transformer architectures which we show are more robust under covariate shifts. c 2024 Ruiqi Zhang, Spencer Frei and Peter L. Bartlett. License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v25/23-1042.html. Zhang, Frei, Bartlett Keywords: in-context learning, transformers, neural networks, self-attention, generalization 1. Introduction Transformer-based neural networks have quickly become the default machine learning model for problems in natural language processing, forming the basis of chatbots like Chat GPT (Open AI, 2023), and are increasingly popular in computer vision (Dosovitskiy et al., 2021). These models can take as input sequences of tokens and return relevant next-token predictions. When trained on sufficiently large and diverse datasets, these models are often able to perform in-context learning (ICL): when given a short sequence of input-output pairs (called a prompt) from a particular task as input, the model can formulate predictions on test examples without having to make any updates to the parameters in the model. Recently, Garg et al. (2022) initiated the investigation of ICL from the perspective of learning particular function classes. At a high-level, this refers to when the model has access to instances of prompts of the form (x1, h(x1), . . . , x N, h(x N), xquery) where xi, xquery are sampled i.i.d. from a distribution Dx and h is sampled independently from a distribution over functions in a function class H. The transformer succeeds at in-context learning if when given a new prompt (x 1, h (x 1), . . . , x N, h (x N), x query) corresponding to an independently sampled h it is able to formulate a prediction for x query that is close to h (x query) given a sufficiently large number of examples N. The authors showed that when transformer models are trained on prompts corresponding to instances of training data from a particular function class (e.g., linear models, neural networks, or decision trees), they succeed at incontext learning, and moreover the behavior of the trained transformers can mimic those of familiar learning algorithms like ordinary least squares. Following this, a number of follow-up works provided constructions of transformerbased neural network architectures which are capable of achieving small prediction error for query examples when the prompt takes the form (x1, w, x1 , . . . , x N, w, x N , xquery) where xi, xquery, w i.i.d. N(0, Id) (von Oswald et al., 2022; Aky urek et al., 2022). However, this leaves open the question of how it is that gradient-based optimization algorithms over transformer architectures produce models which are capable of in-context learning.1 In this work, we investigate the learning dynamics of gradient flow in a simplified transformer architecture when the training prompts consists of random instances of linear regression datasets. Our main contributions are as follows. We establish that for a class of transformers with a single layer and with a linear selfattention module (LSAs), gradient flow on the population loss with a suitable random initialization converges to a global minimum of the population objective, despite the non-convexity of the underlying objective function. We characterize the learning algorithm that is encoded by the transformer at convergence, as well as the prediction error achieved when the model is given a test prompt corresponding to a new (and possibly nonlinear) prediction task. 1. We note a concurrent work also explores the optimization question we consider here (Ahn et al., 2023); we shall provide a more detailed comparison to this work in Section 2. Trained Transformers Learn Linear Models In-Context We use this to conclude that transformers trained by gradient flow indeed in-context learn the class of linear models. Moreover, we characterize the robustness of the trained transformer to a variety of distribution shifts. We show that although a number of shifts can be tolerated, shifts in the covariate distribution of the features xi cannot. Motivated by this failure under covariate shift, we consider a generalized setting of incontext learning where the covariate distribution can vary across prompts. We provide global convergence guarantees for LSAs trained by gradient flow in this setting and show that even when trained on a variety of covariate distributions, LSAs still fail under covariate shift. We then empirically investigate the behavior of large, nonlinear transformers when trained on linear regression prompts. We find that these more complex models are able to generalize better under covariate shift, especially when trained on prompts with varying covariate distributions. 2. Additional Related Work The literature on transformers and non-convex optimization in machine learning is vast. In this section, we will focus on those works most closely related to theoretical understanding of in-context learning of function classes. As mentioned previously, Garg et al. (2022) empirically investigated the ability for transformer architectures to in-context learn a variety of function classes. They showed that when trained on random instances of linear regression, the models predictions are very similar to those of ordinary least squares. Additionally, they showed that transformers can in-context learn two-layer Re LU networks and decision trees, showing that by training on differently-structured data, the transformers learn to implement distinct learning algorithms. A number of works further investigated the types of algorithms implemented by transformers trained on in-context examples of linear models (Ahuja et al., 2023; Ahuja and Lopez-Paz, 2023). Aky urek et al. (2022) and von Oswald et al. (2022) examined the behavior of transformers when trained on random instances of linear regression, as we do in this work. They considered the setting of isotropic Gaussian data with isotropic Gaussian weight vectors, and showed that the trained transformer s predictions mimic those of a single step of gradient descent. They also provided a construction of transformers which implement this single step of gradient descent. By contrast, we explicitly show that gradient flow provably converges to transformers which learn linear models in-context. Moreover, our analysis holds when the covariates are anisotropic Gaussians, for which a single step of vanilla gradient descent is unable to achieve small prediction error.2 Let us briefly mention a number of other works on understanding in-context learning in transformers and other sequence-based models. Han et al. (2023) suggests that Bayesian inference on prompts can be asymptotically interpreted as kernel regression. Dai et al. (2022) 2. To see this, suppose (xi, yi) are i.i.d. with x N(0, Λ) and y = w, x . A single step of gradient descent under the squared loss from a zero initialization yields the predictor x 7 x 1 n Pn i=1 yixi = x 1 n Pn i=1 xix i w x Λw. Clearly, this can differ from x w when Λ = Id. Zhang, Frei, Bartlett interprets ICL as implicit fine-tuning, viewing large language models as meta-optimizers performing gradient-based optimization. Xie et al. (2021) regards ICL as implicit Bayesian inference, with transformers learning a shared latent concept between prompts and test data, and they prove the ICL property when the training distribution is a mixture of HMMs. Similarly, Wang et al. (2023) perceives ICL as a Bayesian selection process, implicitly inferring information pertinent to the designated tasks. Li et al. (2023a) explores the functional resemblance between a single layer of self-attention and gradient descent on a softmax regression problem, offering upper bounds on their difference. Min et al. (2022) notes that the alteration of label parts in prompts does not drastically impair the ICL ability. They contend that ICL is invoked when prompts reveal information about the label space, input distribution, and sequence structure. Another collection of works have sought to understand transformers from an approximation theoretic perspective. Yun et al. (2019, 2020) established that transformers can universally approximate any sequence-to-sequence function under some assumptions. Investigations by Edelman et al. (2022); Likhosherstov et al. (2021) indicate that a single-layer self-attention can learn sparse functions of the input sequence, where sample complexity and hidden size are only logarithmic relative to the sequence length. Further studies by P erez et al. (2019); Dehghani et al. (2019); Bhattamishra et al. (2020) indicate that the vanilla transformer and its variants exhibit Turing completeness. Liu et al. (2023) showed that transformers can approximate finite-state automata with few layers. Bai et al. (2023) showed that transformers can implement a variety of statistical machine learning algorithms as well as model selection procedures. Abernethy et al. (2023) showed that a pretrained transformer can be used to define a transformer that segments a prompt into examples and labels and learns to solve a sparse retrieval task. Zhang et al. (2023) interpreted in-context learning via a Bayesian model averaging process. A handful of recent works have developed provable guarantees for transformers trained with gradient-based optimization. Jelassi et al. (2022) analyzed the dynamics of gradient descent in vision transformers for data with spatial structure. Li et al. (2023c) demonstrated that a single-layer transformer trained by a gradient method could learn a topic model, treating learning semantic structure as detecting co-occurrence between words and theoretically analyzing the two-stage dynamics during the training process. Finally, we note a concurrent work by Ahn et al. (2023) on the optimization landscape of single layer transformers with linear self-attention layers. They show that there exist global minima of the population objective of the transformer that can achieve small prediction error with anisotropic Gaussian data, and they characterize some critical points of deep linear self-attention networks. In this work, we show that despite nonconvexity, gradient flow with a suitable random initialization converges to a global minimum that achieves small prediction error for anistropic Gaussian data. We also characterize the prediction error when test prompts come from a new (possibly nonlinear) task, when there is distribution shift, and when transformers are trained on prompts with possibly different covariate distributions across prompts. Trained Transformers Learn Linear Models In-Context 3. Preliminaries Notation We first describe the notation we use in the paper. We write [n] = {1, 2, ..., n}. We use to denote the Kronecker product, and Vec the vectorization operator in columnwise order. For example, Vec 1 2 3 4 = (1, 3, 2, 4) . We write the inner product of two matrices A, B Rm n as A, B = tr(AB ). We use 0n and 0m n to denote the zero vector and zero matrix of size n and m n, respectively. For a general matrix A, Ak: and A:k denote the k-th row and k-th column, respectively. We denote the matrix operator norm and Frobenius norm as op and F . We use Id to denote the d-dimensional identity matrix and sometimes we also use I when the dimension is clear from the context. For a positive semi-definite matrix A, we write x 2 A := x Ax. Unless otherwise defined, we use lower case letters for scalars and vectors, and use upper case letters for matrices. 3.1 In-context learning We begin by describing a framework for in-context learning of function classes, as initiated by Garg et al. (2022). In-context learning refers to the behavior of models that operate on sequences, called prompts, of input-output pairs (x1, y1, . . . , x N, y N, xquery), where yi = h(xi) for some (unknown) function h and examples xi and query xquery. The goal for an incontext learner is to use the prompt to form a prediction by(xquery) for the query such that by(xquery) h(xquery). From this high-level description, one can see that at a surface level, the behavior of in-context learning is no different than that of a standard learning algorithm: the learner takes as input a training dataset and returns predictions on test examples. For instance, one can view ordinary least squares as an in-context learner for linear models. However, the rather unique feature of in-context learners is that these learning algorithms can be the solutions to stochastic optimization problems defined over a distribution of prompts. We formalize this notion in the following definition. Definition 1 (Trained on in-context examples) Let Dx be a distribution over an input space X, H YX a set of functions X Y, and DH a distribution over functions in H. Let ℓ: Y Y R be a loss function. Let S = n N{(x1, y1, . . . , xn, yn) : xi X, yi Y} be the set of finite-length sequences of (x, y) pairs and let FΘ = {fθ : S X Y, θ Θ} be a class of functions parameterized by θ in some set Θ. For N > 0, we say that a model f : S X Y is trained on in-context examples of functions in H under loss ℓw.r.t. (DH, Dx) if f = fθ where θ Θ satisfies θ argminθ ΘEP=(x1,h(x1),...,x N,h(x N),xquery) [ℓ(fθ(P), h(xquery))] , (1) where xi, xquery i.i.d. Dx and h DH are independent. We call N the length of the prompts seen during training. As mentioned above, this definition naturally leads to a method for learning a learning algorithm from data: Sample independent prompts by sampling a random function h DH Zhang, Frei, Bartlett and feature vectors xi, xquery i.i.d. Dx, and then minimize the objective function appearing in (1) using stochastic gradient descent or other stochastic optimization algorithms. This procedure returns a model that is learned from in-context examples and can form predictions for test (query) examples given a sequence of training data. This leads to the following natural definition that quantifies how well such a model performs on in-context examples corresponding to a particular hypothesis class. Definition 2 (In-context learning of a hypothesis class) Let Dx be a distribution on an input space X, H YX a class of functions X Y, and DH a distribution on functions in H. Let ℓ: Y Y R be a loss function. Let S = n N{(x1, y1, . . . , xn, yn) : xi X, yi Y} be the set of finite-length sequences of (x, y) pairs. We say that a model f : S X Y defined on prompts of the form P = (x1, h(x1), . . . , x M, h(x M), xquery) in-context learns a hypothesis class H under loss ℓwith respect to (DH, Dx) up to error η R if there exists a function MDH,Dx(ε) : (0, 1) N such that for every ε (0, 1), and for every prompt P of length M MDH,Dx(ε), EP=(x1,h(x1),...,x M,h(x M),xquery) ℓ f(P), h (xquery) η + ε, (2) where the expectation is over the randomness in xi, xquery i.i.d. Dx and h DH. The additive error term η in Definition 2 above allows for the possibility that the model does not achieve arbitrarily small error. This error could come from using a model which is not complex enough to learn functions in H or from considering a non-realizable setting where it is not possible to achieve arbitrarily small error. With these two definitions in hand, we can formulate the following questions: suppose a function class FΘ is given and DH corresponds to random instances of hypotheses in a hypothesis class H. Can a model from FΘ that is trained on in-context examples of functions in H w.r.t. (DH, Dx) in-context learn the hypothesis class H w.r.t. (DH, Dx) with small prediction error? Do standard gradient-based optimization algorithms suffice for training the model from in-context examples? How long must the contexts be during training and at test time to achieve small prediction error? In the remaining sections, we shall answer these questions for the case of one-layer transformers with linear self-attention modules when the hypothesis class is linear models, the loss of interest is the squared loss, and the marginals are (possibly anisotropic) Gaussian marginals. 3.2 Linear self-attention networks Before describing the particular transformer models we analyze in this work, we first recall the definition of the softmax-based single-head self-attention module (Vaswani et al., 2017). Let E Rde d N be an embedding matrix formed using a prompt (x1, y1, . . . , x N, y N, xquery) of length N. The user has the freedom to determine how this embedding matrix is formed from the prompt. One natural way to form E is to stack (xi, yi) Rd+1 as the first N columns of E and to let the final column be (xquery, 0) ; if xi Rd, yi R, we would then have de = d + 1 and d N = N + 1. Let W K, W Q Rdk de and W V Rdv de be the key, query, and value weight matrices, W P Rde dv the projection matrix, and ρ > 0 Trained Transformers Learn Linear Models In-Context a normalization factor. The softmax self-attention module takes as input an embedding matrix E of width d N and outputs a matrix of the same size, f Attn(E; W K, W Q, W V , W P ) = E + W P W V E softmax (W KE) W QE where softmax is applied column-wise and, given a vector input of v, the i-th entry of softmax(v) is given by exp(vi)/ P s exp(vs). The d N d N matrix appearing inside the softmax is referred to as the self-attention matrix. Note that f Attn can take as its input a sequence of arbitrary length. In this work, we consider a simplified version of the single-layer self-attention module, which is more amenable to theoretical analysis and yet is still capable of in-context learning linear models. In particular, we consider a single-layer linear self-attention (LSA) model, which is a modified version of f Attn where we remove the softmax nonlinearity, merge the projection and value matrices into a single matrix W PV Rde de, and merge the query and key matrices into a single matrix W KQ Rde de. We concatenate these matrices into θ = (W KQ, W PV ) and denote f LSA(E; θ) = E + W PV E E W KQE We note that recent theoretical works on understanding transformers looked at identical models (von Oswald et al., 2022; Li et al., 2023b; Ahn et al., 2023). It is noteworthy that recent empirical work has shown that state-of-the-art trained vision transformers with standard softmax-based attention modules are such that (W K) W Q and W P W V are nearly multiples of the identity matrix (Trockman and Kolter, 2023), which can be represented under the parameterization we consider. The user has the flexibility to determine the method for constructing the embedding matrix from a prompt P = (x1, y1, . . . , x N, y N, xquery). In this work, for a prompt of length N, we shall use the following embedding, which stacks (xi, yi) Rd+1 into the first N columns with (xquery, 0) Rd+1 as the last column: E = E(P) = x1 x2 x N xquery y1 y2 y N 0 R(d+1) (N+1). (4) We take the normalization factor ρ to be the width of embedding matrix E minus one, i.e., ρ = d N 1, since each element in E E is a inner product of two vectors of length d N. Under the above token embedding, we take ρ = N. We note that there are alternative ways to form the embedding matrix with this data, e.g. by padding all inputs and labels into vectors of equal length and arranging them into a matrix (Aky urek et al., 2022), or by stacking columns that are linear transformations of the concatenation (xi, yi) (Garg et al., 2022), although the dynamics of in-context learning will differ under alternative parameterizations. The network s prediction for the token xquery will be the bottom-right entry of matrix output by f LSA, namely, byquery = byquery(E; θ) = [f LSA(E; θ)](d+1),(N+1). Here and after, we may occasionally suppress dependence on θ and write byquery(E; θ) as byquery. Since the prediction takes only the right-bottom entry of the token matrix output Zhang, Frei, Bartlett by the LSA layer, actually only part of W PV and W KQ affect the prediction. To see how, let us denote W PV 11 w PV 12 (w PV 21 ) w PV 22 R(d+1) (d+1), W KQ = W KQ 11 w KQ 12 (w KQ 21 ) w KQ 22 R(d+1) (d+1), (5) where W PV 11 Rd d; w PV 12 , w PV 21 Rd; w PV 22 R; and W KQ 11 Rd d; w KQ 12 , w KQ 21 Rd; w KQ 22 R. Then, the prediction byquery is byquery = (w PV 21 ) w PV 22 xquery, (6) since only the last row of W PV and the first d columns of W KQ affects the prediction, which means we can simply take all other entries zero in the following sections. 3.3 Training procedure In this work, we will consider the task of in-context learning linear predictors. We will assume training prompts are sampled as follows. Let Λ be a positive definite covariance matrix. Each training prompt, indexed by τ N, takes the form Pτ = (xτ,1, hτ(xτ1), . . . , xτ,N, hτ(xτ,N), xτ,query), where task weights wτ i.i.d. N(0, Id), inputs xτ,i, xτ,query i.i.d. N(0, Λ), and labels hτ(x) = wτ, x . Each prompt corresponds to an embedding matrix Eτ, formed using the transformation (4): Eτ := xτ,1 xτ,2 xτ,N xτ,query wτ, xτ,1 wτ, xτ,2 wτ, xτ,N 0 R(d+1) (N+1). We denote the prediction of the LSA model on the query label in the task τ as byτ,query, which is the bottom-right element of f LSA(Eτ), where f LSA is the linear self-attention model defined in (3). The empirical risk over B independent prompts is defined as byτ,query wτ, xτ,query 2 . (7) We shall consider the behavior of gradient flow-trained networks over the population loss induced by the limit of infinite training tasks/prompts B : L(θ) = lim B b L(θ) = 1 2Ewτ,xτ,1, ,xτ,N,xτ,query (byτ,query wτ, xτ,query )2 (8) Above, the expectation is taken w.r.t. the covariates {xτ,i}N i=1 {xquery} in the prompt and the weight vector wτ, i.e. over xτ,i, xquery i.i.d. N(0, Λ) and wτ N(0, Id). Gradient Trained Transformers Learn Linear Models In-Context flow captures the behavior of gradient descent with infinitesimal step size and has dynamics given by the following differential equation: d dtθ = L(θ). (9) We will consider gradient flow with an initialization that satisfies the following. Assumption 3 (Initialization) Let σ > 0 be a parameter, and let Θ Rd d be any matrix satisfying ΘΘ F = 1 and ΘΛ = 0d d. We assume W PV (0) = σ 0d d 0d 0 d 1 , W KQ(0) = σ ΘΘ 0d 0 d 0 This initialization is satisfied for a particular class of random initialization schemes: if M has i.i.d. entries from a continuous distribution, then by setting ΘΘ = MM / MM F , the assumption is satisfied almost surely. The reason we use this particular initialization scheme will be made more clear in Section 5 when we describe the proof, but at a highlevel this is due to the fact that the predictions (6) can be viewed as the output of a two-layer linear network, and initializations satisfying Assumption 3 allow for the layers to be balanced throughout the gradient flow trajectory. Random initializations that induce this balancedness condition have been utilized in a number of theoretical works on deep linear networks (Du et al., 2018; Arora et al., 2018, 2019; Azulay et al., 2021). We leave the question of convergence under alternative random initialization schemes for future work. 4. Main results In this section, we present the main results of this paper. First, in Section 4.1, we prove the gradient flow on the population loss will converge to a specific global optimum. We characterize the prediction error of the trained transformer at this global minimum when given a prompt from a new prediction task. Our characterization allows for the possibility that this new prompt comes from a nonlinear prediction task. We then instantiate our results for well-specified linear regression prompts and characterize the number of samples needed to achieve small prediction error, showing that transformers can in-context learn linear models when trained on in-context examples of linear models. Next, in Section 4.2, we analyze the behavior of the trained transformer under a variety of distribution shifts. We show the transformer is robust to a number of distribution shifts, including task shift (when the labels in the prompt are not deterministic linear functions of their input) and query shift (when the query example xquery has a possibly different distribution than the test prompt). On the other hand, we show that the transformer suffers from covariate distribution shifts, i.e. when the training prompt covariate distribution differs from the test prompt covariate distribution. Finally, motivated by the failure of the trained transformer under covariate distribution shift, we consider in Section 4.3 the setting of training on in-context examples with varying covariate distributions across prompts. We prove that transformers with a single linear selfattention layer trained by gradient flow converge to a global minimum of the population objective, but that the trained transformer still fails to perform well on new prompts. We complement our proof in the linear self-attention case with experiments on large, nonlinear transformer architectures which we show are more robust under covariate shifts. Zhang, Frei, Bartlett 4.1 Convergence of gradient flow and prediction error for new tasks First, we prove that under suitable initialization, gradient flow will converge to a global optimum. Theorem 4 (Convergence and limits) Consider gradient flow of a linear self-attention network f LSA defined in (3) over the population loss (8). Suppose the initialization satisfies Assumption 3 with initialization scale σ > 0 satisfying σ2 Γ op d < 2 where we have defined N tr(Λ)Id Rd d. Then gradient flow converges to a global minimum of the population loss (8). Moreover, W PV and W KQ converge to W PV and W KQ respectively, where W KQ = tr Γ 2 1 , W PV = tr Γ 2 1 The full proof of this theorem appears in Appendix A. We note that if we restrict our setting to Λ = Id, then the limiting solution described found by gradient flow is quite similar to the construction of von Oswald et al. (2022). Since the prediction of the transformer is the same if we multiply W PV by a constant c = 0 and simultaneously multiply W KQ by c 1, the only difference (up to scaling) is that the top-left entry of their W KQ matrix is Id rather than the (1 + (d + 1)/N) 1Id that we find for the case Λ = Id. Next, we would like to characterize the prediction error of the trained network described above when the network is given a new prompt. Let us consider a prompt of the form (x1, w, x1 , . . . , x M, w, x M , xquery) where w Rd and xi, xquery i.i.d. N(0, Λ). A simple calculation shows that the prediction byquery at the global optimum with parameters W KQ and W PV is given by byquery = 0 d 1 i=1 xix i + 1 M xqueryx query 1 M i=1 xix i w i=1 w xix i 1 M i=1 w xix i w = x queryΓ 1 1 M When the length of prompts seen during training N is large, Γ 1 Λ 1, and when the test prompt length M is large, 1 M PM i=1 xix i Λ, so that byquery x queryw. Thus, for sufficiently large prompt lengths, the trained transformer indeed in-context learns the class of linear predictors. In fact, we can generalize the above calculation for test prompts which could take a significantly different form than the training prompts. Consider prompts that are of the form (x1, y1, . . . , xn, yn, xquery) where, for some joint distribution D over (x, y) pairs with marginal Trained Transformers Learn Linear Models In-Context distribution x N(0, Λ), we have (xi, yi) i.i.d. D and xquery N(0, Λ) independently. Note that this allows for a label yi to be a nonlinear function of the input xi. The prediction of the trained transformer for this prompt is then byquery = 0 d 1 1 M PM i=1 xix i + 1 M xqueryx query 1 M PM i=1 xiyi 1 M PM i=1 x i yi 1 M PM i=1 y2 i = x queryΓ 1 1 M Just as before, when N is large we have Γ 1 Λ 1, and so when M is large as well this implies byquery x queryΛ 1E(x,y) D[yx] = x query argmin w Rd E(x,y) D[(y w, x )2] This suggests that trained transformers in-context learn the best linear predictor over a distribution when the test prompt consists of i.i.d. samples from a joint distribution over feature-response pairs. In the following theorem, we formalize the above and characterize the prediction error when prompts take this form. Theorem 5 Let D be a distribution over (x, y) Rd R, whose marginal distribution on x is Dx = N(0, Λ). Assume ED[y], ED[xy], ED[y2xx ] exist and are finite. Assume the test prompt is of the form P = (x1, y1, . . . , x M, y M, xquery), where (xi, yi), (xquery, yquery) i.i.d. D. Let f LSA be the LSA model with parameters W PV and W KQ in (11), and byquery is the prediction for xquery given the prompt. If we define a := Λ 1E(x,y) D [xy] , Σ := E(x,y) D h xy E (xy) xy E (xy) i , (15) then, for Γ = Λ + 1 N tr(Λ)Id. we have, E (byquery yquery)2 = min w Rd E ( w, xquery yquery)2 | {z } Error of best linear predictor M tr ΣΓ 2Λ + 1 h a 2 Γ 2Λ3 + 2 tr(Λ) a 2 Γ 2Λ2 + tr(Λ)2 a 2 Γ 2Λ i , where the expectation is over (xi, yi), (xquery, yquery) i.i.d. D. The full proof is deferred to Appendix B. Let us now make a few remarks on the above theorem before considering particular instances of D where we may provide more explicit bounds on the prediction error. First, this theorem shows that, provided the length of prompts seen during training (N) and the length of the test prompt (M) is large enough, a transformer trained by Zhang, Frei, Bartlett gradient flow from in-context examples achieves prediction error competitive with the best linear model. Next, our bound shows that the length of prompts seen during training and the length of prompts seen at test-time have different effects on the expected prediction error: ignoring dimension and covariance-dependent factors, the prediction error is at most O(1/M + 1/N2), decreasing more rapidly as a function of the training prompt length N compared to the test prompt length M. Let us now consider when D corresponds to noiseless linear models, so that for some w Rd, we have (x, y) = (x, w, x ), in which case the prediction of the trained transformer is given by (12). Moreover, a simple calculation shows that the Σ from Theorem 5 takes the form Σ = w 2 ΛΛ + Λww Λ. Hence Theorem 5 implies the prediction error for the prompt P = (x1, w, x1 , . . . , x M, w, x M , xquery) is Ex1,...,x M,xquery (byquery w, xquery )2 n w 2 Γ 2Λ3 + tr(Γ 2Λ2) w 2 Λ o n w 2 Γ 2Λ3 + 2 w 2 Γ 2Λ2 tr(Λ) + w 2 Γ 2Λ tr(Λ)2o M w 2 Λ + 1 h w 2 Λ + 2 w 2 2 tr(Λ) + w 2 Λ 1 tr(Λ)2i . The inequality above uses that Γ Λ. Finally, if we assume that w N(0, Id) and denote κ as the condition number of Λ, then by taking expectations over w we get the following: Ex1,...,x M,xquery,w (byquery w, xquery )2 (d + 1) tr(Λ) N2 tr(Λ) + 2d tr(Λ) + tr(Λ 1) tr(Λ)2 (d + 1) tr(Λ) M + (1 + 2d + d2κ) tr(Λ) From the upper bound above, we can see the rate w.r.t M and N are still at most O(1/M) and O(1/N2) respectively. Moreover, the generalization risk also scales with dimension d, tr(Λ) and the condition number κ. This suggests that for in-context examples involving covariates of greater variance, or a more ill-conditioned covariance matrix, the generalization risk will be higher for the same lengths of training and testing prompts. Putting the above together with Theorem 5, Definition 1 and Definition 2, we get the following corollary. Corollary 6 The transformer f LSA trained on length-N prompts of in-context examples of functions in {x 7 w, x } w.r.t. w N(0, Id) and Dx = N(0, Λ) by gradient flow on the population loss (8) for initializations satisfying Assumption 3 converges to the model f LSA( ; W KQ , W PV ). This model takes a prompt P = (x1, y1, . . . , x M, y M, xquery) and returns a prediction byquery for xquery given by byquery = [f LSA(P; W KQ , W PV )]d+1,M+1 = x query N Λ + tr(Λ) This model in-context learns the class of linear models {x 7 w, x } with respect to w N(0, Id) and Dx = N(0, Λ) up to error η := (1+2d+d2κ) tr(Λ)/N2 (where κ is the condition Trained Transformers Learn Linear Models In-Context number of Λ): provided M (d + 1) tr(Λ)ε 1, the model achieves prediction error at most η + ε. It is worth emphasizing that the transformer f LSA( ; W KQ , W PV ) only learns the function class up to error η = O(1/N2) in the sense of Definition 2. In particular, training on finite-length prompts leads to prediction error bounded away from zero. 4.2 Behavior of trained transformer under distribution shifts Using the identity (13), it is straightforward to characterize the behavior of the trained transformer under a variety of distribution shifts. In this section, we shall examine a number of shifts that were first explored empirically for transformer architectures by Garg et al. (2022). Although their experiments were for transformers trained by gradient descent, we find that (in the case of linear models) many of the behaviors of the trained transformers under distribution shift are identical to those predicted by our theoretical characterizations of the performance of transformers with a single linear self-attention layer trained by gradient flow on the population. Following Garg et al. (2022), for prompts of the form (x1, h(x1), . . . , x N, h(x N), xquery), let us assume for training prompts that xi, xquery i.i.d. Dtrain x and h Dtrain H , while for test prompts xi i.i.d. Dtest x , xquery Dtest query, and h Dtest H . We will consider the following distinct categories of shifts: Task shifts: Dtrain H = Dtest H . Query shifts: Dtest query = Dtest x . Covariate shifts: Dtrain x = Dtest x . In the following, we shall fix Dtrain x = N(0, Λ) and vary the other distributions. Recall from (13) that the prediction for a test prompt (x1, y1, . . . , x N, y N, xquery) is given by (for N large), byquery = x queryΓ 1 1 M x queryΛ 1 1 M Task shifts. These shifts are tolerated easily by the trained transformer. As Theorem 5 shows, the trained transformer is competitive with the best linear model provided the prompt length during training and at test time is large enough. In particular, even if the prompt is such that the labels yi are not given by w, xi for some w N(0, Id), the trained transformer will compute a prediction which has error competitive with the best linear model that fits the test prompt. For example, consider a prompt corresponding to a noisy linear model, so that the prompt consists of a sequence of (xi, yi) pairs where yi = w, xi + εi for some arbitrary vector w Rd and independent sub-Gaussian noise εi. Then from (17), the prediction of the transformer on query examples is byquery x queryΛ 1 1 M = x queryΛ 1 1 M w+x queryΛ 1 1 M Zhang, Frei, Bartlett Since εi is mean zero and independent of xi, this is approximately x queryw when M is large. And note that this calculation holds for an arbitrary vector w, not just those which are sampled from an isotropic Gaussian or those with a particular norm. This behavior coincides with that of the trained transformers observed by Garg et al. (2022). Query shifts. Continuing from (17), since yi = w, xi , byquery x queryΛ 1 1 M From this we see that whether query shifts can be tolerated hinges upon the distribution of the xi s. Since Dtrain x = Dtest x , if M is large then byquery x queryΛ 1Λw = x queryw. (18) Thus, very general shifts in the query distribution can be tolerated. On the other hand, very different behavior can be expected if M is not large and the query example depends on the training data. For example, if the query example is orthogonal to the subspace spanned by the xi s, the prediction will be zero, as was observed with transformer architectures by Garg et al. (2022). Covariate shifts. In contrast to task and query shifts, covariate shifts cannot be fully tolerated in the transformer. This can be easily seen due to the identity (13): when Dtrain x = Dtest x , then the approximation in (18) does not hold as 1 M PM i=1 xix i will not cancel Γ 1 when M and N are large. For instance, if we consider test prompts where the covariates are scaled by a constant c = 1, then byquery x queryΛ 1 1 M w x queryΛ 1c2Λw = c2x queryw = x queryw. This failure mode of the trained transformer with linear self-attention was also observed in the trained transformer architectures by Garg et al. (2022). This suggests that although the predictions of the transformer may look similar to those of ordinary least squares in some settings, the algorithm implemented by the transformer is not the same since ordinary least squares is robust to scaling of the features by a constant. It may seem surprising that a transformer trained on linear regression tasks fails in settings where ordinary least squares performs well. However, both the linear self-attention transformer we consider and the transformers considered by Garg et al. (2022) were trained on instances of linear regression when the covariate distribution Dx over the features was fixed across instances. This leads to the natural question of what happens if the transformers instead are trained on prompts where the covariate distribution varies across instances, which we explore in the following section. 4.3 Transformers trained on prompts with random covariate distributions In this section, we will consider a variant of training on in-context examples (in the sense of Definition 1) where the distibution Dx is itself sampled randomly from a distribution, and training prompts are of the form (x1, h(x1), . . . , x N, h(x N), xquery) where xi, xquery i.i.d. Dx and h DH. More formally, we can generalize Definition 1 as follows. Trained Transformers Learn Linear Models In-Context Definition 7 (In-context training with random covariate distributions) Let be a distribution over distributions Dx defined on an input space X, H YX a set of functions X Y, and DH a distribution over functions in H. Let ℓ: Y Y R be a loss function. Let S = n N{(x1, y1, . . . , xn, yn) : xi X, yi Y} be the set of finite-length sequences of (x, y) pairs and let FΘ = {fθ : S X Y, θ Θ} be a class of functions parameterized by some set Θ. We say that a model f : S X Y is trained on in-context examples of functions in H under loss ℓw.r.t. DH and distribution over covariate distributions if f = fθ where θ Θ satisfies θ argminθ ΘEP=(x1,h(x1),...,x N,h(x N),xquery) [ℓ(fθ(P), h(xquery))] , (19) where Dx , xi, xquery i.i.d. Dx and h DH. We recover the previous definition of training on in-context examples by taking to be concentrated on a singleton, supp( ) = {Dx}. The natural question is then, if a model f is trained on in-context examples from a function class H w.r.t. DH and a distribution over covariate distributions, and if one then samples some covariate distribution Dx , does f in-context learn H w.r.t. (DH, Dx) for that Dx (cf. Definition 2)? Since Dx is random, we can hope that this may hold in expectation or with high probability over the sampling of the covariate distribution. In the remainder of this section, we will explore this question for transformers with a linear self-attention layer trained by gradient flow on the population loss. We shall again consider the case where the covariates have Gaussian marginals, xi N(0, Λ), but we shall now assume that within each prompt we first sample a random covariance matrix Λ. For simplicity, we will restrict our attention to the case where Λ is diagonal. More formally, we shall assume training prompts are sampled as follows. For each independent task indexed by τ [B], we first sample wτ N(0, Id). Then, for each task τ and coordinate i [d], we sample λτ,i independently such that the distribution of each λτ,i is fixed and has finite third moments and is strictly positive almost surely. We then form a diagonal matrix Λτ = diag(λτ,1, . . . , λτ,d). Thus the diagonal entries of Λτ are independent but could have different distributions, and Λτ is identically distributed for τ = 1, . . . , B. Then, conditional on Λτ, we sample independent and identically distributed xτ,1, . . . , xτ,N, xτ,query N(0, Λτ). A training prompt is then given by Pτ = (xτ,1, wτ, xτ,1 , . . . , xτ,N, wτ, xτ,N , xτ,query) Notice that here, xτ,i, xτ,query are conditionally independent given the covariance matrix Λτ, but not independent in general. We consider the same token embedding matrix as (4) and linear self-attention network, which forms the prediction byquery,τ as in (6). The empirical risk is the same as before (see (7)), and as in (8), we then take B and consider the gradient flow on the population loss. The population loss now includes an expectation over the distribution of the covariance matrices in addition to the task weight wτ and covariate distributions, and is given by 2Ewτ,Λτ,xτ,1, ,xτ,N,xτ,query (byτ,query wτ, xτ,query )2 . (20) Zhang, Frei, Bartlett In the main result for this section, we show that gradient flow with a suitable initialization converges to a global minimum, and we characterize the limiting solution. The proof will be deferred to Appendix C. Theorem 8 (Global convergence with random covariance) Consider gradient flow of the linear self-attention network f LSA defined in (3) over the population loss (20), where Λτ are diagonal with independent diagonal entries which are strictly positive a.s. and have finite third moments. Suppose the initialization satisfies Assumption 3, EΛτΘ F = 0, with initialization scale σ > 0 satisfying σ2 < 2 EΛτΘ 2 F d h E Γτ op Λτ 2 F i. (21) Then gradient flow converges to a global minimum of the population loss (20). Moreover, W PV and W KQ converge to W PV and W KQ respectively, where W KQ = EΓτΛ2 τ 1 E Λ2 τ 1 EΓτΛ2 τ 1 EΛ2 τ 0d W PV = EΓτΛ2 τ 1 E Λ2 τ where Γτ = N+1 N tr(Λτ)Id Rd d and the expectations above are over the distribution of Λτ. From this result, we can see why the trained transformer fails in the random covariance case. Suppose we have a new prompt corresponding to a weight matrix w Rd and covariance matrix Λnew, sampled from the same distribution as the covariance matrices for training prompts, so that conditionally on Λnew we have xi, xquery i.i.d. N(0, Λnew). The ground-truth labels are given by yi = w, xi , i [M] and yquery = w, xquery . At convergence, the prediction by the trained transformer on the new task will be i=1 xix i + 1 M xqueryx query 1 M i=1 x i yi 1 M EΓτΛ2 τ 1 EΛ2 τ 0d = x query EΛ2 τ EΓτΛ2 τ 1 x query EΛ2 τ EΓτΛ2 τ 1 Λneww almost surely when M . (23) The last line comes from the strong law of large numbers. Thus, in order for the prediction on the query example to be close to the ground-truth x queryw, we need EΛ2 τ EΓτΛ2 τ 1 Λnew Trained Transformers Learn Linear Models In-Context to be close to the identity. When Λτ Λnew is deterministic, this indeed is the case as we know from Theorem 5. However, this clearly does not hold in general when Λτ is random. To make things concrete, let us assume for simplicity that M, N so that Γτ Λτ and the identity (23) holds (conditionally on Λnew). Then, taking expectation over Λnew in (23), we obtain E [ byquery| xquery, w] x query EΛ2 τ EΛ3 τ 1 [EΛτ] w. If we consider the case λτ,i i.i.d. Exponential(1), so that E[Λτ] = Id, E[Λ2 τ] = 2Id, and E[Λ3 τ] = 6Id, we get 3 w, xquery . This shows that for transformers with a single linear self-attention layer, training on incontext examples with random covariate distributions does not allow for in-context learning of a hypothesis class with varying covariate distributions. Experiments with large, nonlinear transformers. We have shown that even when trained on prompts with random covariance matrices, transformers with a single linear selfattention layer fail to in-context learn linear models with random covariance matrices. We now investigate the behavior of more complex transformer architectures that are trained on in-context examples of linear models, both in the fixed-covariance case and in the randomcovariance case. We examine the performance of transformers with a GPT2 architecture (Radford et al., 2019) that are trained on linear regression tasks with mean-zero Gaussian features with either a fixed covariance matrix or random covariance matrices. For the fixed covariance case, the covariance matrix is fixed to the identity matrix across prompts. For the random covariance case, covariates are drawn from x N(0, cΛ) where Λ is diagonal with λi i.i.d. Exponential(1) and c > 0 is a scaling factor. We set c = 1 during training and vary this value at test time. The transformer is trained using the procedure of Garg et al. (2022) (see Appendix E for more details). We consider linear models in d = 20 dimensions and we train on prompt lengths of N = 40, 70, 100 with either fixed or random covariance matrices. The performance of these trained models, when tested on new data with fixed covariance or random covariance matrices (c = 1, 4, 9), is represented in six curves in Figure 1. Using the calculation (23), we can compare the prediction error for the linear self-attention networks in the M , N limit (the black dash line) to those of GPT2 architectures. We additionally compare these models to the ordinary least-squares solution which is optimal for this task. From the figure, we can see that the GPT2 model trained on fixed covariance succeeds in the random covariance setting if the variance is not too large, which shows that the larger nonlinear model is able to generalize better than the model with a single linear self-attention layer. However, when the variance is large (c = 4, 9 for the bottom two figures), the GPT2 model trained with fixed covariance is unsuccessful. When trained on random covariance, the model performs better for test prompts from higher-variance random covariance matrices, but still fails to match least squares when the scaling is largest (c = 9). Zhang, Frei, Bartlett 0 20 40 60 80 100 in-context examples squared error Test on Fixed Covariance 0 20 40 60 80 100 in-context examples squared error Test on Random Covariance, Scale = 1.0 0 20 40 60 80 100 in-context examples squared error Test on Random Covariance, Scale = 4.0 0 20 40 60 80 100 in-context examples squared error Test on Random Covariance, Scale = 9.0 Zero Estimator LSA Limit fixedcov_N40 fixedcov_N70 fixedcov_N100 randomcov_N40 randomcov_N70 randomcov_N100 Least Squares Figure 1: Normalized prediction error for transformers with GPT2 architectures as a function of the number of in-context test examples M when trained on in-context examples of linear models in d = 20 dimensions. Colored lines correspond to different training context lengths (N {40, 70, 100}) and different training procedures (either a fixed identity covariance matrix or random diagonal covariance matrices with each diagonal element sampled i.i.d. from the standard exponential distribution). The four figures correspond to evaluating on either fixed covariance or random covariance matrices of different scales. The gray dashed line shows the prediction error of zero estimator and the black dashed line the prediction error of LSA model when M, N . The GPT2 models achieve smaller error when they are trained on random covariance matrices with larger contexts, but their prediction error spikes when evaluated on contexts larger than those they were trained on. Furthermore, we notice some surprising behaviors when the test prompt length exceeds the training prompt length (i.e., M > N): there is an evident spike in prediction error, regardless of whether training and testing were performed on fixed or random covariance, and the spike appears to decrease when evaluated on prompts with higher variance. Although we are unsure of why the spike should decrease with higher-variance prompts, the failure of large language models to generalize to larger contexts than they were trained on is a wellknown problem (Dai et al., 2019; Anil et al., 2022). In our setting, we conjecture that this spike in error comes from the absolute positional encodings in the GPT2 architecture. The Trained Transformers Learn Linear Models In-Context positional encodings are randomly-initialized and are learnable parameters but the encoding for position i is only updated if the transformer encounters a prompt which has a context of length i. Thus, when evaluating on prompts of length M > N, the model is relying upon random positional encodings for M N samples. We note that a concurrent work has explored the performance of transformers with GPT2 architectures for in-context learning of linear models and found that removing positional encoders improves performance when evaluating on larger contexts (Ahuja et al., 2023). We leave further investigation of this behavior for future work. 5. Proof ideas In this section, we briefly outline the proof sketch of Theorem 4. The full proof of this theorem is left for Appendix A. 5.1 Equivalence to a quadratic optimization problem We recall each task τ corresponds to a weight vector wτ N(0, Id). The prompt inputs for this task are xτ,j i.i.d. N(0, Λ), which are also independent of wτ. The corresponding labels are yτ,j = wτ, xτ,j . For each task τ, we can form the prompt into a token matrix Eτ R(d+1) (N+1) as in (4), with the right-bottom entry being zero. The first key step in our proof is to recognize that the prediction byquery(Eτ; θ) in the linear self-attention model can be written as the output of a quadratic function u Hτu for some matrix Hτ depending on the token embedding matrix Eτ and for some vector u depending on θ = (W KQ, W PV ). This is shown in the following lemma, the proof of which is provided in Appendix A.1. Lemma 9 Let Eτ R(d+1) (N+1) be an embedding matrix corresponding to a prompt of length N and weight wτ. Then the prediction byquery(Eτ; θ) for the query covariate can be written as the output of a quadratic function, byquery(Eτ; θ) = u Hτu, where the matrix Hτ is defined as, 2Xτ EτE τ N R(d+1)2 (d+1)2, Xτ = 0d d xτ,query (xτ,query) 0 R(d+1) (d+1) u = Vec(U) R(d+1)2, U = R(d+1) (d+1), where U11 = W KQ 11 Rd d, u12 = w PV 21 Rd 1, u21 = w KQ 21 Rd 1, u 1 = w PV 22 R correspond to particular components of W PV and W KQ, defined in (5). Zhang, Frei, Bartlett This implies that we can write the original loss function (7) as u Hτu w τ xτ,query 2 . (25) Thus, our problem is reduced to understanding the dynamics of an optimization algorithm defined in terms of a quadratic function. We also note that this quadratic optimization problem is an instance of a rank-one matrix factorization problem, a problem well-studied in the deep learning theory literature (Gunasekar et al., 2017; Arora et al., 2019; Li et al., 2018; Chi et al., 2019; Belabbas, 2020; Li et al., 2020; Jin et al., 2023; Soltanolkotabi et al., 2023). Note, however, this quadratic function is non-convex. To see this, we will show that Hτ has negative eigenvalues. By standard properties of the Kronecker product, the eigenvalues of Hτ = 1 2Xτ EτE τ N are the products of the eigenvalues of 1 2Xτ and the eigenvalues of EτE τ N . Since EτE τ is symmetric and positive semi-definite, all of its eigenvalues are nonnegative. Since EτE τ is nonzero almost surely, it thus has at least one strictly positive eigenvalue. Thus, if Xτ has any negative eigenvalues, Hτ does as well. The characteristic polynomial of Xτ is given by, det(µI Xτ) = det µId xτ,query x τ,query µ = µd 1 µ2 xτ,query 2 2 . Therefore, we know almost surely, Xτ has one negative eigenvalue. Thus Hτ has at least d + 1 negative eigenvalues, and hence the quadratic form u Hτu is non-convex. 5.2 Dynamical system of gradient flow We now describe the dynamical system for the coordinates of u above. We prove the following lemma in Appendix A.2. Lemma 10 Let u = Vec (U) := Vec as in Lemma 9. Consider gradient 2E u Hτu w τ xτ,query 2 (26) with respect to u starting from an initial value satisfying Assumption 3. Then the dynamics of U follows d dt U11(t) = u2 1ΓΛU11Λ + u 1Λ2 d dtu 1(t) = tr h u 1ΓΛU11Λ(U11) Λ2(U11) i , (27) and u12(t) = 0d, u21(t) = 0d for all t 0, where Γ = 1 + 1 N tr(Λ)Id Rd d. Trained Transformers Learn Linear Models In-Context We see that the dynamics are governed by a complex system of d2+1 coupled differential equations. Moreover, basic calculus (for details, see Lemma 15) shows that these dynamics are the same as those of gradient flow on the following objective function: ℓ: Rd d R R, ℓ(U11, u 1) = tr 1 2u2 1ΓΛU11Λ(U11) u 1Λ2(U11) . (28) Actually, the loss function ℓis simply the loss function L in (26) plus some constants that do not depend on the parameter u. Therefore our problem is reduced to studying the dynamics of gradient flow on the above objective function. Our next key observation is that the set of global minima for ℓsatisfies the condition u 1U11 = Γ 1. Thus, if we can establish global convergence of gradient flow over the above objective function ℓ, then we have that u 1(t)U11(t) Γ 1 N Λ 1. Lemma 11 For any global minimum of ℓ, we have u 1U11 = Γ 1. (29) Putting this together with Lemma 10, we see that at those global minima of the population objective satisfying U11 = (cΓ) 1, u 1 = c and u12 = u21 = 0d, the transformer s predictions for a new linear regression task prompt are given by byquery(E; θ) = 1 i=1 yix i Γ 1xquery = w 1 M Γ 1xquery w xquery. Thus, the only remaining task is to show global convergence when gradient flow has an initialization satisfying Assumption 3. 5.3 PL inequality and global convergence We now show that although the optimization problem is non-convex, a Polyak Lojasiewicz (PL) inequality holds, which implies that gradient flow converges to a global minimum. Moreover, we can exactly calculate the limiting value of U11 and u 1. Lemma 12 Suppose the initialization of gradient flow satisfies Assumption 3 with initialization scale satisfying σ2 < 2 d Γ op for Γ = (1 + 1 N )Λ + tr(Λ) N Id. If we define d Λ 2 op tr (Γ 1Λ 1) tr (Λ 1) ΛΘ 2 F h 2 dσ2 Γ op i > 0, (30) then gradient flow on ℓwith respect to U11 and u 1 satisfies, for any t 0, ℓ(U11(t), u 1(t)) 2 µ ℓ(U11(t), u 1(t)) min U11 Rd d,u 1 R ℓ(U11, u 1) . Zhang, Frei, Bartlett Moreover, gradient flow converges to the global minimum of ℓ, and U11 and u 1 satisfy lim t u 1(t) = Γ 1 1 2 F and lim t U11(t) = Γ 1 1 With these observations, proving Theorem 4 becomes a direct application of Lemma 9, 10, 11, and Lemma 12. It then only requires translating U11 and u 1 back to the original parameterization using W PV and W KQ. 6. Conclusion and future work In this work, we investigated the dynamics of in-context learning of transformers with a single linear self-attention layer under gradient flow on the population loss. In particular, we analyzed the dynamics of these transformers when trained on prompts consisting of random instances of noiseless linear models over anisotropic Gaussian marginals. We showed that despite non-convexity, gradient flow from a suitable random initialization converges to a global minimum of the population objective. We characterized the prediction error of the trained transformer when given a new prompt that consists of a training dataset where the responses are a nonlinear function of the inputs. We showed how the trained transformer is naturally robust to shifts in the task and query distributions but is brittle to distribution shifts between the covariates seen during training and the covariates seen at test time, matching the empirical observations on trained transformer models of Garg et al. (2022). There are a number of natural directions for future research. First, our results hold for gradient flow on the population loss with a particular class of random initialization schemes. It is a natural question if similar results would hold for stochastic gradient descent with finite step sizes and for more general initializations. Further, we restricted our attention to transformers with a single linear self-attention layer. Although this model class is rich enough to allow for in-context learning of linear predictors, we are particularly interested in understanding the dynamics of in-context learning in nonlinear and deep transformers. Finally, the framework of in-context learning introduced in prior work was restricted to the setting where the marginal distribution over the covariates (Dx) was fixed across prompts. This allows for guarantees akin to distribution-specific PAC learning, where the trained transformer is able to achieve small prediction error when given a test prompt consisting of linear regression data when the marginals over the covariates are fixed. However, other learning algorithms (such as ordinary least squares) are able to achieve small prediction error for prompts corresponding to well-specified linear regression tasks for very general classes of distributions over the covariates. As we showed in Section 4.3, when transformers with a single linear self-attention layer are trained on prompts where the covariate distributions are themselves sampled from a distribution, they do not succeed on test prompts with covariate distributions sampled from the same distribution. By contrast, we demonstrated with experiments that larger, nonlinear transformer architectures appear to be more successful in this setting but are still sub-optimal. Developing a better understanding of the dynamics of in-context learning when the covariate distribution varies across prompts is an intriguing direction for future research. Trained Transformers Learn Linear Models In-Context Acknowledgments We gratefully acknowledge the support of the NSF and the Simons Foundation for the Collaboration on the Theoretical Foundations of Deep Learning through awards DMS2031883 and #814639, and of the NSF through grant DMS-2023505. Zhang, Frei, Bartlett 1 Introduction 2 2 Additional Related Work 3 3 Preliminaries 5 3.1 In-context learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 3.2 Linear self-attention networks . . . . . . . . . . . . . . . . . . . . . . . . . . 6 3.3 Training procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 4 Main results 9 4.1 Convergence of gradient flow and prediction error for new tasks . . . . . . . 10 4.2 Behavior of trained transformer under distribution shifts . . . . . . . . . . . 13 4.3 Transformers trained on prompts with random covariate distributions . . . 14 5 Proof ideas 19 5.1 Equivalence to a quadratic optimization problem . . . . . . . . . . . . . . . 19 5.2 Dynamical system of gradient flow . . . . . . . . . . . . . . . . . . . . . . . 20 5.3 PL inequality and global convergence . . . . . . . . . . . . . . . . . . . . . . 21 6 Conclusion and future work 22 A Proof of Theorem 4 25 A.1 Proof of Lemma 9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 A.2 Proof of Lemma 10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 A.3 Proof of Lemma 11 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 A.4 Proof of Lemma 12 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 B Proof of Theorem 5 39 C Proof of Theorem 8 41 C.1 Dynamical system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 C.2 Loss function and global minima . . . . . . . . . . . . . . . . . . . . . . . . 43 C.3 PL Inequality and global convergence . . . . . . . . . . . . . . . . . . . . . 44 D Technical lemmas 49 E Experiment details 51 Trained Transformers Learn Linear Models In-Context Appendix A. Proof of Theorem 4 In this section, we prove Lemma 9, Lemma 10, Lemma 11 and Lemma 12. Theorem 4 is a natural corollary of these four lemmas when we translate u 1 and U11 back to W PV and W KQ. A.1 Proof of Lemma 9 For the reader s convenience, we restate the lemma below. Lemma 13 Let Eτ R(d+1) (N+1) be an embedding matrix corresponding to a prompt of length N and weight wτ. Then the prediction byquery(Eτ; θ) for the query covariate can be written as the output of a quadratic function, byquery(Eτ; θ) = u Hτu, where the matrix Hτ is defined as, 2Xτ EτE τ N R(d+1)2 (d+1)2, Xτ = 0d d xτ,query (xτ,query) 0 R(d+1) (d+1) u = Vec(U) R(d+1)2, U = R(d+1) (d+1), where U11 = W KQ 11 Rd d, u12 = w PV 21 Rd 1, u21 = w KQ 21 Rd 1, u 1 = w PV 22 R correspond to particular components of W PV and W KQ, defined in (5). Proof First, we decompose WPV and WKQ in the way above. From the definition, we know byτ,query is the right-bottom entry of f LSA(Eτ), which is byτ,query = (u12) u 1 EτE τ N We denote ui Rd+1 as the i-th column of U11 (u21) and xi τ,query as the i-th entry of xτ,query for i [d]. Then, we have i=1 xi τ,query (u12) u 1 EτE τ N i=1 tr ui (u12) u 1 xi τ,query (u12) u 1 x τ,query EτE τ N Zhang, Frei, Bartlett 0d(d+1) d(d+1) xτ,query EτE τ N x τ,query EτE τ N 0(d+1) (d+1) 2 tr uu Xτ EτE τ N = D Hτ, uu E . Here, we use some algebraic facts about matrix vectorization, Kronecker product and trace. For reference, we refer to (Petersen and Pedersen, 2008). A.2 Proof of Lemma 10 For the reader s convenience, we restate the lemma below. Lemma 14 Let u = Vec (U) := Vec as in Lemma 9. Consider gradient 2E u Hτu w τ xτ,query 2 (26) with respect to u starting from an initial value satisfying Assumption 3. Then the dynamics of U follows d dt U11(t) = u2 1ΓΛU11Λ + u 1Λ2 d dtu 1(t) = tr h u 1ΓΛU11Λ(U11) Λ2(U11) i , (27) and u12(t) = 0d, u21(t) = 0d for all t 0, where Γ = 1 + 1 N tr(Λ)Id Rd d. Proof From the definition of L in (26) and the dynamics of gradient flow, we calculate the derivatives of u. Here, we use the chain rule and some facts about matrix derivatives. See Lemma 29 for reference. dt = E Hτ, uu Hτ u + E w τ xτ,query Hτ u. (31) Step One: Calculate the Second Term We first calculate the second term. From the definition of Hτ, we have E h w τ xτ,query Hτ i = 1 i=1 E xi τ,query Xτ wi τ EτE τ N Trained Transformers Learn Linear Models In-Context For ease of notation, we denote i=1 xτ,ix τ,i. (32) Then, from the definition of EτE τ N , we know N xτ,query x τ,query bΛτwτ wτ bΛτ w τ bΛτwτ Since wτ N(0, Id) is independent of all prompt inputs and query input, we have i=1 E xi τ,query Xτ wi τ N xτ,query x τ,query 0 0 0 i=1 E E xi τ,query Xτ wi τ N xτ,query x τ,query 0 0 0 " xi τ,query Xτ E wi τ | xτ,query xτ,query x τ,query 0 0 0 Therefore, we have E h w τ xτ,query Hτ i = 1 xi τ,query Xτ w τ bΛτ w τ bΛτwτ. Since Xτ only depends on xτ,query by definition, and xτ,query is independent of wτ and xτ,i, i = 1, 2, ..., N, we have E h w τ xτ,query Hτ i = 1 E xi τ,query Xτ E w τ bΛτ w τ bΛτwτ. E(wi τ)Λ ΛE(wi τwτ) E(wi τw τ )Λ E wi τw τ Λwτ where Λi denotes Λ:i. Here, the second line comes from the fact that EbΛτ = Λ, and that wτ is independent of all prompt input and query input. The last line comes from the fact that wτ N(0, Id). Therefore, simple computation shows that E h w τ xτ,query Hτ i u = 1 0d(d+1) d(d+1) A A 0(d+1) (d+1) Zhang, Frei, Bartlett where A Rd(d+1) (d+1) and Vj R(d+1) (d+1) are defined by 0d d Pd i=1 ΛijΛi Step Two: Calculate the First Term Next, we compute the first term in (31), namely D := 2E Hτ, uu Hτu . For simplicity, we denote Zτ := 1 N EτE τ . Using the definition of Hτ in (24) and Lemma 29, we have D = 2E Hτ, uu Hτu (definition) 2E h tr Xτ Zτ Vec (U) Vec (U) (Xτ Zτ) Vec (U) i (definition of Hτ in (24) and u = Vec(U)) 2E h tr Vec (ZτUXτ) Vec (U) Vec (ZτUXτ) i (Vec(AXB) = (B A) Vec(X) in Lemma 29) 2E h Vec (U) Vec (ZτUXτ) Vec (ZτUXτ) i (property of trace operator) (ZτUXτ)ij Uij Vec (ZτUXτ) Step Three: u12 and u21 Vanish We first prove that if u12 = u21 = 0d, then d dtu12 = 0d and d dtu21 = 0d. If this is true, then these two blocks will be zero all the time since we assume they are zero at initial time in Assumption 3. We denote Ak: and A:k as the k-th row and k-th column of matrix A, respectively. Under the assumption that u12 = u21 = 0d, we first compute bΛτwτu 1x τ,query bΛτ + 1 N xτ,query x τ,query U11xτ,query w τ bΛτ wτu 1x τ,query w τ bΛτ U11xτ,query Written in an entry-wise manner, it will be (ZτUXτ)kl = k: wτu 1xl τ,query k, l [d] bΛτ + 1 N xτ,query x τ,query k: U11xτ,query k [d], l = d + 1 w τ bΛτ wτu 1xl τ,query l [d], k = d + 1 w τ bΛτ U11xτ,query k = l = d + 1 Trained Transformers Learn Linear Models In-Context We use Dij to denote the (i, j)-th entry of the (d + 1) (d + 1) matrix D such that Vec( D) = D. Now we fix a k [d], then (ZτUXτ)ij Uij (ZτUXτ)k,d+1 (ZτUXτ)ij Uij (ZτUXτ)k,d+1 2E h (ZτUXτ)d+1,d+1 u 1 (ZτUXτ)k,d+1 i , (36) since Ui,d+1 = Ud+1,i = 0 for any i [d]. For the first term in the right hand side of last equation, we fix i, j [d] and have E (ZτUXτ)ij Uij (ZτUXτ)k,d+1 i: wτu 1xj τ,query bΛτ + 1 N xτ,query x τ,query k: U11xτ,query since wτ is independent with all prompt input and query input, namely all xτ,i for i [query], and wτ is mean zero. Similarly, for the second term of (36), we have E (ZτUXτ)d+1,d+1 u 1 (ZτUXτ)k,d+1 =E u 1w τ bΛτ U11xτ,query bΛτ + 1 N xτ,query xτ,query k: U11xτ,query since E w τ = 0 and wτ is independent of all xτ,i for i [query]. Therefore, we have Dk,d+1 = 0 for k [d]. Similar calculation shows that Dd+1,k = 0 for k [d]. For k [d], to calculate the derivative of Uk,d+1, it suffices to further calculate the inner product of the d(d + 1) + k th row of E w τ xτ,query Hτ and u. From (33), we know this is j=1 Λ k Λj Ud+1,j = 0 given that u12 = u21 = 0d. Therefore, we conclude that the derivative of Uk,d+1 will vanish given u12 = u21 = 0d. Similarly, we conclude the same result for Ud+1,k for k [d]. Therefore, we know u12 = 0d and u21 = 0d for all time t 0. Step Four: Dynamics of U11 Next, we calculate the derivatives of U11 given u12 = u21 = 0d. For a fixed pair of k, l [d], we have (ZτUXτ)ij Uij (ZτUXτ)kl 2E h (ZτUXτ)d+1,d+1 u 1 (ZτUXτ)kl i . Zhang, Frei, Bartlett For fixed i, j [d], we have E h (ZτUXτ)ij Uij (ZτUXτ)kl i = Uiju2 1E h bΛτ i: wτxj τ,queryxl τ,queryw τ bΛτ = Uiju2 1E h xj τ,queryxl τ,query i E h bΛτ = Uiju2 1Λτ,jl E h bΛτ Therefore, we sum over i, j [d] to get (ZτUXτ)ij Uij (ZτUXτ)kl For the last term, we have 1 2E h (ZτUXτ)d+1,d+1 u 1 (ZτUXτ)kl i = 1 So we have Dkl = u2 1E bΛτ Additionally, we have 2 h E w τ xτ,query Hτ u i (l 1)(d+1)+k = 0d(d+1) d(d+1) A A 0(d+1) (d+1) (l 1)(d+1)+k (definition) = 0(d+1) d(d+1) Vl + V l (definition of A in (34)) = Λ k Λlu 1. (definition of Vi in (34)) Therefore, we have that for k, l [d], the dynamics of Ukl is d dt Ukl = u2 1E bΛτ bΛτ U11Λl + u 1Λ k Λl, which implies d dt U11 = u2 1E bΛτ 2 U11Λ + u 1Λ2. From the definition of bΛτ (equation (32)), the independence and Gaussianity of xτ,i and Lemma 30, we compute E bΛτ 2 = E i=1 xτ,ix τ,i (definition (32)) Trained Transformers Learn Linear Models In-Context h E xτ,1x τ,1 i2 + 1 N E xτ,1x τ,1xτ,1x τ,1 (independence between prompt input) N tr(Λ)Λ. (Lemma 30) We define Γ := N + 1 N tr(Λ)Id. (37) Then, from (31), we know the dynamics of U11 is d dt U11 = u2 1ΓΛU11Λ + u 1Λ2. (38) Step Five: Dynamics of u 1 Finally, we compute the dynamics of u 1. We have Dd+1,d+1 = 1 (ZτUXτ)ij Uij (ZτUXτ)d+1,d+1 2E h (ZτUXτ)d+1,d+1 u 1 (ZτUXτ)d+1,d+1 i . (39) For the first term above, we have (ZτUXτ)ij Uij (ZτUXτ)d+1,d+1 i,j=1 Uij E h bΛτ i: wτw τ bΛτ U11xτ,queryxj τ,query i (from (35)) i,j=1 Uij E h bΛτ i: bΛτ U11xτ,queryxj τ,query i (independence and distribution of wτ) i,j=1 Uij E h bΛτ i: bΛτ U11Λj i (independence between prompt covariates) i,j=1 Λj Uij bΛτ = u 1E tr Λ(U11) bΛτ 2 U11 =u 1 tr E bΛτ 2 U11Λ(U11) . For the second term in (39), we have E h (ZτUXτ)d+1,d+1 u 1 (ZτUXτ)d+1,d+1 i = u 1E h w τ bΛτ U11xτ,queryx τ,query(U11) bΛτ wτ i (from (35)) = u 1E tr h wτw τ bΛτ U11xτ,queryx τ,query(U11) bΛτ i Zhang, Frei, Bartlett = u 1E tr h bΛτ U11Λ(U11) bΛτ i = u 1 tr E bΛτ 2 U11Λ(U11) . Therefore, we know Dd+1,d+1 = u 1 tr E bΛτ 2 U11Λ(U11) . Additionally, we have 2 h E w τ xτ,query Hτ u i 0d(d+1) d(d+1) A A 0(d+1) (d+1) (d+1)2 (from (33)) = V1 + V 1 ... Vd + V d 0(d+1) (d+1) (definition of A in (34)) i,j=1 Λ i Λj Uji = tr Λ(U11) Λ . Then, from (31), we have the dynamics of u 1 is d dtu 1 = tr h u 1ΓΛU11Λ(U11) Λ2(U11) i . (40) A.3 Proof of Lemma 11 Lemma 11 gives the form of global minima of an equivalent loss function. First, we prove that gradient flow on L defined in (8) from the initial values satisfying Assumption 3 is equivalent to gradient flow on another loss function ℓdefined below. Then, we derive an expression for the global minima of this loss function. First, from the dynamics of gradient flow, we can actually recover the loss function up to a constant. We have the following lemma. Lemma 15 (Loss Function) Consider gradient flow over L in (26) with respect to u starting from an initial value satisfying Assumption 3. This is equivalent to doing gradient flow with respect to U11 and u 1 on the loss function ℓ(U11, u 1) = tr 1 2u2 1ΓΛU11Λ(U11) u 1Λ2(U11) . (41) Proof The proof is simply by taking gradient of the loss function in (41). For techniques in matrix derivatives, see Lemma 29. We take the gradient of ℓon U11 to obtain 2u2 1Λ Γ U11Λ + 1 2u2 1ΓΛU11Λ u 1Λ2 = u2 1ΓΛU11Λ u 1Λ2, Trained Transformers Learn Linear Models In-Context since Γ and Λ are commutable. We take derivatives w.r.t. u 1 to get ℓ u 1 = tr h u 1ΓΛU11Λ(U11) Λ2(U11) i . Combining this with Lemma 10, we have d dt U11(t) = ℓ U11 , d dtu 1(t) = ℓ We remark that actually this is the loss function L up to some constant. This loss function ℓcan be negative. But we can still compute its global minima as follows. Corollary 16 (Minimum of Loss Function) The loss function ℓin Lemma 15 satisfies min U11 Rd d,u 1 R ℓ(U11, u 1) = 1 ℓ(U11, u 1) min U11 Rd d,u 1 R ℓ(U11, u 1) = 1 Γ 1 2 u 1Λ 1 2 U11Λ 1 2 ΛΓ 1 2 Proof First, we claim that ℓ(U11, u 1) = 1 2 tr Γ u 1Λ 1 2 U11Λ 1 2 ΛΓ 1 u 1Λ 1 2 U11Λ 1 2 ΛΓ 1 1 2 tr Λ2Γ 1 . To calculate this, we just need to expand the terms in the brackets and notice that Γ and Λ commute: tr Γ u 1Λ 1 2 U11Λ 1 2 ΛΓ 1 u 1Λ 1 2 U11Λ 1 2 ΛΓ 1 tr Λ2Γ 1 (i) = tr h Γ u2 1Λ 1 2 U11Λ(U11) Λ1/2 u 1ΛΓ 1Λ 1 2 U11Λ 1 2 u 1Λ 1 2 U11Λ 3 2 Γ 1 + Γ 2Λ2 i = tr h Γ u2 1Λ 1 2 U11Λ(U11) Λ1/2 u 1ΛΓ 1Λ 1 2 U11Λ 1 2 u 1Λ 1 2 U11Λ 3 2 Γ 1 i = u2 1 tr h ΓΛ 1 2 U11Λ(U11) Λ 1 2 i u 1 tr h ΓΛΓ 1Λ 1 2 U11Λ 1 2 ΓΛ 1 2 U11Λ 3 2 Γ 1i (ii) = u2 1 tr h ΓΛU11Λ(U11) i 2u 1 tr h Λ2U11Λ 1 2 i = 2 ℓ(U11, u 1) . Equations (i) and (ii) use that Γ and Λ commute. Zhang, Frei, Bartlett Since Γ 0 and u 1Λ 1 2 U11Λ 1 2 ΛΓ 1 u 1Λ 1 2 U11Λ 1 2 ΛΓ 1 0, we know from Lemma 32 that 1 2 tr Γ u 1Λ 1 2 U11Λ 1 2 ΛΓ 1 u 1Λ 1 2 U11Λ 1 2 ΛΓ 1 0, which implies ℓ(U11, u 1) 1 2 tr Λ2Γ 1 . Equality holds when U11 = Γ 1, u 1 = 1, so the minimum of ℓis 1 2 tr Λ2Γ 1 . The expression for ℓ(U11, u 1) min ℓ(U11, u 1) comes from the fact that tr(A A) = A 2 F for any matrix A. Lemma 11 is an immediate consequence of Corollary16, since the loss will keep the same when we replace (U11, u 1) by (c U11, c 1u 1) for any non-zero constant c. A.4 Proof of Lemma 12 In this section, we prove that the dynamical system in Lemma 10 satisfies a PL inequality. Then, the PL inequality naturally leads to the global convergence of this dynamical system. First, we prove a simple lemma, which says the parameters in the LSA model will keep balanced in the whole trajectory. From the proof of this lemma, we can understand why we assume a balanced parameter at the initial time. Lemma 17 (Balanced Parameters) Consider gradient flow over L in (26) with respect to u starting from an initial value satisfying Assumption 3. For any t 0, it holds that u2 1 = tr h U11(U11) i . (42) Proof From Lemma 10, we multiply the first equation in (27) by (U11) from the right to get d dt U11(t) (U11(t)) = u2 1ΓΛU11Λ(U11) + u 1Λ2(U11) . Also we multiply the second equation in Lemma 10 by u 1 to obtain d dtu 1(t) u 1(t) = tr h u2 1ΓΛU11Λ(U11) + u 1Λ2(U11) i . Therefore, we have dt U11(t) (U11(t)) = d dtu 1(t) u 1(t). Taking the transpose of the equation above and adding to itself gives d dt tr h U11(t)(U11(t)) i = d dt u 1(t)2 . Trained Transformers Learn Linear Models In-Context Notice that from Assumption 3, we know that at t = 0, u 1(0)2 = σ2 = σ2 tr h ΘΘ ΘΘ i = tr h U11(0)(U11(0)) i . So for any time t 0, the equation holds. In order to prove the PL inequality, we first prove an important property which says the trajectories of u 1(t) stay away from saddle point at origin. First, we prove that u 1(t) will stay positive along the whole trajectory. Lemma 18 Consider gradient flow over L in (26) with respect to u starting from an initial value satisfying Assumption 3. If the initial scale satisfies d Γ op , (43) then, for any t 0, it holds that u 1 > 0. Proof From Lemma 15, we are actually doing gradient flow on the loss ℓ. The loss function is non-increasing, because We notice that when u 1 = 0, ℓ= 0. Therefore, as long as ℓ(U11(0), u 1(0)) < 0, then for any time, u 1 will be non-zero. Further, since u 1(0) > 0 and the trajectory of u 1(t) must be continuous, we know u 1(t) > 0 for any t 0. Then, it suffices to prove when 0 < σ < q 2 d Γ op , it holds that ℓ(U11(0), u 1(0)) < 0. From Assumption 3, we can calculate the loss function at the initial time: ℓ(U11(0), u 1(0)) = σ4 2 tr h ΓΛΘΘ ΛΘΘ i σ2 tr h Λ2ΘΘ i . From the property of trace, we know tr h Λ2ΘΘ i = tr h ΛΘΘ Λ i = ΛΘ 2 F . From Von-Neumann s trace inequality (Lemma 31) and the fact that ΘΘ F = 1, we know tr h ΓΛΘΘ ΛΘΘ i d ΛΘΘ ΛΘΘ F Γ op d ΛΘ 2 F ΘΘ F Γ op d ΛΘ 2 F Γ op . Zhang, Frei, Bartlett Therefore, we have ℓ(U11(0), u 1(0)) 2 ΛΘ 2 F Γ op σ2 ΛΘ 2 F dσ2 Γ op 2 i . From Assumption 3, we know ΛΘ F = 0. From (37), we know Γ op > 0. Therefore, when we have ℓ(U11(0), u 1(0)) < 0. From the lemma above, we can actually further prove that the u 1(t) can be lower bounded by a positive constant for any t 0. This will be a critical property to prove the PL inequality. We have the following lemma. Lemma 19 Consider gradient flow over L in (26) with respect to u starting from an initial value satisfying Assumption 3 with initial scale 0 < σ < q 2 d Γ op . For any t 0, it holds d Λ 2 op ΛΘ 2 F h 2 dσ2 Γ op i > 0. (44) Proof We prove by contradiction. Suppose the claim does not hold. From Lemma 17, we know u2 1 = tr U11(U11) = U11 2 F . From Lemma 18, we know u 1 = U11 F . Recall the definition of loss function: ℓ(U11, u 1) = tr 1 2u2 1ΓΛU11Λ(U11) u 1Λ2(U11) . Since Γ 0, Λ 0, and they commute, we know from Lemma 32 that ΓΛ 0. Again, since U11Λ(U11) = U11Λ 1 2 U11Λ 1 2 0, from Lemma 32 we have tr 1 2u2 1ΓΛU11Λ(U11) 0. So ℓ(U11, u 1) tr h u 1Λ2(U11) i . From Von-Neumann s trace inequality, we know for any t 0, tr h u 1Λ2(U11) i du 1 Λ2 op U11 F = du2 1 Λ 2 op . Therefore, under our assumption that the claim does not hold, we have ℓ(U11, u 1) du2 1 Λ 2 op > σ2 2 ΛΘ 2 F h 2 dσ2 Γ op i ℓ(U11(0), u 1(0)). Trained Transformers Learn Linear Models In-Context Here, the last inequality comes from the proof of Lemma 18. This contradicts the nonincreasing property of the loss function in gradient flow. Finally, let s prove the PL inequality and further, the global convergence of gradent flow on the loss function ℓ. We recall the stated lemma from the main text. Lemma 20 Suppose the initialization of gradient flow satisfies Assumption 3 with initialization scale satisfying σ2 < 2 d Γ op for Γ = (1 + 1 N )Λ + tr(Λ) N Id. If we define d Λ 2 op tr (Γ 1Λ 1) tr (Λ 1) ΛΘ 2 F h 2 dσ2 Γ op i > 0, (30) then gradient flow on ℓwith respect to U11 and u 1 satisfies, for any t 0, ℓ(U11(t), u 1(t)) 2 µ ℓ(U11(t), u 1(t)) min U11 Rd d,u 1 R ℓ(U11, u 1) . Moreover, gradient flow converges to the global minimum of ℓ, and U11 and u 1 satisfy lim t u 1(t) = Γ 1 1 2 F and lim t U11(t) = Γ 1 1 Proof From the definition and Lemma 19, we have ℓ(U11, u 1) 2 2 F = u2 1ΓΛU11Λ u 1Λ2 2 F = u2 1 ΓΛ 1 2 u 1Λ 1 2 U11Λ 1 2 ΛΓ 1 Λ 1 2 2 d Λ 2 op ΛΘ 2 F h 2 dσ2 Γ op i ΓΛ 1 2 u 1Λ 1 2 U11Λ 1 2 ΛΓ 1 Λ 1 2 2 To see why the second line is true, recall that u 1 R and Γ and Λ commute. The last line comes from the lower bound of u 1 in Lemma 19. From Corollary 16, we know ℓ min U11 Rd d,u 1 R ℓ(U11, u 1) = 1 2 tr Γ u 1Λ 1 2 U11Λ 1 2 ΛΓ 1 u 1Λ 1 2 U11Λ 1 2 ΛΓ 1 Γ 1 2 u 1Λ 1 2 U11Λ 1 2 ΛΓ 1 2 Therefore, we know that ℓ min U11 Rd d,u 1 R ℓ(U11, u 1) ΓΛ 1 2 u 1Λ 1 2 U11Λ 1 2 ΛΓ 1 Λ 1 2 2 Zhang, Frei, Bartlett ΓΛ 1 2 u 1Λ 1 2 U11Λ 1 2 ΛΓ 1 Λ 1 2 2 F tr Γ 1Λ 1 tr Λ 1 (46) We compare (45) and (46) to obtain that in order to make the PL condition hold, one needs to let d Λ 2 op tr (Γ 1Λ 1) tr (Λ 1) ΛΘ 2 F h 2 dσ2 Γ op i > 0. Once we set this µ, we get the PL inequality. The µ is positive due to the assumption for σ in the lemma. From the dynamics of gradient flow and the PL condition, we know ℓ min U11 Rd d,u 1 R ℓ(U11, u 1) = µ ℓ min U11 Rd d,u 1 R ℓ(U11, u 1) . Therefore, we have when t , 0 ℓ min U11 Rd d,u 1 R ℓ(U11, u 1) exp ( µt) ℓ(U11(0), u 1(0)) min U11 Rd d,u 1 R ℓ(U11, u 1) 0, which implies ℓ min U11 Rd d,u 1 R ℓ(U11, u 1) = 0. From Corollary 16, we know this is Γ 1 2 u 1Λ 1 2 U11Λ 1 2 ΛΓ 1 2 Since Γ and Λ are non-singular and positive definite, and they commute, we know u 1U11 Γ 1 2 F Γ 1 Γ 1 2 u 1Λ 1 2 U11Λ 1 2 ΛΓ 1 2 This implies u 1U11 Γ 1 0d d entry-wise. Since u 1 = U11 F , we know u2 1 = u 1U11 F Γ 1 F . Therefore, we know lim t u 1(t) = Γ 1 1 2 F and lim t U11(t) = Γ 1 1 Trained Transformers Learn Linear Models In-Context Appendix B. Proof of Theorem 5 In this section, we prove Theorem 5, which characterizes the excess risk of the prediction of a trained LSA layer with respect to the risk of best linear predictor, on a new task which is possibly non-linear. First, we restate the theorem. Theorem 5 Let D be a distribution over (x, y) Rd R, whose marginal distribution on x is Dx = N(0, Λ). Assume ED[y], ED[xy], ED[y2xx ] exist and are finite. Assume the test prompt is of the form P = (x1, y1, . . . , x M, y M, xquery), where (xi, yi), (xquery, yquery) i.i.d. D. Let f LSA be the LSA model with parameters W PV and W KQ in (11), and byquery is the prediction for xquery given the prompt. If we define a := Λ 1E(x,y) D [xy] , Σ := E(x,y) D h xy E (xy) xy E (xy) i , (15) then, for Γ = Λ + 1 N tr(Λ)Id. we have, E (byquery yquery)2 = min w Rd E ( w, xquery yquery)2 | {z } Error of best linear predictor M tr ΣΓ 2Λ + 1 h a 2 Γ 2Λ3 + 2 tr(Λ) a 2 Γ 2Λ2 + tr(Λ)2 a 2 Γ 2Λ i , where the expectation is over (xi, yi), (xquery, yquery) i.i.d. D. Proof Unless otherwise specified, we use E to denote the expectation over (xi, yi) and (xquery, yquery) i.i.d. D. Since when (x, y) D, we assume E[x], E[y], E[xy], E[xx ], E[y2xx ] exist, we know that E ( w, xquery yquery)2 exists for each w Rd. We denote a := arg min w Rd E ( w, xquery yquery)2 as the weight of the best linear approximator. Actually, if we denote the function inside the minimum above as R(w), we can write it as R(w) = w Λw 2E yquery x query w + Ey2 query. Since the Hessian matrix 2 w w R(w) is Λ, which is positive definitive, we know that this function is strictly convex and hence, the global minimum can be achieved at the unique first-order stationary point. This is a = Λ 1E (yquery xquery) . (47) We also define a similar vector for ease of computation: b = Γ 1E (yquery xquery) . (48) Zhang, Frei, Bartlett Therefore, we can decompose the risk as E (byquery yquery)2 = E ( a, xquery yquery)2 | {z } I + E (byquery b, xquery )2 | {z } II + E ( b, xquery a, xquery )2 | {z } III + 2E (byquery b, xquery ) ( a, xquery yquery) | {z } IV + 2E (byquery b, xquery ) ( b, xquery a, xquery ) | {z } V + 2E ( b, xquery a, xquery ) ( a, xquery yquery) | {z } VI The term I is the first term on the right hand side of (16). So it suffices to calculate II to VI. First, from the tower property of conditional expectation, we have V = 2E E (byquery b, xquery ) ( b, xquery a, xquery ) xquery = 2E E byquery b, xquery xquery ( b, xquery a, xquery ) = 0, E byquery b, xquery xquery i=1 yiΓ 1xi b xquery = 0. Similarly, for IV, we have IV = 2E (byquery b, xquery ) ( a, xquery yquery) = 2E E (byquery b, xquery ) ( a, xquery yquery) xquery, yquery = 2E E byquery b, xquery xquery, yquery ( a, xquery yquery) For VI, we have VI = 2E tr h (b a) ( a, xquery yquery) x query i = 2 tr h (b a)a Λ i 2 tr h (b a)E yqueryx query i = 0, where the last line comes from the definition of a. Therefore, all cross terms vanish and it suffices to consider II and III. Trained Transformers Learn Linear Models In-Context From the definition II is equal to i=1 yixi E (yquery xquery) Γ 1xqueryx queryΓ 1 1 M i=1 yixi E (yquery xquery) i=1 yixi E (yquery xquery) i=1 yixi E (yquery xquery) (property of trace and the fact that Γ and Λ commute) i,j=1 E tr n (yixi E (yquery xquery)) (yjxj E (yquery xquery)) Γ 2Λ o M E tr n (y1x1 E (yquery xquery)) (y1x1 E (yquery xquery)) Γ 2Λ o (all cross terms vanish due to the independence of xi) M tr ΣΓ 2Λ . The last line comes from the definition of Σ. For III, we have III = E(b a) xqueryx query(b a) = a Λ(Γ 1 Λ 1)Λ(Γ 1 Λ 1)Λa = tr h I ΓΛ 1 2 Γ 2Λ3aa i (property of trace and the fact that Γ and Λ commute) = 1 N2 tr h Id + tr(Λ)Λ 1 2 Γ 2Λ3aa i h tr(Γ 2Λ3aa ) + 2 tr(Λ) tr(Γ 2Λ2aa ) + tr(Λ)2 tr(Γ 2Λaa ) i . Combining all terms above, we conclude. Appendix C. Proof of Theorem 8 The proof of Theorem 8 is very similar to that of Theorem 4. The first step is to explicitly write out the dynamical system. In order to do so, we notice that the Lemma 9 does not depend on the training data and data-generaing distribution and hence, it still holds in the case of a random covariance matrix. Therefore, we know when we input the embedding matrix Eτ to the linear self-attention layer with parameter θ = (W KQ, W PV ), the prediction will be byquery(Eτ; θ) = u Hτu, where the matrix Hτ is defined as, 2Xτ EτE τ N R(d+1)2 (d+1)2, Xτ = 0d d xτ,query (xτ,query) 0 R(d+1) (d+1) Zhang, Frei, Bartlett u = Vec(U) R(d+1)2, U = R(d+1) (d+1), where U11 = W KQ 11 Rd d, u12 = w PV 21 Rd 1, u21 = w KQ 21 Rd 1, u 1 = w PV 22 R correspond to particular components of W PV and W KQ, defined in (5). C.1 Dynamical system The next lemma gives the dynamical system when the covariance matrices in the prompts are i.i.d. sampled from some distribution. Notice that in the lemma below, we do not assume Λτ are almost surely diagonal. The case when the covariance matrices are diagonal can be viewed as a special case of the following lemma. Lemma 21 Consider gradient flow on (20) with respect to u starting from an initial value that satisfies Assumption 3. We assume the covariance matrices Λτ are sampled from some distribution with finite third moment and Λτ are positive definite almost surely. We denote u = Vec (U) := Vec N tr(Λτ)Id Rd d. Then the dynamics of U follows d dt U11(t) = u2 1E [ΓτΛτU11Λτ] + u 1E Λ2 τ d dtu 1(t) = u 1 tr E h ΓτΛτU11Λτ(U11) i + tr E Λ2 τ (U11) , (49) and u12(t) = 0d, u21(t) = 0d for all t 0. Proof This lemma is a natural corollary of Lemma 10. Notice that Lemma 10 holds for any fixed positive definite Λτ. So when Λτ is random, if we condition on Λτ, the dynamical system will be d dt U11(t) = u2 1 [ΓτΛτU11Λτ] + u 1 Λ2 τ d dtu 1(t) = u 1 tr h ΓτΛτU11Λτ(U11) i + tr Λ2 τ (U11) , (50) and u12(t) = 0d, u21(t) = 0d for all t 0. Then, we conclude by simply taking expectation over Λτ. The lemma above gives the dynamical system with general random covariance matrix. When Λτ are diagonal almost surely, we can actually simplify the dynamical system above. In this case, we have the following corollary. Trained Transformers Learn Linear Models In-Context Corollary 22 Under the assumptions of Lemma 21, we further assume the covariance matrix Λτ to be diagonal almost surely. We denote uij(t) R as the (i, j)-th entry of U11(t), and further denote N λ3 τ,i + 1 ξi = E λ2 τ,i , N λ2 τ,iλτ,j + 1 for i, j [d], where the expectation is over the distribution of Λτ. Then, the dynamical system (49) is equivalent to d dtuii(t) = γiu2 1uii + ξiu 1 i [d], d dtuij(t) = ζiju2 1uij i = j [d], d dtu 1(t) = γiu 1u2 ii X i =j ζiju 1u2 ij + i=1 [ξiuii] . Proof This is directly obtained by rewriting the equation for each entry of U11 and recalling the assumption that Λτ (and hence Γτ) is diagonal almost surely. C.2 Loss function and global minima As in the proof of Theorem 4, we can actually recover the loss function in the random covariance case, up to a constant. Lemma 23 The differential equations in (52) are equivalent to gradient flow on the loss function ℓrdm(U11, u 1) = E tr 1 2u2 1ΓτΛτU11Λτ(U11) u 1Λ2 τ(U11) γiu2 1u2 ii + 1 i =j ζiju2 1u2 ij i=1 [ξiuiiu 1] (53) with respect to uij i, j [d] and u 1, from an initial value that satisfies Assumption 3. Proof This can be verified by simply taking gradient of ℓrdm to show that d dtuii = ℓrdm uii i [d], d dtuij = ℓrdm uij i = j [d], d dtu 1 = ℓrdm Zhang, Frei, Bartlett Next, we solve for the minimum of ℓrdm and give the expression for all global minima. Lemma 24 Let ℓrdm be the loss function in (53). We denote min ℓrdm := min U11 Rd d,u 1 R ℓrdm (U11, u 1) . Then, we have min ℓrdm = 1 ξ2 i γi (54) ℓrdm(U11, u 1) min ℓrdm = 1 i =j ζiju2 1u2 ij. (55) Moreover, denoting uij as the (i, j)-entry of U11, all global minima of ℓrdm satisfy u 1 uij = I(i = j) ξi Proof From the definition of ℓrdm, we have i =j ζiju2 1u2 ij 1 The equation holds when uij = 0 for i = j [d] and u 1uii = ξi γi for each i [d]. This can be achieved by simply letting u 1 = 1 and uii = ξi γi for i [d]. Of course, when we replace (u 1, uii) with (cu 1, c 1uii) for any constant c = 0, we can also achieve this global minimum. C.3 PL Inequality and global convergence Finally, to end the proof, we prove a Polyak Lojasiewicz Inequality on the loss function ℓrdm, and then prove global convergence. Before that, let s first prove the balanced condition of parameters will hold during the whole trajectory. Lemma 25 (Balanced condition) Under the assumptions of Lemma 21, for any t 0, it holds that u2 1 = tr h U11(U11) i . (57) Proof The proof is similar to the proof of Lemma 17. From Lemma 10, we multiply the first equation in (49) by (U11) from the right to get d dt U11(t) (U11) = u2 1E h ΓτΛτU11Λτ(U11) i + u 1E h Λ2 τ(U11) i . Trained Transformers Learn Linear Models In-Context Also we multiply the second equation in Lemma 49 by u 1 to obtain d dtu 1(t) u 1(t) = u2 1 tr E h ΓτΛτU11Λτ(U11) i + u 1 tr E Λ2 τ (U11) , Therefore, we have dt U11(t) (U11(t)) = d dtu 1(t) u 1(t). Taking the transpose of the equation above and adding to itself gives d dt tr h U11(t)(U11(t)) i = d dt u 1(t)2 . Notice that from Assumption 3, we know that u 1(0)2 = σ2 = σ2 tr h ΘΘ ΘΘ i = tr h U11(0)(U11(0)) i . So for any time t 0, the equation holds. Next, similar to the proof of Theorem 4, we prove that, as long as the initial scale is small enough, u 1 will be positive along the whole trajectory and can be lower bounded by a positive constant, which implies that the trajectories will be away from the saddle point at the origin. Lemma 26 We do gradient flow on ℓrdm with respect to ui,j ( i, j [d]) and u 1. Suppose the initialization satisfies Assumption 3 with initial scale v u u t 2 EΛτΘ 2 F d h E Γτ op Λτ 2 F i, (58) then for any t 0, it holds that u 1(t) > 0. (59) Proof From the dynamics of gradient flow, we know the loss function ℓrdm is non-increasing: Since we assume U11(0) = ΘΘ , we know the loss function at t = 0 is ℓrdm(U11(0), u 1(0)) = E tr σ4 2 ΓτΛτΘΘ ΛτΘΘ σ2Λ2 τΘΘ . From the property of trace, we know E tr h σ2Λ2 τΘΘ i = σ2 EΛτΘ 2 F . Zhang, Frei, Bartlett From Von-Neumann s trace inequality and the assumption that ΘΘ F = 1, we know 2 ΓτΛτΘΘ ΛτΘΘ d 2 E Γτ op ΛτΘΘ ΛτΘΘ F h E Γτ op Λτ 2 F i = σ4 h E Γτ op Λτ 2 F i . From the assumptions on Θ and Λτ we know EΛτΘ = 0d d and E Γτ op Λτ 2 F > 0. Therefore, comparing the two displays above, we know when (58) holds, we must have ℓrdm(0) < 0. So from the non-increasing property of the loss function, we know ℓrdm(t) < 0 for any time t 0. Notice that when u 1 = 0, the loss function is also zero, which suggests that u 1(t) = 0 for any time t 0. Since u 1(0) > 0 and the trajectory of u 1 must be continuous, we know that it stays positive at all times. Lemma 27 We do gradient flow on ℓrdm with respect to ui,j ( i, j [d]) and u 1. Suppose the initialization satisfies Assumption 3 and the initial scale satisfies (58). Then, for any t 0, it holds that h 2 EΛτΘ 2 F dσ2 h E Γτ op Λτ 2 F ii > 0. (60) Proof From the dynamics of gradient flow, we know ℓrdm is non-increasing (see the proof of Lemma 26). Recall the definition of the loss function: ℓrdm(U11, u 1) = E tr 1 2u2 1ΓτΛτU11Λτ(U11) u 1Λ2 τ(U11) . Since Λτ commutes with Γτ and they are both positive definite almost surely, we know that ΓτΛτ 0d d almost surely from Lemma 29. Again, since U11Λτ(U11) 0d d almost surely, from Lemma 29 we have tr 1 2u2 1ΓτΛτU11Λτ(U11) 0 almost surely. Therefore, we have ℓrdm(U11, u 1) E tr h u 1Λ2 τ(U11) i = tr h u 1 EΛ2 τ (U11) i . From Von Neumann s trace inequality (Lemma 31) and the fact that u 1(t) > 0 for any t 0 (Lemma 26), we know ℓrdm(U11(t), u 1(t)) du 1 EΛ2 τ op U11 F . From Lemma 25, we know u2 1 = tr(U11(U11) ) = U11 2 F . Since u 1(t) > 0 for any time, we know actually u 1(t) = U11(t) F . So we have ℓrdm(U11(t), u 1(t)) du 1(t)2 EΛ2 τ op . Trained Transformers Learn Linear Models In-Context From the proof of Lemma 26, we know ℓrdm(U11(t), u 1(t)) ℓrdm(U11(0), u 1(0)) σ4 h E Γτ op Λτ 2 F i σ2 EΛτΘ 2 F . Combine the two preceding displays above, we have h 2 EΛτΘ 2 F dσ2 h E Γτ op Λτ 2 F ii > 0. The last inequality comes from Lemma 26. Finally, we prove the PL Inequality, which naturally leads to the global convergence. Lemma 28 We do gradient flow on ℓrdm with respect to ui,j ( i, j [d]) and u 1. Suppose the initialization satisfies Assumption 3 and the initial scale satisfies (58). If we denote η = min {γi, i [d]; ζij, i = j [d]} h 2 EΛτΘ 2 F dσ2 h E Γτ op Λτ 2 F ii > 0, (61) then for any t 0, it holds that ℓrdm(U11, u 1) 2 2 := 2 ν (ℓrdm min ℓrdm) . (62) Additionally, ℓrdm converges to the global minimal value, uij and u 1 converge to the following limits, lim t uij(t) = I(i = j) γi i [d], lim t u 1(t) = Translating back to the original parameterization, we have this is equivalent to lim t W KQ(t) = EΓτΛ2 τ 1 E Λ2 τ 1 F EΓτΛ2 τ 1 E Λ2 τ 0d lim t W PV (t) = 0 d EΓτΛ2 τ 1 E Λ2 τ where Γτ = N+1 N tr(Λτ)Id Rd d and E is over Λτ. Zhang, Frei, Bartlett Proof First, we prove the PL Inequality. From Lemma 24, we know ℓrdm(U11, u 1) min ℓrdm = 1 i =j ζiju2 1u2 ij, where ξi, ζij, γi are defined in (51). Meanwhile, we calculate the square norm of the gradient of ℓrdm: ℓrdm(U11, u 1) 2 2 := i=1 γ2 i u2 1 i =j ζ2 iju4 1u2 ij. Comparing the two displays above, we know that in order to ensure that ℓrdm 2 2 ν (ℓrdm min ℓrdm) , it suffices to make γiu 1(t)2 ν ζiju 1(t)2 ν 2 i = j [d]. We define η := min {γi, ζij, i = j [d]} , then it is sufficient to make From Lemma 27, we know that we can actually lower bound u 1 from below by a positive constant. Then, the inequality holds if we take h 2 EΛτΘ 2 F dσ2 h E Γτ op Λτ 2 F ii > 0. Therefore, as long as we take ν as above, a PL inequality holds for ℓrdm. With an abuse of notation, let us write ℓrdm(t) = ℓrdm(U11(t), u 1(t)). Then, from the dynamics of gradient flow and the PL Inequality ((62)), we know d dt [ℓrdm(t) min ℓrdm] = ℓrdm(t) 2 2 ν (ℓrdm(t) min ℓrdm) , which by Gr onwall s inequality implies 0 ℓrdm(t) min ℓrdm exp( νt) [ℓrdm(0) min ℓrdm] 0 when t . From Lemma 24, we know i =j ζiju2 1u2 ij 0 when t . Trained Transformers Learn Linear Models In-Context This implies uiju 1 0 i = j [d]. (64) We take square of uii(t)u 1(t) and uij(t)u 1(t), then sum over all i, j [d]. Then, we get u2 1 Pd i,j=1 u2 ij Pd i=1 ξ2 i γ2 i . From Lemma 25, we know for any t 0, u 1(t)2 = tr U11(U11) = Pd i,j=1 u2 ij. So we have u 1(t)4 = u2 1 i,j=1 u2 ij ξ2 i γ2 i , which implies when t . Combining (64) and (65), we conclude uij(t) 0 i = j [d], uii(t) Appendix D. Technical lemmas Lemma 29 (Petersen and Pedersen, 2008) We denote A, B, X as matrices and x as vectors. Then, we have x = B + B x. Vec(AXB) = B A Vec(X). tr A B = Vec(A) Vec(B). X tr XBX = XB + XB. X tr AX = A. X tr AXBX C = A C XB + CAXB. Lemma 30 If X is Gaussian random vector of d dimension, mean zero and covariance matrix Λ, and A Rd d is a fixed matrix. Then E h XX AXX i = Λ A + A Λ + tr(AΛ)Λ. Zhang, Frei, Bartlett Proof We denote X = (X1, ..., Xd) . Then, XX AXX = X(X AX)X = i,j=1 Aij Xi Xj So we know (XX AXX )k,l = Pd i,j=1 Aij Xi Xj Xk Xl. From Isserlis Theorem in probability theory (Theorem 1.1 in Michalowicz et al. (2009), originally proposed in Wick (1950)), we know for any i, j, k, l [d], it holds that E Xi Xj Xk Xl = ΛijΛkl + ΛikΛjl + ΛilΛjk. Then, we have for any fixed k, l [d], E(XX AXX )k,l = i,j=1 AijΛijΛkl + AijΛikΛjl + AijΛilΛjk = tr(AΛ)Λkl + Λ k (A + A )Λl. Therefore, we know E(XX AXX ) = Λ A + A Λ + tr(AΛ)Λ. Lemma 31 (Von-Neumann s Trace Inequality) Let U, V Rd n with d n. We have i=1 σi(U)σi(V ) U op where σ1(X) σ2(X) σd(X) are the ordered singular values of X Rd n. Lemma 32 ((Meenakshi and Rajian, 1999)) For any two positive semi-definitive matrices A, B Rd d, we have AB 0 if and only if A and B commute. Trained Transformers Learn Linear Models In-Context Appendix E. Experiment details In this section, we provide more details for the experiment in Figure 1. Our experimental setup is based on the codebase provided by Garg et al. (2022), with a modification that allows for the possibility that the covariate distribution changes across prompts. We use the standard GPT2 architecture with embedding size 256, 12 layers and 8 heads (Radford et al., 2018) as implemented by Hugging Face (Wolf et al., 2020). For the GPT2 models, we use the embedding method proposed by Garg et al. (2022), where instead of concatenating x and y into a single token, they are treated as separate tokens. It is also worth noting that the training objective function for the GPT2 model is different than those we consider for the linear self-attention network: for the GPT2 model, the objective function is the average over the full length of the context sequence (predictions for each xi using (xk, yk)k