# trained_transformers_learn_linear_models_incontext__0a2c0bfe.pdf

Journal of Machine Learning Research 25 (2024) 1-55 Submitted 8/23; Published 1/24

Trained Transformers Learn Linear Models In-Context

Ruiqi Zhang rqzhang@berkeley.edu Department of Statistics University of California, Berkeley 367 Evans Hall, Berkeley, CA 94720-3860, USA

Spencer Frei sfrei@ucdavis.edu Department of Statistics University of California, Davis 4118 Mathematical Sciences Building 399 Crocker Ave., Davis, CA 95616, USA

Peter L. Bartlett peter@berkeley.edu Department of Statistics and Department of Electrical Engineering and Computer Sciences University of California, Berkeley 367 Evans Hall, Berkeley, CA 94720-3860, USA Google Deep Mind 1600 Amphitheatre Parkway Mountain View, CA 94040, USA

Editor: Daniel Hsu

Attention-based neural networks such as transformers have demonstrated a remarkable ability to exhibit in-context learning (ICL): Given a short prompt sequence of tokens from an unseen task, they can formulate relevant per-token and next-token predictions without any parameter updates. By embedding a sequence of labeled training data and unlabeled test data as a prompt, this allows for transformers to behave like supervised learning algorithms. Indeed, recent work has shown that when training transformer architectures over random instances of linear regression problems, these models predictions mimic those of ordinary least squares.

Towards understanding the mechanisms underlying this phenomenon, we investigate the dynamics of ICL in transformers with a single linear self-attention layer trained by gradient ﬂow on linear regression tasks. We show that despite non-convexity, gradient ﬂow with a suitable random initialization ﬁnds a global minimum of the objective function. At this global minimum, when given a test prompt of labeled examples from a new prediction task, the transformer achieves prediction error competitive with the best linear predictor over the test prompt distribution. We additionally characterize the robustness of the trained transformer to a variety of distribution shifts and show that although a number of shifts are tolerated, shifts in the covariate distribution of the prompts are not. Motivated by this, we consider a generalized ICL setting where the covariate distributions can vary across prompts. We show that although gradient ﬂow succeeds at ﬁnding a global minimum in this setting, the trained transformer is still brittle under mild covariate shifts. We complement this ﬁnding with experiments on large, nonlinear transformer architectures which we show are more robust under covariate shifts.

c 2024 Ruiqi Zhang, Spencer Frei and Peter L. Bartlett.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v25/23-1042.html.

Zhang, Frei, Bartlett

Keywords: in-context learning, transformers, neural networks, self-attention, generalization

1. Introduction

Transformer-based neural networks have quickly become the default machine learning model for problems in natural language processing, forming the basis of chatbots like Chat GPT (Open AI, 2023), and are increasingly popular in computer vision (Dosovitskiy et al., 2021). These models can take as input sequences of tokens and return relevant next-token predictions. When trained on suﬃciently large and diverse datasets, these models are often able to perform in-context learning (ICL): when given a short sequence of input-output pairs (called a prompt) from a particular task as input, the model can formulate predictions on test examples without having to make any updates to the parameters in the model. Recently, Garg et al. (2022) initiated the investigation of ICL from the perspective of learning particular function classes. At a high-level, this refers to when the model has access to instances of prompts of the form (x1, h(x1), . . . , x N, h(x N), xquery) where xi, xquery are sampled i.i.d. from a distribution Dx and h is sampled independently from a distribution over functions in a function class H. The transformer succeeds at in-context learning if when given a new prompt (x 1, h (x 1), . . . , x N, h (x N), x query) corresponding to an independently sampled h it is able to formulate a prediction for x query that is close to h (x query) given a suﬃciently large number of examples N. The authors showed that when transformer models are trained on prompts corresponding to instances of training data from a particular function class (e.g., linear models, neural networks, or decision trees), they succeed at incontext learning, and moreover the behavior of the trained transformers can mimic those of familiar learning algorithms like ordinary least squares. Following this, a number of follow-up works provided constructions of transformerbased neural network architectures which are capable of achieving small prediction error for query examples when the prompt takes the form (x1, w, x1 , . . . , x N, w, x N , xquery)

where xi, xquery, w i.i.d. N(0, Id) (von Oswald et al., 2022; Aky urek et al., 2022). However, this leaves open the question of how it is that gradient-based optimization algorithms over transformer architectures produce models which are capable of in-context learning.1

In this work, we investigate the learning dynamics of gradient ﬂow in a simpliﬁed transformer architecture when the training prompts consists of random instances of linear regression datasets. Our main contributions are as follows.

We establish that for a class of transformers with a single layer and with a linear selfattention module (LSAs), gradient ﬂow on the population loss with a suitable random initialization converges to a global minimum of the population objective, despite the non-convexity of the underlying objective function.

We characterize the learning algorithm that is encoded by the transformer at convergence, as well as the prediction error achieved when the model is given a test prompt corresponding to a new (and possibly nonlinear) prediction task.

1. We note a concurrent work also explores the optimization question we consider here (Ahn et al., 2023); we shall provide a more detailed comparison to this work in Section 2.

Trained Transformers Learn Linear Models In-Context

We use this to conclude that transformers trained by gradient ﬂow indeed in-context learn the class of linear models. Moreover, we characterize the robustness of the trained transformer to a variety of distribution shifts. We show that although a number of shifts can be tolerated, shifts in the covariate distribution of the features xi cannot.

Motivated by this failure under covariate shift, we consider a generalized setting of incontext learning where the covariate distribution can vary across prompts. We provide global convergence guarantees for LSAs trained by gradient ﬂow in this setting and show that even when trained on a variety of covariate distributions, LSAs still fail under covariate shift.

We then empirically investigate the behavior of large, nonlinear transformers when trained on linear regression prompts. We ﬁnd that these more complex models are able to generalize better under covariate shift, especially when trained on prompts with varying covariate distributions.

2. Additional Related Work

The literature on transformers and non-convex optimization in machine learning is vast. In this section, we will focus on those works most closely related to theoretical understanding of in-context learning of function classes. As mentioned previously, Garg et al. (2022) empirically investigated the ability for transformer architectures to in-context learn a variety of function classes. They showed that when trained on random instances of linear regression, the models predictions are very similar to those of ordinary least squares. Additionally, they showed that transformers can in-context learn two-layer Re LU networks and decision trees, showing that by training on diﬀerently-structured data, the transformers learn to implement distinct learning algorithms. A number of works further investigated the types of algorithms implemented by transformers trained on in-context examples of linear models (Ahuja et al., 2023; Ahuja and Lopez-Paz, 2023). Aky urek et al. (2022) and von Oswald et al. (2022) examined the behavior of transformers when trained on random instances of linear regression, as we do in this work. They considered the setting of isotropic Gaussian data with isotropic Gaussian weight vectors, and showed that the trained transformer s predictions mimic those of a single step of gradient descent. They also provided a construction of transformers which implement this single step of gradient descent. By contrast, we explicitly show that gradient ﬂow provably converges to transformers which learn linear models in-context. Moreover, our analysis holds when the covariates are anisotropic Gaussians, for which a single step of vanilla gradient descent is unable to achieve small prediction error.2

Let us brieﬂy mention a number of other works on understanding in-context learning in transformers and other sequence-based models. Han et al. (2023) suggests that Bayesian inference on prompts can be asymptotically interpreted as kernel regression. Dai et al. (2022)

2. To see this, suppose (xi, yi) are i.i.d. with x N(0, Λ) and y = w, x . A single step of gradient descent under the squared loss from a zero initialization yields the predictor x 7 x 1

n Pn i=1 yixi = x 1

n Pn i=1 xix i w x Λw. Clearly, this can diﬀer from x w when Λ = Id.

Zhang, Frei, Bartlett

interprets ICL as implicit ﬁne-tuning, viewing large language models as meta-optimizers performing gradient-based optimization. Xie et al. (2021) regards ICL as implicit Bayesian inference, with transformers learning a shared latent concept between prompts and test data, and they prove the ICL property when the training distribution is a mixture of HMMs. Similarly, Wang et al. (2023) perceives ICL as a Bayesian selection process, implicitly inferring information pertinent to the designated tasks. Li et al. (2023a) explores the functional resemblance between a single layer of self-attention and gradient descent on a softmax regression problem, oﬀering upper bounds on their diﬀerence. Min et al. (2022) notes that the alteration of label parts in prompts does not drastically impair the ICL ability. They contend that ICL is invoked when prompts reveal information about the label space, input distribution, and sequence structure.

Another collection of works have sought to understand transformers from an approximation theoretic perspective. Yun et al. (2019, 2020) established that transformers can universally approximate any sequence-to-sequence function under some assumptions. Investigations by Edelman et al. (2022); Likhosherstov et al. (2021) indicate that a single-layer self-attention can learn sparse functions of the input sequence, where sample complexity and hidden size are only logarithmic relative to the sequence length. Further studies by P erez et al. (2019); Dehghani et al. (2019); Bhattamishra et al. (2020) indicate that the vanilla transformer and its variants exhibit Turing completeness. Liu et al. (2023) showed that transformers can approximate ﬁnite-state automata with few layers. Bai et al. (2023) showed that transformers can implement a variety of statistical machine learning algorithms as well as model selection procedures. Abernethy et al. (2023) showed that a pretrained transformer can be used to deﬁne a transformer that segments a prompt into examples and labels and learns to solve a sparse retrieval task. Zhang et al. (2023) interpreted in-context learning via a Bayesian model averaging process.

A handful of recent works have developed provable guarantees for transformers trained with gradient-based optimization. Jelassi et al. (2022) analyzed the dynamics of gradient descent in vision transformers for data with spatial structure. Li et al. (2023c) demonstrated that a single-layer transformer trained by a gradient method could learn a topic model, treating learning semantic structure as detecting co-occurrence between words and theoretically analyzing the two-stage dynamics during the training process.

Finally, we note a concurrent work by Ahn et al. (2023) on the optimization landscape of single layer transformers with linear self-attention layers. They show that there exist global minima of the population objective of the transformer that can achieve small prediction error with anisotropic Gaussian data, and they characterize some critical points of deep linear self-attention networks. In this work, we show that despite nonconvexity, gradient ﬂow with a suitable random initialization converges to a global minimum that achieves small prediction error for anistropic Gaussian data. We also characterize the prediction error when test prompts come from a new (possibly nonlinear) task, when there is distribution shift, and when transformers are trained on prompts with possibly diﬀerent covariate distributions across prompts.

Trained Transformers Learn Linear Models In-Context

3. Preliminaries

Notation We ﬁrst describe the notation we use in the paper. We write [n] = {1, 2, ..., n}. We use to denote the Kronecker product, and Vec the vectorization operator in columnwise order. For example, Vec 1 2 3 4

= (1, 3, 2, 4) . We write the inner product of two

matrices A, B Rm n as A, B = tr(AB ). We use 0n and 0m n to denote the zero vector and zero matrix of size n and m n, respectively. For a general matrix A, Ak: and A:k denote the k-th row and k-th column, respectively. We denote the matrix operator norm and Frobenius norm as op and F . We use Id to denote the d-dimensional identity matrix and sometimes we also use I when the dimension is clear from the context. For a positive semi-deﬁnite matrix A, we write x 2 A := x Ax. Unless otherwise deﬁned, we use lower case letters for scalars and vectors, and use upper case letters for matrices.

3.1 In-context learning

We begin by describing a framework for in-context learning of function classes, as initiated by Garg et al. (2022). In-context learning refers to the behavior of models that operate on sequences, called prompts, of input-output pairs (x1, y1, . . . , x N, y N, xquery), where yi = h(xi) for some (unknown) function h and examples xi and query xquery. The goal for an incontext learner is to use the prompt to form a prediction by(xquery) for the query such that by(xquery) h(xquery). From this high-level description, one can see that at a surface level, the behavior of in-context learning is no diﬀerent than that of a standard learning algorithm: the learner takes as input a training dataset and returns predictions on test examples. For instance, one can view ordinary least squares as an in-context learner for linear models. However, the rather unique feature of in-context learners is that these learning algorithms can be the solutions to stochastic optimization problems deﬁned over a distribution of prompts. We formalize this notion in the following deﬁnition.

Deﬁnition 1 (Trained on in-context examples) Let Dx be a distribution over an input space X, H YX a set of functions X Y, and DH a distribution over functions in H. Let ℓ: Y Y R be a loss function. Let S = n N{(x1, y1, . . . , xn, yn) : xi X, yi Y} be the set of ﬁnite-length sequences of (x, y) pairs and let

FΘ = {fθ : S X Y, θ Θ}

be a class of functions parameterized by θ in some set Θ. For N > 0, we say that a model f : S X Y is trained on in-context examples of functions in H under loss ℓw.r.t. (DH, Dx) if f = fθ where θ Θ satisﬁes

θ argminθ ΘEP=(x1,h(x1),...,x N,h(x N),xquery) [ℓ(fθ(P), h(xquery))] , (1)

where xi, xquery i.i.d. Dx and h DH are independent. We call N the length of the prompts seen during training.

As mentioned above, this deﬁnition naturally leads to a method for learning a learning algorithm from data: Sample independent prompts by sampling a random function h DH

Zhang, Frei, Bartlett

and feature vectors xi, xquery i.i.d. Dx, and then minimize the objective function appearing in (1) using stochastic gradient descent or other stochastic optimization algorithms. This procedure returns a model that is learned from in-context examples and can form predictions for test (query) examples given a sequence of training data. This leads to the following natural deﬁnition that quantiﬁes how well such a model performs on in-context examples corresponding to a particular hypothesis class.

Deﬁnition 2 (In-context learning of a hypothesis class) Let Dx be a distribution on an input space X, H YX a class of functions X Y, and DH a distribution on functions in H. Let ℓ: Y Y R be a loss function. Let S = n N{(x1, y1, . . . , xn, yn) : xi X, yi Y} be the set of ﬁnite-length sequences of (x, y) pairs. We say that a model f : S X Y deﬁned on prompts of the form P = (x1, h(x1), . . . , x M, h(x M), xquery) in-context learns a hypothesis class H under loss ℓwith respect to (DH, Dx) up to error η R if there exists a function MDH,Dx(ε) : (0, 1) N such that for every ε (0, 1), and for every prompt P of length M MDH,Dx(ε),

EP=(x1,h(x1),...,x M,h(x M),xquery)

ℓ f(P), h (xquery) η + ε, (2)

where the expectation is over the randomness in xi, xquery i.i.d. Dx and h DH.

The additive error term η in Deﬁnition 2 above allows for the possibility that the model does not achieve arbitrarily small error. This error could come from using a model which is not complex enough to learn functions in H or from considering a non-realizable setting where it is not possible to achieve arbitrarily small error. With these two deﬁnitions in hand, we can formulate the following questions: suppose a function class FΘ is given and DH corresponds to random instances of hypotheses in a hypothesis class H. Can a model from FΘ that is trained on in-context examples of functions in H w.r.t. (DH, Dx) in-context learn the hypothesis class H w.r.t. (DH, Dx) with small prediction error? Do standard gradient-based optimization algorithms suﬃce for training the model from in-context examples? How long must the contexts be during training and at test time to achieve small prediction error? In the remaining sections, we shall answer these questions for the case of one-layer transformers with linear self-attention modules when the hypothesis class is linear models, the loss of interest is the squared loss, and the marginals are (possibly anisotropic) Gaussian marginals.

3.2 Linear self-attention networks

Before describing the particular transformer models we analyze in this work, we ﬁrst recall the deﬁnition of the softmax-based single-head self-attention module (Vaswani et al., 2017). Let E Rde d N be an embedding matrix formed using a prompt (x1, y1, . . . , x N, y N, xquery) of length N. The user has the freedom to determine how this embedding matrix is formed from the prompt. One natural way to form E is to stack (xi, yi) Rd+1 as the ﬁrst N columns of E and to let the ﬁnal column be (xquery, 0) ; if xi Rd, yi R, we would then have de = d + 1 and d N = N + 1. Let W K, W Q Rdk de and W V Rdv de be the key, query, and value weight matrices, W P Rde dv the projection matrix, and ρ > 0

Trained Transformers Learn Linear Models In-Context

a normalization factor. The softmax self-attention module takes as input an embedding matrix E of width d N and outputs a matrix of the same size,

f Attn(E; W K, W Q, W V , W P ) = E + W P W V E softmax (W KE) W QE

where softmax is applied column-wise and, given a vector input of v, the i-th entry of softmax(v) is given by exp(vi)/ P s exp(vs). The d N d N matrix appearing inside the softmax is referred to as the self-attention matrix. Note that f Attn can take as its input a sequence of arbitrary length. In this work, we consider a simpliﬁed version of the single-layer self-attention module, which is more amenable to theoretical analysis and yet is still capable of in-context learning linear models. In particular, we consider a single-layer linear self-attention (LSA) model, which is a modiﬁed version of f Attn where we remove the softmax nonlinearity, merge the projection and value matrices into a single matrix W PV Rde de, and merge the query and key matrices into a single matrix W KQ Rde de. We concatenate these matrices into θ = (W KQ, W PV ) and denote

f LSA(E; θ) = E + W PV E E W KQE

We note that recent theoretical works on understanding transformers looked at identical models (von Oswald et al., 2022; Li et al., 2023b; Ahn et al., 2023). It is noteworthy that recent empirical work has shown that state-of-the-art trained vision transformers with standard softmax-based attention modules are such that (W K) W Q and W P W V are nearly multiples of the identity matrix (Trockman and Kolter, 2023), which can be represented under the parameterization we consider. The user has the ﬂexibility to determine the method for constructing the embedding matrix from a prompt P = (x1, y1, . . . , x N, y N, xquery). In this work, for a prompt of length N, we shall use the following embedding, which stacks (xi, yi) Rd+1 into the ﬁrst N columns with (xquery, 0) Rd+1 as the last column:

E = E(P) = x1 x2 x N xquery y1 y2 y N 0

R(d+1) (N+1). (4)

We take the normalization factor ρ to be the width of embedding matrix E minus one, i.e., ρ = d N 1, since each element in E E is a inner product of two vectors of length d N. Under the above token embedding, we take ρ = N. We note that there are alternative ways to form the embedding matrix with this data, e.g. by padding all inputs and labels into vectors of equal length and arranging them into a matrix (Aky urek et al., 2022), or by stacking columns that are linear transformations of the concatenation (xi, yi) (Garg et al., 2022), although the dynamics of in-context learning will diﬀer under alternative parameterizations. The network s prediction for the token xquery will be the bottom-right entry of matrix output by f LSA, namely,

byquery = byquery(E; θ) = [f LSA(E; θ)](d+1),(N+1).

Here and after, we may occasionally suppress dependence on θ and write byquery(E; θ) as byquery. Since the prediction takes only the right-bottom entry of the token matrix output

Zhang, Frei, Bartlett

by the LSA layer, actually only part of W PV and W KQ aﬀect the prediction. To see how, let us denote

W PV 11 w PV 12

(w PV 21 ) w PV 22

R(d+1) (d+1), W KQ =

W KQ 11 w KQ 12

(w KQ 21 ) w KQ 22

R(d+1) (d+1),

(5) where W PV 11 Rd d; w PV 12 , w PV 21 Rd; w PV 22 R; and W KQ 11 Rd d; w KQ 12 , w KQ 21 Rd; w KQ 22 R. Then, the prediction byquery is

byquery = (w PV 21 ) w PV 22

xquery, (6)

since only the last row of W PV and the ﬁrst d columns of W KQ aﬀects the prediction, which means we can simply take all other entries zero in the following sections.

3.3 Training procedure

In this work, we will consider the task of in-context learning linear predictors. We will assume training prompts are sampled as follows. Let Λ be a positive deﬁnite covariance matrix. Each training prompt, indexed by τ N, takes the form

Pτ = (xτ,1, hτ(xτ1), . . . , xτ,N, hτ(xτ,N), xτ,query),

where task weights wτ i.i.d. N(0, Id), inputs xτ,i, xτ,query i.i.d. N(0, Λ), and labels hτ(x) = wτ, x . Each prompt corresponds to an embedding matrix Eτ, formed using the transformation (4):

Eτ := xτ,1 xτ,2 xτ,N xτ,query wτ, xτ,1 wτ, xτ,2 wτ, xτ,N 0

R(d+1) (N+1).

We denote the prediction of the LSA model on the query label in the task τ as byτ,query, which is the bottom-right element of f LSA(Eτ), where f LSA is the linear self-attention model deﬁned in (3). The empirical risk over B independent prompts is deﬁned as

byτ,query wτ, xτ,query 2 . (7)

We shall consider the behavior of gradient ﬂow-trained networks over the population loss induced by the limit of inﬁnite training tasks/prompts B :

L(θ) = lim B b L(θ) = 1

2Ewτ,xτ,1, ,xτ,N,xτ,query (byτ,query wτ, xτ,query )2 (8)

Above, the expectation is taken w.r.t. the covariates {xτ,i}N i=1 {xquery} in the prompt

and the weight vector wτ, i.e. over xτ,i, xquery i.i.d. N(0, Λ) and wτ N(0, Id). Gradient

Trained Transformers Learn Linear Models In-Context

ﬂow captures the behavior of gradient descent with inﬁnitesimal step size and has dynamics given by the following diﬀerential equation:

d dtθ = L(θ). (9)

We will consider gradient ﬂow with an initialization that satisﬁes the following.

Assumption 3 (Initialization) Let σ > 0 be a parameter, and let Θ Rd d be any matrix satisfying ΘΘ F = 1 and ΘΛ = 0d d. We assume

W PV (0) = σ 0d d 0d 0 d 1

, W KQ(0) = σ ΘΘ 0d 0 d 0

This initialization is satisﬁed for a particular class of random initialization schemes: if M has i.i.d. entries from a continuous distribution, then by setting ΘΘ = MM / MM F , the assumption is satisﬁed almost surely. The reason we use this particular initialization scheme will be made more clear in Section 5 when we describe the proof, but at a highlevel this is due to the fact that the predictions (6) can be viewed as the output of a two-layer linear network, and initializations satisfying Assumption 3 allow for the layers to be balanced throughout the gradient ﬂow trajectory. Random initializations that induce this balancedness condition have been utilized in a number of theoretical works on deep linear networks (Du et al., 2018; Arora et al., 2018, 2019; Azulay et al., 2021). We leave the question of convergence under alternative random initialization schemes for future work.

4. Main results

In this section, we present the main results of this paper. First, in Section 4.1, we prove the gradient ﬂow on the population loss will converge to a speciﬁc global optimum. We characterize the prediction error of the trained transformer at this global minimum when given a prompt from a new prediction task. Our characterization allows for the possibility that this new prompt comes from a nonlinear prediction task. We then instantiate our results for well-speciﬁed linear regression prompts and characterize the number of samples needed to achieve small prediction error, showing that transformers can in-context learn linear models when trained on in-context examples of linear models. Next, in Section 4.2, we analyze the behavior of the trained transformer under a variety of distribution shifts. We show the transformer is robust to a number of distribution shifts, including task shift (when the labels in the prompt are not deterministic linear functions of their input) and query shift (when the query example xquery has a possibly diﬀerent distribution than the test prompt). On the other hand, we show that the transformer suﬀers from covariate distribution shifts, i.e. when the training prompt covariate distribution diﬀers from the test prompt covariate distribution. Finally, motivated by the failure of the trained transformer under covariate distribution shift, we consider in Section 4.3 the setting of training on in-context examples with varying covariate distributions across prompts. We prove that transformers with a single linear selfattention layer trained by gradient ﬂow converge to a global minimum of the population objective, but that the trained transformer still fails to perform well on new prompts. We complement our proof in the linear self-attention case with experiments on large, nonlinear transformer architectures which we show are more robust under covariate shifts.

Zhang, Frei, Bartlett

4.1 Convergence of gradient ﬂow and prediction error for new tasks

First, we prove that under suitable initialization, gradient ﬂow will converge to a global optimum.

Theorem 4 (Convergence and limits) Consider gradient ﬂow of a linear self-attention network f LSA deﬁned in (3) over the population loss (8). Suppose the initialization satisﬁes Assumption 3 with initialization scale σ > 0 satisfying σ2 Γ op

d < 2 where we have deﬁned

N tr(Λ)Id Rd d.

Then gradient ﬂow converges to a global minimum of the population loss (8). Moreover, W PV and W KQ converge to W PV and W KQ respectively, where

W KQ = tr Γ 2 1

, W PV = tr Γ 2 1

The full proof of this theorem appears in Appendix A. We note that if we restrict our setting to Λ = Id, then the limiting solution described found by gradient ﬂow is quite similar to the construction of von Oswald et al. (2022). Since the prediction of the transformer is the same if we multiply W PV by a constant c = 0 and simultaneously multiply W KQ by c 1, the only diﬀerence (up to scaling) is that the top-left entry of their W KQ matrix is Id rather than the (1 + (d + 1)/N) 1Id that we ﬁnd for the case Λ = Id. Next, we would like to characterize the prediction error of the trained network described above when the network is given a new prompt. Let us consider a prompt of the form (x1, w, x1 , . . . , x M, w, x M , xquery) where w Rd and xi, xquery i.i.d. N(0, Λ). A simple calculation shows that the prediction byquery at the global optimum with parameters W KQ and W PV is given by

byquery = 0 d 1

i=1 xix i + 1

M xqueryx query 1 M

i=1 xix i w

i=1 w xix i 1 M

i=1 w xix i w

= x queryΓ 1 1 M

When the length of prompts seen during training N is large, Γ 1 Λ 1, and when the test prompt length M is large, 1 M PM i=1 xix i Λ, so that byquery x queryw. Thus, for suﬃciently large prompt lengths, the trained transformer indeed in-context learns the class of linear predictors. In fact, we can generalize the above calculation for test prompts which could take a signiﬁcantly diﬀerent form than the training prompts. Consider prompts that are of the form (x1, y1, . . . , xn, yn, xquery) where, for some joint distribution D over (x, y) pairs with marginal

Trained Transformers Learn Linear Models In-Context

distribution x N(0, Λ), we have (xi, yi) i.i.d. D and xquery N(0, Λ) independently. Note that this allows for a label yi to be a nonlinear function of the input xi. The prediction of the trained transformer for this prompt is then

byquery = 0 d 1

1 M PM i=1 xix i + 1

M xqueryx query 1 M PM i=1 xiyi

1 M PM i=1 x i yi 1 M PM i=1 y2 i

= x queryΓ 1 1 M

Just as before, when N is large we have Γ 1 Λ 1, and so when M is large as well this implies

byquery x queryΛ 1E(x,y) D[yx] = x query

argmin w Rd E(x,y) D[(y w, x )2]

This suggests that trained transformers in-context learn the best linear predictor over a distribution when the test prompt consists of i.i.d. samples from a joint distribution over feature-response pairs. In the following theorem, we formalize the above and characterize the prediction error when prompts take this form.

Theorem 5 Let D be a distribution over (x, y) Rd R, whose marginal distribution on x is Dx = N(0, Λ). Assume ED[y], ED[xy], ED[y2xx ] exist and are ﬁnite. Assume the test

prompt is of the form P = (x1, y1, . . . , x M, y M, xquery), where (xi, yi), (xquery, yquery) i.i.d. D. Let f LSA be the LSA model with parameters W PV and W KQ in (11), and byquery is the prediction for xquery given the prompt. If we deﬁne

a := Λ 1E(x,y) D [xy] , Σ := E(x,y) D h xy E (xy) xy E (xy) i , (15)

then, for Γ = Λ + 1

N tr(Λ)Id. we have,

E (byquery yquery)2 = min w Rd E ( w, xquery yquery)2

| {z } Error of best linear predictor

M tr ΣΓ 2Λ + 1

h a 2 Γ 2Λ3 + 2 tr(Λ) a 2 Γ 2Λ2 + tr(Λ)2 a 2 Γ 2Λ i ,

where the expectation is over (xi, yi), (xquery, yquery) i.i.d. D.

The full proof is deferred to Appendix B. Let us now make a few remarks on the above theorem before considering particular instances of D where we may provide more explicit bounds on the prediction error. First, this theorem shows that, provided the length of prompts seen during training (N) and the length of the test prompt (M) is large enough, a transformer trained by

Zhang, Frei, Bartlett

gradient ﬂow from in-context examples achieves prediction error competitive with the best linear model. Next, our bound shows that the length of prompts seen during training and the length of prompts seen at test-time have diﬀerent eﬀects on the expected prediction error: ignoring dimension and covariance-dependent factors, the prediction error is at most O(1/M + 1/N2), decreasing more rapidly as a function of the training prompt length N compared to the test prompt length M. Let us now consider when D corresponds to noiseless linear models, so that for some w Rd, we have (x, y) = (x, w, x ), in which case the prediction of the trained transformer is given by (12). Moreover, a simple calculation shows that the Σ from Theorem 5 takes the form Σ = w 2 ΛΛ + Λww Λ. Hence Theorem 5 implies the prediction error for the prompt P = (x1, w, x1 , . . . , x M, w, x M , xquery) is

Ex1,...,x M,xquery (byquery w, xquery )2

n w 2 Γ 2Λ3 + tr(Γ 2Λ2) w 2 Λ o

n w 2 Γ 2Λ3 + 2 w 2 Γ 2Λ2 tr(Λ) + w 2 Γ 2Λ tr(Λ)2o

M w 2 Λ + 1

h w 2 Λ + 2 w 2 2 tr(Λ) + w 2 Λ 1 tr(Λ)2i .

The inequality above uses that Γ Λ. Finally, if we assume that w N(0, Id) and denote κ as the condition number of Λ, then by taking expectations over w we get the following:

Ex1,...,x M,xquery,w (byquery w, xquery )2

(d + 1) tr(Λ)

N2 tr(Λ) + 2d tr(Λ) + tr(Λ 1) tr(Λ)2

(d + 1) tr(Λ)

M + (1 + 2d + d2κ) tr(Λ)

From the upper bound above, we can see the rate w.r.t M and N are still at most O(1/M) and O(1/N2) respectively. Moreover, the generalization risk also scales with dimension d, tr(Λ) and the condition number κ. This suggests that for in-context examples involving covariates of greater variance, or a more ill-conditioned covariance matrix, the generalization risk will be higher for the same lengths of training and testing prompts. Putting the above together with Theorem 5, Deﬁnition 1 and Deﬁnition 2, we get the following corollary.

Corollary 6 The transformer f LSA trained on length-N prompts of in-context examples of functions in {x 7 w, x } w.r.t. w N(0, Id) and Dx = N(0, Λ) by gradient ﬂow on the population loss (8) for initializations satisfying Assumption 3 converges to the model f LSA( ; W KQ , W PV ). This model takes a prompt P = (x1, y1, . . . , x M, y M, xquery) and returns a prediction byquery for xquery given by

byquery = [f LSA(P; W KQ , W PV )]d+1,M+1 = x query

N Λ + tr(Λ)

This model in-context learns the class of linear models {x 7 w, x } with respect to w N(0, Id) and Dx = N(0, Λ) up to error η := (1+2d+d2κ) tr(Λ)/N2 (where κ is the condition

Trained Transformers Learn Linear Models In-Context

number of Λ): provided M (d + 1) tr(Λ)ε 1, the model achieves prediction error at most η + ε.

It is worth emphasizing that the transformer f LSA( ; W KQ , W PV ) only learns the function class up to error η = O(1/N2) in the sense of Deﬁnition 2. In particular, training on ﬁnite-length prompts leads to prediction error bounded away from zero.

4.2 Behavior of trained transformer under distribution shifts

Using the identity (13), it is straightforward to characterize the behavior of the trained transformer under a variety of distribution shifts. In this section, we shall examine a number of shifts that were ﬁrst explored empirically for transformer architectures by Garg et al. (2022). Although their experiments were for transformers trained by gradient descent, we ﬁnd that (in the case of linear models) many of the behaviors of the trained transformers under distribution shift are identical to those predicted by our theoretical characterizations of the performance of transformers with a single linear self-attention layer trained by gradient ﬂow on the population. Following Garg et al. (2022), for prompts of the form (x1, h(x1), . . . , x N, h(x N), xquery),

let us assume for training prompts that xi, xquery i.i.d. Dtrain x and h Dtrain H , while for test

prompts xi i.i.d. Dtest x , xquery Dtest query, and h Dtest H . We will consider the following distinct categories of shifts:

Task shifts: Dtrain H = Dtest H .

Query shifts: Dtest query = Dtest x .

Covariate shifts: Dtrain x = Dtest x .

In the following, we shall ﬁx Dtrain x = N(0, Λ) and vary the other distributions. Recall from (13) that the prediction for a test prompt (x1, y1, . . . , x N, y N, xquery) is given by (for N large),

byquery = x queryΓ 1 1 M

x queryΛ 1 1 M

Task shifts. These shifts are tolerated easily by the trained transformer. As Theorem 5 shows, the trained transformer is competitive with the best linear model provided the prompt length during training and at test time is large enough. In particular, even if the prompt is such that the labels yi are not given by w, xi for some w N(0, Id), the trained transformer will compute a prediction which has error competitive with the best linear model that ﬁts the test prompt. For example, consider a prompt corresponding to a noisy linear model, so that the prompt consists of a sequence of (xi, yi) pairs where yi = w, xi + εi for some arbitrary vector w Rd and independent sub-Gaussian noise εi. Then from (17), the prediction of the transformer on query examples is

byquery x queryΛ 1 1 M

= x queryΛ 1 1 M

w+x queryΛ 1 1 M

Zhang, Frei, Bartlett

Since εi is mean zero and independent of xi, this is approximately x queryw when M is large. And note that this calculation holds for an arbitrary vector w, not just those which are sampled from an isotropic Gaussian or those with a particular norm. This behavior coincides with that of the trained transformers observed by Garg et al. (2022).

Query shifts. Continuing from (17), since yi = w, xi ,

byquery x queryΛ 1 1 M

From this we see that whether query shifts can be tolerated hinges upon the distribution of the xi s. Since Dtrain x = Dtest x , if M is large then

byquery x queryΛ 1Λw = x queryw. (18)

Thus, very general shifts in the query distribution can be tolerated. On the other hand, very diﬀerent behavior can be expected if M is not large and the query example depends on the training data. For example, if the query example is orthogonal to the subspace spanned by the xi s, the prediction will be zero, as was observed with transformer architectures by Garg et al. (2022).

Covariate shifts. In contrast to task and query shifts, covariate shifts cannot be fully tolerated in the transformer. This can be easily seen due to the identity (13): when Dtrain x = Dtest x , then the approximation in (18) does not hold as 1 M PM i=1 xix i will not cancel Γ 1

when M and N are large. For instance, if we consider test prompts where the covariates are scaled by a constant c = 1, then

byquery x queryΛ 1 1 M

w x queryΛ 1c2Λw = c2x queryw = x queryw.

This failure mode of the trained transformer with linear self-attention was also observed in the trained transformer architectures by Garg et al. (2022). This suggests that although the predictions of the transformer may look similar to those of ordinary least squares in some settings, the algorithm implemented by the transformer is not the same since ordinary least squares is robust to scaling of the features by a constant. It may seem surprising that a transformer trained on linear regression tasks fails in settings where ordinary least squares performs well. However, both the linear self-attention transformer we consider and the transformers considered by Garg et al. (2022) were trained on instances of linear regression when the covariate distribution Dx over the features was ﬁxed across instances. This leads to the natural question of what happens if the transformers instead are trained on prompts where the covariate distribution varies across instances, which we explore in the following section.

4.3 Transformers trained on prompts with random covariate distributions

In this section, we will consider a variant of training on in-context examples (in the sense of Deﬁnition 1) where the distibution Dx is itself sampled randomly from a distribution, and

training prompts are of the form (x1, h(x1), . . . , x N, h(x N), xquery) where xi, xquery i.i.d. Dx and h DH. More formally, we can generalize Deﬁnition 1 as follows.

Trained Transformers Learn Linear Models In-Context

Deﬁnition 7 (In-context training with random covariate distributions) Let be a distribution over distributions Dx deﬁned on an input space X, H YX a set of functions X Y, and DH a distribution over functions in H. Let ℓ: Y Y R be a loss function. Let S = n N{(x1, y1, . . . , xn, yn) : xi X, yi Y} be the set of ﬁnite-length sequences of (x, y) pairs and let FΘ = {fθ : S X Y, θ Θ}

be a class of functions parameterized by some set Θ. We say that a model f : S X Y is trained on in-context examples of functions in H under loss ℓw.r.t. DH and distribution over covariate distributions if f = fθ where θ Θ satisﬁes

θ argminθ ΘEP=(x1,h(x1),...,x N,h(x N),xquery) [ℓ(fθ(P), h(xquery))] , (19)

where Dx , xi, xquery i.i.d. Dx and h DH.

We recover the previous deﬁnition of training on in-context examples by taking to be concentrated on a singleton, supp( ) = {Dx}. The natural question is then, if a model f is trained on in-context examples from a function class H w.r.t. DH and a distribution over covariate distributions, and if one then samples some covariate distribution Dx , does f in-context learn H w.r.t. (DH, Dx) for that Dx (cf. Deﬁnition 2)? Since Dx is random, we can hope that this may hold in expectation or with high probability over the sampling of the covariate distribution. In the remainder of this section, we will explore this question for transformers with a linear self-attention layer trained by gradient ﬂow on the population loss. We shall again consider the case where the covariates have Gaussian marginals, xi N(0, Λ), but we shall now assume that within each prompt we ﬁrst sample a random covariance matrix Λ. For simplicity, we will restrict our attention to the case where Λ is diagonal. More formally, we shall assume training prompts are sampled as follows. For each independent task indexed by τ [B], we ﬁrst sample wτ N(0, Id). Then, for each task τ and coordinate i [d], we sample λτ,i independently such that the distribution of each λτ,i is ﬁxed and has ﬁnite third moments and is strictly positive almost surely. We then form a diagonal matrix

Λτ = diag(λτ,1, . . . , λτ,d).

Thus the diagonal entries of Λτ are independent but could have diﬀerent distributions, and Λτ is identically distributed for τ = 1, . . . , B. Then, conditional on Λτ, we sample independent and identically distributed xτ,1, . . . , xτ,N, xτ,query N(0, Λτ). A training prompt is then given by Pτ = (xτ,1, wτ, xτ,1 , . . . , xτ,N, wτ, xτ,N , xτ,query) Notice that here, xτ,i, xτ,query are conditionally independent given the covariance matrix Λτ, but not independent in general. We consider the same token embedding matrix as (4) and linear self-attention network, which forms the prediction byquery,τ as in (6). The empirical risk is the same as before (see (7)), and as in (8), we then take B and consider the gradient ﬂow on the population loss. The population loss now includes an expectation over the distribution of the covariance matrices in addition to the task weight wτ and covariate distributions, and is given by

2Ewτ,Λτ,xτ,1, ,xτ,N,xτ,query (byτ,query wτ, xτ,query )2 . (20)

Zhang, Frei, Bartlett

In the main result for this section, we show that gradient ﬂow with a suitable initialization converges to a global minimum, and we characterize the limiting solution. The proof will be deferred to Appendix C.

Theorem 8 (Global convergence with random covariance) Consider gradient ﬂow of the linear self-attention network f LSA deﬁned in (3) over the population loss (20), where Λτ are diagonal with independent diagonal entries which are strictly positive a.s. and have ﬁnite third moments. Suppose the initialization satisﬁes Assumption 3, EΛτΘ F = 0, with initialization scale σ > 0 satisfying

σ2 < 2 EΛτΘ 2 F

d h E Γτ op Λτ 2 F i. (21)

Then gradient ﬂow converges to a global minimum of the population loss (20). Moreover, W PV and W KQ converge to W PV and W KQ respectively, where

W KQ = EΓτΛ2 τ 1 E Λ2 τ 1

EΓτΛ2 τ 1 EΛ2 τ 0d

W PV = EΓτΛ2 τ 1 E Λ2 τ

where Γτ = N+1

N tr(Λτ)Id Rd d and the expectations above are over the distribution of Λτ.

From this result, we can see why the trained transformer fails in the random covariance case. Suppose we have a new prompt corresponding to a weight matrix w Rd

and covariance matrix Λnew, sampled from the same distribution as the covariance matrices for training prompts, so that conditionally on Λnew we have xi, xquery i.i.d. N(0, Λnew). The ground-truth labels are given by yi = w, xi , i [M] and yquery = w, xquery . At convergence, the prediction by the trained transformer on the new task will be

i=1 xix i + 1

M xqueryx query 1 M

i=1 x i yi 1 M

EΓτΛ2 τ 1 EΛ2 τ 0d

= x query EΛ2 τ EΓτΛ2 τ 1

x query EΛ2 τ EΓτΛ2 τ 1 Λneww almost surely when M . (23)

The last line comes from the strong law of large numbers. Thus, in order for the prediction on the query example to be close to the ground-truth x queryw, we need EΛ2 τ EΓτΛ2 τ 1 Λnew

Trained Transformers Learn Linear Models In-Context

to be close to the identity. When Λτ Λnew is deterministic, this indeed is the case as we know from Theorem 5. However, this clearly does not hold in general when Λτ is random. To make things concrete, let us assume for simplicity that M, N so that Γτ Λτ and the identity (23) holds (conditionally on Λnew). Then, taking expectation over Λnew in (23), we obtain

E [ byquery| xquery, w] x query EΛ2 τ EΛ3 τ 1 [EΛτ] w.

If we consider the case λτ,i i.i.d. Exponential(1), so that E[Λτ] = Id, E[Λ2 τ] = 2Id, and E[Λ3 τ] = 6Id, we get

3 w, xquery .

This shows that for transformers with a single linear self-attention layer, training on incontext examples with random covariate distributions does not allow for in-context learning of a hypothesis class with varying covariate distributions.

Experiments with large, nonlinear transformers. We have shown that even when trained on prompts with random covariance matrices, transformers with a single linear selfattention layer fail to in-context learn linear models with random covariance matrices. We now investigate the behavior of more complex transformer architectures that are trained on in-context examples of linear models, both in the ﬁxed-covariance case and in the randomcovariance case. We examine the performance of transformers with a GPT2 architecture (Radford et al., 2019) that are trained on linear regression tasks with mean-zero Gaussian features with either a ﬁxed covariance matrix or random covariance matrices. For the ﬁxed covariance case, the covariance matrix is ﬁxed to the identity matrix across prompts. For the random covariance case, covariates are drawn from x N(0, cΛ) where Λ is diagonal with λi i.i.d. Exponential(1) and c > 0 is a scaling factor. We set c = 1 during training and vary this value at test time. The transformer is trained using the procedure of Garg et al. (2022) (see Appendix E for more details). We consider linear models in d = 20 dimensions and we train on prompt lengths of N = 40, 70, 100 with either ﬁxed or random covariance matrices. The performance of these trained models, when tested on new data with ﬁxed covariance or random covariance matrices (c = 1, 4, 9), is represented in six curves in Figure 1. Using the calculation (23), we can compare the prediction error for the linear self-attention networks in the M , N limit (the black dash line) to those of GPT2 architectures. We additionally compare these models to the ordinary least-squares solution which is optimal for this task. From the ﬁgure, we can see that the GPT2 model trained on ﬁxed covariance succeeds in the random covariance setting if the variance is not too large, which shows that the larger nonlinear model is able to generalize better than the model with a single linear self-attention layer. However, when the variance is large (c = 4, 9 for the bottom two ﬁgures), the GPT2 model trained with ﬁxed covariance is unsuccessful. When trained on random covariance, the model performs better for test prompts from higher-variance random covariance matrices, but still fails to match least squares when the scaling is largest (c = 9).

Zhang, Frei, Bartlett

0 20 40 60 80 100 in-context examples

squared error

Test on Fixed Covariance

0 20 40 60 80 100 in-context examples

squared error

Test on Random Covariance, Scale = 1.0

0 20 40 60 80 100 in-context examples

squared error

Test on Random Covariance, Scale = 4.0

0 20 40 60 80 100 in-context examples

squared error

Test on Random Covariance, Scale = 9.0

Zero Estimator LSA Limit fixedcov_N40 fixedcov_N70 fixedcov_N100 randomcov_N40 randomcov_N70 randomcov_N100 Least Squares

Figure 1: Normalized prediction error for transformers with GPT2 architectures as a function of the number of in-context test examples M when trained on in-context examples of linear models in d = 20 dimensions. Colored lines correspond to diﬀerent training context lengths (N {40, 70, 100}) and diﬀerent training procedures (either a ﬁxed identity covariance matrix or random diagonal covariance matrices with each diagonal element sampled i.i.d. from the standard exponential distribution). The four ﬁgures correspond to evaluating on either ﬁxed covariance or random covariance matrices of diﬀerent scales. The gray dashed line shows the prediction error of zero estimator and the black dashed line the prediction error of LSA model when M, N . The GPT2 models achieve smaller error when they are trained on random covariance matrices with larger contexts, but their prediction error spikes when evaluated on contexts larger than those they were trained on.

Furthermore, we notice some surprising behaviors when the test prompt length exceeds the training prompt length (i.e., M > N): there is an evident spike in prediction error, regardless of whether training and testing were performed on ﬁxed or random covariance, and the spike appears to decrease when evaluated on prompts with higher variance. Although we are unsure of why the spike should decrease with higher-variance prompts, the failure of large language models to generalize to larger contexts than they were trained on is a wellknown problem (Dai et al., 2019; Anil et al., 2022). In our setting, we conjecture that this spike in error comes from the absolute positional encodings in the GPT2 architecture. The

Trained Transformers Learn Linear Models In-Context

positional encodings are randomly-initialized and are learnable parameters but the encoding for position i is only updated if the transformer encounters a prompt which has a context of length i. Thus, when evaluating on prompts of length M > N, the model is relying upon random positional encodings for M N samples. We note that a concurrent work has explored the performance of transformers with GPT2 architectures for in-context learning of linear models and found that removing positional encoders improves performance when evaluating on larger contexts (Ahuja et al., 2023). We leave further investigation of this behavior for future work.

5. Proof ideas

In this section, we brieﬂy outline the proof sketch of Theorem 4. The full proof of this theorem is left for Appendix A.

5.1 Equivalence to a quadratic optimization problem

We recall each task τ corresponds to a weight vector wτ N(0, Id). The prompt inputs

for this task are xτ,j i.i.d. N(0, Λ), which are also independent of wτ. The corresponding labels are yτ,j = wτ, xτ,j . For each task τ, we can form the prompt into a token matrix Eτ R(d+1) (N+1) as in (4), with the right-bottom entry being zero. The ﬁrst key step in our proof is to recognize that the prediction byquery(Eτ; θ) in the linear self-attention model can be written as the output of a quadratic function u Hτu for some matrix Hτ depending on the token embedding matrix Eτ and for some vector u depending on θ = (W KQ, W PV ). This is shown in the following lemma, the proof of which is provided in Appendix A.1.

Lemma 9 Let Eτ R(d+1) (N+1) be an embedding matrix corresponding to a prompt of length N and weight wτ. Then the prediction byquery(Eτ; θ) for the query covariate can be written as the output of a quadratic function,

byquery(Eτ; θ) = u Hτu,

where the matrix Hτ is deﬁned as,

2Xτ EτE τ N

R(d+1)2 (d+1)2, Xτ =

0d d xτ,query

(xτ,query) 0

R(d+1) (d+1)

u = Vec(U) R(d+1)2, U =

R(d+1) (d+1),

where U11 = W KQ 11 Rd d, u12 = w PV 21 Rd 1, u21 = w KQ 21 Rd 1, u 1 = w PV 22 R correspond to particular components of W PV and W KQ, deﬁned in (5).

Zhang, Frei, Bartlett

This implies that we can write the original loss function (7) as

u Hτu w τ xτ,query 2 . (25)

Thus, our problem is reduced to understanding the dynamics of an optimization algorithm deﬁned in terms of a quadratic function. We also note that this quadratic optimization problem is an instance of a rank-one matrix factorization problem, a problem well-studied in the deep learning theory literature (Gunasekar et al., 2017; Arora et al., 2019; Li et al., 2018; Chi et al., 2019; Belabbas, 2020; Li et al., 2020; Jin et al., 2023; Soltanolkotabi et al., 2023). Note, however, this quadratic function is non-convex. To see this, we will show that Hτ has negative eigenvalues. By standard properties of the Kronecker product, the eigenvalues of Hτ = 1

2Xτ EτE τ N are the products of the eigenvalues of 1

2Xτ and the eigenvalues

of EτE τ N . Since EτE τ is symmetric and positive semi-deﬁnite, all of its eigenvalues are nonnegative. Since EτE τ is nonzero almost surely, it thus has at least one strictly positive eigenvalue. Thus, if Xτ has any negative eigenvalues, Hτ does as well. The characteristic polynomial of Xτ is given by,

det(µI Xτ) = det

µId xτ,query

x τ,query µ

= µd 1 µ2 xτ,query 2 2 .

Therefore, we know almost surely, Xτ has one negative eigenvalue. Thus Hτ has at least d + 1 negative eigenvalues, and hence the quadratic form u Hτu is non-convex.

5.2 Dynamical system of gradient ﬂow

We now describe the dynamical system for the coordinates of u above. We prove the following lemma in Appendix A.2.

Lemma 10 Let u = Vec (U) := Vec

as in Lemma 9. Consider gradient

2E u Hτu w τ xτ,query 2 (26)

with respect to u starting from an initial value satisfying Assumption 3. Then the dynamics of U follows d dt U11(t) = u2 1ΓΛU11Λ + u 1Λ2

d dtu 1(t) = tr h u 1ΓΛU11Λ(U11) Λ2(U11) i , (27)

and u12(t) = 0d, u21(t) = 0d for all t 0, where Γ = 1 + 1

N tr(Λ)Id Rd d.

Trained Transformers Learn Linear Models In-Context

We see that the dynamics are governed by a complex system of d2+1 coupled diﬀerential equations. Moreover, basic calculus (for details, see Lemma 15) shows that these dynamics are the same as those of gradient ﬂow on the following objective function:

ℓ: Rd d R R, ℓ(U11, u 1) = tr 1

2u2 1ΓΛU11Λ(U11) u 1Λ2(U11) . (28)

Actually, the loss function ℓis simply the loss function L in (26) plus some constants that do not depend on the parameter u. Therefore our problem is reduced to studying the dynamics of gradient ﬂow on the above objective function. Our next key observation is that the set of global minima for ℓsatisﬁes the condition u 1U11 = Γ 1. Thus, if we can establish global convergence of gradient ﬂow over the above objective function ℓ, then we have that u 1(t)U11(t) Γ 1 N Λ 1.

Lemma 11 For any global minimum of ℓ, we have

u 1U11 = Γ 1. (29)

Putting this together with Lemma 10, we see that at those global minima of the population objective satisfying U11 = (cΓ) 1, u 1 = c and u12 = u21 = 0d, the transformer s predictions for a new linear regression task prompt are given by

byquery(E; θ) = 1

i=1 yix i Γ 1xquery = w 1 M

Γ 1xquery w xquery.

Thus, the only remaining task is to show global convergence when gradient ﬂow has an initialization satisfying Assumption 3.

5.3 PL inequality and global convergence

We now show that although the optimization problem is non-convex, a Polyak Lojasiewicz (PL) inequality holds, which implies that gradient ﬂow converges to a global minimum. Moreover, we can exactly calculate the limiting value of U11 and u 1.

Lemma 12 Suppose the initialization of gradient ﬂow satisﬁes Assumption 3 with initialization scale satisfying σ2 < 2

d Γ op for Γ = (1 + 1

N )Λ + tr(Λ)

N Id. If we deﬁne

d Λ 2 op tr (Γ 1Λ 1) tr (Λ 1) ΛΘ 2 F h 2

dσ2 Γ op i > 0, (30)

then gradient ﬂow on ℓwith respect to U11 and u 1 satisﬁes, for any t 0,

ℓ(U11(t), u 1(t)) 2

µ ℓ(U11(t), u 1(t)) min U11 Rd d,u 1 R ℓ(U11, u 1) .

Zhang, Frei, Bartlett

Moreover, gradient ﬂow converges to the global minimum of ℓ, and U11 and u 1 satisfy

lim t u 1(t) = Γ 1 1 2 F and lim t U11(t) = Γ 1 1

With these observations, proving Theorem 4 becomes a direct application of Lemma 9, 10, 11, and Lemma 12. It then only requires translating U11 and u 1 back to the original parameterization using W PV and W KQ.

6. Conclusion and future work

In this work, we investigated the dynamics of in-context learning of transformers with a single linear self-attention layer under gradient ﬂow on the population loss. In particular, we analyzed the dynamics of these transformers when trained on prompts consisting of random instances of noiseless linear models over anisotropic Gaussian marginals. We showed that despite non-convexity, gradient ﬂow from a suitable random initialization converges to a global minimum of the population objective. We characterized the prediction error of the trained transformer when given a new prompt that consists of a training dataset where the responses are a nonlinear function of the inputs. We showed how the trained transformer is naturally robust to shifts in the task and query distributions but is brittle to distribution shifts between the covariates seen during training and the covariates seen at test time, matching the empirical observations on trained transformer models of Garg et al. (2022). There are a number of natural directions for future research. First, our results hold for gradient ﬂow on the population loss with a particular class of random initialization schemes. It is a natural question if similar results would hold for stochastic gradient descent with ﬁnite step sizes and for more general initializations. Further, we restricted our attention to transformers with a single linear self-attention layer. Although this model class is rich enough to allow for in-context learning of linear predictors, we are particularly interested in understanding the dynamics of in-context learning in nonlinear and deep transformers. Finally, the framework of in-context learning introduced in prior work was restricted to the setting where the marginal distribution over the covariates (Dx) was ﬁxed across prompts. This allows for guarantees akin to distribution-speciﬁc PAC learning, where the trained transformer is able to achieve small prediction error when given a test prompt consisting of linear regression data when the marginals over the covariates are ﬁxed. However, other learning algorithms (such as ordinary least squares) are able to achieve small prediction error for prompts corresponding to well-speciﬁed linear regression tasks for very general classes of distributions over the covariates. As we showed in Section 4.3, when transformers with a single linear self-attention layer are trained on prompts where the covariate distributions are themselves sampled from a distribution, they do not succeed on test prompts with covariate distributions sampled from the same distribution. By contrast, we demonstrated with experiments that larger, nonlinear transformer architectures appear to be more successful in this setting but are still sub-optimal. Developing a better understanding of the dynamics of in-context learning when the covariate distribution varies across prompts is an intriguing direction for future research.

Trained Transformers Learn Linear Models In-Context

Acknowledgments

We gratefully acknowledge the support of the NSF and the Simons Foundation for the Collaboration on the Theoretical Foundations of Deep Learning through awards DMS2031883 and #814639, and of the NSF through grant DMS-2023505.

Zhang, Frei, Bartlett

1 Introduction 2

2 Additional Related Work 3

3 Preliminaries 5 3.1 In-context learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 3.2 Linear self-attention networks . . . . . . . . . . . . . . . . . . . . . . . . . . 6 3.3 Training procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

4 Main results 9 4.1 Convergence of gradient ﬂow and prediction error for new tasks . . . . . . . 10 4.2 Behavior of trained transformer under distribution shifts . . . . . . . . . . . 13 4.3 Transformers trained on prompts with random covariate distributions . . . 14

5 Proof ideas 19 5.1 Equivalence to a quadratic optimization problem . . . . . . . . . . . . . . . 19 5.2 Dynamical system of gradient ﬂow . . . . . . . . . . . . . . . . . . . . . . . 20 5.3 PL inequality and global convergence . . . . . . . . . . . . . . . . . . . . . . 21

6 Conclusion and future work 22

A Proof of Theorem 4 25 A.1 Proof of Lemma 9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 A.2 Proof of Lemma 10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 A.3 Proof of Lemma 11 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 A.4 Proof of Lemma 12 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

B Proof of Theorem 5 39

C Proof of Theorem 8 41 C.1 Dynamical system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 C.2 Loss function and global minima . . . . . . . . . . . . . . . . . . . . . . . . 43 C.3 PL Inequality and global convergence . . . . . . . . . . . . . . . . . . . . . 44

D Technical lemmas 49

E Experiment details 51

Trained Transformers Learn Linear Models In-Context

Appendix A. Proof of Theorem 4

In this section, we prove Lemma 9, Lemma 10, Lemma 11 and Lemma 12. Theorem 4 is a natural corollary of these four lemmas when we translate u 1 and U11 back to W PV and W KQ.

A.1 Proof of Lemma 9

For the reader s convenience, we restate the lemma below.

Lemma 13 Let Eτ R(d+1) (N+1) be an embedding matrix corresponding to a prompt of length N and weight wτ. Then the prediction byquery(Eτ; θ) for the query covariate can be written as the output of a quadratic function,

byquery(Eτ; θ) = u Hτu,

where the matrix Hτ is deﬁned as,

2Xτ EτE τ N

R(d+1)2 (d+1)2, Xτ =

0d d xτ,query

(xτ,query) 0

R(d+1) (d+1)

u = Vec(U) R(d+1)2, U =

R(d+1) (d+1),

where U11 = W KQ 11 Rd d, u12 = w PV 21 Rd 1, u21 = w KQ 21 Rd 1, u 1 = w PV 22 R correspond to particular components of W PV and W KQ, deﬁned in (5).

Proof First, we decompose WPV and WKQ in the way above. From the deﬁnition, we know byτ,query is the right-bottom entry of f LSA(Eτ), which is

byτ,query = (u12) u 1 EτE τ N

We denote ui Rd+1 as the i-th column of U11 (u21) and xi τ,query as the i-th entry of xτ,query for i [d]. Then, we have

i=1 xi τ,query (u12) u 1 EτE τ N

i=1 tr ui (u12) u 1 xi τ,query

(u12) u 1 x τ,query EτE τ N

Zhang, Frei, Bartlett

0d(d+1) d(d+1) xτ,query EτE τ N

x τ,query EτE τ N 0(d+1) (d+1)

2 tr uu Xτ EτE τ N

= D Hτ, uu E .

Here, we use some algebraic facts about matrix vectorization, Kronecker product and trace. For reference, we refer to (Petersen and Pedersen, 2008).

A.2 Proof of Lemma 10

For the reader s convenience, we restate the lemma below.

Lemma 14 Let u = Vec (U) := Vec

as in Lemma 9. Consider gradient

2E u Hτu w τ xτ,query 2 (26)

with respect to u starting from an initial value satisfying Assumption 3. Then the dynamics of U follows d dt U11(t) = u2 1ΓΛU11Λ + u 1Λ2

d dtu 1(t) = tr h u 1ΓΛU11Λ(U11) Λ2(U11) i , (27)

and u12(t) = 0d, u21(t) = 0d for all t 0, where Γ = 1 + 1

N tr(Λ)Id Rd d.

Proof From the deﬁnition of L in (26) and the dynamics of gradient ﬂow, we calculate the derivatives of u. Here, we use the chain rule and some facts about matrix derivatives. See Lemma 29 for reference.

dt = E Hτ, uu Hτ u + E w τ xτ,query Hτ u. (31)

Step One: Calculate the Second Term We ﬁrst calculate the second term. From the deﬁnition of Hτ, we have

E h w τ xτ,query Hτ i = 1

i=1 E xi τ,query Xτ wi τ EτE τ N

Trained Transformers Learn Linear Models In-Context

For ease of notation, we denote

i=1 xτ,ix τ,i. (32)

Then, from the deﬁnition of EτE τ N , we know

N xτ,query x τ,query bΛτwτ

wτ bΛτ w τ bΛτwτ

Since wτ N(0, Id) is independent of all prompt inputs and query input, we have

i=1 E xi τ,query Xτ wi τ N

xτ,query x τ,query 0 0 0

i=1 E E xi τ,query Xτ wi τ N

xτ,query x τ,query 0 0 0

" xi τ,query Xτ

E wi τ | xτ,query

xτ,query x τ,query 0 0 0

Therefore, we have

E h w τ xτ,query Hτ i = 1

xi τ,query Xτ

w τ bΛτ w τ bΛτwτ.

Since Xτ only depends on xτ,query by deﬁnition, and xτ,query is independent of wτ and xτ,i, i = 1, 2, ..., N, we have

E h w τ xτ,query Hτ i = 1

E xi τ,query Xτ E

w τ bΛτ w τ bΛτwτ.

E(wi τ)Λ ΛE(wi τwτ)

E(wi τw τ )Λ E wi τw τ Λwτ

where Λi denotes Λ:i. Here, the second line comes from the fact that EbΛτ = Λ, and that wτ is independent of all prompt input and query input. The last line comes from the fact that wτ N(0, Id). Therefore, simple computation shows that

E h w τ xτ,query Hτ i u = 1

0d(d+1) d(d+1) A

A 0(d+1) (d+1)

Zhang, Frei, Bartlett

where A Rd(d+1) (d+1) and Vj R(d+1) (d+1) are deﬁned by

0d d Pd i=1 ΛijΛi

Step Two: Calculate the First Term Next, we compute the ﬁrst term in (31), namely

D := 2E Hτ, uu Hτu .

For simplicity, we denote Zτ := 1

N EτE τ . Using the deﬁnition of Hτ in (24) and Lemma 29, we have

D = 2E Hτ, uu Hτu (deﬁnition)

2E h tr Xτ Zτ Vec (U) Vec (U) (Xτ Zτ) Vec (U) i

(deﬁnition of Hτ in (24) and u = Vec(U))

2E h tr Vec (ZτUXτ) Vec (U) Vec (ZτUXτ) i

(Vec(AXB) = (B A) Vec(X) in Lemma 29)

2E h Vec (U) Vec (ZτUXτ) Vec (ZτUXτ) i (property of trace operator)

(ZτUXτ)ij Uij Vec (ZτUXτ)

Step Three: u12 and u21 Vanish We ﬁrst prove that if u12 = u21 = 0d, then d

dtu12 = 0d and d dtu21 = 0d. If this is true, then these two blocks will be zero all the time since we assume they are zero at initial time in Assumption 3. We denote Ak: and A:k as the k-th row and k-th column of matrix A, respectively. Under the assumption that u12 = u21 = 0d, we ﬁrst compute

bΛτwτu 1x τ,query bΛτ + 1

N xτ,query x τ,query U11xτ,query

w τ bΛτ wτu 1x τ,query w τ bΛτ U11xτ,query

Written in an entry-wise manner, it will be

(ZτUXτ)kl =

k: wτu 1xl τ,query k, l [d] bΛτ + 1

N xτ,query x τ,query

k: U11xτ,query k [d], l = d + 1

w τ bΛτ wτu 1xl τ,query l [d], k = d + 1

w τ bΛτ U11xτ,query k = l = d + 1

Trained Transformers Learn Linear Models In-Context

We use Dij to denote the (i, j)-th entry of the (d + 1) (d + 1) matrix D such that Vec( D) = D. Now we ﬁx a k [d], then

(ZτUXτ)ij Uij (ZτUXτ)k,d+1

(ZτUXτ)ij Uij (ZτUXτ)k,d+1

2E h (ZτUXτ)d+1,d+1 u 1 (ZτUXτ)k,d+1 i , (36)

since Ui,d+1 = Ud+1,i = 0 for any i [d]. For the ﬁrst term in the right hand side of last equation, we ﬁx i, j [d] and have

E (ZτUXτ)ij Uij (ZτUXτ)k,d+1

i: wτu 1xj τ,query bΛτ + 1

N xτ,query x τ,query

k: U11xτ,query

since wτ is independent with all prompt input and query input, namely all xτ,i for i [query], and wτ is mean zero. Similarly, for the second term of (36), we have

E (ZτUXτ)d+1,d+1 u 1 (ZτUXτ)k,d+1

=E u 1w τ bΛτ U11xτ,query bΛτ + 1

N xτ,query xτ,query

k: U11xτ,query

since E w τ = 0 and wτ is independent of all xτ,i for i [query]. Therefore, we have Dk,d+1 = 0 for k [d]. Similar calculation shows that Dd+1,k = 0 for k [d].

For k [d], to calculate the derivative of Uk,d+1, it suﬃces to further calculate the inner product of the d(d + 1) + k th row of E w τ xτ,query Hτ and u. From (33), we know this is

j=1 Λ k Λj Ud+1,j = 0

given that u12 = u21 = 0d. Therefore, we conclude that the derivative of Uk,d+1 will vanish given u12 = u21 = 0d. Similarly, we conclude the same result for Ud+1,k for k [d]. Therefore, we know u12 = 0d and u21 = 0d for all time t 0.

Step Four: Dynamics of U11 Next, we calculate the derivatives of U11 given u12 = u21 = 0d. For a ﬁxed pair of k, l [d], we have

(ZτUXτ)ij Uij (ZτUXτ)kl

2E h (ZτUXτ)d+1,d+1 u 1 (ZτUXτ)kl i .

Zhang, Frei, Bartlett

For ﬁxed i, j [d], we have

E h (ZτUXτ)ij Uij (ZτUXτ)kl i = Uiju2 1E h bΛτ

i: wτxj τ,queryxl τ,queryw τ bΛτ

= Uiju2 1E h xj τ,queryxl τ,query i E h bΛτ

= Uiju2 1Λτ,jl E h bΛτ

Therefore, we sum over i, j [d] to get

(ZτUXτ)ij Uij (ZτUXτ)kl

For the last term, we have

1 2E h (ZτUXτ)d+1,d+1 u 1 (ZτUXτ)kl i = 1

So we have Dkl = u2 1E bΛτ

Additionally, we have

2 h E w τ xτ,query Hτ u i

(l 1)(d+1)+k =

0d(d+1) d(d+1) A

A 0(d+1) (d+1)

(l 1)(d+1)+k (deﬁnition)

= 0(d+1) d(d+1) Vl + V l

(deﬁnition of A in (34))

= Λ k Λlu 1. (deﬁnition of Vi in (34))

Therefore, we have that for k, l [d], the dynamics of Ukl is

d dt Ukl = u2 1E bΛτ

bΛτ U11Λl + u 1Λ k Λl,

which implies d dt U11 = u2 1E bΛτ 2 U11Λ + u 1Λ2.

From the deﬁnition of bΛτ (equation (32)), the independence and Gaussianity of xτ,i and Lemma 30, we compute

E bΛτ 2 = E

i=1 xτ,ix τ,i

(deﬁnition (32))

Trained Transformers Learn Linear Models In-Context

h E xτ,1x τ,1 i2 + 1

N E xτ,1x τ,1xτ,1x τ,1

(independence between prompt input)

N tr(Λ)Λ. (Lemma 30)

We deﬁne Γ := N + 1

N tr(Λ)Id. (37)

Then, from (31), we know the dynamics of U11 is

d dt U11 = u2 1ΓΛU11Λ + u 1Λ2. (38)

Step Five: Dynamics of u 1 Finally, we compute the dynamics of u 1. We have

Dd+1,d+1 = 1

(ZτUXτ)ij Uij (ZτUXτ)d+1,d+1

2E h (ZτUXτ)d+1,d+1 u 1 (ZτUXτ)d+1,d+1 i . (39)

For the ﬁrst term above, we have

(ZτUXτ)ij Uij (ZτUXτ)d+1,d+1

i,j=1 Uij E h bΛτ

i: wτw τ bΛτ U11xτ,queryxj τ,query i (from (35))

i,j=1 Uij E h bΛτ

i: bΛτ U11xτ,queryxj τ,query i (independence and distribution of wτ)

i,j=1 Uij E h bΛτ

i: bΛτ U11Λj i (independence between prompt covariates)

i,j=1 Λj Uij bΛτ

= u 1E tr Λ(U11) bΛτ 2 U11

=u 1 tr E bΛτ 2 U11Λ(U11) .

For the second term in (39), we have

E h (ZτUXτ)d+1,d+1 u 1 (ZτUXτ)d+1,d+1 i

= u 1E h w τ bΛτ U11xτ,queryx τ,query(U11) bΛτ wτ i (from (35))

= u 1E tr h wτw τ bΛτ U11xτ,queryx τ,query(U11) bΛτ i

Zhang, Frei, Bartlett

= u 1E tr h bΛτ U11Λ(U11) bΛτ i

= u 1 tr E bΛτ 2 U11Λ(U11) .

Therefore, we know

Dd+1,d+1 = u 1 tr E bΛτ 2 U11Λ(U11) .

Additionally, we have

2 h E w τ xτ,query Hτ u i

0d(d+1) d(d+1) A

A 0(d+1) (d+1)

(d+1)2 (from (33))

= V1 + V 1 ... Vd + V d 0(d+1) (d+1)

(deﬁnition of A in (34))

i,j=1 Λ i Λj Uji = tr Λ(U11) Λ .

Then, from (31), we have the dynamics of u 1 is

d dtu 1 = tr h u 1ΓΛU11Λ(U11) Λ2(U11) i . (40)

A.3 Proof of Lemma 11

Lemma 11 gives the form of global minima of an equivalent loss function. First, we prove that gradient ﬂow on L deﬁned in (8) from the initial values satisfying Assumption 3 is equivalent to gradient ﬂow on another loss function ℓdeﬁned below. Then, we derive an expression for the global minima of this loss function. First, from the dynamics of gradient ﬂow, we can actually recover the loss function up to a constant. We have the following lemma.

Lemma 15 (Loss Function) Consider gradient ﬂow over L in (26) with respect to u starting from an initial value satisfying Assumption 3. This is equivalent to doing gradient ﬂow with respect to U11 and u 1 on the loss function

ℓ(U11, u 1) = tr 1

2u2 1ΓΛU11Λ(U11) u 1Λ2(U11) . (41)

Proof The proof is simply by taking gradient of the loss function in (41). For techniques in matrix derivatives, see Lemma 29. We take the gradient of ℓon U11 to obtain

2u2 1Λ Γ U11Λ + 1

2u2 1ΓΛU11Λ u 1Λ2 = u2 1ΓΛU11Λ u 1Λ2,

Trained Transformers Learn Linear Models In-Context

since Γ and Λ are commutable. We take derivatives w.r.t. u 1 to get

ℓ u 1 = tr h u 1ΓΛU11Λ(U11) Λ2(U11) i .

Combining this with Lemma 10, we have

d dt U11(t) = ℓ

U11 , d dtu 1(t) = ℓ

We remark that actually this is the loss function L up to some constant. This loss function ℓcan be negative. But we can still compute its global minima as follows.

Corollary 16 (Minimum of Loss Function) The loss function ℓin Lemma 15 satisﬁes

min U11 Rd d,u 1 R ℓ(U11, u 1) = 1

ℓ(U11, u 1) min U11 Rd d,u 1 R ℓ(U11, u 1) = 1

Γ 1 2 u 1Λ 1 2 U11Λ 1 2 ΛΓ 1 2

Proof First, we claim that

ℓ(U11, u 1) = 1

2 tr Γ u 1Λ 1 2 U11Λ 1 2 ΛΓ 1 u 1Λ 1 2 U11Λ 1 2 ΛΓ 1 1

2 tr Λ2Γ 1 .

To calculate this, we just need to expand the terms in the brackets and notice that Γ and Λ commute:

tr Γ u 1Λ 1 2 U11Λ 1 2 ΛΓ 1 u 1Λ 1 2 U11Λ 1 2 ΛΓ 1 tr Λ2Γ 1

(i) = tr h Γ u2 1Λ 1 2 U11Λ(U11) Λ1/2 u 1ΛΓ 1Λ 1 2 U11Λ 1 2 u 1Λ 1 2 U11Λ 3 2 Γ 1 + Γ 2Λ2 i

= tr h Γ u2 1Λ 1 2 U11Λ(U11) Λ1/2 u 1ΛΓ 1Λ 1 2 U11Λ 1 2 u 1Λ 1 2 U11Λ 3 2 Γ 1 i

= u2 1 tr h ΓΛ 1 2 U11Λ(U11) Λ 1 2 i u 1 tr h ΓΛΓ 1Λ 1 2 U11Λ 1 2 ΓΛ 1 2 U11Λ 3 2 Γ 1i

(ii) = u2 1 tr h ΓΛU11Λ(U11) i 2u 1 tr h Λ2U11Λ 1 2 i

= 2 ℓ(U11, u 1) .

Equations (i) and (ii) use that Γ and Λ commute.

Zhang, Frei, Bartlett

Since Γ 0 and u 1Λ 1 2 U11Λ 1 2 ΛΓ 1 u 1Λ 1 2 U11Λ 1 2 ΛΓ 1 0, we know from Lemma 32 that

1 2 tr Γ u 1Λ 1 2 U11Λ 1 2 ΛΓ 1 u 1Λ 1 2 U11Λ 1 2 ΛΓ 1 0,

which implies ℓ(U11, u 1) 1

2 tr Λ2Γ 1 .

Equality holds when U11 = Γ 1, u 1 = 1,

so the minimum of ℓis 1

2 tr Λ2Γ 1 . The expression for ℓ(U11, u 1) min ℓ(U11, u 1) comes from the fact that tr(A A) = A 2 F for any matrix A.

Lemma 11 is an immediate consequence of Corollary16, since the loss will keep the same when we replace (U11, u 1) by (c U11, c 1u 1) for any non-zero constant c.

A.4 Proof of Lemma 12

In this section, we prove that the dynamical system in Lemma 10 satisﬁes a PL inequality. Then, the PL inequality naturally leads to the global convergence of this dynamical system. First, we prove a simple lemma, which says the parameters in the LSA model will keep balanced in the whole trajectory. From the proof of this lemma, we can understand why we assume a balanced parameter at the initial time.

Lemma 17 (Balanced Parameters) Consider gradient ﬂow over L in (26) with respect to u starting from an initial value satisfying Assumption 3. For any t 0, it holds that

u2 1 = tr h U11(U11) i . (42)

Proof From Lemma 10, we multiply the ﬁrst equation in (27) by (U11) from the right to get d

dt U11(t) (U11(t)) = u2 1ΓΛU11Λ(U11) + u 1Λ2(U11) .

Also we multiply the second equation in Lemma 10 by u 1 to obtain d

dtu 1(t) u 1(t) = tr h u2 1ΓΛU11Λ(U11) + u 1Λ2(U11) i .

Therefore, we have

dt U11(t) (U11(t)) = d

dtu 1(t) u 1(t).

Taking the transpose of the equation above and adding to itself gives

d dt tr h U11(t)(U11(t)) i = d

dt u 1(t)2 .

Trained Transformers Learn Linear Models In-Context

Notice that from Assumption 3, we know that at t = 0,

u 1(0)2 = σ2 = σ2 tr h ΘΘ ΘΘ i = tr h U11(0)(U11(0)) i .

So for any time t 0, the equation holds.

In order to prove the PL inequality, we ﬁrst prove an important property which says the trajectories of u 1(t) stay away from saddle point at origin. First, we prove that u 1(t) will stay positive along the whole trajectory.

Lemma 18 Consider gradient ﬂow over L in (26) with respect to u starting from an initial value satisfying Assumption 3. If the initial scale satisﬁes

d Γ op , (43)

then, for any t 0, it holds that u 1 > 0.

Proof From Lemma 15, we are actually doing gradient ﬂow on the loss ℓ. The loss function is non-increasing, because

We notice that when u 1 = 0, ℓ= 0. Therefore, as long as ℓ(U11(0), u 1(0)) < 0, then for any time, u 1 will be non-zero. Further, since u 1(0) > 0 and the trajectory of u 1(t) must be continuous, we know u 1(t) > 0 for any t 0. Then, it suﬃces to prove when 0 < σ < q 2

d Γ op , it holds that ℓ(U11(0), u 1(0)) < 0.

From Assumption 3, we can calculate the loss function at the initial time:

ℓ(U11(0), u 1(0)) = σ4

2 tr h ΓΛΘΘ ΛΘΘ i σ2 tr h Λ2ΘΘ i .

From the property of trace, we know

tr h Λ2ΘΘ i = tr h ΛΘΘ Λ i = ΛΘ 2 F .

From Von-Neumann s trace inequality (Lemma 31) and the fact that ΘΘ F = 1, we know

tr h ΓΛΘΘ ΛΘΘ i

d ΛΘΘ ΛΘΘ F Γ op

d ΛΘ 2 F ΘΘ F Γ op

d ΛΘ 2 F Γ op .

Zhang, Frei, Bartlett

Therefore, we have

ℓ(U11(0), u 1(0))

2 ΛΘ 2 F Γ op σ2 ΛΘ 2 F

dσ2 Γ op 2 i .

From Assumption 3, we know ΛΘ F = 0. From (37), we know Γ op > 0. Therefore, when

we have ℓ(U11(0), u 1(0)) < 0.

From the lemma above, we can actually further prove that the u 1(t) can be lower bounded by a positive constant for any t 0. This will be a critical property to prove the PL inequality. We have the following lemma.

Lemma 19 Consider gradient ﬂow over L in (26) with respect to u starting from an initial value satisfying Assumption 3 with initial scale 0 < σ < q 2

d Γ op . For any t 0, it holds

d Λ 2 op ΛΘ 2 F h 2

dσ2 Γ op i > 0. (44)

Proof We prove by contradiction. Suppose the claim does not hold. From Lemma 17, we know u2 1 = tr U11(U11) = U11 2 F . From Lemma 18, we know u 1 = U11 F . Recall the deﬁnition of loss function:

ℓ(U11, u 1) = tr 1

2u2 1ΓΛU11Λ(U11) u 1Λ2(U11) .

Since Γ 0, Λ 0, and they commute, we know from Lemma 32 that ΓΛ 0. Again, since

U11Λ(U11) = U11Λ 1 2 U11Λ 1 2 0, from Lemma 32 we have tr 1

2u2 1ΓΛU11Λ(U11) 0. So ℓ(U11, u 1) tr h u 1Λ2(U11) i .

From Von-Neumann s trace inequality, we know for any t 0,

tr h u 1Λ2(U11) i

du 1 Λ2 op U11 F =

du2 1 Λ 2 op .

Therefore, under our assumption that the claim does not hold, we have

ℓ(U11, u 1)

du2 1 Λ 2 op > σ2

2 ΛΘ 2 F h 2

dσ2 Γ op i ℓ(U11(0), u 1(0)).

Trained Transformers Learn Linear Models In-Context

Here, the last inequality comes from the proof of Lemma 18. This contradicts the nonincreasing property of the loss function in gradient ﬂow.

Finally, let s prove the PL inequality and further, the global convergence of gradent ﬂow on the loss function ℓ. We recall the stated lemma from the main text.

Lemma 20 Suppose the initialization of gradient ﬂow satisﬁes Assumption 3 with initialization scale satisfying σ2 < 2

d Γ op for Γ = (1 + 1

N )Λ + tr(Λ)

N Id. If we deﬁne

d Λ 2 op tr (Γ 1Λ 1) tr (Λ 1) ΛΘ 2 F h 2

dσ2 Γ op i > 0, (30)

then gradient ﬂow on ℓwith respect to U11 and u 1 satisﬁes, for any t 0,

ℓ(U11(t), u 1(t)) 2

µ ℓ(U11(t), u 1(t)) min U11 Rd d,u 1 R ℓ(U11, u 1) .

Moreover, gradient ﬂow converges to the global minimum of ℓ, and U11 and u 1 satisfy

lim t u 1(t) = Γ 1 1 2 F and lim t U11(t) = Γ 1 1

Proof From the deﬁnition and Lemma 19, we have

ℓ(U11, u 1) 2 2

F = u2 1ΓΛU11Λ u 1Λ2 2 F

= u2 1 ΓΛ 1 2 u 1Λ 1 2 U11Λ 1 2 ΛΓ 1 Λ 1 2 2

d Λ 2 op ΛΘ 2 F h 2

dσ2 Γ op i ΓΛ 1 2 u 1Λ 1 2 U11Λ 1 2 ΛΓ 1 Λ 1 2 2

To see why the second line is true, recall that u 1 R and Γ and Λ commute. The last line comes from the lower bound of u 1 in Lemma 19. From Corollary 16, we know

ℓ min U11 Rd d,u 1 R ℓ(U11, u 1) = 1

2 tr Γ u 1Λ 1 2 U11Λ 1 2 ΛΓ 1 u 1Λ 1 2 U11Λ 1 2 ΛΓ 1

Γ 1 2 u 1Λ 1 2 U11Λ 1 2 ΛΓ 1 2

Therefore, we know that

ℓ min U11 Rd d,u 1 R ℓ(U11, u 1)

ΓΛ 1 2 u 1Λ 1 2 U11Λ 1 2 ΛΓ 1 Λ 1 2 2

Zhang, Frei, Bartlett

ΓΛ 1 2 u 1Λ 1 2 U11Λ 1 2 ΛΓ 1 Λ 1 2 2

F tr Γ 1Λ 1 tr Λ 1 (46)

We compare (45) and (46) to obtain that in order to make the PL condition hold, one needs to let

d Λ 2 op tr (Γ 1Λ 1) tr (Λ 1) ΛΘ 2 F h 2

dσ2 Γ op i > 0.

Once we set this µ, we get the PL inequality. The µ is positive due to the assumption for σ in the lemma. From the dynamics of gradient ﬂow and the PL condition, we know

ℓ min U11 Rd d,u 1 R ℓ(U11, u 1) =

µ ℓ min U11 Rd d,u 1 R ℓ(U11, u 1) .

Therefore, we have when t ,

0 ℓ min U11 Rd d,u 1 R ℓ(U11, u 1)

exp ( µt) ℓ(U11(0), u 1(0)) min U11 Rd d,u 1 R ℓ(U11, u 1) 0,

which implies

ℓ min U11 Rd d,u 1 R ℓ(U11, u 1) = 0.

From Corollary 16, we know this is Γ 1 2 u 1Λ 1 2 U11Λ 1 2 ΛΓ 1 2

Since Γ and Λ are non-singular and positive deﬁnite, and they commute, we know

u 1U11 Γ 1 2 F Γ 1

Γ 1 2 u 1Λ 1 2 U11Λ 1 2 ΛΓ 1 2

This implies u 1U11 Γ 1 0d d entry-wise. Since u 1 = U11 F , we know

u2 1 = u 1U11 F Γ 1 F .

Therefore, we know

lim t u 1(t) = Γ 1 1 2 F and lim t U11(t) = Γ 1 1

Trained Transformers Learn Linear Models In-Context

Appendix B. Proof of Theorem 5

In this section, we prove Theorem 5, which characterizes the excess risk of the prediction of a trained LSA layer with respect to the risk of best linear predictor, on a new task which is possibly non-linear. First, we restate the theorem.

Theorem 5 Let D be a distribution over (x, y) Rd R, whose marginal distribution on x is Dx = N(0, Λ). Assume ED[y], ED[xy], ED[y2xx ] exist and are ﬁnite. Assume the test

prompt is of the form P = (x1, y1, . . . , x M, y M, xquery), where (xi, yi), (xquery, yquery) i.i.d. D. Let f LSA be the LSA model with parameters W PV and W KQ in (11), and byquery is the prediction for xquery given the prompt. If we deﬁne

a := Λ 1E(x,y) D [xy] , Σ := E(x,y) D h xy E (xy) xy E (xy) i , (15)

then, for Γ = Λ + 1

N tr(Λ)Id. we have,

E (byquery yquery)2 = min w Rd E ( w, xquery yquery)2

| {z } Error of best linear predictor

M tr ΣΓ 2Λ + 1

h a 2 Γ 2Λ3 + 2 tr(Λ) a 2 Γ 2Λ2 + tr(Λ)2 a 2 Γ 2Λ i ,

where the expectation is over (xi, yi), (xquery, yquery) i.i.d. D.

Proof Unless otherwise speciﬁed, we use E to denote the expectation over (xi, yi) and

(xquery, yquery) i.i.d. D. Since when (x, y) D, we assume E[x], E[y], E[xy], E[xx ], E[y2xx ] exist, we know that E ( w, xquery yquery)2 exists for each w Rd. We denote

a := arg min w Rd E ( w, xquery yquery)2

as the weight of the best linear approximator. Actually, if we denote the function inside the minimum above as R(w), we can write it as

R(w) = w Λw 2E yquery x query w + Ey2 query.

Since the Hessian matrix 2 w w R(w) is Λ, which is positive deﬁnitive, we know that this function is strictly convex and hence, the global minimum can be achieved at the unique ﬁrst-order stationary point. This is

a = Λ 1E (yquery xquery) . (47)

We also deﬁne a similar vector for ease of computation:

b = Γ 1E (yquery xquery) . (48)

Zhang, Frei, Bartlett

Therefore, we can decompose the risk as

E (byquery yquery)2 = E ( a, xquery yquery)2 | {z } I

+ E (byquery b, xquery )2 | {z } II + E ( b, xquery a, xquery )2 | {z } III

+ 2E (byquery b, xquery ) ( a, xquery yquery) | {z } IV + 2E (byquery b, xquery ) ( b, xquery a, xquery ) | {z } V + 2E ( b, xquery a, xquery ) ( a, xquery yquery) | {z } VI

The term I is the ﬁrst term on the right hand side of (16). So it suﬃces to calculate II to VI.

First, from the tower property of conditional expectation, we have

V = 2E E (byquery b, xquery ) ( b, xquery a, xquery ) xquery

= 2E E byquery b, xquery xquery

( b, xquery a, xquery ) = 0,

E byquery b, xquery xquery

i=1 yiΓ 1xi b

xquery = 0.

Similarly, for IV, we have

IV = 2E (byquery b, xquery ) ( a, xquery yquery)

= 2E E (byquery b, xquery ) ( a, xquery yquery) xquery, yquery

= 2E E byquery b, xquery xquery, yquery

( a, xquery yquery)

For VI, we have

VI = 2E tr h (b a) ( a, xquery yquery) x query i

= 2 tr h (b a)a Λ i 2 tr h (b a)E yqueryx query i = 0,

where the last line comes from the deﬁnition of a. Therefore, all cross terms vanish and it suﬃces to consider II and III.

Trained Transformers Learn Linear Models In-Context

From the deﬁnition II is equal to

i=1 yixi E (yquery xquery)

Γ 1xqueryx queryΓ 1 1 M

i=1 yixi E (yquery xquery)

i=1 yixi E (yquery xquery)

i=1 yixi E (yquery xquery)

(property of trace and the fact that Γ and Λ commute)

i,j=1 E tr n (yixi E (yquery xquery)) (yjxj E (yquery xquery)) Γ 2Λ o

M E tr n (y1x1 E (yquery xquery)) (y1x1 E (yquery xquery)) Γ 2Λ o

(all cross terms vanish due to the independence of xi)

M tr ΣΓ 2Λ .

The last line comes from the deﬁnition of Σ.

For III, we have

III = E(b a) xqueryx query(b a) = a Λ(Γ 1 Λ 1)Λ(Γ 1 Λ 1)Λa

= tr h I ΓΛ 1 2 Γ 2Λ3aa i

(property of trace and the fact that Γ and Λ commute)

= 1 N2 tr h Id + tr(Λ)Λ 1 2 Γ 2Λ3aa i

h tr(Γ 2Λ3aa ) + 2 tr(Λ) tr(Γ 2Λ2aa ) + tr(Λ)2 tr(Γ 2Λaa ) i .

Combining all terms above, we conclude.

Appendix C. Proof of Theorem 8

The proof of Theorem 8 is very similar to that of Theorem 4. The ﬁrst step is to explicitly write out the dynamical system. In order to do so, we notice that the Lemma 9 does not depend on the training data and data-generaing distribution and hence, it still holds in the case of a random covariance matrix. Therefore, we know when we input the embedding matrix Eτ to the linear self-attention layer with parameter θ = (W KQ, W PV ), the prediction will be byquery(Eτ; θ) = u Hτu,

where the matrix Hτ is deﬁned as,

2Xτ EτE τ N

R(d+1)2 (d+1)2, Xτ =

0d d xτ,query

(xτ,query) 0

R(d+1) (d+1)

Zhang, Frei, Bartlett

u = Vec(U) R(d+1)2, U =

R(d+1) (d+1),

where U11 = W KQ 11 Rd d, u12 = w PV 21 Rd 1, u21 = w KQ 21 Rd 1, u 1 = w PV 22 R correspond to particular components of W PV and W KQ, deﬁned in (5).

C.1 Dynamical system

The next lemma gives the dynamical system when the covariance matrices in the prompts are i.i.d. sampled from some distribution. Notice that in the lemma below, we do not assume Λτ are almost surely diagonal. The case when the covariance matrices are diagonal can be viewed as a special case of the following lemma.

Lemma 21 Consider gradient ﬂow on (20) with respect to u starting from an initial value that satisﬁes Assumption 3. We assume the covariance matrices Λτ are sampled from some distribution with ﬁnite third moment and Λτ are positive deﬁnite almost surely. We denote

u = Vec (U) := Vec

N tr(Λτ)Id Rd d.

Then the dynamics of U follows

d dt U11(t) = u2 1E [ΓτΛτU11Λτ] + u 1E Λ2 τ

d dtu 1(t) = u 1 tr E h ΓτΛτU11Λτ(U11) i + tr E Λ2 τ (U11) , (49)

and u12(t) = 0d, u21(t) = 0d for all t 0.

Proof This lemma is a natural corollary of Lemma 10. Notice that Lemma 10 holds for any ﬁxed positive deﬁnite Λτ. So when Λτ is random, if we condition on Λτ, the dynamical system will be

d dt U11(t) = u2 1 [ΓτΛτU11Λτ] + u 1 Λ2 τ

d dtu 1(t) = u 1 tr h ΓτΛτU11Λτ(U11) i + tr Λ2 τ (U11) , (50)

and u12(t) = 0d, u21(t) = 0d for all t 0. Then, we conclude by simply taking expectation over Λτ.

The lemma above gives the dynamical system with general random covariance matrix. When Λτ are diagonal almost surely, we can actually simplify the dynamical system above. In this case, we have the following corollary.

Trained Transformers Learn Linear Models In-Context

Corollary 22 Under the assumptions of Lemma 21, we further assume the covariance matrix Λτ to be diagonal almost surely. We denote uij(t) R as the (i, j)-th entry of U11(t), and further denote

N λ3 τ,i + 1

ξi = E λ2 τ,i ,

N λ2 τ,iλτ,j + 1

for i, j [d], where the expectation is over the distribution of Λτ. Then, the dynamical system (49) is equivalent to

d dtuii(t) = γiu2 1uii + ξiu 1 i [d],

d dtuij(t) = ζiju2 1uij i = j [d],

d dtu 1(t) =

γiu 1u2 ii X

i =j ζiju 1u2 ij +

i=1 [ξiuii] .

Proof This is directly obtained by rewriting the equation for each entry of U11 and recalling the assumption that Λτ (and hence Γτ) is diagonal almost surely.

C.2 Loss function and global minima

As in the proof of Theorem 4, we can actually recover the loss function in the random covariance case, up to a constant.

Lemma 23 The diﬀerential equations in (52) are equivalent to gradient ﬂow on the loss function

ℓrdm(U11, u 1) = E tr 1

2u2 1ΓτΛτU11Λτ(U11) u 1Λ2 τ(U11)

γiu2 1u2 ii + 1

i =j ζiju2 1u2 ij

i=1 [ξiuiiu 1] (53)

with respect to uij i, j [d] and u 1, from an initial value that satisﬁes Assumption 3.

Proof This can be veriﬁed by simply taking gradient of ℓrdm to show that

d dtuii = ℓrdm

uii i [d], d dtuij = ℓrdm

uij i = j [d], d dtu 1 = ℓrdm

Zhang, Frei, Bartlett

Next, we solve for the minimum of ℓrdm and give the expression for all global minima.

Lemma 24 Let ℓrdm be the loss function in (53). We denote

min ℓrdm := min U11 Rd d,u 1 R ℓrdm (U11, u 1) .

Then, we have

min ℓrdm = 1

ξ2 i γi (54)

ℓrdm(U11, u 1) min ℓrdm = 1

i =j ζiju2 1u2 ij. (55)

Moreover, denoting uij as the (i, j)-entry of U11, all global minima of ℓrdm satisfy

u 1 uij = I(i = j) ξi

Proof From the deﬁnition of ℓrdm, we have

i =j ζiju2 1u2 ij 1

The equation holds when uij = 0 for i = j [d] and u 1uii = ξi

γi for each i [d]. This

can be achieved by simply letting u 1 = 1 and uii = ξi

γi for i [d]. Of course, when we replace (u 1, uii) with (cu 1, c 1uii) for any constant c = 0, we can also achieve this global minimum.

C.3 PL Inequality and global convergence

Finally, to end the proof, we prove a Polyak Lojasiewicz Inequality on the loss function ℓrdm, and then prove global convergence. Before that, let s ﬁrst prove the balanced condition of parameters will hold during the whole trajectory.

Lemma 25 (Balanced condition) Under the assumptions of Lemma 21, for any t 0, it holds that u2 1 = tr h U11(U11) i . (57)

Proof The proof is similar to the proof of Lemma 17. From Lemma 10, we multiply the ﬁrst equation in (49) by (U11) from the right to get d

dt U11(t) (U11) = u2 1E h ΓτΛτU11Λτ(U11) i + u 1E h Λ2 τ(U11) i .

Trained Transformers Learn Linear Models In-Context

Also we multiply the second equation in Lemma 49 by u 1 to obtain d

dtu 1(t) u 1(t) = u2 1 tr E h ΓτΛτU11Λτ(U11) i + u 1 tr E Λ2 τ (U11) ,

Therefore, we have

dt U11(t) (U11(t)) = d

dtu 1(t) u 1(t).

Taking the transpose of the equation above and adding to itself gives

d dt tr h U11(t)(U11(t)) i = d

dt u 1(t)2 .

Notice that from Assumption 3, we know that

u 1(0)2 = σ2 = σ2 tr h ΘΘ ΘΘ i = tr h U11(0)(U11(0)) i .

So for any time t 0, the equation holds.

Next, similar to the proof of Theorem 4, we prove that, as long as the initial scale is small enough, u 1 will be positive along the whole trajectory and can be lower bounded by a positive constant, which implies that the trajectories will be away from the saddle point at the origin.

Lemma 26 We do gradient ﬂow on ℓrdm with respect to ui,j ( i, j [d]) and u 1. Suppose the initialization satisﬁes Assumption 3 with initial scale

v u u t 2 EΛτΘ 2 F

d h E Γτ op Λτ 2 F i, (58)

then for any t 0, it holds that u 1(t) > 0. (59)

Proof From the dynamics of gradient ﬂow, we know the loss function ℓrdm is non-increasing:

Since we assume U11(0) = ΘΘ , we know the loss function at t = 0 is

ℓrdm(U11(0), u 1(0)) = E tr σ4

2 ΓτΛτΘΘ ΛτΘΘ σ2Λ2 τΘΘ .

From the property of trace, we know

E tr h σ2Λ2 τΘΘ i = σ2 EΛτΘ 2 F .

Zhang, Frei, Bartlett

From Von-Neumann s trace inequality and the assumption that ΘΘ F = 1, we know

2 ΓτΛτΘΘ ΛτΘΘ

d 2 E Γτ op ΛτΘΘ ΛτΘΘ F

h E Γτ op Λτ 2 F i = σ4

h E Γτ op Λτ 2 F i .

From the assumptions on Θ and Λτ we know EΛτΘ = 0d d and E Γτ op Λτ 2 F > 0. Therefore, comparing the two displays above, we know when (58) holds, we must have ℓrdm(0) < 0. So from the non-increasing property of the loss function, we know ℓrdm(t) < 0 for any time t 0. Notice that when u 1 = 0, the loss function is also zero, which suggests that u 1(t) = 0 for any time t 0. Since u 1(0) > 0 and the trajectory of u 1 must be continuous, we know that it stays positive at all times.

Lemma 27 We do gradient ﬂow on ℓrdm with respect to ui,j ( i, j [d]) and u 1. Suppose the initialization satisﬁes Assumption 3 and the initial scale satisﬁes (58). Then, for any t 0, it holds that

h 2 EΛτΘ 2 F

dσ2 h E Γτ op Λτ 2 F ii > 0. (60)

Proof From the dynamics of gradient ﬂow, we know ℓrdm is non-increasing (see the proof of Lemma 26). Recall the deﬁnition of the loss function:

ℓrdm(U11, u 1) = E tr 1

2u2 1ΓτΛτU11Λτ(U11) u 1Λ2 τ(U11) .

Since Λτ commutes with Γτ and they are both positive deﬁnite almost surely, we know that ΓτΛτ 0d d almost surely from Lemma 29. Again, since U11Λτ(U11) 0d d almost surely, from Lemma 29 we have tr 1

2u2 1ΓτΛτU11Λτ(U11) 0 almost surely. Therefore, we have

ℓrdm(U11, u 1) E tr h u 1Λ2 τ(U11) i = tr h u 1 EΛ2 τ (U11) i .

From Von Neumann s trace inequality (Lemma 31) and the fact that u 1(t) > 0 for any t 0 (Lemma 26), we know ℓrdm(U11(t), u 1(t))

du 1 EΛ2 τ op U11 F . From Lemma

25, we know u2 1 = tr(U11(U11) ) = U11 2 F . Since u 1(t) > 0 for any time, we know actually u 1(t) = U11(t) F . So we have

ℓrdm(U11(t), u 1(t))

du 1(t)2 EΛ2 τ op .

Trained Transformers Learn Linear Models In-Context

From the proof of Lemma 26, we know

ℓrdm(U11(t), u 1(t)) ℓrdm(U11(0), u 1(0)) σ4

h E Γτ op Λτ 2 F i σ2 EΛτΘ 2 F .

Combine the two preceding displays above, we have

h 2 EΛτΘ 2 F

dσ2 h E Γτ op Λτ 2 F ii > 0.

The last inequality comes from Lemma 26.

Finally, we prove the PL Inequality, which naturally leads to the global convergence.

Lemma 28 We do gradient ﬂow on ℓrdm with respect to ui,j ( i, j [d]) and u 1. Suppose the initialization satisﬁes Assumption 3 and the initial scale satisﬁes (58). If we denote

η = min {γi, i [d]; ζij, i = j [d]}

h 2 EΛτΘ 2 F

dσ2 h E Γτ op Λτ 2 F ii > 0, (61)

then for any t 0, it holds that

ℓrdm(U11, u 1) 2 2 :=

2 ν (ℓrdm min ℓrdm) . (62)

Additionally, ℓrdm converges to the global minimal value, uij and u 1 converge to the following limits,

lim t uij(t) = I(i = j)

γi i [d], lim t u 1(t) =

Translating back to the original parameterization, we have this is equivalent to

lim t W KQ(t) =

EΓτΛ2 τ 1 E Λ2 τ 1

F EΓτΛ2 τ 1 E Λ2 τ 0d

lim t W PV (t) =

0 d EΓτΛ2 τ 1 E Λ2 τ

where Γτ = N+1

N tr(Λτ)Id Rd d and E is over Λτ.

Zhang, Frei, Bartlett

Proof First, we prove the PL Inequality. From Lemma 24, we know

ℓrdm(U11, u 1) min ℓrdm = 1

i =j ζiju2 1u2 ij,

where ξi, ζij, γi are deﬁned in (51). Meanwhile, we calculate the square norm of the gradient of ℓrdm:

ℓrdm(U11, u 1) 2 2 :=

i=1 γ2 i u2 1

i =j ζ2 iju4 1u2 ij.

Comparing the two displays above, we know that in order to ensure that ℓrdm 2 2 ν (ℓrdm min ℓrdm) , it suﬃces to make

γiu 1(t)2 ν

ζiju 1(t)2 ν

2 i = j [d].

We deﬁne η := min {γi, ζij, i = j [d]} , then it is suﬃcient to make

From Lemma 27, we know that we can actually lower bound u 1 from below by a positive constant. Then, the inequality holds if we take

h 2 EΛτΘ 2 F

dσ2 h E Γτ op Λτ 2 F ii > 0.

Therefore, as long as we take ν as above, a PL inequality holds for ℓrdm. With an abuse of notation, let us write ℓrdm(t) = ℓrdm(U11(t), u 1(t)). Then, from the dynamics of gradient ﬂow and the PL Inequality ((62)), we know

d dt [ℓrdm(t) min ℓrdm] = ℓrdm(t) 2 2 ν (ℓrdm(t) min ℓrdm) ,

which by Gr onwall s inequality implies

0 ℓrdm(t) min ℓrdm exp( νt) [ℓrdm(0) min ℓrdm] 0

when t . From Lemma 24, we know

i =j ζiju2 1u2 ij 0 when t .

Trained Transformers Learn Linear Models In-Context

This implies

uiju 1 0 i = j [d]. (64)

We take square of uii(t)u 1(t) and uij(t)u 1(t), then sum over all i, j [d]. Then, we

get u2 1 Pd i,j=1 u2 ij Pd i=1 ξ2 i γ2 i . From Lemma 25, we know for any t 0, u 1(t)2 =

tr U11(U11) = Pd i,j=1 u2 ij. So we have

u 1(t)4 = u2 1

i,j=1 u2 ij

ξ2 i γ2 i ,

which implies

when t . Combining (64) and (65), we conclude

uij(t) 0 i = j [d], uii(t)

Appendix D. Technical lemmas

Lemma 29 (Petersen and Pedersen, 2008) We denote A, B, X as matrices and x as vectors. Then, we have

x = B + B x.

Vec(AXB) = B A Vec(X).

tr A B = Vec(A) Vec(B).

X tr XBX = XB + XB.

X tr AX = A.

X tr AXBX C = A C XB + CAXB.

Lemma 30 If X is Gaussian random vector of d dimension, mean zero and covariance matrix Λ, and A Rd d is a ﬁxed matrix. Then

E h XX AXX i = Λ A + A Λ + tr(AΛ)Λ.

Zhang, Frei, Bartlett

Proof We denote X = (X1, ..., Xd) . Then,

XX AXX = X(X AX)X =

i,j=1 Aij Xi Xj

So we know (XX AXX )k,l = Pd i,j=1 Aij Xi Xj Xk Xl. From Isserlis Theorem in probability theory (Theorem 1.1 in Michalowicz et al. (2009), originally proposed in Wick (1950)), we know for any i, j, k, l [d], it holds that

E Xi Xj Xk Xl = ΛijΛkl + ΛikΛjl + ΛilΛjk.

Then, we have for any ﬁxed k, l [d],

E(XX AXX )k,l =

i,j=1 AijΛijΛkl + AijΛikΛjl + AijΛilΛjk

= tr(AΛ)Λkl + Λ k (A + A )Λl.

Therefore, we know

E(XX AXX ) = Λ A + A Λ + tr(AΛ)Λ.

Lemma 31 (Von-Neumann s Trace Inequality) Let U, V Rd n with d n. We have

i=1 σi(U)σi(V ) U op

where σ1(X) σ2(X) σd(X) are the ordered singular values of X Rd n.

Lemma 32 ((Meenakshi and Rajian, 1999)) For any two positive semi-deﬁnitive matrices A, B Rd d, we have

AB 0 if and only if A and B commute.

Trained Transformers Learn Linear Models In-Context

Appendix E. Experiment details

In this section, we provide more details for the experiment in Figure 1. Our experimental setup is based on the codebase provided by Garg et al. (2022), with a modiﬁcation that allows for the possibility that the covariate distribution changes across prompts. We use the standard GPT2 architecture with embedding size 256, 12 layers and 8 heads (Radford et al., 2018) as implemented by Hugging Face (Wolf et al., 2020). For the GPT2 models, we use the embedding method proposed by Garg et al. (2022), where instead of concatenating x and y into a single token, they are treated as separate tokens. It is also worth noting that the training objective function for the GPT2 model is diﬀerent than those we consider for the linear self-attention network: for the GPT2 model, the objective function is the average over the full length of the context sequence (predictions for each xi using (xk, yk)k<i), while in our setting the objective function is only for the ﬁnal query point. However, in the ﬁgure, for both GPT2 and the linear self-attention model the error plotted corresponds to the error for predicting the ﬁnal query point. In all experiments, covariates are sampled from a mean-zero Gaussian in d = 20 dimensions with either ﬁxed or random covariance matrix. For the ﬁxed covariance case, we ﬁx the covariance matrix to be identity; for the random case, the covariance matrices are restricted to be diagonal and all diagonal entries are i.i.d. sampled from the standard exponential distribution. The linear weights in all tasks are i.i.d. sampled from standard Gaussian distribution and also independently from all covariates. We trained the model for 500000 steps using Adam (Kingma and Ba, 2014) with a batch size of 64 and learning rate of 0.0001. We use the same curriculum strategy of Garg et al. (2022) for acceleration. For testing the trained model, we used ordinary least squares as a baseline which is optimal for noiseless linear regression tasks. For prompts at test time, covariates are sampled i.i.d. from a mean-zero Gaussian distribution. For the ﬁxed-covariance evaluation, the covariance is the identity matrix. In the random-covariance evaluation, the covariance is a random diagonal matrix with diagonal entries sampled from the standard exponential distribution, multiplied by a scaling coeﬃcient c {1, 4, 9}, i.e. for each task τ, the covariance matrix in the ranxdom case is

Λτ = c diag (λτ,1, ..., λτ,d)

where λτ,i i.i.d. Exponential(1) for any τ and i [d]. The plots in Figure 1 show the error averaged over 642 prompts, where we sample 64 covariance matrices for each curve and 64 prompts for each covariance matrix. We compute 90% conﬁdence intervals over 1000 bootstrap trials for each test.

Jacob Abernethy, Alekh Agarwal, Teodor V. Marinov, and Manfred K. Warmuth. A mechanism for sample-eﬃcient in-context learning for sparse retrieval tasks. Preprint, ar Xiv:2305.17040, 2023.

Kwangjun Ahn, Xiang Cheng, Hadi Daneshmand, and Suvrit Sra. Transformers learn to implement preconditioned gradient descent for in-context learning. Preprint, ar Xiv:2306.00297, 2023.

Zhang, Frei, Bartlett

Kabir Ahuja, Madhur Panwar, and Navin Goyal. In-context learning through the Bayesian prism. ar Xiv preprint ar Xiv:2306.04891, 2023.

Kartik Ahuja and David Lopez-Paz. A closer look at in-context learning under distribution shifts. Preprint, ar Xiv:2305.16704, 2023.

Ekin Aky urek, Dale Schuurmans, Jacob Andreas, Tengyu Ma, and Denny Zhou. What learning algorithm is in-context learning? Investigations with linear models. ar Xiv preprint ar Xiv:2211.15661, 2022.

Cem Anil, Yuhuai Wu, Anders Johan Andreassen, Aitor Lewkowycz, Vedant Misra, Vinay Venkatesh Ramasesh, Ambrose Slone, Guy Gur-Ari, Ethan Dyer, and Behnam Neyshabur. Exploring length generalization in large language models. In Advances in Neural Information Processing Systems (Neur IPS), 2022.

Sanjeev Arora, Nadav Cohen, and Elad Hazan. On the optimization of deep networks: Implicit acceleration by overparameterization. In International Conference on Machine Learning, pages 244 253, 2018.

Sanjeev Arora, Nadav Cohen, Wei Hu, and Yuping Luo. Implicit regularization in deep matrix factorization. Advances in Neural Information Processing Systems, 32, 2019.

Shahar Azulay, Edward Moroshko, Mor Shpigel Nacson, Blake E Woodworth, Nathan Srebro, Amir Globerson, and Daniel Soudry. On the implicit bias of initialization shape: Beyond inﬁnitesimal mirror descent. In International Conference on Machine Learning, pages 468 477, 2021.

Yu Bai, Fan Chen, Huan Wang, Caiming Xiong, and Song Mei. Transformers as statisticians: Provable in-context learning with in-context algorithm selection. Preprint, ar Xiv:2306.04637, 2023.

Mohamed Ali Belabbas. On implicit regularization: Morse functions and applications to matrix factorization. ar Xiv preprint ar Xiv:2001.04264, 2020.

Satwik Bhattamishra, Arkil Patel, and Navin Goyal. On the computational power of transformers and its implications in sequence modeling. ar Xiv preprint ar Xiv:2006.09286, 2020.

Yuejie Chi, Yue M Lu, and Yuxin Chen. Nonconvex optimization meets low-rank matrix factorization: An overview. IEEE Transactions on Signal Processing, 67(20):5239 5269, 2019.

Damai Dai, Yutao Sun, Li Dong, Yaru Hao, Zhifang Sui, and Furu Wei. Why can GPT learn in-context? Language models secretly perform gradient descent as meta optimizers. ar Xiv preprint ar Xiv:2212.10559, 2022.

Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc V. Le, and Ruslan Salakhutdinov. Transformer-XL: Attentive language models beyond a ﬁxed-length context. In Association for Computational Linguistics (ACL), 2019.

Trained Transformers Learn Linear Models In-Context

Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and Lukasz Kaiser. Universal transformers. ar Xiv preprint ar Xiv:1807.03819, 2019.

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations (ICLR), 2021.

Simon S Du, Wei Hu, and Jason D Lee. Algorithmic regularization in learning deep homogeneous models: Layers are automatically balanced. Advances in Neural Information Processing Systems, 31, 2018.

Benjamin L Edelman, Surbhi Goel, Sham Kakade, and Cyril Zhang. Inductive biases and variable creation in self-attention mechanisms. In International Conference on Machine Learning, 2022.

Shivam Garg, Dimitris Tsipras, Percy Liang, and Gregory Valiant. What can transformers learn in-context? A case study of simple function classes. ar Xiv preprint ar Xiv:2208.01066, 2022.

Suriya Gunasekar, Blake E Woodworth, Srinadh Bhojanapalli, Behnam Neyshabur, and Nati Srebro. Implicit regularization in matrix factorization. Advances in Neural Information Processing Systems, 30, 2017.

Chi Han, Ziqi Wang, Han Zhao, and Heng Ji. In-context learning of large language models explained as kernel regression, 2023.

Samy Jelassi, Michael Sander, and Yuanzhi Li. Vision transformers provably learn spatial structure. Advances in Neural Information Processing Systems, 35:37822 37836, 2022.

Jikai Jin, Zhiyuan Li, Kaifeng Lyu, Simon S Du, and Jason D Lee. Understanding incremental learning of gradient descent: A ﬁne-grained analysis of matrix sensing. ar Xiv preprint ar Xiv:2301.11500, 2023.

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980, 2014.

Shuai Li, Zhao Song, Yu Xia, Tong Yu, and Tianyi Zhou. The closeness of in-context learning and weight shifting for softmax regression. ar Xiv preprint ar Xiv:2304.13276, 2023a.

Yingcong Li, M Emrullah Ildiz, Dimitris Papailiopoulos, and Samet Oymak. Transformers as algorithms: Generalization and stability in in-context learning. ar Xiv preprint ar Xiv:2301.07067, 2023b.

Yuanzhi Li, Tengyu Ma, and Hongyang Zhang. Algorithmic regularization in overparameterized matrix sensing and neural networks with quadratic activations. In Conference On Learning Theory, pages 2 47, 2018.

Zhang, Frei, Bartlett

Yuchen Li, Yuanzhi Li, and Andrej Risteski. How do transformers learn topic structure: Towards a mechanistic understanding. ar Xiv preprint ar Xiv:2303.04245, 2023c.

Zhiyuan Li, Yuping Luo, and Kaifeng Lyu. Towards resolving the implicit bias of gradient descent for matrix factorization: Greedy low-rank learning. ar Xiv preprint ar Xiv:2012.09839, 2020.

Valerii Likhosherstov, Krzysztof Choromanski, and Adrian Weller. On the expressive power of self-attention matrices. ar Xiv preprint ar Xiv:2106.03764, 2021.

Bingbin Liu, Jordan T. Ash, Surbhi Goel, Akshay Krishnamurthy, and Cyril Zhang. Transformers learn shortcuts to automata. In International Conference on Learning Representations (ICLR), 2023.

AR Meenakshi and C Rajian. On a product of positive semideﬁnite matrices. Linear algebra and its applications, 295(1-3):3 6, 1999.

JV Michalowicz, JM Nichols, F Bucholtz, and CC Olson. An Isserlis theorem for mixed Gaussian variables: Application to the auto-bispectral density. Journal of Statistical Physics, 136:89 102, 2009.

Sewon Min, Xinxi Lyu, Ari Holtzman, Mikel Artetxe, Mike Lewis, Hannaneh Hajishirzi, and Luke Zettlemoyer. Rethinking the role of demonstrations: What makes in-context learning work? ar Xiv preprint ar Xiv:2202.12837, 2022.

Open AI. Gpt-4 technical report. ar Xiv preprint ar Xiv:2303.08774, 2023.

Jorge P erez, Javier Marinkovi c, and Pablo Barcel o. On the Turing completeness of modern neural network architectures. ar Xiv preprint ar Xiv:1901.03429, 2019.

Kaare Brandt Petersen and Michael Syskind Pedersen. The matrix cookbook. Technical University of Denmark, 7(15):510, 2008.

Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving language understanding by generative pre-training. Technical report, Open AI, 2018. URL https://cdn.openai.com/research-covers/language-unsupervised/ language_understanding_paper.pdf.

Alec Radford, Jeﬀrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. Technical report, Open AI, 2019. URL https://cdn.openai.com/better-language-models/language_models_ are_unsupervised_multitask_learners.pdf.

Mahdi Soltanolkotabi, Dominik St oger, and Changzhi Xie. Implicit balancing and regularization: Generalization and convergence guarantees for overparameterized asymmetric matrix sensing. ar Xiv preprint ar Xiv:2303.14244, 2023.

Asher Trockman and J Zico Kolter. Mimetic initialization of self-attention layers. ar Xiv preprint ar Xiv:2305.09828, 2023.

Trained Transformers Learn Linear Models In-Context

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in Neural Information Processing Systems, 30, 2017.

Johannes von Oswald, Eyvind Niklasson, Ettore Randazzo, Jo ao Sacramento, Alexander Mordvintsev, Andrey Zhmoginov, and Max Vladymyrov. Transformers learn in-context by gradient descent. ar Xiv preprint ar Xiv:2212.07677, 2022.

Xinyi Wang, Wanrong Zhu, and William Yang Wang. Large language models are implicitly topic models: Explaining and ﬁnding good demonstrations for in-context learning. ar Xiv preprint ar Xiv:2301.11916, 2023.

Gian-Carlo Wick. The evaluation of the collision matrix. Physical review, 80(2):268, 1950.

Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, R emi Louf, and Morgan Funtowicz. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 conference on empirical methods in natural language processing: system demonstrations, pages 38 45, 2020.

Sang Michael Xie, Aditi Raghunathan, Percy Liang, and Tengyu Ma. An explanation of in-context learning as implicit Bayesian inference. ar Xiv preprint ar Xiv:2111.02080, 2021.

Chulhee Yun, Srinadh Bhojanapalli, Ankit Singh Rawat, Sashank J Reddi, and Sanjiv Kumar. Are transformers universal approximators of sequence-to-sequence functions? ar Xiv preprint ar Xiv:1912.10077, 2019.

Chulhee Yun, Yin-Wen Chang, Srinadh Bhojanapalli, Ankit Singh Rawat, Sashank Reddi, and Sanjiv Kumar. O(n) connections are expressive enough: Universal approximability of sparse transformers. Advances in Neural Information Processing Systems, 33:13783 13794, 2020.

Yufeng Zhang, Fengzhuo Zhang, Zhuoran Yang, and Zhaoran Wang. What and how does incontext learning learn? Bayesian model averaging, parameterization, and generalization. Preprint, ar Xiv:2305.19420, 2023.