# incontext_convergence_of_transformers__c98f366d.pdf

In-context Convergence of Transformers

Yu Huang 1 Yuan Cheng 2 Yingbin Liang 3

Abstract Transformers have recently revolutionized many domains in modern machine learning and one salient discovery is their remarkable in-context learning capability, where models can solve an unseen task by utilizing task-specific prompts without further parameters fine-tuning. This also inspired recent theoretical studies aiming to understand the in-context learning mechanism of transformers, which however focused only on linear transformers. In this work, we take the first step toward studying the learning dynamics of a one-layer transformer with softmax attention trained via gradient descent in order to in-context learn linear function classes. We consider a structured data model, where each token is randomly sampled from a set of feature vectors in either balanced or imbalanced fashion. For data with balanced features, we establish the finite-time convergence guarantee with near-zero prediction error by navigating our analysis over two phases of the training dynamics of the attention map. More notably, for data with imbalanced features, we show that the learning dynamics take a stage-wise convergence process, where the transformer first converges to a near-zero prediction error for the query tokens of dominant features, and then converges later to a near-zero error for query tokens of under-represented features, via one and four training phases. Our proof features new techniques for analyzing the competing strengths of two types of attention weights, the change of which determines different training phases.

1. Introduction Transformers (Vaswani et al., 2017) have emerged as the foundational architectures in various domains, including

1Department of Statistics and Data Science, Wharton School, University of Pennsylvania, Philadelphia, PA, USA 2National University of Singapore, Singapore 3 Department of Electrical and Computer Engineering, The Ohio State University, Columbus, OH, USA. Correspondence to: Yingbin Liang <liang.889@osu.edu>.

Proceedings of the 41 st International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by the author(s).

natural language processing (Devlin et al., 2018; Open AI, 2023), computer vision (Dosovitskiy et al., 2020; He et al., 2022), reinforcement learning (Chen et al., 2021; Janner et al., 2021), and so on. Recently, large language models (LLMs) based on transformers have exhibited remarkable in-context learning capabilities, where the model can solve a new task solely through inference based on prompts of the task without further fine-tuning (Brown et al., 2020).

Such striking abilities have inspired a recent line of research to understand the underlying mechanisms of in-context learning from various aspects (Garg et al., 2022; Min et al., 2022; Wei et al., 2023; Von Oswald et al., 2023; Xie et al., 2021). Among these studies, the pioneering work of Garg et al. (2022) empirically studied in-context learning via an interpretable framework, highlighting the capacity of transformers to acquire in-context knowledge of linear and some more complex function classes. Specifically, they showed that an in-context trained model over a function class F can accurately predict the function value f (xquery) of a new query token xquery for most f F by using a prompt sequence including in-context input-label pairs along with the query token (x1, f (x1) , . . . , x N, f (x N) , xquery).

Built on this theoretically amenable setting, many follow-up works explored theoretical properties of in-context learning of transformers from different perspectives such as expressive power (Aky urek et al., 2022; Giannou et al., 2023), generalization (Li et al., 2023b), internal mechanisms (Von Oswald et al., 2023; Bai et al., 2023), etc. Specially, a few recent studies (Zhang et al., 2023a; Mahankali et al., 2023; Ahn et al., 2023) made interesting progress towards understanding the training dynamics of transformers for incontext learning1. However, those studies focused only on linear transformers, and does not capture the crucial role of the softmax mapping, which lies in the core design of transformers to be advantageous over other network architectures. Therefore, the following fundamental problem still remains largely open:

How do softmax-based transformers trained via gradient descent learn in-context?

This paper takes the first step toward addressing this problem by investigating the learning dynamics of a single-layer

1More detailed discussions for related work can be found in Appendix A.

In-context Convergence of Transformers

transformer with softmax attention trained by gradient descent (GD) for in-context learning. We focus on the setting with training prompts generated from linear regression models as in Garg et al. (2022), and with structured input data, where each token is randomly selected from a set of feature vectors {vk}K k=1 with probability {pk}K k=1, respectively. We then train the transformer over the squared loss of prediction error using GD. We study the training dynamics under both balanced and imbalanced feature distributions, and characterize the in-context learning ability for both settings. We highlight our contributions as follows.

Our Contributions.

We first establish the convergence guarantee for the setting with balanced features, where pk = Θ( 1

K ) for each k [K], and characterize the training evolution of the attention map into a two-phase dynamic process. In phase I, for each k [K], the parameters of the self-attention module undergo fast growth, aligning the query token featuring vk with input tokens featuring vk rapidly disregarding other feature directions. In phase II, the loss of prediction error converges to a near-minimum value.

We then prove the convergence for the setting with imbalanced features, where one feature dominates, say v1 with p1 = Θ(1), while others are under-represented with pk = Θ( 1

K ) for k > 1, which serves as a remarkable showcase of the in-context learning capabilities of transformers. We show that the learning dynamics takes a stage-wise convergence process. Initially, the transformer quickly attains near-zero prediction error for query tokens of dominant features, and then converges to near-zero prediction error for query tokens of under-represented features, irrespective of their infrequent occurrence.

Our analysis hinges on a novel proof technique that characterizes the softmax attention dynamics via the interplay between two types of bilinear attention weights: weight of query token and its target feature and weight of query token and off-target features . Which weight plays a dominant role in the attention dynamics can change over the learning process, resulting in different training phases. Our analysis tools may be of independent interest and hold the potential to study various other problems involving transformers.

Notations. We let [K] := {1, 2, . . . , K}. We use capital letters for matrices (e.g., A), and lowercase letters for vectors and scalars (e.g., a). For a matrix A, we use Ai to represent the i-th column of A and Ai:j to indicate a collection of columns spanning from i to j. We use 1{ } to denote the indicator function. We use O(K), Ω(K), and Θ(K) to omit universal constants concerning the variable K. We use

poly(K) and polylog(K) to denote large constant-degree polynomials of K and log(K), respectively. Given h(x) 0 and g(x) > 0, we denote h(x) = Ω(g(x)) if there exists some constant C1 > 0 and a1, s.t. |h(x)| C1g(x) for all x a1; h(x) = O(g(x)) if there exist some constant C2 > 0 and a2, s.t. |h(x)| C2g(x) for all x a2; h(x) = Θ(g(x)) if there exists some constant C3, C4 > 0 and a3, s.t. C3g(x) |h(x)| C4g(x) for all x a3.

2. Problem Setup In this section, we present our problem formulations, including the in-context learning framework, one-layer transformer architecture, and the training settings we consider.

2.1. In-Context Learning Framework We adopt the well-established in-context learning framework in Garg et al. (2022). The objective is to enable the training of models capable of in-context learning within a specified function class F, where the functions and input data are sampled respectively by the distributions DF and DX . Specifically, the process is initiated by generating random training prompts as follows. We first sample a random function f from the class according to the distribution DF. We then create a set of random inputs x1, . . . , x N and query xquery , all drawn independently by DX . Finally, we compute the value of function f on these inputs to construct the prompt P = (x1, y1, . . . , x N, y N, xquery), where yi = f(xi). The goal for an in-context learner is to use the prompt to form a prediction by (xquery) for the query such that by (xquery) f (xquery).

Task Distribution. In this work, our focus is on the task of linear functions defined as F = f : X R | f(x) = w, x with w Rd, X Rd , which is widely adopted in recent studies for in-context learning (Ahn et al., 2023; Zhang et al., 2023a; Mahankali et al., 2023). For each prompt, the task-specific weight w is independently drawn from a task distribution DΩwith zero mean and identity covariance matrix Id d.

Data Distribution DX . We consider a set of distinct features {vk Rd, k = 1, . . . , K}, where all features are orthonormal vectors. Each data point x is sampled from the feature set with the probability pk for sampling vk, where pk (0, 1) for k [K] and P

k [K] pk = 1. Such a data model has been widely employed in the theoretical studies of deep learning, including ensemble methods (Allen-Zhu & Li, 2020), multi-modal learning (Huang et al., 2022), vision transformers (Li et al., 2023a), etc.

2.2. One-Layer Transformer Architecture To present the one-layer transformer model we consider, we first introduce the self-attention mechanism (Bahdanau

In-context Convergence of Transformers

et al., 2014; Vaswani et al., 2017) for the transformer model.

Definition 2.1 (Self-Attention (SA) Mechanism). A selfattention layer (Bahdanau et al., 2014; Vaswani et al., 2017) in the single-head case with width de consists of the following components: a key matrix W Key Rde de, a query matrix W Q Rde de, and a value matrix W V Rde de. Given a prompt P of length N, let E Rde d N be an embedding matrix of the prompt P, and the self-attention mechanism will output:

FSA E; W Key, W Q, W V

= W V E softmax W Key E W QE , (1)

where the softmax( ) function is applied column-wisely, i.e., for a vector input z, the i-th entry of softmax(z) is given by ezi/ P

Embeddings. For in-context learning, given a prompt P = (x1, y1, . . . , x N, y N, xquery ), a natural token embedding is to stack xi Rd and yi into the first N columns. The final column consists of xquery Rd and 0. Formally,

E(P) = x1 x2 x N xquery y1 y2 y N 0

R(d+1) (N+1).

Therefore, d N = N + 1 and de = d + 1 in the above embedding. Let us further denote the first d rows of E as Ex(P) Rd (N+1) and the last row of E as Ey(P) R1 (N+1). Then we write E(P) = {Ex(P), Ey(P)}. We omit the dependency on P for E(P), Ex(P) and Ex(P) when there is no ambiguity.

We next instantiate additional operations and certain parameter settings based on the general SA mechanism (1) for our one-layer transformer model to mitigate unnecessary complications in theoretical analysis while keeping the most critical component of the SA mechanism.

Masking. Let M( ) denote the masking operation, which masks (removes) the last column of the entry matrix. In other words, for a given matrix A R(d+1) (N+1), M(A) yields A1:N R(d+1) N. We will first mask the embedding matrix E before its multiplication with the key matrix W Key

and the value matrix W V , which results in W Key M(E) and W V M(E), in order to prevent the query token from attending to itself. This approach has been commonly taken in previous works (Tian et al., 2023; Mahankali et al., 2023; Von Oswald et al., 2023; Kitaev et al., 2020).

Reparameterization. We consolidate the query and key matrices into one matrix denoted as W KQ R(d+1) (d+1), often taken in recent theoretical frameworks (Zhang et al., 2023a; Jelassi et al., 2022; Tian et al., 2023). Furthermore,

we consider W V and W KQ in the following specific forms:

W V = 0d d 0d 0 d ν

, W KQ = Q 0d 0 d 0

where ν R and Q Rd d. The above structures of W V

and W KQ are inspired by the recent study (Zhang et al., 2023a), which showed that such structured matrices achieve the global optimum in the linear SA model. Furthermore, we set ν = 1 (where ν is the only parameter in W V ) and do not update it during the training. The reason is twofold: 1) this aligns with the common practice in theoretical studies of deep learning, where the last linear layer is often kept fixed to focus on the analysis of hidden layers. Our objective remains highly nonconvex and challenging even with a fixed ν; and 2) the form of the global optimum outlined in recent work (Zhang et al., 2023a) suggests that for linear SA, the optimal solution for ν serves as a scaling factor to normalize the output of linear attention. In our case, the output of softmax attention is already inherently normalized.

Remark 2.2 (Nealy no loss of optimality). Despite the specific form of {W V , W KQ}, the minimum of the loss function L = Θ(e poly(K)) (as shown in Theorem 3.2) implies that such a specific form at most incurs an error of Θ(e poly(K)) that vanishes exponentially with K, compared to the minimum loss over the general parameter space {W V , W Key, W Q}. Therefore, for our nonlinear softmax SA, such specific parameterization does not lose optimality.

With the aforementioned masking operations and reparameterization, the overall transformer model consisting of a single SA layer can be recast in the parameterization of θ = {1, Q} as follows:

FSA (E; θ) = M(Ey) softmax M(Ex) QEx . (3)

Such a reparameterization separates the label Ey from the softmax operator while maintaining simultaneous processing of both input Ex and label Ey information. The prediction for the token xquery will be the last entry of FSA,

byquery = byquery(E; θ) = [FSA(E; θ)](N+1) .

Henceforth, we may omit the reference to E and θ, and use byquery if it is not ambiguous.

2.3. Training Settings Loss Function. To train the transformer model FSA over linear regression tasks, we minimize the following squared loss of the prediction error, which has also been taken by Zhang et al. (2023a); Ahn et al. (2023):

2E h (by query w, x query )2i (4)

where the expectation is taken with respect to the prompt P including input and query tokens {xi}N i=1 {xquery } and the weight vector w.

In-context Convergence of Transformers

Training Algorithm. The above learning objective in eq.(4) is minimized via GD with the learning rate η. At t = 0, we initialize Q(0) as zero matrix 0d d. The parameter is updated as follows:

θ(t+1) = θ(t) η θL(θ(t)).

3. Main Results In this section, we characterize the convergence of incontext learning by GD for the settings with balanced and imbalanced features, respectively.

To measure the degree to which the query token xquery attends to the specific input token and to a certain class of features, we define the notions of the attention scores.

Definition 3.1 (Attention Score). Given a prompt P = (x1, y1, , x N, y N, xquery) and its corresponding embedding E, where {xi Rd}N i=1, xquery is drawn independently from DX , then at time t, for FSA with parameter θ(t), we define the attention score as follows.

1. Given i [N], the attention score for the i-th token xi is

attni(θ(t); E) := h softmax(M(Ex) Q(t)Ex) i

= e Ex i Q(t)Ex N+1 P

j [N] e Ex j Q(t)Ex N+1 .

2. For k [K], denote Vk(P) [N] as the index set for input tokens, such that xi = vk for i Vk(P). Then the attention score for the k-th feature is given by

Attnk(θ(t); E) := P

i Vk(P ) attni(θ(t); E).

For simplicity, we represent attni(θ(t); E) and Attnk(θ(t); E) as attn(t) i and Attn(t) k , respectively, and denote Vk(P) as Vk. We also rewrite the prediction output at time t as follows:

by(t) query = P

i [N] attn(t) i yi = P

k [K] Attn(t) k w, vk . (5)

3.1. In-Context Learning with Balanced Features In this subsection, we study in-context learning with balanced features, where the probabilities of sampling all K features are in the same order, i.e., pk = Θ( 1

K ) for each k [K]. In such a setting, each feature appears equally likely in the prompt, ensuring their equal recognition. The following theorem characterizes the convergence of GD.

Theorem 3.2 (In-context Learning with Balanced Features). Suppose pk = Θ( 1

K ) for k [K]. For any 0 < ϵ < 1, suppose N poly(K) and polylog(K) log( 1

ϵ ). We apply GD to train the loss function given in eq.(4). Then

with at most T = O( log(K)K2

η + K log Kϵ 1

ϵη ) iterations, we have

1. The loss converges: L(θ(T )) L ϵ, where L = Θ(e poly(K)) is the global minimum of eq.(4).

2. Attention score concentrates: if xquery = vk, then with probability at least 1 e Ω(poly(K))2,the one-layer transformer nearly pays all attention to input tokens featuring vk, i.e., (1 Attn(T ) k )2 O(ϵ).

Theorem 3.2 shows that training a one-layer transformer with softmax attention can converge to the minimum of the objective loss in the reparameterization space via GD, with polynomial time efficiency with respect to K and 1

ϵ . The learning dynamics for such a case with balanced features exhibit a two-phase behavior. (i) The first term of T captures the duration of phase I, where the network actively aligns the query token (suppose xquery = vk) with those tokens featuring vk itself, thus substantially increasing Attn(t) k to a constant level. (ii) The second term captures the duration of phase II, where the loss converges to the near-zero prediction error.

In-context Learning Ability. For the obtained model with θ(T ), let us evaluate a test prompt associated with a linear task w, which might not be drawn from the support of DΩ (i.e., w may not be present in the training process), but has its data drawn by DX . Suppose the query token is xquery = vk. Following from the attention score concentration principle in Theorem 3.2, eq.(5) yields that with high probability the query prediction by(T ) query is given by

Attn(T ) k w, vk + P

m =k Attn(T ) m w, vm w, vk .

This implies that the in-context learned model can still well approximate the test prompt even if the task model w does not lie in the support of the training task distribution DΩand was unseen during training. This showcases the remarkable in-context learning capability of trained transformers. We also highlight that the in-context learning mechanism characterized by our theorems has been verified by the empirical findings in many recent works trained with transformers at the GPT-2 (Radford et al., 2019) scale. For instance, in Yadlowsky et al. (2023), they showed that the in-context learning ability of transformers may be closely tied to the coverage of their pre-training data mixtures. This indeed aligns with our attention concentration principle, which demonstrates that the transformer can perform in-context learning by correctly capturing and identifying different types of features in the training data.

3.2. In-Context Learning with Imbalanced Features In real-world datasets, skewed distributions are common, where a few classes or features dominate in data while others are under-represented. It is typically difficult to train models

2The randomness originates from the first N input tokens in the test prompt.

In-context Convergence of Transformers

to perform well on features that have limited representation in those datasets (Cui et al., 2019; Chou et al., 2020). In this subsection, we investigate the setting with imbalanced features, where the dominant feature v1 is sampled with the probability p1 = Θ(1), and all other features are sampled with pk = Θ( 1

K ) for 2 k K. We will show that somewhat remarkably, in-context learning is less sensitive to imbalanced features and can achieve a near-zero error even when the query token takes an under-represented feature.

To investigate the performance for the imbalanced case, we focus on the following prediction error for each feature vk:

2E h (byquery w, xquery )2 xquery = vk i . (6)

The following theorem identifies the convergence of GD.

Theorem 3.3 (In-context Learning with Imbalanced Features). Suppose p1 = Θ(1) and pk = Θ( 1

K ) for 2 k K. For any 0 < ϵ < 1, suppose N poly(K), and polylog(K) log( 1

ϵ ). We apply GD to train the loss function given in eq.(4). Then the following results hold.

1. The prediction error for the dominant feature converges:

for v1, with at most T1 = O( log(ϵ 1

2 ) ηϵ ) GD iterations, L1(θ(T1)) L 1 + ϵ, where L 1 = Θ(e poly(K)) is the global minimum of eq.(6) for k = 1;

2. The prediction error for the under-represented features converges: for vk with 2 k K, with at

most Tk = O( log(K)K2

η + K log Kϵ 1

ϵη ) GD iterations, Lk(θ(Tk)) L k + ϵ, where L k = Θ(e poly(K)) is the global minimum of eq.(6);

3. Attention score concentrates: for each k [K], if the query token is vk, then after Tk iterations, with probability at least 1 e Ω(poly(K)), the one-layer transformer nearly pays all attention to input tokens featuring vk: (1 Attn(Tk) k )2 O(ϵ).

Theorem 3.3 shows that the GD dynamics of the in-context training exhibit stage-wise convergence. The trained transformer rapidly (within T1) converges to a model that achieves a near-zero prediction error L1 for the dominant feature; and then takes a much longer time (up to Tk T1) to converge to a model that attains a near-zero prediction error Lk for the under-represented features. Our analysis captures the later learning dynamics associated with the under-represented features into a four-phase behavior as further described in the subsequent section. Despite the longer convergence time it takes, in-context learning still achieves the same accurate prediction for under-represented features as that for the dominant feature.

4. Overview of Training Phases In this section, we explain our key ideas for analyzing the in-context learning capabilities of transformers. We will focus on characterizing the training process of the setting with imbalanced features for under-represented features in Section 4.2, which comprehensively exhibits four phases. Other scenarios take only one or two of those phases, which we will briefly describe in Appendix C. The complete proofs of all the results are provided in the appendix.

4.1. Bilinear Attention Weights We will first provide the general training dynamics for the bilinear attention weights (defined in Definition 4.1 below), which is useful for analyzing all learning phases. These quantities are the key elements in the attention scores attn(t) i for 1 i N, which play an important role in determining the prediction by(t) query. Hence, our analysis mainly tracks the training dynamics of those bilinear attention weights.

Definition 4.1. (Bilinear Attention Weights) Given k, n [K], where k = n, for t 0, we define the bilinear attention weights as follows:

A(t) k := v k Q(t)vk, B(t) k,n := v n Q(t)vk.

By our initialization, we have A(0) k = B(0) k,n = 0.

To further interpret these weights, suppose the query token corresponds to the feature vk. Then e A(t) k serves as the (unnormalized) weight for the input token featuring vk, while

e B(t) k,n captures the weight for the input token featuring a different vector vn with n = k. Having a larger A(t) k compared to other B(t) k,n indicates a better capture of the target feature vk. As shown in eq.(5), this condition implies a higher attention towards input tokens featuring vk, resulting in by(t) query P

i Vk attn(t) i yi w, vk , where the prediction

well approximates the ground truth.

The following lemma provides the GD updates of the bilinear attention weights A(t) k and B(t) k,n.

Lemma 4.2. Let t 0. For k, n [K], where k = n, A(t) k and B(t) k,n satisfy:

A(t+1) k = A(t) k + ηα(t) k , B(t+1) k,n = B(t) k,n + ηβ(t) k,n,

α(t) k = E h 1{xquery = vk} Attn(t) k P

m =k Attn(t) m 2 + (1 Attn(t) k )2 i ,

β(t) k,n = E h 1{xquery = vk} Attn(t) n P

m =k Attn(t) m 2 Attn(t) n Attn(t) k (1 Attn(t) k ) i .

In-context Convergence of Transformers

Lemma 4.2 shows that A(t) k is monotonically increasing at any time since α(t) k 0, whereas the monotonicity does not always hold for B(t) k,n. Therefore, we need to analyze

whether B(t) k,n decreases and determine its rate of change

compared to A(t) k . Such a comparison between B(t) k,n and

A(t) k determines which bilinear weight plays a dominant role in the attention dynamics, and the change of the leading weight over the learning process results in different training phases.

4.2. Learning Process for Under-represented Features We consider the setting with imbalanced features and focus on the under-represented features.

Given a prompt P = (x1, y1, , x N, y N, xquery), denote Pinput to be the collection of input tokes, i.e., {xi}N i=1. Recall that |Vk| is the number of input tokens featuring vk. Based on our data generation setup, we can show that for imbalanced data, with high probability, Pinput belongs to

E imbal := n Pinput : |V1| = Θ(N), |Vk| = Θ N K for k 2 o .

In the following, we focus on the event that Pinput E imbal unless otherwise specified. We next characterize the learning process for under-represented features vk with k > 1 by four phases. An illustration of these four phases is provided in Figure 1.

4.2.1. PHASE I: DECREASE OF DOMINANT FEATURE. Consider the query token featuring vk for some k > 1. At t = 0, A(0) k = B(0) k,n = 0, and hence attn(0) i = 1 N for i [N] which implies that the transformer equally attends each input token. However, due to the imbalanced occurrence of features in E imbal, the number of tokens featuring v1 is much larger than others. Hence, Attn(0) 1 = |V1|

N Ω(1) while Attn(0) m = Θ( 1

K ) for m > 1. Therefore, by Lemma 4.2, we obtain β(0) k,1 Ω( 1

K ), whereas α(0) k , |β(0) k,n| Θ( 1

for n = k, 1. Therefore, B(t) k,1 enjoys a much larger decreas-

ing rate initially. It can be shown that the decrease of B(t) k,1 will dominate for a certain time period that defines phase I. The following lemma summarizes our main result in this phase.

Lemma 4.3 (Informal). Under the same conditions as Theorem 3.3, given k > 1, there exists T1,k = O( log(K)K1.98

η ), such that for all 0 t T1,k

β(t) k,1 Ω 1 K1.98

, α(t) k = Θ 1

|β(t) k,n| O α(t) k +|β(t) k,1| K

for all n = k, 1.

At time t = T1,k + 1, B(T1,k+1) k,1 0.49 log(K), while

A(T1,k+1) k and B(T1,k+1) k,n for n = k, 1 remain close to zero.

During phase I, B(t) k,1 significantly decreases, leading to a

reduction in Attn(t) 1 , whereas other Attn(t) n with n > 1 remain at the level of Θ( 1

K ). By the end of this phase, (Attn(t) 1 )2 drops to O( 1 K0.98 ), resulting in a decrease in |β(t) k,1| as it approaches α(t) k . Phase II then begins.

4.2.2. PHASE II: SWITCHING OF LEADING INFLUENCE Soon after entering this phase, the dominance role of B(t) k,1 diminishes as |β(t) k,1| reaches the same order of magnitude as

α(t) k . The following result captures the shift of the leading influence, where the growth of A(t) k takes dominance.

Lemma 4.4 (Informal). Under the same conditions as Theorem 3.3, given k > 1, there exists T2,k = T1,k + O( log(K)K2

η ), such that at iteration t = T2,k +

1, we have A(T2,k+1) k 0.5 log(K), B(T2,k+1) k,1

[ 0.51 log(K), 0.49 log(K)], and B(T2,k+1) k,n for n = k, 1 remain close to zero.

Lemma 4.4 shows that by the end of phase II, A(t) k matches the magnitude of B(t) k,1, and during phase II B(t) k,1 changes only slightly from the end of phase I. This suggests that, at certain moments in this phase, A(t) k significantly increases and its growth becomes the dominant factor. We next provide some insights into the reasons behind this transition. Once B(t) k,1 decreases to 0.5 log(K), we observe

that |β(t) k,1| α(t) k = Θ( 1

K2 ). After this point, it becomes

challenging for B(t) k,1 to decrease significantly compared to

the increase in A(t) k . To illustrate, let us suppose a minimal decrease of B(t) k,1 by an amount of 0.01 log(K). This would

yield that Attn(t) 1 O( 1 K0.501 ) and β(t) k,1 O( 1 K2.01 ),

while Attn(t) k Ω( 1

K ) and α(t) k Ω( 1

K2 ), establishing a situation where α(t) k β(t) k,1. Such a discrepancy leads to the switching of the dominant effect.

4.2.3. PHASE III: GROWTH OF TARGET FEATURE After a transition phase, we observe that A(t) k enjoys a larger gradient α(t) k Θ( 1 K1.5 ) compared to |β(t) k,1| O( 1 K1.98 )

and |β(t) k,n| O( 1

K3 ) with n = k, 1. This gap between

α(t) k and β(t) k,n remains over the period, and the gradient α(t) k continues to grow, driving the rapid growth of A(t) k with B(t) k,n being relatively unchanged. The following lemma summarizes our main results in this phase.

Lemma 4.5 (Informal). Under the same conditions as The-

In-context Convergence of Transformers

(a) Decrease of Dominant Feature (b) Switching of Leading Influence (c) Growth of Target Feature (d) Convergence

Figure 1. Overview of the dynamics of attention scores and bilinear attention weights for under-represented features. Assume the query token is vk with 2 k K. The top row depicts the trend of the attention score Attn(t) m for each feature vm, where a darker color corresponds to a higher score. The bottom row shows the interplay and leading effect among bilinear attention weights A(t) k , B(t) k,1, and

B(t) k,n (where n = 1, k) in different training phases. (a) Phase I: B(t) k,1 significantly decreases and the attention on tokens with the dominant

feature v1 is suppressed (Section 4.2.1); (b) Phase II: With the suppression of Attn(t) 1 , the decreasing rate for B(t) k,1 drops and the growth

of A(t) k becomes the leading influence (Section 4.2.2); (c) Phase III: A(t) k rapidly grows and Attn(t) k reaches Ω(1) (Section 4.2.3); (d) Phase IV: Attn(t) k nearly grows to 1 and the prediction error converges to a global minimum (Section 4.2.4).

orem 3.3, given k > 1, there exists T3,k = O( log(K)K1.5

η ), such that for all T2,k < t T3,k

α(t) k Ω 1 K1.5 , β(t) k,1 O α(t) k K0.48

, Ω 1 K2.01 ,

|β(t) k,n| O α(t) k +|β(t) k,1| K

with n = k, 1.

At time t = T3,k + 1, we have A(T3,k+1) k log(K).

Lemma 4.5 follows because the continuous growth of α(t) k is mainly driven by Attn(t) k , where 1 Attn(t) k remains at the constant order. However, as A(t) k reaches log(K), Attn(t) k is above Ω(1), necessitating a more detailed analysis to control α(t) k , which starts the final phase.

4.2.4. PHASE IV: CONVERGENCE After learning the target feature vk at a certain level, the prediction error converges. We characterize this in the following lemma, where we establish a connection between α(t) k and the prediction error via analyzing the change of 1 Attn(t) k that diminishes during this phase.

Lemma 4.6 (Informal). Under the same conditions as Theorem 3.3, given 0 < ϵ < 1, for each k > 1, there

exists T4,k = T3,k + O( K log(Kϵ 1

2 ) ηϵ ), such that for all T3,k < t T4,k

α(t) k Ω( ϵ

K ), β(t) k,n [ O( α(t) k K0.49 ), 0],

β(t) k,n [ O(α(t) k K ), 0] with n = k, 1.

At time t = T4,k + 1, we have Lk(θ(T4,k+1)) L k < ϵ and (1 Attn(t) k )2 O(ϵ), if xquery = vk and Pinput E imbal.

The convergence result for k > 1 stated in Theorem 3.3 directly follows by choosing T k = T4,k + 1.

(a) Imbalanced Case

(b) Balanced Case

Figure 2. The prediction error for each feature.

5. Experiments In this section, we conduct experiments to demonstrate that our theoretical results are consistent with the actual dynamics during the in-context training of transformers. Detailed experimental settings are deferred to Appendix B.

Task and Data Generations. We follow the task and data distributions introduced in Section 2.1. For each task, we sample the task weight w from N(0, Id d). Each data point is drawn from the given feature set {vk Rd, k = 1, , K} with probability pk for sampling vk, where all features are orthonormal vectors, and pk (0, 1) satisfies PK k=1 pk = 1. The prompt consists of N random inputs {xi}N i=1 with their task values given by {yi}N i=1 = {w xi}N i=1, and a query xquery. We consider the setting with

In-context Convergence of Transformers

(a) Imbalanced Case

(b) Balanced Case

Figure 3. The attention heatmap during the training. For each heatmap, the i-th row represents the average attention scores of the query token attending to each feature when xquery = vi.

d = 16, N = 60, and K = 3. We consider the following two types of data distributions:

Balanced case: pi = 1

3 for i [3];

Imbalanced case: v1 is the dominant feature with p1 = 0.8 and {v2, v3} are under-represented with p2 = p3 = 0.1.

Stage-Wise Convergence. In Figure 2, we plot the evolution of the prediction error for each feature throughout the training process. For the imbalanced case (Figure 2a), the transformer quickly converges to a model with nearly vanishing prediction error L1 for the dominant feature. However, the errors L2 and L3 for under-represented features initially fluctuate and then converge to zero after a considerably longer period. This behavior verifies the stage-wise convergence process characterized in our Theorem 3.3.On the other hand, in the balanced scenario (Figure 2b), the prediction errors for all features decrease in a similar manner throughout the training, which validates our theory on convergence in the balanced case in Theorem 3.2.

Attention Score Concentration. In Figure 3, we present the dynamic evolution of attention scores throughout the training process for both balanced and imbalanced scenarios. For each k [3], and when xquery = vk, it is observed that Attnk progressively increases to be close to 1 while other Attnk diminishes at the end of the training. These results support the principle of attention score concentration as elaborated in Theorems 3.2 and 3.3, and demonstrate that the attention is allocated towards those tokens with the same feature as the query token.

Multi-Phase Transition during Training Process. Figure 3 also demonstrates the multi-phase convergence process of under-represented features, which verifies those learning phases we characterize in our proof of convergence in Section 4. We elaborate on this by taking the case with xquery = v2 as an example. In Figure 3a and focusing on the row of xquery = v2, from epoch 10 to 100, Attn1 decreases and Attn3 increases, which suggests that the decrease in B2,1 is the main factor in phase I. If the increase in A2 was the driving factor, we would expect a decrease in all off-diagonal attention scores including Attn3 similarly to Figure 3b, which contradicts our observation. Then moving to epoch 150, the simultaneous increase in Attn2 and decreases in Attn1 and Attn3 indicate a shift of dominance effect, with the rise of A2 becomes the main factor (phases II and III). Finally, the concentration of attention scores at epoch 400 corresponds to the last phase of convergence.

6. Discussions Practical Insights. One direct practical implication follows from our stage-wise convergence characterization for the imbalanced setting, which implies that employing an early stopping strategy for in-context (pre-)training could be advantageous when the goal is to identify and leverage dominant features quickly. Further, our insights into attention score concentration can provide useful guidance for dealing with non-stationarity in real-world applications. For example, in scenarios with task shifts, the (pre-)trained model would exhibit considerable robustness due to the in-context learning capability, allowing the model to continue to per-

In-context Convergence of Transformers

form well. On the other hand, data shifts such as covariate shifts or other more complex shifts would necessitate further training of the model.

Future Directions. Our analysis focuses on an orthonormal feature model for analytical clarity, so that our characterization of the convergence and the dynamics of the attention scores will not be over-complicated by non-essential aspects, e.g., additional non-dominant terms that need to be bounded in gradient calculations. Nevertheless, our analysis can be extended to a more general setting, where the features are drawn from a subspace with K features serving as basis vectors. For such a setting, we need to further characterize how correlation among features affects attention coefficients, which we leave as future work. Furthermore, it is also important to generalize our analysis to nonlinear target functions and consider more complicated network architectures.

7. Conclusions In this work, we investigated the training dynamics of a onelayer transformer with softmax attention trained by GD for in-context learning. We analyzed two settings respectively with balanced and imbalanced features, and proved the guaranteed convergence to a vanishing in-context prediction error by detailing the evolution of attention dynamics for both settings. Interestingly, we characterized a four-phase behavior for the imbalanced settings that sheds light on the intricate attention dynamics between dominant and target under-represented features during training. We also provide empirical results to back up our theoretical characterization. To our knowledge, this is the first work that rigorously analyzed the softmax attention dynamics for in-context learning. Our approach features novel ideas for phase decomposition based on the changes of the dominant role between two types of bilinear attention weights in the learning process, and has the potential to facilitate further theoretical understanding of how transformers perform in other algorithms and learning paradigms.

Acknowledgements The work was supported in part by the U.S. National Science Foundation under the grants CCF-1900145 and DMS2134145.

Impact Statement This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none of which we feel must be specifically highlighted here

Ahn, K., Cheng, X., Daneshmand, H., and Sra, S. Transformers learn to implement preconditioned gradient descent for in-context learning. ar Xiv preprint ar Xiv:2306.00297, 2023.

Ahuja, K., Panwar, M., and Goyal, N. In-context learning through the bayesian prism. ar Xiv preprint ar Xiv:2306.04891, 2023.

Aky urek, E., Schuurmans, D., Andreas, J., Ma, T., and Zhou, D. What learning algorithm is in-context learning? investigations with linear models. ar Xiv preprint ar Xiv:2211.15661, 2022.

Allen-Zhu, Z. and Li, Y. Towards understanding ensemble, knowledge distillation and self-distillation in deep learning. ar Xiv preprint ar Xiv:2012.09816, 2020.

Bahdanau, D., Cho, K., and Bengio, Y. Neural machine translation by jointly learning to align and translate. ar Xiv preprint ar Xiv:1409.0473, 2014.

Bai, Y., Chen, F., Wang, H., Xiong, C., and Mei, S. Transformers as statisticians: Provable in-context learning with in-context algorithm selection. ar Xiv preprint ar Xiv:2306.04637, 2023.

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. Language models are few-shot learners. Advances in neural information processing systems, 33: 1877 1901, 2020.

Chen, L., Lu, K., Rajeswaran, A., Lee, K., Grover, A., Laskin, M., Abbeel, P., Srinivas, A., and Mordatch, I. Decision transformer: Reinforcement learning via sequence modeling. Advances in neural information processing systems, 34:15084 15097, 2021.

Chou, H.-P., Chang, S.-C., Pan, J.-Y., Wei, W., and Juan, D.- C. Remix: rebalanced mixup. In Computer Vision ECCV 2020 Workshops: Glasgow, UK, August 23 28, 2020, Proceedings, Part VI 16, pp. 95 110. Springer, 2020.

Cui, Y., Jia, M., Lin, T.-Y., Song, Y., and Belongie, S. Classbalanced loss based on effective number of samples. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 9268 9277, 2019.

Dai, D., Sun, Y., Dong, L., Hao, Y., Ma, S., Sui, Z., and Wei, F. Why can gpt learn in-context? language models implicitly perform gradient descent as meta-optimizers. In ICLR 2023 Workshop on Mathematical and Empirical Understanding of Foundation Models, 2023.

In-context Convergence of Transformers

Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. ar Xiv preprint ar Xiv:1810.04805, 2018.

Devroye, L. The equivalence of weak, strong and complete convergence in l1 for kernel density estimates. The Annals of Statistics, pp. 896 904, 1983.

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al. An image is worth 16x16 words: Transformers for image recognition at scale. ar Xiv preprint ar Xiv:2010.11929, 2020.

Garg, S., Tsipras, D., Liang, P. S., and Valiant, G. What can transformers learn in-context? a case study of simple function classes. Advances in Neural Information Processing Systems, 35:30583 30598, 2022.

Giannou, A., Rajput, S., Sohn, J.-y., Lee, K., Lee, J. D., and Papailiopoulos, D. Looped transformers as programmable computers. ar Xiv preprint ar Xiv:2301.13196, 2023.

Han, C., Wang, Z., Zhao, H., and Ji, H. In-context learning of large language models explained as kernel regression. ar Xiv preprint ar Xiv:2305.12766, 2023.

He, K., Chen, X., Xie, S., Li, Y., Doll ar, P., and Girshick, R. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 16000 16009, 2022.

Huang, Y., Lin, J., Zhou, C., Yang, H., and Huang, L. Modality competition: What makes joint training of multi-modal network fail in deep learning?(provably). In International Conference on Machine Learning, pp. 9226 9259. PMLR, 2022.

Huang, Y., Wen, Z., Chi, Y., and Liang, Y. Transformers provably learn feature-position correlations in masked image modeling. ar Xiv preprint ar Xiv:2403.02233, 2024.

Janner, M., Li, Q., and Levine, S. Offline reinforcement learning as one big sequence modeling problem. Advances in neural information processing systems, 34: 1273 1286, 2021.

Jelassi, S., Sander, M., and Li, Y. Vision transformers provably learn spatial structure. Advances in Neural Information Processing Systems, 35:37822 37836, 2022.

Jiang, H. A latent space theory for emergent abilities in large language models. ar Xiv preprint ar Xiv:2304.09960, 2023.

Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980, 2014.

Kitaev, N., Kaiser, Ł., and Levskaya, A. Reformer: The efficient transformer. ar Xiv preprint ar Xiv:2001.04451, 2020.

Li, H., Wang, M., Liu, S., and Chen, P.-Y. A theoretical understanding of shallow vision transformers: Learning, generalization, and sample complexity. ar Xiv preprint ar Xiv:2302.06015, 2023a.

Li, Y., Ildiz, M. E., Papailiopoulos, D., and Oymak, S. Transformers as algorithms: Generalization and implicit model selection in in-context learning. ar Xiv preprint ar Xiv:2301.07067, 2023b.

Mahankali, A., Hashimoto, T. B., and Ma, T. One step of gradient descent is provably the optimal in-context learner with one layer of linear self-attention. ar Xiv preprint ar Xiv:2307.03576, 2023.

Min, S., Lyu, X., Holtzman, A., Artetxe, M., Lewis, M., Hajishirzi, H., and Zettlemoyer, L. Rethinking the role of demonstrations: What makes in-context learning work? ar Xiv preprint ar Xiv:2202.12837, 2022.

Open AI. Gpt-4 technical report, 2023.

Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al. Language models are unsupervised multitask learners. Open AI blog, 1(8):9, 2019.

Tarzanagh, D. A., Li, Y., Thrampoulidis, C., and Oymak, S. Transformers as support vector machines. ar Xiv preprint ar Xiv:2308.16898, 2023.

Tian, Y., Wang, Y., Chen, B., and Du, S. Scan and snap: Understanding training dynamics and token composition in 1-layer transformer. ar Xiv preprint ar Xiv:2305.16380, 2023.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. Attention is all you need. Advances in neural information processing systems, 30, 2017.

Von Oswald, J., Niklasson, E., Randazzo, E., Sacramento, J., Mordvintsev, A., Zhmoginov, A., and Vladymyrov, M. Transformers learn in-context by gradient descent. In International Conference on Machine Learning, pp. 35151 35174. PMLR, 2023.

Wang, X., Zhu, W., and Wang, W. Y. Large language models are implicitly topic models: Explaining and finding good demonstrations for in-context learning. ar Xiv preprint ar Xiv:2301.11916, 2023.

Wei, J., Wei, J., Tay, Y., Tran, D., Webson, A., Lu, Y., Chen, X., Liu, H., Huang, D., Zhou, D., et al. Larger language models do in-context learning differently. ar Xiv preprint ar Xiv:2303.03846, 2023.

In-context Convergence of Transformers

Wies, N., Levine, Y., and Shashua, A. The learnability of in-context learning. ar Xiv preprint ar Xiv:2303.07895, 2023.

Xie, S. M., Raghunathan, A., Liang, P., and Ma, T. An explanation of in-context learning as implicit bayesian inference. ar Xiv preprint ar Xiv:2111.02080, 2021.

Yadlowsky, S., Doshi, L., and Tripuraneni, N. Pretraining data mixtures enable narrow model selection capabilities in transformer models. ar Xiv preprint ar Xiv:2311.00871, 2023.

Zhang, R., Frei, S., and Bartlett, P. L. Trained transformers learn linear models in-context. ar Xiv preprint ar Xiv:2306.09927, 2023a.

Zhang, Y., Zhang, F., Yang, Z., and Wang, Z. What and how does in-context learning learn? bayesian model averaging, parameterization, and generalization. ar Xiv preprint ar Xiv:2305.19420, 2023b.

In-context Convergence of Transformers

A. Additional Related Work In-Context Learning. Recent studies explored theoretical properties of transformers for in-context learning from various perspectives. Focusing on expressive capacity, Aky urek et al. (2022) studied linear regression tasks and showed that trained in-context learners can represent GD of ridge regression and exact least-squares regression. Giannou et al. (2023) proved the existence of a looped transformer that can emulate in-context learning algorithms. Von Oswald et al. (2023); Dai et al. (2023) also showed that transformer trained in-context implements the GD. Bai et al. (2023) further provided comprehensive results of transformers including the expressive power, in-context prediction power, and sample complexity of pre-training, and then constructed two general mechanisms for algorithm selection. Li et al. (2023b) analyzed the generalization error of trained in-context learning transformers. Another line of work considered in-context learning from a different perspective within the Bayesian framework (Xie et al., 2021; Zhang et al., 2023b; Wang et al., 2023; Jiang, 2023; Han et al., 2023; Wies et al., 2023; Ahuja et al., 2023).

Closely related to our work is the line of research by Zhang et al. (2023a); Mahankali et al. (2023); Ahn et al. (2023), which investigated the training dynamics of in-context learning. Specifically, Mahankali et al. (2023) considered linear regression tasks and showed that the one-layer transformer that minimizes the pre-training loss implements one step of GD. Zhang et al. (2023a) investigated in-context learning of transformers with a single linear self-attention layer trained by gradient flow on linear regression tasks, and showed that gradient flow finds a global minimum. Ahn et al. (2023) investigated the landscape of the loss function for linear transformers trained over random instances of linear regression. However, all those works considered only transformers with linear self-attention layers and do not capture the crucial role of the softmax mapping, which lies in the core design of transformers to be advantageous over other network architectures. Our work focuses on nonlinear transformers with softmax attention and characterizes their training dynamics for in-context learning.

Training Dynamics of Transformers. Jelassi et al. (2022) proposed a simplified Vision Transformers (Vi T) model in which the attention matrix solely depends on the positional embeddings and showed that the trained model by GD can learn spatial structure. Li et al. (2023a) studied the training of shallow Vi T for a classification task and characterized the sample complexity to achieve a desirable generalization performance. However, their analysis relied on a good initialization near the target pattern, which may not be feasible in practice. Tian et al. (2023) analyzed the SGD training dynamics for a one-layer transformer with one self-attention plus one decoder layer and showed how the self-attention layer combines input tokens during the training, but this work did not provide the convergence guarantee for SGD. Tarzanagh et al. (2023) established an equivalence between the optimization geometry of self-attention and a hard-margin SVM problem that separates optimal input tokens from non-optimal tokens using linear constraints on the outer-products of token pairs. While the mathematical setup of these problems is different from in-context learning, some of our analysis techniques may be useful for studying the training dynamics of these problems. Recently, Huang et al. (2024) studied the training dynamics of transformers trained with the masked image modeling method under a self-supervised learning framework.

B. Experimental Settings In this section, we present additional details for experiments in Section 5.

Transformer Architecture. We consider a simplified transformer network. The model consists of one block with a single-head self-attention layer, followed by a feedforward neural network, which incorporates layer normalization and Re LU activation, and finally concludes with a linear layer for output processing.

Training Setup. We collect M = 300 randomly generated prompts and then train the model based on the empirical version of the training objective Equation (4) for 400 epochs using Adam (Kingma & Ba, 2014) with full batch and the learning rate of 0.002. Notice that Adam is a preferred choice for its stability in training transformers, which is also consistent with recent studies (Garg et al., 2022; Zhang et al., 2023a) to tackle the in-context learning ability of transformers over linear function classes.

Evaluations. We focus on two performance metrics. 1). Prediction error: As defined in Equation (6), the prediction error Lk measures the loss conditioned on the event that the query token is vk. We evaluate Lk by averaging the squared loss on the prompts whose query token is vk. 2). Attention score: We also evaluate the attention score Attnk for each feature, where Attnk is defined in Definition 3.1 as the average attention score for the k-th feature over the prompts with query token featuring vk.

In-context Convergence of Transformers

C. Overview of Training Phases of Other Settings We next describe the training dynamics of other settings, which take the phases similar to those discussed in Section 4.2.

Imbalanced Setting for the Dominant Feature. For the dominant feature v1 in the imbalanced setting, since the overall attention Attn(0) 1 to the target feature already reaches Ω(1) due to the abundance of tokens featuring v1 in E imbal, the training directly enters the convergence stage, as summarized in the following lemma.

Lemma C.1 (Informal). Under the same conditions as Theorem 3.3, given k > 1, there exists T1 = O( log(ϵ 1

2 ) ηϵ ), such that for all t T1

α(t) 1 Ω(ϵ), β(t) 1,n [ O( α(t) n K ), 0] with n > 1.

Further L1(θ(T1+1)) L 1 < ϵ, and (1 Attn(T1+1) 1 )2 O(ϵ) if xquery = v1 and Pinput E imbal.

Balanced Scenarios. Similarly to imbalanced settings, we can show that for balanced data, with high probability, Pinput belongs to E bal := Pinput : |Vk| = Θ( N

K ) for all k [K] . At initialization, the transformer uniformly assigns attention to each token, i.e., attn(0) i = 1 N for i [N]. Unlike the imbalanced case, here, due to Pinput E bal, we have that Attn(0) m = Θ( 1

K ) for m [K], indicating nearly equal attention to each feature. Consequently, as Lemma 4.2, we observe a significantly larger gradient in A(t) k at the outset, with α(0) k Θ( 1

K2 ), compared to |β(0) k,n| Θ( 1

K3 ) for n = k. This behavior mirrors the observations from phase III for under-represented features, allowing us to directly generalize the analysis.

D. Preliminary Development for Main Proofs In this section, we will introduce warm-up gradient computations and probabilistic lemmas that establish essential properties of the data and the loss function, which are pivotal for the technical proofs in the upcoming sections. Towards the conclusion of this section, we will also provide a summary of the key notations introduced in both the main content and these preliminary sections. These notations will be frequently adopted in our subsequent analyses.

D.1. Gradient Computations We first calculate the gradient with respect to Q (note that we do not update the parameter ν during the training). We omit the superscript (t) and write L(θ) as L here for simplicity.

Lemma D.1. The gradient of the loss function with respect to Q is given by

(byquery w, xquery ) X

i,j [N] attni attnj(Ex i Ex j )Ex N+1 yi

Proof. We obtain:

QL = E[(byquery w, xquery ) byquery

(byquery w, xquery ) X

Denote Qj,k as the entry in j-th row and k-th column of Q, and define f : Rd d RN as f(Q) = e Ex 1 QEx N+1, , e Ex N QEx N+1 , and g : RN R as g(y) = yi P

j [N] yj . By the chain rule, we have

Qj,k = Tr ( g(y)

y=f(Q)) f(Q)

n =i e Ex i QEx N+1 P

n [N] e Ex n QEx N+1 2 e Ex n QEx N+1(Ex n)j(Ex N+1)k

In-context Convergence of Transformers

P n [N] e Ex n QEx N+1 e Ex i QEx N+1 P

n [N] e Ex n QEx N+1 2 e Ex i QEx N+1(Ex i )j(Ex N+1)k

(Ex i )j(Ex N+1)k

n=1 attnn(Ex n)n = j(Ex N+1)k

n=1 attnn ((Ex i )j (Ex n)j) (Ex N+1)k

Then we reorganize these derivatives into a matrix, and have

Q = attni X

j [N] attnj(Ex i Ex j )Ex N+1 .

Substituting the above equation into Equation (7), we have

(byquery wτ, xquery ) X

i,j [N] attni attnj(Ex i Ex j )Ex N+1 yi

Recall that the quantities Ak and Bk,n are defined in Definition 4.1. These quantities are associated with the attention weights for each token, and they play a crucial role in our analysis of learning dynamics. We will restate their definitions here for clarity.

Definition D.2. For k, n [K] and n = k, define the following quantities for t 0:

A(t) k := v k Q(t)vk α(t) k = v k QL(Q(t))vk

B(t) k,n := v n Q(t)vk β(t) k,n = v n QL(Q(t))vk

By GD update, we have

A(t+1) k := A(t) k + ηα(t) k

B(t+1) k,n := B(t) k,n + ηβ(t) k,n

Moreover, by our initialization of Q(0) = 0d d, we have A(0) k = B(0) k,n = 0 for all k, n [K] with n = k.

Next, we apply the expression in Lemma D.1 to compute the gradient projected onto the feature directions, i.e., α(t) k and β(t) k,n.

Lemma D.3. For k, k [K], where k = k , we have

1{xquery = vk} Attn(t) k

m =k Attn(t) m 2 + (1 Attn(t) k )2

β(t) k,k = E

1{xquery = vk} Attn(t) k

m =k Attn(t) m 2 Attn(t) k Attn(t) k (1 Attn(t) k )

Proof. For any k, k [K], apply the previous gradient expression in Lemma D.1, and note that only when Ex N+1 = xquery = vk, we have Ex N+1 vk = 0. Thus, we obtain

In-context Convergence of Transformers

1{xquery = vk} (byquery w, xquery ) X

i,j [N] attni attnj yiv k (Ex i Ex j )

1{xquery = vk} (byquery w, xquery ) X

j Vn attni attnj yiv k (vm vn)

1{xquery = vk} (byquery w, xquery ) X

j Vn attni attnj yiv k (vk vn)

1{xquery = vk} (byquery w, xquery ) X

j Vk attni attnj yiv k (vm vk )

1{xquery = vk} (byquery w, xquery ) Attnk w, vk X

n [K] Attnn

1{xquery = vk} (byquery w, xquery ) Attnk X

m [K] Attnm w, vm

1{xquery = vk} (byquery w, xquery ) Attnk X

m [K] Attnm w, vk vm

Note that byquery = X

i [N] attni yi = X

m [K] Attnm w, vm .

Thus when xquery = vk, we have

byquery w, xquery = X

m [K] Attnm w, vk vm .

Substituting this into the above equation, we have

1{xquery = vk} Attnk

n [K] Attnn w, vk vn

m [K] Attnm w, vk vm

1{xquery = vk} Attnk

m [K] Attnm Attnn w, vk vn w, vk vm

1{xquery = vk} Attnk

m [K] Attnm Attnn(vk vn) ww (vk vm)

= E [1{xquery = vk} Attnk

m [K] Attnm Attnn(vk vn) E[ww | Pinput {xquery}](vk vm)

1{xquery = vk} Attnk

m [K] Attnm Attnn(vk vn) (vk vm)

1{xquery = vk} Attnk

n [K] Attnn vn) (vk X

m [K] Attnm vm)

In-context Convergence of Transformers

When k = k, we obtain

αk = v k QLvk = E

1{xquery = vk} Attnk vk X

n Attnn vn 2 #

1{xquery = vk} Attnk

(1 Attnk)2 + X

m =k Attn2 m

When k = k, we have

βk,k = v k QLvk

1{xquery = vk} Attnk

m =k,k Attn2 m Attnk(1 Attnk) Attnk (1 Attnk )

1{xquery = vk} Attnk

m =k Attn2 m Attnk(1 Attnk) Attnk

D.2. Useful Probabilistic Lemmas for Prompt Recall that given a prompt P = (x1, y1, . . . , x N, y N, xquery), we denote Pinput as the collection of input tokens, i.e., {xi}N i=1. It is worth noting that, based on our data distribution, the occurrence count of the k-th feature in the first N input tokens from Pinput, denoted as |Vk|, follows a multinomial distribution. Leveraging the concentration property inherent to multinomial distributions, we can identify a high-probability event to which Pinput belongs. This event constitutes the crux of our subsequent analysis.

We first introduce the following tail bound for multinomial distributions.

Lemma D.4 (Tail Bound of Multinomial Distribution (Devroye, 1983)). Let (X1, , XK) be a multinomial (N, p1, , p K) random vector. For all ε (0, 1) and all K satisfying K/N ε2/20, we have

i=1 |Xi E (Xi)| > Nε

3 exp Nε2/25 .

Now we present our characterization of a high-probability event for Pinput.

Lemma D.5 (High-probability Event for Balanced Data). Suppose that pk = Θ 1

K for any k [K] and K3 N. For

some constant cbal q

E bal := Pinput : |Vk| pk N cbal N

K , pk N + cbal N

for k [K] .

Then , we have

P(Pinput E bal) 1 3 exp c2 bal N 25K2

Let us denote Lbal k = pk K cbal and U bal k = pk K + cbal. Note that Lbal k , U bal k are at the order of the constant level since

K . Then for any Pinput belonging to E bal, |Vk| [ Lbal k N

K , U bal k N

K ). Note that we can properly choose cbal to guarantee Lbal k > 0 for k [K].

In-context Convergence of Transformers

Proof. Denote |Vk| = Xk. Then (X1, , XK) multinomial (N, p1, , p K). Noting that c2 bal 20K2 K

N by our choice of cbal, and then letting ϵ = cbal

K , we have ϵ2/20 K

N . By multinomial tail bound in Lemma D.4, we obtain

i=1 |Xi E (Xi)| > cbal N K

3 exp c2 bal N 25K2

Then, since E (Xi) = pi N, we have

|Xi pi N| > cbal N

i=1 |Xi E (Xi)| > cbal N K

3 exp c2 bal N 25K2

Lemma D.6 (High-probability Event for Imbalanced Data). Suppose that p1 = Θ(1), pk = Θ 1

K for 2 k K, and

K3 N. Then for some constant cim q

N , there exist constants U im k > Lim k > 0 for any k [K], such that letting

E imbal := Pinput : |V1| [Lim 1 N, U im 1 N] and |Vk| Lim k N K , U im k N

for 2 k K ,

P(Pinput E imbal) 1 3 exp c2 im N 25K2

Proof. Similarly to the proof for Lemma D.5, we have

|Xi pi N| > cim N

3 exp c2 im N 25K2

For k > 1, let us denote Lim k = pk K cim and U im k = pk K + cim. Since pk = Θ 1

K , we can easily conclude that Lim k , U im k for k > 1 are constant level. Furthermore, for k = 1, let Lim 1 = p1 0.01cim and U im 1 = p1 + 0.01cim. Since p1 is at the order of the Θ(1), we have p1N cim N

K , p1N + cim N

= h (p1 cim

K )N, (p1 + cim

K )p1N i Lim 1 N, U im 1 N

for sufficiently large K.

D.3. Properties of Loss Function and Prediction Error Recall the population loss we consider is given by:

2E h (by query w, x query )2i . (8)

In this part, we will present several important lemmas for such a training objective. We first introduce the following lemma, which connects the loss form with the attention score when the query token takes a certain feature.

Lemma D.7 (Loss Calculation). The population loss L(θ) can be decomposed into the following form:

1{xquery = vk}

m =k Attn2 m +(1 Attnk)2

In-context Convergence of Transformers

Proof. Following the calculations similar to those in Lemma D.3, we have

k=1 E h 1{xquery = vk} (byquery w, xquery )2i

1{xquery = vk}

n [K] Attnn w, vk vn

m [K] Attnm w, vk vm

1{xquery = vk} vk X

n [K] Attnn vn 2

1{xquery = vk}

(1 Attnk)2 + X

m =k Attn2 m

D.3.1. LOSS CHARACTERIZATION FOR THE BALANCED CASE We first introduce some additional crucial notations for the loss objectives.

Notations for the balanced case.

L = min θ L(θ) = min θ 1 2E h (by query w, x query )2i , (9)

k=1 P (xquery = vk |Vk| = 0) . (10)

L denotes the minimum value of the population loss in Equation (8) by minimizing over θ in the form of {1, Q}, and Llow represents the sum of unavoidable errors for each k [K], given that the query token is the k-th feature but has not been seen in the first N training samples. We will show that Llow serves as a lower bound for L , and demonstrate that the network trained with GD will attain nearly zero error compared to Llow. Our convergence will be established by the suboptimality gap with respect to Llow, which necessarily implies the convergence to L . (It also implies L Llow is small.) We further introduce the following quantities to facilitate our analysis of the loss function.

where Lk(θ) = 1

2E h 1{xquery = vk} (byquery w, xquery )2i .

P (xquery = vk |Vk| = 0) ,

e Lk(θ) = 1

2E h 1{xquery = vk Pinput E bal} (byquery w, xquery )2i .

Lemma D.8. For L and Llow defined in Equation (9) and Equation (10), respectively, we have Llow L and they are both at the order of Θ(e poly(K)) for the balanced data.

Proof. We first prove Llow L :

L = min θ 1 2

k=1 E h 1{xquery = vk} (byquery w, xquery )2i

k=1 E h 1{xquery = vk |Vk| = 0} (byquery w, xquery )2i

In-context Convergence of Transformers

= min θ 1 2

1{xquery = vk |Vk| = 0}

m =k Attn2 m +(1 Attnk)2

Notice that when the query token is the k-th feature but has not been seen in the first N training samples, Attnk = 0. Moreover, P

m =k Attn2 m 1 K 1by Cauchy Schwarz inequality. Thus

k=1 E [1{xquery = vk |Vk| = 0}] = Llow.

Furthermore, since xquery and Pinput are independently sampled,

Llow = K Θ 1

N = Θ e poly(K) .

where the last equality follows because N K3, and hence (1 Θ 1

K )N = Θ e poly(K) .

We next only need to show L = O(e poly(K)). We have

1{xquery = vk |Vk| > 0}

m =k Attn2 m +(1 Attnk)2

1{xquery = vk |Vk| = 0}

m =k Attn2 m +1

Consider Q = σId d. If xquery = vk |Vk| > 0 holds, we have X

m =k Attn2 m +(1 Attnk)2

(1 Attnk) max m =k Attnm +(1 Attnk)2

2(1 Attnk)2 = 2 N |Vk| N |Vk| + |Vk|eσ

2 2 N N + eσ

Taking σ = poly(N), then we have

L O(e poly(N)) + O(e poly(K)) = O(e poly(K)).

Lemma D.9. For the balanced data, given k [K], for any θ, we have

e Lk(θ) Lk(θ) Llow k e Lk(θ) + 3pk exp c2 bal N 25K2

Proof. We proceed the derivation as follows.

Lk(θ) e Lk(θ) = 1

2E h 1{xquery = vk Pinput E bal c} (byquery w, xquery )2i

1{xquery = vk Pinput E bal c}

m =k Attn2 m +(1 Attnk)2

2 2P (xquery = vk Pinput E bal c)

In-context Convergence of Transformers

(b) pk 3 exp c2 bal N 25K2

= 3pk exp c2 bal N 25K2

where (a) follows from the fact that

m =k Attn2 m +(1 Attnk)2 (1 Attnk) max m =k Attnm +(1 Attnk)2 2,

and (b) holds by Lemma D.5.

On the other hand,

Lk(θ) e Lk(θ) 1

1{xquery = vk |Vk| = 0}

m =k Attn2 m +(1 Attnk)2

2 K K 1E [1{xquery = vk |Vk| = 0}] = Llow k .

Consequently, for each k [K], e Lk(θ) closely tracks the deviation between Lk(θ) and Llow k , which is what we will primarily focus on bounding in the subsequent analysis.

D.3.2. LOSS CHARACTERIZATION FOR THE IMBALANCED CASE Notations for the imbalanced case. In the imbalanced case, we are interested in the prediction error for the query corresponding to each given feature k [K]. Thus we consider the following conditional prediction error for each k [K]:

2E (byquery w, xquery )2 xquery = vk

Similarly, we define the minimum and the unavoidable values for such conditional prediction error:

L k = min θ 1 2E (byquery w, xquery )2 xquery = vk

P (|Vk| = 0) , (13)

e Lk(θ) = 1

2E 1{Pinput E imbal} (byquery w, xquery )2 xquery = vk

Lemma D.10. Given k [K], for L k and Llow k defined in Equation (12) and Equation (13), respectively, we have Llow k L k and they are both at the order of Θ(e poly(K)) for the imbalanced data.

Proof. The analysis is similar as Lemma D.8, we only show Llow k = Θ(e poly(K)).

P (|Vk| = 0)

= Θ(1)(1 pk)N.

For k = 1, (1 p1)N = Θ(exp( N)) = Θ e poly(K) . For k > 1, since N K3, then (1 pk)N = (1 Θ 1

K )N = Θ e poly(K) , which completes the proof.

In-context Convergence of Transformers

Table 1. Summary of Notations Notations Descriptions

attn(t) i , Attn(t) k The attention scores for the i-token and k-th feature, where i [N] and k [K].

A(t) k , B(t) k,n The bilinear attention weights when xquery = vk: A(t) k = ev k Q(t)vk, B(t) k,n =

ev n Q(t)vk for n = k.

α(t) k , β(t) k,n The gradient updates respectively for A(t) k and B(t) k,n.

Pinput The input tokens in the prompt, i.e., {xi}N i=1.

E bal, E imbal The high-probability events that Pinput belongs to respectively for the balanced and imbalanced data.

L , Llow The minimum value and lower bound on the population loss L(θ) (8).

Lk(θ), e Lk(θ), Llow k The loss functions on the event {xquery = vk}, {xquery = vk} {Pinput E bal}, and the lower bound on Lk.

L k, Llow k (Imbalanced) The minimum value and lower bound of prediction error conditioned on xquery = vk, i.e., Lk(θ) (11).

e Lk(θ) (Imbalanced) The conditional prediction error on the event {Pinput E imbal}.

Lemma D.11. For the imbalanced data, given k [K], for any θ, we have

e Lk(θ) Lk(θ) Llow k e Lk(θ) + 3 exp c2 im N 25K2

Proof. The proof of the first inequality is similar to that for Lemma D.9. We next show the second inequality.

Lk(θ) Llow k e Lk(θ) + 1

2E h 1{Pinput E imbal c} (byquery w, xquery )2 | xquery = vk i

= e Lk(θ) + 1

1{Pinput E imbal c}

m =k Attn2 m +(1 Attnk)2

| xquery = vk

e Lk(θ) + P (Pinput E imbal c)

e Lk(θ) + 3 exp c2 im N 25K2

D.4. Notations and Parameters In Table 1, we summarize the notations introduced throughout the main content and in the preliminary section. Throughout all the proofs in our paper, we consider N = poly(K) K3, and K is sufficiently large.

In-context Convergence of Transformers

E. Analysis for the Balanced Case (Proof of Theorem 3.2) In this section, we present the analysis for the balanced case, we first discuss the outline of our proof.

E.1. Roadmap of the Proof We will analyze the convergence of the training process via two phases of dynamics. At the beginning of each phase, we will establish an induction hypothesis, which we expect to remain valid throughout that phase. Subsequently, we will analyze the dynamics under such a hypothesis within the phase, aiming to provide proof of the hypothesis by the end of the phase.

The main idea of the proof lies in analyzing the GD dynamics of A(t) k and B(t) k,n. From Definition D.2 and Lemma D.3, we have

A(t+1) k = A(t) k + ηα(t) k ,

B(t+1) k,n = B(t) k,n + ηβ(t) k,n,

1{xquery = vk} Attn(t) k

m =k Attn(t) m 2 + (1 Attn(t) k )2

β(t) k,n = E

1{xquery = vk} Attn(t) n

m =k Attn(t) m 2 Attn(t) n Attn(t) k (1 Attn(t) k )

We divide the learning process of any feature k in the balanced case into the following two phases.

Phase I (t [0, T1,k], Appendix E.2): At initialization, A(t) k keeps growing at a rate at least of η K2 , while B(t) k,n oscillates with a smaller rate of η K3 . Therefore, the increase in A(t) k will dominate the learning dynamics during phase I.

Phase II (t (T1,k, T ϵ 2,k], Appendices E.3 and E.4): After rapid growth of self-attention module parameters in phase I, the query token featuring vk is aligned with these input tokens also featuring vk effectively and disregards other features. Then the process proceeds to the convergence phase, where A(t) k monotonically increases and B(t) k,n monotonically

decreases, which finally contributes to the convergence of the loss. Based on the variation rates of A(t) k and B(t) k,n, the convergence phase further has two sub-stages as follows.

Stage I (t (T1,k, e T ϵ 2,k], Appendix E.3): the increase of A(t) k is as fast as Ω( ϵ

K ) while the decrease of B(t) k,n is

slow, and the gap A(t) k maxm =k B(t) k,m stays within O(log( K

Stage II (t ( e T ϵ 2,k, T ϵ 2,k], Appendix E.4): the increase of A(t) k and the decrease of B(t) k,n both are relatively steady and the attention nearly focuses on the target feature, leading to the convergence of the loss.

We finally combine all results in the above two phases to prove the convergence of the training process given in Theorem 3.2 (Appendix E.5).

E.2. Phase I: Growth of Target Feature In this section, we shall study the initial phase of learning the relationship between the query token and its corresponding feature. Firstly, we present the induction hypothesis in this phase. For the k-th feature vk, we define the Phase I as all iterations 0 t T1,k, where

T1,k max n t : A(t) k log(K) o .

We state the following induction hypothesis, which will hold throughout Phase I. This hypothesis is ultimately proved in Appendix E.2.3.

Induction Hypothesis E.1. For each 0 t T1,k, the following holds:

a. A(t) k is monotonically increasing and A(t) k [0, log(K)];

b. |B(t) k,n| = O( A(t) k K ) for any n = k.

In-context Convergence of Transformers

E.2.1. TECHNICAL LEMMAS We first introduce several useful technical lemmas.

Lemma E.1. Suppose Induction Hypothesis E.1 holds at iteration 0 t Tk,1. If xquery = vk and Pinput E bal, the following holds

1. Attn(t) k = Ω 1

2. 1 Attn(t) k Ω(1).

Proof. Since xquery = vk, then we have

Attn(t) k = |Vk|evk Q(t)vk P

j [N] e Ex j Q(t)vk

= |Vk| exp(A(t) k ) P

m =k |Vm| exp(B(t) k,m) + |Vk| exp(A(t) k )

|Vk| exp(B(t) k,m A(t) k ) + 1 .

By Induction Hypothesis E.1,

e (log(K)+O( log(K)

K )) exp(B(t) k,m A(t) k ) e O( log(K)

Attn(t) k 1

e O( log(K)

|Vk| 1) + 1 1

e O( log(K)

K )(K/Lbal k 1) + 1 = Ω 1

where the second inequality follows because Pinput E bal.

On the other hand,

Attn(t) k 1

e (log(K)+O( log(K)

|Vk| 1) + 1 1 e 1( 1 U bal k 1

Considering U bal = Θ(1), we have

1 Attn(t) k ( 1 U bal k 1

( 1 U bal k 1

K ) + e Ω(1).

Lemma E.2. Suppose Induction Hypothesis E.1 holds at iteration 0 t T1,k. If xquery = vk and Pinput E bal, for n = k, the following holds

Attn(t) n = Θ

1 Attn(t) k K

Proof. To show the first equality, since xquery = vk, we have

Attn(t) n = |Vn|evn Q(t)vk P

j [N] e Ex j Q(t)vk

= |Vn| exp(B(t) k,n) P

m =k |Vm| exp(B(t) k,m) + |Vk| exp(A(t) k ) .

In-context Convergence of Transformers

By Induction Hypothesis E.1, e O( log(K)

K ) exp(B(t) k,m B(t) k,n) e O( log(K)

K ). Combining with the fact that |Vm|

|Vn| = Θ(1) when Pinput E bal, we have

Attn(t) n 1 Attn(t) k = |Vn| exp(B(t) k,n) P

m =k |Vm| exp(B(t) k,m) = 1 P

|Vn| exp(B(t) k,m B(t) k,n) = Θ 1

Combining with the Lemma E.1, we immediately have Attn(t) n = Θ 1

E.2.2. CONTROLLING GRADIENT UPDATES IN PHASE I

Lemma E.3. Given any fixed k [K], if Induction Hypothesis E.1 holds at iteration 0 t T1,k, then α(t) k 0 and satisfies

Proof. By the gradient expression in Lemma D.3,

1{xquery = vk} Attn(t) k

m =k Attn(t) m 2 + (1 Attn(t) k )2

1{xquery = vk Pinput E bal} Attn(t) k

m =k Attn(t) m 2 + (1 Attn(t) k )2

1{xquery = vk Pinput E bal c} Attn(t) k

m =k Attn(t) m 2 + (1 Attn(t) k )2

(a) pk P(Pinput E bal)

m =k Attn(t) m 2 + (1 Attn(t) k )2

{xquery = vk} {Pinput E bal}

pk P(Pinput E bal) E Attn(t) k (1 Attn(t) k )2 {xquery = vk} {Pinput E bal} (14)

where (a) follows from the fact that xquery is independent with Pinput and the second term is non-negative, (b) follows from Lemma D.5, Lemma E.1 and the fact that pk = Θ 1

K in the balanced case and N K3.

Lemma E.4. Given any fixed k [K], if Induction Hypothesis E.1 holds at iteration 0 t T1,k, then for any n = k, β(t) k,n satisfies

|β(t) k,n| O

Proof. By the gradient expression in Lemma D.3, we have

1{xquery = vk} Attn(t) n

m =k Attn(t) m 2

β(t) k,n E h 1{xquery = vk} Attn(t) n Attn(t) n + Attn(t) k (1 Attn(t) k ) i . (16)

In-context Convergence of Transformers

For Equation (15), we further derive

1{xquery = vk Pinput E bal} Attn(t) n

m =k Attn(t) m 2

1{xquery = vk Pinput E bal c} Attn(t) n

m =k Attn(t) m 2

(a) pk P(Pinput E bal) E Attn(t) n max m =k Attn(t) m

{xquery = vk} {Pinput E bal}

+ pk P(Pinput E bal c)

(b) pk E Attn(t) n max m =k Attn(t) m

{xquery = vk} {Pinput E bal} + 3pk exp c2 bal N 25K2

where (a) follows from the fact that xquery is independent with Pinput, Attn(t) n 1 and P m =k Attn(t) m 2

maxm =k Attn(t) m P m =k Attn(t) m maxm =k Attn(t) m , (b) follows from Lemma D.5, and (c) follows from Lemma E.2 and the fact that pk = Θ 1

K and N K3.

For Equation (16), similarly to the derivation above, we have

pk E Attn(t) n Attn(t) n + Attn(t) k (1 Attn(t) k ) {xquery = vk} E bal

+ 2pk P(Pinput E bal c)

(a) = 2pk P(Pinput E bal c) + pk P(Pinput E bal)

Θ(1 Attn(t) k K )

Θ(1 Attn(t) k K ) + Attn(t) k (1 Attn(t) k )

! {xquery = vk} E bal

(b) pk P(Pinput E bal)E

O(Attn(t) k (1 Attn(t) k )2

K ) {xquery = vk} E bal

+ 6pk exp c2 bal N 25K2

α(t) k K + 1

K exp c2 bal N 25K2

where (a) follows from Lemma E.2 and (b) follows from Lemma E.1 and Lemma D.5, and (c) follows from Equation (14).

From Lemma E.3 and the choice of N K3, we have

6 exp c2 bal N 25K2

Thus, combining Equations (17) to (19), we have

|β(t) n,k| max

O(α(t) k K ), O 1

E.2.3. END OF PHASE I Lemma E.5. Given any fixed k [K], Induction Hypothesis E.1 holds for all iterations 0 t T1,k, where T1,k is at most O( log(K)K2

η ), and at iteration t = T1,k + 1, we have

In-context Convergence of Transformers

a. A(T1,k+1) k log(K);

b. Attn(T1,k+1) k = Ω(1) if xquery = vk and Pinput E bal.

Proof. If Induction Hypothesis E.1 holds, the existence of T1,k = O( log(K)K2

η ) directly follows from Lemma E.3.

We next prove Induction Hypothesis E.1. It is easy to verify Induction Hypothesis E.1 holds at t = 0. Now we suppose Induction Hypothesis E.1 holds for all iterations t 1, and prove it holds at t.

By Lemma E.3, we have α(t 1) k 0. Thus A(t) k = A(t 1) k + ηα(t 1) k 0. Moreover, by the definition of T1,k, we immediately obtain A(t) k log(K).

By Lemma E.4, we have |β(t 1) k,n | O α(t 1) k

|B(t) k,n| |B(t 1) k,n | + ηO

The first statement follows the definition of T1,k. Moreover, Attn(T1,k+1) k = Ω(1) can be derived from Lemma E.6 in the subsequent section.

E.3. Phase II: Convergence: Stage I After rapid growth of self-attention module parameters in phase I, the query token featuring vk is aligned with these input tokens also featuring vk effectively and disregards other features. Then the process proceeds to the convergence phase, where A(t) k monotonically increases and B(t) k,n monotonically decreases, which finally contributes to the convergence of the

loss. Based on the variation rates of A(t) k and B(t) k,n, the convergence phase further has two sub-stages as follows.

Given any 0 < ϵ < 1, for k [K], define

e T ϵ 2,k := max

t > T1,k : A(t) k max m =k B(t) k,m log

Induction Hypothesis E.2. For T1,k < t e T ϵ 2,k, suppose polylog(K) log( 1

ϵ ), and the following holds

a. A(t) k is monotonically increasing and A(t) k [log(K), O(log(K/ϵ))];

b. B(t) k,n is monotonically decreasing and |B(t) k,n| = O( A(t) k K ) for any n = k.

E.3.1. TECHNICAL LEMMAS We first introduce several useful technical lemmas.

Lemma E.6. Suppose Induction Hypothesis E.2 holds at iteration T1,k < t e T ϵ 2,k. If xquery = vk and Pinput E bal, the following holds

1. Attn(t) k = Ω(1);

2. (1 Attn(t) k )2 Ω(ϵ) = Ω(exp ( polylog(K))).

In-context Convergence of Transformers

Proof. Since xquery = vk, we have

Attn(t) k = |Vk| exp(A(t) k ) P

m =k |Vm| exp(B(t) k,m) + |Vk| exp(A(t) k )

|Vk| exp(B(t) k,m A(t) k ) + 1 .

By Induction Hypothesis E.2, we obtain

exp(B(t) k,m A(t) k ) e O( log(K/ϵ)

K ) log(K) e O( log(K)+polylog(K)

K ) log(K) O 1

Attn(t) k 1 O 1

|Vk| 1) + 1 1 O( 1

K ) + 1 Ω(1).

On the other hand, by the definition of e T ϵ 2,k, we have

1 Attn(t) k =

|Vk| exp(B(t) k,m A(t) k ) P

|Vk| exp(B(t) k,m A(t) k ) + 1

exp(minm =k B(t) k,m A(t) k )( N

exp(minm =k B(t) k,m A(t) k )( N

|Vk| 1) + 1

exp(minm =k B(t) k,m A(t) k )( K

exp(minm =k B(t) k,m A(t) k )( K

U bal k 1) + 1

= exp(maxm =k B(t) k,m A(t) k B(t) k )( K

exp(maxm =k B(t) k,m A(t) k B(t) k )( K

U bal k 1) + 1

Lbal k 1) 1(ϵ 1

2 1) 1 e O( polylog(K)

Lbal k 1) 1(ϵ 1

2 1) 1e O( polylog(K)

U bal k 1) + 1

where B(t) k = maxm =k B(t) k,m minm =k B(t) k,m = O( A(t) k K ), and the first and second inequalities follow from the fact that x 1+x monotonically increases w.r.t. x 0, and the third inequality follows from the definition of e T ϵ 2,k and Induction Hypothesis E.2.

Lemma E.7. Suppose Induction Hypothesis E.2 holds at iteration T1,k < t e T ϵ 2,k. If xquery = vk and Pinput E bal, for n = k, then the following holds

Attn(t) n = Θ

1 Attn(t) k K

Proof. By definition,

Attn(t) n = |Vn| exp(B(t) k,n) P

m =k |Vm| exp(B(t) k,m) + |Vk| exp(A(t) k ) .

By Induction Hypothesis E.2, we have

e O( log(K) log(ϵ)

K ) exp(B(t) k,m B(t) k,n) e O( log(K) log(ϵ)

In-context Convergence of Transformers

Further combining with the fact that log(ϵ) polylog(K), we have

Attn(t) n 1 Attn(t) k = |Vn| exp(B(t) k,n) P

m =k |Vm| exp(B(t) k,m) = 1 P

|Vn| exp(B(t) k,m B(t) k,n) = Θ 1

E.3.2. CONTROLLING GRADIENT UPDATES IN STAGE I OF PHASE II

Lemma E.8. At each iteration T1,k < t e T ϵ 2,k, if Induction Hypothesis E.2 holds, then α(t) k 0 and satisfies

Proof. The analysis is similar to that for Lemma E.3, but we need to be more careful about the lower bound of 1 Attn(t) k . By gradient expression in Lemma D.3, we obtain

1{xquery = vk} Attn(t) k

m =k Attn(t) m 2 + (1 Attn(t) k )2

pk P(Pinput E bal)E

m =k Attn(t) m 2 + (1 Attn(t) k )2

| {xquery = vk} E bal

pk P(Pinput E bal)E h Attn(t) k (1 Attn(t) k )2 | {xquery = vk} E bal i

where the last inequality follows from Lemmas D.5 and E.6 and the fact that pk = Θ 1

K in the balanced case.

Lemma E.9. At each iteration T1,k < t e T ϵ 2,k, if Induction Hypothesis E.2 holds, then given k [K], for any n = k, β(t) k,n satisfies

β(t) k,n 0.

Proof. Note that conditioned on the event {xquery = vk} {Pinput E bal}, by Lemmas E.6 and E.7, we have Attn(t) k = Ω(1), maxm =k Attnm = O 1

K , and thus

m =k Attn(t)2 m Attn(t) n Attn(t) k (1 Attn(t) k ) max m =k Attn(t) m X

m =k Attn(t) m Attn(t) k (1 Attn(t) k )

= (1 Attn(t) k )(Attn(t) k max m =k Attn(t) m )

Ω(1 Attn(t) k ). (20)

Therefore, by combining with Lemma D.3, we obtain

1{xquery = vk E bal} Attn(t) n

m =k Attn(t) m 2 Attn(t) n Attn(t) k (1 Attn(t) k )

1{xquery = vk E bal c} Attn(t) n

m =k Attn(t) m 2

In-context Convergence of Transformers

(a) pk P(Pinput E bal) E

Ω((1 Attn(t) k )2

K ) | {xquery = vk} E bal

+ pk P(E bal c)

(b) pk Ω( ϵ

K ) + 3pk exp c2 bal N 25K2

where (a) follows from Equation (20) and Lemma E.7, (b) follows from Lemmas D.5 and E.6, and the last inequality holds since

ϵ K exp( polylog(K))

K exp c2 bal N 25K2

Moreover, following the analysis similar to that for Lemma E.4, we have

β(t) k,n pk E Attn(t) n Attn(t) n + Attn(t) k (1 Attn(t) k ) {xquery = vk} E bal

+ pk P(E bal c)

Θ(1 Attn(t) k K ) O Attn(t) k (1 Attn(t) k ) {xquery = vk} E bal

+ 6pk exp c2 bal N 25K2

O(Attn(t) k (1 Attn(t) k )2

K ) {xquery = vk} E bal

+ 6pk exp c2 bal N 25K2

O(α(t) k K ).

E.3.3. END OF STAGE I OF PHASE II

Lemma E.10. Given k [K], and 0 < ϵ < 1, suppose polylog(K) log( 1

ϵ ). Then Induction Hypothesis E.2 holds for at

least all T1,k < t e T ϵ 2,k = T1,k + O K log(Kϵ 1

, and at iteration t = e T ϵ 2,k + 1, we have A ( e T ϵ 2,k+1) k Ω log( K

Proof. We first prove the existence of e T ϵ 2,k. Recall that

e T ϵ 2,k := max t > T1,k : A(t) k max m =k B(t) k,m log K

Lbal k 1 (3

ϵ ) 1 2 1 .

When t (T1,k, e T ϵ 2,k], consider

A(t+1) k max m =k B(t+1) k,m

A(t) k max m =k B(t) k,m

)α(t) k = Ω ηϵ

where the inequality follows from Lemma E.9 and the last equation follows from Lemma E.8. Therefore, at most

e T ϵ 2,k T1,k = O( K log ( K

Lbal k 1)(( 3

ηϵ ) = O(K log(Kϵ 1

iterations are needed before A(t) k maxm =k B(t) k,m exceeds log K Lbal k 1 ( 3

ϵ ) 1 2 1 .

In-context Convergence of Transformers

It is easy to verify Induction Hypothesis E.2 holds at t = T1,k + 1. Now we suppose Induction Hypothesis E.2 holds for all iterations in [T1,k + 1, t 1], and prove it holds at t.

By Lemma E.8, we have α(t 1) k 0. Thus A(t) k A(t 1) k log(K). By Lemma E.9, we have O α(t 1) k

β(t 1) k,n 0.

|B(t) k,n| |B(t 1) k,n | + ηO

Moreover, by the definition of e T ϵ 2,k, for any T1,k < t e T ϵ 2,k we immediately have

A(t) k A(t) k max m =k B(t) k,m log

Therefore, A(t) k O(log( K

ϵ )) for any T1,k < t e T ϵ 2,k.

At iteration t = e T ϵ 2,k + 1, we have A ( e T ϵ 2,k+1) k maxm =k B ( e T ϵ 2,k+1) k,m > log ( K

Lbal k 1)( 3

2 1) . Thus A ( e T ϵ 2,k+1) k

When {xquery = vk} {Pinput E bal}, we obtain

1 Attn ( e T ϵ 2,k+1) k =

|Vk| exp(B(t) k,m A(t) k ) P m =k |Vm|

|Vk| exp(B(t) k,m A(t) k ) + 1

exp(maxm =k B(t) k,m A(t) k )( N

exp(maxm =k B(t) k,m A(t) k )( N

|Vk| 1) + 1

exp(maxm =k B(t) k,m A(t) k )( K

exp(maxm =k B(t) k,m A(t) k )( K

Lbal k 1) + 1

Lbal k 1)(( 3

ϵ ) 1 2 1) 1 ( K

Lbal k 1) ( K

Lbal k 1)(( 3

ϵ ) 1 2 1) 1 ( K

Lbal k 1) + 1

= (ϵ/3) 1 2 ,

where the first inequality follows from the fact that x 1+x monotonically increases w.r.t. x 0.

E.4. Phase II: Convergence: Stage II Given k [K], define

T ϵ 2,k := e T ϵ 2,k + O

In-context Convergence of Transformers

Induction Hypothesis E.3. Suppose polylog(K) log( 1

ϵ ) for t ( e T ϵ 2,k, T ϵ 2,k]. The following holds:

a. A(t) k is monotonically increasing but cannot exceed O(log(K/ϵ));

b. B(t) k,m is monotonically decreasing and |B(t) k,m| = O( A(t) k K ) for any m = k.

E.4.1. TECHNICAL LEMMAS We first introduce several useful technical lemmas.

Lemma E.11. Suppose Induction Hypothesis E.3 holds at iteration t ( e T ϵ 2,k, T ϵ 2,k]. If xquery = vk and Pinput E bal, the following holds

1. Attn(t) k = Ω(1);

2. (1 Attn(t) k )2 [Ω(exp( polylog(K))), ϵ].

Proof. Since xquery = vk, we have

Attn(t) k = |Vk| exp(A(t) k ) P

m =k |Vm| exp(B(t) k,m) + |Vk| exp(A(t) k )

|Vk| exp(B(t) k,m A(t) k ) + 1 .

By Induction Hypothesis E.3,

exp(B(t) k,m A(t) k ) e O( log(K/ϵ)

K ) log(K) e O( log(K)+polylog(K)

K ) log(K) O 1

Attn(t) k 1 O 1

|Vk| 1) + 1 1 O( 1

K ) + 1 Ω(1).

We first upper-bound 1 Attn(t) k as

1 Attn(t) k =

|Vk| exp(B(t) k,m A(t) k ) P

|Vk| exp(B(t) k,m A(t) k ) + 1

exp(maxm =k B(t) k,m A(t) k )( N

exp(maxm =k B(t) k,m A(t) k )( N

|Vk| 1) + 1

(a) exp(maxm =k B ( e T ϵ 2,k+1) k,m A ( e T ϵ 2,k+1) k )( N

exp(maxm =k B ( e T ϵ 2,k+1) k,m A ( e T ϵ 2,k+1) k )( N

|Vk| 1) + 1

where (a) holds since maxm =k B(t) k,m A(t) k is non-increasing by Induction Hypothesis E.3, and (b) follows from the

definition of e T ϵ 2,k.

Then we lower-bound 1 Attn(t) k following the analysis similar to that for Lemma E.6:

1 Attn(t) k =

|Vk| exp(B(t) k,m A(t) k ) P

|Vk| exp(B(t) k,m A(t) k ) + 1

In-context Convergence of Transformers

exp(minm =k B(t) k,m A(t) k )( N

exp(minm =k B(t) k,m A(t) k )( N

|Vk| 1) + 1

exp(minm =k B(t) k,m A(t) k )( K

exp(minm =k B(t) k,m A(t) k )( K

U bal k 1) + 1

1 e O(log(K/ϵ)) ( K

1 e O(log(K/ϵ)) ( K

U bal k 1) + 1

1 e O(polylog(K)) ( K

1 e O(polylog(K)) ( K

U bal k 1) + 1

Ω(exp( polylog(K))),

where the first three inequalities follow from the fact that x 1+x monotonically increases w.r.t. x 0 and A(t) k O(log(K/ϵ)).

Lemma E.12. Suppose Induction Hypothesis E.3 holds at iteration t ( e T ϵ 2,k, T ϵ 2,k]. If xquery = vk and Pinput E bal, for n = k, the following holds

Attn(t) n = Θ

1 Attn(t) k K

Proof. By definition,

Attn(t) n = |Vn| exp(B(t) k,n) P

m =k |Vm| exp(B(t) k,m) + |Vk| exp(A(t) k ) .

By Induction Hypothesis E.3,

e O( log(K) log(ϵ)

K ) exp(B(t) k,m B(t) k,n) e O( log(K) log(ϵ)

Combining with the fact that log(ϵ) polylog(K), we obtain

Attn(t) n 1 Attn(t) k = |Vn| exp(B(t) k,n) P

m =k |Vm| exp(B(t) k,m) = 1 P

|Vn| exp(B(t) k,m B(t) k,n) = Θ 1

E.4.2. CONTROLLING GRADIENT UPDATES IN STAGE II OF PHASE II

Lemma E.13. At each iteration t ( e T ϵ 2,k, T ϵ 2,k]. If Induction Hypothesis E.3 holds for t, then α(t) k 0 and satisfies

Proof. By the gradient expression in Lemma D.3,

1{xquery = vk} Attn(t) k

m =k Attn(t) m 2 + (1 Attn(t) k )2

m =k Attn(t) m 2 + (1 Attn(t) k )2

| {xquery = vk} E bal

In-context Convergence of Transformers

+ 6pk exp c2 bal N 25K2

pk O(ϵ) + 6pk exp c2 bal N 25K2

where the second inequality follows from Lemmas E.11 and E.12, and the last inequality follows from the fact that pk = Θ 1

K and ϵ = Ω(exp( polylog(K))) 6 exp c2 bal N 25K2 .

Lemma E.14. At each iteration t ( e T ϵ 2,k, T ϵ 2,k], if Induction Hypothesis E.3 holds for t, for any n = k, β(t) k,n satisfies

β(t) k,n 0.

Proof. Note that conditioned on the event {xquery = vk} {Pinput E bal}, Attn(t) k = Ω(1), and maxm =k Attn(t) m =

O( ϵ 1 2 K ). Thus,

m =k Attn(t) m 2 Attn(t) n Attn(t) k (1 Attn(t) k ) max m =k Attn(t) m X

m =k Attn(t) m Attn(t) k (1 Attn(t) k )

= (1 Attn(t) k )(Attn(t) k max m =k Attn(t) m )

Ω(1 Attn(t) k ) Ω(exp( polylog(K))) .

Therefore, by the gradient expression in Lemma D.3 and the fact that N K3,

β(t) k,n 6 exp c2 bal N 25K2

Ω(exp( polylog(K))) < 0.

Moreover, following the analysis similar to that for Lemma E.9, we have

β(t) k,n pk E h Attn(t) n Attn(t) n + Attn(t) k (1 Attn(t) k ) | {xquery = vk} E bal i + pk P(E bal c)

Θ(1 Attn(t) k K ) O Attn(t) k (1 Attn(t) k ) | {xquery = vk} E #

+ 6pk exp c2 bal N 25K2

O(Attn(t) k (1 Attn(t) k )2

K ) | {xquery = vk} E #

+ 6pk exp c2 bal N 25K2

where the last inequality follows from the gradient expression of α(t) k in Lemma D.3 and because α(t) k 6 exp c2 bal N 25K2 .

E.4.3. CONTROLLING LOSS IN STAGE II OF PHASE II

Lemma E.15. Given k [K], and 0 < ϵ < 1, suppose polylog(K) log( 1

ϵ ). At each iteration t ( e T ϵ 2,k, T ϵ 2,k], if Induction Hypothesis E.3 holds for t, then we have e Lk(θ(t)) < pkϵ

In-context Convergence of Transformers

Proof. By the gradient expression in Lemma D.3, we have

e Lk(θ(t)) = 1

2E h 1{xquery = vk Pinput E bal} (byquery w, xquery )2i

1{xquery = vk Pinput E bal}

m =k Attn(t) m 2 + (1 Attn(t) k )2

2pk P (Pinput E bal) E O 1

+ 1 1 Attn(t) k 2 xquery = vk Pinput E bal

2pk 1 + O 1

where the first inequality follows from Lemma E.12, and the second inequality follows from Lemma E.11.

E.4.4. END OF STAGE II OF PHASE II

Lemma E.16. Given k [K], and 0 < ϵ < 1, suppose polylog(K) log( 1

ϵ ). Then Induction Hypothesis E.3 holds for

all e T ϵ 2,k < t T ϵ 2,k = e T ϵ 2,k + O

Proof. It is easy to verify Induction Hypothesis E.3 holds at t = e T ϵ 2,k + 1. Now we suppose Induction Hypothesis E.3 holds for all iterations e T ϵ 2,k t 1, and prove it holds at t.

For the first claim, we can upper-bound the update of A(t) k by Lemma E.13 as follows:

A(t) k A(t 1) k + η O( ϵ

A ( e T ϵ 2,k+1) k + η(t e T ϵ 2,k 1) O( ϵ

O(log(K/ϵ)) + ηO( K log Kϵ 1

= O(log(K/ϵ)).

The second claim follows from Lemma E.14 and the analysis similar to that for Lemma E.10.

E.5. Proof of Theorem 3.2 for Balanced Case Theorem E.17 (Restatement of Theorem 3.2 for balanced features). Suppose pk = Θ 1

K for each k [K]. For any 0 < ϵ < 1, suppose N poly(K) and polylog(K) log( 1

ϵ ). We apply GD to train the loss function given in Equation (4).

Then with at most T = O( log(K)K2

η + K log Kϵ 1

ϵη ) iterations, we have

1. The loss converges: L(θ(T )) L ϵ, where L = Θ(e poly(K)) is the global minimum of the population loss in Equation (4).

2. Attention score concentrates: if xquery = vk, with probability at least 1 e Ω(poly(K))3, the one-layer transformer nearly pays all attention to input tokens featuring vk, i.e., (1 Attn(T ) k )2 O(ϵ).

3The randomness originates from the first N input tokens in the test prompt.

In-context Convergence of Transformers

Proof. Denote T = maxk [K] e T ϵ 2,k + 1 = O( log(K)K2

η + K log Kϵ 1

ϵη ). Thus for any k, at iteration T , it is in stage II

of the convergence phase, i.e., T ( e T ϵ 2,k, T ϵ 2,k]. Then by Lemmas E.15 and E.16, for any k [K], we obtain:

e Lk(θ(T )) 2pkϵ

L(θ(T )) Llow =

k=1 (Lk(θ(T )) Llow k )

e Lk(θ(T )) + 3pk exp c2 bal N 25K2

3 + 3 exp c2 bal N 25K2

3 + 3 exp c2 bal N 25K2

where the first inequality follows from Lemma D.9.

Finally, by Lemma D.8,

L(θ(T )) L L(θ(T )) Llow ϵ.

In-context Convergence of Transformers

F. Analysis for the Imbalanced Case: Under-represented Features (Proof of Theorem 3.3 Part I) In this section, we present the analysis of the prediction error when the query token features an under-represented feature vk with k > 1 in the imbalanced case. We first discuss the outline of our proof.

F.1. Roadmap of the Proof We will analyze the convergence of the training process via four phases of dynamics. At the beginning of each phase, we will establish an induction hypothesis, which we expect to remain valid throughout that phase. Subsequently, we will analyze the dynamics under such a hypothesis within the phase, aiming to provide proof of the hypothesis by the end of the phase.

The main idea of the proof lies in analyzing the GD dynamics of A(t) k and B(t) k,n. From Definition D.2 and Lemma D.3, we have

A(t+1) k = A(t) k + ηα(t) k ,

B(t+1) k,n = B(t) k,n + ηβ(t) k,n,

1{xquery = vk} Attn(t) k

m =k Attn(t) m 2 + (1 Attn(t) k )2

β(t) k,n = E

1{xquery = vk} Attn(t) n

m =k Attn(t) m 2 Attn(t) n Attn(t) k (1 Attn(t) k )

We divide the learning process of the under-represented feature vk with k > 1 into the following four phases.

Phase I (t [0, T1,k], Appendix F.2): At initialization, B(t) k,1 enjoys a much larger reduction rate, i.e., βk,1 < 0 and

|βk,1| is large. Therefore, the decrease of B(t) k,1 will dominate the dynamics during phase I.

Phase II (t (T1,k, T2,k], Appendix F.3): At time T2,k + 1, the decrease of B(t) k,1 becomes slower, and the same

happens to |β(t) k,1|. Their decreasing rate drops to be closer to the increasing rate of α(t) k . This marks the beginning of

phase II. Shortly after entering this phase, the previous dominance of reduction of B(t) k,1 diminishes, as |β(t) k,1| approaches

a comparable order of the magnitude to α(t) k . At this point, there is a shift in the leading influence, with the growth of A(t) k taking over.

Phase III (t (T2,k, T3,k], Appendix F.4): Following the transitional phase, α(t) k grows from the value of Θ( 1 K1.5 ), whereas |β(t) k,1| and |β(t) k,n| for n = k, 1 stay at much lower values ( O( 1 K1.98 ) and O 1

K3 respectively). This

consistent gap in magnitude between α(t) k and β(t) k,n leads to the continuously rapid growth of A(t) k , while B(t) k,n remains relatively unchanged.

Phase IV (t (T3,k, T ϵ 4,k], Appendix F.5): At t = T3,k + 1, we achieve the desired attention structures for query

tokens featuring the under-represented feature vk. Then we establish a connection between α(t) k and the prediction error via analyzing the change of 1 Attn(t) k that diminishes, leading to the subsequent proof of convergence.

We finally combine all results in the above four phases to prove the main Theorem 3.3 for underrepresented features (Appendix F.6).

F.2. Phase I: Decrease of Dominant Feature In this section, we will delve into the initial phase of learning dynamics, aiming at mitigating the high occurrence bias of the dominant feature v1. Specifically, for k > 1, Bk,1 will undergo significant decrease during this phase. Let us begin by defining phase I.

For the k-th feature vk with k > 1, we define phase I as all iterations t T1,k, where

T1,k max n t : B(t) k,1 0.49 log(K) o .

In-context Convergence of Transformers

We state the following induction hypothesis, which will hold throughout phase I:

Induction Hypothesis F.1. Given k > 1, for each 0 t T1,k, the following holds:

a. A(t) k is monotonically increasing and A(t) k [0, O( log(K)

b. B(t) k,1 is monotonically decreasing and B(t) k,1 [ 0.49 log(K), 0];

c. |B(t) k,n| = O( A(t) k B(t) k,1 K ) and B(t) k,n > B(t) k,1 for any n = k, 1.

F.2.1. TECHNICAL LEMMAS We first introduce several technical lemmas that will be used for the proof of Induction Hypothesis F.1.

Lemma F.1. If Induction Hypothesis F.1 holds at iteration 0 t T1,k, for the prompt satisfying xquery = vk and Pinput E imbal, the following holds

1. Attn(t) k = Θ 1

2. Attn(t) 1 = Ω 1 K0.49 ;

3. 1 Attn(t) 1 Attn(t) k Ω(1).

Proof. Since xquery = vk, and |Vk| > 0 for Pinput E imbal, we have

Attn(t) k = |Vk|evk Q(t)vk P

j [N] e Ex j Q(t)vk

= |Vk| exp(A(t) k ) P

m =k |Vm| exp(B(t) k,m) + |Vk| exp(A(t) k )

|Vk| exp(B(t) k,m A(t) k ) + 1 .

By Induction Hypothesis F.1, we have

for m = 1, k, e O( log(K)

K0.02 ) exp(B(t) k,m A(t) k ) e O( log(K)

for m = 1, e( 0.49 log(K) O( log(K)

K0.02 )) exp(B(t) k,1 A(t) k ) e0.

Combining with the fact that P

|Vk| = Θ(K) for Pinput E imbal, we have

Attn(t) k Ω 1

On the other hand, since N |V1|

|Vk| is still Θ(K), we have

Attn(t) k 1

e O( log(K)

K0.02 )( N |V1|

|Vk| 1) + e( 0.49 log(K) O( log(K)

K0.02 )) |V1|

|Vk| + 1 O 1

By similar analysis, we have

Attn(t) 1 = |V1| exp(B(t) k,1) P

m =k |Vm| exp(B(t) k,m) + |Vk| exp(A(t) k )

m =1,k |Vm|

|Vk| exp(B(t) k,m B(t) k,1) + |Vk|

|V1| exp(A(t) k B(t) k,1) + 1 .

By Induction Hypothesis F.1,

In-context Convergence of Transformers

for m = 1, k, we have e0 exp(B(t) k,m B(t) k,1) e0.49 log(K)+O( log(K)

e0 exp(A(t) k B(t) k,1) e0.49 log(K)+O( log(K)

Attn(t) 1 1

e0.49 log(K)+O( log(K)

|V1| 1) + 1 Ω 1 K0.49

where the last inequality holds since N |V1| = Θ(1) for Pinput E imbal.

For the last statement,

1 Attn(t) 1 e0( N

|V1| 1) + 1 Ω(1).

Combining with the fact that Attn(t) k = Θ 1

K , we have

1 Attn(t) k Attn(t) 1 Ω(1).

Lemma F.2. If Induction Hypothesis F.1 holds at iteration 0 t T1,k, for the prompt satisfying xquery = vk and Pinput E imbal, the following holds

Attn(t) n = O

1 Attn(t) k Attn(t) 1 K

Proof. Since xquery = vk, we have

Attn(t) n = |Vn|evn Q(t)vk P

j [N] e Ex j Q(t)vk

= |Vn| exp(B(t) k,n) P

m =k |Vm| exp(B(t) k,m) + |Vk| exp(A(t) k ) .

By Induction Hypothesis F.1, for m, n = 1,

e O( log(K)

K ) exp(B(t) k,m B(t) k,n) e O( log(K)

Combining with the fact that |Vm|

|Vn| = Θ(1) for Pinput E imbal, we have

Attn(t) n 1 Attn(t) k Attn(t) 1 = |Vn| exp(B(t) k,n) P

m =1,k |Vm| exp(B(t) k,m)

m =k,1 |Vm|

|Vn| exp(B(t) k,m B(t) k,n)

In-context Convergence of Transformers

F.2.2. CONTROLLING GRADIENT UPDATES IN PHASE I

Lemma F.3. Given k > 1, if Induction Hypothesis F.1 holds at iteration 0 t T1,k, then α(t) k 0 and satisfies

α(t) k = Θ 1

Proof. By the gradient expression in Lemma D.3, we have

1{xquery = vk} Attn(t) k

m =k Attn(t) m 2 + (1 Attn(t) k )2

1{xquery = vk E imbal} Attn(t) k

m =k Attn(t) m 2 + (1 Attn(t) k )2

1{xquery = vk E imbal c} Attn(t) k

m =k Attn(t) m 2 + (1 Attn(t) k )2

(a) pk P(Pinput E imbal)E

m =k Attn(t) m 2 + (1 Attn(t) k )2

{xquery = vk} E imbal

+ 2pk P(Pinput E imbal c)

+ Attn(t) 1

2 + (1 Attn(t) k )2 ! {xquery = vk} E imbal

+ 2pk P(Pinput E imbal c)

where (a) follows from the fact that xquery and Pinput are independently sampled, and Attn(t) k (P

m =k Attn(t) m 2 + (1

Attn(t) k )2) is upper-bounded by 2 on the event {Pinput E imbal c}, (b) follows by applying Lemma F.2 to Attn(t) m for m = 1, k, and (c) follows from Lemma F.1, our choice of pk, Lemma D.6, and the evident bound:

3 exp c2 im N 25K2

Similarly, we can show that α(t) k Ω 1

Lemma F.4. Given k > 1, if Induction Hypothesis F.1 holds at iteration 0 t Tk,1, then β(t) k,1 < 0 satisfies

|β(t) k,1| Ω 1 K1.98

Proof. We first derive X

m =k Attn(t) m 2 Attn(t) 1 Attn(t) k (1 Attn(t) k )

m =1,k Attn(t) m 2 Attn(t) 1 (1 Attn(t) 1 ) Attn(t) k (1 Attn(t) k )

max m =1,k Attn(t) m (1 Attn(t) 1 Attn(t) k ) Attn(t) 1 (1 Attn(t) 1 ) Attn(t) k (1 Attn(t) k )

(1 Attn(t) k Attn(t) 1 )(Attn(t) 1 + Attn(t) k max m =1,k Attn(t) m ). (21)

In-context Convergence of Transformers

Therefore, by the gradient expression in Lemma D.3, we have

1{xquery = vk Pinput E imbal} Attn(t) 1

m =k Attn(t) m 2 Attn(t) 1 Attn(t) k (1 Attn(t) k )

1{xquery = vk Pinput E imbal c} Attn(t) 1

m =k Attn(t) m 2

(a) pk P(Pinput E imbal c) + pk P(Pinput E imbal)

Ω((Attn(t) 1 + Attn(t) k maxm =1,k Attn(t) m ) K0.49 ) {xquery = vk} E imbal

pk Ω 1 K0.98

+ 3pk exp c2 im N 25K2

= Ω 1 K1.98

where (a) follows from Equation (21) and Lemma F.1, and the last equality holds since

1 K0.98 exp c2 im N 25K2

Lemma F.5. If Induction Hypothesis F.1 holds at iteration 0 t Tk,1, for any n = 1, k, β(t) k,n satisfies

|β(t) k,n| O

α(t) k β(t) k,1 K

Proof. By the gradient expression in Lemma D.3, we have

1{xquery = vk} Attn(t) n

m =k Attn(t) m 2

β(t) k,n E h 1{xquery = vk} Attn(t) n Attn(t) n + Attn(t) k (1 Attn(t) k ) i (23)

We further upper-bound Equation (22) as,

1{xquery = vk Pinput E imbal} Attn(t) n

m =k Attn(t) m 2

1{xquery = vk Pinput E imbal c} Attn(t) n

m =k Attn(t) m 2

pk P(Pinput E imbal) E Attn(t) n Attn(t) 1

| {xquery = vk} E imbal

+ pk P(Pinput E imbal c)

+ 3pk exp c2 im N 25K2

In-context Convergence of Transformers

where (a) follows from the following two observations from Lemma F.2:

|β(t) k,1| pk P(Pinput E imbal) E Ω Attn(t) 1

2 {xquery = vk} E imbal

and Attn(t) n O 1

To further upper-bound Equation (23), we have

pk E Attn(t) n Attn(t) n + Attn(t) k (1 Attn(t) k ) {xquery = vk} E imbal

+ pk P(E imbal c)

(a) pk P(Pinput E imbal c) + pk P(E imbal)E

1 Attn(t) k K

1 Attn(t) k K

+ Attn(t) k (1 Attn(t) k )

! {xquery = vk} E imbal

pk P(E imbal)E

O(Attn(t) k (1 Attn(t) k )2

K ) {xquery = vk} E imbal

+ 3pk exp c2 im N 25K2

where (a) follows from Lemma F.2, and the last inequality follows from the analysis in the proof of Lemma F.3, and from the fact that

3 exp c2 im N 25K2

Thus, we obtain

|β(t) k,n| O

α(t) k β(t) k,1 K

F.2.3. END OF PHASE I

Lemma F.6. Given k 2, Induction Hypothesis F.1 holds for all t T1,k = O( log(K)K1.98

η ), and at iteration t = T1,k + 1, we have

a. B(T1,k+1) k,1 0.49 log(K);

b. Attn1 = O 1 K0.49 if xquery = vk and Pinput E imbal.

Proof. The existence of T1,k = O( log(K)K1.98

η ) directly follows from Lemma F.3.

It is easy to verify that Induction Hypothesis F.1 holds at t = 0. Now we suppose Induction Hypothesis F.1 holds for all iterations t 1, and prove it holds at t.

By Lemma F.3, we have α(t 1) k 0. Thus A(t) k = A(t 1) k + ηα(t 1) k 0. Moreover, combining Lemmas F.3 and F.4, we

obtain A(t) k A(0) k O( |B(t) k,1 B(0) k,1| K0.02 ) which further implies A(t) k O(log(K)/K0.02).

For m = 1, k, by Lemma F.5, we have

|B(t) k,m| O( A(t) k A(0) k + |B(t) k,1 B(0) k,1|

K ) O(log(K)/K).

The proof for the second statement is deferred to the next phase (Lemma F.7).

In-context Convergence of Transformers

F.3. Phase II: Switching of Leading Influence During phase I, B(t) k,1 significantly decreases, resulting in a decrease in Attn(t) 1 , while other Attn(t) n with n > 1 remain

approximately at the order of Θ 1

K . By the end of phase I, (Attn(t) 1 )2 decreases to O( 1 K0.98 ), leading to a decrease in |β(t) k,1| as it approaches towards α(t) k . At this point, phase II begins. Shortly after entering this phase, the prior dominant role

of the decrease of B(t) k,1 in learning dynamics diminishes as |β(t) k,1| reaches the same order of magnitude as α(t) k .

For k > 1, define

T2,k max{t > T1,k : A(t) k B(t) k,1 1.01 log(K)}.

We next state the following induction hypothesis which holds during phase II.

Induction Hypothesis F.2. For T1,k < t T2,k, the following holds

a. A(t) k is monotonically increasing and A(t) k [0, 0.52 log(K)];

b. B(t) k,1 is monotonically decreasing and B(t) k,1 [ 0.51 log(K), 0.49 log(K)];

c. |B(t) k,n| = O( A(t) k +|B(t) k,1| K ) for any n = 1, k.

F.3.1. TECHNICAL LEMMAS We first introduce several technical lemmas that will be used for the proof of Induction Hypothesis F.2.

Lemma F.7. Suppose Induction Hypothesis F.2 holds at iteration T1,k < t T2,k. If xquery = vk and Pinput E imbal, the following holds

1. Attn(t) k [Ω 1

K , O( 1 K0.48 )];

2. Attn(t) 1 [Ω 1 K0.51 , O 1 K0.49 ];

3. 1 Attn(t) 1 Attn(t) k Ω(1).

Proof. Since xquery = vk, and |Vk| > 0 for Pinput E imbal, we have

Attn(t) k = |Vk|evk Q(t)vk P j [N] e Ex j Q(t)vk

= |Vk| exp(A(t) k ) P

m =k |Vm| exp(B(t) k,m) + |Vk| exp(A(t) k )

|Vk| exp(B(t) k,m A(t) k ) + 1 .

By Induction Hypothesis F.2,

for m = 1, k, we have e O( log(K)

K ) 0.52 log(K) exp(B(t) k,m A(t) k ) e O( log(K)

for m = 1, e 1.01 log(K) exp(B(t) k,1 A(t) k ) e0.

Combining with the fact that P

|Vk| = Θ(K) for Pinput E imbal, we obtain

Attn(t) k Ω 1

Moreover, since N |V1|

|Vk| is still at the order of Θ(K), we have

Attn(t) k 1

e O( log(K)

K ) 0.52 log(K)( N |V1|

|Vk| 1) + e 1.01 log(K) |V1|

|Vk| + 1 O 1 K0.48

In-context Convergence of Transformers

We next analyze Attn(t) 1 as

Attn(t) 1 = |V1| exp(B(t) k,1) P

m =k |Vm| exp(B(t) k,m) + |Vk| exp(A(t) k )

m =1,k |Vm|

|Vk| exp(B(t) k,m B(t) k,1) + |Vk|

|V1| exp(A(t) k B(t) k,1) + 1 .

By Induction Hypothesis F.2,

for m = 1, k, we have

e0.49 log(K) O( log(K)

K ) exp(B(t) k,m B(t) k,1) e0.51 log(K)+O( log(K)

for m = 1, e0.49 log(K) exp(A(t) k B(t) k,1) e1.01 log(K).

Thus, we obtain

Attn(t) 1 1

e0.51 log(K)+O( log(K)

K )( N |Vk|

|V1| 1) + e1.01 log(K)+O( log(K)

|V1| + 1 Ω 1 K0.51

On the other hand,

1 Attn(t) 1 e0.49 log(K) O( log(K)

e0.49 log(K) O( log(K)

|V1| 1) + 1 Ω(1).

Thus, we obtain

1 Attn(t) k Attn(t) 1 Ω(1).

Lemma F.8. Suppose Induction Hypothesis F.2 holds at iteration T1,k < t T2,k. If xquery = vk and Pinput E imbal, for n = 1, k, the following holds

Attn(t) n = O

1 Attn(t) k Attn(t) 1 K

Proof. Since xquery = vk, we have

Attn(t) n = |Vn|evn Q(t)vk P

j [N] e Ex j Q(t)vk

= |Vn| exp(B(t) k,n) P

m =k |Vm| exp(B(t) k,m) + |Vk| exp(A(t) k ) .

By Induction Hypothesis F.2, for m, n = 1,

e O( log(K)

K ) exp(B(t) k,m B(t) k,n) e O( log(K)

Combining with the fact that |Vm|

|Vn| = Θ(1) for Pinput E imbal, we obtain

Attn(t) n 1 Attn(t) k Attn(t) 1 = |Vn| exp(B(t) k,n) P

m =1,k |Vm| exp(B(t) k,m)

In-context Convergence of Transformers

m =k,1 |Vm|

|Vn| exp(B(t) k,m B(t) k,n)

F.3.2. CONTROLLING GRADIENT UPDATES IN PHASE II

Lemma F.9. Given k > 1, if Induction Hypothesis F.2 holds at iteration T1,k < t T2,k, then α(t) k 0 and satisfies

Proof. By the gradient expression in Lemma D.3, we have

1{xquery = vk} Attn(t) k

m =k Attn(t) m 2 + (1 Attn(t) k )2

1{xquery = vk E imbal} Attn(t) k

m =k Attn(t) m 2 + (1 Attn(t) k )2

1{xquery = vk E imbal c} Attn(t) k

m =k Attn(t) m 2 + (1 Attn(t) k )2

pk P(P E imbal)E

m =k Attn(t) m 2 + (1 Attn(t) k )2

{xquery = vk} E imbal

where the last inequality follows from Lemma D.6, Lemma F.7 and our choice of pk.

Lemma F.10. Given k > 1, if Induction Hypothesis F.2 holds at iteration Tk,1 t Tk,2, then β(t) k,1 < 0 and satisfies

|β(t) k,1| Ω 1 K2.02

, O 1 K1.97

Proof. Following the computations similar to those in Lemma F.4, we have X

m =k Attn(t) m 2 Attn(t) 1 Attn(t) k (1 Attn(t) k )

(1 Attn(t) k Attn(t) 1 )(Attn(t) 1 + Attn(t) k max m =1,k Attn(t) m ).

1{xquery = vk E imbal} Attn(t) 1

m =k Attn(t) m 2 Attn(t) 1 Attn(t) k (1 Attn(t) k )

1{xquery = vk E imbal c} Attn(t) 1

m =k Attn(t) m 2

In-context Convergence of Transformers

(a) pk P(Pinput E imbal) E

Ω((Attn(t) 1 + Attn(t) k maxm =1,k Attn(t) m ) K0.51 ) | {xquery = vk} E imbal

+ pk P(Pinput E imbal c)

(b) pk Ω 1 K1.02

+ 3pk exp c2 im N 25K2

= Ω 1 K2.02

where (a) follows from Lemma F.7, (b) follows from Lemma F.7 and Lemma D.6, and the last inequality holds since

1 K1.02 exp c2 im N 25K2

β(t) k,1 E h 1{xquery = vk E imbal} Attn(t) 1 Attn(t) 1 + Attn(t) k (1 Attn(t) k ) i

+ E h 1{xquery = vk E imbal c} Attn(t) 1 Attn(t) 1 + Attn(t) k (1 Attn(t) k ) i

(a) pk P(Pinput E imbal) E h Attn(t) 1 O(Attn(t) 1 + Attn(t) k ) | {xquery = vk} E imbal i

+ 2pk P(Pinput E imbal c)

(b) pk O 1 K0.97

+ 6pk exp c2 im N 25K2

where (a) follows because Attn(t) 1 + Attn(t) k (1 Attn(t) k ) is upper-bounded by 2 on the event {Pinput E imbal c}, and (b) follows from Lemma F.7.

Lemma F.11. If Induction Hypothesis F.2 holds at iteration T1,k < t T2,k, for any n = 1, k, β(t) k,n satisfies

|β(t) k,n| O

α(t) k β(t) k,1 K

Proof. By the gradient expression in Lemma D.3, we have

1{xquery = vk} Attn(t) n

m =k Attn(t) m 2

β(t) k,n E h 1{xquery = vk} Attn(t) n Attn(t) n + Attn(t) k (1 Attn(t) k ) i . (25)

To further bound Equation (24), we have

1{xquery = vk Pinput E imbal} Attn(t) n

m =k Attn(t) m 2

1{xquery = vk Pinput E imbal c} Attn(t) n

m =k Attn(t) m 2

In-context Convergence of Transformers

pk P(Pinput E imbal) E Attn(t) n Attn(t) 1

{xquery = vk} E imbal

+ pk P(Pinput E imbal c)

+ 3pk exp c2 im N 25K2

To further bound Equation (25), we have

pk E h Attn(t) n Attn(t) n + Attn(t) k (1 Attn(t) k ) | {xquery = vk} E imbal i + 2pk P(E imbal c)

=2pk P(E imbal c) + pk P(E imbal)E

1 Attn(t) k K

1 Attn(t) k K

+ Attn(t) k (1 Attn(t) k )

! {xquery = vk} E imbal

pk P(E imbal)E

Attn(t) k (1 Attn(t) k )2

! {xquery = vk} E imbal

+ 6pk exp c2 im N 25K2

O(α(t) k K ).

Following from the analysis in Lemma F.9, we have

6 exp c2 im N 25K2

Thus, we obtain

|β(t) k,n| O

α(t) k β(t) k,1 K

F.3.3. END OF PHASE II

Lemma F.12. Given k 2, Induction Hypothesis F.2 holds for all T1,k < t T2,k = T1,k + O( log(K)K2

η ), and at iteration t = T2,k + 1, we have

a. A(T2,k+1) k 0.5 log(K);

b. B(T2,k+1) k 0.51 log(K).

Proof. The existence of T2,k = T1,k + O( log(K)K2

η ) directly follows from Lemmas F.9 and F.10.

It is easy to verify that Induction Hypothesis F.2 holds at T1,k + 1. Now we suppose Induction Hypothesis F.2 holds for all iterations t 1, and prove that it holds at t.

For m = 1, k, by Lemma F.11, we have

|B(t) k,m| |B(T1,k+1) k,m | + O( A(T2,k) k A(T1,k+1) k + |B(T2,k) k,1 B(T1,k+1) k,1 |

K ) O(log(K)/K).

In-context Convergence of Transformers

Now suppose A(T2,k+1) k < 0.5 log(K), then B(T2,k+1) k,1 < 0.51 log(K). Denote the first time that B(t) k,1 reaches

0.501 log(K) as e T. Note that e T < T (t) 2,k since β(t) k,1, the change of B(t) k,1, satisfies |β(t) k,1| log(K). Then for t e T, if xquery = vk and Pinput E imbal, the following holds:

1. Attn(t) k [Ω 1

K , O( 1 K0.5 )];

2. Attn(t) 1 O( 1 K0.501 ).

Therefore, following the analysis similar to those for Lemma F.10, we have

|β(t) k,1| E h 1{xquery = vk E imbal} Attn(t) 1 Attn(t) 1 + Attn(t) k (1 Attn(t) k ) i

+ E h 1{xquery = vk E imbal c} Attn(t) 1 Attn(t) 1 + Attn(t) k (1 Attn(t) k ) i

pk P(P E imbal) E h Attn(t) 1 O(Attn(t) 1 + Attn(t) k ) | {xquery = vk} E imbal i

+ 2pk P(E imbal c)

pk O 1 K1.002

α(t) k K0.501

+ 6pk exp c2 im N 25K2

α(t) k K0.002

where the last inequality follows from Lemma F.9.

Since |B(T2,k+1) k,1 B( e T ) k,1 | Ω(log(K)), we have

A(T2,k+1) k |B(T2,k+1) k,1 B( e T ) k,1 | Ω(K0.002) + A( e T ) k Ω(K0.002 log(K)),

which contradicts the assumption that A(T2,k+1) k < 0.5 log(K). Therefore, A(T2,k+1) k 0.5 log(K). Noting that once B(t) k,1 drops below 0.501 log(K), it will change much smaller compared to the increase of A(t) k . Thus, B(T2,k+1) k,1 0.51 log(K).

F.4. Phase III: Growth of Target Feature After the transition phase, A(t) k will experience a larger gradient, with the growth of A(t) k becoming the dominant effect in this phase. For the k-th feature vk, we define phase III as all iterations T2,k < t T3,k, where

T3,k max n t > T2,k : A(t) k log(K) o .

We state the following induction hypothesis, which will hold throughout phase III.

Induction Hypothesis F.3. For each T2,k < t T3,k, the following holds:

a. A(t) k is monotonically increasing and A(t) k [0.5 log(K), log(K)];

b. B(t) k,1 is monotonically decreasing and B(t) k,1 [ 0.51 log(K) O( log(K)

K0.48 ), 0.49 log(K)];

c. |B(t) k,n| = O( A(t) k +|B(t) k,1| K ) for any n = 1, k.

F.4.1. TECHNICAL LEMMAS We first introduce several useful technical lemmas.

Lemma F.13. Suppose Induction Hypothesis F.3 holds at iteration Tk,2 < t Tk,3. If xquery = vk and Pinput E imbal, then the following holds

In-context Convergence of Transformers

1. Attn(t) k = Ω 1 K0.5 ;

2. Attn(t) 1 [Ω 1 K0.51 , O 1 K0.49 ];

3. 1 Attn(t) k Ω(1).

Proof. Since xquery = vk, and |Vk| > 0 for Pinput E imbal, we have

Attn(t) k = |Vk|evk Q(t)vk P

j [N] e Ex j Q(t)vk

= |Vk| exp(A(t) k ) P m =k |Vm| exp(B(t) k,m) + |Vk| exp(A(t) k )

= 1 P m =k |Vm|

|Vk| exp(B(t) k,m A(t) k ) + 1 .

By Induction Hypothesis F.3, we have

for m = 1, e (log(K)+O( log(K)

K )) exp(B(t) k,m A(t) k ) e O( log(K)

K ) 0.5 log(K);

e (1.51 log(K)+O( log(K)

K )) exp(B(t) k,1 A(t) k ) e 1.01 log(K).

Attn(t) k 1

e O( log(K)

K ) 0.5 log(K)( N |V1|

|Vk| 1) + e 1.01 log(K) |V1|

|Vk| + 1 Ω 1 K0.5

where the second inequality follows from the fact that Pinput E imbal.

On the other hand,

Attn(t) k 1

e (log(K)+O( log(K)

K ))( N |V1|

|Vk| 1) + 1 1 e 1( 1

1 Attn(t) k e (log(K)+O( log(K)

K ))( N |V1|

|Vk| 1) + e 1.01 log(K) |V1|

e (log(K)+O( log(K)

K ))( N |V1|

|Vk| 1) + e 1.01 log(K) |V1|

|Vk| + 1 Ω(1).

We next analyze Attn(t) 1 as follows.

Attn(t) 1 = |V1| exp(B(t) k,1) P

m =k |Vm| exp(B(t) k,m) + |Vk| exp(A(t) k )

m =1,k |Vm|

|Vk| exp(B(t) k,m B(t) k,1) + |Vk|

|V1| exp(A(t) k B(t) k,1) + 1 .

By Induction Hypothesis F.3,

for m = 1, k, we have

e0.49 log(K) O( log(K)

K ) exp(B(t) k,m B(t) k,1) e0.51 log(K)+O( log(K)

for m = 1, e1.01 log(K) exp(A(t) k B(t) k,1) e1.51 log(K)+O( log(K)

In-context Convergence of Transformers

Attn(t) 1 1

e0.49 log(K) O( log(K)

K )( N |Vk|

|V1| 1) + e1.01 log(K) |Vk|

|V1| + 1 O 1 K0.49

Attn(t) 1 1

|V1| 1) + e1.51 log(K)+O( log(K)

|V1| + 1 Ω 1 K0.51

Lemma F.14. Suppose Induction Hypothesis F.3 holds at iteration T2,k < t T3,k. If xquery = vk and Pinput E imbal, for n = 1, k, then the following holds

Attn(t) n = O

1 Attn(t) k Attn(t) 1 K

Proof. By Induction Hypothesis F.3, we have

e O( log(K)

K ) exp(B(t) k,m B(t) k,n) e O( log(K)

Combining with the fact that |Vm|

|Vn| = Θ(1) when Pinput E imbal, we have

Attn(t) n 1 Attn(t) k Attn(t) 1 = |Vn| exp(B(t) k,n) P

m =1,k |Vm| exp(B(t) k,m) = 1 P

m =1,k |Vm|

|Vn| exp(B(t) k,m B(t) k,n) O 1

F.4.2. CONTROLLING GRADIENT UPDATES IN PHASE III

Lemma F.15. At each iteration T2,k < t T3,k, if Induction Hypothesis F.3 holds, then α(t) k 0 and satisfies

α(t) k Ω 1 K1.5

Proof. By the gradient expression in Lemma D.3, we have

1{xquery = vk} Attn(t) k

m =k Attn(t) m 2 + (1 Attn(t) k )2

1{xquery = vk E imbal} Attn(t) k

m =k Attn(t) m 2 + (1 Attn(t) k )2

1{xquery = vk E imbal c} Attn(t) k

m =k Attn(t) m 2 + (1 Attn(t) k )2

pk P(Pinput E imbal)E

m =k Attn(t) m 2 + (1 Attn(t) k )2

{xquery = vk} E imbal

pk P(Pinput E imbal)E h Attn(t) k (1 Attn(t) k )2 | {xquery = vk} E imbal i

where the last inequality follows from Lemma D.6, Lemma F.13 and our choice of pk.

In-context Convergence of Transformers

Lemma F.16. Given k > 1, if Induction Hypothesis F.3 holds at iteration Tk,2 t Tk,3, then β(t) k,1 < 0 satisfies

α(t) k K0.48

Proof. Following the computations similar to those for Lemma F.4, we have X

m =k Attn(t) m 2 Attn(t) 1 Attn(t) k (1 Attn(t) k )

(1 Attn(t) k Attn(t) 1 )(Attn(t) 1 + Attn(t) k max m =1,k Attn(t) m ).

1{xquery = vk E imbal} Attn(t) 1

m =k Attn(t) m 2 Attn(t) 1 Attn(t) k (1 Attn(t) k )

1{xquery = vk E imbal c} Attn(t) 1

m =k Attn(t) m 2

(a) pk P(Pinput E imbal c) + pk P(Pinput E imbal)

Ω((Attn(t) 1 + Attn(t) k maxm =1,k Attn(t) m ) K0.51 ) {xquery = vk} E imbal

(b) pk Ω 1 K1.01

+ 3pk exp c2 im N 25K2

= Ω 1 K2.01

where both (a) and (b) follow from Lemma F.13, and the last inequality holds since

1 K1.01 exp c2 im N 25K2

Moreover, we have

β(t) k,1 E h 1{xquery = v1 Pinput E imbal} Attn(t) 1 (Attn1 + Attnk(1 Attnk)) i

+ E h 1{xquery = vk Pinput E imbal c} Attn(t) 1 (Attn1 + Attnk(1 Attnk)) i

pk P(Pinput E imbal) E [Attn1 O(Attn1 + Attnk) | {xquery = vk} E imbal]

+ 2pk P(Pinput E imbal c)

pk O 1 K0.98

α(t) k K0.49

+ 6pk exp c2 im N 25K2

α(t) k K0.48

where the last inequality follows from Lemma F.15.

Lemma F.17. If Induction Hypothesis F.3 holds at iteration T2,k < t T3,k, then for any n = 1, k, β(t) k,n satisfies

|β(t) k,n| O

α(t) k β(t) k,1 K

In-context Convergence of Transformers

Proof. By the gradient computation in Lemma D.3, we have

1{xquery = vk} Attn(t) n

m =k Attn(t) m 2

β(t) k,n E h 1{xquery = vk} Attn(t) n Attn(t) n + Attn(t) k (1 Attn(t) k ) i . (27)

We further bound Equation (26) as

1{xquery = vk Pinput E imbal} Attn(t) n

m =k Attn(t) m 2

1{xquery = vk Pinput E imbal c} Attn(t) n

m =k Attn(t) m 2

pk P(Pinput E imbal) E Attn(t) n Attn(t) 1

{xquery = vk} Pinput E imbal

+ pk P(Pinput E imbal c)

+ 6pk exp c2 im N 25K2

We then further bound Equation (27) as

β(t) k,n pk E h Attn(t) n Attn(t) n + Attn(t) k (1 Attn(t) k ) | {xquery = vk} Pinput E imbal i

+ pk P(Pinput E imbal c)

=pk P(Pinput E imbal c) + pk P(Pinput E imbal)E

1 Attn(t) k K

1 Attn(t) k K

+ Attn(t) k (1 Attn(t) k )

! {xquery = vk} E imbal

pk P(E imbal)E

Attn(t) k (1 Attn(t) k )2

! {xquery = vk} E imbal

+ 6pk exp c2 im N 25K2

Following from the analysis in Lemma F.15, we have

α(t) k Ω 1 K1.5

Thus, we obtain

|β(t) k,n| O

α(t) k β(t) k,1 K

In-context Convergence of Transformers

F.4.3. END OF PHASE III

Lemma F.18. Given k > 1, Induction Hypothesis F.3 holds for all T2,k < t T3,k = T2,k + O( log(K)K1.5

η ), and at iteration t = T3,k + 1, we have

a. A(T3,k+1) k log(K);

b. Attnk = Ω(1) if xquery = vk and Pinput E imbal.

Proof. The existence of T3,k = T2,k + O( log(K)K1.5

η ) directly follows from Lemma F.15.

It is easy to verify Induction Hypothesis F.3 holds at t = T2,k + 1. Now we suppose Induction Hypothesis F.3 holds for all iterations t 1, and prove it holds at t.

By Lemma F.15, we have α(t 1) k 0. Thus A(t) k = A(t 1) k + ηα(t 1) k 0.5 log(K). Morover, by Lemma F.16, we have

|B(t) k,1 B(T2,k+1) k,1 | O( A(t) k A (T2,k+1)

k K0.48 ) which immediately implies that

B(t) k,1 O(log(K)/K0.48) 0.51 log(K).

For m = 1, k, by Lemma F.17, we have

|B(t) k,m| O( A(t) k A(T2,k+1) k + |B(t) k,1 B(T2,k+1) k,1 |

K ) O(log(K)/K).

The proof for the second statement is deferred to the next phase (Lemma F.19).

F.5. Phase IV: Convergence At t = T3,k + 1, the desired attention structure for the query token associated with feature vk has already been achieved. In this final phase, we establish that these structures, including each under-represented feature, indeed represent the solutions toward which the algorithm converges.

Given any 0 < ϵ < 1, for k 2, define

T ϵ 4,k max

t > T3,k : A(t) k log

e(1 Lim 1 )K + U im 1 K0.51

Induction Hypothesis F.4. For T3,k < t T ϵ 4,k, suppose polylog(K) log( 1

ϵ ). Then the following holds.

a. A(t) k is monotonically increasing and A(t) k [log(K), O(log(K/ϵ))];

b. B(t) k,1 is monotonically decreasing and

B(t) k,1 0.51 log(K) O log(K)

, 0.49 log(K)

c. B(t) k,n is monotonically decreasing and |B(t) k,n| = O( log(K/ϵ)

K ) for any n = 1, k.

F.5.1. TECHNICAL LEMMAS We first introduce several useful technical lemmas.

Lemma F.19. Suppose Induction Hypothesis F.4 holds at iteration T3,k < t T ϵ 4,k. If xquery = vk and Pinput E imbal, then the following holds.

1. Attn(t) k = Ω(1);

2. (1 Attn(t) k )2 Ω(ϵ) = Ω(exp ( polylog(K))).

In-context Convergence of Transformers

Proof. Since xquery = vk, we have

Attn(t) k = |Vk| exp(A(t) k ) P

m =k |Vm| exp(B(t) k,m) + |Vk| exp(A(t) k )

|Vk| exp(B(t) k,m A(t) k ) + 1 .

By Induction Hypothesis F.4, we have

for m = 1, k:

exp(B(t) k,m A(t) k ) e O( log(K/ϵ)

K ) log(K) e O( log(K)+polylog(K)

K ) log(K) O 1

for m = 1, exp(B(t) k,1 A(t) k ) O( 1 K1.49 ).

Attn(t) k 1

|Vk| 1) + O( 1 K1.49 ) |V1|

|Vk| + 1 Ω(1).

On the other hand, we have

1 Attn(t) k =

|Vk| exp(B(t) k,m A(t) k ) P

|Vk| exp(B(t) k,m A(t) k ) + 1

(a) exp(minm =1,k B(t) k,m A(t) k )( N |V1|

|Vk| 1) + exp(B(t) k,1 A(t) k ) |V1|

|Vk| exp(minm =1,k B(t) k,m A(t) k )( N |V1|

|Vk| 1) + exp(B(t) k,1 A(t) k ) |V1|

exp(minm =k B(t) k,m A(t) k )( (1 U im 1 )K U im k 1) + exp(B(t) k,1 A(t) k ) Lim 1 U im k exp(minm =k B(t) k,m A(t) k )( (1 U im 1 )K U im k 1) + exp(B(t) k,1 A(t) k ) Lim 1 U im k + 1

= ( (1 U im 1 )K U im k 1 + Lim 1 K0.49

U im k ) exp( A(t) k )

( (1 U im 1 )K U im k 1 + Lim 1 K0.49

U im k ) exp( A(t) k ) + 1

where (a) follows from the fact that x 1+x increases w.r.t. x > 0.

Lemma F.20. Suppose Induction Hypothesis F.4 holds at iteration T3,k < t T ϵ 4,k. If xquery = vk and Pinput E imbal, then the following holds.

1. Attn(t) n = Θ 1 Attn(t) k K

for n = 1, k;

2. Attn(t) 1 Ω( 1 Attn(t) k K0.51 ), O 1 Attn(t) k K0.49

Proof. We first have

Attn(t) n 1 Attn(t) k = |Vn| exp(B(t) k,n) P m =k |Vm| exp(B(t) k,m) .

If n = 1, by Induction Hypothesis F.4, we have

In-context Convergence of Transformers

for m = 1, k, e O( log(K) log(ϵ)

K ) exp(B(t) k,m B(t) k,n) e O( log(K) log(ϵ)

for m = 1, e 0.51 log(K) O( log(K/ϵ)

K ) exp(B(t) k,1 B(t) k,n) 0.

Note that when Pinput E imbal, we have |Vm|

|Vn| = Θ(1), and |V1|

|Vn| = Θ(K). Then combining with the fact that log(ϵ) polylog(K), we obtain

Attn(t) n 1 Attn(t) k = |Vn| exp(B(t) k,n) P

m =k |Vm| exp(B(t) k,m) = 1 P

|Vn| exp(B(t) k,m B(t) k,n) = Θ 1

For n = 1, by Induction Hypothesis F.4, we have

e0.49 log(K) O( log(K/ϵ)

K ) exp(B(t) k,m B(t) k,1) 0.51 log(K)+O( log(K/ϵ)

for m = 1. Combining with the fact that |Vm|

K when Pinput E imbal, and log(ϵ) polylog(K), we have

Attn(t) 1 1 Attn(t) k = 1 P

|Vn| exp(B(t) k,m B(t) k,n) O( 1

K e0.49 log(K) O( log(K/ϵ)

K ) + 1 ) = O 1 K0.49

Attn(t) 1 1 Attn(t) k = 1 P

|Vn| exp(B(t) k,m B(t) k,n) O( 1

K e0.51 log(K)+O( log(K/ϵ)

K ) + 1 ) Ω 1 K0.51

F.5.2. CONTROLLING GRADIENT UPDATES IN PHASE IV

Lemma F.21. At each iteration T3,k < t T ϵ 4,k, if Induction Hypothesis F.4 holds, then α(t) k 0 and satisfies

Proof. The analysis is similar to that for Lemma F.15, but we need to be more careful about the lower bound of 1 Attn(t) k . By the gradient expression, we have

1{xquery = vk} Attn(t) k

m =k Attn(t) m 2 + (1 Attn(t) k )2

pk P(P E )E

m =k Attn(t) m 2 + (1 Attn(t) k )2

| {xquery = vk} E

pk P(P E )E h Attn(t) k (1 Attn(t) k )2 | {xquery = vk} E i

where the last inequality follows from Lemma F.19 and our choice of pk.

Lemma F.22. At each iteration T3,k < t T ϵ 4,k, if Induction Hypothesis F.4 holds, then given k 2, β(t) k,1 satisfies

α(t) k K0.49

β(t) k,n 0.

In-context Convergence of Transformers

Proof. Following the computations similar to those for Lemma F.4, we have

m =k Attn(t) m 2 Attn(t) 1 Attn(t) k (1 Attn(t) k )

(1 Attn(t) k Attn(t) 1 )(Attn(t) 1 + Attn(t) k max m =1,k Attn(t) m ).

1{xquery = vk E imbal} Attn(t) 1

m =k Attn(t) m 2 Attn(t) 1 Attn(t) k (1 Attn(t) k )

1{xquery = vk E imbal c} Attn(t) 1

m =k Attn(t) m 2

(a) pk P(Pinput E imbal) E

Ω((1 Attn(t) k )2

K0.51 ) {xquery = vk} E imbal

+ pk P(E imbal c)

pk Ω ϵ K0.51

+ 3pk exp c2 im N 25K2

where (a) follows from Lemma F.20, and the last inequality holds since

ϵ K0.51 exp( polylog(K))

K0.51 exp c2 im N 25K2

β(t) k,1 E h 1{xquery = vk E imbal} Attn(t) 1 Attn(t) 1 + Attn(t) k (1 Attn(t) k ) i

+ E h 1{xquery = vk E imbal c} Attn(t) 1 Attn(t) 1 + Attn(t) k (1 Attn(t) k ) i

pk P(Pinput E imbal) E h O(Attn(t) 1 (1 Attn(t) k )) | {xquery = vk} E imbal i

+ 2pk P(E imbal c)

α(t) k K0.49

+ 6pk exp c2 im N 25K2

α(t) k K0.49

Lemma F.23. At each iteration T3,k < t T ϵ 4,k, if Induction Hypothesis F.4 holds, then given k 2, for any n = 1, k,

β(t) k,n satisfies

β(t) k,n 0.

In-context Convergence of Transformers

Proof. Note that conditioned on the event {xquery = vk} E imbal, by Lemmas F.19 and F.20, we have Attn(t) k = Ω(1), maxm =k Attnm = O 1 K0.49 . Thus, we obtain

m =k Attn(t) m 2 Attn(t) n Attn(t) k (1 Attn(t) k ) max m =k Attn(t) m X

m =k Attn(t) m Attn(t) k (1 Attn(t) k )

= (1 Attn(t) k )(Attn(t) k max m =k Attn(t) m )

Ω(1 Attn(t) k ). (28)

1{xquery = vk E } Attn(t) n

m =k Attn(t) m 2 Attn(t) n Attn(t) k (1 Attn(t) k )

1{xquery = vk E imbal c} Attn(t) n

m =k Attn(t) m 2

pk P(Pinput E imbal) E Ω((1 Attnk)2

K ) {xquery = vk} E + pk P(Pinput E imbal c)

+ 3pk exp c2 im N 25K2

where the last inequality holds since

ϵ K exp( polylog(K))

K exp c2 im N 25K2

Moreover, we have

β(t) k,n pk E h Attn(t) n Attn(t) n + Attn(t) k (1 Attn(t) k ) | {xquery = vk} E imbal i + 2pk P(E imbal c)

Θ(1 Attn(t) k K ) O Attn(t) k (1 Attn(t) k ) {xquery = vk} E imbal

+ 6pk exp c2 im N 25K2

Attn(t) k (1 Attn(t) k )2

! {xquery = vk} E imbal

+ 6pk exp c2 im N 25K2

F.5.3. END OF PHASE IV

Lemma F.24. Given k > 1, and 0 < ϵ < 1, suppose polylog(K) log( 1

ϵ ). Then Induction Hypothesis F.4 holds for all

T3,k < t T ϵ 4,k = T3,k + O( K log(Kϵ 1

2 ) ηϵ ), and at iteration t = T ϵ 4,k + 1, we have

1. e Lk(θT ϵ 4,k+1) < ϵ

2. If xquery = vk and Pinput E imbal, we have (1 Attn (T ϵ 4,k+1) k )2 O(ϵ).

In-context Convergence of Transformers

Proof. The existence of T ϵ 4,k = T3,k + O( K log(Kϵ 1

2 ) ηϵ ) directly follows from Lemma F.21.

It is easy to verify Induction Hypothesis F.4 holds at t = T3,k + 1. Now we suppose Induction Hypothesis F.4 holds for all iterations t 1, and prove it holds at t.

By Lemma F.21, we have α(t 1) k 0. Thus A(t) k = A(t 1) k + ηα(t 1) k log(K). Moreover, by Lemma F.22, we have

|B(t) k,1 B(T3,k+1) k,1 | O(A(t) k A(T3,k+1) k K0.49 ),

which immediately implies

B(t) k,1 O(A(t) k /K0.49) O(log(K)/K0.48) 0.51 log(K).

For m = 1, k, by Lemma F.23, we have

|B(t) k,m B(T3,k+1) k,m | O(A(t) k A(T3,k+1) k K ) O(log(K/ϵ)/K).

Thus |B(t) k,m| O(log(K/ϵ)/K) + O(log(K)/K) = O(log(K/ϵ)/K).

At iteration t = T ϵ 4,k + 1, we have

e(1 Lim 1 )K + U im 1 K0.51

Thus when {xquery = vk} {Pinput E imbal}, we obtain

1 Attn(t) k =

P m =k |Vm|

|Vk| exp(B(t) k,m A(t) k ) P

|Vk| exp(B(t) k,m A(t) k ) + 1

exp(maxm =1,k B(t) k,m A(t) k )( N |V1|

|Vk| 1) + exp(B(t) k,1 A(t) k ) |V1|

|Vk| exp(maxm =1,k B(t) k,m A(t) k )( N |V1|

|Vk| 1) + exp(B(t) k,1 A(t) k ) |V1|

exp(1 A(t) k )( (1 Lim 1 )K Lim k 1) + exp( 0.49 log(K) A(t) k ) U im 1 K Lim k exp(1 A(t) k )( (1 Lim 1 )K Lim k 1) + exp( 0.49 log(K) A(t) k ) U im 1 K Lim k + 1

( e(1 Lim 1 )K+U im 1 K0.51

Lk e) exp( A(t) k ) ( e(1 Lim 1 )K+U im 1 K0.51

Lim k e) exp( A(t) k ) + 1

ϵ ) 1 2 1) 1

ϵ ) 1 2 1) 1 + 1

= (ϵ/3) 1 2 .

We further derive

e Lk(θ(t)) = 1

1{Pinput E imbal}

m =k Attn(t) m 2 + (1 Attn(t) k )2

xquery = vk

2P (Pinput E imbal) E O 1 K0.49

+ 1 (1 Attn(t) k )2 xquery = vk Pinput E imbal

1 + O 1 K0.49

In-context Convergence of Transformers

F.6. Proof of Theorem 3.3 for Under-represented Features Theorem F.25 (Restatement of Theorem 3.3 for Under-represented Features). Suppose p1 = Θ(1) and pk = Θ 1

K for 2 k K. For any 0 < ϵ < 1, suppose N poly(K), and polylog(K) log( 1

ϵ ). We apply GD to train the loss function given in Equation (4). Then the following results hold.

1. The prediction error for under-represented feature converges: for vk with 2 k K, with at most Tk = O( log(K)K2

ϵη ) GD iterations, Lk(θ(Tk)) L k + ϵ, where L k = Θ(e poly(K)) is the global minimum of Equation (6);

2. Attention score concentrates: for each 2 k K, if the query token is vk, then after Tk iterations, with probability at least 1 e Ω(poly(K)), the one-layer transformer nearly pays all attention to input tokens featuring vk: (1 Attn(Tk) k )2 O(ϵ).

Proof. The first statement is obtained by letting Tk = T ϵ 4,k +1, and combining Lemma F.24, Lemma D.10 and Lemma D.11, which lead to

Lk(θ(Tk)) L k Lk(θ(Tk)) Llow k e Lk(θ(Tk)) + 3 exp c2 im N 25K2

The second statement directly follows from Lemma F.24.

In-context Convergence of Transformers

G. Analysis for Imbalanced Case: Dominant Feature (Proof of Theorem 3.3 Part II) In this section, we delve into the analysis of prediction error when the query token features the dominant feature v1. The training dynamics for the dominant feature v1 are relatively straightforward, comprising only a single phase.

Note that at the beginning t = 0, we already have the following lemma.

Lemma G.1. If xquery = v1 and Pinput E imbal, at t = 0, we have Attn(0) 1 = Ω(1), Attn(0) k = O 1

K for k > 1.

Thus, the learning process directly enters the convergence phase, which is defined as follows. Given any 0 < ϵ < 1, define

t 0 : A(t) 1 max m =1 B(t) 1,m log

Induction Hypothesis G.1. For 0 t T ϵ 1, , suppose polylog(K) log( 1

ϵ ). Then the following holds.

a. A(t) 1 is monotonically increasing and A(t) 1 [0, O(log(1/ϵ))];

b. B(t) k,n is monotonically decreasing and O( A(t) 1 K ) B(t) 1,n 0 for any n = 1.

G.1. Technical Lemmas We first introduce several useful technical lemmas.

Lemma G.2. Suppose Induction Hypothesis G.1 holds at iteration 0 < t T ϵ 1, . If xquery = v1 and E imbal Pinput, then the following holds

1. Attn(t) 1 = Ω(1);

2. (1 Attn(t) 1 )2 Ω(ϵ) = Ω(exp ( polylog(K))).

Proof. Since xquery = v1, we have

Attn(t) 1 = |V1| exp(A(t) k ) P

m =1 |Vm| exp(B(t) 1,m) + |V1| exp(A(t) k )

|Vk| exp(B(t) k,m A(t) k ) + 1 .

By Induction Hypothesis G.1, we have

Attn(t) 1 1 P m =k |Vm|

|Vk| exp(B(0) k,m A(0) k ) + 1

1 ( N Lim 1 N 1) + 1 Ω(1).

On the other hand, by the definition of T ϵ 1, , we have

1 Attn(t) 1 =

|V1| exp(B(t) 1,m A(t) k ) P

|V1| exp(B(t) 1,m A(t) 1 ) + 1

(a) exp(minm =1 B(t) 1,m A(t) 1 )( N

exp(minm =1 B(t) 1,m A(t) 1 )( N

|V1| 1) + 1

exp(minm =1 B(t) 1,m A(t) 1 )( 1

exp(minm =1 B(t) 1,m A(t) 1 )( 1

U im 1 1) + 1

In-context Convergence of Transformers

= exp(maxm =1 B(t) 1,m A(t) 1 B(t) 1 )( 1

exp(maxm =1 B(t) 1,m A(t) 1 B(t) 1 )( 1

U im 1 1) + 1

( 1 p1L1 1) 1(( 2

ϵ ) 1 2 1) 1 e O( polylog(K)

Lim 1 1) 1(( 2

ϵ ) 1 2 1) 1( 1

U im 1 1)e O( polylog(K)

where B(t) k = maxm =k B(t) k,m minm =k B(t) k,m = O( A(t) k K ), (a) follows from the fact that x 1+x increases w.r.t. x > 0.

Lemma G.3. Suppose Induction Hypothesis G.1 holds at iteration 0 t T ϵ 1, . If xτ,query = v1 and P E , for n = 1, the following holds

Attn(t) n = Θ

(1 Attn(t) 1 ) K

Proof. We first have

Attn(t) n = |Vn| exp(B(t) 1,n) P

m =k |Vm| exp(B(t) 1,m) + |V1| exp(A(t) 1 ) .

By Induction Hypothesis G.1, we have

e O( p log( 1

ϵ ) K ) exp(B(t) k,m B(t) k,n) e O( p log( 1

Combining with the fact that log(ϵ) polylog(K), we obtain

Attn(t) n 1 Attn(t) k = |Vn| exp(B(t) k,n) P

m =k |Vm| exp(B(t) k,m) = 1 P

|Vn| exp(B(t) k,m B(t) k,n) = Θ 1

G.2. Controlling Gradient Updates

Lemma G.4. At each iteration 0 t T ϵ 1, , if Induction Hypothesis G.1 holds then α(t) 1 0 and satisfies

α(t) k Ω(ϵ).

Proof. By the gradient expression, we have

1{xquery = v1} Attn(t) 1

m =1 Attn(t) m 2 + (1 Attn(t) 1 )2

p1 P(Pinput E imbal)E

m =k Attn(t) m 2 + (1 Attn(t) k )2

{xquery = vk} E imbal

p1 P(Pinput E imbal)E Attn(t) k (1 Attn(t) k )2 {xquery = vk} E

where the last inequality follows from Lemma G.2 and our choice of p1.

In-context Convergence of Transformers

Lemma G.5. At each iteration 0 t T ϵ 1, , if Induction Hypothesis G.1 holds, then for any n = 1, β(t) 1,n satisfies

β(t) 1,n 0.

Proof. Note that conditioned on the event {xquery = v1} E imbal, by Lemmas G.2 and G.3, we have Attn(t) 1 = Ω(1), and maxm =1 Attn(t) m = O 1

K . Thus, we further obtain X

m =1 Attn(t) m 2 Attn(t) n Attn(t) 1 (1 Attn(t) 1 )

max m =1 Attn(t) m X

m =1 Attn(t) m Attn(t) 1 (1 Attn(t) 1 )

= (1 Attn(t) 1 )(Attn(t) 1 max m =1 Attn(t) m )

Ω(1 Attn(t) 1 ). (29)

1{xquery = v1 E imbal} Attn(t) n

m =1 Attn(t) m 2 Attn(t) n Attn(t) 1 (1 Attn(t) 1 )

1{xquery = v1 E imbal c} Attn(t) n

m =1 Attn(t) m 2

(a) p1 P(Pinput E imbal) E

(1 Attn(t) 1 )2

! {xquery = v1} E imbal

+ p1 P(E imbal c)

+ 3p1 exp c2 im N 25K2

where (a) follows from Equation (29) and Lemma G.3, (b) follows from Lemma G.2, and the last inequality holds since

ϵ K exp( polylog(K))

K exp c2 imp2N 25K2

Moreover, we have

β(t) 1,n p1E h Attn(t) n Attn(t) n + Attn(t) 1 (1 Attn(t) 1 ) | {xquery = v1} E imbal i + 2p1P(E imbal c)

1 Attn(t) 1 K

O Attn(t) k (1 Attn(t) 1 ) {xquery = v1} E imbal

+ 6p1 exp c2 im N 25K2

Attn(t) 1 (1 Attn(t) 1 )2

! {xquery = v1} E imbal

+ 6p1 exp c2 imp2N 25K2

G.3. End of the Phase

Lemma G.6. Given 0 < ϵ < 1

2, suppose polylog(K) log( 1

ϵ ). Then Induction Hypothesis G.1 holds for all 0 t

T ϵ 1, = O( log(ϵ 1

2 ) ηϵ ), and at iteration t = T ϵ 1, + 1, we have

In-context Convergence of Transformers

1. e L1(θT ϵ 1, +1) < ϵ/2;

2. If xquery = 1 and Pinput E imbal, we have (1 Attn (T ϵ 1, +1) 1 )2 O(ϵ).

Proof. We first prove the existence of T ϵ 1, . Recall that

T ϵ 1, = max

t 0 : A(t) 1 max m =1 B(t) 1,m log

When t [0, T ϵ 1, ], we can simply lower bound the update of A(t) k maxm =k B(t) k,m as

A(t+1) k max m =k B(t+1) k,m A(t+1) k A(t) k + Ω ηϵ

Therefore, at most T ϵ 1, = O( log ( 1 Lim 1 1)(( 2

ηϵ ) = O( log(ϵ 1

2 ) ηϵ ) iterations are needed before A(t) k maxm =k B(t) k,m exceeds log ( 1

Lim 1 1)(( 2

ϵ ) 1 2 1) .

It is easy to verify Induction Hypothesis G.1 holds at t = 0. Now we suppose Induction Hypothesis G.1 holds for all iterations 0 t 1, and prove it holds at t.

By Lemma G.4, we have α(t 1) 1 0. Thus A(t) 1 A(t 1) 1 0. By Lemma G.5, we have O α(t 1) 1

β(t 1) 1,n 0.

B(t) 1,n B(t 1) 1,n + ηO

Moreover, by the definition of T ϵ 1, , for any t T ϵ 1, , we immediately have

A(t) 1 A(t) 1 max m =1 B(t) 1,m log

Therefore, A(t) 1 O(log( 1

At iteration t = T ϵ 1, + 1, we have A(t) 1 maxm =1 B(t) 1,m > log ( 1

Lim 1 1)(( 2

ϵ ) 1 2 1) . Thus, when {xquery = v1}

{Pinput E imbal}, we obtain

1 Attn(t) 1 =

|V1| exp(B(t) 1,m A(t) 1 ) P

|V1| exp(B(t) 1,m A(t) 1 ) + 1

exp(maxm =1 B(t) 1,m A(t) 1 )( N

exp(maxm =1 B(t) 1,m A(t) 1 )( N

|V1| 1) + 1

exp(maxm =1 B(t) 1,m A(t) 1 )( 1

exp(maxm =1 B(t) 1,m A(t) 1 )( 1

Lim 1 1) + 1

In-context Convergence of Transformers

Lim 1 1)(( 2

ϵ ) 1 2 1) 1 ( 1

Lim 1 1) ( 1

Lim 1 1)(( 2

ϵ ) 1 2 1) 1 ( 1

Lim 1 1) + 1

= (ϵ/2) 1 2 .

Similarly, we have

e L1(θ(t)) = 1

1{Pinput E imbal}

m =k Attn(t) m 2 + (1 Attn(t) k )2

xquery = vk

2P (Pinput E imbal) E O 1

+ 1 (1 Attn(t) k )2 xquery = vk Pinput E imbal

G.4. Proof of Theorem 3.3 for Dominant Feature

Theorem G.7 (Restatement of Theorem 3.3 for Dominant Feature). Suppose p1 = Θ(1) and pk = Θ 1

K for 2 k K. For any 0 < ϵ < 1, suppose N poly(K), and polylog(K) log( 1

ϵ ). We apply GD to train the loss function given in Equation (4). Then the following results hold.

1. The prediction error for dominant feature converges: for v1, with at most T1 = O( log(ϵ 1

2 ) ηϵ ) GD iterations, L1(θ(T1)) L 1 + ϵ, where L 1 = Θ(e poly(K)) is the global minimum of Equation (6);

2. Attention score concentrates: k = 1, if the query token is vk, then after Tk iterations, with probability at least 1 e Ω(poly(K)), the one-layer transformer nearly pays all attention to input tokens featuring vk: (1 Attn(Tk) k )2 O(ϵ).

Proof. The first statement is obtained by letting T1 = T ϵ 1, + 1, and combining Lemma G.6, Lemma D.10 and Lemma D.11, which lead to

L1(θ(T1)) L 1 L1(θ(T1)) Llow 1 e L1(θ(T1)) + 3 exp c2 im N 25K2

The second statement directly follows from Lemma G.6.