# dual_operating_modes_of_incontext_learning__2b33a53a.pdf

Dual Operating Modes of In-Context Learning

Ziqian Lin 1 Kangwook Lee 2

In-context learning (ICL) exhibits dual operating modes: task learning, i.e. acquiring a new skill from in-context samples, and task retrieval, i.e., locating and activating a relevant pretrained skill. Recent theoretical work proposes various mathematical models to analyze ICL, but they cannot fully explain the duality. In this work, we analyze a generalized probabilistic model for pretraining data, obtaining a quantitative understanding of the two operating modes of ICL. Leveraging our analysis, we provide the first explanation of an unexplained phenomenon observed with real-world large language models (LLMs). Under some settings, the ICL risk initially increases and then decreases with more in-context examples. Our analysis offers a plausible explanation for this early ascent phenomenon: a limited number of in-context samples may lead to the retrieval of an incorrect skill, thereby increasing the risk, which will eventually diminish as task learning takes effect with more in-context samples. We also analyze ICL with biased labels, e.g., zero-shot ICL, where in-context examples are assigned random labels, and predict the bounded efficacy of such approaches. We corroborate our analysis and predictions with extensive experiments with Transformers and LLMs. The code is available at: https: //github.com/UW-Madison-Lee-Lab/ Dual_Operating_Modes_of_ICL.

1. Introduction

Large language models (LLMs) exhibit a significant improvement in predictive performance when provided with

1Department of Computer Science, University of Wisconsin Madison, Madison, Wisconsin, USA 2Department of Electrical & Computer Engineering, University of Wisconsin-Madison, Madison, Wisconsin, USA. Correspondence to: Kangwook Lee <kangwook.lee@wisc.edu>.

Proceedings of the 41 st International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by the author(s).

in-context examples (Brown et al., 2020). This emergent ability of LLMs, known as in-context learning (ICL), operates in two distinct modes: task learning and task retrieval (Pan et al., 2023). Large language models exemplify this duality. They can learn unseen functions from incontext examples, demonstrating the learning mode (Brown et al., 2020; Razeghi et al., 2022; Garg et al., 2022). Concurrently, LLMs can also retrieve and utilize a pretrained skill. A clear evidence of the task retrieval mode is presented by Min et al. (2022), where the authors show ICL performance remains largely unaffected even when in-context examples are annotated with random labels. This suggests that LLMs simply retrieve a pretrained skill rather than learn it from in-context examples.

The dual nature of ICL can be explained as follows. LLMs are a next-token predictor that is pretrained on a large pretraining set, consisting of diverse data from diverse domains/tasks. To predict the next token optimally in such a scenario, the model must first learn the task prior from pretraining data and then implicitly perform Bayesian inference at the test time (Xie et al., 2022; Raventos et al., 2023). Optimal prediction on multitask pretraining data requires adherence to the learned prior (over the tasks present in the pretraining data) and making predictions based on the posterior. The ability to learn and apply this prior during test-time inference enables task retrieval if in-context examples align closely with a task encountered during pretraining, the model can swiftly adjust its posterior and predict without learning a new skill. Simultaneously, the model can learn a novel or uncommon skill given sufficient in-context samples and a non-zero prior probability for that skill.

Although the link between pretraining and ICL s dual modes is conceptually straightforward, formally establishing this connection is an unresolved challenge. Motivated by this, our work seeks to address the following questions: How do we rigorously explain the dual operating modes of ICL? Can we define the conditions under which the retrieval mode is a dominant one and vice versa?

A New Model for Pretraining Data To find the answers to these questions, we first propose a new probabilistic model for pretraining data by assuming the pretraining data has a latent clustered structure. In particular, we consider in-context learning of linear functions following the recent

Dual Operating Modes of In-Context Learning

Task Group Re-weighting

retrieved function w/ few in-context examples

learned function w/ many in-context examples

Task Retrieval

Task Learning

Model Analysis Experiment

Figure 1: A summary of our contributions. We first propose a probabilistic model for pretraining data and in-context examples. By analyzing our model, we obtain a quantitative understanding of the dual operating modes of ICL, and explain two real-world phenomena observed with LLMs.

work (Garg et al., 2022; Aky urek et al., 2023; Li et al., 2023; von Oswald et al., 2023; Raventos et al., 2023; Wu et al., 2024). A next-token prediction model is prompted with (1) a sequence of (x, y) pairs, which come from a common linear function, and (2) one test input xtest. An ideal model capable of in-context learning linear models should internally fit a linear function (say y = b w T x) using the in-context examples and then generate the predicted label ytest = b w T xtest as the next token. The recent work (Raventos et al., 2023; Wu et al., 2024) show that such in-context learning is feasible by training a next-token prediction model on a large pretraining dataset, consisting of sequences of labeled samples drawn from diverse linear functions.

We extend the existing model for pretraining data (Raventos et al., 2023) by introducing multiple task groups and task-dependent input distributions. When one generates pretraining data, one must specify a probability distribution of linear functions (equivalently, that of the linear coefficient w). While most of the prior work assumes that w is drawn from a single Gaussian distribution, we will model it as drawn from a Gaussian mixture model, where each Gaussian component models a task group. This model better reflects real-world data that exhibits a clustered structure (Xie et al., 2022). Furthermore, we also allow each mixture component to have its own distribution for input x. Shown on the left-most panel in Fig. 1 is a simple visualization of our model. The blue task group is modeled as the distribution of linear functions with positive coefficients (w 1) with the input distribution centered around E[x] = +1. The red lines represent the other task group linear functions with negative coefficients (w 1) with the input distribution centered at E[x] = 1. See Sec. 3 for more details.

Analysis With our new model for pretraining data, we analyze the optimal pretrained model under the squared loss, i.e., the MMSE estimator of the label given input with in-context examples. Here, the pretraining distribution (of linear functions) is the prior, and in-context examples are the observations. Leveraging the fact that the Gaussian mixture is a conjugate prior to the Gaussian likelihood function, we obtain a closed-form expression of the posterior distribution. By fully quantifying the posterior distribution of w in the form of a Gaussian mixture, we characterize how in-context examples are used to update each component s posterior mean and posterior mixture probability. We will call updates of mixture probabilities as task group (component) re-weighting and updates of component means as task group (component) shifting. See the central panel in Fig. 1 for visualization. By analyzing these two effects, we obtain a quantitative understanding of how two different operating modes emerge. In particular, we show that, under some mild assumptions, task group re-weighting is the dominant factor when provided with few in-context samples, rendering the task retrieval mode. With many in-context samples, task group shifting occurs, resulting in the task learning mode.

Explanation of Two Real-World Phenomena To demonstrate the practical value of the new insights we have gained from our model, we will leverage our analysis to explain and predict two phenomena observed with LLMs in practice.

The early ascent phenomenon refers to the observation that, under certain conditions, the ICL risk initially increases and then decreases when more in-context examples are introduced (Brown et al., 2020; Xie et al., 2022). See

Dual Operating Modes of In-Context Learning

the right-most panel of Fig. 1 for visualization. Based on our analysis, we offer a plausible explanation for this early ascent phenomenon a limited number of in-context samples may lead to the retrieval of an incorrect skill, thereby increasing the risk, which will eventually diminish as task learning takes effect with more in-context samples.

Bounded efficacy of biased-label ICL is predicted by our model. ICL performs well even with in-context examples that are annotated with biased labels (Lyu et al., 2023; Min et al., 2022). Our model provides a rigorous justification of this approach: If in-context examples with biased labels carry sufficient information for retrieving a correct pretrained task, then this approach would work. At the same time, our analysis suggests that the operating mode of ICL will make a transition from task retrieval to task learning with more in-context examples. When the learning mode starts taking place, the test risks of such methods will start increasing as the pretrained model will start fitting the biased labels. See the right-most panel of Fig. 1 for visualization. This bounded efficacy has not been reported in the literature (Min et al., 2022; Pan et al., 2023). We found that this was due to the small number of examples tested. With more in-context samples, we observe the predicted bounded efficacy phenomenon with real-world LLMs such as Mistral 7B (Jiang et al., 2023), Mixtral 8 7B (Jiang et al., 2024), Llama 2 (Touvron et al., 2023), and GPT-4 (Open AI, 2023).

2. Related Work

Dual Operating Modes of ICL. Pan et al. (2023) empirically disentangle the two operating modes of ICL: task recognition, which we refer to as task retrieval and task learning. To illustrate, in the context of sentence sentiment classification using ICL, Pan et al. (2023) explore three labeling schemes for in-context examples: (i) correct semantic labels, (ii) correct but abstract labels ( 0 and 1 ), and (iii) random semantic labels ( positive or negative ). Pan et al. (2023) claim that ICL is in the task recognition mode when the model is provided with randomly labeled in-context data, and observe that its efficacy does not correlate with model size or the quantity of demonstrations. In fact, later, we will show that via our analysis, an increasing number of demonstrations will eventually decrease the ICL accuracy. Conversely, ICL with correct but abstract labels, classified as task learning, shows improved performance in proportion to model size and in-context example count. ICL with correct labels yields the highest accuracy since both task recognition and task learning benefit it.

Explaining ICL via Bayesian Inference. Xie et al. (2022) use a Hidden Markov Model (HMM) (Ghahramani & Jordan, 1995; Rabiner, 1989) to model the pretraining data.

That is, each sequence in pretraining data is generated by an HMM, whose parameters are randomly drawn from a particular distribution. During pretraining, a next-token prediction model is trained to predict tokens in pretraining sequences, which requires the inference of the latent HMM parameters. While this model accurately reflects real-world pretraining data characteristics, such as long-range dependencies, the absence of a closed-form solution for optimal prediction makes detailed analysis of ICL infeasible. On the other hand, Garg et al. (2022); Raventos et al. (2023) consider the setting where a next-token prediction model is pretrained on token sequences consisting of (x, y) pairs in the form of (x1, y1, x2, y2, . . .). The pretraining objective is to predict only the tokens at odd positions, i.e., to predict y, but not x. Garg et al. (2022) empirically evaluate the Transformer architecture (Vaswani et al., 2017), while the authors of Raventos et al. (2023) proposed a probabilistic model to generate sequences according to noisy linear regression. More specifically, yi = xi, w + ϵi, where w is the coefficient shared within the same sequence and ϵi is noise. While this linear regression model facilitates a tractable analysis and elucidates certain aspects of the dual operating modes of ICL, it falls short in modeling the clustered characteristic of nature language. Han et al. (2023) show that ICL asymptotically approaches kernel regression as the in-context samples increases. Jeon et al. (2024) introduce information-theoretic tools to show that the ICL risk should decay in both the number and sequence lengths of in-context examples. On the other hand, our proposed model allows for tractable analysis and captures the clustered characteristic of pretraining data.

Explaining ICL via Gradient Descent. Garg et al. (2022) hint that the pretrained Transformer might implicitly execute gradient descent under ICL. Aky urek et al. (2023); von Oswald et al. (2023); Dai et al. (2023) expand this notion by theoretically showing that one attention layer can be exactly constructed to perform gradient descent, and empirically finding similarities between in-context inference and gradient descent algorithm. Further, Ahn et al. (2023); Mahankali et al. (2024); Zhang et al. (2023) dive into the training process of Transformers. Ahn et al. (2023); Mahankali et al. (2024) theoretically show that under certain conditions, Transformers with one or more attention layers trained on noisy linear regression task minimizing the pretraining loss will implement gradient descent algorithm. Zhang et al. (2023) show that a single linear self-attention layer trained by gradient flow with a suitable random initialization finds a global minimum of the objective function, where ICL of the Transformer achieves prediction error competitive with the best linear predictor.

Others. Wu et al. (2024) studies the sample complexity required for pretraining a linear attention model and

Dual Operating Modes of In-Context Learning

presents a statistical bound. In our work, we do not consider a particular model architecture nor the statistical aspects of pretraining we assume a pretrained model is optimally trained on infinitely large pretraining data, similar to the previous work (Xie et al., 2022; Raventos et al., 2023; Han et al., 2023). Giannou et al. (2023) show a looped Transformer can emulate any algorithms, such as SGD. Bai et al. (2023) show Transformers can perform in-context algorithm selection, i.e., adaptively selecting different ICL algorithms such as gradient descent, least square, or ridge regression. (Li et al., 2023) study the generalization bounds for ICL with Transformers.

3. Pretraining and Data Generative Model

A next-token predictor is a sequential prediction model that predicts the next token given an initial token sequence. Consider pretraining this model on sequences consisting of (x, y)1 pairs in the form of (x1, y1, x2, y2, . . .), with the model trained to predict only the y values, thereby skipping the prediction of x. Here, we assume odd-numbered tokens represent d-dimension real-valued vectors, and evennumbered tokens represent scalars. During inference, the model receives a sequence of 2k + 1 tokens. The first 2k tokens are k labeled samples (xi, yi), i {1, . . . , k} =: [K], and the last token is unlabeled xk+1. Ideally, the model should predict the correct next token, yk+1.

3.1. Data Generative Model

In the pretraining phase, we assume the next-token predictor is pretrained on diverse tasks, each representing a continuous joint distribution of (x, y). Before we move on to the exact pretraining data generative model proposed in this paper, we first provide a general setting for the data generation process. A task is defined by a joint distribution Dx,y, which specifies the likelihood of obtaining a sample (x, y) from this task. Each task is sampled from the task prior Dprior, meaning Dprior represents a distribution over distributions. The pretraining data comprises numerous sequences, each containing K labeled samples i.i.d. drawn from a distribution Dx,y. We formally describe our pretraining data generative model in Assumption 1.

Assumption 1 (Pretraining Data Generative Model). Given an integer K > 0, a pretraining task prior Dprior, we generate a sequence SK as follows: (a) Sample a task from the task prior: Dx,y Dprior; (b) Sample K labeled samples from the chosen task: i {1, 2, . . . , K}, (xi, yi) Dx,y; (c) Define a sequence SK: SK = [x1, y1, . . . , x K, y K].

1It is more rigorous to represent the vector x as multiple tokens. However, viewing it as a high-dimensional token simplifies our notation while not affecting our analysis. Thus, with a slight abuse of notation, we will treat both xi and yi as tokens for simplicity.

In the sequence, the first 2k elements of SK is denoted as Sk, and the first 2k + 1 elements will be indicated by Sk xk+1, e.g., S0 = [ ], and S1 x2 = [x1, y1, x2].

3.2. Bayes-Optimal Next-Token Predictor

Let L(F) = ESK h 1 K PK 1 k=0 (F(Sk xk+1) yk+1)2i

as the pretraining objective, where F is a next-token predictor and SK is generated from Dprior following Assumption 1. In other words, for each sequence, we pretrain F to predict each label y based on preceding samples, measuring risk with the squared loss. Due to the linearity of expectation, we have: L(F) = 1 K PK 1 k=0 E SK

(F(Sk xk+1) yk+1)2 .

A variable-input-length next-token predictor F can be viewed as K fixed-input-length next-token predictors F0, . . . , FK 1, where Fk takes a sequence of exactly 2k+1 tokens as input. Thus, assuming the sufficient expressiveness of F, the optimization problem F = argmin F L(F) can be decomposed into K separate optimization problems for k {0, . . . , K 1}:

F k = argmin Fk E SK [(Fk(Sk xk+1) yk+1)2].

The solution denoted F k is an MMSE estimator (Van Trees, 2004, page 63) for each k. Thus, the prediction F (Sk xk+1) = F k(Sk xk+1) satisfies:

F (Sk xk+1) = E SK [yk+1|Sk xk+1]

E yk+1 [yk+1|Dx,y, Sk xk+1] Sk xk+1

E yk+1 [yk+1|Dx,y, xk+1] Sk xk+1

Thus, F (Sk xk+1) is the expectation (over task posterior) of E yk+1 [yk+1|Dx,y, xk+1] regarding Sk xk+1 as

observation. We show that a pretrained Transformer can empirically approximate Bayesian inference in Appendix D.

3.3. Gaussian/Linear Assumptions on Pretraining Data Generative Model

Let us now elaborate further assumptions on Dprior and Dx,y in the Assumption 1 for a tractable posterior, extending beyond the scope of Raventos et al. (2023), who propose the data generative model that each task is a noisy linear regression task, the function w for each task is drawn from the same Gaussian distribution, and different tasks share the same x distribution. In contrast, our model posits that task functions are derived from a Gaussian mixture distribution, and tasks employ varying x distributions, as illustrated in Fig. 2. We formally formulate this setting in Assumption 6. Assumption 2 (Gaussian/Linear Assumptions for Pretraining Data Generative Model).

Dual Operating Modes of In-Context Learning

0.5 0.0 0.5

Y-axis Samples from Tasks in the Task Group

Task 1 Task 2 Task 3 Task 4

Underlying Functions of Tasks

Tasks in the Task Group

Probability Density Functions of x Distributions of Tasks

Tasks in the Task Group

(a) Pretraining data Raventos et al. (2023).

1.5 1.0 0.5 0.0 0.5 1.0 1.5

Samples from Tasks in Task Group A

Task 1 Task 2 Task 3 Task 4

Samples from Tasks in Task Group B

Task 1 Task 2 Task 3 Task 4

Underlying Functions of Tasks

Tasks in Task Group A Tasks in Task Group B

Probability Density Functions of x Distributions of Tasks

Tasks in Task Group A Tasks in Task Group B

(b) Our pretaining data with 2 task groups.

Figure 2: Pretraining data model of Raventos et al. (2023) and ours.

(a) (µ, w) Dprior : P(µ, w) = PM m=1 πm P(µ, w|Tm), where Tm is the mth mixture component2 of the Gaussian mixture, i.e., P(µ, w|Tm) = N(µ|µm, σ2 µI) N(w|wm, σ2 w I), and πm is the mixture weight. PM m=1 πm = 1, 0 < πm < 1, (µm, wm) is the center of the mixture component Tm, and all components share the same covariance matrix controlled by σµ and σw; (b) input: x Dx(µ), P(x|µ) = N(x|µ, σ2 x I); (c) label: y|x Dy|x(w) : P(y|x, w) = N(y|w x, σ2 y); (d) µm = wm = 1, m [M]; (e) r > 1 that α, β [M], 1

πβ r; (f) x, µ, µm, w, wm Rd, I Rd d. Remark 3.1. Based on Assumptions 6(b) and 6(c), we define the probability of observing a sample (x, y) within a task (µ, w) as the noisy linear regression likelihood.

Assumption 6(a) indicates that the pretraining dataset of an LLM consists of M different task groups. Assumption 6(b) posits that tasks have varying x distribution with varying mean but share the same covariance matrix. Assumption 6(c) assumes tasks as noisy linear regressions with the same noise scale in labels. Assumption 2(e) posits comparable mixture weights π across different task groups.

4. Inference and Dual Operating Modes

The previous Sec. 3.2 shows that performing ICL with the optimally pretrained next-token predictor is equivalent to computing the posterior mean of the label. In Sec. 4.1,

2The concept mixture component is derived from Gaussian mixture models in the statistical literature and is analogous to the term Task Group depicted in the left-most panel of Fig. 1.

we give the generation process of in-context examples. In Sec. 4.2, under Assumption 6 and treating Sk xk+1 as observation, we derive a closed-form expression for the task posterior Dpost, and identify two factors in the transition from prior to posterior: Component Shifting and Component Re-weighting. In Sec. 4.3, we derive a closed-form expression of the ICL prediction F (Sk xk+1). Further, Sec. 4.4 presents the results of numerical computation conducted under the tetrahedron setting, as illustrated in Fig. 3(a). The numerical computation results demonstrate the effects of component shifting and re-weighting. Finally, Sec. 4.5 raises the definitions of the dual operating modes with component shifting and re-weighting.

4.1. In-Context Task and In-Context Function

We introduce Assumption 3 for the in-context task and the in-context function of in-context examples:

Assumption 3 (Gaussian/Linear Assumptions for In-Context Examples). (a) The input sequence Sk xk+1 of ICL satisfies, i, xi N(µ , τ 2 x I), yi = xi, w ; (b) µ = w = 1.

Assumption 3(a) states that each in-context example (xi, yi) is drawn from the in-context task (µ , w ), with w representing the specific in-context function and the labels being free from noise.

4.2. Closed-Form Expression of Posterior

The following lemma gives the closed-form expression of posterior Dpost given any Sk xk+1:

Dual Operating Modes of In-Context Learning

(a) The Tetrahedron setting. An illustration of the in-context task and the prior centers. m {1, 2, 3, 4}, We set µm = wm.

2 δµ = δw = 1/81 δµ = δw = 1/9 δµ = δw = 1 δµ = δw = 9 δµ = δw = 81

1.0 π1 π2 π3 π4

0 1 3 7 15 31 63 127 0.0

0 1 3 7 15 31 63 127 0 1 3 7 15 31 63 127 0 1 3 7 15 31 63 127 0 1 3 7 15 31 63 127

Number of In-Context Examples (k)

(b) CR, CS, and risks under the Tetrahedron setting. In the first two rows, we show the effects of CS and CR with an increasing number of in-context examples. In the third row, we show how far the in-context predicted function w is from the target function w . In the fourth row, we show the ICL risk.

Figure 3: Numerical experiments. (left) An illustration of the pretraining priors (right) The numerical computational results

Lemma 4.1 (Conjugate Distributions with Noisy Linear Regression Likelihood). Under Assumption 6, the posterior probability of task (µ, w) given observation Sk xk+1 is:

P(µ, w|Sk xk+1) = PM m=1 πm P(µ, w| e Tm)

= PM m=1 πm N(µ| µm, σ2 µI) N(w| wm, σ2 w I).

Here, the mixture component Tm in the prior is mapped to the mixture component e Tm in the posterior with mixture weight πm and component means ( µm, wm):

πm = πm C1cµ mcw m,

cµ m = exp µm 2 µm+(k+1)δµ µ 2 (I+(k+1)δµ Σµ) 1/2σ2 µ ,

cw m = exp wm 2 wm+kδw w 2 (I+kδw Σw) 1/2σ2 w ,

µm = (I + (k + 1)δµ Σµ) 1(µm + (k + 1)δµ µ),

wm = (I + kδw Σw) 1(wm + kδw w),

σ2 µ = σ2 µ(I + (k + 1)δµ Σµ) 1,

σ2 w = σ2 w(I + kδw Σw) 1,

where C1 is a normalizing constant, i.e., P m πm = 1, δµ = σ2 µ σ2x , δw = σ2 w σ2y , Σµ = I, µ =

Pk+1 i=1 xi k+1 , Σw =

Pk i=1 xix i k ,

Pk i=1 xiyi

k . See Appendix G for the proof. Remark 4.2. Gaussian mixture is known to be a conjugate prior to the Gaussian likelihood. The outlined conjugate distributions in this lemma extend the Gaussian mixture conjugate distributions by substituting the Gaussian likelihood with the noisy linear regression likelihood in Remark 3.1.

Lemma 4.1 states that the task posterior remains a Gaussian mixture, with its mixture components shifted and reweighted from the task prior. Therefore, understanding the impact of in-context examples on the posterior requires understanding how in-context examples affect the two factors:

Component Shifting (CS). The component center is shifted from (µm, wm) to ( µm, wm).

Component Re-weighting (CR). The component weight is re-weighted from π to π. Remark 4.3. The term component comes from the literature on Gaussian mixtures. It serves as an alternative to Task Group as shown in Fig. 2. The terminology Component Shifting and Component Re-weighting can be viewed as Task Group Shifting and Task Group Reweighting . We will abbreviate mixture component center to simply center when there is no ambiguity.

Leveraging Assumption 3, we collected mathematical analyses of CS and CR in Appendix H. The analysis explores the impacts of pretraining task noises and the number of in-context examples on µm, wm, and πm, and examines the convergence of µm, wm, and πm, as k approaches infinity.

4.3. Closed-form Expression of ICL Prediction

With Assumption 6 and Lemma 4.1, we have the following corollary for the prediction F (Sk xk+1):

Corollary 4.4. Let w = PM m=1 πm wm. With pretraining data generative model 1 and Assumption 6, if the pretrained model F minimizes the pretraining risk, then the prediction on any sequence Sk xk+1 by F is as follows: F (Sk

xk+1) = D xk+1, PM m=1 πm wm E = xk+1, w .

Proof. Apply Assumption 1 to Eq. 1, F (Sk xk+1) = E(µ,w) Dprior[ xk+1, w |Sk xk+1]. Using Lemma 4.1, this reduces to PM m=1 πm E (µ,w) e Tm [ xk+1, w ]. Due to the

linearity of expectation and inner product, the prediction can be simplified as xk+1, PM m=1 πm wm = xk+1, w .

Dual Operating Modes of In-Context Learning

Thus, the prediction is a convex combination of predictions by the centers of those shifted and re-weighted mixture components in the posterior. We are interested in how πm and wm change to πm and wm with increasing k and how the pretraining prior distribution properties affect these changes.

4.4. Prior Task Noises, CS, CR, and ICL Prediction

We numerically compute how πm, wm, and the prediction F (Sk xk+1) evolve as k increases under different prior task noise conditions. The numerical computation is based on the tetrahedron setting with four prior mixture components as illustrated in Fig. 3(a). See Appendix B.1 for details of the tetrahedron setting. Fig. 3(b) shows the computational results. The first row shows the CS effect, demonstrating the impact of increasing k on wm. The second row shows the CR effect, illustrating the impact of increasing k on πm. The third and fourth rows depict how increasing k influences the risk of learning the function w . We observe that with low task noises and a small k value, the CR effect initially prevails, significantly boosting the mixture weight of component 1 over others. Then, as k increases further, the CS effect aligns all component centers with (µ , w ).

4.5. Dual Operating Modes

The task retrieval mode describes a scenario where the impact of component re-weighting surpasses that of component shifting, leading to the prediction that is primarily influenced by the interplay between pretraining priors and in-context examples. An illustration of this is shown in the first column of Fig. 3(b), where the re-weighting of πm is more pronounced than the shifting of wm, indicating that CR plays a pivotal role in altering the prediction. In contrast, the task learning mode refers to situations where component shifting dominates over component re-weighting, resulting in the prediction almost depending on in-context examples and neglecting the pretraining priors.

5. Early Ascent

We now explain the early ascent phenomenon by analyzing a finegrained risk bound of ICL. (See Appendix C Theorem C.1 for the coarser bound.)

5.1. Finegrained Upper Bound

The finegrained upper bound for ICL risk is shown below:

Theorem 5.1 (Finegrained Upper Bound for ICL Risk). Consider a next-token predictor attaining the optimal pretraining risk. As k , ICL risk is upper bounded by:

m=1 wm w 2ESk xk+1[ πm xk+1 2λ1(A)2],

where L k = (F(Sk xk+1) y k+1)2 = (F(Sk xk+1) xk+1, w )2, wm w is the distance between the in-context function w and the function wm of center m, πm is the posterior mixture weight, and A = (I + δw Pk i=1 xix i ) 1. See Appendix L and Eq. 15 for proof details. In Appendix L.1, we further refine the bound for cases when in-context xi only spans in a subspace of Rd, resulting in λ1(A) = 1 constantly.

In-context examples affect the upper bound by affecting the two factors πβ and λ1(A), corresponding to CR and CS introduced in Sec. 4.2. When ignoring the CR effect and only considering CS, the finegrained upper bound degrades to the general coarse bound in Appendix C Theorem C.1.

5.2. The Effect of Dual Operating Modes on ICL Risk

We numerically compute ICL risk under varied settings to explore the effect of the dual operating modes on the risk in Fig. 4. When pretraining task noises are low, i.e., δµ and δw are small, the task retrieval mode happens with a small number of in-context examples, and the upper bound is affected by how (µ , w ) is close to a prior center. Specifically, the task prior boosts the learning process of ICL if the in-context task is close to a prior center, due to the task retrieval mode quickly retrieving the task of the nearest prior center.

5.3. Early Ascent with Biased x Distribution

However, the task retrieval mode may not always benefit ICL. We notice a weird phenomenon is observed by Brown et al. (2020) and Xie et al. (2022). As the number of incontext samples increased, the performance of ICL first decreased and then increased. Brown et al. (2020) reports that GPT-3 on LAMBADA shows a lower one-shot accuracy (72.5%) than zero-shot accuracy (76.2%), but the few-shot accuracy (86.4%) is higher than the zero-shot accuracy. Xie et al. (2022) also replicated this phenomenon with their synthetic dataset. Xie et al. (2022) explains this by the few-shot setting introduces the distracting prompt structure, which can initially lower accuracy.

To obtain some insights, we present a simple scenario where x misleads the prediction by an LLM. Consider the following one-shot prompt for English-to-Korean translation: What is the color of apple? ᄉ ᅡᄀ ᅪᄋ ᅴᄉ ᅢ ᆨᄁ ᅡ ᆯᄋ ᅳ ᆫᄆ ᅮᄋ ᅥ ᆺᄋ ᅵ ᆫᄀ ᅡ?3

What is the color of banana? The correct answer should be ᄇ ᅡᄂ ᅡᄂ ᅡᄋ ᅴᄉ ᅢ ᆨᄁ ᅡ ᆯᄋ ᅳ ᆫᄆ ᅮᄋ ᅥ ᆺᄋ ᅵ ᆫᄀ ᅡ? 4 However, GPT-3.5 generates ᄇ ᅡᄂ ᅡᄂ ᅡᄋ ᅴᄉ ᅢ ᆨᄁ ᅡ ᆯᄋ ᅳ ᆫᄂ ᅩᄅ ᅡ ᆫᄉ ᅢ ᆨᄋ ᅵ ᆸᄂ ᅵᄃ ᅡ, which means The color of bananas is yellow. This shows that pretrained LLMs could retrieve an incorrect skill (question answering in this example) by observing misleading input (x).

3 What is the color of apple? in Korean. 4 What is the color of banana? in Korean.

Dual Operating Modes of In-Context Learning

Number of In-Context Examples (k) - Risk E[(F y k+1)2]

Prior µβ and In-Context µ

0 20 22 24 26 28 0.0

1.0 δµ = δw = 1/81

0 20 22 24 26 28 0.0

1.0 δµ = δw = 1/9

0 20 22 24 26 28 0.0

1.0 δµ = δw = 1

Farthest Medium Closest

Number of In-Context Examples (k)

Figure 4: Distance to the closest prior vs ICL risk. We compute ICL risks of three target tasks colored red (farthest), green, and blue (closest), under the tetrahedron setting, illustrated in the left-most figure. The red target task has the longest distance to the closest prior center, and the blue target task has the shortest distance to the closest prior center. We can observe that the target task is easier to learn when the distance to the closest prior is smaller.

Risk of ICL Risk Upper Bound of ICL

20 24 28 212 216 0

Mixture Weight

20 24 28 212 216

Number of In-Context Examples (k)

Mixture Weight of Component 1 (Misleading)

Mixture Weight of Component 2 (Target)

Mixture Weight of Component 3

20 24 28 212 216 0

(a) Risk and πm as k increases under d {1, 3, 8}.

0.5 0.0 0.5 Value of First Dimension of w

Value of Second Dimension of w

w1 of Center 1 (Misleading)

w2 of Center 2 (Target) w3 of Center 3 Traj of E[f w] with Increasing k

1.0 0.5 0.0 0.5 1.0

1.0 0.5 0.0 0.5 1.0

w1 of Center 1 (Misleading)

w2 of Center 2 (Target) w3 of Center 3 Traj of E[f w] with Increasing k

(b) Expectation of w as k increases under d {2, 3}.

Figure 5: The early ascent phenomenon. Fig. 8(a) and Fig. 8(b) show that the task retrieval mode is dominant up to k = 32, and component 1 s mixture weight increases (E[ w] approaches w1). Since this component is farther than the other one, the risk starts increasing. At larger k values, the risk starts decreasing (E[ w] approaches w2) via task learning. See Appendix B.3 for setting details. We further examine the early ascent phenomenon under linear regression with varied levels of label noises in Appendix I.1, and under non-linear regression and discrete token prediction in Appendix I.2.

Based on our analysis, we further show that the early ascent phenomenon provably occurs under a certain assumption Appendix J.1. We also reproduce early ascent in Fig. 8(a), where the upper bound and the risk initially increase due to the misleading task (of center 1) is retrieved first. Fig. 8(b) further demonstrates the relative locations of the retrieved functions to functions of prior centers. Finally, we give the formal theorem on the early ascent phenomenon: Theorem 5.2 (Early Ascent). Assume α = arg min m

2σ2x + (wm w ) µ 2+dτ 2 x wm w 2

2σ2y is the most misleading task and the task α satisfies Ex1 h (F (x1) w , x1 )2i < Ex1 x1, wα w 2 . Then, when δµ and δw are small enough, k 1 s.t.:

Ex1 h (F (x1) w , x1 )2i

< ESk xk+1 h (F (Sk xk+1) w , xk+1 )2i ,

where Ex1 x1, wα w 2 equals to the risk when the

prediction fully depends on the misleading task function wα of prior center α. See Appendix J.2 for proof details.

Theorem 5.2 shows that, if the misleading task α has a higher risk than the zero-shot risk, then when δµ and δw are small enough, the early ascent phenomenon happens.

6. Bounded Efficacy of Biased-Label ICL

We further predict the bounded efficacy phenomenon by examining the bound of ICL with biased labels. The assumption for ICL with biased labels is described as follows: Assumption 4 (ICL with Biased Labels). The function w of ICL with biased labels is different from the target function wα, i.e., w = wα where wα is a function of a pretraining task prior center. The in-context task is closer to the prior center α compared to all the other prior centers β = α: β = α, µβ µ 2 µα µ 2 d2 µ, wβ w 2 wα w 2 d2 w, and τ 2 x wβ w 2 (1 + τ 2 x) wα w 2 τ 2 xu2 w.

Dual Operating Modes of In-Context Learning

Assumption 4 depicts that to retrieve wα associated with the prior center α, the in-context task is selected based on its proximity to center α, ensuring it is closer to center α.

6.1. Upper Bound for ICL Risk with Biased Labels

The following theorem shows an upper bound for ICL risk with biased labels to retrieve a task:

Theorem 6.1 (Upper Bound for ICL Risk with Biased Labels). Consider a next-token predictor attaining the optimal pretraining risk. As k , ICL risk with biased labels is upper bounded by:

ESk[Lα k] < wα w 2(1 + dτ 2 x)

where Lα k = (F(Sk xk+1) yα k+1)2 = (F(Sk xk+1) xk+1, wα )2. When δµ and δw are sufficiently small, exists a particular interval for k s.t.:

ESk[Lα k] < wα w 2(1 + dτ 2 x) min{1, 4k2δw

2(1 + τ 2 x)2}

d2 µ 8σ2x + u2 wτ 2 x 8σ2y

As k increases, the second and third terms dominate and exponential decay when k is small, and the first term dominates and increases when k is large. C1, C2, C3, and C4 are constants depending on the prior setting, τx, and (µ , w ). See Appendix M for proof details.

k 0 1 2 4 8 16

+ 75.0% 36.2% 33.9% 49.3% 79.3% 85.1% Biased + 100.0% 98.3% 95.9% 60.5% 24.4% 16.8%

Table 1: Bounded efficacy in GPT-4. Error rate measured with respect to addition (+) and biased + . The bounded efficacy phenomenon: the error rate goes down to k = 2, but it increases afterward. Experiment details in Appendix E.1.

6.2. Bounded Efficacy of Biased-Label ICL in GPT-4

This section further shows that the bounded efficacy phenomenon exists in GPT-4 in Table 1. With the task biased addition (+) as the in-context task corresponding to w , as the number of in-context examples increases, ICL will first retrieve the skill addition (+) corresponding to wα which has a strong pretraining prior. Later, it will learn the biased + task, leading to the bounded efficacy phenomenon.

6.3. Bounded Efficacy for Zero-Shot ICL

We further introduce Lemma 6.2, a variation of the previous Theorem 6.1, to explain zero-shot ICL, an ICL algorithm capable of functioning with random labels (Lyu et al., 2023).

0 20 21 22 23 24 25 26 27

Classiﬁcation Error

Mixtral 8x7B

0 20 21 22 23 24 25 26 27

Number of In-Context Examples (k)

Llama 2 70B

0 20 21 22 23 24 25 26 27

True Labels Random Labels

Figure 6: Bounded efficacy. The error rates of ICL with random labels start increasing at large k. See Appendix F for more experimental results.

Lemma 6.2 ((informal) Upper Bound for Zero-Shot ICL). Assume a next-token predictor attains the optimal pretraining risk, the risk of ICL with random labels (provide no information) will reveal a bounded efficacy phenomenon. See Appendix N for proof details.

Lemma 6.2 says that as the number of in-context examples increases, the loss curve of zero-shot ICL with random labels will have the bounded efficacy phenomenon, which conflicts with the observation from Min et al. (2022) that ICL with random labels has very similar performance as ICL with true labels for the number of in-context examples ranging from 1 to 32. We believe this observation is due to the small number of in-context examples. Thus, we extend the experiment of Min et al. (2022) to explore the number of in-context examples beyond 32. Due to LLMs context lengths constraining the maximum number of in-context examples, we choose different LLMs from Min et al. (2022) for a larger context length capacity.

Fig. 6 highlights the bounded efficacy phenomenon in the error curve associated with random labels. Compared with true labels, the error rate of ICL with random labels increases at a much smaller k value, clearly exhibiting the bounded efficacy phenomenon we predicted.

7. Conclusion

In this paper, we introduced a probabilistic model for understanding the dual operating modes of in-context learning: task learning and task retrieval. Our analysis allowed us to explain the existing early ascent phenomenon observed in real-world ICL applications, and predict a new bounded efficacy phenomenon of biased-label ICL. We validated our findings and predictions via experiments involving large language models. Our work lays the groundwork for future research in further exploration and improvement of ICL.

We conclude our paper with the limitations of our current framework: (i) the gap between our assumed pretraining linear regression tasks and complex, non-linear, categorical, real-world pretraining tasks of LLMs; (ii) the labels of incontext samples are assumed to be noiseless.

Dual Operating Modes of In-Context Learning

Acknowledgements

This work was supported by the NSF Award DMS-2023239, NSF CAREER Award CCF-2339978, Amazon Research Award, and a grant from Furiosa AI.

We would like to express our sincere gratitude to Kartik Sreenivasan for his invaluable discussions for this research. His insights and expertise have been instrumental in shaping this study. Additionally, we sincerely thank Andrew Geng for his contributions to coding for the initial experimental setup. His skills and dedication have been pivotal in the early stages of our research.

Impact Statement

This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here.

Ahn, K., Cheng, X., Daneshmand, H., and Sra, S. Transformers learn to implement preconditioned gradient descent for in-context learning. In Advances in Neural Information Processing Systems (Neur IPS), 2023.

Aky urek, E., Schuurmans, D., Andreas, J., Ma, T., and Zhou, D. What learning algorithm is in-context learning? Investigations with linear models. In International Conference on Learning Representations (ICLR), 2023.

Bai, Y., Chen, F., Wang, H., Xiong, C., and Mei, S. Transformers as statisticians: Provable in-context learning with in-context algorithm selection. In Advances in Neural Information Processing Systems (Neur IPS), 2023.

Barbieri, F., Camacho-Collados, J., Anke, L. E., and Neves, L. Tweeteval: Unified benchmark and comparative evaluation for tweet classification. In Findings of the Association for Computational Linguistics: EMNLP, 2020.

Boucheron, S., Lugosi, G., and Massart, P. Concentration inequalities: A nonasymptotic theory of independence. Oxford University Press, 2013.

Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., Mc Candlish, S., Radford, A., Sutskever, I., and Amodei, D. Language models are few-shot learners. In Advances in Neural Information Processing Systems (Neur IPS), 2020.

Dagan, I., Glickman, O., and Magnini, B. The PASCAL recognising textual entailment challenge. In PASCAL Machine Learning Challenges Workshop (MLCW), 2005.

Dai, D., Sun, Y., Dong, L., Hao, Y., Ma, S., Sui, Z., and Wei, F. Why can GPT learn in-context? Language models secretly perform gradient descent as meta-optimizers. In Findings of the Association for Computational Linguistics (ACL), 2023.

Dolan, W. B. and Brockett, C. Automatically constructing a corpus of sentential paraphrases. In International Workshop on Paraphrasing (IWP@IJCNLP), 2005.

Garg, S., Tsipras, D., Liang, P. S., and Valiant, G. What can Transformers learn in-context? A case study of simple function classes. In Advances in Neural Information Processing Systems (Neur IPS), 2022.

Ghahramani, Z. and Jordan, M. Factorial hidden markov models. In Advances in Neural Information Processing Systems (Neur IPS), 1995.

Giannou, A., Rajput, S., Sohn, J.-y., Lee, K., Lee, J. D., and Papailiopoulos, D. Looped Transformers as programmable computers. In International Conference on Machine Learning (ICML), 2023.

Han, C., Wang, Z., Zhao, H., and Ji, H. In-context learning of large language models explained as kernel regression. ar Xiv preprint ar Xiv:2305.12766, 2023.

Jeon, H. J., Lee, J. D., Lei, Q., and Van Roy, B. An information-theoretic analysis of in-context learning. ar Xiv preprint ar Xiv:2401.15530, 2024.

Jiang, A. Q., Sablayrolles, A., Mensch, A., Bamford, C., Chaplot, D. S., Casas, D. d. l., Bressand, F., Lengyel, G., Lample, G., Saulnier, L., et al. Mistral 7B. ar Xiv preprint ar Xiv:2310.06825, 2023.

Jiang, A. Q., Sablayrolles, A., Roux, A., Mensch, A., Savary, B., Bamford, C., Chaplot, D. S., Casas, D. d. l., Hanna, E. B., Bressand, F., et al. Mixtral of experts. ar Xiv preprint ar Xiv:2401.04088, 2024.

Li, Y., Ildiz, M. E., Papailiopoulos, D., and Oymak, S. Transformers as algorithms: Generalization and stability in in-context learning. In International Conference on Machine Learning (ICML), 2023.

Loshchilov, I. and Hutter, F. Decoupled weight decay regularization. In International Conference on Learning Representations (ICLR), 2019.

Lyu, X., Min, S., Beltagy, I., Zettlemoyer, L., and Hajishirzi, H. Z-ICL: Zero-shot in-context learning with pseudodemonstrations. In Annual Meeting of the Association for Computational Linguistics (ACL), 2023.

Dual Operating Modes of In-Context Learning

Mahankali, A., Hashimoto, T. B., and Ma, T. One step of gradient descent is provably the optimal in-context learner with one layer of linear self-attention. In International Conference on Learning Representations (ICLR), 2024.

Marelli, M., Menini, S., Baroni, M., Bentivogli, L., Bernardi, R., and Zamparelli, R. A SICK cure for the evaluation of compositional distributional semantic models. In International Conference on Language Resources and Evaluation (LREC), 2014.

Min, S., Lyu, X., Holtzman, A., Artetxe, M., Lewis, M., Hajishirzi, H., and Zettlemoyer, L. Rethinking the role of demonstrations: What makes in-context learning work? In Empirical Methods in Natural Language Processing (EMNLP), 2022.

Open AI. GPT-4 technical report, 2023.

Pan, J., Gao, T., Chen, H., and Chen, D. What in-context learning learns in-context: Disentangling task recognition and task learning. In Findings of the Association for Computational Linguistics (ACL), 2023.

Rabiner, L. R. A tutorial on hidden markov models and selected applications in speech recognition. Proceedings of the IEEE, 1989.

Raventos, A., Paul, M., Chen, F., and Ganguli, S. The effects of pretraining task diversity on in-context learning of ridge regression. In ICLR Workshop on Mathematical and Empirical Understanding of Foundation Models (MEFo Mo), 2023.

Razeghi, Y., IV, R. L. L., Gardner, M., and Singh, S. Impact of pretraining term frequencies on few-shot numerical reasoning. In Findings of the Association for Computational Linguistics: EMNLP, 2022.

Sheng, E. and Uthus, D. Investigating societal biases in a poetry composition system. In Workshop on Gender Bias in Natural Language Processing, 2020.

Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., et al. Llama 2: Open foundation and finetuned chat models. ar Xiv preprint ar Xiv:2307.09288, 2023.

Tsigler, A. and Bartlett, P. L. Benign overfitting in ridge regression. Journal of Machine Learning Research (JMLR), 2023.

Van Trees, H. L. Detection, estimation, and modulation theory, Part I: Detection, estimation, and linear modulation theory. John Wiley & Sons, 2004.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I. Attention is all you need. In Advances in Neural Information Processing Systems (Neur IPS), 2017.

Vershynin, R. High-dimensional probability: An introduction with applications in data science, volume 47. Cambridge university press, 2018.

von Oswald, J., Niklasson, E., Randazzo, E., Sacramento, J., Mordvintsev, A., Zhmoginov, A., and Vladymyrov, M. Transformers learn in-context by gradient descent. In International Conference on Machine Learning (ICML), 2023.

Wu, J., Zou, D., Chen, Z., Braverman, V., Gu, Q., and Bartlett, P. L. How many pretraining tasks are needed for in-context learning of linear regression? In International Conference on Learning Representations (ICLR), 2024.

Xie, S. M., Raghunathan, A., Liang, P., and Ma, T. An explanation of in-context learning as implicit Bayesian inference. In International Conference on Learning Representations (ICLR), 2022.

Zhang, R., Frei, S., and Bartlett, P. L. Trained transformers learn linear models in-context. In Robustness of Fewshot and Zero-shot Learning in Large Foundation Models (R0-Fo Mo), 2023.

Dual Operating Modes of In-Context Learning

A. Notations

This section collects all notations used in the main paper.

Notations introduced in Sec. 3:

F: a next-token predictor.

ˆF: a pretrained next-token predictor.

F : a Bayes-optimal next-token predictor that attains Bayes risk minimization.

Fk: a next-token predictor for k in-context examples.

F k: a Bayes-optimal next-token predictor that attains Bayes risk minimization for k in-context examples.

x and y: input and label for a task, e.g., x and y of a linear regression task y = x w.

k: the number of in-context examples.

K: the max number of examples in a sequence.

Sk: a sequence of k in-context examples, [x1, y1, . . . , xk, yk].

SK: a sequence of K in-context examples, [x1, y1, . . . , x K, y K].

Sk xk+1: Sk xk+1 = [x1, y1, . . . , xk, yk, xk+1], which is a sequence of k in-context examples appended with xk+1.

µ and w: the parameters that jointly specify a task. µ specifies the distribution of x, and w specifies the function mapping x to y.

Dprior and Dµ,w: Dprior = Dµ,w, and they represent the task prior distribution where each task is specified by parameters µ and w. The task prior is also named pretraining prior, pretraining task prior, pretraining prior distribution, pretraining task prior distribution, or simply prior.

Dx(µ): the conditional distribution of x conditioned on µ of the task (µ, w).

Dx,y(µ, w): the joint distribution of (x, y) in the task (µ, w).

Dy|x(w): y distribution conditioned on the input x and parameter w of the task (µ, w).

P(µ, w): the task probability of (µ, w) in the task prior Dprior.

P(x|µ): the probability of x in Dx(µ).

P(y|x, w): the probability of y in Dy|x(w).

L(F): the risk of F on samples generated from the pretraining data generative model 1.

M: the number of mixture components in a Gaussian mixture prior.

N(x; µ, Σ): the probability of x in the multivariate normal distribution with mean µ and covariance matrix Σ.

m, α, and β: the indices of mixture components in a Gaussian mixture prior.

Tm: the mthe mixture component in a Gaussian mixture prior.

πm: the mixture weight of the mth mixture component in a Gaussian mixture prior.

µm and wm: (µm, wm) is the center of the mth mixture component.

µ and w : (µ , w ) is the in-context task, i.e., in-context examples are drawn from this task without label noises.

σµ and σw: the task noises, i.e., the noise scales of µ and w.

σx and σy: the sample noises, i.e., the noise scales of x and y of pretraining samples.

τx: the sample noise, i.e., the noise scale of x of in-context examples.

d: the dimension of x.

r: the max ratio of two mixture weights of two mixture components.

Dual Operating Modes of In-Context Learning

Notations introduced in Sec. 4:

Dpost: The posterior distribution of the pretraining prior Dprior after observing Sk xk+1.

: the L2 norm.

x 2: for any vector x, x 2 = x x.

x 2 A: for any vector x and matrix A, x 2 A = x Ax.

P(µ, w|Sk xk+1): the probability of task (µ, w) in the posterior after observing Sk xk+1.

e Tm: the mth mixture component in the Gaussian mixture posterior.

πm: the mixture weight of the mth mixture component in the Gaussian mixture posterior.

µm and wm: ( µm, wm) is the center of the mth mixture component in the Gaussian mixture posterior.

P(µ, w| e Tm): the probability of task (µ, w) in the mth mixture component of posterior.

δµ and δw: the ratios of squared task noises over squared sample noises. δµ = σ2 µ σ2x , and δw = σ2 w σ2y .

Σµ: Σµ = I.

Pk i=1 xix i k .

Pk+1 i=1 xi k+1 .

Pk i=1 xiyi

w: the mean of w in the task posterior, i.e., the predicted function by Bayes-optimal next-token predictor. F (Sk

xk+1) = xk+1, w = D xk+1, PM m=1 πm wm E .

cµ m and cw m: parts of the re-weighting coefficient of Component Re-weighting.

Ψµ(α, β) and Ψw(α, β): functions to help analyze the phenomenon of Component Re-weighting.

r(α, β): the ratio of the mixture weight πα of e Tα over the mixture weight πβ of e Tβ.

λd(A): the dth largest eigenvalue of matrix A. In this paper A Rd d, thus λd(A) represents the smallest eigenvalue of matrix A.

λ1(A): the 1st, the largest eigenvalue of matrix A.

y k+1: the label of learning the function w . y k+1 = xk+1, w .

Notations introduced in Sec. 5:

The L2 loss of ICL learning to learn the function w . L k = (F(Sk xk+1) y k+1)2 = (F(Sk xk+1) xk+1, w )2.

Notations introduced in Sec. 6:

d2 µ: β = α, µβ µ 2 µα µ 2 d2 µ, the µ-margin of any other µβ over µα.

d2 w: β = α, wβ w 2 wα w 2 d2 w, the w-margin of any other wβ over wα.

u2 w: β = α, τ 2 x wβ w 2 (1 + τ 2 x) wα w 2 τ 2 xu2 w, the weighted w-margin of any other wβ over wα.

yα k+1: the label of retrieving the function wα. yα k+1 = xk+1, wα .

The L2 loss of ICL learning to retrieve the function wα of the pretraining prior center α. Lα k = (F(Sk xk+1) yα k+1)2 = (F(Sk xk+1) xk+1, wα )2.

B. Prior Examples

This section outlines our configurations of prior settings in numerical computations and preliminary Transformer experiments, focusing on the geometrical arrangement of the centers in the priors. Specifically, we detail the configurations where the centers form shapes of 3-dimensional regular polyhedra in Sec. B.1, extend to configurations in d-dimensional spaces in Sec. B.2, and discuss a unique setup related to the early ascent phenomenon in Sec. B.3.

Dual Operating Modes of In-Context Learning

1.0 0.5 0.0 0.5 1.0

0.0 0.5 1.0

µβ of Prior and µ of In-Context Task

1.0 0.5 0.0 0.5 1.0

0.0 0.5 1.0

wβ of Prior and w of In-Context Task

Figure 7: Visualization of the tetrahedron setting. The figure shows the pretraining prior centers and the in-context task. For β {1, 2, 3, 4}, (µβ, wβ) is a mixture component center in the prior. (µα, wα) for α = 1 (numbers are noted in the center of circles) is the center of the target task for ICL with biased labels, while (µ , w ) is the in-context task. The dotted purple lines highlight the distance of 1 from the origin (0, 0, 0) to any point denoted by µ or w.

B.1. Regular Polyhedrons

Taking into account the centers of the mixture components from the pretraining prior, which manifest as distinct points forming the vertices of various shapes, we examine 3-dimensional regular polyhedrons. These include tetrahedron (4 vertices/centers), octahedron (6 vertices/centers), hexahedron (8 vertices/centers), icosahedron (12 vertices/centers), and dodecahedron (20 vertices/centers), listed with increasing density of the centers on a sphere.

The configuration of a regular polyhedron with M centers is established in accordance with the parameters outlined in Assumption 6, as detailed below:

Dimension d = 3, the number of mixture components equals to M;

The centers of mixture components form a regular polyhedron with M vertices;

All components mixture weights are the same, πm = 1/M, and µm = wm, for all m [M];

For noises of x and y, we have σx = σy = 1, and τx = 1;

For noises of µ and w, we have σµ = σw = 0.25 if not specified;

For the in-context task, µ = 2µ1+µ2 2µ1+µ2 and w = 2w1+w2 2w1+w2 if not specified, where µ2 is one of the the closest centers to µ1.

We mainly use the tetrahedron setting in the paper. Therefore, we further visualize the setting and note down the parameters. The 3D visualization of mixture component centers in the prior and the in-context task are shown in Fig. 7. The parameters are noted as follows:

Dimension d = 3, number of mixture components M = 4;

The centers of topics form a tetrahedron as shown in Fig. 7. µ1 = w1 = [0, 0, 1] , µ2 = w2 = [ q

µ3 = w3 = [ q

3] , and µ4 = w4 = [ q

All components mixture weights are the same, πm = 1/4, and µm = wm, for all m {1, 2, 3, 4};

Dual Operating Modes of In-Context Learning

Table 2: Prior settings for early ascent. The pretraining task prior comprises two components for one dimension and three for two or more dimensions. ICL aims to predict following the in-context function w , equivalent to prior center 2 s function w2 (w = w2). The in-context task is characterized by having a closer x distribution to the task of prior center 1 but a closer x y mapping to the prior center 2. The parameters for all cases are set to σµ = σw = 0.05, σx = τx = 1, and σy = 2. Refer to Fig. 8(b) for visualization of the prior centers under dimension d {1, 2, 3}.

Case Component /Task Mixture Weight µ w

Component 1 1/2 µ1 = [+1] w1 = [ 1] Component 2 1/2 µ2 = [ 1] w2 = [+1] Component 3 / / / In-context Task / µ = [+1] w = [+1]

Component 1 1/3 µ1 = [+1, +1] w1 = [ 1, 1] Component 2 1/3 µ2 = [ 1, 1] w2 = [+1, +1] Component 3 1/3 µ3 = [+1, 1] w3 = [ 1, +1] In-context Task / µ = [+1, +1] w = [+1, +1]

Component 1 1/3 µ1 = [+1] + [+1] (d 1) w1 = [ 1] + [ 1] (d 1) Component 2 1/3 µ2 = [ 1] + [ 1] (d 1) w2 = [+1] + [+1] (d 1) Component 3 1/3 µ3 = [+1] + [ 1] (d 1) w3 = [ 1] + [+1] (d 1) In-context Task / µ = [+1] d w = [+1] d

For noise of x and y, we have σx = σy = 1, and τx = 1;

For noises of µ and w, we have σµ = σw = 0.25 if not specified;

For in-context task, we have µ = 2µ1+µ2+0.2µ3 2µ1+µ2+0.2µ3 and w = 2w1+w2+0.2w3 2w1+w2+0.2w3 . We slightly shift the in-context task (µ , w ) towards (µ3, w3) for visualization purposes, to make m = 3 and m = 4 produce slightly different curves.

B.2. d-Dimensional Examples

We consider d-dimensional examples with d centers for d {2, 4, 8, 16, 32}. A d-dimensional example with d vertices is parametered as follows:

Dimension equals to d, number of mixture component M = d;

For all m [M], µm = em and µm,i =

( 1 if i = m 0 if i = m, i.e., µm is the mth vector in the standard basis of Rm,

characterized by having all elements equal to 0 except for the mth element, which is 1.

All components mixture weights are the same, πm = 1/d, and µm = wm, for all m [M];

For noise of x and y, we have σx = σy = 1, and τx = 1;

For noises of µ and w, we have σµ = σw = 0.25;

For the in-context task, we have µ = 2µ1+µ2 2µ1+µ2 and w = 2w1+w2 2w1+w2 .

B.3. Early Ascent Examples

Table 2 outlines the prior configuration used to produce the early ascent phenomenon, where the in-context task is designed with a distribution of x close to a misleading task. The full results are shown in Fig. 8.

Dual Operating Modes of In-Context Learning

Risk of ICL Risk Upper Bound of ICL

20 24 28 212 216 0

Mixture Weight

20 24 28 212 216 0

20 24 28 212 216

Numbser of In-Context Examples (k)

Mixture Weight of Component 1 (Misleading) Mixture Weight of Component 2 (Target) Mixture Weight of Component 3

20 24 28 212 216 0

20 24 28 212 216 0

(a) First row: expected L2 loss and upper bound with increasing in-context samples k under varied dimensions d. Second row: expected mixture weights with increasing in-context samples k under varied dimensions d. We further examine the early ascent phenomenon under linear regression with varied levels of label noises in Appendix I.1, and under non-linear regression and discrete token prediction in Appendix I.2.

20 24 28 212 216 Number of In-Context Examples (k)

Traj of E[f w] with Increasing k

w1 of Center 1 (Misleading)

w2 of Center 2 (Target)

0.5 0.0 0.5 Value of First Dimension of w

Value of Second Dimension of w

w1 of Center 1 (Misleading)

w2 of Center 2 (Target) w3 of Center 3 Traj of E[f w] with Increasing k

1.0 0.5 0.0 0.5 1.0

1.0 0.5 0.0 0.5 1.0

w1 of Center 1 (Misleading)

w2 of Center 2 (Target) w3 of Center 3 Traj of E[f w] with Increasing k

(b) The trajectory of the expectation of w with increasing k under d equal to 1, 2 and 3.

Figure 8: The early ascent phenomenon. Fig. 8(a) displays the trends of expected losses, upper bounds, and mixture weights, while Fig. 8(b) presents the trend of the expectation of w. We can see that the task retrieval mode is dominant up to k = 32, and component 1 s mixture weight increases (E[ w] approaches w1). Since this misleading component 1 is far from the target component 2, the risk starts increasing. At larger k values, the risk starts decreasing (E[ w] approaches w2) via task learning.

C. Coarse Upper Bound for ICL Risk

The following theorem shows a coarse upper bound of the ICL risk parallel to Theorem 5.1: Theorem C.1 (Coarse Upper Bound for ICL Risk). Consider a next-token predictor attaining the optimal pretraining risk. As k , the ICL risk is upper bounded by:

ESk xk+1[L k] <4(1 + dτ 2 x) τ 4xδw2k2 + O(kδ 5

where L k = (F(Sk xk+1) y k+1)2 = (F(Sk xk+1) xk+1, w )2 and δ is an arbitrarily small positive constant. See Appendix L for proof details. The upper bound decreases as the square of the inverse of k. Notice there is no noise for y labels of in-context examples under our setting, which leads to a faster decay rate than standard 1/k for ridge regression (Tsigler & Bartlett, 2023).

Dual Operating Modes of In-Context Learning

The notations δw and k are colored for easier observation.

0 2 4 6 81012

δµ = δw = 1/81

0 2 4 6 81012

δµ = δw = 1/9

0 2 4 6 81012

δµ = δw = 1

0 2 4 6 81012

δµ = δw = 9

0 2 4 6 81012

δµ = δw = 81

E[(R y k+1)2]

E[(F y k+1)2]

Number of In-Context Examples (k)

Figure 9: In-context learning vs ridge regression. R indicates the prediction by ridge regression, F indicates the prediction by ICL with a Bayes-optimal next-token predictor, and y k+1 = xk+1, w . Let the k samples draw from a task (µ , w ), which is drawn from the pretraining prior distribution. The dimension d of x equals 6. We observe that ICL performs better than ridge regression when k is small, and ridge regression performs better than ICL when k d. Especially, when the task prior distribution has high task variance (big δµ and δw values), ICL and ridge regression have very similar performance.

We further compare the risk ESk xk+1[L k] and the risk under ridge regression with L2 regularization parameter equal to 10 6, where the same k samples without label noises are used as in-context examples for ICL and training samples for ridge regression. Fig. 9 shows the experiment results. Under certain settings for the task prior Dµ,w, when the task prior has low task variances, ICL performs better than ridge regression with a fixed regularization parameter under small k.

D. Transformer Performance in Approximating Bayesian Inference

We examine if a Transformer network pretrained on samples generated from our pretraining data generative model matches the performance of Bayesian inference. We consider three factors of the task prior in our experiment: prior task noises, number of components, and feature dimension. For scalar y, we transform it to a d-dimensional vector [y, 0, . . . , 0]. Thus, Sk xk+1 forms a (2k + 1) d matrix, comprising xk+1 and k pairs of (xi, yi).

Experiment Setting. We conduct experiments based on the module GPT2Model from the package Transformers supported by Hugging Face5. We use a 10-layer, 8-head Transformer decoder with 1024-dimensional feedforward layers, and the input dimension is set to d, equal to the dimension of x. We train the model over three epochs, each consisting of 10,000 batches, with every batch containing 256 samples. We use Adam W (Loshchilov & Hutter, 2019) as the optimizer with weight decay as 0.00001 and set the learning rate to 0.00001.

Experiment Results. Fig. 10, 11, and 12 show the experimental results, where ˆF denotes the prediction of the Transformer network, F denotes the prediction of Bayesian inference, and y k+1 = xk+1, w is the label of learning the in-context function. In Fig. 10, we consider the tetrahedron setting (see Apendix B.1 for setting details) under varied task noises (δµ = δw {1/256, 1/64, 1/16, 1/4, 1}). In Fig. 11, we consider settings of regular shapes (see Appendix B.1 for setting details) with different numbers of vertices/components (M {4, 6, 8, 12, 20}). In Fig. 12, we consider settings with varied dimensions (see Appendix B.2 for setting details, d {2, 4, 8, 16, 32}). We observe that the trained Transformer network can approximate the Bayes-optimal predictor under varied settings, and the larger the number of dimensions and the number of mixture components, the harder it is for the Transformer network to approximate Bayesian prediction.

E. Additional Information for Bounded Efficacy in GPT-4

E.1. Experimental Setting

Table 3 introduces the experiment setting of GPT-4, including the system message, the prompt, the in-context task, the biased + task, and the addition (+) task. Designating the biased + task as the in-context task, i.e., ci = ai + bi + 1, we measure the performances on two goals, including learning the biased + task and retrieving the addition (+) task.

5https://huggingface.co/

Dual Operating Modes of In-Context Learning

δµ = δw = 1 256 δµ = δw = 1 64 δµ = δw = 1 16 δµ = δw =1 4

( ˆF y k+1)2 (F y k+1)2

δµ = δw =1 1

0 10 20 30 10 2

0 10 20 30 0 10 20 30 0 10 20 30 0 10 20 30 Number of In-Context Examples (k)

Figure 10: Prior task noises. The figure shows the experiment results under varied noise levels. δµ and δw indicate the noise levels of the pretraining task prior. F indicates the prediction of Bayesian inference while ˆF indicates the prediction of the trained Transformer network. The results show that the trained Transformer network s performance can approach the performance of Bayesian inference.

M = 20 M = 12 M = 8 M = 6

( ˆF y k+1)2 (F y k+1)2

0 10 20 30 10 2

0 10 20 30 0 10 20 30 0 10 20 30 0 10 20 30 Number of In-Context Examples (k)

Figure 11: Number of components. The figure shows the experiment results under varied component densities. M indicates the number of mixture components corresponding to different 3D regular polyhedrons described in Appendix B.1, and δµ = δw = 1 16. F indicates the prediction of Bayesian inference while ˆF indicates the prediction of the trained Transformer network. The higher the component density is, the harder it is for the Transformer network to approach Bayesian inference.

Dual Operating Modes of In-Context Learning

d = 32 d = 16 d = 8 d = 4

( ˆF y k+1)2 (F y k+1)2

0 10 20 30 10 2

0 10 20 30 0 10 20 30 0 10 20 30 0 10 20 30 Number of In-Context Examples (k)

Figure 12: Feature dimension. The figure shows the experiment results under varied dimensions. d indicates the dimension and the number of mixture components (see Appendix B.2 for setting details), and δµ = δw = 1 16. F indicates the prediction of Bayesian inference while ˆF indicates the prediction of the trained Transformer network. The higher the feature dimension is, the harder it is for the Transformer network to approach Bayesian inference.

Table 3: Experiment setting to reveal the bounded efficacy phenomenon of biased-label ICL in GPT-4.

Setting Desciption

System Message You are a mathematician. Consider the following math problem and follow the exact instruction.

You are given examples. Each example has two integers as input and one integer as output. Please provide an answer for the last problems in the math exercise: a1(?)b1=c1 ... ak(?)bk=c2 ak+1(?)bk+1= Provide your answer directly. In-Context Task ai and bi are uniformly sampled from [10, 99], and ci = ai + bi + 1. Goal of Learning the biased + Task with True Labels

Aiming to learn the biased + task, a(?)b=(a+b+1), with in-context examples following the same biased + task, a(?)b=(a+b+1). Goal of Retrieving the addition (+) Task with Biased Labels

Aiming to retrieve the addition (+) task, a(?)b=(a+b). However, the in-context examples are provided with a slightly different task biased + , a(?)b=(a+b+1).

E.2. Additional Results

This section collects four pairs of prompts and predictions for k = 0, 2, 8 in Tables 4, 5, and 6. The results show that ICL with biased labels will initially retrieve a commonsense pretraining task due to task retrieval, and finally learn the in-context task because of task learning.

Dual Operating Modes of In-Context Learning

Table 4: Zero in-context example, k = 0. Prediction is colored red if it is correct for task retrieval (a(?)b = (a + b)), and colored blue if it is correct for task learning (a(?)b = (a + b + 1)). ... denotes the hidden part of the prompt. Please refer to Table 3 for the whole prompt.

Prompt ... 51(?)36= ...

... 27(?)15= ...

... 76(?)82= ...

... 55(?)15= ...

Without knowing the operation or rule that connects the two input integers to the output integer in the examples, it s impossible to provide a correct answer. Please provide the examples or the rule.

Sorry, but your questionis not clear. Could you please provide more information about the operation between the two numbers?

Your question seems to be missing some information. Could you please provide the examples you mentioned? They are necessary to understand the relationship between the two input integers and the output integer.

Table 5: Two in-context examples, k = 2. Prediction is colored red if it is correct for task retrieval (a(?)b = (a + b)), and colored blue if it is correct for task learning (a(?)b = (a + b + 1)). ... denotes the hidden part of the prompt. Please refer to Table 3 for the whole prompt.

... 73(?)80=154 59(?)22=82 54(?)97= ...

... 48(?)73=122 78(?)80=159 21(?)33= ...

... 21(?)28=50 69(?)29=99 47(?)10= ...

... 94(?)43=138 98(?)70=169 96(?)41= ...

Results 151 54 57 187

Table 6: Eight in-context examples, k = 8. Prediction is colored red if it is correct for task retrieval (a(?)b = (a + b)), and colored blue if it is correct for task learning (a(?)b = (a + b + 1)). ... denotes the hidden part of the prompt. Please refer to Table 3 for the whole prompt.

... 37(?)70=108 41(?)18=60 19(?)12=32 82(?)67=150 42(?)13=56 26(?)41=68 80(?)39=120 58(?)23=82 40(?)90= ...

... 60(?)76=137 69(?)26=96 72(?)85=158 39(?)10=50 50(?)47=98 19(?)63=83 45(?)95=141 69(?)41=111 81(?)36= ...

... 66(?)40=107 46(?)81=128 63(?)31=95 41(?)24=66 70(?)43=114 89(?)84=174 76(?)82=159 46(?)28=75 49(?)46= ...

... 68(?)88=157 34(?)18=53 70(?)70=141 13(?)35=49 52(?)50=103 72(?)32=105 98(?)82=181 55(?)51=107 50(?)31= ...

Results 130 118 96 82

F. Bounded Efficacy in Zero-shot ICL

This section introduces the experiment setting of Fig. 6. We start by introducing the experiment results in Fig. 13 copied and pasted from the work of Min et al. (2022). While our theory shows the bounded efficacy phenomenon for ICL with non-informative labels (Lemma 6.2), Fig. 13 seems to imply a conflict phenomenon. Thus, we further extend the number of

Dual Operating Modes of In-Context Learning

Figure 13: Ablations on varying numbers of examples in the demonstrations (k). Models that are the best under 13B in each task category (Channel Meta ICL and Direct GPT-J, respectively) are used.

0 20 21 22 23 24 25 26 27

Classiﬁcation Error

0 20 21 22 23 24 25 26 27

Llama 2 13B

0 20 21 22 23 24 25 26 27

Number of In-Context Examples (k)

Mixtral 8x7B

0 20 21 22 23 24 25 26 27

Llama 2 70B

0 20 21 22 23 24 25 26 27

w/ True Labels

w/ Random Labels

Figure 14: As k increases, the classification error curve of ICL with random labels exhibits the bounded efficacy phenomenon. The curve with true labels further confirms that this phenomenon is not due to models tending to perform worse on long sequences.

in-context examples in Fig. 13 left. The classification task adopts five datasets including (i) glue-mrpc (Dolan & Brockett, 2005), (ii) glue-rte (Dagan et al., 2005), (iii) tweet eval-hate (Barbieri et al., 2020), (iv) sick (Marelli et al., 2014), and (v) poem-sentiment (Sheng & Uthus, 2020). We use the Git Hub code6 released by Min et al. (2022) to generate the same data and evaluate LLMs with a larger context length capacity aiming at a larger number of in-context examples. We selected Mistral 7B (32768), Mixtral 8 7B (32768), Llama2 13B (4096), Llama2 70B (4096), and GPT-4 (8192) for our experiments, with the integers in parentheses indicating the maximum context length for each model. We perform inference on large models with 8 H100 with the package vllm7.

G. The Derivation of Posterior

This section provides detailed derivations for Lemma 4.1. We begin by showing the posterior is potentially still a Gaussian mixture in Sec. G.1. Then, in Sec. G.2, we show how Eq. 2 is proportion to Eq. 3, which is precisely a Gaussian mixture.

G.1. Prior to Posterior

We start by showing the posterior is potentially still a Gaussian mixture. For fixed Sk xk+1:

P(µ, w|Sk xk+1)

P(µ, w|Sk xk+1)P(Sk xk+1)

= P(µ, w, Sk xk+1)

= P(µ, w)P(Sk xk+1|µ, w)

m=1 πm P(µ, w|Tm) P(Sk xk+1|µ, w)

6https://github.com/Alrope123/rethinking-demonstrations 7https://docs.vllm.ai/en/latest/

Dual Operating Modes of In-Context Learning

m=1 πm P(µ, w|Tm)P(Sk xk+1|µ, w) (2)

m=1 πm P(µ, w| e Tm). (3)

We give the derivation from Eq. 2 to Eq. 3 in the next section.

G.2. Closed-form Solution from Eq. 2 to Eq. 3

We analyze each component (indicated by a specific m) in Eq. 2. Given fixed Sk xk+1, for all m [M] and all (µ, w), we have:

log(P(µ, w|Tm)P(Sk xk+1|µ, w))

2σ2µ wm w 2

2σ2w Pk+1 i=1 µ xi 2

2σ2x Pk i=1 x i w yi 2

+ log (2π) d/2

+ log (2π) d/2

+ (k + 1) log (2π) d/2

+ k log (2π) 1/2

(Let C3 = log (2π) d/2

+ log (2π) d/2

+ (k + 1) log (2π) d/2

+ k log (2π) 1/2

= C3 µm µ 2

2σ2µ wm w 2

2σ2w Pk+1 i=1 µ xi 2

2σ2x Pk i=1 x i w yi 2

= C3 ( µm µ 2

2σ2µ + Pk+1 i=1 µ xi 2

2σ2x ) ( wm w 2

2σ2w + Pk i=1 x i w yi 2

(Let δµ = σ2 µ σ2x and δw = σ2 w σ2y .)

= C3 1 2σ2µ

( µm 2 2µ mµ + µ 2) + δµ

(k + 1) µ 2 2µ k+1 X

( wm 2 2w mw + w 2) + δw

i=1 w xix i w 2w k X

= C3 1 2σ2µ

µm 2 + (1 + (k + 1)δµ) µ 2 2µ µm + δµ

wm 2 + w I + δw

w 2w wm + δw

(Let C4 = C3 δµ

i=1 xi 2 δw

i=1 y2 i .)

= C4 1 2σ2µ

µm 2 + (1 + (k + 1)δµ) µ 2 2µ µm + δµ

wm 2 + w I + δw

w 2w wm + δw

(Let Σµ = I and Σw = Pk i=1 xix i k .)

= C4 1 2σ2µ

µm 2 + µ 2 I+(k+1)δµ Σµ 2µ µm + δµ

Dual Operating Modes of In-Context Learning

wm 2 + w 2 I+kδw Σw 2w wm + δw

i=1 xi and w = Pk i=1 xiyi

= C4 1 2σ2µ ( µm 2 + µ 2 I+(k+1)δµ Σµ 2µ (µm + (k + 1)δµ µ))

1 2σ2w ( wm 2 + w 2 I+kδw Σw 2w (wm + kδw w))

(Let µ = (k + 1)δµ and w = kδw.)

= C4 1 2σ2µ ( µm 2 + µ 2 I+ µ Σµ 2µ (µm + µ µ))

1 2σ2w ( wm 2 + w 2 I+ w Σw 2w (wm + w w))

= C4 µm 2+ µ 2 I+ µ Σµ 2µ (µm+ µ µ)+ µm+ µ µ 2 (I+ µ Σµ) 1 µm+ µ µ 2 (I+ µ Σµ) 1 /2σ2 µ

wm 2+ w 2 I+ w Σw 2w (wm+ w w)+ wm+ w w 2 (I+ w Σw) 1 wm+ w w 2 (I+ w Σw) 1 /2σ2 w

= C4 1 2σ2µ

µm 2 µm + µ µ 2 (I+ µ Σµ) 1 + µ (I + µ Σµ) 1(µm + µ µ) 2 I+ µ Σµ

wm 2 wm + w w 2 (I+ w Σw) 1 + w (I + w Σw) 1(wm + w w) 2 I+ w Σw

Notice C4 is independent to m, µ, and w, thus we have:

P(µ, w|Tm)P(Sk xk+1|µ, w)

µm 2 µm + µ µ 2 (I+ µ Σµ) 1 + µ (I + µ Σµ) 1(µm + µ µ) 2 I+ µ Σµ

wm 2 wm + w w 2 (I+ w Σw) 1 + w (I + w Σw) 1(wm + w w) 2 I+ w Σw

µm 2 µm + (k + 1)δµ µ 2 (I+(k+1)δµ Σµ) 1 2σ2µ

| {z } cµ m

wm 2 wm + kδw w 2 (I+kδw Σw) 1 2σ2w

| {z } cw m N(µ|(I + (k + 1)δµ Σµ) 1(µm + (k + 1)δµ µ), σ2 µ(I + (k + 1)δµ Σµ) 1)

N(w|(I + kδw Σw) 1(wm + kδw w), σ2 w(I + kδw Σw) 1).

By defining P(µ, w| e T) = N(µ|(I + (k + 1)δµ Σµ) 1(µm + (k + 1)δµ µ), σ2 µ(I + (k + 1)δµ Σµ) 1) N(w|(I + kδw Σw) 1(wm + kδw w), σ2 w(I + kδw Σw) 1) and πm = πmcµ mcw m. We have:

πm P(µ, w|Tm)P(Sk xk+1|µ, w) πm P(µ, w| e Tm).

m=1 πm P(µ, w|Tm)P(Sk xk+1|µ, w)

m=1 πm P(µ, w| e Tm).

Dual Operating Modes of In-Context Learning

0 1 3 7 15 31 63

0 1 3 7 15 31 63

0 1 3 7 15 31 63

0 1 3 7 15 31 63

0 1 3 7 15 31 63

0 1 3 7 15 31 63

0 1 3 7 15 31 63

0 1 3 7 15 31 63

0 1 3 7 15 31 63

0 1 3 7 15 31 63

0 1 3 7 15 31 63 0.0

1.0 δµ = 1/9, δw = 1/9

π1 π2 π3 π4

0 1 3 7 15 31 63 0.0

1.0 δµ = 1/9, δw = 1/3

0 1 3 7 15 31 63 0.0

1.0 δµ = 1/9, δw = 1

0 1 3 7 15 31 63 0.0

1.0 δµ = 1/9, δw = 3

0 1 3 7 15 31 63 0.0

1.0 δµ = 1/9, δw = 9

0 1 3 7 15 31 63 0.0

1.0 δµ = 1/3, δw = 1/9

0 1 3 7 15 31 63 0.0

1.0 δµ = 1/3, δw = 1/3

0 1 3 7 15 31 63 0.0

1.0 δµ = 1/3, δw = 1

0 1 3 7 15 31 63 0.0

1.0 δµ = 1/3, δw = 3

0 1 3 7 15 31 63 0.0

1.0 δµ = 1/3, δw = 9

0 1 3 7 15 31 63 0.0

1.0 δµ = 1, δw = 1/9

0 1 3 7 15 31 63 0.0

1.0 δµ = 1, δw = 1/3

0 1 3 7 15 31 63 0.0

1.0 δµ = 1, δw = 1

0 1 3 7 15 31 63 0.0

1.0 δµ = 1, δw = 3

0 1 3 7 15 31 63 0.0

1.0 δµ = 1, δw = 9

0 1 3 7 15 31 63 0.0

1.0 δµ = 3, δw = 1/9

0 1 3 7 15 31 63 0.0

1.0 δµ = 3, δw = 1/3

0 1 3 7 15 31 63 0.0

1.0 δµ = 3, δw = 1

0 1 3 7 15 31 63 0.0

1.0 δµ = 3, δw = 3

0 1 3 7 15 31 63 0.0

1.0 δµ = 3, δw = 9

0 1 3 7 15 31 63 0.0

1.0 δµ = 9, δw = 1/9

0 1 3 7 15 31 63 0.0

1.0 δµ = 9, δw = 1/3

0 1 3 7 15 31 63 0.0

1.0 δµ = 9, δw = 1

0 1 3 7 15 31 63 0.0

1.0 δµ = 9, δw = 3

0 1 3 7 15 31 63 0.0

1.0 δµ = 9, δw = 9

Number of In-Context Examples (k)

Figure 15: Numerical analysis on component re-weighting. The trends of Ψµ, Ψw, and πm for CR with increasing k under varying task noise parameters.

H. Detailed Analysis of Component Shifting and Re-weighting

H.1. Analysis of Component Re-weighting

This section analyzes the CR effect on πβ as k increases. We focus on whether πα of e Tα surpasses πβ of any other e Tβ with β = α, where α is the index of the closest prior center to the in-context task as described in Assumption 3. We assess this via the ratio r(α, β) of πα to πβ:

r(α, β) = πα

πβ = παC0cµ αcw α πβC0cµ β cw β = πα

πβ exp(Ψµ(α, β) + Ψw(α, β)), (4)

where we define two functions Ψµ(α, β) = log(cµ α/cµ β ) and Ψw(α, β) = log(cw α /cw β ) to facilitate the analyses of how r(α, β) changes with increasing k.

Analysis of Ψµ(α, β). We further simplify the function Ψµ(α, β) as follows:

Ψµ(α, β) = (

i=1 µβ xi 2

i=1 µα xi 2)/(2σ2 x(1 + (k + 1)δµ)). (5)

(See Appendix H.3.1 for derivation.) Since xi N(µ , τ 2 x I), choosing µ closer to µα tends to make Ψµ(α, β) positive and increase faster with increasing k. However, as k approaches infinity, Ψµ(α, β) stabilizes rather than increasing infinitely,

Dual Operating Modes of In-Context Learning

i.e., limk Ψµ(α, β) = ( µβ µ 2 µα µ 2)/(2σ2 µ). The leftmost column of Fig. 15 shows the numerical computation of Ψµ(α, β) with varied task noises under the tetrahedron setting (see Appendix B.1 for setting details). The

smaller the value of δµ (= σ2 µ σ2x ) is, the easier for Ψµ(α, β) to increase as k increases.

Meanwhile, we also have:

lim σµ 0 Ψµ(α, β) = (

i=1 µβ xi 2

i=1 µα xi 2)/(2σ2 x) (6)

Analysis of Ψw(α, β). We further simplify the function Ψw(α, β) as follows:

Ψw(α, β) = ( wβ w 2 I (I+kδw Σw) 1 wα w 2 I (I+kδw Σw) 1)/(2σ2 w). (7)

(See Appendix H.3.2 for derivation.) Since kδw Σw (= δw Pk i=1 xix i , see definition of Σw in Lemma 4.1) is semipositive definite, thus choosing w closer to wα tends to make Ψw(α, β) positive and increase faster as k increases.

However, as k approaches infinity, limk kδw Σw = limk kδw

Pk i=1 xix i k = kδw(µ µ +τ 2 x I). Thus, limk I (I + kδw Σw) 1 = I and Ψw(α, β) stabilizes rather than increasing infinitely, i.e., limk Ψw(α, β) = ( wβ w 2 wα w 2)/(2σ2 w). The topmost row of Fig. 15 shows the numerical computation of Ψw(α, β) with varied task noises under the tetrahedron setting (see Appendix B.1 for setting details). The smaller the value of δw (= σ2 w σ2y ) is, the easier for

Ψw(α, β) to increase as k increases. However, one should note that wβ w 2 wα w 2 does not necessarily imply wβ w 2 I (I+kδw Σw) 1 wα w 2 I (I+kδw Σw) 1.

Meanwhile, we also have:

lim σw 0 Ψw(α, β) = ( wβ w 2 kδw Σw wα w 2 kδw Σw)/(2σ2 w)

= ( µβ xi 2 k Σw µα xi 2 k Σw)/(2σ2 y)

i=1 yβ i y i 2

i=1 yα i y i 2)/(2σ2 y), (8)

where yβ i = xi, wβ , yα i = xi, wα , and y i = xi, w .

Therefore, combine Eqs. 6 and 8 and we have:

lim σµ,σw 0 Ψµ(α, β) + Ψw(α, β)

= µβ xk+1 2 µα xk+1 2

i=1 ( µβ xi 2 µα xi 2

2σ2x + yβ i y i 2 yα i y i 2

Numerical Computations of Component Re-weighting. We have seen how noises σµ and σw of the task prior affect the values of Ψµ and Ψw with increasing k. We further show the numerical computation of πβ in the center of Fig. 15. The figure shows that the smaller δµ and δw are, the larger Ψµ(α, β) and Ψw(α, β) will be with increasing k, and the easier for the mixture component e Tα to dominates in the posterior with an increasing number of in-context examples.

H.2. Analysis of Component Shifting

The Component Shifting effect in Lemma 4.1 involves shifting the variables µm and wm:

µm = (I + (k + 1)δµ Σµ) 1(µm + (k + 1)δµ µ), (10)

wm = (I + kδw Σw) 1(wm + kδw w). (11)

The following analyses examine these two variables with increasing k.

Dual Operating Modes of In-Context Learning

0 1 3 7 15 31 63 0

0 1 3 7 15 31 63 0

0 1 3 7 15 31 63 0

0 1 3 7 15 31 63 0

0 1 3 7 15 31 63 0

2 δµ = 9 µ1 µ

0 1 3 7 15 31 63 0

0 1 3 7 15 31 63 0

0 1 3 7 15 31 63 0

0 1 3 7 15 31 63 0

0 1 3 7 15 31 63 0

2 δw = 9 w1 w

Number of In-Context Examples (k)

Figure 16: Numerical computations of µm µ , wm w for Component Shifting (CS).

Analysis of µm. We provide the derivation of µm in Eq. 10 (see Appendix H.4.1 for details):

µm = (µm + kδµ µ)/(1 + (k + 1)δµ). (12)

Thus, when k increases, µm moves close to the value of

k and limk µm = µ . We also show the numerical computation of the distance between shifted µm and µ in the first row of Fig. 16.

Analysis of wm. We provide the derivation of wm in Eq. 11 (see Appendix H.4.2 for details):

wm = (I + kδw Σw) 1(wm w ) + w . (13)

Notice when k , kδw Σw = kδw

Pk i=1 xix i k kδw(τ 2 x I+w w ), thus λd(kδw Σw) , λ1((I+kδw Σw) 1) 0, limk (I + kδw Σw) 1(wm w ) limk λ1((I + kδw Σw) 1) wm w = 0 and limk wm = w , where λd(A) indicates the minimum eigenvalue of A. We also show the numerical computed distance between wm and w

in the second row of Fig. 16.

H.3. Derivation Collection of Ψµ(α, β) and Ψw(α, β)

This section collects derivations for Ψµ(α, β) and Ψw(α, β). The derivation of Ψµ(α, β) is collected in Sec H.3.1 and the derivation of Ψw(α, β) is collected in Sec H.3.2.

H.3.1. DERIVATION OF Ψµ(α, β)

This section collects the derivation of Ψµ(α, β) in Eq. 5 of Sec. H.1:

= log(cµ α/cµ β )

exp µβ 2 µβ+(k+1)δµ µ 2 (I+(k+1)δµ Σµ) 1 2σ2µ

exp µα 2 µα+(k+1)δµ µ 2 (I+(k+1)δµ Σµ) 1 2σ2µ

= (1 + (k + 1)δµ) µβ 2 µβ + δµ Pk+1 i=1 xi 2

2σ2µ(1 + (k + 1)δµ) (1 + (k + 1)δµ) µα 2 µα + δµ Pk+1 i=1 xi 2

2σ2µ(1 + (k + 1)δµ)

= µβ + δµ Pk+1 i=1 xi 2

2σ2µ(1 + (k + 1)δµ) µα + δµ Pk+1 i=1 xi 2

2σ2µ(1 + (k + 1)δµ)

= µβ 2 2µ β (δµ Pk+1 i=1 xi) δµ Pk+1 i=1 xi 2

2σ2µ(1 + (k + 1)δµ) µα 2 2µ α (δµ Pk+1 i=1 xi) δµ Pk+1 i=1 xi 2

2σ2µ(1 + (k + 1)δµ)

Dual Operating Modes of In-Context Learning

= (k + 1)δµ µβ 2 2µ β (δµ Pk+1 i=1 xi) + δµ Pk+1 i=1 xi 2

2σ2µ(1 + (k + 1)δµ) (k + 1)δµ µα 2 2µ α (δµ Pk+1 i=1 xi) + δµ Pk+1 i=1 xi 2

2σ2µ(1 + (k + 1)δµ)

= Pk+1 i=1 δµ µβ xi 2

2σ2µ(1 + (k + 1)δµ) Pk+1 i=1 δµ µα xi 2

2σ2µ(1 + (k + 1)δµ)

= Pk+1 i=1 µβ xi 2 Pk+1 i=1 µα xi 2

2σ2x(1 + (k + 1)δµ) .

H.3.2. DERIVATION OF Ψw(α, β)

This section collects the derivation of Ψw(α, β) in Eq. 7 of Sec. H.1:

= log(cw α /cw β )

exp wα 2 wα+kδw w 2 (I+kδw Σw) 1 2σ2w

exp wβ 2 wβ+kδw w 2 (I+kδw Σw) 1 2σ2w

= wβ 2 wβ + kδw w 2 (I+kδw Σw) 1 2σ2w wα 2 wα + kδw w 2 (I+kδw Σw) 1 2σ2w

(Note kδw w = δw

i=1 xiyi = δw

i=1 xix i w = kδw Σww .)

= wβ 2 wβ + kδw Σww 2 (I+kδw Σw) 1 2σ2w wα wα + kδw Σww 2 (I+kδw Σw) 1 2σ2w

= wβ 2 (wβ w ) + (I + kδw Σw)w 2 (I+kδw Σw) 1 2σ2w wα 2 (wα w ) + (I + kδw Σw)w 2 (I+kδw Σw) 1 2σ2w

= wβ 2 wβ w 2 (I+kδw Σw) 1 2(wβ w ) w

2σ2w wα 2 wα w 2 (I+kδw Σw) 1 2(wα w ) w

= wβ w 2 wβ w 2 (I+kδw Σw) 1 2σ2w wα w 2 wα w 2 (I+kδw Σw) 1 2σ2w

= wβ w 2 I (I+kδw Σw) 1 wα w 2 I (I+kδw Σw) 1 2σ2w .

H.4. Derivation Collection of µm and wm

This section collects derivations for µm and wm. The derivation of µm is collected in Appendix H.4.1, and the derivation of wm is collected in Appendix H.4.2.

H.4.1. DERIVATION OF µm

This section collects the derivation of µm in Eq. 12 of Sec. H.1:

µm = (I + (k + 1)δµ Σµ) 1(µm + (k + 1)δµ µ)

= (I + (k + 1)δµI) 1(µm + δµ

= µm + δµ Pk+1 i=1 xi 1 + (k + 1)δµ .

Dual Operating Modes of In-Context Learning

H.4.2. DERIVATION OF wm

This section collects the derivation of wm in Eq. 13 of Sec. H.1:

wm = (I + kδw Σw) 1(wm + kδw w)

(Recall kδw w = δw

i=1 xiyi = δw

i=1 xix i w = kδw Σww .)

= (I + kδw Σw) 1(wm + kδw Σww )

= (I + kδw Σw) 1(wm w + (I + kδw Σw)w )

= (I + kδw Σw) 1(wm w ) + w . (14)

I. Additional Experiments for Early Ascent

I.1. Early Ascent and Bounded Efficacy under Noisy Labels

We further examine phenomena of early ascent and bounded efficacy with noisy labels under varied noise levels. The results show that these two phenomena are robust to label noises to some extend.

d = 1 d = 3

Risk of ICL under Label Noise τy =0.0 Risk Upper Bound of ICL

0 23 27 211 215 0

Mixture Weight

0 23 27 211 215

Numbser of In-Context Examples (k)

Mixture Weight of Component 1 (Misleading)

Mixture Weight of Component 2 (Target)

Mixture Weight of Component 3

0 23 27 211 215

(a) ICL risk under label noise level τy = 0.0.

d = 1 d = 3

Risk of ICL under Label Noise τy =0.01

0 23 27 211 215 0

Mixture Weight

0 23 27 211 215

Numbser of In-Context Examples (k)

Mixture Weight of Component 1 (Misleading)

Mixture Weight of Component 2 (Target)

Mixture Weight of Component 3

0 23 27 211 215

(b) ICL risk under label noise level τy = 0.01.

d = 1 d = 3

Risk of ICL under Label Noise τy =0.1

0 23 27 211 215 0

Mixture Weight

0 23 27 211 215

Numbser of In-Context Examples (k)

Mixture Weight of Component 1 (Misleading)

Mixture Weight of Component 2 (Target)

Mixture Weight of Component 3

0 23 27 211 215

(c) ICL risk under label noise level τy = 0.1.

d = 1 d = 3

Risk of ICL under Label Noise τy =1.0

0 23 27 211 215 0

Mixture Weight

0 23 27 211 215

Numbser of In-Context Examples (k)

Mixture Weight of Component 1 (Misleading)

Mixture Weight of Component 2 (Target)

Mixture Weight of Component 3

0 23 27 211 215

(d) ICL risk under label noise level τy = 1.0.

Figure 17: Early ascent under varied label noises. Results show that the early ascent phenomenon maintains for noise level τy [0, 1.0]. Label noise level σy = 1.0 is used for pretraining.

I.2. Early Ascent under Non-Linear Regression and Discrete Token Prediction

This section uses Fig. 19 to show the existence of the early ascent phenomenon on non-linear regression and discrete token prediction with our designed distributions of pretraining and in-context samples. Fig. 19(a) shows that the early ascent

Dual Operating Modes of In-Context Learning

δµ = δw = 1 256 δµ = δw = 1 64 δµ = δw = 1 16

(F yα k+1)2

δµ = δw =1 4 δµ = δw =1 1

0 10 20 30 10 3 10 1

0 10 20 30 0 10 20 30 0 10 20 30 0 10 20 30

Number of In-Context Examples (k)

Figure 18: Bounded efficacy under varied label noises. Results show that the bounded efficacy phenomenon maintains for noise level τy [0, 0.1]. Label noise level σy = 1.0 is used for pretraining.

phenomenon exists when a 2-layer neural network with Tanh Activation function serves as the non-linear function, and Fig. 19(b) shows that the early ascent phenomenon exists when the dataset consists of sequences of tokens with discrete values rather than sequences of vectors with continuous values. For the details of experiments including our designed distributions of pretraining and in-context samples, please refer to Sec. I.2.1 for the experiment with non-linear regression and Sec. I.2.2 for the experiment with discrete token prediction.

0 1 3 7 15 31 Number of In-Context Examples (k)

Mean Squared Error

Early Ascent (Non-linear Regression)

E[( ˆF y k+1)2]

(a) Experiment under non-linear regressions.

0 1 3 7 15 31 Number of In-Context Examples (k) 0.0

Early Ascent (Discrete Token Prediction)

E[1[F =y k+1]]

E[1[ ˆF =y k+1]]

(b) Experiment under discrete token prediction.

Figure 19: ˆF indicates the prediction by a pretrained Transformer model and F indicates the prediction by numerical computation following a Bayes optimal predictor. While we cannot derive the optimal predictor under non-linear regression, we can derive the optimal predictor under discrete token prediction.

I.2.1. EXPERIMENT DESIGN FOR NON-LINEAR REGRESSION

The following assumption shows the data generation model to generate a non-linear sequence [x1, y1, . . . , x K, y K], where xi is a vector and yi is a scalar. The non-linear function mapping x to y is highlighted in red in the assumption.

Dual Operating Modes of In-Context Learning

Assumption 5 (Pretraining Data Generative Model for Non-linear Regression). (a) sample a task from the task distribution: (µ, W , v) Dprior, P(µ, W , v) = PM m=1 πm P(µ, W , v|Tm), where Tm is

the mth mixture component, i.e., P(µ, W , v|Tm) = N(µ; µm, σ2 µI) 1 q

(2π)d2σd2 W exp( W Wm 2 F 2 ) N(v; vm, σ2 v I), and

πm is the mixture weight. N(x; µ, Σ) denotes the probability of x in the multivariate normal distribution with mean µ and covariance matrix Σ, F indicates the Frobenius norm, PM m=1 πm = 1, 0 < πm < 1, (µm, wm) is the center of the mixture component Tm, and all components share the same covariance matrix controlled by σµ, σW , and σv; (b) input variable distribution: within a sequence, i [K], xi Dx(µ), P(x|µ) = N(x|µ, σ2 x I); (c) label distribution: within a sequence, i [K], yi|xi Dy|xi(W , v), P(yi|xi, W , v) = N(yi| tanh(W xi), v , σ2 y), where tanh() is a Tanh Activation function; (d) x, µ, µm, v, vm Rd, and W , Wm Rd d.

For experimental setting of Fig. 19(a), we set d = 2, σµ = 1, σW = σv = 0.5, σx = σy = 1, M = 2, π1 = 0.1, π2 = 0.9,

µ1 = [1, 0] , µ2 = [0, 1] , W1 = 1 0 0 0

, W2 = 0 0 0 1

, and v1 = [1, 0] , v2 = [0, 1] . In-context samples follows

task (µ , W , v ), where µ = µ1, W = W2, v = v2, and σy = 1. Notice that although we add label noise to in-context samples, when evaluating the prediction, we calculate error/loss based on the clean label.

I.2.2. EXPERIMENT DESIGN FOR DISCRETE TOKEN PREDICTION

The following assumption shows the data generation model to generate a non-linear sequence [x1, y1, . . . , x K, y K], where xi and yi are both integers (discrete tokens). Assumption 6 (Pretraining Data Generative Model for Discrete Token Prediction). (a) sample a task from the task distribution: (µ, w) Dprior, µ [M], w [M], P(µ, w) = PM m=1 πm P(µ, w|Tm), where Tm is the mth mixture component, i.e., P(µ, w|Tm) = 1[w=wm]((1 (M 1)σµ)1[µ=µm] + σµ1[µ =µm]), and πm is the mixture weight. (b) input variable distribution: within a sequence, i [K], xi Dx(µ), P(xi|µ) = (1 (M 1)σx)1[x=µ] + σx1[x =µ]; (c) label distribution: within a sequence, i [K], yi|xi Dy|xi(w), P(yi|xi, w) = (1 (M 1)σy)1[yi=xi+w mod M] + σy1[yi =xi+w mod M].

For experimental setting of Fig. 19(b), we set M = 6,π1 = 0.04, π3 = 0.481, π5 = 0.479, π2 = π4 = π6 = 0, σµ = 0.05, σx = 0.04, σy = 0.13, µ1 = w1 = 1, µ3 = w3 = 3, µ5 = w5 = 5. In-context samples follows task (µ , w ), where µ = µ1, w = w3, and σy = 0.13. Notice that although we add label noise to in-context samples, when evaluating the prediction, we calculate error/loss based on the clean label.

J. Mathematical Derivation for Early Ascent

We show that the early ascent phenomenon occurs under a specific setting in Sec. J.1. Then, we give formal theory with proof to show when early ascent happens in Sec. J.2.

J.1. A Specific Setting of Early Ascent

To have a cleaner mathematical understanding of this phenomenon, this section uses the setting of d = 1, the first row, in Table 2 to show the mathematical logic. (Some parameter settings are described in Table 2 s caption.) Following Theorem 5.1, the upper bound of ICL risk is as follows:

ESk xk+1[L k]

β=1 wβ w 2ESk xk+1[ πβ xk+1 2λ1(A)2]

= w1 w 2ESk xk+1[ π1 xk+1 2λ1(A)2] + w2 w 2ESk xk+1[ π2 xk+1 2λ1(A)2]

(Notice w2 = w , w1 w 2 = 22 = 4.)

= 4ESk xk+1[ π1 xk+1 2λ1(A)2]

(Notice π1 + π2 = 1.)

Dual Operating Modes of In-Context Learning

= 4ESk xk+1

π1 π1 + π2 xk+1 2λ1(A)2

π2 = r(1, 2) as Eq. 4.)

= 4ESk xk+1

r(1, 2) 1 + r(1, 2) xk+1 2λ1(A)2 .

Noticing δµ = 0.052

12 and δw = 0.052

22 are very small, when k is small, we have kδw 0 and λ1(A) = (I +

δw Pk i=1 xix i ) 1 I, thus ESk xk+1 h r(1,2) 1+r(1,2) xk+1 2λ1(A)2i ESk xk+1 h r(1,2) 1+r(1,2) xk+1 2i and a larger r(1, 2)

means a larger upper bound. In the following, we will examine whether the increase of k leads to the increase of r(1, 2).

Following Eq. 4:

r(1, 2) = 1/2

1/2 exp(Ψµ(1, 2) + Ψw(1, 2))

= exp(Ψµ(1, 2) + Ψw(1, 2)).

We first analyze Ψµ(1, 2), following Eq. 5:

E[Ψµ(1, 2)] = E

"Pk+1 i=1 µ2 xi 2 Pk+1 i=1 µ1 xi 2

2σ2x(1 + (k + 1)δµ)

(Since δµ 0, thus when k is small, we have:)

"Pk+1 i=1 µ2 xi 2 Pk+1 i=1 µ1 xi 2

2σ2x E µ2 x1 2 µ1 x1 2

2σ2x (E[ µ2 x1 2] E[ µ1 x1 2])

2σ2x (E[ µ2 µ 2] + τ 2 x) (E[ µ1 µ 2] + τ 2 x)

(µ is the same as µ1, but different from µ2.)

2σ2x (E[ µ2 µ 2] 0)

= 2(k + 1).

We then analyze Ψw(1, 2), following Eq. 7:

E[Ψw(1, 2)] = E

w1 w 2 I (I+kδw Σw) 1 2σ2w

(Since δw 0, thus when k is small, we have:)

E (w1 w ) kδw Σw(w1 w )

(Notice the feature dimension d = 1, Σw = Pk i=1 xi 2

" w1 w 2kδw Pk i=1 xi 2

Dual Operating Modes of In-Context Learning

" 2 Pk i=1 xi 2

σ2y ( µ 2 + τ 2 x)

22 (1 + 1) = k.

4 2 0 2 4 k

f(k) =exp(k + 2)/(1+exp(k + 2))

Figure 20: Illustration of the function exp(k+2)/(1+exp(k+2))

Therefore, when k is small, r(1, 2) = Ψµ(1, 2) + Ψw(1, 2) exp(k + 2), and the upper bound is approximately equal to:

exp(k + 2) 1 + exp(k + 2) xk+1 2 ,

which increases as the number of in-context examples increases.

J.2. Theorem of Early Ascent

Theorem 5.2 (Early Ascent). Assume Ex1 h (F (x1) w , x1 )2i < Ex1 x1, wα w 2 , where α =

2σ2x + (wm w ) µ 2+dτ 2 x wm w 2

2σ2y . Then, when δµ and δw are small enough, we have the early as-

cent phenomenon on the risk:

k 1 s.t. Ex1 h (F (x1) w , x1 )2i < ESk xk+1 h (F (Sk xk+1) w , xk+1 )2i .

Proof. We examine the following case, when σµ and σw are small enough, and k is also big enough to retrieve a task, i.e., making a center dominate:

lim k lim (σµ,σw) (0,0) ESk xk+1 h (F (Sk xk+1) w , xk+1 )2i

= lim k lim (σµ,σw) (0,0) ESk xk+1

m=1 πm A(wm w ), xk+1

= lim k lim (σµ,σw) (0,0) ESk xk+1

m=1 πm(wm w ), xk+1

= lim k lim (σµ,σw) (0,0) ESk xk+1

*PM m=1 πm exp(Ψµ(m, 1) + Ψw(m, 1))(wm w )

PM m=1 πm exp(Ψµ(m, 1) + Ψw(m, 1)) , xk+1

(Following Eq. 9, we have lim (σµ,σw) (0,0) Ψµ(m, 1) + Ψw(m, 1) = µm xk+1 2 µ1 xk+1 2

µm xi 2 µ1 xi 2

2σ2x + ym i y i 2 y1 i y i 2

= lim k ESk xk+1

*PM m=1 πm exp µm xk+1 2

2σ2x + Pk i=1( µm xi 2

2σ2x + ym i y i 2

2σ2y ) (wm w ) PM m=1 πm exp µm xk+1 2

2σ2x + Pk i=1( µm xi 2

2σ2x + ym i y i 2

2σ2y ) , xk+1

= ESk xk+1[ wα w , xk+1 2]

= Ex1[ wα w , x1 2],

where α = arg min m

2σ2 x + (wm w ) µ 2+dτ 2 x wm w 2

Dual Operating Modes of In-Context Learning

K. Proof Tools

This section introduces the inequalities used in our proofs for Theorems 5.1 (finegrained upper bound for ICL risk), 6.1 (upper bound for ICL with biased labels), C.1 (coarse upper bound for ICL risk) and Lemma 6.2 ((informal) upper bound for zero-shot ICL):

K.1. Gaussian Tail Bound

If Zi N(0, 1), then for t > 0 we have:

K.2. Chi-squared Tail Bound

If X χ(k), i.e., X = Pk i=1 Z2 i where Zi N(0, 1) then (Boucheron et al., 2013):

k 1 > 2 t1 + 2t1

exp kt2 1 ,

exp kt2 1 .

As a looser but symmetric bound, for any t > 0, we have:

k 1 > t exp kt2

k 1 < t exp kt2

K.3. Norm Tail Bound

If ϵi N(0, τ 2 x I), ϵi Rd, I Rd d, then for t > 0 we have:

where indicates the L2 norm.

Pk i=1 ϵi,j

Pk i=1 ϵi,j τx

(Notice ϵi,j N(0, τ 2 x) and let Zj = Pk i=1 ϵi,j τx

k N(0, 1).)

Pd i=1 Z2 i d .

Dual Operating Modes of In-Context Learning

Therefore, by applying Appendix K.2 we have:

Pd i=1 Z2 i d > τ 2 xd k (1 + t)

K.4. Eigenvalue Concentration Bound

Lemma K.1. If i, xi N(µ, τ 2 x I), µ = 1, A =

Pk i=1 xix i k , and ϵi = xi µ, we have t > 0:

L λd(A) λ1(A) U and

> 1 3 exp kt2

where L = τ 2 x(1 t

2 γ)2 2τxγ 1 + t, U = 1 + τ 2 x(1 + t

2 + γ)2 + 2τxγ 1 + t, λi(A) is the ith biggest eigenvalue of

the matrix A and γ = q

We begin with decomposing A to three components A =

Pk i=1 xix i k =

Pk i=1(µ+ϵi)(µ+ϵi)

Pk i=1 ϵiϵ i k + Pk i=1(µϵ i +ϵiµ ) k , then consider the eigenvalue bound of each of them.

For the first component µµ , we have:

0 λd(µµ ) < λ1(µµ ) 1.

Then, we analyze the second component

Pk i=1 ϵiϵ i k . Following Vershynin (2018, Theorem 4.6.1, p. 97), we have for any

d k > s > 0:

Pk i=1 ϵiϵ i k

Pk i=1 ϵiϵ i k

> 1 2 exp ks2

Finally, we examine the third component

Pk i=1(µϵ i +ϵiµ ) k . We have for all a = 1: a Pk i=1(µϵ i + ϵiµ ) k a = 2 a Pk i=1 ϵi

(Notice by Norm Tail Bound in Appendix K.3, we have P

a Pk i=1(µϵ i + ϵiµ ) k a 2

> 1 exp kt2

d k (1 + t) λd

Pk i=1(µϵ i + ϵiµ ) k

Pk i=1(µϵ i + ϵiµ ) k

d k (1 + t)

> 1 exp kt2

d k, s = t/2, and summarize three components by union bound, we have:

1 + t λd(A) λ1(A) 1 + τ 2 x

2 + γ 2 + 2τxγ

> 1 3 exp kt2

As a summary, we have:

L λd(A) λ1(A) U and

> 1 3 exp kt2

Dual Operating Modes of In-Context Learning

where γ = q

d k, L = τ 2 x(1 t

2 γ)2 2τxγ 1 + t, U = 1 + τ 2 x 1 + t

2 + γ 2 + 2τxγ 1 + t, and λi(A) is the ith biggest eigenvalue of the matrix A.

L. ICL to Learn the In-Context Function

This section introduces the proof of Theorem C.1 (coarse upper bound for ICL risk) and Theorem 5.1 (finegrained upper bound for ICL risk). The upper bound of Theorem 5.1 is derived at Eq. 15.

Proof. Assuming we are using in-context examples following Assumption 3, i.e., xi N(µ , τ 2 x I), yi = xi, w , µ = w = 1, and we aim to have the prediction of Sk xk+1 to be xk+1, w , i.e., to learn the function (w ) of the in-context task (µ , w ). Let L k indicate the squared loss (F (Sk xk+1) xk+1, w )2, where F (Sk xk+1) is the prediction of Sk xk+1 by the Bayes-optimal next-token predictor F under Assumption 6 for pretraining data generation. We derive the upper bound of the expected squared loss as follows:

ESk xk+1[L k]

= ESk xk+1 h (F (Sk xk+1) w , xk+1 )2i

(By Corollary 4.4.)

m=1 πm wm, xk+1 w , xk+1 2#

m=1 πm( wm w ), xk+1

(See Eq. 14 for the derivation of wm.)

m=1 πm((I + kδw Σw) 1(wm w ) + w w ), xk+1

(Let A = (I + kδw Σw) 1, and notice A is symmetric positive definite.)

m=1 πm A(wm w ), xk+1

β=1 πβa2 β, since E[a]2 E[a2].)

m=1 πm A(wm w ), xk+1 2

m=1 ESk xk+1 πm((wm w ) Axk+1)2

m=1 ESk xk+1 πm wm w 2λ1(A)2 xk+1 2

m=1 wm w 2ESk xk+1 πm xk+1 2λ1(A)2 (15)

m=1 πm xk+1 2λ1(A)2

= 4ESk xk+1 xk+1 2λ1(A)2

(Notice A is a random matrix only depends on x1, x2, . . . , xk, but not xk+1.)

= 4Exk+1 xk+1 2 ESk λ2 1(A)

= 4(1 + dτ 2 x)ESk λ2 1(A) .

We further simplify ESk λ2 1(A) using Lemma K.1:

ESk xk+1[L k]

Dual Operating Modes of In-Context Learning

4(1 + dτ 2 x)ESk λ2 1(A)

4(1 + dτ 2 x)ESk

Pk i=1 xix i k )

(By applying Lemma K.1 to Pk i=1 xix i k .)

4(1 + dτ 2 x)ESk

" 1 1 + kδw L

4(1 + dτ 2 x)

1 1 + kδw(τ 2x(1 t

2 γ)2 2τxγ 1 + t)

+ 3 exp kt2

Let t = kδ 1

2 , where 1

2 > δ > 0 and δ is arbitrary small. We have:

ESk xk+1[L k] < 4(1 + dτ 2 x) τ 4xδ2wk2 + O(kδ 5

We further validate our analysis with numerical computations in Fig. 21, including the trend of πm for m [M],

Pk i=1 xix i k for j [d], λj I + δw Pk i=1 xix i for j [d], 1/ w w , 1/E[F (Sk xk+1) y k+1], and

1/E[(F (Sk xk+1) y k+1)2] as k increases.

L.1. Case When In-context Input Variable Spans in Subspace

In this section, we refine Eq. 15 for the finegrained bound in Theorem 5.1. Specifically, we refine the following inequality for case when in-context input variable xi only spans in the subspace of Rd, resulting in λ1(A) = 1 constantly as mentioend in Theorem 5.1:

m=1 ESk xk+1 πm((wm w ) Axk+1)2

m=1 ESk xk+1 πm wm w 2λ1(A)2 xk+1 2 ,

where A = (I + Pk i=1 xix i ) 1 is derived in Lemma 4.1. Violating Assumption 3(a), in this section we consider the case that xi N(µ, diag(1, . . . , 1 | {z } d , 0, . . . , 0)), where µ = [p, 0, . . . , 0 | {z } d 1 , q, 0, . . . , 0] . (If µ does not follows the format

[p, 0, . . . , 0 | {z } d 1 , q, 0, . . . , 0] , we can always rotate the coordinates so µ has this format.) Therefore, we have matrix A (after

rotation) with the following format:

" Id d + Pk i=1 xi,1:d x i,1:d 0d (d d ) 0(d d ) d I(d d ) (d d )

, if q = 0 " I(d +1) (d +1) + Pk i=1 xi,1:(d +1)x i,1:(d +1) 0(d +1) (d d 1) 0(d d 1) (d +1) I(d d 1) (d d 1)

where xi,1:d = [xi,1, xi,2, . . . , xi,d ] , Ia a indicates an identity matrix with shape a by a, and 0a b indicates a zero matrix with shape a by b. Finally, we can revise the upper bound for the case when xi only spans in a subspace of Rd using the new format of A as follows:

Dual Operating Modes of In-Context Learning

1.0 δµ = δw = 1/9

1.0 δµ = δw = 1

1.0 δµ = δw = 9

π1 π2 π3 π4

20 λ0 of δw

2000 λ0 of I + δw P xkx k λ1 of I + δw P xkx k λ2 of I + δw P xkx k

1/E[ F y k+1 ]

0 21 42 63 84 105126 0

0 21 42 63 84 105126 0

0 21 42 63 84 105126 0

1/E[(F y k+1)2]

Number of In-Context Examples (k)

Figure 21: The numerical computation of the task learning. The second and third rows show the eigenvalues of the matrices

Pk i=1 xix i k and I + δw Pk i=1 xix i . The fourth row shows the distance between the predicted w and w has a reciprocal decreasing rate with respect to k. The fifth and sixth rows indicate the expected squared loss follows a quadratic decreasing rate with respect to k.

When q = 0, we have: XM

m=1 ESk xk+1 πm((wm w ) Axk+1)2

m=1 ESk xk+1 h πm((wm w ) 1:d A1:d ,1:d xk+1,1:d + (wm w ) (d +1):d I(d d ) (d d )xk+1,(d +1):d)2i

m=1 ESk xk+1 πm( (wm w )1:d 2λ1(A1:d ,1:d )2 xk+1,1:d 2 + (wm w )(d +1):d 2 xk+1,(d +1):d 2) ,

(Notice xk+1,(d +1):d 2 = 0)

m=1 ESk xk+1 πm (wm w )1:d 2λ1(A1:d ,1:d )2 xk+1,1:d 2 ,

When q > 0, we skip the analysis since the analysis for q > 0 is the same as the analysis for q = 0. The only difference is that d for q > 0 is one bigger than d for q = 0.

Dual Operating Modes of In-Context Learning

𝔼!! #!"#[ℒ$

𝑃𝐂𝔼[ &'% (𝜋& *𝑤& 𝑤%, 𝑥$()

𝑃(𝐂)𝔼[(𝜋% *𝑤% 𝑤%, 𝑥$()

, (𝜋& *𝑤& 𝑤%, 𝑥$()

16𝑟𝑀 1 𝐶$+- exp 𝑑.

* exp 𝑢/* 𝜏#*𝑘

||𝑤% 𝑤 ||* 1 + 𝑑𝜏#

* min{1, 4𝑘*𝛿/

𝑃𝐂𝔼 &'% (𝜋&||𝑥$() ||* 𝐂||𝑤 𝑤%||*

𝔼!! #!"#[ℒ$

𝑃𝐂𝔼[ &'% (𝜋& *𝑤& 𝑤%, 𝑥$()

𝑃(𝐂)𝔼[(𝜋% *𝑤% 𝑤%, 𝑥$() *|𝐂]

, (𝜋& *𝑤& 𝑤%, 𝑥$()

* exp 𝑑.* + 4𝜏# 𝑑𝑘3)

2𝜎.* exp 𝑑/*

2𝜎/* + 𝑂𝑘3*

||𝑤% 𝑤 ||* 1 + 𝑑𝜏#

Bounded Efficacy Asymptotic Bound

Figure 22: Proof roadmap of ICL with biased labels, Theorem. 6.1.

M. ICL with Biased Labels to Retrieve A Task

This section details the proof of Theorem 6.1, with Fig.22 serving as a visual guide. The non-asymptotic bound for the bounded efficacy phenomenon and the asymptotic bound share the same foundational elements in the proof. However, they are different in handling the components marked in pink. Fig. 22 is thus provided to offer a clearer understanding of its overall framework and assist readers in navigating through the proof. In the following sections, Sec. M.1 introduces the non-asymptotic bound revealing the bounded efficacy phenomenon, and Sec. M.2 introduces the asymptotic bound.

M.1. Non-Asymptotic Bound for the Bounded Efficacy Phenomenon

This section proves the non-asymptotic bound in Theorem 6.1: Consider a next-token predictor attaining the optimal pretraining risk. When δµ and δw are sufficiently small, there exists a particular interval (refer to Sec.M.1.5 for the interval) for k such that ICL risk with biased labels is upper bounded by:

ESk[Lα k] < C3 exp

d2 µ 8σ2x + u2 wτ 2 x 8σ2y

+ 48(1 + dτ 2 x) exp

+ wα w 2(1 + dτ 2 x) min{1, 4k2δw

2(1 + τ 2 x)2}.

where Lα k = (F(Sk xk+1) yα k+1)2 = (F(Sk xk+1) xk+1, wα )2 C3 is a constant depending on the prior setting, τx, and (µ , w ). With small k, the first and second terms dominate and exponential decay. With large k, the third term dominates and increases. Thus, the upper bound reveals a bounded efficacy phenomenon.

Proof. Assuming we are using in-context examples following Assumptions 3 and 4, i.e., xi N(µ , τ 2 x I), yi = xi, w , µ = w = 1, and we aim to retrieve the function wα of the prior center (µα, wα) which is close to the in-context task. Let Lα k indicate the squared risk (F (Sk xk+1) xk+1, wα )2, where F (Sk xk+1) is the prediction of Sk xk+1 by the Bayes-optimal next-token predictor F . In order to have an upper bound on the risk, we consider xi N(µ , τ 2 x I)

in two cases: (1) C: L < λd Pk i=1 xix i k λ1 Pk i=1 xix i k < U and

γ(1 + t) (see Lemma K.1 for t, γ, L and U) and (2) C: at least one of the previous inequalities does not hold. Following Lemma K.1, the probability of C is bounded by: P( C) 3 exp( kt2

Dual Operating Modes of In-Context Learning

We start our upper bound analysis on the expected squared risk by splitting the risk into three parts:

ESk xk+1[Lα k]

= ESk xk+1 (F (Sk xk+1) wα, xk+1 )2

(By Corollary 4.4.)

β=1 πβ wβ, xk+1 wα, xk+1 2#

β=1 πβ = 1.)

β=1 πβ ( wβ, xk+1 wα, xk+1 ) 2#

β=1 πβa2 β, since E[a]2 E[a2].)

β=1 πβ( wβ, xk+1 wα, xk+1 )2

β=1 πβ wβ wα, xk+1 2

= P(C)ESk xk+1

β=1 πβ wβ wα, xk+1 2 C

+ P( C)ESk xk+1

β=1 πβ wβ wα, xk+1 2 C

= P(C)ESk xk+1 h X

β =α πβ wβ wα, xk+1 2 C i (Part A)

+ P(C)ESk xk+1[ πα wα wα, xk+1 2|C] (Part B)

+ P( C)ESk xk+1

β=1 πβ wβ wα, xk+1 2 C . (Part C)

We will analyze three parts one by one in the following three sections respectively.

M.1.1. BOUNDED EFFICACY - PART A

Proof. We firstly analyze the term P(C)ESk xk+1[P

β =α πβ wβ wα, xk+1 2|C], Part A:

P(C)ESk xk+1 h X

β =α πβ wβ wα, xk+1 2 C i

< P(C)ESk xk+1 h X

β =α πβ wβ wα 2 xk+1 2 C i

(See Eq. 14 for the derivation of wβ.)

= P(C)ESk xk+1 h X

β =α πβ (I + kδw Σw) 1(wβ w ) + w wα 2 xk+1 2 C i

(Let A = (I + kδw Σw) 1, and λ1(A) is the largest eigenvalue of matrix A.)

= P(C)ESk xk+1 h X

β =α πβ A(wβ w ) + w wα 2 xk+1 2 C i

P(C)ESk xk+1 h X

β =α πβ( A(wβ w ) + w wα )2 xk+1 2 C i

(Notice wβ w 2.)

P(C)ESk xk+1 h X

β =α πβ xk+1 2(2λ1(A) + w wα )2 C i

(Notice A = (I + kδw Σw) 1 and conditioned on C we have L < λd( Σw) < λ1( Σw) < U.)

Dual Operating Modes of In-Context Learning

P(C)ESk xk+1 h X

β =α πβ xk+1 2 C i 2 1 + kδw L + w wα 2

(Notice w wα 2.)

16P(C)ESk xk+1

β =α πβ πα xk+1 2 C .

(By applying Eqs. 4, 5, 7, and Assumption 2(e) on πβ

< 16P(C)ESk xk+1

β =α r exp Pk+1 i=1 µβ xi 2 + Pk+1 i=1 µα xi 2

2σ2x(1 + (k + 1)δµ)

exp wβ w 2 I (I+kδw Σw) 1 + wα w 2 I (I+kδw Σw) 1 2σ2w

(In the first exponential term, by splitting Xk+1

i=1 and i = k + 1 :)

< 16P(C)ESk xk+1

β =α r exp Pk i=1 µβ xi 2 + Pk i=1 µα xi 2

2σ2x(1 + (k + 1)δµ)

| {z } Part A-1

exp wβ w 2 I (I+kδw Σw) 1 + wα w 2 I (I+kδw Σw) 1 2σ2w

| {z } Part A-2

exp µβ xk+1 2 + µα xk+1 2

2σ2x(1 + (k + 1)δµ)

| {z } Part A-3

(Note that x1, . . . , xk are dependent on C but xk+1 is not. Thus, we split them for further analysis.)

In the following, we separately analyze the three terms, Part A-1, Part A-2, and Part A-3. The high-level idea is that, as k increases, due to the concentration of Part A-1 and Part A-2, they can be upper bounded by a function of k. Then, regarding Part A-1 and Part A-2 as constant values (their upper bounds), the expectation of Part A-3 can be upper bounded.

Part A-1. We first deal with Part A-1. When conditioned on case C, we have:

Pk i=1( µβ xi 2 + µα xi 2)

1 + (k + 1)δµ (Let xi = µ + ϵi)

= k µα µ 2 µβ µ 2 +

Pk i=1 2 µβ µα,ϵi

k 1 + (k + 1)δµ

= k µα µ 2 µβ µ 2 + D 2(µβ µα),

1 + (k + 1)δµ

k µα µ 2 µβ µ 2 + 2 µβ µα

1 + (k + 1)δµ

(Recall we have β [M], µβ µα 2, and in case C we have:

< k µα µ 2 µβ µ 2 + 4τxγ 1 + t

1 + (k + 1)δµ .

Dual Operating Modes of In-Context Learning

Let t = k 1

4 . Recall in Assumption 4, we have β = α, µβ µ 2 µα µ 2 d2 µ. If δµ 1 s.t. Iµ =

{k|(k + 1)δµ 1 and d2 µ 2 > 4τxγ p

4 } = , then when k Iµ we have:

k µα µ 2 µβ µ 2 + 4τxγ 1 + t

1 + (k + 1)δµ < k µα µ 2 µβ µ 2 + d2 µ 2 2 = k d2 µ 4 .

Part A-2. We then deal with Part A-2. When conditioned on case C, we have:

wβ w 2 I (I+kδw Σw) 1 + wα w 2 I (I+kδw Σw) 1

(λ1(A) and λd(A) indicate the largest and smallest eigenvalues of the matrix A Rd d.)

< wβ w 2λd(I (I + kδw Σw) 1) + wα w 2λ1(I (I + kδw Σw) 1)

(Recall in case C we have: L < λd( Σw) < λ1( Σw) < U.)

< wβ w 2 1 1 1 + kδw L

+ wα w 2 1 1 1 + kδw U

= wβ w 2 kδw L 1 + kδw L + wα w 2 kδw U 1 + kδw U

< wβ w 2 kδw L 1 + kδwτ 2x + wα w 2 kδw U 1 + kδwτ 2x

Let t = k 1

4 . If δw 1 s.t. Iw = {k|kδwτ 2 x 1 and L wβ w 2 U wα w 2 > τ 2 xu2 w 2 } = , (note limk L wβ w 2 U wα w 2 = τ 2 x wβ w 2 (1 + τ 2 x) wα w 2 τ 2 xu2 w) then when k Iw, we have:

wβ w 2 kδw L 1 + kδwτ 2x + wα w 2 kδw U 1 + kδwτ 2x < τ 2 xu2 w 2 kδw 1 + kδwτ 2x < kδw τ 2 xu2 w 4 .

Part A-3. We finally deal with Part A-3. Part A-3 is independent to case C, and we have:

P(C)ESk xk+1

exp µβ xk+1 2 + µα xk+1 2

2σ2x(1 + (k + 1)δµ)

exp µβ xk+1 2 + µα xk+1 2

2σ2x(1 + (k + 1)δµ)

(Let xk+1 = µ + ϵ.)

exp µβ µ ϵ 2 + µα µ ϵ 2

2σ2x(1 + (k + 1)δµ)

exp µβ µ 2 + µα µ 2 + 2(µβ µα), ϵ

2σ2x(1 + (k + 1)δµ)

(Let µβ µ 2 + µα µ 2 = D, 2σ2 x(1 + (k + 1)δµ) = E, b = 2(µβ µα).)

exp D + b ϵ

(Notice xk+1 2 = µ + ϵ 2 2 µ 2 + 2 ϵ 2.)

exp D + b ϵ

(2 µ 2 + 2 ϵ 2)

(Notice µ + ϵ 2 = 1.)

exp D + b ϵ

exp D + b ϵ

exp τ 2 x b 2

exp D + b ϵ

Dual Operating Modes of In-Context Learning

exp τ 2 x b 2

1 + τ 2 x b 2

exp τ 2 x b 2

+ (d 1)τ 2 x exp τ 2 x b 2

d + τ 2 x b 2

exp τ 2 x b 2

Summary of Part A. Thus, summarizing Part A-1, Part A-2, and Part A-3, we have:

P(C)ESk xk+1 h X

β =α πβ wβ wα, xk+1 2 C i

< 16P(C)ESk xk+1

β =α r exp Pk i=1 µβ xi 2 + Pk i=1 µα xi 2

2σ2x(1 + (k + 1)δµ)

| {z } Part A-1

exp wβ w 2 I (I+kδw Σw) 1 + wα w 2 I (I+kδw Σw) 1 2σ2w

| {z } Part A-2

exp µβ xk+1 2 + µα xk+1 2

2σ2x(1 + (k + 1)δµ)

| {z } Part A-3

< 16r(M 1)Ck=0 exp

exp u2 wτ 2 xk 8σ2y

= 16r(M 1)Ck=0 exp

k( d2 µ 8σ2x + u2 wτ 2 x 8σ2y )

M.1.2. BOUNDED EFFICACY - PART B

Proof. We then deal with the second term P(C)ESk xk+1[ πα wβ wα, xk+1 2|C], Part B:

P(C)ESk xk+1[ πα wα wα, xk+1 2|C]

P(C)ESk xk+1[ πα wα wα 2 xk+1 2|C]

(See Eq. 14 for the derivation of wα.)

= P(C)ESk xk+1[ πα (I + kδw Σw) 1(wα w ) + w wα 2 xk+1 2|C]

= P(C)ESk xk+1[ πα (I (I + kδw Σw) 1)(w wα) 2 xk+1 2|C]

(Let λ1(A) be the maximal eigenvalue of the matrix A.)

wα w 2P(C)ESk xk+1[ παλ2 1(I (I + kδw Σw) 1) xk+1 2|C]

(Recall that conditioned on C we have L < λd( Σw) < λ1( Σw) < U.)

< wα w 2P(C)ESk xk+1

1 1 1 + kδw U

= wα w 2P(C)ESk xk+1[ πα xk+1 2|C] 1 1 1 + kδw U

< wα w 2Exk+1 xk+1 2 1 1 1 + kδw U

= wα w 2(1 + dτ 2 x) 1 1 1 + kδw U

Dual Operating Modes of In-Context Learning

= wα w 2(1 + dτ 2 x) kδw U 1 + kδw U

Let t = k 1

4 . if δw 1 s.t. IU = {k|U < 2(1 + τ 2 x)} = , then when k IU we have:

wα w 2(1 + dτ 2 x) kδw U 1 + kδw U

2 < wα w 2(1 + dτ 2 x) min{1, 4k2δ2 w(1 + τ 2 x)2}.

M.1.3. BOUNDED EFFICACY - PART C

Proof. Finally, for the third term P( C)ESK[PM β=1 πβ wβ wα, xk+1 2| C], Part C:

P( C)ESk xk+1

β=1 πβ wβ wα, xk+1 2 C

P( C)ESk xk+1

β=1 πβ wβ wα 2 xk+1 2 C

(See Eq. 14 for the derivation of wβ.)

= P( C)ESk xk+1

β=1 πβ (I + kδw Σw) 1(wβ w ) + w wα 2 xk+1 2 C

< P( C)ESk xk+1

β=1 πβ(2 (I + kδw Σw) 1(wβ w ) 2 + 2 w wα 2) xk+1 2 C

< P( C)ESk xk+1

β=1 πβ 2 wβ w 2λ2 1 (I + kδw Σw) 1 + 2 w wα 2 xk+1 2 C

< P( C)ESk xk+1

β=1 πβ(2 4 1 + 2 4) xk+1 2 C

= 16P( C)ESk xk+1

β=1 πβ xk+1 2 C

< 16P( C)Exk+1[ xk+1 2| C]

(Notice C is defined on {x1, . . . , xk})

< 16P( C)Exk+1[ xk+1 2]

< 16(1 + dτ 2 x)P( C)

(Let t = k 1

< 48(1 + dτ 2 x) exp

M.1.4. BOUNDED EFFICACY - SUMMARY

Proof. Summarizing Part A, Part B, and Part C, we have:

ESk xk+1[Lα k]

< 16r(M 1)Ck=0 exp

exp u2 wτ 2 xk 8σ2y

+ wα w 2(1 + dτ 2 x) min{1, 4k2δ2 w(1 + τ 2 x)2} + 48(1 + dτ 2 x) exp

Dual Operating Modes of In-Context Learning

d2 µ 8σ2x + u2 wτ 2 x 8σ2y

+ 48(1 + dτ 2 x) exp

+ wα w 2(1 + dτ 2 x) min{1, 4k2δ2 w(1 + τ 2 x)2}.

M.1.5. THE PARTICULAR INTERVAL

The particular interval for the non-asymptotic bound is the union of Iµ, Iw, and IU:

δµ 1, 1 δwτ 2x }

4 ) < d2 µ 2 L wβ w 2 U wα w 2 > τ 2 xu2 w/2

U < 2(1 + τ 2 x).

M.2. Asymptotic Bound

This section proves the non-asymptotic bound in Theorem 6.1: Consider a next-token predictor attaining the optimal pretraining risk. As k , ICL risk with biased labels is upper bounded by:

ESk[Lα k] < wα w 2(1 + dτ 2 x) + C1

k exp C2k 1

2 + O(k 2),

where Lα k = (F(Sk xk+1) yα k+1)2 = (F(Sk xk+1) xk+1, wα )2, and C1 and C2 are constants depending on the prior setting, τx, and (µ , w ).

The proof of the asymptotic bound is heavily overlapped with the proof of the non-asymptotic bound. We will hide the overlapped derivations with (. . .) .

Proof. Assuming we are using in-context examples following Assumptions 3 and 4, i.e., xi N(µ , τ 2 x I), yi = xi, w , µ = w = 1, and we aim to retrieve the function wα of the prior center (µα, wα) which is close to the in-context task. Let Lα k indicate the squared risk (F (Sk xk+1) xk+1, wα )2, where F (Sk xk+1) is the prediction of Sk xk+1 by the Bayes-optimal next-token predictor F . In order to have an upper bound on the risk, we consider xi N(µ , τ 2 x I)

in two cases: (1) C: L < λd Pk i=1 xix i k λ1 Pk i=1 xix i k < U and

γ(1 + t) (see Lemma K.1 for t, γ, L and U) and (2) C: at least one of the previous inequalities does not hold. Following Lemma K.1, the probability of C is bounded by: P( C) 3 exp( kt2

We start our upper bound analysis on the expected squared risk by splitting the risk into three parts:

ESk xk+1[Lα k]

= P(C)ESk xk+1 h X

β =α πβ wβ wα, xk+1 2 C i (Part A )

+ P(C)ESk xk+1[ πα wα wα, xk+1 2|C] (Part B )

+ P( C)ESk xk+1

β=1 πβ wβ wα, xk+1 2 C . (Part C )

We will analyze three parts one by one in the following three sections respectively.

Dual Operating Modes of In-Context Learning

M.2.1. ASYMPTOTIC BOUND - PART A

Proof. We firstly analyze the term P(C)ESk xk+1[P

β =α πβ wβ wα, xk+1 2|C], Part A :

P(C)ESk xk+1 h X

β =α πβ wβ wα, xk+1 2 C i

< P(C)ESk xk+1 h X

β =α πβ xk+1 2 C i 2 1 + kδw L + w wα 2

(Notice w wα 2.)

P(C)ESk xk+1

β =α πβ πα xk+1 2 C 4 (1 + kδw L)2 + 8 1 + kδw L

+ P(C)ESk xk+1 h X

β =α πβ xk+1 2 C i w wα 2. (17)

Line 17 will be merged with Part B and analyzed in Sec. M.2.2. The current section will analyze the line 16. We start by analyzing the term P(C)ESk xk+1 h P

β =α πβ πα xk+1 2 C i . By Eqs. 4, 5, 7, and Assumption 2(e) on πβ

πα , we have:

P(C)ESk xk+1

β =α πβ πα xk+1 2 C

< P(C)ESk xk+1

β =α r exp Pk i=1 µβ xi 2 + Pk i=1 µα xi 2

2σ2x(1 + (k + 1)δµ)

| {z } Part A -1

exp wβ w 2 I (I+kδw Σw) 1 + wα w 2 I (I+kδw Σw) 1 2σ2w

| {z } Part A -2

exp µβ xk+1 2 + µα xk+1 2

2σ2x(1 + (k + 1)δµ)

| {z } Part A -3

(Note that x1, . . . , xk are dependent on C but xk+1 is not. Thus, we break them for further analysis.)

In the following, we separately analyze the three terms, Part A -1, Part A -2, and Part A -3. The high-level idea is that, as k increases, due to the concentration of Part A -1 and Part A -2, they can be upper bounded by a function of k. Then, regarding Part A -1 and Part A -2 as constant values (their upper bounds), the expectation of Part A -3 can be upper bounded.

Part A -1. We first deal with Part A-1. When conditioned on case C, we have: Pk i=1( µβ xi 2 + µα xi 2)

1 + (k + 1)δµ (. . .)

< k µα µ 2 µβ µ 2 + 4τxγ 1 + t

1 + (k + 1)δµ .

With Assumption 4, we have d2 µ µβ µ 2 µα µ 2. With Lemma K.1, we have γ = q

d k. Let t = kδ 1

2, we have:

k µα µ 2 µβ µ 2 + 4τxγ 1 + t

1 + (k + 1)δµ = d2 µ δµ + 4τx

2 + O(k 1).

Dual Operating Modes of In-Context Learning

Part A -2. We then deal with Part A -2. When conditioned on case C, we have:

wβ w 2 I (I+kδw Σw) 1 + wα w 2 I (I+kδw Σw) 1

< wβ w 2 1 1 1 + kδw L

+ wα w 2 1 1 1 + kδw U

= ( wβ w 2 wα w 2) + wβ w 2

1 + kδw L wα w 2

With Assumption 4, we have d2 w wβ w 2 wα w 2. Lemma K.1 gives the definitions of L and U. Let t = kδ 1

2 and 0 < δ < 1

2, we have:

= d2 w + wβ w 2

kδwτ 2x wα w 2

kδw(1 + τ 2x)

< d2 w + wβ w 2

kδwτ 2x + O(k 2)

< d2 w + 4 δwτ 2x k 1 + O(k 2).

Part A -3. We finally deal with Part A -3. Part A -3 is independent to case C, and we have:

P(C)ESk xk+1

exp µβ xk+1 2 + µα xk+1 2

2σ2x(1 + (k + 1)δµ)

Summary of Part A . Thus, summarizing Part A -1, Part A -2, and Part A -3, we have:

P(C)ESk xk+1

β =α πβ πα xk+1 2 C 4 (1 + kδw L)2 + 8 1 + kδw L

< P(C)ESk xk+1

β =α r exp Pk i=1 µβ xi 2 + Pk i=1 µα xi 2

2σ2x(1 + (k + 1)δµ)

| {z } Part A -1

exp wβ w 2 I (I+kδw Σw) 1 + wα w 2 I (I+kδw Σw) 1 2σ2w

| {z } Part A -2

exp µβ xk+1 2 + µα xk+1 2

2σ2x(1 + (k + 1)δµ)

| {z } Part A -3

4 (1 + kδw L)2 + 8 1 + kδw L

(Notice lim k L = lim k τ 2 x

1 + t = τ 2 x.)

d2 µ δµ + 4τx

d2 w + 4 δwτ 2 x k 1 + O(k 2)

8 kδwτ 2x + O(k 2)

= r(M 1)Ck=0 exp

2 + O(k 1) 2σ2µ

d2 w + 4 δwτ 2 x k 1 + O(k 2)

! 8 kδwτ 2x + O(k 2)

Dual Operating Modes of In-Context Learning

= 8r(M 1)Ck=0

kδwτ 2x exp

2 + O(k 1) 2σ2µ

d2 w + 4 δwτ 2 x k 1 + O(k 2)

= 8r(M 1)Ck=0

kδwτ 2x exp

exp d2 w 2σ2w

M.2.2. ASYMPTOTIC BOUND - PART B

Proof. We then deal with the second term P(C)ESk xk+1[ πα wβ wα, xk+1 2|C], Part B :

P(C)ESk xk+1[ πα wα wα, xk+1 2|C]

< wα w 2P(C)ESk xk+1[ πα xk+1 2|C] 1 1 1 + kδw U

We add the line 17 in Sec. M.2.1 back:

P(C)ESk xk+1[ πα( wα wα, xk+1 )2|C] + P(C)ESk xk+1 h X

β =α πβ xk+1 2 C i w wα 2

| {z } line 17 in Sec. M.2.1

< wα w 2P(C)ESk xk+1[ πα xk+1 2|C] 1 1 1 + kδw U

+ P(C)ESk xk+1 h X

β =α πβ xk+1 2 C i w wα 2

wα w 2P(C)ESk xk+1[ πα xk+1 2|C] + wα w 2P(C)ESk xk+1 h X

β =α πβ xk+1 2 C i

β=1 πβ = 1)

= wα w 2P(C)ESk xk+1[ xk+1 2|C]

< wα w 2Exk+1 xk+1 2

= wα w 2(1 + dτ 2 x)

M.2.3. ASYMPTOTIC BOUND - PART C

Proof. Finally for the third term P( C)ESK[PM β=1 πβ wβ wα, xk+1 2| C], Part C :

P( C)ESk xk+1

β=1 πβ wβ wα, xk+1 2 C

< 16(1 + dτ 2 x)P( C)

(Let t = kδ 1

< 48(1 + dτ 2 x) exp k2δ

Dual Operating Modes of In-Context Learning

M.2.4. ASYMPTOTIC BOUND - SUMMARY

Proof. Summarizing Part A , Part B , and Part C , we have:

ESk xk+1[Lα k]

< 8r(M 1)Ck=0

kδwτ 2x exp

exp d2 w 2σ2w

+ wα w 2(1 + dτ 2 x) + 48(1 + dτ 2 x) exp k2δ

= wα w 2(1 + dτ 2 x) + 8r(M 1)Ck=0

kδwτ 2x exp

exp d2 w 2σ2w

= wα w 2(1 + dτ 2 x) + C1

k exp(C2k 1

2 ) + O(k 2)

N. Proof of Lemma 6.2

In this subsection, we introduce the proof of Lemma 6.2. We first give the full version of the lemma: Lemma 6.2 (Upper Bound for Zero-Shot ICL). Assume a next-token predictor attains the optimal pretraining risk, and Assumption 6 has only two components α and β, with centers (µα, wα) = ( µβ, wβ). When performing ICL with xi N(µ |τ 2 x I), assume µ = 1, and yi = 0, i.e., yi has the same preference to prior component α as β. When δµ and δw are sufficiently small, there is a particular interval for k that ICL risk is upper bounded by:

ESk[Lα k] < C4 exp

+ 12(1 + dτ 2 x) exp

+ (1 + dτ 2 x) min{1, k2δw

2(1 + τ 2 x)2},

where Lα k = (F(Sk xk+1) yα k+1)2 = (F(Sk xk+1) xk+1, wα )2, C4 is a constant depending on the prior, τx, and (µ , w ). When k is small, the first and second terms dominate and exponential decay. When k is large, the third term dominates and increases.

Proof. The proof techniques are very similar to the proof techniques used in Sec. M.1. Assuming we are using in-context examples following xi N(µ , τ 2 x I), µ = 1, yi = 0, i.e., w = 0, and we aim to retrieve the function wα of the prior center (µα, wα) which is close to the in-context task. Let Lα k indicate the squared loss (F (Sk xk+1) xk+1, wα )2, where F (Sk xk+1) is the prediction of Sk xk+1 by the Bayes-optimal next-token predictor F . In order to have an

upper bound on the loss, we consider xi N(µ , τ 2 x I) in two cases: (1) C: L < λd Pk i=1 xix i k λ1 Pk i=1 xix i k <

γ(1 + t) (see Lemma K.1 for t, γ, L and U) and (2) C: at least one of the previous inequalities

does not hold. Following Lemma K.1, the probability of C is bounded by: P( C) 3 exp( kt2

Similar to Sec. M.1, we split the expected squared loss into three parts:

ESk xk+1[Lα k]

< P(C)ESk xk+1[ πβ wβ wα, xk+1 2|C] (Part A )

+ P(C)ESk xk+1[ πα wα wα, xk+1 2|C] (Part B )

+ P( C)ESk xk+1

κ {α,β} πκ wκ wα, xk+1 2| C . (Part C )

Dual Operating Modes of In-Context Learning

N.1. Proof of Lemma 6.2: Part A

Proof. We first analyze the term P(C)ESk xk+1[ πβ wβ wα, xk+1 2|C], Part A . Similar to Sec. M.1, we have:

P(C)ESk xk+1[ πβ wβ wα, xk+1 2|C]

< P(C)ESk xk+1[ πβ

πα wβ wα, xk+1 2|C] 2 1 + kδw L + w wα 2

< P(C)ESk xk+1

Pk i=1 µβ xi 2 + Pk i=1 µα xi 2

2σ2x(1 + (k + 1)δµ)

wβ w 2 I (I+kδw Σw) 1 + wα w 2 I (I+kδw Σw) 1 2σ2w

exp µβ xk+1 2 + µα xk+1 2

2σ2x(1 + (k + 1)δµ)

2 1 + kδw L + w wα 2

(Notice w = 0, wβ = wα.)

= r P(C)ESk xk+1

Pk i=1 µβ xi 2 + Pk i=1 µα xi 2

2σ2x(1 + (k + 1)δµ)

exp µβ xk+1 2 + µα xk+1 2

2σ2x(1 + (k + 1)δµ)

= 9r P(C)ESk xk+1

Pk i=1 µβ xi 2 + Pk i=1 µα xi 2

2σ2x(1 + (k + 1)δµ)

| {z } A -1

exp µβ xk+1 2 + µα xk+1 2

2σ2x(1 + (k + 1)δµ)

| {z } A -3

Same to Sec. M.1.1, when conditioned on case C, for Part A -1 we have: Pk i=1( µβ xi 2 + µα xi 2)

1 + (k + 1)δµ < k µα µ 2 µβ µ 2 + 4τxγ 1 + t

1 + (k + 1)δµ .

Let t = k 1

4 . Recall in Assumption 4, we have β = α, µβ µ 2 µα µ 2 d2 µ. If δµ 1 s.t. Iµ =

{k|(k + 1)δµ 1 and d2 µ 2 > 4τxγ p

4 } = , then when k Iµ we have:

k µα µ 2 µβ µ 2 + 4τxγ 1 + t

1 + (k + 1)δµ < d2 µ 4 .

Same to Sec. M.1.1, when conditioned on case C, for Part A -3 we have:

P(C)ESk xk+1

exp µβ xk+1 2 + µα xk+1 2

2σ2x(1 + (k + 1)δµ)

xk+1 2 C = Ck=0.

As a summary of the above analysis, we have:

P(C)ESk xk+1[ πβ wβ wα, xk+1 2|C] < 9r Ck=0 exp

Dual Operating Modes of In-Context Learning

N.2. Proof of Lemma 6.2: Part B

Proof. We then deal with the second term P(C)ESk xk+1[ πα( wα wα, xk+1 )2|C], Part B . The analysis is exactly the same as Sec. M.1.2, and we have:

P(C)ESk xk+1[ πα wα wα, xk+1 2|C] < wα w 2(1 + dτ 2 x) kδw U 1 + kδw U

Let t = k 1

4 . if δw 1 s.t. IU = {k|U < 2(1 + τ 2 x)} = , then when k IU we have:

wα w 2(1 + dτ 2 x) kδw U 1 + kδw U

2 < wα w 2(1 + dτ 2 x) min{1, 4k2δ2 w(1 + τ 2 x)2}.

N.3. Proof of Lemma 6.2: Part C

Proof. Finally, for the third term P( C)ESk xk+1[P κ {α,β} πκ wκ wα, xk+1 2| C], Part C . Similar to Sec. M.1.3, we have:

P( C)ESk xk+1

κ {α,β} πκ( wκ wα, xk+1 )2 C

< P( C)ESk xk+1

κ {α,β} πκ 2 (I + kδw Σw) 1(wκ w ) 2 + 2 w wα 2 xk+1 2 C

(Recall w = 0.)

< P( C)ESk xk+1

κ {α,β} πκ(2 1 1 + 2 1) xk+1 2 C

= 4P( C)ESk xk+1

κ {α,β} πκ xk+1 2 C

< 4P( C)Exk+1[ xk+1 2| C]

(Notice C is defined on {x1, . . . , xk}.)

< 4P( C)Exk+1[ xk+1 2]

< 4(1 + dτ 2 x)P( C)

(Let t = k 1

< 12(1 + dτ 2 x) exp

N.4. Proof of Lemma 6.2: Summary

Proof. Summarizing Part A , Part B , and Part C , we have:

ESk xk+1[Lα k]

< 9r Ck=0 exp

+ wα w 2(1 + dτ 2 x) min{1, 4k2δ2 w(1 + τ 2 x)2} + 12(1 + dτ 2 x) exp

= 9r Ck=0 exp

+ (1 + dτ 2 x) min{1, 4k2δ2 w(1 + τ 2 x)2} + 12(1 + dτ 2 x) exp

+ 12(1 + dτ 2 x) exp

+ (1 + dτ 2 x) min{1, 4k2δ2 w(1 + τ 2 x)2}.

Dual Operating Modes of In-Context Learning

N.5. The Particular Interval

The particular interval for the risk bound revealing bounded efficacy is the union of Iµ and IU:

4 ) < d2 µ 2 U < 2(1 + τ 2 x).

O. Toy Example for Component Shifting and Component Re-weighting

We study how in-context examples affect the prediction of ICL by a pretrained Bayes-optimal next-token predictor and how the pretraining distribution affects this phenomenon. Assume the next-token predictor f is initially pretrained on a dataset distribution to produce the minimum risk minimizer f , and then the pretrained f is used to predict the next token y of the token x. Instead of direct inference via f (x), we consider inference with additional k in-context examples {xi}k i=1 via the format f ([x1, . . . , xk, x]). We aim to theoretically examine the effect of in-context examples {xi}k i=1 on the prediction f ([x1, . . . , xk, x]). While the formal problem setting may involve verbose math, this demo section illustrates the basic phenomenon for better delivering our work.

The following demo subsections are organized as follows. We first introduce the problem setting in Sec. O.1. We then connect ICL with Bayesian inference in Sec. O.2. Further, we introduce the assumptions for the pretraining dataset in Sec. O.3. Finally, we derive a closed-form posterior and introduce two phenomena, Component Shifting and Component Re-weighting in Sec. O.4.

O.1. Toy Example: Pretraing Data Generative Modela

ICL involves two important components: the pretraining dataset, and the next-token predictor supporting varied input lengths. We assume the next-token predictor f : k {0,...,K 1}Rk 1 R1 1 can fit the pretraining distribution exactly with enough data and expressivity. To generate a training sample, we first sample a task µ from underlying task distribution Dµ, and then we generate tokens of the sequence from a distribution Dx(µ) based on the task µ. The sample generation process is described as follows: Assumption 7 (Demo: Pretraining Data Generative Model). Given a task prior distribution Dµ, and a conditioned x sampler Dx(µ) conditioned on task µ, the process of generating a sequence SK = [x1, x2, . . . , x K] with length K follows: (a) Sample a task µ from the task prior: µ Dµ, and the probability of µ is indicated by P(µ); (b) Sample K samples, each denoted by xi, from the chosen task: For i {1, 2, . . . , K}, xi Dx(µ), and the probability of xi = x is indicated by P(x|µ); (c) Define a Sequence Sk: For capital K, SK = [x1, . . . , x K]; and for lowercase k, the sequence of the first k demonstrations of SK is indicated by Sk = [x1, . . . , xk], e.g., S2 = [x1, x2].

The generation process is related to real-world scenarios via two points: (i) For sampling step 7(a), the LM is trained on varied tasks; (ii) For sampling step 7(b), when one person/agent produces texts for one task, the generated text could be noisy. For instance, given a task such as describing a football game, one person has multiple ways to describe it.

O.2. Toy Example: Bayes-Optimal Next-Token Predictor

Now we consider training f( ) using sample SK generated via the above generation process 7:

L(f) = E SK

k=0 (f(Sk) xk+1)2 #

E xi D(µ), i {1,...,K}

k=0 (f(Sk) xk+1)2 µ

f can be viewed as K separate models f0, . . . , f K 1, where fk takes a sequence of k tokens as input. Therefore, when the model f has enough expressivity, the optimization problem f = argminf L(f) could be regarded as K different optimization problems:

f k = argmin fk E SK [(f(Sk) xk+1)2], k {0, . . . , K 1}.

Dual Operating Modes of In-Context Learning

7.5 5.0 2.5 0.0 2.5 5.0 7.5 0

3 Posterior Probability Density Function

Gaussian Mixture ª 0.00N(2.25, 0.5)

ª 0.06N(3.0, 0.5)

ª 0.94N(4.0, 0.5) Observations Given Observations

W/O In-Context Examples

Inference with

Bayesian Inference ICL

Pretrained Next-Token Predictor

W/ In-Context Examples

Pretrained Next-Token Predictor

Component Re-weighting

7.5 5.0 2.5 0.0 2.5 5.0 7.5 0

3 Prior Probability Density Function

Gaussian Mixture 0.45N( 3, 1)

0.53N(0, 1)

0.02N(4, 1)

Figure 23: The left part of the figure indicates the pretrained next-token predictor is pretrained on the task prior distribution according to Assumption 8, and the prediction is based on the prior without in-context examples. The right part of the figure indicates that with in-context samples, the prediction is based on posterior, regarding the in-context examples as observed samples.

Thus, the solution f k for each k is a minimum mean square error (MMSE) estimator (Van Trees, 2004, page 63), and the prediction of f (Sk) satisfies:

f (Sk) = E SK [xk+1|Sk] = E µ Dµ [ E xi D(µ), i {1,...,K}

[xk+1|µ, Sk]|Sk] = E µ Dµ [ E xk+1 D(µ)[xk+1|µ]|Sk]. (18)

The prediction f (Sk) is the expectation of E xk+1 D(µ)[xk+1|µ] on the task posterior observing Sk.

O.3. Toy Example: Gaussian Assumptions on Pretraining Data Generative Model

In Sec. O.2, we connect ICL with Bayesian inference, and in Eq. 18, we observe that the prediction f (Sk) depends on the posterior. We are interested in how the in-context examples affect the prediction and the posterior. We make assumptions on the pretraining dataset to have a closed-form expression of the posterior facilitating further analyses: Assumption 8 (Demo: Gaussian Assumptions for Generative Model for Pretraining Data). (a) Task distribution: µ Dµ, P(µ) = PM m=1 πm P(µ|Tm), where Tm is the mth mixture component of the Gaussian mixture, i.e., P(µ|Tm) = N(µ|µm, σ2), and πm is the corresponding mixture weight. PM m=1 πm = 1, 0 < πm < 1, µm is the center of the mixture component Tm, and all components share the same covariance matrix controlled by σ; (b) Token distribution: x Dx(µ), P(x|µ) = N(x|µm, τ 2).

O.4. Toy Example: Posterior Analysis

With Assumption 8, we derive the closed-form expression of the posterior as follows:

m=1 πm N(µ| µm, σ2). (19)

( πm = πm exp

2(τ 2 + kσ2)

, µm = τ 2µm + σ2 Pk i=1 xi τ 2 + kσ2 , σ2 = τ 2σ2

τ 2 + kσ2 )

See Sec. O.5 for proof details. From Eq. 19, we observe two factors when comparing the posterior with the prior in Assumption 8: (i) Component Shifting: after observing Sk = [x1, x2, . . . , xk], the center of each mixture component is

Dual Operating Modes of In-Context Learning

shifted to τ 2µm+σ2 Pk i=1 xi τ 2+kσ2 ; (ii) Component Re-weighting: the mixture weight πm of each mixture component is re-weighted

by multiplying exp

(which needs to be further normalized so that re-weighted mixture weights sum

to 1). Fig. 23 illustrates the phenomena of Component Shifting and Component Re-weighting by observing in-context examples.

O.5. Proof of Posterior Derivation in Toy Example

In this section, we give a detailed derivation of the posterior in Eq. 19 of Sec. O.4:

P(µ|Sk) P(µ, Sk)

= P(Sk|µ)P(µ)

= (Πk i=1P(xi|µ))P(µ)

m=1 πm N(µ|µm, σ2)(Πk i=1N(xi|µ, τ 2)).

We then show N(µ|µm, σ2)(Πk i=1N(xi|µ, τ 2)) is proportional to a Gaussian distribution:

log N(µ|µm, σ2) Πk i=1N(xi|µ, τ 2)

(Let C10 = log 1

= C10 (µ µm)2

= C10 1 2τ 2σ2

τ 2(µ µm)2 + σ2 k X

i=1 (xi µ)2 !

(Abbreviate

i=1 as X for simplicity.)

= C10 1 2τ 2σ2

µ2(τ 2 + kσ2) 2µ τ 2µm + σ2 X xi + τ 2µ2 m + σ2 X x2 i

= C10 τ 2 + kσ2

µ τ 2µm + σ2 P xi

2 + τ 2µ2 m + σ2 P x2 i τ 2 + kσ2 τ 2µm + σ2 P xi

= C10 τ 2 + kσ2

µ τ 2µm + σ2 P xi

2 + (τ 2µ2 m + σ2 P x2 i )(τ 2 + kσ2) (τ 2µm + σ2 P xi)2

(τ 2 + kσ2)2

= C10 τ 2 + kσ2

µ τ 2µm + σ2 P xi

2 + kσ2τ 2µ2 m + σ2 P x2 i (τ 2 + kσ2) 2µmτ 2σ2 P xi (σ2 P xi)2

(τ 2 + kσ2)2

(Let C11 = C10 τ 2 + kσ2

2τ 2σ2 σ2 P x2 i (τ 2 + kσ2) (σ2 P xi)2 τ 2σ2(P xi)2/k

(τ 2 + kσ2)2 .)

= C11 τ 2 + kσ2

µ τ 2µm + σ2 P xi

2 + kσ2τ 2µ2 m 2µmτ 2σ2 P xi + τ 2σ2(P xi)2/k

(τ 2 + kσ2)2

= C11 τ 2 + kσ2

µ τ 2µm + σ2 P xi

(τ 2 + kσ2)2 µm P xi

Dual Operating Modes of In-Context Learning

2(τ 2 + kσ2)

µ τ 2µm+σ2 Pk i=1 xi τ 2+kσ2 2

2 τ 2σ2 τ 2+kσ2 .

Notice C11 is independent to m, m [M] and µ. Therefore, we have:

πm N(µ|µm, σ2) Πk i=1N(xi|µ, τ 2) πm N(µ| µm, σ2),

where πm = πm exp

, µm = τ 2µm+σ2 Pk i=1 xi τ 2+kσ2 , and σ2 = τ 2σ2 τ 2+kσ2 . Thus:

m=1 πm N(µ|µm, σ2)(Πk i=1N(xi|µ, τ 2))

πm N(µ| µm, σ2).