# rethinking_invariance_in_incontext_learning__8042126d.pdf Published as a conference paper at ICLR 2025 RETHINKING INVARIANCE IN IN-CONTEXT LEARNING Lizhe Fang1 Yifei Wang2 Khashayar Gatmiry2 Lei Fang3 Yisen Wang1,4 1 State Key Lab of General Artificial Intelligence, School of Intelligence Science and Technology, Peking University 2 MIT CSAIL 3 School of Economics, Peking University 4 Institute for Artificial Intelligence, Peking University In-Context Learning (ICL) has emerged as a pivotal capability of auto-regressive large language models, yet it is hindered by a notable sensitivity to the ordering of context examples regardless of their mutual independence. To address this issue, recent studies have introduced several variant algorithms of ICL that achieve permutation invariance. However, many of these do not exhibit comparable performance with the standard auto-regressive ICL algorithm. In this work, we identify two crucial elements in the design of an invariant ICL algorithm: information nonleakage and context interdependence, which are not simultaneously achieved by any of the existing methods. These investigations lead us to the proposed Invariant ICL (Inv ICL), a methodology designed to achieve invariance in ICL while ensuring the two properties. Empirically, our findings reveal that Inv ICL surpasses previous models, both invariant and non-invariant, in most benchmark datasets, showcasing superior generalization capabilities across varying input lengths. Code is available at https://github.com/PKU-ML/Inv ICL. 1 INTRODUCTION In-Context Learning (ICL) has shown to be a key emergent property of large language models (LLMs) (Brown et al., 2020). By utilizing a sequence of examples as the context, LLMs can be adapted quickly and accurately to new tasks without parameter tuning (Wang et al., 2024; Kossen et al., 2024; Wang et al., 2025). Despite its impressive potential, ICL exhibits a crucially unusual behavior: sensitivity to the order of context examples (Lu et al., 2022; Zhao et al., 2021; Xie et al., 2021; Agrawal et al., 2022). Although context examples are independent, the order in which they are presented can dramatically influence ICL predictions, with variations from about 90% to 50% on the SST-2 dataset (Lu et al., 2022). It is easy to note that the auto-regressive (AR) nature of LLMs is the root of order sensitivity. ARLLMs often utilize a so-called causal mask in the attention module, which breaks the permutation invariance property of the de facto Transformer architecture1. As the context examples are intrinsically equivalent under different permutations, a model that respects this data symmetry tends to enhance both learning and generalization (Sokoli c et al., 2016; Bietti et al., 2021; Tahmasebi & Jegelka, 2023). Therefore, recent works have proposed several variant algorithms of ICL to achieve the invariance by modifying the Transformer architecture (e.g., Prefix ICL (Raffel et al., 2020), PCW (Ratner et al., 2022), and Batch ICL (Zhang et al., 2024)). However, they often perform inferior to non-invariant counterparts like AR ICL, as we extensively observed in practice shown in Figure 1. We note that although desirable, the invariance property alone is insufficient for good ICL performance (e.g., a model with constant output f( ) = c is invariant yet provides no useful information). Therefore, to ensure the performance of ICL, we need to satisfy the following two properties while making ICL invariant: 1) Information Non-leakage: it prevents the query from accessing its answer, thereby avoiding shortcuts and enabling dense learning signals for ICL by allowing the prediction of every context example in the input. 2) Context Interdependence: Each context example interacts with all preceding examples. As the sequence lengthens, more information is provided, thereby enhancing Equal Contribution. Corresponding Author: Yisen Wang (yisen.wang@pku.edu.cn). 1Besides, sequential positional encodings (PEs) of the prompt also introduce order sensitivity. Published as a conference paper at ICLR 2025 prediction accuracy. However, existing methods more or less compromise these properties when making ICL invariant (Table 1), resulting in the lack of a well-performing invariant ICL method. Figure 1: Performance of existing ICL algorithms under the settings of (Zhang et al., 2024), including autoregressive (AR) ICL, Prefix ICL (Raffel et al., 2020), Batch ICL (Zhang et al., 2024) and PCW (Ratner et al., 2022). Task prompts are removed for fair comparison. Motivated by the analysis above, we design an effective Invariant In-context Learning (Inv ICL) algorithm that maintains these essential properties, ensuring both invariance and high performance. Inv ICL addresses the issue of order sensitivity (invariance), not only avoiding information leakage but also enhancing context interdependence beyond what is achievable with AR-LLMs. To facilitate practical implementation, we also develop a fully parallel version of Inv ICL, capable of obtaining all Leave-One-Out (LOO) embeddings and predictions in a single forward pass using a novel LOOtype attention mask. Empirically, Inv ICL outperforms existing invariant ICL versions, and even surpasses AR-ICL (noninvariant) on most tasks of both synthetic and real-world datasets. We summarize our contributions as follows: We undertake a comprehensive exploration into designing invariant ICL algorithms, highlighting the importance of preserving information non-leakage and context interdependence. We propose Inv ICL, which synergizes the goals of invariant ICL algorithms by utilizing leave-one-out embeddings to achieve invariant predictions and information non-leakage while maximizing context interdependence. Empirically, Inv ICL indeed achieves superior performance across a range of tasks on both synthetic and real-world datasets. Table 1: Comparisons of different ICL types (details in Section 2) on permutation invariance, information non-leakage, context interdependence, and performance. ICL Type Invariance Non-leakage Interdependence Performance Auto-regressive (partial) A (baseline) Prefix (full attn.) ABag-of-Examples AInv ICL (ours) A/A+ 2 PRELIMINARIES Consider a classification task with a few i.i.d. training examples D = { xi := (xi, yi)}n i=1, where xi denotes the input and yi denotes the classification target. An ICL algorithm f takes these training examples (a.k.a. context examples) together as input and then predicts a new test example xt. A general formulation of f is [ˆy1, . . . , ˆyn, ˆyt] = f(xi, yi, . . . , xn, yn, xt), (1) where ˆyi denotes the label prediction for xi. Note the predictions for context example, {ˆyi}n i=1, are optional but they are generally available for AR-LLMs. A popular model choice for ICL is Transformer (Vaswani et al., 2017), where the self-attention layer is the elementary module. Denote H = (h1, ..., hn) be the input hidden state of a self-attention layer, it outputs H H + AHWv P, where A = softmax HWq(HWk) + M . (2) Published as a conference paper at ICLR 2025 (a) Auto-regressive ICL (b) Prefix ICL (c) Bag-of-Example ICL Leave-one-out Bo E Pre-encoding for 𝑛context examples Bo E Encoding for test example (d) Invariant ICL Figure 2: The attention masks of four types of ICL, corresponding to different types of ICL methods. where Wq, Wk, Wv, P denotes the query, key, value, and projector matrices, respectively. M {0, }n n is an attention mask. For a standard (or full) self-attention layer, M is a zero matrix, while a causal self-attention layer utilizes the following causal mask: 0 0 0 ... ... ... ... 0 0 0 As a result, the softmax attention A only has nonzero weights in its lower triangular terms. Notably, the form of Eq. (2) can be generalized to other attention types, as will be discussed later. Revisiting existing Transformer-based ICL algorithms, they can be categorized into three families depending on their aggregation scheme over the context examples: 1) Auto-regressive ICL, 2) Prefix ICL, and 3) Bag-of-Example ICL. Auto-regressive ICL (AR ICL). A naive way to perform ICL is to adopt the original auto-regressive Transformer (Radford et al., 2018), which admits the following aggregation rules hxk aggr {(hxi, hyi)}k 1 i=1 , hxk , k [n + 1], (4) where hxi, hyi, hk denote the encodings of xi, yi, (xk, yk), respectively. Here we let xn+1 := xt for notation simplicity. Therefore, every example hk only attends to the previous ones h k = {h1, . . . , hk}, which introduces a sequential order to the input examples. As former examples have a smaller context, later examples in the sequence enjoy higher accuracy, as shown in Liu et al. (2022); Wu et al. (2022). Figure 2(a) illustrates the implementation by applying a causal mask M, which is exactly the form in Eq. (3). Prefix ICL. To fully utilize the information of every context example, the causal mask is discarded in Prefix LM (Raffel et al., 2020). Therefore, it aggregates over all context examples as hxk aggr {{(hxi, hyi)}n i=1} , k [n]; (5a) hxt aggr {{(hxi, hyi)}n i=1, hxt} . (5b) Figure 2(b) illustrates the implementation by modifying the attention mask M in Eq. (2), where it utilize full attention among the context examples { xi}n i=1 and causal attention on the test example xt. Bag-of-Example ICL (Bo E ICL). In addition to the two conventional designs above, there is a new variant of ICL. Methods like PCW (Ratner et al., 2022), SAICL (Cai et al., 2023), and Batch ICL (Zhang et al., 2024) encode each context example (xi, yi) independently (without considering other context examples), similar to the bag-of-word representation. Its aggregation rules can be formulated as [hxk, hyk] aggr {(hxk, hyk)} , k [n]; (6a) hxt aggr {{hxi, hyi)}n i=1, hxt} . (6b) Figure 2(c) illustrates an implementation (PCW (Ratner et al., 2022)) by modifying the attention mask M. It restricts attention to occur only within each context example xi, i [n], preventing cross-attentions between them, while retaining attention between the test example xt and context examples. Published as a conference paper at ICLR 2025 3 THE PROPOSED INVARIANT IN-CONTEXT LEARNING (INVICL) We begin with formalizing the desiderata of invariant ICL (Section 3.1), and explore how to meet all these desiderata (Section 3.2). Next, we introduce how to implement our proposed Inv ICL method in practice (Sections 3.3). 3.1 INVARIANT ICL AND ITS DESIDERATA We begin with a formal characterization of three important desiderata in invariant in-context learning. 1) Invariance. In an ICL task, we have the prior knowledge of data symmetry that the n context examples xi are independently identical distributed (i.i.d.). We define an ICL algorithm that preserves this symmetry property as an invariant ICL algorithm: Definition 3.1. An ICL algorithm f is said to be (permutation) invariant if its last prediction ft satisfies ft( x1, ..., xn, xt) = ft( xi1, ..., xin, xt) for any (i1, . . . , in) Sn, a permutation of [n] = {1, 2, . . . , n}. 2) Information Non-leakage. During training, AR-LLMs learn to dynamically predict each intermediate context example xi based on its previous tokens x AR ICL > Prefix ICL No PE > Bo E ICL in terms of performance. This indicates the strong length generalization capability of Inv ICL. On one hand, this result confirms the conventional conclusion that a model that respects the data symmetry enjoys better generalization capability. On the other hand, it highlights that preventing information leakage and maintaining context interdependence are crucial for an invariant ICL algorithm. We further conduct experiments in out-of-distribution tasks and other function settings in Appendix B.1, and present the trend of loss as it changes with the training epochs in Appendix B.4. Both experiments demonstrate that Inv ICL s out-of-distribution in-context performance consistently outperforms AR ICL. Additionally, in Appendix B.5, we conduct linear probing experiments to further demonstrate how the architecture of Inv ICL impacts the model s internal representations. 4.2 REAL-WORLD DATASETS In this part, we conduct experiments to evaluate the capacity of Inv ICL on real-world datasets. Since ICL tasks are generally different from the pertaining one and some ICL methods introduce new masking schemes for aggregation (significantly different from the masking in pretrained model), a short finetuning of the pretrained model on the ICL tasks using these new ICL methods is necessary to fully utilize the pretrained model s capacity for ICL (Min et al., 2022b; Wei et al., 2021; Iyer et al., 2022; Cai et al., 2023). Here, we follow Meta ICL (Min et al., 2022b) to do the short finetuning and evaluation. As in Meta ICL, we utilize 142 tasks including text classification, question answering (QA), natural language inference (NLI), and paraphrase detection. For each training iteration, we first sample a task Ti from the C meta-training tasks, and then sample k+1 training examples (x1, y1), ..., (xk+1, yk+1) from Ti. Given the model parameter θ, the training objective is maximizing prediction accuracy of yk+1 under the formatting of ICL: maxθ LCE(ˆyk+1, yk+1), where LCE is the cross-entropy loss, and ˆyk+1 is the in-context prediction defined in Eq. (1). We evaluate the meta-trained models on the 7 settings of Meta ICL. For each setting, we test two cases: 1) all target tasks; 2) target tasks in the training unseen domains (OOD generalization). More details are in Appendix A.4. Baselines. Following Meta ICL, we use GPT-2 Large (762M) (Radford et al., 2019) as base model, and also includes GPT-Neo 2.7B (Black et al., 2021) and Pythia-2.8B (Biderman et al., 2023) (Appendix B.2). For non-invariant methods, we select AR ICL (Radford et al., 2019) and No PE2 (Kazemnejad et al., 2024). For invariant methods, we select Prefix ICL (Raffel et al., 2020) and three types of Bo E ICL (Appendix A.2): PCW (Ratner et al., 2022), SAICL (Cai et al., 2023), and Batch ICL (Zhang et al., 2024). We adopt 8 context examples for training and evaluation. Results. As shown in Table 2, compared to non-invariant methods, Inv ICL outperforms in 4 out of 7 tasks in the All target task setting and all the 7 tasks in the Target tasks in unseen domains setting. This indicates that permutation invariance is indeed an crucial property for ICL algorithm, which incorporate the inductive bias of symmetry into the model architectures, resulting in an extraordinary improvement on performance, especially when generalizing to OOD tasks. 2Although No PE alone is invariant, it still utilizes an auto-regressive LLM which breaks the invariance. Published as a conference paper at ICLR 2025 Table 2: The in-context learning performance with language models based on GPT-2 Large. We changed the causal mask and positional encoding to implement different types of ICL models. The models are finetuned under the framework of Meta ICL (Min et al., 2022b). METHOD HR LR CLASS CLASS NON-CLASS CLASS QA QA NON-QA QA NON-NLI NLI NON-PARA PARA AVG. All target tasks Non-invariant AR ICL (RADFORD ET AL., 2018) 43.4 0.76 43.4 1.36 40.2 1.64 44.0 0.22 37.9 0.42 50.3 0.84 34.1 1.78 41.9 1.15 NOPE (KAZEMNEJAD ET AL., 2024) 41.7 0.47 30.0 0.82 40.3 0.99 44.5 0.11 36.6 0.05 26.8 0.68 38.8 1.49 37.0 0.81 Invariant PCW (BOE) (RATNER ET AL., 2022) 39.7 1.30 37.7 0.51 35.2 0.37 40.8 0.12 37.7 0.30 40.7 1.32 35.1 1.65 38.1 0.98 SAICL (BOE) (CAI ET AL., 2023) 43.4 0.45 43.2 0.74 37.5 0.74 45.1 0.15 37.6 0.15 49.8 2.01 33.3 1.44 41.4 1.03 BATCHICL (BOE) (ZHANG ET AL., 2024) 31.7 0.21 25.4 0.30 27.1 0.22 32.2 0.12 34.4 0.26 28.9 0.48 35.3 0.97 30.7 0.45 PREFIX ICL (RAFFEL ET AL., 2020) 40.3 0.89 39.6 0.73 35.1 0.54 43.6 0.12 36.8 0.33 45.4 1.65 34.9 2.03 39.4 1.11 INVICL(OURS) 45.1 1.31 42.9 0.86 39.4 0.44 45.3 0.15 38.3 0.27 51.6 0.85 34.7 1.36 42.4 0.87 Target tasks in unseen domains Non-invariant AR ICL (RADFORD ET AL., 2018) 31.5 2.98 35.7 0.50 28.1 1.65 56.5 0.89 39.2 1.78 80.3 1.80 34.1 0.00 43.6 1.65 NOPE (KAZEMNEJAD ET AL., 2024) 32.9 1.32 23.4 0.39 26.9 1.44 63.6 0.78 38.2 0.34 33.2 0.26 32.6 0.16 35.8 0.83 Invariant PCW (BOE) (RATNER ET AL., 2022) 35.6 2.54 31.3 0.29 26.9 1.59 65.3 1.16 33.7 1.21 66.7 1.60 34.4 0.31 42.0 1.44 SAICL (BOE) (CAI ET AL., 2023) 30.7 1.67 29.7 1.98 26.4 1.01 56.2 0.50 41.5 1.60 64.3 2.21 37.1 1.89 40.8 1.65 BATCHICL (BOE) (ZHANG ET AL., 2024) 24.2 0.21 22.3 0.15 23.0 0.11 31.9 1.20 29.4 0.54 37.8 0.78 36.8 1.02 29.3 0.70 PREFIX ICL (RAFFEL ET AL., 2020) 31.0 2.43 33.0 1.53 29.6 2.20 63.8 0.47 36.4 1.29 52.6 2.54 34.0 0.23 40.1 1.75 INVICL(OURS) 44.4 2.17 35.8 2.01 29.0 1.99 67.6 0.22 42.6 1.53 81.8 0.65 37.5 2.30 48.4 1.72 1 2 4 8 16 Number of Demonstrations AR ICL Inv ICL Figure 4: The length generalization behavior of Inv ICL and AR ICL on HR LR setting. The models are metatrained by sequences with 8 context examples. Table 3: The inference time of different models. Method Inference time (ms) AR ICL 21.9 PCW (Bo E ICL) 21.7 Prefix ICL 22.0 Inv ICL 22.0 Compared to invariant methods, Inv ICL outperforms 5 out of 7 tasks in the All target task setting and 6 out of 7 tasks in the Target tasks in unseen domains setting. Although being permutation invariant, these baselines exhibit poor performance (none of them surpasses AR ICL on average). This highlights the crucial properties of information non-leakage and context interdependence implemented by Inv ICL. Length Generalization. The generalization ability to different input lengths is a crucial property of the language model. In the context of ICL, the ability to adapt to varying numbers of context examples can be perceived as a dimension of its length generalization capability. However, in the main experiments, the number of context examples remains consistent throughout both the training and evaluation phases. Hence, we vary the number of context examples, as illustrated in Figure 4. We observe that Inv ICL is much more robust than AR ICL when the length of the test data differs from that of the training data, indicating its strong capability for length generalization. Computational Cost. In Section 3.3, we claim that our parallel implementation of Inv ICL has the same computational complexity order as full self-attention and AR self-attention. In Table 3, we empirically verify this by evaluating the inference time of different ICL models, showing that Inv ICL enjoys roughly the same inference speed as other models. Besides, a question worth considering is the memory cost of Inv ICL since it duplicates the input sequence. We find that when the inputs size of the GPT-2 Large model increases from 512 to 1024, the GPU memory overhead increases by 14% (from 4.2 GB to 4.8GB). We consider this acceptable given the clear improvements in performance. Ablation Study. In Table 4, we conduct an ablation study to demonstrate the effect of the two components of Inv ICL: the invariant mask and the symmetric positional encoding. The experiments show that either component is important for invariant ICL. Additionally, in Appendix B.3, we demonstrate that the effectiveness of Inv ICL is not due to its doubled input. Published as a conference paper at ICLR 2025 Table 4: Ablation study of invariant mask and symmetric positional encodings (PE) on ICL performance and order sensitivity. METHOD HR LR ( ) SENSITIVITY ( ) AR ICL 43.4+1.5 0.25+0.05 +SYM PE 38.4 5.0 0.30+0.05 +INV MASK 44.8+1.4 0.10 0.15 +BOTH (INVICL) 45.1+1.7 0.00 0.25 Permutation Invariance. In Table 4, we demonstrate the permutation invariance of Inv ICL. Following Chen et al. (2022), we measure the order sensitivity as the frequency that the prediction is changed under random permutation. We observe that both the invariant mask and PE are important for achieving invariance, and a lower sensitivity indicates better performance. 5 DISCUSSION The Mechanism behind Inv ICL s Strong Length Generalization Ability. We consider that the mechanism primarily stems from Inv ICL achieving invariance. As mentioned in the introduction, previous studies have found that respecting data symmetry in models helps improve generalization. For example, Sokoli c et al. (2016) demonstrated that when the input data exhibits invariance under certain transformations (such as rotation or translation), utilizing an invariant classifier can achieve lower generalization error compared to a regular classifier. Bietti et al. (2021); Tahmasebi & Jegelka (2023) concluded that encoding invariances into model improves the effective number of samples, thereby enhance generalization ability. These theoretical results could help explain why Inv ICL demonstrates stronger length generalization ability. Theoretical Complexity of Inv ICL. Suppose there are n context examples and 1 test example (considering the examples as attention units), and let M {0, }(n+1) (n+1) be the attention mask defined in Figure 2(d). The complexity of Inv ICL is determined by the number of 0 elements in M. The attention computation for Inv ICL includes: 1) Independent self-encoding of the first-time input (corresponding to M[:n, :n]), which requires n self-attention calculations; 2) LOO pre-encoding (corresponding to M[n: 2n, :2n]), which requires n2 calculations; 3) Aggregation to the test example (corresponding to M[2n+1, n: 2n+1]), which requires n+1 calculations. In total, there are n2+2n+1 attention calculations, which is of the same order as Prefix ICL (n2 + 1) and twice that of AR ICL (n2/2 + 3n/2 + 1). The ICL Training Objective. In the synthetic experiments, we utilize the ICL objective to train the Transformers, which does not align with how LLMs are pre-trained. However, our paper focuses on improving the ICL capability of LLMs, rather than investigating the reasons behind the emergence of ICL ability. Therefore, we train the model using the ICL objective to demonstrate that Inv ICL can achieve stronger ICL capability compared to traditional AR ICL. This is also aligned with the objective we use in the real-world experiments. Theoretical Understanding Inv ICL from an Optimization Perspective. Previous studies have established the duality between ICL and the gradient descent algorithm, demonstrating that under specific parameterizations, ICL can implicitly implement gradient descent. In Appendix C, we build upon this line of research and prove that Inv ICL, under the same parameterizations, can also approximately perform gradient descent, thereby highlighting the theoretical potential of Inv ICL. 6 CONCLUSION In this paper, by distilling the advantages of auto-regressive language models, we identified two additional desiderata for invariant ICL: information non-leakage and context interdependence. Since existing invariant ICL algorithms cannot achieve these desiderata simultaneously, we proposed a novel invariant ICL scheme called Invariant In-context Learning (Inv ICL), which accomplishes these goals concurrently. We also proposed an efficient parallel implementation of Inv ICL. Empirically, we show that Inv ICL outperforms both invariant and non-invariant ICL methods on most tasks, and demonstrates good length generalization abilities. These results sparked the unique advantages of the principled design of invariant ICL. Published as a conference paper at ICLR 2025 ACKNOWLEDGEMENT Yisen Wang was supported by National Key R&D Program of China (2022ZD0160300), National Natural Science Foundation of China (92370129, 62376010), and Beijing Nova Program (20230484344, 20240484642). Yifei Wang was supported in part by the NSF AI Institute TILOS, and an Alexander von Humboldt Professorship. Sweta Agrawal, Chunting Zhou, Mike Lewis, Luke Zettlemoyer, and Marjan Ghazvininejad. Incontext examples selection for machine translation. ar Xiv preprint ar Xiv:2212.02437, 2022. Kwangjun Ahn, Xiang Cheng, Hadi Daneshmand, and Suvrit Sra. Transformers learn to implement preconditioned gradient descent for in-context learning. ar Xiv preprint ar Xiv: 2306.00297, 2023. Ekin Akyürek, Dale Schuurmans, Jacob Andreas, Tengyu Ma, and Denny Zhou. What learning algorithm is in-context learning? investigations with linear models. In ICLR, 2022. Yu Bai, Fan Chen, Huan Wang, Caiming Xiong, and Song Mei. Transformers as statisticians: Provable in-context learning with in-context algorithm selection. ar Xiv preprint ar Xiv:2306.04637, 2023. Stella Biderman, Hailey Schoelkopf, Quentin Gregory Anthony, Herbie Bradley, Kyle O Brien, Eric Hallahan, Mohammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, et al. Pythia: A suite for analyzing large language models across training and scaling. In ICML, 2023. Alberto Bietti, Luca Venturi, and Joan Bruna. On the sample complexity of learning under geometric stability. In Neur IPS, 2021. Sid Black, Gao Leo, Phil Wang, Connor Leahy, and Stella Biderman. GPT-Neo: Large Scale Autoregressive Language Modeling with Mesh-Tensorflow, March 2021. URL https://doi. org/10.5281/zenodo.5297715. Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. In Neur IPS, 2020. Tianle Cai, Kaixuan Huang, Jason D Lee, and Mengdi Wang. Scaling in-context demonstrations with structured attention. In ICML 2023 Workshop on Efficient Systems for Foundation Models, 2023. Yanda Chen, Chen Zhao, Zhou Yu, Kathleen Mc Keown, and He He. On the relation between sensitivity and accuracy in in-context learning. ar Xiv preprint ar Xiv:2209.07661, 2022. Yongqiang Chen, Binghui Xie, Kaiwen Zhou, Bo Han, Yatao Bian, and James Cheng. Positional information matters for invariant in-context learning: A case study of simple function classes. ar Xiv preprint ar Xiv:2311.18194, 2023. Thomas H Cormen, Charles E Leiserson, Ronald L Rivest, and Clifford Stein. Introduction to algorithms. MIT press, 2022. Damai Dai, Yutao Sun, Li Dong, Yaru Hao, Zhifang Sui, and Furu Wei. Why can gpt learn incontext? language models secretly perform gradient descent as meta optimizers. ar Xiv preprint ar Xiv:2212.10559, 2022. Nan Ding, Tomer Levinboim, Jialin Wu, Sebastian Goodman, and Radu Soricut. Causallm is not optimal for in-context learning. ar Xiv preprint ar Xiv:2308.06912, 2023. Deqing Fu, Tianqi Chen, Robin Jia, and Vatsal Sharan. Transformers learn higher-order optimization methods for in-context learning: A study with linear models. In Neur IPS 2023 Workshop on Mathematics of Modern Machine Learning, 2023. Shivam Garg, Dimitris Tsipras, Percy S Liang, and Gregory Valiant. What can transformers learn in-context? a case study of simple function classes. In Neur IPS, 2022. Published as a conference paper at ICLR 2025 Srinivasan Iyer, Xi Victoria Lin, Ramakanth Pasunuru, Todor Mihaylov, Daniel Simig, Ping Yu, Kurt Shuster, Tianlu Wang, Qing Liu, Punit Singh Koura, et al. Opt-iml: Scaling language model instruction meta learning through the lens of generalization. ar Xiv preprint ar Xiv:2212.12017, 2022. Amirhossein Kazemnejad, Inkit Padhi, Karthikeyan Natesan Ramamurthy, Payel Das, and Siva Reddy. The impact of positional encoding on length generalization in transformers. In Neur IPS, 2024. Jannik Kossen, Yarin Gal, and Tom Rainforth. In-context learning learns label relationships but is not conventional learning. In ICLR, 2024. Jiachang Liu, Dinghan Shen, Yizhe Zhang, William B Dolan, Lawrence Carin, and Weizhu Chen. What makes good in-context examples for gpt-3? In Proceedings of Deep Learning Inside Out (Dee LIO 2022): The 3rd Workshop on Knowledge Extraction and Integration for Deep Learning Architectures, pp. 100 114, 2022. Yao Lu, Max Bartolo, Alastair Moore, Sebastian Riedel, and Pontus Stenetorp. Fantastically ordered prompts and where to find them: Overcoming few-shot prompt order sensitivity. In ACL, 2022. Sewon Min, Mike Lewis, Hannaneh Hajishirzi, and Luke Zettlemoyer. Noisy channel language model prompting for few-shot text classification. In ACL, 2022a. Sewon Min, Mike Lewis, Luke Zettlemoyer, and Hannaneh Hajishirzi. Metaicl: Learning to learn in context. In NAACL, 2022b. Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving language understanding by generative pre-training. 2018. Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. 2019. Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485 5551, 2020. Nir Ratner, Yoav Levine, Yonatan Belinkov, Ori Ram, Omri Abend, Ehud Karpas, Amnon Shashua, Kevin Leyton-Brown, and Yoav Shoham. Parallel context windows improve in-context learning of large language models. ar Xiv preprint ar Xiv:2212.10947, 2022. Lingfeng Shen, Aayush Mishra, and Daniel Khashabi. Do pretrained transformers really learn in-context by gradient descent? ar Xiv preprint ar Xiv:2310.08540, 2023. Jure Sokoli c, R. Giryes, G. Sapiro, and M. Rodrigues. Generalization error of invariant classifiers. In AISTATS, 2016. Behrooz Tahmasebi and Stefanie Jegelka. The exact sample complexity gain from invariances for kernel regression. In Neur IPS, 2023. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Neur IPS, 2017. Johannes Von Oswald, Eyvind Niklasson, Ettore Randazzo, João Sacramento, Alexander Mordvintsev, Andrey Zhmoginov, and Max Vladymyrov. Transformers learn in-context by gradient descent. In ICML, 2023. Johannes von Oswald, Eyvind Niklasson, Maximilian Schlegel, Seijin Kobayashi, Nicolas Zucchet, Nino Scherrer, Nolan Miller, Mark Sandler, Max Vladymyrov, Razvan Pascanu, et al. Uncovering mesa-optimization algorithms in transformers. ar Xiv preprint ar Xiv:2309.05858, 2023. Qixun Wang, Yifei Wang, Yisen Wang, and Xianghua Ying. Can in-context learning really generalize to out-of-distribution tasks? In ICLR, 2025. Yifei Wang, Yuyang Wu, Zeming Wei, Stefanie Jegelka, and Yisen Wang. A theoretical understanding of self-correction through in-context alignment. In Neur IPS, 2024. Jason Wei, Maarten Bosma, Vincent Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M Dai, and Quoc V Le. Finetuned language models are zero-shot learners. In ICLR, 2021. Published as a conference paper at ICLR 2025 Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. In Neur IPS, 2022. Zhiyong Wu, Yaoxiang Wang, Jiacheng Ye, and Lingpeng Kong. Self-adaptive in-context learning. ar Xiv preprint ar Xiv:2212.10375, 2022. Yanzheng Xiang, Hanqi Yan, Lin Gui, and Yulan He. Addressing order sensitivity of in-context demonstration examples in causal language models. ar Xiv preprint ar Xiv:2402.15637, 2024. Sang Michael Xie, Aditi Raghunathan, Percy Liang, and Tengyu Ma. An explanation of in-context learning as implicit bayesian inference. In ICLR, 2021. Kaiyi Zhang, Ang Lv, Yuhan Chen, Hansen Ha, Tao Xu, and Rui Yan. Batch-icl: Effective, efficient, and order-agnostic in-context learning. ar Xiv preprint ar Xiv:2401.06469, 2024. Zihao Zhao, Eric Wallace, Shi Feng, Dan Klein, and Sameer Singh. Calibrate before use: Improving few-shot performance of language models. In ICML, 2021. Xun Zheng, Bryon Aragam, Pradeep Ravikumar, and E. Xing. Dags with no tears: Continuous optimization for structure learning. In Neur IPS, 2018. Published as a conference paper at ICLR 2025 A IMPLEMENTATION DETAILS A.1 SYMMETRIC POSITIONAL ENCODING In this paper, we mainly focus on the absolute positional encoding which is used in the GPT family. As shown in Figure 5, we adopt an independent position encoding scheme that treats each example as an independent sequence, which follows the design in (Ratner et al., 2022). For each context example xi, we always allocate the positional encoding as it starts from the first position. Denote the maximal sequence length among xi as lmax. For the test example xt, we assign its positional encodings starting from the index ℓmax. This implementation is applicable to Bo E ICL, Prefix ICL, and Inv ICL. 𝑧!,! 𝑧!,#! Token Embedding + Positional Embedding 𝑝%'! (𝑡= max 𝑘! + 1) Input:Aa, answer:x. Input:Bb, answer:y. Input:Cc, answer: 𝑥!,! 𝑥!,#! 𝑥$,! 𝑥!,#" 𝑥%,! 𝑥%,& Tokens (a) Symmetric PE for standard input 𝑝%'! (𝑡= max 𝑘! + 1) Input:Aa, answer:x. Input:Bb, answer:y. Input:Cc, answer: 𝑥!,! 𝑥!,#! 𝑥$,! 𝑥!,#" 𝑥%,! 𝑥%,& Input:Aa, answer:x. Input:Bb, answer:y. 𝑥!,! 𝑥!,#! 𝑥$,! 𝑥!,#" Token Embedding + Positional Embedding (b) Symmetric PE for the duplicated input of Inv ICL Figure 5: The symmetric positional encoding applied in our work. pi refers to the learned absolute positional embeddings that are added to the token embeddings at position i. Figure (a) shows the positional encoding under the standard ICL input sequence. As for the duplicated input of Inv ICL, we apply the same positional encoding for the original and the repeated examples, as shown in Figure (b). A.2 BAG-OF-EXAMPLES ICL We introduce the implementation detail of two Bo E ICL methods mentioned in the main text, PCW (Ratner et al., 2022), SAICL (Cai et al., 2023) and Batch ICL (Zhang et al., 2024). PCW (Parallel Context Window). PCW is a work originally aimed at improving the acceptable length of language models. Denote N be the maximal length of a language model, and n > N be the input length. PCW divides the input into context windows with length N, and separately puts them into the LM. Finally, it utilizes a bag-of-window method (similar to Figure 2(c), where each block in the mask refers to a context window) to generate the predictions. We note that by considering each context example as a window in PCW, it can implement the Bag-of-Examples ICL algorithm. SAICL (Structured Attention for ICL). SAICL is a method proposed to improve the inference efficiency and order-sensitivity of in-context learning. Similar to PCW, they also encode the context examples independently but are also aware of the test example. The original method is based on T5 (Raffel et al., 2020), a language model with the encoder-decoder architecture. We transfer its design to the GPT family by directly taking its attention mask and use the symmetric PE proposed above. Batch ICL. Instead of conducting N-shot encoding for all context examples, Batch ICL utilizes N separate 1-shot encodings for each context example. It then aggregates the intermediate hidden states of the respective last token, which is subsequently incorporated into the forward computation of a zero-shot query to generate the final prediction. We basically follows all the setting introduced in the original paper. As for the layer to extract the aggregated vector, we simply takes the 15-th layer, since they found that any intermediate or later layer is a fair choice. A.3 SETTING OF THE EXPERIMENTS ON LINEAR REGRESSION TASKS. Denote G = {g : X Rd R, g(x) = w x + b} as the linear function class. Let DG be a distribution on G, and DX be a distribution on X. In the training phase, we iteratively sample a random function g G from DG, and sample i.i.d. x1, ..., xk+1 from DX . Then, we produce a prompt in the ICL manner P = (x1, g(x1), ..., xk, g(xk), xk+1), and train a model θ to output [ˆg(x1), ..., ˆg(xk), ˆg(xk+1)] = fθ(P) (as equation Eq. (1)), where ˆg(xi) is the prediction for g(xi). Published as a conference paper at ICLR 2025 The training objective is min θ EDG,DX i=0 ℓ(ˆg(xi), g(xi)) where ℓis the MSE loss. In the experiments in Section 4.1, we set d = 20, k = 40, DX = N(0, Id), and DG : w N(0, Id), b = 0. The architecture selection follows (Garg et al., 2022), where a 12-layer GPT-like Transformer decoder is utilized. We implement the four model types by using the symmetric attention mask and PE. A.4 IMPLEMENTATION DETAILS OF EXPERIMENTS ON REAL-WORLD DATA. Evaluation. Following Meta ICL (Min et al., 2022b), we consider 7 evaluation settings: 1) HR LR, which means training with high resource data and testing on low resource data; 2) X X (X={Classification, QA}), which means training and testing on the same task type, but with no overlap in tasks; 3) Non-X X (X={Classification, QA, NLI, Paraphrase}, which means training and testing on different task type. The last settings are the most challenging, which require strong generalization capacities of language models (Min et al., 2022b). For each setting, we make evaluations both on all target tasks and a subset that contains target tasks in the unseen domains of the source tasks, e.g., medical, financial, and climate. This setting also challenges the out-of-distribution generalization capability of models. Truncation. Since Meta ICL (Min et al., 2022b) truncates the training sequence when it exceeds the maximum input length of the LM, and the ICL prompt sequence is duplicated in our implementation of Inv ICL, the training sequences differ between Inv ICL and other methods because of different truncate rates. As shown in Table 5, there is a significant gap in the dataset size between the standard input and the duplicated input under the truncating setting. To make the comparison fair, we apply the same truncate rate in Inv ICL to the normal training sequence so that all the methods share the same training set. Additionally, we reduce the number of context examples in the training phase from 16 to 8 to control the truncate rate of Inv ICL to the same level as standard ICL. Table 5: Ratio of the remaining data between different input types under the truncating setting of Meta ICL (Min et al., 2022b). Here the number of context examples is set to 8. INPUT TYPE HR LR CLASS CLASS NON-CLASS CLASS QA QA NON-QA QA NON-NLI NLI NON-PARA PARA Remaining ratio of training dataset STANDARD 70% 90% 71% 59% 80% 85% 85% DUPLICATED 53% 79% 55% 40% 62% 75% 71% Direct & Channel. Besides the standard ICL paradigm, Meta ICL (Min et al., 2022b) adopts a new inference paradigm called noisy channel ( Channel ) (Min et al., 2022a) and achieves a better experimental performance. Contrary to the standard ICL paradigm (also called Direct in (Min et al., 2022b)) that takes (x1, y1, ..., xn, yn, xt) as input, the Channel paradigm takes (y1, x1, ..., yn, xn, yt) and try to generate xt. Note that in order to generate the prediction, Channel ICL needs to perform n forward passes conditioned on each of the n labels yt and regard the label with minimum perplexity as the prediction. This will, on the one hand, increase the computational complexity and, on the other hand, reduce its applicability to the generative tasks where the label space is large, e.g., Question Answering. Therefore, we adopt the Direct setting in our experiments, i.e., the standard ICL paradigm. B ADDITIONAL EXPERIMENTAL RESULTS B.1 SYNTHETIC EXPERIMENTS ON OTHER SETUPS In this section, we conduct additional synthetic experiments on more functions and out-of-distribution setups, to further showcase the generalization capability of Inv ICL. Other function settings. We consider two other function settings proposed by (Garg et al., 2022) sparse linear regression and decision tree, to illustrate the ability of Inv ICL to learn algorithms to solve other tasks. Results are given in Figure 6. Published as a conference paper at ICLR 2025 1. Sparse linear regression. In this task, a random linear function y = w x + b is sampled to be predicted, yet the efficient has only 3 non-zero coordinates out of 20 dimensions. Although it is also a linear regression task, its optimal algorithm is no longer least squares but Lasso, which involves solving the least squares objective with an l1-norm regularizer for the weight vector. This demands the in-context learners to learn an algorithm different from that in linear regression to solve this task. Following the experimental settings in our paper, we test the performance of AR ICL and Inv ICL which are trained with 200k epochs. We can still observe the consistent results of our paper that Inv ICL possesses fast convergence (Inv ICL converges while AR ICL does not). 2. Decision tree. We follow the setting in (Garg et al., 2022), where the class of depth 4 decision trees with 20-dimensional inputs is considered. We evaluate the performance of AR ICL and Inv ICL that are trained with 200k epochs. We find that although AR ICL performs better than Inv ICL for short inputs, as the length of the input sequence increases, Inv ICL gradually outperforms AR ICL, indicating the strong extrapolation ability of Inv ICL. (a) Sparse linear regression (b) Desicion tree Figure 6: ICL performance on sparse linear regression and decision tree. Out-of-distribution Setups. We consider three out-of-distribution setups proposed by (Garg et al., 2022; Chen et al., 2023), to showcase the generalization capability of Inv ICL to out-of-distribution (OOD) tasks. We consider a distribution shift between the training and test datasets. The training data remain consistent with Section A.3. However, for the test data, we apply the following modification: 1. Add random noise to the labels by altering b = 0 to b N(0, 1). 2. Scale the data sampling by altering DX = N(0, Id) to DX = N(0, 32Id). 3. Sample the data xi from a random 10-dimensional subspace (out of 20 dimensions). In Figure 7, we report the testing MSE loss with the models trained for respectively 50k and 200k epochs. We omit Prefix ICL and Bo E ICL for their poor performance. We find that Inv ICL continues the advantages mentioned earlier, i.e., the fast convergence and strong extrapolation ability, indicating its strong capacity on OOD tasks. B.2 REAL-WORLD EXPERIMENTS BASED ON GPT-NEO AND PYTHIA We also conduct experiments with models based on GPT-Neo 2.7B (Black et al., 2021) and Pythia2.8B (Biderman et al., 2023) with other hyper-parameters unchanged, as shown in Table 6 and 7. The result is similar to what is demonstrated in the main text: Inv ICL outperforms the baseline in most of the tasks and especially performs well in the OOD settings. This indicates the applicability of Inv ICL to different base models. Besides, we note that the three LLMs (GPT-2, GPT-Neo and Pythia) studied in our work utilize three different kinds of PE trainable PE, Alibi and Rotary PE, respectively. Therefore, our design of symmetric PE is applicable to a wide range of PEs. Published as a conference paper at ICLR 2025 0 50 100 150 200 in-context examples squared error Inv ICL AR ICL Least Squares (a) Random noise, 50k epochs. 0 50 100 150 200 in-context examples squared error Inv ICL AR ICL Least Squares (b) Random noise, 200k epochs. 0 50 100 150 200 in-context examples squared error Inv ICL AR ICL Least Squares (c) Scaling, 50k epochs. 0 50 100 150 200 in-context examples squared error Inv ICL AR ICL Least Squares (d) Scaling, 200k epochs. (e) Half subspace, 50k epochs. (f) Half subspace, 200k epochs. Figure 7: ICL performance on OOD tasks. The training dataset remains consistent with Section 4.1, but we change the distribution of the test dataset. Random noise: changing the distribution of the linear bias from b = 0 to b N(0, 1). Scaling: changing the sampling distribution of xi from DX = N(0, Id) to DX = N(0, 32Id). Half subspace: Sample the data xi from a random 10-dimensional subspace (out of 20 dimensions). Table 6: The in-context learning performance on GPT-Neo 2.7B. METHOD HR LR CLASS CLASS NON-CLASS CLASS QA QA NON-QA QA NON-NLI NLI NON-PARA PARA AVG. All target tasks AUTO-REGRESSIVE ICL 45.8 41.2 40.1 46.4 36.8 45.2 33.1 41.2 INVICL(OURS) 46.1 40.2 40.2 48.6 35.8 44.7 33.7 41.3 Target tasks in unseen domains AUTO-REGRESSIVE ICL 39.1 33.1 31.8 66.5 34.7 56.7 33.1 42.1 INVICL(OURS) 39.6 33.9 32.7 68.1 31.4 56.9 36.0 42.7 B.3 ABLATION STUDY FOR INVICL In this section, we conduct experiments to test the baselines (AR ICL, PCW, Prefix ICL) using the same duplicated data as Inv ICL. As shown in Table 8, Inv ICL still outperforms the baselines when they are given the doubled input as Inv ICL does. Published as a conference paper at ICLR 2025 Table 7: The in-context learning performance on Pythia-2.8B. METHOD HR LR CLASS CLASS NON-CLASS CLASS QA QA NON-QA QA NON-NLI NLI NON-PARA PARA AVG. All target tasks AUTO-REGRESSIVE ICL 31.3 22.3 27.8 33.4 33.7 29.7 37.6 30.8 INVICL(OURS) 31.5 26.3 28.5 33.0 35.6 28.0 40.2 31.9 Target tasks in unseen domains AUTO-REGRESSIVE ICL 20.8 21.0 21.0 43.5 39.7 33.5 34.2 30.5 INVICL(OURS) 20.9 24.2 21.1 44.6 43.7 33.5 38.6 32.4 Table 8: Ablation study of using doubling input for the baseline methods. We report the result on HR LR. Inv ICL still outperforms the baselines. METHOD DOUBLED INPUT ORIGINAL INPUT AR ICL 43.8 43.4 PCW (BOE ICL) 40.6 39.7 PREFIX ICL 41.7 40.3 INVICL 45.1 - B.4 DETAILED RESULTS FOR SYNTHETIC EXPERIMENTS In this section, we provide detailed results for the synthetic experiments in section 4.1. In figure 8, we demonstrate the error curves of AR ICL and Inv ICL at different training epochs. In figure 9, we present the error at different training epochs when the number of context examples is 100. Both experiments demonstrate that Inv ICL s OOD in-context performance (length > 40) consistently outperforms AR ICL across all epochs. Specifically, as shown in figure 9, in the early stages of training, the error of Inv ICL decreases rapidly, while the error of AR ICL only shows significant reduction after approximately 100k epochs. Furthermore, after 200k epochs, the error of Inv ICL stabilizes, whereas the error of AR ICL increases. B.5 LINEAR PROBING EXPERIMENTS In this section, we conduct a linear probing experiments based on the synthetic setting, to further explore how the architecture of Inv ICL impacts the model s internal representations. For a pre-trained model on the synthetic linear regression dataset, we freeze the model parameters and trained a single linear layer on the hidden states of the 3rd, 6th, 9th, and 12th layers, respectively. As shown in Figure 10, the linear probing error of Inv ICL is consistent and close to the error curve of the pre-trained model across all tested layers. In contrast, for AR ICL, only the error curve of layer 12 converges to that of the pre-trained model. This indicates that Inv ICL encodes task features in the model much faster than AR ICL. We believe this is closely related to its context interdependence property, which allows it to utilize richer context example information for encoding. C THEORETICAL UNDERSTANDING INVICL FROM AN OPTIMIZATION PERSPECTIVE In this section, we further characterize the advantages of Inv ICL from an optimization perspective. Inv ICL Can Approximately Implement Gradient Descent. Consider a linear regression task with instances (X, y), where X consists of row vectors x i Rd, and y consists of labels yi R, i [n]. The goal is to find the optimal weight vector w that minimizes the LSE loss L(w) = Xw y 2. A standard gradient descent (GD) algorithm updates the weights iteratively as follows: wℓ= wℓ 1 ηX (Xwℓ 1 y), (11) where ℓstands for the iteration step, and η is the step size. Consider the ICL-style model input, formulated as Z = (z1, ..., zn, z1, ..., zn, zt), where zj = xj yj , j [n] are the context examples, and zt = xt 0 is an arbitrary test example. Here we Published as a conference paper at ICLR 2025 0 50 100 150 200 250 in-context examples squared error 50k 80k 100k 120k 150k 180k 200k 300k 400k 500k (a) Inv ICL. 0 50 100 150 200 250 in-context examples squared error 50k 80k 100k 120k 150k 180k 200k 300k 400k 500k (b) AR ICL. Figure 8: Intermediate results for Inv ICL and AR ICL on the linear regression setting. The line colors represent the models trained with different epochs. 0 100 200 300 400 500 Training epoch (K) squared error Inv ICL AR ICL Figure 9: The squared error at different training epochs. We set the number of context examples to 100. duplicate the input as required by Inv ICL and expect the model to predict xt x t w at the last token. The theorem below illustrates the evolution of the prediction through the Transformer layer of Inv ICL. Theorem C.1. With the attention weight matrices configured as in (Von Oswald et al., 2023), i.e., Wk = Wq = Id d 0 0 0 , Wv = 0d d 0 w0 1 , P = ηI, (12) Inv ICL implements the following weight updating rule: at the ℓ-th layer of the Transformer, the last token outputs z(ℓ) t = xt x t wℓ wℓ= wℓ 1 ηX (Xwℓ 1 y) + η2 wℓ 1. (13) Here, wℓ= X (XX diag(XX ))(Xwℓ y). Theorem C.1 shows that under specific parametrization, the weight updating rule implemented by Inv ICL (Eq. (13)) is very similar to that of standard GD (Eq. (11)), differing only by a second-order term. For gradient descent to converge, the step size η should be at most the inverse of the largest eigenvalue of XX . Under this condition, the term η2 wℓ 1 is dominated by ηX (Xwℓ 1 y), ensuring that Inv ICL has the potential to closely approximates the standard GD algorithm in this linear regression task. Discussion to Other ICL Methods. We provide a comprehensive comparison of all the ICL methods considered in this paper from the optimization perspective: under the parametrization as in Eq. (12), 1) AR ICL emulates the online GD algorithm (with a constant learning rate) (Ding et al., 2023), which is not guaranteed to converge; 2) Prefix ICL implicitly implements the standard GD algorithm under a specific set of parameters for attention (Von Oswald et al., 2023; Ding et al., 2023); and 3) Bo E ICL Published as a conference paper at ICLR 2025 0 50 100 150 200 250 in-context examples squared error Layer 3 Layer 6 Layer 9 Layer 12 Pretrain (a) Inv ICL. 0 50 100 150 200 250 in-context examples squared error Layer 3 Layer 6 Layer 9 Layer 12 Pretrain (b) AR ICL. Figure 10: The linear probing results on Inv ICL and AR ICL. can only update the weight vector of the test (last) example (not the context examples) without context interdependence. This leads to a constant gradient computed at the initial point, causing it to fail to converge (detailed discussion is in Appendix D.1). Compared with these ICL algorithms, Inv ICL has several distinct advantages: 1) Inv ICL surpasses AR ICL in terms of convergence to optimal solutions; 2) Similar to Prefix ICL, Inv ICL approximately implements the standard GD algorithm while avoiding information leakage; and 3) Unlike Bo E ICL, Inv ICL effectively incorporates context interdependence, allowing it to emulate a more efficient GD algorithm. These advantages underscore the theoretical superiority of Inv ICL, which synergizes information non-leakage and context interdependence within an invariant ICL framework. Practicality of Theorem C.1. Theorem C.1 is an existence proof which illustrate that the Transformers have the potential to implement complex optimization mechanisms like gradient descent. In fact, The actual weight may not be strictly follow its parametrization. However, empirical studies including Von Oswald et al. (2023); von Oswald et al. (2023), have shown that pre-trained Transformers exhibit behaviors akin to gradient descent in certain scenarios, thereby providing empirical evidence for the theory. D.1 PROOF OF THEOREM C.1 Proof. We mainly adopt the setting of (Von Oswald et al., 2023) and (Ding et al., 2023). Let Z = (z1, ..., z2n, z2n+1) R(d+1) (2n+1) be the duplicated input of Inv ICL, where zj = xj yj Rd, yj R, and zi = zn+i for i [n]. Consider the linear self-attention layer in the scheme of Inv ICL. Given the query, key, value matrix Wq, Wk, Wv R(d+1) (d+1) and the projection matrix P R(d+1) (d+1), the updating rule of the layer is as follows: zj zj + PWvzj(z j W k Wqzj), zn+j zn+j + PWv X i [n]\{j} zi(z i W k Wqzn+j), z2n+1 z2n+1 + PWv i=1 zn+i(z n+i W k Wqz2n+1), where j [n]. Following the setting of (Von Oswald et al., 2023) and (Ding et al., 2023), we let Wk = Wq = Id d 0 0 0 , Wv = 0d d 0 w(0) 1 , P = ηI. (15) Published as a conference paper at ICLR 2025 Now, we hope to see what kind of iterative algorithm can naturally be implemented by Inv ICL. Before that, we first give the L2 2 loss after doing one step of gradient descent X(w ηX (Xw y)) y 2 = Xw y ηXX (Xw y) 2 = (I ηX X)(Xw y) 2. To compare Inv ICL with the conventional attention heads for ICL linear regression, here we investigate the convergence properties of the leave-one-out scheme in Eq. (17) viewed as an optimization algorithm for solving the regression problem, and compare it to that of gradient descent. It turns out that if we use the same weighting strategy as (Von Oswald et al., 2023) but with Inv ICL, then we obtain a similar iterative algorithm for in-context linear regression according to which the last row of Z evolves, but the update rule transforms into wℓ= wℓ 1 ηX (Xwℓ 1 y ), (17) where y = y ηXX (Xwℓ 1 y) + η[x i xi(x i wℓ 1 yi)]n i=1 (18) is the label updated by the leave-one-out scheme. This equation is obtained by first perform a gradient descent step w.r.t. the whole dataset with gradient update ηX (Xw y) and then minus the term w.r.t the i-th data point xi(x i w yi). Expanding Eq. (17), we get that the global update becomes wℓ= wℓ 1 ηX (Xwℓ 1 y ) = wℓ 1 ηX (Xwℓ 1 y + ηXX (Xwℓ 1 y) η[x i xi(x i wℓ 1 yi)]n i=1) = wℓ 1 ηX (Xwℓ 1 y) + η2X XX (Xwℓ 1 y) η2X Diag(XX )(Xwℓ 1 y). This delivers Eq. (13). Remark. In Bo E ICL, since the context examples cannot interact with each other, the GD algorithm implemented by it can only update the weight vector w of the test (last) example, but not the context examples. Particularly, this means the gradient update process is wℓ= wℓ 1 g(w0, {xi, yi}), where g is the update function of Bo E ICL. This means that the gradients are always computed at the initial point of the algorithm, thus the algorithm cannot converge. D.2 PROOF OF PROPOSITION 3.4 Proof. We will first demonstrate that the attention score matrix A needs to adhere to a specific form when constrained by the attention mask M, in order to guarantee the permutation equivariance of the embeddings of the context examples. Subsequently, we will establish that this requirement is equivalent to the permutation invariance of the ICL prediction with respect to the context examples. Lemma D.1. Given an input matrix H = (h1, ..., hn) Rn d and its attention score matrix A Rn n defined in Eq. (7). Denote SA(H) = AHWv P be the self-attention operation, where A is defined in Eq. (7). Then, SA(H) is permutation equivariant to {hi} iff the attention mask M is equal to 0 0 ... ... ... ... 0 0 0 0 0 ... ... ... ... 0 0 Proof. Denote T Rn n be a permutation matrix on the row vectors of H. This implies that T {0, 1}n n and 1 n T = 1 n , T1n = 1n. Then the permutation equivariant condition can be Published as a conference paper at ICLR 2025 stated as T SA(H) = SA(TH). Since SA(H) = softmax HWq(HWk) + M HWv P, the condition can be expanded as T softmax HWq(HWk) + M HWv P = softmax THWq W k H T + M THWv P. (20) It can be easily verified that 1) the permutation and softmax operations are commutative, and 2) T is orthogonal. Therefore, the above equation can be transformed to softmax THWq(HWk) + TM HWv P = softmax THWq W k H + MT HWv P. (21) This is equivalent to TMT 1 = M (22) for arbitrary permutation matrix T. Next, we will discuss what kind of matrix M satisfies this condition. For notation simplicity, we denote T(i, j) as the permutation performed only between the i-th and j-th index. Assume Mi,i = c1. Taking T = T(i, j), from Eq. (22) we have Mj,j = c1. By iterating over j, we have Mk,k = c1 for every k [n]. Assume Mi,j = c2, i = j. Taking T = T(i, k), k = j, from Eq. (22) we have Mk,j = c2; taking T = T(j, k), k = i, we have Mi,k = c2. Hence, by iterative applying permutations in this way, we can conclude that Mk,l = c2 for every k = l. In conclusion, M = c1In + c2(1n n In). Since the elements of an attention mask can only take the value of either 0 or , M can only be one of the three forms demonstrated in Lemma D.1 (an all attention mask is meaningless). Now we prove the equivalence between the desired permutation invariance property and the equivariance property discussed in Lemma D.1. As the permutation invariance property involves the ICL prediction, which relies on the test embedding ht, it is necessary to incorporate it into the discussion. We denote the full input matrix of ICL as H = (h1, ..., hn, ht) R(n+1) d, and the corresponding matrices in the self-attention process as A, M. Lemma D.2. Let the output embeddings of the Transformer be H = (h 1, ..., h n, h t). Then, h t is invariant to the permutation of (h1, ..., hn) iff (h 1, ..., h n) is equivariant to the permutation of (h1, ..., hn). Proof. First, we need to extend existing results to the circumstance of the full input H. Consider the attention mask M R(n+1) (n+1) of the full input, whose n n submatrix at the upper-left is equal to M, i.e., M1:n,1:n = M. From the condition in the Proposition we have that Mn+1,: = 0 n+1. Besides, it is evident that Proposition 3.5 also satisfies for M, we have M1:n,n+1 = 1 n . Other variables can be naturally extended. In Lemma D.1, we have proved that the equivariance property is equivalent to the attention mask M being one of three specific forms. Now we prove the contrapositive statement of Lemma D.2. If (h 1, ..., h n) is not equivariant to the permutation of (h1, ..., hn), by Lemma D.1, the mask on context examples M must satisfy either 1) i = j, Mii = Mjj, or 2) i = j, k = l, Mij = Mkl. We separately demonstrate that these properties will break the property of permutation invariance. For the following circumstances, we uniformly let Wq = Wk = Wv = P = In+1. Denote the embedding of hi after k self-attention layer as h(k) i . Then, the embeddings are updated as h(k+1) i = X j=1,...,n,t [s(h(k) i , h(k) j ) + Mij]h(k) j , (23) where s( , ) is the similarity function calculated by their inner product and softmax normalization, which is defined in 2. Published as a conference paper at ICLR 2025 i = j, Mii = Mjj. Without loss of generality, since the elements of M only take the value of either 0 or infty, we let M11 = 0, M22 = . Then we construct the input matrix as h1 = e1, h2 = e2, hi = 0(i > 2), ht = 0, where ei denotes the i-th unit vector (i [d]). Since M22 = M2,n+1 = , following Eq. (23), we find that h(1) 2 = c1e1. And since M11 = 0, we have h(1) 1 = c2e1 + c3e1. Now we permute the first and second examples, i.e., h1 = e2, h2 = e1. Although we find that the first update of the test embedding remains unchanged since Eq. (23) is permutation invariant for hk t , the second update differs. Since we have h(1) 2 = c1e2 and h(1) 1 = c3e1+c2e1, the aggregation h(2) i changes. Therefore, the property of permutation invariance is broken. i = j, k = l, Mij = Mkl. Without loss of generality, let Mij = 0, Mkl = . We construct hi = e1, hk = e2, h =i,k = 0. Then, we have h(1) j = c1e1 + c2e2, and h(1) l = c3e1. Similar to the above process, we can prove that ht is not permutation invariant w.r.t. the index exchange (i, j) (k, l). In conclusion, any attention mask M that violates Lemma D.1 will break the property of permutation invariance. Thus Lemma D.2 is proved. Finally, by combining Lemmas D.1 and D.2, we can deliver Proposition 3.4. D.3 PROOF OF PROPOSITION 3.5 Proof. Consider the case that G has no self-loops. Since G is a digraph with no cycles, it is a directed acyclic graph (DAG). According to the graph theory (Cormen et al., 2022), DAG can be topologically ordered, which means in this ordering, any vertex is not reachable from later vertices in the graph. Therefore, if we reorder the vertices in this way, we have Aij = 0 for i j, which infers that A is strictly lower diagonal. Since the original graph allows self-loop, which corresponds to the diagonal elements, the adjacency matrix is lower triangular. This is equivalent to that the attention mask on context examples M is lower triangular. E RELATED WORK The order-sensitivity of ICL. The phenomenon that ICL is sensitive to the permutation of context examples has been observed in several works. (Zhao et al., 2021) and (Lu et al., 2022) used GPT-3 to perform in-context learning on classification tasks such as SST-2 and observe a high variance w.r.t. the permutation of the context examples. Besides, (Xie et al., 2021) and (Agrawal et al., 2022) found the same phenomenon on a generated synthetic dataset and machine learning tasks, respectively. Additionally, (Chen et al., 2022) empirically showed that the order-sensitivity is negatively correlated to the performance of ICL. To address this issue, (Zhao et al., 2021) proposed a calibration module to make the output distribution consistent with prior knowledge. (Lu et al., 2022) filtered out the best prompt ordering by investigating their calibration on a generated set. (Xiang et al., 2024) utilizes contrastive learning to align representations of in-context examples across different positions, resulting in the alleviation of order sensitivity. Besides, there are works that focuses on implementing the concept of permutation invariance from an architectural perspective. For example, SAICL (Cai et al., 2023) proposed a structured attention mechanism that achieves permutation invariance. However, their work is based on improving the ICL performance of T5 (Raffel et al., 2020), a language model based on an encoder-decoder architecture, which did not address the order-sensitivity issue of auto-regressive LMs. Batch ICL (Zhang et al., 2024) is the work that is most relevant to us. Instead of conducting N-shot encoding for all context examples, it utilizes N separate 1-shot encodings for each context example. It then aggregates the intermediate hidden states of the respective last token, which is subsequently incorporated into the forward computation of a zero-shot query to generate the final prediction. In this way, the model achieves permutation invariance since the encoding of the context examples are independent. The connection between ICL and Gradient Descent. Early stage formal theoretical investigation of the linear regression in-context learners includes (Akyürek et al., 2022; Von Oswald et al., 2023). They first showed that Transformers learn in context via gradient descent, where one layer performs Published as a conference paper at ICLR 2025 one gradient update. In subsequent work, (von Oswald et al., 2023) further argued that Transformers are strongly biased towards learning to implement gradient-based optimization routines. (Ahn et al., 2023) showed Transformers can learn to implement preconditioned Gradient Descent, where the pre-conditioner can adapt to the data. (Bai et al., 2023) provided detailed constructions for how Transformers can implement a range of learning algorithms via gradient descent. (Dai et al., 2022) conducted experiments on NLP tasks and concluded that Transformer-based language models performing ICL behave similarly to models fine-tuned via gradient descent; however, concurrent work argued that real-world LLMs do not perform ICL via gradient descent (Shen et al., 2023). (Fu et al., 2023) argued that Transformers actually learn to perform in-context learning by implementing a higher-order optimization method, not gradient descent. Predictions made by different Transformer layers match iterations of higher-order optimization methods better than they match iterations of gradient descent.