# understanding_incontext_learning_from_repetitions__c39a37e2.pdf Published as a conference paper at ICLR 2024 UNDERSTANDING IN-CONTEXT LEARNING FROM REPETITIONS Jianhao Yan1,2 Jin Xu4 Chiyu Song1,2 Chenming Wu5 Yafu Li1,2 Yue Zhang2,3, 1Zhejiang University 2School of Engineering, Westlake University 3 Institute of Advanced Technology, Westlake Institute for Advanced Study 4 Tsinghua University 5 Baidu Research elliottyan37@gmail.com This paper explores the elusive mechanism underpinning in-context learning in Large Language Models (LLMs). Our work provides a novel perspective by examining in-context learning via the lens of surface repetitions. We quantitatively investigate the role of surface features in text generation, and empirically establish the existence of token co-occurrence reinforcement, a principle that strengthens the relationship between two tokens based on their contextual co-occurrences. Furthermore, we find similar reinforcements lie behind the pretraining corpus, revealing the existence is due to LLMs efforts to maximize likelihood. By investigating the dual impacts of these features, our research illuminates the internal workings of in-context learning and expounds on the reasons for its failures. This paper provides an essential contribution to the understanding of in-context learning and its potential limitations, providing a fresh perspective on this exciting capability. 1 INTRODUCTION The impressive ability of Large Language Models (LLMs; Touvron et al. 2023a; Chowdhery et al. 2022; Open AI 2023) to execute in-context learning (ICL) is a standout characteristic. This behavior mirrors human learning and reasoning from analogy (Winston, 1980), enabling LLMs to rapidly adapt to a range of downstream tasks. Without being explicitly pretrained to learn from demonstrations, LLMs can predict responses to unseen test queries from a few demonstrations and without any instruction given (Brown et al., 2020; Zhang et al., 2022; Chowdhery et al., 2022). An example of in-context learning can be found in Figure 1(a), where a pre-trained LLa MA model is given demonstrations for a binary classification task, and learns to make predictions correctly. Despite the success in applications, the working mechanism of in-context learning is still an open question. Existing work has investigated input-label mapping (Min et al., 2022; Yoo et al., 2022; Wei et al., 2023) and demonstration construction (An et al., 2023; Lu et al., 2022; Liu et al., 2022) as underlying factors for ICL. However, little research has focused on the correlation between ICL and textual features. Intuitively, the behavior of ICL depends on the context and can be fragile to its variations. As Figure 1(b) shows, the same LLa MA model makes the incorrect prediction True given the input Circulation revenue has decreased by 5% in Finland. , which is likely because of the repeated pattern Answer: -> True from the demonstrations. In the same perspective, the success case in Figure 1(a) can be attributed to learning desired patterns such as Answer: -> True|False in the demonstrations. Such patterns are apparently used as features in the autoregressive inference process by the model. We take a feature-centric view to understand ICL, analyzing the key patterns in the input context that correlate with ICL behavior. The patterns we discussed above can be viewed as generalizations to repetition patterns (Holtzman et al., 2019; Fu et al., 2020) and self-reinforced patterns (Xu et al., 2022) which have been discussed in the literature. The self-reinforcement effect describes a phenomenon where the model tends to perpetuate the generation of sentences that have frequently appeared in its context. These effects are regarded as harmful to text generation and previous work puts efforts Published as a conference paper at ICLR 2024 Answer: -> True | False Demonstrations Reinforced Connections (Ours) Input: Paying off the national debt will be extremely painful. // Answer: False Input: Generous and subversive artworks. // Answer: True Input: Panostaja did not disclose the purchase price. // Answer: False Input: Circulation revenue has increased by 5% in Finland. // Answer: True Input: Circulation revenue has decreased by 5% in Finland. // Answer: (?) Input: -> True | False Circulation revenue -> True Panostaja did not -> False painful. -> False (a) A correct prediction of in-context learning. Paying off -> False Demonstrations Reinforced Connections (Ours) Input: Paying off the national debt will be extremely painful. // Answer: False Input: Circulation revenue has increased by 5% in Finland. // Answer: True Input: Circulation revenue has increased by 5% in Finland. // Answer: True Input: Circulation revenue has increased by 5% in Finland. // Answer: True Input: Circulation revenue has decreased by 5% in Finland. // Answer: (?) Answer: -> True Circulation -> True by 5% -> True in Finland -> True (b) An in-correct prediction of in-context learning. Figure 1: We showcase correct and incorrect predictions of in-context learning of LLa MA-65B. The shown task is to identify whether the given sentence presents a positive sentiment. We involve the token reinforced connections from demonstrations. In both cases, LLMs learn connections from the demonstrations and make decisions based on these connections. In the case of in-context learning, the model learns reliable connections and hopefully several of these connections result in the function of sentiment analysis. On the other hand, in repetitive demonstrations, the model gets stuck to spurious connections and misses the key information decreased , leading to a wrong prediction. to mitigate it. However, they could give a unique view of the ICL behaviors from the angle of text generation. We quantitatively investigate in-context learning from the perspective of surface patterns, illustrating the inherent correlation among surface patterns, self-reinforcement, and ICL. First, we study the roles of self-reinforce patterns as surface features that guide text generation. We empirically establish the existence of the token co-occurrence reinforcement, where the connection between any two tokens gets reinforced with the number of contextual co-occurrences, a primary principle in learning surfacelevel patterns. We further delve into the reasons and inner workings causing token reinforcement, showing it as an inevitable result of the model s efforts to maximize the likelihood of the training corpus. Given the existence and reasons behind such patterns, we scrutinize the beneficial and detrimental effects of these surface patterns on in-context learning. On the one hand, experiments on MMLU and GSM8K show that the reinforcement helps constrain the output space and format outputs to follow demonstrations like outputting Let s think step by step. . On the other hand, experiments with non-informative connections and reordered answers in MMLU demonstrate that intentionally constructed connections make LLMs lean towards specific answers, revealing the risk of unintended, spurious connections. Our work not only reveals the intrinsic workings of in-context learning to some extent, providing a perspective not analyzed in the literature but also explains the underlying reasons for the failure of in-context learning1. 1https://github.com/Elliott Yan/understand-icl-from-repetition Published as a conference paper at ICLR 2024 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 Number of Repeats Probability 'A' 'A' 'A' 'A' 'A' 'A' 'A' 'A' 'A' 'A' 'A' 'A' 'A' 'A' 'A' 'A' Generate 'A' after 'Answer is'. 0.00 0.05 0.10 0.15 0.20 0.25 Probability Before Repetition.TP0 Probability After 10 Repeats. TP10 Probability before and after repetitions. Wikitext-103 Book Corpus Random OPT LLa MA Figure 2: Left: An example of the self-reinforcement effect. We choose a normal sentence ( Answer is A ), repeat it several times, and present the probability of the token A . The model used is LLa MA7B. Right: Sentence-level self-reinforcement over LLMs. We plot all sizes of OPT and LLa MA with colors from light to dark. All sizes of LLa MA and OPT models demonstrate strong sentence-level self-reinforcement effects. Our main contributions can be summarized as follows: We propose a novel perspective to understand ICL with repetitive text generations. We perform systematic analyses and empirically establish the existence of token reinforcement across various LLMs, alongside with the reason behind. We show that token reinforcement constrains output space and enables desired patterns for ICL, but is also responsible for spurious connections and possible failure of ICL. 2 ICL AND REPTITIONS We first illustrate the background. In ICL, given a desired task f, we feed an LLM with K inputoutput pairs {(xk, yk), n [1, K]}, where yk = f(xk). Each input-output pair is formatted as a demonstration dk = (FI, xk, FO, yk). FI and FO denote formatting tokens for inputs and outputs, e.g., Input: and Answer: , and xk and yk can consist of several tokens. A pretrained language model M predicts the output y conditioned on the concatenation of demonstrations and the test query x, PICL(y|x, k) := M(y|(FI, x1, FO, y1, , FI, xk, FO, yk, FI, x, FO)). (1) To understand the influence of surface repetitions as in Figure 1(b), we fist show a case of sentence repetition and plots the probability of the sentence as the number of repetition in the previous context increases (Figure 2). When we manually repeat the sentence Answer is A , the probability of generating A after Answer is gets boosted from 0.03 to almost 1.0. The right part of Figure 2 demonstrates our quantitative study with two families of large language models, namely OPT (125M, 350M, 1.3B, 2.7B, 6.7B, 13B, 30B, Zhang et al. 2022) and LLa MA (7B, 13B, 30B, 65B, Touvron et al. 2023a). We plot the average sentence probability after 10 repeats for LLMs of varying sizes, arranged by light to dark colors. We use 1,000 sentences from each of the three different datasets Wikitext-103 (Merity et al., 2016), Book Corpus (Zhu et al., 2015), and sequences of random words. More experimental details can be found in Appendix B. With 10 repeats, the probability of generating the sentence is significantly increased across all tested LLMs. LLMs amplify the occurrence of previously presented sentences, even sequences of random tokens.2 The above observations are related to the study of self-reinforcement effect (Xu et al., 2022) in literature. Formally, the conditional probability we model here is PREP(w) := M(w|[s1; s2; ; sn 1; w1 wi 1]), where s is the repeating sentence, n denotes the number of 2Xu et al. (2022) reported that sentences with a low initial probability such as sentences composed of random tokens have a smaller self-reinforcement effect. In our experiments, even sentences with random tokens (which initially have a near-zero probability) become reinforced to a probability nearing one. The difference may come from different model sizes (150M vs. maximum 65B) and pretrained corpus size (hundred millions of tokens vs. trillions of tokens). Published as a conference paper at ICLR 2024 0 3 6 9 12 15 18 21 # Repetition TP: Average Token Prob (A): ..X..;..X.. # Kept Tokens = 1 llama-7b llama-13b llama-30b llama-65b 0 3 6 9 12 15 18 21 # Repetition (B): ..X..Y..;..X..Y.. # Kept Tokens = 2 llama-7b llama-13b llama-30b llama-65b 0 3 6 9 12 15 18 21 # Repetition (C): ..X..Y..Z..;..X..Y..Z.. # Kept Tokens = 3 llama-7b llama-13b llama-30b llama-65b 0 3 6 9 12 15 18 21 # Repetition (D): ..X..Y..Z..H.;..X..Y..Z..H. # Kept Tokens = 4 llama-7b llama-13b llama-30b llama-65b Figure 3: Token co-occurrence reinforcement. Even if only one token repeats in context, the self-reinforcement loop triggers. ..X..Y.. denotes 2 tokens are kept unchanged. The mean and variance are computed over 1,000 randomly generated samples. occurrences, and wi is the i-th token in the sentence s. Previous research finds the self-reinforcement effect, where the probability of generating the sentence s of length Ls occurred N times in context, TPN = 1 Ls P i PREP(w|w = wi; n = N), almost monotonically increases with the number of N. While the generation of repetitive sentences above can be understood as the influence of a single surface feature, we investigate more sophisticated surface patterns, which can be causes to behaviors of in-context learning. 3 SELF-REINFORCED SURAFACE FEATURES FOR IN-CONTEXT LEARNING Similar to previous definition of the demonstration, we set s = [FI; x1; FO; y1] and have, PREP(w|n = K) = M(y|( K times z }| { FI, x1, FO, y1, , FI, x1, FO, y1, FI, x1, FO)). PICL(y|x, k = K) = M(y|(FI, x1, FO, y1, , FI, x K, FO, y K, FI, x, FO)). Comparing PREP(w) to PICL(y), we find: (1) FI and FO are both repeated across demonstrations; (2) In repetitive generation, x1 and y1 are repeated, while in ICL, x and y are changing. To investigate the correlation between surface patterns and the resulting answer y, we gradually expand self-reinforcement patterns toward in-context learning. We achieve this by introducing random perturbations to each demonstration, imitating the role of changing x while leaving certain components, such as FO and y, unchanged. The experiments in this section are conducted over the dataset of randomly generated sentences as in Section 2 and with the four LLa MA models. The results on Wikitext-103 and Book Corpus, and results with OPT and various other LLMs can be found in Appendix D. For each experiment, we repeat the pattern 20 times and report the mean and standard deviation of the probabilities for the kept tokens. 3.1 ANY TWO TOKENS CAN CREATE A TOKEN REINFORCED LOOP We assume that tokens are the base unit of reinforcement. By assessing the self-reinforcement effect among tokens, we understand the self-reinforcement effect from first principles. Formally, given a sentence s = (w1, , w Ls) from a corpus D, we construct a binary mask sequence ˆm = ( ˆm1, ˆm2, , ˆm Ls), and we define a replacement operation R(w, ˆm) = wr, if ˆm = 0 w, if ˆm = 1 that replaces w with a randomly sampled token wr from the vocabulary if ˆm = 0 and keep it unchanged when ˆm = 1. Note that wr is independently sampled for each sentence and each position. As for mask sequence ˆm, we randomly sample positions to put the 0s and 1s of the mask sequence ˆm and control the number of kept tokens. In this way, a sentence s is transformed into ˆsn = (R(w1, ˆm1), , R(w Ls, ˆm Ls)), where P l [1,Ls] ˆml = Lt. Then, we report the average token probability ˆ TPN as in the previous section. Suppose we have a sentence s = (Apple, there, is, red) and m = (1, 0, 1, 1), and the target token w = red. Then, the demonstrations in this section will be like Apple there is red // Apple logo is red // Apple juice is red . Published as a conference paper at ICLR 2024 0 3 6 9 12 15 18 21 # Repetition TP: Average Token Prob (A): ..XY..;..XY.. Token Distance = 0 llama-7b llama-13b llama-30b llama-65b 0 3 6 9 12 15 18 21 # Repetition (B): ..X.Y..;..X.Y.. Token Distance = 1 llama-7b llama-13b llama-30b llama-65b 0 3 6 9 12 15 18 21 # Repetition (C): ..X..Y..;..X..Y.. Token Distance = 2 llama-7b llama-13b llama-30b llama-65b 0 3 6 9 12 15 18 21 # Repetition (D): ..X...Y..;..X...Y.. Token Distance = 3 llama-7b llama-13b llama-30b llama-65b Figure 4: Successive and distant reinforcement. The self-reinforcement effect is the strongest when two tokens are successive, i.e., distance=0. Otherwise, the reinforcement is smaller and appears insensitive to the distance. ..X.Y.. denotes the distance between two tokens is 1. Number of Tokens As depicted in Figure 3, even only one single token shared across demonstrations elicits self-reinforcement. The scenario with two tokens kept unchanged is particularly interested (Figure 3(B)), as it reveals a basic case that one token triggers the generation of the other one. We find that the connection between any two tokens gets reinforced and the probability increases monotonically with the number of their contextual co-occurrences. We refer to this base effect as the token co-occurrence reinforcement. When we increase the number of preserved tokens from 2 to 4, we observe a strengthening of the reinforcement effect. This is because each former token forms a reinforced connection with all the latter ones. Distance In-Between We further examine the role of distance in token reinforcement. This analysis is confined to two tokens. Figure 4 distinctly differentiates between successive tokens (distance= 0) and distant tokens (distance>= 1), which we term as successive reinforcement and distant reinforcement, respectively. The successive reinforcement significantly elevates the probability, from 0 to 0.4, with only several repetitions. Conversely, the distant reinforcement provides a moderate boost to the probability, from 0 to 0.2, and appears to be indifferent to the distance between tokens. Across all experiments, we observe a marked increase in reinforcement as model size scales, especially in the distant token reinforcement. This consistent escalation suggests that larger LLMs are more capable of following complex patterns in in-context demonstrations, which is consistent with results from Wei et al. (2023). We provide more supporting evidence of token reinforcement in Appendix D. The link we found between any two tokens forms the foundation of sentence-level self-reinforcement. Each token is strengthened by the ones before it. In ICL, common elements in demonstrations, such as pattern words, form connections with label words like "A, B, C, D". 3.2 TOKEN REINFORCEMENT IN THE PRE-TRAINING CORPUS We find that token reinforcement could be a result of the model s effort to maximize the likelihood of the training data. Figure 5: The probabilities of next occurrence after several occurrence observed in context. Given a pre-training corpus Dtrain, the language modeling task over a sequence of tokens {w1, w2, , w T } with length T is defined by max PT i log P(wi|wD is reinforced twice; b->D is reinforced once; C->D is reinforced three times. Here, all three reinforcements reach a consensus and predict D in cooperation. However, this ideal scenario is not always the case. Consider another example [A,B,C,D ; A,b,C,E ; a,B,C,F ; A,b,C,(?)]. At this time, A->D is reinforced once; b->E is reinforced once; a->F is reinforced once and et cetera. These reinforcements create conflicts and compete against each other. Thus, instead of the traditional view of ICL as a mapping from input to output, we view ICL from a feature angle, as a combination of connection of tokens, even though some of them are highly reinforced and some of them are not. The next section shows how the reinforcements play a crucial role in in-context learning. 4 THE EFFECTS OF SURFACE PATTERNS TO IN-CONTEXT LEARNING We quantitatively study how self-reinforced surface patterns lead to both beneficial functions and detrimental effects in ICL. The experiments in this section are conducted over MMLU (Hendrycks et al., 2021) and GSM8K (Cobbe et al., 2021). Due to limit of computing resources, we randomly sampled 20 samples for each of the 57 tasks of MMLU, resulting in a collection of 1140 test samples. The demonstrations are independently drawn for each of the test samples. All experiments are conducted across three random seeds. For further experimental details, see Appendix B. 4.1 BENEFICIAL EFFECTS Constraining Output Space. An important advantage brought by the reinforced effect is that it helps constrain the output space with several demonstrations, connections between formatting 3https://huggingface.co/datasets/olm/olm-wikipedia-20221220 Published as a conference paper at ICLR 2024 tokens ( Input: and Answer: ) and label words in each demonstration ( ABCD ) are reinforced, through distant and successive reinforcement, respectively. In this way, the LLM learns to predict either one of ABCD as the final answer, instead of continuation sequences such as Oh, an interesting question. I hope I know the answer. . We verify this advantage with the MMLU dataset, which is widely used to evaluate the language understanding of real-world large language models. To isolate the effect of self-reinforcement, we construct masked demonstrations for analyses. An example of how we mask demonstrations is shown in the left part of Figure 6. Particularly, a demonstration in the MMLU dataset can be divided into five parts: question content, option name (e.g., A. ), option content (e.g., 114, 27.35 ), answer indicator (e.g., Answer: ) and final answer (e.g., D ). Based on token reinforcement, we hypothesize that the option names, i.e., "A.", "B.", reinforce outputting "A,B,C,D" via distant reinforcement. The answer indicator, i.e., Answer: , reinforces outputting within label space via successive reinforcement. To validate our hypothesis, we first mask all the questions and option contents in all demonstrations and keep the formatting words, final answer, and test query unchanged. Then, we further ablate option names and answer indicators by replacing them with semantically equivalent substitutes. More experimental details can be found in the Appendix B. We use LLa MA-65B and plot the probability of choosing "A, B, C, D" as the predicted answer. The results are shown in the right of Figure 6. We find that masking the question and option contents, typically regarded as the most informative parts, does not influence the ability to direct the label space. Option names and answer indicators repeated several times in the demonstrations fulfill the job. Our conclusion is further solidified after ablating the option names and answer indicators individually. We see a huge decrease when replacing both option names and answer indicators, validating our hypothesis. Number of Demonstrations. Probability of Co T Answer. Let's think step by step. ICL baseline + Mask [Question] + Mask [Co T Answer] + Mask // Figure 7: Ability to follow patterns from demonstrations. Learning to Follow Patterns. Another distinctive feature of in-context learning is to follow the patterns of demonstrations. This is exemplified in techniques such as the Few-shot Chain-of-thought (Co T) prompting (Wei et al., 2022), frequently employed in reasoning tasks of LLMs like the GSM8K (Cobbe et al., 2021). Here, we illustrate how the reinforced features in Section 3.1 affect the pattern following of ICL, by showing how LLMs follow the chain-of-thought demonstrations in the GSM8K high school math dataset. Each demonstration in the dataset follows the form Question: [Question] // Let s think step by step. // [Co T Answer] . We demonstrate how models learn to say the Co T pattern, i.e., Let s think step by step. . We further discuss the connection between surface patterns and [Co T Answer] in Appendix E.1. Based on the findings in previous sections, we hypothesize that the common parts in the demonstrations teach the LLM to generate the Co T pattern. More specifically, Question: builds a distant reinforcement, and the new liner // builds a successive reinforcement with the Co T pattern. We mask out each part in each demonstration progressively with random tokens to ablate the influences. The probability of the Co T pattern is shown in Figure 7. After masking out //", the probability gains obtained from demonstrations almost diminish, verifying our hypothesis of successive reinforcement. Another interesting finding is masking [Question] reduces the probability of generating the Co T pattern, indicating the [Question] part to some extent builds a connection with the Co T pattern. Since the questions in demonstrations are different but lie in the same group of grad math problems, there might be some common patterns among these questions. 4.2 DETRIMENTAL EFFECTS Token reinforcement is not always helpful. In this section, we explore the detrimental consequences that arise from it. As shown in Section 3.1, distant and successive reinforcement are activated with even two random tokens. This could potentially lead to spurious patterns across demonstrations, Published as a conference paper at ICLR 2024 Table 1: Top: Effect of Non-Informative Connections (NC). The accuracy of D increases with the cost of A, B and C. Bottom: Effect of Reordered Answers. With more reordered demonstrations, the outputs are more leaned toward D. The delta values on the superscript denote the improvement compared with zero-shot scenarios. Avg. [A,B,C] denotes the average accuracy of samples whose golden answer is A, B or C. D denotes the accuracy of samples whose golden answer is D. ": significantly different compared to its corresponding ICL baseline (p < 0.05). Non-informative Connections # Demos 0 1 2 3 4 5 Avg. [A,B,C] 63.39 63.74+0.35 64.56+1.17 64.17+0.78 64.44+1.05 64.37+0.97 Avg. [A,B,C] w/ NC 63.04 61.21-1.83 62.57-0.47 63.27+0.23 63.00-0.04 63.47+0.43 D 52.28 59.65+7.37 60.00+7.72 59.30+7.02 59.88+7.60 59.53+7.25 D w/ NC 49.47 63.63+14.15 64.09+14.62 63.51+14.04 62.57+13.10 61.64+12.16 Reordered Answers Avg. [A,B,C] 63.39 63.74+0.35 64.56+1.17 64.17+0.78 64.44+1.05 64.37+0.97 Avg. [A,B,C] w/ 50% RA 63.39 64.56+1.17 64.44+1.05 64.05+0.66 63.94+0.55 63.86+0.47 Avg. [A,B,C] w/ 75% RA 63.39 64.52+1.13 62.65-0.74 62.53-0.86 61.95-1.44 62.34-1.05 Avg. [A,B,C] w/ 100% RA 63.39 64.80+1.40 62.03-1.36 61.17-2.22 59.10-4.29 58.05-5.34 D 52.28 59.65+7.37 60.00+7.72 59.30+7.02 59.88+7.60 59.53+7.25 D w/ 50% RA 52.28 59.30+7.02 60.82+8.54 61.40+9.12 61.75+9.47 61.64+9.36 D w/ 75% RA 52.28 59.06+6.78 62.81+10.53 65.50+13.22 67.02+14.74 66.90+14.62 D w/ 100% RA 52.28 58.71+6.43 66.20+13.92 71.35+19.06 75.67+23.39 77.19+24.91 which might be completely unforeseen by end users. We illustrate this point using two experiments where we manually construct spurious patterns in the MMLU dataset. Non-informative Connections. Our first approach is adding connections between a phrase and a certain choice. We append a reasonable but non-informative phrase such as Please kindly provide your answer. or Looking forward to your choice. right before the template Answer: each time the question s answer is D . In testing, we also append the same phrase and check whether the outputs are navigated toward D . By doing so, we construct a distant reinforcement loop from the non-informative phrase to the answer D . We ensure the test set is balanced with equal numbers of questions having golden answers "A,B,C,D", and we report the accuracies at the top of Table 1. We first see a gap even without demonstrations, where adding non-informative phrases lowers the accuracy of choices D. We further discuss the selection bias (Zheng et al., 2023) of different choices in the Appendix. Then, we see that the non-informative connection overcomes the selection bias and significantly elevates the accuracy choice D with a noticeable gap, in the cost of accuracy of A, B, and C. These results show the potential risk of manually injecting spurious connections and directing in-context learning toward unintended outcomes. Answer Indicator Connections. In our second experiment, we show that reinforcing the connection between Answer: and a certain choice, e.g., D, navigates the outputs. To this end, we randomly replace r percent of demonstrated answers with D. Simultaneously, we exchange the option contents of the original golden answer and D, to keep the demonstrations valid. In this way, we gradually reinforce the connection between Answer: and D , with successive reinforcement. The results are presented at the bottom of Table 1. Our baseline is 25%, where the demonstrations are balanced. With the increased ratio of answer D in demonstrations, the accuracy of D is largely improved, from 0.52 to 0.78, while the accuracy of A, B, and C decreases from 0.63 to 0.58. Our findings corroborate with An et al. (2023), where they show how diversity affects the performance of in-context learning. Our findings demonstrate how unbalanced demonstrations bring an unfair advantage for certain outputs. Discussion (1) Reinforcements can be the underlying reason for ICL, but also causes its vulnerability. (2) Our observations guide how to build demonstrations to maximize ICL effectiveness! The demonstrations should be both balanced for all possible output labels, and kept concise enough without introducing any unnecessary reinforcement. Published as a conference paper at ICLR 2024 5 RELATED WORK Explaining In-Context Learning. A range of contributions has deepened our understanding of In-Context Learning (ICL). Chan et al. (2022) and Xie et al. (2022) explore the emergence of ICL from the perspective of training data and Bayesian inference, respectively. Implicit learning of ICL over demonstrations is further highlighted by Garg et al. (2023), Li et al. (2023), and Hahn & Goyal (2023) theoretically show that performance of ICL can be represented by a complexity that repetition structures can be represented by a small PCFG tree. The similarity between gradient descent learner and in-context learner is demonstrated by Akyürek et al. (2023); von Oswald et al. (2023), while Dai et al. (2023) explain language models as meta-optimizers and likens ICL to implicit finetuning. Our work differs from this line of work with a novel perspective via repetitions and reinforced features, and our findings could potentially explain the mechanism of how LLMs achieve implicit learning. For instance, the step by step reinforcement across demonstrations intuitively resembles the gradient descent process of ICL described in previous work. Olsson et al. (2022) introduce induction heads of copying patterns and provide evidence for their relationship with ICL. Differently, our work investigates the LLM as a whole (Anderson, 1972), studies sophisticated patterns, views ICL as a combination of reinforcements, and scrutinizes both the benefits and drawbacks of reinforcements. Analyzing In-Context Learning. Several studies have analyzed ICL properties. Min et al. (2022); Yoo et al. (2022) identify and discuss key factors that influence ICL capability such as input-label mappings. Wei et al. (2023) proposes that learning input-label mapping is an emergent ability. Factors such as structural similarity, diversity, simplicity (An et al., 2023) , and order or embedding distribution (Lu et al., 2022; Liu et al., 2022) are also investigated. Pan et al. (2023) partitions ICL ability into task recognition and task learning, observing different phenomena with varying model sizes. Lastly, Si et al. (2023) unveils the presence of inductive biases in ICL by designing underspecified demonstrations. Our findings corroborate with multiple previous analyses of in-context learning. For example, the scaling for distant reinforcement echoes Wei et al. (2023); Pan et al. (2023) s findings of different phenomena when varying model sizes. The importance of demonstration ordering and diversity in An et al. (2023); Lu et al. (2022) can be explained by avoiding spurious connections. Repetitive Generation and Self-Reinforcement Effect. Repetition is a notorious issue in neural text generation, affecting tasks like open-ended and directed generation (Holtzman et al., 2019; Welleck et al., 2019; Lin et al., 2021; See et al., 2017; Liu & Lapata, 2019). Maximization-based decoding strategies lead to bland, consecutive repetitions at word, phrase, and sentence levels (Holtzman et al., 2019; Welleck et al., 2019; Li et al., 2016; Karpathy & Fei-Fei, 2015; Guan et al., 2021). Despite advancements in large-scale pre-training with Transformer architecture (Vaswani et al., 2017; Radford et al., 2019; Lewis et al., 2020), unexpected sentence-level repetitions persist (Radford et al., 2019; Brown et al., 2020; Fu et al., 2020). The repetition issue is puzzling given the lack of repetitive sentences in the training data. A series of studies investigate the cause, with both from the theoretical perspective (Fu et al., 2020) and empirical findings (Holtzman et al., 2019). Recently, Xu et al. (2022) proposes the self-reinforcement effect, suggesting a repetitive loop when combined with maximization decoding. Our study extends the effect to large language models and token reinforcement, explains the reasons, and bridges the excellent ability of in-context learning to this notorious issue. 6 CONCLUSION We have taken a novel feature-centric approach to understanding in-context learning, by exploring its relationship with repetitive generation. We have identified a key mechanism, the token reinforcement loop, where any two tokens can form a strong connection through multiple co-occurrences. We delve into the reasons and inner-working of token reinforcement, demonstrating it to be an inevitable result of maximizing likelihood. Based on our findings, we view in-context learning as a combination of token reinforcements with different level of strength instead of input-label mapping. Furthermore, we conduct experiments to demonstrate that token reinforcement plays a crucial role in shaping the output space and following patterns in in-context learning. We also illustrate through various studies how token reinforcement leads to spurious connections in in-context learning, highlighting the role of in-context learning as a double-edged sword, where informed demonstrations can maximize ICL effect. Published as a conference paper at ICLR 2024 ACKNOWLEDGEMENT This publication has emanated from research conducted with the financial support of both the National Natural Science Foundation of China Key Program under Grant Number 62336006 and the Pioneer and Leading Goose" R&D Program of Zhejiang under Grant Number 2022SDXHDX0003. Ekin Akyürek, Dale Schuurmans, Jacob Andreas, Tengyu Ma, and Denny Zhou. What learning algorithm is in-context learning? investigations with linear models, 2023. Shengnan An, Zeqi Lin, Qiang Fu, Bei Chen, Nanning Zheng, Jian-Guang Lou, and Dongmei Zhang. How do in-context examples affect compositional generalization?, 2023. Philip W Anderson. More is different: Broken symmetry and the nature of the hierarchical structure of science. Science, 177(4047):393 396, 1972. Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877 1901, 2020. Stephanie C. Y. Chan, Adam Santoro, Andrew K. Lampinen, Jane X. Wang, Aaditya Singh, Pierre H. Richemond, Jay Mc Clelland, and Felix Hill. Data distributional properties drive emergent incontext learning in transformers, 2022. Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways. ar Xiv preprint ar Xiv:2204.02311, 2022. Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems. ar Xiv preprint ar Xiv:2110.14168, 2021. Damai Dai, Yutao Sun, Li Dong, Yaru Hao, Shuming Ma, Zhifang Sui, and Furu Wei. Why can gpt learn in-context? language models implicitly perform gradient descent as meta-optimizers, 2023. Zihao Fu, Wai Lam, Anthony Man-Cho So, and Bei Shi. A theoretical analysis of the repetition problem in text generation. ar Xiv preprint ar Xiv:2012.14660, 2020. Shivam Garg, Dimitris Tsipras, Percy Liang, and Gregory Valiant. What can transformers learn in-context? a case study of simple function classes, 2023. Jian Guan, Xiaoxi Mao, Changjie Fan, Zitao Liu, Wenbiao Ding, and Minlie Huang. Long text generation by modeling sentence-level and discourse-level coherence. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 6379 6393, 2021. Michael Hahn and Navin Goyal. A theory of emergent in-context learning as implicit structure induction, 2023. Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id= d7KBjm I3Gm Q. hiyouga. Llama factory. https://github.com/hiyouga/LLa MA-Factory, 2023. Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. The curious case of neural text degeneration. In International Conference on Learning Representations, 2019. Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. Mistral 7b, 2023. Published as a conference paper at ICLR 2024 Andrej Karpathy and Li Fei-Fei. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3128 3137, 2015. Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 7871 7880, 2020. Jiwei Li, Will Monroe, and Dan Jurafsky. A simple, fast diverse decoding algorithm for neural generation. ar Xiv preprint ar Xiv:1611.08562, 2016. Yingcong Li, M. Emrullah Ildiz, Dimitris Papailiopoulos, and Samet Oymak. Transformers as algorithms: Generalization and stability in in-context learning, 2023. Xiang Lin, Simeng Han, and Shafiq Joty. Straight to the gradient: Learning to use novel tokens for neural text generation. In International Conference on Machine Learning, pp. 6642 6653. PMLR, 2021. Jiachang Liu, Dinghan Shen, Yizhe Zhang, Bill Dolan, Lawrence Carin, and Weizhu Chen. What makes good in-context examples for GPT-3? In Proceedings of Deep Learning Inside Out (Dee LIO 2022): The 3rd Workshop on Knowledge Extraction and Integration for Deep Learning Architectures, pp. 100 114, Dublin, Ireland and Online, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.deelio-1.10. URL https://aclanthology.org/2022. deelio-1.10. Yang Liu and Mirella Lapata. Text summarization with pretrained encoders. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 3730 3740, 2019. Yao Lu, Max Bartolo, Alastair Moore, Sebastian Riedel, and Pontus Stenetorp. Fantastically ordered prompts and where to find them: Overcoming few-shot prompt order sensitivity. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 8086 8098, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-long.556. URL https://aclanthology.org/2022.acl-long. 556. Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models. In International Conference on Learning Representations, 2016. Sewon Min, Xinxi Lyu, Ari Holtzman, Mikel Artetxe, Mike Lewis, Hannaneh Hajishirzi, and Luke Zettlemoyer. Rethinking the role of demonstrations: What makes in-context learning work?, 2022. Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova Das Sarma, Tom Henighan, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, et al. In-context learning and induction heads. ar Xiv preprint ar Xiv:2209.11895, 2022. Open AI. Gpt-4 technical report, 2023. Jane Pan, Tianyu Gao, Howard Chen, and Danqi Chen. What in-context learning "learns" in-context: Disentangling task recognition and task learning, 2023. Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. 2019. Abigail See, Peter J Liu, and Christopher D Manning. Get to the point: Summarization with pointer-generator networks. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1073 1083, 2017. Chenglei Si, Dan Friedman, Nitish Joshi, Shi Feng, Danqi Chen, and He He. Measuring inductive biases of in-context learning with underspecified demonstrations, 2023. Published as a conference paper at ICLR 2024 Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. ar Xiv preprint ar Xiv:2302.13971, 2023a. Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa, Isabel Kloumann, Artem Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushkar Mishra, Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, Eric Michael Smith, Ranjan Subramanian, Xiaoqing Ellen Tan, Binh Tang, Ross Taylor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aurelien Rodriguez, Robert Stojnic, Sergey Edunov, and Thomas Scialom. Llama 2: Open foundation and fine-tuned chat models, 2023b. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017. Johannes von Oswald, Eyvind Niklasson, Ettore Randazzo, João Sacramento, Alexander Mordvintsev, Andrey Zhmoginov, and Max Vladymyrov. Transformers learn in-context by gradient descent, 2023. Ben Wang and Aran Komatsuzaki. GPT-J-6B: A 6 Billion Parameter Autoregressive Language Model. https://github.com/kingoflolz/mesh-transformer-jax, May 2021. Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824 24837, 2022. Jerry Wei, Jason Wei, Yi Tay, Dustin Tran, Albert Webson, Yifeng Lu, Xinyun Chen, Hanxiao Liu, Da Huang, Denny Zhou, and Tengyu Ma. Larger language models do in-context learning differently, 2023. Sean Welleck, Ilia Kulikov, Stephen Roller, Emily Dinan, Kyunghyun Cho, and Jason Weston. Neural text generation with unlikelihood training. In International Conference on Learning Representations, 2019. Patrick H Winston. Learning and reasoning by analogy. Communications of the ACM, 23(12): 689 703, 1980. Sang Michael Xie, Aditi Raghunathan, Percy Liang, and Tengyu Ma. An explanation of in-context learning as implicit bayesian inference. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=Rd JVFCHj UMI. Jin Xu, Xiaojiang Liu, Jianhao Yan, Deng Cai, Huayang Li, and Jian Li. Learning to break the loop: Analyzing and mitigating repetitions for neural text generation, 2022. Kang Min Yoo, Junyeob Kim, Hyuhng Joon Kim, Hyunsoo Cho, Hwiyeol Jo, Sang-Woo Lee, Sang goo Lee, and Taeuk Kim. Ground-truth labels matter: A deeper look into input-label demonstrations, 2022. Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, et al. Opt: Open pre-trained transformer language models. ar Xiv preprint ar Xiv:2205.01068, 2022. Chujie Zheng, Hao Zhou, Fandong Meng, Jie Zhou, and Minlie Huang. On large language models selection bias in multi-choice questions. ar Xiv preprint ar Xiv:2309.03882, 2023. Published as a conference paper at ICLR 2024 Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In The IEEE International Conference on Computer Vision (ICCV), December 2015. Published as a conference paper at ICLR 2024 Table of Contents A Limitations 14 B Experimental Details 14 B.1 Dataset Descriptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 B.2 Experimental Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 C Further Investigation of Reasons and Inner-workings for Token Co-occurrence Reinforcement 17 D Supporting Experiments of Token Co-occurrence Reinforcement 18 D.1 Phrase Co-occurrence Reinforcement . . . . . . . . . . . . . . . . . . . . . . . 18 D.2 Token Reinforcement of OPT Models . . . . . . . . . . . . . . . . . . . . . . . 19 D.3 Token Reinforcement of Other LLMs . . . . . . . . . . . . . . . . . . . . . . . 19 D.4 Token Reinforcement on Other Datasets . . . . . . . . . . . . . . . . . . . . . 20 D.5 Semantic Relationships of Token Reinforcement . . . . . . . . . . . . . . . . . 20 D.6 Improved Ratio of Token Reinforcement . . . . . . . . . . . . . . . . . . . . . 21 E Other Experiments on Detrimental Effects 21 E.1 Generation of Cot Answer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 E.2 Selection Bias of LLMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 A LIMITATIONS While this study provides some insightful findings in the field of in-context learning, there are some limitations that should be noted. First, the experiments in this work constrain themselves to repeating surface patterns. More sophisticated patterns are not discussed. Second, our work mainly focuses on revealing the token co-occurrence reinforcement and understanding its influence on in-context learning. Its detrimental influences on in-context learning suggest that resolving the spurious connections would be helpful to either chain-of-thought or in-context learning. B EXPERIMENTAL DETAILS B.1 DATASET DESCRIPTIONS In this paper, we mainly use the following five datasets, and we introduce each of them and describe our preprocess of these datasets individually. We present cases from each dataset to demonstrate their characteristics in Table 2. Wikitext-103 The Wikitext-103 dataset, introduced by Merity et al. (2016)4, is a language modeling dataset that contains a collection of over 100 million tokens extracted from the set of verified Good and Featured articles on Wikipedia. We use a randomly sampled collection of 1000 sentences provided by Xu et al. (2022)5. The provided version is pre-tokenized to words and we use the moses detokenizer6 to restore the untokenized version for compatibility with the tokenizer of transformers. 4https://blog.salesforceairesearch.com/the-wikitext-long-term-dependency-language-modeling5https://github.com/Jxu-Thu/DITTO/blob/main/data/wiki_sentences.txt 6https://github.com/moses-smt/mosesdecoder/blob/master/scripts/ tokenizer/detokenizer.perl Published as a conference paper at ICLR 2024 Table 2: Datasets and Cases Dataset Number of Cases Examples Wikitext-103 1000 The Bintulu Government Secondary School was built in 1964. Book Corpus 1000 A little, he admitted. Random 1000 Green Ou incarcer hijab ura na Edmonton regardless iken Mayor Question: Janet s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers market? Let s think step by step. Janet sells 16 - 3 - 4 = 16-3-4=9 9 duck eggs a day. She makes 9 * 2 = $ 9*2=18 18 every day at the farmer s market. ### 18 As more lamps are connected in parallel in a circuit, the current in the power source A. increases B. decreases C. remains the same D. Not enough information to say Answer: A Published as a conference paper at ICLR 2024 Table 3: Semantically equivalent substitutes. Category Original Substitutes Option Names A.; B.; C.; D. I.; II.; III.; IV. E.; F.; G.; H. 1.; 2.; 3.; 4. (a).; (b).; (c).; (d). Answer Indicator Answer: Solution Reply Response Result Choice Book Corpus Book Corpus (Zhu et al., 2015) is originally introduced to align the books to their movie releases in order to provide rich descriptive explanations for visual content. It contains 74M sentences from various sources of books, and we randomly sample 1000 sentences and also detokenize with moses. Random Generated Sentences Here, we follow Xu et al. (2022) and construct 1000 randomly generated sentences. For each sentence, we first sample a random length from 5 tokens to 10 tokens, and then sample the tokens from the whole vocabulary. All random sentences are generated by uniformly sampled from the vocabulary. All the symbols are equally likely to appear, except the special tokens. GSM8K GSM8K (Cobbe et al., 2021) is a dataset of high-quality grade school math problems created by human problem writers. The dataset contains 7500 training problems and 1000 test problems. All the problems are answered with between 2 and 8 steps to solve. It is a frequently used benchmark to evaluate the reasoning ability of large language models (Touvron et al., 2023a; Chowdhery et al., 2022). To analyze the chain-of-thought (Wei et al., 2022) effect of LLMs, we add Let s think step by step" followed by the question and right before the answer. MMLU The MMLU (Hendrycks et al., 2021) dataset is another commonly used benchmark to evaluate the knowledge of large language models. It covers 57 subjects across STEM, the humanities, the social sciences, and more. It ranges in difficulty from an elementary level to an advanced professional level, and it tests both world knowledge and problem-solving ability. We randomly sample 20 test cases from each task, resulting in 1140 test queries in total. In the undesired section, we uniformly redistribute the answers and options to isolate the selection bias of LLMs. B.2 EXPERIMENTAL DETAILS Random Substitutes Throughout the paper, we adopt random substitutes to isolate the effects of tokens, specific formats, and other components. The random substitutions are conducted in the following manner. To avoid the effect of different sequence lengths, we first tokenize the original sentence or demonstration using the Transformers tokenizer. Then, we replace the part to substitute with random tokens from the corresponding vocabulary. With the substitutes, we exclude the special tokens to ensure that all random tokens are valid. Self-Reinforcement For each experiment, we report the averaged results across three runs with different seeds. We mainly conduct our experiments over two model families, LLa MA and OPT. For LLa MA models, we use 4 models sized in [7b, 13b, 30b, 65b]. For OPT models, we use 7 models with sizes ranging in [125m, 350m, 1.3b, 2.7b, 6.7b, 13b, and 30b]. Each sentence is repeated 20 times in our experiments. Following Xu et al. (2022), we randomly concatenate a prefix before all repeated sentences. Published as a conference paper at ICLR 2024 0 10000 20000 30000 40000 50000 Steps Losses Over Time Full Context Mask Random Mask Repeat Figure 8: Pretraining losses when masking or not masking the repetitive features. In-Context Learning In the MMLU experiments, we replace option names and answer indicators to study the importance of token reinforcement in directing output space. Specifically, for each replacement, we choose semantically equivalent substitutes from a pool and randomly replace the original option name/answer indicator with the chosen substitute. In this way, we break the token reinforcement from the demonstrations. We put the pool of our replacement in Table 3. Significance test We conduct the significance test using paired t-tests, where we randomly split the test set into 5 folds and compute the significance over accuracies on these 5 subsets. In Table 1, we compute the significance levels for [A, B, C] and D separately. C FURTHER INVESTIGATION OF REASONS AND INNER-WORKINGS FOR TOKEN CO-OCCURRENCE REINFORCEMENT In Section 3.2, we investigate the reason of token reinforcement as it is inherently embedded within the pretraining corpus. Here, we further investigate token reinforcement with two more experiments, from the learning process and attention mechanism, respectively. Leveraging repetitive features helps maximizing likelihood. Our next experiment shows that the utlization of repetitive features is a natural choice of LLMs as it helps in the pre-training stage. To illustrate this, we focus on a conditional language modeling task that predicts tokens based on a given prompt. Formally, we model the probability P(w Tp:T |w