# incontext_language_learning_architectures_and_algorithms__37ca26f2.pdf In-Context Language Learning: Architectures and Algorithms Ekin Aky urek 1 Bailin Wang 1 Yoon Kim 1 Jacob Andreas 1 Some neural language models (LMs) exhibit a remarkable capacity for in-context learning (ICL): they can fit predictors to datasets provided as input. While the mechanisms underlying ICL are well-studied in the context of synthetic problems like in-context linear regression, there is still some divergence between these model problems and the real ICL exhibited by LMs trained on large text corpora. In this paper, we study ICL through the lens of a new family of model problems we term in context language learning (ICLL). In ICLL, LMs are presented with a set of strings from a formal language, and must generate additional strings from the same language. We focus on incontext learning of regular languages generated by random finite automata. We evaluate a diverse set of neural sequence models on regular ICLL tasks. We first show that Transformers significantly outperform neural sequence models with recurrent or convolutional representations on ICLL tasks. Next, we provide evidence that they do so by computing in-context n-gram statistics using specialized attention heads. Finally, we show that hard-wiring these heads into neural models improves performance not just on synthetic ICLL, but natural language modeling, reducing the perplexity of 340M-parameter Transformers by up to 1.14 points (6.7%) on the Slim Pajama dataset. Our results highlight the usefulness of in-context formal language learning as a tool for understanding ICL in models of natural text. 1. Introduction One of the most striking features of modern neural language models is their capacity for in-context learning (ICL) the ability to infer a conditional or unconditional distribution 1MIT CSAIL. Correspondence to: Ekin Aky urek . Proceedings of the 41 st International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by the author(s). over natural language strings simply by performing nexttoken prediction following a sequence of examples from the distribution of interest. ICL is a crucial tool for steering large pre-trained language models (LMs), and a growing body of work aims to understand when and how these LMs perform ICL. Because of the complexity of large-scale LMs trained on natural text (and the lack of public information about many LMs training data), almost all work on understanding ICL has focused on smaller LMs trained on simple model problems like in-context linear regression (Garg et al., 2022), character classification (Chan et al., 2022), and associative recall (Fu et al., 2023). Despite their simplicity, these model problems have played a key role in identifying properties (and limitations) of ICL in current LMs. However, there remains a significant gap between these model problems and the capabilities exhibited by large-scale LMs. In particular, most model problems require relatively simple forms of learning: computing a fixed function of the entire training set (Aky urek et al., 2023; von Oswald et al., 2023a;b), or retrieving a single example relevant to the current input (Fu et al., 2023). In contrast, natural LMs exhibit richer and much more varied forms of ICL in some cases producing structured generative models of text or code from a handful of inputs (Shin & Van Durme, 2022; Drozdov et al., 2023). How can we systematically study these more complex forms of ICL? In this paper, we introduce a new family of model ICL problems that we collectively term in-context language learning (ICLL). In ICLL, LMs are prompted with a finite collection of strings from an unknown formal language, and must infer the distribution over strings corresponding to the full language (Figure 1). ICLL exercises essential features of ICL in natural models: it involves structured outputs, probabilistic predictions, and algorithmic reasoning about input data. In this paper, we present a focused study of ICLL in regular languages the class of formal languages generated by finite automata. We begin by providing general background about neural sequence models, ICL and formal languages in Section 2, then define the ICLL task in Section 3. Next, we explore three questions about in-context language learning in neural sequence models:1 1Code & data are released at github.com/berlino/seq icl In-Context Language Learning: Architectures and Algorithms Q1: Which model classes can learn to perform ICLL accurately? (Section 4) We find that Transformers significantly outperform recurrent and convolutional LMs at in-context language learning, even when these different architectures perform comparably on other problems. Models with efficient convolutional parameterizations perform especially poorly on ICLL tasks. Q2: What algorithmic solutions do successful incontext language learners implement? (Section 5) Transformer predictions on ICLL with regular languages are well approximated by smoothed n-gram models. Transformers develop n-gram heads : higherorder variants of induction heads previously described in LMs (Olsson et al., 2022). Compared to other model architectures, Transformers better encode in-context n-gram counts in their hidden representations. Q3: Can we improve neural models using our understanding of how they perform ICLL? (Section 6) Hard-wiring Transformers, RNNs and convolutional models with n-gram heads improves their performance on ICLL. These heads are not just useful for ICLL: when equipped with n-gram heads, neural sequence models of all classes exhibit perplexity improvements of up to 6.7% on natural language modeling tasks. Our results highlight the usefulness of ICLL as a model problem not only as a tool for research on ICL, but as a source of insight about architectural features that can improve language modeling in the real world. Many aspects of ICLL, even with regular languages, remain to be understood (e.g. learning dynamics and out-of-distribution generalization). Beyond these, future work might study ICLL in more expressive languages (e.g. context-free or context-sensitive languages), offering a path toward understanding of even more complex behaviors in natural LMs. 2. Background 2.1. Neural sequence modeling Much of modern machine learning for natural language processing is concerned with building general-purpose tools for sequence prediction, in which we wish to place a distribution over strings x. Very often this is done via a product of conditional distributions over tokens: p(x) = Q i p(xi | x