# language_generation_in_the_limit__352eed46.pdf Language Generation in the Limit Jon Kleinberg Departments of Computer Science and Information Sciene Cornell University Ithaca NY Sendhil Mullainathan Booth School of Business University of Chicago Chicago IL Although current large language models are complex, the most basic specifications of the underlying language generation problem itself are simple to state: given a finite set of training samples from an unknown language, produce valid new strings from the language that don t already appear in the training data. Here we ask what we can conclude about language generation using only this specification, without further assumptions. In particular, suppose that an adversary enumerates the strings of an unknown target language L that is known only to come from one of a possibly infinite list of candidates. A computational agent is trying to learn to generate from this language; we say that the agent generates from L in the limit if after some finite point in the enumeration of L, the agent is able to produce new elements that come exclusively from L and that have not yet been presented by the adversary. Our main result is that there is an agent that is able to generate in the limit for every countable list of candidate languages. This contrasts dramatically with negative results due to Gold and Angluin in a well-studied model of language learning where the goal is to identify an unknown language from samples; the difference between these results suggests that identifying a language is a fundamentally different problem than generating from it. 1 Introduction The recent advances in large language models (LLMs) have been remarkable, sparking active lines of theoretical work into their performance. These investigations implicitly revolve around two fundamental questions: how do we formally reason about the effectiveness of LLMs; and within such a framework, what are the core ideas at a mathematical level that enable their performance? Answers to these questions must begin by formalizing the specification for what a generative algorithm for language should be doing. Here, we propose starting from a very basic, assumption-free, statement for such a specification: there is an unknown target language L, over time the algorithm sees a sequence of strings from L, and eventually we would like the algorithm to generate new strings from L that it has not seen before.1 Viewed this way, it is also clear why it seems so remarkable for LLMs to be doing well at such a problem. The fully general statement of the problem feels unsolvable: if we know nothing about the unknown target language L, then how can a generative algorithm reliably produce valid strings from L that it hasn t seen before? 1We will formalize these concepts more precisely below, but for now we can think of a language as simply any set of strings over a fixed alphabet; for example, the strings of the language could be the set of all grammatical sentences (or all well-formed expressions) according to a given grammar. 38th Conference on Neural Information Processing Systems (Neur IPS 2024). Language Learning in the Limit. In fact, there is a well-established formalism that allows us to phrase this point precisely: the classical model of language learning in the limit, formulated by Mark Gold in 1967 and fully characterized by Dana Angluin in 1980 [6, 2]. In this model, there is an unknown language K that is known only to be produced by one of a list of candidate representations R1, R2, R3, . . ., where Li is the language produced by representation Ri. We can think of this list of representations as the set of all possible context-free grammars, or the set of all possible finite automata, or the set of all Turing machines with a fixed space bound, or any other generative model that produces strings; in fact, the formal result is much more general than this, in that it is sufficient to suppose that the unknown language K simply comes from a countable list of candidate languages L1, L2, L3, . . ., and we can dispense with explicit representations altogether.2 In the Gold-Angluin model, an adversary enumerates the strings of K one by one, and the algorithm is required after each new string to guess a language Li from the list such that Li = K. If there is some finite step t after which the algorithm s guess is always correct, then we say the algorithm has identified K in the limit. Gold proved that this is impossible in general, even for simple language families such as the regular languages (i.e. those produced by finite automata), and Angluin characterized precisely those families for which it is possible, further establishing how limited they are [1, 2]. Note, crucially, that in the Gold-Angluin model, the adversary enumerates strings in K, but does not provide examples of strings that do not belong to K, nor does it allow the algorithm to ask questions about a string s membership in K; their point with this formalism was to focus on cases where an agent tries inferring a language purely from seeing a sufficient number of examples of strings that belong to the language. (In Section A.1, we provide a self-contained proof of the negative result for identification in the limit; while not strictly necessary for our results, we find it provides useful background and context for the problem.) Our Results: Language Generation in the Limit. These negative results of Gold and Angluin feel intuitive how should we be able to identify a language from a finite sample when we are allowed to make essentially no assumptions about its structure? Because of this intuition, both via the Gold-Angluin model and for more informal reasons as well, the focus in language generation has naturally turned to distributional assumptions; one posits that large language models are so effective because they are able to exploit distributional probabilities of language, and from a finite set of samples they are able to estimate conditional probabilities of strings with increasing accuracy. In this way, the question moves from adversaries to probability distributions, and one seeks explanations for the effectiveness of LLMs through underlying probabilistic models. In this paper, we offer a sharply different view: we show that in the Gold-Angluin model of adversarially produced examples, language generation is always possible. We will provide full details on the result and its proof beginning in the next section, but the key point at a high level is that even in an adversarial model with an unknown language K, language generation is a fundamentally different task than language identification: where identification asks an algorithm to eventually name a language Li = K after seeing a large enough finite sample S from K, generation instead asks an algorithm to eventually output strings in K S after seeing a large enough S from K. Our main result is that this difference in specifications leads to dramatic differences in what is possible in the limit; whereas the Gold-Angluin results establish that identification in the limit is impossible except in highly constrained cases, we show that generation in the limit is possible for every countable list of candidate languages. General Connections to Language Modeling. Our approach emphasizing theoretical properties of language generation and worst-case guarantees, in the style of the Gold-Angluin model, is a source of limitations but also a source of generality; it is therefore important to discuss how we draw stylized insights about language modeling from our theoretical formalism. Clearly, methods to design large language models in practice make extensive use of the empirical distributional properties of language, as they should. Our results don t question this design methodology; when there are genuine empirical regularities in the training data, there is no reason not to use them. Rather, our results argue that if we are looking for the essential reasons why language generation is tractable, we do not fundamentally require any empirical regularities, or indeed any probabilistic assumptions at all; there is instead a formal sense in which language generation unlike broader learning tasks such as language 2In this more general view, we will assume that the family of languages is presented simply via a black box that for a string w and an index i can answer the question, Is w Li? identification is possible even against an adversary presenting positive training examples in a worst-case fashion. In some essential way, the generation problem is therefore different from these other learning tasks in crucial ways that more detailed formalisms may potentially obscure. Despite the generality of the model, producing the generation algorithm that proves our main theorem makes use of subtle structural properties of the given list of candidate languages. Again, we defer any detailed description to the subsequent sections, but the idea at a high level is to maintain a sequence of provisional languages among the candidate languages that are consistent with the finite sample S from K seen so far, and to continually refine this sequence of provisional languages as the adversary adds more strings to the sample S. Since the Gold-Angluin result says that the algorithm can never be sure which is the true language K, there is a sense in which this refinement process essentially needs to continue indefinitely, and in general it leads the algorithm to generate from provisional languages that may be increasingly thin subsets of K. This does not cause trouble for the specification of language generation, since it is acceptable to produce any unseen string from K, but it does mean that while the algorithm is able to eventually produce an infinite sequence of unseen strings from K, in general it might do so from a narrow part of K. This property of the solution in the presence of an adversary suggests interesting connections to the problem of generation in practice as well. In particular, any method for generation has to deal with an underlying validity problem producing valid outputs and an underlying breadth problem producing outputs that represent the full range of valid outputs in some reasonable way. The breadth problem is notoriously difficult, and it manifests itself in numerous ways in the methodology of machine learning and generative models. The approach that proves our main result helps illustrate the tension between validity and breadth even in settings with worst-case assumptions rather than probabilistic ones, and this tension shows up in both the early phases of our algorithm s execution and the later phases. In the early phases, before the algorithm has refined its provisional language sufficiently, it is generating too broadly and producing strings that are not part of the target language K an analogue at a high level of a kind of hallucination in which the generated strings belong to some consistent candidate language, but not to the actual target language [7, 8, 11]. In the later phases, on the other hand, the algorithm continuously shrinks its range of possible outputs so as to ensure that they will be contained within K sacrificing validity for breadth in a manner analogous to the issues that arise in the problem of mode collapse for generative models [3, 4]. Our model therefore suggests interesting questions about the fundamental trade-offs that may exist between validity and breadth even in settings without an underlying probabilistic model. 2 Formal Model and Results We now provide a formal description of the model and the statement of our results. To begin with, we have a countable list of candidate languages C = {L1, L2, L3, . . .}, where each Li is a subset of some countable set U. All we assume about the list of languages is that it is specified through a black box that can answer questions of the form Is w Li? for any string w U and language Li C. (If the reader finds it helpful for concreteness, they can consider the results that follow in the context of a specific list of languages C, such as the set of all context-free languages or the set of all regular languages; but everything we say applies to general collections of languages.) We will allow the collection C to contain repetitions, in that we may have Li = Lj for different indices i and j. We will assume that all the languages Li are infinite; while the original Gold-Angluin framework did not require this, it becomes important in specifying the generation problem: if we require an algorithm to output unseen strings forever, then this is not possible from a finite language, where the algorithm would eventually run out of new strings to generate. An adversary and an algorithm now play the following game. The adversary chooses a language K from C without revealing it to the algorithm, and it begins enumerating the strings of K one by one over a sequence of steps t = 1, 2, 3, . . .. The adversary can repeat strings in its enumeration, but crucially, for every string w K, there must be at least one time step t in which w appears. Let St be the set of strings that the adversary has enumerated in steps 1 through t. Identification and Generation. In this framework, we can now specify both the Gold-Angluin problem of identification and the contrasting problem of generation that we study in this paper. Identification (from [6, 2]: In each step, the algorithm observes St and must output an index i (its guess for the true language K). The algorithm identifies K in the limit if there is some t such that for all steps t t , the algorithm s guess in step t is an index i for which Li = K. Generation (from the present paper): In each step, the algorithm observes St and must output a string at (its guess for an unseen string in K). The algorithm generates from K in the limit if there is some t such that for all steps t t , the algorithm s guess at belongs to K St. A key point in both problem formulations is that the algorithm is not told if its guesses are correct. We know from the Gold-Angluin results that there is no algorithm that can achieve identification in the limit for an arbitrary countable collection C of languages (or even for specific countable collections, like the set of all regular languages or the set of all context-free languages). Our main result is a dramatically different answer for language generation; it is possible for every countable collection: (2.1) There is an algorithm with the property that for any countable collection of languages C = {L1, L2, L3, . . .}, and any enumeration of one of these languages K, the algorithm generates from K in the limit. A Result for Finite Collections. We prove a second result as well, focusing on the variant of the problem in which the collection of languages C is finite. In this case, it follows from Angluin s characterization that every finite collection C allows for identification in the limit. Given this, what more might we ask for? A natural question is whether there is a uniform bound on the number of samples needed to ensure that the algorithm can correctly identify the true language K: for any finite collection C, is there a bound t(C) and an algorithm with the property that after seeing any t(C) distinct strings from K, the algorithm is guaranteed to report K as its guess for the true language? It is easy to see that for the Gold-Angluin model of learning, this is not possible. For example, supopse that C is the collection consisting of two languages L1 and L2: L1 consists of all possible strings, and L2 consists of all strings of even length. Suppose there were a bound t(C) and an algorithm that was guaranteed to guess correctly after seeing t(C) distinct samples. Then an adversary could present t(C) distinct strings of even length, and then ask the algorithm to guess whether the true language is L1 or L2: if the algorithm guesses L2 at this point, then the adversary could announce that the answer is L1, and conversely if the algorithm guesses L1. This does not prevent the algorithm from learning the true language in the limit (since the adversary must eventually output an odd-length string if the true answer is L1); but there is no fixed bound t(C) by which this can be guaranteed. However, for the problem of generation with a finite collection of candidate languages, we can provide this much stronger type of uniform bound, via an algorithm that generates correctly after seeing a finite sample whose size t(C) is specified at the outset. In fact, we can achieve more: after seeing this finite sample, the algorithm can generate an infinite sequence of unseen elements from the true language. (2.2) There is an algorithm with the property that for any finite collection of languages C, there is a number t(C), such that for any language K in C, and any sequence S of at least t(C) distinct elements from K, the algorithm can produce an infinite sequence of distinct strings from K S. In subsequent work building on the present paper, Raman and Tewari [10] have recently proved further results about the type of uniform generation that we consider in (2.2), when it is possible to put an a priori bound t(C) on the number of distinct strings an algorithm needs to see from K before it can be guaranteed to begin generating unseen strings from K. They also consider variants of this definition, situating their analysis in the framework of classical learning theory. Extensions and Generalizations. Following these two main results in our basic model of generation, we provide (in Section 7) the following generalization of the model. Specifically, a familiar issue from language generation applications is the role of the prompt: a user provides an input string p, and a generation algorithm is then supposed to produce a continuation string c to come after the prompt, so that the concatenation of p and c is a valid utterance. We offer an extension that incorporates the notion of prompting, while maintaining the basic structure of the model, and we show how to formulate and prove a generalization of our first main result (2.1) in a setting where at each time step the adversary is allowed to specify a prompt that must be completed. 3 An Approach to Generation that Doesn t Work The following section is important, because it describes an approach that is arguably the most natural non-trivial attempt at achieving language generation in the limit. It seems at first seems to solve the problem directly, but in fact it fails for a deep reason that is important to understand, since it motivates the more involved solution that follows. The strategy is to move through the list of languages C = {L1, L2, L3, . . .} in order, treating each language Li as a hypothesis for K until the sample St proves otherwise. That is, we start with L1, and we generate strings from L1 St until we encounter (if ever) a step t in which St L1. At this point we know that L1 cannot be the true language K, and so we continue the process from L2. The nice idea that underpins this strategy is that the true language K is equal to Lz for some index z. (Since C can contain repetitions, K might appear several times, but we can take Lz as the first appearance.) So if our process were to reach Lz at some step t , it would never move on from Lz, and so we would be generating from K St for all t t . Unfortunately, there is a deep problem with this approach: there may be a language Li C with the property that Li comes before Lz and Li properly contains Lz (that is, i < z, and Lz Li). In this case, our procedure would stop at the first such Li forever, since it would never encounter a string in St that didn t belong to Li. And when it generated from Li St, there is no guarantee that it would choose strings from Lz. This problem is not easily avoided, since if this approach worked as written, it would also solve identification in the limit, which we know is impossible. So we need to extract some of the useful ideas from this failed approach in particular, the trick that K appears at some finite point in the list C, as the language Lz but add important further ideas as well. Specifically, if the algorithm is maintaining hypotheses for the true language K over time, it can provably never know whether its current hypothesis is correct; instead, it must be always moving further down the collection of languages, potentially considering languages that are not K, but in such a way that it is eventually always generating from K St. This is what our proof beginning in the next section will have to accomplish. 4 Generation in the Limit via a Function We prove our main result in two parts. We first give a method for language generation in the limit that is not concerned with the computational power required by the agent performing the generation. Thus, rather than an algorithm to generate the string, we ask whether we can construct a function f C based on the given language collection that maps a finite set of strings to a new string; this function takes the strings St seen so far and outputs a string f C(St) intended to be in K St. We will prove the following: (4.1) For every countable collection of languages C, there is a function f C from finite subsets of U to elements of U, such that for every enumeration of a language K C, there is a t such that for all t t , we have f C(St) K St. Note that while this positive result is not concerned with the computational power required to evaluate f C, it already contains the core contrast with language identification, which remains impossible even if we simply ask for a function f C. In the next section, we will then go on to prove (2.1) by using an algorithm that only performs standard computational steps and membership queries of the form w Li? Minimal and critical languages. As before, we will suppose z is an index such that Lz = K. We say that a language Li is consistent with the sample at step t if St Li. An important idea, which is implicit in our discussion of the failed approach at the end of Section 2, is that if Li Lj are both consistent with St, then it is safer for an algorithm to generate from Li than from Lj: if w Lj St then we must also have w Li St. This suggests that it would be useful to find consistent languages that are minimal with respect to inclusion: we would say that L is minimal if L C is consistent with St, and also L L for every L C that is consistent with St. Unfortunately, this is too much to ask for, since there exist instances of the problem for which there might not be any languages that are minimal with respect to inclusion. (In a finite collection of language there would need to be a minimal language, but it is easy to construct infinite collections without one.) Therefore, we define a related concept that only involves examining the inclusion of a given language with respect to a finite set of other languages. Specifically, we look for languages Ln that are consistent with St in a given step t, such that Ln is a subset of every consistent language that precedes it in the indexing of C. We will say that such a language is critical at step t. To define this formally, we first let Cn denote the finite collection of languages {L1, L2, . . . , Ln}. We now have the following definition. (4.2) A language Ln is critical at step t if Ln is consistent with St, and for every language Li Cn that is consistent with St, we have Ln Li. Properties of critical languages. At any given step t, there is at least one language consistent with St, since the language Lz = K is always consistent with St. It follows that there is also at least one critical language at any step t: for any t, the consistent language Li with the lowest index i must be critical at step t, as it is the only consistent language in Ci. Note that there can be choices of t for which the language Lz = K is not critical at step t. But a crucial fact is that Lz will eventually become critical at some step t and remain critical forever after that. For reasons of space, we provide complete proofs for this and all subsequent results in the Appendix, in this case in Section A.2. (4.3) There exists a time step t+ such that for all t t+, the language Lz is critical at step t. There can be multiple critical languages at a given step t; for example, if on the step t+ in (4.3) the first consistent language Li is not equal to Lz, then both Li and Lz will be critical at step t+. Despite the potential multiplicity of critical languages, the collection of all critical languages at step t has a useful nested structure: Li and Lj are both critical at step t, with i < j, then since Lj is contained in all consistent languages that precede it, in particular it is contained in Li. We therefore have: (4.4) Let i < j, and suppose that Li and Lj are both critical at step t. Then Lj Li. A function for generation in the limit. At a given step t, suppose that the critical languages are Ln1, Ln2, Ln3, . . . where n1 < n2 < n3 < . (This list of critical languages might be finite or infinite.) Then (4.4) tells us that this sequence is nested by inclusion: Ln1 Ln2 Ln3 . By (4.3) we know that the language Lz will eventually appear on this nested list from some step t+ onward, but even then we do not know which index ni it corresponds to at any given step t t+. Indeed, to recall a point from earlier, the Gold-Angluin results for learning in the limit tell us that we can never know for sure which index corresponds to Lz. But we now arrive at the crucial point, which is that beyond some finite index, all the critical languages are subsets of Lz, so it is safe to generate from any of them. Given this, we are prepared to construct our function f C. (4.5) f C(St) is defined as follows. We first identify all languages in Ct that are critical at step t. (If no such languages exist which can only happen if none of them are consistent with St we define f C(St) arbitrarily.) Among these critical languages, let Lnt be the one with the largest index nt t. We define f C(St) to be the lowest-indexed element of Lnt St. Finally, to establish our initial claim (4.1), it is sufficient to prove the following (in Section A.2): (4.6) For any language Lz C and any enumeration of Lz, there is a t such that for all t t , we have f C(St) Lz St. As a final note, we observe that the current formulation of f C allows it to generate the same string more than once, provided that this string is in K St. However, it is not hard to modify f C so that it generates a different string each time, essentially just by defining it so that it generates the lowest-indexed element that it hasn t already generated. The computational power required to produce f C. Our plan was to construct f C without worrying about the computational power required to do so (and recalling that for comparison, in the corresponding problem of identification in the limit, no function f C achieving identification could exist regardless of the computational power required to produce it). Now that we ve constructed an appropritate f C, we can ask what was in fact required computationally. In addition to standard computational steps and membership queries of the form w Li? , the definition of f C(St) requires that we identify the critical languages in Ct. From the definition, we can do this provided we can answer a finite number of subset queries of the form Li Lj? . So an algorithm augmented with the power to perform such subset queries can perform generation in the limit. In the next section, we will show how to remove the necessity for subset queries, so that generation in the limit can be performed by an algorithm using only standard computational steps and membership queries. 5 Generation in the Limit via an Algorithm We now prove (2.1) by giving an algorithm that generates in the limit for any countable collection of languages C, using only standard computational steps and membership queries of the form w Li? The set of possible strings U can be written as U = {u1, u2, u3, . . .}, and for simplicity we will also use the language of the positive integers to describe U, treating ui as the number i. In an enumeration of the true language Lz = K, let the sequence of strings that are enumerated step by step be denoted w1, w2, w3, . . .. Extending definitions to finite subsets of languages Li. One important idea in designing the algorithm is to work with finite subsets of the languages in C, gradually expanding the size of these subsets. Thus, for a language Li C and a number m, we will use Li[m] to denote the finite set Li {1, 2, 3, . . . , m}. We extend definition (4.2) of critical languages from the previous section to handle finite subsets. (5.1) Let t and m be positive integers. A language Ln is (t, m)-critical if Ln is consistent with St, and for every language Li Cn such that Li is consistent with St, we have Ln[m] Li[m]. Since Ln Li implies that Ln[m] Li[m] for any m 1, we have the following analogue of (4.3). (5.2) There exists a time step t+ such that for all t t+ and all m 1, the language Lz is (t, m)-critical. The analogue of (4.4) also still holds with this definition, using the same proof. (5.3) Let i < j and suppose that Li and Lj are both (t, m)-critical. Then Lj[m] Li[m]. Finally, there is a basic monotonicity property of (t, m)-criticality that is useful to write down explicitly; we give the proof in Appendix A.3. (5.4) Suppose that Ln is (t, m)-critical, and m < m. Then Ln is (t, m )-critical. 5.1 An algorithm for generation in the limit We now describe an algorithm for generation in the limit. As before, St = {w1, w2, . . . , wt} is the subset of K enumerated through step t, treating the wi as integers. We will consider the languages in Ct in step t, and maintain an auxiliary variable mt, roughly corresponding to how large a prefix Li[m] we consider from each language Li Ct. At the start of step t, we set mt = max(mt 1, wt); note that by induction this implies mt maxt j(s). In particular, stage s + 1 first enumerates (s + 1) and j(s) + 1, and then it begins enumerating j(s) + 2, j(s) + 3, . . . in increasing order. At some point during this process, the algorithm must output P (s+1),1 as its guess for K, since we can continue in this way to produce a full enumeration of P (s+1),1, at which point the true language is K = P (s+1),1. Once the algorithm outputs P (s+1),1 as its guess during stage s + 1, we end stage s + 1, defining j(s + 1) to be largest integer we ve enumerated up to that point, and we begin stage s + 2. In this way, the successive stages extend the interval I[ s, j(s)] unboundedly in both directions. We are therefore enumerating Q0,1 = U, and so in fact K = Q0,1. But we have also produced an unbounded sequence of steps t0 < t1 < t2 < such that the algorithm guesses P j,1 at step tj. Thus, there is no time step t for which the algorithm outputs the (correct) guess Q0,1 at every t t . As a final note, we observe that for this simple collection of languages, the problem of generation in the limit is straightforward: once the algorithm has seen two elements i and j from K, with i < j, it knows from the arithmetic progression structure that by setting d = j i, it can safely output j + d, j + 2d, j + 3d, . . ., and all of these will belong to K. The generation task for this language family is thus particularly simple (for others it is much more subtle, as in the example from Appendix A.6), but this contrast with the negative result for identification provides some initial intuition about the differences between the two problems that we will need to exploit in a generation algorithm. A.2 Proofs from Section 4 In the next three subsections of the Appendix, we provide proofs of the results stated in the main text. Proof of (4.3). Let K be the indices i < z for which Lz Li. For each i K, let vi be an element of Lz Li. Let ti be the step in which vi first appears in the enumeration of Lz, and let t+ = maxi K ti. Now, suppose by way of contradiction that for some t t+, the language Lz is not critical at step t. In this case, there must be some Li Cz such that Li is consistent with St and Lz Li. But we know that vi St and vi Li, contradicting the consistency of Li with St. Proof of (4.6). In the given enumeration of Lz, (4.3) tells us that there will come a step t+ such that for all t t+, the language Lz is critical at step t. Let t = max(z, t+). In every step t t , our construction of f C will include Lz among its critical languages in Ct. Therefore, the highestindexed critical language Lnt Ct satisfies nt z, and so by (4.4) we have Lnt Lz. Since f C(St) Lnt St, we have f C(St) Lz St as required. A.3 Proofs from Section 5 Proof of (5.4). Since Ln is (t, m)-critical, we know it is consistent with St, and that Ln[m] Li[m] for all languages Li Cn such that Li is consistent with St. Now, if Li is a language in Cn that is consistent with St, then since Ln[m] Li[m] and m < m, we have Ln[m ] Li[m ]. It follows that Ln[m ] Li[m ] for all languages Li Cn such that Li is consistent with St, and so Ln is (t, m )-critical. Proof of (5.5). We identify each iteration with the value of m after the initial increment of the iteration; so the iterations begin at mt + 1 and continue upward from there. Suppose by way of contradiction that the algorithm performs an infinite sequence of iterations. Let us call an iteration m disruptive if nt(m) = nt(m 1). Since nt(m 1) is the maximum index of a (t, m 1)-critical language, and since our monotonicity property (5.4) implies that Lnt(m) is also (t, m 1)-critical, it follows that nt(m) < nt(m 1). Since nt(m) starts at a value bounded by t and decreases by at least one with every disruptive iteration, there can be at most t 1 disruptive iterations. The sequence of iterations must therefore contain a last disruptive iteration m . For all iterations m > m , the language Lnt(m) does not change. Since the language is infinite, we must eventually reach an iteration m > m for which um Lnt(m), and the algorithm will stop and output um at this point. Proof of (5.7). In the given enumeration of Lz, (5.2) tells us that there is a t+ such that for all t t+ and all m 1, the language Lz is (t, m)-critical. Let t = max(z, t+). In every step t t , by (5.6) there is an mt such that the algorithm generates a string from Lnt[mt], where Lnt is the (t, mt)-critical language with maximum index in Ct. In each such step t, Lz is a (t, mt)-critical language in Ct, and so nt z. From (5.3), it follows that Lnt[mt] Lz[mt] Lz. Since the algorithm s output comes from Lnt[mt] St, it follows that it comes from Lz St as well. A.4 Proofs from Section 6 Proof of (2.2). We begin with some additional definitions. For any subset A of the indices {1, 2, . . . , n}, let D(A) be the intersection of the languages whose indices are in A; in other words, D(A) = \ i A Li. For any sequence S of strings from a language in C, let I(S) be the set of indices of the languages in C that contain S; that is, I(S) = {i : S Li}. We observe that the closure operator can be written in terms of this notation, in that S = D(I(S)). If D = D({1, 2, . . . , n}) is infinite, then the algorithm can generate arbitrary strings from D as its output without seeing any sample of strings at all; since D Li for every language Li C, in particular D Lz for the true language Lz, and this satisfies the requirements of (2.2). For the rest of the proof, we therefore suppose D({1, 2, . . . , n}) is finite. Let F be the collection of all sets of indices A {1, 2, . . . , n} with D(A) finite. Finally, let m = max A F |D(A)|; since |F| 2n, we observe that m is the maximum of a finite set of positive integers, and hence a positive integer. We now define t(C) = m + 1 and claim that this choice of t(C) satisfies the required guarantee of (2.2). Indeed, consider the true language Lz C and any sequence St of t(C) distinct elements from Lz. Recall that I(St) denotes the set of indices of all languages in C that contain St. We have St St = D(I(St)). If D(I(St)) were finite, then by the definition of m , the cardinality of D(I(St)) would be at most m . But this would contradict the fact that D(I(St)) contains St, which has cardinality m + 1. Therefore St = D(I(St)) is infinite, and it is a subset of the true language Lz. To conclude the proof, we therefore need only show that there is an algorithm that can enumerate all of St St using only membership queries. To do this, the algorithm begins by querying whether each wi St belongs to each Lj C. From this, it can determine the set I(St) of indices of languages that contain St. Now, it enumerates every string ui U in ascending order, skipping the elements of St. For each such string ui, it queries whether ui Lj for each j I(St), and it outputs ui if it belongs to each of these languages. In this way, the algorithm enumerates the infinite set St St Lz St after seeing a sample of t(C) strings in Lz. A.5 Proofs from Section 7 We now describe how to prove our result (7.1) on prompted generation with robust prompts. The proof is a direct adaptation of the proof of (2.1) from Section 5; as we will see, the structure of critical languages built up there is sufficiently strong that not much more is needed to handle the prompted version of the problem with robust prompts. As in Section 5, we will work with a specific enumeration u1, u2, u3, . . . of all strings in U, and work with finite subsets of the languages Li, defined via the notation Li[m] = Li {u1, u2, . . . , um}. The algorithm for prompted generation will closely follow the algorithm from Section 5, in that in every step t, it will increment a counter m and maintain knowledge of the maximum index nt(m) of a (t, m)-critical language from Ct. Maintaining knowledge of nt(m) does not require knowledge of the prompts, and so this part of the algorithm is the same as before. What changes is the stopping condition for the algorithm in step t: rather than continue increasing m until any valid output is found that is, until um Lnt(m) the algorithm must increase m potentially even further, until it finds a string um for which pt is a prefix of um, and um Lnt(m) St. However, since pt is a robust prompt, the algorithm is guaranteed to eventually find such a string, and so we can be sure that its iterations in step t will terminate. If we let mt be the value of m at the end of step t, then once t is large enough, we know that Lnt(mt)[mt] Lz[mt], where Lz = K is the true language, and so the string um that it outputs has pt as a prefix and belongs to K St as required. The discussion above provides the entire set of modifications to the algorithm; for completeness we now describe these in more detail, together with a proof of correctness. First, the facts (5.2) through (5.4) still hold in the prompted case, since they are structural properties of the language that are not affected by the adversary s use of prompts. The algorithm for generating an output string uses an iteration in step t for which parts (i), (ii), and (iii) of each iteration are the same as in Section 5. Step (iv) of each iteration is replaced by (iv ) If there is any string ui for i m such that ui has pt as a prefix and ui Lnt(m) St, then choose the minimum i with this property; output the string ui and define mt = m. If there is no such ui, then continue to the next iteration. Now, the proof of termination works as before, by establishing that there are only finitely many disruptive iterations in which the identity of nt(m) changes; this part does not depend on the structure of prompts but only on the definition of a (t, m)-critical language, and so it uses (5.2) through (5.4) exactly as before. After the last disruptive iteration, either there is a string ui Lnt(m) St with i m for which pt is a prefix, or else the algorithm will eventually reach one, since the prompt pt is robust. It declares this um to be its output string. We therefore have (A.1) In step t, if at least one language in Ct is consistent with St, then there is an mt and an nt such that the algorithm terminates with a string at for which pt is a prefix of at and at Lnt(mt)[mt] St, where Lnt(mt) is the (t, mt)-critical language with maximum index in Ct. Finally, we establish the basic correctness property of the algorithm, from which (7.1) follows directly. (A.2) For any language Lz C and any enumeration of Lz with robust prompts p1, p2, p3, . . ., there is a t such that for all t t , the algorithm generates a string at for which pt is a prefix of at and at Lz St. Proof. In the given enumeration of Lz, (5.2) tells us that there is a t+ such that for all t t+ and all m 1, the language Lz is (t, m)-critical. Let t = max(z, t+). In every step t t , by (A.1) there is an mt such that the algorithm generates a string at such that pt is a prefix of at, and at Lnt(mt)[mt] St, where Lnt(mt) is the (t, mt)-critical language with maximum index in Ct. In each such step t, Lz is a (t, mt)-critical language in Ct, and so nt(mt) z. From (5.3), it follows that Lnt(mt)[mt] Lz[mt] Lz. Since at Lnt(mt)[mt] St, it follows that at Lz St as well. A.6 An Example Where Closure is Not Helpful In Section 6, we introduced the notion of closure: For a sequence of strings St from a language in C, the closure of St in C, denoted St , is the intersection of all languages in C that are consistent with St. We observed in Section 6 that in any step t where St contains an element not in St, the algorithm can always safely output an element in St St and be sure that it is outputting a new element in the true language K This is simply because for every consistent language Li C, we have St St Li St by the definition of the closure operation, and in particular this holds for the true language K. This strategy is key to the proof in Section 6, when we were dealing with finite collections C, and it would have been an effective strategy in the case of infinite collections C as well provided we could be sure that St St was always non-empty. Unfortunately, there are instances of the problem with infinite collections C for which St St is empty for arbitrarily large t, and therefore the closure provides us with no new strings to generate. To see how this can happen in an example, we go back to the collection of arithmetic progresions over the integers as the ground set U, and we consider a more complicated collection C that builds on this. In particular, we let Pa,b as before be the arithmetic progression consisting of all integers of the form {a + bi : i = 0, 1, 2, . . .}; and now, for any finite set of integers V , we define the language L(a, b, V ) = Pa,b V . Our collection C consists of every L(a, b, V ) for an arbitrary integer a, an arbitrary positive integer b, and an arbitrary finite set of integers V . (Think of L(a, b, V ) as a copy of the arithmetic progression Pa,b that has been obscured by an arbitrarily large finite set V so that its structure is harder to discern.) Now, suppose the adversary is enumerating a language K = L(a, b, V ) C, and consider the set St of samples after t steps. We claim that St = St. Intuitively, this is because St might have come completely from the finite set V that is part of L(a, b, V ); more formally, there cannot be any element j St St because L(j + 1, 1, St) is consistent with St, and it does not contains j. This example illustrates the sense in which we mean that closure by itself is insufficient to provide a strategy for generation in the limit when the collection of languages C is infinite: in this instance, the algorithm has to repeatedly output elements outside the closure of St and yet somehow eventually generate from K. We know that the algorithm that proves our main result (2.1) is able to do this, but it is also interesting to ask if there is a simpler way to directly describe a method for generation in the limit for this particular example. In fact, there is a direct solution to this example, as follows: in any step t, the algorithm finds the two largest elements i < j in St, and with b = j i, it outputs j + b. The point is that if the true language K is L(a, b, V ), there will come a time when the adversary has enumerated every element of V , and within a finite number of steps after this, the two largest elements of St will have to come from Pa,b. From this step onward, the algorithm s strategy of outputting j + b is guaranteed to produce an element in K St. This particular strategy of course seems highly specific to the particular example we are discussing here, but at a very high level it does contain reflections of some of the ideas we saw in the general solution from Sections 4 and 5. Neur IPS Paper Checklist Question: Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? Answer: [Yes] Justification: The paper s main contributions are to formulate a model of language generation in the limit, by analogy with the Gold-Angluin model of language identification in the limit, and to prove a set of theoretical results in this model. These contributions are described in the abstract and introduction. Guidelines: The answer NA means that the abstract and introduction do not include the claims made in the paper. The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations. A No or NA answer to this question will not be perceived well by the reviewers. The claims made should match theoretical and experimental results, and reflect how much the results can be expected to generalize to other settings. It is fine to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper. 2. Limitations Question: Does the paper discuss the limitations of the work performed by the authors? Answer: [Yes] Justification: The paper s main limitation, similar to the classical Gold-Angluin model, is that it is a theoretical model based on worst-case guarantees, and as such does not try to capture the role of distributional assumptions in the design of language models. This is both a source of limitations but also (as a mitigation of these limitations) a source of insights distinct from traditional probabilistic models. These issues are discussed in the paper s subsection entited General Connections to Language Modeling. Guidelines: The answer NA means that the paper has no limitation while the answer No means that the paper has limitations, but those are not discussed in the paper. The authors are encouraged to create a separate "Limitations" section in their paper. The paper should point out any strong assumptions and how robust the results are to violations of these assumptions (e.g., independence assumptions, noiseless settings, model well-specification, asymptotic approximations only holding locally). The authors should reflect on how these assumptions might be violated in practice and what the implications would be. The authors should reflect on the scope of the claims made, e.g., if the approach was only tested on a few datasets or with a few runs. In general, empirical results often depend on implicit assumptions, which should be articulated. The authors should reflect on the factors that influence the performance of the approach. For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting. Or a speech-to-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon. The authors should discuss the computational efficiency of the proposed algorithms and how they scale with dataset size. If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness. While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that aren t acknowledged in the paper. The authors should use their best judgment and recognize that individual actions in favor of transparency play an important role in developing norms that preserve the integrity of the community. Reviewers will be specifically instructed to not penalize honesty concerning limitations. 3. Theory Assumptions and Proofs Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof? Answer: [Yes] Justification: The model and its underlying assumptions are fully described, and all proofs are provided in the Appendix. Guidelines: The answer NA means that the paper does not include theoretical results. All the theorems, formulas, and proofs in the paper should be numbered and crossreferenced. All assumptions should be clearly stated or referenced in the statement of any theorems. The proofs can either appear in the main paper or the supplemental material, but if they appear in the supplemental material, the authors are encouraged to provide a short proof sketch to provide intuition. Inversely, any informal proof provided in the core of the paper should be complemented by formal proofs provided in appendix or supplemental material. Theorems and Lemmas that the proof relies upon should be properly referenced. 4. Experimental Result Reproducibility Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)? Answer: [NA] Justification: The paper consists entirely of theoretical results. Due to the emphasis on guarantees against an adversary in the limit, there is not a natural opening for demonstrating the results experimentally (as there similarly isn t in the work that introduced the Gold Angluin model). Guidelines: The answer NA means that the paper does not include experiments. If the paper includes experiments, a No answer to this question will not be perceived well by the reviewers: Making the paper reproducible is important, regardless of whether the code and data are provided or not. If the contribution is a dataset and/or model, the authors should describe the steps taken to make their results reproducible or verifiable. Depending on the contribution, reproducibility can be accomplished in various ways. For example, if the contribution is a novel architecture, describing the architecture fully might suffice, or if the contribution is a specific model and empirical evaluation, it may be necessary to either make it possible for others to replicate the model with the same dataset, or provide access to the model. In general. releasing code and data is often one good way to accomplish this, but reproducibility can also be provided via detailed instructions for how to replicate the results, access to a hosted model (e.g., in the case of a large language model), releasing of a model checkpoint, or other means that are appropriate to the research performed. While Neur IPS does not require releasing code, the conference does require all submissions to provide some reasonable avenue for reproducibility, which may depend on the nature of the contribution. For example (a) If the contribution is primarily a new algorithm, the paper should make it clear how to reproduce that algorithm. (b) If the contribution is primarily a new model architecture, the paper should describe the architecture clearly and fully. (c) If the contribution is a new model (e.g., a large language model), then there should either be a way to access this model for reproducing the results or a way to reproduce the model (e.g., with an open-source dataset or instructions for how to construct the dataset). (d) We recognize that reproducibility may be tricky in some cases, in which case authors are welcome to describe the particular way they provide for reproducibility. In the case of closed-source models, it may be that access to the model is limited in some way (e.g., to registered users), but it should be possible for other researchers to have some path to reproducing or verifying the results. 5. Open access to data and code Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [NA] Justification: The results of the paper do not make use of data or code. Guidelines: The answer NA means that paper does not include experiments requiring code. Please see the Neur IPS code and data submission guidelines (https://nips.cc/ public/guides/Code Submission Policy) for more details. While we encourage the release of code and data, we understand that this might not be possible, so No is an acceptable answer. Papers cannot be rejected simply for not including code, unless this is central to the contribution (e.g., for a new open-source benchmark). The instructions should contain the exact command and environment needed to run to reproduce the results. See the Neur IPS code and data submission guidelines (https: //nips.cc/public/guides/Code Submission Policy) for more details. The authors should provide instructions on data access and preparation, including how to access the raw data, preprocessed data, intermediate data, and generated data, etc. The authors should provide scripts to reproduce all experimental results for the new proposed method and baselines. If only a subset of experiments are reproducible, they should state which ones are omitted from the script and why. At submission time, to preserve anonymity, the authors should release anonymized versions (if applicable). Providing as much information as possible in supplemental material (appended to the paper) is recommended, but including URLs to data and code is permitted. 6. Experimental Setting/Details Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results? Answer: [NA] Justification: As noted above, the paper does not include computational experiments. Guidelines: The answer NA means that the paper does not include experiments. The experimental setting should be presented in the core of the paper to a level of detail that is necessary to appreciate the results and make sense of them. The full details can be provided either with the code, in appendix, or as supplemental material. 7. Experiment Statistical Significance Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments? Answer: [NA] Justification: As noted above, the paper does not include computational experiments. Guidelines: The answer NA means that the paper does not include experiments. The authors should answer "Yes" if the results are accompanied by error bars, confidence intervals, or statistical significance tests, at least for the experiments that support the main claims of the paper. The factors of variability that the error bars are capturing should be clearly stated (for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions). The method for calculating the error bars should be explained (closed form formula, call to a library function, bootstrap, etc.) The assumptions made should be given (e.g., Normally distributed errors). It should be clear whether the error bar is the standard deviation or the standard error of the mean. It is OK to report 1-sigma error bars, but one should state it. The authors should preferably report a 2-sigma error bar than state that they have a 96% CI, if the hypothesis of Normality of errors is not verified. For asymmetric distributions, the authors should be careful not to show in tables or figures symmetric error bars that would yield results that are out of range (e.g. negative error rates). If error bars are reported in tables or plots, The authors should explain in the text how they were calculated and reference the corresponding figures or tables in the text. 8. Experiments Compute Resources Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments? Answer: [NA] Justification: As noted above, the paper does not include computational experiments. Guidelines: The answer NA means that the paper does not include experiments. The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage. The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute. The paper should disclose whether the full research project required more compute than the experiments reported in the paper (e.g., preliminary or failed experiments that didn t make it into the paper). 9. Code Of Ethics Question: Does the research conducted in the paper conform, in every respect, with the Neur IPS Code of Ethics https://neurips.cc/public/Ethics Guidelines? Answer: [Yes] Justification: The Neur IPS Code of Ethics was reviewed in the course of writing the paper, and has been fully adhered to. Guidelines: The answer NA means that the authors have not reviewed the Neur IPS Code of Ethics. If the authors answer No, they should explain the special circumstances that require a deviation from the Code of Ethics. The authors should make sure to preserve anonymity (e.g., if there is a special consideration due to laws or regulations in their jurisdiction). 10. Broader Impacts Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed? Answer: [Yes] Justification: The paper studies theoretical questions motivated by the dramatic recent advances in the performance of large language models (LLMs). As such, it connects to the broader and rapidly growing discussion about the potentially disruptive societal implications of LLMs and their widespread use, although it does so only through the formulation of theoretical models. One point in the discussion of LLM broader impacts that our theoretical results may address most directly is concerns over the homogenization of text style the ways in which language modeling may lead to decreased variety in the text that is produced and we discuss this in the section entitled Concluding Remarks. Guidelines: The answer NA means that there is no societal impact of the work performed. If the authors answer NA or No, they should explain why their work has no societal impact or why the paper does not address societal impact. Examples of negative societal impacts include potential malicious or unintended uses (e.g., disinformation, generating fake profiles, surveillance), fairness considerations (e.g., deployment of technologies that could make decisions that unfairly impact specific groups), privacy considerations, and security considerations. The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments. However, if there is a direct path to any negative applications, the authors should point it out. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to generate deepfakes for disinformation. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster. The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from (intentional or unintentional) misuse of the technology. If there are negative societal impacts, the authors could also discuss possible mitigation strategies (e.g., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efficiency and accessibility of ML). 11. Safeguards Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)? Answer: [NA] Justification: The paper does not develop data or implementations that could be released. Guidelines: The answer NA means that the paper poses no such risks. Released models that have a high risk for misuse or dual-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety filters. Datasets that have been scraped from the Internet could pose safety risks. The authors should describe how they avoided releasing unsafe images. We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort. 12. Licenses for existing assets Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected? Answer: [NA] Justification: The paper does not use existing computational assets; it does cite the theoretical models that serve as the underlying motivation for the work. Guidelines: The answer NA means that the paper does not use existing assets. The authors should cite the original paper that produced the code package or dataset. The authors should state which version of the asset is used and, if possible, include a URL. The name of the license (e.g., CC-BY 4.0) should be included for each asset. For scraped data from a particular source (e.g., website), the copyright and terms of service of that source should be provided. If assets are released, the license, copyright information, and terms of use in the package should be provided. For popular datasets, paperswithcode.com/datasets has curated licenses for some datasets. Their licensing guide can help determine the license of a dataset. For existing datasets that are re-packaged, both the original license and the license of the derived asset (if it has changed) should be provided. If this information is not available online, the authors are encouraged to reach out to the asset s creators. 13. New Assets Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets? Answer: [NA] Justification: The paper does not release new assets; it does provide complete self-contained descriptions of the theoretical models that it introduces. Guidelines: The answer NA means that the paper does not release new assets. Researchers should communicate the details of the dataset/code/model as part of their submissions via structured templates. This includes details about training, license, limitations, etc. The paper should discuss whether and how consent was obtained from people whose asset is used. At submission time, remember to anonymize your assets (if applicable). You can either create an anonymized URL or include an anonymized zip file. 14. Crowdsourcing and Research with Human Subjects Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)? Answer: [NA] Justification: The paper does not make use of crowdsourcing or any human-subject research. Guidelines: The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Including this information in the supplemental material is fine, but if the main contribution of the paper involves human subjects, then as much detail as possible should be included in the main paper. According to the Neur IPS Code of Ethics, workers involved in data collection, curation, or other labor should be paid at least the minimum wage in the country of the data collector. 15. Institutional Review Board (IRB) Approvals or Equivalent for Research with Human Subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or institution) were obtained? Answer: [NA] Justification: As noted above, the paper does not involve any research with human subjects. Guidelines: The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Depending on the country in which research is conducted, IRB approval (or equivalent) may be required for any human subjects research. If you obtained IRB approval, you should clearly state this in the paper. We recognize that the procedures for this may vary significantly between institutions and locations, and we expect authors to adhere to the Neur IPS Code of Ethics and the guidelines for their institution. For initial submissions, do not include any information that would break anonymity (if applicable), such as the institution conducting the review.