# the_pitfalls_of_nexttoken_prediction__3f13f7f2.pdf The Pitfalls of Next-Token Prediction Gregor Bachmann * 1 Vaishnavh Nagarajan * 2 Can a mere next-token predictor faithfully model human intelligence? We crystallize this intuitive concern, which is fragmented in the literature. As a starting point, we argue that the two oftenconflated phases of next-token prediction autoregressive inference and teacher-forced training must be treated distinctly. The popular criticism that errors can compound during autoregressive inference, crucially assumes that teacherforcing has learned an accurate next-token predictor. This assumption sidesteps a more deep-rooted problem we expose: in certain classes of tasks, teacher-forcing can simply fail to learn an accurate next-token predictor in the first place. We describe a general mechanism of how teacherforcing can fail, and design a minimal planning task where both the Transformer and the Mamba architecture empirically fail in that manner remarkably, despite the task being straightforward to learn. We provide preliminary evidence that this failure can be resolved when training to predict multiple tokens in advance. We hope this finding can ground future debates and inspire explorations beyond the next-token prediction paradigm. We make our code available under https://github.com/gregorbachmann/ Next-Token-Failures 1. Introduction Long after its inception in the seminal work of Shannon (1948; 1951), next-token prediction has made its way into becoming a core part of the modern language model. But despite its long list of achievements, there is a small but growing belief that a next-token predicting model is merely an impressive improv artist that cannot truly model human thought. Humans, when navigating the world, meticulously *Equal contribution 1ETH Zürich, Switzerland 2Google Research, US. Correspondence to: Gregor Bachmann , Vaishnavh Nagarajan . Proceedings of the 41 st International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by the author(s). imagine, curate and backtrack plans in their heads before executing them. Such strategies are unfortunately not explicitly built into the backbone of the present-day language model. This criticism has been floating around as an informal viewpoint (Le Cun, 2024; Bubeck et al., 2023). Our paper is aimed at crystallizing this intuitive criticism of nexttoken prediction, and developing the core arguments of this debate. Let us start by making more precise, what it means to say that human-generated language, or problem-solving, does not follow next-token prediction. When formalizing this, we hit an immediate roadblock: isn t every sequence generation task possible autoregressively? Put differently, an optimist would say, every distribution over a sequence of tokens can be captured by an appropriately sophisticated next-token predictor simulating the chain rule of probability i.e., P(r1, r2, . . .) = Q i P(ri|r1 . . . ri 1). Thus, the autoregressivity in our systems is not antithetical to learning human language, after all. Although this argument is compelling, a pessimist would worry, realistically, even with minor imperfections in the next-token predictor, the accuracy may break down spectacularly for long sequences (Kääriäinen, 2006; Ross & Bagnell, 2010; Le Cun, 2024; Dziri et al., 2024). Say, even if every next-token error is as little as 0.01, the probability of encountering an erroneous token exponentially compounds along the way, and by the end of 200 tokens, blows up to 0.86. This is a simple and powerful observation. Yet, this does not completely capture the intuition that next-token predictors may be poor planners. Crucially, this argument does not carefully distinguish between the two types of next-token prediction: inference-time autoregression (where the model consumes its own previous outputs as inputs), and trainingtime teacher-forcing (Williams & Zipser, 1989) (where the model is taught to predict token-by-token consuming all previous ground truth tokens as inputs). Framed this way, the compounding of errors only pinpoints a superficial failure to execute a plan during inference. It leaves open the possibility that we may have still learned a near-perfect next-token predictor; perhaps, with an appropriate post-hoc wrapper that verifies and backtracks, we can elicit the right plan without compounding errors. The Pitfalls of Next-Token Prediction Drawing this distinction allows us to articulate a much more concerning possibility: is it safe to assume that next-token based learning (teacher-forcing) always learns an accurate next-token predictor? We identify this is not always the case. Consider a task where we expect the model to witness a problem statement p = (p1, p2 . . . , ) and produce the ground truth response tokens (r1, r2, . . .). Teacher-forcing trains the model to produce each token ri by not only providing the problem statement p but also by revealing part of the ground truth r1, . . . ri 1. Depending on the task, we first argue that this can induce shortcuts that use the revealed prefix of the ground truth answer to spuriously fit future answer tokens. We call this the Clever Hans cheat. 1 Next, while the later tokens (ri for large i) become easy to fit by the Clever Hans cheat, in contrast, the earlier answer tokens (say, r0, r1 etc.,) become harder to learn. This is because they no longer come with any supervision about the full answer part of the supervision is lost to the Clever Hans cheat. We argue that these two flaws would together arise in lookahead tasks : tasks that require implicitly planning a later token in advance of an earlier token. In such tasks, teacher-forcing would result in a highly inaccurate next-token predictor that would fail to generalize to unseen problems p, even those sampled in-distribution. Empirically, we demonstrate that the above mechanism leads to complete in-distribution failure in a path-finding setup on a graph, that we propose as a minimal lookahead task. We design our setup in a way that it is demonstrably straightforward to solve that the failure of any model is remarkable. Yet, we observe failure for both the Transformer (Vaswani et al., 2017) and the Mamba architecture, a structured state space model (Gu & Dao, 2023). We also find that a form of teacherless training that predicts multiple future tokens (Monea et al., 2023) is (in some settings) able to circumvent this failure. Thus, we pinpoint a precise and easy-to-learn scenario where, rather than properties that are criticized in existing literature like convolution or recurrence or autoregressive inference (see 6), it is next-token prediction during training that is at fault. We hope that these findings inspire and set future debates around next-token prediction on solid ground. In particular, we believe that the failure of the next-token prediction objective on our straightforward task casts a shadow over its promise on more complex tasks (such as say, learning to write stories). We also hope that this minimal example of failure and the positive results on teacherless training can motivate alternative paradigms of training. 1Clever Hans (Pfungst & Rahn, 1911) was a famous show horse that could solve simple arithmetic tasks by repeatedly tapping with his hoof until he reached the correct count. It turns out however, Clever Hans did not really solve the problem, but merely stopped tapping upon detecting certain (involuntary) cues from his coach. Clever Hans answers were wrong when the coach was absent. We summarize our contributions below. 1. We consolidate existing critiques against next-token prediction and crystallize new core points of contention ( 6 and 3, 4). 2. We identify that the next-token prediction debate must not conflate autoregressive inference with teacherforcing. Both lead to vastly different failures ( 3, B). 3. We conceptually argue that in lookahead tasks, nexttoken prediction during training (i.e., teacher-forcing) can give rise to problematic learning mechanisms that are detrimental to even in-distribution performance ( 4). 4. We design a minimal lookahead task ( 4.1). We empirically demonstrate the failure of teacher-forcing for the Transformer and Mamba architectures, despite the task being easy to learn ( 5). 5. We identify that a teacherless form of training that predicts multiple future tokens at once proposed in Monea et al. (2023) for orthogonal inference-time efficiency goals shows promise in circumventing these training-time failures in some settings ( 5, Eq 4). This further demonstrates the limits of next-token prediction. 2. The Two Modes of Next-Token Prediction Consider a set of tokens V. Let D be a ground truth distribution over sequences that consist of a prefix p and a response r, denoted as s = p, r where p = (p1, p2, . . . , ) VLpref and r = (r1, r2, . . .) VLresp. We assume sequences of fixed length merely for simplicity. For any sequence s, let s