# on_the_generalization_ability_of_nexttokenprediction_pretraining__64a08db5.pdf

On the Generalization Ability of Next-Token-Prediction Pretraining

Zhihao Li 1 Xue Jiang 2 3 Liyuan Liu 1 Xuelin Zhang 1 Hong Chen 1 4 Feng Zheng 2

Large language models (LLMs) have demonstrated remarkable potential in handling natural language processing (NLP) tasks and beyond. LLMs usually can be categorized as transformer decoder-only models (DOMs), utilizing Next-Token-Prediction (NTP) as their pretraining methodology. Despite their tremendous empirical successes, the theoretical understanding of how NTP pre-training affects the model s generalization behavior is lacking. To fill this gap, we establish the fine-grained generalization analysis for NTP pre-training based on Rademacher complexity, where the dependence between tokens is also addressed. Technically, a novel decomposition of Rademacher complexity is developed to study DOMs from the representation learner and the token predictor, respectively. Furthermore, the upper bounds of covering number are established for multi-layer and multi-head transformer-decoder models under the Frobenius norm, which theoretically pioneers the incorporation of mask matrix within the self-attention mechanism. Our results reveal that the generalization ability of NTP pre-training is affected quantitively by the number of token sequences N, the maximum length of sequence m, and the count of parameters in the transformer model Θ. Additionally, experiments on public datasets verify our theoretical findings. Our code is available at https://github.com/Lizeihao/MININTP.

1College of Informatics, Huazhong Agricultural University 2Department of Computer Science and Engineering, Southern University of Science and Technology 3Department of Computer Science, Hong Kong Baptist University 4Engineering Research Center of Intelligent Technology for Agriculture, Ministry of Education, China. Correspondence to: Hong Chen <chenh@mail.hzau.edu.cn>.

Proceedings of the 42 nd International Conference on Machine Learning, Vancouver, Canada. PMLR 267, 2025. Copyright 2025 by the author(s).

1. Introduction

Large Language Models (LLMs) have emerged as powerful generative models in solving sequence-to-sequence (seq2seq) tasks (Ott et al., 2019), which not only have achieved tremendous progress in various NLP tasks (Malach, 2023), but also have realized remarkable performance in other domains (Li et al., 2024). Surprisingly, several existing LLMs, such as GPT3 (Brown et al., 2020), OPT (Zhang et al., 2022), BLOOM (Workshop et al., 2023), Llama (Touvron et al., 2023), Deepseek (Liu et al., 2024a) and Qwen (Yang et al., 2025), share two common characteristics: (i) Employing a decoder-only architecture based on the maskedself-attention(Vaswani et al., 2017). (ii) Adopting the unsupervised pre-training method of Next-Token-Prediction (NTP) (see Figure 1 (a)), which is to predict the next token based on all previous context tokens in each step (Qi et al., 2020). The predominant expense in training a large language model is typically incurred during the pre-training phase (Zhao et al., 2024). Consequently, it is very important to examine the DOMs-based NTP pre-training.

Recently, there have been increasing efforts to evaluate the DOMs-based NTP pre-training empirically. Shlegeris et al. (2022) found that language models are consistently better than humans at NTP tasks by performing two distinct experiments. Malach (2023) demonstrated when trained on Chain-of-Thought data, even a linear next-token predictor can possess high fitting ability. Bachmann & Nagarajan (2024) designed a minimal planning task and demonstrated that NTP pre-training cannot accurately predict the first position in some tasks. Li et al. (2024) utilized a single self-attention layer to explore the mechanics of NTP. While these works justify the use of NTP pre-training in the corresponding regimes, they do not provide a rigorous analysis of the training mechanism, especially from the perspective of generalization theory. This motivates a natural question:

Can we establish the generalization analysis of NTP pretraining, which probably explains the effects of model parameters?

This paper answers the above question positively. The DOMs (see Figure 1 (b)) usually consist of two components (Zhao et al., 2024): multiple layers of transformerdecoder-blocks, shortened as Representation-Learner (R-L), and a task-specific processor, noted by Token-Predictor (T-

On the Generalization Ability of Next-Token-Prediction Pretraining

LM (Decoder-Only-Model)

transformer-decoder-block-1

transformer-decoder-block-L

linear & softmax T-P

Masked-Multi-Head-Attention

Add & Layer-Norm

Feed-Forward

Add & Layer-Norm

(a) (b) (c)

Figure 1. (a): How NTP works utilizing decoder-only model (DOM), for every input token tj(0 j < m), we can get an output token ˆtj+1 whose label is tj+1, and the dashed line here represents the context used. (b): The architecture of DOM, which is consistent with the GPT-3 (Brown et al., 2020). (c): The architecture of transformer-decoder-block (Vaswani et al., 2017).

P). Similar to Zhang et al. (2024a); Deng et al. (2024), we abstract the DOMs into a token-based composite function made up of two separate functions for R-L and T-P, respectively. We consider the dependence between tokens and utilize φ-mixing to delineate the inter-token dependencies, which is a commonly used tool in non-independent scenes (Masuda, 2007; Mohri & Rostamizadeh, 2010; Mc Donald et al., 2015; Wong et al., 2020). We then propose a theoretical framework for NTP from the perspective of statistical learning. To bound the excess risk of NTP, we introduce the concept of Rademacher complexity for composite function classes and propose a decomposition law, as stated in Proposition 4.8. We then establish two distinct bounds on the generalization capability of NTP, depending on whether Proposition 4.8 is applied.

To further assess the impact of model parameters of DOMs on NTP, we provide a refined estimation of the covering number for multi-layer, multi-head transformer-decoder blocks. We are the first to consider the mask matrix in self-attention, which is crucial for NTP. We then use the covering number of DOMs to get the corresponding Rademacher complexity upper bound by utilizing the theory of Bartlett et al. (2017); Lin & Zhang (2019) and establish the generalization bounds for DOMs-based NTP pre-training. Our results primarily encompass three key parameters: the number of token sequences N, the maximum sequence length m, and the count of transformer model parameters Θ. Our generalization bound can be expressed as O p

1/m , where O p

Θ/Nm signifies the generalization capability across token sequences, and O p

1/m denotes the generalization capability among individual tokens. Our bounds remain valid even with modifications to the structure of the transformer-decoder block. Our main contributions are summarized as follows:

A novel Rademacher complexity decomposition method: We consider the dependence between tokens and provide a theoretical framework for NTP pre-training (Section 3). On this basis, we establish the Rademacher complexity upper bounds of excess risk by a novel Rademacher complexity decomposition method (Section 4.1), which shows that the generalization performance of NTP pre-training is related to both sequences and tokens.

A refined covering number for multi-layer, multi-head transformer-decoder models: We establish bounds for the covering number of a function space derived from a multi-layer, multi-head transformer-decoder model based on masked-self-attention (Section 4.2). Unlikely the previous works, our theoretical results are the first to consider the mask matrix in self-attention based on the metric induced by the Frobenius norm.

A generalization bound for DOMs-based NTP pretraining: We use the Rademacher complexity upper bound and covering number to establish the generalization theory of DOMs-based NTP pre-training (Section 4.3). Theoretical results imply that the generalization bound mainly depends on: the number of token sequences N, the maximum length of the token sequence m, and the number of model parameters Θ. Data experiments in Section 5 verify our theoretical findings.

2. Related Work

Next-Token-Prediction (NTP). Beyond its prominence in NLP (Moon et al., 2021; De Souza Pereira Moreira et al., 2021), NTP has found applications in diverse domains, including object recognition (Yue et al., 2024), sensorimotor trajectory prediction (Radosavovic et al., 2024), autonomous driving (Wu et al., 2024; Jia et al., 2024), and code-related

On the Generalization Ability of Next-Token-Prediction Pretraining

Table 1. Generalization bounds for transformer-based models in different pre-training scenarios. (N: the number of token sequences, m: the maximum sequence length, T: the number of prompts, L: the number of transformer layers, d: the model dimension of transformer, Θ 12Ld2: the number of transformer model parameters, C: the constant bigger than 1).

Ref. Scenario Technique Bound

Edelman et al. (2022) seq2seq Pretraining Rademacher Complexity O q

CO(L) ln(Nmd)

Li et al. (2023) ICL Pretraining Stability O C ln N

Zhang et al. (2023) ICL Pretraining Operator Approximation O L2d2 ln(1+NT C)

Deng et al. (2024) MAE Pretraining Rademacher Complexity O L(m C)O(L) ln d

Ours NTP Pretraining Rademacher Complexity O q

Θ ln(1+Nm C)

tasks (Izadi et al., 2022; Kim et al., 2021; Qi et al., 2024). While vision applications remain less explored, Kilian et al. (2024) demonstrated that NTP excels in prompt adherence and throughput efficiency for image synthesis, though diffusion models achieve superior image quality and lower latency. Extensions to standard NTP include multi-token prediction (Qi et al., 2020; Gloeckle et al., 2024) and Diffusion Forcing (Chen et al., 2024), a hybrid training paradigm combining NTP with diffusion for sequence generation.

Theoretical insights reveal NTP s foundational capabilities: Malach (2023) proved autoregressive NTP can approximate complex functions using a novel length-complexity measure. Thrampoulidis (2024) identified an implicit bias toward structured solutions in gradient-based NTP optimization at low training loss. Flemings et al. (2024) proposed PMix ED, a differentially private protocol for LLM-based NTP. Madden et al. (2024) established memory capacity bounds for decoder-only transformers in NTP. Li et al. (2024) showed self-attention learns token-retrieval automata via token-priority graphs.

Generalization theory for pre-training and transformerbased models. Generalization characterizations of pretraining have been stated for many learning paradigms, such as curriculum learning (Zhou et al., 2022), transfer learning (Tripuraneni et al., 2020; Xu & Tewari, 2021; Lotfi et al., 2022), reinforcement learning (RL) (Ye et al., 2023; Lin et al., 2023), etc. Moreover, Zhang et al. (2024a) constructed the generalization theory for supervised pre-training and fine-tuning to explore the trade-off between intra-class and inter-class diversity in pre-training datasets. Deng et al. (2024) developed a generalization bound for the unsupervised pre-training of Masked Autoencoder (MAE), their results are mainly based on the covering number theory of the transformer-encoder models, which was established by Edelman et al. (2022).

For the generalization of transformer-based models, Deora et al. (2023) derived generalization bounds for the single-

layer muti-head-attention models based on the stability of SGD Lei & Ying (2020); Zhang et al. (2024b). Recently, theoretical understandings have been provided for the generalization ability of In-context Learning (ICL). The ICL pre-training is investigated theoretically from the aspects of multi-task learning (Li et al., 2023) and Markov processes (Zhang et al., 2023), respectively. Notably, Lotfi et al. (2023) derived the first non-vacuous generalization bounds for pre-trained LLMs. Later, Lotfi et al. (2024) introduced a martingale-based bound that captures token-level dependencies.

Table 1 highlights the key differences of our theoretical result by comparing it with the most related progresses in (Edelman et al., 2022; Li et al., 2023; Zhang et al., 2023; Deng et al., 2024).

Notation. Throughout our paper, we denote set {1, , n} as [n]. And for a matrix W, W ℓ := maxi,j |Wi,j|.

3. Preliminaries

This section introduces the framework of NTP pre-training and defines the architecture of DOMs.

3.1. Problem Setting

Consider a set of tokens T whose vocabulary size is nv = |T |. Given a pre-training dataset D = {Xi}N i=1 X, where Xi denotes the i-th token sequence, X is an instance domain such as sentences. We assume there exists an unknown distribution D that {Xi}N i=1 D and all the sequences are independent of each other. Each sequence Xi is composed of m tokens {ti 1, , ti m} T , where ti j Rnv denotes the j-th token of i-th sequence, and m denotes the maximum input length of a language model LM. Note that the token sequences we consider here have all undergone a series of preprocessing operations, such as cropping, masking, and patching, so all sequences have the same length. We denote Ti j = {ti 0, ti 1, , ti j 1} as the context of ti j, where

On the Generalization Ability of Next-Token-Prediction Pretraining

ti 0 = t0 T denotes the given begin sign <| im start |> for all sequences. Note that Ti 0 is the empty context for ti 0.

Next-Token-Prediction. For NTP, we abstract the model LM : X T T as an algorithm that maps the context Tj 1 to a function LM (Tj 1, ), similar idea can be found in Li et al. (2023). Then, we input the token tj 1 to the above function, which will get a response:

ˆtj = LM (Tj 1, tj 1) ,

we hope that ˆtj can be as close to tj as possible. More details about the next-token-prediction can be found in Figure 1. Note that the model LM belongs to decoder-only model (DOM), which is usually composed of a Representation Learner (R-L) h H {T I} and a Token-Predictor (T-P) g G {I T }, where I denotes a hidden representation space. We can represent the model LM via

LM (Tj 1, tj 1) = g (h (Tj 1, tj 1)) .

Denote zi j = Ti j, ti j as the j-th training sample of i-th sequence, where Ti j = {Ti j 1, ti j 1}. Then the empirical risk based on NTP with the i-th sequence can be defined as

ˆLXi (g h) := 1

j [m] ℓ g h(zi j), zi j , (1)

where ℓ g h( ), : X X R represents the pre-training loss function, usually cross-entropy loss, and

ℓ(g h(zi j), zi j) := ℓ g h(Ti j 1, ti j 1) , ti j

denotes the loss of sample zi j. Pre-training based on NTP is to train each token sequence in the dataset D according to formula (1), further obtaining the optimal R-L and T-P. We can denote the objective function based on the empirical risk minimization (ERM) as

min g G,h H ˆLD (g h) := 1

i [N] ˆLXi (g h). (2)

Let Lϕi (g h) = E ˆLXi (g h) denote the population risk of ˆLXi (g h), and

LD (g h) = E ˆLD (g h) = 1

i [N] Lϕi (g h)

be the population risk of ˆLD (g h). Then, the excess risk for NTP pre-training task can be represented as

ED ˆg, ˆh := LD ˆg ˆh min g G,h H LD (g h) , (3)

where ˆg G and ˆh H denote the optimal R-L and T-P we learned by solving (2) respectively.

3.2. Decoder-only Models

For a given token sequence X = [t1, , tm] Rm nv, we denote Z = [t0, t1, , tm 1] Rm nv as the input matrix, which contains all the context information. We consider the L-layer and H-head decoder-only transformer model as the R-L, which mainly consists of one Embedding-layer and L layer transformer-decoder-block (see Figure 1 (b)). We use d to denote the model dimension, dk = d/H denotes the attention dimension, and df = 4d denotes the feed-forward dimension throughout the paper.

Embedding. Token vectors are one-hot vectors, which are in discrete form. We need to convert the discrete vectors into continuous vectors first through the Embedding operation:

Z0 = Embedding(Z) := ZWe + Wp,

where Z0 denotes the embedded token sequence of Z, We Rnv d denotes the token-embeding matrix, and Wp Rm d denotes the position-embeding matrix.

Transformer-decoder-block. Let Πnorm denote the Layernormlization operator, and σ denote the non-linear activation function Re LU. Denote TFWl as the l-th layer transformer-decoder-block with parameter set

Wl = Wl F1, Wl F2, {Wl Oh, Wl Qh, Wl Kh, Wl Vh}H h=1 ,

where Wl F1 Rd df , Wl F2 Rdf d, Wl Oh Rd d, Wl Qh Rd dk, Wl Kh Rd dk, Wl Vh Rd d and Zl = TFWl Zl 1 (l 1) denotes the output of l-th layer block, which can be formulated by

TFWl Zl 1 = Πnorm

σ Yl Wl F1 Wl F2 + Yl ,

h [H] Al h Wl Oh + Zl 1 . (4)

Here Al h denotes the masked self-attention of the h-th head.

Masked Self-attention. Denote softmax as the row-wise softmax operator, Ql h = Zl 1Wl Qh, Kl h = Zl 1Wl Kh, Vl h = Zl 1Wl Vh, we have

Al h = softmax Ql h(Kl h) + M dk

where M Rm m is a mask matrix defined as

( 0, j i , j > i .

Then, the R-L can be mathematically formulated by

h(Z) := TFWL . . . TFW1 Embedding(Z) . (6)

On the Generalization Ability of Next-Token-Prediction Pretraining

The T-P is composed of a linear projection and softmax:

g (h (Z)) = softmax h (Z) WP , WP Rd nv. (7)

4. Main Results

Note that the tokens in one sequence are dependent, which we usually call a non-i.i.d. process. We introduce the concept of φ-mixing processes to characterize the dependency relationship between tokens. Definition 4.1. Let T = {tj} j= be a stationary process (Hirschfeld, 1935). T is said to be exponentially φ-mixing (Dobrushin, 1956) if there exist some constants φ0 > 0, φ1 > 0 and r > 0 such that the φ-mixing coefficient

φ(k) := sup n sup A σ n+k B σn

| Pr[A | B] Pr[A]|

φ0 exp ( φ1kr), k N ,

where σi j denotes the σ-algebra generated by the random variables ti, , tj.

Based on the above definition, we now make the following assumption on dataset D. Assumption 4.2. Assume that Xi = ti 1, , ti m is generated by a φ-mixing distribution ϕi for all i, and there exists an unknown distribution U such that U = {ϕi}N i=1 U. Remark 4.3. The above assumption is widely adopted in the study of non-i.i.d. processes such as Ralaivola et al. (2010); Heinrich & Pawlas (2013); Vankadara et al. (2022); Liu et al. (2025b). In Definition 4.1, limk + φ(k) 0 means that A and B will become independent as k increases. When A and B represent two different sentences, the farther the distance between A and B is (coming from two different articles), the smaller the correlation between A and B will be. Therefore, Assumption 4.2 is reasonable. Assumption 4.4. There exists a constant Bℓ R+ satisfying |ℓ(ˆt, t)| Bℓfor any ˆt, t T , and ℓis Gℓ-Lipschitz w.r.t. ˆt.

Assumption 4.4 is commonly used in learning theory (Bartlett & Mendelson, 2002; Shalev-Shwartz & Ben-David, 2014; Liu et al., 2022; Deng et al., 2024; Liu et al., 2025a). Definition 4.5 (Discrepancy measure). Given the set of distributions U = {ϕi}N i=1, we define its discrepancy as

disc(U) := sup k [N]

i [N] ϕi ϕk TV ,

where ϕi ϕk TV = sup t T |ϕi(t) ϕk(t)| denotes total

variation distance between two distributions.

As shown in (Kuznetsov & Mohri, 2020; Wang et al., 2022a), the closer the distributions in set U are, the smaller disc(U) will be. In particular, disc(U) = 0 when ϕ1 = = ϕN.

4.1. Rademacher Complexity Upper Bounds

To mitigate the excess risk defined in Equation (3), we introduce a metric for assessing the complexity of a function class, known as Rademacher complexity (Mohri & Rostamizadeh, 2008). Definition 4.6 (Rademacher complexity). Given a sample set S = {z1, ..., zn} Z and a function class F : Z R, the empirical Rademacher complexity of F is defined as

ˆRS (F) := Eε

i [n] εif (zi) , (8)

where ε = {ε1, ..., εn} are i.i.d. Rademacher random variables satisfying P(εi = 1) = P(εi = 1) = 0.5, i [n].

Due to the dependence of tokens within a sequence, and the independence of distinct sequences, it is essential to establish two separate measures of Rademacher complexity. By setting S = D and F = ℓ G H in Equation (8), we can define the empirical Rademacher complexity of the composite function class ℓ G H for D as

ˆRD (ℓ G H) := Eε

sup g G,h H

i [N] εi ˆLXi (g h) .

For ease of representation, we denote ℓi j = ℓ g h(zi j), zi j . Referring to the definition of Rademacher complexity of multi-task learning in Wang et al. (2022b), we can also consider all token sequences, and define the following multisequence Rademacher complexity:

RD (ℓ G H) := Eε

sup g G,h H

j [m] εijℓi j

Remark 4.7. The Rademacher complexity defined here closely resembles the representation-induced Rademacher complexity delineated in Deng et al. (2024). However, a notable distinction exists: the innermost function in their composite function is fixed, whereas in our definition, it encompasses the entire function class H.

We primarily focus on the performance of R-L because it applies to various downstream scenarios after pre-training, whereas T-P changes as downstream tasks evolve. Consequently, we propose decomposing the Rademacher complexity of G H into the complexities of the individual function classes G and H. This approach allows for a more precise analysis of the influence exerted by H, defined as follows: Proposition 4.8 (Rademacher complexity decomposition). Let F : Z R be a composite function satisfying F = ℓ G H, where ℓis a loss function and H, G are function classes. Given a sample set S = {z1, ..., zn} Z, for any g G satisfying Gg-Lipschitz w.r.t. h H and ℓsatisfying Gℓ-Lipschitz w.r.t. g h G H, we have

ˆRS (ℓ G H) GℓGg ˆRS (H) + GℓˆRS G ˆh ,

On the Generalization Ability of Next-Token-Prediction Pretraining

where ˆh is any given function in H. Theorem 4.9. Given a pre-training dataset D containing N token sequences {Xi}N i=1 X, satisfying the distribution conditions in Assumption 4.2. Denote ˆg and ˆh as the optimal R-L and T-P derived by solving Equation (2), respectively. Then, under Assumption 4.4, for some φ0 > 0, φ1 > 0 and r > 0, there holds

ED(ˆg, ˆh) 6 RD (ℓ G H) + Bℓ

δ N | {z } I

δ 2m | {z } II

+4Bℓdisc(U),

with probability at least 1 δ, where m 1 + 2 Pm k=1 φ(k) and φ(k) φ0 exp ( φ1kr), k [m].

The proofs of Proposition 4.8 and Theorem 4.9 are provided in Appendix A and Appendix B, respectively. Remark 4.10. Item I represents the generalization error of NTP pre-training on the dataset D, reflecting the model s ability to generalize to unseen token sequences and its overall generalization capability within the sequence space. Item II denotes the average generalization capability on individual token sequences Xi, indicating the model s local generalization ability within the token space. The last item, disc(U) (see Definition 4.5), reflects the influence of the quality of the pre-training dataset. Since φ0, φ1, and r are all greater than 0, the positive series P k=1 φ(k) is convergent. Therefore, there exists a constant Cφ1,φ2,r > 0 such that m 2 Cφ1,φ2,r. For simplicity, we will use the constant Cφ,r to represent the upper bound of m 2 in the subsequent analysis.

The following corollary can be derived by combining Theorem 4.9 and Proposition 4.8. Corollary 4.11. Under the same assumptions as Theorem 4.9, if g is Gg-Lipschitz w.r.t. h for any g G, h H, there exists a constant Cφ,r > 0 such that the following inequality holds with probability at least 1 δ:

ED(ˆg, ˆh) 6GℓGg RD(H) | {z } (I)

+ 6Gℓ RD(G ˆh) | {z } (II)

δ 2m + 4Bℓdisc(U).

Remark 4.12. Item (I) is exclusively associated with the complexity of R-L, while item (II) depends solely on the complexity of T-P. These two items operate independently, allowing for separate analysis of the effects of R-L and TP on generalization performance. This independence also simplifies the process of replacing T-P, as it only requires redefining item (II).

4.2. Capacity of Transformer-decoder Models

To investigate the effect of the parameters within the transformer-decoder model (i.e., R-L) on the generalization performance of NTP, we use the covering number to quantify the Rademacher complexity of R-L. We begin by providing a general definition of the covering number.

Definition 4.13 (ϵ-cover and covering number). Denote (U, ) as a metric space and V U. For any ϵ > 0, V is called an ϵ-cover of U if for any u U, there exists v V such that u v ϵ. The covering number of (U, ) is the cardinality of the smallest ϵ-cover, which is defined by

N(U, ϵ, ) := min{|V | : V is the ϵ-cover of U}.

Assumption 4.14. Assume that

Πnorm is Gπ-Lipschitz with the ℓ2-norm, i.e., t1, t2 Rd, Πnorm (t1) Πnorm (t2) ℓ2 Gπ t1 t2 ℓ2.

l [L] and h [H], there exists constants Cl such that Ql h(Kl h) / dk ℓ Cl.

l [L], Wl Wl, there exists constants Bl satisfying Wl F Bl.

The second assumption in Assumption 4.14 is reasonable due to the presence of the scaling factor dk in the selfattention mechanism. The first and third assumptions have been previously used in the analysis of the transformer covering number (Edelman et al., 2022; Deng et al., 2024).

Since the learnable parameters of the Transformer model are all fully connected layer parameters, we introduce the following lemma proposed by Lin & Zhang (2019):

Lemma 4.15. Let X Rn din be a given input matrix with a bounded Frobenius norm, and W Rdin dout such that W F a. Then, we have

ln N ({XW : W F a} , ϵ, F )

dindout ln 1 + 2a X F

Lemma 4.15 emphasizes the impact of model parameters on the covering number, aligning with our research objectives. Based on this lemma, we provide the upper bound of the logarithmic covering number for masked self-attention as follows:

Lemma 4.16 (Simplification of Lemma C.10). Given an input sequence S = {Z1, . . . , ZN} Rm d, denote Z[N] = [Z1, . . . , ZN] RNm d as the concatenated data matrix. Consider the masked self-attention head A ( ) (ignore the layer and head indices) defined in Equation (5), the corresponding function class can be defined as:

HA S := {Z 7 A(Z) : WQ, WK, WV F B} .

On the Generalization Ability of Next-Token-Prediction Pretraining

Then, we have

ln N HA S , ϵ, F ddk ln 1 + B3 Z F Z[N] 2 F dkϵ

N ln m Z[N] F

where Z F = maxi [N] Zi F .

Then, based on Lemma 4.16, we can further obtain the upper bound of the logarithmic covering number for a single-layer transformer-decoder-block as follows:

Proposition 4.17. For the transformer-decoder-block TFW (ignoring the layer indices) defined in Equation (4), the corresponding function class can be defined as

HT F S := {Z 7 TFW(Z) : W F B, W W} .

Then we can get the following covering number bound:

ln N HT F S , ϵ, F 4d2(H + 3) ln 1 + ω

where ω = G2 πB2(B2+1)(e CB2H

N ln m+1) Z[N] F .

The following two lemmas mainly explore the Lipschitz continuity of the transformer decoder model.

Lemma 4.18. For a single transformer-decoder-block TFW ( ) parameterized by W (ignore the layer indices), let Z, ˆZ Rm d be any input matrixes. Then, there holds TFW (Z) TFW ˆZ F

G2 π(B2 + 1) e CB2Hmd + 1 Z ˆZ F .

Lemma 4.19. Let Zl Rm d as the output matrix of the l-th layer decoder-block, we have: Zl F Y

j [l] G2 π(B2 j +1) e Cj B2 j H

ln m + 1 Z0 F .

Based on Proposition 4.17 and Lemmas 4.18 and 4.19 (proved in Appendix C.3), we obtain the following logarithmic covering number upper bound for the R-L.

Theorem 4.20. Let D = {Xi}N i=1 be a a dataset containing N token sequences and let Z[N] = [Z1, . . . , ZN] RNm nv be the input matrix generated from D, and denote Z0 [N] RNm d as the embedded matrix. The function class of the R-L defined in Equation (6) can be defined as

H := Z 7 h(Z) : Wl F Bl, Wl Wl, l [L] .

Then, under Assumption 4.14, we have

ln N (H, ϵ, F ) ΘH

l=1 ln 1 + LB2 l s L Z0 [N] F ϵ

where Θ 12Ld2 is the number of model parameters and

l [L] G2 π(B2 l + 1) e Cl B2 l H

Remark 4.21. As demonstrated in Bartlett et al. (2017), the logarithm of the covering number of H under the infinitenorm is bounded above by that under the Frobenius-norm, i.e., ln N(H, ϵ, ℓ ) ln N(H, ϵ, F ). However, our approach bounds the Frobenius-norm covering number with a smaller order of O(L) compared to the orders CO(L)

(where C > 1) reported in Edelman et al. (2022); Deng et al. (2024). This indicates that our method offers a significant advantage over the previous studies.

4.3. Generalization Bounds for DOMs

Inspired by Bartlett et al. (2017), we use the covering number to deduce the upper bound of Rademacher complexity. Then, the excess risk for NTP pre-training can be bounded by integrating Corollary 4.11 and Theorem 4.20.

Theorem 4.22. Let Z[N] RNm nv be the input sequences generated from dataset D. Denote ˆg and ˆh as the optimal R-L and T-P learned from Equation (2), respectively. Then, under Assumptions 4.2, 4.4 and 4.14, there exists a constant Cφ,r > 0 such that the following inequality holds with probability at least 1 δ:

ED( ˆf, ˆh) O r

δ 2m + 4 disc(U) ,

where Θ is the number of model parameters, and τ1 = ln 1+ρLs L , with ρL = PL l=1 B2 l and constant s L defined in Theorem 4.20.

Remark 4.23. We focus on three parameters: the number of token sequences N, the maximum length of the token sequence m, and the number of model parameters Θ. Our bound is O( p

Cφ,r/m), where O( p

Θ/Nm) reflects the generalization ability between token sequences, and O( p

Cφ,r/m) reflects the generalization capacity among tokens within a sequence. Here, Cφ,r is a constant related to the φ-mixing coefficient, indicating the distribution quality of a single token sequence, while disc(U) reflects the overall distribution quality of all token sequences. Our bound captures the impact of both dataset quality and individual sample quality on generalization performance. Unlike previous works (see Table 1), we show that effective generalization requires both a larger N and a larger m, enabling the model to generalize across both sequence space and token space. Additionally, as Θ increases, the total number of tokens Nm should also increase to achieve better generalization.

On the Generalization Ability of Next-Token-Prediction Pretraining

0 100 200 300 400 500 600 700 Step

mini NTP-1 mini NTP-2 mini NTP-3 mini NTP-4

Train Loss Test Loss Train Loss Test Loss

0 100 200 300 400 500 600 700 Step

mini NTP-4 mini NTP-5 mini NTP-6

Train Loss Test Loss Train Loss Test Loss

0 100 200 300 400 500 600 700 Step

mini NTP-4 mini NTP-7 mini NTP-8 mini NTP-9

Train Loss Test Loss Train Loss Test Loss

Figure 2. Experiments on Mini Mind and DAMO NLP datasets.

Remark 4.24. It should be noted that our bounds remain valid even when modifications are made to the structure of the transformer-decoder block. For instance, replacing post-layer Normalization with pre-layer Normalization or substituting the Re LU activation function with Si LU. This is attributable to the fact that such modifications exclusively affect the logarithmic terms ρL and s L.

5. Experiments

To validate the theoretical contribution of this paper, specifically, Theorem 4.22, we performed a set of NTP pre-training experiments in DOMs. These experiments were designed to systematically examine the influence of model parameters and sample size on generalization performance.

Model Architecture. Our architecture largely follows the GPT-2 framework, with two key modifications: (1) the adoption of the Si LU activation function and (2) RMSNorm normalization (Zhang & Sennrich, 2019). As demonstrated in Remark 4.24, these adjustments do not compromise the validity of our theoretical framework.

Datasets. For pretraining, we employ the Mini Mind dataset1, while our test set consists of 8,192 samples (with a maximum sequence length of m 512) carefully selected from the DAMO NLP dataset2. Both datasets belong to the category of Chinese text generation datasets. Due to the consistent pretraining corpus, we adopted the same tokenizer as Mini Mind3, preserving a vocabulary size of nv = 6400.

Training Protocol. Our training methodology follows the approach outlined in Mini Mind. To optimize efficiency, we employ Flash Attention (Dao et al., 2022) for accelerated attention computation and conduct distributed training on

1https://www.modelscope.cn/datasets/ gongjy/minimind_dataset

2https://www.modelscope.cn/datasets/DAMO_ NLP/lcsts_test_set

3https://github.com/jingyaogong/minimind

8 NVIDIA A800-80GB GPUs using Deep Speed-Zero2 (Rajbhandari et al., 2020). For optimization, we utilized the Adam W (Loshchilov & Hutter, 2017) optimizer, combined with a cosine learning rate scheduler that includes a 20-step warm-up phase during the initial training stage.

Full specifications for model architecture, dataset preprocessing, and training configurations are detailed in Table 3.

5.2. Main Results

Maximum sequence length m. As demonstrated in Figure 2(a), we performed experiments with varying maximum sequence lengths m (64, 128, 256, 512) in the training dataset while holding all other parameters constant. Notably, the test dataset retained a fixed maximum sequence length of 512 across all evaluations. While models trained on shorter sequences converge more rapidly, they exhibit limited generalization capability when applied to longer sequences. This observation accounts for the monotonically increasing test loss trend observed for mini NTP-1, 2, and 3 as the training sequence length decreases. As shown in Table 2, models trained on shorter sequences deliver strong performance on test cases with similarly short sequences. Importantly, models trained on longer sequences maintain robust performance even when evaluated on shorter sequences, highlighting their adaptability.

Table 2. Test sample perplexity (PPL) variations across models under variable maximum sequence lengths (m).

Model m = 64 m = 128 m = 256 m = 512 mini NTP-1 1.10 69.35 1578.14 316024.25 mini NTP-2 1.24 1.31 224.25 130613.71 mini NTP-3 1.03 1.05 1.08 24343.04 mini NTP-4 1.03 1.16 1.24 1.49

The number of sequences N. For models with identical architectural configurations, we demonstrate a enhancement in generalization performance with increasing training sequence quantity. As illustrated in Figure 2(b), while holding model parameters, training hyperparameters, and maximum sequence length (m = 512) constant, we evaluated perfor-

On the Generalization Ability of Next-Token-Prediction Pretraining

Table 3. Model architectures, training data specifications, hyperparameter configurations, and test PPL (m = 512).

Model Θ L H d m N% Batch Size Learning Rate PPL mini NTP-1 0.029B 8 8 512 64 100 0.5M 5.0e-4 316024.25 mini NTP-2 0.029B 8 8 512 128 100 0.5M 5.0e-4 130613.71 mini NTP-3 0.029B 8 8 512 256 100 0.5M 5.0e-4 24343.04 mini NTP-4 0.029B 8 8 512 512 100 0.5M 5.0e-4 1.49 mini NTP-5 0.029B 8 8 512 512 50 0.5M 5.0e-4 3.17 mini NTP-6 0.029B 8 8 512 512 75 0.5M 5.0e-4 1.95 mini NTP-7 0.002B 6 4 128 512 100 0.5M 1.0e-3 5.76 mini NTP-8 0.09B 12 12 768 512 100 0.5M 6.0e-4 1.13 mini NTP-9 0.31B 24 16 1024 512 100 0.5M 3.0e-4 1.05

mance across 50%, 75%, and 100% subsets of the complete pretraining dataset. The experimental results reveal that diminishing training data volume not only compromises model generalization but also significantly impairs convergence characteristics and training stability.

Model size Θ. Our analysis in Figure 2(c) evaluates how model parameter size (Θ) affects generalization performance under fixed training sequence count (N) and maximum length (m), with configurations adapted from Biderman et al. (2023). Early training (step<100) shows smaller models converge faster, but prolonged training demonstrates larger models achieve superior convergence rates and lower final losses. As predicted by Theorem 4.20, this divergence arises from capacity limits: smaller models saturate earlier while larger ones continue learning from additional tokens.

Final Train Loss

Final Test Loss

Avg Generalization Gap Min Test Loss

1.0 mini NTP-1 mini NTP-2 mini NTP-3 mini NTP-4 mini NTP-5 mini NTP-6 mini NTP-7 mini NTP-8 mini NTP-9

Figure 3. Model generalization ability radar chart.

Figure 3 provides a comprehensive visualization comparing the nine models across five distinct evaluation dimensions, with radial distance from the center indicating performance quality in each dimension. The analysis reveals that extending both the maximum sequence length (m) and the number of training sequences (N) improves model gen-

eralization in the NTP pretraining task. Notably, increasing m produces more substantial performance gains than expanding N, though this comes with increased training complexity. Additionally, while larger model parameters facilitate learning richer token representations and elevate the model s capacity ceiling, this architectural expansion should be matched with a corresponding increase in total token quantity to mitigate overfitting risks.

6. Conclusion

This paper presents a generalization error analysis for nexttoken prediction pre-training, a widely used paradigm in large language models. Our theoretical results enhance the understanding of how model parameters influence generalization ability. We find that generalization depends on the number of token sequences, the maximum sequence length, and the number of parameters in the transformer model. Empirical evaluations confirm our theoretical findings through data experiments.

7. Future Work

In the rapidly advancing field of large language models, theoretical foundations remain underdeveloped. While our study addresses part of this research gap, we identify several promising avenues for future work. First, although our φ-mixing data modeling approach demonstrates theoretical validity, empirical verification in practical applications requires further investigation. Beyond mixing processes, developing language-specific data distributions could provide deeper insights into how linguistic properties affect model behavior. Second, this work primarily uses Rademacher complexity for theoretical analysis, other frameworks like stability-based (Liu et al., 2024b; Zhang et al., 2024b) or information-theoretic (Lu & Van Roy, 2019; Livni, 2023) methods are viable alternatives. Finally, given the anticipated evolution toward unified multimodal architectures, extending this research to incorporate diverse data modalities represents a crucial direction for future exploration.

On the Generalization Ability of Next-Token-Prediction Pretraining

Acknowledgments

This work is supported by the National Natural Science Foundation of China (NSFC) (Nos. 62376104 and 12426512) and the Open Research Fund of Engineering Research Center of Intelligent Technology for Agriculture, Ministry of Education (No. ERCITA-KF002).

Impact Statement

This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here.

Ba, J. L., Kiros, J. R., and Hinton, G. E. Layer normalization. ar Xiv preprint ar Xiv: 1607.06450, 2016.

Bachmann, G. and Nagarajan, V. The pitfalls of next-token prediction. ar Xiv preprint ar Xiv: 2403.06963, 2024.

Bartlett, P. L. and Mendelson, S. Rademacher and gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 3(11):463 482, 2002.

Bartlett, P. L., Foster, D. J., and Telgarsky, M. J. Spectrallynormalized margin bounds for neural networks. In Advances in Neural Information Processing Systems, 2017.

Biderman, S., Schoelkopf, H., Anthony, Q. G., Bradley, H., O Brien, K., Hallahan, E., Khan, M. A., Purohit, S., Prashanth, U. S., Raff, E., et al. Pythia: A suite for analyzing large language models across training and scaling. In International Conference on Machine Learning, pp. 2397 2430. PMLR, 2023.

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. Language models are few-shot learners. In Advances in Neural Information Processing Systems, 2020.

Chen, B., Monso, D. M., Du, Y., Simchowitz, M., Tedrake, R., and Sitzmann, V. Diffusion forcing: Next-token prediction meets full-sequence diffusion. ar Xiv preprint ar Xiv: 2407.01392, 2024.

Dao, T., Fu, D., Ermon, S., Rudra, A., and R e, C. Flashattention: Fast and memory-efficient exact attention with io-awareness. Advances in neural information processing systems, 35:16344 16359, 2022.

De Souza Pereira Moreira, G., Rabhi, S., Lee, J. M., Ak, R., and Oldridge, E. Transformers4rec: Bridging the gap between nlp and sequential/session-based recommendation. In ACM Recommender Systems, 2021.

Deng, Y., Hong, J., Zhou, J., and Mahdavi, M. On the generalization ability of unsupervised pretraining. In Artificial Intelligence and Statistics, 2024.

Deora, P., Ghaderi, R., Taheri, H., and Thrampoulidis, C. On the optimization and generalization of multi-head attention. ar Xiv preprint ar Xiv: 2310.12680, 2023.

Dobrushin, R. L. Central limit theorem for nonstationary markov chains.i. Theory of Probability & Its Applications, 1(1):72 88, 1956.

Edelman, B. L., Goel, S., Kakade, S., and Zhang, C. Inductive biases and variable creation in self-attention mechanisms. In International Conference on Machine Learning, 2022.

Flemings, J., Razaviyayn, M., and Annavaram, M. Differentially private next-token prediction of large language models. ar Xiv preprint ar Xiv: 2403.15638, 2024.

Gloeckle, F., Idrissi, B. Y., Rozi ere, B., Lopez-Paz, D., and Synnaeve, G. Better & faster large language models via multi-token prediction. ar Xiv preprint ar Xiv: 2404.19737, 2024.

Heinrich, L. and Pawlas, Z. Absolute regularity and brillinger-mixing of stationary point processes. Lithuanian Mathematical Journal, 53(3):293 310, 2013.

Hirschfeld, H. O. A connection between correlation and contingency. In Mathematical Proceedings of the Cambridge Philosophical Society, volume 31, pp. 520 524, 1935.

Izadi, M., Gismondi, R., and Gousios, G. Codefill: Multitoken code completion by jointly learning from structure and naming sequences. In International Conference on Software Engineering, pp. 401 412, 2022.

Jia, X., Shi, S., Chen, Z., Jiang, L., Liao, W., He, T., and Yan, J. Amp: Autoregressive motion prediction revisited with next token prediction for autonomous driving. ar Xiv preprint ar Xiv: 2403.13331, 2024.

Kaplan, J., Mc Candlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., and Amodei, D. Scaling laws for neural language models. ar Xiv preprint ar Xiv: 2001.08361, 2020.

Kilian, M., Jampani, V., and Zettlemoyer, L. Computational tradeoffs in image synthesis: Diffusion, maskedtoken, and next-token prediction. ar Xiv preprint ar Xiv: 2405.13218, 2024.

Kim, S., Zhao, J., Tian, Y., and Chandra, S. Code prediction by feeding trees to transformers. In International Conference on Software Engineering, 2021.

On the Generalization Ability of Next-Token-Prediction Pretraining

Kuznetsov, V. and Mohri, M. Discrepancy-based theory and algorithms for forecasting non-stationary time series. Annals of Mathematics and Artificial Intelligence, 88(4): 367 399, 2020.

Lei, Y. and Ying, Y. Fine-grained analysis of stability and generalization for stochastic gradient descent. In International Conference on Machine Learning, pp. 5809 5819. PMLR, 2020.

Levin, D. and Peres, Y. Markov chains and mixing times, volume 107. American Mathematical Soc., 2017.

Li, Y., Ildiz, M. E., Papailiopoulos, D., and Oymak, S. Transformers as algorithms: Generalization and stability in in-context learning. In International Conference on Machine Learning, pp. 19565 19594, 2023.

Li, Y., Huang, Y., Ildiz, M. E., Rawat, A. S., and Oymak, S. Mechanics of next token prediction with self-attention. In Artificial Intelligence and Statistics, 2024.

Lin, L., Bai, Y., and Mei, S. Transformers as decision makers: Provable in-context reinforcement learning via supervised pretraining. ar Xiv preprint ar Xiv: 2310.08566, 2023.

Lin, S. and Zhang, J. Generalization bounds for convolutional neural networks. ar Xiv preprint ar Xiv: 1910.01487, 2019.

Liu, A., Feng, B., Xue, B., Wang, B., Wu, B., Lu, C., Zhao, C., Deng, C., Zhang, C., Ruan, C., et al. Deepseekv3 technical report. ar Xiv preprint ar Xiv:2412.19437, 2024a.

Liu, L., Song, B., Pan, Z., Yang, C., Xiao, C., and Li, W. Gradient learning under tilted empirical risk minimization. Entropy, 24(7):956, 2022.

Liu, L., Chen, H., Xiao, C., and Li, W. The consistency analysis of gradient learning under independent covariate shift. Neurocomputing, 635:129883, 2025a.

Liu, L., Chen, Y., Li, W., Wang, Y., Gu, B., Zheng, F., and Chen, H. Generalization bounds of deep neural networks with τ -mixing samples. IEEE Transactions on Neural Networks and Learning Systems, pp. 1 15, 2025b. doi: 10.1109/TNNLS.2025.3526235.

Liu, X., Zhang, H., Gu, B., and Chen, H. General stability analysis for zeroth-order optimization algorithms. In The Twelfth International Conference on Learning Representations, 2024b.

Livni, R. Information theoretic lower bounds for information theoretic upper bounds. Advances in Neural Information Processing Systems, 36:37716 37727, 2023.

Loshchilov, I. and Hutter, F. Decoupled weight decay regularization. ar Xiv preprint ar Xiv:1711.05101, 2017.

Lotfi, S., Finzi, M., Kapoor, S., Potapczynski, A., Goldblum, M., and Wilson, A. G. Pac-bayes compression bounds so tight that they can explain generalization. In Advances in Neural Information Processing Systems, 2022.

Lotfi, S., Finzi, M., Kuang, Y., Rudner, T. G., Goldblum, M., and Wilson, A. G. Non-vacuous generalization bounds for large language models. ar Xiv preprint ar Xiv: 2312.17173, 2023.

Lotfi, S., Kuang, Y., Finzi, M., Amos, B., Goldblum, M., and Wilson, A. G. Unlocking tokens as data points for generalization bounds on larger language models. Advances in Neural Information Processing Systems, 37: 9229 9256, 2024.

Lu, X. and Van Roy, B. Information-theoretic confidence bounds for reinforcement learning. Advances in Neural Information Processing Systems, 32, 2019.

Madden, L., Fox, C., and Thrampoulidis, C. Upper and lower memory capacity bounds of transformers for nexttoken prediction. ar Xiv preprint ar Xiv: 2405.13718, 2024.

Malach, E. Auto-regressive next-token predictors are universal learners. ar Xiv preprint ar Xiv: 2309.06979, 2023.

Masuda, H. Ergodicity and exponential β-mixing bounds for multidimensional diffusions with jumps. Stochastic processes and their applications, 117(1):35 56, 2007.

Mc Donald, D. J., Shalizi, C. R., and Schervish, M. Estimating beta-mixing coefficients via histograms. Electronic Journal of Statistics, 9(2), 2015.

Mohri, M. and Rostamizadeh, A. Rademacher complexity bounds for non-iid processes. In Advances in Neural Information Processing Systems, 2008.

Mohri, M. and Rostamizadeh, A. Stability bounds for stationary φ-mixing and β-mixing processes. Journal of Machine Learning Research, 11(2), 2010.

Moon, J., Park, G., and Jeong, J. Pop-on: Prediction of process using one-way language model based on nlp approach. Applied Sciences, 11(2):864, 2021.

Ott, M., Edunov, S., Baevski, A., Fan, A., Gross, S., Ng, N., Grangier, D., and Auli, M. fairseq: A fast, extensible toolkit for sequence modeling. In North American Chapter of the Association for Computational Linguistics, 2019.

On the Generalization Ability of Next-Token-Prediction Pretraining

Qi, M., Huang, Y., Yao, Y., Wang, M., Gu, B., and Sundaresan, N. Is next token prediction sufficient for gpt? exploration on code logic comprehension. ar Xiv preprint ar Xiv: 2404.08885, 2024.

Qi, W., Yan, Y., Gong, Y., Liu, D., Duan, N., Chen, J., Zhang, R., and Zhou, M. Prophetnet: Predicting future n-gram for sequence-to-sequence pre-training. ar Xiv preprint ar Xiv: 2001.04063, 2020.

Radosavovic, I., Zhang, B., Shi, B., Rajasegaran, J., Kamat, S., Darrell, T., Sreenath, K., and Malik, J. Humanoid locomotion as next token prediction. ar Xiv preprint ar Xiv: 2402.19469, 2024.

Rajbhandari, S., Rasley, J., Ruwase, O., and He, Y. Zero: Memory optimizations toward training trillion parameter models. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1 16. IEEE, 2020.

Ralaivola, L., Szafranski, M., and Stempfel, G. Chromatic pac-bayes bounds for non-iid data: Applications to ranking and stationary β-mixing processes. The Journal of Machine Learning Research, 11:1927 1956, 2010.

Shalev-Shwartz, S. and Ben-David, S. Understanding machine learning: From theory to algorithms. Cambridge university press, 2014.

Shlegeris, B., Roger, F., Chan, L., and Mc Lean, E. Language models are better than humans at next-token prediction. ar Xiv preprint ar Xiv: 2212.11281, 2022.

Thrampoulidis, C. Implicit bias of next-token prediction. ar Xiv preprint ar Xiv: 2402.18551, 2024.

Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozi ere, B., Goyal, N., Hambro, E., Azhar, F., et al. Llama: Open and efficient foundation language models. ar Xiv preprint ar Xiv: 2302.13971, 2023.

Tripuraneni, N., Jordan, M., and Jin, C. On the theory of transfer learning: The importance of task diversity. In Advances in Neural Information Processing Systems, 2020.

Vankadara, L. C., Faller, P. M., Hardt, M., Minorics, L., Ghoshdastidar, D., and Janzing, D. Causal forecasting: generalization bounds for autoregressive models. In Uncertainty in Artificial Intelligence, pp. 2002 2012, 2022.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. Attention is all you need. In Advances in Neural Information Processing Systems, 2017.

Wang, R., Walters, R., and Yu, R. Data augmentation vs. equivariant networks: A theory of generalization on dynamics forecasting. ar Xiv preprint ar Xiv: 2206.09450, 2022a.

Wang, R., Walters, R., and Yu, R. Meta-learning dynamics forecasting using task inference. In Advances in Neural Information Processing Systems, 2022b.

Wong, K. C., Li, Z., and Tewari, A. Lasso guarantees for βmixing heavy-tailed time series. The Annals of Statistics, 48(2):1124 1142, 2020.

Workshop, B., Scao, T. L., Fan, A., Akiki, C., Pavlick, E., Ili c, S., Hesslow, D., Castagn e, R., Luccioni, A. S., Yvon, F., Gall e, M., Tow, J., and M., A. Bloom: A 176b-parameter open-access multilingual language model. ar Xiv preprint ar Xiv: 2211.05100, 2023.

Wu, W., Feng, X., Gao, Z., and Kan, Y. Smart: Scalable multi-agent real-time simulation via next-token prediction. ar Xiv preprint ar Xiv: 2405.15677, 2024.

Xu, Z. and Tewari, A. Representation learning beyond linear prediction functions. In Advances in Neural Information Processing Systems, 2021.

Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., et al. Qwen3 technical report. ar Xiv preprint ar Xiv:2505.09388, 2025.

Ye, H., Chen, X., Wang, L., and Du, S. S. On the power of pre-training for generalization in rl: provable benefits and hardness. In International Conference on Machine Learning, 2023.

Yue, K., Chen, B.-C., Geiping, J., Li, H., Goldstein, T., and Lim, S.-N. Object recognition as next token prediction. In Conference on Computer Vision and Pattern, 2024.

Zhang, B. and Sennrich, R. Root mean square layer normalization. Advances in Neural Information Processing Systems, 32, 2019.

Zhang, J., Wang, B., Hu, Z., Koh, P. W. W., and Ratner, A. J. On the trade-off of intra-/inter-class diversity for supervised pre-training. In Advances in Neural Information Processing Systems, 2024a.

Zhang, S., Roller, S., Goyal, N., Artetxe, M., Chen, M., Chen, S., Dewan, C., Diab, M., Li, X., Lin, X. V., et al. Opt: Open pre-trained transformer language models. ar Xiv preprint ar Xiv: 2205.01068, 2022.

Zhang, X., Chen, H., Gu, B., Gong, T., and Zheng, F. Fine-grained analysis of stability and generalization for stochastic bilevel optimization. In Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence, pp. 5508 5516, 2024b.

On the Generalization Ability of Next-Token-Prediction Pretraining

Zhang, Y., Zhang, F., Yang, Z., and Wang, Z. What and how does in-context learning learn? bayesian model averaging, parameterization, and generalization. ar Xiv preprint ar Xiv: 2305.19420, 2023.

Zhao, W. X., Zhou, K., Li, J., Tang, T., Wang, X., Hou, Y., Min, Y., Zhang, B., Zhang, J., Dong, Z., et al. A survey of large language models. ar Xiv preprint ar Xiv: 2303.18223, 2024.

Zhou, D., Zheng, L., Fu, D., Han, J., and He, J. Mentorgnn: Deriving curriculum for pre-training gnns. In Conference on Information and Knowledge Management, 2022.

On the Generalization Ability of Next-Token-Prediction Pretraining

Summary & Technical Route

This paper conducts a comprehensive theoretical analysis by focusing on the pre-training methodology of Large Language Models (LLMs) known as Next-Token-Prediction (NTP). It categorizes LLMs as transformer decoder-only models (DOMs) and delves into the empirical successes of NTP despite a lack of theoretical understanding. This work also establishes a theoretical framework to analyze the generalization behavior of NTP pre-training. We introduce a novel decomposition of Rademacher complexity to study the representation-learner and token-predictor components of DOMs. The paper also addresses the dependence between tokens using φ-mixing, a tool commonly used in non-independent scenarios, to delineate inter-token dependencies. This approach allows for a fine-grained analysis of the generalization ability of NTP pre-training, considering the model s structure and the nature of the training data.

The technical route of the paper involves developing a theoretical framework for NTP from a statistical learning perspective. This work proposes a decomposition law for Rademacher complexity to bound the excess risk of NTP and establish different bounds on the generalization capability. We refine the estimation of the covering number for multi-layer multi-head transformer-decoder models, pioneering the incorporation of the mask matrix within the self-attention mechanism under the Frobenius norm. This paper uses the covering number to derive the corresponding Rademacher complexity upper bound, extending the theory of (Bartlett et al., 2017) and (Lin & Zhang, 2019) to establish fine-grained generalization bounds for DOMs-based NTP pre-training. The results are expressed in terms of key parameters that affect the generalization ability, providing a clear and quantifiable understanding of how NTP pre-training behaves in practice.

Outline of the Appendix

The appendix is mainly structured as follows,

Section A: Proof of the Proposition 4.8.

Section B: Proof of the Theorem 4.9.

Section C: Capacity of DOMs.

Section C.1: Introduction to the model architecture. Section C.2: Restatement to some useful lemmas. Section C.3: Proof of the Proposition C.11. Section C.4: Proof of the Theorem 4.20.

Section D: Proof of the Theorem 4.22.

On the Generalization Ability of Next-Token-Prediction Pretraining

A. Proof of the Proposition 4.8

ˆRS(ℓ G H) = Eε

i [n] εiℓ g h (zi)

sup g G,h H

i [n] εi g h (zi)

sup g G,h H

i [n] εi g h (zi) g ˆh (zi)

i [n] εi g ˆh (zi)

(b) GℓGg Eε

i [n] εi h (zi) ˆh (zi)

+ GℓˆRS(G ˆh)

i [n] εi h (zi)

i [n] εi ˆh (zi)

+ GℓˆRS(G ˆh)

(c) = GℓGg ˆRS(H) + GℓˆRS(G ˆh),

where ˆh is a any given function in H. Here (a) is by Ledoux-Talagrand contraction inequality, (b) uses the Lipschitz conditions of g, and (c) uses the property that the Rademacher random variables ε are i.i.d. with zero mean.

B. Proof of Theorem 4.9

Firstly, we give two necessary assumptions which have been mentioned before.

Assumption B.1. Assume that Xi = zi 1, , zi m Rm nv is generated by a φ-mixing distribution ϕi, and there exists an unknown distribution U such that U = {ϕi}N i=1 U.

Assumption B.2. Assume there exists a constant Bℓ R+ satisfying |ℓ(ˆt, t)| Bℓfor any ˆt, t T , and ℓis Gℓ-Lipschitz w.r.t. ˆt.

Then, we introduce some related lemmas which will be used in our proof. Since {Xi}N i=1 are independent of each other, { ˆLXi (g h)}N i=1 are also independent of each other, so the generalization error based on the dataset D can be bounded by the following common theorem.

Lemma B.3 (Shalev-Shwartz & Ben-David (2014)). Given a dataset D = {Xi}N i=1 i.i.d. D, if loss function ℓ: T T [ Bℓ, Bℓ], Bℓ R+, then with probability at least 1 δ, the following inequality holds for any h H and g G:

LD(g h) ˆLD(g h) + 2 ˆRD(ℓ G H) + 4Bℓ

For the non-independent case, Mohri & Rostamizadeh (2010) gave a Rademacher complexity bound under the φ-mxing distribution. Therefore, under Assumption B.1, we can define the generalization error based on a single token sequence Xi, mainly using the following theorem:

Lemma B.4 (Mohri & Rostamizadeh (2010)). Given a token sequence Xi = [ti 1, , ti m] and loss function ℓ: T T [ Bℓ, Bℓ], Bℓ R+, if {ti j}m j=1 follow a φ-mxing distribution ϕi, then for some φ0 > 0, φ1 > 0 and r > 0, with probability at least 1 δ, the following inequality holds for any h H and g G:

Lϕi(g h) ˆLXi(g h) Bℓ

where m 1 + 2 Pm k=1 φ(k), and φ(k) φ0 exp ( φ1kr) for all k [m].

On the Generalization Ability of Next-Token-Prediction Pretraining

Lemma B.5. For the two Rademacher complexity over the dataset D mentioned before, we have the following inequality:

ˆRD (ℓ G H) 3 RD (ℓ G H) .

Proof. Let ε = {ε i}N i=1, ε = {ε j }m j=1, ε = {{εij}m j=1}N i=1 be three i.i.d. Rademacher random variable collections, and ε , ε are independent of each other. For ease of representation, we denote ℓi j = ℓ g h(zi j), zi j . Then, we have:

ˆRD (ℓ G H) = Eε

sup g G,h H

j [m] ℓ g h(zi j), zi j

sup g G,h H

ε iℓi j ε j ℓi j + ε j ℓi j

sup g G,h H

ε i ε j ℓi j + ε j ℓi j

sup g G,h H

ε i ε j ℓi j

sup g G,h H

j [m] ε j ℓi j

| {z } (II)

For part (I), we denote ˆε = {{ˆεij}m j=1}N i=1, where ˆεij = 1

2 ε i ε j . It s easy to get ˆε are i.i.d. random variables, and the distribution is:

1/4, ˆεij = 1 1/2, ˆεij = 0 1/4, ˆεij = 1 .

Then, by the independence of ε and ε , we have:

sup g G,h H

ε i ε j ℓi j

sup g G,h H

j [m] ˆεijℓi j

sup g G,h H

j [m] εijℓi j

For part (II), we denote ε = {{ εij}m j=1}N i=1, where εij = ε iε j . It s easy to get ε are i.i.d. Rademacher random variables. We have:

sup g G,h H

j [m] ε j ℓi j

sup g G,h H

j [m] ε iε j ℓi j

sup g G,h H

j [m] εijℓi j

sup g G,h H

j [m] εijℓi j

On the Generalization Ability of Next-Token-Prediction Pretraining

Therefore we can get: ˆRD (ℓ G H) (I) + (II) 3 RD (ℓ G H).

Lemma B.6 (Levin & Peres (2017)). Given two probability measures T1 and T2 over instance space X, the following equality holds:

T1 T2 TV = 1

z X |T1(z) T2(z)|.

We then give some necessary symbol descriptions as follows:

g , h = arg min g G,h H LD (g h) , (9)

k = arg max i [N] Lϕi (g h ) . (10)

In the above symbols, g and h in (9) are the Token-Predictor and Representation-Learner that minimize the expected risk, exactly the best g and h that we hope to learn through (2). k represents the subscript that maximizes the expected risk based on distribution Tk (k [N]) when using g and h , therefore Tk represents the worst distribution in U = {Tk}N k=1. Based on the above lemmas and notations, we begin the proof of Theorem 1.

Proof. We first perform an error decomposition on the excess risk defined in (3):

ED(ˆg, ˆh) = LD(ˆg ˆh) LD (g h )

= LD(ˆg ˆh) ˆLD(ˆg ˆh) + ˆLD(ˆg ˆh) LD (g h )

LD(ˆg ˆh) ˆLD(ˆg ˆh) | {z } I

+ ˆLD g ˆh LD (g h ) | {z } II

Bounding I : According to Lemma B.3, we can get

I = LD(ˆg ˆh) ˆLD(ˆg ˆh)

2 ˆRD (ℓ G H) + 4Bℓ

Bounding II :

II = ˆLD g ˆh LD (g h )

ˆLxi g ˆh LT i (g h )

ˆLxi g ˆh Lϕi g ˆh + Lϕi g ˆh Lϕi (g h )

ˆLxi g ˆh Lϕi g ˆh

Lϕi g ˆh Lϕi (g h )

Bounding III : According to Lemma B.4, we can get

ˆLXi g ˆh Lϕi g ˆh

On the Generalization Ability of Next-Token-Prediction Pretraining

Bounding IV :

Lϕi g ˆh Lϕi (g h )

i [N] (LTk (g h ) Lϕi (g h )) + 1

Lϕi g ˆh LTk g ˆh

z X |Tk(z) ϕi(z)| ℓ(g h (z), z)

z X |ϕi(z) Tk(z)| ℓ g ˆh(z), z !

z X |ϕi(z) Tk(z)|

i [N] ϕi Tk TV

4Bℓdisc(U),

where (i) is by Lemma B.6. Combining the above processes and by Lemma B.5, Theorem 1 is obtained.

C. Capacity of DOMs

C.1. Model architecture

In this section, we describe the architecture and function class of the decoder-only transformer model in detail. Given a pre-training dataset D = {Xi}N i=1 Rm nv containing N token sequences, where m represents the maximum word vector length and nv represents the vocabulary size. We can get N input matrixes {Zi}N i=1 Rm nv.We first introduce two normalization operations that will be used. For a given matrix A Rn m, we denote softmax ( ) as the row-wise softmax operator, which can be defined as:

softmax (A)i,j = exp(Ai,j) P

j [m] exp(Ai,j ). (11)

Let Πnorm ( ) denote the Layer-norm operator (Ba et al., 2016), which can be defined as:

Πnorm (A)i,j = Ai,j µ

m P j m Ai,j δ = q

j [m] (Ai,j µ)2 . (12)

We consider a L-layer and H-head decoder-only transformer model as our Representation-Learner h( ), which mainly consists of one Embedding-layer and L layer transformer-decoder-block. We use d to denote the model dimension, dk = d/H denotes the attention dimension, and df = 4d denotes the feed-forward dimension throughout the paper. Given a token sequence Z Rm nv as input matrix, the lth layer s output is:

( Embedding (Z) , l = 0 TFWl Zl 1 , l [L] , (13)

here Embedding (Z) = ZWe + Wp, where We Rnv d denotes the token-embeding matrix, Wp Rm d denotes the position-embeding matrix. Note that matrices We and Wp are learnable, but we can also directly use the pre-trained We and calculate Wp using sine and cosine functions, the specific calculation method can be found in (V aswaniet al., 2017). To simplify our analysis, we choose the latter. TFWl ( ) denotes the l-th layer transformer-decoder-block with

( Wl F1, Wl F2, Wl Oh H

h=1 , Wl Qh H

h=1 , Wl Kh H

h=1 , Wl Vh H

h=1 Rd df , Rdf d, Rd d, Rd dk, Rd dk, Rd d

On the Generalization Ability of Next-Token-Prediction Pretraining

as parameters, which can be defined as:

TFWl Zl 1 = Πnorm FFN Yl ,

Yl = Πnorm MHA Zl 1 , (15)

where FFN ( ) denotes the Feed-Forward Neural Network with Residual Connections:

FFN Yl = σ Yl Wl F1 Wl F2 + Yl, (16)

where σ ( ) denotes the activation function, and we use Re LU throughout the paper. MHA ( ) denotes the Masked-Mutil Head-Attention with Residual Connections:

MHA Zl 1 = X

h [H] Al h Zl 1 Wl Oh + Zl 1, (17)

here Al h ( ) denotes the Self-Attention head:

Al h Zl 1 = softmax

Ql h Kl h + M dk

where Ql h = Zl 1Wl Qh, Kl h = Zl 1Wl Kh, Vl h = Zl 1Wl Vh denote Q, K, V matrix respectively, and

0 0 0 ... ... ... ... ... 0 0 0 0

is a given mask matrix. Therefore the Representation-Learner can be defined as h(X) = ZL. And the hypothesis class of h(X) is defined as:

( X 7 TFWL (TFWL 1 . . . TFW1 (Embedding (X))) : Wl F1 F , Wl F2 F , Wl Oh F , Wl Qh F , Wl Kh F , Wl Vh F Bl, l [L], h [H]

The Token-Predictor has many options, there we use a simple linear projection layer and softmax:

g (h (X)) = softmax(ZLWP ), (21)

where WP Rd nv.

Assumption C.1. Assume that

Πnorm is Gπ-Lipschitz with the ℓ2-norm, i.e., t1, t2 Rd,

Πnorm (t1) Πnorm (t2) ℓ2 Gπ t1 t2 ℓ2.

l [L] and h [H], there exists constants Cl such that

Ql h(Kl h) / p

l [L], Wl Wl, there exists constants Bl satisfying

On the Generalization Ability of Next-Token-Prediction Pretraining

C.2. Useful Lemmas

Here, we use the covering number to bound Rademacher complexity, so we first introduce the definition of the covering number and provide some useful lemmas and propositions.

Definition C.2 (ϵ-cover and covering numbe). Denote (U, ) as a normed space and V U. For any ϵ > 0, V is called an ϵ-cover of U if for any u U, there exists v V such that u v ϵ. The covering number of the normed space (U, ) is the cardinality of the smallest ϵ-cover, which is defined by N(U, ϵ, ) := min{|V | : V is an ϵ-cover of U}.

Lemma C.3 (Lemma 9 of Lin & Zhang (2019)). Let W := {w : w Rd, w 2 a}, then for any ϵ > 0, we have

ln N (W, ϵ, 2) d ln 1 + 2a

Lemma C.4 (Lemma 10 of Lin & Zhang (2019)). Let X Rn din be a given input matrix with bounded F-norm, and W Rdin dout satisfying W F a, then

ln N ({XW : W F a} , ϵ, F ) dindout ln 1 + 2a X F

Proof. Let ˆ W be the ϵ-cover of {W : W F a} such that W ˆ W F ϵ, then

XW X ˆ W F X F W ˆ W F ϵ X F .

This means that any ϵ-cover of {W : W F a} is also an ϵ X F -cover for {XW : W F a}, we have

ln N ({XW : W F a} , ϵ, F ) ln N {W : W F a} , ϵ X F , F

We denote W Rdindout as the one dimensional vector which is obtained by reshaping W. Then by Lemma C.3, we have

ln N {W : W F a} , ϵ X F , F

ln N W : W 2 a , ϵ X F , F

dindout ln 1 + 2a X F

Lemma C.5 (Extension of (Bartlett et al., 2017)). Let F be a real-valued function class taking values in [0, c], and assume that 0 F. Then the empirical Rademacher complexity of F can be bounded as:

ˆRS(F) inf α>0

ln N F|S, ϵ, 2 dϵ

where S represents the dataset containing n samples. Lemma C.6. Assume that f(x) is a continouous function on [a, b] satisfying f(x) > 0, and g(x) is a continouous concave(downward) function on the range of f(x). Then we have:

a g(f(x))dx g

Proof. We divide the interval [a, b] into n eaqual parts, let xi = a + i

n(b a)(i = 0, 1, 2, , n), then i = xi xi 1 = b a

n (i = 1, 2, , n). Since g(x) is a concave function, we can use Jensen s inequality to get:

i=1 g(f(xi)) i =

1 ng(f(xi)) g

i=1 f(xi) i

Combining the integrability of continuous functions and the definition of integral, let n in the above formula, we can get the result.

On the Generalization Ability of Next-Token-Prediction Pretraining

Lemma C.7. Let Πnorm ( ) be the layer normlization operator defined in (12), for any matrix X Rm d, we have Πnorm (X) F

Πnorm (X) F =

X2 i,j ξ2 + 1

j [d] X2 i,j

X2 i,j 1 d P

j [d] X2 i,j

Lemma C.8. Let softmax ( ) be the row-wise softmax function defined as (11), for any matrix X Rm m obeying X ℓ C, we have:

softmax (X) F e C and softmax (X + M) F e C

where M is the given mask matrix defined in (19).

Proof. Denote X = (xij)m m and softmax (X) = (yij)m m, we have:

yij = exij P

j [m] exij e C

Then we can get:

softmax (X) F = s X

j [m] y2 ij

j [m] yij e2C

here we uses P

j [m] yij = 1, i [m]. When adding the mask matrix M, we have:

x11 x21 x22 ... ... ... ... ... xm1 xm2 xm3 xmm

, softmax (X + M) =

y 11 0 0 0 y 21 y 22 0 0 ... ... ... ... ... y m1 y m2 y m3 y mm

j [k] e xkj , j k

0 , j > k .

Similarly we can get:

softmax (X + M) F

j [k] y kj e2C

k = e C v u u t

Here we use the fact: (1 + 1

m ln m) γ, where γ 0.577218 called Euler constant.

Lemma C.9. The softmax is Gs Lipschitz in the ℓ2 norm, and Gs 4

3 9 , which means for any x, y Rm, we have:

softmax(x) softmax(y) ℓ2 4

3 9 x y ℓ2.

On the Generalization Ability of Next-Token-Prediction Pretraining

Proof. Let ej denote the jth element of softmax (x), the Jacobian satisfies:

J(x) F = diag(softmax(x)) softmax(x) softmax(x) F

j [m] (ei (I [i = j] ej))2

i [m] 4e2 i (1 ei)2

i [m] 4ei (e3 i 2e2 i + ei)

where the last inequality uses the fact: x3 2x2 + x 4

9, x [0, 1].

Denote z = y x, according to the definition of derivative, we have:

lim δ 0 softmax(x + δz) softmax(x)

Integrating along δ = 0 to 1 under J(x)z ℓ2 4

3 9 z ℓ2 can obtain the result.

C.3. Proof of Proposition 4.17

Lemma C.10 (Covering number of masked self-attention head). Given an input sequence S = {Z1, . . . , ZN} Rm d, denote Z[N] = [Z1, . . . , ZN] RNm d as the concatenated data matrix. Consider the Self-Attention head A ( ) (ignore the layer and head indices) defined in Equation (18), the corresponding function class can be defined as:

A (Z1) ... A (ZN)

: A (Zi) = softmax

Zi WQ (Zi WK) + M dk

: WQ F , WK F , WV F B

then we can get the following covering number bound:

ln N HA S , ϵ, F d2 ln

N ln m Z[N] F

+ 2ddk ln 1 + 8Gs B3 Z F Z[N] 2 F dkϵ

Proof. Step 1: Denote CV as a ϵV -cover of set HV S := {V = Z[N]WV : WV F B}, then by Lemma B.6 we have:

ln N HV S , ϵV , F d2 ln 1 + 2B Z[N] F

Step 2: Let CQ to be a ϵQ-cover of set HQ Z := {Q = ZWQ : WQ F B}, and CK to be a ϵK-cover of set HK Z := {K = ZWK : WK F B}. We can use CQ and CK to construct a set as following:

S1 0 ... 0 SN

: Si = softmax

Zi WQ (Zi WK) + M dk

: WQ CQ, WK CK

On the Generalization Ability of Next-Token-Prediction Pretraining

We consider the following set:

S1 0 ... 0 SN

: Si = softmax

Zi WQ (Zi WK) + M dk

: WQ F , WK F B

then for ϵ > 0 and any S HS S, there exists ˆS CS such that:

S1 0 ... 0 SN

ˆS1 0 ... 0 ˆSN

v u u u u t

Zi WQ (Zi WK) + M dk

Zi ˆ WQ Zi ˆ WK + M dk

Zi WQ (Zi WK) Zi ˆ WQ Zi ˆ WK

Zi WQ Zi ˆ WQ (Zi WK) F + Zi ˆ WQ

(Zi WK) Zi ˆ WK F

Gs B dk (ϵQ + ϵK) s X

i [N] Zi 2 F

Gs B dk (ϵQ + ϵK) Z[N] F

Evoking Lemma C.9 we have (i), and by setting ϵQ = ϵK = dk 2Gs B Z[N] F ϵ can get (ii). Therefore CS is a cover of HS S, then by Lemma C.4 we have:

ln N HS S, ϵ, F ln |CS| ln |CQ| + ln |CK| 2ddk ln 1 + 4Gs B2 Z F Z[N] F dkϵ

where Z F = maxi [N] Zi F .

Step 3: For every given ˆV CV , we can construct the set HS S ˆV := n S ˆV : S HS S o and CS ˆV := n S ˆV : S CS o ,

we denote C CS ˆV, ϵS, F as a ϵS-cover of CS ˆV. Then for any S ˆV HS S ˆV, we can find ˆS ˆV

C CS ˆV, ϵS, F such that:

S ˆV ˆS ˆV F S ˆS F ˆV F ϵS Z[N] F B.

We can Choose ϵS = ϵA Z[N] F B to get that C CS ˆV, ϵS, F is a ϵA-cover of HS S ˆV which can be denoted as

C HS S ˆV, ϵA, F . Then we have:

sup ˆV CV ln C HS S ˆV, ϵA, F sup ˆV CV ln C CS ˆV, ϵS, F 2ddk ln 1 + 4Gs B3 Z F Z[N] 2 F dkϵA

Then we construct a set CA: CA = [

ˆV CV C HS S ˆV, ϵA, F ,

On the Generalization Ability of Next-Token-Prediction Pretraining

which is easy to get:

ln |CA| ln |CV | + sup ˆV CV ln C HS S ˆV, ϵA, F

d2 ln 1 + 2B Z[N] F

+ 2ddk ln 1 + 4Gs B3 Z F Z[N] 2 F dkϵA

Step 4: Next we will proof that CA covers HA S . For any A HA S , we can find ˆA CA such that: A ˆA F = SV ˆS ˆV F

SV S ˆV F + S ˆV ˆS ˆV F

S F V ˆV F + ϵA

i [N] Si 2 F ϵV + ϵA

N ln me CϵV + ϵA,

where the last inequality uses Lemma C.8. Then by setting ϵV = ϵ 2e C

N ln m, ϵA = ϵ

2, we can get:

ln N HA S , ϵ, F ln |CA|

N ln m Z[N] F

+ 2ddk ln 1 + 8Gs B3 Z F Z[N] 2 F dkϵ

Proposition C.11 (Covering number of transformer-decoder-block). Given an input sequence S = {Z1, . . . , ZN} Rm d, denote Z[N] = [Z1, . . . , ZN] RNm d as the concatenated data matrix. Consider the transformer-decoder-block TF ( )(ignore the layer indices) defined in Equation (15), the corresponding function class can be defined as:

Z 7 Πnorm (σ (YWF1) WF2 + Y) , Y = Πnorm

h H Ah (Z) WOh + Z

Ah (Z) = softmax

ZWQh (ZWKh) + M dk

WF1 F , WF2 F , WOh F , WQh F , WKh F , WVh F B, h [H]

then we can get the following covering number bound:

ln N HT F S , ϵ, F 4d2(H + 3) ln

1 + G2 πB2(B2 + 1) e CB2H

N ln m + 1 Z[N] F

Proof. Step 1: We first consider the function class of muti-head-attention MHA ( ), which can be defined as:

HMA S := MA = X

h H Ah WOh + Z[N] : Ah HAh S , WOh F B, h [H] ,

Ah (Z1) ... Ah (ZN)

: Ah (Zi) = softmax

Zi WQh (Zi WKh) + M dk

: WQh F , WKh F , WVh F B

On the Generalization Ability of Next-Token-Prediction Pretraining

For any h [H], denote CAh as a ϵA-cover of HAh S , we select one element ˆAh CAh to construct the following set:

ˆAh WOh = n ˆAh WOh : WOh F B o ,

let C ˆAh WOh, ϵd, F ϵd-covers ˆAh WOh, we have:

ln C ˆAh WOh, ϵd, F sup ˆAh CAh ln N ˆAh WOh, ϵd, F

sup ˆAh CAh d2 ln

1 + 2B ˆAh F

N ln m Z[N] F

Now we can construct the ϵd-cover of set n Ah WOh : Ah HAh S , WOh F B o as :

C ˆAh WOh, ϵd, F .

Combined with Lemma C.10, it is easy to get the following covering number bound:

ln N HMA S , ϵd, F H ln |Cheadh|

ln |CAh| + sup ˆAh CAh ln C ˆAh WOh, ϵd, F

N ln m Z[N] F

+ 2ddk H ln 1 + 8Gs B3 Z F Z[N] 2 F dkϵA

N ln m Z[N] F

Step 2: Now, we consider the Feed-Forward Neural Network FFN ( ) defined in (16), which consists of two fully-connected layers. The hypothesis class of fully-connected layer 1 can be defined as:

HF1 S = n Y[N]WF1 : Y[N] = Πnorm (MA) , MA HMA S , WF1 F B o .

Denote CM as a ϵd-cover of HMA S , we select one element ˆ MA CM to construct the following set:

ˆ MA WF1 = n Πnorm ( ˆ MA)WF1 : WF1 F B o ,

let C ˆ MA WF1, ϵF 1, F ϵF 1-covers ˆ MA WF1, we have:

ln C ˆ MA WF1, ϵF 1, F sup ˆ MA CM ln N ˆ MA WF1, ϵF 1, F

sup ˆ MA CM ddf ln

1 + 2B Πnorm ( ˆ MA) F ϵF 1

sup ˆAh CAh ddf ln

1 + 2BGπ H ˆAh F B + Z[N] F

1 + 2BGπ e CB2H

N ln m + 1 Z[N] F

On the Generalization Ability of Next-Token-Prediction Pretraining

Now we can construct the ϵF 1-cover of HF1 S as:

ˆ MA CM C ˆ MA WF1, ϵF 1, F .

We have the following covering number bound:

ln |CF1| ln |CM| + sup ˆ MA CM ln C ˆ MA WF1, ϵF 1, F

N ln m Z[N] F

+ 2ddk H ln 1 + 8Gs B3 Z F Z[N] 2 F dkϵA

N ln m Z[N] F

1 + 2BGπ e CB2H

N ln m + 1 Z[N] F

Step 3: Next, we consider the hypothesis class of fully-connected layer 2:

HF2 S = n σ F[N] WF2 + Y[N] : F[N] HF1 S , Y[N] = Πnorm (MA) , MA HMA S , WF2 F B o .

For every element ˆF[N] CF1 and ˆ MA CM, we construct the following set:

ˆF[N] WF1 ˆ MA = n σ ˆF[N] WF2 + Πnorm ( ˆ MA) : WF2 F B o ,

let C ˆF[N] WF1 ˆ MA, ϵF 2, F ϵF 2-covers ˆF[N] WF1 ˆ MA, we have:

ln C ˆF[N] WF1 ˆ MA, ϵF 2, F sup ˆF[N] CF1, ˆ MA CM ln N ˆF[N] WF1 ˆ MA, ϵF 2, F

= sup ˆF[N] CF1 ln N ˆF[N] WF1, ϵF 2, F

sup ˆF[N] CF1 ddf ln

1 + 2B ˆF[N] F

sup ˆ MA CM ddf ln

1 + 2B2 Πnorm ( ˆ MA) F ϵF 2

sup ˆAh CAh ddf ln

1 + 2B2Gπ H ˆAh F B + Z[N] F

1 + 2B2Gπ e CB2H

N ln m + 1 Z[N] F

Now we can construct the ϵF 2-cover of HF2 S as:

ˆF[N] CF1, ˆ MA CM C ˆF[N] WF1 ˆ MA, ϵF 2, F .

On the Generalization Ability of Next-Token-Prediction Pretraining

We have the following covering number bound:

ln |CF2| ln |CF1| + ln |CM| + sup ˆF[N] CF1, ˆ MA CM ln C ˆF[N] WF1 ˆ MA, ϵF 2, F

N ln m Z[N] F

+ 4ddk H ln 1 + 8Gs B3 Z F Z[N] 2 F dkϵA

N ln m Z[N] F

1 + 2BGπ e CB2H

N ln m + 1 Z[N] F

1 + 2B2Gπ e CB2H

N ln m + 1 Z[N] F

Step 4: To get the covering number of HT F S , we construct the following set:

CT = {Πnorm (F2) : F2 CF2} ,

which is an ϵ-cover of HT F S that can be verified. For any Z[N] HT F S , we can find a ˆZ[N] CT such that:

Z[N] ˆZ[N] F = Πnorm σ Y[N]WF1 WF2 + Y[N] Πnorm σ ˆY[N] ˆ WF1 ˆ WF2 + ˆY[N] F

Gπ σ Y[N]WF1 WF2 σ Y[N]WF1 ˆ WF2 F

+ Gπ σ Y[N]WF1 ˆ WF2 σ ˆY[N] ˆ WF1 ˆ WF2 F + Gπ Y[N] ˆY[N] F

Gπ ϵF 2 + B Y[N]WF1 ˆY[N] ˆ WF1 F + Y[N] ˆY[N] F

Gπ ϵF 2 + B Y[N]WF1 Y[N] ˆ WF1 F + B Y[N] ˆ WF1 ˆY[N] ˆ WF1 F + Y[N] ˆY[N] F

Gπ ϵF 2 + BϵF 1 + (B2 + 1) Y[N] ˆY[N] F

For Y[N] ˆY[N] F we have:

Y[N] ˆY[N] F =

h H Ah WOh + Z[N]

h H ˆAh ˆ WOh + Z[N]

Ah WOh ˆAh ˆ WOh F

Ah WOh Ah ˆ WOh F + Ah ˆ WOh ˆAh ˆ WOh F

Gπ (Hϵd + BHϵA) .

In summary, we have: Z[N] ˆZ[N] F Gπ ϵF 2 + BϵF 1 + Gπ(B2 + 1)(Hϵd + BHϵA)

= GπϵF 2 + GπBϵF 1 + G2 π(B2 + 1)Hϵd + G2 π(B3 + B)HϵA.

We can conclude that CT ϵ covers HT F S By setting

ϵF 2 = ϵ 4Gπ , ϵF 1 = ϵ 4GπB , ϵd = ϵ 4G2π(B2 + 1)H , ϵA = ϵ 4G2π(B3 + B)H .

On the Generalization Ability of Next-Token-Prediction Pretraining

From the definition of CT , we have:

ln |CT | ln |CF2|

1 + 8Gs G2 π(B4 + B2)H

N ln m Z[N] F ϵ

+ 4ddk H ln

1 + 32e CG2 π(B6 + B4)H Z F Z[N] 2 F dkϵ

1 + 8G2 πB2 e CB2H

N ln m + 1 Z[N] F

(4d2H + 4ddk H + 2ddf) ln

1 + G2 πB2(B2 + 1) e CB2H

N ln m + 1 Z[N] F

Combining dk = d/H, df = 4d, we have:

ln N HT F S , ϵ, F ln |CT | 4d2(H + 3) ln

1 + G2 πB2(B2 + 1) e CB2H

N ln m + 1 Z[N] F

Lemma C.12. Let Zl [N] RNm d be the lth layer s concatenated output matrix of our transformer model defined in Equation (13), we have: Zl [N] F Y

j [l] G2 π(B2 j + 1) e Cj B2 j H

N ln m + 1 Z0 [N] F .

Furthermore, let Zl Rm d as the output matrix, we have:

j [l] G2 π(B2 j + 1) e Cj B2 j H

ln m + 1 Z0 F .

Proof. Zl [N] F = Πnorm σ Yl [N]Wl F1 Wl F2 + Yl [N] F (i) Gπ σ Yl [N]Wl F1 Wl F2 + Yl [N] F

Gπ(B2 l + 1)

h [H] Al h Zl 1 [N] Wl Oh + Zl 1 [N]

G2 π(B2 l + 1)

Sl h Zl 1 [N] Wl Vh

F + Zl 1 [N] F

(ii) G2 π(B2 l + 1) e Cl B2 l H

N ln m + 1 Zl 1 [N] F

j [l] G2 π(B2 j + 1) e Cj B2 j H

N ln m + 1 Z0 [N] F .

Here (i) uses Assumption C.1, and (ii) uses Lemma C.8. Similarly, we can get the second result by setting N = 1.

On the Generalization Ability of Next-Token-Prediction Pretraining

Lemma C.13. For a single transformer-decoder-block TFW ( ) parameterized by W (ignore the layer indices), let Z[N] RNm d as the concatenated input matrix, we have: TFW Z[N] TFW ˆZ[N] F G2 π(B2 + 1) e CB2H

Nmd + 1 Z[N] ˆZ[N] F .

Furthermore, let Z Rm d as the input matrix, we have: TFW (Z) TFW ˆZ F G2 π(B2 + 1) e CB2Hmd + 1 Z ˆZ F .

TFW Z[N] TFW ˆZ[N] F Gπ σ Y[N]WF1 WF2 + Y[N] σ ˆY[N]WF1 WF2 ˆY[N] F

G2 π(B2 + 1)

h [H] Ah Z[N] WOh + Z[N] X

h [H] Ah ˆZ[N] WOh ˆZ[N]

G2 π(B2 + 1)

Ah Z[N] Ah ˆZ[N] F + Z[N] ˆZ[N] F

For any h [H], we have: Ah Z[N] Ah ˆZ[N] F = Sh Z[N]WVh ˆSh ˆZ[N]WVh F

B Sh Z[N] ˆSh Z[N] F + ˆSh Z[N] ˆSh ˆZ[N] F

Nmd Sh ˆSh F + e C

N ln m Z[N] ˆZ[N] F

For Sh ˆSh F , we have:

v u u u u t

Zi WQh (Zi WKh) + M dk

ˆZi WQh ˆZi WKh + M dk

Zi WQh (Zi WKh) ˆZi WQh ˆZi WKh

Zi WQh ˆZi WQh (Zi WKh) F + ˆZi WQh

(Zi WKh) ˆZi WKh F

Zi F + ˆZi F 2 Zi ˆZi 2

Z[N] ˆZ[N] F .

On the Generalization Ability of Next-Token-Prediction Pretraining

Combining the above results, we can get:

TFW Z[N] TFW ˆZ[N] F G2 π(B2 + 1)

N ln m + 2Gs B4H

! Z[N] ˆZ[N] F

G2 π(B2 + 1) e CB2H

Nmd + 1 Z[N] ˆZ[N] F .

C.4. Proof of Theorem 4.20

Proof. Step 1: We firstly define the class of lth layer s output as:

X 7 TFWl (TFWl 1 . . . TFW1 (Embedding (X))) : Wj F1 F , Wj F2 F , Wj Oh

F Bj, j [l], h [H]

For l = 1, we consider the set: TF1 Z0 [N] := n TFW1 Z0 [N] : W1 W1o ,

where Z0 [N] = Embedding X[N] represents the embedded token sequences, and Wl defined in (14). We denote

C1 = C TF1 Z0 [N] , ϵ1, F as the ϵ1-cover of TF1 Z0 [N] . Then for 1 < l + 1 L, let Cl be a cover of Hl, we

select one element ˆZl [N] Cl to construct the ϵl+1-cover of following set:

TFl+1 ˆZl [N] := n TFWl+1 ˆZl [N] : Wl+1 Wl+1o ,

here we denote C TFWl+1 ˆZl [N] , ϵl+1, F as the cover. Then by Lemma C.12 and Proposition C.11, we have:

ln C TFWl+1 ˆZl [N] , ϵl+1, F 4d2(H + 3) ln

1 + G2 πB2 l+1(B2 l+1 + 1) e Cl+1B2 l+1H

Nmd + 1 ˆZl [N] F

4d2(H + 3) ln

1 + B2 l+1sl+1 Z0 [N] F ϵl+1

:= ln Nl+1,

where sl+1 := Q

j [l+1] G2 π(B2 j + 1) e Cj B2 j H

Step 2: Now, we can cunstruct the cover of Hl+1 as:

ˆZl [N] Cl C TFWl+1 ˆZl [N] , ϵl+1, F .

Then we can get:

ˆZl [N] Cl C TFWl+1 ˆZl [N] , ϵl+1, F |Cl|Nl+1

1 + B2 j sj Z0 [N] F ϵj

On the Generalization Ability of Next-Token-Prediction Pretraining

Step 3: Next we go to verify that CL is a cover of HL. By Lemma C.13, for any ZL [N] HL we can find a ˆZL [N] CL such that: ZL [N] ˆZL [N] F = TFWL ZL [N] TF ˆ WL ˆZL [N] F

TFWL ZL [N] TFWL ˆZL [N] F + TFWL ˆZL [N] TF ˆ WL ˆZL [N] F

G2 π(B2 L + 1) e Cl B2 LH

Nmd + 1 ZL 1 [N] ˆZL 1 [N] F + ϵL

j =l+1 G2 π(B2 j + 1) e Cj B2 j H

Nmd + 1 ϵl.

We set ϵL = ϵ L, and for 1 l L 1, we choose ϵl = L QL j =l+1 G2 π(B2 j + 1) e Cj B2 j H

Nmd + 1 1 ϵ.

We denote sl+1 L := QL j =l+1 G2 π(B2 j + 1) e Cj B2 j H

Nmd + 1 , then we can get the covering number of HL as following:

ln N HL, ϵ, F ln |CL|

1 + B2 l sl Z0 [N] F ϵl

1 + B2 l sl(sl+1 L)L Z0 [N] F ϵ

= 4d2(H + 3)

1 + B2 l Ls L Z0 [N] F ϵ

Furthermore, let X Rm nv as the input matrix, and Z0 = Embedding (X), by scaling N to 1, we have:

ln N H|X, ϵ, F 4d2(H + 3)

l=1 ln 1 + B2 l Ls L Z0 F

where s L := Q

l [L] G2 π(B2 l + 1) e Cl B2 l Hmd + 1 .

D. Proof of Theorem 4.22

Proof. Step 1: We first bound the Rademacher complexity RD(H). For Zi = ti 0, , ti m 1 Rm nv, note that we do not input each token one by one in order, but input the entire sequence at once, and get a representation matrix h(Zi) Rm d. Due to the existence of mask M, each token can only use the information before the current node. We can naturally abstract the above process into using the jth token ti j to perform m queries in sequence. Therefore RD(H) can be defined as:

RD(H) := Eε

j [m] εij h ti j 1 F

By Lemma C.7 we know: for any i [N], j [m], we have h ti j 1 F

d. Therefore, according to Lemma C.5, we have:

RD(H) inf α>0

ln N H|D, ϵ, 2 dϵ

On the Generalization Ability of Next-Token-Prediction Pretraining

We denote λ = 4d2(H + 3) and ρL = PL l=1 B2 l , then by Theorem 4.20, we can get:

ln N H|D, ϵ, 2 dϵ

1 + B2 l Ls L Z0 [N] F ϵ

1 + ρLs L Z0 [N] F ϵ

1 + ρLs L Z0 [N] F ϵ

1 + ρLs L Z0 [N] F ln

1 + ρLs L Z0 [N] F ln

Here (a) uses Jensen s inequality, (b) uses Lemma C.6. By setting α = 1

Nmd, we have:

RD(H) inf α>0

1 + ρLs L Z0 [N] F ln

1 + ρLs L Z0 [N] F ln(Nmd)

ln (1 + ρLs L)

12Ld3(H + 3) ln (1 + ρLs L)

where (i) uses the fact that Z0 [N] F is of

Step 2: Next, we consider the Rademacher complexity RD(G ˆh):

RD(G ˆh) = Eε

j [m] εij softmax ˆh ti j 1 WP F

We consider the set G ˆh := n softmax ˆZL [N]WP : WP F R o ,

where ˆZL [N] CL. By Lemma C.4, C.7 and C.9 ,it is easy to have:

ln N G ˆh, ϵ, F dnv ln

1 + 2RGs ˆZL [N] F ϵ

Then, by Lemma C.5, we can get:

dnv ln(1 + 2RGs)

On the Generalization Ability of Next-Token-Prediction Pretraining

Step 3: For any k [m], let Tk Rk nv be the input token sequence. Then for any h, ˆh H and g G, we have: g (h (Tk)) g ˆh (Tk) F = h (Tk) WP ˆh (Tk) WP F

R h (Tk) ˆh (Tk) F .

Therefore Gg = R, then by Corollary 4.11, we have:

ED(ˆg, ˆh) 6GℓGg RD(H) | {z } (I)

+ 6Gℓ RD(G ˆh) | {z } (II)

δ 2m + 4Bℓdisc(U)

12Ld3(H + 3) ln (1 + ρLs L)

dnv ln(1 + 2RGs)

δ 2m + 4Bℓdisc(U)

ln (1 + ρLs L)

dnv Nm + Bℓ

δ 2m + 4Bℓdisc(U).

Note that, the number of model parameters is L(10d2 + 2Hd2). Since d is usually much larger than H, we follow the approach of Kaplan et al. (2020) which use Θ 12Ld2 approximation to represent the number of model parameters.