# scaling_dataconstrained_language_models__ed98b69d.pdf

Journal of Machine Learning Research 26 (2025) 1-66 Submitted 6/24; Revised 12/24; Published 2/25

Scaling Data-Constrained Language Models

Niklas Muennighoﬀ n.muennighoff@gmail.com

Hugging Face

Alexander M. Rush arush@cornell.edu

Hugging Face

Boaz Barak boaz@seas.harvard.edu

Harvard University

Teven Le Scao teven.lescao@gmail.com

Hugging Face

Aleksandra Piktus ola.piktus@gmail.com

Hugging Face

Nouamane Tazi nouamane@huggingface.co

Hugging Face

Sampo Pyysalo sampo.pyysalo@gmail.com

University of Turku

Thomas Wolf thomas@huggingface.co

Hugging Face

Colin Raﬀel craffel@gmail.com

Hugging Face

Editor: Fei Sha

The current trend of scaling language models involves increasing both parameter count and training data set size. Extrapolating this trend suggests that training data set size may soon be limited by the amount of text data available on the internet. Motivated by this limit, we investigate scaling language models in data-constrained regimes. Speciﬁcally, we run a large set of experiments varying the extent of data repetition and compute budget, ranging up to 900 billion training tokens and 9 billion parameter models. We ﬁnd that with constrained data for a ﬁxed compute budget, training with up to 4 epochs of repeated data yields negligible changes to loss compared to having unique data. However, with more repetition, the value of adding compute eventually decays to zero. We propose and empirically validate a scaling law for compute optimality that accounts for the decreasing value of repeated tokens and excess parameters. Finally, we experiment with approaches mitigating data scarcity, including augmenting the training data set with code data or removing commonly used ﬁlters. Models and data sets from our 400 training runs are freely available at https://github.com/huggingface/datablations. Keywords: large language models, scaling laws, data-constrained, data engineering

c 2025 Niklas Muennighoﬀand Alexander M. Rush and Boaz Barak and Teven Le Scao and Aleksandra Piktus and Nouamane Tazi and Sampo Pyysalo and Thomas Wolf and Colin Raﬀel.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v26/24-1000.html.

Muennighoff, Rush, Barak, Scao, Piktus, Tazi, Pyysalo, Wolf, Raffel

Tokens (Epochs)

Final test loss

Up to 4 epochs repeating is almost as good as new data

Rapidly diminishing returns for more repetitions

At 40 epochs, repeating is worthless

Return on compute when repeating

Tokens (Epochs)

Loss: 2.376

Loss: 2.359

Allocating compute when repeating

Data-Constrained Scaling Laws

Models trained Loss assuming repeated data is worth the same as new data Loss predicted by our data-constrained scaling laws

Regime of same compute (Iso FLOP) Efficient frontier assuming repeated data is worth the same as new data Efficient frontier predicted by our data-constrained scaling laws

Figure 1: Return and Allocation when repeating data. Left: Loss of LLMs (4.2B parameters) scaled on repeated data decays predictably ( 6). Right: To maximize performance when repeating, our data-constrained scaling laws and empirical data suggest training smaller models for more epochs in contrast to what assuming Chinchilla scaling laws (Hoﬀmann et al., 2022) hold for repeated data would predict ( 5).

1. Introduction

Recent work on compute-optimal language models (Hoﬀmann et al., 2022) shows that many previously trained large language models (LLMs, which we deﬁne as having more than one billion parameters) could have attained better performance for a given compute budget by training a smaller model on more data. Notably, the 70-billion parameter Chinchilla model (Hoﬀmann et al., 2022) outperforms the 280-billion parameter Gopher model (Rae et al., 2021) while using a similar compute budget by being trained on four times more data. Extrapolating these laws for compute allocation (hereafter "Chinchilla scaling laws") to a 530 billion parameter model, such as the under-trained MT-NLG model (Smith et al., 2022), would require training on a massive 11 trillion tokens, corresponding to more than 30 terabytes of text data. For most languages, available data is several orders of magnitude smaller, meaning that LLMs in those languages are already data-constrained. Villalobos et al. (2022) estimate that even high-quality English language data will be exhausted by the year 2024 given the Chinchilla scaling laws and the trend of training ever-larger models. This motivates the question (Villalobos et al., 2022; nostalgebraist, 2022): what should we do when we run out of data? In this work we investigate scaling large language models in a data-constrained regime, and whether training an LLM with multiple epochs of repeated data impacts scaling. Using multiple epochs is, of course, standard in machine learning generally; however, most prior

Scaling Data-Constrained Language Models

large language models have been trained for a single epoch (Komatsuzaki, 2019; Brown et al., 2020) and some work explicitly advocates against reusing data (Hernandez et al., 2022). An exception is the recent Galactica models (Taylor et al., 2022) that were trained for 4.25 epochs and exhibit continually decreasing validation loss and improving downstream performance throughout training. However, the experiments of Galactica do not compare this setup to an alternative non-data-constrained model trained for one epoch on unique data. Without this comparison, it is diﬃcult to quantify the trade-oﬀbetween additional compute versus additional data collection. Our main focus is to quantify the impact of multiple epochs in LLM training such that practitioners can decide how to allocate compute when scaling models. Toward this end, we assembled a battery of empirical training runs of varying data and compute constraints. Speciﬁcally, we train more than 400 models ranging from 10 million to 9 billion parameters for up to 1500 epochs and record ﬁnal test loss. We use these results to ﬁt a new data-constrained scaling law that generalizes the Chinchilla scaling law (Hoﬀmann et al., 2022) to the repeated data regime and yields a better prediction of loss in this setting. Figure 1 summarizes our main results targeting the value of repeated data (Return) and optimal allocation of resources in that regime (Allocation). We ﬁnd that, while models trained for a single epoch consistently have the best validation loss per compute, diﬀerences tend to be insigniﬁcant among models trained for up to 4 epochs and do not lead to diﬀerences in downstream task performance. Additional epochs continue to be beneﬁcial, but returns eventually diminish to zero. We ﬁnd that, in the data-constrained regime, allocating new compute to both more parameters and epochs is necessary, and that epochs should be scaled slightly faster. These ﬁndings suggest a simple way to continue scaling total training compute budgets further ahead in the future than the previously anticipated limits. Finally, given the challenges imposed by data constraints, we consider methods complementary to repeating for improving downstream accuracy without adding new natural language data. Experiments consider incorporating code tokens and relaxing data ﬁltering. For code, English LLMs, such as Pa LM (Chowdhery et al., 2022) or Gopher (Rae et al., 2021), are trained on a small amount of code data alongside natural language data, though no benchmarking was reported to justify that decision. We investigate training LLMs on a mix of language data and Python data at 10 diﬀerent mixing rates and ﬁnd that mixing in code is able to provide a 2 increase in eﬀective tokens even when evaluating only natural language tasks. For ﬁltering, we revisit perplexity and deduplication ﬁltering strategies on both noisy and clean data sets and ﬁnd that data ﬁltering is primarily eﬀective for noisy data sets.

2. Background

Predicting the scaling behavior of large models is critical when deciding on training resources. Speciﬁcally, two questions are of interest: (Allocation) What is the optimal balance of resources? (Return) What is the expected value of additional resources? For scaling LLMs, the resource is compute (measured in FLOPs), and it can be allocated to training a larger model or training for more steps.1 The metric used to quantify progress is the model s loss

1. In this work we use (Kaplan et al., 2020) s approximation for the compute cost: FLOPs(N, D) 6ND, where N denotes the number of model parameters and D denotes the number of tokens processed.

Muennighoff, Rush, Barak, Scao, Piktus, Tazi, Pyysalo, Wolf, Raffel

on held-out data, i.e. the ability to predict the underlying data as measured in the model s cross-entropy (Alabdulmohsin et al., 2022; Hoﬀmann et al., 2022). We aim to minimize the loss (L) subject to a compute resource constraint (C) via optimal allocation to N and D as:

argmin N,D L(N, D) s.t. FLOPs(N, D) = C (1)

Currently, there are established best practices for scaling LLMs. Return follows a powerlaw: loss scales as a power-law with the amount of compute used for training (Henighan et al., 2020; Kaplan et al., 2020; Bahri et al., 2021; Ghorbani et al., 2021; Bansal et al., 2022; Hernandez et al., 2021). Allocation is balanced: resources are divided roughly equally between scaling of parameters and data (Hoﬀmann et al., 2022). These scaling laws were established empirically by training LLMs and carefully extrapolating behavior. Chinchilla (Hoﬀmann et al., 2022) uses three methods for making scaling predictions:

(Fixed Parameters) Train with a ﬁxed model size but on varying amounts of data.

(Fixed FLOPs) Train with ﬁxed computation while parameters and training tokens vary.

(Parametric Fit) Derive and ﬁt a formula for the loss.

For the parametric ﬁt, the loss (L) is a function of parameters (N) and training tokens (D):

L(N, D) = A

Where {A, α, B, β, E} are learned variables ﬁt using the training runs from the ﬁrst two approaches (Hoﬀmann et al., 2022). Using these learned variables, they propose calculating the optimal allocation of compute (C) to N and D as follows:

Nopt(C) = G(C/6)a Dopt(C) = G 1(C/6)b

where G = αA

1 α+β a = β α + β b = α α + β

These methods lead to the conclusion that α β and hence N and D should be scaled proportionally for compute-optimal training. As loss can be an imperfect proxy for performance on natural language tasks (Xia et al., 2022; Shin et al., 2022; Tay et al., 2021), they also validate their conclusions on various downstream tasks.

3. Method: Data-Constrained Scaling Laws

We are interested in scaling behavior in the data-constrained regime. Speciﬁcally, given a limited amount of unique data, what is the best Allocation of and Return for computational resources. Prior work (Kaplan et al., 2020; Hoﬀmann et al., 2022) assumes that the necessary data to support scaling is unlimited. Our aim is therefore to introduce a modiﬁed version of Equation 2 that accounts for data constraints and ﬁt the terms in the modiﬁed scaling law to data from a large body of experiments.

Scaling Data-Constrained Language Models

The primary method we consider is repeating data, i.e. allocating FLOPs to multiple epochs on the same data. Given a budget of unique data DC, we split the Chinchilla total data term D into two parts: the number of unique tokens used, UD, and the number of repetitions, RD (i.e. epochs - 1). Given total training tokens D and data budget DC these terms are simply computed as UD = min{DC, D} and RD = (D/UD) 1. When training for a single epoch like done in prior scaling studies, RD = 0. We are thus interested in minimizing Equation 1 with the additional constraint of a data budget DC:

argmin N,D L(N, D) s.t. FLOPs(N, D) = C, UD DC (4)

Symmetrically, for mathematical convenience, we split the parameter term N into two parts: the base number of parameters needed to optimally ﬁt the unique tokens UN, and the number of times to repeat this initial allocation, RN. We compute UN by ﬁrst rearranging Equation 3 to ﬁnd the optimal compute budget for the unique tokens used (UD). We input this value into the Nopt formula of Equation 3 to get UN = min{Nopt, N}. UN thus corresponds to the compute-optimal number of parameters for UD or less if N < Nopt. Once we have UN, we compute the repeat value as RN = (N/UN) 1. To empirically explore the scaling behavior in a data-limited setting we train LLMs under these constraints. We consider three diﬀerent experimental protocols in this work:

(Fixed Unique Data) In 5 we ﬁx the data constraint DC and train models varying epochs and parameters. These experiments target Allocation, speciﬁcally tradeoﬀof D and N.

(Fixed FLOPs) In 6 we ﬁx the computation available and vary DC (and thus also UD and UN). These experiments target Return, i.e. how well does repeating scale compared to having more unique data.

(Parametric Fit) We ﬁt a formula introduced in 3.1 on all our training runs and evaluate its predictive capability throughout 5 and 6.

Before discussing experimental results we describe the parametric assumptions.

3.1 Parametric Fit

To extrapolate scaling curves, it is necessary to incorporate repetition into the Chinchilla formula (Equation 2). We generalize Equation 2 by replacing D and N with terms corresponding to the eﬀective data (D ) and eﬀective model parameters (N ).

L(N, D) = A N α + B

Intuitively, D should be smaller or equal to D where D is the total number of processed tokens since repeated tokens provide less useful information to the model than new ones. We use an exponential decay formulation, where the value of a data token processed loses roughly (1 1/R D) fraction of its value per repetition, where R D is a learned constant. After some derivations and approximations (see B), this boils down to

D = UD + UDR D(1 e

Muennighoff, Rush, Barak, Scao, Piktus, Tazi, Pyysalo, Wolf, Raffel

Note that for RD = 0 (no repetitions), D = UD = D. For RD R D, e RD/R D 1 RD

R D and so D UD + UDR D(1 1 + RD/R D) = UD(1 + RD) = D

and hence in this case, repeated data is worth almost the same as fresh data. (This is also consistent with the predictions of the deep bootstrap framework (Nakkiran et al., 2021b).) As RD grows, the value of repeated tokens tends to zero, and the eﬀective data D becomes much smaller than D. The formula implies that no matter how many times we repeat the data, we will not get a better loss than could be obtained with a single epoch on UD + UDR D fresh tokens. Just as processing repeated tokens yields a diminishing return, both intuitively and empirically, models with sizes that vastly outstrip the available data also oﬀer diminishing returns per parameter. Hence we use a symmetric formula for the number of eﬀective parameters, where again R N is learned,

N = UN + UNR N(1 e

The learned constants R D, R N roughly correspond to the half-life of repeated data and excess parameters. For example, at RD = R D, the number of eﬀective tokens D is UD + UDRD(1 e 1) which means that the UDRD repeated tokens are worth on average 1 1/e fraction of fresh ones. Using a methodology similar to (Hoﬀmann et al., 2022), R N and R D can be ﬁt on empirical measurements, which yields data-driven estimates. See B for more details on the derivations and the ﬁtting procedure.

4. Experimental Setup

Figure 2: Data set setup. We ensure runs using less data (more epochs) always use a subset of the data used in runs with more data (fewer epochs).

For all experiments, we train transformer language models with the GPT-2 architecture and tokenizer (Radford et al., 2019). Models have up to 8.7 billion parameters and are trained for up to 900 billion total tokens. Following (Hoﬀmann et al., 2022) we use cosine learning rate schedules that decay 10 over the course of training for each model (diﬀerent schedules led to diﬀerent estimates in (Kaplan et al., 2020)). Unlike (Kaplan et al., 2020), we do not use early stopping to also explore the extent of overﬁtting when repeating. Other hyperparameters are based on prior work (Rae et al., 2021; Hoﬀmann et al., 2022) and detailed in I. Models are trained on subsets of C4 (Raﬀel et al., 2020). The data constraints are carefully deﬁned to ensure maximal overlap as shown in Figure 2. Unlike (Hernandez et al., 2022), we always repeat the entire available data rather than subsets of it. Data is shuﬄed after each epoch. As repeating data can result in extreme

Scaling Data-Constrained Language Models

1 10 30 59 100 300 1000 Epochs

Empirical Iso Loss Contours

1 10 30 59 100 300 1000 Epochs

Predicted Iso Loss Contours

Compute-optimal model for 100M tokens and one epoch Lowest loss for 100M tokens

Chinchilla scaling laws efficient frontier Data-constrained scaling laws efficient frontier

Models trained

Figure 3: Iso Loss contours for 100 million unique tokens. Left: 93 models trained with varying parameters and epochs on a ﬁxed data set. Contours show an interpolation of results with the same ﬁnal test loss. Right: Comparison with the loss predictions from our proposed scaling laws for the same budget of 100 million unique tokens and the predicted eﬃcient frontier. The diminishing returns from training on repeated data can be seen in the increase in distance of the contour curves.

overﬁtting (see 6.1), we report loss on a held-out test set unless otherwise speciﬁed (see D). This contrasts training loss used in (Hoﬀmann et al., 2022), but should not alter our ﬁndings as the held-out data stems from the same underlying data set.

5. Results: Resource Allocation for Data-Constrained Scaling

Our ﬁrst experimental setting considers scaling in a setting where all models have the same data constraint. For these experiments, the unique training data budget DC is ﬁxed at either 100M, 400M or 1.5B tokens. For each data budget, we train a set of language models with increasing amounts of compute that is allocated to either more parameters or more epochs on the unique training data. Figure 3 (left) shows the main results for scaling with 100M unique tokens2 and Figure 4 for 400M and 1.5B tokens. For 100M tokens, the corresponding one-epoch compute-optimal model according to scaling laws from (Hoﬀmann et al., 2022) has UN of approximately 7M parameters (see C for the scaling coeﬃcients we use). Results show that more than a 50%

2. Although small, for example, this is the order of magnitude of a realistic data constraint reﬂecting data available after ﬁltering the OSCAR data set (Ortiz Su arez et al., 2019) for Basque, Punjabi, or Slovenian.

Muennighoff, Rush, Barak, Scao, Piktus, Tazi, Pyysalo, Wolf, Raffel

1 10 30 750 100 300 Epochs

400 million unique tokens

1 10 30 400 100 1000 Epochs

1.5 billion unique tokens

3.25 3.30 3.35 3.40 3.45 3.50 3.55 Loss

2.90 2.95 3.00 3.05 3.10 3.15 Loss

Compute-optimal model for given unique tokens and one epoch Lowest loss for given unique tokens

Chinchilla scaling laws efficient frontier Models trained

Figure 4: Empirical iso Loss curves for 400 million and 1.5 billion unique tokens. 34 models trained on 400 million unique tokens and 37 models trained on 1.5 billion unique tokens with varying parameters and epochs.

reduction in loss can be attained by training for several epochs (RD > 0) and increasing model size beyond what would be compute-optimal for 100M tokens (RN > 0). We ﬁnd the best loss to be at around 20-60 more parameters and epochs, which corresponds to spending around 7000 more FLOPs. These results suggest that one-epoch models signiﬁcantly under-utilize their training data and more signal can be extracted by repeating data and adding parameters at the cost of sub-optimal compute utilization. Figure 3 (right) shows the predicted contours created by ﬁtting our data-constrained scaling laws on 182 training runs. In the single-epoch case (RD = 0) with near computeoptimal parameters (RN = 0) our scaling equation ( 3.1) reduces to the Chinchilla equation. In this case, both formulas predict the optimal allocation of compute to parameters and data to be the same, resulting in overlapping eﬃcient frontiers. As data is repeated for more than a single epoch, our ﬁt predicts that excess parameters decay faster in value than repeated data (R N < R D). As a result, the data-constrained eﬃcient frontier suggests allocating most additional compute to more epochs rather than more parameters. This contrasts the Chinchilla scaling laws (Hoﬀmann et al., 2022), which suggest equally scaling both. However, note that they do not repeat the entire training data and their parametric ﬁt explicitly relies on the assumption that models are trained for a single epoch only. Thus, there is no guarantee that their scaling predictions hold for repeated data. For all three data budgets, our results suggest that Allocation is optimized by scaling epochs faster than parameters. We conﬁrm this at scale by training the data-constrained compute-optimal model for 9.3 1021 FLOPs and 25 billion unique tokens as suggested

Scaling Data-Constrained Language Models

by our eﬃcient frontier. Despite having 27% fewer parameters, this model achieves better loss and downstream performance than the model suggested by the Chinchilla scaling laws (Figure 1 (right) and Table 8). Similarly, the 120 billion parameter Galactica model trained on repeated data should have been signiﬁcantly smaller according to data-constrained scaling laws ( 7). An additional beneﬁt of using a smaller model is cheaper inference, though adding parameters can make it easier to parallelize training across GPUs. Adding parameters and epochs causes the loss to decrease and eventually increase again, suggesting that too much compute can hurt performance. Results from (Kaplan et al., 2020) also show that loss can increase when too many parameters are used, even with early stopping. However, we expect that appropriate regularization (such as simply removing all excess parameters as an extreme case) could prevent this behavior. Thus, our formula presented in 3 and its predicted iso Loss contours in Figure 3 do not model the possibility that excess epochs or parameters could hurt performance.

5.1 Double Descent

Prior work has reported double descent phenomena when repeating data, where the loss initially increases and then decreases again as the model is trained for more epochs (Nakkiran et al., 2021a; Hernandez et al., 2022). In Figure 5, we plot the loss curves of several models trained for varying epochs on 100 million tokens. We ﬁnd double descent phenomena with the loss of all models increasing at 200 epochs before decreasing again. These samples can also be seen in Figure 3. This contributes to additional noise in the ﬁtting of our functions in B, as our functional form assumes loss to be monotonically decreasing as epochs increase. Thus, we remove most such examples from the ﬁtting.

5.2 Repeating on Heavily Deduplicated Data

To investigate whether Figure 3 is dependent on the inherent amount of duplicates in the selected 100 million tokens, we train several models on a deduplicated version of C4 (see G). We plot the performance of the models trained on the deduplicated C4 versus the regular C4 in Figure 6. All models are evaluated on the same validation data set from the regular C4. Regardless of deduplication we ﬁnd 59 epochs to be optimal and the overall trend to be very similar. Together with our results on OSCAR ( 6.2), this suggests that our work generalizes to diﬀerent data sets with diﬀerent inherent amounts of duplicates.

5.3 Do Excess Parameters Hurt, Plateau or Help?

Figures 3, 4 suggest that excess parameters (or epochs) can harm performance. We hypothesize that this is due to suboptimal hyperparameters and could be prevented with better regularization. Thus, we expect with optimal regularization hyperparameters excess parameters would never hurt, but performance would merely plateau, as in extreme cases regularization could just take the form of removing the excess parameters. One approach to selecting optimal hyperparameters is µP (Yang et al., 2021). We compare excessively large models trained with a data constraint of DC = 100 million tokens in Figure 7 across µP, our default hyperparameters ( I) and scaling law predictions. Surprisingly, µP leads to even higher test loss than our default hyperparameters. Nevertheless, we ﬁnd that also with µP

Muennighoff, Rush, Barak, Scao, Piktus, Tazi, Pyysalo, Wolf, Raffel

15 27 39 59 75 140 200 910 9000 Epochs

Final test loss relative to starting point

Model parameters

83 million 44 million 14 million

Figure 5: Double descent. Each dot is a model trained on 100 million unique tokens. Loss initially increases at 200 epochs and then decreases again; this is known as epochwise double descent (Nakkiran et al., 2021a).

27 39 59 75 140 320 600 910 1740 Epochs

Final test loss

Training data

Deduplicated C4 C4

Figure 6: Optimal loss on deduplicated data. 146 million parameter models trained on 100 million unique tokens that are either directly from C4 or undergo additional deduplication. Each dot is a single model. While deduplication results in a higher test loss, the optimal number of epochs remains the same whether or not deduplication is performed as in Figure 3.

Scaling Data-Constrained Language Models

250M 500M 1B 2B Parameters

Final test loss

Empirical loss

Parameter selection according to P

Standard parameter selection

Predicted loss (scaling laws)

Data-Constrained (This work)

Allowing excess parameters to hurt via alpha-beta decay

Figure 7: Empirical and predicted losses of LLMs trained on 100 million tokens for a single epoch. Excess parameters empirically hurt performance, but this may be due to a lack of regularization. Thus, our scaling formula predicts loss to plateau, while Chinchilla predicts loss to improve. By decaying the exponent α (and β) instead, one can allow excess parameters to hurt.

excessive parameters hurt: The models with more than 2 billion parameters have signiﬁcantly higher validation loss after training than the models with 200 million to 1 billion parameters when trained on only 100 million tokens. However, µP only covers hyperparameters such as the learning rate, but not explicit regularization hyperparameters like dropout rates, which we hypothesize would prevent this behavior. Thus, our proposed scaling equations predict loss to plateau, as seen in the straight line. As the compute-optimal parameter count for 100 million tokens is around 7 million, all depicted models have a signiﬁcant amount of excess parameters and data-constrained scaling laws predict their losses to be all the same (R N RN). Meanwhile, the default Chinchilla scaling law (Hoﬀmann et al., 2022) predicts loss to continue decreasing as parameters are added, which is in stark contrast to the empirical data. If one wants to incorporate excess parameters hurting performance into the scaling law equations, one could consider (a) Modifying the exponential decay formulation introduced in B such that instead of the value of repeated data decaying to 0 it decays to a large negative value (b) decaying the exponents α and β in Equation 8 instead of D and N. Decaying the exponents to 0 has the eﬀect of more repetitions eventually hurting performance as limα 0 Dα = 1 and the same for β. Thus, initially as D and N increase loss decreases, but ultimately the decay of α and β pushes D and N back to 1 resulting in loss to increase. Speciﬁcally, approach (b) could take the form of:

L(N, D, RN, RD) = E + A Nα max(0,1 (RN/R N)) + B Dβ max(0,1 (RD/R D)) (7)

Muennighoff, Rush, Barak, Scao, Piktus, Tazi, Pyysalo, Wolf, Raffel

1 10 30 59 100 300 1000 Epochs

Empirical Iso Loss Contours

1 10 30 59 100 300 1000 Epochs

Predicted Iso Loss Contours

Compute-optimal model for 100M tokens and one epoch Lowest loss for 100M tokens

Chinchilla scaling laws efficient frontier Alpha-beta decay efficient frontier

Models trained

Figure 8: Iso Loss contours for 100 million unique tokens with contours predicted by parametric decay of alpha and beta. The same models from Figure 3 with the contour predictions being done by the alpha-beta decay formulation introduced in 5.3.

Like the equations in B this formulation also reduces to the Chinchilla scaling laws in the base case of RD = 0 or RN = 0. As the exponents decrease with more repetitions adding parameters or epochs becomes less beneﬁcial. Eventually, the decay in α or β causes loss to increase again as it pushes N or D back down to 1. We ﬁt this formula using the same approach outlined in B but including samples where excess parameters or epochs hurt (296 total samples). We use a grid of initialization given by: R N {0., 2000., . . . , 100000.} and R D {0., 2000., . . . , 100000.}. This results in R D = 26530.611 and R N = 2040.8163. R N is signiﬁcantly lower resulting in excess parameters hurting faster than excess epochs, which is in line with empirical data from Figure 3. We visualize Figure 3 with the predictions from this alpha-beta decay formulation in Figure 8. Expected parameters eventually hurt resulting in circle-shaped contours. Due to the very high R D the area where epochs start to hurt is outside of the boundaries of Figure 8. While the predicted optimal allocation (eﬃcient frontier) is similar to Figure 3, the predicted return from repeated data diﬀers signiﬁcantly. The alpha-beta decay formulation incorrectly predicts returns to diminish signiﬁcantly slower as seen by the longer eﬃcient frontier and the smaller distance in contours early on as compared to Figure 3. Beyond its potentially useful properties, we do not have a rigorous mathematical justiﬁcation for this alpha-beta decay formulation which could be the cause of the incorrect return predictions. Ultimately, we settle on our exponential decay formulation from B that does not allow excess parameters or epochs to hurt, as preventing such behavior is trivial by stopping

Scaling Data-Constrained Language Models

5B 15B 25B 35B 45B 55B Training tokens

Validation loss

2.8B parameters trained

for 55B tokens

5B 25B 45B 65B 85B Training tokens

4.2B parameters trained

for 84B tokens

5B 40B 100B 140B 180B Training tokens

8.7B parameters trained

for 178B tokens

Epochs 1 2 3 4 5 7 14 44

FLOP budget (C) Parameters (N) Training tokens (D) Data budget (DC)

9.3 1020 2.8B 55B { 55, 28, 18, 14, 11, 9, 4, 1.25}B 2.1 1021 4.2B 84B { 84, 42, 28, 21, 17, 12, 6, 1.9}B 9.3 1021 8.7B 178B {178, 88, 58, 44, 35, 25, 13, 4}B

Figure 9: Validation loss for diﬀerent data constraints (Iso FLOP). Each curve represents the same number of FLOPs spent on an equal-sized model. Colors represent diﬀerent numbers of epochs due to repeating because of data constraints. Parameters and training tokens are set to match the single-epoch compute-optimal conﬁgurations for the given FLOPs. Models trained on data that is repeated for multiple epochs have consistently worse loss and diverge if too many epochs are used. Only loss curves for 8.7B runs are smoothed with an exponential moving average and weight of 0.85.

training (in the case of epochs hurting) or removing excess parameters (in the case of model parameters hurting). Further, accurately predicting how much loss increases in the limit is not very useful, as in practice one would want to stop training when it s expected to plateau anyways.

6. Results: Resource Return for Data-Constrained Scaling

Next, consider the question of Return on scaling. To quantify this value, we run experiments with three FLOP budgets across eight respective data budgets to compare return on FLOPs. Figure 9 shows the conﬁgurations and validation curves for models trained on the same number of total tokens. Conforming to intuition and prior work on deduplication (Lee et al., 2021), repeated data is worth less, thus models trained on less unique data (and, correspondingly, more epochs) have consistently higher loss. However, the loss diﬀerence for

Muennighoff, Rush, Barak, Scao, Piktus, Tazi, Pyysalo, Wolf, Raffel

100% 50% 25% 14% 10%

Fraction of training tokens that are unique

Final test loss

8.7B parameters trained for 178B tokens

4.2B parameters trained for 84B tokens

2.8B parameters trained for 55B tokens

Empirical Loss (Fixed training length)

10B 100B 1T 10T 100T

Total training tokens

8.7B parameters

4.2B parameters

2.8B parameters

8 Ep. 16 Ep. 32 Ep. 64 Ep.

16 Ep. 32 Ep. 64 Ep.

16 Ep. 32 Ep. 64 Ep.

Repeating for 4 epochs is almost as good as new data

Predicted Loss (Variable training length)

Loss of models trained Loss assuming training is stopped when exhausting all unique data

Loss assuming repeated data is worth the same as new data Loss predicted by our data-constrained scaling laws

Figure 10: Empirical and extrapolated loss with constrained data. Left: Loss as a function of repeated tokens for three diﬀerent training budgets each with ﬁxed number of parameters. Loss curves predicted by our data-constrained scaling laws are shifted to exactly match the loss at 100% unique data. Return on FLOPs decays with repeated data in a regular pattern. Right: Extrapolating from the proposed data-constrained scaling law shows that at small numbers epochs are benign, but at large numbers loss stops improving.

a few epochs is negligible. For example, the N = 8.7 billion parameter model trained for four epochs (DC = 44 billion unique tokens) ﬁnishes training with only 0.5% higher validation loss than the single-epoch model (DC = 178 billion unique tokens). In Figure 10 (left), we compare the ﬁnal test loss of each model to predictions from our parametric ﬁt. The data-constrained scaling laws can accurately measure the decay in the value of repeated data as seen by the proximity of empirical results (dots) and parametric ﬁt (lines). We note however that it signiﬁcantly underestimates the ﬁnal test loss of failing models where loss increases midway through training, such as models trained for 44 epochs (not depicted). In Figure 10 (right), we extrapolate the three budgets by further scaling compute while keeping the data constraints (DC) at 55B, 84B, and 178B tokens, respectively. The parameter R D introduced in 3 represents roughly the half-life of epochs: speciﬁcally the point where repeated tokens have lost 1

e of their value. Through our ﬁtting in B, we found R D 15, corresponding to 15 repetitions (or 16 epochs). Graphically, this can be seen by the stark diminishing returns in the proximity of the 16-epoch marker and the ﬂattening out soon after. Overall, the Return when repeating data is relatively good. Meaningful gains from repeating data can be made up to around 16 epochs (R D) beyond which returns diminish extremely fast.

Scaling Data-Constrained Language Models

5B 15B 25B 35B 45B 55B Training tokens

Training loss

2.8B parameters trained

for 55B tokens

5B 25B 45B 65B 85B Training tokens

4.2B parameters trained

for 84B tokens

5B 40B 100B 140B 180B Training tokens

8.7B parameters trained

for 178B tokens

Epochs 1 2 3 4 5 7 14 44

Figure 11: Training loss smoothed with exponential moving average smoothing and a weight of 0.999. Models trained on fewer unique tokens (more epochs) have better training loss as they overﬁt.

6.1 Training Loss

Hoﬀmann et al. (2022) use training loss as their core metric. However, when repeating data for multiple epochs, training loss is a bad metric as models will overﬁt to the limited data available as shown in Figure 11. Thus, we use loss on a held-out test set as our key performance metric.

6.2 Scaling Curves on the OSCAR Corpus

To ensure our ﬁndings are not data set-dependent, we train models with the same conﬁgurations from Figure 9 on the OSCAR corpus (Ortiz Su arez et al., 2020). OSCAR is considered noisier than C4 (Raﬀel et al., 2020) due to its less stringent duplication. Figures 12, 13 depict the validation and training loss of these models. We ﬁnd the trend to be the same as for models trained on C4: While models with fewer repeats have better loss, diﬀerences for a few repeats are insigniﬁcant.

6.3 Validation Loss by Epoch

Taylor et al. (2022) decided to early-stop pre-training of the Galactica models due to a small increase in validation loss at the start of the ﬁfth epoch. In Figure 14 we plot the validation loss curves of our iso FLOP models as a function of epochs. We do ﬁnd small increases in validation loss when models enter a new epoch. For example, upon entering the third and fourth epoch, the 7-epoch 8.7 billion parameter OSCAR model shows loss spikes. However, these are temporary and loss continues to go down smoothly thereafter. Thus, we hypothesize

Muennighoff, Rush, Barak, Scao, Piktus, Tazi, Pyysalo, Wolf, Raffel

5B 15B 25B 35B 45B 55B Training tokens

Validation loss

2.8B parameters trained

for 55B tokens

5B 25B 45B 65B 85B Training tokens

4.2B parameters trained

for 84B tokens

5B 40B 100B 140B 180B Training tokens

8.7B parameters trained

for 178B tokens

Epochs 1 2 3 4 5 7 14 44

Figure 12: Validation loss during training for models trained on OSCAR. Models trained on tokens that are repeated for multiple epochs have consistently worse loss.

5B 15B 25B 35B 45B 55B Training tokens

Training loss

2.8B parameters trained

for 55B tokens

5B 25B 45B 65B 85B Training tokens

4.2B parameters trained

for 84B tokens

5B 40B 100B 140B 180B Training tokens

8.7B parameters trained

for 178B tokens

Epochs 1 2 3 4 5 7 14 44

Figure 13: Training loss for models trained on OSCAR smoothed with exponential moving average smoothing and a weight of 0.999. Models trained on fewer unique tokens (more epochs) have better training loss as they overﬁt.

Scaling Data-Constrained Language Models

7 5 4 3 2 1

C4 Validation loss

(a) 2.8B parameters trained on C4

7 5 4 3 2 1

(b) 4.2B parameters trained on C4

Epochs 1 2 3 4

7 5 4 3 2 1

(c) 8.7B parameters trained on C4

7 5 4 3 2 1

OSCAR Validation loss

(d) 2.8B parameters trained on OSCAR

7 5 4 3 2 1

(e) 4.2B parameters trained on OSCAR

7 5 4 3 2 1

(f) 8.7B parameters trained on OSCAR

Figure 14: Validation loss during training visualized by epochs. Loss progresses smoothly throughout training. There are temporary spikes for 8.7 billion parameter models, commonly at the start of a new epoch.

Muennighoff, Rush, Barak, Scao, Piktus, Tazi, Pyysalo, Wolf, Raffel

450B (4.25)

1.35T (12.75)

Tokens (Epochs)

Optimal allocation: 3x less parameters

3x more epochs

Efficient frontier assuming repeated data is worth the same as new data

Efficient frontier predicted by our data-constrained scaling laws

Regime of same compute (Iso FLOP)

Galactica (120B parameters)

Galactica (30B parameters)

Figure 15: Optimal compute allocation for Galactica. Eﬃcient frontier assuming repeated data is worth the same as new data (Chinchilla scaling laws) and data-constrained eﬃcient frontier assuming a unique token budget of 106 billion tokens like for the Galactica models (Taylor et al., 2022). For optimal compute allocation according to our proposed data-constrained scaling laws, the 120 billion Galactica model should have been signiﬁcantly smaller and trained for more epochs.

that the Galactica models could have attained better performance by continuing pre-training beyond the loss spike experienced at the beginning of the ﬁfth epoch.

7. Case Study: Galactica

The Galactica models (Taylor et al., 2022) are the only publicly known LLMs that explicitly trained for a signiﬁcant number of epochs prior to this work. They trained their models on 106 billion unique tokens for 4.25 epochs. Our ﬁndings on Return from repeated data agree with their conclusion that multiple epochs are beneﬁcial, however, we ﬁnd that even more epochs can be beneﬁcial and a small spike in validation loss does not justify stopping training ( 6.3). Meanwhile, our ﬁndings on Allocation signiﬁcantly deviate from Galactica. Figure 15 visualizes the Galactica models with our predicted eﬃcient frontier in the same

Scaling Data-Constrained Language Models

Repeat Repeat

DATA BUDGET Repeating

Filling with

DATA BUDGET

Repeat Repeat Repeat

Deduplicate / Perplexity-filter

DATA BUDGET

100% 50% 25% 10% Data Budget

Average Performance on 19 tasks (%)

Strategy Repeating data Filling missing data with Python code Perplexity-filter then repeat Deduplicate then repeat

Figure 16: Strategies for data-constrained settings and their downstream performance. Left: Schematic showing alternative data use strategies of code ﬁlling and ﬁltering. Right: N = 4.2 billion parameter models trained for a total of D = 84 billion tokens with varying budgets DC. For repeating and ﬁlling with code, ﬁve models with diﬀerent seeds are trained for each dot and the standard deviation is visualized as the shaded area.

style as Figure 1. The creators of Galactica decided to train a 120 billion parameter model on 450 billion tokens, a signiﬁcant overallocation to parameters even in Chinchilla terms (black eﬃcient frontier). This decision was likely driven by the intuition that repeated data is worth less, thus one should spend more compute on parameters. However, our empirical data contradicts this. Parameters learning from repeated data are worth even less than repeated data, thus one should overallocate to epochs, not parameters. Our data-constrained scaling laws thus predict that a better model could have been trained by allocating signiﬁcantly more FLOPs to epochs rather than parameters for the largest Galactica model with 120 billion parameters. Speciﬁcally, 40 billion parameters trained for 1.35 trillion tokens (12.75 epochs) would have been optimal according to data-constrained scaling laws. Note that these scaling laws have been ﬁtted on C4, which is not the data set used to pre-train Galactica. The Galactica models are pre-trained on a predominantly scientiﬁc data set, which includes code data among other data sources. Results from (Hoﬀmann et al., 2022) show that there are diﬀerences in the scaling coeﬃcients when training on C4 as compared to Git Hub code, however, the overall allocation trend is the same. Thus, while we expect a smaller model trained for more epochs to be better than the 120 billion parameter model, the optimal allocation is unlikely to be exactly 40 billion parameters and 1.35 trillion tokens. While this is a lot of tokens, we point to follow-up work that has had success with repeating pretraining data in the trillion-token regime based on our results (Lozhkov et al., 2024).

8. Results: Complementary Strategies for Obtaining Additional Data

While repeating data is eﬀective, it has diminishing returns. We therefore consider strategies for scaling D targeting improved downstream performance as opposed to directly minimizing loss.

Muennighoff, Rush, Barak, Scao, Piktus, Tazi, Pyysalo, Wolf, Raffel

Figure 16 (left) illustrates the strategies: (a) Code augmentation: We use Python code from The Stack (Kocetkov et al., 2022) to make up for missing natural language data. The combined data set consisting of code and natural language samples is shuﬄed randomly. (b) Adapting ﬁltering: We investigate the performance impact of deduplication and perplexity ﬁltering, two common ﬁltering steps that can severely limit available data. Removing such ﬁltering steps can free up additional training data. For these experiments, we set a maximum data budget (DC) of 84 billion tokens. For repetition and code ﬁlling, only a subset of DC is available and the rest needs to be compensated for via repeating or adding code. For both ﬁltering methods, we start out with approximately twice the budget (178 billion tokens), as it is easier to gather noisy data and ﬁlter it than it is to gather clean data for training. For perplexity ﬁltering, we select the top 25% samples with the lowest perplexity according to a language model trained on Wikipedia. This results in 44 billion tokens that are repeated for close to two epochs to reach the full data budget. For deduplication ﬁltering, all samples with a 100-char overlap are removed resulting in 21 billion tokens that are repeated for four epochs during training. See G for more details on the ﬁltering procedures. When comparing across data strategies, loss ceases to be a good evaluation metric as the models are trained on diﬀerent data distributions. We thus evaluate models on 19 natural language tasks with zero to ﬁve in-context few-shot exemplars (Brown et al., 2020) producing 114 scores per model. As our evaluation tasks cover diﬀerent metrics and random baselines, we re-scale all scores to be in the same range to better reﬂect performance ranges before averaging. Details on the evaluation data sets are in D. In Figure 16 (right) we compare the downstream performance of all strategies. For repeating data, diﬀerences in downstream performance are insigniﬁcant for up to around 4 epochs (25% budget) and then start dropping, which aligns with our results on test loss in 6. This also conﬁrms related works showing that perplexity and downstream performance generally exhibit similar trends (Gadre et al., 2024; Isik et al., 2024). Filling up to 50% of data with code (42 billion tokens) also shows no deterioration. Beyond that, performance decreases quickly on natural language tasks. However, adding more code data may beneﬁt non-natural language tasks, which are not considered in the benchmarking. Two of the tasks benchmarked, Web NLG (Castro Ferreira et al., 2020; Gehrmann et al., 2021), a generation task, and b Ab I (Weston et al., 2015; Liang et al., 2022), a reasoning task, see jumps in performance as soon as code is added, possibly due to code enabling models to learn long-range state-tracking capabilities beneﬁcial for these tasks. Of the ﬁltering approaches, we ﬁnd perplexity-ﬁltering to be eﬀective, while deduplication does not help. Prior work found deduplication was able to improve perplexity (Lee et al., 2021); however, it did not evaluate on downstream tasks. Deduplication may have value not captured in our benchmark, such as reducing memorization (Kandpal et al., 2022; Hernandez et al., 2022; Carlini et al., 2022; Biderman et al., 2023). We also investigate ﬁltering on a diﬀerent noisier data set in H, where we ﬁnd it to be more eﬀective. Overall, in a data-constrained regime, we recommend reserving ﬁltering for noisy data sets and using both code augmentation and repeating to increase data tokens. For example, ﬁrst doubling the available data by adding code and then repeating the new data set for four epochs results in 8 more training tokens that are expected to be just as good as having had 8 more unique data from the start.

Scaling Data-Constrained Language Models

5B 25B 45B 65B 85B Training tokens

C4 validation loss

5B 25B 45B 65B 85B Training tokens

Python validation loss

% of Python (rest is C4) 90 80 70 60 50 40 30 20 10

Figure 17: Validation loss of models trained on a mix of natural language (C4) and Python data.

5B 25B 45B 65B 85B Training tokens

Validation loss

5B 25B 45B 65B 85B Training tokens

Training loss

Strategy Deduplicate then repeat (4 Epochs) Perplexity-filter then repeat (2 Epochs) Repeat (4 Epochs) Repeat (2 Epochs)

Figure 18: Validation and training loss of models trained with diﬀerent data strategies. Training loss is smoothed with exponential moving average smoothing and a weight of 0.999. Downstream performance of the models is in Figure 16.

8.1 Loss Curves for Complementary Strategies

To compare the complementary data strategies, we have used downstream performance on natural language tasks detailed in D instead of loss. This is because validation loss gives an unfair advantage to models trained on a larger fraction of data from the same distribution.

Muennighoff, Rush, Barak, Scao, Piktus, Tazi, Pyysalo, Wolf, Raffel

For example, when making up for missing natural language data with code, models that are trained on more code will have better validation loss on code data while having worse loss on the natural language data as seen in Figure 17: The model pre-trained on 90% of Python code data and 10% of C4 has the highest C4 validation loss, but the lowest Python validation loss. Models trained on deduplicated or perplexity-ﬁltered data have higher validation loss as the held-out validation data has not gone through the same ﬁltering steps. Thus, its distribution more closely resembles the training data of models trained on the unﬁltered data resulting in worse validation loss for the two ﬁltering strategies in Figure 18 (left). Meanwhile, for training loss in Figure 18 (right) the model trained on perplexity-ﬁltered data has the lowest loss. Its training data has been ﬁltered to the top 25% of examples with the lowest perplexity ( G) thus high loss examples have been explicitly ﬁltered out from the training data resulting in low training loss. The model trained on deduplicated data has the highest validation and training loss. This is because commonly repeated sequences have been ﬁltered out from its training data. Thus, when encountering these common sequences in the unﬁltered validation set, its loss is comparatively high as other models have likely simply memorized them. Similarly, fewer repeated sequences during training results in higher training loss as unseen sequences are harder to predict.

9. Related Work

9.1 Large Language Models

Scaling up transformer language models (Vaswani et al., 2017) across parameter count and training data has been shown to result in continuous performance gains (Chowdhery et al., 2022). Starting with the 1.4 billion parameter GPT-2 model (Radford et al., 2019), a variety of scaled-up language models have been trained, commonly referred to as large language models (LLMs). They can be grouped into dense models (Brown et al., 2020; Khrushchev et al., 2022; Lieber et al., 2021; Rae et al., 2021; Chung et al., 2022; Black et al., 2022; Zhang et al., 2022; Thoppilan et al., 2022; Su et al., 2022; Taylor et al., 2022; Zeng et al., 2022; Scao et al., 2022a; Li et al., 2023; Luukkonen et al., 2023) and sparse models (Fedus et al., 2021; Zeng et al., 2021; Du et al., 2022; Zoph et al., 2022) depending on whether each forward pass makes use of all parameters. These models are generally pre-trained to predict the next token in a sequence, which makes them applicable to various language tasks directly after pre-training (Brown et al., 2020; Wei et al., 2022; Kojima et al., 2022; Muennighoﬀ, 2022; Srivastava et al., 2022) by reformulating said NLP tasks as context continuation tasks (see (Mc Cann et al., 2018) for an earlier proposal on this topic). We focus on the most common scenario, where a dense transformer model is trained to do next-token prediction on a large corpus and evaluated directly after pre-training using held-out loss or zeroto few-shot prompting.

9.2 Scaling Laws

Prior work has estimated an optimal allocation of compute for the training of LLMs. Kaplan et al. (2020) suggested a 10 increase in compute should be allocated to a 5.5 increase in model size and a 1.8 increase in training tokens. This ﬁrst scaling law has led to the creation

Scaling Data-Constrained Language Models

of very large models trained on relatively little data, such as the 530 billion parameter MTNLG model trained on 270 billion tokens (Smith et al., 2022). More recent work (Hoﬀmann et al., 2022), however, showed that model size and training data should rather be scaled in equal proportions. These ﬁndings called for a renewed focus on the scaling of pre-training data rather than scaling model size via complex parallelization strategies (Shoeybi et al., 2019; Rasley et al., 2020; Bian et al., 2021; Narayanan et al., 2021). Up-sampling is often employed when pre-training data is partly limited, such as data from a high-quality domain like Wikipedia or text in a rare language for training multilingual LLMs (Lin et al., 2021; Orlanski et al., 2023). Hernandez et al. (2022) study up-sampling of data subsets and ﬁnd that repeating only 0.1% of training data 100 times signiﬁcantly degrades performance. In contrast, our work focuses on repeating the entire pre-training corpus for multiple epochs rather than up-sampling parts of it.

9.3 Alternative Data Strategies

Large pre-training data sets are commonly ﬁltered to remove undesired samples or reduce noise (Sorscher et al., 2022). Perplexity-based ﬁltering, whereby a trained model is used to ﬁlter out samples with high perplexity, has been found beneﬁcial to reduce noise in web-crawled data sets (Wenzek et al., 2019). Mixing of data is employed for the pre-training data of multilingual LLMs, where text data from diﬀerent languages is combined (Conneau et al., 2019; Xue et al., 2020; Soltan et al., 2022; Muennighoﬀet al., 2022). However, both for code and natural language models, mixing diﬀerent (programming) languages has been reported to under-perform monolingual models (Nijkamp et al., 2022; Virtanen et al., 2019). Some work has investigated mixing code and natural language data for prediction tasks, such as summarizing code snippets (Iyer et al., 2016) or predicting function names (Allamanis et al., 2015). Several pre-training data sets for LLMs include low amounts of code data (Gao et al., 2020; Rae et al., 2021; Scao et al., 2022a). However, these past works generally do not provide any ablation on the drawbacks of including code or the beneﬁts for natural language task performance. We perform a detailed benchmarking of mixing Python and natural language in LLM pre-training at 10 diﬀerent mixing rates.

10. Limitations and Future Work

10.1 Repeating Fractions of the Data

In this work we focus on repeating the entire unique data set for several epochs. Alternatively, one can repeat only a fraction of the data set. For example, repeating 10% of the data set for 10 epochs while repeating the rest only for a single epoch as done by Hernandez et al. (2022). To predict loss in that scenario, one may need to adapt our scaling laws with an additional parameter to account for the fraction that is repeated and possibly a parameter that captures at what point in training the data is repeated. Repeating earlier in training when most model weights are still randomly initialized is likely to cause less damage than later in training. Adapting our parametric ﬁt to make concrete scaling predictions for such scenarios is an exciting future research direction.

Muennighoff, Rush, Barak, Scao, Piktus, Tazi, Pyysalo, Wolf, Raffel

10.2 Sensitivity to Hyperparameters

The returns from additional epochs may heavily depend on hyperparameters such as learning rate, dropout, or the optimizer choice. It is likely that increasing the learning rate, for example, would lead to diminishing returns from additional epochs kicking in earlier. In this work, we have ﬁxed most hyperparameters to commonly used values for the training of LLMs and leave such explorations to future work.

10.3 Other Data sets

The optimal data strategy is dependent on the data set at hand and we cannot give universally applicable ﬁltering recommendations. By looking into C4 and OSCAR, we have covered two of the most commonly used English text data sets. Our ﬁndings on both data sets were overall in agreement with each other. We have highlighted some of the diﬀerences, such as deduplication being more eﬀective on OSCAR due to it being more noisy than C4. Further, we have focused on large-scale pre-training data sets. There is a lot of research on the optimal ﬁne-tuning data set and methodology for LLMs (Sanh et al., 2022; Longpre et al., 2023a; Yong et al., 2022; Ouyang et al., 2022; Wei et al., 2021; Min et al., 2021; Wang et al., 2022; Zhou et al., 2023; Wang et al., 2023; Gupta et al., 2023; Xu et al., 2023; Muennighoﬀ et al., 2023; Longpre et al., 2023b). More investigations of resolving data-constraints when ﬁne-tuning LLMs may be of interest for future work.

10.4 Other Modalities or Architectures

Our work focuses on text data sets and uses the GPT transformer architecture (Radford et al., 2019). Prior work has experimented with many variations to the GPT or transformer architecture (Dehghani et al., 2018; Tay et al., 2022a; Scao et al., 2022b), as well as scaling laws for non-text data sets (Aghajanyan et al., 2023). Overall, variations of the GPT or transformer architecture have proven very robust and generalizable to other domains (Huang et al., 2018; Chen et al., 2020; Muennighoﬀ, 2020; Madani et al., 2020; Tay et al., 2022a; Dehghani et al., 2023). Nonetheless, it may be of interest for future work to test the applicability of our ﬁndings in this work to diﬀerent data modalities or model architectures.

10.5 Other Strategies

There are numerous strategies to solve data constraints not covered in this work that are worth exploring. Like we have shown for Python, future research may consider to what extent augmenting with a natural language (e.g. Chinese) improves performance in another language (e.g. English) and what is the best language to choose (Lin et al., 2019; Xia et al., 2021). Similarly, while we have looked at deduplication and perplexity ﬁltering, other ﬁltering strategies, such as popularity-based ﬁlters (Allal et al., 2023; Zhao et al., 2023) and toxicity ﬁlters (Gehman et al., 2020; Henderson et al., 2022; Longpre et al., 2023c; Prabhumoye et al., 2023; Penedo et al., 2023) are worth exploring.

Scaling Data-Constrained Language Models

11. Conclusion

This work studies data-constrained scaling, focusing on the optimal use of computational resources when unique data is limited. We propose an extension to the Chinchilla scaling laws that takes into account the decay in value of repeated data, and we ﬁt this function using a large set of controlled experiments. We ﬁnd that despite recommendations of earlier work, training large language models for multiple epochs by repeating data is beneﬁcial and that scaling laws continue to hold in the multi-epoch regime, albeit with diminishing returns. We also consider complementary approaches to continue scaling models, and ﬁnd that code gives the ability to scale an additional 2 . We believe that our ﬁndings will enable further scaling of language models to unlock new capabilities with current data. However, our work also indicates that there are limits on the scaling horizon. In addition to collecting additional data, researchers should explore using current data in a more eﬀective manner.

Acknowledgments

This work was co-funded by the European Union under grant agreement No 101070350. The authors wish to acknowledge CSC IT Center for Science, Finland, for generous computational resources on the LUMI supercomputer.3 We are thankful for the immense support from teams at LUMI and AMD, especially Samuel Antao. Hugging Face provided storage and additional compute instances. This work was supported by a Simons Investigator Fellowship, NSF grant DMS-2134157, DARPA grant W911NF2010021, and DOE grant DE-SC0022199. We are grateful to Harm de Vries, Woojeong Kim, Mengzhou Xia and the Eleuther AI community for exceptional feedback. We thank Loubna Ben Allal for help with the Python data and Big Code members for insightful discussions on scaling laws. We thank Thomas Wang, Helen Ngo and Turku NLP members for support on early experiments.

3. https://www.lumi-supercomputer.eu/

Muennighoff, Rush, Barak, Scao, Piktus, Tazi, Pyysalo, Wolf, Raffel

Appendix A. Contributions

Niklas Muennighoﬀled experiments, analysis, writing, and the overall project. He implemented, trained and evaluated all models. Alexander M. Rush contributed to framing, results analysis, and paper writing. Boaz Barak contributed to formal and experimental analysis as well as paper writing. Teven Le Scao provided guidance, led data choices and preprocessing, and contributed to framing and writing. Aleksandra Piktus created perplexity and deduplication data sets and contributed to writing. Nouamane Tazi contributed to enabling high-performance training on AMD hardware. Sampo Pyysalo contributed to enabling high-performance training and early repetition experiments. Thomas Wolf provided guidance on experimental design and contributed to paper writing. Colin Raﬀel provided guidance on experimental design and contributed to paper writing.

Appendix B. Derivation of Data-Constrained Scaling Laws

Let N be the number of model parameters, D be the training tokens and U be the "unique" training tokens i.e. the size of the data set that is to be trained on for one or more epochs. Chinchilla (Hoﬀmann et al., 2022) only deals with non-repeated tokens, thus D = U and we can write their formula ( Approach 3 ) as:

L(N, U) = A Nα + B

where E represents the irreducible loss. A, B, α and β are learned parameters. We now want to generalize this expression to multiple epochs where tokens are repeated. We repeat the data RD times, where RD = 0 corresponds to the base case of a single epoch. We let D be the eﬀective data size : the number of unique data needed to get the same value as repeating U unique tokens for RD repeats. Hence, if RD = 0, the eﬀective data is the same as the total data processed. Intuitively, each time a sample is repeated, it is worth less as the model has already learned some of its information. Assume that each time a model trains on a token, it learns a 1 δ fraction of the information in it for some constant 0 δ 1. (Thus, if δ = 0 repeated tokens are as good as new ones, and if δ = 1, repeated tokens are worth nothing.) In other words, we expect the decrease in value of each repetition to be proportional to the value of the prior repetition, which is equivalent to exponential decay. As we would like to sum up the value of all repetitions, we temporarily assume an integral number of repeats and express it as a geometric series:

D = U + (1 δ)U + (1 δ)2U + + (1 δ)RDU (9)

We know that the sum S of a geometric series with a common ratio r is:

S = a(1 rn)

where a is the ﬁrst term and n the number of terms in the series. As r = (1 δ) and a = (1 δ)U:

Scaling Data-Constrained Language Models

k=1 (1 δ)k = U + (1 δ)U (1 (1 δ)RD)

Note that Equation 11 can also be used with a non-integer number of repetitions. We can directly use Equation 11 as our eﬀective data and learn δ but for convenience and interpretability, we redeﬁne it in terms of the number of epochs beyond which repeating does not help. Note that as more data is repeated, the right-hand side tends to (1 δ)U

δ , as lim RD (1 (1 δ)RD) = 1. Let R D = 1 δ

δ , hence D plateaus at U + R DU as RD goes to inﬁnity. If we assume δ to be small, 1 δ tends to one and we can approximate 1/R D = δ 1 δ δ. Next, deﬁne ex in terms of its Taylor series expansion:

ex = 1 + x + x2

3! + 1 + x (12)

If x is small later terms become increasingly small, thus ex 1 + x. As we have assumed δ to be small, let x = δ, which yields

(1 + x) = (1 δ) e δ e 1/R D (13)

Now inserting (1 δ)/δ = R D and (1 δ)RD = e( 1/R D)RD into Equation 11 we get our ﬁnal equation representing the eﬀective data:

D = U + U R D (1 e RD/R D) (14)

where U and RD are given while R D is a learned constant. If no repeats are done, the second part of the sum is zero and the term simpliﬁes to the single-epoch scaling laws from Equation 8. While RD R D, the second term is approximated as U RD and for RD R D, it plateaus at U R D. Hence R D corresponds to the number of times we can repeat tokens before seeing sharply diminishing returns. Let us consider a concrete example to show that Equation 14 is a very good approximation of Equation 11 and make the equations more intuitive. Suppose repeated data retains 75% of its value (δ = 0.25) and we train on a single token or data unit (U = 1) for ﬁve epochs, i.e. we repeat it four times (RD = 4). In that case Equation 11 yields D = U + (1 δ)U (1 (1 δ)RD)

δ = 1 + (0.75) 4 (1 0.754) = 3.05. Thus despite training for 5 total units (4 of which are repetitions), we only get the value equivalent to 3.05 units. As we have deﬁned R D = (1 δ)/δ, the corresponding R D value is 3. Setting R D = 3 in Equation 14 yields D = U + U R D (1 e RD/R D) = 1 + 3 (1 e 4/3) = 3.21. Due to our approximations, the results are not the same, i.e. 3.21 is slightly higher than 3.05. However, note that the data term is additionally raised to a power of β = 0.353 (see Equation 8; C), thus the actual diﬀerence calculated as ((3.210.353)/(3.050.353)) 1 is a mere 1.8% despite this relatively large δ of 0.25. Equation 14 has the beneﬁt that we can interpret R D as the number of repetitions beyond which repeating yields sharply diminishing returns and ﬂattens out soon after. Consider RD = 100 then D = 1 + 3 (1 e 100/3) = 3.99. No matter how many repeats are done the eﬀective data will never exceed 4 i.e. it plateaus at U + R DU as RD tends to inﬁnity.

Muennighoff, Rush, Barak, Scao, Piktus, Tazi, Pyysalo, Wolf, Raffel

Similarly, we consider repeating parameters. Symmetric to seeing the same data, excess parameters learn the same features and do not add any value in the extreme. For the Chinchilla equation (Equation 8) increasing parameters from 1 billion to 10 billion yields the same absolute decrease in loss regardless of whether the data set is a single token or 1 billion tokens. However, intuition and our data ( 5.3) suggest that in the ﬁrst case, adding parameters should not decrease loss at all, as the additional 9 billion parameters cannot possibly learn anything from the single token that the ﬁrst 1 billion parameters have not already learned. Thus, to allow excess parameters to decay to adding nothing, we also replace N with a symmetric version of Equation 14 yielding our ﬁnal equation:

L(UN, UD, RN, RD) = A

(UN + UNR N(1 e

R N ))α + B

(UD + UDR D(1 e

R D ))β + E

(15) We deﬁne UN, as the number of "unique" parameters that provide an optimal ﬁt for UD. Additional parameters decay with a symmetric version of the expression for repeated data. RN is the number that the "unique" parameters are repeated i.e. RN = max{(N/UN) 1, 0}.

If R N = , additional parameters do not decay at all and (UN + UNR N(1 e

R N )) reduces to N. We compute UN from UD by setting Dopt = UD and rearranging Equation 3 to map from Dopt to Nopt. UN is then min{Nopt, N}. This is equivalent to the following:

UN = min{((UD G)β/α) G, N} where G = αA

Equation 15 is a generalization of Equation 8: It provides the same estimates for optimal model and data size in the single epoch case, but allows for decay in the value of parameters and tokens, thus generalizing to training for multiple epochs and with excess parameters. It can thus be used as a direct replacement of Equation 8. If R N and R D are unknown, one can simply set them to inﬁnity by default, which will make Equation 15 completely equivalent to Equation 8. To learn the parameters R N and R D, we largely follow the approach from (Hoﬀmann et al., 2022). We ﬁx a, b, e, α, β to the values learned on C4 in C and minimize:

min R N,R D

Run i Huberδ LSE a α log(Ui N + Ui NR N(1 e

Ri N R N )),

b β log(Ui D + Ui DR D(1 e

Ri D R D )), e log Li (17)

We use the LBFGS algorithm to ﬁnd local minima of the objective above, started on a grid of initialization given by: R N {0., 4., . . . , 20.} and R D {0., 4., . . . , 20.}. We ﬁt on 182 samples with parameters varying from 7 million up to 9 billion and epochs ranging from 1 to 500. We removed outliers referenced in 5.3 from our ﬁtting, as our formulas do not allow for excess parameters or excess epochs to negatively impact performance. We assume excess parameters or epochs only cause performance to plateau but never to worsen. However, it is diﬃcult to identify all samples where excess parameters or epochs hurt, as for some

Scaling Data-Constrained Language Models

Table 1: Comparison of diﬀerent versions of our parametric ﬁt. All versions are ﬁtted on the same 182 samples. We report the ﬁtting loss and the R2 (coeﬃcient of determination) of the predicted loss compared to the actual loss. No decay corresponds to assuming Chinchilla holds for repeated data without modiﬁcation necessary. For Equation 11, we use the same equation for D and N renaming the δ to R D and R N.

Parametric Fit R D R N Loss ( ) R2 ( )

No decay - - - 0.1430 Equation 15 but only decay N - 713.0015 0.0241 0.1671 Equation 15 but only decay D 2.9157 - 0.0169 0.7395 Equation 15 15.3878 5.3097 0.0158 0.7810 Equation 11 for both N and D 0.0104 0.3676 0.0155 0.8062 Equation 19 for both N and D 0.0105 0.3676 0.0155 0.8061

data budgets we only train a single model, thus we do not know if the loss of that model is already in the range where it starts to increase again. Further, there are samples where loss initially increases and then decreases as a function of epochs (double descent, see 5.1), which further contributes to noise in the ﬁtting. Nevertheless, we are able to get a fairly stable ﬁt resulting in R N = 5.309743 and R D = 15.387756. Since R D > R N, excess parameters decay faster. Hence, the data-constrained eﬃcient frontiers in Figures 1,3 suggest scaling compute allocated to epochs faster than to parameters. This value of R D yields δ 6 10 2 (0.19 for R N), which respects the assumption that δ is small. Inserting these learned parameters and the parameters from C, and simplifying Equation 16 yields the precise formulation we use to predict loss (L) given unique tokens (UN), parameter repetitions (RN) and data repetitions (RD):

L(UD, RN, RD) = 521

(UN + 5.3 UN(1 e RN

5.3 ))0.35 + 1488

(UD + 15.4 UD(1 e RD

15.4 ))0.35 + 1.87

where UN = UD 0.051 (18) We experiment with diﬀerent versions of our formula and display the learned values in Table 1. No decay or decaying only D or N of Equation 15 leads to worse loss and R2 than Equation 15. Thus, it is important to decay both the value of excess parameters and data repetitions. We also consider an explicit exponential where D = PRD k=0 U e R Dk, hence from Equation 10 it follows:

D = U 1 (e R D)RD+1

1 e R D (19)

This explicit decay, Equation 11, and Equation 15 all yield similar results with R2

around 80. Equation 15 ﬁts the data slightly worse than Equation 11, likely due to our

Muennighoff, Rush, Barak, Scao, Piktus, Tazi, Pyysalo, Wolf, Raffel

Chinchilla Curve

At this point a parameter is worth 1 𝛿 as much as

token for the loss

Same compute cost across

dashed blue lines

In this regime, multiplicative factor in parameters and tokens

worth same for loss

Figure 19: A cartoon of how the compute-optimal tradeoﬀdeviates from Chinchilla as we increase the number of epochs. Initially the model size and tokens processed grow proportionally (RN = RD) but since R N < R D, at some point adding parameters oﬀers worse returns compared to increasing the number of tokens processed, and hence we deviate from the Chinchilla curve.

approximations. Nevertheless, we use Equation 15 throughout as it has fewer terms, and we ﬁnd it easier to interpret.

B.1 Analytical properties of compute-optimal point

In our case, consider the setting of a ﬁxed compute budget C and a ﬁxed budget of unique tokens UD implying a set of unique parameters UN. Let RD denote the number of times we repeat data (we assume that we are in the multi-epoch regime and hence RD > 0).

Write UD = c UN (for Chinchilla c 20). When RD R D and RN R N, our scaling agrees with Chinchilla, and so the point (UN, UD), corresponding to RD = RN = 0 is on the optimal compute curve. Increasing RD by ϵ corresponds to increasing the number of tokens by ϵUD = ϵc UN, while increasing RN by ϵ corresponds to increasing the number of parameters by ϵUN. For small positive RD, RN, our curve agrees with Chinchilla and so we need to increase RN, RD by the same amount to maintain the proportionality. Hence up to some value r > 0, the optimal compute curve corresponds to RN = RD = r. Our curve diﬀers from Chinchilla when r gets closer to either R N or R D. At this point, we start to see sharply diminishing returns.

In our setting, R D > R N which means that we reach the point r R N ﬁrst. At this point, each added parameter is worth less (speciﬁcally worth e r/R N), than an added data point, despite them having equal computational cost. Hence processing more tokens will be more eﬀective than increasing the number of parameters, and we expect the optimal compute curve to break away from proportionality. This is indeed what we see.

Scaling Data-Constrained Language Models

Appendix C. C4 Scaling Coeﬃcients

While Hoﬀmann et al. (2022) have shown that the equal scaling of model parameters and training tokens holds across diﬀerent training data sets, the precise ratios vary considerably across data sets and approaches. For example given the Gopher (Rae et al., 2021) compute budget of 5.76 1023 FLOPs, their parametric loss function ﬁtted on Massive Web predicts an optimal allocation of 40 billion parameters. Meanwhile, if the training data set is C4 (Raﬀel et al., 2020) their Iso FLOP approach predicts 73 billion parameters to be optimal, almost twice as much. However, for C4, which is our training data set, they do not provide the coeﬃcients necessary to compute loss with their parametric loss function. Based on their Iso FLOP training runs on C4, they only provide the information that for C4, compute (C) allocated to data (D) and parameters (N) should be scaled exactly equally for optimality, i.e. a = b = 0.5 in the relationship Nopt Ca and Dopt Cb. This corresponds to α = β in the parametric loss function (Equation 2). Thus, we use this information together with the methodology and C4 data points from (Hoﬀmann et al., 2022) to ﬁt the parametric loss function. We tie the parameters α and β to be equal and optimize

min a,b,e,α,β

Run i Huberδ LSE a α log Ni, b β log Di, e log Li (20)

where LSE is the log-sum-exp operator and Ni, Di and Li the model size, data set size and loss of the ith run, and δ = 10 3. We ﬁt on 54 samples on a grid of initialization given by: α {0., 0.5, . . . , 2.}, β {0., 0.5, . . . , 2.}, e { 1., .5, . . . , 1.}, a {0, 5, . . . , 25}, and b {0, 5, . . . , 25}. Our ﬁt results in a = 6.255414, b = 7.3049974, e = 0.6254804, α = β = 0.3526596. Exponentiating a, b and e to get A, B and E and inserting all learned coeﬃcients into Equation 2 then allows us to compute loss (L) as a function of parameters and data:

L(N, D) = 1.87 + 521 N0.353 + 1488

D0.353 (21)

To verify the accuracy of our ﬁt, we benchmark the predictions with those of the Iso FLOP C4 curves in (Hoﬀmann et al., 2022). Following (Hoﬀmann et al., 2022), we can compute the optimal number of parameters Nopt and tokens Dopt for our ﬁt using:

Nopt(C) = G C

a , Dopt(C) = G 1 C

where G = αA

1 α+β , a = β α + β , and b = α α + β

Given the Gopher compute budget of C = 5.76 1023 our ﬁtted parameters predict an optimal allocation of Nopt = 70.0 billion parameters and Dopt = 1.37 trillion tokens. This is very close to the 73 billion parameters and 1.3 trillion tokens predicted by the Iso FLOP curves on C4 from (Hoﬀmann et al., 2022) and thus we consider it a good ﬁt. We use these ﬁtted parameters rather than the Massive Web parameters for all computations involving Chinchilla scaling laws.

Muennighoff, Rush, Barak, Scao, Piktus, Tazi, Pyysalo, Wolf, Raffel

FLOP budget Parameters Evaluation Interval Evaluation Tokens

9.3 1020 2.8B 100 105 million 2.1 1021 4.2B 1000 105 million 9.3 1021 8.7B 1000 2.1 million

Table 2: Setup for computing validation loss during training. At every Evaluation Interval, loss is computed on Evaluation Tokens many tokens from the validation set. The evaluation tokens vary with the interval, i.e. the evaluation tokens at 100 steps are not the same as at 200 steps. However, the tokens do not vary across data budgets for the same FLOP budget (Figure 9). For example, N = 2.8 billion parameter models with DC = 55 billion tokens are evaluated on the same data as models with DC = 28 billion tokens at each evaluation interval.

Appendix D. Evaluation Details

D.1 Loss Evaluation

For all models trained on C4, the ﬁnal test loss is computed on the same 210 million tokens from the C4 validation set after training. For held-out evaluation during training, such as in Figure 9, the conﬁgurations are displayed in Table 2. The small number of evaluation tokens for the 8.7 billion parameter models likely contributes to the loss spikes for 8.7 billion parameter models seen in Figure 9. Thus, we smooth the validation loss curves of 8.7 billion parameter models with exponential moving average smoothing and a weight of 0.85. For training on OSCAR, conﬁgurations are the same, however, the validation split used is a held-out part from the OSCAR training split, as there is no oﬃcial validation split for OSCAR. All training loss curves for C4 and OSCAR models are smoothed with exponential moving average smoothing and a weight of 0.999.

D.2 Downstream Evaluation

We provide statistics of all downstream evaluation data sets in Table 3. We use the evaluationharness frameworks from Big Science and Eleuther AI (Gao et al., 2021) to evaluate models on 19 evaluation data sets. For each data set, a maximum of 3000 samples are evaluated with 0,1,2,3,4 and 5 few-shots (Brown et al., 2020) to produce six scores which are then averaged. We normalize scores to range from the random baseline of each task to 1 and report them as percentages. For example, if random guessing produces 50% accuracy and the maximum accuracy possible is 100%, then a raw accuracy of 55% would be normalized to 10%, and a raw accuracy of 45% would be normalized to -10% since it is worse than random. This is done to give all tasks the same weight. Otherwise average performance would heavily depend on generative tasks, where the random baselines are 0. Prompts are sourced from GPT-3 (Brown et al., 2020) and Prompt Source (Bach et al., 2022) and detailed in J. We note that our evaluation is in no means comprehensive and a larger benchmarking akin to Srivastava et al. (2022) would be helpful. However, by training ﬁve seeds for most models benchmarked, always averaging 0-5 fewshots, and ensuring maximum data overlap for repeated data ( 4) we signiﬁcantly reduce uncertainty.

Scaling Data-Constrained Language Models

Data set Split(s) Samples Baseline URL

ANLI (Nie et al., 2020) dev_r1,2,3 3000 33.3 hf.co/datasets/anli ARC-Easy (Clark et al., 2018) test 1172 25.0 hf.co/datasets/ai2_arc ARC-Challenge (Clark et al., 2018) test 2376 25.0 hf.co/datasets/ai2_arc Bool Q (Clark et al., 2019) validation 3270 50.0 hf.co/datasets/boolq CB (De Marneﬀe et al., 2019) validation 56 33.3 hf.co/datasets/super_glue Copa (Roemmele et al., 2011) validation 100 50.0 hf.co/datasets/super_glue Hella Swag (Zellers et al., 2019) test 10003 25.0 hf.co/datasets/hellaswag Pi QA (Bisk et al., 2020) validation 1838 50.0 hf.co/datasets/piqa RTE (Dagan et al., 2006; Wang et al., 2019) validation 277 50.0 hf.co/datasets/super_glue Sci Q (Welbl et al., 2017) test 1000 25.0 hf.co/datasets/sciq Story Cloze 2016 (Mostafazadeh et al., 2017) test 1871 25.0 hf.co/datasets/story_cloze Wino Grande XL (Sakaguchi et al., 2021) test 1267 50.0 hf.co/datasets/winogrande

E2E NLG (Dušek et al., 2020) test 4693 0.0 hf.co/datasets/e2e_nlg_cleaned XSUM (Narayan et al., 2018; Gehrmann et al., 2021) test 11334 0.0 hf.co/datasets/GEM/xsum Web NLG EN (Castro Ferreira et al., 2020; Gehrmann et al., 2021) test 5150 0.0 hf.co/datasets/GEM/web_nlg Wiki Lingua EN (Ladhak et al., 2020; Gehrmann et al., 2021) sampled_test 3000 0.0 hf.co/datasets/GEM/wiki_lingua

b Ab I (Weston et al., 2015) test 19000 0.0 hf.co/datasets/Muennighoff/babi

Table 3: Downstream evaluation data sets. We evaluate on 19 data sets: The ﬁrst 14 are evaluated using accuracy (ANLI counted as three), the next 4 using ROUGE-2 f-measure (Lin, 2004) and b Ab I using exact match.

Appendix E. Downstream Repetition Results

In Tables 4-9 we report downstream results of all models trained on C4 (Raﬀel et al., 2020) and OSCAR (Ortiz Su arez et al., 2020) according to the conﬁgurations in Figure 9. All scores are from the ﬁnal checkpoints at the end of training. OSCAR is a noisier data set than C4 due to less ﬁltering, thus models trained on C4 generally perform better. Notably, models trained on C4 completely fail on b Ab I (Weston et al., 2015), while OSCAR models are able to perform better than random. This is likely due to code data being present in OSCAR, which enables state-tracking capabilities like for code augmented models in 8. For C4 the creators strictly removed all data that resembles code (Raﬀel et al., 2020). There are no signiﬁcant diﬀerences between models trained for a single epoch and models trained for up to 4 epochs. Even models trained for more epochs (and thus on less unique data) have similar performance.

Muennighoff, Rush, Barak, Scao, Piktus, Tazi, Pyysalo, Wolf, Raffel

Data Budget 55B 28B 18B 14B 11B 9B 4B 1.25B

Epochs 1 2 3 4 5 7 14 44

ANLI R1 0.4 1.6 0.7 0.8 0.3 0.5 -0.3 1.8 0.4 1.8 0.4 0.7 0.0 0.9 -0.6 0.6 ANLI R2 0.9 0.4 1.4 0.8 0.8 0.8 1.1 0.7 0.5 0.7 0.6 1.0 1.1 1.1 2.7 1.6 ANLI R3 1.7 0.5 1.2 0.4 0.4 0.5 1.9 0.7 0.6 1.0 0.8 0.8 1.7 0.7 0.7 1.7 ARC-Challenge 1.6 1.0 0.9 0.5 1.2 0.6 1.1 0.6 1.1 1.2 1.3 0.5 0.3 0.6 -2.9 1.0 ARC-Easy 44.5 0.5 44.9 0.4 44.7 0.7 44.3 0.4 44.0 0.5 44.2 0.9 41.4 0.2 28.9 0.7 Bool Q 18.8 3.4 16.2 5.2 16.1 2.7 19.7 1.8 15.0 3.8 16.9 3.2 13.1 4.9 -2.1 4.7 CB 20.0 4.7 17.4 6.4 14.6 5.1 17.5 4.2 12.3 12.2 14.4 7.5 21.6 8.4 21.3 5.6 COPA 49.7 3.5 50.3 3.4 49.9 2.3 50.1 2.5 50.9 1.2 48.1 2.4 43.5 3.1 33.3 1.9 Hella Swag 24.7 0.3 24.6 0.2 24.3 0.1 24.3 0.0 24.3 0.3 24.1 0.1 22.8 0.2 16.7 0.4 Pi QA 47.9 0.6 47.6 0.8 47.3 0.3 47.6 0.6 47.6 0.7 47.0 0.2 45.6 0.5 37.0 0.4 RTE 5.1 4.0 2.5 4.5 8.4 2.6 6.0 2.5 5.1 1.6 2.3 3.9 7.8 2.5 2.6 4.3 Sci Q 83.2 0.6 82.5 0.6 82.7 1.1 81.9 0.6 81.9 0.8 81.6 0.9 78.5 1.1 59.3 1.6 Story Cloze 2016 58.7 0.2 58.7 0.5 58.5 0.3 58.3 0.3 58.5 0.6 58.4 0.3 56.7 0.5 52.0 0.6 Wino Grande XL 11.6 0.8 10.8 1.1 10.9 1.3 10.6 0.5 11.1 0.9 10.6 0.9 6.4 1.3 2.9 1.3

E2E NLG 17.0 1.4 17.7 0.5 17.0 1.2 16.9 1.1 15.1 2.3 13.3 2.2 14.9 0.9 9.8 0.9 XSUM 2.4 0.1 2.4 0.1 2.5 0.1 2.3 0.2 2.4 0.1 2.4 0.1 2.1 0.1 1.6 0.1 Web NLG EN 5.3 0.1 5.5 0.2 5.4 0.1 5.4 0.1 5.1 0.1 5.4 0.2 5.1 0.3 2.9 0.2 Wiki Lingua EN 3.0 0.1 3.1 0.1 2.9 0.1 2.9 0.3 2.9 0.2 2.9 0.1 2.6 0.1 2.0 0.2

b Ab I 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

Average 20.9 0.4 20.4 0.3 20.4 0.2 20.6 0.2 19.9 0.9 19.7 0.2 19.2 0.5 14.1 0.4

Table 4: Results for 2.8B parameter models trained on repeated data on C4 for 55B total tokens. Scores are normalized averages of 0-5 few-shots and reported as percentages. We report mean/std. err. across ﬁve diﬀerent models, each trained with a diﬀerent random seed.

Scaling Data-Constrained Language Models

Data Budget 55B 28B 18B 14B 11B 9B 4B 1.25B

Epochs 1 2 3 4 5 7 14 44

ANLI R1 -0.3 0.5 -0.6 1.3 0.2 0.6 0.3 1.1 0.2 1.2 -0.1 1.1 -0.1 0.5 -0.7 1.3 ANLI R2 1.0 1.0 1.1 0.3 1.7 0.8 2.3 0.8 1.4 1.0 0.8 0.7 1.0 0.7 2.3 0.7 ANLI R3 0.4 0.8 0.5 0.5 -0.2 0.8 -0.1 1.0 1.1 0.6 0.7 0.4 -0.2 0.9 0.5 1.2 ARC-Challenge -1.4 0.8 -0.6 0.8 -1.7 0.1 -1.6 0.7 -1.6 0.6 -1.4 0.5 -1.9 0.8 -5.0 1.1 ARC-Easy 39.7 0.3 39.6 0.8 39.5 0.6 39.3 0.5 38.7 0.6 38.7 0.4 36.9 0.4 25.4 0.7 Bool Q 12.8 4.4 7.8 3.8 7.9 3.8 3.3 5.4 0.2 3.0 2.3 6.1 -2.1 2.4 7.4 6.1 CB 19.7 5.1 15.4 7.3 13.2 5.1 12.6 2.6 21.7 3.6 15.4 3.7 16.2 5.2 9.7 5.7 COPA 42.7 2.2 39.5 2.2 40.9 2.0 41.5 2.1 38.5 2.4 40.4 2.4 38.6 2.6 28.5 3.1 Hella Swag 16.3 0.1 16.3 0.2 16.3 0.2 16.1 0.2 16.0 0.1 15.9 0.2 15.0 0.2 11.7 0.1 Pi QA 41.2 0.7 41.4 0.5 40.3 0.4 40.6 0.5 40.3 0.9 39.8 0.6 38.8 1.1 31.0 0.4 RTE 3.9 1.1 2.1 1.6 2.3 3.3 1.6 3.0 0.5 2.1 2.9 2.5 0.9 3.4 -3.2 2.7 Sci Q 83.2 0.6 82.4 0.6 82.1 0.9 82.6 0.7 81.5 0.9 80.5 0.6 76.5 1.3 57.7 1.8 Story Cloze 2016 52.8 0.3 52.9 0.4 52.6 0.3 53.0 0.4 52.3 0.4 52.4 0.4 51.8 0.7 47.9 0.5 Wino Grande XL 5.8 0.9 4.4 1.4 4.5 0.3 4.2 1.3 4.5 0.6 4.1 0.7 1.7 1.2 0.8 1.3

E2E NLG 20.3 0.3 19.9 0.5 19.9 0.7 20.9 0.9 19.7 0.7 20.4 0.6 19.1 0.8 14.2 0.7 XSUM 3.0 0.1 2.9 0.0 2.9 0.3 2.9 0.2 2.9 0.1 2.8 0.3 2.6 0.2 1.8 0.1 Web NLG EN 8.8 0.4 8.3 0.6 8.5 0.3 8.4 0.6 8.1 0.2 8.2 0.2 7.2 0.3 3.3 0.3 Wiki Lingua EN 2.9 0.1 3.1 0.2 3.1 0.1 3.0 0.1 3.1 0.1 3.2 0.3 2.7 0.2 1.7 0.2

b Ab I 15.5 1.0 15.7 1.1 15.3 0.8 15.1 1.5 15.9 1.1 16.2 0.9 14.3 0.6 6.6 0.6

Average 19.4 0.5 18.5 0.2 18.4 0.4 18.2 0.4 18.2 0.4 18.1 0.4 16.8 0.5 12.7 0.7

Table 5: Results for 2.8B parameter models trained on repeated data on OSCAR for 55B total tokens. Scores are normalized averages of 0-5 few-shots and reported as percentages. We report mean/std. err. across ﬁve diﬀerent models, each trained with a diﬀerent random seed.

Muennighoff, Rush, Barak, Scao, Piktus, Tazi, Pyysalo, Wolf, Raffel

Unique Tokens 84B 42B 28B 21B 17B 12B 6B 1.9B

Epochs 1 2 3 4 5 7 14 44

ANLI R1 -1.0 0.3 -0.7 1.1 -0.7 1.0 -0.4 1.1 0.4 0.8 0.5 1.1 0.1 0.9 0.2 0.9 ANLI R2 0.8 0.5 0.8 0.8 0.0 1.4 0.5 0.7 0.5 0.9 0.3 1.0 0.7 0.7 2.5 1.0 ANLI R3 1.1 0.7 0.8 0.9 0.3 0.8 1.4 1.1 1.3 0.9 2.3 0.2 1.3 0.2 1.6 1.2 ARC-Challenge 5.3 0.6 5.1 0.9 5.2 2.0 6.0 0.8 4.7 0.8 3.1 0.4 2.9 1.0 -1.3 1.0 ARC-Easy 49.2 0.9 50.4 1.2 47.4 4.5 49.4 0.7 48.7 1.5 44.9 0.7 45.0 1.2 31.9 0.9 Bool Q 18.2 4.0 19.6 5.1 22.1 1.0 20.4 3.6 18.4 6.0 18.4 3.9 18.9 2.6 -3.3 7.1 CB 12.0 7.2 8.5 9.2 7.9 10.4 19.6 7.3 17.8 7.3 15.1 5.8 17.5 3.5 19.5 6.6 COPA 59.1 5.4 57.7 3.5 56.7 2.0 55.5 2.4 56.8 1.8 58.9 1.7 48.7 3.3 34.9 3.4 Hella Swag 27.8 4.8 30.2 0.5 29.8 0.9 29.9 0.7 28.5 1.1 29.0 0.5 27.0 1.2 19.7 0.5 Pi QA 50.6 0.5 50.8 0.5 48.6 3.4 50.9 0.7 50.3 1.3 49.5 0.4 47.6 1.2 39.5 1.3 RTE 5.6 3.1 2.6 3.9 7.2 2.7 7.0 3.2 8.8 5.3 9.3 3.6 3.0 4.3 2.6 4.2 Sci Q 84.6 3.9 86.1 1.3 84.4 3.7 85.9 0.7 86.2 0.8 79.0 0.7 81.1 1.4 65.3 1.1 Story Cloze 2016 61.1 3.7 62.6 0.2 61.9 2.2 62.6 0.4 61.8 0.8 61.5 0.8 60.1 0.7 53.9 0.5 Wino Grande XL 17.0 2.6 17.8 1.4 16.5 1.8 17.1 1.8 14.9 1.5 15.9 1.2 11.8 1.5 3.9 0.8

E2E NLG 18.2 1.2 18.8 0.8 17.8 1.5 16.0 2.2 15.9 2.5 13.8 1.3 15.7 0.9 11.2 1.4 XSUM 2.9 0.2 3.0 0.2 2.8 0.3 2.9 0.2 2.9 0.2 1.0 0.4 2.4 0.1 1.8 0.1 Web NLG EN 4.8 2.0 5.7 0.2 5.4 0.3 5.6 0.2 5.4 0.5 5.5 0.1 5.4 0.2 3.4 0.3 Wiki Lingua EN 3.3 0.5 3.6 0.1 3.4 0.1 3.4 0.1 3.3 0.1 1.4 0.6 3.0 0.1 2.2 0.1

b Ab I 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

Average 22.1 1.7 22.3 0.9 21.9 1.2 22.8 0.5 22.5 0.6 21.6 0.5 20.6 0.6 15.2 1.0

Table 6: Results for 4.2B parameter models trained on repeated data on C4 for 84B total tokens. Scores are normalized averages of 0-5 few-shots and reported as percentages. We report mean/std. err. across ﬁve diﬀerent models, each trained with a diﬀerent random seed.

Scaling Data-Constrained Language Models

Unique Tokens 84B 42B 28B 21B 17B 12B 6B 1.9B

Epochs 1 2 3 4 5 7 14 44

ANLI R1 -0.9 0.5 -0.8 1.1 -0.9 1.4 -0.4 0.4 -0.1 1.2 0.3 1.1 -0.5 0.8 1.1 1.3 ANLI R2 0.7 0.9 0.7 1.1 1.3 1.0 1.5 1.1 1.7 1.3 0.9 0.8 0.9 1.0 1.7 1.3 ANLI R3 0.4 0.6 0.6 0.4 0.7 0.3 0.4 0.8 0.7 1.2 0.6 1.2 0.7 0.5 0.8 1.2 ARC-Challenge 1.3 0.5 1.8 0.5 1.6 0.7 2.4 1.1 1.6 0.7 2.0 0.7 1.6 0.5 -2.1 0.5 ARC-Easy 45.5 0.8 45.1 1.2 44.8 0.9 44.8 0.6 45.0 1.0 43.9 0.7 40.7 0.7 28.0 0.9 Bool Q 14.5 1.9 15.1 4.6 10.8 5.1 12.5 1.9 6.7 4.0 10.1 4.2 -0.0 6.9 -4.3 7.2 CB 21.3 2.3 19.2 3.8 12.9 6.4 16.9 3.4 15.1 9.4 17.8 3.6 15.0 8.1 11.2 4.1 COPA 43.1 3.0 42.5 3.7 44.4 1.1 43.0 3.4 41.8 2.3 44.6 2.7 40.3 3.0 34.9 4.9 Hella Swag 21.1 0.2 21.0 0.2 20.9 0.1 20.7 0.2 20.5 0.3 20.3 0.1 19.3 0.1 14.5 0.2 Pi QA 45.3 0.9 44.8 0.7 44.8 0.9 44.4 0.6 44.3 0.6 43.9 0.5 42.2 0.9 34.0 0.8 RTE 4.2 2.8 1.5 2.4 -1.1 3.9 -2.5 3.9 5.3 1.8 4.4 1.9 1.6 2.2 -1.0 2.4 Sci Q 86.6 0.7 86.5 0.5 86.0 0.2 86.3 1.0 85.4 0.8 84.7 0.4 82.0 1.4 62.9 2.5 Story Cloze 2016 56.5 0.6 56.8 0.6 56.5 0.7 55.8 0.3 55.9 0.2 56.0 0.3 54.5 0.7 49.3 0.2 Wino Grande XL 9.7 1.4 9.0 1.8 9.5 0.7 8.9 1.0 7.8 1.2 7.4 1.4 6.8 1.4 2.1 1.0

E2E NLG 21.4 1.3 21.9 0.4 21.2 1.0 21.8 0.6 21.0 0.9 20.5 0.7 20.9 1.0 16.0 0.6 XSUM 3.6 0.2 3.5 0.2 3.5 0.2 3.5 0.2 3.5 0.3 3.2 0.5 3.0 0.2 1.9 0.1 Web NLG EN 9.9 0.4 9.7 0.8 9.3 0.6 9.7 0.5 9.3 0.7 9.4 0.3 8.9 0.5 3.8 0.4 Wiki Lingua EN 3.9 0.1 3.8 0.2 3.6 0.3 3.7 0.2 3.6 0.2 3.7 0.1 3.3 0.2 2.1 0.2

b Ab I 15.0 7.5 19.0 1.2 18.8 1.4 18.5 1.4 19.2 0.6 18.1 1.4 14.5 1.5 9.6 1.7

Average 21.2 0.2 21.1 0.4 20.4 0.3 20.6 0.5 20.4 0.5 20.6 0.2 18.7 0.5 14.0 0.6

Table 7: Results for 4.2B parameter models trained on repeated data on OSCAR for 84B total tokens. Scores are normalized averages of 0-5 few-shots and reported as percentages. We report mean/std. err. across ﬁve diﬀerent models, each trained with a diﬀerent random seed.

Muennighoff, Rush, Barak, Scao, Piktus, Tazi, Pyysalo, Wolf, Raffel

Parameters 8.7B 6.3B

Unique Tokens 178B 88B 58B 44B 35B 25B 13B 4B 25B

Epochs 1 2 3 4 5 7 14 44 9.7

ANLI R1 -0.9 -1.2 -4.2 0.7 -1.3 0.1 1.2 2.1 -0.9 ANLI R2 -0.4 -1.2 -0.2 0.2 -0.4 -0.1 0.4 2.2 1.0 ANLI R3 0.7 0.5 0.7 1.8 0.4 1.6 2.0 4.0 2.6 ARC-Challenge 12.2 11.9 10.5 12.2 10.6 11.8 8.3 2.2 12.7 ARC-Easy 58.5 58.0 56.9 57.4 56.7 58.5 52.9 37.4 57.2 Bool Q 26.1 31.8 31.3 30.3 28.8 28.5 27.9 4.1 30.6 CB 7.6 12.9 -15.2 17.9 14.3 -22.8 -12.1 17.4 6.2 COPA 68.0 64.7 62.3 66.3 63.3 70.0 57.0 45.0 66.0 Hella Swag 37.8 37.8 37.3 37.4 37.1 37.5 36.1 27.5 38.1 Pi QA 55.9 55.6 54.7 56.5 55.8 53.9 52.4 45.7 54.3 RTE 14.1 11.4 11.0 8.7 15.9 -2.6 -1.8 -3.2 7.7 Sci Q 90.4 91.1 90.7 90.0 89.8 89.8 87.9 72.9 90.3 Story Cloze 2016 68.3 67.3 67.2 67.6 67.8 66.8 66.2 58.9 68.4 Wino Grande XL 26.3 27.7 26.5 29.0 26.1 23.5 18.1 10.0 27.0

E2E NLG 20.5 17.9 18.7 20.0 17.2 17.7 17.4 11.2 16.9 XSUM 3.6 3.3 3.8 3.8 3.5 3.0 3.3 2.0 3.8 Web NLG EN 5.3 5.8 5.9 5.6 5.8 5.2 5.7 4.9 5.3 Wiki Lingua EN 4.1 4.2 4.2 4.1 4.2 4.0 3.5 2.7 4.0

b Ab I 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

Average 26.2 26.3 24.3 26.8 26.1 23.5 22.4 18.3 25.9

Table 8: Results for 8.7B parameter models trained on repeated data on C4 for 178B total tokens and a data-constrained compute-optimal 6.3B model. Scores are normalized averages of 0-5 few-shots and reported as percentages. The two models with 25 billion unique tokens are the ones depicted in Figure 1 (right). The data-constrained compute-optimal variant (6.3 billion parameters) performs better by using fewer parameters and repeating more data.

Scaling Data-Constrained Language Models

Unique Tokens 178B 88B 58B 44B 35B 25B 13B 4B

Epochs 1 2 3 4 5 7 14 44

ANLI R1 -1.3 -2.3 -0.5 -1.8 0.1 -0.3 2.6 -0.4 ANLI R2 0.8 3.2 -0.2 -1.3 1.0 0.2 1.5 0.5 ANLI R3 1.1 1.2 1.3 0.9 2.8 -0.4 1.1 -0.1 ARC-Challenge 6.9 6.7 6.9 3.8 6.6 4.8 4.0 -0.9 ARC-Easy 50.2 51.6 51.2 51.0 51.9 50.8 47.0 33.0 Bool Q 18.4 11.7 19.4 22.4 17.5 20.8 7.6 4.1 CB 11.2 13.4 16.1 19.6 21.4 25.0 9.8 20.1 COPA 46.7 53.0 52.0 53.7 51.0 53.3 48.7 41.7 Hella Swag 27.4 27.2 26.8 26.8 27.3 26.7 25.5 19.6 Pi QA 49.2 49.3 50.1 48.7 48.1 47.2 45.6 37.0 RTE -0.5 1.1 0.2 1.2 10.2 3.2 -3.0 -7.8 Sci Q 88.1 88.0 88.4 87.9 87.9 87.4 86.3 64.6 Story Cloze 2016 61.6 61.1 60.2 60.6 61.3 59.0 58.8 52.7 Wino Grande XL 17.6 16.3 15.4 13.7 13.9 12.8 10.8 -0.6

E2E NLG 23.3 24.2 22.2 22.9 23.1 22.1 22.9 16.8 XSUM 4.2 3.8 3.9 3.8 4.3 4.0 3.2 2.4 Web NLG EN 9.9 10.1 10.0 10.5 9.5 9.9 10.7 5.2 Wiki Lingua EN 4.3 4.0 4.1 3.7 4.3 4.2 4.0 2.7

b Ab I 20.4 20.6 21.7 21.4 21.1 21.3 19.4 10.7

Average 23.1 23.4 23.6 23.7 24.4 23.8 21.4 15.9

Table 9: Results for 8.7B parameter models trained on repeated data on OSCAR for 178B total tokens. Scores are normalized averages of 0-5 few-shots and reported as percentages.

Muennighoff, Rush, Barak, Scao, Piktus, Tazi, Pyysalo, Wolf, Raffel

Appendix F. Detailed Code Augmentation Results

We report tabular results for replacing part of C4 or OSCAR with code for 4.2 billion parameter and 2.8 billion parameter models in Tables 10-11. We ﬁnd that training on up to 50% of Python data maintains performance on all natural language tasks while enabling huge performance gains on state-tracking (b Ab I) for C4. For OSCAR gains are less clear, which is likely due to OSCAR containing code (Ortiz Su arez et al., 2020), while code data was explicitly ﬁltered out for C4 (Raﬀel et al., 2020).

Scaling Data-Constrained Language Models

% of Python pre-training data (remainder is C4)

Data set ( ) 0 10 20 30 40 50 60 70 80 90

ANLI R1 -1.0 0.3 -0.7 0.6 0.5 0.6 -0.2 1.1 -1.5 0.7 -0.7 1.1 -1.1 0.9 -1.2 1.1 -1.4 0.7 -1.4 0.6 ANLI R2 0.8 0.5 0.4 0.8 0.3 1.0 0.6 0.5 0.5 0.6 0.3 0.6 0.8 0.7 0.4 0.5 0.1 1.2 1.0 0.3 ANLI R3 1.1 0.7 0.6 0.5 0.5 0.6 0.2 0.4 0.3 0.5 0.2 0.5 0.3 0.2 -0.1 0.3 -0.0 0.2 -0.1 0.2 ARC-Challenge 5.3 0.6 6.4 1.0 5.2 2.4 4.3 1.4 5.2 0.8 5.2 0.5 2.6 0.4 1.7 0.5 -0.4 0.4 -3.0 0.4 ARC-Easy 49.2 0.9 52.4 1.1 49.6 3.7 48.1 4.1 50.1 1.0 49.7 0.3 48.0 0.5 45.6 0.5 43.3 0.4 37.7 0.7 Bool Q 18.2 4.0 10.5 12.0 16.3 5.2 17.8 3.3 13.4 3.4 14.8 2.1 12.5 8.9 12.1 6.6 7.2 6.7 10.7 7.3 CB 12.0 7.2 20.3 2.7 14.4 7.1 16.5 1.2 22.3 3.1 22.1 4.8 19.4 4.8 23.8 3.3 23.8 4.1 23.4 2.4 COPA 59.1 5.4 56.4 4.9 46.7 8.7 50.2 3.7 52.7 2.5 50.1 4.5 46.5 1.9 43.1 4.2 39.2 3.7 35.9 4.9 Hella Swag 27.8 4.8 29.4 0.4 25.7 4.8 27.0 1.7 26.3 2.4 26.3 0.6 25.0 0.1 22.6 0.1 19.5 0.2 14.7 0.1 Pi QA 50.6 0.5 50.8 0.6 48.6 3.0 48.2 2.8 48.7 0.7 48.4 1.0 47.1 0.7 45.6 0.3 43.4 0.9 39.0 0.8 RTE 5.6 3.1 7.3 3.4 4.4 4.7 6.1 2.6 9.1 4.0 8.1 5.9 7.7 5.3 4.0 2.1 6.2 2.1 4.6 2.5 Sci Q 84.6 3.9 87.1 0.2 84.6 4.8 86.9 1.2 86.9 1.2 87.9 0.9 87.6 0.6 87.0 0.2 86.0 0.2 84.5 0.6 Story Cloze 2016 61.1 3.7 62.0 0.6 59.0 4.8 60.8 1.5 59.9 1.9 60.0 0.7 59.0 0.4 57.2 0.5 54.9 0.4 51.0 0.3 Wino Grande XL 17.0 2.6 17.4 2.1 14.9 4.4 15.2 2.0 15.7 1.2 14.2 1.0 13.5 1.3 10.7 1.3 9.1 0.6 5.3 1.3

E2E NLG 18.2 1.2 21.8 1.6 15.9 8.6 23.3 0.6 21.5 3.8 23.9 0.6 23.7 0.6 23.7 0.5 24.3 0.7 24.0 0.9 XSUM 2.9 0.2 3.2 0.5 3.4 0.3 3.3 0.3 3.6 0.6 3.4 0.2 3.5 0.2 2.9 0.3 2.8 0.4 2.7 0.2 Web NLG EN 4.8 2.0 9.5 0.7 10.2 1.1 10.5 0.7 10.4 0.8 10.4 0.6 9.9 0.4 10.0 0.5 9.3 0.6 9.2 0.2 Wiki Lingua EN 3.3 0.5 4.0 0.1 4.0 0.2 4.2 0.1 4.3 0.3 4.2 0.2 4.4 0.3 4.1 0.2 3.9 0.2 3.6 0.3

b Ab I 0.0 0.0 12.5 6.7 13.8 7.2 15.8 8.2 17.4 9.2 23.2 1.2 23.4 2.0 24.3 1.4 23.2 1.0 24.6 1.8

Average 22.1 1.7 23.7 0.7 22.0 3.0 23.1 1.1 23.5 1.0 23.8 0.5 22.8 1.0 22.0 0.6 20.8 0.5 19.3 0.3 Average (no b Ab I) 23.4 1.8 24.4 0.7 22.5 2.8 23.5 0.8 23.9 0.9 23.8 0.5 22.8 1.0 21.8 0.6 20.6 0.6 19.1 0.4

Table 10: Results for code-augmentation for 4.2B parameter models. Models trained on a mix of natural language (C4) and Python (The Stack). Scores are normalized averages of 0-5 few-shots and reported as percentages. We report mean/std. err. across ﬁve diﬀerent models, each trained with a diﬀerent random seed.

Muennighoff, Rush, Barak, Scao, Piktus, Tazi, Pyysalo, Wolf, Raffel

% of Python pre-training data (rest is C4) % of Python pre-training data (rest is OSCAR)

Data set ( ) 0 10 20 30 40 50 0 10 20 30 40 50

ANLI R1 0.4 1.6 -1.5 -0.9 -1.0 -0.7 -2.4 -0.3 0.5 0.0 -0.6 -1.6 -2.4 -1.7 ANLI R2 0.9 0.4 0.7 0.0 0.1 -0.1 0.1 1.0 1.0 1.2 -0.1 -0.0 0.0 0.8 ANLI R3 1.7 0.5 0.6 -0.7 -0.2 0.4 0.0 0.4 0.8 -0.4 -0.2 -1.7 -0.8 -0.5 ARC-Challenge 1.6 1.0 4.2 1.7 1.5 0.2 -0.2 -1.4 0.8 -0.7 -1.4 -3.4 -2.3 -3.1 ARC-Easy 44.5 0.5 46.4 46.5 45.4 43.6 42.7 39.7 0.3 39.8 38.7 39.1 37.3 37.6 Bool Q 18.8 3.4 15.7 19.0 13.4 16.0 4.4 12.8 4.4 3.3 12.5 10.6 5.8 8.5 CB 20.0 4.7 22.8 10.7 20.5 17.4 15.2 19.7 5.1 14.7 15.6 19.6 22.8 17.0 COPA 49.7 3.5 46.3 49.3 46.3 42.7 40.0 42.7 2.2 42.7 41.0 42.7 35.7 38.0 Hella Swag 24.7 0.3 24.1 23.3 22.3 21.9 20.9 16.3 0.1 15.7 15.9 15.5 15.1 13.7 Pi QA 47.9 0.6 46.9 47.7 45.1 46.2 45.5 41.2 0.7 41.6 39.9 40.5 38.8 38.6 RTE 5.1 4.0 8.8 7.7 5.1 7.8 10.8 3.9 1.1 2.2 4.3 1.1 3.7 -1.7 Sci Q 83.2 0.6 83.3 85.3 84.8 83.2 83.7 83.2 0.6 82.4 83.5 82.8 83.3 83.2 Story Cloze 2016 58.7 0.2 59.3 57.9 56.9 56.5 56.0 52.8 0.3 52.0 52.2 52.0 51.8 50.9 Wino Grande XL 11.6 0.8 13.0 10.7 9.3 8.2 9.6 5.8 0.9 3.2 5.6 5.8 4.6 3.9

E2E NLG 17.0 1.4 19.8 21.1 20.2 22.1 21.0 20.3 0.3 21.9 20.7 20.5 20.7 21.1 XSUM 2.4 0.1 2.7 2.0 2.2 2.0 2.3 3.0 0.1 2.8 3.1 3.4 3.1 2.9 Web NLG EN 5.3 0.1 9.1 8.0 8.5 8.5 9.1 8.8 0.4 8.7 9.6 9.1 8.7 9.4 Wiki Lingua EN 3.0 0.1 3.2 3.2 3.6 3.3 3.7 2.9 0.1 3.3 3.6 3.5 3.4 3.5

b Ab I 0.0 0.0 4.6 14.2 14.2 14.8 15.1 15.5 1.0 16.6 17.2 17.2 17.7 15.9

Average 20.9 0.4 21.6 21.4 21.0 20.7 19.9 19.4 0.5 18.5 19.0 18.8 18.3 17.8 Average (without b Ab I) 22.0 0.5 22.5 21.8 21.3 21.1 20.1 19.6 0.5 18.6 19.1 18.9 18.3 17.9

Table 11: Results for code-augmentation for 2.8B parameter models. Models trained on a mix of natural language (C4) and Python (The Stack). Scores are normalized averages of 0-5 few-shots and reported as percentages. We report mean/std. err. across ﬁve diﬀerent models, each trained with a diﬀerent random seed.

Scaling Data-Constrained Language Models

Appendix G. Filtering Procedure

G.1 Perplexity ﬁltering

We follow the approach of (Laurençon et al., 2022) to perform perplexity ﬁltering and reuse their artifacts - a Sentence Piece tokenizer (Kudo and Richardson, 2018) and a Ken LM 5-gram language model (Heaﬁeld, 2011) trained on Wikipedia introductions and available to download from their repository.4 We compute the model s perplexity on all OSCAR and C4 samples and only select samples that fall within a certain percentile threshold. For example, to select the top 25%, we only select samples with perplexity lower than the 25th percentile. Figure 20 provides a visual representation of perplexity distribution for respective data sets, highlighting the relevant percentile thresholds.

G.2 Deduplication

We perform deduplication leveraging the suﬃx array-based approach proposed by Lee et al. (2021). We remove any document with at least a 100-character span overlapping with any other document in the corpus. We deduplicate the full C4 data set. In the case of OSCAR, the memory requirements of the deduplication procedure make performing the full data set deduplication infeasible. Instead, we select a 25% subset of the full OSCAR and build a suﬃx array for this subset. We experiment with leveraging the 25% OSCAR suﬃx array in two ways. First, we deduplicate the selected subset. This is very strict and preserves less than 5% of the full OSCAR. Subsequently, we use the 25% suﬃx array to deduplicate the full OSCAR, i.e. we remove any document which has at least a 100-character span overlapping with the 25% subset we selected. This is more permissive and allows us to preserve 31% of the original data set. We refer to the latter as expanded in Table 12 and it is used for the training of the 4.2 billion parameter model in Table 14, while the smaller deduplicated version of OSCAR is used for the 2.8 billion parameter model.

G.3 ROOTS ﬁlter

In addition, we benchmark with the ﬁltering procedure from the ROOTS corpus (Laurençon et al., 2022). It applies the following set of ﬁlters:

Discarding documents with too few words

Discarding documents with overly repeated characterand word-n-grams

Discarding documents with too many special characters

Discarding documents with too few grammatical function words (e.g. of , and )

Discarding documents with too many ﬂagged words

Discarding documents with a low fasttext language identiﬁcation score

Perplexity ﬁltering

4. https://github.com/bigscience-workshop/data-preparation/tree/main/preprocessing/ training/01b_oscar_cleaning_and_filtering

Muennighoff, Rush, Barak, Scao, Piktus, Tazi, Pyysalo, Wolf, Raffel

0 500 1000 1500 2000 2500 Perplexity

25th percentile

50th percentile

75th percentile

0 500 1000 1500 2000 2500 Perplexity

25th percentile

Figure 20: Perplexity histograms for respective data sets. For demonstration purposes, we use 100,000 random samples of each data set.

Base Data set Filter Tokens after ﬁltering

C4 Deduplication 21 billion C4 Perplexity Top 25% 44 billion C4 Perplexity Top 50% 89 billion C4 Perplexity 25-75% 89 billion OSCAR Deduplication 9 billion OSCAR Deduplication-expanded 94 billion OSCAR Perplexity Top 25% 80 billion OSCAR ROOTS 99 billion

Table 12: Sizes of ﬁltered data sets.

Scaling Data-Constrained Language Models

Training Data C4 OSCAR

Parameters 2.8B 4.2B 2.8B 4.2B

Percentile All 25% 50% All 25% 50% 25-75% All 25% All 25%

ANLI R1 0.4 1.6 -0.1 0.9 -0.5 1.4 -0.0 -0.7 -0.8 -0.3 0.5 -0.4 -0.4 1.2 -2.2 ANLI R2 0.9 0.4 -0.2 -0.7 0.0 1.3 -0.4 -0.0 1.1 1.0 1.0 1.7 1.0 0.9 0.7 ANLI R3 1.7 0.5 0.5 1.4 0.7 0.5 0.7 2.9 0.4 0.4 0.8 1.7 1.2 0.5 2.1 ARC-Challenge 1.6 1.0 3.3 2.9 4.2 1.6 10.2 9.3 7.9 -1.4 0.8 3.3 1.8 0.8 6.3 ARC-Easy 44.5 0.5 47.3 47.7 48.1 4.8 55.8 53.7 51.0 39.7 0.3 46.8 45.7 0.6 51.8 Bool Q 18.8 3.4 17.1 17.7 22.4 3.3 27.7 23.5 24.5 12.8 4.4 11.8 12.4 5.9 22.2 CB 20.0 4.7 16.1 13.8 9.3 16.6 24.6 22.3 12.5 19.7 5.1 17.0 23.9 3.8 20.1 COPA 49.7 3.5 55.7 56.0 55.3 3.8 60.7 66.0 61.0 42.7 2.2 44.0 41.1 3.0 49.3 Hella Swag 24.7 0.3 24.7 26.0 29.4 1.3 30.7 32.7 33.1 16.3 0.1 19.0 21.0 0.2 23.3 Pi QA 47.9 0.6 43.4 45.8 48.8 3.8 47.9 52.2 52.1 41.2 0.7 38.3 45.0 0.6 44.4 RTE 5.1 4.0 5.7 7.3 6.9 3.1 11.9 2.2 10.3 3.9 1.1 -1.2 2.2 4.3 7.0 Sci Q 83.2 0.6 82.4 82.8 86.3 1.1 88.6 87.4 88.4 83.2 0.6 84.0 86.3 0.6 86.5 Story Cloze 2016 58.7 0.2 61.1 61.2 62.8 0.5 65.5 65.6 65.1 52.8 0.3 57.9 57.2 0.6 60.2 Wino Grande XL 11.6 0.8 15.3 14.3 18.7 1.0 24.9 22.3 18.7 5.8 0.9 9.7 10.1 1.0 14.8

E2E NLG 17.0 1.4 16.1 16.8 17.9 0.7 18.8 17.8 19.2 20.3 0.3 19.5 21.6 0.7 22.6 XSUM 2.4 0.1 2.6 3.0 3.0 0.3 3.9 3.2 3.0 3.0 0.1 3.2 3.7 0.2 2.7 Web NLG EN 5.3 0.1 4.8 5.1 5.6 0.3 5.4 5.7 5.2 8.8 0.4 6.9 9.3 0.5 10.6 Wiki Lingua EN 3.0 0.1 3.2 3.3 3.6 0.2 3.4 3.5 3.4 2.9 0.1 3.4 4.0 0.1 3.8

b Ab I 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 15.5 1.0 14.5 19.3 1.0 17.2

Average 20.9 0.4 21.0 21.3 22.2 1.4 25.3 24.7 24.0 19.4 0.5 20.1 21.4 0.5 23.3

Table 13: Results for perplexity-ﬁltering. The training data is perplexity ﬁltered according to the given percentile, e.g. 25% corresponds to training on the top 25% percent of examples with the lowest perplexity. The resulting data set sizes are in Table 12. The data is repeated until it matches 55B tokens for 2.8B parameter and 84B tokens for 4.2B parameter models. Scores are normalized averages of 0-5 few-shots and reported as percentages. For unﬁltered models we report mean/std. err. across ﬁve diﬀerent models, each trained with a diﬀerent random seed.

Appendix H. Detailed Filtering Results

In Table 13, we report detailed perplexity ﬁltering results on C4 and OSCAR. For C4, perplexity ﬁltering is only eﬀective at 4.2B parameters. Meanwhile, for OSCAR, which is noisier than C4, perplexity ﬁltering seems eﬀective both for 2.8B and 4.2B parameters. Table 14 contains deduplication results and results for the ROOTS ﬁlter. Deduplication does not improve downstream performance for C4 while being eﬀective for OSCAR which has signiﬁcantly more noise. Applying the ROOTS ﬁlter on OSCAR is not better than the unﬁltered OSCAR on our benchmark, but might have other beneﬁcial eﬀects, such as reducing obscenity, templated messages, or repetition, depending on the ﬁnal use case.

Muennighoff, Rush, Barak, Scao, Piktus, Tazi, Pyysalo, Wolf, Raffel

Training Data C4 OSCAR

Parameters 2.8B parameters 4.2B parameters 2.8B parameters 4.2B parameters

Method All Dedup. All Dedup. All Dedup. ROOTS All Dedup.-exp. ROOTS

ANLI R1 0.4 1.6 -0.2 -0.5 1.4 -0.8 -0.3 0.5 -2.1 -1.7 -0.4 1.2 -1.8 1.2 ANLI R2 0.9 0.4 1.1 0.0 1.3 -0.1 1.0 1.0 2.0 0.7 1.0 0.9 -0.5 -0.3 ANLI R3 1.7 0.5 1.8 0.7 0.5 0.4 0.4 0.8 0.4 0.2 1.2 0.5 0.8 -0.3 ARC-Challenge 1.6 1.0 0.6 4.2 1.6 3.9 -1.4 0.8 2.6 -0.9 1.8 0.8 6.8 0.6 ARC-Easy 44.5 0.5 43.0 48.1 4.8 46.8 39.7 0.3 44.6 42.3 45.7 0.6 51.0 47.1 Bool Q 18.8 3.4 1.5 22.4 3.3 2.2 12.8 4.4 3.4 13.4 12.4 5.9 13.0 7.0 CB 20.0 4.7 0.4 9.3 16.6 0.9 19.7 5.1 25.4 14.3 23.9 3.8 25.0 28.1 COPA 49.7 3.5 57.0 55.3 3.8 60.0 42.7 2.2 47.3 37.7 41.1 3.0 55.3 43.0 Hella Swag 24.7 0.3 25.1 29.4 1.3 30.7 16.3 0.1 22.8 17.6 21.0 0.2 26.3 22.4 Pi QA 47.9 0.6 49.1 48.8 3.8 53.4 41.2 0.7 45.1 41.9 45.0 0.6 48.5 46.3 RTE 5.1 4.0 3.2 6.9 3.1 0.1 3.9 1.1 6.1 5.8 2.2 4.3 1.1 8.9 Sci Q 83.2 0.6 80.4 86.3 1.1 82.2 83.2 0.6 82.6 83.1 86.3 0.6 88.5 86.4 Story Cloze 2016 58.7 0.2 61.8 62.8 0.5 65.2 52.8 0.3 58.1 54.3 57.2 0.6 61.6 58.6 Wino Grande XL 11.6 0.8 13.3 18.7 1.0 19.7 5.8 0.9 12.7 5.6 10.1 1.0 16.2 11.0

E2E NLG 17.0 1.4 15.6 17.9 0.7 14.2 20.3 0.3 20.5 20.5 21.6 0.7 2.4 22.6 XSUM 2.4 0.1 2.1 3.0 0.3 2.5 3.0 0.1 3.2 3.1 3.7 0.2 4.6 3.8 Web NLG EN 5.3 0.1 4.3 5.6 0.3 4.4 8.8 0.4 7.4 7.4 9.3 0.5 9.7 9.4 Wiki Lingua EN 3.0 0.1 3.2 3.6 0.2 3.2 2.9 0.1 3.0 3.1 4.0 0.1 4.3 4.0

b Ab I 0.0 0.0 0.0 0.0 0.0 0.0 15.5 1.0 17.2 14.3 19.3 1.0 21.1 18.0

Average 20.9 0.4 19.1 22.2 1.4 20.5 19.4 0.5 21.2 19.1 21.4 0.5 22.8 22.0

Table 14: Results for ﬁltering with deduplication and the ROOTS ﬁlters. The resulting data set sizes are in Table 12. The data is repeated until it matches 55B tokens for 2.8B parameter and 84B tokens for 4.2B parameter models. Scores are normalized averages of 0-5 few-shots and reported as percentages. For unﬁltered models we report mean/std. err. across ﬁve diﬀerent models, each trained with a diﬀerent random seed.

Scaling Data-Constrained Language Models

Appendix I. Hyperparameters and Setup

where P is the ﬁnal parameter count, l are layers, h is the hidden dimension, V = 50257 the vocabulary size and s = 2048 the sequence length. We ﬁnd the parameter counts reported in Chinchilla (Hoﬀmann et al., 2022) to be signiﬁcantly diﬀerent than our calculations, especially at larger scales. We report both in Table 15, but we use our parameter estimates everywhere in this work. Further, we have corrected the number of heads of the 3,530 and 4,084 million parameter models from (Hoﬀmann et al., 2022) to obey the relationship d_model = kv_size n_heads. To train our models, we have forked the Megatron-Deep Speed (Rasley et al., 2020; Smith et al., 2022) framework and adapted it for ROCm to enable training on AMD GPUs. We have made our training code publicly available at https://github.com/Turku NLP/ Megatron-Deep Speed. Models are trained using data, tensor and pipeline parallelism on up to 256 AMD Instinct MI250X GPUs distributed across up to 64 nodes on the LUMI supercomputer located in Finland. As of June 2023, LUMI is the largest supercomputer in Europe and ranks third worldwide with a performance of around 310 PFLOPs.5 We trained models in parallel using up to 2,200 nodes at a single point in time (equivalent to around 8,800 GPUs or 17,600 GCDs or 86% of all GPUs on LUMI). We have used a total of around 3 million GPU hours. The cluster is powered 100% by renewable energy (hydroelectricity) and its waste heat is used for heating the nearby city reducing the city s carbon emissions by up to 20%. Thanks to the low temperatures in Finland, relatively little cooling for the cluster is required further reducing its impact on the environment. As of June 2023, it ranks as the seventh greenest supercomputer.6

5. https://www.top500.org/lists/top500/2023/06/ 6. https://www.top500.org/lists/green500/2023/06/

Muennighoff, Rush, Barak, Scao, Piktus, Tazi, Pyysalo, Wolf, Raffel

Parameters (millions) d_model ﬀw_size kv_size n_heads n_layers This work Chinchilla

7 - 128 512 32 4 3 14 - 224 896 32 7 4 20 - 288 1152 32 7 5 38 - 448 1792 32 7 6 52 44 512 2048 64 8 8 66 57 576 2304 64 9 9 83 74 640 2560 64 10 10 97 90 640 2560 64 10 13 112 106 640 2560 64 10 16 125 117 768 3072 64 12 12 146 140 768 3072 64 12 15 168 163 768 3072 64 12 18 182 175 896 3584 64 14 14 201 196 896 3584 64 14 16 220 217 896 3584 64 14 18 255 251 1024 4096 64 16 16 280 278 1024 4096 64 16 18 305 306 1024 4096 64 16 20 421 425 1280 5120 128 10 18 480 489 1280 5120 128 10 21 502 509 1408 5632 128 11 18 539 552 1280 5120 128 10 24 574 587 1408 5632 128 11 21 619 632 1536 6144 128 12 19 645 664 1408 5632 128 11 24 704 724 1536 6144 128 12 22 789 816 1536 6144 128 12 25 865 893 1792 7168 128 14 20 981 1018 1792 7168 128 14 23 1096 1143 1792 7168 128 14 26 1215 1266 2048 8192 128 16 22 1364 1424 2176 8704 128 17 22 1366 1429 2048 8192 128 16 25 1517 1593 2048 8192 128 16 28 1535 1609 2176 8704 128 17 25 1650 1731 2304 9216 128 18 24 1706 1794 2176 8704 128 17 28 1905 2007 2304 9216 128 18 28 2160 2283 2304 9216 128 18 32 2179 2298 2560 10240 128 20 26 2494 2639 2560 10240 128 20 30 2809 2980 2560 10240 128 20 34 3090 - 2688 10752 128 22 34 3263 3530 2688 10752 128 21 36 3574 3802 2816 11264 128 22 36 3900 4084 2944 11776 128 23 36 4239 4516 3072 12288 128 24 36 6355 6796 3584 14336 128 28 40 8672 9293 4096 16384 128 32 42 10912 11452 4352 17408 128 32 47 11455 12295 4608 18432 128 36 44 12220 12569 4608 18432 128 32 47

Scaling Data-Constrained Language Models

13601 13735 4864 19456 128 32 47 14917 14940 4992 19968 128 32 49 15056 16183 5120 20480 128 40 47

Table 15: Model architectures. We list the architectures of all models trained as part of this work. Many shown models have been trained multiple times on diﬀerent amounts of unique data and for varying epochs.

Appendix J. Prompts and Samples

The following ﬁgures illustrate the prompts with samples from each evaluation data set. Prompts stem from Prompt Source (Bach et al., 2022) or GPT-3 (Brown et al., 2020). All data comes from the ground truth data sets in this section, and no generations are shown here.

Context Edmond (or Edmund) Halley, FRS (pronounced ; 8 November [O.S. 29 October] 1656 25 January 1742 [O.S. 14 January 1741] ) was an English astronomer, geophysicist, mathematician, meteorologist, and physicist who is best known for computing the orbit of Halley s Comet. He was the second Astronomer Royal in Britain, succeeding John Flamsteed. Question: Edmond Halley was born outside of the United Kingdom. True, False, or Neither? Answer:

Correct Answer Neither Incorrect Answer True Incorrect Answer False

Figure 21: Formatted data set example from ANLI R1 evaluated using accuracy as described in D.

Context The 1970 Swedish Open was a combined men s and women s tennis tournament played on outdoor clay courts held in Båstad, Sweden and was part of the Grand Prix circuit of the 1970 Tour. It was the 23rd edition of the tournament and was held from 2 July through 12 July 1970. Dick Crealy and Peaches Bartkowicz won the singles titles. Question: Dick Crealy and Peaches Bartkowicz beat eachother in the 1970 Swedish Open. True, False, or Neither? Answer:

Correct Answer False Incorrect Answer True Incorrect Answer Neither

Figure 22: Formatted data set example from ANLI R2 evaluated using accuracy as described in D.

Muennighoff, Rush, Barak, Scao, Piktus, Tazi, Pyysalo, Wolf, Raffel

Context Tokyo - Food group Nestle is seeking to lure Japanese holiday shoppers with a taste for fine snacking with a gold-wrapped Kit Kat chocolate bar. The single finger Kit Kat is wrapped in a thin layer of gold leaf. Only 500 of the bars go on sale from Dec. 29 with a price tag of around 2,016 yen ($16). The Kit Kat chocolate bar made its debut in Japan in 1973 and since then a variety of flavors from green tea to wasabi have been produced. Question: Japanese like kit kat. True, False, or Neither? Answer:

Correct Answer True Incorrect Answer False Incorrect Answer Neither

Figure 23: Formatted data set example from ANLI R3 evaluated using accuracy as described in D.

Context An astronomer observes that a planet rotates faster after a meteorite impact. Which is the most likely effect of this increase in rotation?

Correct Answer Planetary days will become shorter. Incorrect Answer Planetary years will become longer. Incorrect Answer Planetary gravity will become stronger.

Figure 24: Formatted data set example from ARC-Challenge evaluated using accuracy as described in D.

Context To express the distance between the Milky Way galaxy and other galaxies, the most appropriate unit of measurement is the

Correct Answer light-year. Incorrect Answer meter. Incorrect Answer kilometer. Incorrect Answer astronomical unit.

Figure 25: Formatted data set example from ARC-Easy evaluated using accuracy as described in D.

Scaling Data-Constrained Language Models

Context Radio wave Radio waves are a type of electromagnetic radiation with wavelengths in the electromagnetic spectrum longer than infrared light. Radio waves have frequencies as high as 300 gigahertz (GHz) to as low as 30 hertz (Hz). At 300 GHz, the corresponding wavelength is 1 mm, and at 30 Hz is 10,000 km. Like all other electromagnetic waves, radio waves travel at the speed of light. They are generated by electric charges undergoing acceleration, such as time varying electric currents. Naturally occurring radio waves are emitted by lightning and astronomical objects. Question: do radio waves travel at the speed of light? Answer:

Correct Answer yes Incorrect Answer no

Figure 26: Formatted data set example from Bool Q evaluated using accuracy as described in D.

Context A: Okay. So Frank, what, uh, type of, uh, budget do you or your family have? B: Well, uh I don t know that we really have a budget. Question: he and his family really have a budget. True, False or Neither? Answer:

Correct Answer False Incorrect Answer True Incorrect Answer Neither

Figure 27: Formatted data set example from CB evaluated using accuracy as described in D.

Context The computer was expensive to fix therefore

Correct Answer I bought a new one. Incorrect Answer I got it repaired.

Figure 28: Formatted data set example from COPA evaluated using accuracy as described in D.

Muennighoff, Rush, Barak, Scao, Piktus, Tazi, Pyysalo, Wolf, Raffel

Context Canoeing: Two women in a child are shown in a canoe while a man pulls the canoe while standing in the water, with other individuals visible in the background. The child and a different man

Correct Answer sit in a canoe while the man paddles. Incorrect Answer are then shown paddling down a river in a boat while a woman talks. Incorrect Answer are driving the canoe, they go down the river flowing side to side. Incorrect Answer walking go down the rapids, while the man in his helicopter almost falls and goes out of canoehood.

Figure 29: Formatted data set example from Hella Swag evaluated using accuracy as described in D.

Context Question: How to sleep in proper posture? Answer:

Correct Answer Sleep straight with a pillow under your head. Incorrect Answer Sleep straight with a pillow over your head.

Figure 30: Formatted data set example from Pi QA evaluated using accuracy as described in D.

Context As spacecraft commander for Apollo XI, the first manned lunar landing mission, Armstrong was the first man to walk on the Moon. "That s one small step for a man, one giant leap for mankind." With these historic words, man s dream of the ages was fulfilled. Question: Neil Armstrong was the first man who landed on the Moon. True or False? Answer:

Correct Answer True. Incorrect Answer False.

Figure 31: Formatted data set example from RTE evaluated using accuracy as described in D.

Scaling Data-Constrained Language Models

Context The electromagnetic spectrum encompasses a very wide range of wavelengths and frequencies. Visible light is only a very small portion of the spectrum with wavelengths from 400-700 nm. Question: With wavelengths from 400-700 nm, what kind of light represents only a very small portion of the spectrum? Answer:

Correct Answer visible light. Incorrect Answer ultraviolet light. Incorrect Answer invisible light. Incorrect Answer sunlight.

Figure 32: Formatted data set example from Sci Q evaluated using accuracy as described in D.

Context Bob went to the gas station to fill up his car. His tank was completely empty and so was his wallet. The cashier offered to pay for his gas if he came back later to pay. Bob felt grateful as he drove home. Answer:

Correct Answer Bob believed that there were good people in the world. Incorrect Answer Bob contemplated how unfriendly the world was.

Figure 33: Formatted data set example from Story Cloze evaluated using accuracy as described in D.

Correct Context Johnny likes fruits more than vegetables in his new keto diet because the fruits: Incorrect Context Johnny likes fruits more than vegetables in his new keto diet because the vegetables:

Target Completion are saccharine.

Figure 34: Formatted data set example from Wino Grande evaluated using accuracy as described in D.

Muennighoff, Rush, Barak, Scao, Piktus, Tazi, Pyysalo, Wolf, Raffel

Context Given the following data about a restaurant: name : The Wrestlers eat Type : pub food : Japanese price Range : cheap area : riverside near : Raja Indian Cuisine

Generate some text about this restaurant.

Target The Wrestlers offers Japanese food and pub with cheap price near Raja Indian Cuisine in riverside.

Figure 35: Formatted data set example from E2E NLG evaluated using ROUGE as described in D.

Context Article: The artificial intelligence system - Lip Net - watches video of a person speaking and matches the text to the movement of their mouths with 93% accuracy, the researchers said. Automating the process could help millions, they suggested. But experts said the system needed to be tested in real-life situations. Lip-reading is a notoriously tricky business with professionals only able to decipher what someone is saying up to 60% of the time. "Machine lip-readers have enormous potential, with applications in improved hearing aids, silent dictation in public spaces, covert conversations, speech recognition in noisy environments, biometric identification and silent-movie processing," wrote the researchers. They said that the AI system was provided with whole sentences so that it could teach itself which letter corresponded to which lip movement. To train the AI, the team - from Oxford University s AI lab - fed it nearly 29,000 videos, labelled with the correct text. Each video was three seconds long and followed a similar grammatical pattern. While human testers given similar videos had an error rate of 47.7%, the AI had one of just 6.6%. The fact that the AI learned from specialist training videos led some on Twitter to criticise the research. Writing in Open Review, Neil Lawrence pointed out that the videos had "limited vocabulary and a single syntax grammar". "While it s promising to perform well on this data, it s not really groundbreaking. While the model may be able to read my lips better than a human, it can only do so when I say a meaningless list of words from a highly constrained vocabulary in a specific order," he writes. The project was partially funded by Google s artificial intelligence firm Deep Mind.

Target Scientists at Oxford University have developed a machine that can lip-read better than humans.

Figure 36: Formatted data set example from XSUM evaluated using ROUGE as described in D.

Scaling Data-Constrained Language Models

Context I will verbalize an abstract representation of a sentence in natural language. To do so, I will first show the representation and then the natural language. The text needs to include all of the information in the representation. Brandon_Carter | alma Mater | University_of_Cambridge, University_of_Cambridge | chancellor | David_Sainsbury,_Baron_Sainsbury_of_Turville, Brandon_Carter | birth Place | England, University_of_Cambridge | vice Chancellor | Leszek_Borysiewicz

Target The University of Cambridge is the alma mater of Brandon Carter, who was born in England. David Sainsbury, also known as the Baron Sainsbury of Turville, and Leszek Borysiewicz are respectively the chancellor and vice chancellor of the University of Cambridge.

Figure 37: Formatted data set example from Web NLG evaluated using ROUGE as described in D.

Muennighoff, Rush, Barak, Scao, Piktus, Tazi, Pyysalo, Wolf, Raffel

Context Attributes are placed within the tag itself, making additional alterations to the ëlement content between the start and end tag. They never stand alone. They are written in the format name= value , where name is the name of the attribute (for instance color ), and value describes this specific instance (for instance red ). You ve actually seen attributes before, if you followed the tutorial in the basic HTML section. <img> tags use the src attribute, anchors use the name attribute, and links use the href attribute. See how those all follow the ___= ___ format? Making a table, or chart, requires several different tags. Play with these tags, or learn about HTML tables in more detail. Start with table tags around the entire table:<table></table> Row tags around the contents of each row: <tr> Column headers in the first row: <th> Cells in subsequent rows: <td> Here s an example of how it all fits together:<table><tr><th>Column 1: Month</th><th>Column 2: Money Saved</th></tr><tr><td>January</td><td>$100</td></tr></table> You ve already learned the <head> tag, which shows up at the start of each document. Besides the <title> tag, it can include the following types of tags: Meta tags, which are used to provide metadata about a web page. This data can be used by search engines when the robot scours the internet to locate and list websites. To make your website more visible on search engines, use one or more <meta> start tags (no end tags necessary), each with exactly one name attribute and one content attribute, for example: <meta name= description content= write a description here >; or <meta name= keywords content= write a list of keywords, each separated by a comma > <link> tags are used to associate other files with the page. This is mainly used to link to CSS stylesheets, which are made using a different type of coding to alter your HTML page by adding color, aligning your text, and many other things. <script> tags are used to link the page to Java Script files, which can cause the page to change as...

TL;DR in English:

Target Learn about attributes. Experiment with HTML tables. Learn the miscellaneous head tags. Play around with HTML found on websites. Learn more advanced web design from comprehensive guides.

Figure 38: Formatted data set example from Wiki Lingua evaluated using ROUGE as described in D.

Context John travelled to the kitchen. Sandra moved to the kitchen. Daniel went to the kitchen. John journeyed to the hallway. Mary journeyed to the bedroom. Mary journeyed to the kitchen. Mary travelled to the bedroom. Sandra travelled to the bedroom. John went to the office. John went back to the kitchen. Where is Mary?

Target bedroom

Figure 39: Formatted data set example from b Ab I evaluated using exact match as described in D.

Scaling Data-Constrained Language Models

Appendix K. Other Experiments

We experimented with the UL2 objective (Tay et al., 2022b,c) for a causal model but did not ﬁnd it to outperform regular causal language modeling on our evaluation tasks. This may stem from UL2 being better suited as an Encoder-Decoder model or from mistakes in our UL2 implementation.

K.2 The Pile

We have also trained several models on The Pile (Gao et al., 2020) and found similar trends as for OSCAR and C4. We make these models publicly available.

Appendix L. Release of Artifacts

We open-source all of our models and code under Apache 2.0 licenses. Our ﬁltered data sets are released with the same licenses as the data sets they stem from. All material can be found at: https://github.com/ huggingface/datablations.

Armen Aghajanyan, Lili Yu, Alexis Conneau, Wei-Ning Hsu, Karen Hambardzumyan, Susan Zhang, Stephen Roller, Naman Goyal, Omer Levy, and Luke Zettlemoyer. Scaling laws for generative mixed-modal language models. ar Xiv preprint ar Xiv:2301.03728, 2023.

Ibrahim M Alabdulmohsin, Behnam Neyshabur, and Xiaohua Zhai. Revisiting neural scaling laws in language and vision. Advances in Neural Information Processing Systems, 35:22300 22312, 2022.

Loubna Ben Allal, Raymond Li, Denis Kocetkov, Chenghao Mou, Christopher Akiki, Carlos Munoz Ferrandis, Niklas Muennighoﬀ, Mayank Mishra, Alex Gu, Manan Dey, et al. Santacoder: don t reach for the stars! ar Xiv preprint ar Xiv:2301.03988, 2023.

Miltiadis Allamanis, Earl T Barr, Christian Bird, and Charles Sutton. Suggesting accurate method and class names. In Proceedings of the 2015 10th joint meeting on foundations of software engineering, pages 38 49, 2015.

Stephen H. Bach, Victor Sanh, Zheng-Xin Yong, Albert Webson, Colin Raﬀel, Nihal V. Nayak, Abheesht Sharma, Taewoon Kim, M Saiful Bari, Thibault Fevry, Zaid Alyafeai, Manan Dey, Andrea Santilli, Zhiqing Sun, Srulik Ben-David, Canwen Xu, Gunjan Chhablani, Han Wang, Jason Alan Fries, Maged S. Al-shaibani, Shanya Sharma, Urmish Thakker, Khalid Almubarak, Xiangru Tang, Xiangru Tang, Mike Tian-Jian Jiang, and Alexander M. Rush. Promptsource: An integrated development environment and repository for natural language prompts, 2022.

Yasaman Bahri, Ethan Dyer, Jared Kaplan, Jaehoon Lee, and Utkarsh Sharma. Explaining neural scaling laws. ar Xiv preprint ar Xiv:2102.06701, 2021.

Yamini Bansal, Behrooz Ghorbani, Ankush Garg, Biao Zhang, Colin Cherry, Behnam Neyshabur, and Orhan Firat. Data scaling laws in nmt: The eﬀect of noise and architecture. In International Conference on Machine Learning, pages 1466 1482. PMLR, 2022.

Zhengda Bian, Hongxin Liu, Boxiang Wang, Haichen Huang, Yongbin Li, Chuanrui Wang, Fan Cui, and Yang You. Colossal-ai: A uniﬁed deep learning system for large-scale parallel training. ar Xiv preprint ar Xiv:2110.14883, 2021.

Stella Biderman, USVSN Sai Prashanth, Lintang Sutawika, Hailey Schoelkopf, Quentin Anthony, Shivanshu Purohit, and Edward Raf. Emergent and predictable memorization in large language models. ar Xiv preprint ar Xiv:2304.11158, 2023.

Muennighoff, Rush, Barak, Scao, Piktus, Tazi, Pyysalo, Wolf, Raffel

Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, and Yejin Choi. Piqa: Reasoning about physical commonsense in natural language. In Thirty-Fourth AAAI Conference on Artiﬁcial Intelligence, 2020.

Sid Black, Stella Biderman, Eric Hallahan, Quentin Anthony, Leo Gao, Laurence Golding, Horace He, Connor Leahy, Kyle Mc Donell, Jason Phang, et al. Gpt-neox-20b: An open-source autoregressive language model. ar Xiv preprint ar Xiv:2204.06745, 2022.

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877 1901, 2020.

Nicholas Carlini, Daphne Ippolito, Matthew Jagielski, Katherine Lee, Florian Tramer, and Chiyuan Zhang. Quantifying memorization across neural language models. ar Xiv preprint ar Xiv:2202.07646, 2022.

Thiago Castro Ferreira, Claire Gardent, Nikolai Ilinykh, Chris van der Lee, Simon Mille, Diego Moussallem, and Anastasia Shimorina. The 2020 bilingual, bi-directional webnlg+ shared task overview and evaluation results (webnlg+ 2020). In Proceedings of the 3rd Web NLG Workshop on Natural Language Generation from the Semantic Web (Web NLG+ 2020), pages 55 76, Dublin, Ireland (Virtual), 2020. Association for Computational Linguistics.

Mark Chen, Alec Radford, Rewon Child, Jeﬀrey Wu, Heewoo Jun, David Luan, and Ilya Sutskever. Generative pretraining from pixels. In International conference on machine learning, pages 1691 1703. PMLR, 2020.

Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways. ar Xiv preprint ar Xiv:2204.02311, 2022.

Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, Albert Webson, Shixiang Shane Gu, Zhuyun Dai, Mirac Suzgun, Xinyun Chen, Aakanksha Chowdhery, Sharan Narang, Gaurav Mishra, Adams Yu, Vincent Zhao, Yanping Huang, Andrew Dai, Hongkun Yu, Slav Petrov, Ed H. Chi, JeﬀDean, Jacob Devlin, Adam Roberts, Denny Zhou, Quoc V. Le, and Jason Wei. Scaling instruction-ﬁnetuned language models. ar Xiv preprint ar Xiv:2210.11416, 2022. URL https://arxiv.org/abs/2210.11416.

Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. Boolq: Exploring the surprising diﬃculty of natural yes/no questions. In NAACL, 2019.

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge. ar Xiv:1803.05457v1, 2018.

Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. Unsupervised cross-lingual representation learning at scale. ar Xiv preprint ar Xiv:1911.02116, 2019.

Ido Dagan, Oren Glickman, and Bernardo Magnini. The pascal recognising textual entailment challenge. In Machine Learning Challenges. Evaluating Predictive Uncertainty, Visual Object Classiﬁcation, and Recognising Tectual Entailment: First PASCAL Machine Learning Challenges Workshop, MLCW 2005, Southampton, UK, April 11-13, 2005, Revised Selected Papers, pages 177 190. Springer, 2006.

Marie-Catherine De Marneﬀe, Mandy Simons, and Judith Tonhauser. The commitmentbank: Investigating projection in naturally occurring discourse. In proceedings of Sinn und Bedeutung, volume 23, pages 107 124, 2019.

Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and Łukasz Kaiser. Universal transformers. ar Xiv preprint ar Xiv:1807.03819, 2018.

Scaling Data-Constrained Language Models

Mostafa Dehghani, Josip Djolonga, Basil Mustafa, Piotr Padlewski, Jonathan Heek, Justin Gilmer, Andreas Steiner, Mathilde Caron, Robert Geirhos, Ibrahim Alabdulmohsin, et al. Scaling vision transformers to 22 billion parameters. ar Xiv preprint ar Xiv:2302.05442, 2023.

Nan Du, Yanping Huang, Andrew M Dai, Simon Tong, Dmitry Lepikhin, Yuanzhong Xu, Maxim Krikun, Yanqi Zhou, Adams Wei Yu, Orhan Firat, et al. Glam: Eﬃcient scaling of language models with mixture-of-experts. In International Conference on Machine Learning, pages 5547 5569. PMLR, 2022.

Ondřej Dušek, Jekaterina Novikova, and Verena Rieser. Evaluating the State-of-the-Art of End-to-End Natural Language Generation: The E2E NLG Challenge. Computer Speech & Language, 59:123 156, January 2020. doi: 10.1016/j.csl.2019.06.009.

William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with simple and eﬃcient sparsity. J. Mach. Learn. Res, 23:1 40, 2021.

Samir Yitzhak Gadre, Georgios Smyrnis, Vaishaal Shankar, Suchin Gururangan, Mitchell Wortsman, Rulin Shao, Jean Mercat, Alex Fang, Jeﬀrey Li, Sedrick Keh, Rui Xin, Marianna Nezhurina, Igor Vasiljevic, Jenia Jitsev, Luca Soldaini, Alexandros G. Dimakis, Gabriel Ilharco, Pang Wei Koh, Shuran Song, Thomas Kollar, Yair Carmon, Achal Dave, Reinhard Heckel, Niklas Muennighoﬀ, and Ludwig Schmidt. Language models scale reliably with over-training and on downstream tasks, 2024. URL https://arxiv.org/abs/ 2403.08540.

Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser, and Connor Leahy. The Pile: An 800GB dataset of diverse text for language modeling. ar Xiv preprint ar Xiv:2101.00027, 2020.

Leo Gao, Jonathan Tow, Stella Biderman, Sid Black, Anthony Di Poﬁ, Charles Foster, Laurence Golding, Jeﬀrey Hsu, Kyle Mc Donell, Niklas Muennighoﬀ, Jason Phang, Laria Reynolds, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. A framework for few-shot language model evaluation, September 2021. URL https://doi.org/10.5281/zenodo.5371628.

Samuel Gehman, Suchin Gururangan, Maarten Sap, Yejin Choi, and Noah A Smith. Realtoxicityprompts: Evaluating neural toxic degeneration in language models. ar Xiv preprint ar Xiv:2009.11462, 2020.

Sebastian Gehrmann, Tosin Adewumi, Karmanya Aggarwal, Pawan Sasanka Ammanamanchi, Aremu Anuoluwapo, Antoine Bosselut, Khyathi Raghavi Chandu, Miruna Clinciu, Dipanjan Das, Kaustubh D Dhole, et al. The gem benchmark: Natural language generation, its evaluation and metrics. ar Xiv preprint ar Xiv:2102.01672, 2021.

Behrooz Ghorbani, Orhan Firat, Markus Freitag, Ankur Bapna, Maxim Krikun, Xavier Garcia, Ciprian Chelba, and Colin Cherry. Scaling laws for neural machine translation. ar Xiv preprint ar Xiv:2109.07740, 2021.

Himanshu Gupta, Saurabh Arjun Sawant, Swaroop Mishra, Mutsumi Nakamura, Arindam Mitra, Santosh Mashetty, and Chitta Baral. Instruction tuned models are quick learners. ar Xiv preprint ar Xiv:2306.05539, 2023.

Kenneth Heaﬁeld. Ken LM: Faster and smaller language model queries. In Proceedings of the Sixth Workshop on Statistical Machine Translation, pages 187 197, Edinburgh, Scotland, July 2011. Association for Computational Linguistics. URL https://aclanthology.org/W11-2123.

Peter Henderson, Mark Krass, Lucia Zheng, Neel Guha, Christopher D Manning, Dan Jurafsky, and Daniel Ho. Pile of law: Learning responsible data ﬁltering from the law and a 256gb open-source legal dataset. Advances in Neural Information Processing Systems, 35:29217 29234, 2022.

Tom Henighan, Jared Kaplan, Mor Katz, Mark Chen, Christopher Hesse, Jacob Jackson, Heewoo Jun, Tom B Brown, Prafulla Dhariwal, Scott Gray, et al. Scaling laws for autoregressive generative modeling. ar Xiv preprint ar Xiv:2010.14701, 2020.

Muennighoff, Rush, Barak, Scao, Piktus, Tazi, Pyysalo, Wolf, Raffel

Danny Hernandez, Jared Kaplan, Tom Henighan, and Sam Mc Candlish. Scaling laws for transfer. ar Xiv preprint ar Xiv:2102.01293, 2021.

Danny Hernandez, Tom Brown, Tom Conerly, Nova Das Sarma, Dawn Drain, Sheer El-Showk, Nelson Elhage, Zac Hatﬁeld-Dodds, Tom Henighan, Tristan Hume, et al. Scaling laws and interpretability of learning from repeated data. ar Xiv preprint ar Xiv:2205.10487, 2022.

Jordan Hoﬀmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models. ar Xiv preprint ar Xiv:2203.15556, 2022.

Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit, Noam Shazeer, Ian Simon, Curtis Hawthorne, Andrew M Dai, Matthew D Hoﬀman, Monica Dinculescu, and Douglas Eck. Music transformer. ar Xiv preprint ar Xiv:1809.04281, 2018.

Berivan Isik, Natalia Ponomareva, Hussein Hazimeh, Dimitris Paparas, Sergei Vassilvitskii, and Sanmi Koyejo. Scaling laws for downstream task performance of large language models, 2024. URL https: //arxiv.org/abs/2402.04177.

Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, and Luke Zettlemoyer. Summarizing source code using a neural attention model. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2073 2083, 2016.

Nikhil Kandpal, Eric Wallace, and Colin Raﬀel. Deduplicating training data mitigates privacy risks in language models, 2022.

Jared Kaplan, Sam Mc Candlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeﬀrey Wu, and Dario Amodei. Scaling laws for neural language models. ar Xiv preprint ar Xiv:2001.08361, 2020.

Mikhail Khrushchev, Ruslan Vasilev, Alexey Petrov, and Nikolay Zinov. Ya LM 100B, 6 2022. URL https://github.com/yandex/Ya LM-100B.

Denis Kocetkov, Raymond Li, Loubna Ben Allal, Jia Li, Chenghao Mou, Carlos Muñoz Ferrandis, Yacine Jernite, Margaret Mitchell, Sean Hughes, Thomas Wolf, et al. The stack: 3 tb of permissively licensed source code. ar Xiv preprint ar Xiv:2211.15533, 2022.

Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners. ar Xiv preprint ar Xiv:2205.11916, 2022.

Aran Komatsuzaki. One epoch is all you need. ar Xiv preprint ar Xiv:1906.06669, 2019.

Taku Kudo and John Richardson. Sentence Piece: A simple and language independent subword tokenizer and detokenizer for neural text processing. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 66 71, Brussels, Belgium, November 2018. Association for Computational Linguistics. doi: 10.18653/v1/D18-2012. URL https://aclanthology. org/D18-2012.

Faisal Ladhak, Esin Durmus, Claire Cardie, and Kathleen Mc Keown. Wikilingua: A new benchmark dataset for cross-lingual abstractive summarization. ar Xiv preprint ar Xiv:2010.03093, 2020.

Hugo Laurençon, Lucile Saulnier, Thomas Wang, Christopher Akiki, Albert Villanova del Moral, Teven Le Scao, Leandro Von Werra, Chenghao Mou, Eduardo González Ponferrada, Huu Nguyen, et al. The bigscience roots corpus: A 1.6 tb composite multilingual dataset. In Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2022.

Katherine Lee, Daphne Ippolito, Andrew Nystrom, Chiyuan Zhang, Douglas Eck, Chris Callison-Burch, and Nicholas Carlini. Deduplicating training data makes language models better. ar Xiv preprint ar Xiv:2107.06499, 2021.

Scaling Data-Constrained Language Models

Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas Muennighoﬀ, Denis Kocetkov, Chenghao Mou, Marc Marone, Christopher Akiki, Jia Li, Jenny Chim, Qian Liu, Evgenii Zheltonozhskii, Terry Yue Zhuo, Thomas Wang, Olivier Dehaene, Mishig Davaadorj, Joel Lamy-Poirier, João Monteiro, Oleh Shliazhko, Nicolas Gontier, Nicholas Meade, Armel Zebaze, Ming-Ho Yee, Logesh Kumar Umapathi, Jian Zhu, Benjamin Lipkin, Muhtasham Oblokulov, Zhiruo Wang, Rudra Murthy, Jason Stillerman, Siva Sankalp Patel, Dmitry Abulkhanov, Marco Zocca, Manan Dey, Zhihan Zhang, Nour Fahmy, Urvashi Bhattacharyya, Wenhao Yu, Swayam Singh, Sasha Luccioni, Paulo Villegas, Maxim Kunakov, Fedor Zhdanov, Manuel Romero, Tony Lee, Nadav Timor, Jennifer Ding, Claire Schlesinger, Hailey Schoelkopf, Jan Ebert, Tri Dao, Mayank Mishra, Alex Gu, Jennifer Robinson, Carolyn Jane Anderson, Brendan Dolan-Gavitt, Danish Contractor, Siva Reddy, Daniel Fried, Dzmitry Bahdanau, Yacine Jernite, Carlos Muñoz Ferrandis, Sean Hughes, Thomas Wolf, Arjun Guha, Leandro von Werra, and Harm de Vries. Starcoder: may the source be with you! ar Xiv preprint ar Xiv:2305.06161, 2023.

Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, et al. Holistic evaluation of language models. ar Xiv preprint ar Xiv:2211.09110, 2022.

Opher Lieber, Or Sharir, Barak Lenz, and Yoav Shoham. Jurassic-1: Technical details and evaluation. White Paper. AI21 Labs, 1, 2021.

Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pages 74 81, 2004.

Xi Victoria Lin, Todor Mihaylov, Mikel Artetxe, Tianlu Wang, Shuohui Chen, Daniel Simig, Myle Ott, Naman Goyal, Shruti Bhosale, Jingfei Du, et al. Few-shot learning with multilingual language models. ar Xiv preprint ar Xiv:2112.10668, 2021.

Yu-Hsiang Lin, Chian-Yu Chen, Jean Lee, Zirui Li, Yuyan Zhang, Mengzhou Xia, Shruti Rijhwani, Junxian He, Zhisong Zhang, Xuezhe Ma, et al. Choosing transfer languages for cross-lingual learning. ar Xiv preprint ar Xiv:1905.12688, 2019.

Shayne Longpre, Le Hou, Tu Vu, Albert Webson, Hyung Won Chung, Yi Tay, Denny Zhou, Quoc V Le, Barret Zoph, Jason Wei, et al. The ﬂan collection: Designing data and methods for eﬀective instruction tuning. ar Xiv preprint ar Xiv:2301.13688, 2023a.

Shayne Longpre, Robert Mahari, Anthony Chen, Naana Obeng-Marnu, Damien Sileo, William Brannon, Niklas Muennighoﬀ, Nathan Khazam, Jad Kabbara, Kartik Perisetla, Xinyi Wu, Enrico Shippole, Kurt Bollacker, Tongshuang Wu, Luis Villa, Sandy Pentland, Deb Roy, and Sara Hooker. The data provenance initiative: A large scale audit of dataset licensing & attribution in ai, 2023b.

Shayne Longpre, Gregory Yauney, Emily Reif, Katherine Lee, Adam Roberts, Barret Zoph, Denny Zhou, Jason Wei, Kevin Robinson, David Mimno, and Daphne Ippolito. A pretrainer s guide to training data: Measuring the eﬀects of data age, domain coverage, quality, & toxicity, 2023c.

Anton Lozhkov, Raymond Li, Loubna Ben Allal, Federico Cassano, Joel Lamy-Poirier, Nouamane Tazi, Ao Tang, Dmytro Pykhtar, Jiawei Liu, Yuxiang Wei, Tianyang Liu, Max Tian, Denis Kocetkov, Arthur Zucker, Younes Belkada, Zijian Wang, Qian Liu, Dmitry Abulkhanov, Indraneil Paul, Zhuang Li, Wen-Ding Li, Megan Risdal, Jia Li, Jian Zhu, Terry Yue Zhuo, Evgenii Zheltonozhskii, Nii Osae Osae Dade, Wenhao Yu, Lucas Krauß, Naman Jain, Yixuan Su, Xuanli He, Manan Dey, Edoardo Abati, Yekun Chai, Niklas Muennighoﬀ, Xiangru Tang, Muhtasham Oblokulov, Christopher Akiki, Marc Marone, Chenghao Mou, Mayank Mishra, Alex Gu, Binyuan Hui, Tri Dao, Armel Zebaze, Olivier Dehaene, Nicolas Patry, Canwen Xu, Julian Mc Auley, Han Hu, Torsten Scholak, Sebastien Paquet, Jennifer Robinson, Carolyn Jane Anderson, Nicolas Chapados, Mostofa Patwary, Nima Tajbakhsh, Yacine Jernite, Carlos Muñoz Ferrandis, Lingming Zhang, Sean Hughes, Thomas Wolf, Arjun Guha, Leandro von Werra, and Harm de Vries. Starcoder 2 and the stack v2: The next generation, 2024. URL https://arxiv.org/abs/2402.19173.

Muennighoff, Rush, Barak, Scao, Piktus, Tazi, Pyysalo, Wolf, Raffel

Risto Luukkonen, Ville Komulainen, Jouni Luoma, Anni Eskelinen, Jenna Kanerva, Hanna-Mari Kupari, Filip Ginter, Veronika Laippala, Niklas Muennighoﬀ, Aleksandra Piktus, Thomas Wang, Nouamane Tazi, Teven Le Scao, Thomas Wolf, Osma Suominen, Samuli Sairanen, Mikko Merioksa, Jyrki Heinonen, Aija Vahtola, Samuel Antao, and Sampo Pyysalo. Fin GPT: Large generative models for a small language. In The 2023 Conference on Empirical Methods in Natural Language Processing, 2023.

Ali Madani, Bryan Mc Cann, Nikhil Naik, Nitish Shirish Keskar, Namrata Anand, Raphael R Eguchi, Po Ssu Huang, and Richard Socher. Progen: Language modeling for protein generation. ar Xiv preprint ar Xiv:2004.03497, 2020.

Bryan Mc Cann, Nitish Shirish Keskar, Caiming Xiong, and Richard Socher. The natural language decathlon: Multitask learning as question answering. Co RR, abs/1806.08730, 2018. URL http://arxiv.org/abs/ 1806.08730.

Sewon Min, Mike Lewis, Luke Zettlemoyer, and Hannaneh Hajishirzi. Metaicl: Learning to learn in context. ar Xiv preprint ar Xiv:2110.15943, 2021.

Nasrin Mostafazadeh, Michael Roth, Annie Louis, Nathanael Chambers, and James Allen. Lsdsem 2017 shared task: The story cloze test. In Proceedings of the 2nd Workshop on Linking Models of Lexical, Sentential and Discourse-level Semantics, pages 46 51, 2017.

Niklas Muennighoﬀ. Vilio: State-of-the-art visio-linguistic models applied to hateful memes. ar Xiv preprint ar Xiv:2012.07788, 2020.

Niklas Muennighoﬀ. Sgpt: Gpt sentence embeddings for semantic search. ar Xiv preprint ar Xiv:2202.08904, 2022.

Niklas Muennighoﬀ, Thomas Wang, Lintang Sutawika, Adam Roberts, Stella Biderman, Teven Le Scao, M Saiful Bari, Sheng Shen, Zheng-Xin Yong, Hailey Schoelkopf, et al. Crosslingual generalization through multitask ﬁnetuning. ar Xiv preprint ar Xiv:2211.01786, 2022.

Niklas Muennighoﬀ, Qian Liu, Armel Zebaze, Qinkai Zheng, Binyuan Hui, Terry Yue Zhuo, Swayam Singh, Xiangru Tang, Leandro von Werra, and Shayne Longpre. Octopack: Instruction tuning code large language models. ar Xiv preprint ar Xiv:2308.07124, 2023.

Preetum Nakkiran, Gal Kaplun, Yamini Bansal, Tristan Yang, Boaz Barak, and Ilya Sutskever. Deep double descent: Where bigger models and more data hurt. Journal of Statistical Mechanics: Theory and Experiment, 2021(12):124003, 2021a.

Preetum Nakkiran, Behnam Neyshabur, and Hanie Sedghi. The deep bootstrap framework: Good online learners are good oﬄine generalizers. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. Open Review.net, 2021b. URL https://openreview.net/ forum?id=guetr IHLFGI.

Shashi Narayan, Shay B. Cohen, and Mirella Lapata. Don t give me the details, just the summary! Topicaware convolutional neural networks for extreme summarization. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, 2018.

Deepak Narayanan, Mohammad Shoeybi, Jared Casper, Patrick Le Gresley, Mostofa Patwary, Vijay Korthikanti, Dmitri Vainbrand, Prethvi Kashinkunti, Julie Bernauer, Bryan Catanzaro, et al. Eﬃcient large-scale language model training on gpu clusters using megatron-lm. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1 15, 2021.

Yixin Nie, Adina Williams, Emily Dinan, Mohit Bansal, Jason Weston, and Douwe Kiela. Adversarial nli: A new benchmark for natural language understanding. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 2020.

Scaling Data-Constrained Language Models

Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, Silvio Savarese, and Caiming Xiong. Codegen: An open large language model for code with multi-turn program synthesis. ar Xiv preprint ar Xiv:2203.13474, 2022.

nostalgebraist. chinchilla s wild implications. lesswrong, 2022.

Gabriel Orlanski, Kefan Xiao, Xavier Garcia, Jeﬀrey Hui, Joshua Howland, Jonathan Malmaud, Jacob Austin, Rishah Singh, and Michele Catasta. Measuring the impact of programming language distribution. ar Xiv preprint ar Xiv:2302.01973, 2023.

Pedro Javier Ortiz Su arez, Benoit Sagot, and Laurent Romary. Asynchronous pipelines for processing huge corpora on medium to low resource infrastructures. Proceedings of the Workshop on Challenges in the Management of Large Corpora (CMLC-7) 2019. Cardiﬀ, 22nd July 2019, pages 9 16, Mannheim, 2019. Leibniz-Institut f"ur Deutsche Sprache. doi: 10.14618/ids-pub-9021. URL http://nbn-resolving.de/urn: nbn:de:bsz:mh39-90215.

Pedro Javier Ortiz Su arez, Laurent Romary, and Benoit Sagot. A monolingual approach to contextualized word embeddings for mid-resource languages. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1703 1714, Online, July 2020. Association for Computational Linguistics. URL https://www.aclweb.org/anthology/2020.acl-main.156.

Long Ouyang, JeﬀWu, Xu Jiang, Diogo Almeida, Carroll L Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. ar Xiv preprint ar Xiv:2203.02155, 2022.

Guilherme Penedo, Quentin Malartic, Daniel Hesslow, Ruxandra Cojocaru, Alessandro Cappelli, Hamza Alobeidli, Baptiste Pannier, Ebtesam Almazrouei, and Julien Launay. The reﬁnedweb dataset for falcon llm: Outperforming curated corpora with web data, and web data only. ar Xiv preprint ar Xiv:2306.01116, 2023.

Shrimai Prabhumoye, Mostofa Patwary, Mohammad Shoeybi, and Bryan Catanzaro. Adding instructions during pretraining: Eﬀective way of controlling toxicity in language models. ar Xiv preprint ar Xiv:2302.07388, 2023.

Alec Radford, Jeﬀrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. Open AI blog, 1(8):9, 2019.

Jack W Rae, Sebastian Borgeaud, Trevor Cai, Katie Millican, Jordan Hoﬀmann, Francis Song, John Aslanides, Sarah Henderson, Roman Ring, Susannah Young, et al. Scaling language models: Methods, analysis & insights from training gopher. ar Xiv preprint ar Xiv:2112.11446, 2021.

Colin Raﬀel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J Liu, et al. Exploring the limits of transfer learning with a uniﬁed text-to-text transformer. J. Mach. Learn. Res., 21(140):1 67, 2020.

JeﬀRasley, Samyam Rajbhandari, Olatunji Ruwase, and Yuxiong He. Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 3505 3506, 2020.

Melissa Roemmele, Cosmin Adrian Bejan, and Andrew S Gordon. Choice of plausible alternatives: An evaluation of commonsense causal reasoning. In AAAI spring symposium: logical formalizations of commonsense reasoning, pages 90 95, 2011.

Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: An adversarial winograd schema challenge at scale. Communications of the ACM, 64(9):99 106, 2021.

Victor Sanh, Albert Webson, Colin Raﬀel, Stephen Bach, Lintang Sutawika, Zaid Alyafeai, Antoine Chaﬃn, Arnaud Stiegler, Teven Le Scao, Arun Raja, et al. Multitask prompted training enables zero-shot task generalization. In The Tenth International Conference on Learning Representations, 2022.

Muennighoff, Rush, Barak, Scao, Piktus, Tazi, Pyysalo, Wolf, Raffel

Teven Le Scao, Angela Fan, Christopher Akiki, Ellie Pavlick, Suzana Ilić, Daniel Hesslow, Roman Castagné, Alexandra Sasha Luccioni, François Yvon, Matthias Gallé, et al. Bloom: A 176b-parameter open-access multilingual language model. ar Xiv preprint ar Xiv:2211.05100, 2022a.

Teven Le Scao, Thomas Wang, Daniel Hesslow, Lucile Saulnier, Stas Bekman, M Saiful Bari, Stella Bideman, Hady Elsahar, Niklas Muennighoﬀ, Jason Phang, et al. What language model to train if you have one million gpu hours? ar Xiv preprint ar Xiv:2210.15424, 2022b.

Seongjin Shin, Sang-Woo Lee, Hwijeen Ahn, Sungdong Kim, Hyoung Seok Kim, Boseop Kim, Kyunghyun Cho, Gichang Lee, Woomyoung Park, Jung-Woo Ha, et al. On the eﬀect of pretraining corpora on in-context learning by a large-scale language model. ar Xiv preprint ar Xiv:2204.13509, 2022.

Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick Le Gresley, Jared Casper, and Bryan Catanzaro. Megatron-lm: Training multi-billion parameter language models using model parallelism. ar Xiv preprint ar Xiv:1909.08053, 2019.

Shaden Smith, Mostofa Patwary, Brandon Norick, Patrick Le Gresley, Samyam Rajbhandari, Jared Casper, Zhun Liu, Shrimai Prabhumoye, George Zerveas, Vijay Korthikanti, et al. Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. ar Xiv preprint ar Xiv:2201.11990, 2022.

Saleh Soltan, Shankar Ananthakrishnan, Jack Fitz Gerald, Rahul Gupta, Wael Hamza, Haidar Khan, Charith Peris, Stephen Rawls, Andy Rosenbaum, Anna Rumshisky, et al. Alexatm 20b: Few-shot learning using a large-scale multilingual seq2seq model. ar Xiv preprint ar Xiv:2208.01448, 2022.

Ben Sorscher, Robert Geirhos, Shashank Shekhar, Surya Ganguli, and Ari S Morcos. Beyond neural scaling laws: beating power law scaling via data pruning. ar Xiv preprint ar Xiv:2206.14486, 2022.

Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R Brown, Adam Santoro, Aditya Gupta, Adrià Garriga-Alonso, et al. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. ar Xiv preprint ar Xiv:2206.04615, 2022.

Hui Su, Xiao Zhou, Houjing Yu, Yuwen Chen, Zilin Zhu, Yang Yu, and Jie Zhou. Welm: A well-read pre-trained language model for chinese. ar Xiv preprint ar Xiv:2209.10372, 2022.

Yi Tay, Mostafa Dehghani, Jinfeng Rao, William Fedus, Samira Abnar, Hyung Won Chung, Sharan Narang, Dani Yogatama, Ashish Vaswani, and Donald Metzler. Scale eﬃciently: Insights from pre-training and ﬁne-tuning transformers. ar Xiv preprint ar Xiv:2109.10686, 2021.

Yi Tay, Mostafa Dehghani, Samira Abnar, Hyung Won Chung, William Fedus, Jinfeng Rao, Sharan Narang, Vinh Q Tran, Dani Yogatama, and Donald Metzler. Scaling laws vs model architectures: How does inductive bias inﬂuence scaling? ar Xiv preprint ar Xiv:2207.10551, 2022a.

Yi Tay, Mostafa Dehghani, Vinh Q Tran, Xavier Garcia, Dara Bahri, Tal Schuster, Huaixiu Steven Zheng, Neil Houlsby, and Donald Metzler. Unifying language learning paradigms. ar Xiv preprint ar Xiv:2205.05131, 2022b.

Yi Tay, Jason Wei, Hyung Won Chung, Vinh Q Tran, David R So, Siamak Shakeri, Xavier Garcia, Huaixiu Steven Zheng, Jinfeng Rao, Aakanksha Chowdhery, et al. Transcending scaling laws with 0.1% extra compute. ar Xiv preprint ar Xiv:2210.11399, 2022c.

Ross Taylor, Marcin Kardas, Guillem Cucurull, Thomas Scialom, Anthony Hartshorn, Elvis Saravia, Andrew Poulton, Viktor Kerkez, and Robert Stojnic. Galactica: A large language model for science. ar Xiv preprint ar Xiv:2211.09085, 2022.

Romal Thoppilan, Daniel De Freitas, Jamie Hall, Noam Shazeer, Apoorv Kulshreshtha, Heng-Tze Cheng, Alicia Jin, Taylor Bos, Leslie Baker, Yu Du, et al. Lamda: Language models for dialog applications. ar Xiv preprint ar Xiv:2201.08239, 2022.

Scaling Data-Constrained Language Models

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017.

Pablo Villalobos, Jaime Sevilla, Lennart Heim, Tamay Besiroglu, Marius Hobbhahn, and Anson Ho. Will we run out of data? an analysis of the limits of scaling datasets in machine learning. ar Xiv preprint ar Xiv:2211.04325, 2022.

Antti Virtanen, Jenna Kanerva, Rami Ilo, Jouni Luoma, Juhani Luotolahti, Tapio Salakoski, Filip Ginter, and Sampo Pyysalo. Multilingual is not enough: Bert for ﬁnnish. ar Xiv preprint ar Xiv:1912.07076, 2019.

Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. Superglue: A stickier benchmark for general-purpose language understanding systems. Co RR, abs/1905.00537, 2019. URL http://arxiv.org/abs/1905.00537.

Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A Smith, Daniel Khashabi, and Hannaneh Hajishirzi. Self-instruct: Aligning language model with self generated instructions. ar Xiv preprint ar Xiv:2212.10560, 2022.

Yizhong Wang, Hamish Ivison, Pradeep Dasigi, Jack Hessel, Tushar Khot, Khyathi Raghavi Chandu, David Wadden, Kelsey Mac Millan, Noah A Smith, Iz Beltagy, et al. How far can camels go? exploring the state of instruction tuning on open resources. ar Xiv preprint ar Xiv:2306.04751, 2023.

Jason Wei, Maarten Bosma, Vincent Y Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M Dai, and Quoc V Le. Finetuned language models are zero-shot learners. ar Xiv preprint ar Xiv:2109.01652, 2021.

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed Chi, Quoc Le, and Denny Zhou. Chain of thought prompting elicits reasoning in large language models. ar Xiv preprint ar Xiv:2201.11903, 2022.

Johannes Welbl, Nelson F Liu, and Matt Gardner. Crowdsourcing multiple choice science questions. ar Xiv preprint ar Xiv:1707.06209, 2017.

Guillaume Wenzek, Marie-Anne Lachaux, Alexis Conneau, Vishrav Chaudhary, Francisco Guzmán, Armand Joulin, and Edouard Grave. Ccnet: Extracting high quality monolingual datasets from web crawl data. ar Xiv preprint ar Xiv:1911.00359, 2019.

Jason Weston, Antoine Bordes, Sumit Chopra, Alexander M Rush, Bart Van Merriënboer, Armand Joulin, and Tomas Mikolov. Towards ai-complete question answering: A set of prerequisite toy tasks. ar Xiv preprint ar Xiv:1502.05698, 2015.

Mengzhou Xia, Guoqing Zheng, Subhabrata Mukherjee, Milad Shokouhi, Graham Neubig, and Ahmed Hassan Awadallah. Metaxl: Meta representation transformation for low-resource cross-lingual learning. ar Xiv preprint ar Xiv:2104.07908, 2021.

Mengzhou Xia, Mikel Artetxe, Chunting Zhou, Xi Victoria Lin, Ramakanth Pasunuru, Danqi Chen, Luke Zettlemoyer, and Ves Stoyanov. Training trajectories of language models across scales. ar Xiv preprint ar Xiv:2212.09803, 2022.

Can Xu, Qingfeng Sun, Kai Zheng, Xiubo Geng, Pu Zhao, Jiazhan Feng, Chongyang Tao, and Daxin Jiang. Wizardlm: Empowering large language models to follow complex instructions. ar Xiv preprint ar Xiv:2304.12244, 2023.

Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, and Colin Raﬀel. mt5: A massively multilingual pre-trained text-to-text transformer. ar Xiv preprint ar Xiv:2010.11934, 2020.

Muennighoff, Rush, Barak, Scao, Piktus, Tazi, Pyysalo, Wolf, Raffel

Ge Yang, Edward Hu, Igor Babuschkin, Szymon Sidor, Xiaodong Liu, David Farhi, Nick Ryder, Jakub Pachocki, Weizhu Chen, and Jianfeng Gao. Tuning large neural networks via zero-shot hyperparameter transfer. Advances in Neural Information Processing Systems, 34:17084 17097, 2021.

Zheng-Xin Yong, Hailey Schoelkopf, Niklas Muennighoﬀ, Alham Fikri Aji, David Ifeoluwa Adelani, Khalid Almubarak, M Saiful Bari, Lintang Sutawika, Jungo Kasai, Ahmed Baruwa, et al. Bloom+ 1: Adding language support to bloom for zero-shot prompting. ar Xiv preprint ar Xiv:2212.09535, 2022.

Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really ﬁnish your sentence? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019.

Aohan Zeng, Xiao Liu, Zhengxiao Du, Zihan Wang, Hanyu Lai, Ming Ding, Zhuoyi Yang, Yifan Xu, Wendi Zheng, Xiao Xia, et al. Glm-130b: An open bilingual pre-trained model. ar Xiv preprint ar Xiv:2210.02414, 2022.

Wei Zeng, Xiaozhe Ren, Teng Su, Hui Wang, Yi Liao, Zhiwei Wang, Xin Jiang, Zhen Zhang Yang, Kaisheng Wang, Xiaoda Zhang, et al. Pangu-alpha: Large-scale autoregressive pretrained chinese language models with auto-parallel computation. ar Xiv preprint ar Xiv:2104.12369, 2021.

Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, et al. Opt: Open pre-trained transformer language models. ar Xiv preprint ar Xiv:2205.01068, 2022.

Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, et al. A survey of large language models. ar Xiv preprint ar Xiv:2303.18223, 2023.

Chunting Zhou, Pengfei Liu, Puxin Xu, Srini Iyer, Jiao Sun, Yuning Mao, Xuezhe Ma, Avia Efrat, Ping Yu, Lili Yu, et al. Lima: Less is more for alignment. ar Xiv preprint ar Xiv:2305.11206, 2023.

Barret Zoph, Irwan Bello, Sameer Kumar, Nan Du, Yanping Huang, JeﬀDean, Noam Shazeer, and William Fedus. Designing eﬀective sparse expert models. ar Xiv preprint ar Xiv:2202.08906, 2022.