# scaling_dataconstrained_language_models__ed98b69d.pdf
Journal of Machine Learning Research 26 (2025) 1-66 Submitted 6/24; Revised 12/24; Published 2/25
Scaling Data-Constrained Language Models
Niklas Muennighoff n.muennighoff@gmail.com
Hugging Face
Alexander M. Rush arush@cornell.edu
Hugging Face
Boaz Barak boaz@seas.harvard.edu
Harvard University
Teven Le Scao teven.lescao@gmail.com
Hugging Face
Aleksandra Piktus ola.piktus@gmail.com
Hugging Face
Nouamane Tazi nouamane@huggingface.co
Hugging Face
Sampo Pyysalo sampo.pyysalo@gmail.com
University of Turku
Thomas Wolf thomas@huggingface.co
Hugging Face
Colin Raffel craffel@gmail.com
Hugging Face
Editor: Fei Sha
The current trend of scaling language models involves increasing both parameter count and training data set size. Extrapolating this trend suggests that training data set size may soon be limited by the amount of text data available on the internet. Motivated by this limit, we investigate scaling language models in data-constrained regimes. Specifically, we run a large set of experiments varying the extent of data repetition and compute budget, ranging up to 900 billion training tokens and 9 billion parameter models. We find that with constrained data for a fixed compute budget, training with up to 4 epochs of repeated data yields negligible changes to loss compared to having unique data. However, with more repetition, the value of adding compute eventually decays to zero. We propose and empirically validate a scaling law for compute optimality that accounts for the decreasing value of repeated tokens and excess parameters. Finally, we experiment with approaches mitigating data scarcity, including augmenting the training data set with code data or removing commonly used filters. Models and data sets from our 400 training runs are freely available at https://github.com/huggingface/datablations. Keywords: large language models, scaling laws, data-constrained, data engineering
c 2025 Niklas Muennighoffand Alexander M. Rush and Boaz Barak and Teven Le Scao and Aleksandra Piktus and Nouamane Tazi and Sampo Pyysalo and Thomas Wolf and Colin Raffel.
License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v26/24-1000.html.
Muennighoff, Rush, Barak, Scao, Piktus, Tazi, Pyysalo, Wolf, Raffel
Tokens (Epochs)
Final test loss
Up to 4 epochs repeating is almost as good as new data
Rapidly diminishing returns for more repetitions
At 40 epochs, repeating is worthless
Return on compute when repeating
Tokens (Epochs)
Loss: 2.376
Loss: 2.359
Allocating compute when repeating
Data-Constrained Scaling Laws
Models trained Loss assuming repeated data is worth the same as new data Loss predicted by our data-constrained scaling laws
Regime of same compute (Iso FLOP) Efficient frontier assuming repeated data is worth the same as new data Efficient frontier predicted by our data-constrained scaling laws
Figure 1: Return and Allocation when repeating data. Left: Loss of LLMs (4.2B parameters) scaled on repeated data decays predictably ( 6). Right: To maximize performance when repeating, our data-constrained scaling laws and empirical data suggest training smaller models for more epochs in contrast to what assuming Chinchilla scaling laws (Hoffmann et al., 2022) hold for repeated data would predict ( 5).
1. Introduction
Recent work on compute-optimal language models (Hoffmann et al., 2022) shows that many previously trained large language models (LLMs, which we define as having more than one billion parameters) could have attained better performance for a given compute budget by training a smaller model on more data. Notably, the 70-billion parameter Chinchilla model (Hoffmann et al., 2022) outperforms the 280-billion parameter Gopher model (Rae et al., 2021) while using a similar compute budget by being trained on four times more data. Extrapolating these laws for compute allocation (hereafter "Chinchilla scaling laws") to a 530 billion parameter model, such as the under-trained MT-NLG model (Smith et al., 2022), would require training on a massive 11 trillion tokens, corresponding to more than 30 terabytes of text data. For most languages, available data is several orders of magnitude smaller, meaning that LLMs in those languages are already data-constrained. Villalobos et al. (2022) estimate that even high-quality English language data will be exhausted by the year 2024 given the Chinchilla scaling laws and the trend of training ever-larger models. This motivates the question (Villalobos et al., 2022; nostalgebraist, 2022): what should we do when we run out of data? In this work we investigate scaling large language models in a data-constrained regime, and whether training an LLM with multiple epochs of repeated data impacts scaling. Using multiple epochs is, of course, standard in machine learning generally; however, most prior
Scaling Data-Constrained Language Models
large language models have been trained for a single epoch (Komatsuzaki, 2019; Brown et al., 2020) and some work explicitly advocates against reusing data (Hernandez et al., 2022). An exception is the recent Galactica models (Taylor et al., 2022) that were trained for 4.25 epochs and exhibit continually decreasing validation loss and improving downstream performance throughout training. However, the experiments of Galactica do not compare this setup to an alternative non-data-constrained model trained for one epoch on unique data. Without this comparison, it is difficult to quantify the trade-offbetween additional compute versus additional data collection. Our main focus is to quantify the impact of multiple epochs in LLM training such that practitioners can decide how to allocate compute when scaling models. Toward this end, we assembled a battery of empirical training runs of varying data and compute constraints. Specifically, we train more than 400 models ranging from 10 million to 9 billion parameters for up to 1500 epochs and record final test loss. We use these results to fit a new data-constrained scaling law that generalizes the Chinchilla scaling law (Hoffmann et al., 2022) to the repeated data regime and yields a better prediction of loss in this setting. Figure 1 summarizes our main results targeting the value of repeated data (Return) and optimal allocation of resources in that regime (Allocation). We find that, while models trained for a single epoch consistently have the best validation loss per compute, differences tend to be insignificant among models trained for up to 4 epochs and do not lead to differences in downstream task performance. Additional epochs continue to be beneficial, but returns eventually diminish to zero. We find that, in the data-constrained regime, allocating new compute to both more parameters and epochs is necessary, and that epochs should be scaled slightly faster. These findings suggest a simple way to continue scaling total training compute budgets further ahead in the future than the previously anticipated limits. Finally, given the challenges imposed by data constraints, we consider methods complementary to repeating for improving downstream accuracy without adding new natural language data. Experiments consider incorporating code tokens and relaxing data filtering. For code, English LLMs, such as Pa LM (Chowdhery et al., 2022) or Gopher (Rae et al., 2021), are trained on a small amount of code data alongside natural language data, though no benchmarking was reported to justify that decision. We investigate training LLMs on a mix of language data and Python data at 10 different mixing rates and find that mixing in code is able to provide a 2 increase in effective tokens even when evaluating only natural language tasks. For filtering, we revisit perplexity and deduplication filtering strategies on both noisy and clean data sets and find that data filtering is primarily effective for noisy data sets.
2. Background
Predicting the scaling behavior of large models is critical when deciding on training resources. Specifically, two questions are of interest: (Allocation) What is the optimal balance of resources? (Return) What is the expected value of additional resources? For scaling LLMs, the resource is compute (measured in FLOPs), and it can be allocated to training a larger model or training for more steps.1 The metric used to quantify progress is the model s loss
1. In this work we use (Kaplan et al., 2020) s approximation for the compute cost: FLOPs(N, D) 6ND, where N denotes the number of model parameters and D denotes the number of tokens processed.
Muennighoff, Rush, Barak, Scao, Piktus, Tazi, Pyysalo, Wolf, Raffel
on held-out data, i.e. the ability to predict the underlying data as measured in the model s cross-entropy (Alabdulmohsin et al., 2022; Hoffmann et al., 2022). We aim to minimize the loss (L) subject to a compute resource constraint (C) via optimal allocation to N and D as:
argmin N,D L(N, D) s.t. FLOPs(N, D) = C (1)
Currently, there are established best practices for scaling LLMs. Return follows a powerlaw: loss scales as a power-law with the amount of compute used for training (Henighan et al., 2020; Kaplan et al., 2020; Bahri et al., 2021; Ghorbani et al., 2021; Bansal et al., 2022; Hernandez et al., 2021). Allocation is balanced: resources are divided roughly equally between scaling of parameters and data (Hoffmann et al., 2022). These scaling laws were established empirically by training LLMs and carefully extrapolating behavior. Chinchilla (Hoffmann et al., 2022) uses three methods for making scaling predictions:
(Fixed Parameters) Train with a fixed model size but on varying amounts of data.
(Fixed FLOPs) Train with fixed computation while parameters and training tokens vary.
(Parametric Fit) Derive and fit a formula for the loss.
For the parametric fit, the loss (L) is a function of parameters (N) and training tokens (D):
L(N, D) = A
Where {A, α, B, β, E} are learned variables fit using the training runs from the first two approaches (Hoffmann et al., 2022). Using these learned variables, they propose calculating the optimal allocation of compute (C) to N and D as follows:
Nopt(C) = G(C/6)a Dopt(C) = G 1(C/6)b
where G = αA
1 α+β a = β α + β b = α α + β
These methods lead to the conclusion that α β and hence N and D should be scaled proportionally for compute-optimal training. As loss can be an imperfect proxy for performance on natural language tasks (Xia et al., 2022; Shin et al., 2022; Tay et al., 2021), they also validate their conclusions on various downstream tasks.
3. Method: Data-Constrained Scaling Laws
We are interested in scaling behavior in the data-constrained regime. Specifically, given a limited amount of unique data, what is the best Allocation of and Return for computational resources. Prior work (Kaplan et al., 2020; Hoffmann et al., 2022) assumes that the necessary data to support scaling is unlimited. Our aim is therefore to introduce a modified version of Equation 2 that accounts for data constraints and fit the terms in the modified scaling law to data from a large body of experiments.
Scaling Data-Constrained Language Models
The primary method we consider is repeating data, i.e. allocating FLOPs to multiple epochs on the same data. Given a budget of unique data DC, we split the Chinchilla total data term D into two parts: the number of unique tokens used, UD, and the number of repetitions, RD (i.e. epochs - 1). Given total training tokens D and data budget DC these terms are simply computed as UD = min{DC, D} and RD = (D/UD) 1. When training for a single epoch like done in prior scaling studies, RD = 0. We are thus interested in minimizing Equation 1 with the additional constraint of a data budget DC:
argmin N,D L(N, D) s.t. FLOPs(N, D) = C, UD DC (4)
Symmetrically, for mathematical convenience, we split the parameter term N into two parts: the base number of parameters needed to optimally fit the unique tokens UN, and the number of times to repeat this initial allocation, RN. We compute UN by first rearranging Equation 3 to find the optimal compute budget for the unique tokens used (UD). We input this value into the Nopt formula of Equation 3 to get UN = min{Nopt, N}. UN thus corresponds to the compute-optimal number of parameters for UD or less if N < Nopt. Once we have UN, we compute the repeat value as RN = (N/UN) 1. To empirically explore the scaling behavior in a data-limited setting we train LLMs under these constraints. We consider three different experimental protocols in this work:
(Fixed Unique Data) In 5 we fix the data constraint DC and train models varying epochs and parameters. These experiments target Allocation, specifically tradeoffof D and N.
(Fixed FLOPs) In 6 we fix the computation available and vary DC (and thus also UD and UN). These experiments target Return, i.e. how well does repeating scale compared to having more unique data.
(Parametric Fit) We fit a formula introduced in 3.1 on all our training runs and evaluate its predictive capability throughout 5 and 6.
Before discussing experimental results we describe the parametric assumptions.
3.1 Parametric Fit
To extrapolate scaling curves, it is necessary to incorporate repetition into the Chinchilla formula (Equation 2). We generalize Equation 2 by replacing D and N with terms corresponding to the effective data (D ) and effective model parameters (N ).
L(N, D) = A N α + B
Intuitively, D should be smaller or equal to D where D is the total number of processed tokens since repeated tokens provide less useful information to the model than new ones. We use an exponential decay formulation, where the value of a data token processed loses roughly (1 1/R D) fraction of its value per repetition, where R D is a learned constant. After some derivations and approximations (see B), this boils down to
D = UD + UDR D(1 e
Muennighoff, Rush, Barak, Scao, Piktus, Tazi, Pyysalo, Wolf, Raffel
Note that for RD = 0 (no repetitions), D = UD = D. For RD R D, e RD/R D 1 RD
R D and so D UD + UDR D(1 1 + RD/R D) = UD(1 + RD) = D
and hence in this case, repeated data is worth almost the same as fresh data. (This is also consistent with the predictions of the deep bootstrap framework (Nakkiran et al., 2021b).) As RD grows, the value of repeated tokens tends to zero, and the effective data D becomes much smaller than D. The formula implies that no matter how many times we repeat the data, we will not get a better loss than could be obtained with a single epoch on UD + UDR D fresh tokens. Just as processing repeated tokens yields a diminishing return, both intuitively and empirically, models with sizes that vastly outstrip the available data also offer diminishing returns per parameter. Hence we use a symmetric formula for the number of effective parameters, where again R N is learned,
N = UN + UNR N(1 e
The learned constants R D, R N roughly correspond to the half-life of repeated data and excess parameters. For example, at RD = R D, the number of effective tokens D is UD + UDRD(1 e 1) which means that the UDRD repeated tokens are worth on average 1 1/e fraction of fresh ones. Using a methodology similar to (Hoffmann et al., 2022), R N and R D can be fit on empirical measurements, which yields data-driven estimates. See B for more details on the derivations and the fitting procedure.
4. Experimental Setup
Figure 2: Data set setup. We ensure runs using less data (more epochs) always use a subset of the data used in runs with more data (fewer epochs).
For all experiments, we train transformer language models with the GPT-2 architecture and tokenizer (Radford et al., 2019). Models have up to 8.7 billion parameters and are trained for up to 900 billion total tokens. Following (Hoffmann et al., 2022) we use cosine learning rate schedules that decay 10 over the course of training for each model (different schedules led to different estimates in (Kaplan et al., 2020)). Unlike (Kaplan et al., 2020), we do not use early stopping to also explore the extent of overfitting when repeating. Other hyperparameters are based on prior work (Rae et al., 2021; Hoffmann et al., 2022) and detailed in I. Models are trained on subsets of C4 (Raffel et al., 2020). The data constraints are carefully defined to ensure maximal overlap as shown in Figure 2. Unlike (Hernandez et al., 2022), we always repeat the entire available data rather than subsets of it. Data is shuffled after each epoch. As repeating data can result in extreme
Scaling Data-Constrained Language Models
1 10 30 59 100 300 1000 Epochs
Empirical Iso Loss Contours
1 10 30 59 100 300 1000 Epochs
Predicted Iso Loss Contours
Compute-optimal model for 100M tokens and one epoch Lowest loss for 100M tokens
Chinchilla scaling laws efficient frontier Data-constrained scaling laws efficient frontier
Models trained
Figure 3: Iso Loss contours for 100 million unique tokens. Left: 93 models trained with varying parameters and epochs on a fixed data set. Contours show an interpolation of results with the same final test loss. Right: Comparison with the loss predictions from our proposed scaling laws for the same budget of 100 million unique tokens and the predicted efficient frontier. The diminishing returns from training on repeated data can be seen in the increase in distance of the contour curves.
overfitting (see 6.1), we report loss on a held-out test set unless otherwise specified (see D). This contrasts training loss used in (Hoffmann et al., 2022), but should not alter our findings as the held-out data stems from the same underlying data set.
5. Results: Resource Allocation for Data-Constrained Scaling
Our first experimental setting considers scaling in a setting where all models have the same data constraint. For these experiments, the unique training data budget DC is fixed at either 100M, 400M or 1.5B tokens. For each data budget, we train a set of language models with increasing amounts of compute that is allocated to either more parameters or more epochs on the unique training data. Figure 3 (left) shows the main results for scaling with 100M unique tokens2 and Figure 4 for 400M and 1.5B tokens. For 100M tokens, the corresponding one-epoch compute-optimal model according to scaling laws from (Hoffmann et al., 2022) has UN of approximately 7M parameters (see C for the scaling coefficients we use). Results show that more than a 50%
2. Although small, for example, this is the order of magnitude of a realistic data constraint reflecting data available after filtering the OSCAR data set (Ortiz Su arez et al., 2019) for Basque, Punjabi, or Slovenian.
Muennighoff, Rush, Barak, Scao, Piktus, Tazi, Pyysalo, Wolf, Raffel
1 10 30 750 100 300 Epochs
400 million unique tokens
1 10 30 400 100 1000 Epochs
1.5 billion unique tokens
3.25 3.30 3.35 3.40 3.45 3.50 3.55 Loss
2.90 2.95 3.00 3.05 3.10 3.15 Loss
Compute-optimal model for given unique tokens and one epoch Lowest loss for given unique tokens
Chinchilla scaling laws efficient frontier Models trained
Figure 4: Empirical iso Loss curves for 400 million and 1.5 billion unique tokens. 34 models trained on 400 million unique tokens and 37 models trained on 1.5 billion unique tokens with varying parameters and epochs.
reduction in loss can be attained by training for several epochs (RD > 0) and increasing model size beyond what would be compute-optimal for 100M tokens (RN > 0). We find the best loss to be at around 20-60 more parameters and epochs, which corresponds to spending around 7000 more FLOPs. These results suggest that one-epoch models significantly under-utilize their training data and more signal can be extracted by repeating data and adding parameters at the cost of sub-optimal compute utilization. Figure 3 (right) shows the predicted contours created by fitting our data-constrained scaling laws on 182 training runs. In the single-epoch case (RD = 0) with near computeoptimal parameters (RN = 0) our scaling equation ( 3.1) reduces to the Chinchilla equation. In this case, both formulas predict the optimal allocation of compute to parameters and data to be the same, resulting in overlapping efficient frontiers. As data is repeated for more than a single epoch, our fit predicts that excess parameters decay faster in value than repeated data (R N < R D). As a result, the data-constrained efficient frontier suggests allocating most additional compute to more epochs rather than more parameters. This contrasts the Chinchilla scaling laws (Hoffmann et al., 2022), which suggest equally scaling both. However, note that they do not repeat the entire training data and their parametric fit explicitly relies on the assumption that models are trained for a single epoch only. Thus, there is no guarantee that their scaling predictions hold for repeated data. For all three data budgets, our results suggest that Allocation is optimized by scaling epochs faster than parameters. We confirm this at scale by training the data-constrained compute-optimal model for 9.3 1021 FLOPs and 25 billion unique tokens as suggested
Scaling Data-Constrained Language Models
by our efficient frontier. Despite having 27% fewer parameters, this model achieves better loss and downstream performance than the model suggested by the Chinchilla scaling laws (Figure 1 (right) and Table 8). Similarly, the 120 billion parameter Galactica model trained on repeated data should have been significantly smaller according to data-constrained scaling laws ( 7). An additional benefit of using a smaller model is cheaper inference, though adding parameters can make it easier to parallelize training across GPUs. Adding parameters and epochs causes the loss to decrease and eventually increase again, suggesting that too much compute can hurt performance. Results from (Kaplan et al., 2020) also show that loss can increase when too many parameters are used, even with early stopping. However, we expect that appropriate regularization (such as simply removing all excess parameters as an extreme case) could prevent this behavior. Thus, our formula presented in 3 and its predicted iso Loss contours in Figure 3 do not model the possibility that excess epochs or parameters could hurt performance.
5.1 Double Descent
Prior work has reported double descent phenomena when repeating data, where the loss initially increases and then decreases again as the model is trained for more epochs (Nakkiran et al., 2021a; Hernandez et al., 2022). In Figure 5, we plot the loss curves of several models trained for varying epochs on 100 million tokens. We find double descent phenomena with the loss of all models increasing at 200 epochs before decreasing again. These samples can also be seen in Figure 3. This contributes to additional noise in the fitting of our functions in B, as our functional form assumes loss to be monotonically decreasing as epochs increase. Thus, we remove most such examples from the fitting.
5.2 Repeating on Heavily Deduplicated Data
To investigate whether Figure 3 is dependent on the inherent amount of duplicates in the selected 100 million tokens, we train several models on a deduplicated version of C4 (see G). We plot the performance of the models trained on the deduplicated C4 versus the regular C4 in Figure 6. All models are evaluated on the same validation data set from the regular C4. Regardless of deduplication we find 59 epochs to be optimal and the overall trend to be very similar. Together with our results on OSCAR ( 6.2), this suggests that our work generalizes to different data sets with different inherent amounts of duplicates.
5.3 Do Excess Parameters Hurt, Plateau or Help?
Figures 3, 4 suggest that excess parameters (or epochs) can harm performance. We hypothesize that this is due to suboptimal hyperparameters and could be prevented with better regularization. Thus, we expect with optimal regularization hyperparameters excess parameters would never hurt, but performance would merely plateau, as in extreme cases regularization could just take the form of removing the excess parameters. One approach to selecting optimal hyperparameters is µP (Yang et al., 2021). We compare excessively large models trained with a data constraint of DC = 100 million tokens in Figure 7 across µP, our default hyperparameters ( I) and scaling law predictions. Surprisingly, µP leads to even higher test loss than our default hyperparameters. Nevertheless, we find that also with µP
Muennighoff, Rush, Barak, Scao, Piktus, Tazi, Pyysalo, Wolf, Raffel
15 27 39 59 75 140 200 910 9000 Epochs
Final test loss relative to starting point
Model parameters
83 million 44 million 14 million
Figure 5: Double descent. Each dot is a model trained on 100 million unique tokens. Loss initially increases at 200 epochs and then decreases again; this is known as epochwise double descent (Nakkiran et al., 2021a).
27 39 59 75 140 320 600 910 1740 Epochs
Final test loss
Training data
Deduplicated C4 C4
Figure 6: Optimal loss on deduplicated data. 146 million parameter models trained on 100 million unique tokens that are either directly from C4 or undergo additional deduplication. Each dot is a single model. While deduplication results in a higher test loss, the optimal number of epochs remains the same whether or not deduplication is performed as in Figure 3.
Scaling Data-Constrained Language Models
250M 500M 1B 2B Parameters
Final test loss
Empirical loss
Parameter selection according to P
Standard parameter selection
Predicted loss (scaling laws)
Data-Constrained (This work)
Allowing excess parameters to hurt via alpha-beta decay
Figure 7: Empirical and predicted losses of LLMs trained on 100 million tokens for a single epoch. Excess parameters empirically hurt performance, but this may be due to a lack of regularization. Thus, our scaling formula predicts loss to plateau, while Chinchilla predicts loss to improve. By decaying the exponent α (and β) instead, one can allow excess parameters to hurt.
excessive parameters hurt: The models with more than 2 billion parameters have significantly higher validation loss after training than the models with 200 million to 1 billion parameters when trained on only 100 million tokens. However, µP only covers hyperparameters such as the learning rate, but not explicit regularization hyperparameters like dropout rates, which we hypothesize would prevent this behavior. Thus, our proposed scaling equations predict loss to plateau, as seen in the straight line. As the compute-optimal parameter count for 100 million tokens is around 7 million, all depicted models have a significant amount of excess parameters and data-constrained scaling laws predict their losses to be all the same (R N RN). Meanwhile, the default Chinchilla scaling law (Hoffmann et al., 2022) predicts loss to continue decreasing as parameters are added, which is in stark contrast to the empirical data. If one wants to incorporate excess parameters hurting performance into the scaling law equations, one could consider (a) Modifying the exponential decay formulation introduced in B such that instead of the value of repeated data decaying to 0 it decays to a large negative value (b) decaying the exponents α and β in Equation 8 instead of D and N. Decaying the exponents to 0 has the effect of more repetitions eventually hurting performance as limα 0 Dα = 1 and the same for β. Thus, initially as D and N increase loss decreases, but ultimately the decay of α and β pushes D and N back to 1 resulting in loss to increase. Specifically, approach (b) could take the form of:
L(N, D, RN, RD) = E + A Nα max(0,1 (RN/R N)) + B Dβ max(0,1 (RD/R D)) (7)
Muennighoff, Rush, Barak, Scao, Piktus, Tazi, Pyysalo, Wolf, Raffel
1 10 30 59 100 300 1000 Epochs
Empirical Iso Loss Contours
1 10 30 59 100 300 1000 Epochs
Predicted Iso Loss Contours
Compute-optimal model for 100M tokens and one epoch Lowest loss for 100M tokens
Chinchilla scaling laws efficient frontier Alpha-beta decay efficient frontier
Models trained
Figure 8: Iso Loss contours for 100 million unique tokens with contours predicted by parametric decay of alpha and beta. The same models from Figure 3 with the contour predictions being done by the alpha-beta decay formulation introduced in 5.3.
Like the equations in B this formulation also reduces to the Chinchilla scaling laws in the base case of RD = 0 or RN = 0. As the exponents decrease with more repetitions adding parameters or epochs becomes less beneficial. Eventually, the decay in α or β causes loss to increase again as it pushes N or D back down to 1. We fit this formula using the same approach outlined in B but including samples where excess parameters or epochs hurt (296 total samples). We use a grid of initialization given by: R N {0., 2000., . . . , 100000.} and R D {0., 2000., . . . , 100000.}. This results in R D = 26530.611 and R N = 2040.8163. R N is significantly lower resulting in excess parameters hurting faster than excess epochs, which is in line with empirical data from Figure 3. We visualize Figure 3 with the predictions from this alpha-beta decay formulation in Figure 8. Expected parameters eventually hurt resulting in circle-shaped contours. Due to the very high R D the area where epochs start to hurt is outside of the boundaries of Figure 8. While the predicted optimal allocation (efficient frontier) is similar to Figure 3, the predicted return from repeated data differs significantly. The alpha-beta decay formulation incorrectly predicts returns to diminish significantly slower as seen by the longer efficient frontier and the smaller distance in contours early on as compared to Figure 3. Beyond its potentially useful properties, we do not have a rigorous mathematical justification for this alpha-beta decay formulation which could be the cause of the incorrect return predictions. Ultimately, we settle on our exponential decay formulation from B that does not allow excess parameters or epochs to hurt, as preventing such behavior is trivial by stopping
Scaling Data-Constrained Language Models
5B 15B 25B 35B 45B 55B Training tokens
Validation loss
2.8B parameters trained
for 55B tokens
5B 25B 45B 65B 85B Training tokens
4.2B parameters trained
for 84B tokens
5B 40B 100B 140B 180B Training tokens
8.7B parameters trained
for 178B tokens
Epochs 1 2 3 4 5 7 14 44
FLOP budget (C) Parameters (N) Training tokens (D) Data budget (DC)
9.3 1020 2.8B 55B { 55, 28, 18, 14, 11, 9, 4, 1.25}B 2.1 1021 4.2B 84B { 84, 42, 28, 21, 17, 12, 6, 1.9}B 9.3 1021 8.7B 178B {178, 88, 58, 44, 35, 25, 13, 4}B
Figure 9: Validation loss for different data constraints (Iso FLOP). Each curve represents the same number of FLOPs spent on an equal-sized model. Colors represent different numbers of epochs due to repeating because of data constraints. Parameters and training tokens are set to match the single-epoch compute-optimal configurations for the given FLOPs. Models trained on data that is repeated for multiple epochs have consistently worse loss and diverge if too many epochs are used. Only loss curves for 8.7B runs are smoothed with an exponential moving average and weight of 0.85.
training (in the case of epochs hurting) or removing excess parameters (in the case of model parameters hurting). Further, accurately predicting how much loss increases in the limit is not very useful, as in practice one would want to stop training when it s expected to plateau anyways.
6. Results: Resource Return for Data-Constrained Scaling
Next, consider the question of Return on scaling. To quantify this value, we run experiments with three FLOP budgets across eight respective data budgets to compare return on FLOPs. Figure 9 shows the configurations and validation curves for models trained on the same number of total tokens. Conforming to intuition and prior work on deduplication (Lee et al., 2021), repeated data is worth less, thus models trained on less unique data (and, correspondingly, more epochs) have consistently higher loss. However, the loss difference for
Muennighoff, Rush, Barak, Scao, Piktus, Tazi, Pyysalo, Wolf, Raffel
100% 50% 25% 14% 10%
Fraction of training tokens that are unique
Final test loss
8.7B parameters trained for 178B tokens
4.2B parameters trained for 84B tokens
2.8B parameters trained for 55B tokens
Empirical Loss (Fixed training length)
10B 100B 1T 10T 100T
Total training tokens
8.7B parameters
4.2B parameters
2.8B parameters
8 Ep. 16 Ep. 32 Ep. 64 Ep.
16 Ep. 32 Ep. 64 Ep.
16 Ep. 32 Ep. 64 Ep.
Repeating for 4 epochs is almost as good as new data
Predicted Loss (Variable training length)
Loss of models trained Loss assuming training is stopped when exhausting all unique data
Loss assuming repeated data is worth the same as new data Loss predicted by our data-constrained scaling laws
Figure 10: Empirical and extrapolated loss with constrained data. Left: Loss as a function of repeated tokens for three different training budgets each with fixed number of parameters. Loss curves predicted by our data-constrained scaling laws are shifted to exactly match the loss at 100% unique data. Return on FLOPs decays with repeated data in a regular pattern. Right: Extrapolating from the proposed data-constrained scaling law shows that at small numbers epochs are benign, but at large numbers loss stops improving.
a few epochs is negligible. For example, the N = 8.7 billion parameter model trained for four epochs (DC = 44 billion unique tokens) finishes training with only 0.5% higher validation loss than the single-epoch model (DC = 178 billion unique tokens). In Figure 10 (left), we compare the final test loss of each model to predictions from our parametric fit. The data-constrained scaling laws can accurately measure the decay in the value of repeated data as seen by the proximity of empirical results (dots) and parametric fit (lines). We note however that it significantly underestimates the final test loss of failing models where loss increases midway through training, such as models trained for 44 epochs (not depicted). In Figure 10 (right), we extrapolate the three budgets by further scaling compute while keeping the data constraints (DC) at 55B, 84B, and 178B tokens, respectively. The parameter R D introduced in 3 represents roughly the half-life of epochs: specifically the point where repeated tokens have lost 1
e of their value. Through our fitting in B, we found R D 15, corresponding to 15 repetitions (or 16 epochs). Graphically, this can be seen by the stark diminishing returns in the proximity of the 16-epoch marker and the flattening out soon after. Overall, the Return when repeating data is relatively good. Meaningful gains from repeating data can be made up to around 16 epochs (R D) beyond which returns diminish extremely fast.
Scaling Data-Constrained Language Models
5B 15B 25B 35B 45B 55B Training tokens
Training loss
2.8B parameters trained
for 55B tokens
5B 25B 45B 65B 85B Training tokens
4.2B parameters trained
for 84B tokens
5B 40B 100B 140B 180B Training tokens
8.7B parameters trained
for 178B tokens
Epochs 1 2 3 4 5 7 14 44
Figure 11: Training loss smoothed with exponential moving average smoothing and a weight of 0.999. Models trained on fewer unique tokens (more epochs) have better training loss as they overfit.
6.1 Training Loss
Hoffmann et al. (2022) use training loss as their core metric. However, when repeating data for multiple epochs, training loss is a bad metric as models will overfit to the limited data available as shown in Figure 11. Thus, we use loss on a held-out test set as our key performance metric.
6.2 Scaling Curves on the OSCAR Corpus
To ensure our findings are not data set-dependent, we train models with the same configurations from Figure 9 on the OSCAR corpus (Ortiz Su arez et al., 2020). OSCAR is considered noisier than C4 (Raffel et al., 2020) due to its less stringent duplication. Figures 12, 13 depict the validation and training loss of these models. We find the trend to be the same as for models trained on C4: While models with fewer repeats have better loss, differences for a few repeats are insignificant.
6.3 Validation Loss by Epoch
Taylor et al. (2022) decided to early-stop pre-training of the Galactica models due to a small increase in validation loss at the start of the fifth epoch. In Figure 14 we plot the validation loss curves of our iso FLOP models as a function of epochs. We do find small increases in validation loss when models enter a new epoch. For example, upon entering the third and fourth epoch, the 7-epoch 8.7 billion parameter OSCAR model shows loss spikes. However, these are temporary and loss continues to go down smoothly thereafter. Thus, we hypothesize
Muennighoff, Rush, Barak, Scao, Piktus, Tazi, Pyysalo, Wolf, Raffel
5B 15B 25B 35B 45B 55B Training tokens
Validation loss
2.8B parameters trained
for 55B tokens
5B 25B 45B 65B 85B Training tokens
4.2B parameters trained
for 84B tokens
5B 40B 100B 140B 180B Training tokens
8.7B parameters trained
for 178B tokens
Epochs 1 2 3 4 5 7 14 44
Figure 12: Validation loss during training for models trained on OSCAR. Models trained on tokens that are repeated for multiple epochs have consistently worse loss.
5B 15B 25B 35B 45B 55B Training tokens
Training loss
2.8B parameters trained
for 55B tokens
5B 25B 45B 65B 85B Training tokens
4.2B parameters trained
for 84B tokens
5B 40B 100B 140B 180B Training tokens
8.7B parameters trained
for 178B tokens
Epochs 1 2 3 4 5 7 14 44
Figure 13: Training loss for models trained on OSCAR smoothed with exponential moving average smoothing and a weight of 0.999. Models trained on fewer unique tokens (more epochs) have better training loss as they overfit.
Scaling Data-Constrained Language Models
7 5 4 3 2 1
C4 Validation loss
(a) 2.8B parameters trained on C4
7 5 4 3 2 1
(b) 4.2B parameters trained on C4
Epochs 1 2 3 4
7 5 4 3 2 1
(c) 8.7B parameters trained on C4
7 5 4 3 2 1
OSCAR Validation loss
(d) 2.8B parameters trained on OSCAR
7 5 4 3 2 1
(e) 4.2B parameters trained on OSCAR
7 5 4 3 2 1
(f) 8.7B parameters trained on OSCAR
Figure 14: Validation loss during training visualized by epochs. Loss progresses smoothly throughout training. There are temporary spikes for 8.7 billion parameter models, commonly at the start of a new epoch.
Muennighoff, Rush, Barak, Scao, Piktus, Tazi, Pyysalo, Wolf, Raffel
450B (4.25)
1.35T (12.75)
Tokens (Epochs)
Optimal allocation: 3x less parameters
3x more epochs
Efficient frontier assuming repeated data is worth the same as new data
Efficient frontier predicted by our data-constrained scaling laws
Regime of same compute (Iso FLOP)
Galactica (120B parameters)
Galactica (30B parameters)
Figure 15: Optimal compute allocation for Galactica. Efficient frontier assuming repeated data is worth the same as new data (Chinchilla scaling laws) and data-constrained efficient frontier assuming a unique token budget of 106 billion tokens like for the Galactica models (Taylor et al., 2022). For optimal compute allocation according to our proposed data-constrained scaling laws, the 120 billion Galactica model should have been significantly smaller and trained for more epochs.
that the Galactica models could have attained better performance by continuing pre-training beyond the loss spike experienced at the beginning of the fifth epoch.
7. Case Study: Galactica
The Galactica models (Taylor et al., 2022) are the only publicly known LLMs that explicitly trained for a significant number of epochs prior to this work. They trained their models on 106 billion unique tokens for 4.25 epochs. Our findings on Return from repeated data agree with their conclusion that multiple epochs are beneficial, however, we find that even more epochs can be beneficial and a small spike in validation loss does not justify stopping training ( 6.3). Meanwhile, our findings on Allocation significantly deviate from Galactica. Figure 15 visualizes the Galactica models with our predicted efficient frontier in the same
Scaling Data-Constrained Language Models
Repeat Repeat
DATA BUDGET Repeating
Filling with
DATA BUDGET
Repeat Repeat Repeat
Deduplicate / Perplexity-filter
DATA BUDGET
100% 50% 25% 10% Data Budget
Average Performance on 19 tasks (%)
Strategy Repeating data Filling missing data with Python code Perplexity-filter then repeat Deduplicate then repeat
Figure 16: Strategies for data-constrained settings and their downstream performance. Left: Schematic showing alternative data use strategies of code filling and filtering. Right: N = 4.2 billion parameter models trained for a total of D = 84 billion tokens with varying budgets DC. For repeating and filling with code, five models with different seeds are trained for each dot and the standard deviation is visualized as the shaded area.
style as Figure 1. The creators of Galactica decided to train a 120 billion parameter model on 450 billion tokens, a significant overallocation to parameters even in Chinchilla terms (black efficient frontier). This decision was likely driven by the intuition that repeated data is worth less, thus one should spend more compute on parameters. However, our empirical data contradicts this. Parameters learning from repeated data are worth even less than repeated data, thus one should overallocate to epochs, not parameters. Our data-constrained scaling laws thus predict that a better model could have been trained by allocating significantly more FLOPs to epochs rather than parameters for the largest Galactica model with 120 billion parameters. Specifically, 40 billion parameters trained for 1.35 trillion tokens (12.75 epochs) would have been optimal according to data-constrained scaling laws. Note that these scaling laws have been fitted on C4, which is not the data set used to pre-train Galactica. The Galactica models are pre-trained on a predominantly scientific data set, which includes code data among other data sources. Results from (Hoffmann et al., 2022) show that there are differences in the scaling coefficients when training on C4 as compared to Git Hub code, however, the overall allocation trend is the same. Thus, while we expect a smaller model trained for more epochs to be better than the 120 billion parameter model, the optimal allocation is unlikely to be exactly 40 billion parameters and 1.35 trillion tokens. While this is a lot of tokens, we point to follow-up work that has had success with repeating pretraining data in the trillion-token regime based on our results (Lozhkov et al., 2024).
8. Results: Complementary Strategies for Obtaining Additional Data
While repeating data is effective, it has diminishing returns. We therefore consider strategies for scaling D targeting improved downstream performance as opposed to directly minimizing loss.
Muennighoff, Rush, Barak, Scao, Piktus, Tazi, Pyysalo, Wolf, Raffel
Figure 16 (left) illustrates the strategies: (a) Code augmentation: We use Python code from The Stack (Kocetkov et al., 2022) to make up for missing natural language data. The combined data set consisting of code and natural language samples is shuffled randomly. (b) Adapting filtering: We investigate the performance impact of deduplication and perplexity filtering, two common filtering steps that can severely limit available data. Removing such filtering steps can free up additional training data. For these experiments, we set a maximum data budget (DC) of 84 billion tokens. For repetition and code filling, only a subset of DC is available and the rest needs to be compensated for via repeating or adding code. For both filtering methods, we start out with approximately twice the budget (178 billion tokens), as it is easier to gather noisy data and filter it than it is to gather clean data for training. For perplexity filtering, we select the top 25% samples with the lowest perplexity according to a language model trained on Wikipedia. This results in 44 billion tokens that are repeated for close to two epochs to reach the full data budget. For deduplication filtering, all samples with a 100-char overlap are removed resulting in 21 billion tokens that are repeated for four epochs during training. See G for more details on the filtering procedures. When comparing across data strategies, loss ceases to be a good evaluation metric as the models are trained on different data distributions. We thus evaluate models on 19 natural language tasks with zero to five in-context few-shot exemplars (Brown et al., 2020) producing 114 scores per model. As our evaluation tasks cover different metrics and random baselines, we re-scale all scores to be in the same range to better reflect performance ranges before averaging. Details on the evaluation data sets are in D. In Figure 16 (right) we compare the downstream performance of all strategies. For repeating data, differences in downstream performance are insignificant for up to around 4 epochs (25% budget) and then start dropping, which aligns with our results on test loss in 6. This also confirms related works showing that perplexity and downstream performance generally exhibit similar trends (Gadre et al., 2024; Isik et al., 2024). Filling up to 50% of data with code (42 billion tokens) also shows no deterioration. Beyond that, performance decreases quickly on natural language tasks. However, adding more code data may benefit non-natural language tasks, which are not considered in the benchmarking. Two of the tasks benchmarked, Web NLG (Castro Ferreira et al., 2020; Gehrmann et al., 2021), a generation task, and b Ab I (Weston et al., 2015; Liang et al., 2022), a reasoning task, see jumps in performance as soon as code is added, possibly due to code enabling models to learn long-range state-tracking capabilities beneficial for these tasks. Of the filtering approaches, we find perplexity-filtering to be effective, while deduplication does not help. Prior work found deduplication was able to improve perplexity (Lee et al., 2021); however, it did not evaluate on downstream tasks. Deduplication may have value not captured in our benchmark, such as reducing memorization (Kandpal et al., 2022; Hernandez et al., 2022; Carlini et al., 2022; Biderman et al., 2023). We also investigate filtering on a different noisier data set in H, where we find it to be more effective. Overall, in a data-constrained regime, we recommend reserving filtering for noisy data sets and using both code augmentation and repeating to increase data tokens. For example, first doubling the available data by adding code and then repeating the new data set for four epochs results in 8 more training tokens that are expected to be just as good as having had 8 more unique data from the start.
Scaling Data-Constrained Language Models
5B 25B 45B 65B 85B Training tokens
C4 validation loss
5B 25B 45B 65B 85B Training tokens
Python validation loss
% of Python (rest is C4) 90 80 70 60 50 40 30 20 10
Figure 17: Validation loss of models trained on a mix of natural language (C4) and Python data.
5B 25B 45B 65B 85B Training tokens
Validation loss
5B 25B 45B 65B 85B Training tokens
Training loss
Strategy Deduplicate then repeat (4 Epochs) Perplexity-filter then repeat (2 Epochs) Repeat (4 Epochs) Repeat (2 Epochs)
Figure 18: Validation and training loss of models trained with different data strategies. Training loss is smoothed with exponential moving average smoothing and a weight of 0.999. Downstream performance of the models is in Figure 16.
8.1 Loss Curves for Complementary Strategies
To compare the complementary data strategies, we have used downstream performance on natural language tasks detailed in D instead of loss. This is because validation loss gives an unfair advantage to models trained on a larger fraction of data from the same distribution.
Muennighoff, Rush, Barak, Scao, Piktus, Tazi, Pyysalo, Wolf, Raffel
For example, when making up for missing natural language data with code, models that are trained on more code will have better validation loss on code data while having worse loss on the natural language data as seen in Figure 17: The model pre-trained on 90% of Python code data and 10% of C4 has the highest C4 validation loss, but the lowest Python validation loss. Models trained on deduplicated or perplexity-filtered data have higher validation loss as the held-out validation data has not gone through the same filtering steps. Thus, its distribution more closely resembles the training data of models trained on the unfiltered data resulting in worse validation loss for the two filtering strategies in Figure 18 (left). Meanwhile, for training loss in Figure 18 (right) the model trained on perplexity-filtered data has the lowest loss. Its training data has been filtered to the top 25% of examples with the lowest perplexity ( G) thus high loss examples have been explicitly filtered out from the training data resulting in low training loss. The model trained on deduplicated data has the highest validation and training loss. This is because commonly repeated sequences have been filtered out from its training data. Thus, when encountering these common sequences in the unfiltered validation set, its loss is comparatively high as other models have likely simply memorized them. Similarly, fewer repeated sequences during training results in higher training loss as unseen sequences are harder to predict.
9. Related Work
9.1 Large Language Models
Scaling up transformer language models (Vaswani et al., 2017) across parameter count and training data has been shown to result in continuous performance gains (Chowdhery et al., 2022). Starting with the 1.4 billion parameter GPT-2 model (Radford et al., 2019), a variety of scaled-up language models have been trained, commonly referred to as large language models (LLMs). They can be grouped into dense models (Brown et al., 2020; Khrushchev et al., 2022; Lieber et al., 2021; Rae et al., 2021; Chung et al., 2022; Black et al., 2022; Zhang et al., 2022; Thoppilan et al., 2022; Su et al., 2022; Taylor et al., 2022; Zeng et al., 2022; Scao et al., 2022a; Li et al., 2023; Luukkonen et al., 2023) and sparse models (Fedus et al., 2021; Zeng et al., 2021; Du et al., 2022; Zoph et al., 2022) depending on whether each forward pass makes use of all parameters. These models are generally pre-trained to predict the next token in a sequence, which makes them applicable to various language tasks directly after pre-training (Brown et al., 2020; Wei et al., 2022; Kojima et al., 2022; Muennighoff, 2022; Srivastava et al., 2022) by reformulating said NLP tasks as context continuation tasks (see (Mc Cann et al., 2018) for an earlier proposal on this topic). We focus on the most common scenario, where a dense transformer model is trained to do next-token prediction on a large corpus and evaluated directly after pre-training using held-out loss or zeroto few-shot prompting.
9.2 Scaling Laws
Prior work has estimated an optimal allocation of compute for the training of LLMs. Kaplan et al. (2020) suggested a 10 increase in compute should be allocated to a 5.5 increase in model size and a 1.8 increase in training tokens. This first scaling law has led to the creation
Scaling Data-Constrained Language Models
of very large models trained on relatively little data, such as the 530 billion parameter MTNLG model trained on 270 billion tokens (Smith et al., 2022). More recent work (Hoffmann et al., 2022), however, showed that model size and training data should rather be scaled in equal proportions. These findings called for a renewed focus on the scaling of pre-training data rather than scaling model size via complex parallelization strategies (Shoeybi et al., 2019; Rasley et al., 2020; Bian et al., 2021; Narayanan et al., 2021). Up-sampling is often employed when pre-training data is partly limited, such as data from a high-quality domain like Wikipedia or text in a rare language for training multilingual LLMs (Lin et al., 2021; Orlanski et al., 2023). Hernandez et al. (2022) study up-sampling of data subsets and find that repeating only 0.1% of training data 100 times significantly degrades performance. In contrast, our work focuses on repeating the entire pre-training corpus for multiple epochs rather than up-sampling parts of it.
9.3 Alternative Data Strategies
Large pre-training data sets are commonly filtered to remove undesired samples or reduce noise (Sorscher et al., 2022). Perplexity-based filtering, whereby a trained model is used to filter out samples with high perplexity, has been found beneficial to reduce noise in web-crawled data sets (Wenzek et al., 2019). Mixing of data is employed for the pre-training data of multilingual LLMs, where text data from different languages is combined (Conneau et al., 2019; Xue et al., 2020; Soltan et al., 2022; Muennighoffet al., 2022). However, both for code and natural language models, mixing different (programming) languages has been reported to under-perform monolingual models (Nijkamp et al., 2022; Virtanen et al., 2019). Some work has investigated mixing code and natural language data for prediction tasks, such as summarizing code snippets (Iyer et al., 2016) or predicting function names (Allamanis et al., 2015). Several pre-training data sets for LLMs include low amounts of code data (Gao et al., 2020; Rae et al., 2021; Scao et al., 2022a). However, these past works generally do not provide any ablation on the drawbacks of including code or the benefits for natural language task performance. We perform a detailed benchmarking of mixing Python and natural language in LLM pre-training at 10 different mixing rates.
10. Limitations and Future Work
10.1 Repeating Fractions of the Data
In this work we focus on repeating the entire unique data set for several epochs. Alternatively, one can repeat only a fraction of the data set. For example, repeating 10% of the data set for 10 epochs while repeating the rest only for a single epoch as done by Hernandez et al. (2022). To predict loss in that scenario, one may need to adapt our scaling laws with an additional parameter to account for the fraction that is repeated and possibly a parameter that captures at what point in training the data is repeated. Repeating earlier in training when most model weights are still randomly initialized is likely to cause less damage than later in training. Adapting our parametric fit to make concrete scaling predictions for such scenarios is an exciting future research direction.
Muennighoff, Rush, Barak, Scao, Piktus, Tazi, Pyysalo, Wolf, Raffel
10.2 Sensitivity to Hyperparameters
The returns from additional epochs may heavily depend on hyperparameters such as learning rate, dropout, or the optimizer choice. It is likely that increasing the learning rate, for example, would lead to diminishing returns from additional epochs kicking in earlier. In this work, we have fixed most hyperparameters to commonly used values for the training of LLMs and leave such explorations to future work.
10.3 Other Data sets
The optimal data strategy is dependent on the data set at hand and we cannot give universally applicable filtering recommendations. By looking into C4 and OSCAR, we have covered two of the most commonly used English text data sets. Our findings on both data sets were overall in agreement with each other. We have highlighted some of the differences, such as deduplication being more effective on OSCAR due to it being more noisy than C4. Further, we have focused on large-scale pre-training data sets. There is a lot of research on the optimal fine-tuning data set and methodology for LLMs (Sanh et al., 2022; Longpre et al., 2023a; Yong et al., 2022; Ouyang et al., 2022; Wei et al., 2021; Min et al., 2021; Wang et al., 2022; Zhou et al., 2023; Wang et al., 2023; Gupta et al., 2023; Xu et al., 2023; Muennighoff et al., 2023; Longpre et al., 2023b). More investigations of resolving data-constraints when fine-tuning LLMs may be of interest for future work.
10.4 Other Modalities or Architectures
Our work focuses on text data sets and uses the GPT transformer architecture (Radford et al., 2019). Prior work has experimented with many variations to the GPT or transformer architecture (Dehghani et al., 2018; Tay et al., 2022a; Scao et al., 2022b), as well as scaling laws for non-text data sets (Aghajanyan et al., 2023). Overall, variations of the GPT or transformer architecture have proven very robust and generalizable to other domains (Huang et al., 2018; Chen et al., 2020; Muennighoff, 2020; Madani et al., 2020; Tay et al., 2022a; Dehghani et al., 2023). Nonetheless, it may be of interest for future work to test the applicability of our findings in this work to different data modalities or model architectures.
10.5 Other Strategies
There are numerous strategies to solve data constraints not covered in this work that are worth exploring. Like we have shown for Python, future research may consider to what extent augmenting with a natural language (e.g. Chinese) improves performance in another language (e.g. English) and what is the best language to choose (Lin et al., 2019; Xia et al., 2021). Similarly, while we have looked at deduplication and perplexity filtering, other filtering strategies, such as popularity-based filters (Allal et al., 2023; Zhao et al., 2023) and toxicity filters (Gehman et al., 2020; Henderson et al., 2022; Longpre et al., 2023c; Prabhumoye et al., 2023; Penedo et al., 2023) are worth exploring.
Scaling Data-Constrained Language Models
11. Conclusion
This work studies data-constrained scaling, focusing on the optimal use of computational resources when unique data is limited. We propose an extension to the Chinchilla scaling laws that takes into account the decay in value of repeated data, and we fit this function using a large set of controlled experiments. We find that despite recommendations of earlier work, training large language models for multiple epochs by repeating data is beneficial and that scaling laws continue to hold in the multi-epoch regime, albeit with diminishing returns. We also consider complementary approaches to continue scaling models, and find that code gives the ability to scale an additional 2 . We believe that our findings will enable further scaling of language models to unlock new capabilities with current data. However, our work also indicates that there are limits on the scaling horizon. In addition to collecting additional data, researchers should explore using current data in a more effective manner.
Acknowledgments
This work was co-funded by the European Union under grant agreement No 101070350. The authors wish to acknowledge CSC IT Center for Science, Finland, for generous computational resources on the LUMI supercomputer.3 We are thankful for the immense support from teams at LUMI and AMD, especially Samuel Antao. Hugging Face provided storage and additional compute instances. This work was supported by a Simons Investigator Fellowship, NSF grant DMS-2134157, DARPA grant W911NF2010021, and DOE grant DE-SC0022199. We are grateful to Harm de Vries, Woojeong Kim, Mengzhou Xia and the Eleuther AI community for exceptional feedback. We thank Loubna Ben Allal for help with the Python data and Big Code members for insightful discussions on scaling laws. We thank Thomas Wang, Helen Ngo and Turku NLP members for support on early experiments.
3. https://www.lumi-supercomputer.eu/
Muennighoff, Rush, Barak, Scao, Piktus, Tazi, Pyysalo, Wolf, Raffel
Appendix A. Contributions
Niklas Muennighoffled experiments, analysis, writing, and the overall project. He implemented, trained and evaluated all models. Alexander M. Rush contributed to framing, results analysis, and paper writing. Boaz Barak contributed to formal and experimental analysis as well as paper writing. Teven Le Scao provided guidance, led data choices and preprocessing, and contributed to framing and writing. Aleksandra Piktus created perplexity and deduplication data sets and contributed to writing. Nouamane Tazi contributed to enabling high-performance training on AMD hardware. Sampo Pyysalo contributed to enabling high-performance training and early repetition experiments. Thomas Wolf provided guidance on experimental design and contributed to paper writing. Colin Raffel provided guidance on experimental design and contributed to paper writing.
Appendix B. Derivation of Data-Constrained Scaling Laws
Let N be the number of model parameters, D be the training tokens and U be the "unique" training tokens i.e. the size of the data set that is to be trained on for one or more epochs. Chinchilla (Hoffmann et al., 2022) only deals with non-repeated tokens, thus D = U and we can write their formula ( Approach 3 ) as:
L(N, U) = A Nα + B
where E represents the irreducible loss. A, B, α and β are learned parameters. We now want to generalize this expression to multiple epochs where tokens are repeated. We repeat the data RD times, where RD = 0 corresponds to the base case of a single epoch. We let D be the effective data size : the number of unique data needed to get the same value as repeating U unique tokens for RD repeats. Hence, if RD = 0, the effective data is the same as the total data processed. Intuitively, each time a sample is repeated, it is worth less as the model has already learned some of its information. Assume that each time a model trains on a token, it learns a 1 δ fraction of the information in it for some constant 0 δ 1. (Thus, if δ = 0 repeated tokens are as good as new ones, and if δ = 1, repeated tokens are worth nothing.) In other words, we expect the decrease in value of each repetition to be proportional to the value of the prior repetition, which is equivalent to exponential decay. As we would like to sum up the value of all repetitions, we temporarily assume an integral number of repeats and express it as a geometric series:
D = U + (1 δ)U + (1 δ)2U + + (1 δ)RDU (9)
We know that the sum S of a geometric series with a common ratio r is:
S = a(1 rn)
where a is the first term and n the number of terms in the series. As r = (1 δ) and a = (1 δ)U:
Scaling Data-Constrained Language Models
k=1 (1 δ)k = U + (1 δ)U (1 (1 δ)RD)
Note that Equation 11 can also be used with a non-integer number of repetitions. We can directly use Equation 11 as our effective data and learn δ but for convenience and interpretability, we redefine it in terms of the number of epochs beyond which repeating does not help. Note that as more data is repeated, the right-hand side tends to (1 δ)U
δ , as lim RD (1 (1 δ)RD) = 1. Let R D = 1 δ
δ , hence D plateaus at U + R DU as RD goes to infinity. If we assume δ to be small, 1 δ tends to one and we can approximate 1/R D = δ 1 δ δ. Next, define ex in terms of its Taylor series expansion:
ex = 1 + x + x2
3! + 1 + x (12)
If x is small later terms become increasingly small, thus ex 1 + x. As we have assumed δ to be small, let x = δ, which yields
(1 + x) = (1 δ) e δ e 1/R D (13)
Now inserting (1 δ)/δ = R D and (1 δ)RD = e( 1/R D)RD into Equation 11 we get our final equation representing the effective data:
D = U + U R D (1 e RD/R D) (14)
where U and RD are given while R D is a learned constant. If no repeats are done, the second part of the sum is zero and the term simplifies to the single-epoch scaling laws from Equation 8. While RD R D, the second term is approximated as U RD and for RD R D, it plateaus at U R D. Hence R D corresponds to the number of times we can repeat tokens before seeing sharply diminishing returns. Let us consider a concrete example to show that Equation 14 is a very good approximation of Equation 11 and make the equations more intuitive. Suppose repeated data retains 75% of its value (δ = 0.25) and we train on a single token or data unit (U = 1) for five epochs, i.e. we repeat it four times (RD = 4). In that case Equation 11 yields D = U + (1 δ)U (1 (1 δ)RD)
δ = 1 + (0.75) 4 (1 0.754) = 3.05. Thus despite training for 5 total units (4 of which are repetitions), we only get the value equivalent to 3.05 units. As we have defined R D = (1 δ)/δ, the corresponding R D value is 3. Setting R D = 3 in Equation 14 yields D = U + U R D (1 e RD/R D) = 1 + 3 (1 e 4/3) = 3.21. Due to our approximations, the results are not the same, i.e. 3.21 is slightly higher than 3.05. However, note that the data term is additionally raised to a power of β = 0.353 (see Equation 8; C), thus the actual difference calculated as ((3.210.353)/(3.050.353)) 1 is a mere 1.8% despite this relatively large δ of 0.25. Equation 14 has the benefit that we can interpret R D as the number of repetitions beyond which repeating yields sharply diminishing returns and flattens out soon after. Consider RD = 100 then D = 1 + 3 (1 e 100/3) = 3.99. No matter how many repeats are done the effective data will never exceed 4 i.e. it plateaus at U + R DU as RD tends to infinity.
Muennighoff, Rush, Barak, Scao, Piktus, Tazi, Pyysalo, Wolf, Raffel
Similarly, we consider repeating parameters. Symmetric to seeing the same data, excess parameters learn the same features and do not add any value in the extreme. For the Chinchilla equation (Equation 8) increasing parameters from 1 billion to 10 billion yields the same absolute decrease in loss regardless of whether the data set is a single token or 1 billion tokens. However, intuition and our data ( 5.3) suggest that in the first case, adding parameters should not decrease loss at all, as the additional 9 billion parameters cannot possibly learn anything from the single token that the first 1 billion parameters have not already learned. Thus, to allow excess parameters to decay to adding nothing, we also replace N with a symmetric version of Equation 14 yielding our final equation:
L(UN, UD, RN, RD) = A
(UN + UNR N(1 e
R N ))α + B
(UD + UDR D(1 e
R D ))β + E
(15) We define UN, as the number of "unique" parameters that provide an optimal fit for UD. Additional parameters decay with a symmetric version of the expression for repeated data. RN is the number that the "unique" parameters are repeated i.e. RN = max{(N/UN) 1, 0}.
If R N = , additional parameters do not decay at all and (UN + UNR N(1 e
R N )) reduces to N. We compute UN from UD by setting Dopt = UD and rearranging Equation 3 to map from Dopt to Nopt. UN is then min{Nopt, N}. This is equivalent to the following:
UN = min{((UD G)β/α) G, N} where G = αA
Equation 15 is a generalization of Equation 8: It provides the same estimates for optimal model and data size in the single epoch case, but allows for decay in the value of parameters and tokens, thus generalizing to training for multiple epochs and with excess parameters. It can thus be used as a direct replacement of Equation 8. If R N and R D are unknown, one can simply set them to infinity by default, which will make Equation 15 completely equivalent to Equation 8. To learn the parameters R N and R D, we largely follow the approach from (Hoffmann et al., 2022). We fix a, b, e, α, β to the values learned on C4 in C and minimize:
min R N,R D
Run i Huberδ LSE a α log(Ui N + Ui NR N(1 e
Ri N R N )),
b β log(Ui D + Ui DR D(1 e
Ri D R D )), e log Li (17)
We use the LBFGS algorithm to find local minima of the objective above, started on a grid of initialization given by: R N {0., 4., . . . , 20.} and R D {0., 4., . . . , 20.}. We fit on 182 samples with parameters varying from 7 million up to 9 billion and epochs ranging from 1 to 500. We removed outliers referenced in 5.3 from our fitting, as our formulas do not allow for excess parameters or excess epochs to negatively impact performance. We assume excess parameters or epochs only cause performance to plateau but never to worsen. However, it is difficult to identify all samples where excess parameters or epochs hurt, as for some
Scaling Data-Constrained Language Models
Table 1: Comparison of different versions of our parametric fit. All versions are fitted on the same 182 samples. We report the fitting loss and the R2 (coefficient of determination) of the predicted loss compared to the actual loss. No decay corresponds to assuming Chinchilla holds for repeated data without modification necessary. For Equation 11, we use the same equation for D and N renaming the δ to R D and R N.
Parametric Fit R D R N Loss ( ) R2 ( )
No decay - - - 0.1430 Equation 15 but only decay N - 713.0015 0.0241 0.1671 Equation 15 but only decay D 2.9157 - 0.0169 0.7395 Equation 15 15.3878 5.3097 0.0158 0.7810 Equation 11 for both N and D 0.0104 0.3676 0.0155 0.8062 Equation 19 for both N and D 0.0105 0.3676 0.0155 0.8061
data budgets we only train a single model, thus we do not know if the loss of that model is already in the range where it starts to increase again. Further, there are samples where loss initially increases and then decreases as a function of epochs (double descent, see 5.1), which further contributes to noise in the fitting. Nevertheless, we are able to get a fairly stable fit resulting in R N = 5.309743 and R D = 15.387756. Since R D > R N, excess parameters decay faster. Hence, the data-constrained efficient frontiers in Figures 1,3 suggest scaling compute allocated to epochs faster than to parameters. This value of R D yields δ 6 10 2 (0.19 for R N), which respects the assumption that δ is small. Inserting these learned parameters and the parameters from C, and simplifying Equation 16 yields the precise formulation we use to predict loss (L) given unique tokens (UN), parameter repetitions (RN) and data repetitions (RD):
L(UD, RN, RD) = 521
(UN + 5.3 UN(1 e RN
5.3 ))0.35 + 1488
(UD + 15.4 UD(1 e RD
15.4 ))0.35 + 1.87
where UN = UD 0.051 (18) We experiment with different versions of our formula and display the learned values in Table 1. No decay or decaying only D or N of Equation 15 leads to worse loss and R2 than Equation 15. Thus, it is important to decay both the value of excess parameters and data repetitions. We also consider an explicit exponential where D = PRD k=0 U e R Dk, hence from Equation 10 it follows:
D = U 1 (e R D)RD+1
1 e R D (19)
This explicit decay, Equation 11, and Equation 15 all yield similar results with R2
around 80. Equation 15 fits the data slightly worse than Equation 11, likely due to our
Muennighoff, Rush, Barak, Scao, Piktus, Tazi, Pyysalo, Wolf, Raffel
Chinchilla Curve
At this point a parameter is worth 1 𝛿 as much as
token for the loss
Same compute cost across
dashed blue lines
In this regime, multiplicative factor in parameters and tokens
worth same for loss
Figure 19: A cartoon of how the compute-optimal tradeoffdeviates from Chinchilla as we increase the number of epochs. Initially the model size and tokens processed grow proportionally (RN = RD) but since R N < R D, at some point adding parameters offers worse returns compared to increasing the number of tokens processed, and hence we deviate from the Chinchilla curve.
approximations. Nevertheless, we use Equation 15 throughout as it has fewer terms, and we find it easier to interpret.
B.1 Analytical properties of compute-optimal point
In our case, consider the setting of a fixed compute budget C and a fixed budget of unique tokens UD implying a set of unique parameters UN. Let RD denote the number of times we repeat data (we assume that we are in the multi-epoch regime and hence RD > 0).
Write UD = c UN (for Chinchilla c 20). When RD R D and RN R N, our scaling agrees with Chinchilla, and so the point (UN, UD), corresponding to RD = RN = 0 is on the optimal compute curve. Increasing RD by ϵ corresponds to increasing the number of tokens by ϵUD = ϵc UN, while increasing RN by ϵ corresponds to increasing the number of parameters by ϵUN. For small positive RD, RN, our curve agrees with Chinchilla and so we need to increase RN, RD by the same amount to maintain the proportionality. Hence up to some value r > 0, the optimal compute curve corresponds to RN = RD = r. Our curve differs from Chinchilla when r gets closer to either R N or R D. At this point, we start to see sharply diminishing returns.
In our setting, R D > R N which means that we reach the point r R N first. At this point, each added parameter is worth less (specifically worth e r/R N), than an added data point, despite them having equal computational cost. Hence processing more tokens will be more effective than increasing the number of parameters, and we expect the optimal compute curve to break away from proportionality. This is indeed what we see.
Scaling Data-Constrained Language Models
Appendix C. C4 Scaling Coefficients
While Hoffmann et al. (2022) have shown that the equal scaling of model parameters and training tokens holds across different training data sets, the precise ratios vary considerably across data sets and approaches. For example given the Gopher (Rae et al., 2021) compute budget of 5.76 1023 FLOPs, their parametric loss function fitted on Massive Web predicts an optimal allocation of 40 billion parameters. Meanwhile, if the training data set is C4 (Raffel et al., 2020) their Iso FLOP approach predicts 73 billion parameters to be optimal, almost twice as much. However, for C4, which is our training data set, they do not provide the coefficients necessary to compute loss with their parametric loss function. Based on their Iso FLOP training runs on C4, they only provide the information that for C4, compute (C) allocated to data (D) and parameters (N) should be scaled exactly equally for optimality, i.e. a = b = 0.5 in the relationship Nopt Ca and Dopt Cb. This corresponds to α = β in the parametric loss function (Equation 2). Thus, we use this information together with the methodology and C4 data points from (Hoffmann et al., 2022) to fit the parametric loss function. We tie the parameters α and β to be equal and optimize
min a,b,e,α,β
Run i Huberδ LSE a α log Ni, b β log Di, e log Li (20)
where LSE is the log-sum-exp operator and Ni, Di and Li the model size, data set size and loss of the ith run, and δ = 10 3. We fit on 54 samples on a grid of initialization given by: α {0., 0.5, . . . , 2.}, β {0., 0.5, . . . , 2.}, e { 1., .5, . . . , 1.}, a {0, 5, . . . , 25}, and b {0, 5, . . . , 25}. Our fit results in a = 6.255414, b = 7.3049974, e = 0.6254804, α = β = 0.3526596. Exponentiating a, b and e to get A, B and E and inserting all learned coefficients into Equation 2 then allows us to compute loss (L) as a function of parameters and data:
L(N, D) = 1.87 + 521 N0.353 + 1488
D0.353 (21)
To verify the accuracy of our fit, we benchmark the predictions with those of the Iso FLOP C4 curves in (Hoffmann et al., 2022). Following (Hoffmann et al., 2022), we can compute the optimal number of parameters Nopt and tokens Dopt for our fit using:
Nopt(C) = G C
a , Dopt(C) = G 1 C
where G = αA
1 α+β , a = β α + β , and b = α α + β
Given the Gopher compute budget of C = 5.76 1023 our fitted parameters predict an optimal allocation of Nopt = 70.0 billion parameters and Dopt = 1.37 trillion tokens. This is very close to the 73 billion parameters and 1.3 trillion tokens predicted by the Iso FLOP curves on C4 from (Hoffmann et al., 2022) and thus we consider it a good fit. We use these fitted parameters rather than the Massive Web parameters for all computations involving Chinchilla scaling laws.
Muennighoff, Rush, Barak, Scao, Piktus, Tazi, Pyysalo, Wolf, Raffel
FLOP budget Parameters Evaluation Interval Evaluation Tokens
9.3 1020 2.8B 100 105 million 2.1 1021 4.2B 1000 105 million 9.3 1021 8.7B 1000 2.1 million
Table 2: Setup for computing validation loss during training. At every Evaluation Interval, loss is computed on Evaluation Tokens many tokens from the validation set. The evaluation tokens vary with the interval, i.e. the evaluation tokens at 100 steps are not the same as at 200 steps. However, the tokens do not vary across data budgets for the same FLOP budget (Figure 9). For example, N = 2.8 billion parameter models with DC = 55 billion tokens are evaluated on the same data as models with DC = 28 billion tokens at each evaluation interval.
Appendix D. Evaluation Details
D.1 Loss Evaluation
For all models trained on C4, the final test loss is computed on the same 210 million tokens from the C4 validation set after training. For held-out evaluation during training, such as in Figure 9, the configurations are displayed in Table 2. The small number of evaluation tokens for the 8.7 billion parameter models likely contributes to the loss spikes for 8.7 billion parameter models seen in Figure 9. Thus, we smooth the validation loss curves of 8.7 billion parameter models with exponential moving average smoothing and a weight of 0.85. For training on OSCAR, configurations are the same, however, the validation split used is a held-out part from the OSCAR training split, as there is no official validation split for OSCAR. All training loss curves for C4 and OSCAR models are smoothed with exponential moving average smoothing and a weight of 0.999.
D.2 Downstream Evaluation
We provide statistics of all downstream evaluation data sets in Table 3. We use the evaluationharness frameworks from Big Science and Eleuther AI (Gao et al., 2021) to evaluate models on 19 evaluation data sets. For each data set, a maximum of 3000 samples are evaluated with 0,1,2,3,4 and 5 few-shots (Brown et al., 2020) to produce six scores which are then averaged. We normalize scores to range from the random baseline of each task to 1 and report them as percentages. For example, if random guessing produces 50% accuracy and the maximum accuracy possible is 100%, then a raw accuracy of 55% would be normalized to 10%, and a raw accuracy of 45% would be normalized to -10% since it is worse than random. This is done to give all tasks the same weight. Otherwise average performance would heavily depend on generative tasks, where the random baselines are 0. Prompts are sourced from GPT-3 (Brown et al., 2020) and Prompt Source (Bach et al., 2022) and detailed in J. We note that our evaluation is in no means comprehensive and a larger benchmarking akin to Srivastava et al. (2022) would be helpful. However, by training five seeds for most models benchmarked, always averaging 0-5 fewshots, and ensuring maximum data overlap for repeated data ( 4) we significantly reduce uncertainty.
Scaling Data-Constrained Language Models
Data set Split(s) Samples Baseline URL
ANLI (Nie et al., 2020) dev_r1,2,3 3000 33.3 hf.co/datasets/anli ARC-Easy (Clark et al., 2018) test 1172 25.0 hf.co/datasets/ai2_arc ARC-Challenge (Clark et al., 2018) test 2376 25.0 hf.co/datasets/ai2_arc Bool Q (Clark et al., 2019) validation 3270 50.0 hf.co/datasets/boolq CB (De Marneffe et al., 2019) validation 56 33.3 hf.co/datasets/super_glue Copa (Roemmele et al., 2011) validation 100 50.0 hf.co/datasets/super_glue Hella Swag (Zellers et al., 2019) test 10003 25.0 hf.co/datasets/hellaswag Pi QA (Bisk et al., 2020) validation 1838 50.0 hf.co/datasets/piqa RTE (Dagan et al., 2006; Wang et al., 2019) validation 277 50.0 hf.co/datasets/super_glue Sci Q (Welbl et al., 2017) test 1000 25.0 hf.co/datasets/sciq Story Cloze 2016 (Mostafazadeh et al., 2017) test 1871 25.0 hf.co/datasets/story_cloze Wino Grande XL (Sakaguchi et al., 2021) test 1267 50.0 hf.co/datasets/winogrande
E2E NLG (Dušek et al., 2020) test 4693 0.0 hf.co/datasets/e2e_nlg_cleaned XSUM (Narayan et al., 2018; Gehrmann et al., 2021) test 11334 0.0 hf.co/datasets/GEM/xsum Web NLG EN (Castro Ferreira et al., 2020; Gehrmann et al., 2021) test 5150 0.0 hf.co/datasets/GEM/web_nlg Wiki Lingua EN (Ladhak et al., 2020; Gehrmann et al., 2021) sampled_test 3000 0.0 hf.co/datasets/GEM/wiki_lingua
b Ab I (Weston et al., 2015) test 19000 0.0 hf.co/datasets/Muennighoff/babi
Table 3: Downstream evaluation data sets. We evaluate on 19 data sets: The first 14 are evaluated using accuracy (ANLI counted as three), the next 4 using ROUGE-2 f-measure (Lin, 2004) and b Ab I using exact match.
Appendix E. Downstream Repetition Results
In Tables 4-9 we report downstream results of all models trained on C4 (Raffel et al., 2020) and OSCAR (Ortiz Su arez et al., 2020) according to the configurations in Figure 9. All scores are from the final checkpoints at the end of training. OSCAR is a noisier data set than C4 due to less filtering, thus models trained on C4 generally perform better. Notably, models trained on C4 completely fail on b Ab I (Weston et al., 2015), while OSCAR models are able to perform better than random. This is likely due to code data being present in OSCAR, which enables state-tracking capabilities like for code augmented models in 8. For C4 the creators strictly removed all data that resembles code (Raffel et al., 2020). There are no significant differences between models trained for a single epoch and models trained for up to 4 epochs. Even models trained for more epochs (and thus on less unique data) have similar performance.
Muennighoff, Rush, Barak, Scao, Piktus, Tazi, Pyysalo, Wolf, Raffel
Data Budget 55B 28B 18B 14B 11B 9B 4B 1.25B
Epochs 1 2 3 4 5 7 14 44
ANLI R1 0.4 1.6 0.7 0.8 0.3 0.5 -0.3 1.8 0.4 1.8 0.4 0.7 0.0 0.9 -0.6 0.6 ANLI R2 0.9 0.4 1.4 0.8 0.8 0.8 1.1 0.7 0.5 0.7 0.6 1.0 1.1 1.1 2.7 1.6 ANLI R3 1.7 0.5 1.2 0.4 0.4 0.5 1.9 0.7 0.6 1.0 0.8 0.8 1.7 0.7 0.7 1.7 ARC-Challenge 1.6 1.0 0.9 0.5 1.2 0.6 1.1 0.6 1.1 1.2 1.3 0.5 0.3 0.6 -2.9 1.0 ARC-Easy 44.5 0.5 44.9 0.4 44.7 0.7 44.3 0.4 44.0 0.5 44.2 0.9 41.4 0.2 28.9 0.7 Bool Q 18.8 3.4 16.2 5.2 16.1 2.7 19.7 1.8 15.0 3.8 16.9 3.2 13.1 4.9 -2.1 4.7 CB 20.0 4.7 17.4 6.4 14.6 5.1 17.5 4.2 12.3 12.2 14.4 7.5 21.6 8.4 21.3 5.6 COPA 49.7 3.5 50.3 3.4 49.9 2.3 50.1 2.5 50.9 1.2 48.1 2.4 43.5 3.1 33.3 1.9 Hella Swag 24.7 0.3 24.6 0.2 24.3 0.1 24.3 0.0 24.3 0.3 24.1 0.1 22.8 0.2 16.7 0.4 Pi QA 47.9 0.6 47.6 0.8 47.3 0.3 47.6 0.6 47.6 0.7 47.0 0.2 45.6 0.5 37.0 0.4 RTE 5.1 4.0 2.5 4.5 8.4 2.6 6.0 2.5 5.1 1.6 2.3 3.9 7.8 2.5 2.6 4.3 Sci Q 83.2 0.6 82.5 0.6 82.7 1.1 81.9 0.6 81.9 0.8 81.6 0.9 78.5 1.1 59.3 1.6 Story Cloze 2016 58.7 0.2 58.7 0.5 58.5 0.3 58.3 0.3 58.5 0.6 58.4 0.3 56.7 0.5 52.0 0.6 Wino Grande XL 11.6 0.8 10.8 1.1 10.9 1.3 10.6 0.5 11.1 0.9 10.6 0.9 6.4 1.3 2.9 1.3
E2E NLG 17.0 1.4 17.7 0.5 17.0 1.2 16.9 1.1 15.1 2.3 13.3 2.2 14.9 0.9 9.8 0.9 XSUM 2.4 0.1 2.4 0.1 2.5 0.1 2.3 0.2 2.4 0.1 2.4 0.1 2.1 0.1 1.6 0.1 Web NLG EN 5.3 0.1 5.5 0.2 5.4 0.1 5.4 0.1 5.1 0.1 5.4 0.2 5.1 0.3 2.9 0.2 Wiki Lingua EN 3.0 0.1 3.1 0.1 2.9 0.1 2.9 0.3 2.9 0.2 2.9 0.1 2.6 0.1 2.0 0.2
b Ab I 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
Average 20.9 0.4 20.4 0.3 20.4 0.2 20.6 0.2 19.9 0.9 19.7 0.2 19.2 0.5 14.1 0.4
Table 4: Results for 2.8B parameter models trained on repeated data on C4 for 55B total tokens. Scores are normalized averages of 0-5 few-shots and reported as percentages. We report mean/std. err. across five different models, each trained with a different random seed.
Scaling Data-Constrained Language Models
Data Budget 55B 28B 18B 14B 11B 9B 4B 1.25B
Epochs 1 2 3 4 5 7 14 44
ANLI R1 -0.3 0.5 -0.6 1.3 0.2 0.6 0.3 1.1 0.2 1.2 -0.1 1.1 -0.1 0.5 -0.7 1.3 ANLI R2 1.0 1.0 1.1 0.3 1.7 0.8 2.3 0.8 1.4 1.0 0.8 0.7 1.0 0.7 2.3 0.7 ANLI R3 0.4 0.8 0.5 0.5 -0.2 0.8 -0.1 1.0 1.1 0.6 0.7 0.4 -0.2 0.9 0.5 1.2 ARC-Challenge -1.4 0.8 -0.6 0.8 -1.7 0.1 -1.6 0.7 -1.6 0.6 -1.4 0.5 -1.9 0.8 -5.0 1.1 ARC-Easy 39.7 0.3 39.6 0.8 39.5 0.6 39.3 0.5 38.7 0.6 38.7 0.4 36.9 0.4 25.4 0.7 Bool Q 12.8 4.4 7.8 3.8 7.9 3.8 3.3 5.4 0.2 3.0 2.3 6.1 -2.1 2.4 7.4 6.1 CB 19.7 5.1 15.4 7.3 13.2 5.1 12.6 2.6 21.7 3.6 15.4 3.7 16.2 5.2 9.7 5.7 COPA 42.7 2.2 39.5 2.2 40.9 2.0 41.5 2.1 38.5 2.4 40.4 2.4 38.6 2.6 28.5 3.1 Hella Swag 16.3 0.1 16.3 0.2 16.3 0.2 16.1 0.2 16.0 0.1 15.9 0.2 15.0 0.2 11.7 0.1 Pi QA 41.2 0.7 41.4 0.5 40.3 0.4 40.6 0.5 40.3 0.9 39.8 0.6 38.8 1.1 31.0 0.4 RTE 3.9 1.1 2.1 1.6 2.3 3.3 1.6 3.0 0.5 2.1 2.9 2.5 0.9 3.4 -3.2 2.7 Sci Q 83.2 0.6 82.4 0.6 82.1 0.9 82.6 0.7 81.5 0.9 80.5 0.6 76.5 1.3 57.7 1.8 Story Cloze 2016 52.8 0.3 52.9 0.4 52.6 0.3 53.0 0.4 52.3 0.4 52.4 0.4 51.8 0.7 47.9 0.5 Wino Grande XL 5.8 0.9 4.4 1.4 4.5 0.3 4.2 1.3 4.5 0.6 4.1 0.7 1.7 1.2 0.8 1.3
E2E NLG 20.3 0.3 19.9 0.5 19.9 0.7 20.9 0.9 19.7 0.7 20.4 0.6 19.1 0.8 14.2 0.7 XSUM 3.0 0.1 2.9 0.0 2.9 0.3 2.9 0.2 2.9 0.1 2.8 0.3 2.6 0.2 1.8 0.1 Web NLG EN 8.8 0.4 8.3 0.6 8.5 0.3 8.4 0.6 8.1 0.2 8.2 0.2 7.2 0.3 3.3 0.3 Wiki Lingua EN 2.9 0.1 3.1 0.2 3.1 0.1 3.0 0.1 3.1 0.1 3.2 0.3 2.7 0.2 1.7 0.2
b Ab I 15.5 1.0 15.7 1.1 15.3 0.8 15.1 1.5 15.9 1.1 16.2 0.9 14.3 0.6 6.6 0.6
Average 19.4 0.5 18.5 0.2 18.4 0.4 18.2 0.4 18.2 0.4 18.1 0.4 16.8 0.5 12.7 0.7
Table 5: Results for 2.8B parameter models trained on repeated data on OSCAR for 55B total tokens. Scores are normalized averages of 0-5 few-shots and reported as percentages. We report mean/std. err. across five different models, each trained with a different random seed.
Muennighoff, Rush, Barak, Scao, Piktus, Tazi, Pyysalo, Wolf, Raffel
Unique Tokens 84B 42B 28B 21B 17B 12B 6B 1.9B
Epochs 1 2 3 4 5 7 14 44
ANLI R1 -1.0 0.3 -0.7 1.1 -0.7 1.0 -0.4 1.1 0.4 0.8 0.5 1.1 0.1 0.9 0.2 0.9 ANLI R2 0.8 0.5 0.8 0.8 0.0 1.4 0.5 0.7 0.5 0.9 0.3 1.0 0.7 0.7 2.5 1.0 ANLI R3 1.1 0.7 0.8 0.9 0.3 0.8 1.4 1.1 1.3 0.9 2.3 0.2 1.3 0.2 1.6 1.2 ARC-Challenge 5.3 0.6 5.1 0.9 5.2 2.0 6.0 0.8 4.7 0.8 3.1 0.4 2.9 1.0 -1.3 1.0 ARC-Easy 49.2 0.9 50.4 1.2 47.4 4.5 49.4 0.7 48.7 1.5 44.9 0.7 45.0 1.2 31.9 0.9 Bool Q 18.2 4.0 19.6 5.1 22.1 1.0 20.4 3.6 18.4 6.0 18.4 3.9 18.9 2.6 -3.3 7.1 CB 12.0 7.2 8.5 9.2 7.9 10.4 19.6 7.3 17.8 7.3 15.1 5.8 17.5 3.5 19.5 6.6 COPA 59.1 5.4 57.7 3.5 56.7 2.0 55.5 2.4 56.8 1.8 58.9 1.7 48.7 3.3 34.9 3.4 Hella Swag 27.8 4.8 30.2 0.5 29.8 0.9 29.9 0.7 28.5 1.1 29.0 0.5 27.0 1.2 19.7 0.5 Pi QA 50.6 0.5 50.8 0.5 48.6 3.4 50.9 0.7 50.3 1.3 49.5 0.4 47.6 1.2 39.5 1.3 RTE 5.6 3.1 2.6 3.9 7.2 2.7 7.0 3.2 8.8 5.3 9.3 3.6 3.0 4.3 2.6 4.2 Sci Q 84.6 3.9 86.1 1.3 84.4 3.7 85.9 0.7 86.2 0.8 79.0 0.7 81.1 1.4 65.3 1.1 Story Cloze 2016 61.1 3.7 62.6 0.2 61.9 2.2 62.6 0.4 61.8 0.8 61.5 0.8 60.1 0.7 53.9 0.5 Wino Grande XL 17.0 2.6 17.8 1.4 16.5 1.8 17.1 1.8 14.9 1.5 15.9 1.2 11.8 1.5 3.9 0.8
E2E NLG 18.2 1.2 18.8 0.8 17.8 1.5 16.0 2.2 15.9 2.5 13.8 1.3 15.7 0.9 11.2 1.4 XSUM 2.9 0.2 3.0 0.2 2.8 0.3 2.9 0.2 2.9 0.2 1.0 0.4 2.4 0.1 1.8 0.1 Web NLG EN 4.8 2.0 5.7 0.2 5.4 0.3 5.6 0.2 5.4 0.5 5.5 0.1 5.4 0.2 3.4 0.3 Wiki Lingua EN 3.3 0.5 3.6 0.1 3.4 0.1 3.4 0.1 3.3 0.1 1.4 0.6 3.0 0.1 2.2 0.1
b Ab I 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
Average 22.1 1.7 22.3 0.9 21.9 1.2 22.8 0.5 22.5 0.6 21.6 0.5 20.6 0.6 15.2 1.0
Table 6: Results for 4.2B parameter models trained on repeated data on C4 for 84B total tokens. Scores are normalized averages of 0-5 few-shots and reported as percentages. We report mean/std. err. across five different models, each trained with a different random seed.
Scaling Data-Constrained Language Models
Unique Tokens 84B 42B 28B 21B 17B 12B 6B 1.9B
Epochs 1 2 3 4 5 7 14 44
ANLI R1 -0.9 0.5 -0.8 1.1 -0.9 1.4 -0.4 0.4 -0.1 1.2 0.3 1.1 -0.5 0.8 1.1 1.3 ANLI R2 0.7 0.9 0.7 1.1 1.3 1.0 1.5 1.1 1.7 1.3 0.9 0.8 0.9 1.0 1.7 1.3 ANLI R3 0.4 0.6 0.6 0.4 0.7 0.3 0.4 0.8 0.7 1.2 0.6 1.2 0.7 0.5 0.8 1.2 ARC-Challenge 1.3 0.5 1.8 0.5 1.6 0.7 2.4 1.1 1.6 0.7 2.0 0.7 1.6 0.5 -2.1 0.5 ARC-Easy 45.5 0.8 45.1 1.2 44.8 0.9 44.8 0.6 45.0 1.0 43.9 0.7 40.7 0.7 28.0 0.9 Bool Q 14.5 1.9 15.1 4.6 10.8 5.1 12.5 1.9 6.7 4.0 10.1 4.2 -0.0 6.9 -4.3 7.2 CB 21.3 2.3 19.2 3.8 12.9 6.4 16.9 3.4 15.1 9.4 17.8 3.6 15.0 8.1 11.2 4.1 COPA 43.1 3.0 42.5 3.7 44.4 1.1 43.0 3.4 41.8 2.3 44.6 2.7 40.3 3.0 34.9 4.9 Hella Swag 21.1 0.2 21.0 0.2 20.9 0.1 20.7 0.2 20.5 0.3 20.3 0.1 19.3 0.1 14.5 0.2 Pi QA 45.3 0.9 44.8 0.7 44.8 0.9 44.4 0.6 44.3 0.6 43.9 0.5 42.2 0.9 34.0 0.8 RTE 4.2 2.8 1.5 2.4 -1.1 3.9 -2.5 3.9 5.3 1.8 4.4 1.9 1.6 2.2 -1.0 2.4 Sci Q 86.6 0.7 86.5 0.5 86.0 0.2 86.3 1.0 85.4 0.8 84.7 0.4 82.0 1.4 62.9 2.5 Story Cloze 2016 56.5 0.6 56.8 0.6 56.5 0.7 55.8 0.3 55.9 0.2 56.0 0.3 54.5 0.7 49.3 0.2 Wino Grande XL 9.7 1.4 9.0 1.8 9.5 0.7 8.9 1.0 7.8 1.2 7.4 1.4 6.8 1.4 2.1 1.0
E2E NLG 21.4 1.3 21.9 0.4 21.2 1.0 21.8 0.6 21.0 0.9 20.5 0.7 20.9 1.0 16.0 0.6 XSUM 3.6 0.2 3.5 0.2 3.5 0.2 3.5 0.2 3.5 0.3 3.2 0.5 3.0 0.2 1.9 0.1 Web NLG EN 9.9 0.4 9.7 0.8 9.3 0.6 9.7 0.5 9.3 0.7 9.4 0.3 8.9 0.5 3.8 0.4 Wiki Lingua EN 3.9 0.1 3.8 0.2 3.6 0.3 3.7 0.2 3.6 0.2 3.7 0.1 3.3 0.2 2.1 0.2
b Ab I 15.0 7.5 19.0 1.2 18.8 1.4 18.5 1.4 19.2 0.6 18.1 1.4 14.5 1.5 9.6 1.7
Average 21.2 0.2 21.1 0.4 20.4 0.3 20.6 0.5 20.4 0.5 20.6 0.2 18.7 0.5 14.0 0.6
Table 7: Results for 4.2B parameter models trained on repeated data on OSCAR for 84B total tokens. Scores are normalized averages of 0-5 few-shots and reported as percentages. We report mean/std. err. across five different models, each trained with a different random seed.
Muennighoff, Rush, Barak, Scao, Piktus, Tazi, Pyysalo, Wolf, Raffel
Parameters 8.7B 6.3B
Unique Tokens 178B 88B 58B 44B 35B 25B 13B 4B 25B
Epochs 1 2 3 4 5 7 14 44 9.7
ANLI R1 -0.9 -1.2 -4.2 0.7 -1.3 0.1 1.2 2.1 -0.9 ANLI R2 -0.4 -1.2 -0.2 0.2 -0.4 -0.1 0.4 2.2 1.0 ANLI R3 0.7 0.5 0.7 1.8 0.4 1.6 2.0 4.0 2.6 ARC-Challenge 12.2 11.9 10.5 12.2 10.6 11.8 8.3 2.2 12.7 ARC-Easy 58.5 58.0 56.9 57.4 56.7 58.5 52.9 37.4 57.2 Bool Q 26.1 31.8 31.3 30.3 28.8 28.5 27.9 4.1 30.6 CB 7.6 12.9 -15.2 17.9 14.3 -22.8 -12.1 17.4 6.2 COPA 68.0 64.7 62.3 66.3 63.3 70.0 57.0 45.0 66.0 Hella Swag 37.8 37.8 37.3 37.4 37.1 37.5 36.1 27.5 38.1 Pi QA 55.9 55.6 54.7 56.5 55.8 53.9 52.4 45.7 54.3 RTE 14.1 11.4 11.0 8.7 15.9 -2.6 -1.8 -3.2 7.7 Sci Q 90.4 91.1 90.7 90.0 89.8 89.8 87.9 72.9 90.3 Story Cloze 2016 68.3 67.3 67.2 67.6 67.8 66.8 66.2 58.9 68.4 Wino Grande XL 26.3 27.7 26.5 29.0 26.1 23.5 18.1 10.0 27.0
E2E NLG 20.5 17.9 18.7 20.0 17.2 17.7 17.4 11.2 16.9 XSUM 3.6 3.3 3.8 3.8 3.5 3.0 3.3 2.0 3.8 Web NLG EN 5.3 5.8 5.9 5.6 5.8 5.2 5.7 4.9 5.3 Wiki Lingua EN 4.1 4.2 4.2 4.1 4.2 4.0 3.5 2.7 4.0
b Ab I 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
Average 26.2 26.3 24.3 26.8 26.1 23.5 22.4 18.3 25.9
Table 8: Results for 8.7B parameter models trained on repeated data on C4 for 178B total tokens and a data-constrained compute-optimal 6.3B model. Scores are normalized averages of 0-5 few-shots and reported as percentages. The two models with 25 billion unique tokens are the ones depicted in Figure 1 (right). The data-constrained compute-optimal variant (6.3 billion parameters) performs better by using fewer parameters and repeating more data.
Scaling Data-Constrained Language Models
Unique Tokens 178B 88B 58B 44B 35B 25B 13B 4B
Epochs 1 2 3 4 5 7 14 44
ANLI R1 -1.3 -2.3 -0.5 -1.8 0.1 -0.3 2.6 -0.4 ANLI R2 0.8 3.2 -0.2 -1.3 1.0 0.2 1.5 0.5 ANLI R3 1.1 1.2 1.3 0.9 2.8 -0.4 1.1 -0.1 ARC-Challenge 6.9 6.7 6.9 3.8 6.6 4.8 4.0 -0.9 ARC-Easy 50.2 51.6 51.2 51.0 51.9 50.8 47.0 33.0 Bool Q 18.4 11.7 19.4 22.4 17.5 20.8 7.6 4.1 CB 11.2 13.4 16.1 19.6 21.4 25.0 9.8 20.1 COPA 46.7 53.0 52.0 53.7 51.0 53.3 48.7 41.7 Hella Swag 27.4 27.2 26.8 26.8 27.3 26.7 25.5 19.6 Pi QA 49.2 49.3 50.1 48.7 48.1 47.2 45.6 37.0 RTE -0.5 1.1 0.2 1.2 10.2 3.2 -3.0 -7.8 Sci Q 88.1 88.0 88.4 87.9 87.9 87.4 86.3 64.6 Story Cloze 2016 61.6 61.1 60.2 60.6 61.3 59.0 58.8 52.7 Wino Grande XL 17.6 16.3 15.4 13.7 13.9 12.8 10.8 -0.6
E2E NLG 23.3 24.2 22.2 22.9 23.1 22.1 22.9 16.8 XSUM 4.2 3.8 3.9 3.8 4.3 4.0 3.2 2.4 Web NLG EN 9.9 10.1 10.0 10.5 9.5 9.9 10.7 5.2 Wiki Lingua EN 4.3 4.0 4.1 3.7 4.3 4.2 4.0 2.7
b Ab I 20.4 20.6 21.7 21.4 21.1 21.3 19.4 10.7
Average 23.1 23.4 23.6 23.7 24.4 23.8 21.4 15.9
Table 9: Results for 8.7B parameter models trained on repeated data on OSCAR for 178B total tokens. Scores are normalized averages of 0-5 few-shots and reported as percentages.
Muennighoff, Rush, Barak, Scao, Piktus, Tazi, Pyysalo, Wolf, Raffel
Appendix F. Detailed Code Augmentation Results
We report tabular results for replacing part of C4 or OSCAR with code for 4.2 billion parameter and 2.8 billion parameter models in Tables 10-11. We find that training on up to 50% of Python data maintains performance on all natural language tasks while enabling huge performance gains on state-tracking (b Ab I) for C4. For OSCAR gains are less clear, which is likely due to OSCAR containing code (Ortiz Su arez et al., 2020), while code data was explicitly filtered out for C4 (Raffel et al., 2020).
Scaling Data-Constrained Language Models
% of Python pre-training data (remainder is C4)
Data set ( ) 0 10 20 30 40 50 60 70 80 90
ANLI R1 -1.0 0.3 -0.7 0.6 0.5 0.6 -0.2 1.1 -1.5 0.7 -0.7 1.1 -1.1 0.9 -1.2 1.1 -1.4 0.7 -1.4 0.6 ANLI R2 0.8 0.5 0.4 0.8 0.3 1.0 0.6 0.5 0.5 0.6 0.3 0.6 0.8 0.7 0.4 0.5 0.1 1.2 1.0 0.3 ANLI R3 1.1 0.7 0.6 0.5 0.5 0.6 0.2 0.4 0.3 0.5 0.2 0.5 0.3 0.2 -0.1 0.3 -0.0 0.2 -0.1 0.2 ARC-Challenge 5.3 0.6 6.4 1.0 5.2 2.4 4.3 1.4 5.2 0.8 5.2 0.5 2.6 0.4 1.7 0.5 -0.4 0.4 -3.0 0.4 ARC-Easy 49.2 0.9 52.4 1.1 49.6 3.7 48.1 4.1 50.1 1.0 49.7 0.3 48.0 0.5 45.6 0.5 43.3 0.4 37.7 0.7 Bool Q 18.2 4.0 10.5 12.0 16.3 5.2 17.8 3.3 13.4 3.4 14.8 2.1 12.5 8.9 12.1 6.6 7.2 6.7 10.7 7.3 CB 12.0 7.2 20.3 2.7 14.4 7.1 16.5 1.2 22.3 3.1 22.1 4.8 19.4 4.8 23.8 3.3 23.8 4.1 23.4 2.4 COPA 59.1 5.4 56.4 4.9 46.7 8.7 50.2 3.7 52.7 2.5 50.1 4.5 46.5 1.9 43.1 4.2 39.2 3.7 35.9 4.9 Hella Swag 27.8 4.8 29.4 0.4 25.7 4.8 27.0 1.7 26.3 2.4 26.3 0.6 25.0 0.1 22.6 0.1 19.5 0.2 14.7 0.1 Pi QA 50.6 0.5 50.8 0.6 48.6 3.0 48.2 2.8 48.7 0.7 48.4 1.0 47.1 0.7 45.6 0.3 43.4 0.9 39.0 0.8 RTE 5.6 3.1 7.3 3.4 4.4 4.7 6.1 2.6 9.1 4.0 8.1 5.9 7.7 5.3 4.0 2.1 6.2 2.1 4.6 2.5 Sci Q 84.6 3.9 87.1 0.2 84.6 4.8 86.9 1.2 86.9 1.2 87.9 0.9 87.6 0.6 87.0 0.2 86.0 0.2 84.5 0.6 Story Cloze 2016 61.1 3.7 62.0 0.6 59.0 4.8 60.8 1.5 59.9 1.9 60.0 0.7 59.0 0.4 57.2 0.5 54.9 0.4 51.0 0.3 Wino Grande XL 17.0 2.6 17.4 2.1 14.9 4.4 15.2 2.0 15.7 1.2 14.2 1.0 13.5 1.3 10.7 1.3 9.1 0.6 5.3 1.3
E2E NLG 18.2 1.2 21.8 1.6 15.9 8.6 23.3 0.6 21.5 3.8 23.9 0.6 23.7 0.6 23.7 0.5 24.3 0.7 24.0 0.9 XSUM 2.9 0.2 3.2 0.5 3.4 0.3 3.3 0.3 3.6 0.6 3.4 0.2 3.5 0.2 2.9 0.3 2.8 0.4 2.7 0.2 Web NLG EN 4.8 2.0 9.5 0.7 10.2 1.1 10.5 0.7 10.4 0.8 10.4 0.6 9.9 0.4 10.0 0.5 9.3 0.6 9.2 0.2 Wiki Lingua EN 3.3 0.5 4.0 0.1 4.0 0.2 4.2 0.1 4.3 0.3 4.2 0.2 4.4 0.3 4.1 0.2 3.9 0.2 3.6 0.3
b Ab I 0.0 0.0 12.5 6.7 13.8 7.2 15.8 8.2 17.4 9.2 23.2 1.2 23.4 2.0 24.3 1.4 23.2 1.0 24.6 1.8
Average 22.1 1.7 23.7 0.7 22.0 3.0 23.1 1.1 23.5 1.0 23.8 0.5 22.8 1.0 22.0 0.6 20.8 0.5 19.3 0.3 Average (no b Ab I) 23.4 1.8 24.4 0.7 22.5 2.8 23.5 0.8 23.9 0.9 23.8 0.5 22.8 1.0 21.8 0.6 20.6 0.6 19.1 0.4
Table 10: Results for code-augmentation for 4.2B parameter models. Models trained on a mix of natural language (C4) and Python (The Stack). Scores are normalized averages of 0-5 few-shots and reported as percentages. We report mean/std. err. across five different models, each trained with a different random seed.
Muennighoff, Rush, Barak, Scao, Piktus, Tazi, Pyysalo, Wolf, Raffel
% of Python pre-training data (rest is C4) % of Python pre-training data (rest is OSCAR)
Data set ( ) 0 10 20 30 40 50 0 10 20 30 40 50
ANLI R1 0.4 1.6 -1.5 -0.9 -1.0 -0.7 -2.4 -0.3 0.5 0.0 -0.6 -1.6 -2.4 -1.7 ANLI R2 0.9 0.4 0.7 0.0 0.1 -0.1 0.1 1.0 1.0 1.2 -0.1 -0.0 0.0 0.8 ANLI R3 1.7 0.5 0.6 -0.7 -0.2 0.4 0.0 0.4 0.8 -0.4 -0.2 -1.7 -0.8 -0.5 ARC-Challenge 1.6 1.0 4.2 1.7 1.5 0.2 -0.2 -1.4 0.8 -0.7 -1.4 -3.4 -2.3 -3.1 ARC-Easy 44.5 0.5 46.4 46.5 45.4 43.6 42.7 39.7 0.3 39.8 38.7 39.1 37.3 37.6 Bool Q 18.8 3.4 15.7 19.0 13.4 16.0 4.4 12.8 4.4 3.3 12.5 10.6 5.8 8.5 CB 20.0 4.7 22.8 10.7 20.5 17.4 15.2 19.7 5.1 14.7 15.6 19.6 22.8 17.0 COPA 49.7 3.5 46.3 49.3 46.3 42.7 40.0 42.7 2.2 42.7 41.0 42.7 35.7 38.0 Hella Swag 24.7 0.3 24.1 23.3 22.3 21.9 20.9 16.3 0.1 15.7 15.9 15.5 15.1 13.7 Pi QA 47.9 0.6 46.9 47.7 45.1 46.2 45.5 41.2 0.7 41.6 39.9 40.5 38.8 38.6 RTE 5.1 4.0 8.8 7.7 5.1 7.8 10.8 3.9 1.1 2.2 4.3 1.1 3.7 -1.7 Sci Q 83.2 0.6 83.3 85.3 84.8 83.2 83.7 83.2 0.6 82.4 83.5 82.8 83.3 83.2 Story Cloze 2016 58.7 0.2 59.3 57.9 56.9 56.5 56.0 52.8 0.3 52.0 52.2 52.0 51.8 50.9 Wino Grande XL 11.6 0.8 13.0 10.7 9.3 8.2 9.6 5.8 0.9 3.2 5.6 5.8 4.6 3.9
E2E NLG 17.0 1.4 19.8 21.1 20.2 22.1 21.0 20.3 0.3 21.9 20.7 20.5 20.7 21.1 XSUM 2.4 0.1 2.7 2.0 2.2 2.0 2.3 3.0 0.1 2.8 3.1 3.4 3.1 2.9 Web NLG EN 5.3 0.1 9.1 8.0 8.5 8.5 9.1 8.8 0.4 8.7 9.6 9.1 8.7 9.4 Wiki Lingua EN 3.0 0.1 3.2 3.2 3.6 3.3 3.7 2.9 0.1 3.3 3.6 3.5 3.4 3.5
b Ab I 0.0 0.0 4.6 14.2 14.2 14.8 15.1 15.5 1.0 16.6 17.2 17.2 17.7 15.9
Average 20.9 0.4 21.6 21.4 21.0 20.7 19.9 19.4 0.5 18.5 19.0 18.8 18.3 17.8 Average (without b Ab I) 22.0 0.5 22.5 21.8 21.3 21.1 20.1 19.6 0.5 18.6 19.1 18.9 18.3 17.9
Table 11: Results for code-augmentation for 2.8B parameter models. Models trained on a mix of natural language (C4) and Python (The Stack). Scores are normalized averages of 0-5 few-shots and reported as percentages. We report mean/std. err. across five different models, each trained with a different random seed.
Scaling Data-Constrained Language Models
Appendix G. Filtering Procedure
G.1 Perplexity filtering
We follow the approach of (Laurençon et al., 2022) to perform perplexity filtering and reuse their artifacts - a Sentence Piece tokenizer (Kudo and Richardson, 2018) and a Ken LM 5-gram language model (Heafield, 2011) trained on Wikipedia introductions and available to download from their repository.4 We compute the model s perplexity on all OSCAR and C4 samples and only select samples that fall within a certain percentile threshold. For example, to select the top 25%, we only select samples with perplexity lower than the 25th percentile. Figure 20 provides a visual representation of perplexity distribution for respective data sets, highlighting the relevant percentile thresholds.
G.2 Deduplication
We perform deduplication leveraging the suffix array-based approach proposed by Lee et al. (2021). We remove any document with at least a 100-character span overlapping with any other document in the corpus. We deduplicate the full C4 data set. In the case of OSCAR, the memory requirements of the deduplication procedure make performing the full data set deduplication infeasible. Instead, we select a 25% subset of the full OSCAR and build a suffix array for this subset. We experiment with leveraging the 25% OSCAR suffix array in two ways. First, we deduplicate the selected subset. This is very strict and preserves less than 5% of the full OSCAR. Subsequently, we use the 25% suffix array to deduplicate the full OSCAR, i.e. we remove any document which has at least a 100-character span overlapping with the 25% subset we selected. This is more permissive and allows us to preserve 31% of the original data set. We refer to the latter as expanded in Table 12 and it is used for the training of the 4.2 billion parameter model in Table 14, while the smaller deduplicated version of OSCAR is used for the 2.8 billion parameter model.
G.3 ROOTS filter
In addition, we benchmark with the filtering procedure from the ROOTS corpus (Laurençon et al., 2022). It applies the following set of filters:
Discarding documents with too few words
Discarding documents with overly repeated characterand word-n-grams
Discarding documents with too many special characters
Discarding documents with too few grammatical function words (e.g. of , and )
Discarding documents with too many flagged words
Discarding documents with a low fasttext language identification score
Perplexity filtering
4. https://github.com/bigscience-workshop/data-preparation/tree/main/preprocessing/ training/01b_oscar_cleaning_and_filtering
Muennighoff, Rush, Barak, Scao, Piktus, Tazi, Pyysalo, Wolf, Raffel
0 500 1000 1500 2000 2500 Perplexity
25th percentile
50th percentile
75th percentile
0 500 1000 1500 2000 2500 Perplexity
25th percentile
Figure 20: Perplexity histograms for respective data sets. For demonstration purposes, we use 100,000 random samples of each data set.
Base Data set Filter Tokens after filtering
C4 Deduplication 21 billion C4 Perplexity Top 25% 44 billion C4 Perplexity Top 50% 89 billion C4 Perplexity 25-75% 89 billion OSCAR Deduplication 9 billion OSCAR Deduplication-expanded 94 billion OSCAR Perplexity Top 25% 80 billion OSCAR ROOTS 99 billion
Table 12: Sizes of filtered data sets.
Scaling Data-Constrained Language Models
Training Data C4 OSCAR
Parameters 2.8B 4.2B 2.8B 4.2B
Percentile All 25% 50% All 25% 50% 25-75% All 25% All 25%
ANLI R1 0.4 1.6 -0.1 0.9 -0.5 1.4 -0.0 -0.7 -0.8 -0.3 0.5 -0.4 -0.4 1.2 -2.2 ANLI R2 0.9 0.4 -0.2 -0.7 0.0 1.3 -0.4 -0.0 1.1 1.0 1.0 1.7 1.0 0.9 0.7 ANLI R3 1.7 0.5 0.5 1.4 0.7 0.5 0.7 2.9 0.4 0.4 0.8 1.7 1.2 0.5 2.1 ARC-Challenge 1.6 1.0 3.3 2.9 4.2 1.6 10.2 9.3 7.9 -1.4 0.8 3.3 1.8 0.8 6.3 ARC-Easy 44.5 0.5 47.3 47.7 48.1 4.8 55.8 53.7 51.0 39.7 0.3 46.8 45.7 0.6 51.8 Bool Q 18.8 3.4 17.1 17.7 22.4 3.3 27.7 23.5 24.5 12.8 4.4 11.8 12.4 5.9 22.2 CB 20.0 4.7 16.1 13.8 9.3 16.6 24.6 22.3 12.5 19.7 5.1 17.0 23.9 3.8 20.1 COPA 49.7 3.5 55.7 56.0 55.3 3.8 60.7 66.0 61.0 42.7 2.2 44.0 41.1 3.0 49.3 Hella Swag 24.7 0.3 24.7 26.0 29.4 1.3 30.7 32.7 33.1 16.3 0.1 19.0 21.0 0.2 23.3 Pi QA 47.9 0.6 43.4 45.8 48.8 3.8 47.9 52.2 52.1 41.2 0.7 38.3 45.0 0.6 44.4 RTE 5.1 4.0 5.7 7.3 6.9 3.1 11.9 2.2 10.3 3.9 1.1 -1.2 2.2 4.3 7.0 Sci Q 83.2 0.6 82.4 82.8 86.3 1.1 88.6 87.4 88.4 83.2 0.6 84.0 86.3 0.6 86.5 Story Cloze 2016 58.7 0.2 61.1 61.2 62.8 0.5 65.5 65.6 65.1 52.8 0.3 57.9 57.2 0.6 60.2 Wino Grande XL 11.6 0.8 15.3 14.3 18.7 1.0 24.9 22.3 18.7 5.8 0.9 9.7 10.1 1.0 14.8
E2E NLG 17.0 1.4 16.1 16.8 17.9 0.7 18.8 17.8 19.2 20.3 0.3 19.5 21.6 0.7 22.6 XSUM 2.4 0.1 2.6 3.0 3.0 0.3 3.9 3.2 3.0 3.0 0.1 3.2 3.7 0.2 2.7 Web NLG EN 5.3 0.1 4.8 5.1 5.6 0.3 5.4 5.7 5.2 8.8 0.4 6.9 9.3 0.5 10.6 Wiki Lingua EN 3.0 0.1 3.2 3.3 3.6 0.2 3.4 3.5 3.4 2.9 0.1 3.4 4.0 0.1 3.8
b Ab I 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 15.5 1.0 14.5 19.3 1.0 17.2
Average 20.9 0.4 21.0 21.3 22.2 1.4 25.3 24.7 24.0 19.4 0.5 20.1 21.4 0.5 23.3
Table 13: Results for perplexity-filtering. The training data is perplexity filtered according to the given percentile, e.g. 25% corresponds to training on the top 25% percent of examples with the lowest perplexity. The resulting data set sizes are in Table 12. The data is repeated until it matches 55B tokens for 2.8B parameter and 84B tokens for 4.2B parameter models. Scores are normalized averages of 0-5 few-shots and reported as percentages. For unfiltered models we report mean/std. err. across five different models, each trained with a different random seed.
Appendix H. Detailed Filtering Results
In Table 13, we report detailed perplexity filtering results on C4 and OSCAR. For C4, perplexity filtering is only effective at 4.2B parameters. Meanwhile, for OSCAR, which is noisier than C4, perplexity filtering seems effective both for 2.8B and 4.2B parameters. Table 14 contains deduplication results and results for the ROOTS filter. Deduplication does not improve downstream performance for C4 while being effective for OSCAR which has significantly more noise. Applying the ROOTS filter on OSCAR is not better than the unfiltered OSCAR on our benchmark, but might have other beneficial effects, such as reducing obscenity, templated messages, or repetition, depending on the final use case.
Muennighoff, Rush, Barak, Scao, Piktus, Tazi, Pyysalo, Wolf, Raffel
Training Data C4 OSCAR
Parameters 2.8B parameters 4.2B parameters 2.8B parameters 4.2B parameters
Method All Dedup. All Dedup. All Dedup. ROOTS All Dedup.-exp. ROOTS
ANLI R1 0.4 1.6 -0.2 -0.5 1.4 -0.8 -0.3 0.5 -2.1 -1.7 -0.4 1.2 -1.8 1.2 ANLI R2 0.9 0.4 1.1 0.0 1.3 -0.1 1.0 1.0 2.0 0.7 1.0 0.9 -0.5 -0.3 ANLI R3 1.7 0.5 1.8 0.7 0.5 0.4 0.4 0.8 0.4 0.2 1.2 0.5 0.8 -0.3 ARC-Challenge 1.6 1.0 0.6 4.2 1.6 3.9 -1.4 0.8 2.6 -0.9 1.8 0.8 6.8 0.6 ARC-Easy 44.5 0.5 43.0 48.1 4.8 46.8 39.7 0.3 44.6 42.3 45.7 0.6 51.0 47.1 Bool Q 18.8 3.4 1.5 22.4 3.3 2.2 12.8 4.4 3.4 13.4 12.4 5.9 13.0 7.0 CB 20.0 4.7 0.4 9.3 16.6 0.9 19.7 5.1 25.4 14.3 23.9 3.8 25.0 28.1 COPA 49.7 3.5 57.0 55.3 3.8 60.0 42.7 2.2 47.3 37.7 41.1 3.0 55.3 43.0 Hella Swag 24.7 0.3 25.1 29.4 1.3 30.7 16.3 0.1 22.8 17.6 21.0 0.2 26.3 22.4 Pi QA 47.9 0.6 49.1 48.8 3.8 53.4 41.2 0.7 45.1 41.9 45.0 0.6 48.5 46.3 RTE 5.1 4.0 3.2 6.9 3.1 0.1 3.9 1.1 6.1 5.8 2.2 4.3 1.1 8.9 Sci Q 83.2 0.6 80.4 86.3 1.1 82.2 83.2 0.6 82.6 83.1 86.3 0.6 88.5 86.4 Story Cloze 2016 58.7 0.2 61.8 62.8 0.5 65.2 52.8 0.3 58.1 54.3 57.2 0.6 61.6 58.6 Wino Grande XL 11.6 0.8 13.3 18.7 1.0 19.7 5.8 0.9 12.7 5.6 10.1 1.0 16.2 11.0
E2E NLG 17.0 1.4 15.6 17.9 0.7 14.2 20.3 0.3 20.5 20.5 21.6 0.7 2.4 22.6 XSUM 2.4 0.1 2.1 3.0 0.3 2.5 3.0 0.1 3.2 3.1 3.7 0.2 4.6 3.8 Web NLG EN 5.3 0.1 4.3 5.6 0.3 4.4 8.8 0.4 7.4 7.4 9.3 0.5 9.7 9.4 Wiki Lingua EN 3.0 0.1 3.2 3.6 0.2 3.2 2.9 0.1 3.0 3.1 4.0 0.1 4.3 4.0
b Ab I 0.0 0.0 0.0 0.0 0.0 0.0 15.5 1.0 17.2 14.3 19.3 1.0 21.1 18.0
Average 20.9 0.4 19.1 22.2 1.4 20.5 19.4 0.5 21.2 19.1 21.4 0.5 22.8 22.0
Table 14: Results for filtering with deduplication and the ROOTS filters. The resulting data set sizes are in Table 12. The data is repeated until it matches 55B tokens for 2.8B parameter and 84B tokens for 4.2B parameter models. Scores are normalized averages of 0-5 few-shots and reported as percentages. For unfiltered models we report mean/std. err. across five different models, each trained with a different random seed.
Scaling Data-Constrained Language Models
Appendix I. Hyperparameters and Setup
where P is the final parameter count, l are layers, h is the hidden dimension, V = 50257 the vocabulary size and s = 2048 the sequence length. We find the parameter counts reported in Chinchilla (Hoffmann et al., 2022) to be significantly different than our calculations, especially at larger scales. We report both in Table 15, but we use our parameter estimates everywhere in this work. Further, we have corrected the number of heads of the 3,530 and 4,084 million parameter models from (Hoffmann et al., 2022) to obey the relationship d_model = kv_size n_heads. To train our models, we have forked the Megatron-Deep Speed (Rasley et al., 2020; Smith et al., 2022) framework and adapted it for ROCm to enable training on AMD GPUs. We have made our training code publicly available at https://github.com/Turku NLP/ Megatron-Deep Speed. Models are trained using data, tensor and pipeline parallelism on up to 256 AMD Instinct MI250X GPUs distributed across up to 64 nodes on the LUMI supercomputer located in Finland. As of June 2023, LUMI is the largest supercomputer in Europe and ranks third worldwide with a performance of around 310 PFLOPs.5 We trained models in parallel using up to 2,200 nodes at a single point in time (equivalent to around 8,800 GPUs or 17,600 GCDs or 86% of all GPUs on LUMI). We have used a total of around 3 million GPU hours. The cluster is powered 100% by renewable energy (hydroelectricity) and its waste heat is used for heating the nearby city reducing the city s carbon emissions by up to 20%. Thanks to the low temperatures in Finland, relatively little cooling for the cluster is required further reducing its impact on the environment. As of June 2023, it ranks as the seventh greenest supercomputer.6
5. https://www.top500.org/lists/top500/2023/06/ 6. https://www.top500.org/lists/green500/2023/06/
Muennighoff, Rush, Barak, Scao, Piktus, Tazi, Pyysalo, Wolf, Raffel
Parameters (millions) d_model ffw_size kv_size n_heads n_layers This work Chinchilla
7 - 128 512 32 4 3 14 - 224 896 32 7 4 20 - 288 1152 32 7 5 38 - 448 1792 32 7 6 52 44 512 2048 64 8 8 66 57 576 2304 64 9 9 83 74 640 2560 64 10 10 97 90 640 2560 64 10 13 112 106 640 2560 64 10 16 125 117 768 3072 64 12 12 146 140 768 3072 64 12 15 168 163 768 3072 64 12 18 182 175 896 3584 64 14 14 201 196 896 3584 64 14 16 220 217 896 3584 64 14 18 255 251 1024 4096 64 16 16 280 278 1024 4096 64 16 18 305 306 1024 4096 64 16 20 421 425 1280 5120 128 10 18 480 489 1280 5120 128 10 21 502 509 1408 5632 128 11 18 539 552 1280 5120 128 10 24 574 587 1408 5632 128 11 21 619 632 1536 6144 128 12 19 645 664 1408 5632 128 11 24 704 724 1536 6144 128 12 22 789 816 1536 6144 128 12 25 865 893 1792 7168 128 14 20 981 1018 1792 7168 128 14 23 1096 1143 1792 7168 128 14 26 1215 1266 2048 8192 128 16 22 1364 1424 2176 8704 128 17 22 1366 1429 2048 8192 128 16 25 1517 1593 2048 8192 128 16 28 1535 1609 2176 8704 128 17 25 1650 1731 2304 9216 128 18 24 1706 1794 2176 8704 128 17 28 1905 2007 2304 9216 128 18 28 2160 2283 2304 9216 128 18 32 2179 2298 2560 10240 128 20 26 2494 2639 2560 10240 128 20 30 2809 2980 2560 10240 128 20 34 3090 - 2688 10752 128 22 34 3263 3530 2688 10752 128 21 36 3574 3802 2816 11264 128 22 36 3900 4084 2944 11776 128 23 36 4239 4516 3072 12288 128 24 36 6355 6796 3584 14336 128 28 40 8672 9293 4096 16384 128 32 42 10912 11452 4352 17408 128 32 47 11455 12295 4608 18432 128 36 44 12220 12569 4608 18432 128 32 47
Scaling Data-Constrained Language Models
13601 13735 4864 19456 128 32 47 14917 14940 4992 19968 128 32 49 15056 16183 5120 20480 128 40 47
Table 15: Model architectures. We list the architectures of all models trained as part of this work. Many shown models have been trained multiple times on different amounts of unique data and for varying epochs.
Appendix J. Prompts and Samples
The following figures illustrate the prompts with samples from each evaluation data set. Prompts stem from Prompt Source (Bach et al., 2022) or GPT-3 (Brown et al., 2020). All data comes from the ground truth data sets in this section, and no generations are shown here.
Context Edmond (or Edmund) Halley, FRS (pronounced ; 8 November [O.S. 29 October] 1656 25 January 1742 [O.S. 14 January 1741] ) was an English astronomer, geophysicist, mathematician, meteorologist, and physicist who is best known for computing the orbit of Halley s Comet. He was the second Astronomer Royal in Britain, succeeding John Flamsteed. Question: Edmond Halley was born outside of the United Kingdom. True, False, or Neither? Answer:
Correct Answer Neither Incorrect Answer True Incorrect Answer False
Figure 21: Formatted data set example from ANLI R1 evaluated using accuracy as described in D.
Context The 1970 Swedish Open was a combined men s and women s tennis tournament played on outdoor clay courts held in Båstad, Sweden and was part of the Grand Prix circuit of the 1970 Tour. It was the 23rd edition of the tournament and was held from 2 July through 12 July 1970. Dick Crealy and Peaches Bartkowicz won the singles titles. Question: Dick Crealy and Peaches Bartkowicz beat eachother in the 1970 Swedish Open. True, False, or Neither? Answer:
Correct Answer False Incorrect Answer True Incorrect Answer Neither
Figure 22: Formatted data set example from ANLI R2 evaluated using accuracy as described in D.
Muennighoff, Rush, Barak, Scao, Piktus, Tazi, Pyysalo, Wolf, Raffel
Context Tokyo - Food group Nestle is seeking to lure Japanese holiday shoppers with a taste for fine snacking with a gold-wrapped Kit Kat chocolate bar. The single finger Kit Kat is wrapped in a thin layer of gold leaf. Only 500 of the bars go on sale from Dec. 29 with a price tag of around 2,016 yen ($16). The Kit Kat chocolate bar made its debut in Japan in 1973 and since then a variety of flavors from green tea to wasabi have been produced. Question: Japanese like kit kat. True, False, or Neither? Answer:
Correct Answer True Incorrect Answer False Incorrect Answer Neither
Figure 23: Formatted data set example from ANLI R3 evaluated using accuracy as described in D.
Context An astronomer observes that a planet rotates faster after a meteorite impact. Which is the most likely effect of this increase in rotation?
Correct Answer Planetary days will become shorter. Incorrect Answer Planetary years will become longer. Incorrect Answer Planetary gravity will become stronger.
Figure 24: Formatted data set example from ARC-Challenge evaluated using accuracy as described in D.
Context To express the distance between the Milky Way galaxy and other galaxies, the most appropriate unit of measurement is the
Correct Answer light-year. Incorrect Answer meter. Incorrect Answer kilometer. Incorrect Answer astronomical unit.
Figure 25: Formatted data set example from ARC-Easy evaluated using accuracy as described in D.
Scaling Data-Constrained Language Models
Context Radio wave Radio waves are a type of electromagnetic radiation with wavelengths in the electromagnetic spectrum longer than infrared light. Radio waves have frequencies as high as 300 gigahertz (GHz) to as low as 30 hertz (Hz). At 300 GHz, the corresponding wavelength is 1 mm, and at 30 Hz is 10,000 km. Like all other electromagnetic waves, radio waves travel at the speed of light. They are generated by electric charges undergoing acceleration, such as time varying electric currents. Naturally occurring radio waves are emitted by lightning and astronomical objects. Question: do radio waves travel at the speed of light? Answer:
Correct Answer yes Incorrect Answer no
Figure 26: Formatted data set example from Bool Q evaluated using accuracy as described in D.
Context A: Okay. So Frank, what, uh, type of, uh, budget do you or your family have? B: Well, uh I don t know that we really have a budget. Question: he and his family really have a budget. True, False or Neither? Answer:
Correct Answer False Incorrect Answer True Incorrect Answer Neither
Figure 27: Formatted data set example from CB evaluated using accuracy as described in D.
Context The computer was expensive to fix therefore
Correct Answer I bought a new one. Incorrect Answer I got it repaired.
Figure 28: Formatted data set example from COPA evaluated using accuracy as described in D.
Muennighoff, Rush, Barak, Scao, Piktus, Tazi, Pyysalo, Wolf, Raffel
Context Canoeing: Two women in a child are shown in a canoe while a man pulls the canoe while standing in the water, with other individuals visible in the background. The child and a different man
Correct Answer sit in a canoe while the man paddles. Incorrect Answer are then shown paddling down a river in a boat while a woman talks. Incorrect Answer are driving the canoe, they go down the river flowing side to side. Incorrect Answer walking go down the rapids, while the man in his helicopter almost falls and goes out of canoehood.
Figure 29: Formatted data set example from Hella Swag evaluated using accuracy as described in D.
Context Question: How to sleep in proper posture? Answer:
Correct Answer Sleep straight with a pillow under your head. Incorrect Answer Sleep straight with a pillow over your head.
Figure 30: Formatted data set example from Pi QA evaluated using accuracy as described in D.
Context As spacecraft commander for Apollo XI, the first manned lunar landing mission, Armstrong was the first man to walk on the Moon. "That s one small step for a man, one giant leap for mankind." With these historic words, man s dream of the ages was fulfilled. Question: Neil Armstrong was the first man who landed on the Moon. True or False? Answer:
Correct Answer True. Incorrect Answer False.
Figure 31: Formatted data set example from RTE evaluated using accuracy as described in D.
Scaling Data-Constrained Language Models
Context The electromagnetic spectrum encompasses a very wide range of wavelengths and frequencies. Visible light is only a very small portion of the spectrum with wavelengths from 400-700 nm. Question: With wavelengths from 400-700 nm, what kind of light represents only a very small portion of the spectrum? Answer:
Correct Answer visible light. Incorrect Answer ultraviolet light. Incorrect Answer invisible light. Incorrect Answer sunlight.
Figure 32: Formatted data set example from Sci Q evaluated using accuracy as described in D.
Context Bob went to the gas station to fill up his car. His tank was completely empty and so was his wallet. The cashier offered to pay for his gas if he came back later to pay. Bob felt grateful as he drove home. Answer:
Correct Answer Bob believed that there were good people in the world. Incorrect Answer Bob contemplated how unfriendly the world was.
Figure 33: Formatted data set example from Story Cloze evaluated using accuracy as described in D.
Correct Context Johnny likes fruits more than vegetables in his new keto diet because the fruits: Incorrect Context Johnny likes fruits more than vegetables in his new keto diet because the vegetables:
Target Completion are saccharine.
Figure 34: Formatted data set example from Wino Grande evaluated using accuracy as described in D.
Muennighoff, Rush, Barak, Scao, Piktus, Tazi, Pyysalo, Wolf, Raffel
Context Given the following data about a restaurant: name : The Wrestlers eat Type : pub food : Japanese price Range : cheap area : riverside near : Raja Indian Cuisine
Generate some text about this restaurant.
Target The Wrestlers offers Japanese food and pub with cheap price near Raja Indian Cuisine in riverside.
Figure 35: Formatted data set example from E2E NLG evaluated using ROUGE as described in D.
Context Article: The artificial intelligence system - Lip Net - watches video of a person speaking and matches the text to the movement of their mouths with 93% accuracy, the researchers said. Automating the process could help millions, they suggested. But experts said the system needed to be tested in real-life situations. Lip-reading is a notoriously tricky business with professionals only able to decipher what someone is saying up to 60% of the time. "Machine lip-readers have enormous potential, with applications in improved hearing aids, silent dictation in public spaces, covert conversations, speech recognition in noisy environments, biometric identification and silent-movie processing," wrote the researchers. They said that the AI system was provided with whole sentences so that it could teach itself which letter corresponded to which lip movement. To train the AI, the team - from Oxford University s AI lab - fed it nearly 29,000 videos, labelled with the correct text. Each video was three seconds long and followed a similar grammatical pattern. While human testers given similar videos had an error rate of 47.7%, the AI had one of just 6.6%. The fact that the AI learned from specialist training videos led some on Twitter to criticise the research. Writing in Open Review, Neil Lawrence pointed out that the videos had "limited vocabulary and a single syntax grammar". "While it s promising to perform well on this data, it s not really groundbreaking. While the model may be able to read my lips better than a human, it can only do so when I say a meaningless list of words from a highly constrained vocabulary in a specific order," he writes. The project was partially funded by Google s artificial intelligence firm Deep Mind.
Target Scientists at Oxford University have developed a machine that can lip-read better than humans.
Figure 36: Formatted data set example from XSUM evaluated using ROUGE as described in D.
Scaling Data-Constrained Language Models
Context I will verbalize an abstract representation of a sentence in natural language. To do so, I will first show the representation and then the natural language. The text needs to include all of the information in the representation. Brandon_Carter | alma Mater | University_of_Cambridge, University_of_Cambridge | chancellor | David_Sainsbury,_Baron_Sainsbury_of_Turville, Brandon_Carter | birth Place | England, University_of_Cambridge | vice Chancellor | Leszek_Borysiewicz
Target The University of Cambridge is the alma mater of Brandon Carter, who was born in England. David Sainsbury, also known as the Baron Sainsbury of Turville, and Leszek Borysiewicz are respectively the chancellor and vice chancellor of the University of Cambridge.
Figure 37: Formatted data set example from Web NLG evaluated using ROUGE as described in D.
Muennighoff, Rush, Barak, Scao, Piktus, Tazi, Pyysalo, Wolf, Raffel
Context Attributes are placed within the tag itself, making additional alterations to the ëlement content between the start and end tag. They never stand alone. They are written in the format name= value , where name is the name of the attribute (for instance color ), and value describes this specific instance (for instance red ). You ve actually seen attributes before, if you followed the tutorial in the basic HTML section. tags use the src attribute, anchors use the name attribute, and links use the href attribute. See how those all follow the ___= ___ format? Making a table, or chart, requires several different tags. Play with these tags, or learn about HTML tables in more detail. Start with table tags around the entire table:
| Column 1: Month | Column 2: Money Saved |
|---|---|
| January | $100 |