# overtrained_language_models_are_harder_to_finetune__167d13cc.pdf

Overtrained Language Models Are Harder to Fine-Tune

Jacob Mitchell Springer 1 Sachin Goyal 1 Kaiyue Wen 2 Tanishq Kumar 3 Xiang Yue 1

Sadhika Malladi 4 Graham Neubig 1 Aditi Raghunathan 1

Large language models are pre-trained on evergrowing token budgets under the assumption that better pre-training performance translates to improved downstream models. In this work, we challenge this assumption and show that extended pre-training can make models harder to fine-tune, leading to degraded final performance. We term this phenomenon catastrophic overtraining. For example, the instruction-tuned OLMo-1B model pre-trained on 3T tokens leads to over 2% worse performance on multiple standard LLM benchmarks than its 2.3T token counterpart. Through controlled experiments and theoretical analysis, we show that catastrophic overtraining arises from a systematic increase in the broad sensitivity of pre-trained parameters to modifications, including but not limited to fine-tuning. Our findings call for a critical reassessment of pre-training design that considers the downstream adaptability of the model.

1. Introduction

Language models have achieved widespread success following a two-stage paradigm: (1) pre-training on a vast corpus of uncurated data, followed by (2) post-training on highquality task-specific data, often to confer targeted abilities such as instruction-following, multi-modality, or reasoning. Under the maxim more data is better , there have been massive investments in scaling both pre-training and post-training.

Hoffmann et al. (2022) proposed a compute-optimal ratio of roughly 20 tokens per model parameter, yet recent models have far exceeded this. For example, Llama-2-7B (Touvron et al., 2023) was trained on 1.8T tokens 13 the recom-

1Carnegie Mellon University 2Stanford University 3Harvard University 4Princeton University. Correspondence to: Jacob Mitchell Springer <jspringer@cmu.edu>.

Proceedings of the 42 nd International Conference on Machine Learning, Vancouver, Canada. PMLR 267, 2025. Copyright 2025 by the author(s).

Stage 2: Post-train (instruction tuning) Stage 1: Pre-train for

LLM benchmark

performance

Pre-training tokens

Pre-training tokens

Figure 1. Language models with extensive pre-training can exhibit catastrophic overtraining, where the performance of post-trained models degrades as the pre-training stage is extended. We report the average performance of five common LLM benchmarks (ARC-Easy, ARC-Challenge, PIQA, Hella Swag) for OLMo-1B intermediate checkpoints before and after instruction fine-tuning, with additional results in Section 2. We argue that catastrophic overtraining arises as a result of a progressive increase throughout pre-training of model sensitivity to parameter transformations, leading to greater forgetting of the capabilities acquired during pre-training after fine-tuning (Section 3). Overall, our results challenge the notion that scaling pre-training is strictly beneficial.

mended ratio and Llama-3-8B scaled this further to 15T tokens. This trend is driven by consistent gains in zero-shot performance (Gadre et al., 2024; Sardana et al., 2024), with few exceptions where scaling up is not helpful (Wei et al., 2022; Mc Kenzie et al., 2022a;b; 2023).

In this paper, we demonstrate that the widely adopted strategy of scaling up language model pre-training does not universally translate to better performance after posttraining. Through both theory and experiments, we uncover a phenomenon we term catastrophic overtraining, where longer pre-training harms final model performance after instruction tuning or other forms of post-training (Figure 1).

Catastrophic overtraining is not an isolated curiosity; rather it emerges consistently across a range of models and tasks. As shown in Section 2, extensive empirical evaluations demonstrate the prevalence of this phenomenon in exist-

Overtrained Language Models Are Harder to Fine-Tune

Base model Fine-tuned model (IFT or VLM)

ID: Alpaca Eval

OOD: ARC-Challenge

OOD: ARC-Easy

1 2 3 Pre-training tokens

1 2 3 Pre-training tokens

OOD: Hella Swag

1 2 3 Pre-training tokens

OOD: Winogrande

OLMo-1B-Anthropic-HH (instruction fine-tuned)

ID: VLM Score

OOD: ARC-Challenge

OOD: ARC-Easy

1 2 3 Pre-training tokens

1 2 3 Pre-training tokens

OOD: Hella Swag

1 2 3 Pre-training tokens

OOD: Winogrande

OLMo-1B-LLa VA (multimodal fine-tuned)

Figure 2. Extending pre-training can degrade performance after fine-tuning on Anthropic-HH (left) and LLa VA (right). We consider fine-tuning on various intermediate checkpoints from OLMo-1B pre-training. While the base model performance (before fine-tuning) improves with the pre-training token budget (black dashed curve), the performance after fine-tuning drops as we pre-train on more tokens. In the instruction-tuning setting (left), we observe degradation on the ID task (green) Alpaca Eval as well as on OOD benchmarks (blue) ARC, PIQA, and Hella Swag. In the multimodal tuning setting, we observe degradation with overtraining on PIQA, and a larger gap between the fine-tuned and base model for ARC, Hella Swag, and Winogrande. We report average over three independent fine-tuning runs, plus error bars. Refer to Appendix F for additional models (OLMo-2-7B, LLM360-Amber) and instruction-tuning datasets (extended results for Anthropic-HH, TULU).

ing models. For instance, we show that the OLMo-1B model (Groeneveld et al., 2024a), pre-trained on 3T tokens and post-trained on Anthropic-HH (Bai et al., 2022), performs 3% worse on Alpaca Eval (Li et al., 2023b) and 2% worse on ARC (Clark et al., 2018) compared to an intermediate checkpoint trained on just 2.3T tokens (Figure 2).

To understand why catastrophic overtraining occurs, we turn to carefully controlled experiments (Section 3). We find that modifying the parameters of a pre-trained model leads to forgetting of previously acquired capabilities, where the extent of this forgetting depends on the magnitude of the parameter modifications. However, another key factor influencing forgetting is what we term progressive sensitivity: for modifications of equal magnitude, models that have undergone longer pre-training exhibit greater forgetting (Figure 4). Catastrophic overtraining arises when this increased forgetting due to post-training modifications overtakes the improvement during pre-training. While constraining the magnitude of the parameter modifications that arise from post-training can mitigate this degradation, it can also limit the pre-trained model s capacity to adapt and learn. This reveals an inherent trade-off that shapes the feasibility of preventing catastrophic overtraining in practice (Figure 7).

Finally, we present a theoretical analysis of a linear transfer learning setting in Section 4 that admits a precise characterization of catastrophic overtraining and progressive sensitivity. We study how incremental feature learning leads to progressive sensitivity and inevitable catastrophic overtrain-

ing. Regularization during fine-tuning can delay the onset, albeit at the cost of downstream performance.

2. Extended pre-training can hurt post-training

We study the effect of extended pre-training on two common post-training setups instruction tuning for instruction following capability, and multimodal fine-tuning (visual instruction tuning) with LLa VA (Liu et al., 2023a).

2.1. Experimental setup

To analyze the effect of overtraining, we experiment on three language models with open-sourced intermediate checkpoints: OLMo-1B (Groeneveld et al., 2024a), OLMo-27B (OLMo et al., 2024), and LLM360-Amber-7B (Liu et al., 2023b). For each model, we perform post-training on intermediate checkpoints. We investigate instruction tuning with two datasets: Anthropic-HH (Bai et al., 2022) and TULU (Wang et al., 2023), and we perform multimodal fine-tuning with the LLa VA visual instruction tuning framework (Liu et al., 2023a). We train each intermediate checkpoint on each dataset.

We evaluate model performance along two key dimensions: the ID performance, evaluated on the fine-tuning task of interest (for e.g. instruction following), and the OOD performance, computed on a suite of ten common LLM evaluation benchmarks, covering reasoning, QA, commonsense, and

Overtrained Language Models Are Harder to Fine-Tune

knowledge extraction. For each checkpoint, we tune the learning rate and select the model with the best ID performance.

We refer the reader to Appendix C for further information on the pre-trained models, the specification of the fine-tuning process, and for details of evaluation.

2.2. Results

Figure 2 compares the performance of various OLMo-1B models, trained to different pretraining budgets (x axis).

Extended pre-training always improves base models. In line with past work, we find that extended pre-training yields a monotonic improvement in the base models. The performance keeps improving on all the downstream tasks we evaluate (dashed line in Figure 2).

Extended pre-training can hurt post-trained counterparts. While the base model improves, we find a surprising degradation when the base models are post-trained. Specifically, after fine-tuning on the Anthropic-HH dataset for instruction following, a base model pre-trained on 3T tokens shows up to 3% lower response rate (Alpaca Eval score) than one pre-trained on just 2.3T tokens ( 23% fewer tokens). We see a similar drop on various OOD tasks such as reasoning and question answering, as evaluated on benchmarks such as ARC-Easy, ARC-Challenge, Hella Swag, and PIQA. Overall, after instruction tuning, models pre-trained on 3T tokens underperform compared to those pre-trained on 2.3T tokens, dropping to the level of models pre-trained with just 1.5T tokens (50% fewer tokens).

For multimodal fine-tuning, we see that extended pretraining translates to continuous improvements in the VLM score. However, models pre-trained on more tokens show greater forgetting and larger drops in performance across the various OOD benchmarks. On some datasets such as PIQA, the drop is so severe that extended pre-training actively hurts performance after post-training (Figure 2, right).

We present evaluations of additional pre-trained models on different fine-tuning setups in Appendix F. Overall, while extended pre-training always improves the pre-training performance, these gains do not always translate to post-training. There are several settings where extended pre-training actively hurts post-training performance.

3. Catastrophic overtraining

In Section 2, we made a surprising observation where extended pre-training can hurt post-training. In this section, we dig deeper into this phenomenon to understand why and when expending more compute by pre-training on more tokens can counterintuitively degrade performance.

We begin by defining the phenomenon, which we call catastrophic overtraining.

Catastrophic overtraining is the phenomenon where extending pre-training beyond a certain token budget results in a decrease in the model s performance after subsequent modifications.

We call this token budget where performance first begins to degrade the inflection point. Catastrophic overtraining can refer to a decrease of the pre-training performance or of the performance of other downstream tasks as pre-training is extended. Note that this performance drop can manifest differently across various downstream evaluation tasks, even for the same model.

In Section 2, we see catastrophic overtraining when posttraining OLMo-1B for instruction tuning or multimodal fine-tuning and evaluating on standard benchmarks. In the rest of this paper, we aim to answer two central questions:

1. When and why does catastrophic overtraining occur?

2. What factors influence the inflection point?

To address these questions, we systematically study and build an intuitive picture of the effect of overtraining in the presence of Gaussian perturbations (Section 3.2) and then expand to fine-tuning in a controlled but real-world setup (Section 3.3).

3.1. Catastrophic overtraining in a controlled setup

We documented several instances of catastrophic overtraining in real-world scenarios. To gain a deeper understanding and explore more extreme degrees of overtraining, we investigate a simpler, controlled setup described below. Note that our real-world experiments used publicly available checkpoints from a single training run, which meant that each pre-training budget corresponded to a different final learning rate due to the annealing schedule. In this section, we remove that confounding factor.

Pre-training setup. We pre-train models from scratch with sizes ranging from 15M to 90M parameters, spanning token budgets from 4B to 128B, on C4 web data (Raffel et al., 2019). We train with a cosine annealing schedule that anneals every model to zero. In the main paper, we present results from the 30M model; see Appendix G for results with 15M and 90M parameter models.

Modifications to the pre-trained model. We fine-tune the pre-trained models above. We fine-tune each model on various classification and language modeling datasets spanning QA, sentiment analysis, math, and code. Details on the datasets and hyperparameter choices are provided in Appendix D. We also consider a simple modification of

Overtrained Language Models Are Harder to Fine-Tune

Pre-training tokens

Base model Minimum (0.0025) Maximum (0.04)

Pre-training tokens

Figure 3. Progressive sensitivity of Gaussian perturbations (left): extending pre-training progressively increases the degree to which a Gaussian parameter perturbation degrades perplexity. Catastrophic overtraining (right): eventually, this leads to overall worse pre-training perplexity. We perturb OLMo-30M models trained on various pre-training token budgets with Gaussian noise scaled by the factor γ (color). The left plot shows the difference in perplexity between the perturbed and unperturbed models, while the right plot shows the absolute perplexity of the perturbed models.

adding Gaussian perturbations to the pre-trained weights as a warm-up in Section 3.2.

Our intuitive picture views post-training as some modification to the pre-trained model that is trained on large amounts of broad data. Such modifications are aimed at improving some targeted performance (such as VLM score). However, as argued in (Kumar et al., 2022), such modifications can inadvertently distort the pre-trained knowledge, leading to degraded performance on out-of-distribution or unrelated tasks.

Downstream evaluation. While we evaluate real-world benchmarks in Section 2, we focus here on measuring the C4 perplexity of the modified downstream model as an indicator of how well the original pre-trained knowledge is preserved. A decline in C4 perplexity may signal a loss of this knowledge, potentially resulting in both out-of-distribution performance degradation (due to forgetting or distortion). We also measure ID performance as perplexity on held-out set from the same distribution as the fine-tuning data. We use perplexity rather than accuracy because it is a smoother and less noisy metric, and can often offer a better measure of model quality than accuracy for small models (Schaeffer et al., 2023; 2024). Although our analysis centers on pretraining perplexity, we acknowledge that other factors may also contribute to downstream performance losses a topic we leave for future work.

3.2. Warmup: Gaussian perturbations

We take base models pre-trained to various token budgets and add Gaussian noise of the following form. Let θ Rd

denote the base model weights, then we get

θ = θ + ϵ where ϵ N(0, γ2Σ), (1)

where Σ is the covariance matrix of the initialization distribution of the parameters (prior to pre-training) and γ controls the magnitude of the perturbation.

First, we plot the change in C4 perplexity due to Gaussian noise, i.e. the difference between the C4 perplexity of θ and θ in Figure 3 (left). We observe an interesting trend as we track the change in perplexity between the base model and the perturbed model as a function of the number of pre-training tokens:

Progressive sensitivity to noise: For a fixed magnitude of perturbation, the change in perplexity between the base model and the perturbed model increases monotonically with the number of pre-training tokens.

Simultaneously, we plot the absolute C4 perplexity of the base model (Figure 3, right, dashed line). We observe that the base model s perplexity decreases with the number of pre-training tokens.

In this setting, catastrophic overtraining arises from the interaction between the progressive sensitivity to noise and the monotonic improvement of the base model as pre-training progresses. Early in training, the base model improves faster than the rate at which sensitivity increases, leading to a net decrease in perplexity after Gaussian parameter perturbations. Beyond a certain point, the rate at which sensitivity increases surpasses the rate at which the base model improves, leading to an increase in perplexity after the perturbation. This results in a U-shaped trend of the C4 perplexity after perturbation (Figure 3, right).

Tracking the inflection point. In Figure 3, larger perturbations are associated with a larger and more quickly increasing degradation of the pre-training loss. Thus, the point at which the degradation from sensitivity surpasses the improvement in the base model is accelerated for larger perturbations, leading to an earlier inflection point.

Intuitive picture. Pre-training on more tokens improves the base model (as expected) but also makes the base models more sensitive to noise. Progressive sensitivity leads to catastrophic overtraining as the increase in perplexity due to noise eventually overwhelms improvements in the model. For large magnitude perturbations, this degradation sets in at lower token budget, while for smaller magnitudes of perturbations, catastrophic overtraining may not be observed until a large token budget.

Overtrained Language Models Are Harder to Fine-Tune

Pre-training tokens

max=5.0e-04

Pre-training tokens

2 max=1.6e-04

Pre-training tokens

max=1.0e-03 Star Coder-Python

0.0 0.2 0.4 0.6 0.8 1.0 0.0

Base model Min fine-tuning LR (4e-6) Max fine-tuning LR ( max)

Figure 4. Progressive sensitivity of fine-tuning: Extending pretraining progressively increases the degree to which fine-tuning degrades perplexity. OLMo-30M models trained on various pretraining token budgets are fine-tuned on downstream tasks using fixed hyperparameters: math (GSM8k), code (Starcoder-Python), and QA (SIQA). Lines connect models sharing hyperparameters, differing only in pre-training tokens. Learning rates range from 4e-06 to the dataset-specific maximum (ηmax). We report the difference in perplexity between the fine-tuned and pre-trained models, as a function of the number of pre-training tokens.

3.3. Fine-tuning pre-trained models

In the previous section, we studied how catastrophic overtraining arises when adding noise to pre-trained models. While noise can be seen as a canonical modification, it is different from fine-tuning that might involve more structured updates to the models. However, we see in this section that the intuitive story above also holds when we fine-tune models on real-world language datasets described above.

3.3.1. FINE-TUNING WITH FIXED LEARNING RATE

First, analogous to how we quantify performance drop for a fixed magnitude of Gaussian perturbation (γ), we similarly need to regularize the fine-tuning in some way to ensure a consistent degree of change across the pre-trained checkpoints. Fixing the learning rate is a simple and effective way to do so. While we do not provide a formal justification, we discuss our reasoning in Appendix D.

For each learning rate, we plot the change in C4 perplexity from the pre-trained model to the fine-tuned model in Figure 4. In this plot, we track how the degradation in C4 perplexity evolves with the number of pre-training tokens. First, larger learning rates distort the model more and thus exhibit a greater increase in perplexity. Second, we observe a trend over pre-training tokens analogous to the behavior seen with Gaussian noise, but this time for fine-tuning.

Progressive sensitivity when fine-tuning: For a fixed learning rate, the change in perplexity increases monotonically with the number of pre-training tokens.

At the inflection point at which sensitivity increases surpasses the rate at which the base model improves, we observe catastrophic overtraining. This results in a U-shaped trend of the C4 perplexity after fine-tuning (Figure 5, top).

Tracking the inflection point for fine-tuning. Analogous to the Gaussian setting, since the rate of increase of degradation is accelerated for larger learning rates, models trained with larger learning rates exhibit an inflection point at lower token budgets, and the degradation is more pronounced.

ID perplexity. While smaller learning rates generally result in less degradation to the C4 perplexity, the ID perplexity of the fine-tuned models shows a different trend: larger learning rates, up to a point, result in a lower ID perplexity, though sometimes also exhibit a U-shaped trend in ID perplexity (Figure 5, bottom). This implies that tuning the learning rate can sometimes mitigate degradation only at the cost of fine-tuning performance. We explore in Section 3.3.2 when tuning the learning rate to minimize the ID perplexity can mitigate the degradation of C4 perplexity that arises as pre-training is extended, and when it cannot.

Intuitive picture. The intuition from the Gaussian perturbation setting carries over to fine-tuning with a fixed learning rate. Pre-training on more tokens will improve the quality of the base model and at the same time make the model degrade more when fine-tuned. Beyond a certain point, pre-training on additional tokens will degrade the resulting fine-tuned model s C4 perplexity, and often the ID perplexity of the fine-tuning task.

3.3.2. BALANCING FINE-TUNING GAINS WITH DEGRADATION

In Section 3.3, we showed that for a fixed learning rate, the sensitivity of pre-trained models increases with the number of pre-training tokens, leading to catastrophic overtraining. In practice, however, the learning rate is tuned on a validation set from the in-domain (ID) task. This tuning process may yield different optimal learning rates across pre-trained checkpoints, which can potentially mitigate catastrophic overtraining. The degradation depends on both the learning rate as well as the sensitivity. So if a model pre-trained on more tokens can admit a smaller learning rate when finetuning to achieve good ID performance, it can compensate the increase in sensitivity.

However, this smaller rate does restrict the extent of necessary parameter updates, and might be insufficient to achieve good ID performance. This presents an interesting trade-off that we investigate empirically. We tune the learning rate to maximize fine-tuning ID performance. We track the optimal value as a function of the pre-training token budget, and plot the ID performance and pre-train perplexity corresponding

Overtrained Language Models Are Harder to Fine-Tune

C4 perplexity

max = 1.0e-03

6 max = 2.4e-04

max = 3.0e-03 Star Coder-Python

max = 9.0e-05

max = 9.0e-05

max = 1.0e-04

Pre-training tokens

ID perplexity

max = 1.0e-03

Pre-training tokens

max = 2.4e-04

Pre-training tokens

max = 3.0e-03

Pre-training tokens

max = 9.0e-05

Pre-training tokens

1.0 max = 9.0e-05

Pre-training tokens

max = 1.0e-04

0.0 0.2 0.4 0.6 0.8 1.0 0.0

Base model Minimum fine-tuning learning rate (4e-6) Maximum fine-tuning learning rate ( max)

Figure 5. Catastrophic overtraining for fine-tuning with fixed hyperparameters: extending pre-training can lead to an overall increase in the C4 perplexity (top), and ID perplexity (fine-tuning task; bottom), when fine-tuning with fixed hyperparameters. OLMo-30M models pre-trained with varying token budgets are fine-tuned on downstream tasks using fixed hyperparameters: math (GSM8k), code (Starcoder-Python), QA (SIQA), and classification (MR, RTE, TREC). Lines connect models sharing hyperparameters, differing only in pre-training tokens. Learning rates range from 4e-06 to the dataset-specific maximum (ηmax). At sufficiently large learning rates (lighter colors), we observe performance degradation in both ID and pre-training metrics beyond certain pre-training budgets. (See Appendices D and G for ablations.)

to this optimal learning rate in Figure 6.

Our findings indicate that the emergence of catastrophic overtraining depends on how the optimal learning rate evolves. We conceptualize this trade-off between ID performance and pre-train perplexity degradation into three scenarios, illustrated in Figure 7:

1. Constant optimal learning rate: A constant optimal learning rate across token budgets leads to degradation in both ID and out-of-domain (OOD) performance for large pre-training budget T (Figure 7, left).

2. Slowly decreasing optimal learning rate: A slowly decreasing optimal learning rate may improve ID performance while OOD performance degrades (Figure 7, center).

3. Quickly decreasing optimal learning rate: A quickly decreasing optimal learning rate enables improvements in both ID and OOD performance as the pre-training budget increases (Figure 7, right).

Using a non-optimal learning rate to mitigate degradation. In cases where catastrophic overtraining emerges when fine-tuning with the optimal learning rate, using a non-optimal learning rate can sometimes mitigate the degradation or delay the inflection point. For example, in both cases where tuning leads to eventual degradation of the OOD loss in Figure 7, choosing to train with the smallest

learning rate would delay the inflection point. However, this would also result in a lower ID performance.

Summary. Overall, our experiments reveal that progressive sensitivity manifests under two types of modifications: unstructured Gaussian noise and structured fine-tuning, leading us to conjecture that progressive sensitivity is a universal phenomenon. With fixed perturbation magnitude or learning rate, this sensitivity causes catastrophic overtraining. In practice, tuning the learning rate introduces a trade-off: its evolution determines if extended pre-training results in performance degradation.

4. A theoretical perspective of overtraining

Catastrophic overtraining contradicts the common belief that more pre-training improves model quality. To investigate this, we analyze catastrophic overtraining for twolayer linear networks, focusing on identifying the inflection point (Definition 4.2) the point after which more pre-training harms final performance on the pre-training task. We first examine catastrophic overtraining via Gaussian perturbations, paralleling our empirical results (Section 3.2), and then demonstrate progressive sensitivity during extended pre-training in a canonical fine-tuning scenario (Theorem 4.6). Next, we formalize how restricting the magnitude of the updates can alleviate performance degradation, using regularization rather than reduced learning rates as in earlier experiments (Section 3.3.2). Without regularization, catastrophic overtraining inevitably emerges (Theorem 4.7).

Overtrained Language Models Are Harder to Fine-Tune

C4 perplexity

Star Coder-Python

Pre-training tokens

ID perplexity

Pre-training tokens

Pre-training tokens

Pre-training tokens

Pre-training tokens

Pre-training tokens

Figure 6. Catastrophic overtraining after hyperparameter tuning: extending pre-training can lead to eventual degradation of the C4 perplexity (top) and ID perplexity (fine-tuning task; bottom), even after hyperparameter tuning. OLMo-30M models pre-trained with varying token budgets are fine-tuned on downstream tasks: math (GSM8k), code (Starcoder-Python), QA (SIQA), and classification (MR, RTE, TREC). Lower is better. We tune the learning rate to optimize ID performance. ID perplexity degrades with extensive overtraining (RTE, TREC); C4 perplexity degrades in GSM8k, Starcoder-Python, MR, and RTE. Results averaged over three fine-tuning runs. (Additional ablations in Appendices D and G.)

While regularization can prevent this phenomenon, it can also impair fine-tuning performance by limiting adaptation (Theorem 4.7).

4.1. Pre-training setting

We adopt the two-layer linear regression setting proposed by Saxe et al. (2018) as a case where pre-training performance improves monotonically with training time via incremental feature learning. Precisely, we consider a regression problem where the data is generated by a full rank linear map y = Aprex for x, y Rd, with Apre Rd d, and where we sample x N(0, I). Denote the SVD of Apre

as UΣpre V T , with the diagonal elements of Σpre being strictly positive and monotonically decreasing. We will call these singular values the pre-training features, and denote them σpre 1 > > σpre d . Let Σpre :i be a diagonal matrix with the first i singular values equal to those of Σpre and the remaining set to 0.

We learn a two-layer network θ = W1W2 with W1, W2 Rd d that minimizes the mean squared error Lpre on the population of Gaussian inputs.

Lpre(θ(t)) = W1(t)W2(t) Apre 2 F .

We initialize W1 and W2 with small values and train using gradient flow. Prior work has established that, as training proceeds in this setting, the model θ incrementally learns the spectrum of Apre (Saxe et al., 2018; Gidel et al., 2019).

Theorem 4.1 (Informal statement of Saxe et al. (2018); Gidel et al. (2019)). There exists a sequence of timesteps t1 < . . . < ti < . . . td such that at timestep ti,

θ(ti) UΣpre :i V T .

This theorem implies that Σ(t) = U θ(t)V is approximately diagonal, and the vector of its diagonal entries σ(t) tracks which pre-training features have been learned by time t. In the ideal case, which we use in the main paper for brevity, we expect the first n elements of σ(tn) are σpre 1 , . . . , σpre n and the remaining elements are zero.1 Therefore, studying the evolution of σ over time and its impact on the fine-tuning procedure allow us to characterize how elongating the pre-training period affects the pre-training and downstream performance of the final model. We will generally study progressive sensitivity and catastrophic overtraining by characterizing the model at time steps t1, ..., td. We focus on studying the inflection point, the time at which catastrophic overtraining with respect to the pre-training loss emerges. Definition 4.2 (Inflection point). Fix a post-training modification to the model A. The inflection point with respect to the pre-training loss is defined as the smallest r such that Lpre(A(θ(tr))) < Lpre(A(θ(tr+1))).

In the following two sections, we study the inflection point for two different post-training modifications: Gaussian parameter perturbations and fine-tuning on a canonical family of tasks.

4.2. Gaussian perturbation setting

As a warm-up, we set A to be isotropic Gaussian parameter perturbations, mirroring Section 3.2. Formally, let A(θ(tn)) = eθ(tn) = (W1(tn) + Z1)(W2(tn) + Z2) where Z1, Z2 N(0, γ2Id2 d2), and let e Lpre(tn) =

E h Lpre(eθ(tn)) i . We characterize how the perturbed model

1Appendix A contains the case when these coordinates are small but not exactly zero.

Overtrained Language Models Are Harder to Fine-Tune

Tuned LR is constant with 𝑇

Tuned LR decreases

slowly with 𝑇

Tuned LR decreases

quickly with 𝑇

Large LR Medium LR Small LR LR tuned on downstream val.

Pre-training tokens Pre-training tokens Pre-training tokens

Degradation

No degradation

No degradation

Degradation Degradation

No degradation

Figure 7. Schematic to illustrate how the scaling of the optimal learning rate can affect model evaluations as a function of the pre-training tokens T. The dashed lines indicate the hypothetical performance of a fixed learning rate, while solid lines indicate the performance when using the learning rate that optimizes the ID performance. (Left) When the optimal learning rate is constant, we expect to observe degradation of both ID and OOD performance. (Center) When the optimal learning rate decreases slowly with T, we may observe a degradation of only the OOD performance. (Right) When the optimal learning rate decreases quickly, we will not observe degradation of either metric of performance.

pre-training loss e Lpre(tn) evolves as pre-training is extended.

Proposition 4.3 (Informal version of Lemma A.4). Let t1, . . . , td be defined as in Theorem 4.1. Then,

e Lpre(tn) e Lpre(tn 1) (2dγ2 σpre n )σpre n . (2)

The formal proof in Appendix A demonstrates that elongating pre-training introduces a newly non-zero feature σn introduces a new dimension along which the perturbation degrades loss. The above proposition allows us to characterize the inflection point (Definition 4.2) in the Gaussian perturbation setting as the smallest n such that 2dγ2 > σpre n . As such, smaller or more quickly decaying features will induce a smaller inflection point.

To establish catastrophic overtraining, we now illustrate that degradation proceeds monotonically beyond the inflection point. That is, elongating the training budget beyond the inflection point will increasingly degrade the pre-training performance of the model.

Theorem 4.4 (Informal version of Theorem A.3). For some γ > 0, there exists an inflection point r [1, d) such that e Lpre(n) increases monotonically for n r.

Our results establish the inevitability of catastrophic over-

training with respect to the pre-training loss when the posttraining modification consists of randomly perturbing the model parameters. In the next section, we study progressive sensitivity and catastrophic overtraining when fine-tuning on a family of canonical downstream tasks.

4.3. Fine-tuning

We now consider the case where the fine-tuning algorithm A corresponds to learning another linear feature map with a shared structure. We define the fine-tuning task as learning y = Aftx, where Aft = UΣft V . Sharing U and V with Apre permits transfer learning to occur, even though the spectrum of Aft is not the same as Apre. We define the fine-tuning features σft 1 > > σft d to be the singular values of Aft.

Let A(θ(t)) = θ(t; k) denote a model pre-trained for time t and then fine-tuned with a small but finite learning rate η and a large batch size for k [0, K] steps. The fine-tuning loss is similar to the pre-training loss with the new task Aft, but we introduce a regularization term to limit the deviation from the pre-trained initialization. This regularization term is a standard design in meta learning literature (Chua et al., 2021; Denevi et al., 2018).

Lft(θ(t; k); λ) = E θ(t; k) Aft 2 F + λ θ(t; k) θ(t) 2 F . (3)

Analogous to the pre-training setting, our analysis proceeds by tracking the vector of the diagonal elements σft(t; k) of Σ(t; k) = U θ(t; k)V . We define pre(tn) = Lpre(θ(tn; K)) Lpre(θ(tn; 0)) as the change in the pretraining performance over the course of fine-tuning, and we characterize how pre(tn) changes as the pre-training time tn increases. In particular, if pre(tn) is monotonically increasing, then we can conclude that progressive sensitivity is present.

To begin, we formalize the misalignment between the pretraining and downstream tasks in terms of their features. Definition 4.5. The pre-training task Apre and the finetuning task Aft are (α, r)-misaligned when σft i > ασpre i for all i > r.

Our first result establishes that our setting exhibits progressive sensitivity when the fine-tuning task is different from the pre-training one. Theorem 4.6 (Progressive sensitivity; informal version of Theorem A.24). Assume that Apre and Aft are (α, 1)- misaligned with α > 1. Then, pre(tn) 0 and pre(tn) is monotonically increasing with the number of learned pre-training features n.

Having established the prevalence of progressive sensitivity, we now turn our attention to understanding how and when

Overtrained Language Models Are Harder to Fine-Tune

we observe catastrophic overtraining with respect to the pre-training loss. We first show that when regularization is not present and the downstream task is sufficiently distinct from the pre-trained task, then elongating pre-training will pre-training performance. Furthermore, we demonstrate that regularization can delay the inflection point at which pre-training performance starts to degrade (Definition 4.2), albeit at a cost to the downstream performance. Theorem 4.7 (Catastrophic overtraining; informal version of Theorem A.25). The following are true with high probability:

1. Catastrophic overtraining is inevitable without regularization. Let λ = 0. There exists an α0 > 0 such that if Apre and Aft are (α, r)-misaligned, for α > α0, then the pre-training loss after fine-tuning Lpre(θ(tn; K)) monotonically increases for n r.

2. Regularization can delay the degradation of pretraining performance at the cost of downstream performance. For any n, the inflection point r(λ) and the unregularized fine-tuning loss θn(K) Aft 2 F increase monotonically with λ.

Our results in this section demonstrate that progressive sensitivity and catastrophic overtraining can arise in the relatively simple setting of training linear networks, which learn task-related features incrementally. We characterize the inflection point (Definition 4.2) under various post-training modifications, including applying Gaussian perturbations and fine-tuning on a canonical task. Our main results demonstrate that elongating the pre-training period will inevitably result in progressive sensitivity and catastrophic overtraining, and although appropriate regularization can delay the onset of these phenomena, this may come at the cost of the downstream task performance (Theorems 4.4, 4.6 and 4.7).

5. Related Work

Loss of plasticity. The idea that more training can degrade a model s adaptability to new tasks, termed loss of plasticity, has been primarily studied in small-model continual learning (Ash & Adams, 2020; Dohare et al., 2021) and reinforcement learning (Kumar et al., 2020; Lyle et al., 2022; 2023; Ma et al., 2023; Abbas et al., 2023). Loss of plasticity has been attributed to loss curvature (Lyle et al., 2023; Lewandowski et al., 2023), increased weight norm (Nikishin et al., 2022), feature rank (Kumar et al., 2020; Gulcehre et al., 2022), and feature inactivity (Lyle et al., 2022; Dohare et al., 2021). Multiple remedies have been proposed, including architecture changes (Lyle et al., 2023), parameter resets (Nikishin et al., 2024; D Oro et al., 2022), and regularization (Kumar et al., 2023; Ash & Adams, 2020).

Catastrophic forgetting. The phenomenon of catastrophic

forgetting where neural networks trained sequentially on tasks tend to forget prior tasks has also been welldocumented in the literature (Kirkpatrick et al., 2017; French, 1999; Goodfellow et al., 2013; Kemker et al., 2018; Kotha et al., 2023). There have been several proposed mitigation strategies, for example, Ahn et al. (2019); Hou et al. (2018); Chaudhry et al. (2019a) propose using regularization to mitigate catastrophic forgetting. Other fixes include generative replay of examples from previous tasks (Shin et al., 2017) or maintaining a memory buffer of previous tasks (Chaudhry et al., 2019b; de Masson d Autume et al., 2019). In this work, we show that catastrophic forgetting can become more severe with overtraining.

Scaling laws for optimal pre-training. In our work, we argue that training for fewer tokens can be beneficial for downstream performance after fine-tuning. Related to our work, Isik et al. (2024) proposes scaling laws for certain downstream translation tasks after fine-tuning, but does not observe degradation with overtraining. In addition, optimal token budgets have been identified for fixed compute (Kaplan et al., 2020; Hoffmann et al., 2022), extended to various contexts (Hernandez et al., 2021; Cherti et al., 2023; Muennighoff et al., 2023; Goyal et al., 2024; Liu et al., 2025; Bhagia et al., 2024). Existing laws sometimes inaccurately predict downstream performance (Diaz & Madaio, 2024), and U-shaped scaling trends have been observed (Caballero et al., 2022; Wei et al., 2022; Mc Kenzie et al., 2022a). Practitioners often overtrain small models beyond optimal tokens to reduce inference cost (Sardana et al., 2024; Gadre et al., 2024).

Transfer learning theory. Our theoretical analysis uses classical deep linear transfer learning setups (Gidel et al., 2019; Saxe et al., 2018). Such setups have been applied to knowledge acquisition (Wei et al., 2024; Arora et al., 2018), downstream feature benefits (Saunshi et al., 2021; Wei et al., 2021; Shachaf et al., 2021; Chua et al., 2021; Wu et al., 2020; Tripuraneni et al., 2020), and fine-tuning-induced performance degradation (Kumar et al., 2022).

We discuss additional related work in Appendix B.

6. Discussion

In this work, we uncovered a surprising trend: contrary to common belief, longer pre-training does not always lead to better post-trained models. We have shown that this is a consequence of a broader underlying phenomenon where models become more sensitive to perturbations as they are pre-trained on more tokens. Our theoretical analysis implies that this degradation of adaptability is especially catastrophic when the pre-training and fine-tuning tasks are misaligned, and in such a case catastrophic overtraining may be inevitable, even if the fine-tuning process is regularized.

Overtrained Language Models Are Harder to Fine-Tune

Acknowledgments

This material is based upon work supported by the National Science Foundation Graduate Research Fellowship under Grant No. DGE2140739. Any opinion, findings, and conclusions or recommendations expressed in this material are those of the authors(s) and do not necessarily reflect the views of the National Science Foundation.

We gratefully acknowledge support from Apple, NSF and the AI2050 program at Schmidt Sciences (Grant #G2264481).

Xiang Yue was supported in part by a Carnegie Bosch Institute Fellowship.

The authors would like to thank the following individuals for their helpful feedback and discussions: Christina Baek, Tianyu Gao, Gaurav Ghosal, Suhas Kotha, Vaishnavh Nagarajan, Chen Wu, and Ziqian Zhong.

Impact Statement

This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here.

Abbas, Z., Zhao, R., Modayil, J., White, A., and Machado, M. C. Loss of plasticity in continual deep reinforcement learning. In Conference on Lifelong Learning Agents, pp. 620 636. PMLR, 2023.

Ahn, H., Cha, S., Lee, D., and Moon, T. Uncertainty-based continual learning with adaptive regularization, 2019. URL https://arxiv.org/abs/1905.11614.

Arora, S., Li, Y., Liang, Y., Ma, T., and Risteski, A. Linear algebraic structure of word senses, with applications to polysemy, 2018. URL https://arxiv.org/abs/ 1601.03764.

Ash, J. and Adams, R. P. On warm-starting neural network training. Advances in neural information processing systems, 33:3884 3894, 2020.

Bai, Y., Jones, A., Ndousse, K., Askell, A., Chen, A., Das Sarma, N., Drain, D., Fort, S., Ganguli, D., Henighan, T., Joseph, N., Kadavath, S., Kernion, J., Conerly, T., El-Showk, S., Elhage, N., Hatfield-Dodds, Z., Hernandez, D., Hume, T., Johnston, S., Kravec, S., Lovitt, L., Nanda, N., Olsson, C., Amodei, D., Brown, T., Clark, J., Mc Candlish, S., Olah, C., Mann, B., and Kaplan, J. Training a helpful and harmless assistant with reinforcement learning from human feedback, 2022. URL https://arxiv.org/abs/2204.05862.

Bhagia, A., Liu, J., Wettig, A., Heineman, D., Tafjord, O., Jha, A. H., Soldaini, L., Smith, N. A., Groeneveld, D., Koh, P. W., et al. Establishing task scaling laws via compute-efficient model ladders. ar Xiv preprint ar Xiv:2412.04403, 2024.

Bisk, Y., Zellers, R., Bras, R. L., Gao, J., and Choi, Y. Piqa: Reasoning about physical commonsense in natural language. In Thirty-Fourth AAAI Conference on Artificial Intelligence, 2020.

Caballero, E., Gupta, K., Rish, I., and Krueger, D. Broken neural scaling laws. ar Xiv preprint ar Xiv:2210.14891, 2022.

Chaudhry, A., Ranzato, M., Rohrbach, M., and Elhoseiny, M. Efficient lifelong learning with a-gem, 2019a. URL https://arxiv.org/abs/1812.00420.

Chaudhry, A., Rohrbach, M., Elhoseiny, M., Ajanthan, T., Dokania, P. K., Torr, P. H. S., and Ranzato, M. On tiny episodic memories in continual learning, 2019b. URL https://arxiv.org/abs/1902.10486.

Cherti, M., Beaumont, R., Wightman, R., Wortsman, M., Ilharco, G., Gordon, C., Schuhmann, C., Schmidt, L., and Jitsev, J. Reproducible scaling laws for contrastive language-image learning. In 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2818 2829. IEEE, June 2023. doi: 10.1109/cvpr52729.2023.00276. URL http://dx. doi.org/10.1109/CVPR52729.2023.00276.

Chua, K., Lei, Q., and Lee, J. D. How fine-tuning allows for effective meta-learning, 2021. URL https://arxiv. org/abs/2105.02221.

Clark, C., Lee, K., Chang, M.-W., Kwiatkowski, T., Collins, M., and Toutanova, K. Bool Q: Exploring the surprising difficulty of natural yes/no questions. In Proceedings of NAACL-HLT 2019, 2019.

Clark, P., Cowhey, I., Etzioni, O., Khot, T., Sabharwal, A., Schoenick, C., and Tafjord, O. Think you have solved question answering? try arc, the ai2 reasoning challenge. ar Xiv:1803.05457v1, 2018.

Cobbe, K., Kosaraju, V., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., et al. Training verifiers to solve math word problems. ar Xiv preprint ar Xiv:2110.14168, 2021.

Cohen, J. M., Kaur, S., Li, Y., Kolter, J. Z., and Talwalkar, A. Gradient descent on neural networks typically occurs at the edge of stability. ar Xiv preprint ar Xiv:2103.00065, 2021.

Overtrained Language Models Are Harder to Fine-Tune

Conneau, A. and Kiela, D. Senteval: An evaluation toolkit for universal sentence representations. ar Xiv preprint ar Xiv:1803.05449, 2018.

Dagan, I., Glickman, O., and Magnini, B. The pascal recognising textual entailment challenge. In Machine learning challenges workshop, pp. 177 190. Springer, 2005.

de Masson d Autume, C., Ruder, S., Kong, L., and Yogatama, D. Episodic memory in lifelong language learning, 2019. URL https://arxiv.org/abs/1906. 01076.

Denevi, G., Ciliberto, C., Stamos, D., and Pontil, M. Learning to learn around a common mean. In Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018. URL https://proceedings.neurips. cc/paper_files/paper/2018/file/ b9a25e422ba96f7572089a00b838c3f8-Paper. pdf.

Diaz, F. and Madaio, M. Scaling laws do not scale. In Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society, volume 7, pp. 341 357, 2024.

Dohare, S., Sutton, R. S., and Mahmood, A. R. Continual backprop: Stochastic gradient descent with persistent randomness. ar Xiv preprint ar Xiv:2108.06325, 2021.

D Oro, P., Schwarzer, M., Nikishin, E., Bacon, P.-L., Bellemare, M. G., and Courville, A. Sample-efficient reinforcement learning by breaking the replay ratio barrier. In Deep Reinforcement Learning Workshop Neur IPS 2022, 2022.

French, R. M. Catastrophic forgetting in connectionist networks. Trends in cognitive sciences, 3(4):128 135, 1999.

Fu, C., Chen, P., Shen, Y., Qin, Y., Zhang, M., Lin, X., Yang, J., Zheng, X., Li, K., Sun, X., Wu, Y., and Ji, R. Mme: A comprehensive evaluation benchmark for multimodal large language models, 2024. URL https: //arxiv.org/abs/2306.13394.

Gadre, S. Y., Smyrnis, G., Shankar, V., Gururangan, S., Wortsman, M., Shao, R., Mercat, J., Fang, A., Li, J., Keh, S., et al. Language models scale reliably with over-training and on downstream tasks. ar Xiv preprint ar Xiv:2403.08540, 2024.

Gidel, G., Bach, F., and Lacoste-Julien, S. Implicit regularization of discrete gradient dynamics in linear neural networks, 2019. URL https://arxiv.org/abs/ 1904.13262.

Goodfellow, I. J., Mirza, M., Xiao, D., Courville, A., and Bengio, Y. An empirical investigation of catastrophic forgetting in gradient-based neural networks. ar Xiv preprint ar Xiv:1312.6211, 2013.

Goyal, S., Maini, P., Lipton, Z. C., Raghunathan, A., and Kolter, J. Z. Scaling laws for data filtering data curation cannot be compute agnostic, 2024. URL https:// arxiv.org/abs/2404.07177.

Grattafiori, A. et al. The llama 3 herd of models, 2024. URL

https://arxiv.org/abs/2407.21783.

Groeneveld, D., Beltagy, I., Walsh, P., Bhagia, A., Kinney, R., Tafjord, O., Jha, A. H., Ivison, H., Magnusson, I., Wang, Y., Arora, S., Atkinson, D., Authur, R., Chandu, K., Cohan, A., Dumas, J., Elazar, Y., Gu, Y., Hessel, J., Khot, T., Merrill, W., Morrison, J., Muennighoff, N., Naik, A., Nam, C., Peters, M. E., Pyatkin, V., Ravichander, A., Schwenk, D., Shah, S., Smith, W., Strubell, E., Subramani, N., Wortsman, M., Dasigi, P., Lambert, N., Richardson, K., Zettlemoyer, L., Dodge, J., Lo, K., Soldaini, L., Smith, N. A., and Hajishirzi, H. Olmo: Accelerating the science of language models. Preprint, 2024a.

Groeneveld, D., Beltagy, I., Walsh, P., Bhagia, A., Kinney, R., Tafjord, O., Jha, A. H., Ivison, H., Magnusson, I., Wang, Y., et al. Olmo: Accelerating the science of language models. ar Xiv preprint ar Xiv:2402.00838, 2024b.

Gulcehre, C., Srinivasan, S., Sygnowski, J., Ostrovski, G., Farajtabar, M., Hoffman, M., Pascanu, R., and Doucet, A. An empirical study of implicit regularization in deep offline rl. ar Xiv preprint ar Xiv:2207.02099, 2022.

Hernandez, D., Kaplan, J., Henighan, T., and Mc Candlish, S. Scaling laws for transfer, 2021. URL https:// arxiv.org/abs/2102.01293.

Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., Cai, T., Rutherford, E., Casas, D. d. L., Hendricks, L. A., Welbl, J., Clark, A., et al. Training compute-optimal large language models. ar Xiv preprint ar Xiv:2203.15556, 2022.

Hou, S., Pan, X., Loy, C. C., Wang, Z., and Lin, D. Lifelong learning via progressive distillation and retrospection. In Proceedings of the European Conference on Computer Vision (ECCV), September 2018.

Hudson, D. A. and Manning, C. D. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 6700 6709, 2019.

Overtrained Language Models Are Harder to Fine-Tune

Isik, B., Ponomareva, N., Hazimeh, H., Paparas, D., Vassilvitskii, S., and Koyejo, S. Scaling laws for downstream task performance of large language models. ar Xiv preprint ar Xiv:2402.04177, 2024.

Kaplan, J., Mc Candlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., and Amodei, D. Scaling laws for neural language models, 2020. URL https://arxiv.org/abs/2001. 08361.

Kembhavi, A., Salvato, M., Kolve, E., Seo, M., Hajishirzi, H., and Farhadi, A. A diagram is worth a dozen images, 2016.

Kemker, R., Mc Clure, M., Abitino, A., Hayes, T., and Kanan, C. Measuring catastrophic forgetting in neural networks. In Proceedings of the AAAI conference on artificial intelligence, volume 32, 2018.

Kirkpatrick, J., Pascanu, R., Rabinowitz, N., Veness, J., Desjardins, G., Rusu, A. A., Milan, K., Quan, J., Ramalho, T., Grabska-Barwinska, A., Hassabis, D., Clopath, C., Kumaran, D., and Hadsell, R. Overcoming catastrophic forgetting in neural networks. Proceedings of the National Academy of Sciences, 114 (13):3521 3526, March 2017. ISSN 1091-6490. doi: 10.1073/pnas.1611835114. URL http://dx.doi. org/10.1073/pnas.1611835114.

Kotha, S., Springer, J. M., and Raghunathan, A. Understanding catastrophic forgetting in language models via implicit inference. ar Xiv preprint ar Xiv:2309.10105, 2023.

Kumar, A., Agarwal, R., Ghosh, D., and Levine, S. Implicit under-parameterization inhibits data-efficient deep reinforcement learning. ar Xiv preprint ar Xiv:2010.14498, 2020.

Kumar, A., Raghunathan, A., Jones, R., Ma, T., and Liang, P. Fine-tuning can distort pretrained features and underperform out-of-distribution, 2022. URL https: //arxiv.org/abs/2202.10054.

Kumar, S., Marklund, H., and Van Roy, B. Maintaining plasticity via regenerative regularization. ar Xiv preprint ar Xiv:2308.11958, 2023.

Lewandowski, A., Tanaka, H., Schuurmans, D., and Machado, M. C. Directions of curvature as an explanation for loss of plasticity. ar Xiv preprint ar Xiv:2312.00246, 2023.

Li, R., Allal, L. B., Zi, Y., Muennighoff, N., Kocetkov, D., Mou, C., Marone, M., Akiki, C., Li, J., Chim, J., et al. Starcoder: may the source be with you! ar Xiv preprint ar Xiv:2305.06161, 2023a.

Li, X., Zhang, T., Dubois, Y., Taori, R., Gulrajani, I., Guestrin, C., Liang, P., and Hashimoto, T. B. Alpacaeval: An automatic evaluator of instruction-following models. https://github.com/tatsu-lab/ alpaca_eval, 5 2023b.

Li, Y., Du, Y., Zhou, K., Wang, J., Zhao, W. X., and Wen, J.-R. Evaluating object hallucination in large visionlanguage models, 2023c. URL https://arxiv. org/abs/2305.10355.

Liu, E., Bertsch, A., Sutawika, L., Tjuatja, L., Fernandes, P., Marinov, L., Chen, M., Singhal, S., Lawrence, C., Raghunathan, A., et al. Not-just-scaling laws: Towards a better understanding of the downstream impact of language model design decisions. ar Xiv preprint ar Xiv:2503.03862, 2025.

Liu, H., Xie, S. M., Li, Z., and Ma, T. Same pre-training loss, better downstream: Implicit bias matters for language models, 2022. URL https://arxiv.org/ abs/2210.14199.

Liu, H., Li, C., Wu, Q., and Lee, Y. J. Visual instruction tuning, 2023a.

Liu, Z., Qiao, A., Neiswanger, W., Wang, H., Tan, B., Tao, T., Li, J., Wang, Y., Sun, S., Pangarkar, O., Fan, R., Gu, Y., Miller, V., Zhuang, Y., He, G., Li, H., Koto, F., Tang, L., Ranjan, N., Shen, Z., Ren, X., Iriondo, R., Mu, C., Hu, Z., Schulze, M., Nakov, P., Baldwin, T., and Xing, E. P. Llm360: Towards fully transparent open-source llms, 2023b.

Lyle, C., Rowland, M., and Dabney, W. Understanding and preventing capacity loss in reinforcement learning. ar Xiv preprint ar Xiv:2204.09560, 2022.

Lyle, C., Zheng, Z., Nikishin, E., Pires, B. A., Pascanu, R., and Dabney, W. Understanding plasticity in neural networks. In International Conference on Machine Learning, pp. 23190 23211. PMLR, 2023.

Ma, G., Li, L., Zhang, S., Liu, Z., Wang, Z., Chen, Y., Shen, L., Wang, X., and Tao, D. Revisiting plasticity in visual reinforcement learning: Data, modules and training stages. ar Xiv preprint ar Xiv:2310.07418, 2023.

Maggie, Culliton, P., and Chen, W. Tweet sentiment extraction. https://kaggle.com/competitions/ tweet-sentiment-extraction, 2020. Kaggle.

Mc Kenzie, I., Lyzhov, A., Parrish, A., Prabhu, A., Mueller, A., Kim, N., Bowman, S., and Perez, E. The inverse scaling prize, 2022a. URL https://github.com/ inverse-scaling/prize.

Overtrained Language Models Are Harder to Fine-Tune

Mc Kenzie, I., Lyzhov, A., Parrish, A., Prabhu, A., Mueller, A., Kim, N., Bowman, S., and Perez, E. Inverse scaling prize: First round winners, 2022b. URL https:// irmckenzie.co.uk/round1.

Mc Kenzie, I., Lyzhov, A., Parrish, A., Prabhu, A., Mueller, A., Kim, N., Bowman, S., and Perez, E. Inverse scaling prize: Second round winners, 2023. URL https:// irmckenzie.co.uk/round2.

Muennighoff, N., Rush, A. M., Barak, B., Scao, T. L., Piktus, A., Tazi, N., Pyysalo, S., Wolf, T., and Raffel, C. Scaling data-constrained language models, 2023. URL https: //arxiv.org/abs/2305.16264.

Nikishin, E., Schwarzer, M., D Oro, P., Bacon, P.-L., and Courville, A. The primacy bias in deep reinforcement learning. In International conference on machine learning, pp. 16828 16847. PMLR, 2022.

Nikishin, E., Oh, J., Ostrovski, G., Lyle, C., Pascanu, R., Dabney, W., and Barreto, A. Deep reinforcement learning with plasticity injection. Advances in Neural Information Processing Systems, 36, 2024.

OLMo, T., Walsh, P., Soldaini, L., Groeneveld, D., Lo, K., Arora, S., Bhagia, A., Gu, Y., Huang, S., Jordan, M., Lambert, N., Schwenk, D., Tafjord, O., Anderson, T., Atkinson, D., Brahman, F., Clark, C., Dasigi, P., Dziri, N., Guerquin, M., Ivison, H., Koh, P. W., Liu, J., Malik, S., Merrill, W., Miranda, L. J. V., Morrison, J., Murray, T., Nam, C., Pyatkin, V., Rangapur, A., Schmitz, M., Skjonsberg, S., Wadden, D., Wilhelm, C., Wilson, M., Zettlemoyer, L., Farhadi, A., Smith, N. A., and Hajishirzi, H. 2 olmo 2 furious, 2024. URL https://arxiv. org/abs/2501.00656.

Pang, B. and Lee, L. A sentimental education: Sentiment analysis using subjectivity summarization based on minimum cuts. ar Xiv preprint cs/0409058, 2004.

Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., and Liu, P. J. Exploring the limits of transfer learning with a unified text-to-text transformer. ar Xiv e-prints, 2019.

Sakaguchi, K., Bras, R. L., Bhagavatula, C., and Choi, Y. Winogrande: An adversarial winograd schema challenge at scale. Communications of the ACM, 64(9):99 106, 2021.

Sap, M., Rashkin, H., Chen, D., Le Bras, R., and Choi, Y. Socialiqa: Commonsense reasoning about social interactions. ar Xiv preprint ar Xiv:1904.09728, 2019.

Sardana, N., Portes, J., Doubov, S., and Frankle, J. Beyond chinchilla-optimal: Accounting for inference in language model scaling laws, 2024. URL https: //arxiv.org/abs/2401.00448.

Saunshi, N., Malladi, S., and Arora, S. A mathematical exploration of why language models help solve downstream tasks. In International Conference on Learning Representations, 2021. URL https://openreview. net/forum?id=v Vj IW3s Ec1s.

Saxe, A. M., Mc Clelland, J. L., and Ganguli, S. A mathematical theory of semantic development in deep neural networks. Co RR, abs/1810.10531, 2018. URL http://arxiv.org/abs/1810.10531.

Schaeffer, R., Miranda, B., and Koyejo, S. Are emergent abilities of large language models a mirage? Advances in Neural Information Processing Systems, 36:55565 55581, 2023.

Schaeffer, R., Schoelkopf, H., Miranda, B., Mukobi, G., Madan, V., Ibrahim, A., Bradley, H., Biderman, S., and Koyejo, S. Why has predicting downstream capabilities of frontier ai models with scale remained elusive? ar Xiv preprint ar Xiv:2406.04391, 2024.

Shachaf, G., Brutzkus, A., and Globerson, A. A theoretical analysis of fine-tuning with linear teachers, 2021. URL https://arxiv.org/abs/2107.01641.

Shin, H., Lee, J. K., Kim, J., and Kim, J. Continual learning with deep generative replay, 2017. URL https:// arxiv.org/abs/1705.08690.

Singh, A., Natarjan, V., Shah, M., Jiang, Y., Chen, X., Parikh, D., and Rohrbach, M. Towards vqa models that can read. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8317 8326, 2019.

Tay, Y., Dehghani, M., Rao, J., Fedus, W., Abnar, S., Chung, H. W., Narang, S., Yogatama, D., Vaswani, A., and Metzler, D. Scale efficiently: Insights from pre-training and fine-tuning transformers, 2022. URL https://arxiv.org/abs/2109.10686.

Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., et al. Llama 2: Open foundation and finetuned chat models. ar Xiv preprint ar Xiv:2307.09288, 2023.

Tripuraneni, N., Jordan, M. I., and Jin, C. On the theory of transfer learning: The importance of task diversity, 2020. URL https://arxiv.org/abs/2006.11650.

Vershynin, R. High-dimensional probability: An introduction with applications in data science, volume 47. Cambridge university press, 2018.

Voorhees, E. M. and Tice, D. M. Building a question answering test collection. In Proceedings of the 23rd annual

Overtrained Language Models Are Harder to Fine-Tune

international ACM SIGIR conference on Research and development in information retrieval, pp. 200 207, 2000.

Wang, Y., Ivison, H., Dasigi, P., Hessel, J., Khot, T., Chandu, K., Wadden, D., Mac Millan, K., Smith, N. A., Beltagy, I., et al. How far can camels go? exploring the state of instruction tuning on open resources. Advances in Neural Information Processing Systems, 36:74764 74786, 2023.

Wei, C., Xie, S. M., and Ma, T. Why do pretrained language models help in downstream tasks? an analysis of head and prompt tuning. Advances in Neural Information Processing Systems, 34, 2021.

Wei, J., Kim, N., Tay, Y., and Le, Q. V. Inverse scaling can become u-shaped. ar Xiv preprint ar Xiv:2211.02011, 2022.

Wei, S., Malladi, S., Arora, S., and Sanyal, A. Provable unlearning in topic modeling and downstream tasks. ar Xiv preprint ar Xiv:2411.12600, 2024.

Wu, S., Zhang, H. R., and R e, C. Understanding and improving information transfer in multi-task learning, 2020. URL https://arxiv.org/abs/2005.00944.

Yang, G., Hu, E. J., Babuschkin, I., Sidor, S., Liu, X., Farhi, D., Ryder, N., Pachocki, J., Chen, W., and Gao, J. Tensor programs v: Tuning large neural networks via zero-shot hyperparameter transfer. ar Xiv preprint ar Xiv:2203.03466, 2022.

Zhang, Y., Backurs, A., Bubeck, S., Eldan, R., Gunasekar, S., and Wagner, T. Unveiling transformers with lego: a synthetic reasoning task, 2023. URL https://arxiv. org/abs/2206.04301.

Overtrained Language Models Are Harder to Fine-Tune

A. Omitted Proofs from Section 4

A.1. Formal Definitions and Assumptions

We provide formal definitions and assumptions underlying the theoretical analysis in Section 4. Throughout the text, we will use a constant δ to express a small probability.

Model Architecture The model consists of a two-layer linear network parameterized by θ = W1W2, where W1, W2 Rd d. The network maps input x Rd to output y = W1W2x Rd.

Pretraining Task The pretraining data follows y = Aprex where Apre Rd d is a matrix with singular value decomposition (SVD) Apre = UΣpre V . Here, U, V Rd d are orthogonal matrices, and Σpre Rd d is diagonal with positive entries {Σpre i }d i=1 arranged in decreasing order. Inputs x N(0, Id) are standard Gaussian.

Pretraining Process The model is trained via gradient flow on the population loss:

Lpre(θ) = Ex θx Aprex 2 2 = θ Apre 2 F , (4)

with parameters initialized as W1(0) = W2(0) = exp( τ)I with a large τ > 0. The gradient flow dynamics follow:

W1(t) = 2(θ(t) Apre)W2(t) (5) W2(t) = 2W1(t) (θ(t) Apre) (6)

where θ(t) = W1(t)W2(t).

This setup is inherited from Gidel et al. (2019), where the authors consider a more general setup with a rank-R matrix Apre and show that the gradient flow dynamics converge to the optimal rank-r approximation of Apre sequentially for r = 1, . . . , R.

Theorem A.1 (Theorem 1 of Gidel et al. (2019)). Suppose Apre has rank R. There exists t1, . . . t R and constant C > 0 depending on Apre, such that for θ(t) following Equations (5) and (6),

W1(ti) U Σpre,i 1/2 F exp( Cτ);

W2(ti) Σpre,i 1/2 V T F exp( Cτ).

where Σpre,i shares the first i diagonal elements as Σpre and the rest diagonal elements are 0.

Finetuning Task The finetuning task follows y = Aftx where Aft = UΣft V shares the singular vectors of Apre but has a spectrum Σft. The input distribution remains x N(0, Id).

Finetuning Process Starting from θn(0) = θ(tn) in Theorem A.1, the model is fine-tuned using gradient descent with learning rate η, batch size m, and K iterations. We will call θn(0) the real initialization and denote the following initialization θn(0) as the ideal initialization,

W n 1 (0) = U (Σpre,n)1/2 (7)

W n 2 (0) = (Σpre,n)1/2 V T (8)

The population loss is:

Lft(θ) = Ex θx Aftx 2 2 + λ θ θn(0) 2 F = θ Aft 2 F + λ θ θn(0) 2 F (9)

We will estimate Lft using a batch of samples Bk with size m on every step,

Lft(θ; Bk) = 1

θx Aftx 2 2 + +λ θ θn(0) 2 F

Overtrained Language Models Are Harder to Fine-Tune

Denote the covariance of x in batch Bk as

x Bk xx T .

Lft(θ; Bk) = Tr (θ Aft)T (θ Aft)Σ(x) k + λ θ θn(0) 2 F .

The parameter update rule at step k is:

W n 1 (k + 1) = W n 1 (k) 2η(θn(k) Aft)Σ(x) k (W n 2 (k)) 2ηλ(θn(k) θn(0))(W n 2 (k)) (10)

W n 2 (k + 1) = W n 2 (k) 2η(W n 1 (k)) Σ(x) k (θn(k) Aft) 2ηλ(W n 1 (k)) (θn(k) θn(0)) (11)

where θn(k) = W n 1 (k)W n 2 (k).

We will denote the final finetuned loss as Lft(n) = Lft(θn(K)).

We will use Γ to denote the upper bound of Σpre and Σft as,

Γ = max Σpre 1,1, max i d Σft i,i

A.2. Formal Statement and Proof of Theorem 4.4

In this section, we consider perturbations of the weights with isotropic Gaussian noise. For a parameter θ = W1W2, we will consider perturbations of the form (W1 + α)(W2 + β) where α, β Rd d are independent isotropic Gaussian noise matrices with αij, βij N(0, γ2) for some γ > 0. We will define the perturbed pretraining loss as,

Lpre(θ) = Eα,β N(0,γ2) h (W1 + α)(W2 + β) Apre 2 F i (13)

Under this definition, assuming pretraining initialization is sufficiently small, we have that the loss under a Gaussian perturbation is monotonically increasing.

Assumption A.2 (Small Pretraining Initialization). τ satisfies that, for C in Theorem A.1,

exp( Cτ) min

Σpre 1,1/2, 1/4, (Σpre d,d)2

16dΣpre 1,1 2Σpre 1,1 + γ2

Theorem A.3. Under Assumption A.2, if γ2 > Σpre d,d/d, there exists some s N and s < d such that for all n > s, the loss under a Gaussian perturbation Lpre(θn(0)) is monotonically increasing.

Proof. Choose s as the minimum number satisfying γ2 > Σpre s,s /d, for n > s, then s d 1, by Lemma A.4,

Lpre( θn(0)) Lpre( θn 1(0)) > Σpre n,n 2 .

By Lemma A.6,

Lpre(θn(0)) Lpre( θn(0)) > Σpre n,n 2 /2.

Lpre( θn 1(0)) Lpre(θn 1(0)) > Σpre n,n 2 /2.

Combining the above,

Lpre(θn(0)) Lpre(θn 1(0)) > 0.

The proof is complete.

Overtrained Language Models Are Harder to Fine-Tune

Lemma A.4. The following inequality holds for any n > 1:

Lpre( θn(0)) Lpre( θn 1(0)) (2dγ2 Σpre n,n)Σpre n,n (14)

Proof. We first expand the loss,

Lpre( θn(0)) = E h ( W n 1 + α)( W n 2 + β) Apre 2 F

U (Σpre,n)1/2 + α (Σpre,n)1/2 V T + β UΣpre V 2

U (Σpre,n)1/2 + α (Σpre,n)1/2 + β V UΣpre V 2

(Σpre,n)1/2 + α (Σpre,n)1/2 + β Σpre 2

Σpre,n + α (Σpre,n)1/2 + (Σpre,n)1/2 β + αβ Σpre 2

= Σpre,n Σpre 2 F + E

α (Σpre,n)1/2 2

(Σpre,n)1/2 β 2

+ E h αβ 2 F i (15)

where the fourth equality arises from the isotropy of the Gaussian noise, and the final equality comes from the independence and zero mean of the noise distributions.

Lemma A.5. For Gaussian noise matrix α Rd d where each entries has variance γ2 and fixed matrix M, it holds that

E[ αM 2 F ] = dγ2 M 2 F .

Proof. It holds that

E[ αM 2 F ] = E[Tr(αMM T αT )] = E[Tr(ααT )] M 2 F = dγ2 M 2 F .

The proof is then completed.

By Lemma A.5 and Equation (15),

Lpre( θn) = Lpre( θn) + 2dγ2 (Σpre,n)1/2 2 F + E h αβ 2 F i .

Taking difference with Lpre( θn 1)

Lpre( θn) Lpre( θn 1) =2dγ2Σpre n,n (Σpre n,n)2.

We then proceed to bound the difference between the perturbed loss of the ideal initialization and the perturbed loss of the real initialization when the pretraining initialization is sufficiently small. Lemma A.6. Under Assumption A.2, for any n > 0, it holds that Lpre(θn(0)) Lpre( θn(0)) (Σpre d,d)2/2.

Proof. By the definition of Lpre,

Lpre(θ) = Eα,β N(0,γ2) h (W1 + α)(W2 + β) Apre 2 F i

= Eα,β N(0,γ2) h (W1 + α)(W2 + β) Apre 2 F i

= W1W2 Apre 2 F + E[ αβ 2 F ] + E[ W1β 2 F ] + E[ αW2 2 F ].

Overtrained Language Models Are Harder to Fine-Tune

By Lemma A.5,

E[ W1W2 Apre 2 F ] = W1 W2 Apre 2 F + dγ2 W1 2 F + W2 2 F .

Taking the difference between Lpre(θn(0)) and Lpre( θn(0)), Lpre(θn(0)) Lpre( θn(0)) W1W2 Apre 2 F W1 W2 Apre 2 F

+ dγ2 W1 2 F W1 2 F

+ dγ2 W2 2 F W2 2 F .

By Theorem A.1,

W1 W1 F exp( Cτ);

W2 W2 F exp( Cτ).

Here exp( Cτ) min{Σpre 1,1/2, 1/4}.

Therefore, W1 2 F W1 2 F 2Tr(( W1)T (W1 W1)) + W1 W1 2 F 2 exp( Cτ)Σpre 1,1 + exp( 2Cτ) 4 exp( Cτ)Σpre 1,1.

Similarly, W2 2 F W2 2 F 2Tr(( W2)T (W2 W2)) + W2 W2 2 F 2 exp( Cτ)Σpre 1,1 + exp( 2Cτ) 4 exp( Cτ)Σpre 1,1.

Finally, W1W2 Apre 2 F W1 W2 Apre 2 F

(W1W2 W1 W2) F W1W2 + W1 W2 2Apre F .

(W1W2 W1 W2) F W1 W1 F W2 F + W1 F W2 W2 F + W1 W1 F W2 W2 F

d exp( Cτ)Σpre 1,1 + exp( 2Cτ) 4

dΣpre 1,1 exp( Cτ)

W1W2 + W1 W2 2Apre F W1W2 W1 W2) F + 2 W1 W2 Apre F

d exp( Cτ)Σpre 1,1 + 2 exp( 2Cτ) + 2

dΣpre 1,1 4

Combining the above,

W1W2 Apre 2 F W1 W2 Apre 2 F 16d Σpre 1,1 2 exp( Cτ).

Combining all the above, we have Lpre(θn(0)) Lpre( θn(0)) exp( Cτ)8dΣpre 1,1 2Σpre 1,1 + γ2 (Σpre d,d)2/2.

The final inequality follows from Assumption A.2.

Overtrained Language Models Are Harder to Fine-Tune

A.3. Dynamic Analysis of Finetuning Process

Before we proceed to the main result of finetuning, we will first analyze the dynamic of the finetuning process in this section.

We will introduce two auxiliary dynamics to help us track the evolution of the finetuning process.

The first auxiliary dynamic θn(t) is named as Ideal initialization dynamic, which is defined as the dynamic starting from the ideal initialization θn(0) in Equations (7) and (8) with the same update rule Equations (10) and (11) and data order as the finetuning process.

The second auxiliary dynamic ˆθn(t) is named as Ideal initialization with infinite batch size, which is defined as the dynamic starting from the ideal initialization θn(0) in Equations (7) and (8) with the update rule Equations (16) and (17), which corresponds to the case when the batch size is infinite and Σ(x) k converges to the identity matrix.

ˆ W n 1 (k + 1) = ˆ W n 1 (k) 2η(ˆθn(k) Aft)( ˆ W n 2 (k)) 2ηλ(ˆθn(k) θn(0))( ˆ W n 2 (k)) (16) ˆ W n 2 (k + 1) = ˆ W n 2 (k) 2η( ˆ W n 1 (k)) (ˆθn(k) Aft) 2ηλ( ˆ W n 1 (k)) (ˆθn(k) θn(0)) (17)

We will show the following results about these three dynamics:

1. Lemma A.7 provides analytical expression for the ideal initialization dynamic with infinite batch size.

2. Lemma A.17 shows that the ideal initialization dynamic with finite batch size is close to the ideal initialization dynamic with infinite batch size, with error bound depending on the batch size.

3. Lemma A.19 shows that the real initialization dynamic is close to the ideal initialization dynamic, with error bound depending on the scale of pretraining initialization (which then controls the distance between the real initialization and the ideal initialization by Theorem A.1).

4. We conclude our analysis by providing our assumption for the main result of the paper Assumption A.21 and show that the finetuning process tracks the ideal initialization dynamic with infinite batch size closely and eventually approximately converges to the minimum (Lemmas A.22 and A.23).

Throughout this subsection, we will call W1 and W2 as well conditioned if W1 op 2

Γ and W2 op 2

A.3.1. ANALYTICAL EXPRESSION FOR THE IDEAL INITIALIZATION DYNAMIC WITH INFINITE BATCH SIZE

We will introduce the following function to better track the evolution of weight in the ideal initialization dynamic with infinite batch size.

f(x; η, λ, σ, σ0) = x + 2ηx(σ2 x2) + 2ηλ(σ2 0 x2). (18)

Lemma A.7. For the ideal initialization dynamic with infinite batch size in Equations (16) and (17), we have

ˆ W n 1 (k) = U(Σn(k))1/2

ˆ W n 2 (k) = (Σn(k))1/2V

(Σn(k))1/2 i,i = 1(i n)f (k)( Σpre i,i 1/2 ; η, λ, Σft i,i 1/2 , Σpre i,i 1/2).

Proof. Consider

Σn 1(k) = U T W n 1 (k)

Σn 2(k) = W n 2 (k)V

We then have

Σn 1(k + 1) = Σn 1(k) 2η Σn 1(k)Σn 2(k) Σft Σn 2(k)T 2ηλ (Σn 1(k)Σn 2(k) Σn 1(0)Σn 2(0)) Σn 2(k)T

Σn 2(k + 1) = Σn 2(k) 2ηΣn 1(k)T Σn 1(k)Σn 2(k) Σft 2ηλΣn 1(k)T (Σn 1(k)Σn 2(k) Σn 1(0)Σn 2(0)) .

Through induction, we can prove that Σn 1(k) = Σn 2(k) are diagonal for all k. This then follows from the definition of f.

Overtrained Language Models Are Harder to Fine-Tune

This suggests that ˆ W n 1 (k) and ˆ W n 2 (k) is always well bounded by Γ.

Assumption A.8. We have that learning rate η and regularization parameter λ are upper bounded,

4η(λ + 2)Γ < 1.

Lemma A.9. Under Assumption A.8, for the ideal initialization dynamic with infinite batch size in Equations (16) and (17), we have that

ˆ W n 1 (k) op

ˆ W n 2 (k) op

with Γ being the upper bound of Σpre and Σft as defined in Equation (12).

Proof. This is a direct consequence of Lemmas A.7 and A.28.

Next, we will show that (U T ˆθn(K)V )i,i will converge to a weighted combination of Σpre i,i and Σft i,i for finites steps K.

Assumption A.10 (Large Enough but Finite Steps). We have that the step size K 1 η min{Σpre i,i ,Σft i,i} log 100Γ

ϵ for some constant ϵ > 0.

Lemma A.11. Under Assumption A.8 and Assumption A.10, for the ideal initialization dynamic with infinite batch size in Equations (16) and (17), we have that for any i n,

(U T ˆθn(K)V )i,i Σpre i,i + λΣft i,i 1 + λ

Proof. By Lemmas A.7 and A.28, we have that

(W n 1 (K))i,i Σpre i,i + λΣft i,i 1 + λ

(1 2η min{Σpre i,i , Σft i,i})K Σpre i,i Σpre i,i + λΣft i,i 1 + λ

This then suggests that once

K 1 2η min{Σpre i,i , Σft i,i} log 100Γ1/2|Σpre i,i Σft i,i| ϵ ,

It then follows that

(W n 1 (K))i,i Σpre i,i + λΣft i,i 1 + λ

ϵ 100Γ1/2 .

Similarly, we have the bound for (W n 2 (K))i,i. Combining the two bounds, the proof is complete.

A.3.2. CORRESPONDENCE BETWEEN IDEAL INITIALIZATION DYNAMIC WITH INFINITE BATCH SIZE AND FINITE BATCH SIZE

We then proceed to bound the difference between the ideal initialization dynamic with infinite batch size and the ideal initialization dynamic with finite batch size.

Lemma A.12 (4.7.3 of (Vershynin, 2018)). For a fixed k, there exists a constant C1, with probability 1 δ, we have that when batch size m d + log(1/δ),

Σ(x) k Id op C1

d + log(1/δ)

Assumption A.13 (Large Batch Size). We have that for constant C1 defined in Lemma A.12 and ϵ > 0, m C2 1(d log(10Kδ))/ϵ2.

Overtrained Language Models Are Harder to Fine-Tune

Lemma A.14. Under Assumption A.13, for the ideal initialization dynamic with infinite batch size in Equations (16) and (17), we have that

k K, Σ(x) k Id op ϵ

with probability 1 δ.

Proof. This is a direct consequence of Lemma A.12 and Assumption A.13.

Lemma A.15. When the event defined in Assumption A.13 happens, for any k K, for the same well-conditioned parameter θ(k) and θ(0), if applying the update rule Equations (16) and (17) yield ˆθ(k + 1) and applying the update rule Equations (10) and (11) yield θ(k + 1), then the difference between ˆθ(k + 1) and θ(k + 1) is bounded by

ˆ W1(k + 1) W1(k + 1) op 32ηϵΓ3/2

ˆ W2(k + 1) W2(k + 1) op 32ηϵΓ3/2

Proof. Taking the difference between the two update rules, we have that

ˆ W1(k + 1) W1(k + 1) op = 2η (θ(k) Aft) Σ(x) k Id W2(k) op

2η θ(k) Aft op Σ(x) k Id op W2(k) op

2η θ(k) Aft opϵ W2(k) op

Similarly we can have the bound for ˆ W2(k + 1) W2(k + 1) op.

Lemma A.16. When the event defined in Assumption A.13 happens, for the ideal initialization dynamic with infinite batch size in Equations (16) and (17), consider two different well-conditioned parameters θ(k) and θ (k) with the same initialization θ(0), denote ϵk = max{ W1(k) W 1(k) op, W2(k) W 2(k) op}. we have that

ϵk+1 (1 + 16ηΓ)ϵk.

Proof. Define Atarget = λApre+Aft

Given the update rule, we have that

W1(k + 1) W 1(k + 1) = (W1(k) W 1(k)) | {z } prev error 2η (θ(k) Atarget)W2(k) (θ (k) Atarget)W 2(k) .

We only need to properly bound the second term,

(θ(k) Atarget)W2(k) (θ (k) Atarget)W 2(k) op θ(k) θ (k) op W2(k) op + θ(k) Atarget op W2(k) W 2(k) op

The difference between θ(k) and θ (k) is bounded by

θ(k) θ (k) op W1(k) W 1(k) op W2(k) op + W 1(k) op W2(k) W 2(k) op 4

Therefore, we have that

(θ(k) Atarget)W2(k) (θ (k) Atarget)W 2(k) op 16Γϵk.

We then concludes that

ϵk+1 (1 + 16ηΓ)ϵk.

This then concludes the proof.

Overtrained Language Models Are Harder to Fine-Tune

Lemma A.17. When the event defined in Lemma A.14 happens for ϵ < 1 4(1+16ηΓ)K , define the error between the ideal initialization dynamic with infinite batch size and the ideal initialization dynamic with finite batch size as εk = max{ ˆ W1(k) W1(k) op, ˆ W2(k) W2(k) op}, then we have that

εk 2(1 + 16ηΓ)kϵΓ1/2 < Γ1/2/2.

Proof. From Lemma A.9, we have that ˆθ is well-conditioned, if θ is well-conditioned, combining Lemmas A.15 and A.16, we have that

εk+1 (1 + 16ηΓ)εk + 32ηϵΓ3/2.

Now we can inductively prove that for k [0, K],

εk (1 + 16ηΓ)k 1 2ϵΓ1/2.

Given that ϵ < 1 2(1+16ηΓ)K , we have that

εK < Γ1/2/4.

This then concludes the proof.

A.3.3. ERROR INCURS BY DIFFERENT INITIALIZATION

Finally, we will show that the real initialization dynamic is close to the ideal initialization dynamic, with error bound depending on the scale of pretraining initialization (which then controls the distance between the real initialization and the ideal initialization by Theorem A.1).

Lemma A.18. When the event defined in Lemma A.14 happens for ϵ < 1 4(1+16ηΓ)K , for the ideal initialization dynamic with finite batch size in Equations (10) and (11), consider two different well-conditioned parameters θ(k) and θ (k) with the same initialization θ(0), denote ϵk = max{ W1(k) W 1(k) op, W2(k) W 2(k) op}. we have that

ϵk+1 (1 + 32ηΓ)ϵk.

Proof. The proof is similar to Lemma A.16 and is omitted here.

Lemma A.19. When the event defined in Lemma A.14 happens for ϵ < 1 4(1+32ηΓ)K , consider two finetuning processes, with θn(t) starts from the real initialization θ(n) in Theorem A.1 and θn(t) starts from the ideal initialization θ(n) in Equations (7) and (8). Then the two processes are close to each other for all k K,

W n 1 (k) W n 1 (k) op (1 + 32ηΓ)k exp( Cτ).

W n 2 (k) W n 2 (k) op (1 + 32ηΓ)k exp( Cτ).

Proof. Define εk = max{ W n 1 (k) W n 1 (k) F , W n 2 (k) W n 2 (k) F }. By Lemma A.17, θ is well-conditioned, if θ is well-conditioned, combining Lemma A.18, we have that

εk+1 (1 + 32ηΓ) εk.

This then suggests that

εk (1 + 32ηΓ)k exp( Cτ).

This then concludes the proof.

Overtrained Language Models Are Harder to Fine-Tune

A.3.4. COMBING TWO APPROXIMATIONS

Lemma A.20. Under Assumption A.8 and Assumption A.13, for ϵ < 1 4(1+16ηΓ)K , with probability 1 δ, we have that both W n 1 (k) and W n 2 (k) are well-conditioned and

W n 1 (k) ˆ W n 1 (k) op (1 + 32ηΓ)k exp( Cτ) + 2(1 + 16ηΓ)kΓ1/2ϵ.

W n 2 (k) ˆ W n 2 (k) op (1 + 32ηΓ)k exp( Cτ) + 2(1 + 16ηΓ)kΓ1/2ϵ.

Proof. This is a direct consequence of Lemmas A.14, A.17 and A.19.

Given this lemma, we now present our main assumption and corresponding bound under this assumption.

Technical Assumptions. We will make the following technical assumptions to simplify the analysis.

Assumption A.21. We will make the following assumption to control the regularity of training. For arbitrary constant λ0, for

ϵ < 1 4000d minn d{|Σpre n,n Σft n,n|2} (λ0 + 1)2Γ2 ,

1. Finite regularization force: 0 λ < λ0.

2. (Assumption A.8) Finetuning learning rate is bounded:

4η(λ0 + 2)Γ < 1

3. (Assumption A.10) The finite number of step K 1 min{Σpre i,i ,Σft i,i} log 100Γ

4. (Assumption A.13) Large enough batch size m,

m C2 1(d log(10d Kδ))

ϵ2 (1 + 32ηΓ)2K

for C1 defined in Lemma A.12.

5. Small enough initialization error exp( Cτ) Γ1/2ϵ/(1 + 32ηΓ)K for C defined in Theorem A.1.

We will first show this important lemma that the distance between the real initialization and the ideal initialization is bounded under Assumption A.21.

Lemma A.22. Under Assumption A.21, with probability 1 δ, we have that for every n d and k K,

θn(k) ˆθn(k) F mini n{|Σpre i,i Σft i,i|2} 1000(λ0 + 1)2Γ .

Proof. This is a consequence of Lemma A.20. However, to go from the operator norm bound on W n 1 (k) and W n 2 (k) to the Frobenius norm bound on θn(k), we need the following two inequalities. The first one provides an operator norm bound on the difference between θn(k) and ˆθn(k),

θn ˆθn op W n 1 (k) ˆ W n 1 (k) op ˆ W n 2 (k) op + W n 2 (k) ˆ W n 2 (k) op ˆ W n 1 (k) op

4Γ1/2( W n 1 (k) ˆ W n 1 (k) op + W n 2 (k) ˆ W n 2 (k) op).

The second one uses this operator norm bound to bound the Frobenius norm of the difference between θn(k) and ˆθn(k),

θn(k) ˆθn(k) F d θn(k) ˆθn(k) op.

Combining these two inequalities with Assumption A.21, we get the desired result.

Overtrained Language Models Are Harder to Fine-Tune

We can continue to show that the finetunig process approximately converges to the minimum. Lemma A.23. Under Assumption A.21, with probability 1 δ, we have that for every n d,

U T θn(K)V Σft :n,:n + λΣpre :n,:n 1 + λ F mini n{|Σpre i,i Σft i,i|2} 500(λ0 + 1)2Γ .

Proof. This is a consequence of Lemmas A.11 and A.22.

A.4. Formal Statement and Proof of Theorem 4.6

Theorem A.24. Under Assumption A.21, with probability 1 δ, For pre(n) = Lpre(θn(K)) Lpre(θn(0)). pre(n) 0 and pre(n) does not decrease with n.

Proof. We will first provide a tight bound for pre(n). By Lemma A.22, we have that

θn(0) ˆθn(0) F mini d{|Σpre i,i Σft i,i|2} 100(λ0 + 1)2Γ .

and by Lemma A.23, we have that

U T θn(K)V Σft :n,:n + λΣpre :n,:n 1 + λ F mini n{|Σpre i,i Σft i,i|2} 50(λ0 + 1)2Γ .

This suggest that Lpre(θn(0)) Lpre(ˆθn(0)) = θn(0) Apre 2 F ˆθn(0) Apre 2 F

θn(0) ˆθn(0) F θn(0) + ˆθn(0) 2Apre op

32Γ θn(0) ˆθn(0) F

mini d{|Σpre i,i Σft i,i|2} 10(λ0 + 1)2

Similarly, we have that

Lpre(θn(K)) Lpre(U Σft :n,:n + λΣpre :n,:n 1 + λ V T ) mini n{|Σpre i,i Σft i,i|2} 5(λ0 + 1)2 .

Combining these two inequalities, we have that

Lpre(U Σft :n,:n + λΣpre :n,:n 1 + λ V T ) Lpre(UΣpre :n,:n V T )

! 3 mini n{|Σpre i,i Σft i,i|2} 10(λ0 + 1)2 .

Meanwhile, we have that

Lpre(U Σft :n,:n + λΣpre :n,:n 1 + λ V T ) Lpre(UΣpre :n,:n V T ) =

Σft i,i + λΣpre i,i 1 + λ Σpre i,i 2

Σft i,i Σpre i,i 1 + λ

Therefore if we additionally define 0 = 0, we have that for 1 n d,

n n 1 (Σpre n,n Σft n,n)2

(1 + λ)2 3 mini n{|Σpre i,i Σft i,i|2} 5(λ0 + 1)2 > 0.

This completes the proof.

Overtrained Language Models Are Harder to Fine-Tune

A.5. Formal Statement and Proof of Theorem 4.7

Theorem A.25. 1. Under Assumption A.21, when λ = 0, with probability 1 δ, if Apre and Aft are (4, r)-misaligned, then Lpre(θn(K)) Lpre(θn 1(K)) > 0 for n r.

2. Define the inflection point rλ as the smallest value of r for which the pre-training loss Lpre(n) increases monotonically for every n > r. Assume that regularization strength λ1 > λ2 > 0 yields iterates θ1 and θ2, if Assumption A.21 holds for

ϵ < 1 4000d minn d{|Σpre n,n Σft n,n|2} Γ2 min n 1 (1 + λ2)2 1 (1 + λ1)2

, λ2 1 (1 + λ1)2 λ2 2 (1 + λ2)2

then with probability 1 δ, we have that rλ1 rλ2 and the unregularized finetuning loss θn 1 (K) Aft 2 F > θn 2 (K) Aft 2 F for every n.

Proof. This is the combination of Lemmas A.26 and A.27.

Lemma A.26. Under Assumption A.21, if Σft n,n > 4Σpre n,n and λ = 0, then Lpre(θn(K)) Lpre(θn 1(K)) > 0.

Proof. With the same argument as in Theorem A.25, we have that

|Lpre(θn(K)) Lpre(UΣft :n,:n V T )| mini n{|Σpre i,i Σft i,i|2} 5 .

Lpre(UΣft :n,:n V T ) Lpre(UΣft :n 1,:n 1V T ) = (Σft n,n Σpre n,n)2 (Σpre n,n)2

We further have that Σft n,n Σpre n,n > 2Σpre n,n. Therefore,

Lpre(θn(K)) Lpre(θn 1(K))

Lpre(UΣft :n,:n V T ) Lpre(UΣft :n 1,:n 1V T ) 2(Σpre n,n)2

This completes the proof.

Lemma A.27. Assume that regularization strength λ1 > λ2 > 0 yields iterates θ1 and θ2, if Assumption A.21 holds for

ϵ < 1 4000d minn d{|Σpre n,n Σft n,n|2} Γ2 min n 1 (1 + λ2)2 1 (1 + λ1)2

, λ2 1 (1 + λ1)2 λ2 2 (1 + λ2)2

then with probability 1 δ, we have that rλ1 rλ2 and the unregularized finetuning loss θn 1 (K) Aft 2 F > θn 2 (K) Aft 2 F for every n.

Proof. Following similar proof as in Lemma A.23, we have that with probability 1 δ,

θn 1 (K) U Σft :n,:n + λ1Σpre :n,:n 1 + λ1 V T F mini n{|Σpre i,i Σft i,i|2} 500Γ

1 (1 + λ2)2 1 (1 + λ1)2

θn 2 (K) U Σft :n,:n + λ2Σpre :n,:n 1 + λ2 V T F mini n{|Σpre i,i Σft i,i|2} 500Γ

1 (1 + λ2)2 1 (1 + λ1)2

This then implies that

| θn 1 (K) Apre 2 F Σft :n,:n + λ1Σpre :n,:n 1 + λ1 Σpre 2 F | mini n{|Σpre i,i Σft i,i|2} 50

1 (1 + λ2)2 1 (1 + λ1)2

Overtrained Language Models Are Harder to Fine-Tune

Similar bound holds for θn 2 (K) Aft 2 F .

Combining these two inequalities, we have that

θn 2 (K) Apre 2 F θn 1 2 (K) Apre 2 F θn 1 (K) Apre 2 F θn 1 1 (K) Apre 2 F

Σft n,n + λ2Σpre n,n 1 + λ2 Σpre 2 Σft n,n + λ1Σpre n,n 1 + λ1 Σpre 2 !

mini n{|Σpre i,i Σft i,i|2} 25

1 (1 + λ2)2 1 (1 + λ1)2

1 (1 + λ2)2 1 (1 + λ1)2

Σpre n,n Σft n,n 2 F mini n{|Σpre i,i Σft i,i|2} 25

This then suggests that θn 2 (K) Apre 2 F > θn 1 2 (K) Apre 2 F when θn 1 (K) Apre 2 F > θn 1 1 (K) Apre 2 F , showing that rλ1 rλ2. Using similar argument, we can show that the unregularized finetuning loss θn 1 (K) Aft 2 F > θn 2 (K) Aft 2 F for every n.

A.6. Technical Lemmas

In this section, we will first prove some of the technical lemmas on function f defined in Equation (18). Recall that f is defined as,

f(x; η, λ, σ, σ0) = x + 2ηx(σ2 x2) + 2ηλx(σ2 0 x2).

Lemma A.28. σ > 0, k N, When (λ + 2)η 2 max{σ2, σ2 0} + λ + λ σ0

σ < 1, define σ = q

1+λ , it holds that

f (k)(σ0; η, λ, σ, σ0) in [min{σ, σ0}, max{σ, σ0}], and

|f (k)(σ0; η, λ, σ, σ0) σ | (1 2η min{σ2, σ2 0})k|σ0 σ |

Proof. Let g(x; σ, σ0, λ) = x(x2 σ2) + λx(x2 σ0). Then g(σ ; σ, σ0, λ) = 0.

We have that

f(x; η, λ, σ, σ0) = x 2ηg(x; σ, σ0, λ).

For any x [min{σ, σ0}, max{σ, σ0}]. As

g(x; σ, σ0, λ) = x(x2 σ2) + λx(x2 σ2 0) = x(x σ )(x + (λ + 1)σ ).

f(x; η, λ, σ, σ0) σ = x σ 2ηg(x; σ, σ0, λ) + 2ηg(σ ; σ, σ0, λ)

= (x σ )(1 2ηx(x + (λ + 1)σ )).

When x [min{σ, σ0}, max{σ, σ0}], x(x + (λ + 1)σ ) min{σ2, σ2 0}. On the other hand

x(x + (λ + 1)σ ) (λ + 2) max{σ2, σ2 0}.

This suggest that 1 2ηx(x + (λ + 1)σ ) > 0.

|f(x; η, λ, σ, σ0) σ | |x σ |(1 2η min{σ2, σ2 0}).

Also f(x; η, λ, σ, σ0) σ has the same sign as x σ . This concludes the proof.

Overtrained Language Models Are Harder to Fine-Tune

B. Extended Related Work

Here we present an expanded and extended discussion of the related work.

Loss of plasticity. The idea that more training can be harmful to performance has been studied before in other continual learning settings. Named loss of plasticity, this phenomenon refers to the degradation of the ability for a model to adapt to a new task. This has mainly been studied in the context of training on small models with small datasets (Ash & Adams, 2020; Dohare et al., 2021) or reinforcement learning (Kumar et al., 2020; Lyle et al., 2022; 2023; Ma et al., 2023; Abbas et al., 2023). Loss of plasticity has been attributed to the loss curvature (Lyle et al., 2023; Lewandowski et al., 2023), increased weight norm (Nikishin et al., 2022), feature rank (Kumar et al., 2020; Gulcehre et al., 2022), and feature inactivity (Lyle et al., 2022; Dohare et al., 2021). Multiple remedies have been proposed, including changes to the neural network architecture (Lyle et al., 2023), resetting model parameters (Nikishin et al., 2024; D Oro et al., 2022), and regularization (Kumar et al., 2023; Ash & Adams, 2020).

While prior work focused on reinforcement learning or small-scale, synthetic setups, our work considers the large-scale autoregressive language modeling setting. Unlike prior work, where pre-training is often harmful for the downstream fine-tuning task, we show that overtraining on generic web data can also degrade fine-tuning performance despite being expected to help. Additionally, we highlight an increased sensitivity to degradation of the pre-training loss that arises with overtraining, an aspect largely overlooked in the literature.

Catastrophic forgetting. The phenomenon of catastrophic forgetting where neural networks trained sequentially on tasks tend to forget prior tasks has also been well-documented in the literature (Kirkpatrick et al., 2017; French, 1999; Goodfellow et al., 2013; Kemker et al., 2018; Kotha et al., 2023). There have been several proposed mitigation strategies, for example, Ahn et al. (2019); Hou et al. (2018); Chaudhry et al. (2019a) propose using regularization to mitigate catastrophic forgetting. Other fixes include generative replay of examples from previous tasks (Shin et al., 2017) or maintaining a memory buffer of previous tasks (Chaudhry et al., 2019b; de Masson d Autume et al., 2019). In this work, we show that catastrophic forgetting can become more severe with overtraining.

Relationship between pre-training loss and downstream performance. In our work, we argue that the degradation of the pre-training loss and the downstream loss may be related. Several works have tried to study the relationship between the pre-training loss in language models and their downstream performance. Liu et al. (2022) analyze the effect of pre-training beyond convergence and suggest that overtrained models exhibit better transfer to downstream tasks. Our work considers web-scale pre-training, which rarely converges in practice, so these findings do not contradict ours. Similarly, Tay et al. (2022); Zhang et al. (2023) highlight the effect of architecture on downstream generalization, given the same pretraining loss.

Scaling laws for optimal pre-training. In our work, we argue that training for fewer tokens can be beneficial for downstream performance after fine-tuning. Related to our work, Isik et al. (2024) proposes scaling laws for certain downstream translation tasks after fine-tuning, but does not observe degradation with overtraining. In addition, the optimal pre-training token budget has also been studied in other contexts. Notably, Kaplan et al. (2020); Hoffmann et al. (2022) demonstrate that, given a fixed compute budget, there exists an optimal token budget for each model size. Subsequent works have extended scaling laws to broader contexts, including transfer learning, contrastive training, training under data constraints, and predicting performance from factors other than pre-training tokens (Hernandez et al., 2021; Cherti et al., 2023; Muennighoff et al., 2023; Goyal et al., 2024; Liu et al., 2025; Bhagia et al., 2024). However, scaling laws are not always optimal for predicting performance. Diaz & Madaio (2024) argue that existing scaling laws do not always predict downstream performance accurately. In addition, multiple works have observed U-shaped trends in performance as models scale (Caballero et al., 2022; Wei et al., 2022; Mc Kenzie et al., 2022a).

To reduce inference cost, practitioners have turned to developing capable small models, which often requires overtraining beyond the compute-optimal token budget. In fact, Sardana et al. (2024) show that pre-training loss continues to decrease when trained for up to 10,000 tokens per parameter. Gadre et al. (2024) validated similar observations and propose scaling laws to predict the model performance in this overtraining regime.

Transfer learning theory Finally, our theoretical analysis of catastrophic overtraining adopts a classical transfer learning setup based on deep linear networks (Gidel et al., 2019; Saxe et al., 2018). Wei et al. (2024); Arora et al. (2018) use this setup to study how models learn and store knowledge. Another group of studies explain how transfer learning can improve performance after pre-training (Saunshi et al., 2021; Wei et al., 2021; Shachaf et al., 2021). Chua et al. (2021); Wu et al.

Overtrained Language Models Are Harder to Fine-Tune

(2020); Tripuraneni et al. (2020) specifically adopt a similar deep linear network setting to study feature learning during pre-training, and how these learned features can benefit downstream tasks. Kumar et al. (2022) explores how fine-tuning can lead to degradation of out-of-distribution performance.

C. Experimental Details from Section 2: Large Model Experiments

In this section, we present all of the omitted experimental details from Section 2 that are necessarily for replication.

C.1. Pre-trained models.

For our pre-trained models, we use checkpoints from three base models: OLMo-1B (Groeneveld et al., 2024b), OLMo-2-7B (OLMo et al., 2024), and LLM360-Amber (Liu et al., 2023b). We choose checkpoints that have been released on each of the model s Hugging Face pages, given by Table 1.

Model Hugging Face ID Revision Step Token Budget

OLMo-1B allenai/OLMo-1B-hf step10000-tokens41B 10k 0.04T step117850-tokens494B 118k 0.5T step358000-tokens1501B 358k 1.5T step447000-tokens1874B 447k 1.9T step561250-tokens2353B 561k 2.4T step738000-tokens3094B 738k 3.1T

OLMo-2-7B allenai/OLMo-2-1124-7B stage1-step19000-tokens80B 19k 0.08T stage1-step120000-tokens504B 120k 0.5T stage1-step441000-tokens1850B 441k 1.9T stage1-step584000-tokens2450B 584k 2.5T stage1-step727000-tokens3050B 727k 3.1T stage1-step928646-tokens3896B 929k 3.9T

LLM360-Amber (7B) LLM360/Amber ckpt 040 40 0.12T ckpt 102 102 0.31T ckpt 244 244 0.75T ckpt 306 306 0.94T ckpt 358 358 1.1T ckpt 410 410 1.3T

Table 1. Pre-trained models used in our experiments in Section 2.

C.2. Fine-tuning setup.

We fine-tune with two different common post-training paradigms: instruction tuning and multimodal tuning. For instruction tuning, we use the following datasets.

Anthropic-HH (Bai et al., 2022). While Anthropic-HH is typically a dataset designed for preference tuning the dataset includes both a chosen and a rejected response for each instruction it can also be used as a standard instruction tuning dataset by treating the chosen response as the target. Anthropic-HH contains 180k instructions and responses.

TULU (Wang et al., 2023). We use the version 1.0 of the TULU SFT mixture, which contains 490k instructions and responses. However, for compute efficiency, we only use a randomly selected 200k subset.

LLa VA (Liu et al., 2023a). We use the LLa VA visual instruction tuning framework to train multimodel models. The LLa VA framework involves two stages: first, fine-tuning an adapter between a vision model and a pre-trained language model, and then fine-tuning the entire model to follow instructions in the presence of images.

When fine-tuning for instruction tuning, we use the standard SFT training algorithm with the following hyperparameters, as shown in Table 2. In this table, we also present the hyperparameters we use with the LLa VA framework, using the defaults for all non-specified hyperparameters.

Overtrained Language Models Are Harder to Fine-Tune

Dataset Batch size Learning rates Learning rate schedule Warmup steps Optimizer Weight decay

Anthropic-HH 256 1e-6, 5e-6, 1e-5, 5e-5, 8e-5, 1e-4, 2e-4 Cosine 20 Adam W 0

Alpaca 256 1e-6, 5e-6, 1e-5, 5e-5, 8e-5, 1e-4, 2e-4 Cosine 20 Adam W 0

TULU 256 1e-6, 5e-6, 1e-5, 5e-5, 8e-5, 1e-4, 2e-4 Cosine 20 Adam W 0

Visual (LLa Va) Stage 1 (Projector training)

256 1e-3 Cosine 50 Adam W 0

Visual (LLa Va) Stage 2 (Inst. tuning)

256 8e-6, 1e-5, 2e-5, 4e-5, 1e-4 Cosine 40 Adam W 0

Table 2. Hyperparameters used for instruction tuning and LLa VA.

C.3. Evaluations

We evaluate the fine-tuned models in two settings: downstream evaluations tasks that is representative of the goal of fine-tuning and generalist evaluations tasks that are representative of the model s overall language understanding and inference capabilities. For downstream evaluations, we use the following datasets.

Alpaca Eval (Li et al., 2023b). To evaluate the downstream performance of instruction-tuned models, we use Alpaca Eval, a benchmark for evaluating the quality of a model s response to an instruction. The Alpaca Eval benchmark contains 20k instructions, and measures the win-rate of the fine-tuned model against a reference model. By default, Alpaca Eval reports win-rate vs GPT-4 responses. However, we evaluate models that are weak by comparison to GPT-4. If we compare against GPT-4, the win rate is so low that it is difficult to see the differences between models. Thus, we compare against a weaker model. In particular, for each of our models, we use a reference model of the same architecture that was also fine-tuned on the same dataset. More specifically, we use the model trained with seed 0 with learning rate 10 5. This means that the Alpaca Eval scores are not comparable across different graphs, as the reference generations are different for each model and dataset. Additionally, the Alpaca Eval score of the model trained with seed 0 and learning rate 10 5 is 50% by definition. Overall, we adopt these choices to ensure that the reference generations are comparable to each model output. We use LLa MA-3-70B-Instruct (Grattafiori et al., 2024) as an evaluator to determine the win rate.

VLM Score. To evaluate the downstream performance of our LLa VA models, we use an average of the following five standard vision-language benchmarks: MME (Fu et al., 2024), GQA (Hudson & Manning, 2019), AI2D (Kembhavi et al., 2016), POPE (Li et al., 2023c), and Text VQA (Singh et al., 2019). We report the average as the VLM score .

Generalist evaluations. To evaluate each language model for generalist capabilities, we consider a suite of ten commonly used LLM evaluation benchmarks. These tasks assess performance beyond the fine-tuning task. These tasks cover reasoning (ARC Challenge and ARC Easy (Clark et al., 2018)), commonsense (PIQA (Bisk et al., 2020), Winogrande (Sakaguchi et al., 2021)), natural language inference (Bool Q (Clark et al., 2019), COPA, SCIQ) and sentence completion (Hella Swag). For all of our evaluations, we report 5-shot performance.

D. Experimental Details from Section 3: Controlled Experiments

In this section, we provide additional experimental details for the controlled experiments presented in Section 3.

D.1. Pre-training and fine-tuning setup.

For our controlled experiments, we pre-train models using the OLMo codebase (Groeneveld et al., 2024b). We use mu P parameterization for all of our experiments (Yang et al., 2022).

Pre-training. We train three different model classes: OLMo-15M, OLMo-30M, and OLMo-90M with 15M, 30M and 90M non-embedding parameters, respectively. We use the following hyperparameters for pre-training, as shown in Table 3. For

Overtrained Language Models Are Harder to Fine-Tune

each model, we train for tokens in the range 4B, 8B, 16B, 32B, 64B, 128B using the pre-tokenized C4 high quality web data distributed by OLMo (OLMo et al., 2024). We train with 8x A100 GPUs.

Hyperparameters OLMo-15M OLMo-30M OLMo-90M

Layers 3 6 9 Heads 3 6 9 Number of unique tokens 50304 50304 50304 Hidden dimensions 192 384 576 Inner MLP dimensions 768 1536 2304 Max context length 1024 1024 1024 Activation type Swi GLU Swi GLU Swi GLU Attention dropout 0.1 0.1 0.1 Residual dropout 0.1 0.1 0.1 Embedding dropout 0.1 0.1 0.1 Optimizer Adam W Adam W Adam W Learning rate 0.0003 0.0003 0.0003 Beta1 0.9 0.9 0.9 Beta2 0.95 0.95 0.95 Learning rate scheduler Cosine Cosine Cosine Warmup steps 10% of training 10% of training 10% of training Weight decay 0.1 0.1 0.1 Batch size 256 256 256

Table 3. Pre-training hyperparameters used in our controlled experiments.

For each model, we anneal the learning rate to zero over the course of training, at the rate specified by the cosine learning rate scheduler.

Fine-tuning. For each of our controlled experiments, we fine-tune the pre-trained models on a series of downstream tasks of two types: classification and language modeling. These ten datasets are: classification SUBJ (Pang & Lee, 2004), Bool Q (Clark et al., 2019), MR (Conneau & Kiela, 2018), CR (Conneau & Kiela, 2018), RTE (Dagan et al., 2005), TREC (Voorhees & Tice, 2000), English Tweet sentiment (Maggie et al., 2020), SIQA (Sap et al., 2019), and language modeling GSM8k (Cobbe et al., 2021), Starcoder-Python (Li et al., 2023a). For Starcoder-Python, we use a 5k example subset. To avoid confusion, note that despite the fact that GSM8k is often evaluated as a math reasoning benchmark, we treat it as a language modeling task to evaluate how well the models can learn math-style text. We use the following hyperparameters for fine-tuning, as shown in Table 4.

Hyperparameters Values

Learning rate 4e-6, 8e-6, 1e-5, 2e-5, 4e-5, 5e-5, 6e-5, 7e-5, 8e-5, 9e-5, 1e-4, 1.1e-4, 1.2e-4, 1.4e-4, 1.6e-4, 1.8e-4, 2e-4, 2.4e-4, 4e-4, 5e-4, 6e-4, 8e-4, 1e-3, 2e-3, 3e-3, 4e-3, 6e-3 Batch size 32, 64*, 256 Learning rate scheduler Cosine*, Constant Optimizer Adam W Weight decay 0.0 Warmup steps 10% of training Epochs 4

Table 4. Fine-tuning hyperparameters used in our controlled experiments. We tune over all specified learning rates. For the other hyperparameters, when multiple are specified, the asterisks (*) indicates the default value which is used unless a different hyperparameter is specified. We perform early stopping over the number of epochs.

Evaluation. For tuning, we use a heldout validation set from each dataset, but report scores on a separate heldout test set. In order to compute the perplexity for classification tasks, we compute a score for each class by measuring the length-normalized likelihood of the class, and then report the perplexity over the classes. For generative tasks, we use the standard language modeling loss. As a measure of generalist capability, we report the perplexity on a heldout C4 web data set.

Appropriate learning rate ranges for Figure 5. For visualization purposes, we choose to plot a subset of the learning rates which we evaluate in Figure 5. In particular, we plot learning rates where the maximum pre-training perplexity, over all

Overtrained Language Models Are Harder to Fine-Tune

1.20e-04 6 10 5

Learning rate

Distance (L2 norm)

1.20e-04 6 10 5

Learning rate

Distance (L2 norm)

1.20e-04 6 10 5

Learning rate

3 100 4 100

Distance (L2 norm)

1.20e-04 6 10 5

Learning rate

3 100 4 100

Distance (L2 norm)

starcoder-python-5k

Pre-training tokens

4B 8B 16B 32B 64B 128B

Figure 8. Distance, as measured by L2 norm, between the pre-trained and fine-tuned model as a function of learning rate for OLMo30M. More specifically, if θpre and θft are the parameters of the pre-trained and fine-tuned models, respectively, we plot θpre θft 2 as a function of the learning rate. We observe that the distance between the pre-trained and fine-tuned model is not exactly, but approximately, directly proportional to the learning rate and independent of the amount of pre-training.

token budgets, is less than 6. This ensures that the learning rates we plot are in a range where the model is still retaining pre-training capability, and has not degenerated to a high perplexity which may not represent the more general case.

Using learning rate as a proxy for a fixed perturbation size. We report the distance between the pre-trained and fine-tuned model as a function of the learning rate for different token budgets in Figure 8. Recall, from Section 3, that we specified that the learning rate is an approximate proxy for the size of the perturbation applied to the model. We observe that the distance between the pre-trained and fine-tuned model is not exactly, but approximately, directly proportional to the learning rate and independent of the amount of pre-training.

D.2. Gaussian perturbations.

In this subsection, we outline the details concerning Gaussian perturbations applied during our experiments. In particular, we perturb each parameter by a random value sampled from a mean-zero Gaussian distribution and evaluate the degradation of pre-training perplexity in Section 3. Using an isotropic Gaussian perturbation, i.e., perturbing each parameter by the same amount, would discount differences in parameter magnitude across different layers. To account for this, we choose to scale the perturbation to each layer to be approximately proportional to the magnitude of the parameter in that layer however, we want the magnitude to be constant for different pre-training token budgets. Thus, we choose to normalize the magnitude of each perturbation to the same magnitude as the layer at initialization prior to pre-training.

E. Connection Between Progressive Sensitivity and Sharpness

In this section, we discuss the connection between our progressive sensitivity conjecture and the phenomenon known as progressive sharpening (Cohen et al., 2021) in greater detail.

Progressive sharpening. This phenomenon refers to the empirical observation that over training with a fixed learning rate, the spectral norm 2L(θ) 2 of the Hessian of the loss function L at the parameters θ increases over time, at least early in training. In the case of of (full batch) gradient descent with a fixed learning rate η, 2L(θ) 2 specifically increases until it reaches 2/η, which is discussed in detail in Cohen et al. (2021). In addition to the spectral norm, other norms of the Hessian, such as the trace norm, also exhibit a similar behavior.

Relationship between progressive sensitivity and progressive sharpening when loss is quadratic. As it turns out, progressive sensitivity and progressive sharpening are closely related specifically in the quadratic setting. In particular, consider a quadratic loss function L(θ) = 1

2θ Hθ + g θ + c, where θ Rd, H Rd d is a symmetric matrix, g Rd, and c R. We will look specifically at the sensitivity to a Gaussian perturbation (θ, λ) = E [L(θ + λε) L(θ)], where ε N(0, I) is a unit Gaussian vector.

Proposition E.1. The sensitivity of L to a Gaussian perturbation is given by (θ, λ) = 1

Overtrained Language Models Are Harder to Fine-Tune

0.00 0.02 0.04 0.06 0.08 0.10 Perturbation strength ( )

Pre-training perplexity

32B 64B 128B

32B hessian approx 64B hessian approx 128B hessian approx

0.00 0.05 0.10 0.15 0.20 0.25 0.30 Perturbation strength ( )

Pre-training perplexity

32B 64B 128B

32B hessian approx 64B hessian approx 128B hessian approx

Figure 9. Hessian approximation of the pre-training loss under a single interpolated Gaussian parameter perturbation. We randomly draw a Gaussian perturbation ε, and then compute the loss L(θ + λε), where λ is the scaling factor, for many different λ (extremely close to zero on the left, and with a wider range on the right). We then compute Hessian, and use it to render the quadratic approximation of the loss.

Proof. We have,

E [L(θ) L(θ + λε)] = E

2 (θ + λε) H(θ + λε) + g (θ + λε) + c 1

2(θ Hθ + g θ + c) (19)

2λ2ε Hε = 1

2λ2 Tr H, (20)

where the second equality follows from the linearity of expectation and the fact that E[ε] = 0.

This proposition establishes that the sensitivity under a Gaussian perturbation is exactly related to the Hessian when the loss function is quadratic. This connection will hold, in general, when the loss function is well-approximated by its second-order Taylor expansion, such as when λ is small. In this instance, progressive sharpening and progressive sensitivity are closely related.

Progressive sharpness is not sufficient to explain degradation when λ is large. We plot the empirical loss of three different OLMo-30B models (trained on 32B, 64B, and 128B tokens) under a Gaussian perturbation with perturbation strength λ, as well as the second-order Taylor approximation in Figure 9. In particular, we draw the perturbation ε with the distribution described in Appendix D.2. We observe that while the loss is well-approximated by the Hessian when λ is small (left), the approximation breaks down when λ is large (right), and the actual loss is substantially higher than the quadratic approximation.

Progressive sharpness is not a sufficient explanation for fine-tuning sensitivity. Similar to the Gaussian case, we consider the loss of three OLMo-30B models as they are interpolated between the base model and the model fine-tuned on ag news in Figure 10. In this example, a perturbation strength of λ = 0 corresponds to the base model, while a perturbation strength of λ = 1 corresponds to the fine-tuned model. Similar to the Gaussian case, we observe that the loss is not well-approximated by the Hessian when λ is large, and the actual loss is substantially higher than the quadratic approximation (right).

Progressive sensitivity as a generalization of progressive sharpness. Our results highlight that in addition to progressive sharpness, which specifically refers to a progressive increase in the eigenvalues of the Hessian of the loss function with training, there is a more global phenomenon where the loss becomes even more sensitive to perturbations than the quadratic approximation predicts.

Overtrained Language Models Are Harder to Fine-Tune

0.000 0.025 0.050 0.075 0.100 0.125 0.150

Fine-tuning perturbation ratio

Pre-training perplexity

0.0 0.2 0.4 0.6 0.8 1.0 Fine-tuning perturbation ratio

Pre-training perplexity

32B 64B 128B

32B-hessian-approx 64B-hessian-approx 128B-hessian-approx

Figure 10. Hessian approximation of the pre-training loss under an interpolated fine-tuning perturbation. We fine-tune each model on ag news yielding a fine-tuning perturbation ε, and then compute the loss L(θ + λε), where λ is the scaling factor, for many different λ (extremely close to zero on the left, and with a wider range on the right). We then compute Hessian, and use it to render the quadratic approximation of the loss.

F. Omitted Figures from Section 2: Large Model Experiments

In this section, we provide the omitted figures from Section 2 that show the results of the extended experiments with large models.

The following Table 5 lists the table of contents for the omitted figures.

Dataset (Variant) OLMo-1B OLMo-2-7B LLM360-7B

Anthropic-HH (tuned learning rate) Figure 11 Figure 13 Figure 15 Anthropic-HH (all learning rates) Figure 12 Figure 14 Figure 16

TULU (tuned learning rate) Figure 17 Figure 19 Figure 21 TULU (all learning rates) Figure 18 Figure 20 Figure 22

VLM (tuned learning rate) Figure 23 Figure 25 Figure 27 VLM (all learning rates) Figure 24 Figure 26 Figure 28

Table 5. Figure references for each dataset (Alpaca, Anthropic-HH, TULU, VLM) and model (OLMo-1B, OLMo-2-7B, LLM360-7B), separated by learning rate tuning variant.

Overtrained Language Models Are Harder to Fine-Tune

Alpaca Eval

arc_challenge

1 2 3 PT tokens (trillion)

1 2 3 PT tokens (trillion)

1 2 3 PT tokens (trillion)

1 2 3 PT tokens (trillion)

Base model Fine-tuned model

Figure 11. Evaluation OLMo-1B post-trained on Anthropic-HH as a function of the number of pre-trained tokens, with tuned learning rates. We report the scores on eight different datasets: Alpaca Eval is considered to be the main evaluation of interest (corresponding with the downstream performance), and the other datasets are considered out-of-distribution (corresponding with the generalist performance). We use the intermediate checkpoints from Table 1 for the evaluation. We tune the learning rate for each checkpoint to maximize the main evaluation (Alpaca Eval). This figure is analogous to Figure 2.

Overtrained Language Models Are Harder to Fine-Tune

Alpaca Eval

arc_challenge

1 2 3 PT tokens (trillion)

1 2 3 PT tokens (trillion)

1 2 3 PT tokens (trillion)

1 2 3 PT tokens (trillion)

Base model Minimum learning rate Maximum learning rate

Figure 12. Evaluation OLMo-1B post-trained on Anthropic-HH as a function of the number of pre-trained tokens, for all learning rates. We report the scores on eight different datasets: Alpaca Eval is considered to be the main evaluation of interest (corresponding with the downstream performance), and the other datasets are considered out-of-distribution (corresponding with the generalist performance). We use the intermediate checkpoints from Table 1 for the evaluation. We also compare to the base model (dashed line). This figure is similar to Figure 11, except we plot every learning rate, with a line representing a fixed learning rate.

Overtrained Language Models Are Harder to Fine-Tune

Alpaca Eval

arc_challenge

2 3 PT tokens (trillion)

2 3 PT tokens (trillion)

2 3 PT tokens (trillion)

2 3 PT tokens (trillion)

Base model Fine-tuned model

Figure 13. Evaluation OLMo-2-7B post-trained on Anthropic-HH as a function of the number of pre-trained tokens, with tuned learning rates. We report the scores on eight different datasets: Alpaca Eval is considered to be the main evaluation of interest (corresponding with the downstream performance), and the other datasets are considered out-of-distribution (corresponding with the generalist performance). We use the intermediate checkpoints from Table 1 for the evaluation. We tune the learning rate for each checkpoint to maximize the main evaluation (Alpaca Eval). This figure is analogous to Figure 2.

Overtrained Language Models Are Harder to Fine-Tune

Alpaca Eval

arc_challenge

2 3 PT tokens (trillion)

2 3 PT tokens (trillion)

2 3 PT tokens (trillion)

2 3 PT tokens (trillion)

Base model Minimum learning rate Maximum learning rate

Figure 14. Evaluation OLMo-2-7B post-trained on Anthropic-HH as a function of the number of pre-trained tokens, for all learning rates. We report the scores on eight different datasets: Alpaca Eval is considered to be the main evaluation of interest (corresponding with the downstream performance), and the other datasets are considered out-of-distribution (corresponding with the generalist performance). We use the intermediate checkpoints from Table 1 for the evaluation. We also compare to the base model (dashed line). This figure is similar to Figure 13, except we plot every learning rate, with a line representing a fixed learning rate.

Overtrained Language Models Are Harder to Fine-Tune

Alpaca Eval

arc_challenge

0.5 1.0 PT tokens (trillion)

0.5 1.0 PT tokens (trillion)

0.5 1.0 PT tokens (trillion)

0.5 1.0 PT tokens (trillion)

Base model Fine-tuned model

Figure 15. Evaluation LLM360-7B post-trained on Anthropic-HH as a function of the number of pre-trained tokens, with tuned learning rates. We report the scores on eight different datasets: Alpaca Eval is considered to be the main evaluation of interest (corresponding with the downstream performance), and the other datasets are considered out-of-distribution (corresponding with the generalist performance). We use the intermediate checkpoints from Table 1 for the evaluation. We tune the learning rate for each checkpoint to maximize the main evaluation (Alpaca Eval). This figure is analogous to Figure 2.

Overtrained Language Models Are Harder to Fine-Tune

Alpaca Eval

arc_challenge

0 1 PT tokens (trillion)

0 1 PT tokens (trillion)

0 1 PT tokens (trillion)

0 1 PT tokens (trillion)

Base model Minimum learning rate Maximum learning rate

Figure 16. Evaluation LLM360-7B post-trained on Anthropic-HH as a function of the number of pre-trained tokens, for all learning rates. We report the scores on eight different datasets: Alpaca Eval is considered to be the main evaluation of interest (corresponding with the downstream performance), and the other datasets are considered out-of-distribution (corresponding with the generalist performance). We use the intermediate checkpoints from Table 1 for the evaluation. We also compare to the base model (dashed line). This figure is similar to Figure 15, except we plot every learning rate, with a line representing a fixed learning rate.

Overtrained Language Models Are Harder to Fine-Tune

Alpaca Eval

arc_challenge

1 2 3 PT tokens (trillion)

1 2 3 PT tokens (trillion)

1 2 3 PT tokens (trillion)

1 2 3 PT tokens (trillion)

Base model Fine-tuned model

Figure 17. Evaluation OLMo-1B post-trained on TULU as a function of the number of pre-trained tokens, with tuned learning rates. We report the scores on eight different datasets: Alpaca Eval is considered to be the main evaluation of interest (corresponding with the downstream performance), and the other datasets are considered out-of-distribution (corresponding with the generalist performance). We use the intermediate checkpoints from Table 1 for the evaluation. We tune the learning rate for each checkpoint to maximize the main evaluation (Alpaca Eval). This figure is analogous to Figure 2.

Overtrained Language Models Are Harder to Fine-Tune

Alpaca Eval

arc_challenge

1 2 3 PT tokens (trillion)

1 2 3 PT tokens (trillion)

1 2 3 PT tokens (trillion)

1 2 3 PT tokens (trillion)

Base model Minimum learning rate Maximum learning rate

Figure 18. Evaluation OLMo-1B post-trained on TULU as a function of the number of pre-trained tokens, for all learning rates. We report the scores on eight different datasets: Alpaca Eval is considered to be the main evaluation of interest (corresponding with the downstream performance), and the other datasets are considered out-of-distribution (corresponding with the generalist performance). We use the intermediate checkpoints from Table 1 for the evaluation. We also compare to the base model (dashed line). This figure is similar to Figure 17, except we plot every learning rate, with a line representing a fixed learning rate.

Overtrained Language Models Are Harder to Fine-Tune

Alpaca Eval

arc_challenge

2 3 PT tokens (trillion)

2 3 PT tokens (trillion)

2 3 PT tokens (trillion)

2 3 PT tokens (trillion)

Base model Fine-tuned model

Figure 19. Evaluation OLMo-2-7B post-trained on TULU as a function of the number of pre-trained tokens, with tuned learning rates. We report the scores on eight different datasets: Alpaca Eval is considered to be the main evaluation of interest (corresponding with the downstream performance), and the other datasets are considered out-of-distribution (corresponding with the generalist performance). We use the intermediate checkpoints from Table 1 for the evaluation. We tune the learning rate for each checkpoint to maximize the main evaluation (Alpaca Eval). This figure is analogous to Figure 2.

Overtrained Language Models Are Harder to Fine-Tune

Alpaca Eval

arc_challenge

2 3 PT tokens (trillion)

2 3 PT tokens (trillion)

2 3 PT tokens (trillion)

2 3 PT tokens (trillion)

Base model Minimum learning rate Maximum learning rate

Figure 20. Evaluation OLMo-2-7B post-trained on TULU as a function of the number of pre-trained tokens, for all learning rates. We report the scores on eight different datasets: Alpaca Eval is considered to be the main evaluation of interest (corresponding with the downstream performance), and the other datasets are considered out-of-distribution (corresponding with the generalist performance). We use the intermediate checkpoints from Table 1 for the evaluation. We also compare to the base model (dashed line). This figure is similar to Figure 19, except we plot every learning rate, with a line representing a fixed learning rate.

Overtrained Language Models Are Harder to Fine-Tune

Alpaca Eval

arc_challenge

0.5 1.0 PT tokens (trillion)

0.5 1.0 PT tokens (trillion)

0.5 1.0 PT tokens (trillion)

0.5 1.0 PT tokens (trillion)

Base model Fine-tuned model

Figure 21. Evaluation LLM360-7B post-trained on TULU as a function of the number of pre-trained tokens, with tuned learning rates. We report the scores on eight different datasets: Alpaca Eval is considered to be the main evaluation of interest (corresponding with the downstream performance), and the other datasets are considered out-of-distribution (corresponding with the generalist performance). We use the intermediate checkpoints from Table 1 for the evaluation. We tune the learning rate for each checkpoint to maximize the main evaluation (Alpaca Eval). This figure is analogous to Figure 2.

Overtrained Language Models Are Harder to Fine-Tune

Alpaca Eval

arc_challenge

0 1 PT tokens (trillion)

0 1 PT tokens (trillion)

0 1 PT tokens (trillion)

0 1 PT tokens (trillion)

Base model Minimum learning rate Maximum learning rate

Figure 22. Evaluation LLM360-7B post-trained on TULU as a function of the number of pre-trained tokens, for all learning rates. We report the scores on eight different datasets: Alpaca Eval is considered to be the main evaluation of interest (corresponding with the downstream performance), and the other datasets are considered out-of-distribution (corresponding with the generalist performance). We use the intermediate checkpoints from Table 1 for the evaluation. We also compare to the base model (dashed line). This figure is similar to Figure 21, except we plot every learning rate, with a line representing a fixed learning rate.

Overtrained Language Models Are Harder to Fine-Tune

arc_challenge

1 2 3 PT tokens (trillion)

1 2 3 PT tokens (trillion)

1 2 3 PT tokens (trillion)

1 2 3 PT tokens (trillion)

Base model Fine-tuned model

Figure 23. Evaluation OLMo-1B post-trained on VLM as a function of the number of pre-trained tokens, with tuned learning rates. We report the scores on eight different datasets: VLM Score is considered to be the main evaluation of interest (corresponding with the downstream performance), and the other datasets are considered out-of-distribution (corresponding with the generalist performance). We use the intermediate checkpoints from Table 1 for the evaluation. We tune the learning rate for each checkpoint to maximize the main evaluation (VLM Score). This figure is analogous to Figure 2.

Overtrained Language Models Are Harder to Fine-Tune

arc_challenge

1 2 3 PT tokens (trillion)

1 2 3 PT tokens (trillion)

1 2 3 PT tokens (trillion)

1 2 3 PT tokens (trillion)

Base model Minimum learning rate Maximum learning rate

Figure 24. Evaluation OLMo-1B post-trained on VLM as a function of the number of pre-trained tokens, for all learning rates. We report the scores on eight different datasets: VLM Score is considered to be the main evaluation of interest (corresponding with the downstream performance), and the other datasets are considered out-of-distribution (corresponding with the generalist performance). We use the intermediate checkpoints from Table 1 for the evaluation. We also compare to the base model (dashed line). This figure is similar to Figure 23, except we plot every learning rate, with a line representing a fixed learning rate.

Overtrained Language Models Are Harder to Fine-Tune

arc_challenge

2 3 PT tokens (trillion)

2 3 PT tokens (trillion)

2 3 PT tokens (trillion)

2 3 PT tokens (trillion)

Base model Fine-tuned model

Figure 25. Evaluation OLMo-2-7B post-trained on VLM as a function of the number of pre-trained tokens, with tuned learning rates. We report the scores on eight different datasets: VLM Score is considered to be the main evaluation of interest (corresponding with the downstream performance), and the other datasets are considered out-of-distribution (corresponding with the generalist performance). We use the intermediate checkpoints from Table 1 for the evaluation. We tune the learning rate for each checkpoint to maximize the main evaluation (VLM Score). This figure is analogous to Figure 2.

Overtrained Language Models Are Harder to Fine-Tune

arc_challenge

2 3 PT tokens (trillion)

2 3 PT tokens (trillion)

2 3 PT tokens (trillion)

2 3 PT tokens (trillion)

Base model Minimum learning rate Maximum learning rate

Figure 26. Evaluation OLMo-2-7B post-trained on VLM as a function of the number of pre-trained tokens, for all learning rates. We report the scores on eight different datasets: VLM Score is considered to be the main evaluation of interest (corresponding with the downstream performance), and the other datasets are considered out-of-distribution (corresponding with the generalist performance). We use the intermediate checkpoints from Table 1 for the evaluation. We also compare to the base model (dashed line). This figure is similar to Figure 25, except we plot every learning rate, with a line representing a fixed learning rate.

Overtrained Language Models Are Harder to Fine-Tune

arc_challenge

0.5 1.0 PT tokens (trillion)

0.5 1.0 PT tokens (trillion)

0.5 1.0 PT tokens (trillion)

0.5 1.0 PT tokens (trillion)

Base model Fine-tuned model

Figure 27. Evaluation LLM360-7B post-trained on VLM as a function of the number of pre-trained tokens, with tuned learning rates. We report the scores on eight different datasets: VLM Score is considered to be the main evaluation of interest (corresponding with the downstream performance), and the other datasets are considered out-of-distribution (corresponding with the generalist performance). We use the intermediate checkpoints from Table 1 for the evaluation. We tune the learning rate for each checkpoint to maximize the main evaluation (VLM Score). This figure is analogous to Figure 2.

Overtrained Language Models Are Harder to Fine-Tune

arc_challenge

0 1 PT tokens (trillion)

0 1 PT tokens (trillion)

0 1 PT tokens (trillion)

0 1 PT tokens (trillion)

Base model Minimum learning rate Maximum learning rate

Figure 28. Evaluation LLM360-7B post-trained on VLM as a function of the number of pre-trained tokens, for all learning rates. We report the scores on eight different datasets: VLM Score is considered to be the main evaluation of interest (corresponding with the downstream performance), and the other datasets are considered out-of-distribution (corresponding with the generalist performance). We use the intermediate checkpoints from Table 1 for the evaluation. We also compare to the base model (dashed line). This figure is similar to Figure 27, except we plot every learning rate, with a line representing a fixed learning rate.

Overtrained Language Models Are Harder to Fine-Tune

G. Omitted Figures from Section 3: Controlled Experiments

In this section, we provide the omitted figures from Section 3 that show the results of the extended controlled experiments.

G.1. Sensitivity

Pre-training tokens

max_lr=5.0e-04

Pre-training tokens

2 max_lr=1.6e-04

Pre-training tokens

2 max_lr=1.0e-03 Star Coder-Python

Pre-training tokens

max_lr=9.0e-05

Pre-training tokens

2 max_lr=8.0e-05

Pre-training tokens

max_lr=9.0e-05

0.0 0.2 0.4 0.6 0.8 1.0 0.0

Base model Minimum fine-tuning learning rate (4e-6) Maximum fine-tuning learning rate ( max)

Figure 29. Sensitivity of fine-tuned models with fixed learning rate in our controlled setup. This figure is analogous to Figure 5 from the main paper, but plots the difference in perplexity between the fine-tuned model and the base model for OLMo-30M. This figure illustrates that sensitivity increases progressively throughout training.

To supplement Figure 6 from the main paper, we plot the sensitivity of fine-tuned models with fixed learning rate in our controlled setup as a function of the number of pre-training tokens in Figure 29. We find, across all datasets, that sensitivity progressively increases throughout training. Since this figure is sufficiently similar to Figure 6, we omit the corresponding sensitivity figures for the other settings we consider.

G.2. Extended fine-tuning experiments.

We now plot the extended fine-tuning experiments. We ablate the batch size, learning rate scheduler, and model size. Table 6 provides a reference to the figures that show the results of the extended controlled experiments.

Setting Pre-training Fine-tuning Tuned pre-training Tuned fine-tuning Optimal LR perplexity perplexity perplexity perplexity

Batch size: 256 Figure 30 Figure 31 Figure 32 Figure 33 Figure 34 Batch size: 32 Figure 35 Figure 36 Figure 37 Figure 38 Figure 39 LR schedule: Constant Figure 40 Figure 41 Figure 42 Figure 43 Figure 44 LR schedule: constant with warmup Figure 45 Figure 46 Figure 47 Figure 48 Figure 49 OLMo-15M Figure 50 Figure 51 Figure 52 Figure 53 Figure 54 OLMo-30M (extended) Figure 55 Figure 56 Figure 57 Figure 58 Figure 59 OLMo-90M Figure 60 Figure 61 Figure 62 Figure 63 Figure 64

Table 6. Table of contents for extended experimental settings. This table provides a reference to the figures that show the results of the extended controlled experiments.

Overtrained Language Models Are Harder to Fine-Tune

Pre-training (C4)

max_lr=1.4e-04

max_lr=9.0e-05

max_lr=1.1e-04

max_lr=6.0e-03

12.5 max_lr=1.4e-04

Pre-training tokens

Pre-training (C4)

max_lr=1.8e-04

Pre-training tokens

max_lr=1.4e-04 tweet_sentiment_en

Pre-training tokens

8 max_lr=6.0e-03

Pre-training tokens

max_lr=6.0e-03

Pre-training tokens

max_lr=6.0e-03 starcoder-python-5k

0.0 0.2 0.4 0.6 0.8 1.0 0.0

Base model Minimum learning rate (4e-6) Maximum learning rate (max_lr)

Figure 30. Pre-training perplexity after fine-tuning as a function of the pre-training budget using the configuration specified in Table 3 but with batch size 256 for the OLMo-30M model. Each connected line reflects a series of models trained with fixed hyperparameters.

Fine-tuning task

max_lr=1.4e-04

max_lr=9.0e-05

max_lr=1.1e-04

max_lr=6.0e-03

max_lr=1.4e-04

Pre-training tokens

Fine-tuning task

max_lr=1.8e-04

Pre-training tokens

max_lr=1.4e-04 tweet_sentiment_en

Pre-training tokens

max_lr=6.0e-03

Pre-training tokens

7 max_lr=6.0e-03

Pre-training tokens

max_lr=6.0e-03 starcoder-python-5k

0.0 0.2 0.4 0.6 0.8 1.0 0.0

Base model Minimum learning rate (4e-6) Maximum learning rate (max_lr)

Figure 31. Fine-tuning perplexity after fine-tuning as a function of the pre-training budget using the configuration specified in Table 3 but with batch size 256 for the OLMo-30M model. Each connected line reflects a series of models trained with fixed hyperparameters.

Overtrained Language Models Are Harder to Fine-Tune

Pre-training (C4)

Pre-training tokens

Pre-training (C4)

Pre-training tokens

tweet_sentiment_en

Pre-training tokens

Pre-training tokens

Pre-training tokens

starcoder-python-5k

Figure 32. Pre-training perplexity after fine-tuning as a function of the pre-training budget with a tuned learning rate to optimize fine-tuning performance using the configuration specified in Table 3 but with batch size 256 for the OLMo-30M model. Similar to the untuned version but with the fine-tuning-optimal learning rate.

Fine-tuning task

Pre-training tokens

Fine-tuning task

Pre-training tokens

tweet_sentiment_en

Pre-training tokens

Pre-training tokens

Pre-training tokens

starcoder-python-5k

Figure 33. Fine-tuning perplexity after fine-tuning as a function of the pre-training budget with a tuned learning rate to optimize fine-tuning performance using the configuration specified in Table 3 but with batch size 256 for the OLMo-30M model. Similar to the untuned version but with the fine-tuning-optimal learning rate.

Overtrained Language Models Are Harder to Fine-Tune

Best learning rate

subj boolq mr cr rte

Best learning rate

Pre-training tokens

tweet_sentiment_en

Pre-training tokens

Pre-training tokens

Pre-training tokens

starcoder-python-5k

Figure 34. The optimal learning rate for best fine-tuning performance as a function of the pre-training budget using the configuration specified in Table 3 but with batch size 256 for the OLMo-30M model. The learning rate shown corresponds with those chosen in Figures 32 and 33.

Pre-training (C4)

max_lr=1.1e-04

12.5 max_lr=7.0e-05

10 max_lr=9.0e-05

12.5 max_lr=1.1e-04

max_lr=1.4e-04

Pre-training tokens

Pre-training (C4)

max_lr=1.1e-04

Pre-training tokens

max_lr=7.0e-05 tweet_sentiment_en

Pre-training tokens

max_lr=6.0e-03

Pre-training tokens

max_lr=6.0e-03

Pre-training tokens

8 max_lr=6.0e-03 starcoder-python-5k

0.0 0.2 0.4 0.6 0.8 1.0 0.0

Base model Minimum learning rate (4e-6) Maximum learning rate (max_lr)

Figure 35. Pre-training perplexity after fine-tuning as a function of the pre-training budget using the configuration specified in Table 3 but with batch size 32 for the OLMo-30M model. Each connected line reflects a series of models trained with fixed hyperparameters.

Overtrained Language Models Are Harder to Fine-Tune

Fine-tuning task

max_lr=1.1e-04

max_lr=7.0e-05

0.55 max_lr=9.0e-05

0.5 max_lr=1.1e-04

0.9 max_lr=1.4e-04

Pre-training tokens

Fine-tuning task

max_lr=1.1e-04

Pre-training tokens

0.8 max_lr=7.0e-05 tweet_sentiment_en

Pre-training tokens

max_lr=6.0e-03

Pre-training tokens

max_lr=6.0e-03

Pre-training tokens

max_lr=6.0e-03 starcoder-python-5k

0.0 0.2 0.4 0.6 0.8 1.0 0.0

Base model Minimum learning rate (4e-6) Maximum learning rate (max_lr)

Figure 36. Fine-tuning perplexity after fine-tuning as a function of the pre-training budget using the configuration specified in Table 3 but with batch size 32 for the OLMo-30M model. Each connected line reflects a series of models trained with fixed hyperparameters.

Pre-training (C4)

Pre-training tokens

Pre-training (C4)

Pre-training tokens

tweet_sentiment_en

Pre-training tokens

Pre-training tokens

Pre-training tokens

starcoder-python-5k

Figure 37. Pre-training perplexity after fine-tuning as a function of the pre-training budget with a tuned learning rate to optimize fine-tuning performance using the configuration specified in Table 3 but with batch size 32 for the OLMo-30M model. Similar to the untuned version but with the fine-tuning-optimal learning rate.

Overtrained Language Models Are Harder to Fine-Tune

Fine-tuning task

Pre-training tokens

Fine-tuning task

Pre-training tokens

tweet_sentiment_en

Pre-training tokens

Pre-training tokens

Pre-training tokens

starcoder-python-5k

Figure 38. Fine-tuning perplexity after fine-tuning as a function of the pre-training budget with a tuned learning rate to optimize fine-tuning performance using the configuration specified in Table 3 but with batch size 32 for the OLMo-30M model. Similar to the untuned version but with the fine-tuning-optimal learning rate.

Best learning rate

subj boolq mr cr rte

Best learning rate

Pre-training tokens

tweet_sentiment_en

Pre-training tokens

Pre-training tokens

Pre-training tokens

starcoder-python-5k

Figure 39. The optimal learning rate for best fine-tuning performance as a function of the pre-training budget using the configuration specified in Table 3 but with batch size 32 for the OLMo-30M model. The learning rate shown corresponds with those chosen in Figures 37 and 38.

Overtrained Language Models Are Harder to Fine-Tune

Pre-training (C4)

max_lr=9.0e-05

max_lr=7.0e-05

max_lr=7.0e-05

max_lr=6.0e-03

15 max_lr=1.1e-04

Pre-training tokens

Pre-training (C4)

max_lr=9.0e-05

Pre-training tokens

8 max_lr=5.0e-05 tweet_sentiment_en

Pre-training tokens

10 max_lr=6.0e-03

Pre-training tokens

max_lr=6.0e-03

Pre-training tokens

max_lr=6.0e-03 starcoder-python-5k

0.0 0.2 0.4 0.6 0.8 1.0 0.0

Base model Minimum learning rate (4e-6) Maximum learning rate (max_lr)

Figure 40. Pre-training perplexity after fine-tuning as a function of the pre-training budget using a constant learning rate scheduler (instead of Cosine) with the configuration specified in Table 3 for the OLMo-30M model. Each connected line reflects a series of models trained with fixed hyperparameters.

Fine-tuning task

max_lr=9.0e-05

0.9 max_lr=7.0e-05

max_lr=7.0e-05

max_lr=6.0e-03

max_lr=1.1e-04

Pre-training tokens

Fine-tuning task

max_lr=9.0e-05

Pre-training tokens

max_lr=5.0e-05 tweet_sentiment_en

Pre-training tokens

max_lr=6.0e-03

Pre-training tokens

max_lr=6.0e-03

Pre-training tokens

max_lr=6.0e-03 starcoder-python-5k

0.0 0.2 0.4 0.6 0.8 1.0 0.0

Base model Minimum learning rate (4e-6) Maximum learning rate (max_lr)

Figure 41. Fine-tuning perplexity after fine-tuning as a function of the pre-training budget using a constant learning rate scheduler (instead of Cosine) with the configuration specified in Table 3 for the OLMo-30M model. Each connected line reflects a series of models trained with fixed hyperparameters.

Overtrained Language Models Are Harder to Fine-Tune

Pre-training (C4)

Pre-training tokens

Pre-training (C4)

Pre-training tokens

tweet_sentiment_en

Pre-training tokens

Pre-training tokens

Pre-training tokens

starcoder-python-5k

Figure 42. Pre-training perplexity after fine-tuning as a function of the pre-training budget with a tuned learning rate to optimize fine-tuning performance using a constant learning rate scheduler for the OLMo-30M model. Similar to the untuned version but showing the performance with the fine-tuning-optimal learning rate.

Fine-tuning task

Pre-training tokens

Fine-tuning task

Pre-training tokens

tweet_sentiment_en

Pre-training tokens

Pre-training tokens

Pre-training tokens

starcoder-python-5k

Figure 43. Fine-tuning perplexity after fine-tuning as a function of the pre-training budget with a tuned learning rate to optimize fine-tuning performance using a constant learning rate scheduler for the OLMo-30M model. Similar to the untuned version but showing the performance with the fine-tuning-optimal learning rate.

Overtrained Language Models Are Harder to Fine-Tune

Best learning rate

subj boolq mr cr rte

Best learning rate

Pre-training tokens

tweet_sentiment_en

Pre-training tokens

Pre-training tokens

Pre-training tokens

starcoder-python-5k

Figure 44. The optimal learning rate for best fine-tuning performance as a function of the pre-training budget using a constant learning rate scheduler for the OLMo-30M model. The learning rate shown corresponds with those chosen in Figures 42 and 43.

Pre-training (C4)

max_lr=1.1e-04

max_lr=7.0e-05

max_lr=1.1e-04

max_lr=6.0e-03

max_lr=1.4e-04

Pre-training tokens

Pre-training (C4)

max_lr=1.1e-04

Pre-training tokens

10 max_lr=7.0e-05 tweet_sentiment_en

Pre-training tokens

max_lr=6.0e-03

Pre-training tokens

max_lr=6.0e-03

Pre-training tokens

max_lr=6.0e-03 starcoder-python-5k

0.0 0.2 0.4 0.6 0.8 1.0 0.0

Base model Minimum learning rate (4e-6) Maximum learning rate (max_lr)

Figure 45. Pre-training perplexity after fine-tuning as a function of the pre-training budget using a constant learning rate scheduler with warmup with the configuration specified in Table 3 for the OLMo-30M model. Each connected line reflects a series of models trained with fixed hyperparameters.

Overtrained Language Models Are Harder to Fine-Tune

Fine-tuning task

max_lr=1.1e-04

0.9 max_lr=7.0e-05

0.55 max_lr=1.1e-04

0.8 max_lr=6.0e-03

0.9 max_lr=1.4e-04

Pre-training tokens

Fine-tuning task

max_lr=1.1e-04

Pre-training tokens

max_lr=7.0e-05 tweet_sentiment_en

Pre-training tokens

4 max_lr=6.0e-03

Pre-training tokens

max_lr=6.0e-03

Pre-training tokens

max_lr=6.0e-03 starcoder-python-5k

0.0 0.2 0.4 0.6 0.8 1.0 0.0

Base model Minimum learning rate (4e-6) Maximum learning rate (max_lr)

Figure 46. Fine-tuning perplexity after fine-tuning as a function of the pre-training budget using a constant learning rate scheduler with warmup with the configuration specified in Table 3 for the OLMo-30M model. Each connected line reflects a series of models trained with fixed hyperparameters.

Pre-training (C4)

Pre-training tokens

Pre-training (C4)

Pre-training tokens

tweet_sentiment_en

Pre-training tokens

Pre-training tokens

Pre-training tokens

starcoder-python-5k

Figure 47. Pre-training perplexity after fine-tuning as a function of the pre-training budget with a tuned learning rate to optimize fine-tuning performance using a constant learning rate scheduler with warmup for the OLMo-30M model. Similar to the untuned version but showing the performance with the fine-tuning-optimal learning rate.

Overtrained Language Models Are Harder to Fine-Tune

Fine-tuning task

Pre-training tokens

Fine-tuning task

Pre-training tokens

tweet_sentiment_en

Pre-training tokens

Pre-training tokens

Pre-training tokens

starcoder-python-5k

Figure 48. Fine-tuning perplexity after fine-tuning as a function of the pre-training budget with a tuned learning rate to optimize fine-tuning performance using a constant learning rate scheduler with warmup for the OLMo-30M model. Similar to the untuned version but showing the performance with the fine-tuning-optimal learning rate.

Best learning rate

subj boolq mr cr rte

Best learning rate

Pre-training tokens

tweet_sentiment_en

Pre-training tokens

Pre-training tokens

Pre-training tokens

starcoder-python-5k

Figure 49. The optimal learning rate for best fine-tuning performance as a function of the pre-training budget using a constant learning rate scheduler with warmup for the OLMo-30M model. The learning rate shown corresponds with those chosen in Figures 47 and 48.

Overtrained Language Models Are Harder to Fine-Tune

Pre-training (C4)

max_lr=2.0e-04

max_lr=1.8e-04

12.5 max_lr=2.4e-04

max_lr=2.0e-04

12.5 max_lr=2.4e-04

Pre-training tokens

Pre-training (C4)

max_lr=2.4e-04

Pre-training tokens

max_lr=1.6e-04 tweet_sentiment_en

Pre-training tokens

max_lr=2.0e-03

Pre-training tokens

max_lr=1.0e-02

Pre-training tokens

max_lr=1.0e-02 starcoder-python-5k

0.0 0.2 0.4 0.6 0.8 1.0 0.0

Base model Minimum learning rate (4e-6) Maximum learning rate (max_lr)

Figure 50. Pre-training perplexity after fine-tuning as a function of the pre-training budget using the configuration specified in Table 3 for OLMo-15M. Each connected line reflects a series of models trained with fixed hyperparameters. Analogous to Figure 5 (top) from the main paper.

Fine-tuning task

max_lr=2.0e-04

1.0 max_lr=1.8e-04

max_lr=2.4e-04

0.50 max_lr=2.0e-04

max_lr=2.4e-04

Pre-training tokens

Fine-tuning task

max_lr=2.4e-04

Pre-training tokens

max_lr=1.6e-04 tweet_sentiment_en

Pre-training tokens

max_lr=1.0e-02

Pre-training tokens

max_lr=2.0e-03

Pre-training tokens

5 max_lr=1.0e-02 starcoder-python-5k

0.0 0.2 0.4 0.6 0.8 1.0 0.0

Base model Minimum learning rate (4e-6) Maximum learning rate (max_lr)

Figure 51. Fine-tuning perplexity after fine-tuning as a function of the pre-training budget using the configuration specified in Table 3 for OLMo-15M. Each connected line reflects a series of models trained with fixed hyperparameters. Analogous to Figure 5 (bottom) from the main paper.

Overtrained Language Models Are Harder to Fine-Tune

Pre-training (C4)

Pre-training tokens

Pre-training (C4)

Pre-training tokens

tweet_sentiment_en

Pre-training tokens

Pre-training tokens

Pre-training tokens

starcoder-python-5k

Figure 52. Pre-training perplexity after fine-tuning as a function of the pre-training budget with a tuned learning rate to optimize fine-tuning performance using the configuration specified in Table 3 for OLMo-15M. Similar to the untuned version but showing the performance obtained with the fine-tuning-optimal learning rate, analogous to Figure 6 (bottom) from the main paper.

Fine-tuning task

Pre-training tokens

Fine-tuning task

Pre-training tokens

tweet_sentiment_en

Pre-training tokens

Pre-training tokens

Pre-training tokens

starcoder-python-5k

Figure 53. Fine-tuning perplexity after fine-tuning as a function of the pre-training budget with a tuned learning rate to optimize fine-tuning performance using the configuration specified in Table 3 for OLMo-15M. Similar to the untuned version but showing the performance obtained with the fine-tuning-optimal learning rate, analogous to Figure 6 (top) from the main paper.

Overtrained Language Models Are Harder to Fine-Tune

Best learning rate

subj boolq mr cr rte

Best learning rate

Pre-training tokens

tweet_sentiment_en

Pre-training tokens

Pre-training tokens

Pre-training tokens

starcoder-python-5k

Figure 54. The optimal learning rate for best fine-tuning performance as a function of the pre-training budget using the configuration specified in Table 3 for OLMo-15M. The learning rate shown corresponds with those chosen in Figures 52 and 53.

Pre-training (C4)

max_lr=1.4e-04

15 max_lr=1.0e-04

max_lr=1.4e-04

max_lr=1.4e-04

max_lr=1.4e-04

Pre-training tokens

Pre-training (C4)

max_lr=1.4e-04

Pre-training tokens

max_lr=9.0e-05 tweet_sentiment_en

Pre-training tokens

max_lr=1.0e-03

Pre-training tokens

max_lr=6.0e-03

Pre-training tokens

max_lr=6.0e-03 starcoder-python-5k

0.0 0.2 0.4 0.6 0.8 1.0 0.0

Base model Minimum learning rate (4e-6) Maximum learning rate (max_lr)

Figure 55. Pre-training perplexity after fine-tuning as a function of the pre-training budget using the configuration specified in Table 3 for OLMo-30M. Each connected line reflects a series of models trained with fixed hyperparameters. Extended version of Figure 5 (top) from the main paper.

Overtrained Language Models Are Harder to Fine-Tune

Fine-tuning task

max_lr=1.4e-04

max_lr=1.0e-04

max_lr=1.4e-04

max_lr=1.4e-04

0.9 max_lr=1.4e-04

Pre-training tokens

Fine-tuning task

max_lr=1.4e-04

Pre-training tokens

max_lr=9.0e-05 tweet_sentiment_en

Pre-training tokens

max_lr=6.0e-03

Pre-training tokens

max_lr=1.0e-03

Pre-training tokens

max_lr=6.0e-03 starcoder-python-5k

0.0 0.2 0.4 0.6 0.8 1.0 0.0

Base model Minimum learning rate (4e-6) Maximum learning rate (max_lr)

Figure 56. Fine-tuning perplexity after fine-tuning as a function of the pre-training budget using the configuration specified in Table 3 for OLMo-30M. Each connected line reflects a series of models trained with fixed hyperparameters. Extended version of Figure 5 (bottom) from the main paper.

Pre-training (C4)

Pre-training tokens

Pre-training (C4)

Pre-training tokens

tweet_sentiment_en

Pre-training tokens

Pre-training tokens

Pre-training tokens

starcoder-python-5k

Figure 57. Pre-training perplexity after fine-tuning as a function of the pre-training budget with a tuned learning rate to optimize fine-tuning performance using the configuration specified in Table 3 for OLMo-30M. Similar to the untuned version but showing the performance with the fine-tuning-optimal learning rate, analogous to Figure 6 (bottom) from the main paper. Extended version of Figure 55 from the main paper.

Overtrained Language Models Are Harder to Fine-Tune

Fine-tuning task

Pre-training tokens

Fine-tuning task

Pre-training tokens

tweet_sentiment_en

Pre-training tokens

Pre-training tokens

Pre-training tokens

starcoder-python-5k

Figure 58. Fine-tuning perplexity after fine-tuning as a function of the pre-training budget with a tuned learning rate to optimize fine-tuning performance using the configuration specified in Table 3 for OLMo-30M. Similar to the untuned version but showing the performance with the fine-tuning-optimal learning rate, analogous to Figure 6 (top) from the main paper. Extended version of Figure 56 from the main paper.

Best learning rate

subj boolq mr cr rte

Best learning rate

Pre-training tokens

tweet_sentiment_en

Pre-training tokens

Pre-training tokens

Pre-training tokens

starcoder-python-5k

Figure 59. The optimal learning rate for best fine-tuning performance as a function of the pre-training budget using the configuration specified in Table 3 for OLMo-30M. The learning rate shown corresponds with those chosen in Figures 57 and 58.

Overtrained Language Models Are Harder to Fine-Tune

Pre-training (C4)

max_lr=2.4e-04

15 max_lr=2.0e-04

max_lr=2.4e-04

7 max_lr=2.4e-04

max_lr=2.4e-04

Pre-training tokens

Pre-training (C4)

max_lr=2.4e-04

Pre-training tokens

max_lr=2.4e-04 tweet_sentiment_en

Pre-training tokens

max_lr=2.0e-03

Pre-training tokens

max_lr=1.0e-02

Pre-training tokens

max_lr=1.0e-02 starcoder-python-5k

0.0 0.2 0.4 0.6 0.8 1.0 0.0

Base model Minimum learning rate (4e-6) Maximum learning rate (max_lr)

Figure 60. Pre-training perplexity after fine-tuning as a function of the pre-training budget using the configuration specified in Table 3 for OLMo-90M. Each connected line reflects a series of models trained with fixed hyperparameters. Analogous to Figure 5 (top) from the main paper.

Fine-tuning task

max_lr=2.4e-04

max_lr=2.0e-04

0.50 max_lr=2.4e-04

max_lr=2.4e-04

max_lr=2.4e-04

Pre-training tokens

Fine-tuning task

max_lr=2.4e-04

Pre-training tokens

max_lr=2.4e-04 tweet_sentiment_en

Pre-training tokens

max_lr=1.0e-02

Pre-training tokens

7 max_lr=2.0e-03

Pre-training tokens

max_lr=1.0e-02 starcoder-python-5k

0.0 0.2 0.4 0.6 0.8 1.0 0.0

Base model Minimum learning rate (4e-6) Maximum learning rate (max_lr)

Figure 61. Fine-tuning perplexity after fine-tuning as a function of the pre-training budget using the configuration specified in Table 3 for OLMo-90M. Each connected line reflects a series of models trained with fixed hyperparameters. Analogous to Figure 5 (bottom) from the main paper.

Overtrained Language Models Are Harder to Fine-Tune

Pre-training (C4)

Pre-training tokens

Pre-training (C4)

Pre-training tokens

tweet_sentiment_en

Pre-training tokens

Pre-training tokens

Pre-training tokens

starcoder-python-5k

Figure 62. Pre-training perplexity after fine-tuning as a function of the pre-training budget with a tuned learning rate to optimize fine-tuning performance using the configuration specified in Table 3 for OLMo-90M. Similar to the untuned version but showing the performance with the fine-tuning-optimal learning rate, analogous to Figure 6 (bottom) from the main paper.

Fine-tuning task

Pre-training tokens

Fine-tuning task

Pre-training tokens

tweet_sentiment_en

Pre-training tokens

Pre-training tokens

Pre-training tokens

starcoder-python-5k

Figure 63. Fine-tuning perplexity after fine-tuning as a function of the pre-training budget with a tuned learning rate to optimize fine-tuning performance using the configuration specified in Table 3 for OLMo-90M. Similar to the untuned version but showing the performance with the fine-tuning-optimal learning rate, analogous to Figure 6 (top) from the main paper.

Overtrained Language Models Are Harder to Fine-Tune

Best learning rate

subj boolq mr cr rte

Best learning rate

Pre-training tokens

tweet_sentiment_en

Pre-training tokens

Pre-training tokens

Pre-training tokens

starcoder-python-5k

Figure 64. The optimal learning rate for best fine-tuning performance as a function of the pre-training budget using the configuration specified in Table 3 for OLMo-90M. The learning rate shown corresponds with those chosen in Figures 62 and 63.

Overtrained Language Models Are Harder to Fine-Tune

Pre-training tokens (B)

Perturbation ( )

Base 0.05 0.08 0.1

Pre-training tokens (B)

Perturbation ( )

Base 0.05 0.08 0.1

Pre-training tokens (B)

Perturbation ( )

Base 0.05 0.08 0.1

Pre-training tokens (B)

Perturbation ( )

Base 0.05 0.08 0.1

2 103 3 103 4 103

Pre-training tokens (B)

Perturbation ( )

Base 0.033 0.04 0.045

Pre-training tokens (B)

LLM360 (7B)

Perturbation ( )

Base 0.3 0.5 0.8

Figure 65. Pre-training perplexity of models with parameters perturbed by Gaussian noise, as a function of the number of pre-training tokens. We report the C4 web data perplexity of different models where each parameter is perturbed by Gaussian noise scaled by the factor λ (color). This figures is an extension of Figure 3 to additional models: OLMo-15M, OLMo-90M, OLMo-1B, OLMo-2-7B, and LLM360-Amber (7B).

G.3. Extended Gaussian perturbations experiments.

Here, we present extended experiments with Gaussian perturbations on additional models: OLMo-15M, OLMo-90M, OLMo-1B, OLMo-2-7B, and LLM360-Amber (7B). We perturb each parameter by Gaussian noise scaled by the factor λ. Figure 65 shows the pre-training perplexity of models with parameters perturbed by Gaussian noise as a function of the number of pre-training tokens. Refer to Appendix D for more details on the experimental setup.