# a_dynamical_model_of_neural_scaling_laws__ea40656d.pdf

A Dynamical Model of Neural Scaling Laws

Blake Bordelon 1 2 Alexander Atanasov 3 2 Cengiz Pehlevan 1 2

On a variety of tasks, the performance of neural networks predictably improves with training time, dataset size and model size across many orders of magnitude. This phenomenon is known as a neural scaling law. Of fundamental importance is the compute-optimal scaling law, which reports the performance as a function of units of compute when choosing model sizes optimally. We analyze a random feature model trained with gradient descent as a solvable model of network training and generalization. This reproduces many observations about neural scaling laws. First, our model makes a prediction about why the scaling of performance with training time and with model size have different power law exponents. Consequently, the theory predicts an asymmetric compute-optimal scaling rule where the number of training steps are increased faster than model parameters, consistent with recent empirical observations. Second, it has been observed that early in training, networks converge to their infinitewidth dynamics at a rate 1/width but at late time exhibit a rate width c, where c depends on the structure of the architecture and task. We show that our model exhibits this behavior. Lastly, our theory shows how the gap between training and test loss can gradually build up over time due to repeated reuse of data.

1. Introduction Large scale language and vision models have been shown to achieve better performance as the number of parameters and number of training steps are increased. Moreover, the scaling of various loss metrics (such as cross entropy or MSE test loss) has been empirically observed to exhibit remarkably regular, often power law behavior across several orders of magnitude (Hestness et al., 2017; Kaplan et al.,

1SEAS, Harvard University 2Kempner Institute, Harvard University 3Department of Physics, Harvard University. Correspondence to: Cengiz Pehlevan <cpehlevan@seas.harvard.edu>.

Proceedings of the 41 st International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by the author(s).

2020). These findings are termed neural scaling laws .

Neural scaling laws play a central role in modern deep learning practice, and have substantial implications for the optimal trade-off between model size and training time (Hoffmann et al., 2022), as well as architecture selection (Alabdulmohsin et al., 2023). Understanding the origin of such scaling laws, as well as their exponents, has the potential to offer insight into better architectures, the design of better datasets (Sorscher et al., 2022), and the failure modes and limitations of deep learning systems. Yet, many questions about neural scaling laws remain open.

In this paper, we introduce and analyze a solvable model which captures many important aspects of neural scaling laws. In particular, we are interested in understanding the following empirically observed phenomena: Test Loss Scales as a Power-law in Training Time and Model Size and Compute. In many domains of deep learning, the test loss of a model with N trainable parameters trained for t iterations has been found to scale as L(t, N) L0 + att rt + a NN r N (Kaplan et al., 2020; Hoffmann et al., 2022). These scaling law exponents rt, r N generally depend on the dataset and architecture. We demonstrate scaling laws on simple vision and language tasks in Figure 1. The compute is proportional to the number of steps of gradient descent times the model size C Nt. Setting N and t optimally gives that test loss scales as a power law in C. This is the compute optimal scaling law.

Compute-Optimal Training Time and Model Size Scaling Exponents Are Different. A discrepancy in exponents rt and r N is usually observed to some degree depdending on the data distribution and architecture Hoffmann et al. (2022); Bachmann et al. (2024). The gap between exponents would lead to asymmetric compute-optimal scaling of parameters. For compute budget C, model size should scale N Cc1 and training time t Cc2 with c2 > c1. This difference in exponents led to a change in the scaling rule for large language models, generating large performance gains.

Larger Models Train Faster. Provided feature learning is held constant across model scales (i.e. adopting meanfield or µP scaling), wider networks tend to train faster (Yang et al., 2021) (Figure 1). If training proceeds in an online/one-pass setting where datapoints are not repeated,

A Dynamical Model of Neural Scaling Laws

103 104 Steps

N = 16 N = 32 N = 64 N = 128 N = 256 N = 512 N = 1024 t 0.38 + N 0.81

(a) One-Pass CIFAR-5M Test Dynamics

104 106 108 1010 Compute

(b) CIFAR-5M Compute Scaling

Tokens seen (in millions)

Wikitext 100M tokens

t 0.16 + N 0.76

N = 128 N = 256 N = 512 N = 1024 N = 2048 N = 4096

(c) Wikitext-100M Train and Test

100 101 102

Tokens seen (in millions)

Wikitext 5M tokens

train test N=128 N=256 N=512 N=1024 N=2048 N=4096

(d) Wikitext-5M Train and Test

Figure 1. Train and test losses (cross-entropy) as a function of training time t and width N. For models trained online, we do not make a distinction between training and test error because each new batch is drawn fresh and would have the same loss in expectation as an independent test set. (a) The test loss of a residual CNN on CIFAR-5M is well described by a fit of the form L t α + N β in the online training regime. (b) The compute optimal strategy requires scaling up both model size and training time simultaneously. (c) Transformer training on wikitext with 100M tokens before data-repetition. Model performance is monotonic in width N. (d) Wikitext with 5M subsampled tokens. Larger width N is not always better as wider models can overfit.

then the wider models will also obtain lower test loss at an equal number of iterations. This observation has been found to hold both in overparameterized and underparameterized regimes (Bordelon & Pehlevan, 2023; Vyas et al., 2023).

Models Accumulate Finite-Dataset and Finite-Width Corrections. Early training can be well described by the learning curves for stochastic gradient descent without reuse of samples (termed the online/ideal limiting dynamics), however over time the effect of reusing data accumulates and leads to worse test performance (Nakkiran et al., 2021b; Mignacco et al., 2020; Ghosh et al., 2022; Muennighoff et al., 2023). Similarly the gaps in model performance across various model sizes also grow with training time (Yang et al., 2021; Vyas et al., 2023). Figure 1 (d) shows overfitting and reversal of wider is better phenomenon due to data reuse.

Scaling Exponents are Task-Dependent at Late Training Time, but not at Early Time. Prior works (Dyer & Gur Ari, 2020; Atanasov et al., 2023; Roberts et al., 2022; Bordelon & Pehlevan, 2023) predict early-time finite-width loss corrections that go as 1/width near the infinite width limit in either lazy or feature-learning regimes. Bahri et al. (2021) et al provide experiments demonstrating the 1/width conver-

gence. However, finite-width models trained for a long time exhibit non-trivial exponents with respect to model width (Kaplan et al., 2020; Vyas et al., 2023). See Figure 1 for examples of nontrivial scalings at late time on CIFAR-5M and Wikitext.

Ensembling is Not the Same as Going Wider. Near the limit of infinite width, finite models can be thought of as noisy approximations of the infinite-width model with noise that can be eliminated through ensembling (Dyer & Gur-Ari, 2020; Geiger et al., 2020; Atanasov et al., 2023). However recent experiments (Vyas et al., 2023) indicate that ensembling is not enough to match performance of larger models.

These phenomena are not unique to deep networks, but can be observed in linear models, or linearized neural networks operating in the lazy/kernel regime. Though this regime does not capture feature learning, it has benefit of analytical tractability. In this paper, we focus on such linearized models to attempt to gain insight into the dynamics of training.

To attempt to explain these phenomena, we develop a mathematically tractable model of neural scaling laws which allows one to simultaneously vary time, model size, and dataset size. Our contributions are as follows:

A Dynamical Model of Neural Scaling Laws

1. We analyze the learning dynamics of a structured and randomly projected linear model trained with gradient flow, discrete time SGD, and momentum. In an asymptotic limit of the model, we obtain a dynamical mean field theory (DMFT) description of the learning curve in terms of correlation functions, which measure the crosstime correlation of training and test errors, and response functions which measure sensitivity of the dynamics to small perturbations. 2. We solve for the response functions exactly in Fourier domain. This solution reveals faster training for larger models. The low frequency range of these functions allow us to extract the long time limit of the loss. 3. We show that the model and data corrections to the dynamics accumulate over time. At early time, each of these corrections has a universal scaling, consistent with prior works (Bahri et al., 2021). 4. For power-law structured features we show that the model exhibits power law scaling of test loss with time, model size and dataset size. While the data and model exponents are the same, the time and model exponents are different in general. We show that this gives rise to an asymmetric compute optimal scaling strategy where training time increases faster than model size. 5. Our theory explains why ensembling is not compute optimal as it gives less benefit to performance than increase in model size. 6. We observe in Section 5.1 that feature learning networks can obtain better power law scalings, leading to a better compute optimal frontier. We empirically study this phenomenon in Appendix L.

1.1. Related Works

The learning curves for linear models with structured (nonisotropic) covariates, including infinite-width kernel regression, have been computed using tools from statistical physics and random matrix theory (Bordelon et al., 2020; Spigler et al., 2020; Canatar et al., 2021; Simon et al., 2021; Bahri et al., 2021; Hastie et al., 2022). Mei & Montanari (2022) analyzed a linear model with random projections of isotropic covariates. There, they study the limiting effects of width and dataset size, and observe model-wise and sample-wise double descent. In Adlam & Pennington (2020a) a related model is used to study the finite-width neural tangent kernel (NTK) (Jacot et al., 2018) of a given network. Further, d Ascoli et al. (2020) and Adlam & Pennington (2020b) extend this analysis to understand the different sources of variance in the predictions of random feature models and the effect of ensembling and bagging on the test loss. Other works have extended this to models where an additional untrained projection is applied to the structured covariates (Loureiro et al., 2021; 2022; Zavatone-Veth et al.,

2022; Atanasov et al., 2023; Maloney et al., 2022; Zavatone Veth & Pehlevan, 2023; Ruben & Pehlevan, 2023; Simon et al., 2023). Within this literature, which considered fully trained models, the works of (Bordelon et al., 2020; Spigler et al., 2020) derived power-law decay rates for power-law features which were termed resolution limited by Bahri et al. (2021) and recovered by Maloney et al. (2022).

However, we also study the dependence on training time. The t limit of our DMFT equations recovers the final losses computed in these prior works. While these prior works find that the scaling exponents for model-size and dataset-size are the same, we find that the test loss scales with a different exponent with training time, leading to a different (model and task dependent) compute optimal scaling strategy.

DMFT methods have been used to analyze the test loss dynamics for general linear and spiked tensor models trained with high-dimensional random data (Mannelli et al., 2019; Mignacco et al., 2020; Mignacco & Urbani, 2022) and deep networks dynamics with random initialization (Bordelon & Pehlevan, 2022b; Bordelon et al., 2023). High dimensional limits of SGD have been analyzed with Volterra integral equations in the offline case (Paquette et al., 2021) or with recursive matrix equations in the online case (Varre et al., 2021; Bordelon & Pehlevan, 2022a). Random matrix approaches have also been used to study test loss dynamics in linear regression with isotropic covariates by (Advani et al., 2020) and for random feature models in (Bodin & Macris, 2021). In this work, we consider averaging over both the disorder in the sampled dataset and the random projection of the features simultaneously using DMFT.

Other models and hypotheses for scaling laws instead rely on a discrete collection of subtasks or skills which are learned as compute grows (Caballero et al., 2022; Arora & Goyal, 2023; Michaud et al., 2023). Our theory instead focuses on spectral components of a data distribution.

2. Setup of the Model

We consider a teacher-student setting, where data sampled from a generative teacher model is used to train a student random feature model. The teacher and student models mismatch in a particular way that will be described below. This mismatch is the key ingredient that leads to most of the phenomena that we will discuss.

Teacher Model. Take x RD to be drawn from a distribution p(x) with a target function y(x) expressible in terms of base features ψ(x) RM up to noise:

M w ψ(x) + σϵ(x). (1)

A Dynamical Model of Neural Scaling Laws

Here ψ(x) play the role of the infinite-width NTK eigenfunctions, which form a complete basis for square-integrable functions L2[p]. The ϵ(x) function describes a component of y with which is uncorrelated with ψ(x). We work in the eigenbasis of features as in (Bordelon et al., 2020), so the covariance given by:

ψk(x)ψℓ(x) x p(x) = δkℓλk. (2)

The power law structure in the λk and w entries will lead to power law scalings for the test loss and related quantities.

Student Model. Our student model is motivated by a scenario where a randomly initialized finite-width network is trained in the linearized or lazy regime (Chizat et al., 2019; Jacot et al., 2018). Such training can be described through learning linear combinations of the finite-width NTK features. These features will span a lower-dimensional subspace of the space of square-integrable functions, and relate to infinite-width NTK features in a complicated way.

To model this key aspect, the student model uses a projection of the ψ(x) features, Aψ(x) where A RN M. These projected features represent the finite-width (i.e. empirical) NTK eigenfunctions. This is motivated by the fact that finite width kernel s features can be linearly expanded in the basis of infinite-width features, because infinite-width kernel eigenfunctions are complete.

Our learned function then has the form:

N w Aψ(x). (3)

Here, we will interpret N as the model size with the N limit recovering original kernel. Similar models were studied in (Maloney et al., 2022; Atanasov et al., 2023).

We will focus on the setting where the elements of A are drawn iid from a distribution of mean zero and variance one. See Appendix B for details on the technical assumptions. The motivations for this choice are (1) tractability and (2) it satisfies the constraint that as N the student s kernel approaches the infinite-width kernel ψ. In more realistic settings, such as when projecting the eigenfunctions of an infinite-width NTK to a finite-width NTK, the form of the A matrix is generally not known.

Training. The model is trained on a random dataset D = {xµ, yµ}P µ=1 of size P with gradient flow on MSE loss

µ=1 (y(xµ) f(xµ, t))Aψ(xµ). (4)

We explore extensions (momentum, discrete time, one-pass SGD in Appendix K). We track the test and train loss

L(t) = Ex[(f(x, t) y(x))2],

µ=1 (fµ(t) yµ)2. (5)

In small size systems, these losses depend on the precise realization of the data D and matrix A. These two quantities can be viewed as the disorder. For large systems, these losses approach a well-defined limit independent of the specific realization of D, A. We will use this fact in the next section when analyzing the model.

3. DMFT for Scaling Laws

We next describe a theoretical approach for characterizing the learning curves for this model. The full details of this approach is detailed in Appendices A, B.

We derive a mean field theory for M, N, P large. We analyze both the (1) proportional regime where N/M = ν, P/M = α and M, N, P , and (2) non-proportional regime where M first and N, P 1. The theories derived in these limits are structurally identical (App. G). 1

Let Ψ RP M with Ψµ k = ψk(xµ). Also define Λij = λiδij.The discrepancy between the target weights and the model s effective weights is

N A w(t). (6)

The test loss is then given by L(t) = 1 M P

k λkv0 k(t)2. The v0 vector has the following dynamics:

d dtv0(t) = 1

P Ψ Ψ v0(t). (7)

Already, we can see that generalization can be limited if A A or Ψ Ψ are low rank as the dynamics will be frozen in the nullspace of 1

P Ψ Ψ . Using DMFT, we characterize this limit by tracking v0 together with the following random vectors:

M Ψv0(t), v2(t) = 1 α

M Av2(t), v4(t) = 1 ν

M A v3(t). (8)

The key summary statistics (also called order parameters)

1While the proportional limit is exact, the finite size N, P theory will also contain fluctuations across realizations of disorder. When relevant, we show these in experiments by plotting standard deviations over draws of data and projection matrices A. This variance decays as O(1/P + 1/N).

A Dynamical Model of Neural Scaling Laws

are the correlation functions:

C0(t, s) = 1

M v0(t) Λv0(s), C1(t, s) = 1

P v1(t) v1(s),

C2(t, s) = 1

M v2(t) v2(s), C3(t, s) = 1

N v3(t) v3(s),

as well as the response functions:

R1(t, s) = 1

P Tr δv1(t)

, R2,4(t, s) = 1

M Tr δv2(t)

R3(t, s) = 1

N Tr δv3(t)

, R0,2(t, s) = 1

M Tr Λ δv0(t)

Here δvi(t)

δvj(s) is the response of vi(t) to a kick in the dynamics of vj at time s. See appendix B.2.1 for details.

The test loss L and train loss ˆL are related to the time-time diagonal of C0(t, s) and C1(t, s) respectively

L(t) = C0(t, t) + σ2 , ˆL(t) = C1(t, t). (9)

These collective quantities concentrate over random draws of the disorder (Sompolinsky & Zippelius, 1981). We show that these correlation and response functions satisfy a closed set of integro-differential equations which depend on α, ν which we provide in the Appendices A.2.

Further, we show in Appendix A.3 that the response functions possess a time-translation invariance property R(t, s) = R(t s). This enables exact analysis in the Fourier domain R(τ) = R dω

2π eiωτR(ω). These response functions can then be used to solve for the correlation functions {C0, C1, C2, C3}.

To understand the convergence of the learned function f along each eigenfunction of the kernel, we introduce the transfer function2 for mode k, Hk(t) w k v0 k(t) . Our key result is that the Fourier transform of Hk can be simply expressed in terms of the Fourier transforms of R1, R3:

Hk(ω) = 1 iω + λk R1(ω)R3(ω). (10)

These functions satisfy the self-consistent equations:

R1(ω) = 1 1

λk R1(ω)R3(ω) iω + λk R1(ω)R3(ω),

R3(ω) = 1 1

λk R1(ω)R3(ω) iω + λk R1(ω)R3(ω). (11)

From these solved response functions R1, R3, we can compute local solutions to the correlation functions twovariable Fourier transform C(ω, ω ) which are independent

2There are dynamical analogues of the mode errors in (Bordelon et al., 2020; Canatar et al., 2021) or learnabilities in (Simon et al., 2021).

equations for each pair of ω, ω . Information about the early dynamics can be extracted from high frequencies ω 1 while information about the late-time limit of the system can be extracted from ω, ω 0 (App. C, D). For example, for the final test loss,

lim t L(t, α, ν) = lim ω,ω 0(iω)(iω ) C0(ω, ω ). (12)

The full temporal trajectory can be obtained with an inverse Fourier transform of C0(ω, ω ). See Appendix A.4.

4. Results Our results hold for any λk and w k and we provide some simple analytically solvable examples in Appendix I. However, based on empirical observations of NTK spectral decompositions on realistic datasets (Bordelon & Pehlevan, 2022a; Spigler et al., 2020; Bordelon & Pehlevan, 2022a; Bahri et al., 2021; Maloney et al., 2022), here, we focus on the case of power law features. In this setting, eigenvalues and target coefficients decay as a power law in the index k

(w k)2λk k a , λk k b. (13)

We will refer to a as the task-power exponent and b as the spectral decay exponent3. See Figure 7 (a)-(b) for an example with a Residual CNN on CIFAR-5M.

Test loss power laws. For power law features, the test loss will generally be bottlenecked by either training time t (steps of gradient descent), the size of the training set P, or the size of the model N. We can derive bottleneck scalings from our exact expressions for L(t, P, N) (Appendix J)4:

t (a 1)/b , P, N , (Time) P min{a 1,2b} , t, N , (Data) N min{a 1,2b} , t, P , (Model) (14)

A consequence of this is an asymmetry in exponent between the model and data bottlenecks compared to the time bottleneck. We verify this asymmetry in Figure 2.

Bottlenecks as Rank-Constraints All three of the bottleneck scalings arise due to rank constraints in the effective dynamics. Heuristically, finite training time or the subsampling of data/features leads to an approximate projection of the target function onto the top k (t, P, N) eigenspace of the infinite-width kernel. The components of the target

3These power-law decay rates are also known as source and capacity conditions in the kernel literature (Caponnetto & Vito, 2005; Cui et al., 2021) 4The alternative scaling exponents L N 2b, P 2b occur for very easy tasks which satisfy a > 2b + 1, but this condition is rarely satisfied in natural data (Appendix J).

A Dynamical Model of Neural Scaling Laws

100 101 102 t

(a) P = 1000 Test Loss Dynamics

L(t, N) L(t, )

t = 10 t = 20 t = 50 t = 100 N 1

(b) Early Time Model Convergence

Final Loss N 1

(c) Late Time Model Bottleneck

100 101 102 t

(d) N = 1000 Test Loss Dynamics

L(t, P) L(t, )

t = 10 t = 20 t = 50 t = 100 P 1

(e) Early Time Data Convergence

Final Loss P 1

(f) Late Time Data Bottleneck

Figure 2. Verification of the various bottleneck scalings for power-law features with a = 1.5 and b = 1.25. Dashed black lines are DMFT solutions while colors are simulations with standard deviation highlighted. (a) The loss dynamics at large α will be bottlenecked by either time or finite ν. (b) Early in training, the loss converges to its limit as N 1 (App. D). (c) At long times, the model s asymptotic loss scales as N (a 1) (App. C). (d)-(f) The same results but for N and P switched. The model exhibits 1/P corrections and early time and power law data bottleneck scalings at late time.

function in the null-space of this projection are not learned. This leads to an approximate test loss of the form

k>k (w k)2λk k (a 1) . (15)

For model and data bottlenecks we have that k N and k P respectively (App. J). On the other hand, k for the time bottleneck also depends on the structure of the features through the exponent b. This is because the k-th eigenfeature is learned at a timescale τk kb. At time t, we have learned the first k t1/b modes and the variance in the remaining modes gives t (a 1)/b. In the limit of t our data and model bottleneck scalings agree with the resolution and variance-limited scalings studied in (Bahri et al., 2021) as well as prior works on kernels and random feature models (Bordelon et al., 2020; Maloney et al., 2022).

Connection to Online Learning with SGD Many modern deep learning models are trained in an online learning setting where each step of training uses a fresh batch of data to estimate the gradient of the population loss and batches are not reused over multiple steps. Our theoretical methods

can also handle this regime. At each step t a fresh minibatch of B examples is used to estimate the gradient. In discrete time with learning rate η this leads to the following DMFT description of v0 k(t)

v0 k(t + 1) = v0 k(t) ηu4 k(t)

s t R3(t, s)[u2 k(s) + λkv0 k(s)] (16)

where u4 k(t), u2 k(t) are zero-mean Gaussian variables with known covariance (see Appendix K.3). The response function R3(t, s) satisfies a discrete time analog of Equation (11). The most important observation about this regime is that there is no longer a data bottleneck regime. Rather, the bias component of the test error can only be limited by either training time or model size. The finite batch B introduces SGD noise which introduces an additional variance component to the test loss. We illustrate these results in Figure 3. The N limit recovers the results of Bordelon & Pehlevan (2022a) which study online SGD without averaging over a random projection. The continuous time limit of the above expressions obtained from evaluating the theory for small η exactly matches the P limit of our gradient flow theory presented in the previous section. We

A Dynamical Model of Neural Scaling Laws

100 101 102 t

(a) SGD with batchsize B = 32

100 101 102 t

(b) SGD with N = 256 and varying batchsize B

Figure 3. Our DMFT can also capture online SGD learning including the effect of batch size fluctuations on the loss and the finite N bottleneck. (a) Power law features trained with SGD and a fixed random projection still generates asymptotes which depend on N. (b) The batchsize B impacts the loss through additional variance in the dynamics but does not lead to an asymptotic plateau.

102 103 104 105 106 C

C (a 1)/(b + 1) = C 0.5

(a) (a, b) = (2, 1) , L(C) C 0.5

103 104 105 106 C

C (a 1)/(b + 1) = C 0.40

(b) (a, b) = (2, 1.5) , L(C) C 0.4

103 104 105 106 107 C

C (a 1)/(b + 1) = C 0.33

(c) (a, b) = (2, 2) , L(C) C 0.33

Figure 4. Compute optimal scaling in our model is determined by tradeoff of time and model-size bottlenecks. Solid colored lines are simulations with power law features and in dashed black is the theoretical prediction of compute optimal scaling. Each color represents varying model sizes with N [25, 210]. The Pareto frontier is defined as the minimum value of L at each compute C over all possible choices of model size N. Although the final losses do not depend on the spectral decay rate b but only on the task-power exponent a, the compute optimal scaling depends does depend on b.

will therefore use this limiting behavior to analyze compute optimal tradeoffs of model size and training time.

Asymmetric Compute Optimal Scaling Strategy We now consider the regime where the total amount of data does not limit performance, but rather training is bottlenecked by time or model size. This could arise in the offline model with very large P or in one-pass SGD with sufficiently small learning rate or sufficiently large batch size (App. K.3). In either case, time and model size should scale differently with compute budget C = Nt and m = min{a 1, 2b}

t C bm a 1+bm , N C a 1 a 1+bm , = L (C) C (a 1)m

a 1+bm . (17)

For the regime of interest where m = min{a 1, 2b} = a 1 this gives L (C) C a 1

1+b . We obtain the above scaling by approximating the loss as a sum of the three terms in

equation (14) and a constant as in (Hoffmann et al., 2022), see Appendix N. This analysis suggests that for features that have rapid decay in their eigenspectrum, it is preferable to allocate greater resources toward training time rather than model size as the compute budget increases. This is consistent with the small asymmetries observed in (Hoffmann et al., 2022) for language models and the larger asymmetries in MLPs on vision from (Bachmann et al., 2024). In the limit as b 1, the time and parameter count should be scaled linearly together. We verify this scaling rule and its b-dependence in Figure 4.

Wider is Better Requires Sufficient Data Larger models are not always better in terms of test loss for all time t, as we showed in Figure 1 (c), especially if the dataset is limited. In Figure 5, we illustrate that larger N can improve convergence to a data-bottlenecked loss for power law features. However, the loss may still be non-monotonic in training

A Dynamical Model of Neural Scaling Laws

100 101 102 t

(a) Test Loss (P = 128)

0 100 200 300 t

(b) Train Loss (P = 128)

100 101 102 t

Test Train DMFT

(c) Train-Test Gaps N = 512

Figure 5. In a data limited regime, wider networks train faster but cannot indefinitely improve generalization by making N larger. (a) Test loss for power-law features with a = 1.5 and b = 1.25 with P = 128 and varying N. In this regime, there are diminishing returns to making the model size larger. (b) For N < P, the model is underparameterized and cannot achieve zero train loss. For N > P, the train loss will eventually decay at exponential rate which depends on N, despite the test loss saturating. (c) The train and test losses gradually separate at a rate which depends on P.

time, motivating regularization or early stopping.

Gradual Buildup of Overfitting Effects The exact gap between train and test losses can exactly be expressed in terms of the DMFT order parameters:

L(t) ˆL(t) = 2

0 dt R0,2(t, t )C1(t, t )

0 dt ds R0,2(t, t )R0,2(t, s )C1(t , s ). (18)

We derive this relation in the Appendix E. At early time this gap goes as O(1/P) (App. D, E). At late time, however, this picks up a nontrivial task-dependent scaling with P as we show in Figure 2 (e)-(f) and App. C. In Figure 5 (c) we show this gradual accumulation of finite data on the test-train loss gap. For larger datasets P it takes longer training time to begin overfitting (App. E).

Ensembling is Not Always Compute Optimal Ensembling a set of models means averaging their predictions over the same datasets but with different intitialization seeds. This reduces test loss by reducing the variance of the model output f due to initialization. This improvement can be predicted from an extension of our DMFT (App. H). Analogously, bagging over B datasets reduces variance due to sampling of data.

One might imagine that ensembling many finite sized models would allow one to approach the performance of an infinite sized model (N ). If this were possible, the compute optimal strategy could involve a tradeoff between ensemble count and model size. However, recent experiments show that there is a limited benefit from ensembling on large datasets when compared to increasing model size (Vyas et al., 2023). We illustrate this in Figure 6 (a). Our

theory can explain these observations as it predicts the effect of ensembling E times on the learning dynamics as we show in App. H. The main reason to prefer increasing N rather than increasing E is that larger N has lower bias in the dynamics, whereas ensembling only reduces variance. The bias of the model B has the form

B(t, N, P) = X

k λk(w k)2Hk(t, N, P)2, (19)

which depend on transfer function Hk that we illustrate for power-law features in Figure 6 (b). Since Hk(t) depend on N, P, we see that ensembling/bagging cannot recover the learning curve of the N, P system since the bias is limited by finite N, P.

5. Tests on Realistic Networks We now move beyond synthetic power-law datasets and consider realistic image datasets and architectures. We take the CIFAR-5M dataset introduced in (Nakkiran et al., 2021a) and consider the task of classfiying animate vs inanimate objects. We plot the spectra of the finite-width NTK at initialization across different widths for a Wide Res Net (Zagoruyko & Komodakis, 2016) in Figure 7 a). Here the width parameter corresponds to the number of channels in the hidden layers. Following (Canatar et al., 2021), we define C(k) as the fraction of the task captured by the top k kernel eigenmodes:

i k λi(w i )2 P

i λi(w i )2 . (20)

Then 1 C(k) is the portion of the task left unexplained. We plot this for the initial NTKs across widths in Figure 7 b). We extract the spectral decay exponent b and the the task power exponent a from these two curves. Together, these give the learning scaling laws of the linearized neural

A Dynamical Model of Neural Scaling Laws

102 103 104 105 106 C

N = 64 N = 256 N = 1024 E = 1 E = 4

(a) E vs N allocation of compute

100 101 102 t

P = 128 P = 256 P = 512 P = 1024 P = 2048 N, P = k = 1 k = 10

(b) Transfer Functions Hk(t) for N = 128

Figure 6. Ensembling E models of size N improves performance by reducing initialization variance by a factor of E (see Appendix H.2) (a) However, at fixed compute C = NEt, increasing the model size N is preferable, since the bias is also reduced. (b) The transfer functions Hk(t) computed from the DMFT determine the error as E depend on N, P and saturate in performance at long times, while the N, P curves decay exponentially.

100 101 102 103 104 k

N = 8 N = 16 N = 32 N = 64 N = 128 N = 256 k 2.0

(a) NTK Spectra for Varying Widths

100 101 102 103 104 k

(b) Task-Power Decay

104 106 108 1010 Compute

N = 4 N = 8 N = 16 N = 32 N = 64 N = 128 N = 256 Feature C 0.05

(c) Lazy vs Feature Compute Scaling

Figure 7. Our theory predicts time and compute scalings for linearized networks on realistic datasets. (a) The initial NTK spectra and (b) task-power distributions for Res Nets of width N on CIFAR-5M are well described by powerlaws λk k 2.0 and k 0.15 for large k. (c) The predicted compute optimal scaling of for the Res Net obeys L (C) C 0.05. However, for networks outside of the kernel regime (dashed lines), feature learning can substantially alter the observed scaling laws and improve the loss curves as a function of compute.

network model on this dataset. We plot the compute optimal scaling laws of these linearized models in Figure 7 c). We also plot the predicted scaling law C (a 1)/(1+b) in blue and find excellent agreement.

5.1. The Role of Feature Learning

We also compare these scalings to those of the compute optimal learning curves for feature-learning networks. We train several networks with different widths and initialization seeds for 64 epochs through the dataset. We observe substantially different compute-optimal scaling exponents in the dotted curves of Figure 7 c). This means that although our random feature model does capture the correct linearized scaling trends, which have all of the qualities observed in realistic scaling laws, more is needed to capture the acceleration of scaling induced by feature learning. Further analyses of the after-kernels of feature learning networks are performed in Appendix L. We see that the kernels continue to evolve substantially throughout training. This indicates that

a full explanation of the compute optimal scaling exponents will require something resembling a mechanistic theory of kernel evolution (Long, 2021; Fort et al., 2020; Atanasov et al., 2022; Bordelon & Pehlevan, 2022b).

6. Conclusion

We have presented a model that recovers a wide variety of phenomena observed in more realistic deep learning settings. Our theory includes not just model size and dataset size as parameters but also explicitly treats the temporal dynamics of training. We observe different scaling exponents for performance in terms of model size and number of time steps. Future work to incorporate kernel evolution into this model could further shed insight into the improved scaling laws in the feature-learning regime. Overall, our results provide a theoretical interpretation of compute-optimal scaling as a competition between the training dynamics of the infinite width/infinite data limit and finite model-size bottleneck.

A Dynamical Model of Neural Scaling Laws

Impact Statement This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here.

Acknowledgements

We are grateful to Yasaman Bahri, Stefano Mannelli, Francesca Mignacco, Jascha Sohl-Dickstein, and Nikhil Vyas for useful conversations. We thank Clarissa Lauditi and Jacob Zavatone-Veth for comments on the manuscript.

B.B. is supported by a Google Ph D Fellowship. A.A. is supported by a Fannie and John Hertz Fellowship. C.P. is supported by NSF grant DMS-2134157, NSF CAREER Award IIS-2239780, and a Sloan Research Fellowship. This work has been made possible in part by a gift from the Chan Zuckerberg Initiative Foundation to establish the Kempner Institute for the Study of Natural and Artificial Intelligence.

Adlam, B. and Pennington, J. The neural tangent kernel in high dimensions: Triple descent and a multi-scale theory of generalization. In International Conference on Machine Learning, pp. 74 84. PMLR, 2020a.

Adlam, B. and Pennington, J. Understanding double descent requires a fine-grained bias-variance decomposition. Advances in neural information processing systems, 33: 11022 11032, 2020b.

Advani, M. S., Saxe, A. M., and Sompolinsky, H. Highdimensional dynamics of generalization error in neural networks. Neural Networks, 132:428 446, 2020.

Alabdulmohsin, I., Zhai, X., Kolesnikov, A., and Beyer, L. Getting vit in shape: Scaling laws for compute-optimal model design. ar Xiv preprint ar Xiv:2305.13035, 2023.

Arora, S. and Goyal, A. A theory for emergence of complex skills in language models. ar Xiv preprint ar Xiv:2307.15936, 2023.

Atanasov, A., Bordelon, B., and Pehlevan, C. Neural networks as kernel learners: The silent alignment effect. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum? id=1Nvflq Adoom.

Atanasov, A., Bordelon, B., Sainathan, S., and Pehlevan, C. The onset of variance-limited behavior for networks in the lazy and rich regimes. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum? id=JLINx POVTh7.

Bachmann, G., Anagnostidis, S., and Hofmann, T. Scaling mlps: A tale of inductive bias. Advances in Neural Information Processing Systems, 36, 2024.

Bahri, Y., Dyer, E., Kaplan, J., Lee, J., and Sharma, U. Explaining neural scaling laws. ar Xiv preprint ar Xiv:2102.06701, 2021.

Bodin, A. and Macris, N. Model, sample, and epoch-wise descents: exact solution of gradient flow in the random feature model. Advances in Neural Information Processing Systems, 34:21605 21617, 2021.

Bordelon, B. and Pehlevan, C. Learning curves for SGD on structured features. In International Conference on Learning Representations, 2022a. URL https: //openreview.net/forum?id=WPI2vbk Al3Q.

Bordelon, B. and Pehlevan, C. Self-consistent dynamical field theory of kernel evolution in wide neural networks. ar Xiv preprint ar Xiv:2205.09653, 2022b.

Bordelon, B. and Pehlevan, C. Dynamics of finite width kernel and prediction fluctuations in mean field neural networks. ar Xiv preprint ar Xiv:2304.03408, 2023.

Bordelon, B., Canatar, A., and Pehlevan, C. Spectrum dependent learning curves in kernel regression and wide neural networks. In International Conference on Machine Learning, pp. 1024 1034. PMLR, 2020.

Bordelon, B., Noci, L., Li, M. B., Hanin, B., and Pehlevan, C. Depthwise hyperparameter transfer in residual networks: Dynamics and scaling limit, 2023.

Caballero, E., Gupta, K., Rish, I., and Krueger, D. Broken neural scaling laws. ar Xiv preprint ar Xiv:2210.14891, 2022.

Canatar, A., Bordelon, B., and Pehlevan, C. Spectral bias and task-model alignment explain generalization in kernel regression and infinitely wide neural networks. Nature communications, 12(1):2914, 2021.

Caponnetto, A. and Vito, E. D. Fast rates for regularized least-squares algorithm. 2005.

Cheng, C. and Montanari, A. Dimension free ridge regression. ar Xiv preprint ar Xiv:2210.08571, 2022.

Chizat, L., Oyallon, E., and Bach, F. On lazy training in differentiable programming. Advances in neural information processing systems, 32, 2019.

Cortes, C., Mohri, M., and Rostamizadeh, A. Algorithms for learning kernels based on centered alignment. The Journal of Machine Learning Research, 13(1):795 828, 2012.

A Dynamical Model of Neural Scaling Laws

Crisanti, A. and Sompolinsky, H. Path integral approach to random neural networks. Physical Review E, 98(6): 062120, 2018.

Cui, H., Loureiro, B., Krzakala, F., and Zdeborov a, L. Generalization error rates in kernel regression: The crossover from the noiseless to noisy regime. Advances in Neural Information Processing Systems, 34:10131 10143, 2021.

Dyer, E. and Gur-Ari, G. Asymptotics of wide networks from feynman diagrams. In International Conference on Learning Representations, 2020.

d Ascoli, S., Refinetti, M., Biroli, G., and Krzakala, F. Double trouble in double descent: Bias and variance (s) in the lazy regime. In International Conference on Machine Learning, pp. 2280 2290. PMLR, 2020.

Fort, S., Dziugaite, G. K., Paul, M., Kharaghani, S., Roy, D. M., and Ganguli, S. Deep learning versus kernel learning: an empirical study of loss landscape geometry and the time evolution of the neural tangent kernel. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., and Lin, H. (eds.), Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, Neur IPS 2020, December 6-12, 2020, virtual, 2020.

Geiger, M., Jacot, A., Spigler, S., Gabriel, F., Sagun, L., d Ascoli, S., Biroli, G., Hongler, C., and Wyart, M. Scaling description of generalization with number of parameters in deep learning. Journal of Statistical Mechanics: Theory and Experiment, 2020(2):023401, 2020.

Gerbelot, C., Troiani, E., Mignacco, F., Krzakala, F., and Zdeborova, L. Rigorous dynamical mean field theory for stochastic gradient descent methods. ar Xiv preprint ar Xiv:2210.06591, 2022.

Ghosh, N., Mei, S., and Yu, B. The three stages of learning dynamics in high-dimensional kernel methods. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum? id=EQm AP4F859.

Hastie, T., Montanari, A., Rosset, S., and Tibshirani, R. J. Surprises in high-dimensional ridgeless least squares interpolation. Annals of statistics, 50(2):949, 2022.

Helias, M. and Dahmen, D. Statistical field theory for neural networks, volume 970. Springer, 2020.

Hestness, J., Narang, S., Ardalani, N., Diamos, G., Jun, H., Kianinejad, H., Patwary, M. M. A., Yang, Y., and Zhou, Y. Deep learning scaling is predictable, empirically. ar Xiv preprint ar Xiv:1712.00409, 2017.

Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., Cai, T., Rutherford, E., Casas, D. d. L., Hendricks, L. A., Welbl, J., Clark, A., et al. Training compute-optimal large language models. ar Xiv preprint ar Xiv:2203.15556, 2022.

Jacot, A., Gabriel, F., and Hongler, C. Neural tangent kernel: Convergence and generalization in neural networks. Advances in neural information processing systems, 31, 2018.

Kaplan, J., Mc Candlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., and Amodei, D. Scaling laws for neural language models. ar Xiv preprint ar Xiv:2001.08361, 2020.

Long, P. M. Properties of the after kernel. ar Xiv preprint ar Xiv:2105.10585, 2021.

Loureiro, B., Gerbelot, C., Cui, H., Goldt, S., Krzakala, F., Mezard, M., and Zdeborov a, L. Learning curves of generic features maps for realistic datasets with a teacherstudent model. Advances in Neural Information Processing Systems, 34:18137 18151, 2021.

Loureiro, B., Gerbelot, C., Refinetti, M., Sicuro, G., and Krzakala, F. Fluctuations, bias, variance & ensemble of learners: Exact asymptotics for convex losses in highdimension. In International Conference on Machine Learning, pp. 14283 14314. PMLR, 2022.

Maloney, A., Roberts, D. A., and Sully, J. A solvable model of neural scaling laws. ar Xiv preprint ar Xiv:2210.16859, 2022.

Mannelli, S. S., Krzakala, F., Urbani, P., and Zdeborova, L. Passed & spurious: Descent algorithms and local minima in spiked matrix-tensor models. In international conference on machine learning, pp. 4333 4342. PMLR, 2019.

Martin, P. C., Siggia, E., and Rose, H. Statistical dynamics of classical systems. Physical Review A, 8(1):423, 1973.

Mei, S. and Montanari, A. The generalization error of random features regression: Precise asymptotics and the double descent curve. Communications on Pure and Applied Mathematics, 75(4):667 766, 2022.

Michaud, E. J., Liu, Z., Girit, U., and Tegmark, M. The quantization model of neural scaling. ar Xiv preprint ar Xiv:2303.13506, 2023.

Mignacco, F. and Urbani, P. The effective noise of stochastic gradient descent. Journal of Statistical Mechanics: Theory and Experiment, 2022(8):083405, 2022.

A Dynamical Model of Neural Scaling Laws

Mignacco, F., Krzakala, F., Urbani, P., and Zdeborov a, L. Dynamical mean-field theory for stochastic gradient descent in gaussian mixture classification. Advances in Neural Information Processing Systems, 33:9540 9550, 2020.

Muennighoff, N., Rush, A. M., Barak, B., Scao, T. L., Piktus, A., Tazi, N., Pyysalo, S., Wolf, T., and Raffel, C. Scaling data-constrained language models. ar Xiv preprint ar Xiv:2305.16264, 2023.

Nakkiran, P., Neyshabur, B., and Sedghi, H. The deep bootstrap framework: Good online learners are good offline generalizers. In International Conference on Learning Representations, 2021a.

Nakkiran, P., Neyshabur, B., and Sedghi, H. The deep bootstrap framework: Good online learners are good offline generalizers. In International Conference on Learning Representations, 2021b. URL https:// openreview.net/forum?id=guetr IHLFGI.

Paquette, C., Lee, K., Pedregosa, F., and Paquette, E. Sgd in the large: Average-case analysis, asymptotics, and stepsize criticality. In Conference on Learning Theory, pp. 3548 3626. PMLR, 2021.

Roberts, D. A., Yaida, S., and Hanin, B. The principles of deep learning theory. Cambridge University Press Cambridge, MA, USA, 2022.

Ruben, B. S. and Pehlevan, C. Learning curves for noisy heterogeneous feature-subsampled ridge ensembles. Ar Xiv, 2023.

Simon, J. B., Dickens, M., Karkada, D., and De Weese, M. R. The eigenlearning framework: A conservation law perspective on kernel regression and wide neural networks. ar Xiv preprint ar Xiv:2110.03922, 2021.

Simon, J. B., Karkada, D., Ghosh, N., and Belkin, M. More is better in modern machine learning: when infinite overparameterization is optimal and overfitting is obligatory. ar Xiv preprint ar Xiv:2311.14646, 2023.

Sompolinsky, H. and Zippelius, A. Dynamic theory of the spin-glass phase. Physical Review Letters, 47(5):359, 1981.

Sorscher, B., Geirhos, R., Shekhar, S., Ganguli, S., and Morcos, A. Beyond neural scaling laws: beating power law scaling via data pruning. Advances in Neural Information Processing Systems, 35:19523 19536, 2022.

Spigler, S., Geiger, M., and Wyart, M. Asymptotic learning curves of kernel methods: empirical data versus teacher student paradigm. Journal of Statistical Mechanics: Theory and Experiment, 2020(12):124001, 2020.

Varre, A. V., Pillaud-Vivien, L., and Flammarion, N. Last iterate convergence of sgd for least-squares in the interpolation regime. Advances in Neural Information Processing Systems, 34:21581 21591, 2021.

Vyas, N., Bansal, Y., and Nakkiran, P. Limitations of the ntk for understanding generalization in deep learning. ar Xiv preprint ar Xiv:2206.10012, 2022.

Vyas, N., Atanasov, A., Bordelon, B., Morwani, D., Sainathan, S., and Pehlevan, C. Feature-learning networks are consistent across widths at realistic scales. ar Xiv preprint ar Xiv:2305.18411, 2023.

Yang, G., Hu, E. J., Babuschkin, I., Sidor, S., Liu, X., Farhi, D., Ryder, N., Pachocki, J., Chen, W., and Gao, J. Tuning large neural networks via zero-shot hyperparameter transfer. In Beygelzimer, A., Dauphin, Y., Liang, P., and Vaughan, J. W. (eds.), Advances in Neural Information Processing Systems, 2021. URL https: //openreview.net/forum?id=Bx6q Ku BM2AD.

Zagoruyko, S. and Komodakis, N. Wide residual networks. ar Xiv preprint ar Xiv:1605.07146, 2016.

Zavatone-Veth, J. A. and Pehlevan, C. Learning curves for deep structured gaussian feature models, 2023.

Zavatone-Veth, J. A., Tong, W. L., and Pehlevan, C. Contrasting random and learned features in deep bayesian linear regression. Physical Review E, 105(6):064118, 2022.

A Dynamical Model of Neural Scaling Laws

A. Derivation of Dynamical Model of Scaling Laws

We investigate the simplest possible model which can exhibit task-dependent time, model size and finite data bottlenecks. We therefore choose to study a linear model with projected features

M Aψ(x) , y(x) = 1

M w ψ(x). (21)

The weights w are updated with gradient descent on a random training dataset which has (possibly) noise corrupted target values yµ = y(xµ) + σϵµ. This leads to the following gradient flow dynamics

µ=1 (yµ fµ)Aψµ = 1

We introduce the variable v0 = w 1

N A w to represent the residual error of the learned weight vector. This residual error has the following dynamics:

P Ψ Ψ v0(t) + σ α

M Ψ ϵ . (23)

The entries of each matrix are treated as random with Ψµ k N(0, λk) and Ajk N(0, 1). To study the dynamical evolution of the test error L(t) = 1 M v0(t) Λv0(t) + σ2, we introduce the sequence of vectors

M Ψv0(t) + σϵ , v2(t) = 1 α

M Av2(t) , v4(t) = 1 ν

M A v3(t). (24)

The train and test losses can be computed from the v0 and v1 fields

µ=1 v1 µ(t)2 , L(t) = 1

k=1 λkv0 k(t)2 + σ2. (25)

In the next section, we derive a statistical description of the dynamics in an appropriate asymptotic limit using dynamical mean field theory methods.

A.1. DMFT Equations for the Asymptotic Limit

Standard field theoretic arguments such as the cavity or path integral methods can be used to compute the effective statistical description of the dynamics in the limit of large M, N, P with fixed ratios α = P/M and ν = N

M (see Appendix B). This computation gives us the following statistical description of the dynamics.

v1(t) = u1(t) + 1

Z ds R0,2(t, s)v1(s) + σϵ , u1(t) GP(0, C0) , ϵ N(0, 1),

v2 k(t) = u2 k(t) + λk

Z ds R1(t, s)v0 k(s) , u2 k(t) GP 0, 1

v3(t) = u3(t) + 1

Z ds R2,4(t, s)v3(s) , u3(t) GP (0, C2) ,

v4 k(t) = u4 k(t) + Z ds R3(t, s)v2 k(s) , u4 k(t) GP 0, 1

tv0 k(t) = v4 k(t).

A Dynamical Model of Neural Scaling Laws

The correlation and response functions obey

C0(t, s) = 1

k λk v0 k(t)v0 k(s) , C1(t, s) = v1(t)v1(s) , C2(t, s) = 1

v2 k(t)v2 k(s)

R0,2(t, s) = 1

δv0 k(t) δu2 k(s)

, R2,4(t, s) = 1

δv2 k(t) δu4 k(s)

R1(t, s) = δv1(t)

, R3(t, s) = δv3(t)

These equations are exact in the joint proportional limit for any value of α, ν.

A.2. Closing the Equations for the Order Parameters

Though we expressed the dynamics in terms of random fields, we stress in this section that all of the dynamics for the correlation and response functions close in terms of integro-differential equations. To shorten the expression, we will provide the expression for β = 0, but momentum can easily be added back by making the substitution t t + β 2 t .

First, our closed integral equations for the response functions are

R0,2,k(t, s) = Z dt Θ(t t )R3(t , s) λk

Z dt dt dt Θ(t t )R3(t , t )R1(t , t )R0,2,k(t , s)

R1(t, s) = δ(t s) + 1

Z dt R0,2(t, t )R1(t , s)

R2,4,k(t, s) = λk

Z dt dt R1(t, t )Θ(t t ) λk

Z dt dt dt R1(t, t )Θ(t t )R3(t , t )R2,4,k(t , s)

R3(t, s) = δ(t s) + 1

Z dt R2,4(t, t )R3(t , s)

R0,2(t, s) = 1

k λk R0,2,k(t, s) , R2,4(t, s) = 1

k R2,4,k(t, s) (27)

We note that these equations imply causality in all of the response functions since R(t, s) = 0 for t < s. Once these equations are solved for the response functions, we can determine the correlation functions, which satisfy

2 ts C0,k(t, s) = λk

Z dt dt R3(t, t )R1(t , t ) s C0,k(t , s)

Z ds R3(s, s )R1(s , s ) t C0,k(t, s )

Z dt dt ds ds R3(t, t )R1(t , t )R3(s, s )R1(s , s )C0,k(t , s )

(w k)2δ(t)δ(s) 1

ν C3(t, s) 1

Z dt ds R3(t, t )R3(s, s )C1(t , s )

C1(t, s) = Z dt R1(t, t )R1(s, s )C0(t , s )

C2,k(t, s) = λk

Z dt dt R1(t, t )R3(t , t )C2,k(t , s) λk

Z ds ds R1(s, s )R3(s , s )C2(t, s )

Z dt dt ds ds R1(t, t )R3(t , t )R1(s, s )R3(s , s )C2,k(t , s )

C3(t, s) = Z dt ds R3(t, t )R3(s, s )C2(t , s ) (28)

Solving these closed equations provide the complete statistical characterization of the limit. The test and train losses are given by the time-time diagonal of C0(t, t), C1(t, t).

A Dynamical Model of Neural Scaling Laws

A.3. Time-translation Invariant (TTI) Solution to Response Functions

From the structure of the above equations, the response functions are time-translation invariant (TTI) since they are only functionals of TTI δ(t s) Dirac-Delta function and Θ(t s) Heaviside step-function. As a consequence, we write each of our response functions in terms of their Fourier transforms

R(t, s) = R(t s) = Z

dω 2π eiω(t s)R(ω). (29)

Using the fact that

δ(τ) = Z dω

2π eiωτ , Θ(τ) = lim ϵ 0+ e ϵτΘ(τ) = lim ϵ 0+

ϵ + iω (30)

We will keep track of the regulator ϵ and consider ϵ 0+ at the end of the computation. The resulting DMFT equations for the response functions have the following form in Fourier space

R1(ω) = 1 + 1

αR2,4(ω)R1(ω)

R3(ω) = 1 + 1

ν R2,4(ω)R3(ω)

R0,2(ω) = 1

λk ϵ + iω + λk R1(ω)R3(ω)R3(ω)

R2,4(ω) = 1

λk ϵ + iω + λk R1(ω)R3(ω)R1(ω) (31)

where ϵ 0 will be taken after. Combining these equations, we arrive at the simple set of coupled equations

R1(ω) = 1 1

λk R3(ω)R1(ω) ϵ + iω + λk R1(ω)R3(ω)

R3(ω) = 1 1

λk R1(ω)R3(ω) ϵ + iω + λk R1(ω)R3(ω) (32)

After solving these equations for all ω, we can invert the dynamics of v0 k(t) to obtain its Fourier transform

v0 k(ω) = Hk(ω) w k u4 k(ω) λk R3(ω)u2 k(ω)

Hk(ω) 1 ϵ + iω + λk R1(ω)R3(ω) (33)

where we defined the transfer functions Hk(ω). From this equation, we can compute v0 k(t) through inverse Fouriertransformation and then compute the correlation function to calculate the test error. An interesting observation is that the response functions R1(ω), R3(ω) alter the pole structure in the transfer function, generating ν, α dependent timescales of convergence.

A.4. Fourier Representations for Correlation Functions

While the response functions are TTI, the correlation functions transparently are not (if the time-time diagonal C0(t, t) did not evolve, then the loss L(t) wouldn t change!). We therefore define the need to define the double Fourier transform C(ω, ω ) for each correlation function C(t, s)

C(ω, ω ) = Z dtds e iωt iω s C(t, s) , C(t, s) = Z dω

2π eiωt+iω s C(ω, ω ) (34)

A Dynamical Model of Neural Scaling Laws

Assuming that all response functions and transfer functions Hk have been solved for, the correlation functions satisfy the closed set of linear equations.

C0(ω, ω ) = 1

k λk Hk(ω)Hk(ω ) (w k)2 + 1

ν C3(ω, ω ) + 1

αλk R3(ω)R3(ω )C1(ω, ω )

C1(ω, ω ) = R1(ω)R1(ω )C0(ω, ω )

C2(ω, ω ) = 1

k λk Hk(ω)Hk(ω ) 1

α(iω)(iω )C1(ω, ω ) + λk R1(ω)R1(ω ) (w k)2 + 1

ν C3(ω, ω )

C3(ω, ω ) = R3(ω)R3(ω )C2(ω, ω ) (35)

These equations can be efficiently solved for all pairs of ω, ω after the response functions have been identified. Then one can take an inverse Fourier transform in both indices.

B. Field Theoretic Derivation of DMFT Equations

In this section, we derive the field theoretic description of our model. We will derive this using both the Martin-Siggia-Rose (MSR) path integral method (Martin et al., 1973) and the dynamical cavity method. For a recent review of these topics in the context of neural networks, see (Helias & Dahmen, 2020).

B.1. Statistical Assumptions for DMFT

The DMFT that we derive in the next few sections requires some assumptions on the structure of A and Ψ. To carry out the classic MSR path integral computation, we assume that the entries of both matrices are Gaussian with mean zero and covariance

Aij Akl = δikδjl , ΨµkΨνl = δµνδklλk. (36)

These are sufficient conditions for the DMFT description to hold and we will take them as our primary assumptions. However, we note that these restrictions are not strictly necessary and can be relaxed. In general, a more flexible cavity derivation in Appendix B.3 shows that independent entries from any well behaved distribution which admits a central limit theorem for sums of independent draws would also have the same DMFT description of the proportional limit. Prior works on DMFT of M-estimators with random data have demonstrated universality for any data matrix Ψ with a covariance that has bounded spectral norm (Gerbelot et al., 2022).

B.2. Path Integral Derivation

With the MSR formalism, we evaluate the moment generating functional for the field v0(t):

Z[{j(t)}] = Z Dv0(t) δ v0(t) + 1 NP A AΨ Ψv0(t) exp Z dt j(t) v0(t)

Note that at zero source, we have the important identity that

Z[0] = 1. (38)

We insert a Dirac delta functions to enforce the definitions of each of the fields {v1, v2, v3, v4} as in equation 8.

Z[{j(t)}] = Z D[v0, . . . , v4, ˆv1 . . . ˆv4] δ( v0 + v4) exp Z dt j(t) v0(t)

exp i Z dt ˆv1(t) v1(t) 1

M Ψv0(t) + ˆv2(t) v2(t) 1 α

exp i Z dt ˆv3(t) v3(t) 1

M Av2(t) + ˆv4(t) v4(t) 1 ν

A Dynamical Model of Neural Scaling Laws

At this stage we can add sources j for each ˆvi variable, yielding a Z[j(t), j(t)]. Interpreting each source as modification of the respective evolution equation, we see that this modified moment-generating function remains equal to unity at any value of j, Z[0, j(t)] = 1. As a consequence, all correlation functions consisting only of ˆvi variables vanish. See (Crisanti & Sompolinsky, 2018) for further details and a worked example.

We now average over the sources of disorder. We assume that the entries of A are i.i.d. with mean zero and variance 1. In the proportional limit, we can replace the entries of A as a draw from a Gaussian N(0, 1) by appealing to Gaussian equivalence. We furhther justify this in the cavity derivation in the next section. This allows us to evaluate the averages over the matrix A. exp i

M Tr A Z dt ˆv3(t)v2(t) + ν 1v3(t)ˆv4(t)

Z dtds[ˆv3(t) ˆv3(s) 1

M v2(t) v2(s) | {z } C2(t,s)

+ν 1ˆv4(t) ˆv4(s) 1

N v3(t) v3(s) | {z } C3(t,s)

N ˆv3(t) v3(s) | {z } i R3(s,t)

v2(t) ˆv4(s)

Similarly, we can calculate the averages over the data, which enters via the design matrices Ψ. Again in this proportional limit we can invoke Gaussian equivalence on Ψ to have it take the form Ψ ΦΛ1/2 where Φ has entries drawn from a unit normal. Taking the average then gives us exp i

M Tr Ψ Z dt ˆv1(t)v0(t) + α 1v1(t)ˆv2(t)

Z dtds[ˆv1(t) ˆv1(s) 1

M v0(t) Λv0(s) | {z } C0(t,s)

+α 1ˆv2(t) Λˆv2(s) 1

P v1(t) v1(s) | {z } C1(t,s)

P ˆv1(t) v1(s) | {z } i R1(s,t)

v0(t) Λ ˆv2(s)

We now insert delta functions for following bracketed terms: C0, C1, C2, C3 and R1, R3 using the following identity (e.g. for C0 at times s, t):

1 = Z d C0(s, t)d ˆC0(s, t)

4πi M 1 exp 1

2M Z dtds ˆC0(t, s) C0(t, s) 1

M v0(t) Λv0(s) . (42)

Here the ˆCi, ˆRi integrals are taken over the imaginary axis. This yields a moment generating function (here we ll take j = 0):

Z = Z D[C1, ˆC1, . . . ] exp h MS[C0, C1, C2, C3, R1, R3, ˆC0, ˆC1, ˆC2, ˆC3, ˆR1, ˆR3] i . (43)

The constraint that Z = 1 means that S = 0 at the saddle point. S here is given by:

S[. . . ] = 1

Z dtds h ˆC0(t, s)C0(t, s) + α ˆC1(t, s)C1(t, s) + ˆC2(t, s)C2(t, s) + ν ˆC3(t, s)C3(t, s) i

+ Z dtds h R1(t, s) ˆR1(s, t) R3(t, s) ˆR3(s, t) i

+ α log Z1 + ν log Z3 + 1

k log Z0,2,4;k.

A Dynamical Model of Neural Scaling Laws

We have chosen to take ˆRi(s, t) to have a different sign and s, t ordering convention than the ˆCi to simplify our notation later on. We have also used that Equations (40), (41) factorize over their respective indices, so each Z is a partition function over a single index. The individual Zi are given by:

Z1 = Z D[v1, ˆv1] exp i Z dtds δ(t s) α 1 ˆR1(s, t) v1(t)ˆv1(s)

Z dtds(ˆv1(t)ˆv1(s)C0(t, s) + v1(t)v1(s) ˆC1(t, s)) , (45)

Z3 = Z D[v3, ˆv3] exp i Z dtds δ(t s) ν 1 ˆR3(s, t) v3(t)ˆv3(s)

Z dtds(ˆv3(t)ˆv3(s)C2(t, s) + v3(t)v3(s) ˆC3(t, s)) , (46)

Z0,2,4;k = Z D[v0,2,4, ˆv0,2,4] exp 1

Z dtds α 1λkˆv2 k(t)ˆv2 k(s)C1(t, s) + ν 1ˆv4 k(t)ˆv4 k(s)C3(t, s)

Z dtds λkv0 k(t)v0 k(s) ˆC0(t, s) + v2 k(t)v2 k(s) ˆC2(t, s) + v4 k(t)v4 k(s) ˆC4(t, s)

exp i Z dtds R3(t, s)v2 k(s)ˆv4 k(t) + λk R1(t, s)v0 k(s)ˆv2 k(t) .

In the large M limit we evaluate this integral via saddle point. The saddle point equations give:

C0(t, s) = 1

k λk v0 k(t)v0 k(s)

Ci(t, s) = vℓ(t)vℓ(s) , ℓ= {1, 2, 3, 4}

R1(t, s) = i v1(t)ˆv1(s)

R3(t, s) = i v3(t)ˆv3(s)

ˆR1(t, s) = i 1

k λkˆv2 k(t)v0 k(s) R0,2(t, s)

ˆR3(t, s) = i 1

k λkˆv4 k(t)v2 k(s) R2,4(t, s).

Here denotes an average taken with respect to the statistical ensemble given by the corresponding partition function Zi. Lastly, the saddle point equations for the ˆCi(t, s) variables are all quadratic functions of the variables {ˆv0, ˆv1, ˆv2, ˆv3} which vanish under the average defined by Z (Helias & Dahmen, 2020). Following the discussion below Equation 39, we take ˆCi(t, s) = 0, which will enforce ˆvi(t)ˆvi(s) = 0 and lead to the correct dynamical equations.

To evaluate the remaining, we can integrate out the ˆvi variables. First let us look at Z1. Using the Hubbard-Stratonovich trick we can write the action in terms linear in ˆv1. This gives

Z1 = Z D[v1, ˆv1, u1] exp i Z dtds ˆv1(t) h δ(t s)(v1(s) u(s)) α 1 ˆR1(t, s)v1(s) i

Z dtds u(t)u(s)C 1 0 (t, s) + v1(t)v1(s) ˆC1(t, s)) (49)

We now replace ˆC1 by its saddle point value of 0 and ˆR1 by R0,2. Integrating over ˆv gives a delta function:

v1(t) = u1(t) + 1

Z ds R0,2(t, s)v1(s), u1(t) GP(0, C0). (50)

A Dynamical Model of Neural Scaling Laws

Analogously for v3 we get

v3(t) = u3(t) + 1

Z ds R2,4(t, s)v3(s), u3(t) GP(0, C2) (51)

For Z0,2,4;k after replacing ˆC0, ˆC2, ˆC4 with their saddle point values we get:

Z0,2,4;k = exp 1

Z dtds α 1λkˆv2 k(t)ˆv2 k(s)C1(t, s) + ν 1ˆv4 k(t)ˆv4 k(s)C3(t, s)

exp Z dtds i R3(t, s)v2 k(s)ˆv4 k(t) + iλk R1(t, s)v0 k(s)ˆv2 k(t) (52)

Using the same Hubbard-Stratonovich trick on ˆv2 k gives:

v2 k(t) = u2 k(t) + λk

Z ds R1(t, s)v0 k(s), u2 k(t) GP 0, 1

On ˆv4 k we similarly get:

v4 k(t) = u4 k(t) + Z ds R3(t, s) v2 k(s), u4 k(t) GP 0, 1

Lastly, the equations of motion for v0 k in terms of v4 k are known:

tv0 k(t) = v4 k(t). (55)

One can easily add momentum by replacing tv0 k(t) with (β 2 t + t)v0 k(t) without changing anything else about the derivation.

B.2.1. INTERPRETATION OF THE RESPONSE FUNCTIONS

Following (Crisanti & Sompolinsky, 2018; Helias & Dahmen, 2020), we can understand the ˆva(t)vb(s) correlators by adding in the single-site moment generating function (e.g. Equation (49)) a source jb(s) that couples to ˆvb at time s. As in the discussion below equation 39, differentiating va(t) with respect to that source corresponds to its response to a kick in the dynamics of vb at time s. We denote this by:

Ri,j(t, s) = δvi(t)

B.3. Cavity Derivation

The cavity derivation relies on Taylor expanding the dynamics upon the addition of a new sample or feature. We will work through each cavity step one at a time by considering the influence of a single new base feature, new sample, and new projected feature. In each step, the goal is to compute the marginal statistics of the added variables. This requires tracking the linear response to all other variables in the system.

Adding a Base Feature Upon addition of a base feature with eigenvalue λ0 so that there are M + 1 instead of M features {v0 k, v2 k, v4 k} for k {0, 1, ..., M}, we have a perturbation to both v1 µ(t) and v3 n(t). Denote the perturbed versions of the dynamics upon addition of the M + 1st feature as v1 µ(t) and v3 n(t). At large M we can use linear-response theory to relate the dynamics at M + 1 features to the dynamics of the original M feature system

v1 µ(t) v1 µ(t) + 1

Z ds v1 µ(t) v1ν(s)ψν 0v0 0(s)

v3 n(t) v1 µ(t) + 1

Z ds v3 n(t) v3m(s)Am0 v2 0(s) (57)

A Dynamical Model of Neural Scaling Laws

The next order corrections have a subleading influence on the dynamics. Now, inserting these perturbed dynamics into the dynamics for the new (M + 1)st set of variables {v2 0(t), v4 0(t)}. For v2 0(t), we have

v2 0(t) 1 α

µ=1 ψµ 0 v1 µ(t) + 1 αM

Z ds ψµ 0 v1 µ(t) v1ν(s)ψν 0v0 0(s) (58)

There are now two key steps in simplifying the above expression in the proportional limit:

1. By the fact that the v1 µ(t) dynamics are statistically independent of the new feature ψµ 0 , we can invoke a central limit theorem for the first term which is mean zero and variance O(1). 2. Similarly, we can invoke a law of large numbers for the second term, which has O(1) mean and variance on the order of O(M 1). Therefore in the asymptotic limit it can be safely approximated by its mean.

We note in passing that neither of these steps require the ψµ 0 variables to be Gaussian. Thus we obtain the following asymptotic statistical description of the v2 0(t) random variable

v2 0(t) u2 0(t) + Z ds R1(t, s)v0 0(s)

u2 0(t) GP(0, α 1λ0C1) , R1(t, s) 1

* v1 µ(t) v1µ(s)

Following an identical argument for v4 0(t) we have

v4 0(t) 1 ν

n=1 An0v3 n(t) + 1 νM

Z ds An0 v3 n(t) v3m(s)Am0v2 0(s)

u3 0(t) + Z ds R3(t, s)v2 0(s)

u3 0(t) GP(0, ν 1C3) , R3(t, s) = 1

v3 n(t) v3n(s)

Adding a Sample Next, we can consider the influence of adding a new data point. We will aim to characterize a P + 1 data point system in terms of the dynamics when P points are present. Upon the addition of a new data point ψ0 the field v0 k(t) will be perturbed to v0 k(t). Again invoking linear response theory, we can expand the perturbed value around the P-sample dynamics

v0 k(t) v0 k(t) + 1 α

Z ds v0 k(t) v2 ℓ(s)ψ0 ℓv1 0(s) (61)

Now, computing the dynamics of the new random variable v1 0(t)

k=1 ψ0 kv0 k(t) + 1 αM

kℓ ψ0 k v0 k(t) v2 ℓ(s) ψ0 ℓv1 0(s)

u1 0(t) + 1

Z ds R0,2(t, s)v1 0(s)

, R0,2(t, s) = 1

v0 k(t) v2 k(s)

Adding a Projected Feature Now, we finally consider the effect of introducing a single new projected feature so that instead of N we now have N + 1 projected features. This causes a perturbation to {v2 k(t)} which we

v2 k(t) v2 k(t) + 1 ν

Z ds v2 k(t) v4 ℓ(s)A0ℓv3 0(s) (63)

A Dynamical Model of Neural Scaling Laws

Now, we compute the dynamics for the added variable v3 0(t)

k=1 A0kv2 k(t) + 1 νM

Z ds A0k v2 k(t) v4 ℓ(s)A0ℓv3 0(s)

u3 0(t) + 1

Z ds R2,4(t, s)v3 0(s)

u3 0(t) GP(0, C2) , R2,4(t, s) = 1

v2 k(t) v4 k(s)

Putting it all together Now, using the information gained in the previous sections, we can combine all of the dynamics for each field into a closed set of stochastic processes. This recovers the DMFT equations of Appendix A.2.

C. Final Losses (the t Limit of DMFT)

In this section we work out exact expressions for the large time limit of DMFT. By comparing with prior computations of the mean-field statics of this problem computed in (Atanasov et al., 2023; Zavatone-Veth & Pehlevan, 2023; Ruben & Pehlevan, 2023; Maloney et al., 2022; Simon et al., 2021), we show that the large time and large M limits commute, specifically that lim M,N,P limt L(M, N, P, t) = limt lim M,N,P L(M, N, P, t). We invoke the final value theorem and use the response functions as before.

Final Value Theorem We note that for functions which vanish at t = , that

lim ω 0 iω H(ω) = lim ω 0

τ e iωτ H(τ) = lim ω 0

τ H(τ) e iωτ = lim τ H(τ) (65)

where we invoked integration by parts and used the assumption that limτ H(τ) = 0, a condition that is satisfied for the correlation and response functions in our theory. We can therefore use the identity limτ H(τ) = limω 0 iωH(ω) to extract the final values of our order parameters.

lim τ Hk(τ) = lim ω 0 iωHk(ω) = lim ω 0 1 1 + λk(iω) 1R1(ω)R3(ω). (66)

We also need to invoke a similar relationship for the final values of the correlation functions

lim t,s C(t, s) = lim ω,ω 0(iω)(iω ) C(ω, ω )

C(ω, ω ) = Z dt Z ds e i(ωt+ω s)C(t, s) (67)

where C is the two-variable Fourier transform. The final value of the test loss is limt L(t) = limt,s C0(t, s).

C.1. General Case (Finite ν, α)

Before working out the solution to the response functions, we note that the following condition is always satisfied

ν(1 R3(ω)) = α(1 R1(ω)). (68)

For ν = α, this equation implies that R1 = R3. For ν = α, we can have either R1 0 or R3 0 but not both. We consider each of these cases below.

Over-parameterized Case ν > α: In this case, the response function R1 O(iω) as ω 0 and R3 1 α

ν as ω 0. We thus define

r lim ω 0(iω) 1R1(ω)R3(ω) (69)

A Dynamical Model of Neural Scaling Laws

Using the equation which defines R1, we find that the variable r satisfies the following relationship at ω 0

λkr 1 + λkr (70)

After solving this implicit equation, we can find the limiting value of iωHk(ω) as

H k = lim τ Hk(τ) = lim ω 0 iωHk(ω) = 1 1 + λkr (71)

Next, we can work out the scaling of the correlation functions in the limit of low frequency. We define the following limiting quantities based on a scaling analysis performed on our correlation functions for small ω

C 0 lim t,s C0(t, s) = lim ω,ω 0(iω)(iω )C0(ω, ω )

0 dt ds C1(t , s ) = lim ω,ω 0 C1(ω, ω )

0 dt ds C2(t , s ) = lim ω,ω 0 C2(ω, ω )

0 dt ds C3(t , s ) = lim ω,ω 0 C3(ω, ω )

These limiting quantities satisfy the closed set of linear equations

k λk(H k )2 (w k)2 + 1

k λk(H k )2C 1 + r2

k λ2 k(H k )2[(w k)2 + ν 1C 3 ]

These equations can be solved for {C 0 , C 1 , C 2 , C 3 }. Simplifying the expressions to a two-variable system, we find

k λk(H k )2 (w k)2 + 1

k λk(H k )2C 0 + r2

k λ2 k(H k )2 (w k)2 + 1

This expression recovers the ridgeless limit of the replica results of (Atanasov et al., 2023; Zavatone-Veth & Pehlevan, 2023) and the random matrix analysis of (Simon et al., 2023).

Under-parameterized Case ν < α: Following the same procedure, we note that for ν < α that R3 O(iω) and R1 1 ν

α. We thus find the following equation for r = limω (iω) 1R1(ω)R3(ω).

λkr λkr + 1 (74)

where as before H k = 1 1+λkr. The analogous scaling argument for small ω gives us the following set of well-defined limiting quantities C 0 lim t,s C0(t, s) = lim ω,ω 0(iω)(iω )C(ω, ω )

C 1 lim t,s C1(t, s) = lim ω,ω 0(iω)(iω )C1(ω, ω )

C 2 lim t,s C2(t, s) = lim ω,ω 0(iω)(iω )C2(ω, ω )

0 dtds C3(t, s) = lim ω,ω 0 C3(ω, ω ).

A Dynamical Model of Neural Scaling Laws

where these limiting correlation values satisfy

k λk(H k )2 (w k)2 + 1

k λk(H k )2C 1 + 1

k λ2 k(H k )2 (w k)2 + 1

This is again a closed linear system of equations for the variables {C 0 , C 1 , C 2 , C 3 }. In the next section, we recover the result for kernel regression where ν and the learning curve for infinite data α with respect to model size ν.

C.2. Learning Curves for Kernel Regression ν, t

In the t and ν limit we recover the learning curve for kernel regression with eigenvalues λk. To match the notation of (Canatar et al., 2021), we define

lim ω 0(iω) 1R1(ω) ακ 1 (77)

which generates the following self-consistent equation for κ

λk λkα + κ. (78)

Plugging this into the expression for the loss, we find

(iω)(iω )C0(ω, ω ) 1

(κ + λkα)2 [(w k)2 + α 1λk C1(ω, ω )]

C1(ω, ω ) = (iω)(iω )α2κ 2C0(ω, ω ) (79)

Letting C limω,ω 0(iω)(iω )C(s, s ), we have

(κ + λkα)2 (w k)2 + C α M

λ2 k (λkα + κ)2

k λk(w k)2 κ2

(κ + λkα)2 , γ = α

λ2 k (λkα + κ)2 (80)

The variable κ decreases from 1

M P k λk, 0 as α [0, 1]. For α > 1 we have κ = 0. The quantity 1 1 γ comes from overfitting due to variance from the randomly sampled dataset.

D. Early Time Dynamics (High-Frequency Range)

In this section, we explore the early time dynamical effects of this model. Similar to how the late time dynamical effects could be measured by examining the low frequency ω 1 part of the response and correlation functions, in this section, we analyze the high frequency components ω 1. We start by noting the following expansions valid near ω

R1(ω) 1 1 α(iω)

R3(ω) 1 1 ν(iω)

+ O(ω 2) (81)

A Dynamical Model of Neural Scaling Laws

We let c = 1 M P

k λk. These can be plugged into the transfer function for mode k

Hk(ω) 1 iω + λk c(α 1 + ν 1)(iω) 1 1 iω + λk

1 + cλk(α 1 + ν 1)

iω(iω + λk)

+ O ω 2 (82)

Performing an inverse Fourier transform, we find the following early time asymptotics

Hk(t) e λkt + cλk(α 1 + ν 1) Z dω

iω(iω + λk)2

= e λkt cλk(α 1 + ν 1)

iω(iω + λk)

= e λkt cλk(α 1 + ν 1)

= e λkt + c(α 1 + ν 1)

1 e λkt λkt e λkt (83)

We see from this expression that the early time corrections always scale as 1/α or 1/ν and that these corrections build up over time. We also note that in this picture, Hk(t) is minimized in the limit of large model and large data α, ν (limited data and limited model size strictly harm performance). A similar expansion can be performed for all of the correlation functions C(ω, ω ) with ω, ω 1 which also give leading corrections which scale as 1/α and 1/ν.

E. Buildup of Overfitting Effects

In this section, we derive a formula for the gap between test loss L(t) and train loss ˆL(t). We start from the following formula

v1(t) = u1(t) + 1

0 ds R0,2(t, s)v1(s) (84)

Moving the v1(t) term to the other side, and using the fact that u1(t)u1(s) = C1(t, s), we find the following relationship between train and test loss

L(t) = u1(t)u1(t) = v1(t)v1(t) 2

0 dt R0,2(t, t ) v1(t)v1(t )

0 ds R0,2(t, t )R0,2(t, s ) v1(t )v1(s )

0 dt R0,2(t, t )C1(t, t ) + 1

0 ds R0,2(t, t )R0,2(t, s )C1(t , s ). (85)

To get a sense of these expressions at early and late timescales, we investigate the Fourier transforms at high ω 1 and low ω 1 frequencies respectively.

E.1. High Frequency Range / Early Time

The relationship between Fourier transforms at high frequencies ω 1 is

C0(ω, ω ) = 1 R1(ω)R1(ω ) C1(ω, ω ) C1(ω, ω ) + c α(iω )C1(ω, ω ) + c α(iω )C1(ω, ω ) + O((iω) 2 + (iω ) 2)

where c = 1 M P

k λk. Taking a Fourier transform back to real time gives us the following early time differential equation for the test-loss train loss gap

2 ts[C0(t, s) C1(t, s)] = 2 ts C1(t, s) + c

α( t + s)C1(t, s). (87)

The above equation should hold for early times. We note that C0(t, t) C1(t, t) = L(t) ˆL(t) exactly recovers the test-train gap.

A Dynamical Model of Neural Scaling Laws

E.2. Low Frequency Range/Late Time

At late time/low frequency, as we showed in Appendix C, the behavior of the C1 correlation function depends on whether the model is over-parameterized or under-parameterized. In the overparameterized case, the asymptotic train loss is zero while the asymptotic test loss is nonzero. In the underparameterized case, we have a limiting value for both the test and train loss which can be computed from the expressions in Appendix C.

F. Timescale/Eigenvalue Density Interpretation

We can use an alternative interpretation of the Fourier transforms derived in previous sections to obtain the timescale density for the dynamics. Since this is a linear model defined by an effective matrix d

P Ψ Ψ v0, this is equivalent to computing the eigenvalue density. We start by expanding the transfer function for mode k in the basis of exponentials

0 du ρk(u) e ut. (88)

We allow for Dirac-delta masses at u = 0 which correspond to the constant (unlearnable) components. Next, we note that the Fourier transform has the form

dt e iωt Hk(t) = Z

0 du ρk(u) Z

dt e (u+iω)t = Z

iω + u. (89)

We can recover the density ρk(s) by using the Sokhotski Plemelj theorem 1

πIm 1 iϵ+u s = δ(u s) which gives us

ρk(u) = lim ϵ 0 1 π Im Hk(iu ϵ). (90)

This allows us to interpret the spread of timescales from the random sampling of data and the random projection A. In the limit of α, ν we have ρk(s) = δ(s λk) but for finite α, ν the density spreads out. We visualize these densities for power law features in Figure 8.

10 2 10 1 100 101 u

k = 1 k = 2 k = 3 k = 4 k = 5 k = 6 k = 7 k = 8 k = 9 k = 10

(a) N = 50, P = 100

k = 1 k = 2 k = 3 k = 4 k = 5 k = 6 k = 7 k = 8 k = 9 k = 10

(b) N = 10000, P = 10000

Figure 8. Timescale (eigenvalue) densities for each transfer function Hk(τ) with power law features with b = 1.2. For limited N, P there is a significant spread of timescales for each mode. For N, P the density converges to a Dirac mass at u = λk.

F.1. Recovering the Marchenko-Pastur Law from DMFT Response Functions

To further illustrate the validity of this perspective, we show that it is possible to recover known random matrix theory results using this technique. To illustrate this, we study the case where λk = 1 and take ν . In this case, we have the coupled equations

H(ω) = 1 iω + R1(ω) , R1(ω) = 1 1

αR1(ω)H(ω) (91)

A Dynamical Model of Neural Scaling Laws

Combining these equations gives the single equation

H(ω) = 1 iω + α α+H(ω) , = iωH(ω)2 + (αiω + α 1)H(ω) α = 0

h (αiω + α 1) + p

(αiω + α 1)2 + 4iωα i (92)

Now, evaluating this expression at iω = s iϵ gives

H(is ϵ) = 1 2(s + iϵ)

h ( αs iαϵ + α 1) + p

( αs αiϵ + α 1)2 4(s + iϵ)α i (93)

The radical has an imaginary solution in the ϵ 0 limit provided that

s [s , s+] , s = 1 1 α

In this interval [s , s+], the density ρ(s) = limϵ 0 1

πIm H(is ϵ) has the form

(s s )(s s+)

2πs , s [s , s+] (95)

which is precisely the bulk of the Marchenko-Pastur law.

G. Non-Proportional (Dimension-Free) Limit

We can imagine a situation where the original features are already infinite dimensional (M is taken first). This would correspond more naturally to the connection between infinite dimensional RKHS s induced by neural networks at infinite width (Bordelon et al., 2020; Canatar et al., 2021; Cheng & Montanari, 2022). Further, we will assume a trace class kernel K(x, x ) = ψ(x) ψ(x ) for the base features ψ which diagonalizes over the data distribution p(x) as Z K(x, x )ϕk(x )p(x )dx = λkϕk(x) ,

k=1 λk < . (96)

As before, we are concerned with the test and train losses

k=1 λkv0 k(t)2 , ˆL(t) = 1

µ=1 v1 µ(t)2. (97)

The appropriate scaling of our four fields of interest in this setting are

v1(t) = Ψv0(t) , v2(t) = 1

v3(t) = Av2(t) , v4(t) = 1

N A v3(t). (98)

Following the cavity argument given in the previous section, we can approximate the the correlation and response functions as concentrating to arrive at the following field description of the training dynamics

tv0 k(t) = v4 k(t)

v1(t) = u1(t) + 1

Z ds R0,2(t, s)v1(s) , u1(t) GP(0, C0)

v2 k(t) = u2 k(t) + λk

Z ds R1(t, s)v0 k(s) , u2 k(t) GP 0, 1

v3(t) = u3(t) + 1

Z ds R2,4(t, s)v3(s) , u3(t) GP(0, C2)

v4 k(t) = u4 k(t) + Z ds R3(t, s)v2 k(s) , u4 k(t) N 0, 1

A Dynamical Model of Neural Scaling Laws

which are exactly the same equations as in the proportional limit except with the substitution ν N and α P. The correlation and response functions have the form

k=1 λk v0 k(t)v0 k(s) , C1(t, s) = v1(t)v1(s)

k=1 v2 k(t)v2 k(s) , C3(t, s) = v3(t)v3(s)

R0,2(t, s) =

δv0 k(t) δu2 k(s)

, R1(t, s) = δv1(t)

R2,4(t, s) =

δv2 k(t) δu4 k(s)

, R3(t, s) = δv3(t)

which will all be O(1) under this scaling.

H. Effect of Ensembling and Bagging on Dynamics

H.1. What Does/Doesn t Concentrate in the DMFT Limit?

To help gain insight into bias and variance decompositions, we first provide a short primer on which entities concentrate over random draws of matrices A and Ψ. For any distinct randomly sampled system, the following objects will always be the same in the asymptotic limit

1. The response functions {R0,2(t, s), R1(t, s), R2,4(t, s), R3(t, s)} 2. The correlation functions {Ci(t, s)}i {1,2,3,4}. 3. The train and test loss dynamics

While the above quantities behave as concentrating or self-averaging random variables, many important quantities are not the same across different realizations of {A, Ψ}. For example,

1. The (random) entries of the vectors {v0(t), v1(t), v2(t), v3(t), v4(t)}. 2. The Gaussian sources {u1(t), u2(t), u3(t), u4(t)} which appear in the large system size limit.

In particular, the first implies that the model outputs f(x) will generally depend on random variations across datasets or model initializations. This means that we can consider drawing multiple realizations of, for example, projection matrices {Ae}E e=1 and then training E separate models using each of them. Averaging these vectors gives us

e=1 v0 e(t) (101)

This operation will intuitively average out noise from the random projection matrices Ae and in the limit of infinite ensembling E will completely eliminate it.

H.2. Definition of Bias and Variance

We adopt the language of the fine-grained bias-variance decomposition in (Adlam & Pennington, 2020b). There, a given learned function generally depends on both the dataset D and initialization seed θ0. We write this as f D,θ0. The role of random initialization is played by the A matrix in our setting. For a given function, its variance over datasets and its variance over initializations are respectively given by

Var Df ED(f D,θ0 ED[f D,θ0])2 (102)

Varθ0f Eθ0(f D,θ0 Eθ0[f D,θ0])2 (103)

Here ED[f D,θ0] and Eθ0[f D,θ0] can be viewed as infinitely bagged or infinitely ensembled predictors respectively. The bias of a function over datasets or initializations is given by the test error of ED[f D,θ0], Eθ0[f D,θ0] respectively. The irreducible bias is given by ED,θ0[f D,θ0].

A Dynamical Model of Neural Scaling Laws

H.3. Derivation

In this section, we consider the effect of ensembling over E random initial conditions and bagging over B random datasets. We let v0 e,b(t) represent the weight discrepancy for model e on dataset b. Here e runs from 1 to E and b runs from 1 to B. The (e, b)th vector has dynamics:

d dtv0 e,b(t) = 1

v0 e,b(t). (104)

Ensembling and bagging would correspond to averaging these v0s over these EB systems

v0(t) = 1 EB

b=1 v0 e,b(t). (105)

The key vectors to track for this computation are

v1 e,b(t) = 1

M Ψbv0 e,b(t) + σϵb , v2 e,b(t) = 1 α

M Ψ b v1 e,b(t)

v3 e,b(t) = 1

M Aev2 e,b(t) , v4 e,b(t) = 1 ν

M A e v3 e,b(t). (106)

We can further show that the v0 e,b and v0 e ,b have response functions that decouple across e, b. Intuitively, giving the dynamical system e, b a kick should not alter the trajectory of the separate e , b dynamical system, even if they share disorder {Ψ, A}. The DMFT description of the proportional limit yields the following integral equations for the v fields:

tv0 e,b,k(t) = v4 e,b,k(t)

v1 e,b(t) = u1 e,b(t) + 1

Z ds R0,2(t, s)v1 e,b(s)

v2 e,b,k(t) = u2 e,b,k(t) + λk

Z ds R1(t, s)v0 e,b,k(s)

v3 e,b(t) = u3 e,b(t) + 1

Z ds R2,4(t, s)v1 e,b(s)

v4 e,b,k(t) = u4 e,b,k(t) + Z ds R3(t, s)v2 e,b,k(s). (107)

Here, the response functions R are to be computed within a single system. In what follows, we will use to denote averages over the disorder, and explicitly write out any averages over the ensemble members and datasets.

The Gaussian variables in the DMFT have the following covariance u1 e,b(t)u1 e ,b (s) = δb,b C0 e,e (t, s) u2 e,b,k(t)u2 e ,b ,k(s) = δb,b λk

α C1,e,e (t, s) u3 e,b(t)u3 e ,b (s) = δe,e C2,b,b (t, s) u4 e,b,k(t)u4 e ,b ,k(s) = δe,e 1

ν C3,b,b (t, s) (108)

The covariances above C0,e,e , C1,e,e , C2,b,b , C3,b,b allow for different ensemble or dataset index but not both. We will use C0, C1, C2, C3 etc to represent the correlation functions within a single system. For instance, C0 e,e (t, s) =

k λk D v0 e,b(t)v0 e ,b(s) E while C0 = 1 M P

k λk D v0 e,b(t)v0 e,b(s) E . The correlation function of interest is thus

C0,k,e,e (ω, ω ) = Hk(ω)Hk(ω ) (w k)2 + 1

ν δe,e C3(ω, ω ) + λk

α C1,e,e (ω, ω )

C1,e,e (ω, ω ) = R1(ω)R1(ω )C0 e,e (ω, ω )

C2,b,b ,k(ω, ω ) = (iω)(iω )Hk(ω)Hk(ω )λk

α δb,b C1(ω, ω ) + λ2 k R1(ω)R1(ω ) (w k)2 + 1

ν C3,b,b (ω, ω )

C3,b,b (ω, ω ) = R3(ω)R3(ω)C2,b,b (ω, ω ) (109)

A Dynamical Model of Neural Scaling Laws

We can combine the first two equations and the second two equations to identify the structure of the cross-ensemble and cross-dataset (across-system) correlations in terms of the marginal (within-system) correlation statistics

C0,e,e (ω, ω ) = 1 1 γ0(ω, ω ) 1 M

k λk Hk(ω)Hk(ω ) (w k)2 + 1

ν δe,e C3(ω, ω )

γ0(ω, ω ) = 1 αM

k λ2 k Hk(ω)Hk(ω )R1(ω)R1(ω )

C2,b,b (ω, ω ) = 1 1 γ2(ω, ω ) 1 M

R1(ω)R1(ω )λk(w k)2 + 1

α(iω)(iω )Hk(ω)Hk(ω )δb,b C1(ω, ω )

γ2(ω, ω ) = 1

ν R3(ω)R3(ω) 1

k λ2 k (110)

These equations give the necessary cross-ensemble and cross-dataset correlations. Now we can consider the effect of ensembling and bagging on the dynamics. To do so, consider the Fourier transform of the bagged-ensembled error v0 k(t) = 1 EB P

eb v0 k,e,b(t), which has the Fourier transform

v0 k(ω) = Hk(ω)

u4 e,b,k(ω) + R3(ω)u2 e,b,k(ω)

Computing the correlation function for this bagged-ensembled field random variable, we find

v0 k(ω) v0 k(ω ) = Hk(ω)Hk(ω )

(w k)2 + 1 νE2B2 X

e,e ,b,b δe,e C3,b,b (ω, ω ) + λk αE2B2 X

ee bb δb,b C1,e,e (ω, ω )

= Hk(ω)Hk(ω )

(w k)2 + 1 νEB2 X

b,b C3,b,b (ω, ω ) + λk αE2B

ee C1,e,e (ω, ω )

= Hk(ω)Hk(ω )(w k)2

νE Hk(ω)Hk(ω )R3(ω)R3(ω )R1(ω)R1(ω )

1 γ2(ω, ω )

ℓ λ2 ℓ(w ℓ)2 #

+ 1 ναEB Hk(ω)Hk(ω )C1(ω, ω )(iω)(iω )

(1 γ2(ω, ω )) 1 M

ℓ λℓHℓ(ω)Hℓ(ω )

αB Hk(ω)Hk(ω )R1(ω)R1(ω )

1 γ0(ω, ω ) 1 M

ℓ λℓHℓ(ω)Hℓ(ω )(w ℓ)2

+ λk ανEB Hk(ω)Hk(ω )C3(ω, ω )

1 γ0(ω, ω ) 1 M

ℓ λℓHℓ(ω)Hℓ(ω ) (112)

The first term is the irreducible bias for mode k which is the loss for mode k when the learned function is averaged over all possible datasets and all possible projections. We see that the second term scales as 1 νE which will persist even if Bα . Similarly, there is a term that is order 1 αB which will persist even if νE . Lastly, there are two terms which depend on both B, E. This is similar to the variance that is explained by the interaction of the dataset and the random projection (Adlam & Pennington, 2020b). The test loss is then a Fourier transform of the above function

k λk v0 k(t)2 . (113)

If E, B , then we obtain the stated irreducible bias of the main paper

lim E,B L(t) = 1

k λk(w k)2Hk(t)2. (114)

This is the error of the mean output function over all possible datasets and random projections of a certain size.

A Dynamical Model of Neural Scaling Laws

0.0 0.5 1.0 1.5 2.0 0.0

E = 1 E = 5 E = 10 E = 25 E = 50

(a) ν = 0.25

0.0 0.5 1.0 1.5 2.0 0.0

E = 1 E = 5 E = 10 E = 25 E = 50

(b) ν = 0.75

Figure 9. The infinite time limit of the loss when ensembling with isotropic features λk = 1 recovers prior results on ensembling and double descent (d Ascoli et al., 2020; Adlam & Pennington, 2020b). There is an overfitting peak (double descent) at α = ν. In the overparameterized regime where α < ν, the infinite ensembled model matches the performance of the ν limit. This is because the bias is limited by dataset size rather than model size. In the underparameterized regime α > ν, the infinite ensembled model does not achieve the loss of the infinite model due to a bias limited by ν.

H.4. Ensembling is Not Always Compute Optimal

For a compute budget C = NEt, we find that ensembling does not provide as much benefit as increasing the size of the model. From the results in the last section, we note that ensembling reduces the variance. For this section, we consider the P limit. We let B(N, t) represent the bias and V(N, t) represent the variance within a single ensemble. The loss at fixed compute then takes the form

L(ν, C, t) = B(ν, t) + 1

νE V(ν, t). (115)

For any ν which satisfies the condition that

ν B(ν, t) 0 , ν V(ν, t) 0 (116)

we have that ensembling is strictly dominated by increasing ν.

I. White Bandlimited Model

To gain intuition for the model, we can first analyze the case where λk = 1, which has a simpler DMFT description since each of the M features are statistically identical. We illustrate the dependence of the loss on model size ν and training time t for α < 1 in Figure 10. We note that the loss can be non-monotonic in ν at late training times, but that monotonicity is maintained for optimal early stopping, similar to results on optimal regularization in linear models (Advani et al., 2020) and random feature models (Mei & Montanari, 2022; Simon et al., 2023).

I.1. Derivation

In the case of all λk = 1 we have the following definitions

R1(ω) = 1 1

α R1(ω)R3(ω) iω + R1(ω)R3(ω)

R3(ω) = 1 1

ν R1(ω)R3(ω) iω + R1(ω)R3(ω)

A Dynamical Model of Neural Scaling Laws

0.2 0.4 0.6 0.8 1.0

t = 2 t = 5 t = 10 t = 25 t = 50 t = 75 t = 100 t = 250 t = 500 Early Stop

Figure 10. The white bandlimited model (λk = 1) with α = 0.8 and varying model size ν with no explicit noise σ = 0 exhibits double descent at late time. Optimal early stopping, like optimal regularization, recovers monotonic scaling with ν.

Writing R1 = 1 ν

α(R3 1) allows us to solve for R3 exactly:

R3 iω + R3(1 + ν

α(R3 1)) = iω + R3(1 + ν

α(R3 1)) . (118)

This is a cubic equation that can be solved for R3 as a function of ω. In the limit of α this simplifies to:

R3iω + R2 3 = iω + 1 1

2[(1 ν 1 iω) + p

(1 ν 1 iω)2 + 4iω]. (119)

I.2. Timescale Corrections in The Small ν Regime

By expanding the above in the limit of small ν we get that R3 goes as

R3 iω ν 1 1 + iω , ν 0 (120)

From this approximate response function, we find that the transfer function takes the form

H(τ) = Z dω

iω + iω ν 1 1+iω = Z dω

2π (ν 1 1 + iω)eiωτ

iω [ν 1 + iω]

= (1 ν) + νe τ/ν, (121)

where in the last line, we used the residue theorem. We note that in this perturbative approximation that this transfer function is always greater than the transfer function at ν which is e τ. Thus finite ν leads to higher bias in this regime. We define bias and variance precisely in Appendix H.2.

I.3. Timescale corrections in fully expressive regime ν > 1

For ν 1, we can approximate R3(ω) 1 ν 1(1 + iω) 1, we have

iω + 1 ν 1(iω + 1) 1 = Z dω

2π eiωτ(1 + iω) (iω + 1 ν 1/2)(iω + 1 + ν 1/2)

2e τ(1+ ν) + 1

2e τ(1 ν) = e τ cosh τ/ ν (122)

A Dynamical Model of Neural Scaling Laws

where we used the residue theorem after closing the contour in the upper half-plane. In Figure 11, we show that this perturbative approximation does capture a slowdown in the dynamics for large but finite ν.

0 20 40 60 80 100 Time

= 2 = 5 = 10 = 50 = 100 = 200

e 2tcosh(t/ )2

Figure 11. Slower timescales in the ν > 1 regime for white bandlimited features.

J. Power-Law Bottleneck Scalings

In this section we calculate the scaling of the loss with the various limiting resources (time, model size, and data) when using power law features. Since the power-law features give a trace class kernel (i.e. P k=1 λk < ), we use the non-proportional limit formalism in Appendix G, which gives an expression for L(t, N, P) with M already considered infinite. While the resulting expressions are not a formal proportional thermodynamic limit and finite N, P corrections exist in the form of fluctuations from one random realization of the system to another. These corrections decay rapidly enough at finite N, P for this mean field theory to be accurate and descriptive in realistic systems (Bordelon et al., 2020; Simon et al., 2023; Cheng & Montanari, 2022). We plot this variability of random finite size experiments as highlighted standard deviations in the main text figures.

J.1. Time Bottleneck

The time bottleneck is defined as the limiting dynamics in the absence of any model or data finite size effects. To eliminate those effects, we simply study the α, ν limit

L (t) = lim P,N L(t, P, N). (123)

In this limit, the response functions simplify to R1(ω)R3(ω) 1 so that

Hk(ω) = 1 iω + λk = Hk(τ) = e λkτΘ(τ). (124)

Further, in this limit, we have that C0(t, s) = 1 M P

k λk Hk(t)Hk(s)(w k)2 since all the variance terms (which depend on ν 1, α 1) drop out. Thus we have the following loss at time t,

k λk(w k)2e 2λkt Z

1 dkk a exp 2k bt t (a 1)/b. (125)

where the final scaling with time can be obtained through either change of variables or steepest descent methods (Bordelon & Pehlevan, 2022a).

A Dynamical Model of Neural Scaling Laws

J.2. Model Bottleneck

In this section we take α, t . This leaves us with the following equation for r limω 0(iω) 1R3(ω).

λkr λkr + 1 Z

dk kb/r + 1 r1/b = r N b. (126)

Now, the large time limit of the transfer functions Hk(τ) can be obtained from the final-value theorem

lim t Hk(τ) = lim ω 0 iω iω + λkriω = 1 1 + λkr. (127)

Now, integrating over the eigenvalue density to get the total loss (and disregarding prefactors)

(1 + k br)2

1 dk k2b a + Z

= 1 2b + 1 a[N (a 1) N 2b] + 1 a 1N (a 1)

N min{a 1,2b} (128)

For difficult tasks where a 1 < 2b, we thus expect a powerlaw scaling of the form L N (a 1) in this regime.

J.3. Data Bottleneck

In this section we take ν, t . This leaves us with the following equation for r limω 0(iω) 1R1(ω).

λkr λkr + 1 Z

dk kb/r + 1 r1/b = r P b. (129)

Now, the large time limit of the transfer functions Hk(τ) can again be obtained from the final-value theorem

lim t Hk(τ) = lim ω 0 iω iω + λkriω = 1 1 + λkr (130)

Now, integrating over the eigenvalue density to get the total loss gives

(1 + k br)2

1 dk k2b a + Z

= 1 2b + 1 a[P (a 1) P 2b] + 1 a 1P (a 1)

P min{a 1,2b} (131)

For difficult tasks with a < 2b + 1, the loss will therefore scale as P (a 1) in this data-bottleneck regime.

K. Optimization Extensions

K.1. Discrete Time

In this section, we point out that DMFT can also completely describe discrete time training as well. In this section we consider discrete time gradient descent with learning rate η

v0(t + 1) = v0(t) ηv4(t)

v4(t) = 1 ν

M A v3(t) , v3(t) = 1

v2(t) = 1 α

M Ψ v1(t) , v1(t) = 1

M Ψv0(t) (132)

A Dynamical Model of Neural Scaling Laws

Following either the MSR or cavity derivation, we obtain an analgous set of limiting DMFT equations defined for integer times t Z,

v0 k(t + 1) = v0 k(t) ηv4 k(t) + δ(t + 1)w k

v1(t) = u1(t) + α 1 X

s R0,2(t, s)v1(s)

v2 k(t) = u2 k(t) + X

s R1(t, s)v0 k(s)

v3(t) = u3(t) + X

s R2,4(t, s)v3(s)

v4 k(t) = u4 k(t) + X

s R3(t, s)v2 k(s) (133)

The delta function in this context is defined as

( 1 t = 1 0 else (134)

ensures that the initial condition v0 k(0) = w k is satisfied. These iteration equations can be closed for the response functions and correlation functions and solved over T T matrices.

Alternatively, we can also solve this problem in an analogous frequency space. Analogous to the Fourier transform method, the equations in discrete time can be closed in terms of the Z-transform

t= z tv(t) (135)

Applying this transform gives us the following expression for the v0 k fields.

v0 k(z) = zw k ηu4 k(z) ηR3(z)u2 k(z) z 1 + ηλk R1(z)R3(z) Hk(z) zw k ηu4 k(z) ηR3(z)u2 k(z) (136)

Similar to the Fourier case, the final losses can be extracted as the z 1 limit of these objects.

K.2. Momentum

As mentioned in appendix B, it is straightforward to extend the DMFT treatment beyond just gradient descent dynamics to include a momentum term with momentum β.

We first consider this replacement in continuous time. This requires applying the following replacement:

tv0 k(t) = v4 k(t) (β 2 t + t)v0 k(t) = v4 k(t). (137)

This slightly modifies the expressions for the response functions. For example, in Fourier space the response functions become:

R0,2(ω) = 1

λk ϵ + iω + β(iω)2 + λk R1(ω)R3(ω)R3(ω)

R2,4(ω) = 1

λk ϵ + iω + β(iω)2 + λk R1(ω)R3(ω)R1(ω). (138)

In discrete time, momentum updates can be expressed as

v0(t + 1) = v0(t) ηb(t)

b(t) = v4(t) + µb(t 1) (139)

A Dynamical Model of Neural Scaling Laws

where b(t) is the filtered version of the loss gradient (the v4(t) field) with momentum coefficient µ and η is the learning rate. The dependence on the b(t) field can be eliminated by turning this into a second order difference equation

v0(t + 1) v0(t) µ v0(t) v0(t 1) = ηv4(t). (140)

Again, the final result can be expressed in terms of the Z-transformed transfer functions Hk(z) which have the form

Hk(z) = 1 z 1 µ + µz 1 + ηR1(z)R3(z). (141)

K.3. One Pass SGD

In this section we derive online SGD with projected features. At each step a random batch of B samples are collected (independent of previous samples), giving a matrix Ψ(t) RB M of sampled features. The update at step t is

v0(t + 1) = v0(t) + η 1

B Ψ(t) Ψ(t) v0(t). (142)

The DMFT limit gives the following statistical description of the fields, which decouple over time for the v1(t), v2 k(t) but remain coupled across time for v3(t), v4 k(t)

v1(t) = u1(t) , u1(t) N(0, C0(t, t)δ(t s)),

v2 k(t) = u2 k(t) + λkv0 k(t) , u2 k(t) N 0, 1

B λk C1(t, t)δ(t s)

v3(t) = u3(t) + 1

s R2,4(t, s)v3(s) , u3(t) N(0, C2(t, s))

v4 k(t) = u4 k(t) + X

s R3(t, s)v2 k(s) , u4 k(t) N 0, 1

v0 k(t + 1) = v0 k(t) ηv4 k(t).

This system cannot exhibit overfitting effects as we have the statistical equivalence between the covariance of v1 and the test loss:

ˆL(t) = v1(t)2 = u1(t)2 = C0(t, t) = L(t) (144)

We note that this is very different than the case where data is reused at every step, which led to a growing gap between train and test loss as we derive in Appendix E.

We visualize some example results for one-pass SGD with power law features in Figure 3. While we see that the same scaling laws with t and N hold, the dependence on batchsize B is much weaker: the model never reaches an asymptote that scales with B but rather experiences SGD noise that scales with η/B for learning rate η.

We summarize the key similarities and differences between the one-pass SGD and multi-pass batch GD settings

1. If the learning rate is small and a continuous time limit of the dynamics is taken, then the SGD dynamics will agree with the P limit of our full batch gradient flow theory. This is a setting where finite data and SGD noise are negligible. 2. If learning rate is non-negligible and batch size is finite, then SGD noise cannot be neglected and the SGD dynamics will be different than full pass GD. The SGD dynamics will be described by a discrete time DMFT given above. 3. In general, the multi-pass version of the theory can have a train loss and test loss gap while the SGD theory never has a gap between training and test loss. 4. The SGD test loss can be limited by t, N, but the effect of finite batch size is basically some additive variance in the model outputs. Finite dataset size in the full batch GD can lead to a bottleneck scaling law ( like L P (a 1)).

L. Kernel Analysis of Feature Learning Networks

In Section 5.1, we observed that feature learning networks can achieve better loss and compute-optimal scaling. In such settings, it may be useful to observe the after kernel, namely the NTK at the end of training. This object can often shed

A Dynamical Model of Neural Scaling Laws

100 101 102 103 k

N = 128 Wide Res Net on CIFAR-5M t = 0 t = 1e+03 t = 2e+03 t = 5e+03 t = 1e+04 t = 2e+04 t = 5e+04 t = 1e+05 t = 2e+05 t = 5e+05 t = 1e+06 t = 2e+06 t = 4e+06 k 2.0

(a) Final NTK Spectra for Varying Widths

100 101 102 103 k

Unexplained Task Fraction t=0 t=1e+03 t=2e+03 t=5e+03 t=1e+04 t=2e+04 t=5e+04 t=1e+05 t=2e+05 t=5e+05 t=1e+06 t=2e+06 t=4e+06 k 0.15

(b) Final NTK Task-Power Decay

100 101 102 103 104

CIFAR-5M Test Loss

N = 4 N = 8 N = 16 N = 32 N = 64 N = 128 Extrapolated t 0.1/1.4

(c) Losses for Feature Learning Networks

101 102 103 104

CIFAR-5M Kernel Alignment

N=4 N=8 N=16 N=32 N=64 N=128

101 102 103 104

CIFAR-5M Kernel norms

N=4 N=8 N=16 N=32 N=64 N=128

Figure 12. a) The observed power law spectrum on a held out test set of the after-kernel for a width N = 128 Res Net trained on CIFAR-5m. Early on in training, the spectrum flattens quite rapidly. At later times, the spectral decay remains relatively constant. b) The fraction of the task unexplained, as defined in Equation 20. Throughout training, the top eigenmode of the after-kernel explains more and more of the task. c) The test loss of the network. We see that the observed scaling of this quantity is faster than that predicted from analyzing the after-kernel. d) The kernel-target alignment of the after kernel improves throughout training time. The error bars here denote different ensemble members. Their relatively small size implies that the kernel trajectory is relatively deterministic over different initialization seeds. e) The norm of the after-kernel throughout training is relatively constant for this task.

insight into the structure of the learned network function (Atanasov et al., 2022; Long, 2021) and its generalization. In some cases, it has been observed that the final kernel stabilizes during the course of training (Fort et al., 2020), potentially allowing one to potentially deduce scaling laws from the spectrum and task-model alignment of this after-kernel, though other papers have observed contrary results (Vyas et al., 2022).

Motivated by this, we study the NTKs of the finite-width networks trained for 64 epochs with the animate-inanimate CIFAR-5m discrimination task. We observe in Figure 12 a) that the spectrum becomes flatter, with a decay exponent of close to 1.4 down from 2.0 for the initial kernel.

The fraction of the task power unexplained is also observed to have a lower exponent in Figure 12 b), however there is also the presence of a low rank spike indicative of the kernel aligning to this discrimination tasks.

From these scalings we can obtain the a and b exponents and get a prediction for the scaling of the test loss. We plot this in grey in Figure 12 c). The observed scaling (in black) is much better than that predicted by the after-kernel. This is an indication the the after kernel continues evolving in this task, improving the scaling exponent of the test loss.

The kernel-target alignment (Cortes et al., 2012), as measured by

A = y Ky y y|K|F , (145)

is plotted in 12 d). Here y is the target labels on a held-out test set, and K is the gram matrix of the after-kernel on this test set. We indeed observe a consistent increase in this quantity across time. This gives an indication that understanding the evolution of the after-kernel will be useful

A Dynamical Model of Neural Scaling Laws

M. Numerical Recipes

M.1. Iteration of DMFT Equations on Time Time matrices

The simplest way to solve the DMFT equations is to iterate them from a reasonable initial condition (Mignacco et al., 2020; Bordelon & Pehlevan, 2022b). We solve in discrete time for T T matrices {R0,2, R1, R2,4, R3, C0, C1, C2, C3} which have entries [R]t,s = R(t, s), [C]t,s = C(t, s), etc. We let Θ(t, s) = ηΘ(t s) where η is the learning rate.

1. Solve for the response functions by updating the closed equations as matrices by iterating the equations.

R0,2,k Θ 1 + λk R3R1 1 R3,

k λk R0,2,k,

R1 I α 1R0,2 1 ,

R2,4,k λk [I + λk R1ΘR3] 1 R1Θ , R2,4 = 1

R3 I ν 1R2,4 1 .

2. Once these response functions have converged, we can iterate the equations for the correlation functions

C0,k [I + λkΘR3R1] 1 (w k)211 + Θ ν 1C3 + λk

Θ [I + λkΘR3R1] 1 ,

k=1 λk C0,k,

C1 R1C0R 1 ,

C2,k [I + λk R1ΘR3] 1 λk

(w k)2λ2 k11 + λ2 k ν ΘC3Θ R 1

[I + λk R1ΘR3] 1 ,

After iterating these equations, one has the discrete time solution to the DMFT order parameters and any other observable can then be calculated.

M.2. Fourier Transform Method

To accurately compute the Fourier transforms in the model/data bottleneck regime (α < 1 or ν < 1) we have that R1(ω)R3(ω) iωr as ω 0 so we must resort to analyzing the principal part and the delta-function contribution to the integral. Construct a shifted and non-divergent version of the function H(ω).

H(ω) = H(ω) + 1 ϵ + iω(1 + r)

H(ω) = 1 ϵ + iω + R1(ω)R3(ω) 1 ϵ + iω(1 + r) = iωr R1(ω)R3(ω) (ϵ + iω + R1(ω)R3(ω))(ϵ + iω(1 + r)), (148)

where r = limω 0 1

iωR1(ω)R3(ω). We see that rather than diverging like H(ω), this function H(ω) vanishes as ω 0. We therefore numerically perform Fourier integral against H(ω) and then add the singular component which can be computed separately.

H(τ) = Z dω

2π eiωτ H(ω) + Z dω

ϵ + iω(1 + r) = Z dω

2π eiωτ H(ω) + 1 1 + r (149)

where we used the fact that 1 ϵ + iω(1 + r) = π 1 + rδ(ω) i 1 + r P(ω 1) , ϵ 0 (150)

A Dynamical Model of Neural Scaling Laws

The Dirac mass is trivial to integrate over giving 1 2(1+r). Lastly, we must perform an integral of the type

i 1 + r P Z

ω = 1 π(1 + r)

0 dω sin(ωτ)

ω = 1 2(1 + r) (151)

Adding these two terms together, our transfer function has the form

H(τ) = 1 1 + r + Z

dω 2π eiωτ H(ω). (152)

The last integral can be performed numerically, giving a more stable result.

N. Compute Optimal Scaling from Sum of Power-Laws

We suppose that the loss scales as (neglecting irrelevant prefactors)

L = t rt + N r N + P r P + L (153)

Our goal is to minimize the above expression subject to the constraint that compute C = Nt is fixed. Since C is fixed we can reduce this to a one-dimensional optimization problem

min N C rt N rt + N r N (154)

The optimality condition NL = 0 is

rt C rt N rt 1 r NN r N 1 = 0

rt rt+r N = t C

rt rt+r N (155)

From this last expression one can evaluate the loss at the optimum

L (C) C rtr N

rt+r N . (156)