# strong_model_collapse__a694e296.pdf

Published as a conference paper at ICLR 2025

STRONG MODEL COLLAPSE

Elvis Dohmatob1,2,3, Yunzhen Feng4, , Arjun Subramonian5, , Julia Kempe1,4

1Meta FAIR 2Concordia University 3Mila 4NYU 5UCLA Work done while interning at Meta. Correspondence to elvis.dohmatob@concordia.ca

Within the scaling laws paradigm, which underpins the training of large neural networks like Chat GPT and Llama, we consider a supervised regression setting and establish a strong form of the model collapse phenomenon, a critical performance degradation due to synthetic data in the training corpus. Our results show that even the smallest fraction of synthetic data (e.g., as little as 1 per 1000) can still lead to model collapse: larger and larger training sets do not enhance performance. We further investigate whether increasing model size, an approach aligned with current trends in training large language models, exacerbates or mitigates model collapse. In a simplified regime where neural networks are approximated via random projections of tunable size, we both theoretically and empirically show that larger models can amplify model collapse. Interestingly, our theory also indicates that, beyond the interpolation threshold (which can be extremely high for very large datasets), larger models may mitigate the collapse, although they do not entirely prevent it. Our theoretical findings are empirically verified through experiments on language models and neural networks for images.

1 INTRODUCTION

The term Model Collapse refers to a critical degradation in the performance of AI models, particularly when a significant portion of their training data consists of synthetic data generated by other models. As detailed in Shumailov et al. (2023), this phenomenon arises as the model gradually overfits to patterns found in synthetic data, which may not fully represent the richness or variability of real-world data. Over successive training cycles, this feedback loop results in the model reinforcing errors, biases, or oversimplifications from the synthetic data. Consequently, the model s ability to generalize to real-world data is compromised, as it increasingly relies on the distorted distribution provided by prior AI generations rather than learning accurate representations of the real world.

This phenomenon was observed empirically (Hataya et al., 2023; Mart ınez et al., 2023a;b; Bohacek & Farid, 2023; Briesch et al., 2023; Guo et al., 2023) and described theoretically (Alemohammad et al., 2023; Bertrand et al., 2023; Dohmatob et al., 2024a;b). The connection to the breakdown of neural scaling laws (Kaplan et al., 2020) has been pointed out and analyzed in Dohmatob et al. (2024b): as data becomes more synthetic, larger training sets do not enhance performance.

The issue is especially concerning in large-scale AI systems like Chat GPT and Llama (Touvron et al., 2023; Dubey & et al., 2024), which rely heavily on vast amounts of training data to maintain their performance. If synthetic data is used in training these models, even in small quantities, the model can start producing gibberish or nonsensical outputs, contains misinformation, or reflect stereotypes. This is because the model effectively starts to amplify its own mistakes (Shumailov et al., 2024). This feedback loop results in a gradual loss of model fidelity, reducing its ability to generalize or adapt to new, unseen test environments.

1.1 MAIN CONTRIBUTIONS

In this work, we establish a series of results which shed more light on model collapse, bringing the phenomenon closer to a solid theoretical foundation. We consider the following important questions:

(Q1) Is model collapse inevitable or can it be fixed by strategically mixing synthetic and real data? (Q2) Are larger models more prone to model collapse than smaller ones?

Published as a conference paper at ICLR 2025

Our theoretical analysis focuses on the solvable setting of linear regression with and without random projections, with the latter serving as an approximation of neural networks by means of random feature maps (Maloney et al., 2022; Bach, 2023). Also, in accordance with the current neural scaling laws paradigm (Kaplan et al., 2020; Hoffmann et al., 2022) whichs underlies the training of LLMs, where models and dataset sizes become larger over time, we focus on the setup where the total amount of data (synthetic + real data) used for training grows arbitrarily.

Let us summarize our main findings. Result #1: Strong Model Collapse. First, we establish a robust negative result which shows that model collapse generally persists even on a mixture of synthetic and real training data, as long as the fraction of training data which is synthetic does not vanish (cf. Sections 3.1 and 3.2). By synthetic data, we mean any training data from a distribution which deviates from the distribution of real data, i.e. data on which the test performance is evaluated. Thus, model collapse cannot generally be mitigated by simple adjustments such as data weighting (Jain et al., 2024). On the other hand, we show (Section 5) that sophisticated iterative mixing strategies like Ferbach et al. (2024) can mitigate model collapse. However, apart from additional computational overhead, such a strategy requires access to a supply of clean/real data whose size grows at least at same rate as the synthetic data.

0.2 0.4 0.6 Test error (train on real data only)

Test error (train on real + synth.)

synthetic dataset size n2 = 50

0.2 0.4 0.6 Test error (train on real data only)

synthetic dataset size n2 = 200

0.2 0.4 0.6 Test error (train on real data only)

synthetic dataset size n2 = 400

Figure 1: Pareto diagram: Understanding the role of model size in model collapse. We compare the test error (on the real / true data distribution), for a random projections model (Equation (5) of Section 2.2) when training is done on a mix of synthetic and real data (y-axis), versus real data only (x-axis); in both cases, the total amount of training data is fixed to n = 500. On the scatter plots, square points correspond to very highquality synthetic data (i.e from a distribution which is close to the true data distribution), diamonds correspond to high-quality synthetic data, triangles correspond to low-quality, while stars correspond to very low-quality synthetic data. The black lines correspond to the Pareto frontiers for each level of quality of the synthetic data; the higher the frontier above the diagonal in the given setting, the more serious is the model collapse. The colorbar is the log of the parametrization rate ψ = m/n, where m captures the size of the model.

Result #2: Model Size and Model Collapse. In Section 3.2, we disentangle the effect of a model s size on its ability to cope with model collapse. We show that in general, bigger models will suffer more from model collapse as soon as the deviation between the distribution of the synthetic data and real data is significant. Crucially, our theory also predicts that past the interpolation threshold point, this tendency can be reversed: large models become more robust to model collapse. Put together, these results predict the existence of a double-descent curve regarding the model collapse phenomenon. This is illustrated in Figures 1 and 2. Thus, the model collapse profile depends critically on design choices like model size.

Experimental Validation. Our theoretical results are empirically confirmed with experiments in:

Toy settings, including random projections model on Gaussian data, and shallow networks fully trained on the MNIST dataset (Deng, 2012). Refer to the end of Section 3.2 and Appendix A.2. Realistic setting of GPT-2 models trained on Babi Stories (Zhang et al., 2024a), a reproduction of Tiny Stories (Eldan & Li, 2023) using the Mixtral-8x7B open language model (Jiang et al., 2024)). Refer to Section 4.

Approach. From a technical standpoint, our theoretical analysis focuses on regression problems in the classical linear setting introduced in Dohmatob et al. (2024a) for studying model collapse, and also the setting of neural networks in a simplified regime which can be approximated by random projections (Maloney et al., 2022; Bach, 2023). We employ the tools of operator-valued free probability theory (OVFPT) (Mingo & Speicher, 2017) to obtain a new bias-variance decomposition Etest B + V + ζ, of the test error evaluated on the real / true data distribution, of a model trained on a mixture of real and synthetic data. The extra term ζ then induces model collapse.

Published as a conference paper at ICLR 2025

0 250 500 750 1000 Network width m

0.0 0.1 0.5 1.0

0 250 500 750 1000 Network width m

0 250 500 750 1000 Network width m

0 250 500 750 1000 Network width m

Test error Etest

Figure 2: Illustration of our new bias-variance decomposition Etest B + V + ζ for neural networks in the simplified random projections regime (cf. Section 3.2), trained on a mixture of real and synthetic data. The sum B + V corresponds to the classical bias variance decomposition in this setup when all the training data is real. The extra term ζ is responsible for model collapse when training is done on a mixture of real and synthetic data. The scalar c2 characterizes the quality of the synthetic data (cf. Definition 1), via its mismatch with the real data distribution. The vertical line corresponds to the interpolation threshold m = n, where m is the model size and n is the total sample size. Notice the well-known double-descent curve in the bias.

1.2 RELATED WORK

The theoretical study of model collapse in the setting of high-dimensional supervised-learning with linear regression and kernel ridge regression was initiated in Dohmatob et al. (2024a). This work derives analytical formulas that quantitatively describe iterative retraining on synthetic data in both under-parameterized and over-parameterized regimes, considering both lowand high-dimensional asymptotics. It places itself within an important body of works studying kernel ridge regression (on clean data), which serves as an effective proxy for neural networks in various regimes, for instance in the infinite-width limit (Neal, 1996; Williams, 1996; Jacot et al., 2018; Lee et al., 2018) or in the lazy regime of training (Chizat et al., 2019) and are a testbed to study interesting phenomena observed in deep learning. For instance, (Rahimi & Recht, 2008; Rudi & Rosasco, 2017; Maloney et al., 2022) study scaling laws for regression in the random feature model and (Bach, 2023) analyses double descent in this setting. Scaling laws have been shown for kernel models under the Gaussian design, e.g. in Caponnetto & de Vito (2007); Spigler et al. (2020); Cui et al. (2022) for regression and (Cui et al., 2023) for classification.

Very few theoretical works tackle the analysis of models trained on mixtures of original (real / clean) and synthetic data. Bertrand et al. (2023) analyze the training process at the distribution level and provide stability results under a locality assumption in parameter space. Seddik et al. (2024) analyze the mixing of discrete original and synthetic data, and provide upper bounds on the amount of synthetic data that can be included to avoid model collapse. Let us also mention the recent works (Jain et al., 2024; Ferbach et al., 2024) which are potential methods for mitigating model collapse. Jain et al. (2024) analyze linear regression on isotropic Gaussian data for mixtures of clean and synthetic data by minizing a strategically weighted sum of losses (one term for each data source, real and synthetic), while Ferbach et al. (2024) can be seen as a multi-step version thereof where at each stage, the synthetic data generator is distilled by interpolating with real data. These methods are analyzed in Section 5, where we outline their shortcomings regarding model collapse.

Finally, a few works go beyond the mixing scenario and analyze how to curate or filter synthetic data to avoid model collapse (Feng et al., 2024; Zhang et al., 2024b; Alemohammad et al., 2024; Gillman et al., 2024; Yang et al., 2025; Askari-Hemmat et al., 2025), but a rigorous study of their effectiveness is still lacking.

2 THEORETICAL SETUP

2.1 DATA DISTRIBUTIONS

Consider an iid sample from D1 = {(xi, yi) | 1 i n1} of size n1 from the true data distribution P1 and an independent iid sample D2 = {(xi, yi) | n1 + 1 i n} of size n2 from another data distribution P2 (which we shall hereafter call the synthetic data distribution), where n := n1 + n2 is the total amount of training data. Here, Pk = PΣk,w k,σ2 k is the distribution on Rd R given by

(Features) x N(0, Σk),

(Labels) y = x w k + ϵ, with ϵ N(0, σ2 k) independent of x. (1)

Each Σk is a d d positive-definite covariance matrix which captures the intrinsic variations of the input feature vector x. The σk s control the level of label noise in each distribution.

Published as a conference paper at ICLR 2025

Structure of the Label Shift. For conciseness, we will assume the following priors on the w k s

True labelling function: w 1 N(0, Γ), Mismatch between real and synthetic: δ := w 2 w 1 N(0, ), independent of w 1,

for some d d positive-semidefinite matrices Γ and . Remark 1. To ease the presentation of our results, we shall assume that the matrices Σ1, Σ2, Γ, and are diagonal matrices, and therefore commute. Furthermore, except otherwise explicitly stated, we shall assume equal covariance matrices, and take Σ1 = Σ2 = Σ as in Dohmatob et al. (2024a).

The matrix Γ captures the structure of the ground-truth labelling function in the real / test distribution P1. Together with the label-noise levels σ2 1 and σ2 2, the matrix = cov(w 2 w 1) captures the covariance structure of the disparity between the true data distribution P1 and the synthetic data distribution P2 regarding the conditional distribution p(y|x); the marginal distribution of x stays the same under P1 and P2 due the assumption Σ1 = Σ2 = Σ. For example, the self-consuming-loops setup of Dohmatob et al. (2024a) corresponds to taking proportional to the precision matrix of the input features Σ 1. Thus, the size of the fluctuations of each component δj of the difference w 2 w 1 is inversely proportional to the standard deviation of the corresponding feature. Another important setup is the case where the fluctuations are isotropic, i.e taking Id.

Quality of Synthetic Data. Due to the a priori general structure of , the label corresponding to an input x will be different for both distributions, even in the absence of label-noise. On average, the L2-norm of this difference is Ew 1,w 2Ex N(0,Σ) [(x w 1 x w 2)2] = tr Σ . We therefore define

Definition 1. The quality of synthetic data is defined as c2( ) = (1/d) tr Σ , which captures the disparity between the synthetic data distribution P2 and the real data distribution P1 (small values of c2( ) are better). For example, if = c2Σ 1 as in Dohmatob et al. (2024a), then c2( ) = c2.

2.2 MODELS AND PERFORMANCE MEASURE

Given this training data, the goal of a learner is to construct an estimator bw. This can be seen as a linear model from x 7 x bw. Evaluated on the real / true data distribution P1 (which coincides with the distribution from which the real component D1 of the training dataset D is drawn), the test error of a model bf : Rd R is defined by

Etest( bf) = EDEx N(0,Σ1)[( bf(x) x w 1)2]. (2)

This will be our main object of study, for different models bf. The outermost expectation ED is to quench the randomness in the training dataset D used to train the model.

We consider two families of analytically tractable models: (1) classical linear models obtained via penalized regression in the input space, and (2) models obtained via penalized regression in a feature space given by random projections. The latter allows us to study the role of model size in model collapse, by varying the output dimension of the random projection mapping. This output dimension m controls the size of a neural network in a simplified regime (Maloney et al., 2022; Bach, 2023).

(1) Classical Linear Model. We start with a setup motivated by Dohmatob et al. (2024a). We are interested in the penalized linear model (ridge) bf CL : x 7 x bw with parameter vector bw given by

bw = arg min w Rd 1 n

i=1 (x i w yi)2 + λ w 2, (3)

trained on the total dataset D = D1 D2. Of course, the unregularized limit λ 0+ corresponds to ordinary least-squares (OLS). We shall work in the following so-called proportionate scaling limit

(Proportionate Scaling Limit for Classical Linear Model) For fixed ϕ (0, ), p2 (0, 1),

d, n, n1, n2 , n2/n p2, n1/n p1 = 1 p2, d/n ϕ. (4)

The extreme cases p1 0+ and p2 0+ correspond to training on only synthetic (resp. real) data. In particular, p1 0+ corresponds to the setting considered in Dohmatob et al. (2024a). Note that in the isotropic setting where Σ Id, ϕ controls the speed of learning on clean data. Indeed, for

Published as a conference paper at ICLR 2025

small ϕ, the scaling law in this case is known (Hastie et al., 2022) to be Etest σ2 1ϕ + O(ϕ2). As we shall see (Corollary 1), this scaling law gets deformed in the presence of synthetic data in the training dataset, leading to model collapse.

(2) Random Projections Model. We consider neural networks in a simplified regime which can be approximated via random projections (Maloney et al., 2022; Bach, 2023), i.e f(x) = x Sv. Here, S is a d m random matrix with iid entries from N(0, 1/d); it maps an input-vector x Rd to a random feature vector z = Φ(x) := S x Rm. Only the read-out weights v Rm are learned, by fitting on the dataset D. Consider the model bf RP : x 7 Φ(x) bv, where bv is given by

bv = arg min v Rm 1 n

i=1 (v Φ(xi) yi)2 + λ v 2. (5)

Note that such a simplified neural network model has been proposed in the literature as a theoretical testbed for studying intriguing properties of neural networks, like scaling laws (Maloney et al., 2022) and double-descent (Bach, 2023). Also see Section 1.2. It can be shown that the extreme case m/n reduces to the classical linear model.

We shall work in the following asymptotic regime:

(Proportionate Scaling Limit for Random Projections Model)

d, m, n, n1, n2 , n1/n p1, n2/n p2, d/n ϕ, m/d γ, m/n ψ, (6)

for some constants ϕ, γ, ψ (0, ) and p1, p2 (0, 1), with p1 + p2 = 1 and ψ = ϕγ.

Note that the ratio ψ/ϕ md captures the size of the network, though the number of trainable parameters (the read-out layer) is m γd.

3 A NEW BIAS-VARIANCE DECOMPOSITION AND THE EMERGENCE OF STRONG MODEL COLLAPSE

3.1 CLASSICAL LINEAR MODELS

We begin with an analysis of the test error Etest( bf CL) for the classical linear model defined in (3) trained on a mixture of synthetic and true / real data, but evaluated on test data from the true data distribution only. We will establish a new bias-variance decomposition with an additional term which quantitatively reveals the emergence of model collapse (Shumailov et al., 2023; 2024).

Let us first recall some standard notations. For any t 0 and integer k, let dfk(t; Σ) be the order-k degrees of freedom of Σ at t, defined by dfk(t; Σ) := tr Σk(Σ + t Id) k. Define

u = u(n, λ; Σ) := df2(κ; Σ)/n 1 df2(κ; Σ)/n, (7)

where κ = κ(n, λ; Σ) is the unique positive solution to the fixed-point equation

κ λ = κ df1(κ; Σ)/n. (8)

The following result (proved in the appendix, alongside all other theoretical results in this work) will be exploited in the sequel to show that the use of synthetic data in model training can lead to catastrophic effects regarding test error. Theorem 1. Define σ2 := p1σ2 1 + p2σ2 2 and let κ, u 0 be as previously constructed. In the proportionate scaling limit (4), the test error w.r.t the true data distribution P1, of the classical linear model bf CL defined in (3) is given by Etest( bf CL) E + ζ, with

E = B + V, V = σ2 df2(κ; Σ)/n 1 df2(κ; Σ)/n, B = κ2 tr ΓΣ(Σ + κId) 2

1 df2(κ; Σ)/n ,

ζ = p2 2 (1 + p1u) tr Σ3(Σ + κId) 2 + p2u tr Σ(p1Σ + κId)2(Σ + κId) 2.

Published as a conference paper at ICLR 2025

Note that for = 0 (i.e w 2 = w 1), which corresponds to assuming that the real data and the surrogate data have the same distribution, the above theorem gives Etest( bf CL) E B+V which is the classical bias-variance decomposition (Hastie et al., 2022; Richards et al., 2021) for ridge regression on n samples from the distribution PΣ,w 1,σ2. The extra term ζ appearing in Theorem 1 is responsible for model collapse! In Appendix D.2, we show how Theorem 1 recovers the main results of Dohmatob et al. (2024a) for special choices of the displacement matrix .

Strong Model Collapse. In particular, in the scaling laws regime where ϕ 0+, it holds that ζ p2 2 tr . In this case, if tr remains bounded away from zero, then so is ζ unless p2 0+, i.e we discard all synthetic data from the training dataset. This is strong model collapse. It hints that model collapse as exposed by Shumailov et al. (2023; 2024); Hataya et al. (2023); Mart ınez et al. (2023a;b); Bohacek & Farid (2023); Briesch et al. (2023); Guo et al. (2023) cannot be fixed by naively mixing synthetic and real data during training. We show in Section 3.2 that this observation continues to hold in the setting of random projections model bf RP defined in (5). Finally, in Section 5 we study what happens when the synthetic data and real data are strategically mixed during training.

Proving Theorem 1. It turns out that the analysis of the classical linear model s test error Etest( bf CL) in Theorem 1 amounts to the analysis of the trace of rational functions of sums of random matrices. Although the limiting spectral density of sums of random matrices is a classical computation using subordination techniques (Marˇcenko & Pastur, 1967; Kargin, 2015), this is not enough for the full analysis; a more involved analysis is required. For example, some of the quantities we must analyze are of the following form (where Mj := X j Xj/n, M := M1 + M2;

A and B deterministic matrices): r(3) j (A, B) := E tr AMj(M + λId) 1B(M + λId) 1Mj. The

difficulty will be even greater in the setting of random projections bf RP because it leads to more complicated terms. To the rescue, in Appendix E we shall employ operator-valued free probability theory (OVFPT) to compute the exact high-dimensional limits of quantities like the definition of r(3) j (A, B) above . The tools of OVFPT have been used in the recent machine learning theory literature to obtain precise asymptotics for the test error of neural networks (trained on real data only) in various linearized settings (Adlam & Pennington, 2020; Tripuraneni et al., 2021; Lee et al., 2023). The idea is to construct a block matrix Q each of whose blocks is a constant or is proportional X1, X 1 , X2, or X 2 , and one of the blocks Q 1[i, j] of Q 1 is equal to the original matrix W = AMj(M + λId) 1B(M + λId) 1Mj. Such a Q is referred to as a linear pencil for W. Because of the structure of Q, OVPT allows us to compute the limiting value of the expectation of the traces of the square blocks of Q 1, and we ultimately extract r(3) j (A, B) lim E tr Q 1[i, j].

Example: The Isotropic Case. To help unpack Theorem 1, consider the following concrete setup

Σ = Id, Γ = (r2/d)Id, = (c2/d)Id,

for some constants r, c > 0. The constant c2 captures how close the distribution of the synthetic data P2 is to the distribution of the real data P1; thus it captures the quality of the synthetic data. This gives u ϕ/((1 + κ)2 ϕ), where κ > 0 uniquely satisfies the fixed-point equation κ λ = κϕ/(1 + κ); this is a quadratic equation that characterizes the well-known Marchenko-Pastur law (Marˇcenko & Pastur, 1967). The quantities appearing in the formulae presented in Theorem 1 then take the following simple forms: V = σ2ϕ/((1 + κ)2 ϕ) and B = κ2r2/((1 + κ)2 ϕ), and

ζ = p2(1 + p1u) + (p1 + κ)2u p2c2/(1 + κ)2.

In particular, in the unregularized limit λ 0+ corresponding to OLS, we get κ (ϕ 1)+. To further make things concrete, consider the under-parametrized case where ϕ (0, 1) in the proportionate scaling regime (4). The over-parametrized case ϕ (1, ) is treated in Appendix D.1. We deduce the following corollary to Theorem 1. Corollary 1. Suppose ϕ (0, 1). Then, in the limit (4) and λ 0+, the test error with respect to the true data distribution P1, of the classical linear model bf CL defined in (3) is given by

Etest( bf CL) σ2ϕ/(1 ϕ) + p2 2 + p2p1ϕ/(1 ϕ) c2.

Moreover, for fixed c > 0 and small ϕ (0, 1), it holds that Etest( bf CL) σ2d/n + p2 2c2 + O(ϕ2). In particular, if c2 = Ω(1), i.e bounded away from zero (corresponding to low-quality synthetic data), then Etest( bf CL) = Ω(p2 2c2): the scaling law plateaus unless p2 0+, i.e unless all but a vanishing proportion of synthetic data is discarded from the training dataset.

Published as a conference paper at ICLR 2025

Figure 3: Strong model collapse in the classical linear model (empirical confirmation of Corollary 1). The training dataset comprises of n = n1 + n2 samples from a mixture of n2 = p2n synthetic samples and n1 = n n2 real samples. The real samples are from the same distribution as the real / true samples of the training dataset, while the synthetic samples are from a distribution with the same covariance structure and label noise level σ = 1, but an incorrect labelling function (epistemic error). The quality of the synthetic data is controlled by the scalar c, with c 0 corresponding to synthetic data of perfect quality (higher values correspond to lower quality synthetic data). Solid curves correspond to experiments, and broken curves correspond to our theoretical predictions of Corollary 1; notice the perfect match. We see that even a small amount of low-quality synthetic data is enough to cause model collapse, whereby the test error of the model deviates from a perfect diagonal (ideal scaling law, corresponding to p2 = 0, i.e training on real data only).

The result is empirically illustrated in Figure 3, where we see that even a small amount of low-quality synthetic data is enough to cause model collapse, whereby the test error of the model deviates from a perfect diagonal (ideal scaling law, corresponding to p2 = 0, i.e training on real data only).

Remark 2. Corollary 1 can be extended to the the non-isotropic case, but the statement is much longer and is thus omitted here.

3.2 RANDOM PROJECTIONS MODEL

We now turn to the more challenging setting of the random projections model bf RP given in (5). As mentioned before (cf. Section 2.2), such models are considered in our work as a simplification of the high-dimensional dynamics of actual neural networks, which still allows us to capture the effect of model size in the model collapse phenomenon.

We will need the following scalars which capture the high-dimensional statistics of the model bf RP .

Definition 2. Let (e, τ, u, ω) be the unique positive solution to the following fixed-point equations

1/e = 1 + ψτ tr ΣK 1, 1/τ = 1 + tr K0K 1, with K0 := eΣ, K := γτK0 + λId, (9)

u = ψe2 tr Σ(γτ 2L + ωId)K 2, ω = τ 2 tr (γωK2 0 + λ2L )K 2, with L := (1 + u)Σ. (10)

Here, tr A := tr A/d is the normalized trace. Also define θ := λ/(γτe) > 0, ω := ω/(γτ 2) > 0.

As usual, we denote σ2 := p1σ2 1 + p2σ2 2. Also, for any p [0, 1] define a d d positive definite matrix T(θ; p) := pΣ+θId, and T(θ) := T(θ; p)|p=1 = Σ+θId. The following result is a nontrivial extension of Theorem 1 to the case of random projections.

Theorem 2. In the proportionate scaling limit (6), the test error w.r.t the true data distribution P1, of the random projections model bf RP defined in (5) is given by Etest( bf RP ) E + ζ, with

E B + V, where B = (1 + u)θ2 tr ΓΣT(θ) 2 + ω tr ΓΣ2T(θ) 2,

V = tr Σ2T(θ) 2 + (ω θu) tr ΣT(θ) 2 σ2/e ,

ζ = p2(p2 + p1u) tr Σ3T(θ) 2 + p2(ω + 2p1uθ) tr Σ2T(θ) 2 + p2uθ2 tr ΣT(θ) 2.

We now explore a few important consequences of Theorem 2.

A Double-Descent Curve. The bias-variance decomposition presented in Theorem 2 is empirically illustrated in Figures 2 and 4 for the Gaussian setting (1) (see Appendix B.1 and A.1 for details on the experimental setup and additional results in this setting). Notice the perfect match with experiments. The shape of the bias curve in Figure 2 (leftmost plot) is reminiscent of the wellknown double-descent (Bach, 2023) in the unregularized setting λ 0+. The divergence at the interpolation threshold m = n (i.e. ψ = 1) is because the bias term B, the variance term V , and the extra term ζ (responsible for model collapse) all diverge to infinity at this point.

Published as a conference paper at ICLR 2025

Strong Model Collapse. Observe that the first term in the expression for ζ given in Theorem 2 is lower-bounded by p2 2 tr Σ3(Σ + θId) 2, which scales linearly with the square of the proportion p2 n2/n of synthetic data in the training dataset D. However, unless p2 0+, i.e unless the proportion p2 of synthetic data in the training dataset vanishes, the performance of the model eventually plateaus above the baseline E (corresponding to the setting where all training data is real, i.e no synthetic data in training dataset). This is strong model collapse.

0 250 500 750 1000 Network width m

Test error Etest

0.0 0.1 0.5 1.0

Figure 4: Impact of model size (network width m) on model collapse. As usual, solid curves correspond to experimental results (5 runs), while broken curves correspond to predictions of our theory (here, Corollary 4). Error bars correspond to 5 independent runs. Also see Figures 2 and 7.

Since the factor tr Σ3(Σ + θId) 2 only depends on the design choices of the model (via the scalar θ defined previously), we expect that different design choices (e.g., model size) will lead to different model collapse profiles.

Are Larger Models More Prone or Less Prone to Model Collapse? Figure 1 shows the results of a small experiment to investigate this. The input dimension is d = 600, and the covariance matrix is identity Id (isotropic features). The total number of training examples is fixed to n = 500. The matrix is taken to be of the form = (c2/d)Σ 1 (similar results are observed for different covariance matrices) for different values of c2 as follows: c2 = 0 (synthetic data of very high quality), represented with square markers; c2 = 0.1 (high quality synthetic data), represented with diamonds; c2 = 0.5 (low quality), represented by triangles; and c2 = 1 (very low-quality synthetic data), represented by stars. As indicated on the figure, the leftmost plot corresponds to the regime where there is much fewer synthetic than real samples (n2 = 50 synthetic samples versus n1 = 450 real samples). Here, for both very high-quality and high-quality (squares and diamonds), the optimal tradeoff is struck by larger models (i.e, larger values of ψ). For lower-quality data (triangles and stars), the frontier shifts upwards and from left to right; intermediately sized models become optimal for coping with model collapse.

In the middle plot, size of the synthetic dataset is comparable to the size of the real dataset (n2 = 200 versus n1 = 300). For high-quality synthetic data, larger models are still better than smaller models. However, for this setting, the frontier shifts upwards and from left to right, and the optimal model size is intermediate. For the rightmost plot, the size of the synthetic dataset is considerably larger than the real dataset (n2 = 400 versus n1 = 100). The results are similar to the case n2 = 200 except that the Pareto frontiers are higher over the diagonal (indicating more serious model collapse). In all cases, very small models are never optimal: they are not good even in the classical sense when training is done only on real data, and the presence of synthetic data only makes this worse.

Special Cases of Theorem 2 In the limit p2 0+ (i.e., no synthetic data; all the training data is real), ζ 0 in Theorem 2, and we recover the main result of Bach (2023) as a special case, namely Etest( bf RP ) B +V , with B and V as given in the theorem. Note that even in this special case, our result is more general since it covers the entire regularization path while the formulae in Bach (2023) are only for the unregularized case λ 0+. On the other hand, Theorem 2 is a generalization of Theorem 1, as can be seen by taking ψ . Refer to Appendix G.2 for details.

4 EXPERIMENTAL RESULTS

Our theoretical framework is developed within the context of high-dimensional linear regression and random projections models using Gaussian data. Our first departure from the confines of our theory are experiments with two-layer neural networks trained on the MNIST dataset (Deng, 2012) both in the random feature model (with Re LU activations) and with fully trained networks. These are presented in Appendix A.2. We find that the general trends observed in our asymptotic theory still hold: (1) there is significant model collapse, which only diminishes as the fraction of synthetic data approaches 0; (2) larger models exhibit a more severe model collapse (Figures 8 and 9).

We now provide evidence that our theory is applicable to large-scale problems, particularly in the context of language modeling with GPT-2 models. The Babi Stories dataset (Zhang et al., 2024a), a reproduction of Tiny Stories (Eldan & Li, 2023) using the Mixtral-8x7B open language model (Jiang et al., 2024) enables us to study language modeling with relatively small models in a compute-

Published as a conference paper at ICLR 2025

109 6 108 2 109 3 109 4 109

Number of Tokens

0.0 0.001 0.005 0.01 0.02 0.05 0.1 0.2 0.5 0.9 1.0 109 1010 1011 Number of Tokens

Number of Layers

Figure 5: Results on Babi Stories with GPT-2 models. Synthetic Babi Stories is generated with a trained GPT2-small with the same set of prompts. (Left) Impact of the proportion of synthetic data p2 on model collapse in a language model with 12 layers. (Right) Impact of model size (number of layers) on model collapse. Here the model is trained on synthetic data only (i.e p2 = 1). The loss is evaluated on the Tiny Stories test set.

efficient and environmentally friendly way. It comprises stories generated by prompting large models to create narratives in simple language that a three-year-old child could understand, effectively simplifying the underlying language model.

Setup. We train a GPT-2-small model with 124 million parameters on the Babi Stories dataset as the generator. Using the same story prompts, which include the story beginning and character names, the generator creates our synthetic dataset. We then mix this synthetic dataset with the original Babi Stories, train, and evaluate model perplexity on a validation set derived from the original Babi Stories. Detailed information on optimization and model configurations is provided in Appendix B.3.

Impact of Synthetic Data Proportion. We investigate the effect of varying the synthetic data proportion (p2) on the model s scaling in Figure 5 (left). Here, the x-axis represents the number of tokens used to train the model. In this experiment, the synthetic data is of high quality, as evidenced by the low training loss and coherent text generations, corresponding to the small c2 (cf. Definition 1) case in our illustrative Figure 1. Consequently, even moderate amounts of synthetic data delay the progression of the scaling laws, and we expect this to eventually lead to plateaus or at least very bad bad (i.e small) exponents in the final scaling laws as predicted in Dohmatob et al. (2024b) in the special case of training on synthetic data only.

Impact of Model Size. We next examine the impact of model size on training with synthetic data. In addition to the GPT-2-small model (12 layers), we introduce two larger models: one with 18 layers (166 million parameters) and another with 24 layers (204 million parameters). The embedding dimension and number of attention heads remain constant across all models. We generate a synthetic dataset 10 times larger than the original one to support the scaling of tokens. As shown in Figure 5 (right), larger (deeper) models maintain a lower test loss until the dataset size increases likely exceeding the interpolation threshold at which point smaller models begin to exhibit lower loss and reduced overfitting. This aligns with the predictions of Theorem 2 (also refer to Figure 1, 2, and the discussion just after the theorem), which suggest that larger models tend to amplify model collapse beyond the interpolation threshold. In Figure 5, we observe this amplification when the number of tokens exceeds 3 1010. Conversely, the theory predicts that over-parameterized models help mitigate collapse, a trend we observe when the number of tokens is below 1 1010, leading to improved performance of larger models.

5 CAN STRATEGIC DATA MIXING SCHEMES PREVENT MODEL COLLAPSE?

Having established the occurrence of strong model collapse both theoretically and empirically, we now explore strategies to mitigate it and leverage synthetic data under stronger assumptions. We begin by assuming clear information about the data source and consider the following strategic iterative mixing, inspired by Ferbach et al. (2024). In this approach, a model is fitted on a mixture of synthetic and real data. In the next iteration, the labels for the synthetic data are replaced with the labels predicted by the previous iteration of the process, and so on.

For concreteness, take Σ1 = Σ2 = Σ = Id for the covariance matrices, and = (c2/d)Σ 1 = (c2/d)Id. In this setup, the proposal of Ferbach et al. (2024) then becomes: At iteration t + 1, we mix n2 = p2n samples of synthetic data from a source having quality parameter c2 = c2 t, with n1 = n n2 samples of real data to construct a penalized linear model bw(t+1) according to (3). This trained model generates the synthetic data with c2 = c2 t+1.

Published as a conference paper at ICLR 2025

Total dataset size n

Mixing Iterative Single

Total dataset size n

Total dataset size n

Figure 6: Iterative vs Single-Step Mixing. Solid lines represent the experimental results (5 runs), while dashed lines correspond to the theoretical predictions of Corollary 2. The iterative mixing is repeated 5 times, with p1 = p2 = 0.5. Clean refers to the scaling when using solely the n1 = p1n real data in the dataset.

Thus, the idea is to iteratively enhance the quality of the synthetic data through bootstrapping.

Corollary 2. For large t, it holds in the limit (4) and then ϕ, λ 0+ that Etest( bf (t) CL) E/(1 p2 2) + Θ(p2t 2 ), where E σ2d/n, σ2 := p1σ2 1 + p2σ2 2.

Let us explore some important consequences of Corollary 2.

Iterative Mixing Recovers Scaling Laws but Might Not be Feasible in Practice. If the practitioner can curate a sufficient amount of data from the original distribution, the training dataset will include a non-vanishing proportion of real data, ensuring that p1 remains bounded away from zero. By comparing E with p2t 2 , we observe that iterative mixing over t iterations, where t is of the order of log(n/d), results in a scaling law proportional to E, as empirically confirmed in Figure 6. However, this comes at the cost of significant bootstrapping, a large volume of real data, and the need to clearly distinguish between real and synthetic data across iterations conditions that are all too computationally expensive and challenging to implement in practice.

Iterative Mixing Relies Mostly on Real Data. In Figure 6, we compare the scaling of iterative mixing with the scaling obtained using only the p1n real data portion from the same training set ( Clean ). While the scaling rate remains consistent, iterative mixing consistently underperforms compared to using real data alone. This suggests that iterative mixing may primarily neutralize the synthetic data and heavily rely on the real data to recover the scaling. Even when the original synthetic data is of high quality (i.e., when c0 is small, rightmost plot of FIgure 6), the iterative method fails to effectively leverage the synthetic data, resulting in worse performance than single mixing. Thus, although iterative mixing recovers the same scaling rate, the model still collapses to some degree, and no significant performance improvement is observed.

Iterative Mixing with Little Real Data is Bad. If we consider the setting where we only have limited real data or where there is faster accumulation of synthetic data, which corresponds to p2 1 (the real data in the training set is diminishing), then it holds that for any t 1, Etest( bw(t)) c2 0 + t E. This is an increasing function of t, meaning that there is still catastrophic model collapse.

6 DISCUSSION

Our work systematically characterizes the effects of training models on mixtures of real and synthetic data, showing that model collapse is a robust phenomenon that persists even with small fractions of synthetic data, in the asymptotic regime. By introducing new mathematical tools, we extend prior work to analyze more complex mixing settings and models (random projections), broadening the scope of theoretically tractable problems. Experiments confirm our theoretical predictions across large language models (LLMs) and also fully-trained feed-forward neural networks.

Going beyond the prevalent neural scaling laws paradigm (Kaplan et al., 2020; Hoffmann et al., 2022) which is at the basis of the current trend in training LLMs, this study emphasizes the importance of preserving and labeling real data, either by curating it or avoiding unintended synthetic data in training, reflecting a shift as AI-generated data becomes prevalent. Our work also delineates the impact of model size on the model collapse profile. Future work will explore the effect of other model design choices like activation functions, depth, and optimization hyper-parameters like learning rate and momentum. To this end, we can leverage Gaussian equivalents (Goldt et al., 2022) to extend our theory to wide, fully-trained networks in the neural tangent kernel (Jacot et al., 2018) and lazy (Chizat et al., 2019) regimes, using operator-valued free probability theory (Mingo & Speicher, 2017), like we have done in our analysis.

Published as a conference paper at ICLR 2025

7 ACKNOWLEDGEMENTS

YF and JK acknowledge support through the NSF NRT training grant award 1922658. Part of this work was done while YF and JK where visiting the Centre Sciences de Donnees (CSD) at the Ecole Normale Superieure in Paris, France, and YF and JK wish to thank the CSD for their hospitality. YF would like to thank Jianyu Zhang for his help with the experiments involving GPT-2 on Babi Stories.

Ben Adlam and Jeffrey Pennington. The neural tangent kernel in high dimensions: Triple descent and a multi-scale theory of generalization. In International Conference on Machine Learning, pp. 74 84. PMLR, 2020.

Sina Alemohammad, Josue Casco-Rodriguez, Lorenzo Luzi, Ahmed Imtiaz Humayun, Hossein Babaei, Daniel Le Jeune, Ali Siahkoohi, and Richard G. Baraniuk. Self-consuming generative models go mad. ar Xiv preprint arxiv:2307.01850, 2023.

Sina Alemohammad, Ahmed Imtiaz Humayun, Shruti Agarwal, John Collomosse, and Richard Baraniuk. Self-improving diffusion models with synthetic data, 2024. URL https://arxiv. org/abs/2408.16333.

Reyhane Askari-Hemmat, Mohammad Pezeshki, Elvis Dohmatob, Florian Bordes, Pietro Astolfi, Melissa Hall, Jakob Verbeek, Michal Drozdzal, and Adriana Romero-Soriano. Improving the scaling laws of synthetic data with deliberate practice. ar Xiv preprint ar Xiv:2502.15588, 2025.

Francis Bach. High-dimensional analysis of double descent for linear regression with random projections. 2023.

Quentin Bertrand, Avishek Joey Bose, Alexandre Duplessis, Marco Jiralerspong, and Gauthier Gidel. On the stability of iterative retraining of generative models on their own data. ar Xiv preprint arxiv:2310.00429, 2023.

Matyas Bohacek and Hany Farid. Nepotistically trained generative-ai models collapse, 2023.

Martin Briesch, Dominik Sobania, and Franz Rothlauf. Large language models suffer from their own output: An analysis of the self-consuming training loop, 2023.

Andrea Caponnetto and Ernesto de Vito. Optimal rates for the regularized least-squares algorithm. Foundations of Computational Mathematics, 7:331 368, 2007.

Lenaic Chizat, Edouard Oyallon, and Francis Bach. On lazy training in differentiable programming. Advances in neural information processing systems, 32, 2019.

Hugo Cui, Bruno Loureiro, Florent Krzakala, and Lenka Zdeborov a. Generalization error rates in kernel regression: the crossover from the noiseless to noisy regime. Journal of Statistical Mechanics: Theory and Experiment, 2022(11):114004, nov 2022.

Hugo Cui, Bruno Loureiro, Florent Krzakala, and Lenka Zdeborov a. Error scaling laws for kernel classification under source and capacity conditions. Machine Learning: Science and Technology, 4(3):035033, August 2023. ISSN 2632-2153. doi: 10.1088/2632-2153/acf041.

Li Deng. The mnist database of handwritten digit images for machine learning research [best of the web]. IEEE signal processing magazine, 29(6):141 142, 2012.

Elvis Dohmatob, Yunzhen Feng, and Julia Kempe. Model collapse demystified: The case of regression. In Advances in Neural Information Processing Systems, volume 37. Curran Associates, Inc., 2024a.

Elvis Dohmatob, Yunzhen Feng, Pu Yang, Franc ois Charton, and Julia Kempe. A tale of tails: Model collapse as a change of scaling laws. In Forty-first International Conference on Machine Learning, 2024b. URL https://openreview.net/forum?id=KVvku47sh W.

Abhimanyu Dubey and et al. The llama 3 herd of models. ar Xiv preprint ar Xiv:2407.21783, 2024.

Published as a conference paper at ICLR 2025

Ronen Eldan and Yuanzhi Li. Tinystories: How small can language models be and still speak coherent english? ar Xiv preprint ar Xiv:2305.07759, 2023.

Yunzhen Feng, Elvis Dohmatob, Pu Yang, Francois Charton, and Julia Kempe. Beyond model collapse: Scaling up with synthesized data requires verification, 2024. URL https://arxiv. org/abs/2406.07515.

Damien Ferbach, Quentin Bertrand, Avishek Joey Bose, and Gauthier Gidel. Self-Consuming Generative Models with Curated Data Provably Optimize Human Preferences. Ar Xiv, 2024.

Damien Ferbach, Quentin Bertrand, Avishek Joey Bose, and Gauthier Gidel. Self-consuming generative models with curated data provably optimize human preferences, 2024. URL https: //arxiv.org/abs/2407.09499.

Nate Gillman, Michael Freeman, Daksh Aggarwal, HSU Chia-Hong, Calvin Luo, Yonglong Tian, and Chen Sun. Self-correcting self-consuming loops for generative model training. In Forty-first International Conference on Machine Learning, 2024.

Sebastian Goldt, Bruno Loureiro, Galen Reeves, Florent Krzakala, Marc M ezard, and Lenka Zdeborov a. The gaussian equivalence of generative models for learning with shallow neural networks. In Mathematical and Scientific Machine Learning, pp. 426 471. PMLR, 2022.

Yanzhu Guo, Guokan Shang, Michalis Vazirgiannis, and Chlo e Clavel. The curious decline of linguistic diversity: Training language models on synthetic text, 2023.

Trevor Hastie, Andrea Montanari, Saharon Rosset, and Ryan J. Tibshirani. Surprises in highdimensional ridgeless least squares interpolation. The Annals of Statistics, 50(2), 2022.

Ryuichiro Hataya, Han Bao, and Hiromi Arai. Will large-scale generative models corrupt future datasets? In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 20555 20565, October 2023.

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Tom Hennigan, Eric Noland, Katie Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karen Simonyan, Erich Elsen, Jack W. Rae, Oriol Vinyals, and Laurent Sifre. Training compute-optimal large language models, 2022.

Arthur Jacot, Franck Gabriel, and Clement Hongler. Neural tangent kernel: Convergence and generalization in neural networks. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa Bianchi, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018.

Ayush Jain, Andrea Montanari, and Eren Sasoglu. Scaling laws for learning with real and surrogate data, 2024. URL https://arxiv.org/abs/2402.04376.

Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, et al. Mixtral of experts. ar Xiv preprint ar Xiv:2401.04088, 2024.

Jared Kaplan, Sam Mc Candlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. ar Xiv preprint ar Xiv:2001.08361, 2020.

V. Kargin. Subordination for the sum of two random matrices. The Annals of Probability, 43(4), 2015.

Donghwan Lee, Behrad Moniri, Xinmeng Huang, Edgar Dobriban, and Hamed Hassani. Demystifying disagreement-on-the-line in high dimensions. In International Conference on Machine Learning, pp. 19053 19093. PMLR, 2023.

Jaehoon Lee, Yasaman Bahri, Roman Novak, Samuel S. Schoenholz, Jeffrey Pennington, and Jascha Sohl-Dickstein. Deep neural networks as gaussian processes. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings. Open Review.net, 2018.

Published as a conference paper at ICLR 2025

Alexander Maloney, Daniel A. Roberts, and James Sully. A solvable model of neural scaling laws, 2022.

Gonzalo Mart ınez, Lauren Watson, Pedro Reviriego, Jos e Alberto Hern andez, Marc Juarez, and Rik Sarkar. Combining generative artificial intelligence (ai) and the internet: Heading towards evolution or degradation? ar Xiv preprint arxiv: 2303.01255, 2023a.

Gonzalo Mart ınez, Lauren Watson, Pedro Reviriego, Jos e Alberto Hern andez, Marc Juarez, and Rik Sarkar. Towards understanding the interplay of generative artificial intelligence and the internet. ar Xiv preprint arxiv: 2306.06130, 2023b.

V.A. Marˇcenko and Leonid Pastur. Distribution of eigenvalues for some sets of random matrices. Math USSR Sb, 1:457 483, 01 1967.

James A. Mingo and Roland Speicher. Free Probability and Random Matrices, volume 35 of Fields Institute Monographs. Springer, 2017.

Radford M. Neal. Priors for infinite networks. In Bayesian Learning for Neural Networks, pp. 29 53. Springer, New York, 1996.

Ali Rahimi and Benjamin Recht. Weighted sums of random kitchen sinks: Replacing minimization with randomization in learning. In Advances in Neural Information Processing Systems. Curran Associates, Inc., 2008.

Dominic Richards, Jaouad Mourtada, and Lorenzo Rosasco. Asymptotics of ridge(less) regression under general source condition. In Proceedings of The 24th International Conference on Artificial Intelligence and Statistics, volume 130 of Proceedings of Machine Learning Research. PMLR, 2021.

Alessandro Rudi and Lorenzo Rosasco. Generalization properties of learning with random features. In Advances in Neural Information Processing Systems. Curran Associates Inc., 2017. ISBN 9781510860964.

Mohamed El Amine Seddik, Suei-Wen Chen, Soufiane Hayou, Pierre Youssef, and Merouane Debbah. How bad is training on synthetic data? a statistical analysis of language model collapse. ar Xiv preprint ar Xiv:2404.05090, 2024.

H. Shimodaira. Improving predictive inference under covariate shift by weighting the loglikelihood function. Journal of Statistical Planning and Inference, 90(2):227 244, 2000.

Ilia Shumailov, Zakhar Shumaylov, Yiren Zhao, Yarin Gal, Nicolas Papernot, and Ross Anderson. The curse of recursion: Training on generated data makes models forget. ar Xiv preprint arxiv:2305.17493, 2023.

Ilia Shumailov, Zakhar Shumaylov, Yiren Zhao, Yarin Gal, Nicolas Papernot, and Ross Anderson. Ai models collapse when trained on recursively generated data. Nature, 631, 2024.

Stefano Spigler, Mario Geiger, and Matthieu Wyart. Asymptotic learning curves of kernel methods: empirical data versus teacher student paradigm. Journal of Statistical Mechanics: Theory and Experiment, 2020(12):124001, December 2020. ISSN 1742-5468. doi: 10.1088/1742-5468/ abc61d.

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. ar Xiv preprint ar Xiv:2307.09288, 2023.

Nilesh Tripuraneni, Ben Adlam, and Jeffrey Pennington. Covariate shift in high-dimensional random feature regression. ar Xiv preprint ar Xiv:2111.08234, 2021.

Robin Vogel, Mastane Achab, St ephan Cl emenc on, and Charles Tillier. Weighted empirical risk minimization: Sample selection bias correction based on importance sampling. Ar Xiv, 2021.

Christopher Williams. Computing with infinite networks. In M.C. Mozer, M. Jordan, and T. Petsche (eds.), Advances in Neural Information Processing Systems, volume 9. MIT Press, 1996.

Published as a conference paper at ICLR 2025

Pu Yang, Yunzhen Feng, Ziyuan Chen, Yuhang Wu, and Zhuoyuan Li. Spend wisely: Maximizing post-training gains in iterative synthetic data boostrapping. ar Xiv preprint ar Xiv:2501.18962, 2025.

Jianyu Zhang, Niklas Nolte, Ranajoy Sadhukhan, Beidi Chen, and L eon Bottou. Memory mosaics. ar Xiv preprint ar Xiv:2405.06394, 2024a.

Jinghui Zhang, Dandan Qiao, Mochen Yang, and Qiang Wei. Regurgitative training: The value of real data in training large language models. ar Xiv preprint ar Xiv:2407.12835, 2024b.

Published as a conference paper at ICLR 2025

Table of Contents

A Further Experimental Results 16 A.1 Additional Results for the toy setting of multivariate gaussians . . . . . . . . . 16 A.2 Experimental results for Neural Networks on MNIST . . . . . . . . . . . . . . . 16 A.3 General Picture for Neural Network on MNIST . . . . . . . . . . . . . . . . . . 17

B Experimental Details 18 B.1 Toy Setting: Random Projections Model . . . . . . . . . . . . . . . . . . . . . 18 B.2 Two-layer neural networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 B.3 Language Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

C Static / Single-Step Data Mixing 19

D Some Omitted Theoretical Results and Comments 20 D.1 Classical Linear Model in Over-Parametrized Isotropic Setting . . . . . . . . . . 20 D.2 Connections to Classical Model Collapse in Regression . . . . . . . . . . . . . 20

E Deterministic Equivalents 21 E.1 Classical Linear Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 E.2 Random Projections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 E.3 Proof of Proposition 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

Computing r(1) j . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

Computing r(4). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 Computing r(3). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 Computing r(2). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 E.4 Proof of Proposition 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

F Proof of Theorem 1 and Corollaries 26 F.1 Proof of Theorem 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 F.2 Proof of Corollary 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 F.3 Proof of Corollary 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 F.4 Proof of Corollary 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 F.5 Proof of Corollary 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

G Proof of Proposition 2 and Theorem 2 29 G.1 Proof of Proposition 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 G.2 Recovering Theorem 1 from Theorem 2 . . . . . . . . . . . . . . . . . . . . . . 31 G.3 Proof of Theorem 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

H Phase-Diagram for Random Projections Model 31 H.1 The General Regularized Case . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 H.2 Unregularized Limit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

I Raw Bias-Variance Decomposition 33 I.1 Classical Linear Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 I.2 Random Projections Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

Published as a conference paper at ICLR 2025

A FURTHER EXPERIMENTAL RESULTS

A.1 ADDITIONAL RESULTS FOR THE TOY SETTING OF MULTIVARIATE GAUSSIANS

Figure 7 provides additional plots for various data quality parameters c2 showing model collapse as a function of model size in the toy setting of multivariate Gaussians with random projections (experimental details in Section B.1).

0 200 400 600 800 1000 Network width m

p2 0.02 0.1 0.2

0.4 0.6 0.8

0 200 400 600 800 1000 Network width m

0 200 400 600 800 1000 Network width m

0 200 400 600 800 1000 Network width m

Figure 7: Impact of model size (network width m) on model collapse. Same setting as for Figure 4, but with quality parameter c2 (smaller is better) as shown on top of each plot and proportion of synthetic data p2 as in the legend (Figure 4 showed the reverse).

A.2 EXPERIMENTAL RESULTS FOR NEURAL NETWORKS ON MNIST

103 104 Number of Samples

Hidden Dimension

100 400 1000 2000 4000 10000

Figure 8: Fully trained two-layer network on MNIST data. Impact of model size (hidden dimension, aka network width) on model collapse. Here, the model is trained solely on synthetic data (i.e p2 1).

Setup. For two-layer neural networks, we consider two scenarios: (1) learning with a random projection model as in Section 3.2, where the first layer of the network is fixed randomly, and only the second layer is trained, and (2) learning with a fully trainable neural network. The first setting directly corresponds to our theoretical results from Section 3.2, but with Re LU activation functions. In the case of fully trained neural networks in the second setting, our theory does not apply directly. However, we hypothesize that the general trends observed in our asymptotic theory will still hold: (1) there will be a significant model collapse, which only diminishes as the fraction of synthetic data approaches 0; (2) larger models will exhibit a more severe model collapse.

To align with the theoretical setting, we employ a (multivariate) regression approach where labels are converted to one-hot vectors and the model is trained using mean squared error. The synthetic labels were generated by another two-layer network, with Gaussian label noise (standard deviation of 0.1) added. A validation set is used to select the best checkpoint, and evaluation is conducted on the test set using the clean labels. Further details of the training are provided in Appendix B.2.

Results. Figure 9 presents the results for both random feature models (left) and fully trained neural networks (right). In these experiments, we mixed synthetic and original data in the training set with varying coefficients, p1. As the proportion of synthetic data, p2, increases, the scaling laws slow down and eventually plateau. We observe a strong model collapse: only when p2 approaches 0 does the collapse subside. The results are consistent across both cases, validating our theoretical predictions and demonstrating the applicability of our insights to more complex scenarios.

We also investigated how model size, specifically the hidden dimension of fully trained neural networks, affects model collapse. As shown in Figure 8, models with varying hidden dimensions were trained exclusively on the synthetic dataset with p2 = 1. For training sets ranging from 10,000 to 50,000 samples, our results indicate that larger models are more susceptible to model collapse under the same validation and evaluation protocols. Notably, all these models remain in the interpolation regime, aligning with our theoretical predictions.

Published as a conference paper at ICLR 2025

102 103 104 Number of Samples

102 103 104

Number of Samples

The proportion of synthetic data, p2 0 0.001 0.001 0.005 0.01 0.02 0.05 0.1 0.2 0.5 0.9

Figure 9: Model collapse as a function of the proportion of synthetic data. We use the MNIST dataset with regression loss. Error bars correspond to 5 runs. Left, Random feature model with hidden dimension 100,000. Right, Two-layer neural network of width (i.e hidden dim.) m = 2000.

102 103 104

Hidden Dimension

Synthetic size n2 = 2000

102 103 104

Hidden Dimension

Synthetic size n2 = 10000

102 103 104

Hidden Dimension

Synthetic size n2 = 40000

Quality of Synthetic Data (Generator Performance) 0.021 0.057 0.095 0.148

Figure 10: Understanding the role of model size in model collapse under varying qualities of synthetic data and dataset sizes. The quality of the synthetic data is evaluated using the MSE loss on the test set. The model is trained solely on synthetic data (p2 1).

A.3 GENERAL PICTURE FOR NEURAL NETWORK ON MNIST

To provide a comprehensive understanding of how the quality of synthetic data, the quantity of synthetic data, and the network size impact performance, we conducted a large-scale experiment varying these factors, as shown in Figures 10 and 11. Figure 10 uses the MSE loss as an indicator of synthetic data quality, while Figure 11 uses accuracy as the indicator. To simplify the analysis, we focus on pure synthetic data (p2 = 1).

The synthetic data with the highest quality already achieves accuracy close to the optimal. As the quality decreases, we observe that the shape of the curve begins to resemble a double descent curve, similar to the changes in the Pareto frontiers shown in Figure 1. With different combinations of the number of synthetic data n2 and hidden dimension d, the figure captures various segments of the double descent curve depicted in Figure 4. When n2 is small (as seen in the left subplots), it corresponds to a large parameterization rate ψ, placing it in the second descent region of the double descent curve. Conversely, when n2 is large (as shown in the right subplots), it captures the up-anddown behavior characteristic of the double descent phenomenon.

Published as a conference paper at ICLR 2025

102 103 104

Hidden Dimension

Synthetic size n2 = 2000

102 103 104

Hidden Dimension

Synthetic size n2 = 10000

102 103 104

Hidden Dimension

Synthetic size n2 = 40000

Quality of Synthetic Data (Generator Performance)

97.9 92.9 85.8 74.6

Figure 11: Understanding the role of model size in model collapse under varying qualities of synthetic data and dataset sizes. The quality of the synthetic data is evaluated using the accuracy on the test set. The model is trained solely on synthetic data (p2 1).

B EXPERIMENTAL DETAILS

B.1 TOY SETTING: RANDOM PROJECTIONS MODEL

Setup. As a sanity check to empirical confirm our analytical predictions from Theorem 2, we consider a setting with multivariate Gaussian data (1). The feature covariance matrix Σ is constructed to have power-law eigenvalues λj = C/j, where C is such that tr Σ = λ1 + . . . + λd = 1. The ground-truth labelling weights w 1 of the real data distribution P1 sampled from N(0, (1/d)Id), while the ground-truth weights w 2 for the synthtic data distribution are sampled from N(w 1, ) with = (c2/d)Σ 1 for different values of c2 ranging from {0, 0.1, 0.5, 1} which controls for the quality of the synthetic data. We run a small experiment with label noise levels σ1 = σ2 = 0.1, input-dimension d = 600, number of real samples n1 = 300, and synthetic samples n2 = 200, for a total of n = n1 + n2 = 500 samples. We fit a random projection model bf RP according to (5) and for different values of the width parameter m (to control the size the of the model), and report the results in Figures 4 and 7. The regularization parameter λ is set to a very small value (10 8). We also consider a variation of this experiment with different values of the synthetic dataset size n2 and report the results in Figure 1.

B.2 TWO-LAYER NEURAL NETWORKS

The two-layer neural networks are trained using stochastic gradient descent (SGD) with a batch size of 128 and a learning rate of 0.1. The models are trained for 400 epochs to fully converge. We employ a held-out validation set from the training set to select the best checkpoint to evaluate.

B.3 LANGUAGE MODELING

The generation process for the Babi Stories dataset is detailed in the Git Hub repository of Zhang et al. (2024a). The dataset comprises a training set of 2,200,000 stories and a validation set of 22,000 stories, created by prompting the Mistral-8x7B model. Each prompt includes a description of the generation task, character names, specific words to be used, and the beginning of a story. The dataset stores the beginnings of these stories along with their generated continuations.

In our experiments, we trained a GPT-2-small model on this dataset to generate synthetic data. The model was trained using next-token prediction, utilizing the beginnings and continuations of stories to have good story generation quality. To maintain consistency with the original prompt distribution, we used all the prompts that were initially employed to generate Babi Stories. During story generation, we applied a temperature setting of 1 with top-p decoding where p = 1. After generation, we filtered out stories of poor quality, such as those containing unwanted symbols, following the same

Published as a conference paper at ICLR 2025

procedure as in Zhang et al. (2024a). The filtered beginnings and synthetic continuations were then collected to form the synthetic dataset.

We used a GPT-2 model with an embedding dimension of d = 768, 12 attention heads, and a context length of 512 tokens, which typically encompasses one to three stories. During training, we applied a learning rate of 5 10 3, a dropout rate of 0.05, L2 weight decay of 0.1, and a warm-up phase of 2,000 iterations.

C STATIC / SINGLE-STEP DATA MIXING

For the purposes of studying scaling laws for learning on mixtures of real and surrogate (e.g synthetic data), the setting considered in Jain et al. (2024) consists in the following optimization problem:

bw = arg min w Rn 1 α

(xi,yi) D1 (x i w yi)2 + α

(xi,yi) D2 (x i w yi)2 + λ w 2. (12)

This is an instance of weighted weighted empirical risk minimization (Shimodaira, 2000; Vogel et al., 2021) where the weight the sample weight πi is constant across each group: πi = (1 α)n/n1 (1 α)/p1 for real samples real samples vs πi = αn/n2 α/p2 for synthetic samples. Thus α (0, 1) is a mixing coefficient for the two the two source of data; in particular α 0 corresponds to only using real data (which corresponds to group 1) for training, while α 1 corresponds to only using surrogate data (group 2). Formula (12) replaces the formula for the weights vector bw of the classical linear model bf CL (3).

For conciseness, as in Section 5 we focus on the isotropic case considered in Section 3.1 where the feature covariance matrices are Σ1 = Σ2 = Id and the shift matrix := cov(w 1 w 2) has the form (c2/d)Id for some scalar c > 0. Further, let us consider the regime where d/n 0 . In the language of our paper, one should think of this as corresponding to the proportionate scaling regime given in (4), and then letting ϕ 0+ (extremely under-parametrized regime). We have the following result.

(a) n1 = 1000, d = 500.

(b) n1 = 10000, d = 100, so that ϕ = d/n 100/10200 < 0.01 (small). Corollary 3 correctly predicts that the optimal strategy mixing coefficient is α 0, i.e to discard surrogate data altogether.

Figure 12: Failure of naive real+surrogate data mixing to solve model collapse. For this experiment, we use different several different values for the size of the real data n1 and the synthetic data n2 . Solid curves correspond to experiments while broken curves correspond to our theoretical prediction give in Corollary 3. Error-bars correspond to independent runs.

Corollary 3. Consider the proportionate scaling limit (4). For small ϕ = d/n, it holds that

Etest( bf CL) p2 2α2 tr + ((1 α)p1σ2 1 + αp2σ2 2)ϕ + O(ϕ2). (13)

Published as a conference paper at ICLR 2025

The formula given in (13) represents a U-shaped function of α, minimized when α = α , with

α = clip[0,1]

1 p1σ2 1 p2σ2 2 2 tr ϕ . (14)

It should be clear that if tr = Ω(1) and σ1, σ2 = O(1), then α 0; this corresponds to only using real data for training! In contrast any fixed value α [0, 1), leads a positive lower-bound on test error Etest( bf CL) tr ; this is effectively model collapse. The situation is empirically confirmed in Figure 12.

D SOME OMITTED THEORETICAL RESULTS AND COMMENTS

D.1 CLASSICAL LINEAR MODEL IN OVER-PARAMETRIZED ISOTROPIC SETTING

We now complement the analysis presented at the end of Section 3.1 with an analysis for the case ϕ (1, ). Plugging into Theorem 1 gives κ ϕ 1 and u 1/(1 ϕ/ϕ2) = ϕ/(ϕ 1) in the limit λ 0+. We obtain the following corollary.

Corollary 4. For ϕ (1, ), in the limit (4) and λ 0+, it holds that Etest E + ζ, with

E = V + B, B = r2(1 1

ϕ 1, ζ = p2 c2

ϕ 1 + (ϕ p2)2 , (15)

Moreover, for large ϕ (1, ), it holds that Etest( bf CL) E (1 2/ϕ) p2c2 + O(1/ϕ2).

Thus, for any fixed c > 0, strong model collapse occurs: the RHS vanishes only if p2 0+, i.e only if we discard all but a vanishing proportion of synthetic data from the training dataset. This is strong model collapse. Combining with Corollary 1, we conclude that (at least in the isotropic setting, strong model collapse occurs both the under-parametrized and over-parametrized settings.

D.2 CONNECTIONS TO CLASSICAL MODEL COLLAPSE IN REGRESSION

In the setting of classical model collapse (Shumailov et al., 2023; 2024; Alemohammad et al., 2023; Dohmatob et al., 2024b;a), we have w 2 = w 1+PN ℓ=1 X ℓEℓ,where N is the number of iterations (i.e self-loops) in the synthetic data-generation process. Let nℓbe the number of samples available for training at stage ℓwith training data (Xℓ, Xℓw 2 +Eℓ) Rn d Rn, where the noise vectors Eℓare independent with iid components from N(0, σ2 ℓ). In the proportionate scaling regime n1, . . . , n N with d/nℓ ϕℓ (0, ) for all ℓ, the situation is equivalent to taking

ℓ σ2 ℓ E (X ℓ) X ℓ X

σ2 ℓ nℓ df2(κℓ; Σ)Σ(Σ + κℓId) 2, with κℓ= κ(nℓ, 0; Σ).

In particular, if maxℓϕℓ 1 (so that there is enough samples to perfectly fit the training data at each stage of the iterative process), and for simplicity we set σℓ= σ0 for all ℓ, then the above expression simplifies to P

ℓσ2 ℓ/(nℓ d) Σ 1. More generally, consider the generic setting where (c2/d)Σ 1, for any c > 0, so that the previous setting corresponds to c2 = P

ℓσ2 ℓϕℓ/(1 ϕℓ).

In the particular case where p1 0+, i.e only synthetic data is available for training. Theorem 1 then gives

d df2 +uκ2 tr(Σ + κId) 2 = η2 df2 1 + κ2 df2 n df2 tr(Σ + κId) 2 .

In particular, taking c2 = P ℓσ2 ℓϕℓ/(1 ϕℓ) gives

ζ 1 + κ2 df2 n df2 tr(Σ + κId) 2 df2

σ2 ℓϕℓ 1 ϕℓ . (16)

This is recovers the main result of Dohmatob et al. (2024a).

Published as a conference paper at ICLR 2025

E DETERMINISTIC EQUIVALENTS

Let Xj (resp. Yj) be the design matrix (resp. response vector) corresponding to dataset Dj. Thus, the design matrix X1 Rn1 d for the real dataset has rows given by xi for i [n1] and Y1 Rn1 with components yi for i [n1], with X2 Rn2 d and Y2 Rn2 defined analogously for the synthetic dataset. Let X Rn d(resp. Y Rn) be the design matrix (resp. response vector) corresponding to the total dataset. We temporarily drop the condition Σ1 = Σ2 = Σ, and instead consider generally different covariance matrices Σ1 and Σ2 for the marginal distribution of the features x under the real data distribution P1 and the synthetic data distribution P2.

E.1 CLASSICAL LINEAR MODEL

Note that the weights bw of the model bf CL given in (3) can be written explicitly as bw = RX Y , where R := (X X + nλId) 1 = (X 1 X1 + X 2 X2 + nλId) 1, a random matrix. Its test error Etest( bf CL) writes Etest( bf CL) = EX,Y [( bf CL(x) x w 1)2] = EX,Y bw w 1 2 Σ1. In Proposition 4, we shall show that the RHS in the above can be decomposed into a sum of simply random quantities of the form r(k) j (A) and r(k) j (A, B) that we now describe and analyze.

Let A and B be d d positive-definite matrices with well-behaved spectral (this will be made precise latter) and let λ > 0. In analyzing the bias-variance decomposition of the test error, we are ultimately led to consider the following quantities

r(1) j (A) := E tr AMj(M + λId) 1, (17)

r(2)(A, B) := E tr A(M + λId) 1B(M + λId) 1, (18)

r(3) j (A, B) := E tr AMj(M + λId) 1B(M + λId) 1Mj, (19)

r(4) j (A, B) := E tr AMj(M + λId) 1B(M + λId) 1, (20)

where we recall that M := M1 + M2 and Mj := X j Xj/n.

Let (e1, e2) be the unique negative solution to the following pair of fixed-point equations

e1 = 1 1 + ϕ tr Σ1K 1 , e2 = 1 1 + ϕ tr Σ2K 1 , with K := p1e1Σ1 + p2e2Σ2 + λId. (21)

Also, define (u1, u2) to be the unique positive solution to the pair of fixed-point equations

u1 = ϕe2 1 tr Σ1L K 2, u2 = ϕe2 2 tr Σ2L K 2, with L := p1u1Σ1 + p2u2Σ2 + λB. (22)

Consider the following deterministic matrices

Cj := pje2 j(B + pj uj Σj )Σ1 + u1(pj ej Σj + λId)2,

Dj := ej B λuj Id + pj (ejuj ej uj)Σj , (23)

where 1 := 2 and 2 = 1.

The following will be crucial for proving Theorem 1 and its corollaries.

Proposition 1. In the proportionate scaling limit (4), it holds for j = 1, 2 that

r(1) j (A) pjej tr AΣj K 1, (24)

r(2)(A, B) tr AL K 2, (25)

r(3) j (A, B) pj tr AΣj Cj K 2, (26)

r(4) j (A, B) pj tr AΣj Dj K 2. (27)

Published as a conference paper at ICLR 2025

E.2 RANDOM PROJECTIONS

For d d deterministic matrices A and B, define the following quenched quantities

r(1) j (A) := E tr ASRS Mj, r(3) j (A, B) := E tr AMj SR SBSRS Mj,

r(4) j (A, B) := E tr AMj SR SBSRS , r(5)(A, B) := E tr AM1SR SBSRS M2,

where we recall that R := (S MS +λIm) 1, M := M1 +M2, Mj := X j Xj/n. These quantities will be useful because we may write

j σ2 j 1 n E tr Mj SRS Σk SRS =

σ2 j n r(4) j (Id, Σk),

Bk = tr ΓΣk + E tr ΓMSRS ΣSRS M 2 tr ΓΣk SRS M + tr M2SRS Σk SRS M2

= tr ΓΣk + 2r(5)(Γ, Σk) + r(3) 1 (Γ, Σk) + r(3) 2 (Γ, Σk) 2r(1) 1 (ΓΣk) 2r(1) 2 (ΓΣk) + r(3) 2 ( , Σk). Each term in the above decomposition can now be analyzed via operator-valued free-probability theory. The following proposition will be heavily exploited in the prove of Theorem 2. Proposition 2. In the proportionate scaling limit (6), it holds that

r(1) j (A) pjγτej tr AΣK 1, r(3) j (A, Σ) pj tr AΣCj K 2,

r(4) j (A, Σ) pjγ tr AΣDK 2, r(5)(A, Σ) p1p2γ tr AΣ2EK 2,

where the constants e1 and e2 and the matrices C1, C2, D, and E are as in Theorem 2.

E.3 PROOF OF PROPOSITION 1

WLOG, we only consider the case j = 1, and suppress this subscript henceforth from all the r(k) j s.

Computing r(1) j . We only do j = 1 as j = 2 is completely analogous. One can obtain a minimal 9 9 linear pencil Q for the random matrix R = AM1(M + λId) 1 such that Q is a 9 9 block matrix (not shown here1) and R = Q 1[1, 5]/λ (using zero-based indexing). It follows that in the asymptotic limit, one has

r(1)/d = E tr R G1,5, (30)

where G = (id E tr )[Q 1] M9(C) is the matrix containing the limiting expected values of the normalized traces of the blocks of each of the 9 9 = 81 blocks of Q 1 (we define the trace of each rectangular as zero). Using classical operator-valued free probability theory (OVFPT) (Mingo & Speicher, 2017), we have the following fixed-point equations which define G1,5 implicitly

G1,5 = p1G3,3 tr AΣ1(λId + p1G3,3Σ1 + p2G7,7Σ2) 1, (31)

G3,3 = λ λ ϕG4,2 , (32)

G7,7 = λ λ ϕG8,6 , (33)

G4,2 = λ tr Σ1(λId + p1G3,3Σ1 + p2G7,7Σ2) 1, (34)

G8,6 = λ tr Σ2(λId + p1G3,3Σ1 + p2G7,7Σ2) 1. (35) We deduce that G3,3 = e1, G7,7 = e2, and

r(1)/d = G1,5 = p1e1 tr AΣ1(λId + p1e1Σ1 + p2e2Σ2) 1, where (e1, e2) is the unique pair of nonnegative solutions to the system of equations

e1 = 1 1 + ϕ tr Σ1(λId + p1e1Σ1 + p2e2Σ2) 1 , (36)

e2 = 1 1 + ϕ tr Σ2(λId + p1e1Σ1 + p2e2Σ2) 1 . (37)

1All the linear pencils in this work are very big and are omitted for brevity.

Published as a conference paper at ICLR 2025

Putting things together gives

r(1) d G1,5 = p1e1 tr AΣ1(p1e1Σ1 + p2e2Σ2 + λId) 1 = p1 tr AΣ1K 1.

In particular, in the limit p2 0+ (i.e single data source), the first equation becomes

1 λ/κ1 = 1 η1λ = ϕ1η1 tr Σ1(Id + p1η1Σ1) 1

= ϕ1 tr Σ1(κ1Id + Σ1) 1,

or equivalently,

κ1 λ κ1 df1(κ1; Σ1)

Furthermore, r(1) is now given by

r(1) e1 tr AΣ1(e1Σ1 + λId) 1 = tr AΣ1(Σ1 + κ1Id) 1. (39)

Computing r(4). Here, the minimal linear pencil for the random matrix involved R = AM1(M + λId) 1B(M + λId) 1 is a 16 16 block matrix Q such that R = Q 1[1, 9]/λ. Thus, r(4)/d G1,16/λ, where G = (id E tr )[Q 1] M16(C).

First consider the special case p2 0+ (i.e n2 is negligible compared to n1). The fixed-point equations defining G1,9 are given by

G1,9 = λ tr AΣ1(G3,3B + G3,11Id)(λId + G3,3Σ1) 1(λId + G11,11Σ1) 1, (40)

G3,3 = λ λ ϕG4,2 , (41)

G11,11 = λ λ ϕG12,10 , (42)

G3,11 = λϕG4,10 (λ ϕG4,2)(λ ϕG12,10) = ϕG3,3G11,11G4,10

G12,10 = λ tr Σ1(λId + G11,11Σ1) 1, (44)

G4,10 = λ tr Σ1(λB G3,11Σ1)(λId + G3,3Σ1) 1(λId + G11,11Σ1) 1, (45)

G4,2 = λ tr Σ1(λId + G3,3Σ1) 1. (46)

Observe the equations for G3,11 and G4,10 further give G3,11 = v, where v solves the equation

v = ϕG3,3G11,11 tr Σ1(vΣ1 + λB)(λId + G3,3Σ1) 1(λId + G11,11Σ1) 1. (47)

Now, let e be the unique non-negative solution to the equation

e = 1 1 + ϕ tr Σ1(λId + eΣ1) 1 . (48)

It is easy to see that we must have G3,3 = G11,11 = e and

r(4)/d = G1,9

λ = tr AΣ1(e B v Id)(λId + Σ1) 2

= e 1 tr ABΣ1(κId + Σ1) 2 ve 2 tr AΣ1(κId + Σ1) 2

λ tr ABΣ1(κId + Σ1) 2 vκ2

λ2 tr AΣ1(κId + Σ1) 2,

where κ := λ/e. Furthermore, v defined earlier now satisfies

v = ϕe2 tr Σ1(vΣ1 + λB)(λId + eΣ1) 2

= ϕ tr Σ1(vΣ1 + λB)(κId + Σ1) 1.

Solving for v gives

v = ϕλ tr BΣ1(κId + Σ1) 2

1 ϕ tr Σ2 1(κId + eΣ1) 2 λ tr BΣ1(κId + Σ1) 2

Published as a conference paper at ICLR 2025

In particular, if B = Σ1 and A = Id, then

v = λ df2(κ) n df2(κ),

and so we must have

r(4)/d = G1,9

λ tr Σ2 1(κId + Σ1) 2 vκ2

λ2 tr Σ1(κId + Σ1) 2

λ 1 d df2(κ) κ2

λ 1 d tr Σ1(κId + Σ1) 2 df2(κ) n df2(κ)

λ 1 d df2(κ) κ

λ 1 d (df1(κ) df2(κ)) df2(κ) n df2(κ)

λ 1 d(n df1(κ)) df2(κ) n df2(κ)

d df2(κ) n df2(κ) 1

ϕ df2(κ) n df2(κ),

where, in the last 2 steps we have made use of the following identities which follow from the definition of κ

κ λ κ df1(κ)

κ tr Σ1(κId + Σ1) 2 = df1(κ) df2(κ).

We deduce that the variance term in the bias-variance decomposition of the test error is given by

V ar = σ2 1

nr(4) σ2 df2(κ) n df2(κ) = σ2u = σ2 df2(κ)/n 1 df2(κ)/n. (51)

Let us now compute the limiting value of r(4) for any values of the proportions p1, p2 (0, 1) with p1 + p2 = 1. The fixed-point equations defining G1,9 now become

G1,9 = p1 tr AΣ1S(λId + p1G2,2Σ1 + p2G6,6Σ2) 2, (52) with S := λ(G2,2B + G2,10Id) + p2(e2G2,10 e1G6,13)Σ2, (53) G2,2 = e1, (54) G6,6 = e2, (55)

G2,10 = G3,11 = ϕe2 1G4,10

G6,13 = G7,14 = ϕe2 2G8,13

G8,13 = λ tr Σ2(λB p1G3,11Σ1 p2G7,14Σ2)(λId + p1G3,3Σ1 + p2G7,7Σ2) 2, (58)

G4,10 = λ tr Σ1(λB p1G3,11Σ1 p2G7,14Σ2)(λId + p1G3,3Σ1 + p2G7,7Σ2) 2, (59)

where e1 0 and e2 0 solve the following system of equations

e1 = 1 1 + ϕ tr Σ1(λId + p1e1Σ1 + p2e2Σ2) 2 , (60)

e2 = 1 1 + ϕ tr Σ2(λId + p1e1Σ1 + p2e2Σ2) 2 . (61)

Furthermore, we deduce that G6,13 = v2 and G2,10 = v1, where v1 and v2 solve the equations

v1 = ϕe2 1 tr Σ1(p1v1Σ1 + p2v2Σ2 + λB)(λId + p1e1Σ1 + p2e2Σ2) 2, (62)

v2 = ϕe2 2 tr Σ2(p1v1Σ1 + p2v2Σ2 + λB)(λId + p1e1Σ1 + p2e2Σ2) 2. (63)

Putting things together gives the formula for r(4) proposed in Proposition 1.

In particular, taking p2 0 (i.e p1 1) recovers the formula as a special case.

Published as a conference paper at ICLR 2025

Computing r(3). A minimal linear pencil for the corresponding random matrix R = AM1(M + λId) 1B(M + λId) 1M1 is a 17 17 block matrix Q such that R = Q 1[1, 16]. This gives

r(3)/d G1,16,

where G = (id E tr )[Q 1] M17(C). The fixed-point eqautions that determine G1,16 are

G1,16 = p1 tr AΣ1S(λId + p1e1Σ1 + p2e2Σ2) 2

with S := p1e2 1(λB p2G6,13Σ2)Σ1 G2,10(λId + p2e2Σ2)2, G7,14 = G6,13 = v2, G3,11 = G2,10 = v1.

We deduce the formula given in Proposition 1. In particular, taking the limit p2 0 (i.e p1 1) gives

e S e2 1BΣ1 + λv1Id = e2 1BΣ1 + λ2u1Id,

v1 = ϕe2 1 tr Σ1(v1Σ1 + λB)(e1Σ1 + λId) 2 = ϕ tr Σ(v1Σ1 + λB)(Σ + κ1Id) 2, i.e

λ = ϕ tr BΣ1(Σ1 + κId) 2

1 ϕ tr Σ2 1(Σ1 + κ1Id) 2 tr BΣ1(Σ1 + κId) 2

n df(1) 2 (κ1) . (64)

Finally, recalling that κ1 = λ/e1 by construction, we get

r(3) d G1,16 = e2 1 tr ABΣ2 1(e1Σ1 + λId) 2 + λ2u1 tr AΣ1(e1Σ1 + λId) 2

= tr ABΣ2 1(Σ1 + κ1Id) 2 + λ2u1

e2 1 tr AΣ1(Σ1 + κ1Id) 2

tr ABΣ2 1(Σ1 + κ1Id) 2 + κ2 1 tr AΣ1(Σ1 + κ1Id) 2 tr BΣ1(Σ1 + κId) 2

n df(1) 2 (κ1) .

Computing r(2). A pencil for the relevant matrix R = λ2A(M +λId) 1B(M +λId) 1 has minimal linear pencil Q of size 15 15, where R = Q 1[1, 8]. We deduce that r(2)/d = E tr R/λ2 = G1,8/λ2, where G = (id E tr )Q 1 M15(C). The fixed-point equations defining G1,8 are given by

G1,8 = λ tr AS(p1G2,2Σ1 + p2G5,5Σ2 + λId) 2, (65) with S = λB p1G2,9Σ1 p2G5,12Σ2, (66) G2,2 = e1, (67) G5,5 = e2, (68)

G2,9 = G3,10 = ϕe2 1G4,9

G5,12 = G6,13 = ϕe2 2G7,12

G4,9 = λ tr Σ1(λB p1G3,10Σ1 p2G6,13Σ2)(p1G2,2Σ1 + p2G5,5Σ2 + λId) 2, (71)

G7,12 = λ tr Σ2(λB p1G3,10Σ1 p2G6,13Σ2)(p1G2,2Σ1 + p2G5,5Σ2 + λId) 2. (72)

Based on previous calculations, we deduce that G2,9 = v1 and G5,12 = v2, and so

r(2) d G1,8

λ tr A(p1v1Σ1 + p2v2Σ2 + λB)(p1e1Σ1 + p2e2Σ2 + λId) 2 = tr Ae LK 2,

as claimed. This completes the proof of Proposition 1.

E.4 PROOF OF PROPOSITION 2

In Section G.1 we will establish a more general result which implies Proposition 2 as a special case.

Published as a conference paper at ICLR 2025

F PROOF OF THEOREM 1 AND COROLLARIES

Let us note that the results in Bach (2023) were obtained in a two-stage approach, where random matrix theory is applied on the raw (unquenched test error bw w1 2 Σ with the projection matrix treated like a deterministic matrix, and then RMT is done one more on the resulting expressions but now treating S as random. The case general case p2 (0, 1) is much more difficult; the key technical difficulty can be pinned down to the problem of obtaining analytic deterministic equivalents for the trace of the and derivatives of the resolvent of a sum of random matrices. To circumvent this, we employ the tools of operator-valued free probability theory.

F.1 PROOF OF THEOREM 1

From Proposition 4 and 1 applied with Σ1 = Σ2 = Σ, we know that

Etest( bf CL) = V1 + B1, with

j=1 pjσ2 j 1 n tr Σk Dj,k K 2 =

j=1 pjσ2 j κ λ 1

n tr Σ(Σ κu Id)(Σ + κId) 2

n tr Σ(Σ κu Id)(Σ + κId) 2,

B1 = p2 tr Σ2C2,1K 2 + λ2 tr ΓL 1K 2

= p2 tr Σ p2(1 + p1u)Σ2 + u(p1Σ + κId)2 (Σ + κId) 2 + κ2(u + 1) tr ΓΣ(Σ + κId) 2.

Now, for the V1 term, first observe that

tr Σ(Σ κu Id)(Σ + κId) 2 = tr Σ(Σ κ df2 n df2 Id)(Σ + κId) 2

= df2 df2 n df2 κ tr Σ(Σ + κId) 2

= df2 df2 n df2 (df1 df2)

= df2 n df2 (n df1).

We deduce that

V1 = σ2 (1 df1 /n)κ

λ df2 n df2 = σ2 df2 n df2 =: V,

where we have used the identity κ λ = κ df1 /n, which defines κ.

We now handle the B1 term. First observe that u + 1 = n/(n df2), and so one computes

κ2(u + 1) tr ΓΣ(Σ + κId) 2 = κ2 n n df2 tr ΓΣ(Σ + λId) 2 =: B,

which is the classical formula for the bias term.

To finalize, observe that

tr ΣC2,1K 2 = tr Σ p2(1 + p1u)Σ2 + u(p1Σ + κId)2 (Σ + κId) 2

= p2(1 + p1u) tr Σ3(Σ + κId) 2 + u tr Σ(p1Σ + κId)2(Σ + κId) 2 =: ζ,

which concludes the proof.

Published as a conference paper at ICLR 2025

F.2 PROOF OF COROLLARY 1

Indeed, here we have κ 0 and u ϕ/(1 ϕ) in the limit λ 0+. Theorem 1 then gives

Etest( bf CL) V + B + ζ, where V = σ2ϕ

1 ϕ, B = 0,

1 ϕ p2(1 ϕ + p1ϕ) + p2 1ϕ = p2c2

1 ϕ(p2(1 p2ϕ) + p2 1ϕ)

1 ϕ(p2 + (p1 p2)ϕ) = p2 2c2 + p2p1c2ϕ

For small ϕ, this further gives Etest( bf CL) σ2ϕ/(1 ϕ) + p2 2c2 + O(ϕ2) σ2d/n + p2 2c2 + O(ϕ2).

F.3 PROOF OF COROLLARY 3

The setup can be seen as a special instance of the setup considered in the proof of Proposition 1 (cf. Appendix F.1), since it corresponds to taking Σ1 = (1 α)Σ/p1, and Σ2 = αΣ/p2. We must have

1 e1 = 1 + ϕ tr Σ1K 1 = 1 + (1 α)ϕ/p1 (1 α)e1 + αe2 + λ, (73)

1 e2 = 1 + ϕ tr Σ2K 1 = 1 + αϕ/p2 (1 α)e1 + αe2 + λ. (74)

At least for λ = 0 and 0 < ϕ < 1, these equations can be solved explicitly to get e1, e1 0 but the resulting formulae are rather complicated, and therefore are omitted altogether. In any case, heorem 1 correctly predicts the test error, as can be seen in Figure 12.

A particular case where things are solvable to give simple expressions, is when ϕ 0+. In this limit, it is easy to see that e1 = e2 = 1 and u1 = u2 = 0. This gives

K = Σ + λId, (75)

L = Σ, (76) C1 = (1 α)Σ, (77) C2 = αΣ, (78) Dk = Σ, (79)

λ2r(2)(A, Σ) λ2 tr AΣ(Σ + λId) 2 = λ tr AΣ(Σ + λ) 1 tr AΣ2(Σ + λId) 2 , (80)

r(3) 1 (A, Σ) p1 tr AΣ1C1K 2 = p1 1α2 tr AΣ2(Σ + λId) 2, (81)

r(3) 2 (A, Σ) p2 tr AΣ2C2K 2 = p2 2(1 α)2 tr AΣ2(Σ + λId) 2, (82)

r(4) 1 (A, Σ) p1 tr AΣ1D1K 2 = p1α tr AΣ2(Σ + λId) 2, (83)

r(4) 2 (A, Σ) p2 tr AΣ2D2K 2 = p2(1 α) tr AΣ2(Σ + λId) 2. (84)

We deduce that

j=1 pj σ2 j n r(4) j (Id, Σ) = (1 α)p1σ2 1 + αp2σ2 2 df2(λ; Σ)

B1 = r(3) 2 ( , Σ) + λ2r(2)(Γ, Σ) λ 0+ p2 2(1 α)2 tr . (86)

Putting things together then gives

Etest( bf CL) B1 + V1 p2 2(1 α)2 tr + (αp1σ2 1 + (1 α)p2σ2 2) d

as claimed.

Published as a conference paper at ICLR 2025

F.4 PROOF OF COROLLARY 2

Applying the first part of Corollary 1 recursively gives for any iteration t 1,

Etest( bf (t) CL) c2 t E + p2 2c2 t 1 . . . p2t 2 c2 0 + 1 p2t 2 1 p2 2

E, with E := σ2ϕ

Iterating the above gives

c2 t+1 σ2ϕt

1 ϕt + p2 2c2 t, ϕt = d/Nt, Nt = n, c2 0 = c2. (87)

Setting E := σ2ϕ/(1 ϕ) σ2d/n, we get

Etest( bf (t+1) CL ) c2 t+1 p2 2c2 t + σ2ϕt

p2 2c2 t 1 + σ2ϕt 1

p2(1+1) 2 c2 t 1 + p2 2 σ2ϕt 1 1 ϕt 1 + σ2ϕt

p2(t+1) 2 c2 0 + X

σ2ϕj 1 ϕj p2(t j) 2

= p2(t+1) 2 c2 + E X

0 j t p2j 2

= p2(t+1) 2 c2 + 1 p2(t+1) 2 1 p2 2

In particular, we if p2 is bounded away from 1 (i.e if p1 := 1 p2 = Ω(1)), we get

Etest( bf (t) CL) 1 1 p2 2

E + p2t 2 c2,

for large t. The first part is just a constant multiple of the scaling law we would have with training on a dataset comprising of n units of clean data.

On the other hand, we have

lim p2 1 Etest( bf (t) CL) c2 + t E.

This is an increasing function of t, lower-bounded by c2 + E. We recover the classical picture, in which model collapse prevails (depending on the size of c2, as per Corollary 1).

F.5 PROOF OF COROLLARY 4

From Theorem 1 and the observation that κ ϕ 1 and u 1/(1 ϕ/ϕ2) = ϕ/(ϕ 1) in the limit λ 0+, we have Etest( bw) E + ζ, with

E = V + B, B = r2 (ϕ 1)2

ϕ2 1 1 1/ϕ = r2 (1 1/ϕ) , V = σ2

p2(1 + p1 ϕ 1) + (p1 + ϕ 1)2 ,

and the first part of the result follows after some algebra.

The second part then follows from expanding the above around ϕ = .

Published as a conference paper at ICLR 2025

G PROOF OF PROPOSITION 2 AND THEOREM 2

G.1 PROOF OF PROPOSITION 2

We state and proof a more general result without the requirement Σ1 = Σ2 = Σ.

Let (e1, e2, τ) be the unique nonnegative solution to the following fixed-point equations

e1 = 1 1 + ψτ tr Σ1K 1 , (88)

e2 = 1 1 + ψτ tr Σ2K 1 , (89)

τ = 1 1 + tr K0K 1 , (90)

with K0 := p1e1Σ1 + p2e2Σ2, K := γτK0 + λId. (91) Also, let (v1, v2, ω) to be the unique nonnegative solution to the following fixed-point equations v1 = ψe2 1 tr Σ1(γτ 2L + λωId)K 2, (92)

v2 = ψe2 2 tr Σ2(γτ 2L + λωId)K 2, (93)

ω = τ 2 tr (γK2 0 + λL)K 2, (94) with L := p1v1Σ1 + p2v2Σ2 + λB. (95) Finally, define d d matrices C1, C2, D1, D2, E by C1 := γp1e2 1 γτ 2(B + p2u2Σ2) + ωId Σ1 + u1(γτp2e2Σ2 + λId)2, (96)

C2 := γp2e2 2 γτ 2(B + p1u1Σ1) + ωId Σ2 + u2(γτp1e1Σ1 + λId)2, (97)

D1 := τ 2e1B + (e1ω τv1)Id + γτ 2p2(e1u2 e2u1)Σ2, (98)

D2 := τ 2e2B + (e2ω τv2)Id + γτ 2p1(e2u1 e1u2)Σ1, (99)

E := γ(γτ 2B + ωId), (100) Proposition 3. In the proportionate scaling limit (6), it holds that

r(1) j (A) γτpjej tr AΣj K 1, (101)

r(3) j (A, B) γpj AΣj Cj K 2, (102)

r(4) j (A, B) γpj tr AΣj Dj K 2, (103)

r(5)(A, B) tr AEK 2. (104)

Observe that if we force τ = γ = 1 and ω = 0, then we recover the corresponding formulae given in Proposition 1. On the other hand, taking Σ1 = Σ2 = Σ gives Proposition 2.

Proof. WLOG, we only consider the cases where j = 1.

Computing r(1) 1 . There is a 11 11 minimal linear pencil Q such that ASRS M1 = Q 1[1, 10] (zero-based indexing). We deduce that

r(1) 1 := E tr ASRS M1 d G1,10, (105) where G := (id E tr )Q 1 C11 11. Moreover, G1,10 is given by the following fixed-point equations G1,10 = p1γG2,2G5,5 tr AΣ1K 1, (106) with K := γG2,2L + λId, L := p1G5,5Σ1 + p2G8,8Σ2, (107)

G5,5 = 1 1 + ϕγG2,2 tr Σ1K 1 = 1 1 + ψG2,2 tr Σ1K 1 , (108)

G8,8 = 1 1 + ϕγG2,2 tr Σ2K 1 = 1 1 + ψG2,2 tr Σ2K 1 , (109)

G2,2 = 1 1 + tr LK 1 , (110)

Published as a conference paper at ICLR 2025

Then, one deduces that tr ASRS M1 d G1,10 = p1e1τγ tr AΣ1K 1. (111)

Computing r(4) 1 . Here, the pencil Q is 20 20 and AM1SRS SRS = Q 1[1, 13]/λ. We deduce that r(4) 1 := E tr AM1SRS BSRS d G1,13/λ, (112)

where G := (id E tr )Q 1 C20 20. Moreover, G1,13 is given by the following fixed-point equations G1,13 = p1γ tr AΣ1TK 2, where (113)

T := λ(τ 2e1B + (e1G6,12 + τG3,15)Id) + p2γτ 2(e2G3,15 e1G9,18)Σ2, (114) G12,12 = G6,6 = τ, (115)

G3,15 = ϕe2 1G4,14

G4,14 = λγ tr Σ1 γτ 2(p1G3,15Σ1 + p2G9,18Σ2) λ(γτ 2B + G6,12Id) K 2, (117)

G9,18 = ϕe2 2G10,17

G10,17 = λγ tr Σ2 γτ 2(p1G3,15Σ1 + p2G9,18Σ2) λ(γτ 2B + G6,12Id) K 2, (119)

G6,12 = τ 2G7,11, (120)

G7,11 = tr (γK2 0 + λ(λB p1G3,15Σ1 p2G9,18Σ2))K 2, (121) We deduce that G3,15 = v1, G9,18 = v2, and G6,12 = ω, where v1, v2, ω 0 solve the following fixed-point equations v1 = ϕγe2 1 tr Σ1 γτ 2(p1v1Σ1 + p2v2Σ2) + λ(γτ 2B + ωId) K 2

= ψe2 1 tr Σ1(γτ 2L + λωId)K 2,

v2 = ϕγe2 2 tr Σ2 γτ 2(p1v1Σ1 + p2v2Σ2) + λ(γτ 2B + ωId) K 2

= ψe2 2 tr Σ2(γτ 2L + λωId)K 2,

ω = τ 2 tr (γK2 0 + λ(λB + p1v1Σ1 + p2v2Σ2))K 2 = τ 2 tr (γK2 0 + λL)K 2, with L := p1v1Σ1 + p1v2Σ2 + λB. Putting everything together then gives

r(4) j d G1,13

λ = p1γ tr AΣ1 e TK 2, where

e T := T/λ = τ 2e1B + (e1ω τv1)Id + p2γτ 2(e1u2 e2u1)Σ2 =: D1.

Computing r(3) 1 . The matrix of interest AM1SRS BSRS M1 admits a minimal linear pencil Q of size 21 21, such that the formal equals to Q 1[1, 20]. It follows that

r(3) 1 := E tr AM1SRS BSRS M1 d G1,20, (122)

where G := (id E tr )Q 1 C21 21. The fixed-point equations defining G1,20 are

G1,20 = p1 tr AΣ1(T/λ)K 2, where

T := p1G2 3,3γ(γτ 2(λB p2G9,18Σ2) + λG6,12Id)Σ1 G3,15(γτp2G9,9Σ2 + λId)2,

G3,3 = e1, G9,9 = e2, G6,12 = ω, G3,15 = v1, G9,18 = v2. Putting things together gives

r(3) 1 d G1,20 = tr AΣ1 e TK 2,

where e T := T/λ = γp1e2 1 γτ 2(B + p2u2Σ2) + ωId Σ1 + u1(γτp2e2Σ2 + λId)2 =: C1, which completes the proof.

Published as a conference paper at ICLR 2025

G.2 RECOVERING THEOREM 1 FROM THEOREM 2

Indeed, we have

ω 0, θ κ, u ϕI2,2(κ) 1 ϕI2,2(κ) = df2(κ)/n 1 df2(κ)/n,

for any regularization strength λ > 0, where κ is as defined in equation (8). Refer to Lemma 1. Plugging these limits into the formulae provided in Theorem 2 then recovers Theorem 1.

G.3 PROOF OF THEOREM 2

This follows directly from Proposition 2 and the computations in Section I.2.

H PHASE-DIAGRAM FOR RANDOM PROJECTIONS MODEL

H.1 THE GENERAL REGULARIZED CASE

Lemma 1. The scalars u and ω which appear in Theorem 2, and described in Definition 2, solve the following pair of linear equations

u = ϕI2,2(θ)(1 + u) + ϕI1,2(θ)ω ,

γω = I2,2(θ)ω + θ2I1,2(θ)(1 + u). (123)

Furthermore, the solutions can be explicitly represented as

u = ϕz γ ϕz I2,2(θ), ω = θ2I2,2(θ) γ ϕz I2,2(θ), (124)

where z = I2,2(θ)(γ I2,2(θ)) + θ2I1,2(θ)2.

In particular, in the limit γ , it holds that

θ κ, ω 0, u ϕI2,2(κ) 1 ϕI2,2(κ) df2(κ)/n 1 df2(κ)/n, (125)

where κ > 0 is as defined in (8).

Proof. The equations defining these are

u = ψe2 tr Σ(γτ 2L + ωId)K 2, (126)

ω = τ 2 tr (γωK2 0 + λ2L )K 2, (127)

where K0 = eΣ, K = γτK0 + λId, and L := uΣ + B. Further, since B = Σ, we have L = (1 + u)Σ. Now, we can rewrite the previous equations like so

u = ψe2 tr Σ(γτ 2(1 + u)Σ + ωId)K 2 = ϕγ2τ 2e2(1 + u) tr Σ2K 2 + ϕγe2ω tr ΣK 2,

ω = τ 2 tr (γωe2Σ2 + λ2(1 + u)Σ)K 2 = γτ 2e2ω tr Σ2K 2 + λ2τ 2(1 + u) tr ΣK 2.

This can be equivalently written as

u = ϕ(1 + u)γ2τ 2e2 tr Σ2K 2 + ϕω γ2τ 2e2 tr ΣK 2, (128)

γω = ω γ2τ 2e2 tr Σ2K 2 + (1 + u)λ2 tr ΣK 2. (129)

Now, observe that

τ 2e2 tr Σ2K 2 = tr Σ2(Σ + θId) 2/γ2 = I2,2(θ)/γ2, (130)

τ 2e2 tr ΣK 2 = tr Σ(Σ + θId) 2/γ2 = I1,2(θ)/γ2, (131)

λ2 tr ΣK 2 = θ2 tr Σ(Σ + θId) 2 = θ2I1,2(θ), (132)

e2 tr ΣK 2 = tr Σ(Σ + θId) 2/(γτ)2 = I1,2(θ)/(γτ)2, (133)

τ 2 tr ΣK 2 = tr Σ(Σ + θId) 2/(γe)2 = I1,2(θ)/(γe)2, (134)

Published as a conference paper at ICLR 2025

where we have used the definition θ = λ/(γτe). Thus, u and ω have limiting values u and ω respectively, which solve the system of linear equations

u = ψγ γ 2I2,2(θ)(1 + u) + ψγ γ 2I1,2ω = ϕI2,2(θ)(1 + u) + ϕI1,2(θ)ω ,

γω = I2,2(θ)ω + θ2I1,2(θ)(1 + u) = I2,2(θ)ω + θ2I1,2(θ)(1 + u),

where we have used the identity ϕγ = ψ. These correspond exactly to the equations given in the lemma. This proves the first part.

For the second part, indeed, τ = 1 η0/γ 1 in the limit γ , and so θ λ/(γe) which verifies the equation

θ λ + λψ tr Σ(γeΣ + λ) 1 = λ + ϕ λ

γe tr Σ(Σ + λ

γe Id) 1 λ + θ tr Σ(Σ + θId) 1/n,

i.e θ λ + θ df1(θ)/n and θ > 0. By comparing with the equation κ λ = κ df1(κ)/n satisfied by κ > 0 in (8), we conclude θ κ.

Now, the equations (123) become ω = 0, and u = ϕI2,2(κ)(1 + u), i.e

u = ϕI2,2(κ) 1 ϕI2,2(κ) df2(κ)/n 1 df2(κ)/n,

as claimed.

H.2 UNREGULARIZED LIMIT

Define the following auxiiliary quantities

θ := λ γτe, χ := λ

where τ, e, u, and ω are as previously defined in Section 3.2.

Lemma 2. In the limit λ 0+, we have the following analytic formulae

χ χ0 = (1 ψ)+ γθ0, (136) κ κ0 = (ψ 1)+ θ0/ϕ, (137) τ τ0 = 1 η0/γ, (138) e e0 = 1 ϕη0. (139)

Proof. From equations (9) and the constraint Σ1 = Σ2 = Σ, we know that e1 = e2 = e, where e and τ are unique positive solutions to a pair of fixed point equations. Observe that K0 = eΣ and K = γτK0 + λId = γτe (Σ + θId). Defining η := I1,1(θ), one can then rewrite the equations defining e and τ as follows

e = λ + ψτλ tr ΣK 1 = λ + ψτλ

γτe tr Σ(Σ + θId) 1 = λ + ϕηe , (140)

τ = λ + λ tr K0K 1 = λ + λe

γτe tr Σ(Σ + θId) 1 = λ + (η/γ)τ . (141)

We deduce that

e = λ 1 ϕη , τ = λ 1 η/γ , τ e = λγθ. (142)

In particular, the above means that η min(γ, 1/ϕ). The last part of equations (142) can be rewritten as follows

λ (1 ϕη)(1 η/γ) = γθ, i.e ϕη2 (ϕγ + 1)η + γ λ

θ = 0. (143)

Published as a conference paper at ICLR 2025

This is a quadratic equation for η as a function of λ and θ, with roots

η = ϕγ + 1 p

(ϕγ + 1)2 4(ϕγ (ϕ/θ)λ)

2ϕ = ψ + 1 p

(ψ + 1)2 4(ψ ϕ/θ )

Now, for small λ > 0 and ψ = 1, we can do a Taylor expansion to get

η ψ + 1 |ψ 1|

2ϕ 1 θ|ψ 1|λ + O(λ2).

More explicitly,

η+ O(λ2) + 1/ϕ + λ/((1 ψ)θ), if ψ < 1, γ + λ/((ψ 1)θ), if ψ > 1.

η O(λ2) + γ λ/((1 ψ)θ), if ψ < 1, 1/ϕ λ/((ψ 1)θ), if ψ > 1,

Because η min(1, 1/ϕ, γ), we must have the expansion

η O(λ2) + γ λ/((1 ψ)θ), if ψ < 1, 1/ϕ + λ/((ψ 1)θ), if ψ > 1,

= η0 1 (1 ψ)θ0 λ + O(λ2),

provided θ0 > 0, i.e η0 = 1. in this regime, we obtain

τ = λ 1 η/γ λ/(1 1 + λ/((1 ψ)γθ0)) = (1 ψ)γθ0, if ψ 1, λ/(1 1/ψ + o(1)) 0, if ψ > 1,

e = λ 1 ϕη λ/(1 ψ + o(1)) 0, if ψ 1, λ/(1 1 + λϕ/((ψ 1)θ0) (ψ 1)θ0/ϕ, if ψ > 1,

τ = 1 η/γ 1 η0/γ = (1 1/ψ)+, e = 1 ϕη 1 ϕη0 = (1 ψ)+.

On the other hand, if θ0 = 0 (which only happens if ψ < 1 and γ > 1 OR ψ 1 and ϕ 1), it is easy to see from (142) that we must have τ 0, e 0, τ 1 1/γ, e 1 ϕ 0.

Next, let s compute the limiting values of u and ω := ω/τ 2.

I RAW BIAS-VARIANCE DECOMPOSITION

I.1 CLASSICAL LINEAR MODEL

Proposition 4. Evaluated on the distribution Pk = PΣk,σ2 k,w k, the test error of model bf CL defined in (3) is given by

Etest( bf CL) = Bk + Vk, (146)

σ2 j n r(4) j (Id, Σk), (147)

( r(3) 2 ( , Σ1) + λ2r(2)(Γ, Σ1), if k = 1, r(3) 1 ( , Σ2) + λ2r(2)(Γ + , Σ2) + 2λr(4) 1 ( , Σ2), if k = 2. (148)

Published as a conference paper at ICLR 2025

Proof. Indeed, one computes

ED bw w k 2 Σk = EX1,Y1,X2,Y2Ex N(0,Σk)[(x bw x w k)2]

= EX1,Y1,X2,Y2 bw w k 2 Σk = EX1,Y1,X2,Y2 (M + λId) 1X Y/n w k 2 Σk = EX1,Y1,X2,Y2 (M + λId) 1X (X1w 1 + E1, X2w 2 + E2)/n w k 2 Σk = EX1,Y1,X2,Y2 (M + λId) 1(M1w 1 + M2w 2) w k 2 Σk + V1 + V2 = Bk + Vk,1 + Vk,2.

Bk := E (M + λId) 1(Mkw k + M kw k) w k 2 Σk, (149)

Vk,j := σ2 j n E tr Mj(M + λId) 1Σk(M + λId) 1 = σ2 j n r(4) j (Id, Σk). (150)

It remains to analyze the bias term Bk. To this end, observe that

(M + λId) 1Mk = Id (M + λId) 1(M k + λId) = Id (M + λM) 1M k λ(M + λId) 1.

Denoting M 1 = M2, M 2 = M1, w 1 = w 2, w 2 = w 1, and δk = ( 1)kδ, where δ := w 2 w 1, we deduce that

(M + λId) 1Mkw k + (M + λId) 1M kw k w k = (M + λId) 1M kw k (M + λId) 1M kw k λ(M + λId) 1w k = (M + λId) 1M kδk λ(M + λId) 1w k.

Since w 1 and δ1 = δ := w 2 w 1are independent with distributions N(0, Γ) and N(0, ) respectively, we deduce that

B1 = (M + λId) 1M2δ λ(M + λId) 1w 1 2 Σ1 = tr M2(M + λId) 1Σ1(M + λId) 1M2 + λ2 tr Γ1(M + λId) 1Σ1(M + λId) 1

= r(3) 2 ( , Σ1) + λ2r(2)(Γ, Σ1).

On the other hand, we have B2 = B2,1 + B2,2, where

B2 = (M + λId) 1M1δ λ(M + λId) 1w 2 2 Σ2 = (M + λId) 1M1δ λ(M + λId) 1(w 1 + δ) 2 Σ2 = (M + λId) 1 (M1 + λId) δ λ(M + λId) 1w 1 2 Σ2 = tr (M1 + λId)(M + λId) 1Σ2(M + λId) 1(M1 + λId) + λ2 tr Γ(M + λId) 1Σ2(M + λId) 1

= tr M1(M + λId) 1Σ2(M + λId) 1M1 + λ2 tr (M + λId) 1Σ2(M + λId) 1

+ 2λ tr M1(M + λId) 1Σ2(M + λId) 1 + λ2 tr Γ(M + λId) 1Σ2(M + λId) 1

= r(3) 1 ( , Σ2) + λ2r(2)(Γ + , Σ2) + 2λr(4) 1 ( , Σ2).

This completes the proof.

I.2 RANDOM PROJECTIONS MODEL

We now expand the test error Etest( bf RP ) of the random projections model bf RP defined in (5). For convenience, we recall the definition of the model here. Let S be a d m random matrix with iid entries from N(0, 1/d). The model bf RP is defined by bf RP (x) := Φ(x) bv, where Φ(x) := S x Rm defines a random feature map, and bv Rm is given by

arg min v Rm L(w) = X

Φ(Xk)v Yk 2 2 n + λ v 2 2. (151)

Published as a conference paper at ICLR 2025

Note that the gradient L(v) of the regularized loss L is given by

k S X k (Xk Sv Yk)/n + λv = X

k S Mk Sv X

k S X k Yk/n + η

k S X k Yk/n,

where H := S MS + λIm Rm m, with M := M1 + M2 and Mk := X k Xk/n. Thus, setting R := H 1, we may write

bv = RS (X 1 Y1 + X 2 Y2)/n = RS (M1w1 + M2w2) + RS X 1 E1/n + RS X 2 E2/n.

Now, one deduces the bias-variance decomposition

Etest( bf RP ) = EDEx N(0,Σk)[( bf RP (x) x w 1)2] = EX1,E1,X2,E2 Sbv wk 2 Σk = Bk + Vk,

where Vk := Vk,1 + Vk,2, with Vk,j := σ2 j n EX1,X2 tr S Mj SRS Σk SRS ,

Bk := EX1,X2 SRS (M1w1 + M2w2) wk 2 Σk.

The variance terms Vk,j can be directly handled via FPT computations. We now look at the bias term Bk. We first treat the case k = 1. One has

E SRS (M1w1 + M2w2) w1 2 Σ = E (SRS (M1 + M2) Id)w1 + SRS M2δ 2 Σ = E (SRS M Id)w1 2 Σ + E SRS M2δ 2 Σ = E tr Γ(SRS M Id)Σ(MSRS Id) + E tr M2SRS ΣSRS M2 = tr ΓΣ + tr ΓSRS MΣMSRS 2E tr ΓΣSRS M + E tr M2SRS ΣSRS M2 = tr ΓΣ + E tr ΣMSRS ΓSRS M 2E tr ΓΣSRS M | {z } classical term (B) + E tr M2SRS ΣSRS M2 | {z } extra term (ζ)

where we recall that R := (S MS + λIm) 1 and M := M1 + M2 with Mk = X k Xk.

For the purposes of FPT computations, it might help to observe that Mk = λΣ1/2 k Z k ZkΣ1/2 k , where Zk := XkΣ1/2 k /(nλ) is an nk d random matrix with iid entries from N(0.1/(nλ)). Thus,

Mk = λM k, (152)

M k = Σ1/2 k Z k ZkΣ1/2 k , (153)

M = λM, (154)

M = M 1 + M 2 = Σ1/2 1 Z 1 Z1Σ1/2 1 + Σ1/2 2 Z 2 Z2Σ1/2 2 ), (155)

R = R/λ, (156)

R = (S MS + Im) 1 = S Σ1/2 1 Z 1 Z1Σ1/2 1 S + S Σ1/2 2 Z 2 Z2Σ1/2 2 S + Im 1 . (157)

We need minimal linear pencils for the random matrices

AM 1SRS BSRS , (158)

AMSRS BSRS M (159)

ASRS M, (160)

AM 2SRS BSRS M 2, (161)

in terms of the set of free variables {A, B, Σ1/2 1 , Σ1/2 2 , S, Z1, Z2, S , Z 1 , Z 2 }. Observe that

tr AMSRS BSRS M

= tr AM1SRS BSRS M1 + tr AM2SRS BSRS M2 + 2 tr AMSRS BSRS M,

tr ASRS M = tr ASRS M1 + tr ASRS M2.

Published as a conference paper at ICLR 2025

For our business, it is therefore sufficient to only compute (minimal) linear pencils for

ASRS M 1, (162) AM 1SRS BSRS , (163) AM 1SRS BSRS M 1, (164) AM 1SRS BSRS M 2, (165)

where M k := Σ1/2 k Z k ZkΣ1/2 k , R := S MS + Im 1 , M := M 1 + M 2.

Observe that without the S matrix (i.e taking m = d and S = Id), the four matrix expressions above reduce to what we had in the classical case.