# model_collapse_demystified_the_case_of_regression__6d5f60a2.pdf Model Collapse Demystified: The Case of Regression Elvis Dohmatob Yunzhen Feng Julia Kempe FAIR, Meta Center for Data Science, New York University Courant Institue of Mathematical Sciences, New York University Correspondence to dohmatob@meta.com The era of proliferation of large language and image generation models begs the question of what happens if models are trained on the synthesized outputs of other models. The phenomenon of "model collapse" refers to the situation whereby as a model is trained recursively on data generated from previous generations of itself over time, its performance degrades until the model eventually becomes completely useless, i.e. the model collapses. In this work, we investigate this phenomenon within the context of high-dimensional regression with Gaussian data, considering both lowand high-dimensional asymptotics. We derive analytical formulas that quantitatively describe this phenomenon in both under-parameterized and over-parameterized regimes. We show how test error increases linearly in the number of model iterations in terms of all problem hyperparameters (covariance spectrum, regularization, label noise level, dataset size) and further isolate how model collapse affects both bias and variance terms in our setup. We show that even in the noise-free case, catastrophic (exponentially fast) model-collapse can happen in the over-parametrized regime. In the special case of polynomial decaying spectral and source conditions, we obtain modified scaling laws which exhibit new crossover phenomena from fast to slow rates. We also propose a simple strategy based on adaptive regularization to mitigate model collapse. Our theoretical results are validated with experiments. 1 Introduction Model collapse describes the situation where the performance of large language models (LLMs) or large image generators degrade as more and more AI-generated data becomes present in their training dataset [44]. Indeed, in the early stages of the generative AI evolution (e.g the Chat GPT-xyz series of models), there is emerging evidence suggesting that retraining a generative AI model on its own outputs can lead to various anomalies in the model s later outputs. This phenomenon has been particularly observed in LLMs, where retraining on their generated content introduces irreparable defects, resulting in what is known as model collapse", the production of nonsensical or gibberish output [44, 8]. Though several recent works demonstrate facets of this phenomenon empirically in various settings [23, 32, 33, 8, 9, 21], a theoretical understanding is still missing. In this work, we initiate a theoretical study of model collapse in the setting of high-dimensional supervised-learning with linear regression. This is equivalent to kernel regression1, which serves as an effective proxy for neural networks in various regimes, for instance in the infinite-width limit [37, 49, 25, 28] or in the lazy regime of training [12]. [11] characterize the power-law generalization error of regularized least-squares kernel algorithms, assuming a power-decay spectrum of the kernel (capacity) and of the coefficients of the target function (source). Source and capacity power decay 1we present the linear regression setting for ease, the extension to the kernel setting is straightforward. 38th Conference on Neural Information Processing Systems (Neur IPS 2024). 10 3 10 2 10 1 100 101 Ridge regularization Sample size T (a) Isotropic covariance spectrum Σ = Id. Here, we show the evolution of test error for different sample size (T), different levels of ridge-regularization (λ), and training data from different generations (n) of fake data. The setup is: input-dimension d = 300, sample size for fake data generator T0 = 600, noise levels σ = 0.1 and σ0 = 0.2. Left plot is for T = 1000 and different values of λ. Notice the U-shape of the curves for large values of n, indicating the existence of a sweet spot (optimal regularization parameter). Right plot is for λ = 10 3 and different values of T. The broken lines correspond to the theoretical result established in Theorem 4.1. 10 5 10 4 10 3 10 2 10 1 100 Ridge regularization 102 103 104 105 Sample size T (b) Power-law covariance spectrum. Refer to Eqn (23). The setup is: d = 300, T0 = 600, σ = σ0 = 1, Σ = diag(λ1, . . . , λd), where λk k 2. Left plot corresponds to T = 10, 000 and Right plot corresponds to adaptive regularization λ = T ℓcrit, where λ = λ(T) as proposed in [14]. See Section D for details. The broken curves are as predicted by our Theorem 5.1. Though ℓ= ℓcrit is optimal in classical case, it is not in the setup of model collapse. In fact here, the test error diverges with sample size T. Our theory proposes a corrected value of this exponent which gracefully adapts to synthesized data. See Figure 4 (Appendix) for results on MNIST [16]. Figure 1: Demystifying model collapse. Refer to Appendix D for details on the experimental setup. capture properties of the data and the model that give rise to power law scaling of test error in terms of data set size and model capacity, as empirically observed e.g. in [26, 24]. More recently, scaling laws have been shown for kernel models under the Gaussian design, e.g. in [46, 13, 14] for regression and [15] for classification. [39, 43, 31] study scaling laws for regression in the random feature model. Summary of Main Contributions. Following the rich tradition in prior works outlined above, we study the Gaussian design where the input x is sampled from a multivariate zero-mean Gaussian N(0, Σ) and labels y are determined by a linear ground truth function with independent label noise ϵ as y = x w0 + ϵ (we present the linear regression setting for ease, the generalization to the kernel setting is straightforward). At each generation step, an approximation to w0 is learned from the data, and used to generate new, fake /synthetic labels for the next generation. Note that the machine learner has no control over the fake data generation process. It only sees data from a stage n of this process, which is then used to fit a downstream predictor. Our main findings can be summarized as follows: (1) Exact Characterization of Test Error under Iterative Retraining on Synthesized Data. In Section 4 (Theorem 4.3), we obtain analytic formulae for test error under the influence of training data with fake / synthesized labels. For n-fold iteration of data-generation, this formula writes Etest = Eclean test + Bias + n σ0ρ(λ, T, T0, σ, Σ), (1) where Eclean test is the usual test error of the model trained on clean data (not AI-generated) and σ2 0 the label noise level in the clean data distribution. The non-negative term ρ precisely highlights the effects of all the relevant problems parameters: the feature covariance matrix Σ, sample size T, original data size T0, label noise level in the fake data distribution σ2, and regularization λ. The non-negative term Bias is an increase in bias brought about by the iterative synthetic data generation process. This term disappears in the under-parametrized regime (Corollary 4.4), if each stage in the process was fitted on sufficiently many samples T0 compared to the input dimension d (i.e if T0 d). In the over-parametrized case where T0 < d, this term is either a constant (Theorem 4.5) or an increasing function of n, depending on whether the design matrix stays the same or is resampled across different generations (Theorem 4.6). Notably, even in the case of noiseless labels (when σ0 = 0), the downstream model converges to a Gaussian process around zero exponentially fast with the number of iterations n, leading to catastrophic" model collapse. A direct consequence of (1) is that, as the number of generations n becomes large, the effect of re-synthesizing will make learning impossible. We note that the multiplicative degradation in scaling with the number of generations n is completely analogous to what has been shown in [18] for infinite memory models and their variants and empirically observed there. Illustration in Figures 1a and 2. (2) Modified Scaling Laws. Turning to the special case of power-law spectra of the covariance matrix Σ, which allows to derive test-error scaling laws [11, 46, 14, 29], we obtain in Section 5 (see Theorem 5.1) precise new scaling laws of the test error that quantitatively highlight the negative effect of training 10 4 10 3 10 2 10 1 100 101 102 Ridge regularization 103 4 102 6 102 Sample size T (a) Identical intermediate design matrices Xn = X0 n. 10 4 10 3 10 2 10 1 100 101 102 Ridge regularization 103 4 102 6 102 Sample size T (b) Independent intermediate design matrices. Figure 2: Model collapse in the case of noiseless over-parametrized synthetic data generator. Here d = 300, the sample sizes for the different versions of the fake data generator are equal, i.e Tn = T0 = d/2 for all n, and noise levels are σ0 = 0 and σ = 0.1. Everything else is as in the setting of Figure 1a. Broken lines correspond to the theoretical estimates given in Theorem 4.3. As predicted by our theory, the test error of the model fitted on synthetic data (n 1) increases (relative to the baseline n = 0, corresponding to training on clean data). The model collapse here, even in the absence of noise (σ0 = 0), is due to the fact that the synthetic data-generator does not have access to enough data to capture the true labelling function. (a) Importantly, and in accordance to our theory, the amount of model collapse in the case Xn X0 is due to an increase in bias term of the test error of the model and does not depend on the number of generations n as long as n 1. (b) In contrast, for the case where the Xn s are independent, the increase in bias term grows with n, leading to catastrophic" model collapse (Theorem 4.6). Refer to Appendix D for the experimental setup. on synthetically generated data. Further exploiting our analytic estimates, we obtain (Corollary 5.2) the optimal ridge regularization parameter λ as a function of all the problem parameters (sample size, spectral exponents, strength of fake data-generator, etc.). This new regularization parameter corresponds to a correction of the the value proposed in the classical theory on clean data [14], and highlights a novel crossover phenomenon where for an appropriate tuning of the regularization parameter, the effect of training on fake data is a degradation of the fast error rate in the noiseless regime [14, 11] to a much slower error rate which depends on the amount of true data on which the fake data-generator was trained in the first place. On the other hand, a choice of regularization which is optimal for the classical setting (training on real data), might lead to catastrophic failure: the test error diverges. See Figure 1b for an illustration. 2 Related Work Current LLMs [17, 30, 10, 47], including GPT-4 [1], were trained on predominantly human-generated text; similarly, diffusion models like DALL-E [40], Stable Diffusion [42], Midjourney [35] are trained on web-scale image datasets. Their training corpora already potentially exhaust all the available clean data on the internet. A growing number of synthetic data generated with these increasingly popular models starts to populate the web, often indistinguishable from real" data. Recent works call attention to the potential dramatic deterioration in the resulting models, an effect referred to as model collapse" [44]. Empirical evidence of model collapse has been reported across various domains [23, 32, 33, 8, 9, 21]. Some theoretical studies [44, 7, 2, 18] have begun exploring this phenomenon. [44] attribute collapse to finite sampling bias and function approximation errors in the (single) Gaussian case but only provide lower bounds without detailed analytic expressions. [7] analyze the training process at the distribution level using both clean and synthetic data and provide stability results. However, these results do not account for finite samples and are only valid locally in parameter space, making them more relevant to fine-tuning rather than training from scratch. [2] examine self-consuming loops in the Gaussian case by assuming a sampling bias that reduces data variance with each generation a (martingale) assumption that we do not require. These studies lack a comprehensive theoretical framework to quantify model collapse and its impact on scaling laws. Our work addresses these gaps by providing an analytic theory that captures how model collapse emerges from training on synthetic data, providing a deeper understanding that goes beyond merely identifying the collapse. A concurrent study by [18] demonstrate that model collapse in foundation models can be attributed to a breakdown in scaling laws [26, 24], where increasing the sample size eventually fails to improve model performance. This finding, theoretically shown for discrete data in variants of the infinite memory model, complements our analytical results on how synthetic data alters the rate of scaling laws, as discussed in Section 5. 3 Theoretical Setup We now present a setup which is simple enough to be analytically tractable, but rich enough to exhibit a wide range of regimes to illustrate a range of new phenomena that emerge with model collapse. Data Distribution and Synthetized Data. Consider the distribution PΣ,w0,σ2 on Rd R given by (Input) x N(0, Σ), (Noise) ϵ N(0, σ2), indep. of x (Output/Label) y = x w0 + ϵ. (2) The positive integer d is the input-dimension, the vector w0 Rd defines the ground-truth labelling function x 7 x w0, the matrix Σ Rd d is the covariance structure of the inputs. The scalar σ2 is the level of label noise. Here, we consider the linear case for clarity. We describe the extension to the kernel setting in Appendix C. Thus, in classical linear regression, given a sample (X, Y ) {(x1, y1), . . . , (x T , y T )} of size T from PΣ,w0,σ2, one seeks a linear model bw Rd with small test error Etest( bw) := Ex,y[(x bw y)2] σ2 = bw w0 2 Σ, (3) where (x, y) PΣ,w0,σ2 is a random clean test point. In our setup for studying model collapse, the training data (X, Y ) is sampled from an iterative loop where each generation of the model serves as the labeller for the data for the next generation. This process is described below. Structure of the Synthesized / Fake Data Generator. Consider a sequence of data distributions PΣ,w0,σ2 0 PΣ, b w1,σ2 1 . . . PΣ, b wn,σ2n ..., (4) where bwn s is defined recursively by bwn = w0, and bwn = Fit(Xn 1, Y n 1), for n 1, (5) where Y n := Xn bwn + En and Fit(A, B) = OLS(A, B) := A B is ordinary-least squares (OLS). The design matrices (Xn)n 0 are of shapes Tn d, each with iid rows from N(0, Σ). The sequence of noise vectors (En)n 0 forms an independent collection, which is independent of the (Xn)n 0 ; each En RTn has iid components ϵn,i from N(0, σ2 n). Refer to Figure 3. Fake / Synthesized Data Generator Model Generation Data Sampling Model Training Figure 3: Illustration of the theoretical framework. The process begins with the original model bw0(w0) and the original dataset (X0, Y 0). n synthetic data generators bw1 to bwn are iteratively fit on data labelled by the previous model with label noise σ0, using T0 samples each. We evaluate the test error (with respect to the ground truth labels from w0) of bwpred n , trained on (X, Y ) := (Xn, Y n) using T samples with label noise σ and a regularization coefficient λ. Thus, in summary, each bwn results from fitting a model on a dataset of size Tn 1 from PΣ, b wn 1,σ2 n 1, for every generation index n 1. The Downstream Model: Ridge Regression. For a number of iterations n 0, noise levels σ0 and σ, dataset sizes T0 and T, and regularization parameter λ 0, let bwpred n = bwpred n,T0,σ2 0,T,σ,λ Rd be the ridge predictor constructed from an iid sample {(x1, y1), . . . , (x T , y T )} of size T from the n-fold fake data distribution PΣ, b wn,σ2n, where for ease of presentation of our results we will assume that Tn 1 = . . . = T1 = T0, Tn = T and σn 1 = . . . = σ1 = σ0, σn = σ. (6) For an n-fold fake data generator PΣ, b wn,σ2 n, we denote with X := Xn RT d the design matrix with iid rows from N(0, Σ), with E := En RT the stage-n label-noise vector with components in N(0, σ2 n), and Y := Y n = X bwn +E RT the labels generated by PΣ, b wn,σ2n. Let bΣ := X X/T Rd d is the sample covariance matrix, and R = R(λ) := (bΣ + λId) 1 denote its resolvent, so that bwpred n = RX Y/T for λ > 0, and bwpred n = X Y for λ = 0. (7) We are interested in the dynamics of the test error Etest( bwpred n ) of this linear model. Importantly, the evaluation of the model is performed on the true data distribution PΣ,w0,σ2 0, even though the model is trained on the fake data distribution PΣ, b wn,σ2. Note that for n = 0, Eclean test := Etest( bwpred n ) corresponds to the usual test error when the downstream model is trained on clean data. Importantly, the downstream model has no control over this process. It will only see training data from a given version PΣ, b wn,σ2n, but evaluation will be on the true distribution PΣ,w0,σ2 0. The mental picture is as follows: each generation bwn can be seen as a proxy for a specific version of Chat GPT, for example. The sample size T0 used to create the fake labelling functions bwn is a proxy for the strength of the fake data-generator thus constructed. Other works which have considered model collapse under such a self-looping training process include [44, 2, 7, 18]. 4 Exact Test Error Characterization In this section we establish generic analytic formulae for the test error of the downstream model bwpred n (7) trained on n-fold fake data-generation as outlined in Section 3. The fully general technical key Theorem F.1 detailing formula (1), with a trace expression for ρ, (as well as proofs) are given in Appendix F; consult part F.1 for an exposition. Notations are standard (summarized in Appendix E). 4.1 Warm-up: Ordinary Least Squares on Isotropic Data For a start, let us first consider the case of unregularized regression, where λ = 0 in Equation (7). Theorem 4.1. For an n-fold fake data generation process with T0 d + 2 samples, the test error for the linear predictor bwpred n in Equation (7) learned on T d + 2 samples, with λ = 0, is given by Etest( bwpred n ) σ2ϕ 1 ϕ + nσ2 0ϕ0 1 ϕ0 , with ϕ = d where the notation f(T) g(T) means f(T)/g(T) 1, for large T. The first term Etest( bwpred 0 ) σ2ϕ/(1 ϕ) in the above decomposition corresponds to the usual error when the downstream model is fitted on clean data (see [22], for example). The additional term nσ2 0ϕ0/(1 ϕ0), proportional to the number of generations n, is responsible for model collapse. Model collapse versus more training data. Note that the linear degeneration in test error highlighted by Equation (8) is a direct consequence of using the same dataset size T0 across the fake data generator. Of course, if the underlying synthetic generating process has access to a larger data budget across generations, this decay can be significantly alleviated. For instance, if fake data increases gradually with the number of generations m 2 as Tm = (m log2 m)T0 (and, to simplify, σ = σ0) a trivial extension of Theorem 4.1 yields Etest( bwpred n ) (1 + 1 2 log2 2 + 1 3 log2 3 + . . . )Etest( bwpred 0 ) Etest( bwpred 0 ), which will keep collapse at bay at the expense of largely increased training data ([44] also has a similar formula). This does not avoid model collapse; rather, it trades additional data generation and training effort against deterioration from generations of fake data. Thus, while for clean data increasing the dataset size n-fold leads to better scaling, with synthetic data, we forfeit this improvement. Also, note that we do not assume access to samples from any of the intermediate generation steps bw0, . . . , bwn 1; we only train the downstream model bwpred n on data from the last step bwn. Model Collapse as Change of Scaling Laws. In the low-dimensional regime (fixed d), Theorem 4.1 already predicts a change of scaling law from σ2T 1 to σ2T 1 + nσ2 0T 1 0 . Thus, as the sample size T is scaled up, the test error eventually plateaus at the value nσ2 0T 1 0 and does not vanish. This phenomenon, also established in [18] in the context of large language models, is clearly visible in Figure 1a. In the rest of this section and also in Section 5, we shall establish an analogous picture for high-dimensional regimes (d ). Mitigation via Regularization. Note that the test error of the null predictor wnull = 0 is Etest(wnull) = w0 2 Σ, and so Etest( bwpred n ) Etest(wnull) = 1 SNR ϕ 1 ϕ + n SNR0 where SNR := w0 2 Σ/σ2 and SNR0 := w0 2 Σ/σ2 0. We deduce that if n SNR0/(1/ϕ0 1), then the learned model is already much worse than the null predictor! This suggests that a possible strategy for mitigating the negative effects on learning on AI-generated data is regularization, as empirically illustrated in Figures 1a, 1b, 2, and also in 4 of Appendix D. Furthermore, in Section 5 we shall establish that the optimal regularization parameter established in [14], in the case of polynomially decreasing spectra (a regime which is relevant to wide neural networks), must be modified in the presence of synthetic training data in order to prevent the generalization error to diverge to infinity (i.e catastrophic failure). 4.2 High-Dimensional Regimes In order to analyze the trace term ρ appearing in Equation (1) (and spelled out in (32) in Appendix F.1), we need some tools from RMT, and ultimately obtain analytic formulae for Etest( bwpred n ) in Theorem 4.3. Such tools have been used extensively to analyze anisotropic ridge regression [41, 22, 4]. Random Matrix Equivalents. For any sample size T 1 and λ 0, define κ(λ, T) implicitly by κ(λ, T) λ = κ(λ, T) df1(κ(λ, T))/T, (9) where, for any λ 0 and m N , dfm(λ) is the mth order "degree of freedom" of the covariance matrix Σ is given by dfm(λ) = dfm(λ; Σ) := tr Σm(Σ + λId) m. The effect of ridge regularization at level λ 0 is to improve the condition of the empirical covariance matrix bΣ; what the κ-function does is translate this into regularization on Σ at level κ(λ, T), so as control the capacity of the former, i.e. the "effective dimension" of the underlying problem. Quantitatively, there is an equivalence of the form df1(λ; bΣ) df1(κ(λ, T); Σ). Roughly speaking, RMT is the business of formalizing such a relationship and derivatives (w.r.t. λ) thereof. A standard reference on the subject is [5]. Example: Isotropic Data. As an illustration, note that dfm(λ) d/(1 + λ)m (polynomial decay) in the isotropic case where Σ = Id. Consequently, we have κ(λ, T) λ = ϕ κ(λ, T)/(1 + κ(λ, T)), with ϕ := d/T. In this case, it is easy to obtain the following well-known formula for κ = κ(λ, T): κ = λ + ϕ + q (λ + ϕ)2 + 4λ 2 , with ϕ := ϕ 1, (10) which is reminiscent of the celebrated Marchenko-Pastur law [34]. Asymptotic Regime. We shall work in the following so-called proportionate asymptotic scaling regime which is a standard analysis based on random matrix theory (RMT): T, d , d/T ϕ, Σ op, Σ 1 op = O(1). (11) Later in Section 5 when we consider power-law spectra, this scaling will be extended to account for the more realistic case where d and T are of the same order on log scale, i.e T, d , d1/C T d C, Σ op, Σ 1 op = O(1), (12) for some absolute constant C 1. Such non-proportionate settings are covered by the theory developed in [27, 48]. For clarity of presentation, even in this more general regime of Equations (12), we will still continue to write ϕ0 := d/T0 and ϕ := d/T. Bias-Variance Decomposition. With everything now in place, let us recall for later use, the classical bias-variance decomposition for ridge regression (for example, see [41, 22, 4]): Proposition 4.2. In the RMT limit (12), the test error of a ridge predictor w(λ) based on T iid samples from the true data distribution PΣ,w0,σ2 is given by Etest(w(λ)) = E w(λ) w0 2 Σ Bias + V ar, (13) with Bias κ2w 0 Σ(Σ + κI) 2w0 1 df2(κ)/T , V ar σ2df2(κ) T 1 1 df2(κ)/T , (14) where κ = κ(λ, T) is as given in Equation (9). 4.3 Analytic Formula for Test Error The following result gives the test error for the downstream ridge predictor bwpred n defined in Equation (7), in the context of fake training data, and will be heavily exploited later to obtain precise estimates in different regimes. Define generic V ar and Bias by: V ar = E RX E/T 2 Σ = σ2 1 T tr ΣR2bΣ Bias = E bΣRw0 w0 2 Σ, and note that Eclean test := Bias + V ar, for standard ridge regression fitted on clean data from the true data distribution PΣ,w0,σ2 (e.g., see Hastie et al. [22]). Let Qn 1 = Pn 1Pn 2 . . . P0 where Pm is the orthogonal projection unto the subspace of Rd spanned by the rows of Xm and define Bias := E bΣR(Qn 1w0 w0) 2 Σ 0. (15) Theorem 4.3. For an n-fold fake data-generation process, the test error of a ridge predictor bwpred n based on a sample of size T with regularization parameter λ, is given in the RMT limit (12) by Etest( bwpred n ) ] Bias + V ar + nσ2 0ρ, (16) where ρ is as given in Theorem F.1, and ] Bias satisfies ] Bias Bias + Bias Bias (with equality if T0 d), and Bias as given in (15). Furthermore, if one of the following conditions holds T0 d OR Xn = X0 for all n 1, (17) then, we have the following explicit formula for ρ ρ = tr Σ4(Σ + κ0I) 2(Σ + κI) 2 T0 df2(κ0) + κ2 tr Σ2(Σ + κ0I) 2(Σ + κI) 2 T0 df2(κ0) df2(κ) T df2(κ), (18) with κ = κ(λ, T) and κ0 := κ(0, T0) are as given in Equation (9). Instructively, the term Bias measures how biased the synthetic data-generation process away from ground-truth model w0. This term disappears if the generator was fitted on sufficiently many samples (i.e. if T0 d). More quantitatively, when T0 < d and Xn = X0, it is easy to see that Bias E [ Σ1/2bΣR 2 op] Bias0, where Bias0 := E P0w0 w0 2 2 measures the inability due to lack of enough data, of the first generation (n = 1) to reliably estimate w0 even in the absence of noise (σ0 = 0) in the data-generating process. This gap propagates over to higher generations of the process. The situation is illustrated in Figure 2. In the case where T0 < d and the Xn s are independent, we shall see in Section 4.5 that this increase in bias actually grows with n, even in the case of fake data generation without label noise (i.e. σ0 = 0). 4.4 Model Collapse in the Case of Under-Parametrized Fake Data-Generator We now consider the scenario of under-parameterization, where T0 d, indicating that the number of data points exceeds the number of dimensions. This condition typically results in a unique solution for the regression. In this case, P0 = Id a.s., leading to ] Bias = Bias (given as in formula (14)), and κ0 = 0 in (18), and so Theorem 4.3 gives T0 d + κ2 tr (Σ + κI) 2 T0 d df2(κ) T df2(κ). (19) We have the following corollary to Theorem 4.3. Corollary 4.4. Consider the setting of Theorems 4.3 and F.1. If T0 d additionally, then it holds in the RMT limit (12) that Etest( bwpred n ) Bias + V ar + nσ2 0ρ, where Bias and V ar are as given in formula (14), and ρ is as given in Equation (19). Moreover, in the special case of isotropic features, it holds that Bias + V ar κ2 w0 2 2 + σ2ϕ (1 + κ)2 ϕ , ρ ϕ0 1 ϕ0 1 (1 + κ)2 + 1 (1 + κ)2 ϕκ2 with ϕ := d/T, ϕ0 := d/T0, and κ = κ(λ, T) as in Equation (10). Note that Theorem 4.1 is special case of the above result corresponding to λ = 0 and ϕ 1. A result like Corollary 4.4 gives us the needed analytical handle for understanding n-fold model collapse in terms of all problem hyper-parameters (covariance spectrum, regularization, label-noise level, etc.). 4.5 Model Collapse in the Absence of Label Noise We now consider the over-parametrized regime, where the different iterations of the synthetic datagenerator (refer to the illustration in Figure 3) are fitted on insufficient data. For simplicity of exposition, we restrict our presentation to isotropic covariance Σ = Id. Since we will be focusing on the possible increase Bias above the bias (defined in Equation (14)) due to n 1 generations as predicted by Theorem 4.3, we further restrict ourselves to the noiseless regime where the fake data-generating process has no label noise, i.e. σ0 = 0. Thanks to Lemma F.4, we know that the generation-n fake labelling vector bwn (defined in Eqn. (5)) is given explicitly as a series of projections bwn = Qn 1w0 = Pn 1Pn 2 . . . P0w0. (20) Further, for simplicity we will assume T = Tn > d, i.e the downstream model has access to enough data. We shall focus on two important special cases. The Dependent Case. We first consider the case where Tm = T0 < d and Xm = X0 for all m n 1. It is clear that Equation (20) reduces to bwn = P0w0, with rank P0 = T0 < d. Theorem 4.5. In the limit λ 0+ and d, T0 with d/T0 ϕ0 > 1, it holds that bwn 2 w0 2/ϕ0, Bias w0 2(1 1/ϕ0). (21) We see that in this setting, the increase in bias Bias (1 1/ϕ0) w0 2 brought about by synthetic data is a positive constant which does not grow with the number of generations n 1. This increase in bias (i.e compared to training on clean data) is due to the fact that, with probability 1, the random subspace of Rd spanned by X0 does not contain the ground truth model w0. The expression is nothing but a RMT estimate of P0w0 w0 2, i.e. the squared norm of the projection of w0 onto the orthogonal complement of this subspace. The result is illustrated in Figure 2(a). The Independent Case. For our second example, we remove the assumption that Tm = T0 and Xm = X0 for all m n 1 considered in the previous case (Theorem 4.5). We instead assume that (A) The Xm s are assumed to be independent, and (B) we are in the following high-dimensional limit λ 0+, d, T1, . . . , Tn 1 , d/Tm ϕm, for some ϕ1, . . . , ϕn 1 > 0. (22) Define η := Qn 1 m=0 min(1/ϕm, 1) (0, 1]. We have the following theorem. Theorem 4.6. In the limit (22), it holds that bwn 2 w0 2η and Bias w0 2 (1 η) . In particular, if n with infinitely many ϕm > 1, then bwn 0 and Bias w0 2. The theorem predicts that a sequence of over-parametrized fake data-generators ( bwn)n collapses to zero (and thus, effectively escapes from the ground truth model w0). Consequently, the downstream model bwpred n convergences to a Gaussian process around zero, instead of the true model w0, leading to an increase in the bias term of the test error! For example if ϕn = ϕ0 > 1, then Theorem 4.6 predicts that Bias (1 ϕ n 0 ) w0 2, which grows exponentially fast towards w0 2, the test error of the null predictor. This compounding effect is due to the fact that in (20), each projection Pm spins the fake data labelling vector bwn further away from the ground-truth w0. The result is illustrated in Figure 2(b). Comparing the dependent case and the independent case, 4.6 shows that the increase in bias is proportional to 1 η0η1 . . . ηn 1, which is typically much larger than 1 η0, which is the increase in the dependent case. Sampling different design matrices results in a more pronounced model collapse. 5 The Case of Heavy Tails (Power Law) Neural scaling laws [26, 24], relate a model s test error to the sample size, model size, and computational resources, and are critical tools for practitioners in strategically allocating resources during the design and implementation of large language models. Previous theoretical works [11, 41, 14] have examined scaling laws in our tractable setting of linear regression with Gaussian design in the context of a power-law covariance spectrum. Now we explore how synthetic data alters these scaling laws in this setting. Let the spectral decomposition of the covariance matrix Σ be Σ = λ1v1v 1 + . . . + λdvdv d , where λ1 . . . λd 0 are the eigenvalues and v1, . . . , vd Rd are the eigenvectors. For any feature index j [d], define a coefficient cj := w 0 vj, i.e the projection of w0 along the jth eigenvector of Σ. We shall work under the following well studied spectral conditions (Capacity Condition) λj j β for all j [d], (Source Condition) Σ1/2 rw0 = O(1), where β > 1 and r > 0. The parameter r measures the amount of dispersion of w0 relative to the spectrum of Σ; a large value of r means w0 is concentrated only along a few important eigen-directions (i.e. the learning problem is easy). For later convenience, define δ, r, and c by δ := 1 + β(2r 1) R, r := min(r, 1) (0, 1), c := 2βr/(2βr + 1) (0, 1). (24) As noted in [14], the source condition in (23) is satisfied if cj j δ/2 for all j [d]. Consider adaptive ridge regularization strength of the form λ = λ(T) T ℓ, (25) for fixed ℓ 0. The case where ℓ= 0 corresponds to non-adaptive regularization; otherwise, the level of regularization decays polynomially with the sample size T. Define ℓcrit := β/(2βr + 1). (26) In [14], KRR under normal circumstances (corresponding to n = 0, i.e. no fake data) was considered and it was shown that this value for the regularization exponent in (25) is minimax-optimal for normal test error in the noisy regime (σ > 0), namely Etest( bwpred 0 ) T c. This represents a crossover from the noiseless regime where it was shown that the test error scales like Etest( bwpred 0 ) T 2βr, a much faster rate. In the context of training on fake data, which is the object of this manuscript, we shall establish new scaling laws which paint a drastically different picture. A "Collapsed" Scaling Law. The following result shows that model collapse is a modification of usual scaling laws induced by fake data. All proofs of this section can be found in Appendix H. Here, for simplicity of presentation, we restrict to the case T0 d + 2 to make the results easier to present. This condition can be removed as in Theorem 4.3. Theorem 5.1. Consider n-fold fake-data generation with sample size T0 d + 2. For a ridge predictor bwpred n given in Equation (7) based on a fake data sample of size T, with regularization parameter λ = λ(T) tuned adaptively as in Equation (25) with exponent ℓ [0, β), the test error satisfies the following scaling law in the RMT limit (12): Etest( bwpred n ) max(σ2, T 1 2rℓ ℓ β )) T (1 ℓ β ) + nσ2 0 1 ϕ0 max (T/T0, ϕ0) T (1 ℓ We now provide an instructive interpretation of Theorem 5.1 and outline the effect of regularization. The Noiseless Regime. First consider the case σ = 0 (or equivalently, exponentially small in T) and ϕ0 (0, 1) is fixed, and consider a number of generations n such that nσ2 0 T a, where 0 a 1 ℓ/β 1. Note that a = 0 corresponds to a constant number of generations. Also take T0 = T b, for some constant b (0, ). According to Theorem 5.1, if we want to balance out the model-collapsing negative effect of training on fake data, we should chose ℓso as to balance the second term n(T/T0)T (1 ℓ/β) = T (b ℓ/β a) and the first term T 2ℓr. We have the following: Corollary 5.2. In the setting of Theorem 5.1 with T0 T b and n T b, the optimal exponent of the ridge regularization parameter in Equation (25) is ℓ= ℓ , where ℓ = min((b a)ℓcrit, β), (28) and ℓcrit is as in Eqn. (26), with corresponding optimal test error infℓ 0 Etest( bwpred n ) T (b a)c. Observe that when (b a)c < 2βr, which is the case when n = O(1), r 1 and b a + 1, this corresponds to the condition T T0. The above result represents a crossover from the fast rate Etest( bwpred 0 ) T 2βr in the case of training on clean data [14], to a much slower rate Etest( bwpred n ) T (b a)c, attained by the adaptive regularization λ T ℓ , which is optimal in this setting. Furthermore, in this setting if we still use λ T ℓcrit as proposed in [14] in the clean data setting, Corollary 5.2 predicts that Etest( bwpred n ) T (b ℓcrit/β a) = T (c+b a 1), which diverges to infinity if b a + 1 c. This is a catastrophic form of model collapse, and is empirically illustrated in Figures 1b and 4. The Noisy Regime. This discussion can be found in Appendix G. Remark. In all the analyses above, we quantitatively demonstrate how model collapse manifests as a change in scaling laws within a setting commonly used to understand scaling behavior in current foundation models [46, 13, 14]. Our results indicate that, in the presence of synthetic data, scaling laws with respect to dataset size slow down (i.e., exhibit smaller exponents), meaning a much larger sample size is needed to achieve the same reduction in test error as with real data. Furthermore, the optimal scaling law with synthetic data requires different regularization; the optimal settings for real data could lead to catastrophic model collapse. Related findings are reported in Dohmatob et al. [18] in the setting of discrete data for infinite memory models and their variants. 6 Experiments To further support our theoretical findings, we conduct experiments using kernel ridge regression on the MNIST dataset [16], as detailed in Appendix D.2. Our experiments validate the theoretical predictions for both RBF and Polynomial kernels, demonstrating the parallels between linear regression and kernel regression, and highlighting the relevance of our theory to more complex settings. We also explore the behavior of real neural networks by training two-layer networks in two different settings: fixing the first layer or training both layers (see Appendix D.3). Consistent with our theoretical insights, we observe a linear pattern of model collapse when the first layer is fixed. However, a more severe, nearly quadratic model collapse is observed when both layers are trained, with our theory providing a lower bound for this behavior. These results reinforce the ability of our theory to capture the dynamics of model collapse across varying complexities. Full experimental details and results are provided in Appendix D. 7 Concluding Remarks As we navigate the "synthetic data age", our findings signal a departure from traditional test error rates (e.g. neural scaling laws), introducing novel challenges and phenomena with the integration of synthetic data from preceding AI models into training sets. Our work provides a solid analytical handle for demystifying the model collapse phenomenon as a modification of usual scaling laws caused by fake / synthesized training data. On the practical side, our analysis reveals that AI-generated data alters the optimal regularization for downstream models and changes the scaling laws. Drawing from the insight that regularization mirrors early stopping [3], our study suggests that models trained on mixed real and AI-generated data may initially improve but later decline in performance (model collapse), necessitating early detection of this inflection point. To preserve model quality when scaling laws are altered, it is essential to employ data filtering and watermarking techniques to distinguish real data from synthetic content. Recent studies have also explored methods for data selection [19] and correction [20]. These observations prompt a re-evaluation of current training approaches and underscores the complexity of model optimization in the era of synthetic data. Acknowledgments YF and JK are supported by the National Science Foundation under NSF Award 1922658. Part of this work was done while JK and YF were hosted by the Centre Sciences de Données at the École Normale Supérieure (ENS) in 2023/24, whose hospitality they gratefully acknowledge. This work was partially supported through the NYU IT High Performance Computing (HPC) resources, services, and staff expertise. [1] Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. ar Xiv preprint ar Xiv:2303.08774, 2023. [2] Sina Alemohammad, Josue Casco-Rodriguez, Lorenzo Luzi, Ahmed Imtiaz Humayun, Hossein Babaei, Daniel Le Jeune, Ali Siahkoohi, and Richard G. Baraniuk. Self-consuming generative models go mad. ar Xiv preprint arxiv:2307.01850, 2023. [3] Alnur Ali, J. Zico Kolter, and Ryan J. Tibshirani. A continuous-time view of early stopping for least squares regression. In Kamalika Chaudhuri and Masashi Sugiyama, editors, Proceedings of the Twenty-Second International Conference on Artificial Intelligence and Statistics, volume 89 of Proceedings of Machine Learning Research, pages 1370 1378. PMLR, 16 18 Apr 2019. [4] Francis Bach. High-dimensional analysis of double descent for linear regression with random projections. 2023. [5] Zhidong. Bai and Jack W. (Jack William) Silverstein. Spectral analysis of large dimensional random matrices. Springer series in statistics. Springer, New York ;, 2nd ed. edition, 2010. ISBN 9781441906601. [6] Raphaël Berthier, Francis R. Bach, and Pierre Gaillard. Tight nonparametric convergence rates for stochastic gradient descent under the noiseless linear model. Co RR, abs/2006.08212, 2020. URL https://arxiv.org/abs/2006.08212. [7] Quentin Bertrand, Avishek Joey Bose, Alexandre Duplessis, Marco Jiralerspong, and Gauthier Gidel. On the stability of iterative retraining of generative models on their own data. ar Xiv preprint arxiv:2310.00429, 2023. [8] Matyas Bohacek and Hany Farid. Nepotistically trained generative-ai models collapse, 2023. [9] Martin Briesch, Dominik Sobania, and Franz Rothlauf. Large language models suffer from their own output: An analysis of the self-consuming training loop, 2023. [10] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam Mc Candlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Language models are few-shot learners. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 1877 1901. Curran Associates, Inc., 2020. [11] Andrea Caponnetto and Ernesto de Vito. Optimal rates for the regularized least-squares algorithm. Foundations of Computational Mathematics, 7:331 368, 2007. [12] Lenaic Chizat, Edouard Oyallon, and Francis Bach. On lazy training in differentiable programming. Advances in neural information processing systems, 32, 2019. [13] Hugo Cui, Bruno Loureiro, Florent Krzakala, and Lenka Zdeborova. Generalization error rates in kernel regression: The crossover from the noiseless to noisy regime. In A. Beygelzimer, Y. Dauphin, P. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, 2021. [14] Hugo Cui, Bruno Loureiro, Florent Krzakala, and Lenka Zdeborová. Generalization error rates in kernel regression: the crossover from the noiseless to noisy regime. Journal of Statistical Mechanics: Theory and Experiment, 2022(11):114004, nov 2022. [15] Hugo Cui, Bruno Loureiro, Florent Krzakala, and Lenka Zdeborová. Error scaling laws for kernel classification under source and capacity conditions. Machine Learning: Science and Technology, 4(3):035033, August 2023. ISSN 2632-2153. [16] Li Deng. The mnist database of handwritten digit images for machine learning research. IEEE Signal Processing Magazine, 29(6):141 142, 2012. [17] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. ar Xiv preprint ar Xiv:1810.04805, 2018. [18] Elvis Dohmatob, Yunzhen Feng, Pu Yang, François Charton, and Julia Kempe. A tale of tails: Model collapse as a change of scaling laws. In Forty-first International Conference on Machine Learning, 2024. URL https://openreview.net/forum?id=KVvku47sh W. [19] Yunzhen Feng, Elvis Dohmatob, Pu Yang, Francois Charton, and Julia Kempe. Beyond model collapse: Scaling up with synthesized data requires reinforcement. ar Xiv preprint ar Xiv:2406.07515, 2024. [20] Nate Gillman, Michael Freeman, Daksh Aggarwal, HSU Chia-Hong, Calvin Luo, Yonglong Tian, and Chen Sun. Self-correcting self-consuming loops for generative model training. In Forty-first International Conference on Machine Learning, 2024. [21] Yanzhu Guo, Guokan Shang, Michalis Vazirgiannis, and Chloé Clavel. The curious decline of linguistic diversity: Training language models on synthetic text, 2023. [22] Trevor Hastie, Andrea Montanari, Saharon Rosset, and Ryan J. Tibshirani. Surprises in highdimensional ridgeless least squares interpolation. The Annals of Statistics, 50(2), 2022. [23] Ryuichiro Hataya, Han Bao, and Hiromi Arai. Will large-scale generative models corrupt future datasets? In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 20555 20565, October 2023. [24] Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Tom Hennigan, Eric Noland, Katie Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karen Simonyan, Erich Elsen, Jack W. Rae, Oriol Vinyals, and Laurent Sifre. Training compute-optimal large language models. ar Xiv preprint ar Xiv:2203.15556, 2022. [25] Arthur Jacot, Franck Gabriel, and Clement Hongler. Neural tangent kernel: Convergence and generalization in neural networks. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018. [26] Jared Kaplan, Sam Mc Candlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. ar Xiv preprint ar Xiv:2001.08361, 2020. [27] Antti Knowles and Jun Yin. Anisotropic local laws for random matrices. Probability Theory and Related Fields, 169(1):257 352, 2017. [28] Jaehoon Lee, Yasaman Bahri, Roman Novak, Samuel S. Schoenholz, Jeffrey Pennington, and Jascha Sohl-Dickstein. Deep neural networks as gaussian processes. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings. Open Review.net, 2018. [29] Tengyuan Liang and Alexander Rakhlin. Just interpolate: Kernel Ridgeless regression can generalize. The Annals of Statistics, 48(3), 2020. [30] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach. ar Xiv preprint ar Xiv:1907.11692, 2019. [31] Alexander Maloney, Daniel A. Roberts, and James Sully. A solvable model of neural scaling laws, 2022. [32] Gonzalo Martínez, Lauren Watson, Pedro Reviriego, José Alberto Hernández, Marc Juarez, and Rik Sarkar. Combining generative artificial intelligence (ai) and the internet: Heading towards evolution or degradation? ar Xiv preprint arxiv: 2303.01255, 2023. [33] Gonzalo Martínez, Lauren Watson, Pedro Reviriego, José Alberto Hernández, Marc Juarez, and Rik Sarkar. Towards understanding the interplay of generative artificial intelligence and the internet. ar Xiv preprint arxiv: 2306.06130, 2023. [34] V.A. Marˇcenko and Leonid Pastur. Distribution of eigenvalues for some sets of random matrices. Math USSR Sb, 1:457 483, 01 1967. [35] Midjourney. Midjourney ai, 2023. URL https://www.midjourney.com/. [36] Hossein Mobahi, Mehrdad Farajtabar, and Peter Bartlett. Self-distillation amplifies regularization in Hilbert space. In Advances in Neural Information Processing Systems, volume 33, pages 3351 3361. Curran Associates, Inc., 2020. [37] Radford M. Neal. Priors for infinite networks. In Bayesian Learning for Neural Networks, pages 29 53. Springer, New York, 1996. [38] Loucas Pillaud-Vivien, Alessandro Rudi, and Francis R. Bach. Statistical optimality of stochastic gradient descent on hard learning problems through multiple passes. In Samy Bengio, Hanna M. Wallach, Hugo Larochelle, Kristen Grauman, Nicolò Cesa-Bianchi, and Roman Garnett, editors, Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, pages 8125 8135, 2018. [39] Ali Rahimi and Benjamin Recht. Weighted sums of random kitchen sinks: Replacing minimization with randomization in learning. In Advances in Neural Information Processing Systems. Curran Associates, Inc., 2008. [40] Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 8821 8831. PMLR, 18 24 Jul 2021. [41] Dominic Richards, Jaouad Mourtada, and Lorenzo Rosasco. Asymptotics of ridge(less) regression under general source condition. In Proceedings of The 24th International Conference on Artificial Intelligence and Statistics, volume 130 of Proceedings of Machine Learning Research. PMLR, 2021. [42] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. Highresolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10684 10695, June 2022. [43] Alessandro Rudi and Lorenzo Rosasco. Generalization properties of learning with random features. In Advances in Neural Information Processing Systems. Curran Associates Inc., 2017. ISBN 9781510860964. [44] Ilia Shumailov, Zakhar Shumaylov, Yiren Zhao, Yarin Gal, Nicolas Papernot, and Ross Anderson. The curse of recursion: Training on generated data makes models forget. ar Xiv preprint arxiv:2305.17493, 2023. [45] James B. Simon, Madeline Dickens, and Michael Robert De Weese. Neural tangent kernel eigenvalues accurately predict generalization. 2021. [46] Stefano Spigler, Mario Geiger, and Matthieu Wyart. Asymptotic learning curves of kernel methods: empirical data versus teacher student paradigm. Journal of Statistical Mechanics: Theory and Experiment, 2020(12), 2020. [47] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. ar Xiv preprint ar Xiv:2307.09288, 2023. [48] Alexander Wei, Wei Hu, and Jacob Steinhardt. More than a toy: Random matrix models predict how real-world neural representations generalize. In Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research. PMLR, 2022. [49] Christopher Williams. Computing with infinite networks. In M.C. Mozer, M. Jordan, and T. Petsche, editors, Advances in Neural Information Processing Systems, volume 9. MIT Press, 1996. Appendix / Supplementary Material for Model Collapse Demystified: The Case of Regression A Differences to Self-Distillation 15 B Related work on Kernel Ridge Regression with Gaussian Design 16 C Extension to Kernel Methods 16 D Details of Experiments 17 D.1 Simulated Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 D.2 Real Data: Kernel Ridge Regression on MNIST . . . . . . . . . . . . . . . . . . . 17 D.3 Neural Networks on MNIST . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 E Notations 19 F Exact Characterization of Test Error Under Model Collapse 19 F.1 A General Formula for Test Error . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 F.2 Proof of Theorem 4.1 (Rigeless Regression) . . . . . . . . . . . . . . . . . . . . . 20 F.3 Proof of Theorem F.1 (Ridge Regression + General Covariance) . . . . . . . . . . 21 F.4 Proof of Theorem 4.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 F.5 Proof of Corollary 4.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 F.6 A Note on Proposition 4.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 F.7 Proof of Theorem 4.5 and Theorem 4.6 (Model Collapse in the Absence of Label Noise) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 G The Noisy Regime for Power Law Spectra 26 H Proof of Results for Power-Law Covariance Spectrum 26 H.1 Proof of Theorem 5.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 H.2 Representation of Clean Test Error . . . . . . . . . . . . . . . . . . . . . . . . . . 26 I Auxiliary Lemmas 27 A Differences to Self-Distillation An important point we wish to make is that the fake data generation process that we analyse should not be confused with self-distillation as formulated in Mobahi et al. [36] for example. Our setting is inspired by the model collapse phenomenon, where increasingly vast amounts of synthetic data generated by users is posted online, and will necessarily enter the training set of the next foundation model. In this case, we do not have ground truth labels, nor is the generation of synthetic data not controlled by us, but by other users. Therefore, we adopt the setting of solely synthetic labels with added noise. Specifically, in our setup, at generation n > 0, we do not have access to the true labels Y0 = f0(X) + noise for the training samples X, but rather to some ˆYn = ˆfn(X) + noise, where ˆfn is an unknown function, which synthesizes fake labels iteratively; the integer n is the number of iterations. In our work, we make the structural assumption that ˆfn is obtained by iterative / successive regressions on a true dataset D0 = (X0, Y0). We do not have any control over the creation of these labels, which is reflected by the noise injected at each stage. In the self-distillation setting, the data generation process actually helps performance of the downstream model. The model has access to training labels from the true data distribution Y , but decides to fit a model on this data, and then use its outputs as the new labels Yn := Fn(X, Y ), iterating this process possibly over severable steps. Thus, self-distillation has control over the data generating process, which is carefully optimized for the next stage training. Specifically, [36] study self-distillation in the same Gaussian regression model underlying our analysis, but in each distillation generation are able to tune the regularization parameter for downstream performance as a function of the original data labels (with the data being the same at each generation). In the setting of model collapse, there is no control over the data generation process, since it constitutes synthesized data which typically comes from the wide web. Self-distillation for linear regression would amount to a very special instance of our analysis where (1) X0 = X1 = . . . = Xn 1 = Xn = X and (2) σ0 = . . . = σn 1 = 0. That is, there is exactly one design matrix which is used in the data generation process and in the downstream estimator, and also no additional source of label noise is present at the end of each generation. In the general setup considered in our work, (1) is not imposed. We typically assume that X0, X1, . . . , Xn 1, Xn with Xn = X, are all independent random matrices. An exception is line 247 ( The Dependent Case") of Section 3.5, where we assume Xm = X0 for all m n 1, and independent of Xn = X. That setup (considered for the purposes of showing that model collapse can still occur in the absence of label noise) also assumes σm = 0 for all m; the analytic picture which emerges (Theorem 3.5) is already drastically different from what one would get from self-distllation (corresponding to additional assumption that X = X0). B Related work on Kernel Ridge Regression with Gaussian Design This model has been studied by a vast body of works. For example, Richards et al. [41], Hastie et al. [22], Bach [4] analyze the classical bias-variance decomposition of the test error for ridge regression in the high dimensional setting where dataset size and dimension diverge proportionately, using tools from Random Matrix Theory (RMT). In Section 4 we significantly extend this type of analysis to training on iteratively generated synthetic data. This model is also particularly attractive because it allows to analyze an important trade-off: the relative decay of the eigenvalues of the kernel (capacity) and the coefficients of the target function in feature space (source). Sizeable effort has been dedicated to characterize the influence on the decay rate of the test error as a function of these two relative decays (aka power laws) [11, 38, 6, 41, 46, 14, 15]. In Section 5 we extend these efforts, in particular based on works of Cui et al. [13, 14] which has given a full characterization of all regimes and test error decay that can be observed at the interplay of noise and regularization, characterizing a crossover transition of rates in the noisy setting. Our work uncovers fascinating new effects as a result of iterative training on synthetic data. C Extension to Kernel Methods Though we present our results in the case of linear regression in Rd for clarity, they can be rewritten in equivalent form in the kernel setting. Indeed, as in [11, 45, 14, 29], it suffices to replace x with a feature map induced by a kernel K, namely ψ(x) := Kx HK. Here, HK is the reproducing kernel Hilbert space (RKHS) induced by K. In the data distribution (2), we must now replace the Gaussian marginal distribution condition x N(0, Σ) with ψ(x) N(0, Σ). The ground-truth labeling linear function in (2) is now just a general function f0 L2. The predictor (7) is then given by (Representer Theorem) bf pred n (x) := K(X, x) bcn, with bcn = (G + λTId) 1Y Rn, where K(X, x) := (K(x1, x), . . . , K(x T , x)), and G = K(X, X) Rn n is the Gram matrix. D Details of Experiments We perform the following experiments on both simulated and real data to empirically validate our theoretical results. D.1 Simulated Data We consider ordinary / linear ridge regression in Rd, for d = 300 and different structures for the covariance matrix Σ of the inputs: isotropic (i.e Σ = Id) and power-law (23), with (β, r) = (2, 0.375). For each value of n (the generation index), the fake data-generator is constructed according to the process described in (4). Then, for different values of T (between 1 and 1000, 000), a sample of size T is drawn from this fake data-generator and then a downstream ridge model (7) is fitted. The test set consists of 100, 000 clean pairs (x, y) form the true data distribution PΣ,w0,σ2. This experiment is repeated 10 times to generate error bars. The results for the isotropic setting are shown in Figure 1a and the results for the power-law setting are shown in Figure 1b. Figure 2 shows the over-parametrized setting. D.2 Real Data: Kernel Ridge Regression on MNIST As in Cui et al. [14], Wei et al. [48] we consider a distribution on MNIST [16], a popular dataset in the ML community. The classification dataset contains 60, 000 training and 10, 000 test data points (handwritten), with labels from 0 to 9 inclusive. Like in Cui et al. [14], we convert the labels into real numbers (i.e a regression problem) as follows: y = label mod 2 + noise , where the variance of the noise is σ2 = 1 (for simplicity, we also set σ2 0 = 1). The test set consists of 10, 000 pairs (x, y), with the labels y constructed as described in the previous sentence. The fake data used for training is generated as in the previous experiment, but via kernel ridge regression (instead of least squares) with the RBF kernel (bandwidth = 10 4) and the polynomial kernel (degree = 5, bandwidth = 10 3). Note that it was empirically shown in Cui et al. [14] that these datasets verify (23) with (β, r) (1.65, 0.097) in the case of the aforementioned RBF kernel, and (β, r) (1.2, 0.15) in the case of the polynomial kernel. Then, for different values of T (between 1 and 1000), a sample of size T is drawn from this fake data-generator and then a downstream kernel ridge model is fitted. Each of these experiments are repeated 10 times to generate error bars (due to different realizations of label noise). The results are shown in Figure 4. D.3 Neural Networks on MNIST We now further examine model collapse in two-layer neural networks on the MNIST dataset, beyond the linear setting and Gaussian data. We consider two scenarios: learning with a random features (RF) model, where the first layer is fixed randomly, and only the second layer is trained, and learning with a fully trainable neural network. For the two-layer network with the first layer fixed, our theory predicts a linear increase in test error as a function of the number of iterations n. This is because such models belong to the linearized regimes as finite-width random feature models and can be approximated by kernel regression [25, 45]. For fully-trained neural networks, our theory does not directly apply. However, we anticipate that the general trends uncovered by our asymptotic theory will hold true for example, more parameters are expected to lead to greater model collapse, as shown in Theorem 4.1. Specifically, the models were trained using stochastic gradient descent (SGD) with a batch size of 128 and a learning rate of 0.1. We employed a regression setting where labels were converted to one-hot vectors, and the model was trained using mean squared error for 200 epochs to convergence. When generating the synthetic data, Gaussian label noise with a standard deviation of 0.1 is added. The test error is consistently evaluated on the test set using clean labels. (a) RBF kernel (bandwidth = 10 4) (b) Polynomial kernel (degree = 5, bandwidth = 10 3) Figure 4: Model collapse in kernel ridge regression (power-law covariance spectrum) on MNIST. Here, we use adaptive regularization T ℓfor different values of the exponent ℓ 0 (see Section D for full experimental setup). Top row: RBF kernel. Bottom row: polynomial kernel. In each plot, we show test error curves as a function of sample size T, from different generations (n) of fake data. The broken vertical line corresponds to T = T0, where T0 is the number of samples (from the true data distribution) which was used to train the label faker. The value of the exponent regularization ℓ= ℓ (broken curves) is the optimal value in the presence of iterative data relabeling, while ℓ= ℓcrit (solid curves) corresponds to the optimal value without iterative re-labelling (i.e n = 0) proposed in Cui et al. [14] (see (26)). Specifically, we take ℓ = (b a)ℓcirt = bℓcrit, where b = log T0/ log T (so that T0 = T b), as proposed in Theorem 5.1, formula (28). Notice how the effect of fake data makes the test error become non decreasing in sample size T. This is effectively a collapse of the learned model. The results for RF models of width (i.e number of hidden dimensions) k of 20,000 are presented in Figure 5. We observe that, with the exception of the first two generations, the decay in MSE loss generally follows a linear trend, which is consistent with the predictions of our theory. Next, we consider the scenario of training the entire neural network. By varying the width k, we adjust the number of parameters to further explore the theoretical predictions on how the number of parameters influences model collapse. Observations. From Figure 6, we can observe that More parameters (wider neural networks, i.e large k) lead to increased model collapse. This observation is consistent with our results proved in the linear regime (Theorem 4.1). For linear models, the number of parameters is proportional to d (the input dimension), whereas in two-layer neural networks, the number of parameters is of order kd (i.e proportional to the width k). The dependence of model collapse on the number of iterations n is linear for small values of n (with n 4 in our experiments), and becomes superlinear (possibly quadratic) for larger values of n (with n 4). Recall that n = 0 corresponds to training on clean data from the data distribution. Thus, possibly, model collapse neural networks appears to be even more severe than in linear regression. 0 1 2 3 4 5 6 7 8 9 Number of generations Mean squared error Random Feature Model Mean Std Deviation Figure 5: Performance of RF model on MNIST, with one-hidden layer NN (width k = 20, 000). Standard deviation is calculated using 10 seeds. 9 8 7 6 5 4 3 2 1 0 Number of iterations Mean squared error Two-layer Neural Networks Num_samples = 100 Num_samples = 200 Num_samples = 800 Num_samples = 4000 Figure 6: The performance of two-layer neural network on MNIST with varying hidden dimensions. E Notations The set of integers from 1 through d is denoted [d]. Given a variable z (which can be the input dimension d or the sample size T, etc.) the notation f(z) g(z) means that f(z) Cg(z) for sufficiently large z and an absolute constant C, while f(z) g(z) means f(z) g(z) f(z). Further, f(z) g(z) means f(z) = (1 + o(1))g(z), where o(1) stands for a quantity which tends to zero in the limit z . We denote with A the Moore-Penrose pseudo-inverse any matrix A, and by A op is operator norm, while the trace of a square matrix A is denoted tr A. Finally, u Σ := u Σu is the Mahalanobis norm induced by a positive-definite matrix Σ. F Exact Characterization of Test Error Under Model Collapse F.1 A General Formula for Test Error We now consider the case of general ridge penalty λ > 0, and drop the requirements T d + 2 and T0 d + 2. Recall the definitions of X, Y, E and the random matrices R and bΣ appearing in (7). For later reference, define Bias := E bΣRw0 w0 2 Σ, (29) V ar = E RX E/T 2 Σ = σ2 1 T tr ΣR2bΣ. (30) These are respectively the bias and variance terms in the classical bias-variance decomposition Eclean test := Bias + V ar, (31) for standard ridge regression fitted on clean data from the true data distribution PΣ,w0,σ2 (e.g., see Hastie et al. [22]). Theorem F.1. For an n-fold fake data generation process, the test error of a ridge predictor bwpred n based on a sample of size T 1 with regularization parameter λ is given by Etest( bwpred n ) = ] Bias + V ar + nσ2 0ρ, ] Bias = E bΣRQn 1w0 w0 2 Σ, m=0 E tr Cn 1,mbΣRΣRbΣ, where V ar is as given in (30) and Ck,m := Qk,m Q k,m for Qk,m = Qk,m X m, Qk,m := Pk Pk 1 . . . Pm, Qk := Qk,0 = Pk Pk 1 . . . P0, with Pm = X m Xm being the orthogonal projection matrix onto the subspace of Rd spanned by the rows of Xm. In particular, if T0 d + 2 (under-parametrized data-generator), then ] Bias = Bias as in (29), and Etest( bwpred n ) Eclean test + nσ2 0ρ, ρ = 1 T0 d 1E tr Σ 1bΣRΣbΣR. In the second part of the theorem, the term Eclean test (introduced earlier in (31)) corresponds to the usual test error when the downstream model is trained on real (not fake) data, for which well-known formulae exist in a variety of scenarios (see Proposition 4.2). Remark F.2. We show in Theorem 4.3 that ] Bias Bias + Bias, where Bias 0 in the appropriate asymptotic limit, with equality if T0 d + 2 (the under-parametrized regime). Thus, apart from the variance term, an over-parametrized (T0 < d + 2) synthetic data-generator harms the bias term of the test error of downstream models. In contrast, an under-parametrized synthetic data-generator (T0 d + 2) only harms the variance. The increase in bias suffered in the overparametrized regime is precisely quantified in Section 4.5, and shown to be an increasing function of the number of generations n. The test error decomposition in Theorem F.1 is thus of the promised form (1). This additional term means that there is competition between the usual test error Eclean test and the additional term induced by the fake labeling process. Understanding the interaction of these two terms is key to demystifying the origins of model collapse. Low-Dimensional Limit. Observe that if d is fixed and T , then the empirical covariance matrix bΣ converges to2 its population version Σ, and so for T0 d + 2, we have ρ tr Σ2(Σ + λId) 2 T0 d = df2(λ) where for any λ 0 and m N , dfm(λ) is the mth order "degree of freedom" of the covariance matrix Σ is given by dfm(λ) = dfm(λ; Σ) := tr Σm(Σ + λId) m. Note that dfm(λ) d always. In the high-dimensional setting (where d can grow beyond T0), the precise analysis of ρ will be carried out via random matrix theory (RMT). F.2 Proof of Theorem 4.1 (Rigeless Regression) The proof is by induction on the number of generations n of fake data. For n = 0, we have Etest( bwpred 0 ) = E bwpred 0 w0 2 Σ = E bwpred 0 bw0 2 2 = E (X 0 X0) 1X 0 E0 2 2 = σ2E tr(X 0 X0) 1 = σ2 d T d 1 σ2ϕ where ϕ := d/T (0, 1) and the last step has made use of Lemma F.3 below. This is a well-known result for the test error of linear regression in the under-parametrized regime, without any AI pollution (fake / synthesized training data). 2e.g. weakly, w.r.t. operator-norm. Analogously, for n = 1 one computes the test error after the first generation of fake data as follows Etest( bwpred 1 ) = E bwpred 1 w0 2 Σ = E bwpred 1 bw0 2 2 = E bwpred 1 bw1 + bw1 bw0 2 2 = E (X 1 X1) 1X 1 E1 + bwpred 0 bw0 2 2 = E w0 bwpred 0 2 2 + E (X 1 X1) 1X 1 E1 2 2 = Etest( bwpred 0 ) + σ2 1d T1 d 1 = Etest( bwpred 0 ) + σ2 0d T0 d 1 1 ϕ + σ2 0ϕ0 1 ϕ0 , where ϕ0 = d/T0 (0, 1). Continuing the induction on n, we obtain the result. Lemma F.3. Let X0 be an T0 d random matrix with iid rows from N(0, Σ). If T0 d + 2, then the empirical covariance matrix bΣ0 := X 0 X0/T0 is invertible a.s and E [bΣ 1 0 ] = T0 T0 d 1Σ 1. F.3 Proof of Theorem F.1 (Ridge Regression + General Covariance) F.3.1 Representation of bwn and bwpred n We first obtain explicit formulae for the labelling vectors bwn used in the fake-data generation process (5). For any integer m 0, define Pm = X m Xm, the orthogonal projection matrix onto the subspace of Rd spanned by the rows of Xm. Observe from (5) that bwn = X n 1Y n 1 = X n 1(Xn 1 bwn 1 + En 1) = Pn 1 bwn 1 + X n 1En 1 = Pn 1X n 2(Xn 2 bwn 2 + En 2) + X n 1En 1 = Pn 1Pn 2 bwn 2 + Pn 1X n 2En 2 + X n 1En 1 ... = Pn 1Pn 2 . . . P0w0 + Pn 1Pn 2 . . . P1X 1E1 + Pn 1Pn 2 . . . P2X 2E2 + . . . ... = Pn 1Pn 2 . . . P0w0 + m=0 Pn 1Pn 2 . . . Pm X m Em. We get the following result. Lemma F.4. For any n 0, the following formula holds bwn = w0, if n = 0, Qn 1w0 + Pn 1 m=0 Qn 1,m Em, if n 1, (36) where Qk,m := Qk,m X m, Qk,m := Pk Pk 1 . . . Pm and Qk := Qk,0 = Pk Pk 1 . . . P0. Moreover, bwn Im Pn 1 as soon as n 1. In particular, under the simplifying condition (17), it holds that ( w0, if n = 0, P0w0 + X 0En 1 Im P0, if n 1. (37) where En 1 := Pn 1 m=0 Em, a random vector of length T0, with iid entries from N(0, nσ2 0), and independent of X0. Moreover, bwn Im P0 as soon as n 1. Note that the second part of the result uses the elementary linear-algebraic fact that Pm X m = X m. In the special case where T0 d, we have P0 = I a.s., and so bwn = w0 + X 0En 1. Otherwise, even in the absence of generator noise (σ0 = 0), the fake data labeller bwn = P0w0 drifts away from the truth w0, into a subspace of Rd spanned by the rows of X0. Next, let us obtain a decomposition for the downstream predictor bwpred n defined in (7). As usual, let bΣ := X X/T be the empirical covariance matrix with resolvent R = (bΣ + λI) 1, and observe that the downstream model writes bwpred n = RX Y n(X)/T = RX (X bwn + E)/T = RX (XQn 1w0 + X Qn 1,m Em + E)/T = RbΣQn 1w0 + RX E/T + RbΣ F.3.2 Proof of Theorem F.1 Using the decomposition (38) for the downstream model bwpred n , we deduce that Etest( bwpred n ) = E bwpred n w0 2 Σ = E RbΣP0w0 + RX E/T + RbΣ Qn 1,m Em w0 2 Σ = E RbΣP0w0 w0 + RX E/T + RbΣ Qn 1,m Em 2 Σ = E RbΣP0w0 w0 2 Σ + E RX E/T 2 Σ + E Σ = ] Bias + V ar + nσ2 0ρ, where bΣ := X X/T, ] Bias, V ar, and ρ are as given in the theorem. On the second line, we have used independence (of X, X0, E, and En 1) and the fact that E and En 1 are centered Gaussian random vectors, with iid components of variances σ2 and nσ2 0 respectively. F.4 Proof of Theorem 4.3 Analysis of Bias-like Term. An exact analysis of the ] Bias term appearing in Theorems F.1 and 4.3 is presumably a treacherous enterprise given dependency on X (via R and bΣ) and X0 (via P0). In place of such an analysis, we shall settle for the following result which gives an instructive lower-bound. Proposition F.5. In the RMT limit (12), it holds that lim ] Bias lim Bias lim E RbΣP0w0 RbΣw0 2 Σ 0. Thus, training on fake / synthesized data increases the bias term of the downstream model s test error! Proof. Letting A := RbΣ, one computes ] Bias Bias = AP0w0 w0 2 Σ Aw0 w0 2 Σ = AP0w0 Aw0 + Aw w 2 Σ Aw0 w0 2 Σ = AP0w0 Aw0 2 Σ + 2w 0 (A P0A)Σ(I A)w0 = AP0w0 Aw0 2 Σ + 2w 0 (I P0)AΣ(I A)w0. It then suffices to observe that, in the RMT limit (12), it holds that lim E w 0 (I P0)AΣ(I A)w0 0, as can be seen from repeated application of Propositions 1 and 2 of Bach [4]. Analysis of ρ Term. Define a d d random psd matrix H := bΣRΣbΣR. Under the simplifying assumption (17), the matrices Qk,m defined in the theorem all equal Q0,0 = X 0. It follows that the ρ-term in (32) then writes m=0 E [Qn 1,m Q n 1,m H] = E [tr X 0(X 0) H] = EHE [tr X 0(X 0) H | H]. (40) Now, one computes the conditional expectation as follows E [tr X 0(X 0) H | H] = E [tr X 0 (X0X 0 ) 2X0H | H] = lim λ0 0+ 1 T0 λ0 E [tr X 0 (X0X 0 + λ0T0I) 1X0H | H]. Furthermore, defining A := Σ1/2HΣ1/2 and Z0 = X0Σ 1/2, we have tr X 0 (X0X 0 + λ0T0I) 1X0H = tr Σ1/2Z 0 (Z0ΣZ 0 + λ0T0I) 1Z0Σ1/2H = tr AZ 0 (Z0ΣZ 0 + λ0T0I) 1Z0, We deduce from Proposition 2 of Bach [4] that E [tr X 0 (X0X 0 + λ0T0I) 1X0H | H] tr A(Σ + κ(λ0, T0)I) 1 = tr H(Σ + κ(λ0, T0)I) 1Σ. Differentiating w.r.t. λ0 and letting this parameter tend to zero from above gives E [tr X 0(X 0) H | H] = 1 T0 lim λ0 0+ λ0 E [tr X 0 (X0X 0 + λ0T0I) 1X0H | H] T0 lim λ0 0+ κ(λ0, T0) t tr H(Σ + t I) 1Σ t=κ(λ0,T0) tr H(Σ + κ0I) 2Σ T0 df2(κ0) , where κ0 = κ(0, T0), and we have made use of Lemma I.2. Combining with (40) and then applying Proposition 1 of Bach [4] to compute EH tr H(Σ + κ0I) 2Σ = EX tr bΣRΣbΣR(Σ + κ0I) 2Σ gives the following result. Proposition F.6. In the RMT limit (12), it holds for any λ > 0 that ρ = tr Σ4(Σ + κ0I) 2(Σ + κI) 2 T0 df2(κ0) + κ2 tr Σ2(Σ + κ0I) 2(Σ + κI) 2 T0 df2(κ0) df2(κ) T df2(κ), (41) where κ0 := κ(λ0, T0) and κ = κ(λ, T). In particular, if T0 d, then ρ df2(κ) T df2(κ) 1 + κ2 tr(Σ + κI) 2 This result completes the proof of Theorem 4.3. F.5 Proof of Corollary 4.4 For the first part, we know from Theorem F.1 that Etest( bwpred n ) = Etest( bwpred 0 ) + nσ2 0ρ, with (43) ρ := E tr Σ 1bΣ(bΣ + λI) 1Σ(bΣ + λI) 1bΣ T0 d . (44) The Etest( bwpred 0 ) term is taken care of by Proposition 4.2, since this corresponds to generalization error on clean training data. For the ρ term, we use Proposition 1 of Bach [4] with A = Σ 1 and B = Σ to get ρ tr(Σ + κI) 2Σ2 T0 d + κ2 tr(Σ + κI) 2) T0 d tr(Σ + κI) 2Σ2 T0 d + κ2 tr(Σ + κI) 2 T0 d df2(κ) T df2(κ), which proves the first part of the result. For the second part, note that df2(κ) = d/(1 + κ)2 when Σ = I, (10) holds, and so (1 1/ϕ0)ρ 1 (1 + κ)2 + κ2 (1 + κ)4 d T d/(1 + κ)2 1 (1 + κ)2 + κ2 (1 + κ)4 ϕ 1 ϕ/(1 + κ)2 = 1 (1 + κ)2 + 1 (1 + κ)2 ϕκ2 (1 + κ)2 ϕ, and the result follows. F.6 A Note on Proposition 4.2 As mentioned in the main text, the result is classical Richards et al. [41], Hastie et al. [22], Bach [4]). Only the second part needs a comment which we now provide. Indeed, the second part of the result follows from the first as we now see. Indeed, w 0 Σ(Σ + κI) 2w0 = r2/(1 + κ)2, df2(κ) = d/(1 + κ)2 and so we deduce from the first part that V ar σ2ϕ 1 (1 + κ)2 1 1 ϕ/(1 + κ)2 = σ2ϕ (1 + κ)2 ϕ, Bias κ2 w0 2 2 1 (1 + κ)2 1 1 ϕ/(1 + κ)2 = κ2 w0 2 2 (1 + κ)2 ϕ, from which the result follows. We now need to estimate δ Hδ for a deterministic psd matrix H. Observe that δ Hδ = (Qn 1w0 w0) H(Qn 1w0 w0) = w 0 Q n 1HQn 1w0 2w 0 Q n 1Hw0 + w 0 Hw0. F.7 Proof of Theorem 4.5 and Theorem 4.6 (Model Collapse in the Absence of Label Noise) We first prove Theorem 4.6. Note that since we are in the isotropic case, Bias defined in (15) is now given by Bias := E bΣR(Qn 1w0 w0) 2, where Qn 1 := Pn 1Pn 1 . . . P0. Moreover, since T > d and λ = 0 by assumption, we have ΣR = Id, and so we further have Bias := E Qn 1w0 w0 2. Now, one computes Qn 1w0 w0 2 = w0 2 2w 0 Qn 1w0 + w 0 Q n 1Qn 1w0 = w0 2 w 0 Qn 1w0 m=0 (I + κm I) 1 ! = w0 2 w0 2 n 1 Y m=0 min(1/ϕm, 1), where on the 2nd line we have used the fact that Q n 1Qn 1 = Qn 1 because the Pm s are projections; on the 3rd line we have used Lemma F.7 with Σ = I and u = v = w0; on the 4th line we have used the fact that κm := κ(0, Tm) = max(ϕm 1, 0) = max(ϕm, 1) 1. This completes the proof of Theorem 4.6. The proof of Theorem 4.5 is completely analogous, with Qn 1 replaced with Q0. Lemma F.7. Let X0, . . . , Xn 1 be independent random matrices of shapes Tm d for m = 0, . . . , n 1, with rows iid from N(0, Σ), and let Qn 1 := Pn 1Pn 2 . . . P0, where Pm = X m Xm is the orthogonal projection onto the subspace of Rd spanned by the rows of Xm. Then, in the limit d, T0, . . . , Tn 1 such that d/T0 ϕ0 (0, ), . . . , d/Tn 1 ϕn 1 (0, ) with Σ op, Σ 1 op = O(1), it holds that for deterministic L2-bounded sequences of vectors u and v u Qn 1v u n 1 Y m=0 (Σ + κm I) 1 ! where κm = κ(0, Tm) is as defined in (9). Proof. The prof is by induction on n 1. For n = 1, we have Qn 1 = Q0 = P0. Thus, u Q0v = u P0v = lim t 0+ u X 0 (X0X 0 + t I) 1X0v lim t 0+ u (Σ + κ(t, T)I) 1v = u (Σ + κ0I) 1v, where κ0 := κ(0, T) and we used Proposition 2 of Bach [4] at the beginning of the 2nd line. Now, suppose the claim holds for n, and let s prove that it holds for n + 1. Indeed, u Qnv = u Pn Qn 1v u Pn 1 m=0 (Σ + κm) 1v u n Y m=0 (Σ + κm) 1v, where the second step is an application of the induction hypothesis with Pnu in place of u. The following lemma can be used to compute Qn 1w0 w0 2 Σ in the case of anisotropic Σ. Lemma F.8. Under the hypothesis of Lemma F.7, it holds that u Qn 1v u Σn n 1 Y m=0 (Σ + κm I) 1 ! u Q n 1ΣQn 1v u Σn n 1 Y with Am := (Σ + κm I) 2 Σ2 + κ2 mdf2(κm) T df2(κm)I , (51) where κm := κ(0, Tm) as defined in (9). Proof. The first formula follows directly from Lemma F.7 with u replaced with Σu. For the second formula, we can write u Q n 1MQn 1v = u P0P1 . . . Pn 2Pn 1MPn 1Pn 2 . . . P0P1v = eu n 1Pn 1MPn 1evn 1, where eun 1 := Pn 2 . . . P0u. So we really only need to prove the result for n = 1; the general case will follow by induction and due to multiplicativity. Indeed, defining A = Σ1/2uv Σ1/2, B = Σ1/2MΣ1/2, and Z0 = X0Σ 1/2, we have u P0MP0v = lim t 0+ u X 0 (X0X 0 + t I) 1X0MX 0 (X0X 0 + t I) 1X0v = lim t 0+ tr AZ0(Z0ΣZ 0 + t I) 1Z0BZ 0 (Z0Z 0 + t I) 1Z0 tr A(Σ + κ0I) 1B(Σ + κ0I) 1 + κ2 0 tr A(Σ + κ0I) 2 tr B(Σ + κ0I) 2 = u (Σ + κ0I) 1ΣMΣ(Σ + κ0I) 1v + κ2 0u Σ(Σ + κ0I) 2v tr MΣ(Σ + κ0) 2 = u ΣA0v, for M = Σ, where the 3rd line is an application of Proposition 2 of Bach [4]. G The Noisy Regime for Power Law Spectra Here we discuss the consequences of Theorem 5.1 for the noisy regime. Now fix σ2 = 0 and ϕ0 (0, 1). In this regime, Theorem 5.1 predicts that consistency (i.e. Etest( bwpred n ) T 0) is only possible if ℓ ℓ . First consider values of ℓfor which the clean variance σ2T (1 ℓ/β) is less than the clean bias T 2rℓin (27) i.e. 0 ℓ ℓcrit. We get Etest( bwpred n ) T 2ℓr + T (b a ℓ/β), which is minimized by taking ℓ= min(ℓ , ℓcrit). For other values of ℓ, variance dominates, giving Etest( bwpred n ) T (1 ℓ/β) + T (b ℓ/β a) T (γ ℓ/β), where γ := min(1, b a). This is minimized by taking ℓ= ℓcrit, leading to Etest( bwpred n ) T (γ 1/(2βr+1)). This tends to zero with T only if b > a + 1/(2βr + 1). H Proof of Results for Power-Law Covariance Spectrum H.1 Proof of Theorem 5.1 From Theorem F.1, we need to analyze the quantity ρ df2(κ(λ)) T0 d + κ(λ)2 tr (Σ + κ(λ)Id) 2 T0 d df2(κ(λ)) T df2(κ(λ)). (52) Now, for small λ, κ := κ(λ) is small and one can compute λm i (λi + κ)m = κ m X λm i (1 + κ 1λi)m κ mκ(m 1/β) = κ 1/β, where we have used Lemma I.1 with D = κ 1 and n = m in the last step. On the other hand, we can use some of the results of Appendix A (Section 3) of [14] to do the following. It can be shown (see aforementioned paper) that If ℓ> β, then κ T β, and so dfm(κ) T for all m 1. If ℓ< β, then κ λ T ℓ, and so dfm(κ) T ℓ/β = o(T) for all m 1. For ℓ< β, plugging this into (52) gives T0 d + d T0 d T ℓ/β T T ℓ/β T 1 0 T ℓ/β + ϕ0 1 ϕ0 T (1 ℓ/β) 1 1 ϕ0 max (T/T0, ϕ0) T (1 ℓ/β), where ϕ0 := d/T0. Combining our Theorem F.1 with (57), we get the claimed result. H.2 Representation of Clean Test Error We make a small digression to present the following curiosity: with a slight leap of faith, the main results of [14] can be obtained in a few lines from the tools developed in [4], namely Proposition 4.2. This is significant, because the computations in [14] were done via methods of statistical physics (replica trick), while [4] is based on RMT. Indeed, for regularization parameter λ T ℓgiven in (25), we have κ = κ(λ) λ. Thus κ T ℓ, df2(κ) κ 1/β T ℓ/β. (53) Now, since λi i β (capacity condition) and (w 0 vi)2 = c2 i i δ (source condition), we deduce κ2w 0 Σ(Σ + κI) 2w0 w 0 λi (λi + κ 1λi)2 viv i c2 i λi (λi + κ 1λi)2 c2 i λi (λi + κ 1λi)2 X λ1+δ/β i (λi + κ 1λi)2 κ γ T ℓγ, where γ = min(2, 1 + δ/β 1/β) = min(2, 2r) = 2r, with r := min(r, 1). The exponent is so because δ = 1 + β(2r 1), and so δ/β = 1/β + 2r 1 by construction. The estimation of the last sum in (54) is thanks to Lemma I.1 applied with D = κ 1, n = 1 + δ/β, and m = 2. Therefore, invoking Proposition 4.2 gives Bias κ2w 0 Σ(Σ + κ) 2w0 1 df2(κ)/T T ℓγ 1 T (1 ℓ/β) T ℓγ = T 2ℓr (55) V ar σ2 df2(κ) T 1 1 df2(κ)/T σ2 T ℓ/β T 1 1 o(1) σ2T (1 ℓ/β). (56) We deduce the scaling law Etest Bias + V ar T 2ℓr + σ2T (1 ℓ/β) max(σ2, T 1 2ℓr ℓ/β))T (1 ℓ/β), (57) which is precisely the main result of [14]. Low-Noise Regime. In the low noise regime where σ2 = O(T 2βr), one may take ℓ= β; the variance is then much smaller than the bias, and one has the fast rate Etest T 2βr . (58) High-Noise Regime. Now, consider the case where σ2 = Θ(1). Setting 2ℓr = 1 ℓ/β to balance out the bias and variance gives ℓ= ℓcrit, where ℓcrit := β 2βr + 1 (0, β). (59) With this value of the exponent ℓ, we get the error rate Etest T 2ℓcrit r = T c, with c := 2βr 2βr + 1, (60) which is precisely the main result of [14], known to be minimax optimal (de Vito [11], etc.) ! I Auxiliary Lemmas Lemma I.1. Let the sequence (λk)k 1 of positive numbers be such that λk k β for some constant β > 0, and let m, n 0 with nβ > 1. Then, for D 1, it holds that λn k (1 + Dλk)m D c log D, if m = n 1/β, 1, else, (61) where c := min(m, n 1/β) 0. Proof. First observe that λn k/(1 + Dλk)m λn k min(1, (Dλk) m) ( λn k = k nβ, if Dλk < 1, i.e if k > D1/β, D mλ (m n) k = D mk(m n)β, else. We deduce that λn k (1 + Dλk)m D m X 1 k D1/β k(m n)β + X k>D1/β k nβ. (62) By comparing with the corresponding integral, one can write the first sum in (62) as 1 k D1/β k(m n)β D m Z D1/β 1 u(m n)βdu (D1/β)1+(m n)β = D (n 1/β), if n 1/β < m, log D, if m = n 1/β, 1, else. D (n 1/β), if n 1/β < m, D m log D, if m = n 1/β, D m, else. = D c log D, if m = n 1/β, 1, else, where c 0 is as given in the lemma. Analogously, one can write the second sum in (62) as X k>D1/β k nβ Z D1/β u nβdu (D1/β)1 nβ = D (n 1/β), and the result follows upon putting things together. Lemma I.2. For κ = κ(λ, T) defined as in (9), it holds that κ λ = 1 1 df2(κ)/T 1. (63) The formula given in the above lemma is useful because it can be combined with the identities Bias = w 0 Σ(Σ + κI) 2w0 κ λ, (64) V ar = σ2 df2(κ) T κ λ. (65) The RHS of (64) is usually referred to as the omniscient risk Hastie et al. [22], Cui et al. [13], Wei et al. [48]. Proof of Lemma I.2. By definition of κ, we know that κ λ = κdf1(κ)/T = κ tr Σ(Σ + κI) 1/T. λ. Differentiating the above identity w.r.t. λ gives κ 1 = κ (tr Σ(Σ + κI) 1 κ tr Σ(Σ + κ) 2)/T = κ tr Σ2(Σ + κI) 2/T = κ df2(κ)/T, and the result follows upon rearranging. Note that we have used the identity I κ(Σ + κI) 1 = Σ(Σ + κI) 1, to rewrite Σ(Σ + κI) 1 κΣ(Σ + κI) 2 = Σ2(Σ + κI) 2. Neur IPS Paper Checklist Question: Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? Answer: [Yes] Justification: We explicitly describe the setting, as well as the conclusions of our theory in various regimes. Guidelines: The answer NA means that the abstract and introduction do not include the claims made in the paper. The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations. A No or NA answer to this question will not be perceived well by the reviewers. The claims made should match theoretical and experimental results, and reflect how much the results can be expected to generalize to other settings. It is fine to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper. 2. Limitations Question: Does the paper discuss the limitations of the work performed by the authors? Answer: [Yes] Justification: We clearly state that our theory applies to the case of kernel ridge regression. All conclusions to different models are by analogy only, as is always done when this type of model is studied. Guidelines: The answer NA means that the paper has no limitation while the answer No means that the paper has limitations, but those are not discussed in the paper. The authors are encouraged to create a separate "Limitations" section in their paper. The paper should point out any strong assumptions and how robust the results are to violations of these assumptions (e.g., independence assumptions, noiseless settings, model well-specification, asymptotic approximations only holding locally). The authors should reflect on how these assumptions might be violated in practice and what the implications would be. The authors should reflect on the scope of the claims made, e.g., if the approach was only tested on a few datasets or with a few runs. In general, empirical results often depend on implicit assumptions, which should be articulated. The authors should reflect on the factors that influence the performance of the approach. For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting. Or a speech-to-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon. The authors should discuss the computational efficiency of the proposed algorithms and how they scale with dataset size. If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness. While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that aren t acknowledged in the paper. The authors should use their best judgment and recognize that individual actions in favor of transparency play an important role in developing norms that preserve the integrity of the community. Reviewers will be specifically instructed to not penalize honesty concerning limitations. 3. Theory Assumptions and Proofs Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof? Answer: [Yes] Justification: All assumptions are clearly stated in the main text, and all proofs can be found in the Appendix. Guidelines: The answer NA means that the paper does not include theoretical results. All the theorems, formulas, and proofs in the paper should be numbered and crossreferenced. All assumptions should be clearly stated or referenced in the statement of any theorems. The proofs can either appear in the main paper or the supplemental material, but if they appear in the supplemental material, the authors are encouraged to provide a short proof sketch to provide intuition. Inversely, any informal proof provided in the core of the paper should be complemented by formal proofs provided in appendix or supplemental material. Theorems and Lemmas that the proof relies upon should be properly referenced. 4. Experimental Result Reproducibility Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)? Answer: [Yes] Justification: We clearly give all experimental details, including parameters. These are sufficient to reproduce all our experiments. Guidelines: The answer NA means that the paper does not include experiments. If the paper includes experiments, a No answer to this question will not be perceived well by the reviewers: Making the paper reproducible is important, regardless of whether the code and data are provided or not. If the contribution is a dataset and/or model, the authors should describe the steps taken to make their results reproducible or verifiable. Depending on the contribution, reproducibility can be accomplished in various ways. For example, if the contribution is a novel architecture, describing the architecture fully might suffice, or if the contribution is a specific model and empirical evaluation, it may be necessary to either make it possible for others to replicate the model with the same dataset, or provide access to the model. In general. releasing code and data is often one good way to accomplish this, but reproducibility can also be provided via detailed instructions for how to replicate the results, access to a hosted model (e.g., in the case of a large language model), releasing of a model checkpoint, or other means that are appropriate to the research performed. While Neur IPS does not require releasing code, the conference does require all submissions to provide some reasonable avenue for reproducibility, which may depend on the nature of the contribution. For example (a) If the contribution is primarily a new algorithm, the paper should make it clear how to reproduce that algorithm. (b) If the contribution is primarily a new model architecture, the paper should describe the architecture clearly and fully. (c) If the contribution is a new model (e.g., a large language model), then there should either be a way to access this model for reproducing the results or a way to reproduce the model (e.g., with an open-source dataset or instructions for how to construct the dataset). (d) We recognize that reproducibility may be tricky in some cases, in which case authors are welcome to describe the particular way they provide for reproducibility. In the case of closed-source models, it may be that access to the model is limited in some way (e.g., to registered users), but it should be possible for other researchers to have some path to reproducing or verifying the results. 5. Open access to data and code Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [No] Justification: We only use one publicly available dataset, MNIST, and no idiosyncratic model. Thus, we provide neither dataset nor code, as the dataset is publicly available, and the experiments are easy to reproduce from their description. Guidelines: The answer NA means that paper does not include experiments requiring code. Please see the Neur IPS code and data submission guidelines (https://nips.cc/ public/guides/Code Submission Policy) for more details. While we encourage the release of code and data, we understand that this might not be possible, so No is an acceptable answer. Papers cannot be rejected simply for not including code, unless this is central to the contribution (e.g., for a new open-source benchmark). The instructions should contain the exact command and environment needed to run to reproduce the results. See the Neur IPS code and data submission guidelines (https: //nips.cc/public/guides/Code Submission Policy) for more details. The authors should provide instructions on data access and preparation, including how to access the raw data, preprocessed data, intermediate data, and generated data, etc. The authors should provide scripts to reproduce all experimental results for the new proposed method and baselines. If only a subset of experiments are reproducible, they should state which ones are omitted from the script and why. At submission time, to preserve anonymity, the authors should release anonymized versions (if applicable). Providing as much information as possible in supplemental material (appended to the paper) is recommended, but including URLs to data and code is permitted. 6. Experimental Setting/Details Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results? Answer: [Yes] Justification: We precisely describe all our (simple) linear or kernel methods in sufficient detail. They can easily be reproduced from these. Guidelines: The answer NA means that the paper does not include experiments. The experimental setting should be presented in the core of the paper to a level of detail that is necessary to appreciate the results and make sense of them. The full details can be provided either with the code, in appendix, or as supplemental material. 7. Experiment Statistical Significance Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments? Answer: [Yes] Justification: Error bars are given and we describe the number of trials over which they are obtained. We believe these are chosen suitably. Guidelines: The answer NA means that the paper does not include experiments. The authors should answer "Yes" if the results are accompanied by error bars, confidence intervals, or statistical significance tests, at least for the experiments that support the main claims of the paper. The factors of variability that the error bars are capturing should be clearly stated (for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions). The method for calculating the error bars should be explained (closed form formula, call to a library function, bootstrap, etc.) The assumptions made should be given (e.g., Normally distributed errors). It should be clear whether the error bar is the standard deviation or the standard error of the mean. It is OK to report 1-sigma error bars, but one should state it. The authors should preferably report a 2-sigma error bar than state that they have a 96% CI, if the hypothesis of Normality of errors is not verified. For asymmetric distributions, the authors should be careful not to show in tables or figures symmetric error bars that would yield results that are out of range (e.g. negative error rates). If error bars are reported in tables or plots, The authors should explain in the text how they were calculated and reference the corresponding figures or tables in the text. 8. Experiments Compute Resources Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments? Answer: [Yes] Justification: Compared to most contributions, our experiments, which accompany our theory, do not require any significant compute whatsoever. Guidelines: The answer NA means that the paper does not include experiments. The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage. The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute. The paper should disclose whether the full research project required more compute than the experiments reported in the paper (e.g., preliminary or failed experiments that didn t make it into the paper). 9. Code Of Ethics Question: Does the research conducted in the paper conform, in every respect, with the Neur IPS Code of Ethics https://neurips.cc/public/Ethics Guidelines? Answer: [Yes] Justification: our research does not involve human participants, and never uses private data or models. Guidelines: The answer NA means that the authors have not reviewed the Neur IPS Code of Ethics. If the authors answer No, they should explain the special circumstances that require a deviation from the Code of Ethics. The authors should make sure to preserve anonymity (e.g., if there is a special consideration due to laws or regulations in their jurisdiction). 10. Broader Impacts Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed? Answer: [Yes] Justification: Our paper provides toy theory that underlines the point that retraining on data generated from currently available models might lead to a vicious loop of model collapse, which might have societal impact. We call for attention to this matter. Guidelines: The answer NA means that there is no societal impact of the work performed. If the authors answer NA or No, they should explain why their work has no societal impact or why the paper does not address societal impact. Examples of negative societal impacts include potential malicious or unintended uses (e.g., disinformation, generating fake profiles, surveillance), fairness considerations (e.g., deployment of technologies that could make decisions that unfairly impact specific groups), privacy considerations, and security considerations. The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments. However, if there is a direct path to any negative applications, the authors should point it out. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to generate deepfakes for disinformation. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster. The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from (intentional or unintentional) misuse of the technology. If there are negative societal impacts, the authors could also discuss possible mitigation strategies (e.g., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efficiency and accessibility of ML). 11. Safeguards Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)? Answer: [NA] Justification: We do not release any models or new data. Guidelines: The answer NA means that the paper poses no such risks. Released models that have a high risk for misuse or dual-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety filters. Datasets that have been scraped from the Internet could pose safety risks. The authors should describe how they avoided releasing unsafe images. We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort. 12. Licenses for existing assets Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected? Answer: [Yes] Justification: The only public dataset we use is MNIST [16], which we properly credit. The MNIST dataset is made available under the terms of the Creative Commons Attribution Share Alike 3.0 license. Guidelines: The answer NA means that the paper does not use existing assets. The authors should cite the original paper that produced the code package or dataset. The authors should state which version of the asset is used and, if possible, include a URL. The name of the license (e.g., CC-BY 4.0) should be included for each asset. For scraped data from a particular source (e.g., website), the copyright and terms of service of that source should be provided. If assets are released, the license, copyright information, and terms of use in the package should be provided. For popular datasets, paperswithcode.com/datasets has curated licenses for some datasets. Their licensing guide can help determine the license of a dataset. For existing datasets that are re-packaged, both the original license and the license of the derived asset (if it has changed) should be provided. If this information is not available online, the authors are encouraged to reach out to the asset s creators. 13. New Assets Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets? Answer: [NA] Justification: We do not release any new assets. Guidelines: The answer NA means that the paper does not release new assets. Researchers should communicate the details of the dataset/code/model as part of their submissions via structured templates. This includes details about training, license, limitations, etc. The paper should discuss whether and how consent was obtained from people whose asset is used. At submission time, remember to anonymize your assets (if applicable). You can either create an anonymized URL or include an anonymized zip file. 14. Crowdsourcing and Research with Human Subjects Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)? Answer: [NA] Justification: We do not involve crowdsourcing nor research on humon subjects. Guidelines: The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Including this information in the supplemental material is fine, but if the main contribution of the paper involves human subjects, then as much detail as possible should be included in the main paper. According to the Neur IPS Code of Ethics, workers involved in data collection, curation, or other labor should be paid at least the minimum wage in the country of the data collector. 15. Institutional Review Board (IRB) Approvals or Equivalent for Research with Human Subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or institution) were obtained? Answer: [NA] Justification: We do not involve crowdsourcing nor research on humon subjects. Guidelines: The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Depending on the country in which research is conducted, IRB approval (or equivalent) may be required for any human subjects research. If you obtained IRB approval, you should clearly state this in the paper. We recognize that the procedures for this may vary significantly between institutions and locations, and we expect authors to adhere to the Neur IPS Code of Ethics and the guidelines for their institution. For initial submissions, do not include any information that would break anonymity (if applicable), such as the institution conducting the review.