# selfconsuming_generative_models_go_mad__0c255e1b.pdf

Published as a conference paper at ICLR 2024

SELF-CONSUMING GENERATIVE MODELS GO MAD

Sina Alemohammad , Josue Casco-Rodriguez , Lorenzo Luzi , Ahmed Imtiaz Humayun , Hossein Babaei , Daniel Le Jeune , Ali Siahkoohi , Richard G. Baraniuk

Department of ECE, Rice University; Department of Statistics, Stanford University; Department of CMOR, Rice University

Seismic advances in generative AI algorithms for imagery, text, and other data types have led to the temptation to use AI-synthesized data to train next-generation models. Repeating this process creates an autophagous ( self-consuming ) loop whose properties are poorly understood. We conduct a thorough analytical and empirical analysis using state-of-the-art generative image models of three families of autophagous loops that differ in how fixed or fresh real training data is available through the generations of training and whether the samples from previous-generation models have been biased to trade off data quality versus diversity. Our primary conclusion across all scenarios is that without enough fresh real data in each generation of an autophagous loop, future generative models are doomed to have their quality (precision) or diversity (recall) progressively decrease. We term this condition Model Autophagy Disorder (MAD), by analogy to mad cow disease, and show that appreciable MADness arises in just a few generations.

Generation t = 1 t = 3 t = 5 t = 7 t = 9

Figure 1: Training generative artificial intelligence (AI) models on synthetic data progressively amplifies artifacts. As AI-synthesized data proliferates in standard datasets and the Internet, future AI models will train on both real and synthetic data, forming autophagous ( self-consuming ) loops. We highlight a potential unintended consequence of autophagous training. We trained a sequence of Style GAN2 (Karras et al., 2019a) models wherein the model at generation t 2 trains only on data synthesized by the model at generation t 1. This forms a fully synthetic loop (Figure 3) without sampling bias (λ = 1). Note how the cross-hatched artifacts (possibly an architectural fingerprint (Karras et al., 2021)) are progressively amplified at each generation. Appendix D has more samples.

1 INTRODUCTION

Synthetic data1 from generative artificial intelligence (AI) models like Stable Diffusion (Rombach et al., 2022) and Chat GPT (Open AI, 2023) is rapidly proliferating on the Internet. Indeed, there will soon be much more synthetic data than real data on the Internet.

Equal contribution. 1By synthetic we mean AI-synthesized data, as opposed to data synthesized via physics-based simulations.

Published as a conference paper at ICLR 2024

Figure 2: Today s large-scale image training datasets contain AI-generated data. Datasets like LAION-5B (Schuhmann et al., 2022), which trains popular models like Stable Diffusion (Rombach et al., 2022), contain AI-synthesized images. Here are LAION-5B (haveibeentrained.com) samples containing data synthesized by (left to right) AICAN (Elgammal et al., 2017), Pix2Pix (Isola et al., 2017), Style GAN (Karras et al., 2019a), and DALL-E (Ramesh et al., 2021). Generative models using LAION-5B thus close an autophagous ( self-consuming ) loop (see Figure 3) that can progressively amplify artifacts (Figure 1) and lower quality (precision) and diversity (recall).

Since the training datasets for generative AI models tend to be sourced from the Internet, today s AI models are unwittingly being trained on increasing amounts of AI-synthesized data. Figure 2 confirms that the popular LAION-5B dataset (Schuhmann et al., 2022), used to train state-of-the-art models like Stable Diffusion, contains synthetic images from several earlier generations of generative models. AI is generating formerly human-sourced data, like reviews (Gault, 2023), websites (Cantor, 2023), and data annotations (Veselovsky et al., 2023), often with no indication of its origin (Christian, 2023). As the use of generative models continues to grow rapidly, this situation will only accelerate.

Moreover, throwing caution to the wind, AI-synthesized data is increasingly used by choice for training because it is convenient, especially in data-scarce applications like medicine (Pinaya et al., 2022) and geophysics (Deng et al., 2022), and because it can protect privacy (Luzi et al., 2024; Klemp et al., 2023) in sensitive data applications like medicine (Packhäuser et al., 2022; Du Mont Schütte et al., 2021). Most importantly, as deep learning models become increasingly enormous (Azizi et al., 2023; Burg et al., 2023), we are simply running out of real data on which to train them (Economist, 2023a;b; Villalobos et al., 2022).

The witting or unwitting use of synthetic data to train generative models departs from standard AI training practice in one important respect: repeating this process for generation after generation of models forms an autophagous ( self-consuming ) loop (Figure 3). Different autophagous loop variations arise depending on how existing real and synthetic data are combined into future training sets. Additional variations arise depending on the model sampling biases used to trade off perceptual quality (fidelity or coherence) versus diversity (variety or heterogeneity).2

The potential ramifications of autophagous loops on the properties and performance of generative models is poorly understood. In one direction, autophagy might progressively amplify the biases and artifacts present in any generative model as fingerprints. In Figure 1, autophagy progressively amplifies cross-hatching artifacts (reminiscent of aliasing (Karras et al., 2021)) in subsequent generations of Style GAN2 models. In another direction, autophagous loops featuring generative models tuned to produce high quality syntheses at the expense of diversity (such as Karras et al. (2019a); Ho and Salimans (2021)) might progressively dilute the diversity of the data on the Internet.3 By analogy to mad cow disease (Nathanson et al., 1997), we term these and other symptoms of autophagy as Model Autophagy Disorder (MAD).

Contributions. We conduct a careful theoretical and empirical study of AI autophagy with generative image models. The concepts developed herein apply to any data type, including text. We also unify the results of contemporaneous work. Our three key contributions establish that, without enough fresh real data each generation, future generative models are doomed to go MAD. Moreover, we demonstrate that appreciable MADness can occur in only a handful of generations.

1. Realistic models for autophagous loops. We propose three families of self-consuming loops that realistically model how real and synthetic data are used to train generative models (recall Figure 3):

2We quantify quality and diversity via precision and recall, respectively (Kynkäänniemi et al., 2019). 3Similar to diversity exposure in recommender systems, where maximizing click rates discourages exposure to diverse ideas (Stroud, 2011; Dylko et al., 2017; Beam, 2014; Bakshy et al., 2015; O Callaghan et al., 2015).

Published as a conference paper at ICLR 2024

Training Dataset

Generative Model(s)

Fixed Real Data Fresh Real Data

Figure 3: Recursively training generative models on synthetic data from other models produces an autophagous ( self-consuming ) loop. In this paper, we study three autophagous loop variants (defined in Section 2): the fully synthetic loop (only synthetic data), the synthetic augmentation loop (synthetic + fixed real data), and the fresh data loop (synthetic + fresh real data). Each generation samples with a bias λ that trades off sample quality versus diversity.

The fully synthetic loop (Section 3), where the training dataset for each generation s model consists solely of synthetic data sampled from previous generations models, such as by training a generative model on its own outputs (followfox.ai, 2023). We show that either the quality (precision) or the diversity (recall) of the generative models decreases over generations. The synthetic augmentation loop (Section 4), where each generation s training data includes syntheses from previous generations and a fixed set of real data, such as by training on real and self-generated data (Huang et al., 2022). We show that fixed real training data only delays the inevitable degradation of the quality or diversity of the generative models over generations. The fresh data loop (Section 5), where each generation s training data includes syntheses from previous generations plus some fresh real data, which models the common practice of scraping the Internet for training data, which finds both real and synthetic data (recall Figure 2). We are the first to propose and study this flavor of autophagy. We show that, with enough fresh real data, model convergence is independent of initialization, with quality and diversity that do not degrade over generations.

2. Sampling bias plays a key rôle in autophagous loops. Practitioners often favor high-quality syntheses, whether through curation or automatic quality-diversity tradeoffs (Ho and Salimans, 2021; Karras et al., 2020). We show that, without these sampling biases, MADness degrades quality and diversity, while with them, quality can be maintained but diversity degrades even faster.

3. Autophagous loop behaviors hold across a wide range of generative models and datasets, including Gaussian, Gaussian mixture, diffusion (DDPM, Ho et al., 2020), Style GAN2 (Karras et al., 2020), Wasserstein GAN (WGAN, Gulrajani et al., 2017a), and Normalizing Flow (Kobyzev et al., 2020) models trained on datasets like FFHQ (Karras et al., 2019b) and MNIST (Deng, 2012).

Related work. We define a cohesive autophagy framework, supported empirically by state-of-the-art models, that unifies and significantly extends contemporaneous results that consider fragmented aspects of MADness. Shumailov et al. (2023) study autophagous loops without sampling bias and show that MADness ensues from variational autoencoders and Gaussian mixture models in fully synthetic and language models in synthetic augmentation loops. However, the absence of quality-diversity tradeoffs (sampling biases) in their models limits the applicability of their findings to real-world scenarios. Furthermore, in each generation they only fine-tune their language models, while we train our models from scratch. Martínez et al. (2023a) also consider unbiased synthetic augmentation loops, but only show qualitative evidence of MADness on a small dataset. Martínez et al. (2023b) focus only on fully synthetic loops and report that sampling bias can prevent degradation of image quality in small datasets. Finally, Huang et al. (2022); Hataya et al. (2022) and others have considered synthetic data augmentation, but not in the context of autophagous loops.

2 SELF-CONSUMING GENERATIVE MODELS

Consider a sequence of generative models (Gt)t N, where each model approximates a reference probability distribution Pr. At each generation t N, the model Gt trains from scratch on the dataset Dt = (Dt r, Dt s) containing both nt r real samples Dt r from Pr and nt s synthetic samples Dt s from trained generative model(s). The first-generation model G1 trains only on real data: n1 s = 0, D1 s = . Definition. An autophagous generative process is a sequence of distributions (Gt)t N where each generative model Gt is trained on data that includes samples from previous models (Gτ)t 1 τ=1. Definition. Let dist( , ) denote a distance metric on distributions. A MAD generative process is a sequence of distributions (Gt)t N such that E[dist(Gt, Pr)] grows with t.

Published as a conference paper at ICLR 2024

Claim. Under mild conditions, an autophagous generative process is a MAD generative process.

Two critical aspects affect whether a sequence of generative models goes MAD: the balance of real and synthetic training data and how the generative models synthesize data. We study three realistic autophagous mechanisms, each of which includes synthetic data and potentially real data in a feedback loop (recall Section 1 and Figure 3):

The fully synthetic loop: Each model Gt for t 2 trains exclusively on synthetic data sampled from models (Gτ)t 1 τ=1 from previous generations, i.e., Dt = Dt s.

The synthetic augmentation loop: Each model Gt for t 2 trains on a dataset Dt = (Dr, Dt s): a fixed set of real data Dr from Pr, plus synthetic data Dt s from previous generations models.

The fresh data loop: Each model Gt for t 2 trains on a dataset Dt = (Dt r, Dt s): a fresh set of real data Dt r drawn from Pr, plus synthetic data Dt s from previous generations models.

Metrics for MADness. Throughout this paper we measure the distance between the synthetic data and the real data (reference distribution) using the Fréchet inception distance (FID) (Heusel et al., 2017),4 the quality of the synthetic data using precision, and the diversity of the synthetic data using recall (Kynkäänniemi et al., 2019). See Appendix A.4 for more details.

2.1 BIASED SAMPLING IN AUTOPHAGOUS LOOPS

While the above three autophagous loops realistically mimic real-world generative model training scenarios that involve synthetic data, it is also critical to consider how each generation s synthetic data is produced in practice. In particular, most syntheses are to some degree biased to maximize perceptual quality, whether through manual curation ( cherry-picking ) or common techniques that automatically boost quality and sacrifice diversity by sampling closer to the modes of the synthetic distribution of the generative model (Open AI, 2023; Ho and Salimans, 2021; Karras et al., 2020; Brock et al., 2019; Humayun et al., 2022). We refer to this common practice as sampling bias. We employ a number of generative models in our experiments below; each has a unique controllable parameter to increase sample quality. We unify these parameters in the universal sampling bias parameter λ [0, 1], where λ = 1 corresponds to unbiased sampling and λ = 0 corresponds to sampling from the modes of the generative distribution Gt with zero variance. The exact interpretation of λ differs across various models, but in general synthetic sample quality will increase and diversity will decrease as λ is decreased from 1. Below we provide specific definitions for λ for the various generative models we consider in this paper:

Gaussian models: To implement biased sampling from an estimated distribution N(µ, Σ), we sample from N(µ, λΣ). As λ decreases, we draw samples closer to the mean µt.

Generative adversarial networks: In our Style GAN2 experiments, we decrease the truncation parameter Ψ [0, 1] to increase sample quality (Karras et al., 2020). Thus, λ = Ψ.

Diffusion models: For DDPMs, we use a classifier-free diffusion guidance factor w (with 10% conditioning dropout) (Ho and Salimans, 2021) and define λ = 1 1+w.

3 THE FULLY SYNTHETIC LOOP: TRAINING EXCLUSIVELY ON SYNTHETIC DATA LEADS TO MADNESS

First, we analyze the fully synthetic loop, where each model trains on synthetic data from previous generations. We focus on the inter-generational propagation of non-idealities resulting from estimation errors and sampling biases and characterize the convergence of the autophagous loop. The fully synthetic loop s simplicity primarily reflects niche examples like training generative models on their own high-quality outputs (followfox.ai, 2023). Nevertheless, this loop represents a worst-case scenario that provides insights into the more practical autophagous loops discussed in subsequent sections. Our analysis and experiments support our main conclusion for the fully synthetic loop: either the quality (precision) or the diversity (recall) of the synthetic data deteriorates over generations.

4We calculate MNIST FIDs via Le Net (Lecun et al., 1998) features instead of Inception features.

Published as a conference paper at ICLR 2024

Generations

1 2 3 4 5 0.6

Generations

Generations

Fully synthetic loop: FFHQ-Style GAN2 λ = 1

MNIST-DDPM λ = 1

FFHQ-Style GAN2 λ = 0.7

MNIST-DDPM λ = 0.5

Figure 4: Training generative models in a fully synthetic loop reduces both the quality and diversity of synthetic data, depending on sampling bias. We plot the FID, precision (quality), and recall (diversity) of the synthetic FFHQ and MNIST images from a fully synthetic loop with unbiased (λ = 1) and biased (λ < 1) Style GAN2 and DDPM models. See Figure 1 for Style GAN2 samples demonstrating that the fully synthetic loop amplifies sample artifacts. In all cases, FID increases and diversity decreases. However, sampling bias can salvage quality (at the expense of diversity).

Gaussian fully synthetic loop: random walks and variance collapse. We first show that these loops have a martingale nature that causes MADness. Consider a reference distribution Pr = N(µ0, Σ0), where µ0 Rd and Σ0 Rd d, and a Gaussian generative process Gt = N(µt, Σt). At each time t N, we sample ns vectors from Gt 1 with bias λ 1; i.e., we draw x1 t, . . . , xns t iid N(µt 1, λΣt 1). From these vectors we construct the unbiased parameters of the next model Gt:

i=1 xi t, Σt = 1 ns 1

i=1 (xi t µt)(xi t µt) . (1)

It is straightforward to see that µt and Σt are (super)martingale processes (Williams, 1991) that take random walks. For Σt, we also have the following result that is proved in Appendix B.

Proposition. For the random process defined in Equation (1), for any λ 1, we have Σt a.s. 0.

That is, when we repeatedly fit a distribution to data sampled from that distribution, we should not only expect some modal drift because of the random walk in µt (reduction in quality) but also inevitably a collapse of the variance Σt (vanishing of diversity).

The key takeaway is that these effects the random walk and the variance collapse are solely due to the estimation error of fitting the model parameters using random data. Importantly, this result holds true even when there is no sampling bias (λ = 1). The magnitudes of the steps of the random walk in µt are determined by two main factors: the number of samples ns and the covariance Σt. Unsurprisingly, the larger the ns, the smaller the steps of the random walk, since there will be less estimation error. This will also slow the convergence of Σt to 0. Meanwhile, Σt can be controlled using a sampling bias factor λ < 1. The smaller the choice of λ, the more rapidly Σt will converge to 0, stopping the random walk of µt (as illustrated in Figure 15). Thus, the sampling bias factor λ provides a trade-off to preserve quality at the expense of diversity. Shumailov et al. (2023) recently showed that the expected Wasserstein-2 distributional distance E[dist(Gt, Pr)] increases in this process, supporting our conclusion that Gt is a MAD generative process.

We now empirically study the fully synthetic loop using FFHQ-trained Style GAN2 and MNISTtrained DDPM models; see Appendix A.1 for the experimental details.

Unbiased sampling degrades synthetic data quality and diversity. Figure 4 plots the FID, precision, and recall for FFHQ-Style GAN2 and MNIST-DDPM models in fully synthetic loops with (λ < 1) and without (λ = 1) sampling bias. In the latter case, the synthetic data distributions undergo random walks that deviate from the reference distribution because each generation s training data is finite. Consequently, the models go MAD: FID increases, while precision and recall steadily decrease.

Biased sampling can boost synthetic data quality, but at the expense of diversity. As for the biased FFHQ-Style GAN2 and MNIST-DDPM models (λ = 0.7 and 0.5) in fully synthetic loops

Published as a conference paper at ICLR 2024

Generation t = 1 t = 3 t = 5

Figure 5: Training generative models on biased synthetic data in a fully synthetic loop progressively loses diversity. We repeat the experiment from Figures 1 and 4 but with sampling bias λ = 0.7. The randomly selected syntheses clearly lose diversity. Appendix E displays additional samples.

Generation t = 3 t = 10 Generation t = 3 t = 10

λ = 1 λ = 0.8

Figure 6: In the fully synthetic loop, unbiased sampling loses quality, while biased sampling loses diversity. 2D UMAP projections of 784-dimensional MNIST-DDPM samples from fully synthetic loops without (left, λ = 1) and with (right, λ < 1) sampling bias. Without sampling bias, the synthetic digits become so unrealistic (low-quality) that they are easily distinguishable from real digits, while with sampling bias, the digits remain realistic but progressively lose diversity. See Appendix F for the synthesized samples.

(see Figure 4), sampling bias increases precision but also accelerates losses in recall (shown clearly in Figure 5) compared to unbiased models. Moreover, the FID still increases, indicating a MAD generative process. See Figure 14 in Appendix C.3 for results with different MNIST-DDPM sampling bias values, which follow the same trend.

Synthetic mode behavior depends on the sampling bias. To visualize MAD generative processes, in Figure 6 we reduced the dimensionality of the real and synthetic MNIST-DDPM fully synthetic loop samples from Figure 4 via Uniform Manifold Approximation and Projection (UMAP) (Mc Innes et al., 2018). With unbiased sampling, the ten modes of the synthetic distribution (one for each digit) progressively drift away from the real distribution modes, despite originating from a conditional model, and eventually merge into one large cluster. By generation t = 10, the synthetic digits are illegible (Figure 24 in Appendix F). In sharp contrast to the unbiased case, UMAP reveals that biased sampling successfully keeps syntheses on the real data manifold (high precision), but contracts the synthetic support around a single set of ten digits (zero recall). Appendix C confirms these trends for Gaussian mixtures, WGANs, and Normalizing Flows.

4 THE SYNTHETIC AUGMENTATION LOOP: FIXED REAL TRAINING DATA CAN DELAY BUT NOT PREVENT MADNESS

While analysis of the fully synthetic loop is straightforward, practitioners will use real data when it is available. We now explore the synthetic augmentation loop, where a fixed real dataset is augmented with autophagous synthetic data. Synthetic data augmentation can improve classification (Luzi et al., 2024; Burg et al., 2023), but the impact of autophagous data augmentation is unclear does increasing training data volume enhance synthesis, even if the added samples stray from reality? We find that, in the synthetic augmentation loop, fixed real training data only delays the inevitable degradation of the quality or diversity of synthetic data over generations.

Published as a conference paper at ICLR 2024

Generations

Generations

Generations

FFHQ-Style GAN2, λ = 1: Fully synthetic loop Synthetic augmentation loop

Figure 7: Training generative models in a synthetic augmentation loop without sampling bias reduces synthetic quality and diversity, albeit more slowly than in the fully synthetic loop. We plot the FID, precision (quality), and recall (diversity) of FFHQ-Style GAN2 syntheses from a synthetic augmentation loop, wherein generative models train on both synthetic and real data, and a fully synthetic loop from Figure 4 for comparison. Both loops have no sampling bias (λ = 1). Qualitative examples (Appendix G) show the same artifacts as in Figure 1, albeit less prominently.

Generations

Generations

Generations

MNIST-DDPM in a synthetic augmentation loop: λ = 1 λ = 0.8 λ = 0.66 λ = 0.5

Figure 8: When incorporating real data in the synthetic augmentation loop, sampling bias still affects MADness. We plot the FID, precision (quality), and recall (diversity) of MNIST-DDPM images synthesized in synthetic augmentation loops with different sampling biases λ. All three metrics exhibit the same, albeit less pronounced, behavior as in the biased fully synthetic loops depicted in Figure 4.

Keeping the original real dataset in the synthetic augmentation loop only slows MADness. Figure 7 shows that keeping the full FFHQ dataset in a Style GAN25 synthetic augmentation loop still produces the same symptoms (albeit more slowly) as the fully synthetic loop: the distance from the real dataset (FID) increases, while the quality (precision) and diversity (recall) of synthetic samples still decrease without sampling bias. (See Appendix A.2 for the experimental details.) In fact, in Appendix G we see the same artifacts as in Figure 1 and Appendix D. Additionally, sampling bias λ impacts MNIST-DDPM synthetic augmentation loops (Figure 8) in the same way it impacts fully synthetic loops: FID still increases, but λ < 1 can increase quality (precision) in exchange for diversity (recall). Additional synthetic augmentation loop experiments can be found in Appendix H.

5 THE FRESH DATA LOOP: FRESH REAL DATA CAN PREVENT MADNESS

Our most elaborate autophagous loop model obtains training data from two sources: unseen (fresh) real data and synthetic data from previously trained models. A clear instance of this is the LAION-5B dataset (Schuhmann et al., 2022), which contains both real and AI-synthesized images from the Internet (Figure 2). We seek to understand how generative models evolve in the fresh data loop, which alters the synthetic augmentation loop by incorporating fresh (instead of fixed) real samples at

5Unique to our Style GAN2 synthetic augmentation loop, we linearly grow a pool of synthetic data to assess whether access to all previous generations synthetic data could help future generations learn (see Appendix A.2).

Published as a conference paper at ICLR 2024

Generations

nini = 100, λ = 1 nini = 1000, λ = 1 nini = 1000, λ = 0.8

nini = 100, λ = 0.8

0 5 10 15 1

Generations

nini = 2000, λ = 1 nini = 3000, λ = 1

Figure 9: In a fresh data loop, generative models converge to a state independent of the initial generative model. We plot the Wasserstein distance (WD) and FID of two fresh data loop models: a Gaussian with nr = 100, ns = 900 (left) and an MNIST-DDPM with nr = ns = 2000 (right). We simulate the former with both unbiased and biased sampling. Across all models, we see that the asymptotic WD and FID are independent of the initial real samples nini and the initial WD or FID.

Figure 10: In a fresh data loop, the admissible amount of synthetic data does not increase with the amount of real data. As the real data count nr increases, the synthetic data count ns for which ne nr (green area) converges. Synthetic data is only likely to be helpful for small nr.

each iteration. We imagine that a fraction p (0, 1) of a corpus of data (e.g., the Internet) is real, and the remainder 1 p is synthetic. Independently sampling nt data points from this corpus yields nt r = pnt real and nt s = (1 p)nt synthetic data points to train the t-th generation model.

The fresh data loop reveals two intriguing phenomena. First, asymptotic performance converges independently of initial performance, depending only on the ratio of real-to-synthetic training data. Second, limited amounts of synthetic data can actually improve performance in the fresh data loop since synthetic data propagates information from previously seen real data and thus increases the effective dataset size but too much synthetic data can still cause MADness. Overall, our fresh data loop analyses and experiments establish that, with enough fresh real data, the quality and diversity of synthetic data do not degrade over generations.

Initial models will eventually be forgotten in the fresh data loop. First we show that the initial model does not affect the behavior of the fresh data loop. We train the initial model on nini real samples and subsequent models with nr new real and ns synthetic (with bias λ) samples from the previous model; see Appendix A.3 for the details. Interestingly, for both the Gaussian and MNISTDDPM and models, the Wasserstein distance and FID converge independently of nini after a few iterations (Figure 9). In other words, we observe that models in a fresh data loop converge to a limit point that depends on nr, ns, and λ, but not on the initial model G1 or its dataset size nini:

lim t E[dist(Gt, Pr)] =: WD(nr, ns, λ). (2)

For autophagy, this brings some hope: with fresh real data at each generation, E[dist(Gt, Pr)] does not necessarily increase with t. In other words, a fresh data loop does not necessarily go MAD.

The fresh data loop exhibits a phase transition. Since fresh real data can mitigate MADness, one might suspect that synthetic data can provoke fresh data loop MADness. However, the truth is that modest amounts of synthetic data in fresh data loops can actually boost performance; only when the

Published as a conference paper at ICLR 2024

amount of synthetic data exceeds a critical threshold do models suffer. We formalize this observation through Monte-Carlo simulation of the fresh data loop limit point (Equation (2)) in Gaussian models. For comparison, we compute the effective sample size ne that an alternative model would need to reach the same performance as the asymptote from scratch:

Find ne s.t. E[dist(G(ne), Pr)] = WD(nr, ns, λ). (3)

That is, ne captures the asymptotic sample efficiency of the fresh data loop.

Figure 10 depicts how the ratio ne/nr changes with nr, ns, and λ. When ne/nr 1, we say the amount of synthetic data ns is admissible because it effectively increases the number of real samples. For ne/nr < 1, synthetic data effectively reduces the number of real samples.

First, we confirm that, given nr and λ < 1, there exists a phase transition in ns. If ns exceeds the admissibility threshold, then the effective sample size ne drops below the fresh sample size nr, meaning that synthetic data does not asymptotically improve performance. However, the synthetic-toreal ratio ns/nr needed to achieve ne/nr 1 is not constant. In fact, in Figure 10 we see that the admissible amount of synthetic data ns (such that ne/nr 1) can be quite high for small values of nr, but as nr grows, the admissible ratio of synthetic-to-real data ns/nr shrinks.

Second, we find that the admissible threshold value for ns depends strongly on the sampling bias λ. Perhaps surprisingly, stronger bias (smaller λ) actually reduces the number of synthetic samples that can be used without harming performance. Taking the limit λ 1 for unbiased sampling appears to ensure that the effective number of samples is always increased (ne/nr is always greater than 1). Whether this limiting behavior extends beyond Gaussian models is an open question. As we discussed in Section 2.1, it is unlikely that practical generative models synthesize without bias, and so it is better to draw conclusions from the λ < 1 case. See Appendix I for additional fresh data loop experiments.

6 DISCUSSION

Our theoretical and empirical analyses have enabled us to extrapolate what might happen as generative models become ubiquitous and train future models in autophagous (self-consuming) loops. Using state-of-the-art generative image models and datasets, we have studied three families of autophagous loops and identified the key rôle of sampling bias. Some ramifications are clear: without enough fresh real data, future generative models are doomed to Model Autophagy Disorder (MAD), progressively losing quality (precision) or diversity (recall) and amplifying generative artifacts. One doomsday scenario is that, if left uncontrolled, MAD could poison the entire Internet s data quality and diversity. After all, our autophagous loops went appreciably MAD after just 5 generations (Figure 1). It seems inevitable that AI autophagy s unintended consequences could arise in the near future.

Practitioners who deliberately use synthetic training data should heed our warning. For those in truly data-scarce applications, our results suggest how much real data can prevent MADness. For example, future training of a medical image generator on inter-institutional anonymous syntheses (Du Mont Schütte et al., 2021) should ensure that all synthetic images are artifact-free and diverse (see Section 3), and that real (preferably new) data is maximally present in training (see Sections 4 and 5).

Practitioners who unknowingly train on synthetic data could try controlling the ratio of real-tosynthetic training data by identifying and rejecting synthetic data. Some identifiers find telltale patterns of AI synthesis (Guarnera et al., 2020; Mitchell et al., 2023; Tang et al., 2023). Others make synthetic data steganographically idenfiable via watermarking (Kirchenbauer et al., 2023a;b; Zhao et al., 2023; Peng et al., 2023; Wen et al., 2023; Fernandez et al., 2023; Fei et al., 2022). However, watermarking deliberately introduces hidden artifacts that could be uncontrollably or harmfully amplified by autophagy. In the fresh data loop, modest amounts of synthetic data can boost performance (ne/nr > 1 in Figure 10). Future research could develop autophagy-aware watermarking that helps identify synthetic data while avoiding the amplification of its own artifacts.

Future research directions include combining our three prototypical autophagous loops into more complex loops, examining how MADness affects downstream tasks (e.g., classification), and models for other data types. We have focused here on imagery, but autophagy and MADness can occur in any data type. For example, autophagous language models (Huang et al., 2022; Wang et al., 2022; Taori et al., 2023) can also go MAD, losing quality (coherence or correctness) or diversity (variety). Shumailov et al. (2023) have reached similar conclusions, but there is much work to do in this vein.

Published as a conference paper at ICLR 2024

ACKNOWLEDGEMENTS

Thanks to H. Javadi, B. Mason, and S. Sonkar for their insights. This work was supported by NSF grants CCF-1911094, IIS-1838177, and IIS-1730574; ONR grants N00014-18-12571, N00014-20-12534, and MURI N00014-20-1-2787; AFOSR grant FA9550-22-1-0060; DOE grant DE-SC0020345; and a Vannevar Bush Faculty Fellowship, ONR grant N00014-18-1-2047. DL was supported by ARO grant 2003514594.

Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. In CVPR, 2019a.

Tero Karras, Miika Aittala, Samuli Laine, Erik Härkönen, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. Alias-free generative adversarial networks. In Neur IPS, 2021.

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. Highresolution image synthesis with latent diffusion models. In CVPR, 2022.

Open AI. GPT-4 technical report. ar Xiv preprint ar Xiv:2303.08774, 2023.

Christoph Schuhmann et al. LAION-5B: An open large-scale dataset for training next generation image-text models. In Neur IPS Datasets and Benchmarks Track, 2022.

Matthew Gault. AI spam is already flooding the internet and it has an obvious tell. VICE, April 2023.

Matthew Cantor. Nearly 50 news websites are AI-generated , a study says. Would I be able to tell? The Guardian, May 2023.

Veniamin Veselovsky, Manoel Horta Ribeiro, and Robert West. Artificial artificial artificial intelligence: Crowd workers widely use large language models for text production tasks. ar Xiv preprint ar Xiv:2306.07899, 2023.

Jon Christian. CNET secretly used AI on articles that didn t disclose that fact, staff say. Futurusm, January 2023.

Walter H. L. Pinaya, Petru-Daniel Tudosiu, Jessica Dafflon, Pedro F. Da Costa, Virginia Fernandez, Parashkev Nachev, Sebastien Ourselin, and M. Jorge Cardoso. Brain imaging generation with latent diffusion models. In Deep Generative Models. Springer Nature, 2022.

Chengyuan Deng, Shihang Feng, Hanchen Wang, Xitong Zhang, Peng Jin, Yinan Feng, Qili Zeng, Yinpeng Chen, and Youzuo Lin. Open FWI: Large-scale multi-structural benchmark datasets for full waveform inversion. In Neur IPS, 2022.

Lorenzo Luzi, Paul M Mayer, Josue Casco-Rodriguez, Ali Siahkoohi, and Richard G. Baraniuk. Boomerang: Local sampling on image manifolds using diffusion models. Transactions on Machine Learning Research, 2024.

Marvin Klemp, Kevin Rösch, Royden Wagner, Jannik Quehl, and Martin Lauer. LDFA: Latent diffusion face anonymization for self-driving applications. ar Xiv preprint ar Xiv:2302.08931, 2023.

Kai Packhäuser, Lukas Folle, Florian Thamm, and Andreas Maier. Generation of anonymous chest radiographs using latent diffusion models for training thoracic abnormality classification systems. ar Xiv preprint ar Xiv:2211.01323, 2022.

August Du Mont Schütte, Jürgen Hetzel, Sergios Gatidis, Tobias Hepp, Benedikt Dietz, Stefan Bauer, and Patrick Schwab. Overcoming barriers to data sharing with medical image generation: a comprehensive evaluation. NPJ Digital Medicine, 2021.

Shekoofeh Azizi, Simon Kornblith, Chitwan Saharia, Mohammad Norouzi, and David J Fleet. Synthetic data from diffusion models improves imagenet classification. ar Xiv preprint ar Xiv:2304.08466, 2023.

Published as a conference paper at ICLR 2024

Max F Burg, Florian Wenzel, Dominik Zietlow, Max Horn, Osama Makansi, Francesco Locatello, and Chris Russell. A data augmentation perspective on diffusion models and retrieval. ar Xiv preprint ar Xiv:2304.10253, 2023.

The Economist. The bigger-is-better approach to AI is running out of road. The Economist, June 2023a.

The Economist. Large, creative AI models will transform lives and labour markets. The Economist, April 2023b.

Pablo Villalobos, Jaime Sevilla, Lennart Heim, Tamay Besiroglu, Marius Hobbhahn, and Anson Ho. Will we run out of data? an analysis of the limits of scaling datasets in machine learning. ar Xiv preprint ar Xiv:2211.04325, 2022.

Ahmed Elgammal, Bingchen Liu, Mohamed Elhoseiny, and Marian Mazzone. CAN: Creative adversarial networks, generating "art" by learning about styles and deviating from style norms. ar Xiv preprint ar Xiv:1706.07068, 2017.

Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-image translation with conditional adversarial networks. In CVPR, 2017.

Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation. In ICML, 2021.

Tuomas Kynkäänniemi, Tero Karras, Samuli Laine, Jaakko Lehtinen, and Timo Aila. Improved precision and recall metric for assessing generative models. Neur IPS, 2019.

Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. In Neur IPS 2021 Workshop on Deep Generative Models and Downstream Applications, 2021.

Natalie Jomini Stroud. Niche News: The Politics of News Choice. Oxford University Press, 2011.

Ivan Dylko, Igor Dolgov, William Hoffman, Nicholas Eckhart, Maria Molina, and Omar Aaziz. The dark side of technology: An experimental investigation of the influence of customizability technology on online political selective exposure. Computers in Human Behavior, 2017.

Michael A Beam. Automating the news: How personalized news recommender system design choices impact news reception. Communication Research, 2014.

Eytan Bakshy, Solomon Messing, and Lada A Adamic. Exposure to ideologically diverse news and opinion on Facebook. Science, 2015.

Derek O Callaghan, Derek Greene, Maura Conway, Joe Carthy, and Pádraig Cunningham. Down the (white) rabbit hole: The extreme right and online recommender systems. Social Science Computer Review, 2015.

Neal Nathanson, John Wilesmith, and Christian Griot. Bovine Spongiform Encephalopathy (BSE): Causes and Consequences of a Common Source Epidemic. American Journal of Epidemiology, 145(11):959 969, 06 1997. ISSN 0002-9262.

followfox.ai. The power of synthetic data: Infinite loop to improve fine-tuning results with stable diffusion models, February 2023. URL https://followfoxai.substack.com/p/ the-power-of-synthetic-data-infinite.

Jiaxin Huang, Shixiang Shane Gu, Le Hou, Yuexin Wu, Xuezhi Wang, Hongkun Yu, and Jiawei Han. Large language models can self-improve. ar Xiv preprint ar Xiv:2210.11610, 2022.

Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. Analyzing and improving the image quality of Style GAN. In CVPR, 2020.

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In Neur IPS, 2020.

Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C Courville. Improved training of Wasserstein GANs. Neur IPS, 2017a.

Published as a conference paper at ICLR 2024

Ivan Kobyzev, Simon JD Prince, and Marcus A Brubaker. Normalizing flows: An introduction and review of current methods. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43 (11), 2020.

Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. In CVPR, 2019b.

Li Deng. The MNIST database of handwritten digit images for machine learning research [best of the web]. IEEE Signal Processing Magazine, 29(6), 2012.

Ilia Shumailov, Zakhar Shumaylov, Yiren Zhao, Yarin Gal, Nicolas Papernot, and Ross Anderson. The curse of recursion: Training on generated data makes models forget. ar Xiv preprint ar Xiv:2305.17493, 2023.

Gonzalo Martínez, Lauren Watson, Pedro Reviriego, José Alberto Hernández, Marc Juarez, and Rik Sarkar. Combining generative artificial intelligence (AI) and the Internet: Heading towards evolution or degradation? ar Xiv preprint ar Xiv:2303.01255, 2023a.

Gonzalo Martínez, Lauren Watson, Pedro Reviriego, José Alberto Hernández, Marc Juarez, and Rik Sarkar. Towards understanding the interplay of generative artificial intelligence and the Internet. ar Xiv preprint ar Xiv:2306.06130, 2023b.

Ryuichiro Hataya, Han Bao, and Hiromi Arai. Will large-scale generative models corrupt future datasets? ar Xiv preprint ar Xiv:2211.08095, 2022.

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. GANs trained by a two time-scale update rule converge to a local Nash equilibrium. In Neur IPS, 2017.

Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 1998.

Andrew Brock, Jeff Donahue, and Karen Simonyan. Large scale GAN training for high fidelity natural image synthesis. In ICLR, 2019.

Ahmed Imtiaz Humayun, Randall Balestriero, and Richard Baraniuk. Polarity sampling: Quality and diversity control of pre-trained generative networks via singular values. In CVPR, 2022.

David Williams. Probability With Martingales. Cambridge University Press, 1991.

Leland Mc Innes, John Healy, Nathaniel Saul, and Lukas Grossberger. UMAP: Uniform manifold approximation and projection. The Journal of Open Source Software, 3(29):861, 2018.

Luca Guarnera, Oliver Giudice, and Sebastiano Battiato. Deepfake detection by analyzing convolutional traces. In CVPR workshops, 2020.

Eric Mitchell, Yoonho Lee, Alexander Khazatsky, Christopher D Manning, and Chelsea Finn. Detect GPT: Zero-shot machine-generated text detection using probability curvature. In Proceedings of the 40th International Conference on Machine Learning, volume 202, pages 24950 24962, 7 2023.

Ruixiang Tang, Yu-Neng Chuang, and Xia Hu. The science of detecting LLM-generated texts. ar Xiv preprint ar Xiv:2303.07205, 2023.

John Kirchenbauer, Jonas Geiping, Yuxin Wen, Manli Shu, Khalid Saifullah, Kezhi Kong, Kasun Fernando, Aniruddha Saha, Micah Goldblum, and Tom Goldstein. On the reliability of watermarks for large language models. ar Xiv preprint ar Xiv:2306.04634, 2023a.

John Kirchenbauer, Jonas Geiping, Yuxin Wen, Jonathan Katz, Ian Miers, and Tom Goldstein. A watermark for large language models. ar Xiv preprint ar Xiv:2301.10226, 2023b.

Yunqing Zhao, Tianyu Pang, Chao Du, Xiao Yang, Ngai-Man Cheung, and Min Lin. A recipe for watermarking diffusion models. ar Xiv preprint ar Xiv:2303.10137, 2023.

Published as a conference paper at ICLR 2024

Sen Peng, Yufei Chen, Cong Wang, and Xiaohua Jia. Protecting the intellectual property of diffusion models by the watermark diffusion process. ar Xiv preprint ar Xiv:2306.03436, 2023.

Yuxin Wen, John Kirchenbauer, Jonas Geiping, and Tom Goldstein. Tree-ring watermarks: Fingerprints for diffusion images that are invisible and robust. ar Xiv preprint ar Xiv:2305.20030, 2023.

Pierre Fernandez, Guillaume Couairon, Hervé Jégou, Matthijs Douze, and Teddy Furon. The stable signature: Rooting watermarks in latent diffusion models. ar Xiv preprint ar Xiv:2303.15435, 2023.

Jianwei Fei, Zhihua Xia, Benedetta Tondi, and Mauro Barni. Supervised GAN watermarking for intellectual property protection. In Workshop on Information Forensics and Security (WIFS), 2022.

Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A Smith, Daniel Khashabi, and Hannaneh Hajishirzi. Self-instruct: Aligning language model with self generated instructions. ar Xiv preprint ar Xiv:2212.10560, 2022.

Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B Hashimoto. Alpaca: A strong, replicable instruction-following model, 2023.

Leonid V Kantorovich. Mathematical methods of organizing and planning production. Management Science, 6(4), 1960.

Tong Che, Ruixiang Zhang, Jascha Sohl-Dickstein, Hugo Larochelle, Liam Paull, Yuan Cao, and Yoshua Bengio. Your GAN is secretly an energy-based model and you should use discriminator driven latent sampling. In Neur IPS, 2020.

Laurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio. Density estimation using Real NVP. In International Conference on Learning Representations, ICLR, 2016.

Durk P Kingma and Prafulla Dhariwal. Glow: Generative flow with invertible 1 1 convolutions. In Advances in Neural Information Processing Systems, 2018.

Filippo Pagani, Martin Wiegand, and Saralees Nadarajah. An n-dimensional Rosenbrock distribution for Markov chain Monte Carlo testing. Scandinavian Journal of Statistics, 49(2), 2022.

Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C Courville. Improved training of Wasserstein GANs. In Advances in Neural Information Processing Systems, 2017b.

Rafael Orozco, Philipp Witte, Mathias Louboutin, Ali Siahkoohi, Gabrio Rizzuti, Bas Peters, and Felix J. Herrmann. Invertible Networks.jl: A Julia package for scalable normalizing flows. Preprint ar Xiv:2312.13480, 2023.

Published as a conference paper at ICLR 2024

A EXPERIMENT SETUPS

Here are detailed descriptions of our experiments.

A.1 THE FULLY SYNTHETIC LOOP

We empirically study the fully synthetic loop using two representative deep generative models and two practical training datasets. After training an initial model G1 with a fully real dataset containing n1 r samples from the (unknown) reference distribution, subsequent models (Gt) t=2 are trained using nt s synthetic samples from the immediately preceding model Gt 1, where each synthetic sample is produced with sampling bias λ. Our primary experiments are organized as follows:

Generative adversarial network: We use an unconditional Style GAN2 model (Karras et al.,

2020) and initially train it on n1 r = 70k samples from the FFHQ dataset (Karras et al., 2019b). We downsized the FFHQ images to 128 128 (using Lanczos Py Torch anti-aliasing filtering as in Karras et al. (2020)) to reduce the computational cost. We set nt s = 70k for t 2. Diffusion model: We use a conditional DDPM (Ho et al., 2020) with T = 500 diffusion time steps and initially train it on n1 r = 60k real samples from the MNIST dataset. We set nt s = 60k for t 2. To calculate FIDs, we use the features extracted by a Le Net (Lecun et al., 1998) rather than an Inception network, because numerical digits are not exactly natural images. For consistency, we continue to use the term FID in this case.

A.2 THE SYNTHETIC AUGMENTATION LOOP

We simulate the synthetic augmentation loop using the same deep generative models and experimental conditions as in Appendix A.1. Recall that we first require training an initial model G1 with a fully real dataset of n1 r samples. All subsequent models (Gt) t=2 are trained using nt s synthetic samples from the previous model(s) and all of the original n1 r samples used to train G1. Note that each synthetic sample is always produced with sampling bias λ. Our experiments are organized as follows:

Generative adversarial network: We use an unconditional Style GAN2 architecture Karras et al. (2020) trained on the FFHQ-128 128 dataset Karras et al. (2019b). Like the Style GAN experiment in Appendix A.1, at each generation t 2 we sample 70k images with no sampling bias (λ = 1) from the immediately preceding model Gt 1. However, now the synthetic dataset Dt s includes samples from all the previously models (Gτ)t 1 τ=1, producing a synthetic data pool of size nt s = (t 1)70k that grows linearly with respect to t. The real FFHQ dataset is always present at every generation: D1 r = Dt r and n1 r = nt r = 70k for every generation t. Diffusion model: We use a conditional MNIST-DDPM Ho et al. (2020) with T = 500 diffusion time steps. In this experiment the synthetic dataset Dt s is only sampled from the previous generation Gt 1 with sampling bias λ, and n1 r = nt s = 60k for all t 2. The original real MNIST dataset is also available at every generation: D1 r = Dt r and n1 r = nt r = 60k for all t.

A.3 THE FRESH DATA LOOP

As in previous autophagous loop variants, we assume that all models are initially trained solely on real samples, with the number of real samples denoted here as n1 r = nini. In subsequent generations (i.e., for t 2) the generative models are trained with a fixed number of real samples, denoted as nt r = nr, and a fixed number of synthetic samples, denoted by nt s = ns. In the fresh data loop, the dataset Dt r is independently sampled from the reference probability distribution Pr, while the dataset Dt s is sampled exclusively from the previous generation Gt 1, with a sampling bias represented as λ.

We simulate the fresh data loop using different values for nini, nr, ns, and λ. The Gaussian example enables examination of the fresh data loop in greater detail, especially in the asymptotic regime. Meanwhile, our MNIST-DDPM example demonstrates the impact of fresh data loop on more realistic dataset and model.

Gaussian model: We consider a normal reference distribution Pr = N(0d, Id) with a dimension of d = 100. For modeling the Gaussian distribution, we utilize an unbiased moment estimation approach, as described in Equation (1).

Published as a conference paper at ICLR 2024

Diffusion model: We use a conditional DDPM Ho et al. (2020) with T = 500 diffusion time steps. We consider the MNIST dataset as our reference distribution.

A.4 METRICS FOR MADNESS

Ascertaining whether an autophagous loop has gone MAD or not (recall Definition 2) requires that we measure how far the synthesized data distribution Gt has drifted from the true data distribution Pr over the generations t. We use the notion of the Wasserstein distance as implemented by the Fréchet Inception Distance (FID) for this purpose. We will also find the standard concepts of precision and recall useful for making rigorous the notions of quality and diversity, respectively.

Wasserstein distance, or earth mover s or optimal transport distance (Kantorovich, 1960), measures the minimum work required to move the probability mass of one distribution to another. Computing the Wasserstein distance between two datasets (e.g., real and synthetic images) is prohibitively expensive. As such, standard practice employs the FID (Heusel et al., 2017) as an approximation, which calculates the Wasserstein-2 distance between Inception feature distributions of real and synthetic images. For our MNIST experiments we calculate FIDs using the features from a Le Net (Lecun et al., 1998) rather than an Inception network, because numerical digits are not exactly natural images.

Precision quantifies the portion of synthesized samples that are deemed high quality or visually appealing. We use precision as an indicator of sample quality. We compute precision by calculating the fraction of synthetic samples that are closer to a real data example than to their k-th nearest neighbor (Kynkäänniemi et al., 2019). We use the default k = 5 in all experiments.

Recall estimates the fraction of samples in a reference distribution that are inside the support of the distribution learned by a generative model. High recall scores suggest that the generative model captures a large portion of diverse samples from the reference distribution. We compute recall in a manner similar to precision (Kynkäänniemi et al., 2019). Given a set of synthetic samples from the generative model, we calculate the fraction of real data samples that are closer to any synthetic sample than its k-th nearest neighbor. In Appendix C.1 we demonstrate how recall captures synthetic diversity in an autophagous loop more accurately than variance.

B PROOF OF SYNTHETIC GAUSSIAN MARTINGALE VARIANCE COLLAPSE

We now prove that for the process described in Equation (1), Σt a.s. 0.

Proof. First write xi t =

λΣ1/2 t 1zi t + µt 1 for zi t N(0d, Id). Then consider the process tr[Σt], which is a lower bounded submartingale:

tr[Σt] = λtr

i=1 (zi t µz t )(zi t µz t ) !

where µz t = 1 N PN i=1 zi t. By Doob s martingale convergence theorem (Williams, 1991, Ch. 11), there exists a random variable w such that tr[Σt] a.s. w, and we now show that we must have w = 0. Without loss of generality, we can assume that Σt 1 is diagonal, in which case it becomes clear that tr[Σt] is a generalized χ2 random variable, being a linear combination of d independent χ2 random variables with N 1 degrees of freedom, mixed with weights λdiag(Σt 1). Therefore, we can write tr[Σt] = λyttr[Σt 1], where yt is a generalized χ2 random variable with the same degrees of freedom but with mixing weights diag(Σt 1)/tr[Σt 1], and E[yt|Σt 1] = 1. This implies that at least one mixing weight is greater than 1/D for each t, which means that for any 0 < ϵ < 1, there exists c > 0 such that Pr(|yt 1| > ϵ) > c. Now consider the case λ = 1. Since |yt 1| > ϵ infinitely often with probability one, the only w that can satisfy limt tr[Σ0] Qt s=1 ys = w is w = 0. For general λ 1, tr[Σt] is simply the product of the process for λ = 1 and the sequence λt 1, and so the product must also converge to zero almost surely. Finally, since tr[Σt] a.s. 0, we also must have Σt a.s. 0, where convergence is defined with any matrix norm.

Published as a conference paper at ICLR 2024

C ADDITIONAL EXPERIMENTS FOR THE FULLY SYNTHETIC LOOP

Here we present additional experiments for the fully synthetic loop.

C.1 RECALL VERSUS VARIANCE: GMMS IN AN UNBIASED FULLY SYNTHETIC LOOP

We also trained 2D GMMs in an unbiased fully synthetic loop using the same 25-mode distribution as (Che et al., 2020). In Figure 11 we see that the fully synthetic loop gradually reduces the number of modes covered by the synthetic distribution. Various metrics could measure this loss in diversity, so in Figure 12 we explore how well each metric reflects the dynamics of the fully synthetic loop, finding that recall is best-equipped to measure diversity in multimodal datasets.

t = 1 t = 200 t = 2000

Figure 11: The fully synthetic loop gradually causes mode collapse. Estimated GMM (Che et al., 2020) distributions after 1, 200, and 2k iterations of an unbiased fully synthetic loop. Notice that the modes are lost asymptotically.

0 1,000 2,000

Generations

0 1,000 2,000

Generations

Average Modal Variance

0 1,000 2,000

Generations

Figure 12: Recall is the most suitable commonly accepted metric for diversity in an autophagous loop. For GMMs in a fully synthetic loop (Figure 11), there are three primary potential metrics of diversity: variance,7 average modal variance (the average variance of each mode), and recall (Kynkäänniemi et al., 2019). We observe that the overall variance (left) does not reflect the loss of modes that we see in Figure 11 as smoothly as recall (right) and average modal variance (middle). Recall is therefore a suitable choice for measuring diversity in multimodal datasets and, unlike average modal variance, is compatible with distributions where the number of modes is not tractable (e.g., natural images).

C.2 WASSERSTEIN GANS IN AN UNBIASED FULLY SYNTHETIC LOOP

In this experiment we trained Wasserstein GANs (with gradient penalty) (Gulrajani et al., 2017a) on the MNIST dataset in a fully synthetic loop for 100 generations. As shown in Figure 13, the FID monotonically increases, while quality (precision) and diversity (recall) monotonically decrease.

7For multidimensional datasets, we calculate variance as the trace of covariance.

Published as a conference paper at ICLR 2024

Generations

Generations

Generations

Figure 13: The negative effects of the fully synthetic loop are monotonic and inescapable. The FID (left), quality (precision, middle), and diversity (recall, right) of synthetic FFHQ and MNIST images produced by MNIST Wasserstein GANs.

C.3 ADDITIONAL MNIST-DDPM FULLY SYNTHETIC LOOP RESULTS

In Figure 4 we showcased the results of training MNIST-DDPMs in a fully synthetic loop with various sampling bias factors λ. In Figure 14 we have the results (FID, precision, and recall) more generations t and different sampling biases λ.

Generations

Generations

Generations

DDPMs on an MNIST fully synthetic loop: λ = 1 λ = 0.8 λ = 0.66 λ = 0.5

Figure 14: In a fully synthetic loop, sampling bias λ < 1 can mitigate distributional drift and losses in quality, but only at the cost of rapid losses in diversity The FID (left), quality (precision, middle), and diversity (recall, right) of synthetic images from an MNIST-DDPM fully synthetic loop.

C.4 NORMALIZING FLOW FULLY SYNTHETIC LOOP

We implemented the fully synthetic loop using normalizing flows (Dinh et al., 2016; Kingma and Dhariwal, 2018) for generative modeling of the two-dimensional Rosenbrock reference distribution (Pagani et al., 2022) in order to visualize the outcome of this particular scenario in a controlled setting. Normalizing flows are unique in that they enable exact evaluation of the likelihood of the estimated distribution due to their invertibility (Dinh et al., 2016). This leads to a relatively straightforward training procedure compared to GANs, which often require careful balancing between the generator and discriminator networks to avoid mode collapse (Gulrajani et al., 2017b). Therefore, by using a low-dimensional reference distribution, this setup allows us to demonstrate the fully synthetic loop while eliminating potential training imperfections. We implemented this example using the Invertible Networks.jl (Orozco et al., 2023) package for normalizing flows.

According to the fully synthetic loop setup, we start with a training dataset of 104 samples from the 2D Rosenbrock distribution with the density function Pr(x1, x2) exp 1

2x2 1 x2 x2 1 2

(Pagani et al., 2022), which is plotted on the left-hand side of Figure 15. The subsequent generations of normalizing flow models are trained using synthetic data generated by the previous pre-trained normalizing flow for 16 generations, both with and without sampling bias. We employ the GLOW normalizing flow architecture (Kingma and Dhariwal, 2018) with eight coupling layers (Kingma and Dhariwal, 2018) and a hidden dimension of 64. The training is carried out for 20 epochs with a batch size of 256 for each generation, ensuring convergence as determined by monitoring the

Published as a conference paper at ICLR 2024

model s likelihood over a validation set. Figure 15 summarizes the results of this fully synthetic loop setup. To incorporate sampling bias, we sample from N(0d, λId) from the latent space of the model, where d = 2. As shown, regardless of the presence of sampling bias, the resulting distribution after 16 generations loses the tails of the reference distribution, indicating a loss of diversity. This phenomenon becomes more pronounced when sampling bias is present (λ < 1).

Ground Truth

t = 1 t = 16

Figure 15: Normalizing flows are not immune to the fully synthetic loop. The fully synthetic loop implememted with a formalizing flow (Dinh et al., 2016) applied to the 2D Rosenbrock distribution (Pagani et al., 2022). Sampling with or without bias still loses the tails of the distribution (i.e., diversity). Using λ < 1 accelerates this loss of diversity.

Published as a conference paper at ICLR 2024

D FFHQ UNBIASED FULLY SYNTHETIC LOOP IMAGES

We show additional randomly chosen synthetic samples produced by the same FFHQ-Style GAN2 unbiased fully synthetic loop as in Figure 1.

Figure 16: Generation t = 1 of a fully synthetic loop with bias λ = 1. i.e., synthetic samples from the first model G1.

Figure 17: Generation t = 3 of a fully synthetic loop with bias λ = 1

Published as a conference paper at ICLR 2024

Figure 18: Generation t = 5 of a fully synthetic loop with bias λ = 1

Figure 19: Generation t = 7 of a fully synthetic loop with bias λ = 1

Published as a conference paper at ICLR 2024

Figure 20: Generation t = 9 of a fully synthetic loop with bias λ = 1

E FFHQ BIASED FULLY SYNTHETIC LOOP IMAGES

We show additional randomly chosen synthetic samples produced by the same FFHQ-Style GAN2 for biased (λ = 0.7) fully synthetic loop as in Figure 5.

Figure 21: Generation t = 1 of a fully synthetic loop with bias λ = 0.7

Published as a conference paper at ICLR 2024

Figure 22: Generation t = 3 of a fully synthetic loop with bias λ = 0.7

Figure 23: Generation t = 5 of a fully synthetic loop with bias λ = 0.7

Published as a conference paper at ICLR 2024

F MNIST-DDPM FULLY SYNTHETIC LOOP IMAGES

Here we show randomly chosen samples from each generation of an MNIST-DDPM in a fully synthetic loop for different sampling biases.

Gen. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Figure 24: Without sampling bias, synthetic data modes drift from real modes and merge together. Randomly selected synthetic MNIST images of each generation without sampling bias (λ = 1). See Figure 6 for more details.

Gen. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Figure 25: With sampling bias, synthetic data modes drift and contract around just a few high-quality data points. Randomly selected synthetic MNIST images of each generation without sampling bias (λ = 0.8). See Figure 6 for more details.

Published as a conference paper at ICLR 2024

G FFHQ UNBIASED SYNTHETIC AUGMENTATION LOOP IMAGES

Figure 26: Generation t = 3 of a synthetic augmentation loop with bias λ = 1. See Figure 16 for the samples from t = 1 (in any autophagous loop the first model G1 always trains on purely real data, see Section 2).

Figure 27: Generation t = 6 of a synthetic augmentation loop with bias λ = 1

Published as a conference paper at ICLR 2024

H ADDITIONAL RESULTS FOR THE SYNTHETIC AUGMENTATION LOOP

H.1 THE MNIST-DDPM SYNTHETIC AUGMENTATION LOOP

In this section, we repeat the experiment in Fig 8 and described in A.2 using all synthetic data from previous generations, i.e, at each iteration t we used nt s = (t 1)60k synthetic samples from (Gτ)t 1 τ=1, combined with the initial real data we had. The results are shown in Figure 28. We see the same trend as in Figure 8. For a better comparison between using synthetic samples from all previous generations vs only using from the previous generation is shown in Figure 29. Using synthetic data from all previous generations only slows down the process the degradation in models with respect to only using data from previous generations.

Generations

Generations

Generations

MNIST-DDPM in a synthetic augmentation loop: λ = 1 λ = 0.8 λ = 0.66 λ = 0.5

Figure 28: Using samples from all previous generative models in synthetic augmentation loop will not stop MADness. We show the FID, precision (quality), and recall (diversity) of MNISTDDPM images synthesized in synthetic augmentation loops with different sampling biases λ. All three metrics exhibit the same behavior as in the case we only sample from the previous generation shown in Figure 8.

Generations

Generations

Generations

Synthetic data origin in MNIST-DDPM synthetic augmentation loop: (Gτ)t 1 τ=1 Gt 1

Figure 29: Using samples from all previous generative models in synthetic augmentation loop slows the degradation of models more. We show the FID of MNIST-DDPM images synthesized in synthetic augmentation loops with different sampling biases λ using synthetic data from all previous generation and only the last generation.

Published as a conference paper at ICLR 2024

H.2 THE GAUSSIAN SYNTHETIC AUGMENTATION LOOP

In this section, we replicate the Gaussian experiment in Section 5 for synthetic augmentation loops.

In particular, we sample nr real data from the reference distribution Pr = N(0d, Id) with a dimension of d = 100 to train the first model G1. For the next generations, we sample ns synthetic data from model Gt 1 with sampling bias λ, and combine it with the same nr real samples we used to train G1. We report ne

nr with ne defined in Equation 3.

We report the results for this experiment in Figure 30 and 31. We observe that for any values of nr > d with d = 100, the presence of synthetic samples reduces effective number of samples progressively. However, when the problem is ill-posed, i.e. nr < d, ne can be increased with respect to nr if some sampling bias λ is present in the system. However, in our experiments we observe that ne cannot surpass nr for any values of λ or ne, as it will always corresponds to an ill-posed problem ne < d.

Figure 30: In a synthetic augmentation loop, we always see a decrease in ne with respect to nr, except for very small values of nr < d = 100, where the distribution estimation problem in ill-posed.

0.1 0.3 0.5 0.7 0.9 10

0.1 0.3 0.5 0.7 0.9 10

0.1 0.3 0.5 0.7 0.9 10

Figure 31: In a synthetic augmentation loop, we always see a decrease in ne with respect to nr. Smaller values of λ result in faster decay of ne as ns increases.

Published as a conference paper at ICLR 2024

I ADDITIONAL RESULTS FOR THE FRESH DATA LOOP

Here we provide three additional Gaussian experiments investigating the convergence of the fresh data loop.

Experiment 1: In Figure 10 we showed how Gaussian fresh data loop convergence depends on ns and nr for a few different values of λ. Now we depict how convergence depends on ns and λ for a few different values of nr.

0.1 0.3 0.5 0.7 0.9 10

0.1 0.3 0.5 0.7 0.9 10

0.1 0.3 0.5 0.7 0.9 10

Figure 32: In a fresh data loop, sampling bias reduces the admissible synthetic sample size. For strong sampling bias (small λ), the maximum synthetic data count ns for which ne nr (green area) decreases.

Experiment 2: In Section 5 we assumed that we only sample from the previous generation Gt 1 for creating the synthetic dataset Dt s. In this experiment we sample randomly from K previous models (Gτ)t 1 τ=t 1 K. Here nr = 103, ns = 104, and λ = 1. In Figure 33 we see how ne

nr varies with respect to K. Increasing the memory K in sampling from previous generations can boost performance, however the rate of improvement becomes slower as K increases. However the rate of improvement on ne is sublinear with respect to K.

Figure 33: The effective sample size ne divided by the real sample size nr for different numbers of accessed previous generations K.

Experiment 3: Here we assume that we are sampling from an environment where p percent of data is real, and the rest is synthetic data from the previous generation Gt 1 with sampling bias λ. We change the total number of data in the dataset n = |Dt|, with nr = p n and ns = (1 n) p. We show the Wasserstein distance for different p and λ in Figure 34.

Let us first examine the dynamics of the Gaussian fresh data loop without sampling bias (λ = 1). We observe in Figure 34 (left) that the Wasserstein distance (WD) decreases with respect to dataset size n. However, the presence of synthetic data (p < 100%) decreases the rate at which the WD decreases, and increases the overall WD each generation in the fresh data loop. This means that with presence of synthetic data in the Internet, the progress of generative models will become slower

Published as a conference paper at ICLR 2024

In the presence of sampling bias (λ < 1, Figure 34 right), we see that even for close values of λ to 1, the Wasserstein distance follows a sub-linear trend, meaning that eventually the rate of progress in generative models will effectively stop, no matter how much (realistically) the total dataset size is increased.

103 104 105

p = 3% p = 1% p = 0.3%

λ = 1 λ = 0.97 λ = 0.95 λ = 0.93

Figure 34: The Wasserstein distance (WD) versus the whole dataset size n, for different values of p (left) and sampling bias λ (right).