# selfconsuming_generative_models_with_adversarially_curated_data__06721773.pdf

Self-Consuming Generative Models with Adversarially Curated Data

Xiukun Wei 1 Xueru Zhang 1

Recent advances in generative models have made it increasingly difficult to distinguish real data from model-generated synthetic data. Using synthetic data for successive training of future model generations creates self-consuming loops, which may lead to model collapse or training instability. Furthermore, synthetic data is often subject to human feedback and curated by users based on their preferences. Ferbach et al. (2024) recently showed that when data is curated according to user preferences, the self-consuming retraining loop drives the model to converge toward a distribution that optimizes those preferences. However, in practice, data curation is often noisy or adversarially manipulated. For example, competing platforms may recruit malicious users to adversarially curate data and disrupt rival models. In this paper, we study how generative models evolve under self-consuming retraining loops with noisy and adversarially curated data. We theoretically analyze the impact of such noisy data curation on generative models and identify conditions for the robustness of the retraining process. Building on this analysis, we design attack algorithms for competitive adversarial scenarios, where a platform with a limited budget employs malicious users to misalign a rival s model from actual user preferences. Experiments on both synthetic and real-world datasets demonstrate the effectiveness of the proposed algorithms.

1. Introduction

The latest generative models can produce highly realistic texts (Open AI, 2024), images (Diffusion, 2025), audio (AI, 2025), and videos (ML, 2025). As synthetic data proliferates

1Department of Computer Science and Engineering, The Ohio State University, Columbus, Ohio, USA. Correspondence to: Xueru Zhang <zhang.12807@osu.edu>.

Proceedings of the 42 nd International Conference on Machine Learning, Vancouver, Canada. PMLR 267, 2025. Copyright 2025 by the author(s).

on the internet, it is inevitably used for training future generations of models, creating a self-consuming training loop. A line of research has focused on examining the impact of this self-consuming training loop on generative models outputs, both theoretically (Bertrand et al., 2024; Taori & Hashimoto, 2023) and empirically (Gerstgrasser et al., 2024; Alemohammad et al., 2024a; Shumailov et al., 2024). These studies demonstrate that such a self-consuming training loop may lead to model collapse (Gerstgrasser et al., 2024; Alemohammad et al., 2024a; Shumailov et al., 2024), training instability (Bertrand et al., 2024), and possibly bias amplification (Taori & Hashimoto, 2023; Wyllie et al., 2024; Xie & Zhang, 2024). Several solutions have also been proposed to mitigate these issues, such as integration of real data (Bertrand et al., 2024), leveraging cumulative datasets (Gerstgrasser et al., 2024), and employing Self-Improving Diffusion Models with Synthetic Data (SIMS) (Alemohammad et al., 2024b).

In contrast to these works, a recent study (Ferbach et al., 2024) explores a more practical scenario in which synthetic data is curated by human users. To improve safety, user trust, and the relevance and quality of generated outputs, modern generative models are increasingly trained with human participation and feedback. For example, platforms such as Journey DB (Pan et al., 2023) and Pika Labs (Labs, 2025) provide multiple variations of outputs for users to choose from, with only selected outputs being upscaled and used to train next-generation models. As shown in Ferbach et al. (2024), when synthetic data is curated based on a reward model representing user preferences, the generative models trained iteratively on this curated data tend to converge to an output distribution that maximizes the expected reward.

However, in practice, data curated from users are likely to be noisy, biased, or even maliciously manipulated. Consider a scenario where multiple platforms compete for the same target user population with similar preference distributions (e.g., Chat GPT (Open AI, 2025) and Claude (Anthropic, 2025), Stable Diffusion (Diffusion, 2025) and Mid Journey (Mid Journey, 2025)). To compete for market share, platforms may leverage their collected datasets, which contain rich information about actual user preferences, to design attack algorithms targeting their competitors. For example, as shown in Fig. 1, a platform may employ malicious users with limited budgets to deliberately select outputs on a ri-

Self-Consuming Generative Models with Adversarially Curated Data

1.User picks prompts ~ࡼ ࢋ 2.Sample: , , , ~ࡼ |

3. Sample upscaled by the ranking

4. Only the upscaled samples are in the dataset ऎ

5. Train the model ࡼ on the dataset ऎ

Adversarial Platform

5. Train the model ࡼ on the dataset ऎ

4. Only the upscaled samples are in the dataset ऎ

Target Platform

Malicious curation using attack algorithms with limited budgets

Figure 1. An example of adversarial data curation in a competitive setting: the adversarial platform and the target platform serve the same population of users. The adversarial platform can access the data generated by the target platform and obtain real user preferences through its own preference collection mechanisms. Using the attack algorithm, the adversarial platform employs malicious users to adversarially curate data on the target platform, preventing its model from aligning with genuine user preferences.

val s platform that significantly deviate from genuine user preferences, creating a dataset on the competing platform that no longer reflects true user preferences. Over time, this adversarially curated data degrades the competitor s ability to train models that align with user preferences, ultimately reducing their capacity to produce content that attracts users.

This paper takes a first step in examining the impact of adversarially curated data on the iterative retraining of generative models. We provide a theoretical analysis to understand how the output distribution of generative models evolves under adversarially curated data and assess the robustness of the self-consuming training loop to such manipulations. Our results demonstrate that under specific conditions, the selfconsuming generative model trained from such adversarially curated data remains robust that it still converges to output distribution that optimizes user preferences. However, we also identify conditions (on the fraction of adversarial data and its associated reward function) under which this robustness guarantee fails to hold.

Building on this theoretical understanding, we design attack algorithms for adversarial data curation aimed at disrupting the alignment of self-consuming generative models with user preferences. Specifically, we consider a competitive scenario in which platforms compete for market share by employing malicious users to curate adversarial data on a rival s platform. Given a dataset reflecting genuine user preferences (collected from the platform s own users), our attack algorithms strategically flip a limited number of preference labels. The modified dataset then guides malicious users in curating adversarial data on the rival platform. Over time, the generative models on the targeted platform, when iteratively trained on such adversarially curated data, fail to align with genuine user preferences, reducing their ability to remain appealing to users. To the best of our knowledge, this is the first attack algorithm for deviating self-consuming generative models from user preference. The most related

work is Wu et al. (2025), which investigates the vulnerability of reward model learning to preference poisoning. However, unlike our work that focuses on a self-consuming training loop, where the attacker aims to gradually misalign the target model with human preferences, Wu et al. (2025) considers a static setting where the attacker aims to flip preference labels to promote or demote specific target outcomes under the learned reward model. In Appendix A, we discuss more related works.

Our contributions are summarized as follows:

In Section 3, we theoretically analyze the long-term performance of self-consuming generative models under adversarial data curation, considering both pure synthetic data and mixtures of synthetic and real data.

In Lemma 3.3, we prove that the convergence of the generative model toward maximizing user-expected rewards is governed by the covariance of adversarially curated synthetic data. This result establishes conditions where convergence remains robust and conditions where adversarial data curation hinders model alignment with human preferences.

In Section 4, we model a competitive situation where an adversarial platform aims to disrupt a competitor s model alignment through adversarial data curation, and propose gradient-based and heuristic attack algorithms.

In Section 5, we validate the theorems and proposed algorithms through experiments on synthetic and real datasets.

2. Problem Formulation

Consider a platform that iteratively trains generative models from data curated by users. Denote pdata P(Rd) as the real data distribution and for t N, let pt P(Rd) be the data distribution of the generative model at t-th round of iterative retraining loop. Throughout the paper, we use

Self-Consuming Generative Models with Adversarially Curated Data

lowercase letters p to denote densities and uppercase letters P to indicate the associated probabilities.

Adversarially curated data. At every round t N, the platform presents synthetic data x1, , x K randomly sampled from current model pt to users, and users select their preferred outputs that will be upscaled in the dataset for training next-generation of models. Following Ferbach et al. (2024), we adopt the generalized Bradley-Terry model (Bradley & Terry, 1952) to model user s choice. Specifically, let r(x) be an underlying reward function that captures user preference of one data xi over another xj, then the probability that a data ˆx {x1, , x K} is curated by the user is as follows:

P (ˆx = xk | x1, . . . , x K) = er(xk) PK j=1 er(xj) , k [K] (1)

Because curated data in practice may be noisy and adversarially manipulated, we use a mixture model to account for such adversarial behavior. Specifically, assume another competing platform employs malicious users to curate noisy data deviating from actual user preference. Let ϕt (0, 1) be the fraction of malicious users at round t and ert(x) the underlying reward function. Then the probability ˆx is curated by a mixture of malicious and benign users is:

P(ˆx = xk | x1, . . . , x K) =(1 ϕt) er(xk) PK j=1 er(xj) (2)

+ ϕt eert(xk) PK j=1 eert(xj) , k [K]

We use ˆx BT (x1, . . . , x K; ϕt) to denote ˆx is sampled from x1, . . . , x K according to probability (2).

Self-consuming training loop. Given adversarially curated data, the platform updates its model at round t+1 according to the following, either solely on the distribution of curated synthetic data (λ ) or on a mixture of synthetic and real data (λ < ):

pt+1 = arg max p P

1 1 + λEx pdata [log (p (x))]

+ λ 1 + λE x1, ,x K pt ˆx BT (x1,...,x K;ϕt) [log (p (ˆx))] (3)

where λ [0, ) controls the fraction of real data and P is the set of achievable distributions with the platform s model.

Objectives. A recent study (Ferbach et al., 2024) focused on a special case without malicious users (ϕt = 0). They showed that if data are curated based on the benign users reward function r(x) and the generative model is updated solely using curated synthetic data, the selfconsuming training loop can result in pt converging to a distribution that optimizes user preferences, i.e., as t ,

Ex pt[er(x)] converges to the maximum reward, and its variance Varx pt[er(x)] vanishes. However, in the presence of malicious users, it remains unclear how the model evolves and whether pt can align with actual user preferences in the long run. This paper explores this problem and we first examine the long-term impact of adversarially curated data on self-consuming models (Section 3) and then design attack algorithms that platforms can use to disrupt competitors model training processes and misalign their models from user preferences (Section 4).

3. Impact of adversarially curated data

Next, we examine the evolution of self-consuming generative models with adversarially curated data. We begin with the case of purely synthetic data (λ ) and then generalize to a mixture of synthetic and real data (λ < ). All proofs can be found in Appendix B.

Iterative retraining only on curated synthetic data. Without real data, the iterative retraining process reduces to:

pt+1 = arg max p P E x1, ,x K pt ˆx BT (x1,...,x K;ϕt) [log (p (ˆx))] (4)

Lemma 3.1. Consider the asymptotic case where the number of samples users select from satisfies K . Suppose Ex pt[er(x)] < and pt+1 follows Eq. (4). Then, we have

pt+1(x) pt(x)

(1 ϕt) er(x)

Ez pt er(z) + ϕt eert(x) Ez pt eert(z)

Lemma 3.1 characterizes the relation between pt+1 and pt in a self-consuming loop with adversarially curated data. Next, we analyze the evolution of Ept[er(x)], which quantifies the expected reward users experience from interacting with the generative model. We first introduce some technical assumptions similar to Ferbach et al. (2024). Assumption 3.2. There exist finite constants rt,min, rt,max, ert,min, ert,max R, such that: pt-almost surely, x pdata, rt,min = infx r(x), rt,max = supx r(x), ert,min = infx ert(x), ert,max = supx ert(x).

In most realistic scenarios, this assumption holds because user preferences typically have finite support or are bounded in a probabilistic sense. Under this assumption, Lemma 3.3 below examines the impact of adversarially curated data on Ept+1[er(x)] and presents its upper and lower bounds. Lemma 3.3. Let pt+1 be the distribution induced from a discrete choice model in Eq. (4). Suppose Assumption 3.2 holds, then the following holds:

Ept+1 h er(x)i Ept h er(x)i + g Var Var Var ert,max + g Cov Cov Cov eert,min

Ept+1 h er(x)i Ept h er(x)i + g Var Var Var ert,min + g Cov Cov Cov eert,max

Self-Consuming Generative Models with Adversarially Curated Data

g Var Var Var := (1 ϕt)(K 1)

K Varpt h er(x)i

g Cov Cov Cov := ϕt (K 1)

K Covpt h er(x), eert(x)i

According to Lemma 3.3, when er(x) and eert(x) are positively correlated, i.e., Covpt 0, the expected reward increases, Ept+1 er(x) Ept er(x) , allowing the generative model to align with user preferences despite adversarially curated data. In this case, the expected reward converges to the maximum, highlighting the model s inherent robustness against noise and adversarial attacks. However, when er(x) and eert(x) are negatively correlated, i.e., Covpt < 0, the convergence is no longer guaranteed. Instead, the expected reward may oscillate and deviate from the maximum value. This shows the model s potential vulnerability in scenarios where adversarially curated data introduces negative correlations with the user reward function. The rationale for the upper bound is discussed in Appendix B.3.

Iterative retraining on mixed real and synthetic data. Prior works such as Ferbach et al. (2024); Bertrand et al. (2024); Gerstgrasser et al. (2024); Alemohammad et al. (2024a) explored the role of real data in self-consuming generative models. Without data curation, Bertrand et al. (2024); Gerstgrasser et al. (2024); Alemohammad et al. (2024a) showed that retraining models with a mix of real and synthetic data help stabilize the algorithm and prevent pt from deviating too much from pdata. When synthetic data is curated by users based on their preferences, pdata is no longer a fixed point of the retraining loop, as different reward values can occur with positive probability; however, Ept[er(x)] still increases compared to Epdata[er(x)] (Ferbach et al., 2024). Next, we study whether incorporating real data can help defend against adversarially curated data during iterative model retraining.

Lemma 3.4. Let pt+1 be defined as in Eq. (3), with p0 = pdata and ϕt = ϕ , t. The following holds for all t:

Ept+1 h er(x)i Epdata h er(x)i (5)

+ ϕ (1 + λ)

where Covmin = min i [t] Covpi h er(x), eeri(x)i . As t ,

Ept+1 h er(x)i Epdata h er(x)i + ϕ (λ + 1) Covmin (6)

Lemma 3.4 implies that when Covmin > 0, i.e., the adversarial reward values are positively correlated with the true rewards, model can align with genuine user preferences

and Ept[er(x)] > Epdata[er(x)] still holds. However, when Covmin < 0, this is no longer the case, suggesting that simply adding real data is not sufficient to defend against adversarial curation.

4. Attack algorithms

As shown in Lemma 3.3, adversarially curated data may result in Ex pt[er(x)] deviating from maximum reward in the long run. Next, we consider a competitive scenario illustrated in Fig. 1, where an adversarial platform employs malicious users to curate data on a target platform to compete for market share. Our goal is to design attack algorithms for the adversarial platform that guide malicious users to act in the most effective way to disrupt the target platform s model.

Learning reward model from pairwise comparisons. Unlike Ferbach et al. (2024) that assumes user reward model r(x) is known, we consider a more realistic setting that the adversarial platform does not have access to r(x) but must learn it from a dataset indicating user preferences. Here, we consider learning from pairwise comparisons (Wu et al., 2025; Liu et al., 2024; Zhou et al., 2025) as detailed below.

Let D = {(xi, zi, oi)}n i=1 be the dataset adversarial platform collects from benign users, where xi, zi Rd are i-th pair of data samples acquired from the target platform and oi {0, 0.5, 1} indicates the user s preference among them1. Specifically, oi = 0 if xi is preferred to zi (xi zi), oi = 1 if zi is preferred to xi (xi zi), and oi = 0.5 if yi and zi are equally preferred. In this paper, we assume adversarial and target platforms face users with identically distributed preferences, so that the adversarial platform can learn r(x) from D.

Let Rθ be a parametric reward model of r(x) learned from preference data D, where θ Θ is the parameter. A typical method is Maximum Likelihood Estimation (MLE) which minimizes the following loss function:

L(D; θ) = X

i [(1 oi) log Pr {xi zi | Rθ}

+oi log Pr {zi xi | Rθ}] (7)

where preference label oi is generated according to

oi Pr{zi xi | r} = er(zi)

er(xi) + er(zi) (8)

Objective and constraint of the adversarial platform. To compete against the target platform, the adversarial platform

1To obtain this dataset, the adversarial platform can first acquire K data samples {x1, , x K} from target platform and then present them to its users to select. Suppose K = 3 and the user selects x2, the data pair {(x1, x2, 1), (x2, x3, 0)} is added to D.

Self-Consuming Generative Models with Adversarially Curated Data

employs malicious users to adversarially curate data on the target platform, aiming to cause the generative models iteratively trained on this curated data to deviate from user preferences. The behavior of malicious users can be formalized as flipping the preference label oi of data pairs. Specifically, given D = {(xi, zi, oi)}n i=1, malicious users can flip preference oi to oi + δi, resulting in an adversarial dataset e D(δ) = {(xi, zi, oi + δi)}n i=1, where δi = { 1, 0, 1} presents label perturbation and δ := {δi}n i=1. In practice, there is a possibility that malicious users fail to curate data adversarially on the target platform, or the curated data is not selected for training the future model. To account for this, we impose a constraint on the total number of label flips Pn i=1 |δi| κ n, where κ (0, 1) represents the success rate of perturbing the data on the target platform.

The goal of the adversarial platform is to find δ such that the resulting perturbed dataset e D(δ), when mixed with benign users preference data (via malicious users adversarial data curation), disrupts the target platform s alignment with user preferences. Formally, let pt+1 be the self-consuming model of the target platform trained from such adversarially curated data, the goal is to reduce the expected reward Ex pt[er(x)] such that it can deviate from the maximum reward in the long run, i.e.,

Ept+1 h e Rθ(x)i < Ept h e Rθ(x)i (9)

where Rθ is parametric reward model learned from benign users. Based on Lemma 3.3, we formulate the following optimization for adversarial platform:

min δ J (δ) := Covpt h e Rθ(x), e e Reθ(x)i + α dist Rθ, e Reθ

s.t. eθ arg min θ L e D(δ), θ (10)

where e Reθ is parametric reward model learned from perturbed preference data e D(δ), which may belong to a different function family than Rθ. To achieve the objective in Eq. (9), we aim to make Covpt[e Rθ(x), e e Reθ(x)] as negative as possible. Meanwhile, to prevent the adversarial behavior from being easily detected as anomalous, we impose a penalty on the difference between e Reθ and Rθ, quantified by dist(Rθ, e Reθ) and can be defined as Ept[d(Rθ(x), e Reθ(x))] for some distance metric d (e.g., ℓp norm).

Dynamic attack during iterative training. Since the target platform dynamically updates its model over time, the adversarial platform must repeatedly solve optimization (10) as pt evolves. In practice, the adversarial platform can periodically interact with the target platform to acquire data pairs x(t) i , z(t) i pt and collect user preference o(t) i from its own customers. The dataset D(t) = {(x(t) i , z(t) i , o(t) i )}n i=1 can be first used to fine-tune benign reward model Rθ and

then solve for δ(t) in optimization (10). The resulting δ(t)

can then guide the malicious users in curating data on the target platform at t. Such adversarially curated data is subsequently used by the target platform to retrain its model pt+1. Over time, this iterative interaction may cause the target platform to deviate significantly from user preferences.

Challenges to solve optimization (10). The solution to (10) is difficult to find because it is a bi-level optimization problem. Moreover, the variables to be optimized are a subset of preference comparison labels, which involves solving a combinatorial optimization problem over a discrete space. Next, we tackle these challenges and introduce two methods for finding approximated solutions to optimization (10).

4.1. Gradient-based methods

To tackle discrete decision space, we relax the action space δi { 1, 0, 1} to interval δi [ 1, 1]. If reward models are differentiable with respect to eθ, then we can compute the gradient of the objective function in (10) as follows:

δJ (δ) = δ Covpt h e Rθ(x), e e Reθ(x)i + α dist Rθ, e Reθ

= eθ Covpt h e Rθ(x), e e Reθ(x)i + α dist Rθ, e Reθ deθ

Recall that e Reθ is the reward model trained from perturbed dataset e D(δ), parameter eθ is a function of δ. Similar to Wu et al. (2025), we can leverage the implicit function theorem (Mei & Zhu, 2015; Koh & Liang, 2017) to compute the implicit derivative deθ

deθ dδ = HeθL 1 d eθL

where HeθL is the Hessian of L with respect to eθ. Given the gradient, we can then apply projected gradient descent to find optimal δ . Since original action space is δi { 1, 0, 1} and the total number of flips is constrained by Pn i=1 |δi| = κ n, we select the the top κ n among δ

based on magnitude |δ i | and only flip oi associated to them. The complete procedure is shown in Algorithm 1.

The efficiency of this algorithm is primarily influenced by two key factors: batch size and model structure. Batch size plays a critical role in the accuracy of covariance estimation, which is essential for aligning the computed covariance with the true data distribution. While larger batch sizes generally yield more precise covariance estimates and reduce the discrepancy between theoretical and practical outcomes, they also incur greater memory requirements and computational overhead. Meanwhile, the structure of the initialized model directly influences the size of the implicit Hessian matrix, which emerges during gradient computations involving second-order derivatives. These calculations are

Self-Consuming Generative Models with Adversarially Curated Data

Algorithm 1 Gradient-based attack

Input: Benign preference data D, parameter κ Train the reward model Rθ of benign users on D; Randomly initialize δ, ensuring (oi + δi) [ 1, 1]; for m = 1 to M do

Create a perturbed dataset e D(δ); Train a new reward model e Reθ on the dataset e D(δ); Compute the gradient δJ (δ); δ δ η δJ (δ); Clip δ such that (oi + δi) [ 1, 1], i [n]; end for Select the top κ n indices based on |δi|; Flip the preference label of the corresponding data pair; Output: Label perturbations δ, reward model e Reθ

computationally expensive. Although approximate methods can be used to estimate the Hessian matrix and reduce the computational burden, models with complex architectures such as deep neural networks with many layers and parameters substantially increase both the computational time and memory requirements for these calculations.

4.2. Heuristic methods

The high computational costs associated with gradient-based calculations for complex models present a significant challenge. To mitigate this issue, we propose leveraging heuristic methods as an alternative. These methods eliminate the need for explicit gradient computation, providing a more computationally efficient approach, as detailed below.

Reward-based heuristic method. Instead of directly optimizing perturbations δ to minimize Covpt[e Rθ(x), e e Reθ(x)] + α dist(Rθ, e Reθ), we adopt a heuristic approach by flipping the preference label oi based on the rewards of data samples. Specifically, given preference data D = {(xi, zi, oi)}n i=1, we first learn the reward model Rθ from D. The idea is to identify κ n data pairs (xi, zi) based on their rewards Rθ(xi), Rθ(zi) such that flipping their preference label oi has the greatest impact on the underlying reward model.

We propose two methods for finding such data pairs: (i) finding (xi, zi) based on dissimilarity between Rθ(xi) and Rθ(zi); (ii) finding (xi, zi) based on maximum of |Rθ(xi)| and |Rθ(zi)|. Specifically, define f : Rd Rd R+ as

f(x, z) := |Rθ(x) Rθ(z)| or max{|Rθ(x)|, |Rθ(z)|}.

We select the κ n data pairs (xi, zi) with the highest f(xi, zi) values to flip their preference label. Intuitively, a larger |Rθ(xi) Rθ(zi)| indicates greater differences in user preferences for the data pair (xi, zi), making the preference flip more impactful. Similarly, a larger max{|Rθ(xi)|, |Rθ(zi)|} suggests that the pair includes a sample that is either highly favored or strongly disliked by

the user, making the preference flip most effective.

Multi-objective heuristic method. Since optimization problem (10) simultaneously considers two competing objectives, we also propose a heuristic method that finds the Pareto front (Ngatchou et al., 2005) for multi-objective optimization. Specifically, a solution is considered Pareto optimal if no other solution exists that can improve one objective without degrading at least one other objective; the set of all Pareto optimal solutions forms the Pareto front.

Our method begins by generating κ n random perturbations δ to flip data D, resulting in e D(δ). For each perturbation, we record the empirical performances of Covpt[e Rθ(x), e e Reθ(x)] and dist(Rθ, e Reθ). Using a Pareto optimization algorithm, such as NSGA-II (Deb et al., 2002), we compute the Pareto front of non-dominated solutions, which represents the best trade-offs between objectives. Finally, we select the optimal solution from the Pareto front based on a specific prioritized objective, and apply the corresponding flip.

5. Experiments

In this section, we conduct experiments on both synthetic and real data to validate our theorems and proposed algorithms. We first evaluate the evolution of generative model pt and user reward Ept[r(x)] under self-consuming training loop (Section 5.1). Then, we demonstrate the effectiveness of proposed attack algorithms (Section 5.2).

Datasets. Similar to Ferbach et al. (2024), we conduct experiments on three datasets:

1. Synthetic Gaussian: A dataset following 8-mode Gaussian mixture model, the details are in Appendix C.2.

2. CIFAR-10 (Krizhevsky, 2009): It contains 60,000 images from 10 classes {airplne := 0, automobile := 1, bird := 2, cat := 3, deer := 4, dog := 5, frog := 6, horse := 7, ship := 8, truck := 9}.

3. CIFAR-100 (Krizhevsky, 2009): It contains 100 classes, each with 600 images. Class labels from 0 to 99 are assigned according to the alphabetical order of class names, i.e., {aquatic mammals-beaver := 0, , vehicles 2-tractor := 99}.

We present the results for CIFAR-10 and CIFAR-100 below, while the results for the Gaussian data are provided in the Appendix C.2.

Reward functions and user preference labels. Using Gaussian, CIFAR-10, and CIFAR-100 datasets, we can construct user preference dataset D = {(xi, zi, oi)}n i=1 using a reward function r(x). Specifically, given a data pair (xi, zi) sampled from Gaussian, CIFAR-10 or CIFAR-100 datasets, we generate the corresponding preference label oi Pr{zi xi | r} based on Eq. (8). r(x) for each

Self-Consuming Generative Models with Adversarially Curated Data

0 1 2 3 4 5 6 7 8 9 10

Adversarial Curation (Gradient)

airplane automobile bird cat deer dog frog horse ship truck

0 1 2 3 4 5 6 7 8 9 10

Adversarial Curation (Random)

0 1 2 3 4 5 6 7 8 9 10

Benign Curation

Figure 2. The proportion of each class generated by the self-consumption model retrained with different data curation methods on CIFAR-10: benign curation based on actual user preferences (left), adversarial curation using the proposed gradient-based attack algorithm (middle), and adversarial curation via a random attack (right). The results show that the proposed gradient-based attacks are the most effective in deviating the model from user preferences.

0 1 2 3 4 5 6 7 8 9 10

Adversarial Curation (Gradient)

#0-#9 #10-#19 #20-#29 #30-#39 #40-#49 #50-#59 #60-#69 #70-#79 #80-#89 #90-#99

0 1 2 3 4 5 6 7 8 9 10

Benign Curation

0 1 2 3 4 5 6 7 8 9 10

Avg. Rewards

Benign Curation Malicious(20%)

Figure 3. The proportion of each ten classes generated by the self-consumption model retrained with different data curation methods on CIFAR-100: benign curation based on actual user preferences (left), adversarial curation using the proposed gradient-based attack algorithm (middle). And empirical estimate of Ept[r(x)] from samples generated by the model over iterations (right).

dataset is defined as follows:

1. Synthetic Gaussian: r(x) := γ max{0, x µ τ}, where x µ is the distance from one Gaussian center µ , τ is the minimum clipped radius, and γ controls the scaling of the reward (details are in Appendix C.2). Intuitively, this function captures user preferences by assigning higher rewards to samples farther from Gaussian center within a threshold. 2. CIFAR-10: First identify the label I(x) {0, , 9} of image x by a pretrained VGG11 (Simonyan & Zisserman, 2015) classifier with 92.79% test accuracy. Suppose users prefer classes with smaller indices and define r(x) := 10 I(x). It reflects user s preference ordering by assigning higher rewards to images classified closer to the most preferred class (class 0) in the hierarchy. 3. CIFAR-100: Similar to CIFAR-10, the label I(x) {0, , 99} of image x is first identified by a pretrained Res Net56 (He et al., 2016) classifier with 72.63% test accuracy. Suppose users prefer classes with smaller indices and define r(x) := 100 I(x).

Generative model and reward model. Following Ferbach et al. (2024), we iteratively retrain a denoising diffusion

probabilistic model (DDPM) (Ho et al., 2020), a generative framework known for its ability to model complex data distributions through a reversible diffusion process. In addition to the generative model, the target and adversarial platforms leverage user preference data D = {(xi, yi, oi)}n i=1 to train a reward model Rθ. For the adversarial platform, it also trains e Reθ using perturbed preference data e D(δ). The reward models R and R may or may not have the same architecture, we discuss details in Appendix C.1.

Iterative retraining process. At each iteration, the generative model produces 50,000 data samples, of which 25,000 are selected (after curation by the reward model) for retraining the next-generation model. In Appendix C, we provide details on the process of (adversarial) data curation and their interaction with the generative model training. The complete process is presented in Algorithm 2.

5.1. Evolution of model under adversarial data curation

First, we examine the impact of adversarial data curation on the self-consuming generative model pt and the respective reward Ept[r(x)]. Specifically, we employ both the gradientbased attack (Algorithm 1) and random attack (i.e., flip

Self-Consuming Generative Models with Adversarially Curated Data

0 1 2 3 4 5 6 7 8 9 10

Avg. Rewards

Benign Curation Gradient Random

Figure 4. Empirical estimate of Ept[r(x)] from model-generated samples over iterations on CIFAR-10: it increases (resp. decreases) over iterations under benign (resp. adversarial) data curation.

preference labels uniformly at random) algorithms to target 20% of data pairs.

Fig. 2 shows the evolution of the proportion of each class generated during the iterative retraining process on CIFAR10 and it demonstrates the impact of three types of data curation on self-consuming models: benign curation, adversarial curation via gradient-based attacks, and adversarial curation via random attacks. Without adversarially curated data (left), the generative model gradually aligns with human preferences, producing an increasing number of samples from class 0 (airplane), the most preferred class in CIFAR-10. This observation is consistent with Ferbach et al. (2024). Under random attacks, while the model does not align as well with user preferences as in benign curation, it still generates a reasonable proportion of samples from the most favored classes, 0 (airplane) and 1 (automobile). In contrast, with our proposed attack algorithm, the model becomes highly misaligned with user preferences, generating samples predominantly from the least favored classes: 7 (horse), 8 (ship), and 9 (frog).

Fig. 4 shows the empirical estimate of Ept[r(x)] for generated samples over iterations on CIFAR-10, with observations consistent with Lemma 3.3. Under benign curation, the expected reward steadily increases as the model aligns with user preferences. In contrast, adversarial curation using the gradient-based attack continuously lowers the reward, indicating a significant deviation from the optimal reward distribution. Random attacks initially lead to a slight increase in rewards, suggesting some robustness. However, as attacks progress, the average reward fluctuates due to the inherent randomness of this method. Indeed, the outcomes for random attacks vary significantly across different runs of experiments, and we provide additional examples in Appendix C.1.

We extend our experiments to CIFAR-100, which contains a larger number of classes. Fig. 3 illustrates the different behaviors under benign curation and adversarial curation using the gradient-based attack. With benign curation, the model progressively generates more samples from user-preferred classes, reflecting an increasing alignment with user prefer-

Table 1. Effectiveness of one-round attack under different methods on the same CIFAR-10 dataset: The results are empirical estimates of Ep1[r(x)] for generated samples at t = 1; method with a lower reward is more effective.

METHOD BENIGN GRADIENT #1 GRADIENT #2

AVG. R 6.3606 5.6460 5.5959

METHOD PARETO R-BASED #1 R-BASED #2

AVG. R 5.4740 5.5982 5.5612

ences over time. In contrast, under adversarial curation, the model generates more samples from less preferred classes (i.e., classes 60 to 99). The average rewards shown on the right further highlight this difference: benign curation leads to a steady increase in expected rewards, whereas adversarial curation causes a continuous decline, indicating growing misalignment between the model and user preferences.

We also conducted iterative model retraining on a mixture of adversarially curated synthetic and real data on CIFAR-10, where adversarial data generated through gradient-based attacks is combined with real data at varying proportions. Fig. 5 shows the class proportions of the generated samples over iterations. The results demonstrate that incorporating real data helps align the distribution of generated samples with the real data distribution. However, it does not steer the model toward the user-preferred distribution. This suggests that adding a limited amount of real data is insufficient and fails to defend against adversarially curated data.

To examine model quality, we present synthetic images generated by self-consuming models under both adversarial and benign data curation in Fig. 7 (Appendix C.1). The results show that the image quality does not significantly degrade during the iterative retraining process.

5.2. Effectiveness of attack methods

To evaluate the effectiveness of various attack algorithms, we applied different attack methods to the same CIFAR-10 dataset, flipping 20% of the data pairs. Table 1 summarizes the empirical estimates of Ep1[r(x)] for generated samples at t = 1. Comparisons include benign curation (BENIGN), gradient-based attacks that target platform and adversarial platform using different (GRADIENT #1) or identical (GRADIENT #2) reward models, reward-based heuristic methods with f(x, z) = |Rθ(x) Rθ(z)| (R-BASED #1) or max{|Rθ(x)|, |Rθ(z)|} (R-BASED #2), and multiobjective heuristic method (PARETO), where we select the solution from the Pareto front with the lowest sum of the two metrics. We also show the class proportions of generated samples over iterations in Fig. 8 (Appendix C.1).

Overall, the results demonstrate the effectiveness of all attack methods in this experimental scenario. Despite employing different reward model architectures, gradient-based

Self-Consuming Generative Models with Adversarially Curated Data

0 1 2 3 4 5 6 7 8 9 10

Mixed (Real Data: Adversarial Curation = 1:1)

airplane automobile bird cat deer dog frog horse ship truck

0 1 2 3 4 5 6 7 8 9 10

Mixed (Real Data: Adversarial Curation = 4:1)

0 1 2 3 4 5 6 7 8 9 10

Mixed (Real Data: Adversarial Curation = 1:4)

Figure 5. The proportion of each class generated by the self-consumption model retrained with a mix of adversarially curated synthetic and real CIFAR-10 data. It shows that adding real data only helps the model align with the real data distribution pdata but does not defend against adversarial data curation.

attack methods consistently maintained high efficacy. In contrast, the relatively weaker performance of heuristic methods is reflected in their higher average rewards. As shown in Fig. 8, this is primarily due to heuristic methods failing to account for all classes comprehensively, resulting in limited effects on certain classes. The multi-objective heuristic method performs best by exploring a broader range of potential solutions, yielding the lowest average reward. However, this performance comes at the cost of significantly higher computational time and resource requirements.

It is important to note that the effectiveness of these attack methods may depend heavily on the specific context, and no method can be considered optimal . Future research could focus on developing solutions that are more general and universally applicable.

6. Discussion

In this section, we discuss the robustness of existing defense strategies and analyze the practical challenges of defending against our proposed attack. We also discuss key assumptions made in our analysis and identify limitations that may affect applicability in real-world scenarios.

Defense. A common strategy proposed in prior work to stabilize the self-consuming retraining loop of generative models is to regularly inject real data during training (Alemohammad et al., 2024a; Bertrand et al., 2024). However, as shown in the experiments in Fig. 5, adding real data only partially mitigates adversarial effects by driving the model closer to the true data distribution pdata. It does not effectively prevent model misalignment under targeted adversarial curation attacks.

Although outlier detection may assist in identifying and filtering adversarially curated samples, they can inadvertently remove genuine preferences. When users are heterogeneous and come from multiple groups, removing genuine preferences from minority groups may potentially introduce

biases. Additionally, our attack algorithm already considers such defense mechanisms: when formulating in Eq. (10), we impose a penalty term dist(Rθ, e Reθ) to prevent the adversarial behavior from being easily detected as anomalous.

Limitations. Our theoretical analysis assumes that each model update converges to the global optimum of the training objective. While our experiments validate the effectiveness of the proposed attack under this assumption, such convergence may not always hold in practice due to optimization noise, local minima, or limited training budgets.

Additionally, our experiments rely on a known success ratio κ to control the adversarial curation. In real-world scenarios, however, the attacker may not have direct access to or precise knowledge of the effective success rate, which could affect the practical impact of the attack.

7. Conclusion

This paper examines the evolution of self-consuming generative models under adversarially curated data. We theoretically analyze the impact of adversarial data curation on these models and identify conditions for the (in)stability of the iterative retraining process. Building on theoretical insights, we develop attack algorithms that effectively disrupt model training and prevent alignment with user preferences. The findings highlight the potential vulnerability of selfconsuming generative models to adversarial data curation, suggesting that developing effective defense mechanisms could be a promising direction for future work.

Acknowledgement

This material is based upon work supported by the U.S. National Science Foundation under award IIS2202699 and IIS-2416895, by OSU President s Research Excellence Accelerator Grant, and grants from the Ohio State University s Translational Data Analytics Institute and College of Engineering Strategic Research Initiative.

Self-Consuming Generative Models with Adversarially Curated Data

Impact Statement

This paper advances the field of Machine Learning by examining the long-term impact of the self-consuming retraining loop on generative models, as well as the role of (adversarial) data curation in this process. There are many potential societal consequences of our work, none of which we feel must be specifically highlighted here.

AI, R. Resemble AI - AI voice generation and cloning platform. https://www.resemble.ai/, 2025. Accessed: 2025-01-10.

Alemohammad, S., Casco-Rodriguez, J., Luzi, L., Humayun, A. I., Babaei, H., Le Jeune, D., Siahkoohi, A., and Baraniuk, R. Self-consuming generative models go MAD. In The Twelfth International Conference on Learning Representations, 2024a. URL https: //openreview.net/forum?id=Shj MHfm Ps0.

Alemohammad, S., Humayun, A. I., Agarwal, S., Collomosse, J., and Baraniuk, R. Self-improving diffusion models with synthetic data, 2024b. URL https: //arxiv.org/abs/2408.16333.

Anthropic. Claude - A next-generation AI assistant by anthropic. https://www.anthropic.com/index/ claude-introduction, 2025. Accessed: 2025-0110.

Baumg artner, T., Gao, Y., Alon, D., and Metzler, D. Bestof-venom: Attacking RLHF by injecting poisoned preference data. In First Conference on Language Modeling, 2024. URL https://openreview.net/forum? id=v74m JURD1L.

Bertrand, Q., Bose, J., Duplessis, A., Jiralerspong, M., and Gidel, G. On the stability of iterative retraining of generative models on their own data. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum? id=JORAf H2x Fd.

Biggio, B., Nelson, B., and Laskov, P. Support vector machines under adversarial label noise. In Proceedings of the Asian Conference on Machine Learning, volume 20, pp. 97 112, South Garden Hotels and Resorts, Taoyuan, Taiwain, 2011. PMLR. URL https://proceedings. mlr.press/v20/biggio11.html.

Bradley, R. A. and Terry, M. E. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39:324 345, 1952.

Carlini, N., Jagielski, M., Choquette-Choo, C. A., Paleka, D., Pearce, W., Anderson, H., Terzis, A., Thomas, K.,

and Tram er, F. Poisoning web-scale training datasets is practical. In 2024 IEEE Symposium on Security and Privacy (SP), pp. 407 425, 2024. doi: 10.1109/SP54263. 2024.00179.

Chen, T., Hirota, Y., Otani, M., Garcia, N., and Nakashima, Y. Would deep generative models amplify bias in future models? In 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 10833 10843, 2024. doi: 10.1109/CVPR52733.2024.01030.

Deb, K., Pratap, A., Agarwal, S., and Meyarivan, T. A fast and elitist multiobjective genetic algorithm: Nsga-ii. IEEE Transactions on Evolutionary Computation, 6(2): 182 197, 2002. doi: 10.1109/4235.996017.

Diffusion, S. Stable diffusion - open-source ai for creating images from text. https://stabledifffusion. com/, 2025. Accessed: 2025-01-10.

Ferbach, D., Bertrand, Q., Bose, A. J., and Gidel, G. Selfconsuming generative models with curated data provably optimize human preferences. In Advances in Neural Information Processing Systems, volume 37, pp. 102531 102567. Curran Associates, Inc., 2024.

Gerstgrasser, M., Schaeffer, R., Dey, A., Rafailov, R., Korbak, T., Sleight, H., Agrawal, R., Hughes, J., Pai, D. B., Gromov, A., Roberts, D., Yang, D., Donoho, D. L., and Koyejo, S. Is model collapse inevitable? Breaking the curse of recursion by accumulating real and synthetic data. In First Conference on Language Modeling, 2024. URL https://openreview.net/forum? id=5B2K4LRgmz.

Gillman, N., Freeman, M., Aggarwal, D., HSU, C.-H., Luo, C., Tian, Y., and Sun, C. Self-correcting self-consuming loops for generative model training. In Scaling Self Improving Foundation Models without Human Supervision, 2025. URL https://openreview.net/ forum?id=O1B95a Il Fn.

He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770 778, 2016. doi: 10.1109/CVPR.2016.90.

Ho, J., Jain, A., and Abbeel, P. Denoising diffusion probabilistic models. In Advances in Neural Information Processing Systems, volume 33, pp. 6840 6851. Curran Associates, Inc., 2020.

Jiang, S., Kadhe, S., Zhou, Y., Cai, L., and Baracaldo, N. Forcing generative models to degenerate ones: The power of data poisoning attacks. In Neur IPS 2023 Workshop on Backdoors in Deep Learning - The Good, the Bad, and the Ugly, 2024. URL https://openreview.net/ forum?id=8R4z3XZt5J.

Self-Consuming Generative Models with Adversarially Curated Data

Koh, P. W. and Liang, P. Understanding black-box predictions via influence functions. In Proceedings of the 34th International Conference on Machine Learning, volume 70, pp. 1885 1894. PMLR, 2017. URL https:// proceedings.mlr.press/v70/koh17a.html.

Krizhevsky, A. Learning multiple layers of features from tiny images. 2009. URL https://www.cs. toronto.edu/ kriz/cifar.html.

Labs, P. Pika labs - AI-generated videos from text. https: //www.pikalabs.com/, 2025. Accessed: 2025-0110.

Liu, C., Li, B., Vorobeychik, Y., and Oprea, A. Robust linear regression against training data poisoning. In Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security, AISec 17, pp. 91 102, New York, NY, USA, 2017. Association for Computing Machinery. doi: 10. 1145/3128572.3140447.

Liu, Y., Zhou, H., Guo, Z., Shareghi, E., Vuli c, I., Korhonen, A., and Collier, N. Aligning with human judgement: The role of pairwise preference in large language model evaluators. In First Conference on Language Modeling, 2024. URL https://openreview.net/forum? id=9gd ZI7c6yr.

Mei, S. and Zhu, X. Using machine teaching to identify optimal training-set attacks on machine learners. In Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, AAAI 15, pp. 2871 2877. AAAI Press, 2015.

Mid Journey. Midjourney - AI-generated art platform. https://www.midjourney.com/, 2025. Accessed: 2025-01-10.

ML, R. Runway ML - AI tools for creators. https: //runwayml.com/, 2025. Accessed: 2025-01-10.

Ngatchou, P., Zarei, A., and El-Sharkawi, A. Pareto multi objective optimization. In Proceedings of the 13th International Conference on, Intelligent Systems Application to Power Systems, pp. 84 91, 2005. doi: 10.1109/ISAP.2005.1599245.

Open AI. GPT-4 technical report, 2024. URL https: //arxiv.org/abs/2303.08774.

Open AI. Chat GPT - Conversational AI by Open AI. https: //openai.com/chatgpt, 2025. Accessed: 202501-10.

Pan, J., Sun, K., Ge, Y., Li, H., Duan, H., Wu, X., Zhang, R., Zhou, A., Qin, Z., Wang, Y., Dai, J., Qiao, Y., and Li, H. Journeydb: A benchmark for generative image understanding, 2023.

Paudice, A., Mu noz-Gonz alez, L., and Lupu, E. C. Label sanitization against label flipping poisoning attacks. In ECML PKDD 2018 Workshops, pp. 5 15. Springer International Publishing, 2019.

Shumailov, I., Shumaylov, Z., Zhao, Y., Papernot, N., Anderson, R., and Gal, Y. AI models collapse when trained on recursively generated data. Nature, 631(8022):755 759, Jul 2024. URL https://doi.org/10.1038/ s41586-024-07566-y.

Simonyan, K. and Zisserman, A. Very deep convolutional networks for large-scale image recognition. In The 3rd International Conference on Learning Representations, 2015. URL http://arxiv.org/abs/ 1409.1556.

Suya, F., Zhang, X., Tian, Y., and Evans, D. What distributions are robust to indiscriminate poisoning attacks for linear learners? In Advances in Neural Information Processing Systems, volume 36, pp. 34942 34980. Curran Associates, Inc., 2023.

Taori, R. and Hashimoto, T. Data feedback loops: Model-driven amplification of dataset biases. In Proceedings of the 40th International Conference on Machine Learning, volume 202, pp. 33883 33920. PMLR, 2023. URL https://proceedings.mlr.press/ v202/taori23a.html.

Vorobeychik, Y. and Kantarcioglu, M. Data Poisoning Attacks, pp. 77 98. Springer International Publishing, 2018. URL https://doi.org/10.1007/ 978-3-031-01580-9_6.

Wu, J., Wang, J., Xiao, C., Wang, C., Zhang, N., and Vorobeychik, Y. Preference poisoning attacks on reward model learning. In 2025 IEEE Symposium on Security and Privacy (SP), pp. 1622 1640, 2025. URL https://doi.ieeecomputersociety.org/ 10.1109/SP61157.2025.00094.

Wyllie, S., Shumailov, I., and Papernot, N. Fairness feedback loops: Training on synthetic data amplifies bias. In Proceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency, FAcc T 24, pp. 2113 2147. Association for Computing Machinery, 2024. URL https://doi.org/10.1145/ 3630106.3659029.

Xie, T. and Zhang, X. Automating data annotation under strategic human agents: Risks and potential solutions. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URL https: //openreview.net/forum?id=2UJLv3KPGO.

Self-Consuming Generative Models with Adversarially Curated Data

Zeng, Y., Pan, M., Jahagirdar, H., Jin, M., Lyu, L., and Jia, R. Meta-Sift: How to sift out a clean subset in the presence of data poisoning? In 32nd USENIX Security Symposium, pp. 1667 1684. USENIX Association, 2023.

Zhang, H., Li, Y., Ding, B., and Gao, J. Practical data poisoning attack against next-item recommendation. Proceedings of The Web Conference 2020, 2020. URL https://api.semanticscholar. org/Corpus ID:215416122.

Zhou, E., Zheng, G., Wang, B., Xi, Z., Dou, S., Bao, R., Shen, W., Xiong, L., Fan, J., Mou, Y., Zheng, R., Gui, T., Zhang, Q., and Huang, X. RMB: comprehensively benchmarking reward models in LLM alignment. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview. net/forum?id=kmgrl G9TR0.

Self-Consuming Generative Models with Adversarially Curated Data

A. Related work

A.1. Self-consuming generative models

Generative models have demonstrated remarkable success in synthesizing high-quality data. Recent research has focused on the challenges faced by generative models trained iteratively on their own synthetic data. Despite slight differences in the definition of collapse, Alemohammad et al. (2024a); Shumailov et al. (2024); Gerstgrasser et al. (2024) all confirm that the model collapses when trained exclusively on synthetic datasets, leading to a degradation in quality or diversity over successive generations. Bertrand et al. (2024); Taori & Hashimoto (2023) theorizes that this collapse arises due to the nature of the iterative training process. Another potential consequence of this iterative training is bias amplification, where the generative model amplifies specific features while neglecting other equally important data characteristics (Chen et al., 2024; Taori & Hashimoto, 2023; Wyllie et al., 2024).

There are two main solutions to this problem. One effective approach to stabilizing generative models is to introduce real data into the training process at each iteration. Alemohammad et al. (2024a) provided empirical evidence that injecting fresh real data mitigates model collapse. Bertrand et al. (2024) further substantiated this observation with theoretical proofs, demonstrating that maintaining a sufficient proportion of real data ensures the stability of iterative training. Gerstgrasser et al. (2024) leverages cumulative data, where previously generated samples are stored and reused alongside new real data to stabilize the performance of the generative model. This approach aligns with practical data accumulation strategies. An alternative strategy is to introduce corrective mechanisms to prevent model collapse. Alemohammad et al. (2024b) proposed a self-improvement framework for generative diffusion models (SIMS), which mitigates model collapse by employing a negative guidance mechanism during the generation process. Gillman et al. (2025) introduced a self-correcting generative model training loop, where synthetic data undergoes transformation through expert-informed correction functions before being reintroduced into the training set. This method significantly enhances stability by ensuring that synthetic samples retain high fidelity and do not reinforce existing biases.

Recent studies have also examined the role of human feedback in iterative retraining. Ferbach et al. (2024) explored how user-curated synthetic samples can implicitly optimize a model s reward function, aligning generative outputs with human preferences. While preference alignment improves the user experience, it can also introduce systematic biases that may be exploited by adversaries. This potential issue forms the basis for the exploration in our work.

A.2. Data poisoning attack

Data poisoning attack is an attack that disrupts model learning by modifying training data (Vorobeychik & Kantarcioglu, 2018). Two major methods of this attack are the injection of poison data and label-flipping attacks.

One common data poisoning technique involves adding maliciously produced samples to the training dataset. In generative models, Jiang et al. (2024) demonstrates that poisoning even a small fraction of the training data can significantly alter the model s output, such as in text summarization or text completion tasks. Carlini et al. (2024) shows how adversaries can inject poisoned examples into datasets at minimal cost by exploiting web-based data collection mechanisms, which is not merely a theoretical concern but a practical threat. Furthermore, Baumg artner et al. (2024) explored the vulnerability of large language model (LLM) fine-tuning to poisoning attacks. In this scenario, an adversary can skillfully modify a small portion of the training data, injecting undetectable biases into the generated text, potentially leading to misinformation or biased outputs.

The label flipping attack modifies the underlying true labels assigned to a subset of training samples. This attack is particularly effective in classification and regression tasks (Biggio et al., 2011; Liu et al., 2017; Paudice et al., 2019; Suya et al., 2023), where incorrect labels can mislead the model s learning process. In recommendation systems, Zhang et al. (2020) proposed an adversarial reinforcement learning approach that strategically flips labels to manipulate item rankings. Similarly, in Reinforcement Learning with Human Feedback (RLHF) training, preference poisoning attacks introduce incorrect label flips in reward datasets, causing reinforcement learning models to produce biased responses (Wu et al., 2025). Several studies have explored defenses against label flipping attacks. Traditional methods include outlier detection techniques (Zeng et al., 2023), which identify and discard suspicious training samples. Paudice et al. (2019) introduced a robust filtering approach that leverages bilevel optimization to identify and remove mislabeled samples from poisoned datasets, providing an effective solution against label flipping attacks.

Self-Consuming Generative Models with Adversarially Curated Data

Unlike previous research that focuses on flipping labels to promote or demote a specific target, our approach emphasizes perturbing the entire alignment process. Notably, our attacker does not require access to data collection pipelines or backend systems; instead, they can act entirely through public feedback mechanisms, such as voting or ranking systems. This makes our approach more practical and harder to detect, as it unfolds gradually over time without direct data manipulation. Whereas traditional poisoning attacks often aim to induce outright model failure, our objective is more subtle: gradually misaligning the model from genuine user preferences over time in a self-consuming training loop. In competitive settings, such gradual misalignment can be highly damaging while remaining difficult to trace.

B.1. Proof of Lemma 3.1

Lemma B.1. Let pt+1 be defined as in Equation (4). If we assume that Ez pt[er(z)] < , then we have for all x Rd,

pt+1(x) K pt(x)

(1 ϕt) er(x)

Ez pt er(z) + ϕt eert(x) Ez pt eert(z)

Proof. Consider the limit of K . By minimization of the cross-entropy, we know that for any distribution q, arg maxp Ex q[log(p(x))] = q.

So, sample K samples from p(t) independently and identically distributed, then sample i K with the following probability:

P(i K = i|x1, ..., x K) = ϕt eert(xi) PK j=1 eert(xj) + (1 ϕt) er(xi) PK j=1 er(xj)

Noting that the events {i K = i}K i=1 are disjoint, the resulting density can be written:

yj,j =i pt (y1, , yi 1, x, yi+1, , y K) P (i K = i | x, yj, j = i) Y

y1, ,y K 1 pt (y1, , y K 1, x) P (i K = K | y1, , y K 1, x) dy1 dy K 1

= pt(x)K(1 ϕt) Z

er(x) + PK 1 i=1 er(yi) pt (y1) pt (y K 1) dy1 dy K 1

eert(x) + PK 1 i=1 eert(yi) pt (y1) pt (y K 1) dy1 dy K 1

= pt(x) h (1 ϕt)HK pt (x) + ϕt e HK pt (x) i

HK pt (x) = Z

PK 1 i=1 er(yi)

pt (y1) pt (y K 1) dy1 dy K 1

e HK pt (x) = Z

PK 1 i=1 eert(yi)

pt (y1) pt (y K 1) dy1 dy K 1

HK pt (x) K er(x)

Ez pt er(z)

e HK pt (x) K eert(x)

Ez pt eert(z)

Self-Consuming Generative Models with Adversarially Curated Data

pt+1(x) K pt(x)

(1 ϕt) er(x)

Ez pt er(z) + ϕt eert(x) Ez pt eert(z)

B.2. Additional lemma: the reward expectation is not increasing

Without assuming that the reward is bounded, we can show using Jensen inequality that:

Lemma B.2. When performing K-wise filtering, t 0,the expected reward:

Ept+1 Ex pt h er(x)i + ϕt Covx pt h er(x), eert(x)i (13)

Proof. Suppose Z = K 1

PK 1 i=1 er(zi)

K 1 , by Jensen inequality, for any x ( a b+xis convex):

HK pt (x) = EZ

K + E[Z] = er(x)

K Ept er(x)

e HK pt (x) = EZ

K + E[Z] = eer(x)

K Ept eer(x)

Ept+1 h er(x)i = Z er(x)pt(x) h (1 ϕt)HK pt (x) + ϕt e HK pt (x) i dx

= (1 ϕt) Z er(x)pt(x)HK pt (x)dx + ϕt

Z er(x)pt(x) e HK pt (x)dx

(1 ϕt) Z pt(x) e2r(x)

K Ez pt er(z) dx + ϕt

Z pt(x) er(x)eert(x)

K Eez pt eert(ez) dx

(Ex pt[f(x)] = Z pt(x)f(x)dx)

= (1 ϕt)Ex pt e2r(x)

K Ez pt er(z) + ϕt Ex pt er(x)eert(x)

K Eez pt eert(ez)

(Jensen s inequality)

(1 ϕt) Ex pt er(x) 2

Ex pt[er(x)]

K Ez pt er(z) + ϕt Ex pt er(x)eert(x)

Ex pt[eert(x)]

K Ez pt eert(z)

= (1 ϕt)Ex pt h er(x)i + ϕt Ex pt er(x)eert(x)

Ex pt eert(x)

= (1 ϕt)Ex pt h er(x)i + ϕt Covx pt h er(x), eert(x)i + ϕt Ex pt h er(x)i

= Ex pt h er(x)i + ϕt Covx pt h er(x), eert(x)i

Ept+1 Ex pt h er(x)i + ϕt Covx pt h er(x), eert(x)i

Self-Consuming Generative Models with Adversarially Curated Data

B.3. Proof of Lemma 3.3

Lemma B.3. Let pt+1 the distribution induced from a discrete choice model on in Eq.(4). Suppose Assumption 3.2 holds, then the expected reward,

Ept+1 h er(x)i Ept h er(x)i + (1 ϕt)(K 1)

K Varpt er(x)

ert,max + ϕt (K 1)

K Covpt er(x), eert(x)

Ept+1 h er(x)i Ept h er(x)i + (1 ϕt)(K 1)

K Varpt er(x)

ert,min + ϕt (K 1)

K Covpt er(x), eert(x)

KEpt+1 h er(x)i =K Z

x1, ,x K (1 ϕt)e2r(x1) + + e2r(x K)

er(x1) + + er(x K) + ϕt eert(x1)er(x1) + + eert(x K)er(x K)

eert(x1) + + eert(x K)

k=1 pt (xk) dxk

x1, ,x K K e2r(x1) + + e2r(x K)

er(x1) + + er(x K)

k=1 pt (xk) dxk

x1, ,x K K eert(x1)er(x1) + + eert(x K)er(x K)

eert(x1) + + eert(x K)

k=1 pt (xk) dxk

er(xj) er(x1) + + er(x K)

er(x1) + + er(x K) + er(xj) (K 1)er(xj) P

i =j er(xi)

er(x1) + + er(x K)

k=1 pt (xk) dxk

er(xj) eert(x1) + + eert(x K)

eert(x1) + + eert(x K) + er(xj) (K 1)eert(xj) P

i =j eert(xi)

eert(x1) + + eert(x K)

k=1 pt (xk) dxk

KEpt h er(x)i + Z

i<j er(xi) er(xj) 2

er(x1) + + er(x K)

k=1 pt (xk) dxk

KEpt h er(x)i + Z

i<j er(xi) er(xj) eert(xi) eert(xj)

eert(x1) + + eert(x K)

k=1 pt (xk) dxk

i<j er(xi) er(xj) 2

er(x1) + + er(x K)

k=1 pt (xk) dxk

i<j er(xi) er(xj) eert(xi) eert(xj)

eert(x1) + + eert(x K)

k=1 pt (xk) dxk

Because Varpt er(x) 0, A P

i<j 2 Varpt[er(x)]

When Covpt er(x), eert(x) > 0, B P

i<j 2 Covpt[er(x),eert(x)]

Keert,max 0; when Covpt er(x), eert(x) < 0, B P

i<j 2 Covpt[er(x),eert(x)]

Keert,min < 0, so B P

i<j 2 Covpt[er(x),eert(x)]

Self-Consuming Generative Models with Adversarially Curated Data

Then, we have:

KEpt+1 h er(x)i (1 ϕt)

KEpt h er(x)i + X

2 Varpt er(x)

KEpt h er(x)i + X

2 Covpt er(x), eert(x)

KEpt h er(x)i + K(K 1)

2 2 Varpt er(x)

KEpt h er(x)i + K(K 1)

2 2 Covpt er(x), eert(x)

=KEpt h er(x)i + (1 ϕt)(K 1) Varpt er(x)

ert,max + ϕt (K 1) Covpt er(x), eert(x)

which means:

Ept+1 h er(x)i Ept h er(x)i + (1 ϕt)(K 1)

K Varpt er(x)

ert,max + ϕt (K 1)

K Covpt er(x), eert(x)

Similarly,we have: A P

i<j 2 Varpt[er(x)]

Kert,min and B P

i<j 2 Covpt[er(x),eert(x)]

Keert,max So:

Ept+1 h er(x)i Ept h er(x)i + (1 ϕt)(K 1)

K Varpt er(x)

ert,min + ϕt (K 1)

K Covpt er(x), eert(x)

We also prove the rationality of the upper bound. That is, the upper bound is always greater than 0.

Proof. Suppose rt,min r(x) rt,max and ert,min er(x) ert,max.

Then, we have ert,min Ept er(x) ert,max and e2rt,min Ept e2r(x) e2rt,max.

Because 0 Varpt er(x) Ept e2r(x) , then 0 Varpt er(x) e2rt,max.

Because Covpt er(x), eert(x) = Ept er(x)eert(x) Ept er(x) Ept eert(x) , so ert,min+ert,min ert,max+ertt,max Covpt er(x), eert(x) ert,max+ert,max ert,min+ert,min.

Ept h er(x)i + (1 ϕt)(K 1)

K Varpt er(x)

ert,min + ϕt (K 1)

K Covpt er(x), eert(x)

ert,min + (1 ϕt)(K 1)

K 0 ert,min + ϕt (K 1)

K ert,min+ert,min ert,max+ert,max

= ert,min + ϕt (K 1)

K ert,min+ert,min ert,max+ert,max

= ert,min + ϕt (K 1)

K (ert,min+ert,min ert,max ert,max+ert,max ert,max)

= ert,min + ϕt (K 1)

K (ert,min+ert,min ert,max ert,max)

= ert,min + ertt,minϕt (K 1)

K (eert,min ert,max ert,max rt,min)

When Varpt er(x) = 0, ert,min = ert,max

Ept h er(x)i + (1 ϕt)(K 1)

K Varpt er(x)

ert,min + ϕt (K 1)

K Covpt er(x), eert(x)

ert,min + ert,minϕt (K 1)

K (eert,min ert,max 1)

Self-Consuming Generative Models with Adversarially Curated Data

Because 0 < eert,min ert,max 1, 1 < eert,min ert,max 1 0

Because 0 ϕt (K 1)

K 1, then ert,min < ert,min(eert,min ert,max 1) 0

ert,min + ert,minϕt (K 1)

K (eert,min ert,max 1) > 0

which means

Ept h er(x)i + (1 ϕt)(K 1)

K Varpt er(x)

ert,min + ϕt (K 1)

K Covpt er(x), eert(x)

eert,max > 0

B.4. Proof of Lemma 3.4

Lemma B.4. Let pt+1 be defined as in Eq. (3). And suppose Covmin = mini {0,1,...,t} Covpi er(x), eeri(x) and ϕt = ϕt 1 = = ϕ1 = ϕ . With p0 = pdata, for t > 1:

Ept+1 h er(x)i Epdata h er(x)i + ϕ (1 + λ)

Covmin (15)

When t , Ept+1 h er(x)i Epdata h er(x)i + ϕ (λ + 1) Covmin (16)

Proof. According to the Lemma B.2

Ep1 Ep0 h er(x)i + ϕ Covp0 h er(x), eer0(x)i

Then Ep2 h er(x)i 1 1 + λEpdata h er(x)i + λ 1 + λ(Ep0 h er(x)i + ϕ Covp0 h er(x), eer0(x)i )

With p0 = pdata

Ep2 h er(x)i Epdata h er(x)i + λ 1 + λϕ Covp0 h er(x), eer0(x)i

Use recursion:

Ept+1 h er(x)i Epdata h er(x)i + ϕ

( λ 1 + λ) Covpt h er(x), eert(x)i + ( λ 1 + λ)2 Covpt 1 h er(x), eert 1(x)i

+ + ( λ 1 + λ)t Covp0 h er(x), eer0(x)i

Covmin = Covmin h er(x), eerm(x)i = min n Covpt h er(x), eert(x)i , Covpt 1 h er(x), eert 1(x)i , . . . , Covp0 h er(x), eer0(x)io

Ept+1 h er(x)i Epdata h er(x)i + ϕ

( λ 1 + λ) Covmin h er(x), eerm(x)i + ( λ 1 + λ)2 Covmin h er(x), eerm(x)i

+ + ( λ 1 + λ)t Covmin h er(x), eerm(x)i

= Epdata h er(x)i + ϕ (1 + λ)

When t , Ept+1 h er(x)i Epdata h er(x)i + ϕ (λ + 1) Covmin

Self-Consuming Generative Models with Adversarially Curated Data

Algorithm 2 Iterative retraining with adversarially curated data

Input: Real data Ddata = {di}N i=1, user reward function r, learning procedure of generative model G, learning procedure of reward model R, attack algorithms A Param: Rate of attack κ, proportion of data β, proportion of filtered data λ p0 = G(Ddata) for t = 1 to T do

Dgen = {edi}N i=1,where ed1, . . . , ed N pt 1

D={(xi, yi, oi)}n i=1, where xi, yi pt 1,oi =

1 if r(xi) < r(yi), 0.5 if r(xi) = r(i), 0 if r(xi) > r(yi). e D = A(D, κ) Rθ = R(D), e Reθ = R( e D) Dt = Curate(Dgen, Rθ/ e Reθ, β) {Using Rθ or e Reθ curated βN data on the generated sample set Dgen} pt = G(Ddata λDt) end for

C. Additional experiments

Algorithm 2 describes the process of iterative retraining on the adversarial curation synthetic dataset and the real dataset.

C.1. Additional experiments on CIRFAR-10 datasets

In this section, we show some additional experiments on the CIRFAR-10 dataset.

Settings. The settings in this section are the same as those described in Section 5. For the reward R and R may share the same architecture or differ. If they have the same architecture, we use pretrained VGG11 as the feature extractor and a linear layer containing 10 neurons for both R and R. If different architectures are used, R employs a pretrained VGG11 as the feature extractor, followed by three linear layers with 128, 64, and 10 neurons, respectively.

Two other random attacks experiments. As we mentioned in Section 5.1, the results of the random attacks are not exactly similar. We show the proportions of each class and the average reward values with iteration number for two other independent experiments in Fig. 6. These results further highlight the variability of random attacks. RANDOM #1 shows a case where the average reward continues to increase in the later stages of retraining, indicating partial alignment with user preferences. RANDOM #2 shows a gradual decrease in the average reward throughout the process, reflecting a more persistent misalignment.

Additional results of adversarial data curation experiments. As we mentioned in Section 5.1, we show the samples generated during the retraining process with different curation in Fig 7. It can be seen that the images are not distorted to the point of being unrecognizable, but there is a noticeable change in the diversity of the generated samples. In the benign

0 1 2 3 4 5 6 7 8 9 10

airplane automobile bird cat deer dog frog horse ship truck

0 1 2 3 4 5 6 7 8 9 10

0 1 2 3 4 5 6 7 8 9 10

Avg. Rewards

Benign Curation Random #1 Random #2

Figure 6. Iterative retraining of the self-consumption model on the CIFAR-10 dataset with two independent experiments employing a randomized adversarial curation (20% random): (1) Left side: proportion of each class for the two independent runs. (2) Right side: average reward values for the two independent runs and benign curation

Self-Consuming Generative Models with Adversarially Curated Data

Iteration 1 Iteration 4 Iteration 7 Iteration 10

Malicious(20%)

Figure 7. Samples generated by self-consumption model with different curation on CIFAR-10 dataset:(1) Top: bengiun curation, filtered by r(x). (2) Middle: adversarial curation with 20% malicious data injected by gradient algorithm. (3) Bottom: mixed dataset created by combining real data with adversarially curated synthetic data (1:1).

curation setting, the model generates a large proportion of images from user-preferred classes (airplane and automobile). In contrast, the malicious curatios are dominated by less favored classes (horse and ship), indicating a significant misalignment with intended user preferences.

Additional results of the attack algorithm experiments. As we mentioned in Section 5.2, in the experiment we recorded the proportions of each classification, and the results are shown in Fig 8. Comparisons include benign curation (BENIGN), gradient-based attacks that target platform and adversarial platform using different reward models (GRADIENT #1) or identical reward models(GRADIENT #2), reward-based heuristic methods with f(x, z) = |Rθ(x) Rθ(z)| (HEURISTIC #1) or max{|Rθ(x)|, |Rθ(z)|} (HEURISTIC #2), and multi-objective heuristic method (PARETO).

An interesting phenomenon is that although the heuristic has a lower average reward value, it has a significantly higher proportion of automobile (the user s preferred category) when curated. This may be due to the fact that the heuristics are not sufficiently global.

0 1 2 3 4 5 6 7 8 9

Gradient #1

Gradient #2

Heuristic #1

Heuristic #1

Figure 8. The proportion of each class on at t = 1 under different attack algorithms on the same CIFAR-10 dataset.

Self-Consuming Generative Models with Adversarially Curated Data

Iteration 0 Iteration 1 Iteration 2 Iteration 3 Iteration 4 Iteration 5

Malicious(20%)

Malicious(50%)

Figure 9. Samples generated by self-consumption model with different curation on synthetic Gaussian dataset:(1) Top: bengiun curation, filtered by r(x). (2) Middle: adversarial curation with 20% malicious data injected by gradient algorithm. (3) Bottom: adversarial curation with 50% malicious data injected by gradient algorithm (sever attack).

Iteration 0 Iteration 1 Iteration 2 Iteration 3 Iteration 4 Iteration 5

Malicious(50%)

Figure 10. Samples generated by self-consumption model on different mixed Gaussian dataset:(1) Top: adversarially curated synthetic dataset with 50% malicious data injected by gradient algorithm (sever attack). (2) Second to Bottom: mixed dataset created by combining real data with adversarially curated synthetic data in different proportions (real data : adversarially curated synthetic data)

Self-Consuming Generative Models with Adversarially Curated Data

C.2. Experiments on Gaussian datasets

This section shows the setting and results of the experiments on the synthetic dataset.

Dataset. The synthetic dataset we generated is a two-dimensional dataset following an 8-mode Gaussian mixture model. Specifically, we define eight mode centers that are uniformly distributed on a circle of radius 2. The coordinates of these centers are given by:

µt = 2 cos(tπ

4 ), sin(tπ

4 ) , t = 0, 1, 2, . . . , 7. (17)

Each data point is independently and uniformly sampled from the 8 mode points, and isotropic Gaussian noise with a mean of 0 and a standard deviation of 0.02 is added:

x = µt + ϵ, ϵ N(0, 0.022I2). (18)

Settings. For the r(x) := γ max{0, x µ τ}, we designate µ = (2, 0) (which is the first center) , τ = 3 and γ = 10. And the reward model we used consists of two fully connected linear layers with 2 neuron in first layer and 64 neurons in the second layer. In each iteration, the generative model produces 10, 000 random samples, from which 5, 000 samples are filtered for next retraining. The samples generated in each iteration are plotted on two-dimensional coordinates.

Adversarial curation. We explored the long-term performance on purely synthetic data with adversarial curation. The results are shown as Fig.9, which shows three different curation: benign curation, adversarial curation using gradient descent algorithms attacking 20% and 50% data pairwise datasets, respectively.

Mixed data. We also explored the long-term performance of adversarial curation on mixed data. The resutls are shown as Fig.10. Under severe attacks, adding a large amount of real data can align it to the real data distribution, but not to the user preference distribution.