# constrained_diffusion_models_via_dual_training__28d4dfb1.pdf Constrained Diffusion Models via Dual Training Shervin Khalafi Dongsheng Ding Alejandro Ribeiro {shervink,dongshed,aribeiro}@seas.upenn.edu University of Pennsylvania Diffusion models have attained prominence for their ability to synthesize a probability distribution for a given dataset via a diffusion process, enabling the generation of new data points with high fidelity. However, diffusion processes are prone to generating samples that reflect biases in a training dataset. To address this issue, we develop constrained diffusion models by imposing diffusion constraints based on desired distributions that are informed by requirements. Specifically, we cast the training of diffusion models under requirements as a constrained distribution optimization problem that aims to reduce the distribution difference between original and generated data while obeying constraints on the distribution of generated data. We show that our constrained diffusion models generate new data from a mixture data distribution that achieves the optimal trade-off among objective and constraints. To train constrained diffusion models, we develop a dual training algorithm and characterize the optimality of the trained constrained diffusion model. We empirically demonstrate the effectiveness of our constrained models in two constrained generation tasks: (i) we consider a dataset with one or more underrepresented classes where we train the model with constraints to ensure fairly sampling from all classes during inference; (ii) we fine-tune a pre-trained diffusion model to sample from a new dataset while avoiding overfitting. 1 Introduction Diffusion models have become a driving force of modern generative modeling, achieving groundbreaking performance in tasks ranging from image/video/audio generation [64, 6, 43] to molecular design for drug discovery [75, 74]. Diffusion models learn diffusion processes that produce a probability distribution for a given dataset (e.g., images) from which we can generate new data (e.g., classic models [66, 34, 68, 78]). As diffusion models are used to generate data with societal impacts, e.g., art generation and content creation for media, they must comply with requirements from specific domains, e.g., social fairness in image generation [54, 57], aesthetic properties of images [12, 13], bioactivity in molecule generation [39], and more [40, 81, 8, 19, 79, 14]. Classic diffusion models [66, 34, 68] have been extended to generate data under different requirements through either first-principle methods or fine-tuning. In image generation with fairness, for instance, first principle methods directly mitigate biases towards social/ethical identities by revising the training loss functions per biased/unbiased data [49, 41]; while fine-tuning methods align biased diffusion models with desired data distributions by optimizing the associated metrics [24, 72, 71, 70]. Although these methods allow us to equip diffusion models with specific requirements, they are often designed for particular generation tasks and do not provide transparency on how these requirements are satisfied. Since diffusion models are trained by minimizing the loss functions of diffusion processes, it is natural to incorporate requirements into diffusion models by imposing constraints on these optimization Corresponding author. 38th Conference on Neural Information Processing Systems (Neur IPS 2024). problems. Therefore, it is imperative to develop diffusion models under constraints by generalizing constrained learning methods and theory (e.g., [9, 10, 22, 36]) for diffusion models. In this work, we formulate the training of diffusion models under requirements as a constrained distribution optimization problem in which the objective function measures a training loss induced by the original data distribution, and the constraints require other training losses induced by some desirable data distributions to be small. This constrained formulation can be instantiated for several critical requirements. For instance, to promote fairness for unrepresented groups, the constraints can encode the closeness of the model to the distributions of underrepresented data. Compared with the typical importance re-weighting method [41], our constrained formulation provides an optimal trade-off between matching given data distribution and following reference distribution. Not limited to constraints with other desirable data distributions, our constrained formulation captures more general requirements. For instance, when adapting a pretrained diffusion model to new data, the constraints require the model to be close to the pretrained model, not degrading the original generation ability. Specifically, our main contribution is three-fold. (i) We propose and analyze constrained diffusion models from the perspective of constrained optimization in an infinite-dimensional distribution space, where the constraints require KL divergences between the model and our desired data distributions to be under some thresholds. We exploit the strong duality in convex optimization to show that our constrained diffusion models generate new data from a mixture data distribution that achieves the optimal trade-off among objective and constraints. (ii) To train constrained diffusion models, we introduce parametrized constrained diffusion models and develop a Lagrangian-based dual training algorithm. We exploit the relation between un/parametrized problems to show that constrained diffusion models generate new data from the optimal mixture distribution, up to some optimization/parametrization errors. (iii) We empirically demonstrate the merit of our constrained diffusion models in two aforementioned requirements. In the fair generation task, we show that our constrained model promotes sampling more from the minority classes compared to unconstrained models, leading to fair sampling across all classes. In the adaptation task, we show that our fine-tuned constrained model learns to generate new data without significantly degrading the original generation ability, compared to the unconstrained model which tends to overfit to new data. Related work. As a first-principle method, our constrained diffusion approach is more relevant to diffusion models that incorporate requirements in distribution space [20, 35, 49, 21, 60, 28], rather than those applied in sample space [37, 53, 29, 27, 26, 17, 48, 25]. In comparison with conditional diffusion models that restrict generation through conditional information [20, 35, 1], our constrained diffusion models impose distribution constraints within the constrained optimization framework. Compared to compositional generation [49, 21, 60] and fair diffusion [28], our work provides a constrained learning approach to balance different distribution models using Lagrangian multipliers, which is different from equal weights [49, 21], hyperparameter [60] or fair guidance [28]. Our constrained approach is also relevant to the importance re-weighting method for diffusion models [41] and GANs [16], which reduces bias in a dataset by re-weighting it with a pre-trained debiased density ratio. In contrast, we design our diffusion models to mitigate the bias in a dataset by imposing distribution constraints without pre-training. In addition to being distinct from existing methods, we provide a systematic study of our constrained diffusion models, covering duality analysis, dual-based algorithm design, and convergence analysis, which is absent in the diffusion model literature. Our work is also pertinent to recently surging fine-tuning and alignment methods that aim to improve pre-trained diffusion models by optimizing their downstream performance, e.g., aesthetic scores of images. In reward-guided fine-tuning methods, such as supervised learning [45, 80, 76], control-based feedback learning [77, 65, 71, 70, 18], and reinforcement learning [23, 32, 24, 72, 5, 82], reward functions have to be pre-trained from an authentic evaluation dataset, and the trade-off for reward functions in pre-trained diffusion models is often regulated heuristically. In contrast, our constrained approach directly minimizes the gap between a fine-tuning model and a high-quality dataset with the desired properties, while ensuring the generated outputs being close to that of pre-trained models. Compared to other generative models under requirements (e.g., VAEs [61], GANs [16]), and classical sampling methods (e.g., Langevin dynamics and Stein variational gradient descent [50]), our work is different because we focus on diffusion-based generative models. 2 Preliminaries We overview diffusion models from the perspective of variational inference [55] by presenting forward/backward processes in Section 2.1, and the evidence lower bound in Section 2.2. 2.1 Forward and backward processes The forward process begins with a true data sample x0 Rd, and generates latent variables {xt}T t = 1 from t = 1 to T by adding white noise recursively. The joint distribution of latent variables results from conditional probabilities q(x1:T | x0) = QT t = 1 q(xt | xt 1), where the distribution of latent variable xt conditioned on the previous latent xt 1 is given by a Gaussian distribution q(xt | xt 1) = N(xt; µt, σ2 q(t)I), where µt = αt xt 1 and σ2 q(t) = 1 αt. Since the forward process is a linear Gaussian model with pre-selected mean and variance, it is solely determined by the data distribution q(x0). We often refer to q(x0) as a forward process. The backward process begins with the latent x T sampled from the standard Gaussian p(x T ) = N(x T ; 0, I), and decodes latent variables from t = T to t = 0 with a joint distribution p(x0:T ) = p(x T ) t = 1 p(xt 1 | xt). (1) Here, p is our distribution model that can be used to generate new samples. We denote by P the set of all joint distributions over x0:T in form of (1), where p(xt 1 | xt) is a conditional Gaussian with a fixed variance (see Appendix A). Throughout the paper, we work in the convergent regimes of the backward process (e.g., [15, 11, 2]). Without loss of generality, we adopt the convergent regime in [47] by taking the scheduling parameter α1 = 1 1/T c0 and αt = 1 c T min ((1 α1)(1 + c T )t, 1) for t > 1, and the variance as σ2 p(t) = 1/αt 1, where c T := c1 log(T)/T and c0, c1 are some constants. Hence, αT 0 implies q(x T ) N(x T ; 0, I). Thus, q(x0:T ) P. Also, α1 1 implies q(x1) q(x0). It is ready to generalize our results to other diffusion processes (e.g., [46, 38, 3]). 2.2 The evidence lower bound (ELBO) Denote the KL divergence of distribution q from distribution p by DKL(q p) := Ex q(x) log( q(x) p(x)). Generative diffusion modeling aims to generate samples whose distribution is close to that of an observed dataset of samples. Formally, we express this objective as maximizing the log-likelihood of an observation generated by the diffusion model: maximizep P Eq(x0)[ log p(x0) ], where Eq(x0) [ log p(x0) ] = E(p; q) + Eq(x0) [ DKL (q(x1:T | x0) p(x1:T | x0)) ] (2) and E(p; q) := Eq(x0)Eq(x1:T | x0) log p(x0:T ) q(x1:T | x0) is known as ELBO in variational inference [7, 55]. Alternatively, we aim to minimize the KL divergence between the forward/backward processes, DKL (q(x0:T ) p(x0:T )) = E(p; q) + Eq(x0) [ log q(x0) ] . (3) Thus, we connect the log-likelihood maximization to the KL divergence minimization via ELBO. Lemma 1 (Equivalent formulations). The ELBO maximization and the KL divergence minimization are equivalent over the distribution space P, and the unique solution of these two problems is a solution for the log-likelihood maximization problem, i.e., maximize p P E(p; q) minimize p P DKL (q(x0:T ) p(x0:T )) maximize p P Eq(x0)[ log p(x0) ] See Appendix B.1 for proof. Lemma 1 states that improving the ELBO score increases the likelihood of a backward process that generates the data, together with the fit of a backward process to the forward process. Hence, finding the best backward process becomes optimizing one of three equivalent objectives. In practice, ELBO serves as a loss function approximated by t = 2 Eq(x0)Eq(xt | x0) [ DKL (q(xt 1 | xt, x0) p(xt 1 | xt)) ] . (4) With the variance schedule described in Section 2.1, it is known that this approximation is almost exact (see Appendix A and also [42]), which is our focal setting. Using standard diffusion derivations [55], the ELBO maximization can be shown to equal to a quadratic loss minimization, minimize bs S Ex0, t, xt h bs(xt, t) log q(xt) 2 i (5) where Ex0,t,xt is an expectation over the data distribution q(x0), a discrete distribution pω(t) from 2 to T, and the forward process q(xt | x0) at time t given the data sample x0; see Appendix A for details. The minimization is done to find a function bs S that can best predict the gradient of the forward process over data log q(xt), commonly called the (Stein) score function, where S is a set of valid score functions mapping from Rd N to Rd. In practice, however, we parametrize the estimator bs(xt, t) as bsθ(xt, t) with parameter θ, which gives our focal objective of generative modeling: minimizeθ Θ Ex0, t, xt bsθ(xt, t) log q(xt) 2 . A parametrized form of p(xt 1 | xt) associated with bsθ(xt, t) is denoted by pθ(xt 1 | xt) and the backward process has a parametrized joint distribution pθ(x0:T ). We remark that the prediction problem (5) can be also be formulated as data or noise prediction instead [55], with our results directly transferable to these formulations. 3 Variational constrained diffusion models We introduce constrained diffusion models by considering the unparametrized set of joint distributions in Section 3.1, and illustrating constraints via two examples in Section 3.2. 3.1 KL divergence-constrained diffusion model: unparametrized case The standard diffusion model specifies a single data distribution, denoted by q in Lemma 1. To account for other generation requirements, we introduce m additional data distributions {qi}m i = 1 that represent m desired properties on generated data. To incorporate new properties into the diffusion model, we formulate an unparametrized KL divergence-constrained optimization problem, minimize p P DKL (q(x0:T ) p(x0:T )) subject to DKL qi(x0:T ) p(x0:T ) bi for i = 1, . . . , m. (U-KL) Let an optimal solution to Problem (U-KL) be p . Then the optimal value of the objective function is F := DKL(q p ). Problem (U-KL) aims to find a model p that generates data from the original distribution q while staying close to m distributions {qi}m i = 1 that encode our desired properties, e.g., unbiasedness towards minorities. Let the Lagrangian for Problem (U-KL) be L(p, λ) = DKL (q(x0:T ) p(x0:T )) + i = 1 λi DKL qi(x0:T ) p(x0:T ) bi (6) for λ 0. The dual function g(λ) is given by g(λ) := minp P L(p, λ), which is always concave. To make Problem (U-KL) meaningful, we assume the constraints are strictly satisfied by some model. Assumption 1 (Strict feasibility). There exists a model p P and ζ > 0 such that DKL(qi(x0:T ) p(x0:T )) bi ζ for all i = 1, . . . , m. Let an optimal dual variable of Problem (U-KL) be λ argmaxλ 0 g(λ), and the optimal value of the dual function be D := g(λ ). From weak duality, the duality gap is non-negative, i.e., F D 0. Moreover, due to the convexity of KL divergence, Problem (U-KL) is a convex optimization problem, and thus it satisfies strong duality; see Appendix B.2 for proof. Lemma 2 (Strong duality). Let Assumption 1 hold. Then, Problem (U-KL) has zero duality gap, i.e., F = D . Moreover, (p , λ ) is an optimal primal-dual pair of Problem (U-KL). Let a mixture data distribution be q(λ) mix := q + Pm i = 1 λiqi /(1 + λ 1) for λ 0. We denote by q(λ) mix(x0:T ) a joint distribution of the forward process with data distribution q(λ) mix. Leveraging strong duality, we show that an optimal model can be obtained by solving an equivalent unconstrained problem in Theorem 1 and its proof is deferred to Appendix B.3. Theorem 1 (Optimal constrained model). Let Assumption 1 hold. Then, Problem (U-KL) equals minimize p P DKL q(λ ) mix (x0:T ) p(x0:T ) (U-MIX) where q(λ ) mix (x0:T ) is the joint distribution of the forward process at an optimal dual variable λ . Theorem 1 states that the KL divergence-constrained problem reduces to an unconstrained KL divergence minimization problem. We notice that the KL divergence is zero if and only if two probability distributions match each other. Hence, q(λ ) mix (x0:T ) is the optimal solution to Problem (U-MIX). Corollary 1. Let Assumption 1 hold. Then, the solution of Problem (U-MIX), i.e., p (x0:T ) = q(λ ) mix (x0:T ), is the solution of Problem (U-KL). Let bi := bi Eqi(x0)[ log qi(x0) ]. Application of Equality (3) to Problem (U-KL) yields an ELBObased constrained optimization problem, minimize p P E(p; q) subject to E(p; qi) bi for i = 1, . . . , m. (U-ELBO) Recall the model representation in Section 2.2, we can characterize each joint distribution p P with a function bs S. Moreover, ELBO reduces to the denoising matching term that has a simplified quadratic form given in Section 2.2. With this reformulation in mind, we cast Problem (U-ELBO) into a convex optimization problem over the function space S, minimize bs S Eq(x0), t, xt h bs(xt, t) log q(xt) 2 i subject to Eqi(x0), t, xt h bs(xt, t) log qi(xt) 2 i ebi for i = 1, . . . , m (U-LOSS) where ebi := ( bi v)/ ω. Here, the notation v is a constant shift due to the variance mismatch term; see it in Appendix A. We note that scaling or shifting objective and constraints from both sides with some constants doesn t alter the solution to a constrained optimization problem. Thus, the key difference between Problems (U-KL) and (U-LOSS) is the optimization variable (respectively, p and bs). Let the Lagrangian Ls(bs, λ) for Problem (U-LOSS) be Eq(x0), t, xt h bs(xt, t) log q(xt) 2i + i = 1 λi Eqi(x0), t, xt h bs(xt, t) log qi(xt) 2i ebi . Let the associated dual function be gs(λ) := minbs S Ls(bs, λ) for λ 0. Hence, g(λ) and gs(λ) have the same maximizer λ , and the partial minimizer bs = argminbs S Ls(bs, λ ) is the solution to Problem (U-LOSS). Hence, an optimal primal-dual pair (bs , λ ) to Problem (U-LOSS) gives an optimal primal-dual pair (p , λ ) for Problem (U-KL), where p is a joint distribution of the backward process induced by bs . By this dual property, we take a dual perspective to train constrained diffusion models: we maximize the dual function gs(λ) to obtain the optimal dual variable λ , and then recover the solution bs by minimizing the Lagrangian Ls(bs, λ ). 3.2 Examples of KL divergence constraints To illustrate our KL-divergence constraints, we provide two generation tasks of exemplary. (i) Fairness to underrepresented classes. We consider a fair image generation task in which some classes are underrepresented in the available training dataset. An example of this would be the Celeb-A dataset [51] which contains pictures of celebrity faces with those labeled as male being underrepresented (42% Male vs 58% Female). To promote representation of the under-represented classes, we can pose it as an instance of Problem (U-KL), where each qi denotes the distribution of an under-represented subset or minority class of q. (ii) Adapting pretrained model to new data. Given a pretrained diffusion model over some original dataset that is no longer accessible, we aim to fine-tune the pretrained model for generating data from a new data distribution. Similarly, we can pose this as an instance of Problem (U-KL), where q1 denotes the distribution of the new data and q is the distribution of samples generated by the pre-trained model. Let hi := Eqi(x0) log qi(x0) be the differential entropy of data distribution qi. We relate the KL divergence constraints with the optimal dual variable through entropy in Theorem 2. Theorem 2. Let Assumption 1 hold, and the supports of data distributions q and {qi}m i = 1 be disjoint. Then, the optimal dual variables λ to Problem (U-ELBO) are given by λ i 1 + (λ )T 1 = ehi bi for i = 1, . . . , m. See Appendix B.4 for proof. Theorem 2 characterizes the mixture weights in the target distribution q(λ ) mix : (i) the tighter a constraint is (i.e., smaller threshold bi), the more the model will sample from the associated distribution; (ii) for the same constraint thresholds, the model will sample more often from the associated distributions that have higher entropy hi. Assumption 1 can be relaxed to a feasibility condition for Problem (U-KL); see Lemma 6 in Appendix B.4. 4 Parametrization and dual training algorithm Having introduced unparametrized models, we move to parametrization for constrained diffusion models in Section 4.1, provide optimality analysis of a Lagrangian-based dual method in Section 4.2, and present a practical dual training algorithm in Section 4.3. 4.1 KL divergence-constrained diffusion model: parametrized case With the parametrized model pθ for θ Θ, we present a parameterized constrained problem, minimize θ Θ DKL (q(x0:T ) pθ(x0:T )) subject to DKL qi(x0:T ) pθ(x0:T ) bi for i = 1, . . . , m. (P-KL) Let the Lagrangian for Problem (P-KL) be L(θ, λ) := L(pθ, λ). The associated dual function g(λ) is given by g(λ) := minθ Θ L(θ, λ). Let an optimal solution to Problem (P-KL) be θ . We denote p := pθ and p := pθ , and the optimal objective by F := DKL(q p ). Let an optimal dual variable be λ argmaxλ 0 g(λ) and the optimal value of the dual function be D := g( λ ). Problem (P-KL) is non-convex in parameter space, and strong duality does not hold any more. Thus, unparametrized results in Section 3.1 don t directly apply to Problem (P-KL). For instance, it s invalid to find an optimal solution via an unconstrained problem as in Theorem 1, i.e., p ( λ ) argminθ Θ L(θ, λ ) doesn t equal p . The effect of parametrization has to be characterized. However, regardless of parametrization, weak duality always holds, i.e., F D 0. To quantify the optimality of p ( λ ) (closeness of it to q mix), we study a practical representation of model pθ as a parametrized function bsθ Sθ, where Sθ is the set of all parametrized score functions. Problem (U-LOSS) is in a parametrized form of minimize θ Eq(x0), t, xt h bsθ(xt, t) log q(xt) 2 i subject to Eqi(x0), t, xt h bsθ(xt, t) log qi(xt) 2 i ebi for i = 1, . . . , m. (P-LOSS) where ebi := ( bi v)/ ω. We note that Problem (P-LOSS) is equivalent to Problem (P-KL). Thus, we let the Lagrangian of Problem (P-LOSS) be Ls(θ, λ) := Ls(bsθ, λ), and the dual function gs(λ) := minimizeθ Θ L(θ, λ). Since g(λ) and gs(λ) have the same maximizer λ , θ argminθ Θ Ls(θ, λ ), which naturally gives a dual training algorithm in Algorithm 1. Denote s := bs θ . Thus, s -induced diffusion model is given by p ( λ ). Algorithm 1 works as a dual ascent method with two natural steps: (i) find a diffusion model with fixed dual variable λ(h); and (ii) update the dual variable using the (sub)gradient of the Lagrangian Ls(bsθ(h), λ). It is known that Algorithm 1 converges to λ since the dual function gs(λ) is concave. However, such convergence in the dual domain doesn t provide optimality guarantee on the primal solution p ( λ ) due to the non-convexity in parameter space. We next exploit the optimization properties of unparametrized diffusion models in Section 3.1 to characterize the optimality of the dual training algorithm. Algorithm 1 Constrained Diffusion Models via Dual Training 1: Input: total diffusion steps T, diffusion parameter αt, total iterations H, stepsize η. 2: Initialize: λ(1) = 0. 3: for h = 1, , H do 4: Compute model bsθ(h) argmin θ Θ Ls(bsθ, λ(h)). 5: Update the dual variable λi(h+1) = h λi(h) + η Ex0 qi,t,xt h bsθ(h)(xt, t) log qi(xt) 2 i ebi i + for all i. 4.2 Optimality analysis of dual training algorithm We analyze the optimality of p ( λ ) as measured by its distance to q mix, i.e., TV(q mix, p ( λ )), where we denote the total variation distance between two probability distributions p and q by TV(q, p) := 1 2 R |p(x) q(x)|dx. We first exploit the convergence analysis of diffusion models, and then characterize the additional error induced by parametrization. Let us begin with the difference between p ( λ ) and q( λ ) mix at λ . Denote a partial minimizer of the Lagrangian by p (λ) argminθ Θ Ls(θ, λ) for λ 0. Noting that minimizeθ Θ Ls(θ, λ) is an unconstrained diffusion problem, we are ready to quantify the difference between p (λ) and q(λ) mix for any λ 0 using the convergence theory of diffusion models. To do so, we assume the boundedness of samples from a mixed data distribution q(λ) mix for λ 0, and a small score estimation error. Assumption 2 (Boundedness of data). The data samples generated from q(λ) mix are bounded, i.e., P x0 T c | x0 q(λ) mix = 1 for any λ 0 and some large constant c > 0. Assumption 3 (Boundedness of score estimation error). The score estimator bsθ(xt, t) estimates the data samples from q(λ) mix with bounded score matching error εscore, Eq(λ) mix (x0), t, xt h bsθ(xt, t) log q(xt) 2 i ε2 score for any λ 0, where Eq(λ) mix (x0), t, xt is an expectation over the mixed data distribution q(λ) mix(x0), a uniform distribution over t from 2 to T, and a forward process q(xt | x0) given the data sample x0. Since data samples are bounded, Assumption 2 is mild in practice. Assumption 3 is the typical score matching error that is near zero if the function class Sθ is sufficiently rich. With Assumptions 2 and 3, below we bound the TV distance between q(λ) mix and p (λ) using the convergence theory of diffusion models from [47]; see Appendix B.5 for proof. Lemma 3 (Convergence of diffusion model). Let Assumptions 2 and 3 hold. Then, the TV distance from p (λ) to q(λ) mix is bounded by TV q(λ) mix, p (λ) 1 2DKL q(λ) mix p (λ) d2 log3 T d log2 T εscore. (7) Lemma 3 states that the TV distance between q(λ) mix and p (λ) decays to zero with a sublinear rate O( 1 T ), up to a score matching error O(εscore). When the diffusion time T is large, the TV distance between q(λ) mix and p (λ) is dominated by the score matching error. Substitution of λ = λ into (7) yields an upper bound on TV(q( λ ) mix , p ( λ )), which is the second term of the inequality TV q mix, p ( λ ) TV q mix, q( λ ) mix + TV q( λ ) mix , p ( λ ) . (8) Next, we quantify the gap between λ and λ , which lets us bound the first term on the RHS of (8) and complete the optimality analysis. To analyze the parametrized optimal dual variable λ , we introduce the richness of the parametrized class Sθ and redundancy of constraints at bs below. Assumption 4 (Richness of parametrization). For any function bs S, there exists parameter θ Θ such that bsθ bs L2 ν, where L2 is with respect to the forward process. Assumption 5 (Redundancy of constraints). There exists σ > 0 such that i = 1 λi bs Eqi(x0), t, xt [ bs (xt, t) log q(xt) ] where bs is the Fréchet derivative over the function bs and bs is a solution to Problem (U-LOSS). Assumption 4 is mild since the gap is small for expressive neural networks [56, 31]. Assumption 5 captures the linear independence of constraints, which is similarly used in optimization [4]. Due to Assumption 2, we set the function class S to be bounded bs L2 T c := R. Using Problem (U-LOSS), we prove that the unparametrized dual function gs(λ) is differentiable, and strongly-concave over H with parameter µ, where H := {γλ + (1 γ) λ , γ [0, 1]} and µ := σ/ 1 + max λ 1 , λ 1 2,which leads to Lemma 4; see Appendix B.6 for their proofs. Lemma 4. Let Assumptions 4 and 5 hold. Then, λ λ 2 8 µR 1 + λ 1 ν. Since TV q mix, q( λ ) mix is bounded by λ λ 1 (see Appendix B.6), application of Lemma 3 and Lemma 4 to (8) leads to Theorem 3; see Appendix B.7 for proof. Theorem 3 (Optimality of constrained diffusion model). Let Assumptions 1 5 hold. Then, the total variation distance between p ( λ ) and q mix is upper bounded by TV q mix, p ( λ ) d2 log3 T µ m R 1 + λ 1 ν + d log2 T εscore. Theorem 3 states that the TV distance between p ( λ ) and q mix decays to zero with a sublinear rate O( 1 T ), up to a parametrization gap O( ν) and a prediction error O(εscore). When the parametrization is rich enough, the parametrization gap ν and the prediction error O(εscore) are nearly zero. In this case, if the diffusion time T is large, then p ( λ ) is close to q mix in TV distance, which recovers the ideal optimal constrained model in the unparametrized case in Section 3.1. 4.3 Practical dual training algorithm Having established the optimality of our dual training method, we futher turn Algorithm 1 into a practical algorithm. First, we relax the computation of a diffusion model bsθ(h) in line 4 of Algorithm 1 to be approximate: Ls(bsθ(h), λ(h)) min θ Θ Ls(bsθ, λ(h)) + ε2 approx, where ε2 approx is an approximation error of training a diffusion model given λ(h). Second, we replace the gradient in line 5 of Algorithm 1 by a stochastic gradient b Ex0 qi, t, xt bsθ(xt, t) log q(xt) 2 , which enables Algorithm 1 to be a stochastic algorithm, where b Ex0 qi, t, xt is an unbiased estimate of Ex0 qi, t, xt. To analyze this approximate and stochastic variant of Algorithm 1, it is useful to introduce the maximum parametrized dual function in history up to step h by gbest(h) := maxh h gs(λ(h )), and an upper bound of the second-order moment of stochastic gradient S2 := Pm i = 1 E b Ex0 qi, t, xt bsθ(h)(xt, t) q(xt) 2 ebi 2 | λ(h) . Denote the dual variable that achieves gbest(h) by λbest. To bound the TV distance between p ( λbest) and q mix, we check the TV distance between q( λbest) mix and p ( λbest) using Lemma 3. The rest is to analyze the convergence of λbest to λ via application of martingale convergence. We defer their proofs to Appendix B.8 and present the optimality of p ( λbest) in Theorem 4. Theorem 4 (Optimality of approximate constrained diffusion model). Let Assumptions 1 5 hold. Then, the total variation distance between p ( λbest) and q mix is upper bounded by TV q mix, p ( λbest) d2 log3 T T + 8R 1 + λbest 1 d log2 T εscore + 2 µ ε2 approx + η S2 Theorem 4 states that the TV distance between p ( λbest) and q mix decays to zero with a sublinear rate O( 1 T ), up to a parametrization gap O(ν), a score matching error O(εscore), an approximation error O(εapprox), and stepsize O(η). When the parametrization is rich enough, the parametrization gap ν and the score matching error O(εscore) are near zero. Thus, if the diffusion time T is large, then the closeness of p ( λbest) to q mix in TV distance is governed by the appproximation error and stepsize. 5 Computational experiments We demonstrate the effectiveness of constrained diffusion models trained by our dual training algorithm in two constrained settings in Section 3.2; see Appendix C for experimental details. Fairness to underrepresented classes. We train constrained diffusion models over three datasets: MNIST digits [44], Celeb-A faces [51], and Image-Net1 [63]. For MNIST and Image-Net, we create a dataset for the distribution q in (P-LOSS) by taking a subset of the dataset with equal number of samples from each class. Then we make some classes under-represented by removing their samples. For each distribution qi, we use samples from the associated underrepresented class. For Celeb-A, our approach is similar to MNIST except we don t remove any samples due to the existence of class imbalance in the dataset (58% female vs 42% male). For Image-Net, since the images are of high resolution, we employ the latent diffusion scheme [62] by imposing distribution constraints in latent space. Figures 1 3 show that our constrained model samples more often from the underrepresented classes (MNIST: 4, 5, 7; Celeb-A: male; Image-Net: Cassette player , French horn , and Golf ball ), leading to a more uniform sampling over all classes. This reflects our theoretical insights on promoting fairness for minority classes (see Section 3.2). Quantitatively, we observe fairly lower FID scores when training over the same dataset but with constraints (see Appendix C for further discussion on FID scores). Furthermore, our Image-Net experiment shows that our approach extends to the state-of-the-art diffusion models in latent space. frequencies Figure 1: Generation performance comparison of constrained and unconstrained models that are trained on MNIST with three minorities: 4, 5, 7. ( Left ) Frequencies of ten digits that are generated by an unconstrained model ( ) and our constrained model ( ); ( Middle ) Generated digits from unconstrained model ( FID 15.9 ); ( Right ) Generated digits from our constrained model ( FID 13.4 ). Adapting pretrained model to new data. Given a pretrained diffusion model over some original dataset Dpretrain, we fine-tune the pretrained model for generating data that resemble Dnew. To cast this problem into (P-KL), we let the data distribution be Dnew, i.e., q(x0:T ) = qnew(x0:T ) and the constrained distribution be the pre-trained model, i.e., qi(x0:T ) = pθpre(x0:T ). In our experiments, we pretrain a diffusion model on a subset of MNIST digits excluding a class of digits (MNIST: 9), and fine-tune this model using samples of the excluded digit. Figure 4 shows that our constrained fine-tuned model samples from the new class as well as all previous classes, whereas the model fine-tuned without the constraint quickly overfits to the new dataset (see Appendix C for details). Our constrained model generates much better high-quality samples than the unconstrained model. 6 Conclusion We have presented a constrained optimization framework for training diffusion models under distribution constraints. We have developed a Lagrangian-based dual algorithm to train such constrained 1We use a subset of ten classes from Image-Net: Tench Fish , English Springer Dog , Cassette Player , Chain saw , Church , French Horn , Garbage Truck , Gas Pump , Golf Ball , Parachute. frequencies Figure 2: Generation performance comparison of constrained and unconstrained models that are trained on Celeb-A with male minority. ( Left ) Frequencies of two genders that are generated by an unconstrained model ( ) and our constrained model ( ); ( Middle ) Generated faces from unconstrained model ( FID 19.6 ); ( Right ) Generated faces from our constrained model ( FID 11.6 ). frequencies Figure 3: Generation performance comparison of constrained and unconstrained models that are trained on Image-Net with minority classes: Cassette player (2), French horn (5), and Golf ball (8). ( Left ) Frequencies of ten classes that are generated by an unconstrained model ( ) and our constrained model ( ); ( Middle ) Generated images from unconstrained model ( FID 36.0 ); (Right) Generated images from our constrained model ( FID 27.3 ). frequencies Figure 4: Fine-tuning performance comparison of constrained and unconstrained models that are trained on MNIST. ( Left ) Frequencies of ten digits that are generated by a pre-trained model without digit 9 ( ) and our fine-tuned constrained model ( ); ( Middle ) Generated digits from unconstrained model ( FID 45.9 ); ( Right ) Generated digits from our constrained model ( FID 25.2 ). diffusion models. Our theoretical analysis shows that our constrained diffusion model generates new data from an optimal mixture data distribution that satisfies the constraints, and we have demonstrated the effectiveness of our distribution constraints in reducing bias across three widely-used datasets. This work directly stimulates several research directions: (i) extend our distribution constraints to other domain constraints, e.g., mirror diffusion [48]; (ii) incorporate conditional generations, e.g., text-to-image generation [28, 65], into our constrained diffusion models; (iii) conduct experiments with text-to-image datasets to identify and address biases; (iv) improve the convergence theory using more advanced diffusion processes. Acknowledgments We thank reviewers and program chairs for providing helpful comments. [1] A. Bansal, H.-M. Chu, A. Schwarzschild, S. Sengupta, M. Goldblum, J. Geiping, and T. Goldstein. Universal guidance for diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 843 852, 2023. [2] J. Benton, V. De Bortoli, A. Doucet, and G. Deligiannidis. Linear convergence bounds for diffusion models via stochastic localization. In Proceedings of the International Conference on Learning Representations, 2024. [3] J. Benton, V. De Bortoli, A. Doucet, and G. Deligiannidis. Nearly d-linear convergence bounds for diffusion models via stochastic localization. In Proceedings of the International Conference on Learning Representations, 2024. [4] D. P. Bertsekas. Nonlinear programming. Athena Scientific, 2016. [5] K. Black, M. Janner, Y. Du, I. Kostrikov, and S. Levine. Training diffusion models with reinforcement learning. In Proceedings of the International Conference on Learning Representations, 2024. [6] A. Blattmann, R. Rombach, H. Ling, T. Dockhorn, S. W. Kim, S. Fidler, and K. Kreis. Align your latents: High-resolution video synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22563 22575, 2023. [7] D. M. Blei, A. Kucukelbir, and J. D. Mc Auliffe. Variational inference: A review for statisticians. Journal of the American statistical Association, 112(518):859 877, 2017. [8] H. Cao, C. Tan, Z. Gao, Y. Xu, G. Chen, P.-A. Heng, and S. Z. Li. A survey on generative diffusion models. IEEE Transactions on Knowledge and Data Engineering, 2024. [9] L. Chamon and A. Ribeiro. Probably approximately correct constrained learning. In Proceedings of the Advances in Neural Information Processing Systems, volume 33, pages 16722 16735, 2020. [10] L. F. Chamon, S. Paternain, M. Calvo-Fullana, and A. Ribeiro. Constrained learning with non-convex losses. IEEE Transactions on Information Theory, 69(3):1739 1760, 2022. [11] H. Chen, H. Lee, and J. Lu. Improved analysis of score-based generative modeling: Userfriendly bounds under minimal smoothness assumptions. In Proceedings of the International Conference on Machine Learning, pages 4735 4763, 2023. [12] J. Chen, Z. Shao, X. Zheng, K. Zhang, and J. Yin. Integrating aesthetics and efficiency: AIdriven diffusion models for visually pleasing interior design generation. Scientific Reports, 14(1):3496, 2024. [13] J. Chen, R. Zhang, Y. Zhou, and C. Chen. Towards aligned layout generation via diffusion model with aesthetic constraints. ar Xiv preprint ar Xiv:2402.04754, 2024. [14] M. Chen, S. Mei, J. Fan, and M. Wang. An overview of diffusion models: Applications, guided generation, statistical rates and optimization. ar Xiv preprint ar Xiv:2404.07771, 2024. [15] S. Chen, S. Chewi, J. Li, Y. Li, A. Salim, and A. R. Zhang. Sampling is as easy as learning the score: theory for diffusion models with minimal data assumptions. In Proceedings of the International Conference on Learning Representations, 2023. [16] K. Choi, A. Grover, T. Singh, R. Shu, and S. Ermon. Fair generative modeling via weak supervision. In Proceedings of the International Conference on Machine Learning, pages 1887 1898, 2020. [17] J. K. Christopher, S. Baek, and F. Fioretto. Projected generative diffusion models for constraint satisfaction. ar Xiv preprint ar Xiv:2402.03559, 2024. [18] K. Clark, P. Vicol, K. Swersky, and D. J. Fleet. Directly fine-tuning diffusion models on differentiable rewards. In Proceedings of the International Conference on Learning Representations, 2024. [19] F.-A. Croitoru, V. Hondru, R. T. Ionescu, and M. Shah. Diffusion models in vision: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023. [20] P. Dhariwal and A. Nichol. Diffusion models beat gans on image synthesis. In Proceedings of the Advances in Neural Information Processing Systems, volume 34, pages 8780 8794, 2021. [21] Y. Du, C. Durkan, R. Strudel, J. B. Tenenbaum, S. Dieleman, R. Fergus, J. Sohl-Dickstein, A. Doucet, and W. S. Grathwohl. Reduce, reuse, recycle: Compositional generation with energy-based diffusion models and MCMC. In Proceedings of the International conference on machine learning, pages 8489 8510, 2023. [22] J. Elenter, L. F. Chamon, and A. Ribeiro. Near-optimal solutions of constrained learning problems. In Proceedings of the International Conference on Learning Representations, 2024. [23] Y. Fan and K. Lee. Optimizing DDPM sampling with shortcut fine-tuning. In Proceedings of the International Conference on Machine Learning, pages 9623 9639, 2023. [24] Y. Fan, O. Watkins, Y. Du, H. Liu, M. Ryu, C. Boutilier, P. Abbeel, M. Ghavamzadeh, K. Lee, and K. Lee. Reinforcement learning for fine-tuning text-to-image diffusion models. In Proceedings of the Advances in Neural Information Processing Systems, volume 36, 2023. [25] B. T. Feng, R. Baptista, and K. L. Bouman. Neural approximate mirror maps for constrained diffusion models. ar Xiv preprint ar Xiv:2406.12816, 2024. [26] N. Fishman, L. Klarner, V. De Bortoli, E. Mathieu, and M. J. Hutchinson. Diffusion models for constrained domains. Transactions on Machine Learning Research, 2024. [27] N. Fishman, L. Klarner, E. Mathieu, M. Hutchinson, and V. De Bortoli. Metropolis sampling for constrained diffusion models. In Proceedings of the Advances in Neural Information Processing Systems, volume 36, 2024. [28] F. Friedrich, M. Brack, L. Struppek, D. Hintersdorf, P. Schramowski, S. Luccioni, and K. Kersting. Fair diffusion: Instructing text-to-image generation models on fairness. ar Xiv preprint ar Xiv:2302.10893, 2023. [29] G. Giannone, A. Srivastava, O. Winther, and F. Ahmed. Aligning optimization trajectories with diffusion models for constrained design generation. In Proceedings of the Advances in Neural Information Processing Systems, volume 36, 2023. [30] S. Gugger, L. Debut, T. Wolf, P. Schmid, Z. Mueller, S. Mangrulkar, M. Sun, and B. Bossan. Accelerate: Training and inference at scale made simple, efficient and adaptable. https: //github.com/huggingface/accelerate, 2022. [31] Y. Han, M. Razaviyayn, and R. Xu. Neural network-based score estimation in diffusion models: Optimization and generalization. ar Xiv preprint ar Xiv:2401.15604, 2024. [32] Y. Hao, Z. Chi, L. Dong, and F. Wei. Optimizing prompts for text-to-image generation. In Proceedings of the Advances in Neural Information Processing Systems, volume 36, 2023. [33] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Proceedings of the Advances in Neural Information Processing Systems, volume 30, 2017. [34] J. Ho, A. Jain, and P. Abbeel. Denoising diffusion probabilistic models. In Proceedings of the Advances in Neural Information Processing Systems, volume 33, pages 6840 6851, 2020. [35] J. Ho and T. Salimans. Classifier-free diffusion guidance. In Neur IPS 2021 Workshop on Deep Generative Models and Downstream Applications, 2021. [36] I. Hounie, A. Ribeiro, and L. F. Chamon. Resilient constrained learning. In Proceedings of the Advances in Neural Information Processing Systems, volume 36, 2023. [37] C.-W. Huang, M. Aghajohari, J. Bose, P. Panangaden, and A. C. Courville. Riemannian diffusion models. In Proceedings of the Advances in Neural Information Processing Systems, volume 35, pages 2750 2761, 2022. [38] D. Z. Huang, J. Huang, and Z. Lin. Convergence analysis of probability flow ODE for scorebased generative models. ar Xiv preprint ar Xiv:2404.09730, 2024. [39] L. Huang, T. Xu, Y. Yu, P. Zhao, K.-C. Wong, and H. Zhang. A dual diffusion model enables 3D binding bioactive molecule generation and lead optimization given target pockets. bio Rxiv, pages 2023 01, 2023. [40] A. Kazerouni, E. K. Aghdam, M. Heidari, R. Azad, M. Fayyaz, I. Hacihaliloglu, and D. Merhof. Diffusion models in medical imaging: A comprehensive survey. Medical Image Analysis, page 102846, 2023. [41] Y. Kim, B. Na, M. Park, J. Jang, D. Kim, W. Kang, and I.-C. Moon. Training unbiased diffusion models from biased dataset. In Proceedings of the International Conference on Learning Representations, 2024. [42] D. Kingma and R. Gao. Understanding diffusion objectives as the ELBO with simple data augmentation. In Proceedings of the Advances in Neural Information Processing Systems, volume 36, 2023. [43] Z. Kong, W. Ping, J. Huang, K. Zhao, and B. Catanzaro. Diff Wave: A versatile diffusion model for audio synthesis. In Proceedings of the International Conference on Learning Representations, 2020. [44] Y. Le Cun, C. Cortes, and C. J. Burges. The MNIST database of handwritten digits. http: //yann.lecun.com/exdb/mnist/. [45] K. Lee, H. Liu, M. Ryu, O. Watkins, Y. Du, C. Boutilier, P. Abbeel, M. Ghavamzadeh, and S. S. Gu. Aligning text-to-image models using human feedback. ar Xiv preprint ar Xiv:2302.12192, 2023. [46] G. Li, Y. Huang, T. Efimov, Y. Wei, Y. Chi, and Y. Chen. Accelerating convergence of scorebased diffusion models, provably. ar Xiv preprint ar Xiv:2403.03852, 2024. [47] G. Li, Y. Wei, Y. Chen, and Y. Chi. Towards faster non-asymptotic convergence for diffusionbased generative models. In Proceedings of the International Conference on Learning Representations, 2024. [48] G.-H. Liu, T. Chen, E. Theodorou, and M. Tao. Mirror diffusion models for constrained and watermarked generation. In Proceedings of the Advances in Neural Information Processing Systems, volume 36, 2023. [49] N. Liu, S. Li, Y. Du, A. Torralba, and J. B. Tenenbaum. Compositional visual generation with composable diffusion models. In Proceedings of the European Conference on Computer Vision, pages 423 439, 2022. [50] X. Liu, X. Tong, and Q. Liu. Sampling with trusthworthy constraints: A variational gradient framework. In Proceedings of the Advances in Neural Information Processing Systems, pages 23557 23568, 2021. [51] Z. Liu, P. Luo, X. Wang, and X. Tang. Large-scale celebfaces attributes (celeba) dataset. https://mmlab.ie.cuhk.edu.hk/projects/Celeb A.html. [52] I. Loshchilov and F. Hutter. Decoupled weight decay regularization, 2019. [53] A. Lou and S. Ermon. Reflected diffusion models. In Proceedings of the International Conference on Machine Learning, pages 22675 22701, 2023. [54] A. S. Luccioni, C. Akiki, M. Mitchell, and Y. Jernite. Stable bias: Analyzing societal representations in diffusion models. ar Xiv preprint ar Xiv:2303.11408, 2023. [55] C. Luo. Understanding diffusion models: A unified perspective. ar Xiv preprint ar Xiv:2208.11970, 2022. [56] S. Mei and Y. Wu. Deep networks as denoising algorithms: Sample-efficient learning of diffusion models in high-dimensional graphical models. ar Xiv preprint ar Xiv:2309.11420, 2023. [57] R. Naik and B. Nushi. Social biases through the text-to-image generation lens. In Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society, pages 786 808, 2023. [58] G. Parmar, R. Zhang, and J.-Y. Zhu. On aliased resizing and surprising subtleties in gan evaluation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11410 11420, 2022. [59] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Köpf, E. Yang, Z. De Vito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala. Py Torch: An imperative style, high-performance deep learning library, 2019. [60] T. Power, R. Soltani-Zarrin, S. Iba, and D. Berenson. Sampling constrained trajectories using composable diffusion models. In Proceedings of the IROS 2023 Workshop on Differentiable Probabilistic Robotics: Emerging Perspectives on Robot Learning, 2023. [61] D. J. Rezende and F. Viola. Generalized ELBO with constrained optimization, GECO. In Workshop on Bayesian Deep Learning, Neur IPS, 2018. [62] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684 10695, 2022. [63] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. Image Net Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3):211 252, 2015. [64] C. Saharia, W. Chan, S. Saxena, L. Li, J. Whang, E. L. Denton, K. Ghasemipour, R. Gontijo Lopes, B. Karagol Ayan, T. Salimans, et al. Photorealistic text-to-image diffusion models with deep language understanding. In Proceedings of the Advances in Neural Information Processing Systems, volume 35, pages 36479 36494, 2022. [65] X. Shen, C. Du, T. Pang, M. Lin, Y. Wong, and M. Kankanhalli. Finetuning text-to-image diffusion models for fairness. In Proceedings of the International Conference on Learning Representations, 2024. [66] J. Sohl-Dickstein, E. Weiss, N. Maheswaranathan, and S. Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. In Proceedings of the International Conference on Machine Learning, pages 2256 2265, 2015. [67] V. Solo and X. Kong. Adaptive signal processing algorithms: stability and performance. Prentice-Hall, Inc., 1994. [68] J. Song, C. Meng, and S. Ermon. Denoising diffusion implicit models. In Proceedings of the International Conference on Learning Representations, 2021. [69] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1 9, 2015. [70] W. Tang. Fine-tuning of diffusion models via stochastic control: entropy regularization and beyond. ar Xiv preprint ar Xiv:2403.06279, 2024. [71] M. Uehara, Y. Zhao, K. Black, E. Hajiramezanali, G. Scalia, N. L. Diamant, A. M. Tseng, T. Biancalani, and S. Levine. Fine-tuning of continuous-time diffusion models as entropyregularized control. ar Xiv preprint ar Xiv:2402.15194, 2024. [72] M. Uehara, Y. Zhao, E. Hajiramezanali, G. Scalia, G. Eraslan, A. Lal, S. Levine, and T. Biancalani. Bridging model-based optimization and generative modeling via conservative fine-tuning of diffusion models. ar Xiv preprint ar Xiv:2405.19673, 2024. [73] P. von Platen, S. Patil, A. Lozhkov, P. Cuenca, N. Lambert, K. Rasul, M. Davaadorj, D. Nair, S. Paul, W. Berman, Y. Xu, S. Liu, and T. Wolf. Diffusers: State-of-the-art diffusion models. https://github.com/huggingface/diffusers, 2022. [74] J. L. Watson, D. Juergens, N. R. Bennett, B. L. Trippe, J. Yim, H. E. Eisenach, W. Ahern, A. J. Borst, R. J. Ragotte, L. F. Milles, et al. De novo design of protein structure and function with RFdiffusion. Nature, 620(7976):1089 1100, 2023. [75] T. Weiss, E. Mayo Yanes, S. Chakraborty, L. Cosmo, A. M. Bronstein, and R. Gershoni-Poranne. Guided diffusion for inverse molecular design. Nature Computational Science, 3(10):873 882, 2023. [76] X. Wu, K. Sun, F. Zhu, R. Zhao, and H. Li. Human preference score: Better aligning textto-image models with human preference. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2096 2105, 2023. [77] J. Xu, X. Liu, Y. Wu, Y. Tong, Q. Li, M. Ding, J. Tang, and Y. Dong. Imagereward: Learning and evaluating human preferences for text-to-image generation. In Proceedings of the Advances in Neural Information Processing Systems, volume 36, 2023. [78] L. Yang, Z. Zhang, Y. Song, S. Hong, R. Xu, Y. Zhao, W. Zhang, B. Cui, and M.-H. Yang. Diffusion models: A comprehensive survey of methods and applications. ACM Comput. Surv., 56(4), nov 2023. [79] L. Yang, Z. Zhang, Y. Song, S. Hong, R. Xu, Y. Zhao, W. Zhang, B. Cui, and M.-H. Yang. Diffusion models: A comprehensive survey of methods and applications. ACM Computing Surveys, 56(4):1 39, 2023. [80] H. Yuan, K. Huang, C. Ni, M. Chen, and M. Wang. Reward-directed conditional diffusion: Provable distribution estimation and reward improvement. In Proceedings of the Advances in Neural Information Processing Systems, volume 36, 2023. [81] C. Zhang, C. Zhang, S. Zheng, M. Zhang, M. Qamar, S.-H. Bae, and I. S. Kweon. Audio diffusion model for speech synthesis: A survey on text to speech and speech enhancement in generative AI. ar Xiv preprint ar Xiv:2303.13336, 2023. [82] Z. Zhang, S. Zhang, Y. Zhan, Y. Luo, Y. Wen, and D. Tao. Confronting reward overoptimization for diffusion models: A perspective of inductive and primacy biases. In Proceedings of the International Conference on Machine Learning, 2024. Supplementary Materials for Constrained Diffusion Models via Dual Training A Details on ELBO Recall the evidence lower bound (ELBO), E(p; q) := Eq(x0)Eq(x1:T | x0) log p(x0:T ) q(x1:T | x0), we can utilize conditionals to expand it into E(p; q) = Eq(x0)Eq(x1 | x0) [ log p(x0 | x1) ] | {z } reconstruction likelihood Eq(x0) [ DKL(q(x T | x0) p(x T )) ] | {z } final latent mismatch t = 2 Eq(x0)Eq(xt | x0) [ DKL (q(xt 1 | xt, x0) p(xt 1 | xt)) ] | {z } denoising matching term where the first term is the reconstruction likelihood of the original data given the first latent x1, the second term is the mismatch between the final latent distribution and the Guassian prior, and the last summation measures the mismatch between the denoising transitions from forward/backward processes. With the variance schedule described in Section 2.1, it is known that the reconstruction likelihood and final latent mismatch are negligible, and thus the approximation in (4) is almost exact, which is our focal setting of this paper. We next focus on one summand of the denoising matching term, Eq(x0)Eq(xt | x0) [ DKL (q(xt 1 | xt, x0) p(xt 1 | xt)) ] . Application of the reparametrization trick leads to xt = αtxt 1 + 1 αtϵt 1, where ϵt 1 N(0, I) is a white noise sample. Using Bayes rule, we can express q(xt 1 | xt, x0) as a Guassian distribution N(xt 1; µq(xt), σq(t)I) where µq(xt) = 1 αt xt + 1 αt αt log q(xt) is the mean and σ2 q(t) = (1 αt)(1 αt 1) 1 αt is the variance. To stay close to the ground-truth backward conditional q(xt 1 | xt, x0) as much as possible, we take p(xt 1 | xt) to be the same as q(xt 1 | xt, x0) except replacing log p(xt) by bs(xt, t) and σ2 q(t) by σ2 p(t), N(xt 1; bµ(xt), σp(t)I) where bµ(xt) = 1 αt xt + 1 αt αt bs(xt, t). Thus, DKL (q(xt 1 | xt, x0) p(xt 1 | xt)) = DKL(N(xt 1; µq(xt), σq(t)I) N(xt 1; bµ(xt), σp(t)I)) d log σ2 p(t) σ2q(t) d + dσ2 p(t) σ2q(t) + 1 σ2q(t) µq(xt, x0) bµ(xt, x0) 2 ! d log σ2 p(t) σ2q(t) d + dσ2 p(t) σ2q(t) | {z } variance mismatch + 1 2σ2q(t) (1 αt)2 αt bs(xt, t) log q(xt) 2 | {z } prediction loss where the second equality is due to the KL Divergence between two Gaussians. Since σ2 p(t) and σ2 q(t) are constants, the variance mismatch term is irrelevant to optimization. Denote ωt := (1 αt)2 2σ2q(t)αt and ω := PT t = 2 ωt. We can define a discrete distribution over the set {2, . . . , T} as pω(t) := ωt ω . Also denote v := PT t = 2 1 2 d log σ2 p(t) σ2q(t) d + d σ2 p(t) σ2q(t) . Hence, the ELBO maximiza- tion: maximizep E(p; q), is equivalent to the quadratic loss minimization, minimize bs v + ω Ex0, t, xt h bs(xt, t) log q(xt) 2 i where Ex0, t, xt is an expectation over the data distribution q(x0), the discrete distribution pω(t) from 2 to T, and a forward process q(xt | x0) given the data sample x0. Since shifting an objective function by a constant and scaling an objective function by a constant don t change the solution of an optimization problem, we omit constants v and ω for brevity, and only emphasize them whenever it is necessary. Hence, the ELBO maximization equals the quadratic loss minimization, minimize bs Ex0, t, xt h bs(xt, t) log q(xt) 2 i up to some scaling and shifting constants, where Ex0,t,xt is an expectation over the data distribution q(x0), the discrete distribution pω(t) from 2 to T, and a forward process q(xt | x0) given the data sample x0. In practice, however, we have to parametrize the estimator bs(xt, t) as bsθ(xt, t) with parameter θ Θ, minimize θ Θ Ex0, t, xt h bsθ(xt, t) log q(xt) 2 i which is our focal objective of generative modeling. A parametrized representation of p(xt 1 | xt) associated with bsθ(xt, t) is denoted by pθ(xt 1 | xt) and the backward process has a parametrized joint distribution pθ(x0:T ). We remark that the above prediction problem can be reformulated as data or noise predictions [55], with our results directly transferrable to these formulations. We provide proofs of all lemmas and theorems in the main paper. B.1 Proof of Lemma 1 Proof. The ELBO maximization has the same optimal solution with the KL divergence minimization because of the equality (3). This directly proves the second equivalence. Next, we relate these two problems to the likelihood maximization problem. We note that the KL divergence is non-negative and is zero if and only if two distributions are the same. Since q(x0:T ) P for large T, the solution of the ELBO maximization and KL divergence minimization is given by p = q. For the KL divergence minimization, the optimal value is zero. For the optimal value of the ELBO maximization, from (3) it follows that: E(p ; q) = Eq(x0)[ log q(x0) ] DKL(q(x0:T ) p (x0:T )) = Eq(x0)[ log q(x0) ]. (10) It is clear that the likelihood maximization problem maximizep Eq(x0)[ log p(x0) ] is equivalent to minimizep DKL(q(x0) p(x0)). Therefore, any distribution p (x0:T ) whose marginal satisfies p (x0) = q(x0), will be a solution of the likelihood maximization problem. This includes the solution of the KL divergence minimization and ELBO maximization which is p = q. Therefore, maximize p P E(p; q) maximize p P Eq(x0)[ log p(x0) ] (11) which concludes the proof. B.2 Proof of Lemma 2 Proof. It is straightforward to check the zero duality gap in convex optimization; see e.g., [4, Proposition 5.3.2]. Furthermore, for a convex optimization problem, an optimal dual variable λ that maximizes the dual function is a geometric multiplier. Hence, (p , λ ) is an optimal primal-dual pair of the convex optimization problem. B.3 Proof of Theorem 1 Proof. From the strong duality in Lemma 2, it is known from [4, Proposition 5.1.4] that λ is also a geometric multiplier. Thus, Problem (U-KL) reduces to an unconstrained problem, minimize p P L(p, λ ) (12) where the objective function results from plugging an optimal dual variable λ into Lagrangian L(p, λ). By the definition of Lagrangian, L(p, λ) = DKL(q(x0:T ) p(x0:T )) + i = 1 λi DKL qi(x0:T ) p(x0:T ) bi i = 1 λi E(p; qi) + Eq(x0) [log q(x0)] + i = 1 λi Eq(x0) log qi(x0) bi By taking λ = λ , Problem (12) is equivalent to maximize p P E(p; q) + i = 1 λ i E(p; qi). (13) From the definition of ELBO, we know that i = 1 λ i E(p; qi) = i = 1 λ i Eqi(x0) Eq(x1:T | x0) log p(x0:T ) q(x1:T | x0) where we use the fact that the forward processes have the same marginal distribution given any initial data samples. Normalization of initial data distributions leads to q(λ ) mix . Thus, Problem (13) is equivalent to maximize p P E(p; q(λ ) mix ) which, together with Lemma 1, completes the proof. B.4 Proof of Theorem 2 and Feasibility Criterion We start with the proof of Theorem 2. Proof. Similar to the proof of Theorem 1, we begin with the Lagrangian of Problem (U-ELBO), L(p, λ) = E(p; q) i = 1 λi E(p; qi) + bi λi λT 1E(p; qi) λi λT 1Eqi(x0)Eq(x1:T | x0) log p(x0:T ) q(x1:T | x0) = λT 1 Eq(λ)(x0)Eq(x1:T | x0) log p(x0:T ) q(x1:T | x0) = (λT 1)E(p; q(λ)) λT b where from (14) onwards we use notation: λ0 = 1, b0 = 0, λ = [λ0, . . . , λm]T , b = b0, . . . , bm T , and use q0 to represent q, which will be used in the rest of proof. To formulate the dual problem, we check the minimum of the Lagrangian, g(λ) := minimize p P L(p, λ) = minimize p P (λT 1)E(p; q(λ)) λT b = λT b + (λT 1) minimize p P E(p; q(λ)) (15) where the only term that depends on p is the ELBO. Recall that: DKL(q(x0:T ) p(x0:T )) = E(p; q) + Eq(x0) [ log q(x0) ] . Since the minimum value of the KL divergence is zero (attained when p = q), the minimum of E(p; q) is likewise attained when p = q. Thus, it is straightforward that the minimum is equal to the entropy of the distribution q, denoted by h(q) := Eq(x0) [ log q(x0) ]. With this in mind, from (15) we have g(λ) = λT b + (λT 1) h(q(λ)). Thus, the dual problem reads maximize λ 0 g(λ) := λT b + (λT 1) h(q(λ)). We first reformulate the entropy of the mixture distribution q(λ), h(q(λ)) = Eq(λ)(x0) h log q(λ)(x0) i λi λT 1qi(x0) log λi λT 1qi(x0) λi λT 1qi(x0) log λi λT 1qi(x0) dx0 (17) Z qi(x0) log qi(x0) dx0 | {z } := hi λi λT 1 log λi λi λT 1 log λi λi λT 1 log(λi) + log(λT 1) where going from (16) to (17) we utilize the assumption that the distributions {qi}m i = 0 have disjoint supports; see Remark 1 on when this is the case. Now, we can compute the gradient of the dual function over λi, i = 1, . . . , m, λT b + (λT 1) h(q(λ)) = λi j = 0 λj log λj + (λT 1) log(λT 1) = hi bi log λi Setting the gradient be zeros allows us to find the optimal dual variables λ , hi bi log λ i (λ )T 1 = 0 for i = 1, . . . , m. Hence, λ i (λ )T 1 = ehi bi for i = 1, . . . , m. (18) We clarify that in (18), λ = [λ 0, . . . , λ m]T with its first element being λ 0 = 1. Finally, if we return back to notation λ = [λ 1, . . . , λ m]T , then, λ i 1 + (λ )T 1 = ehi bi for i = 1, . . . , m which completes the proof. Remark 1. We remark on the assumption of the distributions {qi}m i = 0 having disjoint supports. In the setting of adapting model to new data in Section 3.2, this is a reasonable assumption. Since we often finetune a pre-trained diffusion model on new data not seen in the original pre-training dataset, the new data distribution and the pre-training data distribution have mostly disjoint supports. In the minority class setting in Section 3.2, the constrained distributions {qi}m i = 1 and the objective distribution q0 usually are not disjoint. However, since the distributions {qi}m i = 1 often are often restrictions of q0 to subsets of the support of q0, i.e., the minority classes, the derivation of optimal dual variables is similar to what we have provided in this section, so we omit the repeated details. Extending these results to cases where the distributions are neither disjoint nor restrictions of the objective distributions, is challenging and has been left to future work. To prove a feasibility criterion, we first show that the dual function is finite in Lemma 5. Lemma 5 (Boundedness of the optimal dual function). Let the differential entropy hi of each distribution qi be finite. Then, the optimal value of the dual function D := maxλ 0 g(λ) is finite if and only if m X i = 1 ehi bi < 1. Proof. ( ) From Pm i = 1 ehi bi < 1, the otpimal dual variable λ given by (18) is finite. Thus, the optimal value of the dual function g(λ ) becomes g(λ ) = (λ )T b + i = 0 λ i hi i = 0 λ i log λ i + ((λ )T 1) log((λ )T 1) and λ i = ehi bi > 0, and also {hi}m i = 1 are all finite. Therefore, D is finite. ( ) We prove it by contradiction. Assume Pm i = 1 ehi bi = eδ 1 for some δ 0. For any λ 0, there exists a direction in which g(λ) increases, i.e., g λi = hi bi log λi > δ i. (19) To see (19) by contradiction, we check that hi bi log λi δ for i = 1, . . . , m = ehi bi δ λi λT 1 for i = 1, . . . , m i = 1 ehi bi ! e δ Pm i = 1 λi λT 1 = Pm i = 1 λi 1 + Pm i = 1 λi < 1 i = 1 ehi bi < eδ which contradicts our assumption that Pm i = 1 ehi bi = eδ. By contradiction, we have (19). Furthermore, (19) implies that g(λ) is unbounded above, which contradicts the finiteness of D . Therefore, we must have that Pm i = 1 ehi bi < 1. Lemma 6 (Feasibility criterion). Let the differential entropy hi of each distribution qi be finite. Suppose that there exists a feasible solution to Problem (U-ELBO) such that its objective function is bounded from below. Then, Problem (U-ELBO) is feasible if and only if i = 1 ehi bi < 1. Proof. ( ) Since the primal problem is feasible, the optimal objective function F is bounded from below and it is attained at a feasible point. For the sake of contradiction, we assume Pm i = 1 ehi bi 1, which is equivalent to D = according to Lemma 5. However, this violates weak duality, i.e., D F . By contradiction, we must have Pm i = 1 ehi bi < 1. ( ) We consider a set A, A := (u1, . . . , um, t) | E(p, qi) bi ui for i = 1, . . . , m and E(p, q0) t for p P . The set A is convex since it is the intersection of m + 1 epigraphs of convex functions. We also introduce another convex set B, B := {(0, . . . , 0, t) | t R} . We utilise proof by contradiction. Assume that the primal problem is infeasible. Then there doesn t exist any p P such that E(p, qi) bi 0 for all i = 1, . . . , m. Hence, A and B are two disjoint convex sets. From the separating hyperplane theorem, there exists a hyperplane that separates them, i.e., v Rm+1 and c R, x T v c for all x A (20) y T v c for all y B. (21) Let v = [ λ1, . . . λm, γ ]T . Then, (21) reduces to y T v = λT u + γt = γt c for all (u, t) B γ = 0. This is because γt c for any t R. Note that γ = 0 means the separating hyperplane is vertical, i.e., being parallel to the t-axis. Furthermore, from (20) we can write x T v = λT u + γt = λT u c for all (u, t) A λ 0. The above is true because the set of values that each ui can take in A is unbounded above. Thus, since λT u c, necessarily every λi has to be non-negative. Now, we consider g(λ) = inf (u,t) A t + λT u = g(αλ) = inf (u,t) A t + αλT u for α R+ = lim α g(αλ) = lim α inf (u,t) A t + αλT u = lim α α inf (u,t) A λT u which shows that D = . This contradicts D being finite (D is finite due to Pm i = 1 ehi bi < 1 and Lemma 5). Because of the contradiction, the primal problem has to be feasible. B.5 Proof of Lemma 3 Proof. The proof is an application of the convergence theory of DDPM [47, Theorem 3]. We next check all assumptions of [47, Theorem 3]. It is easy to see that we can cast q(λ) mix as a target distribution of a diffusion model. By the definition, L(θ, λ) = DKL (q(x0:T ) pθ(x0:T )) + i = 1 λi DKL qi(x0:T ) pθ(x0:T ) bi = E(pθ; q) + Eq(x0)[ log q(x0) ] + i = 1 λi E(pθ; qi) bi . Thus, the partial minimization of L(θ, λ) over θ is equivalent to a weighted EBLO minimization, minimize θ Θ E(pθ; q) i = 1 λi E(pθ; qi) or, equivalently, minimize θ Θ E(pθ; q(λ) mix) (22) where we normalize the weighted ELBO objective by introducing a mixed data distribution q(λ) mix. We note that p (λ) is also a minimizer of Problem (22). On the other hand, using Problem (P-LOSS), we can rewrite Problem (22) as minimize θ Θ Eq(λ) mix (x0), t, xt h bsθ(xt, t) log q(xt) 2 i which is equivalent to the score matching objective in DDPM. Therefore, the score matching assumption in [47, Assumption 1] is satisfied with the error bound ε2 score from Assumption 3. Viewing Assumption 2, and using appropriate stepsize and variance, all assumptions in [47, Theorem 3] are satisfied. Therefore, application of [47, Theorem 3] completes the proof. B.6 Proof of Lemma 4 Lemma 7. The TV distance between two mixture data distributions q(λ) mix, q(λ ) mix is bounded by TV q(λ) mix, q(λ ) mix λ λ 1 . Proof. By the definition, TV q(λ) mix, q(λ ) mix = 1 2 q + Pm i = 1 λiqi 1 + λ 1 q + Pm i = 1 λi, qi (1 + (λ ) 1)(q + Pm i = 1 λiqi) (1 + λ 1)(q + Pm i = 1 λi, qi) (1 + λ 1)(1 + (λ ) 1) Pm i = 1 λiqi + (λ ) 1q Pm i = 1 λi, qi λ 1q (1 + λ 1)(1 + (λ ) 1) i = 1 (λi λi, )qi + (λ λ) 1q i = 1 |λi λi, | where the first inequality is due to (1 + λ 1)(1 + (λ ) 1) 1, and we use triangle inequality in the second inequality. We recall the Lagrangians for Problems (U-LOSS) and (P-LOSS), Ls(bs, λ) = Eq(x0), t, xt h bs(xt, t) log q(xt) 2 i i = 1 λi Eqi(x0), t, xt h bs(xt, t) log q(xt) 2 i ebi Ls(bsθ, λ) = Ls(bsθ, λ) and their associated dual functions, gs(λ) = minimize bs S Ls(bs, λ) and gs(λ) = minimize θ Θ Ls(bsθ, λ). For brevity, we use shorthand Eq and Eqi for Eq(x0), t, xt and Eqi(x0), t, xt, respectively. Lemma 8 (Parametrization gap). Let Assumption 4 hold. Then, 0 gs(λ) gs(λ) 4R(1+ λ 1)ν for any λ 0. Proof. Let the partial minimizer of Ls(bs, λ) over bs be bs (λ) := argminbs Ls(bs, λ) for any λ 0. For any λ 0, there exists eθ Θ such that bs (λ) bseθ L2 ν for any λ 0, according to Assumption 4. Thus, Ls(bseθ, λ) Ls(bs (λ), λ) = Eq h bseθ(xt, t) x0 2 i Eq h bs (λ)(xt, t) x0 2 i i = 1 λi Eqi h bseθ(xt, t) x0 2 i Eqi h bs (λ)(xt, t) x0 2 i 4R Eq bseθ(xt, t) bs (λ)(xt, t) i = 1 λi Eqi bseθ(xt, t) bs (λ)(xt, t) 4Rν + 4R λ 1 ν where the first inequality is due to that the quadratic function is locally Lipschitz continuous with parameter 4R, and the second inequality is because that there exists eθ Θ such that bx (λ) bxeθ L2 ν for any λ 0, according to Assumption 4. By the definition bs θ(λ) argminθ Θ Ls(bsθ, λ), Ls(bs θ(λ), λ) Ls(bseθ, λ). 0 Ls(bs θ(λ), λ) Ls(bs (λ), λ) Ls(bseθ, λ) Ls(bs (λ), λ) 4R(1 + λ 1)ν which gives our desired result by the definition of dual functions. Lemma 9 (Differentiability). The dual function gs(λ) is differentiable with gradient λLs(bs (λ), λ). Proof. For any λ 0, the Lagrangian Ls(bs, λ) is strongly convex in function bs S. Since S is convex and compact, any partial minimizer bs (λ) is unique. By Danskin s theorem [4], gs(λ) is differentiable and its gradient is the gradient of Ls(bs, λ) over λ at bs = bs (λ). Lemma 10 (Convexity). The dual function gs(λ) is µ-strongly concave in λ H, where σ 1 + max λ 1 , λ 1 Proof. For any λ1, λ2 H, we denote bs 1 := bs (λ1) and bs 2 := bs (λ2), which are unique partial minimizers of the Lagrangians Ls(bs, λ1) and Ls(bs, λ2). Denote ℓ0(bs) := Eq bs(xt, t) log q(xt) 2 , ℓi(bs) := Eqi bs(xt, t) log q(xt) 2 for i = 1, . . . , m, and ℓ(bs) := [ ℓ1(bs), . . . , ℓm(bs) ] . By the convexity of ℓi, ℓi(bs 1) ℓi(bs 2) + 2 bsℓi(bs 2), bs 1 bs 2 ℓi(bs 2) ℓi(bs 1) + 2 bsℓi(bs 1), bs 2 bs 1 . If we multiply the first inequality above by λ2,i 0 and the second inequality above by λ1,i 0, and add them up from both sides, then, ℓ(bs2) ℓ(bs1), λ2 λ1 2 λ 1 bsℓ(bs 1) λ 2 bsℓ(bs 2), bs 2 bs 1 . We apply Lemma 9 to the LHS of the inequality above, ( gs(λ2) gs(λ1)) (λ2 λ1) 2 λ 1 bsℓ(bs 1) λ 2 bsℓ(bs 2), bs 2 bs 1 . (23) On the other hand, by the optimality of bs 1 and bs 2, bsℓ0(bs 1) + λ 1 bsℓ(bs 1) = 0 (24a) bsℓ0(bs 2) + λ 2 bsℓ(bs 2) = 0 (24b) which allows us to simplify the right-hand side of (23) and obtain ( gs(λ2) gs(λ1)) (λ2 λ1) 2 bsℓ0(bs 2) bsℓ0(bs 1), bs 2 bs 1 2 bs 1 bs 2 2 L2 (25) where the last inequality results from the strong convexity of quadratic functionals. By the smoothness of quadratic functionals with parameter 1, bs 1 bs 2 L2 bsℓ0(bs 1) bsℓ0(bs 2) L2 = λ 1 bsℓ(bs 1) λ 2 bsℓ(bs 2) L2 = (λ2 λ1) bsℓ(bs 2) λ 1 ( bsℓ(bs 1) bsℓ(bs 2)) L2 (λ2 λ1) bsℓ(bs 2) L2 λ 1 ( bsℓ(bs 1) bsℓ(bs 2)) L2 where the equality is due to the optimality condition (24) and the last inequality is due to triangle inequality. By Assumption 5, (λ2 λ1) bsℓ(bs 2) L2 σ λ2 λ1 . We also notice that λ 1 ( bsℓ(bs 1) bsℓ(bs 2)) L2 i = 1 λ1,i bsℓi(bs 1) bsℓi(bs 2) L2 i = 1 λ1,i bs 1 bs 2 L2 where the first inequality is due to triangle inequality and the second inequality is due to the smoothness of quadratic functionals. Hence, bs 1 bs 2 L2 σ λ2 λ1 λ1 1 bs 1 bs 2 L2 or, equivalently, bs 1 bs 2 L2 σ 1 + λ1 1 λ2 λ1 Therefore, (25) becomes ( gs(λ2) gs(λ1)) (λ2 λ1) σ 1 + λ1 1 which completes the proof by choosing the smallest modulus over λ1 H. Proof. By Lemmas 9 and 10, for any λ H, gs(λ) gs(λ ) + gs(λ ) (λ λ ) µ Thus, if we choose λ = λ , then gs( λ ) gs(λ ) + i = 1 ( λ i λ i ) Eqi h bs (λ )(xt, t) log q(xt) 2 i ebi µ Optimality of (bs (λ ), λ ) leads to the complementary slackness, i = 1 λ i Eqi h bs (λ )(xt, t) log q(xt) 2 i ebi = 0 and the feasibility, Eqi h bs (λ )(xt, t) log q(xt) 2 i ebi. Therefore, gs( λ ) gs(λ ) µ According to Lemma 8, gs( λ ) 4R 1 + λ 1 ν gs( λ ). Hence, gs( λ ) 4R 1 + λ 1 ν gs(λ ) µ Thus, λ λ 2 2 µ gs(λ ) gs( λ ) + 8 µR 1 + λ 1 ν 8 µR 1 + λ 1 ν where the last inequality is due to that gs(λ) gs(λ) for any λ 0, and the optimality of λ , gs(λ ) gs(λ ) gs( λ ). B.7 Proof of Theorem 3 Proof. By the triangle inequality for TV distance, TV q mix, p ( λ ) TV q mix, q( λ ) mix + TV q( λ ) mix , p ( λ ) λ λ 1 + d2 log3 T d log2 T εscore µm R 1 + λ 1 ν + d2 log3 T d log2 T εscore where the second inequality is due to Lemma 7 and Lemma 3, and the last inequality is due to λ 1 m λ and Lemma 4. B.8 Proof of Theorem 4 Lemma 11. For a stochastic variant of Algorithm 1 in Section 4.3, we have E h λ(h + 1) λ 2 λ(h) i λ(h) λ 2 + η2S2 2η D g(λ(h)) ε2 approx . Proof. For brevity, we let the stochastic gradient be bf(h) = [ bf1(h), . . . , bfm(h)] with bfi(h) := b Ex0 qi, t, xt h bsθ(h)(xt, t) q(xt) 2 i ebi. By the definition of λ(h + 1), λ(h + 1) λ 2 = h λ(h) + η bf(h) i λ(h) λ + η bf(h) 2 = λ(h) λ 2 + η2 bf(h) 2 + 2η bf(h) λ(h) λ where the inequality is due to the non-expansiveness of projection. Application of the conditional expectation over both sides of the inequality above yields, E h λ(h + 1) λ 2 | λ(h) i λ(h) λ 2 + η2 E bf(h) 2 | λ(h) + 2η E h bf(h) | λ(h) i λ(h) λ which gives our desired result when we use the fact that E[ bf(h) | λ(h)] is an approximate descent direction of the dual function g, E h bf(h) | λ(h) i λ(h) λ ε2 approx g(λ(h)) g( λ ). Lemma 12. In the stochastic variant of Algorithm 1 in Section 4.3, the maximum prarametrized dual function in history up to step h satisfies lim h gbest(h) D ηS2 2 + ε2 approx Proof. The proof is based on the supermartingale convergence theorem [67, Theorem E7.4]. We introduce two processes, α(h) := λ(h) λ 2 1 D gbest(h) > ηS2 2 + ε2 approx β(h) := 2η D g(λ(h)) ε2 approx η2S2 1 D gbest(h) > ηS2 2 + ε2 approx where α(h) measures the gap between λ(h) and λ when the optimality gap D gbest(h) is below the threshold, and β(h) measures the gap bewteen D and g(λ(h)) (up to some optimization errors) when the optimality gap D gbest(h) is below the threshold. Clearly, α(h) is non-negative. Also, β(h) is non-negative due to D gbest(h) ηS2 2 ε2 approx D g(λ(h)) ηS2 2 ε2 approx. Let Fh be the σ-algebra generated by sequences: α(h ), β(h ), and λ(h ) for h h. Thus, {Fh}h 1 is a natural filtration. We notice that α(h + 1) and β(h + 1) are determined by λ(h) in each step. Hence, E [ α(h + 1) | Fh ] = E [ α(h + 1) | λ(h) ] = E [ α(h + 1) | λ(h), α(h) = 0 ] Pr (α(h) = 0) + E [ α(h + 1) | λ(h), α(h) > 0 ] Pr (α(h) > 0) . We first show that E [ α(h + 1) | Fh ] α(h) β(h). (26) A simple case is when α(h) = 0, E [α(h + 1) | Fh ] = E [ α(h + 1) | λ(h), α(h) = 0 ] . There are two situations for α(h) = 0. First, if D gbest(h) ηS2 2 +ε2 approx, then α(h) = β(h) = 0. Due to gbest(h + 1) gbest(h), we have β(h) = 0 and D gbest(h + 1) ηS2 2 + ε2 approx. Thus, α(h + 1) = 0, and (26) holds. Second, if λ(h) = λ , but D gbest(h) > ηS2 2 + ε2 approx, then D = g(λ(h)). Hence, β(h) < 0, which contradicts the non-negativeness of β(h). Therefore, D gbest(h) ηS2 2 + ε2 approx has to hold, which is the first situation. We next show (26) when α(h) > 0. E [ α(h + 1) | Fh ] = E [ α(h + 1) | λ(h), α(h) > 0 ] " λ(h) λ 2 1 D gbest(h) > ηS2 2 + ε2 approx λ(h), α(h) > 0 E h λ(h) λ 2 λ(h), α(h) > 0 i λ(h) λ 2 + η2S2 2η D g(λ(h)) ε2 approx where the last inequality is due to Lemma 11, and the last equality is due to D gbest(h) > ηS2 2 + ε2 approx. Therefore, (26) holds. Finally, application of the supermartingale convergence theorem [67, Theorem E7.4] to the processes α(h) and β(h) for h 1 concludes that β(t) is almost surely summable, lim inf h β(h) = 0 almost surely. This means that either lim inf t 2η D g(λ(h)) ε2 approx η2S2 = 0 or D gbest(h) ηS2 2 + ε2 approx, which concludes our desired result. Denote the step when gbest(h) achieves D ηS2 2 + ε2 approx by hbest, and the associated dual variable be λbest = λ(hbest). Lemma 13. For a stochastic variant of Algorithm 1 in Section 4.3, we have λbest λ 2 2 2 + ε2 approx + 4R(1 + λbest 1)ν . Proof. We denote the segment between λbest and λ by B. By Lemma 10, the dual function g(λ) is strongly concave on B with parameter µ. Thus, λbest λ 2 2 µ g(λ ) g( λbest) 2 µ g(λ ) g( λbest) + 4R(1 + λbest 1)ν 2 + ε2 approx + 4R(1 + λbest 1)ν where the second inequality is due to Lemma 8. We note that g( λbest) = g(λ(hbest)) and D g( λbest), and the third inequality is due to Lemma 12. Proof. By the triangle inequality for TV distance, TV q mix, p ( λbest) TV q mix, q( λbest) mix + TV q( λbest) mix , p ( λbest) λ λbest 1 + d2 log3 T d log2 T εscore 2 + ε2 approx + 4R 1 + λbest 1 ν + d2 log3 T d log2 T εscore where the second inequality is due to Lemma 7 and Lemma 3, and the last inequality is due to λ 1 m λ and Lemma 13. C Experimental details We provide implementation details of our computational experiments in Section 5. The source code is available here.2 Algorithm details: We train our constrained diffusion models by replacing the exact primal minimization step in Algorithm 1 with N steps of gradient descent with the Lagrangian as a loss function. Without loss of generality, we take the noise prediction formulation of diffusion rather than the score-matching formulation used in our theory. Since these two formulations are equivalent, this has no bearing on our main results. Algorithm 2 depicts our practical implementation of Algorithm 1 . Algorithm 2 Practical Implementation of Algorithm 1 1: Input: total diffusion steps T, diffusion parameter αt, total dual iterations H, number of primal descent steps per dual update N, dual step size ηd, primal step size ηp, initial model parameters θ(0). 2: Initialize: λ(1) = 0. 3: for h = 1, , H do 4: for n = 1, , N do 5: θ1 = θ(h 1). 6: θn+1 = θn ηp θ b Ex0 q, t, xt h bϵθn(xt, t) ϵ0 2 i + i = 1 λi b Ex0 qi, t, xt h bϵθn(xt, t) ϵ0 2 i! 7: θ(h) = θN+1. 8: end for 9: Update the dual variable λi(h + 1) = h λi(h) + ηd b Ex0 qi, t, xt h bϵθ(h)(xt, t) ϵ0 2 i ebi i + for all i. 10: end for In Algorithm 2, the unbiased estimate of the noise prediction loss is evaluated via b Ex0 q, t, xt h bϵθ(xt, t) ϵ0 2 i = bϵθ(xt(i), t(i)) ϵ(i) 0 2 where {x(i) 0 }B i = 1 is a randomly chosen batch of samples from q, {t(i)}B i = 1 are time steps randomly sampled from the interval [2, T], and xt(i) is the noisy version of x(i) 0 at time step t(i). Then, the noise ϵ0 is sampled from the standard Gaussian, and xt is derived from xt = αtxt + We remark an important implementation detail in the fine-tuning experiment. In the fine-tuning constraints, we have to evaluate the KL divergence DKL(pθpre(x0:T ) pθ(x0:T )) bi (27) where pθpre(x0:T ) is the joint distribution of the samples and latents generated by the backward process of the pre-trained model. The KL divergence constraint in (27) further reduces to DKL(pθpre(x0:T ) pθ(x0:T )) = Ext pθpre, t h bϵθpre(xt, t) bϵθ(xt, t) 2 i + constant. (28) To estimate the expectation in (28), we need to sample latents xt from the backward distribution pθpre(x0:T ). In practice, this is computationally inefficient, since it requires running inference each time one wants to sample a latent. This is why we implement this with sampling xt as random Gaussian noise instead. This still ensures that the predictions of the new model pθ don t differ too much from the pre-trained distribution pθpre while making sampling batches much faster. Resilient constrained learning. The choice of the constraint thresholds {bi}m i = 1 has noticeable effect on the training of constrained diffusion models. To avoid an exhaustive hyperparameter tuning process, in the minority class experiment, we use the resilient constrained learning technique [36] to 2https://github.com/shervinkhal/Constrained_Diffusion_Dual_Training Table 1: Parameters of U-net Model used as noise predictor # Res-Net layers per U-Net block 2 # Res-Net down/upsampling blocks 6 # Output channels for U-Net blocks (128, 128, 256, 256, 512, 512) adjust the thresholds {bi}m i = 1 during training. In essence, the resilient constrained learning adds a constraint relaxation cost to the loss and relaxes the thresholds by updating them through gradient descent each time we update the dual variable. It can further be shown theoretically that an equivalent formulation of resilience is achieved by adding a quadratic regularizer of the dual variable into the loss objective and setting the constraint thresholds to be zero, i.e., ebi = 0. This is the approach we used in our experiments since it has fewer hyperparameters. We note that the only difference between Algorithm 2 and Algorithm 3 is the additional term in the dual variable update step (line 9 of Algorithm 3). Algorithm 3 Resilient Constrained Diffusion Models via Dual Training 1: Input: total diffusion steps T, diffusion parameter αt, total dual iterations H, number of primal descent steps per dual update N, dual step size ηd, primal step size ηp, constraint step size ηc, initial model parameters θ(0), constraint relaxation cost γ. 2: Initialize: λ(1) = 0. 3: for h = 1, , H do 4: for n = 1, , N do 5: θ1 = θ(h 1). 6: θn+1 = θn ηp θ b Ex0 q, t, xt h bϵθn(xt, t) ϵ0 2 i + i=1 λib Ex0 qi, t, xt h bϵθn(xt, t) ϵ0 2 i! 7: θ(h) = θN+1. 8: end for 9: Update the dual variable λi(h+1) = h λi(h) + ηd b Ex0 qi, t, xt h bϵθ(h)(xt, t) ϵ0 2 i ebi(h) 2γλi(h) i + for all i. 10: end for Model architecture. We use a time-conditioned U-net model as is common in image diffusion tasks for all three datasets. The time conditioning is done by adding a positional embedding of the time to the input image. The parameters of the model are summarized in Table 1. The fifth downsampling block and the corresponding upsampling block are Res-Net blocks with spatial self-attention. Hyperparameters. We summarize the important hyperparameters in our experiments in Table 2. In the unconstrained models that we train for comparison, we use the same hyperparameters as the constrained version, disregarding the parameters related to the dual and relaxation updates. For models trained on Image-Net, when training the constrained models, we initialized to the parameters of the unconstrained model to make training times shorter. Hyperparameter sensitivity. We remark the sensitivity of the dual training algorithm to the number of dual iterations, primal/dual batch sizes, and primal/dual learning rates. Number of dual iterations: In our implementation this shows up as the number of primal GD steps per dual update, N. Experimentally, we have observed that as long as N is greater than 1, the results are not sensitive to this value. Additionally, the dual updates add a negligible computational overhead. Hence, updating the dual nearly as many times as we update model parameters doesn t reduce training efficiency. Primal/dual batch sizes: We have included results of training a constrained model on an unbalanced subset of MNIST, using different primal/dual batch sizes (See Table 3.) The results suggest that for the minority class experiments, when the ratio between Primal and Dual batch sizes is larger, the model performs better (lower FID and more evenly distributed Table 2: Hyperparameter values used in the main experiments. MC denotes Minority Class experiments and FT denotes Fine-Tuning experiments. - denotes that resilience was not used for the experiment. MNIST MC Celeb-A MC Image-Net MC MNIST FT #training epochs (N H) 250 1000 2000 500 # primal steps per dual step (N) 2 2 2 2 Primal batch size 128 256 128 256 Dual batch size 64 128 64 256 Primal learning rate 0.0001 0.0001 5e-5 5e-6 Dual learning rate 0.1 0.1 0.05 1e5 Resilience Relaxation cost 0.09 0.005 0.025 - main dataset size 31000 12500 2000 200 constraint dataset(s) size 5000 500 64 - samples). This is in line with the heuristic we used in the included experiments in the paper where we chose the batch sizes such that the ratio of primal to dual batch size is close to size ratio of entire dataset to constraint datasets (which are much smaller). Howver for the fine-tuning task, the batch sizes did not seem to affect the final result as much. Primal/dual learning rate: For the primal learning rate, we followed the best practice used to train standard diffusion models. For the dual learning rate η , we refer to Theorem 8 in the paper, showing a smaller error bound for smaller η while slowing convergence. In practice, as long as η 1, we observed that the model converges to similar results reliably. Efficiency of constrained diffusion: We note that the complexity of sampling from our constrained diffusion model does not increase with the number of constraints, as our trained diffusion model functions like a standard diffusion model to generate samples. Importantly, we remark that training our constrained diffusion model has comparable efficiency to training standard diffusion models detailed next. The additional computational cost of our dual-based training (Algorithm 1) arises from: (i) updating the dual variables; (ii) updating the diffusion model in the primal update. Cost of updating the dual variables: We note that our dual-based training has the same number of dual variables as the number of constraints. Thus, the cost for the dual update is linear in the number of constraints. To update each dual variable, we can directly use the ELBO loss over the batches sampled from each constrained dataset (already computed for the Lagrangian). Therefore, the cost of updating dual variables is negligible. Cost of updating the diffusion model in the primal update: We note that the primal update trains a standard diffusion model based on the Lagrangian with updated dual variables. In our experiments, this primal training often requires as few as 2-3 updates per dual update. Thus, when training our constrained model, we can train for the same number of epochs as an unconstrained model but update the dual variables after every few primal steps. As a result, training our constrained diffusion model is almost as efficient as training standard unconstrained models. The only concern we encountered regarding efficiency is that batches need to be sampled from every constrained dataset at each step to estimate the Lagrangian. This introduces a small GPU memory overhead that increases with additional constraints. However, this is somewhat mitigated by the fact that constrained datasets are often much smaller than the original dataset, allowing us to choose a smaller batch size for the constrained datasets without degrading performance (see discussion on batch sizes in hyperparameter sensitivity section). Table 3: Constrained model trained on MNIST with different Primal/Dual Batch sizes (Primal batch size, Dual batch size) (64, 16) (64, 64) (128, 16) (128, 64) FID score 16.6 20.6 16.7 20.24 FID scores. As a quantitative means of evaluating our constrained diffusion models, we use the FID (Frechet Inception Distance) as a metric to gauge the quality of the samples generated by diffusion models. The FID score was first introduced in [33] in form of d2 FID((m, C), (mw, Cw)) = m mw 2 2 + Tr(C + Cw 2(CCw)1/2) (29) where m and C represent mean and variance, respectively, of the distribution of the features of the data samples which have been extracted by an inception model [69]. Similarly, mw and Cw represent the mean and variance of some reference distribution that we are computing the distance to. In our experiment, we compute the FID scores by generating 15000 samples from the diffusion model we are evaluating and comparing them to a balanced version of the original dataset. In the experiment with MNIST, this is the actual dataset. In the experiments with Celeb-A and Image-Net, since there is an imbalance in the original dataset, we consider a balanced subset of each with an equal number of samples from each class as reference. We use the clean-FID library [58] for standard computation of the FID scores. We note that our FID scores are somewhat larger compared to typical baselines in the literature. This is expected as a consequence of our experimental setup. We train both the unconstrained and constrained models, on a biased subset of the dataset wherein some of the classes have significantly fewer samples than the rest. We then compute the FID scores for these models compared to the actual dataset itself which is unbiased (i.e., every class has the same number of samples). These FID scores approximate how close the learned distribution of the model trained on biased data, is to the underlying unbiased distribution. This setup contrasts with existing results in the literature, where the FID is computed with respect to unbiased data, and the models are also trained on unbiased data. Therefore, it is expected that such models will achieve better FID scores than constrained or unconstrained models trained with biased data. Our purpose in reporting the FIDs was not to compare them to existing results (as such a comparison would be uninformative) but to demonstrate that, when trained on biased data, the constrained model achieves better FID scores than the unconstrained model. Compute resources. We run all experiments on two NVIDIA RTX 3090-Ti GPUs in parallel. The amount of GPU memory used was 16 Gigabytes per GPU. For experiments with MNIST and Celeb-A datasets, training each model took between 2-3 hours. This increased to 7-8 hours for latent diffusion models trained for the Image-Net experiments. Assets and libraries. We use the Py Torch [59] and Diffusers [73] Python libraries for training our constrained diffusion models, and Adam with decoupled weight decay [52] as an optimizer. The accelerate library [30] is used for the parallelization of the training processes across multiple GPUs. For classifiers used in evaluating the generated samples of the models, we use the following pretrained models accessible on the Huggingface model database: A Vision transformer-based classifier for MNIST digits with %99.5 validation accuracy (https: //huggingface.co/farleyknight-org-username/vit-base-mnist). A classifier for images of male/female faces with %98.6 validation accuracy (https://huggingface.co/cledoux42/ Gender New_v002). For classifying the image-net data, a zero-shot classifier based on a CLIP model (https://huggingface.co/openai/clip-vit-base-patch32) was used. The Autoencoder for the latent diffusion model was the stable diffusion VAE with KL regularization found on (https://huggingface.co/stabilityai/sd-vae-ft-mse). Neur IPS Paper Checklist Question: Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? Answer: [Yes] Justification: We summarize our constrained diffusion models in Section 3, optimality guarantees of unparametrized constrained diffusion models in Section 3.1, optimality guarantees of parametrized constrained diffusion models and training algorithms in Section 4, and experimental results in Section 5. 2. Limitations Question: Does the paper discuss the limitations of the work performed by the authors? Answer: [Yes] Justification: We mention two potential limitations of this work as several future directions in Section 6. 3. Theory Assumptions and Proofs Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof? Answer: [Yes] Justification: We provide all assumptions, lemmas, and theorems in Sections 3 and 4, and provide proof details in Appendix B. 4. Experimental Result Reproducibility Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)? Answer: [Yes] Justification: We summarize our experimental results in Section 5, and provide implementation details of experiments in Appendix C, together with additional experimental results. 5. Open access to data and code Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: A link to the code for replicating our main experiments has been provided in Appendix C. The datasets are open source machine learning datasets that are accessible online. 6. Experimental Setting/Details Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results? Answer: [Yes] Justification: We provide experimental details in Appendix C including the hyperparameters used for each experiment. Our training details can be found in the code in supplemental material. 7. Experiment Statistical Significance Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments? Answer: [NA] Justification: While the histogram plots that showcase our main results do not include error bars, we do provide Frechet Distance metrics (that are a measure of statistical distance between sample distributions) which emphasize our results. 8. Experiments Compute Resources Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments? Answer: [Yes] Justification: We include compute details in Appendix C. 9. Code Of Ethics Question: Does the research conducted in the paper conform, in every respect, with the Neur IPS Code of Ethics https://neurips.cc/public/Ethics Guidelines? Answer: [Yes] Justification: We have reviewed the Neur IPS Code of Ethics and fully comply with it during the preparation of this paper. 10. Broader Impacts Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed? Answer: [Yes] Justification: The potential impacts of the work are discussed in Section 1. 11. Safeguards Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)? Answer: [NA] Justification: We release no models and the supplemental code for training image generation model, trains the model on MNIST and Celeb-A datasets neither of which have potential for misuse. 12. Licenses for existing assets Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected? Answer: [Yes] Justification: The libraries and assets used have been noted and credited in Appendix C. 13. New Assets Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets? Answer: [NA] Justification: [NA] 14. Crowdsourcing and Research with Human Subjects Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)? Answer: [NA] Justification: [NA] 15. Institutional Review Board (IRB) Approvals or Equivalent for Research with Human Subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or institution) were obtained? Answer: [NA] Justification: [NA]