# coind_enabling_logical_compositions_in_diffusion_models__262c0abf.pdf

Published as a conference paper at ICLR 2025

COIND: ENABLING LOGICAL COMPOSITIONS IN DIFFUSION MODELS

Sachit Gaudi Gautam Sreekumar Vishnu Naresh Boddeti Michigan State University {gaudisac,sreekum1,vishnu}@msu.edu

How can we learn generative models to sample data with arbitrary logical compositions of statistically independent attributes? The prevailing solution is to sample from distributions expressed as a composition of attributes conditional marginal distributions under the assumption that they are statistically independent. This paper shows that standard conditional diffusion models violate this assumption, even when all attribute compositions are observed during training. And, this violation is significantly more severe when only a subset of the compositions is observed. We propose COIND to address this problem. It explicitly enforces statistical independence between the conditional marginal distributions by minimizing Fisher s divergence between the joint and marginal distributions. The theoretical advantages of COIND are reflected in both qualitative and quantitative experiments, demonstrating a significantly more faithful and controlled generation of samples for arbitrary logical compositions of attributes. The benefit is more pronounced for scenarios that current solutions relying on the assumption of conditionally independent marginals struggle with, namely, logical compositions involving the NOT operation and when only a subset of compositions are observed during training. Our code is available at https://github.com/sachit3022/compositional-generation/

1 INTRODUCTION

0 1 2 3 4 5 6 7 8 9 Digit (C1)

(a) Uniform

0 1 2 3 4 5 6 7 8 9 Digit (C1)

(b) Non-uniform

0 1 2 3 4 5 6 7 8 9 Digit (C1)

(c) Partial support

Composed GLIDE COIND

Composed GLIDE COIND

Composed GLIDE COIND

Uniform Non-uniform Partial

4 (Pink Green)

(d) Generated samples Figure 1: Generative Modeling of Logical Compositions. (a-c) Consider the task of generating MNIST samples for any logical composition of digits and colors by learning on observational data of different supports. (d) Standard diffusion models fail to generate data with arbitrary logical compositions of attributes. We generate data from simple unseen compositions (row 2), and more complex logical compositions (rows 3,4) through COIND, even under non-uniform and partial support.

Many applications of generative models, including image editing (Kim et al., 2022; Brooks et al., 2023), desire explicit and independent control over statistically independent attributes. For example, in face generation, one might want to control the amount of hair, smile, etc., independently. This challenge relates to the broader task of logical compositionality in generative models, where the goal is to combine attributes according to logical relations. Consider the illustrative task in Fig. 1 of generating realistic samples of colored handwritten digits with explicit and independent control over the composition of color and digit. For example, generate an image of digit 4 while excluding the colors green and pink". This composition can be logically expressed as 4 [Green Pink] , where , , and represent the three primitive logical operators AND, OR, and NOT, respectively.

Existing solutions (Liu et al., 2022; Du et al., 2020; Nie et al., 2021) realize this goal by mapping the logical expressions into a probability distribution involving the conditional marginal distributions

Published as a conference paper at ICLR 2025

p(image | digit = 4), p(image | color = Green), and p(image | color = Pink), and sampling from it. These marginal distributions are obtained either by learning separate energy-based models for each compositional attribute (Du et al., 2020; Nie et al., 2021) or by factorizing the attributes learned joint distribution Liu et al. (2022). Both approaches, however, are predicated on the critical assumption that the conditional marginal distributions are statistically independent of each other.

Employing the approaches mentioned above, for instance Liu et al. (2022), to our illustrative example, we observe that when the conditional diffusion model is learned on data with non-uniform (Fig. 1b) or partial (Fig. 1c) support of the compositional attributes, the models fail to generate realistic samples (columns 3 and 5 of row 2 in Fig. 1d) or generate realistic samples with logically inaccurate compositions (columns 3 and 5 of rows 3 and 4 in Fig. 1d). This is true even for simple unseen logical compositions of attributes (AND in row 2 of Fig. 1d) or for complex logical compositions (rows 3 and 4 of Fig. 1d involving a NOT operation). Such failure under partial support was also observed by Du et al. (2020). Surprisingly, note that even when all compositions of the attributes are observed, the model fails to generate realistic samples (column 1 of row 2 in Fig. 1d).

These observations naturally raise the following research questions that this paper seeks to answer:

(RQ1) Why do standard classifier-free conditional diffusion models fail to generate data with arbitrary logical compositions of attributes? We hypothesize that violating the assumption that the conditional marginal distributions are statistically independent of each other will result in poor image quality, diminished control over the generated image attributes, and, ultimately, failure to adhere to the desired logical composition. We verify and confirm our hypothesis through a case study in 3.

(RQ2) How can we explicitly enable conditional diffusion models to generate data with arbitrary logical compositions of attributes? We adopt the principle of independent causal mechanisms (Peters et al., 2017) to express the conditional data likelihood in terms of the constituent conditional marginal distributions to ensure that the model does not learn non-existent statistical dependencies.

Summary of contributions.

1. In Section 3, we show that conditional diffusion models trained to maximize the likelihood of the observed data do not learn independent conditional marginal distributions, even when all compositions of the attributes are uniformly (Fig. 1a) observed. Furthermore, this problem is exacerbated in more practical scenarios where we learn from non-uniform (Fig. 1b) or partial (Fig. 1c) support of the compositional attributes. Instead, the models learn nonexistent statistical dependencies induced by unknown confounding factors.

2. Through causal modeling, we derive a training objective, COIND, comprising the standard score-matching loss and a conditional independence violation loss required to enforce the COnditional INdependence relations necessary for enabling logical compositions in conditional Diffusion models.

3. Strong inductive biases, in the form of the conditional independence relations in COIND, enable arbitrary logical compositionality in conditional diffusion models with fine-grained control over conditioned attributes and diversity for unconditioned attributes. COIND achieves these goals while being monolithic and is scalable with the number of attributes.

2 LOGICAL COMPOSITIONALITY IN DIFFUSION MODELS

We study the problem of generating data with attributes that satisfy a given logical relation between them. We consider the case where the attributes are statistically independent of each other. However, not all attribute compositions may be observed during training. To study this problem, we first model the underlying data-generation process using a suitable causal model that relates data and their independently varying attributes.

Notations. We use bold lowercase and uppercase characters to denote vectors (e.g., a) and matrices (e.g., A) respectively. Random variables are denoted by uppercase Latin characters (e.g., X). The distribution of a random variable X is denoted as p(X), or as pθ(X) if the distribution is parameterized by a vector θ. We adopt non-standard terminology where marginals denote the conditionals p(X | Ci) rather than integrated marginals, p(Ci) emphasizing their functional role as modular components in our compositional framework. Correspondingly, joint refers to p(X | C), acknowledging this deliberate departure from probabilistic conventions due to a lack of better terminology.

Published as a conference paper at ICLR 2025

C1 C2 . . . Cn

XC1 XC2 XCn

(a) True underlying causal model

C1 C2 . . . Cn

XC1 XC2 XCn

Unobserved Confounding

(b) Causal model during training Figure 2: (a) C1, C2, . . . , Cn vary freely and independently in the underlying causal graph. (b) However, they become dependent during training due to unknown and unobserved confounding factors.

Data Generation Process. The data generation process consists of observed data X (e.g., images) and its attribute variables C1, C2, . . . , Cn (e.g., color, digit, etc.). To have explicit control over these attributes during generation, they should vary independently of each other. In this work, we limit our study to only those causal graphs in which the attributes are not causally related and can hence vary independently, as shown in Fig. 2a. Each Ci assumes values from a set Ci and their Cartesian product C = C1 Cn is referred to as the attribute space. Each attribute Ci generates its own observed component XCi = f Ci(Ci), which together with unobserved exogenous variables UX form the composite observed data X = f(XC1, . . . , XCi, UX) (see Fig. 2a). We do not restrict f much except that it should not obfuscate individual observed components in X (Wiedemer et al., 2024). A simple example of f is the concatenation function. We also assume that all f Ci are invertible and therefore it is possible to estimate C1, . . . , Cn from X. These assumptions together ensure that C1, . . . , Cn are mutually independent given X despite being seemingly d-connected.

Problem Statement. When the training data is sampled according to the causal graph in Fig. 2a, all attribute compositions are equally likely to be observed. We refer to this scenario as uniform support (illustrated in Fig. 1a). However, real-world datasets often deviate from the independence due to unobserved confounders such as sample selection bias (Storkey, 2008), inducing an attribute shift. As shown in Fig. 2b, this shift modifies the causal structure during training through unobserved confounding relationships, resulting in non-uniform support (Fig. 1b) where attribute compositions exhibit unequal occurrence probabilities. In extreme cases, this dependence could lead to the training samples consisting of only a subset of all attribute compositions (Fig. 1c), i.e., Ctrain C. We refer to this scenario as partial support. We aim to learn conditional diffusion models under these scenarios to generate samples with attributes that satisfy a given logical compositional relation between them.

The attribute space in our problem statement has the following properties. (1) The attribute space observed during training Ctrain covers C in the following sense:

Definition 1 (Support Cover). Let C = C1 Cn be the Cartesian product of n finite sets C1, . . . , Cn. Consider a subset Ctrain C, where |Ctrain| = m. Let Ctrain = {(c1j, . . . , cnj) : cij Ci, 1 i n, 1 j m} and Ci = {cij : 1 j m} for 1 i n. The Cartesian product of these sets is Ctrain = C1 Cn. We say Ctrain covers C iff C = Ctrain.

Informally, this assumption implies that every possible value that Ci can assume is present in the training set, and open-set attribute compositions do not fall under this definition. For instance, in the Colored MNIST example in Fig. 1, we are not interested in generating a digit with an unobserved 11th color. (2) For every ordered tuple c Ctrain, there is another c Ctrain such that c and c differ on only one attribute. Similar assumptions were discussed in (Wiedemer et al., 2024).

Preliminaries on Score-based Models. In this work, we train conditional score-based models (Song et al., 2021b) using classifier-free guidance (Ho & Salimans, 2022) to generate data corresponding to a given logical attribute composition. Score-based models learn the score of the observed data distributions ptrain(X) and ptrain(X | C) through score matching (Hyvärinen, 2005). Once the score of a distribution is learned, samples can be generated using Langevin dynamics. For logical attribute compositional generation, the given attribute composition is decomposed in terms of two primitive logical compositions: (1) AND operation (e.g., C1 = c1 C2 = c2 generates data where attributes C1 and C2 takes values c1 and c2 respectively), and (2) NOT operation (e.g., C1 = c1 generates data where the attribute C1 takes any value except c1). Liu et al. (2022) proposed the following modifications during sampling to enable AND and NOT logical operations between the attributes, assuming that the diffusion model learns the conditional independence relations from the underlying data-generation process, i.e., p(C1, . . . , Cn|X) = Qn i=1 p(Ci|X).

Logical AND ( ) operation: Since pθ(C1 C2 | X) = pθ(C1 | X)pθ(C2 | X) samples are generated for the logical composition C1 C2 by sampling from the following score:

X log pθ(X | C1 C2) = X log pθ(X | C1) + X log pθ(X | C2) X log pθ(X) (1)

Published as a conference paper at ICLR 2025

Logical NOT ( ) operation: Following the approximation pθ( C2 | X) 1 pθ(C2|X), the score to sample data for the logical composition C1 C2 can be expressed as,

X log pθ(X | C1 C2) = X log pθ(X) + X log pθ(X | C1) X log pθ(X | C2) (2)

Precise Control: To achieve precise control over attribute composition, the hyperparameter γ is used to modulate the relative intensity of attribute C2 with respect to C1. We sample from the distribution, X log pθ(X | C1 C2), expressed as

X log pθ(X | C1) + γ X log pθ(X | C2) γ X log pθ(X) (3)

Logical OR ( ) operation: From the rules of Boolean algebra, C1 C2 operation can be expressed in terms of and as ( C1 C2). Following the approximation for from above, it follows that p( ( C1 C2)) p(C1)p(C2).

For example, to generate colored handwritten digits with the 4 [Green Pink]" logical composition, the score of the logical composition can be decomposed into its constituent logical primitive operations and further in terms of the score of marginals, which can be obtained from the trained diffusion models. Therefore, X log pθ(X | 4 [G P]) is given by:

= X log pθ(X | C1 = 4 C2 = G) + X log pθ(X | C1 = 4 C2 = P) X log pθ(X) = 2 X log pθ(X | C1 = 4) X log pθ(X | C2 = G) X log pθ(X | C2 = P) + X log pθ(X)

Note that the scores to sample from these primitive logical compositions involve conditional marginal likelihood terms X | Ci. Therefore, to perform logical composition, it is critical to accurately learn the conditional marginals of the attributes.

Evaluation We evaluate the distributions learned by the model based on their accuracy in generating images with attributes that align with the desired compositions for a logical relation. For example, to evaluate AND ( ) composition, consider sampling an arbitrary digit and color, represented as C = (4, Cyan). We generate images ˆX by sampling from Eq. (1), and subsequently infer attributes, (ˆc1, ˆc2) = (ϕC1( ˆX), ϕC2( ˆX)). We then verify if (ˆc1, ˆc2) {4} {Cyan}, and this process is averaged over all combinations in C to obtain CS.

Conformity Score (CS) To formally define CS: For a logical relation, R. This relation is defined as a boolean function over the attribute space C, such that R : C {0, 1}. This induces a constrained attribute space given by R = {(c1, . . . , cn) | R(c1, . . . , cn) = 1} C. The CS is defined as:

CS(R, θ) := EC p(C)EU p(U) [1RC ((ϕCi(gθ(R(C), U)))n i=1)] (4)

where R(C) can represent various logical operations such as , , and on the attribute space C. Here, gθ(R(C), U) denotes a generative model parameterized by θ, which samples according to the logical relations specified above. The variable U represents exogenous noise in the diffusion model. The functions ϕCi are attribute-specific classifiers that infer attributes from the generated images. The term 1RC , is an indicator function, equals 1 if the inferred attributes (ϕCi(gθ(R(C), U)))n i=1 RC. Further details regarding the Conformity Score can be found in App. D.6.

3 WHY DO CONDITIONAL DIFFUSION MODELS FAIL TO GENERATE DATA WITH ARBITRARY LOGICAL COMPOSITIONS OF ATTRIBUTES?

To address (RQ1), we utilize the task of generating synthetic images from the Colored MNIST dataset for any given combination of color and digit, as introduced in 1. To study the effect of data support, we consider the three training distributions of attribute compositions defined in 2: (1) uniform support, where every ordered pair in C has an equal chance of being observed (Fig. 1a), (2) nonuniform support, where every ordered pair in C appears but with unequal probabilities (Fig. 1b), and (3) partial support, where only subset of ordered pairs, Ctrain C are observed (Fig. 1c).

For each support, we train a diffusion model and evaluate the conditional joint, pθ(X | C) and marginal, pθ(X | Ci) distributions. During inference, the images are separately sampled from the joint distribution, X log pθ(X | C), and from the product of the learned marginals as shown in Eq. (1). We refer to the former method as joint sampling and the latter as marginal sampling. To

Published as a conference paper at ICLR 2025

Support Conformity Score JSD Joint Marginal

Uniform 99.98 98.15 0.16 Non Uniform 99.98 86.10 0.30 Partial 33.14 7.40 2.75

Table 1: Conformity Scores and Jensen-Shannon divergence for samples generated from joint and marginal distributions learned by models under various support settings for the Colored MNIST dataset.

measure the accuracy of the attributes in the generated image in accordance to the desired attributes, we use conformity score (CS) defined in 2. Tab. 1 compares the joint and the marginal distributions learned by models trained under various training scenarios. We draw the following conclusions.

Diffusion models struggle to generate unseen attribute compositions. From the conformity scores of images sampled from the joint distribution, we conclude that while the models trained with uniform and non-uniform support generate images with accurate attribute compositions, those trained with partial support struggle to generate images for unseen attribute compositions. The standard training objective of diffusion models is to maximize the likelihood of conditional generation, for every observed attribute composition, the model accurately learns ptrain(X | C), i.e., pθ(X | C) ptrain(X | C) However, with partial support, the model does not observe samples for every attribute composition from ptrain(X | C). Therefore, the model does not accurately learn the density of the unobserved support region.

Diffusion models violate underlying Conditional Independence relations. Although the diffusion model is trained on all marginals (X | Ci), per the support cover assumption, marginals samples performs inferior to that of sampling from the joint distribution. This further drop in conformity score when sampled from the product of marginals ( Eq. (1)) for the models trained under non-uniform and partial support settings is due to the disparity between the joint distribution and the product of marginals, which points to the violation of independence relations from the underlying data-generation process in the learned model. Refer to App. B.1 for a detailed proof. To further strengthen the claim, we measure this violation as the disparity between the conditional joint distribution pθ(C | X) and the product of conditional marginal distributions Qn i pθ(Ci | X) learned by the guidance term in a model using Jensen-Shannon divergence (JSD):

JSD = EC,X pdata

pθ(C | X) ||

i pθ(Ci | X)

where DJS is the Jensen-Shannon divergence and following (Li et al., 2023) pθ is obtained by evaluating the implicit classifier learned by the diffusion model. More details can be found in App. D.7.

A positive JSD value suggests that the model fails to adhere to the independence relations present in the underlying causal model. Our findings (Tab. 1) indicate that as the training distribution of attribute compositions diverges from the true underlying distribution where attributes vary independently the trained models increasingly violate independence relations, as reflected by the JSD. These findings demonstrate diffusion models lack inherent compositional bias, instead propagate dependencies as present in their training data.

Training objective of the diffusion models is not suitable for logical compositionality. The objective of the diffusion models trained with classifier-free guidance is to maximize the conditional likelihood of power-set of attributes. However, due to confounding induced by the training support (Fig. 2b), the attributes become dependent during training, i.e., ptrain(C1, . . . , Cn) = Qn i=1 ptrain(Ci). As a result, the conditional distribution of marginals does not match its true underlying distribution. i.e pθ(X | Ci) ptrain(X | Ci) = pdata(X | Ci) Refer to App. B.2 for formal proof. Therefore, any method (Nie et al., 2021) that relies on training on these incorrect marginals or relies on conditional independence (Liu et al., 2022) is bound to fail. Moreover, even when realistic samples of unseen composition are successfully generated, it is by accident rather than design.

o Failure of Logical Compositionality

Standard conditional diffusion models trained with classifier-free guidance struggle to generate data with arbitrary logical compositions of attributes because they violate the independence relations inherent in the causal data-generation process.

Based on these observations, we propose COIND to train diffusion models that explicitly enforce the conditional independence dictated by the underlying causal data-generation process to encourage the model to learn accurate marginal distributions of the attributes.

Published as a conference paper at ICLR 2025

4 COIND: ENFORCING CONDITIONALLY INDEPENDENT MARGINAL TO ENABLE LOGICAL COMPOSITIONALITY

In this section, we propose COIND to answer (RQ2) posed in 1: How can we explicitly enable conditional diffusion models to generate data with arbitrary logical compositions of attributes?

In the previous section, we observed that diffusion models do not obey the underlying causal relations, learning incorrect attribute marginals, and hence struggling to demonstrate logical compositionally as we showed in Fig. 1. To remedy this, COIND uses a training objective that explicitly enforces the causal factorization to ensure that the trained diffusion models obey the underlying causal relations. From the causal graph Fig. 2a, along with the assumption of C1 . . . Cn | X mentioned in 2, we have p(X | C) = p(X)

p(C) Qn i p(X|Ci)p(Ci)

p(X) . Note that the invariant p(X | C) is now expressed as the product of marginals employed for sampling. Therefore, training the diffusion model by maximizing this conditional likelihood is naturally more suited for learning accurate marginals for the attributes. We minimize the distance between the true conditional likelihood and the learned conditional likelihood as,

p(X | C), pθ(X)

pθ(X | Ci)pθ(Ci)

where W2 is 2-Wasserstein distance. Applying the triangle inequality to Eq. (6) we have,

Lcomp W2 (p(X | C), pθ(X | C)) | {z } Distribution matching

pθ(X | C), pθ(X)

pθ(X | Ci)pθ(Ci)

| {z } Conditional Independence

(Kwon et al., 2022) showed that the Wasserstein distance between p0(X), q0(X) is upper bounded by the square root of the score-matching objective.

W2 (p0(X), q0(X)) K q

Ep0(X) [|| X log p0(X) X log q0(X)||2 2]

Distribution matching: Following this result, the first term in Eq. (7) is upper bounded by the standard score-matching objective of diffusion models (Song et al., 2021b), Lscore = Ep(X,C) X log pθ(X | C) X log p(X | C) 2 2 (8) Conditional Independence: Similarly, the second term in Eq. (7) is upper bounded by scorematching between the joint and product of marginals

LCI = E X log pθ(X | C) X log pθ(X) X

i [ X log pθ(X | Ci) X log pθ(X)] 2 2 (9)

Substituting Eq. (8), Eq. (9) in Eq. (7) will result in our final learning objective

Lscore + K2 p

LCI (10) where K1, K2 are positive constants, i.e., the conditional independence objective LCI is incorporated alongside the existing score-matching loss Lscore.

LCI, is the Fisher divergence between the joint and the product of marginals. From the properties of Fisher s divergence Sánchez-Moreno et al. (2012). LCI = 0 iff pθ(X | C) = pθ(X)

pθ(C) Qn i pθ(X|Ci)pθ(Ci)

pθ(X) . Detailed derivation of the upper bound can be found in App. B.3.

Practical Implementation. A computational burden presented by LCI in Eq. (9) is that the required number of model evaluations increases linearly with the number of attributes. To mitigate this burden, we approximate the mutual conditional independence with pairwise conditional independence (Hammond & Sun, 2006). Thus, the modified LCI becomes, LCI = Ep(X,C)Ej,k X log pθ(X | Cj, Ck) X log pθ(X | Cj) X log pθ(X | Ck) + X log pθ(X) 2 2 The weighted sum of the square of the terms in Eq. (10) has shown stability. Therefore, COIND s training objective: Lfinal = Lscore + λLCI (11) where λ is the hyper-parameter that controls the strength of conditional independence. The reduction to the practical version of the upper bound (Eq. (10)) is discussed in extensively in App. C. For guidance on selecting hyper-parameters in a principled manner, please refer to App. C.3. Finally, our proposed approach can be implemented with just a few lines of code, as outlined in Algorithm 1.

Published as a conference paper at ICLR 2025

5 DOES COIND IMPROVE THE LOGICAL COMPOSITIONALITY?

COIND encourages diffusion models to learn conditionally independent marginals of attributes, and thereby improve their logical compositionality capabilities. In this section, we design experiments to evaluate COIND on two questions: (1) Does COIND effectively train diffusion models that obey the underlying causal model?, and (2) Does COIND improve the logical compositionality of these models? We measure the JSD of the trained models to answer the first question. To answer the second question, we use two primitive logical compositional tasks: (a) (AND) composition and (b) (NOT) composition. In each case, the generative model is provided with a logical relation between the attributes, and the task is to generate images with attributes that satisfy this logical relation. A more detailed description of task construction can be found in App. D.2. Beyond improved logical compositionality, we ask: Does learning conditionally independent marginals lead to greater diversity in uncontrolled attributes and enhanced controllability of attributes?

Datasets. We use the following image datasets with labeled attributes for our experiments: (1) Colored MNIST dataset described in 1, where the attributes of interest are digit and color, (2) Shapes3d dataset (Kim & Mnih, 2018) containing images of 3D objects in various environments where each image is labeled with six attributes of interest. (3) Celeb A with gender and smile attributes demonstrates effectiveness of COIND on real-world datasets. Refer to App. D.5.

Figure 3: Orthogonal partial support

Observed training distributions. We evaluate COIND on four scenarios where we observe different distributions of attribute compositions during training: (1) Uniform support, (2) Non-uniform support (3) Diagonal partial support, as defined in 2. (4) Orthogonal partial support includes only the attribute compositions along the axes originating from a corner of the hypercube C, following (Wiedemer et al., 2024) (Fig. 3). For Colored MNIST experiments, we evaluate with uniform, non-uniform, and diagonal partial support. For Shapes3d experiments, we evaluate with uniform and orthogonal partial support, following the compositional setup in (Schott et al., 2020). We evaluate Celeb A on orthogonal partial support, where all compositions except unseen male smiling celebrities are observed.

Baselines. LACE (Nie et al., 2021) and Composed GLIDE (Liu et al., 2022) are our primary baselines. LACE trains distinct energy-based models (EBMs) for each attribute and combines them following the compositional logic described in 2 during sampling. A similar approach was proposed by (Du et al., 2020). However, in our experimental evaluation for LACE, we train distinct score-based models instead of EBMs. In contrast, Composed GLIDE samples from score-based models by factorizing the joint distribution into marginals, assuming these models had implicitly learned conditionally independent marginals of attributes. Additional details about the baselines are delegated to App. D.3.

Metrics. We assess how accurately the models have captured the underlying data generation process using the JSD, defined in 3. To measure the accuracy of the attributes in the generated image w.r.t. the input logical composition, we use conformity score (CS) from 2. As a reminder, CS measures the accuracy with which the model adds the desired attributes to the generated image using attribute-specific classifiers. In addition to the conformity score, since the Shapes3d dataset contains unique ground truth images corresponding to the input logical relation, we directly compare generated samples with reference images at the pixel level using the variance-weighted coefficient of determination, R2. Additionally, for Celeb A, we measure FID (Seitzer, 2020). We evaluate uniform and non-uniform support on the generations for the input logical relations correspond to attribute compositions that span the attribute space C. In other cases, we evaluate models ability to generate input logical relations corresponding to the unseen compositional support, i.e., C \ Ctrain.

Learning Independent Marginals Enables Logical Compositionality

Fig. 4a compares COIND with baselines on and composition tasks. The Color task generates images with the negation applied on color attribute, while Digit applies the negation to the digit attribute. From these results, we make the following observations:

Conditional diffusion models do not learn accurate marginals even when all attribute compositions are observed during training with equal probability. This is evident from the positive JSD

Published as a conference paper at ICLR 2025

Support Method JSD (CS) Color (CS) Digit (CS)

LACE - 96.40 92.56 83.67 Composed GLIDE 0.16 98.15 99.30 81.64 COIND (λ = 0.2) 0.14 99.73 99.32 84.94 COIND (λ = 1.0) 0.10 99.99 99.33 89.60

Non-uniform LACE - 82.61 65.16 69.51 Composed GLIDE 0.30 86.10 81.61 70.44 COIND (λ = 1.0) 0.15 99.95 92.41 84.98

Partial LACE - 10.85 9.03 28.24 Composed GLIDE 2.75 7.40 5.09 33.86 COIND (λ = 1.0) 1.17 52.38 53.28 52.59 (a) Results on Colored MNIST Dataset

10 1 100 JSD

(b) JSD vs CS

Figure 4: Results on Colored MNIST dataset. (a) We compare JSD and CS of COIND against baselines trained under various settings and on different compositional tasks. (b) Plotting CS against JSD in the log scale of the models trained under different settings reveals a negative correlation.

of the methods trained with uniform support. Furthermore, the conformity score (CS) is lower when JSD is higher. This observation has significant ramifications for compositional generative models.

This result contradicts the intuitive expectation that uniformly observing the whole compositional support during training is sufficient to generate arbitrary logical compositions of attributes. And, it suggests that even in this ideal yet impractical case, the current objectives for training diffusion models are insufficient for controllable and accurate closed-set, let alone open-set, compositional generation. As such, we conjecture that scaling the datasets without inductive biases (conditional independence of marginals in this case) is insufficient for arbitrary logical compositional generation.

Even methods like LACE that train separate models for each attribute fail on composition tasks. This suggests that softer inductive biases, such as learning separate marginals for each attribute without paying heed to the desired independence relations, are insufficient for logical compositionality.

In the more practical scenarios of non-uniform and partial support, JSD increases with non-uniform support and worsens further with partial support due to incorrect marginals as discussed in 3. This result suggests that current state-of-the-art models learned on finite datasets likely operate in the non-uniform or partial support scenario and thus may fail to generate accurate and realistic data for arbitrary logical compositions of attributes.

Logical AND ( ) and NOT ( ) compositionality deteriorates with increasing dependence between the marginals. The negative correlation between JSD and CS was noted in 3 and can be observed in Fig. 4b, which shows JSD-vs-CS for compositions across different methods, and under different settings for observed support. This negative correlation strongly suggests that violation of conditional independence plays a major role in the diminished logical compositionality demonstrated by standard diffusion models.

By enforcing conditional independence between the attributes during training, COIND achieves lower JSD and improves both and compositionality in non-uniform and partial support. Even when trained on non-uniform support, COIND matches compositionality with the uniform support in terms of compositional score. Under partial support setting, COIND achieves 2 10 fold improvement over the baselines on and compositions. These results demonstrate that enforcing conditional independence between the marginals is vital for enabling arbitrary logical compositions in conditional diffusion models.

(a) LACE; H = 1.82

(b) Composed GLIDE; H = 1.71

(c) COIND; H = 2.63 Figure 5: Images generated by COIND for the logical composition digit = 4 under non-uniform scenario are significantly diverse compared to the baselines. H is the Shannon entropy.

COIND generates diverse samples. It is desirable that any attribute not part of the logical composition for generation assumes diverse values in the generated samples to avoid harmful generated content including stereotypes (Dehdashtian et al., 2025) and biases (Luccioni et al., 2024). In Fig. 5, we observe that although COIND does not explicitly optimize for diversity, the samples generated by COIND for the logical relation digit = 4 are significantly more diverse compared

Published as a conference paper at ICLR 2025

to the baselines. We quantitatively measure the diversity of these images using the Shannon entropy H of the color attributes in the generated images. Higher Shannon entropy indicates more diversity. Entropy is maximum for a uniform distribution with H(uniform) = log2(10) = 3.32, since there are 10 colors. We observe that H(COIND) = 2.63, while H(LACE) = 1.82, H(Composed GLIDE) = 1.71. Although COIND does not explicitly seek diversity, breaking the dependence induced by unknown confounders exhibits diversity in attributes.

Support Method JSD Composition Composition

R2 CS R2 CS

Uniform LACE - 0.97 91.19 0.85 50.00 Composed GLIDE 0.302 0.94 83.75 0.91 48.43 COIND (λ = 1.0) 0.215 0.98 95.31 0.92 55.46

Orthogonal LACE - 0.88 62.07 0.70 30.10 Composed GLIDE 0.503 0.86 51.56 0.61 34.63 COIND (λ = 1.0) 0.287 0.97 91.10 0.92 53.90 (a) Quantitative Results on Shapes3D Dataset

Expected Composed GLIDE LACE COIND

Uniform comp.

Partial comp.

Partial comp.

(b) Visual comparison of samples

Figure 6: Results on Shapes3d dataset. (a) We compare JSD, R2, and CS of COIND against the baselines trained with uniform and partial support on the Shapes3d dataset for and composition tasks. (b) Samples generated by COIND match the expected image in all cases.

COIND is scalable with attributes. We use the Shapes3d dataset to evaluate the scalability of COIND w.r.t. the number of attributes. As a reminder, every image in the Shapes3d dataset is labeled with six attributes of interest. For the negation composition task, the operator is applied to the shape attribute such that the attribute composition satisfying this logical relation is unique. Detailed descriptions of the composition tasks are provided in App. D.2. Fig. 6a compares COIND against the baselines for the uniform and orthogonal partial support scenarios. COIND leads to a significant decrease in JSD and, consequently, a significant increase in the composition score. When trained with orthogonal support, the performance (CS) of both LACE and Composed GLIDE suffers significantly while COIND matches its performance when trained on uniform support. In conclusion, COIND affords superior logical compositionality from a single monolithic model in a sample-efficient manner even as the number of attributes increases.

COIND generates unseen compositions of real-world face images

Method JSD smiling male" smiling" male"

CS FID CS FID

LACE - - - 24.20 80.40 Composed GLIDE 2.44 2.51 61.21 10.55 95.41 COIND (λ = 100) 1.82 8.63 43.97 8.79 43.76

Table 2: Results on Celeb A dataset. COIND outperforms the baselines on both CS and FID across various compositionality tasks.

We evaluate ability of COIND to generate unseen smiling male celebrities". Diffusion model is trained on all compositions of the Celeb A dataset (Liu et al., 2015) except gender = male and smiling = true . This is equivalent to the orthogonal support scenario shown in Fig. 3. During inference, the model is asked to generate images with the unseen attribute combination gender = male and smiling = true through both joint sampling and composition.

Tab. 2 compares COIND against baselines in terms of CS and FID. (1) COIND outperforms the baseline by > 4 in joint. (2) COIND generates realistic faces, closer to smiling male celebrities in the held out set, as measured by FID and displayed in Fig. 7c(γ = 1). In App. E.3, we show that COIND extends to Text-to-Image models by fine-tuning Stable Diffusion (Rombach et al., 2022).

COIND provides fine-grained control over attributes. In addition to merely generating samples with conditioned attributes, COIND can also control the amount of attributes in the sample. For example, in the task of generating face images of smiling male celebrities, we may wish to adjust the amount of smiling without affecting gender-specific attributes. To achieve this, we sample from Eq. (3), where γ controls the strength of smile. Fig. 7 shows the result of increasing γ to increase the amount of smiling in the generated image. The subjects in the face images generated by COIND smile more as γ increases without any changes to any gender-specific attribute. For instance, the images for γ = 1 show a soft smile while the subjects in the images for γ = 6 show teeth. However, those generated by baselines contain gender-specific attributes such as long hair and earrings. These distinctions are quantified in Figs. 7a and 7b. Refer to App. E.2 for more analysis.

Published as a conference paper at ICLR 2025

Methods LACE Composed GLIDE Co In D

(a) FID with γ

(b) CS with γ

γ = 0 γ = 2 γ = 6 γ = 1

COIND Composed GLIDE LACE

Theamountofsmile increasesasγ increases

(c) Samples with γ

Figure 7: Effect of γ on FID and CS: Varying the amount of smile in a generated image through γ does not affect the FID of COIND. However, the smiles in the generated images become more apparent, leading to easier detection by the smile classifier and improved CS.

6 RELATED WORK

Our work concerns compositional generalization in generative models, where the goal is to generate data with unseen attribute compositions expressed through logical relations between attributes. One class of approaches achieves logical compositionality by combining distinct models trained for each attribute (Du et al., 2020; Liu et al., 2021; Nie et al., 2021; Du et al., 2023). In contrast, we are interested in monolithic diffusion models that learn logical compositionality. Besides being expensive and scaling linearly with the number of attributes, these models fail under practical partial support scenarios. Liu et al. (2022) studied logical compositionality broadly without differentiating between attribute supports and proposed methods to represent logical compositions in terms of marginal probabilities obtained through factorization of the joint distribution. However, these factorized sampling methods fail since the underlying generative model learns inaccurate marginals. In comparison, COIND is trained to obey the independence relations from the underlying causal graph. Also, (Cho et al., 2024) note that diffusion models lack the conditional independence needed for controllability and address this with a hyperparameter during sampling. We argue that, even with disentangled features, learning accurate marginals tackles the root cause more effectively than such post-hoc adjustments. Encouragingly, Okawa et al. (2023) shows that compositional abilities emerge multiplicatively, and Liang et al. (2024) highlights factorization in diffusion models, suggesting they naturally exhibit compositional capabilities. However, these studies focus on generating from the joint distribution a special case of logical compositionality and are limited to binary attributes. Our work extends these ideas to more general compositions. Lastly, (Wiedemer et al., 2024) studies compositional generalization for supervised learning and provides sufficient conditions for compositionality. Our empirical observations in generative models are consistent with their theoretical results, suggesting that their findings could perhaps be extended to conditional diffusion models.

7 CONCLUSION

Conditional diffusion models struggle to generate data for arbitrary attribute compositions, even when all attribute compositions are observed during training. Existing methods represent logical relations in terms of the learned marginal distributions, assuming that the diffusion model learns the underlying conditional independence relations. We showed that this assumption does not hold in practice and worsens when only a subset of these attribute compositions are observed during training. To mitigate this problem, we proposed COIND to train diffusion models by maximizing conditional data likelihood in terms of the marginal distributions that are obtained from the underlying causal graph. Our causal modeling provides COIND a natural advantage in logical compositionality by ensuring it learns accurate marginals. Our experiments on synthetic and real image datasets highlight the theoretical benefits of COIND. Unlike existing methods, COIND is monolithic, easy to implement, and demonstrates superior logical compositionality. COIND shows that adequate inductive biases such as conditional independence between marginals are necessary for effective logical compositionality. Refer to Apps. F and G for more discussions, analysis and limitations of COIND.

Published as a conference paper at ICLR 2025

Acknowledgements: This work was supported by the Office of Naval Research (award #N0001423-1-2417). Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of ONR. We also thank Mashrur Morshed for his insights on improving diffusion model training and for sharing the bare-bones code, Lan Wang for providing the code for fine-tuning Stable Diffusion, and Ramin Akbari for his assistance with the proofs presented in Apps. B and C.

Tim Brooks, Aleksander Holynski, and Alexei A Efros. Instruct Pix2Pix: Learning to Follow Image Editing Instructions. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023.

Wonwoong Cho, Hareesh Ravi, Midhun Harikumar, Vinh Khuc, Krishna Kumar Singh, Jingwan Lu, David I. Inouye, and Ajinkya Kale. Enhanced controllability of diffusion models via feature disentanglement and realism-enhanced sampling methods. In European Conference on Computer Vision, 2024.

Sepehr Dehdashtian, Gautam Sreekumar, and Vishnu Naresh Boddeti. OASIS Uncovers: High Quality T2I Models, Same Old Stereotypes. In International Conference on Learning Representations, 2025.

Prafulla Dhariwal and Alexander Nichol. Diffusion Models Beat GANs on Image Synthesis. In Advances in Neural Information Processing Systems, 2021.

Yilun Du and Leslie Kaelbling. Position: Compositional Generative Modeling: A Single Model is Not All You Need. In International Conference on Machine Learning, 2024.

Yilun Du, Shuang Li, and Igor Mordatch. Compositional Visual Generation with Energy Based Models. In Advances in Neural Information Processing Systems, 2020.

Yilun Du, Conor Durkan, Robin Strudel, Joshua B Tenenbaum, Sander Dieleman, Rob Fergus, Jascha Sohl-Dickstein, Arnaud Doucet, and Will Sussman Grathwohl. Reduce, Reuse, Recycle: Compositional Generation with Energy-Based Diffusion Models and MCMC. In International Conference on Machine Learning, 2023.

Peter J Hammond and Yeneng Sun. The essential equivalence of pairwise and mutual conditional independence. Probability Theory and Related Fields, 135(3):415 427, 2006.

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep Residual Learning for Image Recognition. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2016.

Jonathan Ho and Tim Salimans. Classifier-Free Diffusion Guidance, 2022. URL https: //arxiv.org/abs/2207.12598.

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising Diffusion Probabilistic Models. In Advances in Neural Information Processing Systems, 2020.

Aapo Hyvärinen. Estimation of Non-Normalized Statistical Models by Score Matching. Journal of Machine Learning Research, 6:695 709, 2005.

Gwanghyun Kim, Taesung Kwon, and Jong Chul Ye. Diffusion CLIP: Text-Guided Diffusion Models for Robust Image Manipulation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022.

Hyunjik Kim and Andriy Mnih. Disentangling by Factorising. In International Conference on Machine Learning, 2018.

Dohyun Kwon, Ying Fan, and Kangwook Lee. Score-based Generative Modeling Secretly Minimizes the Wasserstein Distance. In Advances in Neural Information Processing Systems, 2022.

Tuomas Kynkäänniemi, Miika Aittala, Tero Karras, Samuli Laine, Timo Aila, and Jaakko Lehtinen. Applying Guidance in a Limited Interval Improves Sample and Distribution Quality in Diffusion Models. In Advances in Neural Information Processing Systems, 2024.

Published as a conference paper at ICLR 2025

Alexander C. Li, Mihir Prabhudesai, Shivam Duggal, Ellis Brown, and Deepak Pathak. Your Diffusion Model is Secretly a Zero-Shot Classifier. In IEEE/CVF International Conference on Computer Vision, 2023.

Qiyao Liang, Ziming Liu, Mitchell Ostrow, and Ila R Fiete. How Diffusion Models Learn to Factorize and Compose. In Advances in Neural Information Processing Systems, 2024.

Yaron Lipman, Marton Havasi, Peter Holderrieth, Neta Shaul, Matt Le, Brian Karrer, Ricky T. Q. Chen, David Lopez-Paz, Heli Ben-Hamu, and Itai Gat. Flow Matching Guide and Code, 2024. URL https://arxiv.org/abs/2412.06264.

Nan Liu, Shuang Li, Yilun Du, Josh Tenenbaum, and Antonio Torralba. Learning to Compose Visual Relations. In Advances in Neural Information Processing Systems, 2021.

Nan Liu, Shuang Li, Yilun Du, Antonio Torralba, and Joshua B Tenenbaum. Compositional Visual Generation with Composable Diffusion Models. In European Conference on Computer Vision, 2022.

Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep Learning Face Attributes in the Wild. In IEEE/CVF International Conference on Computer Vision, 2015.

Sasha Luccioni, Christopher Akiki, Margaret Mitchell, and Yacine Jernite. Stable Bias: Evaluating Societal Representations in Diffusion Models. In Advances in Neural Information Processing Systems, 2024.

Weili Nie, Arash Vahdat, and Anima Anandkumar. Controllable and Compositional Generation with Latent-Space Energy-Based Models, 2021. URL https://arxiv.org/abs/2110. 10873.

Maya Okawa, Ekdeep S Lubana, Robert Dick, and Hidenori Tanaka. Compositional Abilities Emerge Multiplicatively: Exploring Diffusion Models on a Synthetic Task. In Advances in Neural Information Processing Systems, 2023.

Jonas Peters, Dominik Janzing, and Bernhard Schölkopf. Elements of Causal Inference: Foundations and Learning Algorithms. The MIT Press, 2017.

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High Resolution Image Synthesis With Latent Diffusion Models. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022.

Pablo Sánchez-Moreno, Alejandro Zarzo, and Jesús S Dehesa. Jensen divergence based on Fisher s information. Journal of Physics A: Mathematical and Theoretical, 45(12):125305, 2012.

Lukas Schott, Julius Von Kügelgen, Frederik Träuble, Peter Vincent Gehler, Chris Russell, Matthias Bethge, Bernhard Schölkopf, Francesco Locatello, and Wieland Brendel. Visual Representation Learning Does Not Generalize Strongly Within the Same Domain. In International Conference on Learning Representations, 2020.

Maximilian Seitzer. pytorch-fid: FID Score for Py Torch. https://github.com/mseitzer/ pytorch-fid, August 2020. Version 0.3.0.

Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising Diffusion Implicit Models. In International Conference on Learning Representations, 2021a.

Yang Song and Stefano Ermon. Generative Modeling by Estimating Gradients of the Data Distribution. In Advances in Neural Information Processing Systems, 2019.

Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-Based Generative Modeling through Stochastic Differential Equations. In International Conference on Learning Representations, 2021b.

Amos Storkey. When Training and Test Sets Are Different: Characterizing Learning Transfer. In Dataset Shift in Machine Learning. The MIT Press, 2008.

Published as a conference paper at ICLR 2025

Max Welling and Yee W Teh. Bayesian Learning via Stochastic Gradient Langevin Dynamics. In International Conference on Machine Learning, 2011.

Thaddäus Wiedemer, Prasanna Mayilvahanan, Matthias Bethge, and Wieland Brendel. Compositional generalization from first principles. In Advances in Neural Information Processing Systems, 2024.

Published as a conference paper at ICLR 2025

Table of Contents

A Preliminaries of Score-based Models 15

B Proofs for Claims 16 B.1 Proof for the case study in 3 . . . . . . . . . . . . . . . . . . . . . . . . . . 16 B.2 Standard diffusion model objective is not suitable for logical compositionality . 16 B.3 Step-by-step derivation of COIND in 4 . . . . . . . . . . . . . . . . . . . . . 17

C Practical Considerations 18 C.1 Scalability of LCI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 C.2 Simplification of Theoretical Loss . . . . . . . . . . . . . . . . . . . . . . . . 19 C.3 Choice of Hyperparameter λ . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

D Experiment Details 20 D.1 COIND Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 D.2 Details of Logical Compositionality Task . . . . . . . . . . . . . . . . . . . . 21 D.3 Training details, Architecture, and Sampling . . . . . . . . . . . . . . . . . . . 22 D.4 Analytical Forms of Support Settings . . . . . . . . . . . . . . . . . . . . . . 23 D.5 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 D.6 Conformity Score (CS) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 D.7 Computing JSD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 D.8 Measuring Diversity in Attributes . . . . . . . . . . . . . . . . . . . . . . . . 24

E COIND for Face Image Generation 25 E.1 COIND can successfully generate real-world face images . . . . . . . . . . . . 25 E.2 COIND provides fine-grained control over attributes . . . . . . . . . . . . . . . 25 E.3 Finetuning T2I models with COIND improves logical compositionality . . . . . 27

F Discussion on COIND 28 F.1 Connection to compositional generation from first principles . . . . . . . . . . 28 F.2 2D Gaussian: Closed-form Analysis of COIND . . . . . . . . . . . . . . . . . 28 F.3 Extension to Gaussian source flow models . . . . . . . . . . . . . . . . . . . . 30 F.4 Compositional vs Monolithic models . . . . . . . . . . . . . . . . . . . . . . . 30 F.5 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

G Additional Results and Discussion on COIND 31 G.1 Learning under non-uniform p(Ci) . . . . . . . . . . . . . . . . . . . . . . . . 31 G.2 Failure examples of COIND . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 G.3 Conformity score for each attribute combination . . . . . . . . . . . . . . . . . 31 G.4 COIND also improves conditional generation . . . . . . . . . . . . . . . . . . 32 G.5 COIND can interpolate between discrete attributes . . . . . . . . . . . . . . . . 33

Published as a conference paper at ICLR 2025

A PRELIMINARIES OF SCORE-BASED MODELS

Score-based models Score-based models (Song et al., 2021b) learn the score of the observed data distribution, ptrain(X) through score matching (Hyvärinen, 2005). The score function sθ(x) = x log pθ(x) is learned by a neural network parameterized by θ.

Lscore = Ex ptrain h sθ(x) x log ptrain(x) 2 2 i (12)

During inference, sampling is performed using Langevin dynamics:

xt = xt 1 + η

2 x log pθ(xt 1) + ηϵt, ϵt N(0, 1) (13)

where η > 0 is the step size. As η 0 and T , the samples xt converge to pθ(X) under certain regularity conditions (Welling & Teh, 2011).

Diffusion models Song & Ermon proposed a scalable variant that involves adding noise to the data. Ho et al. has shown its equivalence to Diffusion models. Diffusion models are trained by adding noise to the image x according to a noise schedule, and then neural network, ϵθ is used to predict the noise from the noisy image, xt. The training objective of the diffusion models is given by: Lscore = Ex ptrain Et [0,T ] ϵ ϵθ (xt, t) 2 (14)

Here, the perturbed data xt is expressed as: xt = αtx + 1 αtϵ where αt = QT i=1 αi, for a pre-specified noise schedule αt. The score can be obtained using,

sθ(xt, t) ϵθ(xt, t) 1 αt (15)

Langevin dynamics can be used to sample from the sθ(xt, t) to generate samples from p(X). The conditional score (Dhariwal & Nichol, 2021) is used to obtain samples from the conditional distribution pθ(X | C) as:

Xt log p(Xt | C) = Xt log pθ(Xt) | {z } Unconditional score

+γ Xt log pθ(C | Xt) | {z } noisy classifier

where γ is the classifier strength. Instead of training a separate noisy classifier, Ho & Salimans have extended to conditional generation by training Xt log pθ(Xt | C) = sθ(Xt, t, C). The sampling can be performed using the following equation:

Xt log p(Xt | C) = (1 γ) Xt log pθ(Xt) + γ Xt log pθ(Xt | C) (16)

However, the sampling needs access to unconditional scores as well. Instead of modelling Xt log pθ(Xt), Xt log pθ(Xt|C) as two different models Ho & Salimans have amortize training a separate classifier training a conditional model sθ(xt, t, c) jointly with unconditional model trained by setting c = .

In the general case of classifier-free guidance, a single model can be effectively trained to accommodate all subsets of attribute distributions. During the training phase, each attribute ci is randomly set to with a probability puncond. This approach ensures that the model learns to match all possible subsets of attribute distributions. Essentially, through this formulation, we use the same network to model all the possible subsets of conditional probability.

Once trained, the model can generate samples conditioned on specific attributes, such as ci and cj, by setting all other conditions to . The conditional score is then computed as, Xt log pθ(Xt|ci, cj) = xt, ci,j), where ci,j represents the condition vector with all values other than i and j set to . This method allows for flexible and efficient sampling across various attribute combinations.

Estimating Guidance Once the diffusion model is trained, we investigate the implicit classifier, pθ(C|X), learned by the model. This will give us insights into the learning process of the diffusion models. (Li et al., 2023) have shown a way to calculate pθ(Ci = ci | X = x), borrowing equation (5), (6) from (Li et al., 2023).

pθ(Ci = ci | x) = p(ci) pθ(x | ci) P

k p(ck) pθ(x | ck)

Published as a conference paper at ICLR 2025

pθ(Ci = ci | x) = exp{ Et,ϵ[ ϵ ϵθ(xt, t, ci) 2]} ECi [exp{ Et,ϵ[ ϵ ϵθ(xt, t, ci) 2]}] (17)

Likewise, we can extend it to joint distribution by

pθ(Ci = ci, Cj = cj | x) = exp{ Et,ϵ[ ϵ ϵθ(xt, t, ci,j) 2]} ECi,Cj [exp{ Et,ϵ[ ϵ ϵθ(xt, t, ci,j) 2]}] (18)

Practical Implementation The authors Li et al.. have showed many axproximations to compute Et,ϵ. However, we use a different approximation inspired by Kynkäänniemi et al. (2024), where we sample 5 time-steps between [300,600] instead of these time-steps spread over the [0, T].

B PROOFS FOR CLAIMS

In this section, we detail the mathematical derivations for case study from 3 in App. B.1, relate the origin of the conditional independence violation to the unsuitable loss function of vanilla diffusion models in App. B.2, and then derive the final loss function of COIND in App. B.3.

B.1 PROOF FOR THE CASE STUDY IN 3

In this section, we prove that failure of compositionality in diffusion models is due to the violation of conditional independence.

Following conditional independence relation:

p(C | X) = Y

i p(Ci | X) (CI relation)

This CI relation is used by several works (Liu et al., 2022; Nie et al., 2021), including ours, to derive the expression for the joint distribution p(X | C) in terms of the marginals p(X | Ci) for logical compositionality. As a reminder, logical compositionality is preferred over simple conditional generation as it (1) provides fine-grained control over the attributes, (2) facilitates NOT relations on attributes, and (3) is more interpretable. The joint likelihood is written in terms of the marginals using the CI relation and the causal factorization as,

p(X | C) = p(X)

p(X | Ci)p(Ci)

(JM relation)

Note that CI relation is crucial for JM relation to hold. We sample from joint likelihood using the score of LHS of JM relation, referred to as joint sampling in 3. Similarly, we sample using the score of RHS of JM relation, referred to as marginal sampling in 3. If the learned generative model satisfies the JM relation, then there should not be any difference in the CS between joint sampling and marginal sampling. However, in Tab. 1, we see a drop in CS, implying JM relation is not satisfied in the learned model.

JM relation must hold in the learned generative model if CI relation is true in the learned generative model. Therefore, we check if the CI relation holds in the generative model by measuring JSD between LHS and RHS of CI relation as shown in Eq. (5) in the main paper. The results Tab. 1 confirm that the CI relation does not hold in the learned model. This is a significant finding since existing works (Liu et al., 2022; Nie et al., 2021) blindly trust the model to satisfy CI relation, leading to severe performance drop when the training support is non-uniform or partial.

The CI relation is violated in the learned model because the standard training objective is not suitable for compositionality, as it does not account for the incorrect ptrain(X | Ci). The proof is detailed in the next section App. B.2. Therefore, we proposed COIND to ensure the JM relation was satisfied by explicitly learning the marginal likelihood according to the causal factorization.

B.2 STANDARD DIFFUSION MODEL OBJECTIVE IS NOT SUITABLE FOR LOGICAL COMPOSITIONALITY

This section proves that the violation in conditional independence in diffusion models is due to learning incorrect marginals, ptrain(X | Ci) under Ci Cj. We leverage the causal invariance

Published as a conference paper at ICLR 2025

property: ptrain(X | C) = ptrue(X | C), where ptrain is the training distribution and ptrue is the true underlying distribution.

Consider the training objective of the score-based models in classifier free formulation Eq. (12). For the classifier-free guidance, a single model sθ(x, C) is effectively trained to match the score of all subsets of attribute distributions. Therefore, the effective formulation for classifier-free guidance can be written as,

Lscore = Ex ptrain ES h x log pθ(x | c S) x log ptrain(x | c S) 2 2 i (19)

where S is the power set of attributes.

From the properties of Fisher divergence, Lscore = 0 iff pθ(X | c S) = ptrain(X | c S), S. In the case of marginals, pθ(X | Ci) i.e. S = {Ci} for some 1 i n,

pθ(X | Ci) = ptrain(X | Ci)

C i ptrain(X | Ci, C i)ptrain(C i | Ci)

C i ptrue(X | Ci, C i)ptrain(C i | Ci)

C i ptrue(X | Ci, C i)ptrue(C i) = ptrue(X | Ci)

= pθ(X | Ci) = ptrue(X | Ci) (20)

Where C i = Qn j=1 j =i Cj, which is every attribute except Ci. Therefore, the objective of the score-

based models is to maximize the likelihood of the marginals of training data and not the true marginal distribution, which is different from the training distribution when Ci Cj.

B.3 STEP-BY-STEP DERIVATION OF COIND IN 4

The objective is to train the model by explicitly modeling the joint likelihood following the causal factorization from Eq. (JM relation). The minimization for this objective can be written as,

p(X | C), pθ(X)

pθ(X | Ci)pθ(Ci)

where W2 is 2-Wasserstein distance. Applying the triangle inequality to Eq. (21) we have,

Lcomp W2 (p(X | C), pθ(X | C)) | {z } Distribution matching

pθ(X | C), pθ(X)

pθ(X | Ci)pθ(Ci)

| {z } Conditional Independence

(Kwon et al., 2022) showed that under some conditions, the Wasserstein distance between p0(X), q0(X) is upper bounded by the square root of the score-matching objective. Rewriting Equation 16 from (Kwon et al., 2022)

W2 (p0(X), q0(X)) K q

Ep0(X) [|| X log p0(X) X log q0(X)||2 2] (23)

Distribution matching Following Eq. (23) result, the first term in Eq. (22), replacing p0 as p and q0 as pθ will result in

W2 (p(X | C), pθ(X | C)) K1 q

Ep0(X) [|| X log p(X | C) X log pθ(X)||2 2]

Lscore (24)

Published as a conference paper at ICLR 2025

Conditional Independence Following Eq. (23) result, the second term in Eq. (22), replacing p0 as pθ and q0(X) as pθ(X)

pθ(C) Qn i pθ(X|Ci)pθ(Ci)

pθ(X | C), pθ(X)

pθ(X | Ci)pθ(Ci)

v u u t E X log pθ(X | C) X log pθ(X)

pθ(X | Ci)pθ(Ci)

Further simplifying and incorporating X log pθ(Ci) = 0 and X log pθ(C) = 0 will result in

pθ(X | C), pθ(X)

pθ(X | Ci)pθ(Ci)

K2 v u u u t

E X log pθ(X | C) X log pθ(X) X

i [ X log pθ(X | Ci) X log pθ(X)] 2 2 | {z } LCI

Substituting Eq. (24), Eq. (25) in Eq. (22) will result in our final learning objective

Lscore + K2 p

where K1, K2 are positive constants, i.e., the conditional independence objective LCI is incorporated alongside the existing score-matching loss Lscore.

Note that Eq. (25) is the Fisher divergence between the joint pθ(X | C) and the causal factorization pθ(X) pθ(C) Q

i pθ(X|Ci)pθ(Ci)

pθ(X) from Eq. (JM relation). From the properties of Fisher divergence (Sánchez-

Moreno et al., 2012), LCI = 0 iff pθ(X | C) = pθ(X)

pθ(C) Qn i pθ(X|Ci)pθ(Ci)

pθ(X) and further implying, Q

i pθ(Ci | X) = ptrain(C | X)

When Lcomp = 0: Pθ(X | C) = Ptrain(X | C) = P(X | C), and Q

i pθ(Ci | X) = ptrain(C | X). This implies that the learned marginals obey the causal independence relations from the datageneration process, leading to more accurate marginals.

C PRACTICAL CONSIDERATIONS

To facilitate scalability and numerical stability for optimization, we introduce two approximations to the upper bound of our objective function Eq. (10).

C.1 SCALABILITY OF LCI

A key computational challenge posed by Eq. (9) is that the number of model evaluations grows linearly with the number of attributes. The Eq. (9) is derived from conditional independence formulation as follows: pθ(C | X) = Y

i pθ(Ci | X). (27)

By applying Bayes theorem to all terms, we obtain,

pθ(X | C)pθ(C)

pθ(X | Ci)pθ(Ci)

Note that this formulation is equal to the causal factorization. From this, by applying logarithm and differentiating w.r.t. X, we derive the score formulation.

X log pθ(X | C) = X log X

i pθ(X | Ci) X log pθ(X) (29)

Published as a conference paper at ICLR 2025

The L2 norm of the difference between LHS and RHS of the objective in Eq. (29) is given by, which forms our LCI objective.

LCI = X log pθ(X | C)

i pθ(X | Ci) X log pθ(X)

Due to the P

i, in the equation, the number of model evaluations grows linearly with the number of attributes (n). This O(n) computational complexity hinders the approach s applicability at scale. To address this, we leverage the results of (Hammond & Sun, 2006), which shows conditional independence is equivalent to pairwise independence under large n to reduce the complexity to O(1) in expectation. This allows for a significant improvement in scalability while maintaining computational efficiency. Using this result, we modify Eq. (27) to:

pθ(Ci, Cj | X) = pθ(Ci | X)pθ(Cj | X). i, j

Accordingly, we can simplify the loss function for conditional independence as follows:

LCI = Ep(X,C)Ej,k X[log pθ(X|Cj, Ck) log pθ(X|Cj) log pθ(X|Ck) + log pθ(X)] 2 2. (31)

In score-based models, which are typically neural networks, the final objective is given as:

LCI = Ep(X,C)Ej,k sθ(X, Cj, Ck) sθ(X, Cj) sθ(X, Ck) + sθ(X, ) 2 2 (32)

where sθ( ) := X log pθ( ) is the score of the distribution modeled by the neural network. We leverage classifier-free guidance to train the conditional score sθ(X, Ci) by setting Ck = for all k = i, and likewise for sθ(X, Ci, Cj), we set Ck = for all k {i, j}.

C.2 SIMPLIFICATION OF THEORETICAL LOSS

In Eq. (10), we showed that the 2-Wassertein distance between the true joint distribution p(X | C) and the causal factorization in terms of the marginals p(X | Ci) is upper bounded by the weighted sum of the square roots of Lscore and LCI as Lcomp K1 Lscore + K2 LCI. In practice, however, we minimized a simple weighted sum of Lscore and LCI, given by Lfinal = Lscore + λLCI as shown in Eq. (11) instead of Eq. (10). We used Eq. (11) to avoid the instability caused by larger gradient magnitudes (due to the square root). Eq. (11) also provided the following practical advantages: (1) the simplicity of the loss function that made hyperparameter tuning easier, and (2) the similarity of Eq. (11) to the loss functions of pre-trained diffusion models allowing us to reuse existing hyperparameter settings from these models. We did not observe any significant difference in conclusion between the models trained on Eq. (10) and Eq. (11) as shown in Tabs. 3 and 4. Both approaches significantly outperformed the baselines.

Support Method JSD (CS) Color (CS) Digit (CS)

LACE - 96.40 92.56 83.67 Composed GLIDE 0.16 98.15 99.30 81.64 Uniform Theoretical COIND Eq. (10) 0.12 98.44 100.00 81.25 COIND (λ = 0.2) 0.14 99.73 99.32 84.94 COIND (λ = 1.0) 0.10 99.99 99.33 89.60

LACE - 82.61 65.16 69.51 Composed GLIDE 0.30 86.10 81.61 70.44 Non-uniform Theoretical COIND Eq. (10) 0.17 96.88 93.75 72.66 COIND (λ = 1.0) 0.15 99.95 92.41 84.98

LACE - 10.85 9.03 28.24 Composed GLIDE 2.75 7.40 5.09 33.86 Partial Theoretical COIND Eq. (10) 1.11 23.44 64.84 53.12 COIND (λ = 1.0) 1.17 52.38 53.28 52.59

Table 3: Results on Colored MNIST to directly minimize the upper bound (K1 = 1, K2 = 0.1)

Published as a conference paper at ICLR 2025

Support Method JSD Composition Composition

R2 CS R2 CS

LACE - 0.97 91.19 0.85 50.00 Composed GLIDE 0.302 0.94 83.75 0.91 48.43 Uniform Theoretical COIND Eq. (10) 0.270 0.98 92.19 0.92 64.06 COIND (λ = 1.0) 0.215 0.98 95.31 0.92 55.46

LACE - 0.88 62.07 0.70 30.10 Composed GLIDE 0.503 0.86 51.56 0.61 34.63 Partial Theoretical COIND Eq. (10) 0.450 0.93 78.13 0.88 51.56 COIND (λ = 1.0) 0.287 0.97 91.10 0.92 53.90

Table 4: Results on Shapes3D with the objective of directly minimizing the upper bound Eq. (10) (K1 = 1, K2 = 0.1)

C.3 CHOICE OF HYPERPARAMETER λ

Effect of λ on the Learned Conditional Independence. COIND enforces conditional independence between the marginals of the attributes learned by the model by minimizing LCI defined in Eq. (32). Here, we investigate the effect of LCI on the effectiveness of logical compositionality by varying its strength through λ in Eq. (11). Figure 8 plots JSD and CS ( ) as functions of λ for models trained on the Colored MNIST dataset under the diagonal partial support setting.

Figure 8: Effect of λ on logical compositionality under diagonal partial support on the Colored MNIST dataset.

When λ = 0, training relies solely on the score matching loss, resulting in higher conditional dependence between Ci | X. As λ increases, CS improves since ensuring conditional independence between the marginals also encourages more accurate learning of the true marginals. However, when λ takes large values, the model learns truly independent conditional distribution C | X but effectively ignores the input compositions and generates samples based solely on the prior distribution pθ(X). As a result, CS drops.

The value for the hyperparameter λ is chosen such that the gradients from the score-matching objective Lscore and the conditional independence objective LCI are balanced in magnitude. One way to choose λ is by training a vanilla diffusion model and setting λ = Lscore

LCI . We used two values for λ in our experiments and noticed that they gave similar results, indicating that the approach was stable for various values of λ.

D EXPERIMENT DETAILS

In this section, we outline the high-level design choices of our approach. We provide full implementation details in our publicly available code and checkpoints at https://github.com/sachit3022/compositional-generation/.

D.1 COIND ALGORITHM

To compute pairwise independence in a scalable fashion, we randomly select two attributes, i and j, for a sample in the batch and enforce independence between them. As the score in Eq. (15) is given by ϵθ(xt,t) 1 αt . The final equation for enforcing LCI will be:

LCI = 1 1 αt

ϵθ(xt, t, ci) + ϵθ(xt, t, cj) ϵθ(xt, t, ci,j) ϵθ(xt, t, c ) 2 2

We follow Ho et al. (2020) to weight the term by 1 αt. This results in an algorithm for COIND, requiring only a few modifications of lines from (Ho & Salimans, 2022), highlighted below.

Published as a conference paper at ICLR 2025

Algorithm 1 COIND Training

1: repeat 2: (c, x0) ptrain(c, x) 3: ck with probability puncond Set element of index,k i.e, ck to with puncond k [0, N] probability 4: i Uniform({0, . . . , N}), j Uniform({0, . . . , N} \ {i}) Select two random attribute indices 5: t Uniform({1, . . . , T}) 6: ϵ N(0, I) 7: xt = αtx0 + 1 αtϵ 8: ci, cj, ci,j c

9: ci {ck = | k = i}, cj {ck = | k = j}, ci,j {ck = | k {i, j}}, c

10: LCI = ||ϵθ(xt, t, ci) + ϵθ(xt, t, cj) ϵθ(xt, t, ci,j) ϵθ(xt, t, c )||2 2 11: Take gradient descent step one

θ[ ϵ ϵθ(xt, t, c) 2 +λLCI ] 12: until converged

Practical Implementation In our experiments, we have used puncond = 0.2 and for Shapes3D instead of enforcing Ci Cj | X, for all i, j enforcing Ci C i | X for all i have led to slightly better results.

D.2 DETAILS OF LOGICAL COMPOSITIONALITY TASK

We designed the following task to evaluate two primitive logical compositions. (1) AND Composition , (2) NOT Composition

AND Composition To evaluate the composition, we apply the operation over all the attributes to generate a respective image. Consider an image from the Shapes3D dataset (see Figure Fig. 9). The image is generated by some function, f, with the input c = [ 6 8 4 6 2 11 ]. The following image can be queried using the logical expression C1 = 6 . . . C6 = 11. We follow Equation Eq. (1) to sample from the above logical composition. To reiterate, for the composition task on Shapes3D, the sampling equation is given by Xpθ(X | C1 = 6 . . . C6 = 11):

X log pθ(X) + X

i [ X log pθ(X | Ci) X log pθ(X)] (33)

Similarly, to evaluate the AND composition for the Colored MNIST dataset, we perform the operation over digit C1 and color C2.

Figure 9: Image from Shapes3d with attributes c = [6, 8, 4, 6, 2, 11]

NOT Composition To evaluate the compositions, the image is queried as an AND on all the attributes except the object attribute, which is queried by its negation. For example, consider the same image from Figure Fig. 9, where the object sphere (C5 = 2) can be expressed as C5 = [0 1 3], because the object class can only take four possible values. Therefore, the same image can be described as C1 = 6 . . . C5 = [0 1 3] . . . C6 = 11. The only possible generation that meets these criteria is the image (Fig. 9) displayed as expected.

The sampling equation for a test image with attributes C1, C2, C3, C4, C5, C6 can be written as C1 = 6 C2 = 8 C3 = 4 C4 = 6 C5 = [0 1 3] C6 = 11. Following Eq. (2), the sampling equation is written as follows:

X log pθ(X|C1 = 6) + X log pθ(X|C2 = 8) + X log pθ(X|C3 = 4) + X log pθ(X|C4 = 6) + X log pθ(X|C6 = 11) X log pθ(X|C5 = 0) X log pθ(X|C5 = 1) X log pθ(X|C5 = 3) X log pθ(X)

Similarly, for Colored MNIST, we perform two kinds of negation operations: one on digit and another on color. In Section 2, we have shown negation on color 4 [Green Pink], along with

Published as a conference paper at ICLR 2025

its sampling equation. A similar logic can be followed for negation on color; an example of negation on digit is [3 4] Pink.

For and , evaluations are strictly restricted to unseen compositions under orthogonal partial support for Shapes3D and under diagonal partial support for Colored MNIST. This approach allows us to explore how effectively the model handles logical operations through unseen image generation. Additionally, we evaluate compositions observed during training with less frequency under nonuniform support.

D.3 TRAINING DETAILS, ARCHITECTURE, AND SAMPLING

Training Composed GLIDE & COIND We train the diffusion model using the DDPM noise scheduler. The model architecture and hyperparameters used for all experiments are detailed in Tab. 5.

Training LACE The LACE method involves training multiple energy-based models for each attribute and sampling according to logical compositional equations. However, we use score-based models instead. We follow the architecture outlined in Tab. 5 for each attribute to train multiple score-based models. For Colored MNIST, which has two attributes, we create two models one for each attribute using the same architecture as other methods, effectively doubling the model size. Similarly, for Shapes3D with six attributes, we develop six models. We reduce the Block Out Channels for each attribute model to fit these into memory while keeping all other hyperparameters consistent. Since we train a single model per attribute, we do not match the joint distribution, preventing us from evaluating it and measuring the JSD.

Sampling To generate samples for a given logical composition, we sample from equations from App. D.2 using DDIM (Song et al., 2021a) with 100 steps.

Hyperparameter Colored MNIST Shapes3D

COIND & Composed GLIDE LACE COIND & Composed GLIDE LACE

Optimizer Adam W Adam W Adam W Adam W Learning Rate 2.0 10 4 2.0 10 4 2.0 10 4 2.0 10 4 Num Training Steps 50000 100000 100000 100000 Train Noise Scheduler DDPM DDPM DDPM DDPM Train Noise Schedule Linear Linear Linear Linear Train Noise Steps 1000 1000 1000 1000 Sampling Noise Schedule DDIM DDIM DDIM DDIM Sampling Steps 150 150 100 100 Model U-Net U-Net U-Net U-Net Layers per block 2 2 2 2 Beta Schedule Linear Linear Linear Linear Sample Size 28x3x3 28x3x3 64x3x3 64x3x3 Block Out Channels [56,112,168] [56,112,168] [56,112,168,224] [56,112,168] Dropout Rate 0.1 0.1 0.1 0.1 Attention Head Dimension 8 8 8 8 Norm Num Groups 8 8 8 8 Number of Parameters 8.2M 8.2M 2 17.2M 8.2M 6

Table 5: Hyperparameters for Colored MNIST and Shapes3D used by COIND, Composed GLIDE, and LACE

Celeb A To generate Celeb A images, we scale the image size to 128 128. We use the latent encoder of Stable Diffusion 3 (SD3) to encode the images to a latent space and perform diffusion in the latent space. The architecture is similar to the Colored MNIST and Shapes3D, except that Block out Channels are scaled as [224, 448, 672, 896]. We use a learning rate of 1.0 10 4 and train the model for 500,000 steps on one A6000 GPU.

FID Measure To evaluate both the generation quality and how well the generated samples align with the natural distribution of smiling male celebrities , we use the FID metric (Seitzer, 2020). Notably, we calculate the FID score specifically on the subset of smiling male celebrities, as our primary objective is to assess the model s ability to generate these unseen compositions. We generate 10000 samples to evaluate FID.

Published as a conference paper at ICLR 2025

T2I: Finetuning SDv1.5 We finetune SDv1.5 with the data constructed from Celeb A, where the labels are converted to text. For example, a label of (male=1, smiling=1) is converted to a photo of a smiling male celebrity."

D.4 ANALYTICAL FORMS OF SUPPORT SETTINGS

Below are the analytical expressions for the densities under the various support settings that we considered in the paper. Let ni be the number of categories for the attribute Ci. For non-uniform and diagonal partial support settings, we assume that ni = nj = n, i, j, i = j.

Uniform setting: p(Ci = c1) = 1 ni and p(Ci = c1, Cj = c2) = p(Ci = c1)p(Cj = c2) = 1 ninj .

Orthogonal support setting: p(Ci = c1, Cj = c2) =

( 1 ni+nj 1, c1 = 0 or c2 = 0 0, otherwise

Non-uniform setting: p(Ci = c1, Cj = c2) = a, c2 c1 c2 + 1 b, otherwise . where 1 n2 b a

Diagonal partial support setting: p(Ci = c1, Cj = c2) =

( 1 2n 1, c2 c1 c2 + 1 0, otherwise .

D.5 DATASETS

Colored MNIST Dataset In Section 1, we introduced the Colored MNIST dataset. Here, we will detail the dataset generation process. We selected 10 visually distinct colors 1, taking the value C2 [0, 9]. The dataset is constructed by coloring the grayscale images from MNIST by converting them into three channels and applying one of the ten colors to non-zero grayscale values.

The training data is composed of three types of support:

Uniform Support: A digit and a color are randomly selected to create an image.

Diagonal Partial Support: A digit is selected, and during training, it is only assigned one of two colors, C2 {d, d + 1}, except for 9, which only takes one color. This creates a dataset where compositions observed during training are along the diagonal of the C space, meaning each digit is seen only with its corresponding colors.

Non-uniform Support: All compositions are observed, but combining a digit and its corresponding colors occurs with a higher probability (0.5). The remaining color space is distributed evenly among other colors, resulting in approximately a 0.25 probability for each corresponding color and a 0.0625 probability for each remaining color.

Shapes3D Full support for Shapes3D consists of all samples from the dataset. For orthogonal support, we use the composition split of Shapes3D as described by Schott et al.., whose code is publicly available 2.

Celeb A Celeb A consists of 40 attributes, from which we select the "smiling" and "male" attributes. We train generative models on all combinations of these attributes except (smiling=1, male=1), resulting in an orthogonal partial support.

D.6 CONFORMITY SCORE (CS)

In Section 2, we described the Conformity Score (CS) to quantify the accuracy of the generation per the prompt. To measure the CS, we train a single Res Net-18 (He et al., 2016) classifier with multiple classification heads, one corresponding to each attribute, and trained on the full support. This classifier estimates the attributes in the generated image, x, and extracts these attributes as

1https://mokole.com/palette.html 2https://github.com/bethgelab/In Domain Generalization Benchmark

Published as a conference paper at ICLR 2025

ϕ(x) = [ ˆc1, . . . , ˆcn]. These attributes are matched against the input prompt that generated the image to obtain accuracy.

To explain further, for example, if the prompt is to generate 4 [Green Pink] , the generated sample will have a CS of 1 if ˆc1 = 4 and ˆc2 {Green, Pink}. We average this across all the prompts in the test set, which determines the CS for a given task.

The effectiveness of the classifier in predicting the attributes is reported in Table 10.

Feature Attributes Possible Values Accuracy

C1 Digit 0-9 98.93 C2 color 10 values 100

(a) Colored MNIST Dataset

Feature Attributes Possible Values Accuracy

C1 Gender {0,1} 98.2 C2 Smile {0,1} 92.1

(b) Celeb A Dataset

Feature Attributes Possible Values Accuracy

C1 floor hue 10 values in [0, 1] 100 C2 wall hue 10 values in [0, 1] 100 C3 object hue 10 values in [0, 1] 100 C4 scale 8 values in [0, 1] 100 C5 shape 4 values in [0-3] 100 C6 orientation 15 values in [-30, 30] 100

(c) Shapes3D Dataset

Figure 10: Independent attribute, their possible values, and the classifier accuracy in estimating them for different datasets

D.7 COMPUTING JSD

We are interested in understanding the causal structure learned by diffusion models. Specifically, we aim to determine whether the learned model captures the conditional independence between attributes, allowing them to vary independently. This raises the question: Do diffusion models learn the conditional independence between attributes? The conditional independence is defined by:

pθ(Ci, Cj | X) = pθ(Ci | X)pθ(Cj | X) (34)

We aim to measure the violation of this equality using the Jensen-Shannon divergence (JSD) to quantify the divergence between two probability distributions:

JSD = Epdata [DJS (pθ(C | X) || pθ(Ci | X)pθ(Cj | X))] (35)

The joint distribution, pθ(Ci, Cj | X), and the marginal distributions, pθ(Ci | X) and pθ(Cj | X), are evaluated at all possible values that Ci and Cj can take to obtain the probability mass function (pmf). The probability for each value is calculated using Equation Eq. (18) for the joint distribution and Equation Eq. (17) for the marginals.

Practical Implementation For the diffusion model with multiple attributes, the violation in conditional mutual independence should be calculated using all subset distributions. However, we focus on pairwise independence. We further approximate this in our experiments by computing JSD between the first two attributes, C1 and C2. We have observed that computing JSD between any attribute pair does not change our examples conclusion.

D.8 MEASURING DIVERSITY IN ATTRIBUTES

To achieve explicit control over certain attributes during the generation process, these attributes must vary independently. Therefore, an ideal generative model must be able to produce samples where all except the controlled attributes take diverse values. This diversity can be measured by the entropy of the uncontrolled attributes in the generated samples, where higher entropy suggests greater diversity. Therefore, the accurate generation of controlled and diverse uncontrolled attributes for the given the underlying data distribution has uniform attributes indicates that the model has successfully learned when the underlying attribute distribution is uniform. In contrast, for non-uniform

Published as a conference paper at ICLR 2025

distributions such as the Gaussian example discussed in App. G.1 a simple diversity argument no longer applies, and minimum KL divergence between the model and the true distribution becomes the appropriate measure. Under a uniform attribute assumption, however, the KL divergence essentially reduces to maximum entropy.

For example, consider the generation of colored MNIST digits. In this case, controllability means that the model has learned that digit and color attributes are independent. When prompted to generate a specific digit (controlled attribute), the model should generate this digit in all possible colors (uncontrolled attribute) with equal likelihood, implying maximum entropy for the color attribute and diverse generation. We measure this entropy by generating samples xi pθ(X | c1 = 4) and passing them through a near-perfect classifier to obtain the color predictions p( ˆ C2) = p(ϕ2(xi)). The diversity is then quantified as: H = E ˆ c2 p( ˆ C2) [log2 p( ˆc2)]

Ensuring diversity through explicit control has applications in bias detection and mitigation in generative models. For example, a biased model may generate images of predominantly male doctors when asked to generate images of doctors . Ensuring diversity in uncontrolled attributes like gender or race can limit such biases.

E COIND FOR FACE IMAGE GENERATION

In 5, we demonstrated that COIND outperforms baseline methods on the unseen logical compositionality task using synthetic datasets. In App. E.1, we showcase the success of COIND in generating face images from the Celeb A dataset (Liu et al., 2015), where COIND demonstrates superior control over attributes compared to the baseline. COIND also allows us to adjust the strength of various attributes and thus provides more fine-grained control over the compositional attributes, as shown in App. E.2. Finally, in App. E.3, we extend COIND to text-to-image (T2I) models widely used in practice to generate face images by providing the desired attributes as logical expressions of text prompts.

Problem Setup We choose the Celeb A dataset to evaluate COIND s ability to generate real-world images. We choose the binary attributes smiling and gender as the attributes we wish to control. During training, all combinations of these attributes except gender = male and smiling = true are observed, similar to the orthogonal support shown in Fig. 3. During inference, the model is tasked to generate images with the attribute combination gender = male and smiling = true , which was not observed during training.

Metrics Similar to the experiments on the synthetic image datasets in 5, we assess the accuracy of the generation w.r.t. the input desired attribute combination CS (conformity score). We also measure the violation of the learned conditional independence using JSD. In addition to CS, we compute FID (Fréchet inception distance) between the generated images and the real samples in the Celeb A dataset where gender = male and smiling = true . A lower FID implies that the distribution of generated samples is closer to the real distribution of the images in the validation dataset.

E.1 COIND CAN SUCCESSFULLY GENERATE REAL-WORLD FACE IMAGES

Tab. 2 shows the quantitative results of COIND and Composed GLIDE trained from scratch in the tasks of joint sampling and composition. Similar to our observations from previous experiments, COIND achieves better CS in both tasks by learning accurate marginals as demonstrated by lower JSD. When sampled from the joint likelihood, COIND achieves a nearly 4 improvement in CS over the baseline.

E.2 COIND PROVIDES FINE-GRAINED CONTROL OVER ATTRIBUTES

So far, we studied the capabilities of COIND to dictate the presence and absence of attributes in the task of controllable image generation. However, there are applications where we desire fine-grained control over the attributes. Specifically, we may want to control the amount of each attribute in the generated sample. We can mathematically formulate this task by revisiting the formulation of logical expressions of attributes in terms of the score functions of marginal likelihood. As an example, the

Published as a conference paper at ICLR 2025

γ = 0 γ = 2 γ = 6 γ = 1

COIND Composed GLIDE LACE

Theamountofsmile increasesasγ increases

Figure 11: By adjusting γ, COIND allows us to the vary the amount of smile in the generated images. However, Composed GLIDE associates the smile attribute with the gender attribute due to their association in the training data. Hence, the images generated by Composed GLIDE contain gender-specific attributes such as long hair and earrings.

operation can be written as,

X log pθ(X | C1 C2) = X log pθ(X | C1) + X log pθ(X | C2) X log pθ(X)

Here, to adjust the amount of attribute added to the generated sample, we can weigh the score functions using some scalar γ, as follows,

X log pθ(X | C1) + γ X log pθ(X | C2) γ X log pθ(X) (36)

where γ controls for the amount of C2 attribute.

Methods LACE Composed GLIDE Co In D

(a) Variation of FID with γ

(b) Variation of CS with γ

Figure 12: Effect of γ on FID and CS: Varying the amount of smile in a generated image through γ does not affect the FID of COIND. However, the smiles in the generated images become more apparent, leading to easier detection by the smile classifier and improved CS.

Fig. 11 shows the effect of increasing γ to adjust the amount of smiling in the generated image. Ideally, we expect increasing γ to increase the amount of smiling without affecting the gender attribute. When γ = 0 (top row), both COIND and Composed GLIDE generate images of men who are not smiling. As γ increases, we notice that the samples generated by COIND show an increase in the amount of smiling, going from a short smile to a wider smile to one where teeth are visible. Note that the training dataset did not include any images of smiling men or fine-grained annotations for the amount of smiling in each image. This conclusion is strengthened by Fig. 12b that shows an increase in CS when γ increases. CS increases when it is easier for the smile classifier to detect the smile. COIND provides this fine-grained control over the smiling attribute without any effect on the realism of the images, as shown by the minimal changes in FID in Fig. 12a.

Published as a conference paper at ICLR 2025

In contrast, the images generated by Composed GLIDE show an increase in the amount of smiling while adding gender-specific attributes such as long hair and makeup. We conclude that, by strictly enforcing a conditional independence loss between the attributes, COIND provides fine-grained control over the attributes, allowing us to adjust the intensity of the attribute in the image without additional training. As shown in Tab. 2, COIND outperforms the baselines for generating unseen compositions. Tuning γ further improves the generation.

E.3 FINETUNING T2I MODELS WITH COIND IMPROVES LOGICAL COMPOSITIONALITY

smiling male smiling AND male smiling NOT female

Composed GLIDE

Figure 13: Samples generated after fine-tuning SDv1.5 on Celeb A. The first row shows images generated by SDv1.5 fine-tuned on Celeb A, while the second row shows images generated by SDv1.5 fine-tuned with COIND. Columns indicate samples generated from the respective prompts indicated above.

We proposed COIND to improve control over the attributes in an image through logical expressions of these attributes. Since larger pre-trained diffusion models such as Stable Diffusion (Rombach et al., 2022) have become more accessible, we seek to incorporate the benefits of COIND in these models. This section shows that text-to-image (T2I) models can be fine-tuned to generate images using logical expressions of text prompts. Specifically, we use Stable Diffusion v1.5 (SDv1.5) to generate face images from the Celeb A dataset where smiling and gender attributes can be controlled. We consider both joint and marginal sampling, similar to our case study in 3. For joint sampling, we provide SDv1.5 with the prompt Photo of a smiling male celebrity . In the marginal sampling, we provide the values for smiling and gender attributes using separate prompts Photo of a smiling celebrity Photo of a male celebrity . Then, we sample from these marginal likelihoods resulting from these prompts following Eq. (1). To evaluate capabilities, we use the prompts Photo of a smiling celebrity Photo of a female celebrity and follow Eq. (2).

Support Method JSD Joint Composition Composition

CS FID CS FID CS FID

Orthogonal Composed GLIDE 0.57 56.57 58.31 14.19 73.53 11.02 115.95 COIND (λ = 1.0) 0.37 58.57 58.19 49.15 61.16 18.80 86.31

Table 6: Results on SDv1.5 fine-tuning. COIND outperforms the baseline on all the metrics.

1. In Tab. 6, COIND improves performance across all metrics achieving 3.46 and 2 improvement in CS over Composed GLIDE in and composition tasks. The images generated by COIND have better FID than those from the baseline.

2. Visual inspection of the generated samples for the same random seed provides insights into how Composed GLIDE and COIND perceive the prompts. Images in columns 1, 3, and 5 of Fig. 13 were generated with the same random seed. Similarly, those in columns 2 and 4 share the random seed. We note the following observations:

Published as a conference paper at ICLR 2025

Both Composed GLIDE and COIND generated images with the desired attributes when sampled from the joint likelihood using photo of a smiling male celebrity . The images generated by these models from the same random seed were also visually similar. This shows that both models can aptly set attributes in the generated images and have identical stochastic profiles, leading to unspecified attributes that assume similar values. When the attributes were passed as the expression smiling male , COIND generated images that were visually similar to those with matching random seeds generated from joint sampling. This implies that COIND learned accurate marginals that help it to correctly model the joint likelihood. When tasked with generating images for smiling male , Composed GLIDE generated images of smiling persons with gender-specific attributes such as thinner eyebrows, commonly seen in photos of female celebrities. These gender-specific features increase when the task is to generate images of smiling female . In contrast, COIND generates images of smiling celebrities while adding attributes such as a beard. Thus, we conclude that COIND offers better control over the desired attributes without affecting correlated attributes.

F DISCUSSION ON COIND

F.1 CONNECTION TO COMPOSITIONAL GENERATION FROM FIRST PRINCIPLES

Compositional generation from first principles Wiedemer et al. (2024) have shown that restricting the function to a certain compositional form will perform better than a single large model. In this section, we show that, by enforcing conditional independence, we restrict the function to encourage compositionality.

Let c1, c2, . . . , cn be independent components such that c1, c2, . . . , cn R. Consider an injective function f : Rn Rd defined by f(c) = x. If the components, c are conditionally independent given x the cumulative functions, F must satisfy the following constraint:

FCi,Cj,...,Cn|X=x(ci, cj, . . . , cn) = Y

i FCi|X=x(ci) (37)

F 1 Ci,Cj,...,Cn|X=x(x) = inf{ci, cj, . . . , cn | F(ci, cj, . . . , cn) x}, where F 1 ci,cj,...,Cn|X=x is a generalized inverse distribution function.

f(ci, cj, . . . , cn) = (f F 1 ci,cj,...,Cn|X=x)( Y

i FCi|X=x(ci))

= (f F 1 ci,cj,...,Cn|X=x e)( X

i log FCi|X=x(ci))

Therefore, we are restricting f to take a certain functional form. However, it is difficult to show that the data generating process, f, meets the rank condition on the Jacobian for the sufficient support assumption Wiedemer et al. (2024), which is also the limitation discussed in their approach. Therefore, we cannot provide guarantees. However, this section provides a functional perspective of COIND.

F.2 2D GAUSSIAN: CLOSED-FORM ANALYSIS OF COIND

In this section, we derive closed-form expressions for the score functions underlying our method and demonstrate how COIND leverages conditional independence constraints to generate the true data distribution.

Data Generation Process We consider data generated from two independent attributes, C1 and C2, which are binary variables taking values in { 1, +1}. The observed variable X is defined as:

X = f(C1) + f(C2), (38)

Published as a conference paper at ICLR 2025

(a) True underlying data distribution

(b) Training data Orthogonal Support

(c) Conditional distribution learned by vanilla diffusion objective

(d) Conditional distribution learned by COIND

Figure 14: COIND respects underlying independence conditions thereby generating true data distribution (d).

f(c) = c + σϵ, ϵ N(0, I).

Thus, f(C1) produces a Gaussian mixture along the x-axis with means at 1 and +1, and similarly f(C2) produces a mixture along the y-axis with means at 1 and +1 (see blue plot on the axis of Figure 14a). The combination yields a two-dimensional Gaussian mixture (Figure 14a).

Training Setup and Orthogonal Support For training, we assume orthogonal support where only the following combinations of (C1, C2) are observed: {( 1, 1), ( 1, +1), (+1, 1)}. The model is then tasked with generating samples from the unseen composition (+1, +1). Recall that our assumptions (see Section 2) are satisfied: C1 and C2 independently generate X, and all possible values for each attribute are observed at least once during training.

Score Function Decomposition Let s+1,+1(x) denote the score corresponding to p(x | C1 = +1, C2 = +1), and let s1, (x) denote the marginal score p(x | C1 = +1) (with a similar definition for s ,1(x)). Leveraging Eq. (1) s+1,+1(x) is decomposed as follows:

s+1,+1(x) = s1, (x) + s ,1(x) s , (x), (39)

where s , (x) is the score of the training data and not full data.

For example, when training on the observed combination (+1, 1), the score function of the s1, , s ,1, is a Gaussian, and written in closed form as

s1, (x) = µ+1, 1 x

s ,1(x) = µ+1, 1 x

In contrast, the score of s , (x), is the mixture (over the three training components) given as:

s , (x) = P

i N(x; µi, σ2I) µi x

i N(x; µi, σ2I) . (41)

However, when using Langevin dynamics for sampling (see Eq. (13)), the naive combination in Eq. (39) produces an incorrect conditional distribution (Figure 14c). Specifically, the generated distribution shows a spurious red blob between the (+1, 1) and ( 1, +1) modes rather than a proper Gaussian centered at (+1, +1). This shows that Diffusion models interpolate between the modes, rather than following underlying conditional independence and generalizing to unseen modes.

Correcting with Conditional Independence Constraints Instead of modeling s1, (x) directly, COIND learns the joint scores for the three observed combinations:

s 1, 1(x), s+1, 1(x), s 1,+1(x).

Published as a conference paper at ICLR 2025

These are then combined under the assumption of pairwise conditional independence to infer the score for the unseen composition: s 1, 1(x) = s+1, (x) + s ,+1(x) s , (x),

s+1, 1(x) = s+1, (x) + s , 1(x) s , (x),

s 1,+1(x) = s 1, (x) + s ,+1(x) s , (x), (42) which leads to the following expression for the unseen (+1, +1) composition: s+1,+1(x) = s+1, (x) + s ,+1(x) s , (x)

= s+1, 1(x) + s 1,+1(x) s 1, 1(x)

= [µ+1, 1 + µ 1,+1 µ 1, 1] x

The derivation above shows that COIND effectively enforces conditional independence constraints to generate the unseen data distribution. This analysis underscores the necessity of incorporating conditional independence constraints into diffusion models to faithfully reproduce the target distribution, particularly when extrapolating to unseen compositions.

F.3 EXTENSION TO GAUSSIAN SOURCE FLOW MODELS

Diffusion models can be viewed as a specific case of flow-based models where: (1) the source distribution is Gaussian, and (2) the forward process follows a predetermined noise schedule (Lipman et al., 2024). Can we reformulate COIND in terms of velocity rather than score, thereby generalizing it to accommodate arbitrary source distributions and schedules? When the source distribution is gaussian, score and velocity are related by affine transformation as detailed in Tab. 1 of (Lipman et al., 2024). st θ(x, C1, C2) = atx + btut θ(x, C1, C2) (44) replacing st θ( ) into Eq. (32)

LCI = Ep(X,C),t U[0,1]Ej,k st θ(x, Cj, Ck) st θ(x, Cj) st θ(x, Ck) + st θ(x) 2 2 = Ep(X,C),t U[0,1]Ej,k b2 t ut θ(x, Cj, Ck) st θ(x, Cj) ut θ(x, Ck) + ut θ(x) 2 2

However we can ignore b2 t, weighting for the time step t.

LCI = Ep(X,C),t U[0,1]Ej,k ut θ(x, Cj, Ck) ut θ(x, Cj) ut θ(x, Ck) + ut θ(x) 2 2 (45) Therefore, if the source distribution is gaussian and for any arbitrary noise schedule, constraint in score translates directly to velocity constraint as given as Eq. (45).

F.4 COMPOSITIONAL VS MONOLITHIC MODELS

Our findings echo the prior observations (Du & Kaelbling, 2024) that composite models consisting of separate diffusion models trained on individual factors (e.g., LACE) demonstrate better compositionality under partial support than sampling from factorized distributions learned by monolithic models (e.g., Composed GLIDE). However, we found that monolithic models can be significantly improved by enforcing the conditional independence constraints necessary for enabling logical compositionality. For instance, COIND achieved a 2.4 better CS on Colored MNIST with diagonal partial support and a 1.4 improvement on orthogonal partial support on Shapes3D compared to LACE.

F.5 LIMITATIONS

This paper considered compositions of a closed set of attributes. As such, COIND requires predefined attributes and access to data labeled with the corresponding attributes. Moreover, COIND must be enforced during training, which requires retraining the model whenever the attribute space changes to include additional values. Instead, state-of-the-art generative models seek to operate without pre-defined attributes or labeled data and generate open-set compositions. Despite the seemingly restricted setting of our work, our findings provide valuable insights into a critical limitation of current generative models, namely their failure to generalize for unseen compositions, by identifying the source of this limitation and proposing an effective solution to mitigate it.

Published as a conference paper at ICLR 2025

G ADDITIONAL RESULTS AND DISCUSSION ON COIND

G.1 LEARNING UNDER NON-UNIFORM p(Ci)

0 1 2 3 4 5 6 7 8 9 Digit (C1)

(a) Gaussian support

Method JSD (CS) Color (CS) Digit (CS)

LACE - 89.22 58.59 57.81 Composed GLIDE 0.27 91.74 88.91 78.39 COIND (λ = 1.0) 0.16 99.61 98.51 83.03

(b) Quantitative results for Gaussian support

0 1 2 3 4 5 6 7 8 9 C2

p(C2 | C1 = 4)

(c) p(C2 | C1 = 4) vs pθ(C2 | C1 = 4)

Figure 15: Results on Gaussian support: When the independent attributes have non-uniform categorical distributions, the joint distribution of attribute combinations is not uniform. Even in this case, COIND learns pθ(Ci | Cj) accurately.

In our experiments, we considered the uniform support setting as an example where the attribute variables are independent of each other in the training data, i.e., C1 C2 | X during training. However, uniform support is not the only scenario that can arise from independent attribute variables. In this section, we show that COIND can learn accurate marginals irrespective of the distribution of Ci.

We designed an experiment using the Colored MNIST images where the attributes C1 and C2 assume values from a non-uniform categorical distribution that resembles a discrete Gaussian distribution. The resulting joint distribution of the attributes, which we refer to as Gaussian support, is illustrated in Fig. 15a. We trained COIND and baselines on this dataset and evaluated on and compositionality tasks. Apart from comparing the CS of baselines and COIND on these compositionality tasks, we also evaluate if COIND accurately learns p(Ci) by comparing the learned pθ(Ci | Cj) against the true p(Ci | Cj). Intuitively, this verifies if COIND generates images with uncontrolled attributes matching their distribution in the training dataset.

Fig. 15b quantitatively compares COIND against Composed GLIDE on CS in both and compositionality tasks. Like our previous experiments, COIND outperforms Composed GLIDE w.r.t. CS in all tasks. In Fig. 15c, we verify if COIND has learned pθ(C2 | C1) accurately by comparing it against the true distribution p(C2 | C1). pθ(C2 | C1 = c ) = pϕ(C2 | X)pθ(X | C1 = c ) is obtained as the histogram density of the attributes that appear in the generated images when C1 = c . We observe that the learned distribution pθ(C2 | C1 = 4) is close to the true distribution, forming a bell shape.

G.2 FAILURE EXAMPLES OF COIND

Here, we examine some samples generated by COIND where it failed to include the desired attributes. We show these failure cases from each dataset, i.e., Colored MNIST, Shapes3d, and Celeb A datasets. Samples from Colored MNIST and Shapes3d datasets are taken from the partial support setting, while the ones from the Celeb A dataset are taken from the orthogonal support setting. Fig. 16a shows some failure samples from the Colored MNIST dataset. The images in the first row contain digits with colors leaking from the nearby seen attribute combination. Those in the second row correspond to approximation and have wrong attributes due to the approximation in the probabilistic formulation in Eq. (2). Some images, like those in the third row, are unrealistic, although they may contain the desired attributes. We observe similar failures in Shapes3d samples shown in Fig. 16b where the COIND deviates from the desired compositions (column 1). Some failed samples from the Celeb A dataset are shown in Fig. 16c. The samples correspond to the task of smile male . In the top image, it is hard to distinguish if the subject is smiling or laughing. In some samples, we observed only a weak or soft smile. This could be because a smile is difficult to control due to its limited spatial presence in an image.

G.3 CONFORMITY SCORE FOR EACH ATTRIBUTE COMBINATION

Published as a conference paper at ICLR 2025

7 Dark Green Red (1 2)

9 Cream Yellow (2 3)

4 Cream 8 (Cream Blue)

Similar color Wrong attributes compositions Unrealistic

(a) Failure samples from Colored MNIST dataset

Single attribute mismatch Attribute mismatch composition Unrealistic

(b) Failure samples from Shapes3d dataset

Male Smiling

Smile or laugh Hard to quantify

(c) Failure samples from Celeb A dataset

Figure 16: Some samples generated by COIND where it could not enforce the desired attributes.

0 1 2 3 4 5 6 7 8 9 Digit

0 1 2 3 4 5 6 7 8 9 Color

100 100 99 90 74 98 98 98 82 42

100 100 70 89 97 93 93 79 48 47

7 100 100 98 50 16 2 2 59 9

8 82 100 100 39 4 8 7 89 22

77 100 99 100 100 99 100 97 80 97

96 78 94 95 100 99 97 98 49 39

53 52 90 90 79 100 100 92 84 100

2 61 58 32 50 89 100 99 21 4

0 50 0 17 1 54 85 100 99 39

37 15 92 89 36 27 53 77 99 99

Figure 17: Heatmap showing CS for each attribute combination in the compositionality task in Colored MNIST generation with partial support (row 10 in Fig. 4a)

In all our experiments, we report CS as the primary metric to evaluate if the generative model produced images with accurate attributes. However, CS is the average accuracy across all unseen attribute combinations. Not all attribute combinations may be generated with equal accuracy.

For instance, Fig. 17 shows the CS for each attribute combination in the compositionality task in Colored MNIST image generation with partial support setting (row 10 in Fig. 4a). As a reminder, COIND achieved 52.38% CS on unseen attribute combinations in this task.

We can see that COIND can successfully generate all seen attribute combinations that appear on the diagonal. Some unseen attribute combinations achieve > 90% CS, while others have nearly 0% CS. We do not observe the model struggling to generate images with any specific attribute or digit, although some colors have a generally lower CS than others. For example, colors 2 and 3 have zero CS with more digits than others. On the other hand, colors 4, 5, and 6 have high CS with all digits. We hypothesize that this disparity in CS could depend on the nature of attributes and the similarity between the values they can take.

G.4 COIND ALSO IMPROVES CONDITIONAL GENERATION

Support Configuration CS

Uniform Vanilla 99.98 Uniform COIND(λ = 1) 100 Non-uniform Vanilla 99.98 Non-uniform COIND(λ = 1) 99.98

Diagonal partial Vanilla 33.14 Diagonal partial COIND(λ = 0.5) 68.82

(a) Colored MNIST

Support Configuration R2 CS

Uniform Vanilla 0.99 100 Uniform COIND(λ = 1) 0.99 100

Orthogonal partial Vanilla 0.97 95.88 Orthogonal partial COIND(λ = 1) 0.99 99.57

(b) Shapes3D

Table 7: Overall Performance Metrics for Conditional generation

Published as a conference paper at ICLR 2025

Given an ordered n-tuple from the attribute space not observed during training, can COIND generate images corresponding to this sampled from joint distribution, Pθ(X|C)? To answer this question, we train COIND and the baselines on Colored MNIST and Shapes3d datasets. Tab. 7 shows the results. As expected, the vanilla model, under full support, generates samples corresponding to the joint distribution. However, as demonstrated in 3, models trained on partial support fail to generate samples for unseen attribute compositions. In addition to the improved performance on logical compositionality, enforcing conditional independence explicitly improves conditional generation as well and produces better results on partial support compared to vanilla diffusion models for both Colored MNIST and Shapes3D datasets.

G.5 COIND CAN INTERPOLATE BETWEEN DISCRETE ATTRIBUTES

Observed Observed Interpolated 26 28 30

Figure 18: Although COIND was only trained to generate images with orientations 26 and 30 , it successfully generated a sample with 28 orientation.

In some cases, it may be necessary to have control over continuous-valued attributes such as height or thickness. However, the datasets with continuous annotations may not be available to train such models. Or we may be interested in using a pre-trained model that was trained to generate images with discrete attributes. In such cases, can we generate samples where attributes take arbitrary values that do not belong to the set of training annotations? We show that COIND can interpolate between the discrete values of an attribute on which it was originally trained and thus essentially produce images with continuous-valued attributes.

As mentioned in the main paper, we trained COIND to generate images from the Shapes3d dataset using the labels provided in (Kim & Mnih, 2018). The labels provided for the orientation attribute were discrete, although orientation itself is continuous.

In Fig. 18, we highlight the images generated by COIND where the subject has orientations 26 and 30 . We interpolate between observed discrete values linearly and generate the samples shown in the second column of Fig. 18. By carefully observing the variation in the gap between the corner of the cube and the corner of the room, we notice that COIND generated an image where the orientation of the cube is midway between those of 26 and 30 . This demonstrates that COIND offers a promising direction where training on datasets with discrete annotations is sufficient to generate samples with continuous-valued attributes.