# mechanisms_of_projective_composition_of_diffusion_models__00ee0226.pdf Mechanisms of Projective Composition of Diffusion Models Arwen Bradley * 1 Preetum Nakkiran * 1 David Berthelot 1 James Thornton 1 Joshua M. Susskind 1 We study the theoretical foundations of composition in diffusion models, with a particular focus on out-of-distribution extrapolation and lengthgeneralization. Prior work has shown that composing distributions via linear score combination can achieve promising results, including lengthgeneralization in some cases (Du et al., 2023; Liu et al., 2022). However, our theoretical understanding of how and why such compositions work remains incomplete. In fact, it is not even entirely clear what it means for composition to work . This paper starts to address these fundamental gaps. We begin by precisely defining one possible desired result of composition, which we call projective composition. Then, we investigate: (1) when linear score combinations provably achieve projective composition, (2) whether reverse-diffusion sampling can generate the desired composition, and (3) the conditions under which composition fails. We connect our theoretical analysis to prior empirical observations where composition has either worked or failed, for reasons that were unclear at the time. Finally, we propose a simple heuristic to help predict the success or failure of new compositions. 1. Introduction The possibility of composing different concepts represented by pretrained models has been of both theoretical and practical interest for some time (Jacobs et al., 1991; Hinton, 2002; Du & Kaelbling, 2024), with diverse applications including image and video synthesis (Du et al., 2023; 2020; Liu et al., 2022; 2021; Nie et al., 2021; Yang et al., 2023a; Wang et al., 2024), planning (Ajay et al., 2024; Janner et al., 2022), con- *Equal contribution 1Apple, Cupertino, CA, USA. Correspondence to: Arwen Bradley , Preetum Nakkiran

. Proceedings of the 42 nd International Conference on Machine Learning, Vancouver, Canada. PMLR 267, 2025. Copyright 2025 by the author(s). Figure 1: Composing diffusion models via score combination. Given two diffusion models, it is sometimes possible to sample in a way that composes content from one model (e.g. your dog) with style of another model (e.g. oil paintings). We aim to theoretically understand this empirical behavior. Figure generated via score composition with SDXL finetuned on the author s dog; details in Appendix C. Figure 2: Length-generalization, another capability of composition enabled by our framework. Diffusion models trained to generate a single object conditioned on location (left) can be composed at inference-time to generate images of multiple objects at specified locations (right). Notably, such images are strictly out-of-distribution for the individual models being composed. (Additional samples in Figure 9.) straint satisfaction (Yang et al., 2023c), parameter-efficient training (Hu et al., 2022; Ilharco et al., 2022), and many others (Wu et al., 2024; Su et al., 2024; Urain et al., 2023; Zhang et al., 2025). One central goal in this field is to build novel compositions at inference time using only the outputs of pretrained models (either entirely separate models, or different conditionings of a single model), to create generations that are potentially more complex than any model could produce individually. As a concrete example to keep Mechanisms of Projective Composition in mind, suppose we have two diffusion models, one trained on your personal photos of your dog and another trained on a collection of oil paintings, and we want to somehow combine these to generate oil paintings of your dog. Note that in order to achieve this goal, compositions must be able to generate images that are out-of-distribution (OOD) with respect to each of the individual models, since for example, there was no oil painting of your dog in either model s training set. Prior empirical work has shown that this ambitious vision is at least partially achievable in practice. However, the theoretical foundations of how and why composition works in practice, as well as its limitations, are still incomplete. The goal of this work is to advance our theoretical understanding of composition we will take a specific family of methods used for composing diffusion models, and we will analyze conditions under which this method provably generates the correct composition. Specifically, are there sufficient properties of the distributions we are composing that can guarantee that composition will work correctly ? And what does correctness even mean, formally? We focus our study on composing diffusion models by linearly combining their scores, a method introduced by Du et al. (2023); Liu et al. (2022) (though many other interesting constructions are possible, see Section 2). Concretely, suppose we have three separate diffusion models, one for the distribution of dog images pdog, another for oil-paintings poil, and another unconditional model for generic images pu. Then, we can use the individual score estimates x log p(x) given by the models to construct a composite score: x log ˆp(x) := (1) x log pdog(x) + x log poil(x) x log pu(x). This implicitly defines a distribution which we will call a product composition : ˆp(x) pdog(x)poil(x)/pu(x). Finally, we can try to sample from ˆp by using these scores with a generic score-based sampler, or even reverse-diffusion. This method of composition often achieves good results in practice, yielding e.g. oil paintings of dogs, but it is unclear why it works theoretically. We are particularly interested in the OOD generalization capabilities of this style of composition. By this we mean the compositional method s ability to generate OOD with respect to each of the individual models being composed which may be possible even if none of the individual models are themselves capable of OOD generation. A specific desiderata is length-generalization, understood as the ability compose arbitrarily many concepts. For example, consider the CLEVR (Johnson et al., 2017) setting shown in Figure 2. Given conditional models trained on images each containing a single object and conditioned on its location, we want to generate images containing k > 1 objects composed in the same scene. How could such length-generalizing compo- sition be possible? Here is one illustrative toy example consider the following construction, inspired by but slightly different from Du et al. (2023); Liu et al. (2022). Suppose pb is a distribution of empty background images, and each pi a distribution of images with a single object at location i, on an otherwise empty background. Assume all locations we wish to compose are non-overlapping. Then, performing reversediffusion sampling using the following score-composition will work meaning will produce images with k objects at appropriate locations: x log pt b(x) + x log pt i(x) x log pt b(x) | {z } score delta δi Rn . (2) Above, the notation pt i denotes the distribution pi after time t in the forward diffusion process (see Appendix E). Intuitively this works because during the reverse-diffusion process, the update performed by model i modifies only pixels in the vicinity of location i, and otherwise leaves them identical to the background. Thus the different models do not interact, and the sampler acts as if each model individually pastes an object onto an empty background. Formally, sampling works because the score delta vectors δi are mutually orthogonal, and in fact have disjoint supports. Notably, we can sample from this composition with a standard diffusion sampler, in contrast to Du et al. (2023) s observations that more sophisticated samplers are necessary. This construction would not be guaranteed to work, however, if the background pb was chosen to be the unconditional distribution pu (as in Equation 1), a common choice in many prior works (Du et al., 2023; Liu et al., 2022). The remainder of this paper is devoted to trying to generalize this example as far as possible, and understand both its explanatory power and its limitations. It turns out the core mechanism can be generalized surprisingly far, and does not depend on orthogonality as strongly as the above example may suggest. We will encounter some subtle aspects along the way, starting from formal definitions of what it means for composition to succeed a definition that can capture both composing objects (as in Figure 2), and composing other attributes (such as style + content, in Figure 1). 1.1. Contributions and Organization In this work we introduce a theoretical framework to help understand the empirical success of certain methods of composing diffusion models, with emphasis on understanding how compositions can sometimes length-generalize. We start by discussing the limitations of several prior definitions of composition in Section 3. In Section 4 we offer a formal definition of what we want composition to do , given precise information about which aspects we want to compose, which we call Projective Composition (Definition 4.1). (Note that there are many other valid notions Mechanisms of Projective Composition of composition; we are merely formalizing one particular goal.) Then, we study how projective composition can be achieved. In Section 5 we introduce formal criteria called Factorized Conditionals (Definition 5.2), which is a type of independence criteria along both distributions and coordinates. We prove that when this criteria holds, projective composition can be achieved by linearly combining scores (as in Equation 2), and can be sampled via standard reversediffusion samplers. In Section 6 we show that parts of this result can be extended much further to apply even in nonlinear feature spaces; but interestingly, even when projective composition is achievable, it may be difficult to sample. We find that in many important cases existing constructions approximately satisfy our conditions, but the theory also helps characterize and explain certain limitations. Finally in Section 7 we discuss how our results can help explain existing experimental results in the literature where composition worked or failed, for reasons that were unclear at the time. We also suggest a simple practical heuristic to help predict whether new sets of concepts will compose correctly. 2. Related Work Single vs. Multiple Model Composition. First, we distinguish the kind of composition we study in this paper from approaches that rely a single model but with OOD conditioners; for example, passing OOD text prompts to text-to-image models (Nichol et al., 2021; Podell et al., 2023), or works like Okawa et al. (2024); Park et al. (2024). In contrast, we study compositions which recombine the outputs of multiple separate models at inference time, where each model only sees in-distribution conditionings. Among compositions involving multiple models, many different variants have been explored. Some are inspired by logical operators like AND and OR, which are typically implemented as product p0(x)p1(x) and sum p0(x) + p1(x) (Du et al., 2023; Du & Kaelbling, 2024; Liu et al., 2022). Some composition methods are based on diffusion models, while others use energy / density approximations (Du et al., 2020; 2023; Liu et al., 2021; Thornton et al., 2025; Skreta et al., 2024). In this work, we focus specifically on product-style compositions implemented with diffusion models via a linear combinations of scores as in Du et al. (2023); Liu et al. (2022). Our goal is not to propose a new method of composition but to improve theoretical understanding of existing methods. Learning and Generalization. In this work we focus only on mathematical aspects of composition, and we do not consider any learning-theoretic aspects such as inductive bias or sample complexity. Our work is thus complementary to Kamb & Ganguli (2024), which studies how a type of compositional generalization can arise from inductive bias in the learning procedure. Additional related works are discussed in Appendix A. 3. Prior Definitions of Composition In this section we will describe why two popular mathematical definitions of composition are insufficient for our purposes: the simple product definition, and the Bayes composition. Specifically, neither of these notions can describe the outcome of the CLEVR length-generalization experiment from Figure 2. Our observations here will thus motivate us to propose a new definition of composition, in the following section. As a running example, we will consider a subset of the CLEVR experiment from Figure 2. Suppose we are trying to compose two distributions p1, p2 of images each containing a single object in an otherwise empty scene, where the object is in the lower-left corner under p1, and the upper-right corner under p2. We would like the composed distribution ˆp to place objects in at least the lower-left and upper-right, simultaneously. 3.1. The Simple Product The simple product is perhaps the most familiar type of composition: Given two distributions p1 and p2 over Rn, the simple product is defined1 as ˆp(x) p1(x)p2(x). The simple product can represent some interesting types of composition, but it has a key limitation: the composed distribution can never be truly out-of-distribution w.r.t. p1 or p2, since ˆp(x) = 0 whenever p1(x) = 0 or p2(x) = 0. This presents a problem for our CLEVR experiment. Using the simple product definition, we must have ˆp(x) = 0 for any image x with two objects, since neither p1 nor p2 was supported on images with two objects. Therefore, the simple product definition cannot represent our desired composition. 3.2. The Bayes Composition Another candidate definition for composition, which we will call the Bayes composition , was introduced and studied by Du et al. (2023); Liu et al. (2022). The Bayes composition is theoretically justified when the desired composed distribution is formally a conditional distribution of the model s training distribution. However, it is not formally capable of generating truly out-of-distribution samples, as our example below will illustrate. Let us attempt to apply the Bayes composition methodology to our CLEVR example. We interpret our two distributions p1, p2 as conditional distributions, conditioned on an object appearing in the lower-left or upper-right, respectively. Thus we write p(x|c1) p1(x), where c1 is the event that an object appears in the lower-left of image x, and c2 the event an object appears in the upper-right. Now, since we want both objects simultaneously, we define the composition as ˆp(x) := p(x|c1, c2). Because the two events c1 and c2 are conditionally independent given x (since they are 1The geometric mean p p1(x)p2(x) is also often used; our discussion applies equally to this as well. Mechanisms of Projective Composition Figure 3: Attempted composition of 3 objects. (a) Composition succeeds for single-object distributions using empty background (as in Figure 2). (b) Bayes composition fails for single-object distributions. (c) Bayes composition succeeds for 1-5 object distributions. (Additional samples in Figure 10; further length-generalization explored in Figure 11; quantitative analysis in Table 1. deterministic functions of x), we can compute ˆp in terms of the individual conditionals: ˆp(x) := p(x|c1, c2) p(x|c1)p(x|c2)/p(x). (3) Equivalently in terms of scores: x log ˆpt(x) := x log p(x|c1) + x log p(x|c2) x log p(x). Line (3) thus serves as our definition of the Bayes composition ˆp, in terms of the conditional distributions p(x|c1) and p(x|c2), and the unconditional p(x). The definition of composition above seems natural: we want both objects to appear simultaneously, so let us simply condition on both these events. However, there is an obvious error in the conclusion: ˆp(x) must be 0 whenever p(x|c1) or p(x|c2) is zero (by Line 3). Since neither conditional distribution have support on images with two objects, the composition ˆp cannot contain images of two objects either. Where did this go wrong? The issue is: p(x|c1, c2) is not well-defined in our case. We intuitively imagine some unconditional distribution p(x) which allows both objects simultaneously, but no such distribution has been defined, or encountered by the models during training. Thus, the definition of ˆp in Line (3) does not actually correspond to our intuitive notion of conditioning on both objects at once. More generally, this example illustrates how the Bayes composition cannot produce truly out-of-distribution samples, with respect to the distributions being composed.2 Figure 3(b) shows that the Bayes composition does not always work experimentally either: for diffusion models trained in our CLEVR setting of Figure 2, the Bayes composition of three locations typically fails to produce three objects. The difficulties discussed lead us to propose a precise definition of what we actually want composition to do in this case. 2Although Du et al. (2023) use the Bayes composition to achieve a kind of length-generalization, our discussion shows that the Bayes justification does not explain the experimental results. Figure 4: Distribution ˆp is a projective composition of p1 and p2 w.r.t. projection functions (Π1, Π2), because ˆp has the same marginals as p1 when both are post-processed by Π1, and analogously for p2. 4. Our Proposal: Projective-Composition We now present our formal definition of what it means to correctly compose distributions. Our main insight here is, a realistic definition of composition should not purely be a function of distributions {p1, p2, . . . }, in the way the simple product ˆp(x) = p1(x)p2(x) is purely a function of p1, p2. We must also somehow specify which aspects of each distribution we care about preserving in the composition. For example, informally, we may want a composition that mimics the style of p1 and the content of p2. Our definition below of projective composition allows us this flexibility. Roughly speaking, our definition requires specifying a feature extractor Πi : Rn Rk associated with every distribution pi. These functions can be arbitrary, but we usually imagine them as projections3 in some feature-space, e.g, Π1(x) may be a transform of x which extracts only its style, and Π2(x) a transform which extracts only its content. Then, a projective composition is any distribution ˆp which looks like distribution pi when both are viewed through Πi (see Figure 4). Formally: Definition 4.1 (Projective Composition). Given a collection of distributions {pi} along with associated projection functions {Πi : Rn Rk}, we call a distribution ˆp a projective composition if4 i : Πi ˆp = Πi pi. (4) That is, when ˆp is projected by each Πi, it yields marginals identical to those of pi. There are a few aspects of this definition worth emphasizing, which are conceptually different from many prior notions of composition. First, our definition above does not construct a composed distribution; it merely specifies what properties the composition must have. For a given 3We use the term projection informally here, to convey intuition; these functions Πi are not necessarily coordinate projections, although this is an important special case (Section 5). 4The notation refers to push-forward of a probability measure. Mechanisms of Projective Composition Figure 5: Composing yellow objects with objects of other colors. Yellow objects successfully compose with blue, cyan and magenta objects but not with brown, gray, green, or red objects. Per the histograms (left), in RGB-colorspace yellow has R, G distributed like the background (gray) while B has a distinct distribution peaked closer to zero. Taking Myellow {B}, Theorem 5.3 predicts that standard diffusion can sample from compositions of yellow with any color where the B channel is distributed like the background: namely, blue, cyan, magenta per the histograms. (Other colors may theoretically compose per Theorem 6.1, but be difficult to sample.) (Additional samples in Figure 12.) set of {(pi, Πi)}, there may be many possible distributions ˆp which are projective compositions; or in other cases, a projective composition may not even exist. Separately, the definition of projective composition does not posit any sort of true underlying distribution, nor does it require that the distributions pi are conditionals of an underlying joint distribution. In particular, projective compositions can be truly out of distribution with respect to the pi: ˆp can be supported on samples x where none of the pi are supported. Examples. We have already discussed the style+content composition of Figure 1 as an instance of projective composition. Another even simpler example to keep in mind is the following coordinate-projection case. Suppose we take Πi : Rn R to be the projection onto the i-th coordinate. Then, a projective composition of distributions {pi} with these associated functions {Πi} means: a distribution where the first coordinate is marginally distributed identically to the first coordinate of p1, the second coordinate is marginally distributed as p2, and so on. (Note, we do not require any independence between coordinates). This notion of composition would be meaningful if, for example, we are already working in some disentangled feature space, where the first coordinate controls the style of the image the second coordinate controls the texture, and so on. The CLEVR length-generalization example from Figure 2 can also be described as a projective composition in almost an identical way, by letting Πi : Rn Rk be a restriction onto the set of pixels neighboring location i. We describe this explicitly later in Section 5.3. 5. Simple Construction of Projective Compositions It is not clear apriori that projective compositional distributions satisfying Definition 4.1 ever exist, much less that there is any straightforward way to sample from them. To explore this, we first restrict attention to perhaps the simplest setting, where the projection functions {Πi} are just coordinate restrictions. This setting is meant to generalize the intuition we had in the CLEVR example of Figure 2, where different objects were composed in disjoint regions of the image. We first define the construction of the composed distribution, and then establish its theoretical properties. 5.1. Defining the Construction Formally, suppose we have a set of distributions (p1, p2, . . . , pk) that we wish to compose; in our running CLEVR example, each pi is the distribution of images with a single object at position i. Suppose also we have some reference distribution pb, which can be arbitrary, but should be thought of as a common background to the pis. Then, one popular way to construct a composed distribution is via the compositional operator defined below. (A special case of this construction is used in Du et al. (2023), for example). Definition 5.1 (Composition Operator). Define the composition operator C acting on an arbitrary set of distributions (pb, p1, p2, . . .) by C[ p] := C[pb, p1, p2, . . . ](x) := 1 pi(x) pb(x), (5) where Z is the appropriate normalization constant. We name C[ p] the composed distribution, and the score of C[ p] the compositional score: x log C[ p](x) (6) = x log pb(x) + X i ( x log pi(x) x log pb(x)) . Notice that if pb is taken to be the unconditional distribution then this is exactly the Bayes-composition. 5.2. When does the Composition Operator Work? We can always apply the composition operator to any set of distributions, but when does this actually yield a correct composition (according to Definition 4.1)? One special case is when each distribution pi is active on a different, nonoverlapping set of coordinates. We formalize this property Mechanisms of Projective Composition below as Factorized Conditionals (Definition 5.2). The idea is, each distribution pi must have a particular set of mask coordinates Mi [n] which it samples in a characteristic way, while independently sampling all other coordinates from a common background distribution. If a set of distributions (pb, p1, p2, . . .) has this Factorized Conditional structure, then the composition operator will produce a projective composition (as we will prove below). Definition 5.2 (Factorized-Conditionals). We say a set of distributions (pb, p1, p2, . . . pk) over Rn are Factorized Conditionals if there exists a partition of coordinates [n] into disjoint subsets Mb, M1, . . . Mk such that: 1. (x|Mi, x|M c i ) are independent under pi. 2. (x|Mb, x|M1, x|M2, . . . , x|Mk) are mutually independent under pb. 3. pi(x|M c i ) = pb(x|M c i ). Equivalently, if we have: pi(x) = pi(x|Mi)pb(x|M c i ), and (7) pb(x) = pb(x|Mb) Y i [k] pb(x|Mi). Equation (7) means that each pi can be sampled by first sampling x pb, and then overwriting the coordinates of Mi according to some other distribution (which can be specific to distribution i). For instance, the experiment of Figure 2 intuitively satisfies this property, since each of the conditional distributions could essentially be sampled by first sampling an empty background image (pb), then pasting a random object in the appropriate location (corresponding to pixels Mi). If a set of distributions obey this Factorized Conditional structure, then we can prove that the composition operator C yields a correct projective composition, and reverse-diffusion correctly samples from it. Below, let Nt denote the noise operator of the diffusion process5 at time t. Theorem 5.3 (Correctness of Composition). Suppose a set of distributions (pb, p1, p2, . . . pk) satisfy Definition 5.2, with corresponding masks {Mi}i. Consider running the reverse-diffusion SDE using the following compositional scores at each time t: st(xt) := x log C[pt b, pt 1, pt 2, . . .](xt), (8) where pt i := Nt[pi] are the noisy distributions. Then, the distribution of the generated sample x0 at time t = 0 is: ˆp(x) := pb(x|Mb) Y i pi(x|Mi). (9) In particular, ˆp(x|Mi) = pi(x|Mi) for all i, and so ˆp is a projective composition with respect to projections {Πi(x) := x|Mi}i, per Definition 4.1. 5Our results are agnostic to the specific diffusion noiseschedule and scaling used. Unpacking this, Line 9 says that the final generated distribution ˆp(x) can be sampled by first sampling the coordinates Mb according to pb (marginally), then independently sampling coordinates Mi according to pi (marginally) for each i. Similarly, by assumption, pi(x) can be sampled by first sampling the coordinates Mi in some specific way, and then independently sampling the remaining coordinates according to pb. Therefore Theorem 5.3 says that ˆp(x) samples the coordinates Mi exactly as they would be sampled by pi, for each i we wish to compose. Proof. (Sketch) Since p satisfies Definition 5.2, we have C[ p](x) := pb(x) Y pi(x) pb(x) = pb(x) Y pb(xt|Mc i )pi(x|Mi) pb(x|Mc i )pb(x|Mi) pi(x|Mi) pb(x|Mi) = pb(x|Mb) Y i pi(xt|Mi) := ˆp(x). The sampling guarantee follows from the commutativity of composition with the diffusion noising process, i.e. C[ pt] = Nt[C[ p]]. The complete proof is in Appendix H. Remark 5.4. In fact, Theorem 5.3 still holds under any orthogonal transformation of the variables, because the diffusion noise process commutes with orthogonal transforms. We formalize this as Lemma J.1. Remark 5.5. Compositionality is often thought of in terms of orthogonality between scores. Definition 5.2 implies orthogonality between the score differences that appear in the composed score (6): x log pt i(xt) x log pt b(xt), but the former condition is strictly stronger (c.f. Appendix F). Remark 5.6. Notice that the composition operator C can be applied to a set of Factorized Conditional distributions without knowing the coordinate partition {Mi}. That is, we can compose distributions and compute scores without knowing apriori exactly how these distributions are supposed to compose (i.e. which coordinates pi is active on). 5.3. Example: Factorized Conditionals in CLEVR. Let us explicitly describe how our definition of Factorized Conditionals captures the CLEVR setting of Figure 2. Recall, the background distribution pb over n pixels is images of an empty scene with no objects. For each i {1, 2, 3, 4}, define the set Mi [n] as the set of pixel indices surrounding location i. (Mi should be thought of as a mask that that masks out objects at location i). Let Mb := ( i Mi)c be the remaining pixels in the image. Then, we claim the distributions (pb, p1, p2, p3, p4) form approximately6 Factorized Conditionals, with corresponding coordinate partition {Mi}. 6Note, the conditions of Definition 5.2 do not exactly hold in the experiment of Figure 2 e.g., there is still some dependence between the masks Mi, since objects can cast shadows or even occlude each other. Empirically, these deviations have greater impact when composing many objects, as seen in Figure 11(a). Mechanisms of Projective Composition This is essentially because each distribution pi matches the background pb on all pixels except those surrounding location i. See Appendix B.2 for more details. 5.4. Discussion Importance of Background. Our derivations highlight the crucial role of the background distribution pb for the composition operator (Definition 5.1). While prior works have taken pb to be an unconditional distribution and the pis its associated conditionals, our results suggest this may not be the optimal choice in particular, it may not satisfy a Factorized Conditional structure (Definition 5.2). Figure 3(a,b) shows a CLEVR experiment where we attempt to compose three diffusion models, each trained to generate a single object conditioned on location (as in Figure 2), using two different backgrounds. In (a) we choose an empty background, which (together with the conditionals) approximately satisfies Definition 5.2 and thus yields a projective composition. In (b), we form a standard Bayes composition by using the unconditional distribution (i.e., a single object in an arbitrary location) as background, which does not satisfy Definition 5.2 in this case. The former succeeds experimentally while the latter fails. Bayes composition may be approximately projective. However, Bayes composition can also often succeed. For example, the CLEVR dataset Du et al. (2023); Liu et al. (2022) contains images with multiple objects and the location of one randomly-chosen object. We replicate this experiment in Figure 3(c) and verify that the Bayes composition succeeds (and may even work better for composing many objects than single-object-empty-background, as in Figure 11). The unconditional together with the conditionals can approximately act as Factorized Conditionals in cluttered settings like this one. The intuition is that if the conditional distributions each contain one specific object plus many independently sampled random objects ( clutter ), then the unconditional distribution almost looks like independently sampled random objects, which together with the conditionals would satisfy Definition 5.2 (further discussion in Appendix B.2 and G). This helps to explain the length-generalization observed in Liu et al. (2022). 6. Projective Composition in Feature Space So far we have focused on the setting where the projection functions Πi are simply projections onto coordinate subsets Mi in the native space (e.g. pixel space). This covers simple examples like Figure 2 but does not include more realistic situations such as Figure 1, where the properties to be composed are more abstract. For example a property like oil painting does not correspond to projection onto a specific subset of pixels in an image. However, we may hope that there exists some conceptual feature space in which oil painting does correspond to a particular subset of variables. In this section, we extend our results to the case where the composition occurs in some conceptual feature space, and each distribution to be composed corresponds to some particular subset of features. Our main result is a featurized analogue of Theorem 5.3: if there exists any invertible transform A mapping into a feature space where Definition 5.2 holds, then the composition operator (Definition 5.1) yields a projective composition in this feature space. See Figure 14 in Appendix I. Theorem 6.1 (Feature-space Composition). Given distributions p := (pb, p1, p2, . . . pk), suppose there exists a C1 diffeomorphism A : Rn Rn (that is, A and A 1 should be differentiable) such that (A pb, A p1, . . . A pk) satisfy Definition 5.2, with corresponding partition Mi [n]. Then, the composition ˆp := C[ p] satisfies: A ˆp(z) (A pb(z))|Mb i=1 (A pi(z))|Mi. (10) Therefore, ˆp is a projective composition of p w.r.t. projection functions {Πi(x) := A(x)|Mi}. This theorem is remarkable because it means we can compose distributions (pb, p1, p2, . . . ) in the base space, and this composition will work correctly in the feature space automatically (Equation 10), without us ever needing to compute or even know the feature transform A explicitly. Theorem 6.1 may apriori seem too strong to be true, since it somehow holds for all feature spaces A simultaneously. The key observation underlying Theorem 6.1 is that the composition operator C behaves well under reparameterization. Lemma 6.2 (Reparameterization Equivariance). The composition operator of Definition 5.1 is reparameterizationequivariant. That is, for all diffeomorphisms A : Rn Rn and all tuples of distributions p = (pb, p1, p2, . . . , pk), C[A p] = A C[ p]. (11) This lemma is potentially of independent interest: equivariance distinguishes the composition operator from many other common operators (e.g. the simple product). Lemma 6.2 and Theorem 6.1 are proved in Appendix I. 6.1. Sampling from Compositions. The feature-space Theorem 6.1 is weaker than Theorem 5.3 in one important way: it does not provide a sampling algorithm. That is, Theorem 6.1 guarantees that ˆp := C[ p] is a projective composition, but does not guarantee that reversediffusion is a valid sampling method. Part of this is inherent: in the feature-space setting, the diffusion noise operator Nt no longer commutes with the composition operator C, so scores of the noisy composed distribution Nt[C[ p]] cannot be Mechanisms of Projective Composition computed from scores of the noisy base distributions Nt[ p]. Nevertheless, one may hope to sample from the distribution ˆp using other samplers besides diffusion, such as annealed Langevin Dynamics, Sequential Monte Carlo (Thornton et al., 2025), or Predictor-Corrector methods (Song et al., 2020). We find that the situation is surprisingly subtle: composition C produces distributions which are in some cases easy to sample (e.g. with diffusion), yet in other cases apparently hard to sample. For example, in the setting of Figure 5, our Theorem 6.1 implies that all pairs of colors should compose equally well at time t = 0, since there exist diffeomorphisms (indeed, linear transforms) between different colors. However, as we saw, the diffusion sampler fails to sample from compositions of non-orthogonal colors and empirically, even more sophisticated samplers such as Predictor-Correctors also fail in this setting. At first glance, it may seem odd that composed distributions are so hard to sample, when their constituent distributions are relatively easy to sample. One possible reason for this below is that the composition operator has extremely poor Lipchitz constant, so it is possible for a set of distributions p to vary smoothly (e.g. diffusing over time) while their composition C[ p] changes abruptly. We formalize this in Lemma 6.3 (further discussion and proof in Appendix K). Lemma 6.3 (Composition Non-Smoothness). For any set of distributions {pb, p1, p2, . . . , pk}, and any noise scale t := σ, define the noisy distributions pt i := Nt[pi], and let qt denote the composed distribution at time t: qt := C[ pt]. Then, for any choice of τ > 0, there exist distributions {pb, p1, . . . pk} over Rn such that 1. For all i, the annealing path of pi is O(1)-Lipshitz: t, t : W2(pt i, pt i ) O(1)|t t |. 2. The annealing path of q has Lipshitz constant at least Ω(τ 1): t, t : W2(qt, qt ) |t t | Intuitively, this means that, even if projective composition is possible at t = 0, reverse diffusion (or indeed any annealing method) may not be able to correctly sample from it. 7. Practical implications We have presented a mathematical theory of composition. Although this theoretical model is a simplification of reality (we do not claim its assumptions hold exactly in practice) we believe the spirit of our results carries over to practical settings, and can help both understand empirical observations from prior work, and make predictions about the success or failure of new compositions. 7.1. Connections with prior work Independence Assumptions and Disentangled Features. Our theory relies on a type of independence between distributions, related to orthogonality between scores, which we formalize as Factorized Conditionals. While such conditional structure typically does not exist in pixel-space, it is plausible that a factorized structure exists in an appropriate feature space, permitted by our theory (Section 6). In particular, a feature space and distribution with perfectly disentangled features (Chen et al., 2018; Kim & Mnih, 2018; Yang et al., 2023b; Locatello et al., 2019) would satisfy our assumptions. Conversely, if distributions are not appropriately disentangled, our theory predicts that linear score combinations will fail to compose correctly. This effect is well-known; see Figure 6 for an example; similar failure cases are highlighted in Liu et al. (2022) as well (such as A bird failing to compose with A flower ). Regarding successful cases, style and content compositions consistently work well in practice, and are often taken to be disentangled features (e.g. Karras et al. (2019); Kotovenko et al. (2019); Gatys et al. (2016); Zhu et al. (2017)). Text conditioning with location information. Conditioning on location is a straightforward way to achieve factorized conditionals (provided the objects in different locations are approximately independent) since the required disjointness already holds in pixel-space. Many successful text-to-image compositions in Liu et al. (2022) use location information in the prompts, either explicitly (e.g. A blue bird on a tree + A red car behind the tree ) or implicitly ( A horse + A yellow flower field ; since horses are typically in the foreground and fields in the background). Unconditional Backgrounds. Most prior works on diffusion composition use the Bayes composition, with substantial practical success. As discussed in Section 5.4, Bayes composition may be approximately projective in cluttered settings, helping to explain its practical success in text-toimage settings, where images often contain many different possible objects and concepts. 7.2. A practical heuristic We now discuss a simple heuristic to help predict whether Factorized Conditionals (FC) holds for a particular set of concepts. Although it is not sufficient to guarantee Projective Composition, it is easy-to-apply in practice to get an initial hint about whether composition is likely to work. We start with a lemma showing that FC concepts satisfy a simple (necessary, but not sufficient) orthogonality condition. The short proof is in Appendix F. Lemma 7.1. Let (pb, p1, . . . , pk) be Factorized Conditionals. Let µi = Epi[x] for i = 1, . . . , k and µb = Epb[x] denote the mean vectors. Then, for any i = j: (µi µb)T (µj µb) = 0. (12) Thus, orthogonality between the mean difference vectors {µi µb} is a necessary (but not sufficient) condition for Factorized Conditionals. Mechanisms of Projective Composition Figure 6: Composing Entangled Concepts. The left image composes the text-conditions photo of a dog with photo of a horse , which both control the subject of the image, and produces unexpected results. In contrast, the right image composes photo of a dog with photo, with red hat, which intuitively correspond to disentangled features. Both samples from SDXL using score-composition with an unconditional background; details in Appendix C. This lemma makes precise the common intuition that some type of approximate concept-space orthogonality is required for successful composition in practice, such as Lo RA task arithmetic (Zhang et al., 2023a; Ilharco et al., 2022). We can apply the lemma to help predict whether a new composition may be successful. Importantly, since we often do not expect FC to hold in pixel-space, we need to propose some feature-space in which FC might hold, and verify Lemma 7.1 there. We choose the CLIP (Radford et al., 2021) feature-space as the proposal since it is simple-to-use and is known to provide a reasonably disentangled representation (though other feature spaces could also be used). In order to test Lemma 7.1 in CLIP-space, the procedure is: for each concept distribution pi, collect several representative images, compute their CLIP embeddings, and average them to obtain the mean µi. Similarly, estimate the background mean µb using arbitrary images (representing the unconditional distribution). Then check whether Equation (12) approximately holds. CLIP s goal of text-image-alignment suggest an even easier heuristic-for-the-heuristic: simply using the text-embedding for each concept. That is, we could approximate µi as the CLIP text-embedding of a text description of concept pi. Although CLIP has been shown to suffer from a modality gap (text and image embeddings are not perfectly aligned) (Liang et al., 2022), the orthogonality structure we care about related to angles between centered-concepts may still be fairly well-preserved. In Figure 7 we show a preliminary experiment with both the image and text heuristics. Figure 7: Cosine similarity ( utv u v ) between mean different vectors µi µb and µj µb, for each pair of concepts i, j, with µi estimated as either the average CLIP image embedding over several images representative of concept i (left) or the CLIP text embedding for a single text description of concept i (right). Details in Appendix D. Note that the groups of concepts { dog , horse , cat }, { watercolor , oil-painting }, { hat , sunglasses } have high intraand low inter-group similarity. This suggests that concepts from different groups may successfully compose (such as dog + hat or dog + oil-painting ), while concepts from the same group may not (such as dog + horse ), consistent with the examples in Figures 1 and 6. 8. Conclusion In this work, we have developed a theory of one possible mechanism of composition in diffusion models. We study how composition can be defined, and sufficient conditions for it to be achieved. Our theory can help understand a range of diverse compositional phenomena in both synthetic and practical settings, and we hope it will inspire further work on foundations of composition. Acknowledgements Acknowledgements: We thank Miguel Angel Bautista Martin, Etai Littwin, Jason Ramapuram, and Luca Zappella for helpful discussions and feedback throughout this work, and Preetum s dog Papaya for his contributions to Figure 1. Impact Statement This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here. Ajay, A., Han, S., Du, Y., Li, S., Gupta, A., Jaakkola, T., Tenenbaum, J., Kaelbling, L., Srivastava, A., and Agrawal, P. Compositional foundation models for hi- Mechanisms of Projective Composition erarchical planning. Advances in Neural Information Processing Systems, 36, 2024. Chen, R. T., Li, X., Grosse, R. B., and Duvenaud, D. K. Isolating sources of disentanglement in variational autoencoders. Advances in neural information processing systems, 31, 2018. Delon, J. and Desolneux, A. A wasserstein-type distance in the space of gaussian mixture models. SIAM Journal on Imaging Sciences, 13(2):936 970, 2020. Du, Y. and Kaelbling, L. P. Position: Compositional generative modeling: A single model is not all you need. In Forty-first International Conference on Machine Learning, 2024. Du, Y., Li, S., and Mordatch, I. Compositional visual generation and inference with energy based models. ar Xiv preprint ar Xiv:2004.06030, 2020. Du, Y., Durkan, C., Strudel, R., Tenenbaum, J. B., Dieleman, S., Fergus, R., Sohl-Dickstein, J., Doucet, A., and Grathwohl, W. S. Reduce, reuse, recycle: Compositional generation with energy-based diffusion models and mcmc. In International conference on machine learning, pp. 8489 8510. PMLR, 2023. Gatys, L. A., Ecker, A. S., and Bethge, M. Image style transfer using convolutional neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2414 2423, 2016. Gregor, K., Danihelka, I., Graves, A., Rezende, D., and Wierstra, D. Draw: A recurrent neural network for image generation. In International conference on machine learning, pp. 1462 1471. PMLR, 2015. Hinton, G. E. Training products of experts by minimizing contrastive divergence. Neural computation, 14(8):1771 1800, 2002. Ho, J. and Salimans, T. Classifier-free diffusion guidance. ar Xiv preprint ar Xiv:2207.12598, 2022. Ho, J., Jain, A., and Abbeel, P. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840 6851, 2020. Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W., et al. Lora: Low-rank adaptation of large language models. ICLR, 1(2):3, 2022. Ilharco, G., Ribeiro, M. T., Wortsman, M., Gururangan, S., Schmidt, L., Hajishirzi, H., and Farhadi, A. Editing models with task arithmetic. ar Xiv preprint ar Xiv:2212.04089, 2022. Jacobs, R. A., Jordan, M. I., Nowlan, S. J., and Hinton, G. E. Adaptive mixtures of local experts. Neural computation, 3(1):79 87, 1991. Janner, M., Du, Y., Tenenbaum, J. B., and Levine, S. Planning with diffusion for flexible behavior synthesis. ar Xiv preprint ar Xiv:2205.09991, 2022. Johnson, J., Hariharan, B., Van Der Maaten, L., Fei-Fei, L., Lawrence Zitnick, C., and Girshick, R. Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2901 2910, 2017. Kamb, M. and Ganguli, S. An analytic theory of creativity in convolutional diffusion models. ar Xiv preprint ar Xiv:2412.20292, 2024. Karras, T., Laine, S., and Aila, T. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 4401 4410, 2019. Karras, T., Aittala, M., Lehtinen, J., Hellsten, J., Aila, T., and Laine, S. Analyzing and improving the training dynamics of diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 24174 24184, 2024. Kim, H. and Mnih, A. Disentangling by factorising. In International conference on machine learning, pp. 2649 2658. PMLR, 2018. Kotovenko, D., Sanakoyeu, A., Lang, S., and Ommer, B. Content and style disentanglement for artistic style transfer. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 4422 4431, 2019. Liang, W., Zhang, Y., Kwon, Y., Yeung, S., and Zou, J. Mind the gap: Understanding the modality gap in multimodal contrastive representation learning. In Neur IPS, 2022. URL https://openreview.net/forum? id=S7Evzt9uit3. Liu, N., Li, S., Du, Y., Tenenbaum, J., and Torralba, A. Learning to compose visual relations. Advances in Neural Information Processing Systems, 34:23166 23178, 2021. Liu, N., Li, S., Du, Y., Torralba, A., and Tenenbaum, J. B. Compositional visual generation with composable diffusion models. In European Conference on Computer Vision, pp. 423 439. Springer, 2022. Liu, Y., Zhang, Y., Jaakkola, T., and Chang, S. Correcting diffusion generation through resampling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8713 8723, 2024. Mechanisms of Projective Composition Locatello, F., Bauer, S., Lucic, M., Raetsch, G., Gelly, S., Sch olkopf, B., and Bachem, O. Challenging common assumptions in the unsupervised learning of disentangled representations. In international conference on machine learning, pp. 4114 4124. PMLR, 2019. Mc Allister, D., Tancik, M., Song, J., and Kanazawa, A. Decentralized diffusion models. ar Xiv preprint ar Xiv:2501.05450, 2025. Nichol, A., Dhariwal, P., Ramesh, A., Shyam, P., Mishkin, P., Mc Grew, B., Sutskever, I., and Chen, M. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. ar Xiv preprint ar Xiv:2112.10741, 2021. Nie, W., Vahdat, A., and Anandkumar, A. Controllable and compositional generation with latent-space energy-based models. Advances in Neural Information Processing Systems, 34:13497 13510, 2021. Okawa, M., Lubana, E. S., Dick, R., and Tanaka, H. Compositional abilities emerge multiplicatively: Exploring diffusion models on a synthetic task. Advances in Neural Information Processing Systems, 36, 2024. Parisi, G. Correlation functions and computer simulations. Nuclear Physics B, 180(3):378 384, 1981. Park, C. F., Okawa, M., Lee, A., Lubana, E. S., and Tanaka, H. Emergence of hidden capabilities: Exploring learning dynamics in concept space. Advances in Neural Information Processing Systems, 37:84698 84729, 2024. Podell, D., English, Z., Lacey, K., Blattmann, A., Dockhorn, T., M uller, J., Penna, J., and Rombach, R. Sdxl: Improving latent diffusion models for high-resolution image synthesis. ar Xiv preprint ar Xiv:2307.01952, 2023. Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pp. 8748 8763. Pm LR, 2021. Robert, C. P., Casella, G., and Casella, G. Monte Carlo statistical methods, volume 2. Springer, 1999. Roberts, G. O. and Tweedie, R. L. Exponential convergence of langevin distributions and their discrete approximations, 1996. Rossky, P. J., Doll, J. D., and Friedman, H. L. Brownian dynamics as smart monte carlo simulation. The Journal of Chemical Physics, 69(10):4628 4633, 1978. Skreta, M., Atanackovic, L., Bose, A. J., Tong, A., and Neklyudov, K. The superposition of diffusion models using the it\ˆ o density estimator. ar Xiv preprint ar Xiv:2412.17762, 2024. Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., and Ganguli, S. Deep unsupervised learning using nonequilibrium thermodynamics. In International conference on machine learning, pp. 2256 2265. pmlr, 2015. Song, J., Meng, C., and Ermon, S. Denoising diffusion implicit models. In International Conference on Learning Representations, 2021. URL https://openreview. net/forum?id=St1giar CHLP. Song, Y. and Ermon, S. Generative modeling by estimating gradients of the data distribution. Advances in neural information processing systems, 32, 2019. Song, Y., Sohl-Dickstein, J., Kingma, D. P., Kumar, A., Ermon, S., and Poole, B. Score-based generative modeling through stochastic differential equations. ar Xiv preprint ar Xiv:2011.13456, 2020. URL https:// arxiv.org/pdf/2011.13456.pdf. Stracke, N., Baumann, S. A., Susskind, J., Bautista, M. A., and Ommer, B. Ctrloralter: Conditional loradapter for efficient 0-shot control and altering of t2i models. In European Conference on Computer Vision, pp. 87 103. Springer, 2024. Su, J., Liu, N., Wang, Y., Tenenbaum, J. B., and Du, Y. Compositional image decomposition with diffusion models. ar Xiv preprint ar Xiv:2406.19298, 2024. Thornton, J., B ethune, L., ZHANG, R., Bradley, A., Nakkiran, P., and Zhai, S. Controlled generation with distilled diffusion energy models and sequential monte carlo. In The 28th International Conference on Artificial Intelligence and Statistics, 2025. Urain, J., Li, A., Liu, P., D Eramo, C., and Peters, J. Composable energy policies for reactive motion generation and reinforcement learning. The International Journal of Robotics Research, 42(10):827 858, 2023. Wang, Y., Liu, L., and Dauwels, J. Slot-vae: Object-centric scene generation with slot attention. In International Conference on Machine Learning, pp. 36020 36035. PMLR, 2023. Wang, Z., Gui, L., Negrea, J., and Veitch, V. Concept algebra for (score-based) text-controlled generative models. Advances in Neural Information Processing Systems, 36, 2024. Wiedemer, T., Mayilvahanan, P., Bethge, M., and Brendel, W. Compositional generalization from first principles. Advances in Neural Information Processing Systems, 36, 2024. Wu, T., Maruyama, T., Wei, L., Zhang, T., Du, Y., Iaccarino, G., and Leskovec, J. Compositional generative inverse design. ar Xiv preprint ar Xiv:2401.13171, 2024. Mechanisms of Projective Composition Yang, M., Du, Y., Dai, B., Schuurmans, D., Tenenbaum, J. B., and Abbeel, P. Probabilistic adaptation of text-tovideo models. ar Xiv preprint ar Xiv:2306.01872, 2023a. Yang, T., Wang, Y., Lv, Y., and Zheng, N. Disdiff: Unsupervised disentanglement of diffusion probabilistic models. ar Xiv preprint ar Xiv:2301.13721, 2023b. Yang, Z., Mao, J., Du, Y., Wu, J., Tenenbaum, J. B., Lozano P erez, T., and Kaelbling, L. P. Compositional diffusionbased continuous constraint solvers. ar Xiv preprint ar Xiv:2309.00966, 2023c. Zhang, J., Liu, J., He, J., et al. Composing parameterefficient modules with arithmetic operation. Advances in Neural Information Processing Systems, 36:12589 12610, 2023a. Zhang, L., Rao, A., and Agrawala, M. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3836 3847, 2023b. Zhang, L., Rao, A., and Agrawala, M. Scaling in-the-wild training for diffusion-based illumination harmonization and editing by imposing consistent light transport. In The Thirteenth International Conference on Learning Representations, 2025. Zhu, J.-Y., Park, T., Isola, P., and Efros, A. A. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE international conference on computer vision, pp. 2223 2232, 2017. Mechanisms of Projective Composition A. Additional Related Works Structured compositional generative models. Structured generative models leverage architectural inductive biases in an encoder-decoder framework, such as recurrent attention mechanisms (Gregor et al., 2015) or slot-attention (Wang et al., 2023). These models decompose scenes into background and parts-based representations in an unsupervised manner guided by modeling priors. While these approaches can flexibly generate scenes with single or multiple objects, they are not explicitly controllable, and require specific model pre-training on datasets containing compositions of interest. Controllable generation. Composition at inference-time is one potential mechanism for exerting control over the generation process. Another way to modify compositions of style and/or content attributes is through spatial conditioning a pre-trained diffusion model on a structural attribute (e.g., pose or depth) as in Zhang et al. (2023b), or on multiple attributes of style and/or content as in Stracke et al. (2024). Another option is control through resampling, as in Liu et al. (2024). These methods are complementary to single or multiple model conditioning mechanisms based on score composition that we study in the current work. Single model conditioning. We distinguish the kind of composition we study in this paper from approaches that rely on a single model but use OOD conditioners to achieve novel combinations of concepts never seen together during training; for example, passing OOD text prompts to text-to-image models (Nichol et al., 2021; Podell et al., 2023), or works like Okawa et al. (2024); Park et al. (2024) where a single model conditions simultaneously on multiple attributes like shape and color, with some combinations held out during training. In contrast, the compositions we study recombine the outputs of multiple separate models at inference time. Though less powerful, this can still be surprisingly effective, and is more amenable to theoretical study since it disentangles the potential role of conditional embeddings. Multiple model composition. Among compositions involving multiple separate models, many different variants have been explored with different goals and applications. Some definitions of composition are inspired by logical operators like AND and OR, usually taken to mean that the composed distribution should have high probability under all of the conditional distributions to be composed, or at least one of them, respectively. Given two conditional probabilities p0(x), p1(x), AND is typically implemented as the product p0(x)p1(x) and OR as sum p0(x) + p1(x) (though these only loosely correspond to the logical operators and other implementations are also possible). Some composition methods are based on diffusion models and use the learned scores (mainly for product compositions), others use energy-based models (which allows for OR-inspired sum compositions, as well as more sophisticated samplers, in particular sampling at t = 0 (Du et al., 2020; 2023; Liu et al., 2021), and still others work directly with the densities (Skreta et al., 2024) (enabling an even greater variety of compositions, including a different style of AND, taken to mean p0(x) = p1(x)). Mc Allister et al. (2025) explore another type of OR composition. (Wiedemer et al., 2024) take a different approach of taking the final rendered images generated by separate diffusion models and adding them up in pixel-space, as part of a study on generalization of data-generating processes. Task-arithmetic (Zhang et al., 2023a; Ilharco et al., 2022), often using Lo RAs (Hu et al., 2022), is a kind of composition in weight-space that has had significant practical impact. Product compositions. In this work, we focus specifically on product compositions (broadly defined to allow for a background distribution, i.e. compositions of the form ˆp(x) = pb(x) Q i pi(x) pb(x)) implemented with diffusion models, which allows the composition to be implemented via a linear combinations of scores as in Du et al. (2023); Liu et al. (2022). Our goal is not to propose a wholly new method of composition but rather to improve theoretical understanding of existing methods. Learning and Generalization. Recently, Kamb & Ganguli (2024) demonstrated how a type of compositional generalization arises from inductive bias in the learning procedure (equivariance and locality). Their findings are relevant to our broader motivation, but complementary to the focus of this work. Specifically, we focus only on mathematical aspects of defining and sampling from compositional distributions, and we do not consider any learning-theoretic aspects such as inductive bias or sample complexity. This allows us to study the behavior of compositional sampling methods even assuming perfect knowledge of the underlying distributions. Mechanisms of Projective Composition B. CLEVR Experimental Details All of our CLEVR experiments use raw conditional diffusion scores, without applying any guidance/CFG (Ho & Salimans, 2022). Details below. B.1. Dataset, models, and training details B.1.1. CLEVR DATASET We used the CLEVR (Johnson et al., 2017) dataset generation procedure7 to generate datasets customized to the needs of the present work. All default objects, shapes, sizes, colors were kept unchanged. Images were generated in their original resolution of 320 240 and down-sampled to a lower resolution of 128 128 to facilitate experimentation and to be more GPU resources friendly. The various datasets we generated from this procedure include: A background dataset (0 objects) with 50,000 samples Single object dataset with 1,550,000 samples A dataset having 1 to 5 objects, with 500,000 samples for each object count, for a total of 2,500,000 samples. Our experiments cover two different conditioning setups. In Figures 2, 9, 11, we condition on the 2D location of the object (or the location of one randomly-chosen object, for multi-object distributions). In Figures 5, 12, we condition on the color of the object. In all experiments we condition only a single attribute (either location or color) at a time, with all other attributes sampled randomly and not conditioned on. B.1.2. MODEL ARCHITECTURE We used our own Py Torch re-implementation of the EDM2 (Karras et al., 2024) U-net architecture. Our re-implementation is functionally equivalent, and only differs in optimizations introduced to save memory and GPU cycles. We used the smallest model architecture, e.g. edm2-img64-xs from https://github.com/NVlabs/edm2. This model has a base channel width of 128, resulting in a total of 124M trainable weights. Two versions of this model were used: An unmodified version for background and class-conditioned experiments. A modified version for (x, y) conditioning in which we simply replaced Fourier embeddings for the class with concatenated Fourier embeddings for x and y. B.1.3. TRAINING AND INFERENCE In all experiments, the model is trained with a batch size of 2048 over 128 220 samples by looping over the dataset as often as needed to reach that number. In practice, training takes around 16 hours to complete on 32 A100 GPUs. We used almost the same training procedure as in EDM2 (Karras et al., 2024), which is basically a standard training loop with gradient accumulation. The only difference is that we do weight renormalization after the weights are updated rather than before as the authors originally did. For simplicity, we did not use posthoc-EMA to obtain the final weights used in inference. Instead we took the average of weights over the last 4096 training updates. The denoising procedure for inference is exactly the same as in EDM2 (Karras et al., 2024), e.g. 65 model calls using a 32-step Heun sampler. B.2. Factorized Conditionals in CLEVR. B.2.1. SINGLE OBJECT DISTRIBUTIONS WITH EMPTY BACKGROUND Let us explicitly describe how our definition of Factorized Conditionals captures the CLEVR setting of Figure 2. Recall, the background distribution pb over n pixels is images of an empty scene with no objects. For each i {1, 2, 3, 4}, define the set Mi [n] as the set of pixel indices surrounding location i. Each Mi should be thought of as a mask that that masks 7https://github.com/facebookresearch/clevr-dataset-gen Mechanisms of Projective Composition out objects at location i. Then, let Mb := ( i Mi)c be the remaining pixels in the image, excluding all the masks. Now we claim the distributions (pb, p1, p2, p3, p4) are approximately Factorized Conditionals, with corresponding coordinate partition (Mb, M1, M2, M3, M4). We can confirm each criterion in Definition 5.2 individually: 1. In each distribution pi, the pixels inside the mask Mi are approximately independent from the pixels outside the mask, since the outside pixels always describe an empty scene. 2. In the background pb, the set of masks {Mi} specify approximately mutually-independent sets of pixels, since all pixels are roughly constant. 3. The distribution of pi and pb approximately agree along all pixels outside mask Mi, since they both describe an empty scene outside this mask. Thus, the set of distributions approximately form Factorized Conditionals. However the conditions of Definition 5.2 do not exactly hold, since objects can cast shadows on each other and may even occlude each other. Empirically, this can significantly affect the results when composing many objects, as explored in Figure 11(a). B.2.2. CLUTTERED DISTRIBUTED WITH UNCONDITIONAL BACKGROUND Figure 8: Samples from unconditional model trained on images containing 1-5 objects. The sampled images sometimes contain 6 objects (circled in orange). Next, we discuss the setting of Figure 3(c), which is a Bayes composition based on an unconditional distribution where each scene contains 1-5 objects (with the number of objects sampled uniformly). The locations and all other attributes of the objects are sampled independently. The conditions label the location of one randomly-chosen object. Just as in the previous case, for each i {1, 2, 3, 4}, we define the set Mi [n] as the set of pixel indices surrounding location i, and let Mb := ( i Mi)c be the remaining pixels in the image, excluding all the masks. Again, we claim that the distributions (pb, p1, p2, p3, p4) are approximately Factorized Conditionals, with corresponding coordinate partition (Mb, M1, M2, M3, M4). We examine the criteria in Definition 5.2: 1. In each distribution pi, the pixels inside the mask Mi are approximately independent from the pixels outside the mask, since the outside pixels approximately describe a distribution containing 0-4 objects, and the locations and other attributes of all objects are independent. 2. In the unconditional background distribution pb, we argue that in practice, the set of masks {Mi} are approximately mutually-independent. By assumption, the locations and other attributes of all shapes are all independent, and the masks Mi are chosen in these experiment to minimize interaction/overlap. The main difficulty is the restriction to 1-5 total objects, which we discuss below. 3. The distribution of pi and pb approximately agree along all pixels outside mask Mi, since pi|M c i contains 0-4 objects, while pb|M c i contains 0-5 objects (since one object could be hidden in M c i ). There are, however, two important caveats to the points above. First, overlap or other interaction (shadows, etc.) between objects can clearly violate all three criteria. In our experiment, this is mitigated by the fact that the masks Mi are chosen to minimize interaction/overlap (though interactions start to occur as we compose more objects, leading to some image degradation). Second, since the number of objects is sampled uniformly from 1-5, the presence of one object affects the probability that another will be present. Thus, the masks {Mi} are not perfectly independent under the background distribution, nor do pi and pb perfectly agree on M c i . Ideally, each pi would place an object in mask Mi and independently follow pb on M c i , and pb would be such that the probability that an object appears in mask Mi is independently Bernoulli (c.f. Appendix G.2). In particular, this would imply that the distribution of the total number of objects is Binomial (which Mechanisms of Projective Composition allows the total object-count to range from zero to the total-number-of-locations, as well as placing specific probabilities on each object-count), which clearly differs from the uniform distribution over 1-5 objects. However, a few factors mitigate this discrepancy: A Binomial with sufficiently small probability-of-success places very little probability on large k. For example, under Binomial(9, 0.3), P(k = 0 : 5) = 0.04, 0.156, 0.27, 0.27, 0.17, 0.07 and P(k > 5) = 0.026. Empirically, the learned unconditional distribution does not actually enforce k < 5; we sometimes see samples with k = 6 for example, as seen in Figure 8. Intuitively, the train distribution is close to Bernoulli and the learned distribution seems to be even closer. With these considerations in mind, we see that the set of distributions approximately though imperfectly form Factorized Conditionals. One advantage of this setting compared to the single-object setting is that the models can learn how multiple objects should interact and even overlap correctly, potentially making it easier compose nearby locations. We explore the length-generalization of this composition empirically in Figure 11c (note, however, that only compositions of more than 5 objects are actually OOD w.r.t. the distributions pi in this case). B.3. Additional CLEVR samples In this section we provide additional non-cherrypicked samples of the experiments shown in the main text. Table 1: Quantitative analysis of different methods of composition of location-conditioned CLEVR distributions. We generated 100 samples using each composition method, and manually counted (to avoid any potential error in using a classifier) the objects in correct locations (i.e. locations corresponding to the conditioners of the distributions being composed) in each generated image. The table shows the histogram of these manual counts, that is, each column lists the number of images that contained the given number of objects in correct locations. N denotes number of distributions being composed (hence the expected number of objects) we test N = 1 through N = 6. Single-object empty composes singleobject object distributions with an empty background. Single-object Bayes composes single-object object distributions with an unconditional background. Bayes-cluttered composes 1-5 object distributions (with location label assigned to a single randomly-chosen object) with an unconditional background. Style N 0 1 2 3 4 5 6 Single-object empty 1 100 2 100 3 1 99 4 2 98 5 2 98 6 3 97 Single-object Bayes 1 100 2 10 67 32 3 36 62 2 4 77 23 5 66 32 2 6 34 6 3 Bayes-cluttered 1 100 2 100 3 100 4 100 5 2 98 6 2 98 Mechanisms of Projective Composition Figure 9: Additional non-cherrypicked samples for CLEVR experiment of Figure 2. Figure 10: Additional non-cherrypicked samples for CLEVR experiment of Figure 3. Mechanisms of Projective Composition Figure 11: Attempted compositional length-generalization up to 9 objects, in the setting of Figure 3 (but with more-closely spaced objects). We attempt to compose via linear score combination the distributions p1 through p9 shown on the far left, where each pi is conditioned on a specific object location as described below. Settings (A) and (C) approximately satisfy the conditions of our theory of projective composition, and thus are expected to length-generalize at least somewhat; while setting (B) does not even approximately satisfy our conditions and indeed fails to length-generalize. Experiment (A): In this experiment, the distributions pi each contain a single object at a fixed location, and the background pb is empty. In this case any successful composition of more than one object represents length-generalization. We find that composition succeeds up to several objects, but then degrades as number of objects increases (see Section 5.3 for details). Experiment (B): Here the distributions pi are identical to (A), but the background pb is chosen as the unconditional distribution (i.e. a single object at a random location) this the Bayes composition (Section 3). This composition entirely fails remarkably, trying to compose many objects often produces no objects! Experiment (C): Here each distribution pi contains an object at a fixed location i, and 0 4 other objects (sampled uniformly) in random locations; see samples at far left. The background distribution pb is a distribution of 1 5 objects (sampled uniformly) in random locations. In this case length-generalization means composition of more than 5 objects. This composition can length-generalize, but artifacts appear for large numbers of objects. See Appendix B.2 for a full discussion. Mechanisms of Projective Composition Figure 12: Additional non-cherrypicked samples for CLEVR experiment of Figure 5. Top left grid shows conditional samples for each color. Top right grid shows compositions of red-colored objects (p6) with objects of other colors (8 samples of each), which only succeeds for cyan-colored objects. Bottom grid shows compositions of yellow-colored objects (p7) with objects of other colors (16 samples of each): these are additional samples of the exact experiment shown in Figure 5. Mechanisms of Projective Composition C. SDXL experimental details C.1. Figure 1 The two models composed are 1. An SDXL model (Podell et al., 2023) fine-tuned on 30 personal photos of the author s dog (Papaya). 2. SDXL-base-1.0 (Podell et al., 2023) conditioned on prompt an oil painting in the style of van gogh. The background score distribution is the unconditional background (i.e. SDXL conditioned on the empty prompt). We use the DDPM sampler (Ho et al., 2020) with 30 steps, using the composed score, and CFG guidance weight of 2 Ho et al. (2020). Note that using guidance weight 1 (i.e. no guidance) also performs reasonably in this case, but is lower quality. C.2. Figure 6 Left: The two score models composed are 1. SDXL-base-1.0 (Podell et al., 2023) conditioned on prompt photo of a dog 2. SDXL-base-1.0 (Podell et al., 2023) conditioned on prompt photo of a horse The background score distribution is the unconditional background (i.e. SDXL conditioned on the empty prompt). For improved sample quality, we use a Predictor-Corrector method (Song et al., 2020) with the DDPM predictor and the Langevin dynamics corrector, both operating on the composed score. We use 100 predictor denoising steps, and 3 Langevin iterations per step. We do not use any guidance/CFG. Right: Identical setting as above, using prompts: 1. photo of a dog 2. photo, with red hat Note that the DDPM sampler also performed reasonably in this setting, but Predictor-Corrector methods improved quality. D. CLIP experiment details In the CLIP experiment of Figure 7, we used the following text prompts for each concept: "dog": "a photograph of a dog", "horse": "a photograph of a horse", "cat": "a photograph of a cat", "watercolor": "a watercolor painting", "oil-painting": "an oil painting", "hat": "wearing a hat", "sunglasses": "wearing sunglasses", "uncond": "" We did an automated collection of 10 images for each concept initially and then manually filtered them to ensure that they actually representative of each concept, leaving us with 10 for dog , 2 for horse , 9 for cat , 3 for watercolor , 4 for oil-painting , 6 for hat , 5 for sunglasses , 10 for uncond (the latter were arbitrary images searched with no specific keyword). For each concept i, we ran the image experiment by computing the CLIP embedding of each representative image and averaging them to estimate the mean µi. Similarly, we estimated the background mean µb by using the arbitrary images Mechanisms of Projective Composition (representing the unconditional distribution). For the text experiment we simply estimated µi, µb as the CLIP embedding of the single representative text prompt (or empty prompt). In either case we then computed the cosine similarity between the mean difference vectors µi µb)T (µj µb) µi µb µj µb to assess whether the condition of Lemma 7.1, i.e. µi µb)T (µj µb) 0 approximately holds. E. Reverse Diffusion and other Samplers E.1. Diffusion Samplers DDPM (Ho et al., 2020) and DDIM (Song et al., 2021) are standard reverse diffusion samplers (Sohl-Dickstein et al., 2015; Song & Ermon, 2019) that correspond to discretizations of a reverse-SDE and reverse-ODE, respectively (so we will sometimes refer to the reverse-SDE as DDPM and the reverse-ODE as DDIM for short). The forward process, reverse-SDE, and equivalent reverse-ODE (Song et al., 2020) for the variance-preserving (VP) (Ho et al., 2020) conditional diffusion are Forward SDE : dx = 1 DDPM SDE : dx = 1 2βtx dt βt x log pt(x|c)dt + p DDIM ODE : dx = 1 2βt x log pt(x|c)dt. (15) E.2. Langevin Dynamics Langevin dynamics (LD) (Rossky et al., 1978; Parisi, 1981) an MCMC method for sampling from a desired distribution. It is given by the following SDE (Robert et al., 1999) 2 log ρ(x)dt + εdw, (16) which converges (under some assumptions) to ρ(x) (Roberts & Tweedie, 1996). That is, letting ρs(x) denote the solution of LD at time s, we have lims ρs(x) = ρ(x). F. Factorized Conditionals vs. Orthogonality Lemma 7.1 states that Factorized Conditionals (Definition 5.2) implies orthogonality between mean differences (providing a necessary-but-not-sufficient condition to check for FC). The proof is straightforward: Proof. (Lemma 7.1) pi(x|M c i ) = pb(x|M c i ) by FC µi = E pi[x], i = 1, . . . , k, µb = E pb [x] = (µi)M c i = (µb)M c i = Support(µi µb) Mi = (µi µb)T (µj µb) = 0, since Mi Mj = Similarly, Definition 5.2 also implies orthogonality between the score differences (recall that the score is related to the Mechanisms of Projective Composition conditional mean as log pt(x) := 1 σ2 t Ep[x xt|xt], so this is closely related to Lemma 7.1). To see this: vt i(x) := x log pt i(xt) x log pt b(xt) = x log pt i(x) pt b(x) = x log pt i(x|Mi)pt b(x|M c i x) pt b(x|Mi)pt b(x|M c i ) = x log pt i(x|Mi) pt b(x|Mi) = vt i(x)[k] = 0, k / Mi = vt i(x)T vt j(x) = 0, i = j, since Mi Mj = , where in the second-to-last line we used the fact that the gradient of a function depending only on a subset of variables has zero entries in the coordinates outside that subset. In fact, the same argument implies that {vt i(x) : x Rn} Mi; in other words, {vt i(x) : x Rn} and {vt j(x) : x Rn} occupy mutually-orthogonal subspaces. But even this latter condition does not imply the stronger condition of Definition 5.2. To find an equivalent definition in terms of scores we must also capture the independence of the subsets under pb. Specifically: pt i(x) = pt i(x|Mix)pt b(x|M c i x) pt b(x) = pt b(x| Mx) Y i pt b(x|Mi) x log pt i(x) = x log pt i(x|Mix) + x log pt b(x|M c i x) x log pt b(x) = x log pt b(x| Mx) + X i x log pt b(x|Mi) x log pt i(x) x log pt b(x) = x log pt i(x|Mix) pt b(x|Mix) x log pt b(x) = x log pt b(x| Mx) + X i x log pt b(x|Mi) So an definition equivalent to Definition 5.2 in terms of scores could be: Definition F.1. The distributions (pb, p1, p2, . . .) form factored conditionals if the score-deltas vt i := x log pt i(x) x log pt b(x) satisfy {vt i(x) : x Rn} Mi, where the Mi are mutually-orthogonal subsets, and furthermore the score of the background distribution decomposes over these subsets as follows: x log pt b(x) = x log pt b(x| Mx) + P i x log pt b(x|Mi). (Note: this is actually equivalent to a slightly more general version of Definition 5.2 that allows for orthogonal transformations, which is the most general assumption under diffusion sampling generates a projective composition, per Lemmas 6.1 and J.1.) G. Connections with the Bayes composition G.1. The Bayes composition and length-generalization We give a counterexample for which the Bayes composition fails to length-generalize, while composition using an empty background succeeds. The example corresponds to the experiment shown in Figure 13 (left). Suppose we have conditional distributions pi that set a single index i to one and all other indices to zero, a zero-background distribution pb, and an unconditional distribution formed from the conditionals by assuming p(c = i) is uniform. That is: pt i(xt) = N(xt; ei, σ2 t ) exp xt ei 2 pt b(xt) = N(xt; 0, σ2 t ) exp xt 2 pt u(xt) = 1 i=1 pi(xt) (17) Mechanisms of Projective Composition Figure 13: Bayes composition vs. projective composition. All experiments use exact scores, which is possible since the diffusion-noised distributions are Gaussian mixtures. (Left) Distributions follow (17): each conditional pi activates index i only, unconditional pu averages over the pi, and background pb is all-zeros. We attempt to compose the conditions p0, p2, p4, p6 and hope to obtain the result [1, 0, 1, 0, 1, 0]. This requires length-generalization, since each of the conditionals pi contains only a single 1. The composition using the empty background pb (top) achieves this goal, while the Bayes composition using the unconditional pu (bottom) does not. Note that [pb, p1, p2, . . .] satisfy Definition 5.2 while [pu, p1, p2, . . .] does not. (Right) Distributions follow (18), where each conditional pi activates index i on an independently cluttered background. In this case the unconditional is similar to the cluttered background. Again we attempt to compose p0, p2, p4, p6, and in this case we find that the composition using pu works similarly well to pb. Suppose we want to compose all n distributions pi, that is, we want to activate all indices. It is enough to consider xt of the special form xt = (α, . . . , α) since there is no reason to favor any condition over any another. Making this restriction, xt = (α, . . . , α) = pt i(xt) exp (n 1)α2 + (1 α)2 = exp nα2 2α + 1 pt u(xt) = exp nα2 2α + 1 pt b(xt) exp nα2 Let us find the value of α that maximizes the probability under the Bayes composition of all condition: xt = (α, . . . , α) = pt i(xt) ptu(xt) = 1 pt i(xt) ptu(xt) pt u(xt) exp nα2 2α + 1 = exp n(α 1 n)2 + const so the optimum is α = 1 n. That is, under the Bayes composition the most likely configuration places value 1 n at each index we wished to activate, rather than the desired value 1. Mechanisms of Projective Composition On the other hand, if we instead use pb in the linear score combination and optimize, we find that: xt = (α, . . . , α) = = pt i(xt) pt b(xt) exp 1 2α pt i(xt) pt b(xt) exp nα2 exp n(1 2α) exp n(α2 2α + 1) exp n(α 1)2 so the optimum is α = 1. That is, the most likely configuration places the desired value 1 at each index we wished to activate, achieving projective composition, and in particular, length-generalizing correctly. G.2. Cluttered Distributions In certain cluttered settings, the Bayes composition may be approximately projective. We explore this in the following simplified setting, corresponding to the experiment in Figure 13 (right). Suppose that x is binary-valued, Mi = {i}, i, the xi are independently Bernoulli with parameter q under the background, and the projected conditional distribution pi(x|i) just guarantees that xi = 1: pb(x|ic) Bernq(x|ic), i.i.d. i, pi(x|i) = 1x|i=1, (18) The distributions (pb, p1, p2, . . .) then clearly satisfy Definition 5.2 and hence guarantee projective composition. In this case, the unconditional distribution used in the Bayes composition is similar to the background distribution if number of conditions is large. Intuitively, each conditional looks very similar to the Bernoulli background except for a single index that is guaranteed to be equal to 1, and the unconditional distribution is just a weighted sum of conditionals. Therefore, we expect the Bayes composition to be approximately projective. More precisely, we will show that the unconditional distribution converges to the background in the limit as n , where n is both the data dimension and number of conditions, in the following sense: " pu(x) pb(x) We define the conditional and background distributions by: x Rn, Mi = {i} pb(x|i) Bernq(x|i), i.i.d. for i = 1, . . . , n pi(x|i) = 1x|i=1, for all i = 1, . . . , n = pb(x) = qnnz(x)(1 q)n nnz(x) pi(x) = 1x|i=1pb(x|ic) = 1x|i=1qnnz(x|ic)(1 q)n 1 nnz(x|ic) We construct the unconditional distribution with assuming uniform probabibility over all labels: pu(x) := 1 i pi(x). The number-of-nonzeros (nnz) in all of these distributions follow Binomial distributions: x pb = pb(nnz(x) = k) Binom(k; n, q) x pi = pi(nnz(x) = k) = pb(nnz(x|ic) = k 1) Binom(k 1; n 1, q) if k > 0 else 0 x pu = pu(nnz(x) = k) = 1 X pi(nnz(x) = k) Binom(k 1; n 1, q) if k > 0 else 0 Mechanisms of Projective Composition The basic intuition is that for large k and n, pb Binom(k; n, q) and pu Binom(k 1; n 1, q) are similar. More precisely, we can calculate: " pu(x) pb(x) , since B(k 1; n 1, q) B(k; n, q) = k = E k Binom(n,q) = 1 (nq)2 E k Binom(n,q) = 1 (nq)2 Var(k), k Binom(n, q) = 1 (nq)2 nq(1 q) = 1 q nq 0 as n . H. Proof of Theorem 5.3 Proof. (Theorem 5.3) For any set of distributions q = (qb, q1, q2, . . .) satisfying Definition 5.2, we have C[ q](x) := qb(x) Y qi(x) qb(x) = qb(x) Y qb(xt|M c i )qi(x|Mi) qb(x|M c i )qb(x|Mi) qi(x|Mi) qb(x|Mi) = qb(x|Mb) Y i qi(xt|Mi) (19) (where we used (7) in the second equality). Since (pb, p1, p2, . . .) satisfy Definition 5.2 by assumption, applying (19) gives C[ p](x) = pb(x|Mb) Y i pi(x|Mi) := ˆp(x), so the composition at t = 0 is projective, as desired. Now to show that reverse-diffusion sampling with the compositional scores generates C[ p], we need to show that C[ pt] = Nt[C[ p]], where pt := Nt[p] denotes the t-noisy version of distribution p under the forward diffusion process. First, notice that if p satisfies Definition 5.2, then pt does as well. This is because the diffusion process adds Gaussian noise independently to each coordinate, and thus preserves independence between sets of coordinates. Therefore by (19), we have C[ pt](x) = pt b(x| M) Q i pt i(xt|Mi). Now we apply the same argument (that diffusion preserves independent sets of coordinates) once again, to see that C[ pt] = Nt[C[ p]], as desired. I. Parameterization-Independent Compositions and Proof of Lemma 6.1 The proof of Lemma 6.1 relies on certain general fact about parametrization-independence of certain operators, which we develop here. Suppose we have an operator that takes as input two probability distributions (p, q) over the same space X, and outputs a distribution over X. That is, F : (X) (X) (X). We can think of such operators as performing some kind of composition of p, q. Certain operators are independent of parameterization, meaning for any reparameterization of the base space A : X Y, we have F(p, q) = A 1 (F(A p, A q)) or equivalently: F(A p, A q) = A F(p, q), where is the pushforward: (A p)(z) := 1 | A|p(A 1(z)). Mechanisms of Projective Composition This means that reparameterization commutes with the operator: it does not matter if we first reparameterize, then compose, or first compose, then reparamterize. A few examples: 1. The pointwise-geometric median, F(p, q)(x) := p p(x)q(x), is independent of reparameterization: 2. Squaring a distribution, F(p, q)(x) := p(x)2, is NOT independent of reparameterization: 3. The CFG composition (Ho & Salimans, 2022), F(p, q)(x) := p(x)γq(x)1 γ, is independent of reparameterization: We can analogously define parametrization-independence for operators on more than 2 distributions. Notably, given a tuple of distributions p = (pb, p1, p2, . . . , pk), our composition operator C of Definition 5.1, C[ p] pb(x) Q i pi(x) pb(x) is independent of parameterization. Lemma I.1 (Parametrization-independence of 1-homogeneous operators). If an operator F is 1-homogeneous, i.e. F(tp, tq, . . .) = t F(p, q, . . .) and operates pointwise, then it is independence of parametrization. F(A p, A q, . . .)(z) = F(A p(z), A q(z), . . .), pointwise = F 1 | A|p(A 1(z)), 1 | A|q(A 1(z)), . . . = 1 | A|F p(A 1(z)), q(A 1(z)), . . . , 1-homogeneous = A F(p, q, . . .)(z) Corollary I.2 (Parametrization-invariance of composition). The composition operator C given by Definition 5.1 is independent of parametrization. Proof. The composition operator given by Definition 5.1 is 1-homogeneous: C(tpb, tp1, tp2, . . .)(x) = tpb(x) Y tpi(x) tpb(x) = tpb(x) Y pi(x) pb(x) = t C(pb, p1, p2, . . . )(x) and so the result follows from Lemma I.1. Alternatively, a direct proof is: C(pb, p1, p2, . . .)(x) := pb(x) Y pi(x) pb(x) C(A pb, A p1, A p2, . . .)(z) = (A pb)(z) Y (A pi)(z) (A pb)(z) = 1 | A|pb(A 1(z)) Y pi(A 1(z)) pb(A 1(z)) = A C(pb, p1, p2, . . .)(z). Theorem 6.1 follows from Corollary I.2: Proof. (Theorem 6.1) Let (qb, q1, q2, . . . , qk) := (A pb, A p1, . . . A pk), for which Definition 5.2 holds by assumption. Applying an intermediate result from the proof of Theorem 5.3 gives: C[ q](z) := qb(z) Y qi(z) qb(z) = qb(z| M) Y i qi(z|Mi). By Corollary I.2, C is independent of parametrization, hence A ˆp := A (C[ p]) = C[ A p] := C( q). Mechanisms of Projective Composition Figure 14: Projective Composition in Feature Space. A commutative diagram illustrating Theorem 6.1: Performing composition in pixel space is equivalent to encoding into some feature space (A), composing there, and decoding back to pixel space (A 1). This holds for all feature spaces subject to smoothness conditions. Thus, if there exists some feature space where distributions p1, p2 projectively compose (e.g. due to orthogonality as illustrated here), then we can achieve this same composition by simply operating in pixel space, without even needing to know the feature space. J. Composition under Orthogonal Transformation Lemma J.1 (Orthogonal transform enables diffusion sampling). If the assumptions of Lemma 6.1 hold for A(x) = Ax, where A is an orthogonal matrix, then running a reverse diffusion sampler with scores st = x log C[ pt] generates the composed distribution ˆp = C[ p] satisfying (10). Figure 15 shows a synthetic experiment illustrating the sampling guarantees of Lemma J.1 in contrast to the lack-ofguarantees in the non-orthogonal case. The proof relies on the fact that diffusion noising commutes with orthogonal transformations, i.e. A Nt[q] = Nt[A q] if A is orthogonal, since standard Gaussians are invariant under orthogonal transformation. Proof. By assumption, (A pb, A p1, . . . A pk) satisfy Definition 5.2, where A(z) = Az with A an orthonormal matrix. By Lemma 6.1, ˆp = C[ p] satisfies (10). To show that reverse-diffusion sampling with scores st = x log C[ pt] generates the composed distribution C[ p] we need to show that composition commutes with the forward diffusion process, i.e. C[ pt] = Nt[C[ p]]. Theorem 5.3 immediately gives us C[Nt[A p]] = Nt[C[A p]]. Now we have to be careful with commuting operators. We know that composition is independent of parametrization, i.e. A C[ p] = C[ A p]. Diffusion noising Nt commutes with orthogonal transformation, i.e. A Nt[q] = Nt[A q] if A is orthogonal, because a standard Gaussian multiplied by an orthonormal matrix Q remains a standard Gaussian: η N(0, I) = Qη N(0, QQT ) = N(0, I) (this is false for non-orthogonal transforms, however). Therefore, in the orthogonal case, we can rewrite: A C[Nt[p]] = A Nt[C[p]], which implies the desired result since A is invertible. K. Proof and further discussion of Lemma 6.3 K.1. Benefits of sampling at t = 0 Interestingly, (Du et al., 2023) have observed that sophisticated samplers like Hamiltonian Monte Carlo (HMC) requiring energy-based formulations often outperform standard diffusion sampling for compositional quality. Lemmas 6.1 and 6.3 Mechanisms of Projective Composition Figure 15: Synthetic composition experiment illustrating the sampling guarantees of Lemma J.1 in contrast to the lackof-guarantees in the non-orthogonal case. We compare a coordinate-aligned case (which satisfies Definition 5.2 in the native space) (top), an orthogonal-transform case (middle) (which satisfies the assumptions of Lemma J.1), and a nonorthogonal-transform case (bottom) (which satisfies the assumptions of Theorem 6.1 but not of Lemma J.1). In the first two cases the correct composition can be sampled using either diffusion (DDIM) or Langevin dynamics (LD) at t = 0, while in the final case DDIM sampling is unsuccessful although LD at t = 0 still works. The distributions are 4-dimensional and we show 8 samples (rows) for each. We show samples from the individual conditional distributions p0, p1 using DDIM, samples from the desired exact composition C[pb, p0, p1] at t = 0 (obtained by sampling from A C[ p] with DDIM and transforming by A 1), samples from the composition C[pb, p0, p1] using DDIM with exact scores, and samples from the composition C[pb, p0, p1] using Langevin dynamics (LD) with exact scores at time t = 0 in the diffusion schedule (σmin = 0.02). The noiseless distributions p0 and p1 are each 4-dimensional 2-cluster Gaussian mixtures with means as noted in the figure, equal weights, and standard deviation τ = 0.02. For example, in the non-orthogonal-transform case, p0 has means [1, 0, 0, 0] and [0, 0, 1, 0], and p1 has means [1, 1, 0, 0] and [0, 0, 1, 1], (which can be transformed to satisfy Definition 5.2 via a non-orthogonal linear transform). help explain why this may be the case. In particular, HMC (or any variant of Langevin dynamics) can enable sampling p0 at time t = 0, even when the path pt used for annealing does not necessarily represent a valid forward diffusion process starting from p0 (as Du et al. (2023) note, C[ pt]] may not be). Lemma 6.1 should gives us hope that approximately-projective composition may often be possible at t = 0, since it allows any invertible transform A to transform into a factorized feature space (which need not be explicitly constructed). However, that does not mean that we can actually sample from this projection at time t = 0. As Lemma 6.3 shows, C[ pt]] is not necessarily a valid diffusion path unless A is orthogonal, so standard diffusion sampling may not work. This is consistent with Du et al. (2023) s observation that non-diffusion samplers that allow sampling at t = 0 may be necessary. Interestingly, Lemma 6.3 further cautions that sometimes C[ pt]] may not even be an effective annealing path for any kind of sampler (which is consistent with our own experiments but not reported by other works, to our knowledge.) K.2. Proof of Lemma 6.3 We will prove Lemma 6.3 using a counterexample, which is inspired by an experiment, shown in Figure 17 (left), where non-orthogonal conditions fail to compose projectively. The basic idea for the counterexample is that given a distribution p(x) with two conditions, c = 0, 1, such at t = 0, 2δe0(x) + 1 2δe2(x), p1(x) 1 2δae0+e1(x) + 1 2δae2+e3(x), for some 0 < a 1, so the conditional distributions do not satisfy the independence assumption of Definition 5.2, However, Mechanisms of Projective Composition Figure 16: (Left) A visualization of the intuition behind the proof of Lemma 6.3, under a 2D projection. (Right) An experiment where the colors red, green, and blue all compose projectively, while the colors red and yellow do not. We trained a Unet on images each containing a single square in one of 4 locations (selected randomly) and a certain color, conditioned on the color. We then generate composed distributions by running DDIM on the composed scores. The desired result of composing red and blue is an image containing a red and a blue square, both with randomly-chosen locations (so we occasionally get a purple square when the locations overlap). When we try to compose red and yellow, we only only ever obtain a single yellow square.Note that in pixel space, the colors are represented as red (1, 0, 0), green (0, 1, 0), blue (0, 0, 1), yellow (1, 1, 0), so that red, green and blue are all orthogonal and are expected to work by Lemma 5.3, while red and yellow are not orthogonal, and fail as allowed by Lemma 6.3. In fact this experiment is closely related to the counterexample used to prove Lemma 6.3. Figure 17: Composition experiments for the setting in the proof of Lemma 6.3. Left pane shows 8 samples (rows) of each distribution in the native 4d representation; right pane shows 1000 samples under the 2D projection used in Figure 16. We show samples from the individual conditional distributions p0, p1 using DDIM, samples from the desired exact composition C[pb, p0, p1] at t = 0 (obtained by sampling from A C[ p] with DDIM and transforming by A 1), and samples from the composition C[pb, p0, p1] using DDIM with exact scores. We take τ = 0.02, and set σmin = 0.02 in the diffusion schedule. In the top row we take a = 1 ( very non-orthogonal ) as in the proof, and compare this to a = 0.3 ( mildly non-orthogonal ) in the bottom row. With a = 1, as in the proof we see that DDIM barely samples two of the clusters. With a = 0.3, DDIM still slightly undersamples the hard clusters but the effect is much less pronounced. Mechanisms of Projective Composition there exists a (linear, but not orthogonal) A such that the distribution of z = Ax is axis-aligned (A p0)(z) 1 2δe0(x) + 1 2δe2(x), (A p1)(z) 1 2δe1(x) + 1 and thus does satisfy Definition 5.2 at t = 0, which guarantees correct composition of p at t = 0 under Lemma 6.1. The correct composition should sample uniformly from {(1+a)e0+e1, e0+ae2+e3, ae0+e2+e1, (1+a)e2+e3}. What goes wrong is that as soon as we add Gaussian noise to the distribution p(x) at time t > 0 of the diffusion forward process, the relationship z = Ax breaks and so we are no longer guaranteed correct composition of pt(x). In fact, the distribution is still a GMM but places nearly all its weight on only two of the four clusters, namely: {(1 + a)e0 + e1, (1 + a)e2 + e3}. Intuitively, let us focus on the mode ae0 + e1 of p1 and consider how it interacts with the two modes e0, e2 of p0, at some time t > 0 when we have isotropic Gaussians centered at each mode. Since ae0 + e1 is further away from e2 (distance a2 + 2) than it is from e0 (distance a2 2a + 2), it is much less likely under N(e2, σt) than N(e0, σt), leading to a lower weight. This intuition is shown graphically in a 2D projection in Figure 16 (left). For the detailed proof, we actually want to ensure that p has full support even at t = 0 so we add a little bit of noise to it, but choose the covariance such that z = Ax still holds at t = 0. We begin by defining the distributions we will use for the counterexample. Definition K.1. For any choice of τ > 0, define the following counterexample distributions: p0 0(x) = 1 2N(x; e0, τ 2(AT A) 1) + 1 2N(x; e2, τ 2(AT A) 1) p0 1(x) = 1 2N(x; ae0 + e1, τ 2(AT A) 1) + 1 2N(x; ae2 + e3, τ 2(AT A) 1) p0 b(x) = N(x; 0, τ 2(AT A) 1), where A := 1 a 0 0 0 1 0 0 0 0 1 a 0 0 0 1 so that in the transformed space (A p)(z) := p(A 1z), z = Ax (A p0 b)(z) = N(z; 0, τ 2) (A p0 0)(z) = 1 2N(z; e0, τ 2) + 1 2N(z; e2, τ 2) (A p0 1)(z) = 1 2N(z; e1, τ 2) + 1 2N(z; e3, τ 2). The noised versions at time t > 0 are pt i(xt|x0) := N(xt; x0, σ2 t ) pt 0(x) = 1 2N(xt; e0, σ2 t I + τ 2(AT A) 1) + 1 2N(x; e2, σ2 t I + τ 2(AT A) 1) pt 1(x) = 1 2N(xt; ae0 + e1, σ2 t I + τ 2(AT A) 1) + 1 2N(x; ae2 + e3, σ2 t I + τ 2(AT A) 1). Next, we state some intermediate results we will need for the proof. Proposition K.2. A composition of two Gaussians with identical covariance (using a Gaussian background with zero mean and the same covariance) is a scaled Gaussian with the following parameters N(x; µ1; Σ)N(x; µ2, Σ) N(x; 0; Σ) = CN(x; µ1 + µ2, Σ), where C = exp(µT 1 Σ 1µ2). Mechanisms of Projective Composition N(x; µ1; Σ)N(x; µ2, Σ) N(x; 0; Σ) = (2π) n 2 (x µ1)T Σ 1(x µ1)e 1 2 (x µ2)T Σ 1(x µ2) 2(x µ1)T Σ 1(x µ1) 1 2(x µ2)T Σ 1(x µ2) + 1 2x T Σ 1x + x T Σ 1(µ1 + µ2) 1 2µT 1 Σ 1µ1 1 2µT 2 Σ 1µ2 2(x µ1 µ2)T Σ 1(x µ1 µ2) = CN(x; µ1 + µ2, Σ) 2µT 1 Σ 1µ1 1 2µT 2 Σ 1µ2 + 1 2(µ1 + µ2)T Σ 1(µ1 + µ2) = exp(µT 1 Σ 1µ2) Proposition K.3. With (pb, p1, p2) from Definition K.1, defining ˆpt(x) := C[pt b, pt 0, pt 1], we have that ˆp0(x), ˆpt(x), and Nt[ˆp0](x) are all Gaussian mixtures (GMs) with identical means: µ = {(1 + a)e0 + e1, e0 + ae2 + e3, ae0 + e2 + e1, (1 + a)e2 + e3}, and the following weights and covariances: ˆp0(x) : weights: w0 := 1 4[1, 1, 1, 1], covariance: Σ0 := τ 2(AT A) 1 ˆpt(x) : weights: wt := [1 2 ε, ε, ε, 1 2S( ξ), ξ := aσ2 t (a2 + 2)σ2 t τ 2 + σ4 t + τ 4 where S(z) := 1 e z + 1 (logistic function), covariance: Σt := σ2 t I + τ 2(AT A) 1 Nt[ˆp0](x) : weights: w0 := 1 4[1, 1, 1, 1], covariance: Σt := σ2 t I + τ 2(AT A) 1 Proof. We apply Proposition K.2 to the distributions of Definition K.1 to analyze ˆp0(x), ˆpt(x), and Nt[ˆp0](x). Proposition K.2 gives that all three distributions are Gaussian mixtures with identical means µ, and variances Σ0, Σt, and Σt, respectively. The weights for ˆp0(x) and Nt[ˆp0](x) are uniform (w0 = 1 4[1, 1, 1, 1]). We just need to calculate the weights for ˆpt(x). Mechanisms of Projective Composition First we compute the covariance and inverse covariance: Σt := σ2 t I + τ 2(AT A) 1 = σ2 t I + τ 2 1 + a2 a 0 0 a 1 0 0 0 0 1 + a2 a 0 0 a 1 Σt 1 = 1 (a2 + 2)σ2 t τ 2 + σ4 t + τ 4 σ2 t + τ 2 aτ 2 0 0 aτ 2 (a2 + 1)τ 2 + σ2 t 0 0 0 0 σ2 t + τ 2 aτ 2 0 0 aτ 2 (a2 + 1)τ 2 + σ2 t After some algebra (namely, computing C = exp(µT 1 Σt 1µ2) for each cluster), we find that ˆpt(x) : wt [exp(ξ), 1, 1, exp(ξ)], ξ := aσ2 t (a2 + 2)σ2 t τ 2 + σ4 t + τ 4 2 ε, ε, ε, 1 2 ε], ε := 1 2S( ξ), where S(z) := 1 e z + 1 (logistic function). The intuition for the proof of Lemma 6.3 will be that when ε is small, clusters (1,2) have much lower weight than clusters (0,3) in the GM. In that case, we can lower-bound the W2 distance by noting that since wt has almost no mass on the two of the clusters, we will need to move a little less than 1/4 probability over to those clusters. For example we need to move 1/4 probability onto cluster e0 +ae2 +e3 from either (1+a)e0 +e1 (L2 distance between means is 2a + 2) or (1+a)e2 +e3 (L2 distance 2). So overall we will have to move a bit less that 1/2 probability at least 2 distance. We restate the following results from Delon & Desolneux (2020), which we will need to help bound the W2 distance. Theorem K.4. (Mixture Wasserstein distance (Delon & Desolneux, 2020)) MW2(q0, q1) := inf γ Π(q0,q1) GMM2d( ) Z y0 y1 2dγ(y0, y1), MW 2 2 (q0, q1) = min c Π(w0,w1) k,l ck,l W 2 2 (qk 0, ql 1) (Delon Prop. 4), W2(q0, q1) MW2(q0, q1) W2(q0, q1) + 2 X k=1 wk i Tr(Σk i ) (Delon Prop. 6), where Π(q0, q1) denotes the set of all joint distributions with marginals q0 and q1, and GMMd( ) := K 0GMMd(K) denotes the set of all finite GMMs. We will also need one more standard fact about the W2 distance between Gaussians. Proposition K.5. (W2 distance between Gaussians; standard) W 2 2 (N(µx, Σx), N(µy, Σy)) = µx µy 2 2 + Tr(Σx + Σy 2(Σ 1 2x ) 1 2 ) µx µy 2 2. Proposition K.6. If p is a Gaussian mixture distribution of the form k=1 wi N(µk, Ck), x Rn then p is (n K) 1 2 -Lipschitz w.r.t Wasserstein 2-distance: W2(pt , pt) (n K) 1 2 |σt σt |, (that is, O(1)-Lipschitz, where O only hides constants depending on ambient dimension and number-of-clusters). Mechanisms of Projective Composition k=1 wi N(µk, Ck) k=1 wi N(µk, Ck + σ2 t I) W 2 2 (N(µx, Σx), N(µy, Σy)) := µx µy 2 2 + Tr(Σx + Σy 2(Σ 1 2x ) 1 2 ) := µx µy 2 2 + Σ 1 2y 2 F if Σx, Σy commute = W 2 2 (pt [k], pt[k]) = (Ck + σ2 t I) 1 2 (Ck + σ2 t I) 1 2 2 F = (Λ + σ2 t I) 1 2 (Λ + σ2 t I) 1 2 2 F , where Ck = UΛU T is eigendecomposition (σt σt )I 2 F , (by concavity of square root and Λ 0) = n(σt σt )2 W 2 2 (pt , pt) MW 2 2 (pt , pt) := min c Π(w,w) k,l ck,l W 2 2 (pt [k], pt[l]) k W 2 2 (pt [k], pt[k]), (since c = I Π(w, w)) n K(σt σt )2 = W2(pt , pt) (n K) 1 2 |σt σt |. Thus p is (n K) 1 2 -Lipschitz w.r.t. W2 distance. Proposition K.7. With (pb, p1, p2) from Definition K.1 and ˆpt(x) := C[pt b, pt 0, pt 1], the W2-distance between ˆpt and ˆp0 is bounded as follows (1 4ε) 1 2 4(τ 2(4 + 2a2) + 2σ2 t ) W2(ˆp0, ˆpt) (1 4ε)(1 + a) + 2(τ 2(4 + 2a2) + 2σ2 t ) 1 Proof. Using Proposition K.5, MW 2 2 (ˆp0, ˆpt) = min c Π(w0,wt) k,l ck,l W 2 2 (ˆp0[k], ˆpt[l]) = min c Π(w0,wt) k,l ck,l µk µl 2 2 + Tr( Σ0 + Σt 2( Σ 1 2 t ) 1 2 ) c(0 or 3),(1 or 2) = 1/4 ε 2 µ(0 or 3) µ(1 or 2) 2 2 2(1 + a) 0 Tr( Σ0 + Σt 2( Σ 1 2 t ) 1 2 ) Tr( Σ0 + Σt) = 2τ 2(4 + 2a2) + 4σ2 t = 1 4ε MW 2 2 (Nt[ˆp0], ˆpt) (1 4ε)(1 + a) + 2τ 2(4 + 2a2) + 4σ2 t = (1 4ε) 1 2 MW2(Nt[ˆp0], ˆpt) (1 4ε)(1 + a) + 2τ 2(4 + 2a2) + 4σ2 t 1 Above, we noted that any c Π(w0, wt) has to move at least 1 4 ε probability each away from indices 1 or 2 and onto indices either 0 or 3, and for any of these moves we can bound the squared L2 distance the mass must move between 2 and 2(1 + a). We also used simple bounds on the trace term. Mechanisms of Projective Composition We can use Theorem K.4 to bound the W2 distance by the MW2 distance: Using Theorem K.4 (Delon & Desolneux, 2020), W2(ˆp0, ˆpt) MW2(ˆp0, ˆpt) (1 4ε)(1 + a) + 2τ 2(4 + 2a2) + 4σ2 t 1 W2(ˆp0, ˆpt) MW2(ˆp0, ˆpt) 2 X k (w0[k]Tr( Σ0) + wt[k]Tr( Σt)) MW2(ˆp0, ˆpt) 2(Tr( Σ0) + Tr( Σt)) (1 4ε) 1 2 2(4σ2 t + 2τ 2(4 + 2a2)). Proposition K.8. With (pb, p1, p2) from Definition K.1 and ˆpt(x) := C[pt b, pt 0, pt 1], the W2-distance between ˆpt and Nt[ˆp0] is bounded as follows (1 4ε) 1 2 4(τ 2(4 + 2a2) + 4σ2 t ) W2(Nt[ˆp0], ˆpt) (1 4ε) 1 2 (1 + a) 1 2 . Proof. Using Proposition K.5, MW 2 2 (Nt[ˆp0], ˆpt) = min c Π(w0,wt) k,l ck,l W 2 2 (Nt[ˆp0][k], ˆpt[l]) = min c Π(w0,wt) k,l ck,l µk µl 2 2 c(0 or 3) (1 or 2) = 1/4 ε 2 µ(0 or 3) µ(1 or 2) 2 2 2(1 + a) = 1 4ε MW 2 2 (Nt[ˆp0], ˆpt) (1 4ε)(1 + a) = (1 4ε) 1 2 MW2(Nt[ˆp0], ˆpt) (1 4ε) 1 2 (1 + a) 1 2 . Using Theorem K.4 (Delon & Desolneux, 2020), W2(Nt[ˆp0], ˆpt) MW2(Nt[ˆp0], ˆpt) (1 4ε) 1 2 (1 + a) 1 2 W2(Nt[ˆp0], ˆpt) MW2(Nt[ˆp0], ˆpt) 2 X k (w0[k] + wt[k])Tr( Σt)) MW2(Nt[ˆp0], ˆpt) 4Tr( Σt) (1 4ε) 1 2 4(4σ2 t + τ 2(4 + 2a2)). Now we have all the pieces to prove Lemma 6.3. Proof. (Lemma 6.3) We will show that the distributions (pb, p1, p2) of Definition K.1 satisfy Lemma 6.3. We make the choices a = 1 and σt := t for simplicity, and define ˆpt(x) := C[pt b, pt 0, pt 1]. Lemma 6.1 applied to the distributions of Definition K.1 implies that at time t = 0, ˆp0(x) := ˆC[ p] := p0 0(x)p0 1(x) p0 b(x) = p0 0(x|(0,2))p0 1(x|(1,3)). Mechanisms of Projective Composition That is, we have a projective composition at time t = 0. For Part 1 of Lemma 6.3, we note that the pi are Gaussian mixtures, therefore Proposition K.6 gives that each pi is O(1)-Lipschitz w.r.t Wasserstein 2-distance: i : W2(pt i, pt i ) For Part 2, we need to bound the Lipschitz constant of ˆpt := ˆC[ pt]. Proposition K.7 gives: (1 4ε) 1 2 4(τ 2(4 + 2a2) + 2σ2 t ) W2(ˆp0, ˆpt) (1 4ε)(1 + a) + 2(τ 2(4 + 2a2) + 2σ2 t ) 1 where ε := 1 2S( ξ), ξ := aσ2 t (a2 + 2)σ2 t τ 2 + σ4 t + τ 4 , with S(z) := 1 e z + 1. Plugging in a = 1, (1 4ε) 1 2 24τ 2 8σ2 t W2(ˆp0, ˆpt) 2(1 4ε) + 12τ 2 + 4σ2 t 1 where ε := 1 2S( ξ), ξ := σ2 t 3σ2 t τ 2 + σ4 t + τ 4 . We will to show that t, t : 1 2τ 1|t t | W2(qt, qt ) 2τ 1|t t |. After some algebra, we find that for any fixed τ, the minimum and maximum of ε are σt 0 = ξ 0 = ε 1 σt = τ = ξ = ξ (τ) := 1 5τ 2 = ε = 1 2S ( ξ (τ)) (min) σt = τ 0 = ξ (τ) = ε 0 (min) Thus, taking σt = τ (the minimizer of ε) and choosing any τ 2 < 1 66 (somewhat arbitrarily, but small enough), gives σt = τ = (1 4ε) 1 2 32τ 2 W2(ˆp0, ˆpt) 2(1 4ε) + 32τ 2 1 2 , where ε := 1 66 = ε 10 6 1 4τ 2 = (1 4ε) 1 2 1 τ 2 = 0.5 1 33τ 2 W2(ˆp0, ˆpt) 2 + 32τ 2 1 Thus, for any choice of τ 2 < 1 66, if we take σt := t = τ and t = 0, we have as desired that 0.5τ 1|t| 0.5 W2(ˆp0, ˆpt) 2 2τ 1|t|, that is, t, t : W2(ˆpt , ˆpt) = Θ(τ 1|t t |). We can also prove another lemma using the same counterexample as Lemma 6.3: Lemma K.9. Let qt denote the composed distribution at time t: qt := C[ pt], and Nt be the Gaussian-noising operator. There exist distributions {pb, p1, . . . pk} over Rn and a value of t such that Nt[q0] (the ideal diffusion path to q0) differs from qt (the path actually followed) by at least Ω(1): t : W2(Nt[q0], qt) 1 Mechanisms of Projective Composition Proof. We will show that the distributions of Definition K.1 satisfy Lemma K.9. We make the choices a = 1 and σt := t for simplicity. By Proposition K.8, (1 4ε) 1 2 4(τ 2(4 + 2a2) + 4σ2 t ) W2(Nt[ˆp0], ˆpt) (1 4ε) 1 2 (1 + a) 1 2 where ε := 1 2S( ξ), ξ := aσ2 t (a2 + 2)σ2 t τ 2 + σ4 t + τ 4 , with S(z) := 1 e z + 1. Taking a = 1: (1 4ε) 1 2 24τ 2 16σ2 t W2(Nt[ˆp0], ˆpt) 2(1 4ε) 1 2 where ε := 1 2S( ξ), ξ := σ2 t 3σ2 t τ 2 + σ4 t + τ 4 . For any fixed τ, the minimum and maximum of ε are σt 0 = ξ 0 = ε 1 σt = τ = ξ = ξ (τ) := 1 5τ 2 = ε = 1 2S ( ξ (τ)) (min) σt = τ 0 = ξ (τ) = ε 0 (min) First we want to show that t : W2(Nt[q0], qt) 1 Taking σt = τ (the minimizer of ε) and choosing any τ 2 < 1 82 (somewhat arbitrarily, but small enough), gives σt = τ = (1 4ε) 1 2 40τ 2 W2(Nt[ˆp0], ˆpt) 2(1 4ε) 1 2 , where ε := 1 82 = ε 10 7 1 4τ 2 = (1 4ε) 1 2 1 τ 2 = 0.5 1 41τ 2 W2(ˆp0, ˆpt) Thus, fixing a τ 2 < 1 82 and taking σt := t = τ, we have as desired that W2(Nt[ˆp0], ˆpt) 0.5. The proof is now complete, but we can make one more interesting observation. The bound above was obtained by choosing a small value of τ, but the diffusion path (specifically, for distributions of the form of Definition K.1) is much less problematic for larger τ: t : W2(Nt[ˆp0], ˆpt) O(τ 1). That is, even for our counterexample distributions, diffusion can still approximately work to sample from the composition ˆp0, if τ is large enough. To see this, we note that t : W2(Nt[ˆp0], ˆpt) 2(1 4ε) 1 2 where ε := 1 2S( ξ), ξ := σ2 t 3σ2 t τ 2 + σ4 t + τ 4 2S ( ξ (τ)) , ξ (τ) := 1 5τ 2 for any fixed τ = W2(Nt[ˆp0], ˆpt) 2(1 2S ( ξ (τ))) 1 2