# compositional_sculpting_of_iterative_generative_processes__ba8b78ce.pdf

Compositional Sculpting of Iterative Generative Processes

Timur Garipov1 Sebastiaan De Peuter2 Ge Yang1,4

Vikas Garg2,5 Samuel Kaski2,3 Tommi Jaakkola1

1MIT CSAIL 2Aalto University 3University of Manchester 4Institute for Artificial Intelligence and Fundamental Interactions 5Yai Yai Ltd

High training costs of generative models and the need to fine-tune them for specific tasks have created a strong interest in model reuse and composition. A key challenge in composing iterative generative processes, such as GFlow Nets and diffusion models, is that to realize the desired target distribution, all steps of the generative process need to be coordinated, and satisfy delicate balance conditions. In this work, we propose Compositional Sculpting: a general approach for defining compositions of iterative generative processes. We then introduce a method for sampling from these compositions built on classifier guidance. We showcase ways to accomplish compositional sculpting in both GFlow Nets and diffusion models. We highlight two binary operations the harmonic mean (𝑝1 𝑝2) and the contrast (𝑝1 𝑝2) between pairs, and the generalization of these operations to multiple component distributions. We offer empirical results on image and molecular generation tasks. Project codebase: https://github.com/timgaripov/compositional-sculpting.

1 Introduction

Large-scale general-purpose pre-training of machine learning models has produced impressive results in computer vision [1 3], image generation [4 6], natural language processing [7 11], robotics [12 14], and basic sciences [15]. By distilling vast amounts of data, such models can produce powerful inferences that lead to emergent capabilities beyond the specified training objective [16]. However, generic pre-trained models are often insufficient for specialized tasks in engineering and basic sciences. Field-adaptation via techniques such as explicit fine-tuning on bespoke datasets [17], human feedback [18], or cleverly designed prompts [19, 20] is therefore often required. Alternatively, capabilities of pre-trained models can be utilized and extended via model composition.

Compositional generation [21 28] views a complex target distribution in terms of simpler pre-trained building blocks which can be mixed and matched into a tailored solution to a specialized task. Given a set of base models capturing different properties of the data, composition provides a way to fuse these models into a single composite model with capacity greater than any individual base model. In this way it allows one to specify distributions over examples that exhibit multiple desired properties simultaneously [22]. The need to construct complex distributions adhering to multiple constraints arises in numerous practical multi-objective design problems such as molecule generation [29 31]. In this context, compositional modeling provides mechanisms for control of the resulting distribution and exploration of different trade-offs between the objectives and constraints.

Prior work on generative model composition [21, 22, 24] has developed operations for piecing together Energy-Based Models (EBMs) via algebraic manipulations of their energy functions. For example, consider two distributions 𝑝1(𝑥) exp( 𝐸1(𝑥)) and 𝑝2(𝑥) exp( 𝐸2(𝑥)) induced by energy

Correspondence to Timur Garipov (timur@csail.mit.edu).

37th Conference on Neural Information Processing Systems (Neur IPS 2023).

Figure 1: Composition operators. (a,b) base distributions 𝑝1and 𝑝2. (c) harmonic mean of 𝑝1and 𝑝2. (d) contrast of 𝑝1with 𝑝2 (e) reverse contrast 𝑝1 𝑝2. Lines show the contours of the PDF level sets.

functions 𝐸1 and 𝐸2. Their product 𝑝prod(𝑥) 𝑝1(𝑥)𝑝2(𝑥) exp( (𝐸1(𝑥) + 𝐸2(𝑥))) and negation 𝑝neg(𝑥) 𝑝1(𝑥) (𝑝2(𝑥))𝛾 exp( (𝐸1(𝑥) 𝛾𝐸2(𝑥))), 𝛾> 0 correspond to operations on the underlying energy functions. The product assigns high likelihood to points 𝑥that have high likelihood under both base distributions but assigns low likelihood to points that have close-to-zero likelihood under one (or both). The negation distribution assigns high likelihood to points that are likely under 𝑝1 but unlikely under 𝑝2 and assigns low likelihood to points that are likely under 𝑝2 but unlikely under 𝑝1.

Iterative generative processes including diffusion models [5, 32 34] and GFlow Nets [35, 36] progressively refine coarse objects into cleaner ones over multiple steps. Realizing effective compositions of these models is complicated by the fact that simple alterations in their generation processes result in non-trivial changes in the distributions of the final objects. Unlike for EBMs, products and negations of diffusion models cannot be realized through simple algebraic operations on their score functions. Du et al. [28] show that the result of the addition of score functions is not equal to the score of the diffused product distribution and develop a method that corrects the sum-of-scores sampling via additional MCMC steps nested under each step of the diffusion time loop.

Jain et al. [31] develop Multi-Objective GFlow Nets (MOGFNs), an extension of GFlow Nets for multi-objective optimization tasks. While a vanilla GFlow Net captures a distribution induced by a single reward (objective) function 𝑝(𝑥) 𝑅(𝑥) (see Section 2 for details), an MOGFN aims to learn a single conditional model that can realize distributions corresponding to various combinations (e.g. a convex combination) of multiple reward functions. Though a single MOGFN realizes a spectrum of compositions of base reward functions, the approach assumes access to the base rewards at training time. Moreover, MOFGNs require the set of possible composition operations to be specified at training time. In this work, we address post hoc composition of pre-trained GFlow Nets (or diffusion models) and provide a way to create compositions that need not be specified in advance.

In this work, we introduce Compositional Sculpting, a general approach for the composition of pretrained models. We highlight two special examples of binary operations harmonic mean: (𝑝1 𝑝2) and contrast: (𝑝1 𝑝2). More general compositions are obtained as conditional distributions in a probabilistic model constructed on top of pre-trained base models. We show that these operations can be realized via classifier guidance. We provide results of empirical verification of our method on molecular generation (with GFlow Nets) and image generation (with diffusion models).

2 Background

Generative flow networks (GFlow Nets). GFlow Nets [35, 36] are an approach for generating structured objects (e.g. graphs) from a discrete space . Given a reward function 𝑅(𝑥) 0, a GFlow Net seeks to sample from 𝑝(𝑥) = 𝑅(𝑥) 𝑍, where 𝑍= 𝑥𝑅(𝑥), i.e. the model assigns larger probabilities to high-reward objects.

Starting at a fixed initial state 𝑠0, objects 𝑥are generated through a sequence of changes corresponding to a trajectory of incomplete states 𝜏= (𝑠0 𝑠1 𝑠𝑛 1 𝑥). The structure of possible trajectories corresponds to by a DAG ( , ) where is a set of states (both complete and incomplete) and is the set of directed edges (actions) 𝑠 𝑠 . The set of complete objects (terminal states) is a subset of . The generation process starts at 𝑠0 and follows a parameterized stochastic forward policy 𝑃𝐹(𝑠 |𝑠; 𝜃) which for each state 𝑠 specifies a probability distribution over all possible successor states 𝑠 (𝑠 𝑠 ) . The process terminates once a terminal state is reached.

Diffusion models. Diffusion models [5, 32 34, 37] are a family of generative models developed for continuous domains. Given an empirical (data) distribution 𝑝(𝑥)=1 𝑛 𝑖𝛿 𝑥𝑖(𝑥) in = ℝ𝑑, diffusion models seek to approximate 𝑝(𝑥) via a generative process 𝑝(𝑥).

A diffusion process is a noising process that gradually destroys the original clean data 𝑥. Viewed as a stochastic differential equation (SDE) [34], it is a time-indexed collection of random variables {𝑥𝑡}𝑇 𝑡=0 in = ℝ𝑑which interpolates between the data distribution 𝑝0(𝑥) = 𝑝(𝑥) at 𝑡= 0 and a prior distribution 𝑝𝑇(𝑥) at 𝑡= 𝑇. The evolution of 𝑥𝑡is described by the forward SDE 𝑑𝑥𝑡= 𝑓𝑡(𝑥𝑡) 𝑑𝑡+ 𝑔𝑡𝑑𝑤𝑡with drift coefficient 𝑓𝑡 ℝ𝑑 ℝ𝑑and diffusion coefficient 𝑔𝑡 ℝ. Here, 𝑤𝑡is the standard Wiener process. Crucially, the coefficients 𝑓𝑡, 𝑔𝑡are generally chosen such that the prior 𝑝𝑇and the transition probabilities 𝑝𝑠𝑡(𝑥𝑡|𝑥𝑠), 0 𝑠< 𝑡 𝑇have a closed form (see [34]).

Song et al. [34] invoke a result from the theory of stochastic processes [38] which gives the expression for the reverse-time process or backward SDE : 𝑑𝑥𝑡= [𝑓𝑡(𝑥𝑡) 𝑔2 𝑡 𝑥log 𝑝𝑡(𝑥𝑡)] 𝑑𝑡+ 𝑔𝑡𝑑𝑤𝑡,where 𝑤𝑡is the standard Wiener process in reversed time. This SDE involves the known coefficients 𝑓𝑡, 𝑔𝑡and the unknown score function 𝑥log 𝑝𝑡( ) of the marginal distribution 𝑝𝑡( ) at time 𝑡. A score network 𝑠𝑡(𝑥; 𝜃) (a deep neural network with parameters 𝜃) is trained to approximate 𝑥log 𝑝𝑡(𝑥). Once 𝑠𝑡( ; 𝜃) is trained, sampling reduces to numerical integration of the backward SDE.

Classifier guidance in diffusion models. Classifier guidance [32, 39] is a technique for controllable generation in diffusion models. Suppose that each example 𝑥0 is accompanied by a discrete class label 𝑦. The goal is to sample from the conditional distribution 𝑝0(𝑥0|𝑦). The Bayes rule 𝑝𝑡(𝑥𝑡|𝑦) 𝑝𝑡(𝑥𝑡)𝑝𝑡(𝑦|𝑥𝑡) implies the score-function decomposition 𝑥𝑡log 𝑝𝑡(𝑥𝑡|𝑦) = 𝑥𝑡log 𝑝𝑡(𝑥𝑡) + 𝑥𝑡log 𝑝𝑡(𝑦|𝑥𝑡), where the first term is already approximated by a pre-trained unconditional diffusion model and the second term can be derived from a time-dependent classifier 𝑝𝑡(𝑦|𝑥𝑡). Therefore, the stated goal can be achieved by first training the classifier 𝑝𝑡(𝑦|𝑥𝑡) using noisy samples 𝑥𝑡from the intermediate steps of the process, and then plugging in the expression for the conditional score into the backward SDE sampling process [34].

3 Related Work

Generative model composition. In Section 1 we reviewed prior work on energy-based composition operations and Multi-Objective GFlow Nets. Learning mixtures of Generative Adversarial Networks has been addressed in [40], where the mixture components are learned simultaneously, and in [41], where the components are learned one by one in an adaptive boosting fashion. Algorithms for additive and multiplicative boosting of generative models have been developed in [42].

This work focuses on the composition of pre-trained models. Assuming that each pre-trained model represents the distribution of examples demonstrating certain concepts (e.g. molecular properties), the composition of models is equivalent to concept composition (e.g. property A AND property B ). The inverse problem is known as unsupervised concept discovery , where the goal is to automatically discover composable concepts from data. Unsupervised concept discovery and concept composition methods have been proposed for energy-based models [25] and text-to-image diffusion models [43].

Controllable generation. Generative model composition is a form of post-training control of the generation process an established area of research. A simple approach to control is training a conditional generative model 𝑝(𝑥|𝑐) on pairs (𝑥, 𝑐) of objects 𝑥and conditioning information 𝑐. Annotations 𝑐can be class labels [39], text prompts [4, 6, 44], semantic maps, and images [4]. Different from out work, this assumes that generation control operations are specified at training time. Dhariwal and Nichol [39] apply classifier guidance [32] on top of (conditional) diffusion models to improve their fidelity. Ho and Salimans [45] develop classifier-free guidance by combining conditional and unconditional score functions. In Control Net [17], an additional network is trained to enable a pre-trained diffusion model to incorporate previously unavailable conditioning information. Meng et al. [46] and Couairon et al. [47] develop semantic image editing methods which first partially noise and then denoise an image to generate an edited version, possibly conditioned on a segmentation mask [47]. Similar to conditional diffusion models, conditional GFlow Nets have been used to condition generation on reward exponents [36] or combinations of multiple predefined reward functions [31]. Note that the methods developed in this work can be combined with conditional diffusion models and GFlow Nets: 𝑝(𝑥|𝑐1), , 𝑝(𝑥|𝑐𝑚) can act as base generative models to be composed.

Compositional generalization. The notion of compositionality has a broad spectrum of interpretations across a variety of disciplines including linguistics, cognitive science, and philosophy. Hupkes et al. [48] collect a list of aspects of compositionality from linguistic and philosophical theories and design practical tests for neural language models. Conwell and Ullman [49] empirically examine the relational understanding of DALL-E 2 [50], a text-guided image generation model, and point out limitations in the model s ability to capture relations such as in , on , hanging over , etc. In this work, we focus on a narrow but well-defined type of composition where we seek to algebraically compose probability densities in a controllable fashion, such that we can emphasize or de-emphasize regions in the data space where specific base distributions have high density.

Connections between GFlow Nets and diffusion models. Our method is applicable to compositions of both GFlow Nets and diffusion models. This is due to deep connections between these two model families. GFlow Nets were initially developed for generating discrete (structured) data [36] and diffusion models were initially developed for continuous data [5, 32]. Lahlou et al. [51] develop an extension of GFlow Nets for DAGs with continuous state-action spaces. Zhang et al. [52] point out unifying connections between GFlow Nets and other generative model families, including diffusion models. In this work, we articulate another aspect of the relation between GFlow Nets and diffusion models: in Section 5.2 we derive the expressions for mixture GFlow Net policies and classifier-guided GFlow Net policies analogous to those derived for diffusion models in [32, 39, 53, 54].

4 Compositional Sculpting of Generative Models

Suppose we can access a number of pre-trained generative models {𝑝𝑖(𝑥)}𝑚 𝑖=1 over a common domain . We may wish to compose these distributions such that we can, say, draw samples that are likely to arise from 𝑝1(𝑥) and 𝑝2(𝑥), or that are likely to arise from 𝑝1(𝑥) but not from 𝑝2(𝑥). In other words, we wish to specify a distribution that we can shape to emphasize and de-emphasize specific base models.

4.1 Binary Composition Operations

Let us first focus on composing two base models. We could specify the composition as a weighted sum 𝑝(𝑥)= 2 𝑖=1 𝜔𝑖𝑝𝑖(𝑥) with weights 𝜔1, 𝜔2 0 summing to one. The weights determine the prevalence of each base model in the composition, but beyond that our control is limited. We cannot emphasize regions where 𝑝1 and 𝑝2 both have high density, or de-emphasize regions where 𝑝2 has high density.

An alternative is to use conditioning to shape a prior 𝑝(𝑥) based on the base models. When we condition 𝑥on some observation 𝑦1, the resulting posterior takes the form 𝑝(𝑥|𝑦1) 𝑝(𝑦1|𝑥) 𝑝(𝑥). Points 𝑥that match 𝑦1 according to 𝑝(𝑦1|𝑥) will have increased density, and the density of points that do not match it decreases. Intuitively, by defining 𝑦1 {1, 2} as the event that 𝑥was generated by a specific base model, we can shape a prior 𝑝(𝑥) based on the densities of the base models. To this end we define a uniform prior over 𝑦𝑘and define the conditional density 𝑝(𝑥|𝑦1 = 𝑖) to represent the fact that 𝑥was generated from 𝑝𝑖(𝑥). This gives us the following model:

𝑝(𝑥|𝑦1 =𝑖) = 𝑝𝑖(𝑥), 𝑝(𝑦1 =𝑖) = 1 2, 𝑝(𝑥) = 𝑝(𝑥|𝑦1 =1) 𝑝(𝑦1 =1) + 𝑝(𝑥|𝑦1 =2) 𝑝(𝑦1 =2). (1)

Under this model, the prior 𝑝(𝑥) is a uniform mixture of the base models. The likelihood of 𝑦1 𝑝(𝑦1 =1|𝑥) = 1 𝑝(𝑦1 =2|𝑥) = 𝑝1(𝑥) (𝑝1(𝑥) + 𝑝2(𝑥)), (2)

implied by this model tells us how likely it is that 𝑥was generated by 𝑝1(𝑥) rather than 𝑝2(𝑥). In fact, it corresponds to the output of an optimal classifier trained to tell 𝑝1(𝑥) and 𝑝2(𝑥) apart.

Our goal is to realize compositions which generate samples likely to arise from both 𝑝1(𝑥) and 𝑝2(𝑥) or from 𝑝1(𝑥) but not 𝑝2(𝑥). Thus we introduce a second observation 𝑦2 {1, 2} such that 𝑦1 and 𝑦2 are independent and identically distributed given 𝑥. The resulting model and inferred posterior are:

𝑝(𝑥, 𝑦1, 𝑦2) = 𝑝(𝑥)

𝑘=1 𝑝(𝑦𝑘|𝑥), 𝑝(𝑥) = 1

2𝑝2(𝑥), 𝑝(𝑦𝑘=𝑖|𝑥) = 𝑝𝑖(𝑥) (𝑝1(𝑥) + 𝑝2(𝑥)), (3)

𝑝(𝑥|𝑦1 =𝑖, 𝑦2 =𝑗) 𝑝(𝑥) 𝑝(𝑦1 =𝑖|𝑥) 𝑝(𝑦2 =𝑗|𝑥) 𝑝𝑖(𝑥)𝑝𝑗(𝑥) (𝑝1(𝑥) + 𝑝2(𝑥)). (4) The above posterior shows clearly how conditioning on observations 𝑦1 =𝑖, 𝑦2 =𝑗has shaped the prior mixture to accentuate regions in the posterior where the observed base models 𝑖, 𝑗have high density.

Conditioning on observations 𝑦1 =1 and 𝑦2 =2, or equivalently 𝑦1 =2, 𝑦2 =1, results in the posterior (𝑝1 𝑝2)(𝑥) = 𝑝(𝑥|𝑦1 =1, 𝑦2 =2) 𝑝1(𝑥)𝑝2(𝑥) (𝑝1(𝑥) + 𝑝2(𝑥)). (5) We refer to this posterior as the harmonic mean of 𝑝1 and 𝑝2 , and denote it as a binary operation 𝑝1 𝑝2. Its value is high only at points that have high likelihood under both 𝑝1(𝑥) and 𝑝2(𝑥) at the same time (Figure 1(c)). Thus, the harmonic mean is an alternative to the product operation for EBMs. The harmonic mean is commutative (𝑝1 𝑝2 = 𝑝2 𝑝1) and is undefined when 𝑝1 and 𝑝2 have disjoint supports, since then the RHS of (5) is zero everywhere.

Conditioning on observations 𝑦1 =1 and 𝑦2 =1 results in the posterior

(𝑝1 𝑝2)(𝑥) = 𝑝(𝑥|𝑦1 =1, 𝑦2 =1) (𝑝1(𝑥))2 (𝑝1(𝑥) + 𝑝2(𝑥)). (6) We refer to this binary operation, providing an alternative to the negation operation in EBMs, as the contrast of 𝑝1 and 𝑝2 , and denote it as 𝑝1 𝑝2. The ratio (6) is high when 𝑝1(𝑥) is high and 𝑝2(𝑥) is low (Figure 1(d)). The contrast is not commutative (𝑝1 𝑝2 𝑝2 𝑝1, unless 𝑝1 = 𝑝2). We denote the reverse contrast as 𝑝1 𝑝2 = 𝑝2 𝑝1. Appendix C provides a detailed comparison between the contrast and negation operations, and between the harmonic mean and product operations.

Controlling the individual contributions of 𝑝1 and 𝑝2 to the composition. In order to provide more control over the extent of individual contributions of 𝑝1 and 𝑝2 to the composition, we modify model (3). Specifically, we introduce an interpolation parameter 𝛼and change the likelihood of observation 𝑦2 in (3): 𝑝(𝑦2 = 𝑖|𝑥; 𝛼) = (𝛼𝑝1(𝑥))[𝑖=1] ((1 𝛼)𝑝2(𝑥))[𝑖=2] (𝛼𝑝1(𝑥) + (1 𝛼)𝑝2(𝑥)), where 𝛼 (0, 1) and [ ] denotes the indicator function. Conditional distributions in this model give harmonic interpolation2 and parameterized contrast:

(𝑝1 (1 𝛼) 𝑝2)(𝑥) 𝑝1(𝑥)𝑝2(𝑥) 𝛼𝑝1(𝑥) + (1 𝛼)𝑝2(𝑥), (𝑝1 (1 𝛼) 𝑝2)(𝑥) (𝑝1(𝑥))2

𝛼𝑝1(𝑥) + (1 𝛼)𝑝2(𝑥). (7)

Operation chaining. As the operations we have introduced result in proper distributions, we can create new 𝑁-ary operations by chaining binary (and 𝑁-ary) operations together. For instance, chaining binary harmonic means gives the harmonic mean of three distributions

((𝑝1 𝑝2) 𝑝3)(𝑥) = (𝑝1 (𝑝2 𝑝3))(𝑥) 𝑝1(𝑥)𝑝2(𝑥)𝑝3(𝑥) 𝑝1(𝑥)𝑝2(𝑥) + 𝑝1(𝑥)𝑝3(𝑥) + 𝑝2(𝑥)𝑝3(𝑥). (8)

4.2 Compositional Sculpting: General Approach

The above approach for realizing compositions of two base models can be generalized to compositions of 𝑚base models 𝑝1(𝑥), , 𝑝𝑚(𝑥) controlled by 𝑛observations. Though operator chaining can also realize compositions of 𝑚base models, our generalized method allows us to specify compositions more flexibly and results in different compositions. We introduce an augmented probabilistic model 𝑝(𝑥, 𝑦1, , 𝑦𝑛) as a joint distribution over the original objects 𝑥 and 𝑛observation variables 𝑦1 , , 𝑦𝑛 where = {1, , 𝑚}. By defining appropriate conditionals 𝑝(𝑦𝑘|𝑥) we can controllably shape a prior 𝑝(𝑥) into a posterior 𝑝(𝑥|𝑦1, , 𝑦𝑛).

As in the binary case, we propose to use a uniformly-weighted mixture of the base models 𝑝(𝑥) = (1 𝑚) 𝑚 𝑖=1 𝑝𝑖(𝑥). The support of this mixture is the union of the supports of the base models: 𝑚 𝑖=1 supp{𝑝𝑖(𝑥)} = supp{ 𝑝(𝑥)}. This is essential as the prior can only be shaped in places where it has non-zero density. As before we define the conditionals 𝑝(𝑦𝑘=𝑖|𝑥) to correspond to the observation that 𝑥was generated by base model 𝑖. The resulting full model is

𝑝(𝑥, 𝑦1, , 𝑦𝑛)= 𝑝(𝑥)

𝑘=1 𝑝(𝑦𝑘|𝑥), 𝑝(𝑥)= 1

𝑖=1 𝑝𝑖(𝑥), 𝑝(𝑦𝑘=𝑖)= 1

𝑚, 𝑝(𝑦𝑘=𝑖|𝑥)= 𝑝𝑖(𝑥) 𝑚 𝑗=1 𝑝𝑗(𝑥), (9)

Note that under this model the mixture can be represented as 𝑝(𝑥) = 𝑚 𝑦𝑘=1 𝑝(𝑥|𝑦𝑘) 𝑝(𝑦𝑘) for any 𝑘.

The inferred posterior over 𝑥for this model is 𝑝(𝑥|𝑦1 =𝑖1, , 𝑦𝑛=𝑖𝑛) 𝑝(𝑥) 𝑝(𝑦1 =𝑖1, , 𝑦𝑛=𝑖𝑛|𝑥) (10)

𝑘=1 𝑝(𝑦𝑘=𝑖𝑘|𝑥)

2the harmonic interpolation approaches 𝑝1 when 𝛼 0 and 𝑝2 when 𝛼 1

The posterior 𝑝(𝑥|𝑦1 =𝑖1, , 𝑦𝑛=𝑖𝑛) is a composition of distributions {𝑝𝑖(𝑥)}𝑚 𝑖=1 that can be adjusted by choosing values for 𝑦1, , 𝑦𝑛. By adding or omitting an observation 𝑦𝑘= 𝑖we can sculpt the posterior to our liking, emphasizing or de-emphasizing regions of where 𝑝𝑖has high density. The observations can be introduced with multiplicities (e.g., 𝑦1 = 1, 𝑦2 = 1, 𝑦3 = 2) to further strengthen the effect. Moreover, one can choose to introduce all observations simultaneously as in (10) or sequentially as in (11). As we show below (Section 5.1 for GFlow Nets; Appendix A.2 for diffusion models), the composition (10) can be realized by a sampling policy that can be expressed as a function of the pre-trained (base) sampling policies.

Special instances and general formulation. The general approach outlined in this section is not limited to choices we made to construct the model in equation (9), i.e. 𝑝(𝑥) does not have to be a uniformly weighted mixture of the base distributions, 𝑦1, , 𝑦𝑛do not have to be independent and identically distributed given 𝑥, and different choices of the likelihood 𝑝(𝑦=𝑖|𝑥) are possible. For instance, in the model for parameterized operations (7) the likelihoods of observations 𝑝(𝑦1|𝑥), 𝑝(𝑦2|𝑥) differ.

5 Compositional Sculpting of Iterative Generative Processes

In this Section, we show how to apply the model above to compose GFlow Nets, and how one can use classifier guidance to sample from the composition. The similar method for diffusion model composition is described in Appendix A.2.

5.1 Composition of GFlow Nets

Besides a sample 𝑥from 𝑝𝑖(𝑥), a GFlow Net also generates a trajectory 𝜏which ends in the state 𝑥. Thus, we extend the model 𝑝(𝑥, 𝑦1, , 𝑦𝑛), described above, and introduce 𝜏as a variable with conditional distribution 𝑝(𝜏|𝑦𝑘= 𝑖) = |𝜏| 1 𝑡=0 𝑝𝑖,𝐹(𝑠𝑡+1|𝑠𝑡), where 𝑝𝑖,𝐹is the forward policy of the GFlow Net that samples from 𝑝𝑖.

Our approach for sampling from the composition is conceptually simple. Given 𝑚base GFlow Nets that sample from 𝑝1, , 𝑝𝑚respectively, we start by defining the prior 𝑝(𝑥) as the uniform mixture of these GFlow Nets. Proposition 5.1 shows that this mixture can be realized by a policy constructed from the forward policies of the base GFlow Nets. We then apply classifier guidance to this mixture to sample from the composition. Proposition 5.2 shows that classifier guidance results in a new policy which can be constructed directly from the GFlow Net being guided. Proposition 5.1 (GFlow Net mixture policy). Suppose distributions 𝑝1(𝑥), , 𝑝𝑚(𝑥) are realized by GFlow Nets with forward policies 𝑝1,𝐹( | ), , 𝑝𝑚,𝐹( | ). Then, the mixture distribution 𝑝M(𝑥) = 𝑚 𝑖=1 𝜔𝑖𝑝𝑖(𝑥) with 𝜔1, , 𝜔𝑚 0 and 𝑚 𝑖=1 𝜔𝑖= 1 is realized by the GFlow Net forward policy

𝑝M,𝐹(𝑠 |𝑠) =

𝑖=1 𝑝(𝑦= 𝑖|𝑠)𝑝𝑖,𝐹(𝑠 |𝑠), (12)

where 𝑦is a random variable such that the joint distribution of a GFlow Net trajectory 𝜏and 𝑦is given by 𝑝(𝜏, 𝑦=𝑖) = 𝜔𝑖𝑝𝑖(𝜏) for 𝑖 {1, , 𝑚}. Proposition 5.2 (GFlow Net classifier guidance). Consider a joint distribution 𝑝(𝑥, 𝑦) over a discrete space such that the marginal 𝑝(𝑥) is realized by a GFlow Net with forward policy 𝑝𝐹( | ). Assume that the joint distribution of 𝑥, 𝑦, and GFlow Net trajectories 𝜏= (𝑠0 𝑠𝑛=𝑥) decomposes as 𝑝(𝜏, 𝑥, 𝑦)=𝑝(𝜏, 𝑥)𝑝(𝑦|𝑥), i.e. 𝑦is independent of the intermediate states {𝑠𝑖}𝑛 1 𝑖=0 in 𝜏given 𝑥. Then,

1. For all non-terminal nodes 𝑠 , the probabilities 𝑝(𝑦|𝑠) satisfy

𝑠 (𝑠 𝑠 ) 𝑝𝐹(𝑠 |𝑠)𝑝(𝑦|𝑠 ). (13)

2. The conditional distribution 𝑝(𝑥|𝑦) is realized by the classifier-guided policy

𝑝𝐹(𝑠 |𝑠, 𝑦) = 𝑝𝐹(𝑠 |𝑠) 𝑝(𝑦|𝑠 ) 𝑝(𝑦|𝑠). (14)

Note that (13) ensures that 𝑠 (𝑠 𝑠 ) 𝑝𝐹(𝑠 |𝑠, 𝑦) = 1.

Proposition 5.1 is analogous to results on mixtures of diffusion models (Theorem 1 of Peluchetti [53], Theorem 1 of Lipman et al. [54]). Proposition 5.2 is analogous to classifier guidance for diffusion models [32, 39]. To the best of our knowledge, our work is the first to derive both results for GFlow Nets.

Theorem 5.3 summarizes our approach. The propositions and the theorem are proved in Appendix D. Theorem 5.3. Suppose distributions 𝑝1(𝑥), , 𝑝𝑚(𝑥) are realized by GFlow Nets with forward policies 𝑝1,𝐹( | ), , 𝑝𝑚,𝐹( | ) respectively. Let 𝑦1, , 𝑦𝑛be random variables defined by (9). Then, the conditional 𝑝(𝑥|𝑦1, , 𝑦𝑛) is realized by the forward policy

𝑝𝐹(𝑠 |𝑠, 𝑦1, , 𝑦𝑛) = 𝑝(𝑦1, , 𝑦𝑛|𝑠 )

𝑝(𝑦1, , 𝑦𝑛|𝑠)

𝑖=1 𝑝𝑖,𝐹(𝑠 |𝑠) 𝑝(𝑦=𝑖|𝑠) (15)

Note that the result of conditioning on observations 𝑦1, , 𝑦𝑛is just another GFlow Net policy. Therefore, to condition on more observations, one can apply classifier guidance repeatedly.

5.2 Classifier Training (GFlow Nets)

The evaluation of policy (15) requires knowledge of the probabilities 𝑝(𝑦1, , 𝑦𝑛|𝑠), 𝑝(𝑦|𝑠). These probabilities can be estimated by a classifier fitted to trajectories sampled from the base GFlow Nets.

Let 𝑄𝜙(𝑦1, , 𝑦𝑛|𝑠) be a classifier with parameters 𝜙that we wish to train to approximate the groundtruth conditional 𝑝(𝑦1, , 𝑦𝑛|𝑠). Under model (9), 𝑦1, , 𝑦𝑛are dependent given a state 𝑠 , but, are independent given a terminal state 𝑥 . This motivates separate treatment of terminal and non-terminal states.

Learning the terminal state classifier. For a terminal state 𝑥, the variables 𝑦1, , 𝑦𝑛are independent, hence we can use the factorization 𝑄𝜙(𝑦1, , 𝑦𝑛|𝑥) = 𝑛 𝑘=1 𝑄𝜙(𝑦𝑘|𝑥). Moreover, all distributions on the r.h.s. must be the same, i.e. it is enough to learn just 𝑄𝜙(𝑦1|𝑥). This marginal classifier can be learned by minimizing the cross-entropy loss

T(𝜙) = 𝔼 ( 𝑥, 𝑦1) 𝑝(𝑦1) 𝑝(𝑥|𝑦1)

[ log 𝑄𝜙(𝑦1 = 𝑦1|𝑥= 𝑥) ] . (16)

Here 𝑦1 is sampled from 𝑝(𝑦1), which is uniform under our choice of 𝑝(𝑥). Then, 𝑥|(𝑦1 = 𝑦1) is generated from the base GFlow Net 𝑝 𝑦1, since (9) implies that 𝑝(𝑥|𝑦= 𝑦1) = 𝑝 𝑦1(𝑥).

Learning the non-terminal state classifier. Given a non-terminal state 𝑠 , we need to model 𝑦1, , 𝑦𝑛jointly, and training requires sampling tuples ( 𝑠, 𝑦1, , 𝑦𝑛). Non-terminal states 𝑠 can be generated as intermediate states in trajectories 𝜏= (𝑠0 𝑠1 𝑥). Given a sampled trajectory 𝜏and a set of labels 𝑦1, , 𝑦𝑛we denote the trajectory cross-entropy by

𝓁( 𝜏, 𝑦1, , 𝑦𝑛; 𝜙) =

[ log 𝑄𝜙(𝑦1 = 𝑦1, , 𝑦𝑛= 𝑦𝑛|𝑠= 𝑠𝑡) ] . (17)

Pairs ( 𝜏, 𝑦1) can be generated in the same way as in the terminal classifier training above: 1) 𝑦1 𝑝(𝑦1); 2) 𝜏 𝑝 𝑦1(𝜏). Sampling 𝑦2, , 𝑦𝑛given 𝑥(the terminal state of 𝜏) requires access to values 𝑝(𝑦𝑘= 𝑦𝑘| 𝑥), which are not directly available. However, if the terminal classifier is learned as described above, the estimates 𝑤𝑖( 𝑥; 𝜙) = 𝑄𝜙(𝑦1 = 𝑖|𝑥= 𝑥) can be used instead. In this case, the loss and the sampling procedure for the non-terminal classifier rely on the outputs of the terminal classifier. In order to train two classifiers simultaneously, and avoid the instability due to a feedback loop, we employ the target network technique developed in the context of deep Q-learning [55]. We introduce a target network parameter vector 𝜙which is used to produce the estimates 𝑤𝑖( 𝑥; 𝜙) for the non-terminal loss. We update 𝜙as the exponential moving average of the recent iterates of 𝜙.

After putting all components together the training loss for the non-terminal state classifier is

𝑁(𝜙, 𝜙) = 𝔼 ( 𝜏, 𝑦1) 𝑝(𝜏,𝑦1)

𝑘=2 𝑤 𝑦𝑘( 𝑥; 𝜙)

𝓁( 𝜏, 𝑦1, , 𝑦𝑛; 𝜙) . (18)

We refer the reader to Appendix D.4 for a more detailed derivation of the loss (18). Algorithm A.1 shows the complete classifier training procedure.

𝑦1=1 𝑦2=2 )

𝑦1=1 𝑦2=2 𝑦3=3

𝑦1=2 𝑦2=2 )

𝑦1=2 𝑦2=2 𝑦3=2

Figure 2: Composed GFlow Nets on 32 32 grid domain. (Top) operations on two distributions. (Bottom) operations on three distributions. Cell probabilities are shown with color, darker is higher. Red circles indicate the high probability regions of 𝑝1, 𝑝2, 𝑝3.

(a) Base at 𝛽= 32

(b) Harmonic mean

(c) Contrasts

(d) Base at 𝛽= 96

(e) Harmonic mean

Figure 3: Reward distributions in the molecular generation domain. (a) Base GFlow Nets at 𝛽=32: 𝑝SEH and 𝑝SA are trained with 𝑅SEH(𝑥)32 and 𝑅SA(𝑥)32. (b) harmonic mean of 𝑝SEH and 𝑝SA at 𝛽=32. (c) contrasts at 𝛽=32. (d) base GFlow Nets at 𝛽=96. (e) harmonic mean at 𝛽=96. Lines show the contours of the level sets of the kernel density estimates in the (𝑅SEH, 𝑅SA) plane.

6 Experiments

2D distributions via GFlow Net. We tested our GFlow Net composition method on 2D grid [36], a controlled domain, where the ground-truth composite distributions can be evaluated directly.

In the 2D grid domain, the states are the cells of an 𝐻 𝐻grid. The starting state is the upper-left cell 𝑠0 = (0, 0). At each state, the allowed actions are: 1) move right; 2) move down; 3) terminate the trajectory at the current position. We first trained GFlow Nets 𝑝𝑖(𝑥) 𝑅𝑖(𝑥) with reward functions 𝑅𝑖(𝑥) > 0, and then trained classifiers and constructed compositions following Theorem 5.3.

Figure 2 (top row) shows the distributions obtained by composing two pre-trained GFlow Nets (top row; left). The harmonic mean 𝑝1 𝑝2, covers the regions that have high probability under both 𝑝1 and 𝑝2 and excludes locations where either of the probabilities is low. 𝑝1 𝑝2 resembles 𝑝1 but the relative masses of the modes of 𝑝1 are modulated by 𝑝2: regions with high 𝑝2 have lower probability under contrast. The parameterized contrast 𝑝1 0.95 𝑝2 with 𝛼= 0.05 magnifies the contrasting effect: high 𝑝2(𝑥) implies very low (𝑝1 0.95 𝑝2)(𝑥). The bottom row of Figure 2 shows operations on 3 distributions. The conditional 𝑝(𝑥|𝑦1=1, 𝑦2=2) is concentrated on the points that have high likelihood under both 𝑝1 and 𝑝2. Similarly, the value 𝑝(𝑥|𝑦1=1, 𝑦2=2, 𝑦3=3) is high if 𝑥is likely to be observed under all three distributions at the same time. The conditionals 𝑝(𝑥|𝑦1=2, 𝑦2=2) and 𝑝(𝑥|𝑦1=2, 𝑦2=2, 𝑦3=2) highlight the points with high 𝑝2(𝑥) but low 𝑝1(𝑥) and 𝑝3(𝑥). Conditioning on three labels results in a sharper distribution compared to double-conditioning. We provide quantitative results and further details in Appendix F.1. The classifier learning curves are provided in Appendix G.4.

Molecule generation via GFlow Net. Next, we evaluated our method for GFlow Net composition on a large and highly structured data space, and assessed the effect that composition operations have

Table 1: Reward distributions of composite GFlow Nets.

SEH low high

SA low high low high

QED low high low high low high low high

𝑝SEH 0 0 0 0 𝟔𝟐 𝟗 𝟐𝟒 𝟓 𝑝SA 0 0 𝟕𝟑 𝟒 0 0 𝟏𝟖 𝟓 𝑝QED 0 𝟒𝟎 0 𝟐𝟔 0 𝟐𝟏 0 𝟏𝟑

(a) 𝑦={SEH, SA} 1 0 16 2 6 3 𝟓𝟒 𝟏𝟖 (b) 𝑦={SEH, QED} 0 11 0 4 1 𝟒𝟖 4 𝟑𝟐 (c) 𝑦={SA, QED} 0 15 1 𝟒𝟐 0 8 2 𝟑𝟐 (d) 𝑦={SEH, SA, QED} 0 7 2 11 2 19 10 𝟒𝟗

(e) 𝑦={SEH, SEH, SEH} 0 0 0 0 𝟔𝟑 9 24 4 (f) 𝑦={SA, SA, SA} 0 0 𝟕𝟒 5 0 0 17 4 (g) 𝑦={QED, QED, QED} 0 𝟒𝟎 0 23 0 23 0 14

In each row, the numbers show the percentage of the samples from the respective model that fall into one of 8 bins according to rewards. The low and high categories are decided by thresholding SEH: 0.5, SA: 0.6, QED: 0.25.

0.5 0.0 0.5 1.0 0.5

Figure 4: 2D t-SNE embeddings of three base GFlow Nets trained with 𝑅SEH(𝑥)𝛽, 𝑅SA(𝑥)𝛽, 𝑅QED(𝑥)𝛽at 𝛽= 32 and their compositions. The t-SNE embeddings are computed based on pairwise earth mover s distances between the distributions. Labels (a)-(g) match rows in Table 1.

on resulting data distributions in a practical setting. To that end, we conducted experiments with GFlow Nets trained for the molecular generation task proposed by Bengio et al. [36]. We train base GFlow Nets with 3 reward functions 𝑅SEH(𝑥), 𝑅SA(𝑥), and 𝑅QED(𝑥) measuring 3 distinct molecular properties (details in Appendix F.2). Following Bengio et al. [36], we introduced the parameter 𝛽 which controls the sharpness (temperature) of the target distribution: 𝑝(𝑥) 𝑅(𝑥)𝛽, increasing 𝛽 results in a distribution skewed towards high-reward objects. We experimented with 𝛽= 32 and 𝛽= 96 (Figures 3(a),3(d)). After training base GFlow Nets with the respective reward functions, we trained classifiers with Algorithm A.1. In order to evaluate the base models and the composed GFlow Net policies, we generated 5 000 samples with each policy and anaylzed the samples. Further details are in Appendix F.2. Classifier learning curves are provided in Appendix G.4. Sample diversity statistics of base GFlow Nets at different values of 𝛽are provided in Appendix G.5.

Figure 3 shows reward distributions of base GFlow Nets (trained with rewards 𝑅SEH(𝑥)𝛽, 𝑅SA(𝑥)𝛽, 𝛽 {32, 96}) and their compositions. Base GFlow Net distributions are concentrated on examples that score high in their respective rewards. For each model, there is considerable variation in the reward that was not used for training. The harmonic mean operation (Figures 3(b), 3(e)) results in distributions that are concentrated on the samples scoring high in both rewards. The contrast operation (Figure 3(c)) has the opposite effect: the distributions are skewed towards the examples scoring high in only one of the original rewards. Note that the tails of the contrast distributions are retreating from the area covered by the harmonic mean.

We show reward distribution statistics of three GFlow Nets (trained with SEH, SA, and QED at 𝛽= 32) and their compositions in Table 1. Each row of the table gives a percentage breakdown of the samples from a given model into one of 23 = 8 bins according to rewards. For all three base models, the majority of the samples fall into the high category according to the respective reward, while the rewards that were not used for training show variation. Conditioning on two different labels (e.g. 𝑦={SEH, QED}) results in concentration on examples that score high in two selected rewards, but not necessarily scoring high in the reward that was not selected. The conditional 𝑦={SEH, QED, SA} shifts the focus to examples that have all three properties.

Figure 4 shows 2D embeddings of the distributions appearing in Table 1. The embeddings were computed with t-SNE based on the pairwise earth mover s distances (details in Appendix F.2, complete summary of distribution distances in Table G.5). The configuration of the embeddings provides insight into the relative positions of the base models and conditionals in the distribution space. The points corresponding to the pairwise conditionals lie in between the two base models selected for conditioning. The conditional 𝑦={SEH, SA, QED} appears to be near the centroid of the triangle (𝑝SEH, 𝑝SA, 𝑝QED) and lies close the the pairwise conditionals. The distributions obtained by repeated conditioning on the same label (e.g. 𝑦= {SEH, SEH, SEH}) are spread out to the boundary, lying closer to the respective base distributions and relatively far from pairwise conditionals.

Colored MNIST generation via diffusion models. Finally, we empirically tested our method for the composition of diffusion models on an image generation task. In this experiment, we composed

𝑦1=1 𝑦2=1 )

𝑦1=2 𝑦2=2 )

𝑦1=3 𝑦2=3 )

𝑦1=1 𝑦2=2 )

𝑦1=1 𝑦2=3 )

𝑦1=2 𝑦2=3 )

𝑦1=1 𝑦2=2 𝑦3=3

𝑦1=1 𝑦2=2 𝑦3=1

𝑦1=1 𝑦2=2 𝑦3=2

Figure 5: Composed diffusion models on colored MNIST. Samples from 3 pre-trained diffusion models and their various compositions.

three diffusion models trained to generate MNIST [56] digits {0, 1, 2, 3} in two colors: cyan and beige. Each model was trained to generate digits with a specific property: 𝑝1 generated cyan digits, 𝑝2 generated digits less than 2, and 𝑝3 generated even digits.

We built the composition iteratively via the factorization 𝑝(𝑥|𝑦1,𝑦2, 𝑦3) 𝑝(𝑥) 𝑝(𝑦1,𝑦2|𝑥) 𝑝(𝑦3|𝑥,𝑦1,𝑦2). To this end, we first trained a classifier 𝑄(𝑦1,𝑦2|𝑥𝑡) on trajectories sampled from the base models. This allowed us to generate samples from 𝑝(𝑥|𝑦1,𝑦2). We then trained an additional classifier 𝑄(𝑦3|𝑥𝑡,𝑦1,𝑦2) on trajectories from compositions defined by (𝑦1,𝑦2) to allow us to sample from 𝑝(𝑥|𝑦1,𝑦2,𝑦3). Further details can be found in Appendix F.3.

Figure 5 shows samples from the pre-trained models and from selected compositions. The negating effect of not conditioning on observations is clearly visible in the compositions using two variables. For example, 𝑝(𝑥|𝑦1 =1, 𝑦2 =1) only generates cyan 3 s. Because there we do not condition on 𝑝2 or 𝑝3, the composition excludes digits that have high probability under 𝑝2 or 𝑝3, i.e. those that are less than 2 or even. In 𝑝(𝑥|𝑦1 =1, 𝑦2 =3), cyan even digits have high density under both 𝑝1 and 𝑝3, but because 𝑝2 is not conditioned on, the composition excludes digits less than two (i.e. cyan 0 s). Finally, 𝑝(𝑥|𝑦1 =1, 𝑦2 =2, 𝑦3 =3) generates only cyan 0 s, on which all base models have high density.

7 Conclusion

We introduced Compositional Sculpting, a general approach for composing iterative generative models. Compositions are defined through observations , which enable us to emphasize or de-emphasize the density of the composition in regions where specific base models have high density. We highlighted two binary compositions, harmonic mean and contrast, which are analogous to the product and negation operations defined on EBMs. A crucial feature of the compositions we have introduced is that we can sample from them directly. By extending classifier guidance we are able to leverage the generative capabilities of the base models to produce samples from the composition. Through empirical experiments, we validated our approach for composing diffusion models and GFlow Nets on toy domains, molecular generation, and image generation.

Broader impact. We proposed a mathematical framework and methods for the composition of pre-trained generative models. While the primary emphasis of our work is on advancing foundational research on generative modeling methodology and principled sampling techniques, our work inherits ethical concerns associated with generative models such as creation of deepfake content and misinformation dissemination, as well as reproduction of biases present in the datasets used for model training. If not carefully managed, these models can perpetuate societal biases, exacerbating issues of fairness and equity. Our work further contributes to research on the reuse of pre-trained models. This research direction promotes eco-friendly AI development, with the long-term goal of reducing energy consumption and carbon emissions associated with large-scale generative model training.

Acknowledgements

TG and TJ acknowledge support from the Machine Learning for Pharmaceutical Discovery and Synthesis (MLPDS) consortium, DARPA Accelerated Molecular Discovery program, the NSF Expeditions grant (award 1918839) "Understanding the World Through Code", and from the MIT-DSTA collaboration.

SK and SDP were supported by the Technology Industries of Finland Centennial Foundation and the Jane and Aatos Erkko Foundation under project Interactive Artificial Intelligence for Driving R&D, the Academy of Finland (flagship programme: Finnish Center for Artificial Intelligence, FCAI; grants 328400, 345604 and 341763), and the UKRI Turing AI World-Leading Researcher Fellowship, EP/W002973/1.

VG acknowledges support from the Academy of Finland (grant decision 342077) for "Human-steered next-generation machine learning for reviving drug design", the Saab-WASP initiative (grant 411025), and the Jane and Aatos Erkko Foundation (grant 7001703) for "Biodesign: Use of artificial intelligence in enzyme design for synthetic biology".

GY acknowledges support from the National Science Foundation under Cooperative Agreement PHY-2019786 (The NSF AI Institute for Artificial Intelligence and Fundamental Interactions, http://iaifi.org).

We thank Sammie Katt and Pavel Izmailov for the helpful discussions and assistance in making the figures.

We thank Neur IPS 2023 anonymous reviewers for the helpful feedback on our work.

[1] Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C. Berg, Wan-Yen Lo, Piotr Dollár, and Ross B. Girshick. Segment anything. Ar Xiv, abs/2304.02643, 2023.

[2] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748 8763. PMLR, 2021.

[3] Mehdi Cherti, Romain Beaumont, Ross Wightman, Mitchell Wortsman, Gabriel Ilharco, Cade Gordon, Christoph Schuhmann, Ludwig Schmidt, and Jenia Jitsev. Reproducible scaling laws for contrastive language-image learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2818 2829, 2023.

[4] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. Highresolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10684 10695, 2022.

[5] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33:6840 6851, 2020.

[6] Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation. In International Conference on Machine Learning, pages 8821 8831. PMLR, 2021.

[7] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. Ar Xiv, abs/1810.04805, 2018.

[8] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877 1901, 2020.

[9] Open AI. Chatgpt (mar 14, 2023 version). Large language model, 2023. URL https://chat. openai.com/chat.

[10] Adam Roberts, Hyung Won Chung, Anselm Levskaya, Gaurav Mishra, James Bradbury, Daniel Andor, Sharan Narang, Brian Lester, Colin Gaffney, Afroz Mohiuddin, Curtis Hawthorne, Aitor Lewkowycz, Alexandru Salcianu, Marc van Zee, Jacob Austin, Sebastian Goodman, Livio Baldini Soares, Haitang Hu, Sasha Tsvyashchenko, Aakanksha Chowdhery, Jasmijn Bastings, Jannis Bulian, Xavier García, Jianmo Ni, Andrew Chen, Kathleen Kenealy, J. Clark, Stephan Lee, Daniel H Garrette, James Lee-Thorp, Colin Raffel, Noam M. Shazeer, Marvin Ritter, Maarten Bosma, Alexandre Passos, Jeremy B. Maitin-Shepard, Noah Fiedel, Mark Omernick, Brennan Saeta, Ryan Sepassi, Alexander Spiridonov, Joshua Newlan, and Andrea Gesmundo. Scaling up models and data with t5x and seqio. Ar Xiv, abs/2203.17189, 2022.

[11] Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways. Ar Xiv, abs/2204.02311, 2022.

[12] Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alexander Herzog, Jasmine Hsu, Julian Ibarz, Brian Ichter, Alex Irpan, Tomas Jackson, Sally Jesmonth, Nikhil J. Joshi, Ryan C. Julian, Dmitry Kalashnikov, Yuheng Kuang, Isabel Leal, Kuang-Huei Lee, Sergey Levine, Yao Lu, Utsav Malla, Deeksha Manjunath, Igor Mordatch, Ofir Nachum, Carolina Parada, Jodilyn Peralta, Emily Perez, Karl Pertsch, Jornell Quiambao, Kanishka Rao, Michael S. Ryoo, Grecia Salazar, Pannag R. Sanketi, Kevin Sayed, Jaspiar Singh, Sumedh Anand Sontakke, Austin Stone, Clayton Tan, Huong Tran, Vincent Vanhoucke, Steve Vega, Quan Ho Vuong, F. Xia, Ted Xiao, Peng Xu, Sichun Xu, Tianhe Yu, and Brianna Zitkovich. Rt-1: Robotics transformer for real-world control at scale. Ar Xiv, abs/2212.06817, 2022.

[13] Danny Driess, F. Xia, Mehdi S. M. Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Ho Vuong, Tianhe Yu, Wenlong Huang, Yevgen Chebotar, Pierre Sermanet, Daniel Duckworth, Sergey Levine, Vincent Vanhoucke, Karol Hausman, Marc Toussaint, Klaus Greff, Andy Zeng, Igor Mordatch, and Peter R. Florence. Palm-e: An embodied multimodal language model. Ar Xiv, abs/2303.03378, 2023.

[14] Michael Ahn, Anthony Brohan, Noah Brown, Yevgen Chebotar, Omar Cortes, Byron David, Chelsea Finn, Chuyuan Fu, Keerthana Gopalakrishnan, Karol Hausman, Alexander Herzog, Daniel Ho, Jasmine Hsu, Julian Ibarz, Brian Ichter, Alex Irpan, Eric Jang, Rosario Jauregui Ruano, Kyle Jeffrey, Sally Jesmonth, Nikhil Jayant Joshi, Ryan C. Julian, Dmitry Kalashnikov, Yuheng Kuang, Kuang-Huei Lee, Sergey Levine, Yao Lu, Linda Luu, Carolina Parada, Peter Pastor, Jornell Quiambao, Kanishka Rao, Jarek Rettinghouse, Diego M Reyes, Pierre Sermanet, Nicolas Sievers, Clayton Tan, Alexander Toshev, Vincent Vanhoucke, Fei Xia, Ted Xiao, Peng Xu, Sichun Xu, Mengyuan Yan, and Andy Zeng. Do as i can, not as i say: Grounding language in robotic affordances. In Karen Liu, Dana Kulic, and Jeff Ichnowski, editors, Proceedings of The 6th Conference on Robot Learning, volume 205 of Proceedings of Machine Learning Research, pages 287 318. PMLR, 2023.

[15] John Jumper, Richard Evans, Alexander Pritzel, Tim Green, Michael Figurnov, Olaf Ronneberger, Kathryn Tunyasuvunakool, Russ Bates, Augustin Žídek, Anna Potapenko, et al. Highly accurate protein structure prediction with alphafold. Nature, 596(7873):583 589, 2021.

[16] Rishi Bommasani, Drew A. Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S. Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, Erik Brynjolfsson, S. Buch, Dallas Card, Rodrigo Castellon, Niladri S. Chatterji, Annie S. Chen, Kathleen A. Creel, Jared Davis, Dora Demszky, Chris Donahue, Moussa Doumbouya, Esin Durmus, Stefano Ermon, John Etchemendy, Kawin Ethayarajh, Li Fei-Fei, Chelsea Finn, Trevor Gale, Lauren E. Gillespie, Karan Goel, Noah D. Goodman, Shelby Grossman, Neel Guha, Tatsunori Hashimoto, Peter Henderson, John Hewitt, Daniel E. Ho, Jenny Hong, Kyle Hsu, Jing Huang, Thomas F. Icard, Saahil Jain, Dan Jurafsky, Pratyusha Kalluri, Siddharth Karamcheti, Geoff Keeling, Fereshte Khani, O. Khattab, Pang Wei Koh, Mark S. Krass, Ranjay Krishna, Rohith Kuditipudi, Ananya Kumar, Faisal Ladhak, Mina Lee, Tony Lee, Jure Leskovec, Isabelle Levent, Xiang Lisa Li, Xuechen Li, Tengyu Ma, Ali Malik, Christopher D. Manning, Suvir Mirchandani, Eric Mitchell, Zanele Munyikwa, Suraj Nair, Avanika Narayan, Deepak Narayanan, Benjamin Newman, Allen Nie, Juan Carlos Niebles, Hamed Nilforoshan, J. F. Nyarko, Giray Ogut, Laurel J. Orr, Isabel Papadimitriou, Joon Sung Park, Chris Piech, Eva Portelance, Christopher Potts, Aditi Raghunathan,

Robert Reich, Hongyu Ren, Frieda Rong, Yusuf H. Roohani, Camilo Ruiz, Jack Ryan, Christopher R e, Dorsa Sadigh, Shiori Sagawa, Keshav Santhanam, Andy Shih, Krishna Parasuram Srinivasan, Alex Tamkin, Rohan Taori, Armin W. Thomas, Florian Tramèr, Rose E. Wang, William Wang, Bohan Wu, Jiajun Wu, Yuhuai Wu, Sang Michael Xie, Michihiro Yasunaga, Jiaxuan You, Matei A. Zaharia, Michael Zhang, Tianyi Zhang, Xikun Zhang, Yuhui Zhang, Lucia Zheng, Kaitlyn Zhou, and Percy Liang. On the opportunities and risks of foundation models. Ar Xiv, abs/2108.07258, 2021.

[17] Lvmin Zhang and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. Ar Xiv, abs/2302.05543, 2023.

[18] Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. In Advances in Neural Information Processing Systems, volume 35, pages 27730 27744, 2022.

[19] D idac Sur is, Sachit Menon, and Carl Vondrick. Vipergpt: Visual inference via python execution for reasoning. Ar Xiv, abs/2303.08128, 2023.

[20] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed Huai hsin Chi, F. Xia, Quoc Le, and Denny Zhou. Chain of thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824 24837, 2022.

[21] Geoffrey E Hinton. Products of experts. In Ninth International Conference on Artificial Neural Networks, volume 1, 1999.

[22] Geoffrey E Hinton. Training products of experts by minimizing contrastive divergence. Neural computation, 14(8):1771 1800, 2002.

[23] Ramakrishna Vedantam, Ian Fischer, Jonathan Huang, and Kevin Murphy. Generative models of visually grounded imagination. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=Hk Csm6l Rb.

[24] Yilun Du, Shuang Li, and Igor Mordatch. Compositional visual generation with energy based models. Advances in Neural Information Processing Systems, 33:6637 6647, 2020.

[25] Yilun Du, Shuang Li, Yash Sharma, Josh Tenenbaum, and Igor Mordatch. Unsupervised learning of compositional energy concepts. Advances in Neural Information Processing Systems, 34: 15608 15620, 2021.

[26] Nan Liu, Shuang Li, Yilun Du, Josh Tenenbaum, and Antonio Torralba. Learning to compose visual relations. Advances in Neural Information Processing Systems, 34:23166 23178, 2021.

[27] Nan Liu, Shuang Li, Yilun Du, Antonio Torralba, and Joshua B Tenenbaum. Compositional visual generation with composable diffusion models. In Computer Vision ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23 27, 2022, Proceedings, Part XVII, pages 423 439. Springer, 2022.

[28] Yilun Du, Conor Durkan, Robin Strudel, Joshua B Tenenbaum, Sander Dieleman, Rob Fergus, Jascha Sohl-Dickstein, Arnaud Doucet, and Will Sussman Grathwohl. Reduce, reuse, recycle: Compositional generation with energy-based diffusion models and mcmc. In International Conference on Machine Learning, pages 8489 8510. PMLR, 2023.

[29] Wengong Jin, Regina Barzilay, and Tommi Jaakkola. Multi-objective molecule generation using interpretable substructures. In International conference on machine learning, pages 4849 4859. PMLR, 2020.

[30] Yutong Xie, Chence Shi, Hao Zhou, Yuwei Yang, Weinan Zhang, Yong Yu, and Lei Li. MARS: Markov Molecular Sampling for Multi-objective Drug Discovery. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum? id=k HSu4ebx FXY.

[31] Moksh Jain, Sharath Chandra Raparthy, Alex Hernandez-Garcia, Jarrid Rector-Brooks, Yoshua Bengio, Santiago Miret, and Emmanuel Bengio. Multi-objective GFlownets. In International Conference on Machine Learning, pages 14631 14653. PMLR, 2023.

[32] Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning, pages 2256 2265. PMLR, 2015.

[33] Yang Song and Stefano Ermon. Improved techniques for training score-based generative models. Advances in neural information processing systems, 33:12438 12448, 2020.

[34] Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations, 2021. URL https://openreview. net/forum?id=Px TIG12RRHS.

[35] Yoshua Bengio, Salem Lahlou, Tristan Deleu, Edward J. Hu, Mo Tiwari, and Emmanuel Bengio. Gflownet foundations. Journal of Machine Learning Research, 24(210):1 55, 2023. URL http://jmlr.org/papers/v24/22-0364.html.

[36] Emmanuel Bengio, Moksh Jain, Maksym Korablyov, Doina Precup, and Yoshua Bengio. Flow network based generative models for non-iterative diverse candidate generation. Advances in Neural Information Processing Systems, 34:27381 27394, 2021.

[37] Yang Song and Stefano Ermon. Generative modeling by estimating gradients of the data distribution. Advances in neural information processing systems, 32, 2019.

[38] Brian D.O. Anderson. Reverse-time diffusion equation models. Stochastic Processes and their Applications, 12(3):313 326, 1982.

[39] Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis. Advances in Neural Information Processing Systems, 34:8780 8794, 2021.

[40] Quan Hoang, Tu Dinh Nguyen, Trung Le, and Dinh Phung. Mgan: Training generative adversarial nets with multiple generators. In International conference on learning representations, 2018.

[41] Ilya O Tolstikhin, Sylvain Gelly, Olivier Bousquet, Carl-Johann Simon-Gabriel, and Bernhard Schölkopf. Adagan: Boosting generative models. Advances in neural information processing systems, 30, 2017.

[42] Aditya Grover and Stefano Ermon. Boosted generative models. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018.

[43] Nan Liu, Yilun Du, Shuang Li, Joshua B Tenenbaum, and Antonio Torralba. Unsupervised compositional concepts discovery with text-to-image generative models. Ar Xiv, abs/2306.05357, 2023.

[44] Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems, 35:36479 36494, 2022.

[45] Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. In Neur IPS 2021 Workshop on Deep Generative Models and Downstream Applications, 2021. URL https://openreview. net/forum?id=qw8AKxf Yb I.

[46] Chenlin Meng, Yutong He, Yang Song, Jiaming Song, Jiajun Wu, Jun-Yan Zhu, and Stefano Ermon. SDEdit: Guided image synthesis and editing with stochastic differential equations. In International Conference on Learning Representations, 2022. URL https://openreview. net/forum?id=a Bs Cjc Pu_t E.

[47] Guillaume Couairon, Jakob Verbeek, Holger Schwenk, and Matthieu Cord. Diffedit: Diffusionbased semantic image editing with mask guidance. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id= 3lge0p5o-M-.

[48] Dieuwke Hupkes, Verna Dankers, Mathijs Mul, and Elia Bruni. Compositionality decomposed: How do neural networks generalise? Journal of Artificial Intelligence Research, 67:757 795, 2020.

[49] Colin Conwell and Tomer Ullman. Testing relational understanding in text-guided image generation. Ar Xiv, abs/2208.00005, 2022.

[50] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents. Ar Xiv, abs/2204.06125, 2022.

[51] Salem Lahlou, Tristan Deleu, Pablo Lemos, Dinghuai Zhang, Alexandra Volokhova, Alex Hernández-Garcıa, Léna Néhale Ezzine, Yoshua Bengio, and Nikolay Malkin. A theory of continuous generative flow networks. In International Conference on Machine Learning, pages 18269 18300. PMLR, 2023.

[52] Dinghuai Zhang, Ricky TQ Chen, Nikolay Malkin, and Yoshua Bengio. Unifying generative models with gflownets. Ar Xiv, abs/2209.02606, 2022.

[53] Stefano Peluchetti. Non-denoising forward-time diffusions, 2022. URL https://openreview. net/forum?id=o Vf IKuhqf C.

[54] Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matthew Le. Flow matching for generative modeling. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=Pqv MRDCJT9t.

[55] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. nature, 518(7540):529 533, 2015.

[56] Yann Le Cun. The mnist database of handwritten digits. http://yann.lecun.com/exdb/mnist/, 1998.

[57] Damiano Brigo. The general mixture-diffusion sde and its relationship with an uncertainvolatility option model with volatility-asset decorrelation. Ar Xiv, abs/0812.4052, 2008.

[58] Nikolay Malkin, Moksh Jain, Emmanuel Bengio, Chen Sun, and Yoshua Bengio. Trajectory balance: Improved credit assignment in GFlownets. In Advances in Neural Information Processing Systems, volume 35, pages 5955 5967, 2022.

[59] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. Ar Xiv, abs/1412.6980, 2014.

[60] Justin Gilmer, Samuel S Schoenholz, Patrick F Riley, Oriol Vinyals, and George E Dahl. Neural message passing for quantum chemistry. In International conference on machine learning, pages 1263 1272. PMLR, 2017.

[61] Peter Ertl and Ansgar Schuffenhauer. Estimation of synthetic accessibility score of drug-like molecules based on molecular complexity and fragment contributions. Journal of cheminformatics, 1:1 11, 2009.

[62] Greg Landrum. Rdkit: Open-source cheminformatics, 2010. URL https://www.rdkit.org/.

[63] G Richard Bickerton, Gaia V Paolini, Jérémy Besnard, Sorel Muresan, and Andrew L Hopkins. Quantifying the chemical beauty of drugs. Nature chemistry, 4(2):90 98, 2012.

[64] Seongjun Yun, Minbyul Jeong, Raehyun Kim, Jaewoo Kang, and Hyunwoo J Kim. Graph transformer networks. Advances in neural information processing systems, 32, 2019.

[65] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18, pages 234 241. Springer, 2015.

[66] Matthew Tancik, Pratul Srinivasan, Ben Mildenhall, Sara Fridovich-Keil, Nithin Raghavan, Utkarsh Singhal, Ravi Ramamoorthi, Jonathan Barron, and Ren Ng. Fourier features let networks learn high frequency functions in low dimensional domains. Advances in Neural Information Processing Systems, 33:7537 7547, 2020.

[67] Matthew D Zeiler. Adadelta: an adaptive learning rate method. Ar Xiv, abs/1212.5701, 2012.

[68] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32, 2019.

A Compositional Sculpting of Iterative Generative Processes (Section 5, continued)

A.1 Classifier Training (GFlow Nets): Algorithm

Algorithm A.1 Compositional Sculpting: classifier training

1: Initialize 𝜙and set 𝜙= 𝜙 2: for step = 1, , num_steps do 3: for 𝑖= 1, , 𝑚do 4: Sample 𝜏𝑖 𝑝𝑖(𝜏) 5: end for

𝑖=1 log 𝑄𝜙(𝑦1 =𝑖|𝑥= 𝑥𝑖) {Terminal state loss (16)}

7: 𝑤𝑖( 𝑥𝑗; 𝜙)= 𝑄𝜙(𝑦𝑘=𝑖|𝑥= 𝑥𝑗), 𝑖, 𝑗 {1, 𝑚} {Terminal probability estimates}

{Non-terminal state loss (17)-(18)}

8: 𝑁(𝜙, 𝜙) =

𝑘=2 𝑤 𝑦𝑘( 𝑥 𝑦1; 𝜙)

𝓁( 𝜏 𝑦1, 𝑦1, 𝑦𝑛; 𝜙)

9: (𝜙, 𝜙) = 𝑇(𝜙) + 𝛾(step) 𝑁(𝜙, 𝜙)

10: Update 𝜙using 𝜙 (𝜙, 𝜙); update 𝜙= 𝛽𝜙+ (1 𝛽)𝜙 11: end for

A.2 Composition of Diffusion Models

In this section, we show how the method introduced above can be applied to diffusion models. First, we adapt the model we introduced in (9)-(11) to diffusion models. A diffusion model trained to sample from 𝑝𝑖(𝑥) generates a trajectory 𝜏= {𝑥𝑡}𝑇 𝑡=0 over a range of time steps which starts with a randomly sampled state 𝑥𝑇and ends in 𝑥0, where 𝑥0 has distribution 𝑝𝑖,𝑡=0(𝑥) = 𝑝𝑖(𝑥). Thus, we must adapt our model to reflect this. We introduce a set of mutually dependent variables 𝑥𝑡for 𝑡 (0, 𝑇] with as conditional distribution the transition kernel of the diffusion model 𝑝𝑖(𝑥𝑡|𝑥0).

Given 𝑚base diffusion models that sample from 𝑝1, , 𝑝𝑚respectively, we define the prior 𝑝(𝑥) as a mixture of these diffusion models. Proposition A.1 shows that this mixture is a diffusion model that can be constructed directly from the base diffusion models. We then apply classifier guidance to this mixture to sample from the composition. We present an informal version of the proposition below. The required assumptions and the proof are provided in Appendix D.5. Proposition A.1 (Diffusion mixture SDE). Suppose distributions 𝑝1(𝑥), , 𝑝𝑚(𝑥) are realized by diffusion models with forward SDEs 𝑑𝑥𝑖,𝑡= 𝑓𝑖,𝑡(𝑥𝑖,𝑡) 𝑑𝑡+ 𝑔𝑖,𝑡𝑑𝑤𝑖,𝑡and score functions 𝑠𝑖,𝑡( ), respectively. Then, the mixture distribution 𝑝M(𝑥) = 𝑚 𝑖=1 𝜔𝑖𝑝𝑖(𝑥) with 𝜔1 𝜔𝑚 0 and 𝑚 𝑖=1 𝜔𝑖= 1 is realized by a diffusion model with forward SDE

𝑖=1 𝑝(𝑦=𝑖|𝑥𝑡)𝑓𝑖,𝑡(𝑥𝑡)

𝑖=1 𝑝(𝑦=𝑖|𝑥𝑡)𝑔2 𝑖,𝑡

and backward SDE

𝑖=1 𝑝(𝑦=𝑖|𝑥𝑡) ( 𝑓𝑖,𝑡(𝑥𝑡) 𝑔2 𝑖,𝑡𝑠𝑖,𝑡(𝑥𝑡) )]

𝑖=1 𝑝(𝑦=𝑖|𝑥𝑡)𝑔2 𝑖,𝑡𝑑𝑤𝑡, (20)

𝑝(𝑦=𝑖|𝑥𝑡) = 𝜔𝑖𝑝𝑖,𝑡(𝑥𝑡) 𝑚 𝑗=1 𝜔𝑗𝑝𝑗,𝑡(𝑥𝑡). (21)

If the base diffusion models have a common forward SDE 𝑑𝑥𝑖,𝑡= 𝑓𝑡(𝑥𝑖,𝑡) 𝑑𝑡+ 𝑔𝑡𝑑𝑤𝑖,𝑡, equations (19)-(20) simplify to

𝑑𝑥𝑡= 𝑓𝑡(𝑥𝑡)𝑑𝑡+ 𝑔𝑡𝑑𝑤𝑡, 𝑑𝑥𝑡=

𝑓𝑡(𝑥𝑡) 𝑔2 𝑡

𝑖=1 𝑝(𝑦=𝑖|𝑥𝑡)𝑠𝑖,𝑡(𝑥𝑡)

𝑑𝑡+ 𝑔𝑡𝑑𝑤𝑡. (22)

Theorem A.2 summarizes the overall approach. Theorem A.2. Suppose distributions 𝑝1(𝑥), , 𝑝𝑚(𝑥) are realized by diffusion models with forward SDEs 𝑑𝑥𝑖,𝑡= 𝑓𝑖,𝑡(𝑥𝑖,𝑡) 𝑑𝑡+ 𝑔𝑖,𝑡𝑑𝑤𝑖,𝑡and score functions 𝑠𝑖,𝑡( ), respectively. Let 𝑦1, 𝑦𝑛be random variables defined by (9). Then, the conditional 𝑝(𝑥|𝑦1, , 𝑦𝑛) is realized by a classifier-guided diffusion with backward SDE

𝑑𝑥𝑡= 𝑣𝐶,𝑡(𝑥𝑡, 𝑦1, , 𝑦𝑛)𝑑𝑡+ 𝑔𝐶,𝑡(𝑥𝑡)𝑑𝑤𝑡, (23)

𝑣𝐶,𝑡(𝑥𝑡, 𝑦1, , 𝑦𝑛) =

𝑖=1 𝑝(𝑦=𝑖|𝑥𝑡) ( 𝑓𝑖,𝑡(𝑥𝑡) 𝑔2 𝑖,𝑡 ( 𝑠𝑖,𝑡(𝑥𝑡) + 𝑥𝑡log 𝑝(𝑦1, , 𝑦𝑛|𝑥𝑡) )) , (24)

𝑖=1 𝑝(𝑦=𝑖|𝑥𝑡)𝑔2 𝑖,𝑡. (25)

The proof of Theorem A.2 is provided in Appendix D.6.

A.3 Classifier Training (Diffusion Models)

We approximate the inferential distributions in equations (22) and (23) with a time-conditioned classifier 𝑄𝜙(𝑦1, , 𝑦𝑛|𝑥𝑡) with parameters 𝜙. Contrary to GFlow Nets, which employed a terminal and non-terminal state classifier, here we only need a single time-dependent classifier. The classifier is trained with different objectives on terminal and non-terminal states. The variables 𝑦1, , 𝑦𝑛are dependent given a state 𝑥𝑡for 𝑡 [0, 𝑇), but are independent given the terminal state 𝑥𝑇. Thus, when training on terminal states we can exploit this independence. Furthermore, we generally found it beneficial to initially train only on terminal states. The loss for the non-terminal states depends on classifications of the terminal state of the associated trajectories, thus by minimizing the classification error of terminal states first, we reduce noise in the loss calculated for the non-terminal states later.

For a terminal state 𝑥0, the classifier 𝑄𝜙(𝑦1, , 𝑦𝑛|𝑥𝑡) can be factorized as 𝑛 𝑘=1 𝑄𝜙(𝑦𝑘|𝑥0). Hence we can train 𝑄by minimizing the cross-entropy loss

T(𝜙) = 𝔼 ( 𝑥0, 𝑦1) 𝑝(𝑥,𝑦1)

[ log 𝑄𝜙(𝑦1 = 𝑦1|𝑥0 = 𝑥0) ] . (26)

Samples 𝑝(𝑥0, 𝑦1) can be generated according to the factorization 𝑝(𝑦1) 𝑝(𝑥0|𝑦1). First, 𝑦1 is sampled from 𝑝(𝑦1), which is uniform under our choice of 𝑝(𝑥0). Then, 𝑥0|(𝑦1 = 𝑦1) is generated from the reverse SDE of base diffusion model 𝑝 𝑦1(𝑥). Note that equation (9) implies that all observations have the same conditional distribution given 𝑥. Thus, 𝑄𝜙(𝑦1|𝑥0) is also a classifier for observations 𝑦2, , 𝑦𝑛.

For a non-terminal state 𝑥𝑡with 𝑡 (0, 𝑇], we must train 𝑄to predict 𝑦1, , 𝑦𝑛jointly. For a non-terminal state 𝑥𝑡and observations 𝑦1, , 𝑦𝑛, the cross-entropy loss is

𝓁( 𝑥𝑡, 𝑦1, , 𝑦𝑛; 𝜙) = log 𝑄𝜙(𝑦1 = 𝑦1, , 𝑦𝑛= 𝑦𝑛|𝑥𝑡= 𝑥𝑡). (27)

Tuples ( 𝑥𝑡, 𝑦1, , 𝑦𝑛) are obtained as follows: 1) 𝑦1 𝑝(𝑦1); 2) A trajectory 𝜏= {𝑥𝑡}𝑇 𝑡=0 is sampled from the reverse SDE of diffusion model 𝑦1. At this point, we would ideally sample 𝑦2, , 𝑦𝑛given 𝑥0 but this requires access to 𝑝(𝑦𝑘= 𝑦𝑘| 𝑥0). Instead, we approximate this with 𝑤𝑖( 𝑥; 𝜙) = 𝑄𝜙(𝑦1 = 𝑖|𝑥0 = 𝑥0) and marginalize over 𝑦2, , 𝑦𝑛to obtain the cross-entropy loss

𝑁(𝜙, 𝜙) = 𝔼 ( 𝜏, 𝑦1) 𝑝(𝜏,𝑦1)

𝑘=2 𝑤 𝑦𝑘( 𝑥0; 𝜙)

𝓁( 𝑥𝑡, 𝑦1, , 𝑦𝑛; 𝜙) . (28)

B Classifier Guidance for Parameterized Operations

This section covers the details of classifier guidance and classifier training for the parameterized operations (Section 4.1).

The complete probabilistic model for the parameterized operations on two distributions is given by

𝑝(𝑥, 𝑦1, 𝑦2; 𝛼) = 𝑝(𝑥) 𝑝(𝑦1|𝑥) 𝑝(𝑦2|𝑥; 𝛼), 𝑝(𝑥) = 1

2𝑝2(𝑥), (29a)

𝑝(𝑦1 =𝑖|𝑥) = 𝑝𝑖(𝑥) 𝑝1(𝑥) + 𝑝2(𝑥), 𝑝(𝑦2 =𝑖|𝑥; 𝛼)=

(𝛼𝑝1(𝑥))[𝑖=1] ((1 𝛼)𝑝2(𝑥))[𝑖=2]

𝛼𝑝1(𝑥) + (1 𝛼)𝑝2(𝑥) . (29b)

While in the probabilistic model (3) all observations 𝑦𝑖are exchangeable, in the parameterized model (29) 𝑦1 and 𝑦2 are not symmetric. This difference requires changes in the classifier training algorithm for the parameterized operations.

We develop the method for the parameterized operations based on two observations:

𝑦1 appears in (29) in the same way as in (3);

the likelihood 𝑝(𝑦2|𝑥; 𝛼) of 𝑦2 given 𝑥can be expressed as the function of 𝑝(𝑦1|𝑥) and 𝛼:

𝑝(𝑦2 =1|𝑥; 𝛼) = 𝛼𝑝1(𝑥) 𝛼𝑝1(𝑥) + (1 𝛼)𝑝2(𝑥) = 𝛼 𝑝1(𝑥) 𝑝1(𝑥)+𝑝2(𝑥)

𝛼 𝑝1(𝑥) 𝑝1(𝑥)+𝑝2(𝑥) + (1 𝛼) 𝑝2(𝑥) 𝑝1(𝑥)+𝑝2(𝑥)

= 𝛼 𝑝(𝑦1 =1|𝑥) 𝛼 𝑝(𝑦1 =1|𝑥) + (1 𝛼) 𝑝(𝑦1 =2|𝑥),

𝑝(𝑦2 =2|𝑥; 𝛼) = (1 𝛼)𝑝2(𝑥) 𝛼𝑝1(𝑥) + (1 𝛼)𝑝2(𝑥) = (1 𝛼) 𝑝2(𝑥) 𝑝1(𝑥)+𝑝2(𝑥)

𝛼 𝑝1(𝑥) 𝑝1(𝑥)+𝑝2(𝑥) + (1 𝛼) 𝑝2(𝑥) 𝑝1(𝑥)+𝑝2(𝑥)

= (1 𝛼) 𝑝(𝑦1 =2|𝑥) 𝛼 𝑝(𝑦1 =1|𝑥) + (1 𝛼) 𝑝(𝑦1 =2|𝑥).

These two observations combined suggest the training procedure where 1) the terminal state classifier is trained to approximate 𝑝(𝑦1 =𝑖|𝑥) in the same way as in Section 5.2; 2) the probability estimates 𝑤𝑖( 𝑥, 𝛼; 𝜙) 𝑝(𝑦2 =𝑖; 𝛼) are expressed through the learned terminal state classifier 𝑝(𝑦1 =𝑖|𝑥) via (30). Below we provide details of this procedure for the case of GFlow Net composition.

Learning the terminal state classifier. The marginal 𝑦1 classifier 𝑄𝜙(𝑦1|𝑥) is learned by minimizing the cross-entropy loss

𝑇(𝜙) = 𝔼 ( 𝑥, 𝑦1) 𝑝(𝑥,𝑦1)

[ log 𝑄𝜙(𝑦1 = 𝑦1|𝑥= 𝑥) ] . (31)

Then, the joint classifier 𝑄𝜙(𝑦1, 𝑦2|𝑥; 𝛼) is constructed as

𝑄𝜙(𝑦1, 𝑦2|𝑥; 𝛼) = 𝑄𝜙(𝑦1|𝑥) 𝑄𝜙(𝑦2|𝑥; 𝛼), (32)

where 𝑄𝜙(𝑦2|𝑥; 𝛼) can be expressed through the marginal 𝑄𝜙(𝑦1|𝑥) via (30).

Learning the non-terminal state classifier. The non-terminal state classifier 𝑄(𝑦1, 𝑦2|𝑠; 𝛼) models 𝑦1 and 𝑦2 jointly. Note that 𝛼is one of the inputs to the classifier model. Given a sampled trajectory 𝜏, labels 𝑦1, 𝑦2, and 𝛼, the total cross-entropy loss of all non-terminal states in 𝜏is

𝓁( 𝜏, 𝑦1, 𝑦2, 𝛼; 𝜙) =

[ log 𝑄𝜙(𝑦1 = 𝑦1, 𝑦2 = 𝑦2|𝑠= 𝑠𝑡; 𝛼) ] . (33)

4 3 2 1 0 1 2 3 4 푥

(a) Base distributions 𝑝1, 𝑝2

4 3 2 1 0 1 2 3 4 푥

푝1 푝2 푝1 prod 푝2

(b) HM(ours), Product

4 3 2 1 0 1 2 3 4 푥

훼= 0.001 훼= 0.050 훼= 0.500

(c) Contrasts(ours)𝑝1 (1 𝛼) 𝑝2

4 3 2 1 0 1 2 3 4 푥

훾= 0.50 훾= 0.25 훾= 0.10

(d) Negations 𝑝1 neg𝛾𝑝2

Figure C.1: Compositional sculpting and energy operations applied to 1D Gaussian distributions. (a) Densities of base 1D Gaussian distributions 𝑝1(𝑥) = (𝑥; 5 4, 1) and 𝑝2(𝑥) = (𝑥; 5 4, 1 2). (b) harmonic mean 𝑝1 𝑝2 and product 𝑝1 prod 𝑝2. (c) parameterized contrasts 𝑝1 (1 𝛼) 𝑝2 at different values of 𝛼. (d) negations 𝑝1 neg𝛾𝑝2 at different values of 𝛾. Curves show the PDFs of distributions.

The pairs ( 𝜏, 𝑦1) can be generated via a sampling scheme similar to the one used for the terminal state classifier loss above: 1) 𝑦1 𝑝(𝑦1) and 2) 𝜏 𝑝 𝑦1(𝜏). An approximation of the distribution of 𝑦2 given 𝜏is constructed using (30):

𝑤1( 𝑥, 𝛼; 𝜙) = 𝛼 𝑄𝜙(𝑦1 =1|𝑥= 𝑥)

𝛼 𝑄𝜙(𝑦1 =1|𝑥= 𝑥) + (1 𝛼) 𝑄𝜙(𝑦1 =2|𝑥= 𝑥) 𝑝(𝑦2 =1|𝑥= 𝑥; 𝛼), (34a)

𝑤2( 𝑥, 𝛼; 𝜙) = (1 𝛼) 𝑄𝜙(𝑦1 =2|𝑥= 𝑥)

𝛼 𝑄𝜙(𝑦1 =1|𝑥= 𝑥) + (1 𝛼) 𝑄𝜙(𝑦1 =2|𝑥= 𝑥) 𝑝(𝑦2 =2|𝑥= 𝑥; 𝛼). (34b)

Since these expressions involve outputs of the terminal state classifier which is being trained simultaneously, we again (see Section 5.2) introduce the target network parameters 𝜙that are used to compute the probability estimates (34).

The training loss for the non-terminal state classifier is

𝑁(𝜙, 𝜙) = 𝔼 𝛼 𝑝(𝛼) 𝔼 ( 𝜏, 𝑦1) 𝑝(𝜏,𝑦1)

𝑦2=1 𝑤 𝑦2( 𝑥, 𝛼; 𝜙)𝓁( 𝜏, 𝑦1, 𝑦2, 𝛼; 𝜙) , (35)

where 𝑝(𝛼) is sampling distribution over 𝛼 (0, 1). In our experiments, we used the following sampling scheme for 𝛼:

𝑧 𝑈[ 𝐵, 𝐵], 𝛼= 1 1 + exp( 𝑧). (36)

C Analysis of Compositional Sculpting and Energy Operations

The harmonic mean and contrast operations we have introduced are analogous to the product and negation operations for EBMs respectively. Although the harmonic mean and product operations are quite similar in practice, unlike the negation operation our proposed contrast operation always results in a valid probability distribution. Figure C.1 shows the results of these operations applied to two Gaussian distributions. The harmonic mean and product, shown in panel (b), are both concentrated on points that have high probability under both Gaussians. Figure C.1(c) shows parameterized contrasts 𝑝1 (1 𝛼)𝑝2 at different values of 𝛼, and panel (d) shows negations 𝑝1 neg𝛾𝑝2 at different values of 𝛾. The effect of negation at 𝛾= 0.1 resembles the effect of the contrast operation: the density retreats from the high likelihood region of 𝑝2. However, as 𝛾increases to 0.5 the distribution starts to concentrate excessively on the values 𝑥< 3. This is due to the instability of division 𝑝1(𝑥) (𝑝2(𝑥))𝛾in regions where 𝑝2(𝑥) 0. Proposition C.1 shows that negation 𝑝1 neg𝛾𝑝2 in many cases results in an improper (non-normalizable) distribution.

Mathematical analysis of operations. Harmonic mean and product are not defined for pairs of distributions 𝑝1, 𝑝2 which have disjoint supports. In such cases, attempts at evaluation of the expressions for 𝑝1 𝑝2 and 𝑝1 prod 𝑝2 will lead to impossible probability distributions that have zero

probability mass (density) everywhere3. The result of both harmonic mean and product are correctly defined for any pair of distributions 𝑝1, 𝑝2 that have non-empty support intersection.

Notably, contrast is well-defined for any input distributions while negation is ill-defined for some input distributions 𝑝1, 𝑝2 as formally stated below (see Figure C.1 (d) for a concrete example). Proposition C.1.

1. For any 𝛼 (0, 1) the parameterized contrast operation 𝑝1 (1 𝛼) 𝑝2 (7) is well-defined: gives a proper distribution for any pair of distributions 𝑝1, 𝑝2.

2. For any 𝛾 (0, 1) there are infinitely many pairs of distributions 𝑝1, 𝑝2 such that the negation

(𝑝1 neg𝛾𝑝2)(𝑥) exp { ( 𝐸1(𝑥) 𝛾𝐸2(𝑥) )} 𝑝1(𝑥) (𝑝2(𝑥))𝛾. (37)

results in an improper (non-normalizable) distribution.

Proof. Without loss of generality, we prove the claims of the proposition assuming absolutely continuous distributions 𝑝1, 𝑝2 with probability density functions 𝑝1( ), 𝑝2( ).

Claim 1. For any two distributions 𝑝1, 𝑝2 we have 𝑝1(𝑥) 0, 𝑝2(𝑥) 0, 𝑝1(𝑥) 𝑑𝑥= 𝑝2(𝑥) 𝑑𝑥= 1 < . Then, the RHS of the expression for the parameterized contrast operation 𝑝1 (1 𝛼) 𝑝2 (7) satisfies

𝛼𝑝1(𝑥) + (1 𝛼)𝑝2(𝑥) = 𝑝1(𝑥)

𝑝1(𝑥) + (1 𝛼)

𝛼 , 𝑥 supp(𝑝1) supp(𝑝2). (38)

For points 𝑥 supp(𝑝1) supp(𝑝2), we set 𝑝1(𝑥)2

𝛼𝑝1(𝑥)+(1 𝛼)𝑝2(𝑥) = 0 since by construction the composite distributions do not have probability mass outside of the union of the supports of the original distributions. The above implies that

𝛼𝑝1(𝑥) + (1 𝛼)𝑝2(𝑥) 𝑑𝑥 1

𝛼 𝑝1(𝑥) 𝑑𝑥= 1

Therefore, the RHS of the expression for the parameterized contrast operation 𝑝1 (1 𝛼) 𝑝2 (7) can be normalized, and the distribution 𝑝1 (1 𝛼) 𝑝2 is well-defined.

Claim 2. For any 𝛾 (0, 1) we provide an infinite collection of distribution pairs 𝑝1 and 𝑝2 such that negation 𝑝1 neg𝛾𝑝2 results in a non-normalizable distribution.

For the given 𝛾 (0, 1) we select four numbers 𝜇1 ℝ, 𝜇2 ℝ, 𝜎1 > 0, 𝜎2 > 0 such that

𝛾𝜎2 2, (40)

Consider univariate normal distributions 𝑝1(𝑥) = (𝑥; 𝜇1, 𝜎2 1), 𝑝2(𝑥) = (𝑥; 𝜇2, 𝜎2 2) with density functions

𝑝𝑖(𝑥) = (𝑥; 𝜇𝑖, 𝜎2 𝑖) = 1

, 𝑖 {1, 2}. (41)

For such 𝑝1 and 𝑝2, the RHS of (37) is

𝑝1(𝑥) (𝑝2(𝑥))𝛾= 1 (

2𝜋 )1 𝛾 𝜎𝛾 2 𝜎1 exp

𝑥2 ( 𝛾𝜎2 1 𝜎2 2 2𝜎2 1𝜎2 2

( 𝜇1 𝜎2 1 𝛾𝜇2

+ 𝛾 𝜇2 2 2𝜎2 2 𝜇2 1 2𝜎2 1

Condition (40) implies that the quadratic function under the exponent above has a non-negative coefficient for 𝑥2. Therefore this function either grows unbounded as 𝑥 (if the coefficients for the quadratic and linear terms are not zero), or constant (if the coefficients for quadratic and linear terms are zero). In either case, ℝ𝑝1(𝑥) (𝑝2(𝑥))𝛾𝑑𝑥= .

3Informal interpretation: distributions with disjoint supports have empty intersections (think of the intersection of sets analogy)

D Proofs and Derivations

D.1 Proof of Proposition 5.1

Our goal is to show that the policy (12) induces the mixture distribution 𝑝𝑀(𝑥) = 𝑚 𝑖=1 𝜔𝑖𝑝𝑖(𝑥).

Preliminaries. In our proof below we use the notion of the probability of observing a state 𝑠 on a GFlow Net trajectory . Following Bengio et al. [35], we abuse the notation and denote this probability by

𝑝𝑖(𝑠) 𝑝𝑖({𝜏 𝑠 𝜏}) =

𝑡=0 𝑝𝑖,𝐹(𝑠𝑡|𝑠𝑡 1), (43)

where 𝑠0,𝑠is the set of all (sub)trajectories starting at 𝑠0 and ending at 𝑠. The probabilities induced by the policy (12) are denoted by 𝑝𝑀(𝑠). Note that 𝑝𝑖(𝑠) and 𝑝𝑀(𝑠) should not be interpreted as probability mass functions over the set of states . In particular 𝑝𝑖(𝑠0) = 𝑝𝑀(𝑠0) = 1 and sums 𝑠 𝑝𝑖(𝑠), 𝑠 𝑝𝑀(𝑠) are not equal to 1 (unless = {𝑠0}). However, the functions 𝑝𝑖( ), 𝑝𝑀( ) restricted to the set of terminal states give valid probability distributions over : 𝑥 𝑝𝑖(𝑥) = 𝑥 𝑝𝑀(𝑥) = 1.

By definition 𝑝𝑖( ) and 𝑝𝑀( ) satisfy the recurrent relationship

𝑠 (𝑠 𝑠) 𝑝𝑖(𝑠 )𝑝𝑖,𝐹(𝑠|𝑠 ), 𝑝𝑀(𝑠) =

𝑠 (𝑠 𝑠) 𝑝𝑀(𝑠 )𝑝𝑀,𝐹(𝑠|𝑠 ). (44)

The joint distribution of 𝑦and 𝜏described in the statement of Proposition 5.1 is 𝑝(𝜏, 𝑦=𝑖) = 𝑤𝑖𝑝𝑖(𝜏). This joint distribution over 𝑦and trajectories implies the following expressions for the distributions involving intermediate states 𝑠.

𝑝(𝑦=𝑖) = 𝜔𝑖, (45)

𝑝(𝜏|𝑦=𝑖) = 𝑝𝑖(𝜏) =

𝑡=0 𝑝𝑖,𝐹(𝑠𝑡|𝑠𝑡 1), (46)

𝑖=1 𝑝(𝜏|𝑦=𝑖)𝑝(𝑦=𝑖) =

𝑖=1 𝜔𝑖𝑝𝑖(𝜏), (47)

𝑝(𝑠|𝑦=𝑖) = 𝑝𝑖(𝑠), (48)

𝑖=1 𝑝(𝑠|𝑦=𝑖)𝑝(𝑦=𝑖) =

𝑖=1 𝜔𝑖𝑝𝑖(𝑠). (49)

Proof. Using the notation introduced above, we can formally state our goal. We need to show that 𝑝𝑀(𝑥) induced by 𝑝𝑀,𝐹gives the mixture distribution

𝑖=1 𝜔𝑖𝑝𝑖(𝑥). (50)

We prove a more general equation for all states 𝑠

𝑖=1 𝜔𝑖𝑝𝑖(𝑠) (51)

by induction over the DAG ( , ).

Base case. Consider the initial state 𝑠0 . By definition 𝑝𝑖(𝑠0) = 𝑝𝑀(𝑠0) = 1 which implies

𝑖=1 𝜔𝑖𝑝𝑖(𝑠0). (52)

Inductive step. Consider a state 𝑠such that (51) holds for all predecessor states 𝑠 (𝑠 𝑠) . For such a state we have

𝑠 (𝑠 𝑠) 𝑝𝑀(𝑠 )𝑝𝑀,𝐹(𝑠| 𝑠 ) {used (44)} (53)

𝑠 (𝑠 𝑠) 𝑝𝑀(𝑠 )

𝑖=1 𝑝(𝑦= 𝑖|𝑠 )𝑝𝑖,𝐹(𝑠|𝑠 )

{used definition of 𝑝𝑀,𝐹} (54)

𝑖=1 𝑝(𝑠 |𝑦=𝑖)𝑝(𝑦=𝑖)𝑝𝑖,𝐹(𝑠|𝑠 )

{used Bayes theorem} (55)

𝑝𝑀(𝑠 ) 𝑚 𝑖=1 𝜔𝑖𝑝𝑖(𝑠 )

𝑖=1 𝜔𝑖𝑝𝑖(𝑠 )𝑝𝑖,𝐹(𝑠|𝑠 )

{used (45), (48), (49)} (56)

𝑖=1 𝜔𝑖𝑝𝑖(𝑠 )𝑝𝑖,𝐹(𝑠|𝑠 )

{used induction hypothesis} (57)

𝑠 (𝑠 𝑠) 𝑝𝑖(𝑠 )𝑝𝑖,𝐹(𝑠|𝑠 )

{changed summation order} (58)

𝑖=1 𝜔𝑖𝑝𝑖(𝑠), {used (44)} (59)

which proves (51) for 𝑠.

D.2 Proof of Proposition 5.2

Claim 1. Our goal is to prove the relationship (13) for all non-terminal states 𝑠 . To prove this relationship, we invoke several important properties of Markovian probability flows on DAGs [35].

By Proposition 16 of Bengio et al. [35] for the given GFlow Net forward policy 𝑝𝐹( | ) there exists a unique backward policy 𝑝𝐵( | ) such that the probability of any complete trajectory 𝜏= (𝑠0 𝑠|𝜏| = 𝑥) in DAG ( , ) can be expressed as

𝑝(𝜏) = 𝑝(𝑥)

𝑡=1 𝑝𝐵(𝑠𝑡 1|𝑠𝑡), (60)

and the probability of observing a state 𝑠 on a trajectory can be expressed as

𝑡=1 𝑝𝐵(𝑠𝑡 1|𝑠𝑡), (61)

where 𝑠,𝑥is the set of all (sub)trajectories starting at 𝑠and ending at 𝑥. Moreover, 𝑝𝐹( | ) and 𝑝𝐵( | ) are related through the detailed balance condition [35, Proposition 21]

𝑝(𝑠)𝑝𝐹(𝑠 |𝑠) = 𝑝(𝑠 )𝑝𝐵(𝑠|𝑠 ), (𝑠 𝑠 ) . (62)

By the statement of Proposition 5.2, in the probabilistic model 𝑝(𝑥, 𝑦), the marginal distribution 𝑝(𝑥) is realized by the GFlow Net forward policy 𝑝𝐹( | ) and 𝑦is independent of intermediate states 𝑠. The joint distribution 𝑝(𝑠, 𝑦) is given by

𝑡=1 𝑝𝐵(𝑠𝑡 1|𝑠𝑡) (63)

𝑠 (𝑠 𝑠 ) 𝑝𝐵(𝑠|𝑠 )

𝑡=1 𝑝𝐵(𝑠𝑡 1|𝑠𝑡) (64)

𝑠 (𝑠 𝑠 ) 𝑝𝐵(𝑠|𝑠 )

𝑡=1 𝑝𝐵(𝑠𝑡 1|𝑠𝑡)

𝑠 (𝑠 𝑠 ) 𝑝𝐵(𝑠|𝑠 )𝑝(𝑠 , 𝑦). (66)

Expressing the conditional probability 𝑝(𝑦|𝑠) through the joint 𝑝(𝑠, 𝑦) we obtain

𝑝(𝑦|𝑠) = 𝑝(𝑠, 𝑦)

𝑝(𝑠) {used definition of conditional probability} (67)

𝑠 (𝑠 𝑠 ) 𝑝𝐵(𝑠|𝑠 )𝑝(𝑠 , 𝑦) {used (66)} (68)

𝑠 (𝑠 𝑠 ) 𝑝𝐵(𝑠|𝑠 )𝑝(𝑠 )

𝑝(𝑠) 𝑝(𝑦|𝑠 ) {decomposed 𝑝(𝑠 , 𝑦) = 𝑝(𝑠 )𝑝(𝑦|𝑠 )} (69)

𝑠 (𝑠 𝑠 ) 𝑝𝐹(𝑠 |𝑠)𝑝(𝑦|𝑠 ), {used (62)} (70)

which proves (13).

Claim 2. Our goal is to show that the classifier-guided policy (14) induces the conditional distribution 𝑝(𝑦|𝑥).

We know (Section D.1) that the state probabilities induced by the marginal GFlow Net policy 𝑝𝐹( | ) satisfy the recurrence 𝑝(𝑠) =

𝑠 (𝑠 𝑠) 𝑝(𝑠 )𝑝𝐹(𝑠|𝑠 ). (71)

Let 𝑝𝑦( ) denote the state probabilities induced by the classifier-guided policy (14). These probabilities by definition (Section D.1) satisfy the recurrence

𝑠 (𝑠 𝑠) 𝑝𝑦(𝑠 )𝑝𝐹(𝑠|𝑠 , 𝑦). (72)

We show that 𝑝𝑦(𝑠) = 𝑝(𝑠|𝑦), (73)

by induction over DAG ( , ).

Base case. Consider the initial state 𝑠0. By definition 𝑝𝑦(𝑠0) = 1. At the same time 𝑝(𝑠0|𝑦) = 𝑝({𝜏 𝑠0 𝜏}|𝑦) = 1. Therefore 𝑝𝑦(𝑠0) = 𝑝(𝑠0|𝑦).

Inductive step. Consider a state 𝑠such that (73) holds for all predecessor states 𝑠 (𝑠 𝑠) . For such a state we have

𝑠 (𝑠 𝑠) 𝑝𝑦(𝑠 )𝑝𝐹(𝑠|𝑠 , 𝑦) {used (72)} (74)

𝑠 (𝑠 𝑠) 𝑝𝑦(𝑠 )𝑝𝐹(𝑠|𝑠 ) 𝑝(𝑦|𝑠)

𝑝(𝑦|𝑠 ) {used (14)} (75)

𝑠 (𝑠 𝑠) 𝑝(𝑠 |𝑦)𝑝𝐹(𝑠|𝑠 ) 𝑝(𝑦|𝑠)

𝑝(𝑦|𝑠 ) {used induction hypothesis} (76)

𝑝(𝑦|𝑠 )𝑝(𝑠 )

𝑝(𝑦) 𝑝𝐹(𝑠|𝑠 ) 𝑝(𝑦|𝑠)

𝑝(𝑦|𝑠 ) {used Bayes theorem} (77)

𝑠 (𝑠 𝑠) 𝑝(𝑠 )𝑝𝐹(𝑠|𝑠 ) {rearranged terms} (78)

𝑝(𝑦) 𝑝(𝑠) {used (71)} (79)

= 𝑝(𝑠|𝑦), {used Bayes theorem} (80)

which proves (73) for state 𝑠.

D.3 Proof of Theorem 5.3

By Proposition 5.1 we have that the policy

𝑝𝑀,𝐹(𝑠 |𝑠) =

𝑖=1 𝑝𝑖,𝐹(𝑠 |𝑠) 𝑝(𝑦=𝑖|𝑠), (81)

generates the mixture distribution 𝑝𝑀(𝑥) = 1

𝑚 𝑚 𝑖=1 𝑝𝑖(𝑥).

In the probabilistic model 𝑝(𝑥, 𝑦1 , 𝑦𝑛) the marginal distribution 𝑝(𝑥) = 𝑝𝑀(𝑥) is realized by the mixture policy 𝑝𝑀,𝐹. Therefore, 𝑝(𝑥, 𝑦1, , 𝑦𝑛) satisfies the conditions of Proposition 5.2 which states that the conditional distribution 𝑝(𝑥|𝑦1, , 𝑦𝑛) is realized by the classifier-guided policy

𝑝𝐹(𝑠 |𝑠, 𝑦1, , 𝑦𝑛) = 𝑝𝑀,𝐹(𝑠 |𝑠) 𝑝(𝑦1, , 𝑦𝑛|𝑠 )

𝑝(𝑦1, , 𝑦𝑛|𝑠) = 𝑝(𝑦1, 𝑦𝑛| 𝑠 )

𝑝(𝑦1, 𝑦𝑛| 𝑠)

𝑖=1 𝑝𝑖,𝐹(𝑠 |𝑠) 𝑝(𝑦=𝑖|𝑠). (82)

D.4 Detailed Derivation of Classifier Training Objective

This section provides a more detailed step-by-step derivation of the non-terminal state classifier training objective (18).

Step 1. Our goal is to train a classifier 𝑄(𝑦1, , 𝑦𝑛|𝑠). This classifier can be obtained as the optimal solution of min 𝜙𝔼 𝜏, 𝑦1, , 𝑦𝑛 𝑝(𝜏,𝑦1, ,𝑦𝑛) [𝓁( 𝜏, 𝑦1, , 𝑦𝑛; 𝜙)], (83)

where 𝓁( ) is defined in equation (17). An unbiased estimate of the loss (and its gradient) can be obtained by sampling ( 𝜏, 𝑦1, , 𝑦𝑛) and evaluating (17) directly. However sampling tuples (𝜏, 𝑦1, , 𝑦𝑛) is not straightforward. The following steps describe our proposed approach to the estimation of expectation in (83).

Step 2. The expectation in (83) can be expressed as

𝔼 𝜏, 𝑦1 𝑝(𝜏,𝑦1)

𝑖=2 𝑝(𝑦𝑖= 𝑦𝑖|𝑥= 𝑥)

𝓁( 𝜏, 𝑦1, , 𝑦𝑛; 𝜙) , (84)

where we re-wrote the expectation over (𝑦2, , 𝑦𝑛)|𝜏as the explicit sum of the form 𝔼𝑞(𝑧)[𝑔(𝑧)] = 𝑧 𝑞(𝑧)𝑔(𝑧). The expectation over (𝜏, 𝑦1) can be estimated by sampling pairs ( 𝜏, 𝑦1) as described in the paragraph after equation (17): 1) 𝑦1 𝑝(𝑦1) and 2) 𝜏 𝑝 𝑦1(𝜏). The only missing part is the probabilities 𝑝(𝑦𝑖= 𝑦𝑖|𝑥= 𝑥) which are not directly available.

Step 3. Our proposal is to approximate these probabilities as 𝑝(𝑦1 =𝑗|𝑥= 𝑥) 𝑤𝑗( 𝑥; 𝜙) = 𝑄𝜙(𝑦1 = 𝑗|𝑥= 𝑥). The idea here is that the terminal state classifier 𝑄𝜙(𝑦1|𝑥), when trained to optimality, produces outputs exactly equal to the probabilities 𝑝(𝑦1|𝑥), and the more the classifier is trained the better is the approximation of the probabilities.

Step 4. Steps 1-3, give a procedure where the computation of the non-terminal state classification loss requires access to the terminal state classifier. As we described in the paragraph preceding equation (18), we propose to train non-terminal and terminal classifiers simultaneously and introduce target network parameters. The weights 𝑤are computed by the target network 𝑄𝜙.

Combining all the steps above, we arrive at objective (18) which we use to estimate the expectation in (83).

Note that equation (18) involves summation over 𝑦2, 𝑦𝑛with 𝑚𝑛 1 terms in the sum. If values of 𝑛 and 𝑚are small, the sum can be evaluated directly. In general, one could trade off estimation accuracy for improved speed by replacing the summation with Monte Carlo estimation. In this case, the values 𝑦𝑘are sampled from the categorical distributions 𝑄𝜙(𝑦|𝑥). Note that labels can be sampled in parallel since 𝑦𝑖are independent given 𝑥.

D.5 Assumptions and Proof of Proposition A.1

This subsection provides a formal statement of the assumptions and a more detailed formulation of Proposition A.1.

The assumptions, the formulation of the result, and the proof below closely follow those of Theorem 1 of Peluchetti [53]. Theorem 1 in [53] generalizes the result of Brigo [57] (Corollary 1.3), which derives the SDE for mixtures of 1D diffusion processes.

We found an error in the statement and the proof of Theorem 1 (Appendix A.2 of [53]). The error makes the result of [53] for 𝐷-dimensional diffusion processes disagree with the result of [57] for 1-dimensional diffusion processes.

Here we provide a corrected version of Theorem 1 of [53] in a modified notation and a simplified setting (mixture of finite rather than infinite number of diffusion processes). Most of the content is directly adapted from [53].

for a vector-valued 𝑓 ℝ𝐷 ℝ𝐷, the divergence of 𝑓is denoted as (𝑓(𝑥)) = 𝐷 𝑑=1 𝑥𝑑𝑓𝑑(𝑥),

for a scalar-values 𝑎 ℝ𝐷 ℝ, the divergence of the gradient of 𝑎(the Laplace operator) is denoted by Δ(𝑎(𝑥)) = ( 𝑎(𝑥)) = 𝐷 𝑑=1 2

Assumption 1 (SDE solution). A given 𝐷-dimensional SDE(𝑓, 𝑔):

𝑑𝑥𝑡= 𝑓𝑡(𝑥𝑡)𝑑𝑡+ 𝑔𝑡𝑑𝑤𝑡, (85)

with associated initial distribution 𝑝0(𝑥) and integration interval [0, 𝑇] admits a unique strong solution on [0, 𝑇].

Assumption 2 (SDE density). A given 𝐷-dimensional SDE(𝑓, 𝑔) with associated initial distribution 𝑝0(𝑥) and integration interval [0, 𝑇] admits a marginal density on (0, 𝑇) with respect to the 𝐷dimensional Lebesgue measure that uniquely satisfies the Fokker-Plank (Kolmogorov-forward) partial differential equation (PDE): 𝑝𝑡(𝑥)

𝑡 = (𝑓𝑡(𝑥)𝑝𝑡(𝑥)) + 1

2Δ(𝑔2 𝑡𝑝𝑡(𝑥)). (86)

Assumption 3 (positivity). For a given stochastic process, all finite-dimensional densities, conditional or not, are strictly positive. Theorem D.1 (Diffusion mixture representation). Consider the family of 𝐷-dimensional SDEs on 𝑡 [0, 𝑇] indexed by 𝑖 {1, , 𝑚},

𝑑𝑥𝑖,𝑡= 𝑓𝑖,𝑡(𝑥𝑖,𝑡)𝑑𝑡+ 𝑔𝑖,𝑡𝑑𝑤𝑖,𝑡, 𝑥𝑖,0 𝑝𝑖,0, (87)

where the initial distributions 𝑝𝑖,0 and the Wiener processes 𝑤𝑖,𝑡are all independent. Let 𝑝𝑖,𝑡, 𝑡 (0, 𝑇) denote the marginal density of 𝑥𝑖,𝑡. For mixing weights {𝜔𝑖}𝑚 𝑖=1, 𝜔𝑖 0, 𝑚 𝑖=1 𝜔𝑖= 1, define the mixture marginal density 𝑝𝑀,𝑡for 𝑡 (0, 𝑇) and the mixture initial distribution 𝑝𝑀,0 by

𝑖=1 𝜔𝑖𝑝𝑖,𝑡(𝑥) 𝑝𝑀,0(𝑥) =

𝑖=1 𝜔𝑖𝑝𝑀,0(𝑥). (88)

Consider the 𝐷-dimensional SDE on 𝑡 [0, 𝑇] defined by

𝑚 𝑖=1 𝜔𝑖𝑝𝑖,𝑡(𝑥)𝑓𝑖,𝑡(𝑥)

𝑝𝑀,𝑡(𝑥) , 𝑔𝑀,𝑡(𝑥) =

𝑚 𝑖=1 𝜔𝑖𝑝𝑖,𝑡(𝑥)𝑔2 𝑖,𝑡 𝑝𝑀,𝑡(𝑥) , (89)

𝑑𝑥𝑡= 𝑓𝑀,𝑡(𝑥𝑡)𝑑𝑡+ 𝑔𝑀,𝑡(𝑥𝑡)𝑑𝑤𝑡, 𝑥𝑀,0 𝑝𝑀,0. (90)

It is assumed that all diffusion processes 𝑥𝑖,𝑡and the diffusion process 𝑥𝑀,𝑡satisfy the regularity Assumptions 1, 2, and 3. Then the marginal distribution of the diffusion 𝑥𝑀,𝑡is 𝑝𝑀,𝑡.

Proof. For 0 < 𝑡< 𝑇we have that

𝑖=1 𝜔𝑖𝑝𝑖,𝑡(𝑥)

𝑖=1 𝜔𝑖 𝑝𝑖,𝑡(𝑥)

𝑖=1 𝜔𝑖 ( (𝑓𝑖,𝑡(𝑥)𝑝𝑖,𝑡(𝑥)) + 1

2Δ(𝑔2 𝑖,𝑡𝑝𝑖,𝑡(𝑥)) ) (93)

(𝑝𝑖,𝑡(𝑥)𝑓𝑖,𝑡(𝑥)

𝑝𝑀,𝑡(𝑥) 𝑝𝑀,𝑡(𝑥) ) + 1

(𝑝𝑖,𝑡(𝑥)𝑔2 𝑖,𝑡 𝑝𝑀,𝑡(𝑥) 𝑝𝑀,𝑡(𝑥)

𝜔𝑖𝑝𝑖,𝑡(𝑥)𝑓𝑖,𝑡(𝑥)

𝑝𝑀,𝑡(𝑥) 𝑝𝑀,𝑡(𝑥)

𝜔𝑖𝑝𝑖,𝑡(𝑥)𝑔2 𝑖,𝑡 𝑝𝑀,𝑡(𝑥) 𝑝𝑀,𝑡(𝑥)

= (𝑓𝑀,𝑡(𝑥)𝑝𝑀,𝑡(𝑥)) + 1

2Δ(𝑔2 𝑀,𝑡𝑝𝑀,𝑡(𝑥)). (96)

The second is an exchange of the order of summation and differentiation, the third line is the application of the Fokker-Planck PDEs for processes 𝑥𝑖,𝑡, the fourth line is a rewriting in terms of 𝑝𝑀,𝑡, the fifth line is another exchange of the order of summation and differentiation. The result follows by noticing that 𝑝𝑀,𝑡(𝑥) satisfies the Fokker-Planck equation of SDE(𝑓𝑀, 𝑔𝑀).

Proof of Proposition A.1. Below, we show that the result of Proposition A.1 follows from Theorem D.1.

First, we rewrite 𝑓𝑀,𝑡(𝑥𝑡) and 𝑔𝑀,𝑡(𝑥𝑡) in (89) in terms of the classifier probabilities (21):

𝑚 𝑖=1 𝜔𝑖𝑝𝑖,𝑡(𝑥𝑡)𝑓𝑖,𝑡(𝑥𝑡)

𝑖=1 𝑝(𝑦=𝑖|𝑥𝑡)𝑓𝑖,𝑡(𝑥𝑡), (97)

𝑚 𝑖=1 𝜔𝑖𝑝𝑖,𝑡(𝑥𝑡)𝑔2 𝑖,𝑡 𝑝𝑀,𝑡(𝑥𝑡) =

𝑖=1 𝑝(𝑦=𝑖|𝑥𝑡)𝑔2 𝑖,𝑡. (98)

With these expressions, we apply the result of Theorem D.1 to the base forward processes 𝑑𝑥𝑖,𝑡= 𝑓𝑖,𝑡(𝑥𝑖,𝑡) 𝑑𝑡+ 𝑔𝑖,𝑡𝑑𝑤𝑖,𝑡and obtain the mixture forward process in equation (19).

From the forward process, we derive the backward process following Song et al. [34]. Using the result of Anderson [38], the backward process for (19) is given by

𝑑𝑥𝑡= [ 𝑓𝑀,𝑡(𝑥𝑡) 𝑥𝑡(𝑔2 𝑀,𝑡(𝑥𝑡)) 𝑔2 𝑀,𝑡(𝑥𝑡) 𝑥𝑡log 𝑝𝑀,𝑡(𝑥𝑡) ] 𝑑𝑡+ 𝑔𝑀,𝑡(𝑥𝑡) 𝑑𝑤𝑡. (99)

Note that the term 𝑥𝑡(𝑔2 𝑀,𝑡(𝑥𝑡)) is due to the fact that the diffusion coefficient 𝑔𝑀,𝑡(𝑥𝑡) in (19) is a function of 𝑥(cf., equation (16), Appendix A in [34]). This term can be transformed as follows

𝑥𝑡(𝑔2 𝑀,𝑡(𝑥𝑡)) = 𝑥𝑡

𝑖=1 𝑝(𝑦=𝑖|𝑥𝑡)𝑔2 𝑖,𝑡

𝑖=1 𝑔2 𝑖,𝑡 𝑥𝑡𝑝(𝑦=𝑖|𝑥𝑡) (101)

𝑖=1 𝑝(𝑦=𝑖|𝑥𝑡)𝑔2 𝑖,𝑡 𝑥𝑡log 𝑝(𝑦=𝑖|𝑥𝑡) (102)

𝑖=1 𝑝(𝑦=𝑖|𝑥𝑡)𝑔2 𝑖,𝑡 𝑥𝑡

( log 𝜔𝑖+ log 𝑝𝑖,𝑡(𝑥𝑡) log 𝑝𝑀,𝑡(𝑥𝑡) ) (103)

𝑖=1 𝑝(𝑦=𝑖|𝑥𝑡)𝑔2 𝑖,𝑡 ( 𝑥𝑡log 𝑝𝑖,𝑡(𝑥𝑡) 𝑥𝑡log 𝑝𝑀,𝑡(𝑥𝑡) ) (104)

𝑖=1 𝑝(𝑦=𝑖|𝑥𝑡)𝑔2 𝑖,𝑡𝑠𝑖,𝑡(𝑥𝑡)

𝑔2 𝑀,𝑡(𝑥𝑡) 𝑥𝑡log 𝑝𝑀,𝑡(𝑥𝑡). (105)

Substituting the last expression in (99), we notice that the term 𝑔2 𝑀,𝑡(𝑥𝑡) 𝑥𝑡log 𝑝𝑀,𝑡(𝑥𝑡) cancels out, and, after simple algebraic manipulations, we arrive at (20).

D.6 Proof of Theorem A.2

We proove Theorem A.2 in the assumptions of Section D.5. In Section D.5 we established that the mixture diffusion process has the forward SDE

𝑑𝑥𝑡= 𝑓𝑀,𝑡(𝑥𝑡)𝑑𝑡+ 𝑔𝑀,𝑡(𝑥𝑡) 𝑑𝑤𝑡. (106)

and the backward SDE

𝑑𝑥𝑡= [ 𝑓𝑀,𝑡(𝑥𝑡) 𝑥𝑡(𝑔2 𝑀,𝑡(𝑥𝑡)) 𝑔2 𝑀,𝑡(𝑥𝑡) 𝑥𝑡log 𝑝𝑀,𝑡(𝑥𝑡) ] 𝑑𝑡+ 𝑔𝑀,𝑡(𝑥𝑡) 𝑑𝑤𝑡. (107)

We apply classifier guidance with classifier 𝑝(𝑦1, , 𝑦𝑛|𝑥𝑡) to the mixture diffusion process, following Song et al. [34] (see equations (48)-(49) in [34]). The backward SDE of the classifier-guided mixture diffusion is

𝑑𝑥𝑡= [ 𝑓𝑀,𝑡(𝑥𝑡) 𝑥𝑡(𝑔2 𝑀,𝑡(𝑥𝑡)) 𝑔2 𝑀,𝑡(𝑥𝑡) ( 𝑥𝑡log 𝑝𝑀,𝑡(𝑥𝑡) + 𝑥𝑡log 𝑝(𝑦1, , 𝑦𝑛|𝑥𝑡) )] 𝑑𝑡(108)

+ 𝑔𝑀,𝑡(𝑥𝑡) 𝑑𝑤𝑡. (109)

Finally, we arrive at (23) by substituting (105) in the above, canceling out the term 𝑔2 𝑀,𝑡(𝑥𝑡) 𝑥𝑡log 𝑝𝑀,𝑡(𝑥𝑡), and applying simple algebraic manipulations.

E Implementation Details

E.1 Classifier Guidance in GFlow Nets

Classifier guidance in GFlow Nets (14) is realized through modification of the base forward policy via the multiplication by the ratio of the classifier outputs 𝑝(𝑦|𝑠 ) 𝑝(𝑦|𝑠). The ground truth (theoretically optimal) non-terminal state classifier 𝑝(𝑦|𝑠) by Proposition 5.2 satisfies (13) which ensures that the guided policy (14) is valid, i.e. for any state 𝑠

𝑠 (𝑠 𝑠 ) 𝑝𝐹(𝑠 |𝑠, 𝑦) =

𝑠 (𝑠 𝑠 ) 𝑝𝐹(𝑠 |𝑠)𝑝(𝑦|𝑠 )

𝑝(𝑦|𝑠) (110)

𝑠 (𝑠 𝑠 ) 𝑝𝐹(𝑠 |𝑠)𝑝(𝑦|𝑠 )

=𝑝(𝑦|𝑠) by Proposition 5.2

In practice, the ground truth values of 𝑝(𝑦|𝑠) are unavailable. Instead, an approximation 𝑄𝜙(𝑦|𝑠) 𝑝(𝑦|𝑠) is learned. Equation (13) might not hold for the learned classifier 𝑄𝜙, but we still wish to use 𝑄𝜙for classifier guidance in practice. In order to ensure that the classifier-guided policy is valid in practice even when the approximation 𝑄𝜙(𝑦|𝑠) of the classifier 𝑝(𝑦|𝑠) is used, we implement guidance as described below.

First, we express the guided policy (14) in terms of log-probabilities:

log 𝑝𝐹(𝑠 |𝑠, 𝑦) = log 𝑝𝐹(𝑠 |𝑠) + log 𝑝(𝑦|𝑠 ) log 𝑝(𝑦|𝑠). (112)

Parameterizing distributions through log-probabilities is common practice in probabilistic modeling: GFlow Net forward policies [36, 58] and probabilistic classifiers are typically parameterized by deep neural networks that output logits (unnormalized log-probabilities).

Second, in the log-probability parameterization the guided policy (14) can be equivalently expressed as

𝑝𝐹(𝑠 |𝑠, 𝑦) = [ softmax ( log 𝑝𝐹( |𝑠) + log 𝑝(𝑦| ) log 𝑝(𝑦|𝑠))]

= exp ( log 𝑝𝐹(𝑠 |𝑠) + log 𝑝(𝑦|𝑠 ) log 𝑝(𝑦|𝑠))

𝑠 (𝑠 𝑠 ) exp ( log 𝑝𝐹(𝑠 |𝑠) + log 𝑝(𝑦|𝑠 ) log 𝑝(𝑦|𝑠)). (114)

In theory, the softmax operation can be replaced with simple exponentiation, i.e. the numerator in (114) is sufficient on its own since Proposition 5.2 ensures that the sum in the denominator equals to 1. However, using the softmax is beneficial in practice when we substitute learned classifier 𝑄𝜙(𝑦|𝑠) instead of the ground truth classifier 𝑝(𝑦|𝑠). Indeed when 𝑄𝜙(𝑦|𝑠) does not satisfy (13), the softmax operation ensures that the guided policy

𝑝𝐹(𝑠 |𝑠, 𝑦) = [ softmax ( log 𝑝𝐹( |𝑠) + log 𝑄𝜙(𝑦| ) log 𝑄𝜙(𝑦|𝑠))]

is valid (i.e. probabilities sum up to 1 over 𝑠 ). The fact the softmax expression is valid in theory ensures that policy (115) guided by 𝑄𝜙(𝑦|𝑠) approaches the ground truth policy (guided by 𝑝(𝑦|𝑠)) as 𝑄𝜙(𝑦|𝑠) approaches 𝑝(𝑦|𝑠) throughout training.

F Experiment details

F.1 2D Distributions with GFlow Nets

The base GFlow Net forward policies 𝑝𝑖,𝐹(𝑠 |𝑠; 𝜃) were parameterized as MLPs with 2 hidden layers and 256 units in each hidden layer. The cell coordinates of a state 𝑠on the 2D 32 32 grid were one-hot encoded. The dimensionality of the input was 2 32. The outputs of the forward policy network were the logits of the softmax distribution over 3 action choices: 1) move down; 2) move right; 3) stop.

We trained base GFlow Nets with the trajectory balance loss [58]. We fixed the uniform backward policy in the trajectory balance objective. We used Adam optimizer [59] with learning rate 0.001, and pre-train the base models for 20 000 steps with batch size 16 (16 trajectories per batch). The log of the total flow log 𝑍𝜃was optimized with Adam with a learning rate 0.1. In order to promote exploration in trajectory sampling for the trajectory balance objective, we used the sampling policy which takes random actions (uniformly) with probability 0.05 but, otherwise, follows the current learned forward policy.

The classifier was parameterized as MLP with 2 hidden layers and 256 units in each hidden layer. The inputs to the classifier were one-hot encoded cell coordinates, terminal state flag ({0, 1}), and log(𝛼 (1 𝛼)) (in the case of parameterized operations). The classifier outputs were the logits of the joint label distribution 𝑄𝜙(𝑦1, , 𝑦𝑛|𝑠) for non-terminal states 𝑠and the logits of the marginal label distribution 𝑄𝜙(𝑦1|𝑥) for terminal states 𝑥.

We trained the classifier with the loss described in Section 5.2. We used Adam with learning rate 0.001. We performed 15 000 training steps with batch size 64 (64 trajectories sampled from each of the base models per training step). We updated the target network parameters 𝜙as the exponential moving average (EMA) of 𝜙with the smoothing factor 0.995. We linearly increased the weight 𝛾(step) of the non-terminal state loss from 0 to 1 throughout the first 3 000 steps and kept constant 𝛾= 1 afterward. For the 𝛼-parameterized version of the classifier (Section B), we used the following sampling scheme for 𝛼: 𝑧 𝑈[ 3.5, 3.5], 𝛼= 1 1+exp( 𝑧).

Quantitative evaluation. For each of the composite distributions shown in Figure 2 we evaluated the L1-distance 𝐿1(𝑝method, 𝑝GT) = 𝑥 |𝑝method(𝑥) 𝑝GT(𝑥)| between the distribution 𝑝method induced by the classifier-guided policy and the ground-truth composition distribution 𝑝GT computed from the known base model probabilities 𝑝𝑖. The evaluation results are presented below.

Figure 2 top row. 𝑝1 𝑝2: 𝐿1 = 0.071; 𝑝1 𝑝2: 𝐿1 = 0.086; 𝑝1 0.95 𝑝2: 𝐿1 = 0.167.

Figure 2 bottom row. 𝑝(𝑥|𝑦1 = 1, 𝑦2 = 2): 𝐿1 = 0.076; 𝑝(𝑥|𝑦1 = 1, 𝑦2 = 2, 𝑦3 = 3): 𝐿1 = 0.087; 𝑝(𝑥|𝑦1 =2, 𝑦2 =2): 𝐿1 = 0.112; 𝑝(𝑥|𝑦1 =2, 𝑦2 =2, 𝑦3 =2): 𝐿1 = 0.122.

Figure G.6 shows the distance between the composition and the ground truth as a function of the number of training steps for the classifier as well as the terminal and non-terminal classifier learning curves.

F.2 Molecule Generation

Domain. In the molecule generation task [36], the objects 𝑥 are molecular graphs. The nonterminal states 𝑠 are incomplete molecular graphs. The transitions from a given non-terminal state 𝑠are of two types: 1) fragment addition 𝑠 𝑠 : new molecular graph 𝑠 is obtained by attaching a new fragment to the molecular graph 𝑠; 2) stop action 𝑠 𝑥: if 𝑠 𝑠0, then the generation process can be terminated at the molecular graph corresponding to the current state (note that new terminal state 𝑥 is different from 𝑠 , but both states correspond to the same molecular graph).

Rewards. We trained GFlow Nets using 3 reward functions: SEH, a reward computed by an MPNN [60] that was trained by Bengio et al. [36] to estimate the binding energy of a molecule to the soluble epoxide hydrolase protein; SA, an estimate of synthetic accessibility [61] computed with tools from RDKit library [62]; QED, a quantitative estimate of drug-likeness [63] which is also computed with RDKit. We normalized all reward functions to the range [0, 1]. Higher values of SEH, SA, and QED correspond to stronger binding, higher synthetic accessibility, and higher drug-likeness respectively. Following Bengio et al. [36], we introduced the parameter 𝛽which controls the sharpness (temperature) of the target distribution: 𝑝(𝑥) 𝑅(𝑥)𝛽, increasing 𝛽results in a distribution skewed towards high-reward objects. We experimented with two 𝛽values, 32 and 96 (Figure 3(a),3(d)).

Reward normalization. We used the following normalization rules for SEH, SA, and QED rewards in the molecule domain.

SEH = SEHraw 8;

SA = 10 SAraw

QED = QEDraw.

Training and evaluation. After training the base GFlow Nets with the reward functions described above, we trained classifiers with Algorithm A.1. The classifier was parameterized as a graph neural network based on a graph transformer architecture [64]. Compared to the 2D grid domain (Section 6, 2D distributions via GFlow Net ), we can not directly evaluate the distributions obtained by our approach. Instead, we analyzed the samples generated by the composed distributions. We sampled 5 000 molecules from each composed distribution obtained with our approach as well as the base GFlow Nets. We evaluated the sample collections with the two following strategies. Reward evaluation (Figure 3, Table 1) we analyzed the distributions of rewards across the sample collections. The goal is to see whether the composition of GFlow Nets trained for different rewards leads to noticeable changes in reward distribution. Distribution distance evaluation (Figure 4, Table G.5) : we used the samples to estimate the pairwise distances between the distributions. Specifically, for a given pair of distributions represented by two collections of samples 𝐴= {𝑥𝐴,𝑖}𝑛 𝑖=1, 𝐵= {𝑥𝐵,𝑖}𝑛 𝑖=1 we computed the earth mover s distance 𝑑( 𝐴, 𝐵) with ground molecule distance given by 𝑑(𝑥, 𝑥 ) = (max{𝑠(𝑥, 𝑥 ), 10 3}) 1 1, where 𝑠(𝑥, 𝑥 ) [0, 1] is the Tanimoto similarity over Morgan fingerprints of molecules 𝑥and 𝑥 .

Training details and hyperparameters. The base GFlow Net policies were parameterized as graph neural networks with Graph Transformer architecture [64]. We used 6 transformer layers with an embedding of size 128. The input to the Graph Transformer was the graph of fragments with node attributes describing fragments and edge attributes describing attachment points of the edges.

The base GFlow Nets were trained with trajectory balance loss. We used Adam optimizer. For the policy network 𝑝𝐹(𝑠 |𝑠; 𝜃), we set the initial learning rate 0.0005 and exponentially decayed with the factor 2step 20 0000. For the log of the total flow log 𝑍𝜃we set the initial learning rate 0.0005 and

exponentially decayed with the factor 2step 50 000. We trained the base GFlow Nets for 15 000 steps with batch size 64 (64 trajectories per batch). In order to promote exploration in trajectory sampling for the trajectory balance objective, we used the sampling policy which takes random actions (uniformly) with probability 0.1 but, otherwise, follows the current learned forward policy.

The classifier was parameterized as a graph neural network with Graph Transformer architecture. We used 4 transformer layers with embedding size 128. The inputs to the classifier were the fragment graph, terminal state flag ({0, 1}), and log(𝛼 (1 𝛼)) (in case of parameterized operations). The classifier outputs were the logits of the joint label distribution 𝑄𝜙(𝑦1, , 𝑦𝑛|𝑠) for non-terminal states 𝑠and the logits of the marginal label distribution 𝑄𝜙(𝑦1|𝑥) for terminal states 𝑥.

We trained the classifier with the loss described in Section 5.2. We used Adam with learning rate 0.001. We performed 15 000 training steps with batch size 8 (8 trajectories sampled from each of the base models per training step). We updated the target network parameters 𝜙as the exponential moving average (EMA) of 𝜙with the smoothing factor 0.995. We linearly increased the weight 𝛾(step) of the non-terminal state loss from 0 to 1 throughout the first 4 000 steps and kept constant 𝛾= 1 afterward. For the 𝛼-parameterized version of the classifier (Section B), we used the following sampling scheme for 𝛼: 𝑧 𝑈[ 5.5, 5.5], 𝛼= 1 1+exp( 𝑧).

F.3 Colored MNIST Generation via Diffusion Models

The colored MNIST experiment in Section 6 ( Colored MNIST generation via diffusion models ) follows the method for composing diffusion models introduced in Appendix A.2. The three base diffusion models were trained on colored MNIST digits generated from the original MNIST dataset. These colored digits were created by mapping MNIST images from their grayscale representation to either the red or green channel, leaving the other channels set to 0. For Figure 5 we post-processed the red and green images generated by the base models and compositions into beige and cyan respectively, which are more accessible colors for colorblind people.

Models, training details, and hyperparameters. The base diffusion models were defined as VE SDEs [34]. Their score models were U-Net [65] networks consisting of 4 convolutional layers with 64, 128, 256, and 256 channels and 4 matching transposed convolutional layers. Time was encoded using 256-dimensional Gaussian random features [66]. The score model was trained using Adam optimizer [59] with a learning rate decreasing exponentially from 10 2 to 10 4. We performed 200 training steps with batch size 32.

The first classifier 𝑄(𝑦1, 𝑦2|𝑥𝑡) was a convolutional network consisting of 2 convolutional layers with 64 and 96 channels and three hidden layers with 512, 256 and 256 units. This classifier is time-dependent and used 128-dimensional Gaussian random features to embed the time. The output was a 3x3 matrix encoding the predicted log-probabilities. The classifier was trained on trajectories sampled from the reverse SDE of the base diffusion models using the Ada Delta optimizer [67] with a learning rate of 1.0. We performed 700 training steps with batch size 128. For the first 100 training steps the classifier was only trained on terminal samples.

The second conditional classifier 𝑄(𝑦3|𝑦1, 𝑦2, 𝑥𝑡) was a similar convolutional network with 2 convolutional layers with 64 channels and two hidden layers with 256 units. This classifier is conditioned both on time and on (𝑦1, 𝑦2). The time variable was embedded used 128-dimensional Gaussian random features. The (𝑦1, 𝑦2) variables were encoded using a 1-hot encoding scheme. The output of the classifier was the three predicted log-probabilities for 𝑦3. Contrary to the first classifier, this one was not trained on the base diffusion models but rather on samples from the posterior 𝑝(𝑥|𝑦1, 𝑦2). It s loss function was:

𝑐(𝜙) = 𝔼 ( 𝑥0, 𝑥𝑡, 𝑦2, 𝑦1,𝑡) 𝑝(𝑥0,𝑥𝑡|𝑦1,𝑦2) 𝑝(𝑦1) 𝑝(𝑦2)𝑝(𝑡)

𝑦3=1 𝑤 𝑦3( 𝑥0) log 𝑄𝜙( 𝑦3| 𝑦1, 𝑦2, 𝑥𝑡) (116)

where 𝑤 𝑦3( 𝑥0) is estimated using the first classifier. The classifier was trained using the Ada Delta optimizer [67] with a learning rate of 0.1. We performed 200 training steps with batch size 128.

Sampling. Sampling from both the individual base models and the composition was done using the Predictor-Corrector sampler [34]. We performed sampling over 500 time steps to generate the

Figure G.1: Diffusion model composition on colored MNIST. (a,b) samples from base diffusion models. (c-e) samples from the resulting harmonic mean and contrast compositions.

samples shown in Figure 5. The samples used to train the classifier were generated using the same method.

When sampling from the composition we found that using scaling for the classifier guidance was generally necessary to achieve high-quality results. Without scaling, the norm of the gradient over the first and second classifier was too small relative to the gradient predicted by the score function, and hence did not sufficiently steer the mixture towards samples from the posterior. Experimentally, we found that scaling factor 10 for the first classifier and scaling factor 75 for the second produced high quality results.

G Additional Results

G.1 Binary Operations for MNIST Digit Generation via Diffusion Models

Here we present a variant of the colored digit generation experiment from Section 6 ( Colored MNIST generation via diffusion models ). using 2 diffusion models. This allows us to better illustrate the harmonic mean and contrast operations on this image domain. In a similar fashion to the experiment in Section 6 ( Colored MNIST generation via diffusion models ), we trained two diffusion models to generate colored MNIST digits. 𝑝1 was trained to generate red and green 0 digits and 𝑝2 was trained to generate green 0 and 1 digits. As before, we used post-processing to map green to cyan and red to beige.

Implementation details. The diffusion models used in this experiment and their training procedure were exactly the same as in Section F.3. The sampling method used to obtain samples from the base models and their compositions was also the same. We found that scaling the classifier guidance was generally required for high-quality results, and used a scaling factor of 20 in this experiment.

The classifier was a convolutional network with 2 convolutional layers consisting of 32 and 64 channels and two hidden layers with 512 and 128 units. The classifier s time input was embedded using 128-dimensional Gaussian random features. The output was a 2x2 matrix encoding the predicted log-probabilities 𝑄(𝑦1, 𝑦2 | 𝑥𝑡). The classifier was trained on trajectories, sampled from the reverse SDE of the base diffusion models, using the Ada Delta optimizer [67] with a learning rate of 0.1 and a decay rate of 0.97. We performed 200 training steps with batch size 128. For the first 100 training steps, the classifier was only trained on terminal samples.

Results. Figure G.1 shows samples obtained from the two trained diffusion models 𝑝1, 𝑝2 and from the harmonic mean and contrast compositions of these models. We observe that the harmonic mean generates only cyan zero digits, because this is the only type of digit on which both 𝑝1 and 𝑝2 have high density. The contrast 𝑝1 𝑝2 generates beige zero digits from 𝑝1. However, unlike 𝑝1, it does not generate cyan zero digits, as 𝑝2 has high density there. The story is similar for 𝑝1 𝑝2, which generates cyan one digits from 𝑝2, but not zero digits due to 𝑝1 having high density over those.

G.2 MNIST Subset Generation via Diffusion Models

We report in this section on additional results for composing diffusion models on the standard MNIST dataset. We trained two diffusion models: 𝑝{0, ,5} was trained to generate MNIST digits 0 through 5,

(a) 𝑝{0, ,5}

(b) 𝑝{4, ,9}

(c) 𝑝{0, ,5} 𝑝{4, ,9}

(d) 𝑝{0, ,5} 𝑝{4, ,9}

(e) 𝑝{0, ,5} 𝑝{4, ,9}

Figure G.2: Diffusion model composition on MNIST. (a,b) samples from base diffusion models. (c-e) samples from the resulting harmonic mean and contrast compositions.

and 𝑝{4, ,9} to generate MNIST digits 4 through 9. The training procedure and models used in this experiment were the same as in Section G.1.

Figure G.2 shows samples obtained from the two diffusion models, from the harmonic mean, and from the contrast compositions of these models. We observe that the harmonic mean correctly generates mostly images on which both diffusion models sampling distributions have high probability, i.e. digits 4 and 5. For the contrasts we see that in both cases digits are generated that have high probability under one model but low probability under the other. We observe some errors, namely some 9 s being generated by the harmonic mean and some 4 s being generated by the contrast 𝑝{0, ,5} 𝑝{4, ,9}. This is likely because 4 and 9 are visually similar, causing the guiding classifier to misclassify them, and generate them under the wrong composition.

We also present binary operations between three distributions. In Figure G.3 and G.4, 𝑝0 models even digits, 𝑝1 models odd digits, and 𝑝2 models digits that are divisible by 3. We color digits {0, 6} purple, {3, 9} blue, {4, 8} orange, and {1, 5, 7} beige. In Figure G.3, harmonic mean of 𝑝0 and 𝑝2 generates the digit 0 and 6, whereas the contrast of 𝑝0 with 𝑝2 shows even digits non-divisible by 3 (𝑝0 𝑝2 = {4, 8}), and odd numbers that are divisible by 3 (𝑝0 𝑝2 = {3, 9}). We observe that the samples from 𝑝0 𝑝2 inherit artifacts from the base generator for 𝑝2 (the thin digit 0), which shows the impact that the base models have on the composite distribution. In Figure G.4 we present similar results between odd digits ({5, 3, 5, 7, 9}. We noticed that samples from both 𝑝1 𝑝2 and 𝑝1 𝑝2 includes a small number of the digit 3.

(a) 𝑝0: Even Digits

(b) 𝑝2: {0, 3, 6, 9}

Figure G.3: Composing even digits and multiples of three. (a,b) samples from base diffusion models. (c-e) samples from the resulting harmonic mean and contrast compositions.

(a) 𝑝1: Odd Digits

(b) 𝑝2: {0, 3, 6, 9}

Figure G.4: Composing odd digits and multiples of three. (a,b) samples from base diffusion models. (c-e) samples from the resulting harmonic mean and contrast compositions.

G.3 Chaining: Sequential Composition of Multiple Distributions

We present results on chaining binary composition operations sequentially on a custom colored MNIST dataset.

Setup. We start with three base generative models that are trained to generate 𝑝1, 𝑝2 and 𝑝3 in Figure G.5. Specifically, 𝑝1 is a uniform distribution over digits {0, 1, 2, 3, 4, 5}, 𝑝2 is a uniform distribution over even digits {0, 2, 4, 6, 8}, and 𝑝3 is a uniform distribution over digits divisible by 3: {0, 3, 6, 9}. Note that we use a different color for each digit consistent across 𝑝1, 𝑝2, 𝑝3. Our goal is to produce combinations of chained binary operations involving all three distributions, where two of them were combined first, then in a second step, combined with the third distribution through either harmonic mean or contrast .

Binary Classifier Training. Consider, for example, the operation (𝑝1 𝑝2) 𝑝3. We use the same classifier training procedure for 𝑝1 versus 𝑝2, as well as the composite model (𝑝1 𝑝2) versus 𝑝3, except that in the later case we sample from composite model as a whole. Our classifier training simultaneously optimizes the terminal classifier and the intermediate state classifier.

Implementation Detains. We adapted diffusion model training code for the base distributions from [33]. Our diffusion model used a UNet backbone with four latent layers on both the contracting and the expansive path. The contracting channel dimensions were [64, 128, 256, 256] with the kernel size 3, and strides [1, 2, 2, 2]. The time embedding used a mixture of 128 sine and cosine feature pairs, with a total of 256 dimensions. These features were passed through a sigmoid activation and then expanded using a different linear head for each layer. The activations were then added to the 2D feature maps at each layer channel-wise. We used a fixed learning rate of 0.01 with Adam optimizer [59].

We adapted the classifier training code from the MNIST example in Pytorch [68]. Our binary classifier has two latent layers with channel dimensions [32, 64], stride 1 and kernel width of 3. We use dropout on both layers: 𝑝1 = 25%, 𝑝2 = 50%. We train the network for 200 epochs on data sampled online in batches of 256 from each source model, and treat the composite model in the second step the same way. We use the Adadelta [67] optimizer with the default setting of learning rate 1.0.

Sampling. We generate samples according to Appendix A.2 and multiply the gradient by 𝛼= 20.

Results. Row 3 from Figure G.5 contains mostly zeros with a few exceptions. This is in agreement with harmonic mean being symmetric. In 𝑝1 (𝑝2 𝑝3), digits that are outside of the intersection still appear with thin strokes.

(a) 𝑝1{0 5}

(b) 𝑝2{0,2,4,6,8}

(c) 𝑝3{0, 3, 6, 9}

Figure G.5: Chaining Binary Operations. (a-c) Samples from 3 pre-trained diffusion models. (d-l) Samples from binary compositions. (row 3) The harmonic mean between all three. (row 4 and beyond) various ways to chain the operations. Parentheses indicate the order of composition.

0 2500 5000 7500 10000 12500 15000 Steps

Terminal state loss

Terminal state loss Non-terminal state loss

Non-terminal state loss

0 2500 5000 7500 10000 12500 15000 Steps

Distribution distance (퐿1)

퐿1 (1 2푝1 + 1 2푝2, GT(1 2푝1 + 1 2푝2))

퐿1 (푝1 푝2, GT(푝1 푝2))

퐿1 (푝1 푝2, GT(푝1 푝2))

퐿1 (푝1 0.05 푝2, GT(푝1 0.05 푝2))

Figure G.6: Training curves of the classifier 𝑄𝜙(𝑦1, 𝑦2| ) in GFlow Net 2D grid domain. The experimental setup corresponds to Section 6 ( 2D distributions via GFlow Net ) and Figure 2 (top row). Left: Terminal state loss and non-terminal state loss (as defined in Algorithm A.1) as functions of the number of training steps. Right: 𝐿1 distance between learned distributions (compositions obtained through classifier-based mixture and guidance) and ground-truth composition distributions as the function of the number of training steps. 𝐿1(𝑝, 𝑞) = 𝑥 |𝑝(𝑥) 𝑞(𝑥)|.

0 2500 5000 7500 10000 12500 15000 Steps

Terminal state loss

Terminal state loss Non-terminal state loss

Non-terminal state loss

Figure G.7: Training curves of the classifier 𝑄𝜙(𝑦1, 𝑦2| ) in GFlow Net molecule generation domain. The experimental setup corresponds to Section 6 ( Molecule generation via GFlow Net ) and Figure 3 (a-c). The curves show terminal state loss and non-terminal state loss (as defined in Algorithm A.1) as functions of the number of training steps.

0 25 50 75 100 125 150 175 200 Steps

0.000 0.125 0.250 0.375 0.500 0.625 0.750 0.875 1.000 1.125 1.250

Terminal state loss

0.00 0.45 0.90 1.35 1.80 2.25 2.70 3.15 3.60 4.05 4.50

Non-terminal state loss

Terminal state loss Non-terminal state loss

Figure G.8: Training curves of the classifier 𝑄𝜙(𝑦1, 𝑦2| ) in diffusion MNIST image generation domain. The experimental setup corresponds to Section G.1 and Figure G.1. The curves show terminal state loss and non-terminal state loss (as defined in equations (26), (28)) as functions of the number of training steps. The nonterminal loss optimization begins after the first 100 training steps (shown by the black dashed line).

G.4 Classifier Learning Curves and Training Time

We empirically evaluated classifier training time and learning curves. The results are shown in Figures G.6, G.7, G.8 and Tables G.1, G.2.

Figures G.6, G.7, G.8 show the cross-entropy loss of the classifier for terminal (16) and non-terminal states (18) as a function of the number of training steps for the GFlow Net 2D grid domain, the molecular generation domain, and the Colored MNIST digits domain respectively. They show that the loss drops quickly but remains above 0. Figure G.6 further shows the distances between the learned compositions and the ground truth distributions as a function of the number of training steps of the classifier. For all compositions, as the classifier training progresses, the distance to the ground truth distribution decreases. Compared to the distance at initialization we observe almost an order of magnitude distance reduction by the end of the training.

The runtime of classifier training is shown in Tables G.1 and G.2. We report the total runtime, as well as separate measurements for the time spent sampling trajectories and training the classifier. The classifier training time is comparable to the base generative model training time. However, most of the classifier training time (more than 70%, or even 90%) was spent on sampling trajectories from base generative models. Our implementation of the training could be improved in this regard, e.g. by sampling a smaller set of trajectories once and re-using trajectories for training and by reducing the number of training steps (the loss curves in Figures G.6, G.7, G.8 show that classification losses plateau quickly).

Table G.1: Summary of base GFlow Net and classifier training time in molecule generation domain. The experimental setup corresponds to Section 6 ( Molecule generation via GFlow Net ) and Figure 3 (a-c). All models were trained with a single Ge Force RTX 2080 Ti GPU.

Base GFlow Net training steps 20 000 Base GFlow Net batch size 64 Base GFlow Net training elapsed real time 6h 47m 11s

Classifier training steps 15 000 Classifier batch size 8 trajectories per base model (all states used) Classifier training total elapsed real time 9h 2m 19s Classifier training data generation time 6h 35m 58s (73%)

Table G.2: Summary of base diffusion and classifier training time in MNIST image generation domain. The experimental setup corresponds to Section G.1 and Figure G.1. All models were trained with a single Tesla V100 GPU.

Base diffusion training steps 200 Base diffusion batch size 32 Base diffusion training elapsed real time 10m 6s

Classifier training steps 200 Classifier batch size 128 trajectories per base model (35 time-steps per trajectory) Classifier training total elapsed real time 30m 12s Classifier training data generation time 29m 22s (97%)

Table G.3: Average pairwise similarity [36] of molecules generated by GFlow Nets trained on SEH , SA , QED rewards at different values of 𝛽. For each combination (reward, 𝛽) a GFlow Net was trained with the corresponding reward 𝑅(𝑥)𝛽. Then, 5 000 molecules were generated. The numbers in the table reflect the average pairwise Tanimoto similarity of top 1 000 molecules (selected according to the target reward function).

𝛽= 1 0.527 0.539 0.480 𝛽= 4 0.529 0.527 0.464 𝛽= 10 0.535 0.500 0.438 𝛽= 16 0.548 0.465 0.422 𝛽= 32 0.585 0.411 0.398 𝛽= 96 0.618 0.358 0.404

Table G.4: Number of Tanimoto-separated modes found above reward threshold. For each combination (reward, 𝛽) a GFlow Net was trained with the corresponding reward 𝑅(𝑥)𝛽, and then 5 000 molecules were generated. Cell format is "𝐴 𝐵", where 𝐴is the number of Tanimoto-separated modes found above the reward threshold, and 𝐵is the total number of generated molecules above the threshold. Analogously to Figure 14 in [36], we consider having found a new mode representative when a new molecule has Tanimoto similarity smaller than 0.7 to every previously found mode s representative molecule. Reward thresholds (in [0, 1], normalized values) are SEH : 0.875, SA : 0.75, QED : 0.75. Note that the normalized threshold of 0.875 for SEH corresponds to the unnormalized threshold of 7 used in [36].

𝛽= 1 15 17 37 37 0 0 𝛽= 4 12 17 82 82 0 0 𝛽= 10 85 109 332 337 18 18 𝛽= 16 190 280 886 910 253 253 𝛽= 32 992 1821 2859 3080 3067 3124 𝛽= 96 1619 4609 4268 4983 4470 4980

G.5 Analysis of Sample Diversity of Base GFlow Nets in Molecule Generation Domain

In order to assess the effect of the reward exponent 𝛽on mode coverage and sample diversity, we evaluated samples generated from GFlow Nets pre-trained with different values of 𝛽. The results are in Tables G.3 and G.4. The details of the evaluation and the reported metrics are described in the table captions. As expected, larger reward exponents shift the learned distributions towards high-scoring molecules (the total number of molecules with scores above the threshold increases). For SA and QED models we don t observe negative effects of large 𝛽on sample diversity and mode coverage: the average pairwise similarity of top 1 000 molecules doesn t grow as 𝛽increases and the ratio of Tanimoto-separated modes remains high. For SEH models we observe a gradual increase in the average pairwise similarity of top 1 000 molecules and a gradual decrease in the ratio of Tanimoto-separated modes. However, the total number of separated modes grows as 𝛽increases, which indicates that larger reward exponents do not lead to mode dropping.

G.6 Summary of Pairwise Distribution Distances in Molecule Generation Domain

Table G.5: Estimated pairwise earth mover s distances between distributions shown in Table 1.

y=SEH y=SA y=QED y=SEH,SA y=SEH,QED y=SA,QED y=SEH,SA,QED y=SEH 3 y=SA 3 y=QED 3

y=SEH 0 4.42 5.77 3.39 4.20 4.88 4.10 2.46 4.44 5.73 y=SA 4.42 0 5.88 3.26 5.15 4.59 4.20 4.39 2.55 5.89 y=QED 5.77 5.88 0 5.40 4.02 3.85 4.20 5.80 5.90 3.20 y=SEH,SA 3.39 3.26 5.40 0 4.25 4.19 3.68 3.39 3.30 5.39 y=SEH,QED 4.20 5.15 4.02 4.25 0 3.80 3.67 4.22 5.19 4.00 y=SA,QED 4.88 4.59 3.85 4.19 3.80 0 3.65 4.91 4.59 3.87 y=SEH,SA,QED 4.10 4.20 4.20 3.68 3.67 3.65 0 4.12 4.23 4.20 y=SEH 3 2.46 4.39 5.80 3.39 4.22 4.91 4.12 0 4.43 5.73 y=SA 3 4.44 2.55 5.90 3.30 5.19 4.59 4.23 4.43 0 5.90 y=QED 3 5.73 5.89 3.20 5.39 4.00 3.87 4.20 5.73 5.90 0