# gflownet_foundations__45bd5e24.pdf

Journal of Machine Learning Research 24 (2023) 1-55 Submitted 4/22; Revised 5/23; Published 6/23

GFlow Net Foundations

Yoshua Bengio yoshua.bengio@mila.quebec Mila, Universi e de Montr eal, CIFAR, IVADO

Salem Lahlou lahlosal@mila.quebec Mila, Universi e de Montr eal

Tristan Deleu deleutri@mila.quebec Mila, Universi e de Montr eal

Edward J. Hu edward@edwardjhu.com Mila, Universi e de Montr eal, Microsoft

Mo Tiwari motiwari@stanford.edu Stanford University

Emmanuel Bengio bengioem@mila.quebec Mila, Mc Gill University

Editor: David Sontag

Generative Flow Networks (GFlow Nets) have been introduced as a method to sample a diverse set of candidates in an active learning context, with a training objective that makes them approximately sample in proportion to a given reward function. In this paper, we show a number of additional theoretical properties of GFlow Nets, including a new local and eﬃcient training objective called detailed balance for the analogy with MCMC. GFlow Nets can be used to estimate joint probability distributions and the corresponding marginal distributions where some variables are unspeciﬁed and, of particular interest, can represent distributions over composite objects like sets and graphs. GFlow Nets amortize the work typically done by computationally expensive MCMC methods in a single but trained generative pass. They could also be used to estimate partition functions and free energies, conditional probabilities of supersets (supergraphs) given a subset (subgraph), as well as marginal distributions over all supersets (supergraphs) of a given set (graph). We introduce variations enabling the estimation of entropy and mutual information, continuous actions and modular energy functions.

1. Introduction

Building upon the introduction of Generative Flow Networks (GFlow Nets) by Bengio et al. (2021), we provide here an in-depth formal foundation and expansion of the set of theoretical results in ways that may be of interest for the active learning scenario of Bengio et al. (2021) but also much more broadly.

. Equal Contribution

c 2023 Yoshua Bengio, Salem Lahlou, Tristan Deleu, Edward J. Hu, Mo Tiwari, Emmanuel Bengio.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v24/22-0364.html.

Bengio, Lahlou, Deleu, Hu, Tiwari, Bengio

1.1 What is a GFlow Net ?

GFlow Nets have properties which make them well-suited to perform amortized probabilistic inference in general, whether for sampling or for marginalizing. Sampling takes place at training time while run-time sampling or computations of marginalized quantities can be done in a single pass through a sequence of constructive stochastic steps. This makes GFlow Nets an interesting alternative to Monte-Carlo Markov chains (MCMC) and related to amortized variational inference (Malkin et al., 2023).

Because sampling of a compositional object s can be achieved through a sequence of stochastic steps, very rich multimodal distributions PT (s) over such objects can be represented, and the oﬄine training objectives make it possible to explore and discover modes of the distribution of interest. The key property of GFlow Nets is that their sampling policy is trained to make the probability PT (s) of sampling an object s approximately proportional to the value R(s) of a given reward function applied to that object. We also talk of an energy function E(s) = log R(s), i.e., the reward function is non-negative and corresponds to an unnormalized probability. Whereas one typically trains a generative model from a dataset of positive examples, a GFlow Net is trained to match the given energy or reward function and convert it into a sampler. We view that sampler as a generative policy because the composite object s is constructed through a sequence of smaller stochastic steps (see Fig. 1), often corresponding to constructively composing diﬀerent elements of s, like the edges of a graph.

This conversion of an energy function or unnormalized probability function to a sampler is similar to what MCMC methods achieve but once trained, GFlow Nets will generate a sample in one shot instead of generating a long sequence of samples whose distribution would gradually approach the desired one. GFlow Nets thus avoid the lengthy stochastic search in the space of such objects and the associated mode-mixing intractability challenge of MCMC methods (Jasra et al., 2005; Bengio et al., 2013; Pompe et al., 2020). Multiple iid samples can be obtained from the GFlow Net by calling the sampler multiple times. GFlow Nets exchange that intractability of sampling with MCMC for the challenge of amortized training of the generative policy. The latter problem would be equally intractable if the modes of the reward function did not have a inherent (but not necessarily known) structure over which the learner could generalize, i.e., the learner had almost no chance to correctly guess where to ﬁnd new modes based on (i.e., training on) those it had already visited.

The energy function or reward function (exponential of minus energy) is evaluated only at the end of the sequential construction process for objects s, in what we call a terminating state. Every such constructive sequence starts in the single initial state s0 and ends in a terminal state. As illustrated in Figure 2, we can visualize the set of all trajectories starting from s0 and ending in a terminal state s. The term ﬂow in generative ﬂow networks refers to unnormalized probabilities that can be learned by GFlow Net learning procedures. The ﬂow in an intermediate state s is a weighted sum of the non-negative rewards of the terminating states reachable from s. Those weights are such as to avoid double-counting: if we were to inject a ﬁxed ﬂow of liquid in s0 and dispatch that liquid in each child of any state s proportionally to the GFlow Net policy for choosing a child of s, we would obtain the ﬂow at each state and the ﬂow at terminating states would match the reward function at those states. As shown in greater detail here and for the ﬁrst time in the ﬁrst GFlow Net

GFlow Net Foundations

GFlow Net PF π(A|st)

draw at π(A|st) at = Add a new node connected to node 2

T(st, at) determines st+1

4 GFlow Net PF π(A|st+1) draw at+1 π(A|st+1) at+1 = . . .

Figure 1: A diagram of how a GFlow Net iteratively constructs an object. We adopt notation that is common in the reinforcement learning literature: st represents the state of the partially constructed object (in this case, a graph) at time t, at represents the action taken by the GFlow Net at time t to transition to state st+1 = T(st, at). In this diagram, the GFlow Net takes a 3-node graph as input and determines an action to take. The action, combined with the environment transition function T(st, at), determines st+1: a four-node graph. This process repeats until an exit action is sampled and the sample is complete.

paper (Bengio et al., 2021), this can be achieved with a ﬂow constraint at each state: the sum of incoming ﬂows must match the sum of outgoing ﬂows.

1.2 Contributions of this paper

In this paper, an important contribution is the notion of conditional GFlow Net, which enables estimation of intractable sums corresponding to marginalization over many steps of object construction, and can thus be used to compute free energies1 over diﬀerent types of joint distributions, perhaps most interestingly over sets and graphs. This marginalization also enables estimation of entropies, conditional entropies and mutual information. GFlow Nets can thus be generalized to estimate multiple ﬂows corresponding to modeling a rich outcome (rather than a scalar reward function) . We refer the reader to Bengio et al. (2021) and Sec. 7 for a discussion of related approaches and diﬀerences with common generative models and reinforcement learning (RL) methods. In an RL context, two interesting properties of GFlow Nets already noted in that paper are that they (1) can be trained in an oﬄine manner with trajectories sampled from a distribution diﬀerent from the one represented by the GFlow Net and (2) they match the reward function in probability rather than try to ﬁnd a conﬁguration which maximizes rewards or returns. The latter property is particularly interesting in the context of exploration, to ensure the conﬁgurations sampled from the generative policy are both interesting and di-

1. In machine learning, a free energy is the logarithm of an unnormalized marginal probability, a generally intractable sum of exponentiated negative energies.

Bengio, Lahlou, Deleu, Hu, Tiwari, Bengio

Terminating state

s Sf Initial state

Terminal state

Figure 2: Illustration of the structure of a Generative Flow Network (GFlow Net), as a pointed DAG over states s, with particles ﬂowing along edges to represent the ﬂow function. Any object sampled by the GFlow Net policy can be obtained by starting from initial state s0 and then at each step choosing a child with probability proportional to the GFlow Net policy s transition probability. This process stops when a terminating action is chosen from a terminating state s (yielding a terminal state sf), at which point a reward R(s) is obtained. The ﬁgure shows a tiny GFlow Net and the possible trajectories from s0 to any of the terminal states. It illustrates that in general a state can be reached through several trajectories. GFlow Net algorithms learn a policy such that the probability of sampling terminating state s is proportional to R(s). It tries to learn a ﬂow function F(s) and F(s s ) over all states (including intermediate states) s and transitions s s with F(s) = R(s) at terminal states and F(s0) being the sum of rewards over all terminal states. A suﬃcient property to achieve this is that at each state the sum of incoming ﬂows equals the sum of outgoing ﬂows.

verse. It is also interesting to transform GFlow Nets into amortized probabilistic inference machines: if we choose the reward function to be a prior (over some random variable) times a likelihood (how well some data is ﬁt given that choice of random variable value), then the GFlow Net policy learns to sample from the corresponding Bayesian posterior (which is proportional to prior times likelihood). The ability of GFlow Nets to generate a diverse set of samples then corresponds to the ability to sample from the modes of the target distribution. An important source of inspiration for GFlow Nets is the way information propagates in temporal-diﬀerence RL methods (Sutton and Barto, 2018). Both rely on a principle of coherence for credit assignment which may only be achieved asymptotically when training converges. While exact gradient calculation may be intractable, because the number of paths in state space to consider is exponentially large, both methods rely on local coherence between diﬀerent components and a training objective that states that if all the learned components are coherent with each other locally, then we obtain a system that estimates the quantities of interest globally. Examples include estimation of expected discounted returns in temporal-diﬀerence methods and probability measures with GFlow Nets. This paper extends the theory of the original GFlow Net construction (Bengio et al., 2021) in several directions, including a new local training objective called detailed balance (for the analogy with the detailed balance condition of Monte-Carlo Markov chains) which avoids forming explicit sums required by the previously proposed ﬂow matching loss, as well as formulations enabling the calculation of marginal probabilities (or free energies)

GFlow Net Foundations

for subsets of variables, more generally for subsets of larger sets, or subgraphs, and their application to estimating entropy and mutual information. Finally, whereas the basic formulation of GFlow Nets assumes a given reward or energy function, this paper considers how the energy function could be jointly learned with the GFlow Net, opening the door to novel energy-based modeling methodologies and a modular structure for both the energy function and the GFlow Net.

1.3 GFlow Nets in other works

In addition to the theory presented in this paper, Malkin et al. (2023) and Zimmermann et al. (2022) prove some partial equivalences between GFlow Nets and hierarchical variational methods, providing yet more theoretical evidence for the eﬃcacy of GFlow Nets in learning to sample proportionally to a given reward function. These works also provide evidence for the superiority of GFlow Nets in oﬀ-policy settings.

GFlow Nets have found a wide array of applications due to the associated diversity of generated samples. In contexts where a cheap proxy for the true reward function exists, GFlow Nets have been used to surface samples under which to query the proxy before more expensive evaluation under the true reward function. In these settings, the diversity of samples generated by GFlow Nets can be used for robustness to proxy misspeciﬁcation and to incorporate epistemic uncertainty. For example, Zhang et al. (2023) use GFlow Nets to produce sample schedules for operations in a computation graph, where evaluating the runtimes of sample schedules via a proxy is fast but evaluating the same schedules on target hardware is expensive. In active learning problems, Jain et al. (2022, 2023) use GFlow Net sampling as a subroutine inside an active learning loop as a substitute for Bayesian Optimization or RL-based methods. Jain et al. (2022) apply GFlow Nets to search for novel anti-microbial peptides, discover DNA sequences that have high binding activity with human transcription factors, and to ﬁnd proteins with high ﬂuorescence. Additionally, Jain et al. (2023) develops preference-conditional GFlow Nets, where a preference weight vector is used to scalarize multiple objective functions into a single reward. The authors apply their techniques to various molecule and DNA sequence generation tasks and ﬁnd that their methods are able to ﬁnd diﬀerent Pareto-optimal samples along the Pareto frontier.

GFlow Nets have found applications in several other machine learning problems. For example, Zhang et al. (2022) simultaneously train an energy-based model and a GFlow Net; the energy function is trained with samples from a GFlow Net, which, in turn, uses the energy function to form its reward. Their method results in a generative model for binary vectors in high dimensions, e.g., binarized digits. Deleu et al. (2022) use a GFlow Net for structure learning; the GFlow Net produces samples that approximates the true posterior over causal graphs given a dataset. Their method works on both observational and interventional data, and compares favorably to MCMCand variational inference-based methods. Hu et al. (2023) ﬁnd maximum-likelihood estimates of latent variable models with discrete compositional latents by jointly training a GFlow Net to approximately sample from the generally intractable posterior in the E-step of the expectation-maximization (EM) algorithm.

Bengio, Lahlou, Deleu, Hu, Tiwari, Bengio

2. Flow Networks and Markovian Flows

2.1 Some elements of graph theory

In this section, we recall some basic deﬁnitions and properties of graphs, which are the basis of ﬂow networks and GFlow Nets.

Deﬁnition 1 A directed graph is a tuple G = (S, A), where S is a ﬁnite set of states, and A a subset of S S representing directed edges. Elements of A are denoted s s and called edges or transitions. A trajectory in such a graph is a sequence τ = (s1, . . . , sn) of elements of S such that every transition st st+1 A and n > 1. We denote s τ to mean that s is in the trajectory τ, i.e., t {1, . . . , n} st = s, and similarly s s τ to mean that t {1, . . . , n 1} st = s, st+1 = s . For convenience, we also use the notation τ = s1 sn. The length of a trajectory is the number of edges in it (the length of τ = (s1, . . . , sn) is thus n 1). A directed acyclic graph (DAG) is a directed graph in which there is no trajectory τ = (s1, . . . , sn) satisfying sn = s1.

Given a DAG G = (S, A), and two states s, s S, if there exists a trajectory in G starting in s and ending in s , then we write s < s . The binary relationship < deﬁnes a strict partial order (i.e. it is irreﬂexive, asymmetric and transitive). We write s s if s < s or s = s . The binary relation is a (non-strict) partial order (i.e. it is reﬂexive, antisymmetric and transitive). If there is no order relation between s and s , we write s s .

Deﬁnition 2 Given a DAG G = (S, A), the parent set of a state s S, which we denote Par(s), contains all of the direct parents of s in G, i.e., Par(s) = {s S : s s A}; similarly, the child set Child(s) contains all of the direct children of s in G, i.e., Child(s) = {s S : s s A}.

Deﬁnition 3 Given a DAG G = (S, A). G is called a pointed DAG if there exist two states s0, sf S that satisfy:

s S \ {s0} s0 < s and s S \ {sf} s < sf.

s0 is called the source state or initial state. sf is called the sink state or ﬁnal state. Because < is a strict partial order, these two states are unique. A complete trajectory in such a DAG is any trajectory starting in s0 and ending in sf. We denote such a trajectory as τ = (s0, s1, . . . , sn, sn+1 = sf). We denote by T the set of all complete trajectories in G, and by T partial the set of (possibly incomplete) trajectories in G. A state s S is called a terminating state if it is a parent of the sink state, i.e. s sf A. The transition s sf is called a terminating edge. We denote by:

A f = {s s A, s = sf}, the set of non-terminating edges in G,

Af = {s s A, s = sf} = A \ A f, the set of terminating edges in G,

Sf = {s S, s sf Af} = Par(sf), the set of terminating states in G.

GFlow Net Foundations

s0 Initial state

sf Sink state

Terminating state (Sf)

Terminating edge (Af)

Non-terminating edge (A f)

Figure 3: Example of a pointed DAG G illustrating the notions of initial state (s0), ﬁnal or sink state (sf), terminating states in Sf, with a transition to sf called a terminating edge, in Af. A terminating state may have other children diﬀerent from the sink state (e.g., the terminating state s7).

In Fig. 3, we visualize the concepts introduced in the previous deﬁnitions. Note that the constraint of a single source state and single sink state is only a mathematical convenience since a bijection exists between general DAGs and those with this constraint (by the addition of a unique source/sink state connected to all the other source/sink states).

Deﬁnition 4 Let G be a pointed DAG with source state s0 and sink state sf. A forward (resp. backward) probability function consistent with G is any non-negative function ˆPF (resp. ˆPB) deﬁned on A that satisﬁes s S \ {sf}, P s Child(s) ˆPF (s | s) = 1 (resp.

s S \ {s0}, P s Par(s) ˆPB(s | s) = 1).

With pointed DAGs, consistent forward and backward probability functions, that are probabilities over states, can be used to deﬁne probabilities over trajectories, i.e. probability measures on some subsets of T partial. The following lemma shows how to construct such factorized probability measures:

Lemma 5 Let G = (S, A) be a pointed DAG, and consider a forward probability function ˆ PF , and a backward probability function ˆ PB both consistent with G. For any state s S \ {sf}, we denote by Ts,f T partial the set of trajectories in G starting in s and ending in sf; and for any state s S \ {s0}, we denote by T0,s T partial the set of trajectories in G starting in s0 and ending in s. Consider the extensions of ˆPF and ˆPB on T partial deﬁned by:

τ = (s1, . . . , sn) T partial ˆPF (τ) :=

t=1 ˆPF (st+1 | st) (1)

τ = (s1, . . . , sn) T partial ˆPB(τ) :=

t=1 ˆPB(st | st+1) (2)

Bengio, Lahlou, Deleu, Hu, Tiwari, Bengio

We have the following:

s S \ {sf} X

ˆPF (τ) = 1 (3)

s S \ {s0} X

ˆPB(τ) = 1 (4)

Proof For convenience, we will use Ts s ,sf to denote the set of trajectories starting with s s and ending in sf, and T0,s s to denote the set of trajectories starting in s0 and ending with s s . This allows to write:

s = sf Ts,f = [

s Child(s) Ts s ,sf , {Ts s ,sf , s Child(s)} pairwise disjoint,

s = s0 T0,s = [

s Par(s ) T0,s s , {T0,s s , s Par(s )} pairwise disjoint.

Additionally, for any s = sf, we denote by ds,f the maximum trajectory length in Ts,f; and for any s = s0, we denote by d0,s the maximum trajectory length in T0,s. We will prove Eq. (3) by strong induction on ds,f and Eq. (4) by strong induction on d0,s . Base cases: If ds,f = 1 and d0,s = 1, then Ts,f = {(s sf)} and T0,s = {(s0 s )}. Hence, P τ Ts,f ˆPF (τ) = ˆPF (s sf) = ˆPF (sf | s) = 1 given that sf is the only child of s

(otherwise ds,f cannot be 1), and P τ T0,s ˆPB(τ) = ˆPB(s0 | s ) = 1 given that s0 is the only parent of s (otherwise d0,s cannot be 1). Induction steps: Consider s = sf such that ds,f > 1 and s = s0 such that d0,s > 1. Because of the disjoint unions written above, we have:

ˆPF (τ) = X

ˆPF (τ) = X

ˆPF ( s | s) X

ˆPF (τ) = 1,

ˆPB( s | s ) X

ˆPB(τ) = 1,

where we used the induction hypotheses in the third equality of each line.

2.2 Trajectories and Flows

We augment pointed DAGs it with a function F called a ﬂow. An analogy which helps to picture ﬂows is a stream of particles ﬂowing through a network where each particle starts at s0 and ﬂowing through some trajectory terminating in sf. The ﬂow F(τ) associated with each complete trajectory τ contains the number of particles sharing the same path τ.

Deﬁnition 6 Given a pointed DAG, a trajectory ﬂow (or ﬂow ) is any non-negative function F : T 7 R+ deﬁned on the set of complete trajectories T . F induces a measure

GFlow Net Foundations

over the σ-algebra Σ = 2T , the power set on the set of complete trajectories T . In particular, for every subset A T , we have

τ A F(τ). (5)

The pair (G, F) is called a ﬂow network.

This deﬁnition ensures that (T , 2T , F) is a measure space. We abuse the notation here, using F to denote both a function of complete trajectories, and its corresponding measure over (T , 2T ). A special case is when the event A is the singleton trajectory {τ}, where we just write its measure as F(τ). We also abuse the notation to deﬁne the ﬂow through either a particular state s, or through a particular edge s s in the following way.

Deﬁnition 7 The ﬂow through a state (or state ﬂow) F : S 7 R+ corresponds to the measure of the set of complete trajectories going through a particular state:

F(s) := F({τ T : s τ}) = X

τ T : s τ F(τ). (6)

Similarly, the ﬂow through an edge (or edge ﬂow) F : A 7 R+ corresponds to the measure of the set of complete trajectories going through a particular edge:

F(s s ) := F({τ T : s s τ}) = X

τ T : s s τ F(τ). (7)

Note that with this deﬁnition, we have F(s s ) = 0 if s s / A is not an edge in the pointed DAG (since F( ) = 0). We call the ﬂow of a terminating transition F(s sf) a terminating ﬂow. The following proposition relates the state ﬂows and the edge ﬂows:

Proposition 8 Given a ﬂow network (G, F). The state ﬂows and edge ﬂows satisfy:

s S \ {sf} F(s) = X

s Child(s) F(s s ) (8)

s S \ {s0} F(s ) = X

s Par(s ) F(s s ) (9)

Proof Given s = sf, the set of complete trajectories going through s is the (disjoint) union of the sets of trajectories going through s s , for all s Child(s):

{τ T : s τ} = [

s Child(s) {τ T : s s τ}.

Therefore, it follows that:

τ : s τ F(τ) = X

τ : s s τ F(τ) = X

s Child(s) F(s s )

Similarly, Eq. (9) follows by writing the set of complete trajectories going though s = s0 as the (disjoint) union of the sets of trajectories going through s s for all s Par(s ).

Bengio, Lahlou, Deleu, Hu, Tiwari, Bengio

2.3 Flow Induced Probability Measures

Deﬁnition 9 Given a ﬂow network (G, F), the total ﬂow Z is the measure of the whole set T , corresponding to the sum of the ﬂows of all the complete trajectories:

Z := F(T ) = X

τ T F(τ). (10)

Proposition 10 The ﬂow through the initial state equals the ﬂow through the ﬁnal state equals the total ﬂow Z.

Proof Since τ T , s0, sf τ, applying Eq. (6) to s0 and sf yields

τ T F(τ) = Z, (11)

τ T F(τ) = Z. (12)

Intuitively, Prop. 10 justiﬁes the use of the term ﬂow , introduced by Bengio et al. (2021), by analogy with a stream of particles ﬂowing from the initial state to the ﬁnal states. We use the letter Z in Def. 9, often used to denote the partition function in probabilistic models and statistical mechanics, because it is a normalizing constant which can turn the measure space (T , 2T , F) deﬁned above into the probability space (T , 2T , P):

Deﬁnition 11 Given a ﬂow network (G, F), the ﬂow probability is the probability measure P over the measurable space (T , 2T ) associated with F:

A T P(A) := F(A)

F(T ) = F(A)

For two events A, B T , the conditional probability P(A | B) thus satisﬁes:

P(A | B) := F(A B)

F(B) . (14)

Similar to the ﬂow F, we abuse the notation P to deﬁne the probability of going through a state:

s S P(s) := F(s)

and similarly for the probability of going through an edge. Note that P(s) does not correspond to a distribution over states, in the sense that P s S P(s) = 1; in particular, it is easy to see that P(s0) = 1 (in other words, the probability of a trajectory passing through the initial state s0 is 1). Additionally, for a trajectory τ T , we also use the abuse of notation P(τ) instead of P({τ}) to denote the probability of going through a speciﬁc trajectory τ.

GFlow Net Foundations

Deﬁnition 12 Given a ﬂow network (G, F), the forward transition probability operator PF is a function on S S, that is a special case of the conditional probabilities induced by F (Eq. (14)):

s s A PF (s | s) := P(s s | s) = F(s s )

F(s) . (16)

Similarly, the backwards transition probability is the operator deﬁned by:

s s A PB(s | s ) := P(s s | s ) = F(s s )

F(s ) . (17)

Note how PF and PB are consistent with G (in the sense of Def. 4), as a consequence of Prop. 8.

Because ﬂows deﬁne probabilities over states and edges, they can be used to deﬁne probability distributions over the terminating states of a graph (denoted by Sf = Par(sf)) as follows:

Deﬁnition 13 Given a ﬂow network (G, F), the terminating state probability PT is the probability over terminating states Sf under the ﬂow probability P:

s Sf PT (s) := P(s sf) = F(s sf)

Contrary to the probability P(s) of going through a state s, the terminating state probability PT is a well-deﬁned distribution over the terminating states s Sf, in the following sense:

Proposition 14 The terminating state probability PT is a well-deﬁned distribution over the terminating states s Sf, in that PT (s) 0 for all s Sf, and

s Sf PT (s) = 1.

Proof Since the ﬂow F(s sf) is non-negative, it is easy to see that PT (s) 0. Moreover, using the deﬁnition of Sf = Par(sf), Prop. 8 (relating the edge ﬂows and the state ﬂows), and Prop. 10 (F(sf) = Z), we have

s Sf PT (s) = 1

s Sf F(s sf) = 1

s Par(sf) F(s sf) = F(sf)

The terminating state probability is particularly important in the context of estimating ﬂow networks (see Sec. 3), as it shows that a ﬂow network (G, F) induces a probability distribution over terminating states which is proportional to the terminating ﬂows F(s sf), the normalization constant Z being given by initial ﬂow F(s0).

Bengio, Lahlou, Deleu, Hu, Tiwari, Bengio

2.4 Markovian Flows

Deﬁning a ﬂow requires the speciﬁcation of |T | non-negative values (one for every trajectory τ T ), which is generally exponential in the number of graph edges. Markovian ﬂows however have the remarkable property that they can be deﬁned with much fewer numbers , given that trajectory ﬂows factorize according to G.

Deﬁnition 15 Let (G, F) be a ﬂow network, with ﬂow probability measure P. F is called a Markovian ﬂow (or equivalently (G, F) a Markovian ﬂow network) if, for any state s = s0, outgoing edge s s , and for any trajectory τ = (s0, s1, . . . , sn = s) T partial

starting in s0 and ending in s:

P(s s | τ) = P(s s | s) = PF (s | s). (19)

Note that the Markovian property does not hold for all of the ﬂows as deﬁned in the previous sections (e.g. Fig. 4). Intuitively, a ﬂow can be considered non-Markovian if a particle in the ﬂow stream can remember its past history; if not, its future behavior can only depend on its current state and the ﬂow must be Markovian. In this work, we will primarily be concerned with Markovian ﬂows, though later we will re-introduce a form of memory via state-conditional ﬂows that allow each ﬂow particle to remember parts of its history. The following proposition shows that Markovian ﬂows have the property that the ﬂows at (or the probabilities of) complete trajectories factorize according the the graph, and that it is a suﬃcient condition for deﬁning Markovian ﬂows.

Proposition 16 Let (G, F) be a ﬂow network, and P the corresponding ﬂow probability. The following three statements are equivalent:

1. F is a Markovian ﬂow

2. There exists a unique probability function ˆPF consistent with G such that for all complete trajectories τ = (s0, . . . , sn+1 = sf):

t=1 ˆPF (st | st 1). (20)

Moreover, the probability function ˆPF is exactly the forward transition probability associated with the ﬂow probability P: ˆPF = PF .

3. There exists a unique probability function ˆPB consistent with G such that for all complete trajectories τ = (s0, . . . , sn+1 = sf):

t=1 ˆPB(st 1 | st). (21)

Moreover, the probability function ˆPB is exactly the backwards transition probability associated with the ﬂow probability P: ˆPB = PB.

GFlow Net Foundations

Proof Recall from Lemma 5 the notations T0,s to denote the set of partial trajectories from s0 to s, and Ts ,f to denote the set of partial trajectories from s to sf. We will prove the equivalences 1 2 and 1 3.

1 2: Suppose that F is a Markovian ﬂow. Then using the laws of probability, the Markov property in Eq. (19), and P(s0) = 1, for some complete trajectory τ = (s0, . . . , sn+1 = sf):

P(τ) = P(s0 s1 . . . sn+1) = P(s0 s1)

t=1 P(st st+1 | s0 . . . st)

t=1 P(st st+1 | st)

= P(s0)PF (s1 | s0)

t=1 PF (st+1 | st)

t=1 PF (st | st 1),

where the second line uses to Markov property, and the third line uses the deﬁnition of the forward transition probability PF . PF thus satisﬁes Eq. (20) for all complete trajectories.

To show uniqueness of PF , assume Eq. (20) is satisﬁed by some ˆPF for all complete trajectories. By deﬁnition of the forward transition probability:

PF (s | s) := P(s s | s) = P(s s )

Any complete trajectory τ going through a state s can be (uniquely) decomposed into a partial trajectory τ T0,s from s0 to s, and a partial trajectory τ Ts,f from s to sf. Using the deﬁnition of P(s), we have:

τ : s τ P(τ) = X

(st st+1) τ

ˆPF (st+1 | st)

(st st+1) τ ˆPF (st+1 | st)

(st st+1) τ ˆPF (st+1 | st)

| {z } = 1 (Lemma 5)

(st st+1) τ ˆPF (st+1 | st).

Similarly, any complete trajectory going through s s can be (uniquely) decomposed into a partial trajectory τ T0,s from s0 to s, and a partial trajectory τ Ts ,f from

Bengio, Lahlou, Deleu, Hu, Tiwari, Bengio

s to sf. Again, using the deﬁnition of P(s s ):

P(s s ) = X

τ : (s s ) τ P(τ) = X

τ : (s s ) τ

(st st+1) τ

ˆPF (st+1 | st)

(st st+1) τ ˆPF (st+1 | st)

| {z } = P(s)

ˆPF (s | s)

(st st+1) τ ˆPF (st+1 | st)

| {z } = 1 (Lemma 5)

= P(s) ˆPF (s | s).

Combining the two results above, we get:

PF (s | s) = P(s s )

P(s) = ˆPF (s | s).

2 1: Suppose that there exists a probability function ˆPF consistent with G such that for some complete trajectory τ = (s0, . . . , sn+1 = sf)

t=1 ˆPF (st | st 1).

For the same reasons as those used to justify the uniqueness in the 1 2 proof, ˆPF is necessarily equal to the forward transition probability PF , associated with P.

We now want to show that the ﬂow F associated with P is Markovian, by showing the Markov property from Eq. (19). Let τ T0,s be any partial trajectory from s0 to s; using the deﬁnition of conditional probability:

P(s s | τ ) = P(s0 . . . s s )

P(s0 . . . s) .

Following the same idea as above, we will now rewrite P(s0 . . . s), as a sum over complete trajectories that share the same preﬁx trajectory τ . Any such complete trajectory τ can be (uniquely) decomposed into this common preﬁx τ , and a partial trajectory τ Ts,f from s to sf.

P(s0 . . . s) = X

τ : τ τ P(τ) = X

(st st+1) τ PF (st+1 | st)

st 1 st τ PF (st | st 1)

(st st+1) τ PF (st+1 | st)

| {z } = 1 (Lemma 5) = Y

st 1 st τ PF (st | st 1).

GFlow Net Foundations

Similarly, any complete trajectory τ that share the same preﬁx trajectory (s0, . . . , s, s ) can be (uniquely) decomposed into this common preﬁx, and a partial trajectory τ Ts ,f from s to sf, leading to:

P(s0 . . . s s ) = P(s0 . . . s)PF (s | s)

Combining the two results above, we can conclude that P satisﬁes the Markov property, and therefore that the ﬂow F is Markovian:

P(s s | τ ) = P(s0 . . . s s )

P(s0 . . . s) = PF (s | s) = P(s s | s)

{1, 2} 3: Suppose that F is a Markovian ﬂow. We have shown above that this is equivalent to P being decomposed into a product of forward transition probabilities PF . For some complete trajectory τ = (s0, . . . , sn+1 = sf):

t=1 PF (st | st 1) =

t=1 PB(st 1 | st),

where the third equality uses the fact that P(s0) = P(sf) = 1, and using the deﬁnition of the backwards transition probability PB. The proof of uniqueness of PB is similar to that of PF in 1 2, and uses:

P(s s ) = X

τ : (s s ) τ P(τ) = X

τ : (s s ) τ

(st st+1) τ

ˆPB(st | st+1)

(st st+1) τ ˆPB(st | st+1)

| {z } = 1 (Lemma 5)

ˆPB(s | s )

(st st+1) τ ˆPB(st | st+1)

| {z } = P(s )

= P(s ) ˆPB(s | s ),

3 1: Similar to the proof of 2 1, ˆPB is necessarily equal to the backwards transition probability PB associated with P. Additionally, PB is related to the forward transition probability PF :

P(s s ) = PB(s | s )P(s ) = PF (s | s)P(s).

We can therefore write the decomposition of P in terms of PF , instead of PB. For some complete trajectory τ = (s0, . . . , sn+1 = sf):

t=1 PB(st 1 | st) =

P(st) PF (st+1 | st) = P(s0)

t=1 PF (st+1 | st)

t=1 PF (st+1 | st),

where we used the fact that P(s0) = P(sf) = 1. Using 2 1 , we can conclude that F is a Markovian ﬂow.

Bengio, Lahlou, Deleu, Hu, Tiwari, Bengio

The decomposition of Eq. (20) shows how Markovian ﬂows can be used to draw terminating states from the terminating state probability PT (Eq. (18)). Namely, we have the following result:

Corollary 17 Let (G, F) be a Markovian ﬂow network, and PF the corresponding forward transition probability. Consider the procedure starting from s = s0, and iteratively drawing one sample from PF (. | s) until reaching sf. Then the probability of the procedure terminating in a state s is PT (s).

Proof First, note that the procedure terminates with probability 1, given that G is acyclic. For the procedure to terminate in a state s, it means that the trajectory τ T implicitly constructed during the procedure contains the edge s sf. The probability of the procedure terminating in s is thus: X

τ T :s sf τ

s s τ PF (s | s )

| {z } P(τ), according to Eq. (20)

= P(s sf) = PT (s)

The following proposition shows that, as a consequence of the Prop. 16, we obtain three diﬀerent parametrizations of Markovian ﬂows.

Proposition 18 Given a pointed DAG G = (S, A), a Markovian ﬂow on G is completely and uniquely speciﬁed by one of the following:

1. the combination of the total ﬂow ˆZ and the forward transition probabilities ˆPF (s | s) for all edges s s A,

2. the combination of the total ﬂow ˆZ and the backward transition probabilities ˆPB(s | s ) for all edges s s A.

3. the combination of the terminating ﬂows ˆF(s sf) for all terminating edges s sf Af and the backwards transition probabilities ˆPB(s | s ) for all non-terminating edges s s A f,

Proof In the ﬁrst two settings, we deﬁne a ﬂow function F : T R+, at a trajectory τ = (s0, s1, . . . , sn, sn+1 = sf) as:

1. F(τ) := ˆZ Qn+1 t=1 ˆPF (st | st 1),

2. F(τ) := ˆZ Qn+1 t=1 ˆPB(st 1 | st)

We need to prove that it is the only Markovian ﬂow that can be deﬁned for both settings. The proof for the third setting will follow from that of the second setting. First setting:

GFlow Net Foundations

First, we need to show that the total ﬂow Z associated with the ﬂow function F (Eq. (10)) matches ˆZ. This is a consequence of Lemma 5:

τ T F(τ) = ˆZ X

τ=(s0,s1,...,sn+1=sf) T

t=1 ˆPF (st | st 1)

| {z } =1 , according to Lemma 5

Then, we need to show that the forward transition probability function PF associated with F (Eq. (16)) matches ˆPF , and that the ﬂow F is Markovian. To this end, note that the corresponding ﬂow probability P satisﬁes Eq. (20). Thus, as a consequence of Prop. 16, F is a Markovian ﬂow, and its forward transition probability function is ˆPF . As a last requirement, we need to show that if a Markovian ﬂow F has a partition function Z = ˆZ and a forward transition probability function P F = ˆPF , then it is necessarily equal to F. This is a direct consequence of Prop. 16, given that for any τ = (s0, . . . , sn+1 = sf) T :

F (τ) = Z n+1 Y

t=1 P F (st | st 1) = ˆZ

t=1 ˆPF (st | st 1) = F(τ)

Second setting: First, we show that as a consequence of Lemma 5, the total ﬂow Z associated with F matches ˆZ:

τ T F(τ) = ˆZ X

τ=(s0,s1,...,sn+1=sf) T

t=1 ˆPB(s 1 | st)

| {z } =1 , according to Lemma 5

Second, we note that the ﬂow probability P associated with F satisﬁes Eq. (21). Thus, as a consequence of Prop. 16, F is a Markovian ﬂow, and its backward transition probability function is ˆPB. Finally, if a Markovian ﬂow F has a partition function Z = ˆZ and a backward transition probability function P B = ˆPB, then following Prop. 16, τ T , F (τ) = F(τ). Third setting: From the terminating ﬂows ˆF(s sf) and the backwards transition probabilities ˆPB(s | s ) for non-terminating edges, we can uniquely deﬁne a total ﬂow ˆZ, and extend ˆPB to all edges as follows:

ˆPB(s | s ) :=

( ˆPB(s | s ) if s = sf ˆF(s sf)

ˆZ otherwise.

This takes us back to the second setting, for which we have already proven that with ˆZ and ˆPB deﬁned for all edges, a Markovian ﬂow is uniquely deﬁned.

Bengio, Lahlou, Deleu, Hu, Tiwari, Bengio

2.5 Flow Matching Conditions

In Prop. 18, we saw how forward and backward probability functions can be used to uniquely deﬁne a Markovian ﬂow. We will show in the next proposition how non-negative functions of states and edges can be used to deﬁne a Markovian ﬂow. Such functions cannot be unconstrained (as ˆPF and ˆZ in Prop. 18 e.g.), as we have seen in Prop. 8.

Proposition 19 Let G = (S, A) be a pointed DAG. Consider a non-negative function ˆF taking as input either a state s S or a transition s s A. Then ˆF corresponds to a ﬂow if and only if the ﬂow matching conditions:

s > s0, ˆF(s ) = X

s < sf, ˆF(s ) = X

s Child(s )

ˆF(s s ) (22)

are satisﬁed. More speciﬁcally, ˆF uniquely deﬁnes a Markovian ﬂow F matching ˆF on states and transitions:

τ = (s0, . . . , sn+1 = sf) T F(τ) = Qn+1 t=1 ˆF(st 1 st) Qn t=1 ˆF(st) . (23)

Proof Necessity is a direct consequence of Prop. 8. Let s show suﬃciency. Let ˆPF be the forward probability function deﬁned by:

s s A ˆPF (s | s) := ˆF(s s )

ˆPF is consistent with G given that ˆF satisﬁes the ﬂow matching conditions (Eq. (22)). Let ˆZ = ˆF(s0). According to Prop. 18, there exists a unique Markovian ﬂow F with forward transition probability function PF = ˆPF and partition function Z = ˆZ, and such that for a trajectory τ = (s0, . . . , sn+1 = sf) T :

τ = (s0, . . . , sn+1 = sf) T F(τ) = ˆZ

t=1 ˆPF (st | st 1) = Qn+1 t=1 ˆF(st 1 st) Qn t=1 ˆF(st) . (24)

Additionally, similar to the proof of Prop. 16, we can write for any state s = s0:

F(s ) = ˆZ X

(st st+1) τ

ˆPF (st+1 | st)

= ˆZ ˆF(s ) ˆF(s0)

(st st+1) τ

ˆPB(st | st+1)

| {z } =1 , according to Lemma 5

GFlow Net Foundations

where ˆPB(s | s) := ˆF(s s )

ˆF(s ) deﬁnes a backward probability function consistent with G. And

because s s A PF (s | s) = ˆPF (s | s), it follows that s s A F(s s ) = ˆF(s s ). To show uniqueness, let s consider a Markovian ﬂow F that matches ˆF on states and edges. Following Prop. 16, for any trajectory τ = (s0, . . . , sn+1 = sf) T

t=1 ˆPF (st | st 1) = Qn+1 t=1 ˆF(st 1 st) Qn t=1 ˆF(st) = F(τ).

Note how Eq. (22) can be used to recursively deﬁne the ﬂow in all the states if Z is given and either the forward or the backwards transition probabilities are given. Either way, we would start from the ﬂow at one of the extreme states s0 or sf and then distribute it recursively through the directed acyclic graph of the ﬂow network, either going forward or going backward. A setting of particular interest, that will be central in Sec. 3, is when we are given all the terminal ﬂows F(s sf), and we would like to deduce a state ﬂow function F(s) and a forward transition probability function PF (s | s) for the rest of the ﬂow network. Next, we will see how to parametrize Markovian ﬂows using forward and backward probability functions consistent with the DAG. Unlike the condition in Prop. 19, the new condition does not involve a sum over transitions, which could be problematic if each state can have a large number of successors or if the state-space is continuous. Interestingly, the resulting condition is analogous to the detailed balance condition of Monte-Carlo Markov chains.

Deﬁnition 20 Given a pointed DAG G = (S, A), a forward transition probability function ˆPF and a backward transition probability function ˆPB consistent with G, ˆPF and ˆPB are compatible if there exists an edge ﬂow function ˆF : A R+ such that

s s A ˆPF (s | s) = ˆF(s s ) P s Child(s) ˆF(s s ) , ˆPB(s | s ) = ˆF(s s ) P s Par(s ) ˆF(s s ) (25)

Proposition 21 Let G = (S, A) be a pointed DAG. Consider a non-negative function ˆF over states, a forward transition probability function ˆPF and a backwards transition probability function ˆPB consistent with G. Then, ˆF, ˆPB, and ˆ PF jointly correspond to a ﬂow if and only if the detailed balance conditions holds:

s s A ˆF(s) ˆPF (s | s) = ˆF(s ) ˆPB(s | s ). (26)

More speciﬁcally, ˆF, ˆPF , and ˆPB uniquely deﬁne a Markovian ﬂow F matching ˆF on states, and with transition probabilities matching ˆPF and ˆPB. Furthermore, when this condition is satisﬁed, the forward and backward transition probability functions ˆPF and ˆPB are compatible.

Bengio, Lahlou, Deleu, Hu, Tiwari, Bengio

Proof For necessity, consider a ﬂow F, with state ﬂow function denoted F, and forward and backward transitions PF and PB. It is clear from the deﬁnition of PF and PB (Def. 12) that Eq. (26) holds. We prove the suﬃciency of the condition by ﬁrst deﬁning the edge ﬂow

s s A ˆF(s s ) := ˆF(s) ˆPF (s | s). (27)

We then sum both sides of Eq. (26) over s, yielding

ˆF(s) ˆPF (s | s) = ˆF(s ) X

ˆPB(s | s ) = ˆF(s ) (28)

where we used the fact that ˆPB is a normalized probability distribution. Combining this with Eq. (27), we get s > s0 ˆF(s ) = X

ˆF(s s ) (29)

which is the ﬁrst equality of the ﬂow-matching condition (Eq. (22)) of Prop. 19. We can obtain the second equality by ﬁrst using the normalization of ˆP, and then using our deﬁnition of the edge ﬂow (Eq. (27)):

s > s0 ˆF(s ) = ˆF(s ) X

s Child(s )

ˆPF (s | s )

s Child(s )

ˆF(s ) ˆPF (s | s )

s Child(s )

ˆF(s s ). (30)

Following Prop. 19, there exists a unique Markovian ﬂow F with state and edge ﬂows given by ˆF. Using Eq. (27) and Eq. (26), it follows that F has transition probabilities ˆPF and ˆPB as required. The uniqueness is also a consequence of Eq. (27). This proves suﬃciency. To show that ˆPF and ˆPB are compatible (Def. 20), we ﬁrst combine Eq. (27) and Eq. (30) (with relabeling of variables) to obtain

s s A ˆPF (s | s) = ˆF(s s ) P s Child(s) ˆF(s s ) ,

we then isolate ˆPB in Eq. (26), yielding

s s A ˆPB(s | s ) = ˆF(s) ˆF(s ) ˆPF (s | s) = ˆF(s s )

ˆF(s ) = ˆF(s s ) P s Par(s ) ˆF(s s ) ,

We thus get Eq. (25) of Def. 20, as desired.

At ﬁrst glance, it may seem that when ˆPB is unconstrained, the detailed balance condition can trivially be achieved by setting

s s A ˆPB(s | s ) = ˆPF (s | s) ˆF(s)

ˆF(s ) (31)

GFlow Net Foundations

However, because we also have the constraint P s Par(s ) ˆPB(s | s ) = 1, then Eq. (31) can only be satisﬁed if the ﬂows are consistent with the forward transition: X

ˆPF (s | s) ˆF(s) = ˆF(s ).

2.6 Backwards Transitions can be Chosen Freely

Consider the setting in which we are given terminating ﬂows to be matched, i.e. where the goal is to ﬁnd a ﬂow function with the right terminating ﬂows. This is the setting introduced in Bengio et al. (2021), and that will be studied in Sec. 3. In this case, Prop. 18 tells us that in order to fully determine the forward transition probabilities and the state or state-action ﬂows, it is not suﬃcient in general to specify only the terminating ﬂows; it is also necessary to specify the backwards transition probabilities on the edges other than the terminal ones (the latter being given by the terminating ﬂows). What this means is that the terminating ﬂows do not specify the ﬂow completely, e.g., because many diﬀerent paths can land in the same terminating state. The preference over such diﬀerent ways to achieve the same ﬁnal outcome is speciﬁed by the backwards transition probability PB (except for PB(s | sf) which is a function of the terminating ﬂows and Z). For example, we may want to give equal weight to all parents of a node s, or we may prefer shorter paths, which can be achieved if we keep track in the state s of the length of the shortest path to the node s, or we may let a learner discover a PB that makes learning PF or F easier.

2.7 Equivalence Between Flows

In the previous sections, we have seen that Markovian ﬂows have the property that trajectory ﬂows or probabilities factorize according to the DAG, and we have seen diﬀerent ways of characterizing Markovian ﬂows. In Sec. 3, we show how to approximate Markovian ﬂows in order to deﬁne probability measures over terminating states. In this section, through an equivalence relation between trajectory ﬂows, we justify the focus on Markovian ﬂows. Given a pointed DAG G = (S, A), we denote by:

F(G): the set of ﬂows on G, i.e. the set of functions from T , the set of complete trajectories in G, to R+,

FMarkov(G): the set of ﬂows in F(G) that are Markovian.

Deﬁnition 22 Let G = (S, A) be a pointed DAG, and F1, F2 F(G) two trajectory ﬂow functions. We say that F1 and F2 are equivalent if they coincide on edge-ﬂows, i.e.:

s s A F1(s s ) = F2(s s )

Fig. 4 shows four ﬂow functions in a simple pointed DAG that are pairwise equivalent. This deﬁnes an equivalence relation (i.e., a relation that is reﬂexive, symmetric, and transitive). Hence, each ﬂow F belongs to an equivalence class, and the set of ﬂows F(G) can be partitioned into equivalence classes. Note that if two ﬂows are equivalent, then the corresponding state ﬂow functions also coincide (as a direct consequence of Prop. 8).

Bengio, Lahlou, Deleu, Hu, Tiwari, Bengio

τ F1(τ) F2(τ) F3(τ) F4(τ)

s0, s2, sf 1 4/5 1 6/5 s0, s1, s2, sf 1 6/5 1 4/5 s0, s2, s3, sf 1 6/5 2 9/5 s0, s1, s2, s3, sf 2 9/5 1 6/5

Figure 4: Equivalent ﬂows and Markovian ﬂows. Flows F1 and F2 are equivalent. F3 and F4 are equivalent, but not equivalent to F1 and F2. F2 and F4 are Markovian. F1 and F3 are not Markovian. F1, F2, F3 and F4 coincide on the terminating ﬂows i.e. at s2 sf and s3 sf.

Proposition 23 Given a pointed DAG G. If two ﬂow function F1, F2 FMarkov(G) are equivalent, then they are equal. Additionally, for any ﬂow function F F(G), there exists a unique Markovian ﬂow function F FMarkov(G) such that F and F are equivalent.

Proof Because F1 and F2 are Markovian, then for any trajectory τ = (s0, . . . , sn+1 = sf):

F1(τ) = Qn+1 t=1 F1(st 1 st) Qn t=1 F1(st)

= Qn+1 t=1 F2(st 1 st) Qn t=1 F2(st)

where we combined the deﬁnition of equivalent ﬂows and Prop. 16. Given a ﬂow function F , because its state and edge ﬂow functions satisfy the ﬂow matching conditions (as a consequence of Prop. 8), then according to Prop. 19, the ﬂow F deﬁned by:

τ := (s0, . . . , sn+1 = sf) T F(τ) = Qn+1 t=1 F (st 1 st) Qn t=1 F (st)

is Markovian, and coincides with F on state and edge ﬂows. Combining this with the statement above, we conclude that F is the unique Markovian ﬂow that is equivalent to F .

The previous proposition shows that in each equivalence class stands out a particular ﬂow function, that has a property the other ﬂows in the same equivalence class don t have: it is Markovian. A consequence of this is that, if we care essentially about state and edge ﬂows, instead of dealing with the full set of ﬂows F(G), it suﬃces to restrict any ﬂow learning problem to the set of Markovian ﬂows FMarkov(G). The advantage of this restriction is that deﬁning a ﬂow requires the speciﬁcation of F(τ) for all trajectories τ T , whereas deﬁning a Markovian ﬂow requires the speciﬁcation of F(s s ) for all edges s s A, which is generally exponentially smaller than T (note that the edge ﬂows still need to satisfy the ﬂow-matching conditions in Prop. 19). Thus, in order to approximate or learn a ﬂow

GFlow Net Foundations

function that satisﬁes some conditions on its edge or state values, it suﬃces to approximate or learn a Markovian ﬂow, by learning the edge ﬂow function, which is a much smaller object than the actual ﬂow function.

3. GFlow Nets: Learning a Flow

With the theoretical preliminaries established in Sec. 1 and Sec. 2, we now consider the general class of problems introduced by Bengio et al. (2021) where some constraints or preferences over ﬂows are given. Our goal is to ﬁnd functions such as the state ﬂow function F(s) or the transition probability function P(s s | s) that best match these desiderata using corresponding estimators ˆF(s) and ˆP(s s | s) which may not correspond to a proper ﬂow. Such learning machines are called Generative Flow Networks (or GFlow Nets for short). We focus on scenarios where we are given a target reward function R : Sf R+, and aim at estimating ﬂows F that satisfy:

s Sf F(s sf) = R(s) (32)

Because of the equivalences that exist in the set of ﬂows, then without loss of generality, we choose GFlow Nets to approximate Markovian ﬂows only. We are thus interested in the following set of ﬂows:

FMarkov(G, R) = {F FMarkov(G), s Sf F(s sf) = R(s)} (33)

For now, we informally deﬁne a GFlow Net as an estimator of a Markovian ﬂow function F FMarkov(G, R). We provide a more formal deﬁnition later-on. With an estimator ˆF of such a Markovian ﬂow F, we can deﬁne an approximate forward transition probability function ˆPF , as in Prop. 16, in order to draw trajectories τ T (the set of complete trajectories in G) by iteratively sampling each state given the previous one, starting at s0 and then with st+1 ˆPF (. | st) until we reach the sink state sn+1 = sf for some n. Next, we will clarify how such an estimator can be obtained.

3.1 GFlow Nets as an Alternative to MCMC Sampling

The main established methods to approximately sample from the distribution associated with an energy function E are Monte-Carlo Markov chain (MCMC) methods, which require signiﬁcant computation (running a potentially very long Markov chain) to obtain samples. Instead, the GFlow Net approach amortizes upfront computation to train a generator that yields very eﬃcient computation (a single conﬁguration is constructed, no chain needed) for each new sample. For example, Bengio et al. (2021) build a GFlow Net that constructs a molecule via a small sequence of actions, each of which adds an atom or a molecular substructure to an existing molecule represented by a graph, starting from an empty graph. Only one such conﬁguration needs to be considered, in contrast with MCMC methods, which require potentially very long chains of such conﬁgurations, and suﬀer from the challenge of mode-mixing (Jasra et al., 2005; Bengio et al., 2013; Pompe et al., 2020), which can take time exponentially long in the distance between modes. In GFlow Nets, this computational challenge is avoided but the computational demand is converted to that of training the

Bengio, Lahlou, Deleu, Hu, Tiwari, Bengio

GFlow Net. To see how this can be extremely beneﬁcial, consider having already constructed some conﬁgurations x and obtained their unnormalized probability or reward R(x). With these pairs (x, R(x)), a machine learning system could potentially generalize about the value of R elsewhere, and if it is a generative model, sample new x s in places of large R(x). Hence, if there is an underlying statistical structure in how the modes of R are related to each other, a generative learner that generalizes could guess the presence of modes it has not visited yet, taking advantage of the patterns it has already uncovered from the (x, R(x)) pairs it has seen. On the other hand, if there is no structure (the modes are randomly placed), then we should not expect GFlow Nets to do signiﬁcantly better than MCMC because training becomes intractable in high-dimensional spaces (since it requires visiting every area of the conﬁguration space to ascertain its reward).

3.2 GFlow Nets and ﬂow-matching losses

We have seen in Sec. 2.4 and Sec. 2.5 diﬀerent ways of parametrizing a ﬂow. For example, with a partition function and forward transition probabilities, or with edge ﬂows that satisfy the ﬂow matching conditions. Because there are many ways to parametrize GFlow Nets, we start with an abstract formulation for them, where o O represents a parameter conﬁguration (e.g., resulting from or while training of a GFlow Net), Π(o) gives the corresponding probability measure over trajectories τ T , and H maps a Markovian ﬂow F to its parametrization o. In the following deﬁnition, we show what conditions should be satisﬁed in order for such a parametrization to be valid.

Deﬁnition 24 Given a pointed DAG G = (S, A), with an initial and sink states s0 and sf respectively, and a target reward function R : Sf R+, we say that the triplet (O, Π, H) is a ﬂow parametrization of (G, R) if:

1. O is a non-empty set,

2. Π is a function mapping each object o O to an element Π(o) (T ), the set of probability distributions on T ,

3. H is an injective functional from FMarkov(G, R) to O,

4. For any F FMarkov(G, R), Π(H(F)) is the probability measure associated with the ﬂow F (Def. 11).

To each object o O, the distribution Π(o) implicitly deﬁnes a terminating state probability measure:

s Sf PT (s) := X

τ T :s sf τ Π(o)(τ), (34)

where the dependence on o in PT is omitted for clarity.

The intuition behind the introduction of (O, Π, H) is that we can deﬁne a probability measure over T for each object o O, but only some of these objects correspond to a Markovian ﬂow with the right terminating ﬂows. For such objects o (i.e. those that can

GFlow Net Foundations

be written as o = H(F) for some ﬂow F FMarkov(G, R)), the probability measure PT corresponds to the distribution of interest, according to Def. 13, i.e.:

s Sf PT (s) R(s)

GFlow Nets thus provide a solution to the generally intractable problem of sampling from a target reward function R, or its associated energy function:

s Sf E(s) := log R(s) (35)

Directly approximating ﬂows F FMarkov(G, R) is a hard problem, whereas with some sets O, searching for an object o H(FMarkov(G, R)) O is a simpler problem that can be tackled with function approximation techniques. Note that not the set O cannot be arbitrary, as there needs to be a way to deﬁne an injective function from FMarkov(G, R) to O. Below, for a given DAG G, we show three examples clarifying the abstract concept of parametrization:

Example 1 Edge-ﬂow parametrization: Consider Oedge = F(A f, R+), the set of functions from A f to R+, and the functionals Hedge : FMarkov(G, R) Oedge and Πedge : Oedge (T ) deﬁned by:

Hedge(F) : (s s ) A f 7 F(s s ),

τ = (s0, . . . , sn = sf) T Πedge( ˆF)(τ)

t=1 P ˆF (st | st 1),

P ˆF (s | s) =

ˆF(s s ) P s =sf ˆF(s s )+R(s) if s = sf

R(s) P s =sf ˆF(s s )+R(s) if s = sf (36)

The injectivity of Hedge follows directly from Prop. 23 (two Markovian ﬂows that coincide on both their terminating and non-terminating edge ﬂow values are equal). And for any Markovian ﬂow F FMarkov(G, R), Πedge(Hedge(F)) equals the probability measure associated with F, as is shown in Prop. 16. (Oedge, Πedge, Hedge) is thus a valid ﬂow parametrization of (G, R).

Example 2 Forward transition probability parametrization: Consider the set OPF = O1 O2, where O1 = F(S \ {sf}, R+) is the set of function from S \ {sf} to R+ and O2 is the set of forward probability functions ˆPF consistent with G , and the functionals HPF : FMarkov(G, R) OPF and ΠPF : OPF (T ) deﬁned by:

HPF (F) = s S \ {sf} 7 F(s), (s s ) A 7 PF (s | s) ,

τ = (s0, . . . , sn = sf) T ΠPF ( ˆF, ˆPF )(τ)

t=1 ˆPF (st | st 1),

Bengio, Lahlou, Deleu, Hu, Tiwari, Bengio

where PF is the forward transition probability function associated with F (Eq. (16)). To verify that HPF is injective, consider F1, F2 FMarkov(G, R) such that HPF (F1) = HPF (F2). It means that s Sf, F1(s) = F2(s), and s s A, F1(s s )

F1(s) = F2(s s )

F2(s) . It follows that s s A, F1(s s ) = F2(s s ). Which, according to Prop. 23, means that F1 = F2. And for any Markovian ﬂow F FMarkov(G, R), ΠPF (HPF (F)) equals the probability measure associated with F, as is shown in Prop. 16. (OPF , ΠPF , HPF ) is thus a valid ﬂow parametrization of (G, R).

Example 3 Transition probabilities parametrization: Similar to Ex. 2, we can parametrize a Markovian ﬂow using the state-ﬂow function and both its forward and backward transition probabilities, i.e. with OPFB = OPF O3, HPFB, and ΠPFB deﬁned as:

HPFB(F) = HPF (F), (s s ) A f 7 PB(s | s ), ,

τ = (s0, . . . , sn = sf) T ΠPFB( ˆF, ˆPF , ˆPB)(τ)

t=1 ˆPF (st | st 1),

where PB is the function deﬁned by Eq. (17). and O3 is the set of backward probability functions ˆPB consistent with G. The injectivity of HPFB is a direct consequence of that of HPF . And for any Markovian ﬂow F, ΠPFB(HPFB(F)) equals the probability measure associated with F, as is shown in Prop.3. (OPFB, ΠPFB, HPFB) is thus a valid ﬂow parametrization of (G, R).

We now have all the ingredients to formally deﬁne a GFlow Net:

Deﬁnition 25 A GFlow Net is a tuple (G, R, O, Π, H), where:

G = (S, A) is a pointed DAG with initial state s0 and sink state sf,

R : Sf R+ a target reward function,

(O, Π, H) a ﬂow parametrization of (G, R).

Each object o O is called a GFlow Net conﬁguration. When it is clear from context, we will use the term GFlow Net to refer to both (G, R, O, Π, H) and a particular conﬁguration o; similar to how the term Neural Network refers to both the class of functions that can be represented with a particular architecture, and to a particular element of that class / weight conﬁguration. If o H(FMarkov(G, R)), then the corresponding terminating state probability measure (Eq. (34)) is proportional to the target reward R.

Once we have a GFlow Net (G, R, O, Π, H), we still need a way to ﬁnd objects o H(FMarkov(G, R)) O. To this end, it suﬃces to design a loss function L on O that equals zero on objects o H(FMarkov(G, R)) and only on those objects. If our loss function L is chosen to be non-negative, then an approximation of the target distribution (on Sf) is obtained by approximating the minimum of the function L. This provides a recipe for casting the search problem of interest to a minimization problem, as we typically do in machine learning. Such loss functions can be easily designed for the natural parametrizations we considered in Ex. 1, Ex. 2, and Ex. 3, as we will illustrate below.

GFlow Net Foundations

Deﬁnition 26 Let (G, R, O, Π, H) be a GFlow Net. A ﬂow-matching loss is any function L : O R+ such that:

o O L(o) = 0 F FMarkov(G, R) o = H(F) (37)

We say that L is edge-decomposable, if there exists a function L : O A R+ such that:

o O L(o) = X

s s A L(o, s s ),

We say that L is state-decomposable, f there exists a function L : O S R+ such that:

o O L(o) = X

s S L(o, s),

We say that L is trajectory-decomposable if there exists a function L : O T R+

such that: o O L(o) = X

τ T L(o, τ)

As mentioned above, with a such a loss function, our search problems can be written as minimization problems of the form min o O L(o), (38)

which can be tackled with gradient-based learning if the function L is diﬀerentiable. Note that with an edge-decomposable ﬂow-matching loss, the minimization problem in Eq. (38) is equivalent to: min o O E(s s ) πT [L(o, s s )], (39)

where πT is any full support probability distribution on A, i.e. a probability distribution such that s s A πT (s s ) > 0. A similar statement can be made for statedecomposable or trajectory-decomposable ﬂow-matching losses.

Example 4 Consider the edge-ﬂow parametrization (Oedge, Πedge, Hedge), and the function LFM : Oedge S R+ deﬁned for each ˆF Oedge and s S as

LFM( ˆF, s ) =

log δ+P s P ar(s ) ˆF(s s )

δ+R(s )+P s Child(s )\{sf } ˆF(s s )

2 if s = sf,

0 otherwise

where δ 0 is a hyper-parameter. The function LFM mapping each ˆF Oedge to

LFM( ˆF) = X

s S LFM( ˆF, s) (40)

is a ﬂow-matching loss, that is (by deﬁnition) state-decomposable. To see this, let ˆF Oedge such that LFM( ˆF) = 0, and extend it to terminating edge:

s Sf ˆF(s sf) := R(s)

Bengio, Lahlou, Deleu, Hu, Tiwari, Bengio

Now that ˆF is deﬁned for all edges in G, we can write that

ˆF(s s ) = X

Which, according to Prop. 19, means that there exists a Markovian ﬂow F FMarkov(G, R) such that Hedge(F) = ˆF. The converse

F FMarkov(G, R) LFM(Hedge(F)) = 0

is a trivial consequence of Prop. 19. This is the loss function proposed in Bengio et al. (2021). δ allows to reduce the importance given to small ﬂows (those smaller than δ), and the usage of the square of the log-ratio is justiﬁed as a way to ensure that states with large ﬂows do not contribute to the gradients of LFM much more than states with small ﬂows.

Example 5 Detailed-balance loss: Consider the transition probabilities parametrization (OPFB, ΠPFB, HPFB), and the function LDB : OPFB A R+ deﬁned for each ( ˆF, ˆPF , ˆPB) OPFB and s s A as

LDB( ˆF, ˆPF , ˆPB, s s ) =

log δ+ ˆF(s) ˆPF (s |s) δ+ ˆF(s ) ˆPB(s|s )

2 if s = sf, log δ+ ˆF(s) ˆPF (s |s)

δ+R(s) 2 otherwise,

where δ 0 is a hyper-parameter. The function LDB mapping each ( ˆF, ˆPF , ˆPB) OPFB to LDB( ˆF, ˆP, ˆPB)) = X

s s A LDB( ˆF, ˆP, ˆPB, s s )

is a ﬂow-matching loss that is (by deﬁnition) edge-decomposable. The proof of this statement is similar to the one of the example above, using Prop. 21.

According to Sec. 2.6, the reward function does not completely specify the ﬂow. Thus, the detailed-balance loss of Ex. 5 can be used with the (OPF , ΠPF , HPF ) parametrization, using any function ˆPB O3 as input to the detailed-balance loss.

Example 6 Trajectory-balance loss: This loss has been introduced in Malkin et al. (2022) for the parametrization (OTB, ΠTB, HTB), where OTB = O1 O2 O3, with O1 = R+

parametrizes the partition function ˆZ, and O2 and O3 introduced in Ex. 2 and Ex. 3 (the set of forward and backward probabilities consistent with G). HTB maps a Markovian ﬂow in FMarkov(G, R) to the corresponding triplet (Z, PF , PB), and ΠTB maps a parametrization ( ˆZ, ˆPF , ˆPB) to a probability over trajectories deﬁned by ˆPF as in Ex. 2. Prop. 18 justiﬁes the validity of this parametrization. The loss LTB maps each ( ˆZ, ˆPF , ˆPB) OTB to:

LTB( ˆZ, ˆPF , ˆPB) = X

τ T LTB( ˆZ, ˆPF , ˆPB, τ),

GFlow Net Foundations

τ = (s0, . . . , sn+1 = sf) T LTB( ˆZ, ˆPF , ˆPB, τ) =

log ˆZ Qn+1 t=1 ˆPF (st | st 1)

R(sn) Qn t=1 ˆPB(st 1 | st)

(41) Malkin et al. (2022) prove that LTB is a ﬂow-matching loss and call it trajectory balance. It is trajectory-decomposable by deﬁnition.

Training by stochastic gradient descent: In the examples of the previous section, given a GFlow Net (G, R, O, Π, H) and a ﬂow-matching loss L, objects o O are themselves functions or combinations of functions, and we can thus parametrize O with function approximators such as Neural Networks. However, most of the times, the evaluation (let alone the minimization) of L(o) is intractable, given that even with a full support distribution, only a subset of edges (or states or trajectories) can be visited in ﬁnite time. In practice, with an edge-decomposable loss e.g., we resort to a stochastic gradient, such as

o L(o, s s ), s s πo (42)

for edge-decomposable losses, or

o L(o, τ), τ πo (43)

for trajectory-decomposable losses, where πo, called the training distribution, is a distribution over edges or trajectories that can be associated with Π(o), corresponding to the online setting in RL, or deﬁned in other ways, corresponding to the behavior policy in oﬄine RL, see Sec. 3.3.3 below.

3.3 Extensions

In this section, we discuss possible relaxations to the GFlow Net training paradigm introduced thus far.

3.3.1 Introducing Time Stamps to Allow Cycles

Note that the state-space of a GFlow Net can easily be modiﬁed to accommodate an underlying state space for which the transitions do not form a DAG, e.g., to allow cycles. Let S be such an underlying state-space. Deﬁne the augmented state space S = S N, where N = {0, 1, 2, . . .} is the set of natural numbers, and s t = (st, t) is the augmented state, where t is the position of the state st in the trajectory. With this augmented state space, we automatically avoid cycles. Furthermore, we may design or train the backwards transition probabilities PB(s t | s t+1 = (st+1, t + 1)) to create a preference for shorter paths towards st+1, as discussed in Sec. 2.6. Note that we can further generalize this setup by replacing N with any totally ordered indexing set; the augmented state space will still have an associated DAG. The ordering < in the original state-space is lifted to the augmented state-space: (st, t) < (s t , t ) if and only if t < t and st < s t.

Bengio, Lahlou, Deleu, Hu, Tiwari, Bengio

3.3.2 Stochastic Rewards

We also consider the setting in which the given reward is stochastic rather than being a deterministic function of the state, yielding training procedures based on stochastic gradient descent. For example, with the trajectory balance loss of Eq. (41), if R(s) is stochastic (even when given s), we can think of what is being really optimized is the squared loss with log R(s) replaced by its expectation (given s). This is a straightforward consequence of minimizing the expected value of a squared error loss (as for example in neural networks trained with a squared error loss and a stochastic target output, where the neural network eﬀectively tries to estimate the expected value of that target).

3.3.3 GFlow Nets can be trained offline

As discussed in Sec. 3.2, we do not need to train a GFlow Net using samples from its own trajectory distribution ˆP = Π(o). Those training trajectories can be drawn from any training distribution πT with full support, as already shown by Bengio et al. (2021). It means that a GFlow Net can be trained oﬄine, as in oﬄine reinforcement learning (Ernst et al., 2005; Riedmiller, 2005; Lange et al., 2012). It should also be noted that with a proper adaptive choice of πT , and assuming that computing R is cheaper or comparable in cost to running the GFlow Net on a trajectory, it should be more eﬃcient to continuously draw new training samples from πT than to rehearse the same trajectories multiple times. An exception would be rehearsing the trajectories leading to high rewards if these are rare. How should one choose the training distribution πT ? It needs to cover the support of R but if it were uniform it would be very wasteful and if it were equal to the current GFlow Net policy π it might not have suﬃcient eﬀective support and thus miss modes of R, i.e., regions where R(x) is substantially greater than 0 but R(x) PT (x). Hence the training distribution should be sampled from an exploratory policy that visits places that have not been visited yet and may have a high reward. High epistemic uncertainty around the current policy would make sense and the literature on acquisition functions for Bayesian optimization (Srinivas et al., 2010) may be a good guide. More generally, this means the training distribution should be adaptive. For example, πT could be the policy of a second GFlow Net trained mostly to match a diﬀerent reward function that is high when the losses observed by the main GFlow Net are large. It would also be good to regularly visit those trajectories corresponding to known large R, i.e., according to samples from π, to make sure those are not forgotten, even temporarily.

3.4 Exploiting Data as Known Terminating States

In some applications we may have access to a dataset of (s, R(s)) pairs and we would like to use them in a purely oﬄine way to train a GFlow Net, or we may want to combine such data with queries of the reward function R to train the GFlow Net. For example, the dataset may contain examples of some of the high-reward terminating states s which would be diﬃcult to obtain by sampling from a randomly initialized GFlow Net. How can we compute a gradient update for the GFlow Net parameters using such (s, R(s)) pairs? If we choose to parametrize the backwards transition probabilities PB (which is necessary for implementing the detailed balance loss), then we can just sample a trajectory

GFlow Net Foundations

τ leading to s using PB and use these trajectories to update the ﬂows and forward transition probabilities along the traversed transitions. However, this alone is not guaranteed to produce the correct GFlow Net sampling distribution because the empirical distribution over training trajectories τ deﬁned as above does not have full support. Suppose for example that the dataset only contains high-reward terminating states with R(s) = 1. The GFlow Net could then just sample trajectories uniformly (which would be wrong, we would like the probability of most states not in the training set to be very small). On the other hand, if we combine the distribution of trajectories leading to terminal transitions in the dataset with a training distribution whose support covers all possible trajectories, then the oﬄine property of GFlow Net guarantees that we can recover a ﬂow-matching model.

4. Conditional Flows and Free energies

A remarkable property of ﬂow networks is that we can recover the normalizing constant Z from the initial state ﬂow F(s0) (Prop. 10). Z also gives us the partition function associated with a given terminal reward function R specifying the terminating ﬂows.

What about internal states s with s0 < s < sf? If we had something like a normalizing constant for only the terminating ﬂows achievable from s, we would be able to obtain a form of marginalization given state s, i.e., a conditional probability for terminating states s s, given s. Naturally, one could ask: does the ﬂow F(s) through state s give us that kind of marginalization over only the downstream terminating ﬂows? Unfortunately in general, the answer to this question is no, as illustrated in Fig. 5: in this example F(s2) = 4, whereas the sum of terminating ﬂows achievable from s2 is 6 (the terminating states reachable from s2 are {s5, s6, s7}). The discrepancy is caused by the ﬂow through (s0, s1, s5) that contributes to the terminating ﬂow F(s5 sf), but not to F(s2) since there is no order relation between s1 and s2.

Figure 5: Example of a state-conditional ﬂow network. (a) The original (Markovian) ﬂow network. (b) The subgraph of states reachable from s2; there is a ﬂow through (s0, s1, s5) that contributed to F(s5 sf), but not to F(s2), showing that F(s2) does not marginalize the rewards of its descendant. (c) State-conditional ﬂow network Fs2, which diﬀers from the original ﬂow F on the subgraph, but satisﬁes the desired marginalization property.

Bengio, Lahlou, Deleu, Hu, Tiwari, Bengio

In Sec. 5.3, we show how GFlow Nets applied to sampling sets of random variables can be used to estimate the marginal probability for the values given to a subset of the variables. It requires computing the kind of intractable sum discussed above (over the rewards associated with all the descendants of a state s, with s corresponding to such a subset of variables and a descendant to a full speciﬁcation of all the variables). That motivates the following deﬁnition:

Deﬁnition 27 Given a pointed DAG G = (S, A), the corresponding partial order denoted by , and a function E : S R, called the energy function, we deﬁne the free energy F(s) of a state s as: e F(s) := X

s :s s e E(s ). (44)

Free energies are generic formulations for the marginalization operation (i.e. summing over a large number of terms) associated with energy functions, and we ﬁnd their estimation to open the door to interesting applications where expensive MCMC methods would typically be the main approach otherwise.

4.1 Conditional ﬂow networks

In Sec. 2.2, we deﬁned a ﬂow network as a DAG, augmented with some function F over the set of complete trajectories T . We can extend this notion of ﬂow networks by conditioning each component on some information x. In general, this conditioning variable can represent any conditioning information, either external to the ﬂow network (but inﬂuencing the terminating ﬂows), or internal (e.g., x can be a property of complete trajectories over another ﬂow network, like passing through a particular state).

Deﬁnition 28 Let X be a set of conditioning variables. We consider a family of DAGs Gx = (Sx, Ax) indexed by x X, along with a family of initial and terminal states denoted by (s0 | x) Sx and (sf | x) Sx respectively. For each DAG Gx, we denote by Tx the set of complete trajectories in Gx, and we denote by T their union:

A conditional ﬂow network is the speciﬁcation of X, the family {Gx, x X}, along with a conditional ﬂow function F, i.e. a function F : X T R+ such that F(x, τ) = 0 if τ / Tx. For clarity, we will denote, for each x X, by Fx the function mapping each τ Tx to F(x, τ). Similar to Sec. 2.2, Fx induces a measure of the σ-algebra 2Tx for each x.

Conditional ﬂow networks eﬀectively represent a family of ﬂow networks, indexed by the value of x. Since conditional ﬂow networks are deﬁned using the same components as an unconditional ﬂow network, they inherit from all the properties of ﬂow networks for all DAGs Gx and ﬂow functions Fx. In particular, we can directly extend the notion of probability distribution over ﬂows, state and edge ﬂows, forward and backward transition probabilities (Sec. 2.3), of Markovian ﬂows (Sec. 2.4), and any ﬂow matching condition

GFlow Net Foundations

(Sec. 2.5) to conditional ﬂows; the only diﬀerence is that now every term explicitly depends of the conditioning variable x. In Sec. 4.2 and Sec. 4.3, we will elaborate two important examples of conditional ﬂow networks: ﬂow networks conditioned on external information that changes the reward R(s | x), and state-conditional ﬂow networks that depend on internal information, i.e., previously visited states.

4.2 Reward-conditional ﬂow networks

Deﬁnition 29 Let X be a set of conditioning variables. Consider a ﬂow network given by a pointed DAG G = (S, A) and a ﬂow function F. Consider a family R of non-negative functions of S: {Rx : Sf R+, x X}. A reward-conditional ﬂow network compatible with the family R is a conditional ﬂow network (Def. 28), with Gx = G for every x X, such that the edge-ﬂow functions induced by the conditional ﬂow function F satisfy:

x X s Sf Fx(s sf) = Rx(s).

We will use the notations Rx(s) and R(s | x) interchangeably.

Note that the deﬁnition above implies that all the DAGs of a reward-conditional ﬂow network are identical, and only the terminating ﬂows diﬀer amongst the members of the family.

Example 7 We will see in Sec. 4.4 that we can estimate a conditional ﬂow network using a GFlow Net (Sec. 3), given a reward function R(s | x). In an Energy-Based Model, the model Pθ(s) is associated with a given energy function Eθ(s), parametrized by θ, with

Pθ(s) = exp( Eθ(s))

This model can be parametrized using a reward-conditional ﬂow network, conditioned on θ with the reward function R(s | θ) = exp( Eθ(s)). We show in Sec. 4.5 how to use such a conditional ﬂow-network to learn an Energy-Based Model.

4.3 State-conditional ﬂow networks

Deﬁnition 30 Consider a ﬂow network given by a DAG G = (S, A) and a ﬂow function F. For each state s S, let Gs be the subgraph of G containing all the states s such that s s. A state-conditional ﬂow network is given by the family {Gs, s S}, along with a conditional ﬂow function F : S T R+, where T = S s S Ts, and Ts the set of complete trajectories in Gs, that satisﬁes:

Fs(s sf) = F(s sf). (45)

Note that in the deﬁnition above, we abused the notation F to refer to both ﬂow functions and edge ﬂow functions, but also used Fs to refer to the conditional ﬂow function (or the corresponding edge ﬂow function) τ 7 F(s, τ). Unlike the reward-conditional ﬂow networks deﬁned in Sec. 4.2, the structure of the DAG in a state-conditional ﬂow network

Bengio, Lahlou, Deleu, Hu, Tiwari, Bengio

depends on the anchor state s. In particular, this means that the initial state (s0 | s) = s changes, but the ﬁnal state (sf | s) = sf remains unchanged, for any state s. Since the deﬁnition of a state-conditional ﬂow network depends on an original ﬂow network, we must ensure that this deﬁnition is indeed correct, i.e. that such a stateconditional ﬂow network that satisﬁes the conditions in Eq. (45) exists.

Proposition 31 For any ﬂow network given by a DAG G = (S, A) and a ﬂow F, we can deﬁne a state-conditional ﬂow network as per Def. 30.

Proof Let s S be a state. Since the structure of the DAG Gs is clearly well-deﬁned, we just need to show that there exists a ﬂow function Fs : Ts R+ that satisﬁes Eq. (45). If such a function exists for every s S, then it would suﬃce to deﬁne the conditional ﬂow function F : S T R+ as:

( Fs(τ) if τ Ts 0 otherwise.

Let As |s be the set of complete trajectories in Ts terminating in s s; the condition in Eq. (45) then reads:

Fs(s sf) = Fs(As |s) = X

τ As |s Fs(τ) = F(s sf). (46)

Note that in Eq. (46), F(s sf) is a given quantity because the ﬂow F is known. Since the sets of trajectories {As |s, , s s} form a partition of all the complete trajectories Ts, Eq. (46) is a system of linear equations, whose unknowns are Fs(τ) for all τ Ts, where each equation involves separate sets of unknowns. Therefore there exists at least a solution Fs(τ) of this system. We can construct such a solution in the following way. For some τ Ts, we can ﬁrst start by selecting the complete trajectories τ T that contain τ:

Cτ = { τ T : τ τ}.

The key diﬀerence between the DAG G and the subgraph Gs though is that G may contain trajectories that terminate in some s s but do not pass through s, and those are therefore not covered by the trajectories of Gs. Let Us |s be the set of complete trajectories of G deﬁned as

Us |s = { τ T : s > s, s τ, s τ, s / τ}.

For all τ Ts such that τ terminates in some s s, we can therefore construct the ﬂow Fs(τ) as

Fs(τ) := F(Cτ) + 1

n F(Us |s),

where n = |As |s| is the number of trajectories τ Ts that terminate in s . It is easy to verify that Fs(τ) is a solution of Eq. (46).

GFlow Net Foundations

While we saw in Prop. 10 that the initial ﬂow F(s0) was equal to the partition function, the initial state-conditional ﬂow also beneﬁt from a marginalizing property, and is now related to the free energy at s.

Proposition 32 Given a state-conditional ﬂow network (G, F) as in Def. 30, for any state s, the initial ﬂow of the state-conditional ﬂow network corresponds to marginalizing the terminating ﬂows F(s sf) for s s:

Fs(s0 | s) = Fs(s) = X

s : s s F(s sf) = exp( F(s)),

where F(s) is the free energy associated to the energy function E(s ) = log F(s sf).

Proof This is a direct consequence of Prop. 10, applied to the state-conditional ﬂow function Fs, along with Def. 27.

Fs(s0 | s) = X

τ Ts Fs(τ) = X

s : s s Fs(s sf)

s : s s F(s sf) = X

s : s s e E(s )

Note that the deﬁnition of state-conditional ﬂow networks is consistent with our original deﬁnition of (unconditional) ﬂow networks in Section 2.1, in the sense that the original ﬂow network is a valid state-conditional ﬂow networks anchored at the initial state. Another quantity of interest that state-conditional ﬂow networks allow us to evaluate, is the probability of terminating a trajectory in a state s if all terminating edge ﬂows were diverted towards an earlier state s < s :

Corollary 33 Consider a ﬂow network given by a DAG G = (S, A) and a ﬂow F, from which we deﬁne any state-conditional ﬂow network, as per Def. 30. Given a state s, the ﬂow function Fs induces a probability distribution over {s Sf : s s} Sf, that we denote by PT (. | s). Under this measure, the probability of terminating a trajectory in Gs in a state s (i.e. the last edge of the trajectory is s sf) is:

PT (s | s) = 1s se E(s )+F(s), (47)

where E is the energy function mapping each state s that is parent of sf to log F(s s), and F is the corresponding free energy function.

Proof Because Fs is a ﬂow function, Def. 11 and Prop. 10 tell us that:

PT (s | s) =

Fs(s) if s s

0 otherwise

Bengio, Lahlou, Deleu, Hu, Tiwari, Bengio

Combining this with Prop. 32, and Eq. (45), we obtain for s s:

PT (s | s) = 1s s F(s sf)

= 1s s e E(s )

4.4 Conditional GFlow Nets

Similar to the way we used a GFlow Net to estimate the ﬂow of a ﬂow network, we can also use a (conditional) GFlow Net in order to estimate a conditional ﬂow network, with given target reward functions. A conditional GFlow Net follows the construction presented in Section 3, with the exception that all quantities to be learned now depend on the conditioning variable x X (e.g., x is an additional input of the neural network). All parametrizations and losses presented in Sec. 3.2 could in principle be used to train a conditional GFlow Net, regardless of the conditioning set. Below we discuss yet another loss, ﬁrst presented in Deleu et al. (2022), that could be used to train both GFlow Nets and conditional GFlow Nets.

Example 8 Given a family of DAGs Gx and reward functions Rx indexed by x X, where each state s Gx is terminating (i.e. is a parent of sf), and following Exs. 2 and 3, we consider a parametrization given by the forward and backward transition probabilities Ox P = Ox 2 Ox 3, where Ox 2 (resp. Ox 3) is the set of forward (resp. backward) probability functions ˆPF (resp. ˆPB) consistent with Gx for every x X, and (Πx P , Hx P ) deﬁned as in Ex. 3. Each (Ox P , Πx P , Hx P ) is a ﬂow parametrization of (Gx, Rx), which can be trained with an edge-decomposable ﬂow-matching loss, as proved in Deleu et al. (2022), and deﬁned for every s s A f:

LDB( ˆPF , ˆPB, s s , x) = log Rx(s )PB(s | s , x)PF (sf | s, x)

Rx(s)PF (s | s, x)PF (sf | s , x)

4.5 Training Energy-Based Models with a GFlow Net

A GFlow Net can be trained to convert an energy function into an approximate corresponding sampler. Thus, it can be used as an alternative to MCMC sampling (Sec. 3.1). Consider the model Pθ(s) associated with a given parametrized energy function Eθ(s) with parameters θ: Pθ(s) = e Eθ(s)

Z . Sampling from Pθ(s) could be approximated by sampling from the terminating probability distribution PT (s) of a GFlow Net trained with target terminal reward R(s) = e Eθ(s) (see Eq. (34)). In practice, ˆPT would be an estimator for the true Pθ because the GFlow Net training objective is not zeroed (insuﬃcient capacity or ﬁnite training time). The GFlow Net samples drawn according to ˆPT could then be used to obtain a stochastic gradient estimator for the negative log-likelihood of observed data x with respect to parameters θ of an energy function Eθ:

s Pθ(s) Eθ(s)

GFlow Net Foundations

An approximate stochastic estimator of the second term could thus be obtained by sampling one or more terminating states s ˆPT (s), i.e., from the trained GFlow Net s sampler. Furthermore, if the GFlow Net s loss is 0, i.e. ˆPT = Pθ, the gradient estimator would be unbiased. One could thus potentially jointly train an energy function Eθ and a corresponding GFlow Net by alternating updates of θ using the above equation (with sampling from Pθ replaced by sampling from ˆPT ) and updates of the GFlow Net using the updated energy function for the target terminal reward. If we ﬁx ˆF(s sf) = R(s) by construction (which we can do if the reward function is deterministic), then we can parametrize the energy function with the same neural network that computes the ﬂow, since E(s) = log R(s) = log ˆF(s sf). Hence the same parameters are used for the energy function and for the GFlow Net, which is appealing. The above strategy for learning jointly an energy function and how to sample from it could be generalized to learning conditional distributions by using a conditional GFlow Net instead. Let x be an observed random variable and h be a hidden variable, with the GFlow Net generating the pair (x, h) in two sub-trajectories: either ﬁrst generate x and then generate h given x, or ﬁrst generate h and then generate x given h. This can be achieved by introducing a 6-valued component u in the state to make sure that both h and x are generated before exiting into sf, with the following values and constraints:

s = s0 u = 0 (49)

s = sf u = 5 (50)

(ut ut+1) {0 1, 0 2, 1 1, 2 2, 1 3, 2 4, 3 3, 4 4, 3 5, 4 5} (51)

where u = 1 indicates that x is being generated (before h), u = 2 indicates that h is being generated (before x), u = 3 indicates that h is being generated (conditioned on x), and u = 4 indicates that x is being generated (given h). The GFlow Net cannot reach the ﬁnal state sf until both x and h have been generated. The conditional GFlow Net can thus approximately sample PT (x), PT (h | x), PT (h), PT (x | h) as well as PT (x, h). If we only want to sample x (or only h), we allow exiting as soon as it is generated (resp. h is generated). See Sec. 5.3 for a more general discussion on how to represent, estimate and sample marginal distributions. Let us denote by Pθ(x, h) e Eθ(x,h) the joint distribution over (x, h) associated with the energy function, i.e., with F(s sf). When x is observed but h is not, θ could thus be updated by approximating the marginal log-likelihood gradient

h PT (h | x) Eθ(x, h)

x ,h PT (x , h) Eθ(x , h)

using samples from the estimated terminal sampling probabilities ˆPT of a trained GFlow Net to approximate in a stochastic gradient way the above sums (using one or a batch of samples). Note how we now have outer loop updates (of the energy function, i.e., the reward function) from actual data, and an inner loop updates (of the GFlow Net) using the energy function as a driving target for the GFlow Net. How many inner loop updates are necessary

Bengio, Lahlou, Deleu, Hu, Tiwari, Bengio

for such a scheme to work is an interesting open question but most likely depends on the form of the underlying data generating distribution. If the work on GANs (Goodfellow et al., 2014) is a good analogy, a good strategy may be to interleave updates of the energy function (as minus the log-terminal ﬂow of a GFlow Net) based on a batch of data, and updates of the GFlow Net as a sampler based on both these samples (trajectories can be sampled backwards from a terminating state s using PB) and forward samples from the tempered training policy πT deﬁned by the forward transition probabilities of the GFlow Net.

4.6 Active Learning with a GFlow Net

An interesting variant on the above scheme is one where the GFlow Net sampler is used not just to produce negative examples for the energy function but also to actively explore the environment. Jain et al. (2022) use an active learning scheme where the GFlow Net is used to sample candidates x for which we expect the reward R(x) to be generally large (since the GFlow Net approximately samples proportionally to R(x)). The challenge is that evaluating the true reward R for any x is computationally expensive and can potentially be noisy (for example, a biological assay to measure the binding energy of a drug to a given target protein). Thus, instead of using the true reward directly, the authors introduce a proxy ˆf (which approximates the true reward function f), which is used to train the GFlow Net. This would lead to a setup similar to Sec. 4.5, with an inner loop where a GFlow Net is trained to match the proxy ˆf, and an outer-loop where the proxy ˆf is learned in a supervised fashion using (x, y) pairs, where x is proposed by the GFlow Net, and y is the corresponding true reward from the environment (for example, outcome of a biological of chemical assay). It is important to note here that the GFlow Net and the proxy are intricately linked since the coverage of proxy ˆf over the domain of x relies on diverse candidates from the GFlow Net. And similarly, since the GFlow Net matches a reward distribution deﬁned by the proxy reward function ˆf, it also depends on the quality of the true reward function f.

This setup can be further extended by incorporating information about how novel a given candidate is, or how much epistemic uncertainty, u(x, f), there is in the prediction of ˆf. We can use the acquisition function heuristics (like Upper Conﬁdence Bound (UCB) or Expected Improvement (EI)) from Bayesian optimization (Moˇckus, 1975; Srinivas et al., 2010) to combine the predicted usefulness ˆf(x) of conﬁguration x with an estimate of the epistemic uncertainty around that prediction. Using this as the reward can allow the GFlow Net to explore areas where the predicted usefulness is high ( ˆf(x) is large) and at the same time explore areas where there is more information to be gathered about useful conﬁgurations of x. The uncertainty over the predictions of ˆf with the appropriate acquisition function can provide more control over the exploratory behaviour of GFlow Nets.

As discussed by Bengio et al. (2021) when comparing GFlow Nets with return-maximizing reinforcement learning methods, an interesting property of suﬃciently trained GFlow Nets is that they will sample from all the modes of the reward function, which is particularly desirable in a setting where exploration is required, as in active learning. The experiments in the paper also demonstrate this advantage experimentally in terms of the diversity of the solutions sampled by the GFlow Net compared with PPO, an RL method that had previously been used for generating molecular graphs and that tends to focus on a single mode of the reward function.

GFlow Net Foundations

4.7 Estimating Entropies, Conditional Entropies and Mutual Information

Deﬁnition 34 Given a reward function R with 0 R(s) < 1 s, we deﬁne the entropic reward function R associated with R as:

R (s) = R(s) log R(s). (53)

In brief, in this section, we show that we can estimate entropies by training two GFlow Nets: one that estimates ﬂows as usual for a target terminal reward function R(s), and one that estimates ﬂows for the corresponding entropic reward function. We show below that we obtain an estimator of entropy by looking up the ﬂow in the initial state, and if we do this exercise with conditional ﬂows, we get conditional entropy. Once we have the conditional entropy, we can also estimate the mutual information.

Proposition 35 Consider a ﬂow network (G, F) such that the terminating ﬂows match a given reward function R, i.e. s Sf, F(s sf) = R(s), with R(s) < 1 for all s, and a second ﬂow network (G, F ) with the same pointed DAG, but with a ﬂow function for which the terminating ﬂows match the entropic reward function R (Eq. (53)), then the entropy H[S] associated with the terminating state random variable S Sf with distribution PT (S = s) = R(s)

Z (Eq. (18)) is

s PT (s) log PT (s) = F (s0)

F(s0) + log F(s0). (54)

Proof First apply the deﬁnition of PT (s), then Eq. (11) on both ﬂows:

s PT (s) log PT (s) = X

R(s) F(s0)(log R(s) log F(s0))

P s R(s) log R(s) + log F(s0) P s R(s)

F(s0) + log F(s0).

Note that we need R(s) < 1 to make sure that the rewards R (s) (and thus the ﬂows) are positive.

Proposition 36 Given a set X of conditioning variables, consider a conditional ﬂow network deﬁned by a conditional ﬂow function F, for which the terminating ﬂows match a target reward R family (conditioned on x X) that satisﬁes Rx(s) < 1 for all s, and a second conditional ﬂow network deﬁned by a conditional ﬂow function F , for which the terminating ﬂows match the entropic reward functions R x (Eq. (53)), then the conditional entropy H[S | x] of random terminating states S Sf consistent with condition x is given by

H[S | x] = F (s0 | x)

F(s0 | x) + log F(s0 | x). (55)

Bengio, Lahlou, Deleu, Hu, Tiwari, Bengio

In particular, for a state-conditional GFlow Net (X = S is the state space of the DAG), we obtain

H[S | s] = F (s | s)

F(s | s) + log F(s | s). (56)

More generally, the mutual information MI(S; X) between the random draw of a terminating state S = s according to PT (s | x) and the conditioning random variable X is

MI(S; X) = H[S] EX[H[S | X]] = F (s0)

F(s0) + log F(s0) EX

F(s0 | X) + log F(s0 | X)

(57) where F(s) and F (s) indicate the unconditional ﬂows (trained with no condition x given) while F(s | x) and F (s | x) are their conditioned counterparts.

Proof The proof of Eq. (55) follows from the fact that each (Gx, Fx) is a ﬂow network, to which we can apply Prop. 35. Eq. (57) is a direct consequence of the deﬁnition of the Mutual Information, Eq. (54) and Eq. (55).

If we have a sampling mechanism for P(X), we can thus approximate the expectation in Eq. (57) by a Monte-Carlo average with draws from P(X).

5. GFlow Nets on Sets, Graphs, and to Marginalize Joint Distributions

5.1 Set GFlow Nets

We ﬁrst deﬁne an action space for constructing sets and we view the GFlow Net as a means to generate a random set S and to estimate quantities like probabilities, conditional probabilities or marginal probabilities for realizations of this random variable. The elements of those sets are taken from a larger universe set U.

Deﬁnition 37 Given a universe set U, consider the pointed DAG G = (S, A), where S := 2U {sf} is the set of all subsets of U with an additional state sf, s0 = {} is the empty set, and for any two subsets s, s of U, s s A a U \s, s = s {a}; meaning that each transition in the DAG corresponds to adding one element of U to the current subset. Additionally all subsets are connected to sf, i.e. s S, s sf A. A set ﬂow network is a ﬂow network on this graph G, and a set GFlow Net is an estimator of such a ﬂow network, as deﬁned in Sec. 3. The target terminal reward function R : s 7 F(s sf) satisﬁes: Z = X

s 2U R(s) < . (58)

A set ﬂow network deﬁnes a terminating probability distribution PT on states (see Def. 13 and Eq. (44)), with

PT (s) = e E(s)+F(s0) = F(s sf)

where E represents the energy log R. Similarly, Cor. 33 provides us with a formula for conditional probabilities of a given superset s of a given set s under PT ,

PT (s | s s) := e E(s )+F(s) = F(s sf)

F(s | s) (60)

GFlow Net Foundations

where F indicates free energy (see Def. 27). Remember that with a GFlow Net with state and edge ﬂow estimator ˆF, it is not guaranteed that ˆF(s) = R(s) for all states s S \ {sf}, so we could estimate probabilities with ˆPT (s) = ˆF(s sf)

ˆF(s0) (61)

or alternatively ˆPT (s) = R(s)

ˆF(s0) . (62)

Similarly, we can estimate conditional superset probabilities with Eq. (60) or with

ˆPT (s | s s) = R(s ) ˆF(s | s) , (63)

none of which are guaranteed to exactly sum to 1. We can also compute the marginal probability over all supersets of a given set s, as shown below.

Proposition 38 Following the notations of Def. 37, let S(s) = {s s} be the set of all supersets of a set s. The probability of drawing any element from S(s) given a set ﬂow network is

PT (S(s)) = X

s s PT (s ) = e F(s)

Z = F(s | s)

F(s0) . (64)

Proof We can rewrite the sum as follows, ﬁrst applying the deﬁnition of PT (Eq. (18)), and Prop. 32:

s s PT (s ) = X

where we notice that for states that are sets, the order relationship s s is equivalent to the subset relationship s s .

To summarize, a GFlow Net which is trained to match a given energy function (i.e. derived rewards) over sets can be used to represent that distribution, sample from it, estimate the probability of a set under it, estimate the partition function, search for the lowest energy set, sample a conditional distribution over supersets of a given set, estimate that conditional distribution for a given pair of set and superset, compute the marginal probability of a subset (i.e., summing over the probabilities of the supersets), and compute the entropy of the set distribution or of the conditional distribution of supersets of a set. For example, using a GFlow Net to model distributions over sets has been successfully used in

Bengio, Lahlou, Deleu, Hu, Tiwari, Bengio

Malik et al. (2023), where the goal is to iteratively select an informative batch of points for eﬃcient active learning. The authors additionally show that the distribution over sets deﬁned by a mutual information-based reward (Sec. 4.7) can be approximated to satisfying levels using a neural network as a function approximator for the GFlow Net policy.

5.2 GFlow Net on Graphs

A graph is a special kind of set in which there are two kinds of elements: nodes and edges, with edges being pairs of node indices. Graphs may also have content attached to nodes and/or edges. The set operations described in the previous section can thus be specialized accordingly. Some actions (i.e. edges in the GFlow Net DAG) could insert a node while other actions could insert an edge. The set of allowable actions can be limited, for example to make sure the graph has a single connected component, or to ensure acyclicity. Like for sets in general, one cannot have an action which adds a node or edge which is already in the set. Deleu et al. (2022) used a GFlow Net over DAGs in order to learn an approximation of the posterior distribution over the graphical structure of a Bayesian Network. While the GFlow Net was learned by minimizing some loss derived from ﬂow-matching conditions as in Sec. 3.2, they showed that the resulting distribution is an accurate approximation of the target posterior. Since graphs are sets, all the GFlow Net operations on sets can be applied on graphs.

5.3 Marginalizing over Missing Variables

The ability of GFlow Nets to capture probability distributions over sets can be applied to modeling the joint distribution over random variables, to calculating marginal probabilities over given subsets of variable values, and to sampling or computing probabilities for any conditional (e.g., for a subset of variables given another subset of variables).

Let X = (X1, X2, . . . , Xn) be a composite random variable with n element random variables Xi, 1 i n, each with possible values xi Xi (not necessarily numbers). If we are given an energy function or a terminal reward function R(x) to score any instance X = x, we can train a particular kind of set GFlow Net for which the set elements are pairs (i, xi) and at most one element in the set with index i, for all i. The only allowed terminating transitions are when the set has exactly size n and every size-n set s terminates on the next transition.

Note how that GFlow Net can sample an X in any possible order, if PB allows that order. Given an existing set of (i, xi) pairs (represented by a state s = {(i, xi)}i), we can estimate the marginal probability of that subset of variables (implicitly summing over all the missing ones, see Eq. (64)). We can sample the other variables by setting the state at s and continuing to sample from the GFlow Net s policy (the learned forward transition probability function). We can sample from a chosen subset S of the other variables by constraining that policy to only add elements which are in S . In addition, we can do all the other things that are feasible on set GFlow Nets, such as estimating the partition function, sampling in an order which prefers the early subsequences with the largest marginal probability, searching for the most probable conﬁguration of variables, or estimating the entropy of the distribution.

GFlow Net Foundations

5.4 Modular Energy Function Decomposition

Let us see how we can apply the graph GFlow Net framework to a special kind of graph: a factor graph (Kschischang et al., 2001) with reusable factors. This will yield a distribution PT (g) over graphs g, each of which is associated with an energy function value E(g) (and associated reward R(g)). Energy-based models are convenient because they can decompose a joint probability into independent pieces (possibly corresponding to independent mechanisms, Sch olkopf et al., 2012; Goyal et al., 2019; Goyal and Bengio, 2020), each corresponding to a factor of a factor graph. In our case, we would like a shared set of factors F to be reusable across many factor graphs g. The factor graph will provide an energy and a probability over a set of random variables V. Let the graph g = {(F i, vi)}i be written as a set of pieces (F i, vi), where F i F is the index of a factor with energy function term EF i, selected from the pool F of possible factors, and where vi = (v1, v2, . . .) is a list of realizations of the random variables Vj vj, where Vj V is a node of the factor graph. That list deﬁnes the edges of the factor graph connecting variable Vj with the j-th argument of EF i. Let us denote EF i(vi) the value of this energy function term EF i applied to those values vi, i.e., EF i(vi) = EF i(v1, v2, . . .). (65)

The total energy function of such a graph can then be decomposed as follows

i EF i(vi). (66)

What is interesting with this construction is that the graph GFlow Net can now sample a graph g, possibly given some conditioning observations x: see Sec. 4.5 on how GFlow Nets can be trained jointly with an energy function, including the case where only some random variables are observed. Hence, given some observed variables (not necessarily always the same), the graph GFlow Net can sample a latent factor graph containing and connecting (with energy function terms) both observed and latent random variables, and whose structure deﬁnes an energy function over values of the joint observed and latent variables. Not only can we use the compositional nature of the objects generated by a GFlow Net to decompose the total energy into reusable energy terms corresponding to ideally independent mechanisms, but we can also decompose the GFlow Net itself into modules associated with each mechanism. The action space of this graph GFlow Net is fairly complex, with each action corresponding with the addition of a latent variable Vk or the addition of a graph piece (F i, vi). Such an action is taken in the context of the state of the GFlow Net, which is a partially constructed graph (arising from the previous actions). The GFlow Net and its associated energy function parameters are thus decomposed into modules. Each module knows how to compute an energy function EFi and how to score and sample competitively (against the other modules) a new graph piece (to insert the corresponding factor in the graph). Consider some observed variables (a subset of the Vj s with their values vj), collectively denoted x. Consider a graph g among those compatible with x (i.e. with some nodes corresponding to Vj = vj for the observed variables) and denote h the speciﬁcation of the part of g not already provided by x. We can think about latent variable h as the explanation for the observed x. Note how marginalizing over all the possible h, we can compute the

Bengio, Lahlou, Deleu, Hu, Tiwari, Bengio

free energy of x. The principles of Sec. 4.5 can be applied to train such an energy-based GFlow Net. It also makes sense to represent a prior over graph structures in the energy function. For example, we may prefer sparse factors (with few arguments), and we may introduce soft or hard constraints having to do with a notion of type that is commonly used in computer programming and in natural language. Each random variable in the graph can have as one of its attributes a type, and each factor energy function argument can expect a type. Energy function terms can be added to construct this prior by favouring graph pieces (Fi, vi = (vk)k) in which the type of variable Vk (of which vk is a realization) matches the type expected of the k-th argument of Fi. This is very similar to attention mechanisms (Bahdanau et al., 2014; Vaswani et al., 2017), which can be seen to match a query (an expected type) with a key (the type associated with an element). For instance, Pan et al. (2023) propose the Forward-Looking GFlow Net framework that exploits the modularity of the energy function, as in Eq. (66) and obtain a better approximation of the target distribution than those resulting from regular losses.

6. Continuous or Hybrid Actions and States

All of the mathematical developments above have used sums over states or actions, with the idea that these would be elements of a discrete space. However, for the most part one can replace these sums by integrals in case the states or actions are either continuous or hybrid (with some discrete components and some continuous components). Beyond this, we discuss below what the presence of continuous-valued actions and states changes to the GFlow Net framework. Although there are explicit sums respectively over successors and predecessors which come up in Eq. (40), such sums are also hiding in the detailed balance constraint of Eq. (26). Indeed, these sums are implicit as part of the normalizing constant in the conditional density of the next state or previous state in PF (st+1 | st) and PB(st | st+1). We consider below ideas to deal with this challenge.

6.1 Integrable Normalization Constants

We ﬁrst note that if we can handle a continuous state, we can also handle a hybrid state, as follows. Let the state be decomposed into

s = (si, sx) (67)

where si is discrete and sx is continuous. Then we can decompose any of the transition conditionals as follows:

PF (st+1 | st) = P(sx t+1 | si t+1, st)P(si t+1 | st). (68)

We note that this is formally equivalent to decomposing the transition into two transitions, ﬁrst to perform the discrete choice into the next state, and second to perform the continuous choice into the next state (given the discrete choice). Having continuous-valued inputs to a neural net is no problem. The challenge is to represent continuous densities on the output, with the need to both being able to compute the density of a particular value (say P(sx t+1 | si t+1, st)) and to be able to sample from it. Computing categorical probabilities and sampling

GFlow Net Foundations

from a conditional categorical is standard fare, so we only discuss the continuous conditional. One possibility is to parametrize sx t+1 | si t+1, st with a density for which the normalization constant is a known tractable integral, like the Gaussian. However, that may limit capacity too much, and may prevent a good minimization of the detailed balance or ﬂow-matching loss. One workaround is to augment the discrete part of the state si with extra dimensions corresponding to cluster IDs , i.e., partition the continuous density into a mixture. We know that with enough mixture components, we can arbitrarily well approximate densities from a very large family. Other approaches include modeling the conditional density with an autogressive or normalizing ﬂow model (Rezende and Mohamed, 2015, with a diﬀerent meaning of the word ﬂow), or, like in denoising diﬀusion models (Sohl-Dickstein et al., 2015), decomposing sampling of sx into several resampling steps, transforming its ditribution from a simple one to complex one.

To guarantee that the detailed balance constraint can be exactly satisﬁed, we could go further and think about parametrizing the edge ﬂow F((si, sx) (s i, s x)), and note that this is the natural parametrization if we use the node-based ﬂow-matching loss. For example, keeping with the Gaussian example, we would now have a joint Gaussian energy in the vector (sx, s x) for each feasible discrete component indexed by (si, s i). Note that in practical applications of GFlow Nets, not all the transitions that satisfy the order relationship are generally allowed in the GFlow Net s underlying DAG. For example, with set GFlow Nets, the only allowed actions add one element to the set (not an arbitrary number of elements). These constraints on the action space mean that the number of legal (si, s i) pairs is manageable and correspond to the number of discrete actions. The overall action is therefore seen as having a discrete part (choosing s i given si) and a continuous part (choosing s x given s i, si and sx). With such a joint ﬂow formulation, the forward and backward conditional densities can be computed exactly and be compatible with each other.

6.2 GFlow Nets in GFlow Nets

Another way to implement an edge ﬂow involving continuous variables is to use a lower-level GFlow Net, energy function pair to represent its ﬂow, conditional probabilities and sample from them. Remember that such a pair can be trained following the approach discussed in Sec. 4.5. Instead of a joint Gaussian for (sx, s x) given (si, s i) we could have a smaller-scale GFlow Net and energy function (representing an edge ﬂow in the outer GFlow Net) to handle a whole family of transitions of a particular type in a larger-scale outer GFlow Net. Imagine that we have a fairly arbitrary energy function for such a transition, with parameters that we will learn. Then we can also train a GFlow Net to sample in either direction (either from st to st+1 or from st+1 to st) and to evaluate the corresponding normalizing constants (and hence, the corresponding conditional probabilities). The discrete aspect of the state and of its transition may correspond to a family of transitions (e.g., insert a particular type of node in a graph GFlow Net), and a separate GFlow Net, energy function module may be specialized and trained to handle such transitions.

While this paper was under review, Lahlou et al. (2023) extended the theory of GFlow Nets to more general state spaces, including continuous ones, and experimentally validated that the usual losses can be used to eﬀectively perform inference in continuous domains.

Bengio, Lahlou, Deleu, Hu, Tiwari, Bengio

7. Related Work

There are several classes of related literature that concern the problem of generating a diversity of samples, given some energy or reward signal, in particular:

generative models (in particular deep learning ones), RL methods that maximize reward with some form of exploratory behavior or smoothness prior, MCMC methods that solve the problem of sampling from p(x) f(x) in principle, evolutionary methods, that can leverage group diversity through iterations over a population of solutions.

In what follows we discuss these and oﬀer insights into similarities and diﬀerences between GFlow Nets and these approaches. Note that the literature related to this problem is much larger than we can reference here, and extends to many other subﬁelds of ML, such as GANs (Kumar et al., 2019), VAEs (Kingma and Welling, 2013; Kusner et al., 2017), and normalizing ﬂows (Dinh et al., 2014, 2016; Rezende and Mohamed, 2015). Yet another related type of approach are the Bayesian optimization methods (Moˇckus, 1975; Srinivas et al., 2010), which have also been used for searching in the space of molecules (Griﬃths and Hern andez-Lobato, 2017). The main relation with Bayesian optimization methods is that GFlow Nets are generative and can thus complement Bayesian optimization methods which scan a tractable list of candidates. When the search space is too large to be able to separately compute a Bayesian optimization acquisition function score on every candidate, using a generative model is appealing. In addition, GFlow Nets are used to explore the modes of the distribution rather than to search for the single most dominant mode. This diﬀerence is similar to that with classical RL methods, discussed further below.

7.1 Contrast with Generative Models

The main diﬀerence between GFlow Nets and established deep generative models like VAEs or GANs is that whereas the latter are trained by being provided a ﬁnite set of examples sampled from the distribution of interest, a GFlow Net is normally trained by being provided an energy function or a reward function. This reward function tells us not just about the samples that are likely under the distribution of interest (which we can think of as positive examples) but also about those that are unlikely (which we can think of as negative examples) and also about those in-between (whose reward is not large but is not zero either). If we think of the maximum likelihood training objective in those terms, it is like a reward function that gives a high reward to every training example (seen as a positive example, where the probability should be high) and a zero reward everywhere else. However, other reward functions are possible, as seen in the application of GFlow Nets to the discovery of new molecules (Bengio et al., 2021), where the reward is not binary and increases monotonically as a function of the value of a desirable property of the candidate molecule. Note however that the diﬀerence with other generative modeling approaches blurs when we include the learning of the energy function along with the learning of the GFlow Net sampler, as outlined in Sec. 4.5. In that case, the pair comprising the trainable GFlow Net sampler and the trainable energy function achieves a similar objective as a trainable generative

GFlow Net Foundations

model. Note that GFlow Nets have been designed for generating discrete variable-size compositional structures (like sets or graphs), for both latent and observed variables, whereas GANs, VAEs or normalizing ﬂows start from the point of view of modeling real-valued ﬁxed-size vectors using real-valued ﬁxed-size latent variables. An interesting diﬀerence between GFlow Nets and most generative model training frameworks (typically some variation on maximum likelihood) is in the very nature of the training objective for GFlow Nets, which came about in the context of active learning scenarios. Whereas the GFlow Net training pairs (s, R(s)) can come from any distribution over s (any full-support training policy πT ), which does not have to be stationary (and indeed will generally not be, in an active learning setting), the maximum likelihood framework is very sensitive to changes in the distribution of the data it sees. This is connected to the oﬄine learning property of the ﬂow matching objective (Sec. 3.3.3, among others).

7.2 Contrast with Regularized Reinforcement Learning

The ﬂow-matching loss of GFlow Nets (Bengio et al., 2021) arose from the inspiration of the temporal-diﬀerence training (Sutton and Barto, 2018) objectives associated with the Bellman equation. The ﬂow-matching equations are analogous to the Bellman equation in the sense that the training objective is local (in time and states), credit assignment propagates through a bootstrap process and tries to ﬁx the parametrization so that these equations are satisﬁed, knowing that if they were (everywhere), we would obtain the desired properties. However, these desired properties are diﬀerent, as elaborated in the next paragraph. The context in which GFlow Nets were developed is also diﬀerent from the typical way of thinking about agents learning in some environment: we can think of the deterministic environments of GFlow Nets as involving internal actions typically needed by a cognitive agent that needs to perform some kind of inference through a sequence of steps (predict or sample some things given other things), i.e., through actions internal to the agent and controlling its computation. This is in contrast with the origins of RL, focused on the actions of an agent in an external and unknown stochastic environment. GFlow Nets were introduced as a tool for learning an internal policy, similar to the use of attention in modern deep learning, where we know the eﬀect of actions, and the composition of these actions deﬁnes an inference machinery for that agent. Classical RL (Sutton and Barto, 2018) control methods work by maximizing return in Markov Decision Processes (MDPs); their focus is on ﬁnding the policy π argmaxπV π(s) s maximizing the expected return V π(s), which happens to provably be achieved with a deterministic policy (Sutton and Barto, 2018), even in stochastic MDPs. In a deterministic MDP, of interest here, this means that training an RL agent is a search for the most rewarding trajectory, or in the case of terminal-reward-only MDPs (again of interest here), the most rewarding terminating state. Another perspective, that emerged out of both the probabilistic inference literature (Toussaint and Storkey, 2006) and the bandits literature (Auer et al., 2002), is concerned with ﬁnding policies of the form π(a|s) f(s, a). It turns out that maximizing both return and entropy (Ziebart et al., 2008) of policies in a control setting yield policies such that

t=0 P(st+1|st, at)

t=0 R(st, at)

Bengio, Lahlou, Deleu, Hu, Tiwari, Bengio

where τ = (s0, a0, s1, a1, . . . , s T ) and η can be seen as a temperature parameter. This result can also be found under the control-as-inference framework (Haarnoja et al., 2017; Levine, 2018). In deterministic MDPs with terminal rewards and no discounting of future rewards, this simpliﬁes to p(τ) exp(ηρ(τ)), where ρ is the return. In recent literature, this entropy maximization (Max Ent) is often interpreted as a regularization scheme (Nachum et al., 2017), entropy being used either as an intrinsic reward signal or as an explicit regularization objective to be maximized. Another way to understand this scheme is to imagine ourselves in an adversarial bandit setting (Auer et al., 2002) where each arm corresponds to a unique trajectory, drawn with probability exp(ρ(τ)). An important distinction to make between Max Ent RL and GFlow Nets is that, in the general case they do not ﬁnd the same result. A GFlow Net learns a policy such that PT (s) R(s), whereas Max Ent RL (with appropriately chosen temperature and R) learns a policy such that PT (s) n(s)R(s), where n(s) is the number of paths in the DAG of all trajectories that lead to s (a proof is provided in Bengio et al., 2021). An equivalence only exists if the DAG minus sf is a tree rooted at s0, which has been found to be useful (Buesing et al., 2019). What this overweighting by a factor n(s) means practically is that states corresponding to longer sequences (which typically will have exponentially more paths to them) will tend to be sampled much more often (typically exponentially more often) than states corresponding to shorter sequences. Clearly, this breaks the objective of sampling terminating states in proportion to their reward and provides a strong motivation for considering GFlow Nets instead. Another perspective on maximizing entropy in RL is that one can also maximize entropy on the states stationary distribution dπ (Ziebart et al., 2008), rather than the policy. In fact, one can show that the objective of training a policy such that PT (s) R(s) is equivalent to training a policy that maximizes r(s, a) = log R(s, a) log dπ(s, a). Unfortunately, computing stationary distributions, although possible (Nachum et al., 2019; Wen et al., 2020), is not always tractable nor precise enough for purposes of reward regularization.

7.3 Contrast with Monte-Carlo Markov Chain methods

MCMC has a long and rich history (Metropolis et al., 1953; Hastings, 1970; Andrieu et al., 2003), and is particularly relevant to the present work, since it is also a principled class of methods towards sampling from PT (s) R(s). MCMC-based methods have already found some amount of success with learned deep neural networks used to drive sampling (Grathwohl et al., 2021; Dai et al., 2020; Xie et al., 2021; Nash and Durkan, 2019; Seﬀet al., 2019). An important drawback of MCMC is its reliance on iterative sampling (forming the Markov chain, one conﬁguration at a time, each of which is like a terminating state of a GFlow Net): a new state conﬁguration is obtained at each step of the chain by making a small stochastic change to the conﬁguration in the previous step. Although these methods guarantee that asymptotically (in the length of the chain) we obtain samples drawn from the correct distribution, there is an important set of distributions for which ﬁnite chains are unlikely to provide enough diversity of the modes of the distribution. This is known as the mode-mixing problem (Jasra et al., 2005; Bengio et al., 2013; Pompe et al., 2020): the chances of going from one mode to a neighboring one may become

GFlow Net Foundations

exponentially small (and thus require exponentially long chains) if the modes are separated by a long sequence of low-probability conﬁgurations. This can be alleviated by burning more computation (sampling longer chains) but becomes exponentially unsustainable with increased mode separation. The issue can also be reduced by introducing random sampling (e.g., drawing multiple chains) and simulated annealing (Andrieu et al., 2003) to facilitate jumping between modes. However, this becomes less eﬀective in high dimensions and when the modes occupy a tiny volume (which can become an exponentially small fraction of the total space as its dimension increases) since random sampling is unlikely to land in the neighborhood of a mode. In contrast, GFlow Nets belong to the family of amortized sampling methods (which includes VAEs, Kingma and Welling, 2013), where we train a machine learning system to produce samples: we have exchanged the complexity of sampling through long chains for the complexity of training the sampler. The potential advantage of such amortized samplers is when the distribution of interest has generalizable structure: when it is possible to guess reasonably well where high-probability samples can be found, based on the knowledge of a set of known high-probability samples (the training set). This is what makes deep generative models work in the ﬁrst place and thus suggests that in such high-dimensional settings where modes occupy tiny volumes (as per the manifold hypothesis, Cayton, 2005; Narayanan and Mitter, 2010; Rifai et al., 2011), one can capitalize on the already observed (x, R(x)) pairs (where x is an already visited conﬁguration and R(x) its reward) to jump from known modes to yet unvisited ones, even if these are far from the ones already visited. How well this will work then depends on the ability to generalize of the learner, i.e., on the strength and appropriateness of its inductive biases, as usual in machine learning. In the case where there is no structure at all (and thus no possibility to generalize when learning about the distribution), there is no reason to expect that amortized ML methods will fare better than MCMC. But if there is structure, then the exponential cost of mixing between modes could go away. There is plenty of evidence that ML methods can do a good job in such high-dimensional spaces (like the space of natural images) and this suggests that GFlow Nets and other amortized sampling methods would be worth considering where ML generally works well. Molecular graph generation experiments (Bengio et al., 2021) comparing GFlow Nets and MCMC methods appear to conﬁrm this. Another factor to consider (independent of the mode mixing issue) is the amortization of the computational costs: GFlow Nets pay a large price upfront to train the network and then a small price (sampling once from PT ) to generate each new sample. Instead, MCMC has no upfront cost but pays a lot for each independent sample. Hence, if we want to only sample once, MCMC may be more eﬃcient, whereas if we want to generate a lot of samples, amortized methods may be advantageous. One can imagine settings where GFlow Nets and MCMC could be combined to achieve some of the advantages of both approaches.

Evolutionary Methods Evolutionary methods work similarly to MCMC methods, via an iterative process of stochastic local search, and populations of candidates are found that maximize one or many objectives (Brown et al., 2004; Salimans et al., 2017; Jensen, 2019; Swersky et al., 2020). From such a perspective, they have similar advantages and disadvantages. One practical advantage of these methods is that natural diversity is easily obtainable via group metrics and subpopulation selection (Mouret and Doncieux, 2012).

Bengio, Lahlou, Deleu, Hu, Tiwari, Bengio

This is not something that is explicitly tackled by GFlow Net, which instead relies on i.i.d. sampling and giving non-zero probability to suboptimal samples as a diversity mechanism.

Sequential Monte-Carlo Sequential Monte Carlo (SMC, Naesseth et al., 2019; Arulampalam et al., 2002) methods are a class of methods aimed at solving inference problems. Similar to GFlow Nets, SMC samplers are trained to sample from a distribution given by its unnormalized probability density, and require forward and backward kernels, as in GFlow Nets. Unlike GFlow Nets though, they require specifying intermediate targets γt(zt). In addition, with GFlow Nets, the reward normally only comes at the end of the sequence, unlike the per-time-step likelihoods used to reweight particles in SMC. GFlow Nets can be applied in settings which do not ﬁt the typical particle ﬁlter setting, such as those whose intermediate states do not correspond to valid elements of the sample space. The trajectories do not necessarily represent a sequence of latent variables associated with corresponding observations, as in ﬁltering tasks. The GFlow Net trajectory distributions are only deﬁned by the terminal state reward function.

8. Conclusions and Open Questions

This paper extends and deepens the mathematical framework and mathematical properties of GFlow Nets (Bengio et al., 2021). It connects the notion of ﬂow in GFlow Nets with that of measure over trajectories and introduces a novel training objective (the detailed balance loss) which makes it possible to choose a parametrization separating the backward policy PB which controls preferences over the order in which things are done from the constraints imposed by the target reward function. An important contribution of this paper is the mathematical framework for marginalization or free energy estimation using GFlow Nets. It relies on the simple idea of conditioning the GFlow Net so as to push the ability to estimate a partition function already introduced by Bengio et al. (2021) to a much more general setting. This makes it possible in principle to estimate intractable sums of rewards over the terminating states reachable by an arbitrary state, opening the door to marginalization over supergraphs of graphs, supersets of sets, and supersets of (variable,value) pairs. In turn, this provides formulae for estimating entropies, conditional entropies and mutual information. Many open questions obviously remain, from the extension to continuous actions and states to hierarchical versions of GFlow Nets with abstract actions and integrating the energy function in the GFlow Net parametrization itself, enabling an interesting form of modularization and knowledge decomposition. Importantly, many of the mathematical formulations presented in this paper will require empirical validation to ascertain their usefulness, improve these ideas, turn them into impactful algorithms and explore a potentially very broad range of interesting applications, from replacing MCMC or being combined with MCMC in some settings, to probabilistic reasoning to further applications in active learning for scientiﬁc discovery.

Acknowledgements

The authors want to acknowledge the useful suggestions and feedback on the paper and its ideas provided by Alexandra Volokhova, Marc Bellemare, Valentin Thomas, Modjtaba

GFlow Net Foundations

Shokrian Zini, Mohammad Pedramfar, and Axel Nguyen Kerbel. They are also grateful for the ﬁnancial support from CIFAR, Samsung, IBM, Google, Microsoft, JP Morgan Chase, and the Thomas C. Nelson Stanford Interdisciplinary Graduate Fellowship.

This research was funded in part by JPMorgan Chase & Co. Any views or opinions expressed herein are solely those of the authors listed, and may diﬀer from the views and opinions expressed by JPMorgan Chase & Co. or its aﬃliates. This material is not a product of the Research Department of J.P. Morgan Securities LLC. This material should not be construed as an individual recommendation for any particular client and is not intended as a recommendation of particular securities, ﬁnancial instruments or strategies for a particular client. This material does not constitute a solicitation or oﬀer in any jurisdiction.

C. Andrieu, N. De Freitas, A. Doucet, and M. I. Jordan. An introduction to mcmc for machine learning. Machine learning, 50(1):5 43, 2003.

M. Arulampalam, S. Maskell, N. Gordon, and T. Clapp. A tutorial on particle ﬁlters for online nonlinear/non-gaussian bayesian tracking. IEEE Transactions on Signal Processing, 50(2):174 188, 2002. doi: 10.1109/78.978374.

P. Auer, N. Cesa-Bianchi, Y. Freund, and R. E. Schapire. The nonstochastic multiarmed bandit problem. SIAM journal on computing, 32(1):48 77, 2002.

D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointly learning to align and translate. ICLR 2015, ar Xiv:1409.0473, 2014.

E. Bengio, M. Jain, M. Korablyov, D. Precup, and Y. Bengio. Flow network based generative models for non-iterative diverse candidate generation. Neur IPS 2021, ar Xiv:2106.04399, 2021.

Y. Bengio, G. Mesnil, Y. Dauphin, and S. Rifai. Better mixing via deep representations. In International conference on machine learning, pages 552 560. PMLR, 2013.

N. Brown, B. Mc Kay, F. Gilardoni, and J. Gasteiger. A graph-based genetic algorithm and its application to the multiobjective evolution of median molecules. Journal of chemical information and computer sciences, 44(3):1079 1087, 2004.

L. Buesing, N. Heess, and T. Weber. Approximate inference in discrete distributions with monte carlo tree search and value functions, 2019.

L. Cayton. Algorithms for manifold learning. Univ. of California at San Diego Tech. Rep, 12(1-17):1, 2005.

H. Dai, R. Singh, B. Dai, C. Sutton, and D. Schuurmans. Learning discrete energy-based models via auxiliary-variable local exploration. In Neural Information Processing Systems (Neur IPS), 2020.

Bengio, Lahlou, Deleu, Hu, Tiwari, Bengio

T. Deleu, A. G ois, C. Emezue, M. Rankawat, S. Lacoste-Julien, S. Bauer, and Y. Bengio. Bayesian structure learning with generative ﬂow networks. In Uncertainty in Artiﬁcial Intelligence, pages 518 528. PMLR, 2022.

L. Dinh, D. Krueger, and Y. Bengio. Nice: Non-linear independent components estimation. ICLR 2015 Workshop, ar Xiv:1410.8516, 2014.

L. Dinh, J. Sohl-Dickstein, and S. Bengio. Density estimation using real NVP. ICLR 2017, ar Xiv:1605.08803, 2016.

D. Ernst, P. Geurts, and L. Wehenkel. Tree-based batch mode reinforcement learning. Journal of Machine Learning Research, 6:503 556, 2005.

I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. Advances in neural information processing systems, 27, 2014.

A. Goyal and Y. Bengio. Inductive biases for deep learning of higher-level cognition. ar Xiv, abs/2011.15091, 2020. https://arxiv.org/abs/2011.15091.

A. Goyal, A. Lamb, J. Hoﬀmann, S. Sodhani, S. Levine, Y. Bengio, and B. Sch olkopf. Recurrent independent mechanisms. ICLR 2021, ar Xiv:1909.10893, 2019.

W. Grathwohl, K. Swersky, M. Hashemi, D. Duvenaud, and C. J. Maddison. Oops i took a gradient: Scalable sampling for discrete distributions, 2021.

R.-R. Griﬃths and J. M. Hern andez-Lobato. Constrained bayesian optimization for automatic chemical design. ar Xiv preprint ar Xiv:1709.05501, 2017.

T. Haarnoja, H. Tang, P. Abbeel, and S. Levine. Reinforcement learning with deep energybased policies. In International Conference on Machine Learning, pages 1352 1361. PMLR, 2017.

W. K. Hastings. Monte carlo sampling methods using markov chains and their applications. Biometrika, 1970.

E. Hu, N. Malkin, M. Jain, K. Everett, A. Graikos, and Y. Bengio. Gﬂownet-em for learning compositional latent variable models. arvix, 2023.

M. Jain, E. Bengio, A. Hernandez-Garcia, J. Rector-Brooks, B. F. P. Dossou, C. Ekbote, J. Fu, T. Zhang, M. Kilgour, D. Zhang, L. Simine, P. Das, and Y. Bengio. Biological sequence design with gﬂownets. International Conference on Machine Learning (ICML), 2022.

M. Jain, S. C. Raparthy, A. Hernandez-Garcia, J. Rector-Brooks, Y. Bengio, S. Miret, and E. Bengio. Multi-objective gﬂownets. ar Xiv preprint ar Xiv:2210.12765, 2023.

A. Jasra, C. C. Holmes, and D. A. Stephens. Markov chain monte carlo methods and the label switching problem in bayesian mixture modeling. Statistical Science, pages 50 67, 2005.

GFlow Net Foundations

J. H. Jensen. A graph-based genetic algorithm and generative model/monte carlo tree search for the exploration of chemical space. Chemical science, 10(12):3567 3572, 2019.

D. P. Kingma and M. Welling. Auto-encoding variational bayes. ar Xiv preprint ar Xiv:1312.6114, 2013.

F. R. Kschischang, B. J. Frey, and H.-A. Loeliger. Factor graphs and the sum-product algorithm. IEEE Transactions on information theory, 47(2):498 519, 2001.

R. Kumar, S. Ozair, A. Goyal, A. Courville, and Y. Bengio. Maximum entropy generators for energy-based models, 2019.

M. J. Kusner, B. Paige, and J. M. Hern andez-Lobato. Grammar variational autoencoder. In International Conference on Machine Learning, pages 1945 1954. PMLR, 2017.

S. Lahlou, T. Deleu, P. Lemos, D. Zhang, A. Volokhova, A. Hern andez-Garc ıa, L. N. Ezzine, Y. Bengio, and N. Malkin. A theory of continuous generative ﬂow networks. International Conference on Machine Learning (ICML), 2023.

S. Lange, T. Gabel, and M. Riedmiller. Batch reinforcement learning. In Reinforcement learning, pages 45 73. Springer, 2012.

S. Levine. Reinforcement learning and control as probabilistic inference: Tutorial and review. ar Xiv preprint ar Xiv:1805.00909, 2018.

S. A. Malik, S. Lahlou, A. Jesson, M. Jain, N. Malkin, T. Deleu, Y. Bengio, and Y. Gal. Batchgfn: Generative ﬂow networks for batch active learning. ar Xiv preprint ar Xiv: 2306.15058, 2023.

N. Malkin, M. Jain, E. Bengio, C. Sun, and Y. Bengio. Trajectory balance: Improved credit assignment in gﬂownets. ar Xiv preprint ar Xiv:2201.13259, 2022.

N. Malkin, S. Lahlou, T. Deleu, X. Ji, E. Hu, K. Everett, D. Zhang, and Y. Bengio. GFlow Nets and variational inference. International Conference on Learning Representations (ICLR), 2023.

N. Metropolis, A. W. Rosenbluth, M. N. Rosenbluth, A. H. Teller, and E. Teller. Equation of state calculations by fast computing machines. The journal of chemical physics, 21(6): 1087 1092, 1953.

J. Moˇckus. On bayesian methods for seeking the extremum. In Optimization techniques IFIP technical conference, pages 400 404. Springer, 1975.

J.-B. Mouret and S. Doncieux. Encouraging Behavioral Diversity in Evolutionary Robotics: An Empirical Study. Evolutionary Computation, 20(1):91 133, 03 2012. ISSN 1063-6560. doi: 10.1162/EVCO\ a\ 00048. URL https://doi.org/10.1162/EVCO_a_00048.

O. Nachum, M. Norouzi, K. Xu, and D. Schuurmans. Bridging the gap between value and policy based reinforcement learning. ar Xiv preprint ar Xiv:1702.08892, 2017.

Bengio, Lahlou, Deleu, Hu, Tiwari, Bengio

O. Nachum, Y. Chow, B. Dai, and L. Li. Dualdice: Behavior-agnostic estimation of discounted stationary distribution corrections. ar Xiv preprint ar Xiv:1906.04733, 2019.

C. A. Naesseth, F. Lindsten, T. B. Sch on, et al. Elements of sequential monte carlo. Foundations and Trends R in Machine Learning, 12(3):307 392, 2019.

H. Narayanan and S. Mitter. Sample complexity of testing the manifold hypothesis. In NIPS 2010, pages 1786 1794, 2010.

C. Nash and C. Durkan. Autoregressive energy machines. In International Conference on Machine Learning, pages 1735 1744. PMLR, 2019.

L. Pan, N. Malkin, D. Zhang, and Y. Bengio. Better training of gﬂownets with local credit and incomplete trajectories. ar Xiv preprint ar Xiv: 2302.01687, 2023.

E. Pompe, C. Holmes, and K. Latuszy nski. A framework for adaptive mcmc targeting multimodal distributions. The Annals of Statistics, 48(5):2930 2952, 2020.

D. Rezende and S. Mohamed. Variational inference with normalizing ﬂows. In International conference on machine learning, pages 1530 1538. PMLR, 2015.

M. Riedmiller. Neural ﬁtted q iteration ﬁrst experiences with a data eﬃcient neural reinforcement learning method. In European conference on machine learning, pages 317 328. Springer, 2005.

S. Rifai, Y. N. Dauphin, P. Vincent, Y. Bengio, and X. Muller. The manifold tangent classiﬁer. Advances in neural information processing systems, 24:2294 2302, 2011.

T. Salimans, J. Ho, X. Chen, S. Sidor, and I. Sutskever. Evolution strategies as a scalable alternative to reinforcement learning, 2017.

B. Sch olkopf, D. Janzing, J. Peters, E. Sgouritsa, K. Zhang, and J. Mooij. On causal and anticausal learning. In ICML 2012, pages 1255 1262, 2012.

A. Seﬀ, W. Zhou, F. Damani, A. Doyle, and R. P. Adams. Discrete object generation with reversible inductive construction. ar Xiv preprint ar Xiv:1907.08268, 2019.

J. Sohl-Dickstein, E. Weiss, N. Maheswaranathan, and S. Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning, pages 2256 2265. PMLR, 2015.

N. Srinivas, A. Krause, S. M. Kakade, and M. Seeger. Gaussian process optimization in the bandit setting: No regret and experimental design. In International Conference on Machine Learning (ICML), 2010.

R. S. Sutton and A. G. Barto. Reinforcement learning: An introduction. MIT press, 2018.

K. Swersky, Y. Rubanova, D. Dohan, and K. Murphy. Amortized bayesian optimization over discrete spaces. In Conference on Uncertainty in Artiﬁcial Intelligence, pages 769 778. PMLR, 2020.

GFlow Net Foundations

M. Toussaint and A. Storkey. Probabilistic inference for solving discrete and continuous state markov decision processes. In Proceedings of the 23rd international conference on Machine learning, pages 945 952, 2006.

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin. Attention is all you need. In Advances in neural information processing systems, pages 5998 6008, 2017.

J. Wen, B. Dai, L. Li, and D. Schuurmans. Batch stationary distribution estimation. ar Xiv preprint ar Xiv:2003.00722, 2020.

Y. Xie, C. Shi, H. Zhou, Y. Yang, W. Zhang, Y. Yu, and L. Li. {MARS}: Markov molecular sampling for multi-objective drug discovery. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=k HSu4ebx FXY.

D. Zhang, N. Malkin, Z. Liu, A. Volokhova, A. Courville, and Y. Bengio. Generative ﬂow networks for discrete probabilistic modeling. International Conference on Machine Learning (ICML), 2022.

D. W. Zhang, C. Rainone, M. Peschl, and R. Bondesan. Robust scheduling with gﬂownets. International Conference on Learning Representations (ICLR), 2023.

B. D. Ziebart, A. L. Maas, J. A. Bagnell, A. K. Dey, et al. Maximum entropy inverse reinforcement learning. In Aaai, volume 8, pages 1433 1438. Chicago, IL, USA, 2008.

H. Zimmermann, F. Lindsten, J.-W. van de Meent, and C. A. Naesseth. A variational perspective on generative ﬂow networks. ar Xiv preprint 2210.07992, 2022.