# gflownets_and_variational_inference__00049fcb.pdf Published as a conference paper at ICLR 2023 GFLOWNETS AND VARIATIONAL INFERENCE Nikolay Malkin , Salem Lahlou , Tristan Deleu Mila, Universit e de Montr eal Xu Ji, Edward Hu Mila, Universit e de Montr eal Katie Everett Google Research Dinghuai Zhang Mila, Universit e de Montr eal Yoshua Bengio Mila, Universit e de Montr eal, CIFAR This paper builds bridges between two families of probabilistic algorithms: (hierarchical) variational inference (VI), which is typically used to model distributions over continuous spaces, and generative flow networks (GFlow Nets), which have been used for distributions over discrete structures such as graphs. We demonstrate that, in certain cases, VI algorithms are equivalent to special cases of GFlow Nets in the sense of equality of expected gradients of their learning objectives. We then point out the differences between the two families and show how these differences emerge experimentally. Notably, GFlow Nets, which borrow ideas from reinforcement learning, are more amenable than VI to off-policy training without the cost of high gradient variance induced by importance sampling. We argue that this property of GFlow Nets can provide advantages for capturing diversity in multimodal target distributions. Code: https://github.com/GFNOrg/GFN_vs_HVI. 1 INTRODUCTION Many probabilistic generative models produce a sample through a sequence of stochastic choices. Non-neural latent variable models (e.g., Blei et al., 2003), autoregressive models, hierarchical variational autoencoders (SΓΈnderby et al., 2016), and diffusion models (Ho et al., 2020) can be said to rely upon a shared principle: richer distributions can be modeled by chaining together a sequence of simple actions, whose conditional distributions are easy to describe, than by performing generation in a single sampling step. When many intermediate sampled variables could generate the same object, making exact likelihood computation intractable, hierarchical models are trained with variational objectives that involve the posterior over the sampling sequence (Ranganath et al., 2016b). This work connects variational inference (VI) methods for hierarchical models (i.e., sampling through a sequence of choices conditioned on the previous ones) with the emerging area of research on generative flow networks (GFlow Nets; Bengio et al., 2021a). GFlow Nets have been formulated as a reinforcement learning (RL) algorithm with states, actions, and rewards that constructs an object by a sequence of actions so as to make the marginal likelihood of producing an object proportional to its reward. While hierarchical VI is typically used for distributions over real-valued objects, GFlow Nets have been successful at approximating distributions over discrete structures for which exact sampling is intractable, such as for molecule discovery (Bengio et al., 2021a), for Bayesian posteriors over causal graphs (Deleu et al., 2022), or as an amortized learned sampler for approximate maximum-likelihood training of energy-based models (Zhang et al., 2022b). Although GFlow Nets appear to have different foundations (Bengio et al., 2021b) and applications than hierarchical VI algorithms, we show here that the two are closely connected. As our main theoretical contribution, we show that special cases of variational algorithms and GFlow Nets coincide in their expected gradients. In particular, hierarchical VI (Ranganath et al., 2016b) and nested VI (Zimmermann et al., 2021) are related to the trajectory balance and detailed balance objectives for GFlow Nets (Malkin et al., 2022; Bengio et al., 2021b). We also point out the differences between VI and GFlow Nets: notably, that GFlow Nets automatically perform gradient variance reduction by estimating a marginal quantity (the partition function) that acts as a baseline and allow off-policy learning without the need for reweighted importance sampling. Our theoretical results are accompanied by experiments that examine what similarities and differences emerge when one applies hierarchical VI algorithms to discrete problems where GFlow Nets Equal contribution. Contact: nikolay.malkin@mila.quebec. Published as a conference paper at ICLR 2023 have been used before. These experiments serve two purposes. First, they supply a missing hierarchical VI baseline for problems where GFlow Nets have been used in past work. The relative performance of this baseline illustrates the aforementioned similarities and differences between VI and GFlow Nets. Second, the experiments demonstrate the ability of GFlow Nets, not shared by hierarchical VI, to learn from off-policy distributions without introducing high gradient variance. We show that this ability to learn with exploratory off-policy sampling is beneficial in discrete probabilistic modeling tasks, especially in cases where the target distribution has many modes. 2 THEORETICAL RESULTS 2.1 GFLOWNETS: NOTATION AND BACKGROUND We consider the setting of Bengio et al. (2021a). We are given a pointed1 directed acyclic graph (DAG) G = (S, A), where S is a finite set of vertices (states), and A S S is a set of directed edges (actions). If 𝑠 𝑠 is an action, we say 𝑠is a parent of 𝑠 and 𝑠 is a child of 𝑠. There is exactly one state that has no incoming edge, called the initial state 𝑠0 S. States that have no outgoing edges are called terminating. We denote by X the set of terminating states. A complete trajectory is a sequence 𝜏= (𝑠0 . . . 𝑠𝑛) such that each 𝑠𝑖 𝑠𝑖+1 is an action and 𝑠𝑛 X. We denote by T the set of complete trajectories and by π‘₯𝜏the last state of a complete trajectory 𝜏. GFlow Nets are a class of models that amortize the cost of sampling from an intractable target distribution over X by learning a functional approximation of the target distribution using its unnormalized density or reward function, 𝑅: X R+. While there exist different parametrizations and loss functions for GFlow Nets, they all define a forward transition probability function, or a forward policy, 𝑃𝐹( | 𝑠), which is a distribution over the children of every state 𝑠 S. The forward policy is typically parametrized by a neural network that takes a representation of 𝑠as input and produces the logits of a distribution over its children. Any forward policy 𝑃𝐹induces a distribution over complete trajectories 𝜏 T (denoted by 𝑃𝐹as well), which in turn defines a marginal distribution over terminating states π‘₯ X (denoted by 𝑃 𝐹): 𝑃𝐹(𝜏= (𝑠0 . . . 𝑠𝑛)) = 𝑖=0 𝑃𝐹(𝑠𝑖+1 | 𝑠𝑖) 𝜏 T, (1) 𝜏 T:π‘₯𝜏=π‘₯ 𝑃𝐹(𝜏) π‘₯ X. (2) Given a forward policy 𝑃𝐹, terminating states π‘₯ X can be sampled from 𝑃 𝐹by sampling trajectories 𝜏from 𝑃𝐹(𝜏) and taking their final states π‘₯𝜏. GFlow Nets aim to find a forward policy 𝑃𝐹for which 𝑃 𝐹(π‘₯) 𝑅(π‘₯). Because the sum in (2) is typically intractable to compute exactly, training objectives for GFlow Nets introduce auxiliary objects into the optimization. For example, the trajectory balance objective (TB; Malkin et al., 2022) introduces an auxiliary backward policy 𝑃𝐡, which is a learned distribution 𝑃𝐡( | 𝑠) over the parents of every state 𝑠 S, and an estimated partition function 𝑍, typically parametrized as exp(log 𝑍) where log 𝑍is the learned parameter. The TB objective for a complete trajectory 𝜏is defined as LTB(𝜏; 𝑃𝐹, 𝑃𝐡, 𝑍) = log 𝑍 𝑃𝐹(𝜏) 𝑅(π‘₯𝜏)𝑃𝐡(𝜏| π‘₯𝜏) where 𝑃𝐡(𝜏| π‘₯𝜏) = Î (𝑠 𝑠 ) πœπ‘ƒπ΅(𝑠| 𝑠 ). If LTB is made equal to 0 for every complete trajectory 𝜏, then 𝑃 𝐹(π‘₯) 𝑅(π‘₯) for all π‘₯ X and 𝑍is the inverse constant of proportionality: 𝑍= Í π‘₯ X 𝑅(π‘₯). The objective (3) is minimized by sampling trajectories 𝜏from some distribution and making gradient steps on (3) with respect to the parameters of 𝑃𝐹, 𝑃𝐡, and log 𝑍. The distribution from which 𝜏 is sampled amounts to a choice of scalarization weights for the multi-objective problem of minimizing (3) over all 𝜏 T. If 𝜏is sampled from 𝑃𝐹(𝜏) note that this is a nonstationary scalarization we say the algorithm runs on-policy. If 𝜏is sampled from another distribution, the algorithm runs off-policy; typical choices are to sample 𝜏from a tempered version of 𝑃𝐹to encourage exploration (Bengio et al., 2021a; Deleu et al., 2022) or to sample 𝜏from the backward policy 𝑃𝐡(𝜏|π‘₯) starting from given terminating states π‘₯(Zhang et al., 2022b). By analogy with the RL nomenclature, we call the behavior policy the one that samples 𝜏for the purpose of obtaining a stochastic gradient, e.g, the gradient of the objective LTB in (3) for the sampled 𝜏. 1A pointed DAG is one with a designated initial state. Published as a conference paper at ICLR 2023 Other objectives have been studied and successfully used in past works, including detailed balance (DB; proposed by Bengio et al. (2021b) and evaluated by Malkin et al. (2022)) and subtrajectory balance (Sub TB; Madan et al., 2022). In the next sections, we will show how the TB objective relates to hierarchical variational objectives. In C, we generalize this result to the Sub TB loss, of which both TB and DB are special cases. 2.2 HIERARCHICAL VARIATIONAL MODELS AND GFLOWNETS Variational methods provide a way of sampling from distributions by means of learning an approximate probability density. Hierarchical variational models (HVMs; Ranganath et al., 2016b; Sobolev & Vetrov, 2019; Vahdat & Kautz, 2020; Zimmermann et al., 2021)) typically assume that the sample space is a set of sequences (𝑧1, . . . , 𝑧𝑛) of fixed length, with an assumption of conditional independence between 𝑧𝑖 1 and 𝑧𝑖+1 conditioned on 𝑧𝑖, i.e., the likelihood has a factorization π‘ž(𝑧1, . . . , 𝑧𝑛) = π‘ž(𝑧1)π‘ž(𝑧2|𝑧1) . . . π‘ž(𝑧𝑛|𝑧𝑛 1). The marginal likelihood of 𝑧𝑛in a hierarchical model involves a possibly intractable sum, 𝑧1,...,𝑧𝑛 1 π‘ž(𝑧1)π‘ž(𝑧2|𝑧1) . . . π‘ž(𝑧𝑛|𝑧𝑛 1). The goal of VI algorithms is to find the conditional distributions π‘žthat minimize some divergence between the marginal π‘ž(𝑧𝑛) and a target distribution. The target is often given as a distribution with intractable normalization constant: a typical setting is a Bayesian posterior (used in VAEs, variational EM, and other applications), for which we desire π‘ž(𝑧𝑛) 𝑝likelihood(π‘₯|𝑧𝑛)𝑝prior(𝑧𝑛). The GFlow Net corresponding to a HVM: Sampling sequences (𝑧1, . . . , 𝑧𝑛) from a hierarchical model is equivalent to sampling complete trajectories in a certain pointed DAG G. The states of G at a distance of 𝑖from the initial state are in bijection with possible values of the variable 𝑧𝑖, and the action distribution is given by π‘ž. Sampling from the HVM is equivalent to sampling trajectories from the policy 𝑃𝐹(𝑧𝑖+1|𝑧𝑖) = π‘ž(𝑧𝑖+1|𝑧𝑖) (and 𝑃𝐹(𝑧1|𝑠0) = π‘ž(𝑧1)), and the marginal distribution π‘ž(𝑧𝑛) is the terminating distribution 𝑃 𝐹. The HVM corresponding to a GFlow Net: Conversely, suppose G = (S, A) is a graded pointed DAG2 and that a forward policy 𝑃𝐹on G is given. Sampling trajectories 𝜏= (𝑠0 𝑠1 . . . 𝑠𝐿) in G is equivalent to sampling from a HVM in which the random variable 𝑧𝑖is the identity of the (𝑖+ 1)-th state 𝑠𝑖in 𝜏and the conditional distributions π‘ž(𝑧𝑖+1|𝑧𝑖) are given by the forward policy 𝑃𝐹(𝑠𝑖+1|𝑠𝑖). Specifying an approximation of the target distribution in a hierarchical model with 𝑛 layers is thus equivalent to specifying a forward policy 𝑃𝐹in a graded DAG. The correspondence can be extended to non-graded DAGs. Every pointed DAG G = (S, A) can be canonically transformed into a graded pointed DAG by the insertion of dummy states that have one child and one parent. To be precise, every edge 𝑠 𝑠 A is replaced with a sequence of β„“ β„“(𝑠) edges, where β„“(𝑠) is the length of the longest trajectory from 𝑠0 to 𝑠, β„“ = β„“(𝑠 ) if 𝑠 X, and β„“ = max𝑠 S β„“(𝑠 ) otherwise. This process is illustrated in A. We thus restrict our analysis in this section, without loss of generality, to graded DAGs. The meaning of the backward policy: Typically, the target distribution is over the objects X of the last layer of a graded DAG, rather than over complete sequences or trajectories. Any backward policy 𝑃𝐡on the DAG turns an unnormalized target distribution 𝑅over X into an unnormalized distribution over complete trajectories T: 𝜏 T 𝑃𝐡(𝜏) 𝑅(π‘₯𝜏)𝑃𝐡(𝜏| π‘₯𝜏), with unknown partition function ˆ𝑍= π‘₯ X 𝑅(π‘₯). (4) The marginal distribution of 𝑃𝐡over terminating states is equal to 𝑅(π‘₯)/ ˆ𝑍by construction. Therefore, if 𝑃𝐹is a forward policy that equals 𝑃𝐡as a distribution over trajectories, then 𝑃 𝐹(π‘₯) = 𝑅(π‘₯)/ ˆ𝑍 𝑅(π‘₯). VI training objectives: In its most general form, the hierarchical variational objective ( HVI objective in the remainder of the paper) minimizes a statistical divergence 𝐷𝑓between the learned and the target distributions over trajectories: LHVI, 𝑓(𝑃𝐹, 𝑃𝐡) = 𝐷𝑓(𝑃𝐡 𝑃𝐹) = E𝜏 𝑃𝐹 2We recall some facts about partially ordered sets. A pointed graded DAG is a pointed DAG in which all complete trajectories have the same length. Pointed graded DAGs G are also characterized by the following equivalent property: the state space S can be partitioned into disjoint sets S = Ã𝐿 𝑙=0 S𝑙, with S0 = {𝑠0}, called layers, such that all edges 𝑠 𝑠 are between states of adjacent layers (𝑠 S𝑖,𝑠 S𝑖+1 for some 𝑖). Published as a conference paper at ICLR 2023 Two common objectives are the forward and reverse Kullback-Leibler (KL) divergences (Mnih & Gregor, 2014), corresponding to 𝑓: 𝑑 𝑑log 𝑑for 𝐷KL(𝑃𝐡 𝑃𝐹) and 𝑓: 𝑑 log 𝑑for 𝐷KL(𝑃𝐹 𝑃𝐡), respectively. Other 𝑓-divergences have been used, as discussed in Zhang et al. (2019b); Wan et al. (2020). Note that, similar to GFlow Nets, (5) can be minimized with respect to both the forward and backward policies, or can be minimized using a fixed backward policy. Table 1: A comparison of algorithms for approximating a target distribution in a hierarchical variational model or a GFlow Net. The gradients used to update the parameters of the sampling distribution and of the auxiliary backward policy approximate the gradients of various divergences between distributions over trajectories. Surrogate loss Algorithm 𝑃𝐹(sampler) 𝑃𝐡(posterior) REVERSE KL 𝐷KL(𝑃𝐹 𝑃𝐡) 𝐷KL(𝑃𝐹 𝑃𝐡) FORWARD KL 𝐷KL(𝑃𝐡 𝑃𝐹) 𝐷KL(𝑃𝐡 𝑃𝐹) WAKE-SLEEP (WS) 𝐷KL(𝑃𝐡 𝑃𝐹) 𝐷KL(𝑃𝐹 𝑃𝐡) REVERSE WAKE-SLEEP 𝐷KL(𝑃𝐹 𝑃𝐡) 𝐷KL(𝑃𝐡 𝑃𝐹) On-policy TB 𝐷KL(𝑃𝐹 𝑃𝐡) see 2.3 Divergences between two distributions over trajectories and divergences between their two marginal distributions over terminating states distributions are linked via the data processing inequality, assuming 𝑓is convex (see e.g. Zhang et al. (2019b)), making the former a sensible surrogate objective for the latter: 𝐷𝑓(𝑅/ ˆ𝑍 𝑃 𝐹) 𝐷𝑓(𝑃𝐡 𝑃𝐹) (6) When both 𝑃𝐡and 𝑃𝐹are learned, the divergences with respect to which they are optimized need not be the same, as long as both objectives are 0 if and only if 𝑃𝐹= 𝑃𝐡. For example, wake-sleep algorithms (Hinton et al., 1995) optimize the generative model 𝑃𝐹using 𝐷KL(𝑃𝐡 𝑃𝐹) and the posterior 𝑃𝐡using 𝐷KL(𝑃𝐹 𝑃𝐡). A summary of common combinations is shown in Table 1. We remark that tractable unbiased gradient estimators for objectives such as (5) may not always exist, as we cannot exactly sample from or compute the density of 𝑃𝐡(𝜏) when its normalization constant ˆ𝑍is unknown. For example, while the REINFORCE estimator gives unbiased estimates of the gradient with respect to 𝑃𝐹when the objective is REVERSE KL (see 2.3), other objectives, such as FORWARD KL, require importance-weighted estimators. Such estimators approximate sampling from 𝑃𝐡by sampling a batch of trajectories {πœπ‘–} from another distribution πœ‹(which may equal 𝑃𝐹) and weighting a loss computed for each πœπ‘–by a scalar proportional to 𝑃𝐡(πœπ‘–) πœ‹(πœπ‘–) . Such reweighted importance sampling is helpful in various variational algorithms, despite its bias when the number of samples is finite (e.g., Bornschein & Bengio, 2015; Burda et al., 2016), but it may also introduce variance that increases with the discrepancy between 𝑃𝐡and πœ‹. 2.3 ANALYSIS OF GRADIENTS The following proposition summarizes our main theoretical claim, relating the GFN objective of (3) and the variational objective of (5). In C, we extend this result by showing an equivalence between the subtrajectory balance objective (introduced in Malkin et al. (2022) and empirically evaluated in Madan et al. (2022)) and a natural extension of the nested variational objective (Zimmermann et al., 2021) to subtrajectories. A special case of this equivalence is between the Detailed Balance objective (Bengio et al., 2021b) and the nested VI objective (Zimmermann et al., 2021). Proposition 1 Given a graded DAG G, and denoting by πœƒ, πœ™the parameters of the forward and backward policies 𝑃𝐹, 𝑃𝐡respectively, the gradients of the TB objective (3) satisfy: πœ™π·KL(𝑃𝐡 𝑃𝐹) = 1 2E𝜏 𝑃𝐡[ πœ™LTB(𝜏)], (7) πœƒπ·KL(𝑃𝐹 𝑃𝐡) = 1 2E𝜏 𝑃𝐹[ πœƒLTB(𝜏)]. (8) The proof of the extended result appears in C. An alternative proof is provided in B. While (8) is the on-policy TB gradient with respect to the parameters of 𝑃𝐹, (7) is not the on-policy TB gradient with respect to the parameters of 𝑃𝐡, as the expectation is taken over 𝑃𝐡, not 𝑃𝐹. The on-policy TB gradient can however be expressed through a surrogate loss E𝜏 𝑃𝐹[ πœ™LTB(𝜏)] = πœ™ h 𝐷log2(𝑃𝐡 𝑃𝐹) + 2(log 𝑍 log ˆ𝑍)𝐷KL(𝑃𝐹 𝑃𝐡) i , (9) where ˆ𝑍= Í π‘₯ X 𝑅(π‘₯), the unknown true partition function. Here 𝐷log2 is the pseudo𝑓-divergence defined by 𝑓(π‘₯) = log(π‘₯)2, which is not convex for large π‘₯. (Proof in B.) The loss in (7) is not possible to optimize directly unless using importance weighting (cf. the end of 2.2), but optimization of 𝑃𝐡using (7) and 𝑃𝐹using (8) would yield the gradients of REVERSE WAKE-SLEEP in expectation. Published as a conference paper at ICLR 2023 Score function estimator and variance reduction: Optimizing the reverse KL loss 𝐷KL(𝑃𝐹 𝑃𝐡) with respect to πœƒ, the parameters of 𝑃𝐹, requires a likelihood ratio (also known as REINFORCE) estimator of the gradient (Williams, 1992), using a trajectory 𝜏(or a batch of trajectories), which takes the form: Ξ”(𝜏) = πœƒlog 𝑃𝐹(𝜏; πœƒ)𝑐(𝜏), where 𝑐(𝜏) = log 𝑃𝐹(𝜏) 𝑅(π‘₯𝜏)𝑃𝐡(𝜏| π‘₯𝜏) (10) (Note that the term πœƒπ‘(𝜏) that is typically present in the REINFORCE estimator is 0 in expectation, since E𝜏 𝑃𝐹[ πœƒlog 𝑃𝐹(𝜏)] = Í 𝜏 𝑃𝐹(𝜏) 𝑃𝐹(𝜏) πœƒπ‘ƒπΉ(𝜏) = 0.) The estimator of (10) is known to exhibit high variance norm, thus slowing down learning. A common workaround is to subtract a baseline 𝑏from 𝑐(𝜏), which does not bias the estimator. The value of the baseline 𝑏(also called control variate) that most reduces the trace of the covariance matrix of the gradient estimator is 𝑏 = E𝜏 𝑃𝐹[𝑐(𝜏) πœƒlog 𝑃𝐹(𝜏; πœƒ) 2] E𝜏 𝑃𝐹[ πœƒlog 𝑃𝐹(𝜏; πœƒ) 2] , commonly approximated with E𝜏 𝑃𝐹[𝑐(𝜏)] (see, e.g., Weaver & Tao (2001); Wu et al. (2018)). This approximation is itself often approximated with a batch-dependent local baseline, from a batch of trajectories {πœπ‘–}𝐡 𝑖=1: 𝑖=1 𝑐(πœπ‘–) (11) A better approximation of the expectation E𝜏 𝑃𝐹[𝑐(𝜏)] can be obtained by maintaining a running average of the values 𝑐(𝜏), leading to a global baseline. After observing each batch of trajectories, the running average is updated with step size πœ‚: 𝑏global (1 πœ‚)𝑏global + πœ‚π‘local. (12) This coincides with the update rule of log 𝑍in the minimization of LTB(𝑃𝐹, 𝑃𝐡, 𝑍) with a learning rate πœ‚ 2 for the parameter log 𝑍(with respect to which the TB objective is quadratic). Consequently, (8) of Prop. 1 shows that the update rule for the parameters of 𝑃𝐹, when optimized using the REVERSE KL objective, with (12) as a control variate for the score function estimator of its gradient, is the same as the update rule obtained by optimizing the TB objective using on-policy trajectories. While learning a backward policy 𝑃𝐡can speed up convergence (Malkin et al., 2022), the TB objective can also be used with a fixed backward policy, in which case the REVERSE KL objective and the TB objective differ only in how they reduce the variance of the estimated gradients, if the trajectories are sampled on-policy. In 4, we experimentally explore the differences between the two learning paradigms that arise when 𝑃𝐡is learned, or when the algorithms run off-policy. 3 RELATED WORK (Hierarchical) VI: Variational inference (Zhang et al., 2019a) techniques originate from graphical models (Saul et al., 1996; Jordan et al., 2004), which typically include an inference machine and a generative machine to model the relationship between latent variables and observed data. The line of work on black-box VI (Ranganath et al., 2014) focuses on learning the inference machine given a data generating process, i.e., inferring the posterior over latent variables. Hierarchical modeling exhibits appealing properties under such settings as discussed in Ranganath et al. (2016b); Yin & Zhou (2018); Sobolev & Vetrov (2019). On the other hand, works on variational auto-encoders (VAEs) (Kingma & Welling, 2014; Rezende et al., 2014) focus on generative modeling, where the inference machine the estimated variational posterior is a tool to assist optimization of the generative machine or decoder. Hierarchical construction of multiple latent variables has also been shown to be beneficial (SΓΈnderby et al., 2016; MaalΓΈe et al., 2019; Child, 2021). While earlier works simplify the variational family with mean-field approximations (Bishop, 2006), modern inference methods rely on amortized stochastic optimization (Hoffman et al., 2013). One of the oldest and most commonly used ideas is REINFORCE (Williams, 1992; Paisley et al., 2012) which gives unbiased gradient estimation. Follow-up work (Titsias & L azaro-Gredilla, 2014; Gregor et al., 2014; Mnih & Gregor, 2014; Mnih & Rezende, 2016) proposes advanced estimators to reduce the high variance of REINFORCE. The log-variance loss proposed by Richter et al. (2020) is equivalent in expected gradient of 𝑃𝐹to the on-policy TB loss for a GFlow Net with a batch-optimal value of log 𝑍. On the other hand, path-wise gradient estimators (Kingma & Welling, 2014) have much lower variance, but have limited applicability. Later works combine these two approaches for particular distribution families (Tucker et al., 2017; Grathwohl et al., 2018). Published as a conference paper at ICLR 2023 Beyond the evidence lower bound (ELBO) objective used in most variational inference methods, more complex objectives have been studied. Tighter evidence bounds have proved beneficial to the learning of generative machines (Burda et al., 2016; Domke & Sheldon, 2018; Rainforth et al., 2018; Masrani et al., 2019). As KL divergence optimization suffers from issues such as mean-seeking behavior and posterior variance underestimation (Minka, 2005), other divergences are adopted as in expectation propagation (Minka, 2001; Li et al., 2015), more general 𝑓-divergences (Dieng et al., 2017; Wang et al., 2018; Wan et al., 2020), their special case 𝛼-divergences (Hern andez-Lobato et al., 2016), and Stein discrepancy (Liu & Wang, 2016; Ranganath et al., 2016a). GFlow Nets could be seen as providing a novel pseudo-divergence criterion, namely TB, as discussed in this work. Wake-sleep algorithms: Another branch of work, starting with Hinton et al. (1995), proposes to avoid issues from stochastic optimization (such as REINFORCE) by alternatively optimizing the generative and inference (posterior) models. Modern versions extending this framework include reweighted wake-sleep Bornschein & Bengio (2015); Le et al. (2019) and memoised wakesleep (Hewitt et al., 2020; Le et al., 2022). It was shown in Le et al. (2019) that wake-sleep algorithms behave well for tasks involving stochastic branching. GFlow Nets: GFlow Nets have been used successfully in settings where RL and MCMC methods have been used in other work, including molecule discovery (Bengio et al., 2021a; Malkin et al., 2022; Madan et al., 2022), biological sequence design (Malkin et al., 2022; Jain et al., 2022; Madan et al., 2022), and Bayesian structure learning (Deleu et al., 2022). A connection of the theoretical foundations of GFlow Nets (Bengio et al., 2021a;b) with variational methods was first mentioned by Malkin et al. (2022) and expanded in Zhang et al. (2022a; 2023). A concurrent and closely related paper (Zimmermann et al., 2022) theoretically and experimentally explores interpolations between forward and reverse KL objectives. 4 EXPERIMENTS The goal of the experiments is to empirically investigate two main observations consistent with the above theoretical analysis: Observation 1. On-policy VI and TB (GFlow Net) objectives can behave similarly in some cases, when both can be stably optimized, while in others on-policy TB strikes a better compromise than either the (mode-seeking) REVERSE KL or (mean-seeking) FORWARD KL VI objectives. This claim is supported by the experiments on all three domains below. However, in all cases, notable differences emerge. In particular, HVI training becomes more stable near convergence and is sensitive to learning rates, which is consistent with the hypotheses about gradient variance in 2.3. Observation 2. When exploration matters, off-policy TB outperforms both on-policy TB and VI objectives, avoiding the possible high variance induced by importance sampling in off-policy VI. GFlow Nets are capable of stable off-policy training without importance sampling. This claim is supported by experiments on all domains, but is especially well illustrated on the realistic domains in 4.2 and 4.3. This capability provides advantages for capturing a more diverse set of modes. Observation 1 and Observation 2 provide evidence that off-policy TB is the best method among those tested in terms of both accurately fitting the target distribution and effectively finding modes, where the latter is particularly important for the challenging molecule graph generation and causal graph discovery problems studied below. 4.1 HYPERGRID: EXPLORATION OF LEARNING OBJECTIVES In this section, we comparatively study the ability of the variational objectives and the GFlow Net objectives to learn a multimodal distribution given by its unnormalized density, or reward function, 𝑅. We use the synthetic hypergrid environment introduced by Bengio et al. (2021a) and further explored by Malkin et al. (2022). The states form a 𝐷-dimensional hypergrid with side length 𝐻, and the reward function has 2𝐷flat modes near the corners of the hypergrid. The states form a pointed DAG, where the source state is the origin 𝑠0 = 0, and each edge corresponds to the action of incrementing one coordinate in a state by 1 (without exiting the grid). More details about the environment are provided in D.1. We focus on the case where 𝑃𝐡is learned, which has been shown to accelerate convergence (Malkin et al., 2022). In Fig. 1, we compare how fast each learning objective discovers the 4 modes of a 128 128 grid, with an exploration parameter 𝑅0 = 0.001 in the reward function. The gap between the learned distribution 𝑃 𝐹and the target distribution is measured by the Jensen-Shannon divergence (JSD) Published as a conference paper at ICLR 2023 0k 200k 400k 600k 800k 1000k Trajectories sampled TB WS Forward KL Reverse KL Reverse WS 0k 200k 400k 600k 800k 1000k Trajectories sampled Target Distribution Figure 1: Top: The evolution of the JSD between the learned sampler 𝑃 𝐹and the target distribution on the 128 128 grid, as a function of the number of trajectories sampled. Shaded areas represent the standard error evaluated across 5 different runs (on-policy left, off-policy right). Bottom: The average (across 5 runs) final learned distribution 𝑃 𝐹for the different algorithms, along with the target distribution. To amplify variation, the plot intensity at each grid position is resampled from the Gaussian approximating the distribution over the 5 runs. Although WS, FORWARD KL, and REVERSE WS (off-policy) find the 4 target modes, they do not model them with high precision, and produce a textured pattern at the modes, where it should be flat. between the two distributions, to avoid giving a preference to one KL or the other. Additionally, we show graphical representations of the learned 2D terminating states distribution, along with the target distribution. We provide in E details on how 𝑃 𝐹and the JSD are evaluated and how hyperparameters were optimized separately for each learning algorithm. Exploration poses a challenge in this environment, given the distance that separates the different modes. We thus include in our analysis an off-policy version of each objective, where the behavior policy is different from, but related to, the trained sampler 𝑃𝐹(𝜏). The GFlow Net behavior policy used here encourages exploration by reducing the probability of terminating a trajectory at any state of the grid. This biases the learner towards sampling longer trajectories and helps with faster discovery of farther modes. When off-policy, the HVI gradients are corrected using importance sampling weights. For the algorithms that use a score function estimator of the gradient (FORWARD KL, REVERSE WS, and REVERSE KL), we found that using a global baseline, as explained in 2.2, was better than using the more common local baseline in most cases (see Fig. D.1). This brings the VI methods closer to GFlow Nets and thus factors out this issue from the comparison with the GFlow Net objectives. We see from Fig. 1 that while FORWARD KL and WS the two algorithms that use 𝐷KL(𝑃𝐡 𝑃𝐹) as the objective for 𝑃𝐹 discover the four modes of the distribution faster, they converge to a local minimum and do not model all the modes with high precision. This is due to the mean-seeking behavior of the forward KL objective, requiring that 𝑃 𝐹puts non-zero mass on terminating states π‘₯where 𝑅(π‘₯) > 0. Objectives that use the reverse KL to train the forward policy (REVERSE KL and REVERSE WS) are mode-seeking and can thus have a low loss without finding all the modes. The TB GFlow Net objective offers the best of both worlds, as it converges to a lower value of the JSD, discovers the four modes, and models them with high precision. This supports Observation 1. Additionally, in support of Observation 2, while both the TB objective and the HVI objectives benefit from off-policy sampling, TB benefits more, as convergence is greatly accelerated. We supplement this study with a comparative analysis of the algorithms on smaller grids in D.1. 4.2 MOLECULE SYNTHESIS We study the molecule synthesis task from Bengio et al. (2021a), in which molecular graphs are generated by sequential addition of subgraphs from a library of blocks (Jin et al., 2020; Kumar Published as a conference paper at ICLR 2023 4 8 10 16 reward exponent F(x) to log R(x) correlation Reverse KL on-policy TB on-policy Reverse KL off-policy TB off-policy learning rate Figure 2: Correlation between marginal sampling log-likelihood and log-reward on the molecule generation task for different learning algorithms, showing the advantage of off-policy TB (red) against on-policy TB (orange) and both on-policy (blue) and off-policy HVI (green). For each hyperparameter setting on the π‘₯-axis (𝛼or 𝛽), we take the optimal choice of the other hyperparameter (𝛽or 𝛼, respectively) and plot the mean and standard error region over three random seeds. et al., 2012). The reward function is expressed in terms of a fixed, pretrained graph neural network 𝑓that estimates the strength of binding to the soluble epoxide hydrolase protein (Trott & Olson, 2010). To be precise, 𝑅(π‘₯) = 𝑓(π‘₯)𝛽, where 𝑓(π‘₯) is the output of the binding model on molecule π‘₯ and 𝛽is a parameter that can be varied to control the entropy of the sampling model. Because the number of terminating states is too large to make exact computation of the target distribution possible, we use a performance metric from past work on this task (Bengio et al., 2021a) to evaluate sampling agents. Namely, for each molecule π‘₯in a held-out set, we compute log 𝑃 𝐹(π‘₯), the likelihood of π‘₯under the trained model (computable by dynamic programming, see E), and evaluate the Pearson correlation of log 𝑃 𝐹(π‘₯) and log 𝑅(π‘₯). This value should equal 1 for a perfect sampler, as log 𝑃 𝐹(π‘₯) and log 𝑅(π‘₯) would differ by a constant, the log-partition function log ˆ𝑍. In Malkin et al. (2022), GFlow Net samplers using the DB and TB objectives, with the backward policy 𝑃𝐡fixed to a uniform distribution over the parents of each state, were trained off-policy. Specifically, the trajectories used for DB and TB gradient updates were sampled from a mixture of the (online) forward policy 𝑃𝐹and a uniform distribution at each sampling step, with a special weight depending on the trajectory length used for the termination action. We wrote an extension of the published code of Malkin et al. (2022) with an implementation of the HVI (REVERSE KL) objective, using a reweighted importance sampling correction. We compare the off-policy TB from past work with the off-policy REVERSE KL, as well as on-policy TB and REVERSE KL objectives. (Note that on-policy TB and REVERSE KL are equivalent in expectation in this setting, since the backward policy is fixed.) Each of the four algorithms was evaluated with four values of the inverse temperature parameter 𝛽and of the learning rate 𝛼, for a total of 4 4 4 = 64 settings. (We also experimented with the off-policy FORWARD KL / WS objective for optimizing 𝑃𝐹, but none of the hyperparameter settings resulted in an average correlation greater than 0.1.) The results are shown in Fig. 2, in which, for each hyperparameter (𝛼or 𝛽), we plot the performance for the optimal value of the other hyperparameter. We make three observations: In support of Observation 2, off-policy REVERSE KL performs poorly compared to its on-policy counterpart, especially for smoother distributions (smaller values of 𝛽) where more diversity is present in the target distribution. Because the two algorithms agree in the expected gradient, this suggests that importance sampling introduces unacceptable variance into HVI gradients. In support of Observation 1, the difference between on-policy REVERSE KL and on-policy TB is quite small, consistent with their gradients coinciding in the limit of descent along the full-batch gradient field. However, REVERSE KL algorithms are more sensitive to the learning rate. In support of Observation 2, off-policy TB gives the best and lowest-variance fit to the target distribution, showing the importance of an exploratory training policy, especially for sparser reward landscapes (higher 𝛽). Published as a conference paper at ICLR 2023 Table 2: Comparison of the Jensen-Shannon divergence for Bayesian structure learning, showing the advantage of off-policy TB over on-policy TB and on-policy or off-policy HVI. The JSD is measured between the true posterior distribution 𝑝(𝐺| D) and the learned approximation 𝑃 𝐹(𝐺). Number of nodes Objective 3 4 5 (Modified) Detailed Balance 5.32 4.15 10 6 2.05 0.70 10 5 4.65 1.08 10 4 Off-Policy Trajectory Balance 3.70 2.51 10 7 9.35 2.99 10 6 5.44 2.47 10 4 On-Policy Trajectory Balance 0.022 0.007 0.123 0.028 0.277 0.040 On-Policy REVERSE KL (HVI) 0.022 0.007 0.125 0.027 0.306 0.042 Off-Policy REVERSE KL (HVI) 0.014 0.008 0.605 0.019 0.656 0.009 4.3 GENERATION OF DAGS IN BAYESIAN STRUCTURE LEARNING Finally, we consider the problem of learning the (posterior) distribution over the structure of Bayesian networks, as studied in Deleu et al. (2022). The goal of Bayesian structure learning is to approximate the posterior distribution 𝑝(𝐺| D) over DAGs 𝐺, given a dataset of observations D. Following Deleu et al. (2022), we treat the generation of a DAG as a sequential decision problem, where directed edges are added one at a time, starting from the completely disconnected graph. Since our goal is to approximate the posterior distribution 𝑝(𝐺| D), we use the joint probability 𝑅(𝐺) = 𝑝(𝐺, D) as the reward function, which is proportional to the former up to a normalizing constant. Details about how this reward is computed, as well as the parametrization of the forward policy 𝑃𝐹, are available in D.3. Note that similarly to 4.2, and following Deleu et al. (2022), we leave the backward policy 𝑃𝐡fixed to uniform. We only consider settings where the true posterior distribution 𝑝(𝐺| D) can be computed exactly by enumerating all the possible DAGs 𝐺over 𝑑nodes (for 𝑑 5). This allows us to exactly compare the posterior approximations, found either with the GFlow Net objectives or HVI, with the target posterior distribution. The state space grows rapidly with the number of nodes (e.g., there are 29k DAGs over 𝑑= 5 nodes). For each experiment, we sampled a dataset D of 100 observations from a randomly generated ground-truth graph 𝐺 ; the size of D was chosen to obtain highly multimodal posteriors. In addition to the (Modified) DB objective introduced by Deleu et al. (2022), we also study the TB (GFlow Net) and the REVERSE KL (HVI) objectives, both on-policy and off-policy. In Table 2, we compare the posterior approximations found using these different objectives in terms of their Jensen-Shannon divergence (JSD) to the target posterior distribution 𝑃(𝐺| D). We observe that on the easiest setting (graphs over 𝑑= 3 nodes), all methods accurately approximate the posterior distribution. But as we increase the complexity of the problem (with larger graphs), we observe that the accuracy of the approximation found with Off-Policy REVERSE KL degrades significantly, while the ones found with the off-policy GFlow Net objectives ((Modified) DB & TB) remain very accurate. We also note that the performance of On-Policy TB and On-Policy REVERSE KL degrades too, but not as significantly; furthermore, both of these methods achieve similar performance across all experimental settings, confirming our Observation 1, and the connection highlighted in 2.2. The consistent behavior of the off-policy GFlow Net objectives compared to the on-policy objectives (TB & REVERSE KL) as the problem increases in complexity (i.e., as the number of nodes 𝑑 increases, requiring better exploration) also supports our Observation 2. These observations are further confirmed when comparing the edge marginals 𝑃(𝑋𝑖 𝑋𝑗| D) in Fig. D.3 ( D.3), computed either with the target posterior distribution or with the posterior approximations. 5 DISCUSSION AND CONCLUSIONS The theory and experiments in this paper place GFlow Nets, which had been introduced and motivated as a reinforcement learning method, in the family of variational methods. They suggest that off-policy GFlow Net objectives may be an advantageous replacement to previous VI objectives, especially when the target distribution is highly multimodal, striking an interesting balance between the mode-seeking (REVERSE KL) and mean-seeking (FORWARD KL) VI variants. This work should prompt more research on how best to choose the behavior policy in off-policy GFlow Net training, seen as a means to efficiently explore and discover modes. Whereas the experiments performed here focused on the realm of discrete variables, future work should also investigate GFlow Nets for continuous action spaces as potential alternatives to VI in continuous-variable domains. We make some first steps in this direction in the Appendix ( F). While this paper was under review, Lahlou et al. (2023) introduced theory for continuous GFlow Nets and showed that some of our claims extend to continuous domains. Published as a conference paper at ICLR 2023 AUTHOR CONTRIBUTIONS N.M., X.J., D.Z., and Y.B. observed the connection between GFlow Nets and variational inference, providing motivation for the main ideas in this work. N.M., X.J., and T.D. did initial experimental exploration. S.L., N.M., and D.Z. contributed to the theoretical analysis. S.L. and N.M. extended the theoretical analysis to subtrajectory objectives. D.Z. reviewed the related work. S.L. performed experiments on the hypergrid domain. N.M. performed experiments on the molecule domain and the stochastic control domain. T.D., E.H., and K.E. performed experiments on the causal graph domain. All authors contributed to planning the experiments, analyzing their results, and writing the paper. ACKNOWLEDGMENTS The authors thank Moksh Jain for valuable discussions about the project. This research was enabled in part by computational resources provided by the Digital Research Alliance of Canada. All authors are funded by their primary institution. We also acknowledge funding from CIFAR, Genentech, Samsung, and IBM. Peter W Battaglia, Jessica B Hamrick, Victor Bapst, Alvaro Sanchez-Gonzalez, Vinicius Zambaldi, Mateusz Malinowski, Andrea Tacchetti, David Raposo, Adam Santoro, Ryan Faulkner, et al. Relational inductive biases, deep learning, and graph networks. ar Xiv preprint 1806.01261, 2018. Emmanuel Bengio, Moksh Jain, Maksym Korablyov, Doina Precup, and Yoshua Bengio. Flow network based generative models for non-iterative diverse candidate generation. Neural Information Processing Systems (Neur IPS), 2021a. Yoshua Bengio, Salem Lahlou, Tristan Deleu, Edward Hu, Mo Tiwari, and Emmanuel Bengio. GFlow Net foundations. ar Xiv preprint 2111.09266, 2021b. Christopher M. Bishop. Pattern Recognition and Machine Learning. Springer, 2006. David M. Blei, Michael I. Jordan, Thomas L. Griffiths, and Joshua B. Tenenbaum. Hierarchical topic models and the nested Chinese restaurant process. Neural Information Processing Systems (NIPS), 2003. J org Bornschein and Yoshua Bengio. Reweighted wake-sleep. International Conference on Learning Representations (ICLR), 2015. Yuri Burda, Roger Baker Grosse, and Ruslan Salakhutdinov. Importance weighted autoencoders. International Conference on Learning Representations (ICLR), 2016. Rewon Child. Very deep VAEs generalize autoregressive models and can outperform them on images. International Conference on Learning Representations (ICLR), 2021. Tristan Deleu, Ant onio G ois, Chris Emezue, Mansi Rankawat, Simon Lacoste-Julien, Stefan Bauer, and Yoshua Bengio. Bayesian structure learning with generative flow networks. Uncertainty in Artificial Intelligence (UAI), 2022. Adji Bousso Dieng, Dustin Tran, Rajesh Ranganath, John William Paisley, and David M. Blei. Variational inference via πœ’upper bound minimization. Neural Information Processing Systems (NIPS), 2017. Justin Domke and Daniel Sheldon. Importance weighting and variational inference. Neural Information Processing Systems (Neur IPS), 2018. Dan Geiger and David Heckerman. Learning Gaussian networks. In Uncertainty Proceedings 1994, pp. 235 243. Elsevier, 1994. Will Grathwohl, Dami Choi, Yuhuai Wu, Geoffrey Roeder, and David Kristjanson Duvenaud. Backpropagation through the void: Optimizing control variates for black-box gradient estimation. International Conference on Learning Representations (ICLR), 2018. Karol Gregor, Ivo Danihelka, Andriy Mnih, Charles Blundell, and Daan Wierstra. Deep Auto Regressive networks. International Conference on Machine Learning (ICML), 2014. Published as a conference paper at ICLR 2023 Jos e Miguel Hern andez-Lobato, Yingzhen Li, Mark Rowland, Thang D. Bui, Daniel Hern andez Lobato, and Richard E. Turner. Black-box alpha divergence minimization. International Conference on Machine Learning (ICML), 2016. Luke B. Hewitt, Tuan Anh Le, and Joshua B. Tenenbaum. Learning to learn generative programs with memoised wake-sleep. Uncertainty in Artificial Intelligence (UAI), 2020. Geoffrey E. Hinton, Peter Dayan, Brendan J. Frey, and R M Neal. The wake-sleep algorithm for unsupervised neural networks. Science, 268 5214:1158 61, 1995. Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Neural Information Processing Systems (Neur IPS), 2020. Matthew D. Hoffman, David M. Blei, Chong Wang, and John William Paisley. Stochastic variational inference. Journal of Machine Learning Research (JMLR), 14:1303 1347, 2013. Moksh Jain, Emmanuel Bengio, Alex Hernandez-Garcia, Jarrid Rector-Brooks, Bonaventure F.P. Dossou, Chanakya Ekbote, Jie Fu, Tianyu Zhang, Micheal Kilgour, Dinghuai Zhang, Lena Simine, Payel Das, and Yoshua Bengio. Biological sequence design with GFlow Nets. International Conference on Machine Learning (ICML), 2022. Wengong Jin, Regina Barzilay, and Tommi Jaakkola. Chapter 11. junction tree variational autoencoder for molecular graph generation. Drug Discovery, pp. 228 249, 2020. ISSN 2041-3211. Michael I. Jordan, Zoubin Ghahramani, Tommi Jaakkola, and Lawrence K. Saul. An introduction to variational methods for graphical models. Machine Learning, 37:183 233, 2004. Diederik P. Kingma and Max Welling. Auto-encoding variational Bayes. International Conference on Learning Representations (ICLR), 2014. Jack Kuipers, Giusi Moffa, and David Heckerman. Addendum on the scoring of Gaussian directed acyclic graphical models. The Annals of Statistics, 42(4):1689 1691, 2014. Ashutosh Kumar, Arnout Voet, and Kam Y.J. Zhang. Fragment based drug design: from experimental to computational approaches. Current medicinal chemistry, 19(30):5128 5147, 2012. Salem Lahlou, Tristan Deleu, Pablo Lemos, Dinghuai Zhang, Alexandra Volokhova, Alex Hern andez-Garc Δ±a, L ena N ehale Ezzine, Yoshua Bengio, and Nikolay Malkin. A theory of continuous generative flow networks. ar Xiv preprint 2301.12594, 2023. Tuan Anh Le, Adam R. Kosiorek, N. Siddharth, Yee Whye Teh, and Frank Wood. Revisiting reweighted wake-sleep for models with stochastic control flow. Uncertainty in Artificial Intelligence (UAI), 2019. Tuan Anh Le, Katherine M. Collins, Luke B. Hewitt, Kevin Ellis, N. Siddharth, Samuel J. Gershman, and Joshua B. Tenenbaum. Hybrid memoised wake-sleep: Approximate inference at the discretecontinuous interface. International Conference on Learning Representations (ICLR), 2022. Yingzhen Li, Jos e Miguel Hern andez-Lobato, and Richard E. Turner. Stochastic expectation propagation. Neural Information Processing Systems (NIPS), 2015. Qiang Liu and Dilin Wang. Stein variational gradient descent: A general purpose Bayesian inference algorithm. Neural Information Processing Systems (NIPS), 2016. Ilya Loshchilov and Frank Hutter. SGDR: stochastic gradient descent with warm restarts. International Conference on Learning Representations (ICLR), 2017. Lars MaalΓΈe, Marco Fraccaro, Valentin Li evin, and Ole Winther. BIVA: A very deep hierarchy of latent variables for generative modeling. Neural Information Processing Systems (Neur IPS), 2019. Kanika Madan, Jarrid Rector-Brooks, Maksym Korablyov, Emmanuel Bengio, Moksh Jain, Andrei Nica, Tom Bosc, Yoshua Bengio, and Nikolay Malkin. Learning GFlow Nets from partial episodes for improved convergence and stability. ar Xiv preprint 2209.12782, 2022. Published as a conference paper at ICLR 2023 Nikolay Malkin, Moksh Jain, Emmanuel Bengio, Chen Sun, and Yoshua Bengio. Trajectory balance: Improved credit assignment in GFlow Nets. Neural Information Processing Systems (Neur IPS), 2022. Vaden Masrani, Tuan Anh Le, and Frank D. Wood. The thermodynamic variational objective. Neural Information Processing Systems (Neur IPS), 2019. Thomas P. Minka. Expectation propagation for approximate Bayesian inference. ar Xiv preprint 1301.2294, 2001. Thomas P. Minka. Divergence measures and message passing. 2005. Andriy Mnih and Karol Gregor. Neural variational inference and learning in belief networks. International Conference on Machine Learning (ICML), 2014. Andriy Mnih and Danilo Jimenez Rezende. Variational inference for Monte Carlo objectives. International Conference on Machine Learning (ICML), 2016. John William Paisley, David M. Blei, and Michael I. Jordan. Variational Bayesian inference with stochastic search. International Conference on Machine Learning (ICML), 2012. Tom Rainforth, Adam R. Kosiorek, Tuan Anh Le, Chris J. Maddison, Maximilian Igl, Frank Wood, and Yee Whye Teh. Tighter variational bounds are not necessarily better. International Conference on Machine Learning (ICML), 2018. Rajesh Ranganath, Sean Gerrish, and David Blei. Black box variational inference. Artificial Intelligence and Statistics (AISTATS), 2014. Rajesh Ranganath, Dustin Tran, Jaan Altosaar, and David M. Blei. Operator variational inference. Neural Information Processing Systems (NIPS), 2016a. Rajesh Ranganath, Dustin Tran, and David Blei. Hierarchical variational models. International Conference on Machine Learning (ICML), 2016b. Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approximate inference in deep generative models. International Conference on Machine Learning (ICML), 2014. Lorenz Richter, Ayman Boustati, Nikolas N usken, Francisco J. R. Ruiz, and Omer Deniz Akyildiz. Var Grad: A low-variance gradient estimator for variational inference. Neural Information Processing Systems (Neur IPS), 2020. Lawrence K. Saul, T. Jaakkola, and Michael I. Jordan. Mean field theory for sigmoid belief networks. Journal of Artificial Intelligence Research, 4:61 76, 1996. Artem Sobolev and Dmitry Vetrov. Importance weighted hierarchical variational inference. Neural Information Processing Systems (Neur IPS), 2019. Casper Kaae SΓΈnderby, Tapani Raiko, Lars MaalΓΈe, SΓΈren Kaae SΓΈnderby, and Ole Winther. Ladder variational autoencoders. Neural Information Processing Systems (NIPS), 2016. Michalis K. Titsias and Miguel L azaro-Gredilla. Doubly stochastic variational Bayes for nonconjugate inference. International Conference on Machine Learning (ICML), 2014. Oleg Trott and Arthur J Olson. Auto Dock Vina: improving the speed and accuracy of docking with a new scoring function, efficient optimization, and multithreading. Journal of Computational Chemistry, 31(2):455 461, 2010. George Tucker, Andriy Mnih, Chris J. Maddison, John Lawson, and Jascha Narain Sohl-Dickstein. REBAR: Low-variance, unbiased gradient estimates for discrete latent variable models. Neural Information Processing Systems (NIPS), 2017. Arash Vahdat and Jan Kautz. Nvae: A deep hierarchical variational autoencoder. Neural Information Processing Systems (Neur IPS), 2020. Neng Wan, Dapeng Li, and Naira Hovakimyan. f-divergence variational inference. Neural Information Processing Systems (Neur IPS), 2020. Published as a conference paper at ICLR 2023 Dilin Wang, Hao Liu, and Qiang Liu. Variational inference with tail-adaptive f-divergence. Neural Information Processing Systems (Neur IPS), 2018. Lex Weaver and Nigel Tao. The optimal reward baseline for gradient-based reinforcement learning. Uncertainty in Artificial Intelligence (UAI), 2001. Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8(3):229 256, 1992. Cathy Wu, Aravind Rajeswaran, Yan Duan, Vikash Kumar, Alexandre M. Bayen, Sham M. Kakade, Igor Mordatch, and Pieter Abbeel. Variance reduction for policy gradient with action-dependent factorized baselines. International Conference on Learning Representations (ICLR), 2018. Mingzhang Yin and Mingyuan Zhou. Semi-implicit variational inference. International Conference on Machine Learning (ICML), 2018. Cheng Zhang, Judith B utepage, Hedvig Kjellstr om, and Stephan Mandt. Advances in variational inference. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41:2008 2026, 2019a. Dinghuai Zhang, Ricky T. Q. Chen, Nikolay Malkin, and Yoshua Bengio. Unifying generative models with GFlow Nets. ar Xiv preprint 2209.02606v1, 2022a. Dinghuai Zhang, Nikolay Malkin, Zhen Liu, Alexandra Volokhova, Aaron Courville, and Yoshua Bengio. Generative flow networks for discrete probabilistic modeling. International Conference on Machine Learning (ICML), 2022b. Dinghuai Zhang, Ricky T. Q. Chen, Nikolay Malkin, and Yoshua Bengio. Unifying generative models with GFlow Nets and beyond. ar Xiv preprint 2209.02606, 2023. Mingtian Zhang, Thomas Bird, Raza Habib, Tianlin Xu, and David Barber. Variational f-divergence minimization. ar Xiv preprint 1907.11891, 2019b. Heiko Zimmermann, Hao Wu, Babak Esmaeili, Sam Stites, and Jan-Willem van de Meent. Nested variational inference. Neural Information Processing Systems (Neur IPS), 2021. Heiko Zimmermann, Fredrik Lindsten, Jan-Willem van de Meent, and Christian A. Naesseth. A variational perspective on generative flow networks. ar Xiv preprint 2210.07992, 2022. Published as a conference paper at ICLR 2023 A CANONICAL CONSTRUCTION OF A GRADED DAG Figure A.1: Illustration of the process by which a DAG (left) can turn into a graded DAG (right). Nodes with a double border represent terminating states. Nodes with a dashed border represent dummy states added to make the DAG graded. Fig. A.1 shows the canonical conversion of a DAG into a graded DAG as described in 2.2. Note that this operation is idempotent: applying it to a graded DAG yields the same graded DAG. We prove Prop. 1. Proof For a complete trajectory 𝜏 T, denote by 𝑐(𝜏) = log 𝑃𝐹(𝜏) 𝑅(π‘₯𝜏)𝑃𝐡(𝜏|π‘₯𝜏) . We have the following: πœƒπ‘(𝜏) = πœƒlog 𝑃𝐹(𝜏) (13) πœ™π‘(𝜏) = πœ™log 𝑃𝐡(𝜏| π‘₯𝜏) = πœ™log 𝑃𝐡(𝜏) (14) Denoting by 𝑓1 : 𝑑 𝑑log 𝑑and 𝑓2 : 𝑑 log 𝑑, which correspond to the forward and reverse KL divergences respectively, and starting from LHVI, 𝑓2(𝑃𝐹, 𝑃𝐡) = 𝐷𝐾𝐿(𝑃𝐹 𝑃𝐡) = E𝜏 𝑃𝐹 = E𝜏 𝑃𝐹[𝑐(𝜏)] + log ˆ𝑍, LHVI, 𝑓1(𝑃𝐹, 𝑃𝐡) = 𝐷𝐾𝐿(𝑃𝐡 𝑃𝐹) = E𝜏 𝑃𝐡 = E𝜏 𝑃𝐡[𝑐(𝜏)] + log ˆ𝑍 , πœƒLHVI, 𝑓2(𝑃𝐹, 𝑃𝐡) = πœƒE𝜏 𝑃𝐹[𝑐(𝜏)] = E𝜏 𝑃𝐹[ πœƒlog 𝑃𝐹(𝜏)𝑐(𝜏) + πœƒπ‘(𝜏)], πœ™LHVI, 𝑓1(𝑃𝐹, 𝑃𝐡) = πœ™E𝜏 𝑃𝐡[𝑐(𝜏)] = E𝜏 𝑃𝐡[ πœ™log 𝑃𝐡(𝜏)𝑐(𝜏) + πœ™π‘(𝜏)]. From (13) and (14), we obtain: E𝜏 𝑃𝐹[ πœƒπ‘(𝜏)] = E𝜏 𝑃𝐹[ πœƒlog 𝑃𝐹(𝜏)] = 𝜏 T 𝑃𝐹(𝜏) πœƒlog 𝑃𝐹(𝜏) = 𝜏 T πœƒπ‘ƒπΉ(𝜏) = πœƒ1 = 0 Hence, for any scalar 𝑍> 0, we can write: E𝜏 𝑃𝐹[ πœƒπ‘(𝜏)] = 0 = E𝜏 𝑃𝐹[ πœƒlog 𝑃𝐹(𝜏) log 𝑍] and similarly Eπœ™ 𝑃𝐹[ πœ™π‘(𝜏)] = 0 = E𝜏 𝑃𝐡[ πœ™log 𝑃𝐡(𝜏) log 𝑍]. Plugging these two equalities back in the HVI gradients above, we obtain: πœƒLHVI, 𝑓2 (𝑃𝐹, 𝑃𝐡) = E𝜏 𝑃𝐹[ πœƒlog 𝑃𝐹(𝜏) log 𝑍𝑃𝐹(𝜏) 𝑅(π‘₯𝜏)𝑃𝐡(𝜏| π‘₯𝜏) ] πœ™LHVI, 𝑓1(𝑃𝐹, 𝑃𝐡) = E𝜏 𝑃𝐡[ πœƒlog 𝑃𝐡(𝜏) log 𝑍𝑃𝐹(𝜏) 𝑅(π‘₯𝜏)𝑃𝐡(𝜏| π‘₯𝜏) ] Published as a conference paper at ICLR 2023 The last two equalities hold for any scalar 𝑍(that does not depend on the parameters of 𝑃𝐹, 𝑃𝐡, and that does not depend on any trajectory). In particular, the equations hold for the parameter 𝑍of the Trajectory Balance objective. It thus follows that: πœƒLHVI, 𝑓2(𝑃𝐹, 𝑃𝐡) = 1 log 𝑍𝑃𝐹(𝜏) 𝑅(π‘₯𝜏)𝑃𝐡(𝜏| π‘₯𝜏) 2E𝜏 𝑃𝐡[ πœƒLTB(𝜏; 𝑃𝐹, 𝑃𝐡, 𝑍)] πœ™LHVI, 𝑓1(𝑃𝐹, 𝑃𝐡) = 1 log 𝑍𝑃𝐹(𝜏) 𝑅(π‘₯𝜏)𝑃𝐡(𝜏| π‘₯𝜏) 2E𝜏 𝑃𝐡[ πœ™LTB(𝜏; 𝑃𝐹, 𝑃𝐡, 𝑍)] As an immediate corollary, we obtain that the expected on-policy TB gradient does not depend on the estimated partition function 𝑍. Next, we will prove the identity (9), which we restate here: E𝜏 𝑃𝐹[ πœ™LTB(𝜏)] = πœ™ h 𝐷log2 (𝑃𝐡 𝑃𝐹) + 2(log 𝑍 log ˆ𝑍)𝐷KL(𝑃𝐹 𝑃𝐡) i . (15) Proof The RHS of (15) equals " log 𝑃𝐡(𝜏| π‘₯𝜏)𝑅(π‘₯𝜏) 2 + 2(log 𝑍 log ˆ𝑍) log 𝑃𝐹(𝜏) ˆ𝑍 𝑃𝐡(𝜏| π‘₯𝜏)𝑅(π‘₯𝜏) log 𝑃𝐡(𝜏| π‘₯𝜏)𝑅(π‘₯𝜏) 2 + 2(log 𝑍 log ˆ𝑍) log 𝑃𝐹(𝜏) ˆ𝑍 𝑃𝐡(𝜏| π‘₯𝜏)𝑅(π‘₯𝜏) 2 πœ™log 𝑃𝐡(𝜏| π‘₯𝜏) log 𝑃𝐡(𝜏| π‘₯𝜏)𝑅(π‘₯𝜏) ˆ𝑍𝑃𝐹(𝜏) 2(log 𝑍 log ˆ𝑍) πœ™log 𝑃𝐡(𝜏| π‘₯𝜏) πœ™log 𝑃𝐡(𝜏| π‘₯𝜏) log 𝑃𝐡(𝜏| π‘₯𝜏)𝑅(π‘₯𝜏) =E𝜏 𝑃𝐹[ πœ™LTB(𝜏)] C A VARIATIONAL OBJECTIVE FOR SUBTRAJECTORIES In this section, we extend the claim made in Prop. 1 to connect alternative GFlow Net losses to other variational objectives. Prop. 1 is thus a partial case of Prop. 2. This provides an alternative proof to Prop. 1. The detailed balance objective (DB): The loss proposed in (Bengio et al., 2021b) parametrizes a GFlow Net using its forward and backward policies 𝑃𝐹and 𝑃𝐡respectively, along with a state flow function 𝐹, which is a positive function of the states, that matches the target reward function on the terminating states. It decomposes as a sum of transition-dependent losses: 𝑠 𝑠 A LDB(𝑠 𝑠 ; 𝑃𝐹, 𝑃𝐡, 𝐹) = log 𝐹(𝑠)𝑃𝐹(𝑠 | 𝑠) 𝐹(𝑠 )𝑃𝐡(𝑠| 𝑠 ) 2 , where 𝐹(𝑠 ) = 𝑅(𝑠 ) if 𝑠 X. The subtrajectory balance objective (Sub TB): Both the DB and TB objectives can be seen as special instances of the subtrajectory balance objective (Malkin et al., 2022; Madan et al., 2022). Malkin et al. (2022) suggested instead of defining the state flow function 𝐹for every state 𝑠, a state flow function could be defined on a subset of the state space S, called the hub states. The loss can be decomposed into a sum of subtrajectory-dependent losses: 𝜏= (𝑠1, . . . , 𝑠𝑛) T partial LSub TB(𝜏; 𝑃𝐹, 𝑃𝐡, 𝐹) = log 𝐹(𝑠1)𝑃𝐹(𝜏) 𝐹(𝑠𝑛)𝑃𝐡(𝜏| 𝑠𝑑) where 𝑃𝐹(𝜏) is defined for partial trajectories similarly to complete trajectories (2), 𝑃𝐡(𝜏| 𝑠) = Î (𝑠 𝑠 ) πœπ‘ƒπ΅(𝑠| 𝑠 ), and we again fix 𝐹(π‘₯) = 𝑅(π‘₯) for terminating states π‘₯ X). The Sub TB objective reduces to the DB objective for subtrajectories of length 1 and to the TB objective for complete trajectories, in which case we use 𝑍to denote 𝐹(𝑠0). Published as a conference paper at ICLR 2023 A variational objective for transitions: From now on, we work with a graded DAG G = (S, A), in which the state space S is decomposed into layers: S = Ã𝐿 𝑙=0 S𝑙, with S0 = {𝑠0} and S𝐿= X. HVI provides a class of algorithms to learn forward and backward policies on G. Rather than learning these policies (𝑃𝐹and 𝑃𝐡) using a variational objective requiring distributions over complete trajectories, nested variational inference (NVI; Zimmermann et al., 2021)), which combines nested importance sampling and variational inference, defines an objective dealing with distributions over transitions, or edges. To this end, it makes use of positive functions πΉπ‘˜of the states π‘ π‘˜ Sπ‘˜, for π‘˜= 0, . . . , 𝐿 1, to define two sets of distributions Λ‡π‘π‘˜and Λ†π‘π‘˜over edges from Sπ‘˜to Sπ‘˜+1: Λ†π‘π‘˜(π‘ π‘˜ π‘ π‘˜+1) πΉπ‘˜(π‘ π‘˜)𝑃𝐹(π‘ π‘˜+1 | π‘ π‘˜) Λ‡π‘π‘˜(π‘ π‘˜ π‘ π‘˜+1) 𝑅(𝑠𝐿)𝑃𝐡(π‘ π‘˜| 𝑠𝐿) π‘˜= 𝐿 1 πΉπ‘˜+1(π‘ π‘˜+1)𝑃𝐡(π‘ π‘˜| π‘ π‘˜+1) otherwise . (18) Learning the policies 𝑃𝐹, 𝑃𝐡and the functions πΉπ‘˜is done by minimizing losses of the form: LNVI(𝑃𝐹, 𝑃𝐡, 𝐹) = π‘˜=0 𝐷𝑓( Λ‡π‘π‘˜ Λ†π‘π‘˜) (19) The positive function πΉπ‘˜plays the same role as the state flow function in GFlow Nets (in the DB objective in particular). Before drawing the links between DB and NVI, we first propose a natural extension of NVI to subtrajectories. C.1 A VARIATIONAL OBJECTIVE FOR SUBTRAJECTORIES Consider a graded DAG G = (S, A) where S = Ã𝐿 𝑙=0 S𝑙, S0 = {𝑠0}, S𝐿= X. Amongst the 𝐿+ 1 layers 𝑙= 0, . . . , 𝐿, we consider 𝐾+ 1 𝐿+ 1 special layers, that we call junction layers, of which the states are called hub states. We denote by π‘š0, . . . , π‘šπΎthe indices of these layers, and we constrain π‘š0 = 0 to represent the layer comprised of the source state only, and π‘šπΎ= 𝐿representing the terminating states X. On each non-terminating junction layer π‘šπ‘˜ 𝐿, we define a state flow function πΉπ‘˜: Sπ‘šπ‘˜ R +. Given any forward and backward policies 𝑃𝐹and 𝑃𝐡respectively, consistent with the DAG G, the state flow functions define two sets of distributions Λ‡π‘π‘˜and Λ†π‘π‘˜over partial trajectories starting from a state π‘ π‘šπ‘˜ Sπ‘šπ‘˜and ending in a state π‘ π‘šπ‘˜+1 Sπ‘šπ‘˜+1 (we denote by Tπ‘˜the set comprised of these partial trajectories, for π‘˜= 0 . . . 𝐾 1): πœπ‘˜= (π‘ π‘šπ‘˜ . . . π‘ π‘šπ‘˜+1) Tπ‘˜ Λ†π‘π‘˜(πœπ‘˜) πΉπ‘˜(π‘ π‘šπ‘˜)𝑃𝐹(πœπ‘˜), (20) πœπ‘˜= (π‘ π‘šπ‘˜ . . . π‘ π‘šπ‘˜+1) Tπ‘˜ Λ‡π‘π‘˜(πœπ‘˜) πΉπ‘˜+1(π‘ π‘šπ‘˜+1)𝑃𝐡(πœπ‘˜| π‘ π‘šπ‘˜+1), (21) where 𝐹𝐾is fixed to the target reward function 𝑅. Lemma 1 If Λ†π‘π‘˜= Λ‡π‘π‘˜for all π‘˜= 0 . . . 𝐾 1, then the forward policy 𝑃𝐹induces a terminating state distribution 𝑃 𝐹that matches the target unnormalized distribution (or reward function) 𝑅. Proof Consider a complete trajectory 𝜏= (π‘ π‘š0 . . . π‘ π‘š1 . . . . . . π‘ π‘š2 . . . . . . π‘ π‘šπΎ). And let πœπ‘˜= (π‘ π‘šπ‘˜ . . . π‘ π‘šπ‘˜+1), for every π‘˜< 𝐾. Denote by Λ†π‘π‘˜and Λ‡π‘π‘˜the partition functions (constant of proportionality in (18)) of Λ†π‘π‘˜and Λ‡π‘π‘˜ respectively, for every π‘˜< 𝐾. It is straightforward to see that for every 0 < π‘˜< 𝐾: Λ†π‘π‘˜+1 = Λ‡π‘π‘˜= π‘ π‘šπ‘˜+1 Sπ‘šπ‘˜+1 πΉπ‘˜+1(π‘ π‘šπ‘˜+1) (22) π‘˜=0 Λ†π‘π‘˜(πœπ‘˜) = Î𝐾 1 π‘˜=0 πΉπ‘˜(π‘ π‘šπ‘˜) Î𝐾 1 π‘˜=0 Λ†π‘π‘˜ 𝑃𝐹(𝜏), (23) π‘˜=0 Λ‡π‘π‘˜(πœπ‘˜) = Î𝐾 1 π‘˜=0 πΉπ‘˜+1(π‘ π‘šπ‘˜+1) Î𝐾 1 π‘˜=0 Λ‡π‘π‘˜ 𝑃𝐡(𝜏| π‘ π‘šπΎ). (24) Because Λ†π‘π‘˜= Λ‡π‘π‘˜for all π‘˜= 0 . . . 𝐾 1, then both right-hand sides of (23) and (24) are equal. Combining this with (22), we obtain: ˆ𝑍0 | {z } =1 𝑃𝐹(𝜏) = 𝑅(π‘₯𝜏) Í π‘₯ X 𝑅(π‘₯) 𝑃𝐡(𝜏| π‘₯), (25) Published as a conference paper at ICLR 2023 which implies the TB constraint is satisfied for all 𝜏 T. Malkin et al. (2022) shows that this is a sufficient condition for the terminating state distribution induced by 𝑃𝐹to match the target reward function 𝑅, which completes the proof. Similar to NVI, we can use Lemma 1 to define objective functions for 𝑃𝐹, 𝑃𝐡, πΉπ‘˜, of the form: LSub NVI, 𝑓(𝑃𝐹, 𝑃𝐡, 𝐹0:𝐾 1) = π‘˜=1 𝐷𝑓( Λ‡π‘π‘˜ Λ†π‘π‘˜) (26) Note that the Sub NVI objective of (26) matches the NVI objective (Zimmermann et al., 2021) when all layers are junction layers (i.e. 𝐾= 𝐿, and π‘šπ‘˜= π‘˜for all π‘˜ 𝐿), and matches the HVI objective of (5) when only the first and last layers are junction layers (i.e. 𝐾= 1, π‘š0 = 0, and π‘š1 = 𝐿). C.2 AN EQUIVALENCE BETWEEN THE SUBNVI AND THE SUBTB OBJECTIVES Proposition 2 Given a graded DAG G as in 2.1, with junction layers π‘š0 = 0, π‘š1, . . . , π‘šπΎ= 𝐿as in C.1. For any forward and backward policies, and for any positive function πΉπ‘˜defined for the hubs, consider Λ†π‘π‘˜and Λ‡π‘π‘˜defined in (20) and (21). The subtrajectory variational objectives of (26) are equivalent to the subtrajectory balance objective (17) for specific choices of the 𝑓-divergences. Namely, denoting by πœƒ, πœ™the parameters of 𝑃𝐹, 𝑃𝐡respectively: Eπœπ‘˜ Λ‡π‘π‘˜[ πœ™LSub TB(πœπ‘˜; 𝑃𝐹, 𝑃𝐡, 𝐹)] = 2 πœ™π·π‘“1( Λ‡π‘π‘˜ Λ†π‘π‘˜) (27) Eπœπ‘˜ Λ†π‘π‘˜[ πœƒLSub TB(πœπ‘˜; 𝑃𝐹, 𝑃𝐡, 𝐹)] = 2 πœƒπ·π‘“2( Λ‡π‘π‘˜ Λ†π‘π‘˜) (28) where 𝐹= 𝐹0:𝐾 1, and 𝑓1 : 𝑑 𝑑log 𝑑and 𝑓2 : 𝑑 log 𝑑. Proof For a subtrajectory πœπ‘˜= (π‘ π‘šπ‘˜ . . . π‘ π‘šπ‘˜+1) Tπ‘˜, let 𝑐(πœπ‘˜) = log πΉπ‘˜(π‘ π‘šπ‘˜)𝑃𝐹(πœπ‘˜) πΉπ‘˜+1(π‘ π‘šπ‘˜+1 )𝑃𝐡(πœπ‘˜|π‘ π‘šπ‘˜+1 ) . First, note that because Λ†π‘π‘˜and Λ‡π‘π‘˜are not functions of πœ™, πœƒ((23)): πœ™π‘(πœπ‘˜) = πœ™log πΉπ‘˜+1(π‘ π‘šπ‘˜+1)𝑃𝐡(πœπ‘˜| π‘ π‘šπ‘˜+1) Λ‡π‘π‘˜ = πœ™log Λ‡π‘π‘˜(πœπ‘˜) (29) πœƒπ‘(πœπ‘˜) = πœƒlog πΉπ‘˜(π‘ π‘šπ‘˜)𝑃𝐹(πœπ‘˜) Λ†π‘π‘˜ = πœ™log Λ†π‘π‘˜(πœπ‘˜) (30) Published as a conference paper at ICLR 2023 We will prove (27). The proof of (28) follows the same reasoning, and is left as an exercise for the reader. 𝐷𝑓1( Λ‡π‘π‘˜ Λ†π‘π‘˜) = 𝐷𝐾𝐿( Λ‡π‘π‘˜ Λ†π‘π‘˜) πœ™π·π‘“1( Λ‡π‘π‘˜ Λ†π‘π‘˜) = πœ™ πœπ‘˜ Tπ‘˜ Λ‡π‘π‘˜(πœπ‘˜) log Λ‡π‘π‘˜(πœπ‘˜) πœπ‘˜ Tπ‘˜ Λ‡π‘π‘˜(πœπ‘˜)𝑐(πœπ‘˜) + πœ™log Λ†π‘π‘˜ Λ‡π‘π‘˜ | {z } =0, according to (23) πœπ‘˜ Tπ‘˜ ( πœ™Λ‡π‘π‘˜(πœπ‘˜)𝑐(πœπ‘˜) + Λ‡π‘π‘˜(πœπ‘˜) πœ™π‘(πœπ‘˜)) πœπ‘˜ Tπ‘˜ ( Λ‡π‘π‘˜(πœπ‘˜) πœ™log Λ‡π‘π‘˜(πœπ‘˜)𝑐(πœπ‘˜) + Λ‡π‘π‘˜(πœπ‘˜) πœ™π‘(πœπ‘˜)) = Eπœπ‘˜ Λ‡π‘π‘˜[ πœ™log Λ‡π‘π‘˜(πœπ‘˜)𝑐(πœπ‘˜)] + πœπ‘˜ Tπ‘˜ Λ‡π‘π‘˜(πœπ‘˜) πœ™log Λ‡π‘π‘˜(πœπ‘˜) following (29) = Eπœπ‘˜ Λ‡π‘π‘˜[ πœ™log 𝑃𝐡(πœπ‘˜| π‘ π‘šπ‘˜+1)𝑐(πœπ‘˜)] + πœ™ πœπ‘˜ Tπ‘˜ Λ‡π‘π‘˜(πœπ‘˜) πœ™log 𝑃𝐡(πœπ‘˜| π‘ π‘šπ‘˜+1) log πΉπ‘˜+1(π‘ π‘šπ‘˜+1)𝑃𝐡(πœπ‘˜| π‘ π‘šπ‘˜+1) πΉπ‘˜(π‘ π‘šπ‘˜)𝑃𝐹(πœπ‘˜) log πΉπ‘˜(π‘ π‘šπ‘˜)𝑃𝐹(πœπ‘˜) πΉπ‘˜+1(π‘ π‘šπ‘˜+1)𝑃𝐡(πœπ‘˜| π‘ π‘šπ‘˜+1) 2Eπœπ‘˜ Λ‡π‘π‘˜[ πœ™LSub TB(πœπ‘˜; 𝑃𝐹, 𝑃𝐡, 𝐹)] As a special case of Prop. 2, when the state flow function is defined for 𝑠0 only (and for the terminating states, at which it equals the target reward function), i.e. when 𝐾= 1, the distribution ˆ𝑝0(𝜏) and 𝑃𝐹(𝜏) are equal, and so are the distributions ˇ𝑝0(𝜏) and 𝑃𝐡(𝜏). We thus obtain the first two equations of Prop. 1 as a consequence of Prop. 2. D ADDITIONAL EXPERIMENTAL DETAILS D.1 HYPERGRID EXPERIMENTS Details about the environment For completeness, we provide more details about the environment, as explained in Malkin et al. (2022). In a 𝐷-dimension hypergrid of side length 𝐻, the state space S is partitioned into the non-terminating states Sπ‘œ= {0, . . . , 𝐻 1}𝐷and terminating states X = S = {0, . . . , 𝐻 1}𝐷. The initial state is 0R𝐷= (0, . . . , 0) Sπ‘œ, and in addition to the transitions from a non-terminating state to another (by incrementing one coordinate of the state), an exit action is available for all 𝑠 Sπ‘œ, that leads to a terminating state 𝑠 S . The reward at a terminating state 𝑠 = (𝑠1, . . . , 𝑠𝐷) is: 𝑅(𝑠 ) = 𝑅0 + 0.5 𝐻 1 0.5 (0.25, 0.5] + 2 𝐻 1 0.5 (0.3, 0.4) , (31) where 𝑅0 is an exploration parameter (lower values indicate harder exploration). Architectural details The forward and backward policies are parametrized as neural networks with 2 hidden layers of 256 units each. The neural networks take as input a one-hot representation of a a state (also called K-hot, or multi-hot representations), which is a 𝐻 𝐷vector including exactly 𝐷ones and (𝐻 1)𝐷zeros, and output the logits of 𝑃𝐹and 𝑃𝐡respectively. Forbidden actions (e.g. when a coordinate is already maxed out at 𝐻 1) are masked out by setting the corresponding logits to after the forward pass. Unlike Malkin et al. (2022), we do not tie the parameters of 𝑃𝐹 and 𝑃𝐡. Published as a conference paper at ICLR 2023 local global 2.25 10 1 2.5 10 1 2.75 10 1 3 10 1 3.25 10 1 3.5 10 1 3.75 10 1 Figure D.1: A comparison of the the type of baseline used (local or global) for the three HVI algorithms that use a score function estimator of the gradient. Behavior policy The behavior policy is obtained from the forward policy 𝑃𝐹by subtracting a scalar πœ–from the logits output by the forward policy neural network. The value of πœ–is decayed from πœ–π‘–π‘›π‘–π‘‘to 0 following a cosine annealing schedule (Loshchilov & Hutter, 2017), and the value πœ–= 0 is reached at an iteration π‘‡π‘šπ‘Žπ‘₯. The values of πœ–π‘–π‘›π‘–π‘‘and π‘‡π‘šπ‘Žπ‘₯were treated as hyperparamters. Hyperparameter optimization Our experiments have shown that HVI objectives were brittle to the choice of hyperparameters (mainly learning rates), and that the ones used for Trajectory Balance in Malkin et al. (2022) do not perform as well in the larger 128 128 grid we considered. To obtain a fair comparison between GFlow Nets and HVI methods, a particular care was given to the optimization of hyperparameters in this domain. The optimization was performed in two stages: 1. We use a batch size of 64 for all learning objectives, whether on-policy or off-policy, and the Adam optimizer with secondary parameters set to their default values, for the parameters of 𝑃𝐹, the parameters of 𝑃𝐡, and log 𝑍(which is initialized at 0). The learning rates of 𝑃𝐹, 𝑃𝐡, log 𝑍, along with a schedule factor 𝛾< 1 by which they are multiplied when the JSD plateaus for more than 500 iterations (i.e. 500 64 trajectories sampled), were sought after separately for each combination of learning objective and sampling method (on-policy or off-policy), using a Bayesian search with the JSD evaluated at 200𝐾trajectories as an optimization target. The choice of the baseline for HVI methods (except WS, that does not have a score function estimator of the gradient) was treated as a hyperparameter as well. 2. All objectives were then trained for 106 trajectories using all the combinations of hyperparameters found in the first stage, for 5 seeds each. The final set of hyperparameters for each objective and sampling mode was then chosen as the one that leads to the lowest area under the JSD curve (approximated with the trapezoids method). For off-policy runs, π‘‡π‘šπ‘Žπ‘₯was defined as a fraction 1/𝑛of the total number of iterations (which is equal to 106/64). The value of 𝑛and πœ–π‘–π‘›π‘–π‘‘was optimized the same way as the learning rate and the schedule, as described above. In Fig. D.1, we illustrate the differences between the two types of baselines considered (global and local) for the 3 algorithms that use a score function estimator of the gradient, both on-policy and off-policy. Smaller environments: The environment studied in the main body of text (128 128, with 𝑅0 = 10 3) already illustrates some key differences between the Forward and Reverse KL objectives. As a sanity check for the HVI methods that failed to converge in this challenging environment, we consider two alternative grids: 64 64 and 8 8 8 8, both with an easier exploration parameter (𝑅0 = 0.1), and compare the 5 algorithms on-policy on these two extra domains. Additionally, for the two-dimensional domain (64 64), we illustrate in Fig. D.2 a visual representation of the average distribution obtained after sampling 106 trajectories, for each method separately. Interestingly, unlike the hard exploration domain, the two algorithms with the mode-seeking KL (REVERSE KL and REVERSE WS) converge to a lower JSD than the mean-seeking KL algorithms (FORWARD KL and WS), and are on par with TB. D.2 MOLECULE EXPERIMENTS Most experiment settings were identical to those of Malkin et al. (2022), in particular, the reward model 𝑓the held-out set of molecules used to compute the performance metric, the GFlow Net model Published as a conference paper at ICLR 2023 0k 200k 400k 600k 800k 1000k Trajectories sampled TB Reverse KL WS Forward KL Reverse WS 0k 200k 400k 600k 800k 1000k Trajectories sampled Target Distribution Figure D.2: Top: The evolution of the JSD between the learned sampler 𝑃 𝐹and the target distribution on the 8 8 8 8 grid left and the 64 64 grid right. Trajectories are sampled on-policy. Shaded areas represent the standard error evaluated across 5 different runs Bottom: The average (across 5 runs) final learned distribution 𝑃 𝐹for the different algorithms, along with the target distribution. To amplify variation, the plot intensity at each grid position is resampled from the Gaussian approximating the distribution over the 5 runs. architecture (a graph neural network introduced by by Bengio et al. (2021a)), and the off-policy exploration rate. All models were trained with the Adam optimizer and batch size 4 for a maximum of 50000 batches. The metric was computed after every 5000 batches and the last computed value of the metric was reported, which was sometimes not the value after 50000 batches when the training runs terminated early because of numerical errors. D.3 BAYESIAN STRUCTURE LEARNING EXPERIMENTS Bayesian Networks A Bayesian Network is a probabilistic model where the joint distribution over 𝑑random variables {𝑋1, . . . , 𝑋𝑑} factorizes according to a directed acyclic graph (DAG) 𝐺: 𝑝(𝑋1, . . . , 𝑋𝑑) = 𝑖=1 𝑝(𝑋𝑖| Pa𝐺(𝑋𝑖)), where Pa𝐺(𝑋𝑖) is the set of parent variables of 𝑋𝑖in the graph 𝐺. Each conditional distribution in the factorization above is also associated with a set of parameters πœƒ Θ. The structure 𝐺of the Bayesian Network is often assumed to be known. However, when the structure is unknown, we can learn it based on a dataset of observation D: this is called structure learning. Structure of the state space We use the same structure of graded DAG G as the one described in (Deleu et al., 2022), where each state of G is itself a DAG 𝐺, and where actions correspond to adding one edge to the current graph 𝐺to transition to a new graph 𝐺 . Only the actions maintaining the acyclicity of 𝐺 are considered valid; this ensures that all the states are well-defined DAGs, meaning that all the states are terminating here (we define a distribution over DAGs). Similar to the hypergrid environment, the action space also contains an extra action stop to terminate the generation process, and return the current graph as a sample of our distribution; this stop action is denoted 𝐺 𝐺 , to follow the notation introduced in 2.1. Reward function Our objective in Bayesian structure learning is to approximate the posterior distribution over DAGs 𝑝(𝐺| D), given a dataset of observations D. Since our goal is to find a forward policy 𝑃𝐹for which 𝑃 𝐹(𝐺) 𝑅(𝐺) (see 2.1), we can define the reward function as the joint distribution 𝑅(𝐺) = 𝑝(𝐺, D) = 𝑝(D | 𝐺)𝑝(𝐺), where 𝑝(𝐺) is a prior over graphs (assumed to be uniform throughout the paper), and 𝑝(D | 𝐺) is the marginal likelihood. Since the marginal likelihood involves marginalizing over the parameters of the Bayesian Network Θ 𝑝(D | πœƒ, 𝐺)𝑝(πœƒ| 𝐺) π‘‘Ξ˜, it is in general intractable. We consider here a special class of models, called linear-Gaussian models, where the marginal likelihood can be computed in closed form; for this class of models, the log-marginal likelihood is also called the BGe score (Geiger & Heckerman, 1994; Kuipers et al., 2014) in the structure learning literature. Published as a conference paper at ICLR 2023 Number of nodes: d = 3 Number of nodes: d = 4 Number of nodes: d = 5 Edge marginals Figure D.3: Comparison of edge marginals computed using the target posterior distribution and using the posterior approximations found either with the GFlow Net objectives, or REVERSE KL. Performance is reported as the Root Mean Square Error (RMSE) between the marginals (lower is better). For each experiment, we sampled a dataset D of 100 samples from a randomly generated Bayesian network. The (ground truth) structure of the Bayesian Network was generated following an Erd os R enyi model, with about 𝑑edges on average (to encourage sparsity on such small graphs with 𝑑 5). Once the structure is known, the parameters of the linear-Gaussian model were sampled randomly from a standard Normal distribution N (0, 1). See (Deleu et al., 2022) for more details about the data generation process. For each setting (different values of 𝑑) and each objective, we repeated the experiment over 20 different seeds. Forward policy Deleu et al. (2022) parametrized the forward policy 𝑃𝐹using a linear transformer, taking all the 𝑑2 possible edges in the graph 𝐺as an input, and returning a probability distribution over those edges, where the invalid actions were masked out. We chose to parametrize 𝑃𝐹using a simpler neural network architecture, based on a graph neural network (Battaglia et al., 2018). The GNN takes the graph 𝐺as an input, where each node of the graph is associated with a (learned) embedding, and it returns for each node 𝑋𝑖a pair of embeddings 𝒖𝑖and 𝒗𝑖. The probability of adding an edge 𝑋𝑖 𝑋𝑗to transition from 𝐺to 𝐺 (given that we do not terminate in 𝐺) is then given by 𝑃𝐹(𝐺 | 𝐺, 𝐺 ) exp(𝒖 𝑖𝒗𝑗), assuming that 𝑋𝑖 𝑋𝑗is a valid action (i.e., it doesn t introduce a cycle in 𝐺), and where the normalization depends only on all the valid actions. We then use a hierarchical model to obtain the forward policy 𝑃𝐹(𝐺 | 𝐺), following (Deleu et al., 2022): 𝑃𝐹(𝐺 | 𝐺) = (1 𝑃𝐹(𝐺 | 𝐺))𝑃𝐹(𝐺 | 𝐺, 𝐺 ). Recall that the backward policy 𝑃𝐡is fixed here, as the uniform distribution over the parents of 𝐺 (i.e. all the graphs were exactly one edge has been removed from 𝐺). (Modified) Detailed Balance objective For completeness, we recall here the modified Detailed Balance (DB) objective (Deleu et al., 2022) as a special case of the DB objective (Bengio et al., 2021b; see also (16)) when all the states of G are terminating (which is the case in our Bayesian structure learning experiments): L(𝑀)𝐷𝐡(𝐺 𝐺 ; 𝑃𝐹, 𝑃𝐡) = log 𝑅(𝐺 )𝑃𝐡(𝐺| 𝐺 )𝑃𝐹(𝐺 | 𝐺) 𝑅(𝐺)𝑃𝐹(𝐺 | 𝐺)𝑃𝐹(𝐺 | 𝐺) Optimization Following (Deleu et al., 2022), we used a replay buffer for all our off-policy objectives ((Modified) DB, TB, and REVERSE KL). All the objectives were optimized using a batch size of 256 graphs sampled either on-policy from 𝑃𝐹, or from the replay buffer. We used the Adam optimizer, with the best learning rate found among {10 6, 3 10 6, 10 5, 3 10 5, 10 4}. For the TB objective, we learned log 𝑍using SGD with a learning rate of 0.1 and momentum 0.8. Edge marginals In addition to the Jensen-Shannon divergence (JSD) between the true posterior distribution 𝑝(𝐺| D) and the posterior approximation 𝑃 𝐹(𝐺) (see E for details about how this Published as a conference paper at ICLR 2023 divergence is computed), we also compare the edge marginals computed with both distributions. That is, for any edge 𝑋𝑖 𝑋𝑗in the graph, we compare 𝑝(𝑋𝑖 𝑋𝑗| D) = 𝐺|𝑋𝑖 Pa𝐺(𝑋𝑗) 𝑝(𝐺| D) and 𝑃 𝐹(𝑋𝑖 𝑋𝑗) = 𝐺|𝑋𝑖 Pa𝐺(𝑋𝑗) 𝑃 𝐹(𝐺). The edge marginal quantifies how likely an edge 𝑋𝑖 𝑋𝑗is to be present in the structure of the Bayesian Network, and is of particular interest in the (Bayesian) structure learning literature. To measure how accurate the posterior approximation 𝑃 𝐹is for the different objectives considered here, we use the Root Mean Square Error (RMSE) between 𝑝(𝑋𝑖 𝑋𝑗| D) and 𝑃 𝐹(𝑋𝑖 𝑋𝑗), for all possible pairs of nodes (𝑋𝑖, 𝑋𝑗) in the graph. Fig. D.3 shows the RMSE of the edge marginals, for different GFlow Net objectives and REVERSE KL (denoted as HVI here for brevity). The results on the edge marginals largely confirm the observations made in 4.3: the off-policy GFlow Net objectives ((Modified) DB & TB) consistently perform well across all experimental settings; On-Policy TB & On-Policy REVERSE KL perform similarly and degrade as the complexity of the experiment increases (as 𝑑increases); and Off-Policy REVERSE KL has a performance that degrades the most as the complexity increases, where the edge marginals given by 𝑃 𝐹(𝑋𝑖 𝑋𝑗) do not match the true edge marginals 𝑝(𝑋𝑖 𝑋𝑗| D) accurately. Evaluation of the terminating state distribution 𝑃 𝐹: When the state space is small enough (e.g. graphs with 𝑑 5 nodes in the Structure learning experiments, or a 2-D hypergrid with length 128, as in the Hypergrid experiments), we can propagate the flows in order to compute the terminating state distribution 𝑃 𝐹from the forward policy 𝑃𝐹. This is done using a flow function 𝐹defined recursively: 𝐹(𝑠 ) = 1 if 𝑠 = 𝑠0 Í 𝑠 π‘ƒπ‘Žπ‘Ÿ(𝑠 ) 𝐹(𝑠)𝑃𝐹(𝑠 | 𝑠) otherwise (32) 𝑃 𝐹is then given by: 𝑃 𝐹(𝑠 ) 𝐹(𝑠)𝑃𝐹(𝑠 | 𝑠), (33) The recursion can be carried out by dynamic programming, by enumerating the states in any topological ordering consistent with the graded DAG G. In particular, computation of the flow at a given terminating state 𝑠is linear in the number of states and actions that lie on trajectories leading to 𝑠, and computation of the full distribution 𝑃 𝐹is linear in |S| + |A|. Evaluation of the Jensen-Shannon divergence (JSD) Similarly, when the state space is small enough, the target distribution 𝑃 = 𝑅/𝑍 can be evaluated exactly, given that the marginalization is over X only. The JSD is a symmetric divergence, thus motivating our choice. The JSD can directly be evaluated as: 𝐽𝑆𝐷(𝑃 𝑃 𝐹) = 1 2 𝐷KL(𝑃 𝑀) + 𝐷KL(𝑃 𝐹 𝑀) where 𝑀= (𝑃 + 𝑃 𝐹)/2 (34) 𝑃 (𝑠) log 2𝑃 (𝑠) 𝑃 (𝑠) + 𝑃 𝐹(𝑠) + 𝑃 𝐹(𝑠) log 2𝑃 𝐹(𝑠) 𝑃 (𝑠) + 𝑃 𝐹(𝑠) Published as a conference paper at ICLR 2023 F EXTENSION TO CONTINUOUS DOMAINS As a first step towards understanding GFlow Nets with continuous action spaces, we perform an experiment on a stochastic control problem. The goal of this experiment is to explore whether the observations in the main text may hold in continuous settings as well. We consider an environment in which an agent begins at the point x0 = (0, 0) in the plane and makes a sequence of 𝐾= 10 steps over the time interval [0, 1], through points x0.1, x0.2, . . . , x1. Each step from x𝑑to x𝑑+0.1 is Gaussian with learned mean depending on x𝑑and 𝑑and with fixed variance; the variance is isotropic with standard deviation 1 2 𝐾. Equivalently, the agent samples the Euler-Maruyama discretization with interval Δ𝑑= 1 𝐾of the ItΛ†o stochastic differential equation 𝑑x𝑑= 𝑓(x𝑑, 𝑑) 𝑑𝑑+ 1 where w𝑑is the two-dimensional Wiener process. The choice of the drift function 𝑓determines the marginal density of the final point, x1. We aim to find 𝑓such that this marginal density is proportional to a given reward function, in this case a quantity proportional to the density function of the standard 8gaussians distribution, shown in Fig. F.2. We scale the distribution so that the modes of the 8 Gaussian components are at a distance of 2 from the origin and their standard deviations are 0.25. In GFlow Net terms, the set of states is S = {(0, 0)} {(x, 𝑑) : x R2, 𝑑 {0.1, 0.2, . . . , 1}}. States with 𝑑= 1 are terminating. There is an action from (x, 𝑑) to (x , 𝑑 ) if and only if 𝑑 = 𝑑+ Δ𝑑. The forward policy is given by a conditional Gaussian: 𝑃𝐹((x , 𝑑+ Δ𝑑) | (x, 𝑑)) = N x x; 𝑓(x, 𝑑)Δ𝑑, !2 Βͺ . (37) We impose a conditional Gaussian assumption on the backward policy as well, i.e., 𝑃𝐡((x, 𝑑) | (x , 𝑑+ Δ𝑑)) = N x x ; πœ‡π΅(x , 𝑑+ Δ𝑑)Δ𝑑, 𝜎2 𝐡(x , 𝑑+ Δ𝑑)Δ𝑑 𝑑 0 1 𝑑= 0 , (38) where πœ‡π΅and log 𝜎2 𝐡are learned. Notice that all the policies, except the backward policy from time 1 𝐾to time 0, now represent probability densities; states can have uncountably infinite numbers of children and parents. We parametrize the three functions 𝑓, πœ‡π΅, log 𝜎2 𝐡as small (two hidden layers, 64 units per layer) MLPs taking as input the position x and an embedding of the time 𝑑. Their parameters can be optimized using any of the five algorithms in Table 1 of the main text.3 Fig. F.1 shows the marginal densities of x𝑑(estimated using KDE) for different 𝑑in one well-trained model, as well as some sampled points and paths. In addition to training on policy, we consider exploratory training policies that add Gaussian noise to the mean of each transition distribution. We experiment with adding standard normal noise scaled by 𝜎exp, where 𝜎exp {0, 0.1, 0.2}. Fig. F.2 compares the marginal densities obtained using different algorithms with on-policy and offpolicy training. The algorithms that use a forward KL objective to learn 𝑃𝐡 namely, REVERSE WS and FORWARD KL are not shown because they encounter Na N values in the gradients early in training, even when using a 10 lower learning rate than that used for all other algorithms (10 3 for the parameters of 𝑓, πœ‡π΅, log 𝜎2 𝐡and 10 1 for the log 𝑍parameter of the GFlow Net). These results suggest that the observations made for discrete-space GFlow Nets in the main text may continue to hold in continuous settings. The first two rows of Fig. F.2 show that off-policy exploration is essential for finding the modes and that TB achieves a better fit to the target distribution. Just as in Fig. 1, although all modes are found by WAKE-SLEEP, they are modeled with lower precision, appearing off-centre and having an oblong shape, which is reflected in the slightly higher MMD. 3We conjecture (and strongly believe under mild assumptions) but do not prove that the necessary GFlow Net theory continues to hold when probabilities are placed by probability densities; the results obtained here are evidence in support of this conjecture. Published as a conference paper at ICLR 2023 Figure F.1: Above: KDE (2560 samples, bandwidth 0.25) of the agent s position after 𝑖steps for 𝑖= 0, 1, . . . , 10 (𝑑= 0, 0.1, . . . , 1) for a model trained with off-policy TB, showing a close match to the target distribution (also convolved with the KDE kernel for fair comparison). Below: A sample of 2560 points from the trained model and the trajectories taken by 128 of the points. On-Policy Off-Policy Algorithm 𝜎exp = 0 𝜎exp = 0.1 𝜎exp = 0.2 MMD 0.1111 0.0005 0.0075 MMD 0.0027 0.0059 0.0036 MMD 0.0008 0.0036 0.0043 Figure F.2: KDE of learned marginal distributions with various algorithms and exploration policies and MMD with Gaussian kernel exp( x y 2) estimated using 2560 samples.