# on_divergence_measures_for_training_gflownets__8608c81c.pdf On Divergence Measures for Training GFlow Nets Tiago da Silva Eliezer de Souza da Silva Diego Mesquita {tiago.henrique, eliezer.silva, diego.mesquita}@fgv.br School of Applied Mathematics Getulio Vargas Foundation Rio de Janeiro, Brazil Generative Flow Networks (GFlow Nets) are amortized samplers of unnormalized distributions over compositional objects with applications to causal discovery, NLP, and drug design. Recently, it was shown that GFlow Nets can be framed as a hierarchical variational inference (HVI) method for discrete distributions. Despite this equivalence, attempts to train GFlow Nets using traditional divergence measures as learning objectives were unsuccessful. Instead, current approaches for training these models rely on minimizing the log-squared difference between a proposal (forward policy) and a target (backward policy) distribution. In this work, we first formally extend the relationship between GFlow Nets and HVI to distributions on arbitrary measurable topological spaces. Then, we empirically show that the ineffectiveness of divergence-based learning of GFlow Nets is due to the large gradient variance of the corresponding stochastic objectives. To address this issue, we devise a collection of provably variance-reducing control variates for gradient estimation based on the REINFORCE leave-one-out estimator. Our experimental results suggest that the resulting algorithms often accelerate training convergence when compared against previous approaches. All in all, our work contributes by narrowing the gap between GFlow Net training and HVI, paving the way for algorithmic advancements inspired by the divergence minimization viewpoint. 1 Introduction The approximation of intractable distributions is one of the central issues in machine learning and modern statistics [7, 35]. In reinforcement learning (RL), a recurring goal is to find a diverse set of high-valued state action trajectories according to a reward function. This problem may be cast as sampling trajectories proportionally to the reward, which is generally an intractable distribution over the environment [3, 9, 38, 50]. Similarly, practical Bayesian inference and probabilistic models computations involve assessing intractable posterior distributions [36, 78, 102]. In the variational inference (VI) approach, circumventing this intractability involves searching for a tractable approximation to the target distribution within a family of parametric models. Conventionally, the problem reduces to minimizing a divergence measure, such as Kullback-Leibler (KL) divergence [7, 36, 92] or Renyi-α divergence [51, 70], between the variational approximation and the target. In particular, Generative Flow Networks (GFlow Nets) [3, 4, 48] are a recently proposed family of variational approximations well-suited for distribution over compositional objects (e.g., graphs and texts). GFlow Nets have found empirical success within various applications from causal discovery [15, 16], NLP [30], and chemical and biological modeling [3, 32]. In a nutshell, a GFlow Net learns an iterative generative process (IGP) [26] over an extension of the target s support, which, for sufficiently expressive parameterizations of transition kernels, yields independent and correctly distributed samples [3, 48]. Remarkably, training GFlow Nets typically consists of minimizing the log-squared difference between a proposal and target distributions over the extended space via SGD [4, 55], contrasting with divergence-minimizing algorithms commonly used in VI [7, 72]. 38th Conference on Neural Information Processing Systems (Neur IPS 2024). Indeed, Malkin et al. [56] suggests that trajectory balance (TB) loss training for GFlow Nets leads to better approximations of the target distribution than directly minimizing the reverse and forward KL divergence, particularly in setups with sparser rewards. Nevertheless, as we highlight in Section 3, these results are a potential consequence of biases and high variance in gradient estimates for the divergence s estimates, which can be overlooked in the evaluation protocol reliant upon sparse target distributions. Therefore, in Section 5, we present a comprehensive empirical investigation of the minimization of well-known f-divergence measures (including reverse and forward KL), showing it is an effective procedure that often accelerates the training convergence of GFlow Nets relative to alternatives. To achieve these results, we develop in Section 4 a collection of control variates (CVs) [63, 71] to reduce the variance without introducing bias on the estimated gradients, improving the efficiency of the optimization algorithms [77, 85]. In summary, our main contributions are: 1. We evaluate the performance of forward and reverse KL- [47], Renyi-α [74] and Tsallis-α [91] divergences as learning objectives for GFlow Nets through an extensive empirical campaign and highlight that they frequently outperform traditionally employed loss functions. 2. We design control variates for the gradients of GFlow Nets divergence-based objectives. Therefore, it is possible to perform efficient evaluations of the optimization objectives using automatic differentiation frameworks [69], and the resulting experiments showcase the significant reduction in the variance of the corresponding estimators. 3. We developed a theoretical connection between GFlow Nets and VI beyond the setup of finitely supported measures [56, 112], establishing results for arbitrary topological spaces. 2 Revisiting the relationship between GFlow Nets and VI Initially, we review Lahlou et al. [48] s work on GFlow Nets for distributions on topological spaces, a perspective applied consequentially to obtain the equivalence between GFlow Nets training and VI divergence minimization in a more generic setting. Finally, we describe standard variance reduction techniques for solving stochastic optimization problems. Notations. Let (S, T ) be a topological space with topology T and Σ be the corresponding Borel σ-algebra. Also, let ν : Σ R+ be a measure over Σ and κf, κb : S Σ R+ be transition kernels over S. For each (B1, B2) Σ Σ, we denote by ν κ(B1, B2) := R B1 ν(ds)k(s, B2). Likewise, we recursively define the product kernel as κ 0(s, ) = κ(s, ) and, for n 1, κ n(s, ) = κ n 1(s, ) κ for a transition kernel κ and s S. Note, in particular, that κ n is a function from S Σ n+1 to R+, with Σ n+1 representing the product σ-algebra of Σ [1, 96]. Moreover, if µ is an absolutely continuous measure relatively to ν, denoted µ ν, we write dµ/dν for the corresponding density (Radom-Nikodym derivative) [1]. Furthermore, we denote by P(A) = {S : S A} the power-set of a set A S and by [d] = {1, . . . , d} the first d positive integers. GFlow Nets. A GFlow Net is, in its most general form, built upon the concept of a measurable pointed directed acyclic graph (DAG) [48], which we define next. Intuitively, it extends the notion of a flow network to arbitrary measurable topological spaces, replacing the directed graph with a transition kernel specifying how the underlying states are connected. Definition 1 (Measurable pointed DAG [48]). Let ( S, T , Σ) be a measurable topological space endowed with a reference measure ν and forward κf and backward κb kernels. Also, let so S and sf S be distinguished elements in S, respectively called initial and final states, and S = S \ {sf}. We assume {sf} is open. A measurable pointed DAG is then a tuple (S, T , Σ, κf, κb, ν) satisfying: 1. (Terminality) If κf(s, {sf}) > 0, then κf(s, {sf}) = 1 s S. Also, κf(sf, ) = δsf . 2. (Reachability) For all B Σ, n N s.t. κ n f (so, B) > 0, i.e., B is reachable from so. 3. (Consistency) For every (B1, B2) Σ Σ such that (B1, B2) / {(so, so), (sf, sf)}, ν κf(B1, B2) = ν κb(B2, B1). Moreover, κb(so, B) = 0 for every B Σ. 4. (Continuity) s 7 κf(s, B) is continuous for B Σ. 5. (Finite absorption) There is a N N such that κ N f (s, ) = δsf for every s S. We designate the corresponding DAG as finitely absorbing. In this setting, the elements in the set X = {s S \ {sf}: κf(s, {sf}) > 0)} are called terminal states. Illustratively, when S is finite and ν is the counting measure, the preceding definition corresponds to a connected DAG with an edge from s S to s S iff κf(s, {s }) > 0, with condition 5) ensuring acyclicity and condition 2) implying connectivity. A GFlow Net, then, is characterized by a measurable pointed DAG, a potentially unnormalized distribution over terminal states X and learnable transition kernels on S (Definition 2). Realistically, its goal is to find an IGP over S which, starting at so, samples from X proportionally to a given positive function. Definition 2 (GFlow Nets [48]). A GFlow Net is a tuple (G,PF ,PB,µ) composed of a measurable pointed DAG G, a σ-finite measure µ ν, and σ-finite Markov kernels PF κf and PB κb, respectively called forward and backward policies. Training GFlow Nets. In practice, we denote by p Fθ : S S R+ the density of PF relative to κf, which we parameterize using a neural network with weigths θ. Similarly, we denote by p B the density of PB wrt kb. Our objective is, for a given target measure R µ on X with r = d R/dµ, estimate the θ for which the distribution over X induced by PF (so, ) matches R, i.e., for every B Σ, Sn p n Fθ (so, s1:n, sf)1sn Bκ n f (so, ds1:n) = R(B) Importantly, the above sum contains only finitely many non-zero terms due to the finite absorption property of κf. To ensure that p Fθ abides by this equation, Lahlou et al. [48] showed it suffices that one of the next balance conditions are concomitantly satisfied by PF and PB. Definition 3 (Trajectory balance condition). For all n 0 and µ n-almost surely s1:n Sn, p n Fθ (so, s1:n, sf)= r(sn) Zθ p n B (sn, sn:1, so), w/ Zθ denoting the target distribution s partition function. Definition 4 (Detailed balance condition). For an auxiliary parametric function u: S R+ and µ 2-almost surely on (s, s ) S2, u(s)p Fθ(s, s ) = u(s )p B(s , s) and u(x) = r(x) for x X. To enforce the trajectory balance (TB) or detailed balance (DB) conditions, we conventionally define a stochastic optimization problem to minimize the expected log-squared difference between the leftand right-hand sides of the corresponding condition under a probability measure ξ supported on an appropriate space [4, 15, 48, 49, 52, 55, 67]. For TB, e.g., we let ξ be defined on Σ N with support supp(µ N). Then, we estimate the GFlow Net s parameters θ by minimizing log p Fθ(si, si+1) p B(si+1, si) + log Zθ r(sh) with τ = so:N and h = max{i: si = sf} being the last non-final state s index in τ. As denoted by the subscript on Zθ, using the TB loss incurs learning the target s normalizing constant. While tuning p B during training is also possible, the common practice is to keep it fixed. Henceforth, we will consider the measurable space of trajectories (PS, ΣP ), with PS = {(s, s1, . . . , sn, sf) Sn+1 {sf}: 0 n N 1} and ΣP as the σ-algebra generated by SN+1 n=1 Σ n. For notational convenience, we use the same letters for representing the measures and kernels of (S, Σ) and their natural product counterparts in (PS, ΣP ), which exist by Carathéodory extension s theorem [96]; for example, ν(B) = ν n(B) for B = (B1, . . . , Bn) Σ n and p Fθ(τ|so; θ) is the density of P n+1 F (so, ) for τ = (so, s1, . . . , sn, sf) relatively to µ n. In this case, we will write τ for a generic element of PS and x for its terminal state (which is unique by Definition 1). For a comprehensive overview of GFlow Nets, please refer to [48, 55]. GFlow Nets and VI. GFlow Nets can be interpreted as hierarchical variational models by framing the forward policy p Fθ(τ|so; θ) in (PS, ΣP ) as a proposal to r(x) Z p B(τ|x). Malkin et al. [56] demonstrated that, for discrete target distributions, the TB loss in (1) aligns with the KL divergence in terms of expected gradients. Extending this, our Proposition 1 establishes that this relationship also holds for distributions over arbitrary topological spaces. Proposition 1 (TB lossand KL divergence gradients for topological spaces). Let LT B(τ; θ) = (log Zp Fθ (τ|so;θ)/r(x)p B(τ|x))2 and p B(τ) = r(x) Z p B(sn 1:o|x) for τ = (so, . . . , sn 1, x, sf). Then, θEτ PF (so, ) [LT B(τ; θ)] = 2 θDKL[PF ||PB], (2) where DKL[p Fθ||p B] = Eτ PF (so, ) [log p Fθ (τ|so;θ)/p B(τ)] is the KL divergence between PF and PB. This proposition shows that minimizing the on-policy TB loss is theoretically comparable to minimizing the KL divergence between PF and PB in terms of convergence speed. Since the TB loss requires estimating the intractable R(X), the KL divergence, which avoids this estimation, can be a more suitable objective. Our experiments in Section 5 support this, with proofs provided in Appendix C. Extending this result to general topological spaces broadens the scope of divergence minimization strategies, extending guarantees for discrete spaces to continuous and mixed spaces. This generalization aligns with advances in generalized Bayesian inference [45] and generalized VI in function spaces [95], via optimization of generic divergences. We make the method theoretically firm and potentially widely applicable by proving the equivalence in these broader contexts. Variance reduction. A naive Monte Carlo estimator for the gradient in Equation 2 has high variance [19], impacting the efficiency of stochastic gradient descent [97]. To mitigate this, we use control variates random variables with zero expectation added to reduce the estimator s variance without bias [63, 71]. This method, detailed in Section 4, significantly reduces noise in gradient estimates and pragmatically improves training convergence, as shown in the experiments in Section 5. 3 Divergence measures for learning GFlow Nets This Section presents four different divergence measures for training GFlow Nets and the accompanying gradient estimators for stochastic optimization. Regardless of the learning objective, recall that our goal is to estimate θ by minimizing a discrepancy measure D between PF and PB that is globally minimized if and only if PF = PB, i.e., θ = arg min θ D(PF , PB), (3) in which PB is typically fixed and PF s density pθ F is parameterized by θ. 3.1 Renyi-α and Tsallis-α divergences Renyi-α [74] and Tsallis-α [91] are families of statistical divergences including, as limiting cases, the widespread KL divergence (Section 3.2) [58]; see Definition 5. These divergences have been successfully applied to both variational inference [51] and policy search for model-based reinforcement learning [18]. Moreover, as we highlight in Section 5, their performance is competitive with, and sometimes better than, traditional learning objectives for GFlow Nets based on minimizing log-squared differences between proposal and target distributions. Definition 5 (Renyi-α and Tsallis-α divergences). Let α R. Also, let p Fθ and p B be GFlow Net s forward and backward policies, respectively. Then, the Renyi-α divergence between PF and PB is Rα(PF ||PB) = 1 α 1 log Z PS p Fθ(τ|so)αp B(τ)1 ακf(so, dτ). Similarly, the Tsallis-α divergence between PF and PB is Tα(PF ||PB) = 1 α 1 PS p Fθ(τ|so)αp B(τ)1 ακf(so, dτ) 1 . From Definition 5, we see that both Renyi-α and Tsallis-α divergences transition from a masscovering to a mode-seeking behavior as α ranges from to . Regarding GFlow Net-training, this flexibility suggests that Rα and Tα are appropriate choices both, e.g., for carrying out Bayesian inference [16] where interest lies in obtaining an accurate approximation to a posterior distribution , and for combinatorial optimization [109] where the goal is to find a few high-valued samples. Additionally, the choice of α provides a mechanism for controlling which trajectories are preferentially sampled during training, with larger values favoring the selection of trajectories leading to highprobability terminal states, resembling the effect of ε-greedy [55], thompson-sampling [73], localsearch [40], and forward-looking [65, 66] techniques for carrying out off-policy training of GFlow Nets [56]. To illustrate the effect of α on the learning dynamics of GFlow Nets, we show in Figure 1 an early stage of training to sample from a homogeneous mixture of Gaussian distributions by minimizing Renyi-α divergence for different values of α; see Section 5.1 for details on this experiment. At this stage, we note that the GFlow Net covers the target distribution s modes but fails to separate them when α is large and negative. In contrast, a large positive α causes the model to focus on a single high-probability region. Therefore, the use of an intermediate value for α = 0.5 culminates in a model that accurately approximates the target distribution. Also, our early experiments suggested the persistence of such dependence on α for diverse learning tasks, with α = 0.5 leading to the best results. Thus, we fix α = 0.5 throughout our experimental campaign. Figure 1: Mode-seeking (α = 2) versus mass-covering (α = 2) behaviour in α-divergences. Importantly, we need only the gradients of Rα and Tα for solving the optimization problem in Equation 3 and, in particular, learning the target distribution s normalizing constant is unnecessary, as we underline in the lemma below. This property distinguishes such divergence measures from both TB and DB losses in Equation 1 and, in principle, simplifies the training of GFlow Nets. Lemma 1 (Gradients for Rα and Tα). Let θ be the parameters of p Fθ in Definition 5 and, for τ PS, g(τ, θ) = (p B(τ|x)r(x)/p Fθ (τ|so;θ))1 α. The gradient of Rα wrt θ is θRα(p Fθ||p B) = E[ θg(τ, θ) + g(τ, θ) θ log p Fθ(τ|so; θ)] (α 1)E[g(τ, θ)] ; the expectations are computed under PF . Analogously, the gradient of Tα wrt θ is θTα(p Fθ||p B) C= E[ θg(τ, θ) + g(τ, θ) θ log p Fθ(τ|so; θ)] in which C= denotes equality up to a multiplicative constant. Lemma 1 uses the REINFORCE method [97] to compute the gradients of both Rα and Tα, and we implement Monte Carlo estimators to approximate the ensuing expectations based on a batch of trajectories {τ1, . . . , τN} sampled during training [56]. Also, note that the function g is computed outside the log domain and, therefore, particular care is required to avoid issues such as numerical underflow of the unnormalized distribution [3, 87]. In our implementation, we sample an initial batch of trajectories {τi}N i=1 and compute the maximum of r among the sampled terminal states in log space, i.e., log r = maxi log r(xi). Then, we consider log r(x) = log r(x) log r as the target s unnormalized log density. In Section 4, we will consider the design of variance reduction techniques to decrease the noise level of gradient estimates and possibly speed up the learning process. 3.2 Kullback-Leibler divergence The KL divergence [47] is a limiting member of the Renyi-α and Tsallis-α families of divergences, derived when α 1 [70], and is the most widely deployed divergence measure in statistics and machine learning. To conduct variational inference, one regularly considers both the forward and reverse KL divergences, which we review in the definition below. Definition 6 (Forward and reverse KL). The forward KL divergence between a target PB and a proposal PF is DKL[PB||PF ] = Eτ PB(sf , ) [log p B(τ)/p Fθ (τ|so)]. Also, the reverse KL divergence is defined by DKL[PF ||PB] = Eτ PF (so, ) [log p Fθ (τ|so)/p B(τ)]. Remarkably, we cannot use a simple Monte Carlo estimator to approximate the forward KL due to the presumed intractability of PB (which depends directly on R). As a first approximation, we could estimate DKL[PB||PF ] via importance sampling w/ PF as a proposal distribution as in [56]: DKL[PB||PF ] = Eτ PF p B(τ) p Fθ(τ|so) log p B(τ) p Fθ(τ|so) and subsequently implement a REINFORCE estimator to compute θDKL[PB||PF ]. Nevertheless, as we only need the divergence s derivatives to perform SGD, we apply the importance weights directly to the gradient estimator. We summarize this approach in the lemma below. Lemma 2 (Gradients for the KL divergence). Let θ be the parameters of PF and s(τ; θ) = log p Fθ(τ|so; θ). Then, the gradient of DKL[PF ||PB] relatively to θ satisfies θDKL [PF ||PB] = Eτ PF (so, ) θs(τ; θ) + log p Fθ(τ|so) p B(τ|x)r(x) θs(τ; θ) 0 500 1000 0.0 0 500 1000 0.0 0 500 1000 0 0 500 1000 0.0 108 Tsallis-α Without CVs With CVs Figure 2: Variance of the estimated gradients as a function of the trajectories batch size. Our control variates greatly reduce the estimator s variance, even for relatively small batch sizes. Correspondingly, the gradient of DKL[PB||PF ] wrt θ is θDKL[PB||PF ] C= Eτ PF (so, ) p B(τ|x)r(x) θs(τ; θ) . Crucially, choosing an appropriate learning objective is an empirical question that one should consider on a problem-by-problem basis similar to the problem of selecting among Markov chain simulation techniques [27]. In particular, a one-size-fits-all solution does not exist; see Section 5 for a thorough experimental investigation. Independently of the chosen method, however, the Monte Carlo estimators for the quantities outlined in Lemma 2 are of potentially high variance and may require a relatively large number of trajectories to yield a reliable estimate of the gradients [97]. The following sections demonstrate that variance reduction techniques alleviate this issue. 4 Control variates for low-variance gradient estimation Control variates. We first review the concept of a control variate. Let f : PS R be a real-valued measurable function and assume that our goal is to estimate Eτ π [f(τ)] according to a probability measure π on ΣP (see Section 2 to recall the definitions). The variance of a naive Monte Carlo estimator for this quantity is Varπ(f(τ))/n. On the other hand, consider a random variable (RV) g: PS R positively correlated with f and with known expectation Eπ[g(τ)]. Then, the variance of a naive Monte Carlo for Eπ [f(τ) a(g(τ) Eπ[g(τ)])] for a baseline a R is 1 n Varπ(f(τ)) 2a Covπ(f(τ), g(τ)) + a2Varπ(g(τ)) , (5) which is potentially smaller than 1 n Varπ(f(τ)) if the covariance between f and g is sufficiently large. Under these conditions, we choose the value of a that minimizes Equation 5 [94], namely, a = Covπ(f(τ),g(τ))/Varπ(g(τ)). We then call the function g a control variate [63]. Also, although the quantities defining the best baseline a are generally unavailable in closed form, one commonly uses a batchbased estimate of Covπ(f(τ), g(τ)) and Varπ(g(τ)); the incurred bias is generally negligible relatively to the reduced variance [71, 79, 85]. For vector-valued RVs, we let a be a diagonal matrix and exhibit, in the next proposition, the optimal baseline minimizing the covariance matrix s trace. Proposition 2 (Control variate for gradients). Let f, g: PS Rd be vector-valued functions and π be a probability measure on PS. Consider a baseline a R and assume Eπ[g(τ)] = 0. Then, arg min a R Tr Covπ[f(τ) a g(τ)]=Eπ[g(τ)T (f(τ) Eπ[f(τ )])] Eπ[g(τ)T g(τ)] . Note that, when implementing the REINFORCE gradient estimator, the expectation we wish to estimate may be generally written as EPF (so, ) [ θf(τ) + f(τ) θ log p Fθ(τ)]. For the second term, we use a leave-one-out estimator [85]; see below. For the first term, we use θ log p Fθ as a control variate, which satisfies EPF (so, ) [ θ log p Fθ(τ|so; θ)] = 0. Importantly, estimating the optimal baseline a in Proposition 2 cannot be done efficiently due to the non-linear dependence of the corresponding Monte Carlo estimator on the sample-level gradients [2]; i.e., it cannot be represented as a vector-Jacobian product, which is efficient to compute in reverse-mode automatic differentiation (autodiff) frameworks [8, 69]. Consequently, we consider a linear approximation of both numerator and denominator defining a in Proposition 2, which may be interpreted as an instantiation of the delta method [83, Sec. 7.1.3]. Then, given a batch {τ1, . . . , τN} of trajectories, we instead use DPN n=1 θ log p Fθ(τn), PN n=1 θf(τn) E ϵ + PN n=1 θ log p Fθ(τn) 2 (6) as the REINFORCE batch-based estimated baseline; , represents the inner product between vectors. Intuitively, the numerator is roughly a linear approximation to the covariance between θ log p Fθ and θf under PF . In contrast, the denominator approximately measures the variance of θ log p Fθ, and ϵ > 0 is included to enhance numerical stability. As a consequence, for the reverse KL divergence, θf(τ) = θ log p Fθ(τ), ˆa 1 and the term corresponding to the expectation of θf(τ) vanishes. We empirically find that this approach frequently reduces the variance of the estimated gradients by a large margin (see Figure 2 above and Section 5 below). Leave-one-out estimator. We now focus on obtaining a low-variance estimate of Eτ PF (so, )[f(τ) θ log p Fθ(τ)]. As an alternative to the estimator of Proposition 2, Shi et al. [85] and Salimans and Knowles [82] proposed a sample-dependent baseline of the form a(τi) = 1 N 1 P 1 n N,n =i f(τn) for i {1, . . . , N}. The resulting estimator, f(τn) 1 N 1 j=1,j =i f(τj) θ log p Fθ(τn), is unbiased for E [f(τ) θ log p Fθ(τ)] due to the independence between τi and τj for i = j. Strikingly, δ can be swiftly computed with autodiff: if f = (f(τn))N n=1 and p = (log p Fθ(τn))N n=1, then sg f 1 N 1(1 I)f , p , (7) with sg as the stop-gradient operation (e.g., lax.stop_gradient in JAX [8] and torch.detach in Py Torch [69]). Importantly, these techniques incur a minimal computational overhead to the stochastic optimization algorithms relative to the considerable reduction in variance they enact. Relationship with previous works. Importantly, Malkin et al. [56] used ˆa = 1 N P n f(τn) as baseline and an importance-weighted aggregation to adjust for the off-policy sampling of trajectories, introducing bias in the gradient estimates and relinquishing guarantees of the optimization procedure. A learnable baseline independently trained to match ˆb was also considered. This potentially entailed the inaccurate conclusion that the TB and DB are superior to standard divergence-based objectives. Indeed, the following section underlines that such divergence measures are sound and practical learning objectives for GFlow Nets for a range of tasks. Illustration of the control variates effectiveness. We train the GFlow Nets using increasingly larger batches of {2i : i [[5, 10]]} trajectories with and without CVs. In this setting, Figure 2 showcases the drastic reduction in the variance, represented by the covariance matrix s trace, of the estimated learning objectives gradients w.r.t. the model s parameters promoted by the CVs. Impressively, as we show in Figure 6, this approach significantly increases the efficiency of the underlying stochastic optimization algorithm. See Section 5 and Appendix D for further details. 5 Training GFlow Nets with divergence measures Our experiments serve two purposes. Firstly, we show in Section 5.2 that minimizing divergencebased learning objectives leads to competitive and often better approximations than the alternatives based on log-squared violations of the flow network s balance. This underlines the effectiveness of well-established divergence measures for training GFlow Nets [56, 112]. Secondly, we highlight in Section 5.3 that the reduction of variance enacted by our control variates critically accelerates the convergence of GFlow Nets. We consider widely adopted benchmark tasks from GFlow Net literature, described in Section 5.1, contemplating both discrete and continuous target distributions. Please refer to Appendix B and Appendix E for additional information on the experimental setup. 5.1 Generative tasks Below, we provide a high-level characterization of the generative tasks used for synthetic data generation and training. For a more rigorous description in the light of Section 2, see Appendix B. 0 2000 4000 6000 0 1000 2000 3000 0 200 400 600 0.0 0 200 400 600 0.00 Gaussian Mixtures 0 5000 10000 0 0 1000 2000 3000 Rev. KL KL Renyi-α Tsallis-α TB DB Sub TB Var Grad Figure 3: Divergence-based learning objectives often lead to faster training than TB loss. Notably, contrasting with the experiments of [56], there is no single best loss function always conducting to the fastest convergence rate, and minimizing well-known divergence measures is often on par with or better than minimizing the TB loss in terms of convergence speed. Results were averaged across three different seeds. Also, we fix α = 0.5 for both Tsallis-α and Renyi-α divergences. Set generation [3, 34, 65, 66]. A state s corresponds to a set of size up to a given S and the terminal states X are sets of size S; a transition corresponds to adding an element from a deposit D to s. The IGP starts at an empty set, and the log-reward of a x X is P d x f(d) for a fixed f : D R. Autoregressive sequence generation [32, 55]. Similarly, a state is a seq. s of max size S and a terminal state is a seq. ended by an end-of-sequence token; a transition appends d D to s. The IGP starts with an empty sequence and, for x X, log r(x) = P i=1...|x| g(i)f(xi) for functions f, g. Bayesian phylogenetic inference (BPI) [111]. A state s is a forest composed of binary trees with labeled leaves and unlabelled internal nodes, and a transition amounts to joining the roots of two trees to a newly added node. Then, s is terminal when it is a single connected tree called a phylogenetic tree. Finally, given a dataset of nucleotide sequences, the reward function is the unnormalized posterior over trees induced by J&C69 s mutation model [37] and a uniform prior. Hypergrid navigation [3, 55, 56, 66]. A state s {0, . . . , H 1}d is a component of a H-sized and d-dimensional Euclidean grid. The IGP starts at 0 and, if we let δi be the i-th line of the identity matrix and (s) = {δi : i {1, . . . , d} maxj(s + δi)j < H}, a transition either adds a δ (s) to s or stops at s. We use Malkin et al. [55, Section 5.1] s reward function with Ro = 10 3. Bayesian structure learning [15, 16]. A state s is a DAG representing a Bayesian network; a transition either adds an edge to s or stops the IGP. Similarly to Deleu et al. [15], we ensure the added edges preserve the state s acyclicity. The reward function is defined as the maximum likelihood of the linear Gaussian structural model induced by the current state based on a fixed i.i.d. data set. Mixture of Gaussians (GMs) [48, 110]. The IGP starts at 0 Rd and proceeds by sequentially substituting each coordinate with a sample from a real-valued distribution. For a K-component GM, the reward of x Rd is defined as P k= αk N(x|µk, Σk) with αk 0 and P k αk = 1. Banana-shaped distribution [57, 76]. We use the same IGP implemented for a bi-dimensional GM. For z R2, we set r(x) to a normal likelihood defined on a quadratic function of x, see Equation 8 in the supplement. We use HMC samples as ground truth to gauge performance on this task. 5.2 Assessing convergence speed Next, we provide evidence that minimizing divergence-based objectives frequently leads to faster convergence than minimizing the standard TB loss [55]. Experimental setup. We compare the convergence speed in terms of the rate of decrease of a measure of distributional error when using different learning objectives for a GFlow Net trained to sample from each of the distributions described in Section 5.1. For discrete distributions, we adopt the evaluation protocols of previous works [3, 54, 55, 66] and compute the L1 distance between the learned p T (x; θ) and target r(x), namely, P x X |p T (x; θ) r(x)/Z|. To approximate p T , we use a Monte Carlo estimate of p T (x; θ) = Eτ PB(x, ) [p Fθ (τ|so;θ)/p B(τ|x)]. For continuous distributions, we echo [48, 110] and compute Jensen-Shannon s divergence between PT (x; θ) and R(x): DJS[PT ||R] = 1/2 (DKL[PT ||M] + DKL[R||M]) = E x PT [log p T (x)/m(x)] + E x R [log r(x)/Zm(x)] , 0 5000 10000 Number of modes 105 Sequences 0 250 500 0.0 0 5000 10000 102 Hypergrid 500 1000 0.25 Rev. KL KL Renyi-α Tsallis-α TB DB Sub TB Var Grad Figure 4: Average reward for the K highest scoring samples (top-K) and Number of Modes found during training for the tasks of sequence design, set generation, hypergrid and DAG environments. With the only exception of the hypergrid task, the minimization of divergence-based measures leads to similar and often faster discovery of high-valued states relatively to their balance-based counterparts. 2.5 0.0 2.5 10 2.5 0.0 2.5 10 2.5 0.0 2.5 10 2.5 0.0 2.5 10 2.5 0.0 2.5 10 2.5 0.0 2.5 10 Figure 5: Learned distributions for the banana-shaped target. Tsallis-α, Renyi-α and for. KL leads to a better model than TB and Rev. KL, which behave similarly as predicted by Proposition 1. with M(B) = 1/2 (PT (B) + R(B)/R(X)) being the averaged measure of PT and R and m its corresponding density relatively to the reference measure µ. Remarkably, for the GMs distribution, we can directly sample from the target to estimate DKL[R||M], and the autoregressive nature of the generative process ensures that p T (x) = p Fθ(τ|so) for the unique trajectory τ starting at so and finishing at x. Hence, we get an unbiased estimate of DKL[PT ||M]. Finally, let Xt be the first t terminal states encountered during training and {x(1), . . . , x(k)} be an descending ordering of Xt according to r. Then, we select a threshold τ R and an integer K to report No M(Xt) = |{r(x): r(x) τ x Xt}|, called number of modes, and Top K(Xt) = AVG {r(x(i)): 1 i K} , referred to as top-K average reward. Both No M and Top K are metrics of substantial interest in the GFlow Net literature [3, 55, 56, 64, 65]. Table 1: Divergence minimization achieves better than or similar accuracy compared to enforcing TB. BPI Sequences Sets GMs TB 0.22 0.04 0.28 0.06 0.07 0.00 0.31 0.08 Rev. KL 0.21 0.04 0.16 0.06 0.03 0.00 0.31 0.09 For. KL 0.22 0.04 0.23 0.12 0.03 0.00 0.09 0.10 Renyi-α 0.22 0.03 0.23 0.10 0.03 0.00 0.19 0.13 Tsallis-α 0.21 0.04 0.22 0.09 0.03 0.00 0.21 0.11 Results. Figure 3 shows that the procedure minimizing divergence-based measures accelerates the training convergence of GFlow Nets, whereas Figure 5 (for the banana-shaped distribution) and Table 1 highlight that we obtain a more accurate model with a fix compute budget. The difference between learning objectives is not statistically significant for the BPI task. Also, we may attribute the superior performance of reverse KL compared to the forward in the sequence generation task to the high variance of the importancesampling-based gradient estimates. Indeed, the observed differences disappear when we increase the batch of trajectories to reduce the estimator s variance (see Figure 8 in Appendix D). In conclusion, our empirical results based on experiments testing diverse generative settings and expanding prior art [48, 56, 112], shows that training methods based on minimizing f-divergence VI objectives with adequate CVs implemented are practical and effective in many tasks. Correlatively, Figure 4 supports 0 200 400 Epochs DKL[PB||PF] 0 200 400 Epochs DKL[PF||PB] 0 200 400 Epochs 0 200 400 Epochs Without CVs With CVs Figure 6: Learning curves for different objective functions in the task of set generation. The reduced variance of the gradient estimates notably increases training stability and speed. the fact that minimizing divergence-based objectives frequently implies better coverage of the target s high-probability regions; the only exception is the (extremely sparse) hypergrid task [55]. 5.3 Reducing the variance of the estimated gradients Figure 2 demonstrates that implementing CVs for the REINFORCE estimator reduces the noise level of gradient estimates significantly. This reduction in variance also accelerates training convergence. To illustrate this, we use the same experimental setup from Section 5.2 and analyze the learning curves for each divergence measure with and without control variates. Results. Figure 6 shows that the implemented gradient reduction techniques significantly enhance the learning stability of GFlow Nets and drastically accelerate training convergence when minimizing the reverse KL divergence. Our results indicate that well-designed CVs for gradient estimation can greatly benefit GFlow Nets training. Notably, similar improvements have been observed in the context of Langevin dynamics simulations [22, 31, 44] and policy gradient methods for RL [68, 100]. 6 Conclusions, limitations and broader impact Discussion. We showed in a comprehensive range of experiments that divergence measures common in VI forward KL, reverse KL, Renyi-α, and Tsallis-α are effective learning objectives for training GFlow Nets, performing competitively with or better than their balance-based counterparts. To achieve this, the introduction of efficacious control variates for low-variance gradient estimation of the divergence-based objectives was crucial, which is a key distinction between our work an prior art [55, 112]. Additionally, we developed the theoretical connection between GFlow Nets and VI beyond the setting of finitely supported measures, establishing results for arbitrary topological spaces. Limitations. Albeit comprehensive and on par with the wider literature, our empirical evaluation was performed on problems of relatively small size due to the intractability of probing a GFlow Net s distributional accuracy on very large state spaces. That said, we acknowledge that an assessment on the domains of natural language processing [30] and drug discovery [3] based on context-specific metrics would strengthen our conclusions; we leave these tasks to future endeavors. Similarly, while we observed promising results for α = 0.5, there might be different choices of α that, depending on the application, might strike a better explore-exploit tradeoff and incur faster convergence. Thus, thoroughly exploring different α might be especially useful to practitioners. Broader impact. Overall, our work highlights the potential of the once-dismissed VI-inspired schemes for training GFNs, paving the way for further research towards improving the GFlow Nets by drawing inspiration from the VI literature. For instance, one could develop χ-divergence-based losses for GFNs [19], combine GFNs with MCMC using Ruiz and Titsias [81] s divergence, or employ an objective similar to that of importance-weighted autoencoders [10]. Finally, although an ϵ-greedy off-policy sampling scheme can be easily incorporated into a divergence-minimizing algorithm through an importance-sampling correction, it remains elusive whether this would be possible for more sophisticated sampling techniques such as replay buffer [15] and local search [40]. Acknowledgements This work was supported by the Fundação Carlos Chagas Filho de Amparo à Pesquisa do Estado do Rio de Janeiro FAPERJ (SEI-260003/000709/2023), the São Paulo Research Foundation FAPESP (2023/00815-6), the Conselho Nacional de Desenvolvimento Científico e Tecnológico CNPq (404336/2023-0), and the Silicon Valley Community Foundation through the University Blockchain Research Initiative (Grant #2022-199610). [1] S. Axler. Measure, Integration &; Real Analysis. Springer International Publishing, 2020. ISBN 9783030331436. doi: 10.1007/978-3-030-33143-6. [2] A. G. Baydin, B. A. Pearlmutter, A. A. Radul, and J. M. Siskind. Automatic differentiation in machine learning: a survey. Journal of Marchine Learning Research, 2018. [3] E. Bengio, M. Jain, M. Korablyov, D. Precup, and Y. Bengio. Flow network based generative models for non-iterative diverse candidate generation. In Neur IPS (Neur IPS), 2021. [4] Y. Bengio, S. Lahlou, T. Deleu, E. J. Hu, M. Tiwari, and E. Bengio. Gflownet foundations. Journal of Machine Learning Research (JMLR), 2023. [5] M. Betancourt. A conceptual introduction to hamiltonian monte carlo. ar Xiv preprint ar Xiv:1701.02434, 2017. [6] C. M. Bishop. Pattern Recognition and Machine Learning. Springer, 2007. [7] D. M. Blei and et al. Variational inference: A review for statisticians. Journal of the American Statistical Association, 2017. [8] J. Bradbury, R. Frostig, P. Hawkins, M. J. Johnson, C. Leary, D. Maclaurin, G. Necula, A. Paszke, J. Vander Plas, S. Wanderman-Milne, and Q. Zhang. JAX: composable transformations of Python+Num Py programs, 2018. [9] L. Buesing, N. Heess, and T. Weber. Approximate inference in discrete distributions with monte carlo tree search and value functions. In AISTATS, pages 624 634. PMLR, 2020. [10] Y. Burda, R. B. Grosse, and R. Salakhutdinov. Importance weighted autoencoders. In ICLR (Poster), 2016. [11] P. Carbonetto, M. King, and F. Hamze. A stochastic approximation method for inference in probabilistic graphical models. Neur IPS, 2009. [12] B. Carpenter, A. Gelman, M. D. Hoffman, D. Lee, B. Goodrich, M. Betancourt, M. Brubaker, J. Guo, P. Li, and A. Riddell. Stan: A probabilistic programming language. Journal of statistical software, 2017. [13] T. da Silva, E. Silva, A. Ribeiro, A. Góis, D. Heider, S. Kaski, and D. Mesquita. Humanin-the-loop causal discovery under latent confounding using ancestral gflownets. ar Xiv preprint:2309.12032, 2023. [14] M. P. Deisenroth and C. E. Rasmussen. PILCO: A model-based and data-efficient approach to policy search. In ICML, Proceedings of Machine Learning Research, pages 465 472. PMLR, 2011. [15] T. Deleu, A. Góis, C. C. Emezue, M. Rankawat, S. Lacoste-Julien, S. Bauer, and Y. Bengio. Bayesian structure learning with generative flow networks. In UAI, 2022. [16] T. Deleu, M. Nishikawa-Toomey, J. Subramanian, N. Malkin, L. Charlin, and Y. Bengio. Joint Bayesian inference of graphical structure and parameters with a single generative flow network. In Advances in Neural Processing Systems (Neur IPS), 2023. [17] T. Deleu, P. Nouri, N. Malkin, D. Precup, and Y. Bengio. Discrete probabilistic inference as control in multi-path environments. Co RR, abs/2402.10309, 2024. [18] S. Depeweg, J. M. Hernández-Lobato, F. Doshi-Velez, and S. Udluft. Learning and policy search in stochastic dynamical systems with bayesian neural networks. ar Xiv preprint ar Xiv:1605.07127, 2016. [19] A. B. Dieng, D. Tran, R. Ranganath, J. Paisley, and D. Blei. Variational inference via χ upper bound minimization. Neur IPS, 2017. [20] J. Domke. Provable gradient variance guarantees for black-box variational inference. In Neur IPS, pages 328 337, 2019. [21] J. Domke. Provable smoothness guarantees for black-box variational inference. In ICML, volume 119 of Proceedings of Machine Learning Research, pages 2587 2596. PMLR, 2020. [22] K. A. Dubey, S. J. Reddi, S. A. Williamson, B. Poczos, A. J. Smola, and E. P. Xing. Variance reduction in stochastic gradient langevin dynamics. In Neur IPS, 2016. [23] S. Dutta, M. Das, and U. Maulik. Towards causality-based explanation of aerial scene classifiers. IEEE Geoscience and Remote Sensing Letters, 2023. [24] J.-P. Falet, H. B. Lee, N. Malkin, C. Sun, D. Secrieru, D. Zhang, G. Lajoie, and Y. Bengio. Delta-ai: Local objectives for amortized inference in sparse graphical models, 2023. [25] J. Felsenstein. Evolutionary trees from DNA sequences: A maximum likelihood approach. Journal of Molecular Evolution, 1981. [26] T. Garipov, S. D. Peuter, G. Yang, V. Garg, S. Kaski, and T. S. Jaakkola. Compositional sculpting of iterative generative processes. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. [27] C. J. Geyer. Markov chain monte carlo maximum likelihood. 1991. [28] G. Hinton, N. Srivastava, and K. Swersky. Neural networks for machine learning lecture 6a overview of mini-batch gradient descent. 2012. [29] E. J. Hu, N. Malkin, M. Jain, K. E. Everett, A. Graikos, and Y. Bengio. Gflownet-em for learning compositional latent variable models. In International Conference on Machine Learning (ICLR), 2023. [30] E. J. Hu, M. Jain, E. Elmoznino, Y. Kaddar, G. Lajoie, Y. Bengio, and N. Malkin. Amortizing intractable inference in large language models. In The Twelfth International Conference on Learning Representations, 2024. [31] Z. Huang and S. Becker. Stochastic gradient langevin dynamics with variance reduction. Co RR, 2021. [32] M. Jain, E. Bengio, A. Hernandez-Garcia, J. Rector-Brooks, B. F. P. Dossou, C. A. Ekbote, J. Fu, T. Zhang, M. Kilgour, D. Zhang, L. Simine, P. Das, and Y. Bengio. Biological sequence design with GFlow Nets. In International Conference on Machine Learning (ICML), 2022. [33] M. Jain, T. Deleu, J. Hartford, C.-H. Liu, A. Hernandez-Garcia, and Y. Bengio. Gflownets for ai-driven scientific discovery. Digital Discovery, 2023. [34] H. Jang, M. Kim, and S. Ahn. Learning energy decompositions for partial inference in GFlownets. In The Twelfth International Conference on Learning Representations, 2024. [35] M. Järvenpää and J. Corander. On predictive inference for intractable models via approximate bayesian computation. Statistics and Computing, 33(2), Feb. 2023. ISSN 1573-1375. [36] M. I. Jordan, Z. Ghahramani, T. S. Jaakkola, and L. K. Saul. An introduction to variational methods for graphical models. Mach. Learn., 37(2):183 233, 1999. [37] T. H. Jukes and C. R. Cantor. Evolution of protein molecules. In Mammalian Protein Metabolism. Elsevier, 1969. [38] H. J. Kappen, V. Gómez, and M. Opper. Optimal control as a graphical model inference problem. Mach. Learn., 87(2):159 182, 2012. [39] K. Kim, Y. Ma, and J. Gardner. Linear convergence of black-box variational inference: Should we stick the landing? In AISTATS, volume 238 of Proceedings of Machine Learning Research, pages 235 243. PMLR, 2024. [40] M. Kim, T. Yun, E. Bengio, D. Zhang, Y. Bengio, S. Ahn, and J. Park. Local search gflownets. ar Xiv preprint ar Xiv:2310.02710, 2023. [41] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980, 2014. [42] D. P. Kingma, S. Mohamed, D. Jimenez Rezende, and M. Welling. Semi-supervised learning with deep generative models. Neur IPS, 2014. [43] D. P. Kingma, T. Salimans, B. Poole, and J. Ho. Variational diffusion models, 2023. [44] Y. Kinoshita and T. Suzuki. Improved convergence rate of stochastic gradient langevin dynamics with variance reduction and its application to optimization. In Neur IPS, 2022. [45] J. Knoblauch, J. Jewson, and T. Damoulas. An optimization-centric view on bayes rule: Reviewing and generalizing variational inference. J. Mach. Learn. Res., 23:132:1 132:109, 2022. [46] A. Kucukelbir, D. Tran, R. Ranganath, A. Gelman, and D. M. Blei. Automatic differentiation variational inference. J. Mach. Learn. Res., 18:14:1 14:45, 2017. [47] S. Kullback and R. A. Leibler. On Information and Sufficiency. The Annals of Mathematical Statistics, 1951. [48] S. Lahlou, T. Deleu, P. Lemos, D. Zhang, A. Volokhova, A. Hernández-García, L. N. Ezzine, Y. Bengio, and N. Malkin. A theory of continuous generative flow networks. In ICML, volume 202 of Proceedings of Machine Learning Research, pages 18269 18300. PMLR, 2023. [49] E. Lau, N. M. Vemgal, D. Precup, and E. Bengio. DGFN: Double generative flow networks. In Neur IPS 2023 Generative AI and Biology (Gen Bio) Workshop, 2023. [50] S. Levine. Reinforcement learning and control as probabilistic inference: Tutorial and review. Co RR, abs/1805.00909, 2018. [51] Y. Li and R. E. Turner. Rényi divergence variational inference. Neur IPS, 29, 2016. [52] D. Liu and et al. Gflowout: Dropout with generative flow networks. In International Conference on Machine Learning, ICML 23. JMLR.org, 2023. [53] R. Liu, J. Regier, N. Tripuraneni, M. I. Jordan, and J. D. Mc Auliffe. Rao-blackwellized stochastic gradients for discrete distributions. In ICML, volume 97 of Proceedings of Machine Learning Research, pages 4023 4031. PMLR, 2019. [54] K. Madan, J. Rector-Brooks, M. Korablyov, E. Bengio, M. Jain, A. C. Nica, T. Bosc, Y. Bengio, and N. Malkin. Learning gflownets from partial episodes for improved convergence and stability. In International Conference on Machine Learning, 2022. [55] N. Malkin, M. Jain, E. Bengio, C. Sun, and Y. Bengio. Trajectory balance: Improved credit assignment in GFlownets. In Neur IPS (Neur IPS), 2022. [56] N. Malkin, S. Lahlou, T. Deleu, X. Ji, E. Hu, K. Everett, D. Zhang, and Y. Bengio. GFlow Nets and variational inference. International Conference on Learning Representations (ICLR), 2023. [57] D. Mesquita, P. Blomstedt, and S. Kaski. Embarrassingly parallel MCMC using deep invertible transformations. In UAI, 2019. [58] T. Minka. Divergence measures and message passing. Technical report, Research, Microsoft, 2005. URL https://www.seas.harvard.edu/courses/cs281/papers/ minka-divergence.pdf. [59] S. Mohamed, M. Rosca, M. Figurnov, and A. Mnih. Monte carlo gradient estimation in machine learning. J. Mach. Learn. Res., 21:132:1 132:62, 2020. [60] S. Mohammadpour, E. Bengio, E. Frejinger, and P.-L. Bacon. Maximum entropy gflownets with soft q-learning, 2023. [61] R. M. Neal et al. Mcmc using hamiltonian dynamics. Handbook of markov chain monte carlo, 2011. [62] A. C. Nica, M. Jain, E. Bengio, C.-H. Liu, M. Korablyov, M. M. Bronstein, and Y. Bengio. Evaluating generalization in gflownets for molecule design. In ICLR2022 Machine Learning for Drug Discovery, 2022. [63] A. B. Owen. Monte Carlo theory, methods and examples. 2013. [64] L. Pan, N. Malkin, D. Zhang, and Y. Bengio. Better training of GFlow Nets with local credit and incomplete trajectories. In International Conference on Machine Learning (ICML), 2023. [65] L. Pan, N. Malkin, D. Zhang, and Y. Bengio. Better training of gflownets with local credit and incomplete trajectories. ar Xiv preprint ar Xiv:2302.01687, 2023. [66] L. Pan, D. Zhang, A. Courville, L. Huang, and Y. Bengio. Generative augmented flow networks. In International Conference on Learning Representations (ICLR), 2023. [67] L. Pan, D. Zhang, M. Jain, L. Huang, and Y. Bengio. Stochastic generative flow networks. In UAI, volume 216 of Proceedings of Machine Learning Research, pages 1628 1638. PMLR, 2023. [68] M. Papini, D. Binaghi, G. Canonaco, M. Pirotta, and M. Restelli. Stochastic variance-reduced policy gradient. In International Conference on Machine Learning, 2018. [69] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Köpf, E. Yang, Z. De Vito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala. Pytorch: An imperative style, high-performance deep learning library, 2019. [70] B. Poczos and J. Schneider. On the estimation of α-divergences. In AISTATS. PMLR, 2011. [71] R. Ranganath, S. Gerrish, and D. M. Blei. Black box variational inference. In AISTATS, volume 33 of JMLR Workshop and Conference Proceedings, pages 814 822. JMLR.org, 2014. [72] R. Ranganath, D. Tran, J. Altosaar, and D. M. Blei. Operator variational inference. In Neur IPS, pages 496 504, 2016. [73] J. Rector-Brooks, K. Madan, M. Jain, M. Korablyov, C.-H. Liu, S. Chandar, N. Malkin, and Y. Bengio. Thompson sampling for improved exploration in gflownets, 2023. [74] A. Rényi. On measures of entropy and information. In Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Contributions to the Theory of Statistics. University of California Press, 1961. [75] D. Rezende and S. Mohamed. Variational inference with normalizing flows. In International conference on machine learning. PMLR, 2015. [76] B. Rhodes and M. U. Gutmann. Variational noise-contrastive estimation. In AISTATS, 2019. [77] L. Richter, A. Boustati, N. Nüsken, F. J. R. Ruiz, and Ö. D. Akyildiz. Vargrad: A low-variance gradient estimator for variational inference. 2020. [78] C. P. Robert et al. The Bayesian choice: from decision-theoretic foundations to computational implementation, volume 2. Springer, 2007. [79] G. Roeder, Y. Wu, and D. Duvenaud. Sticking the landing: simple, lower-variance gradient estimators for variational inference. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS 17, page 6928 6937, Red Hook, NY, USA, 2017. Curran Associates Inc. ISBN 9781510860964. [80] T. G. J. Rudner, V. Pong, R. Mc Allister, Y. Gal, and S. Levine. Outcome-driven reinforcement learning via variational inference. In Neur IPS, pages 13045 13058, 2021. [81] F. Ruiz and M. Titsias. A contrastive divergence for combining variational inference and mcmc. In International Conference on Machine Learning, 2019. [82] T. Salimans and D. A. Knowles. On using control variates with stochastic approximation for variational bayes and its connection to stochastic linear regression, 2014. [83] M. J. Schervish. Theory of statistics. Springer Science & Business Media, 2012. [84] M. W. Shen, E. Bengio, E. Hajiramezanali, A. Loukas, K. Cho, and T. Biancalani. Towards understanding and improving gflownet training. In International Conference on Machine Learning, 2023. [85] J. Shi, Y. Zhou, J. Hwang, M. Titsias, and L. Mackey. Gradient estimation with discrete stein operators. Neur IPS, 35, 2022. [86] Y. Sun, L. Dong, S. Huang, S. Ma, Y. Xia, J. Xue, J. Wang, and F. Wei. Retentive network: A successor to transformer for large language models, 2023. [87] D. Tiapkin, N. Morozov, A. Naumov, and D. Vetrov. Generative flow networks as entropyregularized rl, 2024. [88] E. Todorov. General duality between optimal control and estimation. In CDC, pages 4286 4292. IEEE, 2008. [89] M. Toussaint and A. J. Storkey. Probabilistic inference for solving discrete and continuous state markov decision processes. In ICML, volume 148 of ACM International Conference Proceeding Series, pages 945 952. ACM, 2006. [90] M.-N. Tran, M. K. Pitt, and R. Kohn. Adaptive metropolis hastings sampling using reversible dependent mixture proposals. Statistics and Computing, 2016. [91] C. Tsallis. Possible generalization of boltzmann-gibbs statistics. Journal of statistical physics, 52:479 487, 1988. [92] M. J. Wainwright and M. I. Jordan. Graphical models, exponential families, and variational inference. Found. Trends Mach. Learn., 1(1-2):1 305, 2008. [93] X. Wang, T. Geffner, and J. Domke. Joint control variate for faster black-box variational inference. In AISTATS, volume 238 of Proceedings of Machine Learning Research, pages 1639 1647. PMLR, 2024. [94] L. Weaver and N. Tao. The optimal reward baseline for gradient-based reinforcement learning. ar Xiv preprint ar Xiv:1301.2315, 2013. [95] V. D. Wild, R. Hu, and D. Sejdinovic. Generalized variational inference in function spaces: Gaussian measures meet bayesian deep learning. In Neur IPS, 2022. [96] D. Williams. Probability with Martingales. Cambridge University Press, 1991. [97] R. J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 1992. [98] B. Xu, N. Wang, T. Chen, and M. Li. Empirical evaluation of rectified activations in convolutional network. ar Xiv preprint ar Xiv:1505.00853, 2015. [99] K. Xu, W. Hu, J. Leskovec, and S. Jegelka. How powerful are graph neural networks? International Conference on Learning Representations (ICLR), 2019. [100] P. Xu, F. Gao, and Q. Gu. An improved convergence analysis of stochastic variance-reduced policy gradient. In UAI, 2020. [101] Z. Yang. Molecular Evolution: A Statistical Approach. Oxford University Press Oxford, May 2014. ISBN 9780191782251. doi: 10.1093/acprof:oso/9780199602605.001.0001. [102] J. S. Yedidia, W. T. Freeman, and Y. Weiss. Bethe free energy, kikuchi approximations, and belief propagation algorithms. Advances in neural information processing systems, 13(24), 2001. [103] M. Yin and M. Zhou. Semi-implicit variational inference. In International conference on machine learning. PMLR, 2018. [104] L. Yu and C. Zhang. Semi-implicit variational inference via score matching. ar Xiv preprint ar Xiv:2308.10014, 2023. [105] M. Zaheer, S. Kottur, S. Ravanbakhsh, B. Poczos, R. R. Salakhutdinov, and A. J. Smola. Deep sets. In Neur IPS, 2017. [106] C. Zhang, B. Shahbaba, and H. Zhao. Hamiltonian monte carlo acceleration using surrogate functions with random bases. Statistics and computing, 2017. [107] D. Zhang, R. T. Chen, N. Malkin, and Y. Bengio. Unifying generative models with gflownets and beyond. ICML Beyond Bayes workshop, 2022. [108] D. Zhang, N. Malkin, Z. Liu, A. Volokhova, A. Courville, and Y. Bengio. Generative flow networks for discrete probabilistic modeling. In International Conference on Machine Learning (ICML), 2022. [109] D. Zhang, H. Dai, N. Malkin, A. Courville, Y. Bengio, and L. Pan. Let the flows tell: Solving graph combinatorial optimization problems with gflownets. In Neur IPS (Neur IPS), 2023. [110] D. Zhang, R. T. Q. Chen, C.-H. Liu, A. Courville, and Y. Bengio. Diffusion generative flow samplers: Improving learning signals through partial trajectory optimization. In The Twelfth International Conference on Learning Representations, 2024. [111] M. Y. Zhou, Z. Yan, E. Layne, N. Malkin, D. Zhang, M. Jain, M. Blanchette, and Y. Bengio. Phylo GFN: Phylogenetic inference with generative flow networks. In The Twelfth International Conference on Learning Representations, 2024. [112] H. Zimmermann, F. Lindsten, J. van de Meent, and C. A. Naesseth. A variational perspective on generative flow networks. Trans. Mach. Learn. Res., 2023, 2023. A Related works Generative Flow Networks. GFlow Nets [3, 4] were initially proposed as a reinforcement learning algorithm for finding diverse high-valued states in a discrete environment by sampling from a distribution induced by a reward function. Shortly after, they were extended to sample from complex distributions in arbitrary topological spaces [48]. Remarkably, this family of models was successfully applied to problems as diverse as structure learning and causal discovery [13, 15, 16], discrete probabilistic modeling and graphical models [24, 29, 107, 108], combinatorial optimization and stochastic control [109, 110], drug discovery [3, 33, 62], design of biological sequences [32], natural language processing [30], and aerial scene classification [23]. Concomitantly to these advances, there is a growing literature concerned with the development of more efficient training algorithms for GFlow Nets [4, 40, 55, 84] primarily drawing inspiration from existing techniques in the reinforcement learning literature [60, 65, 66, 87]. In the same spirit, Tiapkin et al. [87] showed it is possible to frame GFlow Nets as an entropy-regularized reinforcement learning. In a study closely related to ours, Malkin et al. [56] proved the equivalence between GFlow Nets and hierarchical variational inference (HVI) for discrete distributions; however, when training GFlow Nets using divergence-based methods from the VI literature, the authors found no improvement relatively to the traditional flow-matching objectives. Thus, extending beyond discrete distributions, this work provides a definitive analysis of training GFlow Nets by directly optimizing a set of divergences typically employed in variational inference training, given a clear context and conditions for effective use of divergence objectives for efficient learning procedures applied on GFlow Nets models. Reinforcement Learning as Inference. Reinforcement Learning (RL) has been studied as a form of probabilistic inference extensively, generating relevant insights in the literature, and alternatively referred to as control as inference. Todorov [88] demonstrates a duality between estimation and optimal control, establishing conditions where estimation algorithms could applied for control problems. Kappen et al. [38] demonstrated that optimal control problems could be framed as inference problems in graphical models, providing a unified perspective for solving control tasks. Levine [50] presents a complete and modern RL formulation, linking with VI in particular. Rudner et al. [80] integrates even further RL with VI methods, demonstrating the conceptual and algorithmic gains of leveraging outcome-driven RL with variational inference to optimize policy distributions. Developing further, Toussaint and Storkey [89] applies approximate probabilistic inference methods to solve Markov Decision Processes (MDPs) with discrete and continuous states. The approach also aligns with model-based RL techniques, such as PILCO, which utilizes probabilistic models to enhance data efficiency in policy search [14]. Recent work by Deleu et al. [17] positions discrete probabilistic inference as a control problem in multi-path environments, highlighting the synergy between control theory and probabilistic modeling in the context of GFlow Nets. This body of works relates to the approach presented in this paper, comparing optimization of trajectory balance and flow-matching losses related to sequential decisions modeled by the GFlow Net with f-divergence measures minimization procedures related to approximated variational inference and generalized posterior inference [36, 45, 51, 92, 95]. Divergence measures and gradient reduction for VI. Approximate inference via variational inference (VI) methods [6, 7, 36, 92] initially relied on message passing and coordinate ascent methods to minimize the KL divergence of an unnormalized distribution and a proposal in a parameterized tractable family of distributions. Despite the initial generality of the optimization perspective, the concrete implementation of algorithms often requires specialized update equations and learning objectives for specific classes of models. On the other hand, the development of algorithms and software for automatic differentiation [2] and stochastic gradient estimators [59] unlocked the potential application of generic gradient-based optimization algorithms in inference and learning tasks for a comprehensive class of models. Seminal works such as Black-Box VI (BBVI) [71], using the REINFORCE/score function estimator, and Automatic Differentiation VI (ADVI) [46], using reparameterization and change-of-variables, demonstrated practical algorithms for Bayesian inference in generic models, including models combining classical statistical modeling with neural networks. Overall, Mohamed et al. [59] explain the development of the main gradient estimators: the score function [11, 71, 97, 103], and the pathwise gradient estimator, also known as the parametrization trick [42, 43, 75]. The vanilla REINFORCE/score function estimator has notoriously high variance [11, 71, 77, 97], which prompted a body of work exploring variance reduction techniques. In the original BBVI proposal, Ranganath et al. [71] explored Rao-Blackwellization, combining iterated conditional expectations and control variates, using the score function estimator (given its zero expectation) as a control variate. Subsequent works have continued to refine these techniques; Liu et al. [53] uses Rao-Blackwellized stochastic gradients for discrete distributions, while Kim et al. [39] and Wang et al. [93] explored joint control variates and provable linear convergence in BBVI. Additionally, Domke [21] and Domke [20] provided smoothness and gradient variance guarantees, further enhancing the robustness of score function estimator for VI methods. Our work demonstrates that effective variance reduction techniques applied to a f-divergence minimization training can significantly enhance the convergence speed and stability of the procedure. In theory and practice, we observed high compatibility between our results of variance-reduced f-divergence GFlow Nets training and the body of work of variance-reduced score-function estimators for VI. Furthermore, by showing that these techniques apply to a broad class of models and optimization objectives, including continuous and mixed structured supports, we move GFlow Nets f-divergence minimization training closer to recent notions of generalized Bayesian inference and generalized VI[45] and variational inference in function spaces [95] with the common thread of casting posterior inference as an optimization problem guided by some divergence measure. This generalization can enable applications of GFlow Nets to a diverse range of machine learning tasks, enhancing their versatility and practical relevance. B Detailed description of the generative tasks Set generation [3, 34, 65, 66]. S contains the sets of size up to N with elements extracted from a fixed deposit D of size D N and so = . For s S with |s| < N, κf(s, ) is a counting measure supported at (the σ-algebra induced by) {s {d}: d D \ s}; for |s| = N, κf(s, ) = δsf . Then, X = {s S : |s| = N}. Similarly, κb(s, ) s support is {s \ {d}: d D} for s = so. We define the unnormalized target s density by log r(x) = P d x f(d) for a fixed function f : D R. We parameterize p F (s, ) as a deep set [105] and fix p B(s, ) as a uniform density for s S. Autoregressive sequence generation [32, 55]. A sequence s in Dn, for any K > n, is bijectively associated to an element of D [K] by s 7 {(sm, m): 1 m n} {( , m): K m > n}; is a token denoting the sequence s end. In this context, we let S P(D [N + 1]) be the set of sequences of size up to N, i.e., if s S and ( , n + 1) s, then (d, m) s iff d = for n < m N + 1 and there is v (D { })n such that (vm, m) s for m n; the initial state is so = . For conciseness, we write d / s, meaning that (d, i) / s for every i. Next, κf(s, ) is the counting measure supported at {s {(d, |s| + 1)}: d D { }} if |s| < N and / s; at {s {( , N + 1)}} if |s| = N; and at {sf} otherwise. Thus, X = {s S : s}. Also, kb(s, ) is supported at {s \ {(d, |s|)}: d D}, which has a single element corresponding to the removal of the element of s of the largest index. On the other hand, the target s density is log r(x) = P (d,i) x: d = f(d)g(i) for functions f, g: D R. We employ an MLP to parameterize p F (s, ). Bayesian phylogenetic inference (BPI) [111]. A (rooted) phylogeny is a complete binary tree with labeled leaves and weighted edges; each leaf corresponds to a biological species, and the edges weights are a measurement of evolutionary time. Formally, we let B be the set of observed biological species and VB be the set of |B| + 1 unobserved species. Next, we represent a phylogeny over B as a weighted directed tree GB = (B VB, EB, ωB) with edges EB featured with a weight assignment ωB; we denote its root by r(GB). Under these conditions, we define S = n SK k=1 GFk : F k Fk = B GFk is a tree o as the set of forests built upon phylogenetic trees over disjoint subsets of B; F represents a disjoint union of non-empty sets and so = S b B G{b} is the forest composed of singleton trees G{b}. Also, we define the amalgamation of phylogenies GFk and GFj, A(GFk, GFj), as the tree obtained by joining their roots r(GFk) and r(GFj) to a new node r(GFk GFj), with a corresponding extension of the edges weights. In contrast, the dissolution of a tree GF, R(GF), returns the union of the two subtrees obtained by removing r(GF) from GF. Then, κf(s, ) is the counting measure supported at n SK k=1,k =i,j GFk A(GFi, GFj): (i, j) [K]2, i = j o with s = SK k=1 GFk and K 2; if s = GB, κf(s, ) = δsf . Hence, X is the set of phylogenies over B. Likewise, κb(s, ) s support is n SK k=1,k =j GFk R(GFi): i [K] r(GFi) / B o for s = SK k=1 GFk and K |B|. Finally, the unnormalized target is the posterior distribution defined by JC69 s mutation model [37] given a data set of genome sequences of the species in B. More specifically, we let the prior be a uniform distribution and compute the model-induced likelihood function by the efficient Felsenstein s algorithm [25]. We assume the edges weights are constant. See [101] for further details. We parameterize p F (s, ) with GIN [99] and fix p B(s, ) as an uniform distribution. Mixture of Gaussians [48, 110]. The training of GFlow Nets in continuous spaces is challenging, and the problem of designing highly expressive models in this setting is still unaddressed [16, 48]. However, as we show in Section 5.2, divergence-based measures seem to be very effective learning objectives for autoregressive sampling of a sparse mixture of Gaussians. For a d-dimensional Gaussian distribution, S = {{(0, 0), (xi, i): 1 i n}, : n d, x Rn} {(0, 0)} P(R [d]) and so = (0, 0); note S is isomorphic to Rd. Also, for s = {(xi, i)}n i=1, κf(s, ) is Lebesgue s measure at {s (x, n + 1): x R} if n < d and κf(s, ) = δsf otherwise. In particular, X = {s S : max(x,i) s i = d}. Moreover, κb(s, ) is a measure on {s \ (x, |s|): x R}, which is isomorphic to R, which is a singleton. We define the target s density with a homogeneous mixture of Gaussian distributions, 1 K PK i=1 N(µi, σ2I) with µi Rd. We similarly define PF (s, ) as a mixture of one-dimensional Gaussians with mean and variance learned via an MLP [48]. Banana distribution. [57, 76] We consider sampling from the banana distribution, defined by N x1 x2 + x2 1 + 1 , 1 0.9 0.9 1 Given its geometry and shape, this distribution is a common baseline in the approximate Bayesian inference literature [90, 104, 106]. This task is identical to sampling from a mixture of Gaussian distributions, except for the different target density specified by the model in Equation (8). Also, we rely on the implementation Hamiltonian Monte Carlo (HMC) [5, 61] provided by Stan [12] to obtain accurate samples from (8). A similar description may be utilized for the hypergrid and structure learning domains. We will consider the measurable space of trajectories (PS, ΣP ), with PS = {(s, s1, . . . , sn, sf) Sn+1 {sf}: 0 n N 1} and ΣP as the σ-algebra generated by SN+1 n=1 Σ n. For notational convenience, we use the same letters for representing the measures and kernels of (S, Σ) and their natural product counterparts in (PS, ΣP ), which exist by Carathéodory extension s theorem [96]; for example, ν(B) = ν n(B) for B = (B1, . . . , Bn) Σ n and p Fθ(τ|so; θ) is the density of P n+1 F (so, ) for τ = (so, s1, . . . , sn, sf) relatively to µ n. In this case, we will write τ for a generic element of PS and x for its terminal state (which is unique by Definition 1). C.1 Proof of Proposition 1 We will show that the gradient of the expected on-policy TB loss matches the gradient of the KL divergence between the forward and backward policies. Firstly, note that θDKL[PF ||PB] = θEτ PF (so, ) log p F (τ|so; θ) τ log p F (τ|so; θ) p B(τ) d PF (so, dτ) τ log p F (τ|so; θ) p B(τ) p F (τ|so; θ)dκf(so, dτ) τ θ log p F (τ|so; θ) p B(τ) PF (so, dτ) τ log p F (τ|so; θ) p B(τ) θp F (τ|so; θ)dκf(so, dτ) by Leibniz s rule for integrals and the product rule for derivatives. Then, since θf(θ) = f(θ) log f(θ) for any differentiable function f : θ 7 f(θ), θDKL [PF ||PB] = E τ PF (so, ) θ log p F (τ|so) + log p F (τ|so) p B(τ) θ log p F (τ|so) = E τ PF (so, ) log p F (τ|so) p B(τ) θ log p F (τ|so) ; we omitted the dependency of PF (and of p F thereof) on the parameters θ for conciseness. On the other hand, θLT B(τ; θ) = 2 log p F (τ|so; θ) θ log p F (τ) (10) by the chain rule for derivatives. Thus, Eτ PF (so, ) θLT B(τ; θ) = 2 θDKL[PF ||PB], (11) ensuring that the equivalence between LT B and DKL in terms of expected gradients holds in a context broader than that of finitely supported distributions [56]. C.2 Proof of Lemma 1 Henceforth, we will recurrently refer to the score estimator for gradients of expectations [97], namely, θ E τ PF (so, ) [fθ(τ)] = E τ PF (so, ) [ θfθ(τ) + fθ(τ) θ log p F (τ|so; θ)] , (12) which can be derived using the arguments of the preceding section. In this context, the Renyi-α s divergence satisfies θRα(PF ||PB) = θEτ PF (so, )[g(τ, θ)] (α 1)Eτ PF (so, )g(τ, θ), with g(τ; θ) = (p B(τ|x)r(x)/p F (τ|so;θ))1 α and α = 1; similarly, the Tsallis-α s divergence abides by θTα(PF ||PB) = 1 (α 1) θEτ PF (so, )[g(τ, θ)]. (13) The statement then follows by substituting θEτ PF (so, )[g(τ, θ)] with the corresponding score estimator given by Equation (12). C.3 Proof of Lemma 2 Forward KL divergence. The gradient of DKL[PB||PF ] is straightforwardly obtained through the application of Leibniz s rule for integrals, θDKL[PB||PF ] = Eτ PB(sf , ) [ θ log p F (τ|so; θ)] , since the averaging distribution PB do not depend on the varying parameters θ. However, as we compute Monte Carlo averages over samples of PF , we apply an importance reweighting scheme [63, Chapter 9] to the previous expectation to infer that, up to a positive multiplicative constant, θDKL[PB||PF ] C= Eτ PF (so, ) p B(τ|x)r(x) p F (τ|so; θ) θ log p F (τ|so; θ) , with C= denoting equality up to a positive multiplicative constant. We emphasize that most modern stochastic gradient methods for optimization, such as Adam [41] and RMSProp [28], remain unchanged when we multiply the estimated gradients by a fixed quantity; thus, we may harmlessly compute gradients up to multiplicative constants. Reverse KL divergence. We verified in Equation (9) that θDKL[PF ||PB] = Eτ PF (so, ) log p F (τ|so) p B(τ) θsθ(τ) . Since p B(τ) = p B(τ|x)r(x)/Z and Eτ PF (so, ) θsθ(τ) = 0, the quantity θDKL[PF ||PB] may be rewritten as θDKL[PF ||PB] = Eτ PF (so, ) log p F (τ|so) p B(τ) θsθ(τ) + Eτ PF (so, ) [(log Z) θsθ(τ)] = Eτ PF (so, ) log p F (τ|so) p B(τ|x)r(x) θsθ(τ) , in which x is the terminal state corresponding to the trajectory τ. Thus proving the statement in Lemma 2. C.4 Proof of Proposition 2 We will derive an expression for the optimal baseline of a vector-valued control variate. For this, let f be the averaged function and g: τ 7 g(τ) be the control variate. Assume, without loss of generality, that Eπ[g] = 0 for the averaging distribution π over the space of trajectories. In this case, the optimal baseline for the control variate a is found by a = arg min a R Tr (Covτ π[f(τ) a g(τ)]) . (14) a = arg min a R Tr 2a Covπ[f(τ), g(τ)] + a2Covπ(g(τ)) , which is a convex optimization problem solved by a = Tr (Covπ[f(τ), g(τ)]) Tr (Covπ[g(τ)]) = Tr Eπ[(f Eπ[f(τ)])g(τ)T ] Tr Eπ[g(τ)g(τ)T ] = Eπ[Tr (f Eπ[f(τ)])T g(τ)] Eπ[Tr g(τ)T g(τ)] = Eπ[(f Eπ[f(τ)])T g(τ)] Eπ[g(τ)T g(τ)] , in which we used the circular property of the trace. This equation exactly matches the result in Proposition 2. In practice, we use g(τ) = θ log p F (τ) for both the reverse KLand α-divergences, rendering a baseline a that depends non-linearly on the sample gradients and is hence difficult to compute in a GPU-powered autodiff framework efficiently. We thus use Equation (6) to estimate a . 0 200 400 0 0 200 400 0 0 200 400 0 0 200 400 0 Gaussian Mixtures 0 200 400 0 Banana distribution Figure 7: Learning curves for a GFlow Net trained by minimizing the TB loss. The curves smoothness highlights the low variance of the optimization steps incurred by the stochastic gradients of LT B, which do not use a score function estimator. Figure 9: Additional illustration of the effect of α when learning GFlow Nets by minimizing the Renyi-α divergence in the hypergrid environment. For such a sparse target distribution, a large and negative value of α (left) is beneficial to ensure that all modes are appropriately covered by the learned distribution. In contrast, the mode-seeking behavior induced by a large value of α entails the collapse of the model in a single mode (right). D Additional experiments Gradient variance for flow-network-based objectives. Figure 7 shows the learning curve for the TB loss in each of the generative tasks. Notoriously, it underlines the low variance of the optimization steps which, contrarily to their divergence-based counterparts, do not rely on a score function estimator and suggests that the design of control variates for estimating the gradients of these objectives would not be significantly helpful. Also, the gradient of LT B depends non-linearly on the score function log p F and, consequently, it is unclear how to implement computationally efficient variance reduction techniques in this case. Rev. KL KL Renyi-α Tsallis-α TB Figure 8: Results for sequence generation with larger batches. Forward KL for sequence generation. Figure 3 shows that alternative approaches in terms of convergence speed outperformed a GFlow Net trained to minimize the forward KL. One possible cause of this underperformance is the high variance induced by the underlying importance sampling estimator. To verify this, we re-run the corresponding experiments, increasing the size of the batch of trajectories for the forward KL estimator to 1024. Figure 8 presents the experiment s results, with an increased batch size corresponding to an estimator of smaller variance that accelerates the GFlow Net s training convergence. More broadly, this suggests that the design of GFlow Net-specific variance reduction techniques, which we leave to future endeavors, may further improve this family of models. Influence of α on the learning of GFlow Nets. As mentioned earlier, divergence-based measures fall short compared to their balance-based counterparts for the hypergrid navigation task. For this extremely sparse problem, the benefits from off-policy exploration during training seem to supersede the statistical efficiency enacted by the minimization of divergences, which fail to properly cover the high-probability regions of the target distribution. In this scenario, Figure 9 suggests that this issue can be mildly circumvented by choosing a sufficiently negative α for the Renyi-α divergences, thereby imposing a mass-covering behavior to the learned model. However, these results should be substantiated by further experimental analysis to be conclusive. Currently, we would suggest a practitioner to stick to the balance-based objectives when dealing with very sparse target distributions. E Experimental details The following paragraph provides further implementation details. Regarding open access to the code, we will make the code public upon acceptance. Shared configurations. For every generative task, we used the Adam optimizer [41] to carry out the stochastic optimization, employing a learning rate of 10 1 for log Zθ when minimizing LT B and 10 3 for the remaining parameters, following previous works [48, 55, 56, 66]. We polynomially annealed the learning rate towards 0 along training, similarly to [86]. Also, we use Leaky Re LU [98] as the non-linear activation function of all implemented neural networks. Set generation. We implement an MLP of 2 64-dimensional layers to parameterize the policy s logits log p F (s, ). We train the model for 512 epochs with a batch of 128 trajectories for estimating the gradients. Also, we let D = 32 and N = 16 be the source s and set s sizes, respectively. Autoregressive sequence generation. We parameterize the logits of the forward policy with a MLP of 2 64-dimensional layers; we pad the sequences to account for their variable sizes. We respectively consider D = 8 and N = 6 for the source s and sequence s sizes. To approximate the gradients, we rely on a batch of 128 sequences. Bayesian phylogenetic inference. We parameterize the logits of the forward policy with a 2-layer GIN [99] with a 64-dimensional latent embedding, which is linearly projected to log p F . Moreover, we simulated the JC69 model [37] to obtain 25-sized sequences of nucleotides for each of the 7 observed species, setting λ = 0.3 for the instantaneous mutation rate; see [101] for an introduction to computational phylogenetics and molecular evolution. To estimate the gradients, we relied on batches of 64 trajectories. Hypergrid navigation. We consider a H = 12 for Figure 3 and Figure 4 and H = 9 for Figure 9. In both cases, d = 2 and Ro = 10 3; see [55, Section 5.1]. To parameterize the policy, we used a 2-layer 256-dimensional MLP with Re LU activations. We trained the models for 10000 epochs. Bayesian structure learning. We simulated a 100-sized 5-variable data set X from a randomly parameterized linear Gaussian structural model based on a fixed Bayesian network. We implemented a 2-layer 256-dimensional MLP with Re LU activations for the policy network, which received the flattened adjacency matrix of the current state as input. Training lasted for 4000 epochs. Mixture of Gaussian distributions. We consider a mixture of 9 2-dimensional Gaussian distributions centered at µij = (i, j) for 0 i, j 2, each of which having an isotropic variance of 10 1; see Figure 1. We use an MLP of 2 64-dimensional layers to parameterize the forward policy. Banana-shaped distribution. The model is specified by Equation (8). We also consider an MLP of 2 64-dimensional layers to parameterize the forward policy. Neur IPS Paper Checklist You should answer [Yes] , [No] , or [NA] . [NA] means either that the question is Not Applicable for that particular paper or the relevant information is Not Available. Please provide a short (1 2 sentence) justification right after your answer (even for NA). Question: Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? Answer: [Yes] Justification: the proofs for the main claims in Proposition 1, Lemma 1, Proposition 2, Lemma 2 are presented in Appendix C, and practical training of GFN empirical validation in Section 5. Guidelines: The answer NA means that the abstract and introduction do not include the claims made in the paper. The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations. A No or NA answer to this question will not be perceived well by the reviewers. The claims made should match theoretical and experimental results, and reflect how much the results can be expected to generalize to other settings. It is fine to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper. 2. Limitations Question: Does the paper discuss the limitations of the work performed by the authors? Answer: [Yes] Justification: The middle paragraph in Section 6 discusses limitations, including the potential impact of different α values and the need for further exploration in this area. Guidelines: The answer NA means that the paper has no limitation while the answer No means that the paper has limitations, but those are not discussed in the paper. The authors are encouraged to create a separate "Limitations" section in their paper. The paper should point out any strong assumptions and how robust the results are to violations of these assumptions (e.g., independence assumptions, noiseless settings, model well-specification, asymptotic approximations only holding locally). The authors should reflect on how these assumptions might be violated in practice and what the implications would be. The authors should reflect on the scope of the claims made, e.g., if the approach was only tested on a few datasets or with a few runs. In general, empirical results often depend on implicit assumptions, which should be articulated. The authors should reflect on the factors that influence the performance of the approach. For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting. Or a speech-to-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon. The authors should discuss the computational efficiency of the proposed algorithms and how they scale with dataset size. If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness. While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that aren t acknowledged in the paper. The authors should use their best judgment and recognize that individual actions in favor of transparency play an important role in developing norms that preserve the integrity of the community. Reviewers will be specifically instructed to not penalize honesty concerning limitations. 3. Theory Assumptions and Proofs Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof? Answer: [Yes] Justification: The statements of the propositions and lemma include the assumptions, as well as in the notations and definitions presented in Section 4, and proofs provided in Appendix C. Guidelines: The answer NA means that the paper does not include theoretical results. All the theorems, formulas, and proofs in the paper should be numbered and crossreferenced. All assumptions should be clearly stated or referenced in the statement of any theorems. The proofs can either appear in the main paper or the supplemental material, but if they appear in the supplemental material, the authors are encouraged to provide a short proof sketch to provide intuition. Inversely, any informal proof provided in the core of the paper should be complemented by formal proofs provided in appendix or supplemental material. Theorems and Lemmas that the proof relies upon should be properly referenced. 4. Experimental Result Reproducibility Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)? Answer: [Yes] Justification: Details are provided in Section 5, Appendix B and Appendix E. Guidelines: The answer NA means that the paper does not include experiments. If the paper includes experiments, a No answer to this question will not be perceived well by the reviewers: Making the paper reproducible is important, regardless of whether the code and data are provided or not. If the contribution is a dataset and/or model, the authors should describe the steps taken to make their results reproducible or verifiable. Depending on the contribution, reproducibility can be accomplished in various ways. For example, if the contribution is a novel architecture, describing the architecture fully might suffice, or if the contribution is a specific model and empirical evaluation, it may be necessary to either make it possible for others to replicate the model with the same dataset, or provide access to the model. In general. releasing code and data is often one good way to accomplish this, but reproducibility can also be provided via detailed instructions for how to replicate the results, access to a hosted model (e.g., in the case of a large language model), releasing of a model checkpoint, or other means that are appropriate to the research performed. While Neur IPS does not require releasing code, the conference does require all submissions to provide some reasonable avenue for reproducibility, which may depend on the nature of the contribution. For example (a) If the contribution is primarily a new algorithm, the paper should make it clear how to reproduce that algorithm. (b) If the contribution is primarily a new model architecture, the paper should describe the architecture clearly and fully. (c) If the contribution is a new model (e.g., a large language model), then there should either be a way to access this model for reproducing the results or a way to reproduce the model (e.g., with an open-source dataset or instructions for how to construct the dataset). (d) We recognize that reproducibility may be tricky in some cases, in which case authors are welcome to describe the particular way they provide for reproducibility. In the case of closed-source models, it may be that access to the model is limited in some way (e.g., to registered users), but it should be possible for other researchers to have some path to reproducing or verifying the results. 5. Open access to data and code Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: Provided in a zip file during the review period, with the plan for public release once the paper is made public. Guidelines: The answer NA means that paper does not include experiments requiring code. Please see the Neur IPS code and data submission guidelines (https://nips.cc/ public/guides/Code Submission Policy) for more details. While we encourage the release of code and data, we understand that this might not be possible, so "No" is an acceptable answer. Papers cannot be rejected simply for not including code, unless this is central to the contribution (e.g., for a new open-source benchmark). The instructions should contain the exact command and environment needed to run to reproduce the results. See the Neur IPS code and data submission guidelines (https: //nips.cc/public/guides/Code Submission Policy) for more details. The authors should provide instructions on data access and preparation, including how to access the raw data, preprocessed data, intermediate data, and generated data, etc. The authors should provide scripts to reproduce all experimental results for the new proposed method and baselines. If only a subset of experiments are reproducible, they should state which ones are omitted from the script and why. At submission time, to preserve anonymity, the authors should release anonymized versions (if applicable). Providing as much information as possible in supplemental material (appended to the paper) is recommended, but including URLs to data and code is permitted. 6. Experimental Setting/Details Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results? Answer: [Yes] Justification: Section 5 in the main text presents information about the experiments with full details provided in Appendix B and Appendix E, and supplemented zip-file with the code. Guidelines: The answer NA means that the paper does not include experiments. The experimental setting should be presented in the core of the paper to a level of detail that is necessary to appreciate the results and make sense of them. The full details can be provided either with the code, in appendix, or as supplemental material. 7. Experiment Statistical Significance Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments? Answer: [Yes] Justification: Plots count on error bars and tables count on standard deviation. Guidelines: The answer NA means that the paper does not include experiments. The authors should answer "Yes" if the results are accompanied by error bars, confidence intervals, or statistical significance tests, at least for the experiments that support the main claims of the paper. The factors of variability that the error bars are capturing should be clearly stated (for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions). The method for calculating the error bars should be explained (closed form formula, call to a library function, bootstrap, etc.) The assumptions made should be given (e.g., Normally distributed errors). It should be clear whether the error bar is the standard deviation or the standard error of the mean. It is OK to report 1-sigma error bars, but one should state it. The authors should preferably report a 2-sigma error bar than state that they have a 96% CI, if the hypothesis of Normality of errors is not verified. For asymmetric distributions, the authors should be careful not to show in tables or figures symmetric error bars that would yield results that are out of range (e.g. negative error rates). If error bars are reported in tables or plots, The authors should explain in the text how they were calculated and reference the corresponding figures or tables in the text. 8. Experiments Compute Resources Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments? Answer: [Yes] Justification: First paragraph of Appendix E. Guidelines: The answer NA means that the paper does not include experiments. The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage. The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute. The paper should disclose whether the full research project required more compute than the experiments reported in the paper (e.g., preliminary or failed experiments that didn t make it into the paper). 9. Code Of Ethics Question: Does the research conducted in the paper conform, in every respect, with the Neur IPS Code of Ethics https://neurips.cc/public/Ethics Guidelines? Answer: [Yes] Justification: Our submission follow the Neur IPS ethical guidelines. Guidelines: The answer NA means that the authors have not reviewed the Neur IPS Code of Ethics. If the authors answer No, they should explain the special circumstances that require a deviation from the Code of Ethics. The authors should make sure to preserve anonymity (e.g., if there is a special consideration due to laws or regulations in their jurisdiction). 10. Broader Impacts Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed? Answer: [Yes] Justification: We provide a perspective on broader impacts in Section 6, but do not foresee any direct negative societal impact. Guidelines: The answer NA means that there is no societal impact of the work performed. If the authors answer NA or No, they should explain why their work has no societal impact or why the paper does not address societal impact. Examples of negative societal impacts include potential malicious or unintended uses (e.g., disinformation, generating fake profiles, surveillance), fairness considerations (e.g., deployment of technologies that could make decisions that unfairly impact specific groups), privacy considerations, and security considerations. The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments. However, if there is a direct path to any negative applications, the authors should point it out. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to generate deepfakes for disinformation. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster. The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from (intentional or unintentional) misuse of the technology. If there are negative societal impacts, the authors could also discuss possible mitigation strategies (e.g., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efficiency and accessibility of ML). 11. Safeguards Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)? Answer: [NA] Justification: We do not foresee any direct risk stemming from our work. Guidelines: The answer NA means that the paper poses no such risks. Released models that have a high risk for misuse or dual-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety filters. Datasets that have been scraped from the Internet could pose safety risks. The authors should describe how they avoided releasing unsafe images. We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort. 12. Licenses for existing assets Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected? Answer: [NA] Justification: All code was made by the authors Guidelines: The answer NA means that the paper does not use existing assets. The authors should cite the original paper that produced the code package or dataset. The authors should state which version of the asset is used and, if possible, include a URL. The name of the license (e.g., CC-BY 4.0) should be included for each asset. For scraped data from a particular source (e.g., website), the copyright and terms of service of that source should be provided. If assets are released, the license, copyright information, and terms of use in the package should be provided. For popular datasets, paperswithcode.com/datasets has curated licenses for some datasets. Their licensing guide can help determine the license of a dataset. For existing datasets that are re-packaged, both the original license and the license of the derived asset (if it has changed) should be provided. If this information is not available online, the authors are encouraged to reach out to the asset s creators. 13. New Assets Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets? Answer: [NA] Justification: No new assets. Guidelines: The answer NA means that the paper does not release new assets. Researchers should communicate the details of the dataset/code/model as part of their submissions via structured templates. This includes details about training, license, limitations, etc. The paper should discuss whether and how consent was obtained from people whose asset is used. At submission time, remember to anonymize your assets (if applicable). You can either create an anonymized URL or include an anonymized zip file. 14. Crowdsourcing and Research with Human Subjects Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)? Answer: [NA] Justification: No experiments with human subjects. Guidelines: The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Including this information in the supplemental material is fine, but if the main contribution of the paper involves human subjects, then as much detail as possible should be included in the main paper. According to the Neur IPS Code of Ethics, workers involved in data collection, curation, or other labor should be paid at least the minimum wage in the country of the data collector. 15. Institutional Review Board (IRB) Approvals or Equivalent for Research with Human Subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or institution) were obtained? Answer: [NA] Justification: No experiments with human subjects. Guidelines: The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Depending on the country in which research is conducted, IRB approval (or equivalent) may be required for any human subjects research. If you obtained IRB approval, you should clearly state this in the paper. We recognize that the procedures for this may vary significantly between institutions and locations, and we expect authors to adhere to the Neur IPS Code of Ethics and the guidelines for their institution. For initial submissions, do not include any information that would break anonymity (if applicable), such as the institution conducting the review.