# gflownet_training_by_policy_gradients__6b1013d9.pdf GFlow Net Training by Policy Gradients Puhua Niu 1 Shili Wu 1 Mingzhou Fan 1 Xiaoning Qian 1 2 3 Generative Flow Networks (GFlow Nets) have been shown effective to generate combinatorial objects with desired properties. We here propose a new GFlow Net training framework, with policydependent rewards, that bridges keeping flow balance of GFlow Nets to optimizing the expected accumulated reward in traditional Reinforcement Learning (RL). This enables the derivation of new policy-based GFlow Net training methods, in contrast to existing ones resembling value-based RL. It is known that the design of backward policies in GFlow Net training affects efficiency. We further develop a coupled training strategy that jointly solves GFlow Net forward policy training and backward policy design. Performance analysis is provided with a theoretical guarantee of our policy-based GFlow Net training. Experiments on both simulated and real-world datasets verify that our policy-based strategies provide advanced RL perspectives for robust gradient estimation to improve GFlow Net performance. Our code is available at: github.com/niupuhua1234/GFN-PG. 1. Introduction Generative Flow Networks (GFlow Nets) are a family of generative models on the space of combinatorial objects X, e.g. graphs composed by organizing nodes and edges in a particular manner, or strings composed of characters in a particular ordering. GFlow Nets aim to solve a challenging task, sampling x X with a probability proportional to some non-negative reward function R(x) that defines an unnormalized distribution, where |X| can be enormous and the distribution modes are highly isolated by its combinatorial 1Department of Electrical and Computer Engineering, Texas A&M University, College Station, TX, USA 2Department of Computer Science and Engineering, Texas A&M University, College Station, TX, USA 3Computational Science Intiative, Brookhaven National Laboratory, Upton, NY, USA. Correspondence to: Xiaoning Qian . Proceedings of the 41 st International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by the author(s). nature. GFlow Nets (Bengio et al., 2021; 2023) decompose the process of generating or sampling x X by generating incremental trajectories that start from a null state, pass through intermediate states, and end at x as the desired terminating state. These trajectory instances are interpreted as the paths along a Directed Acyclic Graph (DAG). Probability measures of trajectories are viewed as the amount of water flows along the DAG, with R(x) being the total flow of trajectories that end at x, so that following the forward generating policy defined by the measure, sampled trajectories will end at x with the probability proportional to R(x). GFlow Nets bear a similar form of reinforcement learning (RL) in that they both operate over Markovian Decision Processes (MDP) with a reward function R(x), where nodes, edges, and node transition distributions defined by Markovian flows are considered as states, actions, and stochastic policies in MDPs. They, however, differ in the following aspects: the goal of RL problems is to learn optimal policies that maximize the expected cumulative trajectory reward by R. For value-based RL methods, the key to achieve this is by reducing the Temporal Difference (TD) error of Bellman equations for the estimated state value function V and stateaction value function Q (Sutton & Barto, 2018; Mnih et al., 2013). GFlow Nets amortize the sampling problem into finding some Markovian flow that assigns the proper probability flow to edges (actions) so that the total flow of trajectories ending at x is R(x). When studying these in the lens of RL, the existing GFlow Net training strategies are also valuebased in that they achieve the goal by keeping the balance flow equation over states of the DAG, whose difference can be measured in trajectory-wise and edge-wise ways (Bengio et al., 2021; 2023; Malkin et al., 2022a; Madan et al., 2023). Due to the similarity of GFlow Net training and RL, investigating the relationships between them can not only deepen understanding of GFlow Nets but also help derive better training methods from RL. In this work, we propose policydependent rewards for GFlow Net training. This bridges GFlow Nets to RL in that keeping the flow balance over DAGs can be reformulated as optimizing the expected accumulated rewards in RL problems. We then derive policybased training strategies, which optimize the accumulated reward by its gradients with respect to (w.r.t.) the forward policy directly (Sutton et al., 1999; Sutton & Barto, 2018). GFlow Net Training by Policy Gradients In terms of RL, we acknowledge that the existing GFlow Net training methods can be considered value-based and have the advantage of allowing off-policy training over policybased methods (Malkin et al., 2022b). Value-based methods, however, face the difficulty in designing a powerful sampler that can balance the exploration and exploitation trade-off, especially when the combinatorial space is enormous with well-isolated modes. Besides, employing typical annealing or random-mixing solutions may lead to the learned policy trapped in local optima. Finally, designing strategies for powerful samplers vary according to the structure and setting of modeling environments. Policy-based methods, especially the on-policy ones, transform the design of a powerful sampler into robust estimation of policy gradients, which can be achieved by variance reduction techniques (Schulman et al., 2016) and improvement of gradient descent directions, such as natural policy gradients (Kakade, 2001) and mirror policy descent (Zhan et al., 2023). Conservation policy updates such as Trust-Region Policy Optimization (TRPO) (Schulman et al., 2015; Achiam et al., 2017) and its first-order approximation, Proximal Policy Optimization (PPO) (Schulman et al., 2017b), have also been developed, for example as the backbone of Chat GPT (Ouyang et al., 2022). Moreover, policy-based methods can be made off-policy, for example, by importance sampling (Degris et al., 2012). Our work provides alternative ways to improve GFlow Net performance via policy-based training. Our contributions can be summarized as follows: We reformulate the GFlow Net training as RL over a special MDP where the reward is policy-dependent, and the underlying Markovian chain is absorbing. We further derive policy gradients for this special MDP and propose policy-based training strategies for GFlow Nets, inspired by policy gradient and TRPO methods for stationary reward. We further formulate the design of backward policies in GFlow Nets as an RL problem and propose a coupled training strategy. While finding a desired forward policy is the goal of GFlow Net training, well-designed backward policies, as the components of training objectives, are expected to improve training efficiency (Shen et al., 2023). We provide performance analyses for theoretical guarantees of our method for GFlow Net training. Our theoretical results are also accompanied by performing experiments in three application domains, hyper-grid modeling, biological and molecular sequence design, and Bayesian Network (BN) structure learning. The obtained experimental results serve as empirical evidence for the validity of our work and also help empirically understand the relationship between GFlow Net training and RL. 2. Preliminaries For notation compactness, we restrict DAGs of GFlow Nets to be graded1. In a DAG, G := (S, A), modeling a MDP of GFlow Nets: s S denotes a state, a A denotes a directed edge/action (s s ), and A S S. Assuming that there is a topological ordering S0, . . . , ST for T +1 disjoint subsets of S, then S = ST t=0 St and an element of St is denoted as st. We use { , , , } to define the partial orders between states; for example, t < t : st st . Furthermore, being acyclic means (s s ) A: s s . Being graded means A can be decomposed into ST 1 t=0 At where At T At =t = and at At represents an edge (st st+1) connecting St and St+1. For any s S, we denote its parent set by Pa G(s) = {s |(s s) A} and its child set Ch G(s) = {s |(s s ) A}. Correspondingly, We denote the edge sets that start and end at s as A(s) = {(s s )|s Ch G(s)} and A(s) = {(s s)|s Pa G(s)} respectively. The complete trajectory set is defined as T = {τ = (s0 s T )| (s s ) τ : (s s ) A}. We use τ s to denote the sub-trajectory that starts at s, and τ t the subtrajectory that starts at st. For the DAG G in GFlow Nets, we have two special states: the initial state s0 with Pa(s0) = and S0 = {s0}, and the final state sf with Ch(sf) = and ST = {sf}. Furthermore, the terminal state set, ST 1 covers the object set X with a reward function R : X R+. 2.1. GFlow Nets GFlow Nets aim at efficient sampling from P (x) := R(x) Z , where Z = P x X R(x) and directly computing Z is often challenging with typically large |X|. To achieve this, GFlow Nets define a measure F(τ) : T R+, termed as flow (Bengio et al., 2023), so that for any event E, F(E) = P τ E F(τ) and the total flow Z = F(s0) = F(sf). For any event E and E , P(E) := F(E)/Z and P(E|E ) := F (E E ) F (E ) . Furthermore, F is restricted to be Markovian, which means τ T : t=1 P(st 1 st|st 1) = F(st 1) , (1) where F(s s ) = P τ {τ|(s s ) τ} F(τ), F(s) = P τ {τ|s τ} F(τ) and PF (st|st 1) := P(st 1 st|st 1). Similarly, PB(st 1|st) := P(st 1 st|st) = F (st 1 st) F (st) . A desired generative flow F is set to have the terminal transition probability P T(x) := P(x sf) equal to P (x). As shown in Bengio et al. (2023), the necessary 1Any DAG can be equivalently converted to be graded by adding dummy non-terminating states. Please refer to Appendix A of Malkin et al. (2022b) for more details. GFlow Net Training by Policy Gradients and sufficient condition is that s S \ {s0, sf}: X s P a(s ) F(s s ) = X s Ch(s) F(s s ). (2) where F(x sf) := R(x) for any x X. 2.2. GFlow Net training Directly estimating the transition flow F(s s ) via the flow matching objective (Bengio et al., 2021) can suffer from the explosion of F values, of which the numerical issues may lead to the failure of model training. In practice, the Trajectory Balance (TB) objective has been shown to achieve the state-of-the-art training performance (Malkin et al., 2022a). With the TB objective, the desired flow is estimated by the total flow Z and a pair of forward/backward policies, PF (s |s) and PB(s|s ). The TB objective LT B(PD) of a trajectory data sampler PD is defined as: LT B(PD) := EPD(τ)[LT B(τ)], LT B(τ) := log PF (τ|s0)Z PB(τ|x)R(x) In the equation above, PF (τ|s0) = QT t=1 PF (st|st 1) with PF (τ) = PF (τ|s0), P F (x) := PF (x sf), and PF (τ|x) = PF (τ|x sf) = PF (τ)/P F (x). Correspondingly, PB(τ|x) = PB(τ|x sf) = QT 1 t=1 PB(st 1|st), P B (x) := PB(x sf) equal to P (x), PB(τ) = P B (x)PB(τ|x), and PB(τ|s0) = PB(τ). Furthermore, we define µ(s0 = s0) := Z/ b Z as the 1-categorical distribution over S0 so that PF,µ(τ) := PF (τ|s0)µ(s0) = PF (τ), where b Z is the normalizing constant whose value is clamped2 to Z. We define PB,ρ(τ) := PB(τ|x)ρ(x) with an arbitrary distribution ρ over X. 3. Policy gradients for GFlow Net training Following Malkin et al. (2022b), we first extend the relationship between the GFlow Net training methods based on the TB objective and KL divergence. With the extended equivalence, we then introduce our policy-based and coupled training strategies for GFlow Nets. Finally, we present theoretical analyses on our proposed strategies. 3.1. Gradient equivalence When choosing trajectory sampler PD(τ) equal to PF (τ), the gradient equivalence between using the KL divergence and TB objective has been proven (Malkin et al., 2022b). However, this forward gradient equivalence does not take the 2For any parametrized function f( ; θ) and ˆf( ) clamped to f, ˆf is equal to f, but regarded as constant w.r.t. θ during gradient computation. total flow estimator Z into account. Moreover, the backward gradient equivalence requires computing the expectation over P (x), which is not feasible. In this work, we extend the proof of the gradient equivalence to take all gradients into account and remove the dependency on P (x), while keeping feasible computation. Proposition 3.1. Given a forward policy PF ( | ; θ), a backward policy PB( | ; ϕ), and a total flow estimator Z(θ), the gradient of the TB objective3 can be written as: θLT B(PF,µ; θ) 2 = θDµ( ;θ) KL (PF (τ|s0; θ), PB(τ|s0)) 2 θ (log Z(θ) log Z )2 = θDµ( ;θ) KL (PF (τ|s0; θ), e PB(τ|s0; θ)); ϕLT B(PB,ρ; ϕ) 2 = ϕDρ KL(PB(τ|x; ϕ), PF (τ|x)) = ϕDρ KL(PB(τ|x; ϕ), e PF (τ|x)). (4) In the equations above, e PF (τ|x) := PF (τ x) = QT 1 t=1 PF (st|st 1) and e PB(τ|s0) := PB(τ|x)R(x)/Z, denoting two unnormalized distributions of PF (τ|x) and PB(τ|s0). For arbitrary distributions p, q, and u, Du KL(p( |s), q( |s)) := Eu(s)[DKL(p( |s), q( |s))]. The proof is provided in Appendix A.1. As the TB objective is a special case of the Sub-Trajectory Balance (Sub-TB) objective (Madan et al., 2023), we also provide the proof of the gradient equivalence with respect to the Sub-TB objective in Appendix A.3, where the initial distribution µ becomes more flexible. 3.2. RL formulation of GFlow Net training Inspired by the equivalence relationship in Proposition 3.1, we propose new reward functions that allow us to formulate GFlow Net training as RL problems with corresponding policy-based training strategies. Definition 3.2 (Policy-dependent Rewards). For any action a = (s s ) A(s) (a A(s )), we define two reward functions as: RF (s, a; θ) := log πF (s, a; θ) πB(s , a; θ), RB(s , a; ϕ) := log πB(s , a; ϕ) πF (s, a) , (5) where πF (s, a; θ) := PF (s |s; θ), πB(s , a; ϕ) := PB(s|s ; ϕ), πB(sf, a) is equal to R(x)/Z for a = (x sf). For any a / A(s), RF (s, a) := 0. For any a / A(s ), RB(s , a) := 0. 3Training via TB was intrinsically done in an off-policy setting, so LT B(PD) = EPD(τ)[ LT B(τ)] for any choice of PD. GFlow Net Training by Policy Gradients Tuples (S, A, G, RF ) and (S, A, G, RB) specify two MDPs with policy-dependent rewards. In the MDPs, G specifies a deterministic transition environment such that P(s |s, a) = I[(s s ) = a] with the indicator function I. (G, πF ) and (G, πB) correspond to two absorbing Markovian chains. Accordingly, the nature of DAGs requires that each state has only one order index, allowing us to define time-invariant expected value functions of states and state-action pairs, which are defined as VF (s) := EPF (τ>t|st)[PT 1 l=t RF (sl, al)|st = s] and QF (s, a) := EPF (τ>t+1|st,at)[PT 1 l=t RF (sl, al)|st = s, at = a]. Then we define JF := Eµ(s0)[VF (s0)], AF (s, a) := QF (s, a) VF (s), and d F,µ(s) := 1 T PT 1 t=0 PF (st = s). We likewise denote the functions for the backward policy as {VB, QB, JB, AB, d B,ρ}. More details are provided in Appendix B.1. By definition, VF (s0) = EP (τ|s0)[PT 1 t=0 RF (st, at)] = DKL(PF (τ|s0), e PB(τ|s0)) , so JF = Dµ KL(PF (τ|s0), e PB(τ|s0)). Likewise, we can obtain JB = Dρ KL(PB(τ|x), e PF (τ|x)). Thus, we can conclude that GFlow Net training can be converted into minimizing the expected value function JF and JB by Proposition 3.1. With the derived JF and JB provided in Appendix B.3, we update πF , πB, and µ to minimize JF and JB based on the correspondingly computed gradients of the following two objectives: Eµ(s0;θ)[VF (s0)] + T Ed F,µ(s),πF (s,a;θ) [AF (s, a)] , (T 1) Ed B,ρ(s),πB(s,a;ϕ) [AB(s, a)] . (6) Our policy-based method generalizes the TB-based training with πD equal to πF as follows: TB-based training corresponds to approximating A(s, a) for (st = s, at = a) empirically by b QF (st, at) C, where b QF (st, at) = PT 1 l=t RF (sl, al), and C is a constant baseline for variance reduction. For comparison, our policy-based method can be considered approximating AF (s, a) functionally by b Aλ F (s, a) = PT 1 l=t λl t( b QF (sl, al) e VF (sl)), where λ [0, 1] controls the bias-variance trade-off for gradient estimation (Schulman et al., 2016), b QF (sl, al) = RF (sl, al) + e VF (sl+1), and e VF (sl) is a functional approximation of exact VF (sl), serving as a functional baseline. Specifically, our policy-based method with λ = 1 can provide unbiased gradient estimation of JF as the TB-based method. This supports the stability of our policy-based method with the theoretical convergence guarantee by Theorem 3.7 in Section 3.4. A formal discussion of their connection is provided in Appendix B.4 and B.6. Additionally, we discuss the relationship between our method and traditional Maximum Entropy (Max Ent) RL in Appendix B.5. To further exemplify that the proposed rewards bridge policy-based RL techniques to GFlow Net training, we specifically focus on the TRPO method, whose performance Figure 1. Dotted lines illustrate the spanning range of trajectories. PB and PG share the ground-truth terminating distribution R(x)/Z . When pushing PF to match PB trajectory-wise, P F (x) will also be pushed to match R(x)/Z . is usually more stable than vanilla policy-based methods due to conservative model updating rules (Schulman et al., 2015; Achiam et al., 2017). Likewise, we propose a TRPO-based objective for updating πF : min θ T Ed F,µ(s;θ),πF (s,a;θ ) [AF (s, a; θ)] s.t. Dd F,µ( ;θ) KL (πF (s, a; θ), πF (s, a; θ )) ζF . (7) The objective for πB can be defined similarly and is omitted here. This objective is motivated as the approximation of the upper bound in Theorem 3.6, which generalizes the original results for static rewards and absorbing MDPs. We defer the discussion of their relationship in Section 3.4. Moreover, the model parameter updating rule based on JF can be written as θ θ α θJF (θ) or equivalently θ = argminθ ( JF )T θ + 1 2α θ θ 2 2. Here, θ θ 2 can be generalized to KL divergence or Bregman divergence corresponding to natural or mirror policy gradients, which we leave for future work. Details of the model parameter updating rules for proposed methods are provided in Appendix B.6. 3.3. RL formulation of guided backward policy design During GFlow Net training, (PB, R) specifies the amount of desired flow that (PF , Z) is optimized to match. While PB( | ) can be chosen freely in principle (Bengio et al., 2023), a well-designed PB that assigns high probabilities over sub-trajectories preceding the terminating state x with a high reward value R(x), will improve training efficiency. Following Shen et al. (2023), we formulate the design problem as minimizing the following objective: LG T B(P ρ B) := EP ρ B(τ)[LG T B(τ)], LG T B(τ; ϕ) := log PB(τ|x; ϕ) where PG(τ|x) = QT 1 t=1 PG(st 1|τ t) is called the conditional guided trajectory distribution, which is usually non- GFlow Net Training by Policy Gradients Markovian4, and PG(τ) = PG(τ|x)P (x). As required by the training w.r.t. PF , the objective LG T B aims at finding the backward policy whose Markovian flow best matches the non-Markovian flow induced by PG. Proposition 3.3. Given a conditional guided trajectory distribution PG(τ|x) and a backward policy PB( | ; ϕ), the gradients of LG T B can be written as: ϕLG T B(P ρ B; ϕ) 2 = ϕDρ KL(PB(τ|x; ϕ), PG(τ|x)). (9) The proof can be found in Appendix A.2. Based on the proposition, we propose a new reward that allows us to formulate the backward policy design problem as an RL problem. Definition 3.4. Given PG(τ|x), we define a reward function for any action a := (s s ) A(s ) as: RG B(s , a; ϕ) := log πB(s , a; ϕ) πG(s , a) , (10) where πG(s , a) := PG(s|τ s ). For any a / A(s ), RG B(s , a) := 0. Accordingly, we denote the associated function set as {V G B , QG B, JG B , AG B, d G B,ρ}, which are defined in a similar way as RB but replacing PF by PG. By the definition of JG B and Proposition 3.3, we can conclude that ϕJG B (ϕ) = 1 2 ϕLG T B(P ρ B; ϕ) and the design of backward policy can be solved by minimizing JG B . The form of PG is detailed in Appendix E for the corresponding experimental tasks. In principle, following the pipeline by Shen et al. (2023), we need to solve the optimization of LG T B to find the desired PB at first. Then, freezing PB, we can optimize LT B to find the desired PF . This gives rise to training inconvenience in practice. To avoid doing two-phase training, the authors mixed PB and PG by αPB + (1 α)PG within the training objective w.r.t. PF . This operation, however, lacks theoretical guarantees as the mixed distribution is still non Markovian. By comparison, the RL formulation allows us to optimize JF and JG B jointly with a theoretical performance guarantee, which we defer to the next section. The workflow of our coupled training strategy is summarized in Algorithm 1 and depicted by Fig. 1. 3.4. Performance Analysis In the previous sections, we formulate two RL problems with respect to RF and RG B. Now, we show below that the two problems can be solved jointly. 4By non-Markovian assumption, PG(τ|x) can factorize in arbitrary ways conditioning on x. Here it is assumed to factorize in the backward direction for notation compactness. Algorithm 1 GFlow Net Training Workflow Require: PF ( | ; θ), Z(θ), PB( | ; ϕ), PG( | ) for n = {1, . . . , N} do D {bτ|bτ PF (τ; θ)} Update θ w.r.t. RF and D if ϕ = then D {bτ| x D : bτ|x PB(τ|x)} if PG(bτ|x) = PB(bτ|x) then Update ϕ w.r.t. RG B and D else Update ϕ w.r.t. RB and D end if end if end for Theorem 3.5. Denoting JG F as the corresponding function of RG F obtained by replacing πB within RF with πG and choosing ρ(x) = P F (x), then JG F , JF and JG B satisfy the following inequality: JG F JF +JG B +(T 1)RG,max B (JF + log Z log Z) (11) where RG,max B = maxs,a RG B(s, a) . The proof is given in Appendix C.1. As shown in Proposition 3.1, minimization of JF will incur the decrease of Dµ KL(PF (τ|s0), PB(τ|s0)) = JF + log Z log Z. Thus, by minimizing JF and JG B jointly, the upper bound of JG F decreases. Moreover, the TRPO-based objective introduced in the previous section is motivated by the following upper bounds. Theorem 3.6. For two forward policies (πF , π F ) with D d F,µ KL (π F (s, ), πF (s, )) < ζF , and two backward policies (πB, π B) with D d B,ρ KL (π B(s, ), πB(s, )) < ζB, we have: T Ed F,µ(s)π F (s,a)[AF (s, a)] + ζF + ϵF p T 1 Ed B,ρ(s)π B(s,a)[AB(s, a)] + ζB + ϵB p where ϵF = maxs Eπ F (s,a)[AF (s, a)] and ϵB = maxs Eπ B(s,a)[AB(s, a)] . Similar results also apply to JG B and AG B for the backward policy πB. The proof is given in Appendix C.2. The TRPO-based objective can be derived following a similar logic in Schulman et al. (2015) and Achiam et al. (2017). Let s denote M(π) = Ed F,µ(s),π(s,a)[AF (s, a)] + ζF + ϵF 2ζF and set π F = argmaxπM(π). In the worst case, we choose π F = πF and M(π F ) = 0; then it can be expected that there is a conservative solution. That is, π F = πF and ζF GFlow Net Training by Policy Gradients is negligibly small, so that M(π F ) < 0, thereby resulting in J F JF < 0. This implies the monotonic performance gain. TRPO method is an approximation to this update and usually provides more stable performance gain than the vanilla policy-based method. Lastly, we provide a theoretical guarantee that policy-based methods with policy-dependent rewards can asymptotically converge to stationary points, which draws inspiration from the results for static rewards by Agarwal et al. (2019). Theorem 3.7. Suppose that: JF (θ) is β smooth; EP ( |θ)[b θJF (θ)] = θJF (θ); the estimation variance, EP ( |θ) h b θJF (θ) θJF (θ) 2 2 i σF ; | log Z(θ) log Z | σZ; we update θ for N (> β) iterations by θn+1 θn αb JF (θn) with n {0, . . . , N 1}, α = p 1/(βN) and initial parameter θ0. Then we have: min n {0,...,N 1}EP (θn) h θn JF (θn) 2 2 i σF + σZ + EP (θ0)[JF (θ0)] (2N)/β 1) . (13) Similar results also apply to JB and JG B . The proof is provided in Appendix C.3. The assumption EP ( |θ)[b θJF (θ)] = θJF (θ) means gradient estimation is unbiased as explained in Appendix B.6. 3.5. Related Work GFlow Net training GFlow Nets were first proposed by Bengio et al. (2021) and trained by a Flow Matching (FM) objective, which aims at minimizing the mismatch of equation (2) w.r.t. a parameterized edge flow estimator F(s s ) directly. Bengio et al. (2023) reformulated equation (2) and proposed a Detailed Balance (DB) objective, where edge flows F(s s ) are represented by F(s)PF (s |s) or F(s )PB(s|s ). Malkin et al. (2022a) claimed that the FM and DB objectives are prone to inefficient credit propagation across long trajectories and showed that the TB objective is the more efficient alternative. Madan et al. (2023) proposed a Sub-TB objective that unified the TB and DB objectives as special cases. They can be considered as Sub-TB objectives with sub-trajectories, which are complete or of length 1 respectively. Zimmermann et al. (2022) proposed KL-based training objectives and Malkin et al. (2022b) first established the equivalence between the KL and TB objectives. Shen et al. (2023) analyzed how the TB objective helps to learn the desired flow under the sequence prepend/append MDP setting, and proposed a guided TB objective. Forward-looking GFlow Nets (Pan et al., 2023) improved the formulation of the DB objective by a better local credit assignment scheme, which was further generalized by learning energy decomposition GFlow Nets (Jang et al., 2023). Finally, back-and-forth local search (Kim et al., 2023b), Thompson Sampling (TS) (Rector-Brooks et al., 2023), and temperature conditioning (Kim et al., 2023a) were proposed for the explicit design of PD. Hierarchical Variational inference Hierarchical Variational Inference (HVI) (Vahdat & Kautz, 2020; Zimmermann et al., 2021) generalizes amortized VI (Zhang et al., 2018) to better explore specific statistical dependency structures between observed variables and latent variables by introducing the hierarchy of latent variables. Training HVI models typically involves minimizing the selected divergence measures between the target distribution and the variational distribution parametrized by neural networks (Kingma & Welling, 2014; Burda et al., 2015). GFlow Nets can be considered as a special HVI model, where non-terminating states are latent variables, the hierarchy corresponds to a DAG, and the task of minimizing divergences is achieved by keeping flow balance (Malkin et al., 2022b). Our work provides another view of divergence minimization by interpreting the divergence as the expected accumulated reward. Imitation learning Imitation learning in RL is to learn a policy that mimics the expert demonstrations with limited expert data, by minimizing the empirical gap between the learned policy and expert policy. (Rajaraman et al., 2020; Ho & Ermon, 2016). For GFlow Net training in this work, we reduce the gap between the forward policy and the expert forward policy at the trajectory level, as the expert trajectory distribution is equal to PB(τ), implicitly encouraging the learned policy to match the desired expert policy. Policy-based RL Policy-based RL optimizes the expected value function J directly based on policy gradients (Sutton et al., 1999). The most relevant policy-based methods are the Actor-Critic method (Sutton & Barto, 2018) and Trust Region Policy Optimization (TRPO) (Schulman et al., 2015) along with its extension Constrained Policy Optimization (CPO) (Achiam et al., 2017). Compared to our methods, they work under the assumption that the reward functions must be fixed w.r.t. policies. The underlying Markov chains are further assumed to be ergodic by CPO and TRPO. We note that Weber et al. (2015) proposed a VI method based on policy gradient despite lacking experimental support. Here, the objective can be interpreted as the KL divergence between two forward trajectory distributions. Without the help of A, the policy gradient is estimated in a vanilla manner, corresponding to b A1 and b A0. Besides, Rengarajan et al. (2022) proposed a TRPO method for imitation learning, where the objective is the expected KL divergence between two forward policies, and the underlying MDP is assumed to be ergodic as the original method. GFlow Net Training by Policy Gradients Max Ent RL Bengio et al. (2021) have shown that directly applying Max Ent RL with fixed R(s, a) defined based on the terminating state is problematic as it corresponds to modeling p(x) n(x)R(x), where n(x) is the number of trajectories that can pass through x. As discussed in Appendix B.5, our policy-based methods, when fixing log πB(s , a) and log Z, can be connected to Soft-Qlearning, a typical Max Ent RL method. Bi-level optimization Our proposed training strategy can also be seen as a Stochastic Bi-level Optimization method for GFlow Net training (Ji et al., 2021; Hong et al., 2023; Ghadimi & Wang, 2018). The inner problem is the RL problem w.r.t. RB or RG B for designing backward policies. The outer problem is the RL problem w.r.t. RF for forward policies. For gradient-based solutions to Bi-level optimization in general, the learning rate of inner problems is carefully selected to guarantee the overall convergence, which is not required in our methods designed for GFlow Net training. Additional discussion about policy-based and valued-based methods in the context of RL is provided in Appendix D. 4. Experiments To compare our policy-based training strategies for GFlow Nets with the existing value-based methods, we have conducted three simulated experiments for hyper-grid modeling, four real-world experiments for biological and molecular sequence design, one on Bayesian Network structure learning, and ablation study of λ. We compare the performance of GFlow Nets by the following training strategies: (1) DB-U, (2) DB-B, (3) TB-U, (4) TB-B, (5) TB-Sub, (6) TB-TS, (7) RL-U, (8) RL-B; (9) RL-T and (10) RL-G, where notion -U means that πB is a fixed uniform policy; -B means that πB is a parameterized policy; RL represent our policy-based method; -T represent our TRPObased method with a uniform πB and -G represent our joint training strategy with guided policy; -Sub represent the weighted Sub-TB objective with a parameterized πB in Madan et al. (2023); -TS represent the TS objective with a parameterized πB in Rector-Brooks et al. (2023). By default, PD is γ-decayed-noisy for valued-based methods. Total variation DT V , Jensen Shannon divergence DJSD, and mode accuracy Acc are used to measure the gap between P F (x) and P (x). Detailed descriptions of experimental settings, including metric definitions, guided policy design, hyper-parameters, etc., can be found in Appendix E. Our implementation is built upon the torchgfn package (Lahlou et al., 2023). 4.1. Hyper-grid Modeling In this set of experiments, we use the hyper-grid environment following Malkin et al. (2022b). In terms of GFlow Nets, states are the coordinate tuples of an Ddimensional hyper-cubic grid with heights equal to N. The initial state s0 is (0, . . . , 0). Starting from s0, actions correspond to increasing one of D coordinates by 1 for the current state or stopping the process at the current state and outputting it as the terminating state x. A manually designed reward function R( ) assigns high reward values to some grid points while assigning low values to others. We conduct experiments on 256 256, 128 128, 64 64 64, and 32 32 32 32 grids. For performance evaluation, P F (x) is computed exactly by dynamic programming (Malkin et al., 2022b). The training curves by DT V across five runs for 256 256 and 128 128 grids are plotted in Fig. 2, and Table 1 in Appendix E.5 reports the mean and standard deviation of metric values at the last iteration. The graphical illustrations of P F (x) are shown in Figs. 12 and 13 in Appendix E.6. In the first setting, it can be observed that our policy-based methods, in terms of convergence rate or converged DT V , perform much better than all the considered value-based training methods. This shows that our policy-based training strategies give a more robust gradient estimation. Besides, RL-G achieves the smallest DT V and converges much faster than all the other competing methods. In RL-G, the guided distribution assigns small values to the probability of terminating at coordinates with low rewards. This prevents the forward policy from falling into the reward desert between the isolated modes. Finally, RL-T outperforms RL-U and behaves more stably than RL-U during training. This confirms that with the help of trust regions, the gradient estimator becomes less sensitive to estimation noises. Here we use a fixed constant ζF for trust region control. It is expected that using a proper scheduler of ζF during training may further improve the performance of RL-T. In the second setting, the converged DT V of policy-based and TBbased methods are similar and significantly better than those of DB-based methods. As expected, policy-based methods converge much faster than all the value-based methods. Thus, the results further support the effectiveness of our policy-based methods. Moreover, RL-G and RL-T achieve the second-best and the best convergence, and RL-T shows better stability than RL-U. This again shows the superiority of coupled and TRPO-based strategies, confirming our theoretical analysis conclusions. More results and discussions for 64 64 64 and 32 32 32 32 grids can be found in Appendix E.2. 4.2. Biological and Molecular Sequence Design In this set of experiments, we use GFlow Nets to generate nucleotide strings of length D and molecular graphs composed of D blocks according to given rewards. The initial state s0 := ( 1, . . . , 1) denotes an empty sequence. The GFlow Net Training by Policy Gradients 500 1000 1500 2000 Number of iterations Total Variation DB-U DB-B TB-U TB-B TB-Sub TB-TS RL-U RL-B RL-T RL-G 500 1000 1500 2000 Number of iterations Total Variation DB-U DB-B TB-U TB-B TB-Sub TB-TS RL-U RL-B RL-T RL-G Figure 2. Training curves by DT V between P F and P for 256 256 (left) and 128 128 hyper-grids (right). The curves are plotted based on means and standard deviations of metric values across five runs and smoothed by a sliding window of length 10. Metric values are computed every 10 iterations. 500 1000 1500 2000 Number of iterations DB-U DB-B TB-U TB-B TB-Sub TB-TS RL-U RL-B RL-T RL-G 500 1000 1500 2000 Number of iterations DB-U DB-B TB-U TB-B TB-Sub TB-TS RL-U RL-B RL-T RL-G Figure 3. Training curves by Acc of P F w.r.t. P for SIX6 (left) and QM9 (right) datasets. The curves are plotted based on means and standard deviations of metric values across five runs and smoothed by a sliding window of length 10. Metric values are computed every 10 iterations. generative process runs as follows: starting from s0, an action is taken to pick one of the empty slots and fill it with one element until the sequence is completed. Then the sequence is returned as the terminating state x. We use nucleotide string datasets, SIX6 and PH04, and molecular graph datasets, QM9 and s EH, from Shen et al. (2023). For metric DT V and DJSD, P F is computed exactly by dynamic programming. Following Shen et al. (2023), the training curves by the mode accuracy Acc and the number of modes for SIX6 and QM9 datasets are shown in Fig. 3, and Fig. 7 in Appendix E.3. For evaluation consistency, we also provide the curves by DT V in Fig. 6 in Appendix E.3, as well as the metric values at the last iteration summarized in Tables 3 and 4 in Appendix E.5. The graphical illustrations of P F (x) are shown in Figs. 16 and 17 in Appendix E.6. In both experiments, TB-based and policy-based methods achieve better performance than DB-based methods. While the converged Acc values of TB-based methods and our policy-based methods are similar, the latter converge much faster than TB-based methods with only TB-U achieving a comparable convergence rate. Besides, RL-T has the fastest convergence rates in both experiments. The performances of RL-G are similar to those of RL-B, which has a parameterized πB, but slightly better than RL-U with a uniform πB. In summary, experimental results for QM9 and SIX6 datasets align with those of hyper-grid tasks, confirming again the advantage offered by our policy-based methods for robust gradient estimation. More results and discussions for PHO4 and s EH datasets can be found in Appendix E.3. 4.3. Ablation Study of λ To investigate how the setting of λ, which controls the biasvariance-trade-off, may help robust estimation of gradients, we conduct experiments in the 256 256 grid environment. We compare the performance of RL-U methods with different λ values and TB-U methods with different γ values The obtained training curves by DT V across five runs are shown in Fig. 5 in Appendix E.2. Among the choices of γ for TB-U, the values 0.99 and 0.95 yield the best and GFlow Net Training by Policy Gradients the worst performances. In contrast, RL-U under all setups except λ = 1, demonstrates significantly faster convergence than TB-U. It should be pointed out that when λ = 1, QF is approximated empirically as TB-based methods, but VF is approximated functionally. Additionally, the converged DT V in all setups of RL-U are better than those in all setups of TB-U. These results verify that by controlling λ, our policy-based methods can provide more robust gradient estimation than TB-based methods. We have also conducted performance comparisons between policy-based and value-based methods for Bayesian network structure learning. The results and discussions can be found in Appendix E.4. 5. Conclusion, Limitations and Future Work This work bridges the flow-balance-based GFlow Net training to RL problems. We have developed policy-based training strategies, which provide alternative ways to improve training performance compared to the existing value-based strategies. The experimental results support our claims. Our policy-based methods are not limited to the cases where G must be a DAG as it intrinsically corresponds to minimizing the KL divergence between two distributions, which does not necessitate G to be a DAG. Future work will focus on extending the proposed methods to general G with the existence of cycles for more flexible modeling of generative processes of object x X. While our policy-based training strategies do not require an explicit design of a data sampler and are shown to achieve better GFlow Net training performance, they may still get trapped into local optima due to the variance of gradient estimation when the state space is very large. Thus, future research will also focus on further improving policy-based methods by more robust gradient estimation techniques, under the gradient equivalence relationship. Acknowledgements This work was supported in part by the U.S. National Science Foundation (NSF) grants SHF-2215573, and by the U.S. Department of Engergy (DOE) Office of Science, Advanced Scientific Computing Research (ASCR) under Awards B&R# KJ0403010/FWP#CC132 and FWP#CC138. Portions of this research were conducted with the advanced computing resources provided by Texas A&M High Performance Research Computing. Impact Statement The presented research aims to improve GFlow Net training methods to address the training performance challenge. The applications of our work encompass various societal realms, ranging from medicine to materials design. Achiam, J., Held, D., Tamar, A., and Abbeel, P. Constrained policy optimization. In International conference on machine learning, pp. 22 31. PMLR, 2017. Agarwal, A., Jiang, N., Kakade, S. M., and Sun, W. Reinforcement learning: Theory and algorithms. CS Dept., UW Seattle, Seattle, WA, USA, Tech. Rep, 32, 2019. Beck, A. First-order methods in optimization. SIAM, 2017. Bengio, E., Jain, M., Korablyov, M., Precup, D., and Bengio, Y. Flow network based generative models for noniterative diverse candidate generation. Advances in Neural Information Processing Systems, 34:27381 27394, 2021. Bengio, Y., Lahlou, S., Deleu, T., Hu, E. J., Tiwari, M., and Bengio, E. Gflownet foundations. Journal of Machine Learning Research, 24(210):1 55, 2023. Burda, Y., Grosse, R., and Salakhutdinov, R. Importance weighted autoencoders. ar Xiv preprint ar Xiv:1509.00519, 2015. Degris, T., White, M., and Sutton, R. S. Off-policy actorcritic. In Proceedings of the 29th International Coference on International Conference on Machine Learning, pp. 179 186, 2012. Deleu, T., G ois, A., Emezue, C., Rankawat, M., Lacoste Julien, S., Bauer, S., and Bengio, Y. Bayesian structure learning with generative flow networks. In Uncertainty in Artificial Intelligence, pp. 518 528. PMLR, 2022. Ghadimi, S. and Wang, M. Approximation methods for bilevel programming. ar Xiv preprint ar Xiv:1802.02246, 2018. Golpar Raboky, E. and Eftekhari, T. On nilpotent interval matrices. Journal of Mathematical Modeling, 7(2):251 261, 2019. Grinstead, C. and Snell, L. J. Introduction to probability. 2006. Haarnoja, T., Zhou, A., Abbeel, P., and Levine, S. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International conference on machine learning, pp. 1861 1870. PMLR, 2018. Hestenes, M. R., Stiefel, E., et al. Methods of conjugate gradients for solving linear systems. Journal of research of the National Bureau of Standards, 49(6):409 436, 1952. GFlow Net Training by Policy Gradients Ho, J. and Ermon, S. Generative adversarial imitation learning. Advances in neural information processing systems, 29, 2016. Hong, M., Wai, H.-T., Wang, Z., and Yang, Z. A twotimescale stochastic algorithm framework for bilevel optimization: Complexity analysis and application to actorcritic. SIAM Journal on Optimization, 33(1):147 180, 2023. Jang, H., Kim, M., and Ahn, S. Learning energy decompositions for partial inference of gflownets. In The Twelfth International Conference on Learning Representations, 2023. Ji, K., Yang, J., and Liang, Y. Bilevel optimization: Convergence analysis and enhanced design. In International conference on machine learning, pp. 4882 4892. PMLR, 2021. Kakade, S. M. A natural policy gradient. Advances in neural information processing systems, 14, 2001. Kim, M., Ko, J., Zhang, D., Pan, L., Yun, T., Kim, W. C., Park, J., and Bengio, Y. Learning to scale logits for temperature-conditional gflownets. In Neur IPS 2023 AI for Science Workshop, 2023a. Kim, M., Yun, T., Bengio, E., Zhang, D., Bengio, Y., Ahn, S., and Park, J. Local search gflownets. In The Twelfth International Conference on Learning Representations, 2023b. Kingma, D. P. and Welling, M. Auto-encoding variational bayes. In Bengio, Y. and Le Cun, Y. (eds.), ICLR, 2014. URL http://dblp.uni-trier.de/db/ conf/iclr/iclr2014.html#Kingma W13. Kuipers, J., Moffa, G., and Heckerman, D. Addendum on the scoring of gaussian directed acyclic graphical models. 2014. Lahlou, S., Viviano, J. D., and Schmidt, V. torchgfn: A pytorch gflownet library. ar Xiv preprint ar Xiv:2305.14594, 2023. Madan, K., Rector-Brooks, J., Korablyov, M., Bengio, E., Jain, M., Nica, A. C., Bosc, T., Bengio, Y., and Malkin, N. Learning GFlow Nets from partial episodes for improved convergence and stability. In International Conference on Machine Learning, pp. 23467 23483. PMLR, 2023. Malkin, N., Jain, M., Bengio, E., Sun, C., and Bengio, Y. Trajectory balance: Improved credit assignment in gflownets. Advances in Neural Information Processing Systems, 35:5955 5967, 2022a. Malkin, N., Lahlou, S., Deleu, T., Ji, X., Hu, E. J., Everett, K. E., Zhang, D., and Bengio, Y. Gflownets and variational inference. In The Eleventh International Conference on Learning Representations, 2022b. Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., and Riedmiller, M. Playing Atari with deep reinforcement learning. ar Xiv preprint ar Xiv:1312.5602, 2013. Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., et al. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730 27744, 2022. Pan, L., Malkin, N., Zhang, D., and Bengio, Y. Better training of gflownets with local credit and incomplete trajectories. In International Conference on Machine Learning, pp. 26878 26890. PMLR, 2023. Rajaraman, N., Yang, L., Jiao, J., and Ramchandran, K. Toward the fundamental limits of imitation learning. Advances in Neural Information Processing Systems, 33: 2914 2924, 2020. Rector-Brooks, J., Madan, K., Jain, M., Korablyov, M., Liu, C.-H., Chandar, S., Malkin, N., and Bengio, Y. Thompson sampling for improved exploration in gflownets. In ICML 2023 Workshop on Structured Probabilistic Inference {\&} Generative Modeling, 2023. Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., and Shakkottai, S. Reinforcement learning with sparse rewards using guidance from offline demonstration. ar Xiv preprint ar Xiv:2202.04628, 2022. Schulman, J., Levine, S., Abbeel, P., Jordan, M., and Moritz, P. Trust region policy optimization. In International conference on machine learning, pp. 1889 1897. PMLR, 2015. Schulman, J., Moritz, P., Levine, S., Jordan, M., and Abbeel, P. High-dimensional continuous control using generalized advantage estimation. In Proceedings of the International Conference on Learning Representations (ICLR), 2016. Schulman, J., Chen, X., and Abbeel, P. Equivalence between policy gradients and soft q-learning. ar Xiv preprint ar Xiv:1704.06440, 2017a. Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. ar Xiv preprint ar Xiv:1707.06347, 2017b. Shen, M. W., Bengio, E., Hajiramezanali, E., Loukas, A., Cho, K., and Biancalani, T. Towards understanding and improving gflownet training. In International Conference on Machine Learning, pp. 30956 30975. PMLR, 2023. GFlow Net Training by Policy Gradients Sutton, R. S. and Barto, A. G. Reinforcement learning: An introduction. MIT press, 2018. Sutton, R. S., Mc Allester, D., Singh, S., and Mansour, Y. Policy gradient methods for reinforcement learning with function approximation. Advances in neural information processing systems, 12, 1999. Tsitsiklis, J. and Van Roy, B. Analysis of temporaldiffference learning with function approximation. Advances in neural information processing systems, 9, 1996. Vahdat, A. and Kautz, J. Nvae: A deep hierarchical variational autoencoder. Advances in neural information processing systems, 33:19667 19679, 2020. Weber, T., Heess, N., Eslami, A., Schulman, J., Wingate, D., and Silver, D. Reinforced variational inference. In Advances in Neural Information Processing Systems (NIPS) Workshops, 2015. Williams, R. J. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8(3-4):229 256, 1992. Zhan, W., Cen, S., Huang, B., Chen, Y., Lee, J. D., and Chi, Y. Policy mirror descent for regularized reinforcement learning: A generalized framework with linear convergence. SIAM Journal on Optimization, 33(2):1061 1091, 2023. Zhang, C., B utepage, J., Kjellstr om, H., and Mandt, S. Advances in variational inference. IEEE transactions on pattern analysis and machine intelligence, 41(8):2008 2026, 2018. Zimmermann, H., Wu, H., Esmaeili, B., and van de Meent, J.-W. Nested variational inference. Advances in Neural Information Processing Systems, 34:20423 20435, 2021. Zimmermann, H., Lindsten, F., van de Meent, J.-W., and Naesseth, C. A. A variational perspective on generative flow networks. Transactions on Machine Learning Research, 2022. GFlow Net Training by Policy Gradients A. Gradient Equivalence Lemma A.1. (REINFORCE trick (Williams, 1992)) Given a random variable u following a distribution p( ; ψ) parameterized by ψ and a arbitrary function f, we have ψEp(u;ψ)[f(u)] = Ep(u)[f(u) ψ log p(u; ψ)] = Ep(u)[f(u) ψ log p(u; ψ)], where p(u; ϕ) = p(u; ϕ)/ ˆZp, and ˆZp is the normalizing constant and clamped to P A.1. Proof of Proposition 3.1 Proof. First of all, we split the parameters of the total flow estimator and forward transition probability and denote them as Z(θZ) and PF ( | ; θF ) respectively. We further define c(τ) = log PF (τ|s0) R(x)PB(τ|x) . For the gradients w.r.t. θF : 1 2 θF EPF (τ|s0)[LT B(τ; θF )] = 1 2EPF,µ(τ) h θF (c(τ; θF ) + log Z)2i = EPF,µ(τ) [(c(τ) + log Z) θF log PF (τ|s0; θF )] = EPF,µ(τ) [(c(τ) + log Z) θF log PF (τ|s0; θF )] + EPF,µ(τ) [ θF (c(τ; θF ) + log Z)] | {z } (a) = Eµ(s0) θF EPF (τ|s0;θF ) [c(τ; θF ) + log Z] = Eµ(s0)[ θF DKL(PF (τ|s0; θF ), e PB(τ|s0))] = θF Dµ KL(PF (τ|s0; θF ), e PB(τ|s0)), (14) where (a) is equal to zero as EPF (τ|s0)[ θF c(τ; θF )] = EPF (τ|s0)[1 θF log PF (τ|s0; θF )] = θF EPF (τ|s0;θF )[1] = 0 by Lemma A.1. We also have: θF Dµ KL(PF (τ|s0; θF ), e PB(τ|s0)) = θF Dµ KL(PF (τ|s0; θF ), e PB(τ|s0)) + θF EPF,µ(τ;θF ) [log Z log Z] | {z } =0 = θF Dµ KL(PF (τ|s0; θF ), PB(τ|s0)). (15) It should be emphasized that PB(τ) is the ground-truth distribution with PB(x) := R(x)/Z , while e PB(τ) is the approximated one with e PB(x) := R(x)/Z. The gradients w.r.t. θZ can be written as: 1 2 θZEPF (τ|s0)[LT B(τ; θZ)] = 1 2EPF (τ|s0) h θZ (c(τ) + log Z(θZ))2i = EPF (τ|s0) [(c(τ) + log Z) θZ log Z(θZ)] = [DKL(PF (τ|s0), e PB(τ|s0))] [ θZ log Z(θZ)] b Z DKL(PF (τ|s0), e PB(τ|s0)) = θZDµ( ;θZ) KL (PF (τ|s0), e PB(τ|s0)). (16) Besides, we have: θZDµ( ;θZ) KL (PF (τ|s0), e PB(τ|s0)) =[ θZ log Z(θZ)] DKL(PF (τ|s0), e PB(τ|s0)) + log Z = [ θZ log Z(θZ)] DKL(PF (τ|s0), PB(τ|s0)) + log Z b Z [DKL(PF (τ|s0), PB(τ|s0))] + θZ log Z(θZ) = θZDµ( ;θZ) KL (PF (τ|s0), PB(τ|s0)) + 1 2 θZ (log Z(θZ) log Z )2 . (17) Combining equations (14) and (16), we obtain: 1 2 θEPF,µ(τ)[LT B(τ; θ)] = θDµ( ;θ) KL (PF (τ|s0; θ), e PB(τ|s0)). (18) GFlow Net Training by Policy Gradients Combining equations (15) and (17), we obtain: 1 2 θEPF,µ(τ)[LT B(τ; θ)] = θ Dµ( ;θ) KL (PF (τ|s0; θ), PB(τ|s0)) + 1 2 (log Z(θ) log Z )2 . (19) Now let s consider the backward gradients and denote c(τ) = log PB(τ|x) PF (τ x) . Then, 1 2 ϕEPB,ρ(τ)[LT B(τ; ϕ)] = 1 2EPB,ρ(τ) h ϕ c(τ; ϕ) + log R(x) log Z log PF (sf|x) 2i = EPB,ρ(τ) c(τ) + log R(x) log Z log PF (sf|x) ϕ log PB(τ|x; ϕ) = EPB,ρ(τ) [c(τ) ϕ log PB(τ|x; ϕ)] + Eρ(x) log R(x) log Z log PF (sf|x) EPB(τ|x)[ ϕ log PB(τ|x; ϕ)] | {z } =0 by Lemma A.1 = EPB,ρ(τ)[c(τ) ϕ log PB(τ|x; ϕ)] + EPB,ρ(τ)[ ϕc(τ; ϕ)] | {z } =0 by Lemma A.1 = Eρ(x) ϕDKL(PB(τ|x; ϕ), e PF (τ|x)) = ϕDρ KL(PB(τ|x; ϕ), e PF (τ|x)). (20) Besides, we have ϕDρ KL(PB(τ|x; ϕ), e PF (τ|x)) = ϕDρ KL(PB,ρ(τ|x; ϕ), e PF (τ|x)) + Eρ(x) ϕEPB(τ|x;ϕ) log(P F (x)/PF (sf|x)) | {z } =log(P F (x)/PF (sf |x)) ϕ1=0 = ϕDρ KL(PB(τ|x; ϕ), e PF (τ|x)) + ϕEPB,ρ(τ;ϕ) log(P F (x)/PF (sf|x)) = ϕDρ KL(PB(τ|x; ϕ), PF (τ|x)). (21) Equations (20) and (21) are the expected results. A.2. Proof of Proposition 3.3 Proof. The proof can be done by a procedure similar to that of backward gradients in Proposition 3.1 by replacing e PF (τ|x) with PG(τ|x). A.3. Sub-trajectory equivalence Proposition 2 in the paper by Malkin et al. (2022b) only considered the gradients of the Sub-TB objective (Madan et al., 2023) w.r.t. PF ( | ) and PB( | ). We provide an extended proposition below that also takes the gradients w.r.t. state flow estimator F( ) into consideration. For any m < n and n, m {1, T 1}, we denote the set of sub-trajectories that start at some state in Sm and end in some state in Sn as T = { τ = (sm . . . sn)| t {m, . . . , n 1} : (st st+1) At}. The sub-trajectory objective LSub T B(PD) is defined by: LSub T B(PD) = EPD( τ)[LSub T B( τ)], LSub T B( τ) = log PF ( τ|sm)F(sm) PB( τ|sn)F(sn) In the equations above, PF ( τ|sm) = Qn 1 t=m PF (st+1|st), PB( τ|sn) = Qn 1 t=m PB(st|st+1) and F(sn = x) := R(x). Besides, we define µ(sm) := F(sm)/ b Zm and ρ(sn) := F(sn)/ b Zn where b Zm and b Zn are the two normalizing constants whose values are clamped to P sm F(sm) and P Furthermore, PF,µ( τ) := µ(sm)PF ( τ|sm) and PB,ρ( τ) := ρ(sn)PB( τ|sn) so that PF,µ( τ|sn) = PF,µ( τ)/ρ (sn) and PB,ρ( τ|sm) = PB,ρ( τ)/µ (sm). Here, ρ (sn) := F (sn)/ b Z n, b Z n is clamped to P sn F (sn), and F (sn) := P τ:sn τ F(sm)PF ( τ|sm) is the ground-truth state flow over Sn implied by PF ; µ (sm) := F (sm)/ b Z m, b Z m is clamped to P sm F (sm), and F (sm) := P τ:sm τ F(sn)PB( τ|sn) is the ground-truth state flow over Sm implied by PB. Proposition A.2. For a forward policy PF ( | ; θ), a backward policy PB( | ; ϕ), a state flow estimator F( ; θ) for Sm, and GFlow Net Training by Policy Gradients a state flow estimator F( ; ϕ) for Sn5, the gradients of Sub-TB can be written as: 1 2 θLSub T B(PF,µ; θ) = θDµ( ;θ) KL (PF ( τ|sm; θ), PB,ρ( τ|sm)) + θDKL(µ(sm; θ), µ (sm)) = θDµ( ;θ) KL (PF ( τ|sm; θ), e PB,ρ( τ|sm)), 1 2 ϕLSub T B(PB,ρ; ϕ) = ϕDρ( ;ϕ) KL (PB( τ|sn; ϕ), PF,µ( τ|sn)) + ϕDKL(ρ(sm; ϕ), ρ (sm)) = ϕDρ( ;ϕ) KL (PB( τ|sn; ϕ), e PF,µ( τ|sn)), (23) where e PF,µ( τ|sn) := PF,µ( τ)/ρ(sn) and e PB,ρ( τ|sm) := PB,ρ( τ)/µ(sm) are approximation to PF,µ( τ|sn) and PB,ρ( τ|sm). Proof. First of all, we split the parameters of the state flow estimator and forward transition probability and denote them as F( ; θM) and PF ( | ; θF ) respectively. We further define c( τ) = log PF ( τ|sm) F (sn)PB( τ|sn) . For the gradients w.r.t. θF : 1 2 θF EPF,µ( τ)[LSub T B( τ; θF )] = 1 2EPF,µ( τ) h θF (c( τ; θF ) + log F(sm))2i = EPF,µ( τ) [(c( τ) + log F(sm)) θF log PF ( τ|sm; θF )] + EPF,µ( τ) [ θF (c( τ; θF ) + log F(sm))] | {z } =0 by Lemma A.1 = θF EPF,µ( τ;θF ) [(c( τ; θF ) + log F(sm))] + θF EPF,µ( τ;θF )[log b Zn log b Zm] | {z } =0 = θF Dµ KL(PF ( τ|sm; θF ), e PB,ρ( τ|sm)). (24) 1 2 θF EPF,µ( τ)[LSub T B( τ; θF )] = 1 2EPF,µ( τ) h θF (c( τ; θF ) + log F(sm))2i = EPF,µ( τ) [c( τ) θF log PF ( τ|sm; θF )] + Eµ(sm) log F(sm) EPF ( τ|sm) [ θF log PF ( τ|sm; θF )] | {z } =0 by Lemma A.1 = EPF,µ( τ) [c( τ) θF log PF ( τ|sm; θF )] + EPF,µ( τ) [ θF c( τ; θF )] | {z } =0 = θF EPF,µ( τ;θF ) [c( τ; θF )] + Eµ(sm) θF EPF ( τ|sm;θF )[log µ (sm)] | {z } =log µ (sm) θF 1=0 + θF EPF,µ( τ;θF )[log b Zn] | {z } =0 = θF EPF,µ( τ;θF ) [c( τ; θF )] + θF EPF,µ( τ;θF )[log µ (sm) + log b Zn] = θF Dµ KL(PF ( τ|sm; θF ), PB,ρ( τ|sm))]. (25) For the gradients w.r.t. θM, we have: 1 2 θM EPF,µ( τ)[LSub T B( τ; θM)] = 1 2EPF,µ( τ) h θM (c( τ) + log F(sm; θM))2i = EPF,µ( τ) [(c( τ) + log F(sm)) θM log F(sm; θM)] + EPF,µ( τ)[(log b Zn log b Zm) θM log F(sm; θM)] | {z } =0 by Lemma A.1 DKL(PF ( τ|sm), e PB( τ|sm)) θM log F(sm; θM) = θM Dµ( ;θM) KL (PF ( τ|sm), e PB( τ|sm)). (26) 5Here, Fθ and Fϕ actually share the same parameters and represent the same flow estimator F. Model parameters are duplicated just for the clarity of gradient equivalences. Therefore the true gradient of the state flow estimator F is θFθ + ϕFϕ. GFlow Net Training by Policy Gradients 1 2 θM EPF,µ( τ)[LSub T B( τ; θM)] = 1 2EPF,µ( τ) h θM (c( τ) + log F(sm; θM))2i = EPF,µ( τ) [c( τ) θF log F(sm; θM)] + Eµ(sm) [log F(sm) θM log F(sm; θM)] = EPF,µ( τ) [(c( τ) + log µ (sm)) θM log F(sm; θM)] + Eµ(sm) [(log F(sm) log µ (sm)) θM log F(sm; θM)] + EPF,µ( τ)[(log b Zn log b Zm) θM log F(sm; θM)] | {z } =0 by Lemma A.1 DKL(PF ( τ|sm), PB,ρ( τ|sm)) θM log F(sm; θM) (log µ(sm) log µ (sm)) θM log F(sm; θM) + Eµ(sm) [ θM (log µ(sm; θM) log µ (sm))] | {z } =0 By Lemma A.1 = θM Dµ( ;θM) KL (PF ( τ|sm), PB,ρ( τ|sm)) + θM DKL(µ(sm; θM), µ (sm)). (27) Combining equations (24) and (26), we obtain 1 2 θEPF,µ( τ)[LSub T B( τ; θ)] = θDµ( ;θ) KL (PF ( τ|sm; θ), e PB( τ|sm)). (28) Combining equations (25) and (27), we obtain 1 2 θEPF,µ( τ)[LSub T B( τ; θ)] = θ n Dµ( ;θ) KL (PF ( τ|sm; θ), PB( τ|sm)) + DKL(µ(sm; θ), µ (sm)) o . (29) Splitting ϕ into ϕB and ϕM and denoting c( τ) = log PB( τ|sn) F (sm)PF ( τ|sm), the gradient derivation of ϕ follows the similar way as θ, and is omitted here. B. RL framework B.1. Derivation of RL functions Let s first consider the case of forward policies. For any s St and a = (s s ) A(s) with t {0, . . . , T 1}, we define the VF,t and QF,t as: VF,t(s) := EPF (τ>t|st) l=t RF (sl, al) st = s = RF (s) + EPF (st+1|st) EPF (τ>t+1|st+1) l=t+1 RF (sl, al) st+1 = s # st = s = RF (s) + EπF (s,a)[VF,t+1(s )], QF,t(s, a) := EPF (τ>t+1|st,at) l=t RF (sl, al) st = s, at = a = RF (s, a) + EPF (τ>t+1|st+1) l=t+1 RF (sl, al) st+1 = s # = RF (s, a) + VF,t+1(s ), (30) where RF (s) := EπF (s,a)[RF (s, a)], VF,T ( ) := 0, and QF,T ( , ) := 0. Since St St = for any t = t , we can read off the time indices (topological orders) from state values. Plus the fact that RF (s, a) := 0 for any a / A(s), we are allowed to define two universal functions VF : S R and QF : S A R such that VF (st = s) := VF,t(s) and QF (st = s, a) := QF,t(s, a). GFlow Net Training by Policy Gradients Remark B.1. While the transition environment G is exactly known, the state space S can be exponentially large, making the exact values of V and Q intractable. This fact, in spirit, corresponds to a regular RL problem where the exact values of V and Q are infeasible due to the unknown and uncertain transition environment P(s |s, a). For backward policies, rewards are accumulated from T 1 to 1. Similarly, for s St and a = (s s ) A(s ), VB,t(s ) := EPB(τ T, P t = 0. (36) Proof. We prove the result by contradiction. Assuming P t(t > T) is not zero, then i = j: [P t]i,j = X k1:t 1 Pi,k1Pk1,k2 . . . Pkt 1,j > 0. (37) By the nature of DAGs, (s s) A : s s. Then the above expression is equal to: [P t]i,j = X k1:si sk1 Pi,k1 k2:sk1 sk2 Pk1,k2 . . . kt 1:skt 2 skt 1 Pkt 2,kt 1Pkt 1,j k1:t 1:(si sk1 ... skt 1 sj) Pi,k1Pk1,k2 . . . Pkt 1,j This means that there at least exists a trajectory (si sk1 . . . skt 1 sj) with non-zero probability. However, there are t + 1 distinct node orders in the path, which contradicts the assumption that there are T + 1 different node orders. Let s return to the graded DAG, G(S, A) in GFlow Nets. For the easiness of analysis, we restrict forward and backward policies and initial distribution to be tabular forms, PF R|S| |S|, µ R|S|, PB R|S| |S|, and ρ R|S| such that PF (sj|si) = [PF ]j,i and PB(si|sj) = [PB]i,j. Besides, we split initial distribution vectors by µ = ( µ; 0) R|S| and ρ = (0; ρ) R|S|, where µ and ρ denote the probabilities of states except sf and s0 respectively. We denote the graph equipped with a self-loop over sf as GF (S, A {(sf sf)}), and the reverse graph equipped with a self-loop over s0 as GB(S, A {(s0 s0)}). Accordingly, we enhance PF and PB by defining PF (sf|sf) := 1 and PB(s0|s0) := 1. (GF , PF ) specifies an absorbing Markov Chains: sf is the only absorbing state as only the self-loop is allowed once entering sf; the sub-graph over S \ {sf}, denoted as GF is still a DAG, so any state s S \ {sf} is transient as it can be visited at most one time. Similarly, (GB, PB) specifies another absorbing Markov Chain with absorbing state s0 and a DAG over S \ {s0}, denoted as GB. For graph GF and GB, their transition matrices PF and PB can be decomposed into: PF = PF 0 r F 1 , PB = 1 r B 0 PB In the equations above, r F R1 (|S| 1) and r B R1 (|S| 1) denote the probabilities of (s sf) for any s S \ {sf} and (s0 s) for any s S \ {s0} respectively; PF R(|S| 1) (|S| 1) and PB R(|S| 1) (|S| 1) denote probability of (s s ) for any s, s S \ {sf} and (s s ) for any s, s S \ {s0}, that is, the transition matrices over GF and GB respectively. Lemma B.4. For (G,PF , µ) and (GB, PB, ρ), d F,µ R|S| and d B,ρ R|S|, can be written in the following forms: d F,µ = d F,µ 0 , d F,µ = 1 T (I PF ) 1 µ, d B,ρ = 0 d B,ρ , d B,ρ = 1 T 1(I PB) 1 ρ. (40) Proof. We first prove the result for the forward case. By the nature of Markov Chains, PF,µ(st = si) = [(PF )tµ]i, and d F,µ = 1 T PT 1 t=0 (PF )tµ. Then, it can be easily verified (Grinstead & Snell, 2006) that: (PF )t = ( PF )t 0 1 GFlow Net Training by Policy Gradients where the explicit expression of the upper right corner is omitted. By Theorem B.2, PF is a nilpotent matrix and by Lemma B.3, PT 1 t=0 ( PF )t = P t=0( PF )t = (I PF ) 1, where the first equality follows from the fact that GF has T topological orders, and the second equality is by the fact that (I PF ) P t=0( PF )t = P t=0( PF )t P t=1( PF )t = I. Therefore, PT 1 t=0 P t F 0 1 (I PF ) 1 µ µ By Theorem 11.4 in Grinstead & Snell (2006), [(I PF ) 1]j,i is the expected number of times the chain is in state sj, starting from si, before being absorbed in sf. And [(I PF ) 1 µ]j is the expected number of times the chain is in state sj before being absorbed. Since s / S0 : µ(s) = 0 and GF is graded, any forward trajectory over sub-graph GF must start from s S0 and end in s ST 1, meaning P j[(I PF ) 1 µ]j = T. Thus, 1 T [(I PF ) 1 µ]j denotes the fraction of staying in transient state sj before being absorbed, that is, the probability observing state sj within T time steps. By the same reasoning, we can conclude that µ = 0 as sf can not be reached within T time steps. For backward case, any backward trajectory over sub-graph GB must start from s ST 1 and end in s S1 as s / ST 1 : ρ(s) = 0 and GB is graded. Then, a proof procedure for the desired result can be derived similarly, so it is omitted. Lemma B.5. For two forward policy, πF and π F , and two backward policy, πB and π B, we have: DT V (d F,µ( ), d F,µ( )) D d F,µ T V (π F (s, ), πF (s, )), DT V (d B,ρ( ), d B,ρ( )) D d B,ρ T V (π B(s, ), πB(s, )), (43) where for three arbitrary distributions p,q and u, DT V (p( ), q( )) := 1 2 p( ) q( ) 1 and Du T V (p( |s), q( |s)) := 1 2Eu(s) [ p( |s) q( |s) 1]. Proof. The proof procedure follows that of Lemma 3 in Achiam et al. (2017). For two forward policy πF and π F , let NF := (I PF ) 1 and N F := (I P F ) 1. Then, := PF P F = ( N F ) 1 N 1 F , (44) and NF N F = NF N F . (45) Then, d F,µ d F,µ 1 = d F,µ d F,µ 1 ( NF N F ) µ 1 = 1 NF 1 d F,µ 1 1 d F,µ 1 = ( PF P F ) d F,µ 1 . (46) Therefore, we have d F,µ d F,µ 1 ( P F PF ) d F,µ 1 ( P F PF ) d F,µ 1 + (r F r F ) d F,µ = (P F PF )d F,µ 1 s (P F (s |s) PF (s |s)) d F,µ(s) s,s |P F (s |s) PF (s |s)| d F,µ(s) s,a |π F (s, a) πF (s, a)| d F,µ(s) = Ed F,µ(s) [ π F (s, ) πF (s, ) 1] . (47) The result for backward policies can be derived analogously and is omitted here. GFlow Net Training by Policy Gradients B.3. Derivation of gradients Proposition B.6. The gradients of JF (θ) and JB(ϕ) w.r.t. θ and ϕ can be written as: θJF (θ) = T Ed F,µ(s)πF (s,a) [QF (s, a) θ log πF (s, a; θ)] + Eµ(s0)[VF (s0) θ log µ(s0; θ)] = T Ed F,µ(s)πF (s,a) [AF (s, a) θ log πF (s, a; θ)] + Eµ(s0)[VF (s0) θ log µ(s0; θ)], ϕJB(ϕ) = (T 1) Ed B,ρ(s)πB(s,a) [QB(s, a) ϕ log πB(s, a; ϕ)] = (T 1) Ed B,ρ(s)πB(s,a) [AB(s, a) ϕ log πB(s, a; ϕ)] . (48) Remark B.7. This result implies that an estimated value function, which may differ from the exact one, does not lead to biased gradient estimation. θJF (θ) = Eµ(s0)[VF (s0) θ log µ(s0; θ)] + Eµ(s0)[ θVF (s0; θ)] | {z } (1) (1) = EPF,µ(s0) θEπF (s0,a0;θ)[QF (s0, a0; θ)] = EPF,µ(s0) EπF (s0,a0)[QF (s0, a0) θ log πF (s0, a0; θ) + θQF (s0, a0; θ)] = EPF,µ(s0 s1) [QF (s0, a0) θ log πF (s0, a0; θ)] + EPF,µ(s0 s1) [ θRF (s0, a0; θ) + θVF (s1; θ)] | {z } (2) (2) = EPF,µ(s0 s1) θ log πF (s0, a0; θ) +EPF,µ(s1) [ θVF (s1; θ)] (3) = EPF,µ(s0) EπF (s0,a0)[1 θ log πF (s0, a0; θ)] | {z } =0 by Lemma A.1 EPF,µ(s0)[ θVF (s0; θ)] (1) = EPF,µ(s0 s1) [QF (s0, a0) θ log πF (s0, a0; θ)] + EPF,µ(s1) [ θVF (s1; θ)] . (50) Keep doing the process, we have EPF,µ(st)[ θVF (st; θ)] = EPF,µ(st st+1) [QF (st, at) θ log πF (st, at; θ)] + EPF,µ(st+1) θ VF (st+1; θ) | {z } VF (s T )=0 (1) = EPF,µ(τ) t=0 QF (st, at) θ log πF (st, at; θ) = Ed F,µ(s)πF (s,a) [QF (s, a) θ log πF (s, a; θ)] . (52) (1) = Ed F,µ(s)πF (s,a) [QF (s, a) θ log πF (s, a; θ)] Ed F,µ(s) VF (s) EπF (s,a)[ θ log πF (s, a; θ) | {z } =0 = Ed F,µ(s)πF (s,a) [AF (s, a) θ log πF (s, a; θ)] . (53) The derivation of ϕJB(ϕ) follows the similar way to θJF (θ), and is omitted here. GFlow Net Training by Policy Gradients B.4. Connection of policy-based training to TB-based training The gradient of the TB objective w.r.t. θF can be written as: 1 2 θF LT B(PF,µ; θF ) = t=1 EPF,µ(τ) θF log PF (st|st 1; θF ) t=1 log PF (st|st 1) e PB(st 1|st) In equation (54), each term for t > 1 can be expanded as: EPF,µ(τ t 1) θF log PF (st|st 1; θF ) l=t log PF (sl|sl 1) e PB(sl 1|sl) + EPF,µ(τ t) θF log PF (st|st 1; θF ) l=1 log PF (sl|sl 1) e PB(sl 1|sl) =EPF,µ(τ t 1) θF log PF (st|st 1; θF ) l=t log PF (sl|sl 1) e PB(sl 1|sl) + EPF,µ(τ t 1) l=1 log PF (sl|sl 1) e PB(sl 1|sl) EPF,µ(st|st 1)[ θF log PF (st|st 1; θF )] | {z } =0 by Lemma A.1 1 2 θF LT B(PF,µ; θF ) = t=1 EPF,µ(τ) θF log PF (st|st 1; θF ) l=t log PF (sl|sl 1) e PB(sl 1|sl) t=0 EPF,µ(τ) [ θF log PF (st, at; θF )] | {z } =0 by Lemma A.1 t=1 θF log PF (st|st 1; θF ) l=t log PF (sl|sl 1) e PB(sl 1|sl) C where C is an added baseline and constant w.r.t. θF for variance reduction during gradient estimation. As shown in Appendix B.3, the gradient of JF w.r.t. θF can be written as: θF JF (θF ) = T Ed F,µ(s)πF (s,a) [AF (s, a) θF log πF (s, a; θF )] t=0 θF log πF (st, at; θF )QF (st, at) t=0 θF log πF (st, at; θF )EPF,µ(τ>t+1|st,at) l=t RF (sl, al) st, at t=0 θF log πF (st, at; θF ) l=t RF (sl, al) t=0 θF log πF (st, at; θF ) l=t RF (sl, al) C This result implies that: when we update the forward policy by the estimation of θF LT B(θF ) based on a batch of sampled trajectories, we approximate QF (st, at) empirically by b QF (st, at) = PT 1 l=t RF (sl, al) for each sample, and can further reduce the estimation variance by some unbiased constant baseline C. By comparison, the RL formulation generalizes the constant to an unbiased functional baseline e VF (s; η), which is the approximation of exact VF (s). This enables to approximate GFlow Net Training by Policy Gradients QF (st, at) and AF (st, at) functionally by b QF (st, at) = R(st, at) + e VF (st+1) and b AF (st, at) = b QF (st, at) e VF (st). Here, b AF (st, at) can further be generalized to PT 1 l=t λl t b QF (sl, al) e VF (sl) , allowing flexible bias-variance trade-off for gradient estimation (Appendix B.6). B.5. Connection between policy-based training and Soft-Q-learning In the following text, we discuss the relationship between our policy-based method and Soft-Q-learning (Haarnoja et al., 2018), one of the most representative Maximum-Entropy (Max Ent) RL methods. Firstly, we introduce their connection when the total flow estimator log Z is fixed. We can expand JF as: JF = T Ed F,µ(s),πF (s,a) [log πB(s , a) log πF (s, a)] = T Ed F,µ(s),πF (s,a) log πB(s , a)) + EπF (s,a)[ log πF (s, a)] = T Ed F,µ(s),πF (s,a) [log πB(s , a)) + H(πF (s, ))] , (58) where a = (s s ), and H denotes the entropy of a distribution. The equation above implies that fixing the total flow estimator log Z, maximizing JF w.r.t. πF can be interpreted as a Max Ent RL problem, where log πB(s , a) is the static reward w.r.t. πF (s, a). We define QS F (s, a) := EPF (τ>t+1|st,at) h πB(st+1, at) + PT 1 l=t+1 πB(sl+1, al) + H(πF (sl, ))|st = s, at = a i = QF (s, a) + log πF (s, a) and V S F (s, a) := EPF (τ>t|st) h PT 1 l=t πB(sl+1, al) + H(πF (sl, ))|st = s i = VF (s, a), which implies that QS F (s, a) = log πB(s , a) + V S F (s ). We use a parameterized function e QS F as the estimator of QS F , and define e V S F (s) := log P a exp{ e QS F (s, a)} as the estimator of V S F . Then, we define πF (s, a; θ) := exp e QS F (s,a;θ) exp e V S F (s;θ) , which implies that: e QS F (s, a) = e V S F (s) + log πF (s, a). (59) In Soft-Q-learning, we define the target function as: b QS F (s, a) := log πB(s , a) + e V S F (s ). (60) Then, πF is updated by the gradient of the following objective: 2 Ed D(s),πD(s,a) e QS F (s, a; θ) b QS F (s, a) 2 . (61) It has been shown that the policy gradients for static rewards plus the gradients of the policy entropy (or KL divergence from some reference policy) are equivalent to the gradients of the corresponding soft Q estimator (Schulman et al., 2017a). Likewise, we demonstrate our policy-based method with λ = 0 is equivalent to Soft-Q-learning with πD equal to πF , without any adaption. Noting Soft-Q-learning is off-policy, the gradients of equation (61) w.r.t. θ can be written as: 2 θEd F,µ(s)πF (s,a) e QS F (s, a; θ) (log PB(s , a) + e V S F (s )) 2 = TEd F,µ(s)πF (s,a) h θ e QS F (s, a; θ) e QS F (s, a; θ) log PB(s , a) e V S F (s ) i = TEd F,µ(s)πF (s,a) θ log πF (s, a) + e V S F (s) e V S F (s) + RF (s, a) e V S F (s ) = TEd F,µ(s)πF (s,a) θ log πF (s, a) + e V S F (s) ˆδF (s, a) t=0 θ log πF (st, at; θ)ˆδF (st, at) t=0 θ e V S F (st; θ)ˆδF (st, at) where ˆδF (s, a) := e VF (s) + RF (s, a) + e VF (s ), and e VF := e V S F . Compared to formula (65) with λ = 0, a clear equivalence can be established. GFlow Net Training by Policy Gradients We further connect the Soft-Q-learning objective (61) to the Flow Matching (FM) objective (Bengio et al., 2021) and explain the role that log Z plays during training. When e QS F achieves the optimal point, we have e QS F = b QS F . Consequently, for the desired flow F, e V S F (sf) = 0 by definition, e QS F (x, a T 1) = log πB(x, a T 1) + 0 = log F (x sf ) Z , and e V S F (x) = log P a exp QS F (x, a) = log F (x) Z (= log R(x) Z ). Accordingly, e QS F (s T 2, a T 2) = e V S F (x)+log πB(x, a T 2) = log F (x) Z + log F (s T 2 x) F (x) = log F (s T 2 x) Z , and e V S F (s T 2) = log P a exp QS F (s T 2, a) = log F (s T 2) Z . Continuing this process, it can be verified that e QS F (s, a) = log F (s s ) Z and e V S F (s) = log F (s) Z when e QS achieve the optimum. Based on the above optimum condition of e QS(s, a) and the fact that e QS(s, a) R is a parametrized function with no assumption over its output form during training, we can safely substitute it by F log(s s ) := e QS(s, a) + log Z R, where F log(s s ) is the estimator for the logarithm of the desired edge flow, log F(s s ), and no assumption over its output form during training is made as well. Then, objective (61) can be equivalently rewritten as: F log(st 1 st) log PB(st 1|st) X st+1 exp F log(st st+1) This objective is similar to the FM objective, which can be written as: st 1 exp F log(st 1 st) st exp F log(st st+1) The reason is that the optimal solution of the objective (63) satisfies exp F log(st 1 st) = PB(st 1|st) P st+1 exp F log(st st+1). Taking summation over st 1 of this equation, we have P st 1 exp F log(st 1 st) = P st+1 exp F log(st st+1), the optimal solution of the objective (64). We can see that log Z serves as a baseline for modeling log F. Without log Z, we need to approximate log F by F log directly. This often leads to numerical issues. For example, let s suppose a small perturbation of log F, denoted as ϵ. Then the flow difference is exp(log F + ϵ) F = (exp ϵ 1)F. As values of F can be exponentially large, especially for nodes near the root, the flow difference can be large even if ϵ is small, making approximation to F by exp F log very difficult. By contrast, TB-based methods and our policy-based method allow the updating of log Z to dynamically scale down the value of e Q during training. This also complements the claims by Malkin et al. (2022a), who show that the TB-based methods is more efficient than flow-matching and DB-based methods. B.6. Model parameter updating rules In the following context, we will explain the updating rules for PF and µ within the vanilla policy-based method, also called the Actor-Critic method, and the TRPO method. The updating rules for PB follow those of PF analogously. Actor-Critic First of all, we split the parameter θ into θF and θZ corresponding to πF and µ. Since computing the exact VF is usually intractable, we use e VF parametrized by η as the functional approximation. Given a batch of trajectories samples, we compute the sampling averaging approximation of the following gradient estimators to update θF , θZ and η as proposed by Schulman et al. (2016) and Tsitsiklis & Van Roy (1996): t=0 b Aλ F (st, at) θF log πF (st, at; θF ) + Eµ(s0) h b V λ F (s0) θZ log µ(s0; θZ) i , t=0 η(b V λ F (st) e VF (st; η))2 # where λ [0, 1], b Aλ F (st, at) := l=t λl tˆδF (sl, al), b V λ F (st) := l=t λl tˆδF (sl, al) + e VF (st), ˆδF (st, at) := RF (st, at) + e VF (st+1) e VF (st), (66) GFlow Net Training by Policy Gradients b AF is called critic and πF is called actor. It can be verified that b A1 F (st, at) = PT 1 l=t RF (sl, al) e VF (st; η) renders an unbiased estimator of θF J(θ) as the first term is an unbiased estimation of QF and e VF ( ; η) does not introduce estimation bias (Remark B.7); b A0 F (st, at) = R(st, at)+ e VF (st+1) e VF (st) provided an direct functional approximation of AF (st, at), which usually render biased estimation with lower variance as e VF may not equal to VF exactly. Thus, λ enables the variancebias trade-off for robust gradient estimation. Likewise, b V 1 F (st) = PT 1 l=t R(sl, al) and b V 0 F (st) = R(st, at) + e VF (st+1) for each τ, corresponding to unbiased and biased estimation of VF (st). Denoting the estimated gradient w.r.t. (θF , θZ, η) as (ˆg F , ˆg Z, ˆg V ), these parameters are updated by (θ F , θ Z, η ) (θF αF ˆg F , θZ αZˆg Z, η αV ˆg V ). TRPO Parameters θZ and η are updated in the same way as the actor-critic method. Parameter θF is updated by the linear approximation of objective (6): min θ F T g F θ F θF s.t. 1 2(θ F θF ) HF (θ F θF ) ζF , (67) g F = θ F Ed F,µ(s;θ),πF (s,a;θ F ) h b Aλ F (s, a; θF ) i , HF = 2 θ F Dd F,µ( ;θ) KL (πF (s, a; θF ), πF (s, a; θ F )) . (68) Let s denote the Lagrangian formulation of the above problem as L(d F , κ) := Tg F d F κ(d F HF d F ζF ) with Lagrangian constant κ and d F := θ F θF . By the optimal conditions of L(d F , κ), κL(d F , κ) = 0 and d F L(d F , κ) = 0, we have d F = 1 κH 1 F g F and κ = g F H 1 F g F 2ζF 0.5 . Thus, the maximal updating step of model parameters is: θ F θF 2ζF ˆg F b H 1 F ˆg F 0.5 b H 1 F ˆg F . When the dimension of θ F is high, computing b H 1 F is time-demanding. Thus, we adopt the conjugate gradient method to estimate b H 1 F ˆg F based on b HF ˆg F (Hestenes et al., 1952). Besides, following Schulman et al. (2015), we perform a line search of updating step size to improve performance, instead of taking the maximal step. C. Performance analysis Lemma C.1. (Descent lemma (Beck, 2017)) Supposing f( ) is a β-smooth function, then for any θ and θ : f(θ ) f(θ) + θf(θ), θ θ + β 2 θ θ 2 2 . (69) Lemma C.2. Given two forward policies (π F , πF ) with D d F,µ KL (π F (s, ), πF (s, )) ζF or two backward (π B, πB) with D d B,ρ KL (π B(s, ), πB(s, )) ζB, we have 1 T (J F JF ) Ed F,µ(s),π F (s,a)[AF (s, a)] + ζF , 1 T 1(J B JB) Ed B,ρ(s),π B(s,a)[AB(s, a)] + ζB. (70) Proof. The proof procedure is analogous to that of Schulman et al. (2015) and Rengarajan et al. (2022). By the definition of AF , EP F (τ|s0) t=0 AF (st, at) = EP F (τ|s0) RF (st, at) + VF (st+1) VF (st) # = EP F (τ|s0) t=0 RF (st, at) + EP F (τ|s0)[VF (s T ) | {z } =0 = EP F (τ|s0) t=0 (R F (st, at) + RF (st, at) R F (st, at)) = V F (s0) VF (s0) + EP F (τ|s0) t=0 (RF (st, at) R F (st, at)) GFlow Net Training by Policy Gradients J F JF = EP F,µ(τ) t=0 AF (st, at) + EP F,µ(τ) t=0 DKL(π F (st, ), πF (st, )) = T Ed F,µ(s),π F (s,a)[AF (s, a)] + T D d F,µ KL (π F (s, ), πF (s, )) T Ed F,µ(s),π F (s,a)[AF (s, a)] + ζF . (72) Using the fact that V B(s0) = 0 and backward rewards are accumulated from T 1 back to 1, the results for the backward case can be derived by a similar procedure as in the forward case, so it is omitted here. C.1. Proof of Theorem 3.5 Proof. Firstly, JG F = JF + (JG F JF ) = JF + EPF,µ(τ) log PF (τ|s0)Z PG(τ|x)R(x) log PF (τ|s0)Z PB(τ|x)R(x) = JF + EPF,µ(τ) log PB(τ|x) log PB(τ|x) log PB(τ|x) = JF + JG B + X τ (PF,µ(τ) PB,ρ(τ)) RG B(τ), (73) where RG B(τ) := log PB(τ|x) PG(τ|x) = PT 1 t=1 RG B(st, at). Then, JG F = JF + JG B + PF,µ( ) PB,ρ( ), RG B( ) JF + JG B + PF,µ( ) PB,ρ( ) 1 RG B( ) JF + JG B + (T 1) PF,µ( ) PB,ρ( ) 1 RG,max B , (74) where the first inequality holds by H older s inequality, and the second inequality holds by RG,max B := maxs,a RG B(s, a) 1 T 1 maxτ RG B(τ) . By Pinsker s inequality: PF,µ( ) PB,ρ( ) 1 1 2DKL(PF,µ(τ), PB,ρ(τ)). (75) DKL(PF,µ(τ), PB,ρ(τ)) = EPF,µ(τ) log PF,µ(τ|x)P F,µ(x) PB(τ|x)P F,µ(x) log PF,µ(τ|x) + EP F,µ(x) log P F,µ(x) R(x)/Z | {z } 0 = DKL(PF,µ(τ), PB(τ)) = Dµ KL(PF (τ|s0), PB(τ|s0)) + DKL(µ(s0), PB(s0)) | {z } =0 = Dµ KL(PF (τ|s0), e PB(τ|s0)) log Z + log Z = JF + log Z log Z. (76) Then, we have: JG F JF + JG B + (T 1) RG,max B 1 2(JF + log Z log Z). (77) GFlow Net Training by Policy Gradients C.2. Proof of Theorem 3.6 Proof. By Lemma C.2 and the definition of ζF : 1 T (J F JF ) Ed F,µ(s),π F (s,a)[AF (s, a)] + ζF . (78) Let AF R|S| denote the vector components of Eπ F (s,a)[AF (s, a)]. Then, we have: Ed F,µ(s)π F (s,a)[AF (s, a)] = d F,µ, AF = d F,µ, AF + d F,µ d F,µ, AF Ed F,µ(s)π F (s,a)[AF (s, a)] + d F,µ d F,µ 1 AF , (79) where the last inequality holds by H older s inequality. By Lemma B.5 and the definition of ϵF : Ed F,µ(s)π F (s,a)[AF (s, a)] Ed F,µ(s)π F (s,a)[AF (s, a)] + 2Ed F,µ(s) DT V (π F (s, ), πF (s, )) ϵF . (80) By Pinsker s inequality, DT V (π F (s, ), πF (s, )) 1 2DKL(π F (s, ), πF (s, ) 0.5 . By Jensen s inequality and the definition of ζF , 2DKL(π F (s, ), πF (s, ) 0.5# 2Ed F,µ(s) [DKL(π F (s, ), πF (s, )] 0.5 ζF Thus, we have: Ed F,µ(s)π F (s,a)[AF (s, a)] Ed F,µ(s)π F (s,a)[AF (s, a)] + (2ζF )0.5ϵF . (81) Combing inequalities (81) and (78), we have: 1 T (J F JF ) Ed F,µ(s)π F (s,a)[AF (s, a)] + ζF + (2ζF )0.5ϵF . (82) C.3. Proof of Theorem 3.7 Proof. By Lemma C.1, JF (θn+1) JF (θn) + θn JF (θn), θn+1 θn + β 2 θn+1 θn 2 2 . (83) θn JF (θn), θn+1 θn JF (θn) JF (θn+1) + β 2 θn+1 θn 2 2 , α θn JF (θn), b θn JF (θn) JF (θn) JF (θn+1) + βα2 b θn JF (θn) 2 2. Conditioning on θn, taking expectations over both sides and noting that EP ( |θn) h θn JF (θn), b θn JF (θn) i = θn JF (θn), EP ( |θn)[b θn JF (θn)] = θn JF (θn) 2 2, we have: α θn JF (θn) 2 2 JF (θn) EP (θn+1|θn) [JF (θn+1)] + βα2 2 EP ( |θn) h b θn JF (θn) 2 2 By the assumption that EP ( |θ) h b θJF (θ) θJF (θ) 2 2 i = EP ( |θ) h b θJF (θ) 2 2 i θJF (θ) 2 2 σF , we have: α θn JF (θn) 2 2 JF (θn) EP (θn+1|θn) [JF (θn+1)] + βα2 2 θn JF (θn) 2 2 + βα2σF GFlow Net Training by Policy Gradients Consequently, we have: EP (θ0:N 1) n=0 θn JF (θn) 2 2 2 + EP (θ0:N) n=0 JF (θn) JF (θn+1) n=0 EP (θn) h θn JF (θn) 2 2 i Nβα2σF 2 + EP (θ0:N) [JF (θ0) JF (θN)] , N min n {0,...,N 1} EP (θn) h θn JF (θn) 2 2 i Nβα2σF 2 + EP (θ0) [JF (θ0)] EP (θN) [JF (θN)] . (86) Setting α = p 2/(βN), we have: p (2N)/β 1 min n {0,...,N 1} EP (θn) h θn JF (θn) 2 2 i σF + EP (θ0) [JF (θ0)] EP (θN) [JF (θN)] . (87) Since JF (θ)+log Z log Z(θ) = Dµ( ;θ) KL (PF (τ|s0; θ), PB(τ|s0)), and JF (θ ) = 0 with log Z = log Z(θ ) for optimal parameter θ , then JF (θN) + log Z log Z(θN) JF (θ ) and we have: min n {0,...,N 1} EP (θn) h θn JF (θn) 2 2 i σF + EP (θ0) [JF (θ0)] EP (θN) [JF (θN) + log Z log Z(θN)] p + EP (θN) [log Z log Z(θN)] p σF + EP (θ0) [JF (θ0)] + EP (θN) [|log Z log Z(θN)|] p (2N)/β 1 . (88) By the assumption that |log Z log Z | σZ, we have: min n {0,...,N 1} EP (θn) h θn JF (θn) 2 2 i σF + σZ + EP (θ0)[JF (θ0)] p (2N)/β 1 . (89) D. Additional discussion about policy-based and value-based methods The goal of traditional RL is to learn a policy π that achieves the optimality in the expected accumulated reward Jπ (for GFlow Net training, corresponding to the distance between PF (τ) and PB(τ), Dist(PF (τ), PB(τ))) addressing the challenge of the exploration-exploitation (Exp-Exp) dilemma. While valued-based methods are usually off-policy allowing to explicitly balance the Exp-Exp trade-off by designing PD, the objectives of the valued-based methods are optimized to encourage the improvement of Jπ but they do not directly solve the optimization formulation with Jπ. Policy-based methods directly optimize Jπ w.r.t. π, enabling optimization techniques that tackle the Exp-Exp trade-off implicitly but efficiently. Our joint framework manages to inherit both advantages of the value-based and the policy-based methods by keeping the optimization formulation of Jπ and allowing explicit design of PG as PD. We provide more detailed explanations of our arguments as follows: The Exp-Exp dilemma is the main challenge in decision-making including different reinforcement learning (RL) formulations. RL is guided by reward functions. To learn the desired policies, a reinforcement learning agent must prefer actions that it has tried in the past and found to be effective in producing rewards (exploitation). But to discover such actions, it has to try actions that it has not been selected before (exploration) at the expense of an exploitation opportunity (Sutton & Barto, 2018). Therefore, both policy-based and value-based methods face the fundamental challenge and try to overcome them in different ways. In RL, the goal is to learn a policy π that achieves the optimality in the expected accumulated reward Jπ. The value-based methods, represented by Q-learning and Soft-Q-learning, do not optimize Jπ w.r.t. π directly. They GFlow Net Training by Policy Gradients leverage the fact that the optimal policy should satisfy the Bellman equation. By minimizing the mismatch of the Bellman equation, which typically takes an off-policy form E(s,a) PD[(Qπ(s, a) b Qπ(s, a))2] (Haarnoja et al., 2018), the corresponding Jπ is encouraged to be improved. The improvement, however, is not guaranteed, since they do not directly solve the optimization formulation with Jπ. So the core of the value-based methods turns to explicitly design a sampler (agent), PD, that can effectively identify the state-action pair that gives rise to the mismatch the most (exploration) while allowing revisiting the state-action pair that has already been found to be effective (exploitation), and how to represent the target function b Qπ that approximates the optimal function Q while balancing the trade-off properly as well. To exemplify that both exploration and exploitation are important, let s take the hyper-grid experiment as an example, where modes are highly separated but do not exactly lie on the margin of the grids. In the extreme case, a purely explorative sampler will always favor taking actions that lead to visiting the marginal coordinates, which, however, yield low rewards. Likewise, original GFlow Net methods do not optimize the distance between PF (τ) and PB(τ) directly (so that P F (x) = PB(x)), but optimizes the flow mismatch associated with, PF , which implicitly encourages the minimization of the distance and the core of training efficiency is to design a sampler PD that effectively balance the Exp-Exp trade-off. This motivates back-and-forth local search (Kim et al., 2023b), Thompson sampling (Rector-Brooks et al., 2023) and temperature conditioning (Kim et al., 2023a). Besides, Detailed Balance (DB) objective can be understood as favoring exploitation as the target edge flow is PB(s|s )F(s ), where F(s ) is learned from the data collected so far and represents our partial knowledge about the environment. The Trajectory Balance (TB) objective can be understood as favoring exploration as the target trajectory flow is PB(τ|x)R(x) which can be fixed w.r.t. PF and regarded as pure environment feedback. So, finding better target flow representations that properly balance the Exp-Exp trade-off is one of the common motivations for sub-trajectory balance (Madan et al., 2023), forward-looking (Pan et al., 2023), and energy decomposition (Kim et al., 2023a) approaches. Policy-based methods reformulate the problem of balancing the Bellman equation into optimizing Jπ w.r.t. π directly. On one hand, the nature of the policy-based methods requires to be on-policy (i.e. PD = π), so we can not explicitly design PD to overcome the exploration-exploitation dilemma. On the other hand, it saves us from the difficulty in sampler design. The policy-based methods compute the gradients of Jπ w.r.t. π and learn π by gradient-based strategies. So the problem of designing a sampler is converted to gradient-based optimization with robust gradient estimation. Related techniques include variance reduction techniques, improvement of gradient descent directions like natural policy gradients and mirror policy descent, and conservative policy updates such as TRPO and PPO. These methods implicitly address Exp-Exp trade-off. Because a policy π that always favors either exploration or exploitation will not render the maximum or minimum Jπ unless it is equal to the optimal policy. Besides, the intuition of conservative policy updates like TRPO is that we keep the policy unchanged to prevent it from getting trapped into local optima (exploration) unless we find a better point in the trust region (exploitation). Our joint training framework manages to inherit the advantages of both value-based methods and policy-based methods. It keeps optimizing Jπ directly and makes the associated gradient-based optimization techniques applicable. In the meanwhile, we can explicitly design guided policy as the design of PD in the off-policy case, to integrate expert knowledge about the environment and balance the Exp-Exp trade-off explicitly. E. Additional experimental settings and results In all experiments, we follow a regular way of designing off-policy sampler,PD for value-based methods: PD is a mixture of the learned forward policy and a uniform policy where the mix-in factor of the uniform policy starts at 1.0 and decays exponentially at a rate of γ after each training iteration, where γ is set to 0.99 based on the results of the ablation study. In TB-Sub, the objective is a convex combination of the sub-trajectory balance losses following Madan et al. (2023), where the hyperparameter that controls the weights assigned to sub-trajectories of different lengths, is set to 0.9 selected from {0.80, 0.85, 0.90, 0.95, 0.99}. For our policy-based methods, we set the value of hyper-parameter λ to 0.99 for the forward policy gradients based on the results of the ablation study. The backward policy gradient is estimated unbiasedly, meaning the hyper-parameter value is 1. Trust region hyper-parameter ζF is set to 0.01 select from {0.01, 0.02, 0.03, 0.04, 0.05}. We use the Adam optimizer for model optimization. The learning rates of forward and backward policy are equal to 1 10 3, which is selected from {5 10 3, 1 10 3, 5 10 4, 1 10 4} by TB-U. The learning rates of value functions are set to 5 10 3, which is selected from {1 10 2, 5 10 3, 1 10 3} by RL-U. The learning rates of total flow estimator is GFlow Net Training by Policy Gradients 1 10 1, which is selected from {1 10 1, 5 10 2, 1 10 2, 5 10 3} by TB-U. The sample batch size is set to 128 for each optimization iteration. For all experiments, we report the performance with five different random seeds. E.1. Evaluation metrics The total variation DT V between P F (x) and P (x) is defined as: DT V (P F , P ) = 1 x X |P F (x) P (x)|. (90) The total variation is similar to the average l1-distance used in prior works, which can be computed by 1 |X| P x |P (x) P F (x)|. However, the average l1-distance may be inappropriate as |X| is usually large (> 104) and P x |P (x) P F (x)| 2, resulting in the average l1-distance being heavily scaled down by |X|. The Jensen Shannon divergence DJSD between P F (x) and P (x) is defined as: DJSD(P F , P ) = 1 2DKL(P F , P M) + 1 2DKL(P , P M), P M = P F + P . (91) Following Shen et al. (2023), the mode accuracy Acc of P F (x) w.r.t. P (x) is defined as: Acc(P F , P ) = min EP F (x)[R(x)] EP (x)[R(x)] , 1 For biological and molecular sequence experiments, we also count the number of modes, that is, the number of modes discovered during training. At every 10 training iterations, we sample |X mode| terminating states by the current learned PF ( | ) and store the states which are modes and have not been discovered before; then we count the total number of discovered unique modes for evaluation. (Shen et al., 2023; Kim et al., 2023b). The mode set X mode is defined as the set of terminating states whose rewards are in the top 0.5%, 0.5%, 0.5%, and 0.1% of all rewards for the SIX6, QM9, PHO4 and s EH datasets respectively. E.2. Hyper-grid modeling Environment In this environment, S \ {sf} is equal to {s = ([s]0, . . . , [s]d, . . . , [s]D)|[s]d {0, . . . , N 1}} (= {1, . . . , N 1}D), where the initial state s0 = (0, . . . , 0), and the final state sf can be represented by any invalid coordinate tuple of the hyper-grid, denoted as ( 1, . . . , 1) in our implementation. For state s S \ {sf}, we have D + 1 possible actions in A(s): (1) increment the coordinate [s]d by one, arriving at s = ([s]0, . . . , [s]d + 1, . . .); (2) choose stopping actions (s sf), terminating the process and returning s as the terminating coordinate tuple x. In this environment, G is not a graded DAG, and S \ {sf} = X as all coordinate tuples can be returned as the terminating states. The reward R(x) is defined as: R(x) = R0 + R1 d=1 I [s]d N 1 0.5 (0.25, 0.5] + R2 d=1 I [s]d N 1 0.5 (0.3, 0.4] , (93) where R0 = 10 2, R1 = 0.5 and R2 = 2 in our experiment. Conditioning on x, we use an unnormalized conditional guided trajectory distribution e PG(τ|x) for backward policy design, which is defined as: e PG(τ|x sf) := Pf(τ x) = t=1 Pf(st|st 1), st = sf : Pf(st|st 1) := ( PF (st|st 1) P s:s =sf PF (s|st 1)+ϵf if R(st 1) R0 PF (st|st 1) otherwise , Pf(sf|st 1) := s:s =sf PF (s|st 1)+ϵf if R(st 1) R0 PF (sf|st 1) otherwise , (94) GFlow Net Training by Policy Gradients 500 1000 1500 2000 Number of iterations Total Variation DB-U DB-B TB-U TB-B TB-Sub TB-TS RL-U RL-B RL-T RL-G 500 1000 1500 2000 Number of iterations Total Variation DB-U DB-B TB-U TB-B TB-Sub TS_TB RL-U RL-B RL-T RL-G Figure 4. Training curves by DT V between P F and P for 64 64 64 (left) and 32 32 32 32 hyper-grids (right). The curves are plotted based on means and standard deviations of metric values across five runs and smoothed by a sliding window of length 10. Metric values are computed every 10 iterations. 500 1000 1500 2000 Number of iterations Total Variation TB-0.95 TB-0.96 TB-0.97 TB-0.98 TB-0.99 RL-0.95 RL-0.96 RL-0.97 RL-0.98 RL-0.99 RL-1.0 500 1000 1500 2000 Number of iterations Total Variation RL-0.60 RL-0.65 RL-0.70 RL-0.75 RL-0.80 RL-0.85 RL-0.90 RL-0.95 RL-0.99 RL-1.0 Figure 5. Performance comparison between RL-U with different λ values and TB-U with different γ values. The curves are plotted based on their mean and standard deviation values across five runs and smoothed by a sliding window of length 10. Metric values are computed every 10 iterations. where ϵf = 10 5, the corresponding normalized distribution can be understood as PG(τ|x sf) = Pf(τ|x sf) Pf(τ x), and Pf(τ) = QT t=1 Pf(st|st 1). Similar to the proof of Proposition 3.1, it can be verified that ϕDρ KL(PB(τ|x sf; ϕ), PG(τ|x sf))= ϕDρ KL(PB(τ|x sf; ϕ), e PG(τ|x sf)). As all the coordinate tuples can be terminating states (i.e., sf is the child of all the other states) the expression above means that Pf assigns a low probability to the event of the terminating state being a state with a low reward. In this way, we discourage the generative process from stopping early at low reward coordinate tuples. Model Architecture Forward policy PF ( | ) is parametrized by a neural network with 4 hidden layers and the hidden dimension is 256. Backward policy PB( | ) is fixed to be uniform over valid actions or parameterized in the same way as PF . Coordinate tuples are transformed by K-hot encoding before being fed into neural networks. Additional Experiment Results The obtained results from five runs on 64 64 64 and 32 32 32 32 grids are summarized in Fig. 4 and Table 2. The graphical illustrations of P F (x) are shown in Figs. 14 and 15. We can observe similar performance trends to those in 256 256 and 128 128 grids: both TB-based methods and our policy-based method are better than DB-based method, and our policy-based methods achieve much faster convergence than TB-based methods. While these trends are less obvious than in 256 256 and 128 128 grids, this phenomenon can be ascribed to the fact that the environment height N has more influence on the modeling difficulty than the environment dimension D. The reason is that hyper-grids are homogeneous w.r.t. each dimension, and the minimum distance between modes only depends on N. GFlow Net Training by Policy Gradients 500 1000 1500 2000 Number of iterations To TSal Variation DB-U DB-B TB-U TB-B TB-Sub TB-TS RL-U RL-B RL-T RL-G 500 1000 1500 2000 Number of iterations Total Variation DB-U DB-B TB-U TB-B TB-Sub TB-TS RL-U RL-B RL-T RL-G Figure 6. Training curves by DT V between P F and P for SIX6 (left) and QM9 (right). The curves are plotted based on means and standard deviations of metric values across five runs and smoothed by a sliding window of length 10. Metric values are computed every 10 iterations. 500 1000 1500 2000 Number of iterations Number of modes DB-U DB-B TB-U TB-B TB-Sub TB-TS RL-U RL-B RL-T RL-G 500 1000 1500 2000 Number of iterations Number of modes DB-U DB-B TB-U TB-B TB-Sub TB-TS RL-U RL-B RL-T RL-G Figure 7. Training curves by the number of modes discovered during training for SIX6 (left) and QM9 (right). The curves are plotted based on means and standard deviations of metric values across five runs and smoothed by a sliding window of length 10. Metric values are computed every 10 iterations. E.3. Biological and Molecular Sequence design Environment In this environment, S \ sf = { 1, 0, . . . , N 1}D with element s corresponds to a sequence composed of integers ranging from 1 to N 1. The set {0, . . . , N 1} denotes the N nucleotide types or molecular building blocks, and the integer 1 represents that the corresponding position within s is unfilled. The initial state s0 = ( 1, . . . , 1) represents an empty sequence and the final state sf = (N, . . . , N). For st St, there are t elements in {0, . . . , N 1} and the rest equal to 1. There are N (D t) actions in A(s) that correspond to fill in one of the empty slots by one integer in {0, . . . , N 1}. The generative process will not stop until sequences are fulfilled. By definition, G is a graded DAG and SD = X = {0, . . . , N 1}D. We use the reward values provided in the dataset directly. Following Shen et al. (2023), we compute reward exponents Rβ(x) with hyper-parameter β set to 3, 5, 3, 6 and normalize the reward exponents to [1 10 3, 10], [1 10 3, 10], [0, 10] and [1 10 3, 10] for the SIX6, QM9, PHO4 and s EH datasets respectively. The guided distribution design also follows Shen et al. (2023). For content completeness, we provide the definitions as: t=1 PG(st|st 1, x), PG(st|st 1, x) = score(st|x) P s Ch(st 1) score(s |x), score(s|x) := mean({R(x )|s x , x X replay}) if s x 0 otherwise (95) where X replay corresponds to a replay buffer that stores the sampled terminating states during training. Model Architecture Policies are constructed in the same way as the hyper-grid modeling experiment. Integer sequences are transformed by K-hot encoding before being fed into neural networks. GFlow Net Training by Policy Gradients 500 1000 1500 2000 Number of iterations DB-U DB-B TB-U TB-B TB-Sub TB-TS RL-U RL-B RL-T RL-G 500 1000 1500 2000 Number of iterations DB-U DB-B TB-U TB-B TB-Sub TB-TS RL-U RL-B RL-T RL-G Figure 8. Training curves by Acc of P F w.r.t. P for PHO4 and s EH. The curves are plotted based on their mean and standard deviation values. The curves are plotted based on means and standard deviations of metric values across five runs and smoothed by a sliding window of length 10. Metric values are computed every 10 iterations. 500 1000 1500 2000 Number of iterations Number of modes DB-U DB-B TB-U TB-B TB-Sub TB-TS RL-U RL-B RL-T RL-G 500 1000 1500 2000 Number of iterations Number of modes DB-U DB-B TB-U TB-B TB-Sub TB-TS RL-U RL-B RL-T RL-G Figure 9. Training curves by the number of modes discovered over training for PHO4 and s EH. The curves are plotted based on means and standard deviations of metric values across five runs and smoothed by a sliding window of length 10. Metric values are computed every 10 iterations. Additional Experiment Results For PHO4 and s EH dataset, the exact computation of P F by dynamic programming is expensive. Thus, we only plot the training curves by Acc and the number of modes as shown in Figs. 8 and 9, and the means and standard deviations of metric values at the last iteration are provided in Table 5. We can observe similar performance trends as in the QM9 and SIX6 datasets. Our policy-based methods achieve faster convergence rates and better converged Acc values than the TB-based and DB-based methods. While the converged Acc of RL-T is slightly worse than the other policy-based methods, it achieves the fastest convergence rate. RL-G and RL-B, both employing a parametrized πB, demonstrate similar performance and converge faster than RL-U, which utilizes a uniform πB. E.4. Bayesian Network Structure Learning In this experiment, we investigate GFlow Nets for BN structure learning following the settings adopted in Malkin et al. (2022b). The set X corresponds to a set of BN structures, which are also DAGs. BN structure learning can be understood as approximating P(x|D) R(x) given a dataset D. Given a set of nodes, the state space for GFlow Nets is the set of all possible DAGs over the node set. The actions correspond to adding edges over a DAG without introducing a cycle. The generative process of a BN structure is interpreted as starting from an empty graph, an action is taken to decide to add an edge or terminate the generative process at the current graph structure. Environment A Bayesian Network is a probabilistic model that represents the joint distribution of N random variable and the joint distribution factorizes according to the network structure x: P(y1, . . . , y N) = n=1 P(yn|Pax(yn)) (96) GFlow Net Training by Policy Gradients 500 1000 1500 2000 Number of iterations Total Variation DB-U DB-B TB-U TB-B RL-U RL-B RL-T RL-G Figure 10. Training curves by DT V of P F w.r.t. P for the BN structure learning experiment. The curves are plotted based on means and standard deviations of metric values across five runs and smoothed by a sliding window of length 10. Metric values are computed every 10 iterations. where Pax(yn) denote the set of parent nodes of yn according to graph x. As the structure of any graph can be represented by its adjacency matrix, the state space can be defined as S := s|C(s) = 0, s {0, 1}N N where C corresponds to the acyclic graph constraint (Deleu et al., 2022), the initial state s0 = 0N N and specially sf := 1N N in our implementation. For each state s, a A(s) can be any action that turns one of 0 values of s to be 1 (i.e. adding an edge) while keeping C(s ) = 0 for the resulting graph s , or equal to (s sf) that stopping the generative process and return x = s as the terminating state. By definition, the corresponding G in this environment is not graded. Given observation dataset Dy of y1:N, the structure learning problem can be understood as approximating P(x|Dy) P(x, Dy) = P(Dy|x)P(x). Without additional information about graph structure x, P(x) is often assumed to be uniform. Thus, P(x|Dy) P(Dy|x) and the reward function is defined as R(x) P(Dy|x). Distribution P(Dy|x) is also called graph score and we use BGe score (Kuipers et al., 2014) in our experiment. Following Malkin et al. (2022b), the ground-truth graph structure and the corresponding observation dataset Dy are simulated from Erd os R enyi model. The guided distribution design follows the hyper-grid experiment. A low probability value, 10 5 is assigned to the transition probability of (s sf) if | log R(s) log Rmax| 5. Model Architecture Policies are constructed in the same way as the hyper-grid modeling experiments, but adjacency matrices are fed into neural networks directly without encoding. As reported in (Deleu et al., 2022), the distribution can be very peaky between adjacent graph structures. The reward R(x) (typically e80) in this experiment is much larger than those (typically 10) in the previous two sets of experiments. These facts give rise to numerical issues for reliable estimation of value function. Thus, we compute the gradients w.r.t. JB empirically as the gradients w.r.t. TB-based objective, that is, VB is not utilized during training. Besides, log Z is also very large, so we set the learning rate to 1, which is selected from {0.1, 0.3, 0.5, 0.8, 1.0} by TB-U. Experiment Results The number of possible DAGs grows exponentially with the number of nodes. Thus, we test the same benchmark in Malkin et al. (2022b) with the number of nodes set to 5 and the corresponding total numbers of DAGs is about 2.92 104. The number of edges in the ground-truth DAG is set to 5, and the size of the observation dataset is set to 100. The experimental results across five runs are shown in Fig. 10 and Table 6. The graphical illustrations of P F (x) are shown in Fig. 18. As expected, performance trends similar to those in the previous two sets of experiments are observed. The converged DT V values of all the policy-based methods are better than those of the value-based methods, with only TB-U achieving comparable performance. Besides, RL-T achieves the best converged DT V value and has the fastest convergence among all the methods. These results further demonstrate the effectiveness of our policy-based methods for GFlow Net training. GFlow Net Training by Policy Gradients E.5. Tables of converged metric values 256 256 128 128 Method DT V ( 10 1) DJSD ( 10 2) Time DT V ( 10 1) DJSD ( 10 2) Time DB-U 6.050 0.129 24.61 1.147 37.2 3.233 0.138 8.357 0.471 18.5 DB-B 4.459 0.135 14.42 0.656 46.8 2.621 0.122 5.621 0.502 21.3 TB-U 0.728 0.095 0.763 0.131 30.9 0.449 0.078 0.338 0.062 14.3 TB-B 2.101 1.052 6.612 5.093 32.5 0.441 0.080 0.355 0.105 16.4 TB-Sub 1.461 0.058 2.915 0.285 71.5 0.450 0.032 0.367 0.060 31.3 TB-TS 1.277 0.268 1.714 0.627 38.5 0.481 0.050 0.364 0.058 20.4 RL-U 0.621 0.057 0.770 0.096 44.4 0.440 0.077 0.390 0.053 18.3 RL-B 0.704 0.190 1.064 0.286 68.5 0.490 0.088 0.467 0.082 35.2 RL-T 0.708 0.058 0.774 0.054 90.2 0.503 0.043 0.426 0.053 58.3 RL-G 0.439 0.037 0.541 0.017 69.8 0.427 0.158 0.353 0.180 35.4 Table 1. Converged metric values of different methods for the modeling of 256 256 and 128 128 grids. Training time costs are provided in minutes. 64 64 64 32 32 32 32 Method DT V ( 10 1) DJSD ( 10 2) Time DT V ( 10 1) DJSD ( 10 2) Time DB-U 3.687 0.132 10.83 0.611 19.3 3.254 0.151 7.570 0.682 19.1 DB-B 2.606 0.083 6.559 0.486 21.5 1.248 0.041 1.684 0.119 20.4 TB-U 0.870 0.065 0.737 0.082 16.1 1.086 0.078 1.123 0.110 17.3 TB-B 0.909 0.069 0.753 0.101 17.4 1.191 0.112 1.258 0.184 18.6 TB-Sub 1.185 0.106 1.397 0.276 27.5 1.319 0.085 1.961 0.171 25.2 TB-TS 1.164 0.222 1.255 0.410 21.3 1.390 0.161 1.781 0.295 21.2 RL-U 1.082 0.060 1.287 0.105 17.4 1.180 0.049 1.455 0.111 18.5 RL-B 0.647 0.111 0.440 0.132 30.2 1.203 0.330 1.304 0.723 34.4 RL-T 0.654 0.069 0.456 0.084 49.0 0.838 0.105 0.591 0.140 59.3 RL-G 0.670 0.095 0.527 0.198 31.8 0.888 0.068 0.666 0.091 36.4 Table 2. Converged metric values of different methods for the modeling of 64 64 64 and 32 32 32 32 grids. Training time costs are provided in minutes. SIX6 Method Acc ( 102) DT V ( 10 1) DJSD ( 10 2) Number of modes DB-U 86.26 1.20 1.889 0.021 3.261 0.052 298.24 6.84 DB-B 86.26 1.51 1.886 0.020 3.263 0.064 302.00 4.97 TB-U 93.88 1.13 1.767 0.015 2.915 0.025 317.92 2.74 TB-B 90.94 1.46 1.998 0.055 3.542 0.146 295.36 4.69 TB-Sub 88.57 0.91 1.893 0.021 3.261 0.056 305.26 3.22 TB-TS 88.66 1.04 1.887 0.029 3.241 0.050 302.58 1.27 RL-U 94.30 1.22 1.779 0.018 2.956 0.025 318.84 3.08 RL-B 94.68 1.62 1.786 0.019 2.967 0.036 319.14 1.96 RL-T 93.40 1.01 1.832 0.022 3.174 0.049 321.74 1.72 RL-G 94.70 1.65 1.782 0.020 2.951 0.025 318.96 3.03 Table 3. Converged metric values of different methods for the SIX6 datasets. GFlow Net Training by Policy Gradients QM9 Method Acc ( 102) DT V ( 10 1) DJSD ( 10 2) Number of modes DB-U 96.30 1.07 1.889 0.021 3.261 0.052 780.2 4.32 DB-B 96.93 1.07 1.886 0.020 3.263 0.064 780.0 6.04 TB-U 98.65 1.10 1.767 0.015 2.915 0.025 788.8 2.78 TB-B 98.36 1.08 1.998 0.055 3.542 0.146 781.4 3.72 TB-Sub 97.00 1.17 1.893 0.021 3.261 0.056 781.8 4.87 TB-TS 96.83 0.76 1.887 0.029 3.241 0.050 780.6 4.16 RL-U 98.79 1.08 1.779 0.018 2.956 0.025 787.0 5.29 RL-B 99.10 0.90 1.786 0.019 2.967 0.036 792.8 2.95 RL-T 98.74 1.27 1.832 0.022 3.174 0.049 798.2 0.84 RL-G 90.04 1.20 1.782 0.020 2.951 0.025 794.0 2.16 Table 4. Converged metric values of different methods for the QM9 datasets. PHO4 s EH Method Acc ( 102) Number of modes Acc Number of modes DB-U 76.08 0.23 3957.33 18.61 88.45 0.61 23502.97 260.03 DB-B 76.61 0.32 3975.00 64.00 88.58 0.81 23454.68 422.14 TB-U 77.47 0.29 3993.33 9.87 90.49 0.48 24250.63 114.85 TB-B 77.00 0.14 3901.33 26.08 90.84 0.85 23139.13 634.65 TB-Sub 76.94 0.24 3939.33 18.61 89.81 0.61 24652.47 74.43 TB-TS 76.78 0.14 3973.33 38.68 89.57 0.24 24491.93 90.99 RL-U 77.31 0.30 3984.67 24.79 90.86 0.42 25200.67 24.61 RL-B 77.55 0.25 4004.67 15.01 91.36 0.85 25454.03 142.64 RL-T 76.81 0.48 4062.67 47.69 93.98 0.64 26530.30 142.27 RL-G 77.26 0.55 4005.67 12.50 91.28 0.79 25368.07 52.27 Table 5. Converged metric values of different methods for the PHO4 and s EH datasets Method DT V ( 10 1) DJSD ( 10 2) Method DT V ( 10 1) DJSD ( 10 2) DB-U 1.346 0.030 4.863 0.420 DB-B 1.284 0.027 7.533 1.873 TB-U 0.898 0.124 3.121 0.332 TB-B 1.521 0.281 13.301 2.107 RL-U 0.831 0.079 7.436 1.508 RL-B 0.929 0.078 6.114 1.179 RL-T 0.698 0.052 1.890 0.307 RL-G 0.964 0.174 5.712 0.845 Table 6. Converged metric values of different methods for the BN structure learning experiment, where the ground-truth BN has 5 nodes and 5 edges. GFlow Net Training by Policy Gradients E.6. Graphical representation of P F 0 50 100 150 200 250 256 256 0 20 40 60 80 100 120 128 128 0 10 20 30 40 50 60 64 64 64 0 5 10 15 20 25 30 32 32 32 32 Figure 11. Graphical representation of P for different hyper-grids. For visualization easiness, the ground-truth marginal distributions of two dimensions are plotted for 64 64 64 and 32 32 32 32 grids. 0 50 100 150 200 250 DB-U 0 50 100 150 200 250 TB-U 0 50 100 150 200 250 TB-TS 0 50 100 150 200 250 RL-U 0 50 100 150 200 250 RL-T 0 50 100 150 200 250 DB-B 0 50 100 150 200 250 TB-B 0 50 100 150 200 250 TB-Sub 0 50 100 150 200 250 RL-B 0 50 100 150 200 250 RL-G Figure 12. Graphical illustrations of P F (x) averaged across 5 runs of corresponding training strategies for a 256 256 hyper-grid. 0 20 40 60 80 100 120 DB-U 0 20 40 60 80 100 120 TB-U 0 20 40 60 80 100 120 TB-TS 0 20 40 60 80 100 120 RL-U 0 20 40 60 80 100 120 RL-T 0 20 40 60 80 100 120 DB-B 0 20 40 60 80 100 120 TB-B 0 20 40 60 80 100 120 TB-Sub 0 20 40 60 80 100 120 RL-B 0 20 40 60 80 100 120 RL-G Figure 13. Graphical illustrations of P F (x) averaged across 5 runs of corresponding training strategies for a 128 128 hyper-grid. GFlow Net Training by Policy Gradients 0 10 20 30 40 50 60 DB-U 0 10 20 30 40 50 60 TB-U 0 10 20 30 40 50 60 TB-TS 0 10 20 30 40 50 60 RL-U 0 10 20 30 40 50 60 RL-T 0 10 20 30 40 50 60 DB-B 0 10 20 30 40 50 60 TB-B 0 10 20 30 40 50 60 TB-Sub 0 10 20 30 40 50 60 RL-B 0 10 20 30 40 50 60 RL-G Figure 14. Graphical illustrations of P F (x) averaged across 5 runs of corresponding training strategies for a 64 64 64 hyper-grid. For visualization easiness, only the marginals of two dimensions are plotted. 0 5 10 15 20 25 30 DB-U 0 5 10 15 20 25 30 TB-U 0 5 10 15 20 25 30 TB-TS 0 5 10 15 20 25 30 RL-U 0 5 10 15 20 25 30 RL-T 0 5 10 15 20 25 30 DB-B 0 5 10 15 20 25 30 TB-B 0 5 10 15 20 25 30 TB-Sub 0 5 10 15 20 25 30 RL-B 0 5 10 15 20 25 30 RL-G Figure 15. Graphical illustrations of P F (x) averaged across 5 runs of corresponding training strategies for a 32 32 32 32 hyper-grid. For visualization easiness, only the marginals of 2 dimensions are plotted. GFlow Net Training by Policy Gradients Figure 16. In each plot, the orange line represents the P (x) of all sequences in the SIX6 dataset, with its values plotted in ascending order. The blue dots represent corresponding values of P F (x), averaged over five runs of the corresponding training strategy. Figure 17. In each plot, the orange line represents the P (x) of all sequences in the QM9 dataset, with its values plotted in ascending order. The blue dots represent corresponding values of P F (x), averaged over five runs of the corresponding training strategy. Figure 18. In each plot, the orange line represents the P (x) of all BN structures, with its values plotted in ascending order. The blue dots represent corresponding values of P F (x), averaged over five runs of the corresponding training strategy. Only the ground-truth and corresponding learned values for the top 3000 structures are plotted as the remaining values are nearly zero.