# revisiting_nonacyclic_gflownets_in_discrete_environments__4583ce75.pdf

Revisiting Non-Acyclic GFlow Nets in Discrete Environments

Nikita Morozov * 1 Ian Maksimov * 1 Daniil Tiapkin 2 3 Sergey Samsonov 1

Generative Flow Networks (GFlow Nets) are a family of generative models that learn to sample objects from a given probability distribution, potentially known up to a normalizing constant. Instead of working in the object space, GFlow Nets proceed by sampling trajectories in an appropriately constructed directed acyclic graph environment, greatly relying on the acyclicity of the graph. In our paper, we revisit the theory that relaxes the acyclicity assumption and present a simpler theoretical framework for non-acyclic GFlow Nets in discrete environments. Moreover, we provide various novel theoretical insights related to training with fixed backward policies, the nature of flow functions, and connections between entropy-regularized RL and non-acyclic GFlow Nets, which naturally generalize the respective concepts and theoretical results from the acyclic setting. In addition, we experimentally re-examine the concept of loss stability in nonacyclic GFlow Net training, as well as validate our own theoretical findings.

1. Introduction

Generative Flow Networks (GFlow Nets, Bengio et al., 2021) are models that aim to sample discrete objects from distributions known proportionally up to a constant. They operate by constructing an object through a sequence of stochastic transitions defined by a forward policy. GFlow Nets have been successfully applied in various areas, starting from molecule generation (Bengio et al., 2021; Shen et al., 2024; Koziarski et al., 2024; Cretu et al., 2025) and biological sequence design (Jain et al., 2022; Kim et al., 2024) to combinatorial optimization (Zhang et al., 2023a;b; Kim et al.,

*Equal contribution 1HSE University, Moscow, Russia 2CMAP CNRS Ecole polytechnique Institut Polytechnique de Paris, 91128, Palaiseau, France 3Universit e Paris-Saclay, CNRS, LMO, 91405, Orsay, France. Correspondence to: Nikita Morozov <nvmorozov@hse.ru>.

Proceedings of the 42 nd International Conference on Machine Learning, Vancouver, Canada. PMLR 267, 2025. Copyright 2025 by the author(s).

2025) and fine-tuning of large language models and diffusion models (Hu et al., 2023; Venkatraman et al., 2024; Uehara et al., 2024; Zhang et al., 2025; Lee et al., 2025). The detailed theoretical foundations of GFlow Nets in discrete environments were developed in (Bengio et al., 2023). While the majority of GFlow Net literature considers the discrete setting, it is possible to apply the methodology of continuous GFlow Nets (Lahlou et al., 2023) to sampling problems on more general spaces.

The main idea behind the generation process in GFlow Nets lies in sampling trajectories in the appropriately constructed directed acyclic graph environment instead of working directly in the object space. A standard intuition behind this process is a sequence of actions applied in order to construct a composite object from blocks . One of the limitations of this setting is that it requires acyclicity. While this limitation can be naturally interpreted in, e.g., molecule generation setting, it can confine the practical design of GFlow Net environments, as well as restrict the applicability of GFlow Nets in other scenarios. A motivational example for non-acyclic GFlow Nets presented by (Brunswic et al., 2024) is related to modeling distributions over objects with intrinsic symmetries. Consider a class of environments where states are elements of some group, e.g. symmetric group or Rubik s Cube group. The transitions are given via a generating set of this group, thus corresponding to applying the group operation on the current state and some element of the generating set, which leads to the existence of cycles. While in some cases an acyclic environment can be designed to generate group elements, such environments of algebraic origin naturally contain cycles, thus falling under the area of our study. In addition, there is a growing body of work that connects GFlow Nets and Reinforcement Learning (Tiapkin et al., 2024; Mohammadpour et al., 2024; Deleu et al., 2024; He et al., 2024). Most RL environments contain cycles, thus understanding how GFlow Nets can be applied in cyclic environments and connecting them to RL in such cases can be a crucial step towards further bridging two research fields.

To the best of our knowledge, methodological aspects of working with non-acyclic environments in GFlow Nets were previously considered only in the recent work of (Brunswic et al., 2024). The latter paper, similarly to (Lahlou et al., 2023), uses the machinery of measurable spaces and measure theory, which is harder to build new extensions and

Revisiting Non-Acyclic GFlow Nets in Discrete Environments

methodology upon. We believe that simplicity is a key merit of the theory behind discrete GFlow Nets (Bengio et al., 2023) when compared to their general state counterparts. Thus, the main aim of our paper is to revisit the theory of non-acyclic GFlow Nets within a discrete state space, simplifying the constructions of (Brunswic et al., 2024) and providing additional theoretical and methodological insights into training GFlow Nets in this setting. The main contributions of the paper can be summarized as follows: 1. We present a simple and intuitive way to build a theory of non-acyclic GFlow Nets in discrete environments from scratch. In addition to simplicity, our construction introduces and clarifies a number of key points regarding similarities and dissimilarities between acyclic and nonacyclic discrete GFlow Nets that were not explored in (Brunswic et al., 2024), in particular regarding the nature of flows and importance of backward policies. 2. We show that when the backward policy is fixed, the loss stability introduced by (Brunswic et al., 2024) does not impact the result of the optimization procedure. The latter becomes important only when the backward policy is also being trained. 3. When backward policy is trained, we show that learning a non-acyclic GFlow Net with the smallest expected trajectory length is equivalent to learning a non-acyclic GFlow Net with the smallest total flow. In addition, we propose state flow regularization as a way to approach the arising optimization problem. 4. We empirically show that the scale in which flow error is computed in the loss plays a crucial role in its stability. However, we also show that using an unstable loss with the proposed state flow regularization can lead to better sampling quality. 5. Finally, we generalize the key theoretical result of (Tiapkin et al., 2024) on the equivalence between GFlow Nets and entropy-regularized RL to the non-acyclic setting. Source code: github.com/Great Drake/non-acyclic-gfn.

2. Background

2.1. GFlow Nets

This section presents necessary notations and theoretical background on GFlow Nets. GFlow Nets treat the problem of sampling from a probability distribution over discrete space X as a sequential decision-making process in a directed acyclic graph (DAG) G = (S, E), where S is a finite state space and E S S is a finite set of edges (or transitions). There is a special initial state s0 with no incoming edges and a special sink state sf with no outgoing edges. The commonly used variant of notation does not include a sink state sf, yet we prefer to use a variant with sf, since it was also used in the previous work on nonacyclic GFlow Nets (Brunswic et al., 2024) and leads to a

more intuitive construction. Let T be a set of all trajectories τ = (s0 s1 . . . snτ sf) from s0 to sf, where we use nτ to denote the length of the trajectory τ. We use a convention snτ +1 = sf. We say that τ terminates in s if its last transition is s sf. Such transitions are called terminating, and the states that have an outgoing edge into sf are called terminal states. The set of terminal states coincides with X, and the probability distribution of interest R(x)/Z is defined on it, where R(x) > 0 is called GFlow Net reward and Z = P

x X R(x) is an unknown normalizing constant. In addition, for any state s, we denote in(s) to be the set of states s such that there is an edge s s E (parents), and out(s) to be the set of states s such that there is an edge s s E (children).

The main goal of GFlow Nets is to find a distribution P over T such that for any x X, probability that τ P terminates in x coincides with R(x)/Z. This property is called the reward matching condition. GFlow Nets operate with Markovian distributions over trajectories (see (Bengio et al., 2023) for a definition and discussion) using the following key components:

1. a forward policy PF(s | s), which is a distribution over children of each state;

2. a backward policy PB(s | s ), which is a distribution over parents of each state;

3. state/edge flows F(s), F(s s ), which coincide with an unnormalized probability that a trajectory τ passes through state/edge.

PF, PB, and F are connected through the trajectory balance conditions:

t=0 PF(st+1 | st) =

t=0 PB(st | st+1) , (1)

detailed balance conditions:

F(s s ) = F(s)PF(s | s) = F(s )PB(s | s ) , (2)

and flow matching conditions:

s out(s) F(s s ) = X

s in(s) F(s s) . (3)

All of these objects are completely and uniquely specified if one fixes either 1) edge flow F(s s ), 2) initial flow F(s0) and PF, 3) initial flow F(s0) and PB. If flows satisfy F(s sf) = R(s), trajectory distribution defined by the corresponding forward policy will satisfy the reward matching condition (Bengio et al., 2023). In practice, a neural network is used to parameterize the forward policy (and, optionally, the backward policy and the flows). Then, it is

Revisiting Non-Acyclic GFlow Nets in Discrete Environments

trained to minimize some loss function that would enforce the reward matching condition. For example, Detailed Balance loss (Bengio et al., 2023) is defined on all transitions s s E as:

LDB(s s ) log Fθ(s)PF(s | s, θ)

Fθ(s )PB(s | s , θ)

Reward matching is enforced by substituting F(x sf) = Fθ(sf)PB(x | sf, θ) = R(x) in the loss. Although the optimization task typically admits multiple solutions, fixing the backward policy results in a unique solution in terms of F and PF (Bengio et al., 2023).

2.2. GFlow Nets in Non-Acyclic Environments

(Brunswic et al., 2024) state that fundamental results of GFlow Net theory also apply in the case when the environment graph G may contain cycles, and all definitions from the acyclic case remain valid and extend to the non-acyclic case. However, we will further show that this is not exactly true, e.g., flows cannot be consistently defined as unnormalized visitation probabilities.

More specifically, (Brunswic et al., 2024) argue that if (3) holds for an edge flow, as well as F(s sf) = R(s) for terminating transitions, the forward policy induced by the flow PF(s | s) = F(s s )

F(s) satisfies the reward matching condition. Thus, standard GFlow Net losses, such as Flow Matching (FM, Bengio et al., 2021), Detailed Balance (DB, Bengio et al., 2023), and Trajectory Balance (TB, Malkin et al., 2022) can be applied in non-acyclic environments.

However, (Brunswic et al., 2024) point out that the main distinction between non-acyclic and acyclic GFlow Nets is that in the non-acyclic setting, expected trajectory length E[nτ] (denoted as a sampling time in (Brunswic et al., 2024)) can be arbitrarily large because of the cycles, while in the acyclic setting it is always bounded. To tackle this issue, a concept of loss stability is introduced. A loss is called stable if adding a constant to the flow along a cycle can never decrease the loss, and otherwise, it is called unstable (Definition 3). It is shown that FM, DB, and TB are unstable (Theorem 3), which can lead to arbitrarily large sampling time when utilized for training. In contrast, a family of losses that are provably stable is presented (Theorem 4). Moreover, the authors show that there are stable variants of FM and DB losses, such as stable DB loss, which we denote as SDB:

LSDB(s s ) log(1 + ε 2(s, s , θ))(1 + ηFθ(s)) ,

(s, s , θ) Fθ(s)PF(s |s, θ) Fθ(s )PB(s|s , θ) ,

where ε and η are hyperparameters. In addition, (Brunswic et al., 2024) show that the expected trajectory length is

bounded by the total normalized state flow

E[nτ] 1 F(s0)

s S\{s0,sf } F(s) , (6)

and using a stable loss with a regularizer that controls the total flow, e.g., the norm of the edge flow matrix, can be used to learn an acyclic flow (Theorem 1).

3. Theory of GFlow Nets in Discrete Non-Acyclic Environments

3.1. Environment

All definitions regarding the environment can be introduced similarly to the setting of acyclic GFlow Nets (see Section 2.1) with one main difference: graph G can now contain cycles. In addition to finiteness, we make two technical assumptions on the structure of G:

Assumption 3.1. 0) graph G is finite; 1) There is a special initial state s0 with no incoming edges and a special sink state sf with no outgoing edges; 2) For any state s S there exists a path from s0 to s and a path from s to sf.

Next, we formally define trajectories:

Definition 3.2. A sequence τ = (s0 s1 . . . snτ snτ +1 = sf) is called a trajectory of length nτ N if all transitions st st+1 E, t {0, . . . , nτ}. Then T is a set of all finite-length trajectories that start in s0 and finish in sf.

In the above definition and further, we use a convention snτ +1 = sf. While G itself is always finite, the main difference with acyclic GFlow Nets is that T can be infinite, and T can contain trajectories of arbitrary length.

3.2. Backward Policy and Trajectory Distribution

There are several equivalent ways to introduce probability distributions over trajectories in GFlow Nets. One of the common approaches is to start by introducing trajectory flows (Bengio et al., 2023). The main theoretical advantage of the approach based on trajectory flows is that it allows for non-Markovian flows, see (Bengio et al., 2023). At the same time, Markovian flows are primarily considered in GFlow Nets literature, and in our paper, we only consider this setting. Instead of starting from the definition of the flow, a more intuitive approach is to begin with the definitions of the forward and backward policies.

Definition 3.3. A forward policy PF(s | s) consistent with G is a family of conditional probability distributions over s out(s) defined for each s S \ {sf}, Similarly, a backward policy PB(s | s ) consistent with G is a family of conditional probability distribution over s in(s ), defined for each s S \ {s0}.

Revisiting Non-Acyclic GFlow Nets in Discrete Environments

In the subsequent parts of the paper, we always assume that the considered PF or PB are consistent with G and do not specify this fact explicitly. Definition 3.3 is consistent with the definitions of forward and backward policies in acyclic GFlow Nets (Bengio et al., 2023, Definition 4). Note that the structure of G is symmetric with respect to an interchange between initial state s0 and sink state sf and reversion of all edges in G. Thus, starting with either PF or PB is equivalent. We prefer to start from a backward policy PB in our subsequent derivations. Using PB, we define a probability distribution P over τ = (s0 s1 . . . snτ sf) T according to

t=0 PB(st | st+1) . (7)

In such a case, we say that the trajectory distribution P(τ) is induced by PB. In the following lemma, we show that P(τ) is a correctly defined probability distribution over T .

Lemma 3.4. Let PB(s | s ) be a backward policy, such that PB(s | s ) > 0 for any edge s s E. Then

P(τ) defined in (7) is a probability distribution over T , that is, P

τ T P(τ) = 1. Moreover, its expected trajectory length is always finite Eτ P[nτ] = P

τ T nτP(τ) < + .

In fact, the condition PB(s | s ) > 0 together with Assumption 3.1 allows us to ensure that the sequence si is a finite state-space absorbing Markov chain. Given this assumption, the proof of Lemma 3.4 almost coincides with the proof of the fact that the states of such a Markov chain are transient, see, e.g., (Douc et al., 2018). For completeness, we provide the proof in Appendix A.1.

3.3. State and Edge Flows

Given a probability distribution P(τ) induced by PB, our next aim is to define state and edge flows. Before proceeding with a valid construction, we provide some intuition about our definitions. Let us first show that, contrary to the acyclic GFlow Nets, we cannot define edge flows as visitation probabilities P({τ T | s s τ}). In particular, we show that such a definition does not satisfy the flow matching conditions (3). Indeed, consider an example from (Brunswic et al., 2024):

s0 1 / a 0.5 / b 1 ? c 0.5 1 / sf

The number above each edge s s is PB(s | s ). Consider the distribution P(τ) defined by (7). Let us plot the visitation probability for each edge:

s0 1 / a 1 / b

1 ? c 0.5 1 / sf

One can see that the flow matching condition (3) does not hold for states b and c since 1 = 1 + 0.5. Instead, let us calculate the expected number of visits for each edge s s

t=0 I{st = s, st+1 = s }

We visualize the corresponding numbers on the plot below:

s0 1 / a 1 / b

2 ? c 1 1 / sf

It is easy to check that the flow matching conditions (3) are now satisfied. Next, we formally define: Definition 3.5. Let PB(s | s ) be a backward policy, such that PB(s | s ) > 0 for any edge s s E. Then, given a final flow F(sf) > 0, we define state and edge flows as

F(s s ) F(sf) Eτ P

t=0 I{st = s, st+1 = s }

F(s) F(sf) Eτ P(τ)

t=0 I{st = s}

We say that the flows defined above are induced by the backward policy PB and final flow F(sf).

It is important to note that if G does not contain cycles, the expected number of visits in (8) coincides with visitation probability, thus Definition 3.5 agrees with the usual understanding of flows in the acyclic GFlow Net literature. Next, we show that state and edge flows defined in (3.5) satisfy the detailed balance and flow matching conditions (2) (3). Proposition 3.6. Flows F defined in Definition 3.5 satisfy:

1. F(s) (a) = P

s out(s) F(s s ) (b) = P

s in(s) F(s s),

for each s S \ {s0, sf}. Moreover, identity (a) holds for s0, and (b) holds for sf.

2. F(s s ) = F(s )PB(s | s ) for any s s E.

3. F(s0) = F(sf).

We provide the proof in Appendix A.2. In the next proposition, we show that there is a one-to-one correspondence between a pair (PB, F(sf)) and edge flows F. Its proof is provided in Appendix A.3. Proposition 3.7. Let F : E R>0 be a function that satisfies the flow matching conditions (3). Define the corresponding backward policy by the relation

PB(s | s ) = F(s s ) / X

s in(s ) F(s s ) .

Then F are edge flows induced by PB and F(sf) = P

s in(sf ) F(s sf).

Revisiting Non-Acyclic GFlow Nets in Discrete Environments

3.4. Forward Policy and Detailed Balance

It is well-known in acyclic GFlow Nets theory (Bengio et al., 2023) that there exists a unique forward policy PF for any backward policy PB that induces the same probability distribution over T . The main implication of this fact is that by fixing rewards R(x), x X and a backward policy PB(s | s ) for each state s S \ {s0, sf}, one automatically fixes a trajectory distribution P(τ) that satisfies the reward matching condition (Malkin et al., 2022). However, sampling from a such distribution is intractable since it requires starting from a terminal state sampled from the reward distribution. Thus, during GFlow Net training, one tries to find a forward policy, which always allows tractable sampling of trajectories by construction, that will match this trajectory distribution P(τ). One can note that this bears similarities with hierarchical variational inference (Malkin et al., 2023). In the following proposition, we show that this result also holds for non-acyclic GFlow Nets. Proposition 3.8. Given any backward policy PB(s | s ) > 0, there exists a unique forward policy PF(s | s) such that

t=0 PB(st | st+1) =

t=0 PF(st+1 | st) , τ T .

Moreover, it satisfies the detailed balance conditions

F(s)PF(s | s) = F(s )PB(s | s ), s s E

with the state flow F defined in (8).

The proof is provided in Appendix A.4. Conversely, the following proposition shows that if a triplet F, PF, PB satisfies the detailed balance conditions (2), it will be consistent with all previous definitions and propositions. Proposition 3.9. Let F : S R>0, and let PF(s |s) > 0, PB(s|s ) > 0 be forward and backward policies, such that the detailed balance conditions (2) are satisfied. Then PF and PB induce the same trajectory distribution:

t=0 PB(st | st+1) =

t=0 PF(st+1 | st) , τ T .

Moreover, then F are state flows induced by PB and F(sf).

For proof, we refer to Appendix A.5. The above propositions directly generalize their counterparts from the nonacyclic setting (Bengio et al., 2023). Note that, due to the symmetries between s0 and sf in G up to changing direction of edges, we could start from the forward policy and trajectory distribution induced by it in (7), and then prove the uniqueness of the corresponding backward policy PB.

3.5. Training Non-Acyclic GFlow Nets

Now, we proceed with the main learning problem in GFlow Nets: finding a forward policy that matches the reward distribution over terminal states R(x)/Z, x X. The

following proposition shows how the reward matching condition can be formulated in terms of flows.

Proposition 3.10. Let PB(s | s ) > 0 be a backward policy, F(sf) > 0 a final flow, and R(x) > 0 GFlow Net rewards. If edge flows F(s s ) induced by PB and F(sf) satisfy:

F(x sf) = R(x) x sf E , (9)

the trajectory distribution P induced by PB (7) satisfies the reward matching condition, i.e. Pτ P[snτ = x] = R(x)/Z. Then, the same trajectory distribution is induced by the unique corresponding forward policy PF, thus also satisfying the reward matching condition.

Proof. Notice that an edge leading into sf can be visited only once; thus, F(x sf) coincides with a probability Pτ P[snτ = x] that a trajectory terminates in x times the final flow F(sf). In addition, we have F(sf) = P

x in(sf ) F(x sf) = P

x X R(x) = Z, thus Pτ P[snτ = x] = F(x sf)/F(sf) = R(x)/Z.

Proposition 3.10 also implies F(s0) = F(sf) = Z and PB(x | sf) = R(x)/Z by Proposition 3.6.

An important fact from the literature on acyclic GFlow Nets (Malkin et al., 2022; Bengio et al., 2023) that was overlooked in the work of (Brunswic et al., 2024), but holds in the non-acyclic case as well, is that it is generally easy to manually pick a backward policy such that the induced trajectory distribution will satisfy the reward matching condition. A simple and natural choice is to take PB(x | sf) = R(x)/Z for sf and fix PB(s | s ) = 1/| in(s )| for all other states. It is worth mentioning that Z is generally unknown, but this issue is circumvented in GFlow Nets by learning unnormalized flows or making Z itself a learnable parameter depending on the chosen loss function (Malkin et al., 2022; Bengio et al., 2023; Madan et al., 2023). Moreover, Proposition 3.8 shows the uniqueness of the corresponding PF. Thus, we state the main practical corollary of this result:

Corollary 3.11. When a backward policy PB > 0 is fixed, any loss from the acyclic GFlow Net literature (Bengio et al., 2021; Malkin et al., 2022; Bengio et al., 2023; Madan et al., 2023) can be used to learn the corresponding forward policy PF in the non-acyclic case as well. Lemma 3.4 and Proposition 3.8 imply that there is always a unique solution with a finite expected trajectory length, thus the stability of the loss (Brunswic et al., 2024) does not play a factor.

The main disadvantage of learning with a fixed backward policy in non-acyclic GFlow Nets that does not arise in acyclic GFlow Nets is the fact that the expected trajectory length Eτ P[nτ] of a manually chosen PB can be large. A natural way to circumvent this issue is to consider a

Revisiting Non-Acyclic GFlow Nets in Discrete Environments

learnable backward policy, which is also a widely employed choice in acyclic GFlow Net literature (Malkin et al., 2022; Jang et al., 2024a; Gritsaev et al., 2025). However, (Brunswic et al., 2024) made an important discovery by pointing out that standard losses from acyclic GFlow Net literature are not stable (Theorem 3), meaning that the expected trajectory length can grow uncontrollably during training. The concept of stability was introduced with respect to learnable edge flows (Definition 3), which implies that the corresponding backward policy also changes during training. Using a stable loss, e.g. (5), was proposed as a way to approach this issue. At the same time, we argue that efficient training of a non-acyclic GFlow Net with controlled expected trajectory length in case of a learnable PB is possible without utilizing stable losses. The next proposition is a simple corollary of Definition 3.5:

Proposition 3.12. Given a trajectory distribution P, its expected trajectory length is equal to the normalized total flow:

Eτ P[nτ] = 1 F(sf)

s S\{s0,sf } F(s). (10)

The proof is presented in Appendix A.6. This result is a refinement of Theorem 2 of (Brunswic et al., 2024), which states only inequality in (10). Thus, we believe one of our key contributions to be pointing out the following fact:

Learning a non-acyclic GFlow Net with the smallest expected trajectory length is equivalent to learning a non-acyclic GFlow Net with the smallest total flow.

We also believe that exploiting this equivalence is a crucial direction for future research on non-acyclic GFlow Nets. We further explore a particular solution, which suggests the use of a state flow as a regularizer in the existing GFlow Net losses. We consider an example of the detailed balance loss DB (4). In this case, Proposition 3.9 implies that learning a non-acyclic GFlow Net with the smallest expected trajectory length can be formulated as the following constrained optimization problem:

min F,PF,PB

s S\{s0,sf } F(s) (11)

subject to log F(s)PF(s | s)

F(s )PB(s | s )

2 = 0 , s s E ,

F(sf)PB(x | sf) = R(x) , x sf E .

As an approximate way to solve (11), one can use DB (4) with state flow regularization:

log Fθ(s)PF(s | s, θ)

Fθ(s )PB(s | s , θ)

2 + λFθ(s) , (12)

where λ > 0 is a hyperparameter that controls a trade-off between an expected trajectory length and an accuracy of matching the reward distribution. As in (4), reward matching is enforced by substituting Fθ(sf)PB(x | sf, θ) = R(x).

Note that the DB loss is defined on individual transitions, and during training, it is optimized across transitions collected by a training policy. A standard choice is to optimize it over transitions from trajectories sampled using PF, yet the training policy can be chosen differently, or, in RL terms, training can be done in an on-policy or off-policy fashion, see (Tiapkin et al., 2024). Note that different states s might appear with different frequencies in the loss depending on a training policy, which can lead to a bias in the optimization problem (11). We discuss this phenomenon in more detail, as well as potential ways to mitigate it, in Appendix B.1.

Finally, it is important to mention that flow-based regularizers in the non-acyclic case were already proposed in Theorem 1 of (Brunswic et al., 2024), but only for the stable loss setup. Moreover, they were introduced in order to find an acyclic flow. Our paper further explores and sheds new light on this phenomenon, showing that training can be carried out even when an unstable loss is utilized with regularization. Moreover, when the total flow is minimized, one can ensure the smallest possible expected trajectory length. It is also worth pointing out that the idea of introducing a constrained optimization problem to accommodate for cycles in GFlow Nets was mentioned in (Deleu, 2025).

3.6. Connections with Entropy-Regularized RL

A recent line of works (Tiapkin et al., 2024; Deleu et al., 2024) studied connections between GFlow Nets and RL, showing that the GFlow Net learning problem is equivalent to an entropy-regularized RL (Neu et al., 2017; Geist et al., 2019) problem in an appropriately constructed deterministic MDP, given that the backward policy is fixed. We show that the same result holds for non-acyclic GFlow Nets as well.

Let G be a graph of a non-acyclic GFlow Net, R a GFlow Net reward, and PB > 0 a fixed backward policy that satisfies the reward matching condition. Let F be the flow induced by PB with F(sf) = Z, and PF be a unique forward policy corresponding to PB (see Proposition 3.8). Define a deterministic MDP MG induced by a graph G, where the state space S coincides with vertices of G, the action space As for each state s corresponds to out(s), and the transition kernel is defined as transition in the graph P(s | s, a) = I{a = s }, a out(s). We use no discounting (γ = 1.0) and set RL rewards for terminating transitions r(x, sf) = log R(x), and for all other transitions r(s, s ) = log PB(s | s ). Then, the following statement holds. Theorem 3.13 (Generalization of Theorem 1 (Tiapkin et al., 2024)). The optimal policy π λ=1(s | s) for the entropy-

Revisiting Non-Acyclic GFlow Nets in Discrete Environments

regularized MDP MG with coefficient λ = 1 is equal to PF(s | s) for all s S \ {sf}, s As. Moreover, regularized optimal value V λ=1(s) and Q-value Q λ=1(s, s ) coincide with log F(s) and log F(s s ) respectively for all s s E.

The proof and all missing definitions are provided in Appendix A.7. Note that the proof of (Tiapkin et al., 2024) cannot be directly transferred to the non-acyclic setting since it is based on induction over the topological ordering of vertices of G, which exists only for acyclic graphs.

4. Experiments

In addition to verifying our theoretical findings, one of the goals of our experimental evaluation is to examine the scaling hypothesis that we put out:

Scaling hypothesis. When PB is trainable, the main factor that plays a crucial role in loss stability in practice, i.e., the controlled mean trajectory length of the trained non-acyclic GFlow Net, is the scale in which the error between flows is computed. Indeed, the standard DB loss (4) operates in log-flow scale log F, while standard SDB (5) operates in flow scale F. We hypothesize that using log-flow scale losses without regularization can lead to arbitrarily large trajectory length, while flow scale losses are biased towards solutions with smaller flows and thus do not suffer from this issue.

In this section, we use DB or SDB to specify the utilized loss, log F or F to specify the flow scale used to compute the error, and use λ = C to specify the strength of the proposed state flow regularization (see Section 3.5). For example, (DB, log F) in the legend corresponds to (4), (SDB, F) corresponds to (5) and (DB, log F, λ = C) corresponds to (12). Detailed discussion on loss scaling and stability is provided in Appendix B.2.

4.1. Experimental Setting

We consider two discrete environments for experimental evaluation: 1) a non-acyclic version of the hypergrid environment (Bengio et al., 2021) that was introduced in (Brunswic et al., 2024); 2) non-acyclic permutation generation environment from (Brunswic et al., 2024) with a harder variant of the reward function. Experimental details are presented in Appendix C.

Mean sample reward was used as a metric in (Brunswic et al., 2024), with higher values considered better. However, we point out that this does not always represent sampling accuracy from the reward distribution R/Z. Indeed, the model that learned to sample from the highest-reward mode

104 105 106

training trajectories

full empirical L1 error

hypergrid 7x7

104 105 106

training trajectories

mean trajectory length

SDB, ΔF, fix PB DB, Δlog F, fix PB SDB, ΔF, train PB DB, Δlog F, train PB DB, Δlog F, λ = 0.001, train PB true expected error / nτ of fixed PB

Figure 1. Comparison of non-acyclic GFlow Net training losses on a small hypergrid environment. We use DB or SDB to specify the utilized loss, log F or F to specify the flow scale used to compute the error in the loss, and use λ = C to specify the usage of the proposed state flow regularization. Top: evolution of L1 distance between an empirical distribution of samples and target distribution. Bottom: evolution of mean length of sampled trajectories.

still achieves a high average reward despite resulting in mode collapse. For instance, recent works argue that measuring the deviation of mean sample reward from the true expected reward P

x X R(x) R(x)

Z results in a better metric, see, e.g., (Shen et al., 2023) for detailed motivation. In addition, we employ other metrics to track sampling accuracy depending on the environment, which we discuss in detail below.

In both environments, we consider two settings: training with a fixed backward policy PB that is almost uniform and using a trainable PB. In the second case, the initial log flow log Fθ(s0) = log Zθ is also being learned. Thus, we can examine its convergence to the logarithm of the true normalizing constant log Z. See Appendix B.3 for details.

4.2. Hypergrids

We start with non-acyclic hypergrid environments (Brunswic et al., 2024). These environments are small enough that the normalizing constant Z can be computed exactly, and the trained sampler can be efficiently evaluated against the exact reward distribution. States are points with integer coordinates s {0, . . . , H 1}D inside a D-dimensional hypercube with side length H, plus two auxiliary states s0 and sf. Possible transitions correspond to increasing or decreasing any coordinate by 1 without exiting the grid. Moreover, each state has a terminating

Revisiting Non-Acyclic GFlow Nets in Discrete Environments

104 105 106

training trajectories

full empirical L1 error

hypergrid 20x20x20x20

DB, Δlog F, λ = 0.001

SDB, Δlog F, λ = 0.001

true expected error

104 105 106

training trajectories

mean trajectory length

hypergrid 20x20x20x20

DB, Δlog F, λ = 0.001

SDB, Δlog F, λ = 0.001

104 105 106

training trajectories

learned log Z

hypergrid 20x20x20x20

DB, Δlog F, λ = 0.001

SDB, Δlog F, λ = 0.001

Figure 2. Comparison of non-acyclic GFlow Net training losses on a larger hypergrid environment. We use DB or SDB to specify the utilized loss, log F or F to specify the flow scale used to compute the error in the loss, and use λ = C to specify the usage of the proposed state flow regularization. Left: evolution of L1 distance between an empirical distribution of samples and target distribution. Middle: evolution of mean length of sampled trajectories. Right: evolution of the trained initial log flow log Zθ.

Table 1. Comparison on the permutation environment. C(k) L1 is the L1 distance between the true and empirical distribution of fixed point probabilities C(k), R is the relative error of mean reward proposed in (Shen et al., 2023), log Z is | log Zθ log Z|. Mean and std values are computed over 3 random seeds. Blue indicates the best metric, red indicates the smallest expected trajectory length.

n = 8 n = 20

Loss C(k) L1 R log Z E[nτ] C(k) L1 R log Z E[nτ]

DB, F 0.215 0.198 0.214 0.086 0.814 0.826 2.43 0.28 0.453 0.002 0.343 0.000 42.98 0.000 2.00 0.00 SDB, F 0.031 0.012 0.046 0.023 0.074 0.025 3.32 0.15 0.452 0.001 0.343 0.000 42.98 0.000 2.01 0.00 DB, log F, λ = 10 3 0.036 0.015 0.056 0.024 0.018 0.010 2.80 0.04 0.041 0.002 0.064 0.000 0.023 0.005 3.23 0.00 SDB, log F, λ = 10 3 0.037 0.013 0.056 0.019 0.020 0.015 2.79 0.04 0.041 0.002 0.064 0.000 0.026 0.003 3.22 0.00 DB, log F, λ = 10 5 0.005 0.001 0.001 0.000 0.005 0.004 4.31 0.05 0.017 0.002 0.035 0.002 0.003 0.003 7.55 0.50 SDB, log F, λ = 10 5 0.005 0.001 0.002 0.000 0.006 0.006 4.36 0.09 0.014 0.001 0.025 0.001 0.005 0.005 7.31 0.07

transition s sf, thus X = S \ {s0, sf}. The reward function has modes near the grid corners, separated by wide troughs with very small rewards. To measure sampling accuracy, total variation distance is computed between the true reward distribution R(x)/Z and an empirical distribution of the last 2 105 samples seen in training, which coincides with 1 2 of the L1 distance on discrete domains.

We begin our analysis with a 7 7 grid to study the effects of learning under a fixed backward policy compared to a trainable backward policy. Since the environment is small, it is possible to find the flows induced by the fixed PB exactly, thus also its expected trajectory length, see Appendix B.4. Figure 1 presents the results. First, we note that both (DB, log F) and (SDB, F) with fixed PB converge to the true expected trajectory length induced by the fixed backward policy, which is in line with Corollary 3.11. However, for all losses, using trainable PB allows us to find a solution with a smaller trajectory length. In addition, we observe that using a loss in F scale results in slower convergence and a slight bias in the learned forward policy than in the case of log F scale, for both fixed and learned

PB. Finally, an interesting note is that using an unstable DB loss in log F scale without state flow regularization can still result in a small expected trajectory length, as we see in this experiment. However, we further show that this is not the case for a larger environment.

Next, we consider a larger 20 20 20 20 hypergrid. An expected trajectory length induced by the chosen fixed backward policy is several orders of magnitude larger than for a smaller grid, making this approach impractical. While one can try to manually find a fixed PB with a smaller expected trajectory length, this is generally a challenging problem, thus, we consider only the setting of trainable PB here. Our findings are presented in Figure 2. Similarly to 7 7 grid, we find that learning in F scale results in a biased policy both for DB and SDB, and this bias is noticeably larger than in the smaller grid. In log F scale, both DB and SDB employed with state flow regularization learn to correctly sample from the reward distribution. While all methods converge to similar expected trajectory length, F scale losses have smaller nτ in the middle of the training even when employed without regularization, which supports our scaling hypothesis. In addition, Figure 4 in Appendix D

Revisiting Non-Acyclic GFlow Nets in Discrete Environments

shows that for both losses in log F scale, a mean length of sampled trajectories tends to infinity when the training is done without state flow regularization. Finally, we note that log F losses correctly learn the true normalizing constant Z, while F losses perform worse.

4.3. Permutations

Next, we consider the environment corresponding to the Cayley Graph of the symmetric group Sn (group of permutations on n elements {1, 2, . . . , n}) from (Brunswic et al., 2024). Each state s S \ {s0, sf} is a permutation of fixed length (s(1), . . . , s(n)), and there are n 1 possible transitions that correspond to swapping a pair of adjacent elements s(k) and s(k + 1), plus a transition that corresponds to a circular shift of the permutation to the right (s(n), s(1), . . . , s(n 1)). In addition, each state has a terminating transition s sf. GFlow Net reward utilized in the experiments of (Brunswic et al., 2024) is I[s(1) = 1]. We argue that this results in a fairly simple task, and a trivial forward policy exists that just applies circular shift until 1 is in the first position. We opt for using a more complex reward distribution in our experiments and define GFlow Net reward in terms of the number of fixed points in a permutation R(s) = exp 1

2 Pn k=1 I{s(k) = k} .

Since with the growth of n, the number of states n! quickly becomes too large to compute total variation distance as it was done for hypergrid, we track convergence of a number of statistics to their true respective values. Firstly, we compute the relative error between the mean reward of GFlow Net samples and the true expected reward as it was proposed in (Shen et al., 2023). Secondly, denote C(k) as the probability that a permutation sampled from the reward distribution has k fixed points. We compute the L1

error between the vector (C(0), C(1), . . . , C(n)) and its empirical estimate over the last 105 samples seen in training. Finally, we track the convergence of the trained log Zθ to the true value of log Z. In Appendix C.3.1, we show how true reference values of these quantities can be efficiently computed.

Table 1 presents the results for n = 8 and n = 20. Here, PB is trained in all cases. While for n = 8, the environment is still relatively small, n = 20 results in a more challenging environment with 2.4 1018 states, thus the trained neural network needs to generalize to states unseen during training in order to match the reward distribution. Firstly, we note that while F scale losses can learn the reward distribution to some capacity for n = 8, they fail for n = 20. However, in all cases, they converge to small E[nτ], supporting our scaling hypothesis. On the other hand, we find that GFlow Nets training with log F losses and state flow regularization converges to low values of reward distribution approximation errors for both n = 8 and n = 20. In addi-

tion, we see that using a smaller regularization coefficient λ on the one hand results in a model with a larger expected trajectory length, but on the other hand, results in a model that better matches the reward distribution. Finally, we perform the same experiment as for hypergrids (Figure 1) with a fixed PB compared to a trainable PB on small permutations of length n = 4, and make similar observations to the ones presented in Section 4.2. The results are presented in Figure 6 in Appendix D.

4.4. Discussion

The key observations from our experimental evaluation are:

1. Learning with a fixed PB is possible without stable losses and regularization, however, manually picking PB with small E[nτ] is challenging; 2. When PB is trained, our results empirically support the scaling hypothesis, showing that even the standard DB in F scale is stable; however, non-acyclic GFlow Nets trained with F scale losses often fail to accurately match the reward distribution; 3. Both DB and SDB in log F scale result in better matching the reward distribution but need to be utilized with state flow regularization to ensure small expected trajectory length E[nτ].

5. Conclusion

In our paper we extended the theoretical framework of GFlow Nets to encompass non-acyclic discrete environments, revisiting and simplifying the previous constructions by (Brunswic et al., 2024). In addition, we provided a number of theoretical insights regarding backward policies and the nature of flows in non-acyclic GFlow Nets, generalized known connections between GFlow Nets training and entropy-regularized RL to this setting, and experimentally re-examined the importance of the concept of loss stability proposed in (Brunswic et al., 2024).

Future work could explore applying other losses from acyclic GFlow Net literature (Madan et al., 2023; Silva et al., 2024; Hu et al., 2025) to the non-acyclic setting. Based on Theorem 3.13, another promising direction is to apply known RL techniques and algorithms to GFlow Nets in the non-acyclic case, following their success for acyclic GFlow Nets (Tiapkin et al., 2024; Mohammadpour et al., 2024; Lau et al., 2024; Morozov et al., 2024). Finally, environments where all states are terminal, i.e., have a transition into sf, naturally arise in the non-acyclic case. Then, special modifications can be introduced to improve the propagation of the reward signal during training (Deleu et al., 2022; Pan et al., 2023; Jang et al., 2024b).

Revisiting Non-Acyclic GFlow Nets in Discrete Environments

Acknowledgements

We would like to thank Leo Maxime Brunswic for the helpful discussion and providing implementation details of the paper (Brunswic et al., 2024). This work was supported by the Ministry of Economic Development of the Russian Federation (code 25-139-66879-1-0003). This research was supported in part through computational resources of HPC facilities at HSE University (Kostenetskiy et al., 2021).

Impact Statement

This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here.

Bengio, E., Jain, M., Korablyov, M., Precup, D., and Bengio, Y. Flow network based generative models for noniterative diverse candidate generation. Advances in Neural Information Processing Systems, 34:27381 27394, 2021.

Bengio, Y., Lahlou, S., Deleu, T., Hu, E. J., Tiwari, M., and Bengio, E. Gflownet foundations. Journal of Machine Learning Research, 24(210):1 55, 2023.

Bertsekas, D. Dynamic programming and optimal control: Volume I, volume 4. Athena scientific, 2012.

Brunswic, L., Li, Y., Xu, Y., Feng, Y., Jui, S., and Ma, L. A theory of non-acyclic generative flow networks. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pp. 11124 11131, 2024.

Comtet, L. Advanced Combinatorics: The Art of Finite and Infinite Expansions. Reidel, 1974.

Cretu, M., Harris, C., Igashov, I., Schneuing, A., Segler, M., Correia, B., Roy, J., Bengio, E., and Lio, P. Syn Flow Net: Design of diverse and novel molecules with synthesis constraints. In The Thirteenth International Conference on Learning Representations, 2025.

Deleu, T. Generative flow networks: Theory and applications to structure learning. ar Xiv preprint ar Xiv:2501.05498, 2025.

Deleu, T., G ois, A., Emezue, C., Rankawat, M., Lacoste Julien, S., Bauer, S., and Bengio, Y. Bayesian structure learning with generative flow networks. In Uncertainty in Artificial Intelligence, pp. 518 528. PMLR, 2022.

Deleu, T., Nouri, P., Malkin, N., Precup, D., and Bengio, Y. Discrete probabilistic inference as control in multi-path

environments. In The 40th Conference on Uncertainty in Artificial Intelligence, 2024.

Douc, R., Moulines, E., Priouret, P., and Soulier, P. Markov chains. Springer Series in Operations Research and Financial Engineering. Springer, 2018. ISBN 978-3-31997703-4.

Geist, M., Scherrer, B., and Pietquin, O. A theory of regularized markov decision processes. In International Conference on Machine Learning, pp. 2160 2169. PMLR, 2019.

Gritsaev, T., Morozov, N., Samsonov, S., and Tiapkin, D. Optimizing backward policies in GFlownets via trajectory likelihood maximization. In The Thirteenth International Conference on Learning Representations, 2025.

He, H., Bengio, E., Cai, Q., and Pan, L. Random policy evaluation uncovers policies of generative flow networks. ar Xiv preprint ar Xiv:2406.02213, 2024.

Hu, E. J., Jain, M., Elmoznino, E., Kaddar, Y., Lajoie, G., Bengio, Y., and Malkin, N. Amortizing intractable inference in large language models. In The Twelfth International Conference on Learning Representations, 2023.

Hu, R., Zhang, Y., Li, Z., and Huang, L. Beyond squared error: Exploring loss design for enhanced training of generative flow networks. In The Thirteenth International Conference on Learning Representations, 2025.

Jain, M., Bengio, E., Hernandez-Garcia, A., Rector-Brooks, J., Dossou, B. F., Ekbote, C. A., Fu, J., Zhang, T., Kilgour, M., Zhang, D., et al. Biological sequence design with gflownets. In International Conference on Machine Learning, pp. 9786 9801. PMLR, 2022.

Jang, H., Jang, Y., Kim, M., Park, J., and Ahn, S. Pessimistic backward policy for GFlow Nets. In Advances in Neural Information Processing Systems, volume 37, pp. 107087 107111, 2024a.

Jang, H., Kim, M., and Ahn, S. Learning energy decompositions for partial inference in GFlownets. In The Twelfth International Conference on Learning Representations, 2024b.

Kemeny, J. G. and Snell, J. L. Finite markov chains, volume 26. van Nostrand Princeton, NJ, 1969.

Kim, H., Kim, M., Yun, T., Choi, S., Bengio, E., Hern andez Garc ıa, A., and Park, J. Improved off-policy reinforcement learning in biological sequence design. ar Xiv preprint ar Xiv:2410.04461, 2024.

Kim, M., Choi, S., Kim, H., Son, J., Park, J., and Bengio, Y. Ant colony sampling with GFlow Nets for combinatorial

Revisiting Non-Acyclic GFlow Nets in Discrete Environments

optimization. In International Conference on Artificial Intelligence and Statistics. PMLR, 2025.

Kostenetskiy, P., Chulkevich, R., and Kozyrev, V. Hpc resources of the higher school of economics. In Journal of Physics: Conference Series, volume 1740, pp. 012050. IOP Publishing, 2021.

Koziarski, M., Rekesh, A., Shevchuk, D., van der Sloot, A., Gai nski, P., Bengio, Y., Liu, C., Tyers, M., and Batey, R. Rgfn: Synthesizable molecular generation using gflownets. Advances in Neural Information Processing Systems, 37:46908 46955, 2024.

Lahlou, S., Deleu, T., Lemos, P., Zhang, D., Volokhova, A., Hern andez-Garcıa, A., Ezzine, L. N., Bengio, Y., and Malkin, N. A theory of continuous generative flow networks. In International Conference on Machine Learning, pp. 18269 18300. PMLR, 2023.

Lau, E., Lu, S., Pan, L., Precup, D., and Bengio, E. Qgfn: Controllable greediness with action values. Advances in neural information processing systems, 37:81645 81676, 2024.

Lee, S., Kim, M., Cherif, L., Dobre, D., Lee, J., Hwang, S. J., Kawaguchi, K., Gidel, G., Bengio, Y., Malkin, N., and Jain, M. Learning diverse attacks on large language models for robust red-teaming and safety tuning. In The Thirteenth International Conference on Learning Representations, 2025.

Madan, K., Rector-Brooks, J., Korablyov, M., Bengio, E., Jain, M., Nica, A. C., Bosc, T., Bengio, Y., and Malkin, N. Learning gflownets from partial episodes for improved convergence and stability. In International Conference on Machine Learning, pp. 23467 23483. PMLR, 2023.

Malkin, N., Jain, M., Bengio, E., Sun, C., and Bengio, Y. Trajectory balance: Improved credit assignment in gflownets. Advances in Neural Information Processing Systems, 35:5955 5967, 2022.

Malkin, N., Lahlou, S., Deleu, T., Ji, X., Hu, E. J., Everett, K. E., Zhang, D., and Bengio, Y. GFlownets and variational inference. In The Eleventh International Conference on Learning Representations, 2023.

Mohammadpour, S., Bengio, E., Frejinger, E., and Bacon, P.-L. Maximum entropy gflownets with soft q-learning. In International Conference on Artificial Intelligence and Statistics, pp. 2593 2601. PMLR, 2024.

Morozov, N., Tiapkin, D., Samsonov, S., Naumov, A., and Vetrov, D. Improving gflownets with monte carlo tree search. ar Xiv preprint ar Xiv:2406.13655, 2024.

Neu, G., Jonsson, A., and G omez, V. A unified view of entropy-regularized markov decision processes. ar Xiv preprint ar Xiv:1705.07798, 2017.

Pan, L., Malkin, N., Zhang, D., and Bengio, Y. Better training of gflownets with local credit and incomplete trajectories. In International Conference on Machine Learning, pp. 26878 26890. PMLR, 2023.

Shen, M. W., Bengio, E., Hajiramezanali, E., Loukas, A., Cho, K., and Biancalani, T. Towards understanding and improving gflownet training. In Proceedings of the 40th International Conference on Machine Learning, ICML 23. JMLR.org, 2023.

Shen, T., Seo, S., Lee, G., Pandey, M., Smith, J. R., Cherkasov, A., Kim, W. Y., and Ester, M. Taco GFN: Target-conditioned GFlow Net for structure-based drug design. Transactions on Machine Learning Research, 2024. ISSN 2835-8856.

Silva, T., de Souza da Silva, E., and Mesquita, D. On divergence measures for training gflownets. Advances in Neural Information Processing Systems, 37:75883 75913, 2024.

Tiapkin, D., Morozov, N., Naumov, A., and Vetrov, D. P. Generative flow networks as entropy-regularized rl. In International Conference on Artificial Intelligence and Statistics, pp. 4213 4221. PMLR, 2024.

Uehara, M., Zhao, Y., Biancalani, T., and Levine, S. Understanding reinforcement learning-based fine-tuning of diffusion models: A tutorial and review. ar Xiv preprint ar Xiv:2407.13734, 2024.

Venkatraman, S., Jain, M., Scimeca, L., Kim, M., Sendera, M., Hasan, M., Rowe, L., Mittal, S., Lemos, P., Bengio, E., Adam, A., Rector-Brooks, J., Bengio, Y., Berseth, G., and Malkin, N. Amortizing intractable inference in diffusion models for vision, language, and control. Neural Information Processing Systems (Neur IPS), 2024.

Zhang, D., Dai, H., Malkin, N., Courville, A. C., Bengio, Y., and Pan, L. Let the flows tell: Solving graph combinatorial problems with gflownets. In Advances in Neural Information Processing Systems, volume 36, pp. 11952 11969, 2023a.

Zhang, D., Zhang, Y., Gu, J., Zhang, R., Susskind, J. M., Jaitly, N., and Zhai, S. Improving gflownets for textto-image diffusion alignment. Transactions on Machine Learning Research, 2025. ISSN 2835-8856.

Zhang, D. W., Rainone, C., Peschl, M., and Bondesan, R. Robust scheduling with gflownets. In The Eleventh International Conference on Learning Representations, 2023b.

Revisiting Non-Acyclic GFlow Nets in Discrete Environments

A.1. Proof of Lemma 3.4

Consider a random walk on G with reversed edges transition probabilities given by backward policy. Specifically, we define a Markov chain {Xn} n=0, such that X0 = sf a.s. and P[Xi = s | Xi 1 = s ] = PB(s | s ). Let also P[Xi = s0 | Xi 1 = s0] = 1, i.e., s0 is an absorbing state. We want to show that

1. The random walk terminates at s0 with probability 1: P[ n : Xn = s0] = 1;

2. The expected length of a walk is finite: E[nτ] = E[P n=0 I{Xi = s0}] < .

In particular, the first statement implies that P( ) is a correct probability measure over finite trajectories since for any τ = (s0, s1 . . . , snτ , sf) it holds

P[(Xnτ +1, . . . , X0) = τ] =

t=0 PB(st | st+1) = P(τ) ,

and we have X

τ T P[(Xnτ +1, . . . , X0) = τ] = P[ τ T : (Xnτ +1, . . . , X0) = τ] = P[ n : Xn = s0] ,

since the events {(Xnτ +1, . . . , X0) = τ} do not intersect for different τ.

First consider any intermediate state s S \ {s0, sf} and define a Markov chain {Yn} n=0 with the same transition probabilities as {Xn} n=0, with Y0 = s a.s. We define ps P[ n > 0 : Yn = s], i.e. the probability that {Yn} returns to s. First, notice that ps < 1. Indeed, based on our assumptions on G, there exists at least one path τ from s0 to s, and furthermore, there exists such a path without cycles. In this case, we have

P[(Ynτ +1, . . . , Y0) = τ] =

t=0 PB(st | st+1) > 0

by the condition on PB(s | s ) > 0 for s s E. Notice that {(Ynτ +1, . . . , Y0) = τ} { n > 0 : Yn = s} = , since if the trajectory of the random walk has already reached s0, it will never return to s. Thus, ps = P[ n > 0 : Yn = s] < 1.

Next, for each state s S \ {s0, sf} we define Ns P i=0 I{Xi = s} as the number of visits of a state s encountered by the original process. Also, let us define N s P i=0 I{Yi = s} as the number of visits of a state s during the backward random walk that starts at s. We notice that

E[Ns] = P( k > 0 : Xk = s)E[N s] E[N s] ,

where the first equation is due to the strong Markov property. At the same time, we have P[N s > k] = P[ n1, . . . , nk > 0 : Ynj = s] and, by Markov property, we have P[N s > k] = pk s. Thus

k=0 P[N s > k] =

k=0 pk s = 1 1 ps < + .

Finally, we have

k=0 I{Xk = s0}

s S\{s0,sf }

k=0 I{Xk = s}

s S\{s0,sf } E[Ns] X

s S\{s0,sf } E[N s] < + .

The first statement of the Lemma directly follows from the finiteness of the expected length of the walk, because otherwise it has an infinite length with non-zero probability, leading to a contradiction.

Revisiting Non-Acyclic GFlow Nets in Discrete Environments

A.2. Proof of Proposition 3.6

First, we prove the flow matching conditions. We have

F(s s ) = F(sf) Eτ

t=0 I{st = s, st+1 = s }

t=0 I{st = s, st+1 = s } | nτ

F(s) = F(sf) Eτ

t=0 I{st = s}

t=0 I{st = s} | nτ

Next, note that the following equations hold for any trajectory τ and any s S \ {s0, sf}:

I{st = s} = X

s in(s) I{st 1 = s , st = s} = X

s out(s) I{st = s, st+1 = s }.

Then for s S \ {s0}:

F(s) = F(sf)Enτ

s in(s) I{st 1 = s , st = s} | nτ

s in(s) F(sf)Enτ

t=1 I{st 1 = s , st = s} | nτ

s in(s) F(s s).

Similarly for any s S \ {sf}:

F(s) = F(sf)Enτ

s out(s) I{st = s, st+1 = s } | nτ

s out(s) F(sf)Enτ

t=0 I{st = s, st+1 = s } | nτ

s out(s) F(s s ).

Next, we prove the key relation F(s s ) = F(s )PB(s | s ). We have

F(s s ) = F(sf)Enτ

t=0 I{st = s, st+1 = s } | nτ

t=0 Eτ[I{st = s, st+1 = s } | nτ]

t=0 P(st = s, st+1 = s | nτ)

(a) = F(sf)Enτ

t=0 P(st+1 = s | nτ)P(st = s | st+1 = s , nτ)

t=0 P(st+1 = s | nτ)

PB(s | s ) = F(s )PB(s | s ).

Here in (a) we used the Markov property of trajectory distribution. Finally, by definition, F(s0) is equal to, F(sf) multiplied by the expected number of times τ P visits s0, where the latter is always 1, so we have F(s0) = F(sf).

A.3. Proof of Proposition 3.7

Since F(s s ) satisfies the flow matching conditions, we define

s out(s) F(s s ) = X

s in(s) F(s s).

Revisiting Non-Acyclic GFlow Nets in Discrete Environments

Next, take PB(s | s ) = F(s s )/F(s ). Let us denote ˆF to be flows from Definition 3.5 that are induced by PB and F(sf) (which correspond to expected number of visits with respect to the trajectory distribution P induced by PB). We aim to prove that F and ˆF coincide.

By Proposition 3.6 and definition of PB, we have

ˆF(s s ) = ˆF(s )PB(s | s ) = ˆF(s ) F(s )F(s s ) = C(s )F(s s ) ,

where we denote C(s) = ˆF(s)/F(s). In addition, by Proposition 3.6, ˆF satisfies the flow matching conditions, thus

s out(s) ˆF(s s ) = X

s in(s) ˆF(s s).

Combining these statements, for any s S \ {sf} we have

ˆF(s) = C(s)F(s) = X

s out(s) C(s )F(s s ).

The first equation is by definition of C(s) and the second equation is due to the flow matching conditions. Thus we have a system of linear equations with respect to C(s):

s S \ {sf}, X

s out(s) C(s )F(s s ) C(s)F(s) = 0.

In addition, ˆF(sf) is equal to, by definition, F(sf) multiplied by the expected number of times τ P visits sf, where the latter is 1, so we have ˆF(sf) = F(sf), thus an additional equation is C(sf) = 1. In total, we have |S| variables and |S| equations, and are interested in strictly positive solutions. Firstly, there exists a trivial solution C(s) = 1 for each s S, which is an only constant solution since C(sf) = 1.

Next, suppose there exists a non-constant solution C (s). Denote Smax = argmaxs S C (s), which will be a proper subset of S. Let us consider two cases. First, suppose sf Smax. Let τ = (s0 s1 . . . snτ sf) be any trajectory that visits some state in Smax. Then there exists an index t nτ such that st Smax and st+1 Smax. Then we have

s out(st) C(s )F(st s )

C(st)F(st) <

s out(st) C(st)F(st s )

C(st)F(st) =

s out(st) F(st s )

The inequality is due to three facts: (i) C(s) > 0 s S, (ii) st Smax, and thus C(st) C(s ) for any s S, and (iii) the inequality is strict for at least one edge st st+1 such that C (st+1) < C (st), and it implies a contradiction.

Second, suppose sf Smax. Then, denote Smin = argmins S C (s), which will be a proper subset of S. Similarly to the previous case, let τ be any trajectory that visits some state in Smin. Then there exists an index t nτ such that st Smin and st+1 Smin. Then we have

P s out(st) C(s )F(st s )

C(st)F(st) >

s out(st) C(st)F(st s )

C(st)F(st) =

s out(st) F(st s )

Thus, in this case, there is also a contradiction. Therefore C(s) = 1 is a unique solution, meaning that ˆF(s) = F(s). Finally for any s s E ˆF(s s ) = C(s )F(s s ) = F(s s ).

F and ˆF coincide, thus the proposition is proven.

A.4. Proof of Proposition 3.8

Let us proof existence and uniqueness of a corresponding forward policy. Let F be the flow induced by the backward policy (3.5).

Revisiting Non-Acyclic GFlow Nets in Discrete Environments

Uniqueness. Suppose that such a forward policy PF exists, then probability distributions over T induced by PB and PF coincide. Then

F(s s ) = F(sf)Enτ

t=0 I{st = s, st+1 = s } | nτ

t=0 Eτ[I{st = s, st+1 = s } | nτ]

t=0 P(st = s, st+1 = s | nτ)

t=0 P(st = s | nτ)P(st+1 = s | st = s, nτ)

t=0 P(st = s | nτ)

PF(s | s) = F(s)PF(s | s).

Then we have F(s )PB(s | s ) = F(s)PF(s | s) and PF(s | s) = F(s )PB(s | s )/F(s), thus we finish the proof of uniqueness presented above.

Existence. Take

PF(s | s) = F(s )PB(s | s )

F(s) = F(s s )

This is always a valid probability distribution since F(s) = P s out(s) F(s s ). Next, for any τ T we have

t=0 PB(st | st+1) =

F(st+1) = Qnτ t=0 F(st st+1) Qnτ t=0 F(st+1) .

where the first equation is due to Proposition 3.6. Next, note that

F(s0) = F(sf) Eτ

t=0 I{st = s0}

= F(sf) 1 = F(sf).

Then Qnτ t=0 F(st st+1) Qnτ t=0 F(st+1) = F(s0)

Qnτ t=0 F(st st+1) Qnτ t=0 F(st) = Qnτ t=0 F(st st+1) Qnτ t=0 F(st) =

t=0 PF(st+1 | st).

Thus the existence is proven. Finally, the detailed balance conditions follow from the proof of uniqueness.

A.5. Proof of Proposition 3.9

Consider an edge function F(s s ) = F(s)PF(s | s). It is positive since F(s) > 0 and PF(s | s) > 0 by the statement of the proposition. Since PF( | s) is a valid probability distribution over out(s), we have X

s out(s) F(s s ) = X

s out(s) F(s)PF(s | s) = F(s) X

s out(s) PF(s | s) = F(s).

Similarly, since PB( | s) is a valid probability distribution over in(s), and F, PF and PB satisfy the detailed balance conditions, we have X

s in(s) F(s s) = X

s in(s) F(s )PF(s | s ) = X

s in(s) F(s)PB(s | s) = F(s) X

s in(s) PB(s | s) = F(s).

Thus F satisfies the flow matching conditions. In addition

PB(s | s ) = F(s)PF(s | s)

F(s ) = F(s s )

F(s ) = F(s s ) P

s in(s ) F(s s ).

Revisiting Non-Acyclic GFlow Nets in Discrete Environments

Thus, applying Proposition 3.7 to F, we get that it is an edge flow induced by PB, thus F is also the state flow induced by PB and F(sf).

Next, consider any trajectory τ = (s0, s1, . . . , snτ , sf) T . By the detailed balance conditions we have

t=0 PB(st | st+1) =

F(st)PF(st+1 | st)

F(st+1) = F(s0)

t=0 PF(st+1 | st) =

t=0 PF(st+1 | st).

The final equation is due to the fact that state flow F is induced by PB and F(sf), so we have F(sf) = F(s0) by Proposition 3.6. Thus the proposition is proven.

A.6. Proof of Proposition 3.12

We first note that not including s0 and sf in the sum is just a matter of the definition of trajectory length presented in 2.1, where we do not count the first and the final state towards it. Using the fact that the length of a trajectory is the sum of the numbers of visits to each individual state in the graph, we obtain that

Eτ P[nτ] = Eτ

s S\{s0,sf }

t=0 I{st = s}

s S\{s0,sf } Eτ

t=0 I{st = s}

s S\{s0,sf }

F(s) F(sf).

A.7. Entropy-Regularized RL and Theorem 3.13

Background on Entropy-Regularized RL. Let MG be a deterministic MDP induced by a graph G with a state space S corresponding to vertices of G, the action space As for each state s corresponds to outgoing edges of s, associated with out(s), and let λ > 0 be a regularization coefficient. We define a policy π as a mapping from each state s S to a probability measure π( |s) over As.

Then, for any policy π, we define the regularized value function for all s = sf as follows

V π λ (s) Eτ Pπ T

t=0 r(st, st+1) + λH(π( |st)) | s0 = s

and V π λ (sf) = 0, where H is Shannon entropy, Pπ T is a trajectory distribution induced by the following the policy π: st π( |st 1) for all t 1 and the starting state s0 is fixed as s (not to be confused with the initial state in G), and nτ is a length of trajectory defined as nτ = min{k 0 | sk+1 = sf}. Overall, it is not clear if the value function is a well-defined function when no discounting is used (γ = 1). We call this problem a regularized shortest path problem, akin to shortest path and stochastic shortest path problem (Bertsekas, 2012, Chapter 3). A policy π is called optimal if it maximizes V π λ (s0).

Lemma A.1. Assume that (i) a graph G satisfies Assumption 3.1 and (ii) for any s S, s As it holds r(s, s ) 0 and r(s, s ) = 0 if and only if | in(s )| = 1. Also, assume that for any optimal policy π it holds Eπ [nτ] < + .

Then, a regularized shortest path problem admits a unique solution, and the value of its solution satisfies soft optimal Bellman equations

Q λ(s, s ) r(s, s ) + V (s ) , V λ (s) λ log

s out(s) exp 1

λQ λ(s, s )

where the optimal policy can be derived as π (a|s) exp{1/λ Q λ(s, a)}.

Proof. Let us define a number of visits of a vertex s and an edge s, s in G on a given trajectory τ as nτ(s) = P t=0 I{st = s} and nτ(s, s ) = P t=0 I{st = s, st+1 = s }. In the analogy with occupancy measures in RL, we employ the notation dπ(s) Eπ[nτ(s)] and dπ(s, s ) Eπ[nτ(s, s )] for an expected number of visits. This definition also corresponds to the flow function in the reversed graph (see Definition 3.5) with the backward policy π. The condition on expected trajectory

Revisiting Non-Acyclic GFlow Nets in Discrete Environments

length of optimal policies implies that we can consider only policies π such that dπ(s) < + for any s S \ {s0, sf}, and, as a result, dπ(s, s ) < + .

Next, we rewrite the value function in the initial state as follows

V π λ (s0) = X

s out(s) dπ(s, s )r(s, s ) + λ X

s S dπ(s)H(π( |s)) .

Then, we notice that dπ(s, s ) = π(s |s) dπ(s) thus we can rewrite the value in the following form

V π λ (s0) = X

s out(s) dπ(s, s )r(s, s ) λ X

s out(s) dπ(s, s ) log

dπ(s, s )/ X

s out(s) dπ(s, s )

| {z } R(dπ)

As a function of dπ(s, s ), we see that the first term in the expression above is linear whereas the second one is relative conditional entropy (Neu et al., 2017) that is strongly concave. Given that the set of all admissible dπ(s, s ) is a polytope that is defined as a family of negative functions that satisfies the flow matching conditions (see Proposition 3.6)

s out(s) dπ(s, s ) = X

s in(s) dπ(s , s), X

s out(s0) d(s0, s ) = 1, X

s in(sf ) d(s , sf) = 1

where the flow and policy have one-to-one corresponds due to Proposition 3.7 in the reversed graph. Since the set K is a polytope, optimization of V π λ (s0) over occupancy measures is a strictly convex problem and thus admits a unique solution d that corresponds to a unique policy π .

Before proving the optimal Bellman equations, we want to show that π satisfies π (s |s) > 0 for any s S, s out(s). To do it, we explore the gradients of the regularizer, using computations done in (Neu et al., 2017), Section A.4: R(dπ) dπ(s,s ) = log π(s |s) . In particular, it implies that as π(s |s) 0, then dπ R(dπ) + , thus the optimal policy π cannot satisfy π (s |s) = 0 since it will violate the optimality conditions.

Next, we want to prove that the value of the optimal policy satisfies soft optimal Bellman equations. First, we notice that the usual Bellman equations still hold since nτ is a stopping time

Qπ λ(s, s ) = r(s, s ) + V π λ (s ), V π λ (s) = X

s out(s) π(s |s)Qπ λ(s, s ) + λH(π( |s)) ,

with an additional initial condition V π λ (sf) = 0, by the proprieties of conditional expectation. Let us consider a regularized policy improvement operation, defined as

π ( |s) arg max p

s out(s) p(s )Qπ λ(s, s ) + λH(p)

λQπ λ(s, ) .

Then we want to show that V π λ (s0) V π λ (s0) if the policy π is positive: π(s |s) > 0 for all s S, s out(s). We start from a general inequality that holds for any s S

V π λ (s) V π λ (s) =

s out(s) π (s |s)Qπ λ (s, s ) + λH(π ( |s))

s out(s) π(s |s)Qπ λ(s, s ) + λH(π( |s))

s out(s) π (s |s)Qπ λ (s, s ) + λH(π ( |s))

s out(s) π (s |s)Qπ λ(s, s ) + λH(π ( |s))

s out(s) π (s |s)Qπ λ(s, s ) + λH(π ( |s))

s out(s) π(s |s)Qπ λ(s, s ) + λH(π( |s))

s out(s) π (s |s) h Qπ λ (s, s ) Qπ λ(s, s ) i = X

s out(s) π (s |s) h V π λ (s ) V π λ (s ) i .

Revisiting Non-Acyclic GFlow Nets in Discrete Environments

After t rollouts, we have V π λ (s0) V π λ (s0) Eτ Pπ T

h V π λ (st) V π λ (st) i ,

Since the policy π is positive, then Lemma 3.4 in the reversed graph implies that dπ(s, s ), dπ(s) < + and thus all values and Q-values are finite: Qπ(s, s ) > for any s S, s out(s). It implies that π is also positive. Thus, its trajectories are finite with probability 1 and yields V π λ (s0) V π λ (s0). Finally, applying policy improvement to π we conclude the statement.

Proof of Theorem 3.13. Let P be the trajectory distribution induced by the GFlow Net backward policy and Pπ T be the trajectory distribution induced by RL policy π. Then we rewrite the value function (13) in the following form using the tower property of conditional expectation to replace entropy with negative logarithm of the policy

V π λ=1(s0) = Eτ Pπ T

t=0 r(st, st+1) log π(st+1 | st)

Notice that there is no coefficient in front of entropy and reward because we set γ = 1, λ = 1 by the theorem statement. Using simple algebraic manipulations

V π λ=1(s0) = Eτ Pπ T

t=0 log exp(r(st, st+1)) log π(st+1 | st)

log Qnτ t=0 exp(r(st, st+1)) Qnτ t=0 π(st+1 | st)

Next, we notice that r(s, s ) = log PB(s|s ) for all non-terminal s and, r(s, sf) = log R(s) = log PB(s | sf) + log Z for terminal transitions due to the reward matching condition. Thus,

V π λ=1(s0) = log Z Eτ Pπ T

log Qnτ t=0 π(st+1 | st) Qnτ t=0 PB(st | st+1)

= log Z KL(Pπ T |P) .

Here P is a trajectory distribution induced by PB (7). We note that the final equation is the same as the one in Proposition 1 of (Tiapkin et al., 2024) for the acyclic case.

Thus, an optimal policy π that maximizes V π λ=1(s0) is the one that minimizes KL(Pπ T |P). By Lemma 3.4, the expected trajectory length of any optimal policy π that matches P is finite. Thus, by Proposition 3.8, there exists a unique forward policy PF that induces the same trajectory distribution as PB, which is equivalent to achieving zero KL-divergence. Thus, π

coincides with PF, and we conclude the statement by the uniqueness of the solution (see Lemma A.1). To apply Lemma A.1, without loss of generality, we can assume that the GFlow Net reward function R is normalized, i.e., Z = 1 and log R(x) < 0. Indeed, since a terminating transition x sf is always visited exactly once, it is equivalent to subtracting log Z from all terminal rewards, which does not change the optimal policy and modifies all values by the same constant.

Next, consider soft optimal Bellman equations (14) for non-terminating transitions

Q λ=1(s, s ) = log PB(s | s ) + log X

s out(s ) exp(Q λ=1(s , s )) .

Let us show that Q λ=1(s, s ) = log F(s s ) will satisfy the equations.

log F(s s ) = log F(s ) + log PB(s | s ) = log X

s out(s ) F(s s ) + log PB(s | s )

= log PB(s | s ) + log X

s out(s ) exp(log F(s s )).

Here we used equations from Proposition 3.6. For terminating transitions we simply have Q λ=1(s, sf) = r(s, sf) = log R(s) = log F(s sf). Since there exists a unique solution to soft optimal Bellman equations, we have proven Q λ=1(s, s ) = log F(s s ). As for state flows, we have

V λ=1(s) = log X

s out(s) exp(Q λ=1(s, s )) = log X

s out(s) exp(log F(s s )) = log F(s) .

Thus the proof is concluded.

Revisiting Non-Acyclic GFlow Nets in Discrete Environments

B. Algorithmic Details

B.1. Training Policy and Flow Weighting

Recall the optimization problem in (11):

min F,PF,PB

s S\{s0,sf } F(s)

subject to log F(s)PF(s | s)

F(s )PB(s | s )

2 = 0 , s s E ,

F(sf)PB(x|sf) = R(x) , x sf E .

Now, suppose that training with DB loss (4) and state flow regularization (12) is done on-policy, i.e. trajectories are collected using the trained policy PF. Let us write down the expected gradient of the loss summed over a trajectory (note that regularization is not applied to F(s0) and F(sf))

log Fθ(s)PF(st+1 | st, θ) Fθ(st+1)PB(st | st+1, θ)

t=1 λ θFθ(t)

which can be rewritten as

t=0 θLDB(st st+1)

t=1 θFθ(st)

The first term is the expected gradient of the standard DB loss. As for the second term, we note that if Fθ is exactly the state flow induced by PF, we have

t=1 θFθ(st)

s S\{s0,sf }

t=0 I{st = s} θFθ(s)

s S\{s0,sf } θFθ(s)Eτ PF

t=0 I{st = s}

s S\{s0,sf }

Fθ(s) Fθ(sf) θFθ(s) = 1 2Fθ(sf) θ

s S\{s0,sf } Fθ(s)2

This implies that on-policy training tries to minimize the sum of squared state flows rather than the sum of state flows. This happens due to the fact that the trajectory distribution that is used to collect data for training (induced by PF in this case) visits certain states more often than others, thus a weight is given to the flow in each state equal to the expected number of visits. However, if PF(s | s0) is fixed to be uniform over S \ {s0, sf} (see Section 4 and Appendix B.3), this issue can be circumvented by applying the flow regularizer only in the first state of each sampled trajectory. Then, equal weight will be given to Fθ(s) in each state in the expected loss, thus we will be minimizing the sum of state flows. However, in our experiments we noticed that this does not significantly influence the results, so we leave exploring this phenomenon as a further research direction.

B.2. Loss Scaling and Stability

In this section, we provide a more detailed explanation of our scaling hypothesis (see Section 4). Let us consider a GFlow Net that learns F, PF and PB. Since these quantities are predicted by a neural network, a standard way is to make it predict logits for the forward policy, logits for the backward policy, and the logarithm of the state flow. Flow functions are always positive, thus, predicting them in log scale is a natural approach (Bengio et al., 2021; 2023). Then, for any transition s s , define two quantities:

log F(s, s , θ) log Fθ(s) + log PF(s |s, θ) log Fθ(s ) log PB(s|s , θ) ,

F(s, s , θ) exp(log Fθ(s) + log PF(s |s, θ)) exp(log Fθ(s ) + log PB(s|s , θ)) . (15)

Revisiting Non-Acyclic GFlow Nets in Discrete Environments

10 5 0 5 10 log F

L(log F, 1)

10 5 0 5 10 log F

L(log F, 1)

SDB, Δlog F

Figure 3. Plots for DB and SDB losses in F and log F scales with fixed predicted log backward flow = 1 and varying predicted log forward flow. More specifically, green curve is y = (x 1)2, red curve is y = ex e1 2, brown curve is y = log 1 + (x 1)2 (1 +

0.001ex), blue curve is y = log 1 + ex e1 2 (1 + 0.001ex) .

The first is the difference between the predicted logarithms of the flows in the forward and backward direction log FF log FB, while the second is the difference between predicted flows in the forward and backward direction FF FB. Then, the standard DB loss (4) is LDB(s s ) = log F(s, s , θ)2,

and the SDB loss (4) proposed in (Brunswic et al., 2024) is

LSDB(s s ) = log 1 + ε F(s, s , θ)2 (1 + ηFθ(s)).

However, for both losses, one can either replace log F with F or the other way around. For visualization, let us fix the predicted log backward flow FB to be, e.g., 1, and plot the losses with respect to the varying value of the predicted log forward flow FF . The plots are presented in Figure 3. One can note that as argument log FF decreases, both losses in F scale quickly plateau, thus their derivative goes to zero. From the optimization perspective, this means that when the predicted log flow needs to be increased, the gradient step will be very small since the derivative of the loss is almost zero. On the other hand, when the predicted log flow needs to be decreased, the gradient step will be larger since losses have much higher derivatives in the corresponding regions. In combination with Proposition 3.12, this gives a possible explanation to the stability of F scale losses: they are biased towards underestimation of the flows and, as a result, biased towards solutions with smaller expected trajectory length. We note that the same reasoning can be applied to the stable flow matching loss proposed in (Brunswic et al., 2024) since it also operates with differences between flows in F scale.

However, as we show in our experimental evaluation (Section 4), this comes at the cost of learning GFlow Nets that match the reward distribution less accurately.

B.3. Fixed PB and Trainable PB

In non-acyclic environments, s0 and sf generally are fictive states that do not correspond to any object. Then PF(sf | s) corresponds to the probability to terminate a trajectory in state s, while PF(s | s0) corresponds to the probability that a trajectory starts in the state s. Thus, the choice of out(s0) is crucial in the design of the environment. If this set is large, e.g., coincides with S \ {s0, sf}, one has to fix PF(s | s0) to some distribution, e.g. uniform, otherwise learning becomes intractable. However, in this case PB(s0 | s) has to be trainable, otherwise, it may be impossible to satisfy the detailed balance conditions for transitions s0 s.

In our experiments, we consider two settings: training with a fixed PB and using a trainable PB.

In case of fixed PB, we consider the case when out(s0) = {sinit}, where sinit is some fixed state S \ {s0, sf}. Thus the first transition for all trajectories is to go from s0 to sinit. Then, for any s S \ {s0, sf, sinit}, PB( | s) is uniform over the parents of s, while PB(s0 | sinit) = 1 ε for some small ε > 0 and PB(s | sinit) = ε/(in(sinit) 1) for other transitions s sinit.

Revisiting Non-Acyclic GFlow Nets in Discrete Environments

For a trainable PB, we consider the case when out(s0) = S \ {s0, sf}. Here we fix the first forward transition probability PF(s | s0) to be uniform over S \ {s0, sf}. In this case, DB loss for the first transition takes a special form:

LDB(s0 s) log Zθ log |S \ {s0, sf}| log PB(s0 | s, θ) log Fθ(s) 2 , (16)

where log Zθ log |S \ {s0, sf}| corresponds to log Fθ(s0) + log PF(s | s0). An important note is that log Fθ(s0) for optimal solutions always coincides with log Z; thus, it is usually harmful to apply state flow regularization (12) to it.

B.4. Solving Small Environments Exactly

Suppose we have a fixed backward policy PB and a final flow F(sf). Then, induced flows F and the corresponding forward policy PF can be obtained exactly for small environments. Consider the following system of linear equations with respect to ˆF(s) that arises from Proposition 3.6:

s out(s) PB(s | s ) ˆF(s ), s S \ {sf},

ˆF(sf) = F(sf).

The system has |S| variables and |S| equations. ˆF(s) = F(s) is a solution, where F(s) are state flows induced by PB and F(sf), and the uniqueness of the solution follows from Proposition 3.7. Thus, by solving the system, one can exactly find induced state flows. Then, by Proposition 3.6 and Proposition 3.8, edge flows and PF can also be exactly expressed as

F(s s ) = PB(s | s )F(s ), PF(s | s) = PB(s | s )F(s )/F(s).

Finally, by Corollary 3.12, one can find the expected trajectory length of the induced trajectory distribution P as:

Eτ P[nτ] = 1 F(sf)

s S\{s0,sf } F(s).

Interestingly, the system (17) can also be explained from the Markov Chain perspective. Let us take the graph G with reversed edges, add a loop from s0 to itself, and use PB to define a Markov Chain with the following transition matrix: P(s0 | s0) = 1, P(s | s ) = PB(s | s ) if there is an edge s s , and P(s | s ) = 0 otherwise. It will be an absorbing Markov Chain, with an only absorbing state s0 since it is reachable from any other state by Assumption 3.1. The transition matrix can be written in the following way:

P = Q R 0 1 ,

where Q is a |S| 1 by |S| 1 matrix and R is a |S| 1 by 1 matrix. Its fundamental matrix N, i.e., such a matrix that Ns,s is equal to the expected number of visits to a non-absorbing state s before being absorbed when starting from a non-absorbing state s, can be obtained as:

k=0 Qk = (I Q) 1,

where I Q is always invertible (Kemeny & Snell, 1969, Theorem 3.2.1). One can note that normalized flows F(s)/F(sf) coincide with the expected number of visits to s when starting from sf, thus coincide with the row of matrix N corresponding to sf. Finally, notice that (I Q) coincides with the transposed matrix of the truncated system (17) (with the exception of the variable and the equation corresponding to s0), thus, such system has a unique solution F(sf)(I Q) T esf = F(sf)N T esf , where esf is a vector of size |S| 1 that has 1 on the position corresponding to sf and 0 on all others. The variable corresponding to s0 should be handled separately, but it is easy to see ˆF(s0) = P

s out(s) PB(s | s )F(s ) = F(s0).

C. Experimental Details

C.1. Loss Choice

While (Brunswic et al., 2024) used the original flow matching loss (Bengio et al., 2021) for experimental evaluation, it was previously shown to be less computationally efficient and provide slower convergence than other GFlow Net losses (Malkin

Revisiting Non-Acyclic GFlow Nets in Discrete Environments

et al., 2023; Madan et al., 2023) in the acyclic case, so we carry out experimental evaluation with the more broadly employed detailed balance loss (Bengio et al., 2023). Moreover, flow matching loss does not admit explicit parameterization of a backward policy, as well as training with fixed backward policies, thus not allowing us to study some of the phenomena we explore in the experiments.

In addition, we note that the proposed state flow regularization (12) can be potentially applied with other GFlow Net losses that learn flows, e.g. Sub TB (Madan et al., 2023), or with the modification of DB proposed in (Deleu et al., 2022) that implicitly parametrizes flows as F(s) = R(s)/PF(sf | s).

C.2. Hypergrids

Formally, S \ {s0, sf} is a set of points with integer coordinates inside a D-dimensional hypercube with side length H: s1, . . . , s D | si {0, 1, . . . , H 1} . s0 and sf are auxiliary states that do not correspond to any point inside the grid. Possible transitions correspond to increasing or decreasing any coordinate by 1 without exiting the grid. In addition, for each state s S \ {s0, sf} there is a terminating transition s sf. GFlow Net reward at s = (s1, . . . , s D) is defined as

R(s) R0 + R1

i=1 I 0.25 < si

i=1 I 0.3 < si

H 1 0.5 < 0.4 ,

where 0 < R0 R1 < R2. (Brunswic et al., 2024) do not specify reward parameters used in their experiments, so we use the parameters from the acyclic version of the environment studied in (Malkin et al., 2022), i.e. (R0 = 10 3, R1 = 0.5, R2 = 2.0).

The utilized metric is: 1 2

x X |R(x)/Z π(x)|,

where π(x) is the empirical distribution of last 2 105 samples seen in training (endpoints of trajectories sampled from PF).

All models are parameterized by MLP with 2 hidden layers and 256 hidden size, which accept a one-hot encoding of s as input. Fθ(s), PF(s | s, θ), PB(s | s , θ) share the same backbone, with different linear heads predicting the logarithm of the state flow, the logits of the forward policy and the logits of the backward policy. In the case of the fixed PB, sinit corresponds to the center of the grid, and we take ε = 10 8 (see Appendix B.3).

We train all models on-policy. We use Adam optimizer with a learning rate of 10 3 and a batch size of 16 (number of trajectories sampled at each training step). For log Zθ we use a larger learning rate of 10 2 (see (Malkin et al., 2022)). All models are trained until 2 106 trajectories are sampled, and the empirical sample distribution π(x) is computed over the last 2 105 samples seen in training. For SDB we set ε = 1.0 and η = 10 3. We found that using larger values of η can lead to smaller expected trajectory length, but also significantly interfere with the sampling fidelity of the learned GFlow Net, thus we opt for these values in our experiments.

C.3. Permutations

All models are parameterized by MLP with 2 hidden layers and 128 hidden size, which accept a one-hot encoding of s as input. Fθ(s), PF(s | s, θ), PB(s | s , θ) share the same backbone, with different linear heads predicting the logarithm of the state flow, the logits of the forward policy and the logits of the backward policy. In the case of the fixed PB, sinit corresponds to the permutation (n, n 1, . . . , 2, 1), and we take ε = 10 8 (see Appendix B.3).

We train all models on-policy. We use Adam optimizer with a learning rate of 10 3 and a batch size of 512 (number of trajectories sampled at each training step). We found that using small batch sizes can significantly hinder training stability for this environment; thus, we opt for a larger value. All models are trained for 105 iterations. For log Zθ we use a larger learning rate of 10 2 (see (Malkin et al., 2022)). To compute log Z we take the average value of log Zθ over the last 10 training checkpoints. Empirical distribution ˆC(k) is computed over the last 105 samples seen in training. For SDB we set ε = 1.0 and η = 10 3. We found that using larger values of η can lead to smaller expected trajectory length, but also significantly interfere with the sampling fidelity of the learned GFlow Net, thus we opt for these values in our experiments.

Suppose that x1, . . . , xm is a set of GFlow Net samples (terminal states of trajectories sampled from PF). Then, the empirical

Revisiting Non-Acyclic GFlow Nets in Discrete Environments

L1 error of fixed point probabilities is defined as:

i=1 I{xi(k) = k}

and the relative error of mean reward is defined as E[R(x)] 1

m Pm i=1 R(xi) E[R(x)]

where E[R(x)] = P

x X R(x) R(x)

Z . We compute mean reward over 104 samples.

C.3.1. REWARD DISTRIBUTION PROPERTIES

We define the GFlow Net reward as R(s) = exp( 1

2 Pn k=1 I{s(k) = k}). We are interested in the true values of three quantities:

1. normalizing constant Z = P x X R(x),

2. true expected reward E[R(x)] = P

x X R(x) R(x)

3. fixed point probabilities C(k) = P((Pn i=1 I{x(i) = i}) = k) with respect to the reward distribution.

While computing sums over all permutations is intractable for n above some threshold, below, we show that for this particular reward, analytical expressions for these quantities can be derived.

First, we will derive the formula for the total number of permutations of length n with exactly k fixed points, which we will denote as D(k, n). In combinatorics, such permutations are known as partial derangements, and the quantity is known as rencontres numbers (Comtet, 1974, p.180). Note that

D(k, n) = n k

since choosing a permutation with k fixed points coincides with choosing k positions for fixed points, and permuting the remaining elements such that there are no fixed points among them. So let us start with the derivation of D(0, n). Denote Si to be the set of permutations on n elements that has a fixed point on position i. Then, by the inclusion-exclusion principle, we have

|S1 Sn| = X

i<j |Si Sj| + X

i<j<k |Si Sj Sk| + + ( 1)n+1|S1 Sn|

(n 2)! + n 3

(n 3)! + ( 1)n+1 n n

i=1 ( 1)i+1 n i

(n i)! = n!

D(0, n) = n! |S1 Sn| = n! n!

Thus, we have

D(k, n) = n k

D(0, n k) = n! k!(n k)!(n k)!

Finally, all of the quantities we are interested in are easily expressed through D(k, n):

1. Z = Pn k=0 D(k, n) exp(k/2).

2. E[R(x)] = Pn k=0 D(k, n) exp(k/2) exp(k/2)

3. C(k) = D(k, n) exp(k/2)

For reference, the formula yields values of log Z 3.8262, 11.2533, 42.9843 for n = 4, n = 8, and n = 20 respectively.

Revisiting Non-Acyclic GFlow Nets in Discrete Environments

D. Additional Plots

104 105 106

training trajectories

full empirical L1 error

hypergrid 20x20x20x20

SDB, Δlog F

DB, Δlog F, λ = 0.001

SDB, Δlog F, λ = 0.001

true expected error

104 105 106

training trajectories

mean trajectory length

hypergrid 20x20x20x20

SDB, Δlog F

DB, Δlog F, λ = 0.001

SDB, Δlog F, λ = 0.001

Figure 4. Left: evolution of L1 distance between empirical distribution of samples and target distribution. Right: evolution of mean length of sampled trajectories. Here we note that when log F scale losses are employed without state flow regularization, mean trajectory length tends to infinity. Plots are not full since training is done on-policy. Thus, the time needed for full training also grows according to the length of trajectories.

104 105 106

training trajectories

full empirical L1 error

hypergrid 20x20x20x20

DB, Δlog F, λ = 0.1

DB, Δlog F, λ = 0.01

DB, Δlog F, λ = 0.001

DB, Δlog F, λ = 0.0001

true expected error

104 105 106

training trajectories

mean trajectory length

hypergrid 20x20x20x20

DB, Δlog F, λ = 0.1

DB, Δlog F, λ = 0.01

DB, Δlog F, λ = 0.001

DB, Δlog F, λ = 0.0001

Figure 5. Left: evolution of L1 distance between empirical distribution of samples and target distribution. Right: evolution of mean length of sampled trajectories. Here, we see the effects of state flow regularization of different strength λ. Larger values of λ lead to smaller mean trajectory length, however, if λ is too large, the obtained forward policy will be significantly biased.

105 106 107

training trajectories

C(k) empirical L1 error

permutations of length 4

training trajectories

mean trajectory length

permutations of length 4

SDB, ΔF, fix PB DB, Δlog F, fix PB SDB, ΔF, train PB DB, Δlog F, train PB DB, Δlog F, λ = 0.001, train PB true expected error / nτ of fixed PB

Figure 6. Comparison of non-acyclic GFlow Net training losses on a small permutation environment. Left: evolution of L1 distance between true and empirical distribution of fixed point probabilities C(k). Right: evolution of mean length of sampled trajectories. The results are similar to the same experiment on hypergrids (Figure 1), with the only difference that here SDB loss in F scale here has fast convergence with a trainable backward policy.