# generalization_and_distributed_learning_of_gflownets__b7fa3803.pdf

Published as a conference paper at ICLR 2025

GENERALIZATION AND DISTRIBUTED LEARNING OF GFLOWNETS

Tiago da Silva Getulio Vargas Foundation tiago.henrique@fgv.br

Amauri Souza Federal Institute of Cear a amauriholanda@ifce.edu.br

Omar Rivasplata University of Manchester omar.rivasplata@manchester.ac.uk

Vikas Garg Yai Yai Ltd and Aalto University vgarg@csail.mit.edu

Samuel Kaski Aalto University, University of Manchester samuel.kaski@aalto.fi

Diego Mesquita Getulio Vargas Foundation diego.mesquita@fgv.br

Conventional wisdom attributes the success of Generative Flow Networks (GFlow Nets) to their ability to exploit the compositional structure of the sample space for learning generalizable flow functions (Bengio et al., 2021). Despite the abundance of empirical evidence, formalizing this belief with verifiable nonvacuous statistical guarantees has remained elusive. We address this issue with the first data-dependent generalization bounds for GFlow Nets. We also elucidate the negative impact of the state space size on the generalization performance of these models via Azuma-Hoeffding-type oracle PAC-Bayesian inequalities. We leverage our theoretical insights to design a novel distributed learning algorithm for GFlow Nets, which we call Subgraph Asynchronous Learning (SAL). In a nutshell, SAL utilizes a divide-and-conquer strategy: multiple GFlow Nets are trained in parallel on smaller subnetworks of the flow network, and then aggregated with an additional GFlow Net that allocates appropriate flow to each subnetwork. Our experiments with synthetic and real-world problems demonstrate the benefits of SAL over centralized training in terms of mode coverage and distribution matching.

1 INTRODUCTION

Generalization is a long-standing problem in the machine learning literature, asking whether a learning algorithm can reliably make predictions beyond the data it was trained on (Valiant, 1984; Vapnik, 2000; Catoni, 2007; Alquier & Guedj, 2017; Dziugaite et al., 2021; Lotfi et al., 2024a). In an age of rapid deployment of AI models to end-users, there has been an emerging interest in the design of theoretically robust algorithms, with remarkable results for GANs (Mbacke et al., 2023), diffusion models (Li et al., 2024), transformers (Lotfi et al., 2024a;b), and graph neural networks (Ju et al., 2023; Tang & Liu, 2023). In this pursuit for developing models with proven generalizability, a rich set of tools has been created (Vapnik & Chervonenkis, 2015; Shalev-Shwartz & Ben-David, 2014), with Mc Allester (1998; 1999) s PAC-Bayesian theorems often providing the tightest statistical guarantees (P erez-Ortiz et al., 2021; Dziugaite & Roy, 2017; 2018; Lotfi et al., 2024b;a). Notably, however, there is no one-size-fits-all solution for understanding generalization: the diverse nature of data and learning algorithms demands a distinctly unique approach to each problem.

In the realm of probabilistic methods, for example, it has been widely hypothesized that the outstanding performance of Generative Flow Networks (GFlow Nets, Bengio et al., 2021; 2023; Lahlou at al., 2023), which have demonstrated exceptional results in problems such as design of biological sequences (Jain et al., 2022; Malkin et al., 2022) and combinatorial optimization (Zhang et al., 2023a;b), to name a few, emerges from their potential to exploit the compositional structure of the underlying state space to learn a generalizable flow assignment function in a flow network by only

Published as a conference paper at ICLR 2025

observing a fraction of the network s nodes (Bengio et al., 2021; Nica et al., 2022; Shen et al., 2023; Atanackovic & Bengio, 2024; Krichel et al., 2024). Nonetheless, in spite of the wealth of empirical evidence indicating that generalization occurs in GFlow Net learning (Nica et al., 2022; Atanackovic & Bengio, 2024), no work so far has provided non-vacuous high-probability empirical bounds on the population risk of GFlow Nets, which might serve as statistical certificates for generalization.

Given the described scenario, in this paper we develop the first non-vacuous generalization bounds for GFlow Nets in the literature in Section 5. In doing so, the core questions we want to address in this work are when GFlow Nets (provably) generalize and which factors potentially contribute to diminish their generalization performance. To kickstart our analysis, we present in Section 4 an example in which a GFlow Net catastrophically fails to generalize even after learning a compatible flow assignment for over 90% of the flow network. This example demonstrates that, to properly understand the generalization of GFlow Nets, we must consider not only the extension of the observed flow network but also the specific parts that have been encountered during training, a fact that is also implicit in popular techniques such as the replay buffer (Vemgal et al., 2023) and local search (Kim et al., 2024b). From a technical perspective, this implies that the development of meaningful statistical guarantees for GFlow Nets must be based not on data-agnostic theoretical results, but on data-dependent priors, in the fashion of Dziugaite & Roy (2018); Dziugaite et al. (2021); P erez-Ortiz et al. (2021). This observation guides the establishment of the non-vacuous high-probability bounds for the population risk of GFlow Nets in Section 5.1 and distinguishes our approach from other investigations of GFlow Nets generalization (Krichel et al., 2024).

The empirical results in Section 5.1, however, do not provide a fine understanding of which characteristics of the flow network tend to hinder the generalization of GFlow Nets. What effect do larger trajectory lengths, for instance, have on the provable learnability of generalizable flow assignments? Intuitively, generalization is harder in larger state spaces as, borrowing the terminology from the reinforcement learning (RL) literature (Bengio et al., 2021), the visited portion of an environment by an agent constrained by a fixed time budget decreases with increasing environment s size and, therefore, the agent would have to rely on increasingly sparse information for learning from larger trajectories. In Section 5.2, we formalize this intuition through the lens of PAC-Bayesian bounds, revealing the increasing difficulty in obtaining tight statistical certificates for larger state spaces. To achieve this, our technical contributions are two-fold: First, an oracle concentration inequality for the forward Kullback-Leibler (KL) divergence between the learned and targeted flow assignments (Theorem 5.2) inspired by Malkin et al. (2023) s interpretation of GFlow Nets as variational inference. Second, an Azuma-Hoeffding-type inequality (Seldin et al., 2012b) for independently sampled martingales representing the extent to which the learned flow assignment violates the detailed balance condition (Bengio et al., 2023) in the observed trajectories (Theorem 5.4).

Motivated by these results, we pose the following question: what can we do to mitigate the issues raised by an inadequate coverage of the state space, as illustrated in Section 4, and larger trajectory sizes, as analyzed in Section 5? As we show in Section 6, the answer lies in breaking up the flow network into multi-source subnetworks grouped together by a small root network (Figure 3). In this new paradigm, finding a compatible flow assignment becomes a two-stage process. Firstly, a different GFlow Net is trained on each subnetwork in a distributed fashion. Secondly, an additional GFlow Net trained on the root network learns to assign the correct amount of flow to each subnetwork, as estimated in the previous step. The resulting algorithm, which we call Subgraph Asynchronous Learning (SAL), has several advantages over the standard approach (Section 6.2). In particular, each GFlow Net within this framework needs to solve a problem that is relatively simpler than that of a unique, centralized model. Similarly, the asynchronous nature of the algorithm implies that we are able to visit a considerably larger fraction of the original flow network within a fixed time window and, therefore, that we get a significantly better coverage of the state space. As a consequence of this, we are also able to drastically improve the discovery of high-value states within the flow network, which is a metric of great interest in the GFlow Net literature (Bengio et al., 2021; Shen et al., 2023; Zhang et al., 2023a; Pan et al., 2023a; Jang et al., 2024; Kim et al., 2024b).

In summary, our contributions are:

1. We construct a family of examples in which a GFlow Net does not generalize even after learning a compatible flow assignment on arbitrarily large fractions of the flow network (Section 4);

2. We provide the first non-vacuous generalization bounds for GFlow Nets (Section 5.1);

Published as a conference paper at ICLR 2025

3. We derive oracle PAC-Bayesian inequalities for the population risk of GFlow Nets, emphasizing the impact of the flow network s topology on generalization performance (Section 5.2); 4. We design the first distributed algorithm for learning GFlow Nets with network-level parallelization, and evaluate its performance in common benchmarks in the literature (Section 6);

The first part of the paper establishes the notation and terminology to be used throughout the paper and reviews relevant results in the GFlow Net and PAC-Bayes literature, alongside an overview of our main contributions (Sections 2 and 3). The second part provides a formal treatment of GFlow Nets from the viewpoint of the PAC-Bayesian theory (Section 5). The third and final part outlines the foundations of SAL and conducts an empirical evaluation of the algorithm in common benchmark problems (Section 6). We defer the proofs and details of the experiments to the supplement.

2 PRELIMINARIES

Notations and terminology. Let G = (V, E) by a directed acyclic graph (DAG). A forward policy over G is a Markov transition kernel p F : V V [0, 1] supported on G s edges, i.e., such that p F (v, ) is a distribution over {u: (v, u) E}, for each vertex v. We interchangeably use p F ( |v) and p F (v, ) for representing p F . A backward policy p B over G is a forward policy on the transpose graph G = (V, E ) with E = {(u, v): (v, u) E}. The uniform policy assigns the same probability mass to a state s children, i.e., p U(s |s) = 1{s Ch(s)}/|Ch(s)| with Ch(s) = {s : (s, s ) E}. We say that G is pointed if there are nodes so and sf, respectively called initial (source) and final (sink) nodes, s.t. so (resp. sf) is the only node without incoming (resp. outgoing) edges and, for each s V , there is a trajectory (directed path) between so and sf containing s. In this case, a trajectory τ in G is complete if it starts at so and finishes at sf, which we denote by τ : so sf. Clearly, a forward policy induces a distribution over trajectories starting at s via p F (τ|s) = Q

(s ,s ) τ p F (s |s ); when τ is unambiguously complete, we will often omit so from this notation. Lastly, for probability measures P and Q on the same space, we let KL(P||Q), χ2(P||Q) and TV (P, Q) respectively denote their Kullback-Leibler (KL) divergence, χ2 divergence, and total variation distance.

GFlow Nets. We represent a GFlow Net (Bengio et al., 2021; 2023; Lahlou at al., 2023) G as a tuple (S, X, G, A, T , p F , p B, R, F) consisting of a set of states S, a set of terminal states X S, a pointed DAG G = (S, E), which is called a state graph, an action mapping A: S 2A associating each state s with an abstract action space A(s) A that is isomorphic to the children of s in G, a transition function T : s S ({s} A(s)) S defining how a state s is affected by an action a A(s), forward p F and backward p B policies on G, a reward function R: X R+ attributing a positive value to each terminal state, and a flow function F : S R+ such that F|X = R. Importantly, only the elements of X are connected to the sink node sf of G. When there is no risk of ambiguity, we will simply write G = (p F , p B, F). The objective of a GFlow Net is to find a p F s.t. the marginal distribution p T (x) := p F (x|so) = P

τ : so x p F (τ) over X matches R up to a normalizing factor (Bengio et al., 2021). In Appendix A, we illustrate how this abstract representation can be instantiated to accommodate three frequently considered use-cases (Malkin et al., 2022; 2023).

Learning GFlow Nets. In this context, S, X, G, A, T , and R are problem-dependent, while p F , p B, and F are unknowns that should be estimated. Remarkably, however, p B is often fixed as uniform (Shen et al., 2023; Liu et al., 2023; Zhang et al., 2023a), an assumption that we make throghout the paper, albeit most of our theoretical results and all our methods can be extended to the case of learnable p B. Under these circumstances, many learning objectives have been proposed for learning p F and F. Two popular choices, which we adopt here, are the trajectory balance (TB, LTB(p F , F), Malkin et al., 2022) and detailed balance (DB, LDB(p F , F), Bengio et al., 2023) losses,

" log F(so)p F (τ)

R(x)p B(τ|x)

and E τ p E

log p F (s |s)F(s)

p B(s|s )F(s )

in which |τ| represents τ s length, x is τ s (unique) terminal state, and p E is an exploratory policy. Intuitively, p E has the role of the data-generating distribution in a standard supervised learning context and is often defined as p E = (ϵ)p U + (1 ϵ)p F , an ϵ-greedy version of p F , with p U denoting an uniform policy; although more sophisticated techniques have been developed (Kim et al., 2024b; Rector-Brooks et al., 2023; Vemgal et al., 2023). In Appendix A we provide a more

Published as a conference paper at ICLR 2025

thorough overview of GFlow Net learning, including the subtrajectory balance loss (Sub TB, Madan et al., 2022) and divergence-based objectives (Malkin et al., 2023; Lahlou at al., 2023).

Generalization bounds for neural networks. The field of statistical learning theory (Vapnik, 1998; 2000) seeks to develop statistical certificates for the generalization of a learned model by providing high-probability upper bounds on the population error of an estimator as a function of the observed empirical risk. In the context of GFlow Nets, we ask whether an empirically measured imbalance based on the observed trajectories, such as the losses in Equation 1 or other locally computed metrics (see Sections 4 and 5), are appropriate surrogates for the GFlow Net s overall distributional accuracy. In particular, we are interested in inductive statistical guarantees, namely, those based on the training set (as opposed to the transductive setting, in which a test set is used). To this end, the PAC-Bayes framework of Mc Allester (1998; 1999; 2013) often provides the tighest bounds (P erez-Ortiz et al., 2021; Lotfi et al., 2024b;a; Dziugaite & Roy, 2017). In a nutshell, consider data X = {Xi}m i=1 drawn from some data distribution, a significance level δ, an empirical loss ˆL(θ, X) and a population loss L(θ) = EX[ ˆL(θ, X)] associated to the model s parameters θ. Given a prior (independent of X) distribution Q over θ, a PAC-Bayes bound typically assumes the form

Eθ P [L(θ)] Eθ P [ ˆL(θ, X)] + ϕ(δ, P, Q, m). (2)

The inequality holds with probability 1 δ over draws of X, simultaneously for all posterior distributions P over θ; and ϕ is a term penalizing the model s complexity (Mc Allester, 1999). We direct the reader to Alquier (2024) for a comprehensive introduction to PAC-Bayesian analysis. For bounded L, the right-hand side of Equation 2 is termed vacuous if it is larger than an upper bound of L. Although Mc Allester s original works posited that the data were independent and identically distributed (i.i.d.) and that the risk function was uniformly bounded (Mc Allester, 1998; 1999), recent advances relaxed these assumptions by deriving generalization bounds for non-i.i.d. data (Seldin et al., 2012b; Barnes et al., 2022), with applications to multi-armed bandits and RL (Fard & Pineau, 2010; Beygelzimer et al., 2011; Tasdighi et al., 2024), and for unbounded losses limited by high-probability bounds (Alquier & Guedj, 2017; Haddouche & Guedj, 2023; Casado et al., 2024; Mbacke et al., 2023). To the best of our knowledge, however, this is the first work promoting the development of PAC-Bayesian bounds for understanding the generalization of GFlow Nets.

3 OVERVIEW OF OUR RESULTS

Before delving into the details of our work in Sections 4, 5, 6 (and further details in the appendices in the supplement), we provide below a brief discussion around our technical results under the light of the formalism presented in Section 2, alongside the main ideas they were built upon.

Non-vacuous generalization bounds for GFlow Nets. The learning objectives in Equation 1, due to the unboundedness of the logarithm, cannot be directly incorporated into standard PAC-Bayesian theorems (Mc Allester, 2013), which assume that the risk function has at least bounded exponential moments (Casado et al., 2024; Rodr ıguez-G alvez et al., 2024a;b). To circumvent this issue, our empirical analysis in Section 5.1 adopts the recently proposed FCS metric (Silva et al., 2024) as the risk functional measuring the accuracy of a trained GFlow Net, which may be written as

LFCS(p F ) = E(τ1,x1),...,(τB,x B) [TV (px1:B T , Rx1:B)] [0, 1], (3)

where px1:B T and Rx1:B are the respective restrictions of p T and R to the B-sized multiset {{x1, . . . , x B}} X of terminal states, and TV is the total variation distance. However, in spite of easily computable, LFCS is not an appropriate learning objective for GFlow Nets due to the potential numerical instability of the non-log-domain. Instead, we minimize LTB as a surrogate objective for LFCS during training and evaluate the generalization bound on LFCS in the inductive fashion mentioned in Section 2. Importantly, Figure 2 shows that the resulting bounds are remarkably tight.

Oracle generalization bounds for GFlow Nets. As a complement, we also establish non-empirical high-probability upper bounds on the population risk of GFlow Nets by assuming that a potentially intractable quantity bounds the corresponding loss function. In Section 5.2, we follow this rationale and demonstrate that there always is an α > 0 for which the set of policy networks of the form αp U + (1 α)p F contains the solution to the flow assignment problem. Armed with such a family of models, which guarantee log probabilities uniformly bounded away from zero, we consider

Published as a conference paper at ICLR 2025

the reverse KL divergence risk (Malkin et al., 2023) to avoid explicitly bounding the flow function F. Although informative, the resulting Theorem 5.2 only considers the trajectories and not transitions as data points. As the number of observed transitions is significantly larger than that of trajectories, we enrich our results by constructing a martingale difference sequence based on the DB loss and adapting Azuma s inequality (Azuma, 1967; Seldin et al., 2012b) to the context of independent martingales to derive a transition-level generalization bound for GFlow Nets. Both approaches, which are respectively encapsulated in Theorems 5.2 and 5.4, show that the population risk can be bounded with high-probability as, apart from technical nuances,

Eθ P [L(θ)] Eθ P [ ˆL(θ)] + O log tm

in which ˆL is an empirical measure of risk, tm is the maximum trajectory length of the state graph, and n is the number of observed data points: either trajectories (Theorem 5.2, α = 0.5) or transitions (Theorem 5.4, α = 1). From an analytical perspective, these results suggest that learning provable generalizable flow assignments is increasingly harder for state spaces having longer trajectories.

4 WHEN DO GFLOWNETS NOT GENERALIZE?

To start our discussion on the generalization of GFlow Nets, we introduce simple, but non-trivial, examples in which a GFlow Net does not learn a generalizable policy network even after minimizing the loss on an arbitrarily large portion of the state space, raising the questions of when do GFlow Nets generalize and how to measure such generalization, which we investigate in Sections 5.1 and 5.2.

A non-generalizable data distribution. To concretize our arguments, we recall the task of set generation for GFlow Nets (Pan et al., 2023a;b; Bengio et al., 2023; Jang et al., 2024). Each state corresponds to a subset of a set W = {1, . . . , W} for a given W; the generative process starts at an empty set so = and iterativey adds elements from W to so until a prescribed size T is achieved. For our purposes, we fix a function u: W [0, 1], representing the log-utility of each w A(so) := W and define the reward R associated to S as R(S) = 1{#S=T } exp{P

w s u(w)}. Also, let p E be a forward policy s.t. p E( |s) is supported on A(s) \ {1} := W \ ({1} s) for every s, i.e., the support of the marginal p E,T of p E on X is the set X of subsets of {2, . . . , W}. We next show that X covers an arbitrarily large portion of X for specific choices of T and W.

Lemma 4.1. For each ξ (0, 1), there exist T and W such that |X | ξ|X|.

0 200000 0.00

81.25% visited

90.62% visited

ϵ-greedy p E

Training epoch

Figure 1: Convergence speed when actions are masked (blue) or not (orange) for different state space sizes.

The (straightforward) proof of Lemma 4.1 can be found in Appendix D. Obviously, we cannot hope that a GFlow Net trained by minimizing an empirical risk defined on trajectories sampled from p E would generalize to unseen states, as no information regarding u(1) would be available during training. To empirically validate our reasoning, we show in Figure 1 that a GFlow Net trained on samples from p E fails to learn the right distribution, whereas a standard ϵ-greedy strategy succeeds. It is remarkable, however, that a GFlow Net is unable to successfuly sample from the target distribution even after minimizing the empirical risk on samples covering over 90% of the state space. From a statistical viewpoint, this behavior can be explained via a change of measure inequality: preference over states is not properly captured by the sampling distribution (p E). We formalize this intuition in the proposition below.

Proposition 4.2 (Generalization depends on the sampling distribution). Let (p F , p B, R) be a GFlow Net and p E,T be (any) distribution over X. Also, recall π(x) represents the normalized target and p T the learned marginal. Define q E,T as an uniform PMF on X, i.e., q E,T (x) = 1/|X|. Then,

TV (p T , π)

v u u t(1 + χ2(q E,T ||p E,T ))Ex p E,T

Eτ p B(τ|x)

" log p F (τ) π(x)p B(τ|x)

in which χ2(P||Q) represents the χ2 divergence between P and Q.

Published as a conference paper at ICLR 2025

We interpret Equation 5 in the following way: If the sampling policy (p E,T ) greatly deviates from the uniform (q E,T ), then a small empirical risk does not necessarily ensure an accurate distributional approximation. In contrast, Equation 5 does not entail that the uniform distribution is the optimal choice for sampling trajectories, as it does not address the algorithmic difficulty of minimizing the empirical risk via SGD. Illustratively, we show in Table 1 in Appendix B the values of χ2(q E,T ||p E,T ) when p E,T is far away from q E,T and of χ2(q E,T ||pϵ,T ) for the ϵ-greedy policy considered in Figure 1. On a fundamental level, these examples underline the importance of taking into account the data distribution for understanding generalization performance. Section 5 elaborates on this problem through the lens of PAC-Bayes bounds, albeit with data-dependent priors (Dziugaite & Roy, 2017; 2018; Dziugaite et al., 2021; P erez-Ortiz et al., 2021).

5 PAC-BAYESIAN GENERALIZATION BOUNDS FOR GFLOWNETS

Towards the objective of understanding GFlow Net generalization, we construct high-probability upper bounds on different risk functions. In Section 5.1, we build upon Mc Allester s empirical bound and Dziugaite s data-dependent priors to derive the first non-vacuous generalization bounds for GFlow Nets. Then, to gain a clearer understanding of the factors hindering the generalizability of these models, we provide both trajectoryand transition-level oracle bounds in Section 5.2 by drawing upon the martingale-based PAC-Bayesian theory for non-i.i.d. data (Beygelzimer et al., 2011).

5.1 NON-VACUOUS EMPIRICAL GENERALIZATION BOUNDS

GFlow Net learning as supervised learning. To rigorously address the generalization of GFlow Nets, we firstly frame the training of these models as a supervised learning problem (Shalev-Shwartz & Ben-David, 2014; Atanackovic & Bengio, 2024). For this, we assume that a set of independently sampled complete trajectories, Tn = {τ1, . . . , τn}, is drawn from a fixed distribution and that each trajectory τi is annotated with a noise-free target, yi = p B(τi|xi)R(xi), with xi representing τi s unique terminal state. Importantly, the only supervision during training comes from the reward function; we do not make assumptions on the distribution over Tn. In this context, minimizing LTB corresponds to finding the least-squares solution to the equation log Z + log p F (τ) = log p B(τ|x)R(x) in p F . Importantly, this setting differs from conventional GFlow Net training algorithms, for which the sampling policy depends on the trajectories observed so far, that is, the trajectories are not independently sampled. Nonetheless, the question of whether GFlow Nets generalize remains relevant even under our relatively simplified conditions, which may be seen as a single-iteration of an ϵ-greedy strategy (Krichel et al., 2024).

A bounded risk functional for GFlow Nets. As mentioned, PAC-Bayesian theory originally relied on the assumption of bounded risk functions (Mc Allester, 1998). Despite recent advances in extending the theory to unbounded losses (Casado et al., 2024; Rodr ıguez-G alvez et al., 2024a;b; Haddouche & Guedj, 2023), most generalization bounds still depend on technical and hard-to-verify conditions, e.g., on exponential moments. For this reason, we use the FCS metric as a measure of risk (Silva et al., 2024); see Appendix B for an unbiased estimator ˆLFCS(p F , Tn) of LFCS(p F ).

Data-dependent priors for PAC-Bayes. For this, we first recall the techniques originally developed by Dziugaite & Roy (2017; 2018); Dziugaite et al. (2021) in a striking series of papers for probing the generalization of overparameterized neural networks in the supervised learning context. To start with, we state below Dziugaite et al. (2021) s empirical PAC-Bayes bound, which combines results from Mc Allester (2013), Rivasplata et al. (2019), and Boucheron et al. (2013). For completeness, we also provide a self-contained proof of Proposition 5.1 in Appendix D in the supplement. Proposition 5.1 (Empirical PAC-Bayesian bounds). For any distribution ζ on parameters θ of p F , let LFCS(ζ) = Eθ ζ[LFCS(R, p T )] and define ˆLFCS(ζ, Tn) similarly. Also, let α (0, 1) and let P be a distribution on θ learned on an uniformly random (1 α)n -sized subset T1 α of Tn. Then,

LFCS(P) ˆLFCS(P, T1 α) + min

η(η + 2ˆLFCS(P, T1 α)), p η

with probability at least 1 δ over T1 α, in which η := KL(P ||Q)+log 2

(1 α)n /δ (1 α)n and Q is a distribution that does not depend on T1 α but may depend on Tα := Tn \ Tα.

Published as a conference paper at ICLR 2025

When the prior distribution Q is naively chosen (e.g., as a standard Gaussian distribution), the KL divergence in Equation 6 often dominates the right-hand side

Sets Bags Seq. SIX6 0.0

Risk (LFCS) Bound

Figure 2: Non-vacuous generalization bounds for the FCS risk functional in Eq. 10.

of the equation and results in vacuous bounds, i.e., LFCS(P) a for some a > 1. To address this issue, the influential work of Dziugaite & Roy (2017) proposed the use of a data-dependent Q learned by minimizing the empirical risk functional on a fraction α of the data and, after learning P by minimizing Equation 6, evaluating the generalization bound on the remaining (1 α) portion of the data, as presented in Proposition 5.1.

Empirical results. We follow a similar approach to derive the first non-vacuous generalization bounds for GFlow Nets in the literature. For this, we disjointly partition the dataset Tn with n = 3 104 into sets Tα and T1 α with α = 0.6. We learn an isotropic Gaussian prior Q on Tα and then a diagonal Gaussian posterior P on Tα T1 α by minimizing the bound in Equation 6. Finally, the bound is evaluated on T1 α to obtain the statistical certificate (P erez-Ortiz et al., 2021). Results in Figure 2 for the tasks of set generation (Pan et al., 2023a; Bengio et al., 2023), bag generation (Shen et al., 2023; Jang et al., 2024), sequence design (Malkin et al., 2022; 2023; Madan et al., 2022) with additive rewards, and SIX6 (Jain et al., 2022; Malkin et al., 2022; Shen et al., 2023) highlight the non-vacuousness of Equation 6 and the generalizability of the trained models. Please refer to Section 6 and to Appendix B for a detailed description of the experimental setup. Appendix A describes the design of the GFlow Net for each of these generative tasks.

5.2 ORACLE GENERALIZATION BOUNDS

Although the previous section s empirical results certify the generalization of the learned policy network to novel trajectories, they do not necessarily shed light on which characteristics of the generative task are hindering the model s generalization capability. In the remaining of this section, we thus derive generalization bounds that, despite not being directly computable, provide a finer understanding of which factors play a role when the goal is to learn a generalizable policy. In particular, we observe that larger trajectories and peakier target distributions tend to make generalization harder when a fixed sampling budget is available. In Section 6, we will see how a distributed algorithm may alleviate these issues (Yagli et al., 2020; Barnes et al., 2022; Sefidgaran et al., 2022).

Trajectory-level bounds. We start by deriving generalization bounds for GFlow Nets when the trajectories are independently sampled (see Section 5.1) and the risk functional is the KL divergence between the forward and backward policies, i.e., KL(p B||p F ), in which p B(τ) p B(τ|x)R(x). This choice is motivated by Malkin et al. (2023) s interpretation of GFlow Nets as a hierarchical variational inference algorithms and by the ability of KL(p B||p F ) to focus the model on highprobability regions of the target, which is a desirable trait of GFlow Nets. Remarkably, we show in Lemma B.1 that KL(p B||p F ) can be bounded by sensibly reparameterizing p F as a mixture policy. Then, as shown in Theorem 5.2 below, this reparametrization enables developing oracle generalization bounds in the fashion of the tight results we derived in Section 5.1.

Theorem 5.2. Let G = (p F , p B, F) be a GFlow Net with policy network p F parameterized as in Lemma B.1. Also, let Q be a probability distribution over the parameters θ of p F . Denote H[p B] = Eτ p B[log p B(τ)] for p B s entropy and MT = maxτ(|τ| log(α 1 maxs τ |Ch(s)|)). Then,

Eθ P [KL(π||p T )] Eθ P

1 i m log p B(τi)

+ ( H[p B] + MT ) η(P, Q, n), (7)

in which we recall that η(P, Q, n) = q

KL(P ||Q)+log 2 n/δ

n and π(x) R(x) is the target.

A few remarks on the excess risk upper bound of Theorem 5.2. Firstly, the assumption that trajectories are sampled according to p B(τ) p B(τ|x)R(x) is consistent with popular strategies for learning GFlow Nets that focus on sampling trajectories leading to high-reward states more often than those leading to low-reward states, e.g., using a replay buffer (Deleu et al., 2022). Secondly, in alignment with well-established practical knowledge, the result in Equation 7 shows it is harder to

Published as a conference paper at ICLR 2025

achieve tighter generalization bounds when the target distribution is spiky with a small entropy term H[p B], and when the generative task is composed of longer trajectories or larger action spaces.

Transition-level bounds. For many applications, the number of observed complete trajectories when training GFlow Nets can be orders of magnitude smaller than the number of collected state transitions. In this context, one may obtain significantly tighter generalization bounds by interpreting the transitions, and not the complete trajectories, as data samples (Lotfi et al., 2024a;b). Indeed, it is assumed that GFlow Nets outstanding potential emerges from its capacity to exploit the compositional structure of the space characterized by the state graph (Bengio et al., 2021; Nica et al., 2022; Shen et al., 2023; Atanackovic & Bengio, 2024). To incorporate this structure into our theoretical bounds, we shift our focus to the design of Azuma-Hoeffding-type concentration inequalities (Azuma, 1967; Mc Diarmid, 1998; Boucheron et al., 2013) applied to the stochastic process induced by the Markov Decision Process (MDP) governing the data-generating process. For this, we start defining a martingale difference sequence based on the DB loss (Bengio et al., 2023, Example 5). Definition 5.3 (A martingale difference sequence for the DB loss). Recall the detailed balance loss LDB(s, s ) = (log F(s)p F (s |s) log F(s )p B(s|s ))2. For a fixed sampling policy p E, we let M(Si, S<i) = LDB(Si, Si 1) Esi p E( |Si 1) [LDB(si, Si 1)|Si 1] , (8)

where S<i = {S1, . . . , Si 1}. Also, we define the natural filtration Ft = σ(S1, . . . , St) generated by the first t states of the Markov process {Si}i 1. Clearly, each M(Si, S<i) is Fi-measurable and ESi[M(Si, S<i)|F<i] = 0, i.e., {M(Si, S<i)}i 1 is a martingale difference sequence.

From this definition, it is immediate that Mt := P

1 i t M(Si, S<i) is a martingale w.r.t. the filtration {Ft}t 1. We defer to Appendix B the discussion regarding its properties and the assumptions imposed on it for proving Theorem 5.4 below. Additionally, we define

L(θ) = E τ p E 1 |τ|

1 i |τ| E [LDB(Si, Si 1)|Si 1] and ˆL(θ) = 1

1 i tj LDB(S(j) i , S(j) i 1)

as the population and empirical DB-based risk functionals for GFlow Nets. Under these conditions, Theorem 5.4 complements Theorem 5.2 with a generalization bound based on the DB loss. Theorem 5.4 (Transition-level generalization bounds for GFlow Nets). Let Mt(θ) be the martingale arising from Definition 5.3, with θ representing the parameters of the forward policy. Also, let Q be a distribution on θ. Assume that LDB(Si, Si 1) U uniformly on (Si, Si 1) and that Mtm(θ)2 K, in which tm is the maximum trajectory length. Similarly, define λ 1/2U and β (0, 1), and let P be a data-dependent posterior distribution on θ. Then, with probability at least 1 δ over the set of independent martingales {So, S(j) 1 , . . . , S(j) tj }1 j n such that So = so almost surely,

Eθ P [L(θ)] 1

β Eθ P h ˆL(θ) i + αT,n

KL(P||Q) + log 2

in which T is the number of observed transitions, αT,n = U 2β(1 β)n + 1 βT λ , and γ = e 2.

Similarly to Theorem 5.2, Theorem 5.4 implies that obtaining tighter generalization guarantees is harder for larger state spaces with longer trajectories when the sampling process is limited by a maximum number of observable transitions (or states) T, which is an often imposed constraint for comparing the sample-efficiency of different learning objectives for GFlow Nets in the literature (Pan et al., 2023b;a; Madan et al., 2022; Malkin et al., 2022; 2023). In the next section, we show how these issues can be addressed via a distributed learning scheme with network-level parallelization.

6 DIVIDE AND CONQUER: DISTRIBUTED LEARNING OF GFLOWNETS

In light of the above analysis, the diverse exploration of state graphs (Section 4) with smaller trajectory sizes (Section 5) is beneficial for the successful training of GFlow Nets. In what follows, we show how these features can be efficiently implemented by recasting the GFlow Net training as an embarrassingly parallel divide-and-conquer algorithm, which we call subgraph asynchronous learning (SAL). This is, to the best of our knowledge, the first method enabling the distributed learning of GFlow Nets with network-level parallelization. We remark that previous work on the topic (da Silva et al., 2024) promoted only the partitioning of the reward function for parallel Bayesian inference and that, in stark contrast to SAL, each client learned from the same state graph.

Published as a conference paper at ICLR 2025

Figure 3: A fixed-horizon DAG partition with three leaves (S1, S2, S3) and one root (So). For inference, the sampling policy is chosen based on the current state.

Algorithm 1 Subgraph Asynchronous Learning

1: S = So Sm j=1 Sj Fixed-horizon partition 2: Ij = Sj So for j {1, . . . , m} 3: Local training 4: parfor j {1, . . . , m} do 5: Minimize Lj ATB in Sj with SGD 6: (pj F , Fj) = arg minp F ,F Lj ATB(p F , F) 7: end parfor 8: Ro : x7 1{x X}R(x)+Pm j=1 1{x Ij}Fj(x) 9: (po F , Fo) = arg minp F ,F LTB(p F , F, Ro)

10: return {(po F , Fo)} S

1 j m{(pj F , Fj)}

6.1 SUBGRAPH ASYNCHRONOUS LEARNING

Overview. There are two ingredients making up SAL: a fixed-horizon partition (FHP) and an assignment function (AF). In short, a FHP defines the state graph split explicitly, while an AF indirectly encodes it by assigning states to partitions. Here, we formally define the former concept. We introduce the idea of an AF and provide a comprehensive theoretical analysis of SAL in Appendix B.4.

Convergence guarantees. We introduce the notion of a FHP of a pointed DAG below. In the flow network perspective, such a partition can be viewed as a collection of possibly overlapping multisource subnetworks, termed leaves, which are grouped together by a single-source network, referred to as root. We use the term fixed-horizon due to the fixed distance of the subnetworks sources to so. Also, note that a FHP is only a partition in the set-theoretical sense when the state graph is a tree.

Definition 6.1 (Fixed-horizon DAG partition). We say that S = So S

1 j m Sj is a FHP of

the state space S, with leaves {Sj}m j=1 and root So, when it satisfies the conditions below:

1. (Disjointness of sources) so So and the sets {Ij := So Sj}m j=1 are pairwise disjoint.

2. (Completeness) If s Sj for a j 1, then all descendants of s are in Sj.

3. (Regularity) If d denotes the shortest-path distance, d(so, Ij) = d(so, Ii) for all i, j.

Under Definition 6.1, we let Xj = Sj X be the set of terminal states reachable from Ij. For conciseness, we denote {Sj}m j=0 = FHP(S, m) when {Sj}m j=0 is a FHP of S with m components. We then illustrate a FHP of a tree in which m = 3 and the Ij, represented by the doubly-stroked circles, are singletons for the blue and green leaves in Figure 3. We are now ready to define SAL.

Definition 6.2 (SAL). Let {Sj}m j=0 = FHP(S, m). For each 1 j m, let Gj = (pj F , pj B, Fj) be a GFlow Net and pj E be any forward policy over the Sj-induced subgraph of the state graph. Finally, let qj be any distribution with full support on Ij. Then, define

Lj ATB(pj F , Fj) = Es qj Eτ pj E( |s)

log Fj(s)pj F (τ|s)

R(x)pj B(τ|x)

in which x represents τ s terminal state, as the amortized trajectory balance (ATB) objective. For the root, let po E be a policy in So and Go = (po F , po B, Ro). For each x (X \ Sm j=1 Xj) (Sm j=1 Ij), let Ro(x) = Rj(x) if x Ij for some j and Ro(x) = R(x) otherwise. In this context, SAL follows a two-step procedure: first, m models are trained in parallel by minimizing Equation 9; then, a global model is estimated by optimizing the TB loss with reward Ro, which we denote by LT B(p F , F, Ro) for a GFlow Net (p F , p B, F). We summarize this approach in Algorithm 1.

Clearly, any flow-based learning objectives (e.g., Sub TB (Madan et al., 2022), Munchausen DQN (Tiapkin et al., 2024), GAFlow Nets (Pan et al., 2023b)), parametrizations (e.g., forward-looking (Pan et al., 2023a), LED (Jang et al., 2024), temperature-scaled (Kim et al., 2024a)), and offpolicy sampling strategies (e.g., replay buffer (Vemgal et al., 2023) and local search (Kim et al., 2024b)) could be employed for estimating both the root and leaf GFlow Nets in Definition 6.2. In

Published as a conference paper at ICLR 2025

Centralized

Cln 1 Cln 2 SAL Cent. Model

Training time (s)

52.56 51.86

Figure 4: SAL leads to faster mode discovery and more accurate approximations given a fixed training time budget (right-most plot). Results for a centralized GFlow Net, for our algorithm (SAL), and the target reproduced from Malkin et al. (2022, Section 5.1) are shown from left to right. Running time for SAL equals the running time of the longest client plus that of the aggregation phase.

Appendix C, we demonstrate the soundness of SAL and conduct an extensive theoretical analysis on the character of the local distributions and error propagation within this framework. Similarly, Appendix E.1 discusses these issues from an empirical viewpoint.

Recursive SAL. As defined in Definition 6.2 and shown in Figure 3, SAL has a single layer: each leaf is directly connected to the root in the underlying FHP. Nonetheless, there is no obstacle preventing us from building SAL upon a multi-layered partition of the state graph, as illustrated in Figure 11 in the supplement. For this, we must first define a hierarchy of partitions. Then, we recursively learn a flow assignment for each partition by starting at the lowest levels of this hierarchy and moving upwards, in the fashion of backward-induction algorithms. Each learning step is based on minimizing the amortized trajectory balance loss in Equation 9 via SGD. We demonstrate the correctness of the resulting algorithm, termed Recursive SAL, in Proposition C.6, which follows from Theorem C.1 and an inductive argument. Although we do not provide an empirical assessment of Recursive SAL in this work, Appendix C.3 considers its potential implications.

6.2 EMPIRICAL ILLUSTRATION

Experimental setup. We evaluate the performance of SAL in six different generative tasks encompassing both synthetic and real-world problems (Appendix C.1). In Appendix C, we extensively discuss how to implicitly define a FHP via an assignment function, which allows for an efficient implementation of SAL. Also, please refer to Appendix B for a detailed account of the experimental setup.

Results. As expected, Figure 4 above, and Figures 7, 9, 10, and 13 in the supplement, show that SAL drastically speeds up the discovery of high-value states for all considered generative problems under varying computational constraints. Complementarily, Figure 6 and Table 2 in Appendix C.2 underline that our distributed algorithm achieves more accurate distributional approximations than its centralized counterpart. We discuss these promising results in more detail in Appendix C.2.

7 CONCLUSIONS

Discussion. We developed the first PAC-Bayesian bounds and non-vacuous statistical guarantees for the generalization of GFlow Nets in the literature. Additionally, our theoretical results provided deeper insights into the negative effect of the trajectory length on the proven learnability of a generalizable policy. Inspired by these conclusions, our distributed algorithm SAL, which is also the first of its kind, exhibited promising performance in both synthetic and real-world problems.

Future works and limitations. We discuss the limitations of our work at large in Appendix E. In particular, we acknowledge that a deeper theoretical understanding of advanced sampling techniques is still required. From a practitioner s perspective, we believe that SAL can greatly improve the performance of GFlow Nets in specialized domains, e.g., NLP (Hu et al., 2024) and drug discovery (Bengio et al., 2021), which are beyond the scope of our work. Finally, it has not escaped our attention that SAL is related to Mankowitz et al. (2016) s Adaptive Skills, Adaptive Partitions (ASAP) framework for learning temporally extended actions in MDPs and may find fruitful applications in multi-task RL by interpreting each leaf (resp. root) GFlow Net as an intra- (resp. inter-) skill policy.

Published as a conference paper at ICLR 2025

ACKNOWLEDGEMENTS

This work was supported by the Fundac ao Carlos Chagas Filho de Amparo a Pesquisa do Estado do Rio de Janeiro FAPERJ (SEI-260003/000709/2023), the S ao Paulo Research Foundation FAPESP (2023/00815-6), and the Conselho Nacional de Desenvolvimento Cient ıfico e Tecnol ogico CNPq (404336/2023-0). We acknowledge the UKRI EPSRC grant EP/Y028783/1.

We acknowledge the Aalto Science-IT Project from Computer Science IT and the FGV TIC for providing computational resources.

Pierre Alquier. User-friendly introduction to PAC-Bayes bounds. Foundations and Trends in Machine Learning, 17(2):174 303, 2024. URL https://arxiv.org/abs/2110.11216.

Pierre Alquier and Benjamin Guedj. Simpler PAC-Bayesian bounds for hostile data. Machine Learning, 107(5):887 902, December 2017. ISSN 1573-0565. doi: 10.1007/s10994-017-5690-0. URL http://dx.doi.org/10.1007/s10994-017-5690-0.

Pierre Alquier, Xiaoyin Li, and Olivier Wintenberger. Prediction of time series by statistical learning: general losses and fast rates, 2012. URL https://arxiv.org/abs/1211.1847.

Lazar Atanackovic and Emmanuel Bengio. Investigating Generalization Behaviours of Generative Flow Networks. ICML 2024 Workshop on Structured Probabilistic Inference & Generative Modeling, 2024. URL https://arxiv.org/abs/2402.05309.

Kazuoki Azuma. Weighted sums of certain dependent random variables. Tohoku Mathematical Journal, Second Series, 19(3):357 367, 1967.

Akshay Balsubramani. PAC-Bayes Iterated Logarithm Bounds for Martingale Mixtures. ar Xiv preprint ar Xiv:1506.06573, 2015. URL https://arxiv.org/abs/1506.06573.

Leighton Pate Barnes, Alex Dytso, and Harold Vincent Poor. Improved Information-Theoretic Generalization Bounds for Distributed, Federated, and Iterative Learning. Entropy, 24(9):1178, August 2022. ISSN 1099-4300. doi: 10.3390/e24091178. URL http://dx.doi.org/10. 3390/e24091178.

Luis A Barrera, Anastasia Vedenko, Jesse V Kurland, Julia M Rogers, Stephen S Gisselbrecht, Elizabeth J Rossin, Jaie Woodard, Luca Mariani, Kian Hong Kock, Sachi Inukai, et al. Survey of variation in human transcription factors reveals prevalent DNA binding changes. Science, 351 (6280):1450 1454, 2016.

Emmanuel Bengio, Moksh Jain, Maksym Korablyov, Doina Precup, and Yoshua Bengio. Flow Network based Generative Models for Non-Iterative Diverse Candidate Generation. In Advances in Neural Information Processing Systems (Neur IPS), 2021.

Yoshua Bengio and Nikolay Malkin. Machine learning and information theory concepts towards an AI Mathematician. Bulletin of the American Mathematical Society, 61(3):457 469, 2024.

Yoshua Bengio, Salem Lahlou, Tristan Deleu, Edward J. Hu, Mo Tiwari, and Emmanuel Bengio. GFlow Net Foundations. Journal of Machine Learning Research (JMLR), 2023.

Alina Beygelzimer, John Langford, Lihong Li, Lev Reyzin, and Robert E. Schapire. Contextual Bandit Algorithms with Supervised Learning Guarantees. In Artificial Intelligence and Statistics, pp. 19 26. PMLR, 2011. URL https://arxiv.org/abs/1002.4058.

Felix Biggs and Benjamin Guedj. On Margins and Derandomisation in PAC-Bayes. In International Conference on Artificial Intelligence and Statistics, pp. 3709 3731. PMLR, 2022. URL https: //proceedings.mlr.press/v151/biggs22a.html.

Felix Biggs and Benjamin Guedj. Tighter PAC-Bayes generalisation bounds by leveraging example difficulty. In International Conference on Artificial Intelligence and Statistics, pp. 8165 8182. PMLR, 2023. URL https://proceedings.mlr.press/v206/biggs23a.html.

Published as a conference paper at ICLR 2025

St ephane Boucheron, G abor Lugosi, and Pascal Massart. Concentration inequalities: A nonasymptotic theory of independence. Oxford University Press, 2013.

Ioar Casado, Luis A Ortega, Aritz P erez, and Andres R Masegosa. PAC-Bayes-Chernoff bounds for unbounded losses. In Advances in Neural Information Processing Systems, 2024.

Olivier Catoni. PAC-Bayesian Supervised Classification: The Thermodynamics of Statistical Learning. Institute of Mathematical Statistics, LECTURE NOTES MONOGRAPH SERIES, 56, 2007. URL https://arxiv.org/abs/0712.0248.

Sitan Chen, Sinho Chewi, Jerry Li, Yuanzhi Li, Adil Salim, and Anru Zhang. Sampling is as easy as learning the score: theory for diffusion models with minimal data assumptions. In International Conference on Learning Representations, 2023.

Yihang Chen and Lukas Mauch. Order-Preserving GFlow Nets. In International Conference on Learning Representations, 2024. URL https://arxiv.org/abs/2310.00386.

David Cohn, Les Atlas, and Richard Ladner. Improving generalization with active learning. Machine Learning, 15:201 221, 1994. URL https://link.springer.com/article/10. 1007/bf00993277.

Tiago da Silva, Eliezer Silva, Ad ele Ribeiro, Ant onio G ois, Dominik Heider, Samuel Kaski, and Diego Mesquita. Human-in-the-Loop Causal Discovery under Latent Confounding using Ancestral GFlow Nets. ar Xiv preprint, 2023. URL https://arxiv.org/abs/2309.12032.

Tiago da Silva, Luiz Max Carvalho, Amauri Souza, Samuel Kaski, and Diego Mesquita. Embarrassingly Parallel GFlow Nets, 2024. URL https://arxiv.org/abs/2406.03288.

Tristan Deleu and Yoshua Bengio. Generative Flow Networks: a Markov Chain Perspective. ar Xiv preprint ar Xiv:2307.01422, 2023. URL https://arxiv.org/abs/2307.01422.

Tristan Deleu, Ant onio G ois, Chris Chinenye Emezue, Mansi Rankawat, Simon Lacoste-Julien, Stefan Bauer, and Yoshua Bengio. Bayesian Structure Learning with Generative Flow Networks. In Uncertainty in Artificial Intelligence (UAI), 2022.

Tristan Deleu, Mizu Nishikawa-Toomey, Jithendaraa Subramanian, Nikolay Malkin, Laurent Charlin, and Yoshua Bengio. Joint Bayesian Inference of Graphical Structure and Parameters with a Single Generative Flow Network. In Advances in Neural Processing Systems (Neur IPS), 2023.

Tristan Deleu, Padideh Nouri, Nikolay Malkin, Doina Precup, and Yoshua Bengio. Discrete Probabilistic Inference as Control in Multi-path Environments, 2024. URL https://arxiv.org/ abs/2402.10309.

Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. QLo RA: Efficient Finetuning of Quantized LLMs. In Advances in Neural Information Processing Systems, volume 36, pp. 10088 10115, 2023. URL https://arxiv.org/abs/2305.14314.

Yilun Du and Leslie Kaelbling. Compositional Generative Modeling: A Single Model is Not All You Need, 2024. URL https://arxiv.org/abs/2402.01103.

Gintare Karolina Dziugaite and Daniel M Roy. Computing nonvacuous generalization bounds for deep (stochastic) neural networks with many more parameters than training data. In Uncertainty in Artificial Intelligence (UAI), 2017.

Gintare Karolina Dziugaite and Daniel M Roy. Data-dependent PAC-Bayes priors via differential privacy. In Advances in Neural Information Processing Systems, volume 31, 2018.

Gintare Karolina Dziugaite, Kyle Hsu, Waseem Gharbieh, Gabriel Arpino, and Daniel M. Roy. On the role of data in PAC-Bayes bounds. In Artificial Intelligence and Statistics, pp. 604 612. PMLR, 2021. URL https://arxiv.org/abs/2006.10929.

M. Fard and Joelle Pineau. PAC-Bayesian Model Selection for Reinforcement Learning. In J. Lafferty, C. Williams, J. Shawe-Taylor, R. Zemel, and A. Culotta (eds.), Advances in Neural Information Processing Systems, volume 23. Curran Associates, Inc., 2010.

Published as a conference paper at ICLR 2025

Ronald Aylmer Fisher and Frank Yates. Statistical tables for biological, agricultural and medical research. Edinburgh: Oliver and Boyd, 1963.

Yarin Gal, Riashat Islam, and Zoubin Ghahramani. Deep Bayesian Active Learning with Image Data. In International Conference on Machine Learning, pp. 1183 1192. PMLR, 2017.

Timur Garipov, Sebastiaan De Peuter, Ge Yang, Vikas Garg, Samuel Kaski, and Tommi S. Jaakkola. Compositional Sculpting of Iterative Generative Processes. In Advances in Neural Information Processing Systems, 2023.

Pascal Germain, Francis Bach, Alexandre Lacoste, and Simon Lacoste-Julien. PAC-Bayesian Theory Meets Bayesian Inference. In Advances in Neural Information Processing Systems, 2016. URL https://arxiv.org/abs/1605.08636.

Benjamin Guedj. A Primer on PAC-Bayesian Learning. In Proceedings of the French Mathematical Society, volume 33, pp. 391 414. Soci et e Math ematique de France, 2019. URL https:// arxiv.org/abs/1901.05353.

Maxime Haddouche and Benjamin Guedj. Online PAC-Bayes Learning. In Advances in Neural Information Processing Systems, 2022. URL https://arxiv.org/abs/2206.00024.

Maxime Haddouche and Benjamin Guedj. PAC-Bayes generalisation bounds for heavy-tailed losses through supermartingales. Transactions on Machine Learning Research, 2023.

Maxime Haddouche, Paul Viallard, Umut Simsekli, and Benjamin Guedj. A PAC-Bayesian Link Between Generalisation and Flat Minima. In Algorithmic Learning Theory, 2024. URL https: //arxiv.org/abs/2402.08508.

Sepp Hochreiter and J urgen Schmidhuber. Flat minima. Neural computation, 9(1):1 42, 1997.

Matthew Holland. PAC-Bayes under potentially heavy tails. In Advances in Neural Information Processing Systems, volume 32, 2019. URL https://arxiv.org/abs/1905.07900.

Emiel Hoogeboom, Didrik Nielsen, Priyank Jaini, Patrick Forr e, and Max Welling. Argmax Flows: Learning Categorical Distributions with Normalizing Flows. In Third Symposium on Advances in Approximate Bayesian Inference, 2021.

Edward J Hu, Nikolay Malkin, Moksh Jain, Katie E Everett, Alexandros Graikos, and Yoshua Bengio. GFlow Net-EM for learning compositional latent variable models. In International Conference on Machine Learning (ICML), 2023.

Edward J. Hu, Moksh Jain, Eric Elmoznino, Younesse Kaddar, and et al. Amortizing intractable inference in large language models. In International Conference on Learning Representations, 2024. URL https://arxiv.org/abs/2310.04363.

Moksh Jain, Emmanuel Bengio, Alex Hernandez-Garcia, Jarrid Rector-Brooks, Bonaventure F. P. Dossou, Chanakya Ajit Ekbote, Jie Fu, Tianyu Zhang, Michael Kilgour, Dinghuai Zhang, Lena Simine, Payel Das, and Yoshua Bengio. Biological Sequence Design with GFlow Nets. In International Conference on Machine Learning (ICML), 2022.

Moksh Jain, Sharath Chandra Raparthy, Alex Hernandez-Garcia, Jarrid Rector-Brooks, Yoshua Bengio, Santiago Miret, and Emmanuel Bengio. Multi-Objective GFlow Nets. In International Conference on Machine Learning (ICML), 2023.

Hyosoon Jang, Minsu Kim, and Sungsoo Ahn. Learning Energy Decompositions for Partial Inference in GFlow Nets. In International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=P15CHILQlg.

Marco Jiralerspong, Bilun Sun, Danilo Vucetic, Tianyu Zhang, Yoshua Bengio, Gauthier Gidel, and Nikolay Malkin. Expected flow networks in stochastic environments and two-player zerosum games. In International Conference on Learning Representations, 2024. URL https: //arxiv.org/abs/2310.02779.

Published as a conference paper at ICLR 2025

Haotian Ju, Dongyue Li, Aneesh Sharma, and Hongyang R Zhang. Generalization in graph neural networks: Improved PAC-Bayesian bounds on graph diffusion. In International Conference on Artificial Intelligence and Statistics, pp. 6314 6341. PMLR, 2023.

Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Peter Tang. On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima, 2017. URL https://arxiv.org/abs/1609.04836.

Minsu Kim, Joohwan Ko, Taeyoung Yun, Dinghuai Zhang, Ling Pan, Woochang Kim, Jinkyoo Park, Emmanuel Bengio, and Yoshua Bengio. Learning to Scale Logits for Temperature-Conditional GFlow Nets, 2024a. URL https://arxiv.org/abs/2310.02823.

Minsu Kim, Taeyoung Yun, Emmanuel Bengio, Dinghuai Zhang, Yoshua Bengio, Sungsoo Ahn, and Jinkyoo Park. Local Search GFlow Nets, 2024b. URL https://arxiv.org/abs/2310. 02710.

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representations, 2015. URL https://arxiv.org/abs/1412. 6980.

Anas Krichel, Nikolay Malkin, Salem Lahlou, and Yoshua Bengio. On Generalization for Generative Flow Networks, 2024. URL https://arxiv.org/abs/2407.03105.

Salem Lahlou at al. A Theory of Continuous Generative Flow Networks. In International Conference on Machine Learning (ICML), 2023. URL https://proceedings.mlr.press/ v202/lahlou23a.html.

Elaine Lau, Nikhil Murali Vemgal, Doina Precup, and Emmanuel Bengio. DGFN: Double Generative Flow Networks. In Neur IPS 2023 Generative AI and Biology (Gen Bio) Workshop, 2023.

Puheng Li, Zhong Li, Huishuai Zhang, and Jiang Bian. On the generalization properties of diffusion models. In Advances in Neural Information Processing Systems, volume 36, 2024.

Jens Liebehenschel. Ranking and unranking of lexicographically ordered words: An average-case analysis. Journal of Automata, Languages and Combinatorics, 2(4):227 268, 1997.

Dianbo Liu et al. GFlow Out: Dropout with Generative Flow Networks. In International Conference on Machine Learning. PMLR, 2023. URL https://proceedings.mlr.press/v202/ liu23r.html.

Ben London, Bert Huang, Ben Taskar, and Lise Getoor. PAC-Bayesian Collective Stability. In Samuel Kaski and Jukka Corander (eds.), Proceedings of the Seventeenth International Conference on Artificial Intelligence and Statistics, volume 33 of Proceedings of Machine Learning Research, pp. 585 594, Reykjavik, Iceland, 22 25 Apr 2014. PMLR.

Sanae Lotfi, Marc Finzi, Yilun Kuang, Tim G. J. Rudner, Micah Goldblum, and Andrew Gordon Wilson. Non-Vacuous Generalization Bounds for Large Language Models. In International Conference on Machine Learning, pp. 32801 32818. PMLR, 2024a. URL https://arxiv. org/abs/2312.17173.

Sanae Lotfi, Yilun Kuang, Brandon Amos, Micah Goldblum, Marc Finzi, and Andrew Gordon Wilson. Unlocking Tokens as Data Points for Generalization Bounds on Larger Language Models. In Advances in Neural Information Processing Systems, volume 37, pp. 9229 9256, 2024b. URL https://arxiv.org/abs/2407.18158.

Jianzhu Ma, Jian Peng, Sheng Wang, and Jinbo Xu. Estimating the Partition Function of Graphical Models Using Langevin Importance Sampling. In Carlos M. Carvalho and Pradeep Ravikumar (eds.), Artificial Intelligence and Statistics, pp. 433 441. PMLR, 2013. URL http: //proceedings.mlr.press/v31/ma13a.pdf.

Kanika Madan, Jarrid Rector-Brooks, Maksym Korablyov, Emmanuel Bengio, Moksh Jain, Andrei Cristian Nica, Tom Bosc, Yoshua Bengio, and Nikolay Malkin. Learning GFlow Nets from partial episodes for improved convergence and stability. In International Conference on Machine Learning, 2022. URL https://proceedings.mlr.press/v202/madan23a.html.

Published as a conference paper at ICLR 2025

Eran Malach. Auto-Regressive Next-Token Predictors are Universal Learners. In International Conference on Machine Learning, pp. 34417 34431. PMLR, 2024.

Shreshth A. Malik, Salem Lahlou, Andrew Jesson, Moksh Jain, Nikolay Malkin, Tristan Deleu, Yoshua Bengio, and Yarin Gal. Batch GFN: Generative Flow Networks for Batch Active Learning, 2023. URL https://arxiv.org/abs/2306.15058.

Nikolay Malkin, Moksh Jain, Emmanuel Bengio, Chen Sun, and Yoshua Bengio. Trajectory balance: Improved credit assignment in GFlow Nets. In Advances in Neural Information Processing Systems (Neur IPS), 2022.

Nikolay Malkin, Salem Lahlou, Tristan Deleu, Xu Ji, Edward Hu, Katie Everett, Dinghuai Zhang, and Yoshua Bengio. GFlow Nets and variational inference. In International Conference on Learning Representations (ICLR), 2023.

Daniel J. Mankowitz, Timothy A. Mann, and Shie Mannor. Adaptive Skills, Adaptive Partitions (ASAP). In Advances in Neural Information Processing Systems, 2016. URL https: //arxiv.org/abs/1602.03351.

Andreas Maurer. A Note on the PAC Bayesian Theorem. ar Xiv preprint cs/0411099, 2004. URL

https://arxiv.org/abs/cs/0411099.

Sokhna Diarra Mbacke, Florence Clerc, and Pascal Germain. PAC-Bayesian Generalization Bounds for Adversarial Generative Models. In International Conference on Machine Learning, pp. 24271 24290. PMLR, 2023. URL https://arxiv.org/abs/2302.08942.

David Mc Allester. A PAC-Bayesian Tutorial with A Dropout Bound. ar Xiv preprint ar Xiv:1307.2118, 2013. URL https://arxiv.org/abs/1307.2118.

David A Mc Allester. Some PAC-Bayesian theorems. In Proceedings of the eleventh annual conference on Computational Learning Theory, pp. 230 234, 1998.

David A Mc Allester. PAC-Bayesian model averaging. In Proceedings of the twelfth annual conference on Computational Learning Theory, pp. 164 170, 1999.

Colin Mc Diarmid. Concentration. In Probabilistic Methods for Algorithmic Discrete Mathematics, pp. 195 248. Springer, 1998.

Kohei Miyaguchi. PAC-Bayesian Transportation Bound. ar Xiv preprint ar Xiv:1905.13435, 2019. URL https://arxiv.org/abs/1905.13435.

Wendy Myrvold and Frank Ruskey. Ranking and unranking permutations in linear time. Information Processing Letters, 79(6):281 284, 2001.

Andrei Cristian Nica, Moksh Jain, Emmanuel Bengio, Cheng-Hao Liu, Maksym Korablyov, Michael M Bronstein, and Yoshua Bengio. Evaluating generalization in GFlow Nets for molecule design. ICLR2022 Machine Learning for Drug Discovery workshop, 2022.

Ling Pan, Nikolay Malkin, Dinghuai Zhang, and Yoshua Bengio. Better Training of GFlow Nets with Local Credit and Incomplete Trajectories. In International Conference on Machine Learning (ICML), 2023a.

Ling Pan, Dinghuai Zhang, Aaron Courville, Longbo Huang, and Yoshua Bengio. Generative Augmented Flow Networks. In International Conference on Learning Representations (ICLR), 2023b.

Mohit Pandey, Gopeshh Subbaraj, and Emmanuel Bengio. GFlow Net Pretraining with Inexpensive Rewards. Neur IPS 2024 Workshop AIDrug X, 2024.

Mar ıa P erez-Ortiz, Omar Rivasplata, John Shawe-Taylor, and Csaba Szepesv ari. Tighter risk certificates for neural networks. Journal of Machine Learning Research, 22(227):1 40, 2021. URL http://jmlr.org/papers/v22/20-879.html.

Jarrid Rector-Brooks, Kanika Madan, Moksh Jain, Maksym Korablyov, Cheng-Hao Liu, Sarath Chandar, Nikolay Malkin, and Yoshua Bengio. Thompson sampling for improved exploration in GFlow Nets, 2023. URL https://arxiv.org/abs/2306.17693.

Published as a conference paper at ICLR 2025

Omar Rivasplata, Vikram M Tankasali, and Csaba Szepesvari. PAC-Bayes with backprop. ar Xiv preprint ar Xiv:1908.07380, 2019. URL https://arxiv.org/abs/1908.07380.

Omar Rivasplata, Ilja Kuzborskij, Csaba Szepesvari, and John Shawe-Taylor. PAC-Bayes Analysis Beyond the Usual Bounds. In Advances in Neural Information Processing Systems, 2020. URL https://arxiv.org/abs/2006.13057.

Borja Rodr ıguez-G alvez, Omar Rivasplata, Ragnar Thobaben, and Mikael Skoglund. A note on generalization bounds for losses with finite moments. In 2024 IEEE International Symposium on Information Theory (ISIT), pp. 2676 2681. IEEE, 2024a.

Borja Rodr ıguez-G alvez, Ragnar Thobaben, and Mikael Skoglund. More PAC-Bayes bounds: From bounded losses, to losses with general tail behaviors, to anytime validity. Journal of Machine Learning Research, 25(110):1 43, 2024b. URL https://arxiv.org/abs/2306.12214.

Julien Roy, Pierre-Luc Bacon, Christopher Pal, and Emmanuel Bengio. Goal-conditioned GFlow Nets for controllable multi-objective molecular design. ICML 2023 Workshop Deployable Generative AI, 2023. URL https://arxiv.org/abs/2306.04620.

Otmane Sakhi, Pierre Alquier, and Nicolas Chopin. PAC-Bayesian offline contextual bandits with guarantees. In International Conference on Machine Learning, pp. 29777 29799. PMLR, 2023.

Milad Sefidgaran, Romain Chor, and Abdellatif Zaidi. Rate-Distortion Theoretic Bounds on Generalization Error for Distributed Learning. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), Advances in Neural Information Processing Systems, 2022.

Yevgeny Seldin, Nicol o Cesa-Bianchi, Peter Auer, Franc ois Laviolette, and John Shawe-Taylor. PAC-Bayes-Bernstein inequality for martingales and its application to multiarmed bandits. In Proceedings of the Workshop on On-line Trading of Exploration and Exploitation 2, pp. 98 111. PMLR Workshop and Conference Proceedings, 2012a.

Yevgeny Seldin, Franc ois Laviolette, Nicol o Cesa-Bianchi, John Shawe-Taylor, and Peter Auer. PAC-Bayesian Inequalities for Martingales. IEEE Transactions on Information Theory, 58(12): 7086 7093, 2012b. URL https://arxiv.org/abs/1110.6886.

Marcin Sendera, Minsu Kim, Sarthak Mittal, Pablo Lemos, Luca Scimeca, Jarrid Rector-Brooks, Alexandre Adam, Yoshua Bengio, and Nikolay Malkin. Improved off-policy training of diffusion samplers, 2024. URL https://arxiv.org/abs/2402.05098.

Shai Shalev-Shwartz and Shai Ben-David. Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press, 2014.

John Shawe-Taylor and Robert C Williamson. A PAC analysis of a Bayesian estimator. In Proceedings of the tenth annual conference on Computational Learning Theory, pp. 2 9, 1997.

Max W. Shen, Emmanuel Bengio, Ehsan Hajiramezanali, Andreas Loukas, Kyunghyun Cho, and Tommaso Biancalani. Towards Understanding and Improving GFlow Net Training. In International Conference on Machine Learning, 2023.

Tiago Silva, Eliezer de Souza da Silva, Rodrigo Barreto Alves, Luiz Max Carvalho, Amauri H Souza, Samuel Kaski, Vikas Garg, and Diego Mesquita. Analyzing GFlow Nets: Stability, Expressiveness, and Assessment. In ICML 2024 Workshop on Structured Probabilistic Inference & Generative Modeling, 2024. URL https://openreview.net/forum?id=B8KXm XFi Fj.

Huayi Tang and Yong Liu. Towards understanding generalization of graph neural networks. In International Conference on Machine Learning, pp. 33674 33719. PMLR, 2023.

Bahareh Tasdighi, Abdullah Akg ul, Manuel Haussmann, Kenny Kazimirzak Brink, and Melih Kandemir. PAC-Bayesian Soft Actor-Critic Learning. In Symposium on Advances in Approximate Bayesian Inference, pp. 127 145. PMLR, 2024. URL https://arxiv.org/abs/2301. 12776.

Published as a conference paper at ICLR 2025

Daniil Tiapkin, Nikita Morozov, Alexey Naumov, and Dmitry Vetrov. Generative Flow Networks as Entropy-Regularized RL. In International Conference on Artificial Intelligence and Statistics, pp. 4213 4221. PMLR, 2024. URL https://arxiv.org/abs/2310.12934.

Brandon Trabucco, Xinyang Geng, Aviral Kumar, and Sergey Levine. Design-Bench: Benchmarks for Data-Driven Offline Model-Based Optimization. In International Conference on Machine Learning, pp. 21658 21676. PMLR, 2022. URL https://arxiv.org/abs/2202. 08450.

Dustin Tran, Keyon Vafa, Kumar Krishna Agrawal, Laurent Dinh, and Ben Poole. Discrete Flows: Invertible Generative Models of Discrete Data. In Advances in Neural Information Processing Systems, volume 32, 2019. URL https://arxiv.org/abs/1905.10347.

Leslie G Valiant. A theory of the learnable. Communications of the ACM, 27(11):1134 1142, 1984.

Vladimir Vapnik. Statistical Learning Theory. John Wiley & Sons, 1998.

Vladimir N. Vapnik. The Nature of Statistical Learning Theory. Springer New York, 2000. URL

http://dx.doi.org/10.1007/978-1-4757-3264-1.

Vladimir N Vapnik and A Ya Chervonenkis. On the uniform convergence of relative frequencies of events to their probabilities. In Measures of Complexity: Festschrift for Alexey Chervonenkis, pp. 11 30. Springer, 2015.

Nikhil Vemgal, Elaine Lau, and Doina Precup. An Empirical Study of the Effectiveness of Using a Replay Buffer on Mode Discovery in GFlow Nets. ICML 2023 Workshop on Structured Probabilistic Inference & Generative Modeling, 2023. URL https://arxiv.org/abs/2307. 07674.

Siddarth Venkatraman, Moksh Jain, Luca Scimeca, Minsu Kim, Marcin Sendera, Mohsin Hasan, Luke Rowe, Sarthak Mittal, Pablo Lemos, Emmanuel Bengio, Alexandre Adam, Jarrid Rector Brooks, Yoshua Bengio, Glen Berseth, and Nikolay Malkin. Amortizing intractable inference in diffusion models for vision, language, and control. In Advances in Neural Information Processing Systems, 2024. URL https://arxiv.org/abs/2405.20971.

Yi-Shan Wu, Andres Masegosa, Stephan Lorenzen, Christian Igel, and Yevgeny Seldin. Chebyshev Cantelli PAC-Bayes-Bennett Inequality for the Weighted Majority Vote. In Advances in Neural Information Processing Systems, 2021.

Semih Yagli, Alex Dytso, and H. Vincent Poor. Information-Theoretic Bounds on the Generalization Error and Privacy Leakage in Federated Learning. In 2020 IEEE 21st International Workshop on Signal Processing Advances in Wireless Communications (SPAWC), pp. 1 5. IEEE, 2020. URL https://arxiv.org/abs/2005.02503.

David W Zhang, Corrado Rainone, Markus Peschl, and Roberto Bondesan. Robust scheduling with GFlow Nets. In International Conference on Learning Representations (ICLR), 2023a.

Dinghuai Zhang, Hanjun Dai, Nikolay Malkin, Aaron Courville, Yoshua Bengio, and Ling Pan. Let the Flows Tell: Solving Graph Combinatorial Optimization Problems with GFlow Nets. In Advances in Neural Information Processing Systems (Neur IPS), 2023b.

Ming Yang Zhou, Zichao Yan, Elliot Layne, Nikolay Malkin, Dinghuai Zhang, Moksh Jain, Mathieu Blanchette, and Yoshua Bengio. Phylo GFN: Phylogenetic inference with generative flow networks. In International Conference on Learning Representations, 2024.

Pan Zhou, Jiashi Feng, Chao Ma, Caiming Xiong, Steven Hoi, and Weinan E. Towards Theoretically Understanding Why SGD Generalizes Better Than ADAM in Deep Learning. In Advances in Neural Information Processing Systems, volume 33, pp. 21285 21296, 2020. URL https: //arxiv.org/abs/2010.05627.

Published as a conference paper at ICLR 2025

SUPPLEMENTARY MATERIAL FOR GENERALIZATION AND DISTRIBUTED LEARNING OF GFLOWNETS

A Background and related works 19

A.1 Directed Acyclic Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

A.2 Generative Flow Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

A.3 Learning GFlow Nets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

A.4 Related works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

A.5 Additional review of PAC-Bayes bounds . . . . . . . . . . . . . . . . . . . . . . . 21

B Experimental details and additional discussions 23

B.1 A non-generalizable distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

B.2 Non-vacuous generalization bounds . . . . . . . . . . . . . . . . . . . . . . . . . 23

B.3 Oracle generalization bounds: Lemmata . . . . . . . . . . . . . . . . . . . . . . . 24

B.4 Subgraph Asynchronous Learning . . . . . . . . . . . . . . . . . . . . . . . . . . 25

C SAL: Implementation and Theoretical Analysis 26

C.1 Summary of the experimental campaing . . . . . . . . . . . . . . . . . . . . . . . 26

C.2 An efficient implementation of SAL . . . . . . . . . . . . . . . . . . . . . . . . . 26

C.3 Theoretical analysis and extensions . . . . . . . . . . . . . . . . . . . . . . . . . . 30

C.4 Conditional SAL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

C.5 SAL and EP-GFlow Nets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

D Proofs 34

D.1 Proof of Lemma 4.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

D.2 Proof of Proposition 4.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

D.3 Proof of Proposition 5.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

D.4 Proof of Lemma B.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

D.5 Proof of Theorem 5.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

D.6 Proof of Lemma B.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

D.7 Proof of Theorem 5.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

D.8 Proof of Theorem C.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

D.9 Proof of Lemma C.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

D.10 Proof of Proposition C.5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

D.11 Proof of Proposition C.6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

E Limitations and future works 41

E.1 Additional experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

Published as a conference paper at ICLR 2025

A BACKGROUND AND RELATED WORKS

For probability measures P and Q on the same space X, we recall for convenience that the Kullback-Leibler (KL) divergence is KL(P||Q) = Ex P [log (d P/d Q) (x)], the chi-squared

divergence is χ2(Q||P) = Ex P h ((d Q/d P) (x))2 1 i ; and the total variation distance is given by

TV (P, Q) = sup A X |P(A) Q(A)|. When X is finite, TV (P, Q) = 1/2 P

x X |P(x) Q(x)|. There are other notions of divergence for probability measures; we have mentioned here the ones used in our paper. Readers are refereed to Boucheron et al. (2013) for further details on the topic.

A.1 DIRECTED ACYCLIC GRAPHS

We briefly recall the definition of a pointed directed acycliy graph. For this, we firstly define the concept of a finitely absorbing Markov transition kernel (Lahlou at al., 2023) in a topological space. Henceforth, we let ({so} S {sf}, V) be a topological space endowed with a topology V and two special elements, so and sf. We denote S = {so} S {sf} the state space; so and sf are the initial and final states, respectively. We also assume that both {so} and {sf} are open sets with respect to V. Definition A.1 (Finitely absorbing Markov transition kernel (MTK)). Consider the measure space ( S, Σ, µ) with measure µ and a Borel σ-algerba Σ. Let κ: S Σ R+ be a reference kernel, i.e., κ(s, ): Σ R+ is a measure absolutely continuous with respect to µ for all s, and we recursively define κ t(s, A) = R κ t 1(s, ds )κ(s , A) for measurable A Σ. We say that ρF : S Σ R+ is a finitely absorbing Markov transition kernel if the following conditions are satisfied.

1. ρF (s, ): Σ R+ is an absolutely continuous probability measure with respect to κ(s, );

2. there is a tm < such that ρ tm F (s, {sf}) = 1 for every s {so} S and ρF (sf, {sf}) = 1;

3. s 7 ρF ( , B) is continuous for every measurable B Σ;

4. if ρF (s, {sf}) > 0, then ρF (s, {sf}) = 1;

5. for every A Σ, there is a t < tm such that ρ t F (so, A) > 0.

In this work, S is always finite, V is the discrete topology, and µ is the counting measure. The state graph G is induced by ρF , i.e., (u, v) is an edge in G if and only if ρF (u, {v}) > 0. Acyclicity is ensured by the finitely absorbing property of ρF (item 2 of Definition A.1). Notably, the finite S assumption covers the vast majority of use-cases for GFlow Nets. Under these conditions, we say κ is a backward reference kernel in S with respect to κ if κ(u, {v}) = κ (v, {u}) for all (u, v) S S. We refer the reader to (Lahlou at al., 2023) for an overview of GFlow Nets in infinite spaces.

A.2 GENERATIVE FLOW NETWORKS

A GFlow Net can be seen as a tuple ({so} S {sf}, PF , PB, ρF , ρB, κ, κ , µ, R) for which

1. κ is a forward reference kernel on S; 2. κ is a backward reference kernel in S with respect to κ; 3. ρF (resp. ρB) is a finitely aborsbing MTK with respect to κ (resp. κ ); 4. R: Σ R+ is a measure such that R µ; 5. PF : S Σ R+ is a MTK, called the forward policy, such that PF (s, ) ρF (s, ); 6. PB : S Σ R+ is a MTK, called the backward policy, such that PB(s, ) ρB(s, ).

We denote by p F and p B the densities of PF and PB with respect to their respective reference kernels. For simplicity, we interchangeably let R(x) be the density of R with respect to µ. In this scenario, the set of terminal states X is defined by X = {x S : PF (s, {sf}) > 0}. In practice, p F is parameterized by a neural network and its parameters are estimated to ensure that the marginal of PF (so, ) over X matches R up to a normalizing constant. In the terminology of Section 2, the abstract actions space A would correspond to A = S

s S{(s, u): PF (s, {u}) > 0} and A(s) = {(s, u): PF (s, {u}) > 0}. For most problems, we identify the edge (s, u) with an entity representing the difference between u and s, e.g., a nucleotide base when S is the space of nucleotide strings. We complement the discussion in Section 2 on how to learn a GFlow Net in the next section.

Published as a conference paper at ICLR 2025

A.3 LEARNING GFLOWNETS

Below, we illustrate our definition of GFlow Nets for three common generative tasks. These tasks encompass a large number of applications, e.g., Jain et al. (2022); Shen et al. (2023); Hu et al. (2024; 2023); Liu et al. (2023); Malkin et al. (2022); Pan et al. (2023b); Madan et al. (2022).

1. Autoregressive generation. Each object in S is a string of length up to a L, and G is a tree rooted at so. Also, action sets A(s) represent an alphabet and a transition T (s, a) appends the character a to the string s. Here, p B(s|(s, a)) = 1 for every s S and a A(s). 2. Set generation. Each s S is a subset of W = {1, . . . , W}, with X containing those s of size T. Action sets are A(s) = W \ s and transitions T (s, a) add the element a to s; see Figure 5. 3. Hypergrid environment. Each s S is a point within {0, . . . , H 1}d {0, 1} for given H (size) and d (dimension); so = 0d+1 and the last coordinate indicates whether s X. Also, A(s) = {ei : si < H 1} { }, with ei denoting the i-th canonical vector in Rd and a stop action. Transitions T (s, a) either add a to s, if a = ei for some i; or set sd+1 = 1, if a = .

Figure 5: State graph for the set generation task (W = 3, T = 2).

Notably, T (x, ) {sf} for every terminal state x X. We provided examples of R, F, and p F throughout the main text; in particular, see Sections 4 and 6.2 and Appendix B. Figure 5 illustrates the state graph for the set generation task (omitting sf). To learn a forward policy p F , we minimize a stochastic objective based on the observed trajectories. Besides the ones shown in Equation 1, many loss functions has been recently proposed. The Sub TB loss (Madan et al., 2022), for instance, is defined by

1 n<m |τ| λm n

log F(τn)p F (τn:m)

p B(τn:m)F(τm)

with the constraint that F(x) = R(x) for x X and τn representing the nth element within the trajectory τ. Correspondingly, the Var Grad (Zhang et al., 2023a) and contrastive balance (da Silva et al., 2024) objectives avoid the estimation of F by minimizing

Eτ,τ p E (log p F (τ) log p F (τ ) log p B(τ|x) + log p B(τ |x ) log R(x) + log R(x ))2 ,

which led to faster training convergence in some cases. On the same page, Malkin et al. (2023) considered a series of divergence-based loss functions for training GFlow Nets, showing that the on-policy version of the TB loss (Equation 1) corresponds to the reverse KL between the forward and backward policies in terms of the gradients. In particular, we note that these learning objectives can only be used for estimating the parameters of the root network in SAL. For the leaf networks, which must provide an estimate of the flow function F for the aggregation step, these loss functions cannot be used. Nonetheless, flow-based learning objetctives such as TB and Sub TB often exhibit a convergence speed comparable to that of variational alternatives and are frequently implemented for large-scale applications (Nica et al., 2022; Jain et al., 2022; Hu et al., 2024; Zhou et al., 2024). Learning objectives aside, there is a growing interest in the literature in the development of more effective parametrizations for GFlow Nets, with remarkable results for the forward-looking GFlow Nets (Pan et al., 2023a) and LED-GFlow Nets (Jang et al., 2024), which residually reparameterize F as log F(s) = log ϕ(s) + log F(s) for a (given or learnable) ϕ, temperature-scaled-GFlow Nets (Kim et al., 2024a), in which p F (s |s) exp{ϕ(β) ψ(s |s)} for neural networks ϕ and ψ and an inverse-temperature parameter β > 0, and QGFN (Lau et al., 2023), which learns a Q-function concomitantly to F and p F and prune the values of p F based on Q during inference time for controlable greediness.

A.4 RELATED WORKS

GFlow Nets (Bengio et al., 2021; 2023; Lahlou at al., 2023) were canonically proposed as a reinforcement learning algorithm for sampling compositional objects (e.g., graphs) proportionally to a prespecified reward function. From a theoretical perspective, the relationship between GFlow Nets and variational inference (Malkin et al., 2023), entropy-regularized Q-learning (Tiapkin et al., 2024; Deleu et al., 2024), and diffusion models (Lahlou at al., 2023; Sendera et al., 2024; Venkatraman et al., 2024) has been thoroughly established. From a practitioner s viewpoint, GFlow Nets have

Published as a conference paper at ICLR 2025

been successfuly applied to many problems including, but not restricted to, causal discovery (Deleu et al., 2022; 2023; da Silva et al., 2023), Bayesian phylogenetic inference (Zhou et al., 2024; da Silva et al., 2024), language and image modelling (Hu et al., 2024; Liu et al., 2023; Hu et al., 2023; Venkatraman et al., 2024), combinatorial optimization (Zhang et al., 2023a;b), and drug discovery (Bengio et al., 2021; Nica et al., 2022; Vemgal et al., 2023; Pan et al., 2023a). Indeed, we are confident that problems such as language modelling and drug discovery could greatly benefit from SAL if appropriate policy networks and fixed-horizon partitionings are designed. Nonetheless, given the open-endedness and specialized nature of these applications, we believe that they would be more suited for future, dedicated works and are, hence, not addressed in this text. Correspondingly, recent work by Jiralerspong et al. (2024) highlighted the competitive performance of stochastic GFlow Nets in two-player zero-sum games, specifically, Tic-Tac-Toe and Connect-4, and we are optimistic that an extension of SAL to stochastic environments would exhibit promising results for games having larger trajectories, e.g., Chess and Go. Orthogonal to these advances, the issue of generalization in GFlow Nets has also received significant attention in the literature (Atanackovic & Bengio, 2024; Krichel et al., 2024). In sharp contrast to previous works, ours is the first one that derives PAC-Bayesian bounds and provides non-vacuous statistical guarantees for GFlow Nets, along with a theoretical analysis that highlights which factors are potentially harmful to the model s generalization performance. Notably, a recent discussion by Bengio & Malkin (2024) provides an interesting perspective on generalization, active learning, and GFlow Nets in the context of abstract reasoning for machine-learning-based theorem proving and conjecture formation. Concomitantly, we note that there is a well-established interest in the community towards the development of more sample-efficient learning objectives for speeding up training convergence (Malkin et al., 2022; Deleu et al., 2022; Madan et al., 2022; Zhang et al., 2023a; da Silva et al., 2024; Tiapkin et al., 2024).

A.5 ADDITIONAL REVIEW OF PAC-BAYES BOUNDS

Historically, Mc Allester (1998; 1999) s PAC-Bayesian theorems, which were inspired by the work of Shawe-Taylor & Williamson (1997), were developed towards the objective of providing Probably Approximately Correct (PAC) guarantees to Bayesian algorithms with potentially misspecified prior distributions. Recently, the relationship between PAC-Bayesian theory and (approximate) Bayesian algorithms has been made explicit by Germain et al. (2016). From this perspective, Alquier (2024) provides an informative and comprehensive account of the literature on PAC-Bayes bounds, both theory and applications. In what are now well-established references, Catoni (2007) gives a rigorous foundation of PAC-Bayes bounds in supervised classification, including results on the form of the distributions that optimise the bounds; and Guedj (2019) provides a nice concise exposition of the essential form of PAC-Bayesian inequalities. In the context of contemporary machine learning, PAC-Bayesian theory has found enormous success in the development of numerical generalization bounds for overparameterized neural network classifiers, achieving non-vacuous results (Dziugaite & Roy, 2017; 2018) and tight certificates (P erez-Ortiz et al., 2021), which subsequent works have applied even for large language models (Lotfi et al., 2024a;b) with billions of parameters through appropriate compression techniques (Dettmers et al., 2023). In a recent work, Malach (2024) introduced the notion of length complexity for next-token autoregressive learning on Chain-of-Thought data, referring to the minimum number of iterations required by an AR learner to compute a target function, which is (vaguely) connected to our results regarding the harmful effects of the maximum trajectory size on GFlow Net learning. Importantly, the advantegeousness of distributed approaches for the generalization performance of learning algorithms was already pointed out by Yagli et al. (2020); Barnes et al. (2022); Sefidgaran et al. (2022); similarly to SAL, these authors consider the problem of training a set of models in parallel and subsequently aggregating them with a (possibly randomized) estimator in a central server. In spite of these advances, the development of tighter PAC-Bayes bounds with weaker assumptions on the risk functional, e.g., heavy tailedness instead of boundedness, is still an active research field (Holland, 2019; Wu et al., 2021; Balsubramani, 2015; London et al., 2014; Biggs & Guedj, 2023; Rivasplata et al., 2020; Rodr ıguez-G alvez et al., 2024a;b). Also, the development of PAC-Bayesian theory in the setting of non-i.i.d. data is still relatively underdeveloped when compared against other branches of machine learning, albeit there are interesting results in online learning (Haddouche & Guedj, 2022), reinforcement learning (Fard & Pineau, 2010; Beygelzimer et al., 2011; Sakhi et al., 2023), and time series (Alquier et al., 2012). Finally, PAC-Bayesian theorems provide statistical guarantees for stochastic predictors, which are arguably not frequently used in practice, and the problem of derandomizing the resulting bounds

Published as a conference paper at ICLR 2025

is still mostly open. Notably, the derandomization of PAC-Bayes bounds has a non-negligible cost, and we refer the reader to Miyaguchi (2019); Biggs & Guedj (2022) for further details on this topic.

Published as a conference paper at ICLR 2025

B EXPERIMENTAL DETAILS AND ADDITIONAL DISCUSSIONS

All experiments were conducted on a single Linux machine with 128 GB of RAM and featuring a NVIDIA RTX 3090 GPU and 12th Gen Intel(R) Core(TM) i9-12900K CPU. Unless specified otherwise, the code for reproducing the experiments below was executed on this GPU.

B.1 A NON-GENERALIZABLE DISTRIBUTION (W, T) χ2(q E,T ||p T,E) χ2(qϵ,T ||p T,E) (32, 6) 1.20 103 1.32 (64, 6) 4.56 102 1.24

Table 1: χ2 divergence between the exploratory (pruned, p E,T , and ϵ-greedy, pϵ,T ) and uniform (q E,T ) distributions.

For the experiments in Figure 1, we considered the set generation task (see Appendix A) with W {32, 64} elements to choose from and set size S = 6, and the forward policy was parameterized by an MLP with 2 64-dimensional layers. The elements log-utilities u were sampled from [ 1, 1] prior to training and the resulting values were normalized so that the largest reward of a set was 5. For both settings in Figure 1, the models were trained for 1500 epochs with a batch size of 128. To compute the quantities in Table 1, we compared the uniform policy of an untrained GFlow Net against a policy p E such that the (unnormalized) logit corresponding to the addition the element 1 is set to log p E(1|s) = 11.5 log 10 5. Table 1 shows the large discrepancy between the resulting p E and an uniform policy, providing a taste for the upper bound in Proposition 4.2.

B.2 NON-VACUOUS GENERALIZATION BOUNDS

A bounded risk functional for GFlow Nets. We start recalling the definition of the flow-consistency in subgraphs (FCS) metric (Silva et al., 2024). Given a policy p E, the FCS is defined as

LFCS(R, p T ) = Eτ1,...,τB p E

1 j B p T (xj) R(xi) P

1 j B R(xj)

in which B 2 is a (typically small) given integer. Equivalently, FCS may be seen the expected total variation distance between the learned p T and target R distributions over random subsets of X. It was shown by Silva et al. (2024) that LFCS(R, p T ) = 0 if and only if p T (x) R(x), i.e., the model samples correctly from the distribution proportional to R in X. Then, equipped with the dataset Tn described in Section 4, an unbiased estimate of LFCS is

ˆLFCS(Tn, R, p T ) = 1 2N

k1,...,k B U{1,...,n}

p T (xki) PB j=1 p T (xkj) R(xki) PB j=1 R(xkj)

in which the outer summation covers N uniformly random B-sized subsets of {1, . . . , n} and xki represents the kith observed terminal state in Tn. Importantly, LFCS [0, 1] for any R and p T , which enables the implementation of well-known algorithms for tightening PAC-Bayesian generalization bounds through the adoption of data-dependent priors (Dziugaite et al., 2021; Maurer, 2004).

Experimental details for computing non-vacuous bounds. To achieve the results illustrated in Figure 2, we use Tα to learn an isotropic Gaussian prior Q with variance 10 6 over the parameters θ of an MLP with 3 128-dimensional layers defining the forward policy by minimizing the expected TB loss on Tα under Q. For each problem, we used the same architecture of the neural network, changing only the input and output dimensions, and the resulting models were trained for 64 epochs on their respective datasets. Then, we freeze θ and learn both the mean and the diagonal covariance of a Gaussian posterior P over the parameters of a policy network by minimizing the upper bound in Equation 6 with ˆLFCS substituted by an unbiased estimate of the TB loss on Tα T1 α. Finally, we evaluate the upper bound in Equation 6 on T1 α to certify its tightness. We closely followed the experimental setup of Dziugaite et al. (2021); P erez-Ortiz et al. (2021) for conducting these experiments. In particular, the data-splitting protocol for learning the prior, learning the posterior, and evaluating the bound is analogous to the one used by P erez-Ortiz et al. (2021). Similarly, in contrast to the other experiments, which rely on the Adam optimizer (Kingma & Ba, 2015), we use SGD with a fixed learning rate of 10 3 that presumably achieves a flat minimum (Keskar et al., 2017) with potentially better generalization properties (Hochreiter & Schmidhuber, 1997; Zhou et al., 2020; Haddouche et al., 2024). Finally, we acknowledge Dziugaite et al. (2021) for making their code publicly available and adhering to the best current practices of scientific reproducibility.

Published as a conference paper at ICLR 2025

B.3 ORACLE GENERALIZATION BOUNDS: LEMMATA

Trajectory-level bounds. The technical lemma below ensures that KL(p B||p F ) can be directly bounded by adopting a mixture transition policy, sometimes called an α-uniform policy (Hu et al., 2023), that keeps the trajectory-level probabilities away from zero and ensures the boundedness of the log-probabilities (Dziugaite et al., 2021; Lotfi et al., 2024a) without limiting the GFlow Net s ability to learn the correct solution that samples from X proportionally to the reward. Lemma B.1 (Realizability of mixture policies). Let p U( |s) denote the uniform policy on the state space S with reward R, i.e., p U(s |s) = 1 |Ch(s)|. Then, there is a α (0, 1] s.t. the family { p F : p F ( |s) = αp U( |s) + (1 α)p F ( |s)} contains a policy sampling from X in proportion to R.

In the classical statistical learning terminology, the result above states that the family of α-uniform policy networks is realizable, meaning that a member of this family satisfies the desired balance conditions. However, as we note in the proof of Lemma B.1, finding such α depends on the knowledge of the minimum value of R(x) on X, which may be an NP-hard problem for some generative instances (Zhang et al., 2023b; Ma et al., 2013) that cannot be swifitly solved. Since the resulting generalization bound depends on hardly computable quantities, we call it a oracle bound, similarly to the distribution-dependent PAC-Bayesian inequalities in, e.g., (Alquier et al., 2012; Alquier, 2024).

Transition-level bounds. From Definition 5.3, we can readily conclude that the stochastic process

1 i t M(Si, S<i) = X

1 i t LDB(Si, Si 1) Esi p E( |Si 1) [LDB(si, Si 1)] (12)

is a martingale with respect to the filtration {Ft}t 1. In Theorem 5.4, we developed concentration inequalities for Mt to derive transition-level generalization bounds for GFlow Nets. Complementarily, the lemma bellow shows how the martingale Mt is connected to the traditionally implemented trajectory-wide DB loss (see Equation 1). There, we assume that trajectories have fixed length, an assumption that was also considered by Malkin et al. (2023) when showing that a GFlow Net can be seen as an instantiation of a hierarchical variational inference model. Lemma B.2. Let p E be the sampling distribution and p E,T be the corresponding marginal over terminal states. Then, by denoting τ = (S1, . . . , Sl) with fixed l,

1 i l LDB(Si, S<i)

1 i l ESi 1 p E,T ESi p E( |Si 1) [LDB(Si, Si 1)|Si 1] .

In other words, the trajectory-wise objective in the left-hand side of Equation 13, which is often used as a learning objective for GFlow Nets (Pan et al., 2023a;b; Madan et al., 2022; Bengio et al., 2023; Jang et al., 2024; Silva et al., 2024), corresponds to the transition-wise objective in Equation 8 when the trajectories are sampled in a Markovian fashion. Under these circumstances, we defined the risk functional associated to a specific parameterization θ of the policy network as

L(θ) = Eτ p E 1 |τ|

1 i |τ| ESi p E( |Si 1) [LDB(Si, Si 1)|Si 1] , (14)

in which LDB implicitly depends on θ via the forward policy p F . Importantly, we take the trajectory s length |τ| into account when defining L(θ), which is often done in practice (Zhang et al., 2023b). Then, given a set {so, S(j) 1 , . . . , S(j) tj }n j=1 of independently sampled trajectories, we define

1 i tj LDB(S(j) i , S(j) i 1) (15)

as the empirical estimate of L(θ). Under these conditions, Theorem 5.4 established a highprobability upper bound of L(θ) as a function of ˆL(θ) and of some characteristics of the generative process. We recall, however, that two assumptions were required to achieve this: that the DB loss and thus the martingale difference sequence M are almost surely bounded and that the training is constrained by a pre-specified transition budget. In practice, the boundedness can be achieved by

Published as a conference paper at ICLR 2025

either clipping the loss function (Mc Allester, 1999; 2013) or, when more detailed information about R is available, constraining the output of the neural networks in the fashion of Lemma B.1 with the knowledge that the optimal flow F (s) satisfying the detailed balance is bounded by F (s) minx X R(x), P

x X R(x) for each state s (Bengio et al., 2023), or a more refined version of this constraint (with upper and lower limits possibly depending on s). On the other hand, due to the CPUbounded nature of GFlow Net transition sampling (which cannot be easily parallellized in a GPU), we assume that training is computationally limited by a fixed number of observed transitions. Hence, to promote an equitable assessment of different generative tasks, we assume in Theorem 5.4 that the number n of sampled trajectories for training depends on a fixed budget of sampleable transitions T.

B.4 SUBGRAPH ASYNCHRONOUS LEARNING

Please refer to Section C.2 for a detailed experimental evaluation of SAL. We would like to emphasize that all experiments below are based on standard practices for GFlo Net training, with trajectories sampled from an ϵ-greedy sampling policy, and not on the simplified setting of Section B.2.

Published as a conference paper at ICLR 2025

C SAL: IMPLEMENTATION AND THEORETICAL ANALYSIS

Sampling correctness. We first recall how to sample a x X in the context of SAL. For a given collection {po F } {pj F }m j=1 of forward policies trained in the style of Definition 6.2, we do so by starting at so and following the root policy until we reach either a terminal state x X or a leaf partition Sj. In the former case, we interrupt the generation and return x as a sample. In the latter, we proceed to X by following the leaf s policy pj F , as shown in the highlighted trajectory in Figure 3. In Theorem C.1, we demonstrate that this approach samples x X proportionally to R(x) when both the leaf and root policies globally minimize their respective learning objectives.

Theorem C.1 (Sampling correctness of SAL). Let {Sj}m j=0 = FHP(S, m) and {Gj}m j=0 be the corresponding GFlow Nets. Let p ,o F and {p ,j F }m j=1 be global minimizers of their respective learning objectives. Then, the marginal distribution over X induced by the learned policies {p ,j F }m j=0,

p T (x) = X

τ : so s p ,o F (τ|so) X

τ : s x p ,j F (τ |s), (16)

matches the target distribution π(x) := R(x)/Z, with Z = P

Remarkably, Theorem C.1 establishes SAL as the first asymptotically correct general-purpose distributed learning algorithm for GFlow Nets. On the other hand, a successful implementation of SAL requires having an efficient mechanism concomitantly enabling to sample states from a given partition (to minimize LATB) and to recover the partition of a state (for inference). This is the reasoning behind what we name, and have long named, an assignment function. In Section C.2, we develop such mechanisms for some commonly considered generative tasks in the GFlow Net literature and provide an empirical analysis asserting the effectiveness of the resulting algorithm.

C.1 SUMMARY OF THE EXPERIMENTAL CAMPAING

We provide a summary of the considered generative tasks for assessing the performance of SAL. For a comprehensive discussion on the implementation of SAL, please consult the sections below.

1. Hypergrid (Bengio et al., 2021; Malkin et al., 2022; 2023; Pan et al., 2023b; Krichel et al.,

2024). We consider both a 8 x 8 and a 64 x 64 hypergrid environment (Section 2) with (Malkin et al., 2022, Section 5.1) s reward function, which is illustrated in Figure 4 for H = 8.

2. SIX6 (Jain et al., 2022; Malkin et al., 2022; Shen et al., 2023; Chen & Mauch, 2024; Kim et al.,

2024a). We generate 8-sized nucleotide strings. The reward represents wet-lab DNA binding measurements to a human transcription factor (Barrera et al., 2016; Trabucco et al., 2022).

3. PHO4 (Jain et al., 2022; Malkin et al., 2022; Shen et al., 2023; Chen et al., 2023). Similarly, we construct 10-sized nucleotide strings; the reward reflects wet-lab measurements of DNA binding activities to a yeast transcription factor (Barrera et al., 2016; Trabucco et al., 2022).

4. Bit sequences (Malkin et al., 2022; Madan et al., 2022; Rector-Brooks et al., 2023; Tiapkin et al., 2024). We produce 60-sized binary sequences. Given a subset M of such sequences, we define R(x) = exp{ minm M d L(x, m)}, in which d L is the edit distance.

5. Sequence design (Jain et al., 2022; da Silva et al., 2024). We build 8-sized sequences of {1, . . . , 6}. Also, R(x) = P8 i=1 g(i)f(xi), with f and g being [ 1, 1]-valued functions.

6. Set generation (Bengio et al., 2023; Pan et al., 2023a). We assemble 16-sized subsets of a fixed 32-sized set. We employ the same additive reward function described in Section 4.

C.2 AN EFFICIENT IMPLEMENTATION OF SAL

We start defining an assignment function.

Definition C.2 (Assignment function). Let f : S {0, 1, . . . , m} := [m], in which m is the number of available computational units. Assume that f satisfies the following conditions.

1. (Completeness). f 1(j) = for each j, i.e., f assigns at least one state to each available unit;

Published as a conference paper at ICLR 2025

2. (Consistency). {Sj = f 1(j)}m j=0 is a fixed-horizon partition of S.

Then, f is called an assignment function and {Sj}m j=0 is the fixed-horizon partition associated to f.

Condition (1) above, which we call completeness, ensures that no computational unit is wasted, whereas condition (2) consistency guarantees that the partition of S induced by f 1 is a FHP. In this context, we denote by RV(S) the space of S-valued random variables (measurable functions). Then, we say that a function g: [m] RV(S) is a stochastic inverse of f if g(j) f 1(j) with probability one for each j [m]; a similar concept exists in the literature of discrete normalizing flows (Hoogeboom et al., 2021; Tran et al., 2019). Notably, the distribution qj over the subnetworks sources in the definition of SAL (see Eq. 9) corresponds to the PMF of the random variable g(j).

Importantly, to efficiently implement SAL, one only needs to develop an f that is both fast to compute and easily stochastically invertible. In this section, we show how to design such an assignment function for the problems of autoregressive design and set generation and for the hypergrid environment. The reader is reminded to recall Appendix A for an overview of each generative task.

Sets Sequences Centralized 0.092 0.001 0.126 0.012 SAL 0.072 0.008 0.094 0.005

Table 2: Total variation distance between target and learned measures for the centralized model (top row) and SAL (bottom row).

SAL for autoregressive models. We first illustrate the concept of an assignment function for autoregressively generated objects, which are very common in applications (Jain et al., 2022; Malkin et al., 2022; Jiralerspong et al., 2024; Hu et al., 2024). For this problem, each state s is represented as an element of the set [[0, k 1]]L for fixed k (the vocabulary size, e.g., k = 4 for nucleotide strings) and L (the sequence s length). Then, to construct a fixed-horizon partition, we choose a distance D from the initial state and define

1 i D si ki 1 (17)

as the k-ary representation of s. Then, f(s) = 1{#s D} (h(s) (mod m) + 1) is our assignment function. Importantly, both f(s) and f 1(j) add an negligible computational overhead to the training procedure. Indeed, to sample s from f 1(j) for j 1, we first define h(s) = m ξ+(j 1) with ξ randomly sampled from [[0, k D 1/m ]]. Thus, to recover s from h(s), we only need to solve a (triangular) linear system; the details are provided next. Let hn(s) = m ξ+(j 1) (mod kn). Also,

1 i D siki 1 (mod kn) = X

1 i n siki 1 kn 1 (18)

when n D. Clearly, h1(s) = s1 and, recursively, hn(s) = kn 1sn + hn 1(s). Therefore, s = (s1, . . . , s D) jointly satisfy the triangular system Ts = h, in which Ti,j = 1{i j} kj 1 and hi = h(s) (mod ki) for i, j {1, . . . , D}. This system can be efficiently solved via backward substitution in parallel for a batch {h(s1), . . . , h(s B)} of B sequences.

1 2 3 4 10 5

Centralized

1 2 3 4 10 5

0.25 0.50 0.75 1.00 10 4

Figure 6: Target vs. learned distributions.

SAL for set generation. As in Section 4, we also consider the problem of generating S-sized sets with elements extracted from a source W = {1, . . . , W} of fixed size W. In this setting, each s W can be uniquely represented as a binary vector s {1, 0}W with si = 1 indicating that i W is a member of s. Notably, the elements s at distance D S to the initial state can also be completely described by PW i=1 si = D. For these elements, the prefix s1:D {1, 0}D has at least max{0, 2D W} components equal to 1. Then, similarly to Equation 17, we define the assignment function f for each s {1, 0}W

with PW i=1 si = D as the binary representation of s modulo the number of computational units,

1 i D 2i 1si (mod m) + 1. (19)

Published as a conference paper at ICLR 2025

5000 10000 15000 20000 25000 0

Hypergrid 64 x 64

200 400 600 800 1000 0

500 1000 1500 2000 0

1000 2000 3000 4000 5000 0

Bit sequences

500 1000 1500 2000 0

Set generation

Centralized SAL

Total number of epochs

Figure 7: SAL enacts faster mode discovery under varying time budgets. The horizontal axis represents the total number of epochs used by the centralized model, which is set to the sum of the number of epochs for each leaf GFlow Net and the root GFlow Net to ensure the approaches are fairly compared. We provide additional evidence for the enhance performance of SAL in Figure 13.

On the other hand, let νmin = 2max{0,2D W } 1/m and νmax = 2D 1/m . To stochastically reverse f, we define h(j) = m ξ + (j 1) with ξ P(νmin, νmax, λ) sampled from a Poisson truncated at [νmin, νmax] and parameterized by λ. Then, we let [h(j)] {1, 0}D be the corresponding bit-wise representation of h and yj {1, 0}W D be a random binary vector with exactly D PD i=1[h(j)]i components equal to 1, obtained via Fisher-Yates shuffling algorithm (Fisher & Yates, 1963). Finally, we construct a sample s = ([h(j)], yj) {1, 0}W by concatenating [h(j)] and yj. Notably, an assignment function is closely related to the concept of ranking and unraking functions in computational combinatorics (Myrvold & Ruskey, 2001). To see this, we recall that a ranking r (resp. unranking u) function of a set S (resp. {0, . . . , |S| 1}) injectively maps each member of S to an element of {0, . . . , |S| 1} (resp. S). A natural choice for r is based on the lexicographic order on S (Liebehenschel, 1997); the corresponding u, however, may not be efficiently computable. Given a ranking function r and a number m of computing nodes for training a GFlow Net, we may defined an assignment function as f(s) = r(s) (mod m)+1. We also provide an implementation of a lexicographic-based ranking and unranking functions for the set generation task to support future research on the development of more effective partioning schemes for SAL.

SAL for the hypergrid environment. In conclusion, we also consider the difficult-to-explore hypergrid environment, which is defined by a distribution supported on [[0, H 1]]d for fixed H (the grid s size) and d (the grid s dimension) (Bengio et al., 2021). For this problem, we note the states x at distance D to the initial state can be fully described by the equation P

1 i d xi = D. Equivalently, each x satisfying the above equation can be injectively mapped to a point within the (k 1)-simplex. Hence, we define the assignment function f over states at distance D from so as

f(x) = min 0 j<m

1/d 1 + [j = m 1]

The exponent 1/(d 1) is meant to ensure the workload is approximately homogeneously distributed among the computational units. To sample from f 1(j) for j 1, we let

νmin = l D (j 1/m) 1/d 1m and νmax = j D (j/m) 1/d 1k + [j = m 1] (21)

and pick x1 uniformly at random from [[νmin, νmax]]. Then, (x2, . . . , xd) is drawn from a Dirichletmultinomial with number of trials D x1 and concentration parameter α set (arbitrarily) to 1.

Figure 8: FHP for the hypergrid.

Remarkably, the hypergrid environment illustrates an approach differing from the strategy of encoding-as-integer and computing-the-remainder that was implemented for the other tasks. Figure 8 shows the partition for d = 2, H = 8, and m = 2, which was the setup for Figure 4. We represent in red and teal the sources of the subnetworks assigned to the leaves j = 1 and j = 2, respectively, and in blue the remaining states. Recall that, by definition, all descendants of a state s are members of s s partition. In this scenario, we hope that the development of sophisticated and expert-driven partitioning techniques will greatly benefit the use of GFlow Nets in specialized domains, e.g., drug discovery.

SAL results in better distributional approximations. Figure 4 shows that, when compared against a centralized approach, SAL achieves a better distributional approximation for the hypergrid environment under a fixed time-budget (for SAL, the training time is the longest client s training time

Published as a conference paper at ICLR 2025

SAL Cent. 0

Hypergrid 64 x 64

SAL Cent. 0

SAL Cent. 0

SAL Cent. 0

25000 23164

Bit sequences

SAL Cent. 0

Set generation

Figure 9: SAL improves the discovery of high-value states for all considered tasks. For a fair comparison, the centralized model is allowed to explore for twice the number of epochs permitted to each client, ensuring the training times are roughly the same for SAL and the standard GFlow Net.

Centralized

Cln 1 Cln 2 SAL Cent. Model

Training time (min)

23.33 23.30

Figure 10: SAL results in a more accurate approximation than a centralized approach for a similar time budget on the 12 12 grid. Complementarily to Fig. 4, all models are trained by minimizing the Sub TB (instead of TB) objective with λ = 0.9. The running time for SAL is determined by the training time of the longest client (blue columns) plus the much faster aggregation step (green column).

plus the time for aggregation). Figure 6 and Table 2 corroborate this claim for the set generation and sequence design tasks, showcasing that SAL learns a distribution that matches the target more closely than a standardly trained GFlow Net. For the other tasks, learning an accurate distributional approximation is not as important as finding high-value objects, and that is the reason we do not consider them here. Notably, these results are consistent with Theorems 5.2 and 5.4: by reducing the size of the state graph that each model needs to focus on, SAL potentially faciliates the learning of a generalizable policy network and leads to a more accurate approximation to the target.

SAL greatly improves mode-discovery. In the GFlow Net literature, a mode is often defined as a state x whose associated reward R(x) is larger than a predefined threshold t; see, e.g., (Bengio et al., 2021; Pan et al., 2023a;b; Madan et al., 2022; Jang et al., 2024; Malkin et al., 2022). For our experiments, we fix t = 0.1 for the hypergrid environment and, for the other generative tasks, we sample an initial batch of 2 104 from the uniform policy and set t as the 0.99 quantile of the observed rewards. Importantly, the same threshold is used for both the centralized, leaf, and root GFlow Nets. Under these conditions, Figures 4, 9, 7, and 13 show that SAL enacts a drastic improvement of the mode-discovery rate over a centralized approach for varying computational budgets in all considered generative problems, leading to the discovery of up to 8x more modes. There are two reasons for this. Firstly, the distributed nature of SAL ensures that a much larger portion of the state space is explored in a significantly shorter amount of time. Secondly, each client model focus on a subset of the state graph and may be regarded as a specialist in the corresponding subtask. By collecting the samples fostered by these local specialists, we end up with a significantly more diverse and valuable collection than the one that would be obtained by, e.g., independently training multiple GFlow Nets in parallel. Remarkably, this interpretation highlights the relevance of appropriately defining a fixed-horizon partition of the state graph, an issue that defines a key future direction for our work, as we discussed in Section 7 of the main text.

Alternative learning objectives for SAL. For simplicity of exposition, we outlined SAL in Definition 6.2 as a collection of TB-minimizing GFlow Nets. However, as previously discussed, one can straightforwardly adapt alternative learning objectives (e.g., Sub TB (Madan et al., 2022)) and sampling techniques (e.g., replay buffer (Deleu et al., 2022)) to the context of SAL. We illustrate

Published as a conference paper at ICLR 2025

these extensions here. Firstly, the Sub TB objective for the jth partition would take the form

Lj Sub TB(p F , F) = E s qj E τ pj E( |s)

1 s<t |τ| λt s

log F(τm)p F (τm:n|τm)

F(τn)p B(τn:m|τn)

in which qj and pj E are a distribution over initial states of and a sampling policy for the jth partition, respectively. Secondly, the replay buffer would store the trajectories τ leading to high-value states within the jth partition, as measured by either R (for leaf partitions) or Ro (for the root partition; see Algorithm 1). Figure 10 compares the accuracy of SAL against a centralized GFlow Net, both of which trained by Sub TB minimization (λ = 0.9), for the 12 12 hypergrid environment. Similarly to Figure 4, SAL achieves a better distributional approximation in this case. Additionally, we found that Sub TB leads to faster convergence with respect to (A)TB (not reported) in this particular problem, consistently with the evidence at Madan et al. (2022, Figure 1). On the other hand, our experiments did not provide evidence in favor of using the replay buffer. However, we acknowledge that a deeper empirical investigation, in the fashion of Vemgal et al. (2023) s work, is required.

C.3 THEORETICAL ANALYSIS AND EXTENSIONS

This section aims to answer two core questions regarding the nature of SAL. From a sampling perspective, we ask which distribution each client learns and suggest potential diagnostic techniques to evaluate their distributional accuracy. From a distributed learning standpoint, we assess the extent to which local errors are propagated to the global model. Additionally, we formally extend SAL to accommodate the learning over multi-layered fixed-horizon partitions of the state graph.

Local sampling distributions. Each leaf GFlow Net in SAL learns a distribution over a subset Xj of the set of terminal states X. The character of such distribution, however, was not considered in the foregoing discussion, and one may wonder whether it just corresponds to the restriction of the target to Xj. As we show below, this is not generally the case: the optimal leaf distribution, which we denote by pj T , depends on both Xj and on the specific structure of the state graph induced by the leaf Sj. Proposition C.3 (Local sampling distributions). Let {Sj}m j=0 = FHP(S) and {Gj}m j=0 be the corresponding GFlow Nets. Also, denote by TDc(s) the set of terminal descendants of s on the original state graph, i.e., x TDc(s) if x X and there is a directed path from s to x. Then, for fixed backward policy p B, the solution that globally minimizes Equation 9 satisfies

x TDc(s) R(x) X

τ : s x p B(τ|x) and pj T (x|s) = R(x)

τ : s x p B(τ|x) (22)

for each j {1, . . . , m}, s Ij, and x Xj := S

s Ij TDc(s).

When the sum over backward trajectories in Equation 22 does not depend on x, e.g., for autoregressively generated object (in which p B(τ|x) = 1 for the unique trajectory τ connecting s to x) and sets (in which the sum depends only on the depth of s when p B is fixed to an uniform policy), Proposition C.3 says that pj T can be nicely interpreted as the restriction of the original target R to the induced terminal set Xj. We emphasize this fact in the corollary below. Corollary C.4 (Local sampling distributions for autoregressive models). In the context of Proposition C.3, assume that the state graph is represented as a tree. Then, p B(τ|x) = 1 for every τ and

x TDc(s) R(x) and pj T (x|s) = R(x) P

x TDc(s) R(x). (23)

For the set generation task, it also holds that pj T (x|s) R(x) for each x TDc(s) and s Ij.

Interestingly, the result above suggests a straightforward procedure for assessing the goodness-of-fit of the locally trained GFlow Nets. When Xj is considerably smaller than X, we can compute the normalized target in Equation 23 and directly compare it against the learned distribution in X. Otherwise, any technique for diagnosing GFlow Nets can be readily applied to probe the accuracy of pj T , e.g., measuring the Spearman correlation between log pj T ( |s) and log R(x) for x Xj (Malkin et al., 2022; Madan et al., 2022; Shen et al., 2023; Tiapkin et al., 2024; Chen & Mauch, 2024).

Published as a conference paper at ICLR 2025

Figure 11: Illustration of Recursive SAL. We show a two-level partition with m1 = 2 models within the first level and m2 = 3 models within the second one. For training, we first train models at the bottommost layer (represented in blue, red, and green) and recursively proceed upwards towards the middle (magenta and yellow) and top (root partition, shown in pink) layers. For the non-root layers, learning is based on minimizing LATB with the reward defined as in Equation 25; for the root, we minimize LTB instead. For inference, we start at so and iteratively select the policy based on the current state, as illustrated in the highlighted trajectory and in the annotated text on the top-right corner.

Sensibility to error propagation. Proposition C.3 raises an important question: how do the errors of the leaf models {Gj}m j=1 affect the global goodness-of-fit? To address this issue, the next proposition shows that the contribution of Gj to the overall distributional error is an increasing function of the probability mass associated to the jth leaf by the root model, Go, and of how inaccurate Gj s itself is.

Proposition C.5 (Sensibility to error propagation). Let {Sj}m j=0 = FHP(S, m) with GFlow Nets {Gj}m j=0. Assume that Go satisfies its balance condition. Also, define

x X So R(x), ZF = X

1 j m Ij Fj(x), and Z = X

x X ZR, (24)

and note that ZR + ZF is the partition function associated to Go. Then, the TV distance between the learned distribution p T in Equation 16 and the target π(x) R(x) for x X satisfies

TV (p T , π) ZR

1 Z 1 ZR + ZF

| {z } Error in estimating ZF

π(x) ZF ZR + ZF pf(s) T (x|s) | {z } Error of the local approximations

in which po T,\X is the restriction of po T to S

1 j m Ij and f is the assignment function.

Importantly, the bound above is tight in the sense that, when the root and leaf models satisfy their balance conditions, Z = ZR + ZF and π(x) Es[pf(s) T (x|s)], as we show in the proof of Proposition C.5 in Appendix D. There, we also provide an alternative, trajectory-based upper bound on the TV distance that similarly highlights the relatively large impact of the distributional errors associated with large-probability leaves to the overall accuracy. Heuristically, this suggests that SAL may benefit from a FHP that approximately homogeneously distribute the probability mass among the leaf partitions, ensuring that no client has a disproportionate role on shaping the accuracy of the aggregated model. To achieve this, however, one needs prior knowledge of the reward function; the definition of a good fixed-horizon partition should be done in problem-by-problem basis. We also believe that a human expert would have a remarkable impact on the effectiveness of SAL for the highly-specialized, molecular-biology-based, tasks in which GFlow Net are often implemented.

Recursive SAL. In Section 6, we introduced an extension of SAL to recursively defined FHPs. Proposition C.6 formalizes this procedure and demonstrates through an inductive argument that the resulting model samples correctly from the target distribution.

Proposition C.6 (Recursive SAL). Let S be the vertices of a state graph with diameter D. Then, for sequences 0 = do < d1 < d2 < < dk D and {mo = 1, m1, . . . , mk}, we define S

1 j mi Iij as a disjoint mi-partition of the states distanced di from so. Also, let Xk = {x X : d(x, so) dk} and, for i < k, let Xi = {s: s Ii+1,j (d(s, so) di s X)}. Finally,

Published as a conference paper at ICLR 2025

we define Gi = {(pi,j F , pi,j B , Fij): 1 j mi} as a set of GFlow Nets trained on a state graph with initial states S

j Iij, terminal states Xi, and reward function Ri such that

Ri(s) = Fi+1(s), if s Ii+1 and i < k, R(s), if s X. (25)

Then, when the GFlow Nets k i=0Gi satisfy their respective balance conditions, the generative process starting at so and recursively following pi,b F until either reaching Ii+1,a, at which point the guiding forward policy is changed to pi+1,a F , for 0 i k, or reaching X, signaling to stop the generation and return the sampled object, samples each x X proportionally to R(x).

Remarks on Recursive SAL. In plain English, the above proposition says that we can use the learned flow function Fi+1 at the (i + 1)th layer as the reward function of the GFlow Nets within the ith layer to obtain a correct sampler when training the GFlow Nets in a hierarchical fashion. In computational terms, the number of trained models grows linearly with the depth k and width maxi mi of the multi-layered partition. In the light of Theorems 5.2 and 5.4, however, each model would have to solve a considerably simpler problem and we may be able to use a significantly smaller neural network to parameterize the corresponding forward policies, with advantageous consequences for both generalization via the KL term in Theorems 5.2 and 5.4, which increases with the number of estimable parameters and storage. Albeit we do not provide an empirical evaluation of Recursive SAL in this work, we believe its implementation could be beneficial for problems with very large trajectory sizes and are optimistic about its potential applications in future endeavors.

C.4 CONDITIONAL SAL

SAL and state-conditional flows. Bengio et al. (2023, Section 4.3) introduce state-conditional flows as a family {Fs}s S of flow functions defined on the subgraphs Gs induced by {s S : s s }, in which s s means that there is a path from s to s on the original state graph. Remarkably, a FHP(S, m) can be interpreted as a subset {Fs}s m j=1Ij of a state-conditional flow. In spite of these similarities, which serve only to strengthen the foundations of our work, we emphasize the novelty and demonstrated effectiveness of our distributed strategy for learning state-conditional flows.

Learning reward-conditioned GFlow Nets with SAL. Recently, there has been growing interest in reward-conditioned flows, in which we learn a family {Fc}c C of flow functions conditioned on some information c C given as an additional input to the neural network. In most applications, c corresponds to either a temperature parameter (Zhang et al., 2023a; Kim et al., 2024a) defining the peakiness of the target distribution or pharmaceutical properties (Roy et al., 2023; Pandey et al., 2024) guiding the drug discovery process. In view of this, we extend SAL to accommodate the distributed learning of reward-conditioned GFlow Nets on a conditioned FHP, i.e., a FHP that depends on the conditioning information. This may be formally expressed as follows.

Remark C.7 (Reward-conditioned SAL). Let C be a set of conditioning information and {Rc : X R+ : c C} be the corresponding family of conditioned rewards. Also, let F : C S R+ and p F : C S S [0, 1] be a conditional flow function and a conditional forward policy, that is, p F (c, , ) is a forward policy for each c. Finally, let {Sc j}m j=0 = FHP(S, m, c) be a conditioned FHP and {Gj}m j=0 be a family of reward-conditional GFlow Nets. Following the arguments for demonstrating Theorem C.1, it is easy to see that SAL samples correctly in proportion to Rc when each Gj satisfies its respective balance condition with respect to Rc for every c C and 0 j m.

Conditional GFlow Nets are commonly implemented for controllable generation by setting the conditioning information c at inference time (Roy et al., 2023); see also Lau et al. (2023) s QGFN. In fact, Pandey et al. (2024) recently explored this principle for the effective exploration of chemical space at an atomic-level given some desirable pharmacological properties, e.g., synthetizability. In this regard, Remark C.7 ensures that most of these approaches can ba adapted to the distributed setting via SAL, and we believe that assessing the resulting methods is an important research direction.

C.5 SAL AND EP-GFLOWNETS

The reward function of GFlow Nets can often be decomposed as R(x) = QK i=1 Ri(x) (Jain et al., 2023; Deleu et al., 2023; Zhou et al., 2024; Pandey et al., 2024), e.g., in multi-objective problems in

Published as a conference paper at ICLR 2025

which each Ri is an objective and our goal is finding samples that are concomintantly high valued for every Ri. In these cases, da Silva et al. (2024) proposed a divide-and-conquer algorithm for learning K GFlow Nets in parallel, each targeting a reward Ri, and then aggregating them with an extra GFlow Net. The resulting model, termed EP-GFlow Net, trains (K + 1) GFlow Nets on the same state graph, in sharp contrast to SAL which trains a GFlow Net for each component of a FHP. To further highlight the distinction between SAL and EP-GFlow Nets, we show below that these approaches can be implemented in a complementar manner.

Proposition C.8 (EP-SAL). Let R(x) = QK i=1 Ri(x) be a multiplicative decomposition of R. For each 1 i K, let {Si j}mi j=0 = FHPi(S, mi) be a fixed-horizon partition of the state space S. Also, let Gi = {(pi,j F , pi,j B , Fi,j): 0 j mi} be the rootand leaf-GFlow Nets corresponding to the i-th FHP and denote by pi F and pi B the induced distributions over trajectories. Assume that each pi F samples terminal objects in X proportionally to Ri. Finally, let p E be any positive probability measure over trajectories. Then, if a GFlow Net (p F , p B, G) globally minimizes

log QK i=1 pi F (τ) QK i=1 p B(τ|x) + log p F (τ)

the marginal p T of p F over X matches R := QK i=1 Ri up to a normalizing constant.

Proof. The result follows directly from Theorem C.1 and (da Silva et al., 2024, Theorem 3.1).

For the sake of completeness, let τ1:d denote the first d transitions of τ and τd: be its complementar.

Then, let di be the distance from the initial state of so to the sets Ii,j in the underlying state graph for j {1, . . . , mi} (recall Definition 6.1, and note that di does not depend on j). Under these conditions, the sampling distribution pi F in Proposition C.8 can be formally written as

( po F (τ) if |τ| di, po F (τ1:di) Pmi j=1 P

s Ij 1{s τ}pi,j F (τdi:|s), otherwise. (27)

Importantly, Equation 27 can be efficiently computed by keeping track of the partition associated to each sampled state. In practice, the computational overhead wrt directly evaluating p F is negligible.

2 4 Learned 10 4

Figure 12: SAL of EP-GFlow Nets.

Empirical illustration. From an empirical standpoint, Proposition C.8 says that SAL can be used to learn a set Gi of GFlow Nets jointly sampling x X in proportion to Ri(x). Then, given Gi for 1 i K, a GFlow Net sampling in proportion to QK i=1 Ri can be obtained by minimizing EP-GFlow Net s learning objective. Figure 12 empirically validates this result for the task of set generation with the same hyperparameters considered in Figure 6 with K = 2 and mi = 2 for each i {1, 2}. Each GFlow Net was trained for 512 epochs and the log-utilities defining Ri were independently sampled for each client. Nonetheless, in spite of its soundness, the effectiveness of this mixed approach in realistic problems remains to be assessed. Looking at the bigger picture, these observations emphasize the composability of GFlow Nets (Garipov et al., 2023), which might be relevant for the design of more data-efficient algorithms (Du & Kaelbling, 2024).

Published as a conference paper at ICLR 2025

D.1 PROOF OF LEMMA 4.1

We simply note that the space of T-sized subsets of {1, . . . , W} has size W T and the space of T-sized subsets of {2, . . . , W} has size W 1 T . Since W 1 T

when W , we can always find for any ξ (0, 1) a W and a T, both of which potentially depending on ξ, for which |X | ξ|X|. For the cases considered in Figure 1, in particular, we compute the following proportions: (32 6)/32 = 81.25% and (64 6)/64 90.63%.

D.2 PROOF OF PROPOSITION 4.2

Our proof has three steps. Firstly, we use H older s inequality to bound the expectation of |π(x) p T (x)|. Secondly, we rely on Jensen s inequality to bound |π(x) p T (x)| with an expectation of |p F (τ)/p B(τ|x) π(x)| over τ. Thirdly, we convert the probabilities to a log-scale with a simple technical argument based on the Taylor expansion of log. For this, let ϕ(x) = |π(x) p T (x)|. Then,

Ex q E,T [ϕ(x)] = X

x X ϕ(x)q E,T (x)

x X ϕ(x) q E,T (x)

p E,T (x)p E,T (x)

x X ϕ(x)qp E,T (x)

p p E,T (x)

= Ex p E,T [ϕ(x)q] 1/q Ex p E,T

for any p, q > 1 such that 1/p + 1/q = 1. For p = q = 2, this bound becomes

Ex q E,T [ϕ(x)] Ex p E,T [ϕ(x)2] χ2(q E,T ||p E,T ) + 1 1

For GFlow Nets, we may write p T (x) = Eτ p B[p F (τ)/p B(τ|x)]. Hence, by Jensen s inequality,

Ex p E,T ϕ(x)2 = Ex p E,T

p B(τ|x) π(x) 2#

p B(τ|x) π(x) 2##

In conclusion, we show that p F (τ)

p B(τ|x) π(x) 2 log p F (τ)

p B(τ|x) log π(x) 2 . (32)

In fact, let M = maxτ,x p F (τ) p B(τ|x), which always exists due to the finiteness of the state space. For instance, M 1 for autoregressive generative tasks (i.e., when p B(τ|x) = 1). Thus, p F (τ)

p B(τ|x) π(x) 2 M 2 p F (τ) Mp B(τ|x) π(x)

The lemma below, which is a direct consequence of the mean value theorem, ensures that the quantity above is bounded above by the log-squared difference between p F (τ)/p B(τ|x) and π(x). Lemma D.1 (Lipschitzness of x 7 ex). For every x, y (0, 1], | log x log y| |x y|.

Published as a conference paper at ICLR 2025

Proof. Consider f : ( , 0] R, f : t 7 et, and notice that |f (t)| = |et| 1. Consequently, by the mean value theorem, f is 1-Lipstchitz and |et es| |t s| for every t, s ( , 0]. By letting log x = t and log y = s, we conclude that |x y| | log x log y| for x, y (0, 1].

In summary, we have shown that

E x q E,T[ϕ(x)]

Ex p E,T Eτ p B

log p F (τ)

p B(τ|x) log π(x) 2 χ2(q E,T ||p E,T ) + 1 !1/2

The statement thereby follows by considering an uniform reference distribution, q E,T (x) = 1 |X|,

TV (p T , π) = |X|

2 Ex q E,T [ϕ(x)]

Ex p E,T Eτ p B

log p F (τ)

p B(τ|x) log π(x) 2 χ2(q E,T ||p E,T ) + 1 !1/2

D.3 PROOF OF PROPOSITION 5.1

For completeness, we provide a proof of Proposition 5.1. Clearly, it is enough to show that

LFCS(P) ˆLFCS(P) + rη

2 and LFCS(P) ˆLFCS(P) + η + q

η(η + 2ˆLFCS(P)), (34)

in which we omit the dependence of ˆLFCS on the dataset Tn for conciseness. We recall that η = KL(P ||Q)+log 2 nα/δ

nα , with nα = ( 1 α)n , is the complexity term that depends on the prior Q, posterior P, confidence δ, and the number of data points nα. Notably, both inequalities directly follow from Maurer (2004, Theorem 5) bound: with probability 1 δ over Tn,

kl(ˆLFCS(P)||LFCS(P)) η, (35)

in which kl represents the binary KL divergence, i.e., kl(p||q) = p log p

q + (1 p) log 1 p

1 q . Below, we show that kl(p||q) is greater than or equal to (p q)2/2q when p < q. Lemma D.2. (Boucheron et al., 2013, Exercise 2.8). Let h(t) = (1 t) log(1 t)+t and p: {1, 0} [0, 1] (resp. q) represent the PMF of a Bernoulli with parameter p [0, 1]. Then,

Ex Be(q)h 1 p(x)

= kl(p||q) (36)

and h(t) t2

2 for t [0, 1]. In particular, kl(p||q) (p q)2/2q when p q.

Proof. Equation 36 follows from a direct algebraic manipulation of the left-hand side. On the other hand, define

g(t) = h(t) t2

for t [0, 1]. Then, g is continuous, g(0) = 0, and g(t) 1/2 when t 1. Also, g (t) = log(1 t) t 0 for t [0, 1] since log(1 t) = | log(1 t)| t. In conclusion,

Ex Be(q)h 1 p(x)

By the symmetry of Equation 35 with respect to LFCS and ˆLFCS, we conclude that

LFCS(P) ˆLFCS(P) p

2LFCS(P)η. (39)

Under these circumstances, the inequality LFCS(P) ˆLFCS(P) + η + q

η(η + 2ˆLFCS(P)) is

obtained by solving the above quadratic inequality on p

LFCS(P). Through a similar reasoning, kl(p||q) 2(p q)2 by Pinsker s inequality and, consequently, LFCS(P) ˆLFCS(P) + p

η/2. These results jointly entail Proposition 6.

Published as a conference paper at ICLR 2025

D.4 PROOF OF LEMMA B.1

Recall that p T (x) = P

τ x p F (τ) for pα,p F F = αp U +(1 α)p F , in which we make the dependence of p F on α and on the (unconstrained) policy p F explicit. Let F(α, p F ) be the family of such policies and F(α) be the set of α-greedy policies. It is straightforward to see that F(α) is a convex set. Clearly, it is enough to ensure that minα>0,p F minx pα,p F T (x) minx π(x), namely, that the rarest object can be sampled correctly by properly adjusting p F and a (non-zero) α. Indeed, Bengio [ref, Theorem 8] showed that, for each given backward policy p B and positive reward R, there is a unique forward policy p F for which the marginal p T (x) R(x) for each x X. Hence, since pα,p F T = αp U,T + (1 α)p T , with p U,T being the marginal of p U over X, the realizability of F(α) is ensured when α satisfies minx X αp U,T (x) < minx X π(x), i.e., α < minx π(x)/minx p U,T (x), in which case we may set a p F such that p T (x) = 1 1 α (π(x) αp U,T (x)). As an example, consider the set generation task, the details of which are provided in Section 4. There, p U induces an uniform distribution over X and we may set α = N/2 minx π(x) < minx π(x)/minx p T,U(x). Importantly, our analysis is not considering the (limited) expressivity of the chosen parametric model for the policy network, which touches on a mostly open problem in the deep learning literature. Rather, we are concerned with the feasibility of finding a transition policy p F consistent and compatible with the given target distribution R, in the sense of Bengio et al. (2023, Definition 4, Definition 20).

D.5 PROOF OF THEOREM 5.2

We first show that the risk function is bounded. Then, Equation 7 follows directly from Maurer (2004, Theorem 5) and Jensen s inequality. Under these conditions, notice that

KL(p B||p F ) = Eτ p B[log p B(τ)] Eτ p B[log p F (τ)] = H[p B] Eτ p B[log p F (τ)]. (40)

Also, by definition,

p F (τ) = Y

(s,s ) τ p F (s |s)

(s,s ) τ (αp U(s |s) + (1 α)p F (s |s))

1 |Ch(s)| α maxs τ |Ch(s)|

and, consequently,

Eτ p B[log p F (τ)] min τ |τ| log α maxs τ |Ch(s)|

= max τ |τ| log maxs τ |Ch(s)|

i.e., KL(p B||p F ) H[p B] + MT . In conclusion, the convexity of the KL divergence along with the the fact that p T and π are respectively convex functions of p F and p B imply that KL(π||p T ) KL(p B||p F ). The rest follows from Maurer (2004, Theorem 5) applied to KL(p B||p F ).

D.6 PROOF OF LEMMA B.2

The result follows directly from the Markov property of the MDP. Equivalently, we note that

1 i S LDB(Si, S<i)

1 i t ES1,...,Si [LDB(Si, Si 1)]

1 i t ESi 1 p(e) T ESi pe( |Si 1) [LDB(Si, Si 1)|Si 1] .

D.7 PROOF OF THEOREM 5.4

Our proof has three main ingredients. Firstly, we build upon a Azuma-Hoeffding-type inequality to bound the expected transition-level error with the observed empirical error. Secondly, we derive

Published as a conference paper at ICLR 2025

a trajectory-level bound of the transition-level results by relying on Mc Allester s linear PAC-Bayes inequality. Thirdly, we combine these results with a standard union bound argument. To start with, Beygelzimer et al. (2011, Theorem 1) shows that the martingale Mt = P

1 i t M(Si, Si 1) defined above, with Ai M(Si, Si 1) Bi and Bi Ai C, satisfies

E exp λMt (e 2)λ2Vt 1, (43)

in which Vt = P

1 i t M(Si, Si 1)2 and λ [0, 1/C]. In our context, |M(Si, Si 1)| 2U by the triangle inequality, and we can take C = 2U. By assumption, Vt K for all t tm.

Then, for the martingale Mt(θ) and corresponding Vt(θ), with θ representing the parameters of the forward policy, by Donsker-Varadhan s variational formula, we notice that

Eθ P λMt(θ) (e 2)λ2Vt(θ) KL(P||Q) + log Eθ Q exp λMt(θ) (e 2)λ2Vt(θ) .

Similarly to (Seldin et al., 2012b, Theorem 1), let δt = δ/2tm, so that P

1 t tm δt = δ/2. Then, by Markov s inequality, with probability at least 1 δt,

Eθ Q exp λMt(θ) (e 2)λ2Vt(θ) 1

δt E Eθ Q exp λMt(θ) (e 2)λ2Vt(θ) ;

where the outer expectation is with respect to the joint distribution of {S1, . . . , St}. By Tonelli s theorem and Equation 43, the right-hand side of the equation above satisfies

1 δt E Eθ Q exp λMt(θ) (e 2)λ2Vt(θ) = 1

δt Eθ QE exp λMt(θ) (e 2)λ2Vt(θ)

Consequently, by Donsker-Varadhan s formula applied to λMt(θ) (e 2)λ2Vt(θ), bounding its exponential moment as above, and a union bound over t, yields with probability at least 1 δ

Eθ P [Mt(θ)] (e 2)λEθ P [Vt(θ)] + KL(P||Q) + log tm + log 2

Hence, by the definition of Mt(θ) and the bounded-variance assumption,

1 i t E[LDB(Si, Si 1)|S<i]

1 i t LDB(Si, Si 1)

t + KL(P||Q) + log tm + log 2/δ

Nextly, let S(j) 1 be independent samples from a forward policy p F ( |so) for 1 j n and {S(j) 1 , . . . , S(j) tj } be the correspondingly observed trajectories. Also, we recall that

L(θ) = ES1,S2,...,St

1 i t E [LDB(Si, Si 1)|S<i]

and define ˆL(θ) = 1

1 i tj E h LDB(S(j) i , S(j) i 1)|S(j) <i i ; (47)

the inner expectations are computed with respect to the Markovian data-generating process (recall that the conditional expectation E[LDB(Si, Si 1)|S<i] is a random variable). By assumption, L(θ) U. Hence, Mc Allester s linear PAC-Bayes inequality (Mc Allester, 2013, Theorem 2) entails, with probability at least 1 δ

2 over draws of {S1, . . . , St},

Eθ P [L(θ)] 1

β Eθ P h ˆL(θ) i + U 2β(1 β) KL(P||Q) + log 2/δ

Published as a conference paper at ICLR 2025

Under these conditions, equations 45 and 48 jointly imply that, by a standard union-bound argument,

Eθ P [L(θ)] 1

1 i t LDB(S(j) i , S(j) i 1) + (e 2)λK

KL(P||Q) + log tm + log 2/δ

+ U 2β(1 β) KL(P||Q) + log 2/δ

with probability 1 δ over draws of (S1, . . . , St). Since ntj T for all tj, as we observe T transitions (n is between T/tmin and T/tm , with tmin being the minimum length of a complete trajectory), and tj tm, as tm is the trajectory s maximum lenght, the result above is equivalent to

Eθ P [L(θ)] 1

1 i t LDB(S(j) i , S(j) i 1)

| {z } = ˆ L(θ)

KL(P||Q) + log tm + log 2/δ

Tλ + U 2β(1 β) KL(P||Q) + log 2/δ

By aggregating the terms corresponding to KL(P||Q) and log 2/δ, we derive the desired upper bound on the expected risk of the DB loss.

D.8 PROOF OF THEOREM C.1

Intuitively, when each balance condition is satisfied, each state s on Ij is sampled in proportion to Fj(s) and, conditioned on s, each terminal state will be sampled in proportion to R(x)/Fj(s), implying that, marginally, each x is sampled proportionally to R(x). In the following, we make this argument rigorous. We first consider the case in which x X \ So. As we are assuming that Fj(s)p F (τ|s) = p B(τ|x)R(x) for each trajectory τ starting at s Ij and finishing at x, we must conclude that

pj T (x|s) = X

τ : s x p F (τ|s) = R(x)

τ : s x p B(τ|x). (50)

On the other hand, since Fo(so)po F (τ|s) = po B(τ|s)Fj(s) for s Ij,

po T (s|so) = X

τ : so s p F (τ|so) = Fj(s)

τ : so s p B(τ|s) = Fj(s)

Fo(so), (51)

as the probability of reaching so by starting from s and following p B is equal to one since so is the only sink state of the transposed state graph. In this context,

p T (x|so) = X

s Ij pj T (x|s)po T (s|so)

Fj(s) Fo(so) R(x)

τ : s x pj B(τ|x)

R(x) Fo(so)

τ : s x pj B(τ|x)

1 j m Ij pj B(τ|s) = R(x)

i.e., p T (x|so) samples x proportionally to R(x). For the forth line above, we relied on the fact that the probability of eaching S Ij is equal to one when starting at x X \ So and following p B. Correspondingly, when x X, it follows from the satisfiability of the trajectory balance condition that p T (x|so) R(x). This ensures SAL is a sound distributed learning algorithm for GFlow Nets.

Published as a conference paper at ICLR 2025

D.9 PROOF OF LEMMA C.3

The global minimizer of Equation 9 satisfies, for every j, Fj(s)pj F (τ) = R(x)pj B(τ|x) for every trajectory τ : s x starting at s Ij and finishing at x Xj. Consequently,

pj T (x|s) = X

τ : s x pj F (τ|s) = X

pj B(τ|x)R(x)

Fj(s) = R(x)

τ : s x pj B(τ|s). (53)

Similarly, X

τ : s x Fj(s)pj F (τ|s) = X

τ : s x pj B(τ|x)R(x) (54)

implies that Fj(s) = X

τ : s x pj B(τ|x)R(x) (55)

τ : s x pj F (τ|s) for every s. These equations jointly entail the proposition.

D.10 PROOF OF PROPOSITION C.5

As in the demonstrations above, we consider two cases in separate. First, when x X So, then p T (x) = R(x)/ZF +ZR due to the satisfiability of the balance condition by the model. Hence, X

x X So |p T (x) π(x)| = X

R(x) ZF + ZR R(x)

= 1 ZF + ZR 1

x X So R(x) = 1 ZF + ZR 1

Second, when x X \ So, we note that

p T (x) = X

τ : so s x p F (τ|so)

τ : so s po F (τ|so)

τ : s x pj F (τ |s)

s Sj po T (s)pj T (x|s) = X

Fj(s) ZF + ZR pj T (x|s).

Similarly, for any X-valued function f,

ZF f(x); (58)

π(x) p T (x) = X

ZF π(x) Fj(s) ZF + ZR pj T (x|s)

π(x) ZF ZF + ZR pj T (x|s)

= Es po T,\X

π(x) ZF ZF + ZR pf(s) T (x|s) .

By recalling that TV(π, p T ) = 1

x X |π(x) p T (x)| , this result, along with Equation 56 and Jensen s inequality applied to the function x 7 |x|, implies the proposition. To further strengthen our intuition, we also consider directly bounding the accuracy of Go as a function of the trajectorylevel inaccuracies of each Gj. For this, we re-write π(x) as

s Sj π(x) X

τ : s x pj B(τ|x). (60)

Published as a conference paper at ICLR 2025

Correspondingly, by recalling the property p T (x) = P

s Sj po T (s)pj T (x|s), we conclude

|π(x) p T (x)| =

τ : s x π(x)p B(τ|x) p F (τ|s)po T (s)

π(x)pj B(τ|x) pj F (τ|s)po T (s)

| {z } Error associated to the jth client

Hence, the total variation distance between π and p T is bounded above by

TV(π, p T ) = ZR

1 Z 1 ZR + ZF

π(x)pj B(τ|x) pj F (τ|s)po T (s)

| {z } Error associated to the jth client

For tree-shaped state graphs, the second term of the equation above can be significantly simplified by noticing that (i) each x is uniquely associated to a j, a relationship which we denote by g(x) = j, and (ii) that pj B(τ|x) = 1 and pj F (τ|s) = pj T (x|s). Under these conditions,

TV(π, p T ) ZR

1 Z 1 ZR + ZF

π(x) pg(x) T (x|s)po T (s)

| {z } Error associated to the j=g(x)th model

D.11 PROOF OF PROPOSITION C.6

We proceed by strong induction on the number k of fixed-horizon partitions. For k = 1, the result above is equivalent to Equation C.1. Assume, then, that the statement holds for j fixed-horizon partitions of the state graph for all j < k. Let Gi, 0 i k, be a sequence of GFlow Nets satisfying the amortized trajectory balance condition. By induction, each x S 1 j k 1 Xj is sampled proportionally to P

1 j k 1 1[x Xj]Rj(s). In particular, if x X S

1 j k 1 Xj, then x is sampled proportionally to R(x). For what remains, let x X \ S

1 j k 1 Xj. Hence, for each state s S

1 j mk Ik,j Xk 1 and each trajectory τ : s x,

Fk(s)p F (τ|s) = p B(τ|x)R(x), (63)

i.e., p F (τ|s) = p B(τ|x)R(x)/Fk(s). Thus, by marginalizing out the non-terminal components of τ,

p T (x|s) = R(x)

τ : s x p B(τ|x) (64)

and, since each s is sampled proportionally to Rk 1(s) := Fk(s),

1 j mk Ik,j p T (x|s)Fk(s)

1 j mk Ik,j Fk(s) R(x)

τ : s x p B(τ|x)

1 j mk Ik,j

τ : s x p B(τ|x)

This ensures that each x X \ S

1 j k 1 Xj is sampled proportionally to R(x). By induction, each x X is sampled proportionally to R(x). Hence, the recursive instance of SAL is a sound approach for sampling objects proportionally to a reward function.

Published as a conference paper at ICLR 2025

5000 10000 15000 20000 25000 0

Hypergrid 64 x 64

200 400 600 800 1000 0

500 1000 1500 2000 0

1000 2000 3000 4000 5000 0

Bit sequences

500 1000 1500 2000 0

Set generation

5000 10000 15000 20000 25000 0

200 400 600 800 1000 0

500 1000 1500 2000 0

1000 2000 3000 4000 5000 0

500 1000 1500 2000 0

5000 10000 15000 20000 25000 0

200 400 600 800 1000 0

500 1000 1500 2000 0

1000 2000 3000 4000 5000 0

500 1000 1500 2000 0

Centralized SAL

Total number of epochs

Figure 13: Complementary results for Figure 7 with different random seeds. Notice that, except for the hypergrid task, the threshold defining a mode is a random variable, which explains the variability (albeit consistency) of the number of modes found within the same column.

E LIMITATIONS AND FUTURE WORKS

We presently discuss the limitations of our work and the consequent opportunities for future research. Although informative, our theoretical analysis was limited to the case of i.i.d. sampled trajectories which, as highlighted in Section 5, does not necessarily reflect the nature of usual strategies for learning GFlow Nets, e.g., ϵ-greedy sampling. Indeed, the typical training of GFlow Nets is closer in spirit to active learning (AL) (Cohn et al., 1994; Gal et al., 2017) in the sense that a batch of trajectories is sampled from a policy that is dynamically updated as more data points are observed, and it would be interesting and important to pursue an investigation in this direction (Jain et al., 2022; Malik et al., 2023). In contrast to AL, however, there is no explicit acquisition function guiding GFlow Net training. On the other hand, Deleu & Bengio (2023) s interpretation of GFlow Nets as Markov chains in the trajectory-space could be used together with Azuma s inequality (Azuma, 1967), in the fashion of Theorem 5.4, as a useful starting point for this. More specifically, one could consider that a sequence of trajectories {τt}t 1 is observed during training and use the same techniques enabling the proof of Theorem 5.4, namely, constructing a martingale difference sequence Mt = LTB(τt) Eτt[LTB(τt)|τs<t] and applying Theorem 1 of Seldin et al. (2012a) both within and between trajectories, the results of which would then be unified via an union bound argument.

Additionally, the promising results of SAL pave the road to a range of interesting investigations. Most promitently, we believe the development of principled partitioning methods can lead to substantial improvements in scaling GFlow Net training. Although our discussion is constrained to fixed-horizon partitions for easeness of exposition and implementation, the algorithm could in principle be extended to more general settings. Theorem 5.2 suggests that a good partition would ensure that the withinand between-partition distributions are close to uniform. Intuitively, we would like that the target distribution of both the leaf and root GFlow Nets are relatively simple to approximate when compared against the original target and that the most important regions of the state space, as measured by the reward function, are appropriately covered. The best way to ensure these properties, however, remains an open problem and we think it is a promising venue for future endeavors.

E.1 ADDITIONAL EXPERIMENTS

Robustness of SAL with respect to the FHP s size. It is intuitively clear that an increase in the number m of partitions in a FHP would improve the coverage of the state graph and accelerate mode discovery. In doing so, however, we also enlarge the memory cost of the algorithm due to the necessity of aggregating a larger number of leaf GFlow Nets in the server. To shed light on the effect of m on SAL s performance, Figure 14 presents the number of modes found during training for the tasks of SIX6 and PHO4. As anticipated, SAL drastically improves upon a centralized GFlow Net,

Published as a conference paper at ICLR 2025

SAL Centralized

Number of modes

Figure 14: SAL consistently outperforms a centralized GFlow Net irrespective of the number of components defining the underlying FHP. As expected, the coverage of the state graph is an increasing function of the FHP s size. Results for the centralized model are included solely for comparison.

Intentionally mode-collapsed SAL

Target distribution

Figure 17: Extremely inaccurate estimation of the flow function might lead to mode collapse. (left) Standard SAL-trained GFlow Net for a 8 8 grid with 2-sized FHP. (middle) SAL-trained GFlow Net when the flow function F1 is severely inaccurate. (right) Target distribution.

notably regardless of the size of the underlying FHP. Overall, our experiments throughout this work indicated that SAL is notably robust to the choice of hyperparameters defining its implementation.

SAL Cent. 0

50-sized subsets of {1, .. ., 100}

Figure 15: SAL leads to significantly faster mode discovery for large-scale set generation. Results averaged across 3 runs.

SAL Centralized 0.045 0.001 0.061 0.009

Table 3: SAL improves upon a centralized GFlow Net for largescale set generation in terms of FCS (Silva et al., 2024).

SAL Cent. 0

5553 Intentionally bad partition sampling

Figure 16: SAL underperforms when the within-subgraph sampling distribution is inadequately designed.

Experiments on large-scale set generation tasks (|X| 1030). To underscore the scalability of SAL with respect to the underlying domain s size, Figure 15 and Table 3 respectively show that our distributed algorithm entails faster mode discovery and lead to a more accurate distributional approximation for the problem of generating 50-sized subsets of {1, . . . , 100}; see Section A.3 for a definition of this task. For a fair comparison, the centralized GFlow Net is trained for 60 seconds twice the time allocated to each leaf GFlow Net and to the aggregation step. Besides, we adopt the experimental setup described in Section B. Importantly, these results suggest that the applicability of SAL is not limited by the domain sizes presented in the main text.

In the following paragraphs, we examine two potential causes for catastrophic failures of the aggregated model, as mentioned in Proposition C.5 on Section C.3.

Published as a conference paper at ICLR 2025

Insufficient training of a leaf GFlow Net. As the aggregation phase in SAL relies on the locally estimated flows as a surrogate reward for the root GFlow Net (recall Algorithm 1), a natural driver of catastrophic failures is an inaccurately approximated flow function by a leaf GFlow Net. In particular, if a local flow Fj is significantly larger than its correct value, the resulting model might allocate a substantial probability mass to the subgraph associated to the jth leaf GFlow Net. In this case, high-probability regions of the remaining subgraphs might be completely missed by the global GFlow Net. We illustrate this effect in Figure 17, which shows the learned distribution over a 8 8 grid for (i) a SAL-trained GFlow Net and (ii) a SAL-trained GFlow Net with a substantially over-estimated F1. To emulate (ii), we train each leaf GFlow Net in a standard fashion and, during the aggregation phase, multiply the learned F1 by 100. In both cases, we considered 2-sized FHPs and followed the experimental setup of Section 6.2. Notably, Figure 17 confirms that a severely inaccurate flow function Fj may lead to a misrepresentation of the target distribution s high-probability regions by the global GFlow Net. In spite of these results, we stress that we did not observed this pathological behavior throughout our experiments in the main text and in Section C.

Insufficiently diverse within-subgraph sampling distributions. Recall in Definition 6.2 that SAL depends on a distribution qj over the initial states Ij of the jth subgraph defining the FHP; see Figure 3. Clearly, when qj s probability mass is overly concentrated in a relatively small subset of Ij, the corresponding leaf GFlow Net might fail to accurately learn from the target distribution due a restricted exploration of the state graph this was observed in Section 4. Also, the efficiency of mode discovery during training would be significantly hindered. In this context, Figure 16 illustrates the mode discovery rate for the task of generating 16-sized subsets of {1, . . . , 32} with the reward function described in Section A when each qj is a truncated Poisson distribution with mean equal to 1/5000-th of the size of the corresponding Ij. Importantly, similarly to Figure 17, this is an extreme corner case that did not pose an issue in our experiments. In practice, we suggest setting qj as an uniform distribution to maximize the diversity of the explored partitions.