# generalization_and_distributed_learning_of_gflownets__b7fa3803.pdf Published as a conference paper at ICLR 2025 GENERALIZATION AND DISTRIBUTED LEARNING OF GFLOWNETS Tiago da Silva Getulio Vargas Foundation tiago.henrique@fgv.br Amauri Souza Federal Institute of Cear a amauriholanda@ifce.edu.br Omar Rivasplata University of Manchester omar.rivasplata@manchester.ac.uk Vikas Garg Yai Yai Ltd and Aalto University vgarg@csail.mit.edu Samuel Kaski Aalto University, University of Manchester samuel.kaski@aalto.fi Diego Mesquita Getulio Vargas Foundation diego.mesquita@fgv.br Conventional wisdom attributes the success of Generative Flow Networks (GFlow Nets) to their ability to exploit the compositional structure of the sample space for learning generalizable flow functions (Bengio et al., 2021). Despite the abundance of empirical evidence, formalizing this belief with verifiable nonvacuous statistical guarantees has remained elusive. We address this issue with the first data-dependent generalization bounds for GFlow Nets. We also elucidate the negative impact of the state space size on the generalization performance of these models via Azuma-Hoeffding-type oracle PAC-Bayesian inequalities. We leverage our theoretical insights to design a novel distributed learning algorithm for GFlow Nets, which we call Subgraph Asynchronous Learning (SAL). In a nutshell, SAL utilizes a divide-and-conquer strategy: multiple GFlow Nets are trained in parallel on smaller subnetworks of the flow network, and then aggregated with an additional GFlow Net that allocates appropriate flow to each subnetwork. Our experiments with synthetic and real-world problems demonstrate the benefits of SAL over centralized training in terms of mode coverage and distribution matching. 1 INTRODUCTION Generalization is a long-standing problem in the machine learning literature, asking whether a learning algorithm can reliably make predictions beyond the data it was trained on (Valiant, 1984; Vapnik, 2000; Catoni, 2007; Alquier & Guedj, 2017; Dziugaite et al., 2021; Lotfi et al., 2024a). In an age of rapid deployment of AI models to end-users, there has been an emerging interest in the design of theoretically robust algorithms, with remarkable results for GANs (Mbacke et al., 2023), diffusion models (Li et al., 2024), transformers (Lotfi et al., 2024a;b), and graph neural networks (Ju et al., 2023; Tang & Liu, 2023). In this pursuit for developing models with proven generalizability, a rich set of tools has been created (Vapnik & Chervonenkis, 2015; Shalev-Shwartz & Ben-David, 2014), with Mc Allester (1998; 1999) s PAC-Bayesian theorems often providing the tightest statistical guarantees (P erez-Ortiz et al., 2021; Dziugaite & Roy, 2017; 2018; Lotfi et al., 2024b;a). Notably, however, there is no one-size-fits-all solution for understanding generalization: the diverse nature of data and learning algorithms demands a distinctly unique approach to each problem. In the realm of probabilistic methods, for example, it has been widely hypothesized that the outstanding performance of Generative Flow Networks (GFlow Nets, Bengio et al., 2021; 2023; Lahlou at al., 2023), which have demonstrated exceptional results in problems such as design of biological sequences (Jain et al., 2022; Malkin et al., 2022) and combinatorial optimization (Zhang et al., 2023a;b), to name a few, emerges from their potential to exploit the compositional structure of the underlying state space to learn a generalizable flow assignment function in a flow network by only Published as a conference paper at ICLR 2025 observing a fraction of the network s nodes (Bengio et al., 2021; Nica et al., 2022; Shen et al., 2023; Atanackovic & Bengio, 2024; Krichel et al., 2024). Nonetheless, in spite of the wealth of empirical evidence indicating that generalization occurs in GFlow Net learning (Nica et al., 2022; Atanackovic & Bengio, 2024), no work so far has provided non-vacuous high-probability empirical bounds on the population risk of GFlow Nets, which might serve as statistical certificates for generalization. Given the described scenario, in this paper we develop the first non-vacuous generalization bounds for GFlow Nets in the literature in Section 5. In doing so, the core questions we want to address in this work are when GFlow Nets (provably) generalize and which factors potentially contribute to diminish their generalization performance. To kickstart our analysis, we present in Section 4 an example in which a GFlow Net catastrophically fails to generalize even after learning a compatible flow assignment for over 90% of the flow network. This example demonstrates that, to properly understand the generalization of GFlow Nets, we must consider not only the extension of the observed flow network but also the specific parts that have been encountered during training, a fact that is also implicit in popular techniques such as the replay buffer (Vemgal et al., 2023) and local search (Kim et al., 2024b). From a technical perspective, this implies that the development of meaningful statistical guarantees for GFlow Nets must be based not on data-agnostic theoretical results, but on data-dependent priors, in the fashion of Dziugaite & Roy (2018); Dziugaite et al. (2021); P erez-Ortiz et al. (2021). This observation guides the establishment of the non-vacuous high-probability bounds for the population risk of GFlow Nets in Section 5.1 and distinguishes our approach from other investigations of GFlow Nets generalization (Krichel et al., 2024). The empirical results in Section 5.1, however, do not provide a fine understanding of which characteristics of the flow network tend to hinder the generalization of GFlow Nets. What effect do larger trajectory lengths, for instance, have on the provable learnability of generalizable flow assignments? Intuitively, generalization is harder in larger state spaces as, borrowing the terminology from the reinforcement learning (RL) literature (Bengio et al., 2021), the visited portion of an environment by an agent constrained by a fixed time budget decreases with increasing environment s size and, therefore, the agent would have to rely on increasingly sparse information for learning from larger trajectories. In Section 5.2, we formalize this intuition through the lens of PAC-Bayesian bounds, revealing the increasing difficulty in obtaining tight statistical certificates for larger state spaces. To achieve this, our technical contributions are two-fold: First, an oracle concentration inequality for the forward Kullback-Leibler (KL) divergence between the learned and targeted flow assignments (Theorem 5.2) inspired by Malkin et al. (2023) s interpretation of GFlow Nets as variational inference. Second, an Azuma-Hoeffding-type inequality (Seldin et al., 2012b) for independently sampled martingales representing the extent to which the learned flow assignment violates the detailed balance condition (Bengio et al., 2023) in the observed trajectories (Theorem 5.4). Motivated by these results, we pose the following question: what can we do to mitigate the issues raised by an inadequate coverage of the state space, as illustrated in Section 4, and larger trajectory sizes, as analyzed in Section 5? As we show in Section 6, the answer lies in breaking up the flow network into multi-source subnetworks grouped together by a small root network (Figure 3). In this new paradigm, finding a compatible flow assignment becomes a two-stage process. Firstly, a different GFlow Net is trained on each subnetwork in a distributed fashion. Secondly, an additional GFlow Net trained on the root network learns to assign the correct amount of flow to each subnetwork, as estimated in the previous step. The resulting algorithm, which we call Subgraph Asynchronous Learning (SAL), has several advantages over the standard approach (Section 6.2). In particular, each GFlow Net within this framework needs to solve a problem that is relatively simpler than that of a unique, centralized model. Similarly, the asynchronous nature of the algorithm implies that we are able to visit a considerably larger fraction of the original flow network within a fixed time window and, therefore, that we get a significantly better coverage of the state space. As a consequence of this, we are also able to drastically improve the discovery of high-value states within the flow network, which is a metric of great interest in the GFlow Net literature (Bengio et al., 2021; Shen et al., 2023; Zhang et al., 2023a; Pan et al., 2023a; Jang et al., 2024; Kim et al., 2024b). In summary, our contributions are: 1. We construct a family of examples in which a GFlow Net does not generalize even after learning a compatible flow assignment on arbitrarily large fractions of the flow network (Section 4); 2. We provide the first non-vacuous generalization bounds for GFlow Nets (Section 5.1); Published as a conference paper at ICLR 2025 3. We derive oracle PAC-Bayesian inequalities for the population risk of GFlow Nets, emphasizing the impact of the flow network s topology on generalization performance (Section 5.2); 4. We design the first distributed algorithm for learning GFlow Nets with network-level parallelization, and evaluate its performance in common benchmarks in the literature (Section 6); The first part of the paper establishes the notation and terminology to be used throughout the paper and reviews relevant results in the GFlow Net and PAC-Bayes literature, alongside an overview of our main contributions (Sections 2 and 3). The second part provides a formal treatment of GFlow Nets from the viewpoint of the PAC-Bayesian theory (Section 5). The third and final part outlines the foundations of SAL and conducts an empirical evaluation of the algorithm in common benchmark problems (Section 6). We defer the proofs and details of the experiments to the supplement. 2 PRELIMINARIES Notations and terminology. Let G = (V, E) by a directed acyclic graph (DAG). A forward policy over G is a Markov transition kernel p F : V V [0, 1] supported on G s edges, i.e., such that p F (v, ) is a distribution over {u: (v, u) E}, for each vertex v. We interchangeably use p F ( |v) and p F (v, ) for representing p F . A backward policy p B over G is a forward policy on the transpose graph G = (V, E ) with E = {(u, v): (v, u) E}. The uniform policy assigns the same probability mass to a state s children, i.e., p U(s |s) = 1{s Ch(s)}/|Ch(s)| with Ch(s) = {s : (s, s ) E}. We say that G is pointed if there are nodes so and sf, respectively called initial (source) and final (sink) nodes, s.t. so (resp. sf) is the only node without incoming (resp. outgoing) edges and, for each s V , there is a trajectory (directed path) between so and sf containing s. In this case, a trajectory τ in G is complete if it starts at so and finishes at sf, which we denote by τ : so sf. Clearly, a forward policy induces a distribution over trajectories starting at s via p F (τ|s) = Q (s ,s ) τ p F (s |s ); when τ is unambiguously complete, we will often omit so from this notation. Lastly, for probability measures P and Q on the same space, we let KL(P||Q), χ2(P||Q) and TV (P, Q) respectively denote their Kullback-Leibler (KL) divergence, χ2 divergence, and total variation distance. GFlow Nets. We represent a GFlow Net (Bengio et al., 2021; 2023; Lahlou at al., 2023) G as a tuple (S, X, G, A, T , p F , p B, R, F) consisting of a set of states S, a set of terminal states X S, a pointed DAG G = (S, E), which is called a state graph, an action mapping A: S 2A associating each state s with an abstract action space A(s) A that is isomorphic to the children of s in G, a transition function T : s S ({s} A(s)) S defining how a state s is affected by an action a A(s), forward p F and backward p B policies on G, a reward function R: X R+ attributing a positive value to each terminal state, and a flow function F : S R+ such that F|X = R. Importantly, only the elements of X are connected to the sink node sf of G. When there is no risk of ambiguity, we will simply write G = (p F , p B, F). The objective of a GFlow Net is to find a p F s.t. the marginal distribution p T (x) := p F (x|so) = P τ : so x p F (τ) over X matches R up to a normalizing factor (Bengio et al., 2021). In Appendix A, we illustrate how this abstract representation can be instantiated to accommodate three frequently considered use-cases (Malkin et al., 2022; 2023). Learning GFlow Nets. In this context, S, X, G, A, T , and R are problem-dependent, while p F , p B, and F are unknowns that should be estimated. Remarkably, however, p B is often fixed as uniform (Shen et al., 2023; Liu et al., 2023; Zhang et al., 2023a), an assumption that we make throghout the paper, albeit most of our theoretical results and all our methods can be extended to the case of learnable p B. Under these circumstances, many learning objectives have been proposed for learning p F and F. Two popular choices, which we adopt here, are the trajectory balance (TB, LTB(p F , F), Malkin et al., 2022) and detailed balance (DB, LDB(p F , F), Bengio et al., 2023) losses, " log F(so)p F (τ) R(x)p B(τ|x) and E τ p E log p F (s |s)F(s) p B(s|s )F(s ) in which |τ| represents τ s length, x is τ s (unique) terminal state, and p E is an exploratory policy. Intuitively, p E has the role of the data-generating distribution in a standard supervised learning context and is often defined as p E = (ϵ)p U + (1 ϵ)p F , an ϵ-greedy version of p F , with p U denoting an uniform policy; although more sophisticated techniques have been developed (Kim et al., 2024b; Rector-Brooks et al., 2023; Vemgal et al., 2023). In Appendix A we provide a more Published as a conference paper at ICLR 2025 thorough overview of GFlow Net learning, including the subtrajectory balance loss (Sub TB, Madan et al., 2022) and divergence-based objectives (Malkin et al., 2023; Lahlou at al., 2023). Generalization bounds for neural networks. The field of statistical learning theory (Vapnik, 1998; 2000) seeks to develop statistical certificates for the generalization of a learned model by providing high-probability upper bounds on the population error of an estimator as a function of the observed empirical risk. In the context of GFlow Nets, we ask whether an empirically measured imbalance based on the observed trajectories, such as the losses in Equation 1 or other locally computed metrics (see Sections 4 and 5), are appropriate surrogates for the GFlow Net s overall distributional accuracy. In particular, we are interested in inductive statistical guarantees, namely, those based on the training set (as opposed to the transductive setting, in which a test set is used). To this end, the PAC-Bayes framework of Mc Allester (1998; 1999; 2013) often provides the tighest bounds (P erez-Ortiz et al., 2021; Lotfi et al., 2024b;a; Dziugaite & Roy, 2017). In a nutshell, consider data X = {Xi}m i=1 drawn from some data distribution, a significance level δ, an empirical loss ˆL(θ, X) and a population loss L(θ) = EX[ ˆL(θ, X)] associated to the model s parameters θ. Given a prior (independent of X) distribution Q over θ, a PAC-Bayes bound typically assumes the form Eθ P [L(θ)] Eθ P [ ˆL(θ, X)] + ϕ(δ, P, Q, m). (2) The inequality holds with probability 1 δ over draws of X, simultaneously for all posterior distributions P over θ; and ϕ is a term penalizing the model s complexity (Mc Allester, 1999). We direct the reader to Alquier (2024) for a comprehensive introduction to PAC-Bayesian analysis. For bounded L, the right-hand side of Equation 2 is termed vacuous if it is larger than an upper bound of L. Although Mc Allester s original works posited that the data were independent and identically distributed (i.i.d.) and that the risk function was uniformly bounded (Mc Allester, 1998; 1999), recent advances relaxed these assumptions by deriving generalization bounds for non-i.i.d. data (Seldin et al., 2012b; Barnes et al., 2022), with applications to multi-armed bandits and RL (Fard & Pineau, 2010; Beygelzimer et al., 2011; Tasdighi et al., 2024), and for unbounded losses limited by high-probability bounds (Alquier & Guedj, 2017; Haddouche & Guedj, 2023; Casado et al., 2024; Mbacke et al., 2023). To the best of our knowledge, however, this is the first work promoting the development of PAC-Bayesian bounds for understanding the generalization of GFlow Nets. 3 OVERVIEW OF OUR RESULTS Before delving into the details of our work in Sections 4, 5, 6 (and further details in the appendices in the supplement), we provide below a brief discussion around our technical results under the light of the formalism presented in Section 2, alongside the main ideas they were built upon. Non-vacuous generalization bounds for GFlow Nets. The learning objectives in Equation 1, due to the unboundedness of the logarithm, cannot be directly incorporated into standard PAC-Bayesian theorems (Mc Allester, 2013), which assume that the risk function has at least bounded exponential moments (Casado et al., 2024; Rodr ıguez-G alvez et al., 2024a;b). To circumvent this issue, our empirical analysis in Section 5.1 adopts the recently proposed FCS metric (Silva et al., 2024) as the risk functional measuring the accuracy of a trained GFlow Net, which may be written as LFCS(p F ) = E(τ1,x1),...,(τB,x B) [TV (px1:B T , Rx1:B)] [0, 1], (3) where px1:B T and Rx1:B are the respective restrictions of p T and R to the B-sized multiset {{x1, . . . , x B}} X of terminal states, and TV is the total variation distance. However, in spite of easily computable, LFCS is not an appropriate learning objective for GFlow Nets due to the potential numerical instability of the non-log-domain. Instead, we minimize LTB as a surrogate objective for LFCS during training and evaluate the generalization bound on LFCS in the inductive fashion mentioned in Section 2. Importantly, Figure 2 shows that the resulting bounds are remarkably tight. Oracle generalization bounds for GFlow Nets. As a complement, we also establish non-empirical high-probability upper bounds on the population risk of GFlow Nets by assuming that a potentially intractable quantity bounds the corresponding loss function. In Section 5.2, we follow this rationale and demonstrate that there always is an α > 0 for which the set of policy networks of the form αp U + (1 α)p F contains the solution to the flow assignment problem. Armed with such a family of models, which guarantee log probabilities uniformly bounded away from zero, we consider Published as a conference paper at ICLR 2025 the reverse KL divergence risk (Malkin et al., 2023) to avoid explicitly bounding the flow function F. Although informative, the resulting Theorem 5.2 only considers the trajectories and not transitions as data points. As the number of observed transitions is significantly larger than that of trajectories, we enrich our results by constructing a martingale difference sequence based on the DB loss and adapting Azuma s inequality (Azuma, 1967; Seldin et al., 2012b) to the context of independent martingales to derive a transition-level generalization bound for GFlow Nets. Both approaches, which are respectively encapsulated in Theorems 5.2 and 5.4, show that the population risk can be bounded with high-probability as, apart from technical nuances, Eθ P [L(θ)] Eθ P [ ˆL(θ)] + O log tm in which ˆL is an empirical measure of risk, tm is the maximum trajectory length of the state graph, and n is the number of observed data points: either trajectories (Theorem 5.2, α = 0.5) or transitions (Theorem 5.4, α = 1). From an analytical perspective, these results suggest that learning provable generalizable flow assignments is increasingly harder for state spaces having longer trajectories. 4 WHEN DO GFLOWNETS NOT GENERALIZE? To start our discussion on the generalization of GFlow Nets, we introduce simple, but non-trivial, examples in which a GFlow Net does not learn a generalizable policy network even after minimizing the loss on an arbitrarily large portion of the state space, raising the questions of when do GFlow Nets generalize and how to measure such generalization, which we investigate in Sections 5.1 and 5.2. A non-generalizable data distribution. To concretize our arguments, we recall the task of set generation for GFlow Nets (Pan et al., 2023a;b; Bengio et al., 2023; Jang et al., 2024). Each state corresponds to a subset of a set W = {1, . . . , W} for a given W; the generative process starts at an empty set so = and iterativey adds elements from W to so until a prescribed size T is achieved. For our purposes, we fix a function u: W [0, 1], representing the log-utility of each w A(so) := W and define the reward R associated to S as R(S) = 1{#S=T } exp{P w s u(w)}. Also, let p E be a forward policy s.t. p E( |s) is supported on A(s) \ {1} := W \ ({1} s) for every s, i.e., the support of the marginal p E,T of p E on X is the set X of subsets of {2, . . . , W}. We next show that X covers an arbitrarily large portion of X for specific choices of T and W. Lemma 4.1. For each ξ (0, 1), there exist T and W such that |X | ξ|X|. 0 200000 0.00 81.25% visited 90.62% visited ϵ-greedy p E Training epoch Figure 1: Convergence speed when actions are masked (blue) or not (orange) for different state space sizes. The (straightforward) proof of Lemma 4.1 can be found in Appendix D. Obviously, we cannot hope that a GFlow Net trained by minimizing an empirical risk defined on trajectories sampled from p E would generalize to unseen states, as no information regarding u(1) would be available during training. To empirically validate our reasoning, we show in Figure 1 that a GFlow Net trained on samples from p E fails to learn the right distribution, whereas a standard ϵ-greedy strategy succeeds. It is remarkable, however, that a GFlow Net is unable to successfuly sample from the target distribution even after minimizing the empirical risk on samples covering over 90% of the state space. From a statistical viewpoint, this behavior can be explained via a change of measure inequality: preference over states is not properly captured by the sampling distribution (p E). We formalize this intuition in the proposition below. Proposition 4.2 (Generalization depends on the sampling distribution). Let (p F , p B, R) be a GFlow Net and p E,T be (any) distribution over X. Also, recall π(x) represents the normalized target and p T the learned marginal. Define q E,T as an uniform PMF on X, i.e., q E,T (x) = 1/|X|. Then, TV (p T , π) v u u t(1 + χ2(q E,T ||p E,T ))Ex p E,T Eτ p B(τ|x) " log p F (τ) π(x)p B(τ|x) in which χ2(P||Q) represents the χ2 divergence between P and Q. Published as a conference paper at ICLR 2025 We interpret Equation 5 in the following way: If the sampling policy (p E,T ) greatly deviates from the uniform (q E,T ), then a small empirical risk does not necessarily ensure an accurate distributional approximation. In contrast, Equation 5 does not entail that the uniform distribution is the optimal choice for sampling trajectories, as it does not address the algorithmic difficulty of minimizing the empirical risk via SGD. Illustratively, we show in Table 1 in Appendix B the values of χ2(q E,T ||p E,T ) when p E,T is far away from q E,T and of χ2(q E,T ||pϵ,T ) for the ϵ-greedy policy considered in Figure 1. On a fundamental level, these examples underline the importance of taking into account the data distribution for understanding generalization performance. Section 5 elaborates on this problem through the lens of PAC-Bayes bounds, albeit with data-dependent priors (Dziugaite & Roy, 2017; 2018; Dziugaite et al., 2021; P erez-Ortiz et al., 2021). 5 PAC-BAYESIAN GENERALIZATION BOUNDS FOR GFLOWNETS Towards the objective of understanding GFlow Net generalization, we construct high-probability upper bounds on different risk functions. In Section 5.1, we build upon Mc Allester s empirical bound and Dziugaite s data-dependent priors to derive the first non-vacuous generalization bounds for GFlow Nets. Then, to gain a clearer understanding of the factors hindering the generalizability of these models, we provide both trajectoryand transition-level oracle bounds in Section 5.2 by drawing upon the martingale-based PAC-Bayesian theory for non-i.i.d. data (Beygelzimer et al., 2011). 5.1 NON-VACUOUS EMPIRICAL GENERALIZATION BOUNDS GFlow Net learning as supervised learning. To rigorously address the generalization of GFlow Nets, we firstly frame the training of these models as a supervised learning problem (Shalev-Shwartz & Ben-David, 2014; Atanackovic & Bengio, 2024). For this, we assume that a set of independently sampled complete trajectories, Tn = {τ1, . . . , τn}, is drawn from a fixed distribution and that each trajectory τi is annotated with a noise-free target, yi = p B(τi|xi)R(xi), with xi representing τi s unique terminal state. Importantly, the only supervision during training comes from the reward function; we do not make assumptions on the distribution over Tn. In this context, minimizing LTB corresponds to finding the least-squares solution to the equation log Z + log p F (τ) = log p B(τ|x)R(x) in p F . Importantly, this setting differs from conventional GFlow Net training algorithms, for which the sampling policy depends on the trajectories observed so far, that is, the trajectories are not independently sampled. Nonetheless, the question of whether GFlow Nets generalize remains relevant even under our relatively simplified conditions, which may be seen as a single-iteration of an ϵ-greedy strategy (Krichel et al., 2024). A bounded risk functional for GFlow Nets. As mentioned, PAC-Bayesian theory originally relied on the assumption of bounded risk functions (Mc Allester, 1998). Despite recent advances in extending the theory to unbounded losses (Casado et al., 2024; Rodr ıguez-G alvez et al., 2024a;b; Haddouche & Guedj, 2023), most generalization bounds still depend on technical and hard-to-verify conditions, e.g., on exponential moments. For this reason, we use the FCS metric as a measure of risk (Silva et al., 2024); see Appendix B for an unbiased estimator ˆLFCS(p F , Tn) of LFCS(p F ). Data-dependent priors for PAC-Bayes. For this, we first recall the techniques originally developed by Dziugaite & Roy (2017; 2018); Dziugaite et al. (2021) in a striking series of papers for probing the generalization of overparameterized neural networks in the supervised learning context. To start with, we state below Dziugaite et al. (2021) s empirical PAC-Bayes bound, which combines results from Mc Allester (2013), Rivasplata et al. (2019), and Boucheron et al. (2013). For completeness, we also provide a self-contained proof of Proposition 5.1 in Appendix D in the supplement. Proposition 5.1 (Empirical PAC-Bayesian bounds). For any distribution ζ on parameters θ of p F , let LFCS(ζ) = Eθ ζ[LFCS(R, p T )] and define ˆLFCS(ζ, Tn) similarly. Also, let α (0, 1) and let P be a distribution on θ learned on an uniformly random (1 α)n -sized subset T1 α of Tn. Then, LFCS(P) ˆLFCS(P, T1 α) + min η(η + 2ˆLFCS(P, T1 α)), p η with probability at least 1 δ over T1 α, in which η := KL(P ||Q)+log 2 (1 α)n /δ (1 α)n and Q is a distribution that does not depend on T1 α but may depend on Tα := Tn \ Tα. Published as a conference paper at ICLR 2025 When the prior distribution Q is naively chosen (e.g., as a standard Gaussian distribution), the KL divergence in Equation 6 often dominates the right-hand side Sets Bags Seq. SIX6 0.0 Risk (LFCS) Bound Figure 2: Non-vacuous generalization bounds for the FCS risk functional in Eq. 10. of the equation and results in vacuous bounds, i.e., LFCS(P) a for some a > 1. To address this issue, the influential work of Dziugaite & Roy (2017) proposed the use of a data-dependent Q learned by minimizing the empirical risk functional on a fraction α of the data and, after learning P by minimizing Equation 6, evaluating the generalization bound on the remaining (1 α) portion of the data, as presented in Proposition 5.1. Empirical results. We follow a similar approach to derive the first non-vacuous generalization bounds for GFlow Nets in the literature. For this, we disjointly partition the dataset Tn with n = 3 104 into sets Tα and T1 α with α = 0.6. We learn an isotropic Gaussian prior Q on Tα and then a diagonal Gaussian posterior P on Tα T1 α by minimizing the bound in Equation 6. Finally, the bound is evaluated on T1 α to obtain the statistical certificate (P erez-Ortiz et al., 2021). Results in Figure 2 for the tasks of set generation (Pan et al., 2023a; Bengio et al., 2023), bag generation (Shen et al., 2023; Jang et al., 2024), sequence design (Malkin et al., 2022; 2023; Madan et al., 2022) with additive rewards, and SIX6 (Jain et al., 2022; Malkin et al., 2022; Shen et al., 2023) highlight the non-vacuousness of Equation 6 and the generalizability of the trained models. Please refer to Section 6 and to Appendix B for a detailed description of the experimental setup. Appendix A describes the design of the GFlow Net for each of these generative tasks. 5.2 ORACLE GENERALIZATION BOUNDS Although the previous section s empirical results certify the generalization of the learned policy network to novel trajectories, they do not necessarily shed light on which characteristics of the generative task are hindering the model s generalization capability. In the remaining of this section, we thus derive generalization bounds that, despite not being directly computable, provide a finer understanding of which factors play a role when the goal is to learn a generalizable policy. In particular, we observe that larger trajectories and peakier target distributions tend to make generalization harder when a fixed sampling budget is available. In Section 6, we will see how a distributed algorithm may alleviate these issues (Yagli et al., 2020; Barnes et al., 2022; Sefidgaran et al., 2022). Trajectory-level bounds. We start by deriving generalization bounds for GFlow Nets when the trajectories are independently sampled (see Section 5.1) and the risk functional is the KL divergence between the forward and backward policies, i.e., KL(p B||p F ), in which p B(τ) p B(τ|x)R(x). This choice is motivated by Malkin et al. (2023) s interpretation of GFlow Nets as a hierarchical variational inference algorithms and by the ability of KL(p B||p F ) to focus the model on highprobability regions of the target, which is a desirable trait of GFlow Nets. Remarkably, we show in Lemma B.1 that KL(p B||p F ) can be bounded by sensibly reparameterizing p F as a mixture policy. Then, as shown in Theorem 5.2 below, this reparametrization enables developing oracle generalization bounds in the fashion of the tight results we derived in Section 5.1. Theorem 5.2. Let G = (p F , p B, F) be a GFlow Net with policy network p F parameterized as in Lemma B.1. Also, let Q be a probability distribution over the parameters θ of p F . Denote H[p B] = Eτ p B[log p B(τ)] for p B s entropy and MT = maxτ(|τ| log(α 1 maxs τ |Ch(s)|)). Then, Eθ P [KL(π||p T )] Eθ P 1 i m log p B(τi) + ( H[p B] + MT ) η(P, Q, n), (7) in which we recall that η(P, Q, n) = q KL(P ||Q)+log 2 n/δ n and π(x) R(x) is the target. A few remarks on the excess risk upper bound of Theorem 5.2. Firstly, the assumption that trajectories are sampled according to p B(τ) p B(τ|x)R(x) is consistent with popular strategies for learning GFlow Nets that focus on sampling trajectories leading to high-reward states more often than those leading to low-reward states, e.g., using a replay buffer (Deleu et al., 2022). Secondly, in alignment with well-established practical knowledge, the result in Equation 7 shows it is harder to Published as a conference paper at ICLR 2025 achieve tighter generalization bounds when the target distribution is spiky with a small entropy term H[p B], and when the generative task is composed of longer trajectories or larger action spaces. Transition-level bounds. For many applications, the number of observed complete trajectories when training GFlow Nets can be orders of magnitude smaller than the number of collected state transitions. In this context, one may obtain significantly tighter generalization bounds by interpreting the transitions, and not the complete trajectories, as data samples (Lotfi et al., 2024a;b). Indeed, it is assumed that GFlow Nets outstanding potential emerges from its capacity to exploit the compositional structure of the space characterized by the state graph (Bengio et al., 2021; Nica et al., 2022; Shen et al., 2023; Atanackovic & Bengio, 2024). To incorporate this structure into our theoretical bounds, we shift our focus to the design of Azuma-Hoeffding-type concentration inequalities (Azuma, 1967; Mc Diarmid, 1998; Boucheron et al., 2013) applied to the stochastic process induced by the Markov Decision Process (MDP) governing the data-generating process. For this, we start defining a martingale difference sequence based on the DB loss (Bengio et al., 2023, Example 5). Definition 5.3 (A martingale difference sequence for the DB loss). Recall the detailed balance loss LDB(s, s ) = (log F(s)p F (s |s) log F(s )p B(s|s ))2. For a fixed sampling policy p E, we let M(Si, S 0, then ρF (s, {sf}) = 1; 5. for every A Σ, there is a t < tm such that ρ t F (so, A) > 0. In this work, S is always finite, V is the discrete topology, and µ is the counting measure. The state graph G is induced by ρF , i.e., (u, v) is an edge in G if and only if ρF (u, {v}) > 0. Acyclicity is ensured by the finitely absorbing property of ρF (item 2 of Definition A.1). Notably, the finite S assumption covers the vast majority of use-cases for GFlow Nets. Under these conditions, we say κ is a backward reference kernel in S with respect to κ if κ(u, {v}) = κ (v, {u}) for all (u, v) S S. We refer the reader to (Lahlou at al., 2023) for an overview of GFlow Nets in infinite spaces. A.2 GENERATIVE FLOW NETWORKS A GFlow Net can be seen as a tuple ({so} S {sf}, PF , PB, ρF , ρB, κ, κ , µ, R) for which 1. κ is a forward reference kernel on S; 2. κ is a backward reference kernel in S with respect to κ; 3. ρF (resp. ρB) is a finitely aborsbing MTK with respect to κ (resp. κ ); 4. R: Σ R+ is a measure such that R µ; 5. PF : S Σ R+ is a MTK, called the forward policy, such that PF (s, ) ρF (s, ); 6. PB : S Σ R+ is a MTK, called the backward policy, such that PB(s, ) ρB(s, ). We denote by p F and p B the densities of PF and PB with respect to their respective reference kernels. For simplicity, we interchangeably let R(x) be the density of R with respect to µ. In this scenario, the set of terminal states X is defined by X = {x S : PF (s, {sf}) > 0}. In practice, p F is parameterized by a neural network and its parameters are estimated to ensure that the marginal of PF (so, ) over X matches R up to a normalizing constant. In the terminology of Section 2, the abstract actions space A would correspond to A = S s S{(s, u): PF (s, {u}) > 0} and A(s) = {(s, u): PF (s, {u}) > 0}. For most problems, we identify the edge (s, u) with an entity representing the difference between u and s, e.g., a nucleotide base when S is the space of nucleotide strings. We complement the discussion in Section 2 on how to learn a GFlow Net in the next section. Published as a conference paper at ICLR 2025 A.3 LEARNING GFLOWNETS Below, we illustrate our definition of GFlow Nets for three common generative tasks. These tasks encompass a large number of applications, e.g., Jain et al. (2022); Shen et al. (2023); Hu et al. (2024; 2023); Liu et al. (2023); Malkin et al. (2022); Pan et al. (2023b); Madan et al. (2022). 1. Autoregressive generation. Each object in S is a string of length up to a L, and G is a tree rooted at so. Also, action sets A(s) represent an alphabet and a transition T (s, a) appends the character a to the string s. Here, p B(s|(s, a)) = 1 for every s S and a A(s). 2. Set generation. Each s S is a subset of W = {1, . . . , W}, with X containing those s of size T. Action sets are A(s) = W \ s and transitions T (s, a) add the element a to s; see Figure 5. 3. Hypergrid environment. Each s S is a point within {0, . . . , H 1}d {0, 1} for given H (size) and d (dimension); so = 0d+1 and the last coordinate indicates whether s X. Also, A(s) = {ei : si < H 1} { }, with ei denoting the i-th canonical vector in Rd and a stop action. Transitions T (s, a) either add a to s, if a = ei for some i; or set sd+1 = 1, if a = . Figure 5: State graph for the set generation task (W = 3, T = 2). Notably, T (x, ) {sf} for every terminal state x X. We provided examples of R, F, and p F throughout the main text; in particular, see Sections 4 and 6.2 and Appendix B. Figure 5 illustrates the state graph for the set generation task (omitting sf). To learn a forward policy p F , we minimize a stochastic objective based on the observed trajectories. Besides the ones shown in Equation 1, many loss functions has been recently proposed. The Sub TB loss (Madan et al., 2022), for instance, is defined by 1 n 0, and QGFN (Lau et al., 2023), which learns a Q-function concomitantly to F and p F and prune the values of p F based on Q during inference time for controlable greediness. A.4 RELATED WORKS GFlow Nets (Bengio et al., 2021; 2023; Lahlou at al., 2023) were canonically proposed as a reinforcement learning algorithm for sampling compositional objects (e.g., graphs) proportionally to a prespecified reward function. From a theoretical perspective, the relationship between GFlow Nets and variational inference (Malkin et al., 2023), entropy-regularized Q-learning (Tiapkin et al., 2024; Deleu et al., 2024), and diffusion models (Lahlou at al., 2023; Sendera et al., 2024; Venkatraman et al., 2024) has been thoroughly established. From a practitioner s viewpoint, GFlow Nets have Published as a conference paper at ICLR 2025 been successfuly applied to many problems including, but not restricted to, causal discovery (Deleu et al., 2022; 2023; da Silva et al., 2023), Bayesian phylogenetic inference (Zhou et al., 2024; da Silva et al., 2024), language and image modelling (Hu et al., 2024; Liu et al., 2023; Hu et al., 2023; Venkatraman et al., 2024), combinatorial optimization (Zhang et al., 2023a;b), and drug discovery (Bengio et al., 2021; Nica et al., 2022; Vemgal et al., 2023; Pan et al., 2023a). Indeed, we are confident that problems such as language modelling and drug discovery could greatly benefit from SAL if appropriate policy networks and fixed-horizon partitionings are designed. Nonetheless, given the open-endedness and specialized nature of these applications, we believe that they would be more suited for future, dedicated works and are, hence, not addressed in this text. Correspondingly, recent work by Jiralerspong et al. (2024) highlighted the competitive performance of stochastic GFlow Nets in two-player zero-sum games, specifically, Tic-Tac-Toe and Connect-4, and we are optimistic that an extension of SAL to stochastic environments would exhibit promising results for games having larger trajectories, e.g., Chess and Go. Orthogonal to these advances, the issue of generalization in GFlow Nets has also received significant attention in the literature (Atanackovic & Bengio, 2024; Krichel et al., 2024). In sharp contrast to previous works, ours is the first one that derives PAC-Bayesian bounds and provides non-vacuous statistical guarantees for GFlow Nets, along with a theoretical analysis that highlights which factors are potentially harmful to the model s generalization performance. Notably, a recent discussion by Bengio & Malkin (2024) provides an interesting perspective on generalization, active learning, and GFlow Nets in the context of abstract reasoning for machine-learning-based theorem proving and conjecture formation. Concomitantly, we note that there is a well-established interest in the community towards the development of more sample-efficient learning objectives for speeding up training convergence (Malkin et al., 2022; Deleu et al., 2022; Madan et al., 2022; Zhang et al., 2023a; da Silva et al., 2024; Tiapkin et al., 2024). A.5 ADDITIONAL REVIEW OF PAC-BAYES BOUNDS Historically, Mc Allester (1998; 1999) s PAC-Bayesian theorems, which were inspired by the work of Shawe-Taylor & Williamson (1997), were developed towards the objective of providing Probably Approximately Correct (PAC) guarantees to Bayesian algorithms with potentially misspecified prior distributions. Recently, the relationship between PAC-Bayesian theory and (approximate) Bayesian algorithms has been made explicit by Germain et al. (2016). From this perspective, Alquier (2024) provides an informative and comprehensive account of the literature on PAC-Bayes bounds, both theory and applications. In what are now well-established references, Catoni (2007) gives a rigorous foundation of PAC-Bayes bounds in supervised classification, including results on the form of the distributions that optimise the bounds; and Guedj (2019) provides a nice concise exposition of the essential form of PAC-Bayesian inequalities. In the context of contemporary machine learning, PAC-Bayesian theory has found enormous success in the development of numerical generalization bounds for overparameterized neural network classifiers, achieving non-vacuous results (Dziugaite & Roy, 2017; 2018) and tight certificates (P erez-Ortiz et al., 2021), which subsequent works have applied even for large language models (Lotfi et al., 2024a;b) with billions of parameters through appropriate compression techniques (Dettmers et al., 2023). In a recent work, Malach (2024) introduced the notion of length complexity for next-token autoregressive learning on Chain-of-Thought data, referring to the minimum number of iterations required by an AR learner to compute a target function, which is (vaguely) connected to our results regarding the harmful effects of the maximum trajectory size on GFlow Net learning. Importantly, the advantegeousness of distributed approaches for the generalization performance of learning algorithms was already pointed out by Yagli et al. (2020); Barnes et al. (2022); Sefidgaran et al. (2022); similarly to SAL, these authors consider the problem of training a set of models in parallel and subsequently aggregating them with a (possibly randomized) estimator in a central server. In spite of these advances, the development of tighter PAC-Bayes bounds with weaker assumptions on the risk functional, e.g., heavy tailedness instead of boundedness, is still an active research field (Holland, 2019; Wu et al., 2021; Balsubramani, 2015; London et al., 2014; Biggs & Guedj, 2023; Rivasplata et al., 2020; Rodr ıguez-G alvez et al., 2024a;b). Also, the development of PAC-Bayesian theory in the setting of non-i.i.d. data is still relatively underdeveloped when compared against other branches of machine learning, albeit there are interesting results in online learning (Haddouche & Guedj, 2022), reinforcement learning (Fard & Pineau, 2010; Beygelzimer et al., 2011; Sakhi et al., 2023), and time series (Alquier et al., 2012). Finally, PAC-Bayesian theorems provide statistical guarantees for stochastic predictors, which are arguably not frequently used in practice, and the problem of derandomizing the resulting bounds Published as a conference paper at ICLR 2025 is still mostly open. Notably, the derandomization of PAC-Bayes bounds has a non-negligible cost, and we refer the reader to Miyaguchi (2019); Biggs & Guedj (2022) for further details on this topic. Published as a conference paper at ICLR 2025 B EXPERIMENTAL DETAILS AND ADDITIONAL DISCUSSIONS All experiments were conducted on a single Linux machine with 128 GB of RAM and featuring a NVIDIA RTX 3090 GPU and 12th Gen Intel(R) Core(TM) i9-12900K CPU. Unless specified otherwise, the code for reproducing the experiments below was executed on this GPU. B.1 A NON-GENERALIZABLE DISTRIBUTION (W, T) χ2(q E,T ||p T,E) χ2(qϵ,T ||p T,E) (32, 6) 1.20 103 1.32 (64, 6) 4.56 102 1.24 Table 1: χ2 divergence between the exploratory (pruned, p E,T , and ϵ-greedy, pϵ,T ) and uniform (q E,T ) distributions. For the experiments in Figure 1, we considered the set generation task (see Appendix A) with W {32, 64} elements to choose from and set size S = 6, and the forward policy was parameterized by an MLP with 2 64-dimensional layers. The elements log-utilities u were sampled from [ 1, 1] prior to training and the resulting values were normalized so that the largest reward of a set was 5. For both settings in Figure 1, the models were trained for 1500 epochs with a batch size of 128. To compute the quantities in Table 1, we compared the uniform policy of an untrained GFlow Net against a policy p E such that the (unnormalized) logit corresponding to the addition the element 1 is set to log p E(1|s) = 11.5 log 10 5. Table 1 shows the large discrepancy between the resulting p E and an uniform policy, providing a taste for the upper bound in Proposition 4.2. B.2 NON-VACUOUS GENERALIZATION BOUNDS A bounded risk functional for GFlow Nets. We start recalling the definition of the flow-consistency in subgraphs (FCS) metric (Silva et al., 2024). Given a policy p E, the FCS is defined as LFCS(R, p T ) = Eτ1,...,τB p E 1 j B p T (xj) R(xi) P 1 j B R(xj) in which B 2 is a (typically small) given integer. Equivalently, FCS may be seen the expected total variation distance between the learned p T and target R distributions over random subsets of X. It was shown by Silva et al. (2024) that LFCS(R, p T ) = 0 if and only if p T (x) R(x), i.e., the model samples correctly from the distribution proportional to R in X. Then, equipped with the dataset Tn described in Section 4, an unbiased estimate of LFCS is ˆLFCS(Tn, R, p T ) = 1 2N k1,...,k B U{1,...,n} p T (xki) PB j=1 p T (xkj) R(xki) PB j=1 R(xkj) in which the outer summation covers N uniformly random B-sized subsets of {1, . . . , n} and xki represents the kith observed terminal state in Tn. Importantly, LFCS [0, 1] for any R and p T , which enables the implementation of well-known algorithms for tightening PAC-Bayesian generalization bounds through the adoption of data-dependent priors (Dziugaite et al., 2021; Maurer, 2004). Experimental details for computing non-vacuous bounds. To achieve the results illustrated in Figure 2, we use Tα to learn an isotropic Gaussian prior Q with variance 10 6 over the parameters θ of an MLP with 3 128-dimensional layers defining the forward policy by minimizing the expected TB loss on Tα under Q. For each problem, we used the same architecture of the neural network, changing only the input and output dimensions, and the resulting models were trained for 64 epochs on their respective datasets. Then, we freeze θ and learn both the mean and the diagonal covariance of a Gaussian posterior P over the parameters of a policy network by minimizing the upper bound in Equation 6 with ˆLFCS substituted by an unbiased estimate of the TB loss on Tα T1 α. Finally, we evaluate the upper bound in Equation 6 on T1 α to certify its tightness. We closely followed the experimental setup of Dziugaite et al. (2021); P erez-Ortiz et al. (2021) for conducting these experiments. In particular, the data-splitting protocol for learning the prior, learning the posterior, and evaluating the bound is analogous to the one used by P erez-Ortiz et al. (2021). Similarly, in contrast to the other experiments, which rely on the Adam optimizer (Kingma & Ba, 2015), we use SGD with a fixed learning rate of 10 3 that presumably achieves a flat minimum (Keskar et al., 2017) with potentially better generalization properties (Hochreiter & Schmidhuber, 1997; Zhou et al., 2020; Haddouche et al., 2024). Finally, we acknowledge Dziugaite et al. (2021) for making their code publicly available and adhering to the best current practices of scientific reproducibility. Published as a conference paper at ICLR 2025 B.3 ORACLE GENERALIZATION BOUNDS: LEMMATA Trajectory-level bounds. The technical lemma below ensures that KL(p B||p F ) can be directly bounded by adopting a mixture transition policy, sometimes called an α-uniform policy (Hu et al., 2023), that keeps the trajectory-level probabilities away from zero and ensures the boundedness of the log-probabilities (Dziugaite et al., 2021; Lotfi et al., 2024a) without limiting the GFlow Net s ability to learn the correct solution that samples from X proportionally to the reward. Lemma B.1 (Realizability of mixture policies). Let p U( |s) denote the uniform policy on the state space S with reward R, i.e., p U(s |s) = 1 |Ch(s)|. Then, there is a α (0, 1] s.t. the family { p F : p F ( |s) = αp U( |s) + (1 α)p F ( |s)} contains a policy sampling from X in proportion to R. In the classical statistical learning terminology, the result above states that the family of α-uniform policy networks is realizable, meaning that a member of this family satisfies the desired balance conditions. However, as we note in the proof of Lemma B.1, finding such α depends on the knowledge of the minimum value of R(x) on X, which may be an NP-hard problem for some generative instances (Zhang et al., 2023b; Ma et al., 2013) that cannot be swifitly solved. Since the resulting generalization bound depends on hardly computable quantities, we call it a oracle bound, similarly to the distribution-dependent PAC-Bayesian inequalities in, e.g., (Alquier et al., 2012; Alquier, 2024). Transition-level bounds. From Definition 5.3, we can readily conclude that the stochastic process 1 i t M(Si, S 1 such that 1/p + 1/q = 1. For p = q = 2, this bound becomes Ex q E,T [ϕ(x)] Ex p E,T [ϕ(x)2] χ2(q E,T ||p E,T ) + 1 1 For GFlow Nets, we may write p T (x) = Eτ p B[p F (τ)/p B(τ|x)]. Hence, by Jensen s inequality, Ex p E,T ϕ(x)2 = Ex p E,T p B(τ|x) π(x) 2# p B(τ|x) π(x) 2## In conclusion, we show that p F (τ) p B(τ|x) π(x) 2 log p F (τ) p B(τ|x) log π(x) 2 . (32) In fact, let M = maxτ,x p F (τ) p B(τ|x), which always exists due to the finiteness of the state space. For instance, M 1 for autoregressive generative tasks (i.e., when p B(τ|x) = 1). Thus, p F (τ) p B(τ|x) π(x) 2 M 2 p F (τ) Mp B(τ|x) π(x) The lemma below, which is a direct consequence of the mean value theorem, ensures that the quantity above is bounded above by the log-squared difference between p F (τ)/p B(τ|x) and π(x). Lemma D.1 (Lipschitzness of x 7 ex). For every x, y (0, 1], | log x log y| |x y|. Published as a conference paper at ICLR 2025 Proof. Consider f : ( , 0] R, f : t 7 et, and notice that |f (t)| = |et| 1. Consequently, by the mean value theorem, f is 1-Lipstchitz and |et es| |t s| for every t, s ( , 0]. By letting log x = t and log y = s, we conclude that |x y| | log x log y| for x, y (0, 1]. In summary, we have shown that E x q E,T[ϕ(x)] Ex p E,T Eτ p B log p F (τ) p B(τ|x) log π(x) 2 χ2(q E,T ||p E,T ) + 1 !1/2 The statement thereby follows by considering an uniform reference distribution, q E,T (x) = 1 |X|, TV (p T , π) = |X| 2 Ex q E,T [ϕ(x)] Ex p E,T Eτ p B log p F (τ) p B(τ|x) log π(x) 2 χ2(q E,T ||p E,T ) + 1 !1/2 D.3 PROOF OF PROPOSITION 5.1 For completeness, we provide a proof of Proposition 5.1. Clearly, it is enough to show that LFCS(P) ˆLFCS(P) + rη 2 and LFCS(P) ˆLFCS(P) + η + q η(η + 2ˆLFCS(P)), (34) in which we omit the dependence of ˆLFCS on the dataset Tn for conciseness. We recall that η = KL(P ||Q)+log 2 nα/δ nα , with nα = ( 1 α)n , is the complexity term that depends on the prior Q, posterior P, confidence δ, and the number of data points nα. Notably, both inequalities directly follow from Maurer (2004, Theorem 5) bound: with probability 1 δ over Tn, kl(ˆLFCS(P)||LFCS(P)) η, (35) in which kl represents the binary KL divergence, i.e., kl(p||q) = p log p q + (1 p) log 1 p 1 q . Below, we show that kl(p||q) is greater than or equal to (p q)2/2q when p < q. Lemma D.2. (Boucheron et al., 2013, Exercise 2.8). Let h(t) = (1 t) log(1 t)+t and p: {1, 0} [0, 1] (resp. q) represent the PMF of a Bernoulli with parameter p [0, 1]. Then, Ex Be(q)h 1 p(x) = kl(p||q) (36) and h(t) t2 2 for t [0, 1]. In particular, kl(p||q) (p q)2/2q when p q. Proof. Equation 36 follows from a direct algebraic manipulation of the left-hand side. On the other hand, define g(t) = h(t) t2 for t [0, 1]. Then, g is continuous, g(0) = 0, and g(t) 1/2 when t 1. Also, g (t) = log(1 t) t 0 for t [0, 1] since log(1 t) = | log(1 t)| t. In conclusion, Ex Be(q)h 1 p(x) By the symmetry of Equation 35 with respect to LFCS and ˆLFCS, we conclude that LFCS(P) ˆLFCS(P) p 2LFCS(P)η. (39) Under these circumstances, the inequality LFCS(P) ˆLFCS(P) + η + q η(η + 2ˆLFCS(P)) is obtained by solving the above quadratic inequality on p LFCS(P). Through a similar reasoning, kl(p||q) 2(p q)2 by Pinsker s inequality and, consequently, LFCS(P) ˆLFCS(P) + p η/2. These results jointly entail Proposition 6. Published as a conference paper at ICLR 2025 D.4 PROOF OF LEMMA B.1 Recall that p T (x) = P τ x p F (τ) for pα,p F F = αp U +(1 α)p F , in which we make the dependence of p F on α and on the (unconstrained) policy p F explicit. Let F(α, p F ) be the family of such policies and F(α) be the set of α-greedy policies. It is straightforward to see that F(α) is a convex set. Clearly, it is enough to ensure that minα>0,p F minx pα,p F T (x) minx π(x), namely, that the rarest object can be sampled correctly by properly adjusting p F and a (non-zero) α. Indeed, Bengio [ref, Theorem 8] showed that, for each given backward policy p B and positive reward R, there is a unique forward policy p F for which the marginal p T (x) R(x) for each x X. Hence, since pα,p F T = αp U,T + (1 α)p T , with p U,T being the marginal of p U over X, the realizability of F(α) is ensured when α satisfies minx X αp U,T (x) < minx X π(x), i.e., α < minx π(x)/minx p U,T (x), in which case we may set a p F such that p T (x) = 1 1 α (π(x) αp U,T (x)). As an example, consider the set generation task, the details of which are provided in Section 4. There, p U induces an uniform distribution over X and we may set α = N/2 minx π(x) < minx π(x)/minx p T,U(x). Importantly, our analysis is not considering the (limited) expressivity of the chosen parametric model for the policy network, which touches on a mostly open problem in the deep learning literature. Rather, we are concerned with the feasibility of finding a transition policy p F consistent and compatible with the given target distribution R, in the sense of Bengio et al. (2023, Definition 4, Definition 20). D.5 PROOF OF THEOREM 5.2 We first show that the risk function is bounded. Then, Equation 7 follows directly from Maurer (2004, Theorem 5) and Jensen s inequality. Under these conditions, notice that KL(p B||p F ) = Eτ p B[log p B(τ)] Eτ p B[log p F (τ)] = H[p B] Eτ p B[log p F (τ)]. (40) Also, by definition, p F (τ) = Y (s,s ) τ p F (s |s) (s,s ) τ (αp U(s |s) + (1 α)p F (s |s)) 1 |Ch(s)| α maxs τ |Ch(s)| and, consequently, Eτ p B[log p F (τ)] min τ |τ| log α maxs τ |Ch(s)| = max τ |τ| log maxs τ |Ch(s)| i.e., KL(p B||p F ) H[p B] + MT . In conclusion, the convexity of the KL divergence along with the the fact that p T and π are respectively convex functions of p F and p B imply that KL(π||p T ) KL(p B||p F ). The rest follows from Maurer (2004, Theorem 5) applied to KL(p B||p F ). D.6 PROOF OF LEMMA B.2 The result follows directly from the Markov property of the MDP. Equivalently, we note that 1 i S LDB(Si, S