# embarrassingly_parallel_gflownets__114d026a.pdf

Embarrassingly Parallel GFlow Nets

Tiago da Silva 1 Luiz Max Carvalho 1 Amauri Souza 2 3 Samuel Kaski 2 4 Diego Mesquita 1

GFlow Nets are a promising alternative to MCMC sampling for discrete compositional random variables. Training GFlow Nets requires repeated evaluations of the unnormalized target distribution, or reward function. However, for large-scale posterior sampling, this may be prohibitive since it incurs traversing the data several times. Moreover, if the data are distributed across clients, employing standard GFlow Nets leads to intensive client-server communication. To alleviate both these issues, we propose embarrassingly parallel GFlow Net (EP-GFlow Net). EP-GFlow Net is a provably correct divide-and-conquer method to sample from product distributions of the form R( ) R1( )...RN( ) e.g., in parallel or federated Bayes, where each Rn is a local posterior defined on a data partition. First, in parallel, we train a local GFlow Net targeting each Rn and send the resulting models to the server. Then, the server learns a global GFlow Net by enforcing our newly proposed aggregating balance condition, requiring a single communication step. Importantly, EPGFlow Nets can also be applied to multi-objective optimization and model reuse. Our experiments illustrate the EP-GFlow Nets s effectiveness on many tasks, including parallel Bayesian phylogenetics, multi-objective multiset, sequence generation, and federated Bayesian structure learning.

1. Introduction

Generative Flow Networks (GFlow Nets) (Bengio et al., 2021) are powerful sampling tools and, despite being in their infancy, have become popular alternatives to Markov Chain Monte Carlo methods in finite discrete state spaces. While the most straight-forward use of MCMC-like methods is to sample from Bayesian posteriors (Deleu et al., 2022;

1Getulio Vargas Foundation 2Aalto University 3Federal Institute of Cear a 4University of Manchester. Correspondence to: Diego Mesquita <diego.mesquita@fgv.br>.

Proceedings of the 41 st International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by the author(s).

2023; Atanackovic et al., 2023b), GFlow Nets find varied applications in combinatorial optimization (Zhang et al., 2023a), multi-objective active learning (Jain et al., 2023), and design of biological sequences (Jain et al., 2022). Inheriting jargon from reinforcement learning, GFlow Nets learn to sample from R : X R+ by incrementally refining a policy function p F that incrementally builds objects x X by augmenting an initial state s0. In Bayesian context, R is proportional to some posterior p( |D), and X is its support.

Nonetheless, similarly to running MCMC chains, training GFlow Nets requires repeatedly evaluating R. For posterior sampling, this implies repeated sweeps through the data. Furthermore, if the data is scattered across many clients e.g., in Bayesian federated learning (El Mekkaoui et al.,

2021; Vono et al., 2022) , the multiple communication rounds between clients and the server may be a bottleneck.

In the realm of MCMC, embarrassingly parallel methods (Neiswanger et al., 2014) address both aforementioned issues simultaneously through a divide-and-conquer scheme. In a nutshell, given an N-partition D1, . . . , DN of the data D, the strategy consists of sampling in parallel from N subposteriors defined on different data shards:

Rn(x) p(Dn|x)p(x)1/N n = 1 . . . N,

and subsequently combining the results in a server to get approximate samples from the full posterior p(x|D), or equivalently from R R1R2 . . . RN. In federated settings, the data partition reflects how data arises in client devices, and this class of algorithms incurs a single communication round between the clients and the server. However, these methods are tailored towards continuous random variables, and their combination step relies on approximating (or sampling from) the product of sample-based continuous surrogates of the subposteriors, e.g., kernel density estimators (Neiswanger et al., 2014), Gaussian processes (Nemeth & Sherlock, 2018; de Souza et al., 2022), or normalizing flows (Mesquita et al., 2019). Consequently, they are not well-suited to sample from discrete state spaces.

We propose EP-GFlownet, the first embarrassingly parallel sampling method for discrete distributions. EP-GFlow Nets start off by learning N local GFlow Nets in parallel to sample proportional to their corresponding reward functions R1, . . . , RN and send the resulting models to the server.

Embarrassingly Parallel GFlow Nets

Then, the server learns a global GFlow Nets by enforcing our newly proposed aggregating balance (AB) condition, which ensures the global model correctly samples from the product of (sampling distributions of) the local GFlow Nets. Notably, sampling from products of GFlow Nets is challenging since no simple composition (e.g., avg./product) of local policies induces the product distribution (Du et al., 2023).

Towards deriving the AB, we introduce a novel balance criterion to train conventional/local GFlow Nets, the contrastive balance (CB) condition. Compared to broadly used training criteria (Bengio et al., 2023; Malkin et al., 2022), CB often leads to faster convergence, which may be attributed to its minimal parametrization, requiring only forward and backward policies. We also show that, in expectation, minimizing the CB is equivalent to minimizing the variance of an estimator of the log-partition from Zhang et al. (2023c).

Remarkably, the applicability of EP-GFlow Nets goes beyond Bayesian inference. For instance, it applies to multiobjective tasks in which we need to sample from a combination of different objective functions (Daulton et al., 2021; Jain et al., 2023). EP-GFlow Nets can also be applied for model reuse (Garipov et al., 2023), allowing sampling from log-pools of experts without retraining individual models.

Our experiments validate EP-GFlow Nets in different contexts, including multi-objective multiset, parallel Bayesian phylogenetic inference, and federated Bayesian structure learning. In summary, our contributions are:

1. We propose EP-GFlow Net, the first algorithm for embarrassingly parallel sampling in discrete state spaces using GFlow Nets. We provide theoretical guarantees of correctness, and also analyze EP-GFlow Net s robustness to errors in the estimation of local GFlow Nets;

2. We present the contrastive balance (CB) condition, show it is a sufficient and necessary condition for sampling proportionally to a reward, and analyze its connection to variational inference (VI);

3. We substantiate our methodological contributions with experiments on five different tasks. Our empirical results i) showcase the accuracy of EP-GFlow Nets; ii) show that, in some cases, using the CB as training criterion leads to faster convergence compared to popular loss functions; iii) illustrate EP-GFlow Nets potential in two notable applications: Bayesian phylogenetic inference, and Bayesian network structure learning.

2. Preliminaries

Notation. We represent a directed acyclic graph (DAG) over nodes V and with adjacency matrix A {1, 0}|V | |V |

as G = (V, A). A forward policy over V in G is a function p: V V R+ such that (i) p(v, ) is a probability mea-

sure over V for every v V and (ii) p(v, w) > 0 if and only if Avw = 1; we alternatively write p(v w) and p(w|v) to represent p(v, w). A transition kernel p induces a conditional probability measure over the space of trajectories in G: if τ = (v1 vn) is a trajectory of length n in G, then p(τ|v1) = Qn 1 i=1 p(vi+1|vi). A backward policy in G is a forward policy on the transpose graph G = (V, A ).

Generative flow networks. GFlow Nets are a family of amortized variational algorithms trained to sample from an unnormalized distribution over discrete and compositional objects. More specifically, let R: X R+ be an unnormalized distribution over a finite space X. We call R a reward due to terminological inheritance from the reinforcement learning literature. Define a finite set S and a variable so. Then, let G be a weakly connected DAG with nodes V = {so} {sf} S X such that (i) there are no incoming edges to so, (ii) there are no outgoing edges exiting sf and (iii) there is an edge from each x X to sf. We call the elements of V states and refer to X as the set of terminal states; so is called the initial state and sf is an absorbing state designating the end of a trajectory. We denote by T the space of trajectories in G starting at so and ending at sf. Illustratively, X could be the space of biological sequences of length 32; S, the space of sequences of lengths up to 32; and so, an empty sequence. The training objective of a GFlow Net is to learn a forward policy p F over G such that the marginal distribution p T over X satisfies

p T (x) := X

τ : τ leads to x p F (τ|so) R(x). (1)

We usually parameterize p F as a neural network and select one among the diversely proposed training criteria to estimate its parameters, which we denote by ϕF . These criteria typically enforce a balance condition on the Markovian process defined by p F that provably imposes the desired property on the marginal distribution in Equation (1). For example, the trajectory balance (TB) criterion introduces a parametrization of the target distribution s partition function ZϕZ and of a backward policy p B( , ; ϕB) with parameters ϕZ and ϕB, respectively, and is defined as

LT B(τ, ϕF , ϕB, ϕZ) = log ZϕZ log R(x) + X

s s τ log p F (s, s ; ϕF )

p B(s , s; ϕB)

Minimizing Equation (2) enforces the TB condition: p F (τ; ϕF ) = Z 1 ϕZ R(x) Q p B(s , s; ϕB), which implies Equation (1) if valid for all τ T . This is the most widely used training scheme for GFlow Nets. In practice, some works set p B as a uniform distribution to avoid learning ϕB, as suggested by Malkin et al. (2022).

Another popular approach for training GFlow Nets uses the notion of detailed balance (DB). Here, we want

Embarrassingly Parallel GFlow Nets

to find forward and backward policies and state flows F (with parameters ϕS) that satisfy the DB condition: F(s; ϕS)p F (s, s ; ϕF ) = F(s ; ϕS)p B(s , s; ϕB) if s is an non-terminal state and F(s; ϕS)p F (sf|s; ϕF ) = R(s) otherwise. Again, satisfying the DB condition for all edges in G entails Equation (1). Naturally, this condition leads to a transition-decomposable loss

LDB(s, s , ϕF , ϕB, ϕS) =

log p F (s,s ;ϕF )

p B(s ,s;ϕB) + log F (s;ϕS)

F (s ;ϕS) 2 if s = sf, log F (s;ϕS)p F (sf |s;ϕF )

R(s) 2 otherwise.

Recently, Pan et al. (2023a) proposed to residually reparameterize F with reference on a hand-crafted function ξ, namely, log F(s) = log F(s) + log ξ(s). This approach led to the forward-looking (FL) GFlow Net, whose learning objective we denote by LF L. Importantly, contrarily to Pan et al. (2023a), we do not drop the boundary conditions in Equation 3. To guarantee correctness whether using LDB or LT B, we need to integrate the loss over some exploratory policy π fully supported in T . In practice, π is typically a ϵmixture between p F and a uniform forward policy, (1 ϵ) p F +ϵ u F , or a tempered version of p F . We use the former definition for π in this work. We review alternative training schemes for GFlow Nets in the supplementary material.

Problem statement. Given a set of clients n = 1, . . . , N, each with reward function Rn, we want to learn a GFlow Net to sample proportionally to a global reward function R defined as a product of the local rewards R1, . . . , RN. We want to do so with as little client-server communication as possible and without openly disclosing the local rewards to the server. While we focus on sampling from R(x) := QN n=1 Rn(x) in the main paper, we also provide in Appendix B.2 extensions of our theoretical results to exponentially weighted rewards of the form R(x) := QN n=1 Rn(x)wn with w1, . . . , w N > 0.

This section derives a provably correct framework for embarrassingly parallel GFlow Nets based on the newly developed concept of aggregating balance (Section 3.1). As a cornerstone, we introduce the contrastive balance (CB) condition, a new balance condition requiring minimal parametrization. Notably, Section 3.2 shows the CB condition yields a sound learning objective for training conventional GFlow Nets.

3.1. Embarrassingly Parallel GFlow Nets

To circumvent the restrictions imposed by the problem statement, we propose a divide-and-conquer scheme. First, each client trains their own GFlow Net to sample proportionally

to their local reward, sending the estimated forward and backward policies p(n) F and p(n) B to a centralizing server. Then, the server estimates the policies (p F , p B) of a novel GFlow Net that approximately samples from R solely based on the local policies {(p(n) F , p(n) B )}N n=1, i.e., without ever evaluating any R(n). More specifically, the server learns a GFlow Net whose marginal distribution p T over terminal states is proportional to the product of those from the clients p(1) T , . . . , p(N) T . Toward this end, Theorem 3.1 delineates a necessary and sufficient condition guaranteeing the correctness of the aggregation phase, which we call aggregating balance (AB) condition. It is worth mentioning that Theorem 3.1 builds directly upon the contrastive balance condition in Lemma 3.5, discussed in detail in Section 3.2.

Theorem 3.1 (Aggregating balance condition). Let p(1) F , p(1) B , . . . , p(N) F , p(N) B : V 2 R+ be pairs of forward and backward policies from N GFlow Nets sampling respectively proportionally to R1, . . . , RN : X R+. Then, another GFlow Net with forward and backward policies p F , p B V 2 R+ samples proportionally to R(x) := QN n=1 R(x) if and only if the following condition holds for all terminal trajectories τ, τ T :

p(i) F (s, s )

p(i) B (s , s)

p(i) F (s, s )

p(i) B (s , s)

p F (s, s ) p B(s , s)

p F (s, s ) p B(s , s)

Based on Theorem 3.1, we can naturally derive a loss function enforcing Equation (4) that can be used to combine the locally trained GFlow Nets. To guarantee the minimum of our loss achieves aggregating balance, it suffices to integrate Equation (4) against a distribution attributing non-zero mass to every (τ, τ ) T 2. Importantly, note the aggregating balance loss is agnostic to which loss was used to learn the local GFlow Nets, as it only requires their transition functions and not, e.g., an estimate of the partition function or of the flows going though each state.

Corollary 3.2 (Aggregating balance loss). Let p(i) F and p(i) B be forward/backward transition functions such that p(i) T (x) Ri(x) for arbitrary reward functions Ri over terminal states x X. Also, let ν : T 2 R+ be a full-support distribution over pairs of complete trajectories. Moreover, assume that p F ( , ; ϕF ) and p B( , ; ϕB) denote the forward/backward policies of a GFlow Net parameterized by ϕF and ϕB. The following are equivalent:

1. p T (x; ϕF ) Q

i Ri(x) for all x X;

2. E(τ,τ ) ν [LAB(τ, τ , ϕF , ϕB)] = 0 where for trajectories τ x and τ x

Embarrassingly Parallel GFlow Nets

Client 1 (correct)

Client 2 (incorrect)

p(1) T p(2) T

EP-GFlow Net

Empirical distribution

Figure 1. EP-GFlow Net samples proportionally to a pool of locally trained GFlow Nets. If a client correctly trains their local model (green) and another client trains theirs incorrectly (red), the distribution inferred by EP-GFlow Net (mid-right) differs from the target product distribution (right).

LAB(τ, τ , ϕF , ϕB)= log p F (τ; ϕF )p B(τ |x ; ϕB)

p B(τ|x; ϕB)p F (τ ; ϕF )

1 i N log p F (τ; ϕF )p B(τ |x ; ϕB)

p B(τ|x; ϕB)p F (τ ; ϕF )

Remark 3.3 (Imperfect local inference). In practice, the local balance conditions often cannot be completely fulfilled by the local GFlow Nets and the distributions p(1) T , . . . , p(N) T over terminal states are not proportional to the rewards R1, . . . , RN. In this case, aggregating balance implies the aggregated model samples proportionally to

Eτ p B( |x)

p(i) B (τ|x)

Interestingly, the value of Equation (6) equals the expectation of a non-deterministic random variable only if the local balance conditions are not satisfied. Otherwise, the ratio p(i) F (τ)/p(i) B (τ|x) equals R(x), a constant wrt τ conditioned on τ having x as its final state. Furthermore, Equation (6) also allows us to assess the probability mass function the global EP-GFlow Net is truly drawing samples from.

As mentioned in Remark 3.3, in practice, the local GFlow Nets may not be balanced with respect to their rewards, incurring errors that propagate to our aggregated model. In this context, Theorem 3.4 quantifies the extent to which these local errors impact the overall result.

Theorem 3.4 (Influence of local failures). Let πn := Rn/Zn and p(n) F and p(n) B be the forward and backward policies of the n-th client. We use τ x to indicate that τ T is finished by x sf. Suppose that the local balance conditions are lowerand upper-bounded n [[1, N]] as

1 αn min x X,τ x p(n) F (τ)

p(n) B (τ|x)πn(x)

max x X,τ x p(n) F (τ)

p(n) B (τ|x)πn(x) 1 + βn

where αn (0, 1) and βn > 0. The Jeffrey divergence DJ between the global model ˆπ(x) that fulfills the aggregating balance condition and π(x) QN n=1 πn(x) then satisfies

n=1 log 1 + βn

There are two things worth highlighting in Theorem 3.4. First, if the local models are accurately learned (i.e., βn = αn = 0 n), the RHS of Equation (8) equals zero, implying π = ˆπ. Second, if either βn or αn 1 for some n, the bound in Equation (8) goes to infinity i.e., it degenerates if one of the local GFlow Nets are poorly trained. This is well-aligned with the catastrophic failure phenomenon (de Souza et al., 2022), which was originally observed in the literature of parallel MCMC (Neiswanger et al., 2014; Nemeth & Sherlock, 2018; Mesquita et al., 2019) and refers to the incorrectness of the global model due to inadequately estimated local parameters and can result in missing modes or misrepresentation of low-density regions. Figure 1 shows a case where one of the local GFlow Nets is poorly trained (Client 2 s). Note that minimizing the AB objective leads to a good approximation of the product of marginal distributions over terminal states (encoded by the local GFlow Nets). Nonetheless, the result is far from we have envisioned at first, i.e., the learned model significantly diverges from the product distribution R R1R2. Additionally, Figure 13 in Appendix C.4 highlights that EP-GFlow Nets can learn a relatively good approximation to the target distribution even in the face of inaccurate local approximations.

3.2. Contrastive balance

As a stepping stone towards proving Theorem 3.1, we develop the contrastive balance condition, which is sufficient for ensuring that a GFlow Net s marginal over terminal states is proportional to its target reward (Lemma 3.5).

Lemma 3.5 (Constrastive balance condition). If p F , p B V 2 R+ are the forward and backward policies of a

Embarrassingly Parallel GFlow Nets

GFlow Net sampling proportionally to some arbitrary reward function R : X R+, then, for any pair of complete trajectories τ, τ T with τ x and τ x ,

p F (s, s ) p B(s , s) = R(x) Y

p F (s, s ) p B(s , s). (9)

Conversely, if a GFlow Net with forward and backward policies p F , p B abide by Equation (9), it induces a marginal distribution over x X proportional to R.

Enforcing Lemma 3.5 results in a loss that does not depend on an estimate log ZϕZ of the intractable log-partition function present in the TB condition. The next corollary guarantees that an instantiation of the GFlow Net parameterized by a global minimizer of LCB (Equation (10)) correctly samples from p(x) R(x). We call LCB the contrastive balance loss as it measures the contrast between randomly sampled trajectories. In practice, we observed that in some cases the CB loss leads to better results than the TB and DB losses, as we will see in Section 4.6.

Corollary 3.6 (Contrastive balance loss). Let p F ( , ; ϕF ) and p B( , ; ϕB) denote forward/backward policies, and ν : T 2 R+ be a full-support probability distribution over pairs of terminal trajectories. Then, p T (x; ϕF ) R(x) x X iff E(τ,τ ) ν [LCB(τ, τ , ϕF , ϕB)] = 0 where

LCB(τ, τ , ϕF , ϕB) = log p F (τ; ϕF )

log p F (τ ; ϕF )

p B(τ ; ϕB) + log R(x )

Computational advantages of LCB. Importantly, note that LCB incurs learning fewer parameters than TB and DB losses. Indeed, besides requiring the forward and backward policies p F and p B, TB requires parameterizing the partition function of R. Alternatively, DB implies using a neural network to approximate the flow through each node. In contrast, CB requires only learning p F and p B.

LCB and VI. Notably, the next proposition ties the CB loss gradient to that of a variational objective, extending the characterization of GFlow Nets as VI started by Malkin et al. (2023) for the TB loss. More specifically, Theorem 3.7 states that the on-policy gradients of the CB objective coincide in expectation to the gradient of the KL divergence between the forward and backward policies.

Theorem 3.7 (VI & CB). Let p F p F be the outer product distribution assigning probability p F (τ)p F (τ ) to each trajectory pair (τ, τ ). The criterion in Equation (10) satisfies

ϕF DKL[p F ||p B] = 1

4 E (τ,τ ) p F p F[ ϕF LCB(τ, τ , ϕF )] .

Notably, a corresponding result connecting the CB s and KL s gradients holds when we parameterize the backward policy, as we show in the Theorem [ref] in the appendix.

Connection to other balance conditions. The CB loss may be equivalently defined as the squared difference between signed violations to the TB of two independent trajectories, namely, LCB(τ, τ ) = (VT B(τ) VT B(τ ))2, with VT B(τ) = log (p F (τ)Z/p B(τ|x)R(x)) satisfying VT B(τ)2 = LT B(τ). In this scenario, one might derive Lemma 3.5 and Theorem 3.7 as corollaries of the respective results for the TB condition by Malkin et al. (2022, Proposition 1) and by Malkin et al. (2023, Proposition 1). We present this derivation jointly with self-contained proofs in Appendix A and Appendix B.1. Appendix B.1 also shows that the expected value of the CB loss is proportional to the variance of a TB-based estimate of the partition function, which was used as a training loss by Zhang et al. (2023c).

4. Experiments

The main purpose of our experiments is to verify the empirical performance of EP-GFlow Nets, i.e., their capacity to accurately sample from the combination of local distributions. To that extent, we consider five diverse tasks: sampling states from a grid world in Section 4.1, generation of multisets (Bengio et al., 2023; Pan et al., 2023a) in Section 4.2, design of sequences (Jain et al., 2022) in Section 4.3, distributed Bayesian phylogenetic inference (Zhang & Matsen IV, 2018) in Section 4.4, and federated Bayesian network structure learning (BNSL; Ng & Zhang, 2022) in Section 4.5. Since EP-GFlow Net is the first of its kind, we propose two baselines to compare it against: a centralized GFlow Net, which requires clients to disclose their rewards in a centralizing server, and a divide-and-conquer algorithm in which each client approximates its local GFlow Net with a product of categorical distributions, which are then aggregated in the server. We call the latter approach parallel categorical VI (PCVI), which may be also viewed as an implementation of the SDA-Bayes framework for distributed approximate Bayesian inference (Broderick et al., 2013).

4.1. Grid world

Task description. Our grid world environment consists of a Markov decision process over an 9 9 square grid in which actions consist of choosing a direction ( , ) or stopping. The reward for each state R is the sigmoid transform of its minimum distance to a reward beacon (bright yellow in Figure 2). For the distributed setting, we consider the problem of combining the rewards from different clients, each of which has two beacons placed in different positions.

Results. Figure 2 shows that EP-GFlow Net accurately approximates the targeted distribution, even in cases where

Embarrassingly Parallel GFlow Nets

Table 1. Quality of the parallel approximation to the combined rewards. The table shows i) the L1 distance between the distribution induced by each method and the ground truth and ii) the average log reward of the top-800 scoring samples. Our EP-GFlow Net is consistently better than the PCVI baseline regarding L1 distance, showing approximately the same performance as a centralized GFlow Net. Furthermore, EP-GFlow Net s Top-800 score perfectly matches the centralized model, while PCVI s differ drastically. Values are the average and standard deviation over three repetitions.

Grid World Multisets Sequences L1 Top-800 L1 Top-800 L1 Top-800

Centralized 0.027 6.355 0.100 27.422 0.003 1.535

( 0.016) ( 0.000) ( 0.001) ( 0.000) ( 0.001) ( 0.000)

EP-GFlow Net (ours) 0.038 6.355 0.130 27.422 0.005 1.535

( 0.016) ( 0.000) ( 0.004) ( 0.000) ( 0.002) ( 0.000)

PCVI 0.189 6.355 0.834 26.804 1.872 16.473

( 0.006) ( 0.000) ( 0.005) ( 0.018) ( 0.011) ( 0.007)

Client 2 Client 3 Client 4 EP-GFlow Net

Figure 2. Grid world. Each heatmap represents the target distribution (first row), based on the normalized reward, and the ones learned by the local GFlow Nets (second row). Results for EP-GFlow Net are in the rightmost panels. As established by Theorem 3.1, the good fit of the local models results in an accurate fit to the combined reward.

combining the client rewards leads to multiple sparsely distributed modes. Furthermore, Table 1 shows that EPGFlow Net performs approximately on par with the centralized model trained on the product distribution in terms of L1 distance (within one standard deviation), but is three orders of magnitude better than the PCVI baseline. This is also reflected in the average reward over the top 800 samples identical to the centralized version for EP-GFlow Net, but an order of magnitude smaller for PCVI. Again, these results corroborate our theoretical claims about the correctness of our scheme for combining GFlow Nets.

4.2. Multiset generation

Task description. Here, the set of terminal states comprises multisets of size S. A multiset S is built by iteratively adding elements from a finite dictionary U to an initially empty set. Each client n assigns a value rn u to each u U and defines the log-reward of S as the sum of its elements values; i.e., log Rn(S) = P

u S r(n) u . In practice, the quantities r(n) u are uniformly picked from the interval [0, 1] for

each client. We use S = 8 and |U| = 10 in our experiments.

Results. Figure 3 provides further evidence that our algorithm is able to approximate well the combined reward, even if only the local GFlow Nets are given. This is further supported by the results in Table 1. Notably, EP-GFlow Net is roughly eight times more accurate than the PCVI baseline.

4.3. Design of sequences

Task description. This tasks revolves around building sequences of maximum size S. We start with an empty sequence S and proceed by iteratively appending an element from a fixed dictionary U. The process halts when (i) we select a special terminating token or (ii) the sequence length reaches S. In the distributed setting, we assume each client n has a score p(n) s to each of the S positions within the sequence and a score t(n) u to each of the |U| available tokens, yielding the logarithmic reward of a sequence S = (u1, . . . , u M) as log Rn(S) = PM i=1 p(n) i t(n) ui .

Results. Again, Figure 4 corroborates Theorem 3.1 and

Embarrassingly Parallel GFlow Nets

EP-GFlow Net

Learned Figure 3. Multisets: learned ground truth distributions. Plots compare target vs. distributions learned by GFlow Nets. The five plots to the left show local models were accurately trained. Thus, a well-trained EP-GFlow Net (right) approximates well the combined reward.

EP-GFlow Net

Learned Figure 4. Sequences: learned ground truth distributions. . Plots compare target to distributions learned. The five leftmost plots show local GFlow Nets were well trained. Hence, as implied by Theorem 3.1, EP-GFlow Net approximates well the combined reward.

shows that EP-GFlow Net accurately samples from the product of rewards. Table 1 further reinforces this conclusion, showing a small gap in L1 distance between EP-GFlow Net and the centralized GFlow Net trained with access to all rewards. Notably, our method is 8 more accurate than PCVI. Furthermore, the Top-800 average reward of EPGFlow Net perfectly matches the centralized model.

4.4. Bayesian phylogenetic inference

Task description. In this task, we are interested on inferring a phylogeny T = (t, b), which is a characterization of the evolutionary relationships between biological species and is composed by a tree topology t and its (2N 1)-dimensional vector of non-negative branch lengths b. The topology t is as a leaf-labeled complete binary tree with N leaves, each corresponding to a species. Notably, T induces a probability distribution P over the space of nucleotide sequences Y1, . . . , YM ΩN, where Ωis a vocabulary of nucleobases and Ym denotes the nucleobases observed at the m-th site for each species. Assume t is rooted in some node r and that π |Ω| is the prior probability distribution over the nucleobases frequencies at r. Then, the marginal likelihood of a nucleobase Ym occurring site m for node n is recursively defined by Felsenstein (1981) s algorithm as

( One-Hot(Ym,n) if n is a leaf, [M(n, nl) M(n, nr)] otherwise,

in which nl and nr are respectively the left and right children of n; bn,a is the length of the branch between nodes n and a; Q R|Ω| |Ω| is an instantaneous rate-conversion matrix for the underlying substitution rates between nucleotides, which is given beforehand; and M(n, k) = ebn,k QPk(Ym|T) represents the mutation probabilities from entity n to entity k after a time bn,k. In this context, the marginal likelihood of the observed data within the site m is Pr(Ym|T) π and, assuming conditional in-

dependence of the sites given T, the overall likelihood of the data is P(Y|T) = Q

1 i M(Pr(Yi|T) π) which is naturally log-additive on the sites. For our experiments, π is a uniform distribution. For simplicity, we consider constant branch length, fixed throughout the experiments. For the distributed setting, we place a uniform prior over t and split 2500 nucleotide sites across five clients. In parallel MCMC (Neiswanger et al., 2014) fashion, each client trains a GFlow Net to sample from its local posterior, proportional to the product of its local likelihood and a scaled version of the prior. We further detail the generative process for building phylogenetic trees in Figure 8 in Appendix C.1.

Results. Figure 5 shows that EP-GFlow Net accurately learns the posterior distribution over the tree s topologies: the L1 error between the learned distribution and the targeted product distribution is 0.088, whereas the average L1 error among the clients is 0.083( 0.041). Noticeably, this indicates the model s aptitude to learn a posterior distribution in a decentralized manner. Moreover, our results suggest the potential usefulness of GFlow Nets as a scalable alternative to the notoriously inefficient MCMC-based algorithms (Zhang & Matsen IV, 2018) in the field of evolutionary biology. Indeed, Figure 12 at Appendix C.1 highlights that our distributed framework significantly reduces the training runtime for GFlow Nets for Bayesian inference relatively to a centralized approach. Notably, naive strategies, like the PCVI baseline, consistently lead to sampling elements that do not belong to the support of our posterior (i.e., are invalid) which is why we do not compare against it. In future endeavors, we plan to investigate joint parallel inference on the tree s topology and its branches lengths using hybrid-space GFlow Nets (Deleu et al., 2023). Importantly, our method is also the first provably correct algorithm enabling distributed Bayesian inference over discrete objects, which may become invaluable in real-world problems with several thousands of sites (Stamatakis & Aberer, 2013).

Embarrassingly Parallel GFlow Nets

EP-GFlow Nets scalability wrt number of clients. To illustrate the effect of the number of clients, we perform the training of EP-GFlow Nets with an increasing number of clients and a fixed data set. Strikingly, Figure 6 shows that EP-GFlow Nets are approximately robust to the number of chunks the data is partitioned into both in terms of convergence speed and accuracy.

4.5. Federated Bayesian network structure learning

Task description. While structure learning is usually carried out on centralized data, there are situations in which the possibly sensitive data is scattered across a number of clients (Reisach et al., 2021; Lorch et al., 2021). In this case, local datasets may be small and individually provide insufficient information to draw meaningful conclusions. Thus, we extend the work of Deleu et al. (2022) and show that EP-GFlow Nets can efficiently learn beliefs over Bayesian networks in a federated setting (Ng & Zhang, 2022), drawing strength from multiple data sources. To this end, let X Rd be a random variable and B Rd d be a realvalued matrix with sparsity pattern determined by a DAG G with nodes [[1, d]]. We assume that X follows a linear structural equation model X = BX + N, with N N(0, σ2I) representing independent Gaussian noise (Bielby & Hauser, 1977). Also, we consider 10 fixed clients, each owing a private dataset upon which a DAG-GFlow Net (Deleu et al., 2022) is trained. Our objective is to train a GFlow Net sampling proportionally to the belief distribution defined by the product of the local rewards without directly accessing the clients datasets, akin to (Ng & Zhang, 2022). See Appendix D for further discussion.

Results. Figure 14 in Appendix D shows that the distribution learned by EP-GFlow Nets over some topological features of the DAG accurately matches the target product distribution. Also, Figure 15 highlights that the federated model finds structures with a significantly higher score wrt the complete data than the clients . Importantly, we remark that the clients could refine the global distribution by incorporating an expert s knowledge in a personalized fashion (Bharti et al., 2022). However, we leave this to future works.

4.6. Evaluating the CB loss

Section 3.2 presents the CB loss as a natural development given the theory of EP-GFlow Nets. To evaluate its utility as a criterion to train GFlow Nets in the conventional centralized (non-parallel) setting, we report the evolution during training of the L1 error of the GFlow Net wrt the normalized reward for models trained using DB, TB, and our CB. We do so for all tasks in our experiments, with all GFlow Nets using the same architectures for the forward and backward policies (more details in supplement). Notably, CB led to the best convergence rate in the multiset generation, phy-

logeny and BNSL tasks (Figure 7), while still performing on par with DB in the remaining domains. An explanation is that CB incurs a considerably simpler parametrization than DB and TB as we do not require estimating the flow going through each state or the target partition function. Indeed, when using Deleu et al. (2022) s reparametrization for DB, which obviates the estimation of state flows from state graphs where all states are terminal, implemented for grid world and sequences in Figure 7, DB and CB perform similarly. Moreover, to ensure the observed difference between CB and TB is not due to an insufficiently high learning rate for the log-partition function, Appendix C.4 shows results comparing the CB to TB with different lr s for log ZϕZ in the multiset experiments (Figure 9). Noticeably, CB outperforms TB for all rates tested. A rigorous understanding, however, of the different virtues of the diversely proposed balance conditions remains a lacking issue in the literature and is an important course of action for future research.

0 100 200 0.5 1.0 1.5

0.5 1.0 1.5

0 100 200 0

Phylogenetic trees

0 100 200 0

Figure 7. LCB performs competitively with LT B, LDB and

LDBC in the training of conventional GFlow Nets.

5. Conclusions

We proposed EP-GFlow Net as a simple and elegant solution for distributed inference over discrete distributions and federated learning of GFlow Nets, which we validate on an extensive suite of experiments. Our method enjoys theoretical guarantees and builds on the concept of contrastive balance (CB). Our theoretical analysis i) guarantees correctness when local models are perfectly trained and ii) allows us to quantify the impact of errors of local models on the global one. We also observed that using CB loss led to faster convergence for the local clients when intermediate states are not terminal while being otherwise competitive.

Remarkably, we believe EP-GFlow Nets pave the way for a range of applications of distributed and federated discrete Bayesian inference. We also believe EP-GFlow Nets will be useful to scale up Bayesian inference by amortizing the cost of expensive likelihood computations over different clients. In the realm of multi-objective optimization, EP-GFlow Nets enable sampling from a combination of rewards (Jain et al., 2023) by leveraging pre-trained GFlow Nets even without directly accessing the rewards. as recently explored by Garipov et al. (2023). We also believe the modular nature of EP-Gflow Nets may be instrumental in enabling the reuse and combination of large-scale models (Du et al., 2023).

Embarrassingly Parallel GFlow Nets

EP-GFlow Net

Learned Figure 5. Bayesian phylogenetic inference: learned ground truth distributions. Following the pattern in Figures 2-4, the goodnessof-fit from local GFlow Nets (Clients 1-5) is directly reflected in the distribution learned by EP-GFlow Net.

Epochs (log)

5 10 15 Number of clients

Figure 6. EP-GFlow Net are relatively stable wrt the number of clients. The learning curve of the aggregation phase (left) and the accuracy achieved by the resulting model in terms of L1 norm (right) are roughly the same for varying number of clients.

Impact statement

We proposed a framework to approximate log-pools of GFlow Nets using another GFlow Net. Our method has a broad range of applications in large-scale Bayesian inference, federated learning, and multi-objective optimization. Similar to other sampling methods, our work inherits the ethical concerns involved in formulating its target distribution and, e.g., may propagate negative societal biases or be used to generate malicious content. On the other hand, our framework enables model reuse, which can have a positive impact on the amount of carbon emissions stemming from training large generative models.

Acknowledgments

Tiago da Silva, Luiz Max Carvalho, and Diego Mesquita acknowledge the support of the Fundac ao de Amparo a Pesquisa do Estado do Rio de Janeiro (FAPERJ, grant 200.151/2023), the Fundac ao de Amparo a Pesquisa do Estado de S ao Paulo (FAPESP, grant 2023/00815-6), and the Conselho Nacional de Desenvolvimento Cient ıfico e Tecnol ogico (CNPq, grant 404336/2023-0). Amauri Souza and Samuel Kaski were supported by the Academy of Finland (Flagship programme: Finnish Center for Artificial Intelligence FCAI), EU Horizon 2020 (European Network of AI Excellence Centres ELISE, grant agreement 951847), UKRI Turing AI World-Leading Researcher Fellowship (EP/W002973/1).

We acknowledge Aalto Science-IT Project from Computer Science IT and FGV TIC for the provided comp. resources.

Atanackovic, L., Tong, A., Hartford, J., Lee, L. J., Wang, B., and Bengio, Y. Dyn GFN: Towards bayesian inference of gene regulatory networks with gflownets. In Advances in Neural Processing Systems (Neur IPS), 2023a.

Atanackovic, L., Tong, A., WANG, B., Lee, L. J., Bengio, Y., and Hartford, J. Dyn GFN: Towards bayesian inference of gene regulatory networks with GFlownets. In Thirtyseventh Conference on Neural Information Processing Systems, 2023b. URL https://openreview.net/ forum?id=e7MK5Vq44Q.

Bengio, E., Jain, M., Korablyov, M., Precup, D., and Bengio, Y. Flow network based generative models for noniterative diverse candidate generation. In Advances in Neural Information Processing Systems (Neur IPS), 2021.

Bengio, Y., Lahlou, S., Deleu, T., Hu, E. J., Tiwari, M., and Bengio, E. Gflownet foundations. Journal of Machine Learning Research (JMLR), 2023.

Bharti, A., Filstroff, L., and Kaski, S. Approximate bayesian computation with domain expert in the loop, 2022.

Bielby, W. T. and Hauser, R. M. Structural equation models. Annual review of sociology, 3(1):137 161, 1977.

Broderick, T., Boyd, N., Wibisono, A., Wilson, A. C., and Jordan, M. I. Streaming variational bayes. In Burges, C., Bottou, L., Welling, M., Ghahramani, Z., and Weinberger, K. (eds.), Advances in Neural Information Processing Systems, volume 26. Curran Associates, Inc., 2013.

Chang, Q., Qu, H., Zhang, Y., Sabuncu, M., Chen, C., Zhang, T., and Metaxas, D. N. Synthetic learning: Learn from distributed asynchronized discriminator gan without sharing medical image data. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020.

da Silva, T., Silva, E., Ribeiro, A., G ois, A., Heider, D., Kaski, S., and Mesquita, D. Human-in-the-loop causal discovery under latent confounding using ancestral gflownets. ar Xiv preprint:2309.12032, 2023.

Embarrassingly Parallel GFlow Nets

Daulton, S., Balandat, M., and Bakshy, E. Parallel bayesian optimization of multiple noisy objectives with expected hypervolume improvement. In Advances in Neural Information Processing Systems (Neur IPS), 2021.

de Souza, D., Mesquita, D., Kaski, S., and Acerbi, L. Parallel MCMC without embarrassing failures. In Artificial Intelligence and Statistics (AISTATS), 2022.

Deleu, T. and Bengio, Y. Generative flow networks: a markov chain perspective, 2023.

Deleu, T., G ois, A., Emezue, C. C., Rankawat, M., Lacoste Julien, S., Bauer, S., and Bengio, Y. Bayesian structure learning with generative flow networks. In Uncertainty in Artificial Intelligence (UAI), 2022.

Deleu, T., Nishikawa-Toomey, M., Subramanian, J., Malkin, N., Charlin, L., and Bengio, Y. Joint Bayesian inference of graphical structure and parameters with a single generative flow network. In Advances in Neural Processing Systems (Neur IPS), 2023.

Du, Y., Durkan, C., Strudel, R., Tenenbaum, J. B., Dieleman, S., Fergus, R., Sohl-Dickstein, J., Doucet, A., and Grathwohl, W. Reduce, reuse, recycle: compositional generation with energy-based diffusion models and mcmc. In Proceedings of the 40th International Conference on Machine Learning. JMLR.org, 2023.

El Mekkaoui, K., Mesquita, D., Blomstedt, P., and Kaski, S. Federated stochastic gradient langevin dynamics. In Uncertainty in artificial intelligence (UAI), 2021.

Felsenstein, J. Evolutionary trees from DNA sequences: A maximum likelihood approach. Journal of Molecular Evolution, 1981.

Fey, M. and Lenssen, J. E. Fast graph representation learning with Py Torch Geometric. In ICLR Workshop on Representation Learning on Graphs and Manifolds, 2019.

Garipov, T., Peuter, S. D., Yang, G., Garg, V., Kaski, S., and Jaakkola, T. Compositional sculpting of iterative generative processes. In Advances in Neural Information Processing Systems (Neur IPS), 2023.

Geffner, T., Antoran, J., Foster, A., Gong, W., Ma, C., Kiciman, E., Sharma, A., Lamb, A., Kukla, M., Pawlowski, N., et al. Deep end-to-end causal inference. ar Xiv preprint ar Xiv:2202.02195, 2022.

Graves, A. and Graves, A. Long short-term memory. Supervised sequence labelling with recurrent neural networks, pp. 37 45, 2012.

Hernandez-Garcia, A., Saxena, N., Jain, M., Liu, C.-H., and Bengio, Y. Multi-fidelity active learning with gflownets, 2023.

Hinton, G. E. Training Products of Experts by Minimizing Contrastive Divergence. Neural Computation, 14(8): 1771 1800, 08 2002.

Hong, J., Zhu, Z., Yu, S., Wang, Z., Dodge, H. H., and Zhou, J. Federated adversarial debiasing for fair and transferable representations. In ACM SIGKDD Conference on Knowledge Discovery & Data Mining (KDD), 2021.

Hu, E. J., Jain, M., Elmoznino, E., Kaddar, Y., Lajoie, G., Bengio, Y., and Malkin, N. Amortizing intractable inference in large language models. ar Xiv preprint 2310.04363, 2023a.

Hu, E. J., Malkin, N., Jain, M., Everett, K. E., Graikos, A., and Bengio, Y. Gflownet-em for learning compositional latent variable models. In International Conference on Machine Learning (ICLR), 2023b.

Husmeier, D. Sensitivity and specificity of inferring genetic regulatory interactions from microarray experiments with dynamic bayesian networks. Bioinformatics, 19 (17):2271 2282, November 2003. ISSN 1367-4803. doi: 10.1093/bioinformatics/btg313. URL http://dx. doi.org/10.1093/bioinformatics/btg313.

Jain, M., Bengio, E., Hernandez-Garcia, A., Rector-Brooks, J., Dossou, B. F. P., Ekbote, C. A., Fu, J., Zhang, T., Kilgour, M., Zhang, D., Simine, L., Das, P., and Bengio, Y. Biological sequence design with GFlow Nets. In International Conference on Machine Learning (ICML), 2022.

Jain, M., Raparthy, S. C., Hernandez-Garcia, A., Rector Brooks, J., Bengio, Y., Miret, S., and Bengio, E. Multiobjective GFlow Nets. In International Conference on Machine Learning (ICML), 2023.

Jukes, T. H. and Cantor, C. R. Evolution of Protein Molecules, pp. 21 132. Elsevier, 1969. ISBN 9781483232119. doi: 10.1016/b978-1-4832-3211-9. 50009-7. URL http://dx.doi.org/10.1016/ B978-1-4832-3211-9.50009-7.

Lahlou, S., Deleu, T., Lemos, P., Zhang, D., Volokhova, A., Hern andez-Garcıa, A., Ezzine, L. N., Bengio, Y., and Malkin, N. A theory of continuous generative flow networks. In International Conference on Machine Learning (ICML), 2023.

Lorch, L., Rothfuss, J., Sch olkopf, B., and Krause, A. Dibs: Differentiable bayesian structure learning. Advances in Neural Information Processing Systems, 34:24111 24123, 2021.

Loshchilov, I. and Hutter, F. Decoupled weight decay regularization. In International Conference on Learning Representations (ICLR), 2019.

Embarrassingly Parallel GFlow Nets

Maas, A. L., Hannun, A. Y., Ng, A. Y., et al. Rectifier nonlinearities improve neural network acoustic models. In International Conference on Machine Learning (ICML), 2013.

Malkin, N., Jain, M., Bengio, E., Sun, C., and Bengio, Y. Trajectory balance: Improved credit assignment in GFlownets. In Advances in Neural Information Processing Systems (Neur IPS), 2022.

Malkin, N., Lahlou, S., Deleu, T., Ji, X., Hu, E., Everett, K., Zhang, D., and Bengio, Y. GFlow Nets and variational inference. International Conference on Learning Representations (ICLR), 2023.

Mc Mahan, B., Moore, E., Ramage, D., Hampson, S., and y Arcas, B. A. Communication-efficient learning of deep networks from decentralized data. In Artificial intelligence and statistics, pp. 1273 1282. PMLR, 2017a.

Mc Mahan, H. B., Moore, E., Ramage, D., Hampson, S., and y Arcas, B. A. Communication-efficient learning of deep networks from decentralized data. In Artificial Intelligence and Statistics (AISTATS), 2017b.

Mesquita, D., Blomstedt, P., and Kaski, S. Embarrassingly parallel MCMC using deep invertible transformations. In Uncertainty in Artificial Intelligence (UAI), 2019.

Neiswanger, W., Wang, C., and Xing, E. P. Asymptotically exact, embarrassingly parallel MCMC. In Uncertainty in Artificial Intelligence (UAI), 2014.

Nemeth, C. and Sherlock, C. Merging MCMC subposteriors through Gaussian-process approximations. Bayesian Analysis, 13(2):507 530, 2018.

Ng, I. and Zhang, K. Towards federated bayesian network structure learning with continuous optimization. In International Conference on Artificial Intelligence and Statistics, 2022.

Pan, L., Malkin, N., Zhang, D., and Bengio, Y. Better training of GFlow Nets with local credit and incomplete trajectories. In International Conference on Machine Learning (ICML), 2023a.

Pan, L., Zhang, D., Courville, A., Huang, L., and Bengio, Y. Generative augmented flow networks. In International Conference on Learning Representations (ICLR), 2023b.

Pan, L., Zhang, D., Jain, M., Huang, L., and Bengio, Y. Stochastic generative flow networks. In Uncertainty in Artificial Intelligence (UAI), 2023c.

Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., Kopf, A., Yang, E., De Vito, Z., Raison,

M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., Bai, J., and Chintala, S. Py Torch: An Imperative Style, High-Performance Deep Learning Library. In Advances in Neural Information Processing Systems (Neur IPS), 2019.

Pearl, J. Probabilistic reasoning in intelligent systems: networks of plausible inference. Morgan kaufmann, 1988.

Pearl, J. Causality. Cambridge University Press, Cambridge, UK, 2 edition, 2009. ISBN 978-0-521-89560-6. doi: 10.1017/CBO9780511803161.

Pearl, J. and Mackenzie, D. The Book of Why: The New Science of Cause and Effect. Basic Books, Inc., USA, 1st edition, 2018. ISBN 046509760X.

Polato, M. Federated variational autoencoder for collaborative filtering. In International Joint Conference on Neural Networks (IJCNN), 2021.

Qu, H., Zhang, Y., Chang, Q., Yan, Z., Chen, C., and Metaxas, D. Learn distributed gan with temporary discriminators. In European Conference on Computer Vision (ECCV), 2020.

Reisach, A. G., Seiler, C., and Weichwald, S. Beware of the simulated dag! causal discovery benchmarks may be easy to game. Advances in Neural Information Processing Systems, 34, 2021.

Richter, L., Boustati, A., N usken, N., Ruiz, F., and Akyildiz, O. D. Vargrad: A low-variance gradient estimator for variational inference. In Advances in Neural Information Processing Systems (Neur IPS), 2020.

Scott, S. L., Blocker, A. W., Bonassi, F. V., Chipman, H. A., George, E. I., and Mc Culloch, R. E. Bayes and big data: The consensus monte carlo algorithm. International Journal of Management Science and Engineering Management, 11, 2016.

Shen, M. W., Bengio, E., Hajiramezanali, E., Loukas, A., Cho, K., and Biancalani, T. Towards understanding and improving gflownet training. ar Xiv preprint ar Xiv:2305.07170, 2023.

Stamatakis, A. and Aberer, A. J. Novel parallelization schemes for large-scale likelihood-based phylogenetic inference. In 2013 IEEE 27th International Symposium on Parallel and Distributed Processing, pp. 1195 1204. IEEE, 2013.

Vono, M., Plassier, V., Durmus, A., Dieuleveut, A., and Moulines, E. Qlsd: Quantised langevin stochastic dynamics for bayesian federated learning. In Artificial Intelligence and Statistics (AISTATS), 2022.

Embarrassingly Parallel GFlow Nets

Wang, X., Guo, F., Heller, K. A., and Dunson, D. B. Parallelizing MCMC with random partition trees. In Advances in Neural Information Processing Systems (Neur IPS), 2015.

Xu, J., Tong, X., and Huang, S.-L. Personalized federated learning with feature alignment and classifier collaboration. In The Eleventh International Conference on Learning Representations, 2023. URL https: //openreview.net/forum?id=SXZr8a DKia.

Xu, K., Hu, W., Leskovec, J., and Jegelka, S. How powerful are graph neural networks? International Conference on Learning Representations (ICLR), 2019.

Zhang, C. and Matsen IV, F. A. Variational bayesian phylogenetic inference. In International Conference on Learning Representations (ICLR), 2018.

Zhang, D., Chen, R. T., Malkin, N., and Bengio, Y. Unifying generative models with gflownets and beyond. ICML Beyond Bayes workshop, 2022a.

Zhang, D., Malkin, N., Liu, Z., Volokhova, A., Courville, A., and Bengio, Y. Generative flow networks for discrete probabilistic modeling. In International Conference on Machine Learning (ICML), 2022b.

Zhang, D., Dai, H., Malkin, N., Courville, A., Bengio, Y., and Pan, L. Let the flows tell: Solving graph combinatorial optimization problems with gflownets. In Advances in Neural Information Processing Systems (Neur IPS), 2023a.

Zhang, D., Pan, L., Chen, R. T., Courville, A., and Bengio, Y. Distributional gflownets with quantile flows. ar Xiv preprint ar Xiv:2302.05793, 2023b.

Zhang, D. W., Rainone, C., Peschl, M., and Bondesan, R. Robust scheduling with gflownets. In International Conference on Learning Representations (ICLR), 2023c.

Zheng, X., Aragam, B., Ravikumar, P., and Xing, E. P. DAGs with NO TEARS: Continuous Optimization for Structure Learning. In Advances in Neural Information Processing Systems, 2018.

Zhou, M., Yan, Z., Layne, E., Malkin, N., Zhang, D., Jain, M., Blanchette, M., and Bengio, Y. Phylogfn: Phylogenetic inference with generative flow networks, 2023.

Embarrassingly Parallel GFlow Nets

A.1. Proof of Lemma 3.5

It stems directly from the trajectory balance that, for any trajectory τ T :

s s τ p F (s s ) = R(x) Y

s s τ p B(s s) (11)

p B(s s) p F (s s ) (12)

Therefore, applying this identity to τ and τ and equating the right-hand-sides (RHSs) yields Equation (9). We are left with the task of proving the converse. Note we can rewrite Equation (9) as:

p B(s s) p F (s s ) = R(x ) Y

p B(s s) p F (s s ). (13)

If Equation (9) holds for any pair (τ, τ ), we can vary τ freely for a fixed τ which implies the RHS of the above equation must be a constant with respect to τ . Say this constant is c, then:

p B(s s) p F (s s ) = c (14)

s s τ p B(s s) = c Y

s s τ p F (s s ), (15)

and summing the above equation over all τ T yields: X

s s τ p B(s s) = c X

s s τ p F (s s ) (16)

s s τ p B(s s) = c (17)

Furthermore, note that: X

s s τ p B(s s) = c (18)

x X R(x) = c (19)

= Z = c (20)

Plugging Z = c into Equation (14) yields the trajectory balance condition.

A.2. Proof of Theorem 3.1

The proof is based on the following reasoning. We first show that, given the satisfiability of the aggregating balance condition, the marginal distribution over the terminating states is proportional to

Eτ p B( |x)

p(i) B (τ|x)

as stated in Remark 3.3. Then, we verify that this distribution is the same as

1 i N Ri(x) (22)

Embarrassingly Parallel GFlow Nets

if the local balance conditions are satisfied. This proves the sufficiency of the aggregating balance condition for building a model that samples from the correct product distribution. The necessity follows from Proposition 16 of Bengio et al. (2023) and from the observation that the local balance conditions are equivalent to p(i) F (τ)/p(i) B (τ|x) = Ri(x) for each i = 1, . . . , N.

Next, we provide a more detailed discussion about this proof. Similarly to Appendix A.1, notice that the contrastive nature of the aggregating balance condition implies that, if

s s τ p(i) F (s,s )

p(i) B (s ,s)

s s τ p(i) F (s,s )

p(i) B (s ,s)

s s τ p F (s,s ) p B(s ,s)

s s τ p F (s,s ) p B(s ,s) , (23)

p F (τ) = c

p(i) B (τ|x)

p B(τ|x) (24)

for a constant c > 0 that does not depend either on x or on τ. Hence, the marginal distribution over a terminating state x X is

p T (x) := X

s s τ p F (s s ) (25)

p(i) B (τ|x)

p B(τ|x) (26)

= c Eτ p B( |x)

p(i) B (τ|x)

Correspondingly, p(i) F (τ)/p(i) B (τ|x) Ri(x) for every i = 1, . . . , N and every τ leading to x due to the satisfiability of the local balance conditions. Thus,

p T (x) Eτ p B( |x)

1 i N Ri(x)

1 i N Ri(x), (28)

which attests the sufficiency of the aggregating balance condition for the distributional correctness of the global model.

A.3. Proof of Theorem 3.4

Initially, recall that the Jeffrey divergence, known as the symmetrized KL divergence, is defined as

DJ(p, q) = DKL[p||q] + DKL[q||p] (29)

for any pair p and q of equally supported distributions. Then, let

ˆπ(x) = ˆZ Eτ p B( |x)

p(i) B (τ|x)

Embarrassingly Parallel GFlow Nets

be the marginal distribution over the terminating states of a GFlow Net satisfying the aggregating balance condition (see Remark 3.3 and Appendix A.2). On the one hand, notice that

DKL[π||ˆπ] = Ex π

log π(x) log ZEτ p B( |x)

p(i) B (τ|x)

log Eτ p B( |x)

p(i) B (τ|x)πi(x)

log ˆZ + log Z (33)

1 i N (1 αi)

log ˆZ + log Z (34)

1 i N log 1 1 αi

in which Z := P

1 i N πi(x) 1 is π s normalization constant. On the other hand,

DKL[π||ˆπ] = Ex ˆπ

log ZEτ p B( |x)

p(i) B (τ|x)

log Eτ p B( |x)

p(i) B (τ|x)πi(x)

+ log ˆZ log Z (38)

1 i N (1 + βi)

+ log ˆZ log Z (39)

= log ˆZ Z + X

1 i N log (1 + βi) . (40)

Thus, the Jeffrey divergence between the targeted product distribution π and the effectively learned distribution ˆπ is

DJ(π, ˆπ) = DKL[π||ˆπ] + DKL[ˆπ||π] (41)

1 i N log 1 1 αi

+ log ˆZ Z + X

1 i N log (1 + βi) (42)

1 i N log 1 + βi

A.4. Proof of Theorem 3.7

We firstly recall the construction of the unbiased REINFORCE gradient estimator (Williams 1992), which was originally designed as a method to implement gradient-ascent algorithms to tackle associative tasks involving stochastic rewards in reinforcement learning. Let pθ be a probability density (or mass function) differentiably parametrized by θ and fθ : X R be a real-value function over X possibly dependent on θ. Our goal is to estimate the gradient

θEx pθ[fθ(x)], (44)

Embarrassingly Parallel GFlow Nets

which is not readily computable due to the dependence of pθ on θ. However, since

θEx pθ[fθ(x)] = θ

x X fθ(x)pθ(x)dx (45)

x X (( θfθ(x))pθ(x)) dx + Z

x X (( θpθ(x))fθ(x)) dx (46)

= Ex pθ [ θfθ(x) + fθ(x) θ log pθ(x)] , (47)

the gradient of fθ s expected value under pθ may be unbiasedly estimated by averaging the quantity θfθ(x) + fθ(x) θ log pθ(x) over samples of pθ. We use this identity to compute the KL divergence between the forward and backward policies of a GFlow Net. In this sense, notice that

θDKL[p F ||p B] = θEτ p F

log p F (τ)

θ log p F (τ) + log p F (τ)

θ log p F (τ) (49)

log p F (τ)

θ log p F (τ) , (50)

as Eτ p F [ θ log p F (τ)] = θEτ p F [1] = 0. In contrast, the gradient of the contrastive balance loss with respect to θ is

θLCB(τ, τ , θ) = θ

log p F (τ)

p B(τ) log p F (τ )

= 2 log p F (τ)

p B(τ) log p F (τ )

( θ log p F (τ) θ log p F (τ )) , (52)

whose expectation under the outer product distribution p F p F equals the quantity 4 θDKL[p F ||p B] in Equation (48). Indeed, as

log p F (τ )

θ log p F (τ) = 0, (53)

with an equivalent identity obtained by interchanging τ and τ ,

E (τ,τ ) p F p F [ θLCB(τ, τ , θ)] = (54)

E (τ,τ ) p F p F

2 log p F (τ)

p B(τ) log p F (τ )

( θ log p F (τ) θ log p F (τ )) = (55)

E (τ,τ ) p F p F

2 log p F (τ)

θ log p F (τ) + 2 log p F (τ )

θ log p F (τ ) = (56)

4 log p F (τ)

θ log p F (τ) = 4 θDKL[p F ||p B]. (57)

Thus, the on-policy gradient of the contrastive balance loss equals in expectation the gradient of the KL divergence between the forward and backward policies of a GFlow Net.

B. Additional theoretical results

This section rigorously lays out further theoretical results which were only briefly stated in the main paper. Firstly, section B.1 (i) shows that LCB is, in expectation, equivalent to the variance loss considered by Zhang et al. (2023c); and (ii) provides alternative and shorter proofs for Corollary 3.6 and Theorem 3.7 based on previously published results for LT B. Secondly, section B.2 extends our aggregation scheme to accommodate generic logarithmic pooling of GFlow Nets, significantly expanding the potential applicability of EP-GFlow Nets to personalized federated learning (by implementing a client-level fine-tuned log-pool (Xu et al., 2023)) and to model composition (such as negation (Du et al., 2023; Garipov et al., 2023)).

Embarrassingly Parallel GFlow Nets

B.1. Relationship of LCB to other losses

Alternative proofs for Corollary 3.6 and Theorem 3.7. To start with, we define VT B(τ) = log p F (τ) + log Z log p B(τ|xτ) log R(xτ) for the signed violation to the TB condition in log-space; xτ represents the terminal state of the complete trajectory τ. Note then that

LCB(τ, τ ) = (VT B(τ) VT B(τ ))2 and LT B(τ) = VT B(τ)2. (58)

In this context, Proposition 1 of (Malkin et al., 2022) states the sufficiency of the condition VT B(τ)2 = 0 and equivalently of VT B(τ) = 0 for each τ to ensure that the GFlow Nets policies sample proportionally to the reward distribution R(x). As a consequence, LCB(τ, τ ) = 0 ensures that VT B is constant and, in particular, that there exists a c > 0 such that

VT B + (log c log Z) = log p F (τ) + log c log p B(τ|xτ) log R(xτ) = 0. (59)

This condition corresponds to a reparametrization of the TB and entails, by Proposition 1 of (Malkin et al., 2022), Corollary 3.6. Moreover, since

E(τ,τ ) p F p F [ θF LCB(τ, τ )] = E(τ,τ ) p F p F [2 (VT B(τ) VT B(τ )) θF (VT B(τ) VT B(τ ))] (60)

by the chain rule, Eτ p F [ θF LT B(τ)] = Eτ p F [2VT B(τ) θF VT B(τ)] = 2DKL [p F ||p B] (61)

by Proposition 1 of (Malkin et al., 2023), and

Eτ p F [ θF VT B(τ)] = X

τ p F (τ) θF log p F (τ) = θF X

τ p F (τ) = 0, (62)

one may conclude that E(τ,τ ) p F p F [LCB(τ, τ )] = 4DKL [p F ||p B] , (63)

thereby showing Theorem 3.7.

Relationship of LCB to LV L. Zhang et al. (2023c) proposed to train a GFlow Net by minimizing an expectation of the variance loss, LV L(τ) = (VT B(τ) Eτ [VT B(τ )])2 ; (64)

in practice, the inner expectation was locally estimated using the average violation of batch of trajectories. Notably, the CB loss, derived from our proposed contrastive balance condition, equals twice LV L in expectation when the training policy samples the trajectories independently, since

Eτ,τ h (VT B(τ) VT B(τ ))2i = 2Eτ h (VT B(τ) Eτ [VT B(τ )])2i = 2Eτ [LV L(τ)] . (65)

Remarkably, this equation may also be elegantly deduced from the general fact that the expectation of the squared difference between two i.i.d. random variables equals twice their variance.

LCB and VI when p B is parameterized. The theorem below shows that the CB loss coincides with the KL divergence between the forward and backward policies in terms of expected gradients when p F and p B are disjointly parameterized.

Theorem B.1. Let ϕB represent the parameters of the backward policy. Also, let p B p B be the outer-product distribution assigning probability 1 Z2 R(x)R(x )p B(τ|x)p B(τ|x ) to each pair of trajectories (τ, τ ) terminating respectively at x and x . Then, the CB loss satisfies

Eτ,τ p B p B [ ϕBLCB(τ, τ ; ϕB)] = 1

4DKL [p B||p F ] . (66)

Importantly, the parameterization of p B doubles the number of forward and backward passes performed in training, which often obfuscates the advantages enacted by the improved credit assignment provided by an appropriate p B (Shen et al., 2023). Hence, while acknowledging the importance of designing effective learning schemes for p B, we fix the backward policy as an uniform.

Embarrassingly Parallel GFlow Nets

B.2. Exponentially weighted distributions

This section extends our theoretical results and shows how to train an EP-GFlow Net to sample from a logarithmic pool of locally trained GFlow Nets. Henceforth, let R1, . . . , RN : X R+ be non-negative functions over X and assume that each client n = 1, . . . , N trains a GFlow Net to sample proportionally to Rn. The next propositions show how to train a GFlow Net to sample proportionally to an exponentially weighted distribution QN n=1 Rn(x)ωn for non-negative weights ω1, . . . , ωN. We omit the proofs since they are essentially identical to the ones presented in Appendix A.

Firstly, Theorem 3.1 below proposes a modified balance condition for the global GFlow Net and shows that the satisfiability of this condition leads to a generative model that samples proportionally to the exponentially weighted distribution.

Theorem 3.1 (Aggregating balance condition). Let p(1) F , p(1) F , . . . , p(N) F , p(N) F : V 2 R+ be pairs of forward and

backward policies from N GFlow Nets sampling respectively proportional to R1, . . . , RN : X R+. Then, another GFlow Net with forward and backward policies p F , p B V 2 R+ samples proportionally to R(x) := QN n=1 R(x)ωn if and only if the following condition holds for any terminal trajectories τ, τ T :

s s τ p(i) F (s,s )

p(i) B (s ,s)

s s τ p(i) F (s,s )

p(i) B (s ,s)

s s τ p F (s,s ) p B(s ,s)

s s τ p F (s,s ) p B(s ,s) . (67)

Secondly, Theorem 3.4 provides an upper bound on the discrepancy between the targeted and the learned global distributions under controlled local errors when the local distributions are heterogeneously pooled. Notably, it suggests that the effect of the local failures over the global approximation may be mitigated by reducing the weights associated with improperly trained local models.

Theorem 3.4 (Influence of local failures). Let πn := Rn/Zn and p(n) F and p(n) B be the forward and backward policies of the nth client. We use τ x to indicate that τ T is finished by x sf. Suppose that the local balance conditions are lowerand upper-bounded n = 1, . . . , N as per

1 αn min x X,τ x p(n) F (τ)

p(n) B (τ|x)πn(x) max x X,τ x p(n) F (τ)

p(n) B (τ|x)πn(x) 1 + βn (68)

where αn (0, 1) and βn > 0. The Jeffrey divergence DJ between the global model ˆπ(x) that fulfills the aggregating balance condition in Equation (67) and π(x) QN n=1 πn(x)ωn then satisfies

n=1 ωn log 1 + βn

Interestingly, one could train a conditional GFlow Net (Bengio et al., 2021; Zhang et al., 2023c) to build an amortized generative model able to sample proportionally to QN n=1 Rn(x)ωn for any non-negative weights (ω1, . . . , ωN) within a prescribed set. This is a promising venue for future research.

C. Additional experiments and implementation details

This section is organized as follows. First, Appendix C.1 describes the experimental setup underlying the empirical evaluation of EP-GFlow Nets in Section 4. Second, Appendix C.2 exhibits the details of the variational approximations to the combined distributions used as baselines in Table 1. Third, Appendix C.3 specifies our settings for comparing the training convergence speed of different optimization objectives. Algorithm 1 illustrates the training procedure of EP-GFlow Nets. The computer code for reproducing our experiments will be publicly released at github.com/ML-FGV/ep-gflownets.

C.1. Experimental setup

In the following, we applied the same optimization settings for each environment. For the stochastic optimization, we minimized the contrastive balance objective using the Adam W optimizer (Loshchilov & Hutter, 2019) for both local and global GFlow Nets. We trained the models for 5000 epochs (20000 for the grid world) with a learning rate equal to 3 10 3

Embarrassingly Parallel GFlow Nets

Algorithm 1 Training of EP-GFlow Nets

p(1) F , p(1) B , . . . , p(K) F , p(K) B clients policies, R1, . . . , RK clients rewards, (p F , p B) parameterized global policies, E number of epochs for training, u F uniform policy Ensure: p (x) R(x) := Q

1 k K Rk(x) parfor k {1, . . . , K} do Train the clients models in parallel

train the policies p(k) F , p(k) B to sample proportionally to Rk end parfor for e {1, . . . , E} do Train the global model B {(τ, τ ): τ, τ 1/2 p F + 1/2 u F } Sample a batch of trajectories

τ,τ B LAB τ, τ ; n p(1) F , p(1) B , . . . , p(K) F , p(K) B o

Update the parameters of p F and p B through gradient descent on L end for

Figure 8. An illustration of the generative process for phylogenetic trees topologies. We iteratively select two trees to join their roots. The final state corresponds to a single, connected graph.

with a batch size dependent upon the environment. Correspondingly, we define the L1 error between the distributions π and ˆπ as two times the total variation distance between them, π ˆπ 1 := P

x X |π(x) ˆπ(x)|. For the grid world, design of sequences and federated BNSL setups, all intermediate GFlow Net states are also terminal, since they are connected to a sink state. For the remaining setups, most states are not terminal and exist solely as intermediate steps between the initial state and the target distribution s support.

Grid world. We considered a two-dimensional grid with length size 9 as the environment for the results of both Table 1 and Figure 2. To parameterize the forward policy, we used an MLP with two 64-dimensional layers and a Leaky Re LU activation function between them (Maas et al., 2013). For inference, we simulated 106 environments to (i) compute the L1 error between the targeted and the learned distributions; and (ii) selected the 800 most rewarding samples. We utilized a batch size equal to 1024 during both the training and inference phases.

Design of sequences. We trained the GFlow Nets to generate sequences of size up to 6 with elements selected from a set of size 6. We parametrized the forward policies with a single 64-dimensional layer bidirectional LSTM network followed by an MLP with two 64-dimensional layers (Graves & Graves, 2012). For training, we used a batch size of 512. For inference, we increased the batch size to 1024 and we sampled 106 sequences to estimate the quantities reported in Table 1 and Figure 4.

Multiset generation. We designed the GFlow Net to generate multisets of size 8 by iteratively selecting elements from a set U of size 10. Moreover, we endowed each element within U with a learnable and randomly initialized 10-dimensional embedding. To estimate the transition probabilities at a given state s, we applied an MLP with two 64-dimensional layers to the sum of the embeddings of the elements in s. During training, we used a batch size of 512 to parallely generate multiple multisets and reduce the noiseness of the backpropagated gradients. During inference, we increased the batch size to 1024 and generated 106 samples to generate the results reported in Table 1 and Figure 3.

Bayesian phylogenetic inference. We devised a GFlow Net to learn a posterior distribution over the space of rooted phylogenetic trees with 7 leaves and fixed branch lengths. Each state is represented as a forest. Initially, each leaf belongs to a

Embarrassingly Parallel GFlow Nets

different singleton tree. An action consists of picking two trees and joining their roots to a newly added node. The generative process is finished when all nodes are connected in a single tree (see Figure 8; a similar modeling was recently considered by (Zhou et al., 2023)).To estimate the policies at the (possibly partially built) tree t, we used a graph isomorphism network (GIN; Xu et al., 2019) with two 64-dimensional layers to generate node-level representations for t and then used an MLP to project the sum of these representations to a probability distribution over the viable transitions at t. We used a tempered version of the likelihood to increase the sparsity of the targeted posterior. Importantly, we selected a batch size of 512 for training and of 1024 for inference. Results for Table 1 and Figure 5 are estimates based on 105 trees drawn from the learned distributions. To evaluate the likelihood function, we considered Jukes & Cantor s nucleotide substitution model for the observed sites (Jukes & Cantor, 1969), which assigns the same instantaneous substitution rate for each pair of nucleotides.

Bayesian network structure learning. Each client trains a DAG-GFlow Net Deleu et al. (2022) to sample proportionally to the reward function evaluated at a small dataset of 20 points. However, in contrast Deleu et al. (2022), which implemented a linear transformer for parameterizing GFlow Net s forward policy, we use a 2-layer 128-dimensional MLP; the backward policy is fixed as an uniform distribution over a state s parents. For the global model, we increase the MLP s latent layers size to 256 units each. To estimate the probability PG[U V ], we sample {G1, . . . , GN} and compute

1 n N 1[U V |Gn], (70)

with 1[U V |Gn] indicating whether the edge U V exists in Gn. We compare this quantity with the nominal value obtained by enumerating the target distribution π support and directly evaluating the expectation EG π [1[U V |G]], similarly to (Deleu et al., 2022).

Our implementations were based on Py Torch (Paszke et al., 2019) and on Py Torch Geometric (Fey & Lenssen, 2019).

C.2. Parallel Categorical Variational Inference

As a simplistic approach to combining the locally learned distributions over compositional objects, we variationally approximate them as the product of categorical distributions over the objects components. For this, we select the parameters that minimize the reverse Kullback-Leibler divergence between the GFlow Net s distribution p T and the variational family Q, ˆq = arg min q Q KL[p T ||q] = arg min q Q Ex p T [log q(x)], (71)

which, in asymptotic terms, is equivalent to choosing the parameters that maximize the likelihood of the GFlow Net s samples under the variational model. Then, we use a logarithmic pool of these local variational approximations as a proxy for the global model. In the next paragraphs, we present the specific instantiations of this method for the domains we considered throughout our experiments. We used the same experimental setup of Appendix C.1 to train the local GFlow Nets.

Grid world. An object in this domain is composed of its two coordinates in the grid. For a grid of width W and height H, we consider the variational family

Q = {(ϕ, ψ) W +1 H+1 : qϕ,ψ(x, y) = Cat(x|ϕ)Cat(y|ψ)}, (72)

in which d is the d-dimensional simplex and Cat(ϕ) (Cat(ψ)) is a categorical distribution over {0, . . . , W} ({0, . . . , H}) parameterized by ϕ (ψ). Then, given the N variational approximations qϕ(1),ψ(1) , . . . , qϕ(N),ψ(N) individually adjusted to the distributions learned by the local GFlow Nets, we estimate the unnormalized parameters ϕ and ψ of the variational approximation to the global distribution over the positions within the grid as

1 i N ϕ(i) and ψ = K

1 i N ψ(i). (73)

Then, we let ϕ = ϕu/ϕ u1W +1 and ψ = ψu/ψ u1H+1, with 1d as the d-dimensional vector of 1s, be the parameters of the global model.

Embarrassingly Parallel GFlow Nets

Design of sequences. We represent sequences of size up to T over a dictionary V as a tuple (S, (x1, . . . , x S)) denoting its size S and the particular arrangement of its elements (x1, . . . , x S). This is inherently modeled as a hierarchical model of categorical distributions,

S Cat(θ), (74)

xi Cat(ϕi,S|S) for i {1, . . . , S}, (75)

which is parameterized by θ T and ϕ ,S RS |V | for S {1, . . . , T}. We define our family of variational approximations as the collection of all such hierarchical models and estimate the parameters θ and ϕ accordingly to Equation (71). In this case, let θ(1), ϕ(1) , . . . , θ(N), ϕ(N) be the parameters associated with the variational approximations to each of the N locally trained GFlow Nets. The unnormalized parameters θ and ϕ of the combined model that approximates the global distribution over the space of sequences are then

1 i N θ(i) and ϕ ,S = K

1 i N ϕ(i) ,S for S {1, . . . , T}, (76)

whereas the normalized ones are θ = θ/ θ 1T and ϕ ,S = diag( ϕ ,S1|V |) 1 ϕ ,S.

Multiset generation. We model a multiset S of size S as a collection of independently sampled elements from a warehouse W with replacement. This characterizes the variational family

q( |ϕ): q(S|ϕ) = Y

s S Cat(s|ϕ)

in which ϕ is the parameter of the categorical distribution over W estimated through Equation (71). Denote by ϕ(1), . . . , ϕ(N)

the estimated parameters that disjointly approximate the distribution of N locally trained GFlow Nets. We then variationally approximate the logarithmically pooled global distribution as q( |ϕ) Q with ϕ = ϕ/ ϕ 1|W|, in which

1 i N ϕ(i). (78)

Notably, the best known methods for carrying out Bayesian inference over the space of phylogenetic trees are either based on Bayesian networks (Zhang & Matsen IV, 2018) or MCMC, neither of which are amenable to data parallelization and decentralized distributional approximations without specifically tailored heuristics. More precisely, the product of Bayesian networks may not be efficiently representable as a Bayesian network, and it is usually not possible to build a global Markov chain whose stationary distribution matches the product of the stationary distributions of local Markov chains. Moreover, any categorical variational approximation factorizable over the trees clades would not be correctly supported on the space of complete binary trees and would lead to frequently sampled invalid graphs. By a similar reasoning, we do not compare EP-GFlow Nets to a composition of tractable variational approximations to the local models, such as (Geffner et al., 2022; Zheng et al., 2018), in the problem of federated Bayesian structure learning.

C.3. Comparison of different training criteria

Experimental setup. We considered the same environments and used the same neural network architectures described in Appendix C.1 to parametrize the transition policies of the GFlow Nets. Importantly, the implementation of the DB constraint and of the FL-GFlow Net requires the choice of a parametrization for the state flows (Bengio et al., 2023; Pan et al., 2023a). We model them as an neural network with an architecture that essentially mirrors that of the transition policies with the only difference being the output dimension, which we set to one. Moreover, we followed suggestions in (Pan et al., 2023a; Malkin et al., 2022) and utilized a learning rate of 3 10 3 for all parameters of the policy networks except for the partition function s logarithm log Z composing the TB constraint, for which we used a learning rate of 1 10 1. Noticeably, we found that this heterogeneous learning rate scheme is crucial to enable the training convergence under the TB constraint.

Further remarks regarding Figure 7. In Figure 7, we observed that LCB and LT B perform similarly in the grid world and in design of sequences tasks. A reasonable explanation for this is that such criteria are identically parameterized in such domains, as LDB reduces to R(s )p B(s|s )p F (sf|s) = R(s)p F (s |s)p F (sf|s ) in environments where every state is terminal (Deleu et al., 2022). Thus, F vanishes and hence the difficult estimation of this function is avoided.

Embarrassingly Parallel GFlow Nets

Table 2. Quality of the parallel approximation. The global model s performance does not critically depend on the clients training objective; it relies only on the goodness-of-fit of their models.

Grid World Multisets Sequences L1 Top-800 L1 Top-800 L1 Top-800

EP-GFlow Net (CB) 0.038 6.355 0.130 27.422 0.005 1.535

( 0.016) ( 0.000) ( 0.004) ( 0.000) ( 0.002) ( 0.000)

EP-GFlow Nets (TB) 0.039 6.355 0.131 27.422 0.006 1.535

( 0.006) ( 0.000) ( 0.018) ( 0.000) ( 0.005) ( 0.000)

C.4. Additional experiments

Reduction in runtime achieved due to a distributed formulation. Figure 12 shows that our distributed inference framework enacts a significant reduction in runtime in the task of Bayesian phylogenetic inference relatively to a centralized approach while maintaining a competitive performance under the L1 metric1. This underlines the effectiveness of our method for Bayesian inference when the target distribution is computationally difficult to evaluate due to an expensiveto-compute likelihood function (which, for phylogenies, requires running Felsenstein s dynamic programming algorithm (Felsenstein, 1981) to carry out message-passing in a graphical model). In this context, EP-GFlow Nets may be also of use in Bayesian language modeling with large language models (LLMs), in which case the likelihood function is provided by a LLM with a notoriously high computational footprint; see (Hu et al., 2023a). For this experiment, we considered the same setup described in Appendix C.1, partitioning the data among an increasing number of {2, 4, 6, 8, 10} identical clients, each processing a set of 1000 random sites, and training the corresponding centralized and distributed models to sample from the full posterior. The reported times were measured in a high-performance Linux machine equipped with an AMD EPYC 7V13 64-core processor, 216 GB DDR4 RAM and a NVIDIATM A100 80 GB PCIe 4.0 GPU. Notably, we found out that training a GFlow Net in a CPU is considerably faster in this case than training it in a GPU possibly due to the iterative nature of the generative process and the relatively small size of the policy networks, whose accelerated evaluation in a GPU does not compensate for the corresponding slow-down of the tree-building process.

0 200 400 Epoch

TB (lr=0.001)

TB (lr=0.01)

TB (lr=0.1)

TB (lr=0.5)

TB (lr=1.0)

TB (lr=5.0)

TB (lr=10.0) CB

Figure 9. CB outperforms TB for different lr s for log Z.

Comparison between TB and CB with different learning rates. Figure 9 shows that increasing the learning rate for log ZϕZ significantly accelerates the training convergence for the TB objective. In this experiment, the learning rate for the other parameters was fixed at 10 3 following the setup of Malkin et al. (2022, Appendix B). However, CB leads to faster convergence relatively to TB for all considered learning rates. In practice, though, note that finding an adequate learning rate for log ZϕZ may be a very difficult and computationally exhaustive endeavor that is completely avoided by implementing the CB loss.

0 2 4 p T(x) 10 3

Figure 10. Sampling from the product of GFlow Nets policies isn t equivalent to sampling from the local targets product.

Sampling from product of distributions. As a first approximation, one could employ a simple rejection sampling procedure to sample from a product of discrete distributions q1, . . . , q K over a shared support: draw independently a xi from qi and accept the resulting xi only if it was sampled by every other model. Nonetheless, this approach scales very badly as one increases either the number of distributions or the size of their common support. Hinton (2002), in a related work, proposed an efficient algorithm to train a model to sample from a product of energy-based distributions by minimizing a contrastive divergence, acknowledging that it is not generally straightforward to sample from a product of discrete distributions. More recently, Du et al. (2023) developed

1Note that, in Figure 12, the runtime for EP-GFlow Net equals the maximum time for training the clients models plus the running time of our aggregation step.

Embarrassingly Parallel GFlow Nets

2000 4000 6000 8000 10000 Number of sites

Runtime (s)

Std. GFlow Net EP-GFlow Nets

2000 4000 6000 8000 10000 Number of sites

Figure 12. EP-GFlow Nets achieve a significant reduction in runtime relatively to a standard GFlow Net in the task of Bayesian phylogenetic inference. The left plot highligths that the training time for an EP-GFlow Net remains approximately constant when an increasing set of observed samples is equally partitioned between a correspondingly increasing set of clients, each receiving 1000 sites, whereas the training time of a standard GFlow Net grows roughly linearly on the number of samples. Importantly, the right plot shows the asynchronously trained model performs comparably to the synchronously trained one for all the considered data sizes.

an MCMC-based algorithm to sample from a multiplicative composition of energy-base parameterized diffusion models, remarking the incorrectness of naive approaches based on sampling from the product of the reverse kernels. In the context of compounding GFlow Nets with forward policies (p(i) F )1 i N to sample from the corresponding product distribution Q 1 i N Ri, however, one could exploit the structured nature of the model and naively attempt to use

p F ( |s) Y

1 i N p(i) F ( |s), (79)

hoping that the resulting GFlow Net would sample from the correct product distribution. Notably, this approach fails due to the dependence of the normalizing constant of the preceding distribution on the state s and it is unclear according to what distribution the sampled objects are distributed. Figure 10 illustrates this for the problem of generating sequences: by combining the policies of the clients accordingly to Equation (79), the culminating distribution drastically differs from the targeted one, even though the clients were almost perfectly trained. The question of which distributions one may obtain by composing the policies of expensively pre-trained GFlow Nets is still open in the literature (Garipov et al., 2023).

Implementing different training objectives for the clients. Table 2 suggests that the accuracy of EP-GFlow Net s distributional approximation is mostly independent of whether the clients implemented CB or TB as training objectives. Notably, the combination phase of our algorithm is designedly agnostic to how the local models were trained as long as they provide us with well-trained backward and forward policies. This is not constraining, however, since any practically useful training scheme for GFlow Nets is explicitly based upon the estimation of such policies (Malkin et al., 2022; Pan et al., 2023a; Bengio et al., 2023; Zhang et al., 2023c).

10 20 Number of clients

Time (s) for 50 epochs

Figure 11. Aggregation phase s 50-epoch cost.

Aggregation phase computational cost. Figure 11 shows that the per-epoch cost of training of a naively implemented EP-GFlow Nets aggregation phase, which we measure for the task of multiset generation, increases roughly linearly as a function of the number of worker nodes. Nonetheless, we remark that, in theory, one could achieve a sublinear relationship between the computational cost and the number of clients by parallelizing the local policies evaluation when minimizing the aggregating balance loss function LAB, which is not done by our implementation.

Sensibility of EP-GFlow Net to imperfectly trained clients. Theorem 3.4 ensures that EPGFlow Nets can be trained to accurately sample from the combined target when the clients are sufficiently but maybe imperfectly learned. However, even when clients are imperfectly trained, EP-GFlow Nets may achieve a relatively good approximation to the target. To exemplify this, we consider the task of multiset generation with 4 clients trained on multiplicatively noisy targets, namely, log R i(x) = log Ri(x) + ϵ with ϵ N|X|(0, σ2I) sampled from a |X|-dimensional normal distribution. Then, we perform EP-GFlow Nets aggregation phase and compare the learned distribution with the product Q

1 i 4 Ri. Importantly, Figure 13 shows that the accuracy of the aggregated model is comparable and often better than that of the most inaccurate client for σ2 {0.000, 0.002, 0.004, 0.006, 0.008, 0.01}.

Embarrassingly Parallel GFlow Nets

0.0 0.5 1.0 0.00

1.00 Client 1

0.0 0.5 1.0 0.00

1.00 Client 2

0.0 0.5 1.0 0.00

1.00 Client 3

0.0 0.5 1.0 0.00

1.00 Client 4

0.0 0.5 1.0 0.00

1.00 Client 5

0.0 0.5 1.0 0.00

1.00 Client 6

0.0 0.5 1.0 0.00

1.00 Client 7

0.0 0.5 1.0 0.00

1.00 Client 8

0.0 0.5 1.0 0.00

1.00 Client 9

0.0 0.5 1.0 0.00

1.00 Client 10

0.0 0.5 1.0 0.00

1.00 EP-GFlow Net

Figure 14. Federated Bayesian network structure learning. Each plot shows, for each pair of nodes (U, V ), the expected (vertical) and learned (horizontal) probabilities of U and V being connected by an edge, P[U V ], and of existing a directed path between U and V , P[U V ]. Notably, EP-GFlow Net accurately matches the target distribution over such edges features.

D. Federated Bayesian Network Structure Learning (BNSL)

0.0 0.1 0.2 Standard deviation

EP-GFlow Net Clients (max)

Figure 13. EP-GFlow Net s vs Clients accuracy.

Bayesian networks are often used to describe the statistical dependencies of a set of observed variables (Bielby & Hauser, 1977; Pearl & Mackenzie, 2018; Pearl, 2009; 1988); however, learning their structure from data is very challenging due to the combinatorially many possibilities. For example, when such variables represent some form of gene expression, we may want to infer the gene regulatory network describing the genes causal relationships (Husmeier, 2003). Conventionally, structure learning is carried out centrally with all data gathered in a single place (Reisach et al., 2021; Lorch et al., 2021). Nonetheless, such datasets are often small and insufficiently informative about the underlying structure. On the other hand, the rapid development of information technologies has significantly lowered the barrier for multiple parties (e.g., companies) to collaboratively train a model using their collected and privacy-sensitive datasets. In this context, we extend the work of Deleu et al. (2022; 2023) and show that EP-GFlow Nets can be efficiently used to learn a belief distribution over Bayesian networks in a federated setting (Ng & Zhang, 2022).

200 180 160 140 120 100 80 60 40 Score

Client GFlow Net EP-GFlow Net

Figure 15. The distribution learned in a federated manner (green) finds significantly higher scoring graphs wrt the complete data than any of its local counterparts (red). The dashed lines represent the avg. scores for the global and local models.

For this, let X Rd be a random variable and B Rd d be a sparse real-valued matrix with sparsity pattern determined by a directed acyclic graph G with nodes {1, . . . , d}. We assume that X follows a linear structural equation model (SEM)

X = BX + N, (80)

with N N(0, σ2I) representing independent Gaussian noise (Bielby & Hauser, 1977; Deleu et al., 2022). Also, we consider K fixed clients, each owning a private dataset Dk = {x1, . . . , x N}, and our objective is to learn a belief distribution

1 k K Rk(G) (81)

over DAGs, in which Rk(G) = p(Dk|G)p(G) and p(Dk|G) = p(Dk| ˆBk, G), with ˆBk being client k s maximum likelihood estimate of B given G (Deleu et al., 2022). Intuitively, the belief R assigns high probability to DAGs which are probable under each of Rk s and may be seem as a multiplicative composition of the local models (Garipov et al., 2023; Du et al., 2023). Notably, Ng & Zhang (2022) considered a similar setup, iteratively solving a L1-regularized continuous relaxation of the structure learning problem to find a globally optimum DAG (Zheng et al., 2018). However, in contrast to EP-GFlow Nets, Ng & Zhang (2022) s approach doesn t always guarantee that the obtained graph is acyclic (Geffner et al., 2022; Zheng et al., 2018); it doesn t provide a belief distribution over the DAGs, which could be successively refined by probing an expert (Bharti et al., 2022; da Silva et al., 2023); and it is not embarrassingly parallel, requiring many communication steps to be fulfilled.

Experimental setup. To show that EP-GFlow Nets can accurately sample from the belief distribution defined by Equation (81) in a distributed setting, we first simulate a dataset of 400 independent data points according to the SEM outlined in Equation (80) with d = 4 variables. Next, we evenly partition this dataset between 4 clients, which then train their own DAG-GFlow Nets (Deleu et al., 2022) based on their individual datasets for 10000 epochs using Adam W (Loshchilov & Hutter, 2019). Finally, the locally trained DAG-GFlow Nets are aggregated by minimizing the aggregating balance and the resulting model is compared to the belief distribution in Equation (81). See Section 4.5 for further discussion on our empirical observations for this domain.

Embarrassingly Parallel GFlow Nets

Results. Figure 14 s that EP-GFlow Net adequately learns a distribution over DAGs in a federated setting. Correspondingly, Figure 15 stresses that the collaboratively trained GFlow Net finds graphs with higher scores (graph-conditioned maximum log-likelihood) relatively to the private models. See Section 4.5 for discussion.

E. Related work

GFlow Nets were originally proposed as a reinforcement learning algorithm tailored to the search of diverse and highly valuable states within a given discrete environment (Bengio et al., 2021). Recently, these algorithms were successfully applied to the discovery of biological sequences (Jain et al., 2022), robust scheduling of operations in computation graphs (Zhang et al., 2023c), Bayesian structure learning and causal discovery (Deleu et al., 2022; 2023; da Silva et al., 2023; Atanackovic et al., 2023a), combinatorial optimization (Zhang et al., 2023a), active learning (Hernandez-Garcia et al., 2023), multi-objective optimization (Jain et al., 2023), and discrete probabilistic modeling (Zhang et al., 2022a; Hu et al., 2023b; Zhang et al., 2022b). (Bengio et al., 2023) formulated the theoretical foundations of GFlow Nets. Correlatively, (Lahlou et al., 2023) laid out the theory of GFlow Nets defined on environments with a non-countable state space. (Pan et al., 2023c) and (Zhang et al., 2023b) extended GFlow Nets to environments with stochastic transitions and rewards. Concomitantly to these advances, there is a growing literature that aims to better understand and improve this class of algorithms (Deleu & Bengio, 2023; Shen et al., 2023; Malkin et al., 2023), with an emphasis on the development of effective objectives and parametrizations to accelerate training convergence (Pan et al., 2023b;a; Malkin et al., 2022; Deleu et al., 2022). In recent work, for instance, (Pan et al., 2023a) proposed a novel residual parametrization of the state flows that achieved promising results in terms of speeding up the training convergence of GFlow Nets. More specifically, the authors assumed the existence of a function E : S R such that (i) E(so) = 0 and (ii) E(x) = log R(x) for each terminal state x X and reparameterized the state flows as log F(s, ϕS) = E(s) + log F(s, ϕS). This new training scheme was named forward looking (FL) GFlow Nets due to the inclusion of partially computed rewards in non-terminal transitions. Notably, both Malkin et al. (2023) and (Zhang et al., 2023c) proposed using the variance of the a TB-based estimate of the log partition function as a training objective based on the variance reduction method of Richter et al. (2020). It is important to note one may use stochastic rewards (see Bengio et al., 2023; Zhang et al., 2023b) for carrying out distributed inference, in the same fashion of, e.g., distributed stochastic-gradient MCMC (El Mekkaoui et al., 2021; Vono et al., 2022). Notably, stochastic rewards have also been used in the context of causal structure learning by Deleu et al. (2023). However, it would require many communication steps between clients and server to achieve convergence which is one of the bottleneck EP-GFlow Nets aim to avoid.

Distributed Bayesian inference mainly concerns the task of approximating or sampling from a posterior distribution given that data shards are spread across different machines. This comprises both federated scenarios (El Mekkaoui et al., 2021; Vono et al., 2022) or the ones in which we arbitrarily split data to speed up inference (Scott et al., 2016). Within this realm, there is a notable family of algorithms under the label of embarrassingly parallel MCMC (Neiswanger et al., 2014), which employ a divide-and-conquer strategy to assess the posterior. These methods sample from subposteriors (defined on each user s data) in parallel, subsequently sending results to the server for aggregation. The usual approach is to use local samples to approximate the subposteriors with some tractable form and then aggregate the approximations in a product. In this line, works vary mostly in the approximations employed. For instance, Mesquita et al. (2019) apply normalizing flows, (Nemeth & Sherlock, 2018) model the subposteriors using Gaussian processes, and (Wang et al., 2015) use hyper-histograms. It is important to note, however, that these works are mostly geared towards posteriors over continuous random variables.

Federated learning was originally motivated by the need to train machine learning models on privacy-sensitive data scattered across multiple mobile devices linked by an unreliable communication network (Mc Mahan et al., 2017b). While we are the first tackling FL of GFlow Nets, there are works on learning other generative models in federated/distributed settings, such as for generative adversarial networks (Hong et al., 2021; Chang et al., 2020; Qu et al., 2020) and variational autoencoders (Polato, 2021). Critically, EP-GFlow Nets enable the collaborative training of a global model without disclosing the data underlying the clients reward functions (defined by a neural network (Jain et al., 2022) or a posterior distribution (Deleu et al., 2022; 2023; da Silva et al., 2023), for instance); however, it does not inherently preserve the privacy of the reward functions themselves, which may be accurately estimated from their publicly shared policy networks. Nevertheless, a future work may investigate formally to which extent imperfectly trained policy networks ensure some form of differential privacy over the clients rewards.

Comparing EP-GFlow Nets to single-round Fed AVG (Mc Mahan et al., 2017a) As EP-GFlow Nets are the first method enabling embarrassingly parallel inference over discrete combinatorial spaces, there are no natural baselines to our

Embarrassingly Parallel GFlow Nets

experiments. For completeness, however, we use a single round of Fed AVG (Mc Mahan et al., 2017a) by training local models (GFlow Nets) until convergence and aggregating them in parameter space, assuming all policy networks share the same architecture. Note, however, that this baseline does not enjoy any guarantee of correctness and may lead to poor results. To validate this, we have run an experiment for the multiset generation task using Fed Avg for aggregation. The resulting model attained an L1 error of 1.32, roughly two orders of magnitude worse compared to the 0.04 of EP-GFlow Nets.