# ergodic_generative_flows__60093519.pdf Ergodic Generative Flows Leo Maxime Brunswic* 1 Mateo Clemente* 1 Rui Heng Yang 1 Adam Sigal 1 Amir Rasouli 1 Yinchuan Li 2 Generative Flow Networks (GFNs) were initially introduced on directed acyclic graphs to sample from an unnormalized distribution density. Recent works have extended the theoretical framework for generative methods allowing more flexibility and enhancing application range. However, many challenges remain in training GFNs in continuous settings and for imitation learning (IL), including intractability of flow-matching loss, limited tests of non-acyclic training, and the need for a separate reward model in imitation learning. The present work proposes a family of generative flows called Ergodic Generative Flows (EGFs) which are used to address the aforementioned issues. First, we leverage ergodicity to build simple generative flows with finitely many globally defined transformations (diffeomorphisms) with universality guarantees and tractable flow-matching loss (FM loss). Second, we introduce a new loss involving cross-entropy coupled to weak flowmatching control, coined KL-weak FM loss. It is designed for IL training without a separate reward model. We evaluate IL-EGFs on toy 2D tasks and real-world datasets from NASA on the sphere, using the KL-weak FM loss. Additionally, we conduct toy 2D reinforcement learning experiments with a target reward using the FM loss. 1. Introduction Generative models aim to sample from a target distribution κ on a space S, possibly conditioned on additional context or variables. The target distribution κ is either specified by an unnormalized density with respect to a background measure λ, as in reinforcement learning (RL), or defined by a dataset of samples D as in imitation learning (IL). While RL (Brooks et al., 2011; Sutton, 2018; Haarnoja *Equal contribution 1Huawei Technologies Canada, Noah s Ark Laboratories 2Huawei. Correspondence to: Leo Maxime Brunswic . Proceedings of the 42 nd International Conference on Machine Learning, Vancouver, Canada. PMLR 267, 2025. Copyright 2025 by the author(s). et al., 2018; Bengio et al., 2023) and IL (Goodfellow et al., 2020; Ho et al., 2020; Song et al., 2020b; Rozen et al., 2021; Papamakarios et al., 2021) methods are often considered to be separate approaches, it is a common practice to first train a reward model from a dataset and then use RL techniques to solve an IL task (Chung et al., 2024; Zhang et al., 2022b;a; Sendera et al., 2024). Among IL methods, normalizing flows (NFs), although capable of leveraging advanced neural ODEs (Chen et al., 2018; Grathwohl et al., 2018), have not achieved state-ofthe-art performance due to architectural constraints (i.e. the requirement of sequentially combined parameterized diffeomorphisms) and numerical instabilities (Caterini et al., 2021; Verine et al., 2023). While NFs are deterministic models, stochasticity has proven to be a crucial component of IL generative models, such as diffusion models as its simplest embodiment (Ho et al., 2020; Song et al., 2020b). However, diffusion models face a challenging trade-off between loss function tractability and the time-consuming denoising process for generation. This trade-off drives continued efforts to accelerate the diffusion inference process (Lu et al., 2022; Tang et al., 2025; Chen et al., 2024). Such issues are particularly important when the computational and time budgets of generation are highly constrained, which is typical on real-time tasks or mobile devices. The present work leverages the flexibility of Generative Flow Networks (GFNs) (Bengio et al., 2021; 2023) to address these issues by building simple yet highly expressive models with short sampling trajectories. Generative Flow Networks (GFNs), originally formulated for RL tasks, are a family of generative methods designed to sample proportionally based on a reward function r = φκ, which represents the density of a target distribution κ. They were initially restricted to directed acyclic graphs, but have since been extended to other settings. Subsequent work has extended GFNs to generalize to continuous state spaces (Li et al., 2023; Lahlou et al., 2023) and non-acyclic structures (Brunswic et al., 2024). Despite these advancements, GFNs still face four key challenges that hinder their application in RL and IL settings. First, although it has been suggested that GFNs themselves could be used as a reward model in the IL setting (see Bengio et al. (2023) p37), previous works (Zhang et al., 2022b; Ergodic Generative Flows Lahlou et al., 2023) train a separate reward model on samples from κ, effectively reducing the problem to an RL task. The need to train an additional reward model, which results in additional training and computation costs, stems from a lack of control of the so-called flow-matching (FM) property of the flow. Indeed, the FM-loss is zero by construction of the GFN self-defined reward model. Second, the FM training loss presents significant challenges in continuous settings, as its evaluation is intractable in naive implementations (Li et al., 2023), necessitating the training of an imperfect estimator when directly enforcing the FM constraint. Addressing this issue requires either handcrafting or training an additional backward policy π , or resorting to higher-variance loss functions, such as the Detailed Balance (DB) (Bengio et al., 2021) and Trajectory Balance (TB) (Malkin et al., 2022) losses. Third, the acyclicity requirements (Lahlou et al., 2023) mean that additional structures must be manually crafted. Cycles appear naturally in naive implementations and RL environments. Brunswic et al. (2024) offers a theory for non-acyclic flows and stable FM losses, but these have yet to be tested in strongly non-acyclic settings. Lastly, Li et al. (2023) argues that exact cycles are negligible in the continuous setting due to their zero probability, whereas Brunswic et al. (2024) argues that 0-flows, ergodic measures ξ for π , need to be addressed. Otherwise, the divergence-based losses used by Bengio et al. (2023) may become unstable in the presence of ξ a prediction that remains untested. We present Ergodic Generative Flows (EGFs), which addresses the aforementioned key limitations. EGFs are built using a policy choosing at a given state from finitely many globally defined transformations, i.e. diffeomorphisms of the state space. This allows for tractable inverse flow policy, hence tractable FM-loss. Since we favor finitely many simple transformations, the universality of EGFs is non trivial. We provide provable guarantee with an ergodicity (Walters, 2000; Kifer, 2012) assumption of the group generated by the EGFs transformations. Main contributions: 1. We extend the theoretical framework of generative flows, in particular providing quantitative versions of the sampling theorem for non-acylic generative flows. 2. We develop a theory of EGFs, presenting a universality criterion and demonstrating that, in any dimension on tori and spheres, this criterion can be fulfilled with four simple transformations. 3. We propose a coupled KL-weak FM loss to train EGFs directly for IL. This allows to train EGFs for IL without a separate reward model. 2. Preliminaries: GFN for RL and IL In this work, we focus on generative flows in the continuous setting, encompassing scenarios where the state space S is either a vector space, S = Rd, or potentially a Riemannian manifold (Lee, 2018; Gemici et al., 2016). Both are instances of Polish1 spaces that admit a natural finite background measure λ (usually the Lebesgue, Haar (Halmos, 2013) or Riemmanian volume measures). A measure µ is dominated by ν, denoted µ ν if ν(A) = 0 µ(A) = 0 for any measurable A S; then, the Radon-Nikodym derivative of µ with respect to ν, denoted dµ dν L1(ν), is the unique ν-integrable function φ such that φν = µ. A Markov kernel π : X Y is a stochastic map denoted π(x). The image measure of µ by π denoted µπ is defined by µπ(A) = R x X P(π(x) A)dµ(x) for any measurable A Y. Their tensor product is the kernel µ π(A B) := R x A P(π(x) B)dµ(x). Even when µ is an unnormalized distribution, we write x µ for the law of x is 1 µ(S)µ . We adopt a formulation of generative flows that differs from Lahlou et al. (2023) and Brunswic et al. (2024), providing the necessary flexibility for the subsequent sections. Let (S, λ) be a Polish space with a finite background measure. A generative flow on S is the data of a star forward policy, i.e. a Markov kernel π : S S, and a star outflow, i.e. a finite non-negative measure F on S. With unnormalized initial and terminal distributions Finit and Fterm, it induces a Markov chain (st)g 1 defined by s1 Finitand st+1 = π (st). Define the sampling time τ N 1 as the random variable given by2 P(τ 1) = 1 and P(τ = t|τ t) = d Fterm d(Fterm+F )(st). The sampler associated with a generative flow is obtained by emulating the Markov chain (st)t 1 and sampling sτ. By extending the sampling Markov chain with a source s0 and sink sf state, we can summarize the flow as follows: s0 Finit / S F π Fterm / sf . (1) The following theorem provides theoretical guarantees on the sampler associated with a generative flow. Theorem 2.1 ((Bengio et al., 2021; Brunswic et al., 2024)). Let Finit, and Fterm be unnormalized distributions with Fterm = 0 and let (π , F ) be a generative flow. If the flow-matching constraint Finit + F π = Fterm + F (2) 1A topological measurable space is Polish if its topology is completely metrizable and it is endowed with its Borel σ-algebra. 2Clearly Fterm Fterm + F so the Radon-Nikodym derivative is well-defined. Ergodic Generative Flows is satisfied then Finit(S) = Fterm(S) and: E(τ) F (S) Finit(S) + 1, sτ Fterm. The pair (F , π ) is trained using gradient descent to minimize a loss function that enforces the flow-matching constraint in equation (2). Once convergence is achieved, the generative flow sampler yields samples of the random variable sτ. Hence, by Theorem 2.1, the generative flow sampler samples from Fterm. For example, we can use the most straightforward stable FM-loss: Lstable FM,q = Es νtrain [(finit + f fterm f )q (s)] (3) where finit = d Finit dλ , fterm = d Fterm dλ and f = d F dλ = d(F π ) dλ . In general, both finit and f are determined by the chosen parametrization, with f being problem-dependent, while f requires the computation of an image measure density. However, the previously mentioned methods outlined above face several limitations: Limitation ①: The first limitation arises in IL. fterm is unknown so one has to build a density model for Fterm. Disregarding solutions based on separate energy models, we can attempt to set Fterm = Finit + F F , assuming F is tractable. We then use a cross-entropy loss to train Fterm to approximate κ. Despite the FM loss formally and trivially becoming 0, Fterm may not be positive. Therefore, the existing theoretical framework fails and so does the sampling. At this stage, we move beyond the existing theoretical framework. Limitation ②: The second limitation arises in RL. At the inference time, the reward may be unknown, either because of its costly nature or because it is simply inaccessible. A straightforward solution consists of using Fterm = κ during training and Fterm = Finit + F F during inference. However, this reward-less sampling is ill-controlled because Fterm may not be positive. Limitation ③: Both training and inference are based on the tractability of f . For instance, naive forward policies, such as those used in diffusion models add noise in Rd. These policies are typically defined as π (x) = m(x) + ϵ, where m is a deterministic model and ϵ N(0, η) represents Gaussian noise. This leads to an intractable star inflow as the integral in equation (4) becomes computationally intensive. x Rd f (x) exp (m(x) y)2 Limitation ④: Although Li et al. (2023) argues that exact cycles are negligible in the continuous setting due to their zero probability, Brunswic et al. (2024) disputes that their generalization called 0-flows must still be addressed. 0flows are measures ξ for which π is ergodic (Walters, 2000). Brunswic et al. (2024) predicts that divergence-based losses used by Bengio et al. (2023) are unstable if such a ξ exists. This has yet to be tested. 3. Ergodic Generative Flows for RL and IL To address the four issues outlined above, we propose Ergodic Generative Flows (EGFs). Definition 3.1 (Ergodic Generative Flows). Let S be Riemannian manifold endowed with its volume measure λ and let (Φi)p i=1 be a family of diffeomorphisms S S. An EGF is a generative flow, where i=1 αi (s)δΦi(s) (5) for some policy α : S [0, 1]p, and such that the group of diffeomorphisms generated by the Φi is topologically ergodic. That is for all x, y S and any neighborhood U of y, there exists a sequence i1, , it such that xΦi1Φi2 Φit U. There are two key ideas that motivate this definition: Tractability of star inflow f is achieved by only using finitely many diffeomorphisms. Let say that S = Rd endowed with Lebesgue measure λ and that each move Φi is a parametrizable diffeomorphism. Then, a closed form formula for the backward policy may be deduced from a detailed balance argument, see appendix A for details: i=1 αi (s)δΦ 1 i (s) (6) αi (s) = (αi f ) Φ 1 i (s)|JsΦ 1 i | P j(αj f ) Φ 1 j (s)|JsΦ 1 j | where Js denotes the Jacobian matrix at s. Therefore, we have a closed-form formula for the density of the star inflow, i=1 (αi f ) Φ 1 i (s)| det JsΦ 1 i |. (8) All terms in equation (8) are tractable, so f is tractable as long as the number of diffeomorphisms p is small. As a consequence, the flow-matching loss given in Equation (3) is tractable for p small enough. Expressivity can potentially be an issue if the number of diffeomorphisms is small and each move is too simple (Rezende et al., 2020). Ergodicity guarantees that the parameterized family of EGFs is able to be flow-matching for Ergodic Generative Flows any non-zero Finit λ and Fterm λ even with simple transformations, such as affine maps on tori and rotations on spheres. The following sub-sections are organized as follow: Section 3.1 establishes the universality of EGFs. In Section 3.2, we present theoretical advancements for general generative flows, which forms the basis for the design of the KL-weak FM loss used IL training, described in Section 3.3. 3.1. Universality From Ergodicity We begin with a formalization of universality in conjunction with the master universality Theorem. Definition 3.2. On a state space (S, λ), a family of generative flow family is universal if for any two distributions µ, ν λ with bounded density there is a flow (f , π ) in the family that is flow matching for Finit = ν and Fterm = µ. We begin with the presentation of our master universality theorem in section 3.1.1. This theorem is then applied to tori and spheres, providing simple examples of universal families, as detailed in sections 3.1.2 and 3.1.3. 3.1.1. MASTER UNIVERSALITY THEOREM The next theorem states that if the family contains a sufficiently strong ergodic policy π , then the family is universal. More precisely, Definition 3.3. A Markov kernel π on (S, λ) is summably L2-mixing if π admits an invariant measure bλ equivalent to λ and such that the L2-mixing coefficients given by, γn := sup φ H s S (φπn(s))2 dbλ(s), (9) with H = {φ L2(λ) | φ 2 = 1 and R s S φ(s) = 0} are summable: P n 0 γn < + . The technical summably L2-mixing property informally means that iterating the Markov kernel π is averaging out functions fast enough to ensure that the sum of errors is finite. In contrast, ergodicity alone ensures the convergence of φπn in the much weaker Cesaro sense. This property is in particular satisfied if the policy π induces a Koopman operator ϕ 7 d(ϕλ)π dλ on L2(S), which has a socalled spectral gap (Conze & Guivarc H, 2013). Informally the spectrum of the Koopman operator is bounded away from 1 on the subspace {ϕ | R ϕdλ = 0}. In this case, the convergence of iterates by π is exponential. Theorem 3.4. A parameterized family of EGF is universal provided that it contains some summably L2-mixing π and that the set of f is dense in L2(S, λ). 3.1.2. ERGODICITY ON TORI For the sake of simplicity, let Td be a flat torus of side 1 endowed with Lebesgue measure λ and let P : Rd Td be the natural universal covering projection. The simplest family of transformations is the group of affine transformations, Aff(Td) := {x 7 Ax + b | b Rd and A SLd(Z)}, (10) with SLd(Z) denoting the set of square matrices of order d with integral coefficients and determinant 1. Given that (Φi)p i=1 are all in Aff(Rd), we can build a continuous family of generative flows using equation (5), which is then parameterized by the translation part of each Φi together with a model having a softmax head α and a scalar head f . We call such a family an affine toroidal family. Theorem 3.5. Any affine toroidal family is a universal EGF family provided mild technical assumptions on the (Φi)p i=1. The technical assumptions above is in particular satisfied if the group generated by the linear part of the Φi is the whole group SLd(Z). It is shown in Conder et al. (2025) that SLd(Z) has a presentation with two generators. Therefore, in any dimension, EGF are universal with p = 4 and wellchosen (Φi)4 i=1: the two generators and their inverse. 3.1.3. ERGODICITY ON SPHERES Consider the round sphere Sd = {x Rd+1 | x = 1} endowed with its natural Riemannian volume measure λ. Again a simple family of diffeomorphisms is provided by the so-called projective transformations, PGLd+1(R) := {x 7 Ax/ Ax | A GLd+1(R)}, (11) where GLd+1(R) is the group of invertible matrices of order d + 1. Similar to tori, we build a continuous family of EGFs using equation (5) by parametrizing (Φi)p i=1 in PGLd+1(R) which is a Lie group, and choose a tractable model that have a softmax head α and a scalar head f . Such a family is called a projective spherical family. We focus on the subfamily of isometry spherical family composed of rotations Φi SOd+1(R) for which we have theoretical guarantees. Theorem 3.6. An isometry spherical family is a universal EGF family, provided technical assumptions on the (Φi)p i=1. The technical assumption is satisfied if the Φi have algebraic coefficients and are frozen: this is a consequence of the main theorem of Bourgain & Gamburd (2012) (see also Benoist & de Saxc e (2016)) that guarantees a spectral gap under this assumption. Furthermore, the main Theorem of Breuillard & Gelander (2003) allows to build a dense subgroup of SOd(R) with two generators and algebraic coefficients. Therefore, EGF are universal on spheres with Ergodic Generative Flows p = 4 and well-chosen (Φi)4 i=1: the two generators and their inverse. 3.2. Quantitative Sampling Theorem We present a quantitative sampling theorem for generative flows. The key idea of the proof is that any generative flow matches the flow up to a transformation of the initial and terminal distributions. Definition 3.7 (Virtual initial and terminal flow). Let (π , f ) be a generative flow and let Finit, Fterm be initial and terminal distributions. Define the initial and terminal errors as δFinit := (Finit + F Fterm F ) and δFterm := (Finit + F Fterm F )+ respectively. Define the positive and negative parts of a measure µ as µ . The virtual initial and terminal distributions are b Finit := Finit + δFinit and b Fterm := Fterm + δFterm. We note that once the initial and terminal distributions Finit, Fterm are specified, then any generative flow (π , f ) is a FM with respect to the virtual initial and terminal distributions, b Finit and b Fterm. This enables the presentation of a quantitative formulation of the sampling theorem for generative flows. Theorem 3.8 (Quantitative Sampling of Generative Flows). Let (π , f ) be a generative flow and let Finit and Fterm initial and terminal distributions. Assume that Finit = 0 and consider the sample sτ of the generative flow Markov chain from Finit to b Fterm then: 1 b Fterm(S) b Fterm b Fterm(S) , (12) with TV the total variation. Corollary 3.9. For a generative flow trained on the target κ with Finit(S) = 1 and using Fterm = bκ := (Finit + F F )+ during inference, the sampling error is controlled by: δ 1 + δ + TV bκ bκ(S) with δ := (Finit + F F ) (S). Commonly the training loss is designed to control the total variation term in the upper bound. However, a secondary regularization term controlling δ = δFinit may be added to enhance reward-less sampling. To substantiate this quantitative bound, let s consider the RL case with a reward r defined on a finite DAG (V, E) with V = S {s0, sf}, trained using a flow-matching loss, say Lstable FM,q . Let s take νtrain the uniform distribution on the vertices and q = 1, then Lstable FM,q = 1 |V| δ + P s S |br(s) r(s)| with br(s) the density of bκ with respect to the counting measure. We may also rewrite TV bκ bκ(S) κ κ(S) = 1 2 P s S | br(s) P s S br(s ) r(s) P so that if Z = P s S r(s) then |V|Lstable FM,q . (14) 3.3. KL-Weak FM loss and IL algorithm Henceforth, we assume that S is a Polish space equipped with a background measure λ and κ is a target probability distribution. Leveraging Theorem 3.8, we design the KL-weak FM loss, LKL w F M, which enables IL training of a generative flow without requiring a separate reward model. Let δfinit = min(0, finit +f f ) and bfterm = max(0, finit +f f ). The KL-weak FM loss can then be defined as follows: LKL w F M(θ) := b Es νtrainδfinit(s) Es κ log bfterm(s), (15) for some training distribution of paths νtrain and b > 0. The name of the loss is derived from two components: on the one hand, the cross-entropy term, which controls the Kullback-Leibler divergence between b Fterm and κ; on the other hand, the term Es νtrainδfinit(s), which resembles the FM-loss but controls only the negative part of the FM defect. We now provide a detailed description and motivation for LKL w F M. The discussion begins with a corollary of Theorem 3.8, based on Pinsker s inequality: Corollary 3.10. Let (π , f ) be a generative flow. We train the generative flow on target probability distribution κ with Finit(S) = 1. By using Fterm = bκ := (Finit + F F )+ during inference, the sampling error is bounded as follow: TV (sτ κ) δ 1 + δ + 1 2KL bκ bκ(S) with δ := R s S δfinit(s)dλ(s). First, the weak-FM term Es Pτ t=1 (δfinit(st))2 in equation (15) controls the term δ 1+δ in equation (16). Second, since the target κ has unknown density, we employ reward-less inference, using Fterm = (Finit + F F )+. It is therefore natural to directly train the density of (Finit + F F )+ to match κ using cross-entropy, leading to the second term on the right-hand side of Equation (15). Corollary 3.10 demonstrates that controlling the Kullback Leibler divergence through cross-entropy also controls the total variation sampling error. Ergodic Generative Flows Algorithm 1 Ergodic flow: IL training Input: A list of trainable bi-Lipchitz maps Φ1, , Φp Input: A softmax model α : S P({1, , p}) Input: A trainable star outflow model f : S R+ Input: A target samplable distribution κ Input: A source Finit samplable and of density finit repeat Fill replay buffer B = B B with B trajectories (st)tmax t=1 using (Finit, π ) for B and (κ, π ) for B with t {1, , τ} Minimization step of t=1 δfinit(st)pt(s) b Es κ log bfterm(s), 1 + ( bfterm/f )(st ) 1 if s B t η bfterm during inference, where η is chosen to minimize negative likelihood on a validation dataset. See Appendix D.1 for details. 4.1. RL Experiments Two dimensional distributions are tested in RL settings, which allow us to conduct sanity-check on the tractability of the EGF, its stability and expressivity. An EGF on S = T2 is built with 16 transformations (8 translations and 2 elements of SLd(Z) together with their inverse). Their MLPs have 5 hidden layers of width 32 to parameterize f and π . This EGF is trained either with stable Lstable FM or unstable Ldiv FM FM-losses, with or without regularization R (see equations (3),(19) and (20)). Ldiv FM = Es t=1 log f + finit f + fterm 2 (st) (19) t=1 (f )2(st). (20) The regularization R is motivated by the stability Theory of Brunswic et al. (2024): as long as the directional derivative of R is positive on 0-flows directions, such a regularization helps reducing the unstability of a loss. See also Morozov et al. (2025) for a study of the impact of such regularizations for detailed balance losses on graphs. A stability comparison to divergence-based FM loss (see figure 1) shows the expressivity of very small EGF as well as the unstability Ldiv FM as predicted by Brunswic et al. (2024)(refer to limitation 4 in section 2). 4.2. IL Experiments We proceed with imitation learning experiments on tori T2 and spheres S2. On T2, we compare small Moser Flows, DDPM and EGFs, demonstrating that EGFs can generate well-behaved distributions even when Moser Flows break down. An EGF is built with the minimal four affine transformations described in section 3.1.2, implemented as MLPs with 3 hidden layers of width 32. These transformations parameterize f and π . The Moser Flow is trained using the implementation provided in the authors Git Hub repository (Rozen, 2022) with the only modification being the model size set to 32x3. Our EGF is trained using the KL-weak FM loss together with Ergodic Generative Flows a EGF 32x5, density (left) and samples (right). b EGF 32x5, stability comparison between FM losses. Unregularized unstable loss (blue) flow size (dashed lines) and sampling time τ (solid lines) blow up while stable loss (red) converges. Regularization helps stabilization for both losses. Figure 1. Checkerboard RL task. Table 1. Negative log-likelihood scores of the volcano dataset. Volcano Earthquake Flood Mixture v MF -0.31 0.59 1.09 Stereographic -0.64 0.43 0.99 Riemannian -0.97 0.19 0.90 Moser Flow -2.02 -0.09 0.62 EGFN -2.31 -0.12 0.56 a regularization R as in the RL setting. Figure 10 shows how Moser Flow fails to train with such a small model while minimal EGF reproduces the target distribution with high fidelity. On S2, we benchmark EGFs on the earth science volcano dataset (NGDC/WDS, 2025). A sample distribution is given for the dataset in Figure 3 and negative log-likelihoods are given in Table 1. We only use six rotations: a rotation of angle π/4 around each of the three axes plus their respective inverse. The two core MLPs of EGF are of size 256x5, compared to the 512x6 used by Rozen et al. (2021). The learning rate is 1e-3 with an exponential decay to 1e-5 at 3000 epochs of 25 steps. We outperform Moser Flow and all related baselines. Notably, the EGF achieved its reported performance with a training time 10 times shorter. a Moser Flow Figure 2. Comparison of imitation learning on standard toy distributions using tiny 32x3 MLP models. Background filter is applied to EGF, a similar filter on Moser flow would yield worse results. Without a filter, EGF samples have outliers similar to DDPM. Since DDPM does not give access to density, only samples are provided. For fairness, DDPM is sampled using 100 denoising steps. Figure 3. EGF generated points with reward field bfterm (left) and whole dataset point with KDE field (right) for the volcano task. Ergodic Generative Flows 5. Related Work EGFs draw inspiration from NFs for their transformation blocks, as well as from denoising diffusion models (DDMs) (Ho et al., 2020) for incorporating stochasticity. While similar approaches have been explored in previous works (Wu et al., 2020; Zhang & Chen, 2021), our method stands out by leveraging general generative flows theory to construct a more efficient framework. Compared to previous works, EGFs are more compact and demonstrate faster convergence as illustrated in our experiments. Furthermore, DDMs and their variants (Song et al., 2020a) rely on a fixed discretization of inference denoising trajectories, whereas EGFs enable trainable sampling times, offering greater flexibility and adaptability. A particularly notable comparison arises when examining the relationship between our EGFs and NFs introduced in Rezende et al. (2020), as both utilize affine toroidal families and projective spherical families. However, while our EGF framework provides universality guarantees, the NF setting lacks such guarantees as it solely relies on affine and projective transforms. Indeed, both projective and affine transformations form groups (compositions of affine transforms are affine), thus limiting the expressivity. This distinction highlights a fundamental advantage of EGFs, further reinforcing their theoretical robustness and practical effectiveness. Broadly speaking, previous methods, such as Riemannian Continuous Normalizing Flows (Mathieu & Nickel, 2020) and Flow Matching on Manifolds (Chen & Lipman, 2024) rely on either handcrafted non-linear transformations or sophisticated Neural ODEs (Chen et al., 2018). Our EGF does not incorporate Neural ODEs, as we do not face expressivity limitations that justify their usage. However, these techniques remain fully compatible with EGFs and could be integrated into future extensions of our framework to further enhance its flexibility and adaptability if needed. Since EGFs elementary transformation are derived from the transformation blocks of NFs, EGFs can be viewed as a stochastic sampler for NF architectures. In particular, any EGFs sampled trajectory can be defined as (st)τ t=1, where the trajectory (st)τ t=1 is obtained by composing NF transformations as follows: s1 Finit, s2 = Φk1(s1), . . . , sτ = Φkτ 1 Φk1(s1), with the transition probability P(kt+1 | kt) = αi (st). Each composition Φkτ 1 Φk1 is a diffeomorphism, and the terminal distribution is given by Fterm = EK(Finit K), where K = Φkτ 1 Φk1 represents a random NF. In other words, an EGF with p diffeomorphisms can be represented as a pair (ι, π), where ι : Tp Diffeo(S) is a trainable embedding of the p-ary tree into the space of diffeomorphisms on the state space, and π is a random walk policy on Tp with a random stopping time. However, the policy π is intractable as it is defined by π(Φk|Φkt Φk1) = 1 F (S) R s S αk (s)d F (s), where F := F + Fterm. Building on this formulation, we can draw comparisons between EGFs and three related research directions: Continuously Indexed Normalizing Flows (CINF) (Caterini et al., 2021), Neural Architecture Search (NAS) (Elsken et al., 2019), and Wasserstein Gradient Descent (WGD) (Chizat & Bach, 2018). First, CINFs attempt to overcome the limitations of NFs by using a fixed architecture, with conditioning sampled from a latent distribution. This approach generates the target distribution as an expectation, Fterm = EK(Finit K), where the random NF K is drawn from a continuous distribution of NFs. In contrast, our method samples K from a tree structure. Second, NAS aims to find the optimal neural network architecture for a given task. In comparison, our EGF constructs a distribution of suitable NFs, where the sampling time can be interpreted as a learned depth of the resulting architecture. Third, by training over a distribution of architectures and considering the expectation of the output (with Finit as the input and Fterm as the output), our approach parallels the setup of Wasserstein Gradient Descent (WGD). Extending the theorems of Chizat & Bach (2018) to EGFs would be a valuable addition to our theoretical framework. Lastly, from the proof of the universality Theorem 3.4, we observe that f can be obtained as a fixed point of a flow operator dependent on π , Finit, and Fterm. This operator is a contraction for the affine toroidal family. Although we train f via gradient descent, this approach bears similarities to Deep Equilibrium Models (DEQ) (Bai et al., 2019; 2020), where the fixed point is computed by a contraction neural network. The key difference is that we explicitly approximate the fixed point with a neural network, while DEQ models use a black-box solver and the implicit function theorem for gradients. Future work could develop a DEQ-EGF, where f is implicitly obtained as a fixed point instead of via a feedforward network. 6. Limitations and Future Work First and foremost, our experiments are limited to low dimensions. Although ergodicity is easily achievable with a number of generators independent from the dimension (for instance on tori, one randomly initialized non-trainable translation is sufficient to ensure ergodicity), the technical assumption of L2-mixing summability needed for universality is more subtle. EGFs suffer from limited proven applicability, further theory is needed to easily enforce L2mixing summability in higher dimension together with related experiments. More precisely, a theoretical bound on the minimal number of transformations necessary to achieve universality would be useful for hyperparameter tuning. We give such a bound only for affine toroidal and isometry Ergodic Generative Flows spherical families where two transformations are sufficient to achieve universality. Second, the L2-mixing summability property is essential to ensure universality. It is likely that training would be improved if some control over this property was to be achieved, say by adding a regularization. There is extensive mathematical literature on the so-called spectral gap (Kontoyiannis & Meyn, 2012; Guivarc h & Le Page, 2016; Marrakchi, 2018; Bekka & Francini, 2020), which is a sufficient condition for L2-mixing summability, and absolutely continuous invariant measures (G ora & Boyarsky, 2003; Bahsoun, 2004; Galatolo, 2015) including stability results (Froyland et al., 2014). A systematic review of this literature would certainly yield such regularization. Lastly, our focus was on the theoretical advancements of generative flow theory through EGF. As a result, we did not exploit EGF s high modularity, which enables sophisticated transformations, replay buffers, or neural architectures. Additionally, hyperparameter tuning was minimal, leaving room for future work to perform a systematic analysis. 7. Conclusion We propose a new family of generative flows that leverage ergodicity to provide universality guarantees while utilizing simple diffeomorphisms and neural networks. Our results demonstrate that EGFs maintain their expressivity even at parameter counts where Moser Flow (Rozen et al., 2021) fails. This simplicity enables us to outperform our baselines on the NASA volcano dataset with a model 30 times smaller. Impact Statement This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here. Bahsoun, W. Absolutely continuous invariant measures for random maps. Ph D thesis, Concordia University, 2004. Bai, S., Kolter, J. Z., and Koltun, V. Deep equilibrium models. In Neur IPS, 2019. Bai, S., Koltun, V., and Kolter, J. Z. Multiscale deep equilibrium models. In Neur IPS, 2020. Bekka, B. and Francini, C. Spectral gap property and strong ergodicity for groups of affine transformations of solenoids. Ergodic Theory and Dynamical Systems, 40 (5):1180 1193, 2020. Bekka, B. and Guivarc h, Y. On the spectral theory of groups of affine transformations of compact nilmanifolds. Annales Scientifiques de l Ecole Normale Sup erieure, 48 (3):607 645, 2015. Bengio, E., Jain, M., Korablyov, M., Precup, D., and Bengio, Y. Flow network based generative models for noniterative diverse candidate generation. In Neur IPS, 2021. Bengio, Y., Lahlou, S., Deleu, T., Hu, E. J., Tiwari, M., and Bengio, E. Gflownet foundations. The Journal of Machine Learning Research, 24(1):10006 10060, 2023. Benoist, Y. and de Saxc e, N. A spectral gap theorem in simple lie groups. Inventiones Mathematicae, 205(2): 337 361, 2016. Bourgain, J. and Gamburd, A. A spectral gap theorem in su.(d). Journal of the European Mathematical Society (EMS Publishing), 14(5), 2012. Breuillard, E. and Gelander, T. On dense free subgroups of lie groups. Journal of Algebra, 261(2):448 467, 2003. Brooks, S., Gelman, A., Jones, G., and Meng, X.-L. Handbook of markov chain monte carlo. CRC press, 2011. Brunswic, L., Li, Y., Xu, Y., Feng, Y., Jui, S., and Ma, L. A theory of non-acyclic generative flow networks. In AAAI, 2024. Caterini, A., Cornish, R., Sejdinovic, D., and Doucet, A. Variational inference with continuously-indexed normalizing flows. In Uncertainty in Artificial Intelligence, 2021. Chen, H., Ren, Y., Ying, L., and Rotskoff, G. M. Accelerating diffusion models with parallel sampling: Inference at sub-linear time complexity. ar Xiv preprint ar Xiv:2405.15986, 2024. Chen, R. T. and Lipman, Y. Flow matching on general geometries. In ICLR, 2024. Chen, R. T., Rubanova, Y., Bettencourt, J., and Duvenaud, D. K. Neural ordinary differential equations. In Neur IPS, 2018. Chizat, L. and Bach, F. On the global convergence of gradient descent for over-parameterized models using optimal transport. In Neur IPS, 2018. Chung, H. W., Hou, L., Longpre, S., Zoph, B., Tay, Y., Fedus, W., Li, Y., Wang, X., Dehghani, M., Brahma, S., et al. Scaling instruction-finetuned language models. Journal of Machine Learning Research, 25(70):1 53, 2024. Conder, M., Liversidge, G., and Vsemirnov, M. Generating pairs for sl (n, z). Journal of Algebra, 662:123 137, 2025. Ergodic Generative Flows Conze, J.-P. and Guivarc H, Y. Ergodicity of group actions and spectral gap, applications to random walks and markov shifts. Discrete and Continuous Dynamical Systems-Series A, 33(9):4239 4269, 2013. Elsken, T., Metzen, J. H., and Hutter, F. Neural architecture search: A survey. Journal of Machine Learning Research, 20(55):1 21, 2019. Froyland, G., Gonz alez-Tokman, C., and Quas, A. Stability and approximation of random invariant densities for lasota yorke map cocycles. Nonlinearity, 27(4):647, 2014. Galatolo, S. Statistical properties of dynamics. introduction to the functional analytic approach. ar Xiv preprint ar Xiv:1510.02615, 2015. Gemici, M. C., Rezende, D., and Mohamed, S. Normalizing flows on riemannian manifolds. ar Xiv preprint ar Xiv:1611.02304, 2016. Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. Generative adversarial networks. Communications of the ACM, 63(11):139 144, 2020. G ora, P. and Boyarsky, A. Absolutely continuous invariant measures for random maps with position dependent probabilities. Journal of Mathematical Analysis and Applications, 278(1):225 242, 2003. Grathwohl, W., Chen, R. T., Bettencourt, J., Sutskever, I., and Duvenaud, D. Ffjord: Free-form continuous dynamics for scalable reversible generative models. ar Xiv preprint ar Xiv:1810.01367, 2018. Guivarc h, Y. and Le Page, E. Spectral gap properties for linear random walks and pareto s asymptotics for affine stochastic recursions. Annales de l Institut Henri Poincar e - Probabilit es et Statistiques, 52(2):503 574, 2016. Haarnoja, T., Zhou, A., Hartikainen, K., Tucker, G., Ha, S., Tan, J., Kumar, V., Zhu, H., Gupta, A., Abbeel, P., et al. Soft actor-critic algorithms and applications. ar Xiv preprint ar Xiv:1812.05905, 2018. Halmos, P. R. Measure theory, volume 18, chapter XII. Springer, 2013. Ho, J., Jain, A., and Abbeel, P. Denoising diffusion probabilistic models. In Neur IPS, 2020. Hu, W., Xiao, L., and Pennington, J. Provable benefit of orthogonal initialization in optimizing deep linear networks. ar Xiv preprint ar Xiv:2001.05992, 2020. Kifer, Y. Ergodic theory of random transformations, volume 10. Springer Science & Business Media, 2012. Kingma, D. P. and Ba, J. L. Adam: A method for stochastic gradient descent. In ICLR, 2015. Kontoyiannis, I. and Meyn, S. P. Geometric ergodicity and the spectral gap of non-reversible markov chains. Probability Theory and Related Fields, 154(1):327 339, 2012. Lahlou, S., Deleu, T., Lemos, P., Zhang, D., Volokhova, A., Hern andez-Garc ıa, A., Ezzine, L. N., Bengio, Y., and Malkin, N. A theory of continuous generative flow networks. ar Xiv preprint ar Xiv:2301.12594, 2023. Lee, J. M. Introduction to Riemannian manifolds, volume 2. Springer, 2018. Li, Y., Luo, S., Wang, H., and Hao, J. Cflownets: Continuous control with generative flow networks. ar Xiv preprint ar Xiv:2303.02430, 2023. Loshchilov, I. and Hutter, F. Decoupled weight decay regularization. In ICLR, 2019. Lu, C., Zhou, Y., Bao, F., Chen, J., Li, C., and Zhu, J. Dpmsolver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps. In Neur IPS, 2022. Malkin, N., Jain, M., Bengio, E., Sun, C., and Bengio, Y. Trajectory balance: Improved credit assignment in gflownets. ar Xiv preprint ar Xiv:2201.13259, 2022. Marrakchi, A. Strongly ergodic actions have local spectral gap. Proceedings of the American Mathematical Society, 146(9):3887 3893, 2018. Mathieu, E. and Nickel, M. Riemannian continuous normalizing flows. In Neur IPS, 2020. Morozov, N., Maksimov, I., Tiapkin, D., and Samsonov, S. Revisiting non-acyclic gflownets in discrete environments. ar Xiv preprint ar Xiv:2502.07735, 2025. NGDC/WDS. Global significant volcanic eruptions database. https://www.ncei.noaa.gov/ access/metadata/landing-page/bin/iso? id=gov.noaa.ngdc.mgg.hazards:G10147, 2025. Papamakarios, G., Nalisnick, E., Rezende, D. J., Mohamed, S., and Lakshminarayanan, B. Normalizing flows for probabilistic modeling and inference. Journal of Machine Learning Research, 22(57):1 64, 2021. Rezende, D. J., Papamakarios, G., Racaniere, S., Albergo, M., Kanwar, G., Shanahan, P., and Cranmer, K. Normalizing flows on tori and spheres. In ICML, 2020. Ergodic Generative Flows Rozen, N. Moser flow. https://github.com/noamroze/moserflow, 2022. Rozen, N., Grover, A., Nickel, M., and Lipman, Y. Moser flow: Divergence-based generative modeling on manifolds. In Neur IPS, 2021. Saxe, A. M., Mc Clelland, J. L., and Ganguli, S. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. ar Xiv preprint ar Xiv:1312.6120, 2013. Sendera, M., Kim, M., Mittal, S., Lemos, P., Scimeca, L., Rector-Brooks, J., Adam, A., Bengio, Y., and Malkin, N. Improved off-policy training of diffusion samplers. In Neur IPS, 2024. Song, J., Meng, C., and Ermon, S. Denoising diffusion implicit models. ar Xiv preprint ar Xiv:2010.02502, 2020a. Song, Y., Sohl-Dickstein, J., Kingma, D. P., Kumar, A., Ermon, S., and Poole, B. Score-based generative modeling through stochastic differential equations. ar Xiv preprint ar Xiv:2011.13456, 2020b. Sutton, R. S. Reinforcement learning: An introduction. A Bradford Book, 2018. Tang, S., Wang, Y., Ding, C., Liang, Y., Li, Y., and Xu, D. Adadiff: Accelerating diffusion models through step-wise adaptive computation. In ECCV, 2025. Verine, A., Negrevergne, B., Chevaleyre, Y., and Rossi, F. On the expressivity of bi-lipschitz normalizing flows. In Asian Conference on Machine Learning, 2023. Walters, P. An introduction to ergodic theory, volume 79. Springer Science & Business Media, 2000. Wu, H., K ohler, J., and No e, F. Stochastic normalizing flows. In Neur IPS, 2020. Zhang, D., Chen, R. T., Malkin, N., and Bengio, Y. Unifying generative models with gflownets. ar Xiv preprint ar Xiv:2209.02606, 2022a. Zhang, D., Malkin, N., Liu, Z., Volokhova, A., Courville, A., and Bengio, Y. Generative flow networks for discrete probabilistic modeling. In ICML, 2022b. Zhang, Q. and Chen, Y. Diffusion normalizing flow. In Neur IPS, 2021. Ergodic Generative Flows A. Fundamentals of EGFs As stated in section 3, on a Differential manifold S endowed with a background measure λ absolutely continuous with respect to Lebesgue measure, we define an EGF as a Generative Flow (π , F ) with F = f λ and π (s ) = Pp i=1 αi (s)δΦi(s) where Φi are diffeomorphisms of S. With these definitions, the backward policy automatically has a closed form formula. Theorem A.1. Let (π , F ) as above, the induced Generative Flow on S has a backward star policy given by i=1 αi (s)δΦ 1 i (s) (21) αi (s) = (αi f ) Φ 1 i (s)|JsΦ 1 i | f (s) (22) j (αj f ) Φ 1 j (s)|JsΦ 1 j |. (23) Proof. From (Brunswic et al., 2024) section 3.2 and Proposition 1 in their appendix, we have for any measurable X S: F (Y ) = F π (S Y ) (24) π (s X) = d F π ( X) d F (s) (25) F π (Y X) = F π (X Y ). (26) Therefore, for all measurables X, Y S: F π (Y X) = F π (X Y ) (27) s S 1X(s)π (s Y )d F (s) (28) s S 1X(s)π (s Y )f (s)dλ(s) (29) i=1 1X(s)αi (s)δΦi(s)(Y )f (s)dλ(s) (30) s S 1X(s)αi (s)1Y (Φi(s))f (s)dλ(s) (31) u S 1X(Φ 1 i (u))αi (Φ 1 i (u))1Y (u)f (Φ 1 i (u))|JuΦ 1 i |dλ(u) (32) u Y 1X(Φ 1 i (u))αi (Φ 1 i (u))f (Φ 1 i (u))|JuΦ 1 i |dλ(u) (33) i=1 δΦ 1 i (u))(X)αi (Φ 1 i (u))f (Φ 1 i (u))|JuΦ 1 i |dλ(u). (34) Ergodic Generative Flows Since this formula is true for any Y S, we deduce that F π ( X) λ and that d F π ( X) dλ (x) = i=1 δΦ 1 i (x)(X)αi (Φ 1 i (x))f (Φ 1 i (x))|JxΦ 1 i | (35) i=1 αi (Φ 1 i (x))f (Φ 1 i (x))|JxΦ 1 i |δΦ 1 i (x) f (x) = d F π ( S) dλ (x) (37) i=1 δΦ 1 i (x)(S)αi (Φ 1 i (x))f (Φ 1 i (x))|JxΦ 1 i | (38) i=1 αi (Φ 1 i (x))f (Φ 1 i (x))|JxΦ 1 i | (39) The result follows. B. Universality Theorems B.1. General Universality Theorem We shall both write the theorems in a more formal way and provide rigorous proof of each statements. The following theorem implies Theorem 3.4. Theorem B.1. Let (S, λ) be a measured Polish space and let π be a Markov kernel acting on S. Assume that π is ergodic on (S, λ) with summable L2-mixing coefficients, then for any probability distributions Finit, Fterm λ with density in L2(λ) and any ε > 0, there exists f L2(λ) such that the generative flow (π , f ) from Finit to Fterm is such that δFinit(S) + δFterm(S) < ε. Proof. Let H : ν 7 νπ + Finit Fterm, let ν0 = Finit Fterm and let νt+1 = H(νt). By assumption, ν0 density is in L2(λ), then the density of νt with respect to λ is L2 for all t N. νT = HT (ν) (40) t=0 (Finit Fterm)(π )t (41) t=0 ϵt (42) with ϵt = (Finit Fterm)(π )t. For all t N, ϵt(S) = 0, hence by assumption P dλ (s) 2 dλ(s) < + . Therefore, νT converges as T + to some ν and |ν |2 P dλ (s) 2 dλ(s) < + , hence ν λ and dν dλ L2(λ). Furthermore, ν is a fix point of H so ν = ν π R + Finit. Now, ν does not necessarily provide a suitable f because it may happen that the negative part ν = 0. Since for any η > 0 we have H(ν +ηλ) = H(ν )+ηλπ = ν +ηλ, we may add ηλ to ν to get another fix point νη := ν +ηλ of H. Define f = d(νη )+ dλ , so that: Finit + (f λ)π Fterm f λ = Finit + (f λ νη + νη )π Fterm (f λ νη + νη ) (43) = (f λ νη )π (f λ νη ) (44) Therefore, δFinit := [(f λ νη )π (f λ νη )] and δFterm := [(f λ νη )π (f λ νη )]+. So that δFinit(S) + δFterm(S) = |(νη ) (π 1)|(S) 2(νη ) (S). (45) Since ν λ in particular limη + (νη ) (S) = 0, the result follows by choosing η big enough. Ergodic Generative Flows B.2. Universality on Tori The universality Theorem on tori is obtained as a consequence of the spectral gap for Affine Toroidal families. We use the following result. Theorem B.2 (Theorem 5 of (Bekka & Guivarc h, 2015) ). Let H be a countable subgroup of Aff(Td). The following properties are equivalent: (i) The action of H on T does not have a spectral gap. (ii) There exists a non-trivial H-invariant factor torus T such that the projection of H on Aut(T) is amenable. If condition (ii) of Theorem B.2 is false for the group generated by the transformations (Φi)p i=1, the we deduce any π induced by equation (5) has spectral gap if i, αi = 1 p. Since each Φi keep the Lebesgue measure λ invariant, by universality master Theorem B.1, the family is universal. Our technical condition of Theorem 3.5 is then The set of transforms (Φi)p i=1 is stable by inverse and the group generated by (Φi)p i=1 violates condition (ii). In particular, if the linear part of H is the whole SL(d, Z), projections such as the one in condition (ii) are never amenable since SL(d, Z) contains a free group with two generators. B.3. Universality on Spheres The universality Theorem on spheres is obtained as a consequence of the spectral gap for isometry spherical families. Before stating the technical mathematical result we use, we recall that a real number x R is algebraic if there exists a non-zero polynomial P Z[X] with integral coefficients such that P(x) = 0. The set of algebraic real number is denoted Q. We use the following result. Theorem B.3 (Reformulation of Theorem 1 of (Bourgain & Gamburd, 2012) ). Assume that (Φ1, , Φp) SO(d) Matd d(Q), that the group generated by Φ1, , Φp is dense in SO(d, R) and that i {1, , p}, j {1, , p}, Φ 1 i = Φj. Assume that, i, αi = 1 p then the associated Markov kernel π given by equation (5) has a spectral gap. With the technical condition The set of transforms (Φi)p i=1 is stable by inverse, the group generated by (Φi)p i=1 is dense in SO(d, R) and each of the Φi has algebraic coefficients , Theorem 3.6 then follows from master Theorem B.1. C. Quantitative Sampling Theorem Theorem C.1. Let (π , f ) be a generative flow and let Finit and Fterm initial and terminal distributions. Assume that Finit = 0, and consider the sample sτ of the generative flow Markov chain from Finit to b Fterm then: 1 b Fterm(S) b Fterm b Fterm(S) (46) with TV the total variation. Proof. For any generative flow F := (π , F ), define πτ,F the kernel x 7 (sτ|s1 = x) for the stopping condition b Fterm. The sampling distribution sτ of F is then 1 Finit(S)Finitπτ,F . Furthermore, we have δFterm(S) δFinit(S) = Finit(S) + F π (S) F (S) Fterm(S) = Finit(S) Fterm(S). In particular, b Fterm(S) Finit(S) > 0 so b Fterm = 0. We notice that the generative flow F := (π , F ) satisfies the flow-matching constraint from b Finit to b Fterm and that πτ,F = πτ,F. Applying the sampling Theorem to F , we obtain b Finitπτ,F = b Fterm (47) (Finit + δFinit)πτ,F = b Fterm (48) Finitπτ,F + δFinitπτ,F = b Fterm (49) α 1 Finit(S)Finitπτ,F +(1 α) 1 δFinit(S)δFinitπτ,F = 1 b Fterm(S) b Fterm (50) Ergodic Generative Flows with α = Finit(S) b Fterm(S) = b Fterm(S) δFinit(S) b Fterm(S) = 1 δFinit(S) b Fterm(S). The result follows. D. Implementation considerations D.1. Density filtering Since EGFs give us access to a tractable density bfterm, we can estimate its mean m and standard deviation σ on the state space to filter out any unwanted outliers when sampling. By choosing a lower bound saturation value bfsat = m kσ where k defines how strong the filter should be, we can then define a stricter sampling density fterm as: ( bfterm if bfterm bfterm 0 otherwise (51) Since the new sampling density fterm = 0 for all points of low true density, the next state of the Markov Chain is guaranteed not be the sink state. This ensures the EGF only generates samples associated with high density. The factor k can be chosen manually to obtain a more strict or lenient filter, or it can be automatically optimized by recalculating the negative log-likelihood on the training dataset to systematically evaluate all k values. Figure 4. NLL estimation for a range of values of k for a given model on the volcano dataset. Note that this filtering is applied on the density rather than on the samples, meaning that no samples are filtered out. Therefore, this simply has the effect of preventing the trajectories from stopping at low density points. Figure 5 demonstrates that using the optimal filter value concentrates the samples in the zones of high true density, bypassing the uniform background noise that would otherwise be found prior to filtering. Ergodic Generative Flows a From left to right, the depth of f is kept constant while α depth increases. b From left to right, the depth of α is kept constant while f depth increases. Figure 6. Variation of f and α depths from 2x32 to 6x32. The theoretical prediction is that α may be kept constant while f is trained. Keeping α constant indeed yields improvements, however the improvement obtained by increasing the depth of α instead of f is comparatively better. a Target distribution b Unfiltered samples c Samples with optimal filter Figure 5. Comparison of the EGF s samples using the same model with and without a density filter. E. Ablation study on the size of EGF Ergodic Generative Flows Figure 7. Theoretically minimal EGF with four affine moves on the torus T2 ie two generators of SL(2, Z) together with their inverse. The MLPs parameterizing f and α have size 4x128. We see that the generated checkerboard has high quality, sharp boundaries but some under densities Ergodic Generative Flows a Generation of spin wheel with variations of MLPs for both f and α from 2x32 to 5x32 (TOP) and 6x32 to 9x32 (BOT). b Generation of spin wheel with variations of MLPs for both f and α from 2x64 to 5x64 (TOP) and 6x64 to 9x64 (BOT). Figure 8. Variation of f and α depths and width from 2x32 to 9x64. Increasing depth and width increases quality until depth 4 then the training seemingly becomes lest stable as shown by the 9x64 experiment. Since we keep the learning rate constant at 1e-3, it is likely that the MLPs training becomes unstable. The state space is S = R2, to ensure F (S) < + , we constrain f to be zero outside [ 4, 4]2. The EGF has 8 transformations which are translations, while the MLPs for α and f vary from 2x32 (2 hidden layer of width 32) to 9x64. Learning rate is kept at 0.001. Ergodic Generative Flows a Generation of 4 circles with variations of MLPs for both f and α from 2x32 to 5x32 (TOP) and 6x32 to 9x32 (BOT). b Generation of 4 circles with variations of MLPs for both f and α from 2x64 to 5x64 (TOP) and 6x64 to 9x64 (BOT). Figure 9. Variation of f and α depths and width from 2x32 to 9x64. Increasing depth and width increases quality until depth 4 then the training seemingly becomes lest stable as shown by the 9x64 experiment. Since we keep the learning rate constant at 1e-3, it is likely that the MLPs training becomes unstable. The state space is S = R2, to ensure F (S) < + , we constrain f to be zero outside [ 4, 4]2. The EGF has 8 transformations which are translations, while the MLPs for α and f vary from 2x32 (2 hidden layer of width 32) to 9x64. Learning rate is kept at 0.001. Ergodic Generative Flows a Generation of 2spirals with variations of MLPs for both f and α from 2x32 to 5x32 (TOP) and 6x32 to 9x32 (BOT). b Generation of 2 spirals with variations of MLPs for both f and α from 2x64 to 5x64 (TOP) and 6x64 to 9x64 (BOT). Figure 10. Variation of f and α depths and width from 2x32 to 9x64. Increasing depth and width increases quality until depth 4 then the training seemingly becomes lest stable as shown by the 9x64 experiment. Since we keep the learning rate constant at 1e-3, it is likely that the MLPs training becomes unstable. The state space is S = R2, to ensure F (S) < + , we constrain f to be zero outside [ 4, 4]2. The EGF has 8 transformations which are translations, while the MLPs for α and f vary from 2x32 (2 hidden layer of width 32) to 9x64. Learning rate is kept at 0.001.