# learning_interpolations_between_boltzmann_densities__c377980a.pdf Published in Transactions on Machine Learning Research (05/2023) Learning Interpolations between Boltzmann Densities Bálint Máté balint.mate@unige.ch Department of Computer Science, Department of Physics University of Geneva François Fleuret francois.fleuret@unige.ch Department of Computer Science University of Geneva Reviewed on Open Review: https: // openreview. net/ forum? id= TH6Yr Ecbth We introduce a training objective for continuous normalizing flows that can be used in the absence of samples but in the presence of an energy function. Our method relies on either a prescribed or a learnt interpolation ft of energy functions between the target energy f1 and the energy function of a generalized Gaussian f0(x) = ||x/σ||p p. The interpolation of energy functions induces an interpolation of Boltzmann densities pt e ft and we aim to find a time-dependent vector field Vt that transports samples along the family pt of densities. The condition of transporting samples along the family pt is equivalent to satisfying the continuity equation with Vt and pt = Z 1 t e ft. Consequently, we optimize Vt and ft to satisfy this partial differential equation. We experimentally compare the proposed training objective to the reverse KL-divergence on Gaussian mixtures and on the Boltzmann density of a quantum mechanical particle in a double-well potential. 1 Introduction We consider the task of estimating the expectation value Ex p[O(x)] of some observable O, under a probability density p proportional to the unnormalized density e f, where f : Rn R is a given energy function. In particular, we don t have access to true samples from p, all we have is the ability to evaluate f and its derivatives for any x Rn. A popular technique (Boyda et al., 2021; Albergo et al., 2021a;b; 2022; Abbott et al., 2022; de Haan et al., 2021; Gerdes et al., 2022; Noé et al., 2018; Köhler et al., 2020; Nicoli et al., 2020; 2021) for attacking this problem is to use a normalizing flow to parametrize a variational density qθ and optimize the parameters θ to minimize the reverse KL-divergence KL[qθ, p] = Ex qθ(log qθ(x) log p(x)) = Ex qθ(log qθ(x) + f(x)) + Z. (1) The use of normalizing flows for this problem is particularly attractive because qθ can be used as a proposal for importance sampling, Ex p[O(x)] = Ex qθ[ p(x) qθ(x)O(x)], to account for the inaccuracies of qθ. Unfortunately, the reverse KL-divergence is mode-seeking, making the training prone to mode-collapse (Fig. 2). To tackle this problem, several works have proposed alternative training objectives for normalizng flows. Vaitl et al. (2022) introduce better estimators of the forward and reverse KL divergences, while Midgley et al. (2022) use the α = 2 divergence instead of the reverse KL-divergence as their training objective. We propose yet another alternative based on infinitesimal deformations of Boltzmann densities (Pfau & Rezende, 2020; Máté & Fleuret, 2022). This work was motivated by denoising diffusion (Sohl-Dickstein et al., 2015; Ho et al., 2020) and score-base models (Song & Ermon, 2019). It also bears similarities to more recent works (Lipman et al., 2022; Albergo & Vanden-Eijnden, 2022; Liu et al., 2022; Neklyudov et al., 2022) that generalize diffusion models by relying on more flexible interpolations between the data and the base distribution. The implementation of our experiments is available at https://github.com/balintmate/boltzmann-interpolations. Published in Transactions on Machine Learning Research (05/2023) The contributions of this work can be summarized as follows In 3 we describe our method which relies on either a prescribed or a learnt interpolation ft of energy functions between the target energy f1 and the energy function of a generalized Gaussian f0(x) = ||x/σ||p p. Given ft we optimize a vector field Vt to transport samples along the family pt(x) e ft(x) of Boltzmann densities. After translating this condition to a PDE between Vt and ft we propose to minimize the amount by which this PDE fails to hold. First we find that in certain cases the linear interpolation ft = (1 t)f0 + tf1 already leads to improved performance over optimizing the reverse KL-divergence. We also show that in general this interpolation is insufficient. Motivated by the failure mode of the linear interpolation, we parametrize the interpolation with another neural network ft = (1 t)f0 + tf1 + t(1 t)f θ(t) and optimize f θ t together with the vector field V θ t . In 4 we run experiments on Gaussian mixtures and on the Boltzmann density of a quantum particle in a double-well potential, and report improvements in KL-divergence, effective sample size, mode coverage and also training speed. Figure 1: Interpolating in the space of unnormalized probability densities. The unnormalized target density (red) and its normalization (black) that we are trying to fit a flow to. The base density of the flow (green) lying on the subspace of normalized probability densities (orange). The interpolation connecting the base density to the unnormalized target (blue) and the induced interpolation of normalized densities (the projection of the blue trajectory to the log Z = 0 -plane). Consider the following multimodal density N 8 8 , 1 + N 8 8 , 1 + N 8 8 , 1 + N 8 8 , 1 , (2) where N(µ, σ) denotes a normal density centered at µ with covariance matrix diag(σ2). Fig. 2 shows the mode-collapse of a normalizing flow trained with the reverse KL-divergence on this target. The reason why mode collapse can happen in the first place is the training objective itself. The mode seeking behavior of the reverse KL-divergence can be explained as follows. As the sampling is done according to qθ, the difference in log-likelihoods is weighted by the likelihood qθ. This implies that if qθ completely ignores modes of p the log-likelihoods of p and qθ are not compared over regions that are not covered by qθ. In this paper we introduce a training objective for continuous normalizing flows that can be used to replace the reverse KL-divergence and investigate to what extent it solves the issue of mode collapse on multimodal targets. Figure 2: Mode-seeking nature of the reverse KL-divergence. The figures from left to right show how the latent gaussian is transformed by the continuous normalizing flow trained with the reverse KL objective. The blue blobs denote the target density (2), the arrows represent the vector field Vt. Published in Transactions on Machine Learning Research (05/2023) 2 Background Change of variables. Let p0(z) be a probability density on Rn and Ψ : Rn Rn a diffeomorphism. Pushing the density p0 forward along Ψ induces a new probability density p implicitly defined by log p0(z) = log p(Ψz) + log | det JΨ(z)|. (3) The term log | det JΨ(z)| measures how much the function Ψ expands volume locally at z. Continuous change of variables. Let now Vt be a time-dependent vector field and Ψτ denote the diffeomorphism of flowing along the integral curves of Vt from 0 to τ. This family of diffeomorphisms generates a one-parameter family of densities pτ. The amount of volume expansion a particle experiences along a trajectory t 7 Φtz between 0 and τ is R τ 0 Vt(Ψtz)dt. The log-likelihoods are then related by log p0(z) = log pτ(Ψτz) + Z τ 0 Vt(Ψtz)dt. (4) Normalizing flows. Normalizing flows (Tabak & Turner, 2013; Rezende & Mohamed, 2015; Dinh et al., 2016) (continuous normalizing flows (Chen et al., 2018)) parametrize a subset of the space of all densities on Rn. They do this by first fixing a base density p0 and using a neural network that parametrizes the transformation Ψ (the vector field Vt). The (continuous) change of variables formula is then applied to compute the density induced by Ψ (Vt). Recently, generalizations of normalizing flows have also been studied (Nielsen et al., 2020; Huang et al., 2020; Máté et al., 2022). Boltzmann densities. Let f : Rn R be an energy function with a finite normalizing constant Z = R e f(x)dnx. The function f then induces a Boltzmann density over the configurations x Rn, p(x) = 1 Z e f(x). Conversely, given a probability density function p : Rn R+,0 the corresponding energy function can be recovered up to a constant f = log p log Z. The continuity equation. One can describe a time-dependent probability density either with its time dependent density function pt or by a reference (base) density p0 and a time-dependent vector field Vt. The latter generates pτ by pushing p0 forward along the integral curves of Vt from t = 0 to t = τ. The two descriptions are related by the continuity equation tpt + (pt Vt) = 0, which, in the case of Boltzmann densities pt = Z 1 t e ft can be rewritten as follows, 0 = tpt + (pt Vt) expanding pt = Z 1 t e ft (5) = t(Z 1 t e ft) + (Z 1 t e ft Vt) (6) = e ft t(Z 1 t ) + Z 1 t t(e ft) + Z 1 t (e ft Vt) (7) = e ft t(Z 1 t ) + Z 1 t t(e ft) + Z 1 t e ft Vt + Z 1 t e ft, Vt (8) = e ft t(Z 1 t ) Z 1 t e ft t(ft) + Z 1 t e ft Vt Z 1 t e ft ft, Vt factoring out pt = Z 1 t e ft (9) = Z 1 t e ft Zt t(Z 1 t ) | {z } t log Zt tft + Vt ft, Vt , (10) where , is the Euclidean scalar product between vectors. Since pt = Z 1 t e ft > 0, we conclude tft + ft, Vt Vt + t log Zt = 0, (11) Lemma 1. Moreover, if any time-dependent energy function ft, time-dependent vector field Vt and spatially constant function Ct satisfies tft + ft, Vt Vt + Ct = 0, (12) then Ct necessarily equals t log Zt with Zt = R e ft(x)dnx. Published in Transactions on Machine Learning Research (05/2023) Proof. Let ft, Vt, Ct be such that they satisfy (12). We can then compute t log Zt = t Zt R ( tft)e ft(x)dnx R e ft(x)dnx = (e ft(x)Vt) z }| { ( ft, Vt Vt)e ft(x) dnx R e ft(x)dnx + R Cte ft(x)dnx R e ft(x)dnx = Ct (13) where the last equality follows from the divergence theorem and the fact that e ft(x) vanishes at infinity. When numerically solving the continuity equation, Lemma 1 enables us to work with (12) instead of (11) and parametrize the triplet (f, V, C) with C being a spatially constant function instead of estimating t log Zt. In what follows we only work with (12) and refer to it as the continuity equation. 3 Approximating the transport field From here on, we will use the vector field Vt and the term continuous normalizing flow interchangeably. Our goal is to sample from a target Boltzmann density ptarget e ftarget by 1. defining a family of energy functions ft, 0 t 1 interpolating between the target energy f1 = ftarget and the energy function of a generalized Gaussian f0(x) = ||x/σ||p p, 2. finding a transport field Vt such that (ft, Vt) solves the continuity equation (12). If we succeed at both of these constructions, then we can obtain samples from ptarget by sampling from p0 e ||x/σ||p p and let the samples follow the integral curves of Vt from t = 0 to t = 1. Note that we turned the problem of learning a single density e f1 into a continuous collection of problems, learning all the densities e ft for 0 t 1. It might seem like that we just made the task more difficult, but the idea is that for any 0 τ 1, the density e fτ is easier to fit once e fτ ε is already fitted for some small ε. The pointwise continuity error Regarding the second item of the above list, an analytical expression for Vt is not easy to find if we are given a family of energy functions ft. This would amount to solving (12), which is difficult in general. Therefore we will parametrize Vt with a neural network and train it to minimize the amount by which the pair (ft, Vt) fails to satisfy the continuity equation. We begin by recalling the continuity equation, tft + ft, Vt Vt + Ct = 0, (14) where Ct is a spatially constant function. In what follows, V θ t and Cθ t are parametrized by neural networks and are trained to minimize some monotonically increasing function L1 of the pointwise continuity error Eθ,x,t = | tft(x) + ft(x), V θ t (x) V θ t (x) + Cθ t |. (15) The expression L(Eθ,x,t) measures the incompatibility of ft and Vt at a single (t, x) pair of coordinates, we will need to optimize some sort of integral of this pointwise error over both t and x. The continuity loss Suppose that we have an interpolation of energy functions ft. We propose to train V θ t and Cθ t to minimize the continuity error (15) along the trajectories of V θ t . Formally, let qθ be a parametric density parametrized by a continuous normalizing flow V θ t . We update the parameters to minimalize the integral of L(E) along the trajectories of the flow, L(θ) = Ez p0 0 L Eθ,γθ t (z),t dt , (16) where γθ τ(z) is given by transporting z along the vector field V θ t between 0 and τ. We evaluate the integral (16) by discretizing time and using numerical ODE solvers. 1In our experiments we tried L : R+ R+ {E 7 E, E 7 E2, E 7 E + E2, E 7 log(1 + E)}. Published in Transactions on Machine Learning Research (05/2023) First approach: Linear interpolation Arguably the simplest way to interpolate between a pair of functions (f0, f1) is to set ft(x) = (1 t)f0(x) + tf1(x), 0 t 1. (17) The same interpolation is proposed by Wu et al. (2020) in the context of stochastic normalizing flows and is also a common choice for annealed importance sampling (Neal, 1998). We will call the family (17) the linear interpolation. Fig. 3 shows the evolution of the samples along a continuous normalizing flow trained with the continuity loss using the linear interpolation. We observe that the same network trained with the continuity loss instead of the reverse KL-divergence can capture all 4 modes of (2). Figure 3: The same continuous normalizing flow as in Fig. 2 trained with the continuity loss using the linear interpolation. The top row shows how the target density evolves under the predefined linear interpolation between f0 = x2/2 and f1 = log(Eq. 2), while the bottom row shows how the samples from qθ evolve along Vt as t is varied. The issue with the linear interpolation. Let us now consider the density 1 3N 4 4 , 1 + 2 3N 8 8 , 1 . (18) This target does not enjoy the symmetry properties of the previous mixture, one of the modes is closer to the base and has a lower relative weight than the other one. The top row of Fig. 4 shows the linear interpolation to this target and the second row shows that this interpolation is insufficient to capture the mode which is further away. In a nutshell, the reason for this is that the linear interpolation does not preserve the relative weights of the modes as t is varied. Figure 4: The issue with the linear interpolation. The top row shows how the target density evolves under the predefined linear interpolation between f0 = x2/8 and f1 = log(Eq. 18), while the bottom row shows how the samples from qθ evolve along Vt as t is varied. Published in Transactions on Machine Learning Research (05/2023) Interlude: good and bad interpolations The reason why the linear interpolation can lead to problems is illustrated in Figures 5 and 6. Figure 5 shows a particular example where the linear interpolation in log-density space induces a good interpolation in density space. Figure 6 shows that for certain target functions the linear interpolation in log-density space induces an interpolation in density space that is not local in the sense that probability mass gets moved between the modes. Such densities are difficult to learn with the linear interpolation, even if in principle there exists a vector field Vt that moves probability mass the correct way along any interpolation. t = 0.00 t = 0.14 t = 0.29 t = 0.43 t = 0.57 t = 0.71 t = 0.86 t = 1.00 Figure 5: Linear interpolation between f0(x) = x2/4 and f1 = log( 1 2e (x 3)2 + 1 2e (x+3)2). Linear interpolation in the log-density space (top row), and the induced interpolation in density space (middle row). The transport field Vt which can be represented as a scalar-valued function for one-dimensional densities (bottom row). The thin gray line in the bottom row denotes Vt = 0. t = 0.00 t = 0.14 t = 0.29 t = 0.43 t = 0.57 t = 0.71 t = 0.86 t = 1.00 Figure 6: Linear interpolation between f0(x) = x2/4 and f1 = log( 1 3e (x 1)2 + 2 3e (x+4)2). Linear interpolation in the log-density space (top row), and the induced interpolation in density space (middle row). The transport field Vt which can be represented as a scalar-valued function for one-dimensional densities (bottom row). The thin gray line in the bottom row denotes Vt = 0. Note how the relative weights of modes is not preserved as t is varied, resulting in the exploding norm of Vt. Published in Transactions on Machine Learning Research (05/2023) Sanity check: interpolate along the diffusion process Diffusion Processes. Diffusion models (Sohl-Dickstein et al., 2015; Ho et al., 2020) learn to reverse a diffusion process that is in essence also a family of probability densities pt on Rn interpolating between the data density p1(x) and the latent n dimensional gaussian p0 = N(0, σ). Samples from pt are of the form tx + 1 tz where x p1, z p0. Geometrically, the probability density pt is obtained by first stretching p1 by a factor of t and then convolving with the Gaussian kernel Gσ 1 t2. p1(x) λt t n/2p1(t 1/2x) Gt t n/2p1(t 1/2x) Gσ 1 t | {z } pt(x) where we used λt to denote the stretching by a factor of t and Gt to denote the convolution with the Gaussian kernel Gσ 1 t. If t = 1, both λt (stretching by a factor of 1) and Gt (convolving with a Diracdelta) are just the identity. If t = 0, then λt collapses p1 to a Dirac-delta at the orgin and Gt convolves it with N(0, σ) implying that p0 = N(0, σ). The attractive thing about the diffusion process is that it provides an interpolation between densities where the situation of Fig. 6 is avoided. Loosely speaking, this interpolation is local in a sense that that the linear interpolation was not. The family of Gaussian mixtures is closed under the diffusion process (19), we can even explicitly compute the time-evolution of a Gaussian mixture and use it to replace the linear interpolation. Fig. 7 shows samples from a normalizing flow that was trained with the continuity loss using the hand-computed interpolation of its diffusion process. Figure 7: The same continuous normalizing flow as in Fig. 4 with the continuity loss using the diffusion interpolation. The blue blobs denote the target density (18), a mixture of two Gaussians. The top row shows how the target density evolves under the predefined diffusion-like interpolation, while the bottom row shows how the samples from qθ evolve along the flow as t is varied. Unfortunately, since all we are given is the ability to evaluate the target energy pointwise, there is no analytical formula for calculating the diffusion process of an arbitrary target density. Moreover, any numerical approach involves integration, that gets increasingly expensive as t gets smaller and the dimensionality of the problem gets larger. Nonetheless, this result demonstrates that minimizing the continuity error is a feasible approach, one just needs to be more careful about the interpolation between the latent and target energy functions. The authors could not find a predefined interpolation that is 1) local in the still not formal, but intuitive sense and 2) easy to compute for arbitrary target energies. Instead, we propose to learn the family of densities between the base and the target. 2The Gaussian kernel Gσ 1 t is a Gaussian density with mean 0 and covariance matrix diag(σ2(1 t)). Published in Transactions on Machine Learning Research (05/2023) Parametrizing the interpolation We propose to use a neural network to parametrize the interpolation ft as ft(x) = (1 t)f0(x) + tf1(x) + t(1 t)f θ t (x), 0 t 1, (20) where f θ t is parametrized by a neural network. This parametrization ensures that the boundary conditions at t {0, 1} are satisfied, and allows for flexibility on the open time-interval (0, 1). The parameters of the interpolation are trained together with the those of the flow (and those of Ct) with the objective of minimizing the continuity loss. Fig. 8 shows that this flexibility allows a flow trained with the continuity loss to capture both modes of the distribution (18). Figure 8: The same continuous normalizing flow as in Fig. 4 trained with the continuity loss using the trainable interpolation. The blue blobs denote the target density (18), a mixture of 2 Gaussians. The top row shows how the target density evolves under the learned interpolation, while the bottom row shows how the samples from qθ evolve along Vt as t is varied. 4 Experiments To compare the continuity loss to the reverse KL-divergence we train the same normalizing flow architecture by minimizing the 1) reverse KL objective and 2) the continuity loss. We run experiments on Gaussian mixtures and on the Boltzmann distribution of a quantum mechanical particle in a double-well potential. Performance metrics To quantify the results of the experiments, for each model we report a subset of the following metrics. For all runs we report the reverse KL-divergence (minus log Z), Ex qθ(log qθ(x) log p(x)) (21) and the effective sample size, i p(xi)/qθ(xi) 2 i(p(xi)/qθ(xi))2 , xi qθ, (22) where N is the number of samples. These metrics capture how good a fit qθ is for p, but only in those regions where samples are available. They are therefore insensitive to mode collapse. To compensate for this, we compute the Hausdorff distance between the means of the modes M = {m1, ..., mk} and a batch of samples X = {x1, ..., x N} from the model, H(M, X) = max m M min x X ||m x||2. (23) Published in Transactions on Machine Learning Research (05/2023) In the case of Gaussian mixtures, the means M are the means of mixture components whereas in the case of the quantum mechanical particle, the means are (ϕ1, ϕ2, ..., ϕN) = (a, a, ..., a) and (ϕ1, ϕ2, ..., ϕN) = (b, b, ..., b) where a and b are the two local minima of V (ϕ) (see 4.2 for details). The Hausdorff distance is a good metric for measuring mode coverage but is insensitive to the shape of the distributions. Finally, the forward KL-divergence (plus log Z), Ex p(log p(x) log qθ(x)) (24) and effective sample size, i qθ(xi)/p(xi) 2 i(qθ(xi)/pθ(xi))2 , xi p. (25) provide the most accurate (both mode coverage and shape matching) description of the goodness of the fit. As they require samples from p, we only report them for the experiments on Gaussian mixtures. 4.1 Gaussian mixtures In this section we consider two targets, those given by (2) and (18). The metrics are reported in Table 1 and correspond well to what we observe in Figures 2, 3, 4 and 8. Table 1: Results of training the same flow architecture with different objectives on the targets (2) and (18). Note that the target densities are normalized, i.e. log Z = 0. Mean and standard deviation values over 5 seeds are reported. Energy function given by log(Eq.2) H(M, X) Rev. KL Rev. ESS Fw.KL Fw. ESS KL(qθ, p) 19.26 .333 1.387 .001 0.999 .000 110.6 97.6 0.254 .006 Cont. Loss with Linear Int. 0.060 .023 0.003 .002 0.998 .001 0.001 .001 0.998 .002 Cont. Loss with Trainable Int. 0.053 .010 0.000 .001 1.000 .000 0.000 .000 0.999 .000 Energy function given by log(Eq.18) H(M, X) Rev. KL Rev. ESS Fw.KL Fw. ESS KL(qθ, p) 13.25 .363 1.099 .000 1.000 .000 74.38 30.3 0.332 .007 Cont. Loss with Linear Int. 2.329 .124 1.081 .004 0.928 .080 37.68 6.99 0.336 .008 Cont. Loss with Trainable Int. 0.045 .022 0.001 .001 0.999 .000 0.000 .000 1.000 .000 4.2 Quantum mechanical particle in a double-well potential In this section we consider the trajectory of a quantum mechanical particle moving in a one-dimensional double-well potential V between t = 0 and t = T. We closely follow the experimental setup of Vaitl et al. (2022). The action associated to a continous trajectory ϕ(t) Rn, 0 t T reads S[ϕ(t)] = Z T 2 ( tϕ)2 + V (ϕ)dt, V (ϕ) = m where λ are numerical parameters. After distretizing time, the discretised action of a trajectory (ϕ1, ..., ϕN) is S(ϕ1, ... , ϕN) = T, V (ϕ) = m where T = T/N, m and the subscript i + 1 is to be understood modulo N. We replicate the choices of Vaitl et al. (2022) for T = 4, λ = 1, N {4, 8, 16, 32, 64} and we use mass values m {1.50, 3.00, 4.50, 6.00}. The goal is then, as before, to sample trajectories (ϕ1, ... , ϕN) from the Boltzmann density p(ϕ1, ... , ϕN) = 1 Z e S(ϕ1, ... ,ϕN), Z = Z RN e S(ϕ1, ... ,ϕN)d Nx (28) Published in Transactions on Machine Learning Research (05/2023) m 2 φ2 + λ 4φ4 m = 1.50 m = 3.00 m = 4.50 m = 6.00 e m 2 φ2 λ 4φ4 Figure 9: Dependence of V (ϕ) and of the Boltzmann density e V (ϕ) on m. Heuristics. For larger values of m, the energy barrier between the wells of V gets greater, resulting in a bimodal, one-dimensional Boltzmann density e V (ϕ). (Fig. 9). Intuitively, the second summand in (27) encourages the particle to follow the unnormalized density e V (ϕ) at every time step, while the first one penalizes if the values of ϕ at consecutive time steps differ too much (i.e. belong to different modes of e V (ϕ)). With the above interpretation we can argue that ϕi and ϕi+1 are likely to be close, and every ϕi is likely to belong to one of the modes of e V (ϕi). Then the density e S(ϕ1,...,ϕN) has two modes, centered at (a, a, ..., a) and (b, b, ..., b), where a and b are the local minima of V . Sensitivity to the mass parameter. Now we fix N = 16 and vary the mass of particle m {1.50, 3.00, 4.50, 6.00}. We compare the continuity loss with the trainable interpolation to the reverse KL objective. The quantitative results are summarized in Table 2. In Figure 10 we compare e V (ϕ) to the histogram of flattened samples from the trained models. This makes sense since the action encourages the particle to follow the one-dimensional Boltzmann density of the potential V (ϕ) at every time step. Note that these two densities are not supposed to perfectly match, and Fig. 10 can only be used to detect mode-collapse of the flow. Figure 10: Mode collapse of the reverse KL-divergence for higher values of m. The unnormalized density p(ϕ) e V (ϕ) (blue curve), compared to flattened samples ϕi (orange histogram). Table 2: Results of training the same flow architecture with different objectives on the energy function (27). Note that a decrease of log 2 0.7 in reverse KL-divergence corresponds to covering twice as much probability mass. Mean and standard deviation values over 3 seeds are reported. m = 1.50 m = 3.00 H(M, X) Rev. KL Rev. ESS H(M, X) Rev. KL Rev. ESS KL(qθ, p) 0.717 .06 1.052 .00 0.951 .00 0.541 .00 1.832 .01 0.801 .11 Continuity Loss 0.678 .02 1.054 .00 0.991 .00 0.486 .04 1.835 .00 0.979 .00 m = 4.50 m = 6.00 H(M, X) Rev. KL Rev. ESS H(M, X) Rev. KL Rev. ESS KL(qθ, p) 13.61 .49 9.058 .00 0.883 .05 16.67 .00 22.49 .01 0.935 .01 Continuity Loss 0.335 .00 9.765 .01 0.966 .01 0.300 .00 23.19 .01 0.963 .01 Published in Transactions on Machine Learning Research (05/2023) Sensitivity to the dimensionality and computational speedup. In this section we fix a relatively small mass value, m = 1.50, resulting in a unimodal Boltzmann density. We then train models at different time resolutions, N {4, 8, 16, 32, 64}. For each N we trained two models, one with the reverse KL and one with the continuity loss. Figure 11 shows the evolution of the (reverse) effective sample size and the continuity loss during the training process. Since the continuity loss is a pointwise objective, contrary to the reverse KL, it does not require backpropagation along the trajectories of the flow. On the other hand, the baseline reverse KL objective only requires a parametrization of Vt and the computation of Vt. In addition to this, the continuity loss also requires to parametrizations of Ct and ft and the computation of tf, f and Ct. Overall, we observed a speedup when switching to the continuity loss both in terms of number of optimization steps per unit time and also in terms of convergence speed (Figure 11). 1m 10m 1h 10h 50h training time N=4 N=8 N=16 N=32 N=64 102 103 104 105 training steps Continuity Loss N=4 N=8 N=16 N=32 N=64 Figure 11: Evolution of the (reverse) effective sample size (left) and of the continuity loss (right) during training for different dimensionality of the problem. The solid lines correspond to the models trained by optimizing the continuity loss, while the dashed lines denote the models trained by optimizing the reverse KL divergence. Implementation and training details Architectures. The parametrization of Vt and ft are given by a weighted average of 4 MLPs. The weighting is done by evenly spaced RBF time-kernels, one for each model. In the case of the Gaussian targets the MLPs has two hidden layers with 64 neurons per layer, in the case of the quantum particle the MLPs has 3 hidden layers with 128 neurons per layer. Between the linear layers we use swish nonlinearities. The parametrization of Ct consists of a single MLP with the same hyperparameters as the mixture components of the parametrizations of Vt and ft. Importantly, our architecture is completely oblivious to the Z2 Cnsymmetry of the lattice of quantum mechanical particle and to the Z2-symmetry of the double-well potential. We rely on automatic differentiation to compute all derivatives. We leave the exploitation of symmetries and the use of architectures with analytic expressions for the divergence of Vt (Köhler et al., 2020; Gerdes et al., 2022), as well as for ft and tft, for future work. Optimization. We train with a batch size of 256 using the Adam optimizer (Kingma & Ba, 2017) and evaluate on batches of size 4096. The trajectories of the flow are computed by a 4th-order Runge-Kutta solver with 50 integration steps. The N = 64 and N = 32 runs in 4.2 are trained for 2.5 105 and 105 iterations, respectively. All other models are trained for 104 iterations. The initial learning rate of 3 10 3 is annealed to 0 following a cosine schedule. Base Density. We use a standard Gaussian base for the target (2), a centered Gaussian with standard deviation 2 for the target (18) and the generalized Gaussian proportional to e x4 in the experiments of 4.2. On the choice of the function L. The definition of the the continuity loss (16) involves a somewhat arbitrary choice of the function L. We compared the functions {E 7 E, E 7 E2, E 7 E +E2, E 7 log(1+E)} and ended up using E + E2 in all of our experiments, as it empirically outperformed the other choices. The intuition is that the L2 norm provides good gradients when the error is large, while the L1 norm provides Published in Transactions on Machine Learning Research (05/2023) good gradients when the error is small and therefore their sum is expected to be superior to both of them. Our experiments support this intuition. 5 Summary and Closing remarks We introduced an alternative training objective of continuous normalizing flows that uses an interpolation of energy functions. We ve demonstrated empirically that the proposed objective is less prone to mode collapse than the KL-divergence when the target density has multiple modes and is computationally more efficient. Two reasons for mode collapse. There can be, at least, two different reasons for mode collapse when training with the reverse KL-divergence. First, if one of the modes is so far away from the base distribution that it never gets visited by the flow. Second, if a mode is visited during training by trajectories of the flow, but the flow still ignores it and fits the remaining modes of the density. It is important to distinguish these two scenarios as our proposed technique can help with the second kind, but not the first one. This is also the reason why the choice of the base density is important. When the base has a higher variance, a larger fraction of the space gets visited. Reframing the method as a PINN. Our work naturally fits into the framework of Physics-informed neural networks (Raissi et al., 2019). What this work calls the pointwise continuity error, would be called the residual to the continuity equation in the PINN literature. The core idea is essentially the same: the optmization of a neural network to satisfy a PDE. 6 Acknowledgement The authors acknowledge support from the Swiss National Science Foundation under grant number CRSII5_193716 - Robust Deep Density Models for High-Energy Particle Physics and Solar Flare Analysis (RODEM) . We also thank Samuel Klein, Jonas Köhler and Eloi Alonso for discussions, Saviz Mowlavi for mentioning the connection to PINNs, and Ricky T. Q. Chen for pointing out that (12) is just the continuity equation. We would also like to express our appreciation of the Python ecosystem (Van Rossum & Drake Jr, 1995) and its libraries that made this work possible. In particular, the experimental part of the paper relies heavily on JAX (Bradbury et al., 2018), haiku (Hennigan et al., 2020), optax (Babuschkin et al., 2020), numpy (Harris et al., 2020), hydra (Yadan, 2019), matplotlib (Hunter, 2007), jupyter (Pérez & Granger, 2007) and weights&biases (Biewald, 2020). Ryan Abbott, Michael S. Albergo, Denis Boyda, Kyle Cranmer, Daniel C. Hackett, Gurtej Kanwar, Sébastien Racanière, Danilo J. Rezende, Fernando Romero-López, Phiala E. Shanahan, Betsy Tian, and Julian M. Urban. Gauge-equivariant flow models for sampling in lattice field theories with pseudofermions, 2022. URL https://arxiv.org/abs/2207.08945. Michael S Albergo and Eric Vanden-Eijnden. Building normalizing flows with stochastic interpolants. ar Xiv preprint ar Xiv:2209.15571, 2022. Michael S. Albergo, Denis Boyda, Daniel C. Hackett, Gurtej Kanwar, Kyle Cranmer, Sébastien Racanière, Danilo Jimenez Rezende, and Phiala E. Shanahan. Introduction to normalizing flows for lattice field theory, 2021a. URL https://arxiv.org/abs/2101.08176. Michael S. Albergo, Gurtej Kanwar, Sé bastien Racanière, Danilo J. Rezende, Julian M. Urban, Denis Boyda, Kyle Cranmer, Daniel C. Hackett, and Phiala E. Shanahan. Flow-based sampling for fermionic lattice field theories. Physical Review D, 104(11), dec 2021b. doi: 10.1103/physrevd.104.114507. URL https://doi.org/10.1103%2Fphysrevd.104.114507. Published in Transactions on Machine Learning Research (05/2023) Michael S. Albergo, Denis Boyda, Kyle Cranmer, Daniel C. Hackett, Gurtej Kanwar, Sébastien Racanière, Danilo J. Rezende, Fernando Romero-López, Phiala E. Shanahan, and Julian M. Urban. Flow-based sampling in the lattice schwinger model at criticality, 2022. URL https://arxiv.org/abs/2202.11712. Igor Babuschkin, Kate Baumli, Alison Bell, Surya Bhupatiraju, Jake Bruce, Peter Buchlovsky, David Budden, Trevor Cai, Aidan Clark, Ivo Danihelka, Antoine Dedieu, Claudio Fantacci, Jonathan Godwin, Chris Jones, Ross Hemsley, Tom Hennigan, Matteo Hessel, Shaobo Hou, Steven Kapturowski, Thomas Keck, Iurii Kemaev, Michael King, Markus Kunesch, Lena Martens, Hamza Merzic, Vladimir Mikulik, Tamara Norman, George Papamakarios, John Quan, Roman Ring, Francisco Ruiz, Alvaro Sanchez, Rosalia Schneider, Eren Sezener, Stephen Spencer, Srivatsan Srinivasan, Wojciech Stokowiec, Luyu Wang, Guangyao Zhou, and Fabio Viola. The Deep Mind JAX Ecosystem, 2020. URL http://github.com/deepmind. Lukas Biewald. Experiment tracking with weights and biases, 2020. URL https://www.wandb.com/. Software available from wandb.com. Denis Boyda, Gurtej Kanwar, Sébastien Racanière, Danilo Jimenez Rezende, Michael S. Albergo, Kyle Cranmer, Daniel C. Hackett, and Phiala E. Shanahan. Sampling using SU(n) gauge equivariant flows. Phys. Rev. D, 103:074504, Apr 2021. doi: 10.1103/Phys Rev D.103.074504. URL https://link.aps.org/ doi/10.1103/Phys Rev D.103.074504. James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary, Dougal Maclaurin, George Necula, Adam Paszke, Jake Vander Plas, Skye Wanderman-Milne, and Qiao Zhang. JAX: composable transformations of Python+Num Py programs, 2018. URL http://github.com/google/jax. Ricky T. Q. Chen, Yulia Rubanova, Jesse Bettencourt, and David Duvenaud. Neural ordinary differential equations, 2018. URL https://arxiv.org/abs/1806.07366. Pim de Haan, Corrado Rainone, Miranda C. N. Cheng, and Roberto Bondesan. Scaling up machine learning for quantum field theory with equivariant continuous flows, 2021. URL https://arxiv.org/abs/2110. 02673. Laurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio. Density estimation using real nvp. ar Xiv preprint ar Xiv:1605.08803, 2016. Mathis Gerdes, Pim de Haan, Corrado Rainone, Roberto Bondesan, and Miranda C. N. Cheng. Learning lattice quantum field theories with equivariant continuous flows, 2022. URL https://arxiv.org/abs/ 2207.00283. Charles R. Harris, K. Jarrod Millman, Stéfan J. van der Walt, Ralf Gommers, Pauli Virtanen, David Cournapeau, Eric Wieser, Julian Taylor, Sebastian Berg, Nathaniel J. Smith, Robert Kern, Matti Picus, Stephan Hoyer, Marten H. van Kerkwijk, Matthew Brett, Allan Haldane, Jaime Fernández del Río, Mark Wiebe, Pearu Peterson, Pierre Gérard-Marchant, Kevin Sheppard, Tyler Reddy, Warren Weckesser, Hameer Abbasi, Christoph Gohlke, and Travis E. Oliphant. Array programming with Num Py. Nature, 585(7825):357 362, September 2020. doi: 10.1038/s41586-020-2649-2. URL https: //doi.org/10.1038/s41586-020-2649-2. Tom Hennigan, Trevor Cai, Tamara Norman, and Igor Babuschkin. Haiku: Sonnet for JAX, 2020. URL http://github.com/deepmind/dm-haiku. Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33:6840 6851, 2020. Chin-Wei Huang, Laurent Dinh, and Aaron Courville. Augmented normalizing flows: Bridging the gap between generative flows and latent variable models. ar Xiv preprint ar Xiv:2002.07101, 2020. J. D. Hunter. Matplotlib: A 2d graphics environment. Computing in Science & Engineering, 9(3):90 95, 2007. doi: 10.1109/MCSE.2007.55. Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization, 2017. Published in Transactions on Machine Learning Research (05/2023) Jonas Köhler, Leon Klein, and Frank Noé. Equivariant flows: Exact likelihood generative learning for symmetric densities, 2020. URL https://arxiv.org/abs/2006.02425. Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling. ar Xiv preprint ar Xiv:2210.02747, 2022. Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow. ar Xiv preprint ar Xiv:2209.03003, 2022. Bálint Máté and François Fleuret. Deformations of boltzmann distributions. ar Xiv preprint ar Xiv:2210.13772, 2022. Bálint Máté, Samuel Klein, Tobias Golling, and François Fleuret. Flowification: Everything is a normalizing flow. ar Xiv preprint ar Xiv:2205.15209, 2022. Laurence Illing Midgley, Vincent Stimper, Gregor NC Simm, Bernhard Schölkopf, and José Miguel Hernández-Lobato. Flow annealed importance sampling bootstrap. ar Xiv preprint ar Xiv:2208.01893, 2022. Radford M. Neal. Annealed importance sampling, 1998. URL https://arxiv.org/abs/physics/9803008. Kirill Neklyudov, Daniel Severo, and Alireza Makhzani. Action matching: A variational method for learning stochastic dynamics from samples. ar Xiv preprint ar Xiv:2210.06662, 2022. Kim A. Nicoli, Shinichi Nakajima, Nils Strodthoff, Wojciech Samek, Klaus-Robert Müller, and Pan Kessel. Asymptotically unbiased estimation of physical observables with neural samplers. Physical Review E, 101(2), feb 2020. doi: 10.1103/physreve.101.023304. URL https://doi.org/10.1103%2Fphysreve.101. 023304. Kim A Nicoli, Christopher J Anders, Lena Funcke, Tobias Hartung, Karl Jansen, Pan Kessel, Shinichi Nakajima, and Paolo Stornati. Estimation of thermodynamic observables in lattice field theories with deep generative models. Physical review letters, 126(3):032001, 2021. Didrik Nielsen, Priyank Jaini, Emiel Hoogeboom, Ole Winther, and Max Welling. Survae flows: Surjections to bridge the gap between vaes and flows. Advances in Neural Information Processing Systems, 33:12685 12696, 2020. Frank Noé, Simon Olsson, Jonas Köhler, and Hao Wu. Boltzmann generators sampling equilibrium states of many-body systems with deep learning, 2018. URL https://arxiv.org/abs/1812.01729. Fernando Pérez and Brian E. Granger. IPython: a system for interactive scientific computing. Computing in Science and Engineering, 9(3):21 29, May 2007. ISSN 1521-9615. doi: 10.1109/MCSE.2007.53. URL https://ipython.org. David Pfau and Danilo Rezende. Integrable nonparametric flows. ar Xiv preprint ar Xiv:2012.02035, 2020. M. Raissi, P. Perdikaris, and G.E. Karniadakis. Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations. Journal of Computational Physics, 378:686 707, 2019. ISSN 0021-9991. doi: https://doi.org/10.1016/j.jcp.2018. 10.045. URL https://www.sciencedirect.com/science/article/pii/S0021999118307125. Danilo Rezende and Shakir Mohamed. Variational inference with normalizing flows. In International conference on machine learning, pp. 1530 1538. PMLR, 2015. Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning, pp. 2256 2265. PMLR, 2015. Yang Song and Stefano Ermon. Generative modeling by estimating gradients of the data distribution. Advances in neural information processing systems, 32, 2019. Published in Transactions on Machine Learning Research (05/2023) Esteban G Tabak and Cristina V Turner. A family of nonparametric density estimation algorithms. Communications on Pure and Applied Mathematics, 66(2):145 164, 2013. Lorenz Vaitl, Kim A Nicoli, Shinichi Nakajima, and Pan Kessel. Gradients should stay on path: better estimators of the reverseand forward kl divergence for normalizing flows. Machine Learning: Science and Technology, 3(4):045006, oct 2022. doi: 10.1088/2632-2153/ac9455. URL https://dx.doi.org/10. 1088/2632-2153/ac9455. Guido Van Rossum and Fred L Drake Jr. Python reference manual. Centrum voor Wiskunde en Informatica Amsterdam, 1995. Hao Wu, Jonas Köhler, and Frank Noé. Stochastic normalizing flows, 2020. Omry Yadan. Hydra - a framework for elegantly configuring complex applications. Github, 2019. URL https://github.com/facebookresearch/hydra.