# smooth_normalizing_flows__0ad04388.pdf

Smooth Normalizing Flows

Jonas Köhler Andreas Krämer Frank Noé

Department of Mathematics and Computer Science, Freie Universität Berlin Department of Physics, Freie Universität Berlin Department of Chemistry, Rice University, Houston, TX {jonas.koehler, andreas.kraemer, frank.noe}@fu-berlin.de

Normalizing ﬂows are a promising tool for modeling probability distributions in physical systems. While state-of-the-art ﬂows accurately approximate distributions and energies, applications in physics additionally require smooth energies to compute forces and higher-order derivatives. Furthermore, such densities are often deﬁned on non-trivial topologies. A recent example are Boltzmann Generators for generating 3D-structures of peptides and small proteins. These generative models leverage the space of internal coordinates (dihedrals, angles, and bonds), which is a product of hypertori and compact intervals. In this work, we introduce a class of smooth mixture transformations working on both compact intervals and hypertori. Mixture transformations employ root-ﬁnding methods to invert them in practice, which has so far prevented bi-directional ﬂow training. To this end, we show that parameter gradients and forces of such inverses can be computed from forward evaluations via the inverse function theorem. We demonstrate two advantages of such smooth ﬂows: they allow training by force matching to simulation data and can be used as potentials in molecular dynamics simulations.

1 Introduction

Generative learning using normalizing ﬂows (NF) [50, 42, 39] has become a widely applicable tool in the physical sciences which has e.g. been used for sampling lattice models [36, 37, 31, 4, 1], approximating the equilibrium density of molecular systems [38, 29, 56, 58] or estimating free energy differences [55, 10].

Such models approximate a target density µ via diffeomorphic maps f( ; θ): Ω Rd Ωby transforming samples z p0(z) of a base density into samples x = f(z; θ) such that they follow the push-forward density

x pf(x; θ) := p0 f 1(x; θ) det xf 1(x; θ) . (1)

Flows can be trained on data by maximizing the likelihood or via minimizing of the reverse KL divergence DKL [pf( ; θ) µ] if µ is known up to a normalizing constant.

While NFs are usually introduced as smooth diffeomorphisms, most applications like density estimation or sampling only require C1-smooth transformations. Higher-order smoothness of ﬂows has not been discussed so far and can become a challenge as soon as multi-modal transformations on other topologies than Rd are discussed.

J.K and A.K. contributed equally to this work.

35th Conference on Neural Information Processing Systems (Neur IPS 2021).

Ck-smoothness of NFs with k > 1 is especially important for applications in physics. Physical models usually come in the form of differential equations, where the derivatives bear a physical meaning that is often crucial to ﬁt, evaluate, or interpret the model. Thus, the construction of expressive, smooth ﬂow architectures will likely open up new avenues of research in this domain.

Boltzmann Generators A recent application of NFs to a physical problem are Boltzmann Generators (BG) [38], which we see as the main application area for the methods introduced in this paper. They are generative models trained to sample conformations of molecules in equilibrium, which follow a Boltzmann-type distribution µ(x) exp( u(x)). Here u is the dimensionless potential energy deﬁned by the molecular system and the thermodynamic state (e.g. Canonical ensemble at a certain temperature) it is simulated in. BGs can be trained by a combination of MLE on possibly biased trajectory data and simultaneous minimization of the reverse KL divergence to the target density µ(x). This bi-directional training scheme can achieve sampling of low-energy equilibrium structures. After training, BGs can be used e.g. for importance sampling or for providing efﬁcient proposals when being used in MCMC applications [45].

In the context of BGs, the negative gradient of the log push-forward density with respect to x corresponds to the atomistic forces. Access to well-behaved forces is pivotal in most classical molecular computations: they simplify optimization of molecular models, drive molecular dynamics (MD) simulations, enable computations of macroscopic observables, and facilitate multiscale modeling.

In this work, we will primarily focus on two important implications of smooth ﬂow forces. First, we show that they enable the training of NFs via force matching (FM), i.e. minimizing the force mean-squared error with respect to reference data. In combining FM with density estimation, NFs are pushed to match the distributions and their derivatives, which implicitly leads to favorable regularization. Second, we apply ﬂow forces to drive dynamics simulations. Such simulations are required in many applications to not only sample the target distribution but also compute dynamical observables such as transition rates or diffusivities.

Respecting Topological Constraints Many physical models operate on nontrivial topologies, i.e. the d-dimensional hyper-torus Td, which can become an obstacle in constructing smooth ﬂows.

An important example are BGs for peptides and small proteins which require an internal coordinate (IC) transformation to achieve low-energy sampling of structures [38]. This non-trainable layer transforms Euclidean coordinates x Rn 3 into a representation given by distances d [a1, b1] . . . [an 1, bn 1], angles α [0, π]n 2 and dihedral torsion angles τ Tn 3. As molecular energies are often invariant to translation and rotation, IC transformations are useful as they are bijective with the equivalence class of all-atom coordinates that are identical up to global rotations and translations. The learning problem can then be reduced to modeling the joint distribution µ(d, α, τ) which is supported on the nontrivial topological space XIC := I2n 3 Tn 3, where I = [0, 1] denotes the closed unit interval.

Recent work [38, 56] suggested to model the density within an open set Ω XIC, leverage C - smooth normalizing ﬂows deﬁned on Rd and then prevent signiﬁcant mass around singular points using regularizing losses. Such an approach however can lead to bias and ill-behaved training and requires a-priori knowledge of the support of the densities. Later work [9] approached the problem using C1-smooth splines ﬂows which leads to accurate samples, however results in broken forces.

To overcome these limitations while still beneﬁting from the merits of prior work, such as bidirectional training and fast forward/inverse transformations we formulate the following desiderata for ﬂow transformations on-top of IC layers: (A) They must have support on Id and Td. (B) They should be C -smooth. (C) They must allow bi-directional training. (D) Forward and inverse direction must be efﬁcient to evaluate.

Satisfying (D) can be achieved using coupling layers [11, 12]. This reduces the problem to ﬁnding element-wise conditional transformations satisfying (A-C).

Contributions In this work we propose the following novelties:

We present a new C -smooth transformation with compact support which can simultaneously be used for expressive transformations on Id as well as the hypertorus Td. This satisﬁes (A) and (B).

We present novel algorithm which allows optimizing non-analytic inverse directions of ﬂows that can only be evaluated via black-box root ﬁnding methods. This satisﬁes (C). We show that training of smooth ﬂows through combinations of force matching, density estimation, and energy-based training can achieve nearly perfect equilibrium densities. We show that forces of such smooth ﬂows can be used in dynamical simulations.

2 Related Work

While related work exist for each of the requirements (A-D) above, none of them provides a framework that addresses all of them. Speciﬁcally, multi-modality in element-wise ﬂow transformations has been approached using splines [13, 35], monotonoic polynomials [26, 41], monotonic neural networks which directly approximate an inverse CDF transformation [22, 8], mixtures of unimodal transformations [21, 43] and stochastic transition steps [56, 7].

Flows on non-trivial manifolds have been discussed for hypertori and -spheres [43], Lie groups [16, 4], hyperbolic spaces [3] as well as on general manifolds [15, 18, 33, 32, 5, 27]. Applications of ﬂows for molecular systems include approximation of the equilibrium density [38, 56, 29], generation of molecular conformations [58, 44] and free energy differences [55, 10]. Using forces to train neural network potentials of molecules was done e.g. in [53, 24].

Backpropagation through black-box functions was generally discussed in [19]. More related to our work is Bai et al. [2] who discuss training models whose outputs are obtained through ﬁx-point equations. Finally Shirobokov et al. [46] discuss backpropagation through black-box simulators using surrogate models.

Using force-matching for training generative models is rarely used in machine learning due to absence of an (unnormalized) target density. However, a common method to train unnormalized energy-based models purely on data is given by score-matching (SM) [25, 47]. SM assumes an unknown density for the data distribution and attempts to match forces implicitly, e.g. via sliced [49] or denoising [52] score-matching. A survey on score-matching and its relation to other approaches of training energy-based models on data is e.g. given in Song and Kingma [48].

We discuss related work on ﬂows for I and the unit circle in detail in Section 4.

3 Incorporating Forces into Flow Training

NFs are most commonly trained by minimizing the negative log likelihood (NLL), LNLL(θ) := Ex µ(x)[log pf(x; θ)] = DKL[µ||pf( ; θ)] + const, (2) or by minimizing the reverse KL divergence (KLD) LKLD(θ) := DKL[pf( ; θ)||µ] + const. (3) BGs as discussed in Noé et al. [38] use a convex combination of the two in order to avoid modecollapse while still being able to achieve low-energy samples. If reference forces f(x) = xu(x) corresponding to samples x are available and f( ; θ) is at least C2-smooth, the optimization can naturally be augmented by the force mean-squared error:

LFM(θ) := Ex µ(x) h f(x) x log pf(x; θ) 2 2 i . (4)

As was shown in Wang et al. [53] such force-matching can lead to unbiased potential surfaces even if samples are not sampled from equilibrium. Similarly to Noé et al. [38] we can thus deﬁne a loss function for smooth ﬂows as the convex combination L(θ) = ωn LNLL(θ) + ωk LKLD(θ) + ωf LFM(θ). (5)

4 Smooth ﬂows on the closed interval and the unit circle

We now discuss how to achieve smooth ﬂows on Id and Td. To unify the discussion we consider the unit circle S1 to be the quotient space I/ using the relation x x (x = x ) (x = 0 x = 1) (x = 0 x = 1). The d-dimensional hypertorus is given by the direct product Td = S1 . . . S1. Following the usual deﬁnition we say that f is Ck-smooth on a compact interval [a, b] iff there exists a Ck-smooth continuation of f on an open set Ω [a, b].

Smooth ﬂows on I Any smooth diffeomorphism on R or an open interval (a, b) I can be restricted to I and re-scaled to satisfy this requirement. Let f : Ω Ωbe Ck-smooth for some open set I Ω R. Then f(x) = (f(x) f(0))/(f(1) f(0)) deﬁnes a Ck-smooth diffeomorphism on I. Simple C transformations on R are given by afﬁne transformations [12]. After re-scaling any afﬁne map will result in the same identity mapping and thus will not be able to model complex densities. Powerful and computationally efﬁcient multi-modal transformations on I can be achieved using rational-quadratic splines [13]. Those are C1-smooth and thus will result in discontinuous forces which can be disadvantageous for physical applications. Possible other C -smooth candidate transformations with analytic forward evaluation are given by mixtures of logistic transformations [21] or deep sigmoid/deep dense sigmoid ﬂows [22]. Those are only continuous on (0, 1) and thus special care has to be taken to avoid problematic behavior on the tails. Finally, multi-modal C transformations on R can be achieved using non-analytic methods [6, 54]. Yet, computational costs (solving and backpropagating over ODEs, inverting quadrature integrations via bisections) and numerical accuracy (non-anlytic forward passes) limit their applicability e.g. when used in deep coupling layers which are required to capture multivariate correlations or when trying to match densities up to high accuracy e.g. as necessary for molecular modeling.

Smooth ﬂows on S1 Flows on the hypertorus were discussed in Rezende et al. [43] who introduced necessary conditions for the transformations to deﬁne C0-continuous densities, such that they can be used in density estimation tasks. In their work Rezende et al. [43] introduced three candidates satisfying this condition: mixtures of non-compact projections (NCP), rational-quadratic circular spline ﬂows (CSF) and mixtures of Moebius transformations (Mo MT). While NCPs and CSFs satisfy C1-continuity and thus render forces discontinuous, only Mo MTs deﬁne smooth densities. While Mo MTs trivially also deﬁne C -ﬂows on I their periodicity would be limiting when applied to non-periodic densities.

Smooth compact bump functions In addition to this previous work we leverage a third way to construct smooth transformations which works for both I and S1. Here we follow a general construction principle for smooth bump functions e.g. as explained in Tu [51]. All proofs can be found in the supplementary material.

First, deﬁne a Ck-smooth and strictly increasing ramp function ρ: I I with ρ(0) = 0 and ρ(1) = 1. Second, deﬁne the generalized sigmoid

σ[ρ](x) := ρ(x) ρ(x) + ρ(1 x). (6)

Then σ[ρ] will be a Ck-diffeomorphsim on I and furthermore all derivatives up to order k vanish at the boundary. The ﬁrst derivative deﬁnes a non-negative smooth bump function with support [0, 1] and maximum at 0.5. We can introduce a concentration parameter a R>0 and a location parameter b [0, 1], which makes g(x) := σ[ρ] a (x b) + 1

2 a smooth bijection from [ 1

2a + b] to [0, 1] with all values and higher-order derivatives vanishing outside the domain. Furthermore, by introducing c (0, 1] and setting

f(x) := (1 c) g(x) g(0)

we can deﬁne ﬂexible unimodal Ck-diffeomorphisms on [0, 1]. When used within coupling layers, we let a, b, c be the output of some not further constrained neural network.

Ck and C ramp functions There are many possible choices for ramp functions satisfying above mentioned properties. A simple Ck ramp function is given by the k-th order monomial ρ(x) = xk. More interestingly, C smoothness on [0, 1] can be achieved using the ramp ρ(x) = exp 1 α xβ

for x > 0 and ρ(x) = 0 for x 0. Here we set α > 0 and β > 1. We make α a trainable parameter. Optimizing β is possible in principle, yet ﬁxing β {1, 2} and treating it as a hyper-parameter stabilized training and led to better results.

Smooth and efﬁcient circular wrapping As discussed in Rezende et al. [43] and Falorsi et al. [16] we can turn any Ck-smooth density p(x) with support on R into a Ck-smooth density on S1 using the marginalization p(x) = P

k Z p(x + k). This construction is generally problematic due to two

reasons: (i) non-vanishing tails (e.g. if p(x) is Gaussian) require evaluating an inﬁnite series which in most cases has no analytic expression and can be numerically challenging (ii) smooth densities with compact support in some interval [a, b], for example sigmoidal transforms, can introduce non-smooth behavior at the interval boundaries.

Smooth and compactly supported transformations as introduced above do not suffer from this: (i) due to compact support the series will always have a ﬁnite number of non-vanishing contributions, (ii) as the transformations are Ck on all of R and all derivatives are compactly supported wrapping will always result in a Ck-smooth density on S1.

Multi-modality via mixtures Similarly, as discussed in prior work [21, 43] any convex sum of Ck-diffeomorphisms on I deﬁnes a Ck-diffeomorphism on I. Thus we can combine multiple of those unimodal transformations deﬁned in Eq. (7) to obtain arbitrarily complex multi-modal transformations on I and S1, see Fig. 1.

Figure 1: Construction of mixture transformations on compact intervals (left) and hypertori (right). The upper and lower row depict probability densities and cumulative distribution functions, respectively. Multiple unimodal bump functions [a) and d)] are added to a small but ﬁnite density [b) and e)] and combined to yield bijective multimodal transformations [c) and f)].

5 Optimizing non-analytic inverse ﬂows

A drawback of mixture-based ﬂows is their lack of an analytic inverse. Prior work such as [21, 43] suggested to rely on black-box inversion methods like a bisection search to sample from trained models. While this leads to accurate samples, the discrete nature of such black-box oracles does not allow to minimize L(θ) as deﬁned in Eq. (5).

A possible remedy to this can be derived using the inverse function theorem. Let f( ; θ): Ω R Ω be a scalar diffeomorphism and let furthermore x = f 1(y; θ) be obtained via a black-box inversion algorithm. To minimize losses depending on x(y; θ) or log | yx(y; θ)| we need to compute gradients with respect to y and θ. Here we derive the following relations for the derivatives (all details in supplementary material):

yx(y; θ) = ( xf(x; θ)) 1 (8)

θx(y; θ) = ( xf(x; θ)) 1 θf(x; θ) (9)

y log | yx(y; θ)| = ( xf(x; θ)) 1 log | xf(x; θ)| (10)

θ log | yx(y; θ)| = ( xf(x; θ)) 1 (log | xf(x; θ)| θf(x; θ) θ xf(x; θ)) (11)

We remark that (8) corresponds to the implicit reparameterization gradient as introduced in Figurnov et al. [17]. Using these rules, we can ﬁrst compute x via a black-box oracle for any input y and then compute all necessary gradients using forward evaluations and derivatives of f( ; θ). Derivatives of f( ; θ) are usually accessible for computing the log Jacobian of the transformation. We can obtain higher-order derivatives using automatic differentiation. It is easy to see that this construction extends to multivariate diffeomorphisms with diagonal Jacobian as used in coupling layers. It can be generalized to arbitrary higher order derivatives using the Faà di Bruno formula, e.g. when aiming

to do force matching through a black-box inversion algorithm. As our experiments only require losses involving second order derivatives for the inverse direction we leave this for future work. Note that Equations (8) - (11) can be applied to any other element-wise ﬂow transformation that lacks an analytic inverse and is not tied to mixture models.

Another problem of bisection is its sequential execution which can become prohibitively slow during training. E.g. achieving an error of 10 6 requires around 20 iterations. To obtain speedup on GPUs we suggest to generalize the classic binary search to searching in a grid of K bins simultaneoulsy (see details in Supp. Mat.). For K = 2 it results in the usual bisection method. However, by leveraging vectorized execution we can reduce the number of iterations by a factor O (1/ log K) at the expense of increasing memory by a factor O (K). Practical speedup depends on the actual number of parallel workers and thus depending on dimensionality, batch size and number of mixture components the optimal choice of K varies.

6 Experiments

The numerical experiments in this work are tailored to show the beneﬁts of smooth ﬂows over non-smooth ﬂows. Therefore, we mainly compare mixtures of bump functions to neural spline ﬂows [13], which exhibit state-of-the-art generative performance but whose densities are only ﬁrst-order continuously differentiable.

6.1 Illustrative Toy Example

Figure 2: a) Reference density and corresponding force ﬁeld as approximated by b) a C1-smooth NSF and c) a C -smooth ﬂow using the mixture of bump functions introduced in Sec. 4.

To highlight the difference in terms of smoothness, the two ﬂow architectures were applied to a two-dimensional toy example. Fig. 2 a) depicts the reference energy and forces on samples from the ground-truth distribution. Both ﬂows were trained through density estimations on those samples (see SI for details). The smooth ﬂows and spline ﬂows matched the density well, which demonstrates their expressivity. However, the force ﬁeld of the spline ﬂow contained dramatic outliers, while the forces of the smooth ﬂow matched the regular behavior of the reference forces.

6.2 Runtime Comparison

While the rational-quadratic splines are analytically invertible, mixture transformations obviously increase computational cost of the inversion due to the iterative root-ﬁnding procedure. To quantify this gap in performance, the inverse evaluation of a spline ﬂow was compared with a modiﬁed version, where the analytic inverse was replaced by the multi-bin bisection from Section 5. The performance of the bisection was evaluated for different numbers of bins K = 2m, m = 1, . . . , 8, and the most efﬁcient K was picked for each input size ( dim ). Both transformations were coupled to a conditioner with the same input size (dim) and 64 hidden neurons.

For small tensor dimensions (2-32), the optimal multi-bin bisection employed up to 256 bins on the GPU, which resulted in only a factor of 2-3 slowdown compared to analytic inversion. For larger dimensions (2048), the parallelization over multiple bins became less effective, leading to one order of magnitude difference in computational cost. The compute times and optimal bin sizes were comparable for the inverse network pass and its backward-pass. More details on these runtime comparisons can be found in the supplementary material.

6.3 Smooth Flows Enable Boltzmann Generator Training through Force Matching

To demonstrate the advantages of smooth ﬂows, we train a Boltzmann generator (BG) for a small molecule, alanine dipeptide, which is described in the supplementary material. This is a common test case for molecular computations and was previously considered for sampling with normalizing ﬂows in [56, 9, 30]. The molecular system has 60 degrees of freedom (global rotation and translation excluded). Its potential energy surface is highly sensitive to small displacements [30] and contains singularities.

Figure 3: a) Generated sample structures. b) Torsion distribution after training through root-ﬁnding. c) Smooth normalizing ﬂow trained on alanine dipeptide through a combination of force matching and density estimation. The top and bottom row show the performance on 10,000 samples each from the holdout test set and from the ﬂow, respectively. Left: joint distribution of backbone torsion angles. Center: scatter plot of ﬂow vs. target force components. Right: energy histograms for the ﬂow energy and the target energy. (Flow energies were shifted by a constant so that the minimum energy matched with the minimum energy from the molecular potential).

In previous work [56, 9, 30], BGs for alanine dipeptide used stochastic augmentation of the base space. The introduction of noise variables facilitates creating expressive normalizing ﬂows but prevents computation of deterministic forces. Some have also used spline ﬂows to sample molecular conﬁgurations [55, 10].

As a proof-of-concept for the force matching loss, we trained smooth ﬂows for alanine dipeptide using a 1000:1 weighting between the force matching residual and the negative log likelihood, see supplementary material for details on the ﬂow and training setup. No stochastic augmentation or energy-based training was used and no reweighting was conducted for postprocessing.

Figure 3 c) compares the ﬂow distribution with the target Boltzmann distribution on the test data from MD and on ﬂow samples. The left column shows that the ﬂow has almost perfectly learned the nontrivial joint distribution between the two backbone torsion angles φ and ψ, including the sparsely populated metastable region around φ 1. In contrast to previous work that used afﬁne coupling layers [56], the modes are cleanly separated, see also Fig. SI-3 for a direct comparison. The center column compares ﬂows forces with target forces. Those forces are highly sensitive to small

perturbations of the molecular conﬁgurations. Nevertheless, they matched up to a root-mean square deviation of 25 and 46 k BT/nm in the target ensemble and the generated ensemble, respectively. Note that the training was stopped after 10 epochs, where the validation force matching residual had not yet fully converged, so that even further improvements may be possible with longer training or extensive hyperparameter optimization. A more detailed analysis of the forces and sampling efﬁciency of the BGs is shown in supplementary material, Figure SI-4. Smooth ﬂows trained by a combination of density estimation and FM attain a favorable sampling efﬁciency of 42%, compared to 25% and < 1% for spline and Real NVP ﬂows, respectively.

Finally, the right column of Figure 3 c) depicts the distribution of ﬂow and target energies evaluated on the same samples. The ﬂow energies were shifted so that the minimum energy matched with the molecular mechanics energy. It has not escaped our notice that this constant offset (189 k BT on the test set and 190 k BT on the ﬂow samples) corresponds to the log partition function, a quantity whose intractability has complicated research in statistical mechanics for decades.

The energy distribution of the ﬂow tracked the ground truth exponential distribution almost perfectly even though no target energies or forces were evaluated on the ﬂow samples during training. This close-to-perfect match demonstrates that including force information into the training process presents an efﬁcient means for regularization. Figure 3 a) shows representative conformations generated by the BG, which conﬁrm the high quality of the molecular structures.

6.4 Boltzmann Generator Training through Black-Box Root-Finding

Table 1: Test set metrics of alanine dipeptide training with different losses: negative log likelihood (NLL), force matching error (FME), and reverse Kullback-Leibler divergence (KLD). Weighted loss functions were used as deﬁned in Eq. (5) with weight factors ωk, ωf and ωn = 1 ωk ωf. Metrics were recorded after 10 training epochs. Statistics are means and 2 standard errors over 10 replicas for each experiment. The lowest value with respect to each metric is highlighted in bold type.

ωk = 0 ωk = 0 ωk = 0.1 ωk = 0.1 Metric Method ωf = 0 ωf = 0.001 ωf = 0 ωf = 0.001

NLL spline -210.32 - -196.40 48.13 ( 0.16) ( 1.53) ( 95.37)

smooth -211.04 -211.40 -206.33 -208.70 ( 0.09) ( 0.05) ( 2.04) ( 1.93)

FME 104 spline 12.13 - 313.64 1498.98 ( 4.94) ( 107.92) ( 1935.72)

smooth 1.04 0.32 5.80 0.48 ( 0.08) ( 0.02) ( 2.30) ( 0.05)

KLD spline 263.77 - 230.08 1205.69 ( 3.21) ( 12.38) ( 36.25)

smooth 193.91 195.17 219.03 207.81 ( 0.39) ( 0.43) ( 22.16) ( 11.64)

Two experiments were conducted to demonstrate that ﬂows can be trained through a black-box root-ﬁnding algorithm.

First, the BG from section 6.3 was trained with all smooth transformations operating in the inverse direction. Consequently, inverse problems had to be solved when computing the negative log likelihood with respect to data, while the sampling occurred analytically. Figure 3 b) shows the joint marginal distribution of the backbone torsions on the BG samples. The distribution matches the data and the smooth ﬂows that were trained in forward mode (see Fig. 3). This result shows that training with gradients computed from the inverse function theorem is feasible.

Second, ﬂows were trained by different combinations of density estimation, force matching, and energy-based training, where computing θLKLD requires the gradients of the inverse ﬂow. The training of spline ﬂows with inclusion of the force matching error (ωf > 0) was notoriously unstable. This led to 10/10 and 7/10 failed runs for ωk = 0 and ωk = 0.1, respectively. This is understandable

from the discontinuity of the spline forces. Even the combination of NLL and KLD led to instabilities for 2/10 runs. (Gradients were not clipped during training to enable a mostly unbiased comparison). In contrast, all training runs with our smooth ﬂows concluded successfully.

Table 1 shows the metrics on the test set after 10 training epochs using spline ﬂows and smooth ﬂows (see SI for speciﬁcs about the ﬂow architecture and training). With pure density estimation (ωk = ωf = 0), both architectures achieved similar NLL. However, the smooth ﬂow achieved much lower FME and KLD. The NLL and FME were further improved when force data was presented to the ﬂow during training. Including reverse KLD in the loss yielded reasonable metrics for the smooth ﬂows, indicating that the backpropagation through root-ﬁnding was numerically stable. However, the metrics were consistently worse than with pure NLL training for both types of ﬂows (spline and smooth) and were subject to large ﬂuctuations. In contrast, including the FME proved to be a more stable approach to include information about the target energy into the training.

6.5 Smooth Flows as Potentials in Molecular Dynamics

Figure 4: Total energy ﬂuctuation in simulations with a classical force ﬁeld and with two ﬂows that employed smooth transforms and neural spline transforms, respectively. For each potential, 10 simulations were run starting from 10 different initial conﬁgurations (grey color). One run each is highlighted in black. Subﬁgures c) and d) show identical data on different scales.

A challenging test for the smoothness of a ﬂow is its use as a potential energy in a dynamics simulation. Especially simulations in the microcanonical ensemble are extremely sensitive to numerical errors as the symplectic integration does not impose any additional stabilizing mechanism on the total energy. This means that any inconsistencies between energies and forces on the scale of one integration step easily cause drifting, strongly ﬂuctuating, or exploding energies.

Therefore, we ran MD simulations using the ﬂow energy uθ = log pf( ; θ), from Section 6.3 as the potential. For comparison, simulations were also run with spline ﬂows that were trained purely by density estimation. Simulations started from stable structures of an MD simulation and were equilibrated in each potential for 1 ps using a Langevin thermostat with 10/ps friction coefﬁcient.

Figure 4 depicts the evolution of the total energy over 5 ps simulations in the microcanonical ensemble. As expected, the classical simulation kept the total energy roughly constant within a standard deviation of 6 10 4 k J/mol per degree of freedom. Astoundingly, the smooth ﬂow potentials also maintained the energies within 8 10 4 k J/mol per degree of freedom using the same 1 fs time step. In contrast, the simulations with spline ﬂow potentials quickly fell apart with potential and kinetic energies growing out of bounds. Those instabilities persisted even with an order-of-magnitude smaller time step (not shown).

While the infeasibility of spline ﬂows for this task was expected, the competitive behavior of smooth ﬂows with molecular mechanics force ﬁelds that were tailored for dynamics highlights their regularity and the consistency of their forces.

7 Discussion

Limitations and outlook Despite the promising results we mention the following possible limitations of the method in the present state and discuss possible improvements for future work:

First, smooth ﬂows as implemented in this work impose numerical overhead in comparison to non-smooth alternatives such as spline ﬂows. While this can be considered a question of concrete engineering it also opens search for more efﬁcient smooth alternatives.

In addition, the differentiation rules for black-box inverses provide correct algebraic gradients. Yet for considerably multi-modal distributions maintaining numerical stability can become a challenge. Future work should thus focus on studying and improving numerical stability of the bidirectional training.

While potentials can be trained up to an accuracy where they allow for energy-conserving simulations, it is yet to show that they also provide accurate equilibrium distribution for long-run simulations. At this point a major bottleneck is the inferior evaluation performance per step of a ﬂow-based force-ﬁeld compared to common force-ﬁeld implementations.

Furthermore, internal coordinates are a very efﬁcient representation space for smaller poly-peptides. However, they are difﬁcult to scale to large proteins, protein complexes or systems with non-trivial topologies, e.g. proteins with disulﬁde bonds. Integrating recent advances in graph-based molecular representations with force-matched ﬂows will likely be required for a succesful modeling of such systems.

Finally, training with forces improves the resulting potential compared to just training with samples. However, for many MD data forces have not been stored and recomputing them requires a signiﬁcant computational overhead.

Conclusion We introduced a range of contributions which can improve upcoming work on applying ﬂows to physical problems.

The proposed approach for backpropagation through black-box root-ﬁnders can help to obviate the search for analytically invertible transformations if bi-directional training is necessary. As we show the method works especially well for relatively low-dimensional problems. Scaling the approach, generalizing it to non-diagonal Jacobians and improving its numerics are interesting questions for future work.

The MD simulation example shows that C -smooth ﬂows on non-trivial topologies can open new avenues of research as well as new applications for ﬂows. By carefully respecting the topological domain as well as the smoothness of the target potential we can approximate the equilibrium density of a small peptide nearly perfectly. Finally, we showed that incorporating force information together with MLE outperforms the state-of-the-art approach of combining MLE with minimizing of the reverse KL divergence. It does not suffer from mode-collapse, yet includes information of the target potential which is required to achieve low-energy samples.

Combined, these results could improve methods used for learning the mean potential surface of coarse-grained molecules, pave the way to multi-scale ﬂows for modeling large protein systems and eventually help to accelerate simulations and sampling.

Acknowledgements

We thank the anonymous reviewers at Neur IPS 2021 for their insightful comments. Special thanks to Yaoyi Chen for setting up the alanine dipeptide system and Manuel Dibak, Leon Klein, Oana-Iuliana Popescu, Leon Sixt, and Michael Figurnov for helpful discussions and feedback on the manuscript and pointers to related work. We acknowledge funding from the European Commission (ERC Co G 772230 Scale Cell), Deutsche Forschungsgemeinschaft (GRK DAEDALUS, SFB1114/A04), the German Ministry for Education and Research (Berlin Institute for the Foundations of Learning and Data BIFOLD), and the Berlin Mathematics center MATH+ (Project AA1-6 and EF1-2).

[1] M. Albergo, G. Kanwar, and P. Shanahan. Flow-based generative models for markov chain monte carlo in lattice ﬁeld theory. Physical Review D, 100(3):034515, 2019.

[2] S. Bai, J. Z. Kolter, and V. Koltun. Deep equilibrium models. Advances in Neural Information Processing Systems, 32:690 701, 2019.

[3] J. Bose, A. Smofsky, R. Liao, P. Panangaden, and W. Hamilton. Latent variable modelling with hyperbolic normalizing ﬂows. In International Conference on Machine Learning, pages 1045 1055. PMLR, 2020.

[4] D. Boyda, G. Kanwar, S. Racanière, D. J. Rezende, M. S. Albergo, K. Cranmer, D. C. Hackett, and P. E. Shanahan. Sampling using su (n) gauge equivariant ﬂows. Physical Review D, 103(7): 074504, 2021.

[5] J. Brehmer and K. Cranmer. Flows for simultaneous manifold learning and density estimation. Advances in Neural Information Processing Systems, 33, 2020.

[6] T. Q. Chen, Y. Rubanova, J. Bettencourt, and D. K. Duvenaud. Neural ordinary differential equations. In Advances in neural information processing systems, pages 6571 6583, 2018.

[7] R. Cornish, A. Caterini, G. Deligiannidis, and A. Doucet. Relaxing bijectivity constraints with continuously indexed normalising ﬂows. In International Conference on Machine Learning, pages 2133 2143. PMLR, 2020.

[8] N. De Cao, W. Aziz, and I. Titov. Block neural autoregressive ﬂow. In Uncertainty in Artiﬁcial Intelligence, pages 1263 1273. PMLR, 2020.

[9] M. Dibak, L. Klein, and F. Noé. Temperature-steerable ﬂows. ar Xiv preprint ar Xiv:2012.00429, 2020.

[10] X. Ding and B. Zhang. Deepbar: A fast and exact method for binding free energy computation. The Journal of Physical Chemistry Letters, 12(10):2509 2515, 2021.

[11] L. Dinh, D. Krueger, and Y. Bengio. Nice: Non-linear independent components estimation. ar Xiv preprint ar Xiv:1410.8516, 2014.

[12] L. Dinh, J. Sohl-Dickstein, and S. Bengio. Density estimation using real NVP. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. Open Review.net, 2017. URL https://openreview. net/forum?id=Hkpbn H9lx.

[13] C. Durkan, A. Bekasov, I. Murray, and G. Papamakarios. Neural spline ﬂows. In Advances in Neural Information Processing Systems, pages 7509 7520, 2019.

[14] P. Eastman, J. Swails, J. D. Chodera, R. T. Mc Gibbon, Y. Zhao, K. A. Beauchamp, L.-P. Wang, A. C. Simmonett, M. P. Harrigan, C. D. Stern, et al. Openmm 7: Rapid development of high performance algorithms for molecular dynamics. PLo S computational biology, 13(7):e1005659, 2017.

[15] L. Falorsi. Continuous normalizing ﬂows on manifolds. ar Xiv preprint ar Xiv:2104.14959, 2021.

[16] L. Falorsi, P. de Haan, T. R. Davidson, and P. Forré. Reparameterizing distributions on lie groups. In The 22nd International Conference on Artiﬁcial Intelligence and Statistics, pages 3244 3253. PMLR, 2019.

[17] M. Figurnov, S. Mohamed, and A. Mnih. Implicit reparameterization gradients. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, pages 439 450, 2018.

[18] M. C. Gemici, D. Rezende, and S. Mohamed. Normalizing ﬂows on riemannian manifolds. ar Xiv preprint ar Xiv:1611.02304, 2016.

[19] W. Grathwohl, D. Choi, Y. Wu, G. Roeder, and D. Duvenaud. Backpropagation through the void: Optimizing control variates for black-box gradient estimation. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings. Open Review.net, 2018. URL https://openreview.net/ group?id=ICLR.cc/2018/Conference.

[20] C. R. Harris, K. J. Millman, S. J. van der Walt, R. Gommers, P. Virtanen, D. Cournapeau, E. Wieser, J. Taylor, S. Berg, N. J. Smith, R. Kern, M. Picus, S. Hoyer, M. H. van Kerkwijk, M. Brett, A. Haldane, J. F. del Río, M. Wiebe, P. Peterson, P. Gérard-Marchant, K. Sheppard, T. Reddy, W. Weckesser, H. Abbasi, C. Gohlke, and T. E. Oliphant. Array programming with Num Py. Nature, 585(7825):357 362, Sept. 2020. doi: 10.1038/s41586-020-2649-2. URL https://doi.org/10.1038/s41586-020-2649-2.

[21] J. Ho, X. Chen, A. Srinivas, Y. Duan, and P. Abbeel. Flow++: Improving ﬂow-based generative models with variational dequantization and architecture design. In International Conference on Machine Learning, pages 2722 2730. PMLR, 2019.

[22] C.-W. Huang, D. Krueger, A. Lacoste, and A. Courville. Neural autoregressive ﬂows. In International Conference on Machine Learning, pages 2078 2087. PMLR, 2018.

[23] J. D. Hunter. Matplotlib: A 2d graphics environment. Computing in Science & Engineering, 9 (3):90 95, 2007. doi: 10.1109/MCSE.2007.55.

[24] B. E. Husic, N. E. Charron, D. Lemm, J. Wang, A. Pérez, M. Majewski, A. Krämer, Y. Chen, S. Olsson, G. de Fabritiis, et al. Coarse graining molecular dynamics with graph neural networks. The Journal of Chemical Physics, 153(19):194101, 2020.

[25] A. Hyvärinen and P. Dayan. Estimation of non-normalized statistical models by score matching. Journal of Machine Learning Research, 6(4), 2005.

[26] P. Jaini, K. A. Selby, and Y. Yu. Sum-of-squares polynomial ﬂow. In International Conference on Machine Learning, pages 3009 3018. PMLR, 2019.

[27] D. Kalatzis, J. Z. Ye, J. Wohlert, and S. Hauberg. Multi-chart ﬂows. ar Xiv preprint ar Xiv:2106.03500, 2021.

[28] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980, 2014.

[29] J. Köhler, L. Klein, and F. Noé. Equivariant ﬂows: exact likelihood generative learning for symmetric densities. In International Conference on Machine Learning, pages 5361 5370. PMLR, 2020.

[30] A. Krämer, J. Köhler, and F. Noé. Training invertible linear layers through rank-one perturbations. ar Xiv preprint ar Xiv:2010.07033, 2020.

[31] S.-H. Li and L. Wang. Neural network renormalization group. Physical review letters, 121(26): 260601, 2018.

[32] A. Lou, D. Lim, I. Katsman, L. Huang, Q. Jiang, S. N. Lim, and C. M. De Sa. Neural manifold ordinary differential equations. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 17548 17558. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper/ 2020/file/cbf8710b43df3f2c1553e649403426df-Paper.pdf.

[33] E. Mathieu and M. Nickel. Riemannian continuous normalizing ﬂows. Advances in Neural Information Processing Systems, 33, 2020.

[34] R. T. Mc Gibbon, K. A. Beauchamp, M. P. Harrigan, C. Klein, J. M. Swails, C. X. Hernández, C. R. Schwantes, L.-P. Wang, T. J. Lane, and V. S. Pande. Mdtraj: a modern open library for the analysis of molecular dynamics trajectories. Biophysical journal, 109(8):1528 1532, 2015.

[35] T. Müller, B. Mc Williams, F. Rousselle, M. Gross, and J. Novák. Neural importance sampling. ACM Transactions on Graphics (TOG), 38(5):1 19, 2019.

[36] K. A. Nicoli, S. Nakajima, N. Strodthoff, W. Samek, K.-R. Müller, and P. Kessel. Asymptotically unbiased estimation of physical observables with neural samplers. Physical Review E, 101(2): 023304, 2020.

[37] K. A. Nicoli, C. J. Anders, L. Funcke, T. Hartung, K. Jansen, P. Kessel, S. Nakajima, and P. Stornati. Estimation of thermodynamic observables in lattice ﬁeld theories with deep generative models. Physical review letters, 126(3):032001, 2021.

[38] F. Noé, S. Olsson, J. Köhler, and H. Wu. Boltzmann generators: Sampling equilibrium states of many-body systems with deep learning. Science, 365(6457):eaaw1147, 2019.

[39] G. Papamakarios, E. Nalisnick, D. J. Rezende, S. Mohamed, and B. Lakshminarayanan. Normalizing ﬂows for probabilistic modeling and inference. Journal of Machine Learning Research, 22(57):1 64, 2021. URL http://jmlr.org/papers/v22/19-1028.html.

[40] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. De Vito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala. Pytorch: An imperative style, high-performance deep learning library. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL https://proceedings.neurips. cc/paper/2019/file/bdbca288fee7f92f2bfa9f7012727740-Paper.pdf.

[41] S. Ramasinghe, K. Fernando, S. Khan, and N. Barnes. Robust normalizing ﬂows using bernsteintype polynomials. ar Xiv preprint ar Xiv:2102.03509, 2021.

[42] D. Rezende and S. Mohamed. Variational inference with normalizing ﬂows. In International Conference on Machine Learning, pages 1530 1538. PMLR, 2015.

[43] D. J. Rezende, G. Papamakarios, S. Racaniere, M. Albergo, G. Kanwar, P. Shanahan, and K. Cranmer. Normalizing ﬂows on tori and spheres. In International Conference on Machine Learning, pages 8083 8092. PMLR, 2020.

[44] V. G. Satorras, E. Hoogeboom, F. B. Fuchs, I. Posner, and M. Welling. E (n) equivariant normalizing ﬂows for molecule generation in 3d. ar Xiv preprint ar Xiv:2105.09016, 2021.

[45] L. Sbailò, M. Dibak, and F. Noé. Neural mode jump monte carlo. The Journal of Chemical Physics, 154(7):074101, 2021.

[46] S. Shirobokov, V. Belavin, M. Kagan, A. Ustyuzhanin, and A. G. Baydin. Black-box optimization with local generative surrogates. Advances in Neural Information Processing Systems, 33, 2020.

[47] Y. Song and S. Ermon. Generative modeling by estimating gradients of the data distribution. In Proceedings of the 33rd Annual Conference on Neural Information Processing Systems, 2019.

[48] Y. Song and D. P. Kingma. How to train your energy-based models. Co RR, abs/2101.03288, 2021. URL https://arxiv.org/abs/2101.03288.

[49] Y. Song, S. Garg, J. Shi, and S. Ermon. Sliced score matching: A scalable approach to density and score estimation. In Uncertainty in Artiﬁcial Intelligence, pages 574 584. PMLR, 2020.

[50] E. G. Tabak, E. Vanden-Eijnden, et al. Density estimation by dual ascent of the log-likelihood. Communications in Mathematical Sciences, 8(1):217 233, 2010.

[51] L. W. Tu. Bump Functions and Partitions of Unity, pages 127 134. Springer New York, New York, NY, 2008. ISBN 978-0-387-48101-2. doi: 10.1007/978-0-387-48101-2_13. URL https://doi.org/10.1007/978-0-387-48101-2_13.

[52] P. Vincent. A connection between score matching and denoising autoencoders. Neural Computation, 23(7):1661 1674, 2011. doi: 10.1162/NECO_a_00142.

[53] J. Wang, S. Olsson, C. Wehmeyer, A. Pérez, N. E. Charron, G. De Fabritiis, F. Noé, and C. Clementi. Machine learning of coarse-grained molecular dynamics force ﬁelds. ACS central science, 5(5):755 767, 2019.

[54] A. Wehenkel and G. Louppe. Unconstrained monotonic neural networks. Advances in Neural Information Processing Systems 32 (2019), 2019.

[55] P. Wirnsberger, A. Ballard, G. Papamakarios, S. Abercrombie, S. Racanière, A. Pritzel, C. Blundell, et al. Targeted free energy estimation via learned mappings. The Journal of Chemical Physics, 153(14):144112 144112, 2020.

[56] H. Wu, J. Köhler, and F. Noe. Stochastic normalizing ﬂows. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 5933 5944. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper/2020/file/ 41d80bfc327ef980528426fc810a6d7a-Paper.pdf.

[57] C. Xenophontos. A formula for the nth derivative of the quotient of two functions. http: //www.mas.ucy.ac.cy/~xenophon/pubs/capsule.pdf, 2007. Accessed: 2021-05-28.

[58] M. Xu, S. Luo, Y. Bengio, J. Peng, and J. Tang. Learning neural generative dynamics for molecular conformation generation. ar Xiv preprint ar Xiv:2102.10240, 2021.