# flow_matching_for_scalable_simulationbased_inference__a589f3ce.pdf Flow Matching for Scalable Simulation-Based Inference Jonas Wildberger Max Planck Institute for Intelligent Systems Tübingen, Germany wildberger.jonas@tuebingen.mpg.de Maximilian Dax Max Planck Institute for Intelligent Systems Tübingen, Germany maximilian.dax@tuebingen.mpg.de Simon Buchholz Max Planck Institute for Intelligent Systems Tübingen, Germany sbuchholz@tue.mpg.de Stephen R. Green University of Nottingham Nottingham, United Kingdom Jakob H. Macke Max Planck Institute for Intelligent Systems & Machine Learning in Science, University of Tübingen Tübingen, Germany Bernhard Schölkopf Max Planck Institute for Intelligent Systems Tübingen, Germany Neural posterior estimation methods based on discrete normalizing flows have become established tools for simulation-based inference (SBI), but scaling them to high-dimensional problems can be challenging. Building on recent advances in generative modeling, we here present flow matching posterior estimation (FMPE), a technique for SBI using continuous normalizing flows. Like diffusion models, and in contrast to discrete flows, flow matching allows for unconstrained architectures, providing enhanced flexibility for complex data modalities. Flow matching, therefore, enables exact density evaluation, fast training, and seamless scalability to large architectures making it ideal for SBI. We show that FMPE achieves competitive performance on an established SBI benchmark, and then demonstrate its improved scalability on a challenging scientific problem: for gravitational-wave inference, FMPE outperforms methods based on comparable discrete flows, reducing training time by 30% with substantially improved accuracy. Our work underscores the potential of FMPE to enhance performance in challenging inference scenarios, thereby paving the way for more advanced applications to scientific problems. 1 Introduction The ability to readily represent Bayesian posteriors of arbitrary complexity using neural networks would herald a revolution in scientific data analysis. Such networks could be trained using simulated data and used for amortized inference across observations bringing tractable inference and speed to a myriad of scientific models. Thanks to innovative architectures such as normalizing flows [1, 2], approaches to neural simulation-based inference (SBI) [3] have seen remarkable progress in recent years. Here, we show that modern approaches to deep generative modeling (particularly flow matching) deliver substantial improvements in simplicity, flexibility and scaling when adapted to SBI. Equal contribution 37th Conference on Neural Information Processing Systems (Neur IPS 2023). sa E4My+PE8qxw Xnr HB6c5Ir Xk7j SJN9ck Dyx CHnp Eiu SYm UCSe P5Jm8kjfry Xqx3q2PSWv Kms7skj+w Pn8Ad Ui W6A=q( |xo) ez2t FS9yu LIkw Ny SMr EIRek Sm5Ijd QJz F5Jq/kz Uqs F+vd+pi15qxs Zp/8gf X5A15Mk UA=q( |x) 4fw B87n D+h Fj QI=x vt,x( t) FMPE A+fw DT/I19NPE Figure 1: Comparison of network architectures (left) and flow trajectories (right). Discrete flows (NPE, top) require a specialized architecture for the density estimator. Continuous flows (FMPE, bottom) are based on a vector field parametrized with an unconstrained architecture. FMPE uses this additional flexibility to put an enhanced emphasis on the conditioning data x, which in the SBI context is typically high dimensional and in a complex domain. Further, the optimal transport path produces simple flow trajectories from the base distribution (inset) to the target. The Bayesian approach to data analysis is to compare observations to models via the posterior distribution p(θ|x). This gives our degree of belief that model parameters θ gave rise to an observation x, and is proportional to the model likelihood p(x|θ) times the prior p(θ). One is typically interested in representing the posterior in terms of a collection of samples, however obtaining these through standard likelihood-based algorithms can be challenging for intractable or expensive likelihoods. In such cases, SBI offers an alternative based instead on data simulations x p(x|θ). Combined with deep generative modeling, SBI becomes a powerful paradigm for scientific inference [3]. Neural posterior estimation (NPE) [4 6], for instance, trains a conditional density estimator q(θ|x) to approximate the posterior, allowing for rapid sampling and density estimation for any x consistent with the training distribution. The NPE density estimator q(θ|x) is commonly taken to be a (discrete) normalizing flow [1, 2], an approach that has brought state-of-the-art performance in challenging problems such as gravitationalwave inference [7]. Naturally, performance hinges on the expressiveness of q(θ|x). Normalizing flows transform noise to samples through a discrete sequence of basic transforms. These have been carefully engineered to be invertible with simple Jacobian determinant, enabling efficient maximum likelihood training, while at the same time producing expressive q(θ|x). Although many such discrete flows are universal density approximators [2], in practice, they can be challenging to scale to very large networks, which are needed for big-data experiments. Recent studies [8, 9] propose neural posterior score estimation (NPSE), a rather different approach that models the posterior distribution with score-matching (or diffusion) networks. These techniques were originally developed for generative modeling [10 12], achieving state-of-the-art results in many domains, including image generation [13, 14]. Like discrete normalizing flows, diffusion models transform noise into samples, but with trajectories parametrized by a continuous time parameter t. The trajectories solve a stochastic differential equation [15] (SDE) defined in terms of a vector field vt, which is trained to match the score of the intermediate distributions pt. NPSE has several advantages compared to NPE, including the ability to combine multiple observations at inference time [9] and, importantly, the freedom to use unconstrained network architectures. We here propose to use flow matching, another recent technique for generative modeling, for Bayesian inference, an approach we refer to as flow-matching posterior estimation (FMPE). Flow matching is also based on a vector field vt and thereby also admits flexible network architectures (Fig. 1). For flow matching, however, vt directly defines the velocity field of sample trajectories, which solve ordinary differential equations (ODEs) and are deterministic. As a consequence, flow matching allows for additional freedom in designing non-diffusion paths such as optimal transport, and provides direct access to the density [16]. These differences are summarized in Tab. 1. NPE NPSE FMPE (Ours) Tractable posterior density Yes No Yes Unconstrained network architecture No Yes Yes Network passes for sampling Single Many Many Table 1: Comparison of posterior-estimation methods. Our contributions are as follows: We adapt flow-matching to Bayesian inference, proposing FMPE. In general, the modeling requirements of SBI are different from generative modeling. For the latter, sample quality is critical, i.e., that samples lie in the support of a complex distribution (e.g., images). In contrast, for SBI, p(θ|x) is typically less complex for fixed x, but x itself can be complex and high-dimensional. We therefore consider pyramid-like architectures from x to vt, with gated linear units to incorporate (θ, t) dependence, rather than the typical U-Net used for images (Fig. 1). We also propose an alternative t-weighting in the loss, which improves performance in many tasks. Under certain regularity assumptions, we prove an upper bound on the KL divergence between the model and target posterior. This implies that estimated posteriors are masscovering, i.e., that their support includes all θ consistent with observed x, which is highly desirable for scientific applications [17]. We perform a number of experiments to investigate the performance of FMPE.2 Our twopronged approach, which involves a set of benchmark tests and a real-world problem, is designed to probe complementary aspects of the method, covering breadth and depth of applications. First, on an established suite of SBI benchmarks, we show that FMPE performs comparably or better than NPE across most tasks, and in particular exhibits mass-covering posteriors in all cases (Sec. 4). We then push the performance limits of FMPE on a challenging real-world problem by turning to gravitational-wave inference (Sec. 5). We show that FMPE substantially outperforms an NPE baseline in terms of training time, posterior accuracy, and scaling to larger networks. 2 Preliminaries Normalizing flows. A normalizing flow [1, 2] defines a probability distribution q(θ|x) over parameters θ Rn in terms of an invertible mapping ψx : Rn Rn from a simple base distribution q0(θ), q(θ|x) = (ψx) q0(θ) = q0(ψ 1 x (θ)) det ψ 1 x (θ) θ where ( ) denotes the pushforward operator, and for generality we have conditioned on additional context x Rm. Unless otherwise specified, a normalizing flow refers to a discrete flow, where ψx is given by a composition of simpler mappings with triangular Jacobians, interspersed with shuffling of the θ. This construction results in expressive q(θ|x) and also efficient density evaluation [2]. Continuous normalizing flows. A continuous flow [18] also maps from base to target distribution, but is parametrized by a continuous time t [0, 1], where q0(θ|x) = q0(θ) and q1(θ|x) = q(θ|x). For each t, the flow is defined by a vector field vt,x on the sample space.3 This corresponds to the velocity of the sample trajectories, d dtψt,x(θ) = vt,x(ψt,x(θ)), ψ0,x(θ) = θ. (2) 2Code available here. 3In the SBI literature, this is also commonly referred to as parameter space . We obtain the trajectories θt ψt,x(θ) by integrating this ODE. The final density is given by q(θ|x) = (ψ1,x) q0(θ) = q0(θ) exp Z 1 0 div vt,x(θt) dt , (3) which is obtained by solving the transport equation tqt + div(qtvt,x) = 0. The advantage of the continuous flow is that vt,x(θ) can be simply specified by a neural network taking Rn+m+1 Rn, in which case (2) is referred to as a neural ODE [18]. Since the density is tractable via (3), it is in principle possible to train the flow by maximizing the (log-)likelihood. However, this is often not feasible in practice, since both sampling and density estimation require many network passes to numerically solve the ODE (2). Flow matching. An alternative training objective for continuous normalizing flows is provided by flow matching [16]. This directly regresses vt,x on a vector field ut,x that generates a target probability path pt,x. It has the advantage that training does not require integration of ODEs, however it is not immediately clear how to choose (ut,x, pt,x). The key insight of [16] is that, if the path is chosen on a sample-conditional basis,4 then the training objective becomes extremely simple. Indeed, given a sample-conditional probability path pt(θ|θ1) and a corresponding vector field ut(θ|θ1), we specify the sample-conditional flow matching loss as LSCFM = Et U[0,1], x p(x), θ1 p(θ|x), θt pt(θt|θ1) vt,x(θt) ut(θt|θ1) 2 . (4) Remarkably, minimization of this loss is equivalent to regressing vt,x(θ) on the marginal vector field ut,x(θ) that generates pt(θ|x) [16]. Note that in this expression, the x-dependence of vt,x(θ) is picked up via the expectation value, with the sample-conditional vector field independent of x. There exists considerable freedom in choosing a sample-conditional path. Ref. [16] introduces the family of Gaussian paths pt(θ|θ1) = N(θ|µt(θ1), σt(θ1)2In), (5) where the time-dependent means µt(θ1) and standard deviations σt(θ1) can be freely specified (subject to boundary conditions5). For our experiments, we focus on the optimal transport paths defined by µt(θ1) = tθ1 and σt(θ1) = 1 (1 σmin)t (also introduced in [16]). The sampleconditional vector field then has the simple form ut(θ|θ1) = θ1 (1 σmin)θ 1 (1 σmin)t . (6) Neural posterior estimation (NPE). NPE is an SBI method that directly fits a density estimator q(θ|x) (usually a normalizing flow) to the posterior p(θ|x) [4 6]. NPE trains with the maximum likelihood objective LNPE = Ep(θ)p(x|θ) log q(θ|x), using Bayes theorem to simplify the expectation value with Ep(x)p(θ|x) Ep(θ)p(x|θ). During training, LNPE is estimated based on an empirical distribution consisting of samples (θ, x) p(θ)p(x|θ). Once trained, NPE can perform inference for every new observation using q(θ|x), thereby amortizing the computational cost of simulation and training across all observations. NPE further provides exact density evaluations of q(θ|x). Both of these properties are crucial for the physics application in section 5, so we aim to retain these properties with FMPE. Related work Flow matching [16] has been developed as a technique for generative modeling, and similar techniques are discussed in [19 21] and extended in [22, 23]. Flow matching encompasses the deterministic ODE version of diffusion models [10 12] as a special instance. Although to our knowledge flow matching has not previously been applied to Bayesian inference, score-matching diffusion models have been proposed for SBI in [8, 9] with impressive results. These studies, however, use stochastic formulations via SDEs [15] or Langevin steps and are therefore not directly applicable when evaluations of the posterior density are desired (see Tab. 1). It should be noted that score modeling can also be used 4We refer to conditioning on θ1 as sample-conditioning to distinguish from conditioning on x. 5The sample-conditional probability path should be chosen to be concentrated around θ1 at t = 1 (within a small region of size σmin) and to be the base distribution at t = 0. to parameterize continuous normalizing flows via an ODE. Extension of [8, 9] to the deterministic formulation could thereby be seen as a special case of flow matching. Many of our analyses and the practical guidance provided in Section 3 therefore also apply to score matching. We here focus on comparisons of FMPE against NPE [4 6], as it best matches the requirements of the application in section 5. Other SBI methods include approximate Bayesian computation [24 28], neural likelihood estimation [29 32] and neural ratio estimation [33 39]. Many of these approaches have sequential versions, where the estimator networks are specifically tuned to a specific observation xo. FMPE has a tractable density, so it is straightforward to apply the sequential NPE [4 6] approaches to FMPE. In this case, inference is no longer amortized, so we leave this extension to future work. 3 Flow matching posterior estimation To apply flow matching to SBI we use Bayes theorem to make the usual replacement Ep(x)p(θ|x) Ep(θ)p(x|θ) in the loss function (4), eliminating the intractable expectation values. This gives the FMPE loss LFMPE = Et p(t), θ1 p(θ),x p(x|θ1), θt pt(θt|θ1) vt,x(θt) ut(θt|θ1) 2 , (7) which we minimize using empirical risk minimization over samples (θ, x) p(θ)p(x|θ). In other words, training data is generated by sampling θ from the prior, and then simulating data x corresponding to θ. This is in close analogy to NPE training, but replaces the log likelihood maximization with the sample-conditional flow matching objective. Note that in this expression we also sample t p(t), t [0, 1] (see Sec. 3.3), which generalizes the uniform distribution in (4). This provides additional freedom to improve learning in our experiments. 3.1 Probability mass coverage As we show in our examples, trained FMPE models q(θ|x) can achieve excellent results in approximating the true posterior p(θ|x). However, it is not generally possible to achieve exact agreement due to limitations in training budget and network capacity. It is therefore important to understand how inaccuracies manifest. Whereas sample quality is the main criterion for generative modeling, for scientific applications one is often interested in the overall shape of the distribution. In particular, an important question is whether q(θ|x) is mass-covering, i.e., whether it contains the full support of p(θ|x). This minimizes the risk to falsely rule out possible explanations of the data. It also allows us to use importance sampling if the likelihood p(x|θ) of the forward model can be evaluated, which can be used for precise estimation of the posterior [40, 41]. argminq DKL(q||p) argminq DKL(p||q) argminq LFM Figure 2: A Gaussian (blue) fitted to a bimodal distribution (gray) with various objectives. Consider first the mass-covering property for NPE. NPE directly minimizes the forward KL divergence DKL(p(θ|x)||q(θ|x)), and thereby provides probabilitymass covering results. Therefore, even if NPE is not accurately trained, the estimate q(θ|x) should cover the entire support of the posterior p(θ|x) and the failure to do so can be observed in the validation loss. As an illustration in an unconditional setting, we observe that a unimodal Gaussian q fitted to a bimodal target distribution p captures both modes when using the forward KL divergence DKL(p||q), but only a single mode when using the backwards direction DKL(q||p) (Fig. 2). For FMPE, we can fit a Gaussian flow-matching model q(θ) = N(ˆµ, ˆσ2) to the same bimodal target, in this case, parametrizing the vector field as vt(θ) = (σ2 t + (tˆσ)2 σt)θt + tˆµ σt t (σ2 t + (tˆσ)2) (8) (see Appendix A), we also obtain a mass-covering distribution when fitting the learnable parameters (ˆµ, ˆσ) via (4). This provides some indication that the flow matching objective induces mass-covering behavior, and leads us to investigate the more general question of whether the mean squared error between vector fields ut and vt bounds the forward KL divergence. Indeed, the former agrees up to constant with the sample-conditional loss (4) (see Sec. 2). We denote the flows of ut, vt, by ϕt, ψt, respectively, and we set qt = (ψt) q0, pt = (ϕt) q0. The precise question then is whether we can bound DKL(p1||q1) by MSEp(u, v)α for some positive power α. It was already observed in [42] that this is not true in general, and we provide a simple example to that effect in Lemma 1 in Appendix B. Indeed, it was found in [42] that to bound the forward KL divergence we also need to control the Fisher divergence, R pt(dθ)( ln pt(θ) qt(θ))2. Here we show instead that a bound can be obtained under sufficiently strong regularity assumptions on p0, ut, and vt. The following statement is slightly informal, and we refer to the supplement for the complete version. Theorem 1. Let p0 = q0 and assume ut and vt are two vector fields whose flows satisfy p1 = (ϕ1) p0 and q1 = (ψ1) q0. Assume that p0 is square integrable and satisfies | ln p0(θ)| c(1 + |θ|) and ut and vt have bounded second derivatives. Then there is a constant C > 0 such that (for MSEp(u, v) < 1)) DKL(p1||q1) C MSEp(u, v) 1 2 . (9) The proof of this result can be found in appendix B. While the regularity assumptions are not guaranteed to hold in practice when vt is parametrized by a neural net, the theorem nevertheless gives some indication that the flow-matching objective encourages mass coverage. In Section 4 and 5, this is complemented with extensive empirical evidence that flow matching indeed provides mass-covering estimates. We remark that it was shown in [43] that the KL divergence of SDE solutions can be bounded by the MSE of the estimated score function. Thus, the smoothing effect of the noise ensures mass coverage, an aspect that was further studied using the Fokker-Planck equation in [42]. For flow matching, imposing the regularity assumption plays a similar role. 3.2 Network architecture Generative diffusion or flow matching models typically operate on complicated and high dimensional data in the θ space (e.g., images with millions of pixels). One typically uses U-Net [44] like architectures, as they provide a natural mapping from θ to a vector field v(θ) of the same dimension. The dependence on t and an (optional) conditioning vector x is then added on top of this architecture. For SBI, the data x is often associated with a complicated domain, such as image or time series data, whereas parameters θ are typically low dimensional. In this context, it is therefore useful to build the architecture starting as a mapping from x to v(x) and then add conditioning on θ and t. In practice, one can therefore use any established feature extraction architecture for data in the domain of x, and adjust the dimension of the feature vector to n = dim(θ). In our experiments, we found that the (t, θ)-conditioning is best achieved using gated linear units [45] to the hidden layers of the network (see also Fig. 1); these are also commonly used for conditioning discrete flows on x. 3.3 Re-scaling the time prior The time prior U[0, 1] in (4) distributes the training capacity uniformly across t. We observed that this is not always optimal in practice, as the complexity of the vector field may depend on t. For FMPE we therefore sample t in (7) from a power-law distribution pα(t) t1/(1+α), t [0, 1], introducing an additional hyperparameter α. This includes the uniform distribution for α = 0, but for α > 0, assigns greater importance to the vector field for larger values of t. We empirically found this to improve learning for distributions with sharp bounds (e.g., Two Moons in Section 4). 4 SBI benchmark We now evaluate FMPE on ten tasks included in the benchmark presented in [46], ranging from simple Gaussian toy models to more challenging SBI problems from epidemiology and ecology, with varying dimensions for parameters (dim(θ) [2, 10]) and observations (dim(x) [2, 100]). For each task, we train three separate FMPE models with simulation budgets N {103, 104, 105}. We use a simple network architecture consisting of fully connected residual blocks [47] to parameterize the conditional vector field. For the two tasks with dim(x) = 100 (B-GLM-Raw, SLCP-D), we condition on (t, θ) via gated linear units as described in Section 3.2 (Fig. 8 in Appendix C shows the corresponding performance gain). For the remaining tasks with dim(x) 10 we concatenate (t, θ, x) instead. We reserve 5% of the simulations for validation. See Appendix C for details. For each task and simulation budget, we evaluate the model with the lowest validation loss by comparing q(θ|x) to the reference posteriors p(θ|x) provided in [46] for ten different observations x in terms of the C2ST score [48, 49]. This performance metric is computed by training a classifier to discriminate inferred samples θ q(θ|x) from reference samples θ p(θ|x). The C2ST score is then the test accuracy of this classifier, ranging from 0.5 (best) to 1.0. We observe that FMPE exhibits comparable performance to an NPE baseline model for most tasks and outperforms on several (Fig. 4). In terms of the MMD metric (Fig. 6 in the Appendix), FMPE clearly outperforms NPE (but MMD can be sensitive to its hyperparameters [46]). As NPE is one of the highest ranking methods for many tasks in the benchmark, these results show that FMPE indeed performs competitively with other existing SBI methods. We report an additional baseline for score matching in Fig. 7 in the Appendix. As NPE and FMPE both directly target the posterior with a density estimator (in contrast to most other SBI methods), observed differences can be primarily attributed to their different approaches for density estimation. Interestingly, a great performance improvement of FMPE over NPE is observed for SLCP with a large simulation budget (N = 105). The SLCP task is specifically designed to have a simple likelihood but a complex posterior, and the FMPE performance underscores the enhanced flexibility of the FMPE density estimator. 4 2 0 2 4 log q(θ|x) Figure 3: Histogram of FMPE densities log q(θ|x) for reference samples θ p(θ|x) (Two Moons task, N = 103). The estimate q(θ|x) clearly covers p(θ|x) entirely. Finally, we empirically investigate the mass coverage suggested by our theoretical analysis in Section 3.1. We display the density log q(θ|x) of the reference samples θ p(θ|x) under our FMPE model q as a histogram (Fig. 3). All samples θ p(θ|x) fall into the support from q(θ|x). This becomes apparent when comparing to the density log q(θ|x) for samples θ q(θ|x) from q itself. This FMPE result is therefore mass covering. Note that this does not necessarily imply conservative posteriors (which is also not generally true for the forward KL divergence [17, 50, 51]), and some parts of p(θ|x) may still be undersampled. Probability mass coverage, however, implies that no part is entirely missed (compare Fig. 2), even for multimodal distributions such as Two Moons. Fig. 9 in the Appendix confirms the mass coverage for the other benchmark tasks. 5 Gravitational-wave inference 5.1 Background Gravitational waves (GWs) are ripples of spacetime predicted by Einstein and produced by cataclysmic cosmic events such as the mergers of binary black holes (BBHs). GWs propagate across the universe to Earth, where the LIGO-Virgo-KAGRA observatories measure faint time-series signals embedded in noise. To-date, roughly 90 detections of merging black holes and neutron stars have been made [52], all of which have been characterized using Bayesian inference to compare against theoretical models.6 These have yielded insights into the origin and evolution of black holes [53], fundamental properties of matter and gravity [54, 55], and even the expansion rate of the universe [56]. Under reasonable assumptions on detector noise, the GW likelihood is tractable,7 and inference is 6BBH parameters θ R15 include black-hole masses, spins, and the spacetime location and orientation of the system (see Tab. 4 in the Appendix). We represent x in frequency domain; for two LIGO detectors and complex f [20, 512] Hz, f = 0.125 Hz, we have x R15744. 7Noise is assumed to be stationary and Gaussian, so for frequency-domain data, the GW likelihood p(x|θ) = N(h(θ)|Sn)(x). Here h(θ) is a theoretical signal model based on Einstein s theory of general relativity, and Sn is the power spectral density of the detector noise. 1.0 GL GL-U GM Two Moons 103 104 105 103 104 105 103 104 105 103 104 105 103 104 105 Number of Simulations Figure 4: Comparison of FMPE with NPE, a standard SBI method, across 10 benchmark tasks [46]. typically performed using tools [57 60] based on Markov chain Monte Carlo [61, 62] or nested sampling [63] algorithms. This can take from hours to months, depending on the nature of the event and the complexity of the signal model, with a typical analysis requiring up to 108 likelihood evaluations. The ever-increasing rate of detections means that these analysis times risk becoming a bottleneck. SBI offers a promising solution for this challenge that has thus been actively studied in the literature [64 68, 7, 69, 70, 41]. A fully amortized NPE-based method called DINGO recently achieved accuracies comparable to stochastic samplers with inference times of less than a minute per event [7]. To achieve accurate results, however, DINGO uses group-equivariant NPE [7, 69] (GNPE), an NPE extension that integrates known conditional symmetries. GNPE, therefore, does not provide a tractable density, which is problematic when verifying and correcting inference results using importance sampling [41]. 5.2 Experiments We here apply FMPE to GW inference. As a baseline, we train an NPE network with the settings described in [7] with a few minor changes (see Appendix D).8 This uses an embedding network [71] to compress x to a 128-dimensional feature vector, which is then used to condition a neural spline flow [72]. The embedding network consists of a learnable linear layer initialized with principal components of GW simulations followed by a series of dense residual blocks [47]. This architecture is a powerful feature extractor for GW measurements [7]. As pointed out in Section 3.2, it is straightforward to reuse such architectures for FMPE, with the following three modifications: (1) we provide the conditioning on (t, θ) to the network via gated linear units in each hidden layer; (2) we change the dimension of the final feature vector to the dimension of θ so that the network parameterizes the conditional vector field (t, x, θ) vt,x(θ); (3) we increase the number and width of the hidden layers to use the capacity freed up by removing the discrete normalizing flow. We train the NPE and FMPE networks with 5 106 simulations for 400 epochs using a batch size of 4096 on an A100 GPU. The FMPE network (1.9 108 learnable parameters, training takes 2 days) is larger than the NPE network (1.3 108 learnable parameters, training takes 3 days), but trains substantially faster. We evaluate both networks on GW150914 [73], the first detected GW. We generate a reference posterior using the method described in [41]. Fig. 5 compares the inferred posterior distributions qualitatively and quantitatively in terms of the Jensen-Shannon divergence (JSD) to the reference.9 FMPE substantially outperforms NPE in terms of accuracy, with a mean JSD of 0.5 mnat (NPE: 3.6 mnat), and max JSD < 2.0 mnat, an indistinguishability criterion for GW posteriors [59]. Remarkably, FMPE accuracy is even comparable to GNPE, which leverages physical symmetries 8Our implementation builds on the public DINGO code from https://github.com/dingo-gw/dingo. 9We omit the three parameters ϕc, ϕJL, θJN in the evaluation as we use phase marginalization in importance sampling and the reference therefore uses a different basis for these parameters [41]. For GNPE we report the results from [7], which are generated with slightly different data conditioning. Therefore, we do not display the GNPE results in the corner plot, and the JSDs serve only as a rough comparison. The JSD for the tc parameter is not reported in [7] due to a tc marginalized reference. Reference NPE FMPE 1.2 1.3 0.8 2.5 0.6 1.1 3.2 0.8 0.2 1.6 1.0 0.3 0.8 0.1 0.5 0.4 0.3 0.5 0.3 0.2 0.1 4.4 0.1 0.8 10.1 0.6 0.7 8.6 0.5 1.4 0.6 0.1 0.2 Figure 5: Results for GW150914 [73]. Left: Corner plot showing 1D marginals on the diagonal and 2D 50% credible regions. We display four GW parameters (distance d L, time of arrival tc, and sky coordinates α, δ); these represent the least accurate NPE parameters. Right: Deviation between inferred posteriors and the reference, quantified by the Jensen-Shannon divergence (JSD). The FMPE posterior matches the reference more accurately than NPE, and performs similarly to symmetryenhanced GNPE. (We do not display GNPE results on the left due to different data conditioning settings in available networks.) to simplify data. Finally, we find that the Bayesian evidences inferred with NPE (log p(x) = 7667.958 0.006) and FMPE (log p(x) = 7667.969 0.005) are consistent within their statistical uncertainties. A correct evidence is only obtained in importance sampling when the inferred posterior q(θ|x) covers the entire posterior p(θ|x) [41], so this is another indication that FMPE indeed induces mass-covering posteriors. 5.3 Discussion Our results for GW150914 show that FMPE substantially outperforms NPE on this challenging problem. We believe that this is related to the network structure as follows. The NPE network allocates roughly two thirds of its parameters to the discrete normalizing flow and only one third to the embedding network (i.e., the feature extractor for x). Since FMPE parameterizes just a vector field (rather than a collection of splines in the normalizing flow) it can devote its network capacity to the interpretation of the high-dimensional x R15744. Hence, it scales better to larger networks and achieves higher accuracy. Remarkably, the performance iscomparable to GNPE, which involves a much simpler learning task with likelihood symmetries integrated by construction. This enhanced performance, comes in part at the cost of increased inference times, typically requiring hundreds of network forward passes. See Appendix D for further details. In future work we plan to carry out a more complete analysis of GW inference using FMPE. Indeed, GW150914 is a loud event with good data quality, where NPE already performs quite well. DINGO with GNPE has been validated in a variety of settings [7, 69, 41, 74] including events with a larger performance gap between NPE and GNPE [69]. Since FMPE (like NPE) does not integrate physical symmetries, it would likely need further enhancements to fully compete with GNPE. This may require a symmetry-aware architecture [75], or simply further scaling to larger networks. A straightforward application of the GNPE mechanism to FMPE GFMPE is also possible, but less practical due to the higher inference costs of FMPE. Nevertheless, our results demonstrate that FMPE is a promising direction for future research in this field. 6 Conclusions We introduced flow matching posterior estimation, a new simulation-based inference technique based on continuous normalizing flows. In contrast to existing neural posterior estimation methods, it does not rely on restricted density estimation architectures such as discrete normalizing flows, and instead parametrizes a distribution in terms of a conditional vector field. Besides enabling flexible path specifications, while maintaining direct access to the posterior density, we empirically found that regressing on a vector field rather than an entire distribution improves the scalability of FMPE compared to existing approaches. Indeed, fewer parameters are needed to learn this vector field, allowing for larger networks, ultimately enabling to solve more complex problems. Furthermore, our architecture for FMPE (a straightforward Res Net with GLU conditioning) facilitates parallelization and allows for cheap forward/backward passes. We evaluated FMPE on a set of 10 benchmark tasks and found competitive or better performance compared to other simulation-based inference methods. On the challenging task of gravitational-wave inference, FMPE substantially outperformed comparable discrete flows, producing samples on par with a method that explicitly leverages symmetries to simplify training. Additionally, flow matching latent spaces are more naturally structured than those of discrete flows, particularly when using paths such as optimal transport. Looking forward, it would be interesting to exploit such structure in designing learning algorithms. This performance and flexibilty underscores the capability of continuous normalizing flows to efficiently solve inverse problems. Acknowledgements We thank the DINGO team for helpful discussions and comments. We would like to particularly acknowledge the contributions of Alessandra Buonanno, Jonathan Gair, Nihar Gupte and Michael Pürrer. This material is based upon work supported by NSF s LIGO Laboratory which is a major facility fully funded by the National Science Foundation. This research has made use of data or software obtained from the Gravitational Wave Open Science Center (gw-openscience.org), a service of LIGO Laboratory, the LIGO Scientific Collaboration, the Virgo Collaboration, and KAGRA. LIGO Laboratory and Advanced LIGO are funded by the United States National Science Foundation (NSF) as well as the Science and Technology Facilities Council (STFC) of the United Kingdom, the Max-Planck-Society (MPS), and the State of Niedersachsen/Germany for support of the construction of Advanced LIGO and construction and operation of the GEO600 detector. Additional support for Advanced LIGO was provided by the Australian Research Council. Virgo is funded, through the European Gravitational Observatory (EGO), by the French Centre National de Recherche Scientifique (CNRS), the Italian Istituto Nazionale di Fisica Nucleare (INFN) and the Dutch Nikhef, with contributions by institutions from Belgium, Germany, Greece, Hungary, Ireland, Japan, Monaco, Poland, Portugal, Spain. The construction and operation of KAGRA are funded by Ministry of Education, Culture, Sports, Science and Technology (MEXT), and Japan Society for the Promotion of Science (JSPS), National Research Foundation (NRF) and Ministry of Science and ICT (MSIT) in Korea, Academia Sinica (AS) and the Ministry of Science and Technology (Mo ST) in Taiwan. M.D. thanks the Hector Fellow Academy for support. J.H.M. and B.S. are members of the MLCo E, EXC number 2064/1 Project number 390727645 and the Tübingen AI Center funded by the German Ministry for Science and Education (FKZ 01IS18039A). [1] Danilo Rezende and Shakir Mohamed. Variational inference with normalizing flows. In International Conference on Machine Learning, pages 1530 1538, 2015. ar Xiv:1505.05770. [2] George Papamakarios, Eric Nalisnick, Danilo Jimenez Rezende, Shakir Mohamed, and Balaji Lakshminarayanan. Normalizing flows for probabilistic modeling and inference. Journal of Machine Learning Research, 22(57):1 64, 2021. URL: http://jmlr.org/papers/v22/ 19-1028.html. [3] Kyle Cranmer, Johann Brehmer, and Gilles Louppe. The frontier of simulation-based inference. Proc. Nat. Acad. Sci., 117(48):30055 30062, 2020. ar Xiv:1911.01429, doi:10.1073/ pnas.1912789117. [4] George Papamakarios and Iain Murray. Fast ε-free inference of simulation models with Bayesian conditional density estimation. In Advances in neural information processing systems, 2016. ar Xiv:1605.06376. [5] Jan-Matthis Lueckmann, Pedro J Gonçalves, Giacomo Bassetto, Kaan Öcal, Marcel Nonnenmacher, and Jakob H Macke. Flexible statistical inference for mechanistic models of neural dynamics. In Proceedings of the 31st International Conference on Neural Information Processing Systems, pages 1289 1299, 2017. [6] David Greenberg, Marcel Nonnenmacher, and Jakob Macke. Automatic posterior transformation for likelihood-free inference. In International Conference on Machine Learning, pages 2404 2414. PMLR, 2019. [7] Maximilian Dax, Stephen R. Green, Jonathan Gair, Jakob H. Macke, Alessandra Buonanno, and Bernhard Schölkopf. Real-Time Gravitational Wave Science with Neural Posterior Estimation. Phys. Rev. Lett., 127(24):241103, 2021. ar Xiv:2106.12594, doi:10.1103/Phys Rev Lett. 127.241103. [8] Louis Sharrock, Jack Simons, Song Liu, and Mark Beaumont. Sequential neural score estimation: Likelihood-free inference with conditional score based diffusion models. ar Xiv preprint ar Xiv:2210.04872, 2022. [9] Tomas Geffner, George Papamakarios, and Andriy Mnih. Score modeling for simulation-based inference. ar Xiv preprint ar Xiv:2209.14249, 2022. [10] Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning, pages 2256 2265. PMLR, 2015. [11] Yang Song and Stefano Ermon. Generative modeling by estimating gradients of the data distribution. Advances in neural information processing systems, 32, 2019. [12] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33:6840 6851, 2020. [13] Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, volume 34, pages 8780 8794. Curran Associates, Inc., 2021. URL: https://proceedings.neurips.cc/paper_files/paper/ 2021/file/49ad23d1ec9fa4bd8d77d02681df5cfa-Paper.pdf. [14] Jonathan Ho, Chitwan Saharia, William Chan, David J. Fleet, Mohammad Norouzi, and Tim Salimans. Cascaded diffusion models for high fidelity image generation. J. Mach. Learn. Res., 23:47:1 47:33, 2022. URL: http://jmlr.org/papers/v23/21-0635.html. [15] Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. ar Xiv preprint ar Xiv:2011.13456, 2020. [16] Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling. Co RR, abs/2210.02747, 2022. ar Xiv:2210.02747, doi: 10.48550/ar Xiv.2210.02747. [17] Joeri Hermans, Arnaud Delaunoy, François Rozet, Antoine Wehenkel, and Gilles Louppe. Averting a crisis in simulation-based inference. ar Xiv preprint ar Xiv:2110.06581, 2021. [18] Tian Qi Chen, Yulia Rubanova, Jesse Bettencourt, and David Duvenaud. Neural ordinary differential equations. In Samy Bengio, Hanna M. Wallach, Hugo Larochelle, Kristen Grauman, Nicolò Cesa-Bianchi, and Roman Garnett, editors, Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, Neur IPS 2018, December 3-8, 2018, Montréal, Canada, pages 6572 6583, 2018. URL: https://proceedings.neurips.cc/paper/2018/hash/ 69386f6bb1dfed68692a24c8686939b9-Abstract.html. [19] Michael S Albergo and Eric Vanden-Eijnden. Building normalizing flows with stochastic interpolants. ar Xiv preprint ar Xiv:2209.15571, 2022. [20] Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow. ar Xiv preprint ar Xiv:2209.03003, 2022. [21] Kirill Neklyudov, Daniel Severo, and Alireza Makhzani. Action matching: A variational method for learning stochastic dynamics from samples. ar Xiv preprint ar Xiv:2210.06662, 2022. [22] Aram Davtyan, Sepehr Sameni, and Paolo Favaro. Randomized conditional flow matching for video prediction. ar Xiv preprint ar Xiv:2211.14575, 2022. [23] Alexander Tong, Nikolay Malkin, Guillaume Huguet, Yanlei Zhang, Jarrid Rector-Brooks, Kilian Fatras, Guy Wolf, and Yoshua Bengio. Conditional flow matching: Simulation-free dynamic optimal transport. ar Xiv preprint ar Xiv:2302.00482, 2023. [24] Scott A Sisson, Yanan Fan, and Mark A Beaumont. Overview of abc. In Handbook of approximate Bayesian computation, pages 3 54. Chapman and Hall/CRC, 2018. [25] Mark A Beaumont, Wenyang Zhang, and David J Balding. Approximate bayesian computation in population genetics. Genetics, 162(4):2025 2035, 2002. [26] Mark A Beaumont, Jean-Marie Cornuet, Jean-Michel Marin, and Christian P Robert. Adaptive approximate bayesian computation. Biometrika, 96(4):983 990, 2009. [27] Michael G. B. Blum and Olivier François. Non-linear regression models for approximate bayesian computation. Stat. Comput., 20(1):63 73, 2010. doi:10.1007/ s11222-009-9116-0. [28] Dennis Prangle, Paul Fearnhead, Murray P. Cox, Patrick J. Biggs, and Nigel P. French. Semiautomatic selection of summary statistics for abc model choice, 2013. ar Xiv:1302.5624. [29] Simon Wood. Statistical inference for noisy nonlinear ecological dynamic systems. Nature, 466:1102 4, 08 2010. doi:10.1038/nature09319. [30] Christopher C Drovandi, Clara Grazian, Kerrie Mengersen, and Christian Robert. Approximating the likelihood in approximate bayesian computation, 2018. ar Xiv:1803.06645. [31] George Papamakarios, David Sterratt, and Iain Murray. Sequential neural likelihood: Fast likelihood-free inference with autoregressive flows. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 837 848. PMLR, 2019. [32] Jan-Matthis Lueckmann, Giacomo Bassetto, Theofanis Karaletsos, and Jakob H Macke. Likelihood-free inference with emulator networks. In Symposium on Advances in Approximate Bayesian Inference, pages 32 53. PMLR, 2019. [33] Rafael Izbicki, Ann Lee, and Chad Schafer. High-dimensional density ratio estimation with extensions to approximate likelihood computation. In Artificial intelligence and statistics, pages 420 429. PMLR, 2014. [34] Kim Pham, David Nott, and Sanjay Chaudhuri. A note on approximating abc-mcmc using flexible classifiers. Stat, 3, 03 2014. doi:10.1002/sta4.56. [35] Kyle Cranmer, Juan Pavez, and Gilles Louppe. Approximating likelihood ratios with calibrated discriminative classifiers. ar Xiv preprint ar Xiv:1506.02169, 2015. [36] Joeri Hermans, Volodimir Begy, and Gilles Louppe. Likelihood-free mcmc with approximate likelihood ratios. ar Xiv preprint ar Xiv:1903.04057, 10, 2019. [37] Conor Durkan, Iain Murray, and George Papamakarios. On contrastive learning for likelihoodfree inference. In International conference on machine learning, pages 2771 2781. PMLR, 2020. [38] Owen Thomas, Ritabrata Dutta, Jukka Corander, Samuel Kaski, and Michael U. Gutmann. Likelihood-free inference by ratio estimation, 2020. ar Xiv:1611.10242. [39] Benjamin K Miller, Christoph Weniger, and Patrick Forré. Contrastive neural ratio estimation. Advances in Neural Information Processing Systems, 35:3262 3278, 2022. [40] Thomas Müller, Brian Mc Williams, Fabrice Rousselle, Markus Gross, and Jan Novák. Neural importance sampling. ACM Transactions on Graphics (TOG), 38(5):1 19, 2019. [41] Maximilian Dax, Stephen R. Green, Jonathan Gair, Michael Pürrer, Jonas Wildberger, Jakob H. Macke, Alessandra Buonanno, and Bernhard Schölkopf. Neural Importance Sampling for Rapid and Reliable Gravitational-Wave Inference. Phys. Rev. Lett., 130(17):171403, 2023. ar Xiv:2210.05686, doi:10.1103/Phys Rev Lett.130.171403. [42] Michael S Albergo, Nicholas M Boffi, and Eric Vanden-Eijnden. Stochastic interpolants: A unifying framework for flows and diffusions. ar Xiv preprint ar Xiv:2303.08797, 2023. [43] Yang Song, Conor Durkan, Iain Murray, and Stefano Ermon. Maximum likelihood training of score-based diffusion models. Advances in Neural Information Processing Systems, 34:1415 1428, 2021. [44] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18, pages 234 241. Springer, 2015. [45] Yann N Dauphin, Angela Fan, Michael Auli, and David Grangier. Language modeling with gated convolutional networks. In International conference on machine learning, pages 933 941. PMLR, 2017. [46] Jan-Matthis Lueckmann, Jan Boelts, David Greenberg, Pedro Goncalves, and Jakob Macke. Benchmarking simulation-based inference. In International Conference on Artificial Intelligence and Statistics, pages 343 351. PMLR, 2021. [47] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition, 2015. ar Xiv:1512.03385. [48] Jerome H Friedman. On multivariate goodness of fit and two sample testing. Statistical Problems in Particle Physics, Astrophysics, and Cosmology, 1:311, 2003. [49] David Lopez-Paz and Maxime Oquab. Revisiting classifier two-sample tests. ar Xiv preprint ar Xiv:1610.06545, 2016. [50] Arnaud Delaunoy, Joeri Hermans, François Rozet, Antoine Wehenkel, and Gilles Louppe. Towards reliable simulation-based inference with balanced neural ratio estimation, 2022. ar Xiv: 2208.13624. [51] Arnaud Delaunoy, Benjamin Kurt Miller, Patrick Forré, Christoph Weniger, and Gilles Louppe. Balancing simulation-based inference for conservative posteriors. ar Xiv preprint ar Xiv:2304.10978, 2023. [52] R. Abbott et al. GWTC-3: Compact Binary Coalescences Observed by LIGO and Virgo During the Second Part of the Third Observing Run. ar Xiv preprint ar Xiv:2111.03606, 11 2021. ar Xiv:2111.03606. [53] R. Abbott et al. Population Properties of Compact Objects from the Second LIGO-Virgo Gravitational-Wave Transient Catalog. Astrophys. J. Lett., 913(1):L7, 2021. ar Xiv:2010. 14533, doi:10.3847/2041-8213/abe949. [54] B. P. Abbott et al. GW170817: Measurements of neutron star radii and equation of state. Phys. Rev. Lett., 121(16):161101, 2018. ar Xiv:1805.11581, doi:10.1103/Phys Rev Lett.121. 161101. [55] R. Abbott et al. Tests of general relativity with binary black holes from the second LIGO-Virgo gravitational-wave transient catalog. Phys. Rev. D, 103(12):122002, 2021. ar Xiv:2010.14529, doi:10.1103/Phys Rev D.103.122002. [56] B. P. Abbott et al. A gravitational-wave standard siren measurement of the Hubble constant. Nature, 551(7678):85 88, 2017. ar Xiv:1710.05835, doi:10.1038/nature24471. [57] J. Veitch, V. Raymond, B. Farr, W Farr, P. Graff, S. Vitale, et al. Parameter estimation for compact binaries with ground-based gravitational-wave observations using the LALInference software library. Phys. Rev., D91(4):042003, 2015. ar Xiv:1409.7215, doi:10.1103/ Phys Rev D.91.042003. [58] Gregory Ashton et al. BILBY: A user-friendly Bayesian inference library for gravitationalwave astronomy. Astrophys. J. Suppl., 241(2):27, 2019. ar Xiv:1811.02042, doi:10.3847/ 1538-4365/ab06fc. [59] I. M. Romero-Shaw et al. Bayesian inference for compact binary coalescences with bilby: validation and application to the first LIGO Virgo gravitational-wave transient catalogue. Mon. Not. Roy. Astron. Soc., 499(3):3295 3319, 2020. ar Xiv:2006.00714, doi:10.1093/mnras/ staa2850. [60] Joshua S Speagle. dynesty: a dynamic nested sampling package for estimating Bayesian posteriors and evidences. Monthly Notices of the Royal Astronomical Society, 493(3):3132 3158, Feb 2020. URL: http://dx.doi.org/10.1093/mnras/staa278, ar Xiv:1904.02180, doi:10.1093/mnras/staa278. [61] Nicholas Metropolis, Arianna W Rosenbluth, Marshall N Rosenbluth, Augusta H Teller, and Edward Teller. Equation of state calculations by fast computing machines. The journal of chemical physics, 21(6):1087 1092, 1953. [62] W. K. Hastings. Monte Carlo sampling methods using Markov chains and their applications. Biometrika, 57(1):97 109, 04 1970. ar Xiv:https://academic.oup.com/biomet/ article-pdf/57/1/97/23940249/57-1-97.pdf, doi:10.1093/biomet/57.1.97. [63] John Skilling. Nested sampling for general Bayesian computation. Bayesian Analysis, 1(4):833 859, 2006. doi:10.1214/06-BA127. [64] Elena Cuoco, Jade Powell, Marco Cavaglià, Kendall Ackley, Michał Bejger, Chayan Chatterjee, Michael Coughlin, Scott Coughlin, Paul Easter, Reed Essick, et al. Enhancing gravitationalwave science with machine learning. Machine Learning: Science and Technology, 2(1):011002, 5 2020. ar Xiv:2005.03745, doi:10.1088/2632-2153/abb93a. [65] Hunter Gabbard, Chris Messenger, Ik Siong Heng, Francesco Tonolini, and Roderick Murray-Smith. Bayesian parameter estimation using conditional variational autoencoders for gravitational-wave astronomy. Nature Phys., 18(1):112 117, 2022. ar Xiv:1909.06296, doi:10.1038/s41567-021-01425-7. [66] Stephen R. Green, Christine Simpson, and Jonathan Gair. Gravitational-wave parameter estimation with autoregressive neural network flows. Phys. Rev. D, 102(10):104057, 2020. ar Xiv:2002.07656, doi:10.1103/Phys Rev D.102.104057. [67] Arnaud Delaunoy, Antoine Wehenkel, Tanja Hinderer, Samaya Nissanke, Christoph Weniger, Andrew R. Williamson, and Gilles Louppe. Lightning-Fast Gravitational Wave Parameter Inference through Neural Amortization. In Third Workshop on Machine Learning and the Physical Sciences, 10 2020. ar Xiv:2010.12931. [68] Stephen R. Green and Jonathan Gair. Complete parameter inference for GW150914 using deep learning. Mach. Learn. Sci. Tech., 2(3):03LT01, 2021. ar Xiv:2008.03312, doi:10.1088/ 2632-2153/abfaed. [69] Maximilian Dax, Stephen R. Green, Jonathan Gair, Michael Deistler, Bernhard Schölkopf, and Jakob H. Macke. Group equivariant neural posterior estimation. In International Conference on Learning Representations, 11 2022. ar Xiv:2111.13139. [70] Chayan Chatterjee, Linqing Wen, Damon Beveridge, Foivos Diakogiannis, and Kevin Vinsen. Rapid localization of gravitational wave sources from compact binary coalescences using deep learning. ar Xiv preprint ar Xiv:2207.14522, 7 2022. ar Xiv:2207.14522. [71] Stefan T. Radev, Ulf K. Mertens, Andreass Voss, Lynton Ardizzone, and Ullrich Köthe. Bayesflow: Learning complex stochastic models with invertible neural networks, 2020. ar Xiv:2003.06281. [72] Conor Durkan, Artur Bekasov, Iain Murray, and George Papamakarios. Neural spline flows. In Advances in Neural Information Processing Systems, pages 7509 7520, 2019. ar Xiv: 1906.04032. [73] B.P. Abbott et al. Observation of Gravitational Waves from a Binary Black Hole Merger. Phys. Rev. Lett., 116(6):061102, 2016. ar Xiv:1602.03837, doi:10.1103/Phys Rev Lett. 116.061102. [74] Jonas Wildberger, Maximilian Dax, Stephen R. Green, Jonathan Gair, Michael Pürrer, Jakob H. Macke, Alessandra Buonanno, and Bernhard Schölkopf. Adapting to noise distribution shifts in flow-based gravitational-wave inference. Phys. Rev. D, 107(8):084046, 2023. ar Xiv: 2211.08801, doi:10.1103/Phys Rev D.107.084046. [75] Taco Cohen and Max Welling. Group equivariant convolutional networks. In Maria-Florina Balcan and Kilian Q. Weinberger, editors, Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, New York City, NY, USA, June 19-24, 2016, volume 48 of JMLR Workshop and Conference Proceedings, pages 2990 2999. JMLR.org, 2016. URL: http://proceedings.mlr.press/v48/cohenc16.html. [76] Mark Hannam, Patricia Schmidt, Alejandro Bohé, Leïla Haegel, Sascha Husa, Frank Ohme, Geraint Pratten, and Michael Pürrer. Simple model of complete precessing black-holebinary gravitational waveforms. Phys. Rev. Lett., 113:151101, Oct 2014. URL: https:// link.aps.org/doi/10.1103/Phys Rev Lett.113.151101, doi:10.1103/Phys Rev Lett. 113.151101. [77] Sebastian Khan, Sascha Husa, Mark Hannam, Frank Ohme, Michael Pürrer, Xisco Jiménez Forteza, and Alejandro Bohé. Frequency-domain gravitational waves from nonprecessing black-hole binaries. II. A phenomenological model for the advanced detector era. Phys. Rev., D93(4):044007, 2016. ar Xiv:1508.07253, doi:10.1103/Phys Rev D.93.044007. [78] Alejandro Bohé, Mark Hannam, Sascha Husa, Frank Ohme, Michael Pürrer, and Patricia Schmidt. Phenom Pv2 technical notes for the LAL implementation. LIGO Technical Document, LIGO-T1500602-v4, 2016. URL: https://dcc.ligo.org/LIGO-T1500602/public. [79] Benjamin Farr, Evan Ochsner, Will M. Farr, and Richard O Shaughnessy. A more effective coordinate system for parameter estimation of precessing compact binaries from gravitational waves. Phys. Rev. D, 90(2):024018, 2014. ar Xiv:1404.7070, doi:10.1103/Phys Rev D.90. 024018. [80] J.R. Dormand and P.J. Prince. A family of embedded runge-kutta formulae. Journal of Computational and Applied Mathematics, 6(1):19 26, 1980. URL: https:// www.sciencedirect.com/science/article/pii/0771050X80900133, doi:https:// doi.org/10.1016/0771-050X(80)90013-3. A Gaussian flow We here derive the form of a vector field vt(θ) that restricts the resulting continuous flow to a one dimensional Gaussian with mean ˆµ variance ˆσ2. With the optimal transport path µt(θ) = tθ1, σt(θ) = 1 (1 σmin)t σt from [16], the sample-conditional probability path (5) reads pt(θ|θ1) = N[tθ1, σ2 t ](θ). (10) We set our target distribution q1(θ1) = N[ˆµ, ˆσ2](θ1). (11) To derive the marginal probability path and the marginal vector field we need two identities for the convolution of Gaussian densities. Recall that the convolution of two function is defined by f g(x) = R f(x y)g(y) dy. We define the function gµ,σ2(θ) = θ N µ, σ2 (θ). (12) Then the following holds N[µ1, σ2 1] N[µ2, σ2 2] = N[µ1 + µ2, σ2 1 + σ2 2] (13) g0,σ2 1 N[µ2, σ2 2] = σ2 1 σ2 1 + σ2 2 gµ2,σ2 1+σ2 2 µ2 N[µ2, σ2 1 + σ2 2] (14) Marginal probability paths Marginalizing over θ1 in (10) with (11), we find pt(θ) = Z pt(θ|θ1)q(θ1) dθ1 = Z N tθ1, σ2 t (θ) N ˆµ, ˆσ2 (θ1)dθ1 = Z N 0, σ2 t (θ tθ1) N(tˆµ, (tˆσ)2)(tθ1) t dθ1 = Z N 0, σ2 t (θ θt 1) N tˆµ, (tˆσ)2 (θt 1) dθt 1 = N tˆµ, σ2 t + (tˆσ)2 (θ) where we defined θt 1 = tθ1 and used (13). Marginal vector field We now calculate the marginalized vector field ut(θ) based on equation (8) in [16]. Using the sample-conditional vector field (6) and the distributions (10) and (11) we find ut(θ) = Z ut(θ|θ1)pt(θ|θ1)q(θ1) Z (θ1 (1 σmin)θ) σt N tθ1, σ2 t (θ) N ˆµ, ˆσ2 (θ1) dθ1 Z (θ1 (1 σmin)θ) σt N 0, σ2 t (θ tθ1) N tˆµ, (tˆσ)2 (tθ1) t dθ1 Z (θ 1 (1 σmin)t θ) σt t N 0, σ2 t (θ θ 1) N tˆµ, (tˆσ)2 (θ 1) dθ 1 Z ( θ 1 + (1 (1 σmin)t) θ) σt t N 0, σ2 t (θ 1) N tˆµ, (tˆσ)2 (θ θ 1) dθ 1 Z ( θ 1 + σt θ) σt t N 0, σ2 t (θ 1) N tˆµ, (tˆσ)2 (θ θ 1) dθ 1 where we used the change of variables θ 1 = tθ1 and θ 1 = θ θ 1. Now we evaluate this expression using (12), then the identities (13) and (14) and the marginal probability (15) ut(θ) = 1 pt(θ) σt t g0,σ2 t N tˆµ, (tˆσ)2 (θ) + θ pt(θ) t N 0, σ2 t N tˆµ, (tˆσ)2 (θ) = 1 pt(θ) σt t (θ tˆµ) σ2 t σ2 t + (tˆσ)2 N tˆµ, (σ2 t + (tˆσ)2) (θ) + θ pt(θ) t N tˆµ, (σ2 t + (tˆσ)2) (θ) = (σ2 t + (tˆσ)2)θ (θ tˆµ) σt pt(θ) t (σ2 t + (tˆσ)2) pt(θ) = (σ2 t + (tˆσ)2 σt)θ + tˆµ σt t (σ2 t + (tˆσ)2) . By choosing a vector field vt of the form (17) with learnable parameters ˆµ, ˆσ2, we can thus define a continuous flow that is restricted to a one dimensional Gaussian. B Mass covering properties of flows In this supplement, we investigate the mass covering properties of continuous normalizing flows trained using mean squared error and in particular prove Theorem 1. We first recall the notation from the main part. We always assume that the data is distributed according to p1(θ). In addition, there is a known and simple base distribution p0 and we assume that there is a vector field ut : [0, 1] Rd Rd that connects p0 and p1 in the following sense. We denote by ϕt the flow generated by ut, i.e., ϕt satisfies tϕt(θ) = ut(ϕt(θ)). (18) Then we assume that (ϕ1) p0 = p1 and we also define the interpolations pt = (ϕt) p0. We do not have access to the ground truth distributions pt and the vector field ut but we try to learn a vector field vt approximating ut. We denote its flow by ψt and we define qt = (ψt) q0 and q0 = p0. We are interested in the mass covering properties of the learned approximation q1 of p1. In particular, we want to relate the KL-divergence DKL(p1||q1) to the mean squared error, MSEp(u, v) = Z 1 0 dt Z pt(dθ)(ut(θ) vt(θ))2, (19) of the generating vector fields. The first observation is that without any regularity assumptions on vt it is impossible to obtain any bound on the KL-divergence in terms of the mean squared error. Lemma 1. For every ε > 0 there are vector field ut and vt and a base distribution p0 = q0 such that MSEp(u, v) < ε and DKL(p1||q1) = . (20) In addition we can construct ut and vt such that the support of p1 is larger than the support of q1. Proof. We consider the uniform distribution p0 = q0 U([ 1, 1]) and the vector fields ut(θ) = 0 (21) vt(θ) = ε for 0 θ < ε, 0 otherwise. (22) As before, let ϕt denote the flow of the vector field ut and similarly ψt denote the flow of vt. Clearly ϕt(θ) = θ. On the other hand ψt(θ) = min(θ + εt, ε) if 0 θ < ε, θ otherwise. (23) In particular ψ1(θ) = ε if 0 θ < ε, θ otherwise. (24) This implies that p1 = (ϕ1) p0 U([ 1, 1]). On the other hand q1 = (ψ1) q0 has support in [ 1, 0] [ε, 1]. In particular, the distribution of q1 is not mass covering with respect to p1 and DKL(p1||q1) = . Finally, we observe that the MSE can be arbitrarily small MSEp(u, v) = Z 1 0 dt Z pt(dθ)|ut(θ) vt(θ)|2 = Z 1 Here we used that the density of pt(dθ) is 1/2 for 1 θ 1. We see that an arbitrary small MSE-loss cannot ensure that the probability distribution is mass covering and the KL-divergence is finite. On a high level this can be explained by the fact that for vector fields vt that are not Lipschitz continuous the flow is not necessarily continuous, and we can generate holes in the distribution. Note that we chose p0 to be a uniform distribution for simplicity, but the result extends to any smooth distribution, in particular the result does not rely on the discontinuity of p0. Next, we investigate the mass covering property for Lipschitz continuous flows. When the flows ut and vt are Lipschitz continuous (in θ) this ensures that the flows ψ1 and ϕ1 are continuous in x and it is not possible to create holes in the distribution as shown above for non-continuous vector fields. We show a weaker bound in this setting. Lemma 2. For every 0 δ 1 there is a base distribution p0 = q0 and the are Lipschitz-continuous vector fields ut and vt such that MSEp(u, v) = δ and DKL(p1||q1) 1 2 MSEp(u, v)1/3. (26) Proof. We consider p0, q0 and ut as in Lemma 1, and we define 2θ for 0 θ < ε, 2ε θ for ε θ < 2ε, 0 otherwise. (27) Then we can calculate for 0 θ e 2ε that ψt(θ) = θe2t. (28) Similarly we obtain for ε θ 2ε (solving the ODE f = 2f) ψt(θ) = 2ε (2ε θ)e 2t. (29) ψ1(0) = 0, ψ1(e 2ε) = ε, ψ1(ε) = 2 εe 2; ψ2(2ε) = 2ε. (30) Next we find for the densities of q1 that q1(ψ1(θ)) = q0(θ)|ψ 1(θ)| 1 = 1 e 2 for 0 θ e 2ε, e2 for ε θ 2ε. (31) Together with (30) this implies that the density of q1 is given by e 2 for 0 θ ε, e2 for 2ε εe 2 θ 2ε. (32) Note that p1(θ) = 1/2 for 1 θ 1 and therefore Z ε q1(θ)p1(dθ) = Z ε 2dθ = ε, (33) 2ε εe 2 ln p1(θ) q1(θ)p1(dθ) = Z 2ε 2ε εe 2 ln(e 2)1 2dθ = εe 2. (34) Moreover we note Z 2ε εe 2 ε q1(dε) = Z ε e 2ε q0(dε) = 1 2ε(1 e 2) = Z 2ε εe 2 ε p1(dε), (35) which implies (by positivity of the KL-divergence) that Z 2ε εe 2 p1(dθ) 0. (36) We infer using also p1(θ) = q1(θ) = 1/2 for θ [ 1, 0] [2ε, 1] that DKL(p1||q1) = Z ln p1(θ) p1(dθ) ε(1 e 2). (37) On the other hand we can bound Z 1 0 dt Z pt(dθ)|vt(θ) ut(θ)|2 = 1 0 |ut(θ)|2 = Z ε 0 s2 ds = ε3 We conclude that DKL(p1||q1) 1 2 (MSEp(u, v))1/3 . (39) In particular, it is not possible to bound the KL-divergence by the MSE even when the vector fields are Lipschitz continuous. Let us put this into context. It was already shown in [42] that we can, in general, not bound the forward KL-divergence by the mean squared error and our Lemmas 1 and 2 are concrete examples. On the other hand, when considering SDEs the KL-divergence can be bounded by the mean squared error of the drift terms as shown in [43]. Indeed, in [42] the favorable smoothing effect was carefully investigated. Here we show that we can alternatively obtain an upper bound on the KL-divergence when assuming that ut, vt, and p0 satisfy additional regularity assumptions. This allows us to recover the mass covering property from bounds on the means squared error for sufficiently smooth vector fields. The scaling is nevertheless still weaker than for SDEs. We now state our assumptions. We denote the gradient with respect to θ by = µ and second derivatives by 2 = 2 µν. When applying the chain rule, we leave the indices implicit. We denote by | | the Frobenius norm |A| = P ij A2 ij 1/2 of a matrix. The Frobenius norm is submultiplicative, i.e., |AB| |A| |B| and directly generalizes to higher order tensors. Assumption 1. We assume that | ut| L, | vt| L, | 2ut| L , | 2vt| L . (40) We require one further assumption on p0. Assumption 2. There is a constant C1 such that | ln p0(θ)| C1(1 + |θ|). (41) We also assume that Ep0 |θ|2 < C2 < . (42) Note that (41) holds, e.g., if p0 follows a Gaussian distribution but also for smooth distribution with slower decay at . If we assume that | ln p0(θ)| is bounded the proof below simplifies slightly. This is, e.g., the case if p0(θ) e |θ| as |θ| . We need some additional notation. It is convenient to introduce ϕs t = ϕt (ϕs) 1, i.e., the flow from time s to t (in particular ϕ0 t = ϕt) and similarly for ψ. We can now restate and prove Theorem 1. Theorem 2. Let p0 = q0 and assume ut and vt are two vector fields whose flows satisfy p1 = (ϕ1) p0 and q1 = (ψ1) q0. Assume that p0 satisfies Assumption 2 and ut and vt satisfy Assumption 1. Then there is a constant C > 0 depending on L, L , C1, C2, and d such that (for MSEp(u, v) < 1)) DKL(p1||q1) C MSEp(u, v) 1 2 . (43) Remark 1. We do not claim that our results are optimal, it might be possible to find similar bounds for the forward KL-divergence with weaker assumptions. However, we emphasize that Lemma 2 shows that the result of the theorem is not true without the assumption on the second derivative of vt and ut. Proof. We want to control DKL(p1||q1). It can be shown that (see equation above (25) in [43] or Lemma 2.19 in [42] ) t DKL(pt||qt) = Z pt(dθ)(ut(θ) vt(θ)) ( ln pt(θ) ln qt(θ)). (44) Using Cauchy-Schwarz we can bound this by t DKL(pt||qt) Z pt(dθ)|ut(θ) vt(θ)|2 1 2 Z pt(dθ)| ln pt(θ) ln qt(θ)|2 1 We use the relation (see (3)) ln(pt(ϕt(θ0)) = ln(p0(θ0)) Z t 0 (div us)(ϕs(θ0))ds, (46) which can be equivalently rewritten (setting θ = ϕtθ0) as ln(pt(θ)) = ln(p0(ϕt 0θ)) Z t 0 (div us)(ϕt sθ)ds. (47) We use the following relation for ϕt s ϕt s(θ) = exp Z s t dτ ( uτ)(ϕt τ(θ)) . (48) This relation is standard and can be directly deduced from the following ODE for ϕt s s ϕt s(θ) = sϕt s(θ) = (us(ϕt s(θ))) = ( us)(ϕt s(θ)) ϕt s(θ). (49) We can conclude that for 0 s, t 1 the bound | ϕt s(θ)| e L (50) holds. We find | ln(pt(θ))| = ln(p0)(ϕt 0θ) ϕt 0(θ) Z t 0 ( div us)(ϕt sθ) ϕt s(θ)ds | ln(p0)(ϕt 0θ)|e L + L e L, (51) and a similar bound holds for qt. In words, we have shown that the score of pt at θ can be bounded by the score of p0 of theta transported along the vector field ut minus a correction which quantifies the change of score along the path. We now bound using the definition pt = (ϕt) p0 and the assumption (41) Z pt(dθ)| ln p0(ϕt 0(θ))|2 = Z p0(dθ0)| ln p0(ϕt 0ϕt(θ0))|2 = Ep0 | ln p0(θ0)|2 Ep0(C1(1 + |θ0|)2) 2C2 1(1 + Ep0 |θ0|2) 2C2 1(1 + C2 2). (52) Similarly we obtain using q0 = p0 Z pt(dθ)| ln q0(ψt 0θ)|2 = Z p0(dθ0)| ln q0(ψt 0ϕtθ0)|2. (53) In words, to control the score of q integrated with respect to pt we need to control the distortion we obtain when moving forward with u and backwards with v. We investigate ψt 0ϕt(θ0). We now show hψt+h t ϕt t+h(θ)|h=0 = ut(θ) vt(θ). (54) First, by definition of ϕ, we find hϕt t+h(θ)|h=0 = hϕt+hϕ 1 t (θ)|h=0 = ut(ϕtϕ 1 t (θ)) = ut(θ). (55) To evaluate the second contribution we observe 0 = hθ|h=0 = hψt+h t+h(θ)|h=0 = hψt+hψ 1 t+h(θ)|h=0 = ( hψt+h)ψ 1 t (θ)|h=0 + ψt( hψ 1 t+h)(θ)|h=0 = vt(ψtψ 1 t (θ)) + hψtψ 1 t+h(θ)|h=0 = vt(θ) + hψt+h t (θ)|h=0 Now (54) follows from (55) and (56) together with ϕt t = ψt t = Id. Using (54) we find t(ψt 0ϕt)(θ0) = h(ψt 0ψt+h t ϕt t+hϕt)(θ0)|h=0 = ( ψt 0)(ϕt(θ0)) ((ut vt)(ϕt(θ0))) . (57) Using (50) we conclude that |ψt 0ϕt(θ0) θ0| 0 sψs 0ϕs(θ0) ds Z t 0 |( ψs 0)(ϕs(θ0))| |us vs|(ϕs(θ0)) ds 0 |us vs|(ϕs(θ0)) ds. We use this and the assumption (41) to continue to estimate (53) as follows Z pt(dθ)| ln q0(ψt 0θ)|2 = Z p0(dθ0)| ln q0(ψt 0ϕt(θ0))|2 Z p0(dθ0)(1 + |ψt 0ϕt(θ0)|)2 Z p0(dθ0)(1 + |ψt 0ϕt(θ0) θ0| + |θ0|)2 3C2 1 + 3C2 1 Z p0(dθ0) |ψt 0ϕt(θ0) θ0|2 + |θ0|2 3C2 1(1 + Ep0 |θ0|2) + 3C2 1e2L Z p0(dθ0) Z t 0 ds |us vs|(ϕs(θ0)) 2 . Here we used (a + b + c)2 3(a2 + b2 + c2) in the second to last step. We bound the remaining integral using Cauchy-Schwarz as follows Z p0(dθ0) Z t 0 |us vs|(ϕs(θ0)) 2 Z p0(dθ0) Z t 0 ds |us vs|2(ϕs(θ0)) Z t 0 ds Z p0(dθ0)|us vs|2(ϕs(θ0)) 0 ds Z ps(dθs)|us vs|2(θs) 0 ds Z ps(dθs)|us vs|2(θs) = MSEp(u, v). The last displays together imply Z pt(dθ)| ln q0(ψt 0θ)|2 3C2 1 1 + Ep0 |θ0|2 + e2L MSEp(u, v) . (61) Now we have all the necessary ingredients to bound the derivative of the KL-divergence. We control the second integral in (45) using (51) (and again (P4 i=1 ai)2 4 P a2 i ) as follows, Z pt(dθ)| ln pt(θ) ln qt(θ)|2 2 22 L 2e2L + 4e2L Z pt(dθ) | ln q0(ψt 0)θ)|2 + | ln p0(ϕt 0)θ)|2 . (62) Using (52) and (61) we finally obtain Z pt(dθ)| ln pt(θ) ln qt(θ)|2 8 L 2e2L + C2 1e2L 20(1 + C2 2) + 12 MSEp(u, v) C(1 + MSEp(u, v)) (63) for some constant C > 0. Finally, we obtain DKL(p1||q1) = Z 1 0 dt t DKL(pt||qt) (C(1 + MSEp(u, v))) 1 2 Z 1 0 dt Z pt(dθ)|ut(θ) vt(θ)|2 1 (C(1 + MSEp(u, v))) 1 2 Z 1 0 dt Z pt(dθ)|ut(θ) vt(θ)|2 1 (C(1 + MSEp(u, v))) 1 2 MSEp(u, v) 1 2 . C SBI Benchmark In this section, we collect missing details and additional results for the analysis of the SBI benchmark in Section 4. C.1 Network architecture and hyperparameters For each task and simulation budget in the benchmark, we perform a mild hyperparameter optimization. We sweep over the batch size and learning rate (which is particularly important as the simulation budgets differ by orders of magnitudes), the network size and the α parameter for the time prior defined in Section 3.3 (see Tab. 2 for the specific values). We reserve 5% of the simulation budget for validation and choose the model with the best validation loss across all configurations. C.2 Additional results We here provide various additional results for the SBI benchmark. First, we compare the performance of FMPE and NPE when using the Maximum Mean Discrepancy metric (MMD). The results can be found in Fig. 6. FMPE shows superior performance to NPE for most tasks and simulation budgets. Compared to the C2ST scores in Fig. 4 the improvement shown by FMPE in MMD is more substantial. Fig. 7 compares the FMPE results with the optimal transport path from the main text with a comparable score matching model using the Variance Preserving diffusion path [15]. The score matching results were obtained using the same batch size, network size and learning rate as the FMPE network, while optimizing for βmin {0.1, 1, 4} and βmax {4, 7, 10}. FMPE with the optimal transport path clearly outperforms the score-based model on almost all configurations. In Fig. 8 we compare FMPE using the architecture proposed in Section 3.2 with (t, θ)-conditioning via gated linear units to FMPE with a naive architecture operating directly on the concatenated (t, θ, x) vector. For the two displayed tasks the context dimension dim(x) = 100 is much larger than the parameter dimension dim(θ) {5, 10}, and there is a clear performance gain in using the GLU conditioning. Our interpretation is that the low dimensionality of (t, θ) means that it is not well-learned by the network when simply concatenated with x. hyperparameter sweep values hidden dimensions 2n for n {4, . . . , 10} number of blocks 10, . . . , 18 batch size 2n for n {2, . . . , 9} learning rate 1.e-3, 5.e-4, 2.e-4, 1.e-4 α (for time prior) -0.25, -0.5, 0, 1, 4 Table 2: Sweep values for the hyperparamters for the SBI benchmark. We split the configurations according to simulation budgets, e.g. for 1000 simulations, we only swept over smaller values for network size and batch size. The network architecture has a diamond shape, with increasing layer width from smallest to largest and then decreasing to the output dimension. Each block consists of two fully-connected residual layers. 0.20 GL GL-U GM Two Moons SLCP 103 104 105 0.0 103 104 105 103 104 105 103 104 105 103 104 105 Number of Simulations Figure 6: Comparison of FMPE and NPE performance across 10 SBI benchmarking tasks [46]. We here quantify the deviation in terms of the Maximum Mean Discrepancy (MMD) as an alternative metric to the C2ST score used in Fig. 4. MMD can be sensitive to its hyperparameters [46], so we use the C2ST score as a primary performance metric. Fig. 9 displays the densities of the reference samples under the FMPE model as a histogram for all tasks (extended version of Fig. 3). The support of the learned model q(θ|x) covers the reference samples θ p(θ|x), providing additional empirical evidence for the mass-covering behavior theoretically explored in Thm. 1. However, samples from the true posterior distribution may have a small density under the learned model, especially if the deviation between model and reference is high; see Lotka-Volterra (bottom right panel). Fig. 10 displays P P plots for two selected tasks. Finally, we study the impact of our time prior re-weighting for one example task in Fig. 11. We clearly see that our proposed re-weighting leads to increased performance by up-weighting samples for t closer to 1 during training. 1.0 Two Moons 103 104 105 103 104 105 1.0 B-GLM-Raw 103 104 105 103 104 105 103 104 105 Number of Simulations Figure 7: Comparison of FMPE with the optimal transport path (as used throughout the main paper) with comparable models trained with a Variance Preserving diffusion path [15] by regressing on the score (SMPE). Note that the SMPE baseline shown here is not directly comparable to NPSE [8, 9], as this method uses Langevin steps, which reduces the dependence of the results on the vector field for small t (at the cost of a tractable density). 103 104 105 0.5 103 104 105 Number of Simulations GLU embedding Concatenation Figure 8: Comparison of the architecture proposed in Section 3.2 with gated linear units for the (t, θ)-conditioning (red) and a naive architecture based on a simple concatenation of (t, θ, x) (black). FMPE with the proposed architecture performs substantially better. 20 15 10 5 20 15 10 5 15 10 5 15 10 5 0 5.0 2.5 0.0 2.5 5.0 5 0 5 5.0 2.5 0.0 2.5 Two Moons Two Moons 4 2 0 2 4 2 0 2 4 10.0 7.5 5.0 2.5 10 5 0 B-GLM B-GLM 10 0 10 5 0 5 B-GLM-Raw B-GLM-Raw 10 5 0 5 10 5 0 5 SLCP-D SLCP-D 10.0 7.5 5.0 2.5 10 5 0 7.5 5.0 2.5 0.0 6 4 2 0 2 6 4 2 0 2 10 5 0 5 log q(θ|x) 5.0 2.5 0.0 2.5 5.0 Figure 9: Histogram of FMPE densities log q(θ|x) for samples θ q(θ|x) and reference samples θ p(θ|x) for simulation budgets N = 103 (left), N = 104 (center) and N = 105 (right). The reference samples θ p(θ|x) are all within the support of the learned model q(θ|x), indicating mass covering FMPE results. Nonetheless, reference samples may have a small density under q(θ|x), if the validation loss is high, see Lotka-Volterra (LV). Figure 10: P-P plot for the marginals of the FMPE-posterior for the Two Moons (upper) and SLCP (lower) tasks for training budgets of 103 (left), 104 (center), and 105 (right) samples. 103 104 105 0.5 Number of Simulations power law uniform Figure 11: Comparison of the time prior re-weighting proposed in Section 3.3 with a uniform prior over t on the Two Moons task (Section 4). The network trained with the re-weighted prior clearly outperforms the reference on all simulation budgets. D Gravitational-wave inference We here provide the missing details and additional results for the gravitational wave inference problem analyzed in Section 5. D.1 Network architecture and hyperparameters Compared to NPE with normalizing flows, FMPE allows for generally simpler architectures, since the output of the network is simply a vector field. This also holds for NPSE (model also defined by a vector) and NRE (defined by a scalar). Our FMPE architecture builds on the embedding network developed in [7], however we extend the network capacity by adding more residual blocks (Tab. 3, top panel). For the (t, θ)-conditioning we use gated linear units applied to each residual block, as described in Section 3.2. We also use a small residual network to embed (t, θ) before applying the gated linear units. In this Appendix we also perform an ablation study, using the same embedding network as the NPE network (Tab. 3, bottom panel). For this configuration, we additionally study the effect of conditioning on (t, θ) starting from different layers of the main residual network. D.2 Data settings We use the data settings described in [7], with a few minor modifications. In particular, we use the waveform model IMRPhenom Pv2 [76 78] and the prior displayed in Tab. 4. Generation of the training dataset with 5,000,000 samples takes around 1 hour on 64 CPUs. Compared to [7], we reduce the frequency range from [20, 1024] Hz to [20, 512] Hz to reduce the computational load for data preprocessing. We also omit the conditioning on the detector noise power spectral density (PSD) introduced in [7] as we evaluate on a single GW event. Preliminary tests show that the performance with PSD conditioning is similar to the results reported in this paper. All changes to the data settings have been applied to FMPE and the NPE baselines alike to enable a fair comparison. D.3 Additional results Tab. 5 displays the inference times for FMPE and NPE. NPE requires only a single network pass to produce samples and (log-)probabilities, whereas many forwards passes are needed for FMPE to solve the ODE with a specific level of accuracy. A significant portion of the additional time required for calculating (log-)probabilities in conjunction with the samples is spent on computing the divergence of the vector field, see Eq. (3). hyperparameter values residual blocks 2048, 4096 3, 2048 3, 1024 6, 512 8, 256 10, 128 5, 64 3, 32 3, 16 3 residual blocks (t, θ) embedding 16, 32, 64, 128, 256 batch size 4096 learning rate 5.e-4 α (for time prior) 1 residual blocks 2048 2, 1024 4, 512 4, 256 4, 128 4, 64 3, 32 3, 16 3 residual blocks (t, θ) embedding 16, 32, 64, 128, 256 batch size 4096 learning rate 5.e-4 α (for time prior) 1 Table 3: Hyperparameters for the FMPE models used in the main text (top) and in the ablation study (bottom, see Fig. 12). The network is composed of a sequence of residual blocks, each consisting of two fully-connected hidden layers, with a linear layer between each pair of blocks. The ablation network is the same as the embedding network that feeds into the NPE normalizing flow. Description Parameter Prior component masses m1, m2 [10, 120] M , m1 m2 chirp mass Mc = (m1m2) 3 5 /(m1 + m2) 1 5 [20, 120] M (constraint) mass ratio q = m2/m1 [0.125, 1.0] (constraint) spin magnitudes a1, a2 [0, 0.99] spin angles θ1, θ2, ϕ12, ϕJL standard as in [79] time of coalescence tc [ 0.03, 0.03] s luminosity distance d L [100, 1000] Mpc reference phase ϕc [0, 2π] inclination θJN [0, π] uniform in sine polarization ψ [0, π] sky position α, β uniform over sky Table 4: Priors for the astrophysical binary black hole parameters. Priors are uniform over the specified range unless indicated otherwise. Our models infer the mass parameters in the basis (Mc, q) and marginalize over the phase parameter ϕc. Fig. 12 presents a comparison of the FMPE performance using networks of the same hidden dimensions as the NPE embedding network (Tab. 3 bottom panel). This comparison includes an ablation study on the timing of the (t, θ) GLU-conditioning. In the top-row network, the (t, θ) conditioning is applied only after the 256-dimensional blocks. In contrast, the middle-row network receives (t, θ) immediately after the initial residual block. With FMPE we can achieve performance comparable to NPE, while having only 1/3 of the network size (most of the NPE network parameters are in the flow). This suggests that parameterizing the target distribution in terms of a vector field requires less learning capacity, compared to directly learning its density. Delaying the (t, θ) conditioning until the final layers impairs performance. However, the number of FLOPs at inference is considerably reduced, as the context embedding can be cached and a network pass only involves the few layers with the (t, θ) conditioning. Consequently, there s a trade-off between accuracy and inference speed, which we will explore in a greater scope in future work. Network Passes Inference Time (per batch) FMPE (sample only) 248 26s FMPE (sample and log probs) 350 352s NPE (sample and log probs) 1 1.5s Table 5: Inference times per batch for FMPE and NPE on a single Nvidia A100 GPU, using the training batch size of 4096. We solve the ODE for FMPE using the dopri5 discretization [80] with absolute and relative tolerances of 1e-7. For FMPE, generation of the (log-)probabilities additionally requires the computation of the divergence, see equation (3). This needs additional memory and therefore limits the maximum batch size that can be used at inference. FMPE late GLU FMPE early GLU 45.3 41.5 1.2 0.5 18.0 11.8 0.3 12.3 8.1 11.4 7.5 0.4 4.8 4.6 0.8 1.0 0.9 0.2 0.3 1.7 2.0 4.8 13.0 0.4 1.2 2.5 3.2 1.6 0.8 0.4 0.3 4.4 9.1 10.1 8.6 0.6 Figure 12: Jensen-Shannon divergence between inferred posteriors and the reference posteriors for GW150914 [73]. We compare two FMPE models with the same architecture as the NPE embedding network, see Tab. 3 bottom panel. For the model in the first row, the GLU conditioning of (θ, t) is only applied before the final 128-dim blocks. The model in the middle row is given the context after the very first 2048 block.