# action_matching_learning_stochastic_dynamics_from_samples__f7cad597.pdf Action Matching: Learning Stochastic Dynamics from Samples Kirill Neklyudov 1 Rob Brekelmans 1 Daniel Severo 1 2 Alireza Makhzani 1 2 Learning the continuous dynamics of a system from snapshots of its temporal marginals is a problem which appears throughout natural sciences and machine learning, including in quantum systems, single-cell biological data, and generative modeling. In these settings, we assume access to cross-sectional samples that are uncorrelated over time, rather than full trajectories of samples. In order to better understand the systems under observation, we would like to learn a model of the underlying process that allows us to propagate samples in time and thereby simulate entire individual trajectories. In this work, we propose Action Matching, a method for learning a rich family of dynamics using only independent samples from its time evolution. We derive a tractable training objective, which does not rely on explicit assumptions about the underlying dynamics and does not require back-propagation through differential equations or optimal transport solvers. Inspired by connections with optimal transport, we derive extensions of Action Matching to learn stochastic differential equations and dynamics involving creation and destruction of probability mass. Finally, we showcase applications of Action Matching by achieving competitive performance in a diverse set of experiments from biology, physics, and generative modeling. 1. Introduction Understanding the time evolution of systems of particles or individuals is a fundamental problem appearing across machine learning and the natural sciences. In many scenarios, it is expensive or even physically impossible to observe entire individual trajectories. For example, in quantum me- 1Vector Institute 2University of Toronto. Correspondence to: , . Proceedings of the 40 th International Conference on Machine Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright 2023 by the author(s). chanics, the act of measurement at a given point collapses the wave function (Griffiths & Schroeter, 2018), while in biological applications, single-cell RNAor ATACsequencing techniques destroy the cell in question (Macosko et al., 2015; Klein et al., 2015; Buenrostro et al., 2015). Instead, from cross-sectional or independent samples at various points in time, we would like to learn a model which simulates particles such that their density matches that of the observed samples. The problem of learning stochastic dynamics from marginal samples is variously referred to as learning population dynamics (Hashimoto et al., 2016) or as trajectory inference (Lavenant et al., 2021), in contrast to time series modeling where entire trajectories are assumed to be available. Learning such models to predict entire trajectories holds the promise of facilitating simulation of complex chemical or physical systems (Vázquez, 2007; Noé et al., 2020) and understanding developmental processes or treatment effects in biology (Schiebinger et al., 2019; Tong et al., 2020; Schiebinger, 2021; Bunne et al., 2021). Furthermore, recent advances in generative modeling have been built upon learning stochastic dynamics which interpolate between the data distribution and a prior distribution. In particular, score-based diffusion models (Song et al., 2020b; Ho et al., 2020) construct a stochastic differential equation (SDE) to move samples from the data distribution to a prior distribution, while score matching (Hyvärinen & Dayan, 2005) is used to learn a reverse SDE which models the gradients of intermediate distributions. However, these methods rely on analytical forms of the SDEs and/or the tractability of intermediate Gaussian distributions (Lipman et al., 2022). Since our proposed method can learn dynamics which simulate an arbitrary path of marginal distributions, it can also be applied in the context of generative modeling. Namely, we can approach generative modeling by constructing an interpolating path between the data and an arbitrary prior distribution, and learning to model the resulting dynamics. In this work, we propose Action Matching, a method for learning population dynamics from samples of their temporal marginals qt. Our contributions are as follows: In Theorem 2.1, we establish the existence of a unique gradient field s t which traces any given time- Action Matching Figure 1. Action Matching, and its entropic (e AM) and unbalanced (u AM) variants, can learn to trace any arbitrary distributional path. For a given path, AM learns deterministic trajectories, e AM learns stochastic trajectories, and u AM learns weighted trajectories. continuous distributional path qt. Notably, our restriction to gradient fields is without loss of expressivity for this class of qt. To learn this gradient field, we propose the tractable Action Matching training objective in Theorem 2.2. In Sec. 3.1-3.3, we extend the above approach in several ways: an entropic version which can approximate ground-truth dynamics involving stochasticity, an unbalanced version which allows for creation and destruction of probability mass, and a version which can minimize an arbitrary convex cost function in the Action Matching objective. We discuss the close relationship between Action Matching and dynamical optimal transport, along with other related works in Sec. 5 and App. B. Since Action Matching relies only on samples and does not require tractable intermediate densities or knowledge of the underlying stochastic dynamics, it is applicable in a wide variety of problem settings. In particular, we demonstrate competitive performance of Action Matching in a number of experiments, including trajectory inference in biological data (Sec. 4.1), evolution of quantum systems (Sec. 4.2), and a variety of tasks in generative modeling (Sec. 4.3). 2. Action Matching 2.1. Continuity Equation Suppose we have a set of particles in space X Rd, initially distributed as qt=0. Let each particle follow a timedependent ODE (continuous flow) with the velocity field v : [0, 1] X Rd as follows d dtx(t) = vt(x(t)) , x(t = 0) = x . (1) The continuity equation describes how the density of the particles qt evolves in time t, i.e., tqt = (qtvt) , (2) which holds in the distributional sense, where denotes the divergence operator. Under mild conditions, the following theorem shows that any continuous dynamics can be modeled by the continuity equation, and moreover any continuity equation results in a continuous dynamics. Theorem 2.1 (Adapted from Theorem 8.3.1 of Ambrosio et al. (2008)). Consider a continuous dynamic with the density evolution of qt, which satisfies mild conditions (absolute continuity in the 2-Wasserstein space of distributions P2(X)). Then, there exists a unique (up to a constant) function s t (x), called the action , such that vector field v t (x) = s t (x) and qt satisfies the continuity equation tqt = (qt s t (x)) . (3) In other words, the ODE d dtx(t) = s t (x) can be used to move samples in time such that the marginals are qt. Using Theorem 2.1, the problem of learning the dynamics can be boiled down to learning the unique vector field s t , only using samples from qt. Motivated by this, we restrict our search space to the family of curl-free vector fields St = { st | st : X R} . (4) We use a neural network to parameterize the set of functions st(x; θ), and will propose the Action Matching objective in Sec. 2 to learn parameters θ such that st(x; θ) approximates s t (x). Once we have learned the vector field, we can move samples forward or backward in time by simulating the ODE in Eq. (1) with the velocity st. The continuity equation ensures that for s t , samples at any given time t [0, 1] are distributed according to qt. Note that, even though we arrived at the continuity equation and ground truth vector field s t (x) using ODEs, the continuity equation can describe a rich family of density evolutions in a wide range of stochastic processes, including those of SDEs (see Equation 37 of Song et al. (2020b)), or Action Matching even those of the porous medium equations (Otto, 2001) which are more general than SDEs. Since these processes also define an absolutely continuous curve in the density space, Theorem 2.1 applies. Thus, for the task of modeling the marginal evolution of qt, our restriction to ODEs using curl-free vector fields does not sacrifice expressivity. 2.2. Action Matching Loss The main development of this paper is the Action Matching method, which allows us to recover the true action s t while having access only to samples from qt. With this action in hand, we can simulate the continuous dynamics whose evolution matches qt using the vector field s t (see Fig. 1). In order to do so, we define the variational action st(x) parameterized by a neural network, which approximates s t (x) by minimizing the ACTION-GAP objective ACTION-GAP(s, s ) := 1 0 Eqt(x) st(x) s t (x) 2dt . (5) Note that this objective is intractable, as we do not have access to s . However as the following proposition shows, we can still derive a tractable objective for minimizing the action gap. Theorem 2.2. For an arbitrary variational action s, the ACTION-GAP(s, s ) can be decomposed as the sum of an intractable constant K, and a tractable term LAM(s) ACTION-GAP(st, s t ) = K + LAM(st) . (6) where LAM(s) is the Action Matching objective, which we minimize LAM(s) := Eq0(x) s0(x) Eq1(x) s1(x) 2 st(x) 2 + st t (x) dt (7) See App. A.1 for the proof. The term LAM is tractable, since we can use the samples from marginals qt to obtain an unbiased low variance estimate. We show in App. A.1 that the intractable constant K is the kinetic energy of the distributional path, defined as K( s t ) := 1 2 R 1 0 Eqt(x) s t (x) 2dt, and thus minimizing LAM(s) can be viewed as maximizing a variational lower bound on the kinetic energy. Connection with Optimal Transport In App. B.1, we show that the optimal dynamics of AM along the curve is also optimal in the sense of optimal transport with the 2Wasserstein cost. More precisely, at any given time t, the optimal vector field in the AM objective defines a mapping between two infinitesimally close distributions qt and qt+h, which is of the form x 7 x + h s t (x). This mapping is indeed the same as the Brenier map (Brenier, 1987) in Algorithm 1 Action Matching 1 Require: data {xj t}Nt j=1, xj t qt(x) Require: parametric model st(x, θ) for learning iterations do get batch of samples from boundaries: {xi 0}n i=1 q0(x), {xi 1}n i=1 q1(x) sample times {ti}n i=1 Uniform[0, 1] get batch of intermediate samples {xi ti}n i=1 qt(x) s0(xi 0, θ) s1(xi 1, θ) 2 sti(xi ti, θ) 2 + sti(xi ti,θ) t update the model θ Optimizer(θ, θLAM(θ)) end for output trained model st(x, θ ) optimal transport, which is of the form x 7 x + φt(x), where φt is the (c-convex) Kantorovich potential. Finally, in App. A.1, we adapt reasoning from Albergo & Vanden-Eijnden (2022) to show that the 2-Wasserstein distance between the ground truth marginals and those simulated using our learned st(x) can be upper bounded in terms of ACTION-GAP(st, s t ). 2.3. Learning, Sampling, and Likelihood Evaluation Learning We provide pseudo-code for learning with the Action Matching objective in Algorithm 1. With our learned st(x, θ), we now describe how to simulate the dynamics and evaluate likelihoods when the initial density q0 is known. Sampling We sample from the target distribution via the trained function st(x(t), θ ) by solving the following ODE forward in time: d dtx(t) = xst(x(t), θ ), x(t = 0) q0(x). (8) Recall that this sampling process is justified by Eq. (3), where st(x(t), θ ) approximates s t (x(t)). Evaluating the Log-Likelihood When the density for q0 is available, we can evaluate the log-likelihood of a sample x q1 using the continuous change of variables formula (Chen et al., 2018). Integrating the ODE backward in time, log q1(x) = log q0(x(0)) Z 1 0 dt s t (x(t)), d dtx(t) = xs t (x(t)), x(t = 1) = x, 1Notebooks with pedagogical examples of AM are given at github.com/necludov/jam#tutorials. Action Matching dt log qt = s t can be confirmed using a simple calculation and we approximate s t (x(t)) by st(x(t), θ ). 3. Extensions of Action Matching 2 In this section, we propose several extensions of Action Matching, which can be used to learn dynamics which include stochasticity (Sec. 3.1), allow for teleportation of probability mass (Sec. 3.2), and minimize alternative kinetic energy costs (Sec. 3.3). 3.1. Entropic Action Matching In this section, we propose entropic Action Matching (e AM), which can recover the ground-truth dynamics arising from diffusion processes with curl-free drift term and known diffusion term. This setting takes place in biological applications studying the Brownian motion of cells in a medium (Schiebinger et al., 2019; Tong et al., 2020). We will show in Prop. 3.1 that, at optimality, entropic AM can also learn to trace any absolutely continuous distributional path under mild conditions, so that the choice between entropic AM and deterministic AM should be made based on prior knowledge of the true underlying dynamics. Consider the stochastic differential equation dx(t) = vt(x)dt + σtd Wt , x(t = 0) = x . (10) where Wt is the Wiener process. We know that the evolution of density of this diffusion process is described by the Fokker Planck equation: tqt = (qtvt) + σ2 t 2 qt , (11) In the following proposition, we extend Theorem 2.1 and prove that any continuous distributional path, regardless of ground truth generating dynamics, can be modeled with the diffusion dynamics in the state-space. Proposition 3.1. Consider a continuous dynamic with the density evolution of qt, and suppose σt is given. Then, there exists a unique (up to a constant) function s t (x), called the entropic action , such that vector field v t (x) = s t (x) and qt satisfies the Fokker-Planck equation tqt = (qt s t ) + σ2 t 2 qt , (12) See App. A.2 for the proof. This proposition indicates that the we can use the the SDE dx(t) = s t dt + σtd Wt to move samples in time such that the marginals are qt. 2This section can be skipped without loss of understanding of our core contributions, although our experiments also evaluate the entropic AM method from Sec. 3.1. Entropic AM objective aims to recover the unique s t (x), as described by the above proposition. In order to learn the diffusion velocity vector, we define the variational action st(x), parameterized by a neural network, that approximates s t (x), by minimizing the E-ACTION-GAP objective E-ACTION-GAP(s, s ) := 1 0 Eqt(x) st(x) s t (x) 2dt . Note that while the E-ACTION-GAP is similar to the original ACTION-GAP objective, it minimizes the distance to s t , which is different than s t . As in AM, this objective is intractable since we do not have access to s t . However, we derive a tractable objective in the following proposition. Proposition 3.2. For an arbitrary variational action s, the E-ACTION-GAP(s, s ) can be decomposed as the sum of an intractable constant K, and a tractable term Le AM(s) which can be minimized: E-ACTION-GAP(s, s ) = Le AM(s) + Ke AM , where Le AM(s) is the entropic Action Matching objective, which we minimize Le AM(s) := Eq0(x)[s0(x)] Eq1(x)[s1(x)] (13) 2 st(x) 2 + st t (x) + σ2 t 2 st See App. A.2 for the proof. The constant Ke AM is the entropic kinetic energy, discussed in App. A.2. Connection with Entropic Optimal Transport In App. B.3, we describe connections between the e AM objective and dynamical formulations of entropy-regularized optimal transport (Cuturi, 2013) or Schrödinger Bridge (Léonard, 2014; Chen et al., 2016; 2021) problems. 3.2. Unbalanced Action Matching In this section, we further extend the scope of underlying dynamics which can be learned by Action Matching by allowing for the creation and destruction of probability mass via a growth rate gt(x). This term is useful to account for cell growth and death in trajectory inference for single-cell biological dynamics (Schiebinger et al., 2019; Tong et al., 2020; Baradat & Lavenant, 2021; Lübeck et al., 2022; Chizat et al., 2022), and is well-studied in relation to unbalanced optimal transport problems (Chizat et al., 2018a;b;c; Liero et al., 2016; 2018; Kondratyev et al., 2016). To introduce unbalanced Action Matching (u AM), consider the following ODE, which attaches importance weights to each sample and updates the weights according to a growth rate gt(x) while transporting the samples in space, d dtx(t) = vt(x(t)) , x(t = 0) = x . (14) d dt log wt(x(t)) = gt(x(t)) , w(t = 0) = w . (15) Action Matching where vt is the vector field moving particles, similar to continuity equation, wt(x) is the importance weight of particles, and gt(x(t)) the growth rate of particles. These importance weights can grow or shrink over time, allowing the particles to be destroyed or create mass probability without needing to transport the particles. The evolution of density governing the importance weighted ODE is given by the following continuity equation: tqt = (qtvt) + qtgt . (16) In the following proposition, we extend Theorem 2.1 to show that any distributional path (under mild conditions), regardless of how it was generated in the state-space, can be modeled with the importance weighted ODE. Proposition 3.3. Consider a continuous dynamic with density evolution qt satisfying mild conditions. Then, there exists a unique function ˆs t (x), called the unbalanced action , such that velocity field v t (x) = ˆs t (x) and growth term g t (x) = ˆs t (x) satisfy the importance weighted continuity equation: tqt = (qt ˆs t ) + qtˆs t , (17) See App. A.3 for the proof. This proposition indicates that we can use the importance weighted ODE d dtx(t) = ˆs t (x(t)) , x(t = 0) = x , (18) d dt log wt(x(t)) = ˆs t (x(t)) , w(t = 0) = w , (19) to move the particles and update their weights in time, such that the marginals are qt. Remarkably, the optimal velocity vector field v t = ˆs t and growth rate g t = ˆs t in Prop. 3.3 are linked to a single action function ˆs t (x). Thus, for learning the variational action st(x) in unbalanced AM, we add a term to the UNBALANCED-ACTION-GAP objective which encourages st to match ˆs t , U-ACTION-GAP(s, ˆs ) := 1 0 Eqt(x) st(x) ˆs t (x) 2dt 0 Eqt(x) st(x) ˆs t (x) 2dt . As before, U-ACTION-GAP(s, ˆs ) objective is intractable since we do not have access to ˆs t . However, as the following proposition shows, we can still derive a tractable objective. Proposition 3.4. For an arbitrary variational action s, the U-ACTION-GAP(s, ˆs ) can be decomposed as the sum of intractable constants K and G, and a tractable term Lu AM(s) U-ACTION-GAP(s, ˆs ) = Ku AM + Gu AM + Lu AM(s) where Lu AM(s) is the unbalanced Action Matching objective, which we minimize Lu AM(s) := Eq0(x)[s0(x)] Eq1(x)[s1(x)] (20) 2 st(x) 2 + st See App. A.3 for the proof. The constants Ku AM and Gu AM are the unbalanced kinetic and growth energy, defined in App. A.3 and B.4. We note that the entropic and unbalanced extensions of Action Matching can also be combined, as is common in biological applications (Schiebinger et al., 2019; Chizat et al., 2022). To showcase how u AM can handle creation and destruction of mass, without transporting particles, we provide a mixture of Gaussians example in App. E.3. Connection with Unbalanced Optimal Transport In App. B.4, we show that at any given time t, the optimal dynamics of u AM along the curve is optimal in the sense of the unbalanced optimal transport (Chizat et al., 2018a; Liero et al., 2016; Kondratyev et al., 2016) between two infinitesimally close distributions qt and qt+h. 3.3. Action Matching with Convex Costs In App. B.2, we further extend AM to minimize kinetic energies defined using an arbitrary strictly convex cost c(vt) (Villani (2009, Ch. 7)). For a given path qt, consider Kc AM := inf vt 0 Eqt(x)[c(vt)]dt s.t. tqt = (qtvt). In this case, the unique vector field tracing the density evolution of qt becomes v t = c ( s t ), where c is the convex conjugate of the c. The corresponding action gap becomes an integral of the Bregman divergence generated by c , ACTION-GAPc (st, s t ) := Z Dc [ st : s t ] dxtdt. In practice, we can minimize ACTION-GAPc using the following c-Action Matching loss: Lc AM(st) := Z s0(x0) q0(x0)dx0 Z s1(x1) q1(x1)dx1 Z c ( st(xt)) + st(xt) qt(xt)dxtdt . For c( ) = c ( ) = 1 2 2, we recover standard AM. Importantly, the continuity equation for this formulation is tqt = (qt c ( s t )) . (21) Thus, the sampling is done by integrating the following vector field d dtx(t) = c ( s t , t), x(t = 0) = x . (22) Action Matching 4. Applications of Action Matching 3 In this section, we discuss and empirically study applications of Action Matching. We first consider the scenario when the samples from the dynamics qt are given as a dataset, which is the case for applications in biology and physics. Furthermore, we demonstrate applications of Action Matching in generative modeling, where we would like to learn a tractable model of a target distribution represented by a dataset of samples x1 q1 without given intermediate samples or densities. 4.1. Population Dynamics in Biology Action Matching is a natural approach to population dynamics inference in biology. Namely, given the marginal distribution of cells at several timesteps one wants to learn the model of the dynamics of cells (Tong et al., 2020; Bunne et al., 2021; Huguet et al., 2022; Koshizuka & Sato, 2022). It is crucial that Action Matching does not rely on the trajectories of cells since the measurement process destroys cells, thus the individual trajectories are unavailable. We use entropic Action Matching for this task since the ground truth processes are guided by the Brownian motion. Moreover, if the diffusion coefficient is known from the experiment conditions and the drift term is a gradient flow, entropic Action Matching recovers the ground truth drift term. Synthetic Data We consider synthetic data from (Huguet et al., 2022), which simulates natural dynamics arising in cellular differentiation, including branching and merging of cells. We generate three datasets with 5, 10, 15 time steps and learn entropic Action Matching with a fixed diffusion coefficient. In Fig. 2, we compare our method with MIOFlow (Huguet et al., 2022) by measuring the distance between the empirical marginals from the dataset and the generated distributions. We see that Action Matching outperforms MIOFlow both in terms of Wasserstein-2 distance and Maximum Mean Discrepancy (MMD) (Gretton et al., 2012). Moreover, unlike MIOFlow, Action Matching performance does not degrade with more timesteps or finer granulation of the dynamics in time. In Fig. 3, we demonstrate that the learned trajectories for Action Matching stay closer to the data marginals compared to the trajectories of MIOFlow. Embryoid sc RNA-Seq Data For a real data example, we consider a embryoid body single-cell RNA sequencing dataset from Moon et al. (2019). We follow the experimental setup from NLSB (Koshizuka & Sato, 2022) and learn entropic Action Matching where the sample space is a 5-dimensional PCA decomposition of the original data. For Action Matching, we interpolate the data using mixtures between given time steps to obtain data which is more dense in time. In Table 1, we compare our method with 3The code is available at github.com/necludov/jam 0.0 0.2 0.4 0.6 0.8 1.0 time W2 distance entropic AM (5,10,15 steps) MIOFlow (5,10,15 steps) 0.0 0.2 0.4 0.6 0.8 1.0 time entropic AM (5,10,15 steps) MIOFlow (5,10,15 steps) Figure 2. Performance of entropic Action Matching and MIOFlow on the synthetic data. We simulate the dynamics starting from the initial data distribution and estimate Wasserstein-2 distance and MMD between generated distributions and dataset marginals. data (t=0.0) data (t=0.25) data (t=0.5) data (t=0.75) data (t=1.0) trajectories Figure 3. Generated trajectories by MIOFlow (left) and entropic Action Matching (right). The training data is scattered. OT-flow (Onken et al., 2021), Trajectory-NET (Tong et al., 2020), Iterative Proportional Fitting (IPF) (De Bortoli et al., 2021), Neural SDE (Li et al., 2020), and Neural Lagrangian Schrodinger Bridge (NLSB) (Koshizuka & Sato, 2022). We find that e AM performs on par with NLSB and outperforms other methods. 4.2. Quantum System Simulation In this section, we apply Action Matching to learn the dynamics of a quantum system evolving according to the Schrödinger equation. The Schrödinger equation describes the evolution of many quantum systems, and in particular, it describes the physics of molecular systems. Here, for the ground truth dynamics, we take the dynamics of an excited state of the hydrogen atom, which is described by the following equation tψ(x, t) = 1 x ψ(x, t) 1 2 2ψ(x, t). (23) The function ψ(x, t) : R3 R C is called a wavefunction and completely describes the state of the quantum system. In particular, it defines the distribution of the coordinates x by the density qt(x) := |ψ(x, t)|2, which is defined through the evolution of ψ(x, t) in Eq. (23). For the baseline, we take Annealed Langevin Dynamics (ALD) as considered in (Song & Ermon, 2019). It approximates the ground truth dynamics using only scores of the distributions by running the approximate MCMC method (which does not have access to the densities) targeting the intermediate distributions of the dynamics (see Algorithm 3). For the estimation of scores, we consider Score Matching (SM) (Hyvärinen & Dayan, 2005), Sliced Score Matching (SSM) (Song et al., 2020a), and additionally evaluate the Action Matching Distance OT-flow Trajectory-Net IPF Neural SDE NLSB e AM (ours) W2(qt1, ˆqt1) 0.75 0.64 0.65 0.016 0.62 0.016 0.63 0.015 0.58 0.015 W2(qt2, ˆqt2) 0.93 0.87 0.78 0.021 0.78 0.021 0.75 0.017 0.77 0.016 W2(qt3, ˆqt3) 0.93 0.78 0.76 0.018 0.77 0.017 0.75 0.014 0.72 0.007 W2(qt4, ˆqt4) 0.88 0.89 0.76 0.017 0.75 0.017 0.72 0.010 0.74 0.017 Table 1. Performance for embryoid body sc RNA-seq data as measured by the W2 distance computed between test data marginals and predicted marginals from the previous test data marginals. Numbers for concurrent methods are taken from (Koshizuka & Sato, 2022). Method Average MMD AM (ours) 5.7 10 4 3.1 10 4 ALD + SM 4.8 10 2 4.8 10 3 ALD + Sliced SM 4.7 10 2 4.0 10 3 ALD + True Scores 3.6 10 2 4.1 10 4 Table 2. Performance of Action Matching and ALD for the Schrödinger equation simulation. We report average MMD over time between the data and the generated samples. For ALD, we use different estimations of scores: Score Matching (SM), Sliced SM, ground true scores. baseline using the ground truth scores. For further details, we refer the reader to App. E.1 and the supplemented code. Action Matching outperforms both Score Matching and Sliced Score Matching, precisely simulating the true dynamics (see Table 2). Despite that both SM and SSM accurately recover the ground truth scores for the marginal distributions (see the right plot in Fig. 5 of App. E.1), one cannot efficiently use them for the sampling from the ground truth dynamics. Note, that even using the ground truth scores in ALD does not match the performance of Action Matching (see Table 2) since it is itself an approximation to the Metropolis-Adjusted Langevin Algorithm. Finally, we provide animations of the learned dynamics for different methods (see github.com/necludov/action-matching) to illustrate the performance difference. 4.3. Generative Modeling In addition to learning stochastic dynamics in natural sciences, Action Matching has a wide range of applications in generative modeling. The key feature of Action Matching for generative modeling is the flexibility to choose any dynamics qt that starts from the prior distribution q0 and arrives to the given data q1. After using Action Matching to learn a vector field s which simulates the appropriate dynamics, we can sample and evaluate likelihoods using Eq. (8) and Eq. (9). We consider a flexible approach for specifying dynamics through interpolants in the samplespace in Eq. (24), and instantiate it for specific problem settings in the following paragraphs. We defer analysis of experimental results in each setting to Sec. 4.3.1. There are many degrees of freedom in the definition of the density path between a given q0 and q1 = dataset. For instance, we consider interpolating between samples x0 q0 and x1 q1 from the prior and data distributions using xt = αt(x0) + βt(x1), x0 q0(x), x1 q1(x) , (24) where αt, βt are some continuous transformations of the prior and the data correspondingly. To respect the endpoint marginals q0, q1, we select αt and βt such that α0(x0) = x0, β0(x1) = 0 and α1(x0) = 0, β1(x1) = x1. The path in density space qt is implicitly defined via these samples. Unconditional Generation A simple choice of αt, βt is the linear interpolation between a simple prior and the data, xt = (1 t) x0 + t x1, x0 N(0, 1), x1 dataset , We demonstrate below that these dynamics can be learned with Action Matching, which allows us to perform unconditional image generation and log-likelihood evaluation. Conditional Generation We can also consider conditional generation tasks by simulating the dynamics only for the subset of dimensions. For instance, we can lower the resolution of images in the dataset and learn the dynamics that performs super-resolution using xt = mask x1 + (1 mask) ((1 t) x0 + t x1), x0 N(0, 1), x1 dataset , where mask is the binary valued vector that defines the subset of dimensions to condition on, e.g., which pixels we generate. Note that we still can learn the dynamics for all the dimensions of vector xt. For super-resolution, we keep one pixel in each 2 2 block, while the remaining pixels are gradually transformed to the standard normal random variable. Conditional Coupled Generation In some applications, the conditions and the predicted variables might be coupled, or share common dimensions. We still can perform conditional generation by considering the marginal distribution of conditions as a prior. To showcase this scenario, we consider image colorization, for which we define the following dynamics xt = (1 t) (10 1 x0 + gray(x1)) + t x1, x0 N(0, 1), x1 dataset , where the function gray(x1) returns the grayscale version of image x1. We found that minor distortion of the inputs Action Matching Model BPD FID IS NFE VP-SDE (uses extra information) 3.25 3.71 9.12 199 Flow Matching (uses extra information) 2.99 6.35 142 Baseline (ALD + SSM) - 86.29 5.43 1090 AM - generation (ours) 3.54 10.04 8.60 132 Baseline (ALD + SSM) n/a n/a n/a AM - superres (ours) 1.44 10.93 166 AM - color (ours) 2.47 9.88 89 Table 3. Model performance for CIFAR-10. When possible, we report log-likelihood in bits per dimension (BPD). For all tasks, we report FID and IS evaluated for 50k generated images. with Normal noise stabilizes the training significantly, while barely destroying the visual information of the image (see Fig. 6). We also can restore this information by passing the non-distorted grayscale image to the model. 4.3.1. EMPIRICAL STUDY FOR GENERATIVE MODELING We demonstrate the performance of Action Matching for the aforementioned generative modeling tasks. For evaluation, we choose the CIFAR-10 dataset of natural images. Despite the fact that we do not consider image generation to be the main application of Action Matching, we argue that it demonstrates the scalability and applicability of the proposed method for the following reasons: (i) the data is high-dimensional and challenging; (ii) the quality of generated samples is easy to assess; (iii) we use known deep learning architectures, hence, remove this component from the scope of our study. For the sake of proper comparison, we consider a baseline model using Annealed Langevin Dynamics (ALD) (Song & Ermon, 2019) and Sliced Score Matching (SSM) (Song et al., 2020a), which estimates the scores of the marginals log qt without knowledge of the underlying dynamics. As stronger baselines which rely on a known form for the dynamics, we consider the variance-preserving diffusion model (VP-SDE) from Song et al. (2020b) and Flow Matching from (Lipman et al., 2022). We train all models using the same architecture for 500k iterations and evaluate the negative log-likelihood, FID (Heusel et al., 2017) and IS (Salimans et al., 2016). For Flow Matching, we report the numbers from (Lipman et al., 2022). In Table 3, we compare the performance of all models, finding that Action Matching performs much closer to VPSDE and Flow Matching (which rely on the knowledge of the dynamics) than to the baseline, with fewer function evaluations (NFE). Moreover, for the conditional generation, the ALD+SSM baseline fails to learn any meaningful scores. We argue that this is due to the dynamics starting from a complicated prior distribution, whose scores are difficult to learn without the annealing process to the noise distribution (Song & Ermon, 2019). Although VP-SDE achieves better FID and IS scores than AM in Table 3, in Fig. 4, we find only minor qualitative differences in comparing images generated from the corresponding ODEs starting from the same noise values. We provide all generations in App. F and discuss implementation details in App. C and E.2. 5. Related Works A large body of previous work on continuous normalizing flows (Chen et al., 2018) has considered regularization inspired by dynamical optimal transport. In particular, Finlay et al. (2020); Tong et al. (2020) regularize the kinetic energy 1 2 vt(x) 2 (and its Jacobian) across time, while Yang & Karniadakis (2020); Onken et al. (2021); Koshizuka & Sato (2022) parameterize vt as the gradient of a potential function st and additionally regularize | 1 2 st(x) 2 + st t | to be close to zero. However, these methods all require simulation of and backpropagation through the differential equations during training. Further, the loss functions for these include maximum likelihood or adversarial losses at the endpoints or entropic OT losses to intermediate distributions. By contrast, Action Matching is simulation-free and evaluates the action function at the endpoints. The Action Matching objective in Eq. (7) naturally includes minimization of 1 2 st(x) 2 + st t , while we show in App. B.1 that the minimum of 1 2 vt(x) 2 is also achieved by the actionminimizing st(x). In fact, these are dual optimizations, as shown in App. B.2 or Mikami & Thieullen (2006). Thus, including both regularizers is redundant from the perspective of the Action Matching loss. Existing methods for trajectory inference often optimize an entropy-regularized (unbalanced) optimal transport loss between the observed samples qt and samples simulated from a current model ˆqt, which requires backpropagation through OT solvers (Cuturi, 2013; Chizat et al., 2018b; Cuturi et al., 2022). In particular, Tong et al. (2020); Koshizuka & Sato (2022); Huguet et al. (2022) sample ˆqt using continuous normalizing flows with the regularization described above. The JKONET scheme of Bunne et al. (2022) seeks to learn a time-independent potential function F[q] = R q(x)s(x) for which the observed qt path is a gradient flow, and generates samples from ˆqt using learned optimal transport maps. As we discuss in App. B.5, Action Matching can be viewed as learning a time-dependent potential via st(x), without simulation of ˆqt during training, using a simple objective. 5.1. Connection with Flow Matching Methods Recent related work on Flow Matching (FM) (Lipman et al., 2022; Pooladian et al., 2023; Tong et al., 2023), stochastic interpolants (Albergo & Vanden-Eijnden, 2022), and Rectified Flow (Liu et al., 2022; Liu, 2022) have recently been understood under the umbrella of bridge matching (BM) methods (Shi et al., 2023; Peluchetti, 2023; Liu et al., 2023). We highlight several distinctions between Action Matching and flow matching methods in this section. Action Matching Figure 4. Generated samples for CIFAR-10 using VP-SDE model (top row) and Action Matching (bottom row). In particular, consider a deterministic bridge which produces intermediate samples xt = ft(x0, x1) with a tractable conditional vector field ut(xt|x0, x1). A natural example is linear interpolation of the endpoint samples. A marginal vector field uθ t(xt), which does not condition on a data sample x1 and thus can be used for unconditional generation, is learned to match ut(xt|x0, x1) using a squared error loss LFM(uθ t ) = inf ut 0 Eq(x0,x1) u(xt|x0, x1) uθ t (xt) 2 dt. Liu (2022) justifies other Bregman divergence losses, which parallels our c-Action Matching method in Sec. 3.3. The LFM objective can be decomposed as in (Banerjee et al., 2005), Eq0,1 u(xt|x0, x1) uθ t (xt) 2 = (25) Eq0,1[ u(xt|x0, x1) u t (xt) 2] + Eqt[ u t (xt) uθ t (xt) 2], where u t (xt) is the unique minimizer of LFM given by u t (xt) = Eq(x0,x1)[u(xt|x0, x1)] . (26) We highlight several key differences between FM and AM. Target Vector Fields We can see that the ACTION-GAP (Eq. (5)) in AM is analogous to the second term on the right-hand side of Eq. (25), using a different target and parameterization. While both the FM target vector field u t and the AM target vector field s t yield the same marginals qt (Theorem 2.1), they have several key differences that can be understood through the Helmholtz decomposition (Ambrosio et al., 2008, Lemma 8.4.2). This theorem states that any vector field, such as u t , can be uniquely decomposed as u t = s +wt, where s is the gradient-field component, and wt is the divergence-free component, (qtwt) = 0. Among all vector fields respecting marginals qt, the AM target s t is the unique vector field without a divergencefree component, moving the particles optimally in the sense of optimal transport. By contrast, the FM target u t may contain a divergence-free component, moving the particles in a way that reflects the underlying path measure (Shi et al., 2023) defined through the endpoint distribution q(x0, x1) and the bridge process. Indeed, the gradient-field component of u t is identical to s t , and the divergence-free component of u t does not influence the marginals. As the result, the optimal s t has a lower kinetic energy or dynamical cost 1 2Eqt(xt) s t 2 1 2Eqt(xt) u t 2 t . (27) Applications Starting from the ACTION-GAP, Action Matching derives a tractable dual objective which removes the need for a known conditional vector field ut(xt|x0, x1). Thus, Action Matching may be applied in settings where linear interpolation is not a reasonable inductive bias, or no bridge vector field ut(xt|x0, x1) is available. In particular, in Sec. 4, we applied AM to scientific applications where the dynamics of interest are only given via a dataset of samples from the temporal marginals. Optimization We note that FM methods are able to optimize the simple L2 loss E[ u(xt|x0, x1) uθ t(xt) 2] due to the assumption that the target conditional vector field u(xt|x0, x1) is available in closed form. Furthermore, parametrizing uθ t : [0, 1] Rd Rd directly using a neural network, rather than parametrizing st : [0, 1] Rd R and differentiating to obtain st Rd in AM, provides computational benefits by saving a backward pass. Nevertheless, we have previously noted the benefits of the gradient-field parametrization and the generality of Action Matching. 6. Conclusion In this work, we have presented a family of Action Matching methods for learning continuous dynamics from samples along arbitrary absolutely continuous paths on the 2Wasserstein space. We propose the tractable optimization objective that relies only on samples from intermediate distributions. We further derived three extensions of Action Matching: Entropic AM, that can model stochasticity in the state space dynamics; Unbalanced AM, that allows for creation and destruction of probability mass; and c-AM, that incorporates convex cost functions on the state space. The key property of the proposed objective is that it can be efficiently optimized for a wide range of applications. We demonstrate this empirically in the physical and natural sciences, where snapshots of samples are often given. Further, we demonstrated competitive performance of Action Matching for generative modeling of images, where a prior distribution and density path can be flexibly constructed depending on the task of interest. Acknowledgement The authors thank Viktor Oganesyan for helpful discussions regarding reproducing VP-SDE results. AM acknowledges support from the Canada CIFAR AI Chairs program. Action Matching Albergo, M. S. and Vanden-Eijnden, E. Building normalizing flows with stochastic interpolants. ar Xiv preprint ar Xiv:2209.15571, 2022. Ambrosio, L., Gigli, N., and Savaré, G. Gradient flows: in metric spaces and in the space of probability measures. Springer Science & Business Media, 2008. Banerjee, A., Merugu, S., Dhillon, I. S., Ghosh, J., and Lafferty, J. Clustering with bregman divergences. Journal of machine learning research, 6(10), 2005. Baradat, A. and Lavenant, H. Regularized unbalanced optimal transport as entropy minimization with respect to branching brownian motion. ar Xiv preprint ar Xiv:2111.01666, 2021. Benamou, J.-D. and Brenier, Y. A computational fluid mechanics solution to the Monge-Kantorovich mass transfer problem. Numerische Mathematik, 84(3):375 393, 2000. Brenier, Y. Décomposition polaire et réarrangement monotone des champs de vecteurs. CR Acad. Sci. Paris Sér. I Math., 305:805 808, 1987. Buenrostro, J. D., Wu, B., Litzenburger, U. M., Ruff, D., Gonzales, M. L., Snyder, M. P., Chang, H. Y., and Greenleaf, W. J. Single-cell chromatin accessibility reveals principles of regulatory variation. Nature, 523(7561): 486 490, 2015. Bunne, C., Stark, S. G., Gut, G., del Castillo, J. S., Lehmann, K.-V., Pelkmans, L., Krause, A., and Rätsch, G. Learning single-cell perturbation responses using neural optimal transport. bio Rxiv, 2021. Bunne, C., Papaxanthos, L., Krause, A., and Cuturi, M. Proximal optimal transport modeling of population dynamics. In International Conference on Artificial Intelligence and Statistics, pp. 6511 6528. PMLR, 2022. Chen, R. T., Rubanova, Y., Bettencourt, J., and Duvenaud, D. K. Neural ordinary differential equations. Advances in neural information processing systems, 31, 2018. Chen, Y., Georgiou, T. T., and Pavon, M. On the relation between optimal transport and Schrödinger bridges: A stochastic control viewpoint. Journal of Optimization Theory and Applications, 169(2):671 691, 2016. Chen, Y., Georgiou, T. T., and Pavon, M. Stochastic control liaisons: Richard sinkhorn meets gaspard monge on a schrodinger bridge. SIAM Review, 63(2):249 313, 2021. Chizat, L., Peyré, G., Schmitzer, B., and Vialard, F.-X. An interpolating distance between optimal transport and fisher rao metrics. Foundations of Computational Mathematics, 18(1):1 44, 2018a. Chizat, L., Peyré, G., Schmitzer, B., and Vialard, F.-X. Scaling algorithms for unbalanced optimal transport problems. Mathematics of Computation, 87(314):2563 2609, 2018b. Chizat, L., Peyré, G., Schmitzer, B., and Vialard, F.-X. Unbalanced optimal transport: Dynamic and Kantorovich formulations. Journal of Functional Analysis, 274(11): 3090 3123, 2018c. Chizat, L., Zhang, S., Heitz, M., and Schiebinger, G. Trajectory inference via mean-field langevin in path space. ar Xiv preprint ar Xiv:2205.07146, 2022. Cuturi, M. Sinkhorn distances: Lightspeed computation of optimal transport. Advances in neural information processing systems, 26, 2013. Cuturi, M., Meng-Papaxanthos, L., Tian, Y., Bunne, C., Davis, G., and Teboul, O. Optimal transport tools (ott): A jax toolbox for all things Wasserstein. ar Xiv preprint ar Xiv:2201.12324, 2022. URL https://github. com/ott-jax/ott. De Bortoli, V., Thornton, J., Heng, J., and Doucet, A. Diffusion Schrödinger bridge with applications to score-based generative modeling. Advances in Neural Information Processing Systems, 34:17695 17709, 2021. Figalli, A. and Glaudo, F. An Invitation to Optimal Transport, Wasserstein Distances, and Gradient Flows. 2021. Finlay, C., Jacobsen, J.-H., Nurbekyan, L., and Oberman, A. How to train your neural ode: the world of jacobian and kinetic regularization. In International conference on machine learning, pp. 3154 3164. PMLR, 2020. Gangbo, W. and Mc Cann, R. J. The geometry of optimal transportation. Acta Mathematica, 177(2):113 161, 1996. Gretton, A., Borgwardt, K. M., Rasch, M. J., Schölkopf, B., and Smola, A. A kernel two-sample test. The Journal of Machine Learning Research, 13(1):723 773, 2012. Griffiths, D. J. and Schroeter, D. F. Introduction to quantum mechanics. Cambridge university press, 2018. Hashimoto, T., Gifford, D., and Jaakkola, T. Learning population-level diffusions with generative rnns. In International Conference on Machine Learning, pp. 2417 2426. PMLR, 2016. Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., and Hochreiter, S. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30, 2017. Action Matching Ho, J., Jain, A., and Abbeel, P. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33:6840 6851, 2020. Huguet, G., Magruder, D. S., Fasina, O., Tong, A., Kuchroo, M., Wolf, G., and Krishnaswamy, S. Manifold interpolating optimal-transport flows for trajectory inference. ar Xiv preprint ar Xiv:2206.14928, 2022. Hyvärinen, A. and Dayan, P. Estimation of non-normalized statistical models by score matching. Journal of Machine Learning Research, 6(4), 2005. Klein, A. M., Mazutis, L., Akartuna, I., Tallapragada, N., Veres, A., Li, V., Peshkin, L., Weitz, D. A., and Kirschner, M. W. Droplet barcoding for single-cell transcriptomics applied to embryonic stem cells. Cell, 161(5):1187 1201, 2015. Kondratyev, S., Monsaingeon, L., and Vorotnikov, D. A new optimal transport distance on the space of finite radon measures. Advances in Differential Equations, 21(11/12): 1117 1164, 2016. Koshizuka, T. and Sato, I. Neural lagrangian Schrödinger bridge, 2022. URL https://arxiv.org/abs/ 2204.04853. Lavenant, H., Zhang, S., Kim, Y.-H., and Schiebinger, G. Towards a mathematical theory of trajectory inference. ar Xiv preprint ar Xiv:2102.09204, 2021. Léonard, C. A survey of the Schrödinger problem and some of its connections with optimal transport. Discrete & Continuous Dynamical Systems, 34(4):1533, 2014. Li, X., Wong, T.-K. L., Chen, R. T., and Duvenaud, D. Scalable gradients for stochastic differential equations. In International Conference on Artificial Intelligence and Statistics, pp. 3870 3882. PMLR, 2020. Liero, M., Mielke, A., and Savaré, G. Optimal transport in competition with reaction: The Hellinger Kantorovich distance and geodesic curves. SIAM Journal on Mathematical Analysis, 48(4):2869 2911, 2016. Liero, M., Mielke, A., and Savaré, G. Optimal entropytransport problems and a new Hellinger Kantorovich distance between positive measures. Inventiones mathematicae, 211(3):969 1117, 2018. Lipman, Y., Chen, R. T., Ben-Hamu, H., Nickel, M., and Le, M. Flow matching for generative modeling. ar Xiv preprint ar Xiv:2210.02747, 2022. Liu, Q. Rectified flow: A marginal preserving approach to optimal transport. ar Xiv preprint ar Xiv:2209.14577, 2022. Liu, X., Gong, C., and Liu, Q. Flow straight and fast: Learning to generate and transfer data with rectified flow. ar Xiv preprint ar Xiv:2209.03003, 2022. Liu, X., Wu, L., Ye, M., et al. Learning diffusion bridges on constrained domains. In The Eleventh International Conference on Learning Representations, 2023. Lübeck, F., Bunne, C., Gut, G., del Castillo, J. S., Pelkmans, L., and Alvarez-Melis, D. Neural unbalanced optimal transport via cycle-consistent semi-couplings. ar Xiv preprint ar Xiv:2209.15621, 2022. Macosko, E. Z., Basu, A., Satija, R., Nemesh, J., Shekhar, K., Goldman, M., Tirosh, I., Bialas, A. R., Kamitaki, N., Martersteck, E. M., et al. Highly parallel genome-wide expression profiling of individual cells using nanoliter droplets. Cell, 161(5):1202 1214, 2015. Mikami, T. and Thieullen, M. Duality theorem for the stochastic optimal control problem. Stochastic processes and their applications, 116(12):1815 1835, 2006. Mikhailov, V. Partial differential equations. 1976. Moon, K. R., van Dijk, D., Wang, Z., Gigante, S., Burkhardt, D. B., Chen, W. S., Yim, K., Elzen, A. v. d., Hirn, M. J., Coifman, R. R., et al. Visualizing structure and transitions in high-dimensional biological data. Nature biotechnology, 37(12):1482 1492, 2019. Noé, F., Tkatchenko, A., Müller, K.-R., and Clementi, C. Machine learning for molecular simulation. Annual review of physical chemistry, 71:361 390, 2020. Onken, D., Fung, S. W., Li, X., and Ruthotto, L. Ot-flow: Fast and accurate continuous normalizing flows via optimal transport. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pp. 9223 9232, 2021. Otto, F. The geometry of dissipative evolution equations: the porous medium equation. 2001. Peluchetti, S. Diffusion bridge mixture transports, schr\" odinger bridge problems and generative modeling. ar Xiv preprint ar Xiv:2304.00917, 2023. Pooladian, A.-A., Ben-Hamu, H., Domingo-Enrich, C., Amos, B., Lipman, Y., and Chen, R. Multisample flow matching: Straightening flows with minibatch couplings. ar Xiv preprint ar Xiv:2304.14772, 2023. Ronneberger, O., Fischer, P., and Brox, T. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pp. 234 241. Springer, 2015. Action Matching Salimans, T. and Ho, J. Should ebms model the energy or the score? In Energy Based Models Workshop-ICLR 2021, 2021. Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., and Chen, X. Improved techniques for training gans. Advances in neural information processing systems, 29, 2016. Schiebinger, G. Reconstructing developmental landscapes and trajectories from single-cell data. Current Opinion in Systems Biology, 27:100351, 2021. Schiebinger, G., Shu, J., Tabaka, M., Cleary, B., Subramanian, V., Solomon, A., Gould, J., Liu, S., Lin, S., Berube, P., et al. Optimal-transport analysis of single-cell gene expression identifies developmental trajectories in reprogramming. Cell, 176(4):928 943, 2019. URL https://www.cell.com/cms/10. 1016/j.cell.2019.01.006/attachment/ d251b88d-a356-436a-a9a7-b7e04b56b41f/ mmc8.pdf. Shi, Y., De Bortoli, V., Campbell, A., and Doucet, A. Diffusion schr\" odinger bridge matching. ar Xiv preprint ar Xiv:2303.16852, 2023. Song, Y. and Ermon, S. Generative modeling by estimating gradients of the data distribution. Advances in Neural Information Processing Systems, 32, 2019. Song, Y., Garg, S., Shi, J., and Ermon, S. Sliced score matching: A scalable approach to density and score estimation. In Uncertainty in Artificial Intelligence, pp. 574 584. PMLR, 2020a. Song, Y., Sohl-Dickstein, J., Kingma, D. P., Kumar, A., Ermon, S., and Poole, B. Score-based generative modeling through stochastic differential equations. ar Xiv preprint ar Xiv:2011.13456, 2020b. Tong, A., Huang, J., Wolf, G., Van Dijk, D., and Krishnaswamy, S. Trajectorynet: A dynamic optimal transport network for modeling cellular dynamics. In International conference on machine learning, pp. 9526 9536. PMLR, 2020. Tong, A., Malkin, N., Huguet, G., Zhang, Y., Rector-Brooks, J., Fatras, K., Wolf, G., and Bengio, Y. Conditional flow matching: Simulation-free dynamic optimal transport. ar Xiv preprint ar Xiv:2302.00482, 2023. Vázquez, J. L. The porous medium equation: mathematical theory. Oxford University Press on Demand, 2007. Villani, C. Optimal transport: old and new, volume 338. Springer, 2009. Yang, L. and Karniadakis, G. E. Potential flow generator with l2 optimal transport regularity for generative models. IEEE Trans. Neural Networks Learn. Syst., 33(2):528 538, 2020. doi: 10.1109/TNNLS.2020. 3028042. URL https://doi.org/10.1109/ TNNLS.2020.3028042. Action Matching A.1. Action Matching Proofs Theorem 2.1 (Adapted from Theorem 8.3.1 of Ambrosio et al. (2008)). Consider a continuous dynamic with the density evolution of qt, which satisfies mild conditions (absolute continuity in the 2-Wasserstein space of distributions P2(X)). Then, there exists a unique (up to a constant) function s t (x), called the action , such that vector field v t (x) = s t (x) and qt satisfies the continuity equation tqt = (qt s t (x)) . (3) In other words, the ODE d dtx(t) = s t (x) can be used to move samples in time such that the marginals are qt. Proof of existence and uniqueness. The existence of and uniqueness of the solution could be argued by observing that tqt = (qt s t ) in X, s t , n = 0 on X, where n is the surface normal, is an elliptic PDE with the Neumann boundary condition, and that it is a classical fact that these PDEs have a solution under mild conditions on qt (Mikhailov, 1976). See Ambrosio et al. (2008) for a proof in more general settings. Theorem 2.2. For an arbitrary variational action s, the ACTION-GAP(s, s ) can be decomposed as the sum of an intractable constant K, and a tractable term LAM(s) ACTION-GAP(st, s t ) = K + LAM(st) . (6) where LAM(s) is the Action Matching objective, which we minimize LAM(s) := Eq0(x) s0(x) Eq1(x) s1(x) 2 st(x) 2 + st t (x) dt (7) ACTION-GAP(st, s t ) 0 ωt Eqt(x) st s t 2dt X ωtqt(x) st s t 2dxdt X ωtqt(x) st 2dxdt Z 1 X qt(x) st(x), s t (x) dxdt + KAM z }| { 1 2 Z Eqt(x) s t 2dt X ωtqt(x) st 2dxdt Z 1 X st(x), qt(x) s t (x) dxdt + KAM X ωtqt(x) st 2dxdt + Z 1 X st(x)[ (qt(x) s t (x))]dxdt Z 1 X qt(x)st(x) :0 s t , dn dt + KAM X ωtqt(x) st 2dxdt Z 1 X ωtst(x) qt(x) t dx dt + KAM 0 ωt Eqt(x) 2 st(x) 2 dt ωt Eqt(x)[st(x)] t=1 t=0 Z dt + ωt st(x) 0 ωt Eqt(x) 2 st(x) 2 + st(x) t + st(x)d log ωt dt ω1Eq1(x)[s1(x)] + ω0Eq0(x)[s0(x)] + KAM = LAM(s) + KAM Action Matching where in (1), we have used integration by parts for divergence operator R X g, f dx = H X g( f)dx and that s t , dn | X = 0 due to the Neumann boundary condition (see proof of Theorem 2.1 above), and in (2) we have used integration by parts. For each distributional path qt, the kinetic energy term only depends on the true actions s t and is defined as K( s t ) := 1 0 Eqt(x) s t (x) 2dt . (28) Thus, minimizing the LAM(s) can be interpreted as maximizing a variational lower bound on kinetic energy. Proposition A.1 (Adapted from Albergo & Vanden-Eijnden (2022)). Let st(x) be a learned vector field, continuously differentiable in (t, x) and Lipschitz in x uniformly on [0, 1] Rd with Lipschitz constant K. Let ˆqt denote the density induced by x t = st(x) with x0 q0. Then, the squared Wasserstein-2 distance between ˆqt and the ground truth qt (induced by s t (x)) at each time τ [0, 1] is bounded by W 2 2 (ˆqτ, qτ) e(1+2K)τ Z τ 0 Eqt(x) st(x) s t (x) 2dt where the right-hand side includes the action gap in Eq. (5). Proof. Consider two flows φt(x) and φ t (x) defined by t = st(φt(x)) and φ t (x) t = s t (φ t (x)) (29) correspondingly. Consider the quantity Qt = Z dx q0(x) φt(x) φ t (x) 2. (30) Note that W 2 2 (ˆqτ, qτ) Qτ, since (φτ)#q0 = ˆqτ and (φ τ)#q0 = qτ have the correct marginals and thus constitute an admissible coupling for the W2 distance. The time derivative of Qt is t = 2 Z dx q0(x) φt(x) φ t (x), φt(x) t φ t (x) t = 2 Z dx q0(x) φt(x) φ t (x), st(φt(x)) s t (φ t (x)) (32) = 2 Z dx q0(x) φt(x) φ t (x), st(φt(x)) st(φ t (x)) (33) + 2 Z dx q0(x) φt(x) φ t (x), st(φ t (x)) s t (φ t (x)) . By the Lipshitzness of st(x), we bound the first term as 2 φt(x) φ t (x), st(φt(x)) st(φ t (x)) 2K φt(x) φ t (x) 2. (34) The second term is bounded as follows φt(x) φ t (x) 2 2 φt(x) φ t (x), st(φ t (x)) s t (φ t (x)) + st(φ t (x)) s t (φ t (x)) 2 0, (35) 2 φt(x) φ t (x), st(φ t (x)) s t (φ t (x)) φt(x) φ t (x) 2 + st(φ t (x)) s t (φ t (x)) 2. (36) t (1 + 2K)Qt + Z dx q0(x) st(φ t (x)) s t (φ t (x)) 2, (37) and by Gronwall s lemma, we have Qτ Q0 exp(τ(1 + 2K)) Z τ 0 dt Z dx q0(x) st(φ t (x)) s t (φ t (x)) 2. (38) Clearly, Q0 = 0. Using Eq. (38) and the fact that W 2 2 (ˆqτ, qτ) Qτ from above, the proposition is proven. Action Matching A.2. Entropic Action Matching Proofs Proposition 3.1. Consider a continuous dynamic with the density evolution of qt, and suppose σt is given. Then, there exists a unique (up to a constant) function s t (x), called the entropic action , such that vector field v t (x) = s t (x) and qt satisfies the Fokker-Planck equation tqt = (qt s t ) + σ2 t 2 qt , (12) Proof of existence and uniqueness. Consider the PDE tqt = (qt s t ) + σ2 t 2 qt in X s t , n = σ2 t 2 log qt, n on X with the reparametrization s t = s t σ2 t 2 log qt, we can write this PDE as tqt = (qt s t ) + σ2 t 2 qt s t σ2 t 2 log qt s t σ2 t 2 log qt = (qt s t ) with the boundary condition of s σ2 t 2 log qt , n = s t , n = 0 From App. A.1, we know s t exists and is unique (up to a constant), thus s t also exists and is unique (up to a constant). Proposition 3.2. For an arbitrary variational action s, the E-ACTION-GAP(s, s ) can be decomposed as the sum of an intractable constant K, and a tractable term Le AM(s) which can be minimized: E-ACTION-GAP(s, s ) = Le AM(s) + Ke AM , where Le AM(s) is the entropic Action Matching objective, which we minimize Le AM(s) := Eq0(x)[s0(x)] Eq1(x)[s1(x)] (13) 2 st(x) 2 + st t (x) + σ2 t 2 st Action Matching E-ACTION-GAP(st, s t ) 0 Eqt(x) st s t 2dt X qt(x) st s t 2dxdt X qt(x) st 2dxdt Z 1 X qt(x) st(x), s t (x) dxdt + Ke AM z }| { 1 2 Z Eqt(x) s t 2dt X qt(x) st 2dxdt Z 1 X st(x), qt(x) s t (x) dxdt + Ke AM X qt(x) st 2dxdt + Z 1 X st(x)[ (qt(x) s t (x))]dxdt σ2 t 2 X st qt, dn dt + Ke AM X qt(x) st 2dxdt Z 1 tqt(x)dx dt + σ2 t 2 X st(x) qtdx dt σ2 t 2 X st qt, dn dt + Ke AM 2 st(x) 2 dt Eqt(x)[st(x)] t=1 t=0 Z dt + σ2 t 2 X qt(x) stdx dt X qt st, dn dt + Ke AM 2 st(x) 2 + st(x) t + σ2 t 2 st dt Eq1(x)[s1(x)] + Eq0(x)[s0(x)] σ2 t 2 X qt st, dn dt + Ke AM = Le AM(s) + Ke AM where in (1), we have used the integration by parts for divergence operator R X g, f dx = H X g( f)dx, and the Neumann boundary condition from App. A.2 Z X st(x), qt(x) s t (x) dx = I X qt(x)st(x) s t , dn Z X st(x) (qt(x) s t (x))dx integration by parts X stqt log qt, dn Z X st(x) (qt(x) s t (x))dx boundary condition X st qt, dn Z X st(x) (qt(x) s t (x))dx In (2), we have used the Fokker-Planck equation: (qt s t (x)) = tqt + σ2 t 2 qt. To derive (3), we have Z X st(x) qtdx Z X st qt, dn = Z X st(x)[ ( qt)]dx Z X st qt, dn X st, qt dx integration by parts for divergence theorem X qt, st dx by symmetry X qt(x) stdx Z X qt st, dn following the reverse of the first 2 steps Finally, note that the term σ2 t 2 R 1 0 H X qt st, dn dt can be dropped if qt vanishes on the boundary X. For each distributional path qt, the entropic kinetic energy term only depends on the true entropic action s t and is defined as Ke AM( s t ) := 1 0 Eqt(x) s t (x) 2dt . (39) Thus, minimizing Le AM(s) can be interpreted as maximizing a variational lower bound on the entropic kinetic energy. Action Matching A.3. Unbalanced Action Matching Proofs Proposition 3.3. Consider a continuous dynamic with density evolution qt satisfying mild conditions. Then, there exists a unique function ˆs t (x), called the unbalanced action , such that velocity field v t (x) = ˆs t (x) and growth term g t (x) = ˆs t (x) satisfy the importance weighted continuity equation: tqt = (qt ˆs t ) + qtˆs t , (17) Proof of existence and uniqueness. The existence and uniqueness of the solution could be argued by observing that tqt = (qt ˆs t ) + ˆs t qt in X ˆs t , n = 0 on X is an elliptic PDE with the Neumann boundary condition, and that it is a classical fact that these PDEs have a solution under mild conditions on qt (Mikhailov, 1976). Proposition 3.4. For an arbitrary variational action s, the U-ACTION-GAP(s, ˆs ) can be decomposed as the sum of intractable constants K and G, and a tractable term Lu AM(s) U-ACTION-GAP(s, ˆs ) = Ku AM + Gu AM + Lu AM(s) where Lu AM(s) is the unbalanced Action Matching objective, which we minimize Lu AM(s) := Eq0(x)[s0(x)] Eq1(x)[s1(x)] (20) 2 st(x) 2 + st U-ACTION-GAP(st, ˆs t ) 0 Eqt(x) st ˆs t 2dt + 1 0 Eqt(x) st ˆs t 2dt X qt(x) st ˆs t 2dxdt + 1 X qt(x) st ˆs t 2dxdt X qt(x) st(x), ˆs t (x) dxdt Z 1 X qt(x)st(x)ˆs t (x)dxdt Ku AM z }| { 1 2 Z Eqt(x) ˆs 2 t 2dt + Gu AM z }| { 1 2 Z Eqt(x)ˆs 2 t dt X st(x), qt(x) ˆs t (x) dxdt Z 1 X qt(x)st(x)ˆs t (x)dxdt + Ku AM + Gu AM X st(x)[ (qt(x) ˆs t (x)) qt(x)ˆs t (x)]dxdt Z 1 X qt(x)st(x) :0 s t , dn dt + Ku AM + Gu AM tqt(x)dx dt + Ku AM + Gu AM 2 st(x) 2 + 1 dt Eqt(x)[st(x)] t=1 t=0 Z dt + Ku AM + Gu AM 2 st(x) 2 + st(x) dt Eq1(x)[s1(x)] + Eq0(x)[s0(x)] + Ku AM + Gu AM = Lu AM(s) + Ku AM + Gu AM Action Matching where in (1), we have used integration by parts for divergence operator R X g, f dx = H X g( f)dx and that s t , dn | X = 0 due to the Neumann boundary condition (see App. A.3), in (2) we have used: tqt = (qt s ) + s qt, and in (3) we have integration by parts. For each distributional path qt, the unbalanced kinetic energy and unbalanced growth energy terms only depend on the true unbalanced action ˆs t and are defined as Ku AM( ˆs t ) := 1 0 Eqt(x) ˆs t (x) 2dt . Gu AM(ˆs t ) := 1 0 Eqt(x)ˆs t (x)2dt . (40) Thus, minimizing the Lu AM(s) can be interpreted as maximizing a variational lower bound on the summation of unbalanced kinetic energy and growth energy. B. Action Matching and Optimal Transport In this section, we describe various connections between Action Matching and optimal transport. First, we describe how Action Matching (App. B.1), entropic AM (App. B.3), and unbalanced AM (App. B.4) can be understood as solving local versions of optimal transport problems between infinitesimally close distributions, where connections with dynamical OT formulations play a key role (Benamou & Brenier, 2000; Chen et al., 2016; Chizat et al., 2018a; Liero et al., 2016). In App. B.2, we generalize AM by considering Lagrangian action cost functions beyond the squared Euclidean OT cost and standard kinetic energy underlying the other results in this paper. Finally, in App. B.5, we interpret AM as learning a linear potential functional whose Wasserstein-2 gradient flow traces the given density path qt. B.1. Action Matching as Infinitesimal Optimal Transport Using Ambrosio et al. (2008) Prop. 8.4.6 (see also Villani (2009) Remark 13.10), we can interpret s t in action matching as learning the optimal transport map between two infinitesimally close distributions on the given curve qt, under the squared Euclidean cost. In particular, by Ambrosio et al. (2008) Prop. 8.4.6, we have s t = lim dt 0 1 dt(T (qt, qt+dt) id). (41) where id is the identity mapping and T (qt, qt+dt) = xt + φ t,dt(xt) is the unique transport map (Gangbo & Mc Cann (1996) Thm. 4.5) solving the Monge formulation of the Wasserstein-2 distance between neighboring densities qt, qt+dt W 2 2 (qt, qt+dt) = inf T Z T(x) x 2qt(x)dx T#qt = qt+dt 2Eqt(x) φ t,dt(x) 2 s.t. (x + φ t,dt)#qt = qt+dt (42) and T#qt is the pushforward density of qt under the map T. Note that the continuity equation suggests the pushforward map x + dt s t (x) for small dt. Thus, among all vector fields, the Action Matching objective finds the one that satisfies the continuity equation while minimizing the infinitesimal displacement of samples according to the squared Euclidean cost. We can also compare Eq. (42) to Eq. (44) below. Kinetic Energy and Dynamical Optimal Transport To further understand this result, we observe that Action Matching is closely related to the dynamical formulation of optimal transport due to Benamou & Brenier (2000). The squared Wasserstein-2 distance between given densities p0 and p1 with densities q0 and q1 can be expressed as, W 2 2 (µ, ν) = inf qt inf vt 1 2 0 Eqt(x) vt(x) 2dt s.t. tqt = (qtvt), q0 = p0, and q1 = p1. (43) However, in Action Matching, the intermediate densities or distributional path qt are fixed via the given samples. In this case, we can interpret AM as learning a local or infinitesimal optimal transport map as in Eq. (41), although the given qt may not trace the global optimal transport path between q0 and q1. Nevertheless, it can be shown that (Ambrosio et al. (2008)) s t = arg min vt 0 Eqt(x) h vt(x) 2i dt s.t. tqt = (qtvt). (44) Action Matching To show this directly, we introduce a Lagrange multiplier st to enforce the continuity equation and integrate by parts with respect to x and t (also see Eq. (58)), L(vt, st) = sup st inf vt 1 2 Z Eqt(x) vt(x) 2 dt + Z st(xt) qt t (xt) + qt(xt)vt(xt) dxtdt (45) = sup st inf vt 1 2 0 Eqt(x) vt(x) 2 dt + Eqt(xt)[st(xt)] t=1 t + vt(xt), st(xt) dt (46) Eliminating the optimization with respect to vt, we obtain the optimality condition vt(x) = st(x) (47) Plugging this condition into the Lagrangian in Eq. (46), we obtain the AM objective from Theorem 2.2, inf st LAM(st), where LAM(st) = R 1 0 Eqt(x) 1 2 st(x) 2 + st(xt) t dt Eqt(xt)[st(xt)] t=1 t=0 . This objective is an upper bound on the optimal kinetic energy K( s t ), which is evaluated using the optimal vector field v t = s t in Eq. (44). Metric Derivative Finally, the AM objective and Eq. (44) are related to the metric derivative in the 2-Wasserstein space P2(X). By Ambrosio et al. (2008) Thm. 8.3.1, for qt, qt+dt along an absolutely continuous density path with tqt = (qtv t ), we have |q t|2 = lim dt 0 W2(qt, qt+dt) 2 = Eqt(x) v t (x) 2 = Eqt(x) s t (x) 2 . (48) The optimal value of the AM objective at each t thus reflects the infinitesimal squared W2 distance along the given path. B.2. Action Matching with Lagrangian Costs Starting from the Benamou-Brenier dynamical formulation of optimal transport in Eq. (43), we can also formulate action matching with more general Lagrangian action cost functions. For a Lagrangian L(xt, vt, t) which is strictly convex in the velocity vt, we define the dynamical optimal transport problem T (p0, p1) = inf qt inf vt Z L(xt, vt, t) qt(xt)dxtdt subj: qt t (xt) = qt(xt)vt(xt) q0 = p0 and q1 = p1. (49) See Villani (2009) Ch. 7 for in-depth analysis of Eq. (49) as an optimal transport problem, where L(xt, vt, t) = 1 recovers the Wasserstein-2 distance. For action matching, assume that the intermediate path of qt is given via its samples. We now seek to learn a vector field which minimizes the Lagrangian while satifying the continuity equation, K ℓAM := inf vt Z L(xt, vt, t) qt(xt)dxtdt subj: qt t = qtvt . (50) For this Lagrangian cost, we prove the following analogue of Theorem 2.2, which is recovered for L(xt, vt, t) = 1 2 vt 2. We first state a definition. Definition B.1. The convex conjugate of a Lagrangian L(xt, vt, t) which is strictly convex in vt is defined as H(xt, at, t) = sup vt vt, at L(xt, vt, t). (51) Solving for the optimizing argument yields the dual correspondence vt = at H(xt, at, t) and at = vt L(xt, vt, t). Action Matching Proposition B.2. For a Lagrangian L(xt, vt, t) which is strictly convex in the velocity vt, the optimization defining K ℓAM in Eq. (50) can be expressed via the dual optimization, K ℓAM = inf st LℓAM(st), (52) where LℓAM(st) := Z s0(x0) q0(x0)dx0 Z s1(x1) q1(x1)dx1 + Z H(xt, st(xt), t) + st(xt) qt(xt)dxtdt For a variational st, the corresponding action gap between st and the optimal s t , is written using a Bregman divergence K ℓAM + LℓAM(st) = ACTION-GAPH(st, s t ) (53) where ACTION-GAPH(st, s t ) := Z DH[ st(xt) : s t (xt)] qt(xt) dxtdt (54) where the Bregman divergence DH is defined as DH[ st(xt) : s t (xt)] = H(xt, st, t) H(xt, s t , t) v t (xt), st(xt) s t (xt) . (55) Using Definition B.1 and the fact that v t and s t are duals related by v t = at H(xt, s t , t), the Bregman divergence may also be written in mixed parameterization DH[ st : s t ] = DL,H[v t : st], with DL,H[v t : st(xt)] = L(xt, v t , t) + H(xt, st, t) v t (xt), st(xt) . (56) For the case of L(xt, vt, t) = 1 2 vt 2, the Hamiltonian is H(xt, st, t) = 1 2 st 2 and the two parameterizations are self-dual with vt = st. Using this transformation, the Bregman divergence is simply half the squared Euclidean norm, DL,H[v t : st] = 1 2 v t 2 + 1 2 st 2 v t , st = 1 2 s t 2 + 1 2 st 2 s t , st = 1 2 s t st 2. (57) From Definition B.1, note that in general, the optimality condition translating between vt and st is vt = at H(xt, st, t) or st = vt L(xt, vt, t). Proof. Introducing a Lagrange multipler st(x) to enforce the continuity equation constraint, we integrate by parts in both x and t using the assumption of the boundary condition vt, n = 0, K ℓAM = inf vt Z L(xt, vt, t) qt(xt)dxtdt subj: qt t (xt) = qt(xt)vt(xt) = sup st inf vt Z L(xt, vt, t)qt(xt)dxtdt + Z st(xt) qt t (xt) + qt(xt)vt(xt) dxtdt L(xt, vt, t) qt(xt)dxtdt + Z st(xt)qt(xt)dxt t=1 t qt(xt)dxtdt X qt(xt)st(xt) :0 vt(xt), dn dt vt(xt), st(xt) qt(xt)dxtdt sup vt vt(xt), st(xt) L(xt, vt, t) qt(xt)dxtdt t qt(xt)dxtdt + Z st(xt)qt(xt)dxt t=1 Action Matching We can recognize the highlighted terms as a Legendre transform, where we define the Hamiltonian H as the convex conjugate H(xt, st(xt), t) = sup vt vt(xt), st(xt) L(xt, vt, t) of the Lagrangian L. This finally results in K ℓAM := sup st Z s1(x1) q1(x1)dx1 Z s0(x0) q0(x0)dx0 H(xt, st(xt), t) + st(xt) qt(xt)dxtdt. = LℓAM(s t ) where Eq. (59) can be used to define a corresponding action matching objective LℓAM(st) for a variational st (see Eq. (52)). To calculate the action gap, we subtract LℓAM(st) from the optimal value of the kinetic energy in Eq. (50), using v t ACTION-GAPH(st, s t ) := K ℓAM + LℓAM(st) = LℓAM(st) LℓAM( s t ) (using Eq. (59)) = Z s0(x0) q0(x0)dx0 Z s1(x1) q1(x1)dx1 + Z H(xt, st(xt), t) + st(xt) qt(xt)dxtdt Z s 0(x0) q0(x0)dx0 Z s 1(x1) q1(x1)dx1 + Z H(xt, s t (xt), t) + s t (xt) qt(xt)dxtdt Z H(xt, st(xt), t) H(xt, s t (xt), t) qt(xt)dxtdt + t st(xt) + s t (xt) dxtdt Z H(xt, st(xt), t) H(xt, s t (xt), t) qt(xt)dxtdt Z qtv t (xt) st(xt) s t (xt) dxtdt Z H(xt, st(xt), t) H(xt, s t (xt), t) v t (xt), st(xt) s t (xt) qt(xt)dxtdt Z DH[ st(xt) : s t (xt)] qt(xt)dxtdt where in (1) we use the fact that R 1 0 R qt t stdxtdt = R s0 q0dx0 R s1q1dx1 + R 1 0 R st t qtdxtdt by integration by parts. In (2), we use the fact that v t satisfies the continuity equation for qt from Eq. (50). In (3), we integrate by parts with respect to x and recognize the resulting expression as the definition of the Bregman divergence from Eq. (55). B.3. Entropic Action Matching and Entropy-Regularized Optimal Transport Consider the dynamical formulation of entropy-regularized optimal transport (Léonard, 2014; Chen et al., 2016; 2021), which involves the same kinetic energy minimization as in Eq. (43) but modifying the continuity equation to account for stochasticity, 1 2Wϵ(p0, p1) = inf vt inf qt 1 2Eqt(x) vt(x) 2dt, s.t. qt(x) t = qt(x)vt(x) + σ2 t 2 qt(x) and q0 = p0, q1 = p1 . Since we fix the density path qt in our Action Matching formulation problem, we again omit the optimization over qt and the marginal constraints, Ke AM := inf vt 1 2Eqt(x) vt(x) 2dt, s.t. qt(x) t = qt(x)vt(x) + σ2 t 2 qt(x) (61) Action Matching Introducing Lagrange multiplier st to enforce the Fokker-Planck constraint leads to Ke AM = sup st inf vt 1 2Eqt(x) vt(x) 2dt + Z st(xt) qt t (xt) + qt(xt)vt(xt) σ2 t 2 qt(x) dxtdt. Compared to Eq. (46) and Eq. (58), note that the additional st(xt) σ2 t 2 qt(x) term does not depend on vt. Thus, integrating by parts and eliminating vt as above yields the condition that v t is a gradient field, vt(x) = s t (x) (62) Substituting into Eq. (61) and following derivations as in the proof of Prop. 3.2 in App. A.2 yields the entropic AM objective, Le AM := inf st(x) Z dx s0(x)p0(x) Z dx s1(x)p1(x) + dtdx qt(x) 1 2 st 2 + σ2 t 2 st + st Since s t is the unique gradient field which satisfies the Fokker-Planck equation for the distributional path of qt (Prop. 3.1), and the solution v t minimizing the kinetic energy in Eq. (61) is a gradient field satisfying the Fokker-Planck equation, we conclude that these vector fields are the same, v t = s t . B.4. Unbalanced Action Matching and Unbalanced Optimal Transport To account for growth or destruction of probability mass across time or optimal transport between positive measures with unequal normalization constant, (Chizat et al., 2018a;c; Kondratyev et al., 2016; Liero et al., 2016; 2018) analyze optimal transport problems involving a growth rate gt(x). In particular, the Wasserstein Fisher-Rao distance (Chizat et al., 2018a) is defined by adding a term involving the norm of the growth rate gt to the Benamou & Brenier (2000) dynamical OT formulation in Eq. (43) and accounting for the growth term in the modified continuity equation (see Eq. (15)-16) WFRλ(p0, p1) := inf vt inf gt inf qt 2 vt(x) 2 + λ 2 gt(x)2 dt, subj. to (64) t = qt(x)vt(x) + λgt(x)qt(x) , and q0 = p0, q1 = p1. (65) where the growth term may also be scaled by a multiplier λ. If we again fix the path qt, as in Action Matching, we define K u AM + G u AM := inf vt inf gt 2 vt(x) 2 + λ 2 gt(x)2 dt s.t. qt(x) t = qt(x)vt(x) + λgt(x)qt(x) (66) Note that Eq. (64) and Eq. (66) are each convex optimizations after the change of variables (qt, vt, gt) 7 (qt, qtvt, qtgt) (Chizat et al., 2018a). We slightly abuse notation to write integrals as expectations with respect to qt(x), which may not be a normalized probability measure in general (see below, Liero et al. (2016)). Introducing st as a Lagrange multiplier enforcing the modified continuity equation, we obtain the following Lagrangian K u AM + G u AM = sup st inf vt inf gt 2 vt(x) 2 + λ 2 gt(x)2 dt + t (x) + qt(x)vt(x) qt(x)gt(x) dxdt (1) = sup st inf vt inf gt 2 vt(x) 2 + λ 2 gt(x)2 qt(x)dxdt + Z st(x)qt(x)dx t=1 t qt(x)dxdt (67) Z h λst(x) gt(x) vt(x), st(x) i qt(x)dx where integrate by parts wrt x and t in (1). Finally, eliminating vt and gt yields the optimality conditions v t (x) = st(x) g t (x) = st(x). (68) Action Matching These optimality conditions show that the action function st links the problems of transporting mass via the vector field vt and creating or destroying mass via the growth rate gt. Substituting back into Eq. (67) and simplifying, we obtain the unbalanced objective Lu AM(st), which is to be minimized as a function of the variational action Lu AMλ(st) := Eq0(x)[s0(x)] Eq1(x)[s1(x)] + Z Eqt(x) 2 st(x) 2 + st As in Prop. 3.4, we can define an appropriate action gap involving the growth term, with the optimal ˆs t evaluating the kinetic energy K u AM( ˆs t ) := 1 2 R 1 0 Eqt(x) ˆs t (x) 2dt and growth energy G u AM(ˆs t ) := λ 2 R 1 0 Eqt(x)[ˆs t (x)2]dt of a given curve qt. Metric Derivative Finally, the u AM objective and Eq. (66) are related to the metric derivative in the space of finite positive Borel measures M(X) equipped with the Wasserstein Fisher-Rao metric distance in Eq. (64), where we consider measures with arbitrary mass due to the effect of the growth term. Liero et al. (2016) Thm. 8.16 and 8.17 are analogous to Ambrosio et al. (2008) Thm. 8.3.1 for this case. In particular, for qt, qt+dt along an absolutely continuous (AC) curve of positive measures with tqt = (qtv t ) + qtg t , we have (Liero et al., 2016) |q t|2 = lim dt 0 WFR1(qt, qt+dt) 2 = Z v t (x) 2 + g t (x) 2 qt(x)dx = Z ˆs t (x) 2 + ˆs t (x) 2 qt(x)dx (70) The optimal u AM objective (Eq. (66)) at each t thus reflects the infinitesimal squared WFR distance along an AC curve. B.5. Action Matching and Wasserstein Gradient Flows We can also view dynamics learned by action matching as parameterizing the gradient flow of a time-dependent linear functional on the Wasserstein-2 manifold. This is in contrast JKOnet (Bunne et al., 2022), which is limited to learning gradient flows for class of time-homogenous linear functionals. We give a brief review of concepts related to gradient flows, but refer to e.g. Figalli & Glaudo (2021) Ch. 3-4 for more details. Throughout this section, we let X Rd, consider the space of probability measures P2(X) with finite second moment, and identify measures µ P2(X) with their densities dµ = pdx. In order to define a gradient flow, we first need to define an inner product on the tangent space. The seminal work of Otto (2001) defines the desired Riemannian metric on TP2(X). Consider two curves p(i) t : t P2(X) passing through a point q P2(X), with p(i) 0 = q and tangent vectors p0 (1), p0 (2) Tq P2(X) for p(i) := p(i) t t . Using Theorem 2.1, each curve satisfies a continuity equation for a gradient field ψ(i) t (x), e.g. pt (i)|t=0 = p(i) 0 ψ(i) 0 at t = 0. The inner product on Tq P2(X) is defined as pt (1), pt (2) q = Z ψ(1) 0 (x), ψ(2) 0 (x) q(x)dx where pt (i)|t=0 = p(i) 0 ψ(i) 0 (71) The gradient of a functional F[pt] with respect to the Wasserstein-2 metric is defined as the tangent vector for which the inner product yields the directional derivative along a curve pt : t P2(X) at t = 0 grad W2F[pt], pt p0 = d t=0F[pt] (72) or, more explictly, as: grad W2F[pt] = pt δF[pt] where δF[p] δp is the first variation. For example, the Wasserstein gradient of a time-dependent linear functional is given by Ft[pt] = Z pt(x)st(x)dx grad W2Ft[pt] = pt st . (74) A negative gradient flow on the Wasserstein manifold is then given, in either continuous or discrete time, by t = grad W2Ft[pt] = pt st . (75) Thus, the negative gradient flow of a time-dependent linear functional on P2(X) can be modeled using the continuity equation for a vector field st. Action Matching can now be viewed as learning the functional Ft[pt] = R pt(x)st(x)dx for which a given density path pt : t P2(X) (or qt, in the main text) traces gradient flow on the Wasserstein manifold P2(X). Action Matching Comparison with JKONET In discrete time, the gradient flow in Eq. (75) can be written as pt+1 = arg min p F[p] + 1 2τ W 2 2 (p, pt). (76) where we write the Wasserstein distance between densities instead of measures. The JKONET method of Bunne et al. (2022) uses the above discrete-time approach to learn a potential function F[p] which drives the observed sample dynamics. However, they restrict attention to learning the parameters θ of a time-homogeneous linear functional F[p] = R p(x)s(x; θ)dx and their optimization methodology is notably more complex than action matching. C. Generative Modeling in Practice Algorithm 2 Generative Modeling using Action Matching (In Practice) Require: dataset {xj t}Nt j=1, xj t qt(x), batch-size n Require: parameteric model st(x, θ), weight schedule ω(t) for learning iterations do get batch of samples from boundaries: {xi 0}n i=1 q0(x), {xi 1}n i=1 q1(x) sample times {ti}n i=1 p(t) get batch of intermediate samples {xi ti}n i=1 qt(x) Li = s0(xi 0)ω(0) s1(xi 1)ω(1) + 1 2 st(xi ti) 2ω(ti) + st(xi ti) t ω(ti) + st(xi ti) ω(ti) 1 p(ti)Li update the model θ Optimizer(θ, θLθ) end for output trained model st(x, θ ) In practice, we found that the naive application of Action Matching (Algorithm 1) for complicated dynamics such as image generation might exhibit poor convergence due to the large variance of objective estimate. Moreover, the optimization problem Z Eqt(x) st(x) s t (x) 2dt (77) might be ill posed due to the singularity of the ground truth vector field s t . This happens when the data distribution q0 is concentrated close to a low dimensional manifold, and the final distribution q1 has a much higher intrinsic dimensionality (e.g., Gaussian distributions). In this case, the deterministic velocity vector field must be very large (infinite in the limit), so that it can pull apart the low dimensional manifold to transform it to higher dimensions. We now discuss an example of this behavior, when the data distribution is a mixture of delta functions. Consider the sampling process xt = ft(x0) + σtε, x0 π(x), ε N(x | 0, 1) , (78) where the target distribution is a mixture of delta-functions i δ(x xi). (79) Denoting the distribution of xt as qt(x), we can solve the continuity equation t = (qtvt) (80) analytically (see Appendix D) by finding one of the many possible solutions i qi t(x) (x ft(xi)) t log σt + ft(xi) , qi t(x) = N(x | ft(xi), σ2 t ). (81) Action Matching Note that vt is not curl-free in general, and thus is not the solution of action matching. However, it can be written as i qi t(x) si t(x), where si t(x) = 1 2(x ft(xi))2 t log σt + ft(xi) Given that the density of Gaussian distributions drop exponentially fast, we can conclude that for small values of t around each xi, qj t (x) P j qj t (x) is close to 1 if i = j, and close to 0 if i = j. Thus, vt(x) around each xi can be locally approximated with the curl-free vector field si t(x). Now suppose s t (x) is the solution of action matching, i.e., the unique curl-free vector field that solves the continuity equation in every region, including regions around each xi. Given the uniqueness of curl-free vector fields that solve continuity equation, we can conclude that si t(x) locally matches s t (x) around each xi. For generative modeling, it s essential that q0 = π(x); hence, limt 0 σt = 0 and limt 0 ft(x) = x. Assuming that σ2 t is continuous and differentiable at 0, in the limit, around each xi, we have for t 0 , s t (x) 2 1 σ2 t , and 1 2Eqt(x) s t (x) 2 1 σ2 t . (82) Thus, the loss can be properly defined only on the interval t (δ, 1], where δ > 0. In practice, we want to set δ as small as possible, i.e., we ideally want to learn st on the whole interval t [0, 1]. We can prevent learning the singularity functions just by re-weighting the objective in time as follows Z Eqt(x) st(x) s t (x) 2dt 1 ω(t)Eqt(x) st(x) s t (x) 2dt. (83) To give an example, we can take σt = t and ft(x) = x 1 t, then ω(t) = (1 t)t3/2 cancels out the singularities at t = 0 and t = 1. The second modification of the original Algorithm 1 is the sampling of time-steps for the estimation of the time integral. Namely, the optimization of Equation (83) is equivalent to the minimization of the following objective LAM(s) = ω(1)Eq1(x)[s1(x)] ω(0)Eq0(x)[s0(x)] | {z } weighted action-increment 2ω(t) st(x) 2 + ω(t) st(x) t + st(x) ω(t) dt | {z } weighted smoothness which consists of two terms. Estimation of the weighted action-increment involves only sampling from q0 and q1, while the weighted smoothness term estimate depends on the distribution of time samples p(t), i.e., p(t) p(t) |{z} =1 2ω(t) st(x) 2 + ω(t) st(x) t + st(x) ω(t) = Et p(t)Ex qt(x) 2ω(t) st(x) 2 + ω(t) st(x) t + st(x) ω(t) Note that p(t) can be viewed as a proposal importance sampling distribution, and thus every choice of it results in an unbiased estimate of the original objective function. Thus, we can design p(t) to reduce the variance of the weighted smoothness term of the objective. In our experiments, we observed that simply taking p(t) proportionally to the standard deviation of the corresponding integrand significantly reduces the variance, i.e., Ex qt(ζt Ex qtζt)2, ζt = 1 2ω(t) st(x) 2 + ω(t) st(x) t + st(x) ω(t) We implement sampling from this distribution by aggregating the estimated variances throughout the training with exponential moving average, and then followed by linear interpolation between the estimates. Action Matching D. Sparse Data Regime In this section, we find velocity vector fields that satisfy the continuity equation in the case where the data distribution q0 is a delta function or a mixture of delta functions; and the conditional kt(xt | x) is a Gaussian distribution. D.1. Delta Function Data Distribution We start with the case where the dataset consists only of a single point x0 Rd q0(x) = δ(x x0), kt(xt | x) = N(xt | ft(x), σ2 t ). (89) Then the distribution at time t is qt(x) = Z dx q0(x )kt(x | x ) = N(x | ft(x0), σ2 t ). (90) The ground truth vector field v comes from the continuity equation t = (qtv) = t log qt = log qt, v (v). (91) For our dynamics, we have t log qt = t 2 log(2πσ2 t ) 1 2σ2 t x ft(x0) 2 (92) t log σt + 1 σ2 t x ft(x0) 2 t log σt + 1 x ft(x0), ft(x0) t log σt + 1 x ft(x0), (x ft(x0)) t log σt + ft(x0) σ2 t (x ft(x0)); (95) t log qt = d t log σt log qt, (x ft(x0)) t log σt + ft(x0) Matching the corresponding terms in the continuity equation, we get v = (x ft(x0)) t log σt + ft(x0) We note that since the above vector field is curl-free, it is the unique vector field that the action matching would recover. D.2. Mixture of Delta Functions Data Distribution For the mixture of delta-functions, we denote i δ(x xi), qt(x) = 1 i qi t(x), qi t(x) = N(x | ft(xi), σ2 t ). (98) Due to the linearity of the continuity equation w.r.t. q, we have i qi tv = X t log qi t + log qi t, v + (v) = 0. (99) We first solve the equation for ft t = 0, then for t log σt = 0 and join the solutions. t = 0, we look for the solution in the following form i qi t, qi t(x) = N(x | f i t(xi), σ2 t ). (100) Action Matching Then we have i 2qi t (101) log qi t 2 d i qi t (vσ) = A P log qi t 2 d and from (99) we have t log σt + log qi t, vσ + σ2 t t log σt log qi t + (vσ) = 0. (104) From these two equations we have i qi t (vσ) = A P log qi t 2 d σ2 t t log σt X i qi t log qi t 2. (106) Thus, we have A = σ2 t t log σt. (107) t log σt = 0, we simply check that the solution is i qi t ft(xi) Indeed, the continuity equation turns into log qi t, vf ft(xi) + (vf) = 0. (109) From the solution and the continuity equation we write P i qi t (vf) in two different ways. i qi t (vf) = 1 P i qi t ft(xi) qi t, ft(xi) qi t, ft(xi) Thus, we see that (108) is indeed a solution. Finally, unifying vσ and vf, we have the full solution i qi t + 1 P i qi t ft(xi) t , qi t(x) = N(x | ft(xi), σ2 t ), (112) t log σt + ft(xi) Action Matching 0 20k 40k 60k 80k 100k train iterations Average MMD over time Action Matching Annealed Langevin + SM Annealed Langevin + SSM 0 20k 40k 60k 80k 100k train iterations Score Squared Error Figure 5. On the left, we demonstrate performance of various algorithms in terms of average MMD over the time of dynamics. The MMD is measured between generated samples and the training data. On the right, we report squared error of the score estimation for the score-based methods. E. Experiments Details E.1. Schrödinger Equation Simulation For the initial state of the dynamics tψ(x, t) = 1 x ψ(x, t) 1 2 2ψ(x, t) , (114) we take the following wavefunction ψ(x, t = 0) ψ32 1(x) + ψ210(x), and q t=0(x) = |ψ(x, t = 0)|2, (115) where n, l, m are quantum numbers and ψnlm is the eigenstate of the corresponding Hamiltonian (see (Griffiths & Schroeter, 2018)). For all the details on sampling and the exact formulas for the initial state, we refer the reader to the code github.com/necludov/action-matching. We evolve the initial state for T = 14 103 time units in the system ℏ= 1, me = 1, e = 1, ε0 = 1 collecting the dataset of samples from q t . For the time discretization, we take 103 steps; hence, we sample every 14 time units. To evaluate each method, we collect all the generated samples from the distributions qt, t [0, T] comparing them with the samples from the training data. For the metric, we measure the Maximum Mean Discrepancy (Gretton et al., 2012) between the generated samples and the training data at 10 different timesteps t = k 10T, k = 1, . . . , 10 and average the distance over the timesteps. For the Annealed Langevin Dynamics, we set the number of intermediate steps for M = 5, and select the step size dt by minimizing MMD using the exact scores log qt(x). For all methods, we use the same architecture, which is a multilayer perceptron with 5 layers 256 hidden units each. The architecture h(t, x) takes x R3 and t R and outputs 3-d vector, i.e. h(t, x) : R R3 R3. For the score-based models it already defines the score, while for action matching we use st(x) = h(t, x) x 2 as the model and the vector field is defined as st(x). In Fig. 5 we plot the convergence of Average MMD over time for Action Matching and the baselines. Despite that both SM and SSM accurately recover the ground truth scores for the marginal distributions (see the right plot in Fig. 5), one cannot efficiently use them for the sampling from the ground truth dynamics. Action Matching Algorithm 3 Annealed Langevin Dynamics for the Schrödinger Equation Require: score model st(x) = log qt(x), step size dt, number of intermediate steps M Require: initial samples xi 0 Rd for time steps t (0, T] do for intermediate steps j 1, . . . , M do εi N(0, 1) xi t = xi t + dt 2 log qt(xi t) + end for save samples xi t qt(x) end for output samples {xi t}T t=0 Algorithm 4 Annealed Langevin Dynamics for the Image Generation Require: score model st(x) = log qt(x), step size dt, number of intermediate steps M Require: initial samples xi 0 Rd for time steps t (0, 1) do α = α1(1 t)2 for intermediate steps j 1, . . . , M do εi N(0, 1) xi t = xi t + α 2 log qt(xi t) + α εi end for end for α = α1 for intermediate steps j 1, . . . , M do εi N(0, 1) xi 1 = xi 1 + α 2 log q1(xi 1) + α εi end for output samples xi 1 E.2. Generative Modeling For the architecture of the neural network parameterizing st, we follow (Salimans & Ho, 2021) with a small modification. Namely, we parameterize st(x) as unet(t, x), x , where unet(t, x) is the output of the U-net architecture (Ronneberger et al., 2015). For the U-net architecture, we follow (Song et al., 2020b). We consider the same U-net architecture for the baseline to parameterize log qt. For diffusion, we take VP-SDE from (Song et al., 2020b), which corresponds to αt = exp( 1 2 R β(s)ds) and σt = q 1 exp( R β(s)ds), where β(s) = 0.1 + 19.9t. All images are normalized to the interval [ 1, 1]. For the baseline, we managed to generate the images only taking into account the noise variance of the current distribution qt as proposed in (Song & Ermon, 2019). For propagating samples in time we select the time step dt = 10 2 and perform 10 sampling steps for every qt. We additionally run 100 sampling steps for the final distribution. In total we run 1000 steps to generate images. See Algorithm 4 for the pseudocode. Action Matching Figure 6. Examples of different noising processes used for different vision tasks. The processes interpolate between the prior distribution at t = 0 and the target distribution t = 1. For all the processes x0 N(0, 1). data (t=0.00) data (t=0.25) data (t=0.50) data (t=0.75) data (t=1.00) data (t=0.00) generation (t=0.00) 5 0 5 10 0.00 data (t=0.25) generation (t=0.25) data (t=0.50) generation (t=0.50) data (t=0.75) generation (t=0.75) data (t=1.00) generation (t=1.00) Figure 7. The histograms of the training data (top row) changing in time and the histograms of the generated samples by Unbalanced Action Matching (bottom row). E.3. Unbalanced Action Matching We showcase Unbalanced Action Matching on a toy data for which we consider a mixture of gaussians, i.e., qt(x) = αt N( 5, 1) + (1 αt)N(5, 1) , (116) and change αt linearly from 0.2 to 0.8. In Fig. 7, we demonstrate the data samples and the samples generated by Unbalanced Action Matching starting from the ground truth samples at time t = 0 and reweighting particles according to Eq. (19). Instead of attempting to transport particles from one mode to another, Unbalanced Action Matching is able to model this change of probability mass using the growth term gt(x) = st(x). Action Matching F. Generated Images Figure 8. Images generated (right) by the baseline (ALD + SSM) from the noise (left). Figure 9. Images generated (right) by Action Matching from the noise (left). Action Matching Figure 10. Images generated (right) by VP-SDE from the noise (left). Figure 11. Images generated (right) by Action Matching from the lower resolution images (left). Figure 12. Images colored (right) by Action Matching from the grayscale images (left).