# momentum_particle_maximum_likelihood__d350d107.pdf Momentum Particle Maximum Likelihood Jen Ning Lim 1 Juan Kuntz 2 Samuel Power 3 Adam M. Johansen 1 Maximum likelihood estimation (MLE) of latent variable models is often recast as the minimization of a free energy functional over an extended space of parameters and probability distributions. This perspective was recently combined with insights from optimal transport to obtain novel particle-based algorithms for fitting latent variable models to data. Drawing inspiration from prior works which interpret momentum-enriched optimization algorithms as discretizations of ordinary differential equations, we propose an analogous dynamical-systems-inspired approach to minimizing the free energy functional. The result is a dynamical system that blends elements of Nesterov s Accelerated Gradient method, the underdamped Langevin diffusion, and particle methods. Under suitable assumptions, we prove that the continuous-time system minimizes the functional. By discretizing the system, we obtain a practical algorithm for MLE in latent variable models. The algorithm outperforms existing particle methods in numerical experiments and compares favourably with other MLE algorithms. 1. Introduction In this work, we study parameter estimation for (probabilistic) latent variable models pθ (y, x) with parameters θ Rdθ, unobserved (or latent) variables x Rdx, and observed variables y Rdy (which we treat as fixed throughout). The type II maximum likelihood approach (Good, 1983) estimates θ by maximizing the marginal likelihood pθ (y) := R pθ (y, x) dx. However, for most models of practical interest, the integral has no known closed-form expressions and we are unable to optimize the marginal likelihood directly. This issue is often overcome (e.g., Neal and Hinton (1998)) 1University of Warwick 2Polygeist 3University of Bristol. Correspondence to: Jen Ning Lim . Proceedings of the 41 st International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by the author(s). by constructing an objective defined over an extended space, whose optima are in one-to-one correspondence with those of the MLE problem. To this end, we define the free energy functional: E (θ, q) := Z log q (x) pθ (y, x) q (x) dx. (1) The infimum of E over the extended space Rdθ P(Rdx) (where P(Rdx) denotes the space of probability distributions over Rdx) coincides with negative of the log marginal likelihood s supremum: E := inf (θ,q) E(θ, q) = sup θ Rdθ log pθ(y). (2) To see this, note that E (θ, q) = log pθ (y) + KL (q, pθ ( | y)) , (3) where KL denotes the Kullback Leibler divergence and pθ( |y) := pθ(y, )/pθ(y) the posterior distribution. So, if (θ , q ) minimizes the free energy, then θ maximizes the marginal likelihood and q ( ) = pθ ( |y). In other words, by minimizing the free energy, we solve our MLE problem. This perspective motivates the search for practical procedures that minimize the free energy E. One such example is the classical Expectation Maximization (EM) algorithm applicable to models for which the posterior distributions pθ ( |y) are available in closed form. As shown in Neal and Hinton (1998), its iterates coincide with those of coordinate descent applied to the free energy E. Recently, Kuntz et al. (2023) sought analogues of gradient descent (GD) applicable to the free energy E. In particular, building on ideas popular in optimal transport (e.g., see Ambrosio et al. (2005)), they identified an appropriate notion for the free energy s gradient E, discretized the corresponding gradient flow ( θt, qt) = E(θt, qt), and obtained a practical algorithm they called particle gradient descent (PGD). In optimization, GD is well-known to be suboptimal: other practical first-order methods achieve better worstcase convergence guarantees and practical performance (Nemirovskij and Yudin, 1983; Nesterov, 1983). A common feature among algorithms that achieve the optimal accelerated convergence rates is the presence of momentum effects in the dynamics of the algorithm. Roughly speaking, Momentum Particle Maximum Likelihood if GD for a function f is analogous to solving the ordinary differential equation (ODE) θt = θf (θt) (4) with θf denoting f s Euclidean gradient, then a momentum method is akin to solving a second-order ODE like θt = ηmt, mt = γηmt θf (θt) , (5) where γ, η denote positive hyperparameters (Su et al., 2014; Wibisono et al., 2016). In the context of machine learning, Sutskever et al. (2013) demonstrated empirically that incorporating momentum effects into stochastic gradient descent (SGD) often has substantial benefits such as eliminating the performance gap between standard SGD and competing Hessian-free deterministic methods based on higher-order differential information (Martens, 2010). Given the successes of momentum methods, it is natural to ask whether PGD can be similarly accelerated. In this work, we affirmatively answer this question. Our contributions are: (1) we construct a continuous-time flow that incorporates momentum effects into the gradient flow studied in Kuntz et al. (2023); (2) we derive a discretization that outperforms PGD and compares competitively against other methods. As theoretical contributions: (3) we prove in Proposition 3.1 the existence and uniqueness of the flow s solutions; (4) under suitable conditions, we show in Theorem 4.1 that the flow converges to E s minima; and, (5) we asymptotically justify in Proposition 5.1 the particle approximations we use when discretizing the flow. The proofs of Proposition 3.1 and Proposition 5.1 require generalizations of proofs pertaining to Mc Kean Vlasov SDEs, and the proof of Theorem 4.1 is a generalization of the Ma et al. (2021) s convergence argument. These may be of independent interest for future works that operate in the extended space Rdθ P(Rdx). The structure of this manuscript is as follows: in Section 2, we review gradient flows on the Euclidean space Rdθ, probability space P(Rdx), and extended space Rdθ P(Rdx). We also review momentum methods on Rdθ and P(Rdx). In Section 3, we describe how momentum can be incorporated into the gradient flow on Rdθ P(Rdx), resulting in a dynamical system we refer to as the momentum-enriched flow . In Section 4, we establish the flow s convergence. In Section 5, we discretize the flow and obtain an MLE algorithm we refer to as Momentum Particle Descent (MPD). In Section 6, we demonstrate empirically the efficacy of our method and study the effects of various design choices. As a large-scale experiment, we benchmark our proposed method for training Variational Autoencoders (Kingma and Welling, 2014) against current methods for training latent variable models. We conclude in Section 7 with a discussion of our results, their limitations, and future work. 2. Background Our goal is to incorporate momentum into the gradient flow on the extended space Rdθ P(Rdx) considered in Kuntz et al. (2023) and obtain an algorithm for type II MLE that outperforms PGD. We begin by surveying the standard gradient flows on Rdθ P(Rdx) s component spaces and how momentum is incorporated in those settings. 2.1. Gradient Flows on Rd and their Acceleration Given a differentiable function f : Rd R, f s Euclidean gradient flow is (4). Discretizing (4) in time using a standard forward Euler integrator yields GD. As observed in Su et al. (2014); Shi et al. (2021), algorithms such as Nesterov (1983) s accelerated gradient (NAG) that achieve the optimal convergence rates are instead obtained by discretising the second-order ODE (5); cf. Appendix I.2. This ODE models the evolution of a kinetic particle with both position and momentum. In particular, for convex quadratic f, (5) reproduces the dynamics of a damped harmonic oscillator (Mc Call, 2010). Hence, the ODE s hyperparameters have intuitive interpretations: η 1 represents the particle s mass and γ determines the damping s strength. We can rewrite (5) concisely as ( θt, mt) = Dγ (θ,m)fη (θt, mt) , (6) where fη denotes the Hamiltonian function and Dγ the damping matrix : fη (θ, m) = f (θ)+ η 2 m 2, Dγ := 0d Id Id γId In other words, (5) is a damped or conformal Hamiltonian flow (Mc Lachlan and Perlmutter, 2001; Maddison et al., 2018). In particular, we can interpret NAG as a discretisation of Fη s γ-damped Hamiltonian flow (Wibisono et al., 2016; Wilson et al., 2021). 2.2. Gradient Flows on P Rd and their Acceleration We can follow a similar program to solve optimization problems over P Rd . To do so, we require an analogue of the Euclidean gradient θf for (differentiable) functions f on Rd applicable to (sufficiently-regular) functionals F on P(Rd). One way of defining θf(θ) at a given point θ is as the unique vector satisfying f(θ ) = f (θ) + θf (θ) , θ θ + o(d E(θ, θ )), (8) where , and d E denote the Euclidean inner product and metric related by the formula d E(θ, θ )2 = inf λ 0 λt, λt dt, (9) Momentum Particle Maximum Likelihood where the infimum is taken over all (sufficiently-regular) curves λ : [0, 1] Rd connecting θ and θ . Roughly speaking, we can reverse this process to define F s gradient for a given metric on P(Rd). Here, we focus on the Wasserstein-2 metric W2 for which (9) s analogue is the Benamou-Brenier formula (Peyré and Cuturi, 2019, Eq.7.2): W2(q, q)2 = inf ρ 0 gρt( ρt, ρt) dt where the infimum is taken over all (sufficiently-regular) curves ρ : [0, 1] Rd connecting q and q and gq(m, m ) := Z v(x), v (x) q(x)dx, where v and v solve the respective continuity equation, i.e., m = x (qv), m = x (qv ). Then, replacing (θ, θ , f, θ, , , d E) with (q, q , F, q, gq( , ), W2) in (8) and solving for q F, we obtain q F (q) = x (q xδq F [q]) , where δq F[q] : Rd R denotes the functional s first variation: the function satisfying F (q + εχ) = F (q) + ε Z δq F [q] (x) χ(x) dx + o(ε) for all (sufficiently-regular) χ : Rd R s.t. R χ(x)dx = 0. This is, of course, not the only choice; other metrics induce other geometries and lead to other algorithms (e.g., see Duncan et al. (2023); Liu (2017); Sharrock and Nemeth (2023) for the Stein case). For a more rigorous presentation of these concepts, see Appendix C.1 and references therein. We can now define F s Wasserstein gradient flow just as in the Euclidean case (except we have a PDE rather than an ODE): qt = q F (qt) . A well-known example of such a flow is that for which F( ) := KL ( , p) for a fixed p in P Rd . In this case, δq F [q] = log(q/p) and the flow reads qt = x (qt x log(qt/p)) . As noted in Jordan et al. (1998), this is the Fokker-Planck equation satisfied by the law of the overdamped Langevin SDE with invariant measure p: d Xt = xℓ(Xt) dt + 2 d Wt, where ℓ(x) := log p(x) and (Wt)t 0 denotes the standard Wiener process. For this reason, papers such as Wibisono (2018); Ma et al. (2021) view the overdamped Langevin algorithm obtained by discretizing this SDE using the Euler Maruyama scheme as an analogue of GD for F. Ma et al. (2021) go a step further and search for accelerated methods for minimizing F. To do so, they consider distributions over the momentum-enriched space R2d rather than Rd and they replace F on P(Rd) with the following Hamiltonian on P(Rd Rd): Fη( ) := KL ( , p(dx) rη(du)) , (10) where η > 0 denotes a fixed hyperparameter, rη := N 0, η 1Id a zero-mean η 1-variance isotropic Gaussian distribution over the momentum variables u, and p rη the product measure. They then define the following P(R2d)- valued analogue of the γ-damped Hamiltonian flow (6): qt = Dγ q Fη(qt) = (x,u) qt Dγ (x,u)δq Fη [qt] . (11) This is the Fokker Planck equation satisfied by the law of the underdamped Langevin SDE: d Xt = ηUtdt, d Ut = [ xℓ(Xt) γηUt]dt + p 2γd Wt. (12) Discretizing the SDE, Ma et al. (2021) recover the underdamped Langevin algorithm (e.g., see Cheng et al. (2018)) and show that, under suitable conditions, it achieves faster convergence rates than the overdamped Langevin algorithm. 2.3. Gradient Flows on Rdθ P Rdx To minimize a functional E on Rdθ P Rdx , we can follow steps analogous to those in the previous two sections. First, given a metric d on this space, we can define a gradient E w.r.t. d similarly as in the beginning of Section 2.2; cf. Kuntz et al. (2023, Appendix A). Setting d((θ, q), (θ , q )) := p d E(θ, θ )2 + W2(q, q )2, one finds that E = ( θE, q E), and the corresponding flow reads θt = θE (θt, qt) , qt = q E (θt, qt) . Kuntz et al. (2023) were interested in MLE of latent variable problems and set E to be the free energy in (1). In this case, the above reduces to θt = Z θℓ(θt, x) qt (dx) , qt = x qt x log qt where ρθ( ) := pθ(y, ). The above is the Fokker Planck equation of the following Mc Kean Vlasov SDE: dθt = Z θℓ(θt, x) qt (dx) , d Xt = xℓ(θt, Xt) dt + where qt := Law (Xt) and ℓ(θ, x) := log ρθ(x). Because θt s drift depends on the Xt s law, this is a Mc Kean Vlasov process (Kac, 1956; Mc Kean Jr, 1966). Approximating the SDE as described in Appendix I.1, we obtain PGD. 3. Momentum-Enriched Flow for MLE Following an approach analogous to those previously taken to obtain accelerated algorithms for minimizing functionals on Rdθ and P(Rdx) (cf. Sections 2.1 and 2.2, respectively), we now derive a momentum-enriched flow that minimizes Momentum Particle Maximum Likelihood the free energy functional E over Rdθ P(Rdx) defined in (1). Approximating this flow, we will obtain an algorithm that experimentally outperforms PGD (Section 2.3). We begin by enriching both components of the space with momentum variables: we replace θ and x in Rdθ and Rdx with (θ, m) and (x, u) in R2dθ and R2dx; and we correspondingly substitute P(Rdx) with P(R2dx). Next, we define a Hamiltonian on the momentum-enriched space R2dθ P(R2dx) analogous to fη and Fη in (7,10): F (θ, m, q) := Z log q (x, u) ρθ,ηx (x, u) q (x, u) dx du 2 m 2. (13) where (ηθ, ηx) denote positive hyperparameters and ρθ,ηx (x, u) := pθ (y, x) rηx (u) with rηx := N(0, η 1 x Idx). Consider now the following (γθ, γx)-damped Hamiltonian flow of F analogous to (6,11): ( θt, mt) = Dγθ (θ,m)F (θt, mt, qt) , (14a) qt = (x,u) qt Dγx (x,u)δq F [θt, mt, qt] . (14b) This is the Fokker Planck equation for the following Mc Kean Vlasov SDE: dθt = ηθmt dt, (15a) dmt = Z θℓ(θt, x) qt,X (dx) γθηθmt d Xt = ηx Ut dt, (15c) d Ut = [ xℓ(θt, Xt) γxηx Ut] dt + p 2γx d Wt, (15d) where qt,X := Law (Xt). Assuming that ℓhas a Lipschitz gradient, the SDE has a unique strong solution: Assumption 1 (K-Lipschitz gradient). We assume that the potential ℓhas a K-Lipschitz gradient, i.e., there is some K > 0 such that the inequality holds (θ,x)ℓ(θ, x) (θ,x)ℓ(θ , x ) K (θ, x) (θ , x ) , for all (θ, x) and (θ , x ) in Rdθ Rdx. Proposition 3.1 (Existence and uniqueness of strong solutions to (15)). Under Assumption 1, there exists a unique strong solution to (15) for any initial condition (θ0, m0, q0) in R2dθ P(R2dx). Proof. The proof is a generalization of Carmona (2016, Theorem 1.7) s proof; see Appendix G. 4. Convergence of Momentum-Enriched Flow To evidence that our approach is theoretically well-founded, we wish to show the momentum-enriched flow (14) solves the maximum likelihood problem. We do this by showing that the log marginal likelihood evaluated along (θt)t 0 (of the momentum-enriched flow (14)) converges exponentially fast to its supremum. Since directly working with the marginal likelihood is difficult, we exploit the following inequality: for all m Rdθ, q P(R2dx), log pθ(y) E(θ, q X) F(θ, m, q), (16) where q X denotes q s x-marginal. From (16) and the fact that both E and F share the same infimum E , it holds that if F in (13) decays exponentially fast to its infimum along the solutions (θt, mt, qt)t 0 of (14), then we have the desired result of convergence in log marginal likelihood (as summarized in (16)). Hence, we dedicate the remainder of this section to establishing that F convergence exponentially along the momentum-enriched flow. Because the momentum-enriched flow (14) is not a gradient flow, we are unable to prove the convergence using the techniques commonly applied to study gradient flows; e.g., see Otto and Villani (2000). Instead, we need to resort to an involved argument featuring a Lyapunov function L along the lines of those in Wilson et al. (2021); Wibisono et al. (2016); Ma et al. (2021). Two immediate choices for Lyapunov functions are available: E and F. However, unlike the gradient case, E along the momentum-enriched flow (15) is not monotonic and can fluctuate. As for F, we show in Proposition E.1 that F along the flow denoted by Ft := F(θt, mt, qt) satisfies Ft = γθ m Ft 2 γx uδq Ft 2 0; (17) implying that Ft is non-increasing. This shows that the flow is in some sense stable, but it is not enough to prove the convergence: the two gradients could vanish without the flow necessarily having reached a minimizer. To preclude this happening, we need to force (θ, m)-gradients to feature in (17) similarly to the (x, u)-gradients. To achieve this, we follow Wilson et al. (2021); Ma et al. (2021) and introduce the following twisted gradient terms: (θ,m)F 2 T(θ,m) := (θ,m)F, T(θ,m) (θ,m)F , (x,u)δq F 2 T(x,u) := (x,u)δq F, T(x,u) (x,u)δq F , T(θ,m) = K 1 τθIdθ τθm 2 Idθ τm Idθ T(x,u) = K 1 τx Idx τxu 2 Idx τu Idx with K denoting the Lipschitz constant in Assumption 1 and (τθ, τθm, τm, τx, τxu, τx) non-negative constants such that Momentum Particle Maximum Likelihood the above matrices are positive-semidefinite. We then add the above terms to F E to obtain our Lyapunov function: L := F E + (θ,m)F 2 T(θ,m) + (x,u)δq F 2 T(x,u) . We prove the convergence for models satisfying the a log Sobolev inequality (see Caprio et al. (2024) for a detailed discussion). It extends the log Sobolev inequality popular in optimal transport and the Polyak Łojasiewicz inequality often used to argue linear convergence of GD algorithms. Assumption 2 (Log Sobolev inequality). There exists a constant CE > 0, s.t. for all (θ, q) in Rdθ P(Rdx), E(θ, q) E 1 2CE || E(θ, q)||2 q , where || E(θ, q)||q := || θE(θ, q)||2 + || x log(q/ρθ)||2 q with || ||q denoting the L2 q norm. Theorem 4.1 (Exponential convergence of the momentum-enriched flow). Suppose that Assumptions 1, 2 hold and there exists (φ, τθ, τθm, τm, τx, τxu, τu) satisfying the inequalities (51) in the supplement. If Lt := L(θt, mt, qt), where C = min{CE, ηθ, ηx}. Moreover, sup θ Rdθ log pθ(y) log pθt(y) Et E Ft E Lt L0 exp ( φCt). (19) Proof. The proof extends Ma et al. (2021, Proposition 1) s proof. It requires generalizing from P(Rdx) to the product space and controlling the terms that arise from the interaction between the two spaces; cf. Appendix F. It is not difficult to find hyperparameter choices for which the inequalities in (51) have solutions; cf. Appendix F.3 for some concrete examples. 5. Momentum Particle Descent In order to realize an actionable algorithm, we need to discretize the dynamics of the momentum-enriched flow (15) both in space and in time. In this section, we derive the algorithm called Momentum Particle Descent (MPD) via discretization. For the space discretization, following the approach of Kuntz et al. (2023), we approximate qt,X 1 M P i [M] δXi,M t =: q M t,X using a collection of particles. This yields the following system: for i [M], dθM t =ηθm M t dt (20a) dm M t = γθηθm M t dt θE θM t , q M t,X dt (20b) d Xi,M t =ηx U i,M t dt (20c) d U i,M t = γxηx U i,M t dt + xℓ θM t , Xi,M t dt 2γx d W i t . (20d) where {W i t }M i=1 comprises M independent standard Wiener processes. Proposition 5.1 establishes asymptotic pointwise propagation of chaos which provides asymptotic justification for using a particle approximation for qt,X to approximate the flow in (15). The proof can be found in the Appendix H. Proposition 5.1. Under Assumption 1, we have lim M E sup t [0,T ] { ϑt ϑM t 2 + W2 2(q M t , q M t )} = 0, where ϑt := (θt, mt), q M t := ΠM i=1qt with qt = Law(Xt, Ut) are defined by the SDE in (15); and ϑM t := (θM t , m M t ) , and q M t = Law({(Xi,M t , U i,M t )}M i=1) in (20). As for the time discretization, as noted by Ma et al. (2021), naive application of an Euler Maruyama scheme to momentum-enriched dynamics may be insufficient for attaining an accelerated rate of convergence. We appeal to the literature on the discretization of underdamped Langevin dynamics to design an appropriate integrator. One integrator that can achieve accelerated rates is the (Euler) Exponential Integrator (Cheng et al., 2018; Sanz-Serna and Zygalakis, 2021; Hochbruck and Ostermann, 2010). The main strategy is to freeze the nonlinear components of the dynamics in a suitable way, and then solve the resulting linear SDE. We also incorporate a partial update inspired by Sutskever et al. (2013) s interpretation of NAG. More precisely, given θ0, m0, { Xi 0, U i 0 }M i=1 , define an approximating SDE on the time interval t [0, h] as follows: for i [M], d θt = ηθ mt dt (21a) d mt = γθηθ mt dt θE θ0, q M 0,X dt (21b) d Xi t = ηx U i t dt (21c) d U i t = γxηx U i t dt + xℓ θ0, Xi 0 dt + p 2γx d W i t , where q M t,X := 1 M PM i=1 δ Xi t, and the fixed variables θ and θ0 (for reasons which will become clear when faced with the solved SDE) can be thought of as a partial update to θt. This is a linear SDE, which can then be solved analytically, yielding the following iteration (see Appendix I.4 for Momentum Particle Maximum Likelihood details): for i [M], we have θk = θk 1 + ιθ (h) θE θk 1, q M k 1,X mk =(1 ιθ (h)) mk 1 ιθ (h) γθηθ θE θk 1, q M k 1,X Xi k = Xi k 1 + ιx (h) xℓ θk 1, Xi k 1 + LXX Σ ξi k U i k =(1 ιx (h)) U i k 1 + ιx (h) γxηx xℓ θk 1, Xi k 1 + LXU Σ ξi k + LUU Σ ξ ,i k , where ιx (h) := 1 exp ( ηxγxh), n ξi k, ξ ,i k o N (0dx, Idx), and LXX Σ , LXU Σ , LUU Σ are suitable scalars that depend on (ηx, γx) that can be found in the appendix as Equation (83). We set θk 1 := θk 1 + ιθ(h) γθ mk 1, θk 1 = θk. (22) The astute reader will notice that our approximating ODESDE (i.e., Equation (21)) deviates from Cheng et al. (2018) s style of discretization. Specifically, the difference lies in where we compute the gradient in Equations (21b) and (21d) which we refer to as gradient correction. The choices of θk 1 and θk 1 in Equation (22) are chosen to reflect partial and full updates to θk respectively. This is reminiscent of NAG s usage of a partial parameter update to compute the gradient, unlike Polyak s Heavy Ball, which does not (Sutskever et al., 2013, Section 2.1). In our case, we empirically found that this choice is critical for a more stable discretization and enables the algorithm to be faster by taking larger step sizes (see Section 6.1). The choice of the momentum parameters (ηθ, γθ, ηx, γx) is crucial to the practical performance of MPD. We observe that (as is also the case for NAG, the underdamped Langevin SDE, and other similar systems) there are three different qualitative regimes for the dynamics: i) the underdamped regime, in which the parameter values oscillate, ii) the overdamped regime, in which one recovers PGD-type behaviour, and iii) the critically-damped regime, in which oscillations are largely suppressed, but the momentum effects are still able to accelerate the convergence behaviour relative to PGD. While rigorous approaches are limited to simple targets (e.g., Dockhorn et al. (2022)), one can utilize cross-validation to obtain these parameters in practice. 6. Experiments In this section, we study various design choices of MPD and demonstrate the efficacy of our proposal through empirical results. The structure of this section is as follows: in Section 6.1, we demonstrate on the Toy Hierarchical model the effects of various design choices; then, in Section 6.2, we compare MPD with algorithms that incorporate momentum in either θ or X only; finally, in Section 6.3, we compare our proposal on training VAEs with other methods. The code for reproducibility is available online: https://github.com/jenninglim/mpgd. 6.1. Toy Hierarchical Model As a toy example, we consider the hierarchical model proposed in Kuntz et al. (2023). The model Toy HM(θ, σ) is given by pθ(y, x) := QN i=1 N(yi|xi, 1)N(xi|θ, σ2), where N = 100 is the number of data points. The dataset y is sampled from a model with θ = 100, σ = 1. In this experiment, we wish to understand the behaviour of MPD compared against PGD. We are particularly interested in (1) how the momentum parameters (ηθ, γθ, ηx, γx) affect the optimization process; (2) comparing our proposed discretization to another that follows in the style of NAG (detailed in Appendix I.3); and (3) the role of the gradient correction term described in Section 5. The results are shown in Figure 1. Further experiment details can be found in Appendix J.4.1. In Figure 1a, we show that different regimes can arise from different choices of hyper-parameters (as discussed in Section 5). In Figure 1b, we compare the performance of MPD using (our) Exponential integrator with a NAG-like discretization for (θt, mt)-component. We vary the momentum coefficient , i.e., we have µ {0.5, 0.8, 0.9} with µθ = µx = µ. It can be seen that MPD with our exponential integrator for θt performs better than NAG-like integrator (see Appendix J.3 for more discussion). In Figure 1c, we show the effect of the gradient correction term for three different step sizes in (θ, m)-components and (X, U)- components. The different lines in the figure are generated from multiplying the stepsize of each component with a constant c {0.99, 1, 1.01}. It can be seen that our proposed gradient correction results in a more effective discretization. In red, we show MPD with the absence of gradient correction in Equation (21b) (i.e., when we use θ0 = θ0 in Equation (21b)), and, in green, we show the MPD when the gradient correction is absent in both Equation (21b) and Equation (21d) (i.e. when we use θ0 = θ0 = θ0 in Equation (21b) and Equation (21d)). It can be seen the gradient correction term results in a more stable algorithm. Momentum Particle Maximum Likelihood (a) Different regimes in MPD. (b) θt integrator comparison. (c) Role of gradient correction. Figure 1: Toy Hierarchical Model. (a) Different regimes that arise from varying the momentum parameters; (b) comparison between our Exponential (Exp) integrator and a Nesterov-like integrator for different momentum parameters; (c) we compare the performance of the MPD with and without gradient correction. Figure 2: Comparison of MPD with algorithms that only enrich one component. (a), (b), (c) shows the performance on the Toy HM(10, 12) with different initialization of the particle cloud {Xi 0}M i=1. In (d) shows the ˆW1 vs iterations of a density estimation problem on a Mixture of Gaussian (Mo G) dataset. MPD is shown in blue, θ-only-enriched shown in green, X-only-enriched in purple, and PGD in black. The results are averaged over 10 independent trials. 6.2. Why Accelerate Both Components? One may question the importance of momentum-enriching both components of PGD. We show in two experiments that enriching only one component results in a suboptimal algorithm. To show this, we consider the Toy HM(10, 12), and, as a more realistic example, a 1-d density estimation problem using a VAE on a Mixture of Gaussians with two equally weighted components which are N(2, 0.5) and N( 2, 0.5). To measure the performance of each algorithm, for the first problem, we examine the parameter of the model which should converge to the true value of 10; and, for the second, we compute the empirical Wasserstein-1 ˆW1 using the empirical CDF. Figure 2 shows the results of both experiments. Overall, it can be seen that the presence of momentum results in a better performance. Furthermore, we observe that the algorithms which only enrich one component can exhibit problems which are not present for MPD. It can be seen that the poor initialization of the cloud drastically lengthens the transient phase of the θ-only enriched algorithm, whereas MPD can recover more rapidly. This is a typical setting in high-dimensional settings (see Figure 2 (d)). Conversely, while the X-only enriched algorithm does not suffer as much from poor initialization, it can be observed to suffer from slow convergence in the θ-component (intuitively, one pays a larger price for poor conditioning of the ideal MLE objective). Overall, it can be observed that MPD noticeably outperforms PGD and all other intermediate methods in all settings. For details/discussion, see Appendix K. 6.3. Image Generation For this task, we consider two datasets: MNIST (Le Cun et al., 1998) and CIFAR-10 (Krizhevsky and Hinton, 2009). For the model, we use a variational autoencoder (Kingma and Welling, 2014) with a Vamp Prior (Tomczak and Welling, 2018). As baselines, we compare our proposed method MPD against PGD, Alternating Backpropagation Momentum Particle Maximum Likelihood Figure 3: Posterior Cloud vs Epochs. We show the evolution of the reconstruction of a particle for persistent methods. The particle is taken at epoch {10, 20, 40} on MNIST and CIFAR-10. Dataset MPD PGD ABP SR VI MNIST 47.9 0.6 104.1 1.2 45.9 0.8 109.97 12.5 72.6 1.3 CIFAR 93.2 1.5 106.2 4.8 97.0 8.0 126.25 13.0 94.2 6.7 Table 1: FID scores on MNIST and CIFAR-10 after the final epoch. In bold, we indicate the lowest two scores on average. We write (µ σ) to indicate the mean µ and standard deviation σ of the FID score calculated over three independent trials. (ABP) (Han et al., 2017), Short Run (SR) (Nijkamp et al., 2020), and amortized variational inference (VI) (Kingma and Welling, 2014). ABP is the most similar to PGD. They differ in that ABP takes multiple steps of the (unadjusted and overdamped) Langevin algorithm (ULA) instead of PGD s single step. They (MPD, PGD, ABP) are also persistent, meaning that the ULA chain starts at the previous particle location, whereas SR restarts the chain at a random location sampled from the prior but like ABP runs the chain for several steps. Further experiment details can be found in Appendix J.4.2. We are interested in the generative performance of the resulting models. For a qualitative measure, we show the samples produced by the model in Figure 6 for CIFAR (for MNIST samples, see Figure 5 in the Appendix). As a quantitative measure, we report the Fréchet inception distance (FID) (Heusel et al., 2017) as shown in Table 1. It can be seen that our proposed method does well compared against other baselines. The closest competitor is ABP. As noted by Kuntz et al. (2023), ABP can reap the benefits of taking multiple ULA steps to locate the posterior mode quickly and reduce the transient phase. This hypothesis is further confirmed in Figure 3, where we visualize the evolution of a single particle across various epochs. It can be seen that ABP s and MPD s distinct methods of reducing the transient phase (by taking multiple steps or utilizing momentum) are effective. As a competitor SR does perform badly, which we attribute to its non-persistent property: by restarting the particles at the prior and running short chains, it is unable to overcome the bias of short chains and locate the posterior mode. Variational inference performs well for CIFAR but surprisingly falls slightly short in MNIST. 7. Conclusion, Limitations, and Future work We presented Momentum Particle Descent MPD, a method of incorporating momentum into particle gradient descent. We establish several theoretical results such as the convergence in continuous time; existence and uniqueness; as well as some guarantees for the usage of the particle approximation. Through experiments, we showed that, for suitably chosen momentum parameters, the resulting algorithm achieves better performance than PGD. The main limitation that may impede the widespread adoption of MPD over PGD is the requirement for tuning the momentum parameters. This issue is inherited from the Underdamped Langevin dynamics where it is generally understood to be somewhat more challenging than its overdamped counterpart, as witnessed by the dearth of papers proposing practical tuning strategies for this class of algorithms. Although we found that the momentum coefficient heuristic (see Appendix J.3) worked well in our experiments, future work includes a systematic method of tuning momentum parameters (ηθ, γθ, ηx, γx) following Riou-Durand Momentum Particle Maximum Likelihood et al. (2023) (or some justification for using the momentum coefficient heuristic). Another future work is a theoretical characterization of the difference between our discretization scheme compared with other potential schemes akin to Sanz-Serna and Zygalakis (2021). Acknowledgements We thank our anonymous reviewers for their helpful comments which improved the quality of the paper. JL acknowledge the support from the Feuer International Scholarship. JK and AMJ acknowledge support from the Engineering and Physical Sciences Research Council (EPSRC; grant EP/T004134/1). AMJ acknowledges further support from the EPSRC (grants EP/R034710/1 and EP/Y014650/1). SP acknowledges support from the Engineering and Physical Sciences Research Council (EPSRC; grant EP/R018561/1). Impact Statement This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none of which we feel must be specifically highlighted here. Ambrosio, L., Gigli, N., and Savaré, G. (2005). Gradient flows: in metric spaces and in the space of probability measures. Springer Science & Business Media. Andrieu, C. and Thoms, J. (2008). A tutorial on adaptive MCMC. Statistics and computing, 18:343 373. Bakry, D. and Émery, M. (2006). Diffusions hypercontractives. In Séminaire de Probabilités XIX 1983/84: Proceedings, pages 177 206. Springer. Bernton, E. (2018). Langevin Monte Carlo and JKO splitting. In Conference on learning theory, pages 1777 1798. PMLR. Caprio, R., Kuntz, J., Power, S., and Johansen, A. M. (2024). Error bounds for particle gradient descent, and extensions of the log-Sobolev and Talagrand inequalities. ar Xiv preprint ar Xiv:2403.02004. Carmona, R. (2016). Lectures on BSDEs, stochastic control, and stochastic differential games with financial applications. SIAM. Chen, F., Lin, Y., Ren, Z., and Wang, S. (2024). Uniform-intime propagation of chaos for kinetic mean field Langevin dynamics. Electronic Journal of Probability, 29:1 43. Cheng, X., Chatterji, N. S., Bartlett, P. L., and Jordan, M. I. (2018). Underdamped Langevin MCMC: A non- asymptotic analysis. In Conference on Learning Theory, pages 300 323. PMLR. Chizat, L. (2022). Mean-field langevin dynamics : Exponential convergence and annealing. Transactions on Machine Learning Research. Chizat, L. and Bach, F. (2018). On the global convergence of gradient descent for over-parameterized models using optimal transport. Advances in neural information processing systems, 31. De Bortoli, V., Durmus, A., Pereyra, M., and Vidal, A. F. (2021). Efficient stochastic optimisation by unadjusted Langevin Monte Carlo: Application to maximum marginal likelihood and empirical Bayesian estimation. Statistics and Computing, 31:1 18. Delyon, B., Lavielle, M., and Moulines, E. (1999). Convergence of a stochastic approximation version of the EM algorithm. Annals of statistics, pages 94 128. Diao, M. Z., Balasubramanian, K., Chewi, S., and Salim, A. (2023). Forward-backward Gaussian variational inference via JKO in the Bures-Wasserstein space. In International Conference on Machine Learning, pages 7960 7991. PMLR. Dockhorn, T., Vahdat, A., and Kreis, K. (2022). Score-based generative modeling with critically-damped Langevin diffusion. In International Conference on Learning Representations (ICLR). Duncan, A., Nüsken, N., and Szpruch, L. (2023). On the geometry of Stein variational gradient descent. Journal of Machine Learning Research, 24(56):1 39. Garbuno-Inigo, A., Nüsken, N., and Reich, S. (2020). Affine invariant interacting Langevin dynamics for Bayesian inference. SIAM Journal on Applied Dynamical Systems, 19(3):1633 1658. Gelfand, I. M. and Silverman, R. A. (2000). Calculus of Variations. Courier Corporation. Good, I. J. (1983). Good thinking: The foundations of probability and its applications. University of Minnesota Press. Han, T., Lu, Y., Zhu, S.-C., and Wu, Y. N. (2017). Alternating back-propagation for generator network. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 31. Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., and Hochreiter, S. (2017). GANs trained by a two time-scale update rule converge to a local nash equilibrium. Advances in Neural Information Processing Systems, 30. Momentum Particle Maximum Likelihood Hinton, G. E. (2002). Training products of experts by minimizing contrastive divergence. Neural computation, 14(8):1771 1800. Hochbruck, M. and Ostermann, A. (2010). Exponential integrators. Acta Numerica, 19:209 286. Hu, K., Ren, Z., Šiška, D., and Szpruch, Ł. (2021). Meanfield Langevin dynamics and energy landscape of neural networks. In Annales de l Institut Henri Poincare (B) Probabilites et statistiques, volume 57, pages 2043 2065. Institut Henri Poincaré. Jordan, R., Kinderlehrer, D., and Otto, F. (1998). The variational formulation of the Fokker Planck equation. SIAM Journal on Mathematical Analysis, 29(1):1 17. Kac, M. (1956). Foundations of kinetic theory. In Proceedings of The third Berkeley symposium on mathematical statistics and probability, volume 3, pages 171 197. Kingma, D. P. and Welling, M. (2014). Auto-encoding variational Bayes. In Bengio, Y. and Le Cun, Y., editors, 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Conference Track Proceedings. Krizhevsky, A. and Hinton, G. (2009). Learning multiple layers of features from tiny images. Kruger, A. Y. (2003). On Fréchet subdifferentials. Journal of Mathematical Sciences, 116(3):3325 3358. Kuntz, J., Lim, J. N., and Johansen, A. M. (2023). Particle algorithms for maximum likelihood training of latent variable models. In Proceedings on 26th International Conference on Artificial Intelligence and Statistics (AISTATS), volume 206 of Proceedings of Machine Learning Research, pages 5134 5180. Lambert, M., Chewi, S., Bach, F., Bonnabel, S., and Rigollet, P. (2022). Variational inference via Wasserstein gradient flows. Advances in Neural Information Processing Systems, 35:14434 14447. Le Cun, Y., Bottou, L., Bengio, Y., and Haffner, P. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278 2324. Liu, Q. (2017). Stein variational gradient descent as gradient flow. Advances in Neural Information Processing Systems, 30. Loève, M. (1977). Probability Concepts, pages 151 176. Springer New York, New York, NY. Ma, Y.-A., Chatterji, N. S., Cheng, X., Flammarion, N., Bartlett, P. L., and Jordan, M. I. (2021). Is there an analog of Nesterov acceleration for gradient-based MCMC? Bernoulli, 27(3):1942 1992. Maddison, C. J., Paulin, D., Teh, Y. W., O Donoghue, B., and Doucet, A. (2018). Hamiltonian descent methods. ar Xiv preprint ar Xiv:1809.05042. Martens, J. (2010). Deep learning via Hessian-free optimization. In International Conference on Machine Learning, volume 27, pages 735 742. Mc Call, M. W. (2010). Classical Mechanics: From Newton to Einstein: A Modern Introduction. John Wiley & Sons. Mc Kean Jr, H. P. (1966). A class of Markov processes associated with nonlinear parabolic equations. Proceedings of the National Academy of Sciences, 56(6):1907 1911. Mc Lachlan, R. and Perlmutter, M. (2001). Conformal Hamiltonian systems. Journal of Geometry and Physics, 39(4):276 300. Mei, S., Montanari, A., and Nguyen, P.-M. (2018). A mean field view of the landscape of two-layer neural networks. Proceedings of the National Academy of Sciences, 115(33):E7665 E7671. Neal, R. M. and Hinton, G. E. (1998). A view of the EM algorithm that justifies incremental, sparse, and other variants. Learning in graphical models, pages 355 368. Nemirovskij, A. S. and Yudin, D. B. (1983). Problem complexity and method efficiency in optimization. Wiley Interscience. Nesterov, Y. (2003). Introductory Lectures on Convex Optimization: A Basic Course. Springer Science & Business Media. Nesterov, Y. E. (1983). A method of solving a convex programming problem with convergence rate O 1 k2 . In Doklady Akademii Nauk, volume 269, pages 543 547. Russian Academy of Sciences. Nijkamp, E., Pang, B., Han, T., Zhou, L., Zhu, S.-C., and Wu, Y. N. (2020). Learning multi-layer latent variable model via variational optimization of short run MCMC for approximate inference. In Computer Vision ECCV 2020: 16th European Conference, Glasgow, UK, August 23 28, 2020, Proceedings, Part VI 16, pages 361 378. Springer. Nitanda, A. and Suzuki, T. (2017). Stochastic particle gradient descent for infinite ensembles. ar Xiv preprint ar Xiv:1712.05438. Nitanda, A., Wu, D., and Suzuki, T. (2022). Convex analysis of the mean field Langevin dynamics. In International Conference on Artificial Intelligence and Statistics, pages 9741 9757. PMLR. Momentum Particle Maximum Likelihood Otto, F. and Villani, C. (2000). Generalization of an Inequality by Talagrand and Links with the Logarithmic Sobolev Inequality. Journal of Functional Analysis, 173(2):361 400. O Donoghue, B. and Candes, E. (2015). Adaptive restart for accelerated gradient schemes. Foundations of Computational Mathematics, 15:715 732. Pang, B., Han, T., Nijkamp, E., Zhu, S.-C., and Wu, Y. N. (2020). Learning latent space energy-based prior model. Advances in Neural Information Processing Systems, 33:21994 22008. Peyré, G. and Cuturi, M. (2019). Computational optimal transport: With applications to data science. Foundations and Trends in Machine Learning, 11(5-6):355 607. Platen, E. and Bruti-Liberati, N. (2010). Numerical Solution of Stochastic Differential Equations with Jumps in Finance. Springer Science & Business Media. Riou-Durand, L., Sountsov, P., Vogrinc, J., Margossian, C., and Power, S. (2023). Adaptive tuning for Metropolis adjusted Langevin trajectories. In International Conference on Artificial Intelligence and Statistics, pages 8102 8116. PMLR. Roberts, G. O. and Rosenthal, J. S. (2009). Examples of adaptive MCMC. Journal of computational and graphical statistics, 18(2):349 367. Santambrogio, F. (2017). {Euclidean, metric, and Wasserstein} gradient flows: an overview. Bulletin of Mathematical Sciences, 7:87 154. Sanz-Serna, J. M. and Zygalakis, K. C. (2021). Wasserstein distance estimates for the distributions of numerical approximations to ergodic stochastic differential equations. The Journal of Machine Learning Research, 22(1):11006 11042. Sharrock, L. and Nemeth, C. (2023). Coin sampling: Gradient-based Bayesian inference without learning rates. In Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., and Scarlett, J., editors, Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 30850 30882. PMLR. Shi, B., Du, S. S., Jordan, M. I., and Su, W. J. (2021). Understanding the acceleration phenomenon via highresolution differential equations. Mathematical Programming, 195:79 148. Silvester, J. R. (2000). Determinants of block matrices. The Mathematical Gazette, 84(501):460 467. Staib, M., Reddi, S., Kale, S., Kumar, S., and Sra, S. (2019). Escaping saddle points with adaptive gradient methods. In International Conference on Machine Learning, pages 5956 5965. PMLR. Su, W., Boyd, S., and Candes, E. (2014). A differential equation for modeling Nesterov s accelerated gradient method: theory and insights. Advances in Neural Information Processing Systems, 27. Sutskever, I., Martens, J., Dahl, G., and Hinton, G. (2013). On the importance of initialization and momentum in deep learning. In International Conference on Machine Learning, pages 1139 1147. PMLR. Suzuki, T., Wu, D., and Nitanda, A. (2023). Mean-field langevin dynamics: Time-space discretization, stochastic gradient, and variance reduction. In Thirty-seventh Conference on Neural Information Processing Systems. Tieleman, T. and Hinton, G. (2012). Lecture 6.5-rmsprop, coursera: Neural networks for machine learning. University of Toronto, Technical Report, 6. Tomczak, J. and Welling, M. (2018). VAE with a Vamp Prior. In International Conference on Artificial Intelligence and Statistics, pages 1214 1223. PMLR. Van Handel, R. (2014). Probability in high dimension. Lecture Notes (Princeton University). Villani, C. (2009). Optimal transport: old and new, volume 338. Springer. Wibisono, A. (2018). Sampling as optimization in the space of measures: The Langevin dynamics as a composite optimization problem. In Conference on Learning Theory, pages 2093 3027. PMLR. Wibisono, A., Wilson, A. C., and Jordan, M. I. (2016). A variational perspective on accelerated methods in optimization. Proceedings of the National Academy of Sciences, 113(47):E7351 E7358. Wilson, A. C., Recht, B., and Jordan, M. I. (2021). A Lyapunov analysis of accelerated methods in optimization. Journal of Machine Learning Research, 22(113):1 34. Yao, R. and Yang, Y. (2022). Mean field variational inference via Wasserstein gradient flow. ar Xiv preprint ar Xiv:2207.08074. Momentum Particle Maximum Likelihood Appendices A Notation 13 B Related Work 13 C Gradient Flow on Rdθ P(Rdx) 14 C.1 Gradient Flow on P(Rdx) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 C.2 Gradient flow on Rdθ P(Rdx) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 C.3 First Variation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 D A log Sobolev inequality for E can be transferred to F 17 E F(θt, mt, qt) is non-increasing 18 F Proof of Theorem 4.1 19 F.1 Proof of Step 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 F.2 Proof of Step 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 F.3 Examples of (51) holding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 F.4 Supporting Proofs and Derivations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 G Existence and Uniqueness of Strong solutions to (15) 38 H Space Discretization 40 I Time Discretization 42 I.1 PGD discretization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 I.2 NAG as a discretization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 I.3 Nesterov and Cheng s discretization of MPD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 I.4 Proposed discretization of (21) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 J Practical Concerns and Experimental Details 49 J.1 RMSProp Preconditioner . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 J.2 Subsampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 J.3 Heuristic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 J.4 Experimental Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 J.5 Hyperparameter settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 J.6 Additional Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 K Additional Experiments 55 Momentum Particle Maximum Likelihood K.1 Toy Hierarchical Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 K.2 Density Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 K.3 Sensitivity Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 A. Notation The following table summarizes some key notation used throughout. E Free energy defined in Equation (1). F Momentum-enriched free energy defined in Equation (13). zt The tuple (θt, mt, qt) Rdθ Rdθ P(Rdx Rdx). f Euclidean gradient of f. If f : Rn Rm is vector-valued, we have f Rn m. af Euclidean gradient of f w.r.t. a. (a,b)f [ af, bf] . ℓ ℓ(θ, x) := log pθ(y, x). ρθ ρθ(x) := pθ(y, x). f If f : Rn Rn, we have f := ifi. If f : Rn Rn Rm, we have ( f)i := jfji. a Divergence operator w.r.t. a. a Adjoint operator of a (see Appendix F.4.2). [n] [n] := {1, ..., n}. Laplacian operator, = . P(Rd) The space of probability measures that are absolutely continuous w.r.t. Lebesgue measure (have densities) and possess finite second moments. , (and ) Euclidean inner product or Frobenius inner product (and its inner norm). , ρ (and ρ) L2(ρ) inner product (and its norm). o(ϵ) Bachmann Landau little-o notation. F(µ) Fréchet subdifferential (see Definition C.1). B. Related Work The present work sits at the juncture of i) deterministic gradient flows for optimising objectives over parameter spaces, typically expressible through the discretization of ODEs, and ii) stochastic gradient flows for optimising objectives over the space of probability measures, typically expressible through discretisation of mean-field SDEs. The former class of problems is too vast to be properly surveyed here (for an overview, see Santambrogio (2017); Ambrosio et al. (2005)), effectively including a large proportion of modern continuous optimisation problems. The latter class has seen substantial growth over the past few years in particular, with various problems related to sampling (Liu, 2017; Bernton, 2018; Garbuno-Inigo et al., 2020; Duncan et al., 2023), variational inference (Yao and Yang, 2022; Lambert et al., 2022; Diao et al., 2023), and the training of shallow neural networks (Mei et al., 2018; Chizat and Bach, 2018; Nitanda and Suzuki, 2017; Hu et al., 2021; Nitanda et al., 2022; Chizat, 2022; Chen et al., 2024; Suzuki et al., 2023) being studied in this framework. While there exist earlier works which combine optimisation with Markovian sampling (e.g. stochastic approximation approaches to the EM algorithm (Delyon et al., 1999; De Bortoli et al., 2021), training of energy-based models (Hinton, 2002), and hyperparameter tuning in MCMC (Andrieu and Thoms, 2008; Roberts and Rosenthal, 2009)), the connection to gradient flows remains somewhat under-developed at present. We hope that the present work can encourage further exploration of these connections. Momentum Particle Maximum Likelihood C. Gradient Flow on Rdθ P(Rdx) In this section, we describe gradient flows on the extended space Rdθ P(Rdx). We begin with an exposition of gradient flows in Wasserstein space, i.e. the space of distributions P(Rdx) endowed with the Wasserstein-2 metric. We aim to give an intuitive introduction as opposed to a rigorous one; those readers with an interest in the latter are directed to Ambrosio et al. (2005). Using the ideas in Ambrosio et al. (2005), we show how the notion of gradients can be generalized to the product space Rdθ P(Rdx). C.1. Gradient Flow on P(Rdx) We begin by describing the gradient flow on P(Rdx) endowed with the Wasserstein-2 metric. The Wasserstein distance is defined as W2 2(p, q) = inf π Π(p,q) Rd Rd x y 2 π(dx dy), where Π(p, q) is the set of all couplings between p and q. In Ambrosio et al. (2005, Chapter 11), the authors discuss various approaches for adapting gradient flows on well-studied spaces (such as Euclidean and Riemannian spaces) to the Wasserstein space. One of these approaches proceeds by first defining suitable notions of tangent space and subdifferential, following which the simple definition of gradient flow modelled on Riemannian manifolds can then be reproduced. In this case, the Fréchet subdifferential (Ambrosio et al., 2005, Definition 10.1.1) is defined as follows: Definition C.1 (Fréchet differential on Wasserstein Space). Let F : P(Rdx) R be a sufficiently regular function. We say that ξ L2(q) belongs to the Fréchet subdifferential F(q) if for all q P(Rdx), we have F(q ) F(q) D ξ, tq q i E q + o(W2(q, q )), where tq p is the optimal map between p and q (Ambrosio et al., 2005, see (7.1.4)) and i is the identity map. Furthermore, if ξ F(q) also satisfies F(t#q) F(q) ξ, t i q + o( t i q), for all t L2(q), then we say that ξ is a strong subdifferential. See Ambrosio et al. (2005, Definition 10.1.1) for more details. The strong subdifferential can be thought of as the (Wasserstein) gradient of F. Equipped with this notion of gradient, we can define the gradient (descent) flow of F : P(Rdx) R as follows: Definition C.2 (Gradient Flow). We say that a curve pt : [0, 1] P(Rdx) is a gradient flow of F : P(Rdx) R if for all t > 0, it satisfies the continuity equation tpt + x (vtpt) = 0, where the tangent vector vt satisfies vt F(pt) for all t. Thus, for our application, we are interested in computing the strong subdifferential of F. If F is an integral of the type F(p) = Z f(x, p(x), xp(x))dx, (23) where f : Rdx R Rdx R is sufficiently regular, then we will see that its strong subdifferential admits an analytic solution. Functionals of this form of great interest in the calculus of variations (e.g., see Gelfand and Silverman (2000)). A vital quantity which is used to study these functionals is the first variation. Writing δp F[p] : Rdx R for the first variation of F, it is a function that satisfies d dϵF(qϵ) ϵ=0 = Z δq F[q](x) x (q(x)v(x))dx, for all v such that qϵ := (i + ϵv)#q P(Rdx) for sufficiently small ϵ. One can readily establish that for (23)-typed F, it is given by δp F[p](x) := (2)f(x, p(x), xp(x)) x ( (3)f(x, p(x), xp(x)), Momentum Particle Maximum Likelihood where (i) denotes the partial derivative w.r.t. the i-th argument (Ambrosio et al., 2005, Eq. (10.4.2)). It can be shown that for any ξ F(p) which is a strong subdifferential, it holds that ξ(x) p a.e. = xδp F[p](x) (Ambrosio et al., 2005, Lemma 10.4.1). C.2. Gradient flow on Rdθ P(Rdx) We provided here a generalization of the Fréchet differential (for a broad survey of Fréchet differentials, see Kruger (2003)) to the extended space Rdθ P(Rdx): Definition C.3 (Fréchet differential on Rdθ P(Rdx)). Let G : Rdθ P(Rdx) R be a sufficiently regular function. We say that (ξθ, ξq) Rdθ L2(q) belongs to the Fréchet subdifferential G(θ, q) if for all (θ , q ) Rdθ P(Rdx), it holds that G(θ , q ) G(θ, q) ξθ, θ θ + D ξq, tq q i E q + o(W2(q , q) + θ θ ). Furthermore, if (ξθ, ξq) G(θ, q) also satisfies G(θ + τ, t#q) G(θ, q) ξθ, τ + ξq, t i q + o( t i q + τ ), for all (τ, t) Rdθ L2(q), then we say that (ξθ, ξq) is a strong subdifferential. For the remainder of the section, we shall assume that for any perturbation (σ, v), there exists some δq F[θ, q] : Rdx R (called the first variation) such that the following expansion holds: d dϵG(θϵ, qϵ) ϵ=0 = θG(θ, q), σ Z δq G[θ, q](x) x (v(x)q(x)) dx, (24) where qϵ := (i + ϵv)#q and θϵ := θ + ϵσ. For all G of interest in this paper, this assumption holds; for instance, see Proposition F.4. We follow in the argument of Ambrosio et al. (2005, Lemma 10.4.1) to show that if (ξθ, ξq) G(θ, q) is a strong subdifferential, then we have that ξθ = θG(θ, q) and ξq(x) q a.e. = xδq G[θ, q](x). Let (ξθ, ξp) G(θ, p) be a strong subdifferential. By using the strong subdifferential property for some perturbation (ϵσ, ϵv) and taking left and right limits of (C.3), we obtain that lim sup ϵ 0 G(θϵ, qϵ) G(θ, q) ϵ ξθ, σ + ξq, v q lim inf ϵ 0 G(θϵ, qϵ) G(θ, q) On the other hand, from our assumption (24), we have that lim ϵ 0 G(θϵ, qϵ) G(θ, q) ϵ = θG(θ, q), σ Z δq G[θ, q](x) x (v(x)q(x)) dx = θG(θ, q), σ + xδq G[θ, q], v q , where the last equality follows from integration by parts and the divergence theorem. Hence, we have that ξθ = θG(θ, q), and ξq q a.e. = xδq G[θ, q]. C.3. First Variation In this section, we derive the first variation for integral expressions taking a certain form. In particular, we are interested in computing the strong subdifferential of G of the following types: G(θ, q) := Z f(θ, x, q(x), xq(x)) dx, where f : Rdθ Rdx R Rdx; (VI-I) G(θ, q) := Z Z f(θ, x, x , q(x), q(x )) dxdx , where f : Rdθ Rdx Rdx R R. (VI-II) Momentum Particle Maximum Likelihood An exampled of (VI-I)-typed G is when G is the free energy E. Following standard techniques from the calculus of variations (Gelfand and Silverman, 2000), consider a perturbation (σ, v) Rdθ L2(q) and define a mapping Φ by Φ(ϵ) := G(θϵ, qϵ), where qϵ := (i + ϵv)#q and θϵ := θ + ϵσ. (VI-I)-typed G. By application of the change-of-variables formula (for instance, see Ambrosio et al. (2005, Lemma 5.5.3)), we can compute the density qϵ and its derivative ϵqϵ(y) as follows: qϵ(y) = q det(I + ϵ xv) (i + ϵv) 1(y), ϵqϵ(y) ϵ=0 = x [qv](y). Thus, we can compute the derivative of Φ, provided that the interchange of derivative and integral can be justified, as d dϵΦ(ϵ) ϵ=0 = Z d dϵf(θϵ, x, qϵ(x), xqϵ(x)) dx = Z θf(θ, x, q(x), xq(x)), σ dx Z (3)f(θ, x, q(x), xq(x)) x [vq](x) dx Z (4)f(θ, x, q(x), xq(x)), x[ x [vq]](x) dx. Applying integration by parts and the divergence theorem, we obtain that d dϵΦ(ϵ) ϵ=0 = θG(θ, q), σ Z (3)f(θ, x, q(x), xq(x)) [ x (vq)](x) dx + Z x [ (4)f](θ, x, q(x), xq(x))[ x (vq)](x) dx. We can hence write the first variation of (VI-I)-typed G as δq G[θ, q](x) = (3)f(θ, x, q(x), xq(x)) x ( (4)f(θ, x, q(x), xq(x)). (25) (VI-II)-typed G. Similarly to above, we can define Φ(ϵ) := G(θϵ, qϵ), whose derivative is given by d dϵΦ(ϵ) ϵ=0 = θG(θ, q), σ Z Z (4)f(θ, x, y, q(x), q(y)) x [qv](x) dx dy Z Z (5)f(θ, x, y, q(x), q(y)) y [qv](y) dx dy. Hence, the first variation is given by δq G[θ, q](x) = Z (4)f(θ, x, y, q(x), q(y)) + (5)f(θ, y, x, q(y), q(x)) dy. (26) Example 1 (First Variation of E). It can be seen that the free energy is (VI-I)-typed with f(θ, x, a, g) = a log a pθ(y,x). Hence, by combining the formula (25) with the fact that 3f(θ, x, a, b) = log a Momentum Particle Maximum Likelihood 4(f(θ, x, a, g) = 0, we obtain the following expression for the first variation: δq E[θ, q](x) = log q(x) pθ(y, x) + 1. This provides an alternative derivation of this expression to that given in Kuntz et al. (2023, Lemma 1). D. A log Sobolev inequality for E can be transferred to F In this section, we show that nice properties of the functional E transfer to the functional F, i.e. if E satisfies a log Sobolev inequality (Assumption 2), then so does F (with a modified constant): Proposition D.1. If Assumption 2 holds, then upon defining C := min{CE, ηθ, ηx}, it holds that F(θ, m, q) E 1 2C || F(θ, m, q)||2 q , θ, m Rdθ, q P(R2dx), where || F(θ, m, q)||2 q := (θ,m)F(θ, m, q) 2 + (x,u) log q (x,u) log ρθ,ηx 2 q. (27) Proof. As we show at the end of the proof, we have the fact that (x,u) log q ρθ rηx u log q U|X From the definition, we have || F(θ, m, q)||2 q = || θF(θ, m, q)||2 + ||ηθm||2 + (x,u) log q ρθ,ηx Using (28) and the fact that θF(θ, m, q) = Z θ log ρθ,ηx(x, u)q(dx, du) = Z θ log ρθ(x)q X(dx) = θE(θ, q X), then it follows that || F(θ, m, q)||2 q || θE(θ, q X)||2 q + η2 θ ||m||2 + u log q U|X Because C = min{CE, ηθ, ηx}, the above implies that 1 2C || F(θ, m, q)||2 q 1 2CE || θE(θ, q)||2 q + ηθ 2 ||m||2 + 1 2ηx u log q U|X Since rηx = N(0, η 1 x Idx) is ηx-strongly log-concave, by the Bakry Émery criterion of Bakry and Émery (2006), it holds that 1 2ηx u log q( |x) q(du|x) KL(q U|x|rηx) x X, and so we obtain that u log q U|X q = 1 2ηx Eq X u log q( |X) Eq X KL(q U|X|rηx) . Momentum Particle Maximum Likelihood Plugging the above into (29) and using the log Sobolev inequality for E, we obtain that for F: 1 2C || F(θ, m, q)||2 q E(θ, q X) E + ηθ 2 m 2 2 + Eq X KL(q U|X|rηx) F(θ, m, q) E . We have one loose end to tie up: proving (28). To do so, note that (x,u) log q ρθ rηx (x,u) log q X + (x,u) log q U|X + x log q U|X u log q U|X Taking the first term and expanding the square, we have + x log q U|X + 2 x log q X , x log q U|X + x log q U|X 2 q. Applying Jensen s inequality to the final term, we obtain + x log q U|X + 2 x log q X , Eq(du| ) [ x log q(U| )] q X + Eq(du| ) [ x log q(U| )] 2 q X. One can compute explicitly that for all x, it holds that Eq(du|x) [ x log q(U|x)] = 0; we thus obtain that ρθ + x log q U|X and so (28) follows from (31). E. F(θt, mt, qt) is non-increasing In the following proposition, we show that Ft := F(θt, mt, qt) is non-increasing in time. Proposition E.1. For any γx 0 and γθ 0, it holds that Ft = γθ m Ft 2 γx uδq Ft 2 qt 0. Proof. We begin by computing the time derivative Momentum Particle Maximum Likelihood (x,u)δq Ft, Decomposing the matrix into symmetric and skew-symmetric components (write D, Q respectively), i.e. we can simplify the RHS of (32) to RHS (32) = (θ,m)Ft, D (θ,m)Ft = γθ m Ft 2. Similarly, we can show that (33) = γx uδq Ft 2 qt. As such, for γx 0, γθ 0, the claim follows. F. Proof of Theorem 4.1 The outline of the proof of Theorem 4.1 is: Step 1 (Appendix F.1): Explicitly computing an upper bound of the time derivative of F as a quadratic form. Step 2 (Appendix F.2): Under the conditions specified in (51), we show that this time derivative is bounded above by another quadratic form that allows us to apply the log Sobolev inequality of Proposition D.1. Step 3: Using log Sobolev inequality, Proposition D.1, and Grönwall s inequality, we obtain the desired result. The remainder of the sections are dedicated to supporting the proof of Steps 1 and 2. This is done by developing the technical tools and carrying out explicit computations. The particular roles of the following sections are: 1. Appendix F.4.1: Computing the time derivative of L for Step 1. 2. Appendices F.4.2 to F.4.4: Introducing the adjoint, commutator and another operator, as well as their explicit forms. 3. Appendix F.4.5: Using the operators and explicit forms in Appendices F.4.2 to F.4.4, we can upper bound terms introduced in Appendix F.4.1 for Step 1. We utilize this in Step 1. 4. Appendix F.4.6. Bounding the cross terms (or interaction terms) between θ and x that arises in the time derivative. 5. Appendix F.4.7. Establishing sufficient conditions for a matrix with a particular form to be positive semi-definite given in Proposition F.18. This is used in Step 2. Recall that the Lyapunov function is given by: L := F E + (θ,m)F 2 T(θ,m) + (x,u)δq F 2 T(x,u) T(θ,m) := 1 2 Idθ τm Idθ , T(x,u) := 1 2 Idx τu Idx Momentum Particle Maximum Likelihood Proof of Theorem 4.1. The proof is completed in the following three steps: Step 1. In Section F.1, we show that the time derivative of Lt := L(zt) satisfies the following upper bound: d dt Lt (θ,m)Ft, S(θ,m) (θ,m)Ft (x,u)δq Ft, S(x,u) (x,u)δq Ft where S(x,u) and S(θ,m) are suitable matrices defined in (47) and (48), respectively. Step 2. Then, in Section F.2, when Equation (51) holds for some rate φ > 0, we have that S(θ,m) φC T(θ,m) + 1 S(x,u) φC T(x,u) + 1 where C := min{CE, ηθ, ηx}. Step 3. By Step 1 and Step 2, we obtain d dt Lt φC 1 (θ,m)Ft 2 + (θ,m)Ft 2 T(θ,m) (x,u)δq Ft 2 qt + (x,u)δq Ft 2 T(x,u) Comparing the above to (27), applying Proposition D.1 and the Equation (18), d dt Lt φCLt, and application of Gronwall s inequality yields the final result Ft E L0 exp ( φCt) . F.1. Proof of Step 1 As shown in Proposition F.2, the time derivative of the Lyapunov function L is given by (x,u)δq Lt, where we abbreviate Lt := L(θt, mt, qt) and Ft := F(θt, mt, qt). We deal with the inner products in (35,36) one-by-one, starting with (35): we first compute the gradient of L w.r.t. to (θ, m), finding that (θ,m)L = (θ,m)F + 2[ 2 (θ,m)F] T(θ,m) (θ,m)F + 2Eq [ (θ,m) (x,u)δq F] T(x,u) (x,u)δq F , where, as before, assume that the derivative-integral exchange can justified (see Appendix C.3). It then follows that (35) = D (θ,m)Ft, Γθ (θ,m)Ft E (37) (θ,m) (x,u)[δq Ft] T(x,u) (x,u)δq Ft, Momentum Particle Maximum Likelihood + 2T(θ,m) 2 (θ,m)Ft By F s definition, whence we see that ηθIdθ γθηθIdθ T(θ,m) 2 (θ,m)F 2 Idθ τθmγθηθ 2 Idθ τθ 2 θF τmηθIdθ τmγθηθIdθ τθm τθmηθIdθ [τθmγθηθ 1]Idθ 2τθ 2 θFt [2τmηθ + 1]Idθ [2τmηθ + 1]γθIdθ τθm 2 θFt Given that, for any matrix A, we have , A = , Asym where Asym = A + A /2, then with Γθ := Γθ+ Γ θ 2 , we obtain that (37) = (θ,m)Ft, Γθ (θ,m)Ft , (39) and hence that τθmηθIdθ τm + τθmγθ 2 ηθIdθ τθ 2 θFt τm + τθmγθ 2 ηθIdθ τθ 2 θFt [2τmηθ + 1]γθIdθ τθm 2 θFt We now turn our attention to (36). Defining Ht(x, u) := q qt(x,u) ρθt,ηx(x,u) and Proposition F.2, we see that (x,u) log Ht, (x,u) log Ht (x,u) (x,u)T(x,u) (x,u)Ht (x,u) log Ht (x,u) (θ,m) [log ρθt] T(θ,m) (θ,m)Ft, (x,u) log Ht Furthermore, we have (41) = 4γx || u log Ht||2 qt = γx u log qt ρθt,ηx In Proposition F.11, we show that (42) can be further simplified as follows: log qt ρθt,ηx , T(x,u) (x,u) u log qt ρθt,ηx Momentum Particle Maximum Likelihood (x,u) log qt ρθt,ηx , Γx (x,u) log qt ρθt,ηx τxuηx Idx τu+τxuγx 2 ηx Idx + τx 2 xℓ 2 ηx I + τx 2 xℓ 2γxτuηx Idx + τxu 2 xℓ We can then combine (41) and (42) to obtain (41) + (42) = 2γx log qt ρθt,ηx , T(x,u) (x,u) u log qt ρθt,ηx (x,u) log qt ρθt,ηx , Γx (x,u) log qt ρθt,ηx where Γx = Γx + is again positive semi-definite. Given any p.s.d. matrix M Rd d and a general matrix A Rd d, it holds generically that A, MA = A, LL A = L A, L A = L A 2 0, where M = LL is Cholesky decomposition of M. Thus, we have the following upper bound: (41) + (42) (x,u) log qt ρθt,ηx , Γx (x,u) log qt ρθt,ηx = (x,u)δq Ft, Γx (x,u)δq Ft Cross Terms (38) and (43). We will now deal with the cross terms of (43) and (38). In Proposition F.17, we show the following upper-bound: for all ε > 0, it holds that (43) + (38) ε (θ,m)Ft, Γ (θ,m)Ft + 1 ε (x,u)δq Ft, (x,u)δq Ft where Γ = K2 τ 2 θ Idθ τθ τxu+τθm 2 Idθ τxu+τθm 2 2 + τ 2 x Idθ Conclusion of Step 1. Combining the results in (39, 45, 46), we upper bound the time derivative of the Lyapunov function as d dt Lt (θ,m)Ft, S(θ,m) (θ,m)Ft (x,u)δq Ft, S(x,u) (x,u)δq Ft S(x,u) = Γx 1 ε Idx τu+τxuγx 2 ηx Idx + τx 2 xℓ 2 ηx Idx + τx 2 xℓ γx(2τuηx + 1) 1 ε Idx + τxu 2 xℓ S(θ,m) = Γθ εΓ = sθIdθ sθm Idθ τθ 2 θFt sθm Idθ τθ 2 θFt sm Idθ τθm 2 θFt with constants defined as sθ := τθmηθ εK2τ 2 θ , sθm := τmηθ + τθmγθ 2 ηθ εK2τθ τxu + τθm sm := (2τmηθ + 1)γθ εK2 τxu + τθm Momentum Particle Maximum Likelihood F.2. Proof of Step 2 By the definitions of T(θ,m) and T(x,u) in (34), and those of S(θ,m) and S(x,u) in (47, 48), S(θ,m) φC T(θ,m) + 1 αθIdθ βθIdθ τθ 2 θF βθIdθ τθ 2 θF κθIdθ τθm 2 θF S(x,u) φC T(x,u) + 1 αx Idx βx Idx + τx 2 xℓ βx Idx + τx 2 xℓ κx Idx + τxu 2 xℓ αθ := τθmηθ εK2τ 2 θ φC τθ + 1 βθ := τm + τθmγθ κθ := (2τmηθ + 1)γθ εK2 τxu + τθm αx = τxuηx 1 ε φC τx + 1 βx = (τu + τxuγx)ηx κx = γx(2τuηx + 1) 1 ε φC τu + 1 Proposition F.1. Defining (αθ, βθ, κθ, αx, βx, κx) as above, assume that K 2CE if τθm K αθ κθ 0, (51a) τ 2 θ K2 + (τθmαθ 2βθτθ)K αθκθ + β2 θ 0, (51b) τ 2 θ K2 (τθmαθ 2βθτθ)K αθκθ + β2 θ 0, (51c) τxu K αx κx 0, (51d) τ 2 x K2 (τxuαx 2βxτx) K αxκx + β2 x 0, (51e) τ 2 x K2 + (τxuαx 2βxτx) K αxκx + β2 x 0, (51f) hold for some rate φ > 0. The following lower bounds for S(x,u) and S(θ,m) then hold: S(x,u) φC T(x,u) + 1 , S(θ,m) φC T(θ,m) + 1 where C := min{CE, ηθ, ηx}. Proof. Note the matrices in (49, 50) have the same form as that in (66). As we show below, KIdx 2 xℓ KIdx, KIdθ 2 θF KIdθ. (52) Then, applying Proposition F.18 tells us that (49, 50) are positive semidefinite if the conditions (51) are satisfied. Now, to prove the inequalities in (52). Assumption 1 and (Nesterov, 2003, Lemma 1.2.2) imply the first two inequalities and KIdθ 2 θℓ KIdθ. The other two inequalities in (52) follow from the above. To see this, for 2 θF KIdθ, we have v, (KIdθ 2 θF)v = Z v, (KIdθ + 2 θℓ)v q(dx du), and since v, 2 θ[ℓ]v K v 2, we have shown that v, (KIdθ 2 θF)v 0 implying the desired result of 2 θF KIdθ. A similar argument can be made for KIdθ 2 θF. Momentum Particle Maximum Likelihood F.3. Examples of (51) holding We verified the above with a symbolic calculator (see https://github.com/jenninglim/mpgd) written in Mathematica for the choices of ηx, ηθ, γx, γθ hold for the following choices: 1. Rate φ = 1 10, momentum parameters γx = 3 2, γθ = 3, ηx = 2K, ηθ = 2K, and elements of the Lyapunov function τθ = 3 4, τm = 2, τx = 1 4, and ε = 15. 2. Rate φ = 1 30, momentum parameters γx = 6 2, ηx = 2K, ηθ = 2K, and elements of the Lyapunov function τθ = 3 4, τm = 2, τx = 1 4, τxu = 1, τu = 3 2, and ε = 15. F.4. Supporting Proofs and Derivations F.4.1. THE DERIVATIVE OF Lt ALONG THE FLOW Here, we prove (35,36). Proposition F.2. The derivative of the Lyapunov function L is given by (x,u)δq Lt, where the first variation is given by δq Lt =δq Ft + δq L1 t + δq L2 t = log qt ρθt,ηx + 1 2 (θ,m) log ρθt, T(θ,m) (θ,m)Ft Ht (x,u)T(x,u) (x,u)Ht, with and Ht := Hθt,qt defined in Proposition F.5. Proof. From the linearity of L, we have Lt = Ft + L1 t + L2 t, L1 t := (θ,m)Ft, T(θ,m) (θ,m)Ft , L2 t := (x,u)δq Ft, T(x,u) (x,u)δq Ft From Appendix C.2, it can be seen that the time derivatives of Ft, L1 t, and L2 t, are given by Ft = D (θ,m)Ft, ( θt, mt) E + (x,u)δq Ft, ( xt, ut) L1 t = D (θ,m)L1 t, ( θt, mt) E + (x,u)δq L1 t, ( xt, ut) L2 t = D (θ,m)L2 t, ( θt, mt) E + (x,u)δq L2 t, ( xt, ut) where ( θt, mt) = (θ,m)Ft and ( xt, ut) = (x,u)δq Ft. From Proposition F.3, Proposi- tion F.4, and Proposition F.5, and linearity of the inner product, we obtain as desired. Proposition F.3. The first variation of F is given by δq F[z](x, u) = log q(x, u) ρθ,ηx(x, u) + 1, where z = (θ, m, q). Momentum Particle Maximum Likelihood Proof. It can be seen that F is (VI-I)-typed equation with f((θ, m), (x, u), q, g) = q log q ρθ,ηx(x). Thus, using the formula (25), we obtain as its first variation δq F[z](x, u) = log q(x, u) ρθ,ηx(x, u) + 1, Hence, we have as desired. For the first variation of L1, we have the following result: Proposition F.4. The first variation of L1 is given by δq L1[z](x, u) = 2 (θ,m) log ρθ(x), T(θ,m) (θ,m)F[z] , where z = (θ, m, q). Proof. It can be seen that L1 is a variational integral of type (VI-II) with f((θ, m), (x, u), (x , u ), q, q ) =qq θ log ρθ(x), Tθθ θ log ρθ(x ) q θ log ρθ(x), Tθmm q m, Tmθ θ log ρθ(x ) + m, Tmmm , where we write Tθθ for the upper left block of T(θ,m), and similarly for Tθm, and Tmm. Using the formula (26) and the fact that (4)f((θ, m), (x, u), (x , u ), q, q ) =q θ log ρθ(x), Tθθ θ log ρθ(x ) θ log ρθ(x), Tθmm , For the first term of (26), we have Z (4)f((θ, m), (x, u), (x , u ), q(x, u), q(x , u ))dx du = (θ,m) log ρθ(x), T(θ,m) (θ,m)F[z] , and by symmetry, it follows similarly for the last term of (26). We have obtained as desired. Proposition F.5. The first variation of L2 is given by δq L2[z] = 4 Hθ,q (x,u)T(x,u) (x,u)Hθ,q, where Hθ,q(x, u) := q q(x,u) ρθ,ηx(x,u) and the adjoint operator (x,u) (see Appendix F.4.2) is defined as (x,u)(f)(x, u) := (x,u) log ρθ,ηx(x, u), f(x, u) (x,u) [f](x, u), where x is the divergence operator w.r.t. x. Proof. First note that L2 can be written equivalently as (x,u)q q (x,u) log ρθ,ηx, T(x,u) q (x,u) log ρθ,ηx It can be seen that L2 is a variational integral of type (VI-I) with f((θ, m), (x, u), q, g) = q g q (x,u) log ρθ,ηx(x, u), T(x,u) q (x,u) log ρθ,ηx(x, u) . Momentum Particle Maximum Likelihood (3)f((θ, m), (x, u), q, g) = g q (x,u) log ρθ,ηx(x, u), T(x,u) q (x,u) log ρθ,ηx(x, u) q2 , T(x,u) q (x,u) log ρθ,ηx(x, u) q + (x,u) log ρθ,ηx(x, u), T(x,u) q (x,u) log ρθ,ηx(x, u) , (4)f(( , m), (x, u), q, g) =2T(x,u) q (x,u) log ρθ,ηx(x, u) , using formula (25), we obtain δq L2[z](x, u) = (x,u) log q(x, u) + (x,u) log ρθ,ηx(x, u), T(x,u) (x,u) log q(x, u) ρθ,ηx(x, u) (x,u) 2T(x,u) (x,u) log q(x, u) ρθ,ηx(x, u) To obtain the desired result, note that we can rewrite (53a) = (x,u) log q + (x,u) log ρθ,ηx, 4 Hθ,q T(x,u) (x,u) log Hθ,q (53b) = (x,u) 4 Hθ,q T(x,u) (x,u)Hθ,q = (x,u) 4 Hθ,q , T(x,u) (x,u)Hθ,q 4 Hθ,q (x,u) (T(x,u) (x,u)Hθ,q) = (x,u) log Hθ,q, 4 Hθ,q T(x,u) (x,u)Hθ,q 4 Hθ,q (x,u) (T(x,u) (x,u)Hθ,q). Hence, we have obtained the desired result. F.4.2. THE ADJOINT OF Throughout the proofs, we use to denote the linear map that is an adjoint of (the Euclidean gradient operator). In the view of as a linear map from (L2 ρ(X; R), , ρ) to (L2 ρ(X; X), , ρ) where X {Rdx, Rdx Rdx}, ρ P(X), and L2 ρ(X; X) := {sufficiently regular v : X X such that ||v||ρ < }. Then its adjoint is given by ρ [ρv] = log ρ, v v. where denotes the divergence operator. Whatever the gradient is taken with respect to, the adjoint is taken with respect to the corresponding quantity. We show the adjoint property for the case where X = Rdx Rdx as the other case follows similarly. To begin. f, v ρ = Z f(x, u), v(x, u) ρ(dx du) = Z f(x, u), ρ(x, u)v(x, u) dx du = Z f(x, u) [ρ(x, u)v(x, u)] dx du = f, v ρ , Momentum Particle Maximum Likelihood where we used integration by parts combined with the divergence theorem and the fact that ρ must vanish at the boundary to obtain the second equation. Another case of interest is the view of as a linear map from (L2 ρ(X; X ), , ρ) to (L2 ρ(X; X X ), , ρ), where the underlying inner product on the latter space is induced by the Frobenius inner product, and as before, X, X {Rdx, Rdx Rdx}, ρ P(X). The adjoint of for some M L2(X, X X ) is defined as ( M) := M log ρ M X , where M denotes the divergence operator for a matrix field defined as ( M)i = j Mji in Einstein s notation. Each element of ( M)i is given by ( M)i := log ρ, M ei (M ei), where {ei}i is the standard basis of X . For the case X = Rdx Rdx, the adjoint is property can be shown as follows: let M L2(X, X X ) and v L2(X, X ) v, M ρ = Z d X X j=1 ( v)ij Mij ρ(dx du) j=1 i[vj] Mij ρ(dx du) = Z d X X j=1 vj, Mej ρ(dx du) j=1 (vj) (ρMej) dx du j=1 (vj)[ log ρ, Mej + (Mej)] ρ(dx du) where, for the second line, we apply integration by parts combined with the divergence theorem and the fact that ρ vanishes at the boundary. F.4.3. COMMUTATOR [ , ] Another quantity widely used in the proof is the commutator denoted by [ , ]. It can be thought of as an indicator of whether two operators commute. It is defined as [A, B]f = (AB)f (BA)f. If [A, B]f = 0, then A and B is said to commute. In this section, we list propositions and their proofs that will be useful for the proof of Theorem 4.1. Proposition F.6. Let f : Rdx Rdx R be sufficiently regular, then we have [ u, u] uf = ηx uf. Proof. Expanding out the commutator, we obtain [ u, u] uf = u u uf u u uf u u log ρθ,ηx, uf u[ u uf] + 2 uf u log ρθ,ηx + u ( 2 uf), and since ( u[ u uf])i = ui uj ujf = uj uj uif = ( u ( 2 uf))i. We have that [ u, u] uf = 2 u log rηx uf = ηx uf. Momentum Particle Maximum Likelihood Proposition F.7. Let f : Rdx Rdx R be sufficiently regularly, we have that u and x x commutes. In other words, we have [ u, x x]f = 0. Proof. We begin with the definition [ u, x x]f = u x xf x x uf, For the first term, we have u x xf = u[ x log ρ, xf + x xf] = u xf x log ρ u[ x xf]. As for the second term, we have x x uf = u xf x log ρ x ( x uf). Equality or [ u, x x]f = 0 follows from the following fact: ( x ( x uf))i = xj xj uif = ui xj xjf = ( u x ( xf))i. Proposition F.8. For f : Rdx Rdx R be sufficiently regular, we have that x and u u commutes, i.e., we have [ x, u u]f = 0. Proof. Expanding out the commutator, we shall show that [ x, u u]f = x u uf u u xf = 0. This follows since the first term can be written as x u uf = x u log rηx, uf + x[ u ( uf)] = x u[f] u log rηx + x[ u ( uf)], and, for the second term, we have u u xf = u x[f] u log rηx + u ( u xf). Note that u ( u xf) = x[ u ( uf)], which can be seen from ( x u ( uf))i = xi uj ujf = uj uj xif = ( u ( u xf))i. Hence, we have as desired. F.4.4. ANOTHER OPERATOR B To simplify the notation, we drop the subscripts and write ρ := ρθ,ηx. Let f : Rdx Rdx R, another operator (denoted by B) that we are interested in is defined as follow: B[f] : Rdx Rdx R := (x,u) Q (x,u)f . By expanding out the adjoint, B can be written equivalently as B[f] = (x,u) log ρ, Q (x,u)f (x,u) Q (x,u)f Momentum Particle Maximum Likelihood = xf, u log ρ + x log ρ, uf . (54) One can generalize the operator B to vector-valued functions f : Rdx Rdx Rn as follows (its equivalence when n = 1 is easy to see): B[f] : Rdx Rdx Rn := (x,u) Q (x,u)f . This can be equivalently written as B[f] = (x,u)f Q (x,u) log ρ (x,u) Q (x,u)f where we use the fact that (x,u) Q (x,u)f In the following proposition, we show that the operator is anti-symmetric. Proposition F.9. For all f, g : Rdx Rdx Rn, we have B[f], g ρ = f, B[g] ρ. Proof. From the definition, we have B[f], g ρ = D (x,u) Q (x,u)f , g E ρ = Q (x,u)f, (x,u)g where we use the adjoint operator of for the second equality. Continuing from the last line, B[f], g ρ = (x,u)f, Q (x,u)g ρ = D f, (x,u)Q (x,u)g E ρ = f, B[g] ρ . Proposition F.10. For f : Rdx Rdx R, we have [ u, B]f = ηx xf, and [ x, B]f = 2 x log ρ uf Proof. For the first equality, we begin by expanding out the commutator [ u, B]f = u B[f] B u[f]. From Equation (54), we have u B[f] = u xf u log ρ 2 u log ρ xf + 2 uf x log ρ, and, from Equation (55), we have B u[f] = u (x,u)f = u xf u log ρ + 2 uf x log ρ, then, we must have [ u, B]f = 2 u log ρ xf = ηx xf. For the second inequality, we begin similarly [ x, B]f = x B[f] B x[f]. Momentum Particle Maximum Likelihood For the first term, from Equation (54), we have x B[f] = 2 xf u log ρ + 2 x log ρ uf + x uf x log ρ. For the second term, from Equation (55), we have B x[f] = x (x,u)f = 2 xf u log ρ + x uf x log ρ. Hence, we have [ x, B]f = [ x, B]f = 2 x log ρ uf. F.4.5. SIMPLIFYING (42) Proposition F.11 (Simplifying (42)). We have that , T(x,u) (x,u) u (x,u) log qt ρt , Γx (x,u) log qt where ρt := ρθt,ηx and 2τxuηx Idx (τuηx + τxuγxηx) Idx + 2τx 2 x[ℓ] (τuηx + τxuγxηx) Idx + 2τx 2 x[ℓ] 4γxτuηx Idx + 2τxu 2 x[ℓ] Proof. Recall that Ht := q qt ρθt . We begin by applying the quotient rule to obtain the fact that (x,u)T(x,u) (x,u)Ht = (x,u) (x,u) T(x,u) (x,u)Ht Ht (x,u)Ht (x,u) T(x,u) (x,u)Ht Then, we can write (42) as * (x,u) (x,u) T(x,u) (x,u)Ht (x,u) log Ht * (x,u)Ht (x,u) T(x,u) (x,u)Ht (x,u) log Ht We can simplify the terms individually given in Proposition F.13 and Proposition F.12 for (56) and (57) Hence, we can write (42) equivalently the following: (42) = 16γx (x,u) u [Ht] , T(x,u) (x,u)Ht Ht u Ht, T(x,u) (x,u)Ht 8γx (x,u) u [Ht] , T(x,u) (x,u) u [Ht] Momentum Particle Maximum Likelihood 4τxuηx x Ht 2 ρt (58d) 4 x Ht, (τuηx + τxuγxηx) Idx + 2τx 2 x [ℓ] u Ht 8 D u Ht, h τuηxγx Idx + τxu 2 2 x [ℓ] i u Ht E We can rewrite the sum of (58a), (58b), (58c) as (58a) + (58b) + (58c) (x,u) u [Ht] (x,u)Ht Ht u Ht, T(x,u) (x,u) u [Ht] (x,u)Ht Since we have that x u Ht = x(Ht u log Ht) = ( x Ht)( u log Ht) + Ht( x u log Ht), we can write the above as (58a) + (58b) + (58c) = 8γx (x,u) u [log Ht] , T(x,u) (x,u) u [log Ht] , T(x,u) (x,u) u As for the other terms, we have that (58d) + (58e) + (58f) = 4 D (x,u) log Ht, Γx (x,u) log Ht E = (x,u) log qt ρt , Γx (x,u) log qt 2τxuηx Idx (τuηx + τxuγxηx) Idx + 2τx 2 xℓ (τuηx + τxuγxηx) Idx + 2τx 2 xℓ 4γxτuηx Idx + 2τxu 2 xℓ We have as desired. Proposition F.12 (Simplifying (57)). We have (x,u) u [Ht] , T(x,u) (x,u)Ht Ht u Ht, T(x,u) (x,u)Ht Proof. Noting that 1 Ht (x,u) T(x,u) (x,u)Ht is scalar-valued and using H2 t to change the inner product from , qt to , ρt, we obtain * (x,u) T(x,u) (x,u)Ht * (x,u) T(x,u) (x,u)Ht Ht , u Ht 2 2 Then, using the adjoint of operator (x,u) as was described in Appendix F.4.2, we obtain (x,u) u Ht 2 2 Ht , T(x,u) (x,u)Ht Momentum Particle Maximum Likelihood Then, using the quotient rule, we have that * (x,u) u Ht 2 Ht , T(x,u) (x,u)Ht * u Ht 2 2 (x,u)Ht H2 t , T(x,u) (x,u)Ht (x,u) u [Ht] u Ht Ht , T(x,u) (x,u)Ht * u Ht 2 (x,u)Ht H2 t , T(x,u) (x,u)Ht Finally, we obtain (x,u) u [Ht] , T(x,u) (x,u)Ht Ht u Ht, T(x,u) (x,u)Ht where denotes the outer product, and we use the fact that Mv, u = M, u v F and u Ht 2 (x,u)Ht = (x,u)Ht u Ht u Ht. Proposition F.13 (Simplifying (56)). We have (56) = 8γx (x,u) u [Ht] , T(x,u) (x,u) u [Ht] ρt 4τxuηx x Ht 2 ρt 8 D u Ht, τxηxγx Idx + τxu 2 2 x log ρt u Ht E ρt 4 x Ht, [τxηx + τxuγxηx] Idx + 2τx 2 x log ρt u Ht where ρt := ρθt,ηx. Proof. We begin by changing the inner product from , qt to , ρt (x,u) (x,u) T(x,u) (x,u)Ht , Since we have (x,u) T(x,u) (x,u)Ht = x h τx x Ht + τxu 2 u Ht i + u hτxu 2 x Ht + τu u Ht i , (56) = 4τxu (x,u) [ x u Ht + u x Ht] , (x,u) [ x ( x Ht)] , (x,u) [ u ( u Ht)] , In Proposition F.14, F.15 and F.16 we deal with these respective terms: Momentum Particle Maximum Likelihood Proposition F.14 (Simplifying (59a)). We have (59a) = 4τxuηx x Ht 2 ρt 4τxu u Ht, 2 x [ℓ] u Ht ρt 8γxτxu 2 u [Ht] , u x [Ht] ρt 4γxηxτxu x Ht, u Ht ρt . Proposition F.15 (Simplifying (59b)). We have (59b) = 8τxγx x u Ht 2 ρt 8τx x Ht, 2 x [ℓ] u Ht Proposition F.16 (Simplifying (59c)). (59c) = 8τuγx 2 u Ht 2 ρt 8τuηxγx u Ht 2 ρt 4τuηx x Ht, u Ht ρt. Their proofs can be found below. Thus, summing up the results of Proposition F.14, Proposition F.15 and Proposition F.16, we obtain (56) =(59a) + (59b) + (59c) = 8τxuγx 2 u Ht, u x [Ht] 8τuγx 2 u Ht 2 ρt (60b) 8τxγx x u Ht, x u Ht ρt (60c) 4τxuηx x Ht 2 ρt 8 D u Ht, τuηxγx Idx + τxu 2 2 x [ℓ] u Ht E ρt 4 x Ht, (τuηx + τxuγxηx) Idx + 2τx 2 x [ℓ] u Ht Since we can write the following sum equivalently as: (60a) + (60b) + (60c) = 8γx (x,u) u [Ht] , T(x,u) (x,u) u [Ht] we obtain as desired. Proof of Proposition F.14. Beginning from the definition, we have (59a) = 4τxu (x,u) [ x u Ht + u x Ht] , (x,u) [ x u Ht + u x Ht] , ρt = 4γxτxu u [ x u Ht + u x Ht] , u Ht ρt (61a) (x,u) [ x u Ht + u x Ht] , Momentum Particle Maximum Likelihood We will deal with (61a) and (61b) separately. Using the adjoints of u, u, x, we obtain (61a) = 4γxτxu u Ht, x u u Ht ρt 4γxτxu x Ht, u u u Ht ρt , From Proposition F.8, we have that x commutes with u u (61a) = 4γxτxu u Ht, u u x [Ht] ρt 4γxτxu x Ht, u u u Ht ρt , We can write this in terms of the commutator [ , ] (defined in Appendix F.4.3) using the fact that [ u, u] uf = ηx uf (as shown in Proposition F.6). Thus, we have (61a) = 8γxτxu u Ht, u u x [Ht] ρt 4γxτxu x Ht, [ u, u] u Ht ρt = 8γxτxu 2 u [Ht] , u x [Ht] ρt 4γxηxτxu x Ht, u Ht ρt . (62) As for (61b), we start by writing it in terms of B (see Appendix F.4.4) B[Ht] := (x,u)Q (x,u)Ht, we can write (61b) = 4τxu D (x,u) (u,x)Ht, B[Ht] E Using the fact that the adjoint of B denoted by B w.r.t. the inner product , ρ is B = B (see Proposition F.9) we obtain (61b) = 4τxu (u,x)Ht, (x,u)B[Ht] = 4τxu x Ht, u B[Ht] ρt + u Ht, x B[Ht] ρt = 4τxu x Ht, u B[Ht] ρt + u Ht, B x[Ht] ρt + u Ht, [ x, B][Ht] ρt = 4τxu x Ht, u B[Ht] ρt B u Ht, x[Ht] ρt + u Ht, [ x, B][Ht] ρt = 4τxu x Ht, [ u, B][Ht] ρt + u Ht, [ x, B][Ht] ρt From Proposition F.10, we have [ u, B][Ht] = ηx x Ht and [ x, B][Ht] = 2 x[ℓ] u Ht, and so (61b) = 4τxuηx x Ht 2 ρt 4τxu u Ht, 2 x [ℓ] u Ht Thus, summing up (62) and (63), we have as desired (59a) = (62) + (63) = 4τxuηx x Ht 2 ρt 4τxu u Ht, 2 x [ℓ] u Ht ρt 8γxτxu 2 u [Ht] , u x [Ht] ρt 4γxηxτxu x Ht, u Ht ρt . Proof of Proposition F.15. Beginning from the definition and following in the same expansion as in Proposition F.14, and using the adjoint of (x.u) ( i.e., (x.u)) and adjoint of x (i.e., x) we obtain the following: (59b) = 8τxγx u [ x ( x Ht)] , u Ht ρt 8τx x Ht, x B[Ht] ρt . Writing this in terms of the commutator, we get (59b) = 8τxγx u [ x ( x Ht)] , u Ht ρt 8τx x Ht, [ x, B]Ht + B x Ht ρt = 8τxγx x u Ht, x u Ht ρt 8τx x Ht, 2 x [ℓ] u Ht where, for the first term, we use the fact that u commutes with x x (see Proposition F.7); and, for the second term, we use Proposition F.10 and x Ht, B x Ht ρt = 0 from antisymmetry (see Proposition F.9). Momentum Particle Maximum Likelihood Proof of Proposition F.16. Beginning with the definition and following in the same expansion as in Proposition F.14, we obtain (59c) = 8τuγx u u [ u Ht] , u Ht ρt u Ht, u (x,u) ρt = 8τuγx u u [ u Ht] , u Ht ρt 4τu u Ht, u B[Ht] ρt . Writing in terms of the commutator [ , ] and B, we have equivalently (59c) = 8τuγx u u [Ht] , u u [Ht] ρt 8τuγx [ u, u] u Ht, u Ht ρt 4τu u Ht, B u Ht ρt 4τuηx x Ht, u Ht ρt, where, for the last term, we use the fact of Proposition F.10 to obtain u Ht, [ u, B]Ht ρt = ηx x Ht, u Ht ρt Using the antisymmetric property of B, we have u Ht, B u Ht ρt = 0; and, from Proposition F.6, the fact that [ u, u] uf = ηx uf, we have (59c) = 8τuγx 2 u [Ht] , 2 u [Ht] ρt 8τuηxγx u Ht, u Ht ρt 4τuηx x Ht, u Ht ρt. F.4.6. BOUNDING THE CROSS TERMS (38) AND (43). Proposition F.17. For all ε > 0, we have that (38) + (43) ε (θ,m)Ft, Γ (θ,m)Ft + 1 ε (x,u)δq Ft 2 qt, where Γ = K2 τ 2 θ Idθ τθ τxu+τθm 2 Idθ τxu+τθm 2 2 + τ 2 x Idθ Proof. Because T(x,u) is symmetric, (38) = 2 At (θ,m)Ft, (x,u)δq Ft qt , (43) = 2 D A t (θ,m)Ft, (x,u)δq Ft E At := T(x,u) (x,u) (θ,m) [δq Ft] (x,u) (θ,m) [log ρθt] T(θ,m). (x,u) (θ,m) [δq Ft] = (x,u) (θ,m) [log ρθt,ηx] = x θ [log ρθt] 0dx dθ 0dx dθ 0dx dθ Momentum Particle Maximum Likelihood we can re-write At and A t as 0dx dθ τx x θ [log ρθt] 2 x θ [log ρθt] 0dx dθ 0dx dθ τθ x θ [log ρθt] τθm 2 x θ [log ρθt] (38) + (43) =2 At (θ,m)Ft, (x,u)δq Ft At := A t At = 0 τx x θ [log ρθt] τθ x θ [log ρθt] (τxu+τθm) 2 x θ [log ρθt] Fix any ε > 0. Applying the Cauchy-Schwarz inequality and Young s inequality, we obtain that (38) + (43) 2 At (θ,m)Ft qt (x,u)δq Ft qt ε At (θ,m)Ft 2 (x,u)δq Ft 2 qt . (64) At (θ,m)Ft 2 qt = τx x θ [log ρθt] m Ft 2 qt + x θ [log ρθt] τθ θFt + τxu + τθm Given that θF(z) and m F(z) are constant in x, the above reads At (θ,m)Ft 2 qt = x θ log ρθt, τx m Ft 2 qt θ log ρθt, τθ θFt + τxu + τθm Fix any θ, v in Rdθ. Because (θ,x)ℓis K-Lipschitz (Assumption 1), the function fv,θ(x) := θ log ρθ(x), v = θℓ(θ, x), v x Rdx, is (K v )-Lipschitz: by the Cauchy-Schwarz inequality, fv,θ(x) fv,θ(x ) = θ log ℓ(θ, x) θℓ(θ, x ), v θℓ(θ, x) θℓ(θ, x ) v , (θ,m)ℓ(θ, x) (θ,m)ℓ(θ, x ) v K v x x x, x Rdx. Recalling that Lipschitz seminorms can be estimated by suprema of the norm of the gradient (e.g., see Van Handel (2014, Lemma 4.3)), we then see that xfv,θ 2 qt sup x Rdx xfv,θ(x) 2 K2 v 2. Applying the above inequality to (65), we find that At (θ,m)Ft 2 qt K2 " τx m Ft 2 + τθ θFt + τxu + τθm = (θ,m)Ft, Γ (θ,m)Ft . Plugging the above into (64) yields the desired inequality. Momentum Particle Maximum Likelihood F.4.7. POSITIVE SEMI-DEFINITENESS CONDITIONS In this section, we establish sufficient conditions to ensure that a matrix with the form described in (66). Proposition F.18. Given α, β, κ, a, c R, let A be a symmetric matrix with the following form: αI βI + a H βI + a H κI + c H where H is a symmetric matrix that satisfies KI H KI for some K > 0. If A satisfies the following conditions: ( c K α κ 0 if c 0, c K α κ 0 otherwise. a2K2 (cα 2βa)K ακ + β2 0, a2K2 + (cα 2βa)K ακ + β2 0, then, we have that A is positive semi-definite. Proof. We prove this by showing that if the conditions are satisfied, then the eigenvalues of A are non-negative. The eigenvalues l of A satisfy its characteristic equation det(A l I) = 0. We have that det(A l I) = det((α l)(κ l)I + c(α l)H (βI + a H)2), where we use the fact that (κI + c H) and (βI + a H) are symmetric matrices and so their multiplication commutes (e.g., see Silvester (2000)). Let H = UΛU be the eigenvalue decomposition of H, then observe that det(A l I) = det(U(((α l)(κ l) β2)I + (c(α l) 2βa)Λ a2Λ2)U ). Hence, we obtain for each eigenvalue Λi of H the following constraints: ((α l)(κ l) β2) + (c(α l) 2βa)Λi a2Λ2 i = 0. These constraints can be written equivalently as l2 + ( cΛi α κ) l + ακ β2 a2Λ2 i + (cα 2βa)Λi = 0. Since the equality constraint is a quadratic function in l, we can utilize the quadratic formula to solve for l. It can be verified that the discriminant to the above quadratic equation is non-negative (hence has real roots). To ensure that l is positive, we require that cΛi α κ 0, (67) a2Λ2 i (cα 2βa)Λi ακ + β2 0, (68) for all Λi [ K, K]. The constaints on Λi come from the fact that KI H KI and the min-max theorem. Since (67) is a linear function and (68) is a quadratic function in Λi, we only require that the inequalities are satisfied at the end points. Namely, we obtain the following conditions ( c K α κ 0 if c 0, c K α κ 0 otherwise. a2K2 (cα 2βa)K ακ + β2 0, a2K2 + (cα 2βa)K ακ + β2 0. This concludes the proof. Momentum Particle Maximum Likelihood G. Existence and Uniqueness of Strong solutions to (15) In this section, we show the existence and uniqueness of the Mc Kean Vlasov SDE (15) under Lipschitz assumption 1. The structure is as follows: 1. We begin by showing that F has Lipschitz θ-gradients (Proposition G.1). 2. Then, we show that the drift, defined in Equation (69), is Lipschitz (Proposition G.2). 3. Finally, we prove the existence and uniqueness (Proposition 3.1). In this section, we write the SDE (15) equivalently as d(ϑ, Υ)t = b(Υt, ϑt, Law(Υt))dt + βd Wt, (69) where ϑt = (θt, mt) R2dθ, Υt = (Xt, Ut) R2dx, β = 2Diag([02dθ, 0dx, 1dx]) (where Diag(x) returns a Diagonal matrix whose elements are given by x), and b : R2dx R2dθ P(R2dx) R2dθ R2dθ+2dx defined as b((x, u), (θ, m), q) = γθηθm θF(θ, q) γxηxu + xℓ(θ, x) We now prove that θF is Lipschitz. Since the θ-gradients do not depend on the momentum parameter m, we will drop the dependence on m for brevity. Proposition G.1 (F has Lipschitz θ-gradient). Under Assumption 1, we have that F is Lipschitz,i.e., there exist a constant KF > 0 such that the following inequality holds: θF (θ, q) θF (θ , q ) KF( θ θ + W1(q, q )), for all θ, θ Rdθ and q, q P(R2dx). Proof. From the definition, and adding and subtracting the same quantities and triangle inequality, we obtain θF (θ, q) θF (θ , q ) θF (θ, q) θF (θ , q) + θF (θ , q) θF (θ , q ) . We treat the terms on the RHS separately. For the first, from Jensen s inequality, we obtain θF (θ, q) θF (θ , q) = Z θℓ(θ, x) θℓ(θ , x) q X(dx) Z θℓ(θ, x) θℓ(θ , x) q X(dx) As for the other term, we have θF (θ , q) θF (θ , q ) Z θℓ(θ , x)(q q )(x, u)dxdu K Z 1 K θℓ(θ , x) |q q |(x, u)dxdu (a) 2KW1(q, q ). For (a), we use the fact that (x, u) 7 θℓ(θ , x) is K-Lipschitz (under assumption 1) (and so the map x 7 1 K θℓ(θ , x) is 1-Lipschitz), from the dual representation of W1 we have as desired. To conclude, by combining the two bounds, we have shown that θF is Lipschitz with constant KF = 2K. Momentum Particle Maximum Likelihood We now prove that the drift of the SDE (69) is Lipschitz. Proposition G.2 (Lipschitz Drift). Under Assumption 1, the drift b is Lipschitz, i.e., there is some constant Kb > 0 such that the following inequality holds: b(Υ, ϑ, q) b(Υ , ϑ , q ) Kb( Υ Υ + ϑ ϑ + W1(q, q )), for all ϑ, ϑ Rdθ, Υ, Υ R2dθ, and q, q P(R2dx). Proof. We begin with the definition, apply the triangle inequality, and use the fact that from the concavity of we have b for a, b > 0 to obtain b(Υ, ϑ, q) b(Υ , ϑ , q ) ηθ(1 + γθ) m m + θF(θ, q) θF(θ , q ) + ηx(1 + γx) u u + xℓ(θ, x) xℓ(θ , x ) ηθ(1 + γθ) m m + KF( θ θ + W1(q, q )) + ηx(1 + γx) u u + K( θ θ + x x ) 2 max{ηθ(1 + γθ), KF + K} ϑ ϑ 2 max{ηx(1 + γx), K, KF}( Υ Υ + W1(q, q )), where we use the Lipschitz assumption 1 on ℓ, and Proposition G.1. Hence we obtained as desired. Proof of Proposition 3.1. The proof is similar to Carmona (2016, Theorem 1.7) but with key generalizations to the product space. Fix some ν C([0, T], R2dθ P(R2dx)). We denoted by νϑ and νΥ the projection to the R2dθ and P(R2dx) components respectively. Consider substituting νt into (69) in place of the Law(Υt) and ϑt, from Carmona (2016, Theorem 1.2), we have existence and uniqueness of the strong solution for some initial point (ϑ, Υ)0. More explicitly, we have (ϑ, Υ)ν t = (ϑ, Υ)0 + Z t 0 b(Υν t , νϑ s , νΥ s )ds + Z t for t [0, T]. We define the operator FT : C([0, T], R2dθ P(R2dx)) C([0, T], R2dθ P(R2dx)) as FT : ν (t 7 (ϑν t , Law(Υν t ))). Clearly, if the process (ϑ, Υ)t is a solution to (69) then the function t 7 (ϑt, Law(Υt)) is a fixed point to the operator FT , and vice versa. Now we establish the existence and uniqueness of the fixpoint of the operator FT . We begin by endowing the space R2dθ P(R2dx) with the metric: d((ϑ, q), (ϑ , q )) = p ϑ ϑ 2 + W2(q, q )2. Note that the metric space (R2dθ P(R2dx), d) is complete (Villani, 2009). First note that using Jensen s inequality, b is Lipschitz, and the fact that (a + b + c + d)2 4(a2 + b2 + c2 + d2) we obtain ϑν t ϑν t 2 + E[ Υν t Υν t 2] = E 0 b(Υν s, νϑ s , νΥ s ) b(Υν s , ν ϑ s , ν Υ s )ds 0 E b(Υν s, νϑ s , νΥ s ) b(Υν s , ν ϑ s , ν Υ s ) 2 ds 0 E Υν s Υν s 2 + ϑν s ϑν s 2 + νϑ s ν ϑ s 2 + W2 1(νΥ s , ν Υ s )ds, where C = 4K2 b . Applying Grönwall s inequality, we obtain that ϑν t ϑν t 2 + E[ Υν t Υν t 2] C(t) Z t h W2 1(νΥ s , ν Υ s ) + νϑ s ν ϑ s 2i ds, Momentum Particle Maximum Likelihood where C(t) = Ct exp(Ct2). Then, using the fact that the LHS is an upper bound for the squared distance d and W1 W2 (Villani, 2009, Remark 6.6), we have d2(FT (ν)t, FT (ν )t) C(t) Z t h W2 2(νΥ s , ν Υ s ) + νϑ s ν ϑ s 2i ds 0 d2(νs, ν s)ds t C(T) sup s [0,T ] d2(νs, ν s), (70) where we use the fact that C(t) C(T) since t [0, T]. We show that for k 1 successive compositions of the map FT denoted by F k T , we have the following inequality: d2(F k T (ν)t, F k T (ν )t) (t C(T))k k! sup u [0,T ] d2(νu, ν u). (71) This can be proved inductively. The base case k = 1 follows immediately from (70), assume the inequality holds for k 1, then for k we have d2(F k T (ν)t, F k T (ν )t) C(T) Z t 0 d2(F k 1 T (ν)s, F k 1 T (ν )s)ds (k 1)! sup u [0,T ] d2(νu, ν u) Z t k! sup u [0,T ] d2(νu, ν u). Hence, we have shown that the inequality (71) holds. Taking the supremum, we obtain sup s [0,T ] d2(F k T (ν)s, F k T (ν )s) (TC(T))k k! sup s [0,T ] d2(νs, ν s). Since k 7 (TC(T))k o(k!), there exists a large enough k such that there is a constant 0 < α < 1 for the following inequality holding: sup s [0,T ] d2(F k T (ν)s, F k T (ν )s) α sup s [0,T ] d2(νs, ν s). Hence, we have shown that the operator F k T is a contraction and from the Banach Fixed Point theorem and completeness of the space (C([0, T], Rdθ P(Rdx)), sup d), we have existence and uniqueness. H. Space Discretization In this section, we establish asymptotic pointwise propagation of chaos results. We are interested in justifying the use of a particle approximation in the flow (15). The flow (15) can be rewritten equivalently as follows. Let (Υi 0, W i)i [M] be i.i.d. copies of (Υ0, W). We write (ϑ, {Υi}M i=1) as solutions of (15) starting at (ϑ0, {Υi 0}M i=1) driven by the {W i}M i=1. In other words, (15) satisfies dϑt = bϑ ϑt, Law(Υ1 t) dt, i [M] : dΥi t = bΥ(Υi t, ϑt)dt + βΥ d W i t . bϑ(ϑt, q) = γθηθm θF(θ , q) , bΥ(Υ, ϑ) = γxηxu + xℓ(θ, x) 2Diag([0dx, 1dx]). Clearly, for all i, j [M], we have Law(Υi t) = Law(Υj t). Momentum Particle Maximum Likelihood We will justify that we can replace the Law(Υ1 t) with a particle approximation to obtain the approximate process (20), or equivalently as: i=1 δΥi,M t i [M] : dΥi,M t = bΥ(Υi,M t , ϑM t )dt + βΥd W i t . (73b) Similarly to Proposition G.2, we can show that bϑ and bΥ are both Lipschitz. We are now ready to prove Proposition 5.1 justifying that (20) is a good approximation to (15). Proof of Proposition 5.1. This is equivalent to showing that sup t [0,T ] ϑt ϑM t 2 + 1 i=1 Υi t Υi,M t 2 )# More specifically, we will show that sup t [0,T ] ϑt ϑM t 2 # sup t [0,T ] i=1 Υi t Υi,M t 2 # We begin with (a). From Jensen s inequality, we have bϑ(ϑs, Law(Υ1 s)) bϑ(ϑM s , 1 i=1 δΥi,M s ) 0 E ϑs ϑM s 2 + EW2 2 Law(Υ0 s), 1 i=1 δΥi,M s where we use the fact that bϑ is Kbϑ Lipschitz, Cϑ = 2TK2 bϑ, and E(a + b)2 2(Ea2 + Eb2) (known as the Cr inequality (Loève, 1977, p157)). Note that by the triangle inequality Law(Υ0 s), 1 i=1 δΥi,M s Law(Υ0 s), 1 i=1 δΥis, 1 i=1 δΥi,M s The two terms can be bounded using Carmona (2016, Eq. (1.24) and Lemma 1.9) to produce the inequality Law(Υ0 s), 1 i=1 δΥi,M s i=1 E Υi s Υi,M s 2. Hence, we have that E ϑs ϑM s 2 + o(1) + 1 i=1 E Υi s Υi,M s 2 ) where C ϑ = 2Cϑ. We use the fact that ϑs ϑM s 2 sup s [0,s] ϑs ϑM s 2, 1 M i=1 Υi s Υi,M s 2 sup s [0,s] i=1 Υi s Υi,M s 2, (74) Momentum Particle Maximum Likelihood E sup s [0,s] ϑs ϑM s 2 + o(1) + E sup s [0,s] i=1 Υi s Υi,M s 2 ) Similarly, for (b), we have 0 bΥ(Υi s, ϑs) bΥ(Υi,M s , ϑM s ) 2ds 0 ϑs ϑM s 2 + 1 i=1 Υi s Υi,M s 2ds 0 E sup s [0,s] ϑs ϑM s 2 + E sup s [0,s] i=1 Υi s Υi,M s 2, ds, where CΥ := 2TK2 bΥ for the last line we use the trick in (74). Combining the bounds for (a) and (b), we obtain (a) + (b) C Z T E sup s [0,s] ϑs ϑM s 2 + o(1) + E sup s [0,s] i=1 Υi s Υi,M s 2 ) where C = CΥ + C ϑ. Applying Gronwall s inequality, we obtain sup t [0,T ] ϑt ϑM t sup t [0,T ] i=1 Υi t Υi,M t Taking the limit, we have as desired. I. Time Discretization In this section, we are concerned with discretization schemes of various ODE/SDEs. The structure is as follows: Appendix I.1. We describe the Euler-Marayama discretization of PGD as described in Kuntz et al. (2023). Appendix I.2. We show we can obtain NAG as a discretization of the damped Hamiltonian. Appendix I.3. We show a discretization of MPD (described in Equation (15)) using a scheme replicating NAG as described in Appendix I.2 for the (θ, m)-components, and Cheng et al. (2018) s for (x, u)-components. Appendix I.4. We derive the transition using a scheme inspired by Cheng et al. (2018) while incorporating NAG-style gradient correction as described in Sutskever et al. (2013). I.1. PGD discretization In order to obtain an implementable system, it is standard to then discretise the distribution qt by representing it with a finite particle system, i.e. qt (dx) 1 M P i [M] δ Xi t, dx . Upon making this approximation, one obtains the system i [M] θℓ θt, Xi t , for i [M] , d Xi t = xℓ θt, Xi t dt + 2 d W i t , in which all terms are readily available. Discretising this process in time then yields the Particle Gradient Descent (PGD) algorithm of (Kuntz et al., 2023), i.e. for k 1, iterate θk = θk 1 + h i=1 θℓ θk 1, Xi k 1 ! Momentum Particle Maximum Likelihood for i [M], Xi k = Xi k 1 + h xℓ θk 1, Xi k 1 + where ϵi k i.i.d. N(0dx, Idx), for some initialization (θ0, {Xi 0}M i=1). I.2. NAG as a discretization Recall the momentum-enriched ODE is given by θt + γη θt + η θf(θt) = 0. η θt, then we can write the above equivalently as the following coupled first-order ODEs: mt = γηmt f(θt), θt = ηmt. t and η = 1, we show that a particular discretization of the above is equivalent to Nesterov Accelerated Gradient (NAG) method (Nesterov, 1983). The argument is inspired by reversing the one of Su et al. (2014) who obtained the continuous limit of NAG. Recall that NAG (Nesterov, 1983), for convex f (but not strongly convex), is defined by the following iteration: θk = yk 1 h f (yk 1) , yk = θk + k 1 (θk θk 1) . k), we have θk yk 1 h θf (yk 1) , yk θk + 1 3 (θk θk 1) . (76) We will now show how a particular discretization scheme will produce (76). It uses a combination of implicit Euler for θt, and use a semi-implicit Euler scheme for mt. More specifically, the semi-implicit scheme for mt = 3 t mt θf(θt) uses an explicit approximation for the momentum 3 t mt, and implicit approximation for the gradient θf(θt). In summary, we obtain the following discretization h + θf(θt) , θt θt We can write the above equivalently through the map t 7 t h and in terms of k := t h, to obtain mk 1 h θf(θk). From our discretization, we have that hmk θk θk 1 then we obtain θk θk 1 + 1 3 k 1 (θk 1 θk 2) h θf(θk). We can write the θ update in terms of yk (cf. (76)), θk yk 1 h θf(θk). If f is Lipschitz, then we have h θf(θk) h θf(yk 1) Lh θk yk 1 h2L θf(θk) o(h). Momentum Particle Maximum Likelihood We replace the gradient θf(θk) with θf(yk 1) to obtain as desired θk yk 1 h θf(yk 1), yk θk + 1 3 (θk θk 1) . Interestingly, for given step size h, the (approximate) NAG iterations approximate the flow for time h as opposed to h (Su et al., 2014, Section 3.4) I.3. Nesterov and Cheng s discretization of MPD In this section, we show how to discretize the MPD using a combination of Nesterov s (see Appendix I.2) and Cheng et al. (2018) s discretization. Recall, we have dθt = ηθmt dt, dmt = θE(θt, q M t,X) dt γθηθmt dt, i [M] : d Xi t = ηx U i tdt, i [M] : d U i t = xℓ(θt, Xi t)dt γxηx U i tdt + p 2γx d W i t . In this section, we show how discretizing the θ component in the style of Nesterov (specifically, Sutskever et al. (2013) s formulation) and q component in the style of Cheng et al. (2018) obtains the MPD-NC (Nesterov Cheng) algorithm. The MPD-NC algorithm is described as follows: given previous values (θk, vk, {Xi k, U i k}M i=1) and step-size h > 0, we iterate θk+1 = θk + vk, (77a) vk+1 = µvk h2 θE(θk + µvk, q M k,X), (77b) i [M] : Xi k+1 = Xi k + 1 (1 ωx(h))U i k + xℓ(θk+1, Xi k) h 1 ωx(h) + LXX Σ ξi k, (77c) i [M] : U i k+1 = ωx(h)U i k + 1 ωx(h) γxηx xℓ(θk+1, Xi k) + LXU Σ ξi k + LUU Σ ξ i k , (77d) for all i [N], where µ := 1 hγθηθ, ωx(t) :=; LXX Σ , LXU Σ , LUU Σ is described in (83), and {ξi k, ξ i k } i.i.d. N(0, Idx). Each iteration corresponds to (approximately) solving (15) for time h. In Appendix I.3.1, we show how Nesterov s discretization can be used to produce the update θk+1 = θk + vk, vk+1 = µvk h2 θE(θk + µvk, q M k,X). In Appendix I.3.2, given (θk+1, {Xi k, U i k}M i=1), we show that the transition is described by i [M] : Xi k+1 = Xi k + 1 (1 ωx(h))U i k + xℓ(θk+1, Xi k) h 1 ωx(h) + LXX Σ ξi k, i [M] : U i k+1 = ωx(h)U i k + 1 ωx(h) γxηx xℓ(θk+1, Xi k) + LXU Σ ξi k + LUU Σ ξ i k . I.3.1. DISCRETIZATION OF ( θt, mt) This discretization follows similarly to that in Appendix I.2. Similarly, let k := t h with the time rescaling t 7 t h, then we consider the following discretization: h γθηθmk 1 + θE(θk, q M k 1,X) h θE(θk, q M k 1,X), θk = θk 1 + Momentum Particle Maximum Likelihood Expanding mk, we obtain θk = θk 1 + hγθηθ mk 1 h θE(θk, q M k 1,X) hγθηθ (θk 1 θk 2) h θE(θk, q M k 1,X), Let vt = θk θk 1, then we have θk = θk 1 + vk vk = µvk 1 h θE(θk, q M k 1,X), where µ := 1 hγθηθ. Then, as before, using the approximation h E(θk, q M k 1,X) h E(θk 1+ µvk 1, q M k 1,X)+o(h), we obtain that θk = θk 1 + vk, vk = µvk 1 h θE(θk 1 + µvk 1, q M k 1,X). Similarly to Appendix I.2 this approximates the flow for time h instead of h. Hence, we have to apply the appropriate rescaling to obtain the following iteration with the desired behaviour: θk = θk 1 + vk, vk = µvk 1 h2 θE(θk 1 + µvk 1, q M k 1,X), where µ := 1 hγθηθ. This is exactly that of Sutskever et al. (2013) s characterization of NAG (see Equations 3 and 4 in their paper). I.3.2. DISCRETIZATION OF ( Xt, Ut) We describe the discretization scheme of Cheng et al. (2018). For simplicity, we derive the transition of a single particle (Xt, Ut) since there are no interactions between the particles given θ. Furthermore, for brevity and without loss in generality, we derive the transition for some step size h > 0 given initial values (θ, X0, U0), which can be easily generalized to future transitions. Consider a time interval t (0, h] and given (X0, U0), we first approximate the gradient xℓ(θ, Xt) with xℓ(θ, X0) to arrive at the following linear SDE: A 2dx-dimensional linear SDE is given by: d Xt = (AXt + α) dt + β d Wt, (79) where A R2dx 2dx and α, β R2dx are fixed matrices. It is clear that if we set 0dx γxηx Idx then the discretized underdamped Langevin SDE of (78) falls within the class of linear SDE of the form specified in (79). These SDEs admits the following explicit solution (Platen and Bruti-Liberati, 2010, see pages 48, 101), 0 Ψ 1 s α ds + Z t 0 Ψ 1 s β d Ws where Ψt := exp (At) , Momentum Particle Maximum Likelihood with exp to be understood as the matrix exponential. In our case, the matrix exponential and its inverse is given by Ψt = I2dx + 0dx ( γxηxt)k 0dx ( γxηxt)k Idx Idx 1 ωx(t) 0dx ωx(t)Idx Idx 1 ωx( t) 0dx ωx( t)Idx where ωx(t) := exp( γxηxt). It can be verified that they are indeed the inverse of each other, i.e., ΨtΨ 1 t = I2dx. Plugging in (81) into (80), we obtain the following: Idx 1 ωx(t) 0dx ωx(t)Idx γx xℓ(θ, X0) ωx( s) xℓ(θ, X0) 1 ωx( s) γx 1dx 1 γxωx( s)1dx 1 Or, equivalently, we have Ut = ωx(t)U0 + xℓ(θ, X0) Z t 0 ωx(t s) ds + p 0 ωx(t s) d Ws = ωx(t)U0 + 1 ωx(t) γxηx xℓ(θ, X0) + p 0 ωx(t s) d Ws, Xt = X0 + 1 (1 ωx(t))U0 + xℓ(θ, X0) Z t 0 1 ωx(t s) ds + r 2 0 1 ωx(t s) d Ws (1 ωx(t))U0 + xℓ(θ, X0) t 1 ωx(t) 0 1 ωx(t s) d Ws As noted by Cheng et al. (2018), this is a Gaussian transition. To characterize it, we need to calculate the first two moments: For the first moments, we have µU(X0, U0, t) :=E[Ut] =ωx(t)U0 + 1 ωx(t) γxηx xℓ(θ, X0), µX(X0, U0, t) :=E[Xt] (1 ωx(t))U0 + xℓ(θ, X0) t 1 ωx(t) For the second moments, we have ΣUU(t) := E[(Ut E[Ut])(Ut E[Ut]) ] 0 ωx(t s) d Ws 0 ωx(t s) d Ws 0 ωx(2(t s)) ds Idx ΣXX(t) := E[(Xt E[Xt])(Xt E[Xt]) ] 0 1 ωx(t s) d Ws 0 1 ωx(t s) d Ws Momentum Particle Maximum Likelihood 0 [1 ωx(t s)]2 ds Idx γxηx + 4ωx(t) γxηx 3 ηxγx ΣUX(t) := E[(Ut E[Ut])(Xt E[Xt]) ] 0 ωx(t s) d Ws 0 1 ωx(t s) d Ws 0 ωx(t s)(1 ωx(t s)) ds Idx = 1 γxηx (1 2ωx(t) + ωx(2t)) Idx. Hence, the transition can be described as N (µ(t), Σ(t)) where µX(Xk, Uk, t) µU(Xk, Uk, t) ΣXX(t) ΣUX(t) ΣUX(t) ΣUU(t) Therefore, given some point (Xk, Uk) and some step-size h > 0, we have that N (µ(h), Σ(h)) . (82) Sampling from the transition. Using samples from a standard Gaussian z N(0d, Id), one may produce samples from of a multivariate distribution N(µ, Σ) by using the fact that X = µ + Lz N(µ, Σ), where L is a lower triangular matrix with positive diagonal entries obtained via the Cholesky decomposition of Σ, i.e, Σ = LL . The Cholesky decomposition of the covariance matrix of (82) will be described here. Consider the LDL decomposition of a symmetric matrix where S = D B A 1B is the Schur complement. Given Cholesky factorization of A and B, written as A = LAL A and S = LSL S respectively, we can write the Cholesky decomposition of Σ as L A L 1 A B In our case, we compute the Cholesky decomposition of Σ as follows: LXX Σ Idx 0 LXU Σ Idx LUU Σ Idx where the constants are defined as γxηx + 4ωx(h) γxηx 3 ηxγx Momentum Particle Maximum Likelihood 1 γxηx (1 2ωx(h) + ωx(2h)) r h 2h ωx(2h) γxηx + 4ωx(h) γxηx 3 ηxγx v u u u u t 1 γxηx (1 2ωx(h) + ωx(2h)) 2 h 2h ωx(2h) γxηx + 4ωx(h) γxηx 3 ηxγx Therefore, we the transition can be written as follows: i [M] : Xi k+1 = Xi k + 1 (1 ωx(h))Uk + xℓ(θ, Xi k) h 1 ωx(h) + LXX Σ ξi k, i [M] : U i k+1 = ωx(h)U i k + 1 ωx(h) γxηx xℓ(θ, Xi k) + LXU Σ ξi k + LUU Σ ξ ,i k , where {ξi k, ξ ,i k }i [M],k i.i.d. N(0dx, Idx). I.4. Proposed discretization of (21) Recall, our approximating SDE in (21) is given by: d θt = ηθ mt dt d mt = γθηθ mt dt θE θ0, q M 0,X dt for i [M], d Xi t = ηx U i t dt for i [M], d U i t = γxηx U i t dt + xℓ θ0, Xi 0 dt + p 2γx d W i t . For simplicity and without loss in generality, we assume there is a single particle, i.e., M = 1, and derive the transition for a single step. Clearly, (21) can be written as an linear 2dx + 2dθ SDE: d Xt = (AXt + α) dt + β d Wt, 0dθ ηθγθIdθ 02dθ 2dx 02dx 2dθ 0dx ηx Idx 0dx ηxγx Idx θE( θ0, q M 0,X) xℓ( θ0, X0) Hence it admits the following explicit solution (Platen and Bruti-Liberati, 2010, see page 48, 101): 0 Ψ 1 s α ds + Z t 0 Ψ 1 s β d Ws where Ψt := exp (At) , Momentum Particle Maximum Likelihood with exp to be understood as the matrix exponential. In our case, similarly to (81), we have Idθ 1 ωθ(t) 0dθ ωθ(t)Idθ 02dθ 2dx 02dx 2dθ Idx 1 ωx(t) 0dx ωx(t)Idx Idθ 1 ωθ( t) 0dθ ωθ( t)Idθ 02dx 2dθ 02dθ 2dx Idx 1 ωx( t) 0dx ωx( t)Idx where ωx(t) := exp( γxηxt), and similarly ωθ(t) := exp( γθηθt). Hence, we have θt = θ0 + 1 (1 ωθ(t)) m0 t 1 ωθ(t) θE( θ0, q M 0,X) , mt = ωθ(t) m0 1 ωθ(t) γθηθ θE( θ0, q M 0,X), i [M] : Xi t = Xi 0 + 1 (1 ωx(t)) U i 0 + t 1 ωx(t) xℓ( θ0, Xi 0) + LXX Σ ξ, i [M] : U i t = ωx(t) U i 0 + 1 ωx(t) γxηx xℓ( θ0, Xi 0) + LXU Σ ξ + LUU Σ ξ , where the constants are exactly the same as those from Equation (83), and ξ, ξ N(0, Idx). It can be seen that taking setting t = h, the general iterations are given by θk+1 = θk + 1 (1 ωθ(k)) mk t 1 ωθ(k) θE( θk, q M k,X) , mk+1 = ωθ(k) mk 1 ωθ(k) γθηθ θE( θk, q M k,X), i [M] : Xi k+1 = Xi k + 1 (1 ωx(k)) U i k + t 1 ωx(k) xℓ( θk, Xi k) + LXX Σ ξi k, i [M] : U i k+1 = ωx(k) U i k + 1 ωx(k) γxηx xℓ( θk, Xi k) + LXU Σ ξi k + LUU Σ ξ i k , where {ξi k, ξ i k }k,i [M] N(0, Idx). J. Practical Concerns and Experimental Details Here, we describe practical considerations and experiment details. First, in Appendix J.1, we describe the RMSProp precondition (Tieleman and Hinton, 2012; Staib et al., 2019) used in our experiments. Then, in Appendix J.2, we describe the subsampling procedure used for our image generation task. After, in Appendix J.3, we discuss the heuristic we introduced for choosing momentum parameters (γθ, γx, ηθ, ηx). Finally, in Appendix J.4, we detail all the parameters and models used in our experiment for reproducibility. J.1. RMSProp Preconditioner Given 0 < β < 1, the RMSProp preconditioner Gk (Tieleman and Hinton, 2012; Staib et al., 2019) at iteration k is defined by the following iteration: Gk+1 = βGk + (1 β) θE( θk, q M k,X)2, where G0 = 0dθ and x2 denotes element-wise square. Momentum Particle Maximum Likelihood J.2. Subsampling In the presence of a large dataset, it is common to develop computationally cheaper implementations by appropriate using of data sub-sampling. Such a scheme was devised in Kuntz et al. (2023, Appendix E.4). We utilize the same subsampling scheme for (θ, m)-components. However, we found it necessary to devise a new scheme for (x, u)-components. We found that this alternative subsampling scheme substantially improved the performance of MPD but did not noticeably affect the PGD algorithm. It is desirable to update the whole particle cloud. However, in cases where each sample has its own posterior cloud approximation, as in the generative modelling task, the dataset can become prohibitively large for updating the whole cloud. In Kuntz et al. (2023), the subsampling scheme only updated the portion of the particle cloud associated with the mini-batch, which neglects the remainder of the posterior approximation. As such, we propose to update the subsampled cloud at the beginning of each iteration at a step proportional to the time /steps it has missed. This is described more succinctly in Algorithm 1. Algorithm 1 A single subsampled step. In pink, we indicate the existing subsampling scheme of Kuntz et al. (2023). We indicate our proposed additions in teal. Require: Subsampled indices I, subsampled data {yi}i I, step-size h, previous iterates (θk, mk, {Xi k, U i k}i I) // Updated cloud for missed time {Xi k+ϵ, U i k+ϵ}i I Solve Equations (21c) and (21d) with step-size (h missed[I]) and (θk, mk, {Xi k, U i k}i I). // Reset time missed missed[I] = 0 // Take a step with step-size h θk+1, mk+1 Solve Equations (21a) and (21b) with step-size h with (θk, mk, {Xi k+ϵ, U i k+ϵ}i I). XI k+1, u I k+1 Solve Equations (21c) and (21d) with step-size h with (θk, mk, {Xi k+ϵ, U i k+ϵ}i I).. // Increment missed time for particles that are not updated missed[not in I] = missed[not in I] + 1. J.3. Heuristic We (partially) circumvent the choice of momentum parameters by leveraging the relationship between Equation (5) and NAG to define a heuristic. For completeness, we briefly describe the heuristic here. For a given step-size h, one can define the momentum coefficient µθ = 1 hγθηθ (and similarly, for µx). Since, in NAG, µ has typical values, we can use say µθ = 0.9 with a fixed value of γθ to find a suitable value of ηθ (and similarly, for ηx, γx). In our experiments, we found that γθ [ 1 2, 5] performed well. Another possible approach to handling hyperparameters is to borrow inspiration from adaptive restart methods (O Donoghue and Candes, 2015). While some practical heuristics exist, it seems that to a large extent, the problem of tuning these hyperparameters remains open; we leave this topic for future work. In Appendix I.3.1, we show how we can discretize Equation (15) to obtain the following update in the (θ, m)-components: θk = θk 1 + vk, vk = µvk 1 h2 θE(θk 1 + µvk 1, q M k 1,X), where µθ = 1 hγθηθ, and h > 0 is the step size. We refer to the resulting iterations as MPD-NAG-Cheng. This is exactly the update used in Sutskever et al. (2013) s characterization of NAG with step-size h2. Since there are some well-accepted choices for choosing µθ in NAG (e.g., Nesterov (1983) advocating for µt = 1 3/(t + 5)), we can leverage the formula µθ = 1 hγθηθ to define a suitable choice of ηθ for a fixed γθ. One can, of course, fix ηθ instead, but we found that suitable values for γθ are easier to choose from than ηθ. We note that for MPD-NAG-Cheng, it depends only on the value of µθ which is not the case for our discretization. In Figure 4, we show the effect of varying the momentum coeffect µθ, while keeping the damping parameter γθ fixed (and vice versa). In Figure 4a, it can be seen that for a fixed damping parameter γθ and varying the momentum coefficient MPD-NAG-Cheng changes significantly while ours does not. As for the vice versa case, in Figure 4b, it can be observed that MPD-NAG-Cheng does not change while ours varies significantly instead. This suggests that the momentum coefficient in our discretization no longer has the same interpretation as that in MPD-NAG-Cheng. Nonetheless, it can be observed that for a suitably chosen γθ and momentum parameter µθ, our discretization performs better than MPD-NAG-Cheng. Hence, we use this as a heuristic when choosing momentum parameters (γθ, γx, ηθ, ηx). Momentum Particle Maximum Likelihood (a) Fixed γθ, varying µθ. (b) Fixed µθ, varying γθ. Figure 4: The effect of the damping parameter γθ and momentum coefficient µθ on MPD. J.4. Experimental Details Here, we outline for reproducibility the settings: for the Toy Hierarchical Model (Appendix J.4.1); image generation experiment (Appendix J.4.2); and, additional figures are in Appendix J.6. Much akin to Kuntz et al. (2023, Section 2), we found it beneficial to consider separate time-scales for the (θt, mt) and (Xt, Ut) which we denote hθ and hx respectively. J.4.1. TOY HIERARCHICAL MODEL Model. The model is described in Kuntz et al. (2023, Example 1). For completeness, we describe it here. The model is defined by i=1 N(yi; xi, 1)N(xi; θ, σ2). (84) Thus, for the log-likelihood with σ2 = 1, we have log pθ(x, y) = C 1 (xi θ)2 + (yi xi)2 , where C is a constant independent of θ and x. Figure 1a. For the different regimes, we use the following momentum parameters: Underdamped: γθ = 0.1, ηθ = 403.96, γx = 0.1, ηx = 403.96. Overdamped: γθ = 1, ηθ = 403.96, γx = 1, ηx = 403.96. Critically Damped γθ = 0.7, ηθ = 403.96, γx = 0.7, ηx = 403.96. The step-sizes are hθ = 10 2/N = 10 4, hx = 10 2, with number of samples N = 100, and number of particles M = 100. Figure 1b. We compare the discretization outlined in Appendix I.3, and Section 5. We keep all parameters equal except for changing the momentum coefficient µ. For step-sizes, we have hθ = 10 5/2 5.8 10 3, hx = 10 3, γx = γθ = 0.5, with the momentum coefficient set to µθ = µx = µ where µ {0.9, 0.8, 0.5}. Lipschitz Constant and Strong log concavity. For the toy Hierarchical model with σ2 = 1, one can show that it satisfies Momentum Particle Maximum Likelihood our Lipschitz assumption, as well as being strongly log-concave. We have 2 (θ,x) log pθ = The characteristic equation can be written as det( 2 (θ,x) log pθ l Idx+1) = 0. The determinant can be evaluated by expanding the minors. Thus, we obtain det( 2 (θ,x) log pθ l Idx+1) =( dx l)( 2 l)dx dx( 2 l)dx 1 =( 2 l)dx 1((2 + l)(dx + l) dx) =( 2 l)dx 1(l2 + (2 + dx)l + dx). Hence, the characteristic equation is satisfied when l { 2, (2+dx) d2x+4 2 }. Note that (2+dx) From the above calculations, it can be seen that 2 (θ,x) log pθ is strongly-log concave, i.e., we have 2 (θ,x) log pθ 2I. In order to calculate its Lipschitz constant of Assumption 1, by Nesterov (2003, Lemma 1.2.2), this is equivalent to finding a constant Khm > 0 such that 2 (θ,x) log pθ Khm. Since 2 (θ,x) log pθ is symmetric, then the matrix norm is given by the largest absolute value of the eigenvalue of 2 (θ,x) log pθ (i.e., it s spectral radius). Hence, we have Khm = (2+dx)+ J.4.2. IMAGE GENERATION The dataset of both MNIST and CIFAR-10 is processed similarly. We use N = 5000 images for training and 3000 for testing. The model is updated 6280 times using subsampling with a batch size of 32. Hence, it completes 40 epochs. The model is defined as: i=1 pθ(yi, xi), The datum (yi, xi) Rdy Rdx denotes a single image and its corresponding latent variable. For MNIST, we have dy = 28 28 = 784 and for CIFAR, we have dy = 32 32 3 = 3072. For both datasets, we set dx = 64. Thus, we have that (y, x) Rdy N Rdx N. For the VAE model, we have p(yi, xi) = N(yi|gψ(xi), σ2I)pϕ(xi), where ψ and ϕ are parameters of the generator and prior respectively. Thus, the parameter of the model is given by θ = {ψ, ϕ}. We set σ2 = 0.1. The following specifies the details regarding the generator gψ and prior pϕ: Generator gψ. For CIFAR, we use a Convolution Transpose network (as in Pang et al. (2020)) shown in Table 2. For MNIST, we use a fully connected network specified in Table 3 with din = dx and dout = dy. Prior pϕ(x). The prior is inspired by the Vamp Prior (Tomczak and Welling, 2018) and is defined as: i=1 N(x|µν(zi), σ2 ν(zi)Idz), where µν, σ2 ν : Rdz Rdx are neural networks (with parameters ν) that parameterized the mean and variance; pseudo-points {zi}K i=1 are optimized; and so the prior parameters are ϕ := {ν} {zi}K i=1. The architecture can be found in Table 4 where din = dz and dout = dz Momentum Particle Maximum Likelihood Layers Size Stride Pad Input 1x1x128 - - 8x8 Conv T(ngf x 8), LRe LU 8x8x(ngf x 8) 1 0 4x4 Conv T(ngf x 4), LRe LU 16x16x(ngf x 4) 2 1 4x4 Conv T(ngf x 2), LRe LU 32x32x(ngf x 2) 2 1 3x3 Conv T(3), Tanh 32x32x3 1 1 Table 2: Generator network for the VAE used for CIFAR-10 (ngf = 256). Layers Size Linear(dx, 512), LRe LU 512 Linear(512,512), LRe LU 512 Linear(512,512), LRe LU 512 Linear(512,dy), Tanh dy Table 3: Generator network for the VAE used for MNIST. Layers Size Linear(din, 512), LRe LU 512 Linear(512,512), LRe LU 512 Linear(512,512), LRe LU 512 Linear(512,dout 2) dout 2 Output: Id([:dout]), Softplus([dout:]) dout, dout Table 4: Neural network parametrization of the mean and variance of a Gaussian Distribution used in VI and VAMPprior. Id is the identity function, and [:dout] is a standard pseudo-code notation that refers to the operation that extracts the first dout dimensions and similarly for [dout:]. Momentum Particle Maximum Likelihood J.5. Hyperparameter settings Unless specified otherwise, we use the same parameters for both MNIST and CIFAR datasets. Our approach to selecting the step size involved first setting it to 10 3, then we would decrease it appropriately if instabilities arise. MPD. For step sizes, we have hθ = hx = 10 4. The number of particles is set to M = 5. For the momentum parameters, we use γθ = γx = 0.9 with the momentum coefficient µθ = 0.95, µx = 0 (or equivalently, ηθ 556, ηx 11, 111 )for MNIST and µθ = 0.95, µx = 0.5 (or equivalently, ηθ 556, , ηx 55, 555 ) for CIFAR. We use the RMSProp preconditioner (see Appendix J.1) with β = 0.9. PGD. We have hθ = 10 4, hx = 10 3. The number of particles is set to M = 5. We use the RMSProp preconditioner (see Appendix J.1) with β = 0.9. ABP. We use SGD optimizer with step size hθ = 10 4, and run a length 5 ULA chain 5 with step-size hx = 10 3. We found that RMSProp worked well in the transient period but failed to achieve better performance than SGD. SR. We used RMSProp optimizer with step size hθ = 10 4, and ran five 20-length ULA chain with step-size hx = 10 3 VI. We use RMSProp optimizer with a step size of h = 10 5. We found that this resulted in the best performance. For the encoder, we use a Mean Field approximation as in Kingma and Welling (2014). The neural network is a 3-layer fully connected network with Leaky Relu activation functions. In the last layer, we use a softmax to parameterize the variance. See Table 4. J.6. Additional Figures Figure 5: MNIST. Samples generated from various algorithms. Figure 6: CIFAR-10. Samples generated from various algorithms. Momentum Particle Maximum Likelihood K. Additional Experiments In this section, we show additional experiments particularly concerned with comparing MPD with PGD and other variants of MPD that only accelerated one component of the space instead of two. We say X-enriched to indicate algorithms with momentum incorporated in X with either momentum included/excluded in θ. To separate the two cases, we call X-only-enriched to refer to the algorithm with gradient descent in θ. We use similar terminology for θ-enriched and θ-only-enriched. We consider two settings: in Appendix K.1, further results with the toy Hierarchical model (see Appendix J.4.1); and, in Appendix K.2 a density estimation task using VAEs on a Mixture of Gaussian dataset. K.1. Toy Hierarchical Model Figure 7. As the model, we consider a Toy Hierarchical model for different σ values in (84) where σ is chosen to be the same as the data generating process. The data is generated from a toy Hierarchical model with parameters (θ = 10, σ) where σ = {5, 10, 12}. As noted by Kuntz et al. (2023, Eq. 51), the marginal maximum likelihood of θ is the empirical average of the observed samples, i.e., 1 dx Pdx i=1 yi. For each trial, we process the data (yi)dx i=1 to have an empirical average of 10 for simplicity. Figure 2 (a), (b), (c). We use a toy Hierarchical model with σ = 12, and, similarly, the data-generating processing is a toy Hierarchical model with (θ = 10, σ = 12). We are interested in how the initialization of the particle cloud affects MPD, PGD and other algorithms that only momentum-enriched one component. Each subplot shows the evolution of the parameter θ for the initialization from N(µ, I) where µ { 5, 20, 100}. This example serves as an illustration of the typical case where the cloud is initialized badly. It can be seen that all methods are affected by poor initialization. For X-enriched, the algorithms can recover faster to achieve almost identical performances (at the later iterations) between different initializations. In other words, for X-enriched algorithms transient phase is short. Methods without such X-enrichment take longer to recover and are unable to achieve similar performances compared to the cases where they are better-initialized. Overall, it can be seen that MPD perform better than PGD. (a) σ = 5. (b) σ = 10. (c) σ = 12. Figure 7: Comparison between MPD and PGD on the Toy Hierarchical Model for different choices of σ. The plot shows the average and standard deviation of θ across iterations computed over 10 independent trials. MPD is shown in red and PGD in black. The dashed line shows the true value. K.2. Density Estimation In this problem, we show the result of training a VAE with VAMP Prior on a 1-d Mixture of Gaussians (Mo G) dataset with PDF shown in Figure 8. The dataset is composed of 100 samples. In Figure 2 (d), we show the performance of various acceleration algorithms. The empirical Wasserstein-1 distance (denoted by ˆW1) is estimated from the empirical CDF produced from 1000 samples across each iteration. The results show the average (with standard error bars which are barely noticeable) computed over 100 trials. PGD is shown in black, and MPD is shown in green. In this case, all accelerated algorithms perform better than PGD, and MPD performs best of all. Hyperparameters: No subsampling used. We have dy = 1, dx = 10. We set γx = γθ = 0.4, and use the momentum heuristic with µθ = µx = 0.1 (see Appendix J.3). We use the same VAE architecture in Appendix J.4.2 with likelihood Momentum Particle Maximum Likelihood noise σ = 0.01, an MLP decoder (see Table 4), and VAMP prior with dz = 2 and K = 20. Figure 8: The true density of the Mo G dataset. K.3. Sensitivity Analysis We are interested in investigating the sensitivity of MPD to different momentum parameters. One way to quantify whether you are in a regime where MPD is an improvement to PGD is by computing an approximation to the (signed and weighted) area between PGD and the MPD curves called Area Between Curves (ABC) More specifically, we compute k=1 w(k)[PGD(k) MPD(k)], where PGD : Z+ R and MPD : Z+ R denote the loss/metric curves of PGD and MPD respectively, and w(k) is a weighting function such that PK k=1 w(k) = 1. The role of the factor 1 K is to normalize and w(k) is to reweight the sum to have more importance at the end of the run than the beginning (we use w(k) = 2 K(1+K) k/K). If MPD is in a desirable regime and performs better than PGD then the curve of MPD will lower bound the curve of PGD. In this case, ABC will obtain large positive values and in the converse case, ABC will have small negative, hence a higher ABC indicates better hyperparameters. Figure 9 shows the ABC for different momentum parameters γ and η while keeping all other 100 167 278 464 774 1292 2154 3594 5995 10000 η 1.2 1.1 1.0 0.9 0.7 0.6 0.5 0.4 0.2 0.1 γ -1.28 -1.65 -1.88 -2.02 -2.08 -2.10 -2.11 -2.16 -2.16 -2.17 -0.08 -0.48 -0.68 -0.83 -0.91 -0.97 -0.99 -1.00 -1.01 -0.99 0.99 0.60 0.36 0.22 0.14 0.09 0.06 0.05 0.03 0.06 1.62 1.55 1.30 1.17 1.10 1.03 0.99 0.99 0.99 0.97 1.64 2.34 2.12 1.97 1.89 1.85 1.82 1.81 1.79 1.79 1.21 2.58 2.81 2.66 2.58 2.53 2.49 2.49 2.48 2.47 0.10 2.34 3.12 3.23 3.13 3.09 3.05 3.04 3.03 3.03 -2.42 1.41 2.91 3.45 3.56 3.51 3.48 3.46 3.45 3.44 -8.59 -1.54 1.82 3.13 3.60 3.76 3.78 3.76 3.75 3.74 -13.62 -5.02 0.29 2.60 3.44 3.75 3.86 3.89 3.90 Figure 9: The Area Between the Curve of PGD and MPD for different momentum parameters. hyperparameters the same between PGD and MPD. We use the toy Hierarchical model which allows us to have access to the true parameter θ, so ABC is computed from curves MPD and PGD that returns the absolute difference between the model and true θ. We note that a similar heatmap can be produced using the loss curve. This heatmap supports our recommendation that if we set η to be sufficiently large, we can easily tune γ to achieve better performance than PGD. Note that η-axis is in log scale and that this plot only applies to toy Hierarchical Model.