# accelerated_flow_for_probability_distributions__5a1edc59.pdf Accelerated Flow for Probability Distributions Amirhossein Taghvaei 1 Prashant G. Mehta 1 This paper presents a methodology and numerical algorithms for constructing accelerated gradient flows on the space of probability distributions. In particular, we extend the recent variational formulation of accelerated methods in (Wibisono et al., 2016) from vector valued variables to probability distributions. The variational problem is modeled as a mean-field optimal control problem. A quantitative estimate on the asymptotic convergence rate is provided based on a Lyapunov function construction, when the objective functional is displacement convex. An important special case is considered where the objective functional is the relative entropy. For this case, two numerical approximations are presented to implement the Hamilton s equations as a system of N interacting particles. The algorithm is numerically illustrated and compared with the MCMC and Hamiltonian MCMC algorithms. 1. Introduction Optimization on the space of probability distributions is important to a number of machine learning models including variational inference (Blei et al., 2017), generative models (Goodfellow et al., 2014; Arjovsky et al., 2017), and policy optimization in reinforcement learning (Sutton et al., 2000). A number of recent studies have considered solution approaches to these problems based upon a construction of gradient flow on the space of probability distributions (Liu & Wang, 2016; Richemond & Maginnis, 2017; Zhang et al., 2018; Frogner & Poggio, 2018; Chizat & Bach, 2018; Chen et al., 2018; Liu et al., 2018). Such constructions are useful for convergence analysis as well as development of numerical algorithms. 1Department of Mechanical Science and Engineering, Coordinated Science Laboratory, University of Illinois at Urbana Champaign, Urbana, IL, USA. Correspondence to: Amirhossein Taghvaei . Proceedings of the 36 th International Conference on Machine Learning, Long Beach, California, PMLR 97, 2019. Copyright 2019 by the author(s). In this paper, we propose a methodology and numerical algorithms that achieve accelerated gradient flows on the space of probability distributions. The proposed numerical algorithms are related to yet distinct from the accelerated stochastic gradient descent (Jain et al., 2017) and Hamiltonian Markov chain Monte-Carlo (MCMC) algorithms (Neal et al., 2011; Cheng et al., 2017). The proposed methodology extends the variational formulation of (Wibisono et al., 2016) from vector valued variables to probability distributions. The original formulation of (Wibisono et al., 2016) was used to derive and analyze the convergence properties of a large class of accelerated optimization algorithms, most significant of which is the continuous-time limit of the Nesterov s algorithm (Su et al., 2014). In this paper, the limit is referred to as the Nesterov s ordinary differential equation. The extension proposed in our work is based upon a generalization of the formula for the Lagrangian in (Wibisono et al., 2016): (i) the kinetic energy term is replaced with the expected value of kinetic energy; (ii) the potential energy term is replaced with a suitably defined functional on the space of probability distributions. The variational problem is to obtain a trajectory in the space of probability distributions that minimizes the action integral of the Lagrangian. The variational problem is modeled as a mean-field optimal problem (Bensoussan et al., 2013; Carmona & Delarue, 2017). The maximum principle of the optimal control theory is used to derive the Hamilton s equations which represent the first order optimality conditions. The Hamilton s equations provide a generalization of the Nesterov s ODE to the space of probability distributions. A Lyapunov function is proposed for the convergence analysis of the solution of the Hamilton s equations. In this way, quantitative estimates on convergence rate are obtained for the case when the objective functional is displacement convex (Mc Cann, 1997). Table 1 provides a summary of the relationship between the original variational formulation in (Wibisono et al., 2016) and the extension proposed in this paper. We also consider the important special case when the objective functional is the relative entropy functional D(ρ|ρ ) defined with respect to a target probability distribution ρ . In this case, the accelerated gradient flow is shown to be related to the continuous limit of the Hamiltonian Monte-Carlo algorithm (Cheng et al., 2017) (Remark 2). The Hamil- Accelerated Flow for probability distributions ton s equations are finite-dimensional for the special case when the initial and the target probability distributions are both Gaussian. In this case, the mean evolves according to the Nesterov s ODE. For the general case, the Lyapunov function-based convergence analysis applies when the target distribution is log-concave. As a final contribution, the proposed methodology is used to obtain a numerical algorithm. The algorithm is an interacting particle system that empirically approximates the distribution with a finite but large number of N particles. The difficult part of this construction is the approximation of the interaction term between particles. For this purpose, two types of approximations are described: (i) Gaussian approximation which is asymptotically (as N ) exact in Gaussian settings; and (ii) Diffusion map approximation which is computationally more demanding but asymptotically exact for a general class of distributions. The outline of the remainder of this paper is as follows: Sec. 2 provides a brief review of the variational formulation in (Wibisono et al., 2016). The proposed extension to the space of probability distribution appears in Sec. 3 where the main result is also described. The numerical algorithm along with the results of numerical experiments appear in Sec. 4. Comparisons with MCMC and Hamiltonian MCMC are also described. The conclusions appear in Sec. 5. The proof of the results appear in the supplementary note. 1.1. Related work Construction of accelerated flows for probability distribution was proposed in (Liu et al., 2018) based on the generalization of the Nesterov s method to Riemannian manifolds (Liu et al., 2017). The procedure involves approximating the exponential map and parallel transport map for probability distributions in the Wasserstein space. Our construction of accelerated flow is different from (Liu et al., 2018) in several respects: i) we describe a variational formulation and make connection to mean-field control theory; ii) our variational construction yields a continuous-time algorithm providing a straightforward comparison to HMCMC; iii) we carry out convergence analysis based upon a Lyapunov function method; iv) and analysis in Gaussian setting shows we recover the Nesterov ode. Another class of related work are the interacting particlebased numerical algorithms designed to sample from a target distribution. An example is the Stein variational gradient descent (SVGD) algorithm (Liu & Wang, 2016; Liu, 2017) based on the Riemannian construction of the gradient flow. Another example is the particle optimization method (Chen et al., 2018), whose update is obtained from a solution to an optimization problem based on the variational formulation of the Langevin dynamics. Interacting particle systems have also been shown to be useful for numerically solving the nonlinear filtering problem (Del Moral et al., 1998; Reich, 2011; Yang et al., 2016; Zhang et al., 2019). Notation: The gradient and divergence operators are denoted as and respectively. With multiple variables, z denotes the gradient with respect to the variable z. Therefore, the divergence of the vector field U is U(x) = Pd n=1 xn Un(x). The space of absolutely continuous probability measures on Rd with finite second moments is denoted by Pac,2(Rd). For a measure µ Pac,2(Rd) and a measurable map T : Rd Rd, the push-forward of µ by T is denoted by T#µ. The second-order Wasserstein distance between any two measures µ, ν Pac,2(Rd) is denoted as W2(µ, ν). The Wasserstein gradient and the Gâteaux derivative of a functional F is denoted as ρF(ρ) and F ρ(ρ) respectively (see supplementary note for definition). The probability distribution of a random variable Z is denoted as Law(Z). 2. Review of the variational formulation of (Wibisono et al., 2016) The basic problem is to minimize a C1 smooth convex function f on Rd. The standard form of the gradient descent algorithm for this problem is an ODE: dt = f(Xt), t 0 (1) Accelerated forms of this algorithm are obtained based on a variational formulation due to (Wibisono et al., 2016). The formulation is briefly reviewed here using an optimal control formalism. The Lagrangian L : R+ Rd Rd R is defined as L(t, x, u) := eαt+γt 1 | {z } kinetic energy eβtf(x) | {z } potential energy where t 0 is the time, x Rd is the state, u Rd is the velocity or control input, and the time-varying parameters αt, βt, γt satisfy the following scaling conditions: αt = log p log t, βt = p log t + log C, and γt = p log t where p 2 and C > 0 are constants. The variational problem is Minimize u J(u) = Z 0 L(t, Xt, ut) dt Subject to d Xt dt = ut, X0 = x0 where ut is the control input chosen to minimize the objective function J(u). over all control laws {ut}t>0 in Rd. The Hamiltonian function is H(t, x, y, u) = y u L(t, x, u) (4) Accelerated Flow for probability distributions Vector Probability distribution State-space Rd P2(Rd) Objective function f(x) F(ρ) := D(ρ|ρ ) Lagrangian eαt+γt 1 2|e αtu|2 eβtf(x) eαt+γt E h 1 2|e αt U|2 eβt log( ρ(X) Lyapunov funct. 1 2 |x + e γty x|2 1 2E[|Xt + e γt Yt T ρ ρt (Xt)|2] +eβt(f(x) f( x)) +eβt(F(ρt) F(ρ )) Convergence rate f(xt) f( x) O(e βt) F(ρt) F(ρ ) O(e βt) Table 1. Summary of the variational formulations for vectors and probability distributions. where y Rd is dual variable and y u denotes the dot product between vectors y and u. According to the Pontryagin s Maximum Principle, the optimal control u t = arg max v H(t, Xt, Yt, v) = eαt γt Yt. The resulting Hamilton s equations are dt = + y H(t, Xt, Yt, u t ) = eαt γt Yt, X0 = x0 (5a) d Yt dt = x H(t, Xt, Yt, u t ) = eαt+βt+γt f(Xt), The system (5) is an example of accelerated gradient descent algorithm. Specifically, if the parameters αt, βt, γt are defined using p = 2, one obtains the continuous-time limit of the Nesterov s accelerated algorithm. It is referred to as the Nesterov s ODE in this paper. For this system, a Lyapunov function is as follows: V (t, x, y) = 1 x + e γty x 2 +eβt(f(x) f( x)) (6) where x arg minx f(x). It is shown in (Wibisono et al., 2016) that upon differentiating along the solution trajectory, d dt V (t, Xt, Yt) 0. This yields the convergence rate: f(Xt) f( x) O(e βt), t 0 (7) 3. Variational formulation for probability distributions 3.1. Motivation and background Let F : Pac,2(Rd) R be a functional on the space of probability distributions. Consider the problem of minimizing F(ρ). The (Wasserstein) gradient flow with respect to F(ρ) is a curve ρt such that t (x) = (ρt(x) ρF(ρt)(x)) (8) where (the vector field) ρF(ρ) : Rd Rd is the Wasserstein gradient of F. An important example is the relative entropy functional where F(ρ) = D(ρ|ρ ) := R Rd log( ρ(x) ρ (x))ρ(x) dx where ρ Pac,2(Rd) is referred to as the target distribution. The gradient of relative entropy is given by ρF(ρ)(x) = log( ρ(x) ρ (x)) (Ch. 8.3 in (Villani, 2003)). The gradient flow t (x) = (ρt(x) log(ρ (x))) + ρt(x) (9) is the Fokker-Planck equation (Jordan et al., 1998). The gradient flow achieves the density transport from an initial probability distribution ρ0 to the target (here, also equilibrium) probability distribution ρ ; and underlies the construction and the analysis of Markov chain Monte-Carlo (MCMC) algorithms. The simplest MCMC algorithm is the Langevin stochastic differential equation (SDE): d Xt = f(Xt) dt + 2 d Bt, X0 ρ0 where Bt is the standard Brownian motion in Rd. The main problem of this paper is to construct an accelerated form of the gradient flow (8). The proposed solution is based upon a variational formulation. As tabulated in Table 1, the solution represents a generalization of (Wibisono et al., 2016) from its original deterministic finite-dimensional to now probabilistic infinite-dimensional settings. The variational problem can be expressed in two equivalent forms: (i) The probabilistic form and (ii) The partial differential equation (PDE) 3.2. Probabilistic form of the variational problem Consider the stochastic process {Xt}t 0 that takes values in Rd and evolves according to: dt = Ut, X0 ρ0 Accelerated Flow for probability distributions where the control input {Ut}t 0 also takes values in Rd, and ρ0 Pac,2(Rd) is the probability distribution of the initial condition X0. It is noted that the randomness here comes only from the random initial condition. Suppose the objective functional is of the form F(ρ) = R F(ρ, x)ρ(x) dx. The Lagrangian L : R+ Rd Pac,2(Rd) Rd R is defined as L(t, x, ρ, u) := eαt+γt 1 | {z } kinetic energy eβt F(ρ, x) | {z } potential energy This formula is a natural generalization of the Lagrangian (2) and the parameters αt, βt, γt are defined exactly the same as in the finite-dimensional case. The stochastic optimal control problem is: Minimize J(u) = E Z 0 L(t, Xt, ρt, Ut) dt Subject to d Xt dt = Ut, X0 ρ0 where ρt = Law(Xt) Pac,2(Rd) is the probability density function of the random variable Xt. The Hamiltonian function H : R+ Rd Pac,2(Rd) Rd Rd R for this problem is given by (see Sec. 6.2.3 in (Carmona & Delarue, 2017)): H(t, x, ρ, y, u) := u y L(t, x, ρ, u) (12) where y Rd is the dual variable. Remark 1. The variational problem (11) is an example of a mean-field (Mc Kean-Vlasov) optimal control problem. This is because the Lagrangian depends also upon the law of the stochastic process; cf., Ch. 6 in (Carmona & Delarue, 2017). 3.3. PDE formulation of the variational problem An equivalent pde formulation is obtained by considering the stochastic optimal control problem (11) as a deterministic optimal control problem on the space of the probability distributions. Specifically, the process {ρt}t 0 is a deterministic process that takes values in Pac,2(Rd) and evolves according to the continuity equation where ut : Rd Rd is now a time-varying vector field. The Lagrangian L : R+ Pac,2(Rd) L2(Rd; Rd) R is defined as: L(t, ρ, u) := eαt+γt Z Rd 1 2|e αtu(x)|2ρ(x) dx eβt F(ρ) The optimal control problem is: 0 L(t, ρt, ut) dt Subject to ρt t + (ρtut) = 0 (14) The Hamiltonian function H : R+ Pac,2(Rd) C(Rd; R) L2(Rd; Rd) R is H(t, ρ, φ, u) := φ, u L2(ρ) L(t, ρ, u) (15) where φ C(Rd; R) is the dual variable and the innerproduct φ, u L2(ρ) := R Rd φ(x) u(x)ρ(x) dx 3.4. Main result Theorem 1. Consider the variational problem (11)-(14). (i) For the probabilistic form (11) of the variational problem, the optimal control U t = eαt γt Yt, where the optimal trajectory {(Xt, Yt)}t 0 evolves according to the Hamilton s odes: dt = U t = eαt γt Yt, X0 ρ0 (16a) dt = eαt+βt+γt ρF(ρt)(Xt), Y0 = φ0(X0) where φ0 is a convex function, and ρt = Law(Xt). (ii) For the pde form (14) of the variational problem, the optimal control is u t = eαt γt φt(x), where the optimal trajectory {(ρt, φt)}t 0 evolves according to the Hamilton s pdes: t = (ρt eαt γt φt | {z } u t t = eαt γt | φt|2 2 eαt+γt+βt ρF(ρ) (17b) (iii) The solutions of the two forms are equivalent in the following sense: Law(Xt) = ρt, Ut = ut(Xt), Yt = φt(Xt) (iv) Suppose additionally that the functional F is displacement convex and ρ is its minimizer. Define 2E[|Xt + e γt Yt T ρ ρt (Xt)|2] + eβt(F(ρ) F(ρ )) (18) where the map T ρ ρt : Rd Rd is the optimal transport map from ρt to ρ . Assume the dimension d = 1. Consequently, the following rate of convergence is obtained along the optimal trajectory F(ρt) F(ρ ) O(e βt), t 0 Accelerated Flow for probability distributions Proof sketch. The Hamilton s equations are derived using the standard mean-field optimal control theory (Carmona & Delarue, 2017). The Lyapunov function argument is based upon the variational inequality characterization of a displacement convex function (see Eq. 10.1.7 in (Ambrosio et al., 2008)). The detailed proof appears in the supplementary note. We expect that the assumption that d = 1 is not necessary. This is the subject of the continuing work. 3.5. Relative entropy as the functional In the remainder of this paper, we assume that the functional F(ρ) = D(ρ|ρ ) is the relative entropy where ρ Pac,2(Rd) is a given target probability distribution. In this case the Hamilton s equations are given by dt = eαt γt Yt, X0 ρ0 (19a) dt = eαt+βt+γt( f(Xt) + log(ρt(Xt)), (19b) with Y0 = φ0(X0), where ρt = Law(Xt) and f = log(ρ ). Moreover, if f is convex (or equivalently ρ is log-concave), then F is displacement convex with the unique minimizer at ρ and the convergence estimate is given by D(ρt|ρ ) O(e βt). Remark 2. The Hamilton s equations (19) with the relative entropy functional is related to the under-damped Langevin equation (Cheng et al., 2017). A basic form of the underdamped (or second order) Langevin equation is d Xt = vt dt dvt = γvt dt f(Xt) dt + 2 d Bt (20) where {Bt}t 0 is the standard Brownian motion. Consider next, the the accelerated flow (19). Denote vt := eαt γt Yt. Then, with an appropriate choice of scaling parameters (e.g. αt = 0, βt = 0 and γt = γt ): d Xt = vt dt dvt = γvt dt f(Xt) dt x log(ρt(Xt)) (21) The scaling parameters are chosen here for the sake of comparison and do not satisfy the ideal scaling conditions of (Wibisono et al., 2016). The sdes (20) and (21) are similar except that the stochastic term 2 d Bt in (20) is replaced with a deterministic term x log(ρt(Xt)) in (21). Because of this difference, the resulting distributions are different. See the supplementary note for more details. 3.6. Quadratic Gaussian case Suppose the initial distribution ρ0 and the target distribution ρ are both Gaussian, denoted as N(m0, Σ0) and N( x, Q), respectively. This is equivalent to the objective function f(x) being quadratic of the form f(x) = 1 2(x x) Q 1(x x). Therefore, this problem is referred to as the quadratic Gaussian case. The following Proposition shows that the mean of the stochastic process (Xt, Yt) evolves according to the Nesterov ODE (5): Proposition 1. (Quadratic Gaussian case) Consider the variational problem (11) for the quadratic Gaussian case. Then (i) The stochastic process (Xt, Yt) is a Gaussian process. The Hamilton s equations are given by: dt = eαt γt Yt, dt = eαt+βt+γt(Q 1(Xt x) Σ 1 t (Xt mt)) where mt and Σt are the mean and the covariance of Xt. (ii) Upon taking the expectation of both sides, and denoting nt := E[Yt] dt = eαt γtnt, dnt dt = eαt+βt+γt Q 1(mt x) | {z } f(mt) which is identical to Nesterov ODE (5). Proof sketch. Fix ρt. Consider the resulting pair (Xt, Yt) from (19) and let ρt = Law(Xt). The proof follows from showing that a Gaussian ρt is a fixed-point of the map ρt 7 ρt. 4. Numerical algorithm The proposed numerical algorithm is based upon an interacting particle implementation of the Hamilton s equation (19). Consider a system of N particles {(Xi t, Y i t )}N i=1 that evolve according to: d Xi t dt = eαt γt Y i t , Xi 0 ρ0 d Y i t dt = eαt+βt+γt( f(Xi t) + I(N) t (Xi t) | {z } interaction term with Y i 0 = φ0(Xi 0). The interaction term I(N) t is an empirical approximation of the log(ρt) term in (19). We propose two types of empirical approximations as follows: 1. Gaussian approximation: Suppose the density is approximated as a Gaussian N(mt, Σt). In this case, log(ρt(x)) = Σt 1(x mt). This motivates the following empirical approximation of the interaction term: I(N) t (x) = Σ(N) t 1(x m(N) t ) (23) Accelerated Flow for probability distributions where m(N) t := N 1 PN i=1 Xi t is the empirical mean and Σ(N) t := 1 N 1 PN i=1(Xi t m(N) t )(Xi t m(N) t ) is the empirical covariance. Even though the approximation is asymptotically (as N ) exact only under the Gaussian assumption, it may be used in a more general settings, particularly when the density ρt is unimodal. The situation is analogous to the (Bayesian) filtering problem, where an ensemble Kalman filter is used as an approximate solution for non-Gaussian distributions (Evensen, 2003). 2. Diffusion map approximation: This is based upon the diffusion map approximation of the weighted Laplacian operator (Coifman & Lafon, 2006; Hein et al., 2007). For a C2 function f, the weighted Laplacian is defined as ρf := 1 ρ (ρ f). Denote e(x) = x as the coordinate function on Rd. It is a straightforward calculation to show that log(ρ) = ρe. This allows one to use the diffusion map approximation of the weighted Laplacian to approximate the interaction term as follows: (DM) I(N) t (Xi t) = 1 PN j=1 kϵ(Xi t, Xj t )(Xj t Xi t) PN j=1 kϵ(Xi t, Xj t ) (24) where the kernel kϵ(x, y) = gϵ(x,y) PN i=1 gϵ(y,Xi) is constructed empirically in terms of the Gaussian kernel gϵ(x, y) = exp( |x y|2/(4ϵ)). The parameter ϵ is referred to as the kernel bandwidth. The approximation is asymptotically exact as ϵ 0 and N . The approximation error is of order O(ϵ) + O( 1 Nϵd/4 ) where the first term is referred to as the bias error and the second term is referred to as the variance error (Hein et al., 2007). The variance error is the dominant term in the error for small values of ϵ, whereas the bias error is the dominant term for large values of ϵ (see Figure 2(d)). The resulting interacting particle algorithm is tabulated in Table 1. The symplectic method proposed in (Betancourt et al., 2018) is used to carry out the numerical integration. The algorithm is applied to two examples as described in the following sections. Remark 3. For the case where there is only one particle ( N = 1), the interaction term is zero and the system (22) reduces to the Nesterov ODE (5). Remark 4. (Comparison with density estimation) The diffusion map approximation algorithm is conceptually different from an explicit density estimation-based approach. A basic density estimation is to approximate ρ(x) 1 N PN i=1 gϵ(x, Xi t) where gϵ(x, y) is the Gaussian kernel. Using such an approximation, the interaction term is approximated as (DE) I(N) t (Xi t) = 1 PN j=1 gϵ(Xi t, Xj t )(Xj t Xi t) 2 PN j=1 gϵ(Xi t, Xj t ) (25) Algorithm 1 Interacting particle implementation of the accelerated gradient flow Input: ρ0, φ0, N, t0, t, p, C, K Output: {Xi k}N,K i=1,k=0 Initialize {Xi 0}N i=1 i.i.d ρ0, Y i 0 = φ0(Xi 0) Compute I(N) 0 (Xi 0) with (23) or (24) for k = 0 to K 1 do 2 = Y i k 1 2Cpt2p 1 k+ 1 2 ( f(Xi k) + I(N) k (Xi k)) t Xi k+1 = Xi k + p tp+1 k+ 1 Compute I(N) k+1(Xi k+1) with (23)or (24) Y i k+1 = Y i k+ 1 2 1 2Cpt2p 1 k+ 1 2 ( f(Xi k+1) + I(N) k+1(Xi k+1)) t tk+1 = tk+ 1 2 t end for Despite the apparent similarity of the two formulae, (24) for diffusion map approximation and (25) for density estimation, the nature of the two approximations is different. The difference arises because the kernel kϵ(x, y) in (24) is datadependent whereas the kernel in (25) is not. While both approximations are exact in the asymptotic limit as N and ϵ 0, they exhibit different convergence rates. Numerical experiments presented in Figure 2(a)-(d) show that the diffusion map approximation has a much smaller variance for intermediate values of N. Theoretical understanding of the difference is the subject of continuing work. 4.1. Gaussian Example Consider the Gaussian example as described in Sec. 3.6. The simulation results for the scalar (d = 1) case with initial distribution ρ0 = N(2, 4) and target distribution N( x, Q) where x = 5.0 and Q = 0.25 is depicted in Figure 1-(a)- (b). For this simulation, the numerical parameters are as follows: N = 100, φ0(x) = 0.5(x 2), t0 = 1, t = 0.1, p = 2,C = 0.625, and K = 400. The result numerically verifies the O(e βt) = O( 1 t2 ) convergence rate derived in Theorem 1 for the case where the target distribution is Gaussian. 4.2. Non-Gaussian example This example involves a non-Gaussian target distribution ρ = 1 2N( m, σ2) + 1 2N(m, σ2) which is a mixture of two one-dimensional Gaussians with m = 2.0 and σ2 = 0.8. The simulation results are depicted in Figure 1- (c)-(d). The numerical parameters are same as in the Example 4.1. The interaction term is approximated using the diffusion map approximation with ϵ = 0.01. The numerical result depicted in Figure 1-(c) show that the diffusion map Accelerated Flow for probability distributions t=t0 t=t1 t=t2 Figure 1. Simulation result for the Gaussian case (Example 4.1): (a) The time traces of the particles; (b) The KL-divergence as a function of time. Simulation result for the non-Gaussian case (Example 4.2): (c) The time traces of the particles; (d) The KL-divergence as a function of time. algorithm converges to the mixture of Gaussian target distribution. The result depicted in Figure 1-(d) suggests that the convergence rate O(e βt) also appears to hold for this nonlog-concave target distribution. Theoretical justification of this is subject of continuing work. 4.3. Comparison with MCMC and HMCMC This section contains numerical experiment comparing the performance of the accelerated algorithm 1 using the diffusion map (DM) approximation (24) and the density estimation (DE)-based approximation (25) with the Markov chain Monte-Carlo (MCMC) algorithm studied in (Durmus & Moulines, 2016) and the Hamiltonian MCMC algorithm studied in (Cheng et al., 2017). We consider the problem setting of the mixture of Gaussians as in example 4.2. All algorithms are simulated with a fixed step-size of t = 0.1 for K = 1000 iterations. The performance is measured by computing the mean-squared error in estimating the expectation of the function ψ(x) = x1x 0 denoted as ˆψ := R ψ(x)ρ (x) dx. The mean-square error at the k-th iteration is computed by averaging the error over M = 100 runs: i=1 ψ(Xi,m tk ) ˆψ The numerical results are depicted in Figure 2. Figure 2(a) depicts the m.s.e as a function of N. It is observed that the accelerated algorithm 1 with the diffusion map approximation admits an order of magnitude better m.s.e for the same number of particles. It is also observed that the m.s.e decreases rapidly for intermediate values of N before saturating for large values of N, where the bias term dominates (see discussion following Eq. 24). Figure 2(b) depicts the m.s.e as a function of the number of iterations for a fixed number of particles N = 100. It is observed that the accelerated algorithm 1 displays the quickest convergence amongst the algorithms tested. Figure 2(c) depicts the average computational time per iteration as a function of the number of samples N. The computational time of the diffusion map approximation scales as O(N 2) because it involves computing a N N matrix [kϵ(Xi, Xj)]N i,j=1, while the computational cost of the MCMC and HMCMC algorithms scale as O(N). The computational complexity may be improved by (i) exploiting the Accelerated Flow for probability distributions bias dominates variance dominates Figure 2. Simulation-based comparison of the performance of the accelerated algorithm 1 using the diffusion map (DM) approximation (24), the density estimation (DE)-based approximation (25) with the MCMC and HMCMC algorithms: (a) the mean-squared error (m.s.e) (26) as a function of the number of samples N; (b) the m.s.e as a function of the number of iterations; (c) the average computational time per iteration as a function of the number of samples; (d) m.s.e comparison between the diffusion map and the density estimation-based approaches as a function of the kernel bandwidth ϵ. sparsity structure of the N N matrix ; (ii) sub-sampling the particles in computing the empirical averages; (iii) adaptively updating the N N matrix according to a certain error criteria. Finally, we provide comparison between diffusion map approximation (25) and the density-based approximation (25): Figure 2(d) depicts the m.s.e for these two approximations as a function of the kernel-bandwidth ϵ for a fixed number of particles N = 100. For very large and for very small values of ϵ, where bias and variance dominates the error, respectively, the two algorithms have similar m.s.e. However, for intermediate values of ϵ, the diffusion map approximation has smaller variance, and thus lower m.s.e. 5. Conclusion and directions for future work The main contribution of this paper is to extend the variational formulation of (Wibisono et al., 2016) to the space of probability distributions (see Theorem 1). In particular, we obtain theoretical results and numerical algorithms for accelerated gradient flow in such settings. Through this procedure, we provide an accelerated version of the Langevin equation (see equation (19)), and compare it with the under- damped Langevin equation (see Remark 2). Moreover, we recover the ode limit of the Nesterov accelerated method in the special quadratic Gaussian case (see Proposition 1). Two algorithms based upon an interacting particle representation are presented and illustrated with numerical examples. The examples serve to verify the theoretical result, and demonstrate that our proposed approach admits faster convergence and better accuracy with the same number of samples, compared to the standard MCMC and HMCMC methods. The main bottleneck for the implementation of our approach is the O(N 2) computational complexity, which is common to the related interacting particle based approaches in the literature. This is an important problem in its own right and subject of future work. Some direction for future include: (i) removing the assumption d = 1 in the convergence result of the Theorem 1; (ii) using the variational formulation to obtain meaningful approximations of the interaction term for the finite-N algorithm (ii) analysis of the convergence under the weaker assumption that the target distribution satisfies the logarithmic Sobolev inequality; and (iii) error analysis of the numerical algorithms in the finite-N and in the discrete-time setting. Accelerated Flow for probability distributions Acknowledgements Financial support from the NSF grant CMMI-1462773 and ARO grant W911NF1810334 is gratefully acknowledged. Ambrosio, L., Gigli, N., and Savaré, G. Gradient flows: in metric spaces and in the space of probability measures. Springer Science & Business Media, 2008. Arjovsky, M., Chintala, S., and Bottou, L. Wasserstein gan. ar Xiv preprint ar Xiv:1701.07875, 2017. Bensoussan, A., Frehse, J., Yam, P., et al. Mean field games and mean field type control theory, volume 101. Springer, 2013. Betancourt, M., Jordan, M. I., and Wilson, A. C. On symplectic optimization. ar Xiv preprint ar Xiv:1802.03653, 2018. Blei, D. M., Kucukelbir, A., and Mc Auliffe, J. D. Variational inference: A review for statisticians. Journal of the American Statistical Association, 112(518):859 877, 2017. Carmona, R. and Delarue, F. Probabilistic Theory of Mean Field Games with Applications I-II. Springer, 2017. Chen, C., Zhang, R., Wang, W., Li, B., and Chen, L. A unified particle-optimization framework for scalable bayesian sampling. ar Xiv preprint ar Xiv:1805.11659, 2018. Cheng, X., Chatterji, N. S., Bartlett, P. L., and Jordan, M. I. Underdamped langevin mcmc: A non-asymptotic analysis. ar Xiv preprint ar Xiv:1707.03663, 2017. Chizat, L. and Bach, F. On the global convergence of gradient descent for over-parameterized models using optimal transport. ar Xiv preprint ar Xiv:1805.09545, 2018. Coifman, R. R. and Lafon, S. Diffusion maps. Applied and computational harmonic analysis, 21(1):5 30, 2006. Del Moral, P. et al. Measure-valued processes and interacting particle systems. application to nonlinear filtering problems. The Annals of Applied Probability, 8(2):438 495, 1998. Durmus, A. and Moulines, E. High-dimensional bayesian inference via the unadjusted langevin algorithm. ar Xiv preprint ar Xiv:1605.01559, 2016. Evensen, G. The ensemble kalman filter: Theoretical formulation and practical implementation. Ocean dynamics, 53(4):343 367, 2003. Frogner, C. and Poggio, T. Approximate inference with wasserstein gradient flows. ar Xiv preprint ar Xiv:1806.04542, 2018. Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. Generative adversarial nets. In Advances in neural information processing systems, pp. 2672 2680, 2014. Hein, M., Audibert, J.-Y., and Luxburg, U. v. Graph laplacians and their convergence on random neighborhood graphs. Journal of Machine Learning Research, 8(Jun): 1325 1368, 2007. Jain, P., Kakade, S. M., Kidambi, R., Netrapalli, P., and Sidford, A. Accelerating stochastic gradient descent. ar Xiv preprint ar Xiv:1704.08227, 2017. Jordan, R., Kinderlehrer, D., and Otto, F. The variational formulation of the fokker planck equation. SIAM journal on mathematical analysis, 29(1):1 17, 1998. Liu, C., Zhuo, J., Cheng, P., Zhang, R., Zhu, J., and Carin, L. Accelerated first-order methods on the wasserstein space for bayesian inference. ar Xiv preprint ar Xiv:1807.01750, 2018. Liu, Q. Stein variational gradient descent as gradient flow. In Advances in neural information processing systems, pp. 3115 3123, 2017. Liu, Q. and Wang, D. Stein variational gradient descent: A general purpose bayesian inference algorithm. In Advances In Neural Information Processing Systems, pp. 2378 2386, 2016. Liu, Y., Shang, F., Cheng, J., Cheng, H., and Jiao, L. Accelerated first-order methods for geodesically convex optimization on riemannian manifolds. In Advances in Neural Information Processing Systems, pp. 4868 4877, 2017. Mc Cann, R. J. A convexity principle for interacting gases. Advances in mathematics, 128(1):153 179, 1997. Neal, R. M. et al. Mcmc using hamiltonian dynamics. Handbook of Markov Chain Monte Carlo, 2(11):2, 2011. Reich, S. A dynamical systems framework for intermittent data assimilation. BIT Numerical Mathematics, 51(1): 235 249, 2011. Richemond, P. H. and Maginnis, B. On wasserstein reinforcement learning and the fokker-planck equation. ar Xiv preprint ar Xiv:1712.07185, 2017. Su, W., Boyd, S., and Candes, E. A differential equation for modeling nesterov s accelerated gradient method: Theory and insights. In Advances in Neural Information Processing Systems, pp. 2510 2518, 2014. Accelerated Flow for probability distributions Sutton, R. S., Mc Allester, D. A., Singh, S. P., and Mansour, Y. Policy gradient methods for reinforcement learning with function approximation. In Advances in neural information processing systems, pp. 1057 1063, 2000. Villani, C. Topics in optimal transportation. Number 58. American Mathematical Soc., 2003. Wibisono, A., Wilson, A. C., and Jordan, M. I. A variational perspective on accelerated methods in optimization. Proceedings of the National Academy of Sciences, pp. 201614734, 2016. Yang, T., Laugesen, R. S., Mehta, P. G., and Meyn, S. P. Multivariable feedback particle filter. Automatica, 71: 10 23, 2016. Zhang, C., Taghvaei, A., and Mehta, P. G. A meanfield optimal control formulation for global optimization. IEEE Transactions on Automatic Control, 64(1): 279 286, 2019. Zhang, R., Chen, C., Li, C., and Carin, L. Policy optimization as wasserstein gradient flows. ar Xiv preprint ar Xiv:1808.03030, 2018.