# unbalanced_sobolev_descent__ee13dcac.pdf Unbalanced Sobolev Descent Youssef Mroueh, Mattia Rigotti IBM Research AI mroueh@us.ibm.com, mrg@zurich.ibm.com We introduce Unbalanced Sobolev Descent (USD), a particle descent algorithm for transporting a high dimensional source distribution to a target distribution that does not necessarily have the same mass. We define the Sobolev-Fisher discrepancy between distributions and show that it relates to advection-reaction transport equations and the Wasserstein-Fisher-Rao metric between distributions. USD transports particles along gradient flows of the witness function of the Sobolev Fisher discrepancy (advection step) and reweighs the mass of particles with respect to this witness function (reaction step). The reaction step can be thought of as a birth-death process of the particles with rate of growth proportional to the witness function. When the Sobolev-Fisher witness function is estimated in a Reproducing Kernel Hilbert Space (RKHS), under mild assumptions we show that USD converges asymptotically (in the limit of infinite particles) to the target distribution in the Maximum Mean Discrepancy (MMD) sense. We then give two methods to estimate the Sobolev-Fisher witness with neural networks, resulting in two Neural USD algorithms. The first one implements the reaction step with mirror descent on the weights, while the second implements it through a birthdeath process of particles. We show on synthetic examples that USD transports distributions with or without conservation of mass faster than previous particle descent algorithms, and finally demonstrate its use for molecular biology analyses where our method is naturally suited to match developmental stages of populations of differentiating cells based on their single-cell RNA sequencing profile. Code is available at http://github.com/ibm/usd. 1 Introduction Particle flows such as Stein Variational Gradient descent [1], Sobolev descent [2] and MMD flows [3], allow the transport of a source distribution to a target distribution, following paths that progressively decrease a discrepancy between distributions (Kernel Stein discrepancy and MMD, respectively). Particle flows can be seen through the lens of Optimal Transport as gradient flows in the Wasserstein geometry [4], and they ve been recently used to analyze the dynamics of gradient descent in overparametrized neural networks in [5] and of Generative Adversarial Networks (GANs) training [2]. Unbalanced Optimal Tansport [6, 7, 8, 9] is a new twist on the classical Optimal Transport theory [10], where the total mass between source and target distributions may not be conserved. The Wasserstein Fisher-Rao (WFR) distance introduced in [7] gives a dynamic formulation similar to the so-called Benamou-Brenier dynamic form of the Wasserstein-2 distance [11], where the dynamics of the transport is governed by an advection term with a velocity field Vt and a reaction term with a rate of growth rt, corresponding to the construction and destruction of mass with the same rate: WFR2(p, q) = inf qt,Vt,rt Z ( Vt(x) 2 + α 2 r2 t (x))dqt(x)dt t = div(qt(x)Vt(x)) + α rt(x)qt(x), q0 = q, q1 = p. (1) 34th Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada. From a particle flow point of view, this advection-reaction in Unbalanced Optimal Transport corresponds to processes of birth and death, where particles are created or killed in the transport from source to target. Particle gradient descent using the WFR geometry have been used in the analysis of over-parameterized neural networks and implemented as Birth-Death processes in [12] and as conic descent in [13]. In the context of particles transportations, [14] showed that birth and death processes can accelerate the Langevin diffusion. On the application side, Unbalanced Optimal Transport is a powerful tool in biological modeling. For instance, the trajectories of a tumor growth have been modeled in the WFR framework by [15]. [16] and [17] used Unbalanced Optimal Transport to find differentiation trajectories of cells during development. The dynamic formulation of WFR is challenging as it requires solving PDEs. One can use the unbalanced Sinkhorn divergence and apply an Euler scheme to find the trajectories between source and target as done in [18] but this does not give any convergence guarantees. In this paper we take another approach similar to the one of Sobolev Descent [2]. We introduce the Kernel Sobolev-Fisher discrepancy that is related to WFR and has the advantage of having a closed form solution. We present a particle descent algorithm in the unbalanced case named Unbalanced Sobolev Descent (USD) that consists of two steps: an advection step that uses the gradient flows of a witness function of the Sobolev-Fisher discrepancy, and a reaction step that reweighs the particles according to the witness function. We show theoretically that USD is convergent in the Maximum Mean Discrepancy sense (MMD), that the reaction step accelerates the convergence, in the sense that it results in strictly steeper descent directions, and give a variant where the witness function is efficiently estimated as a neural network. We then empirically demonstrate the effectiveness and acceleration of USD in synthetic experiments, image color transfer tasks, and finally use it to model the developmental trajectories of populations of cells from single-cell RNA sequencing data [16]. 2 Sobolev-Fisher Discrepancy In this Section we define the Sobolev-Fisher Discrepancy (SF) and show how it relates to advectionreaction PDEs. While this formulation remains computationally challenging, we ll show in Section 3 how to approximate it in RKHS. 2.1 Advection-Reaction with no Conservation of Mass Definition 1 (Sobolev-Fisher Discrepancy). Let p, q be two measures defined on X Rd. For α > 0, the Sobolev-Fisher Discrepancy is defined as follows: SF(p, q) = sup f n Ex pf(x) Ex qf(x) : Ex q xf(x) 2 + αEx qf 2(x) 1, f| X = 0 o .Note that the objective of SF is an Integral Probability Metric (IPM) objective, and the function space imposes a constraint on the weighted Sobolev norm of the witness function f on the support of the distribution q. We refer to q as the source distribution and p as the target distribution. The following theorem relates the solution of the Sobolev-Fisher Discrepancy to an advection-reaction PDE: Theorem 1 (Sobolev-Fisher Critic as Solution of an Advection-Reaction PDE). Let u be the solution of the advection-reaction PDE: p(x) q(x) = div(q(x) xu(x)) + αu(x)q(x), u| X = 0. Then SF2(p, q) = Ex q xu(x) 2 + αEx qu2(x), with witness function f p,q = u/SF(p, q). From Theorem 1 we see that the witness function of SF2 solves an advection-reaction where the mass is transported from q to p, via an advection term following the gradient flow of xu, and a reaction term amounting to construction/destruction of mass that we also refer to as a birth-death process with a rate given by u. Intuitively, if the witness function u(x) > 0 we need to create mass, and destruct mass if u(x) < 0. This is similar to the notion of particle birth and death defined in [12] and [14]. In Proposition 1 we give a convenient unconstrained equivalent form for SF2: Proposition 1 (Unconstrained Form of SF2). SF satisfies the expression: SF2(p, q) = supu L(u), with L(u) = 2(Ex pu(x) Ex qu(x)) Ex q xu(x) 2 + αEx qu2(x) . Theorem 2 gives a physical interpretation for SF2 as finding the witness function u that has minimum sum of kinetic energy and rate of birth-death while transporting q to p via advection-reaction: Theorem 2 (Kinetic Energy & Birth-Death rates minimization). Consider the following minimization: P = inf V :X Rd r:X R X ( V (x) 2 + αr2(x))q(x)dx : p(x) q(x) = div(q(x)V (x)) + αr(x)q(x) We then have that P = 1 2SF2(p, q), and moreover: SF2(p, q) = inf u X xu(x) 2 q(x)dx + α Z X u2(x)q(x)dx, subject to p(x) q(x) = div(q(x) xu(x)) + αu(x)q(x). Remarks. a) When α = 0 we obtain the Sobolev Discrepancy, or p q H 1(q), that linearizes the Wasserstein-2 distance. b) Note that this corresponds to a Beckman type of optimal transport [19], where we transport q to p (q and p do not have the same total mass) via an advection-reaction with mass not conserved. It is easy to see that R X (p(x) q(x))dx = α R X u(x)q(x)dx. 2.2 Advection-Reaction with Conservation of Mass Define the Sobolev-Fisher Discrepancy with conservation of mass: SF 2(p, q) = supu L(u), where L(u) = 2 (Ex pu(x) Ex qu(x)) Ex q xu(x) 2 + α Ex q (u(x) Ex q(u(x)))2 . The only difference between the previous expression and SF2 in Proposition 1 is that the variance of the witness function is kept under control, instead of the second order moment. Defining X ( xu(x) 2 + α(u(x) Ex qu(x))2)q(x)dx one can similarly show that SF has the primal representation: SF 2(p, q) = inf u {E(u) : p(x) q(x) = div(q(x) xu(x)) + α(u(x) Ex qu(x))q(x)} . Hence, we see that SF is the minimum sum of kinetic energy and variance of birth-death rate for transporting q to p following an advection-reaction PDE with conserved total mass. The conservation of mass comes from the fact that χ(x) = div(q(x) xu(x)) + α(u(x) Ex qu(x))q(x) satisfies: Z X (p(x) q(x))dx = Z X χ(x) = 0. 3 Kernel Sobolev-Fisher Discrepancy In this section we turn to the estimation of SF discrepancy by restricting the witness function to a Reproducing Kernel Hilbert Space (RKHS), resulting in a closed-form solution. 3.1 Estimation in Finite Dimensional RKHS Consider the finite dimensional RKHS, corresponding to an m dimensional feature map Φ: H = {f |f(x) = w, Φ(x) where Φ : X Rm, w Rm}. Define the kernel mean embeddings µ(p) = Ex pΦ(x), µ(q) = Ex qΦ(x), and δp,q = µ(p) µ(q). Let C(q) = Ex qΦ(x) Φ(x) be the covariance matrix and D(q) = Ex q JΦ(x) JΦ(x) be the Gramian of the Jacobian, where [JΦ(x)]a,j = Φj(x) xa , a = 1 . . . d, j = 1 . . . m. Definition 2 (Regularized Kernel Sobolev-Fisher Discrepancy (KSFD)). Let u H, and let λ > 0 and γ {0, 1}, define: Lγ,λ(u) = 2(Ex pu(x) Ex qu(x)) Ex q[ xu(x) 2 + α(u(x) γEqu(x))2] + λ u 2 H . The Regularized Kernel Sobolev-Fisher Discrepancy is defined as: SF2 H,γ,λ(p, q) = sup u H Lγ,λ(u). When γ = 0 this corresponds to the unbalanced case, i.e. birth-death with no conservation of total mass, while for γ = 1 we have birth-death with conservation of total mass. Proposition 2 (Estimation in RKHS). The Kernel Sobolev-Fisher Discrepancy is given by: SF2 H,γ,λ(p, q) = uλ,γ p,q , δp,q , where the critic uλ,γ p,q = (D(q) + αCγ(q) + λIm) 1δp,q, with Cγ(q) = C(q) γµ(q)µ(q) . Let uλ,γ p,q (x) = uλ,γ p,q , Φ(x) and δp,q(x) = δp,q, Φ(x) , then: xuλ,γ p,q (x) = (D(q) + αCγ(q) + λIm) 1 xδp,q(x). Remarks. a) For the unbalanced case γ = 0, we refer to SF2 H,0,λ as SF2 H,λ. For the case of mass conservation γ = 1, refer to SF2 H,1,λ as SF 2 H,λ. Note that C1(q) = C(q) = C(q) µ(q)µ(q) . b) A similar Kernelized discrepancy was introduced in [20], but not as an approximation of the Sobolev-Fisher discrepancy, nor in the context of unbalanced distributions and advection-reaction. c) For α = 0 we obtain the kernelized Sobolev Discrepancy KSD of [2]. 3.2 Kernel SF for Direct Measures Consider direct measures p = PN i=1 aiδxi and q = Pn j=1 bjδyj (with no conservation of mass we can have P i ai = P j bj = 1 ). An estimate of the Sobolev-Fisher critic is given by ˆuλ,γ p,q = ( ˆD(q)+α ˆCγ(q)+λIm) 1(ˆµ(p) ˆµ(q)), where the empirical Kernel Mean Embeddings are ˆµ(p) = PN i=1 aiΦ(xi) and ˆµ(q) = Pn j=1 bjΦ(yj). The empirical operator embeddings are given by ˆD(q) = Pn j=1 bj[JΦ(yj)] JΦ(yj), and ˆCγ(q) = Pn j=1 bjΦ(yj)Φ(yj) γˆµ(q)ˆµ(q) . 4 Unbalanced Continuous Kernel Sobolev Descent Given the Kernel Sobolev-Fisher Discrepancy defined in the previous sections and its relation to advection-reaction transport, in this section we construct a Markov process that transports particles drawn from a source distribution to a target distribution. Note that we don t assume that the densities are normalized nor have same total mass. 4.1 Constructing the Continuous Markov Process Given α, λ > 0, γ {0, 1} and n weighted particles drawn from the source distribution : qn 0 = q = Pn i=1 biδyi, i.e X0 i = yi and w0 i = bi. Recall that the target distribution is given by p = PN i=1 aiδxi. We define the following Markov Process that we name Unbalanced Kernel Sobolev Descent: d Xi t = xuλ,γ p,qn t (Xi t)dt (advection step) dwi t = α(uλ,γ p,qn t (Xi t) γEq(n) t uλ,γ p,qn t (x))wi tdt (reaction step) i=1 wi tδXi t, (2) where uλ,γ p,qn t is the critic of the Kernel Sobolev-Fisher discrepancy, whose expression and gradients are given in Proposition 2. We see that USD consists of two steps: the advection step that updates the particles positions following the gradient flow of the Sobolev-Fisher critic, and a reaction step that updates the weights of the particles with a growth rate proportional to that critic. This reaction step consists in mass construction or destruction, that depends on the confidence of the witness function. This can be seen as birth-death process on the particles, where the survival log probability of a particle is proportional to the critic evaluation on this particle. 4.2 Generator Expression and PDE in the limit of n Proposition 3 gives the evolution equation of a functional of the intermediate distributions qn t produced in the descent, at the limit of infinite particles n : Proposition 3. Let Ψ : P(X ) R, be a functional on the probability space. Let qn t be the distribution produced by USD at time t. Let qt be its limit as n , we have: tΨ[qt] = (LΨ)[qt], where LΨ(q) = R xuλ,γ p,q (x), x DqΨ(x) q(dx) + α R DqΨ(x)(uλ,γ p,q (x) γEquλ,γ p,q )q(x)dx. Where the functional derivative Dµ is defined through first variation for a signed measure χ ( R χ(x)dx = 0): Z DµΨ(x)χ(x)dx = lim ε 0 Ψ(µ + εχ) Ψ(µ) In particular, the paths of USD in the limit of n satisfy the advection-reaction equation: tqt = div(qt xuλ,γ p,qt) + α(uλ,γ p,qt γEqtuλ,γ p,q )qt. 4.3 Unbalanced Sobolev Descent decreases the MMD. The following Theorem shows that USD when the number of the particles goes to infinity decreases the MMD distance at each step, where: MMD2(p, q) = µ(p) µ(q) 2 . Theorem 3 (Unbalanced Sobolev Descent decreases the MMD). Consider the paths qt produced by USD. In the limit of particles n we have 1 2 d MMD2(p, qt) dt = MMD2(p, qt) λSF2 H,γ,λ(p, qt) 0. (3) In particular, in the regularized case λ > 0 with strict descent (i.e. qt = p implies MMD2(p, qt) λSF2 H,γ,λ(p, qt) > 0), USD converges in the MMD sense: limt MMD2(p, qt) = 0. Similarly to [2], strict descent is ensured if the kernel and the target distribution p satisfy the condition: δp,q / Null(D(q) + αCγ(q)), q = p. USD Accelerates the Convergence. We now prove a Lemma the can be used to show that Unbalanced Sobolev Descent has an acceleration advantage over Sobolev Descent [2]. Lemma 1. In the regularized case λ > 0 with α > 0, the Kernel Sobolev-Fisher Discrepancy SFH,γ,λ is strictly upper bounded by the Kernel Sobolev discrepancy SH,λ(i.e for α = 0) [2]: SF2 H,γ,λ(p, q) < S2 H,λ(p, q). From Lemma 1 and Eq. (3), we see that USD (α > 0), results in a larger decrease in MMD than SD [2] (α = 0), resulting in a steeper descent. Hence, USD advantages over SD are twofold: 1) it allows unbalanced transport, 2) it accelerates convergence for the balanced and unbalanced transport. USD with Universal Infinite Dimensional Kernel. While we presented USD with a finite dimensional kernel for ease of presentation, we show in Appendix D that all our results hold for an infinite dimensional kernel. For a universal or a characteristic kernel, convergence in MMD implies convergence in distribution (see [21, Theorem 12]). Hence, using a universal kernel, USD guarantees the weak convergence as MMD(p, qt) 0. 4.4 Understanding the effect of the Reaction Step: Whitened Principal Transport Directions In [2] it was shown that the gradient of the Sobolev Discrepancy can be written as a linear combination of principal transport directions of the Gramian of derivatives D(q). Here we show that unbalanced descent leads to a similar interpretation in a whitened feature space thanks to the ℓ2 regularizer. Let Hq = {f |f(x) = D v, Φq(x) E , Φq(x) = (Cγ(q) + λ 2 Φ(x)}, δp,q = (Cγ(q) + λ 2 δp,q D(q) = (Cγ(q) + λ 2 D(q)(Cγ(q) + λ 2 , and let vλ,γ p,q = ( D(q) + αIm) 1 δp,q. It is easy to see that the critic of the SF can be written as: uλ,γ p,q (x) = uλ,γ p,q , Φ(x) = D vλ,γ p,q , Φq(x) E . Note that Φq is a whitened feature map and D(q) is the Gramian of its derivatives. Let dj, λj be the eigenvectors and eigenvalues of D(q). We have: vλ,γ p,q = Pm j=1 1 λj+α dj D dj, δp,q E . Hence, we write the gradient of the Sobolev-Fisher critic as xuλ,γ p,q (x) = Pm j=1 1 λj+α D dj, δp,q E [J Φ(x)] dj = Pm j=1 1 λj+α D dj, δp,q E x dj(x), where dj(x) = D dj, Φq(x) E . This says that the mass is transported along a weighted combination of whitened principal transport directions x dj(x). α introduces a damping of the transport as it acts as a spectral filter on the transport directions in the whitened space. 5 Discrete time Unbalanced Kernel and Neural Sobolev Descent In order to get a practical algorithm in this Section we discretize the continuous USD given in Eq. (2). We also give an implementation parameterizing the critic as a Neural Network. Discrete Time Kernel USD. Recall that the source distribution q0 = q = Pn j=1 bjδyj, note w0 j = bj and x0 j = yj, j = 1 . . . n. The target distribution p = PN j=1 ajδxj, and assume for simplicity PN j=1 aj = 1. Let ε > 0, for ℓ= 1 . . . L, for j = 1 . . . n, we discretize the advection step: xℓ j = xℓ 1 j + ε xuλ,γ p,qℓ 1(xℓ 1 j ). Let mℓ 1 = Pn j=1 wℓ 1 j uλ,γ p,qℓ 1(xℓ 1 j ). For τ > 0, similarly we discretize the reaction step as: aℓ j = log(wℓ 1 j ) + τ(uλ,γ p,qℓ 1(xℓ 1 j ) γmℓ 1). If γ = 0 (total mass not conserved) we define the reweighing as follows: wℓ j = exp(aℓ j) and if γ = 1 (mass conserved): wℓ j = exp(aℓ j)/ Pn i=1 exp(aℓ i), and finally : qℓ= Pn j=1 wℓ jδxℓ j. Neural Unbalanced Sobolev Descent. Motivated by the use of neural network critics in Sobolev Descent [2], we propose a Neural variant of USD by parameterizing the critic of the Sobolev-Fisher Discrepancy as a Neural network fξ trained via gradient descent with the Augmented Lagrangian Method (ALM) on the loss function of SF given in Definition 1. The re-weighting is defined as in the kernel case above. Neural USD with re-weighting is summarized in Algorithm 1 in Appendix B. Note that the re-weighting can also be implemented via a birth-death process as in [12]. In this variant, particles are duplicated or killed with a probability driven by the growth rate given by the critic. We give the details of the implementation as birth-death process in Algorithm 2 (Appendix B). Computational and Sample Complexities. The computational complexity Neural USD is given by that of updating the witness function and particles by SGD with backprop, i.e. O(N(T + B)), where N is the mini-batch size, T is the training time, B is the gradient computation time for particles update. T corresponds to a forward and a backward pass through the critic and its gradient. The sample complexity for estimating the Sobolev Fisher critic scales like 1/ N similar to MMD [22]. 6 Relation to Previous Work Table 1 in Appendix A summarizes the main differences between Sobolev descent [2], which only implements advection, and USD that also implements advection-reaction. Our work is related to the conic particle descent that appeared in [13] and [12]. The main difference of our approach is that it is not based on the flow of a fixed functional, but we rather learn dynamically the flow that corresponds to the witness function of the Sobolev-Fisher discrepancy. The accelerated Langevin Sampling of [14] also uses similar principles in the transport of distributions via Langevin diffusion and a reaction term implemented as a birth-death process. The main difference with our work is that in Langevin sampling the log likelihood of the target distribution is required explicitly, while in USD we only need access to samples from the target distribution. USD relates to unbalanced optimal transport [6, 7, 8, 9] and offers a computational flexibility when compared to Sinkhorn approaches [8, 9], since it scales linearly in the number of points while Sinkhorn is quadratic. Compared to WFR (Eq. (1)), USD finds greedily the connecting path, while WFR solves an optimal planning problem. 7 Applications We experiment with USD on synthetic data, image coloring and prediction of developmental stages of sc RNA-seq data. In all our experiments we report the MMD distance with a gaussian kernel, computed using the random Fourier features (RF) approximation [23] with 300 RF and kernel bandwith equal to d (the input dimension). We consider the conservation of mass case, i.e. γ = 1. Synthetic Examples. We test Neural USD descent (Algorithms 1 and 2) on two synthetic examples. In the first example (Figure 1), the source samples are drawn from a 2D standard Gaussian, while target samples are drawn from a Mixture of Gaussians (MOG). Samples from this MOG have uniform weights. In the second example (Figure 2), source samples are drawn from a cat -shaped density whereas the target samples are drawn uniformly from a heart . Samples from the targets have non-uniform weights following a horizontal gradient. In order to target such complex densities USD exploits advection and reaction by following the critic gradients and by creation and destruction of mass. We see in Figs 1 and 2 a faster mixing of USD in both, implementation with weights (w) and as birth-death (bd) processes compared to the Sobolev descent algorithm of [2]. (a) Neural USD paths in transporting a Gaussian to a MOG. We compare Sobolev descent (SD, [2]) to both USD implementations: with birthdeath process (bd: Algorithm 2) and weights (w: Algorithm 1). USD outperforms SD in capturing the modes of the MOG. 0 200 400 600 800 descent step Gaussian Mo G Sobolev descent (s) unbalanced birth-death (bd) unbalanced weighted (w) (b) MMD as a function of step along the descent from a Gaussian to a MOG. Both USD implementations convergence faster to the target distribution, reaching lower MMD than Sobolev Descent that relies on advection only. Figure 1: Neural USD transport of a Gaussian to a MOG (target distribution is uniformly weighted). (a) Neural USD transporting a cat distributed cloud to a heart . The main difference with the example above is that the points of the target distribution have non uniform weights describing a linear gradient as seen from the color code in the figure. Similarly to the MOG case, USD outperforms SD and better captures the non uniform density of the target. 0 200 400 600 800 descent step Cat Heart x Gradient Sobolev descent (s) unbalanced birth-death (bd) unbalanced weighted (w) (b) MMD as function of step along the descent from cat heart Grad. Similarly to the uniform target case USD accelerates the descent and outperforms SD. Figure 2: Neural USD transport of a cat to a non-uniform heart . Samples from the target distribution have non-uniform weights given by aj s following a linearly decaying gradient. Image Color Transfer. We test Neural USD on the image color transfer task. We choose target images that have sparse color distributions. This is a good test for unbalanced transport since intuitively having birth and death of particles accelerates the transport convergence in this case. We compare USD to standard optimal transport algorithms. We follow the recipe of [24] as implemented in the POT library [25], where images are subsampled for computational feasibility and then interpolated for out-of-sample points. We compare USD to Earth-Moving Distance (EMD), Sinkhorn [26] and Unbalanced Sinkhorn [8] baselines. We see in Figure 3 that USD achieves smaller MMD to the target color distribution. We give in Appendix H.2 in Fig 7 trajectories of the USD. Developmental Trajectories of Single Cells. When the goal is not only to transport particles but also to find intermediate points along trajectories, USD becomes particularly interesting. This type of use case has recently received increased attention in developmental biology, thanks to single-cell RNA sequencing (sc RNA-seq), a technique that records the expression profile of a whole population of cells at a given stage, but does so destructively. In order to trace the development of cells inbetween such destructive measurements, [16] proposed to use unbalanced optimal transport [8]. mmd(., target) = 0.0000 mmd(., target) = 0.3030 mmd(., target) = 0.0251 mmd(., target) = 0.0958 mmd(., target) = 0.2083 Unbalanced Sinkhorn mmd(., target) = 0.0116 Unbalanced Sobolev mmd(., target) = 0.0000 mmd(., target) = 0.2677 mmd(., target) = 0.0181 mmd(., target) = 0.0572 mmd(., target) = 0.1168 Unbalanced Sinkhorn mmd(., target) = 0.0026 Unbalanced Sobolev Figure 3: Color Transfer with USD using (bd) Algorithm 2. Comparison to OT baselines (EMD, Sinkhorn and Unbalanced Sinkhorn). USD achieves lower MMD, and faithfully captures the sparse distribution of the target. USD WOT 0.0 Normalized MMD USD WOT 0.0 Normalized EMD Figure 4: Mean and standard deviations plots of Normalized MMD and EMD for the intermediate stage prediction by USD and WOT (unbalanced OT) of [16] (means and standards deviation are computed over intervals). While USD outperforms WOT in MMD, the reverse holds in EMD. See text for an explanation. Denoting those populations qt0 (source) and qt1 (target), then, in order to predict the population at an intermediate time t0+t1 2 , [16] used a linear interpolation between matches between the source and target populations based on the coupling of unbalanced OT. This type of interpolation is a form of Mc Cann interpolate [27]. As an alternative, we propose to use the mid-point of the USD descent as an interpolate, i.e. the timestamp in the descent t1/2 such that MMD(qt1/2, qt0) = MMD(qt1/2, qt1). We test this procedure on the dataset released by [16]. For all time intervals [t0, t1] in the dataset, we compute the intermediate stage qt1/2. We compare the quality of this interpolate with that obtained by the WOT algorithm of [16] in terms of MMD to the ground truth intermediate population q t1/2, normalized by MMD between initial and final population, i.e. MMD(qt1/2, q t1/2)/MMD(qt0, qt1). Fig. 4 gives mean and standard deviation of the normalized MMD between intermediate stages predicted by USD and the ground truth. Note that mean and standard deviations are computed across 35 time intervals, individual MMDs can be found in Figure 8 in Appendix H. From Figure 4 we see that USD outperforms WOT in MMD, since USD is designed to decrease the MMD distance. On the other hand, for fairness of the evaluation we also report Normalized EMD (Earth-Mover Distance, normalized similarly) for which WOT outperforms USD. This is not surprising since WOT relies on unbalance OT, while USD instead provides guarantees in terms of MMD. 8 Conclusion In this paper we introduced the KSFD discrepancy and showed how it relates to an advection-reaction transport. Using the critic of KSFD, we introduced Unbalanced Sobolev Descent (USD) that consists in an advection step that moves particles and a reaction step that re-weights their mass. The reaction step can be seen as birth-death process which, as we show theoretically, speeds up the descent compared to previous particle descent algorithms. We showed that the MMD convergence of Kernel USD and presented two neural implementations of USD, using weight updates, and birth and death of particle, respectively. We empirically demonstrated on synthetic examples and in image color transfer, that USD can be reliably used in transporting distributions, and indeed does so with accelerated convergence, supporting our theoretical analysis. As a further demonstration of our algorithm, we showed that USD can be used to predict developmental trajectories of single cells based on their RNA expression profile. This task is representative of a situation where distributions of different mass need to be compared and interpolated between, since the different sc RNA-seq measurements are taken on cell populations of dissimilar size at different developmental stages. USD can naturally deal with this unbalanced setting. Finally we compared USD to unbalanced OT algorithms, showing its viability as a data-driven, more scalable dynamic transport method. Broader Impact Statement Our work provides a practical particle descent algorithm that comes with a formal convergence proof and theoretically guaranteed acceleration over previous competing algorithms. Moreover, our algorithm can naturally handle situations where the objects of the descent are particles sampled from a source distribution descending towards a target distribution with different mass. The type of applications that this enables range from theoretically principled modeling of biological growths processes (like tumor growth) and developmental processes (like the differentiation of cells in their gene expression space), to faster numerical simulation of advection-reaction systems. Since our advance is mainly theoretical and algorithmic (besides the empirical demonstrations), its implications are necessarily tied to the utilization for which it is being deployed. Beside the applications that we mentioned, particle descent algorithms like ours have been proposed as a paradigm to characterize and study the dynamics of Generative Adversarial Network (GANs) training. As such, they could indirectly contribute to the risks associated with the nefarious uses of GANs such as deepfakes. On the other hand, by providing a tools to possibly analyze and better understand GANs, our theoretical results might serve as the basis for mitigating their abuse. Acknowledgments and Disclosure of Funding Authors did not receive any third party funding and have no competing interests. [1] Qiang Liu and Dilin Wang. Stein variational gradient descent: A general purpose bayesian inference algorithm. In Advances in Neural Information Processing Systems 29, 2016. [2] Youssef Mroueh, Tom Sercu, and Anant Raj. Sobolev descent. In Aistats,Proceedings of Machine Learning Research, 2019. [3] Michael Arbel, Anna Korba, Adil Salim, and Arthur Gretton. Maximum mean discrepancy gradient flow. ar Xiv preprint ar Xiv:1906.04370, 2019. [4] Filippo Santambrogio. {Euclidean, metric, and Wasserstein} gradient flows: an overview. Bulletin of Mathematical Sciences, 7(1):87 154, 2017. [5] Lenaic Chizat and Francis Bach. On the global convergence of gradient descent for overparameterized models using optimal transport. In Advances in neural information processing systems, pages 3036 3046, 2018. [6] Lenaic Chizat, Gabriel Peyré, Bernhard Schmitzer, and François-Xavier Vialard. Unbalanced optimal transport: Dynamic and kantorovich formulation. ar Xiv preprint ar Xiv:1508.05216, 2015. [7] Lenaic Chizat, Gabriel Peyré, Bernhard Schmitzer, and François-Xavier Vialard. An interpolating distance between optimal transport and fisher rao metrics. Foundations of Computational Mathematics, 18(1):1 44, 2018. [8] Lenaic Chizat, Gabriel Peyré, Bernhard Schmitzer, and François-Xavier Vialard. Scaling algorithms for unbalanced optimal transport problems. Mathematics of Computation, 87(314):2563 2609, 2018. [9] Thibault Séjourné, Jean Feydy, François-Xavier Vialard, Alain Trouvé, and Gabriel Peyré. Sinkhorn divergences for unbalanced optimal transport. ar Xiv preprint ar Xiv:1910.12958, 2019. [10] Cédric Villani. Optimal Transport: Old and New. Grundlehren der mathematischen Wissenschaften. Springer, 2008. [11] Jean-David Benamou and Yann Brenier. A computational fluid mechanics solution to the Monge-Kantorovich mass transfer problem. Numerische Mathematik, 2000. [12] Grant Rotskoff, Samy Jelassi, Joan Bruna, and Eric Vanden-Eijnden. Global convergence of neuron birth-death dynamics. ar Xiv preprint ar Xiv:1902.01843, 2019. [13] Lenaic Chizat. Sparse optimization on measures with over-parameterized gradient descent. ar Xiv preprint ar Xiv:1907.10300, 2019. [14] Yulong Lu, Jianfeng Lu, and James Nolen. Accelerating langevin sampling with birth-death. ar Xiv preprint ar Xiv:1905.09863, 2019. [15] Lénaïc Chizat and Simone Di Marino. A tumor growth model of hele-shaw type as a gradient flow, 2017. [16] Geoffrey Schiebinger, Jian Shu, Marcin Tabaka, Brian Cleary, Vidya Subramanian, Aryeh Solomon, Joshua Gould, Siyan Liu, Stacie Lin, Peter Berube, et al. Optimal-transport analysis of single-cell gene expression identifies developmental trajectories in reprogramming. Cell, 176(4):928 943, 2019. [17] Karren D. Yang and Caroline Uhler. Scalable unbalanced optimal transport using generative adversarial networks. In International Conference on Learning Representations, 2019. [18] Jean Feydy, Thibault Séjourné, François-Xavier Vialard, Shun-Ichi Amari, Alain Trouvé, and Gabriel Peyré. Interpolating between optimal transport and mmd using sinkhorn divergences. ar Xiv preprint ar Xiv:1810.08278, 2018. [19] Gabriel Peyré and Marco Cuturi. Computational optimal transport. ar Xiv:1803.00567, 2017. [20] Michael Arbel, Dougal J. Sutherland, Mikolaj Binkowski, and Arthur Gretton. On gradient regularizers for mmd gans. Neur IPS, 2018. [21] Carl-Johann Simon-Gabriel and Bernhard Schölkopf. Kernel distribution embeddings: Universal kernels, characteristic kernels and kernel metrics on distributions, 2016. [22] Arthur Gretton, Karsten M. Borgwardt, Malte J. Rasch, Bernhard Schölkopf, and Alexander Smola. A kernel two-sample test. JMLR, 2012. [23] Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. In Advances in neural information processing systems, pages 1177 1184, 2007. [24] Sira Ferradans, Nicolas Papadakis, Julien Rabin, Gabriel Peyré, and Jean-François Aujol. Regularized discrete optimal transport. In International Conference on Scale Space and Variational Methods in Computer Vision, 2013. [25] Rémi Flamary and Nicolas Courty. Pot python optimal transport library, 2017. [26] Marco Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. In Advances in neural information processing systems, pages 2292 2300, 2013. [27] Robert J. Mc Cann. A convexity principle for interacting gases. Advances in Mathematics, 1997. [28] Ding-Xuan Zhou. Derivative reproducing properties for kernel methods in learning theory. Journal of Computational and Applied Mathematics, 2008.