# minimizing_fdivergences_by_interpolating_velocity_fields__ddcc0c27.pdf

Minimizing f-Divergences by Interpolating Velocity Fields

Song Liu 1 Jiahao Yu 1 Jack Simons 1 Mingxuan Yi 1 Mark Beaumont 1

Many machine learning problems can be seen as approximating a target distribution using a particle distribution by minimizing their statistical discrepancy. Wasserstein Gradient Flow can move particles along a path that minimizes the f-divergence between the target and particle distributions. To move particles, we need to calculate the corresponding velocity fields derived from a density ratio function between these two distributions. Previous works estimated such density ratio functions and then differentiated the estimated ratios. These approaches may suffer from overfitting, leading to a less accurate estimate of the velocity fields. Inspired by non-parametric curve fitting, we directly estimate these velocity fields using interpolation techniques. We prove that our estimators are consistent under mild conditions. We validate their effectiveness using novel applications on domain adaptation and missing data imputation. The code for reproducing our results can be found at https://github.com/ anewgithubname/gradest2.

1. Introduction

Many machine learning problems can be formulated as minimizing statistical divergences between distributions, e.g., Variational Inference (Blei et al., 2017), Generative Modeling (Nowozin et al., 2016; Yi et al., 2023), Domain Adaptation (Courty et al., 2017a; Yu et al., 2021), and Data Imputation (Muzellec et al., 2020). Among many divergence minimization techniques, Particle-based gradient descent reduces the divergence between a set of particles (which define a distribution) and the target distribution by iteratively moving particles according to an update rule.

One such algorithm is Stein Variational Gradient Descent (SVGD) (Liu and Wang, 2016; Liu, 2017). It minimizes the

1University of Bristol, Bristol, UK. Correspondence to: Song Liu <song.liu@bristol.ac.uk>.

Proceedings of the 41 st International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by the author(s).

Figure 1. Estimating a log density ratio log r using a flexible model (RBF kernel) leads to a overfitted estimate (log r1). The overfitting consequently causes huge fluctuations in the derivative (log r1) . Our proposed method provides a much more stable estimate log r2 and a more accurate estimate of (log r2) .

Kullback-Leibler (KL) divergence using the steepest descent algorithm and has achieved promising results in Bayesian inference. However, to compute the SVGD updates, we need unnormalized target density functions. If we only have samples from the target distribution as is often the case in applications like domain adaptation and generative model training we cannot directly apply SVGD to minimize the f-divergences.

Wasserstein Gradient Flow (WGF) describes the evolution of a marginal measure qt along the steepest descent direction of a functional objective in Wasserstein geometry where xt qt follows a probability flow ODE. It can be employed to minimize an f-divergence between a particle distribution and the target distribution. Particularly, such WGFs characterize the following ODE (Yi et al., 2023; Gao et al., 2019)

dxt = (h rt)(xt)dt, t [0, ).

Here rt := p

qt is the ratio between the target density p and particle density qt at time t, and h is a known function which depends on the f-divergence. The gradient operator computes the gradient of the composite function h rt. If we know rt, simulating the above ODE would be straightforward using a discrete-time Euler method. However, the main challenge is that we do not know the ratio rt in practice. In the context we consider, we know neither the particle nor target density, which constitutes rt.

To overcome this issue, recent works (Gao et al., 2019; Ansari et al., 2021; Simons et al., 2021) first obtain an estimate ˆrt using a density ratio estimator (Sugiyama et al.,

Minimizing f-Divergences by Interpolating Velocity Fields

2012), then differentiate h ˆrt to obtain (h ˆrt)(xt). However, like other estimation tasks, density ratio estimation can be prone to overfitting. The risk of overfitting is further exacerbated when employing flexible models such as kernel models or neural networks. Overfitting can be disastrous for gradient estimation: A wiggly fit of rt will cause huge fluctuations in gradient (see the blue fit in Figure 1). Moreover, density ratio estimation lacks the inductive bias for the density ratio gradient estimation. In other words, while it might be good at estimating the ratio itself, it doesn t have any built-in assumptions to capture the gradient of the density ratio function accurately.

In this paper, we directly approximate the velocity fields induced by WGF, i.e., (h rt)(xt). We show that the backward KL velocity field, where h = log, can be effectively estimated using Nadaraya-Watson (NW) interpolation if we know log p, and this estimator is closely related to SVGD. We prove the estimation error of log rt(xt) vanishes as the kernel bandwidth approaches to zero. This finding motivates us to propose a more general linear interpolation method to approximate (h rt)(xt) for any general h functions using only samples from p and qt. Our estimators are based on the idea that, within the neighbourhood of a given point, the best linear approximation of h rt has slope (h rt). Under mild conditions, we show that our estimators are also consistent for estimating (h rt)(xt) and achieve the optimal non-parametric regression rate. Finally, equipped with our proposed gradient estimator, we test WGF on two novel applications: domain adaptation and missing data imputation and achieve promising performance.

2. Background

Notation: Rd is the d-dimensional real domain. Vectors are lowercase bold letters, e.g., x := [x1, . . . , xd] . x j is a subvector of x obtained by excluding the j-th dimension. Matrices are uppercase bold letters, e.g., X, Y . f g is the composite function f(g( )). f is the derivative of a univariate function. if means the partial derivative with respect to the i-th input of f and f := [ 1f, 2f, , df] . f represents its transpose. xf(x, y) represents the gradient of f with respect to the input x. f Rd m represents the Jacobian of a vector-valued function f : Rm Rd. 2 i f means the second-order partial derivative with respect to the i-th input of f. 2f is the Hessian of f. is the ℓ2 norm of a vector or the spectral norm of a matrix. λmin [X] represents the smallest eigenvalue of a matrix X. a b is the greater value between two scalars a and b. P(Rd) is the space of probability measures defined on Rd equipped with Wasserstein-2 metric or Wasserstein space for short. ˆE[ ] is the sample approximation of an expectation E[ ].

We begin by introducing WGF of f-divergence, an effective

Name Notation f(rt) h(rt) Forw. KL KL[p, qt] rt log(rt) rt Back. KL KL[qt, p] log(rt) log rt 1 Pearson s χ2 χ2 p[p, qt] 1 2(rt 1)2 1 2r2 t 1

2 Neyman s χ2 χ2 n[p, qt] 1 2rt 1

Table 1. Some f-divergences, their definitions and h functions computed according to Theorem 2.1.

technique for minimizing f-divergence between the particle and target distribution.

2.1. Wasserstein Gradient Flows of f-divergence

In general terms, a Wasserstein Gradient Flow is a curve in probability space (Ambrosio et al., 2005). By moving a probability measure along this curve, a functional objective (such as a statistical divergence) is reduced. In this work, we focus solely on using f-divergences as the functional objective. Let qt : R+ P(Rd) be a curve in Wasserstein space. Consider an f-divergence defined by Df[p, qt] := R qt(x)f(rt(x))dx, where rt := p qt . f is a twice differentiable convex function with f(1) = 0.

Theorem 2.1 (Corollary 3.3 in (Yi et al., 2023)). The Wasserstein gradient flow of Df[p, qt] characterizes the particle evolution via the ODE:

dxt = (h rt)(xt)dt, h(rt) = rtf (rt) f(rt).

Simply speaking, particles evolve in Euclidean space according to the above ODE moves the corresponding qt along a curve where Df[p, qt] always decreases with time.

Theorem 2.1 establishes a relationship between f and a function h : R+ 7 R. The gradient field of h rt over time as t is referred to as the WGF velocity field. Similar theorems on reversed f-divergences have been discussed in (Gao et al., 2019; Ansari et al., 2021). Some frequently used f-divergences and their corresponding h functions are listed in Table 1. Specifically, for the backward KL divergence we have (h rt)(xt)|h=log( ) = log rt(xt).

In reality, we move particles by simulating the above ODE using the forward Euler method: We draw particles from an initial distribution q0 and iteratively update them for time t = 0, 1 . . . T according to the following rule:

xt+1 := xt + η (h rt)(xt) (1)

where η is a small step size. There is a slight abuse of notation, and we reuse t for discrete-time indices 1.

Although (1) seems straightforward, we normally do not have access to rt, so the update in (1) cannot be readily performed. In previous works, such as (Gao et al., 2019;

1From now on, we will only discuss discrete-time algorithms.

Minimizing f-Divergences by Interpolating Velocity Fields

Simons et al., 2021), rt is estimated using density ratio estimators and the WGF is simulated using the estimated ratio. Although these estimators achieved promising results, the density ratio estimators are not designed for usage in WGF algorithms. For example, a small density ratio estimation error could lead to huge deviations in gradient estimation, as we demonstrated in Figure 1. Others (Wang et al., 2022) propose to estimate the gradient flow using Kernel Density Estimation (KDE) on densities p and qt separately (See Section J), then compute the ratio. However, KDE tends to perform poorly in high dimensional settings (see e.g., (Scott, 1991)).

3. Direct Velocity Field Estimation by Interpolation

In this work, we consider directly estimating the velocity field, i.e., directly modelling and estimating (h rt). We are encouraged by the recent successes in Score Matching (Hyv arinen, 2005; 2007; Vincent, 2011; Song et al., 2020), which is a direct estimator of a log density gradient. It works by minimizing the squared differences between the true log density gradient and the model gradient. However, such a technique cannot be easily adapted to estimate (h rt), even for h(r) = log r (See Appendix K).

We start by looking at a simpler setting where log p is known. In fact, this setting itself has many interesting applications such as Bayesian inference. The solution we derive using interpolation will serve as a motivation for other interpolation based approaches in later sections.

3.1. Nadaraya-Watson (NW) Interpolation of Backward KL Velocity Field

Define a local weighting function with a parameter σ > 0, kσ(x, x ) := exp x x 2

2σ2 . Nadaraya-Watson (NW) estimator (Nadaraya, 1964; Watson, 1964) interpolates a function g at a fixed point x . Suppose that we observe g(x) at a set of sample points {xi}n i=1 q, NW interpolates g(x ) by computing

ˆg(x ) := b Eq[kσ(x, x )g(x)]/b Eq[kσ(x, x )]. (2)

Thus, the NW interpolation of the backward KL field 2 is

ˆut(x ) := b Eqt[kσ(x, x ) log rt(x)]/b Eqt[k(x, x )]. (3)

Since we cannot evaluate log rt(x), (3) is intractable. However, assuming that lim x qt(x)k(x, x ) = 0, using integration by parts , the expectation in the numerator

2 backward KL field is short for backward KL velocity field.

of (3) can be rewritten as:

Eqt [k σ log rt(x)] =Eqt [k σ log p(x)] Eqt [ log qt(x)]

=Eqt[k σ log p(x) + k σ], (4)

where we shortened the kernel kσ(x, x ) as k σ. Since we can evaluate log p, (4) can be approximated using samples from the particle distribution qt. Thus, the NW estimator of the backward KL field can be approximated by

ˆut(x ) b Eqt[k σ log p(x) + k σ]/b Eqt[k σ]. (5)

Interestingly, the numerator of (5) is exactly the particle update of the SVGD algorithm (Liu and Wang, 2016) for an RKHS induced by a Gaussian kernel (See Appendix N for details on SVGD), and the equality (4) has been noticed by Chewi et al. (2020). Note that for different x , the denominator in (5) is different and thus cannot be combined into the overall learning rate of SVGD.

3.2. Effectiveness of NW Estimator

For simplicity, we drop t from ut, rt, xt and qt when our analysis holds the same for all t.

Although there have been theoretical justifications for the convergence analysis of WGF given the ground truth velocity fields such as Langevin dynamics (Wibisono, 2018). Few theories have been dedicated to the estimation of velocity fields themselves. One of the contributions of this paper is that we study the statistical theory of the velocity field estimation through the lenses of non-parametric regression/curve approximation.

Now, we prove the convergence rate of the NW estimator under the assumption that the second-order derivative of log r is well-behaved. Although ˆu(x ) cannot be directly computed, assuming log p, k and k are well-behaved, using concentration inequalities (such as Hoeffding s inequality (Hoeffding, 1963)), the difference between ˆu(x ) and its approximation (5) can be easily bounded. Thus, we focus on the classical NW estimator in (3). Proposition 3.1. Suppose supx Rd 2 log r(x) κ < . Define k(y) := exp y 2/2 . Assume that there exist constants Ck, K that are independent of σ, such that R q(σy + x )k(y) y dy R q(σy + x )k(y)dy Ck, Eq

σd k σ x x = O 1

, as σ 0. (7)

Then, with high probability, there exists a constant K :

ˆu(x ) log r(x )

Minimizing f-Divergences by Interpolating Velocity Fields

for all σ > 0.

We can also deduce that, with an optimal choice of σ n 1/(d+2), the estimation error is Op(n 1/(d+2)).

See Appendix A for the proof. q(σy + x ) in the first inequality (6) can be further expanded using Taylor expansion. Provided that the kernel k is well-behaved, this condition becomes a regularity condition on q(x ) and its higher order moments. The second inequality in (6) means that there should be enough mass around x under the distribution q, which is a key assumption in classical nonparametric curve estimation (See, e.g., Chapter 20 in (Wasserman, 2010)). (7) is required to ensure the empirical quantities in the NW estimator converge to their population counterparts in probability.

It can be seen that the estimation error is bounded by the sample approximation error K

nσd and a bias depending on by σ. Interestingly, the bias term decreases at the rate of σ, slower than the classical rate σ2 for the non-parametric regression of a second-order differentiable function (Theorem 20.21 in (Wasserman, 2010)). This is expected as NW estimates the gradient of h r, not h r. To achieve a faster σ2 rate, one needs to assume conditions on 3 log r. This also highlights a slight downside of using NW to estimate the gradient. However, in the following section, we show that another interpolator achieves the superior rate when using the same type of assumption on 2(h r).

4. Velocity Field Interpolation from Samples

Although we have seen that ˆu is an effective estimator of the backward KL velocity field, it can only be approximated when we can evaluate log p. In some applications, such as domain adaptation or generative modelling, we only have samples from the target distribution, and log p is unavailable. Moreover, since the f-divergence family consists of a wide variety of divergences, we hope to provide a general computational framework to estimate different velocity fields that minimize different f-divergences.

Nonetheless, the success of NW motivates us to look for other interpolators to approximate (h r)(x ).

Another common interpolation technique is local linear regression (See e.g., (Gasser and M uller, 1979; Fan, 1993) or Chapter 6, (Hastie et al., 2001)). It approximates an unknown function g at x by using a linear function: ˆg(x) := β(x ), x + β0(x ). β(x ) and β0(x ) are the minimizer of the following weighted least squares objective:

min β Rd,β R b Eq h k σ (g(x) β, x β0)2i . (8)

A key insight is, since the gradient of a function is the slope of its best local linear approximation, it is reasonable to

just use the slope of the fitted linear model, a.k.a., β(x ), to approximate the gradient g(x ). See Figure 6 in Appendix for an illustration.

We apply the same rationale to estimate (h r)(x ). However, unlike local linear interpolation, we cannot evaluate h r at any input. Thus, we cannot directly use the least squares objective (8) to obtain a local linear interpolation. Similar to what we have done in Section 3.1, we look for a tractable population estimator for estimating h r, which can be approximated using samples from p and q. Then, we convert it into a local linear objective.

In the following section, we derive an objective for estimating h r by maximizing a variational lower bound of a mirror divergence.

4.1. Mirror Divergence

Definition 4.1. Let Dϕ[p, q] and Dψ[p, q] denote two fdivergences with f being ϕ and ψ respectively. Dψ is the mirror of Dϕ if and only if ψ (r) rϕ (r) ϕ(r), where means equal up to a constant.

For example, let Dϕ[p, q] and Dψ[p, q] be KL[p, q] and χ2 p[p, q] respectively. From Table 1, we can see that ϕ(r) = r log r and ψ(r) = 1

2(r 1)2. Thus rϕ (r) ϕ(r) = r(1 + log r) r log r = r ψ (r) = r 1. Therefore, χ2 p[p, q] is the mirror of KL[p, q]. Similarly, we can verify that KL[p, q] is the mirror of KL[q, p] and KL[q, p] is the mirror of χ2 n[p, q]. In general, Dψ[p, q] is the mirror of Dϕ[p, q] does not imply the other direction.

4.2. Gradient Estimator using Linear Interpolation

The key observation that helps derive a tractable objective is that h r is the argmax of the mirror variational lowerbound . Suppose h is associated with an f-divergence Dϕ as per Theorem 2.1 and Dψ is the mirror of Dϕ. Then h r is the argmax of the following objective:

Dψ[p, q] = max d Ep[d(x)] Eq[ψcon(d(x))], (9)

where ψcon is the convex conjugate of ψ. The formal statement and its proof can be found in Appendix C. The equality in (9) is known in previous literature (Nguyen et al., 2010; Nowozin et al., 2016) and the objective in (9) is commonly referred to as the variational lowerbound of Dψ[p, q].

Notice that the expectations in (9) can be approximated by b Ep[ ] and b Eq[ ] using samples from p and q respectively.

This is a surprising result. Since h is related to Dϕ s field (as per Theorem 2.1), one may associate maximizing Dϕ s variational lowerbound with its velocity field estimation. However, the above observation shows that, to approximate Dϕ s field, one should maximize the variational lowerbound

Minimizing f-Divergences by Interpolating Velocity Fields

of its mirror divergence Dψ! To our best knowledge, this mirror structure in the context of WGF has never been studied before.

We then localize (9) to obtain a local linear estimator of h r at a fixed point x . First, we parameterize the function d using a linear model dw,b(x) := w, x + b. Second, we weight the objective using kσ(x, x ), which leads to the following local linear objective:

(w(x ), b(x )) = argmax w Rd,b R ℓ(w, b; x ),

where ℓ(w, b; x ) :=b Ep[k σdw,b(x)] b Eq[k σψcon(dw,b(x))] (10)

The above transformation is similar to how the local linear regression localizes the ordinary least squares objective.

Solving (10), we get a linear approximation of h r at x :

h(r(x )) w(x ), x + b(x ).

Following the intuition that (h r)(x ) is the slope of the best local linear fit of h(r(x )), we use w(x ) to approximate (h r)(x ). We will theoretically justify this approximation in Section 4.3. Now let us study two examples:

Example 4.2. Suppose we would like to estimate KL[p, q] s field at x , which is r(x ). Using Definition 4.1, we can verify that the mirror of KL[p, q] is Dψ = χ2 p[p, q], in which case ψ = 1

2(r 1)2. The convex conjugate of ψ is ψcon(d) = d2/2+d. Substituting ψcon in (10), the gradient estimator w (x ) r(x ) is obtained by the following objective:

(w (x ), b (x )) := argmax w Rd,b R ℓ (w, b; x )

where ℓ (w, b; x ) :=b Ep[k σ dw,b(x)]

k σ dw,b(x)2

2 + dw,b(x) .

Example 4.3. Suppose we would like to estimate KL[q, p] s field at x , which is log r(x ). Using Definition 4.1, we can verify that the mirror of KL[q, p] is Dψ = KL[p, q], in which case, ψ(r) = r log r. The convex conjugate of ψ is ψcon(d) = exp(d 1). The gradient estimator w (x ) log r(x ) is obtained by the following objective:

(w (x ), b (x )) := argmax w Rd,b R ℓ (w, b; x )

where ℓ (w, b; x ) :=b Ep[k σ dw,b(x)] b Eq[k σ exp(dw,b(x) 1)]. (12)

In the following section, we show that the estimation error w(x ) (h r)(x ) vanishes as σ 0 and n .

4.3. Effectiveness of Local Interpolation

In this section, we state our main theoretical result. Let (w(x ), b(x )) be a stationary point of ℓ(w, b; x ). We denote the domain of x as X (not necessarily Rd). Without loss of generality, we also assume all sample averages b E[ ] are averaged over n samples. We prove that, w(x ) is a consistent estimate of (h r)(x ) assuming the change rate of the flow is bounded. Assumption 4.4. The change rate of the velocity fields is well-behaved, i.e.,

sup x X 2(h r)(x) κ.

This is an analogue of the assumption on 2 log r in Proposition 3.1. Assumption 4.5. There exists a constant Ck > 0 independent of σ,

Z q(σy + x )k(y) y 2 [σy + x , 1] dy Ck.

This assumption is similar to the first inequality in (6). Expanding q(σy + x ) using Taylor expansion and providing that our kernel is well behaved, this assumption essentially implies the boundedness of the q(x ) and 2q(x ) .

Define two shorthands: w := (h r)(x ) and b := h(r(x )) (h r)(x ), x .

Assumption 4.6. Let x := x , 1 . As σ 0,

σd k σ x = O( 1

σd k σ ψ con( w , x + b ) x = O( 1

These two are analogues of (7). They are required so that our sample approximation of the objective is valid and concentration inequalities can be applied. Assumption 4.7. For all a [0, 1] and x X, ψ con [ah(r(x)) + (1 a)( w , x + b )] Cψ con.

This is a unique assumption to our estimator, where we assume that the convex conjugate ψcon has a bounded secondorder derivative. Theorem 4.8. Suppose Assumption 4.4, 4.5, 4.6 and 4.7 holds. If there exist strictly positive constants W, B, Λmin that are independent of σ and n such that,

w W, |b | B (13)

and for all w {w| w < 2W} and b {b||b| < 2B},

λmin n b Eq h k σ 2 [w,b]ψcon( w, x + b) io σdΛmin,

Minimizing f-Divergences by Interpolating Velocity Fields

holds with high probability. Then there exists σ0, N, K > 0 such that for all 0 < σ < σ0, n > N,

nσd + κCk Cψ conσ2

with high probability.

The proof can be found in Appendix D.

Similar to Proposition 3.1, the estimation error is upperbounded by the sample approximation error that reduces with the effective sample size nσd, and a bias term that reduces with σ. Interestingly, although we make the smoothness assumption on 2(h r), similar to Proposition 3.1, the bias vanishes at a quadratic rate σ2, unlike the linear rate obtained in Proposition 3.1.

We can deduce that, with an optimal choice of σ n 1/(d+4), the estimation error is Op(n 2/(d+4)), a rate faster than the rate implied by Proposition 3.1.

Using Theorem 4.8, we can prove the consistency of various velocity field estimators for different f-divergences. Corollary 4.9. Suppose supx X 2r(x) κ , Assumption 4.5, 4.6 holds and

λmin n b Eq h kσ(x, x ) x x io σd Λ > 0, (15)

holds with high probability. Then σ0, N, K > 0,

w (x ) r(x )

nσd + κ Ck σ2

Λ for all 0 < σ < σ0, n > N holds with high probability.

See Appendix E for the proof. Corollary 4.10. Suppose supx X 2 log r(x) κ , and Assumption 4.5, 4.6 holds. Assume

sup x X x CX , sup x X log r(x) 1 Clog r,

and there exists B, W < , so that

log r(x ) < W, | log r(x ) log r(x ), x | < B.

Additionally,

λmin h b Eq[kσ(x, x ) x x ] i σd Λ > 0, (16)

holds with high probability. Then σ0, N, K > 0 and for all 0 < σ < σ0, n > N,

w (x ) log r(x )

1 (Λ exp( 2WCX 2B 1)) K

κ Ck [exp(WCX + B 1) exp(Clog r 1)] σ2

Λ exp( 2WCX 2B 1) ,

holds with high probability.

See Appendix F for the proof. Note that due to the assumption that x is bounded, Corollary 4.10 can only be applied to density ratio functions with bounded input domains. However, this does include important examples such as images, where pixel brightnesses are bounded within [0, 1]. Using the same proof techniques, it is possible to derive more Corollaries for other f-divergence velocity fields. We leave them as a future investigation.

4.4. Model Selection via Linear Interpolation

Although Theorem 4.8 says the estimation bias disappears as σ 0, when we only have a finite number of samples, the choice of the kernel bandwidth σ controls the bias-variance trade-off of the local estimation. Thus we propose a model selection criterion. The details of the procedure is provided in Appendix I.1.

The high level idea is: Suppose we have testing samples from p and q. A good choice of σ would result in a good approximation of h r on testing points, thus the best approximation of h r would maximize the variational lower bound (9). Therefore, we only need to evaluate (9) on testing samples to determine the optimality of σ.

Our local linear estimator offers a unique advantage of model selection because it is formulated as a non-parametric curve fitting problem. In contrast, SVGD lacks a systematic approach and has to resort to the median trick .

5. Experiments

5.1. Reducing KL Divergence: SVGD vs. NW vs. Local Linear Estimator

In this experiment, we investigate the performance of SVGD, NW and Local Linear (LL) estimator through the task of minimizing KL[qt, p]. We let SVGD, NW and LL fit the target distribution p = N(0, 1). 500 iid initial particles are drawn from q0 = N( 1, 0.252). For all methods, we use naive gradient descent to update particles with a fixed step size 0.1. We also consider a variant of SVGD where Ada Grad (Duchi et al., 2011) is applied to adjust the step size dynamically. For SVGD and SVGD with Ada Grad, we use the MATLAB code provided by Liu and Wang (2016) with its default heuristics. We plot the trajectories of particles of all three methods in Figure 2.

Although all three algorithms move particles toward the target distribution, the naive SVGD does not spread the particle mass quickly enough to cover the target distribution when using the same step size. This situation is much improved by applying the adaptive learning rate method (Ada Grad). In comparison, NW and LL both converge fast. After 20 iterations, all particles have arrived at the target positions. Since all methods are motivated by minimizing

Minimizing f-Divergences by Interpolating Velocity Fields

Figure 2. Particle Trajectories of SVGD, SVGD with Ada Grad, NW, LL. Approximated KL[qt, p] with different methods.

KL[qt, p], we plot the KL[qt, p] approximated by Donsker and Varahan Lower Bound (Donsker and Varadhan, 1976) for all four methods. The plot of KL[qt, p] agrees with our qualitative assessments: Ada Grad SVGD can reduce the KL significantly faster than the vanilla SVGD with naive gradient descent. After 20 iterations, the KL divergence for both NW and LL particles reaches zero, indicating that the particles have fully converged to the target distribution. LL achieves a performance comparable to NW. This is a remarkable result as NW, and SVGD has access to the true log p, but LL only has samples from p.

In the next sections, we will showcase the performance of LL in forward/backward KL minimization problems.

5.2. Joint Domain Adaptation

In domain adaptation, we want to use source domain samples to help a prediction task in a target domain. This addresses situations where the training data for a method may differ from the real data when deployed. We as-

sume that source samples Dq := n (x(i) q , y(i) q ) onq

i=1 are drawn from a joint distribution QXY and target samples

Dp := n (x(i) p , y(i) p ) onp

i=1 are drawn from a different joint distribution PXY . However, yp is missing from the target set. Thus, we want to predict missing labels in Dp with the help of Dq. Courty et al. (2017b;a) propose to find an optimal map that aligns the distribution QXY and PXY , then train a classifier on the aligned source samples. Inspired by this method, we propose to align samples by minimizing KL[qt, p], where p is the density of the target PXY and qt is a particle-label pair distribution whose samples

are Dqt := n (x(i) t , y(i) q ) onq

i=1. To minimize KL[qt, p], we evolve xt according to the backward KL field and xt=0 is initialized to be the source input xq. In words, we transport source input samples so that the transported and target samples are aligned in terms of minimizing the backward KL divergence. After T iterations, we can train a classifier using transported source samples {(x T , yq)} to predict target labels.

One slight issue is that we do not have labels in the target domain but performing WGF requires joint samples (X, Y ). To solve this, we adopt the same approach used in Courty et al. (2017a), replacing yp with a proxy ˆg(xp), where ˆg is a prediction function trained to minimize an empirical transportation cost (See Section 2.2 in Courty et al. (2017a)). We demonstrate our approach in a toy example, in Figure 3.

Figure 3. Left: the source classifier (represented by colored areas) misclassifies many testing points (colored dots). Middle: WGF moves particles to align the source and target samples. Lines are trajectories of sample movements in each class. Right: the retrained classifier on the transported source samples gives a much better prediction.

Table 2 compares the performance of adapted classifiers on a real-world 10-class classification dataset named officecaltech-10 , where images of the same objects are taken from four different domains (amazon, caltech, dslr and webcam). We reduce the dimensionality by projecting all samples to a 100-dimensional subspace using PCA. We compare the performance of the base (the source RBF kernel SVM classifier), the Joint Distribution Optimal Transport (Courty et al., 2017a) (JDOT), an RBF kernel SVM trained on the WGF transported source samples {(x T , yq)} (WGF) and an SVM trained on MMDFlow (Hagemann et al., 2024) transported source samples (MMD). The classification accuracy on the entire target sets are reported. It can be seen that in some cases, reusing the source classifiers in the target domain does lead to catastrophic results (e.g. amazon to dslr, caltech to dslr). However, we can avoid such performance decline by using any joint distribution-based domain adaptation. It can be seen that both WGF and MMD achieve

Minimizing f-Divergences by Interpolating Velocity Fields

Dq Dp base JDOT WGF MMD amz. dslr 27.39% 65.61% 78.34% 78.34% amz. web. 61.69% 67.80% 84.07% 89.15% amz. cal. 81.66% 63.58% 82.72% 82.19% dslr amz. 70.35% 72.96% 85.91% 76.30% dslr web. 94.92% 76.61% 95.25% 86.10% dslr cal. 58.95% 72.31% 79.25% 69.01% web. amz. 75.78% 75.37% 91.34% 89.46% web. dslr 94.27% 73.25% 98.73% 100.00% web. cal. 67.05% 63.49% 78.09% 75.16% cal. amz. 83.40% 84.55% 91.13% 89.04% cal. dslr 26.11% 69.43% 84.71% 84.71% cal. web. 63.39% 74.58% 80.34% 79.32%

Table 2. Domain adaptation of different domains in Office-Caltech10 dataset.

superior performance compared to JDOT while WGF has the best performance in most adaptation cases.

5.3. Missing Data Imputation

In missing data imputation, we are given a joint dataset D := {( x(i), m(i))}, where m {0, 1}d is a mask vector and m(i) j = 0 indicates the j-th dimension of x(i) is missing. The task is to guess the missing values in x vector. In recent years, GAN-based missing value imputation, e.g., GAIN (Yoon et al., 2018), has gained significant attention (Zhang et al., 2023). Let x be an imputation of x. The basic idea is that if x is a perfect imputation of x, then a classifier cannot predict mj given (x, m j). For example, given a perfectly imputed image, one cannot tell which pixels are imputed and which pixels are observed. Therefore, in their approach, a generative network is trained to minimize the aforementioned classification accuracy. In fact, this method teaches the generator to break the dependency between x and m. Inspired by this idea, we propose to impute { x} by iteratively updating particles {xt}. The initial particle xt=0 is set to be

( xj mj = 1 Sample from e.g., N(0, 1) mj = 0.

After that, the particles {xt} are evolved according to the forward KL field that minimizes KL[p Xt M, p Xtp M], i.e., the mutual information between Xt (particles) and M (mask). Note that we only update missing dimensions, i.e.,

( xt,j mj = 1 xt,j + η xjrt(xt, m) mj = 0.

Samples from p Xt M are available to us since we observe the pairs {(x, m)} and samples from p Xtp M can be constructed as {(x, m )}, where m is a random sample of M given the missing pattern.

In the first experiment, we test the performance of various imputers on an S -shaped dataset, where samples are Missing Completely at Random (MCAR) (Rubin, 1976). The results are plotted in Figure 4. We compare our imputed results (WGF) with Optimal Transport-based imputation method (Sinkhorn) (Muzellec et al., 2020), and GANbased imputation method (GAIN) (Yoon et al., 2018). σ is chosen by automatic model selection described in Section I.1. It can be seen that our imputer, based on minimizing KL[p Xt M, p Xtp M] nicely recovers the S -shape after 100 particle update iterations. However, Sinkhorn and GAIN imputation methods struggle to accurately restore the S - shaped pattern.

In the second experiment, we test the performance of our algorithm on a real-world Breast Cancer classification dataset (Zwitter and Soklic, 1988) in Figure 5. This is a 30dimensional binary classification dataset and we artificially create missing values by following the MCAR paradigm with different missing rates. Since the dataset is a binary classification dataset, we compare the performance of linear SVM classifiers trained on imputed datasets. The performance is measured by the Area Under the ROC Curve (AUROC) on hold-out testing sets. In addition to Sinkhorn and GAIN, we validate the performance of our method with three additional algorithms: MMD flow using negative distance kernel (MMDFlow) (Hagemann et al., 2024), a model selection-based method Hyper Impute (Jarrett et al., 2022), and an auto encoder-based approach (MIWAE) (Mattei and Frellsen, 2019). The result shows that SVM trained on the dataset imputed by our method achieves comparable performance to datasets imputed by the other benchmark methods. Our method is robust against the choice of σ. Details and the selection of hyperparameters can be found in Section I.2.

6. Limitations

All f-divergence fields are defined using the density ratio function p/qt. However, the ratio function is not defined when the input domains of p and qt are non-overlapping. This problem is particularly noticeable when working on high-dimensional, real-world datasets. This so-called density chasm problem (Rhodes et al., 2020) will also affect the velocity field estimation as assumptions required by our estimators e.g., Assumption 4.4 or (13) depends on the density ratio. One possible solution to this issue is to build a bridge using interpolations of two distributions, estimate the density ratios of neighbouring interpolations, and then combine these estimates. However, this will significantly increase the computational cost as we need to estimate the density ratio gradients of many pairs of distributions. Developing an efficient gradient estimator using target and particle distribution interpolation is an interesting future task.

Minimizing f-Divergences by Interpolating Velocity Fields

Figure 4. Comparison of imputation methods. Fully observed samples are plotted in blue, and imputed samples in red. The leftmost plot shows the initial particles in the WGF impute. The second left plot visualizes the imputation trajectories of different particles. The third left plot is the final output after 100 WGF iterations.

Figure 5. AUROC of a linear SVM classifier on the imputed Breast Cancer dataset. Base indicates the performance of a baseline imputer where we impute the missing values with Gaussian noises.

Another potential limitation of the proposed estimator is its applicability to high-dimensional datasets. Although some preliminary investigations show that the proposed estimator does work reasonably well on some high-dimensional generative tasks (see Section O.2), it is known that local regression tends to perform poorly on high-dimensional datasets (see, e.g., (Stone, 1980; 1982)). Instead of performing WGF updates directly on the high-dimensional datasets, we can consider performing the updates in a low-dimensional feature space (see Section L). Some preliminary investigations have shown promising results (see Section O.3).

7. Conclusion

In recent years, it has been discovered that by iteratively updating a set of particles, WGF can approximate a target distribution by reducing the f-divergences between corresponding distributions. This paper addresses the important problem of estimating the velocity fields induced by various f-divergence WGFs. We propose novel interpolation-based estimators for different f-divergences fields and prove their

validity. We demonstrate the effectiveness of these estimators through two novel applications: domain adaptation and missing data imputation. Our results show that even without access to the density ratio function, the velocity fields can be efficiently estimated from samples of the target and particle distributions.

Acknowledgments

JS was supported by the EPSRC Centre for Doctoral Training in Computational Statistics and Data Science, grant number EP/S023569/1.

Impact Statement

This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none of which we feel must be specifically highlighted here.

L. Ambrosio, N. Gigli, and G. Savar e. Gradient flows: in metric spaces and in the space of probability measures. Springer Science & Business Media, 2005.

A. F. Ansari, M. L. Ang, and H. Soh. Refining deep generative models via discriminator gradient flow. In International Conference on Learning Representations (ICLR 2021), 2021.

D. M. Blei, A. Kucukelbir, and J. D. Mc Auliffe. Variational inference: A review for statisticians. Journal of the American Statistical Association, 112(518):859 877, 2017.

S. Chewi, T. Le Gouic, C. Lu, T. Maunu, and P. Rigollet. Svgd as a kernelized wasserstein gradient flow of the chisquared divergence. In Advances in Neural Information

Minimizing f-Divergences by Interpolating Velocity Fields

Processing Systems (Neur IPS 2020), volume 33, pages 2098 2109, 2020.

N. Courty, R. Flamary, A. Habrard, and A. Rakotomamonjy. Joint distribution optimal transportation for domain adaptation. In Advances in Neural Information Processing Systems (Neur IPS 2017), volume 30, 2017a.

N. Courty, R. Flamary, D. Tuia, and A. Rakotomamonjy. Optimal transport for domain adaptation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39 (9):1853 1865, 2017b.

M. D. Donsker and S. R. S. Varadhan. Asymptotic evaluation of certain markov process expectations for large time iii. Communications on Pure and Applied Mathematics, 29(4):389 461, 1976.

J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(7), 2011.

J. Fan. Local Linear Regression Smoothers and Their Minimax Efficiencies. The Annals of Statistics, 21(1):196 216, 1993.

Y. Gao, Y. Jiao, Y. Wang, Y. Wang, C. Yang, and S. Zhang. Deep generative learning via variational gradient flow. In International Conference on Machine Learning (ICML 2019), pages 2093 2101, 2019.

T. Gasser and H-G M uller. Kernel estimation of regression functions. In T. Gasser and M. Rosenblatt, editors, Smoothing Techniques for Curve Estimation, pages 23 68, Berlin, Heidelberg, 1979. Springer Berlin Heidelberg.

J. Givens, S. Liu, and H. W. J. Reeve. Density ratio estimation and neyman pearson classification with missing data. In Proceedings of The 26th International Conference on Artificial Intelligence and Statistics (AISTATS 2023), volume 206, pages 8645 8681, 2023.

P. Hagemann, J. Hertrich, F. Altekr uger, R. Beinert, J. Chemseddine, and G. Steidl. Posterior sampling based on gradient flows of the MMD with negative distance kernel. In International Conference on Learning Representations (ICLR 2024), 2024.

T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, 2001.

W. Hoeffding. Probability inequalities for sums of bounded random variables. Journal of the American statistical association, 58(301):13 30, 1963.

A. Hyv arinen. Estimation of non-normalized statistical models by score matching. Journal of Machine Learning Research, 6:695 709, 2005.

A. Hyv arinen. Some extensions of score matching. Computational Statistics & Data Analysis, 51(5):2499 2512, 2007.

D. Jarrett, B. Cebere, T. Liu, A. Curth, and M. van der Schaar. Hyperimpute: Generalized iterative imputation with automatic model selection. In International Conference on Machine Learning (ICML 2022), volume 162, page 9916 9937, 2022.

D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representations (ICLR 2015), 2015.

Q. Liu. Stein variational gradient descent as gradient flow. In Advances in Neural Information Processing Systems (Neur IPS 2017), volume 30, pages 3118 3126, 2017.

Q. Liu and D. Wang. Stein variational gradient descent: A general purpose bayesian inference algorithm. In Advances in Neural Information Processing Systems (Neur IPS 2016), volume 29, pages 2378 2386, 2016.

D. Maoutsa, S. Reich, and M. Opper. Interacting particle solutions of fokker planck equations through gradient log density estimation. Entropy, 22(8):802, 2020.

P-A Mattei and J Frellsen. MIWAE: Deep generative modelling and imputation of incomplete data sets. In International Conference on Machine Learning (ICML 2019), volume 97, pages 4413 4423, 2019.

B. Muzellec, J. Josse, C. Boyer, and M. Cuturi. Missing data imputation using optimal transport. In International Conference on Machine Learning (ICML 2020), pages 7130 7140, 2020.

E. A. Nadaraya. On estimating regression. Theory of Probability & Its Applications, 9(1):141 142, 1964.

X. Nguyen, M. J. Wainwright, and M. I. Jordan. Estimating divergence functionals and the likelihood ratio by convex risk minimization. IEEE Transactions on Information Theory, 56(11):5847 5861, 2010.

S. Nowozin, B. Cseke, and R. Tomioka. f-gan: Training generative neural samplers using variational divergence minimization. In Advances in Neural Information Processing Systems (Neur IPS 2016), volume 29, 2016.

B. Rhodes, K. Xu, and M. U. Gutmann. Telescoping densityratio estimation. Advances in Neural Information Processing Systems (Neur IPS 2020), 33:4905 4916, 2020.

D. B. Rubin. Inference and Missing Data. Biometrika, 63 (3):581 592, 1976. Publisher: Oxford University Press.

D. W. Scott. Feasibility of multivariate density estimates. Biometrika, 78(1):197 205, 1991.

Minimizing f-Divergences by Interpolating Velocity Fields

J. Simons, S. Liu, and M. Beaumont. Variational likelihoodfree gradient descent. In Fourth Symposium on Advances in Approximate Bayesian Inference (AABI 2021), 2021.

Y. Song, S. Garg, J. Shi, and S. Ermon. Sliced score matching: A scalable approach to density and score estimation. In Uncertainty in Artificial Intelligence (UAI 2020), pages 574 584, 2020.

C. J. Stone. Optimal rates of convergence for nonparametric estimators. The Annals of Statistics, pages 1348 1360, 1980.

C. J. Stone. Optimal Global Rates of Convergence for Nonparametric Regression. The Annals of Statistics, 10 (4):1040 1053, 1982.

M. Sugiyama, T. Suzuki, and T. Kanamori. Density Ratio Estimation in Machine Learning. Cambridge University Press, 2012.

P. Vincent. A connection between score matching and denoising autoencoders. Neural Computation, 23(7):1661 1674, 2011.

M. J. Wainwright. High-Dimensional Statistics: A Non Asymptotic Viewpoint. Cambridge University Press, 2019.

Y. Wang, P. Chen, and W. Li. Projected wasserstein gradient descent for high-dimensional bayesian inference. SIAM/ASA Journal on Uncertainty Quantification, 10(4): 1513 1532, 2022.

L. Wasserman. All of Statistics: A Concise Course in Statistical Inference. Springer Publishing Company, Incorporated, 2010.

G. S. Watson. Smooth regression analysis. Sankhy a: The Indian Journal of Statistics, Series A (1961-2002), 26(4): 359 372, 1964.

A. Wibisono. Sampling as optimization in the space of measures: The langevin dynamics as a composite optimization problem. In Conference on Learning Theory (COLT 2018), pages 2093 3027, 2018.

M. Yi, Z. Zhu, and S. Liu. Monoflow: Rethinking divergence gans via the perspective of wasserstein gradient flows. In International Conference on Machine Learning (ICML 2023), pages 39984 40000, 2023.

J. Yoon, J. Jordon, and M. Schaar. Gain: Missing data imputation using generative adversarial nets. In International Conference on Machine Learning (ICML 2018), pages 5689 5698, 2018.

Q. Yu, A. Hashimoto, and Y. Ushiku. Divergence optimization for noisy universal domain adaptation. In Proceedings of the IEEE/CVF Conference on Computer Vision

and Pattern Recognition (CVPR 2021), pages 2515 2524, 2021.

Y. Zhang, R. Zhang, and B. Zhao. A systematic review of generative adversarial imputation network in missing data imputation. Neural Computing and Applications, 35 (27):19685 19705, September 2023.

M. Zwitter and M. Soklic. Breast Cancer. UCI Machine Learning Repository, 1988.

Minimizing f-Divergences by Interpolating Velocity Fields

A. Proof of Proposition 3.1

|ˆui(x ) i log r(x )| =

b Eq [kσ(x, x ) i log r(x)]

b Eq[kσ(x, x )] i log r(x )

b Eq[kσ(x, x ) ( i log r(x ) + i log r( x), x x )]

b Eq[kσ(x, x )] i log r(x )

b Eq[kσ(x, x ) i log r( x), x x ]

b Eq[kσ(x, x )]

b Eq[kσ(x, x ) i log r( x) x x ]

b Eq[kσ(x, x )]

κ b Eq[kσ(x, x ) x x ]

b Eq[kσ(x, x )]

b Eq[kσ(x, x ) x x ]

b Eq[kσ(x, x )] Eq[kσ(x, x ) x x ]

Eq[kσ(x, x )] + Eq[kσ(x, x ) x x ]

Eq[kσ(x, x )]

The second line is due to the mean value theorem and x is a point in between x and x in a coordinate-wise fashion. The second inequality is due to the operator norm of a matrix is always greater than a row/column norm.

Since Varq[ 1

σd kσ(x, x ) x x ] = O( 1

σd ) due to the assumption, using Chebyshev inequality

σd kσ(x, x ) x x Eq

σd kσ(x, x ) x x = Op( 1

Similarly, due to the assumption that Varq[ 1

σd kσ(x, x )] = O( 1

σd ), we have

σd kσ(x, x ) Eq

σd kσ(x, x ) = Op( 1

Lemma A.1. Suppose |EA| A1 < , |EB| > B0 > 0. b EA EA = Op( 1

nσd ) and b EB EB = Op( 1

b EA b EB EA

Sketch of proof

The argument below resembles the proof technique used in A.2 in Givens et al. (2023). Here, we provide a sketch argument. b EA b EB EA

EB = b EAEB EAb EB

First, due to the boundedness of |EB|, for any ϵ, N, n > N, |b EB| > |EB/2| > B0/2 with a probability 1 ϵ. Thus, b EA EA = Op( 1

nσd ) implies b EA EA

Second, due to the boundedness of |EA|, and the boundedness of |b EB| argued above, b EA EA = Op( 1

nσd ) implies b EB EB

|EA| = Op( 1

Hence, b EA b EB EA

Due to Lemma A.1 and (17), we have with high probability that

Minimizing f-Divergences by Interpolating Velocity Fields

1 0.5 0 0.5

(x , ˆg(x ))

Figure 6. Fit g locally at x using a linear function ˆg. Its slope is a natural estimator of xg(x ).

|ˆui(x ) i log r(x )| κ K

nσd + Eq[kσ(x, x ) x x ]

Eq[kσ(x, x )]

nσd + σ R q(x + σy)k(y) y dy R q(x + σy)k(y)dy

y := (x x )/σ

where K is constant.

B. Visualization of Gradient Estimation using Local Linear Fitting

C. Variational Objective for Estimating h r

Proposition C.1. The maximum in (9) is attained if and only if d = h r.

Proof. Due to the maximizing argument, the maximum of (9) is attained if and only if d = ψ . The definition of mirror divergence indicates that ψ = rϕ (r) ϕ(r). Theorem 2.1 states that h r = rϕ (r) ϕ(r) = ψ , thus the maximum is attained if and only if d = h r.

D. Proof of Theorem 4.8, Estimation Error Bound of w(x )

Proof. In this section, to simplify notations, we denote θ as the parameter vector that combines both w and b, i.e., θ := [w, b] . Specifically, we define

θ = h w , b i := (h r)(x ), h(r(x )) (h r)(x ), x .

Let us denote the negative objective function in (10) as ℓ(θ) and consider a constrained optimization problem:

min θ ℓ(θ) subject to: θ θ 2 min(W, B)2. (18)

This convex optimization has a Lagrangian ℓ(θ) + λ( θ θ 2 min(W, B)2), where λ 0 is the Lagrangian multiplier. According to KKT condition, the optimal solution θ0 of (18) satisfies ℓ(θ0) + 2λ(θ0 θ ) = 0.

We apply mean value theorem to g(θ) := θ0 θ , ℓ(θ) + 2λ(θ θ ) :

θ0 θ , ℓ(θ0) + 2λ(θ0 θ ) | {z } g(θ0)=0, KKT condition

= θ0 θ , ℓ(θ ) | {z } g(θ )

+ (θ0 θ ) 2ℓ( θ) + 2λI

, (θ0 θ ) ,

Minimizing f-Divergences by Interpolating Velocity Fields

where θ is a point between θ and θ0 in an elementwise fashion. Since θ is in a hyper cube with θ0 and θ as opposite corners, it is in the constrain set of (18). Let us rearrange terms:

θ0 θ , ℓ(θ ) = θ0 θ , 2ℓ( θ) + 2λI (θ0 θ ) θ0 θ , 2ℓ( θ)(θ0 θ ) ,

For all θ in the constraint set, w = w w + w w w + w θ θ + w 2W and |b| = |b b + b | |b b | + |b | θ θ + |b | 2B. Under our assumption (14), the lowest eigenvalue of 2ℓ(θ) for all θ in the constrain set of (18) is always lower bounded by σdΛmin. Therefore,

θ0 θ , ℓ(θ ) σd Λmin θ0 θ 2.

Using Cauchy Schwarz inequality

θ0 θ ℓ(θ ) σd Λmin θ0 θ 2.

Assume θ0 θ is not zero (if it is, our estimator is already consistent).

θ0 θ 1 σdΛmin ℓ(θ )

1 σdΛmin ℓ(θ ) E ℓ(θ ) + E ℓ(θ )

1 σdΛmin ( ℓ(θ ) E ℓ(θ ) + E ℓ(θ ) ) (19)

Since tr Covp 1

σd k(x, x ) x = O( 1

σd ) and tr Covq 1

σd k(x, x ) ψcon( w , x + b ) x = O( 1

σd ) by assumption, due to multi-dimensional version of Chebyshev s inequality, σd 1

σd ( ℓ(θ ) E ℓ(x )) = σd Op( 1

nσd ). Therefore,

θ0 θ 1 σdΛmin

nσd + E ℓ(θ ) ,

with high probability and K is a constant.

Now we proceed to bound E ℓ(θ ) .

Lemma D.1. E ℓ(θ ) Eq h kσ(x, x ) 1

2 x x 2 x i κ Cψ con

Proof. The expression of E ℓ(θ ) is:

E ℓ(θ ) := Ep [kσ(x, x ) x] + Eq [kσ(x, x )ψ con( w , x + b ) x] . (20)

Due to Taylor s theorem, w , x + b = h(r(x)) 1

2 (x x ) 2(h r)( x) (x x ) where x is a point in between x and x in an elementwise fashion. Thus, applying the mean value theorem on ψ con,

ψ con( w , x + b ) = ψ con

2 (x x ) 2(h r)( x) (x x )

= ψ con [h(r(x))] 1

2 (x x ) 2(h r)( x) (x x ) ψ con(y),

where y is a scalar in between h(r(x)) and h(r(x)) 1

2 (x x ) 2(h r)( x) (x x ) or equivalently, in between h(r(x)) and w , x + b .

Theorem 2.1 states that h(r) = rϕ + ϕ, then, by the definition of the mirror divergence, ψ = h(r). Moreover, due to the maximizing argument, ψ con is the input argument of ψ (i.e., r) and ψ is the input argument of ψcon. Thus, ψ con(h(r)) = ψ con(ψ (r)) = r. Let us write

E ℓ(θ ) = Ep[kσ(x, x ) x] + Eq[kσ(x, x ) ψ con(h(r(x))) | {z } r

2 (x x ) 2(h r)( x) (x x ) ψ con(y) x

2 (x x ) 2(h r)( x) (x x ) ψ con(y) x ,

Minimizing f-Divergences by Interpolating Velocity Fields

We can derive a bound for E ℓ(θ ) :

kσ(x, x ) 1

(x x ) 2(h r)( x) (x x ) |ψ con(y)| x

2 x x 2 2(h r)( x) x Cψ con

2 x x 2 x κ Cψ con

σd+2 κ Cψ con 1 2

Z q(σy + x ) k(y) y 2 [σy + x , 1] dy σd+2 κ Cψ con Ck

Finally, due to Lemma D.1 and Assumption 4.5, we can see that with high probability

nσd + κ σd+2 Ck Cψ con σdΛmin

nσd + κ σ2 Ck Cψ con Λmin .

Since nσd , σ 0, θ θ0 0 with high probability. There always exists σ0 and N, such that for all σ < σ0 and

n > N, min(W, B) >

nσd +κ Ck Cψ con σ2

Λmin . When it happens, θ0 must be the interior of the constrain set of (18). i.e., the constraints in (18) are not active. It implies θ0 must be the stationary point of ℓ(θ) as long as σ is sufficiently small and n is sufficiently large.

E. Proof of Corollary 4.9

Proof. Since (11) has a unconstrained quadratic objective, its maximizers are stationary points.

We can see that Assumption 4.4 holds. To apply Theorem 4.8, we still need to show that Assumption 4.7 holds and W and B exist. In this case, ψcon(d) = d2/2 + d, so ψ con = 1. Thus Assumption 4.7 holds automatically for every Cψ con 1. Additionally,

2 [w,b]ψcon( w, x + b) = b Eq[kσ(x, x )ψ con( w, x + b) x x ] = b Eq[kσ(x, x ) x x ].

Therefore, (15) implies the minimum eigenvalue assumption (14) holds for every W > 0, B > 0. Thus, we can choose any W > 0 and B > 0 that satisfies (13). Noticing that h(r(x)) = r(x), applying Theorem 4.8 gives the desired result.

F. Proof of Corollary 4.10

Proof. Assumption 4.4, 4.5 and (13) are already satisfied. Let us verify the eigenvalue condition (14). In this case, ψ con(d) = exp(d 1). Thus exp( w, x + b 1) exp(2WCX + 2B 1) < for w 2W, |b| 2B. Moreover, because x x is positive semi-definite, due to (16),

λmin h b Eq[kσ(x, x ) exp( w, x + b 1) x x ] i σ2 λmin h b Eq[kσ(x, x ) x x ] i exp( 2WCX 2B 1)

>σ2 Λ exp( 2WCX 2B 1) > 0,

for all w, b that w 2W, |b| 2B. So (14) holds. Finally, let us verify Assumption 4.7. Since ψcon is a strictly monotone increasing function, supa [0,1] ψcon (ax0 + (1 a)y0) is obtained either at x0 or y0. We only need to verify that ψcon (h(r(x))) and ψcon ( w , x + b ) are both bounded for all x. Both h(r(x)) and w , x + b can be bounded using our assumptions. Thus, for a

Cψ con = exp(W CX + B 1) exp(Clog r 1)

Assumption 4.7 holds. Applying Theorem 4.8 completes the proof.

Minimizing f-Divergences by Interpolating Velocity Fields

(a) p, q with different means. The further p, q are apart, the larger the supx 2r(x) is.

(b) p, q with different variances. The less overlapped p, q are, the larger the supx 2r(x) is.

Figure 7. Visualizing 2(h r)(x) for Dϕ = KL[p, q] with two different settings of p and q.

G. 2r(x) for Different p and q

See Figure 7.

H. Finite-sample Objectives

H.1. Practical Implementation

In our experiments, we observe that (10) can be efficiently minimized by using gradient descent with adaptive learning rate schemes, e.g., Adam (Kingma and Ba, 2015). One computational advantage of the local linear model is that the computation for each x is independent from the others. This property allows us to parallelize the optimization. Even using a single CPU/GPU, we can easily write highly vectorized code to compute the gradient of (10) with respect to w and b for a large particle set {x i }n i=1.

Suppose Xp Rnp d and Xq Rnq d are the matrices whose rows are x(i) p and x(i) q respectively. Kp Rn np, Kq Rn nq are the kernel matrices between {x } and Xp, {x } and Xq respectively. W Rn d and b Rn are the parameters whose rows are w(x ) and b(x ) respectively. Then the gradient of (10) with respect to W , b can be expressed as

Kp Xp/np Kq ψ con(W X q + b)Xq/nq,

Kp1np/np Kq ψ con(W X q + b)1nq/nq,

where 1np and 1nq are the vectors of ones with length np and nq respectively. ψ con is evaluated element-wise. is the element-wise product and the vector b is broadcast to a matrix with nq columns.

Minimizing f-Divergences by Interpolating Velocity Fields

I. Experiment Details in Section 5

I.1. Model Selection

Let Dp := n x(i) p onp

i=1 , Dq := n x(i) q onq

i=1 be training sets from p and q respectively and Dp = n x(i) p o np

i=1 and Dq = n x(i) q o nq

i=1 be testing sets. We can fit a local linear model at each testing point using the training sets, i.e.,

ˆwσ( x),ˆbσ( x) := argmax w Rd,b R ℓ(w, b; x, Dp, Dq) | {z } tranining objective

The dependency on σ comes from the smoothing kernel in the training objective. We can tune σ by evaluating the variational lower bound (9) approximated using testing samples:

ℓ σ; Dp, Dq := Ep h ˆdσ (x) i Eq h ψcon( ˆdσ (x)) i

| {z } testing criterion

where Ep[ ˆd(x)] := 1 np P np i=1 ˆd( x(i) p ), i.e., the sample average over the testing points. ˆdσ( x) := ˆwσ ( x) , x + ˆbσ ( x), is the interpolation of d( x) using training samples. The best choice of σ should maximize the above testing criterion.

In our experiments, we construct training and testing sets using cross validation and choose a list of candidate σ for the model selection. This procedure is parallel to selecting k in k-nearest neighbors to minimize the testing error. In our case, (21) is the negative testing error .

I.2. Missing Data Imputation

Before running the experiments, we first pre-process data in the following way:

1. Suppose Xtrue is the original data matrix, i.e. without missing values. We introduce missingness to Xtrue, and call the matrix with missing values X, following MCAR paradigm. Denote the corresponding mask matrix as M, where m(i) j = 0 if x(i) j is missing, and m(i) j = 1 otherwise.

2. Calculate column-wise mean x and standard deviation s (excluding missing values) of X.

3. Standardize X by taking X = X x

s , where the vectors x and s are broadcasted to the same dimensions as the matrix X. Note that the division here is element-wise.

Denote Xt as the imputed data of X at iteration i, where X0 = X M + Z (1 M), and Z N(0, diag(1)).

We performed two experiments on both toy data ( S -shape) and real world data (UCI Breast Cancer 3 data).

Let NWGF be the number of iterations WGF is performed. In each iteration, let NGrad Est be the number gradient descent steps for gradient estimation. In the missing data experiments, we set the hyperparameters to be:

S -shape data: TWGF = 100, TGrad Est = 2000, σ is chosen by model selection described in Section I.1.

UCI Breast Cancer data: TWGF = 1000, TGrad Est = 100, σ = median( q

pairwise distance of X

Across experiments, we observe that our method is robust against the choice of the bandwidth σ. We demonstrate this via a new experiment that tests the imputation performance with different selections of σ. Figure 8 demonstrates this experiment on the UCI Breast Cancer data. We can see that the imputation performance is still comparable with the original results (and baseline methods) when the scale of σ varies from 0.01 to 1.

3Available at https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data.

Minimizing f-Divergences by Interpolating Velocity Fields

Figure 8. AUROC of a linear SVM classifier on the imputed Breast Cancer dataset, with various scales of σ from 0.01 to 1.

I.3. Wasserstein Gradient Flow

In this experiment, we first expand MNIST digits into 32 32 pictures then adds a small random noise ϵ N(0, 0.0012I) to each picture so that computing the sample mean and covariance will not cause numerical issues. For both forward and backward KL WGF, we use a kernel bandwidth that equals to 1/5 of the pairwise distances in the particle dataset, as it is too computationally expensive to perform cross validation at each iteration. After each update, we clip pixel values so that they are in between [0, 1]. It is done using py Torch torch.clamp function.

To reduce computational cost, at each iteration, we randomly select 4000 samples from the original dataset and 4000 particles from the particle set. We use these samples to estimate the WGF updates.

J. Discussion:Kernel Density Gradient Estimation

The Kernel Density Estimator (KDE) of p is

i=1 k(x(i) p , y)/Z,

where Z is a normalization constant to ensure that R ˆp(x)dx = 1. Thus,

y log ˆp(y) := 1

i=1 yk(x(i) p , y)/ 1

i=1 k(x(i) p , y).

The normalizing constant Z is cancelled.

K. Discussion: Why Score Matching does not Work on Log Ratio Gradient Estimation

For Score Matching (SM), the estimator of ˆp := argmin f

R p log p log f 2dx, where the objective function is

commonly refered to as Fisher Divergence. To use SM in practice, the objective function is further broken down to

Z p log p log f 2dx = Z p log f 2dx + 2

Z p 2 i log fdx + C, (22)

where we used the dimension-wise integration by parts and C is a constant.

Since our target is to estimate log p, we can directly model log p as g : Rd Rd. The objective becomes

Z p g 2dx + 2

Z p igidx + C.

Minimizing f-Divergences by Interpolating Velocity Fields

SM can be used to estimate log p. One might assume that SM can also be used for estimating log r, where r := p

q . Let us replace log p with log r in (22),

Z p log r log f 2dx = Z p log f 2dx 2

Z p i log r i log fdx + C

= Z p log f 2dx + 2

Z q ir i log fdx + C

= Z p log f 2dx + 2

Z r i(q i log f)dx + C, (23)

= Z p log f 2dx + 2

Z r q 2 i log fdx + 2

Z r iq i log fdx + C (24)

and to get (23) we applied integration by parts, where we assumed q r i log f 0 as xi . In (24), the third term is not tractable due to the lack of information about r and q. Changing the objective to R p log r log f 2dx would also yield an intractable objective for a similar reason.

L. Discussion: Gradient Flow Estimation in Feature Space

One of the issues of local estimation is the curse of dimensionality: Local approximation does not work well in high dimensional spaces. However, since the f-divergence gradient flow is always associated with the density ratio function, we can utilize special structures in density ratio functions to estimate (h r)(x ) more effectively.

L.1. Density Ratio Preserving Map

Let s(x) be a measurable function, where s : Rd Rm, m d. Consider two random variables, Xp and Xq, each associated with probability density functions p and q, respectively. Define p and q as the probability density functions of the random variables Sp := s(Xp) and Sq := s(Xq).

Definition L.1. s(x) is a density ratio preserving map if and only if it satisfies the following equality

r(x) = r (s(x)), where r (s(x)) := p (s(x))

q (s(x)), x X.

We can leverage the density ratio preserving map to reduce the dimensionality of gradient flow estimation. Suppose s(x) is a known density ratio preserving map. Define z := s(x) and z := s(x ). We can see that

(h r)(x ) | {z } Rd7 Rd = (h r )(s(x )) = s(x ) | {z } Rd7 Rd m, known

(h r )(z ) | {z } Rd7 Rm

If we can evaluate s(x ), we only need to estimate an m-dimensional gradient (h r )(z ), which is potentially easier than estimating the original d-dimensional gradient (h r)(x ) using a local linear model.

While Definition L.1 might suggest that s is a very specific function, the requirement for s to preserve the density ratio is quite straightforward. Specifically, s must be sufficient in expressing the density ratio function. This requirement is formalized in the following proposition:

Proposition L.2. Consider a function s : Rd Rm. If there exists a function g : Rm R+ such that r(x) = g(s(x)), x X holds, then s is a density ratio preserving map. Additionally, it follows that g s = r = r s = g = r .

The proof can be found in Section M.

Proposition L.2 implies that we can identify the density ratio preserving map s by simply learning the ratio function r and using the trained feature transform function as s. For instance, in the context of a neural network used to estimate r, s could correspond to the functions represented by the penultimate layer of the network. After identifying s, we can simply translate a high dimensional gradient flow estimation into a low dimensional problem according to (25).

Minimizing f-Divergences by Interpolating Velocity Fields

Algorithm 1 Searching for a Density Ratio Preserving Map s

1: Inputs: Dp, Dq and an initial guess of ˆs. 2: while ˆs not converged do 3: for each x Dp Dq do 4: z := ˆs(x)

ˆw(z),ˆb(z) := argmin w Rm,b R ℓ(w, b; z, Dp , Dq ) |s=ˆs

6: ˆd(z) := ˆw (z) , z + ˆb (z) 7: end for 8: ˆs := argmax s S b Ep h ˆd (s (x)) i b Eq h ψcon ˆd (s (x)) i

9: end while 10: Output: ˆs.

In practice, we find this method works well. However, this approach still requires us to estimate a high dimensional density ratio function r to obtain a feature map s.

In the next section, we propose an algorithm of learning s from data without estimating a high dimensional density ratio function.

L.2. Finding Density Ratio Preserving Map

Theorem L.3. Suppose h is associated with an f-divergence Dϕ according to Theorem 2.1 and Dψ is the mirror of Dϕ. If r(x) = g (s (x)), then s must be an arg sup of the following objective:

sup s Ep [h(r (s(x)))] Eq [ψcon(h(r (s(x))))] . (26)

Proof. Since r(x) = g (s (x)), Proposition C.1 implies that (g , s ) is necessarily an arg sup to the following optimization problem:

sup s,g Ep [h(g(s(x)))] Eq [ψcon(h(g(s(x))))] .

Due to the law of unconscious statistician, Ep[f(s(x))] = Ep [f(z)], where z = s(x). The above optimization problem can be rewritten as

sup s sup g Ep [h(g(z))] Eq [ψcon(h(g(z)))] ,

Proposition C.1 states that for all s, g = r is an arg sup of the inner optimization problem. Substituting this optimal solution of g and rewriting the expectation using x again, we arrive

sup s Ep [h(r (s(x)))] Eq [ψcon(h(r (s(x))))] .

Both expectations in (26) can be approximated using samples from p and q. Given a fixed s, h(r (z)) can be approximated by an m-dimensional local linear interpolation

h(r (z)) ˆd(z) := ˆw (z) , z + ˆb (z) , ˆw(z),ˆb(z) := argmin w Rm,b R ℓ(w, b; z, Dp , Dq ), (27)

where Dp , Dq are sets of samples from p and q respectively.

Approximating expectations in (26) with samples in Dp and Dq and replacing h r with ˆd, we solve the following optimization to obtain an estimate of s:

ˆs := argmax s S

i=1 ˆd h s x(i) p i 1

i=1 ψcon h ˆd s(x(i) q ) i . (28)

Minimizing f-Divergences by Interpolating Velocity Fields

The optimization of (28) is a bi-level optimization problem as ˆd depends on (27). We propose to divide the whole problem into two steps: First, let s = ˆs and solve for ( ˆw,ˆb). Then, with the estimated ( ˆw,ˆb), we solve for ˆs. Repeat the above procedure until convergence. This algorithm is detailed in Algorithm 1.

In practice, we restrict S to be the set of all linear maps via a matrix S Rd m whose columns are orthonormal basis, i.e., s(x) := S x. The Jacobian s(x) is simply S .

After obtaining ˆs, we can approximate the gradient flow using the chain rule described in (25):

(h r)(x ) ˆs(x ) ˆw(z ),

where ˆw(z ) is approximated by an m-dimensional local linear interpolation

ˆw(z ),ˆb(z ) := argmin w Rm,b R ℓ(w, b; z , Dp , Dq ) |s=ˆs .

M. Discussion: Sufficient Condition of Density Ratio Preserving Map

In this Section, we provide a sufficient condition for s to be a density ratio preserving map.

Lemma M.1. If there exists some g : Rm R+, such that r(x) = g(s(x)), then s is a density ratio preserving map and g(s(x)) = r (s(x)).

Proof. The statement g(s(x)) being a density ratio p(x)

q(x) is equivalent to asserting that KL[p, q (g s)] = 0. Since KL divergence is always non-negative, it means

g = argmin g: R q(x)g(s(x))dx=1,g 0 Ep

log p(x) q(x)g(s(x))

= Eq[log g(s(x))] + C1, (29)

i.e., g is a minimizer of KL[p, q (g s)], where g 0 is constrained in a domain where q (g s) is normalized to 1.

Similarly, r (z) being a density ratio p (z)

q (z) is the same as asserting that KL[p , q r ] = 0 and is equivalent to

r = argmin g: R q (z)g(z)dz=1,g 0 Ep log p (z) q (z)g(z)

= Eq [log g(z)] + C2, (30)

i.e., r is a minimizer of KL[p , q g].

In fact, one can see that (29) and (30) are identical optimization problems due to the law of the unconscious statistician: R q(x)g(s(x))dx = Eq[g(s(x))] = Eq [(g(z))] = R q (z)g(z)dz and Eq[log g(s(x))] = Eq [log g(z)], which means their solution sets are the same. Therefore, for any g that minimizes (29), it must also minimize (30). Hence it satisfies the following equality r (s(x)) = g(s(x)) = r(x), where the second equality is by our assumption.

N. Discussion: Stein Variational Gradient Descent

SVGD minimizes KL[qt+1, p], where samples of qt+1 is constructed using the following deterministic rule:

xt+1 = xt + ηut(xt). (31)

xt are particles at iteration t, ut Hd, a d-dimensional Reproducing Kernel Hilbert Space (RKHS) with a kernel function l(x, x ). Liu and Wang (2016) shows the optimal update ut has a closed form:

usvgd t := Eqt[l(xt, ) log p(xt) + l(xt, )]. (32)

In practice, expectations Eqt[ ] can be approximated by b Eqt[ ], i.e., the sample average taken from the particles at time t.

Minimizing f-Divergences by Interpolating Velocity Fields

Figure 9. Left: log r(x) and its approximations. Right: Estimation error with standard error.

Chewi et al. (2020) links SVGD with f-divergence WGF: usvgd t is the backward KL divergence WGF under the coordinatewise transform of an integral operator. Indeed, the i-th dimension of SVGD update usvgd t,i can be expressed as

usvgd t,i :=Eqt [l(xt, ) i log p(xt)] + Eqt [ il(xt, )]

=Eqt [l(xt, ) i log p(xt)] Eqt [l(xt, ) i log qt(xt)]

= Z qt(xt)l(xt, ) i log p(xt)

qt(xt)dxt, (33)

where the last line is an integral operator (Wainwright, 2019) of the functional i log p

qt , i.e., the i-th dimension of the backward KL divergence flow log rt. 4 Due to the reproducing property of RKHS, the SVGD update at some fixed point x can be written as usvgd(x ) = usvgd t , l( , x ) Hd = Eq[l(x, x ) log r(x)].

O. Additional Experiments

O.1. Gradient Estimation

Now we investigate the performance of estimating log r(x) using the proposed gradient estimator and an indirect estimator using logistic regression. For the indirect estimator, we first train a Multilayer Perceptron (MLP) using a binary logistic regression to approximate log r. Then obtain log r by auto-differentiating the estimated log ratio. The kernel bandwidth in our method is tuned by using the model selection criterion described in Section 4.4.

To conduct the experiments, we let p = N( 5, .5)/3 + N(0, .5)/3 + N(5, .5)/3 and q = N( 5, 1)/3 + N(0, 1)/3 + N(5, 1)/3. From each distribution, 5000 samples are generated for approximating the gradients.

The left plot in Figure 9 shows the true gradient and its approximations. It can be seen that the direct gradient estimation is more accurate than estimating the log ratio first then taking the gradient.

The right plot in Figure 9 displays the estimation errors of different methods, comparing the proposed method with Kernel Density Estimation (KDE) and score matching, all applied to the same distributions in the previous experiment. KDE was previously used in approximating WGF (Wang et al., 2022). It first estimates p and q with ˆp and ˆq separately using nonparametric kernel density estimators, then approximates log r with log ˆp log ˆq . The score matching approximates log ˆp and log ˆq with the minimizers of Fisher-divergence. It has also been used in simulating particle ODEs in a previous work (Maoutsa et al., 2020). The estimation error plot shows that the proposed estimator yields more accurate results compared to the other two kernel-based gradient estimation methods, namely KDE and score matching.

O.2. Wasserstein Gradient Flow

In this experiment, we test the performance of the proposed gradient estimators by generating samples from a high dimensional target distribution (MNIST handwritten digits). We will check whether the quality of the particles can be improved by performing WGC using our estimated updates. Note that we do not intend to compare the generated samples

4Note that the second equality in (33) is due to the integration by parts, and only holds under conditions that lim x qt(x) = 0.

Minimizing f-Divergences by Interpolating Velocity Fields

Figure 10. Samples generated using forward and backward KL velocity field.

Figure 11. {xqt} in the first two dimensions x1, x2. Feature space WGF converges in fewer steps compared to WGF in the original space. Above: g = cos. Below: g = sin.

with NN-based approaches, as the focus of our paper is on local estimation using kernel functions. We perform two different WGFs, forward KL and backward KL whose updates are approximated using ˆw and ˆw respectively. We let the initial particle distribution q0 be N(µ, Σ), where µ, Σ are the mean and covariance of the target dataset and fix the kernel bandwidth using the median trick (see the appendix for details).

The generated samples together with samples from the initial distribution q0 are shown in Figure 10. Judging from the generated sample quality, it can be seen that both VGDs perform well and both have made significant improvements from the initial samples drawn from q0.

We also provide two videos showing the animation of the first 100 gradient steps:

Forward KL: https://youtube.com/shorts/HZcv Uykrpbc

Backward KL: https://youtube.com/shorts/Ag N6ds Dec CM

O.3. Feature Space Wasserstein Gradient Flow

In this experiment, we run WGF in a 5-dimensional space. The target distribution is

p = p12(x1,2)p345(x3,4,5), p345 := N(0, I)

and p12 is constructed by a generative process. We generate samples in the first two dimensions as follows

X1 = g(X2) + ϵ, X2 N(0, 1), ϵ N(0, 1).

In our experiment, we draw 5000 samples from p, and 5000 samples from q0 := N(0, I), and run the backward KL gradient flow. Clearly, this WGF has a low dimensional structure since p and q only differs in the first two dimensions. We also run

Minimizing f-Divergences by Interpolating Velocity Fields

feature space backward KL field whose updates are calculated using (25). The feature function s(x) = S x is learned by Algorithm 1.

The resulting particle evolution for both processes is plotted in Figure 11. For visualization purposes, we only plot the first two dimensions. It can be seen that the particles converge much faster when we explicitly exploit the subspace structure using the feature space WGF. In comparison, running WGF in the original space converges at a much slower speed.

In this experiments, we set the learning rates for both WGF and feature space WGF to be 0.1 and the kernel bandwidth σ in our local estimators is tuned using cross validation with a candidate set ranging from 0.1 to 2.