# sampling_with_mollified_interaction_energy_descent__10a68874.pdf

Published as a conference paper at ICLR 2023

SAMPLING WITH MOLLIFIED INTERACTION ENERGY DESCENT

Lingxiao Li MIT CSAIL lingxiao@mit.edu

Qiang Liu University of Texas at Austin lqiang@cs.utexas.edu

Anna Korba CREST, ENSAE, IP Paris anna.korba@ensae.fr

Mikhail Yurochkin IBM Research, MIT-IBM Watson AI Lab mikhail.yurochkin@ibm.com

Justin Solomon MIT CSAIL jsolomon@mit.edu

Sampling from a target measure whose density is only known up to a normalization constant is a fundamental problem in computational statistics and machine learning. In this paper, we present a new optimization-based method for sampling called mollified interaction energy descent (MIED). MIED minimizes a new class of energies on probability measures called mollified interaction energies (MIEs). These energies rely on mollifier functions smooth approximations of the Dirac delta originated from PDE theory. We show that as the mollifier approaches the Dirac delta, the MIE converges to the chi-square divergence with respect to the target measure and the minimizers of MIE converge to the target measure. Optimizing this energy with proper discretization yields a practical firstorder particle-based algorithm for sampling in both unconstrained and constrained domains. We show experimentally that for unconstrained sampling problems, our algorithm performs on par with existing particle-based algorithms like SVGD, while for constrained sampling problems our method readily incorporates constrained optimization techniques to handle more flexible constraints with strong performance compared to alternatives.

1 INTRODUCTION

Sampling from an unnormalized probability density is a ubiquitous task in statistics, mathematical physics, and machine learning. While Markov chain Monte Carlo (MCMC) methods (Brooks et al., 2011) provide a way to obtain unbiased samples at the price of potentially long mixing times, variational inference (VI) methods (Blei et al., 2017) approximate the target measure with simpler (e.g., parametric) distributions at a lower computational cost. In this work, we focus on a particular class of VI methods that approximate the target measure using a collection of interacting particles. A primary example is Stein variational gradient descent (SVGD) proposed by Liu & Wang (2016), which iteratively applies deterministic updates to a set of particles to decrease the KL divergence to the target distribution.

While MCMC and VI methods have found great success in sampling from unconstrained distributions, they often break down for distributions supported in a constrained domain. Constrained sampling is needed when the target density is undefined outside a given domain (e.g., the Dirichlet distribution), when the target density is not integrable in the entire Euclidean space (e.g., the uniform distribution), or when we only want samples that satisfy certain inequalities (e.g., fairness constraints in Bayesian inference (Liu et al., 2021)). A few recent approaches (Brubaker et al., 2012; Byrne & Girolami, 2013; Liu & Zhu, 2018; Shi et al., 2021) extend classical sampling methods like Hamiltonian Monte Carlo (HMC) or SVGD to constrained domains. These extensions, however, typically contain expensive numerical subroutines like solving nonlinear systems of equations and require explicit formulas for quantities such as Riemannian metric tensors or mirror maps to be derived on a case-by-case basis from the constraints.

Published as a conference paper at ICLR 2023

We present an optimization-based method called mollified interaction energy descent (MIED) that minimizes mollified interaction energies (MIEs) for both unconstrained and constrained sampling. An MIE takes the form of a double integral of the quotient of a mollifier smooth approximation of δ0, the Dirac delta at the origin over the target density properly scaled. Intuitively, minimizing an MIE balances two types of forces: attractive forces that drive the current measure towards the target density, and repulsive forces from the mollifier that prevents collapsing. We show that as the mollifier converges to δ0, the MIE converges to the χ2 divergence to the target measure up to an additive constant (Theorem 3.3). Moreover, the MIE Γ-converges to χ2 divergence (Theorem 3.6), so that minimizers of MIEs converge to the target measure, providing a theoretical basis for sampling by minimizing MIE.

While mollifiers can be interpreted as kernels with diminishing bandwidths, our analysis is fundamentally different from that of SVGD where a fixed-bandwidth kernel is used to define a reproducing kernel Hilbert space (RKHS) on which the Stein discrepancy has a closed-form (Gorham & Mackey, 2017). Deriving a version of the Stein discrepancy for constrained domains is far from trivial and requires special treatment (Shi et al., 2021; Xu, 2021). In contrast, our energy has a unified form for constrained and unconstrained domains and approximates the χ2 divergence as long as the bandwidth is sufficiently small so that short-range interaction dominates the energy: this idea of using diminishing bandwidths in sampling is under-explored for methods like SVGD.

Algorithmically, we use first-order optimization to minimize MIEs discretized using particles. We introduce a log-sum-exp trick to neutralize the effect of arbitrary scaling of the mollifiers and the target density; this form also improves numerical stability significantly. Since we turn sampling into optimization, we can readily apply existing constrained sampling techniques such as reparameterization using a differentiable (not necessarily bijective) map or the dynamic barrier method by Gong & Liu (2021) to handle generic differentiable inequality constraints. Our method is effective as it only uses first-order derivatives of both the target density and the inequality constraint (or the reparameterization map), enabling large-scale applications in machine learning (see e.g. Figure 4). For unconstrained sampling problems, we show MIED achieves comparable performance to particlebased algorithms like SVGD, while for constrained sampling problems, MIED demonstrates strong performance compared to alternatives while being more flexible with constraint handling.

2 RELATED WORKS

KL gradient flow and its discretization for unconstrained sampling. The Wasserstein gradient flow of the Kullback-Leibler (KL) divergence has been extensively studied, and many popular sampling algorithms can be viewed as discretizations of the KL-divergence gradient flow. Two primary examples are Langevin Monte Carlo (LMC) and Stein variational gradient descent (SVGD). LMC simulates Langevin diffusion and can be viewed as a forward-flow splitting scheme for the KL-divergence gradient flow (Wibisono, 2018). At each iteration of LMC, particles are pulled along log p where p is the target density, while random Gaussian noise is injected, although a Metropolis adjusting step is typically needed for unbiased sampling. In contrast with LMC, SVGD is a deterministic algorithm that updates a collection of particles using a combination of an attractive force involving log p and a repulsive force among the particles; it can be viewed as a kernelized gradient flow of the KL divergence (Liu, 2017) or of the χ2 divergence (Chewi et al., 2020). The connection to the continuous gradient flow in the Wasserstein space is fruitful for deriving sharp convergence guarantees for these sampling algorithms (Durmus et al., 2019; Balasubramanian et al., 2022; Korba et al., 2020; Salim et al., 2022).

Sampling in constrained domains. Sampling in constrained domains is more challenging compared to the unconstrained setting. Typical solutions are rejection sampling and reparameterization to an unconstrained domain. However, rejection sampling can have a high rejection rate when the constrained domain is small, while reparameterization maps need to be chosen on a case-by-case basis with a determinant-of-Jacobian term that can be costly to evaluate. Brubaker et al. (2012) propose a constrained version of HMC for sampling on implicit submanifolds, but their algorithm is expensive as they need to solve a nonlinear system of equations for every integration step in each step of HMC. Byrne & Girolami (2013) propose geodesic Hamiltonian Monte Carlo for sampling on embedded manifolds, but they require explicit geodesic formulae. Zhang et al. (2020); Ahn & Chewi (2021) propose discretizations of the mirror-Langevin diffusion for constrained sampling

Published as a conference paper at ICLR 2023

provided that a mirror map is given to capture the constraint. Similarly, Shi et al. (2021) propose mirror SVGD that evolves particles using SVGD-like updates in the dual space defined by a mirror map. While these two methods have strong theoretical convergence guarantees, they are limited by the availability of mirror maps that capture the constraints. Liu et al. (2021) extend SVGD to incorporate a population constraint over the samples obtained; their setting is different from ours where the constraint is applied to every sample.

Pairwise interaction energies on particles. Pairwise interaction energies on particles take the form E({xi}N i=1) = P

i =j W(xi, xj) for a kernel W(x, y). The gradient flow of E on N particles can give rise to phenomena like flocking and swarming, and a long line of mathematical research studies mean-field convergence of such particle gradient flow to continuous solutions of certain PDEs as N (Carrillo et al., 2014; Serfaty, 2020). Of particular interest to sampling, a separate line of works summarized by Borodachov et al. (2019) demonstrates that for the hypersingular Riesz kernel W(x, y) = x y s 2 with sufficiently big s, minimizing E over a compact set yields uniform samples as N . Moreover, W can depend on p to obtain non-uniform samples distributed according to p. As the hypersingular Riesz kernel is not integrable, their analysis is entirely based on geometric measure theory and avoids any variational techniques. In contrast, we take a different approach by using integrable kernels in MIEs through mollifiers. This enables us to apply variational analysis to establish interesting connections between MIEs and χ2 divergence. We compare our formulation with theirs further in Appendix B. Joseph et al. (2019) propose to minimize the maximum of pairwise interaction akin to the interaction in Borodachov et al. (2019) without gradient information using a greedy algorithm. Recently, Korba et al. (2021) propose sampling via descending on kernel Stein discrepancy which can also be viewed as a type of interaction energy, but like SVGD, it is limited to unconstrained sampling and can be slow as higher-order derivatives of log p are required. Craig et al. (2022) propose approximating the χ2 divergence with a functional different from ours that also involves a mollifier. The resulting algorithm needs to evaluate an integral containing the target density over the whole domain at each step which can be costly. Consequently, their method is only applied to special target densities for which the integral has a closed form. In comparison, our method can handle general target densities with cheap updates in each step.

3 MOLLIFIED INTERACTION ENERGY

Notation. Let X Rn be the domain we work in. We use Ln to denote the Lebesgue measure on Rn and write R f(x) dx to denote integration with respect to Ln. We will assume all measures are Borel. We use Hd to denote the d-dimensional Hausdorff measure. We use ωN to denote a set of points {x1, . . . , x N} Rn. For x Rn, let δx be the Dirac delta measure at x and δωN be the empirical measure 1 N PN k=1 δxk. We denote Lp(X) := {f : X R Lebesgue measurable with R

X |f(x)|p dx < }, and for f Lp(X) we let f p := R

X |f(x)|p dx 1/p. We use f to denote the essential supremum of |f|. For f, g : Rn R, we define convolution (f g)(x) := R f(y)g(x y) dy provided this integral exists. We use Ck(X) to denote continuously k-differentiable functions and C (X) to indicate the smooth functions. The space of continuous k-differentiable functions with compact support on X is Ck c (X). We denote by P(X) the set of probability measures, and P2(X) the set of probability measures with bounded second moments. We denote by Msign(X) the set of finite signed measures. Given a Lebesgue measurable map T : X X and µ P2(X), T#µ is the pushforward measure of µ by T .

Setup. The domain X Rn where we constrain samples is assumed to be closed and fulldimensional, i.e., Ln(X) > 0; the unconstrained case corresponds to X = Rn. The target density, always denoted as p, is assumed to satisfy p C1(X), p(x) > 0 for all x X, and 0 < R

X p(x) dx < . Let µ be the target probability measure µ (B) := R

B p(x) dx/ R

X p(x) dx for Borel B X. Our goal is to sample from µ .

3.1 MOLLIFIERS

Our framework relies on the notion of mollifiers originally introduced in Friedrichs (1944). Definition 3.1 (family of mollifiers). We say {ϕϵ}ϵ>0 C1(Rn) is a family of mollifiers if it satisfies

Published as a conference paper at ICLR 2023

(a) For any ϵ > 0, x Rn, ϕϵ(x) 0 and ϕϵ(x) = ϕϵ( x). (b) For any ϵ > 0, ϕϵ 1 = 1 and supx Rn ϕϵ(x) < .

(c) For any δ > 0, p {1, }, limϵ 0 1Rn\Bδ(0)ϕϵ p = 0.

As ϵ 0, the distribution with density ϕϵ converges to δ0, and if f Lp(Rn) and is continuous, then convolution ϕϵ f converges to f both pointwise (Proposition A.4) and in Lp (Proposition A.7).

Our definition is different from the one used in PDE theory (H ormander, 2015), where some ϕ C c (Rn) is fixed and the family of mollifiers is generated by ϕϵ(x) := ϵ nϕ(x/ϵ). While in PDE theory mollifiers are typically used to improve the regularity of non-smooth functions, in our framework they are used to construct interaction energies that approximate the χ2 divergence.

We will use the following mollifiers. In the following definitions, we include normalizing constants Zϵ that ensure ϕϵ = 1. We do not write out Zϵ explicitly as they might not admit clean forms and they are only relevant in theory but not in our practical algorithm. For s > n, the s-Riesz family of mollifiers is defined as ϕs ϵ(x) := ( x 2 2 + ϵ2) s/2/Zs ϵ . The Gaussian family of mollifiers is defined as ϕg ϵ(x) := exp x 2 2/2ϵ2 /Zg ϵ . The Laplace family of mollifiers is defined as ϕl ϵ(x) := exp ( x 2/ϵ) /Zl ϵ. Since {ϕg ϵ} and {ϕl ϵ} correspond to the densities of centered Gaussian and Laplace random variables, they satisfy Definition 3.1. Proposition A.1 proves that {ϕs ϵ} also satisfies Definition 3.1.

3.2 MOLLIFIED INTERACTION ENERGIES

Definition 3.2 (mollified interaction energy). Given a family of mollifiers {ϕϵ}ϵ>0 satisfying Definition 3.1, for each ϵ > 0, define a symmetric kernel Wϵ : Rn Rn [0, ] by

Wϵ(x, y) := ϕϵ(x y)(p(x)p(y)) 1/2 if x, y X otherwise. (1)

Define the mollified interaction energy (MIE), Eϵ : P(Rn) [0, ], to be

Eϵ(µ) := ZZ Wϵ(x, y) dµ(x) dµ(y). (2)

Intuitively, minimizing Eϵ(µ) balances two forces: a repulsive force from ϕϵ(x y) makes sure µ has a good spread, and an attractive force from (p(x)p(y)) 1/2 helps µ concentrate on highdensity regions. The exponent 1/2 is chosen to balance the two forces so that, as we will show in Theorem 3.3, Eϵ(µ) approximates the χ2 divergence with respect to µ for small ϵ.

3.3 CONVERGENCE TO χ2-DIVERGENCE

Recall that the χ2-divergence between probability measures P, Q is defined by

( R d Q d P 1 2 d P if Q P otherwise, (3)

where Q P denotes absolute continuity of Q with respect to P with density d Q/d P.

Theorem 3.3. Suppose µ P(Rn) satisfies χ2(µ µ ) < . Then, Eϵ(µ) < for any ϵ > 0. Furthermore,

lim ϵ 0 Eϵ(µ) = χ2(µ µ ) + 1.

We provide a succinct proof of Theorem 3.3 in Appendix A.2 using the theory of mollifiers developed in Appendix A.1. In Remark A.10 we discuss an extension of Theorem 3.3 to cases when X is not full-dimensional; in particular Theorem 3.3 still holds when X is flat , i.e., has Hausdorff dimension d < n and is contained in a d-dimensional linear subspace.

Published as a conference paper at ICLR 2023

3.4 CONVEXITY AND Γ-CONVERGENCE

We next study properties of the minimizers of MIEs: Does minµ P(X) Eϵ(µ) admit a unique minimum? If so, do minima of {Eϵ}ϵ>0 converge to µ as ϵ 0? In order to answer affirmatively to these questions, we will need the associated kernel kϵ(x, y) := ϕϵ(x y) to satisfy the following property. Definition 3.4 (i.s.p.d. kernel). A symmetric lower semicontinuous (l.s.c.) kernel K on X X is integrally strictly positive definite (i.s.p.d.) if for every finite signed Borel measure ν on X, the energy EK(ν) := RR K(x, y) d(ν ν)(x, y) is well-defined (i.e., the integrals over the negative and positive parts of ν are not both infinite), and EK(ν) 0 where the equality holds only if ν = 0 on all Borel sets of X.

For the mollifiers we consider, the associated kernel kϵ(x, y) = ϕϵ(x y) is i.s.p.d. on any compact set (Pronzato & Zhigljavsky (2021, Example 1.2)). Proposition 3.5. Suppose X is compact and kϵ(x, y) := ϕϵ(x y) is i.s.p.d. on X. Then,

(a) The kernel Wϵ(x, y) defined in (1) also i.s.p.d on X. (b) The functional Eϵ is strictly convex on Msign(X) and attains a unique minimum on P(X).

We next show the convergence of minima of {Eϵ} to µ as a consequence of Γ-convergence of the sequence {Eϵ}. Theorem 3.6. Suppose kϵ(x, y) := ϕϵ(x y) is i.s.p.d. on compact sets for every ϵ > 0. Then we have Γ-convergence (Definition A.13) Eϵ Γ χ2( µ )+1 as ϵ 0. In particular, if X is compact, if we denote µ ϵ := arg minµ P(X) Eϵ(µ), then µ ϵ µ weakly and limϵ 0 Eϵ(µ ϵ) = 1.

We prove Theorem 3.6 in Appendix A.3 using Fourier transforms and Bochner s theorem. Theorem 3.6 provides basis for minimizing Eϵ for a small ϵ, since its unique minimum will be a good approximation of µ .

3.5 DIFFERENTIAL CALCULUS OF Eϵ IN P2(Rn)

We next study the gradient flow of Eϵ in Wasserstein space P2(Rn) for X = Rn. Understanding the gradient flow of a functional often provides insights into the convergence of algorithms that simulates gradient flow with time discretization (Ambrosio et al., 2005, Chapter 11) or spatial discretization (Chizat, 2022). Proofs of this section are given in Appendix A.4.

Let F : P2(Rn) R be a functional. The Wasserstein gradient flow of F (Ambrosio et al., 2005, Definition 11.1.1) is defined as the solution {µt}t 0 of the PDE: µt

t = (µtw F,µt) where w F,µ L2(µ; Rn) is a Frech et subdifferential of F at µ. Intuitively, gradient flows capture the evolution of the variable being optimized if we were to do gradient descent1 on F when the step sizes go to 0. We next show that the gradient flow of Eϵ agrees with that of χ2( µ ) in the sense that their subdifferentials coincide as ϵ 0. Proposition 3.7. Assume µ P2(Rn) has density q C1 c (Rn)2. Then any strong Frech et subdifferential (Definition A.20) of Eϵ at µ takes the form

wϵ,µ(x) = 2 p(x) 1/2(ϕϵ q/ p)(x) , for µ-a.e. x Rn. (4)

Moreover, if for sufficiently small ϵ, ϕϵ has compact support independent of ϵ, then

lim ϵ 0 wϵ,µ(x) = wχ2,µ(x), for µ-a.e. x Rn, (5)

where wχ2,µ is a strong Frech et subdifferential of χ2( µ ).

While simulating χ2 divergence gradient flow is often intractable (Trillos & Sanz-Alonso, 2020, Section 3.3), our MIE admits a straightforward practical algorithm (Section 4).

1The more precise notion is of minimizing movements (Ambrosio et al., 2005, Chapter 2). 2We assume that µ has compact support because Wϵ can be unbounded and cause integrability issues due to the presence of (p(x)p(y)) 1/2 term.

Published as a conference paper at ICLR 2023

We next show that Eϵ is displacement convex at µ P2(Rn) as ϵ 0, obtaining a result similar to Korba et al. (2021, Corollary 4). This hints that gradient flow initialized near µ will have fast convergence. Proposition 3.8. Suppose p C2 c (Rn). Suppose that for sufficiently small ϵ, ϕϵ has compact support independent of ϵ. Assume kϵ(x, y) := ϕϵ(x y) is i.s.p.d. Let ξ C c (Rn; Rn) and

µt := (I + tξ)#µ . Then limϵ 0 d2

t=0 Eϵ(µt) 0.

4 A PRACTICAL SAMPLING ALGORITHM

We now present a first-order particle-based algorithm for constrained and unconstrained sampling by minimizing a discrete version of (2). Substituting an empirical distribution for µ in (2) gives the discrete mollified interaction energy, for N particles ωN = {x1, . . . , x N},

Eϵ(ωN) := 1 N 2

j=1 ϕϵ(xi xj)(p(xi)p(xj)) 1/2. (6)

Denote ω N,ϵ arg minωN X Eϵ(ωN) and µ ϵ = arg minµ P2(X) Eϵ(µ), If X is compact, by Borodachov et al. (2019, Corollary 4.2.9), we have weak convergence δω N,ϵ µ ϵ as N . If in addition kϵ(x, y) := ϕϵ(x y) is i.s.p.d., then by Theorem 3.6, we have weak convergence µ ϵ µ . This shows that minimizing (6) with a large N and a small ϵ will result in an empirical distribution of particles that approximates µ . Our sampling method, mollified interaction energy descent (MIED), is simply running gradient descent on (6) over particles (Algorithm 1) with an update that resembles the one in SVGD (see the discussion in Appendix C.1). Below we address a few practical concerns.

Optimization in the logarithmic domain. In practical applications, we only have access to the unnormalized target density p. The normalizing constant of the mollifier ϕϵ can also be hard to compute (e.g., for the Riesz family). While these normalization constants do not affect the minima of (6), they can still affect gradient step sizes during optimization. Moreover, ϕϵ can be very large when ϵ is small, and in many Bayesian applications p can be tiny and only log p is numerically significant. To address these issues, we optimize the logarithm of (6) using the log-sum-exp trick (Blanchard et al., 2021) to improve numerical stability and to get rid of the arbitrary scaling of the normalizing constants:

log Eϵ(ωN) := log

j=1 exp log ϕϵ(xi xj) 1

2(log p(xi) + log p(xj)) 2 log N. (7)

Special treatment of the diagonal terms. Since ϕϵ goes to the Dirac delta as ϵ 0, the discretization of (2) on a neighborhood of the diagonal {(x, y) : x = y} needs to be handled with extra care. In (6) the diagonal appears as PN i=1 ϕϵ(0)p(xi) 1 which can dominate the summation when ϵ is small and then ϕϵ(0) becomes too large. For the mollifiers that we consider, we use a different diagonal term PN i=1 ϕϵ(hi/κn)p(xi) 1 where hi := minj =i xi xj and κn 1 is a constant depending only on the dimension n. Since 0 ϕϵ(hi/κn) ϕϵ(0), the energy obtained will be bounded between (6) and the version of (6) without the diagonal terms. Hence by the proof of Theorem 4.2.2 of Borodachov et al. (2019), the discrete minimizers of Eϵ still converge to µ ϵ as N and ϵ 0. Empirically we found the choice κn = (1.3n)1/n works well for the Riesz family of mollifiers and we use the Riesz family primarily for our experiments.

Constraint handling. If X = Rn, we need to minimize (7) subject to constraints ωN = {x1, . . . , x N} X. Since the constraint is the same for each particle xi, we want our algorithm to remain parallelizable across particles.

We consider two types of constraints: (a) there exists a differentiable map f : Rn X, with Ln(X \ f(Rn)) = 0; (b) the set X is given by {x Rn : g(x) 0} for a differentiable g : Rn R. For (a), we reduce the problem to unconstrained optimization in Rn using objective log Eϵ(f(ωN)) with ωN = {x1, . . . , x N} Rn and f(ωN) := {f(x1), . . . , f(x N)}. For (b), we apply the dynamic barrier method by Gong & Liu (2021) to particles in parallel; see Appendix C.2 for details and an extension to handling multiple constraints.

Published as a conference paper at ICLR 2023

5 EXPERIMENTS

We compare MIED with recent alternatives on unconstrained and constrained sampling problems. Unless mentioned otherwise, we choose the s-Riesz family of mollifiers {ϕs ϵ} with s = n+10 4 and ϵ = 10 8: we found minimizing the MIE with such mollifiers results in well-separated particles so that we can take ϵ to be very small as our theory recommends. This is not the case for the Gaussian or the Laplace family as setting ϵ too small can cause numerical issues even when particles are wellseparated. In Appendix D.5, we compare different choices of s on a constrained mixture distribution.

Unless mentioned otherwise: for SVGD (Liu & Wang, 2016), we use the adaptive Gaussian kernel as in the original implementation (adaptive kernels can be prone to collapse samples see Appendix D.2); for KSDD (Korba et al., 2021), we use a fixed Gaussian kernel with unit variance. All methods by default use a learning rate of 0.01 with Adam optimizer (Kingma & Ba, 2014). The source code can be found at https://github.com/lingxiaoli94/MIED.

5.1 UNCONSTRAINED SAMPLING

Gaussians in varying dimensions. We first compare MIED with SVGD and KSDD on Gaussians of varying dimensions and with different numbers of particles. In Figure 1, we see that as the number of dimensions increases, MIED results in best samples in terms of W2 distance (Wasserstein-2 distance computed using linear programming) with respect to 104 i.i.d. samples, while SVGD yields lower energy distance (Sz ekely & Rizzo, 2013). We think this is because MIED results in more evenly spaced samples (Figure 5) so the W2 distance is lower, but it is biased since ϵ > 0 so for energy distance SVGD performs better. More details can be found in Appendix D.1.

500 1000 Number of particles

W2 Distance

Dimension 2

SVGD KSDD MIED

500 1000 Number of particles

Energy Distance

Dimension 2

SVGD KSDD MIED

500 1000 Number of particles

W2 Distance

Dimension 4

SVGD KSDD MIED

500 1000 Number of particles

Energy Distance

Dimension 4

SVGD KSDD MIED

500 1000 Number of particles

3 10 1 4 10 1

W2 Distance

Dimension 8

SVGD KSDD MIED

500 1000 Number of particles

Energy Distance

Dimension 8

SVGD KSDD MIED

Figure 1: Gaussian experiments results in varying dimensions and different numbers of particles. For each method, we plot the metric (W2 distance or energy distance) versus the number of particles, averaged over 10 trials (shaded region indicates standard deviation).

Product of two Student s t-distributions. Next, we consider a 2-dimensional distribution constructed as the product of two independent t-distributions with the degree of freedom ν = 2 composed with a linear transform. Unlike Gaussians, Student s t-distributions have heavy tails. On the left of Figure 2, we visualize the results of each method with 1000 samples after 2000 iterations. MIED captures the heavy tail while SVGD fails. Quantitatively, on the right of Figure 2, while SVGD captures the distribution in [ a, a]2 better for a 3, MIED yields lower metrics for a bigger a.

2 4 6 8 10 Edge length a

W2 Distance (log scale)

SVGD KSDD MIED

2 4 6 8 10 Edge length a

Energy Distance (log scale)

SVGD KSDD MIED

Figure 2: Left: visualization of samples from each method for the product of two Student s tdistribution (composed with a linear transform). Right: metrics of each method when restricting to [ a, a]2. As t-distributions have heavy tails, if we draw i.i.d. samples from the true product distribution, a small number of them will have large norms, making the computation of metrics unstable. Thus we restrict both i.i.d. samples and the resulting samples from each method to [ a, a]2 before computing the metrics for a [2, 10].

Published as a conference paper at ICLR 2023

Bayesian logistic regression. We compare MIED with SVGD and KSDD for the Bayesian logistic regression setting in Liu & Wang (2016). We further include a na ıve baseline, independent particle descent (IPD), which simply runs gradient descent independently on each particle to maximize the posterior probability. In addition to using test accuracy as the metric, we include W2 distance and energy distance between the samples from each method and 104 samples from NUTS (Hoffman et al., 2014) after sufficient burn-in steps. We summarize the results in Table 5.1. All methods, including the na ıve baseline IPD, are comparable in terms of test accuracy. In other words, accuracy is not a good metric for comparing the quality of posterior samples. Bayesian inference is typically preferred over maximum a posteriori estimation for its ability to capture uncertainty. When evaluating the quality of the uncertainty of the samples using distances between distributions, MIED provides the best approximation in terms of the W2 distance and all methods (except IPD) are comparable in terms of the energy distance.

Dataset NUTS IPD SVGD KSDD MIED

banana (d = 3) 0.55 -6.14/-3.58/0.55 -7.81/-5.24/0.55 -8.24/-5.76/0.55 -7.37/-5.06/0.55 breast cancer (d = 10) 0.64 -1.51/-1.03/0.60 -1.62/-2.06/0.60 -1.71/-2.23/0.60 -1.99/-2.18/0.60 diabetis (d = 9) 0.78 -2.18/-1.55/0.77 -3.09/-3.42/0.77 -2.91/-3.90/0.77 -3.11/-3.13/0.77 flare solar (d = 10) 0.59 3.30/2.65/0.48 6.91/4.09/0.52 1.77/-0.08/0.55 7.09/4.25/0.48 german (d = 21) 0.65 -1.80/-1.25/0.65 -1.89/-2.63/0.64 -1.27/-2.83/0.65 -1.96/-2.80/0.65 heart (d = 14) 0.87 -0.40/-0.56/0.87 -0.41/-1.50/0.87 -0.10/-1.76/0.87 -0.92/-1.67/0.87 image (d = 19) 0.82 6.53/4.31/0.83 7.17/4.01/0.83 2.16/-0.50/0.82 1.14/-1.88/0.82 ringnorm (d = 21) 0.77 -3.82/-2.45/0.77 -4.11/-5.98/0.77 1.07/-2.21/0.76 -4.03/-5.70/0.77 splice (d = 61) 0.85 -1.47/-1.18/0.85 -1.22/-2.65/0.85 2.04/-0.05/0.84 1.45/0.70/0.82 thyroid (d = 6) 0.84 1.95/0.53/0.84 1.17/-0.00/0.84 2.42/1.57/0.74 0.84/-0.37/0.84 titanic (d = 4) 0.40 -1.59/-0.16/0.40 -0.46/-0.31/0.40 -0.63/-0.39/0.40 -1.00/-0.45/0.40 twonorm (d = 21) 0.97 -1.21/-1.13/0.97 -1.32/-2.78/0.97 1.55/-0.62/0.97 -1.44/-3.21/0.97 waveform (d = 22) 0.77 -2.67/-1.87/0.78 -2.98/-5.23/0.78 -2.60/-4.18/0.77 -3.09/-3.17/0.78

Table 1: Bayesian logistic regression results with 1000 particles. We include the test accuracy for NUTS in the second column. Three numbers A/B/C in the following columns are logarithmic W2 distance, logarithmic energy distance, and test accuracy. Bold indicates the best numbers. We use 80%/20% training/test split. All methods are run with identical initialization and learning rate 0.01. Results are reported after 104 iterations.

5.2 CONSTRAINED SAMPLING

Uniform sampling in 2D. We consider uniform sampling in the square [ 1, 1]2 with 500 particles. We reparameterize our particles using tanh to eliminate the constraint and show results with various choices of mollifiers we always choose the smallest ϵ while the optimization remains stable. We compare our method with mirror LMC (Zhang et al., 2020) and SVMD/MSVGD by Shi et al. (2021) using entropic mirror map ϕ(θ) = Pn i=1 ((1 + θi) log(1 + θi) + (1 θi) log(1 θi)). To demonstrate that SVGD and KSDD break down in constrained domains, we implement these two methods adapted to the constrained setting using the same reparameterization as our method. The initial particles are drawn uniformly from [ 0.5, 0.5]2. The left plot of Figure 3 shows that quantita-

0 200 400 600 800 1000

W2 Distance

SVGD(adaptive)

SVGD(ﬁxed) KSDD SVMD MSVGD LMC MIED(Riesz)

MIED(Gaussian)

MIED(Laplace)

MIED(Riesz) w/ mirror

SVGD(adaptive)

MIED(Riesz)

MIED(Gaussian)

MIED(Laplace)

MIED(Riesz) w/ mirror

Figure 3: Convergence of metrics and visualization of samples for uniform sampling from a 2D box.

tively our method achieves much lower energy distance and W2 distance (compared to a 5000 i.i.d. samples uniformly drawn in [ 1, 1]2.). On the right of Figure 3, we visualize the samples from each method. We see that samples from SVGD with an adaptive kernel collapse we investigate this issue in Appendix D.2. Samples from SVGD with a fixed kernel size and KSDD are non-uniform (we choose kernel sizes that produce the best result). While SVMD creates locally clustered artifacts, MSVGD produces good results. For mirror LMC, the resulting samples are not evenly spaced,

Published as a conference paper at ICLR 2023

resulting in worse W2 distances. When using tanh as the reparameterization map, MIED produces decent results for all three families of mollifiers (Riesz, Gaussian, Laplace). When using the same entropic mirror map, MIED also produces good results. This highlights the flexibility of our method with constraint handling, whereas, for LMC and SVMD/MSVGD, the mirror map has to be chosen carefully to be the gradient of a strongly convex function while capturing the constraints.

In Appendix D.3, we further test SVMD/MSVGD with a different choice of the mirror map where they break down. In Appendix D.4, we conduct a similar comparison for sampling from a 20dimensional Dirichlet distribution using the same setup as Shi et al. (2021). In scenarios where a good choice of a mirror map is available, SVMD/MSVGD can obtain better performance compared to MIED. In Appendix D.6, we conduct additional qualitative experiments for MIED, demonstrating its effectiveness for challenging constraints and multi-modal distributions.

Fairness Bayesian neural networks. We train fair Bayesian neural networks to predict whether the annual income of a person is at least $50, 000 with gender as the protected attribute using the Adult Income dataset (Kohavi et al., 1996). We follow the same setup as in Liu et al. (2021) where the dataset D = {x(i), y(i), z(i)}|D| i=1 consists of feature vectors x(i), labels y(i) (whether the income is $50, 000), and genders z(i) (protected attribute). The target density is taken to be the posterior of logistic regression with a two-layer Bayesian neural network ˆy( ; θ) with weights θ, and we put a standard Gaussian prior on each entry of θ independently. Given t > 0, the fairness constraint is g(θ) = (cov(x,y,z) D[z, ˆy(x; θ)])2 t 0. On the left of Figure 4, we plot the trade-off curve of the result obtained using our method and the methods from Liu et al. (2021) for t {10 5, 10 4, 0.0001, 0.001, 0.002, 0.005, 0.01}. Details can be found in Appendix D.7. Our method recovers a much larger Pareto front compared to the alternatives. On the right of Figure 4, we visualize the curves of the energy and the covariance versus the number of training iterations: as expected we see a smaller t results in bigger MIE (lower log-likelihood) and smaller covariance between the prediction and the protected attribute.

Test Accuracy

Demographic Parity

MIED (Ours) Control+Langevin Primal-Dual+Langevin Primal-Dual+SVGD

Molliﬁed Interaction Energy

t = 0.0001 t = 0.001 t = 0.002 t = 0.005 t = 0.01

0 250 500 750 1000 1250 1500 1750 2000 Number of Training Iterations

cov[z, ˆy]2

t = 0.0001 t = 0.001 t = 0.002 t = 0.005 t = 0.01

Figure 4: Left: trade-off curves of demographic parity versus accuracy on the test data for MIED and methods from Liu et al. (2021). Right: MIEs and (cov(x,y,z) D[z, ˆy(x; θ)])2 (measured on the training data) versus the number of training iterations, for various t.

6 CONCLUSION

We present a new sampling method by minimizing MIEs discretized as particles for unconstrained and constrained sampling. This is motivated by the insight that MIEs converge to χ2 divergence with respect to the target measure as the mollifier parameter goes to 0. The proposed method achieves promising results on the sampling problems we consider.

Below we highlight three limitations. First, as discussed in Remark A.10, our theory only applies when the domain is full-dimensional or flat. Extending our theory to handle cases where the domain is an arbitrary d-rectifiable set is an important next step as it allows the handling of more complicated constraints such as nonlinear equality constraints. Secondly, when ϵ > 0, the minimizer of MIE can be different from the target measure. Finding a way to debias MIE (e.g., like how Sinkhorn distances are debiased (Feydy et al., 2019)) is an interesting direction. Lastly, the connection between the minimizers of the discretized MIE (6) and those of the continuous MIE (2) is only established in the limit as N . We hope to investigate how well the empirical distribution of particles minimizing (6) approximates the target measure when N is finite as in Xu et al. (2022).

Published as a conference paper at ICLR 2023

Reproducibility statement. We provide self-contained proofs in Appendix A for the theoretical results stated in the main text. The source code for all experiments can be found at https:// github.com/lingxiaoli94/MIED.

Acknowledgements Qiang Liu would like to acknowledge the support of NSF CAREER 1846421 and ONR. The MIT Geometric Data Processing group acknowledges the generous support of Army Research Office grants W911NF2010168 and W911NF2110293, of Air Force Office of Scientific Research award FA9550-19-1-031, of National Science Foundation grants IIS-1838071 and CHS1955697, from the CSAIL Systems that Learn program, from the MIT IBM Watson AI Laboratory, from the Toyota CSAIL Joint Research Center, from a gift from Adobe Systems, and from a Google Research Scholar award.

Kwangjun Ahn and Sinho Chewi. Efficient constrained sampling via the mirror-langevin algorithm. Advances in Neural Information Processing Systems, 34:28405 28418, 2021.

Luigi Ambrosio, Nicola Gigli, and Giuseppe Savar e. Gradient flows: in metric spaces and in the space of probability measures. Springer Science & Business Media, 2005.

Krishna Balasubramanian, Sinho Chewi, Murat A Erdogdu, Adil Salim, and Shunshi Zhang. Towards a theory of non-log-concave sampling: first-order stationarity guarantees for langevin monte carlo. In Conference on Learning Theory, pp. 2896 2923. PMLR, 2022.

Pierre Blanchard, Desmond J Higham, and Nicholas J Higham. Accurately computing the log-sumexp and softmax functions. IMA Journal of Numerical Analysis, 41(4):2311 2330, 2021.

David M Blei, Alp Kucukelbir, and Jon D Mc Auliffe. Variational inference: A review for statisticians. Journal of the American statistical Association, 112(518):859 877, 2017.

Sergiy V Borodachov, Douglas P Hardin, and Edward B Saff. Discrete energy on rectifiable sets. Springer, 2019.

Steve Brooks, Andrew Gelman, Galin Jones, and Xiao-Li Meng. Handbook of markov chain monte carlo. CRC press, 2011.

Marcus Brubaker, Mathieu Salzmann, and Raquel Urtasun. A family of mcmc methods on implicitly defined manifolds. In Artificial intelligence and statistics, pp. 161 172. PMLR, 2012.

Simon Byrne and Mark Girolami. Geodesic monte carlo on embedded manifolds. Scandinavian Journal of Statistics, 40(4):825 845, 2013.

Jos e Antonio Carrillo, Young-Pil Choi, and Maxime Hauray. The derivation of swarming models: mean-field limit and wasserstein distances. In Collective dynamics from bacteria to crowds, pp. 1 46. Springer, 2014.

Sinho Chewi, Thibaut Le Gouic, Chen Lu, Tyler Maunu, and Philippe Rigollet. Svgd as a kernelized wasserstein gradient flow of the chi-squared divergence. Advances in Neural Information Processing Systems, 33:2098 2109, 2020.

Lenaic Chizat. Sparse optimization on measures with over-parameterized gradient descent. Mathematical Programming, 194(1):487 532, 2022.

Katy Craig, Karthik Elamvazhuthi, Matt Haberland, and Olga Turanova. A blob method method for inhomogeneous diffusion with applications to multi-agent control and sampling. ar Xiv preprint ar Xiv:2202.12927, 2022.

Alain Durmus, Szymon Majewski, and Bła zej Miasojedow. Analysis of langevin monte carlo via convex optimization. The Journal of Machine Learning Research, 20(1):2666 2711, 2019.

Jean Feydy, Thibault S ejourn e, Franc ois-Xavier Vialard, Shun-ichi Amari, Alain Trouv e, and Gabriel Peyr e. Interpolating between optimal transport and mmd using sinkhorn divergences. In The 22nd International Conference on Artificial Intelligence and Statistics, pp. 2681 2690. PMLR, 2019.

Published as a conference paper at ICLR 2023

Kurt Otto Friedrichs. The identity of weak and strong extensions of differential operators. Transactions of the American Mathematical Society, 55(1):132 151, 1944.

Chengyue Gong and Xingchao Liu. Bi-objective trade-off with dynamic barrier gradient descent. Neur IPS 2021, 2021.

Jackson Gorham and Lester Mackey. Measuring sample quality with kernels. In International Conference on Machine Learning, pp. 1292 1301. PMLR, 2017.

Matthew D Hoffman, Andrew Gelman, et al. The no-u-turn sampler: adaptively setting path lengths in hamiltonian monte carlo. J. Mach. Learn. Res., 15(1):1593 1623, 2014.

Lars H ormander. The analysis of linear partial differential operators I: Distribution theory and Fourier analysis. Springer, 2015.

V Roshan Joseph, Dianpeng Wang, Li Gu, Shiji Lyu, and Rui Tuo. Deterministic sampling of expensive posteriors using minimum energy designs. Technometrics, 2019.

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980, 2014.

Achim Klenke. Probability theory: a comprehensive course. Springer Science & Business Media, 2013.

Ron Kohavi et al. Scaling up the accuracy of naive-bayes classifiers: A decision-tree hybrid. In Kdd, volume 96, pp. 202 207, 1996.

Anna Korba, Adil Salim, Michael Arbel, Giulia Luise, and Arthur Gretton. A non-asymptotic analysis for stein variational gradient descent. Advances in Neural Information Processing Systems, 33:4672 4682, 2020.

Anna Korba, Pierre-Cyril Aubin-Frankowski, Szymon Majewski, and Pierre Ablin. Kernel stein discrepancy descent. In International Conference on Machine Learning, pp. 5719 5730. PMLR, 2021.

Franziska K uhn. Probability and heat kernel estimates for l evy (-type) processes. 2016.

Chang Liu and Jun Zhu. Riemannian stein variational gradient descent for bayesian inference. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018.

Qiang Liu. Stein variational gradient descent as gradient flow. Advances in neural information processing systems, 30, 2017.

Qiang Liu and Dilin Wang. Stein variational gradient descent: A general purpose bayesian inference algorithm. Advances in neural information processing systems, 29, 2016.

Xingchao Liu, Xin Tong, and Qiang Liu. Sampling with trusthworthy constraints: A variational gradient framework. Advances in Neural Information Processing Systems, 34:23557 23568, 2021.

Luc Pronzato and Anatoly Zhigljavsky. Minimum-energy measures for singular kernels. Journal of Computational and Applied Mathematics, 382:113089, 2021.

Adil Salim, Lukang Sun, and Peter Richtarik. A convergence theory for svgd in the population limit under talagrand s inequality t1. In International Conference on Machine Learning, pp. 19139 19152. PMLR, 2022.

Sylvia Serfaty. Mean field limit for coulomb-type flows. Duke Mathematical Journal, 169(15): 2887 2935, 2020.

Jiaxin Shi, Chang Liu, and Lester Mackey. Sampling with mirrored stein operators. ar Xiv preprint ar Xiv:2106.12506, 2021.

G abor J Sz ekely and Maria L Rizzo. Energy statistics: A class of statistics based on distances. Journal of statistical planning and inference, 143(8):1249 1272, 2013.

Published as a conference paper at ICLR 2023

Ryan J Tibshirani. Dykstra s algorithm, admm, and coordinate descent: Connections, insights, and extensions. Advances in Neural Information Processing Systems, 30, 2017.

Nicolas Garcia Trillos and Daniel Sanz-Alonso. The bayesian update: variational formulations and gradient flows. Bayesian Analysis, 15(1):29 56, 2020.

SR Srinivasa Varadhan. Probability theory. Number 7. American Mathematical Soc., 2001.

Andre Wibisono. Sampling as optimization in the space of measures: The langevin dynamics as a composite optimization problem. In Conference on Learning Theory, pp. 2093 3027. PMLR, 2018.

Lantian Xu, Anna Korba, and Dejan Slepcev. Accurate quantization of measures via interacting particle-based optimization. In International Conference on Machine Learning, pp. 24576 24595. PMLR, 2022.

Wenkai Xu. Generalised kernel stein discrepancy (gksd): A unifying approach for non-parametric goodness-of-fit testing. ar Xiv preprint ar Xiv:2106.12105, 2021.

Kelvin Shuangjian Zhang, Gabriel Peyr e, Jalal Fadili, and Marcelo Pereyra. Wasserstein control of mirror langevin monte carlo. In Conference on Learning Theory, pp. 3814 3841. PMLR, 2020.

A DETAILED ANALYSIS

A.1 PRELIMINARIES ON MOLLIFIERS

Proposition A.1. For s > n, the s-Riesz family of mollifiers, defined as ϕs ϵ(x) := ( x 2 2 + ϵ2) s/2/Zs ϵ , satisfies Definition 3.1.

In order to prove Proposition A.1, we first prove a lemma. Lemma A.2. Let b 0. Then for any ϵ > 0,

Bϵ(0) ϕs ϵ(y) y b 2 dy = Hn 1(Sn 1) Z 1

(t2 + 1)s/2

ϵn+b s, (8)

where Hn 1(Sn 1) is the volume of the (n 1)-dimensional sphere.

Furthermore, assuming s > b + n, then for any ϵ > 0, δ > 0,

Rn\Bδ(0) ϕs ϵ(y) y b 2 dy Hn 1(Sn 1) δn+b s

s (n + b). (9)

Proof. For (8), we compute

Bϵ(0) ϕs ϵ(y) y b 2 dy = Z

y b 2 ( y 2 2 + ϵ2)s/2 = Hn 1(Sn 1) Z ϵ

(r2 + ϵ2)s/2 dr

= Hn 1(Sn 1) Z 1

(ϵ2t2 + ϵ2)s/2 ϵ dt

= Hn 1(Sn 1) Z 1

(t2 + 1)s/2 dt ϵn+b s,

where we use substitution r = ϵt on the second line.

If s > b + n, then

Rn\Bδ(0) ϕs ϵ(0, y) y b 2 dy = Z

y b 2 ( y 2 2 + ϵ2)s/2

y b 2 y s 2 = Hn 1(Sn 1) Z

= Hn 1(Sn 1) δn+b s

Published as a conference paper at ICLR 2023

where the last integral uses s > b + n.

Proof of Proposition A.1. It is clear that (a) of Proposition A.1 holds for ϕs ϵ(x). Since ϕs ϵ(x) is bounded we have ϕs ϵ L (Rn). For any δ > 0, we have, for ϵ δ,

Bδ(0) ϕs ϵ(x) dx Zs ϵ

Bϵ(0) ϕs ϵ(x) dx = Cϵn s,

where we use (8) with b = 0 and C is a constant depending only on n, s. On the other hand, using (9) with b = 0,

Rn\Bδ(0) ϕs ϵ(x) dx C δn s,

where C depends only on n, s. With δ = ϵ, we see that Zs ϵ ϕs ϵ L1(Rn) so (b) is satisfied.

Since δ is fixed and s > n, we see that

Bδ(0) ϕs ϵ(x) dx R

Rn\Bδ(0) ϕsϵ(x) dx lim ϵ 0 Cϵn s

Bδ(0) ϕs ϵ(x) dx + R

Rn\Bδ(0) ϕs ϵ(x) dx = 1, we have shown (c).

In the rest of this section, we assume {ϕϵ}ϵ>0 is a family of mollifiers satisfying Definition 3.1.

Lemma A.3. For any p [1, ], for any δ > 0, ϕϵ Lp(Rn) and limϵ 0 1Rn\Bδ(0)ϕϵ p = 0. holds.

Proof. Assume p / {1, } since both cases are covered in the assumptions. Then for any δ > 0, by H older s inequality,

ϕϵ p p = ϕϵ ϕp 1 ϵ 1 ϕϵ 1 ϕp 1 ϵ = ϕϵ 1 ϕϵ p 1 < .

Similarly, 1Rn\Bδ(0)ϕϵ p p = 1Rn\Bδ(0)ϕp ϵ 1 = ϕϵ 1Rn\Bδ(0)ϕp 1 ϵ 1

ϕϵ 1 1Rn\Bδ(0)ϕp 1 ϵ = 1Rn\Bδ(0)ϕϵ p 1 .

Letting ϵ 0 and applying (c) gives limϵ 0 1Rn\Bδ(0)ϕϵ p = 0.

Proposition A.4. Let f Lp(Rn), p [1, ]. Then for every ϵ > 0, the integral Z f(x y)ϕϵ(y) dy

exists, so that (f ϕϵ)(x) is finite. Moreover, if f is continous at x, then

lim ϵ 0(f ϕϵ)(x) = f(x). (10)

Proof. Observe that by H older s inequality, for q such that 1/p + 1/q = 1 (allowing infinity), Z |f(x y)ϕϵ(y)| dy f p ϕϵ q < .

Hence (f ϕϵ)(x) is integrable and finite.

For any ϵ > 0, note that

|(f ϕϵ)(x) f(x)| =

Z f(x y)ϕϵ(y) dy f(x) .

Published as a conference paper at ICLR 2023

Since R ϕϵ(x) dx = 1, we have

|(f ϕϵ)(x) f(x)| =

Z (f(x y) f(x)) ϕϵ(y) dy Z |f(x y) f(x)|ϕϵ(y) dy.

Fix t > 0. Continuity of f at x implies there exists δ > 0 such that |f(x y) f(x)| < t for all y Bδ(0). Then Z

Bδ(0) |f(x y) f(x)|ϕϵ(y) dy t Z

Bδ(0) ϕϵ(y) dy t.

On the other hand, by H older s inequality, Z

Rn\Bδ(x) |f(x y) f(x)|ϕϵ(y) dy τxf f p 1Rn\Bδ(0)ϕϵ q 2 f p 1Rn\Bδ(0)ϕϵ q.

|(f ϕϵ)(x) f(x)| t + 2 f p 1Rn\Bδ(0)ϕϵ q.

By Lemma A.3, since f p < , taking ϵ 0 we get, for any t > 0,

lim sup ϵ 0 |(f ϕϵ)(x) f(x)| t.

Now let t 0 we get (10).

Corollary A.5. Let f Lp(Rn), p [1, ], be a continuous function. If {fϵ} is a sequence of functions that converge uniformly to f, then for every x,

lim ϵ 0(fϵ ϕϵ)(x) = f(x). (11)

Proof. Note that

|(ϕϵ fϵ)(x) f(x)| |(ϕϵ fϵ)(x) (ϕϵ f)(x)| + |(ϕϵ f)(x) f(x)|.

Proposition A.4 shows the second term goes to 0 as ϵ 0. For the first term, we have

|(ϕϵ fϵ)(x) (ϕϵ f)(x)| =

Z (fϵ(x y) f(x y)) ϕϵ(y) dy

sup x |fϵ(x) f(x)| 0

by uniform convergence.

Lemma A.6. For f Lp(Rn), p [1, ), we have

lim y 0 τyf f p = 0,

where we use τaf to denote the translated function τaf(x) := f(x a).

Proof. Fix ϵ > 0. It is a standard fact that Cc(Rn) is dense in Lp(Rn). Hence there exists g Cc(Rn) such that f g p < ϵ. Since g is continuous with compact support, it is uniform continuous. Then there exists δ > 0 with |g(x y) g(x)| < ϵ1/p/Ln(K) if y Bδ(x). Hence for such y we have τyg g p p < ϵ with Ln(K) < . Thus

τyf f p τyf τyg p + τyg g p + g f p 3ϵ.

Proposition A.7. Let f Lp(Rn), p (1, ). Then for every ϵ > 0,

f ϕϵ p 3 f p.

In particular, f ϕϵ Lp(Rn). Moreover,

lim ϵ 0 f ϕϵ f p = 0.

Published as a conference paper at ICLR 2023

Proof. For every ϵ > 0, we have

f ϕϵ f p p = Z

Z (f(x y) f(x))ϕϵ(y) dy

Z Z |f(x y) f(x)|pϕϵ(y) dy dx

= Z Z |f(x y) f(x)|p dx ϕϵ(y) dy

= Z τyf f p pϕϵ(y) dy,

where we use Jensen s inequality with the observation that ϕϵ(y) dy is a probability measure and Tonelli s theorem to exchange the order of integration. To show the first claim, note that,

f ϕϵ f p p Z τyf f p pϕϵ(y) dy Z (2 f p)pϕϵ(y) dy = (2 f p)p.

Hence f ϕϵ p f p + f ϕϵ f p 3 f p.

For the second claim, by Lemma A.6, the function y 7 τyf f p p is continuous at y = 0 with limit 0. Hence by Proposition A.4, we are done by taking ϵ 0.

A.2 CONVERGENCE TO χ2-DIVERGENCE

We start by giving an alternative formula for χ2 divergence with respect to the target measure.

Lemma A.8. Define a functional E : P(Rn) [0, ] by

p(x) dx if dµ(x) = q(x) d Ln(x) and µ µ

otherwise. (12)

Then for every µ P(Rn), E(µ) = χ2(µ µ ) + 1.

Proof. If µ is not absolutely continuous with respect to µ , then both sides are infinity. Otherwise, µ(X) = 1 and µ has some density q. We then compute

χ2(µ µ ) + 1 = Z dµ

dµ (x) 1 2 dµ (x) + 1 = Z

p(x) 1 2 p(x) dx + 1

p(x) dx 2 Z

X q(x) dx + 2 = Z

p(x) dx = E(µ).

Proof of Theorem 3.3. Since χ2(µ µ ) < , we know µ µ , so µ(X) = 1. We let q(x) be the density of µ which satisfies q(x) = 0 for x Rn \ X. We compute using Tonelli s theorem (since our integrand is positive):

Eϵ(µ) = ZZ ϕϵ(x y)(p(x)p(y)) 1/2q(x)q(y) dx dy

(x) q p(x) dx = Z (ϕϵ g)(x)g(x) dx, (13)

where we define g(x) := q(x)/ p

p(x) on X and g(x) := 0 for x / X. Moreover, by Lemma A.8,

χ2(µ µ ) + 1 = Z g(x)2 dx.

Published as a conference paper at ICLR 2023

The assumption χ2(µ µ ) < then implies g L2(Rn). By Proposition A.7, we have ϕϵ g L2(Rn). Next notice that

Eϵ(µ) χ2(µ µ ) + 1 Z |(ϕϵ g)(x) g(x)|g(x) dx ϕϵ g g 2 g 2 < .

This shows the first claim. Finally, the last expression goes to 0 as ϵ 0 since ϕϵ g g 2 0 by Proposition A.7.

Remark A.9. One may ask if similar results as Theorem 3.3 for f-divergences other than χ2 divergence can be obtained using similar techniques. We think the proof of Theorem 3.3 is highly specific to χ2 divergence because in (13) there are two copies of q coming from the definition of Eϵ as a double integral over µ. An f-divergence between µ and µ has the form R

X f(q(x)/p(x)) dp(x), and the only way to have q2 showing up is to choose f(x) = x2 corresponding to χ2 divergence.

Remark A.10. It is possible to prove versions of Theorem 3.3 when X has Hausdorff dimension d < n: in such cases χ2-divergence still makes sense as does (2). When X is flat , i.e., with Hausdorff dimension d and contained in a d-dimensional linear subspace of Rn, e.g., when X is defined by a set of linear inequalities, then Theorem 3.3 follows if we adapt the assumptions of Definition 3.1 and calculation in the proof of Theorem 3.3 to be in the subspace. For a general d-dimensional X, a similar calculation yields Eϵ(µ) = R

X ϕϵ(x y) q p (y) d Hd(y) q p (x) d Hd(x). For a

similar argument to go through, we will need the normalizing constant R

X ϕϵ(x y) d Hd(y) to the same for all x. This is true when X is a d-dimensional sphere, but for a general X the last integral will depend on the base point x. Proving a version of Theorem 3.3 for X with Hausdorff dimension d < n is an interesting future direction.

A.3 CONVEXITY AND Γ-CONVERGENCE

We start by recalling a few definitions regarding functionals in P(Rn).

Definition A.11. We say a functional F : P(Rn) ( , ] is proper if there exists µ P(Rn) such that F(µ) < , and is lower semicontinuous (l.s.c.) if for any weakly convergence sequence µk µ, we have lim infk F(µk) F(µ).

Lemma A.12. For any ϵ > 0, the functional Eϵ : P(Rn) ( , ] is proper and l.s.c. Moreover, if X is compact, the minimum minµ P(X) Eϵ(µ) is attained by some measure in P(X).

Proof. Taking any x X, since ϕϵ is bounded and p(x) > 0, we see that Eϵ(δx) < so Eϵ is proper. Moreover, given weakly convergence µk µ, by Portmanteau theorem and the fact that Wϵ is nonnegative and l.s.c., we conclude that Eϵ is also l.s.c.

The set of probability distributions P(X) P(Rn) is tight by the compactness of X. It is closed since if {µk} P(X) weakly converges to µ, then by Portmanteau theorem, µ(X) lim sup µ(X) = 1 so that µ P(X). Hence by Prokhorov s theorem (Theorem 5.1.3 (Ambrosio et al., 2005)), P(X) is (sequentially) compact. It is then an elementary result that any l.s.c. function attains its minimum on a compact set.

We next prove Proposition 3.5 regarding the convexity of Eϵ and the uniqueness of its minimum.

Proof of Proposition 3.5. Let µ be any finite signed Borel measure. The compactness assumption and the fact that p C1(X) imply p(x) > δ for any x X for some δ > 0. Hence p(x) 1/2 δ 1/2, so the weighted measure µ defined by d µ(x) := p 1/2(x) dµ(x) is also finite. By the definition of i.s.p.d. kernels, we have EWϵ(µ) = Ekϵ( µ) > 0 if µ is not the zero measure, which is equivalent to µ not being zero since p > 0. Thus Wϵ is i.s.p.d. on X. Also note that (p(x)p(y)) 1/2 < δ 1 for all x, y X, so EWϵ is always finite. By Lemma 1.1 of Pronzato & Zhigljavsky (2021), we conclude that EWϵ is strictly convex in Msign, the space of finite signed measures, and in particular it is convex on P(X). Hence combined with the existence result from Lemma A.12 we conclude Eϵ attains a unique minimum in P(X).

Published as a conference paper at ICLR 2023

Our goal in the rest of this section is to prove the Γ-convergence of Eϵ to E : P(Rn) [0, ] defined in (12). By Lemma A.8, we have E(µ) = χ2(µ µ ) + 1. Hence together we will have proved Theorem 3.6.

Definition A.13 (Γ-convergence). A sequence of functionals Fϵ : P(Rn) ( , ] is said to Γ-converge to F : P(Rn) ( , ], denoted as Fϵ Γ F, if:

(a) For any sequence µϵ P(Rn) converging weakly to µ P(Rn), lim infϵ 0 Fϵ(µϵ) F(µ); (b) For any µ P(Rn), there exists a sequence µϵ P(Rn) converging weakly to µ with lim supϵ 0 Fϵ(µϵ) F(µ).

We will show Eϵ Γ E using Fourier transforms and Bochner s theorem.

Definition A.14 (Fourier transform). For f L1(Rn), its Fourier transform ˆf is the complexvalued function defined via

ˆf(ξ) := Z e 2πiξ xf(x) dx.

More generally, for a signed finite measure µ Msign(Rn), its Fourier transform ˆµ is the complexvalued function defined via

ˆµ(ξ) := Z e 2πiξ x dµ(x).

This integral is always well-defined and moreover ˆµ is uniformly continuous; see Borodachov et al. (2019, Section 1.10).

We will prove the following weak version (under the additional assumption that a mollifier ϕ is integrable) of Bochner s theorem suitable for our case. In particular we will need the Fourier inversion formula which is not given in the usual statement of Bochner s theorem. On the other hand, we cannot directly use the Fourier inversion formula since it is not obvious how to check the integrability of ˆϕ when ϕ is a mollifier.

Lemma A.15. Suppose ϕ L1(Rn) is even, bounded, continuous, and k(x, y) := ϕ(x y) is i.s.p.d. on any compact sets. Then its Fourier transform ˆϕ is real, nonnegative, and the following inversion formula holds:

ϕ(x) = Z e2πix ξ ˆϕ(ξ) dξ for all x Rn. (14)

Proof. The proof is adapted from that of Varadhan (2001, Theorem 2.7) and is extended to the multi-dimensional case.

Since ˆϕ(ξ) := R e 2πix ξϕ(x) dx and ϕ(x) = ϕ( x), with a change of variable x = x, we obtain ˆϕ(ξ) = ˆϕ(ξ), so ˆϕ is real.

Next we show that ˆϕ is nonnegative. For T > 0, we compute, for a fixed ξ Rn,

[0,T ]n e 2πi(t s) ξϕ(t s) dt ds

i[|ui|,2T |ui|] 2 n dv

e 2πiu ξϕ(u) du

e 2πiu ξϕ(u) du

e 2πiu ξϕ(u) du,

Published as a conference paper at ICLR 2023

where we have used change of variable u = t s, v = t + s. Since ϕ L1(Rn), by dominated convergence theorem, we have as T

ˆϕ(ξ) = Z e 2πiξ xϕ(x) dx = lim T 1 T n

[0,T ]n e 2πi(t s) ξϕ(t s) dt ds.

For t, s Rn, we have

Re e 2πi(t s) ξϕ(t s))

= cos(2πt ξ)k(t, s) cos(2πs ξ) + sin(2πt ξ)k(t, s) sin(2πs ξ). For a fixed T, if we define a finite measure µ as dµ = 1t [0,T ]n cos(2πt ξ) dt, then k being i.s.p.d. implies ZZ ϕ(t s) dµ dµ = Z

[0,T ]n cos(2πt ξ)ϕ(t s) cos(2πs ξ) dt ds 0,

and similarly for sin. Since ˆϕ is real, we conclude that ˆϕ is nonnegative.

Finally we prove Equation (14). For σ > 0, define ˆϕσ(ξ) := ˆϕ(ξ)e σ2 ξ 2 2. Since ˆϕ is bounded (because ϕ L1(Rn)), we see that ˆϕσ L1(Rn). We compute, for x Rn, using Fubini s theorem, Z e2πix ξ ˆϕσ(ξ) dξ = Z ˆϕ(ξ)e σ2 ξ 2 2e2πix ξ dξ

= Z Z e 2πiy ξϕ(y) dy e σ2 ξ 2 2e2πix ξ dξ

= Z Z e 2πi(y x) ξe σ2 ξ 2 2 dξ ϕ(y) dy

= Z (π/σ2)n/2e π2 x y 2 2/σ2ϕ(y) dy,

where we use the Fourier transform formula (Borodachov et al., 2019, (4.4.1)) of the Gaussian distribution. Notice that pσ(y) := (π/σ2)n/2e π2 x y 2 2/σ2 is the density of a multivariate Gaussian centered at x with covariance σ2/(2π2) I. Hence R e2πix ξ ˆϕσ(ξ) dξ = (pσ ϕ)(0). Since ϕ is bounded by assumption, with x = 0 we find R ˆϕσ(ξ) dξ ϕ . Taking σ 0, by monotone convergence theorem, we have R ˆϕ(ξ) dξ ϕ , so together with the fact that ˆϕ 0 we have ˆϕ L1(Rn). Finally, since ˆϕσ ˆϕ, by dominated convergence theorem and Proposition A.4 (note pσ is centered at x), we have Z e2πix ξ ˆϕ(ξ) dξ = lim σ 0

Z e2πix ξ ˆϕσ(ξ) dξ = lim σ 0(pσ ϕ)(0) = ϕ(x).

Proposition A.16. Given ϕ : Rn R satisfying the assumptions of Lemma A.15, for any ν Msign(Rn), it holds that ZZ ϕ(x y) dν(x) dν(y) = Z |ˆν(ξ)|2 ˆϕ(ξ) dξ,

Proof. By Lemma A.15, ZZ ϕ(x y) dν(x) dν(y) = ZZ Z e 2πi(x y) ξ ˆϕ(ξ) dξ dν(x) dν(y)

= Z Z e 2πix ξ dν(x) Z e2πiy ξ dν(y) ˆϕ(ξ) dξ

= Z ˆν(ξ)ˆν(ξ)ˆϕ(ξ) dξ = Z |ˆν(ξ)|2 ˆϕ(ξ) dξ.

where we use Fubini s theorem (all measures are finite and e 2πi is bounded) to exchange the order of integration.

Published as a conference paper at ICLR 2023

Proof of Theorem 3.6. First note that Definition A.13(b) is immediate from Theorem 3.3 with µϵ = µ: if E(µ) = , then trivially lim supϵ 0 Eϵ(µ) E(µ); otherwise we apply Theorem 3.3. So we focus on proving criterion Definition A.13(a).

Fix a sequence µϵ P(Rn) that converges weakly to µ P(Rn), and our goal is to show lim infϵ 0 Eϵ(µϵ) E(µ). Without loss of generality, we may assume Eϵ(µϵ) < for all ϵ > 0 (these terms have no effect in lim inf), which implies µϵ(X) = 1. By Portmanteau s theorem and the assumption that X is closed, we have µ(X) lim supϵ 0 µϵ(X) = 1. So all of µϵ and µ will have support in X.

For m > 0, ϵ > 0, define a nonnegative measure νϵ,m by

dνϵ,m(x) := hm(x)p 1/2(x) dµϵ(x),

where hm : Rn R is a continuous monotonically decreasing cutoff function satisfying hm(x) = 1 if x 2 2 < m and hm(x) = 0 if x 2 2 > m + 1, and that hm(x) hm (x) for m < m . Then since p > 0 is continuous, it is bounded below on any compact set in X, and hence νϵ,m is finite. Also define, for m > 0, a nonnegative measure νm by

dνm(x) := hm(x)p 1/2(x) dµ(x),

which is again finite following the same reasoning.

Then for any m > 0, denoting dνϵ(x) := p 1/2(x) dµϵ(x),

Eϵ(µϵ) = ZZ ϕϵ(x y) dνϵ(x) dνϵ(y)

ZZ ϕϵ(x y) dνϵ,m(x) dνϵ,m(y) = Z |ˆνϵ,m(ξ)|2 ˆϕϵ(ξ) dξ, (15)

where we apply Proposition A.16 for the last equality. On the other hand, note that

ˆνϵ,m(ξ) = Z e 2πiξ x dνϵ,m(x) = Z e 2πiξ xhm(x)p 1/2(x) dµϵ(x).

Since µϵ µ weakly and the last integrand is a continuous bounded function, we have

lim ϵ 0 ˆνϵ,m(ξ) = Z e 2πiξ xhm(x)p 1/2(x) dµ(x) = ˆνm(ξ).

On the other hand, ˆϕϵ(ξ) = R e 2πiξ xϕϵ(x) dx. Since by Definition 3.1, ϕϵ dx converges to δ0 in probability (and in particular weakly), we have limϵ 0 ˆϕϵ(ξ) = 1.

Applying Fatou s lemma to (15), we obtain, for any m > 0,

lim inf ϵ 0 Eϵ(µϵ) Z |ˆνm(ξ)|2 dξ.

If R |ˆνm(ξ)|2 dξ = for any m > 0, then we are done since lim infϵ 0 Eϵ(µϵ) = . Otherwise, by K uhn (2016, Lemma 2.11), for every m > 0, νm has density in L2(Rn). This imples µ has density everywhere. Suppose dµ(x) = q(x) d Ln(x). By Plancherel s theorem and the monotone convergence theorem, we have

lim inf ϵ 0 Eϵ(µϵ) lim m

Z |ˆνm(ξ)|2 dξ

hm(x)p 1/2(x)q(x) 2 dx

X p 1(x)q(x)2 dx = E(µ).

This completes the proof of Eϵ Γ E.

Now suppose X is compact, and let µ ϵ := arg minµ P(X) Eϵ(µ). To establish µ ϵ µ weakly (resp. limϵ 0 Eϵ(µ ϵ) = E(µ ) = 1), it suffices to show that every subsequence of {µ ϵ} (resp.

Published as a conference paper at ICLR 2023

{Eϵ(µ ϵ)}) has a further convergence subsequence converging to µ (resp. E(µ )). With a slight abuse of notation, we assume the sequence of ϵ is chosen so that {µ ϵ} (resp. {Eϵ(µ ϵ)}) is already some subsequence of the original sequence. As argued in the proof of Lemma A.12, P(X) is compact with respect to weak convergence. Hence {µ ϵ} has a weakly convergence subsequence µ ϵk ν for some ν P(X). The Γ-convergence Eϵ Γ E implies

lim inf k Eϵk(µ ) lim inf k Eϵk(µ ϵk) E(ν) lim sup k Eϵk(ν) lim sup k Eϵk(µ ϵk),

where we have used, for each inequality, µ ϵk = arg min P(X) Eϵk, Definition A.13(a), the first paragraph of this proof, and again the fact that µ ϵk = arg min P(X) Eϵk. Since lim infk Eϵk(µ ) = E(µ ) by Theorem 3.3, we have

E(µ ) lim k Eϵk(µ ϵk) = E(ν) E(µ ),

where the last inequality follows because µ is the minimizer of E. Then limk Eϵk(µ ϵk) = E(µ ) = 1. Moreover, χ2(ν µ ) = 0. This can only happen if ν and µ agree on all Borel sets, so ν = µ .

A.4 DIFFERENTIAL CALCULUS OF Eϵ IN P2(Rn)

Lemma A.17. Let x0 R, h > 0, and Ω Rm be a compact set. Suppose f : (x0 h, x0 + h) Ω R is jointly continuous and the derivative xf : (x0 h, x0 + h) Ω R exists and is jointly continuous. Then R

Ωf(x, ω) dω is differentiable for x (x0 h, x0 + h), and

Ω f(x, ω) dω = Z

xf(x, ω) dω

where the integration is with respect to the Lebesgue measure Lm on Ω.

Proof. Fix x (x0 h, x0 + h) and let t > 0 be small enough such that [x t, x + t] (x0 h, x0 + h). Note that R

Ωf(x, ω) dω is well-defined by the dominated convergence theorem since supx [x t,x+t],ω Ωf(x, ω) < and Lm(Ω) < . Define θ(ω) := supx [x t,x+t]

which is finite since xf(x, ω) is continuous, so that θ(ω) x|f(x, ω)| for all x [x t, x + t] and θ is integrable since Lm(Ω) < . Hence by the differentiation lemma (Klenke, 2013, Theorem 6.28) we are done.

The following lemma is similar to Korba et al. (2021, Proposition 3) but with different assumptions: we do not put any integrability assumptions on Wϵ, but we do restrict measures to have compact support. For a differentiable f : Rn Rn R, we use 1f(x, y) to denote the gradient with respect to x. We also use H1f(x, y) to denote the Hessian of f with respect to x. We similarly denote 2f(x, y) and H2f(x, y).

Lemma A.18. Assume µ P2(Rn) has density q C1 c (Rn). Let ξ C(Rn; Rn). Denote s(x, t) := x + tξ(x). Then for all t > 0 and all ϵ > 0,

d dt Eϵ(µt) = 2 ZZ ξ(x) 1Wϵ(s(x, t), s(y, t)) dµ(x) dµ(y). (16)

In particularly,

t=0 Eϵ(µt) = 2 ZZ ξ(x) 1Wϵ(x, y) dµ(x) dµ(y). (17)

t=0 Eϵ(µt) = 2 ZZ ξ(x) 1 2Wϵ(x, y)ξ(y) + ξ(x) H1Wϵ(x, y)ξ(x) dµ(x) dµ(y).

Published as a conference paper at ICLR 2023

Proof. We compute

d dt Eϵ(µt) = d

2 ZZ Wϵ(s(x, t), s(y, t)) dµ(x) dµ(y) .

Since µ has compact support and (x, y, t) 7 Wϵ(s(x, t), s(y, t))q(x)q(y) is jointly continuous and its derivative with respect to t is also jointly continuous, by Lemma A.17, we can push d dt inside the double integral and we obtain (16). Another application of Lemma A.17 shows that if we take derivative with respect to t again on (16) and evaluate at 0 we obtain (18).

Lemma A.19. Let f C1(Rn), p [1, ]. Assume ϕϵ has compact support for some ϵ > 0. Then for all i = 1, . . . , n, ϕϵ f is differentiable and

xi (ϕϵ f)(x) = ϕϵ

Proof. Since supp(ϕϵ) is compact and f C1(Rn) is bounded on any compact set, we know ϕϵ f is well-defined at every x Rn. Note that

xi (ϕϵ f)(x) = xi

Z f(x y)ϕϵ(y) dy

(?) = Z f xi (x y)ϕϵ(y) dy = ϕϵ

Since supp(ϕϵ) is compact and (x, y) 7 f(x y)ϕ(y) is C1, by Lemma A.17 we justify the existence of the derivative and the exchange of differentiation and integration (?).

A.4.1 SUBDIFFERENTIALS OF Eϵ

Recall the following notion of a Wasserstein gradient in P2(Rn) from Ambrosio et al. (2005, Definition 10.1.1). Definition A.20. A vector field w L2(µ; Rn) is a strong Fr echet subdifferential of a functional F : P2(Rn) ( , + ] if for all T L2(µ; Rn), the following holds:

F(T#µ) F(µ) Z w(x) (T (x) x) dµ(x) + o T I L2(µ;Rn) . (19)

Note that we cannot apply Ambrosio et al. (2005, Lemma 10.4.1) directly to prove (4) because interaction energies cannot be written in the form of (10.4.1) in their setup.

Proof of Proposition 3.7. Let ξ C c (Rn; Rn). By Lemma A.18, we have

t=0 Eϵ(µt) = 2 ZZ ξ(x) 1Wϵ(x, y) dµ(x) dµ(y)

= 2 ZZ ξ(x) 1 ϕϵ(x y)(p(x)p(y)) 1/2 q(y) dy dµ(x)

= 2 Z ξ(x) p(x) 1/2(ϕϵ q/ p)(x) dµ(x),

where the last step follows from applying Lemma A.17 since q has compact support. Now suppose w L2(µ; Rn) is a strong Fr echet subdifferential satisfying (19). For the sequence {Tt}, we have by definition

Eϵ(µt) Eϵ(µ) Z w(x) (Tt(x) x) dµ(x) + o Tt I L2(µ;Rn)

= Z w(x) (tξ(x)) dµ(x) + o(t).

lim inf t 0 Eϵ(µt) Eϵ(µ)

t Z w(x) ξ(x) dµ(x) lim inf t 0 Eϵ(µt) Eϵ(µ)

Published as a conference paper at ICLR 2023

The previous calculation shows d dt|t=0Eϵ(µt) exists, and hence it is equal to R w(x) ξ(x) dµ(x). This proves for any ξ C c (Rn; Rn), Z w(x) ξ(x) dµ(x) = Z ξ(x) 2p(x) 1/2(ϕϵ q/ p)(x) dµ(x).

Hence we have shown (4).

Finally, we show limϵ 0 wϵ(x) = wχ2(x) for µ-a.e. x under the additional assumption that ϕϵ has

compact support. By Ambrosio et al. (2005, Lemma 10.4.1) with F(x, ρ(x)) = ρ(x) p(x) 1 2 p(x),

we find the strong subdifferential of χ2( µ ) is given by

wχ2,µ(x) = 2 q(x)

p(x), for µ-a.e. x Rn. (20)

To show (5), we compute, for µ-a.e. x,

wϵ(x) = p(x) 1/2(ϕϵ q/ p)(x)

= (p(x) 1/2)(ϕϵ q/ p)(x) + p(x) 1/2 (ϕϵ q/ p)(x)

= (p(x) 1/2)(ϕϵ q/ p)(x) + p(x) 1/2(ϕϵ (q/ p))(x),

where for the last equality we have applied Lemma A.19. Now taking ϵ 0, by Proposition A.4 using the fact that supp(ϕϵ) is compact (so that q/ p and (q/ p) are bounded on x + supp(ϕϵ)), we obtain (5).

A.4.2 DISPLACEMENT CONVEXITY OF Eϵ AT µ AS ϵ 0

The statement of Proposition 3.8 is similar in form as Korba et al. (2021, Corollary 4) but in our case we do not have the second term in Equation (18) vanishing and we need to take the limit ϵ 0. We also do not resort to RKHS theory in the proof.

Proof of Proposition 3.8. By (18), we have

t=0 Eϵ(µt) = 2(Fϵ + Gϵ),

Fϵ = ZZ ξ(x) 1 2Wϵ(x, y)ξ(y) dµ (x) dµ (y)

Gϵ = ZZ ξ(x) H1Wϵ(x, y)ξ(x) dµ (x) dµ (y).

We tackle Fϵ first. Observe that successive application of integration by parts using the fact that ξ has compact support gives

xi yj Wϵ(x, y)ξj(y)p(x)p(y) dx dy

ZZ xi (ξi(x)p(x))

yj Wϵ(x, y)ξj(y)p(y) dx dy

ZZ xi (ξi(x)p(x))Wϵ(x, y)

yj (ξj(y)p(y)) dx dy

xi (ξi(x)p(x))

yj (ξj(y)p(y))

Published as a conference paper at ICLR 2023

If we view Pn i=1 xi (ξi(x)p(x)) as the density of a signed measure (it is integrable since it has compact support), and since Wϵ is i.s.p.d. on the support of ξ by Proposition 3.5(a), we see that each double integral in the last expression is non-negative. Hence Fϵ 0.

Next we show limϵ 0 Gϵ = 0. Since µ has compact support, by Fubini s theorem,

Gϵ = Z ξ(x) H1

Z Wϵ(x, y)p(y) dy ξ(x)p(x) dx.

To expand the integral inside the Hessian operator, we have

Z Wϵ(x, y)p(y) dy = H p(x) 1/2(ϕϵ p)(x) .

First by the chain rule and Lemma A.19, we have

p(x) 1/2(ϕϵ p)(x) = d

p(x) 1/2 (ϕϵ p) + p(x) 1/2 ϕϵ d

Differentiating again while applying Lemma A.19, we obtain after rearranging terms,

Z Wϵ(x, y)p(y) dy = d2

dx2 p(x) 1/2 (ϕϵ p)(x)

dxp(x) 1/2 ϕϵ d

+ p(x) 1/2 ϕϵ d2

dx2 p (x). (21)

By Proposition A.4 and the fact that p C2 c (Rn), we have

Z Wϵ(x, y)p(y) dy

dx2 p(x) 1/2 p(x) + 2 d

dxp(x) 1/2 d

dx p(x) + p(x) 1/2 d2

Finally, we have

lim ϵ 0 Gϵ = lim ϵ 0

Z Wϵ(x, y)p(y) dy ξ(x)p(x) dx

= Z ξ(x) lim ϵ 0 H1

Z Wϵ(x, y)p(y) dy ξ(x)p(x) dx

where interchanging the limit and the integral is jusfitied by the dominated convergence theorem and the fact that p and ϕϵ have compact support (we need compact support assumption of ϕϵ to make sure convolutions appearing in (21) are uniformly bounded when ϵ is sufficiently small).

A.4.3 A DESCENT LEMMA FOR Eϵ WITH TIME DISCRETIZATION

We now consider a time discretization of the gradient flow of Eϵ given by, for an initial measure µ0 P2(Rn) and m Z>0, with a step size γ > 0,

µm+1 := (I γwϵ,µm)#µm. (22)

Similar to Korba et al. (2021, Proposition 14), we can show the following descent lemma for iterations (22) with the following additional assumptions:

Published as a conference paper at ICLR 2023

Assumption A.21. Suppose the target density p C2(Rn) satisfies 1/Cp p(x) Cp, p(x) 2 C p, Hp(x) 2 C p for some Cp, C p, C p for all x Rn, where we use A 2 to indicate the matrix spectral norm (i.e. A 2 = σmax(A)). Suppose the mollifier ϕϵ C2(Rn) satisfies, in addition to Definition 3.1, ϕϵ(x) Cϵ, ϕϵ(x) 2 C ϵ, and Hϕϵ(x) 2 C ϵ for some Cϵ, C ϵ, C ϵ . Lemma A.22. Under Assumption A.21, there exists a constant L > 0 such that the function 1Wϵ : Rn Rn Rn is L-Lipschitz in terms of 2 in either input.

Proof. Denote r(x) := p(x) 1/2. Then our assumptions imply r(x) Cr := C1/2 p , r(x) 2 C r := 1

2C3/2 p C p, and H1r(x) 2 C r := 3

4C5/2 p C 2 p + 1

2C3/2 p C p . We compute, for x, y Rn,

1Wϵ(x, y) = x (ϕϵ(x y)r(x)r(y)) = ϕϵ(x y)r(x)r(y) + ϕϵ(x y) r(x)r(y).

H1Wϵ(x, y) = Hϕϵ(x y)r(x)r(y) + 2 ϕϵ(x y) r(x) r(y) + ϕϵ(x y)Hr(x)r(y),

2 1Wϵ(x, y) = Hϕϵ(x y)r(x)r(y) + ϕϵ(x y)r(x) r(y)

ϕϵ(x y) r(x)r(y) + ϕϵ(x y) r(x) r(y) .

Then we have

H1Wϵ(x, y) 2 C ϵ C2 r + 2C ϵC r Cr + CϵC r Cr,

2 1Wϵ(x, y) 2 C ϵ C2 r + 2C ϵCr C r + CϵC 2 r .

Hence we conclude that 1Wϵ is L-Lipschitz with

L := C ϵ C2 r + 2C ϵC r Cr + Cϵ max(C r Cr, C 2 r ).

Remark A.23. Compared with Korba et al. (2021, Lemma 1), to ensure 1Wϵ is Lipscthiz, we only require uniform boundedness of p and ϕϵ up to second order derivatives instead of up to order 3. Proposition A.24. Under Assumption A.21, suppose µ0 P2(Rn) has compact support. Then for any γ 1 2L, with L defined as in Lemma A.22 and {µm}m>0 defined as in (22),

Eϵ(µm+1) Eϵ(µm) γ(1 2γL) wϵ,µm L2(µm) 0. (23)

Proof. We first show µm has compact support for all m Z 0. For µ P2(Rn) with compact support, the proof of Proposition 3.7 implies (note Proposition 3.7 assumes µ has density but it is not necessary to obtain the following formula using the same proof)

wϵ,µ(x) = 2 Z 1Wϵ(x, y) dµ(y).

Since x 7 1Wϵ(x, y) is C1 by the assumptions and µ has compact support, by Lemma A.17, we see that wϵ,µ is differentiable with

wϵ,µ(x) = 2 Z H1Wϵ(x, y) dµ(y).

By induction, suppose µm has compact support. Then

supp(µm+1) (I γwϵ,µm)# supp(µm) supp(µm) + γ

sup x supp(µm) wϵ,µm 2

Since wϵ,µm is continuous, the set on the right-hand side is bounded. Hence µm+1 has compact support.

Published as a conference paper at ICLR 2023

Now fix m Z>0 and we show (23). Define a path {µt}t [0,1] defined by µt = (I γtwϵ,µm)#µm. Let f(t) := Eϵ(µt). By Lemma A.18, the continuity of wϵ,µm implies that f is differentiable. Moreover, another application of Lemma A.17 implies that f is twice differentiable, and in particular continuously differentiable. Hence f is absolutely continuous on the compact interval [0, 1]. By the fundamental theorem of calculus, we have

Eϵ(µm+1) Eϵ(µm) = f(1) f(0) = f (0) + Z 1

0 (f (t) f (0)) dt.

Observe that by Lemma A.18,

f (0) = 2 ZZ ( γwϵ,µm(x)) 1Wϵ(x, y) dµm(x) dµm(y)

= Z ( γwϵ,µm(x)) 2 Z 1Wϵ(x, y) dµm(y) dµm(x)

= γ wϵ,µm(x) 2 L2(µm),

where we apply Fubini s theorem to exchange the order of integration together with the fact that µm has compact support and the integrand is continuous. Let s(x, t) := x γtwϵ,µm(x). Note that, again by Lemma A.18,

|f (t) f (0)| 2 ZZ ( γwϵ,µm(x)) ( 1Wϵ(s(x, t), s(y, t)) 1Wϵ(x, y)) dµm(x) dµm(y)

2γ ZZ wϵ,µm(x) 2 1Wϵ(s(x, t), s(y, t)) 1Wϵ(x, y) 2 dµm(x) dµm(y).

Note that, by Lemma A.22, we have

1Wϵ(s(x, t), s(y, t)) 1Wϵ(x, y) 2 1Wϵ(s(x, t), s(y, t)) 1Wϵ(s(x, t), y) 2 + 1Wϵ(s(x, t), y) 1Wϵ(x, y) 2 Lγt( wϵ,µm(x) 2 + wϵ,µm(y) 2).

|f (t) f (0)| 2γ2Lt ZZ wϵ,µm(x) 2 wϵ,µm(x) 2 + wϵ,µm(y) 2 dµm(x) dµm(y)

wϵ,µm L2(µm) + Z wϵ,µm(x) 2 dµm(x) 2!

4γ2Lt wϵ,µm L2(µm),

where we have used the Cauchy-Schwartz inequality in the last step.

Combining everything, we have shown

Eϵ(µm+1) Eϵ(µm) = f (0) + Z 1

0 (f (t) f (0)) dt

γ(1 2γL) wϵ,µm L2(µm) 0,

since γ < 1 2L.

B WEIGHTED HYPERSINGULAR RIESZ ENERGY

We recall results most relevant to us from Borodachov et al. (2019). Suppose X Rn is compact, of Hausdorff dimension d, and d-rectifiable, i.e., the image of a bounded set in Rd under a Lipschitz mapping. Given a non-vanishing continuous probability density p on X, define a measure Hp d(B) :=

Published as a conference paper at ICLR 2023

B p(x) d Hd(x) for any Borel set B X. The target measure hp d is defined to be hp d(B) := Hp d(B)/Hp d(X). The weighted s-Riesz energy is defined as the interaction energy

Es(ωN) := X

(p(xi)p(xj)) s/2d

xi xj s . (24)

The following result states that minimizers of the weighted s-Riesz energy approximate the target measure when the s is sufficiently large.

Theorem B.1 (Theorem 11.1.2, Borodachov et al. (2019)). Suppose s > d. For ω N arg min Es with ω N = {x N 1 , . . . , x N N}, we have weak convergence δω N hp d as N .

A similar result with a slightly different assumption holds for s = d (Theorem 11.1.3 of Borodachov et al. (2019)).

Comparison of (24) with discretized MIE (6). Our theory is only valid for the full-dimensional case, i.e., n = d (see a few exceptions discussed in Remark A.10). When this is the case and if the mollifier is taken to be from the s-Riesz family, (6) becomes, for s > n,

(p(xi)p(xj)) 1/2 xi xj 2 2 + ϵ2 s/2 .

Compared with (24), in our case there is an ϵ in the denominator, so that the continuous version of the energy Eϵ does not blow up all the time (this is in contrast with the continuous version of (24) see Borodachov et al. (2019, Theorem 4.3.1)). Moreover, the exponential scaling on p is different: in our case we use 1/2 whereas in (24) it is s/2n. The diagonal i = j is included in our energy, but this is inconsequential as another valid discretization is to discard the diagonal term (Borodachov et al., 2019, Theorem 4.2.2). On the other hand, our theory allows a bigger class of mollifiers that are not necessarily of Riesz families. We also allow X to be non-compact. An interesting future research direction is to extend our theory to cases where d < n, e.g., when X is an embedded d-dimensional submanifold in Rn.

C ALGORITHMIC DETAILS

In this section we provide algorithmic details of MIED and compare with the updates of SVGD (Liu & Wang, 2016). The negative gradient of (7) with respect to xi is

xi log Eϵ(ωN) =2 X

( log ϕϵ)(xj xi) + 1

2 log p(xi)

i,j e Iij log p(xi)

i,j e Iij (2 log ϕϵ(xj xi) + log p(xi)) ,

Iij := log ϕϵ(xi xj) 1

2(log p(xi) + log p(xj)), (25)

and to get the last equality we used the fact that ϕ(0) = 0 thanks to the assumption ϕ(x) = ϕ( x). Then gradient descent on (7) gives our algorithm (Algorithm 1). The special treatment of the diagonal terms described in Section 4 amounts to modifying only the diagonal Iii.

Published as a conference paper at ICLR 2023

Algorithm 1: Mollified interaction energy descent (MIED) in the logarithmic domain.

Input: target density p, mollifier ϕϵ, initial particles {x0 i }N i=1, learning rate η, total steps T. for t 1 to T do

for i 1 to N do

xt+1 i xt i + η PN j=1 e Iij P

i,j e Iij (2 log ϕϵ(xj xi) + log p(xi)), where Iij is

defined in (25); end end return final particles {x T i }N i=1.

C.1 COMPARISON WITH SVGD

The update formula in Algorithm 1 is similar to the one in SVGD: if we use ϕϵ(x y) in place of the kernel k(x, y) in the SVGD update and rewrite:

xt+1 i = xt i + η

j=1 ( ϕϵ(xj xi) + ϕϵ(xj xi) log p(xj))

j=1 ϕϵ(xj xi) ( log ϕϵ(xj xi) + log p(xj)) . (SVGD)

For both algorithms, the update formula for each particle consists of attraction and repulsion terms and the total time complexity of each update iteration is O(N 2). We note the following differences. First, in our formulation we have scaling factors e Iij P

i,j e Iij which help stabilizing the optimization (as

a by-product of working in the logarithmic domain) and put more weight on nearby particles as well as particles in low-density regions, whereas in (SVGD) the scaling factors are ϕϵ(xj xi) which are not adapted to prioritize low-density regions. Second, in MIED, the attraction force for particle i only comes from log p(xi), whereas in (SVGD) the attraction force comes from log p(xj) for all j s. Third, for each j, in our formulation the repulsive force has an additional factor of 2 in front of log ϕϵ(xj xi).

Empirically, the additional scaling factors in MIED help produce samples with good separations compared to SVGD, since closer pairs of points will have large weights. Additionally, since MIED optimizes a finite-dimensional objective (7), we can employ accelerated gradient-based optimizers like Adam (Kingma & Ba, 2014), which we used in our experiments. In contrast, SVGD does not optimize any finite-dimensional objective. While practical SVGD implementations also use optimizers like Adam, it is unclear how the resulting particle dynamics is related to the gradient flow of KL divergence.

C.2 HANDLING CONSTRAINTS WITH DYNAMIC BARRIER METHOD

The dynamic barrier method is introduced in Gong & Liu (2021) which solves minx f(x) subject to g(x) 0 where g is scalar-valued. Intuitively, their method computes update directions by either decreasing g(x) when g(x) > 0, following f(x) if the constraints are satisfied, or balancing both gradient directions.

In order to handle multiple constraints such as in Figure 11, we consider a generalized version of their dynamic barrier method. In this generalized setting, g : Rn Rm is vector-valued and constraints are g(x) 0 where the sign is interpreted coordinate wise.. Suppose we are at iteration t with current solution xt. Then the next update direction v is taken to be the arg min of

min v Rn v f(xt) 2 2 s.t. i = 1, . . . , m, gi(xt) v αigi(xt), (26)

where αi > 0 are fixed hyperparameters; in our implementation we simply choose αi = 1. Then xt+1 = xt ηv with learning rate η. In our implementation we use Adam (Kingma & Ba, 2014) that modulates the update directions. Observe that the optimization problem (26) is the

Published as a conference paper at ICLR 2023

same as projecting a point f(xt) onto the polyhedron formed by the intersection of the halfspaces m i=1{x Rn : gi(xt) v αigi(xt)}. To solve (26), we use Dykstra s algorithm (Tibshirani, 2017) which can be interpreted as running coordinate descent on the dual of (26). We use a fixed number of 20 iterations for the Dykstra s algorithm which we found to be sufficient for our experiments; in the case of a single constraint, we only need to use one iteration.

D EXPERIMENT DETAILS AND ADDITIONAL RESULTS

D.1 GAUSSIANS IN VARYING DIMENSIONS

We generate the n-dimensional Gaussians used to produce Figure 1 as follows. We generate a matrix

A Rn n with i.i.d. entries uniformly in [ 1, 1]. Then we set A =

way det(A) = 1. We then use A as the covariance matrix for the Gaussian (centered at 0). We use Adam with learning rate 0.01 for all methods for a total of 2000 iterations. This is enough for SVGD and MIED to converge, while for KSDD the convergence can be much slower.

We visualize the samples from each method for n = 2 in Figure 5. We notice that MIED is capable of generating well-separated samples, while for SVGD there is a gap between the inner cluster of samples and a sparse outer ring. For KSDD we see the artifact where too many samples concentrate on the diagonal.

Figure 5: Visualization of samples from each method for a 2D skewed Gaussian.

D.2 COLLAPSED SAMPLES WHEN THE KERNEL WIDTH IS TOO BIG

In Figure 3, we see that samples from SVGD collapse with an adaptive kernel where the variance is taken to be half of the median of the squared distance among all pairs of points (Liu & Wang, 2016); at termination the median of the squared distance is greater than 1 in that experiment. Here we investigate this issue further. In Figure 6, for the same uniform sampling setup, we visualize the results of SVGD with a fixed kernel width and MIED with Gaussian mollifiers with the same kernel width: when the kernel width (i.e. 2ϵ2 in ϕg ϵ(x) = exp x 2 2 2ϵ2 /Zg ϵ ) is too big, both SVGD and MIED result in collapsed samples. This is because since the target density is a constant, the only force in the updates is the repulsive force. When the kernel width is too large, the repulsive force coming from points in the same collapsed cluster is dominated by the repulsive force coming from points from other clusters this is evident in the update directions shown in the leftmost column of Figure 6 (SVGD with kernel width 0.1). When using the dynamic barrier method Gong & Liu (2021) to enforce the square constraints instead of reparameterization with tanh, we obtain similar results as in Figure 6.

This pathological phenomenon is not only limited to constrained sampling: when sampling the 2D Gaussian from Figure 5, using too big a kernel width can also result in collapsing (Figure 7).

We emphasize that our theory of MIED suggests that in practice we need to choose the kernel width very small in order to sample from the correct target measure according to Theorem 3.3 and Theorem 3.6. In comparison, the theory of SVGD has no such implication.

D.3 UNIFORM SAMPLING WITH AN ALTERNATIVE MIRROR MAP

In this section, we show that for sampling from a uniform distribution in the square [ 1, 1]2, the results of SVMD/MSVGD (Shi et al., 2021) depend heavily on the choice of the mirror map. Instead of the entropic mirror map used to produce results in Figure 3, here we use the mirror map

Published as a conference paper at ICLR 2023

SVGD(0.001)

MIED(0.001)

Figure 6: SVGD and MIED with fixed-size Gaussian kernels for uniform sampling in a square. In each cell of the grid, we plot the samples along with the update direction (black arrows) at that iteration. Rows correspond to iterations 100, 200, 1000. Columns correspond to each method with varying kernel widths (twice the variance of the Gaussian kernel/mollifier) indicated in the parentheses.

Figure 7: SVGD and MIED with fixed-size Gaussian kernels for sampling a 2D Gaussian as in Figure 5. We see that using too big a kernel size can lead to collapsed samples for both SVGD and MIED.

ϕ(θ) = Pn i=1 log 1 1 θi + log 1 1+θi

as in Ahn & Chewi (2021). The results are shown in Figure 8. SVMD/MSVGD fail to draw samples near the boundary; we suspect this is because the gradient of the conjugate ϕ (η) = (

1+η2 1)/η (coordinate-wise arithmetic) requires coordinates of η to go to to land near the boundary. We verify this phenomenon by using ϕ (η) as the reparametrization map in MIED (rightmost figure in Figure 8): indeed with such reparameterization MIED also struggles near the boundary.

Published as a conference paper at ICLR 2023

MIED(Riesz) w/ mirror

Figure 8: Visualization of samples for uniform sampling from a 2D box when using a suboptimal mirror map. All three methods fail to draw samples near the boundary of the box [ 1, 1]2.

D.4 20-DIMENSIONAL DIRICHLET DISTRIBUTION

We sample from the 20-dimensional Dirichlet distribution in the same setup as in Shi et al. (2021) with 50 particles. Results and visualization are shown in Figure 9. We see that unlike sampling from a box (Figure 3), both MSVGD and SVMD by Shi et al. (2021) perform well in this experiment. This is due to the fact that the entropic mirror map used here is a well-tested choice for simplex constraint, yet obtaining a good mirror map for a generic constraint, even if it is linear, can be challenging. Our method does not have such a limitation, as it can easily incorporate existing constrained optimization tools.

0 250 500 750 1000 1250 1500 1750 2000

100 W2 Distance

SVMD MSVGD MIED(Riesz)

MIED(Gaussian)

MIED(Laplace)

0 250 500 750 1000 1250 1500 1750 2000

Energy Distance

SVMD MSVGD MIED(Riesz)

MIED(Gaussian)

MIED(Laplace)

0.80 0.85 0.90 0.95

0.80 0.85 0.90 0.95

0.80 0.85 0.90 0.95

MIED(Riesz)

0.80 0.85 0.90 0.95

MIED(Gaussian)

0.80 0.85 0.90 0.95

MIED(Laplace)

Figure 9: Top: Metrics vs. the number of iteration for sampling the 20-dimensional Dirichlet distribution. Bottom: visualization of samples from each method.

D.5 EFFECT OF s FOR RIESZ MOLLIFIERS

When we use the s-Riesz families of mollifiers in MIED, we have the freedom of choosing the hyperparameter s so long as s > n. In this section, we study the effect of s on the resulting samples. We consider the problem of sampling from a mixture of four 2D Gaussians centered at ( 1, 1), each with diagonal variance 0.3 and constrained to the [ 1, 1]2 box. We vary s in [2, 10] and the number of particles N in {100, 200, 500, 1000, 2000}. All runs use a total of 1000 iterations with learning rate 0.01. In the top of Figure 10, we plot the W2 distance and energy distance as functions of s for each N. Interestingly, we see the best performing s is in [3, 5.0] and depends on N. This suggests that our choice of s = n + 10 4 in Section 5 may not be optimal and there is room for hyperparameter tuning to further improve the performance of MIED with Riesz kernel. At the bottom of Figure 10 we visualize the samples of MIED with s = 3 and of SVGD with kernel width 0.01 (adaptive kernels would result in collapsed samples and other widths we tested on would result

Published as a conference paper at ICLR 2023

in worse samples). While SVGD samples form visible artifacts, the samples of MIED are evenly distributed and the four modes of the mixtures emerge as N increases.

2.5 5.0 7.5 10.0 12.5 15.0 s

W2 Distance

N = 100 N = 200 N = 500 N = 1000 N = 2000

2.5 5.0 7.5 10.0 12.5 15.0 s

Energy Distance

N = 100 N = 200 N = 500 N = 1000 N = 2000

Figure 10: Ablation study of the hyperparameter s for MIED with s-Riesz families of mollifiers. Top: metrics as functions of s. Bottom: visualization of samples of MIED with a Riesz mollifier with s = 3 and of SVGD with kernel width 0.01.

D.6 MORE CONSTRAINED SAMPLING EXPERIMENTS

In this section we test MIED on more low dimensional constrained sampling problems and qualitatively assess the results. Note that mirror LMC (Zhang et al., 2020; Ahn & Chewi, 2021) or mirror SVGD (Shi et al., 2021) cannot be applied due to non-convexity of the constraints. In Figure 11, we consider uniform sampling of a challenging 2D region with initial samples drawn from the topright corner: as the number of iterations increases, MIED gradually propagate samples to fill up the entire region. In Figure 12, we consider sampling from a von Mises-Fisher distribution on a unit sphere. Although our theory focuses on sampling from a full-dimensional distribution, as discussed in Remark A.10, we can extend Theorem 3.3 to the case of a sphere due to its symmetry. We see the samples visualized in Figure 12 capture the two modes that emerge by restricting the Gaussian to a unit sphere.

D.7 DETAILS ON FAIRNESS BAYESIAN NEURAL NETWORK EXPERIMENT

We use 80%/20% training/test split as in Liu et al. (2021). We use the source code provided by Liu et al. (2021) with default hyperparameters.The source code provided by the authors of Liu et al. (2021) does not implement the calculation of g(θ) faithfully as written in the formula, so we corrected it. All methods use 2000 iterations for training. For our method we use learning rate 0.001. One of their four methods (Control+SVGD) got stuck at initialization (with accuracy around 0.75), so we omit its result from the plot in Figure 4.

Published as a conference paper at ICLR 2023

Figure 11: Uniform sampling of the region {(x, y) [ 1, 1]2 : (cos(3πx) + cos(3πy))2 < 0.3} using MIED with a Riesz mollifier (s = 3) where the constraint is enforced using the dynamic barrier method. The plot in row i column j shows the samples at iteration 100 + 200(6i + j). The initial samples are drawn uniformly from the top-right square [0.5, 1.0]2.

Figure 12: Sampling from the von Mises-Fisher distribution obtained by constraining the 3dimensional Gaussian from Appendix D.1 to the unit sphere. The unit-sphere constraint is enforced using the dynamic barrier method and the shown results are obtained using MIED with Riesz kernel and s = 3. The six plots are views from six evenly spaced azimuthal angles.