# learning_rate_free_sampling_in_constrained_domains__78d2c39f.pdf

Learning Rate Free Sampling in Constrained Domains

Louis Sharrock Department of Mathematics and Statistics Lancaster University, UK l.sharrock@lancaster.ac.uk

Lester Mackey Microsoft Research New England Cambridge, MA lmackey@microsoft.com

Christopher Nemeth Department of Mathematics and Statistics Lancaster University, UK c.nemeth@lancaster.ac.uk

We introduce a suite of new particle-based algorithms for sampling in constrained domains which are entirely learning rate free. Our approach leverages coin betting ideas from convex optimisation, and the viewpoint of constrained sampling as a mirrored optimisation problem on the space of probability measures. Based on this viewpoint, we also introduce a unifying framework for several existing constrained sampling algorithms, including mirrored Langevin dynamics and mirrored Stein variational gradient descent. We demonstrate the performance of our algorithms on a range of numerical examples, including sampling from targets on the simplex, sampling with fairness constraints, and constrained sampling problems in postselection inference. Our results indicate that our algorithms achieve competitive performance with existing constrained sampling methods, without the need to tune any hyperparameters.

1 Introduction

The problem of sampling from unnormalised probability distributions is of central importance to computational statistics and machine learning. Standard approaches to this problem include Markov chain Monte Carlo (MCMC) [11, 76] and variational inference (VI) [41, 96]. More recently, there has been growing interest in particle-based variational inference (Par VI) methods, which deterministically evolve a collection of particles towards the target distribution and combine favourable aspects of both MCMC and VI. Perhaps the most well-known of these methods is Stein variational gradient descent (SVGD) [63], which iteratively updates particles according to a form of gradient descent on the Kullback-Leibler (KL) divergence. This approach has since given rise to several extensions [16, 27, 97, 105] and found success in a wide range of applications [27, 66, 74].

While such methods have enjoyed great success in sampling from unconstrained distributions, they typically break down when applied to constrained targets [see, e.g., 60, 82]. This is required for targets that are not integrable over the entire Euclidean space (e.g., the uniform distribution), when the target density is undefined outside a particular domain (e.g., the Dirichlet distribution), or when the target must satisfy certain additional constraints (e.g., fairness constraints in Bayesian inference [65]). Notable examples include latent Dirichlet allocation [9], ordinal data models [45], regularised regression [15], survival analysis [49], and post-selection inference [54, 88]. In recent years, there has been increased interest in this topic, with several new methodological and theoretical developments [1, 19, 43, 44, 60, 61, 82, 86, 103, 104].

37th Conference on Neural Information Processing Systems (Neur IPS 2023).

One limitation shared by all of these methods is that their performance depends, often significantly, on a suitable choice of learning rate. In principle, convergence rates for existing constrained sampling schemes (e.g., [1, Theorem 1] or [86, Corollary 1]) allow one to compute the optimal learning rate for a given problem. In practice, however, this optimal learning rate is a function of the unknown target and thus cannot be computed. Motivated by this problem, in this paper we introduce a suite of new sampling algorithms for constrained domains which are entirely learning rate free.

Our contributions We first propose a unifying perspective of several existing constrained sampling algorithms, based on the viewpoint of sampling in constrained domains as a mirrored optimisation problem on the space of probability measures. Based on this perspective, we derive a general class of solutions to the constrained sampling problem via the mirrored Wasserstein gradient flow (MWGF). The MWGF includes, as special cases, several existing constrained sampling methods, e.g., mirrored Langevin dynamics [43] and mirrored SVGD (MSVGD) [82]. Using this formulation, we also introduce mirrored versions of several other particle-based sampling algorithms which are suitable for sampling in constrained domains, and study their convergence properties in continuous time.

By leveraging this perspective, and extending the coin sampling methodology recently introduced in [81], we then obtain a suite of learning rate free algorithms suitable for constrained domains. In particular, we introduce mirrored coin sampling, which includes as a special case Coin MSVGD, as well as several other tuning-free constrained sampling algorithms. We also outline an alternative approach which does not rely on a mirror mapping, namely, coin mollified interaction energy descent (Coin MIED). This algorithm can be viewed as the natural coin sampling analogue of the method recently introduced in [60]. Finally, we demonstrate the performance of our methods on a wide range of examples, including sampling on the simplex, sampling with fairness constraints, and constrained sampling problems in post-selection inference. Our empirical results indicate that our methods achieve competitive performance with the optimal performance of existing approaches, with no need to tune a learning rate.

2 Preliminaries

The Wasserstein Space. We will make use of the following notation. For any X Rd, we write P2(X) for the set of probability measures µ on X with finite 2nd moment: R

X ||x||2µ(dx) < , where || || denotes the Euclidean norm. For µ P2(X), we write L2(µ) for the set of measurable f : X X with finite ||f||2 L2(µ) = R

X ||f(x)||2µ(dx). For f, g L2(µ), we also define f, g L2(µ) = R

X f(x) g(x)µ(dx). Given µ P2(X) and T L2(µ), we write T#µ for the pushforward of µ under T, that is, the distribution of T(X) when X µ.

Given µ, ν P2(X), the Wasserstein 2-distance between µ and ν is defined according to W 2 2 (µ, ν) = infγ Γ(µ,ν) R

X X ||x y||2γ(dx, dy), where Γ(µ, ν) denotes the set of couplings between µ and ν. The Wasserstein distance W2 is indeed a distance over P2(X) [2, Chapter 7.1], with the metric space (P2(X), W2) known as the Wasserstein space. Given a functional F : P(X) ( , ], we will write W2F(µ) for the Wasserstein gradient of F at µ, which exists under appropriate regularity conditions [2, Lemma 10.4.1]. We refer to App. A for further details on calculus in the Wasserstein space.

Wasserstein Gradient Flows. Suppose that π P2(X) and that π admits a density π e V with respect to Lebesgue measure,1 where V : X R is a smooth potential function. The problem of sampling from π can be recast as an optimisation problem over P2(X) [e.g., 99]. In particular, one can view the target π as the solution of

π = arg min µ P2(X) F(µ) (1)

where F : P(X) ( , ] is a functional uniquely minimised at π. In the unconstrained case, X = Rd, a now rather classical method for solving this problem is to identify a continuous process which pushes samples from an initial distribution µ0 to the target π. This corresponds to finding a family of vector fields (vt)t 0, with vt L2(µt), which transports µt to π via the continuity equation

t + (vtµt) = 0. (2)

1In a slight abuse of notation, throughout this paper we will use the same letter to refer to a distribution and its density with respect to Lebsegue measure.

A standard choice is vt = W2F(µt), in which case the solution (µt)t 0 of (2) is referred to as the Wasserstein gradient flow (WGF) of F. In fact, for different choices of (vt)t 0, one can obtain the continuous-time limit of several popular sampling algorithms [3, 20, 40, 46, 51, 63, 99]. Unfortunately, it is not straightforward to extend these approaches to the case in which X Rd is a constrained domain. For example, the vector fields vt may push the random variable xt outside of its support X, thus rendering all future updates undefined.

3 Mirrored Wasserstein Gradient Flows

We now outline one way to extend the WGF approach to the constrained case, based on the use of a mirror map. We will first require some basic definitions from convex optimisation [e.g., 77]. Let X be a closed, convex domain in Rd. Let ϕ : X R { } be a proper, lower semicontinuous, strongly convex function of Legendre type [77]. This implies, in particular, that ϕ(X) = Rd

and ϕ : X Rd is bijective [77, Theorem 26.5]. Moreover, its inverse ( ϕ) 1 : Rd X satisfies ( ϕ) 1 = ϕ , where ϕ : Rd R denotes the Fenchel conjugate of ϕ, defined as ϕ (y) = supx X x, y ϕ(x). We will refer to ϕ : X Rd as the mirror map and ϕ(X) = Rd as the dual space.

3.1 Constrained Sampling as Optimisation

Using the mirror map ϕ : X Rd, we can now reformulate the constrained sampling problem as the solution of a mirrored version of the optimisation problem in (1). Let us define ν = ( ϕ)#π, with π = ( ϕ )#ν. We can then view the target π as the solution of

π = ( ϕ )#ν, ν = arg min η P2(Rd) F(η), (3)

where now F : P2(Rd) ( , ] is a functional uniquely minimised by ν = ( ϕ)#π. The motivation for this formulation is clear. Rather than directly solving a constrained optimisation problem for the target π P2(X) as in (1), we could instead solve an unconstrained optimisation problem for the mirrored target ν P2(Rd), and then recover the target via π = ( ϕ )#ν. This is a rather natural extension of the formulation of sampling as optimisation to the constrained setting.

3.2 Mirrored Wasserstein Gradient Flows

One way to solve the mirrored optimisation problem in (3) is to find a continuous process which transports samples from η0 := ( ϕ)#µ0 P2(Rd) to the mirrored target ν. According to the mirror map, this process will implicitly also push samples from µ0 to the target π. More precisely, we would like to find (wt)t 0, where wt : Rd Rd, which evolves ηt on Rd and thus µt on X, through

t + (wtηt) = 0 , µt = ( ϕ )#ηt (4)

where (wt)t 0 should ensure convergence of ηt to ν. In the case that wt = W2F(ηt), we will refer to (4) as the mirrored Wasserstein gradient flow (MWGF) of F. The idea of (4) is to simulate the WGF in the (unconstrained) dual space and then map back to the primal (constrained) space using a carefully chosen mirror map. This approach bypasses the inherent difficulties associated with simulating a gradient flow on the constrained space itself.

While a similar approach has previously been used to derive certain mirrored sampling schemes [43, 82, 86], these works focus on specific choices of F and (wt)t 0, while we consider the general case. Our general perspective provides a unifying framework for these approaches, whilst also allowing us to obtain new constrained sampling algorithms and to study their convergence properties (see App. B). We now consider one such approach in detail.

3.2.1 Mirrored Stein Variational Gradient Descent

MSVGD [82] is a mirrored version of the popular SVGD algorithm. In [82], MSVGD is derived using a mirrored Stein operator in a manner analogous to the original SVGD derivation [63]. Here, we take an alternative perspective, which originates in the MWGF. This formulation is more closely aligned with other more recent papers analysing the non-asymptotic convergence of SVGD [51, 78, 85, 86].

Algorithm 1 MSVGD

input: target density π, kernel k, mirror function ϕ, particles (xi 0)N i=1 µ0, step size γ. initialise: yi 0 = ϕ(xi 0) for i [N] for t = 0, . . . , T 1 do

For i [N], yi t+1 = yi t γPˆηt,kϕ log ˆηt

ν (yi t), where ˆηt = 1

N PN j=1 δyj t .

For i [N], xi t+1 = ϕ (yi t+1). return (xi T )N i=1.

We will require the following additional notation. Let k : X X R denote a positive semi-definite kernel and Hk the corresponding reproducing kernel Hilbert space (RKHS) of real-valued functions f : X R, with inner product , Hk and norm || ||Hk. Let H := Hd k denote the Cartesian product RKHS, consisting of elements f = (f1, . . . , fd) for fi Hk. We write Sµ,k : L2(µ) H for the operator Sµ,kf = R k(x, )f(x)dµ(x). We assume R k(x, x)dµ(x) < for µ P2(X), in which case H L2(µ) for all µ P2(X) [52, Sec. 3.1]. We also define the identity embedding i : H L2(µ) with adjoint i = Sµ,k. Finally, we define Pµ,k : L2(µ) L2(µ) as Pµ,k = i Sµ,k.

We are now ready to introduce our perspective on MSVGD. We begin by introducing the continuoustime MSVGD dynamics, which are defined by the continuity equation

t Pηt,kϕ log ηt

ηt = 0, µt = ( ϕ )#ηt, (5)

where kϕ(y, y ) = k( ϕ (y), ϕ (y )) for some base kernel k : X X R. Clearly, this is a special case of the MWGF in (4), in which (wt)t 0 are defined according to wt = Pηt,kϕ W2F(ηt), with F(ηt) = KL(ηt||ν). In particular, wt can be interpreted as the negative Wasserstein gradient of KL(ηt||ν), the KL divergence of the ηt with respect to ν, under the inner product of Hkϕ.

The key point here is that, given samples from ηt, estimating Pηt,kϕ log( ηt

ν ) is straightforward. In particular, if lim||y || kϕ(y , )ν(y ) = 0, then, integrating by parts,

Pηt,kϕ log(ηt

Rd[kϕ(y , y) y log ν(y ) + y kϕ(y , y)]dηt(y ). (6)

To obtain an implementable algorithm, we must discretise in time. By applying an explicit Euler discretisation to (5), we arrive at

ηt+1 = id γPηt,kϕ log ηt

# ηt, µt+1 = ( ϕ )#ηt+1, (7)

where γ > 0 is a step size or learning rate, and id is the identity map. This is the population limit of the MSVGD algorithm in [82]. Now, suppose µ0 P2(X) admits a density with respect to the Lebesgue measure. In addition, suppose x0 µ0 and y0 = ϕ(x0). Then, for t N0, if we define

yt+1 = yt γPηt,kϕ log ηt

(yt), xt+1 = ϕ (yt+1), (8)

then (ηt)t N0 and (µt)t N0 in (7) are precisely the densities of (yt)t N0 and (xt)t N0 in (8). In practice, these densities are unknown and must be estimated using samples. In particular, to approximate (8), we initialise a collection of particles xi 0 i.i.d. µ0 for i [N], set yi 0 = ϕ(xi 0), and then update

yi t+1 = yi t γPˆηt,kϕ log ˆηt

(yi t), xi t+1 = ϕ (yi t+1) (9)

where ˆηt = 1

N PN j=1 δyj t . This is precisely the MSVGD algorithm introduced in [82] and since also studied in [86].

3.2.2 Other Algorithms

In App. B, we outline several algorithms which arise as special cases of the MWGF. These include existing algorithms, such as the mirrored Langevin dynamics (MLD) [43] and MSVGD [82], and two new algorithms, which correspond to mirrored versions of Laplacian adjusted Wasserstein gradient descent (LAWGD) [20] and kernel Stein discrepancy descent (KSDD) [51]. We also study the convergence properties of these algorithms in continuous time.

4 Mirrored Coin Sampling

By construction, any algorithm derived as a discretisation of the MWGF, including MSVGD, will depend on an appropriate choice of learning rate γ. In this section, we consider an alternative approach, leading to new algorithms which are entirely learning rate free. Our approach can be viewed as an extension of the coin sampling framework introduced in [81] to the constrained setting, based on the reformulation of the constrained sampling problem as a mirrored optimisation problem.

4.1 Coin Sampling

We begin by reviewing the general coin betting framework [22, 71, 72]. Consider a gambler with initial wealth w0 = 1 who makes bets on a series of adversarial coin flips ct { 1, 1}, where +1 denotes heads and 1 denotes tails. We encode the gambler s bet by xt R. In particular, sign(xt) { 1, 1} denotes whether the bet is on heads or tails, and |xt| R denotes the size of the bet. Thus, in the tth round, the gambler wins |xtct| if sign(ct) = sign(xt) and loses |xtct| otherwise. Finally, we write wt for the wealth of the gambler at the end of the tth round. The gambler s wealth thus accumulates as wt = 1 + Pt s=1 csxs. We assume that the gambler s bets satisfy xt = βtwt 1, for a betting fraction βt [ 1, 1], which means the gambler does not borrow any money.

In [71], the authors show how this approach can be used to solve convex optimisation problems on Rd, that is, problems of the form x = arg minx Rd f(x), for convex f : Rd R. In particular, [71] consider a betting game with outcomes ct = f(xt), replacing scalar multiplications ctxt by scalar products ct, xt . In this case, under certain assumptions, they show that f( 1

T PT t=1 xt) f(x ) at a rate determined by the betting strategy. There are many suitable choices for the betting strategy. Perhaps the simplest of these is βt = t 1 Pt 1 s=1 cs, known as the Krichevsky-Trofimov (KT) betting strategy [53, 71], which yields the sequence of bets xt = βtwt 1 = t 1 Pt 1 s=1 cs(1 + Pt 1 s=1 cs, xs ). Other choices, however, are also possible [22, 70, 72].

In [81], this approach was extended to solve optimisation problems on the space of probability measures. In this case, several further modifications are necessary. First, in each round, one now bets xt x0 rather than xt, where x0 µ0 is distributed according to some µ0 P2(Rd). In this case, viewing xt : Rd Rd as a function that maps x0 7 xt(x0), one can define a sequence of distributions (µt)t N as the push-forwards of µ0 under the functions (xt)t N. In particular, this means that, if x0 µ0, then xt µt.

By using these modifications in the original coin betting framework, and choosing (ct)t N := (cµt(xt))t N appropriately, [81] show that it is now possible to solve problems of the form µ = arg minµ P2(X) F(µ). This approach is referred to as coin sampling. For example, by considering a betting game with ct = W2F(µt)(xt), [81] obtain a coin Wasserstein gradient descent algorithm, which can be viewed as a learning-rate free analogue of Wasserstein gradient descent. Alternatively, setting ct = Pµt,k W2F(µt)(xt), one can obtain a learning-rate free analogue of the population limit of SVGD, in which the updates are given by

xt = x0 Pt 1 s=1 Pµs,k W2F(µs)(xs)

s=1 Pµs,k W2F(µs)(xs), xs x0

where, as described above, µt = (xt)#µ0. In practice, the densities (µt)t N0 and thus the outcomes (ct)t N are almost always intractable, and will be approximated using a set of interacting particles.

4.2 Mirrored Coin Sampling

By leveraging the formulation of constrained sampling as a mirrored optimisation problem over the space of probability measures, we can extend the coin sampling approach to constrained domains. In particular, as an alternative to discretising the MWGF to solve this optimisation problem (Sec. 3.2), we propose instead to use a mirrored version of coin sampling (Sec. 4.1). The idea is to use coin sampling to solve the optimisation problem in the dual space, and then map back to the primal space using the mirror map. In particular, this suggests the following scheme. Let x0 µ0 P2(X), and y0 = ϕ(x0) Rd. Then, for t N, update

yt = y0 + Pt 1 s=1 cηs(ys)

s=1 cηs(ys), ys y0

, xt = ϕ (xt), (11)

Algorithm 2 Coin MSVGD

input: target density π, kernel k, mirror function ϕ, particles (xi 0)N i=1 µ0. intialise: yi 0 = ϕ(xi 0) for i [N] for t = 1, . . . , T do

For i [N], yi t = yi 0

Pt 1 s=1 Pˆ ηs,kϕ log( ˆ ηs

ν )(yi s) t 1 Pt 1 s=1 Pˆηs,kϕ log( ˆηs

ν )(yi s), yi s yi 0 .

For i [N], xi t = ϕ (yi t). return (xi T )N i=1.

where ηt = (yt)#η0 and µt = ( ϕ )#ηt, and where cηs(ys) = W2F(ηs)(ys), or some variant thereof. We will refer to this approach as mirrored coin sampling, or mirrored coin Wasserstein gradient descent. In order to obtain an implementable algorithm, we proceed as follows. First, decide on a suitable functional F and a suitable (ct)t N := (cηt(yt))t N. For example, F(η) = KL(η|ν), and (ct)t N = ( Pηt,kϕ W2F(ηt)(yt))t N. Then, substitute these into (11), and approximate the updates using a set of interacting particles, as in existing Par VIs algorithms.

Following these steps, we obtain a learning rate free analogue of MSVGD [82], which we will refer to as Coin MSVGD. This is summarised in Alg. 2. For different choices of (cηt)t N in (11), we also obtain learning rate free analogues of two other mirrored Par VI algorithms introduced in App. B, namely mirrored LAWGD (App. B.3), and mirrored KSDD (App. B.4). These algorithms, which we term Coin MLAWGD and Coin MKSDD, are given in Alg. 6 and Alg. 7 in App. D.

Even in the unconstrained setting, establishing the convergence of coin sampling remains an open problem. In [81], the authors provide a technical sufficient condition under which it is possible to establish convergence to the target measure in the population limit; however, it is difficult to verify this condition in general. In the interest of completeness, in App. C, we provide an analogous convergence result for mirrored coin sampling, adapted appropriately to the constrained setting.

4.3 Alternative Approaches

In this section, we outline an alternative approach, based on a coinification of the mollified interaction energy descent (MIED) method recently introduced in [60]. MIED is based on minimising a function known as the mollified interaction energy (MIE) Eϵ : P(Rd) [0, ]. The idea is that minimising this function balances two forces: a repulsive force which ensures the measure is well spread, and an attractive force which ensures the measure is concentrated around regions of high density. In order to obtain a practical sampling algorithm, [60] propose to minimise a discrete version of the logarithmic MIE, using, e.g., Adagrad [31] or Adam [48]. That is, minimising the function

log Eϵ(ωn) := log 1

j=1 ϕϵ(xi xj)(π(xi)π(xj)) 1

where ωn = {x1, . . . , xn}, and where (ϕϵ)ϵ>0 is a family of mollifiers [60, Definition 3.1]. This approach can be adapted to handle the constrained case, as outlined in [60]. In particular, if there exists a differentiable f : Rd X such that X \ f(Rd) has Lebesgue measure zero, then one can reduce the constrained problem to an unconstrained one by minimising log Eϵ(f(wn)), where f(ωn) = {f(x1), . . . , f(xn)}. Meanwhile, if X = {x Rd : g(x) 0} for some differentiable g : Rd R, then one can use a variant of the dynamic barrier method introduced in [37].

Here, as an alternative to the optimisation methods considered in [60], we instead propose to use a learning-rate free algorithm to minimise (12), based on the coin betting framework described in 4.1. We term the resulting method, summarised in Alg. 3, Coin MIED. In comparison to our mirrored coin sampling algorithm, this approach has the advantage that it only requires a differentiable (not necessarily bijective) map.

5 Related Work

Sampling as Optimisation. The connection between sampling and optimisation has a long history, dating back to [46], which established that the evolution of the law of the Langevin diffusion

Algorithm 3 Coin MIED

Input: target density π, mollifier ϕϵ, particles ωn 0 = (xi 0)N i=1. for t = 1, . . . , T do

For i [N], xi t = xi 0

Pt 1 s=1 xis log Eϵ(ωn s )

t 1 Pt 1 s=1 xis log Eϵ(ωn s ), xi s xi 0 .

return (xi T )N i=1.

corresponds to the WGF of the KL divergence. In recent years, there has been renewed interest in this perspective [99]. In particular, the viewpoint of existing algorithms such as LMC [23, 24, 33, 34] and SVGD [32, 63] as discretisations of WGFs has proved fruitful in deriving sharp convergence rates [6, 34, 52, 78, 83, 85]. It has also resulted in the rapid development of new sampling algorithms, inspired by ideas from the optimisation literature such as proximal methods [100], coordinate descent [29, 30], Nesterov acceleration [18, 25, 67], Newton methods [68, 84, 98], and coin betting [80, 81].

Sampling in Constrained Domains. Recent interest in constrained sampling has resulted in a range of general-purpose algorithms for this task [e.g., 43, 60, 82, 103, 104]. In cases where it is possible to explicitly parameterise the target domain in a lower dimensional space, one can use variants of classical methods, including LMC [98], rejection sampling [28], Hamiltonian Monte Carlo (HMC) [13], and Riemannian manifold HMC [55, 73]. Several other approaches are based on the use of mirror map. In particular, [43] proposed mirrored Langevin dynamics and its first-order discretisation. [103], and an earlier draft of [43], introduced the closely related mirror Langevin diffusion and the mirror Langevin algorithm, which has since also been studied in [1, 19, 44, 61]. Along the same lines, [82] introduced two variants of SVGD suitable for constrained domains based on the use of a mirror map, and established convergence of these schemes to the target distribution; see also [86]. In certain cases [e.g., 56, 58] one cannot explicitly parameterise the manifold of interest. In this setting, different approaches are required. [12] introduced a constrained version of HMC for this case. Meanwhile, [102] proposed a constrained Metropolis-Hastings algorithm with a reverse projection check to ensure reversibility; this approach has since been extended in [57, 59]. Other, more recent approaches suitable for this setting include MIED [60] (see Sec. 4.3), and O-SVGD [104].

6 Numerical Results

In this section, we perform an empirical evaluation of Coin MSVGD (Sec. 6.1 - 6.2) and Coin MIED (Sec. 6.3). We consider several simulated and real data experiments, including sampling from targets defined on the simplex (Sec. 6.1), confidence interval construction for post-selection inference (Sec. 6.2), and inference in a fairness Bayesian neural network (Sec. 6.3). We compare our methods to their learning-rate-dependent analogues, namely, MSVGD [82] and MIED [60]. We also include a comparison with Stein Variational Mirror Descent (SVMD) [82], and projected versions of Coin SVGD [81] and SVGD [63], which include a Euclidean projection step to ensure the iterates remain in the domain. We provide additional experimental details in App. F and additional numerical results in App. G. Code to reproduce all of the numerical results can be found at https://github.com/louissharrock/constrained-coin-sampling.

6.1 Simplex Targets

Following [82], we first test the performance of our algorithms on two 20-dimensional targets defined on the simplex: the sparse Dirichlet posterior of [73] and the quadratic simplex target of [1]. We employ the IMQ kernel and the entropic mirror map [7]; and use N = 50 particles, T = 500 iterations. In Fig. 1 and Fig. 6 - 7 (App. F.1), we plot the energy distance [87] to a set of surrogate ground truth samples, either obtained i.i.d. (sparse Dirichlet target) or using the No-U-Turn Sampler (NUTS) [42] (quadratic target). After T = 500 iterations, Coin MSVGD has comparable performance to MSVGD and SVMD with optimal learning rates but significantly outperforms both algorithms for sub-optimal learning rates (Fig. 1). In both cases, the projected methods fail to converge. For the sparse Dirichlet posterior, Coin MSVGD converges more rapidly than MSVGD and SVMD, even for well-chosen values of the learning rate (Fig. 6a in App. F.1). Meanwhile, for the quadratic target, Coin MSVGD generally converges more rapidly than MSVGD but not as fast as SVMD [82], which takes advantage of the log-concavity of the target in the primal space (Fig. 6a in App. F.1).

10 5 10 4 10 3 10 2 10 1 100

Learning Rate

Energy Distance

Coin MSVGD Proj. Coin SVGD MSVGD SVMD Proj. SVGD

(a) Sparse Dirichlet Posterior.

10 5 10 4 10 3 10 2 10 1 100

Learning Rate

Energy Distance

Coin MSVGD Proj. Coin SVGD MSVGD SVMD Proj. SVGD

(b) Quadratic Target.

Figure 1: Results for the simplex targets in [73] and [1]. Posterior approximation quality for Coin MSVGD, MSVGD, projected Coin SVGD, projected SVGD, and SVMD.

6.2 Confidence Intervals for Post Selection Inference

We next consider a constrained sampling problem arising in post-selection inference [54, 88]. Suppose we are given data (X, y) Rn p Rn. We are interested in obtaining valid confidence intervals (CIs) for regression coefficients obtained via the randomised Lasso [89], defined as the solution of

ˆβ = arg min β Rp

2||y Xβ||2 2 + λ||β||1 ω β + ε

where λ, ε R+ are penalty parameters and ω Rp is an auxiliary random vector. Let ˆβE Rq represent the non-zero coefficients of ˆβ and ˆz E = sign(ˆβE) for their signs. If the support E were known a priori, we could determine CIs for βE using the asymptotic normality of βE = (X E XE) 1X E y. However, when E is based on data, these classical CIs will no longer be valid. In this case, one approach is to condition on the result of the initial model selection, i.e., knowledge of E and ˆz E. In practice, this means sampling from the so-called selective distribution, which has support D = {(ˆβE, ˆz E) : sign(ˆβE) = ˆz E , ||ˆz E|| 1}, and density proportional to [e.g., 90, Sec. 4.2]

ˆg(ˆβE, ˆz E) g ε(ˆβE, 0) X (y XE ˆβE) + λ(ˆz E, ˆz E) . (14)

Synthetic Example. We first consider the model setup described in [79, Sec. 5.3]; see App. F.2 for full details. In Fig. 2, we plot the energy distance between samples obtained using Coin MSVGD and projected Coin SVGD, and a set of 1000 samples obtained using NUTS [42], for a two-dimensional selective distribution. Similar to our previous examples, the performance of MSVGD is very sensitive to the choice of learning rate. If the learning rate is too small, the particle converge rather slowly; on the other hand, if the learning rate is too big, the particles may not even converge. Meanwhile, Coin MSVGD converges rapidly to the true target, with no need to tune a learning rate. Similar to the examples considered in Sec. 6.1, the projected methods once again fail to converge to the true target, highlighting the need for the mirrored approach.

0 200 400 600 800 1000 Iterations

Energy Distance

Coin MSVGD MSVGD (LR=0.1) MSVGD (LR=0.05) MSVGD (LR=0.01) MSVGD (LR=0.005)

(a) Coin MSVGD.

0 200 400 600 800 1000 Iterations

Energy Distance

Proj. Coin SVGD Proj. SVGD (LR=0.1) Proj. SVGD (LR=0.05) Proj. SVGD (LR=0.01) Proj. SVGD (LR=0.005)

(b) Projected Coin SVGD.

Figure 2: Results for a two-dimensional post-selection inference target. Energy distance versus iterations for (a) Coin MSVGD and MSVGD and (b) projected Coin SVGD and projected SVGD, for a selective distribution of the randomised Lasso in which two features are selected.

0.80 0.85 0.90 0.95 1.00 Nominal coverage

Actual coverage

Coin MSVGD MSVGD SVMD Standard

(a) Actual vs Nominal Coverage

1000 1500 2000 2500 3000 Number of Samples

Actual Coverage

Coin MSVGD MSVGD SVMD Standard

(b) Coverage vs Number of Samples

Figure 3: Coverage results for post-selection inference. Coverage of the post-selection confidence intervals obtained by Coin MSVGD, MSVGD, and SVMD, for 100 repeats of the simulation in [79, Sec. 5.3].

In Fig. 3, we plot the coverage of the CIs obtained using Coin MSVGD, MSVGD, SVMD, and the norejection MCMC algorithm in selective Inference [91], as we vary the nominal coverage or the total number of samples. For MSVGD and SVMD, we use RMSProp [92] to adapt the learning rate, with an initial learning rate of γ = 0.1, following [82]. As the nominal coverage varies, Coin MSVGD achieves similar coverage to MSVGD and SVMD; and significantly higher actual coverage than norejection (Fig. 3a). Meanwhile, as the number of samples varies, Coin MSVGD, MSVGD, and SVMD consistently all obtain a higher coverage than the fixed nominal coverage of 90% (Fig. 3b). This is important for small sample sizes, where the standard approach undercovers. The performance of MSVGD and SVMD is, of course, highly dependent on an appropriate choice of learning rate. In Fig. 11 (App. G.3), we provide additional plots illustrating how the coverage of the CIs obtained using MSVGD and SVMD can deteriorate when the learning-rate is chosen sub-optimally.

Real Data Example. We next consider a post-selection inference problem involving the HIV-1 drug resistance dataset studied in [8, 75]; see App. F.2 for full details. The goal is to identify statistically significant mutations associated with the response to the drug Lamivudine (3TC). The randomised Lasso selects a subset of mutations, for which we compute 90% CIs using 5000 samples and five methods: unadjusted (the unadjusted CIs), standard (the method in selective Inference), MSVGD, SVMD, and Coin MSVGD. Our results (Fig. 4) indicate that the CIs obtained by Coin MSVGD are similar to those obtained via the standard approach, which we view as a benchmark, as well as by MSVGD and SVMD (with a well chosen learning rate). Importantly, they differ from the CIs obtained using the unadjusted approach (see, e.g., mutation P65R or P184V in Fig. 4). Meanwhile, for sub-optimal choices of the learning rate (Fig. 4b and Fig. 4c, Fig. 12 in App. G.3), the CIs obtained using MSVGD and SVMD differ substantially from those obtained using the baseline standard approach and by Coin MSVGD, once more highlighting the sensitivity of these methods to the choice of an appropriate learning rate.

2.4 Coin MSVGD MSVGD SVMD Unadjusted Standard

(a) γ = 0.01.

P41L P62V P65R 0.2

Coin MSVGD MSVGD SVMD Unadjusted Standard

(b) γ = 0.001.

P41L P62V P65R 0.2

Coin MSVGD MSVGD SVMD Unadjusted Standard

(c) γ = 1.0.

Figure 4: Real data results for post-selection inference. Confidence intervals for the mutations selected by the randomised Lasso as candidates for HIV-1 drug resistance. We report (a) all mutations when MSVGD and SVMD use a well chosen learning rate (γ = 0.01) and (b) - (c) a subset of mutations when MSVGD and SVMD use a smaller (γ = 0.001) or larger (γ = 1.0) learning rate.

0.825 0.830 0.835 0.840 0.845

Primal-Dual + Langevin Control + Langevin Primal-Dual + SVGD MIED Coin MIED

(a) Trade off curves.

0 500 1000 1500 2000 Iterations

Coin MIED (t=1e-2) Coin MIED (t=1e-3) Coin MIED (t=1e-4) Coin MIED (t=1e-6)

MIED (t=1e-2) MIED (t=1e-3) MIED (t=1e-4) MIED (t=1e-6)

(b) Constraint vs. Iterations.

0 500 1000 1500 2000 Iterations

Riesz Energy

Coin MIED (t=1e-2) Coin MIED (t=1e-3) Coin MIED (t=1e-4) Coin MIED (t=1e-6)

MIED (t=1e-2) MIED (t=1e-3) MIED (t=1e-4) MIED (t=1e-6)

(c) Riesz Energy vs. Iterations.

Figure 5: Results for the fairness Bayesian neural network. (a) Trade-off curves showing disparate impact versus test accuracy for MIED, Coin MIED, and the methods in [65]. (b) - (c) Cov(x,y,z) D[z, ˆy(x; θ)]2 and MIE versus the number of iterations, for different values of t, for MIED and Coin MIED.

6.3 Fairness Bayesian Neural Network

Finally, following [60, 64, 65, 69], we train a fairness Bayesian neural network to predict whether an individual s income is greater than $50,000, with gender as a protected characteristic. We use the Adult Income dataset [50]. The dataset is of the form D = {xi, yi, zi}n i=1, where xi denote the feature vectors, yi denote the labels (i.e., whether the income is greater than $50,000), and zi denote the protected attribute (i.e., the gender). We train a two-layer Bayesian neural network ˆy(x; w) with weights w and place a standard Gaussian prior on each weight independently. The fairness constraint is given by g(θ) = Cov(x,y,z) D[z, ˆy(x; θ)]2 t 0 for some user-specified t > 0. In testing, we evaluate each method using a Calder-Verwer (CV) score [14], a standard measure of disparate impact. We run each method for T = 2000 iterations and use N = 50 particles.

In Fig. 5a, we plot the trade-off curve between test accuracy and CV score, for t {10 6, 10 5, 10 4, 10 3, 2 10 3, 5 10 3, 10 2}. We compare the results for Coin MIED, MIED, and the methods in [65], using the default implementations. In this experiment, Coin MIED is clearly preferable to the other methods, achieving a much larger Pareto front. In Fig. 5b and Fig. 5c, we plot the constraint and the MIE versus training iterations for both coin MIED and MIED. Once again, coin MIED tends to outperform MIED, achieving both lower values of the constraint and lower values of the energy.

7 Discussion

Summary. In this paper, we introduced several new particle-based algorithms for constrained sampling which are entirely learning rate free. Our first approach was based on the coin sampling framework introduced in [81], and a perspective of constrained sampling as a mirrored optimisation problem on the space of probability measures. Based on this perspective, we also unified several existing constrained sampling algorithms, and studied their theoretical properties in continuous time. Our second approach can be viewed as the coin sampling analogue of the recently proposed MIED algorithm [60]. Empirically, our algorithms achieved comparable or superior performance to other particle-based constrained sampling algorithms, with no need to tune a learning rate.

Limitations. We highlight three limitations. First, like any mirrored sampling algorithm, mirrored coin sampling (e.g., Coin MSVGD), necessarily depends on the availability of a mirror map that can appropriately capture the constraints of the problem at hand. Second, Coin MSVGD has a cost of O(N 2) per update. Finally, even in the population limit, establishing the convergence of mirrored coin sampling under standard conditions (e.g., strong log-concavity or a mirrored log-Sobolev inequality) remains an open problem. We leave this as an interesting direction for future work.

Acknowledgements

LS and CN were supported by the UK Research and Innovation (UKRI) Engineering and Physical Sciences Research Council (EPSRC), grant number EP/V022636/1. CN acknowledges further support from the EPSRC, grant number EP/R01860X/1.

[1] Kwangjun Ahn and Sinho Chewi. Efficient constrained sampling via the mirror-Langevin algorithm. In Proceedings of the 35th International Conference on Neural Information Processing Systems (Neur IPS 2021), Online, 2021. 1, 2, 7, 8, 18, 31, 32, 33

[2] Luigi Ambrosio, Nicola Gigli, and Giuseppe Savaré. Gradient Flows: In Metric Spaces and in the Space of Probability Measures. Birkhäuser, Basel, 2008. 2, 17, 23, 28

[3] Michael Arbel, Anna Korba, Adil Salim, and Arthur Gretton. Maximum Mean Discrepancy Gradient Flow. In Proceedings of the 33rd International Conference on Neural Information Processing Systems (Neur IPS 2019), Vancouver, Canada, 2019. 3

[4] D Bakry and M Émery. Diffusions hypercontractives. In Jacques Azéma and Marc Yor, editors, Séminaire de Probabilités XIX 1983/84, pages 177 206, Berlin, Heidelberg, 1985. Springer Berlin Heidelberg. 20, 23

[5] Dominique Bakry, Ivan Gentil, and Michel Ledoux. Analysis and Geometry of Markov Diffusion Operators. Springer Cham, 2014. 25

[6] Krishnakumar Balasubramanian, Sinho Chewi, Murat A. Erdogdu, Adil Salim, and Matthew Zhang. Towards a Theory of Non-Log-Concave Sampling: First-Order Stationarity Guarantees for Langevin Monte Carlo. In Proceedings of the 35th Annual Conference on Learning Theory (COLT 2022), London, UK, 2022. 7

[7] Amir Beck and Marc Teboulle. Mirror descent and nonlinear projected subgradient methods for convex optimization. Operations Research Letters, 31(3):167 175, 2003. 7

[8] Nan Bi, Jelena Markovic, Lucy Xia, and Jonathan Taylor. Inferactive data analysis. Scandinavian Journal of Statistics, 47(1):212 249, 2020. 9

[9] David M. Blei, Andrew Y. Ng, and Michael I. Jordan. Latent Dirichlet Allocation. Journal of Machine Learning Research, 3:993 1022, 2003. 1, 31

[10] Sergey G Bobkov. Isoperimetric and Analytic Inequalities for Log-Concave Probability Measures. The Annals of Probability, 27(4):1903 1921, 1999. 20

[11] Steve Brooks, Andrew Gelman, Galin Jones, and Xiao-Li Meng. Handbook of Markov Chain Monte Carlo. Chapman and Hall/CRC, 1st edition, 2011. 1

[12] Marcus Brubaker, Mathieu Salzmann, and Raquel Urtasun. A Family of MCMC Methods on Implicitly Defined Manifolds. In Proceedings of the 15th International Conference on Artificial Intelligence and Statistics (AISTATS), La Palma, Canary Islands, 2012. 7

[13] Simon Byrne and Mark Girolami. Geodesic Monte Carlo on Embedded Manifolds. Scandinavian Journal of Statistics, 40(4):825 845, 2013. 7

[14] Toon Calders and Sicco Verwer. Three Naive Bayes Approaches for Discrimination-Free Classification. Data Mining and Knowledge Discovery, 21(2):277 292, 2010. 10

[15] Gilles Celeux, Mohammed El Anbari, Jean-Michel Marin, and Christian P Robert. Regularization in Regression: Comparing Bayesian and Frequentist Methods in a Poorly Informative Situation. Bayesian Analysis, 7(2):477 502, 2012. 1

[16] Peng Chen and Omar Ghattas. Projected Stein Variational Gradient Descent. In Proceedings of the 34th International Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada, 2020. 1

[17] Yuansi Chen. An Almost Constant Lower Bound of the Isoperimetric Coefficient in the KLS Conjecture. Geometric and Functional Analysis, 31(1):34 61, 2021. 20

[18] Xiang Cheng, Niladri S. Chatterji, Peter L. Bartlett, and Michael I. Jordan. Underdamped Langevin MCMC: A non-asymptotic analysis. In Proceedings of the 31st Annual Conference on Learning Theory (COLT 2018), volume 75, pages 1 24, Stockholm, Sweden, 2018. 7

[19] Sinho Chewi, Thibaut Le Gouic, Chen Lu, Tyler Maunu, Philippe Rigollet, and Austin Stromme. Exponential ergodicity of mirror-Langevin diffusions. In Proceedings of the 34th International Conference on Neural Information Processing Systems (Neur IPS 2020), Online, 2020. 1, 7, 18, 20

[20] Sinho Chewi, Thibaut Le Gouic, Chen Lu, Tyler Maunu, and Philippe Rigollet. SVGD as a kernelized Wasserstein gradient flow of the chi-squared divergence. In Proceedings of the 34th International Conference on Neural Information Processing Systems (Neur IPS 2020), pages 2098 2109, 2020. 3, 4, 23, 24, 25

[21] Kacper Chwialkowski, Heiko Strathmann, and Arthur Gretton. A Kernel Test of Goodness of Fit. In Proceedings of the 33rd International Conference on Machine Learning (ICML 2016), New York, NY, 2016. 26

[22] Ashok Cutkosky and Francesco Orabona. Black-Box Reductions for Parameter-free Online Learning in Banach Spaces. In Proceedings of the 31st Annual Conference on Learning Theory (COLT 2018), Stockholm, Sweden, 2018. 5

[23] Arnak S Dalalyan. Theoretical guarantees for approximate sampling from smooth and logconcave densities. Journal of the Royal Statistical Society. Series B (Statistical Methodology), 79(3):651 676, 2017. 7

[24] Arnak S Dalalyan and Avetik Karagulyan. User-friendly guarantees for the Langevin Monte Carlo with inaccurate gradient. Stochastic Processes and their Applications, 129(12):5278 5311, 2019. 7

[25] Arnak S Dalalyan and Lionel Riou-Durand. On sampling from a log-concave density using kinetic Langevin diffusions. Bernoulli, 26(3):1956 1988, 2020. 7

[26] Nabarun Deb, Young-Heon Kim, Soumik Pal, and Geoffrey Schiebinger. Wasserstein Mirror Gradient Flow as the Limit of the Sinkhorn Algorithm. ar Xiv preprint ar Xiv:2307.16421, 2023. 18

[27] Gianluca Detommaso, Tiangang Cui, Alessio Spantini, Youssef Marzouk, and Robert Scheichl. A Stein variational Newton method. In Proceedings of the 32nd International Conference on Neural Information Processing Systems (NIPS 2018), Montreal, Canada, 2018. 1

[28] Persi Diaconis, Susan Holmes, and Mehrdad Shahshahani. Sampling from a manifold. Advances in modern statistical theory and applications: a Festschrift in honor of Morris L. Eaton, 10:102 125, 2013. 7

[29] Zhiyan Ding, Qin Li, Jianfeng Lu, and Stephen J. Wright. Random Coordinate Langevin Monte Carlo. In Proceedings of the 34th Annual Conference on Learning Theory (COLT 2021), volume 134, pages 1 28, Boulder, CO, 2021. 7

[30] Zhiyan Ding, Qin Li, Jianfeng Lu, and Stephen J. Wright. Random Coordinate Underdamped Langevin Monte Carlo. In Proceedings of The 24th International Conference on Artificial Intelligence and Statistics (AISTATS 2021), San Diego, CA, 2021. 7

[31] John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(61):2121 2159, 2011. 6, 30

[32] Andrew Duncan, Nikolas Nüsken, and Lukasz Szpruch. On the geometry of Stein variational gradient descent. Journal of Machine Learning Research, 24:1 40, 2023. 7, 22

[33] Alain Durmus, Szymon Majewski, and Blazej Miasojedow. Analysis of Langevin Monte Carlo via Convex Optimization. Journal of Machine Learning Research, 20(1):2666 2711, 2019. 7

[34] Alain Durmus and Éric Moulines. High-dimensional Bayesian inference via the unadjusted Langevin algorithm. Bernoulli, 25(4A):2854 2882, 2019. 7

[35] Murat A. Erdogdu and Rasa Hosseinzadeh. On the Convergence of Langevin Monte Carlo: The Interplay between Tail Growth and Smoothness. In Proceedings of the 34th Annual Conference on Learning Theory (COLT 2021), Boulder, CO, 2021. 21

[36] Damien Garreau, Wittawat Jitkrittum, and Motonobu Kanagawa. Large sample analysis of the median heuristic. ar Xiv preprint ar Xiv:1707.07269, 2017. 30

[37] Chengyue Gong, Xingchao Liu, and Qiang Liu. Bi-objective trade-off with dynamic barrier gradient descent. In Proceedings of the 35th International Conference on Neural Information Processing Systems (Neur IPS 2021), Online, 2021. 6

[38] Jackson Gorham and Lester Mackey. Measuring Sample Quality with Kernels. In Proceedings of the 34th International Conference on Machine Learning (ICML 2017), Sydney, Australia, 2017. 26, 30

[39] T H Gronwall. Note on the Derivatives with Respect to a Parameter of the Solutions of a System of Differential Equations. Annals of Mathematics, 20(4):292 296, 1919. 20

[40] Richard Grumitt, Biwei Dai, and Uros Seljak. Deterministic Langevin Monte Carlo with Normalizing Flows for Bayesian Inference. In Proceedings of the 36th International Conference on Neural Information Processing Systems (Neur IPS 2022), New Orleans, LA, 2022. 3, 21

[41] Matthew D. Hoffman, David M. Blei, Chong Wang, and John Paisley. Stochastic variational inference. Journal of Machine Learning Research, 14(1):1303 1347, 2013. 1

[42] Matthew D Hoffman and Andrew Gelman. The No-U-Turn sampler: adaptively setting path lengths in Hamiltonian Monte Carlo. Journal of Machine Learning Research, 15(1):1593 1623, 2014. 7, 8, 32

[43] Ya-Ping Hsieh, Ali Kavis, Paul Rolland, and Volkan Cevher. Mirrored Langevin Dynamics. In Proceedings of the 32nd International Conference on Neural Information Processing Systems (NIPS 2018), Montreal, Canada, 2018. 1, 2, 3, 4, 7, 17, 18, 20, 21, 23, 25, 28

[44] Qijia Jiang. Mirror Langevin Monte Carlo: the Case Under Isoperimetry. In Proceedings of the 35th International Conference on Neural Information Processing Systems (Neur IPS 2021), Online, 2021. 1, 7, 18, 20

[45] Valen E. Johnson and James H. Albert. Ordinal Data Modeling. Springer, New York, NY, 1999. 1

[46] Richard Jordan, David Kinderlehrer, and Felix Otto. The Variational Formulation of the Fokker Planck Equation. SIAM Journal on Mathematical Analysis, 29(1):1 17, 1998. 3, 6

[47] Ravi Kannan, László Lovász, and Miklós Simonovits. Isoperimetric problems for convex bodies and a localization lemma. Discrete & Computational Geometry, 13(3):541 559, 1995. 20

[48] Diederik P. Kingma and Jimmy Ba. Adam: a method for stochastic optimisation. In Proceedings of the 3rd International Conference on Learning Representations (ICLR 15), pages 1 13, San Diego, CA, 2015. 6, 30

[49] John P. Klein and Melvin L. Moeschberger. Survival Analysis: Techniques for Censored and Truncated Data. Springer, New York, NY, 2003. 1

[50] Ron Kohavi. Scaling up the Accuracy of Naive-Bayes Classifiers: A Decision-Tree Hybrid. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, KDD 96, pages 202 207. AAAI Press, 1996. 10

[51] Anna Korba, Pierre-Cyril, Aubin-Frankowski, Szymon Majewski, and Pierre Ablin. Kernel Stein Discrepancy Descent. In Proceedings of the 38th International Conference on Machine Learning (ICML 2021), Online, 2021. 3, 4, 26, 27

[52] Anna Korba, Adil Salim, Michael Arbel, Giulia Luise, and Arthur Gretton. A Non-Asymptotic Analysis for Stein Variational Gradient Descent. In Proceedings of the 34th International Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada, 2020. 4, 7, 22

[53] Raphail E Krichevsky and Victor K Trofimov. The performance of universal encoding. IEEE Transactions on Information Theory, 27(2):199 207, 1981. 5

[54] Jason D Lee, Dennis L Sun, Yuekai Sun, and Jonathan E Taylor. Exact post-selection inference, with application to the lasso. The Annals of Statistics, 44(3):907 927, 2016. 1, 8

[55] Yin Tat Lee and Santosh S Vempala. Convergence Rate of Riemannian Hamiltonian Monte Carlo and Faster Polytope Volume Computation. In Proceedings of the 50th Annual ACM Symposium on the Theory of Computing (STOC 2018), pages 1115 1121, Los Angeles,CA, 2018. 7

[56] Tony Lelièvre, Mathias Rousset, and Gabriel Stoltz. Free Energy Computations. Imperial College Press, 2010. 7

[57] Tony Lelièvre, Mathias Rousset, and Gabriel Stoltz. Hybrid Monte Carlo methods for sampling probability measures on submanifolds. Numerische Mathematik, 143(2):379 421, 2019. 7

[58] Tony Lelièvre and Gabriel Stoltz. Partial differential equations and stochastic methods in molecular dynamics. Acta Numerica, 25:681 880, 2016. 7

[59] Tony Lelièvre, Gabriel Stoltz, and Wei Zhang. Multiple projection Markov chain Monte Carlo algorithms on submanifolds. IMA Journal of Numerical Analysis, 43(2):737 788, 2023. 7

[60] Lingxiao Li, Qiang Liu, Anna Korba, Mikhail Yurochkin, and Justin Solomon. Sampling with Mollified Interaction Energy Descent. In Proceedings of the 11th International Conference on Learning Representations, Kigali, Rwanda, 2023. 1, 2, 6, 7, 10, 30, 32, 34

[61] Ruilin Li, Molei Tao, Santosh S. Vempala, and Andre Wibisono. The Mirror Langevin Algorithm Converges with Vanishing Bias. In Proceedings of the 33rd International Conference on Algorithmic Learning Theory (ALT 2022), Paris, France, 2022. 1, 7, 18

[62] Qiang Liu, Jason D. Lee, and Michael Jordan. A Kernelized Stein Discrepancy for Goodnessof-fit Tests. In Proceedings of the 33rd International Conference on Machine Learning (ICML 2016), New York, NY, 2016. 26, 27

[63] Qiang Liu and Dilin Wang. Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm. In Proceedings of the 30th Conference on Neural Information Processings Systems (NIPS 2016), Barcelona, Spain, 2016. 1, 3, 7, 30

[64] Suyun Liu and Luis Nunes Vicente. Accuracy and fairness trade-offs in machine learning: a stochastic multi-objective approach. Computational Management Science, 19(3):513 537, 2022. 10, 32

[65] Xingchao Liu, Xin T. Tong, and Qiang Liu. Sampling with Trustworthy Constraints: A Variational Gradient Framework. In Proceedings of the 35th International Conference on Neural Information Processing Systems (Neur IPS 2021), Online, 2021. 1, 10, 32

[66] Yang Liu, Prajit Ramachandran, Qiang Liu, and Jian Peng. Stein Variational Policy Gradient. In Proceedings of the Conference on Uncertainty In Artificial Intelligence (UAI 2017), Sydney, Australia, 2017. 1

[67] Yi-An Ma, Niladri S Chatterji, Xiang Cheng, Nicolas Flammarion, Peter L Bartlett, and Michael I Jordan. Is there an analog of Nesterov acceleration for gradient-based MCMC? Bernoulli, 27(3):1942 1992, 2021. 7

[68] James Martin, Lucas C Wilcox, Carsten Burstedde, and Omar Ghattas. A Stochastic Newton MCMC Method for Large-Scale Statistical Inverse Problems with Application to Seismic Inversion. SIAM Journal on Scientific Computing, 34(3):1460 1487, 2012. 7

[69] Natalia Martinez, Martin Bertran, and Guillermo Sapiro. Minimax Pareto Fairness: A Multi Objective Perspective. In Proceedings of the 37th International Conference on Machine Learning (ICML 2020), Online, 2020. 10

[70] Francesco Orabona and Ashok Cutkosky. Tutorial on Parameter-Free Online Learning. In Proceedings of the 37th International Conference on Machine Learning (ICML 2020), Online, 2020. 5

[71] Francesco Orabona and David Pal. Coin Betting and Parameter-Free Online Learning. In Proceedings of the 30th Conference on Neural Information Processings Systems (NIPS 2016), Barcelona, Spain, 2016. 5

[72] Francesco Orabona and Tatiana Tommasi. Training Deep Networks without Learning Rates Through Coin Betting. In Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, 2017. 5, 29

[73] Sam Patterson and Yee Whye Teh. Stochastic gradient Riemannian Langevin dynamics on the probability simplex. In Proceedings of the 27th International Conference on Neural Information Processing Systems (NIPS 2013), Lake Tahoe, NV, 2013. 7, 8, 31, 32, 33

[74] Yunchen Pu, Zhe Gan, Ricardo Henao, Chunyuan Li, Shaobo Han, and Lawrence Carin. VAE Learning via Stein Variational Gradient Descent. In Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, 2017. 1

[75] Soo-Yon Rhee, Jonathan Taylor, Gauhar Wadhera, Asa Ben-Hur, Douglas L Brutlag, and Robert W Shafer. Genotypic predictors of human immunodeficiency virus type 1 drug resistance. Proceedings of the National Academy of Sciences, 103(46):17355 17360, 2006. 9

[76] Christian P Robert and George Casella. Monte Carlo Statistical Methods. Springer-Verlag, New York, 2 edition, 2004. 1

[77] Ralph Tyrrell Rockafellar. Convex Analysis. Princeton University Press, Princeton, NJ, 1970.

[78] Adil Salim, Lukang Sun, and Peter Richtárik. A Convergence Theory for SVGD in the Population Limit under Talagrand s Inequality T1. In Proceedings of the 39th International Conference on Machine Learning (ICML 2022), Online, 2022. 3, 7

[79] Amir Sepehri and Jelena Markovic. Non-reversible, tuningand rejection-free Markov chain Monte Carlo via iterated random functions. ar Xiv preprint ar Xiv:1711.07177, 2017. 8, 9, 31

[80] Louis Sharrock, Daniel Dodd, and Christopher Nemeth. Coin EM: Tuning-Free Particle-Based Variational Inference for Latent Variable Models. ar Xiv preprint ar Xiv:2305.14916, 2023. 7

[81] Louis Sharrock and Christopher Nemeth. Coin Sampling: Gradient-Based Bayesian Inference without Learning Rates. In Proceedings of the 40th International Conference on Machine Learning (ICML 2023), Honolulu, HI, 2023. 2, 5, 6, 7, 10, 28, 29

[82] Jiaxin Shi, Chang Liu, and Lester Mackey. Sampling with Mirrored Stein Operators. In Proceedings of the 10th International Conference on Learning Representations (ICLR 2022), Online, 2022. 1, 2, 3, 4, 6, 7, 9, 21, 22, 24, 30, 31, 32

[83] Jiaxin Shi and Lester Mackey. A Finite-Particle Convergence Rate for Stein Variational Gradient Descent. In Proceedings of the 37th International Conference on Neural Information Processing Systems (Neur IPS 2023), New Orleans, LA, 2023. 7

[84] Umut Simsekli, Roland Badeau, A. Taylan Cemgil, and Gaël Richard. Stochastic Quasi Newton Langevin Monte Carlo. In Proceedings of the 33rd International Conference on Machine Learning (ICML 2016), New York, NY, 2016. 7

[85] Lukang Sun, Avetik Karagulyan, and Peter Richtarik. Convergence of Stein Variational Gradient Descent under a Weaker Smoothness Condition. In Proceedings of the 26th International Conference on Artificial Intelligence and Statistics (AISTATS 2023), Valencia, Spain, 2023. 3, 7, 23

[86] Lukang Sun and Peter Richtárik. A Note on the Convergence of Mirrored Stein Variational Gradient Descent under (L0,L1)-Smoothness Condition. ar Xiv preprint ar Xiv:2206.09709, 2022. 1, 2, 3, 4, 7, 22, 23

[87] Gábor J. Székely and Maria L. Rizzo. Energy statistics: A class of statistics based on distances. Journal of Statistical Planning and Inference, 143(8):1249 1272, 2013. 7

[88] Jonathan Taylor and Robert J Tibshirani. Statistical learning and selective inference. Proceedings of the National Academy of Sciences, 112(25):7629 7634, 2015. 1, 8

[89] Xiaoying Tian, Nan Bi, and Jonathan Taylor. MAGIC: a general, powerful and tractable method for selective inference. ar Xiv preprint ar Xiv:1607.02630, 2016. 8, 31

[90] Xiaoying Tian, Snigdha Panigrahi, Jelena Markovic, Nan Bi, and Jonathan Taylor. Selective sampling after solving a convex problem. ar Xiv preprint ar Xiv:1609.05609, 2016. 8, 31

[91] Ryan Tibshirani, Rob Tibshirani, Jonatha Taylor, Joshua Loftus, Stephen Reid, and Jelena Markovic. selective Inference: Tools for Post-Selection Inference, 2019. 9, 34

[92] Tijmen Tieleman and Geoffrey E Hinton. Lecture 6.5-rmsprop: divide the gradient by a running average of its recent magnitude. COURSERA: Neural networks for machine learning, 4(2):26 31, 2012. 9, 30, 31, 34

[93] Alexandre B. Tsybakov. Introduction to Nonparametric Estimation. Springer, New York, NY, 2009. 20, 24

[94] Santosh S. Vempala and Andre Wibisono. Rapid Convergence of the Unadjusted Langevin Algorithm: Isoperimetry Suffices. In Proceedings of the 33rd International Conference on Neural Information Processing Systems (Neur IPS 2019), Vancouver, Canada, 2019. 21

[95] Cédric Villani. Optimal Transport: Old and New. Springer-Verlag, Berlin, 2008. 23

[96] Martin J Wainwright and Michael I Jordan. Graphical Models, Exponential Families, and Variational Inference. Foundations and Trends in Machine Learning, 1(1-2):1 305, 2008. 1

[97] Dilin Wang, Zhe Zeng, and Qiang Liu. Stein Variational Message Passing for Continuous Graphical Models. In Proceedings of the 35th International Conference on Machine Learning (ICML 2018), Stockholm, Sweden, 2018. 1

[98] Xiao Wang, Qi Lei, and Ioannis Panageas. Fast Convergence of Langevin Dynamics on Manifold: Geodesics meet Log-Sobolev. In Proceedings of the 34th International Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada, 2020. 7

[99] Andre Wibisono. Sampling as optimization in the space of measures: The Langevin dynamics as a composite optimization problem. In Proceedings of the 31st Annual Conference on Learning Theory (COLT 2018), Stockholm, Sweden, 2018. 2, 3, 7, 20

[100] Andre Wibisono. Proximal Langevin Algorithm: Rapid Convergence Under Isoperimetry. ar Xiv preprint ar Xiv:1911.01469, 2019. 7

[101] Edwin B Wilson. Probable Inference, the Law of Succession, and Statistical Inference. Journal of the American Statistical Association, 22(158):209 212, 1927. 31

[102] Emilio Zappa, Miranda Holmes-Cerfon, and Jonathan Goodman. Monte Carlo on Manifolds: Sampling Densities and Integrating Functions. Communications on Pure and Applied Mathematics, 71(12):2609 2647, 2018. 7

[103] Kelvin Shuangjian Zhang, Gabriel Peyré, Jalal Fadili, and Marcelo Pereyra. Wasserstein Control of Mirror Langevin Monte Carlo. In Proceedings of the 33rd Annual Conference on Learning Theory (COLT 2020), Graz, Austria, 2020. 1, 7, 18

[104] Ruqi Zhang, Qiang Liu, and Xin T. Tong. Sampling in Constrained Domains with Orthogonal Space Variational Gradient Descent. In Proceedings of the 36th International Conference on Neural Information Processing Systems (Neur IPS 2022), New Orleans, LA, 2022. 1, 7

[105] Jingwei Zhuo, Chang Liu, Jiaxin Shi, Jun Zhu, Ning Chen, and Bo Zhang. Message Passing Stein Variational Gradient Descent. In Proceedings of the 35th International Conference on Machine Learning (ICML 2018), Stockholm, Sweden, 2018. 1

A Calculus in the Wasserstein Space

We recall the following from [2, Chapter 10]. Suppose µ P2(Rd) and ξ L2(µ). Let F be a proper and lower semi-continuous functional on P2(Rd). We say that ξ L2(µ) belongs to the Fréchet subdifferential of F at µ and write ξ F(µ) if, for any ν P2(Rd),

lim inf ν µ F(ν) F(µ) R

Rd ξ(x), tν µ(x) x µ(dx) W2(ν, µ) 0, (15)

where tν µ : Rd Rd denotes the optimal transport map from µ to ν [2, Chapter 7.1]. Under mild conditions [2, Lemma 10.4.1], the subdifferential F is single-valued, F(µ) = { W2F(µ)}, and W2F(µ) L2(µ) is given by

W2F(µ) = F(µ)

µ (x) for µ-a.e. x Rd, (16)

µ : Rd R denotes the first variation of F at µ, that is, the unique function such that

lim ε 0 1 ε (F(µ + εζ) F(µ)) = Z

µ (x)ζ(dx), (17)

where ζ = ν µ, and ν P2(Rd). We will refer to W2F(µ) as the Wasserstein gradient of F at µ.

B Examples of Mirrored Wasserstein Gradient Flows

In this section, we outline several algorithms which arise as special cases of the MWGF introduced in Sec 3.2, namely ηt

t + (wtηt) = 0 , µt = ( ϕ )#ηt, (18)

where (wt)t 0 are vector fields chosen to ensure the convergence of (ηt)t 0 to ν = ( ϕ)#π.

We will assume, throughout this section, that π e V and that ν = ( ϕ)#π e W . The Monge-Ampere equation determines the relationship between W and V . In particular, we have [e.g., 43] e V = e W ϕdet 2ϕ , e W = e V ϕ det 2ϕ . (19) Thus, the potential W(y) of the mirrored target ν evaluated at y = ϕ(x) can be expressed in terms of the potential V (x) of the original target π evaluated at x, as

W(y) = V (x) + log det 2ϕ(x). (20)

B.1 Mirrored Langevin Dynamics

Suppose that F(η) = KL(η||ν), where, as elsewhere, ν = ( ϕ)#π is the dual target. In addition, let wt = W2F(ηt). In this case, it is straightforward to show that [e.g. 2, Chapter 10] W2F(η) = log η

ν . Thus, for this choice of (wt)t 0, the MWGF in (18) reads ηt

ηt = 0, µt = ( ϕ )#ηt. (21)

Substituting log ν = W, and using the fact that ηt log(ηt) = ηt, this can also be written as ηt

t (ηt W + ηt) = 0, µt = ( ϕ )#ηt. (22)

This PDE is nothing more than the Fokker-Planck equation describing the evolution of the law of the overdamped Langevin SDE with respect to the mirrored target ν e W . In particular, suppose that x0 µ0, y0 = ϕ(x0) η0, and that (xt)t 0 and (yt)t 0 are the solutions of

dyt = W(yt)dt +

2dbt , xt = ϕ (yt), (23) where b = (bt)t 0 is a standard Brownian motion. Then, if we let (µt)t 0 and (ηt)t 0 denote the laws of (xt)t 0 and (yt)t 0, it follows that (µt)t 0 and (ηt)t 0 satisfy (22). This is precisely the mirrored Langevin dynamics (MLD) introduced in [43]. We remark that, using (20), the MLD in (23) can also be written as

dyt = 2ϕ(xt) 1( V (xt) + log det 2ϕ(xt))dt +

2dbt, xt = ϕ (yt). (24)

Remark 1. It is worth distinguishing between the mirrored Langevin dynamics in (23) and the so-called mirror Langevin diffusion [e.g., 44, Equation 2], which refers to the solution of

dyt = V (xt)dt +

2 dbt, xt = ϕ (yt) (25) These dynamics, and the corresponding Euler-Maruyama discretisation, known as the mirror Langevin algorithm or mirror Langevin Monte Carlo, were first studied in [103], and have since also been analysed in [1, 44, 61]. An alternative time discretisation has also been studied in [19].

In this case, one can show that the law (µt)t 0 of the solution of the mirror Langevin diffusion in (25) satisfies the Fokker-Planck equation [e.g., 44, App. A.2] µt

t + (µtvt) = 0, vt = [ 2ϕ] 1 log µt

Recalling that W2KL(µt||π) = log µt

π , one can view (26) as a special case of the so-called Wasserstein mirror flow (WMF), defined according to [e.g. 1, App. C]

t + (vtµt) = 0, vt = ( 2ϕ) 1 W2F(µt). (27)

The WMF is rather different from the MWGF in (18). In particular, the WMF in (27) describes the evolution of µt according to the Wasserstein tangent vector ( 2ϕ) 1 W2F(µt), where F is a functional minimised at π, while the MWGF in (18) describes the evolution of the mirrored ηt := ( ϕ)#µt according to the tangent vector W2F(ηt), where F is now a functional minimised at ν = ( ϕ)#π. We refer to [19] for a detailed discussion of the mirror Langevin diffusion from this perspective; and [26] for a more general formulation of the WMF.

B.1.1 Continuous Time Results

We now study the properties of the dynamics in (23) in continuous time, starting with the dissipation of KL(µt|π) and χ2(µt|π) along the MLD. These results are a natural extension of existing results in the unconstrained case to our setting. Proposition 1. The dissipation of KL( ||π) along the MWGF in (21) is given by d KL(µt||π)

dt = [ 2ϕ] 1 log µt

L2(µt) . (28)

Proof. Using differential calculus in the Wasserstein space and the chain rule, we have that d KL(ηt||ν)

dt = D wt, log ηt

L2(ηt) = log ηt

L2(ηt) . (29)

By [43, Theorem 2], we have that KL(η||ν) = KL(µ||π). To deal with the term on the RHS, first note that, using the formula for the change of variable in a probability density, we have that η(y) = µ( ϕ (y))|det 2ϕ( ϕ (y))| 1 , ν(y) = π( ϕ (y))|det 2ϕ( ϕ (y))| 1. (30) Thus, in particular, if y = ϕ(x), then we have that

(y) = log µ

( ϕ (y)) = log µ

Using this, and the fact that η = ( ϕ)#µ, we can now compute log ηt

L2(ηt) = Z y log ηt

(y) 2 dηt(y) (32)

= Z y log ηt

(y) 2 d(( ϕ)#µt)(y) (33)

= Z y log ηt

( ϕ(x)) 2 dµt(x) (34)

= Z y log µt

(x) 2 dµt(x) (35)

= Z [ 2ϕ(x)] 1 x log µt

(y) 2 dµt(y) (36)

= [ 2ϕ(y)] 1 y log µt

L2(µt) . (37)

The conclusion now follows.

Proposition 2. The dissipation of χ2( ||π) along the MWGF in (21) is given by

dt = 2 [ 2ϕ] 1 µt

L2(π) . (38)

Proof. The proof is rather similar to the previous one. In this case, using the fact that W2X 2(η||ν) = 2 η

ν , we have

dt = D log ηt

L2(ηt) = 2 ηt

We next observe, using now the fact that η

π(x) which follows from (30), that

χ2(ηt||ν) = Z ηt

ν (y) 1 2 dν(y) = Z ηt

ν (y) 1 2 d(( ϕ)#π)(y) (39)

ν ( ϕ(x)) 1 2 dπ(x) = Z µt

π (x) 1 2 dπ(x) = χ2(µt||π). (40)

In addition, based on a very similar argument, we have that ηt

L2(ν) = Z y ηt

(y) 2 dν(y) = Z y ηt

(y) 2 d(( ϕ)#π)(y) (41)

( ϕ(x)) 2 dπ(x) = Z y µt

(x) 2 dπ(x) (42)

= Z [ 2ϕ(y)] 1 y µt

(y) 2 dπ(y) = [ 2ϕ(y)] 1 y µt

This completes the proof.

Since the RHS of both (28) and (38) are non-positive, these results show that KL(µt|π) and χ2(µt|π) decrease along the MLD. As an immediate corollary, we have the following continuous time convergence rate for the time-average of ||[ 2ϕ] 1 log( µt

π )||2 L2(µt) and ||[ 2ϕ] 1 ( µt

π )||2 L2(π).

Proposition 3. For any t 0, it holds that

[ 2ϕ] 1 log µt

[ 2ϕ] 1 log µs

L2(µs) ds KL(µ0||π)

(44) Proposition 4. For any t 0, it holds that

L2(π) ds χ2(µ0||π)

Proof. These results follow straightforwardly from integration of (28) or (38) and the non-negativity of the KL or χ2 divergence.

To obtain stronger convergence results, we will require additional assumptions on the target measure. In the first case, we can assume that the target satisfies what we will refer to as a mirrored logarithmic Sobolev inequality (LSI). Proposition 5. Assume that π satisfies a mirrored LSI with constant λ > 0. In particular, for all µ P2(X), there exists λ > 0 such that

[ 2ϕ] 1 log µ

L2(µ) . (46)

Then the KL divergence converges exponentially along the MWGF in (21):

KL(µt||π) e 2λt KL(µ0||π). (47)

Proof. Substituting the mirrored LSI (46) into (28), we have that

d KL(µt||π)

dt = [ 2ϕ] 1 log µt

L2(µt) 2λKL(µt||π). (48)

The conclusion now follows straightforwardly from Grönwall s inequality [39].

Alternatively, under the assumption that the target satisfies a mirrored Poincaré inequality (PI), we can obtain exponential convergence in a number of metrics. Proposition 6. Assume that π satisfies a mirrored PI with constant κ > 0. In particular, for all locally Lipschitz g L2(π), there exists κ > 0 such that

[ 2ϕ] 1 g 2 L2(π) (49)

Then the total variation distance, Hellinger distance, KL divergence, and the χ2 divergence all converge exponentially along the MWGF in (21):

max(2||µt π||2 TV, H2(µt, π), λ

2 W 2 2 (µt, π), KL(µt||π), χ2(µt||π)) e 2κtχ2(µ0||π). (50)

Proof. The proof proceeds similarly to above. In particular, this time substituting the mirrored PI (49) with g = µt

π 1 into (38), we have

dt = 2 [ 2ϕ] 1 µt

L2(π) 2κχ2(µt||π).

The result for the χ2 divergence follows via an application of Grönwall s inequality. The remaining bounds follow from standard comparison inequalities [e.g., 93, Sec. 2.4].

Remark 2. The assumption that π satisfies the mirrored LSI in (46) or the mirrored PI in (49) is precisely equivalent to the assumption that the mirrored target ν = ( ϕ)#π satisfies the classical LSI or the classical PI, respectively. The LSI holds, for example, for all strongly log-concave distributions, with constant equal to the reciprocal of the strong convexity constant of the potential [4]. The class of measures satisfying the PI is even larger, including all strongly log-concave measures and, more generally, all log-concave measures [10, 17, 47]. Remark 3. The mirrored LSI in (46) and the mirrored PI in (49) are distinct from the so-called mirror LSI [44, Assumption 2] and mirror PI [19, Definition 1]. These assumptions are used in [44, Proposition 1] and [19, Theorem 1], respectively, to establish exponential convergence of the mirror Langevin diffusion [e.g., 44, Equation 2], which we recall is distinct from the mirrored Langevin diffusion in (23); see Remark 1. Remark 4. In Propositions 5 - 6 above, and also in Proposition 7 below, our assumptions on the target π are, in some sense, better viewed as assumptions on the mirrored target ν = ( ϕ)#π. Indeed, by construction, the imposed assumptions are equivalent to the assumptions that the mirrored target ν satisfies a LSI (Proposition 5, 7), or that the mirrored target ν satisfies a PI (Proposition 6).

In some sense, it would be preferable to establish such results under more direct assumptions on the target π itself. One possibility is to assume that the target π is strongly log-concave. In this case, [43, Proof of Theorem 3] guarantees the existence of a mirror map such that the mirrored target ν = ( ϕ)#π is also strongly log-concave. Thus, in particular, the mirrored target satisfies a LSI and a PI or, equivalently, the target π satisfies a mirrored LSI and a mirrored PI. Thus, Propositions 5 - 7 all still hold.

B.1.2 Discrete Time Results

In practice, of course, it is necessary to discretise the dynamics in (23) in time. Applying a standard Euler-Maruyama discretisation to (23), or equivalently a forward-flow discretisation to (21) [e.g., 99], one arrives at the mirrored Langevin algorithm (MLA) from [43], viz

yt+1 = yt h W(yt) +

2hξt, xt+1 = ϕ (yt+1) (51)

where h > 0 is the step size or learning rate and where (ξt)t N are a sequence of i.i.d. standard normal random variables.

Proposition 7. Suppose that π satisfies the mirrored LSI with constant λ > 0. In addition, suppose that π is mirror-L-smooth.2 Then, if 0 < γ λ 4L2 , the iterates (xt)t 0 in (51) satisfy

KL(µt||π) e λγt KL(µ0||π) + 8γd L2

Proof. By [43, Theorem 2], and (32) - (37) in the proof of Proposition 1, if π satisfies a mirrored LSI with constant λ > 0, then ν = ( ϕ)#π satisfies a LSI with the same constant λ. In addition, by definition, if π is mirror-L-smooth, then ν is L-smooth. Thus, it follows from [94, Theorem 1] that

KL(ηt||ν) e λγt KL(η0||ν) + 8γd L2

Finally, using the invariance of the KL divergence under the mirror map ϕ [43, Theorem 2], the result follows.

Remark 5. Using a similar approach, one can also obtain bounds on χ2(µt||π) under a mirrored LSI [35, Theorem 4], and for the Renyi divergence Rq(µt||π) under a mirrored LSI [94, Theorem 2] or a mirrored PI [94, Theorem 3]. Remark 6. One could also consider a forward Euler discretisation of the dynamics in (21), which leads to a mirrored version of the so-called deterministic Langevin Monte Carlo algorithm [40].

B.2 Mirrored Stein Variational Gradient Descent

We now turn our attention to the MSVGD dynamics given in Sec. 3.2.1, which we recall are defined by the continuity equation ηt

t (Pηt,kϕ log ηt

ηt) = 0, µt = ( ϕ )#ηt, (54)

where, as previously, kϕ is the kernel defined according to kϕ(y, y ) = k( ϕ (y), ϕ (y )), given some base kernel k : X X R.

B.2.1 Continuous Time Results

We now study the convergence properties of the continuous-time MSVGD dynamics. We note that a similar set of results can also found in [82, App. H]. We provide them here to highlight the analogues with the convergence results obtained for the MLD in Sec. B.1. Proposition 8. The dissipation of the KL along the mirrored SVGD gradient flow is given by d KL(µt||π)

dt = Sµt,k[ 2ϕ] 1 log µt

Hd k . (55)

Proof. The proof is similar to the proof of Proposition 1. In particular, using differential calculus on the Wasserstein space and the chain rule, we have d KL(ηt||ν)

dt = D wt, log ηt

L2(ηt) = Sηt,kϕ log ηt

Hd kϕ . (56)

By Theorem 2 in [43], we have that KL(η||ν) = KL(µ||π). In addition, using (31), we can compute Sηt,kϕ log ηt

= Z Z k( ϕ (y), ϕ (y )) D y log ηt

(y), y log ηt

(y ) E dηt(y)dηt(y ) (58)

= Z Z k(x, x ) D y log µt

(x), y log µt

(x ) E dµt(x)dµt(x ) (59)

= Z Z k(x, x ) D [ 2ϕ(x)] 1 x log µt

(x), [ 2ϕ(x )] 1 x log µt

(x ) E dµt(x)dµt(x )

= Sµt,k[ 2ϕ] 1 log µt

Hd k . (60)

2We say that π e V is mirror-L-smooth if ν e W is L-smooth. This holds, in particular, if W = V ϕ + log det 2ϕ has Hessian bounded by L.

Remark 7. The quantity on the RHS of (55) is referred to in [86] as the mirrored Stein Fisher information of µ with respect to π, with mirror map ϕ. This quantity represents the Stein Fisher information [32] between ( ϕ)#µ and ( ϕ)#π, given the kernel kϕ( , ) = k( ϕ ( ), ϕ ( )).

Similar to before, a straightforward consequence of this result is a continuous time convergence rate for the average of the mirrored stein discrepancy along the continuous-time MSVGD dynamics.

Proposition 9. For all t 0, it holds that

Sµs,k[ 2ϕ] 1 log µs

Sµs,k[ 2ϕ] 1 log µs

Hd k ds KL(µ0||π)

Proof. We integrate (55), and use the non-negativity of the KL divergence.

To obtain stronger convergence guarantees, we must once again assume some additional properties on the target. Here, a natural choice is the mirrored Stein LSI, the analogue of the Stein LSI introduced in [32] in the mirrored setting; see also [82, Definition 2].

Proposition 10. Assume that π satisfies the mirrored Stein LSI with constant λ > 0. In particular, for all µ P2(X), there exists λ > 0 such that

Sµ,k[ 2ϕ] 1 log µ

Hd k . (62)

Then the KL divergence converges exponentially along the continuous-time MSVGD dynamics:

KL(µt||π) e 2λt KL(µ0||π). (63)

Proof. The result is a straightforward consequence of (55) and (62). In particular, substituting (62) into (55), we have

d KL(µt||π)

dt = Sµt,k[ 2ϕ] 1 log µt

Hd k 2λKL(µt||π). (64)

We conclude by applying Grönwall s inequality.

Remark 8. The assumption that π satisfies the mirrored Stein LSI is equivalent to the assumption that ν = ( ϕ)#π satisfies the classical Stein LSI with kernel kϕ. The Stein LSI is rather less well known than the classical LSI, and holds for a much smaller class of targets. In particular, this condition fails to hold if the kernel k is too regular with respect to the target π, e.g., if π has exponential tails and the derivatives of the kernel k and the potential V grow at most polynomially [32, 52]. It remains an open problem to establish conditions for which this conditions holds.

B.2.2 Discrete Time Results

In discrete time, [86] recently established a descent lemma and a complexity bound for MSVGD in the population limit. For completeness, we recall one of these results here.

Theorem 1 (Corollary 1 in [86]). Suppose that the following assumptions hold:

(1) The mirror function ϕ : X ( , + ) is strongly K-convex.

(2) There exist B1, B2 > 0 such that for all x X, k(x, x ) B2 1 and xi xjk(x, x )|x=x B2 2 for all i [d], j [d].

(3) There exist L0, L1 0 such that || 2W(y)|| L0 + L1|| W(y)||.

(4) There exists p 1, y0 Rd, and s > 0 such that R exp [s||y y0||p] dν(y) < .

(5) There exists p 1, Cp > 0 such that || W(y)|| Cp (||y||p + 1) for any y Rd.

Then, provided γ satisfies [86, Equation 21], then

Sµt,k[ 2ϕ] 1 log µt

Hd k 2KL(µ0||π)

Remark 9. The results in [86] are obtained by extending the proofs used to establish convergence of SVGD in [85] to the mirrored setting. In this case, essentially all of the same arguments can be applied (in the dual space), provided suitable assumptions are imposed on the mirrored target ν = ( ϕ)#π. The disadvantage of this approach is that the required assumptions are specified in terms of the mirrored target ν, rather than the target π.

Following a similar approach to [43, Theorem 3], let us show how to recover this result under a standard assumption on π itself. Suppose we replace [86, Assumption 4] by the assumption that π is strongly log-concave. Under this assumption, [43, Proof of Theorem 3] guarantees the existence of a mirror map such that ν is also strongly log-concave. By the Bakry-Emery criterion [4], it follows that the T2 inequality is satisfied by the dual target ν, and thus so too is the T1 inequality. Finally, the T1 inequality is necessary and sufficient for [86, Assumption 4] by [95, Theorem 22.10]. Thus, [86, Corollary 1] holds whenever the target π is strongly log-concave.

B.3 Mirrored Laplacian Adjusted Wasserstein Gradient Descent

We now consider the MWGF of the chi-squared divergence; see also [20, Sec. 3] or [2, Theorem 11.2.1] in the unconstrained case. Following similar steps to [20], we will then obtained a mirrored version of the Laplacian Adjusted Wasserstein Gradient Descent (LAWGD) algorithm.

B.3.1 MWGF of the Chi-Squared Divergence

Suppose that F(η) = χ2(η||ν) with W2F(η) = 2 η

ν . Setting wt = W2F(ηt) in (4), we are interested in the following MWGF

ηt = 0, µt = ( ϕ)#ηt. (66)

Similar to before, we will begin by studying the convergence properties of this MWGF in continuous time. Proposition 11. The dissipation of KL( ||π) along the MWGF in (66) is given by

d KL(µt||π)

dt = 2 [ 2ϕ] 1 µt

L2(π) . (67)

Proof. The proof is essentially identical to the proof of Proposition 2, although the roles of the Lyapunov functional and the Wasserstein gradient are now reversed. In particular, we now have

d KL(ηt||ν)

dt = D 2 ηt

L2(ηt) = 2 ηt

By Theorem 2 in [43] and (41) - (43) in the proof of Proposition 2, we have that KL(ηt||ν) = KL(µt||π) and ηt

ν 2 L2(ν) = [ 2ϕ] 1 µt

π 2 L2(π), from which the conclusion follows.

Proposition 12. The dissipation of χ2( ||π) along the MWGF in (66) is given by

dt = 4 [ 2ϕ] 1 µt

L2(µt) . (68)

Proof. Similar to the last result, the proof follows very closely the proof of Proposition 1. On this occasion, we have

dt = D wt, 2 ηt

L2(ηt) = 4 ηt

L2(ηt) . (69)

Arguing as in (39) - (40) and (41) - (43) in the proof of Proposition 2, we have that χ2(η||ν) = χ2(µ||π) and ηt

ν 2 L2(ηt) = [ 2ϕ] 1 µt

π 2 L2(µt), from which the conclusion follows.

Similar to before, we can also obtain exponential convergence to the target under a mirrored functional inequality. Proposition 13. Assume that π satisfies a mirrored PI with constant κ > 0. In particular, for all locally Lipschitz g L2(π), there exists κ > 0 such that

[ 2ϕ] 1 g 2 L2(π) (70)

Then the KL divergence converges exponentially along the MWGF in (66):

KL(µt||π) KL(µ0||π)e 2λt. (71)

Furthermore, for t 1 2λ, the following convergence rate holds:

KL(µt||π) (KL(µ0||π) 2) e 2λt. (72)

Proof. The proof follows closely the proof of [20, Theorem 1]. In particular, substituting the mirrored PI (70) with g = µt

π 1 into (67), we have that

d KL(µt||π)

dt = [ 2ϕ] 1 µt

L2(µt) 2λχ2(µt||π) 2λKL(µt||π), (73)

where in the final inequality we have used the fact that KL( ||π) χ2( ||π) (e.g., [93, Sec. 2.4]). The first bound now follows straightforwardly from Grönwall s inequality.

For the second bound, we will instead use the inequality KL( ||π) log 1 + χ2( ||π) (e.g., [93, Sec. 2.4]), which implies that e KL( ||π) 1 χ2( ||π). Using this in (73), it follows that

1 e KL(µt||π) 2λ 1 e KL(µt||π) . (74)

Applying Grönwall s lemma, we then have that 1 e KL(µt||π) e 2λt 1 e KL(µt||π) e 2λt, (75)

where in the final line we have used the elementary inequality g(x) = 1 e x 1, for x 0. Finally, observe that if t 1 2λ, then 1 e KL(µt||π) e 2λt e 1. Combining this with the fact that x 2g(x) whenever g(x) e 1, we have that

KL(µt||π) 2 1 e KL(µt||π) 2e 2λt (76)

for t 1 2λ. Alongside the bound in (71), this completes the proof of (72).

B.3.2 Kernelised MWGF of the Chi-Squared Divergence

In [20], the authors provide an interpretation of SVGD as a kernelised WGF of the χ2 divergence. Similarly, we can interpret MSVGD as a kernelised MWGF of the χ2 divergence; see also [82, App. H]. In particular, observe that [20, Sec. 2.3]

Pηt,kϕ log ηt

= Z k(x, ) log ηt

ν (x) dηt(x) (77)

= Z k(x, ) ηt

ν (x) dν(x) = Pν,kϕ ηt

It follows that the continuous-time MSVGD dynamics in (5) or (54) can equivalently be written as

ηt = 0, µt = ( ϕ )#ηt. (79)

Comparing (79) and (66), it is clear that, up to a factor of two, MSVGD can be interpreted as the MWGF obtained by replacing the gradient of chi-squared divergence, ηt

ν , by Pν,kϕ ηt

ν . In [82, App. H], the authors obtain continuous-time convergence results for this scheme, under a mirrored Stein PI. Here, we proceed differently, based on the approach in [20].

B.3.3 Mirrored Laplacian Adjusted Wasserstein Gradient Descent (MLAWGD)

Following [20], suppose now that we replace Pν,kϕ ηt

ν by the vector field Pν,kϕ ηt

ν . The new dynamics, which we refer to as the mirrored Laplacian adjusted Wasserstein gradient flow (MLAWGF), are then given by ηt

ηt = 0, µt = ( ϕ )#ηt. (80)

The dynamics in (80) can essentially be viewed as a change of measure of the standard LAWGD dynamics in [20, Sec. 4], designed to converge to the dual target ν = ( ϕ)#π.

Suppose, in addition, that the kernel kϕ were chosen such that Pν,kϕ = L 1 ν , where Lν,kϕ denotes the infinitesimal generator of the Langevin SDE with stationary distribution ν. Thus, in particular,

Lνfϕ(y) = W(y), fϕ(y) + fϕ(y). (81)

where fϕ : Rd R denotes the function fϕ( ) = f( ϕ ( )), given some base function f : X R. Letting y = ϕ(x), using (20), and then the chain rule, we can rewrite this as

Lvf(x) = 2ϕ(x) 1 V (x) + log det 2ϕ(x) , 2ϕ(x) 1 f(x) (82)

+ Tr [ 2ϕ(x)] 1 3ϕ(x)[ 2ϕ(x)] 2 f(x) + [ 2ϕ(x)] 2 2f(x) . (83)

We will denote this kernel k Lν. In practice, computing k Lν requires computing the spectral decomposition of the operator Lν, which is challenging even in the moderate dimensions. Nonetheless, following [20, Theorem 4], for this choice of kernel we can show that the MLAWGF converges exponentially fast to the target distribution, at a rate independent of the Poincaré constant. Proposition 14. Assume that π satisfies a mirrored PI with some constant κ > 0. In addition, suppose that Lν has a discrete spectrum. Then the KL divergence converges exponentially along the MLAWGF, with a rate independent of κ:

KL(µt||π) (KL(µ0||π) 2) e t. (84)

Proof. Using standard rules for calculus on the Wasserstein space, and the integration by parts formula Eν f, g = Eν[f Lνg] [e.g., 5], we have

d KL(ηt||ν)

, Pν,k Lν ηt

L2(π) = Z ηt

ν LνPν,k Lν ηt

ν dν(x) = ηt

By [43, Theorem 2] and (41) - (43) in the proof of Proposition 2, we have that KL(ηt||ν) = KL(µt||π) and ηt

ν 2 L2(ν) = [ 2ϕ] 1 µt

π 2 L2(π), which implies that

d KL(µt||π)

L2(π) = X 2(µt||π) KL(µt||π), (85)

where in the final line we have once again used the fact that KL( ||π) X 2( ||π). The conclusion now follows via Grönwall s inequality.

To obtain an implementable algorithm, observe that the vector field in the continuity equation can be rewritten as [20]

ν ( ) = Z 1k Lν( , y)ηt

ν (y)dν(y) = Z 1k Lν( , y)dη(y). (86)

Next, using a forward Euler discretisation in time, we arrive at

ηt+1 = id γ Pν,k Lν ηt

# ηt, µt+1 = ( ϕ)#ηt+1. (87)

Finally, we approximate the expectations in (87) with samples. In particular, after initialising a collection of particles xi 0 i.i.d. µ0 for i [N], we set yi 0 = ϕ(xi 0) for i [N], and then update

yi t+1 = yi t γ 1

j=1 yik Lν(yi t, yj t ), xi t+1 = ϕ (yi t+1). (88)

This algorithm, which we refer to as the mirrored LAWGD (MLAWGD) algorithm, is summarised in Alg. 4.

Algorithm 4 Mirrored LAWGD

input: target density π, kernel k Lν, mirror function ϕ, particles (xi 0)N i=1 µ, step size γ. initialise: for i [N], yi 0 = ϕ(xi 0). for t = 0, . . . , T 1 do

For i [N], yi t+1 = yi t γ 1

N PN j=1 yik Lν(yi t, yj t ). For i [N], xi t+1 = ϕ (yi t+1). return (xi T )N i=1.

B.4 Mirrored Kernel Stein Discrepancy Descent

Finally, we consider the MWGF of the kernel Stein discrepancy (KSD) [21, 38, 62]. In so doing, we will obtain a mirrored version of the kernel Stein discrepancy descent (KSDD) algorithm in [51].

B.4.1 MWGF of the Kernel Stein Discrepancy

We begin by defining F(η) = 1

2KSD2 ϕ(η|ν), where KSDϕ(η|ν) is the mirrored KSD, which we define as

KSDϕ(η|ν) =

Rd kν,ϕ(y, y )dη(y)dη(y ), (89)

where kν,η is the mirrored Stein kernel, defined in terms of the score sν = log ν of the mirrored target ν = ( ϕ)#π, and the positive semi-definite kernel kϕ as

kν,ϕ(y, y ) = y log ν(y) kϕ(y, y ) y log ν(y ) + y log ν(y) y kϕ(y, y )

+ yk ϕ (y, y ) log ν(y ) + ykϕ(y, ), y kϕ( , y ) Hd kϕ, (90)

and where, as before, kϕ is the kernel defined according to kϕ(y, y ) = k( ϕ (y), ϕ (y )), for some base kernel k : X X R. The MKSD is nothing more than the KSD between η = ( ϕ)#µ and ν = ( ϕ)#π, with respect to the kernel kϕ.

In this case, one can show that W2F(η) = Ey η[ 2kν,ϕ(y, )] [51, Proposition 2]. We can thus define the MKSD gradient flow according to

t (Ey ηt[ 2kν,ϕ(y, )]ηt) = 0, µt = ( ϕ )#ηt. (91)

Similar to the previous sections, we would like to study the properties of this scheme in continuous time.

Proposition 15. The dissipation of 1

2KSD2 ϕ(ηt|ν) along the MWGF in (91) is given by

2KSD2 ϕ(ηt|ν) dt = Ey ηt h ||Ey ηt [ y kν,ϕ(y, y )]||2i (92)

Proof. The result follows straightforwardly using the chain rule. In particular, we have

2KSD2 ϕ(ηt|ν) dt = wt, W2 1 2KSD2(ηt|ν)

L2(ηt) (93)

= Ey ηt h ||Ey ηt [ y kν,ϕ(y, y )]||2i . (94)

Remark 10. Unlike elsewhere (i.e., Propositions 1 - 2, Proposition 8, and Propositions 11 - 12) this dissipation result is given in terms of the mirrored densities (ηt)t 0 and the mirrored target ν = ( ϕ)#π, rather than densities (µt)t 0 and the target π. The crucial difference here is that the objective functional 1

2KSD2(ηt|ν) is not equal to 1

2KSD2(µt|π), whereas previously it was true that, e.g., KL(ηt||ν) = KL(µt||π) or χ2(ηt||ν) = χ2(µt||π). Nonetheless, it is possible to give a dissipation result in terms of the target π.

Algorithm 5 Mirrored KSDD

input: target density π, kernel k, mirror function ϕ, particles (xi 0)N i=1 µ, step size γ. initialise: for i [N], yi 0 = ϕ(xi 0). for t = 0, . . . , T 1 do

For i [N], yi t+1 = yi t γ 1

N2 PN j=1 yikν,ϕ(yj t , yi t). For i [N], xi t+1 = ϕ (yi t+1). return (xi T )N i=1.

We begin by rewriting the LHS of (92). Recall that the squared KSD between µ and π is identical to the Stein Fisher information between µ and π [e.g., 62, Theorem 3.6]. This also holds for η = ( ϕ)#µ and ν = ( ϕ)#π in the mirrored space, and thus we have

KSD2 ϕ(η|ν) = Sµ,kϕ log η

Hd kϕ = Sµ,k[ 2ϕ] 1 log µ

Hd k , (95)

where in the second equality we have used the result obtained in (57) - (60); see the proof of Proposition 8. Thus, in particular, F(η) is equal to the mirrored Stein Fisher information, up to a factor half.

We now turn our attention to the RHS of (92). We would like to simplify y kν,ϕ(y, y ), as defined in (90). Let y = ϕ(x) and y = ϕ(x ). Using the definition of kϕ, we have kϕ(y, y ) = k(x, x ). Using also the chain rule, we have

ykϕ(y, y ) = [ 2ϕ(x)] 1 xk(x, x ) , y kϕ(y, y ) = [ 2ϕ(x )] 1 x k(x, x ) (96)

Similarly, it holds that

ykϕ(y, ), y kϕ( , y ) Hd kϕ = 2ϕ(x) 1 xk(x, ), 2ϕ(x ) 1 x k( , x ) Hd k. (97)

Finally, due to (20), we have y log ν(y) = [ 2ϕ(x)] 1 x log π(x) x log det 2ϕ(x) . Putting everything together, we thus have that

kν,ϕ(y, y ) = [ 2ϕ(x)] 1[ x log π(x) x log det 2ϕ(x)] k(x, x )

[ 2ϕ(x)] 1[ x log π(x) x log det 2ϕ(x)]

+ [ 2ϕ(x)] 1[ x log π(x) x log det 2ϕ(x)] [ 2ϕ(x)] 1 xk(x, x )

+ [ 2ϕ(x )] 1 x k(x, x ) [ 2ϕ(x)] 1[ x log π(x) x log det 2ϕ(x)]

+ [ 2ϕ(x)] 1 xk(x, ), [ 2ϕ(x )] 1 x k( , x ) := kπ,ϕ(x, x ). (98)

Using (95) and (98) we thus have the following dissipation result for the mirrored Stein Fisher information in terms of the target π:

Sµt,k[ 2ϕ] 1 log µt

Hd k = 2Ex µt

Ex µt h kπ,ϕ(x, x ) i 2 . (99)

B.4.2 Mirrored Kernel Stein Discrepancy Descent (MKSDD)

To obtain an implementable MKSDD algorithm, it is, of course, necessary to discretise in time and in space. In this case, following [51], we will first discretise in space, replacing the measure η by a discrete approximation ˆη = 1

N PN j=1 δyj. This yields, writing ωN = {y1, . . . , yn},

F ωN := F(ˆη) = 1 2N 2

i,j=1 kν,ϕ(yi, yj), yi F ωN := F(ˆη) = 1 N 2

j=1 2kν,ϕ(yj, yi).

(100) The mirrored KSDD algorithm thus takes the following form. Initialise a collection of particles xi 0 i.i.d. µ0 for i [N], and set yi 0 = ϕ(xi 0) for i [N]. Then, update

yi t+1 = yt γ 1

j=1 2kν,ϕ(yj t , yi t) , xi t+1 = ϕ (yi t+1). (101)

C Convergence of Mirrored Coin Wasserstein Gradient Descent

Proposition 16. Suppose that π is strongly log-concave and that || W2F(ηt)(yt)|| L for all t [T]. In addition, suppose that (ηt)t N0 and ν satisfy [81, Assumption B.1]. Then there exists a mirror map ϕ such that

X || ϕ(x)||

T ln 1 + 96K2T 2 || ϕ(x)||2 π(dx)

X || ϕ(x)||

T ln 1 + 96T 2 || ϕ(x)||2 µ0(dx) , (102)

where K > 0 is the constant defined in [81, Assumption B.1].

Proof. First note, using Jensen s inequality, and [43, Theorem 2], that

t=1 KL ( µt|| π) (103)

t=1 [KL(ηt||ν) KL(ν||ν)] (104)

t=1 F (ηt) F(ν), (105)

where in the final line we have defined F(η) := KL(η||π). Now, under the assumption that π is strongly log-concave, [43, Proof of Theorem 3] guarantees the existence of a mirror map ϕ : X R { } such that the mirrored target ν = ( ϕ)#π is also strongly log-concave. It follows, in particular, that F(η) := KL(η||π) is geodesically convex [e.g., 2, Chapter 9]. Thus, arguing as in the proof of [81, Proposition 3.3], with w0 = 1, it follows that

t=1 F (ηt) F(ν) L

T ln 1 + 96K2T 2 ||y||2 ν(dy)

T ln 1 + 96T 2 ||y||2 η0(dy) . (106)

Finally, once more using the definition of the mirror map and in particular the fact that ν = ( ϕ)#π and η0 = ( ϕ)#η0, we can rewrite the RHS in (106) as the RHS in (102). This completes the proof.

D Other Mirrored Coin Sampling Algorithms

In this section, we provide some other mirrored coin sampling algorithms besides Coin MSVGD. Recall, from Sec. 4.2, the general form of the mirrored coin sampling algorithm. Let x0 µ0 P2(X), and y0 = ϕ(x0) Rd. Then, for t N, update

yt = y0 + Pt 1 s=1 cηs(ys)

s=1 cηs(ys), ys y0

, xt = ϕ (xt), (107)

where cηt(yt) = W2F(ηt)(yt), or some variant thereof. As noted in Sec. 4.2, for different choices of F and (cηt)t N0, one obtains learning-rate free analogues of different mirrored Par VI algorithms.

In Sec. 4.2, we provided the Coin MSVGD algorithm (Alg. 2), which corresponds to F = KL( |ν) and (cηt)t N0 = ( Pηt,kϕ log( ηt

ν ))t N0. This can be viewed as the coin sampling analogue of MSVGD (Sec. 3.2.1 and App. B.2). We now provide two more mirrored coin sampling algorithms, which correspond to the coin sampling analogues of MLAWGD (App. B.3) and MKSDD (App. B.4).

D.1 Coin MLAWGD

Algorithm 6 Coin MLAWGD

input: target density π, kernel k Lν, mirror map ϕ, initial particles (xi 0)N i=1 µ0. initialise: for i [N], yi 0 = ϕ(xi 0). for t = 1, . . . , T do

yi t = yi 0

Pt 1 s=1 1 N PN j=1 yik Lν(yi s, yj s)

j=1 yik Lν(yi s, yj s), yi s yi 0

xi t = ϕ (yi t).

where k Lν is the mirrored LAWGD kernel defined in (81). return (xi T )N i=1.

D.2 Coin MKSDD

Algorithm 7 Coin MKSDD

input: target density π, kernel k, mirror map ϕ, initial particles (xi 0)N i=1 µ0. initialise: for i [N], yi 0 = ϕ(xi 0). for t = 1, . . . , T do

yi t = yi 0

Pt 1 s=1 1 N 2 PN j=1 yikν,ϕ(yj s, yi s)

j=1 yikν,ϕ(yj s, yi s), yi s yi 0

xi t = ϕ (yi t).

where kν,ϕ is the mirrored Stein kernel defined in (90). return (xi T )N i=1.

E Adaptive Coin MSVGD

In Sec. 4.1 - 4.2, we described a betting strategy which is valid under the assumption that the sequence of outcomes (ct)t N := (cηt(yt))t N are bounded by one, in the sense that ||cηt(yi t)|| 1 for all t [T] and for all i [N] [see also 81, Sec. 3.4]. In practice, this may not be the case. If, instead, ||cηt(yi t)|| L for some constant L, then one can simply replace cηt(yt) by its normalised version, namely ˆcηt(yt) = cηt(yt)/L in Alg. 2.

If, on the other hand, the constant L is unknown in advance, then one can use a coordinate-wise empirical estimate Li t,j which is adaptively updated as the algorithm runs, following [72] and later [81, App. D].3 We provide full details of this version of Coin MSVGD in Alg. 8. One can also obtain an analogous version of Coin MIED; we omit the full details here in the interest of brevity.

Finally, when we use a coin sampling algorithm to perform inference for a (fairness) Bayesian neural network (see Sec. 6.3), we alter the denominator of the betting fraction in Alg. 8 such that it is at least 100Li t,j. Thus, the update equation in Alg. 8 (or, analogously, the update equation in the adaptive version of Coin MIED) now reads as

yi t,j = yi 0,j +

Pt 1 s=1 ci s,j max(Gi t,j + Li t,j, 100Li t,j)(1 + Ri t,j Li t,j ). (108)

This modification, recommended in [72, Sec. 6], has the effect of restricting the value of the particles in the first few iterations. We should emphasise that these adaptive algorithms, which in practice we use in all experiments, are still free of any hyperparameters which need to be tuned by hand.

3Here, i [N] indexes the particle, and j [d] indexes the dimension (see Alg. 8).

Algorithm 8 Adaptive Coin MSVGD

input: target density π, kernel k, mirror map ϕ, initial particles (xi 0)N i=1 µ0. initialise: for i [N], yi 0 = ϕ(xi 0); for i [N] and j [d], Li 0,j = 0, Gi 0,j = 0, Ri 0,j = 0. for t = 1 to T do

ci t 1 = Pˆηt,kϕ log ˆηt 1

(yi t 1) , ˆηt 1 = 1

k=1 δyk t 1 (compute SVGD gradient)

For i [N], j [d],

Li t,j = max(Li t 1,j, |ci t 1,j|) (update max. observed scale)

Gi t,j = Gi t 1,j + |ci t 1,j| (update sum of abs. value of gradients)

Ri t,j = max(Ri t 1,j + ci t 1,j, yi t 1,j yi 0,j , 0) (update total reward)

yi t,j = yi 0,j +

Pt 1 s=1 ci s,j Gi t,j + Li t,j (1 + Ri t,j Li t,j ). (update the particles)

For i [N], xi t = ϕ (yi t). output: (xi T )N i=1.

Computational Cost. It is worth noting that, in terms of computational and memory cost, the adaptive Coin MSVGD algorithm is no worse than MSVGD when the latter is paired, as is common, with a method such as Adagrad [31], RMSProp [92], or Adam [48]. The computational cost per iteration is O(N 2), due to the kernelised gradient. Meanwhile, in terms of memory requirements, it is necessary to keep track of the maximum observed absolute value of the gradients (component-wise), the sum of the absolute value of the gradients (component-wise), and the total reward (component-wise). This is similar to, e.g., Adam, which keeps track of exponential moving averages of the gradient and the (component-wise) squared gradient [48].

F Additional Experimental Details

In this section, we provide additional details relevant to the experiments carried out in Sec. 6. We implement all methods using Python 3, Py Torch, and Tensor Flow. We use the existing implementations of MSVGD and SVMD [82] and MIED [60] to run these methods. We perform all experiments using a Mac Book Pro 16" (2021) laptop with Apple M1 Pro chip and 16GB of RAM.

For the experiments in Sec. 6.1 and 6.2, we use the inverse multiquadric (IMQ) kernel k(x, x ) = (1 + {||x x ||2 2/h2) 1

2 due to its favourable convergence control properties (e.g., [38]). Unless otherwise stated, the bandwidth h is determined using the median heuristic [36, 63]. For the experiment in Sec. 6.3, following [60], we use the s-Riesz family of mollififiers {ϕs ϵ} with s = d + 10 4 and ϵ = 10 8.

F.1 Simplex Targets

For both experiments in Sec. 6.1, we initialise all methods with i.i.d samples from a Dirichlet(5) distribution and use the entropic mirror map

k=1 xk log xk + (1

k=1 xk) log(1

k=1 xk). (109)

We run all methods for T = 500 iterations using N = 50 particles. In the interest of a fair comparison, we run all of the learning-rate dependent methods (MSVGD, SVMD, projected SVGD) using RMSProp [92], an extension of the Adagrad algorithm [31], to adapt the learning rates.

Sparse Dirichlet Posterior. We first consider the sparse Dirichlet posterior in [73], which is given by

k=1 xnk+αk 1 k , x int( ), (110)

for parameters α1, . . . , αd+1 > 0 and observations n1, . . . , nd+1, where nk is the number of observations of category k. We consider the sparse regime, in which most of the nk are zero and the remaining nk are large, e.g., of order O(d). In particular, as in [82], we set αk = 0.1 for k {1, . . . , 20} and use count data n1 = 90, n2 = n3 = 5, and nj = 0 for j > 3. This target is used to replicate the multimodal sparse conditionals that arise in Latent Dirichlet Allocation (LDA) [9]. While this distribution is not log-concave, the dual-target is strictly log-concave under the entropic mirror map.

Quadratic Target. We next consider the quadratic simplex target introduced in [1]. In this case, the target takes the form

2σ2 x Ax , (111)

where A Rd d is a symmetric positive semidefinite matrix with all entries bounded in magnitude by 1. Following [82], we set σ = 0.01 and generate A by normalising products of random matrices with i.i.d. elements drawn from Unif[ 1, 1].

F.2 Confidence Intervals for Post Selection Inference

Given data (X, y) Rn p Rn, the randomised Lasso is defined as the solution of [89, Sec. 3.1]

ˆβ = arg min β Rp

2||y Xβ||2 2 + λ||β||1 ω β + ε

where λ is the ℓ1 penalty parameter, ε R+ is a ridge term, and ω Rp is an auxiliary random vector drawn from some distribution G with density g. We choose G to be a zero-mean independent Gaussian distribution. The scale of this distribution, and the parameter ε, are both determined automatically by the randomized Lasso function in the selective Inference package. The distribution of interest has density proportional to (e.g., [90, Sec. 4.2])

ˆg(ˆβE, ˆz E) g ε(ˆβE, 0) X (y XE ˆβE) + λ(ˆz E, ˆz E) , (113)

with support D = {(ˆβE, ˆz E) : sign(ˆβE) = ˆz E , ||ˆz E|| 1}. In all experiments, following [89], we integrate out ˆz E analytically and reparameterise ˆβE as ˆz E |ˆβE|. This yields a log-concave density for |ˆβE|, with domain given by the non-negative orthant in Rq. We then use the mirror map

k=1 (xk log xk xk) . (114)

Synthetic Example. We consider the simulation in [79, Sec. 5.3], with n = 100 and p = 40. The design matrix X is generated from an equi-correlated model. In particular, for i = 1, . . . , n, the rows xi i.i.d. N(0, Σ), where Σij = δi,j + ρ(1 δi,j), with ρ = 0.3. We also normalise the columns of X to have norm 1. Meanwhile, the response is distributed as y N(0, In), independent of X. That is the true parameter β = 0. We set ε = ˆσ2 n, where ˆσ denotes the empirical standard deviation of the response. We set λ using the default value returned by theoretical.lambda in the selective Inference package, multiplied by a factor of 0.7n. This adjustment reduces the ℓ1 regularisation, ensuring that a reasonable subset of the p = 40 features are selected.

We run each algorithm for T = 1000 iterations. For Coin MSVGD, MSVGD, and SVMD, we use N = 50 particles, and generate Ntotal samples by aggregating the particles from Ntotal/N independent runs. For the norejection method, we generate Ntotal samples by running the algorithm for Ntotal iterations after burn-in. Finally, following [82], in Fig. 3 the error bars for the actual coverage levels correspond to 95% Wilson intervals [101].

HIV-1 Drug Resistance. We use the vitro measurement of log-fold change under 3TC as the response and include mutations that appear more than 10 times in the dataset as predictors. The resulting dataset consists of n = 633 cases and p = 91 predictors. In the selection stage, we apply the randomised Lasso, setting the regularisation parameter to the theoretical value in selective Inference, multiplied by a factor of n. We implement SVMD and MSVGD using RMSProp [92], with an initial learning rate of γ = 0.01 following [82]. We implement each method for T = 2000 iterations, using N = 50 particles.

F.3 Fairness Bayesian Neural Network

We use a train-test split of 80% / 20%, as in [60, 65]. We implement the methods in [60, 65] using their source code, using the default hyperparameters. We run all algorithms for T = 2000 iterations, and using N = 50 particles. Similar to [60], we found that one of the algorithms (Control + SVGD) in [64] got stuck after initialisation, with accuracy close to 0.75 and did not improve during training. We thus omit the results for this method in Fig. 5.

G Additional Numerical Results

G.1 Simplex Targets

In Fig. 6 and Fig. 7, we provide additional results for the sparse Dirichlet posterior [73] and the quadratic simplex target [1] considered in Sec. 6.1. In particular, in Fig. 6a and Fig. 7a, we plot the energy distance to a set of ground truth samples either obtained i.i.d. (sparse Dirichlet target) or using NUTS [42] (quadratic target) versus the number of iterations for Coin MSVGD, MSVGD, SVMD, and projected versions of Coin SVGD and SVGD. Meanwhile, in Fig. 6b and Fig. 7b, we compare the performance of Coin MSVGD with the performance of MSVGD, for multiple choices of the learning rate.

0 100 200 300 400 500 Iterations

Energy Distance

Coin MSVGD Projected Coin SVGD MSVGD SVMD Projected SVGD

(a) All methods.

0 100 200 300 400 500 Iterations

1.75 Coin MSVGD MSVGD (LR=5e-01) MSVGD (LR=1e-01)

MSVGD (LR=1e-02) MSVGD (LR=1e-03) MSVGD (LR=1e-04)

(b) Coin MSVGD vs MSVGD.

Figure 6: Additional results for the sparse dirichlet posterior in [73]. (a) Energy distance vs. iterations for Coin MSVGD, MSVGD, SVMD, projected Coin SVGD, and projected SVGD; and (b) energy distance vs iterations for Coin MSVGD and MSVGD, using five different learning rates.

0 100 200 300 400 500 Iterations

Energy Distance

Coin MSVGD Projected Coin SVGD MSVGD SVMD Projected SVGD

(a) All methods.

0 100 200 300 400 500 Iterations

Coin MSVGD MSVGD (LR=5e-01) MSVGD (LR=1e-01)

MSVGD (LR=1e-02) MSVGD (LR=1e-03) MSVGD (LR=1e-04)

(b) Coin MSVGD vs MSVGD.

Figure 7: Additional results for the quadratic simplex target in [1]. (a) Energy distance vs. iterations for Coin MSVGD, MSVGD, SVMD, projected Coin SVGD, and projected SVGD; and (b) energy distance vs iterations for Coin MSVGD and MSVGD, using five different learning rates. As observed in Sec. 6.1, for the sparse Dirichlet posterior, Coin MSVGD converges more rapidly than MSVGD and SVMD, even for well-chosen values of the learning rate (Fig. 6). Meanwhile, for the quadratic target, Coin MSVGD generally converges more rapidly than MSVGD but not as fast as SVMD [82], which leverages the log-concavity of the target in the primal space (Fig. 7). In both cases, for poorly chosen values of the learning rate, MSVGD (e.g., γ = 5 10 1 in Fig. 6b and γ = 1 10 4, γ = 1 10 3, or γ = 5 10 1 in Fig. 7b) entirely fails to converge to the true target.

Results for Different Numbers of Particles. In Fig. 8 and Fig. 9, we provide additional results for the two simplex targets, now varying the number of particles N. The results are similar to those reported in Sec. 6.1. In particular, Coin MSVGD has comparable performance to MSVGD and SVMD with optimal learning rates, but significantly outperforms both algorithms for sub-optimal learning rates. In addition, as expected, the posterior approximation quality of Coin MSVGD, as measured by the energy distance, improves as a function of the number of particles N (Fig. 8d and Fig. 9d).

10 5 10 4 10 3 10 2 10 1 100

Learning Rate

Energy Distance

Coin MSVGD Proj. Coin SVGD MSVGD SVMD Proj. SVGD

(a) N = 10.

10 5 10 4 10 3 10 2 10 1 100

Learning Rate

Energy Distance

Coin MSVGD Proj. Coin SVGD MSVGD SVMD Proj. SVGD

(b) N = 20.

10 5 10 4 10 3 10 2 10 1 100

Learning Rate

Energy Distance

Coin MSVGD Proj. Coin SVGD MSVGD SVMD Proj. SVGD

(c) N = 100.

50 100 150 200 250 N

Energy Distance

Coin MSVGD MSVGD (LR=1e-01)

MSVGD (LR=5e-02) MSVGD (LR=1e-02)

(d) N {10, . . . , 250}.

Figure 8: Additional results for the sparse Dirichlet posterior in [73]. (a) - (c) Energy distance vs learning rate, for several values of N, for Coin MSVGD, MSVGD, SVMD, projected Coin SVGD, and projected SVGD. (d) Energy distance vs N, for several different values of the learning rate, for Coin MSVGD and MSVGD. We run each algorithm for T = 250 iterations.

10 5 10 4 10 3 10 2 10 1 100

Learning Rate

Energy Distance

Coin MSVGD Proj. Coin SVGD MSVGD SVMD Proj. SVGD

(a) N = 10.

10 5 10 4 10 3 10 2 10 1 100

Learning Rate

Energy Distance

Coin MSVGD Proj. Coin SVGD MSVGD SVMD Proj. SVGD

(b) N = 20.

10 5 10 4 10 3 10 2 10 1 100

Learning Rate

Energy Distance

Coin MSVGD Proj. Coin SVGD MSVGD SVMD Proj. SVGD

(c) N = 100.

50 100 150 200 250 N

Energy Distance

Coin MSVGD MSVGD (LR=5e-01) MSVGD (LR=1e-01)

MSVGD (LR=5e-02) MSVGD (LR=2e-02) MSVGD (LR=1e-02)

(d) N {10, . . . , 250}.

Figure 9: Additional results for the quadratic target in [1]. (a) - (c) Energy distance vs learning rate, for several values of N, for Coin MSVGD, MSVGD, SVMD, projected Coin SVGD, and projected SVGD. (d) Energy distance vs N, for several different values of the learning rate, for Coin MSVGD and MSVGD. We run each algorithm for T = 500 iterations.

G.2 Sampling from a Uniform Distribution

In this section, we consider the task of sampling from the uniform distribution on the unit square, namely Unif[ 1, 1]2. For MSVGD and Coin MSVGD, we once again use the entropic mirror map. We also now use the RBF kernel, with a fixed bandwidth of h = 0.01.4 For MIED and Coin MIED, following [60], we reparameterise the particles using tanh to eliminate the constraint. We show results for several choices of mollifiers, in each casing choosing the smallest ϵ which guarantees the optimisation remains stable. We initialise all methods with particles drawn i.i.d. from U[ 0.5, 0.5]2. For both methods, in the interest of a fair comparison, we adapt the learning rates using RMSProp [92]. We use N = 100 particles and run each experiment for T = 250 iterations. We then assess the quality of the posterior approximations by computing the energy distance to a set of 1000 samples drawn i.i.d. from the true target distribution.

The results for this experiment are shown in Fig. 10. Similar to our previous synthetic examples, Coin MSVGD is competitive with the optimal performance of MSVGD. Meanwhile, for sub-optimal choices of the learning rate, the coin sampling algorithm is clearly preferable. The same is also true for Coin MIED which, depending on the choice of mollifier, obtains comparable or superior performance to MIED with the optimal choice of learning rate.

10 5 10 3 10 1

Learning Rate

Energy Distance

Coin MSVGD MSVGD

10 5 10 3 10 1

Learning Rate

Energy Distance

MIED (Riesz) Coin MIED (Riesz)

(b) MIED (Riesz)

10 5 10 3 10 1

Learning Rate

Energy Distance

MIED (Gaussian) Coin MIED (Gaussian)

(c) MIED (Gaussian)

10 5 10 3 10 1

Learning Rate

Energy Distance

MIED (Laplace) Coin MIED (Laplace)

(d) MIED (Laplace)

0 50 100 150 200 250 Iterations

Energy Distance

Coin MSVGD Coin MIED (Riesz) Coin MIED (Laplace) Coin MIED (Gaussian)

MSVGD MIED (Riesz) MIED (Laplace) MIED (Gaussian)

(e) Comparison of Methods.

Figure 10: Results for the uniform target: Unif[ 1, 1]2. Energy distance vs learning rate after 250 iterations for (a) Coin MSVGD and MSVGD; and (b) - (d) Coin MIED and MIED, for three different choices of mollifier.

G.3 Post-Selection Inference

In this section, we provide additional results for the post-selection inference examples considered in Sec. 6.2. In Fig. 11, we once again plot the coverage of the CIs obtained using Coin MSVGD, MSVGD, SVMD, and the norejection algorithm in selective Inference [91], as we vary either the nominal coverage or the total number of samples. Now, however, we consider two sub-optimal choices of the learning rate for MSVGD and SVMD. In particular, we now use an initial learning rate of γ = 0.001 (Fig. 11a - 11b) or γ = 0.1 (Fig. 11c - 11d) for RMSProp [92], rather than γ = 0.01 as in Fig. 3 (see Sec. 6.2).

4In this example, the use of an adaptive kernel (e.g., with the bandwidth chosen according to the median rule), can cause the samples from MSVGD to collapse [60].

In this case, we see that the performance of both MSVGD and SVMD deteriorates. In particular, they both now obtain significantly lower coverage than Coin MSVGD. As noted in the main text, this increased coverage is of particular importance for smaller sample sizes (see Fig. 11b and Fig. 11c), and for greater nominal coverage levels (see Fig. 11a and 11c), where the standard CIs generally undercover.

0.80 0.85 0.90 0.95 1.00 Nominal coverage

Actual coverage

Coin MSVGD MSVGD SVMD Standard

(a) Actual vs Nominal Coverage (γ = 0.001).

1000 1500 2000 2500 3000 Number of Samples

Actual Coverage

Coin MSVGD MSVGD SVMD Standard

(b) Coverage vs No. of Samples (γ = 0.001).

0.80 0.85 0.90 0.95 1.00 Nominal coverage

Actual coverage

Coin MSVGD MSVGD SVMD Standard

(c) Actual vs Nominal Coverage (γ = 0.1).

1000 1500 2000 2500 3000 Number of Samples

Actual Coverage

Coin MSVGD MSVGD SVMD Standard

(d) Coverage vs No. of Samples (γ = 0.1).

Figure 11: Additional coverage results for the post-selection inference. Coverage of post-selection confidence intervals for Coin MSVGD, MSVGD, and SVMD, using sub-optimal learning rates.

Finally, in Fig. 12, we perform a similar experiment for the real-data example involving the HIV-1 drug resistance dataset. In particular, we compute 90% CIs for the subset of mutations selected by the randomised Lasso, using 5000 samples and the same five methods as before: unadjusted (the unadjusted CIs), standard (the method in selective Inference), MSVGD, SVMD, and Coin MSVGD. Now, however, we use sub-optimal choices of the initial learning rate in RMSProp, namely, γ = 0.001 (Fig. 12a) and γ = 1.0 (Fig. 12b), compared to γ = 0.01 in the main paper.

Similar to the synthetic example, the impact of using a sub-optimal choice of learning rate is clear. In particular, CIs obtained using MSVGD and SVMD now differ substantially from those obtained by Coin MSVGD and the standard approach. This means, in particular, that several mutations are now incorrectly flagged as significant by MSVGD and SVMD (e.g., mutation P41L, P62V, P70R, or P215Y in Fig. 12b), when in reality there is insufficient evidence after conditioning on the selection event.

2.4 Coin MSVGD MSVGD SVMD Unadjusted Standard

(a) γ = 0.001.

2.4 Coin MSVGD MSVGD SVMD Unadjusted Standard

(b) γ = 1.0.

Figure 12: Additional real data results for post-selection inference. Unadjusted and post-selection confidence intervals for the mutations selected by the randomised Lasso as candidates for HIV-1 drug resistance, for two sub-optimal choices of the learning rate γ.