# sampling_with_mirrored_stein_operators__5bee7090.pdf Published as a conference paper at ICLR 2022 SAMPLING WITH MIRRORED STEIN OPERATORS Jiaxin Shi1 Chang Liu2 Lester Mackey1 1 Microsoft Research New England 2 Microsoft Research Asia {jiaxinshi,chang.liu,lmackey}@microsoft.com We introduce a new family of particle evolution samplers suitable for constrained domains and non-Euclidean geometries. Stein Variational Mirror Descent and Mirrored Stein Variational Gradient Descent minimize the Kullback-Leibler (KL) divergence to constrained target distributions by evolving particles in a dual space defined by a mirror map. Stein Variational Natural Gradient exploits non-Euclidean geometry to more efficiently minimize the KL divergence to unconstrained targets. We derive these samplers from a new class of mirrored Stein operators and adaptive kernels developed in this work. We demonstrate that these new samplers yield accurate approximations to distributions on the simplex, deliver valid confidence intervals in post-selection inference, and converge more rapidly than prior methods in large-scale unconstrained posterior inference. Finally, we establish the convergence of our new procedures under verifiable conditions on the target distribution. 1 INTRODUCTION Accurately approximating an unnormalized distribution with a discrete sample is a fundamental challenge in machine learning, probabilistic inference, and Bayesian inference. Particle evolution methods like Stein variational gradient descent (SVGD, Liu & Wang, 2016) tackle this challenge by applying deterministic updates to particles using operators based on Stein s method (Stein, 1972; Gorham & Mackey, 2015; Oates et al., 2017; Liu et al., 2016; Chwialkowski et al., 2016; Gorham & Mackey, 2017) and reproducing kernels (Berlinet & Thomas-Agnan, 2011) to sequentially minimize Kullback-Leibler (KL) divergence. SVGD has found great success in approximating unconstrained distributions for probabilistic learning (Feng et al., 2017; Haarnoja et al., 2017; Kim et al., 2018) but breaks down for constrained targets, like distributions on the simplex (Patterson & Teh, 2013) or the targets of post-selection inference (Taylor & Tibshirani, 2015; Lee et al., 2016; Tian et al., 2016), and fails to exploit informative non-Euclidean geometry (Amari, 1998). In this work, we derive a family of particle evolution samplers suitable for target distributions with constrained domains and non-Euclidean geometries. Our development draws inspiration from mirror descent (MD) (Nemirovskij & Yudin, 1983), a first-order optimization method that generalizes gradient descent with non-Euclidean geometry. To sample from a distribution with constrained support, our method first maps particles to a dual space. There, we update particle locations using a new class of mirrored Stein operators and adaptive reproducing kernels introduced in this work. Finally, the dual particles are mapped back to sample points in the original space, ensuring that all constraints are satisfied. We illustrate this procedure in Fig. 1. In Sec. 3, we develop two algorithms Mirrored SVGD (MSVGD) and Stein Variational Mirror Descent (SVMD) with different updates in the dual space; when only a single particle is used, MSVGD reduces to gradient ascent on the log dual space density, and SVMD reduces to mirror ascent on the log target density. In addition, by exploiting the connection between MD and natural gradient descent (Amari, 1998; Raskutti & Mukherjee, 2015), we develop a third algorithm Stein Variational Natural Gradient (SVNG) that extends SVMD to unconstrained targets with non-Euclidean geometry. In Sec. 5, we demonstrate the advantages of our algorithms on benchmark simplex-constrained problems from the literature, constrained sampling problems in post-selection inference (Taylor & Tibshirani, 2015; Lee et al., 2016; Tian et al., 2016), and unconstrained large-scale posterior inference with the Fisher information metric. Finally, we analyze the convergence of our mirrored algorithms in Sec. 6 and discuss our results in Sec. 7. Published as a conference paper at ICLR 2022 ϵt Eθ qt[Mp,ψK(θt, θ)] Figure 1: Updating particle approximations in constrained domains Θ. Standard updates like SVGD (dashed arrow) can push particles outside of the support. Our mirrored Stein updates in Alg. 1 (solid arrows) preserve the support by updating particles in a dual space and mapping back to Θ. Related work Our mirrored Stein operators (6) are instances of diffusion Stein operators in the sense of Gorham & Mackey (2017), but their specific properties have not been studied, nor have they been used to develop sampling algorithms. There is now a large body of work on transferring algorithmic ideas from optimization to MCMC (e.g., Welling & Teh, 2011; Simsekli et al., 2016; Dalalyan, 2017; Durmus et al., 2018; Ma et al., 2019) and SVGD-like sampling methods (e.g., Liu et al., 2019a;b; Zhu et al., 2020; Zhang et al., 2020a). The closest to our work in this space is the recent marriage of mirror descent and MCMC. For example, Hsieh et al. (2018) propose to run Langevin Monte Carlo (LMC, an Euler discretization of the Langevin diffusion) in a mirror space. Zhang et al. (2020b) analyze the convergence properties of the mirror-Langevin diffusion, Chewi et al. (2020) demonstrate its advantages over the Langevin diffusion when using a Newton-type metric, and Ahn & Chewi (2020) study its discretization for MCMC sampling in constrained domains. Relatedly, Patterson & Teh (2013) proposed stochastic Riemannian LMC for sampling on the simplex. Several modifications of SVGD have been proposed to incorporate geometric information. Riemannian SVGD (RSVGD, Liu & Zhu, 2018) generalizes SVGD to Riemannian manifolds, but, even with the same metric tensor, their updates are more complex than ours: notably they require higher-order kernel derivatives, do not operate in a mirror space, and do not reduce to natural gradient descent when a single particle is used. They also reportedly do not perform well when with scalable stochastic estimates of log p. Stein Variational Newton (SVN, Detommaso et al., 2018; Chen et al., 2019) introduces second-order information into SVGD. Their algorithm requires an often expensive Hessian computation and need not lead to descent directions, so inexact approximations are employed in practice. Our SVNG can be seen as an instance of matrix SVGD (Mat SVGD, Wang et al., 2019) with an adaptive time-dependent kernel discussed in Sec. 4.4, a choice that is not explored in Wang et al. (2019) and which recovers natural gradient descent when n = 1 unlike the heuristic kernel constructions of Wang et al. (2019). None of the aforementioned works provide convergence guarantees, and neither SVN nor matrix SVGD deals with constrained domains. 2 BACKGROUND: MIRROR DESCENT AND NON-EUCLIDEAN GEOMETRY Standard gradient descent can be viewed as optimizing a local quadratic approximation to the target function f: θt+1 = argminθ Θ f(θt) θ + 1 2ϵt θ θt 2 2. When Θ Rd is constrained, it can be advantageous to replace 2 with a function Ψ that reflects the geometry of a problem (Nemirovskij & Yudin, 1983; Beck & Teboulle, 2003): θt+1 = argminθ Θ f(θt) θ + 1 ϵt Ψ(θ, θt). (1) We consider the mirror descent algorithm (Nemirovskij & Yudin, 1983; Beck & Teboulle, 2003) which chooses Ψ to be the Bregman divergence induced by a strongly convex, essentially smooth1 function ψ : Θ R { }: Ψ(θ, θ ) = ψ(θ) ψ(θ ) ψ(θ ) (θ θ ). When Θ is a (d + 1)- simplex {θ : Pd i=1 θi 1 and θi 0 for i [d]}, a common choice of ψ is the negative entropy ψ(θ) = Pd+1 i=1 θi log θi, for θd+1 1 Pd i=1 θd. The solution of (1) is given by θt+1 = ψ ( ψ(θt) ϵt f(θt)), (2) where ψ (η) supθ Θ η θ ψ(θ) is the convex conjugate of ψ and ψ is a bijection from Θ to dom(ψ ) with inverse map ( ψ) 1 = ψ . We can view the update in (2) as first mapping θt to ηt by ψ, applying the update ηt+1 = ηt ϵt f(θt), and mapping back through θt+1 = ψ (ηt+1). 1ψ is continuously differentiable on the interior of Θ with ψ(θt) whenever θt θ Θ. Published as a conference paper at ICLR 2022 Mirror descent can also be viewed as a discretization of the continuous-time dynamics dηt = f(θt)dt, θt = ψ (ηt), which is equivalent to the Riemannian gradient flow (see App. A): dθt = 2ψ(θt) 1 f(θt)dt, or equivalently, dηt = 2ψ (ηt) 1 ηtf( ψ (ηt))dt, (3) where 2ψ(θ) and 2ψ (η) are Riemannian metric tensors. In information geometry, the discretization of (3) is known as natural gradient descent (Amari, 1998). There is considerable theoretical and practical evidence (Martens, 2014) showing that natural gradient works efficiently in learning. 3 STEIN S IDENTITY AND MIRRORED STEIN OPERATORS Stein s identity (Stein, 1972) is a tool for characterizing a target distribution P using a so-called Stein operator. We assume P has a differentiable density p with a closed convex support Θ Rd. A Stein operator Sp takes as input functions g from a Stein set G and outputs mean-zero functions under p: Eθ p[(Spg)(θ)] = 0, for all g G. (4) Gorham & Mackey (2015) proposed the Langevin Stein operator given by (Spg)(θ) = g(θ) log p(θ) + g(θ), (5) where g is a vector-valued function and g is its divergence. For an unconstrained domain with Ep[ log p(θ) 2] < , Stein s identity (4) holds for this operator whenever g C1 is bounded and Lipschitz by (Gorham et al., 2019, proof of Prop. 3). However, on constrained domains Θ, Stein s identity fails to hold for many reasonable inputs g if p is non-vanishing or explosive at the boundary. Motivated by this deficiency and by a desire to exploit non-Euclidean geometry, we propose an alternative mirrored Stein operator, (Mp,ψg)(θ) = g(θ) 2ψ(θ) 1 log p(θ) + ( 2ψ(θ) 1g(θ)), (6) where ψ is a strongly convex, essentially smooth function as in Sec. 2 with ( 2ψ) 1 differentiable and Lipschitz on Θ. We derive this operator from the (infinitesimal) generator of the mirror-Langevin diffusion (19) in App. C. The following result, proved in App. I.1, shows that Mp,ψ generates mean-zero functions under p whenever 2ψ 1 suitably cancels the growth of p at the boundary. Proposition 1. Suppose that 2ψ(θ) 1 log p(θ) and 2ψ(θ) 1 are p-integrable. If limr R Θr p(θ) 2ψ(θ) 1nr(θ) 2dθ = 0 for Θr {θ Θ : θ r} and nr(θ) the outward unit normal vector2 to Θr at θ, then Ep[(Mp,ψg)(θ)] = 0 if g C1 is bounded Lipschitz. Example 1 (Dirichlet p, Negative entropy ψ). When θ1:d+1 Dir(α) for α Rd+1 + , even setting g(θ) = 1 in (5) need not cause the identity to hold when some αj 1. However, when ψ(θ) = Pd+1 j=1 θj log θj, we show in App. B that the conditions of Prop. 1 are met for any α. Remarkably, the mirror-Langevin diffusion for our choice of ψ is the Wright-Fisher diffusion (Ethier, 1976) which Gan et al. (2017) recently used to bound distances to Dirichlet distributions. 4 SAMPLING WITH MIRRORED STEIN OPERATORS Liu & Wang (2016) pioneered the idea of using Stein operators to approximate a target distribution with particles. Their popular SVGD algorithm updates each particle in its approximation by applying the update rule θt+1 = θt + ϵtgt(θt) for a chosen mapping gt : Rd Rd. Specifically, SVGD chooses the mapping g t that leads to the largest decrease in KL divergence to p in the limit as ϵt 0. The following theorem summarizes their main findings. Theorem 2 (Liu & Wang, 2016, Thm. 3.1). Suppose (θt)t 0 satisfies dθt = gt(θt)dt for bounded Lipschitz gt C1 : Rd Rd and that θt has density qt with Eqt[ log qt 2] < . If KL(qt p) Eqt[log(qt/p)] exists then, for the Langevin Stein operator Sp (5), d dt KL(qt p) = Eθt qt[(Spgt)(θt)]. (7) 2For a closed convex set whose boundary Θ can be locally represented as F(θ1, . . . , θd) = 0, its unit normal vector is defined as n(θ) = F (θ1,...,θd) F (θ1,...,θd) 2 . Published as a conference paper at ICLR 2022 Algorithm 1 Mirrored Stein Variational Gradient Descent & Stein Variational Mirror Descent Input: density p on Θ, kernel k, mirror function ψ, particles (θi 0)n i=1 Θ, step sizes (ϵt)T t=1 Init: ηi 0 ψ(θi 0) for i [n] for t = 0 : T do if SVMD then Kt Kψ,t (13) else Kt k I (MSVGD) for i [n], ηi t+1 ηi t + ϵt 1 n Pn j=1 Mp,ψKt(θi t, θj t) (for Mp,ψKt( , θ) defined in Thm. 3) for i [n], θi t+1 ψ (ηi t+1) return {θi T +1}n i=1. To improve its current particle approximation, SVGD finds the choice of gt that most quickly decreases KL(qt p) at time t, i.e., it minimizes d dt KL(qt p) over a set of candidate directions gt. SVGD finds gt in a reproducing kernel Hilbert space (RKHS, Berlinet & Thomas-Agnan, 2011) norm ball BHd = {g : g Hd 1}, where Hd is the product RKHS containing vector-valued functions with each component in the RKHS H of k. Then the optimal g t BHd that minimizes (7) is g t g qt,k Eθt qt[k(θt, ) log p(θt) + θtk(θt, )] = Eθt qt[Sp Kk( , θt)], where we let Kk(θ, θ ) = k(θ, θ )I, and Sp Kk( , θ) denotes applying Sp to each row of Kk( , θ). SVGD has found great success in approximating unconstrained target distributions p but breaks down for constrained targets and fails to exploit non-Euclidean geometry. Our goal is to develop new particle evolution samplers suitable for constrained domains and non-Euclidean geometries. 4.1 MIRRORED DYNAMICS SVGD encounters two difficulties when faced with a constrained support. First, the SVGD updates can push the random variable θt outside of its support Θ, rendering all future updates undefined. Second, Stein s identity (4) often fails to hold for candidate directions in BHd (cf. Ex. 1). When this occurs, SVGD need not converge to p as p is not a stationary point of its dynamics (i.e., d dt KL(qt p)|qt=p = 0 when qt = p). Inspired by mirror descent (Nemirovskij & Yudin, 1983), we consider the following mirrored dynamics θt = ψ (ηt) for dηt = gt(θt)dt, or, equivalently, dθt = 2ψ(θt) 1gt(θt)dt, (8) where gt : Θ Rd now represents the update direction in η space. The inverse mirror map ψ automatically guarantees that θt belongs to the constrained domain Θ. Since ψ is strongly convex and 2ψ 1 is bounded Lipschitz, from Thm. 2 it follows for any bounded Lipschitz gt that d dt KL(qt p) = Eθt qt[(Mp,ψgt)(θt)], (9) where Mp,ψ is the mirrored Stein operator (6). In the following sections, we propose three new deterministic sampling algorithms by seeking the optimal direction gt that minimizes (9) over different function classes. Thm. 3 (proved in App. I.2) forms the basis of our analysis. Theorem 3 (Optimal mirror updates in RKHS). Suppose (θt)t 0 follows the mirrored dynamics (8). Let HK denote the RKHS of a matrix-valued kernel K : Θ Θ Sd d (Micchelli & Pontil, 2005). Then, the optimal direction of gt that minimizes (9) in the norm ball BHK {g : g HK 1} is g t g qt,K Eθt qt[Mp,ψK( , θt)], (10) where Mp,ψK( , θ) applies Mp,ψ (6) to each row of the matrix-valued function Kθ = K( , θ). 4.2 MIRRORED STEIN VARIATIONAL GRADIENT DESCENT Following the pattern of SVGD, one can choose the K of Thm. 3 to be Kk(θ, θ ) = k(θ, θ )I, where k is any scalar-valued kernel. In this case, the resulting update g qt,Kk( ) = Eθt qt[Mp,ψKk( , θt)] is equivalent to running SVGD in the dual η space before mapping back to Θ. Theorem 4 (Mirrored SVGD updates). In the setting of Thm. 3, let kψ(η, η ) = k( ψ (η), ψ (η )), p H(η) = p( ψ (η)) | det 2ψ (η)| denote the density of η = ψ(θ) when θ p, and qt,H denote the distribution of ηt under the mirrored dynamics (8). If Kk = k I, g qt,Kk(θ ) = Eηt qt,H[kψ(ηt, η ) log p H(ηt) + ηtkψ(ηt, η )] θ Θ, η = ψ(θ ). (11) Published as a conference paper at ICLR 2022 The proof is in App. I.3. By discretizing the dynamics dηt = g qt,Kk(θt)dt and initializing with any particle approximation q0 = 1 n Pn i=1 δθi 0, we obtain Mirrored SVGD (MSVGD), our first algorithm for sampling in constrained domains. The details are summarized in Alg. 1. When only a single particle is used (n = 1) and the differentiable input kernel satisfies k(θ, θ) = 0, the MSVGD update (11) reduces to gradient descent on log p H(η). Note however that the modes of the mirrored density p H(η) need not match those of the target density p(θ) (see the examples in App. E). Since we are primarily interested in the θ-space density, it is natural to ask whether there exists a mirrored dynamics that reduces to finding the mode of p(θ) in this limiting case. In the next section, we give an answer to this question by designing an adaptive reproducing kernel that yields a mirror descent-like update. 4.3 STEIN VARIATIONAL MIRROR DESCENT Our second sampling algorithm for constrained problems is called Stein Variational Mirror Descent (SVMD). We start by introducing a new matrix-valued kernel that incorporates the metric 2ψ and evolves with the distribution qt. Definition 1 (Kernels for SVMD). Given a continuous scalar-valued kernel k, consider the Mercer representation3 k(θ, θ ) = P i 1 λt,iut,i(θ)ut,i(θ ) w.r.t. qt, where ut,i is an eigenfunction satisfying Eθt qt[k(θ, θt)ut,i(θt)] = λt,iut,i(θ). (12) For k1/2 t (θ, θ ) P i 1 λ1/2 t,i ut,i(θ)ut,i(θ ), we define the adaptive SVMD kernel at time t, Kψ,t(θ, θ ) Eθt qt[k1/2 t (θ, θt) 2ψ(θt)k1/2 t (θt, θ )]. (13) By Thm. 3, the optimal update direction for the SVMD kernel ball is g qt,Kψ,t =Eqt[Mp,ψKψ,t( , θt)]. We obtain the SVMD algorithm (summarized in Alg. 1) by discretizing dηt = g qt,Kψ,t(θt)dt and initializing with q0 = 1 n Pn i=1 δθi 0. Because of the discrete representation of qt, Kψ,t takes the form Kψ,t(θ, θ ) = Pn i=1 Pn j=1 λ1/2 t,i λ1/2 t,j ut,i(θ)ut,j(θ )Γt,ij, n Pn ℓ=1 ut,i(θℓ t)ut,j(θℓ t) 2ψ(θℓ t). Here both λt,j and ut,j(θi t) can be computed by solving a matrix eigenvalue problem involving the particle set {θi t}n i=1: Btvt,j = nλt,jvt,j, where Bt = (k(θi t, θj t))n i,j=1 Rn n is the Gram matrix of pairwise kernel evaluations at particle locations, and the i-th element of vt,j is ut,j(θi t). To compute θKψ,t(θ, θ ), we differentiate both sides of (12) to find that ut,j(θ) = 1 λt,j Pn i=1 vt,j,i θk(θ, θi t). This technique was used in Shi et al. (2018) to estimate gradients of eigenfunctions w.r.t. a continuous q. Following their recommendations, we truncate the sum at the J-th largest eigenvalues according to a threshold (τ PJ j=1 λt,j/ Pn j=1 λt,j) to ensure numerical stability. Notably, SVMD differs from MSVGD only in its choice of kernel, but, whenever k(θ, θ) = 0, this change is sufficient to exactly recover mirror descent when n = 1. Proposition 5 (Single-particle SVMD is mirror descent). If n = 1, then one step of SVMD becomes ηt+1 = ηt + ϵt(k(θt, θt) log p(θt) + k(θt, θt)), θt+1 = ψ (ηt+1). 4.4 STEIN VARIATIONAL NATURAL GRADIENT The fact that SVMD recovers mirror descent as a special case is not only of relevance in constrained problems. We next exploit the connection between MD and natural gradient descent discussed in Sec. 2 to design a new sampler Stein Variational Natural Gradient (SVNG) that more efficiently approximates unconstrained targets. The idea is to replace the Hessian 2ψ( ) in the SVMD dynamics dθt = 2ψ(θt) 1g qt,Kψ,t(θt) with a general metric tensor G( ). The result is the Riemannian gradient flow dθt = G(θt) 1g qt,KG,t(θt)dt with KG,t(θ, θ ) Eθt qt[k1/2(θ, θt)G(θt)k1/2(θt, θ )]. (14) 3See App. F for background on Mercer representations in non-compact domains. Published as a conference paper at ICLR 2022 0 200 400 Number of particle updates, T Energy distance Sparse Dirichlet Projected SVGD SVMD MSVGD, k MSVGD, k2 0 200 400 Number of particle updates, T Projected SVGD SVMD MSVGD, k MSVGD, k2 Figure 2: Quality of 50-particle approximations to 20-dimensional distributions on the simplex after T particle updates. (Left) Sparse Dirichlet posterior of Patterson & Teh (2013). (Right) Quadratic simplex target of Ahn & Chewi (2020). Details of the target distributions are in App. G.1. Given any initial particle approximation q0 = 1 n Pn i=1 δθi 0, we discretize these dynamics to obtain the unconstrained SVNG sampler of Alg. 2 in the appendix. SVNG can be seen as an instance of Mat SVGD (Wang et al., 2019) with a new adaptive time-dependent kernel G 1(θ)KG,t(θ, θ )G 1(θ ). However, similar to Prop. 5 and unlike the heuristic kernels of Wang et al. (2019), SVNG reduces to natural gradient ascent for finding the mode of p(θ) when n = 1. SVNG is well-suited to Bayesian inference problems where the target is a posterior distribution p(θ) π(θ)π(y|θ). There, the metric tensor G(θ) can be set to the Fisher information matrix Eπ(y|θ)[ log π(y|θ) log π(y|θ) ] of the data likelihood π(y|θ). Ample precedent from natural gradient variational inference (Hoffman et al., 2013; Khan & Nielsen, 2018) and Riemannian MCMC (Patterson & Teh, 2013) suggests that encoding problem geometry in this manner often leads to more rapid convergence. 5 EXPERIMENTS We next conduct a series of simulated and real-data experiments to assess (1) distributional approximation on the simplex, (2) frequentist confidence interval construction for (constrained) post-selection inference, and (3) large-scale posterior inference with non-Euclidean geometry. To compare with standard SVGD on constrained domains and to prevent its particles from exiting the domain Θ, we introduce a Euclidean projection onto Θ following each SVGD update. For SVMD, we need to solve an eigenvalue problem, which costs O(n3) time. In practice the number of particles used for particle evolution algorithms is relatively small, even for SVGD, due to the O(n2) cost of updates. We have produced a practical SVMD implementation that is computationally competitive with MSVGD and SVGD for standard particle counts like n = 50 (used in all experiments). 5.1 APPROXIMATION QUALITY ON THE SIMPLEX We first measure distributional approximation quality using two 20-dimensional simplex-constrained targets: the sparse Dirchlet posterior of Patterson & Teh (2013) extended to 20 dimensions and the quadratic simplex target of Ahn & Chewi (2020). The Dirichlet target mimics the multimodal sparse conditionals that arise in latent Dirichlet allocation (Blei et al., 2003) but induces a log concave density in η space, while the quadratic is log-concave in θ space. In Fig. 2, we compare the quality of MSVGD, SVMD, and projected SVGD with 50 particles and inverse multiquadric kernel k (Gorham & Mackey, 2017) by computing the energy distance (Sz ekely & Rizzo, 2013) to a surrogate ground truth sample of size 1000 (drawn i.i.d. or, in the quadratic case, from the No-U-Turn Sampler (Hoffman & Gelman, 2014)). We also compare to MSVGD with k2(θ, θ ) = k( ψ(θ), ψ(θ )), a choice which corresponds to running SVGD in the dual space with kernel k by Thm. 4 and which ensures the convergence of MSVGD to p by the upcoming Thms. 6 to 8. In the quadratic case, SVMD is favored over MSVGD as it is able to exploit the log-concavity of p(θ). In contrast, for the multimodal sparse Dirichlet with p(θ) unbounded near the boundary, MSVGD converges slightly more rapidly than SVMD by exploiting the log concave structure in η space. This parallels the observation of Hsieh et al. (2018) that LMC in the mirror space outperforms Riemannian LMC for sparse Dirichlet distributions. Projected SVGD fails to converge to the target in both cases and has particular difficulty in approximating the sparse Dirichlet target with unbounded density. Published as a conference paper at ICLR 2022 1000 1500 2000 2500 3000 Number of sample points, N Actual coverage Standard MSVGD SVMD (a) Nominal coverage: 0.9 0.80 0.85 0.90 0.95 1.00 Nominal coverage Actual coverage Standard MSVGD SVMD (b) N = 5000 sample points Figure 3: Coverage of post-selection CIs across (a) 500 / (b) 200 replications of simulation of Sepehri & Markovic (2017). MSVGD with k and k2 perform very similarly, but we observe that k yields better approximation quality upon convergence. Therefore, we employ k in the remaining MSVGD experiments. 5.2 CONFIDENCE INTERVALS FOR POST-SELECTION INFERENCE We next apply our algorithms to the constrained sampling problems that arise in post-selection inference (Taylor & Tibshirani, 2015; Lee et al., 2016). Specifically, we consider the task of forming valid confidence intervals (CIs) for regression parameters selected using the randomized Lasso (Tian et al., 2016) with data X R n p and y R n and user-generated randomness w Rp from a log-concave distribution with density g. The randomized Lasso returns ˆβ Rp with non-zero coefficients denoted by ˆβE and their signs by s E. It is common practice to report least squares CIs for βE by running a linear regression on the selected features E. However, since E is chosen based on the same data, the resulting CIs are often invalid. Post-selection inference solves this problem by conditioning the inference on the knowledge of E and s E. To construct valid CIs, it suffices to approximate the selective distribution with support {ˆβE, u E : s E ˆβE > 0, u E [ 1, 1]p |E|} and density ˆg(ˆβE, u E) g X y X E XE+ϵI|E| X EXE ˆβE + λ s E u E . In our experiments, we integrate out u E analytically, following Tian et al. (2016), and reparameterize ˆβE as s E |ˆβE| to obtain a log-concave density of |ˆβE| supported on the nonnegative orthant with mirror function ψ(θ) = Pd j=1(θj log θj θj). In Fig. 4a we show the example of a 2D selective distribution using samples drawn by NUTS (Hoffman & Gelman, 2014). We also plot the results by projected SVGD, SVMD, and MSVGD in this example. Projected SVGD fails to approximate the target with many samples gathering at the truncation boundary, while the samples by MSVGD and SVMD closely resemble the truth. We then compare our methods with the standard norejection MCMC approach of the selective Inference R package (Tibshirani et al., 2019) using the example simulation setting described in Sepehri & Markovic (2017) and a penalty factor 0.7. To generate N total sample points we run MCMC for N iterations after burn-in or aggregate the particles from N/n independent runs of MSVGD or SVMD with n = 50 particles. As N ranges from 1000 to 3000 in Fig. 3a, the MSVGD and SVMD CIs consistently yield higher coverage than the standard 90% CIs. This increased coverage is of particular value for smaller sample sizes, for which the standard CIs tend to undercover. For a much larger sample size of N = 5000 in Fig. 3b, the SVMD and standard CIs closely track one another across confidence levels, while MSVGD consistently yields longer CIs with high coverage. The higher coverage of MSVGD is only of value for larger confidence levels at which the other methods begin to undercover. We next apply our samplers to a post-selection inference task on the HIV-1 drug resistance dataset (Rhee et al., 2006), where we run randomized Lasso (Tian et al., 2016) to find statistically significant mutations associated with drug resistance using susceptibility data on virus isolates. Published as a conference paper at ICLR 2022 We take the vitro measurement of log-fold change under the 3TC drug as response and include mutations that had appeared at least 11 times in the dataset as regressors. In Fig. 4b we plot the CIs of selected mutations obtained with N = 5000 sample points. We see that the invalid unadjusted least squares CIs can lead to premature conclusions, e.g., declaring mutation 215Y significant when there is insufficient support after conditioning on the selection event. In contrast, mutation 184V, which has known association with drug resistance, is declared significant by all methods even after post-selection adjustment. The MSVGD and SVMD CIs mostly track those of the standard selective Inference method, but their conclusions sometimes differ: e.g., 62Y is flagged as significant by MSVGD and SVMD but not by selective Inference. Truth Projected SVGD 2.4 Unadjusted Standard SVMD MSVGD Figure 4: (a) Sampling from a 2D selective density; (b) Unadjusted and post-selection CIs for the mutations selected by the randomized Lasso as candidates for HIV-1 drug resistance (see Sec. 5.2). 5.3 LARGE-SCALE POSTERIOR INFERENCE WITH NON-EUCLIDEAN GEOMETRY Finally, we demonstrate the advantages of exploiting non-Euclidean geometry by recreating the real-data large-scale Bayesian logistic regression experiment of Liu & Wang (2016) with 581,012 datapoints and d = 54 feature dimensions. Here, the target p is the posterior distribution over logistic regression parameters. We adopt the Fisher information metric tensor G, compare 20-particle SVNG to SVGD and its prior geometry-aware variants RSVGD (Liu & Zhu, 2018) and Mat SVGD with average and mixture kernels (Wang et al., 2019), and for all methods use stochastic minibatches of size 256 to scalably approximate each log likelihood query. In Fig. 5, all geometry-aware methods substantially improve the log predictive probability of SVGD. SVNG also strongly outperforms RSVGD and converges to its maximum test probability in half as many steps as Mat SVGD (Avg) and more rapidly than Mat SVGD (Mixture). 6 CONVERGENCE GUARANTEES We next turn our attention to the convergence properties of our proposed methods. For Kt and ϵt as in Alg. 1, let (q t , q t,H) represent the distributions of the mirrored Stein updates (θt, ηt) when θ0 q 0 and ηt+1 = ηt + ϵtg qt,Kt(θt) for t 0. Our first result, proved in App. I.5, shows that if the Alg. 1 initialization qn 0,H = 1 n Pn i=1 δηi 0 converges in Wasserstein distance to a distribution q 0,H as n , then, on each round t > 0, the output of Alg. 1, qn t = 1 n Pn i=1 δθi t, converges to q t . Theorem 6 (Convergence of mirrored updates as n ). Suppose Alg. 1 is initialized with qn 0,H = 1 n Pn i=1 δηi 0 satisfying W1(qn 0,H, q 0,H) 0 for W1 the L1 Wasserstein distance. Define the η-induced kernel K ψ ,t(η, η ) Kt( ψ (η), ψ (η )). If, for some c1, c2 > 0, (K ψ ,t( , η) log p H(η) + K ψ ,t( , η)) op c1(1 + η 2), (K ψ ,t(η , ) log p H( ) + K ψ ,t(η , )) op c2(1 + η 2), then, W1(qn t,H, q t,H) 0 and qn t q t for each round t. Published as a conference paper at ICLR 2022 0 500 1000 1500 2000 2500 3000 Number of particle updates, T Test log predictive probability SVGD SVNG RSVGD 0 100 200 300 400 Number of particle updates, T Test log predictive probability SVNG Matrix SVGD (Avg) Matrix SVGD (Mixture) 0 50 100 150 200 250 300 350 400 Number of particle updates, T SVNG Matrix SVGD (Avg) Matrix SVGD (Mixture) Figure 5: Value of non-Euclidean geometry in large-scale Bayesian logistic regression. Remark The pre-conditions hold, for example, whenever log p H is Lipschitz, ψ is strongly convex, and Kt = k I for k bounded with bounded derivatives. Given a mirrored Stein operator (6), an arbitrary Stein set G, and an arbitrary matrix-valued kernel K we define the mirrored Stein discrepancy and mirrored kernel Stein discrepancy MSD(q, p, G) supg G Eq[(Mp,ψg)(θ)] and MKSDK(q, p) MSD(q, p, BHK). (15) The former is an example of a diffusion Stein discrepancy (Gorham et al., 2019) and the latter an example of a diffusion kernel Stein discrepancy (Barp et al., 2019). Since the MKSD optimization problem (15) matches that in Thm. 3, we have that MKSDK(q, p) = g q,K HK. Our next result, proved in App. I.6, shows that the infinite-particle mirrored Stein updates reduce the KL divergence to p whenever the step size is sufficiently small and drive MKSD to 0 if, for example, ϵt = Ω(MKSDKt(q t , p)α) for any α > 0. We also provide two conditions in App. H that generalize the Stein Log-Sobolev and Stein Poincar e inequalities in Duncan et al. (2019); Korba et al. (2020) and which imply exponential convergence rates of our algorithms in continuous time. Theorem 7 (Infinite-particle mirrored Stein updates decrease KL and MKSD). Assume κ1 supθ Kt(θ, θ) op < , κ2 Pd i=1 supθ 2 i,d+i Kt(θ, θ) op < , log p H is L-Lipschitz, and ψ is α-strongly convex. If ϵt < 1/(2 supθ 2ψ(θ) 1 g q t ,Kt(θ) + g q t ,Kt(θ) 2ψ(θ) 1 op), KL(q t+1 p) KL(q t p) ϵt Lκ1 α2 ϵ2 t MKSDKt(q t , p)2. Our last result, proved in App. I.7, shows that q t p if MKSDKk(q t , p) 0. Hence, by Thms. 6 and 7, n-particle MSVGD converges weakly to p if ϵt decays at a suitable rate. Theorem 8 (MKSDKk determines weak convergence). Assume p H is distantly dissipative (Eberle, 2016) with log p H Lipschitz, ψ is strongly convex with continuous ψ , and k(θ, θ ) = κ( ψ(θ), ψ(θ )) for κ(x, y) = (c2 + x y 2 2)β with β ( 1, 0). Then, q t p if MKSDKk(q t , p) 0. Remark The pre-conditions hold, for example, for any Dirichlet target with negative entropy ψ. 7 DISCUSSION This paper introduced the mirrored Stein operator along with three new particle evolution algorithms for sampling with constrained domains and non-Euclidean geometries. The first algorithm MSVGD performs SVGD updates in a mirrored space before mapping to the target domain. The other two algorithms are different discretizations of the same continuous dynamics for exploiting non-Euclidean geometry. SVMD is a multi-particle generalization of mirror descent for constrained domains, while SVNG is designed for unconstrained problems with informative metric tensors. We highlight three limitations. First, like SVGD, our MSVGD require O(n2) time per update. Second, SVMD and SVNG are more costly than MSVGD due to the adaptive kernel construction. Low-rank kernel approximation may be needed to reduce their complexity. Third, we leave open the question of convergence when stochastic gradient estimates are employed, but we suspect the results of Gorham et al. (2020, Thm. 7) can be extended to our setting. In the future, we hope to deploy our mirrored Stein operators for other inferential tasks on constrained domains including sample quality measurement (Gorham & Mackey, 2015; Huggins & Mackey, 2018), goodness-of-fit testing (Chwialkowski et al., 2016; Liu et al., 2016; Jitkrittum et al., 2017), graphical model inference (Zhuo et al., 2018; Wang et al., 2018), parameter estimation (Barp et al., 2019), thinning (Riabiz et al., 2020), and de novo sampling (Chen et al., 2018; Futami et al., 2019). Published as a conference paper at ICLR 2022 REPRODUCIBILITY STATEMENT See App. G for experimental details and https://github.com/thjashin/mirror-stein-samplers for Python and R code replicating all experiments. Kwangjun Ahn and Sinho Chewi. Efficient constrained sampling via the mirror-Langevin algorithm. ar Xiv preprint ar Xiv:2010.16212, 2020. Shun-Ichi Amari. Natural gradient works efficiently in learning. Neural Computation, 10(2):251 276, 1998. Andrew D Barbour. Stein s method and Poisson process convergence. Journal of Applied Probability, pp. 175 184, 1988. Alessandro Barp, Francois-Xavier Briol, Andrew Duncan, Mark Girolami, and Lester Mackey. Minimum Stein discrepancy estimators. In Advances in Neural Information Processing Systems, pp. 12964 12976, 2019. Amir Beck and Marc Teboulle. Mirror descent and nonlinear projected subgradient methods for convex optimization. Operations Research Letters, 31(3):167 175, 2003. Alain Berlinet and Christine Thomas-Agnan. Reproducing kernel Hilbert spaces in probability and statistics. Springer Science & Business Media, 2011. Rabi N Bhattacharya and Edward C Waymire. Stochastic processes with applications. SIAM, 2009. David M Blei, Andrew Y Ng, and Michael I Jordan. Latent Dirichlet allocation. Journal of Machine Learning Research, 3:993 1022, 2003. Peng Chen, Keyi Wu, Joshua Chen, Tom O'Leary-Roseberry, and Omar Ghattas. Projected Stein variational Newton: A fast and scalable Bayesian inference method in high dimensions. In Advances in Neural Information Processing Systems, volume 32, 2019. Wilson Ye Chen, Lester Mackey, Jackson Gorham, Franc ois-Xavier Briol, and Chris Oates. Stein points. In International Conference on Machine Learning, pp. 844 853, 2018. Sinho Chewi, Thibaut Le Gouic, Chen Lu, Tyler Maunu, Philippe Rigollet, and Austin Stromme. Exponential ergodicity of mirror-Langevin diffusions. ar Xiv preprint ar Xiv:2005.09669, 2020. Kacper Chwialkowski, Heiko Strathmann, and Arthur Gretton. A kernel test of goodness of fit. In International Conference on Machine Learning, pp. 2606 2615, 2016. Arnak Dalalyan. Further and stronger analogy between sampling and optimization: Langevin Monte Carlo and gradient descent. In Conference on Learning Theory, pp. 678 689, 2017. Gianluca Detommaso, Tiangang Cui, Youssef Marzouk, Alessio Spantini, and Robert Scheichl. A Stein variational Newton method. In Advances in Neural Information Processing Systems, volume 31, pp. 9187 9197, 2018. John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(7), 2011. Andrew Duncan, Nikolas N usken, and Lukasz Szpruch. On the geometry of Stein variational gradient descent. ar Xiv preprint ar Xiv:1912.00894, 2019. Alain Durmus, Eric Moulines, and Marcelo Pereyra. Efficient Bayesian computation by proximal Markov chain Monte Carlo: when Langevin meets Moreau. SIAM Journal on Imaging Sciences, 11(1):473 506, 2018. Published as a conference paper at ICLR 2022 Andreas Eberle. Reflection couplings and contraction rates for diffusions. Probability Theory and Related Fields, 166(3):851 886, 2016. Stewart N Ethier. A class of degenerate diffusion processes occurring in population genetics. Communications on Pure and Applied Mathematics, 29(5):483 493, 1976. Yihao Feng, Dilin Wang, and Qiang Liu. Learning to draw samples with amortized Stein variational gradient descent. Uncertainty in Artificial Intelligence, 2017. JC Ferreira and VA Menegatto. Eigenvalues of integral operators defined by smooth positive definite kernels. Integral Equations and Operator Theory, 64(1):61 81, 2009. Futoshi Futami, Zhenghang Cui, Issei Sato, and Masashi Sugiyama. Bayesian posterior approximation via greedy particle optimization. In Proceedings of the AAAI Conference on Artificial Intelligence, pp. 3606 3613, 2019. Han L Gan, Adrian R ollin, and Nathan Ross. Dirichlet approximation of equilibrium distributions in Cannings models with mutation. Advances in Applied Probability, 49(3):927 959, 2017. Damien Garreau, Wittawat Jitkrittum, and Motonobu Kanagawa. Large sample analysis of the median heuristic. ar Xiv preprint ar Xiv:1707.07269, 2017. Jackson Gorham and Lester Mackey. Measuring sample quality with Stein s method. In Advances in Neural Information Processing Systems, pp. 226 234, 2015. Jackson Gorham and Lester Mackey. Measuring sample quality with kernels. In International Conference on Machine Learning, pp. 1292 1301, 2017. Jackson Gorham, Andrew B Duncan, Sebastian J Vollmer, and Lester Mackey. Measuring sample quality with diffusions. The Annals of Applied Probability, 29(5):2884 2928, 2019. Jackson Gorham, Anant Raj, and Lester Mackey. Stochastic stein discrepancies. ar Xiv preprint ar Xiv:2007.02857, 2020. Tuomas Haarnoja, Haoran Tang, Pieter Abbeel, and Sergey Levine. Reinforcement learning with deep energy-based policies. In International Conference on Machine Learning, pp. 1352 1361, 2017. Geoffrey Hinton, Nitish Srivastava, and Kevin Swersky. Neural networks for machine learning lecture 6a: overview of mini-batch gradient descent. 2012. Matthew D Hoffman and Andrew Gelman. The No-U-Turn sampler: adaptively setting path lengths in Hamiltonian Monte Carlo. Journal of Machine Learning Research, 15(1):1593 1623, 2014. Matthew D Hoffman, David M Blei, Chong Wang, and John Paisley. Stochastic variational inference. Journal of Machine Learning Research, 14(5), 2013. Ya-Ping Hsieh, Ali Kavis, Paul Rolland, and Volkan Cevher. Mirrored Langevin dynamics. In Advances in Neural Information Processing Systems, pp. 2878 2887, 2018. Jonathan Huggins and Lester Mackey. Random feature Stein discrepancies. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (eds.), Advances in Neural Information Processing Systems, pp. 1903 1913. 2018. Wittawat Jitkrittum, Wenkai Xu, Zolt an Szab o, K. Fukumizu, and A. Gretton. A Linear-Time Kernel Goodness-of-Fit Test. In Advances in Neural Information Processing Systems, 2017. Sham Kakade, Shai Shalev-Shwartz, Ambuj Tewari, et al. On the duality of strong convexity and strong smoothness: Learning applications and matrix regularization. Unpublished Manuscript, 2 (1), 2009. Mohammad Emtiyaz Khan and Didrik Nielsen. Fast yet simple natural-gradient descent for variational inference in complex models. In 2018 International Symposium on Information Theory and Its Applications (ISITA), pp. 31 35. IEEE, 2018. Published as a conference paper at ICLR 2022 Taesup Kim, Jaesik Yoon, Ousmane Dia, Sungwoong Kim, Yoshua Bengio, and Sungjin Ahn. Bayesian model-agnostic meta-learning. Advances in Neural Information Processing Systems, 2018. Anna Korba, Adil Salim, Michael Arbel, Giulia Luise, and Arthur Gretton. A non-asymptotic analysis for Stein variational gradient descent. Advances in Neural Information Processing Systems, 33, 2020. Jason D Lee, Dennis L Sun, Yuekai Sun, and Jonathan E Taylor. Exact post-selection inference, with application to the lasso. Annals of Statistics, 44(3):907 927, 2016. Chang Liu and Jun Zhu. Riemannian Stein variational gradient descent for Bayesian inference. In Proceedings of the AAAI Conference on Artificial Intelligence, pp. 3627 3634, 2018. Chang Liu, Jingwei Zhuo, Pengyu Cheng, Ruiyi Zhang, and Jun Zhu. Understanding and accelerating particle-based variational inference. In International Conference on Machine Learning, pp. 4082 4092, 2019a. Chang Liu, Jingwei Zhuo, and Jun Zhu. Understanding MCMC dynamics as flows on the Wasserstein space. In Proceedings of the 36th International Conference on Machine Learning, pp. 4093 4103, 2019b. Qiang Liu. Stein variational gradient descent as gradient flow. In Advances in Neural Information Processing Systems, pp. 3115 3123, 2017. Qiang Liu and Dilin Wang. Stein variational gradient descent: A general purpose Bayesian inference algorithm. Advances in Neural Information Processing Systems, 29:2378 2386, 2016. Qiang Liu, Jason Lee, and Michael Jordan. A kernelized Stein discrepancy for goodness-of-fit tests. In International Conference on Machine Learning, pp. 276 284, 2016. Yi-An Ma, Tianqi Chen, and Emily Fox. A complete recipe for stochastic gradient MCMC. In Advances in Neural Information Processing Systems, pp. 2917 2925, 2015. Yi-An Ma, Niladri Chatterji, Xiang Cheng, Nicolas Flammarion, Peter Bartlett, and Michael I Jordan. Is there an analog of Nesterov acceleration for MCMC? ar Xiv preprint ar Xiv:1902.00996, 2019. James Martens. New insights and perspectives on the natural gradient method. ar Xiv preprint ar Xiv:1412.1193, 2014. Charles A Micchelli and Massimiliano Pontil. On learning vector-valued functions. Neural Computation, 17(1):177 204, 2005. Arkadij Semenovic Nemirovskij and David Borisovich Yudin. Problem complexity and method efficiency in optimization. 1983. Chris J Oates, Mark Girolami, and Nicolas Chopin. Control functionals for Monte Carlo integration. Journal of the Royal Statistical Society: Series B (Methodological), 79(3):695 718, 2017. Bernt Øksendal. Stochastic Differential Equations: An Introduction with Applications. Springer Science & Business Media, 2003. Sam Patterson and Yee Whye Teh. Stochastic gradient Riemannian Langevin dynamics on the probability simplex. In Advances in Neural Information Processing Systems, pp. 3102 3110, 2013. Garvesh Raskutti and Sayan Mukherjee. The information geometry of mirror descent. IEEE Transactions on Information Theory, 61(3):1451 1457, 2015. Soo-Yon Rhee, Jonathan Taylor, Gauhar Wadhera, Asa Ben-Hur, Douglas L Brutlag, and Robert W Shafer. Genotypic predictors of human immunodeficiency virus type 1 drug resistance. Proceedings of the National Academy of Sciences, 103(46):17355 17360, 2006. Marina Riabiz, Wilson Chen, Jon Cockayne, Pawel Swietach, Steven A Niederer, Lester Mackey, Chris Oates, et al. Optimal thinning of MCMC output. ar Xiv preprint ar Xiv:2005.03952, 2020. Published as a conference paper at ICLR 2022 Amir Sepehri and Jelena Markovic. Non-reversible, tuning-and rejection-free Markov chain Monte Carlo via iterated random functions. ar Xiv preprint ar Xiv:1711.07177, 2017. Jiaxin Shi, Shengyang Sun, and Jun Zhu. A spectral approach to gradient estimation for implicit distributions. In International Conference on Machine Learning, pp. 4644 4653, 2018. Umut Simsekli, Roland Badeau, Taylan Cemgil, and Ga el Richard. Stochastic quasi-Newton Langevin Monte Carlo. In International Conference on Machine Learning, pp. 642 651, 2016. Charles Stein. A bound for the error in the normal approximation to the distribution of a sum of dependent random variables. In Proceedings of the Sixth Berkeley Symposium on Mathematical Statistics and Probability, Volume 2: Probability Theory. The Regents of the University of California, 1972. G abor J Sz ekely and Maria L Rizzo. Energy statistics: A class of statistics based on distances. Journal of Statistical Planning and Inference, 143(8):1249 1272, 2013. Jonathan Taylor and Robert J Tibshirani. Statistical learning and selective inference. Proceedings of the National Academy of Sciences, 112(25):7629 7634, 2015. Xiaoying Tian, Nan Bi, and Jonathan Taylor. MAGIC: a general, powerful and tractable method for selective inference. ar Xiv preprint ar Xiv:1607.02630, 2016. Ryan Tibshirani, Rob Tibshirani, Jonatha Taylor, Joshua Loftus, Stephen Reid, and Jelena Markovic. selective Inference: Tools for Post-Selection Inference, 2019. URL https://CRAN. R-project.org/package=selective Inference. R package version 1.2.5. Dilin Wang, Zhe Zeng, and Qiang Liu. Stein variational message passing for continuous graphical models. In International Conference on Machine Learning, pp. 5219 5227, 2018. Dilin Wang, Ziyang Tang, Chandrajit Bajaj, and Qiang Liu. Stein variational gradient descent with matrix-valued kernels. In Advances in Neural Information Processing Systems, pp. 7836 7846, 2019. Max Welling and Yee W Teh. Bayesian learning via stochastic gradient Langevin dynamics. In International Conference on Machine Learning, pp. 681 688, 2011. Edwin B Wilson. Probable inference, the law of succession, and statistical inference. Journal of the American Statistical Association, 22(158):209 212, 1927. Tatiana Xifara, Chris Sherlock, Samuel Livingstone, Simon Byrne, and Mark Girolami. Langevin diffusions and the Metropolis-adjusted Langevin algorithm. Statistics & Probability Letters, 91: 14 19, 2014. Jianyi Zhang, Yang Zhao, and Changyou Chen. Variance reduction in stochastic particle-optimization sampling. In Proceedings of the 37th International Conference on Machine Learning, pp. 11307 11316, 2020a. Kelvin Shuangjian Zhang, Gabriel Peyr e, Jalal Fadili, and Marcelo Pereyra. Wasserstein control of mirror Langevin Monte Carlo. ar Xiv preprint ar Xiv:2002.04363, 2020b. Michael Zhu, Chang Liu, and Jun Zhu. Variance reduction and quasi-Newton for particle-based variational inference. In Proceedings of the 37th International Conference on Machine Learning, pp. 11576 11587, 2020. Jingwei Zhuo, Chang Liu, Jiaxin Shi, Jun Zhu, Ning Chen, and Bo Zhang. Message passing Stein variational gradient descent. In International Conference on Machine Learning, pp. 6018 6027, 2018. Published as a conference paper at ICLR 2022 Algorithm 2 Stein Variational Natural Gradient (SVNG) Input: density p(θ) on Rd, kernel k, metric tensor G(θ), particles (θi 0)n i=1, step sizes (ϵt)T t=1 for t = 0 : T do for i [n], θi t+1 θi t + ϵt G(θi t) 1g G,t(θi t), where g G,t(θ) = 1 n Pn j=1[KG,t(θ, θj t)G(θj t) 1 log p(θj t)+ θj t (KG,t(θ, θj t)G(θj t) 1)] (see (14)) return {θi T +1}n i=1. A MIRROR DESCENT, RIEMANNIAN GRADIENT FLOW, AND NATURAL GRADIENT The equivalence between the mirror flow dηt = f(θt)dt, θt = ψ (ηt)dt and the Riemannian gradient flow in (3) is a direct result of the chain rule: dθt dt = ηtθt dηt dt = ( θtηt) 1 dηt dt = 2ψ(θt) 1 f(θt), (16) dt = f(θt) = θtηt ηtf( ψ (ηt)) = 2ψ (ηt) 1 ηtf( ψ (ηt)). (17) Depending on discretizing (16) or (17), there are two natural gradient descent (NGD) updates that can arise from the same gradient flow: NGD (a): θt+1 = θt ϵt 2ψ(θt) 1 f(θt), NGD (b): ηt+1 = ηt ϵt 2ψ (ηt) 1 ηtf( ψ (ηt)). With finite step sizes ϵt, their updates need not be the same and can lead to different optimization paths. Since f(θt) = 2ψ (ηt) 1 ηtf( ψ (ηt)), NGD (b) is equivalent to the dual-space update by mirror descent. This relationship was pointed out in Raskutti & Mukherjee (2015) and has been used for developing natural gradient variational inference algorithms (Khan & Nielsen, 2018). We emphasize, however, our SVNG algorithm developed in Sec. 4.4 corresponds to the discretization in the primal space as in NGD (a). Therefore, it does not require an explicit dual space, and allows replacing 2ψ with more general information metric tensors. B DETAILS OF EXAMPLE 1 For the entropic mirror map ψ(θ) = Pd+1 j=1 θj log θj, we have 2ψ(θ) 1 = diag(θ) θθ . Note here θ denotes a d-dimensional vector and does not include θd+1 = 1 Pd j=1 θd. Since Θ is a (d + 1)-simplex, Θ is composed of d + 1 faces with θ in the j-th face satisfies θj = 0. The outward unit normal vector n(θ) for the first d faces are ej for 1 j d, where ej denotes the j-th standard basis of Rd. The outward unit normal vector for the (d + 1)-st face is a vector with 1/ d in all coordinates. Therefore, we have Z Θ p(θ)g(θ) 2ψ(θ) 1n(θ)dθ = Z Θ p(θ)g(θ) (diag(θ) θθ )n(θ)dθ Θ p(θ)(θ g(θ) θθ g(θ)) n(θ)dθ θj=0 p(θ)(θ g(θ) gj(θ))θjdθ j θd+1=0 p(θ)θ g(θ)θd+1dθ where in the second to last identity we used θ 1 = 1 θd+1. Finally, we can verify the condition in Prop. 1 as Θr p(θ) 2ψ(θ) 1nr(θ) 2dθ = sup g 1 Θ p(θ)g(θ) 2ψ(θ) 1n(θ)dθ = 0. Published as a conference paper at ICLR 2022 C DERIVATION OF THE MIRRORED STEIN OPERATOR We first review the (overdamped) Langevin diffusion a Markov process that underlies many recent advances in Stein s method along with its recent mirrored generalization. The Langevin diffusion with equilibrium density p on Rd is a Markov process (θt)t 0 Rd satisfying the stochastic differential equation (SDE) dθt = log p(θt)dt + 2d Bt (18) with (Bt)t 0 a standard Brownian motion (Bhattacharya & Waymire, 2009, Sec. 4.5). To identify Stein operators that satisfy (4) for broad classes of targets p, Gorham & Mackey (2015) proposed to build upon the generator method of Barbour (1988): First, identify a Markov process (θt)t 0 that has p as the equilibrium density; they chose the Langevin diffusion of (18). Next, build a Stein operator based on the (infinitesimal) generator A of the process (Øksendal, 2003, Def. 7.3.1): (Af)(θ) = limt 0 1 t (Ef(θt) Ef(θ0)) for f : Rd R, as the generator satisfies Eθ p[(Af)(θ)] = 0 under relatively mild conditions. We use the following theorem to derive the generator of the processes described by SDEs like (18): Theorem 9 (Generator of Itˆo diffusion; Øksendal, 2003, Thm 7.3.3). Let (xt)t 0 be the Itˆo diffusion in X Rd satisfying dxt = b(xt)dt + σ(xt)d Bt. For any f C2c (X), the (infinitesimal) generator A of (xt)t 0 is (Af)(x) = b(x) f(x) + 1 2 Tr(σ(x)σ(x) 2f(x)). For the Langevin diffusion (18), substituting log p( ) for b( ) and 2I for σ( ) in Thm. 9, we obtain Af = ( log p) f + f. Replacing f with a vector-valued function g gives the Langevin Stein operator in (5). To derive a Stein operator that works well for constrained domains, we consider the Riemannian Langevin diffusion (Patterson & Teh, 2013; Xifara et al., 2014; Ma et al., 2015) that extends the Langevin diffusion to non-Euclidean geometries encoded in a positive definite metric tensor G(θ): dθt = (G(θt) 1 log p(θt) + G(θt) 1)dt + 2G(θt) 1/2d Bt.4 We show in App. D that the choice G = 2ψ yields the recent mirror-Langevin diffusion (Zhang et al., 2020b; Chewi et al., 2020) θt = ψ (ηt), dηt = log p(θt)dt + 2 2ψ(θt)1/2d Bt. (19) According to Thm. 9, the generator of the mirror-Langevin diffusion described by (20) is (Ap,ψf)(θ) = ( 2ψ(θ) 1 log p(θ) + 2ψ(θ) 1) f(θ) + Tr( 2ψ(θ) 1 2f(θ)) = f(θ) 2ψ(θ) 1 log p(θ) + ( 2ψ(θ) 1 f(θ)). Now substituting g(θ) for f(θ), we obtain the associated mirrored Stein operator: (Mp,ψg)(θ) = g(θ) 2ψ(θ) 1 log p(θ) + ( 2ψ(θ) 1g(θ)). D RIEMANNIAN LANGEVIN DIFFUSIONS AND MIRROR-LANGEVIN DIFFUSIONS Zhang et al. (2020b) pointed out (19) is a particular case of the Riemannian LD. However, they did not give an explicit derivation. The Riemannian LD (Patterson & Teh, 2013; Xifara et al., 2014; Ma et al., 2015) with 2ψ( ) as the metric tensor is dθt = ( 2ψ(θt) 1 log p(θt) + 2ψ(θt) 1)dt + 2 2ψ(θt) 1/2d Bt. (20) To see the connection with mirror-Langevin diffusion, we would like to obtain the SDE that describes the evolution of ηt = ψ(θt) under the diffusion. This requires the following theorem that provides the analog of the chain rule in SDEs. 4A matrix divergence G(θ) is the vector obtained by computing the divergence of each row of G(θ). Published as a conference paper at ICLR 2022 Theorem 10 (Itˆo formula; Øksendal, 2003, Thm 4.2.1). Let (xt)t 0 be an Itˆo process in X Rd satisfying dxt = b(xt)dt + σ(xt)d Bt. Let f(x) C2 : Rd Rd . Then yt = f(xt) is again an Itˆo process, and its i-th dimension satisfies dyt,i = ( fi(xt) b(xt) + 1 2 Tr( 2fi(xt)σ(xt)σ(xt) )dt + fi(xt) σ(xt)d Bt. Substituting ψ for f in Thm. 10, we have the SDE of ηt = ψ(θt) as dηt = ( log p(θt) + 2ψ(θt) 2ψ(θt) 1 + h(θt))dt + 2 2ψ(θt)1/2d Bt, where h(θt)i = Tr( 2 θt( θt,iψ(θt)) 2ψ(θt) 1). Moreover, we have [ 2ψ(θt) 2ψ(θt) 1]i + Tr( 2 θt( θt,iψ(θt)) 2ψ(θt) 1) j=1 2ψ(θt)ij θt,ℓ[ 2ψ(θt) 1]jℓ+ j=1 θt,ℓ 2ψ(θt)ij[ 2ψ(θt) 1]jℓ j=1 2ψ(θt)ij[ 2ψ(θt) 1]jℓ ℓ=1 θt,ℓIiℓ= 0. Therefore, the ηt diffusion is described by the SDE: dηt = log p(θt)dt + 2 2ψ(θt)1/2d Bt, θt = ψ (ηt). E MODE MISMATCH UNDER TRANSFORMATIONS 0.0 0.2 0.4 0.6 0.8 1.0 0.0 10 5 0 5 10 0.00 0.0 0.2 0.4 0.6 0.8 1.0 0 10 5 0 5 10 0.0 Figure 6: The density functions of the same distribution in θ (left) and η (right) space under the transformation η = ψ(θ). Each θ follows a Beta distributions on [0, 1]. We choose the negative entropy ψ(θ) = θ log θ + (1 θ) log(1 θ). Then, the transformation is the logit function η = log(θ/(1 θ)) and its reverse is the sigmoid function θ = 1/(1+e η). Top: θ Beta(0.5, 0.5). Dashed lines mark the mode of the transformed density p H(η) and the corresponding θ, which gives the lowest value of p(θ); Bottom: θ Beta(1.1, 10). Dashed lines mark the mode of the target density p(θ) and the corresponding η, which clearly does not match the mode of p H(η). Published as a conference paper at ICLR 2022 F BACKGROUND ON REPRODUCING KERNEL HILBERT SPACES Let H be a Hilbert space of functions defined on X and taking their values in R. We say k is a reproducing kernel (or kernel) of H if x X, k(x, ) H and f H, f, k(x, ) H = f(x). H is called a reproducing kernel Hilbert space (RKHS) if it has a kernel. Kernels are positive definite (p.d.) functions, which means that matrices with the form (k(xi, xj))ij are positive semidefinite. For any p.d. function k, there is a unique RKHS with k as the reproducing kernel, which can be constructed by the completion of {Pn i=1 aik(xi, ), xi X, ai R, i N}. Now we assume X is a metric space, k is a bounded continuous kernel with the RKHS H, and ν is a positive measure on X. L2(ν) denote the space of all square-integrable functions w.r.t. ν. Then the kernel integral operator Tk : L2(ν) L2(ν) defined by X g(x)k(x, )dν is compact and self-adjoint. Therefore, according to the spectral theorem, there exists an at most countable set of positive eigenvalues {λj}j J R with λ1 λ2 . . . converging to zero and orthonormal eigenfunctions {uj}j J such that Tkuj = λjuj, and k has the representation k(x, x ) = P j J λjuj(x)uj(x ) (Mercer s theorem on non-compact domains), where the convergence of the sum is absolute and uniform on compact subsets of X X (Ferreira & Menegatto, 2009). G SUPPLEMENTARY EXPERIMENTAL DETAILS AND ADDITIONAL RESULTS In this section, we report supplementary details and additional results from the experiments of Sec. 5. In Secs. 5.1 and 5.2, we use the inverse multiquadric input kernel k(θ, θ ) = (1 + θ θ 2 2/ℓ2) 1/2 due to its convergence control properties (Gorham & Mackey, 2017). In the unconstrained experiments of Sec. 5.3, we use the Gaussian kernel k(θ, θ ) = exp( θ θ 2 2/ℓ2) for consistency with past work. The bandwidth ℓis determined by the median heuristic (Garreau et al., 2017). We select τ from {0.98, 0.99} for all SVMD experiments. For unconstrained targets, we report, for each method, results from the best fixed step size ϵ {0.01, 0.05, 0.1, 0.5, 1} selected on a separate validation set. For constrained targets, we select step sizes adaptively to accommodate rapid density growth near the boundary; specifically, we use RMSProp (Hinton et al., 2012), an extension of the Ada Grad algorithm (Duchi et al., 2011) used in Liu & Wang (2016), and report performance with the best learning rate. Results were recorded on an Intel(R) Xeon(R) CPU E5-2690 v4 @ 2.60GHz and an NVIDIA Tesla P100 PCIe 16GB. G.1 APPROXIMATION QUALITY ON THE SIMPLEX The sparse Dirichlet posterior of Patterson & Teh (2013) extended to 20 dimensions features a sparse, symmetric Dir(α) prior with αk = 0.1 for k {1, . . . , 20} and sparse count data n1 = 90, n2 = n3 = 5, nj = 0 (j > 3), modeled via a multinomial likelihood. The quadratic target satisfies log p(θ) = 1 2σ2 θ Aθ + const, where we slightly modify the target density of Ahn & Chewi (2020) to make it less flat by introducing a scale parameter σ = 0.01. A R19 19 is a positive definite matrix generated by normalizing products of random matrices with i.i.d. elements drawn from Unif[ 1, 1]. We initialize all methods with i.i.d samples from Dirichlet(5) to prevent any of the initial particles being too close to the boundary. For each method and each learning rate we apply 500 particle updates. For SVMD we set τ = 0.98. We search the base learning rates of RMSProp in {0.1, 0.01, 0.001} for SVMD and MSVGD. Since projected SVGD applies updates in the θ space, the appropriate learning rate range is smaller than those of SVMD and MSVGD. There we search the base learning rate of RMSProp in {0.01, 0.001, 0.0001}. For all methods the results under each base learning rate are plotted in Fig. 7. Published as a conference paper at ICLR 2022 0 200 400 Number of particle updates, T Energy distance Projected SVGD LR=0.01 LR=0.001 LR=0.0001 0 200 400 Number of particle updates, T LR=0.1 LR=0.01 LR=0.001 0 200 400 Number of particle updates, T LR=0.1 LR=0.01 LR=0.001 0 200 400 Number of particle updates, T LR=0.1 LR=0.01 LR=0.001 Figure 7: Sampling from a Dirichlet target on a 20-simplex. We plot the energy distance to a ground truth sample of size 1000. 0 200 400 Number of particle updates, T Energy distance Projected SVGD LR=0.01 LR=0.001 LR=0.0001 0 200 400 Number of particle updates, T LR=0.1 LR=0.01 LR=0.001 0 200 400 Number of particle updates, T LR=0.1 LR=0.01 LR=0.001 0 200 400 Number of particle updates, T LR=0.1 LR=0.01 LR=0.001 Figure 8: Sampling from a quadratic target on a 20-simplex. We plot the energy distance to a ground truth sample of size 1000 drawn by NUTS (Hoffman & Gelman, 2014). G.2 CONFIDENCE INTERVALS FOR POST-SELECTION INFERENCE Given a dataset X R n p, y R n, the randomized Lasso (Tian et al., 2016) solves the following problem: argminβ Rp 1 2 y Xβ 2 2 + λ β 1 w β + ϵ 2 β 2 2, w G. where G is a user-specified log-concave distribution with density g. We choose G to be zeromean independent Gaussian distributions while leaving its scale and the ridge parameter ϵ to be automatically determined by the randomized Lasso function of the selective Inference package. We initialize the particles of our SVMD and MSVGD in the following way: First, we map the solution ˆβE to the dual space by ψ. Next, we add i.i.d. standard Gaussian noise to n copies of the image in the dual space. Finally, we map the n particles back to the primal space by ψ and use them as the initial locations. Below we discuss the remaining settings and additional results of the simulation and the HIV-1 drug resistance experiment separately. Simulation In our simulation we mostly follow the settings of Sepehri & Markovic (2017) except using a different penalty level λ recommended in the selective Inference R package. We set n = 100 and p = 40. The design matrix X is generated from an equi-correlated model, i.e., each datapoint xi Rp is generated i.i.d. from N(0, Σ) with Σii = 1, Σij = 0.3 (i = j) and then normalized to have almost unit length. The normalization is done by first centering each dimension by subtracting the mean and dividing the standard deviation of that column of X, then additionally multiplying 1/ n1/2. y is generated from a standard Gaussian which is independent of X, i.e., we assume the global null setting where the true value of β is zero. We set λ to be the value returned by theoretical.lambda of the selective Inference R package multiplied a coefficient 0.7 n, where the 0.7 adjustment is introduced in the test examples of the R package to reduce the regularization effect so that we have a reasonably large set of selected features when p = 40. The base learning rates for SVMD and MSVGD are set to 0.01 and we run them for T = 1000 particle updates. τ is set to 0.98 for SVMD. Our 2D example in Fig. 4a is grabbed from one run of the simulation where there happen to be only 2 features selected by the randomized Lasso. The selective distribution in this case has logdensity log p(θ) = 8.07193((2.39859θ1 + 1.90816θ2 + 2.39751)2 + (1.18099θ2 1.46104)2) + const, θ1,2 0. Published as a conference paper at ICLR 2022 The error bars for actual coverage levels in Fig. 3a and Fig. 3b are 95% Wilson intervals (Wilson, 1927), which is known to be more accurate than 2 standard deviation intervals for binomial proportions like the coverage. In Fig. 9a and Fig. 9b we additionally plot the average length of the confidence intervals w.r.t. different sample size N and nominal coverage levels. For all three methods the CI widths are very close, although MSVGD consistently has wider intervals than SVMD and selective Inference. This indicates that SVMD can be preferred over MSVGD when both methods produce coverage above the nominal level. HIV-1 drug resistance We take the vitro measurement of log-fold change under the 3TC drug as response and include mutations that had appeared 11 times in the dataset as regressors. This results in n = 663 datapoints with p = 91 features. We choose λ to be the value returned by theoretical.lambda of the selective Inference R package multiplied by n. The base learning rates for SVMD and MSVGD are set to 0.01 and we run them for T = 2000 particle updates. τ is set to 0.99 for SVMD. 1000 1500 2000 2500 3000 Number of sample points, N Standard MSVGD SVMD (a) Nominal coverage: 0.9 0.80 0.85 0.90 0.95 1.00 Nominal coverage Standard MSVGD SVMD (b) N = 5000 sample points Figure 9: Width of post-selection CIs across (a) 500 / (b) 200 replications of simulation of Sepehri & Markovic (2017). G.3 LARGE-SCALE POSTERIOR INFERENCE WITH NON-EUCLIDEAN GEOMETRY The Bayesian logistic regression model we consider is Q n i=1 p(yi|xi, w)p(w), where p(w) = N(w|0, I), p(yi|xi, w) = Bernoulli(σ(w xi)). The bias parameter is absorbed into into w by adding an additional feature 1 to each xi. The gradient of the log density of the posterior distribution of w is w log p(w|{yi, xi}N i=1) = PN i=1 xi(yi σ(w xi)) w. We choose the metric tensor 2ψ(w) to be the Fisher information matrix (FIM) of the likelihood: i=1 Ep(yi|w,xi)[ w log p(yi|xi, w) w log p(yi|xi, w) ] i=1 σ(w xi)(1 σ(w xi))xix i . Following Wang et al. (2019), for each iteration r (r 1), we estimate the sum with a stochastic minibatch Br of size 256: ˆFBr = n |Br| P i Br σ(w xi)(1 σ(w xi))xix i and approximate the FIM with a moving average across iterations: ˆFr = ρr ˆFr 1 + (1 ρr) ˆFBr, where ρr = min(1 1/r, 0.95). To ensure the positive definiteness of the FIM, a damping term 0.01I is added before taking the inverse. For RSVGD and SVNG, the gradient of the inverse of FIM is estimated with wj F 1 ˆF 1 r ( ˆ r wj F) ˆF 1 r , where ˆ r wj F = ρr ˆ r 1 wj F + (1 ρr) wj ˆFBr. We run each method for T = 3000 particle updates with learning rates in {0.01, 0.05, 0.1, 0.5, 1} and average the results for 5 random trials. τ is set to 0.98 for SVNG. For each run, we randomly Published as a conference paper at ICLR 2022 keep 20% of the dataset as test data, 20% of the remaining points as the validation set, and all the rest as the training set. The results of each method on validation sets with all choices of learning rates are plotted in Fig. 10. We see that the SVNG updates are very robust to the change in learning rates and is able to accommodate very large learning rates (up to 1) without a significant loss in performance. The results in Fig. 5 are reported with the learning rate that performs best on the validation set. 0 1000 2000 3000 Number of particle updates, T Validation log predictive probability 0.01 0.05 0.1 0.5 1.0 0 1000 2000 3000 Number of particle updates, T Validation log predictive probability 0.01 0.05 0.1 0.5 1.0 0 1000 2000 3000 Number of particle updates, T Validation log predictive probability 0.01 0.05 0.1 0 1000 2000 3000 Number of particle updates, T Validation log predictive probability Matrix SVGD (Avg) 0.01 0.05 0.1 0.5 1.0 0 1000 2000 3000 Number of particle updates, T Validation log predictive probability Matrix SVGD (Mixture) 0.01 0.05 0.1 0.5 1.0 Figure 10: Logistic regression results on validation sets with learning rates in {0.01, 0.05, 0.1, 0.5, 1}. Running RSVGD with learning rates 0.5 and 1 produces numerical errors. Therefore, we did not include them in the plot. H EXPONENTIAL CONVERGENCE OF CONTINUOUS-TIME ALGORITHMS We derive a time-inhomogeneous generalization of the Stein Log-Sobolev inequality of Duncan et al. (2019) and Korba et al. (2020) which ensures the exponential convergence of our continuous-time algorithms and time-inhomogeneous generalization of the Stein Poincar e inequality of Duncan et al. (2019) which guarantees exponential convergence near equilibrium (i.e., when qt is sufficiently close to p). As the results hold for a generic sequence of kernels (Kt)t 0, the implications apply to both MSVGD and SVMD. Published as a conference paper at ICLR 2022 Definition 2 (Mirror Stein Log-Sobolev inequality). We define the Mirror Stein Log-Sobolev inequality (cf., Korba et al., 2020, Def. 2) as 2λMKSD2 Kt(qt, p) = 1 2λEqt[( 2ψ 1 log qt p ) PKt,qt 2ψ 1 log qt where MKSDKt is defined in Eq. (15); PKt,qt : L2(qt) L2(qt) is the kernel integral operator: (PKt,qtϕ)( ) Eqt(θ)[Kt( , θ)ϕ(θ)] for a general kernel Kt and vector-valued function ϕ on Θ, and the stated equality holds whenever integration-by-parts is applicable. Proposition 11. Suppose (θt)t 0 follows the mirrored dynamics (8) with gt chosen to be g qt,Kt as in (10). Then, the dissipation of KL(qt p) is d dt KL(qt p) = MKSDKt(qt, p)2. Proof The proof directly follows from Thm. 3 since the optimization problem there matches the definition of MKSD in (15). Therefore, when the Mirror Stein Log-Sobolev inequality holds, we have d dt KL(qt p) 2λKL(qt p), and the exponential convergence KL(qt p) KL(q0 p)e 2λt follows by Gronwall s lemma (Gronwall, 1919). Definition 3 (Mirror Stein Poincar e inequality). We say that the distribution p satisfies the Mirror Stein Poincar e inequality (cf., Duncan et al. 2019, Eq. (57)) with strongly convex ψ and constant λ if λEp[ φ PKt,p 2ψ 1 φ] for all φ L2(p) C (Θ) that is locally Lipschitz, where PKt,p is the kernel integral operator under p defined similarly as in Definition 2. This inequality can also be viewed as a kernelized generalization of the mirror Poincar e inequality introduced in Chewi et al. (2020, Def. 1) for proving exponential convergence of mirror-Langevin diffusion. In a manner analogous to Thm. 1 of Chewi et al. (2020), the following proposition relates the Mirror Stein Poincar e inequality to chi-squared divergence. Proposition 12. Suppose (θt)t 0 follows the mirrored dynamics (8) with gt chosen to be g qt,Kt as in (10). Then, the dissipation of chi-square divergence χ2(qt p) is d dtχ2(qt p) = 2Eqt PKt,p 2ψ 1 qt whenever integration-by-parts is applicable. Proof We first note that by applying integration-by-parts, g qt,Kt as in (10) can be equivalently written as g qt,Kt = PKt,qt 2ψ 1 log qt Then using the Fokker-Planck equation of qt under the dynamics, we have d dtχ2(qt p) = d 2 dp = 2 Z qt " PKt,qt 2ψ 1 log qt PKt,p 2ψ 1 qt Note that the right hand side of the equation differs from the Mirror Stein Poincar e inequality only in the base measure of the expectation. Duncan et al. (2019) proposes to replace qt with p to study the convergence near equilibrium (See their Sec. 6, where Eq. (46) is replaced with Eq. (51)). If we do the same and combine this identity with the Mirror Stein Poincar e inequality, we obtain d dtχ2(qt p) 2λVarp[ qt p ] = 2λχ2(qt p), which implies exponential convergence in KL KL(qt p) χ2(qt p) χ2(q0 p)e 2λt by Gronwall s lemma (Gronwall, 1919). Published as a conference paper at ICLR 2022 I.1 PROOF OF PROP. 1 Proof Fix any g Gψ. Since g and g are bounded and 2ψ(θ) 1 log p(θ) and 2ψ(θ) 1 are p-integrable, the expectation Eθ p[(Mp,ψg)(θ)] exists. Because Θ is convex, Θr is bounded and convex with Lipschitz boundary. Since p 2ψ 1g C1, we have |Ep[(Mp,ψg)(θ)]| = |Ep[g(θ) 2ψ(θ) 1 log p(θ) + ( 2ψ(θ) 1g(θ))]| Θ p(θ) 2ψ(θ) 1g(θ) + p(θ) ( 2ψ(θ) 1g(θ))dθ Θ (p(θ) 2ψ(θ) 1g(θ))dθ Θr (p(θ) 2ψ(θ) 1g(θ))dθ (by dominated convergence) Θr (p(θ) 2ψ(θ) 1g(θ)) nr(θ)dθ (by the divergence theorem) Θr p(θ) g(θ) 2 2ψ(θ) 1nr(θ) 2dθ (by Cauchy-Schwarz) Θr p(θ) 2ψ(θ) 1nr(θ) 2dθ = 0 (by assumption). I.2 PROOF OF THM. 3: OPTIMAL MIRROR UPDATES IN RKHS Proof Let ei denote the standard basis vector of Rd with the i-th element being 1 and others being zeros. Since m HK, we have m(θ) 2ψ(θ) 1 log p(θ) = m, K( , θ) 2ψ(θ) 1 log p(θ) HK ( 2ψ(θ) 1m(θ)) = i=1 θi(m(θ) 2ψ(θ) 1ei) i=1 m, θi(K( , θ) 2ψ(θ) 1ei) HK = m, θ (K( , θ) 2ψ(θ) 1) HK, where we define the divergence of a matrix as a vector whose elements are the divergences of each row of the matrix. Then, we write (9) as Eqt[m(θ) 2ψ(θ) 1 log p(θ) + ( 2ψ(θ) 1m(θ))] = Eqt[ m, K( , θ) 2ψ(θ) 1 log p(θ) + θ (K( , θ) 2ψ(θ) 1) HK] = m, Eqt[K( , θ) 2ψ(θ) 1 log p(θ) + θ (K( , θ) 2ψ(θ) 1)] HK = m, Eqt[Mp,ψK( , θ)] HK. Therefore, the optimal direction in the HK norm ball BHK = {g : g HK 1} that minimizes (9) is g t g qt,K = Eqt[Mp,ψK( , θ)]. I.3 PROOF OF THM. 4: MIRRORED SVGD UPDATES Proof A p.d. kernel k composed with any map φ is still a p.d. kernel. To prove this, let {x1, . . . , xp} = {φ(η1), . . . , φ(ηn)}, p n. Then X i,j αiαjk(φ(ηi), φ(ηj)) = X ℓ,m βℓβmk(xℓ, xm) 0, Published as a conference paper at ICLR 2022 where βℓ= P i Sℓαi, Sℓ= {i : φ(ηi) = xℓ}. Therefore, kψ(η, η ) = k( ψ (η), ψ (η )) is a p.d. kernel. Plugging K = k I into Lem. 13, for any θ Θ and η = ψ(θ ), we have g qt,Kk(θ ) = Eηt qt,H[K ψ ( ψ(θ ), ηt) log p H(ηt) + ηt K ψ ( ψ(θ ), ηt)] = Eηt qt,H[k( ψ (η ), ψ (ηt)) log p H(ηt) + j=1 ηt,jk( ψ (η ), ψ (ηt))ej] = Eηt qt,H[kψ(η , ηt) log p H(ηt) + ηtkψ(η , ηt)]. I.4 PROOF OF PROP. 5: SINGLE-PARTICLE SVMD IS MIRROR DESCENT Proof When n = 1, λ1 = k(θt, θt), u1 = 1, and thus Kψ,t(θt, θt) = k(θt, θt) 2ψ(θt). I.5 PROOF OF THM. 6: CONVERGENCE OF MIRRORED UPDATES AS n Proof The idea is to reinterpret our mirrored updates as one step of a matrix SVGD in η space based on Lem. 13 and then follow the path of Gorham et al. (2020, Thm. 7). Assume that qn t,H and q t,H have integrable means. Let ηn, η be an optimal Wasserstein-1 coupling of qn t,H and q t,H. Let Φqt,Kt denote the transform through one step of mirrored update: θt = ψ (ηt), ηt+1 = ηt + ϵtg qt,Kt(θt). Then, with Lem. 13, we have Φqt,Kt(η) Φqt,Kt(η ) 2 = η + ϵtg qn t ,Kt(θ) η ϵtg q t ,Kt(θ ) 2 η η 2 + ϵt g qn t ,Kt(θ) g q t ,Kt(θ ) 2 η η 2 + ϵt Eηn[K ψ ,t(η, ηn) log p H(ηn) + ηn K ψ ,t(η, ηn) (K ψ ,t(η , ηn) log p H(ηn) + ηn K ψ ,t(η , ηn))] 2 + ϵt Eηn,η [K ψ ,t(η , ηn) log p H(ηn) + ηn K ψ ,t(η, ηn) (K ψ ,t(η , η ) log p H(η ) + η K ψ ,t(η , η ))] 2 η η 2 + ϵtc1(1 + E[ ηn 2)] η η 2 + ϵtc2(1 + η 2)Eηn,η [ ηn η 2] = η η 2 + ϵtc1(1 + Eqn t,H[ 2]) η η 2 + ϵtc2(1 + η 2)W1(qn t,H, q t,H). Since Φqt,Kt(ηn) qn t+1,H, Φqt,K(η ) q t+1,H, we conclude W1(qn t+1,H, q t+1,H) E[ Φqt,K(ηn) Φqt,K(η ) 2] (1 + ϵtc1(1 + Eqn t,H[ 2]))E[ ηn η 2] + ϵtc2(1 + η 2)W1(qn t,H, q t,H)] (1 + ϵtc1(1 + Eqn t,H[ 2]) + ϵtc2(1 + Eq t,H[ 2]))W1(qn t,H, q t,H). The final claim qn t q t now follows by the continuous mapping theorem as ψ is continuous. I.6 PROOF OF THM. 7: INFINITE-PARTICLE MIRRORED STEIN UPDATES DECREASE KL AND MKSD Proof Let Tq t ,Kt denote transform of the density function through one step of mirrored update: θt = ψ (ηt), ηt+1 = ηt + ϵtg q t ,Kt(θt). Then KL(q t+1 p) KL(q t p) = KL(q t T 1 q t ,Ktp) KL(q t p) = Eηt q t,H[log p H(ηt) log p H(ηt + ϵtg q t ,Kt(θt)) log | det(I + ϵt ηtg q t ,Kt(θt))|], Published as a conference paper at ICLR 2022 where we have used the invariance of KL divergence under reparameterization: KL(qt p) = KL(qt,H p H) . Following Liu (2017), we bound the difference of the first two terms as log p H(ηt) log p H(ηt + ϵtg q t ,Kt(θt)) 0 s log p H(ηt(s)) ds, where ηt(s) ηt + sϵtg q t ,Kt(θt) 0 log p H(ηt(s)) (ϵtg q t ,Kt(θt)) ds = ϵt log p H(ηt) g q t ,Kt(θt) + Z 1 0 ( log p H(ηt) log p H(ηt(s))) (ϵtg q t ,Kt(θt)) ds ϵt log p H(ηt) g q t ,Kt(θt) + ϵt 0 log p H(η) log p H(ηt(s)) 2 g q t ,Kt(θt) 2 ds ϵt log p H(ηt) g q t ,Kt(θt) + Lϵ2 t 2 g q t ,Kt(θt) 2 2, and bound the log determinant term using Lem. 15: log | det(I + ϵt ηtg q t ,Kt(θt)) ϵt Tr( ηtg q t ,Kt(θt)) + 2ϵ2 t ηtg q t ,Kt(θt) 2 F . The next thing to notice is that Eηt q t,H[ log p H(ηt) g q t ,Kt(θt) + Tr( ηtg q t ,Kt(θt))] is the square of the MKSD in (15). We can show this equivalence using the identity proved in Lem. 14: Eηt q t,H[g q t ,Kt(θt) log p H(ηt) + Tr( ηtg q t ,Kt(θt))] = Eθt q t [g q t ,Kt(θt) 2ψ(θt) 1 θt(log p(θt) log det 2ψ(θt)) + Tr( 2ψ(θt) 1 g q t ,Kt(θt))] = Eθt q t [g q t ,Kt(θt) 2ψ(θt) 1 log p(θt) + ( 2ψ(θt) 1g q t ,Kt(θt))] (Lem. 14) = Eθt q t [(Mp,ψg q t ,Kt)(θt)] = MKSDKt(q t , p)2. Finally, we are going to bound g q t ,Kt(θt) 2 2 and ηtg q t ,Kt(θt) 2 F . From the assumptions we have ψ is α-strongly convex and thus ψ is 1 α-strongly smooth (Kakade et al., 2009), therefore 2ψ ( ) 2 1 α. By Lem. 16, we know g q t ,Kt(θt) 2 2 g q t ,Kt 2 HKt K(θt, θt) op = MKSDKt(q t , p)2 Kt(θt, θt) op, ηtg q t ,Kt(θt) 2 F = 2ψ (ηt) g q t ,Kt(θt) 2 F 2ψ (ηt) 2 2 g q t ,Kt(θt) 2 F α2 g q t ,Kt 2 HKt i=1 2 i,d+i Kt(θt, θt) op α2 MKSDKt(q t , p)2 d X i=1 2 i,d+i Kt(θt, θt) op, where 2 i,d+i K(θ, θ) denotes 2 θi,θ i K(θ, θ )|θ =θ. Combining all of the above, we have KL(q t+1 p) KL(q t p) ϵt Lϵ2 t 2 sup θ Kt(θ, θ) op 2ϵ2 t α2 i=1 sup θ 2 i,d+i Kt(θ, θ) op MKSDKt(q t , p)2. Plugging in the definition of κ1 and κ2 finishes the proof. Published as a conference paper at ICLR 2022 I.7 PROOF OF THM. 8: MKSDKk DETERMINES WEAK CONVERGENCE Proof According to Thm. 4, g q,Kk = Eq H[k( , ψ (η)) log p H(η) + ηk( ψ (η), )], where q H(η) denotes the density of η = ψ(θ) under the distribution θ q. From the assumptions we have k(θ, θ ) = κ( ψ(θ), ψ(θ )). With this specific choice of k, the squared MKSD is MKSDKk(q, p)2 = g q,Kk 2 HKk 1 p H(η)p H(η ) η η (p H(η)k( ψ (η), ψ (η ))p H(η )) 1 p H(η)p H(η ) η η (p H(η)κ(η, η )p H(η )) . (21) The final expression in (21) is the squared kernel Stein discrepancy (KSD) (Liu et al., 2016; Chwialkowski et al., 2016; Gorham & Mackey, 2017) between q H and p H with the kernel κ: KSDκ(q H, p H)2. Recall that it is proved in Gorham & Mackey (2017, Theorem 8) that, for κ(x, y) = (c2 + x y 2 2)β with β ( 1, 0) and distantly dissipative p H with Lipschitz score functions, q H p H if KSDκ(q H, p H) 0. The advertised result (q p if MKSDKk(q, p) 0) now follows by the continuous mapping theorem as ψ is continuous. Lemma 13. Let K ψ (η, η ) K( ψ (η), ψ (η )). The mirrored updates g qt,K in (10) can be equivalently expressed as g qt,K = Eqt,H[K ψ ( ψ( ), η) log p H(η) + η K ψ ( ψ( ), η)]. Proof We will use the identity proved in Lem. 14. g qt,K = Eqt[Mp,ψK( , θ)] = Eqt[K( , θ) 2ψ(θ) 1 log p(θ) + θ (K( , θ) 2ψ(θ) 1)] = Eqt[K( , θ) 2ψ(θ) 1 θ(log p H( ψ(θ)) + log det 2ψ(θ)) + θ (K( , θ) 2ψ(θ) 1)] (by change-of-variable formula) = Eqt[K( , θ) 2ψ(θ) 1 θ log p H( ψ(θ)) + i,j=1 [ 2ψ(θ) 1]ij θi K( , θ):,j] (by applying Lem. 14 to each row of K( , θ)) = Eqt[K( , θ) 2ψ(θ) 1 θ log p H( ψ(θ)) + j=1 ηj K( , θ):,j] = Eqt,H[K( , ψ (η)) log p H(η) + j=1 ηj K( , ψ (η)):,j] = Eqt,H[K ψ ( ψ( ), η) log p H(η) + η K ψ ( ψ( ), η)], where A:,j denotes the j-th column of a matrix A. Lemma 14. For a strictly convex function ψ C2 : Rd R and any vector-valued g C1 : Rd Rd, the following relation holds: ( 2ψ(θ) 1g(θ)) = Tr( 2ψ(θ) 1 g(θ)) g(θ) 2ψ(θ) 1 θ log det 2ψ(θ). Proof By the product rule of differentiation: ( 2ψ(θ) 1g(θ)) = Tr( 2ψ(θ) 1 g(θ)) + g(θ) ( 2ψ(θ) 1). (22) Published as a conference paper at ICLR 2022 This already gives us the first term on the right side. Next, we have [ 2ψ(θ) 1 log det 2ψ(θ)]i j=1 [ 2ψ(θ) 1]ij Tr( 2ψ(θ) 1 θj 2ψ(θ)) j=1 [ 2ψ(θ) 1]ij ℓ,m=1 [ 2ψ(θ) 1]ℓm[ θj 2ψ(θ)]mℓ j,ℓ,m=1 [ 2ψ(θ) 1]ij[ 2ψ(θ) 1]ℓm θj 2ψ(θ)mℓ j,ℓ,m=1 [ 2ψ(θ) 1]ij θm 2ψ(θ)jℓ[ 2ψ(θ) 1]ℓm m=1 θm( 2ψ(θ) 1)im = [ 2ψ(θ) 1]i. Plugging the above relation into (22) proves the claimed result. Lemma 15 (Liu, 2017, Lemma A.1). Let A be a square matrix, and 0 < ϵ < 1 2 A+A op . Then, log | det(I + ϵA)| ϵ Tr(A) 2ϵ2 A 2 F , where F denotes the Frobenius norm of a matrix. Lemma 16. Let K be a matrix-valued kernel and HK be the corresponding RKHS. Then, for any f HK (f is vector-valued), we have f(x) 2 f HK K(x, x) 1/2 op , f(x) 2 F f 2 HK i=1 2 xi,x i K(x, x )|x =x op, where op denotes the operator norm of a matrix induced by the vector 2-norm. Proof We first bound the f(x) 2 as f(x) 2 = sup y 2=1 f(x) y = sup y 2=1 f, K( , x)y HK f HK sup y 2=1 K( , x)y HK = f HK sup y 2=1 (y K(x, x)y)1/2 f HK sup y 2=1 sup u 2=1 (u K(x, x)y)1/2 = f HK sup y 2=1 K(x, x)y 1/2 2 = f HK K(x, x) 1/2 op . The second result follows similarly, i=1 xif(x) 2 2 = i=1 sup y 2=1 ( xif(x) y)2 = X i=1 sup y 2=1 ( f, xi K( , x)y HK)2 i=1 sup y 2=1 xi K( , x)y 2 HK = f 2 HK i=1 sup y 2=1 (y 2 xi,x i K(x, x )|x=x y) i=1 sup y 2=1 sup u 2=1 (u 2 xi,x i K(x, x )|x=x y) i=1 2 xi,x i K(x, x )|x =x op.