# mirrored_langevin_dynamics__b1209ca6.pdf Mirrored Langevin Dynamics Ya-Ping Hsieh Ali Kavis Paul Rolland Volkan Cevher Laboratory for Information and Inference Systems (LIONS), EPFL, Lausanne, Switzerland {ya-ping.hsieh, ali.kavis, paul.rolland, volkan.cevher}@epfl.ch We consider the problem of sampling from constrained distributions, which has posed significant challenges to both non-asymptotic analysis and algorithmic design. We propose a unified framework, which is inspired by the classical mirror descent, to derive novel first-order sampling schemes. We prove that, for a general target distribution with strongly convex potential, our framework implies the existence of a first-order algorithm achieving O(ϵ 2d) convergence, suggesting that the state-of-the-art O(ϵ 6d5) can be vastly improved. With the important Latent Dirichlet Allocation (LDA) application in mind, we specialize our algorithm to sample from Dirichlet posteriors, and derive the first non-asymptotic O(ϵ 2d2) rate for first-order sampling. We further extend our framework to the mini-batch setting and prove convergence rates when only stochastic gradients are available. Finally, we report promising experimental results for LDA on real datasets. 1 Introduction Many modern learning tasks involve sampling from a high-dimensional and large-scale distribution, which calls for algorithms that are scalable with respect to both the dimension and the data size. One approach [32] that has found wide success is to discretize the Langevin Dynamics: d Xt = V (Xt)dt + 2d Bt, (1.1) where e V (x)dx presents a target distribution and Bt is a d-dimensional Brownian motion. Such a framework has inspired numerous first-order sampling algorithms [1, 7, 13, 15, 18, 19, 26, 29], and the convergence rates are by now well-understood for unconstrained and log-concave distributions [8, 12, 14]. However, applying (1.1) to sampling from constrained distributions (i.e., when V has a bounded convex domain) remains a difficult challenge. From the theoretical perspective, there are only two existing algorithms [4, 5] that possess non-asymptotic guarantees, and their rates are significantly worse than the unconstrained scenario under the same assumtions; cf., Table 1. Furthermore, many important constrained distributions are inherently non-logconcave. A prominent instance is the Dirichlet posterior, which, in spite of the presence of several tailor-made first-order algorithms [18, 26], is still lacking a non-asymptotic guarantee. In this paper, we aim to bridge these two gaps at the same time. For general constrained distributions with a strongly convex potential V , we prove the existence of a first-order algorithm that achieves the same convergence rates as if there is no constraint at all, suggesting the state-of-the-art O(ϵ 6d5) can be brought down to O(ϵ 2d). When specialized to the important case of simplex constraint, we provide the first non-asymptotic guarantee for 32nd Conference on Neural Information Processing Systems (Neur IPS 2018), Montr eal, Canada. Dirichlet posteriors, O(ϵ 2d2R0) for deterministic and O ϵ 2(Nd + σ2)R0 for the stochastic version of our algorithms; cf., Example 1 and 2 for the involved parameters. Our framework combines ideas from the Mirror Descent [2, 25] algorithm for optimization and the theory of Optimal Transport [31]. Concretely, for constrained sampling problems, we propose to use the mirror map to transform the target into an unconstrained distribution, whereby many existing methods apply. Optimal Transport theory then comes in handy to relate the convergence rates between the original and transformed problems. For simplex constraints, we use the entropic mirror map to design practical first-order algorithms that possess rigorous guarantees, and are amenable to mini-batch extensions. The rest of the paper is organized as follows. We briefly review the notion of push-forward measures in Section 2. In Section 3, we propose the Mirrored Langevin Dynamics and prove its convergence rates for constrained sampling problems. Mini-batch extensions are derived in Section 4. Finally, in Section 5, we provide synthetic and real-world experiments to demonstrate the empirical efficiency of our algorithms. 1.1 Related Work First-Order Sampling Schemes with Langevin Dynamics: There exists a bulk of literature on (stochastic) first-order sampling schemes derived from Langevin Dynamics or its variants [1, 4 6, 8, 9, 12, 14, 16, 20, 26, 32]. However, to our knowledge, this work is the first to consider mirror descent extensions of the Langevin Dynamics. The authors in [21] proposed a formalism that can, in principle, incorporate any variant of Langevin Dynamics for a given distribution e V (x)dx. The Mirrored Langevin Dynamics, however, is targeting the push-forward measure e W (y)dy (see Section 3.1), and hence our framework is not covered in [21]. For Dirichlet posteriors, there is a similar variable transformation as our entropic mirror map in [26] (see the reduced-natural parametrization therein). The dynamics in [26] is nonetheless drastically different from ours, as there is a position-dependent matrix multiplying the Brownian motion, whereas our dynamics has no such feature; see (3.2). Mirror Descent-Type Dynamics for Stochastic Optimization: Although there are some existing work on mirror descent-type dynamics for stochastic optimization [17, 24, 27, 33], we are unaware of any prior result on sampling. 2 Preliminaries 2.1 Notation In this paper, all Lipschitzness and strong convexity are with respect to the Euclidean norm . We use Ck to denote k-times differentiable functions with continuous kth derivative. The Fenchel dual [28] of a function h is denoted by h . Given two mappings T, F of proper dimensions, we denote their composite map by T F. For a probability measure µ, we write X µ to mean that X is a random variable whose probability law is µ . 2.2 Push-Forward and Optimal Transport Let dµ = e V (x)dx be a probability measure with support X := dom(V ) = {x Rd | V (x) < + }, and h be a convex function on X. Throughout the paper we assume: Assumption 1. h is closed, proper, h C2, and 2h 0 on X Rd. Assumption 2. All measures have finite second moments. Assumption 3. All measures vanish on sets with Hausdorffdimension [22] at most d 1. The gradient map h induces a new probability measure dν := e W (y)dy through ν(E) = µ h 1(E) for every Borel set E on Rd. We say that ν is the push-forward measure of µ under h, and we denote it by h#µ = ν. If X µ and Y ν, we will sometimes abuse the notation by writing h#X = Y to mean h#µ = ν. If h#µ = ν, the triplet (µ, ν, h) must satisfy the Monge-Amp ere equation: e V = e W h det 2h. (2.1) Using ( h) 1 = h and 2h h = 2h 1, we see that (2.1) is equivalent to e W = e V h det 2h (2.2) which implies h #ν = µ. The 2-Wasserstein distance between µ1 and µ2 is defined by1 W2 2(µ1, µ2) := inf T :T #µ1=µ2 Z x T(x) 2dµ1(x). (2.3) 3 Mirrored Langevin Dynamics This section demonstrates a framework for transforming constrained sampling problems into unconstrained ones. We then focus on applications to sampling from strongly log-concave distributions and simplex-constrained distributions, even though the framework is more general and future-proof. 3.1 Motivation and Algorithm We begin by briefly recalling the mirror descent (MD) algorithm for optimization. In order to minimize a function over a bounded domain, say minx X f(x), MD uses a mirror map h to transform the primal variable x into the dual space y := h(x), and then performs gradient updates in the dual: y+ = y β f(x) for some step-size β. The mirror map h is chosen to adapt to the geometry of the constraint X, which can often lead to faster convergence [25] or, more pivotal to this work, an unconstrained optimization problem [2]. Inspired by the MD framework, we would like to use the mirror map idea to remove the constraint for sampling problems. Toward this end, we first establish a simple fact [30]: Theorem 1. Let h satisfy Assumption 1. Suppose that X µ and Y = h(X). Then Y ν := h#µ and h (Y) µ. Proof. For any Borel set E, we have ν(E) = P (Y E) = P X h 1(E) = µ h 1(E) . Since h is one-to-one, Y = h(X) if and only if X = h 1(Y) = h (Y). In the context of sampling, Theorem 1 suggests the following simple procedure: For any target distribution e V (x)dx with support X, we choose a mirror map h on X satisfying Assumption 1, and we consider the dual distribution associated with e V (x)dx and h: e W (y)dy := h#e V (x)dx. (3.1) Theorem 1 dictates that if we are able to draw a sample Y from e W (y)dy, then h (Y) immediately gives a sample for the desired distribution e V (x)dx. Furthermore, suppose for the moment that dom(h ) = Rd, so that e W (y)dy is unconstrained. Then we can simply exploit the classical Langevin Dynamics (1.1) to efficiently take samples from e W (y)dy. The above reasoning leads us to set up the Mirrored Langevin Dynamics (MLD): MLD d Yt = ( W h)(Xt)dt + 2d Bt Xt = h (Yt) . (3.2) Notice that the stationary distribution of Yt in MLD is e W (y)dy, since d Yt is nothing but the Langevin Dynamics (1.1) with V W. As a result, we have Xt X e V (x)dx. 1In general, (2.3) is ill-defined; see [30]. The validity of (2.3) is guaranteed by Mc Cann s theorem [23] under Assumption 2 and 3. Assumption D( ) W2 d TV Algorithm LI 2V m I unknown unknown O ϵ 6d5 MYULA [4] LI 2V 0 unknown unknown O ϵ 12d12 PLMC [5] 2V m I O ϵ 1d O ϵ 2d O ϵ 2d MLD; this work LI 2V m I, V unconstrained O ϵ 1d O ϵ 2d O ϵ 2d Langevin Dynamics [8, 11, 14] Table 1: Convergence rates for sampling from e V (x)dx with dom(V ) bounded Using (2.1), we can equivalently write the d Yt term in (3.2) as d Yt = 2h(Xt) 1 V (Xt) + log det 2h(Xt) dt + In order to arrive at a practical algorithm, we then discretize the MLD, giving rise to the following equivalent iterations: ( βt W(yt) + p βt 2h(xt) 1 V (xt) + log det 2h(xt) + p 2βtξt (3.3) where in both cases xt+1 = h (yt+1), ξt s are i.i.d. standard Gaussian, and βt s are step-sizes. The first formulation in (3.3) is useful when W has a tractable form, while the second one can be computed using solely the information of V and h. Next, we turn to the convergence of discretized MLD. Since d Yt in (3.2) is the classical Langevin Dynamics, and since we have assumed that W is unconstrained, it is typically not difficult to prove the convergence of yt to Y e W (y)dy. However, what we ultimately care about is the guarantee on the primal distribution e V (x)dx. The purpose of the next theorem is to fill the gap between primal and dual convergence. We consider three most common metrics in evaluating approximate sampling schemes, namely the 2-Wasserstein distance W2, the total variation d TV, and the relative entropy D( ). Theorem 2 (Convergence in yt implies convergence in xt). For any h satisfying Assumption 1, we have d TV( h#µ1, h#µ2) = d TV(µ1, µ2) and D( h#µ1 h#µ2) = D(µ1 µ2). In particular, we have d TV(yt, Y ) = d TV(xt, X ) and D(yt Y ) = D(xt X ) in (3.3). If, furthermore, h is ρ-strongly convex: 2h ρI. Then W2(xt, X ) 1 ρW2(yt, Y ). Proof. See Appendix A. 3.2 Applications to Sampling from Constrained Distributions We now consider applications of MLD. For strongly log-concave distributions with general constraint, we prove matching rates to that of unconstrained ones; see Section 3.2.1. In Section 3.2.2, we consider the important case where the constraint is a probability simplex2. 3.2.1 Sampling from a strongly log-concave distribution with constraint As alluded to in the introduction, the existing convergence rates for constrained distributions are significantly worse than their unconstrained counterparts; see Table 1 for a comparison. The main result of this subsection is the existence of a good mirror map for arbitrary constraint, with which the dual distribution e W (y)dy becomes unconstrained: Theorem 3 (Existence of a good mirror map for MLD). Let dµ(x) = e V (x)dx be a probability measure with bounded convex support such that V C2, 2V m I 0, and V is bounded away from + in the interior of the support. Then there exists a mirror map h C2 such that the discretized MLD (3.3) yields D x T X = O d , W2 x T , X = O , d TV x T , X = O 2More examples of mirror map can be found in Appendix B. Proof. See Appendix C. Remark 1. We remark that Theorem 3 is only an existential result, not an actual algorithm. Practical algorithms are considered in the next subsection. 3.2.2 Sampling Algorithms on Simplex We apply the discretized MLD (3.3) to the task of sampling from distributions on the probability simplex d := {x Rd | Pd i=1 xi 1, xi 0}, which is instrumental in many fields of machine learning and statistics. On a simplex, the most natural choice of h is the entropic mirror map [2], which is well-known to be 1-strongly convex: ℓ=1 xi log xℓ+ , where 0 log 0 := 0. (3.4) In this case, the associated dual distribution can be computed explicitly. Lemma 1 (Sampling on a simplex with entropic mirror map). Let e V (x)dx be the target distribution on d, h be the entropic mirror map (3.4), and e W (y)dy := h#e V (x)dx. Then the potential W of the push-forward measure admits the expression W(y) = V h (y) ℓ=1 yℓ+ (d + 1)h (y) (3.5) where h (y) = log 1 + Pd ℓ=1 eyℓ is the Fenchel dual of h, which is strictly convex and 1-Lipschitz gradient. Proof. See Appendix D. Crucially, we have dom(h ) = Rd, so that the Langevin Dynamics for e W (y)dy is unconstrained. Based on Lemma 1, we now present the surprising case of the non-log-concave Dirichlet posteriors, a distribution of central importance in topic modeling [3], for which the dual distribution e W (y)dy becomes strictly log-concave. Example 1 (Dirichlet Posteriors). Given parameters α1, α2, ..., αd+1 > 0 and observations n1, n2, ..., nd+1 where nℓis the number of appearance of category ℓ, the probability density function of the Dirichlet posterior is ℓ=1 xnℓ+αℓ 1 ℓ , x int ( d) (3.6) where C is a normalizing constant and xd+1 := 1 Pd ℓ=1 xℓ. The corresponding V is V (x) = log p(x) = log C ℓ=1 (nℓ+ αℓ 1) log xℓ, x int ( d) . The interesting regime of the Dirichlet posterior is when it is sparse, meaning the majority of the nℓ s are zero and a few nk s are large, say of order O(d). It is also common to set αℓ< 1 for all ℓin practice. Evidently, V is neither convex nor concave in this case, and no existing non-asymptotic rate can be applied. However, plugging V into (3.5) gives W(y) = log C ℓ=1 (nℓ+ αℓ)yℓ+ ℓ=1 (nℓ+ αℓ) h (y) (3.7) which, magically, becomes strictly convex and O(d)-Lipschitz gradient no matter what the observations and parameters are! In view of Theorem 2 and [14, Corollary 7], one can then apply (3.3) to obtain an O ϵ 2d2R0 convergence in relative entropy, where R0 := W2 2(y0, e W (y)dy) is the initial Wasserstein distance to the target. Algorithm 1 Stochastic Mirrored Langevin Dynamics (SMLD) Require: Target distribution e V (x)dx where V = PN i=1 Vi, step-sizes βt, batch-size b 1: Find Wi such that e NWi h#e NVi for all i. 2: for t 0, 1, , T 1 do 3: Pick a mini-batch B of size b uniformly at random. 4: Update yt+1 = yt βt N i B Wi(yt) + p 5: xt+1 = h (yt+1) Update only when necessary. 6: end for return x T 4 Stochastic Mirrored Langevin Dynamics We have thus far only considered deterministic methods based on exact gradients. In practice, however, evaluating gradients typically involves one pass over the full data, which can be time-consuming in large-scale applications. In this section, we turn attention to the mini-batch setting, where one can use a small subset of data to form stochastic gradients. Toward this end, we assume: Assumption 4 (Primal Decomposibility). The target distribution e V (x)dx admits a decomposable structure V = PN i=1 Vi for some functions Vi. Consider the following common scheme in obtaining stochastic gradients. Given a batch-size b, we randomly pick a mini-batch B from {1, 2, . . . , N} with |B| = b, and form an unbiased estimate of V by computing V := N i B Vi. (4.1) The following lemma asserts that exactly the same procedure can be carried out in the dual. Lemma 2. Assume that h is 1-strongly convex. For i = 1, 2, ..., N, let Wi be such that e NWi = h# e NVi R e NVi . (4.2) Define W := PN i=1 Wi and W := N i B Wi, where B is chosen as in (4.1). Then: 1. Primal decomposibility implies dual decomposability: There is a constant C such that e (W +C) = h#e V . 2. For each i, the gradient Wi depends only on Vi and the mirror map h. 3. The gradient estimate is unbiased: E W = W. 4. The dual stochastic gradient is more accurate: E W W 2 E V V 2. Proof. See Appendix E. Lemma 2 furnishes a template for the mini-batch extension of MLD. The pseudocode is detailed in Algorithm 1, whose convergence rate is given by the next theorem. Theorem 4. Let e V (x)dx be a distribution satisfying Assumption 4, and h a 1-strongly convex mirror map. Let σ2 := E V V 2 be the variance of the stochastic gradient of V in (4.1). Suppose that the corresponding dual distribution e W (y)dy = h#e V (x)dx satisfies LI 2W 0. Then, applying SMLD with constant step-size βt = β yields3: D x T e V (x)dx 2W2 2 y0, e W (y)dy (Ld + σ2) T = O provided that β min n 2TW2 2 y0, e W (y)dy Ld + σ2 1 3Our guarantee is given on a randomly chosen iterate from {x1, x2, ..., x T }, instead of the final iterate x T . In practice, we observe that the final iterate always gives the best performance, and we will ignore this minor difference in the theorem statement. Proof. See Appendix F. Example 2 (SMLD for Dirichlet Posteriors). For the case of Dirichlet posteriors, we have seen in (3.7) that the corresponding dual distribution satisfies (N + Γ)I 2W 0, where N := Pd+1 ℓ=1 nℓand Γ := Pd+1 ℓ=1 αℓ. Furthermore, it is easy to see that the stochastic gradient W can be efficiently computed (see Appendix G): i B Wi(y)ℓ= Nmℓ + (N + Γ) eyℓ 1 + Pd k=1 eyk , (4.4) where mℓis the number of observations of category ℓin the mini-batch B. As a result, Theorem 4 states that SMLD achieves D x T e V (x)dx v u u t2W2 2 y0, e W (y)dy (N + Γ)(d + 1) + σ2 (N + Γ)d + σ2 with a constant step-size. 5 Experiments We conduct experiments with a two-fold purpose. First, we use a low-dimensional synthetic data, where we can evaluate the total variation error by comparing histograms, to verify the convergence rates in our theory. Second, We demonstrate that the SMLD, modulo a necessary modification for resolving numerical issues, outperforms state-of-the-art first-order methods on the Latent Dirichlet Allocation (LDA) application with Wikipedia corpus. 5.1 Synthetic Experiment for Dirichlet Posterior We implement the deterministic MLD for sampling from an 11-dimensional Dirichlet posterior (3.6) with n1 = 10,000, n2 = n3 = 10, and n4 = n5 = = n11 = 0, which aims to capture the sparse nature of real observations in topic modeling. We set αℓ= 0.1 for all ℓ. As a baseline comparison, we include the Stochastic Gradient Riemannian Langevin Dynamics (SGRLD) [26] with the expanded-mean parametrization. SGRLD is a tailor-made first-order scheme for simplex constraints, and it remains one of the state-of-the-art algorithms for LDA. For fair comparison, we use deterministic gradients for SGRLD. We perform a grid search over the constant step-size for both algorithms, and we keep the best three for MLD and SGRLD. For each iteration, we build an empirical distribution by running 2,000,000 independent trials, and we compute its total variation with respect to the histogram generated by the true distribution. Figure 1(a) reports the total variation error along the first dimension, where we can see that MLD outperforms SGRLD by a substantial margin. As dictated by our theory, all the MLD curves decay at the O(T 1/2) rate until they saturate at the dicretization error level. In contrast, SGRLD lacks non-asymptotic guarantees, and there is no clear convergence rate we can infer from Figure 1(a). The improvement along all other dimensions (i.e., topics with less observations) are even more significant; see Appendix H.1. 5.2 Latent Dirichlet Allocation with Wikipedia Corpus An influential framework for topic modeling is the Latent Dirichlet Allocation (LDA) [3], which, given a text collection, requires to infer the posterior word distributions without knowing the exact topic for each word. The full model description is standard but somewhat convoluted; we refer to the classic [3] for details. Each topic k in LDA determines a word distribution πk, and suppose there are in total K topics and W + 1 words. The variable of interest is therefore π := (π1, π2, ..., πK) W W W . Since this domain is a Cartesian product of simplices, we propose to use h(π) := PK k=1 h(πk), where h is the entropic mirror map (3.4), for SMLD. It is easy to see that all of our computations for Dirichlet posteriors generalize to this setting. (a) Synthetic data, first dimension. (b) LDA on Wikipedia corpus. 5.2.1 Experimental Setup We implement the SMLD for LDA on the Wikipedia corpus with 100,000 documents, and we compare the performance against the SGRLD [26]. In order to keep the comparison fair, we adopt exactly the same setting as in [26], including the model parameters, the batch-size, the Gibbs sampler steps, etc. See Section 4 and 5 in [26] for omitted details. Another state-of-the-art first-order algorithm for LDA is the SGRHMC in [21], for which we skip the implementation, due to not knowing how the ˆBt was chosen in [21]. Instead, we will repeat the same experimental setting as [21] and directly compare our results versus the ones reported in [21]. See Appendix H.2 for comparison against SGRHMC. 5.2.2 A Numerical Trick and the SMLD-approximate Algorithm A major drawback of the SMLD in practice is that the stochastic gradients (4.4) involve exponential functions, which are unstable for large-scale problems. For instance, in python, np.exp(800) = inf, whereas the relevant variable regime in this experiment extends to 1600. To resolve such numerical issues, we appeal to the linear approximation4 exp(y) max{0, 1 + y}. Admittedly, our theory no longer holds under such numerical tricks, and we shall not claim that our algorithm is provably convergent for LDA. Instead, the contribution of MLD here is to identify the dual dynamics associated with (3.7), which would have been otherwise difficult to perceive. We name the resulting algorithm SMLD-approximate to indicate its heuristic nature. 5.2.3 Results Figure 1(b) reports the perplexity on the test data up to 100,000 documents, with the five best step-sizes we found via grid search for SMLD-approximate. For SGRLD, we use the best step-sizes reported in [26]. From the figure, we can see a clear improvement, both in terms of convergence speed and the saturation level, of the SMLD-approximate over SGRLD. One plausible explanation for such phenomenon is that our MLD, as a simple unconstrained Langevin Dynamics, is less sensitive to discretization. On the other hand, the underlying dynamics for SGRLD is a more sophisticated Riemannian diffusion, which requires finer discretization than MLD to achieve the same level of approximation to the original continuous-time dynamics, and this is true even in the presence of noisy gradients and our numerical heuristics 4One can also use a higher-order Taylor approximation for exp(y), or add a small threshold exp(y) max{ϵ, 1 + y} to prevent the iterates from going to the boundary. In practice, we observe that these variants do not make a huge impact on the performance. Acknowledgments This project has received funding from the European Research Council (ERC) under the European Union s Horizon 2020 research and innovation programme (grant agreement n 725594 - time-data). [1] Sungjin Ahn, Anoop Korattikara, and Max Welling. Bayesian posterior sampling via stochastic gradient fisher scoring. In Proceedings of the 29th International Coference on International Conference on Machine Learning, pages 1771 1778, 2012. [2] Amir Beck and Marc Teboulle. Mirror descent and nonlinear projected subgradient methods for convex optimization. Operations Research Letters, 31(3):167 175, 2003. [3] David M Blei, Andrew Y Ng, and Michael I Jordan. Latent dirichlet allocation. Journal of machine Learning research, 3(Jan):993 1022, 2003. [4] Nicolas Brosse, Alain Durmus, Eric Moulines, and Marcelo Pereyra. Sampling from a log-concave distribution with compact support with proximal langevin monte carlo. In Proceedings of the 2017 Conference on Learning Theory, volume 65 of Proceedings of Machine Learning Research, pages 319 342. PMLR, 07 10 Jul 2017. [5] S ebastien Bubeck, Ronen Eldan, and Joseph Lehec. Sampling from a log-concave distribution with projected langevin monte carlo. ar Xiv preprint ar Xiv:1507.02564, 2015. [6] Changyou Chen, Nan Ding, and Lawrence Carin. On the convergence of stochastic gradient mcmc algorithms with high-order integrators. In Advances in Neural Information Processing Systems, pages 2278 2286, 2015. [7] Tianqi Chen, Emily Fox, and Carlos Guestrin. Stochastic gradient hamiltonian monte carlo. In International Conference on Machine Learning, pages 1683 1691, 2014. [8] Xiang Cheng and Peter Bartlett. Convergence of langevin mcmc in kl-divergence. In Proceedings of Algorithmic Learning Theory, volume 83 of Proceedings of Machine Learning Research, pages 186 211. PMLR, 07 09 Apr 2018. [9] Xiang Cheng, Niladri S Chatterji, Peter L Bartlett, and Michael I Jordan. Underdamped langevin mcmc: A non-asymptotic analysis. ar Xiv preprint ar Xiv:1707.03663, 2017. [10] Bo Dai, Niao He, Hanjun Dai, and Le Song. Provable bayesian inference via particle mirror descent. In Artificial Intelligence and Statistics, pages 985 994, 2016. [11] Arnak S Dalalyan. Theoretical guarantees for approximate sampling from smooth and log-concave densities. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 79(3):651 676, 2017. [12] Arnak S Dalalyan and Avetik G Karagulyan. User-friendly guarantees for the langevin monte carlo with inaccurate gradient. ar Xiv preprint ar Xiv:1710.00095, 2017. [13] Nan Ding, Youhan Fang, Ryan Babbush, Changyou Chen, Robert D Skeel, and Hartmut Neven. Bayesian sampling using stochastic gradient thermostats. In Advances in neural information processing systems, pages 3203 3211, 2014. [14] Alain Durmus, Szymon Majewski, and Bla zej Miasojedow. Analysis of langevin monte carlo via convex optimization. ar Xiv preprint ar Xiv:1802.09188, 2018. [15] Alain Durmus, Umut Simsekli, Eric Moulines, Roland Badeau, and Ga el Richard. Stochastic gradient richardson-romberg markov chain monte carlo. In Advances in Neural Information Processing Systems, pages 2047 2055, 2016. [16] Raaz Dwivedi, Yuansi Chen, Martin J Wainwright, and Bin Yu. Log-concave sampling: Metropolis-hastings algorithms are fast! ar Xiv preprint ar Xiv:1801.02309, 2018. [17] Walid Krichene and Peter L Bartlett. Acceleration and averaging in stochastic descent dynamics. In Advances in Neural Information Processing Systems, pages 6799 6809, 2017. [18] Shiwei Lan and Babak Shahbaba. Sampling constrained probability distributions using spherical augmentation. In Algorithmic Advances in Riemannian Geometry and Applications, pages 25 71. Springer, 2016. [19] Chang Liu, Jun Zhu, and Yang Song. Stochastic gradient geodesic mcmc methods. In Advances in Neural Information Processing Systems, pages 3009 3017, 2016. [20] Tung Luu, Jalal Fadili, and Christophe Chesneau. Sampling from non-smooth distribution through langevin diffusion. 2017. [21] Yi-An Ma, Tianqi Chen, and Emily Fox. A complete recipe for stochastic gradient mcmc. In Advances in Neural Information Processing Systems, pages 2917 2925, 2015. [22] Benoit B Mandelbrot. The fractal geometry of nature, volume 173. WH freeman New York, 1983. [23] Robert J Mc Cann. Existence and uniqueness of monotone measure-preserving maps. Duke Mathematical Journal, 80(2):309 324, 1995. [24] Panayotis Mertikopoulos and Mathias Staudigl. On the convergence of gradient-like flows with noisy gradient input. SIAM Journal on Optimization, 28(1):163 197, 2018. [25] AS Nemirovsky and DB Yudin. Problem complexity and method efficiency in optimization. 1983. [26] Sam Patterson and Yee Whye Teh. Stochastic gradient riemannian langevin dynamics on the probability simplex. In Advances in Neural Information Processing Systems, pages 3102 3110, 2013. [27] Maxim Raginsky and Jake Bouvrie. Continuous-time stochastic mirror descent on a network: Variance reduction, consensus, convergence. In Decision and Control (CDC), 2012 IEEE 51st Annual Conference on, pages 6793 6800. IEEE, 2012. [28] Ralph Tyrell Rockafellar. Convex analysis. Princeton university press, 1970. [29] Umut Simsekli, Roland Badeau, Taylan Cemgil, and Ga el Richard. Stochastic quasinewton langevin monte carlo. In International Conference on Machine Learning, pages 642 651, 2016. [30] C edric Villani. Topics in optimal transportation. Number 58. American Mathematical Soc., 2003. [31] C edric Villani. Optimal transport: old and new, volume 338. Springer Science & Business Media, 2008. [32] Max Welling and Yee W Teh. Bayesian learning via stochastic gradient langevin dynamics. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), pages 681 688, 2011. [33] Pan Xu, Tianhao Wang, and Quanquan Gu. Accelerated stochastic mirror descent: From continuous-time dynamics to discrete-time algorithms. In International Conference on Artificial Intelligence and Statistics, pages 1087 1096, 2018.