# stochastic_gradient_geodesic_mcmc_methods__406cf5e6.pdf

Stochastic Gradient Geodesic MCMC Methods

Chang Liu , Jun Zhu , Yang Song

Dept. of Comp. Sci. & Tech., TNList Lab; Center for Bio-Inspired Computing Research State Key Lab for Intell. Tech. & Systems, Tsinghua University, Beijing, China Dept. of Physics, Tsinghua University, Beijing, China {chang-li14@mails, dcszj@}.tsinghua.edu.cn; songyang@stanford.edu

We propose two stochastic gradient MCMC methods for sampling from Bayesian posterior distributions deﬁned on Riemann manifolds with a known geodesic ﬂow, e.g. hyperspheres. Our methods are the ﬁrst scalable sampling methods on these manifolds, with the aid of stochastic gradients. Novel dynamics are conceived and 2nd-order integrators are developed. By adopting embedding techniques and the geodesic integrator, the methods do not require a global coordinate system of the manifold and do not involve inner iterations. Synthetic experiments show the validity of the method, and its application to the challenging inference for spherical topic models indicate practical usability and efﬁciency.

1 Introduction

Dynamics-based Markov Chain Monte Carlo methods (D-MCMCs) are sampling methods using dynamics simulation for state transition in a Markov chain. They have become a workhorse for Bayesian inference, with well-known examples like Hamiltonian Monte Carlo (HMC) [22] and stochastic gradient Langevin dynamics (SGLD) [29]. Here we consider variants for sampling from distributions deﬁned on Riemann manifolds. Overall, geodesic Monte Carlo (GMC) [7] stands out for its notable performance on manifolds with known geodesic ﬂow, such as simplex, hypersphere and Stiefel manifold [26, 16]. Its applicability to manifolds with no global coordinate systems (e.g. hyperspheres) is enabled by the embedding technique, and its geodesic integrator eliminates inner (within one step in dynamics simulation) iteration to ensure efﬁciency. It is also used for efﬁcient sampling from constraint distributions [17]. Constrained HMC (CHMC) [6] aims at manifolds deﬁned by a constraint in some Rn. It covers all common manifolds, but inner iteration makes it less appealing. Other D-MCMCs involving Riemann manifold, e.g. Riemann manifold Langevin dynamics (RMLD) and Riemann manifold HMC (RMHMC) [13], are invented for better performance but still on the task of sampling in Euclidean space, where the target variable is treated as the global coordinates of some distribution manifold. Although they can be used to sample in non-Euclidean Riemann manifolds by replacing the distribution manifold with the target manifold, a global coordinate system of the target manifold is required. Moreover, RMHMC suffers from expensive inner iteration.

However, GMC scales undesirably to large datasets, which are becoming common. An effective strategy to scale up D-MCMCs is by randomly sampling a subset to estimate a noisy but unbiased stochastic gradient, with stochastic gradient MCMC methods (SG-MCMCs). Welling et al. [29] pioneered in this direction by developing stochastic gradient Langevin dynamics (SGLD). Chen et al. [9] apply the idea to HMC with stochastic gradient HMC (SGHMC), where a non-trivial dynamics with friction has to be conceived. Ding et at. [10] propose stochastic gradient Nosé-Hoover thermostats (SGNHT) to automatically adapt the friction to the noise by a thermostats. To unify dynamics used for SG-MCMCs, Ma et al. [19] develop a complete recipe to formulate the dynamics.

JZ is the corresponding author; YS is with Department of Computer Science, Stanford University, CA.

30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.

Table 1: A summary of some D-MCMCs. : sampling on manifold not supported; : The integrators are not in the SSI scheme (It is unclear whether the claimed 2nd-order is equivalent to ours); : 2nd-order integrators for SGHMC and m SGNHT are developed by [8] and [18], respectively.

methods stochastic gradient no inner iteration no global coordinates order of integrator

GMC [7] 2nd RMLD [13] 1st RMHMC [13] 2nd

CHMC [6] 2nd

SGLD [29] 1st SGHMC [9] / SGNHT [10] 1st

SGRLD [23] / SGRHMC [19] 1st

SGGMC / g SGNHT (proposed) 2nd

In this paper, we present two SG-MCMCs for manifolds with known geodesic ﬂow: stochastic gradient geodesic Monte Carlo (SGGMC) and geodesic stochastic gradient Nosé-Hoover thermostats (g SGNHT). They are the ﬁrst scalable sampling methods on manifolds with known geodesic ﬂow and no global coordinate systems. We use the recipe [19] to tackle the non-trivial dynamics conceiving task. Our novel dynamics are also suitable for developing 2nd-order integrators by adopting the symmetric splitting integrator (SSI) [8] scheme. A key property of a Kth-order integrator is the bias of the expected sample average at iteration L can be upper bounded by L K/(K+1) and the mean square error by L 2K/(2K+1) [8], so a higher order integrator basically performs better. Our integrators also incorporate the geodesic integrator to avoid inner iteration. Our methods can also be used to scalably sample from constraint distributions [17] like GMC.

There exist other SG-MCMCs on Riemann manifold, e.g. SGRLD [23] and SGRHMC [19], stochastic gradient versions of RMLD and RMHMC respectively. But they also require the Riemann manifold to have a global coordinate system, like their original versions as is mentioned above. So basically they cannot draw samples from hyperspheres, while our methods are capable. Technically, SGRLD/SGRHMC (and RMLD/RMHMC) samples in the coordinate space, so we need a global one to make it valid. The explicit use of the Riemann metric tensor also makes the methods more difﬁcult to implement. Our methods (and GMC) sample in the isometrically embedded space, where the whole manifold is represented and the Riemann metric tensor is implicitly embodied by the isometric embedding. Moreover, our integrators are of a higher order. Tab. 1 summarizes the key properties of aforementioned D-MCMCs, where our advantages are clearly shown.

Finally, we apply our samplers to perform inference for spherical admixture models (SAM) [24]. SAM deﬁnes a hierarchical generative process to describe the data that are expressed as unit vectors (i.e., elements on the hypersphere). The task of posterior inference is to identify a set of latent topics, which are also unit vectors. This process is highly challenging due to a non-conjugate structure and the strict manifold constraints. None of the existing MCMC methods is both applicable to the task and scalable. We demonstrate that our methods are the most efﬁcient methods to learn SAM on large datasets, with a good performance on testing data perplexity.

2 Preliminaries

We brieﬂy review the basics of SG-MCMCs. Consider a Bayesian model with latent variable q, prior π0(q) and likelihood π(x|q). Given a dataset D = {xd}D d=1, sampling from the posterior π(q|D) by D-MCMCs requires computing the gradient of potential energy U(q) log π(q|D) = log π0(q) PD d=1 log π(xd|q), which is linear to data size D thus not scalable. SG-MCMCs address this challenge by randomly drawing a subset S of D to build the stochastic gradient q U(q) q log π0(q) D

x S q log π(x|q), a noisy but unbiased estimate.Under the i.i.d. assumption of D, the central limit theorem holds: in the sense of convergence in distribution for large D, q U(q) = q U(q) + N(0, V (q)), (1)

where we use N( , ) to denote a Gaussian random variable and V (q) is some covariance matrix.

The gradient noise raises challenging restrictions to the SG-MCMC dynamics. Ma et al. [19] then provide a recipe to construct correct dynamics. It claims that for a random variable z, given a Hamiltonian H(z), a skew-symmetric matrix (curl matrix) Q(z) and a positive deﬁnite matrix (diffusion matrix) D(z), the dynamics deﬁned by the following stochastic differential equation (SDE)

dz = f(z)dt + p

2D(z)d W(t) (2)

has the unique stationary distribution π(z) exp{ H(z)}, where W(t) is the Wiener process and

f(z) = [D(z) + Q(z)] z H(z) + Γ(z), Γi(z) = X

zj (Dij(z) + Qij(z)) . (3)

The above dynamics is compatible with stochastic gradient. For SG-MCMCs, z is usually an augmentation of the target variable q, and the Hamiltonian usually follows the form H(z) = T(z) + U(q). Referring to Eqn. (1), q H(z) = q H(z) + N(0, V (q)) and f(z) = f(z) + N(0, B(z)), where B(z) is the covariance matrix of the Gaussian noise passed from z H(z) to f(z) through Eqn. (3). We informally rewrite d W(t) as N(0, dt) and express dynamics Eqn. (2) as

dz =f(z)dt + N(0, 2D(z)dt) = f(z)dt + N(0, B(z)dt2) + N 0, 2D(z)dt B(z)dt2

= f(z)dt + N 0, 2D(z)dt B(z)dt2 . (4)

This tells us that the same dynamics can be exactly expressed by stochastic gradient. Moreover, the recipe is complete: for any continuous Markov process deﬁned by Eqn. (2) with a unique stationary distribution π(z) exp{ H(z)}, there exists a skew-symmetric matrix Q(z) so that Eqn. (3) holds.

3 Stochastic Gradient Geodesic MCMC Methods

We now formally develop our SGGMC and g SGNHT. We will describe the task settings, develop the dynamics, and show how to simulate by 2nd-order integrators and stochastic gradient.

3.1 Technical Descriptions of the Settings

ℝ𝑚𝑚= 2 ℝ𝑛𝑛= 3

Figure 1: An illustration of manifold M with local coordinate system (N, Φ) and embedding Ξ. See text for details.

We ﬁrst describe a Riemann manifold. Main concepts are depicted in Fig. 1. Let M be an m-dim Riemann manifold, which is covered by a set of local coordinate systems. Denote one of them by (N, Φ), where N M is an open subset, and Φ : N Ω, Q 7 q with Ω Φ(N) Rm, Q N and q Ωis a homeomorphism. Additionally, transition mappings between any two intersecting local coordinate systems are required to be smooth. Denote the Riemann metric tensor under (N, Φ) by G(q), an m m symmetric positive-deﬁnite matrix. Another way to describe M is through embedding a diffeomorphism Ξ : M Ξ(M) Rn (n m). In (N, Φ), Ξ can be embodied by a more sensible mapping ξ Ξ Φ 1 : Rm Rn, q 7 x, which links the coordinate space and the embedded space. For convenience, we only consider isometric embeddings (whose existence is guaranteed [21]): Ξ such that G(q)ij = Pn l=1 ξl(q)

qj , 1 i, j m holds for any local coordinate system. Common manifolds are subsets of some Rn, in which case the identity mapping (as Ξ) from Rn (where M is deﬁned) to Rn (the embedded space) is isometric.

To deﬁne a distribution on a Riemann manifold, from which we want to sample, we need a measure. In the coordinate space Rm, Ωnaturally possesses the Lebesgue measure λm(dq), and the probability density can be deﬁned in Ω, which we denote as π(q). In the embedded space Rn, Ξ(N) naturally possesses the Hausdorff measure Hm(dx), and we denote the probability density w.r.t this measure as πH(x). The relation between them can be found by πH(ξ(q)) = π(q)/ p

3.2 The Dynamics

We now construct our dynamics using the recipe [19] so that our dynamics naturally have the desired stationary distribution, leading to correct samples. It is important to note that the recipe only suits for dynamics in a Euclidean space. So we can only develop the dynamics in the coordinate space but not in the embedded space Ξ(M), which is generally not Euclidean. However it is advantageous to simulate the dynamics in the embedded space (See Sec. 3.3).

Dynamics for SGGMC Deﬁne the momentum in the coordinate space p Rm and the augmented variable z = (q, p) R2m. Deﬁne the Hamiltonian 2 H(z) = U(q) + 1

2 log |G(q)| + 1

2p G(q) 1p,

2Another derivation of the momentum and the Hamiltonian originated from physics in both coordinate and embedded spaces is provided in Appendix C.

where U(q) log π(q). We deﬁne the Hamiltonian so that the canonical distribution π(z) exp{ H(z)} marginalized w.r.t p recovers the target distribution π(q). For a symmetric positive deﬁnite n n matrix C, deﬁne the diffusion matrix D(z) and the curl matrix Q(z) as

D(z) = 0 0 0 M(q) CM(q)

, Q(z) = 0 I I 0

where we deﬁne M(q)n m : M(q)ij = ξi(q)/ qj. So from Eqn. (2, 3), the dynamics

dp = q Udt 1

2 q log |G|dt M CMG 1p dt 1

2 q p G 1p dt + N(0, 2M CMdt) (5) has a unique stationary distribution π(z) exp{ H(z)}.

Dynamics for g SGNHT Deﬁne z = (q, p, ξ) R2m+1, where ξ R is the thermostats. For a positive C R, deﬁne the Hamiltonian H(z) = U(q) + 1

2 log |G(q)| + 1

2p G(q) 1p + m

2 (ξ C)2, whose marginalized canonical distribution is π(q) as desired. Deﬁne D(z) and Q(z) as

0 0 0 0 CG(q) 0 0 0 0

0 I 0 I 0 p/m 0 p /m 0

Then by Eqn. (2, 3) the proper dynamics of g SGNHT is

dp = q Udt 1

2 q log |G|dt ξp dt 1

2 q p G 1p dt + N(0, 2CGdt)

mp G 1p 1)dt

These two dynamics are novel. They are extensions of the dynamics of SGHMC and SGNHT to Riemann manifolds, respectively. Conceiving the dynamics in this form is also intended for the convenience to develop 2nd-order geodesic integrators, which differs from SGRHMC.

3.3 Simulation with 2nd-order Geodesic Integrators

In this part we develop our integrators by following the symmetric splitting integrator (SSI) scheme [8], which is guaranteed to be of 2nd-order. The idea of SSI is to ﬁrst split the dynamics into parts with each analytically solvable, then alternately simulate each exactly with the analytic solutions. Although also SSI, the integrator of GMC does not ﬁt our dynamics where diffusion arises. But we adopt its embedding technique to get rid of any local coordinate system thus release the global coordinate system requirement. So we will solve and simulate the split dynamics in the isometrically embedded space, where everything is expressed by the position x = ξ(q) and the velocity v = x (which is actually the momentum in the isometrically embedded space, see Appendix C; the overhead dot means time derivative), instead of q and p.

Integrator for SGGMC We ﬁrst split dynamics (5) into sub-SDEs with each analytically solvable:

2 q p G 1p dt , B :

dp= M CMG 1pdt, O:

dp= q U(q)dt 1

2 qlog|G(q)|dt

+ N(0, 2M CMdt)

As noted in GMC, the solution of dynamics A is the geodesic ﬂow of the manifold [1]. Intuitively, dynamics A describes motion with no force so a particle moves freely on the manifold, e.g. the uniform motion in Euclidean space, and motion along great circles (velocity rotating with varying tangents along the trajectory) on hypersphere Sd 1 {x Rd| x = 1} ( denotes ℓ2-norm). The evolution of the position and velocity of this kind is the geodesic ﬂow. We require an explicit form of the geodesic ﬂow in the embedded space. For Sd 1, ( x(t) = x(0) cos(αt) + v(0)/α sin(αt)

v(t) = αx(0) sin(αt) + v(0) cos(αt) (7)

is the geodesic ﬂow expressed by the embedded variables x and v, where α = v(0) .

By details in [7] or Appendix A, dynamics B and O are solved as

( x(t)=x(0)

v(t)=expm Λ x(0) Ct v(0) , O:

( x(t)=x(0)

v(t)=v(0)+Λ x(0) x UH x(0) t+N(0, 2Ct) ,

where UH(x) log πH(x), expm{ } is the matrix exponent, and Λ(x) is the projection onto the tangent space at x in the embedded manifold. For Rn, Λ(x) = In (the identity mapping in Rn) and for Sn 1 embedded in Rn, Λ(x) = In xx (see Appendix A.3).

We further reduce dynamics B for scalar C: v(t) = Λ(x(0)) exp{ Ct}v(0) = exp{ Ct}v(0), by noting that exp{ Ct} is a scalar and v(0) already lies on the tangent space at x(0). To illustrate this form, we expand the exponent for small t and get v(t) = (1 Ct)v(0), which is exactly the action of a friction dissipating energy to control injected noise, as proposed in SGHMC. Our investigation reveals that this form holds generally for v as the momentum in the isometrically embedded space, but not the usual momentum p in the coordinate space. In SGHMC, v and p are undistinguishable, but in our case v can only lie in the tangent space and p is arbitrary in Rm.

Integrator for g SGNHT We split dynamics (6) in a similar way:

2 q p G 1p dt

mp G 1p 1 dt

dq =0 dp = ξp dt dξ =0 , O :

dp = q Udt 1

2 q log |G| dt

+N(0, 2CGdt) dξ =0

For dynamics A, the solution of q and p is again the geodesic ﬂow. To solve ξ, we ﬁrst ﬁgure out that for dynamics A, p G 1p is constant: d

dt p G(q) 1p = q p G(q) 1p q+2 G(q) 1p p = 2 p q + 2 q p = 0. Alternatively we note that 1

2p G 1p = 1

2v v is the kinetic energy 3 conserved by motion with no force. Now the evolution of ξ can be solved as ξ(t) = ξ(0)+ 1

mv(0) v(0) 1 t.

Dynamics O is identical to the one of SGGMC. Dynamics B can be solved similarly with only v updated: v(t) = exp{ ξ(0)t}v(0). Expansion of this recovers the dissipation of energy by an adaptive friction proposed by SGNHT, and we extend it to an embedded space.

Now we consider incorporating stochastic gradient. Only the common dynamics O is affected. Similar to Eqn. (1), we express the stochastic gradient as x UH(x) = x UH(x) + N(0, V (x)), then reformulate the solution of dynamics O as

v(t) = v(0) + Λ x(0) h x UH x(0) t + N 0, 2Ct V x(0) t2 i . (8)

To estimate the usually unknown V (x), a simple way is just to take it as zero, in the sense that V (x)t2 is a higher order inﬁnitesimal of 2Ct for t as a small simulation step size. Another way to estimate V (x) is by the empirical Fisher information, as is done in [2].

Finally, as SSI suggests, we simulate the complete dynamics by exactly simulating these solutions alternately in an ABOBA pattern. For a time step size of ε, dynamics A and B advance by ε/2 for once and dynamics O by ε. As other SG-MCMCs, we omit the unscalable Metropolis-Hastings test. But the consistency is still guaranteed [8] of e.g. the estimation by averaging over samples drawn from SG-MCMCs. Algorithms of SGGMC and g SGNHT are listed in Appendix E.

4 Application to Spherical Admixture Model

We now apply SGGMC/g SGNHT to solve the challenging task of posterior inference in Spherical Admixture Model (SAM) [24]. SAM is a Bayesian topic model for spherical data (each datum is in some Sd 1), such as the tf-idf representation of text data. It enables more feature representations for hierarchical Bayesian models, and have the beneﬁt over Latent Dirichlet Allocation (LDA) [5] to directly model the absence of words. The structure of SAM is shown in Fig. 2. Each document vd, each topic βk, the corpus mean µ and the hyper-parameter m are all in SV 1 with V the vocabulary size. Each topic proportion θd is in (K 1)-dim simplex with K the number of topics.

3p G 1p = (G 1p) G(G 1p) = q (M M) q = (M q) (M q) = v v for an isometric embedding.

SAM uses the von Mises-Fisher distribution (v MF) (see e.g. [20]) to model variables on hyperspheres. The v MF on Sd 1 with mean µ Sd 1 and concentration parameter κ R+ has pdf (w.r.t the Hausdorff measure) v MF(x|µ, κ) = cd(κ) exp{κµ x}, where cd(κ) = κd/2 1/ (2π)d/2Id/2 1(κ) and Ir( ) denotes the modiﬁed Bessel function of the ﬁrst kind and order r. Then the generating process of SAM is

Figure 2: An illustration of SAM model structure.

Draw µ v MF(µ|m, κ0); For k = 1, . . . , K, draw topic βk v MF(βk|µ, σ); For d = 1, . . . , D, draw θd Dir(θd|α) and vd v MF(vd| v(β, θd), κ),

where v(β, θd) βθd βθd with β (β1, . . . , βK) is an approximate spherical weighted mean of topics. The joint distribution of v (v1, . . . , v D), β, θ (θ1, . . . , θK), µ can be known.

The inference task is to estimate the topic posterior π(β|v). As it is intractable, [24] provides a meanﬁeld variational inference method and solves an optimization problem under spherical constraint, which is tackled by repeatedly normalizing. However, this treatment is not applicable to most sampling methods since it may corrupt the distribution of the samples. [24] tries a simple adaptive Metropolis-Hastings sampler with undesirable results, and no more attempt of sampling methods appears. Due to the deﬁciency of global coordinate system of hypersphere, most Riemann manifold samplers including SGRLD and SGRHMC fail. To our knowledge, only CHMC and GMC are suitable, yet not scalable. Our samplers are appropriate for the task, with the advantage of scalability.

Now we present our inference method that uses SGGMC/g SGNHT to directly sample from π(β|v). First we note that µ can be collapsed analytically and the marginalized distribution of (v, β, θ) is:

π(v, β, θ) = c V (κ0)c V (σ)Kc V ( m(β) ) 1 D Y

d=1 Dir(θd|α)v MF(vd| v(β, θd), κ), (9)

where m(β) κ0m + σ PK k=1 βk. To sample from π(β|v) using our samplers, we only need to know a stochastic estimate of the gradient of potential energy βU(β|v) β log π(β|v), which can be estimated by adopting the technique used in [11]: β log π(β|v) = 1 π(β|v) β

Z π(β, θ|v)dθ = Z π(β, θ|v)

π(β|v) βπ(β, θ|v)

π(β, θ|v) dθ = Eπ(θ|β,v) [ β log π(β, θ|v)] ,

where β log π(β, θ|v) = β log π(v, β, θ) is known, and the expectation can be estimated by averaging over a set of samples {θ(n)}N n=1 from π(θ|v, β): βU(β|v) 1

N PN n=1 β log π(v, β, θ(n)). To draw {θ(n)}N n=1, noting the simplex constraint and that the target distribution π(θ|v, β) is known up to a constant multiplier, we use GMC to do the task.

To scale up, we use a subset {d(s)}S s=1 of indices of randomly chosen items from the whole data set to get a stochastic estimate for each β log π(v, β, θ(n)). The ﬁnal stochastic gradient is:

β U(β|v) β log c V ( m(β) ) κ D

s=1 v d(s) v(β, θ(n) d(s)). (10)

The inference algorithm for SAM by SGGMC/g SGNHT is summarized in Alg. 3 in Appendix E.

5 Experiments

We present empirical results on both synthetic and real datasets to prove the accuracy and efﬁciency of our methods. All target densities are expressed in the embedded space w.r.t the Hausdorff measure so we omit the subscript H . Synthetic experiments are only for SGGMC since the advantage to use thermostats has been shown by [10] and the effectiveness of g SGNHT is presented on real datasets. Detailed settings of the experiments are provided in Appendix F.

5.1 Toy Experiment

We ﬁrst present the utility and check the correctness of SGGMC by a greenhouse experiment with known stochastic gradient noise. Consider sampling from a circle (S1) for easy visualization. We set the target distribution such that the potential energy is U(x) = log exp{5µ 1 x} + 2 exp{5µ 2 x} , where x, µ1, µ2 S1 and µ1 = µ2 = π

3 (angle from +x direction). The stochastic gradient is

0.6 0.4 0.2 0 0.2 0.4 0.6 0

true GMC SGGMC

(a) π(v1|D)

0.6 0.4 0.2 0 0.2 0.4 0.6 0

true GMC SGGMC

(b) π(v2|D)

0.4 0.2 0 0.2 0.4

(c) π(v1, v2|D) Figure 4: (a-b): True and empirical densities for π(v1|D) and π(v2|D). (c) True (left) and empirical by SGGMC (right) densities for π(v1, v2|D).

produced by corrupting with N(0, 1000I), whose variance is used as V (x) in Eqn. (8) for sampling. Fig. 3(a) shows 100 samples from SGGMC and empirical distribution of 10,000 samples in the embedded space R2. True and empirical distributions are compared in Fig. 3(b) in angle space (local coordinate space). We see no obvious corruption of the result when using stochastic gradient.

empirical distribution samples from SGGMC

(a) samples by SGGMC in the embedded space

4 3 2 1 0 1 2 3 4 0

true GMC SGGMC

(b) distribution comparison in angle space Figure 3: Toy experiment results: (a) samples and empirical distribution of SGGMC; (b) comparison of true and empirical distributions.

It should be stressed that although it is possible to apply scalable methods like SGRLD in spherical coordinate systems (almost global ones), it is too troublesome to work out the form of e.g. Riemann metric tensor, and special treatments like reﬂection at boundaries have to be considered. Numerical instability at boundaries also tends to appear. All these will get even worse in higher dimensions. Our methods work in embedded spaces, so all these issues are bypassed and can be elegantly extended to high dimensions.

5.2 Synthetic Experiment

We then test SGGMC on a simple Bayesian posterior estimation task. We adopt a model with similar structure as the one used in [29]. Consider a mixture model of two v MFs on S1 with equal weights:

π(v1)=v MF(v1|e1,κ1), π(v2)=v MF(v2|e1,κ2), π(xi|v1,v2) v MF(xi|v1,κx) + v MF(xi|µ,κx),

where e1 = (1, 0) and µ (v1 +v2)/ v1 +v2 . The task is to infer the posterior π(v1, v2|D), where D = {xi}D=100 i=1 is our synthetic data that is generated from the likelihood with v1 = π

8 and κ1 = κ2 = κx = 20 by GMC. SGGMC uses empirical Fisher information in the way of [2] for V (x) in Eqn. (8), and uses 10 for batch size. Fig. 4(a-b) show the true and empirical marginal posteriors of v1 and v2, and Fig. 4(c) presents empirical joint posterior by samples from SGGMC and its true density. We see that samples from SGGMC exhibit no observable corruption when a mini-batch is used, and fully explore the two modes and the strong correlation of v1 and v2. 4

5.3 Spherical Admixture Models Setups For baselines, we compare with the mean-ﬁeld variational inference (VI) by [24] and its stochastic version (Sto VI) based on [15], as well as GMC methods. It is problematic for GMC to directly sample from the target distribution π(β|v) since the potential energy is hard to estimate, which is required for Metropolis-Hastings (MH) test in GMC. An approximate Monte Carlo estimation is provided in Appendix B and the corresponding method for SAM is GMC-appr MH. An alternative is GMC-b Gibbs, which adopts blockwise Gibbs sampling to alternately sample from π(β|θ, v) and π(θ|β, v) (both known up to a constant multiplier) using GMC.

We evaluate the methods by log-perplexity the average of negative log-likelihood on a held-out test set Dtest. Variational methods produce a single point estimate ˆβ and the log-perplexity is log-perp = 1 |Dtest| P

d Dtest log π(vd|ˆβ). Sampling methods draw a set of samples {β(m)}M m=1 and log-perp = 1 |Dtest| P

d Dtest log( 1

M PM m=1 π(vd|β(m))). In both cases the intractable π(vd|β) needs to be estimated. By noting that π(vd|β) = R π(vd, θd|β)dθd = Eπ(θd|β)[π(vd|β, θd)], we

4Appendix D provides a rationale on the shape of the joint posterior.

estimate it by averaging π(vd|β, θ(n) d ) (exactly known from the generating process) over samples {θ(n) d }N n=1 drawn from π(θd|β) = π(θd) = Dir(α), the prior of θd. The log-perplexity is not comparable among different models so we exclude LDA from our baseline.

We show the performance of all methods on a small and a large dataset. Hyper-parameters of SAM are ﬁxed while training and set the same for all methods. V (x) in Eqn. (8) is taken zero for SGGMC/g SGNHT. All sampling methods are implemented 5 in C++ and fairly parallelized by Open MP. VI/Sto VI are run in MATLAB codes by [24] and we only use their ﬁnal scores for comparison. Appendix F gives further implementation details, including techniques to avoid overﬂow.

10 2 10 3 10 4 10 5 1000

wall time in seconds (log scale)

log perplexity

VI Sto VI GMC appr MH GMC b Gibbs SGGMC batch SGGMC full g SGNHT batch g SGNHT full

(a) 20News-different

10 3 10 4 10 5 2000

wall time in seconds (log scale)

log perplexity

VI Sto VI GMC appr MH GMC b Gibbs SGGMC batch SGGMC full g SGNHT batch g SGNHT full

(b) 150K Wikipedia

Figure 5: Evolution of log-perplexity along wall time of all methods on (a) 20News-different dataset and (b) 150K Wikipedia subset.

On the small dataset The small dataset is the 20News-different dataset used by [24], which consists of 3 categories from 20Newsgroups dataset. It is small (1,666 training and 1,107 test documents) so we have the chance to see the eventual results of all methods. We use 20 topics and 50 as the batch size.

Fig. 5(a) shows the performance of all methods. We can see that our SGGMC and g SGNHT perform better than others. VI converges swiftly but cannot go any lower due to the intrinsic gap between the mean-ﬁeld variational distribution and the true posterior. Sto VI converges slower than VI in this small scale case, and exhibits the same limit. All sampling methods eventually go below variational methods, and ours go the lowest. g SGNHT shows its beneﬁt to outperform SGGMC under the same setting. For our methods, an appropriately smaller batch size achieves a better result due to the speed-up by subsampling. Note that even the full-batch SGGMC and g SGNHT outperform GMC variants. This may be due to the randomness in the dynamics helps jumping out of one local mode to another for a better exploration.

On the large dataset For the large dataset, we use a subset of the Wikipedia dataset with 150K training and 1K test documents, to challenge the scalability of all the methods. We use 50 topics and 100 as the batch size. Fig. 5(b) shows the outcome. We see that the gap between our methods and other baselines gets larger, indicating our scalability. Bounded curves of VI/Sto VI, the advantage of using thermostats and subsampling speed-up appear again. Our full-batch versions are still better than GMC variants. GMC-appr MH and GMC-b Gibbs scale badly; they converge slowly in this case.

6 Conclusions and Discussions

We propose SGGMC and g SGNHT, SG-MCMCs for scalable sampling from manifolds with known geodesic ﬂow. They are saliently efﬁcient on their applications. Novel dynamics are constructed and 2nd-order geodesic integrators are developed. We apply the methods to SAM topic model for more accurate and scalable inference. Synthetic experiments verify the validity and experiments for SAM on real-world data shows an obvious advantage in accuracy over variational inference methods and in scalability over other applicable sampling methods. There remains possible broad applications of our methods, including models involving v MF (e.g. mixture of v MF [4, 14, 28], DP mixture of v MF [12, 3, 27]), constraint distributions [17] (e.g. truncated Gaussian), and distributions on Stiefel manifold (e.g. Bayesian matrix completion [25]), where the ability of scale-up will be appealing.

Acknowledgments The work was supported by the National Basic Research Program (973 Program) of China (No. 2013CB329403), National NSF of China Projects (Nos. 61620106010, 61322308, 61332007), the Youth Top-notch Talent Support Program, and Tsinghua Initiative Scientiﬁc Research Program (No. 20141080934).

5All the codes and data can be found at http://ml.cs.tsinghua.edu.cn/~changliu/sggmcmc-sam/.

[1] Ralph Abraham, Jerrold E Marsden, and Jerrold E Marsden. Foundations of mechanics. Benjamin/Cummings Publishing Company Reading, Massachusetts, 1978.

[2] Sungjin Ahn, Anoop Korattikara, and Max Welling. Bayesian posterior sampling via stochastic gradient ﬁsher scoring. ar Xiv preprint ar Xiv:1206.6380, 2012.

[3] Nguyen Kim Anh, Nguyen The Tam, and Ngo Van Linh. Document clustering using dirichlet process mixture model of von mises-ﬁsher distributions. In The 4th International Symposium on Information and Communication Technology, So ICT 2013, page 131 138, 2013.

[4] Arindam Banerjee, Inderjit S Dhillon, Joydeep Ghosh, and Suvrit Sra. Clustering on the unit hypersphere using von mises-ﬁsher distributions. Journal of Machine Learning Research, 6:1345 1382, 2005.

[5] David M. Blei, Andrew Y. Ng, and Michael I. Jordan. Latent dirichlet allocation. The Journal of Machine Learning Research, 3:993 1022, 2003.

[6] Marcus A. Brubaker, Mathieu Salzmann, and Raquel Urtasun. A family of mcmc methods on implicitly deﬁned manifolds. In Proceedings of the 15th International Conference on Artiﬁcial Intelligence and Statistics (AISTATS), pages 161 172, 2012.

[7] Simon Byrne and Mark Girolami. Geodesic monte carlo on embedded manifolds. Scandinavian Journal of Statistics, 40(4):825 845, 2013.

[8] Changyou Chen, Nan Ding, and Lawrence Carin. On the convergence of stochastic gradient mcmc algorithms with high-order integrators. In Advances in Neural Information Processing Systems, pages 2269 2277, 2015.

[9] Tianqi Chen, Emily Fox, and Carlos Guestrin. Stochastic gradient hamiltonian monte carlo. In Proceedings of the 31st International Conference on Machine Learning (ICML-14), pages 1683 1691, 2014.

[10] Nan Ding, Youhan Fang, Ryan Babbush, Changyou Chen, Robert D. Skeel, and Hartmut Neven. Bayesian sampling using stochastic gradient thermostats. In Advances in Neural Information Processing Systems, pages 3203 3211, 2014.

[11] Chao Du, Jun Zhu, and Bo Zhang. Learning deep generative models with doubly stochastic mcmc. ar Xiv preprint ar Xiv:1506.04557, 2015.

[12] Kaushik Ghosh, Rao Jammalamadaka, and Ram C. Tiwari. Semiparametric bayesian techniques for problems in circular data. Journal of Applied Statistics, 30(2):145 161, 2003.

[13] Mark Girolami and Ben Calderhead. Riemann manifold langevin and hamiltonian monte carlo methods. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 73(2):123 214, 2011.

[14] Siddharth Gopal and Yiming Yang. Von mises-ﬁsher clustering models. In Proceedings of the 31st International Conference on Machine Learning (ICML-14), 2014.

[15] Matthew D. Hoffman, David M. Blei, Chong Wang, and John Paisley. Stochastic variational inference. The Journal of Machine Learning Research, 14(1):1303 1347, 2013.

[16] I. M. James. The topology of Stiefel manifolds, volume 24. Cambridge University Press, 1976.

[17] Shiwei Lan, Bo Zhou, and Babak Shahbaba. Spherical hamiltonian monte carlo for constrained target distributions. In Proceedings of the 31st International Conference on Machine Learning (ICML-14), pages 629 637, 2014.

[18] Chunyuan Li, Changyou Chen, Kai Fan, and Lawrence Carin. High-order stochastic gradient thermostats for bayesian learning of deep models. ar Xiv preprint ar Xiv:1512.07662, 2015.

[19] Yi-An Ma, Tianqi Chen, and Emily Fox. A complete recipe for stochastic gradient mcmc. In Advances in Neural Information Processing Systems, pages 2899 2907, 2015.

[20] Kanti V. Mardia and Peter E. Jupp. Distributions on spheres. Directional Statistics, pages 159 192, 2000.

[21] John Nash. The imbedding problem for riemannian manifolds. Annals of Mathematics, pages 20 63, 1956.

[22] Radford M. Neal. Mcmc using hamiltonian dynamics. Handbook of Markov Chain Monte Carlo, 2, 2011.

[23] Sam Patterson and Yee Whye Teh. Stochastic gradient riemannian langevin dynamics on the probability simplex. In Advances in Neural Information Processing Systems, pages 3102 3110, 2013.

[24] Joseph Reisinger, Austin Waters, Bryan Silverthorn, and Raymond J. Mooney. Spherical topic models. In Proceedings of the 27th International Conference on Machine Learning (ICML-10), pages 903 910, 2010.

[25] Yang Song and Jun Zhu. Bayesian matrix completion via adaptive relaxed spectral regularization. In The 30th AAAI Conference on Artiﬁcial Intelligence (AAAI-16), 2016.

[26] Eduard L. Stiefel. Richtungsfelder und fernparallelismus in n-dimensionalen mannigfaltigkeiten. Commentarii Mathematici Helvetici, 8(1):305 353, 1935.

[27] Julian Straub, Jason Chang, Oren Freifeld, and John W. Fisher III. A dirichlet process mixture model for spherical data. In Proceedings of the 18th International Conference on Artiﬁcial Intelligence and Statistics (AISTATS), pages 930 938, 2015.

[28] Jalil Taghia, Zhanyu Ma, and Arne Leijon. Bayesian estimation of the von mises-ﬁsher mixture model with variational inference. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(9):1701 1715, 2014.

[29] Max Welling and Yee Whye Teh. Bayesian learning via stochastic gradient langevin dynamics. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), pages 681 688, 2011.