# geometric_convergence_of_elliptical_slice_sampling__72d60346.pdf Geometric Convergence of Elliptical Slice Sampling Viacheslav Natarovskii 1 Daniel Rudolf 1 Bj orn Sprungk 2 For Bayesian learning, given likelihood function and Gaussian prior, the elliptical slice sampler, introduced by Murray, Adams and Mac Kay 2010, provides a tool for the construction of a Markov chain for approximate sampling of the underlying posterior distribution. Besides of its wide applicability and simplicity its main feature is that no tuning is required. Under weak regularity assumptions on the posterior density we show that the corresponding Markov chain is geometrically ergodic and therefore yield qualitative convergence guarantees. We illustrate our result for Gaussian posteriors as they appear in Gaussian process regression, as well as in a setting of a multi-modal distribution. Remarkably, our numerical experiments indicate a dimension-independent performance of elliptical slice sampling even in situations where our ergodicity result does not apply. 1. Introduction Probabilistic modeling provides a versatile tool in the analysis of data and allows for statistical inference. In particular, in Bayesian approaches one is able to quantify model and prediction uncertainty by extracting knowledge from the posterior distribution through sampling. The generation of exact samples w.r.t. the posterior distribution is usually quite difficult, since it is in most scenarios only known up to a normalizing constant. Let ϱ: Rd (0, ) be determined by a likelihood function given some data (which we omit in the following for simplicity) as mapping from the parameter space into the non-negative reals and let µ0 = N(0, C) be a Gaussian prior distribution on Rd with non-degenerate covariance matrix C, such that the posterior distribution µ 1Institute for Mathematical Stochastics, Georg-August Universit at G ottingen, G ottingen, Germany 2Faculty of Mathematics and Computer Science, Technische Universit at Bergakademie Freiberg, Germany. Correspondence to: Viacheslav Natarovskii , Daniel Rudolf , Bj orn Sprungk . Proceedings of the 38 th International Conference on Machine Learning, PMLR 139, 2021. Copyright 2021 by the author(s). on Rd takes the form µ(dx) = ϱ(x) Z µ0(dx), Z := Z Rd ϱ(x) µ0(dx). (1) For convenience, we abbreviate the former relation between the measures µ(dx) and ϱ(x)µ0(dx) as µ(dx) ϱ(x)µ0(dx). A standard approach for generating approximate samples w.r.t. µ is given by Markov chain Monte Carlo. The idea is to construct a Markov chain, which has µ as its stationary and limit distribution1. For this purpose in machine learning (and computational statistics in general) Metropolis Hastings algorithms and slice sampling algorithms (which include Gibbs sampling) are classical tools, see, e.g., (Neal, 1993; Andrieu et al., 2003; Neal, 2003). Murray, Adams and Mac Kay in (Murray et al., 2010) introduced the elliptical slice sampler. On the one hand it is based on a Metropolis-Hastings method suggested by Neal (Neal, 1999) (nowadays also known as preconditioned Crank-Nicolson Metropolis (Cotter et al., 2013; Rudolf & Sprungk, 2018)) and on the other hand it is a modification of slice sampling with stepping-out and shrinkage (Neal, 2003). Elliptical slice sampling is illustrated in (Murray et al., 2010) on a number of applications, such as Gaussian regression, Gaussian process classification and a Log Gaussian Cox process. Apart from its simplicity and wide applicability the main advantage of the suggested algorithm is that it performs well in practice and no tuning is necessary. In addition to that in many scenarios it appears as a building block and/or influenced methodological development of sampling approaches (Fagan et al., 2016; Hahn et al., 2019; Bierkens et al., 2020; Murray & Graham, 2016; Nishihara et al., 2014). However, despite the arguments for being reversible w.r.t. the desired posterior in (Murray et al., 2010) there is, to our knowledge, no theory guaranteeing indeed convergence of the corresponding Markov chain. Under a tail and a weak boundedness assumption on ϱ we derive a small set and Lyapunov function which imply geometric ergodicity by standard theorems for Markov chains on general state spaces, see e.g. chapter 15 in (Meyn & Tweedie, 2009) 1Limit distribution in the sense that for n the distribution of the nth random variable of the Markov chain converges to µ. Geometric Convergence of Elliptical Slice Sampling and/or (Hairer & Mattingly, 2011). Before we state our ergodicity result in Section 2 we provide the algorithm and introduce notation as well as basic facts. Afterwards we state the detailed analysis, in particular, the strategy of proof as well as verify the two crucial conditions of having a Lyapunov function and a sufficiently large small set. In Section 4 we illustrate the applicability of our theoretical result in a fully Gaussian and multi-modal scenario. Additionally, we compare elliptical with simple slice sampling and different Metropolis-Hastings algorithms numerically. The experiments indicate dimension-independent statistical efficiency of elliptical slice sampling which will be the content of future research. 2. Convergence of Elliptical Slice Sampling We start with stating the transition mechanism/kernel of elliptical slice sampling in algorithmic form and provide our notation. Let (Ω, F, P) be the underlying probability space of all subsequently used random variables. For a, b R with a < b let U[a, b] be the uniform distribution on [a, b] and let B(Rd) be the Borel σ-algebra of Rd. Furthermore, the Euclidean ball with radius R > 0 around x Rd is denoted by BR(x) and the Euclidean norm is given by . 2.1. Transition Mechanism We use the function p : Rd Rd [0, 2π] Rd defined as p(x, w, θ) := cos(θ) x + sin(θ) w, (2) where, for fixed x, w Rd, the map θ 7 p(x, w, θ) describes an ellipse in Rd with conjugate diameters x, w. Furthermore, for t 0 let Gt := {x Rd : ϱ(x) t}, be the (super-)level set of ϱ w.r.t. t. Using this notation a single transition of elliptical slice sampling from x Rd to y is presented in Algorithm 1. Here y Rd is considered as a realization of a random variable Yx. Let us denote the transition kernel which corresponds to elliptical slice sampling by E : Rd B(Rd) [0, 1] and for A B(Rd) observe that E(x, A) = 1 ϱ(x) Rd Ex,w,t(A) µ0(dw)dt, Ex,w,t(A) := P(Yx A | W = w, Tx = t) (3) is determined by steps 3-14 of Algorithm 1. These steps of the algorithm determine the sampling mechanism on Gt intersected with the ellipse by using a suitable adaptation of the shrinkage procedure, see (Neal, 2003; Murray et al., Algorithm 1 Elliptical Slice Sampler input current state x Rd output next state y as realization of a random variable Yx 1: draw W µ0, call the result w; 2: draw Tx U[0, ϱ(x)], call the result t; 3: draw Θ U[0, 2π], call the result θ; 4: θmin θ 2π 5: θmax θ 6: while p(x, w, θ) Gt do 7: if θ < 0 then 8: θmin θ 9: else 10: θmax θ 11: end if 12: draw Θ U[θmin, θmax], set the result to θ; 13: end while 14: y p(x, w, θ) 2010). Let (Xn)n N be a Markov chain generated by Algorithm 1, that is, a Markov chain on Rd with transition kernel E. Then, for any n N, A B(Rd) and x Rd we have P(Xn+1 A | X1 = x) = En(x, A), (4) where En is iteratively defined as En+1(x, A) = Z Rd En(z, A)E(x, dz), (5) with E0(x, A) = 1A(x) denoting the indicator function of the set A. 2.2. Main Result Before we formulate the theorem, we state the assumptions which eventually imply the convergence result. Assumption 2.1. The function ϱ: Rd (0, ) satisfies the following properties: 1. It is bounded away from 0 and on any compact set. 2. There exists an α > 0 and R > 0, such that Bα x (0) Gϱ(x) for x > R. The boundedness condition from below and above of ϱ on compact sets is relatively weak and appears frequently in qualitative proofs for geometric ergodicity of Markov chain algorithms, see e.g. (Roberts & Tweedie, 1996). The second condition tells us that ϱ has a sufficiently nice tail behavior. It is satisfied if the tails are rotational invariant and monotone decreasing, e.g., like exp( κ x ) for arbitrary κ > 0. For examples of ϱ which satisfy Assumption 2.1 we refer to Section 4. Geometric Convergence of Elliptical Slice Sampling For stating the geometric ergodicity of elliptical slice sampling we introduce the total variation distance of two probability measures π, ν on Rd as π ν tv := sup f 1 Rd f(x)(π(dx) ν(dx)) , where f := supx Rd |f(x)| for f : Rd R. Theorem 2.2. For elliptical slice sampling under Assumption 2.1 there exist constants C > 0 and γ (0, 1), such that En(x, ) µ tv C(1 + x )γn, n N, x Rd. (6) Remark 2.3. A transition kernel which satisfies an inequality as in (6) is called geometrically ergodic, since the distribution of Xn+1, given that the initial state X1 = x, converges exponentially/geometrically fast to µ. Here, the righthand side depends on x only via the term 1 + x . We view this result as a qualitative statement telling us about exponential convergence of the Markov chain whereas we do not care too much about the constants C > 0 and γ (0, 1). The main reason behind this is, that the employed technique of proof does usually not provide sharp bounds on γ and C, particularly regarding their dependence on the dimension d. 3. Detailed Analysis For proving geometric ergodicity for Markov chains on general state spaces we employ a standard strategy, which consists of the verification of a suitable small set as well as a drift or Lyapunov condition, see e.g. chapter 15 in (Meyn & Tweedie, 2009) or (Hairer & Mattingly, 2011). More precisely we use a consequence of the Harris ergodic theorem as formulated in (Hairer & Mattingly, 2011), which provides a relatively concise introduction and proof of a geometric ergodicity result for Markov chains. 3.1. Strategy of Proof To formulate the convergence theorem we need the notion of a Lyapunov function and a small set. For this let P : Rd B(Rd) [0, 1] be a generic transition kernel. We call a function V : Rd [0, ) Lyapunov function of P with δ [0, 1) and L [0, ) if for all x Rd holds PV (x) := Z Rd V (y) P(x, dy) δV (x) + L. (7) Furthermore, a set S B(Rd) is a small set w.r.t. P and a non-zero finite measure ν on Rd, if P(x, A) ν(A), x S, A B(Rd). With this terminology we can state a consequence of Theorem 1.2 in (Hairer & Mattingly, 2011), which we justify for the convenience of the reader in the supplementary material. Proposition 3.1. Suppose that for a transition kernel P there is a Lyapunov function V : Rd [0, ) with δ [0, 1) and L [0, ) ((7) is satisfied). Additionally, for some constant R > 2L/(1 δ) let SR := {x Rd : V (x) R} (8) be a small set w.r.t. P and a non-zero measure ν on Rd. Then, there is a unique stationary distribution µ on Rd, that is, for all A B(Rd) µ P(A) := Z Rd P(x, A)µ (dx) = µ (A), and there exist constants γ (0, 1) as well as C < such that P n(x, ) µ tv C(1+V (x))γn, n N, x Rd. (Here P n is the n-step transition kernel defined as in (5).) From the arguments of reversibility of elliptical slice sampling w.r.t. µ derived in (Murray et al., 2010) we know already that µ is a stationary distribution w.r.t. the transition kernel E. The idea is now to first detect a suitable Lyapunov function V of E satisfying (7) for P = E and a δ [0, 1) and L [0, ) and, having this, proving that the corresponding set SR from (8) is a small set w.r.t. E and a suitable measure ν. 3.2. Lyapunov Function Besides the usefulness of a Lyapunov function in the context of geometric convergence of Markov chains as in Proposition 3.1 it arises to derive certain stability properties, e.g., it crucially appears in the perturbation theory of Markov chains in measuring the difference of transition kernels (Rudolf & Schweizer, 2018; Medina-Aguayo et al., 2020). We start with the following abstract proposition inspired by Lemma 3.2 in (Hairer et al., 2014), see also Proposition 3 in (Hosseini & Johndrow, 2018). Proposition 3.2. Let P be a transition kernel on Rd such that for Yx P(x, ), x Rd, there exists a random variable Wx with E Wx K for a constant K < independent of x and Yx x + Wx almost surely. (9) Additionally, assume that there exists a radius R > 0, constants ℓ (0, 1] and eℓ [0, 1) such that for all x BR(0)c := {x Rd : x > R} there is a set Dx B(Rd) satisfying Geometric Convergence of Elliptical Slice Sampling (a) P(x, Dx) ℓ, (b) supy Dx y eℓ x . Then V (x) := x is a Lyapunov function for P with L := R + K < and δ := 1 (1 eℓ)ℓ< 1. Proof. We distinguish whether x BR(0) or x BR(0)c. Consider the case x BR(0): By assumption we have a.s. V (Yx) V (x) + V (Wx), such that PV (x) = EV (Yx) V (x) + E Wx R + K δV (x) + R + K. Consider the case x BR(0)c: We have Dx V (y) P(x, dy) + Z Dcx V (y) P(x, dy). For the first term we obtain Z Dx V (y) P(x, dy) (b) eℓV (x) P(x, Dx). To bound the second term observe that Z Dcx V (y) P(x, dy) = E(1Dcx(Yx) V (Yx)) (9) V (x)P(Yx Dc x) + E Wx . P(Yx Dc x) = 1 P(Yx Dx) = 1 P(x, Dx) and combining both estimates above yields PV (x) eℓV (x) P(x, Dx) + (1 P(x, Dx)) V (x) + L = [1 P(x, Dx) + eℓP(x, Dx)] V (x) + L. By the fact that 1 (1 eℓ)P(x, Dx) δ the assertion is proven. We apply this proposition in the context of elliptical slice sampling and obtain the following result. Lemma 3.3. Assume that there exists an α (0, 1/ 2] and R > 0, such that Bα x (0) Gϱ(x) for all x BR(0)c. Then, the function V (x) := x is a Lyapunov function for E with some δ [0, 1) and L [0, ). Proof. From (2) we have for all x, w Rd and any θ [0, 2π] that p(x, w, θ) x + w . (10) Thus, condition (9) is satisfied for the transition kernel E with Wx µ0 being the random variable W in line 1 of Algorithm 1. Next, we show that for any x BR(0)c and Dx := Bα x (0) the assumptions (a) and (b) of Proposition 3.2 are satisfied for an ℓ (0, 1] and an eℓ [0, 1). Obviously, supy Dx y eℓ x for eℓ= 1 2 < 1 even for all x Rd. Thus, it is sufficient to find a number ℓ (0, 1] such that E(x, Dx) ℓ, x BR(0)c. For this notice that the probability to move to a set A B(Rd) after all trials described in the lines 6 13 of Algorithm 1 is larger than the probability to move to A after exactly one iteration of the loop. Thus, for any x, w Rd, t [0, ϱ(x)] and A B(Rd) we have Ex,w,t(A) 1 0 1A Gt(p(x, w, θ)) dθ, (11) with Ex,w,t as given in (3). Further, notice that for any x BR(0)c and any t [0, ϱ(x)] we have Dx = Bα x (0) Gϱ(x) Gt. Defining eΘ to be a [0, 2π]-uniformly distributed random variable and using (11) we have for any x BR(0)c, w Rd and t [0, ϱ(x)] that Ex,w,t(Dx) 1 0 1Dx(p(x, w, θ))dθ = P p(x, w, eΘ) Dx . Additionally, let W µ0 be independent of eΘ. Then we have for all x BR(0)c E(x, Dx) P p(x, W, eΘ) Dx . (12) Hence, we need to study the event p(x, W, eΘ) Dx in more detail. We have p(x, W, eΘ) Dx p(x, W, eΘ) α x , which is equivalent to p(x, W, eΘ) 2 = x 2 cos2(eΘ) + W 2 sin2(eΘ) + 2 x, W sin(eΘ) cos(eΘ) where , denotes the standard inner product on Rd. Defining AW := x 2 W 2, BW := 2 x, W , CW := (2α2 1) x 2 W 2, Geometric Convergence of Elliptical Slice Sampling and using the trigonometric identities cos(2θ) = 2 cos2(θ) 1 = 1 2 sin2(θ), sin(2θ) = 2 cos(θ) sin(θ), we have that p(x, W, eΘ) Dx is equivalent to AW cos(2eΘ) + BW sin(2eΘ) CW . Letting ϕW [0, 2π) be an angle satisfying cos(ϕW ) = AW p A2 W + B2 W , sin(ϕW ) = BW p A2 W + B2 W , and using the cosine of sum identity we get cos(2eΘ ϕW ) CW p A2 W + B2 W . At this point we have n p(x, W, eΘ) Dx o = ( cos(2eΘ ϕW ) CW p A2 W + B2 W Note that AW , BW , CW , ϕW are all random variables which depend on W, but are independent of eΘ. We aim to condition on the event W 2 α2R2 2 α2 . In this case CW < 0 and AW > 0, such that A2 W + B2 W CW AW = (2α2 1) x 2 W 2 The last fraction can be rewritten as (α2 1)( x 2 W 2) + α2 x 2 + (α2 2) W 2 or equivalently as (α2 1) + α2 The second term is non-negative, therefore, we have A2 W + B2 W α2 1 > 1. (14) With ℓR := P W 2 α2R2 2 α2 > 0 we have P p(x, W, eΘ) Dx P p(x, W, eΘ) Dx Now using (13) and (14) we have that P p(x, W, eΘ) Dx P cos(2eΘ ϕW ) α2 1 W 2 α2R2 For any random variable ξ independent of eΘ we have that the distribution of cos(2eΘ ξ) coincides with the distribution of cos(2eΘ), since eΘ is uniformly distributed on [0, 2π]. Recall that ϕW is independent of eΘ. Therefore, with εα := P cos(2eΘ) α2 1 > 0 P cos(2eΘ ϕW ) α2 1 W 2 α2R2 Putting everything together, we conclude that E(x, Dx) (12) P p(x, W, eΘ) Dx εαℓR > 0 and all assumptions of Proposition 3.2 are then satisfied with ℓ:= εαℓR. 3.3. Small Set In this section we show that under suitable assumptions any compact set is small w.r.t. the transition kernel E of elliptical slice sampling. Lemma 3.4. Assume that ϱ is bounded away from 0 and on any compact set. Then any compact set G Rd is small w.r.t. E and the measure ε λG, where ε > 0 is some constant and λG denotes the d-dimensional Lebesgue measure restricted to G. Proof. Let x G be arbitrary and recall that for any A B(Rd) we have E(x, A) = 1 ϱ(x) Rd Ex,w,t(A)µ0(dw)dt, where we argued in (11) that Ex,w,t(A) 1 0 1A Gt(p(x, w, θ)) dθ, for any w Rd and t [0, ϱ(x)]. Therefore, we obtain 0 1A Gt(p(x, w, θ))dθµ0(dw)dt. Geometric Convergence of Elliptical Slice Sampling Changing the order of integration yields 0 E 1A Gt(p(x, W, θ)) dθdt for some random vector W N(0, C). Define the auxiliary random vector Zx,θ := p(x, W, θ) with corresponding distribution νx,θ := N(x cos(θ), sin2(θ)C). Then E(x, A) 1 2πϱ(x) 0 E 1A Gt(Zx,θ) dθdt A 1Gt(z)νx,θ(dz)dθdt. Using the fact that 1Gt(z) = 1[0,ϱ(z)](t) we have E(x, A) Z 2π 0 1[0,ϱ(z)](t)dt νx,θ(dz)dθ. Notice that 0 1[0,ϱ(z)](t)dt = 1 ϱ(x) min{ϱ(x), ϱ(z)} = min 1, ϱ(z) Moreover, for all x, z G by the boundedness assumption on ϱ we have min 1, ϱ(z) min 1, infa G ϱ(a) supa G ϱ(a) 0 νx,θ(A G)dθ π 4 νx,θ(A G)dθ. Since G is a compact set, there exists a finite constant κ > 0, such that (z x cos θ)T C 1(z x cos θ) κ, x, z G. Moreover, for all θ π 2 we have that 1 2 sin2(θ) 1. Therefore, the factors of the density of the Gaussian distribution νx,θ satisfy exp (z x cos θ)T C 1(z x cos θ) and 2π sin2(θ) d νx,θ(A G) exp( κ) (2π) d 2 det(C) 1 2 λG(A), such that finally with ε := β (2π) d 2 det(C) 1 2 we have E(x, A) ε λG(A), which finishes the proof. Remark 3.5. For a compact set G Rd suppose that ϱ: G (0, ) with 0 < infx G ϱ(x) and supx G ϱ(x) < . In this setting the same arguments as in the proof of Lemma 3.4 can be used to verify that the whole state space G is small w.r.t. elliptical slice sampling. This leads to the fact that elliptical slice sampling is uniformly ergodic in this scenario, see for example Theorem 15.3.1 in (Douc et al., 2018). For a summary of different ergodicity properties and their relations to each other we refer to Section 3.1 in (Rudolf, 2012). 3.4. Proof of Theorem 2.2 We apply Proposition 3.1. First, recall that in (Murray et al., 2010) it is verified that elliptical slice sampling is reversible w.r.t. µ and therefore µ is a stationary distribution of E. Hence, it is sufficient to provide a Lyapunov function and to check the smallness of SR. By Assumption 2.1 part 2. the requirements for Lemma 3.3 are satisfied, such that V (x) := x is a Lyapunov function with δ [0, 1) and L [0, ). By Assumption 2.1 part 1. using Lemma 3.4 we obtain that for any R > 2L/(1 δ) the set SR = BR(0) is compact and therefore small w.r.t. transition kernel E and some non-trivial finite measure. Therefore, all requirements of Proposition 3.1 are satisfied and the statement of Theorem 2.2 follows. 4. Illustrative Examples In this section we verify in toy scenarios as well as more demanding settings the conditions of Assumption 2.1 to illustrate the applicability of our result. In the supplementary we provide a discussion in terms of the exponential family. 4.1. Gaussian Posterior In (Murray et al., 2010) Gaussian regression is considered as test scenario for elliptical slice sampling, since there the posterior distribution is again Gaussian. We see covering that setting as a minimal requirement for our theory: Here, for some x0 Rd we have ϱ(x) = exp 1 2(x x0)T Σ 1(x x0) , x Rd, Geometric Convergence of Elliptical Slice Sampling that is, ϱ is proportional to a Gaussian density with nondegenerate covariance matrix Σ. Thus, the matrix Σ Rd d is symmetric, positive-definite, and we denote its eigenvalues by λ1, . . . , λd. Notice that all eigenvalues are strictly positive and define λmin := mini=1,...,d λi, λmax := maxi=1,...,d λi. The covariance matrix induces a norm Σ 1 on Rd by x 2 Σ 1 = x T Σ 1x. It is well-known that the Euclidean and the Σ 1-norm are equivalent. One has λ 1 max x 2 x 2 Σ 1 λ 1 min x 2, x Rd. (16) Now we are able to formulate and prove the following proposition guaranteeing the applicability of Theorem 2.2. Proposition 4.1. For ϱ defined in (15) Assumption 2.1 is satisfied with R = 4 q λmin x0 and α = 1 λmin λmax . Proof. Observe that ϱ is continuous, bounded by 1 and strictly larger than 0 everywhere, such that part 1. of Assumption 2.1 is true. By exploiting both inequalities in (16) we show part 2. of Assumption 2.1, that is, we verify for all x B4 q λmin x0 (0)c holds B 1 λmin λmax x (0) Gϱ(x). For this fix x B4 q λmin x0 (0)c. Therefore, we have λmin x0 . (17) Now let y B 1 λmin λmax x (0). Therefore, we have λmin λmax x , (18) and one might observe that Gϱ(x) = {y Rd : y x0 Σ 1 x x0 Σ 1}. With this we obtain y x0 Σ 1 y Σ 1 + x0 Σ 1 (16) y λ1/2 min + x0 Σ 1 2λ1/2 max + x0 Σ 1 (16) x Σ 1 = x Σ 1 x0 Σ 1 x Σ 1 2 + 2 x0 Σ 1 (16) x x0 Σ 1 x 2λ1/2 max + 2 x0 Σ 1 (17) x x0 Σ 1 2 x0 λ1/2 min + 2 x0 Σ 1 (16) x x0 Σ 1 which provides the desired result. In Gaussian process regression as well as Bayesian inverse problems with linear forward maps the resulting posterior distribution has again a Gaussian density ϱ with respect to the Gaussian prior µ0. However, in these applications the corresponding covariance matrix of ϱ is typically positive semi-definite, and we have to replace Σ 1 in (15) by its pseudo-inverse Σ . We emphasize that also in this more general situation Assumption 2.1 is satisfied, since ϱ is then simply constant on the null space of Σ and on its orthogonal complement we can apply Proposition 4.1. 4.2. Multi-modality In the previous section we considered the setting of a Gaussian posterior distribution µ. In particular, ϱ had just a single peak. It seems that such a requirement is not necessary to verify the crucial Assumption 2.1. Here we introduce a class of density functions which might behave almost arbitrarily in their center (the central part of the state space) and exhibit a certain tail behavior. For formulating the result, let | | be a norm on Rd which is equivalent to the Euclidean norm , that is, there exist constants c1, c2 (0, ) such that c1 x |x| c2 x , x Rd. (19) Proposition 4.2. For some R > 0 and some x0 Rd let ϱR : BR (x0) (0, ) be continuous and let r: [R , ) (0, ) be decreasing. Furthermore, suppose that inf z BR (x0) ϱR (z) sup t R r(t). (20) Then, the function ( ϱR (x) x BR (x0) r(|x x0|) x BR (x0)c, satisfies Assumption 2.1 with R = max{R , 4 c2 c1 x0 } and α = c1 2c2 . Proof. By the continuity of ϱR and the fact that r is strictly positive as well as decreasing part 1. of Assumption 2.1 is satisfied. For part 2. let x BR(0)c, i.e., x > R and x > 4 c2 c1 x0 . Hence, we have by (20) and the decreasing property of r that Gϱ(x) = BR (x0) {y BR (x0)c : |y x0| |x x0|}. Now let y Bα x (0) and distinguish two cases: 1. For y BR (x0) we immediately have y Gϱ(x), and we are done. Geometric Convergence of Elliptical Slice Sampling Figure 1. Plot of the function x 7 exp( x 1 2 x 2) for d = 2. 2. For y BR (x0)c Bα x (0) we obtain due to y c1 2c2 x that |y x0| |y| + |x0| (19) 1 2|x| + |x0| and, furthermore, by exploiting x > 4 c2 1 2|x| + |x0| = |x| |x0| 1 2|x| + 2|x0| (19) |x x0| 2|x0| + 2|x0| = |x x0|, which leads again to y Gϱ(x). Both cases combined yield the statement. To state an example which satisfies the assumption of Proposition 4.2 we consider the following volcano density . Example 4.3. Set ϱ(x) := exp( x 1 2 x 2). Let | | = , x0 = 0, R = 2, r(t) := exp(t t2/2) and ϱR be the restriction of ϱ to B2(0). It is easily checked that for this choice of parameters all required properties are satisfied. One can argue that the function ϱ is highly multimodal, since its maximum is attained on a d 1-dimensional manifold (a sphere). For illustration, it is plotted in Figure 1. 4.3. Volcano Density and Limitations of the Result In the last section we showed the applicability of Theorem 2.2 for a volcano density . Here we use this density differently. Namely, µ(dx) exp x 1 2 x 2 dx, that is, the Lebesgue density of µ is proportional to the function plotted in Figure 1. Setting µ0 = N(0, I) with identity matrix I, we obtain ϱ(x) = exp( x ), x Rd. (21) Observe that in this setting for any x Rd we have Gϱ(x) = {y Rd : y x } = B x (0)c, such that Gϱ(x) never completely contains a ball around the origin and Assumption 2.1 cannot be satisfied. For this scenario we conduct numerical experiments in various dimensions, namely, d = 10, 30, 100, 300, 1000. Although, our sufficient Assumption 2.1 is not satisfied2, we still observe a good performance of the elliptical slice sampler. In particular, its statistical efficiency in terms of the effective sample size (ESS) seems to be independent of the dimension, see Figure 2. To check whether this dimension-independent behavior is inherently due to the particular setting or not, we also consider other Markov chain based sampling algorithms. For estimating the ESS we use an empirical proxy of the autocorrelation function γf(k) := Corr(f(Xn0), f(Xn0+k)), of the underlying Markov chain (Xk)k N for a chosen quantity of interest f : Rd R where n0 denotes a burn-in parameter. Since the ESS takes the form ESS(n, f, (Xk)k N) = n where n N denotes the chosen sample size, we approximate it by using the empirical proxy of γf(k) and truncating the summation at k = 104. In Figure 2 we display estimates of the ESS for four different Markov chain Monte Carlo algorithms. Namely, the random walk Metropolis algorithm (RWM), the preconditioned Crank-Nicolson Metropolis (p CN), the simple slice sampler and the elliptical one. For each algorithm we set the initial state to be 0 Rd and compute the ESS for f(x) := log(1 + x ), n0 := 105 and n := 106. Both Metropolis algorithms (the RWM and the p CN Metropolis) were tuned to an averaged acceptance probability of approximately 0.25. We clearly see in Figure 2 the dimensiondependence of the ESS for the simple slice sampler3 and the RWM. In contrast to that, the results for the elliptical slice sampler and the p CN Metropolis indicate a dimensionindependent efficiency. Let us remark that elliptical slice sampling does not need to be tuned in comparison to the p CN Metropolis, which performs similarly. However, the price for this is the requirement of evaluating the function ϱ 2In the supplementary material we provide further discussions how Assumption 2.1 can be satisfied in this scenario by taking a modification into account. 3In the light of (Natarovskii et al., 2021) the dimensiondependent behavior for simple slice sampling is not surprising. There, for a certain class of ϱ a spectral gap of size 1/d is proven. Geometric Convergence of Elliptical Slice Sampling 101 102 103 Effective sample size Elliptical Slice Sampler Precond. Crank Nicolson Metropolis Simple Slice Sampler Random Walk Metropolis Figure 2. Proxies for ESS for different MCMC algorithms depending on the dimension of the space. more often within a single transition. Here the function ϱ was evaluated on average 1.5 times in each iteration of the elliptical slice sampler. Intuitively, the example of this section is not covered by our theorem, since the tail behavior of ϱ is bad . Namely, for x we have ϱ(x) . It seems that for convergence only the tail behavior of likelihood times prior considered as Lebesgue density matters. Let us briefly comment on different approaches how to verify the numerically observed dimension-independence. Similarly to the strategy employed in (Hairer et al., 2014; Rudolf & Sprungk, 2018) for the p CN Metropolis one might be able to extend elliptical slice sampling on infinite-dimensional Hilbert spaces. If one proves the existence of an absolute spectral gap of the correspondent transition kernel, then this directly gives bounds of the total variation distance of the nth step distribution to the stationary one. Due to the infinitedimensional setting one might argue that the estimate must be independent of the dimension. Another approach is to prove dimension-free Wasserstein contraction rates, as, for example, has been done in (Eberle, 2016; Eberle et al., 2019; De Bortoli & Durmus, 2019) for diffusion processes. 4.4. Logistic Regression Suppose data (ξi, yi)i=1,...,N with ξi Rd and yi { 1, 1} for i = 1, . . . , N is given. For logistic regression the function ϱ: Rd (0, ) takes the form 1 1 + exp( yix T ξi), x Rd. (22) Moreover, assume we have a Gaussian prior distribution µ0 on Rd with µ0 = N(0, I). Thus, the distribution of interest, i.e., the posterior distribution µ is determined by µ(dx) ϱ(x) µ0(dx). The function ϱ does not satisfy As- sumption 2.1, since it has no vanishing tails. For example for d = N = ξ1 = y1 = 1 we have ϱ(x) = (1 + exp( x)) 1, which is increasing with Gϱ(x) = [x, ) for all x R. Thus, ϱ cannot satisfy Assumption 2.1. In the general setting the phenomena is the same and the arguments are similar. Therefore, our theory for elliptical slice sampling seems not to be applicable. However, with a tail-shift modification we can satisfy Assumption 2.1. The idea is to take a small part of the Gaussian prior and shift it to the likelihood function, such that it gets sufficiently nice tail behavior. For arbitrary ε (0, 1) set eµ0 := N(0, (1 ε) 1I) and eϱ(x) := ϱ(x) exp ε x 2/2 . (23) Observe that eϱ has, in contrast to ϱ, exponential tails. Moreover, note that µ0(dx) exp( ε x 2/2)eµ0(dx) and therefore µ(dx) ϱ(x)µ0(dx) eϱ(x)eµ0(dx). Now considering µ as given through eϱ and eµ0 our main theorem is applicable. In the supplementary material we prove the following result and provide a discussion of the tail-shift modification. Proposition 4.4. For ε (0, 1) the function eϱ given in (23) satisfies Assumption 2.1 for α = ε/2 and R = 4N mini=1,...,N ξi /ε. Finally, note that for having the guarantee of geometric ergodicity of elliptical slice sampling one can choose ε (0, 1) arbitrarily small, whereas for ε = 0 our theory does not apply. 5. Conclusion In this paper we provide a mild sufficient condition for the geometric ergodicity of the elliptical slice sampler in finite dimensions. In particular, it is satisfied if the density of the target measure with respect to a Gaussian measure µ0 is continuous, strictly positive and has a sufficiently nice tail behavior. Besides that our numerical results indicate that (a) our condition is not necessary and (b) the elliptical slice sampler shows a dimension-independent efficiency. Both issues will be addressed in future research. Acknowledgements We thank the anonymous referees for their valuable remarks, in particular, for bringing the tail-shift modification to our attention. VN thanks the DFG Research Training Group 2088 for their support. BS acknowledges support of the DFG within project 389483880. DR gratefully acknowledges support of the DFG within project 432680300 SFB 1456 (subproject B02). Geometric Convergence of Elliptical Slice Sampling Andrieu, C., De Freitas, N., Doucet, A., and Jordan, M. I. An introduction to MCMC for machine learning. Machine learning, 50(1-2):5 43, 2003. Bierkens, J., Grazzi, S., Kamatani, K., and Roberts, G. The boomerang sampler. In International Conference on Machine Learning, pp. 908 918. PMLR, 2020. Cotter, S. L., Roberts, G. O., Stuart, A. M., and White, D. MCMC methods for functions: modifying old algorithms to make them faster. Statistical Science, pp. 424 446, 2013. De Bortoli, V. and Durmus, A. Convergence of diffusions and their discretizations: from continuous to discrete processes and back. ar Xiv preprint ar Xiv:1904.09808, 2019. Douc, R., Moulines, E., Priouret, P., and Soulier, P. Markov chains. Springer, 2018. Eberle, A. Reflection couplings and contraction rates for diffusions. Probability theory and related fields, 166(3): 851 886, 2016. Eberle, A., Majka, M. B., et al. Quantitative contraction rates for Markov chains on general state spaces. Electronic Journal of Probability, 24, 2019. Fagan, F., Bhandari, J., and Cunningham, J. P. Elliptical slice sampling with expectation propagation. In UAI, 2016. Hahn, P. R., He, J., and Lopes, H. F. Efficient sampling for Gaussian linear regression with arbitrary priors. Journal of Computational and Graphical Statistics, 28(1):142 154, 2019. Hairer, M. and Mattingly, J. C. Yet another look at Harris ergodic theorem for Markov chains. In Seminar on Stochastic Analysis, Random Fields and Applications VI, pp. 109 117. Springer, 2011. Hairer, M., Stuart, A. M., and Vollmer, S. J. Spectral gaps for a Metropolis Hastings algorithm in infinite dimensions. The Annals of Applied Probability, 24(6):2455 2490, 2014. Hosseini, B. and Johndrow, J. E. Spectral gaps and error estimates for infinite-dimensional Metropolis-Hastings with non-Gaussian priors. ar Xiv preprint ar Xiv:1810.00297, 2018. Medina-Aguayo, F., Rudolf, D., and Schweizer, N. Perturbation bounds for Monte Carlo within Metropolis via restricted approximations. Stochastic processes and their applications, 130(4):2200 2227, 2020. Meyn, S. and Tweedie, R. Markov Chains and Stochastic Stability 2 edn (Cambridge: Cambridge University)(with a prologue by PW Glynn). 2009. Murray, I. and Graham, M. Pseudo-marginal slice sampling. In Artificial Intelligence and Statistics, pp. 911 919, 2016. Murray, I., Adams, R., and Mac Kay, D. Elliptical slice sampling. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pp. 541 548, 2010. Natarovskii, V., Rudolf, D., and Sprungk, B. Quantitative spectral gap estimate and Wasserstein contraction of simple slice sampling. The Annals of Applied Probability, 31 (2):806 825, 2021. Neal, R. M. Probabilistic inference using Markov chain Monte Carlo methods. Department of Computer Science, University of Toronto Toronto, Ontario, Canada, 1993. Neal, R. M. Regression and classification using Gaussian process priors. J. M. Bernardo et al., editors, Bayesian Statistics, 6:475 501, 1999. Neal, R. M. Slice sampling. Annals of statistics, pp. 705 741, 2003. Nishihara, R., Murray, I., and Adams, R. P. Parallel MCMC with generalized elliptical slice sampling. The Journal of Machine Learning Research, 15(1):2087 2112, 2014. Roberts, G. O. and Tweedie, R. L. Geometric convergence and central limit theorems for multidimensional Hastings and Metropolis algorithms. Biometrika, 83(1):95 110, 1996. Rudolf, D. Explicit error bounds for Markov chain Monte Carlo. Dissertationes Mathematicae, 485:1 93, 2012. Rudolf, D. and Schweizer, N. Perturbation theory for Markov chains via Wasserstein distance. Bernoulli, 24 (4A):2610 2639, 2018. Rudolf, D. and Sprungk, B. On a generalization of the preconditioned Crank Nicolson Metropolis algorithm. Foundations of Computational Mathematics, 18(2):309 343, 2018.