# gibbsian_polar_slice_sampling__4b39fa8e.pdf

Gibbsian Polar Slice Sampling

Philip Sch ar 1 Michael Habeck 1 Daniel Rudolf 2

Polar slice sampling (Roberts & Rosenthal, 2002) is a Markov chain approach for approximate sampling of distributions that is difficult, if not impossible, to implement efficiently, but behaves provably well with respect to the dimension. By updating the directional and radial components of chain iterates separately, we obtain a family of samplers that mimic polar slice sampling, and yet can be implemented efficiently. Numerical experiments in a variety of settings indicate that our proposed algorithm outperforms the two most closely related approaches, elliptical slice sampling (Murray et al., 2010) and hit-and-run uniform slice sampling (Mac Kay, 2003). We prove the well-definedness and convergence of our methods under suitable assumptions on the target distribution.

1. Introduction

Bayesian inference heavily relies on efficient sampling schemes of posterior distributions that are defined on highdimensional spaces with probability density functions that can only be evaluated up to their normalizing constants. We develop a Markov chain method that can be used to approximately sample a large variety of target distributions. For convenience we frame the problem in a black-box setting. We assume that the distribution of interest ν is defined on (Rd, B(Rd)) and given by a density, i.e. a measurable function ϱν : Rd R+ := [0, ), such that

A ϱν(x)dx R

Rd ϱν(x)dx, A B(Rd).

In the following, knowledge of the normalization constant of ϱν, i.e. the denominator of the above ratio, is not required.

1Microscopic Image Analysis Group, Friedrich Schiller University Jena, Jena, Germany 2Faculty of Computer Science and Mathematics, University of Passau, Passau, Germany. Correspondence to: Daniel Rudolf <daniel.rudolf@uni-passau.de>.

Proceedings of the 40 th International Conference on Machine Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright 2023 by the author(s).

Since exact sampling is generally infeasible, we aim to produce approximate samples from ν, i.e. realizations of random variables whose distributions are, in some sense, close to ν.

To pursue this goal, we rely on Markov chain Monte Carlo (MCMC), which implements an irreducible and aperiodic Markov kernel P that leaves ν invariant. For such a kernel, well-established theory shows that the distribution of iterates Xn of a Markov chain (Xn)n N0 with transition kernel P converges to ν as n . An MCMC method generates a realization (xn)1 n N Rd of X1, . . . , XN from (Xn)n N0 and uses some or all of the realized chain iterates xn as approximate samples of ν. Here, we focus on slice sampling MCMC methods, that use auxiliary slice or threshold random variables (Tn)n N0. In general, Tn given Xn 1 follows a uniform distribution over an interval that depends on Xn 1. Specifically, we consider polar slice sampling (PSS), which was proposed by Roberts & Rosenthal (2002). PSS factorizes the target density ϱν into

ϱν(x) = ϱ(0) ν (x)ϱ(1) ν (x),

ϱ(0) ν (x) = x 1 d,

ϱ(1) ν (x) = x d 1ϱν(x)

for x = 0, where is the Euclidean norm on Rd. Given an initial value x0 Rd with x0 = 0 and ϱν(x0) > 0, PSS recursively realizes the n-th chain iterate xn from the (n 1)-th iterate xn 1 as follows: An auxiliary variable tn is chosen as a realization of1

Tn U((0, ϱ(1) ν (xn 1))).

Given tn, the next chain iterate is generated using polar coordinates Xn = RnΘn where the radius Rn is a random variable on R+ and the direction Θn is a random variable on the (d 1)-sphere

Sd 1 := {θ Rd | θ = 1} Rd.

To conform with the general slice sampling principle of Besag & Green (1993), the variables (Rn, Θn) need to be

1With U(G) we denote the uniform distribution on a set G w.r.t. some reference measure that is clear from the context, as it is always either the Lebesgue measure, the surface measure on the (d 1)-sphere or a product measure of the two.

Gibbsian Polar Slice Sampling

sampled from the joint uniform distribution

U({(r, θ) R+ Sd 1 | ϱ(1) ν (rθ) > tn}). (2)

Standard slice sampling theory then guarantees that the resulting transition kernel has invariant distribution ν. For the convenience of the reader, in Appendix A we elaborate on how PSS can be derived from the general slice sampling principle of Besag & Green (1993). In the following, we refer to the process of sampling Xn given Tn as X-update of PSS.

The theoretical analysis of PSS by Roberts & Rosenthal (2002) offered performance guarantees for approximate sampling that are dimension-independent for rotationally invariant, log-concave target densities. Overall, their analysis suggests that PSS works generally robustly in high-dimensional scenarios. Despite this, PSS received little attention in the MCMC literature during the twenty years since its publication. We believe that this lack of engagement is not the result of PSS performing poorly on paper, but rather that of practical challenges in efficiently implementing it. Concurrent work by Rudolf & Sch ar (2023) supports this view. They prove again in the rotationally invariant, log-concave setting dimension-independent spectral gap estimates for PSS, which imply dimension-independence w.r.t. the mean squared error and the asymptotic variance within the central limit theorem of the MCMC average of a summary function.

The practical challenge in implementing PSS is that the polar variables (Rn, Θn) need to be jointly drawn uniformly from a high-dimensional set that often has a complicated structure. Therefore, this step is usually implemented by an acceptance/rejection scheme using uniform samples from a tractable superset. In moderate to high dimensions, for target densities ϱν that are not rotationally invariant, the fraction of directions θ Sd 1 for which the set

{r R+ | ϱ(1) ν (rθ) > t}

is non-empty becomes tiny for most thresholds t occurring as realizations of Tn. This usually leads to an impractically low acceptance rate of the aforementioned acceptance/rejection scheme, such that in expectation an astronomically large number of proposals needs to be drawn during a single transition. In other words, although a valid implementation of PSS is available in principle, the iterations of the sampler are computationally inefficient resulting in exceedingly long simulation runs.

To address this deficiency, we develop an MCMC framework that imitates PSS, but is guaranteed to run in a computationally efficient manner. Imitating here refers to keeping the PSS structure and splitting the difficult joint uniform sampling of radius and direction of (2) into separate steps. Intuition suggests that if the resulting transition mechanism is close to the original version of PSS, then also some of its

desirable convergence properties will be inherited. The eventually proposed MCMC algorithm is essentially tuning-free and explores the state space remarkably quickly. We provide a basic theoretical underpinning of our method, proving that it asymptotically samples from the target distribution ν under mild regularity conditions. Moreover, we illustrate its potential to improve upon related methods through a series of numerical experiments.

The remainder of this paper is structured as follows: In Section 2 we propose our modifications of PSS that we term Gibbsian polar slice sampling. To provide a better understanding, we present different variants that culminate in a Gibbsian polar slice sampler that incorporates a stepping-out and shrinkage procedure. In Section 3 we provide theoretical support for our methods. We discuss possible extensions and alternative mechanisms that can also be used in our framework in Section 4 and comment on related approaches in Section 5. In Section 6 we present a number of numerical experiments comparing our algorithm to the two most closely related ones that are similarly feasible to use in practice. We conclude with a short discussion in Section 7.

2. Gibbsian Polar Slice Sampling

The idea is to decompose the X-update of PSS by replacing the joint sampling of radius rn and direction θn with separate updates of both variables in a Gibbsian fashion.

Our initial modification of PSS is mostly of theoretical interest. First, we only update the directional component of the last chain iterate xn 1, resulting in an intermediate state rn 1θn, where rn 1 = xn 1 and θn is a realization of

Θn U({θ Sd 1 | ϱ(1) ν (rn 1θ) > tn}). (3)

We then update the radial component of the intermediate state, resulting in the new chain iterate xn := rnθn, where rn is a realization of

Rn U({r R+ | ϱ(1) ν (rθn) > tn}). (4)

In contrast to standard Gibbs sampling which cycles over coordinate-wise updates of the current state, we rely on a systematic conditional renewal in terms of the polar transformation components given as radius and direction. Although the modification does not solve the runtime issue, it lays the groundwork for further algorithmic improvements.

At this stage, we would already like to mention Theorem 3.1 which states that ν is the invariant distribution of the transition kernel of the 1st variant of Gibbsian polar slice sampling (GPSS). Therefore, GPSS provides a correct MCMC method for targeting ν.

Gibbsian Polar Slice Sampling

We note that by randomizing the deterministic updating scheme (i.e. rather than always updating the direction first and then the radius, both are sampled in random order), one can obtain the stronger statement that the transition kernel is reversible w.r.t. ν.

The direction update is still a challenging implementation issue, since it requires sampling of a uniform distribution over a (d 1)-dimensional set with d possibly being large. Therefore, to sample the next direction, we suitably use (ideal) spherical slice sampling (Habeck et al., 2023), a recently developed MCMC method for (approximate) sampling from distributions on Sd 1. The radius update is not changed in this variant (since it consists of a 1-dimensional uniform distribution sampling step).

Let Sd 2 θ be the great subsphere w.r.t. θ, i.e. the set

{ϑ Sd 1 | θT ϑ = 0}

of directions in Sd 1 that are orthogonal to θ. In (Habeck et al., 2023) it is shown that one can sample uniformly from Sd 2 θ as follows: Draw V1 Nd(0, Id), set V2 := V1 (θT V1)θ and finally V3 := V2/ V2 , then V3 U(Sd 2 θ ). Furthermore, for any y Sd 2 θ the set

{θ cos(ω) + y sin(ω) | ω [0, 2π)}

is the unique great circle in Sd 1 that contains both θ and y.

A single iteration of this variant imitates the X-update of PSS as follows: First, a reference point y is drawn uniformly from the great subsphere w.r.t. the direction θn 1 of the previous sample. The new direction θn is then sampled uniformly from the great circle of Sd 1 running through both θn 1 and y, intersected with

{θ Sd 1 | ϱ(1) ν (rn 1θ) > tn}.

Then, a new radius is chosen by sampling (4).

Variant 3: The Concrete Algorithm

In practice, the univariate direction and radius updates of the 2nd variant of GPSS still need to be implemented as acceptance/rejection schemes that might exhibit low acceptance rates. Therefore, the final variant of GPSS replaces both updates with adaptive procedures that are essentially guaranteed to be fast and result in an algorithm that empirically converges against the correct target distribution.

In the radius update, we use the stepping-out and shrinkage procedure as proposed for uniform slice sampling by Neal (2003). In the direction update, we incorporate a shrinkage procedure. Actually, our direction update can be interpreted as running the shrinkage-based spherical slice sam-

Algorithm 1 Gibbsian Polar Slice Sampling

1: Input: target density ϱν, initial value x0 Rd with x0 = 0 and ϱν(x0) > 0, initial interval length w > 0

2: Define ϱ(1) ν : x 7 x d 1ϱν(x) 3: Set r0 := x0 and θ0 := x0/r0 4: for n = 1, 2, . . . do 5: Draw Tn U((0, ϱ(1) ν (xn 1))), call result tn 6: θn := Geodesic Shrinkage(ϱ(1) ν , tn, rn 1, θn 1)

7: rn := Radius Shrinkage(ϱ(1) ν , tn, rn 1, θn, w) 8: xn := rnθn 9: end for 10: return (xn)n 0

Algorithm 2 Geodesic Shrinkage

1: Input: transform ϱ(1) ν of target density, current threshold tn, current radius rn 1, current direction θn 1 2: Draw Y U(Sd 1 θn 1), call the result y 3: Draw ωmax U([0, 2π]), set ωmin := ωmax 2π 4: repeat 5: Draw Ω U([ωmin, ωmax]), call result ω 6: Set θn := θn 1 cos ω + y sin ω 7: if ω < 0 then ωmin := ω else ωmax := ω 8: until ϱ(1) ν (rn 1θn) > tn 9: return θn

pler (Habeck et al., 2023). This ultimate variant is readily implemented by Algorithms 1, 2 and 3.

Briefly a single iteration works as follows: An element y of the great subsphere is determined as in variant 2 and then the new direction θn is sampled via a shrinkage procedure on the great circle of Sd 1 running through both θn 1 and y. Finally, a new radius is chosen via a stepping-out and shrinkage procedure on the ray emanating from the origin in direction θn, where shrinkage can be performed around the old radius rn 1 because it satisfies the target condition by construction.

3. Validation Theoretical Support

We provide some basic theoretical underpinning of the proposed methods with a focus on the 2nd variant of GPSS. The reasons for this are threefold: First, in principle this variant can also be implemented by using univariate acceptance/rejection schemes2. Second, we want to avoid any deep discussion about the stepping-out procedure that is involved in Algorithm 3. Third, since Algorithms 2 and 3 both

2For the direction update this is immediately possible, since [0, 2π] is a superset of the corresponding acceptance region. For the radius update this is more complicated and may require some structural knowledge about ϱν.

Gibbsian Polar Slice Sampling

Algorithm 3 Radius Shrinkage

1: Input: transform ϱ(1) ν of target density, current threshold tn, current radius rn 1, current direction θn, initial interval length w > 0 2: Sample U U([0, 1]), call the result u 3: rmin := max(rn 1 u w, 0) 4: rmax := rn 1 + (1 u) w

5: while rmin > 0 and ϱ(1) ν (rminθn) > tn do 6: rmin := max(rmin w, 0) 7: end while 8: while ϱ(1) ν (rmaxθn) > tn do 9: rmax := rmax + w 10: end while 11: Sample Rn U([rmin, rmax]), call the result rn 12: while ϱ(1) ν (rnθn) tn do 13: if rn < rn 1 then rmin := rn else rmax := rn 14: Sample Rn U([rmin, rmax]), call the result rn 15: end while 16: return rn

contain a shrinkage procedure, no explicit representation of the corresponding transition kernels is available to our knowledge. The proofs of the following statements can be found in Appendix B. Theorem 3.1. The transition kernels corresponding to the 1st and 2nd variant of GPSS admit ν as invariant distribution.

To verify that ν is an invariant distribution, we argue that the direction and radius update are both individually reversible w.r.t. ν. This could also serve as a strategy to prove the invariance of the 3rd variant of GPSS. For the direction update (Algorithm 2), this can be done by virtue of results of (Habeck et al., 2023; Hasenpflug et al., 2023). For the radius update, the interplay of stepping-out and shrinkage makes it difficult to prove reversibility rigorously. However, intuitively the arguments of Neal (2003) apply.

Under weak regularity assumptions on ϱν we also get a convergence statement. Theorem 3.2. Assume that the target density ϱν is strictly positive, i.e. ϱν : Rd (0, ). Then, for ν-almost every initial value x0, the distribution of an iterate Xn of a Markov chain (Xn)n N0 with X0 = x0 and transition kernel corresponding to the 2nd variant of GPSS converges to ν in total variation as n .

Under an additional restrictive structural requirement the convergence result also holds for the 3rd variant of GPSS. Namely, if we assume that the radius shrinkage of Algorithm 3 with threshold tn, current radius rn 1 and direction θn realizes sampling w.r.t.

U({r R+ | ϱ(1) ν (rθn) > tn}).

For example, this is true if ϱ(1) ν is unimodal along rays. This actually is a scenario where the direction update relies on the shrinkage-based Algorithm 2, but the radius update is exact as in the 2nd variant of GPSS.

4. Alternative Transition Mechanisms

We emphasize that our 3rd variant of GPSS is just one of many possible ways to create a valid and efficiently implementable method that is based on the 2nd variant. For example, many of the ideas in Section 2.4 of (Murray et al., 2010) for modifying the shrinkage procedure used by elliptical slice sampling easily transfer to the shrinkage procedure used in the direction update of GPSS. For the radius update, one could drop stepping-out, i.e. just place an interval of size w > 0 randomly around the current rn and then use the shrinkage procedure right away. With the same choice of the hyperparameter w this reduces the number of target density evaluations per iteration at the cost of also reducing the speed at which the sampler explores the distribution of interest. One could also replace our radius update by the update mechanism of latent slice sampling (Li & Walker, 2023). In principle, one could even use Metropolis-Hastings transitions for either one of the updates. As long as each transition mechanism leaves the corresponding distribution in either (3) or (4) invariant, the resulting sampler should be well-behaved. In this sense, GPSS provides a flexible framework that leads to a variety of different samplers.

5. Related Work

The radius update we use for GPSS in the 3rd variant has previously been considered in Section 4.2.2 of (Thompson, 2011), where it was suggested to alternate it with a standard update on Rd, i.e. one that does not keep the radius fixed. Intuitively, our GPSS improves upon this approach by introducing a dedicated direction update, which, by not wasting effort on exploring the entire sample space, can more efficiently determine a (hopefully) good new direction.

Furthermore, although GPSS was developed as imitating classical polar slice sampling, it also bears some resemblance to two other slice sampling methods, namely elliptical slice sampling (ESS) (Murray et al., 2010) and hitand-run uniform slice sampling3 (HRUSS). Both methods have been analyzed theoretically to some extent, ESS in (Natarovskii et al., 2021; Hasenpflug et al., 2023), HRUSS in (Latuszy nski & Rudolf, 2014) and an idealized version of HRUSS in (Rudolf & Ullrich, 2018). Moreover, a sophisticated sampling algorithm that employs a large number of ESS-based Markov chains running in parallel has been

3The idea of HRUSS is already formulated in the paragraph How slice sampling is used in real problems in Section 29.7 in (Mac Kay, 2003)

Gibbsian Polar Slice Sampling

suggested in (Nishihara et al., 2014).

Both ESS and HRUSS follow the same basic principle as all other slice sampling approaches: In iteration n, they draw a threshold tn w.r.t. the value of some fixed function ϱ(1) ν at the latest sample xn 1. They then determine the next sample xn by approximately sampling from some distribution restricted to the slice (or level set) of ϱ(1) ν at threshold tn.

For a given threshold, ESS draws an auxiliary variable from a mean zero multivariate Gaussian and performs shrinkage on the zero-centered ellipse running through both the auxiliary point and the latest sample, using the latter as the reference point for shrinkage. This is very similar to the direction update in the 3rd variant of GPSS4, where we draw the auxiliary variable y from the uniform distribution on the great subsphere w.r.t. the direction θn 1 of the latest sample xn 1 and then perform shrinkage on the unique great circle of Sd 1 that contains both y and θn 1.

HRUSS on the other hand uses neither ellipses nor circles. For a given threshold, it proceeds by choosing a random direction (uniformly from Sd 1) and determining the next sample xn by performing stepping-out and shrinkage procedures on the line through the latest sample xn 1 in this direction. This is obviously very similar to the radius update used in the 3rd GPSS variant. The two major differences are that the latter does not use an auxiliary variable to determine the direction along which it performs the update, and that HRUSS performs an update on an entire line, whereas GPSS only considers a ray (by requiring radii to be non-negative).

Based on these comparisons, we may expect an iteration of our 3rd GPSS variant to be roughly as costly as one of ESS and one of HRUSS combined. Various experiments suggest this to be a good rule of thumb, though GPSS actually tends to be faster than ESS if the latter is not well-tuned (we explain what this means in Section 6). Although HRUSS is clearly the fastest among the three, it usually takes large amounts of very small steps, so that it tends to lag behind the other two samplers when considering metrics like effective sample size per time unit.

6. Experiments

We illustrate the strengths of our 3rd variant of GPSS by a series of numerical experiments in which we compare its performance with those of ESS and HRUSS5. Source code allowing the reproduction (in nature) of our experimental

4In fact, both approaches can even be implemented to use the same random variables, d samples from N(0, 1), for determining the one-dimensional object to perform shrinkage on. 5All of our experiments were conducted on a workstation equipped with an AMD Ryzen 5 PRO 4650G CPU.

results is provided in a github repository6. Some remarks on the pitfalls in implementing our method are provided in Appendix C, additional sampling statistics in Appendix D, an additional experiment on Bayesian logistic regression in Appendix E, and further illustrative plots in Appendix F.

Before we discuss the individual experiments, some explanation of our general approach to using and comparing these methods is in order. The positive hyperparameter w in HRUSS determines the initial size of the search interval in the stepping-out procedure, playing the same role as the eponymous parameter of GPSS in Algorithm 3. Therefore, when we compare the two samplers in some fixed setting, we always use the same value of w for both of them. We note, however, that neither parameter has much of an influence on the sampler s performance as long as they are not chosen orders of magnitude too small. Accordingly, we did not carefully tune them in any of the experiments.

As ESS is technically only intended for posterior densities with mean zero Gaussian priors, whenever we are in a different setting, we artificially introduce a mean zero Gaussian prior and use the target divided by the prior as likelihood. As we will see in the following, the performance of the method can be quite sensitive to the choice of the covariance matrix for this artificial prior, so we sometimes consider both a naive (or untuned ) and a sophisticated (or tuned ) choice in our comparisons.

To make the sophisticated choices, we typically rely on the fact that the radii of samples from Nd(0, Id) scatter roughly around

d, but also on an understanding of the desired range of radii gained either through theoretical analysis of the target density or by examining samples generated by GPSS. In one of our experiments, where the variables are highly correlated in the sense that their joint density our target density is very far from rotational invariance, we even use the empirical covariance of samples generated by GPSS as the covariance matrix for ESS.

6.1. Multivariate Standard Cauchy Distribution

First we considered the multivariate generalization of the pathologically heavy-tailed standard Cauchy distribution, i.e. the distribution ν with ϱν(x) = (1+ x 2) (d+1)/2. We chose the sample space dimension to be d = 100, initialized all samplers with x0 := (1, . . . , 1)T and ran each of them for N = 106 iterations. Since ν is rotationally invariant and the covariance of naive ESS was already on a reasonable scale, we refrained from using a tuned version of ESS here.

As the canonical test statistics mean and covariance are undefined for ν, we instead measured the sample quality by letting the samplers estimate a probability w.r.t. ν. For

6https://github.com/microscopic-imageanalysis/Gibbsian Polar Slice Sampling

Gibbsian Polar Slice Sampling

Figure 1. Progression over iterations of the estimates of p(b) from (5) in the multivariate standard Cauchy experiment. Note the logarithmic scale. The dashed black line denotes the ground truth, which, using our knowledge of the target s rotational invariance, we could compute by numerically solving a one-dimensional integral.

this we chose the probability of the event that Z > b and simultaneously Z1 > 0, where Z = (Z1, . . . , Zd)T ν and b > 0. Described in formulas, the samplers estimated

p(b) := ν({z Rd | z > b, z1 > 0}), (5)

where z1 is the first entry of z = (z1, . . . , zd)T Rd. Note that the condition Z > b measures how well the sample radii reflect those that would occur in exact sampling from ν, whereas the condition Z1 > 0 detects if the sample directions (i.e. the samples divided by their radii), are biased towards either one of the two half-spaces separated by the hyperplane through all but the first coordinate axes. Hence both radii and directions need to be sampled well in order for the estimate of p(b) to quickly approach the true value.

The progressions of the samplers estimates of p(b) are shown in Figure 1. It can be seen that those produced by GPSS converge to the true value orders of magnitude faster than those by both HRUSS and ESS. In Appendix F, Figure 11, we provide a peek into the sampling behind these results by displaying the progression of radii and log radii over Nwindow = 5 104 iterations. As exact sampling from ν is tractable (using samples from the multivariate normal and the χ2-distribution), we also display traces of exact i.i.d. samples for comparison.

Figure 11 suggests that the samples produced by GPSS are comparable in quality to i.i.d. ones, and shows that those by HRUSS and ESS are certainly not. Upon closer examination, the reason for this becomes evident: Although all samplers take trips to the distribution s tails, those taken by HRUSS and ESS last several orders of magnitude longer than those of GPSS into the same distance. Consequently, HRUSS

and ESS make their excursions to the extremely far-off parts of the tails (say, the points of radii in the thousands or more) with vanishingly small frequency. Thus GPSS needs a much shorter chain to properly reflect the tails of the target distribution. We acknowledge, however, that the advantage GPSS has over HRUSS in this setting would be considerably smaller if the target density was not centered around the origin.

6.2. Hyperplane Disk

Next we considered the target density

ϱν(x) = exp (Pd i=1 xi)2 x 2 (6)

for x = (x1, . . . , xd)T Rd. Intuitively, the sum term within the density leads to a concentration of its probability mass around a hyperplane, i.e. a (d 1)-dimensional subspace, given by the set of points x Rd for which the sum term vanishes. The norm term ensures that the function is integrable with Gaussian tails in all directions, which intuitively further concentrates the distribution around a circular disk within the hyperplane. We set d = 200 and initialized all chains in an area of high probability mass. We then ran each sampler for N = 104 iterations. To tune ESS, we took all N samples generated by GPSS, computed their empirical covariance matrix and used it as the covariance parameter for the artificial Gaussian prior of ESS.

The results are shown in Appendix F, Figure 12. The progression of the sample radii suggests that GPSS has a considerable advantage over all other approaches. However, impressions based on sample radii should be taken with some caution, since GPSS is the only method that specifically updates the radii during sampling. Accordingly, when considering empirical step sizes, the advantage of GPSS is less pronounced, but still present. For example, the mean step sizes are 5.0 for GPSS, 3.5 for tuned ESS, 2.4 for untuned ESS and 0.6 for HRUSS.

There are two drawbacks regarding the performance of tuned ESS, which by the aforementioned aspects is the closest competitor to our method in this setting. On the one hand, as noted before, we tuned ESS using the samples generated by GPSS, thus relying on the robust performance of GPSS to compute a good estimate of the target distribution s true covariance matrix. On the other hand, the computational overhead of sampling from a multivariate Gaussian with non-diagonal covariance matrix slows tuned ESS down significantly. As a result, tuned ESS consistently ran slower than all the other samplers in this experiment. The advantage of GPSS over tuned ESS is remarkable, not just because only the latter uses a proposal distribution adjusted to the shape of the target distribution, but also because the tails of the target are Gaussian, which in principle should benefit ESS (by virtue of its proposal distribution being Gaussian

Gibbsian Polar Slice Sampling

We attribute the relatively poor performance of HRUSS to the curse of dimensionality: As a result of the target density being narrowly concentrated around a hyperplane of a very high-dimensional space, from any given point in the target s high probability region there are very few directions in which one can take large steps without leaving the high-probability region. Nevertheless, HRUSS isotropically samples the direction along which it will move, such that most of the time only small steps along the sampled direction are allowed.

6.3. Axial Modes

Our third experiment was concerned with the target density

ϱν(x) = x 4 exp( x 1). (7)

Due to the counteracting forces of -norm and 1-norm, ϱν possesses 2d fairly isolated modes, 2 along each coordinate axis, see Figure 8 (appendix) for an illustration.

This particular structure enables an interesting quantitative diagnostic for the performance of MCMC methods targeting this distribution: For any given x = (x1, . . . , xd)T Rd, one can use the absolute values of its components to assign it to a pair of modes (that lie on the same coordinate axis) via axis(x) := argmax 1 i d |xi|.

For a finite chain of samples (x(n))n=1,...,N Rd that was generated as the output of some MCMC method, numerous quantitative diagnostics can then be applied to the values (axis(x(n)))n=1,...,N. For example, one can say that the chain jumped between modes in step i if and only if

axis(x(i)) = axis(x(i 1)).

One can then count the total number of jumps within the N iterations (which provides information about how quickly the chain moved back and forth between the mode pairs) and compare these values between different chains. Alternatively, one could compute the mean dwelling time, i.e. the average number of iterations the chain spent at a mode pair until jumping to the next. This may be a more helpful quantity than the total number of jumps, because it is essentially independent of the number N of iterations. It may also be worthwhile to consider the maximum dwelling time, i.e. the largest number of iterations the chain spent at a mode pair without leaving, as this is more suitable than mean dwelling time and total number of jumps for detecting whether a chain occasionally gets stuck at a mode pair for excessively many iterations.

We ran each sampler for N = 105 iterations in each of the dimensions d = 10, 20, . . . , 100. As initialization we

Figure 2. Progression over dimensions d of the mean dwelling time, determined based on N = 105 iterations, in the axial modes experiment, as described in Section 6.3.

Figure 3. Progression over dimensions d of the maximum dwelling times within N = 105 iterations in the axial modes experiment, as described in Section 6.3.

used x0 := (5, 1, . . . , 1)T Rd. For the ESS covariance we considered both the usual naive choice Σ = Id and the carefully hand-tuned choice Σ = (5 + d/10)2/d Id.

In Figures 2 and 3 we display for each sampler the progression over the dimensions d of mean and maximum dwelling time. In terms of mean dwelling time, GPSS has a clear advantage over both untuned ESS and HRUSS, only the carefully tuned ESS is competitive with it. In terms of maximum dwelling time, GPSS is slightly ahead of all other samplers, even tuned ESS. In Appendix F, Figure 13 we provide a peek into the sampling behind these results by displaying the progression of currently visited mode pair in the last Nwindow = 2 104 iterations of the samplers runs for the highest dimension d = 100.

Gibbsian Polar Slice Sampling

6.4. Neal s Funnel

In our fourth experiment, we considered an arguably even more challenging target density that was originally proposed by Neal (2003) and is commonly termed Neal s funnel. It is given by

ϱν(x) = N(x1; 0, 9)

i=2 N(xi; 0, exp(x1)), (8)

where we write x = (x1, . . . , xd)T Rd and denote by N(z; µ, σ2) the density of N(µ, σ2) evaluated at z. As the name suggests, the density is shaped like a funnel, having both a very narrow and a very wide region of high probability mass, which smoothly transition into one another. Besides being a challenging target, the funnel is of particular interest to us because its marginal distribution in the first coordinate is simply N(0, 9) and a sampler needs to explore both the narrow and the wide part of the funnel equally well in order to properly approximate this marginal distribution via the marginals of its samples (i.e. the set containing each of its samples truncated after the first component).

We ran the slice samplers for the funnel in dimension d = 10 (as suggested by Neal) and initialized all of them with x0 := (2, 0, . . . , 0)T Rd. For ESS we used the relatively well-tuned covariance Σ = diag(9, 70, . . . , 70). For this experiment we also extend our comparison to a sampler that is only related to GPSS by the fact that it is also an MCMC method. Namely, we compare GPSS with the No-U-Turn Sampler (NUTS) (Hoffman & Gelman, 2014), which is widely regarded to be the current state-of-the-art in MCMC sampling7. We used the implementation of NUTS provided by the probabilistic programming Python library Py MC (Salvatier et al., 2016).

In this experiment we study the convergence to several target quantities. To enable a fair comparison of the samplers, we took differences in the speed of the iterations into account. This was achieved by allocating a fixed time budget of 1 minute to each sampler and letting it run for sufficiently many iterations to deplete this time budget, while tracking how much time had elapsed after each completed iteration8. We assessed the convergence ten times per second, resulting in 600 measurement times in total. Finally, we determined for each measurement time t and each sampler s the exact number of iterations nt,s N it had completed up to that time (using the aforementioned logs) and estimated the target quantities from only the nt,s samples produced in these iterations. As target quantities we considered mean, standard deviation, 0.001-quantile and 0.999-quantile. To

7Note, however, that NUTS needs the target density to be differentiable and requires oracle access to its gradient, neither of which is true for the slice sampling methods we consider. 8Except for NUTS, where we only approximated this by assuming the runtime per iteration to be constant.

Figure 4. Progression over runtime of the empirical mean of the marginal samples in the experiment on Neal s funnel (8). The dashed black line denotes the ground truth.

Figure 5. Progression over runtime of the empirical standard deviation of the marginal samples in the experiment on Neal s funnel (8). The dashed black line denotes the ground truth.

generate estimates of these quantities from marginal samples, we simply used their empirical versions.

The progression over time of each sampler s approximations to these target quantities are shown in Figures 4, 5, 6 and 7. Additionally, we provide the marginal histograms of all samples generated within the time budget as well as a peek into the progression of the marginal samples and sample radii in Appendix F, Figure 14.

It can be clearly seen from the first four figures that GPSS performs better overall than any of the other methods. We note that HRUSS is more competitive with it than ESS and that GPSS converges well despite completing only 3.39 105

iterations in the allotted time, which is less than half the 7.95 105 iterations completed by HRUSS, and about the same as the 2.94 105 completed by ESS. In other words, had

Gibbsian Polar Slice Sampling

Figure 6. Progression over runtime of the empirical 0.001-quantile of the marginal samples in the experiment on Neal s funnel (8). The dashed black line denotes the ground truth.

Figure 7. Progression over runtime of the empirical 0.999-quantile of the marginal samples in the experiment on Neal s funnel (8). The dashed black line denotes the ground truth.

we used the same number of iterations for all slice samplers like in other experiments the convergence results would attribute GPSS an even larger advantage. Regarding the performance of NUTS, we observe that it is very successful at retrieving the 0.999-quantile, but, due to its refusal to enter the narrow part of the funnel, it performs worst among the four methods not just for the 0.001-quantile, but also for mean and standard deviation.

7. Discussion

We introduced a Gibbsian polar slice sampling (GPSS) framework as a general Markov chain approach for approximate sampling from distributions on Rd given by an unnormalized Lebesgue density. The efficiently implementable version that we propose is essentially tuning-free. It has only

a single hyperparameter with little impact on the method s performance, as long as it is not chosen orders of magnitude too small. GPSS can quickly produce samples of high quality, and numerical experiments indicate advantages compared to related approaches in a variety of settings.

Although GPSS is generally quite robust, its performance slowly deteriorates when the distance between the target distribution s center of mass and the coordinate origin is increased. This could potentially be avoided by automatically centering the target on the origin. However, such a modification would likely follow an adaptive MCMC approach and therefore result in a method that no longer fits the strict MCMC framework.

Particularly good use cases for GPSS appear to be heavytailed target distributions. For example, one could apply GPSS to intractable posterior distributions resulting from Cauchy priors (perhaps on just some of the variables) and likelihoods that do not change the nature of the tails. As illustrated in Section 6.1, GPSS can have enormous advantages over related methods when heavy tails are involved. Another type of target for which GPSS could be of practical use are distributions with strong funneling, which naturally occur in Bayesian hierarchical models (cf. Neal (2003)). By nature, these targets call for methods with variable step sizes, because they contain both very narrow regions, which necessitate small step sizes, and very wide regions, in which much larger step sizes are advantageous to speed up the exploration of the sample space. Due to their use of variable step sizes, slice sampling methods might be considered a natural choice and, as demonstrated in Section 6.4, GPSS appears to perform better than related slice samplers for such targets.

We envision that many more applications for GPSS will be found. Moreover, we think it may be possible to derive qualitative or even quantitative geometric convergence guarantees for GPSS. Aside from giving helpful insight into how well GPSS retains the desirable theoretical properties of PSS, this would further justify using GPSS in real-world applications, where the sample quality is often hard to validate. Finally, we emphasize that the 2nd variant of GPSS, the intermediate step between idealized framework and efficiently implementable method, can also be used as the foundation for a variety of other hybrid samplers (cf. Section 4).

Acknowledgements

We thank the anonymous referees for their suggestions. PS and MH gratefully acknowledge funding by the Carl Zeiss Foundation within the program CZS Stiftungsprofessuren and the project Interactive Inference . We are grateful for the support of the DFG within project 432680300 SFB 1456 subprojects A05 and B02.

Gibbsian Polar Slice Sampling

Besag, J. and Green, P. J. Spatial statistics and Bayesian computation. Journal of the Royal Statistical Society Series B, 55(1):25 37, 1993.

Blackard, J. A. Comparison of neural networks and discriminant analysis in predicting forest cover types. Ph D thesis, Colorado State University, 1998.

Blackard, J. A., Dean, D. J., and Anderson, C. W. Covertype data set. URL https://archive.ics.uci.edu/ ml/datasets/Covertype.

Douc, R., Moulines, E., Priouret, P., and Soulier, P. Markov Chains. Springer Series in Operations Research and Financial Engineering. Springer, Cham, 2018.

Habeck, M., Hasenpflug, M., Kodgirwar, S., and Rudolf, D. Geodesic slice sampling on the sphere. ar Xiv preprint ar Xiv:2301.08056, 2023.

Hasenpflug, M., Natarovskii, S., and Rudolf, D. Reversibility of elliptical slice sampling revisited. ar Xiv preprint ar Xiv:2301.02426, 2023.

Hoffman, M. D. and Gelman, A. The No-U-Turn sampler: adaptively setting path lengths in Hamiltonian Monte Carlo. Journal of Machine Learning Research, 15:1593 1623, 2014.

Latuszy nski, K. and Rudolf, D. Convergence of hybrid slice sampling via spectral gap. ar Xiv preprint ar Xiv:1409.2709, 2014.

Li, Y. and Walker, S. G. A latent slice sampling algorithm. Computational Statistics and Data Analysis, 179, 2023.

Mac Kay, D. Information Theory, Inference and Learning Algorithms. Cambridge University Press, 2003.

Murray, I., Adams, R., and Mac Kay, D. Elliptical slice sampling. Journal of Machine Learning Research, 9: 541 548, 2010.

Natarovskii, V., Rudolf, D., and Sprungk, B. Geometric convergence of elliptical slice sampling. Proceedings of the 38th International Conference on Machine Learning, 139:7969 7978, 2021.

Neal, R. M. Slice sampling. The Annals of Statistics, 31(3): 705 767, 2003.

Nishihara, R., Murray, I., and Adams, R. Parallel MCMC with generalized elliptical slice sampling. Journal of Machine Learning Research, 15, 2014.

Roberts, G. O. and Rosenthal, J. S. The polar slice sampler. Stochastic Models, 18(2):257 280, 2002.

Figure 8. Illustration of the axial modes target density (7) in dimension d = 2.

Rudolf, D. and Sch ar, P. Dimension-independent spectral gap of polar slice sampling. ar Xiv preprint ar Xiv:2305.03685, 2023.

Rudolf, D. and Ullrich, M. Comparison of hit-and-run, slice sampler and random walk metropolis. Journal of Applied Probability, 55, 2018.

Salvatier, J., Wiecki, T. V., and Fonnesbeck, C. Probabilistic programming in python using Py MC3. Peer J Computer Science, 2(e55), 2016.

Schilling, R. L. Measures, Integrals and Martingales. Cambridge University Press, 2005.

Thompson, M. B. Slice Sampling with Multivariate Steps. Ph D thesis, University of Toronto, 2011.

Tierney, L. Markov chains for exploring posterior distributions. The Annals of Statistics, 22(4):1701 1728, 1994.

A. Formal Derivation

In order to explain how PSS as described in Section 1 can be derived from the usual slice sampling framework laid out in (Besag & Green, 1993), we provide two technical tools. The first one is a well-known identity from measure theory.

Proposition A.1. Any integrable real-valued function g on (Rd, B(Rd)) satisfies Z

Rd g(x)dx = Z

Sd 1 g(rθ)rd 1σd(dθ)dr,

where σd is the surface measure on (Sd 1, B(Sd 1)).

Proof. See for example Theorem 15.13 in (Schilling, 2005).

Gibbsian Polar Slice Sampling

Though this identity does not by itself involve any probabilistic quantities, we apply it in a stochastic setting to obtain our second tool, which we term sampling in polar coordinates.

Corollary A.2. Let ξ be a distribution on (Rd, B(Rd)) with probability density ϱξ. Then by Proposition A.1 we get for any A B(Rd) that

Rd 1A(x)ϱξ(x)dx

0 1A(rθ)ϱξ(rθ)rd 1drσd(dθ).

Consequently a random variable X ξ can be sampled in polar coordinates as X := R Θ by sampling (R, Θ) from the joint distribution with probability density

(r, θ) 7 ϱξ(rθ)rd 11R+(r)

w.r.t. λ1 σd, where λ1 denotes the Lebesgue measure on (R, B(R)).

We can now apply the second tool to PSS. In the framework of Besag & Green (1993), the X-update of slice sampling for a target density factorized as in (1) is to be performed by sampling Xn from the distribution with unnormalized density

x 7 ϱ(0) ν (x)1(tn, )(ϱ(1) ν (x)) = x 1 d1(tn, )(ϱ(1) ν (x)).

By Corollary A.2, this can equivalently be done by sampling Xn in polar coordinates as Xn := RnΘn, where (Rn, Θn) is drawn from the joint distribution with unnormalized density

(r, θ) 7 r1 d1(tn, )(ϱ(1) ν (rθ)) rd 11R+(r)

= 1(tn, )(ϱ(1) ν (rθ))1R+(r)

w.r.t. λ1 σd. As this distribution is simply (2), PSS as described in Section 1 is equivalent to the slice sampler resulting from factorization (1) in the general slice sampling framework of Besag & Green (1993).

We assume some familiarity with transition kernels and Markov chains throughout this section. For details we refer to the introductory sections of (Douc et al., 2018).

We introduce some notation and provide a few observations. Let Cν := R

Rd ϱν(x)dx, so that

Cν ν(dx) = ϱ(0) ν (x)ϱ(1) ν (x)dx.

For t > 0 define the level set

L(t) := {x Rd | ϱ(1) ν (x) > t}.

Note that, by

1 (0,ϱ(1) ν (x))(t) = 1L(t)(x),

the transition kernel corresponding to PSS for ν can be expressed as

P(x, A) := 1

0 µt(A)1L(t)(x)dt

for x Rd, A B(Rd), where we set

A ϱ(0) ν (x)1L(t)(x)dx R

Rd ϱ(0) ν (x)1L(t)(x)dx

for t > 0. It is known that νP = ν (Roberts & Rosenthal, 2002). We formulate a criterion for invariance w.r.t. imitations of polar slice sampling.

Lemma B.1. Suppose that for each t > 0 we have a transition kernel U (t) X on Rd B(Rd) with µt U (t) X = µt. Then the transition kernel Q on Rd B(Rd) given by

Q(x, A) := 1

0 U (t) X (x, A)1L(t)(x)dt

satisfies νQ = ν.

Proof. Let A B(Rd) arbitrary. Then

Cν νQ(A) = Cν

Rd Q(x, A)ν(dx)

0 U (t) X (x, A)1L(t)(x)dtϱ(0) ν (x)dx

Rd U (t) X (x, A)ϱ(0) ν (x)1L(t)(x)dxdt

Rd U (t) X (x, A)µt(dx)

Rd ϱ(0) ν (x)1L(t)(x)dx dt

Rd ϱ(0) ν (x)1L(t)(x)dxdt

0 µt(A)1L(t)(x)dtϱ(0) ν (x)dx

Rd P(x, A)ν(dx)

= Cν νP(A) = Cν ν(A),

which shows νQ = ν.

Note that the framework we rely on was previously used in Lemma 1 in (Latuszy nski & Rudolf, 2014) to analyze the reversibility of hybrid uniform slice sampling. Further

Gibbsian Polar Slice Sampling

note that in proving Lemma B.1 we never use the precise factorization of ϱν into ϱ(0) ν and ϱ(1) ν that is dictated by PSS. Hence the result also holds true for any other slice sampling scheme that adheres to this format. Moreover, in the previous lemma it is assumed that U (t) X is a transition kernel on Rd B(Rd). For given t > 0 it is sometimes more convenient to define a corresponding transition kernel U (t) X on L(t) B(L(t)). In such cases we consider U (t) X as extension of U (t) X to Rd B(Rd) given as

U (t) X (x, A) =

( U (t) X (x, A L(t)) x L(t) 1A(x) x L(t).

In the following we write U (t) X for U (t) X and consider U (t) X as extension if necessary. We turn our attention to the 1st variant of GPSS.

Lemma B.2. For Tn = t the transition kernel U (t) X on (L(t), B(L(t))) of the X-update of the 1st variant of GPSS takes the form U (t) X = U (t) D1U (t) R with kernels

U (t) D1(rθ, A) =

Sd 1 1A(r θ)σd(d θ) R

Sd 1 1L(t)(r θ)σd(d θ) ,

U (t) R (rθ, A) =

R 0 1A( rθ)d r R 0 1L(t)( rθ)d r, (9)

where r R+, θ Sd 1, x = rθ L(t) and A B(L(t)). Moreover, U (t) D1 and U (t) R are reversible w.r.t. µt.

Proof. The overall X-update consists of consecutively realizing (3) and (4). Obviously, the direction update rn 1θn 1 7 rn 1θn (according to (3)) corresponds to U (t) D1 and the radius update rn 1θn 7 rnθn (according to

(4)) corresponds to U (t) R . Consequently, the transition kernel of the X-update is U (t) X = U (t) D1U (t) R , that is, U (t) X is the

product of the kernels U (t) D1 and U (t) R .

Now we prove the claimed invariance property. We use the fact that ϱ(0) ν (x) = x 1 d. To simplify the notation, set

Rd x 1 d1L(t)(x)dx 1 , (10)

so that the restriction of µt to L(t) can be written as

µt(dx) = ct x 1 ddx.

Observe that Proposition A.1 yields for all A, B B(L(t)) Z

A U (t) D1(x, B)µt(dx)

A U (t) D1(x, B) x 1 ddx

Sd 1 1A(rθ)U (t) D1(rθ, B)σd(dθ)dr

Sd 1 1A(rθ)σd(dθ) R

Sd 1 1B(r θ)σd(d θ) R

Sd 1 1L(t)(r θ)σd(d θ) dr.

Similarly, again by Proposition A.1, we obtain Z

A U (t) R (x, B)µt(dx)

A U (t) R (x, B) x 1 ddx

0 1A(rθ)U (t) R (rθ, B)drσd(dθ)

R 0 1A(rθ)dr R 0 1B( rθ)d r R 0 1L(t)( rθ)d r σd(dθ).

As the last expression in each of these two computations is symmetric in A and B, they show both U (t) D1 and U (t) R to be reversible w.r.t. µt.

We also provide a suitable representation for the 2nd variant of GPSS.

Lemma B.3. For Tn = t the transition kernel U (t) X on (L(t), B(L(t))) of the X-update of the 2nd variant of GPSS takes the form U (t) X = U (t) D2U (t) R with U (t) R being specified

as in (9) and a suitable9, w.r.t. µt reversible, kernel U (t) D2.

Proof. We require some further notation: For r R+ and θ, ϑ Sd 1 let g(θ,ϑ) : [0, 2π] Sd 1 be given by g(θ,ϑ)(α) = θ cos(α) + ϑ sin(α) and define

L(t, r) := {θ Sd 1 | rθ L(t)}

L(t, r, θ, ϑ) := {α [0, 2π] | g(θ,ϑ)(α) L(t, r)}.

Define the transition kernel S(t,r) on L(t, r) B(Sd 1) as

S(r,t)(θ, C) := Z

1C(g(θ,ϑ)(α)) dα

λ1(L(t, r, θ, ϑ)) ξθ(dϑ),

where θ L(t, r) and C B(Sd 1), with ξθ being the uniform distribution on Sd 2 θ . The transition kernel S(r,t)

coincides with the transition kernel of the ideal geodesic slice sampler on the sphere that is reversible w.r.t. the uniform distribution on L(t, r), see Lemma 14 in (Habeck et al.,

9We provide an explicit expression of U (t) D2 in the proof of the statement.

Gibbsian Polar Slice Sampling

2023). Denote the uniform distribution on L(t, r) as γ(r,t), i.e.

γ(r,t)(dθ) = 1L(t,r)(θ)σd(dθ)

σd(L(t, r)) .

Now observe that the transition kernel of the direction update U (t) D2 specified in the 2nd variant of GPSS for x = rθ L(t) with r R+ and θ Sd 1 is given by

U (t) D2(rθ, A) = S(r,t)(θ, A(r)), (11)

where for A B(Rd) we denote

A(r) := {ϑ Sd 1 | rϑ A}.

Note that the radius update of the 2nd variant of GPSS coincides with the one of the 1st variant, i.e. U (t) R is given by (9). Hence the total X-update takes the form U (t) X = U (t) D2U (t) R .

It remains to prove the reversibility of U (t) D2 w.r.t. µt: For A, B B(L(t)) we have with ct as in (10) that Z

A U (t) D2(x, B)µt(dx)

A U (t) D2(x, B) x 1 ddx

Sd 1 1A(rθ)U (t) D2(rθ, B)σd(dθ)dr

A(r) S(r,t)(θ, B(r))γ(r,t)(dθ) σd(L(t, r)) dr,

which is, by the reversibility of S(r,t) w.r.t. γ(r,t), symmetric in A and B. Consequently, U (t) D2 is reversible w.r.t. µt and the claimed statement is proven.

Theorem 3.1 is now an easy consequence.

Proof of Theorem 3.1. By Lemmas B.2 and B.3, given Tn = t, the transition kernel of the X-update of the 1st and 2nd variant of GPSS is U (t) X := U (t) Di U (t) R for i = 1, 2 respectively. By the reversibility of the individual kernels w.r.t. µt, see Lemma B.2 and Lemma B.3, we have

µt U (t) X = µt U (t) Di U (t) R = µt U (t) R = µt,

proving that U (t) X leaves µt for i = 1, 2 invariant. By Lemma B.1, this yields that the variants of GPSS leave the target distribution ν invariant.

We add another auxiliary result.

Lemma B.4. For (t, θ) (0, ) Sd 1 let

L(t, θ) := {r R+ | rθ L(t)}.

Then λ1(L(t, y/ y )) < holds for λ1 λd-almost all (t, y) R+ Rd.

Proof. By Proposition A.1 follows Z

Rd ϱν(x)dx = Z

0 ϱ(1) ν (rθ)drσd(dθ)

0 1L(t)(rθ)drσd(dθ)dt

Sd 1 λ1(L(t, θ))σd(dθ)dt.

From this and the fact that ϱν is integrable by assumption, it is immediate that λ1(L(t, θ)) < for λ1 σd-almost all (t, θ) R+ Sd 1. By defining

F := {(t, θ) R+ Sd 1 | λ1(L(t, θ)) = },

we can alternatively express this result as (λ1 σd)(F) = 0. With another application of Proposition A.1, this yields

(λ1 λd)({(t, y) R+ Rd | λ1(L(t, y/ y )) = })

= (λ1 λd)({(t, y) R+ Rd | (t, y/ y ) F})

0 1F (t, y/ y )dtdy

0 1F (t, θ)dt σd(dθ)rd 1dr

0 (λ1 σd)(F) rd 1dr = 0,

which proves the lemma s claim.

Now we turn to the proof of Theorem 3.2.

Proof of Theorem 3.2. We use the same notation as in the proof of Theorem 3.1. Moreover, we know from the proof10

of Theorem 15 in (Habeck et al., 2023) that there exists a constant ε > 0 (independent of t, r) such that

S(r,t)(θ, D) ε σd(D L(t, r)),

for any θ L(t, r) and D B(Sd 1). By (11) this implies for x = rθ L(t) and B B(L(t)) that

U (t) D2(rθ, B) = S(r,t)(θ, B(r)) ε σd(B(r) L(t, r))

Sd 1 1B L(t)(rϑ)σd(dϑ)

Sd 1 1L(t)(rϑ)δrϑ(B)σd(dϑ),

where δz denotes the Dirac-measure at z Rd. Concisely written down, the former inequality yields

U (t) D2(rθ, dy) ε Z

Sd 1 1L(t)(rϑ)δrϑ(dy)σd(dϑ).

10The theorem applies in their notation with p(θ) = 1L(t,r)(θ), C = L(t, r) and β = 1, where also Remark 1 of (Habeck et al., 2023) should be taken into account.

Gibbsian Polar Slice Sampling

For ϑ Sd 1 we use L(t, ϑ) as defined in Lemma B.4 and note that the normalizing constant within U (t) R satisfies R 0 1L(t)(sϑ)ds = λ1(L(t, ϑ)). Taking the representation of the X-update of the 2nd variant of GPSS into account we obtain (using the same variables as earlier)

U (t) X (rθ, B) = U (t) D2U (t) R (rθ, B)

Rd U (t) R (y, B)U (t) D2(rθ, dy)

Rd U (t) R (y, B)1L(t)(rϑ)δrϑ(dy)σd(dϑ)

Sd 1 U (t) R (rϑ, B)1L(t)(rϑ)σd(dϑ)

1B( rϑ) λ1(L(t, ϑ))1L(t)(rϑ)d r σd(dϑ)

1L(t)(ry/ y )

y d 1λ1(L(t, y/ y )) dy,

where the last equality follows by Proposition A.1. Therefore the transition kernel Q corresponding to the 2nd variant of GPSS satisfies for any A B(Rd) and 0 = x Rd with x = rθ that

Q(x, A) = 1

Z ϱ(1) ν (x)

0 U (t) X (x, A L(t))dt

Z ϱ(1) ν (x)

1L(t)(ry/ y )1L(t)(y)

y d 1λ1(L(t, y/ y )) dt dy

Z min{ϱ(1) ν (x),ϱ(1) ν (y),ϱ(1) ν (ry/ y )}

λ1(L(t, y/ y ))dt dy.

By Lemma B.4 we have that the mapping

(t, y) 7 λ1(L(t, y/ y ))

is λ1 λd-almost surely finite, so that we obtain that the function

(t, y) 7 y 1 d

λ1(L(t, y/ y ))

is λ1 λd-almost surely strictly larger than zero. By the assumption and definition of ϱ(1) ν we have that ϱ(1) ν > 0 on Rd \ {0}, such that Q(x, A) > 0 whenever λd(A) > 0. We apply the former implication: Let A B(Rd) such that ν(A) > 0. Then, by the absolute continuity of ν w.r.t. λd we obtain that λd(A) > 0 and thus Q(x, A) > 0 for any 0 = x Rd. This yields that the transition kernel of the 2nd variant of GPSS is ν-irreducible and aperiodic, as defined in Section 3 in (Tierney, 1994).

Since by Theorem 3.1 we also know that ν is an invariant distribution of Q, applying Theorem 1 of (Tierney, 1994)

yields that ν is the unique invariant distribution. Moreover, the same theorem gives that the distribution of Xn converges ν-almost surely (regarding the initial state) to ν in the total variation distance.

C. Implementation Notes

The code we provide not only allows for the reproduction of our experimental results, but also contains an easily usable, general purpose implementation of GPSS in Python 3.10, based on numpy.

Those still seeking to implement GPSS themselves, perhaps in another programming language, can pretty much follow Algorithms 1, 2 and 3. Still we find it appropriate to give two pieces of advice. First, the target density ϱν, its transform ϱ(1) ν and the thresholds tn should all be moved into log-space in order to make the implementation numerically stable. Second, upon generating a direction proposal, i.e. immediately following the code-adaptation of line 6 of Algorithm 2, the proposal should be re-normalized.

Although in theory the proposal should already be normalized, in practice the operations involved in generating it always introduce small numerical errors. Individually, these errors are far too small to matter, but, due to the way in which each direction variable depends on that from the previous iteration, without re-normalization the errors accumulate and lead the direction variables to be more and more unnormalized. If the sampler goes through many thousands of iterations, these accumulating errors eventually introduce large amounts of bias into the sampling, because the initial proposal set in the direction update is no longer a great circle but rather an elongated ellipse. Of course the renormalization is not necessary if the direction variables are extracted from the full samples whenever required instead of being stored separately.

D. Additional Sampling Statistics

Table 1. Target density evaluations per iteration (TDE/I) and integrated autocorrelation times (IAT) for the experiments presented in Sections 6.1 and 6.2. The IATs were computed w.r.t. the sample log radii in the Cauchy experiment and w.r.t. the sample radii in the hyperplane experiment.

CAUCHY HYPERPLANE SAMPLER TDE/I IAT TDE/I IAT

GPSS 6.90 8.59 12.23 1.09 HRUSS 8.46 51346.93 7.78 684.57 NAIVE ESS 5.86 35543.94 7.91 199.17 TUNED ESS 6.51 188.13

Gibbsian Polar Slice Sampling

E. Bayesian Logistic Regression

In this section we examine how well GPSS performs in a more classical machine learning setting, namely Bayesian logistic regression with a mean-zero Gaussian prior. We begin by giving a brief overview of the problem. Bayesian logistic regression is a binary regression problem. Supposing the training data to be given as pairs (a(i), b(i)), i = 1, ..., m, where a(i) Rd and b(i) { 1, 1}, and choosing the prior distribution as Nd(0, 0.12Id), the posterior density of interest is given by

ϱπ(x) = Nd(x; 0, 0.12Id)

1 1 + exp b(i) a(i), x ,

where , denotes the standard inner product on Rd.

We consider the Cover Type data set (Blackard, 1998; Blackard et al.), which has cartographic variables as its features and forest cover types as its labels. The raw data set has m = 581,012 instances, 54 features and 7 different labels. As one of the labels occurs in about the same number of instances (283,301) as all the other labels combined (297,711), we make the regression problem binary by mapping this label to +1 and all other labels to 1. As the data set is quite large and we want to keep all of our experiments easily reproducible (in particular avoiding prohibitively large execution times), we use only 10% of it as training data and the remaining 90% as test data. Moreover, we normalized the data (i.e. each feature was affine-linearly transformed to have mean zero and standard deviation one) and added a constant feature (to enable a non-zero intercept of the regression line), rendering the problem d = 55 dimensional.

We then ran the usual three slice samplers and NUTS for N = 5000 iterations each, initializing them equally with a draw x0 from the prior distribution Nd(0, 0.12Id). Viewing the first Nb = 5000 iterations as burn-in, we computed sample means for each sampler based only on the latter Na = 5000 iterations and then used these sample means as predictors to compute training and testing accuracies. For reference, we also solved the problem11 with the default solver of the classical python machine learning package scikit-learn (typically abbreviated sklearn).

As can be seen in Table 2, all five methods solved the logistic regression problem equally well, achieving virtually identical accuracies on both training and test data. Nevertheless, some differences between the samplers behaviors become evident when considering the usual sampling metrics, see Table 3: On the one hand, GPSS used as many target den-

11Note that sklearn technically solved a slightly different problem, because it only performed maximum likelihood without considering the prior. However, this did not appear to have a significant effect in practice, see Table 2.

Table 2. Training and testing accuracies of the different solvers in the Bayesian logistic regression experiment.

SOLVER TRAIN ACC TEST ACC

GPSS 0.75520 0.75571 HRUSS 0.75472 0.75583 ESS 0.75506 0.75585 NUTS 0.75510 0.75573 SKLEARN 0.75562 0.75580

sity evaluations as HRUSS and ESS combined, on the other hand it also achieved more than twice the mean step size of either method (which should in principle help GPSS explore the target distribution s mode more thoroughly, though evidently this did not lead to it finding a better predictor than the other samplers, cf. Table 2). In terms of IAT, the three slice samplers perform relatively similarly12. Notably, NUTS, for which target density evaluations are neither available nor a sensible metric (due to the sampler s use of gradient information), vastly outperformed all three slice samplers in terms of both IAT and mean step size.

Table 3. Target density evaluations per iteration (TDE/I), integrated autocorrelation time (IAT) and mean step size (MSS) for the Bayesian logistic regression experiment. The IATs were computed w.r.t. the marginal samples consisting of the second entry of each sample. Whereas the TDE/I were computed based on the entire sampling runs, both IAT and MSS are only based on the samples generated after the burn-in period.

SAMPLER TDE/I IAT MSS

GPSS 23.55 311.37 0.0268 HRUSS 9.12 179.30 0.0125 ESS 10.81 116.17 0.0115 NUTS 1.18 0.2942

Aside from classification accuracies, another important aspect to consider when using samplers for logistic regression is how long they need to reach the target distribution s mode from wherever they are initialized. For this we refer to Figures 9 and 10, where it can be seen that, at least in this case, GPSS and HRUSS converged towards the mode about equally fast, with the convergence of ESS seemingly being slightly slower. NUTS again worked vastly better than all slice samplers, converging almost instantaneously. This may seem remarkable at first glance, but is to be expected when considering that NUTS unlike the slice samplers relies on gradient information, which is extremely valuable in solving logistic regression tasks.

Overall, the results do not suggest Bayesian logistic regres-

12Note that the IAT values varied significantly between different runs of this experiment.

Gibbsian Polar Slice Sampling

Figure 9. Log target density values for the Bayesian logistic regression experiment, over the course of each sampler s Na = 2500 iterations after the burn-in period.

Figure 10. Sample radii (Euclidean norms) for the Bayesian logistic regression experiment over the course of each sampler s N = 5000 iterations.

sion to be a particularly worthwhile application of GPSS, which is not surprising to us based on our observations about the method s apparent strengths and weaknesses (cf. Sections 6 and 7).

Gibbsian Polar Slice Sampling

F. Additional Results of Numerical Experiments

Figure 11. Sample runs for the multivariate standard Cauchy distribution, i.e. the density ϱν(x) = (1 + x 2) (d+1)/2, in dimension d = 100. The plots in the left column display the progression of the radii (Euclidean norms) of the individual samples over the course of N = 5 104 iterations. Their counterparts on the right show the logarithms of these values. Note that we include the log radii plots to provide more insight into the short-term behavior of exact sampling and the GPSS chain, which can not be ascertained from the radii plots due to the magnitude of their respective outliers.

Gibbsian Polar Slice Sampling

Figure 12. Sample runs and step size histograms for the hyperplane disk density (6) in dimension d = 200. The plots in the left column display the progression of the radii (Euclidean norms) of the individual samples over the course of N = 104 iterations. The plots in the right column show histograms of the Euclidean distances between each two consecutive samples. The descriptor tuned ESS refers to ESS where the artificial Gaussian prior is given the empirical covariance of the samples generated by GPSS as its covariance parameter Σ. Untuned ESS corresponds, as usual, to Σ := Id.

Gibbsian Polar Slice Sampling

Figure 13. Progression of currently visited mode pair over the last Nwindow = 2 104 iterations for the axial modes target density (7) in dimension d = 100. Here the covariance used by untuned ESS was Σ = Id and that used by tuned ESS Σ = (5 + d/10)2/d Id.

Gibbsian Polar Slice Sampling

Figure 14. Marginal traces, marginal histograms and sample radii for Neal s funnel (8) in dimension d = 10. The plots in the left column show the progression of the first coordinate component of each sample over the course of each sampler s final Nwindow = 104 iterations. The plots in the middle column display histograms of the first coordinate component of all samples each sampler generated within the awarded time budget, with the thick black line marking the target marginal distribution. The plots in the right column show the progression of sample radii (Euclidean norms) over the course of each sampler s final Nwindow = 104 iterations. In particular, the quantities displayed in the left and right column of each row are derived from the same Nwindow samples.