# the_boomerang_sampler__79a0fd6d.pdf

The Boomerang Sampler

Joris Bierkens 1 Sebastiano Grazzi 1 Kengo Kamatani 2 Gareth Roberts 3

This paper introduces the Boomerang Sampler as a novel class of continuous-time non-reversible Markov chain Monte Carlo algorithms. The methodology begins by representing the target density as a density, e U, with respect to a prescribed (usually) Gaussian measure and constructs a continuous trajectory consisting of a piecewise elliptical path. The method moves from one elliptical orbit to another according to a rate function which can be written in terms of U. We demonstrate that the method is easy to implement and demonstrate empirically that it can out-perform existing benchmark piecewise deterministic Markov processes such as the bouncy particle sampler and the Zig-Zag. In the Bayesian statistics context, these competitor algorithms are of substantial interest in the large data context due to the fact that they can adopt data subsampling techniques which are exact (ie induce no error in the stationary distribution). We demonstrate theoretically and empirically that we can also construct a control-variate subsampling boomerang sampler which is also exact, and which possesses remarkable scaling properties in the large data limit. We furthermore illustrate a factorised version on the simulation of diffusion bridges.

1. Introduction

Markov chain Monte Carlo remains the gold standard for asymptotically exact (ie bias-free) Bayesian inference for complex problems in Statistics and Machine Learning; see for example (Brooks et al., 2011). Yet a major impediment to its routine implementation for large data sets is the need to evaluate the target density (and possibly other related functionals) at each algorithm iteration.

1Technische Universiteit Delft, Netherlands 2Graduate School of Engineering Science, Osaka University, Japan 3Department of Statistics, University of Warwick, United Kingdom. Correspondence to: Joris Bierkens <joris.bierkens@tudelft.nl>.

Proceedings of the 37 th International Conference on Machine Learning, Online, PMLR 119, 2020. Copyright 2020 by the author(s).

Partly motivated by this, in recent years there has been a surge in the development of innovative piecewise deterministic Monte Carlo methods (PDMC, most notably the Bouncy Particle Sampler (BPS) (Bouchard-Cˆot e et al., 2017) and the Zig-Zag Sampler (ZZ) (Bierkens et al., 2019)), as a competitor for classical MCMC algorithms such as Metropolis Hastings and Gibbs sampling. We refer to (Fearnhead et al., 2018) for an accessible introduction to the PDMC setting. The primary beneﬁts of these methods are the possibility of exact subsampling and non-reversibility. Exact subsampling refers to the possibility of using only a subset of the full data set (or even just a single observation) at each iteration of the algorithm, without introducing bias in the output of the algorithm (Fearnhead et al., 2018). Non-reversibility is a property of MCMC algorithms related to a notion of direction of the algorithm, reducing the number of backtracking steps, thus reducing the diffusivity of the algorithm and reducing the asymptotic variance; as analyzed e.g. in (Diaconis et al., 2000; Andrieu & Livingstone, 2019).

The current key proponents BPS and ZZ of the PDMC paradigm share the following description of their dynamics. The process moves continuously in time according to a constant velocity over random time intervals, which are separated by switching events . These switching events occur at stochastic times at which the velocity, or a component of it, is either reﬂected, or randomly refreshed. The direction of a reﬂection, and the random time at which it occurs, is inﬂuenced by the target probability distribution.

In this paper we explore the effect of modifying the property of constant velocity. By doing so we introduce the Boomerang Sampler which has dynamics of the simple form dx

dt = x. Similar ideas were introduced in (Vanetti et al., 2017) and termed Hamiltonian-BPS, a method which can be seen as a special case of our approach. We generalise the Hamiltonian-BPS algorithm in three important ways.

1. We relax a condition which restricts the covariance function of the auxiliary velocity process to be isotropic. This generalisation is crucial to ensure good convergence properties of the algorithm.

2. Furthermore we extend the Boomerang Sampler to allow for exact subsampling (as introduced above),

Boomerang Sampler

thus permitting its application efﬁciently for large data sets.

3. We also introduce a factorised extension of the sampler which has important computational advantages in the common situation where the statistical model exhibits suitable conditional dependence structure.

Our method also has echoes of the elliptical slice sampler (Murray et al., 2010) which has been a successful discretetime MCMC method especially within machine learning applications. Both methods are strongly motivated by Hamiltonian dynamics although there are also major differences in the two approaches. Finally we mention some other PDMP methods with non-linear dynamics such as Randomized HMC (Bou-Rabee & Sanz-Serna, 2017; Deligiannidis et al., 2018), and others (Markovic & Sepehri, 2018; Terenin & Thorngren, 2018).

We shall study the Boomerang Sampler and two subsampling alternatives theoretically by analysing the interaction of Bayesian posterior contraction, data size (n) and subsampling schemes in the regular (smooth density) case. We shall show that no matter the rate of posterior contraction, a suitably constructed subsampled Boomerang sampler achieves an O(n) advantage over non-subsampled algorithms.

At the same time, we show that for the (non-subsampled) Boomerang Sampler, the number of switching events, and thus the computational cost, can be reduced by factor O(1/d) (where d is the number of dimensions) relative to other piecewise deterministic methods, thanks to the deterministic Hamiltonian dynamics of the Boomerang Sampler.

We illustrate these analyses with empirical investigations in which we compare the properties of Boomerang samplers against other PDMC benchmarks demonstrating the superiority of subsampled Boomerang for sufﬁciently large data size for any ﬁxed dimension in the setting of logistic regression. We shall also give an empirical study to compare the Boomerang Sampler with its competitors as dimension increases. Finally, as a potentially very useful application we describe the simulation of diffusion bridges using the Factorised Boomerang Sampler, demonstrating substantial computational advantages compared to its natural alternatives.

For a Rd and Σ a positive deﬁnite matrix in Rd d we write N(a, Σ) for the Gaussian distribution in Rd with mean a and covariance matrix Σ. Let , denote the Euclidean inner product in Rd. We write (a)+ := max(a, 0) for the positive part of a R, and we write , + := ( , )+ for the positive part of the inner product.

2. The Boomerang Sampler

The Boomerang Sampler is a continuous time, piecewise deterministic Markov process (PDMP) with state space S = Rd Rd. The two copies of Rd will be referred to as the position space and the velocity space, respectively. Our primary interest is in sampling the position coordinate, for which the auxiliary velocity coordinate is a useful tool for us.

Let µ0 denote a Gaussian measure on S speciﬁed by µ0 = N(x , Σ) N(0, Σ), where Σ is a positive deﬁnite matrix in Rd d. Often we take x = 0 to shorten expressions, which can be done without loss of generality by a shift in the position coordinate. The measure µ0 will be referred to as the reference measure. The Boomerang Sampler is designed in such a way that it has stationary probability distribution µ with density exp( U(x)) relative to µ0. Equivalently, it has density

2(x x ) Σ 1(x x ) 1

relative to the Lebesgue measure dx dv on Rd Rd. We assume that this density has a ﬁnite integral. The marginal distribution of µ with respect to x is denoted by Π.

The Boomerang process moves along deterministic trajectories (xt, vt) Rd Rd which change direction at random times. The deterministic trajectories satisfy the following simple ordinary differential equation:

dt = vt, dvt

dt = (xt x ), (1)

with explicit solution xt = x +(x0 x ) cos(t)+v0 sin(t), vt = (x0 x ) sin(t) + v0 cos(t). Note that (x, v) 7 x x , Q(x x ) + v, Qv is invariant with respect to the ﬂow of (1) for any symmetric matrix Q. In particular the ﬂow of (1) preserves the Gaussian measure µ0 on S.

Given an initial position (x0, v0) S, the process moves according to the motion speciﬁed by (1), resulting in a trajectory (xt, vt)t 0, until the ﬁrst event occurs. The distribution of the ﬁrst reﬂection event time T is speciﬁed by

P(T t) = exp Z t

0 λ(xs, vs) ds ,

where λ : S [0, ) is the event rate and is speciﬁed as

λ(x, v) = v, U(x) +. (2)

For x Rd we deﬁne the contour reﬂection R(x) to be the linear operator from Rd to Rd given, for (x, v) S, by

R(x)v = v 2 U(x), v

|Σ1/2 U(x)|2 Σ U(x). (3)

Importantly the reﬂection satisﬁes

R(x)v, U(x) = v, U(x) (4)

Boomerang Sampler

and |Σ 1/2R(x)v| = |Σ 1/2v|, (5)

which are key in establishing that the resulting Boomerang Sampler has the correct stationary distribution.

At the random time T at which a switch occurs, we put v T := R(x T )v T , where we use the notation yt := lims t ys. The process then starts afresh according to the dynamics (1) from the new position (x T , v T ). Additionally, at random times generated by a homogeneous Poisson process with rate λrefr > 0 the velocity is refreshed, i.e. at such a random time T we independently draw v T N(0, Σ). This additional input of randomness guarantees that the Boomerang Sampler can visit the full state space and is therefore ergodic, as is the case for e.g. BPS (Bouchard Cˆot e et al., 2017).

In Section 1 of the Supplement we deﬁne the generator of the Boomerang Sampler, which can in particular be used to prove that µ is a stationary distribution for the Boomerang process, and which can be used in subsequent studies to understand its probabilistic properties. Remark 2.1 (On the choice of the reference measure). In principle we can express any probability distribution Π(dx) exp( E(x)) dx as a density relative to µ0 by deﬁning

U(x) = E(x) 1

2(x x ) Σ 1(x x ). (6)

As mentioned before we can take µ0 to be identical to a Gaussian prior measure in the Bayesian setting. Alternatively, and this is an approach which we will adopt in this paper, we may choose µ0 to be a Gaussian approximation of the measure Π which may be obtained at relatively small computational cost in a preconditioning step.

2.1. Factorised Boomerang Sampler

As a variation to the Boomerang Sampler introduced above we introduce the Factorised Boomerang Sampler (FBS), which is designed to perform well in situations where the conditional dependencies in the target distribution are sparse. For simplicity we restrict to the case with a diagonal reference covariance Σ = diag(σ2 1, . . . , σ2 d).

The deterministic dynamics of the FBS are identical to those of the standard Boomerang Sampler, and given by (1). The difference is that every component of the velocity has its own switching intensity. This is fully analogous with the difference between BPS and ZZ, where the latter can be seen as a factorised Bouncy Particle Sampler. In the current case, this means that as switching intensity for the i-th component of the velocity we take

λi(x, v) = (vi i U(x))+,

and once an event occurs, the velocity changes according to the operator Fi(v) given by

Fi(v) = v1, . . . , vi 1, vi, vi+1, . . . , vd .

Also, the velocity of each component is refreshed according to vi N(0, σ2 i ) at rate λrefr,i > 0.

Note that the computation of the reﬂections has a computational cost of O(1), compared to the reﬂections in (3) being at least of O(d), depending upon the sparsity of Σ. The sparse conditional dependence structure implies that the individual switching intensities λi(x) are in fact functions of a subset of the components of x, contributing to a fast computation. This feature can be exploited by an efﬁcient local implementation of the FBS algorithm which reduces the number of Poisson times simulated by the algorithm (similar in spirit to the local Bouncy Particle Sampler (Bouchard-Cˆot e et al., 2017) and the local Zig-Zag sampler in (Bierkens et al., 2020)). In Section 3.2 we will brieﬂy comment on the dimensional scaling of FBS. As an illustration of a realistic use, FBS will be applied to the simulation of diffusion bridges in Section 4.2.

2.2. Subsampling with control variates

Let E(x) be the energy function, i.e., negative log density of Π with respect to the Lebesgue measure. Consider the setting where E(x) = 1

n Pn i=1 Ei(x), as is often the case in e.g. Bayesian statistics or computational physics. (Let us stress that n represents a quantity such as the number of interactions or the size of the data, and not the dimensionality of x, which is instead denoted by d.) Using this structure, we introduce a subsampling method using the Gaussian reference measure as a tool for the efﬁcient construction of the Monte Carlo method.

Relative to a Gaussian reference measure with covariance Σ centred at x , the negative log density is given by (6). Let us assume Σ = [ 2E(x )] 1 (7)

for a reference point x . In words, the curvature of the reference measure will agree around x with the curvature of the target distribution. We can think of x as the mean or mode of an appropriate Gaussian approximation used for the Boomerang Sampler. Note however that we shall not require that E(x ) = 0 for the sampler and its subsampling alternatives to work well, although some restrictions will be imposed in Section 3.1. In this setting it is possible to employ a subsampling method which is exact, in the sense that it targets the correct stationary distribution. This is an extension of methodology used for subsampling in other piecewise deterministic methods, see e.g. (Fearnhead et al., 2018) for an overview.

Assume for notational convenience that x = 0. As an

Boomerang Sampler

unbiased estimator for the log density gradient of U we could simply take

U(x) = EI(x) 2E(0)x, (8)

where I is a random variable with uniform distribution over {1, . . . , n}. We shall see in Proposition 3.1 that this estimator will lead to weights which increase with n and therefore we shall consider a control variate alternative.

Therefore also consider the control variate gradient estimator \ U(x) = GI(x), where, for i = 1, . . . , n,

Gi(x) = Ei(x) 2Ei(0)x Ei(0)+ E(0). (9)

Taking the expectation with respect to I,

Ei(x) 2Ei(0)x Ei(0) + E(0)

= E(x) 2E(0)x = U(x),

so that \ U(x) is indeed an unbiased estimator for U(x). In Section 3 we shall show that \ U(x) has signiﬁcantly

superior scaling properties for large n than U(x). Remark 2.2. In various situations we can ﬁnd a reference point x such that E(x ) = 0, in which case the ﬁnal term in (9) vanishes. We include the term here so that it can accommodate the general situation in which E(x ) = 0.

Upon reﬂection, conditional on the random draw I, we reﬂect according to

RI(x)v = v 2 GI(x), v

|Σ1/2GI(x)|2 ΣGI(x).

The Boomerang Sampler that switches at the random rate \ λ(x, v) = v, \ U(x) +, and reﬂects according to RI will preserve the desired target distribution in analogy to the argument found in (Bierkens et al., 2019).

2.3. Simulation

The implementation of the Boomerang Sampler depends crucially on the ability to simulate from a nonhomogeneous Poisson process with a prescribed rate. In this section we will make a few general comments on how to achieve these tasks for the Boomerang Sampler and for the Subsampled Boomerang Sampler.

Suppose we wish to generate the ﬁrst event according to a switching intensity λ(xt, vt) where (xt, vt) satisfy (1). This is challenging because it is non-trivial to generate points according to time inhomogeneous Poisson process,

but also the function λ(xt, vt) may be expensive to evaluate. It is customary in simulation of PDMPs to employ the technique of Poisson thinning to generate an event according to a deterministic rate function λ(t) 0, referred to as computational bound, such that λ(xt, vt) λ(t) for all t 0. The function λ(t) should be suitable in the sense that we can explicitly simulate T according to the law

P(T t) = exp Z t

After generating T from this distribution, we accept T as a true switching event with probability λ(x T , v T )/λ(T). As a consequence of this procedure, the ﬁrst time T that gets accepted in this way is a Poisson event with associated intensity λ(xt, vt).

In this paper we will only consider bounds λ(t) of the form λ(t; x0, v0) = a(x0, v0) + tb(x0, v0). We will call the bound constant if b(x, v) = 0 for all (x, v), and afﬁne otherwise. As a simple example, consider the situation in which | U(x)| m for all x. In this case we have

λ(x, v) = v, U(x) + m|v| m p

|x|2 + |v|2.

Since the ﬁnal expression is invariant under the dynamics (1), we ﬁnd that

λ(xt, vt) m p

|x0|2 + |v0|2, t 0,

which gives us a simple constant bound.

In the case of subsampling the switching intensity \ λ(x, v) is random. Still, the bound λ(t; x0, v0) needs to be an upper bound for all random realizations of \ λ(x, v). In the case we use the unbiased gradient estimator \ U(x) = GI of (9), we can bound e.g.

\ λ(x, v) sup i,x |Gi(x)||v| sup i,x |Gi(x)| p

|x|2 + |v|2,

assuming all gradient estimators Gi are globally bounded. We will introduce different bounds in detail in Section 2 of the Supplement.

3. Scaling for large data sets and large dimension

3.1. Robustness to large n

In this section, we shall investigate the variability of the rates induced by the Boomerang Sampler and its subsampling options. The size of these rates is related to the size of the upper bounding rate Poisson process used to simulate them. Moreover, the rate of the upper bounding Poisson rate is proportional to the number of density evaluations, which

Boomerang Sampler

in turn is a sensible surrogate for the computing cost of running the algorithm.

As in Section 2.2, we describe E as a sum of n constituent negative log-likelihood terms: E(x) = Pn i=1 ℓi(x). (In the notation above we are just setting ℓi(x) = n Ei(x).) Under suitable regularity conditions, the target probability measure Π satisﬁes posterior contraction around x = 0 at the rate η, that is for all ϵ there exists δ > 0 such that Π(Bn ηδ(0)) > 1 ϵ where Br(0) denotes the ball of radius r centred at 0. As a result of this, we typically have velocities of order n η ensuring that the dynamics in (1) circles the state space in O(1) time.

The various algorithms will have computational times roughly proportional to the number of likelihood evaluations, which in turn depends on the event rate (and its upper bound). Therefore we shall introduce explicitly the subsampling bounce rates corresponding to the use of the unbiased estimators in (8) and (9).

λ(x, v) = v, U(x) + ; \ λ(x, v) = v, \ U(x) + .

To simplify the arguments below, we also assume that ℓi has all its third derivatives uniformly bounded, implying that all third derivative terms of E are bounded by a constant multiple of n. This allows us to write down the expansion

U(x) = E(0) + 2E(0)x Σ 1x + O(n|x|2)

= E(0) + O(n|x|2) . (10)

Similarly we can write

\ U(x) = n ℓI(x) n 2ℓI(0)x nℓI(0)

+ E(0) Σ 1x

= E(0) + O(n|x|2) . (11)

using the same Taylor series expansion.

We can now use this estimate directly to obtain bounds on the event rates. We summarise this discussion in the following result.

Proposition 3.1. Suppose that x, v Bn ηδ(0) for some δ, and under the assumptions described above, we have that

λ(x, v) O n η(| E(0)| + n1 2η) (12)

λ(x, v) O (| E(0)|) + O(n)) (13)

\ λ(x, v) O n η(| E(0)| + n1 2η) (14)

Thus the use of \ U(x) does not result in an increased event rate (in order of magnitude). There is therefore an O(n) computational advantage obtained from using subsampling due to each target density valuation being O(n) quicker.

Proposition 3.1 shows that as long as the reference point x (chosen to be 0 here for convenience) is chosen to be sufﬁciently close to the mode so that | E(0)| is at most O(n1 2η), then we have that

λ(x, v) = \ λ(x, v) = O n1 3η .

Note that this rate can go to 0 when η > 1/3. In particular in the regular case where Bernstein von-Mises theorem holds, we have η = 1/2. In this case the rate of jumps for the Boomerang can recede to 0 at rate n 1/2 so long as | E(0)| is at most O(1)).

3.2. Scaling with dimension

In this section, we will discuss how the Boomerang Sampler has an attractive scaling property for high dimension. This property is qualitatively similar to the preconditioned Crank Nicolson algorithm (Neal, 1999; Beskos et al., 2008) and the elliptical slice sampler (Murray et al., 2010) which take advantage of the reference Gaussian distribution.

The dimensional complexity of BPS and ZZ was studied in (Bierkens et al., 2018; Deligiannidis et al., 2018; Andrieu et al., 2018). For the case of an isotropic target distribution, the rate of reﬂections per unit of time is constant for BPS and proportional to d for ZZ with unit speeds in all directions. On the other hand, the time until convergence is of order d for the BPS and 1 for ZZ. Therefore, the total number of reﬂections required for convergence of these two algorithms is of the same order which grows linearly with dimension.

For the Boomerang Sampler we consider the following setting. Consider reference measures µ0,d(dx, dv) = N(0, Σd) N(0, Σd) for increasing dimension d, where for every d = 1, 2, . . . , Σd is a d-dimensional positive deﬁnite matrix. Relative to these reference measures we consider a sequence of potential functions Ud(x). Thus relative to Lebesgue measure our target distribution Πd(dx) has density exp( Ed(x)), where Ed(x) = Ud(x) + 1 2 x, Σ 1 d x . Let Ed denote expectation with respect to Πd(dx) N(0, Σd)(dv). We assume that the sequence (Ud) satisﬁes

sup d=1,2,... Ed[|Σ1/2 d Ud(x)|2] < , (15)

The condition (15) arises naturally for instance in the context of Gaussian regression, spatial statistics, Bayesian inverse problems as well as the setting of the diffusion bridge simulation example described in detail in Section 4.2.

Furthermore we assume that the following form of the Poincar e inequality holds,

Ed[fd(x)2] 1

C2 Ed[|Σ1/2 d fd(x)|2] (16)

Boomerang Sampler

with constant C > 0 independent of dimension, and where fd : Rd R is any mean zero differentiable function. A sufﬁcient condition for (16) to hold is

C2I Σ1/2 d 2Ed(x)Σ1/2 d = Σ1/2 d 2Ud(x)Σ1/2 d + I

by the classical Brascamp-Lieb inequality (Brascamp & Lieb, 1976; Bakry et al., 2014); note that it may also hold in the non-convex case, see e.g. (Lorenzi & Bertoldi, 2007), Section 8.6.

Under the stated assumptions we argue that (i) the expected number of reﬂections per unit time scales as O(1) with respect to dimension, and (ii) within a continuous time interval that scales as O(1), the Boomerang Sampler mixes well. Claims (i) and (ii) are provided with a heuristic motivation in the Supplement. A rigorous proof for this claim remains part of our future work.

In the ideal but non-sparse scenario, the computational cost of the event time calculation for the Boomerang Sampler is thus expected to be a factor d smaller compared to BPS and ZZ assuming that the cost per event is the same for these algorithms. However, this comparison is unrealistic since in general we can not simulate reﬂections directly. In practice, we need to use the thinning method as discussed in Section 2.3. The thinning method introduces a signiﬁcant amount of shadow events (which are rejected after inspection), and the true events usually represent a small portion relative to the number of shadow events. As a result there can be a high cost for calculating shadow events even when the number of true events is small.

For the FBS, the expected number of events per unit of time is Pd i=1 Ed[(vi i U(x))+]. Under the hypothesis above, this is of O(d1/2). Thus, the number of events is much bigger than that of the Boomerang. However, as in the case of ZZ, under a sparse model assumption, the cost of calculation per jump is of constant order whereas it is of the order of d for the Boomerang Sampler. Therefore, the Factorised Boomerang Sampler should outperform the Boomerang Sampler for this sparse setup.

4. Applications and experiments

4.1. Logistic regression

As a suitable test bed we consider the logistic regression inference problem. Given predictors y(1), . . . , y(n) in Rd, and outcomes z(1), . . . , z(n) in {0, 1}, we deﬁne the log likelihood function as

n log(1 + ex y(i)) z(i)x y(i)o .

Furthermore we impose a Gaussian prior distribution over x which for simplicity we keep ﬁxed to be a standard normal

distribution throughout these experiments. As a result we arrive at the negative log target density

n log(1 + ex y(i)) z(i)x y(i)o + 1

As a preprocessing step when applying the Boomerang Sampler, and all subsampled methods, we ﬁnd the mode x of the posterior distribution and deﬁne Σ by (7). We apply the Boomerang Sampler, with and without subsampling. These samplers are equipped with an afﬁne computational bound and a constant computational bound respectively, both discussed in Section 2 of the Supplement (the afﬁne bound is usually preferred over a constant bound, but a useful afﬁne bound is not available in the subsampling case).

We compare the Boomerang to both BPS and ZZ with and without subsampling. In all subsampling applications we employ appropriate control variance techniques to reduce the variability of the random switching intensities, as discussed in Section 2.2. Furthermore in the dimension dependent study we include the Metropolis adjusted Langevin algorithm (MALA) for comparison. Throughout these experiments we use Effective Sample Size (ESS) per second of CPU time as measure of the efﬁciency of the methods used. ESS is estimated using the Batch Means method, where we take a ﬁxed number of 50 batches for all our estimates. ESS is averaged over the dimensions of the simulation and then divided by the runtime of the algorithm to obtain average ESS per second (other ESS summaries could also have been used). The time horizon is throughout ﬁxed at 10, 000 (with 10,000 iterations for MALA). For ZZ and BPS the magnitude of the velocities is rescaled to be comparable on average with Boomerang, to avoid unbalanced runtimes of the different algorithms. In Figures 1 and 2 the boxplots are taken over 20 randomly generated experiments, where each experiment corresponds to a logistic regression problem with a random (standard normal) parameter, based on randomly generated data from the model.1 The refreshment rates for BPS and the Boomerang Samplers are taken to be 0.1.

The Boomerang Sampler is seen to outperform the other algorithms, both in terms of scaling with dimension as with respect to an increase in the number of observations. For a ﬁxed dimension, the subsampling algorithms will clearly outperform the non-subsampling algorithms as number of observations n grows. In particular, the ESS/sec stays ﬁxed for the subsampled algorithms, and decreases as O(n) for the non-subsampled versions. In this case, we did not include the MALA algorithm since we observed its complexity strongly deteriorating as the number of observations

1The code used to carry out all of the experiments of this paper may be found online at https://github.com/ jbierkens/ICML-boomerang.

Boomerang Sampler

100 1000 10000 100000 number of observations

average ESS per second

Boomerang BPS Zig Zag

Boomerang w/subsampling BPS w/subsampling Zig Zag w/subsampling

Figure 1. Scaling of Boomerang Sampler compared to other PDMC methods for the logistic regression problem of Section 4.1 as a function of the number of observations. Here d = 2.

increases. For a large number of observations (n 10, 000, d = 2) we see that the Boomerang Sampler (with and without subsampling) accepts almost none of the proposed switches. This means that effectively we are sampling from the Gaussian reference measure. This observed behaviour is in line with the scaling analysis in Section 3.1.

In the second experiment we let the dimension d grow for a ﬁxed number of observations. The subsampling algorithms currently do not scale as well as the non-subsampled versions. For practical purposes we therefore only consider non-subsampled algorithms for the comparison with respect to dimensional dependence. For the dimensions d 32 we tested the Boomerang outperforms MALA, but it seems empirically that MALA has a better scaling behaviour with dimension. Note that MALA needs careful tuning to exhibit this good scaling. We remark that the beneﬁcial scaling properties of the underlying Boomerang Process as discussed in Section 3.2 may be adversely affected by suboptimal computational bounds. We are optimistic that the dimensional scaling of subsampled algorithms can be further improved by designing better computational bounds.

In all cases the necessary preprocessing steps can be done very quickly. In particular the plots are not affected by including (or excluding) the preprocessing time in the computation of ESS/sec.

2 4 8 16 32 number of dimensions

average ESS per second

Boomerang BPS MALA Zig Zag

Figure 2. Scaling of Boomerang Sampler compared to other PDMC methods and MALA for the logistic regression problem of Section 4.1 as a function of the number of dimensions. Here the number of observations is n = 1, 000.

4.2. Diffusion bridges

In (Bierkens et al., 2020) the authors introduce a framework for the simulation of diffusion bridges (diffusion processes conditioned to hit a prescribed endpoint) taking strong advantage of the use of factorised piecewise deterministic samplers. This invites the use of the Factorised Boomerang Sampler (FBS). We consider timehomogeneous one-dimensional conditional diffusion processes (diffusion bridges) of the form

d Xt = b(Xt)dt + d Wt, X0 = u, XT = v

where W is a scalar Brownian motion and b satisﬁes some mild regularity conditions (see (Bierkens et al., 2020) for details). This simulation problem is an essential building block in Bayesian analysis of non-linear diffusion models with low frequency observations (Roberts & Stramer, 2001).

We consider the approach of (Bierkens et al., 2020) where the diffusion path on [0, T] is expanded with a truncated Faber Schauder basis as

XN t = φ(t)u + φ(t)v +

j=0 φi,j(t)xi,j.

φ(t) = t/T, φ(t) = 1 t/T,

T (t/T)1[0,T/2](t) + (1 t/T)1(T/2,T ](t) ,

φi,j(t) = 2 i/2φ0,0(2it j T) i 0, 0 j 2i 1,

are the Faber-Schauder functions and N is the truncation of the expansion. In (Bierkens et al., 2020), ZZ is used to sam-

Boomerang Sampler

ple the corresponding coefﬁcients x := (x0,0, ..., x N,2N 1) which have a density measure written with respect to a standard Gaussian reference measure (corresponding to a Brownian bridge measure in the path space, see (Bierkens et al., 2020) for details). In particular we have that

dµ dµ0 (x, v) exp

b2(XN s ) + b (XN s ) ds

(17) where b is the derivative of b and µ0 = N(0, I) N(0, I) with I the 2N+1 1 dimensional identity matrix. The measure given by (17) has a remarkable conditional independence property (Proposition 2, (Bierkens et al., 2020)) and the coefﬁcients xi,j, for i large, responsible for the local behaviour of the process, are approximately independent standard Gaussian, reﬂecting the fact that, locally, the process behaves as a Brownian motion.

In (Bierkens et al., 2020) the authors device a local implementation of ZZ which optimally exploits the sparse conditional independence structure of the target distribution, alleviating the computational costs in high dimensional setting (e.g. of a high truncation level N). Since the Girsanov density (17) is expressed relative to a standard normal distribution on the coefﬁcients x, the Factorised Boomerang Sampler is a natural candidate for a further reduction in computational cost, by reducing the required number of simulated events, in particular at the higher levels where the coefﬁcients have approximately a Gaussian distribution. This will allow for a further increase of the truncation level N and/or faster computations at a ﬁxed truncation levels.

We consider the the class of diffusion bridges with drift equal to b(x) = α sin(x), α 0. (18)

The higher α, the stronger is the attraction of the diffusion paths to the stable points (2k 1)π, k N while for α = 0 the process reduces to a Brownian bridge with µ = µ0. Equivalently to (Bierkens et al., 2020), we use subsampling as the gradient of the log density in (17) involves a time integral that cannot be solved analytically in most of the cases. The unbiased estimator for xi,j U(x) is the integrand evaluated at a uniform random point multiplied by the range of the integral. The Poisson bounding rates used for the subsampling can be found in the Supplement, Section 5.

Figure 3 shows the resulting bridges for α = 1, starting at u = π and hitting v = 3π at ﬁnal time T = 50 after running the FBS with clock T = 20000, as simulated on a standard desktop computer. The refreshment rate relative to each coefﬁcient xi,j is ﬁxed to λrefr,i,j = 0.01 and the truncation of the expansion is N = 6.

In Figure 4, we compare the performances of the Boomerang Sampler and ZZ by computing the average number of reﬂections (y-axis on a log-scale) for the coefﬁcients xi,j at

Figure 3. 1000 diffusion bridges with drift equal to (18) with α = 1, u = π, v = 3π, T = 50, L = 6 sampled with the FBS with time horizon T = 20, 000 and refreshment rates λrefr,i = 0.01 for all i. The straight horizontal lines are the attraction points.

each level (x-axis). The number of reﬂections is understood as a measure of complexity of the algorithm. We repeat the experiment for α = 0.5 and α = 0 (where µ = µ0) and ﬁx the truncation level to be N = 10 which corresponds to a 2047 + 2047 dimensional space for (x, v). For a fair comparison we set the expected ℓ1 norm of the velocities and the time horizon of the two samplers to be the same. In both cases, the average number of reﬂections converges to the average number of reﬂections under µ0 (dashed lines) indicating that the coefﬁcients at high levels are approximately standard normally distributed but while ZZ requires a ﬁxed number of reﬂections for sampling from µ = µ0, the Boomerang does not, allowing to high resolutions of the diffusion bridges at lower computational cost.

0 1 2 3 4 5 6 7 8 9 10

ZZ, alpha = 0.7 Boom, alpha = 0.7 ZZ, alpha = 0.0 Boom, alpha = 0.0

Figure 4. Average number of reﬂections (on a log-scale) for the coefﬁcients xi,j at the level i = 0, 1, .., 10 for the diffusion bridge given by (18) with α = 0.5 (solid lines) and α = 0.0 (dashed lines) for the Zig-Zag Sampler (blue lines) and the Factorised Boomerang Sampler (red lines) with T = 2, 000 and Boomerang refreshment rates λrefr,i = 0.01 for all i.

Boomerang Sampler

4.3. Dependence upon reference measure

In a ﬁnal experiment we investigate the dependence of the performance of the Boomerang Sampler upon the choice of reference measure. For this we consider a simple setting in which the target distribution is a standard normal distribution in d dimensions. However, instead of using the standard normal distribution as reference measure, we perturb it in two ways: (i) we vary the component-wise variance σ2 of the reference measure, and (ii) we vary the mean x of the reference measure. Speciﬁcally, we choose a reference measure N(x , Σ) N(0, Σ), which we choose in case (i) to be x = 0, Σ = σ2I, and in case (ii), x = α(1, . . . , 1) , Σ = I. As performance measure we use the ESS per second for the quantity |x|2. We use refreshment rate 0.1 for Boomerang, and we compare to the Bouncy Particle Sampler, with refreshment 1.0, with both samplers run over a time horizon of 10,000. In Figure 5 the results of this experiment are displayed for varying σ2, and in Figure 6 the results are displayed for varying x . The box plots are taken over 20 experiments of the Boomerang Sampler, which are compared to a single run of the Bouncy Particle Sampler (dashed line).

0.5 0.6 0.7 0.8 0.9 1 1.1 1.3 1.5 1.7 2 sigma 2

|x| 2 ESS per second

dimension 1 10 100

Figure 5. Effect of perturbing the variance of the reference measure. As reference measure we choose N(0, σ2I) N(0, σ2I), where σ2 is varied from 0.5 to 2.0.

In this setting, the Boomerang Sampler signiﬁcantly outperforms the BPS, although the performance is seen to depend upon the choice of reference measure. Note however that the dependencies of Σ on σ2 and of x upon α scale as trace Σ = σ2d and x = αd1/2 respectively, so that in high dimensional cases the sensitivity on x and Σ may be more moderate than might appear from Figures 5 and 6.

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 alpha

|x| 2 ESS per second

dimension 1 10 100

Figure 6. Effect of perturbing the mean of the reference measure. As reference measure we choose N(α1, I) N(0, I), where α is varied from 0.0 to 2.0.

5. Conclusion

We presented the Boomerang Sampler as a new and promising methodology, outperforming other piecewise deterministic methods in the large n, moderate d setting, as explained theoretically and by performing a suitable benchmark test. The theoretical properties of the underlying Boomerang Sampler in high dimension are very good. However currently a large computational bound and therefore a large number of rejected switches are hampering the efﬁciency. We gave a numerical comparison which demonstrates that Boomerang performs well against its natural competitors; however one should be cautious about drawing too many conclusions about the performance of the Boomerang without a more comprehensive simulation study. Further research is required to understand in more detail the dependence upon e.g. reference covariance Σ, centering position x , refreshment rate, computational bounds and the choice of efﬁciency measure.

We furthermore introduced the Factorised Boomerang Sampler and illustrated its ability to tackle a challenging simulation problem using an underlying sparse structure.

An important direction for further research is the improvement of the computational bounds, in particular with the aim of having good scaling with dimension of subsampled algorithms. Related to this it is important to gain a better understanding of the relative optimality of subsampled versus non-subsampled algorithms in the large n, large d case.

Boomerang Sampler

Acknowledgements

We thank the anonymous reviewers and metareviewer for their helpful comments, which have helped to improve the paper. JB and SG acknowledge funding under the research programme Zig-zagging through computational barriers with project number 016.Vidi.189.043, which is ﬁnanced by the Netherlands Organisation for Scientiﬁc Research (NWO). KK is supported by JSPS KAKENHI Grant Number 20H04149 and JST CREST Grant Number JPMJCR14D7. GR is supported by EPSRC grants EP/R034710/1 (Co SIn ES) and EP/R018561/1 (Bayes for Health), and by Wolfson merit award WM140096.

Andrieu, C. and Livingstone, S. Peskun-Tierney ordering for Markov chain and process Monte Carlo: beyond the reversible scenario. preprint ar Xiv:1906.06197, 2019. URL http://arxiv.org/abs/1906.06197.

Andrieu, C., Durmus, A., N usken, N., and Roussel, J. Hypocoercivity of Piecewise Deterministic Markov Process-Monte Carlo. preprint ar Xiv:1808.08592, aug 2018. URL http://arxiv.org/abs/1808. 08592.

Bakry, D., Gentil, I., and Ledoux, M. Analysis and geometry of markov diffusion operators. Grundlehren der mathematischen Wissenschaften, 2014. ISSN 2196-9701. doi: 10.1007/978-3-319-00227-9. URL http://dx. doi.org/10.1007/978-3-319-00227-9.

Beskos, A., Roberts, G. O., Stuart, A., and Voss, J. MCMC methods for diffusion bridges. Stoch. Dyn., 8(3):319 350, 2008. URL https://doi.org/10.1142/ S0219493708002378.

Bierkens, J., Kamatani, K., and Roberts, G. O. Highdimensional scaling limits of piecewise deterministic sampling algorithms. preprint ar Xiv:1807.11358, 2018. URL http://arxiv.org/abs/1807.11358.

Bierkens, J., Fearnhead, P., and Roberts, G. O. The Zig-Zag process and super-efﬁcient sampling for Bayesian analysis of big data. Ann. Statist., 47(3):1288 1320, 2019. doi: 10.1214/18-AOS1715. URL https://doi.org/10. 1214/18-AOS1715.

Bierkens, J., Grazzi, S., van der Meulen, F., and Schauer, M. A piecewise deterministic Monte Carlo method for diffusion bridges. preprint ar Xiv:2001.05889, 2020. URL http://arxiv.org/abs/2001.05889.

Bou-Rabee, N. and Sanz-Serna, J. M. Randomized Hamiltonian Monte Carlo. Ann. Appl. Probab., 27(4):2159 2194, 2017. doi: 10.1214/16-AAP1255. URL https: //doi.org/10.1214/16-AAP1255.

Bouchard-Cˆot e, A., Vollmer, S. J., and Doucet, A. The Bouncy Particle Sampler: A Non Reversible Rejection-Free Markov Chain Monte Carlo Method. Journal of the American Statistical Association, 2017. doi: 10.1080/01621459.2017. 1294075. URL http://www.tandfonline.com/ doi/abs/10.1080/01621459.2017.1294075.

Brascamp, H. J. and Lieb, E. H. On extensions of the Brunn-Minkowski and Pr ekopa-Leindler theorems, including inequalities for log concave functions, and with an application to the diffusion equation. J. Functional Analysis, 22(4):366 389, 1976. doi: 10.1016/ 0022-1236(76)90004-5. URL https://doi.org/ 10.1016/0022-1236(76)90004-5.

Brooks, S., Gelman, A., Jones, G. L., and Meng, X.-L. (eds.). Handbook of Markov chain Monte Carlo. Chapman & Hall/CRC Handbooks of Modern Statistical Methods. CRC Press, Boca Raton, FL, 2011. ISBN 9781-4200-7941-8. doi: 10.1201/b10905. URL https: //doi.org/10.1201/b10905.

Deligiannidis, G., Paulin, D., and Doucet, A. Randomized Hamiltonian Monte Carlo as Scaling Limit of the Bouncy Particle Sampler and Dimension-Free Convergence Rates. preprint ar Xiv:1808.04299, aug 2018. URL http:// arxiv.org/abs/1808.04299.

Diaconis, P., Holmes, S., and Neal, R. M. Analysis of a nonreversible Markov chain sampler. Ann. Appl. Probab., 10 (3):726 752, 2000. ISSN 1050-5164. doi: 10.1214/aoap/ 1019487508. URL https://doi.org/10.1214/ aoap/1019487508.

Fearnhead, P., Bierkens, J., Pollock, M., and Roberts, G. O. Piecewise Deterministic Markov Processes for Continuous-Time Monte Carlo. Statist. Sci., 33(3):386 412, 2018. doi: 10.1214/18-STS648. URL https: //doi.org/10.1214/18-STS648.

Lorenzi, L. and Bertoldi, M. Analytical methods for Markov semigroups, volume 283 of Pure and Applied Mathematics (Boca Raton). Chapman & Hall/CRC, Boca Raton, FL, 2007. ISBN 978-1-58488-659-4; 1-58488-659-5.

Markovic, J. and Sepehri, A. Bouncy hybrid sampler as a unifying device. preprint ar Xiv:1802.04366, 2018. URL http://arxiv.org/abs/1802.04366.

Murray, I., Adams, R., and Mac Kay, D. Elliptical slice sampling. In Teh, Y. W. and Titterington, M. (eds.), Proceedings of the Thirteenth International Conference on Artiﬁcial Intelligence and Statistics, volume 9 of Proceedings of Machine Learning Research, pp. 541 548, Chia Laguna Resort, Sardinia, Italy, 13 15 May 2010. PMLR. URL http://proceedings.mlr. press/v9/murray10a.html.

Boomerang Sampler

Neal, R. M. Regression and classiﬁcation using Gaussian process priors. In Bayesian statistics, 6 (Alcoceber, 1998), pp. 475 501. Oxford Univ. Press, New York, 1999.

Roberts, G. O. and Stramer, O. On inference for partially observed nonlinear diffusion models using the metropolishastings algorithm. Biometrika, 88(3):603 621, 2001. ISSN 00063444. URL http://www.jstor.org/ stable/2673434.

Terenin, A. and Thorngren, D. A Piecewise Deterministic Markov Process via (r, θ) swaps in hyperspherical coordinates. preprint ar Xiv:1807.00420, 2018. URL http://arxiv.org/abs/1807.00420.

Vanetti, P., Bouchard-Cˆot e, A., Deligiannidis, G., and Doucet, A. Piecewise-Deterministic Markov Chain Monte Carlo. preprint ar Xiv:1707.05296, 2017. URL http://arxiv.org/abs/1707.05296.