# greedy_poisson_rejection_sampling__e4829486.pdf

Greedy Poisson Rejection Sampling

Gergely Flamich Department of Engineering University of Cambridge gf332@cam.ac.uk

One-shot channel simulation is a fundamental data compression problem concerned with encoding a single sample from a target distribution Q using a coding distribution P using as few bits as possible on average. Algorithms that solve this problem ﬁnd applications in neural data compression and differential privacy and can serve as a more efﬁcient alternative to quantization-based methods. Sadly, existing solutions are too slow or have limited applicability, preventing widespread adoption. In this paper, we conclusively solve one-shot channel simulation for one-dimensional problems where the target-proposal density ratio is unimodal by describing an algorithm with optimal runtime. We achieve this by constructing a rejection sampling procedure equivalent to greedily searching over the points of a Poisson process. Hence, we call our algorithm greedy Poisson rejection sampling (GPRS) and analyze the correctness and time complexity of several of its variants. Finally, we empirically verify our theorems, demonstrating that GPRS signiﬁcantly outperforms the current state-of-the-art method, A* coding. Our code is available at https:// github.com/gergely-flamich/greedy-poisson-rejection-sampling.

1 Introduction

It is a common misconception that quantization is essential to lossy data compression; in fact, it is merely a way to discard information deterministically. In this paper, we consider the alternative, that is, to discard information stochastically using one-shot channel simulation. To illustrate the main idea, take lossy image compression as an example. Assume we have a generative model given by a joint distribution Px,y over images y and latent variables x, e.g. we might have trained a variational autoencoder (VAE; Kingma & Welling, 2014) on a dataset of images. To compress a new image y, we encode a single sample from its posterior x Px|y as its stochastic lossy representation. The decoder can obtain a lossy reconstruction of y by decoding x and drawing a sample ˆy Py|x (though in practice, for a VAE we normally just take the mean predicted by the generative network).

Abstracting away from our example, in this paper we will be entirely focused on channel simulation for a pair of correlated random variables x, y Px,y: given a source symbol y Py we wish to encode a single sample x Px|y. A simple way to achieve this is to encode x with entropy coding using the marginal Px, whose average coding cost is approximately the entropy H[x]. Surprisingly, however, we can do much better by using a channel simulation protocol, whose average coding cost is approximately the mutual information I[x; y] (Li & El Gamal, 2018). This is remarkable, since not only I[x; y] H[x], but in many cases I[x; y] might be ﬁnite even though H[x] is inﬁnite, such as when x is continuous. Sadly, most existing protocols place heavy restrictions on Px,y or their runtime scales much worse than O(I[x; y]), limiting their practical applicability (Agustsson & Theis, 2020).

In this paper, we propose a family of channel simulation protocols based on a new rejection sampling algorithm, which we can apply to simulate samples from a target distribution Q using a proposal distribution P over an arbitrary probability space. The inspiration for our construction comes from an exciting recent line of work which recasts random variate simulation as a search problem over a set

37th Conference on Neural Information Processing Systems (Neur IPS 2023).

Figure 1: Illustration of three GPRS procedures for a Gaussian target Q = N(1, 0.252) and Gaussian proposal distribution P = N(0, 1), with the time axis truncated to the ﬁrst 17 units. All three variants ﬁnd the ﬁrst arrival of the same (1, P)-Poisson process Π under the graph of ϕ = σ r indicated by the thick dashed black line in each plot. Here, r = d Q/d P is the target-proposal density ratio, and σ is given by Equation (5). Left: Algorithm 3 sequentially searching through the points of Π. The green circle ( ) shows the ﬁrst point of Π that falls under ϕ, and is accepted. All other points are rejected, as indicated by red crosses ( ). In practice, Algorithm 3 does not simulate points of Π that arrive after the accepted arrival. Middle: Parallelized GPRS (Algorithm 4) searching through two independent (1/2, P)-Poisson processes Π1 and Π2 in parallel. Blue points are arrivals in Π1 and orange points are arrivals in Π2. Crosses ( ) indicate rejected, and circles ( ) indicate accepted points by each thread. In the end, the algorithm accepts the earliest arrival across all processes, which in this case is marked by the blue circle ( ). Right: Branch-and-bound GPRS (Algorithm 5), when ϕ is unimodal. The shaded red areas are never searched or simulated by the algorithm since, given the ﬁrst two rejections, we know points in those regions cannot fall under ϕ.

of randomly placed points, speciﬁcally a Poisson process (Maddison, 2016). The most well-known examples of sampling-as-search are the Gumbel-max trick and A* sampling (Maddison et al., 2014). Our algorithm, which we call greedy Poisson rejection sampling (GPRS), differs signiﬁcantly from all previous approaches in terms of what it is searching for, which we can succinctly summarise as: GPRS searches for the ﬁrst arrival of a Poisson process Π under the graph of an appropriately deﬁned function ϕ . The ﬁrst and simplest variant of GPRS is equivalent to an exhaustive search over all points of Π in time order. Next, we show that the exhaustive search is embarrassingly parallelizable, leading to a parallelized variant of GPRS. Finally, when the underlying probability space has more structure, we develop branch-and-bound variants of GPRS that perform a binary search over the points of Π. See Figure 1 for an illustration of these three variants.

While GPRS is an interesting sampling algorithm on its own, we also show that each of its variants induces a new one-shot channel simulation protocol. That is, after we receive y Py, we can set Q Px|y and P Px and use GPRS to encode a sample x Px|y at an average bitrate of a little more than the mutual information I[x; y]. In particular, on one-dimensional problems where the density ratio d Px|y/d Px is unimodal for all y (which is often the case in practice), branch-andbound GPRS leads to a protocol with an average runtime of O(I[x; y]), which is optimal. This is a considerable improvement over A* coding (Flamich et al., 2022), the current state-of-the-art method.

In summary, our contributions are as follows:

We construct a new rejection sampling algorithm called greedy Poisson rejection sampling, which we can construe as a greedy search over the points of a Poisson process (Algorithm 3). We propose a parallelized (Algorithm 4) and a branch-and-bound variant (Algorithms 5 and 6) of GPRS. We analyze the correctness and runtime of these algorithms. We show that each variant of GPRS induces a one-shot channel simulation protocol for correlated random variables x, y Px,y, achieving the optimal average codelength of I[x; y] + log2(I[x; y] + 1) + O(1) bits. We prove that when x is a R-valued random variable and the density ratio d Px|y/d Px is always unimodal, the channel simulation protocol based on the binary search variant of GPRS achieves O(I[x; y]) runtime, which is optimal. We conduct toy experiments on one-dimensional problems and show that GPRS compares favourably against A* coding, the current state-of-the-art channel simulation protocol.

2 Background

The sampling algorithms we construct in this paper are search procedures on randomly placed points in space whose distribution is given by a Poisson process Π. Thus, in this section, we ﬁrst review the necessary theoretical background for Poisson processes and how we can simulate them on a computer. Then, we formulate standard rejection sampling as a search procedure over the points of a Poisson process to serve as a prototype for our algorithm in the next section. Up to this point, our exposition loosely follows Sections 2, 3 and 5.1 of the excellent work of Maddison (2016), peppered with a few additional results that will be useful for analyzing our algorithm later. Finally, we describe the channel simulation problem, using rejection sampling as a rudimentary solution and describe its shortcomings. This motivates the development of greedy Poisson rejection sampling in Section 3.

2.1 Poisson Processes

A Poisson process Π is a countable collection of random points in some mathematical space Ω. In the main text, we will always assume that Π is deﬁned over Ω= R+ Rd and that all objects involved are measure-theoretically well-behaved for simplicity. For this choice of Ω, the positive reals represent time, and Rd represents space. However, most results generalize to settings when the spatial domain Rd is replaced with some more general space, and we give a general measure-theoretic construction in Appendix A, where the spatial domain is an arbitrary Polish space.

Basic properties of Π: For a set A Ω, let N(A) def = |Π A| denote the number of points of Π falling in the set A, where | | denotes the cardinality of a set. Then, Π is characterized by the following two fundamental properties (Kingman, 1992). First, for two disjoint sets A, B Ω, A B = , the number of points of Π that fall in either set are independent random variables: N(A) N(B).

Second, N(A) is Poisson distributed with mean measure µ(A) def = E[N(A)].

Time-ordering the points of Π: Since we assume that Ω= R+ Rd has a product space structure, we may write the points of Π as a pair of time-space coordinates: Π = {(Tn, Xn)} n=1. Furthermore, we can order the points in Π with respect to their time coordinates and index them accordingly, i.e. for i < j we have Ti < Tj. Hence, we refer to (Tn, Xn) as the nth arrival of the process.

As a slight abuse of notation, we deﬁne N(t) def = N([0, t) Rd) and µ(t) def = E[N(t)], i.e. these quantities measure the number and average number of points of Π that arrive before time t, respectively. In this paper, we assume µ(t) has derivative µ (t), and assume for each t 0 there is a conditional probability distribution PX|T =t with density p(x | t), such that we can write the mean measure as µ(A) = R

A p(x | t)µ (t) dx dt.

Algorithm 1: Generating a (λ, PX|T )-Poisson process.

Input :Time rate λ, Spatial distribution PX|T T0 0 for n = 1, 2, . . . do

n Exp(λ) Tn Tn 1 + n Xn PX|T =Tn yield (Tn, Xn) end

Simulating Π: A simple method to simulate Π on a computer is to realize it in time order, i.e. at step n, simulate Tn and then use it to simulate Xn PX|T =Tn. We can ﬁnd the distribution of Π s ﬁrst arrival by noting that no point of Π can come before it and hence P[T1 t] = P[N(t) = 0] = exp( µ(t)), where the second equality follows since N(t) is Poisson distributed with mean µ(t). A particularly important case is when Π is time-homogeneous, i.e. µ(t) = λt for some λ > 0, in which case T1 Exp(λ) is an exponential random variable with rate λ. In fact, all of Π s inter-arrival times n = Tn Tn 1 for n 1 share this simple distribution, where we set T0 = 0. To see this, note that

P[ n t | Tn 1] = P N [Tn 1, Tn 1 + t) Rd = 0 | Tn 1 = exp( λt), (1)

i.e. all n | Tn 1 Exp(λ). Therefore, we can use the above procedure to simulate timehomogeneous Poisson processes, described in Algorithm 1. We will refer to a time-homogeneous Poisson process with time rate λ and spatial distribution PX|T as a (λ, PX|T )-Poisson process.

Rejection sampling using Π: Rejection sampling is a technique to simulate samples from a target distribution Q using a proposal distribution P, assuming we can ﬁnd an upper bound M > 0 for

Algorithm 2: Standard rejection sampler. Input :Proposal distribution P, Density ratio r = d Q/d P, Upper bound M for r. Output :Sample X Q and its index N. // Generator for a (1, P)-Poisson process using Algorithm 1. Π Simulate PP(1, P) for n = 1, 2, . . . do

(Tn, Xn) next(Π) Un Unif(0, 1) if Un < r(Xn)/M then

return Xn, n end end

Algorithm 3: Greedy Poisson rejection sampler. Input :Proposal distribution P, Density ratio r = d Q/d P, Stretch function σ. Output :Sample X Q and its index N. // Generator for a (1, P)-Poisson process using Algorithm 1. Π Simulate PP(1, P) for n = 1, 2, . . . do

(Tn, Xn) next(Π) if Tn < σ (r(Xn)) then

return Xn, n end end

their density ratio r = d Q/d P (technically, the Radon-Nikodym derivative). We can formulate this procedure using a Poisson process: we simulate the arrivals (Tn, Xn) of a (1, P)-Poisson process Π, but we only keep them with probability r(Xn)/M, otherwise, we delete them. This algorithm is described in Algorithm 2; its correctness is guaranteed by the thinning theorem (Maddison, 2016).

Rejection sampling is suboptimal: Using Poisson processes to formulate rejection sampling highlights a subtle but crucial inefﬁciency: it does not make use of Π s temporal structure and only uses the spatial coordinates. GPRS ﬁxes this by using a rejection criterion that does depend on the time variable. As we show, this signiﬁcantly speeds up sampling for certain classes of distributions.

2.2 Channel Simulation

The main motivation for our work is to develop a one-shot channel simulation protocol using the sampling algorithm we derive in Section 3. Channel simulation is of signiﬁcant theoretical and practical interest. Recent works used it to compress neural network weights, achieving state-of-the-art performance (Havasi et al., 2018); to perform image compression using variational autoencoders (Flamich et al., 2020) and diffusion models with perfect realism (Theis et al., 2022); and to perform differentially private federated learning by compressing noisy gradients (Shah et al., 2022).

One-shot channel simulation is a communication problem between two parties, Alice and Bob, sharing a joint distribution Px,y over two correlated random variables x and y, where we assume that Alice and Bob can simulate samples from the marginal Px. In a single round of communication, Alice receives a sample y Py from the marginal distribution over y. Then, she needs to send the minimum number of bits to Bob such that he can simulate a single sample from the conditional distribution x Px|y. Note that Bob does not want to learn Px|y; he just wants to simulate a single sample from it. Surprisingly, when Alice and Bob have access to shared randomness, e.g. by sharing the seed of their random number generator before communication, they can solve channel simulation very efﬁciently. Mathematically, in this paper we will always model this shared randomness by some time-homogeneous Poisson process (or processes) Π, since given a shared random seed, Alice and Bob can always simulate the same process using Algorithm 1. Then, the average coding cost of x given Π is its conditional entropy H[x | Π] and, surprisingly, it is always upper bounded by H[x | Π] I[x; y] + log2(I[x; y] + 1) + O(1), where I[x; y] is the mutual information between x and y (Li & El Gamal, 2018). This is an especially curious result, given that in many cases H[x] is inﬁnite while I[x; y] is ﬁnite, e.g. when x is a continuous variable. In essence, this result means that given the additional structure Px,y, channel simulation protocols can ofﬂoad an inﬁnite amount of information into the shared randomness, and only communicate the ﬁnitely many necessary bits.

An example channel simulation protocol with rejection sampling: Given Π and y Py, Alice sets Q Px|y as the target and P Px as the proposal distribution with density ratio r = d Q/d P, and run the rejection sampler in Algorithm 2 to ﬁnd the ﬁrst point of Π that was not deleted. She counts the number of samples N she had to simulate before acceptance and sends this number to Bob. He can then decode a sample x Px|y by selecting the spatial coordinate of the Nth arrival of Π.

Unfortunately, this simple protocol is suboptimal. To see this, let D [Q P] def = supx Ω{log2 r(x)} denote Rényi -divergence of Q from P, and recall two standard facts: (1) the best possible upper bound Alice can use for rejection sampling is Mopt = exp2 (D [Q P]), where exp2(x) = 2x, and (2), the number of samples N drawn until acceptance is a geometric random variable with mean Mopt (Maddison, 2016). We now state the two issues with rejection sampling that GPRS solves.

Problem 1: Codelength. By using the formula for the entropy of a geometric random variable and assuming Alice uses the best possible bound Mopt in the protocol, we ﬁnd that

H[x | Π] = Ey Py[H[N | y]] Ey Py[D [Px|y Px]] (a) I[x; y], (2) see Appendix I for the derivation. Unfortunately, inequality (a) can be arbitrarily loose, hence the average codelength scales with the expected -divergence instead of I[x; y], as would be optimal.

Problem 2: Slow runtime. We are interested in classifying the time complexity of our protocol. As we saw, for a target Q and proposal P, Algorithm 2 draws Mopt = exp2 (D [Q P]) samples on average. Unfortunately, under the computational hardness assumption RP = NP, Agustsson & Theis (2020) showed that without any further assumptions, there is no sampler that scales polynomially in DKL[Q P]. However, with further assumptions, we can do much better, as we show in Section 3.1.

3 Greedy Poisson Rejection Sampling

We now describe GPRS; its pseudo-code is shown in Algorithm 3. This section assumes that Q and P are the target and proposal distributions, respectively, and r = d Q/d P is their density ratio. Let Π be a (1, P)-Poisson process. Our proposed rejection criterion is now embarrassingly simple: for an appropriate invertible function σ : R+ R+, accept the ﬁrst arrival of Π that falls under the graph of the composite function ϕ = σ r, as illustrated in the left plot in Figure 1. We refer to σ as the stretch function for r, as its purpose is to stretch the density ratio along the time-axis towards .

Deriving the stretch function (sketch, see Appendix A for details): Let ϕ = σ r, where for now σ is an arbitrary invertible function on R+, let U = {(t, x) Ω| t ϕ(x)} be the set of points under the graph of ϕ and let Π = Π U. By the restriction theorem (Kingman, 1992), Π is also a Poisson process with mean measure µ(A) = µ(A U). Let ( T, X) be the ﬁrst arrival of Π, i.e. the ﬁrst arrival of Π under ϕ and let Qσ be the marginal distribution of X. Then, as we show in Appendix A, the density ratio d Qσ/d P is given by

d P (x) = Z ϕ(x)

0 P[ T t] dt. (3)

Now we pick σ such that Qσ = Q, for which we need to ensure that d Qσ/d P = r. Substituting t = ϕ(x) into Equation (3), and differentiating, we get σ 1 (t) = P[ T t]. Since ( T, X) falls under the graph of ϕ by deﬁnition, we have P[ T t] = P[ T t, r( X) σ 1(t)]. By expanding the deﬁnition of the right-hand side, we obtain a time-invariant ODE for σ 1: σ 1 = w Q σ 1 σ 1 w P σ 1 , with σ 1(0) = 0, (4)

where w P (h) def = PZ P [r(Z) h] and w Q deﬁned analogously. Finally, solving the ODE, we get

1 w Q(η) η w P (η) dη. (5)

Remember that picking σ according to Equation (5) ensures that GPRS is correct by construction. To complete the picture, in Appendix A we prove that ( T, X) always exists and Algorithm 3 terminates with probability 1. We now turn our attention to analyzing the runtime of Algorithm 3 and surprisingly, we ﬁnd that the expected runtime of GPRS matches that of standard rejection sampling; see Appendix B.2 for the proof. Note that in our analyses, we identify GPRS s time complexity with the number of arrivals of Π it simulates. Moreover, we assume that we can evaluate σ in O(1) time. Theorem 3.1 (Expected Runtime). Let Q and P be the target and proposal distributions for Algorithm 3, respectively, and r = d Q/d P their density ratio. Let N denote the number of samples simulated by the algorithm before it terminates. Then, E[N] = exp2 (D [Q P]) and V[N] exp2 (D [Q P]) , (6) where V[ ] denotes the variance of a random variable.

Furthermore, using ideas from the work of Liu & Verdu (2018), we can show the following upper bound on the fractional moments of the index; see Appendix B.3 for the proof. Theorem 3.2 (Fractional Moments of the Index). Let Q and P be the target and proposal distributions for Algorithm 3, respectively, and r = d Q/d P their density ratio. Let N denote the number of samples simulated by the algorithm before it terminates. Then, for α (0, 1) we have

E[N α] 1 1 α

1 + D [Q P]

exp2 αD 1 1 α [Q P] , (7)

where Dα[Q P] = 1 α 1 log2 EZ Q r(Z)α 1 is the Rényi α-divergence of Q from P.

GPRS induces a channel simulation protocol: We use an analogous solution to the standard rejection sampling-based scheme in Section 2.2, to obtain a channel simulation protocol for a pair of correlated random variables x, y Px,y. First, we assume that both the encoder and the decoder have access to the same realization of a (1, Px)-Poisson process Π, which they will use as their shared randomness. In a single round of communication, the encoder receives y Py, sets Q Px|y as the target and P Px as the proposal distribution, and runs Algorithm 3 to ﬁnd the index N of the accepted arrival in Π. Finally, following Li & El Gamal (2018), they use a zeta distribution ζ(n | s) n s with s 1 = I[x; y] + 2 log2 e to encode N using entropy coding. Then, the decoder simulates a Px|y-distributed sample by looking up the spatial coordinate XN of Π s Nth arrival. The following theorem shows that this protocol is optimally efﬁcient; see Appendix B.4 for the proof. Theorem 3.3 (Expected Codelength). Let Px,y be a joint distribution over correlated random variables x and y, and let Π be the (1, Px)-Poisson process used by Algorithm 3. Then, the above protocol induced by the the algorithm is optimally efﬁcient up to a constant, in the sense that H[x | Π] I[x; y] + log2 (I[x; y] + 1) + 6. (8)

Note, that this result shows we can use GPRS as an alternative to the Poisson functional representation to prove of the strong functional representation lemma (Theorem 1; Li & El Gamal, 2018).

GPRS in practice: The real-world applicability of Algorithm 3 is limited by the following three computational obstacles: (1) GPRS requires exact knowledge of the density ratio d Q/d P to evaluate ϕ, which immediately excludes a many practical problems where d Q/d P is known only up to a normalizing constant. (2) To compute σ, we need to evaluate Equation (4) or (5), which requires us to compute w P and w Q, which is usually intractable for general d Q/d P. Fortunately, in some special cases they can be computed analytically: in Appendix G, we provide the expressions for discrete, uniform, triangular, Gaussian, and Laplace distributions. (3) However, even when we can compute w P and w Q, analytically evaluating the integral in Equation (5) is usually not possible. Moreover, solving it numerically is unstable as σ is unbounded. Instead, we numerically solve for σ 1 using Equation (4) in practice, which fortunately turns out to be stable.

However, we emphasize that GPRS s raison d être is to perform channel simulation using the protocol we outline above or the ones we present in Section 3.1, not general-purpose random variate simulation. Practically relevant channel simulation problems, such as encoding a sample from the latent posterior of a VAE (Flamich et al., 2020) or encoding variational implicit neural representations (Guo et al., 2023; He et al., 2023) usually involve simple distributions such as Gaussians. In these cases, we have readily available formulae for w P and w Q and can cheaply evaluate σ 1 via numerical integration.

GPRS as greedy search: The deﬁning feature of GPRS is that its acceptance criterion at each step is local, since if at step n the arrival (Tn, Xn) in Π falls under ϕ, the algorithm accepts it. Thus it is greedy search procedure. This is in contrast with A sampling (Maddison et al., 2014), whose acceptance criterion is global, as the acceptance of the nth arrival (Tn, Xn) depends on all other points of Π in the general case. In other words, using the language of Liu & Verdu (2018), GPRS is a causal sampler, as its acceptance criterion at each step only depends on the samples the algorithm has examined in previous steps. On the other hand, A* sampling is non-causal as its acceptance criterion in each step depends on future samples as well. Surprisingly, this difference between the search criteria does not make a difference in the average runtimes and codelengths in the general case. However, as we show in the next section, GPRS can be much faster in special cases.

3.1 Speeding up the greedy search

This section discusses two ways to improve the runtime of GPRS. First, we show how we can utilize available parallel computing power to speed up Algorithm 3. Second, we propose an advanced search

Algorithm 4: Parallel GPRS with J available threads. Input :Proposal distribution P, Density ratio r = d Q/d P, Stretch function σ, Number of parallel threads J. Output :Sample X Q and its code (j , Nj ). T , X , j , Nj , nil, nil, nil in parallel for j = 1, . . . , J do

Πj Simulate PP(1/J, P) for nj = 1, 2, . . . do T (j) nj , X(j) nj next(Πj)

if T < T (j) nj then terminate thread j. end

if T (j) nj < σ r X(j) nj then

T , X , j , Nj T (j) nj , X(j) nj , j, nj terminate thread j. end end end return X , (j , Nj )

Algorithm 5: Branch-and-bound GPRS on R with unimodal r Input :Proposal distribution P, Density ratio r = d Q/d P, Stretch function σ, Location x of the mode of r. Output :Sample X Q and its heap index H. T0, H, B (0, 1, R) for d = 1, 2, . . . do

Xd P|B d Exp (P(B)) Td Td 1 + d if Td < σ(r(Xd)) then

return Xd, H end if Xd x then

B B ( , Xd) H 2H else

B B (Xd, ) H 2H + 1 end end

strategy when the spatial domain Ωhas more structure and obtain a super-exponential improvement in the runtime from exp2 (D [Q P]) to O(DKL[Q P]) in certain cases.

Parallel GPRS: The basis for parallelizing GPRS is the superposition theorem (Kingman, 1992): For a positive integer J, let Π1, . . . ΠJ all be (1/J, P)-Poisson processes; then, Π = SJ j=1 Πj is a (1, P)-Poisson process. As shown in Algorithm 4, this result makes parallelizing GPRS very simple given J parallel threads: First, we independently look for the ﬁrst arrivals of Π1, . . . , ΠJ under the graph of ϕ, yielding T (1) N1 , X(1) N1

, . . . , T (J) NJ , X(J) NJ

, respectively, where the Nj corresponds to the index of Πj s ﬁrst arrival under ϕ. Then, we select the candidate with the earliest arrival time, i.e.

j = arg minj {1,...,J} T (j) Nj . Now, by the superposition theorem, T (j ) Nj , X(j ) Nj is the ﬁrst arrival

of Π under ϕ, and hence X(j ) Nj Q. Finally, Alice encodes the sample via the tuple (j , Nj ), i.e. which of the J processes the ﬁrst arrival occurred in, and the index of the arrival in Πj . See the middle graphic in Figure 1 for an example with J = 2 parallel threads.

Our next results show that parallelizing GPRS results in a linear reduction in both the expectation and variance of its runtime and a more favourable codelength guarantee for channel simulation. Theorem 3.4 (Expected runtime of parallelized GPRS). Let Q, P and r be deﬁned as above, and let νj denote the random variable corresponding to the number of samples simulated by thread j in Algorithm 4 using J threads. Then, for all j,

E[νj] = (exp2 (D [Q P]) 1) J + 1 and V[νj] (exp2 (D [Q P]) 1) J + 1. (9)

Theorem 3.5 (Expected codelength of parallelized GPRS). Let Px,y be a joint distribution over correlated random variables x and y, and let Π1, . . . , ΠJ be (1/J, Px)-Poisson processes. Then, assuming log2 J I[x; y], parallelized GPRS induces a channel simulation protocol such that

H[x | Π1, . . . , ΠJ] I[x; y] + log2 (I[x; y] log2 J + 1) + 8. (10)

See Appendix C for the proofs. Note that we can use the same parallelisation argument with the appropriate modiﬁcations to speed up A sampling / coding too.

Branch-and-bound GPRS on R: We brieﬂy restrict our attention to problems on R when the density ratio r is unimodal, as we can exploit this additional structure to more efﬁciently search for Π s ﬁrst

arrival under ϕ. Consider the example in the right graphic in Figure 1: we simulate the ﬁrst arrival (T1, X1), and reject it, since T1 > ϕ(X1). Since r is unimodal by assumption and σ is increasing, ϕ is also unimodal. Hence, for x < X1 we must have ϕ(x) < ϕ(X1), while the arrival time of any of the later arrivals will be larger than T1. Therefore, none of the arrivals to the left of X1 will fall under ϕ either! Hence, it is sufﬁcient to simulate Π to the right of X1, which we can easily do using generalized inverse transform sampling.1 We repeat this argument for all further arrivals to obtain an efﬁcient search procedure for ﬁnding Π s ﬁrst arrival under ϕ, described in Algorithm 5: We simulate the next arrival (TB, XB) of Π within some bounds B, starting with B = Ω. If TB ϕ(XB) we accept; otherwise, we truncate the bound to B B ( , XB), or B B (XB, ) based on where XB falls relative to ϕ s mode. We repeat these two steps until we ﬁnd the ﬁrst arrival.

Since Algorithm 5 does not simulate every point of Π, we cannot use the index N of the accepted arrival to obtain a channel simulation protocol as before. Instead, we encode the search path, i.e. whether we chose the left or the right side of our current sample at each step. Similarly to A* coding, we encode the path using its heap index (Flamich et al., 2022): the root has index Hroot = 1, and for a node with index H, its left child is assigned index 2H and its right child 2H + 1. As the following theorems show, this version of GPRS is, in fact, optimally efﬁcient; see Appendix E for proofs. Theorem 3.6 (Expected Runtime of GPRS with binary search). Let Q, P and r be deﬁned as above, and let D denote the number of samples simulated by Algorithm 5 and λ = 1/ log2(4/3) Then, E[D] λ DKL[Q P] + O(1). (11) Theorem 3.7 (Expected Codelength of GPRS with binary search). Let Px,y be a joint distribution over correlated random variables x and y, and let Π be a (1, Px)-Poisson process. Then, GPRS with binary search induces a channel simulation protocol such that H[x | Π] I[x; y] + 2 log2 (I[x; y] + 1) + 11. (12)

Branch-and-bound GPRS with splitting functions: With some additional machinery, Algorithm 5 can be extended to more general settings, such as RD and cases where r is not unimodal, by introducing the notion of a splitting function. For a region of space, B Ω, a splitting function split simply returns a binary partition of the set, i.e. {L, R} = split(B), such that L R = and L R = B. In this case, we can perform a similar tree search to Algorithm 5, captured in Algorithm 6. Recall that ( T, X) is the ﬁrst arrival of Π under ϕ. Starting with the whole space B = Ω, we simulate the next arrival (T, X) of Π in B at each step. If we reject it, we partition B into two parts, L and R, using split. Then, with probability P[ X R | X B, T T], we continue searching through the arrivals of Π only in R, and only in L otherwise. We show the correctness of this procedure in Appendix F and describe how to compute P[ X R | X B, T T] and σ in practice. This splitting procedure is analogous to the general version of A* sampling/coding (Maddison et al., 2014; Flamich et al., 2022), which is parameterized by the same splitting functions. Note that Algorithm 5 is a special case of Algorithm 6, where Ω= R, r is unimodal, and at each step for an interval bound B = (a, b) and sample X (a, b) we split B into {(a, X), (X, b)}.

4 Experiments

We compare the average and one-shot case efﬁciency of our proposed variants of GPRS and a couple of other channel simulation protocols in Figure 2. See the ﬁgure s caption and Appendix H for details of our experimental setup. The top two plots in Figure 2 demonstrate that our methods expected runtimes and codelengths align well with our theorems predictions and compare favourably to other methods. Furthermore, we ﬁnd that the mean performance is a robust statistic for the binary search-based variants of GPRS in that it lines up closely with the median, and the interquartile range is quite narrow. However, we also see that Theorem 3.7 is slightly loose empirically. We conjecture that the coefﬁcient of the log2(I[x; y] + 1) term in Equation (12) could be reduced to 1, and that the constant term could be improved as well.

On the other hand, the bottom plot in Figure 2 demonstrates the most salient property of the binary search variant of GPRS: unlike all previous methods, its runtime scales with DKL[Q P] and not D [Q P]. Thus, we can apply it to a larger family of channel simulation problems, where DKL[Q P] is ﬁnite but D [Q P] is large or even inﬁnite, and other methods would not terminate.

1Suppose we have a real-valued random variable X with CDF F and inverse CDF F 1. Then, we can simulate X truncated to [a, b] by simulating U Unif[F(a), F(b)] and computing X = F 1(U).

Algorithm 6: Branch-and-bound GPRS with splitting function. Input :Proposal distribution P, Density ratio r = d Q/d P, Stretch function σ, Splitting function split. Output :Sample X Q and its heap index H T0, H, B (0, 1, R) for d = 1, 2, . . . do

Xd P|B d Exp (P(B)) Td Td 1 + d if Td < σ(r(Xd)) then

return Xd, H else

B0, B1 split(B) ρ P[ X B1 | X B, T Td] β Bernoulli(ρ) H 2H + β B Bβ end end

PFR AS* s GPRS d GPRS

2 4 6 8 10 0

10 15 20 25 0

Figure 2: Left: Binary search GPRS with arbitrary splitting function. Right: Performance comparison of different channel simulation protocols. In each plot, dashed lines indicate the mean, solid lines the median and the shaded areas the 25 - 75 percentile region of the relevant performance metric. We computed the statistics over 1000 runs for each setting. Top right: Runtime comparison on a 1D Gaussian channel simulation problem Px,µ, plotted against increasing mutual information I[x; µ]. Alice receives µ N(0, σ2) and encodes a sample x | µ N(µ, 1) to Bob. The abbreviations in the legend are: GPRS Algorithm 3; PFR Poisson functional representation / Global-bound A* coding (Li & El Gamal, 2018; Flamich et al., 2022), AS* Split-on-sample A* coding (Flamich et al., 2022); s GPRS split-on-sample GPRS (Algorithm 5); d GPRS GPRS with dyadic split (Algorithm 6). Middle right: Average codelength comparison of our proposed algorithms on the same channel simulation problem as above. UB in the legend corresponds to an upper bound of I[x; µ] + log2(I[x; µ] + 1) + 2 bits. We estimate the algorithms expected codelengths by encoding the indices returned by GPRS using a Zeta distribution ζ(n | λ) n λ with λ = 1 + 1/I[x; y] in each case, which is the optimal maximum entropy distribution for this problem setting (Li & El Gamal, 2018). Bottom right: One-shot runtime comparison of s GPRS with AS* coding. Alice encodes samples from a target Q = N(m, s2) using P = N(0, 1) as the proposal. We computed m and s2 such that DKL[Q P] = 2 bits for each problem, but D [Q P] increases. GPRS runtime stays ﬁxed as it scales with DKL[Q P], while the runtime of A* keeps increasing.

5 Related Work

Poisson processes for channel simulation: Poisson processes were introduced to the channel simulation literature by Li & El Gamal (2018) via their construction of the Poisson functional representation (PFR) in their proof of the strong functional representation lemma. Flamich et al. (2022) observed that the PFR construction is equivalent to a certain variant of A* sampling (Maddison et al., 2014; Maddison, 2016). Thus, they proposed an optimized version of the PFR called A* coding, which achieves O(D [Q P]) runtime for one-dimensional unimodal distributions. GPRS was mainly inspired by A* coding, and they are dual constructions to each other in the following sense: Depth-limited A* coding can be thought of as an importance sampler, i.e. a Monte Carlo algorithm that returns an approximate sample in ﬁxed time. On the other hand, GPRS is a rejection sampler, i.e. a Las Vegas algorithm that returns an exact sample in random time.

Channel simulation with dithered quantization: Dithered quantization (DQ; Ziv, 1985) is an alternative to rejection and importance sampling-based approaches to channel simulation. DQ exploits that for any c R and U, U Unif( 1/2, 1/2), the quantities c U + U and c + U are equal in distribution. While DQ has been around for decades as a tool to model and analyze quantization error, Agustsson & Theis (2020) reinterpreted it as a channel simulation protocol and used it to develop a VAE-based neural image compression algorithm. Unfortunately, basic DQ only allows uniform target distributions, limiting its applicability. As a partial remedy, Theis & Yosri (2022) showed DQ could be combined with other channel simulation protocols to speed them up and thus called their approach hybrid coding (HQ). Originally, HQ required that the target distribution be compactly supported, which was lifted by Flamich & Theis (2023), who developed an adaptive rejection sampler using HQ. In a different vein, Hegazy & Li (2022) generalize and analyze a method proposed in the appendix of Agustsson & Theis (2020) and show that DQ can be used to realize channel simulation protocols for one-dimensional symmetric, unimodal distributions.

Greedy Rejection Coding: Concurrently to our work, Flamich et al. (2023) introduce greedy rejection coding (GRC) which generalizes Harsha et al. s rejection sampling algorithm to arbitrary probability spaces and arbitrary splitting functions, extending the work of Flamich & Theis (2023). Furthermore, they also utilize a space-partitioning procedure to speed up the convergence of their sampler and prove that a variant of their sampler also achieves optimal runtime for one-dimensional problems where d Q/d P is unimodal. However, the construction of their method differs signiﬁcantly from ours. GRC is a direct generalization of Harsha et al. (2007) s algorithm and, thus, a more conventional rejection sampler, while we base our construction on Poisson processes. Thus, our proof techniques are also signiﬁcantly different. It is an interesting research question whether GRC could be formulated using Poisson processes, akin to standard rejection sampling in Algorithm 2, as this could be used to connect the two algorithms and improve both.

6 Discussion and Future Work

Using the theory of Poisson processes, we constructed greedy Poisson rejection sampling. We proved the correctness of the algorithm and analyzed its runtime, and showed that it could be used to obtain a channel simulation protocol. We then developed several variations on it, analyzed their runtimes, and showed that they could all be used to obtain channel simulation protocols. As the most signiﬁcant result of the paper, we showed that using the binary search variant of GPRS we can achieve O(DKL[Q P]) runtime for arbitrary one-dimensional, unimodal density ratios, signiﬁcantly improving upon the previous best O(D [Q P]) bound by A* coding.

There are several interesting directions for future work. From a practical perspective, the most pressing question is whether efﬁcient channel simulation algorithms exist for multivariate problems; ﬁnding an efﬁcient channel simulation protocol for multivariate Gaussians would already have farreaching practical consequences. We also highlight an interesting theoretical question. We found that the most general version of GPRS needs to simulate exactly exp2(D [Q P]) samples in expectation, which matches A* sampling. From Agustsson & Theis (2020), we know that for general problems, the average number of simulated samples has to be at least on the order of exp2(DKL[Q P]). Hence we ask whether this lower bound can be tightened to match the expected runtime of GPRS and A* sampling or whether there exists an algorithm that achieves a lower runtime.

7 Acknowledgements

We are grateful to Lennie Wells for the many helpful discussions throughout the project, particularly for suggesting the nice argument for Lemma D.2; Peter Wildemann for his advice on solving some of the differential equations involved in the derivation of GPRS and Stratis Markou for his help that aided our understanding of some high-level concepts so that we could derive Algorithm 6. We would also like to thank Lucas Theis for the many enjoyable early discussions about the project and for providing helpful feedback on the manuscript; and Cheuk Ting Li for bringing Liu & Verdu s work to our attention. Finally, we would like to thank Daniel Goc for proofreading the manuscript and Lorenzo Bonito for helping to design Figures 1 and 2 and providing feedback on the early version of the manuscript. GF acknowledges funding from Deep Mind.

Eirikur Agustsson and Lucas Theis. Universally quantized neural compression. In Advances in Neural Information Processing Systems, volume 33, 2020.

Harry Bateman and Arthur Erdélyi. Higher transcendental functions, volume 1. Mc Graw-Hill book company, 1953.

Robert M. Corless, Gaston H. Gonnet, David E. G. Hare, David J. Jeffrey, and Donald E. Knuth. On the Lambert W function. Advances in Computational mathematics, 5:329 359, 1996.

Gergely Flamich and Lucas Theis. Adaptive greedy rejection sampling. In IEEE International Symposium on Information Theory (ISIT), pp. 454 459, 2023.

Gergely Flamich, Marton Havasi, and José Miguel Hernández Lobato. Compressing images by encoding their latent representations with relative entropy coding. Advances in Neural Information Processing Systems, 33, 2020.

Gergely Flamich, Stratis Markou, and José Miguel Hernández Lobato. Fast relative entropy coding with A* coding. In International Conference on Machine Learning, pp. 6548 6577, 2022.

Gergely Flamich, Stratis Markou, and José Miguel Hernández Lobato. Faster relative entropy coding with greedy rejection coding. In Advances in Neural Information Processing Systems, 2023.

Zongyu Guo, Gergely Flamich, Jiajun He, Zhibo Chen, and José Miguel Hernández Lobato. Compression with Bayesian implicit neural representations. In Advances in Neural Information Processing Systems, 2023.

Prahladh Harsha, Rahul Jain, David Mc Allester, and Jaikumar Radhakrishnan. The communication complexity of correlation. In Twenty-Second Annual IEEE Conference on Computational Complexity (CCC 07), pp. 10 23. IEEE, 2007.

Marton Havasi, Robert Peharz, and José Miguel Hernández Lobato. Minimal random code learning: Getting bits back from compressed model parameters. In International Conference on Learning Representations, 2018.

Jiajun He, Gergely Flamich, Zongyu Guo, and José Miguel Hernández Lobato. RECOMBINER: Robust and enhanced compression with Bayesian implicit neural representations. ar Xiv preprint ar Xiv:2309.17182, 2023.

Mahmoud Hegazy and Cheuk Ting Li. Randomized quantization with exact error distribution. In 2022 IEEE Information Theory Workshop (ITW), pp. 350 355. IEEE, 2022.

Diederik P. Kingma and Max Welling. Auto-encoding variational Bayes. In International Conference on Learning Representations, 2014.

John F. C. Kingman. Poisson Processes. Oxford Studies in Probability. Clarendon Press, 1992. ISBN 9780191591242.

Cheuk Ting Li and Abbas El Gamal. Strong functional representation lemma and applications to coding theorems. IEEE Transactions on Information Theory, 64(11):6967 6978, 2018.

Jingbo Liu and Sergio Verdu. Rejection sampling and noncausal sampling under moment constraints. In 2018 IEEE International Symposium on Information Theory (ISIT), pp. 1565 1569. IEEE, 2018.

Chris J. Maddison. A Poisson process model for Monte Carlo. Perturbation, Optimization, and Statistics, pp. 193 232, 2016.

Chris J. Maddison, Daniel Tarlow, and Tom Minka. A* sampling. In Advances in Neural Information Processing Systems, volume 27, pp. 3086 3094, 2014.

Pat Muldowney, Krzysztof Ostaszewski, and Wojciech Wojdowski. The Darth Vader rule. Tatra Mountains Mathematical Publications, 52(1):53 63, 2012.

Jorma Rissanen and Glen G Langdon. Arithmetic coding. IBM Journal of research and development, 23(2):149 162, 1979.

Abhin Shah, Wei-Ning Chen, Johannes Balle, Peter Kairouz, and Lucas Theis. Optimal compression of locally differentially private mechanisms. In International Conference on Artiﬁcial Intelligence and Statistics, pp. 7680 7723. PMLR, 2022.

Lucas Theis and Noureldin Yosri. Algorithms for the communication of samples. In International Conference on Machine Learning, 2022.

Lucas Theis, Tim Salimans, Matthew D. Hoffman, and Fabian Mentzer. Lossy compression with Gaussian diffusion. ar Xiv:2206.08889, 2022.

Jacob Ziv. On universal quantization. IEEE Transactions on Information Theory, 31(3):344 347, 1985.

A Measure-theoretic Construction of Greedy Poisson Rejection Sampling

In this section, we provide a construction of greedy Poisson rejection sampling (GPRS) in its greatest generality. However, we ﬁrst establish the notation for the rest of the appendix.

Notation: In what follows, we will always denote GPRS s target distribution as Q and its proposal distribution as P. We assume that Q and P are Borel probability measures over some Polish space Ω

with Radon-Nikodym derivative r def = d Q/d P. We denote the standard Lebesgue measure on Rn by λ. All logarithms are assumed to be to the base 2 denoted by log2. Similarly, we use the less common notation of exp2 to denote exponentiation with base 2, i.e. exp2(x) = 2x. The relative

entropy of Q from P is deﬁned as DKL[Q P] def = R

Ωlog2 r(x) d P(x) and the Rényi -divergence

as D [Q P] def = ess supx Ω{log r(x)}, where the essential supremum is taken with respect to P.

Restricting a Poisson process under the graph of a function : We ﬁrst consider the class of functions under which we will be restricting our Poisson processes. Let ϕ : Ω R+ be a measurable function, such that |ϕ(Ω)| = |r(Ω)|, i.e. the image of ϕ has the same cardinality as the image of r = d Q/d P. This implies that there exist bijections between r(Ω) and ϕ(Ω). Since both images are subsets of R+, both images have a linear order. Hence, a monotonically increasing bijection σ exists such that ϕ = σ r. Since σ is monotonically increasing, we can extend it to a monotonically increasing continuous function σ : R+ R+. This fact is signiﬁcant, as it allows us to reduce questions about functions of interest from Ωto R+ to invertible functions on the positive reals. Since σ is continuous and monotonically increasing on the positive reals, it is invertible on the extended domain. We now consider restricting a Poisson process under the graph of ϕ, and work out the distribution of the spatial coordinate of the ﬁrst arrival.

Let Π = {(Tn, Xn)} n=1 be a Poisson process over R+ Ωwith mean measure µ = λ P. Let

U def = {(t, x) Ω| t ϕ(x)} be the set of points under the graph of ϕ. By the restriction theorem

(Kingman, 1992), Π def = Π U is a Poisson process with mean measure µ(A) = µ(A U) for an arbitrary Borel set A. As a slight abuse of notation, we deﬁne the shorthand µ(t) = µ([0, t) Ω) for the average number of points in Π up to time t. Let N Π denote the counting measure of Π and let ( T, X) be the ﬁrst arrival of Π, i.e. the ﬁrst point of the original process Π that falls under the graph of ϕ, assuming that it exists. To develop GPRS, we ﬁrst work out the distribution of X for an arbitrary ϕ.

Remark: Note, that the construction below only holds if the ﬁrst arrival is guaranteed to exist, otherwise the quantities below do not make sense. From a formal point of view, we would have

to always condition on the event N Π( ) def = limt N Π(t) > 0, e.g. we would need to write P[ T t | N Π( ) > 0]. However, we make a leap of faith instead and assume that our construction is valid to obtain our guesses for the functions σ and ϕ. Then, we show that for our speciﬁc guesses the derivation below is well-deﬁned, i.e. namely that P[N Π( ) > 0] = 1, which then implies P[ T t | N Π( ) > 0] = P[ T t]. To do this, note that

P[N Π( ) > 0] = 1 P[N Π( ) = 0] (13)

= 1 lim t e µ(t). (14)

Hence, for the well-deﬁnition of our construction it is enough to show that for our choice of σ, µ(t) as t .

Constructing σ: We wish to derive

d P[ X = x]

d P[ T = t, X = x]

d(λ P) dt. (15)

To do this, recall that we deﬁned N Π(t) def = N Π([0, t) Ω) and deﬁne N Π(s, t) def = N Π([s, t] Ω).

Similarly, let µ(t) def = E[N Π(t)] and µ(s, t) def = E[N Π(s, t)] = µ(t) µ(s). Then,

P[ X A, T dt] = lim s t P[ X A, T [s, t]]

= lim s t P[ X A, N Π(s, t) = 1, N Π(s) = 0]

= lim s t P[ X A | N Π(s, t) = 1]P[N Π(s, t) = 1]P[N Π(s) = 0]

where the last equality holds, because N Π(s) N Π(s, t), since [0, s) Ωand [s, t] Ωare disjoint. By Lemma 3 from Maddison (2016),

P[ X A | N Π(s, t) = 1] = P[( T, X) [s, t] A | N Π(s, t) = 1] (19)

= µ([s, t] A)

µ(s, t) . (20)

Substituting this back into Equation (18) and using the fact that N Π is Poisson, we ﬁnd

P[ X A, T dt] = lim s t

µ([s, t] A)

µ(s, t) µ(s, t)e µ(s,t) e µ(s) . (t s) (21)

µ([s, t] A) e ( µ(t) µ(s)) e µ(s) . (t s) (22)

= e µ(t) lim s t µ([s, t] A)

A 1[t ϕ(x)] d P(x), (24)

where the last equation holds by the deﬁnition of the derivative and the fundamental theorem of calculus. From this, by inspection we ﬁnd

d P[ T = t, X = x]

d(λ P) = 1[t ϕ(x)]e µ(t) = 1[t ϕ(x)]P[ T t]. (25)

Finally, by marginalizing out the ﬁrst arrival time, we ﬁnd that the spatial distribution is

d P[ X = x]

d P[ T = t, X = x]

d(λ P) dt (26)

0 P[ T t] dt. (27)

Deriving ϕ by ﬁnding σ: Note that Equation (27) holds for any ϕ. However, to get a correct algorithm, we need to set it such that d P[ X=x] d P = d Q

d P = r. Since we can write ϕ = σ r by our earlier argument, this problem is reduced to ﬁnding an appropriate invertible continuous function σ : R+ R+. Thus, we wish to ﬁnd σ such that

x Ω, r(x) = Z σ(r(x))

0 P[ T t] dt. (28)

Now, introduce τ = σ(r(x)), so that σ 1(τ) = r(x). Note that this substitution only makes sense for τ r(Ω). However, since we are free to extend σ 1 to R+ in any we like so long as it is monotone, we may require that this equation hold for all τ R+. Then, we ﬁnd

σ 1(τ) = Z τ

0 P[ T t] dt (29)

σ 1 (τ) = P[ T τ] (30)

with σ 1(0) = 0. Before we solve for σ, we deﬁne

w P (h) def = PY P [r(Y ) h] (31)

w Q(h) def = PY Q[r(Y ) h]. (32)

Note that w P and w Q are supported on [0, r ), where r def = ess supx Ω{r(x)} = exp2(D [Q P]). Now, we can rewrite Equation (30) as σ 1 (t) = P[ T t] (33)

= P[ T t, ϕ( X) t] + P[ T t, ϕ( X) < t] | {z } =0 due to mutual exclusivity

= P[ T 0, ϕ( X) t] | {z } =P[ϕ( X) t], since T 0 always

P[ϕ( X) t T] (35)

= P[r( X) σ 1(t)] P[r( X) σ 1(t) σ 1( T)] (36)

= w Q(σ 1(t)) σ 1(t)w P (σ 1(t)) (37) where the second term in the last equation follows by noting that

P[r( X) σ 1(t) σ 1( T)] = Z

0 1[r(x) σ 1(t)]1[r(x) σ 1(τ)] | {z } =1[r(x) σ 1(t)], since τ<t

P[ T τ] dτ d P(x)

Ω 1[r(x) σ 1(t)] Z t

0 P[ T τ] dτ | {z } =σ 1(t),by eq. (29)

d P(x) (39)

= σ 1(t)w P (σ 1(t)). (40) Now, by the inverse function theorem, we get

σ (h) = 1 w Q(h) h w P (h), (41)

thus we ﬁnally ﬁnd

1 w Q(η) η w P (η) dη. (42)

Thus, to recapitulate, setting ϕ = σ r, where σ is given by Equation (42) will ensure that the spatial distribution of the ﬁrst arrival of Π under ϕ is the target distribution Q.

An alternate form for P[ T t]: We make use of the following integral representation of the indicator function to present an alternative form of the survival function of T:

for a > 0 : 1[a x] = Z x

0 δ(y a) dy, (43)

where δ is the Dirac delta function. Then, we have

Ω 1[r(x) h] d Q(x) (44)

0 r(x)δ(y r(x)) dy d P(x) (45)

Ω δ(y r(x)) d P(x) dy (46)

= 1 [y(1 w P (y))]h 0 + Z h

0 (1 w P (y)) dy (47)

= 1 + h w P (h) Z h

0 w P (y) dy. (48)

Then, from this identity, we ﬁnd that

P[ T t] = w Q(σ 1(t)) σ 1(t)w P (σ 1(t)) (49)

= 1 Z σ 1(t)

0 w P (y) dy. (50)

Well-deﬁnition of GPRS and Algorithm 3 terminates with probability 1: To ensure that our construction is useful, we need to show that the ﬁrst arrival under the graph T exists and is almost surely ﬁnite, so that Algorithm 3 terminates with probability 1. First, note that for any t > 0, we have µ(t) µ(t) = t < . Since P[N Π(t) = 0] = e µ(t), this implies that for any ﬁnite time t, P[N Π(t) = 0] > 0. Second, note that for h [0, r ],

w Q(h) h w P (h) = Z

Ω 1[r(x) h](r(x) h) d P(x), (51)

w Q(r ) r w P (r ) = Z

Ω 1[r(x) r ](r(x) r ) d P(x) (52)

Ω 1[r(x) = r ](r(x) r ) d P(x) (53)

Thus, in particular, we ﬁnd that

lim h r e µ(σ(h)) = lim h r P[N Π(σ(h)) = 0] eq. (37) = lim h r (w Q(h) h w P (h)) = 0. (55)

By continuity, this can only hold if µ(σ(h)) as h r . Since µ(t) is ﬁnite for all t > 0 and σ is monotonically increasing, this also implies that σ(h) as h r . Thus, we have shown a couple of facts:

σ(h) as h r , hence ϕ is always unbounded at points in Ωthat achieve the supremum r . Furthermore, this implies that σ 1(t) r as t . (56)

Since σ 1 is increasing, this implies that it is bounded from above by r . µ(t) < for all t > 0, but µ(t) as t . Hence, by Equation (14) we have P[N Π( ) > 0] = 1. Thus, the ﬁrst arrival of Π exists almost surely, and our construction is well-deﬁned. In particular, it is meaningful to write P[ T t].

P[ T t] = P[N Π(t) = 0] 0 as t , which shows that the ﬁrst arrival time is ﬁnite with probability 1. In turn, this implies that Algorithm 3 will terminate with probability one, as desired.

Note, that w P and w Q can be computed in many practically relevant cases, see Appendix G.

B Analysis of Greedy Poisson Rejection Sampling

Now that we constructed a correct sampling algorithm in Appendix A, we turn our attention to deriving the expected ﬁrst arrival time T in the restricted process Π, the expected number of samples N before Algorithm 3 terminates and bound on N s variance, and an upper bound on its coding cost. The proofs below showcase the true advantage of formulating the sampling algorithm using the language of Poisson processes: the proofs are all quite short and elegant.

B.1 The Expected First Arrival Time

At the end of the previous section, we showed that T is ﬁnite with probability one, which implies that E[ T] < . Now, we derive the value of this expectation exactly. Since T is a positive random variable, by the Darth Vader rule (Muldowney et al., 2012), we may write its expectation as

E h T i = Z

0 P[ T t] dt eq. (29) = lim t σ 1(t) eq. (56) = r . (57)

B.2 The Expectation and Variance of the Runtime

Expectation of N: First, let ΠC def = Π \ Π be the set of points in Π above the graph of ϕ. Since Π and ΠC are deﬁned on complementary sets, they are independent. Since by deﬁnition T is the ﬁrst arrival of Π and the Nth arrival of Π, it must mean that the ﬁrst N 1 arrivals of Π occurred in ΠC. Thus, conditioned on T, we have N 1 = |{(T, X) ΠC | T < T}| is Poisson distributed with mean

E h N 1 | T i = Z T

Ω 1[t ϕ(x)] d P(x) dt (58)

Ω 1 1[t < ϕ(x)] d P(x) dt (59)

= T µ T . (60)

By the law of iterated expectations,

E[N 1] = E T h E h N 1 | T ii (61)

= E h T i E h µ T i . (62)

Focusing on the second term, we ﬁnd

E h µ T i = Z

0 µ(t) P[ T dt] dt (63)

0 µ(t) µ (t)e µ(t) dt (64)

0 ue u du = 1, (65)

where the third equality follows by substituting u = µ(t). Finally, plugging the above and Equation (57) into Equation (62), we ﬁnd

E[N] = 1 + E[N 1] = 1 + r 1 = r . (66)

Variance of N: We now show that the distribution of N is super-Poissonian, i.e. E[N] V[N]. Similarly to the above, we begin with the law of iterated variances to ﬁnd

V[N] = V[N 1] = E T [V[N 1 | T]] + V T [E[N 1 | T]] (67)

= E h T i E h µ T i + V h T i + V h µ T i , (68)

where the second equality follows from the fact that the variance of N 1 matches its mean conditioned on T, since it is a Poisson random variable. Focussing on the last term, we ﬁnd

V h µ T i = E µ T 2 E h µ T i2 (69)

0 µ(t)2 P[ T dt] dt 1 (70)

0 u2e u dt 1 = 1, (71)

where the third equality follows from a similar u-substitution as in Equation (65). Thus, putting everything together, we ﬁnd

V[N] = r 1 + V h T i + 1 = r + V h T i > E[N]. (72)

B.3 The Fractional Moments of the Index

In this section, we follow the work of Liu & Verdu (2018) and analyse the fractional moments of the index N and prove Theorem 3.2. As before, set Q and P as GPRS s target and proposal distribution, respectively, and let r = d Q

d P . To begin, we deﬁne the Rényi α-divergence between two distributions for α (0, 1) (1, ) as

Dα[Q P] = 1 α 1 log2 EZ Q r(Z)α 1 . (73)

By taking limits, we can extend the deﬁniton to α = 1, as

D1[Q P] def = lim α 1 Dα[Q P] = DKL[Q P] (74)

D [Q P] def = lim α Dα[Q P] = ess sup x Ω {log2 r(x)} . (75)

Now ﬁx α (0, 1) and x Ωand note that by Jensen s inequality,

E[N α | X = x] E[N | X = x]α. (76)

Now, to continue, we will show the following bound:

E[N | X = x] 1 w P (r(x)) (77)

Let Π be the (1, P)-Poisson process realized by Algorithm 3, and let

M = min{m N | (Tm, Xm) Π, r(Xm) r(x)}, (78)

i.e. the index of the ﬁrst arrival in Π such that the density ratio of the spatial coordinate exceeds r(x). Since Xi P are i.i.d.,

M Geom (PZ P [r(Z) r(x)]) , (79)

E[M] = 1 PZ P [r(Z) r(x)] = 1 w P (r(x)). (80)

Now, since geometrically distributed random variables are memoryless, we ﬁnd

E[M] = E[M N | M > N] (81) E[M | M > N], (82)

which then implies

E[M | M N] E[M]. (83)

Before we proceed, we ﬁrst require the following result Lemma B.1. Let all quantities be as deﬁned above and let x, y Ωsuch that r(x) r(y). Then, N | X = y stochastically dominates N | X = x, i.e. for all positive integers n we have

P[N n | X = x] P[N n | X = y]. (84)

Proof. The proof follows via direct calculation. Concretely,

P[N n | X = x] = P[N 1 n 1 | X = x] (85)

= E T P T | X=x

h P[N 1 n 1 | X = x, T] i (86)

= 1 (n 1)!E T P T | X=x

h Γ(n, T µ( T)) i , (87)

where Γ(s, x) is the upper incomplete Γ-function deﬁned as

Γ(s, x) def = Z

x ts 1e t dt. (88)

Equation (87) follows from the fact that conditioned on T we have that N 1 is a Poisson random variable with mean T µ( T). Now, focussing on the expectation term in Equation (87), we let

ε(r(x)) def = E T P T | X=x

h Γ(n, T µ( T)) i = 1 r(x)

0 Γ(n, t µ(t)) σ 1 (t) dt. (89)

Substituting h = r(x), we now show that ϵ is a decreasing function, which proves the lemma. To this end, regardless of the image of the density ratio r(Ω), extend the deﬁnition of ϵ to the entirety of (0, ess sup r(x)). Now, observe, that

h Γ(n, σ(h) µ(σ(h)))

=1 by inverse function theorem z }| { σ 1 (σ(h)) σ (h) Z σ(h)

0 Γ(n, t µ(t)) σ 1 (t) dt

Focussing on the integral, via integration by parts we ﬁnd Z σ(h)

0 Γ(n, t µ(t)) σ 1 (t) dt = (91)

h Γ(n, σ(h) µ(σ(h))) + Z σ(h)

0 (t µ(t))(n 1)e (t µ(t))(1 w P (σ 1(t)))σ 1(t) dt.

Plugging this back into the above calculation, we get

0 (t µ(t))(n 1)e (t µ(t))(1 w P (σ 1(t)))σ 1(t) dt. (93)

Note, that since the integrand is positive and h was arbitrary, dϵ

dh is negative for every h in its domain, which proves the lemma.

Returning to Equation (83), we now examine the two cases M < N and M = N separately.

Case M < N: In this case, we have the following inequalities:

r(x) r(XM) by the deﬁnition of M. Applying σ to both sides we ﬁnd ϕ(x) ϕ(XM). ϕ(XM) < TM, since M < N and by deﬁnition N is the index of the ﬁrst arrival of Π under the graph of ϕ.

Now, note that since M < N conditioned on TM, the ﬁrst M 1 points of Π are all rejected and hence are all in ΠC = Π \ Π. Thus, M 1 | (M < N, TM) is Poisson distributed with mean TM µ(TM). Hence,

E[M | M < N] = 1 + E[M 1 | M < N] (94) = 1 + ETM PTM |M<N [E[M 1 | M < N, TM]] (95)

= 1 + ETM PTM |M<N [TM µ(TM)] (96)

1 + ETM PTM |M<N [ϕ(x) µ(ϕ(x))] (97)

= 1 + ϕ(x) µ(ϕ(x)). (98)

On the other hand, since the N-th point is the index of ﬁrst arrival of Π under the graph of ϕ, hence T ϕ( X). Furthermore, conditioning on X = x we get T ϕ(x). Similarly to the above case N 1 | T is Poisson distributed with mean T µ( T). Hence, we get

E[N | X = x] = 1 + E T | X=x h E[N 1 | X = x, T] i (99)

= 1 + E T | X=x h T µ( T) i (100)

1 + E T | X=x [ϕ(x) µ(ϕ(x))] (101)

= 1 + ϕ(x) µ(ϕ(x)). (102)

Putting the above two inequalities, we ﬁnd

E[N | X = x] 1 + ϕ(x) µ(ϕ(x)) E[M | M < N]. (103)

Case M = N: In this case, we have

E[M | N = M] = E[N | r(XN) r(x)] (104)

E[N | XN = X = x], (105)

where Equation (105) follows by applying Lemma B.1.

Finally, putting these two cases together we ﬁnd

1 w P (r(x)) = E[M] (106)

E[M | N M] (107)

E[N | X = x], (108)

which establishes Equation (77). Now, taking expectation over X, we get

d Q(x) w P (r(x))α (109)

r(x) w P (r(x))α d P(x) (110)

Ω δ(y r(x)) d P(x) dy (111)

Ω δ(y r(x)) d P(x) dy (112)

= y w P (y)1 α

0 w P (y)1 α dy (113)

0 w P (y)1 α dy. (114)

In the above, Equation (113) follows by integration by parts and from the fact that

w P (h) = Z

Ω 1[r(x) h] d P(x) (115)

Ω δ(η r(x)) d P(x) dη. (116)

We now note that we can bound w P using Markov s inequality:

w P (h) = PZ P [r(Z) h] (117)

= PZ P h r(Z) 1 1 α h 1 1 α i (118)

h 1 1 α EZ P h r(Z) 1 1 α i (119)

= h 1 1 α EZ Q h r(Z) 1 1 α 1i (120)

= h 1 1 α exp2

α 1 αD 1 1 α [Q P] . (121)

Putting the above in Equation (114), we get

E[N α] 1 1 α

0 w P (y)1 α dy (122)

1 + Z exp2(D [Q P ])

1 w P (y)1 α dy

1 + Z exp2(D [Q P ])

1 y dy exp2 αD 1 1 α [Q P] !

1 + D [Q P]

exp2 αD 1 1 α [Q P] . (125)

B.4 The Codelength of the Index

As discussed in Section 3, Alice and Bob can realize a one-shot channel simulation protocol using GPRS for a pair of correlated random variables x, y Px,y if they have access to shared randomness and can simulate samples from Px. In particular, after receiving y Py, Alice runs GPRS with proposal distribution Px and target Px|y, and the shared randomness to simulate the Poisson process Π. She then encodes the index N of the ﬁrst arrival of Π under the graph of ϕ. Bob can decode Alice s sample x Px|y by simulating the same N samples from Π using the shared randomness. The question is, how efﬁciently can Alice encode N? We answer this question by following the approach of Li & El Gamal (2018). Namely, we ﬁrst bound the conditional expectation E[log2 N | y = y], after which we average over y to bound E[log2 N]. Then, we use the maximum entropy distribution subject to the constraint of ﬁxed E[log2 N] to bound H[N]. Finally, noting that x is a function of Π and N, we get H[x, N | Π] = H[x | N, Π] | {z } =0

+H[N | Π] (126)

= H[N | x, Π] | {z } =0

+H[x | Π], (127)

from which H[x | Π] = H[N | Π] H[N], (128) which will ﬁnish the proof.

Bound on the conditional expectation: Fix y and set Q = Px|y and P = Px as GPRS s target and proposal distribution, respectively. Let (t, x) R+ Ωbe a point under the graph of ϕ, i.e. σ 1(t) r(x). Then, r(x) σ 1(t) (129)

eq. (30) = Z t

0 P[ T τ] dτ (130)

t inf τ [0,t]

n P[ T τ] o (131)

= t P[ T t]. (132) From this, we get

t + 1 r(x) P[ T t] + 1 r(x) + 1

P[ T t] . (133)

Next, conditioning on the ﬁrst arrival ( T, X) we get

E[log2 N | T = t, X = x] log2 E[N 1 | T = t, X = x] + 1 (134)

eq. (60) = log2 (t µ (t) + 1) (135) log2 (t + 1) (136)

eq. (133) log2 (r(x) + 1) log2 P[ T t] (137) = log2 (r(x) + 1) + µ (t) log2 e, (138)

where the ﬁrst inequality follows by Jensen s inequality.

Now, by the law of iterated expectations, we ﬁnd

E[log2 N] = E X h E T | X h E[log2 N | T, X] ii (139)

eq. (138) E X h log2 r X + 1 i + E T h µ T i

| {z } =1 by eq. (65)

log2 e (140)

= DKL[Q P] + E X

+ log2 e (141)

DKL[Q P] + E X

1 r( X) log2 e + log2 e (142)

= DKL[Q P] + 2 log2 e, (143)

where the second inequality follows from switching to the natural base and using a ﬁrst-order Taylor approximation for e 1 x .

Bound on the marginal expectation and entropy: Equation (143) is a one-shot bound, which yields

E[log2 N | y] DKL[Px|y Px] + 2 log2 e. (144)

Taking expectation over y Py, we get

E[log2 N] I[x; y] + 2 log2 e. (145)

Finally, following Proposition 4 in Li & El Gamal (2018), the maximum entropy distribution for N subject to a constraint on E[log2 N] obeys

H[N] E[log2 N] + log (E[log2 N] + 1) + 1. (146)

Plugging in Equation (145) into the above, we get

H[N] I[x; y] + log2(I[x; y] + 2 log2 e + 1) + 2 log2 e + 1 (147) I[x; y] + log2(I[x; y] + 1) + 2 log2 e + 1 + log2(2 log2 e + 1) (148) I[x; y] + log2(I[x; y] + 1) + 5.84 (149) < I[x; y] + log2(I[x; y] + 1) + 6. (150)

Similarly to Li & El Gamal (2018), we can encode N using a Zeta distribution ζ(n | s) n s with

s def = 1 I[x; y] + 2 log2 e. (151)

With this choice of the coding distribution, the expected codelength of N is upper bounded by I[x; y] + log2(I[x; y] + 1) + 7 bits.

C Analysis of Parallel GPRS

Parallel GPRS (PGPRS) is a general technique to accelerate GPRS when parallel computing power is available and is presented in Section 3.1. Assuming that J parallel threads are available, for target distribution Q and proposal P, PGPRS simulates J Poisson processes Π1, . . . , ΠJ in parallel,

all of them with mean measure µi def = λ J P. Assuming n T (1) N1 , X(1) N1

, . . . , T (J) NJ , X(J) NJ

o are the ﬁrst arrivals of Π1, . . . , ΠJ under ϕ, respectively, PGPRS selects the arrival with the overall smallest arrival time J = arg minj {1,...,J} n T (j) Nj

o . By the superposition theorem (Kingman,

1992), J j=1Πj = Π is a Poisson process with mean measure µ = λ P, hence T (J ) NJ , X(J ) NJ will be the ﬁrst arrival of Π under the graph of ϕ and the theoretic construction in Appendix A therefore guarantees the correctness of PGPRS.

Expected runtime: We will proceed similarly to the analysis in Appendix B.2. Concretely, let T = T (J ) NJ be the ﬁrst arrival time of Π under the graph of ϕ. Let

νj def = n (T, X) Πj | T < T, T > ϕ(X) o (152)

be the number of points in Πj that are rejected before the global ﬁrst arrival occurs. By an independence argument analogous to the one given in Appendix B.2, the ν1, . . . , νJ are independent Poisson random variables, each distributed with mean

E h νj | T i = 1

T µ T . (153)

Hence, by the law of iterated expectations, we ﬁnd that we get

E [νj] eq. (62) = r 1

rejections in each of the J threads on average before the global ﬁrst arrival occurs. Now, a thread j terminates either when the global ﬁrst arrival occurs in it or when the thread s current arrival time T (j) nj provably exceeds the global ﬁrst arrival time. Thus, on average, each thread will have one more rejection compared to Equation (154). Hence the average runtime of a thread is

E[νj + 1] = r 1

J + 1, (155)

and the average number of samples simulated by PGPRS across all its threads is

J + 1 = r + J 1. (156)

Variance of Runtime: Once again, similarly to Appendix B.2, we ﬁnd by the law of iterated variances that the variance of the runtime in each of the j threads is

V[νj + 1] = V[νj] = r 1

V h T i + 1 , (157)

meaning that we make an O(1/J) reduction in the variance of the runtime compared to regular GPRS.

Codelength: PGPRS can also realize a one-shot channel simulation protocol for a pair of correlated random variables x, y Px,y. For a ﬁxed y Py, Alice applies PGPRS to the target Q = Px|y and proposal P = Px, and encodes the two-part code (J , NJ ). Bob can then simulate NJ samples from ΠJ and recover Alice s sample.

Encoding J : By the symmetry of the setup, the global ﬁrst arrival will occur with equal probability in each subprocess Πj. Hence J follows a uniform distribution on {1, . . . J}. Therefore, Alice can encode J optimally using log2 J bits.

Encoding NJ : We can develop a bound using an almost identical argument to the one in Appendix B.4. In particular, by adapting the conditional bound in Equation (138) appropriately using Equation (155), we get

E h log2 NJ | T = t, X = x, J i log2(r(x) + 1) + µ(t) log2 e log2 J. (158)

Then, using this conditional bound and adapting Equation (143), we ﬁnd E [log2 NJ | y, J ] DKL[Px|y Px] + 2 log2 e log2 J (159) to obtain a one-shot bound. Taking expectation over y Py, we get E [log2 NJ | J ] I[x; y] + 2 log2 e log2 J, (160) hence, by adapting the maximum entropy bound in Equation (150), we ﬁnd H[NJ | J ] < I[x; y] log2 J + log2(I[x; y] log2 J + 1) + 6. (161) Thus, we ﬁnally ﬁnd that the entropy of the two-part code (J , NJ ) is upper bounded by H[J , NJ ] < I[x; y] + log2(I[x; y] log2 J + 1) + 7. (162) Using a Zeta distribution ζ(n | s) n s to encode NJ | J with

s def = 1 I[x; y] + 2 log2 e log2 J , (163)

we ﬁnd that the expected codelength of the two-part code is upper bounded by H[J , NJ ] < I[x; y] + log2(I[x; y] log2 J + 1) + 8 bits. (164)

D Simulating Poisson Processes Using Tree-Structured Partitions

Algorithm 7: Simulating a tree construction with another Input :Proposal distribution P Target splitting function splittarget, Simulating splitting function splitsim, Target heap index Htarget Output :Arrival (T, X) with heap index Htarget in Ttarget, and heap index Hsim of the arrival in Tsim. P Priority Queue B Ω K log2 Htarget X P T Exp(1) P.push(T, X, Ω, 1) for k = 0, . . . , K do

T, X, C, H P.pop() L, R splitsim(C) if B L = then

L Exp(P(L)) TL T + L XL P|L P.push(TL, XL, L, 2H) end if B R = then

R Exp(P(R)) TR T + R XR P|R P.push(TR, XR, R, 2H + 1) end until X B /* Exit loop when we find first arrival in B. */ if k < K then

{B0, B1} splittarget(B)

d l Htarget 2K k m mod 2

B Bd end end return T, X, H

In this section, we examine an advanced simulation technique for Poisson processes, which is required to formulate the binary search-based variants of GPRS. We ﬁrst recapitulate the tree-based simulation technique from (Flamich et al., 2022) and some important results from their work. Then, we present Algorithm 7, using which we can ﬁnally formulate the optimally efﬁcient variant of GPRS in Appendix E. Note: For simplicity, we present the ideas for Poisson processes whose spatial measure P is non-atomic. These ideas can also be extended to atomic spatial measures with appropriate modiﬁcations.

Splitting functions: Let Π be a spatiotemporal Poisson process on some space Ωwith non-atomic spatial measure P. In this paragraph, we will not deal with Π itself yet, but the space on which it is deﬁned and the measure P. Now, assume that there is a function split, which for any given Borel set B Ω, produces a (possibly random) P-essential partition of B consisting of two Borel sets, i.e.

split(B) = {L, R}, such that L R = and P(L R) = P(B). (165)

The last condition simply allows split to discard some points from B that will not be included in either the left or right split. For example, if we wish to design a splitting function for a subset of

B Rd to split B along some hyperplane, we do not need to include the points on the hyperplane in either set. The splitting function that always exists is the trivial splitting function, which just returns the original set and the empty set:

splittrivial(B) = {B, }. (166)

A more interesting example when Ω= R is the on-sample splitting function, where for a R-valued random variable X P|B it splits B as

spliton-samp(B) = {( , X) B, (X, ) B}. (167)

This function is used by Algorithm 5 and AS* coding (Flamich et al., 2022). Another example is the dyadic splitting function operating on a bounded interval (a, b), splitting as

splitdyad((a, b)) = a, a + b

2 , b . (168)

We call this dyadic because when we apply this splitting function to (0, 1) and subsets produced by applying it, we get the set of all intervals with dyadic endpoints on (0, 1), i.e. {(0, 1), (0, 1/2), (1/2, 1), (0, 1/4), . . .}. This splitting function is used by Algorithm 6 and AD* coding. As one might imagine, there might be many more possible splitting functions than the two examples we give above, all of which might be more or less useful in practice.

Split-induced binary space partitioning tree: Every splitting function on a space Ωinduces an inﬁnite set of subsets by repeatedly applying the splitting function to the splits it produces. These sets can be organised into an inﬁnite binary space partitioning tree (BSP-tree), where each node in the tree is represented by a set produced by split and an unique index. Concretely, let the root of the tree be represented by the whole space Ωand the index Hroot = 1. Now we recursively construct the rest of the tree as follows: Let (B, H) be a node in the tree, with B a Borel set and H its index, and let {L, R} = split(B) be the left and right splits of B. Then, we set (L, 2H) as the node s left child and (R, 2H + 1) as its right child. We refer to each node s associated index as their heap index.

Heap-indexing the points in Π and a strict heap invariant: As we saw in Section 2.1, each point in Π can be uniquely identiﬁed by their time index N, i.e. if we time-order the points of Π, N represents the Nth arrival in the ordered list. However, we can also uniquely index each point in Π using a splitting function-induced BSP-tree as follows.

We extend each node in the BSP-tree with a point from Π, such that the extended tree satisﬁes a strict heap invariant: First, we extend the root node (Ω, 1) by adjoining Π s ﬁrst arrival (T1, X1) and the ﬁrst arrival index 1 to get (T1, X1, Ω, 1, 1). Then, for every other extended node (T, X, B, H, N) with parent node (T , X , B , H/2 , M) we require that (T, X) is the ﬁrst arrival in the restricted process Π ((T , ) B). This restriction enforces that we always have that T < T and, therefore, M < N. Furthermore, it is strict because there are no other points of Π in B between those two arrival times.

Notation for the tree structure: Let us denote the extended BSP tree T on Π induced by split. Each node ν T is a 5-tuple ν = (T, X, B, H, N) consisting of the arrival time T, spatial coordinate X, its bounds B, split-induced heap index H and time index N. As a slight overload of notation, for a node ν, we use it as subscript to refer to its elements, e.g. Tν is the arrival time associated with node ν. We now prove the following important result, which ensures that we do not lose any information by using T to represent Π. An essentially equivalent statement was ﬁrst proven for Gumbel processes (Appendix, Equivalence under partition subsection; Maddison et al., 2014).

Lemma D.1. Let Π, P, split and T be as deﬁned above. Then, P-almost surely, T contains every point of Π.

Proof. We give an inductive argument.

Base case: the ﬁrst arrival of Π is by deﬁnition contained in the root node of T .

Inductive hypothesis: assume that the ﬁrst N arrivals of Π are all contained in T .

Case for N + 1: We begin by observing that if the ﬁrst N nodes are contained in T , they must form a connected subtree TN. To see this, assume the contrary, i.e. that the ﬁrst N arrivals form a subgraph ΓN T with multiple disconnected components. Let ν ΓN be a node contained in a component

of ΓN that is disconnected from the component of ΓN containing the root node. Since T is a tree, there is a unique path π from the root node to ν in T , and since ν and the root are disconnected in ΓN, π must contain a node c ΓN. However, since c is an ancestor of ν, by the heap invariant of T we must have that the time index of c is Nc < Nν N hence c ΓN, a contradiction.

Thus, let TN represent the subtree of T containing the ﬁrst N arrivals of Π. Now, let FN represent the frontier of TN, i.e. the leaf nodes children:

FN def = {ν T | ν TN, parent(ν) TN} , (169)

where parent retrieves the parent of a node in T . Let

ν FN Bν (170)

be the union of all the bounds of the nodes in the frontier. A simple inductive argument shows that for all N the nodes in FN provide a P-essential partition of Ω, from which P( ΩN) = 1. Let TN be the Nth arrival time of Π. Now, by deﬁnition, the arrival time associated with every node in the frontier FN must be later than TN. Finally, consider the ﬁrst arrival time across the nodes in the frontier:

T N def = min ν FN Tν. (171)

Then, conditioned on TN, T N is the ﬁrst arrival of Π restricted to (TN, ) Ω, thus it is P-almost surely the N + 1st arrival in Π, as desired.

A connection between the time index an heap index of a node: Now that we have two ways of uniquely identifying each point in Π it is natural to ask whether there is any relation between them. In the next paragraph we adapt an argument from Flamich et al. (2022) to show that under certain circumstances, the answer is yes.

First, we need to deﬁne two concepts, the depth of a node in an extended BSP-tree and a contractive splitting function. Thus, let T be an extended BSP-tree over Π induced by split. Let ν T . Then, the depth D of ν in T is deﬁned as the distance from ν to the root. A simple argument shows that the heap index Hν of every node ν at depth D in T is between 2D Hν < 2D+1. Thus, we can easily obtain the depth of a node ν from the heap index via the formula Dν = log2 Hν . Next, we say that split is contractive if all the bounds it produces shrink on average. More formally, let ϵ < 1, and for a node ν T , let Aν denote the set of ancestor nodes of ν. Then, split is contractive if for every non-root node ν T we have

ETAν |Dν[P(Bν)] ϵDν, (172)

where TAν and denotes the subtree of T containing the arrivals of the ancestors of ν.

Note that ϵ 1/2 for any split function. This is because if {L, R} = split(B) for some set B, then P(R) = P(B) P(L). Thus, if P(R) = αP(B), then P(L) = (1 α)P(B), and by deﬁnition ϵ = max{α, 1 α}, which is minimized when α = 1/2, from which ϵ = 1/2.

For example, splitdyad is contractive with ϵ = 1/2, while splittrivial is not contractive. By Lemma 1 from Flamich et al. (2022), spliton-samp is also contractive with ϵ = 3/4. We now strengthen this result using a simple argument, and show that spliton-samp is contractive with ϵ = 1/2.

Lemma D.2. Let ν, D be deﬁned as above, let P be a non-atomic probability measure over R with CDF FP . Let Π a (1, P)-Poisson process and T be the BSP-tree over Π induced by spliton-samp. Then,

ETAν |Dν[P(Bν)] = 2 Dν. (173)

Proof. Fix Dν = d, and let Nd = {n T | Dn = d} be the set of nodes in T whose depth is d. Note, that |Nd| = 2d. We will show that

n, m Nd : P(Bn) d= P(Bm), (174)

i.e. that the distributions of the bound sizes are all the same. From this, we will immediately have E[P(Bn)] = E[P(Bν)] for every n Nd. Then, using the fact, that the nodes in Nd for a P-almost partition of Ω, we get:

n Nd E [P(Bn)] = |Nd| E [P(Bν)] . (175)

Dividing the very left and very right by |Nd| = 2d yields the desired result.

To complete the proof, we now show that by symmetry, Equation (174) holds. We begin by exposing the fundamental symmetry of spliton-samp: for a node ν with left child L and right child R, the left and right bound sizes are equal in distribution:

P(BL) d= P(Br) | P(Bν). (176)

First, note that by deﬁnition, all involved bounds will be intervals. Namely, assume that Bν = (a, b) for some a < b and Xν is the sample associated with ν. Then, BL = (a, Xν) and BR = (Xν, b) and hence P(BL) = FP (Xν) FP (a). Since Xν P|Bν, by the probability integral transform, we have F(Xν) Unif(FP (a), FP (b)), from which P(BL) Unif(0, FP (b) FP (a)) = Unif(0, P(Bν)). Since P(BR) = P(Bν) P(BL), we similarly have P(BR) Unif(0, P(Bν)), which establishes our claim.

Now, to show Equation (174), ﬁx d and ﬁx n Nd. Let An denote the ancestor nodes of n. As we saw in the paragraph above,

P(Bn) | P(Bparent(n)) d= P(Bparent(n)) U, U Unif(0, 1), (177)

regardless of whether n is a left or a right child of its parent. We can apply this d times to the ancestors of n ﬁnd the marginal:

i=1 Ui, Ui Unif(0, 1). (178)

Since the choice of n was arbitrary, all nodes in Nd have this distribution, which is what we wanted to show.

Now, we have the following result.

Lemma D.3. Let split be a contractive splitting function for some ϵ [1/2, 1). Then, for every node ν in T with time index Nν and depth Dν, we have

EDν|Nν[Dν] logϵ Nν. (179)

Proof. Let us examine the case Nν = 1 ﬁrst. In this case, ν is the root of T and has depth Dν = 0 by deﬁnition. Thus, 0 logϵ 1 holds trivially.

Now, ﬁx ν T with time index Nν = N > 1. Let TN 1 be the subtree of T containing the ﬁrst N 1 arrivals, FN 1 be the frontier of TN 1 and TN 1 the (N 1)st arrival time. Then, as we saw in Lemma D.1, the Nth arrival in Π occurs in one of the nodes in the frontier FN 1, after TN 1. In particular, conditioned on TN 1, the arrival times associated with each node f FN 1 will be shifted exponentials Tf = TN 1 + Exp(P(Bf)), and the Nth arrival time in Π is the minimum of these: Tν = TN = minf FN 1 Tf. It is a standard fact (see e.g. Lemma 6 in Maddison (2016)) that the index of the minimum

FN = arg min f FN 1 Tf (180)

is independent of Tν = TFN and P[FN = f | TN 1] = P(Bf). A simple inductive argument shows that the number of nodes on the frontier |FN 1| = N. Thus, we have a simple upper bound on the entropy of F:

EFN|TN 1,Nν=N[ log2 P(Bν)] = H[FN | TN 1, Nν = N] log2 N. (181)

Thus, taking expectation over TN 1, we ﬁnd

log2 N eq. (181) EFN,TN 1|Nν=N[ log2 P(Bν)] (182)

= EDν|Nν=N EFN,TN 1|Dν,Nν=N[ log2 P(Bν)] (183)

EDν|Nν=N log2 EFN,TN 1|Dν,Nν=N[P(Bν)] (184)

eq. (172) EDν|Nν=N log2 ϵDν (185)

= ( log2 ϵ) EDν|Nν=N[Dν]. (186)

The second inequality holds by Jensen s inequality. In the third inequality, we apply Equation (172), and one might worry about conditioning on Nν here. However, this is not an issue because

EFN,TN 1|Dν=d,Nν=N[P(Bν)] (187)

= 1[d N 1] X

f FN 1 ETAf |FN=f,Dν=d[P(Bf)] P[FN = f | Df = d]

eq. (172) 1[d N 1] X

f FN 1 ϵd P[FN = f | Df = d] (189)

= 1[d N 1] ϵd. (190)

Thus, we ﬁnally get

( log2 ϵ) EDν|Nν=N[Dν] log2 N EDν|Nν [Dν] logϵ Nν (191)

by dividing both sides by log2 ϵ and we obtain the desired result.

Converting between different heap indices: Assume now that we have two splitting functions, splittarget and splitsim, which induce their own BSP-ordering on Π, Ttarget and Tsim. Now, given a splittarget-induced heap index H, Algorithm 7 presents a method for simulating the appropriate node ν Ttarget by simulating nodes from Tsim. In other words, given a node with some heap index induced by a splitting function, Algorithm 7 lets us ﬁnd the heap index of the same arrival induced by a different splitting function. The signiﬁcance of Algorithm 7 is that it lets us develop convenient search methods using a given splitting function, but it might be more efﬁcient to encode the heap index induced by another splitting function. Theorem D.4. Let Π, splittarget, splitsim, Ttarget and Tsim be as above. Let ν Ttarget with and let (Tsim, Xsim, Hsim) be the arrival and its heap index output by Algorithm 7 given the above as input as well as Hν as the target index. Then, Algorithm 7 is correct, in the sense that

Tν d= Tsim and Xν d= Xsim, (192)

and Hsim is the heap index of (Tsim, Xsim) in Tsim.

Proof. First, observe that when splittarget = splitsim, Algorithm 7 collapses onto just the extended BSP tree construction for Π and simply returns the arrival with the given heap index Htarget in Ttarget. In particular, the inner loop will always exit after one iteration, and every time one and only one of the conditional blocks will be executed. In other words, in this case, the algorithm becomes equivalent to Algorithm 2 in Flamich et al. (2022).

Let us now consider the case when splittarget = splitsim. Denote the depth of the required node by Dν = log2 Hν . Now, we give an inductive argument for correctness.

Base case: Consider Dν = 0. In this case, the target bounds B = Ω, and the ﬁrst sample we draw X P is guaranteed to fall in Ω. Hence, for Dν = 0 the outer loop only runs for one iteration. Furthermore, during that iteration, the inner loop will also exit after one iteration, and Algorithm 7 returns the sample (T, X) sampled before the outer loop with heap index 1. Since T Exp(1) and X P, this will be a correctly distributed output with the appropriate heap index.

Inductive hypothesis: Assume Algorithm 7 is correct heap indices with depths up to Dν = d.

Case Dν = d + 1: Let ρ Ttarget be the parent node of ν with arrival (Tρ, Xρ). Then, Dρ = d, hence by the inductive hypothesis, Algorithm 7 will correctly simulate a branch Ttarget up to node ρ. At the end of the dth iteration of the outer loop Algorithm 7 sets the target bounds B Bν. Then, in the ﬁnal, d + 1st iteration, the inner loop simply realizes Tsim and accepts the ﬁrst sample after X that falls inside Bν whose time index T > Tρ. Due to the priority queue, the loop simulates the nodes of Tsim in time order; hence the accepted sample will also be the one with the earliest arrival time. Furthermore, Algorithm 7 only ever considers nodes of Tsim whose bounds intersect the target bounds Bν, hence the inner loop is guaranteed to terminate, which ﬁnishes the proof.

E GPRS with Binary Search

We now utilise the machinery we developed in Appendix D to analyze Algorithm 5.

Correcntess of Algorithm 5: Observe that Algorithm 5 constructs the extended BSP tree for the on-sample splitting function. Thus, we will now focus on performing a binary tree search using the extended BSP tree induced by the on-sample splitting function, which we denote by T . The ﬁrst step of the algorithm matches GPRS s ﬁrst step (Algorithm 3). Hence it is correct for the ﬁrst step. Now consider the algorithm in an arbitrary step k before termination, where the candidate sample (T, X) is rejected, i.e. T > ϕ(X). By assumption, the density ratio r is unimodal, and since σ is monotonically increasing, ϕ = σ r is unimodal too. Thus, let x Ωbe such that r(x ) = r , where r = exp2(D [Q P]). Assume for now that X < x , the case X > x follows by a symmetric argument. By the unimodality assumption, since X < x , it must hold that for all y < X, we have ϕ(y) < ϕ(X). Consider now the arrival (TL, XL) of Π in the current node s left child. Then, we will have T < TL and XL < X by construction. Thus, ﬁnally, we get

ϕ(XL) < ϕ(X) < T < TL, (193)

meaning that the current node s left child is also guaranteed to be rejected. This argument can be easily extended to show that any left-descendant of the current node will be rejected, and it is sufﬁcient to search its right-descendants only. By a similar argument, when X > x , we ﬁnd that it is sufﬁcient only to check the current node s left-descendants. Finally, since both Algorithms 3 and 5 simulate Π and search for its ﬁrst arrival under ϕ, by the construction in Appendix A, the sample returned by both algorithms will follow the desired target distribution.

Analyzing the expected runtime and codelength: Since Algorithm 5 simulates a single branch of Ton-sample, its runtime is equal to the depth D of the branch. Furthermore, let N denote the time index of the accepted arrival and H its heap index in Ton-sample, which means D = log2 H . It is tempting to use lemmas D.2 and D.3 to in our analysis of N, H and D. However, they do not apply, because they assume no restriction on D and N, while in this section, we condition on the fact that N, H and D identify the arrival in Π that was accepted by Algorithm 5. Hence, we ﬁrst prove a lemma analogous to lemma D.2 in this restricted case, based on Lemma 4 of Flamich et al. (2022). Note the additional assumptions we need for our result.

Lemma E.1. Let Q and P be distributions over R with unimodal density ratio r = d Q/d P, given to Algorithm 5 as the target and proposal distribution as input, respectively. Assume P has CDF FP . Let H denote the accepted sample s heap index, let D = log2 H , and let BH denote the bounds associated with the accepted node. Then,

E[P(BH) | H] 3

Proof. We prove the claim by induction. For H = 1, D = 0 the claim holds trivially, since B1 = R, hence P(B1) = 1. Assume now that the claim holds for D = log2 H = d 1 for d 1, and we prove the statement when D = log2 H = d. Let X0, . . . , Xd 1, Xd denote the samples simulated along the branch that leads to the accepted node, i.e. X0 is the sample associated with the root node and Xd is the accepted sample. By the deﬁnition of the on-sample splitting function and law of iterated expectations, we have

EX0:d 1[P(BH) | H] = EX0[P(BH) | H] when d = 1 EX0:d 2[EXd 1|X0:d 2[P(BH) | H] | H] when d 2. (195)

Below, we only examine the inner expectation in the d 2 case; computing the expectation in the d = 1 case follows mutatis mutandis. First, denote the bounds of the accepted node s parent as B H/2 = (α, β) for some real numbers α < β and deﬁne a = FP (α), b = FP (β) and U = FP (Xd 1). Since Xd 1 | X1:d 2 P|B H/2 , by the generalized probability integral transform we have U Unif(a, b), where Unif(a, b) denotes the uniform distribution on the interval (a, b). The two possible intervals from which Algorithm 5 will choose are (α, Xd 1) and (Xd 1, β), whose measures are P((α, Xd 1)) = FP (Xd 1) FP (α) = U a and similarly P((Xd 1, β)) = b U. Then, in the worst case, the algorithm always picks the bigger of the two intervals, thus P(BH) max{U A, B U}, from which we obtain the bound

EXd 1|X1:d 2[P(BH) | H] EU[max{U a, b U}] = 3

4P(B H/2 ). (196)

Plugging this into Equation (195), we get

EX1:d 1[P(BH) | H] 3

4EX1:d 2 P(B H/2 ) | H (197)

d 1 , (198)

where the second inequality follows from the inductive hypothesis, which ﬁnishes the proof.

Fortunately, lemma D.3 does not depend on this condition, so we can modify it mutatis mutandis so that it still applies in this latter case as well, and we restate it here for completeness: Lemma E.2. Let N, D, H, BH be deﬁned as above. Then

E[ log P(BH)] log2 N (199)

ED|N[D] log3/4 N. (200)

Proof. Equation (199) follows via the same argument as Equation (181), and Equation (200) follows via the same argument as lemma D.3, combined with lemma E.1.

Expected runtime: Combining Equation (200) with Equation (143) yields

E[D] DKL[Q P] + 2 log2 e

log2(4/3) , (201)

which proves Theorem 3.6. In the channel simulation case, taking a pair of correlated random variables x, y Px,y and setting Q Px|y and P Px, and taking expectation over y, we get

E[D] I[x; y] + 2 log2 e

log2(4/3) . (202)

Entropy coding the heap index: Let ν Ton-sample now denote the node accepted and returned by Algorithm 5. We will encode ν using its heap index H. As we saw in Appendix D, H can be decomposed into two parts: the depth D and the residual R:

H = 2D + R, 0 R < 2D. (203)

Then, we have

H[H] H[D, R] (204) = H[R | D] + H[D]. (205)

Dealing with the second term ﬁrst, we note that the maximum entropy distribution for a random variable Z over the positive integers with the moment constraint E[Z] = 1/p is the geometric distribution with rate p. Thus, based on Equation (202), we set

p = log2(4/3) I[x; y] + 2 log2 e, (206)

from which we get

H[D] H [Geom (p)] (207)

= log2 p 1 p

p log2(1 p) (208)

log2(I[x; y] + 2 log2 e) log2 log2(4/3) + log2 e (209) log2(I[x; y] + 1) + log2(2 log2 e) log2 log2(4/3) + log2 e, (210)

where Equation (208) follows from the fact that the term 1 p

p log2(1 p) is bounded from above by log2 e, which it attains as I[x; y] .

For the H[R | D] term, we begin by noting that there is a particularly natural encoding for R given a particular run of the algorithm. Let B(D) 1 , . . . , B(D) 2D be the bounds associated with the depth-D nodes of Ton-sample | D, the split-on-sample BSP tree given D. Then, since B(D) 1 , . . . , B(D) 2D form a P-essential partition of Ω, we can encode R by encoding the bounds associated with the accepted sample, B(D) R . Here, we can borrow a technique from arithmetic coding (Rissanen & Langdon, 1979): we encode the largest dyadic interval that is a subset of B(D) R , which can be achieved using log2 P(B(D) R )+1 bits. Note, that Algorithm 5 only ever realises a single branch of Ton-sample, hence, in this section only, let X1: denote the inﬁnite sequence of samples along this branch, starting with X1 P at the root. Furthermore, from here onwards, let Bd and Td denote the bounds and arrival time associated with Xd in the sequence, respectively. Then, by the above reasoning, we have

H[R | D] EX1: ,T1: [ED[ log2 P(BD)]] + 1 (211)

k=1 EX1:k 1,T1:k [P[D = k | X1:k 1, T1:k] log2 P(Bk)] + 1. (212)

We now develop a bound on P(Bk). To this end, note, that by the deﬁnition of GPRS,

P [D k + 1 | X1:k 1, T1:k, D k] = P [N Π([Tk 1, Tk] Bk) = 0 | X1:k 1, T1:k 1, D k] .

P [D = k | X1:k 1, T1:k, D k] = 1 P [D k + 1 | X1:k 1, T1:k, D k] (213) = P [N Π([Tk 1, Tk] Bk) 1 | X1:k 1, T1:k 1, D k] (214) E [N Π([Tk 1, Tk] Bk) | X1:k 1, T1:k 1, D k] (215) E [NΠ([Tk 1, Tk] Bk) | X1:k 1, T1:k 1, D k] (216) = P(Bk)(Tk Tk 1), (217)

where eq. (215) follows from Markov s inequality, and eq. (216) follows from the fact that the mean measure of Π is just a restricted version of Π s mean measure. For convenience, we now write k = Tk Tk 1. Then, rearranging the above inequality, we get

log2 P(Bk) log2 P [D = k | X1:k 1, T1:k 1, k, D k] + log2 k. (218)

Substituting this into eq. (212), we get

H[R | D] (219)

k=1 EX1:k 1,T1:k 1, k [P[D = k | X1:k 1, T1:k 1, k] log2 P[D = k | X1:k 1, T1:k 1, k]]

k=1 EX1:k 1,T1:k 1, k [P[D = k | X1:k 1, T1:k 1, k] log2 k] + 1. (220)

We deal with the above two sums separately. In the ﬁrst sum, we apply Jensen s inequality on the concave function x 7 x log2 x in each summand, from which we get

k=1 EX1:k 1,T1:k 1, k [P[D = k | X1:k 1, T1:k 1, k] log2 P[D = k | X1:k 1, T1:k 1, k]]

k=1 P[D = k] log2 P[D = k] (221)

= H[D]. (222) For the second sum, consider the kth summand ﬁrst: P[D = k | X1:k 1,T1:k 1, k, D k] log2 k (223)

d P[D = k, Xk = x | X1:k 1, T1:k 1, k, D k]

d P log2 d P(x).

Bk 1[r(x) σ 1(Tk)] log2 d P(x).

To continue, we now examine the integrand:

r(x) σ 1(Tk) eq. (132) P[ T Tk] Tk P[ T Tk] k (224) from which log2 k log2 r(x) log2 P[ T Tk]. (225) Substituting this back, we get P[D = k | X1:k 1, T1:k 1, k, D k] log2 k (226)

Bk 1[r(x) σ 1(Tk)] log2 r(x) d P(x)

P[D = k | X1:k 1, T1:k 1, k, D k] log2 P[ T Tk]. (227) Therefore, for the second summand, we get X

k=1 EX1:k 1,T1:k 1, k [P[D = k | X1:k 1, T1:k 1, k] log2 k] (228)

d P[D = k, Xk = X]

d P log2 r(X)

k=1 P[D = k] log2 P[ T Tk].

(229) For the ﬁrst term, by Fubini s theorem and by the correctness of GPRS, we get

d P[D = k, Xk = X]

d P log2 r(X) = EX P

d P[D = k, Xk = X]

d P log2 r(X)

(230) = EX P [r(X) log2 r(X)] (231) = DKL[Q P]. (232) For the second term, note that P[D = k] = P[ T = Tk]. (233) Hence,

k=1 P[D = k] log2 P[ T Tk] =

k=1 P[D = k] log2 P[D k] (234)

k=1 (P[D k] P[D k + 1]) log2 1 P[D k] (235)

1 log2 p dp (236)

= log2 e. (237)

Thus, we ﬁnally get that

H[R | D] DKL[Q P] + H[D] + log2 e + 1. (238)

In the average case, this yields

H[R | D] I[x; y] + H[D] + log2 e + 1. (239)

Thus, we ﬁnally get that the joint entropy is bounded above by

H[R, D] I[x; y] + 2 log2(I[x; y] + 1) (240) + 2 (log2(2 log2 e) log2 log2(4/3) + log2 e) + log2 e + 1 (241) < I[x; y] + 2 log2(I[x; y] + 1) + 11, (242)

which ﬁnally proves Theorem 3.7.

F General GPRS with Binary Search

We ﬁnally present a generalized version of branch-and-bound GPRS (Algorithm 6) for more general spaces and remove the requirement that r be unimodal.

Decomposing a Poisson process into a mixture of processes: Let Π be a process over R+ Ω as before, with spatial measure P, and let Q be a target measure, ϕ = σ r and U = {(t, x) R+ Ω| t ϕ(x)} as before. Let Π be Π = Π U restricted under ϕ and ( T1, X1) its ﬁrst arrival. Let {L, R} form an P-essential partiton of Ω, i.e. L R = and P(L R) = 1, and let ΠL = Π R+ L and ΠR = Π R+ R be the restriction of Π to L and R, respectively. Let ΠL = ΠL U and ΠR = ΠR U be the restrictions of the two processes under ϕ as well. Let µL(t) and µR(t) be the mean measures of these processes. Thus, the ﬁrst arrival times in these processes have survival functions

P[ T L 1 t] = e µL(t) (243)

P[ T R 1 t] = e µR(t). (244)

Note, that by the superposition theorem, ΠL ΠR = Π, hence the ﬁrst arrival ( T1, X1) occurs in either ΠL or ΠR. Assume now that we have already searched through the points of Π up to time τ without ﬁnding the ﬁrst arrival. At this point, we can ask: will the ﬁrst arrival occur in ΠL, given that T1 τ? Using Bayes rule, we ﬁnd

P[ X1 L | T1 τ] = P[ X1 L, T1 τ]

P[ T1 τ] . (245)

More generally, assume that the ﬁrst arrival of Π occurs in some set A Ω, and we know that the ﬁrst arrival time is larger than τ. Then, what is the probability that the ﬁrst arrival occurs in some set B A? Similarly to the above, we ﬁnd

P[ X1 B | T1 τ, X1 A] = P[ X1 B, T1 τ, X1 A]

P[ T1 τ, X1 A] (246)

= P[ T1 τ, X1 B]

P[ T1 τ, X1 A] . (247)

Let ( TL, XL) be the ﬁrst arrival of ΠL. Then, the crucial observation is that

( TL, XL) d= T1, X1 | X1 L. (248)

This enables us to search for the ﬁrst arrival of Π under the graph of ϕ using an extended BSP tree construction. At each node, if we reject, we draw a Bernoulli random variable b with mean equal to the probability that the ﬁrst arrival occurs within the bounds associated with the right child node. Then, if b = 1, we continue the search along the right branch. Otherwise, we search along the left branch.

Note, however, that in a restricted process ΠA, the spatial measure no longer integrates to 1. Furthermore, our target Radon-Nikodym derivative is r(x) 1[x A]/Q(A). This means we need to change

the graph ϕ to some new graph ϕA to ensure that the spatial distribution of the returned sample is still correct. Therefore, for a set A we deﬁne the restricted versions of previous quantities:

µA(t) def = Z t

A 1[τ ϕA(x)] d P(x) dt (249)

w P (h | A) def = Z

A 1[h r(x)] d P(x) (250)

Then, via analogous arguments to the ones in Appendix A, we ﬁnd

d P[ TA = t, XA = x]

d(λ P) = 1[x A]1[t ϕA(x)]P[ TA t] (251)

d P[ XA = x]

d P = 1[x A] Z ϕA(x)

0 P[ TA t] dt (252)

ϕA = σA r. (253)

Similarly, setting d P[ XA=x]

d P = 1[x A] r(x)/Q(A), and setting τ = σA(r(x)), we get

σ 1 A (τ) = Q(A) Z τ

0 P[ TA t] dt (254)

σ 1 A (τ) = P[ TA τ] (255)

= P[ T τ, X A]. (256)

From this, again using similar arguments to the ones in Appendix A, we ﬁnd σ 1 A (τ) = w Q(σ 1 A (t) | A) σ 1 A (t) w P (σ 1 A (t) | A). (257)

G Necessary Quantities for Implementing GPRS in Practice

Ultimately, given a target-proposal pair (Q, P) with density ratio r, we would want an easy-toevaluate expression for the appropriate stretch function σ or σ 1 to plug directly into our algorithms. Computing σ requires computing the integral in Equation (5) and ﬁnding σ 1 requires solving the non-linear ODE in Equation (4), neither of which is usually possible in practice. Hence, we usually resort to computing σ 1 numerically by using an ODE solver for Equation (4).

In any case, we need to compute w P and w Q, which are analytically tractable in the practically relevant cases. Hence, in this section, we give closed-form expressions for w P and w Q for all the examples we consider and give closed-form expressions for σ and σ 1 for some of them. If we do not give a closed-form expression of σ, we use numerical integration to compute σ 1 instead.

Note on notation: In this section only, we denote the natural logarithm as ln and exponentiation with respect to the natural base as exp.

G.1 Uniform-Uniform Case

Let P be the uniform distribution over an arbitrary space Ωand Q a uniform distribution supported on some subset X Ω, with P(X) = C for some C 1 Then,

C 1[x X] (258)

w P (h) = C (259) w Q(h) = 1 (260)

C log (1 C h) (261)

C (1 exp( C h)) . (262)

Note that using GPRS in the uniform-uniform case is somewhat overkill, as in this case, it is simply equivalent to standard rejection sampling.

G.2 Triangular-Uniform Case

Let P = Unif(0, 1) and for some numbers 0 < a < c < b < 1, let Q be the triangular distribution, deﬁned by the PDF

0 if x < a 2(x a) (b a)(c a) if a x < c 2 b a if x = c 2(b x) (b a)(b c) if c < x b 0 if b < x.

In this parameterization, the distribution has support on (a, b). Moreover, it is linearly increasing on (a, c) and linearly decreasing on (c, b), with the mode located at c. For convenience, let ℓ= b a. Then,

r(x) = q(x) (264)

DKL[Q P] = log2

2 log2 e (265)

D [Q P] = 1 log2 ℓ (266)

Dα[Q P] = log2 ℓ+ α log2(1 + α)

w P (h) = ℓ ℓ2

w Q(h) = 1 ℓ2

σ(h) = 2h 2 ℓ h (270)

σ 1(t) = 2t 2 + ℓ t (271)

Additionally, in this case we ﬁnd

P[ T t] = 4 (2 + ℓ t)2 (272)

P[ T dt] = 8 ℓ (2 + ℓ t)3 (273)

ϕ T (s) = 2G3,1 1,3

0 0, 1, 3/2

+ 2is2 sign(s) l |s| 2Ci 2|s|

l + π 2Si 2|s|

l2|s| (275)

M T (s) = ℓ2 + 2ℓs 4s2e 2s/ℓEi 2s

ℓ2 for s 0, (276)

where ϕ T and M T denote the characteristic and the moment-generating functions of T, respectively. Furthermore, Si, Ci and Ei are the sine, cosine and exponential integrals, deﬁned as

Si(x) = Z x

0 t 1 sin(t) dt (277)

x t 1 cos(t) dt (278)

Ei(x) = Z x

t 1et dt. (279)

Finally, G3,1 1,3 is the Meijer-G function (Sections 5.1-3; Bateman & Erdélyi, 1953) Interestingly, we can see that in this case, while the expectation of T is ﬁnite, we ﬁnd that V[ T] = . Thus, by Equation (72), we ﬁnd that V[N] = as well, where N is the number of samples simulated by Algorithm 3.

G.3 Finite Discrete-Discrete / Piecewise Constant Case

Without loss of generality, let Q be a discrete distribution over a ﬁnite set with K elements deﬁned by the probability vector (r1 < r2 < . . . < r K) and P = Unif([1 : K]) the uniform distribution on K elements. Note that any discrete distribution pair can be rearranged into this setup. Now, we can embed this distribution into [0, 1] by mapping P to ˆP = Unif(0, 1) the uniform distribution over [0, 1] and mapping Q to ˆQ with density

k=1 1 [ K x = k 1] K rk. (280)

Then, we can draw a sample x from Q by simulating a sample ˆx ˆQ and computing x = K ˆx/K + 1. Hence, we can instead reduce the problem to sampling from a piecewise constant distribution. Thus, let us now instead present the more general case of arbitrary piecewise constant distributions over [0, 1], with Q deﬁned by probabilities (q1 < q2 < . . . < q K) and corresponding piece widths (w1 < w2 < . . . < w K). We require, that P

k=1 qk = PK k=1 wk = 1. Then, the density is

Deﬁne rk = qk/wk. Then,

k=1 1[h rk] wk (282)

k=1 1[h rk] wk rk (283)

k=1 1[h rk 1] 1

Bk ln Ak Bkrk 1 Ak Bk min{rk, h}

where we deﬁned

j=k qj (285)

j=k wj (286)

G.4 Diagonal Gaussian-Gaussian Case

Without loss of generality, let Q = N(µ, σ2I) and P = N(0, I) be d-dimensional Gaussian distributions with diagonal covariance. As a slight abuse of notation, let N(x | µ, σ2I) denote the probability density function of a Gaussian random variable with mean µ and covariance σ2I evaluated

at x. Then, when σ2 < 1, we have r(x) = Z N x | m, s2I (287)

m = µ 1 σ2 (288)

Z = (1 σ2) d

N(µ | 0, (1 σ2)I) (290)

w P (h) = P χ2 d, m 2 2s2 ln h + C (291)

2s2 ln h + C

C = s2 2 ln Z d ln(2πs2) . (293) Unfortunately, in this case, it is unlikely that we can solve for the stretch function analytically, so in our experiments, we solved for it numerically using Equation (4).

G.5 One-dimensional Laplace-Laplace Case

Without loss of generality, let Q = L(µ, b) and P = L(0, 1) be Laplace distributions. As a slight abuse of notation, let L(x | µ, b) denote the probability density function of a Laplace random variable with mean µ and scale b evaluated at x. Then, for b < 1 we have

b exp |x µ|

b + |x| (294)

1 exp b ln(bh)

1 b cosh µ 1 b if ln h |µ|

exp b2 ln(bh) |µ|

1 b2 sinh b(ln(bh) |µ|)

2 (bh) b 1+b exp |µ|

1+b (bh) b 1 b exp |µ|

1 b if ln h > |µ|

1 exp ln(hb)

1 b cosh µ 1 b if ln h |µ|

1 exp ln(hb) |µ|

1 b2 cosh b(ln(hb) |µ|)

2(bh) 1 1 b exp |µ|

2(bh) 1 1+b exp |µ|

1+b if ln h > |µ|

1 σ 1 + b b 1 b b 1 1 b

| {z } 0 on [0,1]

cosh µ 1 b σ 1 1 1 b if ln h |µ|

2 b b 1 b b 1 1 b exp |µ|

1 b σ 1 1 1 b

2 b b 1+b + b 1 1+b exp |µ|

1+b σ 1 1 1+b otherwise

Similarly to the Gaussian case, we resorted to numerical integration using Equation (4) to solve for σ 1.

H Experimental Details

Comparing Algorithm 3 versus Global-bound A* coding: We use a setup similar to the one used by Theis & Yosri (2022). Concretely, we assume the following model for correlated random variables x, µ: Pµ = N(0, 1) (298)

Px|µ = N(µ, σ2). (299)

From this, we ﬁnd that the marginal on x must be Px = N(0, σ2 + 1). The mutual information is I[x; µ] = 1

2 log2 1 + σ2 bits, which is what we plot as I[x; µ] in the top two panels in Figure 2.

For the bottom panel in Figure 2, we ﬁxed a standard Gaussian prior P = N(0, 1), ﬁxed K = DKL[Q P] and linearly increased R = D [Q P]. To ﬁnd a target that achieves the desired values for these given divergences, we set its mean and variances as

σ2 = exp (W(A exp(B)) B (300)

µ = 2K σ2 + ln σ2 + 1 (301) A = 2 ln R 2K 1 (302) B = 2 ln R 1, (303)

where W is the principal branch of the Lambert W-function (Corless et al., 1996).

We can derive this formula by assuming we wish to ﬁnd µ and σ2 such that for ﬁxed numbers K and R, and q(x) = N(x | µ, σ2), p(x) = N(x | 0, 1). Then, we have that

DKL[q p] = K and sup x R

We know that K = DKL[q p] = 1

2 µ2 + σ2 log σ2 1

log R = log sup x R

2(1 σ2) log σ. (305)

From these, we get that µ2 = 2K σ2 + log σ2 + 1

µ2 = 2(1 σ2)(log R + log σ) (306)

Setting these equal to each other

2K σ2 + log σ2 + 1 = 2 log R + log σ2 2σ2 log R σ2 log σ2

σ2 log σ2 σ2 + 2σ2 log R = 2 log R 2K 1

σ2 log σ2 + σ2(2 log R 1) = A

σ2(log σ2 + B) = A

σ2 log(σ2e B) = A

e Bσ2 log(σ2e B) = Ae B

elog(σ2e B) log(σ2e B) = Ae B

log(σ2e B) = W(Ae B)

σ2 = e W (Ae B) B,

where we made the substitutions A = 2 log R 2K 1 and B = 2 log R 1.

I Rejection sampling index entropy lower bound

Assume that we have a pair of correlated random variables x, y Px,y and Alice and Bob wish to realize a channel simulation protocol using standard rejection sampling as given by, e.g. Algorithm 2. Thus, when Alice receives a source symbol y Py, she sets Q = Px|y as the target and P = Px as the proposal for the rejection sampler. Let N denote the index of Alice s accepted sample, which is also the number of samples she needs to draw before her algorithm terminates. Since each acceptance decision is an independent Bernoulli trial in standard rejection sampling, N follows a geometric distribution whose mean equals the upper bound M used for the density ratio (Maddison, 2016). The lower bound on the optimal coding cost for N is given by its entropy

H[N] = (M 1) log2

+ log2 M log2 M, (308)

where the inequality follows by omitting the ﬁrst term, which is guaranteed to be positive since x 7 x log2 x is positive on (0, 1). Hence, by using the optimal upper bound on the density ratio M = exp2(D [Q P]) and plugging it into the formula above, we ﬁnd that

D [Q P] H[N]. (309)

Now, taking expectation over y, we ﬁnd

Ey Py D [Px|y Px] H[N]. (310)