# greedy_poisson_rejection_sampling__e4829486.pdf Greedy Poisson Rejection Sampling Gergely Flamich Department of Engineering University of Cambridge gf332@cam.ac.uk One-shot channel simulation is a fundamental data compression problem concerned with encoding a single sample from a target distribution Q using a coding distribution P using as few bits as possible on average. Algorithms that solve this problem find applications in neural data compression and differential privacy and can serve as a more efficient alternative to quantization-based methods. Sadly, existing solutions are too slow or have limited applicability, preventing widespread adoption. In this paper, we conclusively solve one-shot channel simulation for one-dimensional problems where the target-proposal density ratio is unimodal by describing an algorithm with optimal runtime. We achieve this by constructing a rejection sampling procedure equivalent to greedily searching over the points of a Poisson process. Hence, we call our algorithm greedy Poisson rejection sampling (GPRS) and analyze the correctness and time complexity of several of its variants. Finally, we empirically verify our theorems, demonstrating that GPRS significantly outperforms the current state-of-the-art method, A* coding. Our code is available at https:// github.com/gergely-flamich/greedy-poisson-rejection-sampling. 1 Introduction It is a common misconception that quantization is essential to lossy data compression; in fact, it is merely a way to discard information deterministically. In this paper, we consider the alternative, that is, to discard information stochastically using one-shot channel simulation. To illustrate the main idea, take lossy image compression as an example. Assume we have a generative model given by a joint distribution Px,y over images y and latent variables x, e.g. we might have trained a variational autoencoder (VAE; Kingma & Welling, 2014) on a dataset of images. To compress a new image y, we encode a single sample from its posterior x Px|y as its stochastic lossy representation. The decoder can obtain a lossy reconstruction of y by decoding x and drawing a sample ˆy Py|x (though in practice, for a VAE we normally just take the mean predicted by the generative network). Abstracting away from our example, in this paper we will be entirely focused on channel simulation for a pair of correlated random variables x, y Px,y: given a source symbol y Py we wish to encode a single sample x Px|y. A simple way to achieve this is to encode x with entropy coding using the marginal Px, whose average coding cost is approximately the entropy H[x]. Surprisingly, however, we can do much better by using a channel simulation protocol, whose average coding cost is approximately the mutual information I[x; y] (Li & El Gamal, 2018). This is remarkable, since not only I[x; y] H[x], but in many cases I[x; y] might be finite even though H[x] is infinite, such as when x is continuous. Sadly, most existing protocols place heavy restrictions on Px,y or their runtime scales much worse than O(I[x; y]), limiting their practical applicability (Agustsson & Theis, 2020). In this paper, we propose a family of channel simulation protocols based on a new rejection sampling algorithm, which we can apply to simulate samples from a target distribution Q using a proposal distribution P over an arbitrary probability space. The inspiration for our construction comes from an exciting recent line of work which recasts random variate simulation as a search problem over a set 37th Conference on Neural Information Processing Systems (Neur IPS 2023). Figure 1: Illustration of three GPRS procedures for a Gaussian target Q = N(1, 0.252) and Gaussian proposal distribution P = N(0, 1), with the time axis truncated to the first 17 units. All three variants find the first arrival of the same (1, P)-Poisson process Π under the graph of ϕ = σ r indicated by the thick dashed black line in each plot. Here, r = d Q/d P is the target-proposal density ratio, and σ is given by Equation (5). Left: Algorithm 3 sequentially searching through the points of Π. The green circle ( ) shows the first point of Π that falls under ϕ, and is accepted. All other points are rejected, as indicated by red crosses ( ). In practice, Algorithm 3 does not simulate points of Π that arrive after the accepted arrival. Middle: Parallelized GPRS (Algorithm 4) searching through two independent (1/2, P)-Poisson processes Π1 and Π2 in parallel. Blue points are arrivals in Π1 and orange points are arrivals in Π2. Crosses ( ) indicate rejected, and circles ( ) indicate accepted points by each thread. In the end, the algorithm accepts the earliest arrival across all processes, which in this case is marked by the blue circle ( ). Right: Branch-and-bound GPRS (Algorithm 5), when ϕ is unimodal. The shaded red areas are never searched or simulated by the algorithm since, given the first two rejections, we know points in those regions cannot fall under ϕ. of randomly placed points, specifically a Poisson process (Maddison, 2016). The most well-known examples of sampling-as-search are the Gumbel-max trick and A* sampling (Maddison et al., 2014). Our algorithm, which we call greedy Poisson rejection sampling (GPRS), differs significantly from all previous approaches in terms of what it is searching for, which we can succinctly summarise as: GPRS searches for the first arrival of a Poisson process Π under the graph of an appropriately defined function ϕ . The first and simplest variant of GPRS is equivalent to an exhaustive search over all points of Π in time order. Next, we show that the exhaustive search is embarrassingly parallelizable, leading to a parallelized variant of GPRS. Finally, when the underlying probability space has more structure, we develop branch-and-bound variants of GPRS that perform a binary search over the points of Π. See Figure 1 for an illustration of these three variants. While GPRS is an interesting sampling algorithm on its own, we also show that each of its variants induces a new one-shot channel simulation protocol. That is, after we receive y Py, we can set Q Px|y and P Px and use GPRS to encode a sample x Px|y at an average bitrate of a little more than the mutual information I[x; y]. In particular, on one-dimensional problems where the density ratio d Px|y/d Px is unimodal for all y (which is often the case in practice), branch-andbound GPRS leads to a protocol with an average runtime of O(I[x; y]), which is optimal. This is a considerable improvement over A* coding (Flamich et al., 2022), the current state-of-the-art method. In summary, our contributions are as follows: We construct a new rejection sampling algorithm called greedy Poisson rejection sampling, which we can construe as a greedy search over the points of a Poisson process (Algorithm 3). We propose a parallelized (Algorithm 4) and a branch-and-bound variant (Algorithms 5 and 6) of GPRS. We analyze the correctness and runtime of these algorithms. We show that each variant of GPRS induces a one-shot channel simulation protocol for correlated random variables x, y Px,y, achieving the optimal average codelength of I[x; y] + log2(I[x; y] + 1) + O(1) bits. We prove that when x is a R-valued random variable and the density ratio d Px|y/d Px is always unimodal, the channel simulation protocol based on the binary search variant of GPRS achieves O(I[x; y]) runtime, which is optimal. We conduct toy experiments on one-dimensional problems and show that GPRS compares favourably against A* coding, the current state-of-the-art channel simulation protocol. 2 Background The sampling algorithms we construct in this paper are search procedures on randomly placed points in space whose distribution is given by a Poisson process Π. Thus, in this section, we first review the necessary theoretical background for Poisson processes and how we can simulate them on a computer. Then, we formulate standard rejection sampling as a search procedure over the points of a Poisson process to serve as a prototype for our algorithm in the next section. Up to this point, our exposition loosely follows Sections 2, 3 and 5.1 of the excellent work of Maddison (2016), peppered with a few additional results that will be useful for analyzing our algorithm later. Finally, we describe the channel simulation problem, using rejection sampling as a rudimentary solution and describe its shortcomings. This motivates the development of greedy Poisson rejection sampling in Section 3. 2.1 Poisson Processes A Poisson process Π is a countable collection of random points in some mathematical space Ω. In the main text, we will always assume that Π is defined over Ω= R+ Rd and that all objects involved are measure-theoretically well-behaved for simplicity. For this choice of Ω, the positive reals represent time, and Rd represents space. However, most results generalize to settings when the spatial domain Rd is replaced with some more general space, and we give a general measure-theoretic construction in Appendix A, where the spatial domain is an arbitrary Polish space. Basic properties of Π: For a set A Ω, let N(A) def = |Π A| denote the number of points of Π falling in the set A, where | | denotes the cardinality of a set. Then, Π is characterized by the following two fundamental properties (Kingman, 1992). First, for two disjoint sets A, B Ω, A B = , the number of points of Π that fall in either set are independent random variables: N(A) N(B). Second, N(A) is Poisson distributed with mean measure µ(A) def = E[N(A)]. Time-ordering the points of Π: Since we assume that Ω= R+ Rd has a product space structure, we may write the points of Π as a pair of time-space coordinates: Π = {(Tn, Xn)} n=1. Furthermore, we can order the points in Π with respect to their time coordinates and index them accordingly, i.e. for i < j we have Ti < Tj. Hence, we refer to (Tn, Xn) as the nth arrival of the process. As a slight abuse of notation, we define N(t) def = N([0, t) Rd) and µ(t) def = E[N(t)], i.e. these quantities measure the number and average number of points of Π that arrive before time t, respectively. In this paper, we assume µ(t) has derivative µ (t), and assume for each t 0 there is a conditional probability distribution PX|T =t with density p(x | t), such that we can write the mean measure as µ(A) = R A p(x | t)µ (t) dx dt. Algorithm 1: Generating a (λ, PX|T )-Poisson process. Input :Time rate λ, Spatial distribution PX|T T0 0 for n = 1, 2, . . . do n Exp(λ) Tn Tn 1 + n Xn PX|T =Tn yield (Tn, Xn) end Simulating Π: A simple method to simulate Π on a computer is to realize it in time order, i.e. at step n, simulate Tn and then use it to simulate Xn PX|T =Tn. We can find the distribution of Π s first arrival by noting that no point of Π can come before it and hence P[T1 t] = P[N(t) = 0] = exp( µ(t)), where the second equality follows since N(t) is Poisson distributed with mean µ(t). A particularly important case is when Π is time-homogeneous, i.e. µ(t) = λt for some λ > 0, in which case T1 Exp(λ) is an exponential random variable with rate λ. In fact, all of Π s inter-arrival times n = Tn Tn 1 for n 1 share this simple distribution, where we set T0 = 0. To see this, note that P[ n t | Tn 1] = P N [Tn 1, Tn 1 + t) Rd = 0 | Tn 1 = exp( λt), (1) i.e. all n | Tn 1 Exp(λ). Therefore, we can use the above procedure to simulate timehomogeneous Poisson processes, described in Algorithm 1. We will refer to a time-homogeneous Poisson process with time rate λ and spatial distribution PX|T as a (λ, PX|T )-Poisson process. Rejection sampling using Π: Rejection sampling is a technique to simulate samples from a target distribution Q using a proposal distribution P, assuming we can find an upper bound M > 0 for Algorithm 2: Standard rejection sampler. Input :Proposal distribution P, Density ratio r = d Q/d P, Upper bound M for r. Output :Sample X Q and its index N. // Generator for a (1, P)-Poisson process using Algorithm 1. Π Simulate PP(1, P) for n = 1, 2, . . . do (Tn, Xn) next(Π) Un Unif(0, 1) if Un < r(Xn)/M then return Xn, n end end Algorithm 3: Greedy Poisson rejection sampler. Input :Proposal distribution P, Density ratio r = d Q/d P, Stretch function σ. Output :Sample X Q and its index N. // Generator for a (1, P)-Poisson process using Algorithm 1. Π Simulate PP(1, P) for n = 1, 2, . . . do (Tn, Xn) next(Π) if Tn < σ (r(Xn)) then return Xn, n end end their density ratio r = d Q/d P (technically, the Radon-Nikodym derivative). We can formulate this procedure using a Poisson process: we simulate the arrivals (Tn, Xn) of a (1, P)-Poisson process Π, but we only keep them with probability r(Xn)/M, otherwise, we delete them. This algorithm is described in Algorithm 2; its correctness is guaranteed by the thinning theorem (Maddison, 2016). Rejection sampling is suboptimal: Using Poisson processes to formulate rejection sampling highlights a subtle but crucial inefficiency: it does not make use of Π s temporal structure and only uses the spatial coordinates. GPRS fixes this by using a rejection criterion that does depend on the time variable. As we show, this significantly speeds up sampling for certain classes of distributions. 2.2 Channel Simulation The main motivation for our work is to develop a one-shot channel simulation protocol using the sampling algorithm we derive in Section 3. Channel simulation is of significant theoretical and practical interest. Recent works used it to compress neural network weights, achieving state-of-the-art performance (Havasi et al., 2018); to perform image compression using variational autoencoders (Flamich et al., 2020) and diffusion models with perfect realism (Theis et al., 2022); and to perform differentially private federated learning by compressing noisy gradients (Shah et al., 2022). One-shot channel simulation is a communication problem between two parties, Alice and Bob, sharing a joint distribution Px,y over two correlated random variables x and y, where we assume that Alice and Bob can simulate samples from the marginal Px. In a single round of communication, Alice receives a sample y Py from the marginal distribution over y. Then, she needs to send the minimum number of bits to Bob such that he can simulate a single sample from the conditional distribution x Px|y. Note that Bob does not want to learn Px|y; he just wants to simulate a single sample from it. Surprisingly, when Alice and Bob have access to shared randomness, e.g. by sharing the seed of their random number generator before communication, they can solve channel simulation very efficiently. Mathematically, in this paper we will always model this shared randomness by some time-homogeneous Poisson process (or processes) Π, since given a shared random seed, Alice and Bob can always simulate the same process using Algorithm 1. Then, the average coding cost of x given Π is its conditional entropy H[x | Π] and, surprisingly, it is always upper bounded by H[x | Π] I[x; y] + log2(I[x; y] + 1) + O(1), where I[x; y] is the mutual information between x and y (Li & El Gamal, 2018). This is an especially curious result, given that in many cases H[x] is infinite while I[x; y] is finite, e.g. when x is a continuous variable. In essence, this result means that given the additional structure Px,y, channel simulation protocols can offload an infinite amount of information into the shared randomness, and only communicate the finitely many necessary bits. An example channel simulation protocol with rejection sampling: Given Π and y Py, Alice sets Q Px|y as the target and P Px as the proposal distribution with density ratio r = d Q/d P, and run the rejection sampler in Algorithm 2 to find the first point of Π that was not deleted. She counts the number of samples N she had to simulate before acceptance and sends this number to Bob. He can then decode a sample x Px|y by selecting the spatial coordinate of the Nth arrival of Π. Unfortunately, this simple protocol is suboptimal. To see this, let D [Q P] def = supx Ω{log2 r(x)} denote Rényi -divergence of Q from P, and recall two standard facts: (1) the best possible upper bound Alice can use for rejection sampling is Mopt = exp2 (D [Q P]), where exp2(x) = 2x, and (2), the number of samples N drawn until acceptance is a geometric random variable with mean Mopt (Maddison, 2016). We now state the two issues with rejection sampling that GPRS solves. Problem 1: Codelength. By using the formula for the entropy of a geometric random variable and assuming Alice uses the best possible bound Mopt in the protocol, we find that H[x | Π] = Ey Py[H[N | y]] Ey Py[D [Px|y Px]] (a) I[x; y], (2) see Appendix I for the derivation. Unfortunately, inequality (a) can be arbitrarily loose, hence the average codelength scales with the expected -divergence instead of I[x; y], as would be optimal. Problem 2: Slow runtime. We are interested in classifying the time complexity of our protocol. As we saw, for a target Q and proposal P, Algorithm 2 draws Mopt = exp2 (D [Q P]) samples on average. Unfortunately, under the computational hardness assumption RP = NP, Agustsson & Theis (2020) showed that without any further assumptions, there is no sampler that scales polynomially in DKL[Q P]. However, with further assumptions, we can do much better, as we show in Section 3.1. 3 Greedy Poisson Rejection Sampling We now describe GPRS; its pseudo-code is shown in Algorithm 3. This section assumes that Q and P are the target and proposal distributions, respectively, and r = d Q/d P is their density ratio. Let Π be a (1, P)-Poisson process. Our proposed rejection criterion is now embarrassingly simple: for an appropriate invertible function σ : R+ R+, accept the first arrival of Π that falls under the graph of the composite function ϕ = σ r, as illustrated in the left plot in Figure 1. We refer to σ as the stretch function for r, as its purpose is to stretch the density ratio along the time-axis towards . Deriving the stretch function (sketch, see Appendix A for details): Let ϕ = σ r, where for now σ is an arbitrary invertible function on R+, let U = {(t, x) Ω| t ϕ(x)} be the set of points under the graph of ϕ and let Π = Π U. By the restriction theorem (Kingman, 1992), Π is also a Poisson process with mean measure µ(A) = µ(A U). Let ( T, X) be the first arrival of Π, i.e. the first arrival of Π under ϕ and let Qσ be the marginal distribution of X. Then, as we show in Appendix A, the density ratio d Qσ/d P is given by d P (x) = Z ϕ(x) 0 P[ T t] dt. (3) Now we pick σ such that Qσ = Q, for which we need to ensure that d Qσ/d P = r. Substituting t = ϕ(x) into Equation (3), and differentiating, we get σ 1 (t) = P[ T t]. Since ( T, X) falls under the graph of ϕ by definition, we have P[ T t] = P[ T t, r( X) σ 1(t)]. By expanding the definition of the right-hand side, we obtain a time-invariant ODE for σ 1: σ 1 = w Q σ 1 σ 1 w P σ 1 , with σ 1(0) = 0, (4) where w P (h) def = PZ P [r(Z) h] and w Q defined analogously. Finally, solving the ODE, we get 1 w Q(η) η w P (η) dη. (5) Remember that picking σ according to Equation (5) ensures that GPRS is correct by construction. To complete the picture, in Appendix A we prove that ( T, X) always exists and Algorithm 3 terminates with probability 1. We now turn our attention to analyzing the runtime of Algorithm 3 and surprisingly, we find that the expected runtime of GPRS matches that of standard rejection sampling; see Appendix B.2 for the proof. Note that in our analyses, we identify GPRS s time complexity with the number of arrivals of Π it simulates. Moreover, we assume that we can evaluate σ in O(1) time. Theorem 3.1 (Expected Runtime). Let Q and P be the target and proposal distributions for Algorithm 3, respectively, and r = d Q/d P their density ratio. Let N denote the number of samples simulated by the algorithm before it terminates. Then, E[N] = exp2 (D [Q P]) and V[N] exp2 (D [Q P]) , (6) where V[ ] denotes the variance of a random variable. Furthermore, using ideas from the work of Liu & Verdu (2018), we can show the following upper bound on the fractional moments of the index; see Appendix B.3 for the proof. Theorem 3.2 (Fractional Moments of the Index). Let Q and P be the target and proposal distributions for Algorithm 3, respectively, and r = d Q/d P their density ratio. Let N denote the number of samples simulated by the algorithm before it terminates. Then, for α (0, 1) we have E[N α] 1 1 α 1 + D [Q P] exp2 αD 1 1 α [Q P] , (7) where Dα[Q P] = 1 α 1 log2 EZ Q r(Z)α 1 is the Rényi α-divergence of Q from P. GPRS induces a channel simulation protocol: We use an analogous solution to the standard rejection sampling-based scheme in Section 2.2, to obtain a channel simulation protocol for a pair of correlated random variables x, y Px,y. First, we assume that both the encoder and the decoder have access to the same realization of a (1, Px)-Poisson process Π, which they will use as their shared randomness. In a single round of communication, the encoder receives y Py, sets Q Px|y as the target and P Px as the proposal distribution, and runs Algorithm 3 to find the index N of the accepted arrival in Π. Finally, following Li & El Gamal (2018), they use a zeta distribution ζ(n | s) n s with s 1 = I[x; y] + 2 log2 e to encode N using entropy coding. Then, the decoder simulates a Px|y-distributed sample by looking up the spatial coordinate XN of Π s Nth arrival. The following theorem shows that this protocol is optimally efficient; see Appendix B.4 for the proof. Theorem 3.3 (Expected Codelength). Let Px,y be a joint distribution over correlated random variables x and y, and let Π be the (1, Px)-Poisson process used by Algorithm 3. Then, the above protocol induced by the the algorithm is optimally efficient up to a constant, in the sense that H[x | Π] I[x; y] + log2 (I[x; y] + 1) + 6. (8) Note, that this result shows we can use GPRS as an alternative to the Poisson functional representation to prove of the strong functional representation lemma (Theorem 1; Li & El Gamal, 2018). GPRS in practice: The real-world applicability of Algorithm 3 is limited by the following three computational obstacles: (1) GPRS requires exact knowledge of the density ratio d Q/d P to evaluate ϕ, which immediately excludes a many practical problems where d Q/d P is known only up to a normalizing constant. (2) To compute σ, we need to evaluate Equation (4) or (5), which requires us to compute w P and w Q, which is usually intractable for general d Q/d P. Fortunately, in some special cases they can be computed analytically: in Appendix G, we provide the expressions for discrete, uniform, triangular, Gaussian, and Laplace distributions. (3) However, even when we can compute w P and w Q, analytically evaluating the integral in Equation (5) is usually not possible. Moreover, solving it numerically is unstable as σ is unbounded. Instead, we numerically solve for σ 1 using Equation (4) in practice, which fortunately turns out to be stable. However, we emphasize that GPRS s raison d être is to perform channel simulation using the protocol we outline above or the ones we present in Section 3.1, not general-purpose random variate simulation. Practically relevant channel simulation problems, such as encoding a sample from the latent posterior of a VAE (Flamich et al., 2020) or encoding variational implicit neural representations (Guo et al., 2023; He et al., 2023) usually involve simple distributions such as Gaussians. In these cases, we have readily available formulae for w P and w Q and can cheaply evaluate σ 1 via numerical integration. GPRS as greedy search: The defining feature of GPRS is that its acceptance criterion at each step is local, since if at step n the arrival (Tn, Xn) in Π falls under ϕ, the algorithm accepts it. Thus it is greedy search procedure. This is in contrast with A sampling (Maddison et al., 2014), whose acceptance criterion is global, as the acceptance of the nth arrival (Tn, Xn) depends on all other points of Π in the general case. In other words, using the language of Liu & Verdu (2018), GPRS is a causal sampler, as its acceptance criterion at each step only depends on the samples the algorithm has examined in previous steps. On the other hand, A* sampling is non-causal as its acceptance criterion in each step depends on future samples as well. Surprisingly, this difference between the search criteria does not make a difference in the average runtimes and codelengths in the general case. However, as we show in the next section, GPRS can be much faster in special cases. 3.1 Speeding up the greedy search This section discusses two ways to improve the runtime of GPRS. First, we show how we can utilize available parallel computing power to speed up Algorithm 3. Second, we propose an advanced search Algorithm 4: Parallel GPRS with J available threads. Input :Proposal distribution P, Density ratio r = d Q/d P, Stretch function σ, Number of parallel threads J. Output :Sample X Q and its code (j , Nj ). T , X , j , Nj , nil, nil, nil in parallel for j = 1, . . . , J do Πj Simulate PP(1/J, P) for nj = 1, 2, . . . do T (j) nj , X(j) nj next(Πj) if T < T (j) nj then terminate thread j. end if T (j) nj < σ r X(j) nj then T , X , j , Nj T (j) nj , X(j) nj , j, nj terminate thread j. end end end return X , (j , Nj ) Algorithm 5: Branch-and-bound GPRS on R with unimodal r Input :Proposal distribution P, Density ratio r = d Q/d P, Stretch function σ, Location x of the mode of r. Output :Sample X Q and its heap index H. T0, H, B (0, 1, R) for d = 1, 2, . . . do Xd P|B d Exp (P(B)) Td Td 1 + d if Td < σ(r(Xd)) then return Xd, H end if Xd x then B B ( , Xd) H 2H else B B (Xd, ) H 2H + 1 end end strategy when the spatial domain Ωhas more structure and obtain a super-exponential improvement in the runtime from exp2 (D [Q P]) to O(DKL[Q P]) in certain cases. Parallel GPRS: The basis for parallelizing GPRS is the superposition theorem (Kingman, 1992): For a positive integer J, let Π1, . . . ΠJ all be (1/J, P)-Poisson processes; then, Π = SJ j=1 Πj is a (1, P)-Poisson process. As shown in Algorithm 4, this result makes parallelizing GPRS very simple given J parallel threads: First, we independently look for the first arrivals of Π1, . . . , ΠJ under the graph of ϕ, yielding T (1) N1 , X(1) N1 , . . . , T (J) NJ , X(J) NJ , respectively, where the Nj corresponds to the index of Πj s first arrival under ϕ. Then, we select the candidate with the earliest arrival time, i.e. j = arg minj {1,...,J} T (j) Nj . Now, by the superposition theorem, T (j ) Nj , X(j ) Nj is the first arrival of Π under ϕ, and hence X(j ) Nj Q. Finally, Alice encodes the sample via the tuple (j , Nj ), i.e. which of the J processes the first arrival occurred in, and the index of the arrival in Πj . See the middle graphic in Figure 1 for an example with J = 2 parallel threads. Our next results show that parallelizing GPRS results in a linear reduction in both the expectation and variance of its runtime and a more favourable codelength guarantee for channel simulation. Theorem 3.4 (Expected runtime of parallelized GPRS). Let Q, P and r be defined as above, and let νj denote the random variable corresponding to the number of samples simulated by thread j in Algorithm 4 using J threads. Then, for all j, E[νj] = (exp2 (D [Q P]) 1) J + 1 and V[νj] (exp2 (D [Q P]) 1) J + 1. (9) Theorem 3.5 (Expected codelength of parallelized GPRS). Let Px,y be a joint distribution over correlated random variables x and y, and let Π1, . . . , ΠJ be (1/J, Px)-Poisson processes. Then, assuming log2 J I[x; y], parallelized GPRS induces a channel simulation protocol such that H[x | Π1, . . . , ΠJ] I[x; y] + log2 (I[x; y] log2 J + 1) + 8. (10) See Appendix C for the proofs. Note that we can use the same parallelisation argument with the appropriate modifications to speed up A sampling / coding too. Branch-and-bound GPRS on R: We briefly restrict our attention to problems on R when the density ratio r is unimodal, as we can exploit this additional structure to more efficiently search for Π s first arrival under ϕ. Consider the example in the right graphic in Figure 1: we simulate the first arrival (T1, X1), and reject it, since T1 > ϕ(X1). Since r is unimodal by assumption and σ is increasing, ϕ is also unimodal. Hence, for x < X1 we must have ϕ(x) < ϕ(X1), while the arrival time of any of the later arrivals will be larger than T1. Therefore, none of the arrivals to the left of X1 will fall under ϕ either! Hence, it is sufficient to simulate Π to the right of X1, which we can easily do using generalized inverse transform sampling.1 We repeat this argument for all further arrivals to obtain an efficient search procedure for finding Π s first arrival under ϕ, described in Algorithm 5: We simulate the next arrival (TB, XB) of Π within some bounds B, starting with B = Ω. If TB ϕ(XB) we accept; otherwise, we truncate the bound to B B ( , XB), or B B (XB, ) based on where XB falls relative to ϕ s mode. We repeat these two steps until we find the first arrival. Since Algorithm 5 does not simulate every point of Π, we cannot use the index N of the accepted arrival to obtain a channel simulation protocol as before. Instead, we encode the search path, i.e. whether we chose the left or the right side of our current sample at each step. Similarly to A* coding, we encode the path using its heap index (Flamich et al., 2022): the root has index Hroot = 1, and for a node with index H, its left child is assigned index 2H and its right child 2H + 1. As the following theorems show, this version of GPRS is, in fact, optimally efficient; see Appendix E for proofs. Theorem 3.6 (Expected Runtime of GPRS with binary search). Let Q, P and r be defined as above, and let D denote the number of samples simulated by Algorithm 5 and λ = 1/ log2(4/3) Then, E[D] λ DKL[Q P] + O(1). (11) Theorem 3.7 (Expected Codelength of GPRS with binary search). Let Px,y be a joint distribution over correlated random variables x and y, and let Π be a (1, Px)-Poisson process. Then, GPRS with binary search induces a channel simulation protocol such that H[x | Π] I[x; y] + 2 log2 (I[x; y] + 1) + 11. (12) Branch-and-bound GPRS with splitting functions: With some additional machinery, Algorithm 5 can be extended to more general settings, such as RD and cases where r is not unimodal, by introducing the notion of a splitting function. For a region of space, B Ω, a splitting function split simply returns a binary partition of the set, i.e. {L, R} = split(B), such that L R = and L R = B. In this case, we can perform a similar tree search to Algorithm 5, captured in Algorithm 6. Recall that ( T, X) is the first arrival of Π under ϕ. Starting with the whole space B = Ω, we simulate the next arrival (T, X) of Π in B at each step. If we reject it, we partition B into two parts, L and R, using split. Then, with probability P[ X R | X B, T T], we continue searching through the arrivals of Π only in R, and only in L otherwise. We show the correctness of this procedure in Appendix F and describe how to compute P[ X R | X B, T T] and σ in practice. This splitting procedure is analogous to the general version of A* sampling/coding (Maddison et al., 2014; Flamich et al., 2022), which is parameterized by the same splitting functions. Note that Algorithm 5 is a special case of Algorithm 6, where Ω= R, r is unimodal, and at each step for an interval bound B = (a, b) and sample X (a, b) we split B into {(a, X), (X, b)}. 4 Experiments We compare the average and one-shot case efficiency of our proposed variants of GPRS and a couple of other channel simulation protocols in Figure 2. See the figure s caption and Appendix H for details of our experimental setup. The top two plots in Figure 2 demonstrate that our methods expected runtimes and codelengths align well with our theorems predictions and compare favourably to other methods. Furthermore, we find that the mean performance is a robust statistic for the binary search-based variants of GPRS in that it lines up closely with the median, and the interquartile range is quite narrow. However, we also see that Theorem 3.7 is slightly loose empirically. We conjecture that the coefficient of the log2(I[x; y] + 1) term in Equation (12) could be reduced to 1, and that the constant term could be improved as well. On the other hand, the bottom plot in Figure 2 demonstrates the most salient property of the binary search variant of GPRS: unlike all previous methods, its runtime scales with DKL[Q P] and not D [Q P]. Thus, we can apply it to a larger family of channel simulation problems, where DKL[Q P] is finite but D [Q P] is large or even infinite, and other methods would not terminate. 1Suppose we have a real-valued random variable X with CDF F and inverse CDF F 1. Then, we can simulate X truncated to [a, b] by simulating U Unif[F(a), F(b)] and computing X = F 1(U). Algorithm 6: Branch-and-bound GPRS with splitting function. Input :Proposal distribution P, Density ratio r = d Q/d P, Stretch function σ, Splitting function split. Output :Sample X Q and its heap index H T0, H, B (0, 1, R) for d = 1, 2, . . . do Xd P|B d Exp (P(B)) Td Td 1 + d if Td < σ(r(Xd)) then return Xd, H else B0, B1 split(B) ρ P[ X B1 | X B, T Td] β Bernoulli(ρ) H 2H + β B Bβ end end PFR AS* s GPRS d GPRS 2 4 6 8 10 0 10 15 20 25 0 Figure 2: Left: Binary search GPRS with arbitrary splitting function. Right: Performance comparison of different channel simulation protocols. In each plot, dashed lines indicate the mean, solid lines the median and the shaded areas the 25 - 75 percentile region of the relevant performance metric. We computed the statistics over 1000 runs for each setting. Top right: Runtime comparison on a 1D Gaussian channel simulation problem Px,µ, plotted against increasing mutual information I[x; µ]. Alice receives µ N(0, σ2) and encodes a sample x | µ N(µ, 1) to Bob. The abbreviations in the legend are: GPRS Algorithm 3; PFR Poisson functional representation / Global-bound A* coding (Li & El Gamal, 2018; Flamich et al., 2022), AS* Split-on-sample A* coding (Flamich et al., 2022); s GPRS split-on-sample GPRS (Algorithm 5); d GPRS GPRS with dyadic split (Algorithm 6). Middle right: Average codelength comparison of our proposed algorithms on the same channel simulation problem as above. UB in the legend corresponds to an upper bound of I[x; µ] + log2(I[x; µ] + 1) + 2 bits. We estimate the algorithms expected codelengths by encoding the indices returned by GPRS using a Zeta distribution ζ(n | λ) n λ with λ = 1 + 1/I[x; y] in each case, which is the optimal maximum entropy distribution for this problem setting (Li & El Gamal, 2018). Bottom right: One-shot runtime comparison of s GPRS with AS* coding. Alice encodes samples from a target Q = N(m, s2) using P = N(0, 1) as the proposal. We computed m and s2 such that DKL[Q P] = 2 bits for each problem, but D [Q P] increases. GPRS runtime stays fixed as it scales with DKL[Q P], while the runtime of A* keeps increasing. 5 Related Work Poisson processes for channel simulation: Poisson processes were introduced to the channel simulation literature by Li & El Gamal (2018) via their construction of the Poisson functional representation (PFR) in their proof of the strong functional representation lemma. Flamich et al. (2022) observed that the PFR construction is equivalent to a certain variant of A* sampling (Maddison et al., 2014; Maddison, 2016). Thus, they proposed an optimized version of the PFR called A* coding, which achieves O(D [Q P]) runtime for one-dimensional unimodal distributions. GPRS was mainly inspired by A* coding, and they are dual constructions to each other in the following sense: Depth-limited A* coding can be thought of as an importance sampler, i.e. a Monte Carlo algorithm that returns an approximate sample in fixed time. On the other hand, GPRS is a rejection sampler, i.e. a Las Vegas algorithm that returns an exact sample in random time. Channel simulation with dithered quantization: Dithered quantization (DQ; Ziv, 1985) is an alternative to rejection and importance sampling-based approaches to channel simulation. DQ exploits that for any c R and U, U Unif( 1/2, 1/2), the quantities c U + U and c + U are equal in distribution. While DQ has been around for decades as a tool to model and analyze quantization error, Agustsson & Theis (2020) reinterpreted it as a channel simulation protocol and used it to develop a VAE-based neural image compression algorithm. Unfortunately, basic DQ only allows uniform target distributions, limiting its applicability. As a partial remedy, Theis & Yosri (2022) showed DQ could be combined with other channel simulation protocols to speed them up and thus called their approach hybrid coding (HQ). Originally, HQ required that the target distribution be compactly supported, which was lifted by Flamich & Theis (2023), who developed an adaptive rejection sampler using HQ. In a different vein, Hegazy & Li (2022) generalize and analyze a method proposed in the appendix of Agustsson & Theis (2020) and show that DQ can be used to realize channel simulation protocols for one-dimensional symmetric, unimodal distributions. Greedy Rejection Coding: Concurrently to our work, Flamich et al. (2023) introduce greedy rejection coding (GRC) which generalizes Harsha et al. s rejection sampling algorithm to arbitrary probability spaces and arbitrary splitting functions, extending the work of Flamich & Theis (2023). Furthermore, they also utilize a space-partitioning procedure to speed up the convergence of their sampler and prove that a variant of their sampler also achieves optimal runtime for one-dimensional problems where d Q/d P is unimodal. However, the construction of their method differs significantly from ours. GRC is a direct generalization of Harsha et al. (2007) s algorithm and, thus, a more conventional rejection sampler, while we base our construction on Poisson processes. Thus, our proof techniques are also significantly different. It is an interesting research question whether GRC could be formulated using Poisson processes, akin to standard rejection sampling in Algorithm 2, as this could be used to connect the two algorithms and improve both. 6 Discussion and Future Work Using the theory of Poisson processes, we constructed greedy Poisson rejection sampling. We proved the correctness of the algorithm and analyzed its runtime, and showed that it could be used to obtain a channel simulation protocol. We then developed several variations on it, analyzed their runtimes, and showed that they could all be used to obtain channel simulation protocols. As the most significant result of the paper, we showed that using the binary search variant of GPRS we can achieve O(DKL[Q P]) runtime for arbitrary one-dimensional, unimodal density ratios, significantly improving upon the previous best O(D [Q P]) bound by A* coding. There are several interesting directions for future work. From a practical perspective, the most pressing question is whether efficient channel simulation algorithms exist for multivariate problems; finding an efficient channel simulation protocol for multivariate Gaussians would already have farreaching practical consequences. We also highlight an interesting theoretical question. We found that the most general version of GPRS needs to simulate exactly exp2(D [Q P]) samples in expectation, which matches A* sampling. From Agustsson & Theis (2020), we know that for general problems, the average number of simulated samples has to be at least on the order of exp2(DKL[Q P]). Hence we ask whether this lower bound can be tightened to match the expected runtime of GPRS and A* sampling or whether there exists an algorithm that achieves a lower runtime. 7 Acknowledgements We are grateful to Lennie Wells for the many helpful discussions throughout the project, particularly for suggesting the nice argument for Lemma D.2; Peter Wildemann for his advice on solving some of the differential equations involved in the derivation of GPRS and Stratis Markou for his help that aided our understanding of some high-level concepts so that we could derive Algorithm 6. We would also like to thank Lucas Theis for the many enjoyable early discussions about the project and for providing helpful feedback on the manuscript; and Cheuk Ting Li for bringing Liu & Verdu s work to our attention. Finally, we would like to thank Daniel Goc for proofreading the manuscript and Lorenzo Bonito for helping to design Figures 1 and 2 and providing feedback on the early version of the manuscript. GF acknowledges funding from Deep Mind. Eirikur Agustsson and Lucas Theis. Universally quantized neural compression. In Advances in Neural Information Processing Systems, volume 33, 2020. Harry Bateman and Arthur Erdélyi. Higher transcendental functions, volume 1. Mc Graw-Hill book company, 1953. Robert M. Corless, Gaston H. Gonnet, David E. G. Hare, David J. Jeffrey, and Donald E. Knuth. On the Lambert W function. Advances in Computational mathematics, 5:329 359, 1996. Gergely Flamich and Lucas Theis. Adaptive greedy rejection sampling. In IEEE International Symposium on Information Theory (ISIT), pp. 454 459, 2023. Gergely Flamich, Marton Havasi, and José Miguel Hernández Lobato. Compressing images by encoding their latent representations with relative entropy coding. Advances in Neural Information Processing Systems, 33, 2020. Gergely Flamich, Stratis Markou, and José Miguel Hernández Lobato. Fast relative entropy coding with A* coding. In International Conference on Machine Learning, pp. 6548 6577, 2022. Gergely Flamich, Stratis Markou, and José Miguel Hernández Lobato. Faster relative entropy coding with greedy rejection coding. In Advances in Neural Information Processing Systems, 2023. Zongyu Guo, Gergely Flamich, Jiajun He, Zhibo Chen, and José Miguel Hernández Lobato. Compression with Bayesian implicit neural representations. In Advances in Neural Information Processing Systems, 2023. Prahladh Harsha, Rahul Jain, David Mc Allester, and Jaikumar Radhakrishnan. The communication complexity of correlation. In Twenty-Second Annual IEEE Conference on Computational Complexity (CCC 07), pp. 10 23. IEEE, 2007. Marton Havasi, Robert Peharz, and José Miguel Hernández Lobato. Minimal random code learning: Getting bits back from compressed model parameters. In International Conference on Learning Representations, 2018. Jiajun He, Gergely Flamich, Zongyu Guo, and José Miguel Hernández Lobato. RECOMBINER: Robust and enhanced compression with Bayesian implicit neural representations. ar Xiv preprint ar Xiv:2309.17182, 2023. Mahmoud Hegazy and Cheuk Ting Li. Randomized quantization with exact error distribution. In 2022 IEEE Information Theory Workshop (ITW), pp. 350 355. IEEE, 2022. Diederik P. Kingma and Max Welling. Auto-encoding variational Bayes. In International Conference on Learning Representations, 2014. John F. C. Kingman. Poisson Processes. Oxford Studies in Probability. Clarendon Press, 1992. ISBN 9780191591242. Cheuk Ting Li and Abbas El Gamal. Strong functional representation lemma and applications to coding theorems. IEEE Transactions on Information Theory, 64(11):6967 6978, 2018. Jingbo Liu and Sergio Verdu. Rejection sampling and noncausal sampling under moment constraints. In 2018 IEEE International Symposium on Information Theory (ISIT), pp. 1565 1569. IEEE, 2018. Chris J. Maddison. A Poisson process model for Monte Carlo. Perturbation, Optimization, and Statistics, pp. 193 232, 2016. Chris J. Maddison, Daniel Tarlow, and Tom Minka. A* sampling. In Advances in Neural Information Processing Systems, volume 27, pp. 3086 3094, 2014. Pat Muldowney, Krzysztof Ostaszewski, and Wojciech Wojdowski. The Darth Vader rule. Tatra Mountains Mathematical Publications, 52(1):53 63, 2012. Jorma Rissanen and Glen G Langdon. Arithmetic coding. IBM Journal of research and development, 23(2):149 162, 1979. Abhin Shah, Wei-Ning Chen, Johannes Balle, Peter Kairouz, and Lucas Theis. Optimal compression of locally differentially private mechanisms. In International Conference on Artificial Intelligence and Statistics, pp. 7680 7723. PMLR, 2022. Lucas Theis and Noureldin Yosri. Algorithms for the communication of samples. In International Conference on Machine Learning, 2022. Lucas Theis, Tim Salimans, Matthew D. Hoffman, and Fabian Mentzer. Lossy compression with Gaussian diffusion. ar Xiv:2206.08889, 2022. Jacob Ziv. On universal quantization. IEEE Transactions on Information Theory, 31(3):344 347, 1985. A Measure-theoretic Construction of Greedy Poisson Rejection Sampling In this section, we provide a construction of greedy Poisson rejection sampling (GPRS) in its greatest generality. However, we first establish the notation for the rest of the appendix. Notation: In what follows, we will always denote GPRS s target distribution as Q and its proposal distribution as P. We assume that Q and P are Borel probability measures over some Polish space Ω with Radon-Nikodym derivative r def = d Q/d P. We denote the standard Lebesgue measure on Rn by λ. All logarithms are assumed to be to the base 2 denoted by log2. Similarly, we use the less common notation of exp2 to denote exponentiation with base 2, i.e. exp2(x) = 2x. The relative entropy of Q from P is defined as DKL[Q P] def = R Ωlog2 r(x) d P(x) and the Rényi -divergence as D [Q P] def = ess supx Ω{log r(x)}, where the essential supremum is taken with respect to P. Restricting a Poisson process under the graph of a function : We first consider the class of functions under which we will be restricting our Poisson processes. Let ϕ : Ω R+ be a measurable function, such that |ϕ(Ω)| = |r(Ω)|, i.e. the image of ϕ has the same cardinality as the image of r = d Q/d P. This implies that there exist bijections between r(Ω) and ϕ(Ω). Since both images are subsets of R+, both images have a linear order. Hence, a monotonically increasing bijection σ exists such that ϕ = σ r. Since σ is monotonically increasing, we can extend it to a monotonically increasing continuous function σ : R+ R+. This fact is significant, as it allows us to reduce questions about functions of interest from Ωto R+ to invertible functions on the positive reals. Since σ is continuous and monotonically increasing on the positive reals, it is invertible on the extended domain. We now consider restricting a Poisson process under the graph of ϕ, and work out the distribution of the spatial coordinate of the first arrival. Let Π = {(Tn, Xn)} n=1 be a Poisson process over R+ Ωwith mean measure µ = λ P. Let U def = {(t, x) Ω| t ϕ(x)} be the set of points under the graph of ϕ. By the restriction theorem (Kingman, 1992), Π def = Π U is a Poisson process with mean measure µ(A) = µ(A U) for an arbitrary Borel set A. As a slight abuse of notation, we define the shorthand µ(t) = µ([0, t) Ω) for the average number of points in Π up to time t. Let N Π denote the counting measure of Π and let ( T, X) be the first arrival of Π, i.e. the first point of the original process Π that falls under the graph of ϕ, assuming that it exists. To develop GPRS, we first work out the distribution of X for an arbitrary ϕ. Remark: Note, that the construction below only holds if the first arrival is guaranteed to exist, otherwise the quantities below do not make sense. From a formal point of view, we would have to always condition on the event N Π( ) def = limt N Π(t) > 0, e.g. we would need to write P[ T t | N Π( ) > 0]. However, we make a leap of faith instead and assume that our construction is valid to obtain our guesses for the functions σ and ϕ. Then, we show that for our specific guesses the derivation below is well-defined, i.e. namely that P[N Π( ) > 0] = 1, which then implies P[ T t | N Π( ) > 0] = P[ T t]. To do this, note that P[N Π( ) > 0] = 1 P[N Π( ) = 0] (13) = 1 lim t e µ(t). (14) Hence, for the well-definition of our construction it is enough to show that for our choice of σ, µ(t) as t . Constructing σ: We wish to derive d P[ X = x] d P[ T = t, X = x] d(λ P) dt. (15) To do this, recall that we defined N Π(t) def = N Π([0, t) Ω) and define N Π(s, t) def = N Π([s, t] Ω). Similarly, let µ(t) def = E[N Π(t)] and µ(s, t) def = E[N Π(s, t)] = µ(t) µ(s). Then, P[ X A, T dt] = lim s t P[ X A, T [s, t]] = lim s t P[ X A, N Π(s, t) = 1, N Π(s) = 0] = lim s t P[ X A | N Π(s, t) = 1]P[N Π(s, t) = 1]P[N Π(s) = 0] where the last equality holds, because N Π(s) N Π(s, t), since [0, s) Ωand [s, t] Ωare disjoint. By Lemma 3 from Maddison (2016), P[ X A | N Π(s, t) = 1] = P[( T, X) [s, t] A | N Π(s, t) = 1] (19) = µ([s, t] A) µ(s, t) . (20) Substituting this back into Equation (18) and using the fact that N Π is Poisson, we find P[ X A, T dt] = lim s t µ([s, t] A) µ(s, t) µ(s, t)e µ(s,t) e µ(s) . (t s) (21) µ([s, t] A) e ( µ(t) µ(s)) e µ(s) . (t s) (22) = e µ(t) lim s t µ([s, t] A) A 1[t ϕ(x)] d P(x), (24) where the last equation holds by the definition of the derivative and the fundamental theorem of calculus. From this, by inspection we find d P[ T = t, X = x] d(λ P) = 1[t ϕ(x)]e µ(t) = 1[t ϕ(x)]P[ T t]. (25) Finally, by marginalizing out the first arrival time, we find that the spatial distribution is d P[ X = x] d P[ T = t, X = x] d(λ P) dt (26) 0 P[ T t] dt. (27) Deriving ϕ by finding σ: Note that Equation (27) holds for any ϕ. However, to get a correct algorithm, we need to set it such that d P[ X=x] d P = d Q d P = r. Since we can write ϕ = σ r by our earlier argument, this problem is reduced to finding an appropriate invertible continuous function σ : R+ R+. Thus, we wish to find σ such that x Ω, r(x) = Z σ(r(x)) 0 P[ T t] dt. (28) Now, introduce τ = σ(r(x)), so that σ 1(τ) = r(x). Note that this substitution only makes sense for τ r(Ω). However, since we are free to extend σ 1 to R+ in any we like so long as it is monotone, we may require that this equation hold for all τ R+. Then, we find σ 1(τ) = Z τ 0 P[ T t] dt (29) σ 1 (τ) = P[ T τ] (30) with σ 1(0) = 0. Before we solve for σ, we define w P (h) def = PY P [r(Y ) h] (31) w Q(h) def = PY Q[r(Y ) h]. (32) Note that w P and w Q are supported on [0, r ), where r def = ess supx Ω{r(x)} = exp2(D [Q P]). Now, we can rewrite Equation (30) as σ 1 (t) = P[ T t] (33) = P[ T t, ϕ( X) t] + P[ T t, ϕ( X) < t] | {z } =0 due to mutual exclusivity = P[ T 0, ϕ( X) t] | {z } =P[ϕ( X) t], since T 0 always P[ϕ( X) t T] (35) = P[r( X) σ 1(t)] P[r( X) σ 1(t) σ 1( T)] (36) = w Q(σ 1(t)) σ 1(t)w P (σ 1(t)) (37) where the second term in the last equation follows by noting that P[r( X) σ 1(t) σ 1( T)] = Z 0 1[r(x) σ 1(t)]1[r(x) σ 1(τ)] | {z } =1[r(x) σ 1(t)], since τ 0 : 1[a x] = Z x 0 δ(y a) dy, (43) where δ is the Dirac delta function. Then, we have Ω 1[r(x) h] d Q(x) (44) 0 r(x)δ(y r(x)) dy d P(x) (45) Ω δ(y r(x)) d P(x) dy (46) = 1 [y(1 w P (y))]h 0 + Z h 0 (1 w P (y)) dy (47) = 1 + h w P (h) Z h 0 w P (y) dy. (48) Then, from this identity, we find that P[ T t] = w Q(σ 1(t)) σ 1(t)w P (σ 1(t)) (49) = 1 Z σ 1(t) 0 w P (y) dy. (50) Well-definition of GPRS and Algorithm 3 terminates with probability 1: To ensure that our construction is useful, we need to show that the first arrival under the graph T exists and is almost surely finite, so that Algorithm 3 terminates with probability 1. First, note that for any t > 0, we have µ(t) µ(t) = t < . Since P[N Π(t) = 0] = e µ(t), this implies that for any finite time t, P[N Π(t) = 0] > 0. Second, note that for h [0, r ], w Q(h) h w P (h) = Z Ω 1[r(x) h](r(x) h) d P(x), (51) w Q(r ) r w P (r ) = Z Ω 1[r(x) r ](r(x) r ) d P(x) (52) Ω 1[r(x) = r ](r(x) r ) d P(x) (53) Thus, in particular, we find that lim h r e µ(σ(h)) = lim h r P[N Π(σ(h)) = 0] eq. (37) = lim h r (w Q(h) h w P (h)) = 0. (55) By continuity, this can only hold if µ(σ(h)) as h r . Since µ(t) is finite for all t > 0 and σ is monotonically increasing, this also implies that σ(h) as h r . Thus, we have shown a couple of facts: σ(h) as h r , hence ϕ is always unbounded at points in Ωthat achieve the supremum r . Furthermore, this implies that σ 1(t) r as t . (56) Since σ 1 is increasing, this implies that it is bounded from above by r . µ(t) < for all t > 0, but µ(t) as t . Hence, by Equation (14) we have P[N Π( ) > 0] = 1. Thus, the first arrival of Π exists almost surely, and our construction is well-defined. In particular, it is meaningful to write P[ T t]. P[ T t] = P[N Π(t) = 0] 0 as t , which shows that the first arrival time is finite with probability 1. In turn, this implies that Algorithm 3 will terminate with probability one, as desired. Note, that w P and w Q can be computed in many practically relevant cases, see Appendix G. B Analysis of Greedy Poisson Rejection Sampling Now that we constructed a correct sampling algorithm in Appendix A, we turn our attention to deriving the expected first arrival time T in the restricted process Π, the expected number of samples N before Algorithm 3 terminates and bound on N s variance, and an upper bound on its coding cost. The proofs below showcase the true advantage of formulating the sampling algorithm using the language of Poisson processes: the proofs are all quite short and elegant. B.1 The Expected First Arrival Time At the end of the previous section, we showed that T is finite with probability one, which implies that E[ T] < . Now, we derive the value of this expectation exactly. Since T is a positive random variable, by the Darth Vader rule (Muldowney et al., 2012), we may write its expectation as E h T i = Z 0 P[ T t] dt eq. (29) = lim t σ 1(t) eq. (56) = r . (57) B.2 The Expectation and Variance of the Runtime Expectation of N: First, let ΠC def = Π \ Π be the set of points in Π above the graph of ϕ. Since Π and ΠC are defined on complementary sets, they are independent. Since by definition T is the first arrival of Π and the Nth arrival of Π, it must mean that the first N 1 arrivals of Π occurred in ΠC. Thus, conditioned on T, we have N 1 = |{(T, X) ΠC | T < T}| is Poisson distributed with mean E h N 1 | T i = Z T Ω 1[t ϕ(x)] d P(x) dt (58) Ω 1 1[t < ϕ(x)] d P(x) dt (59) = T µ T . (60) By the law of iterated expectations, E[N 1] = E T h E h N 1 | T ii (61) = E h T i E h µ T i . (62) Focusing on the second term, we find E h µ T i = Z 0 µ(t) P[ T dt] dt (63) 0 µ(t) µ (t)e µ(t) dt (64) 0 ue u du = 1, (65) where the third equality follows by substituting u = µ(t). Finally, plugging the above and Equation (57) into Equation (62), we find E[N] = 1 + E[N 1] = 1 + r 1 = r . (66) Variance of N: We now show that the distribution of N is super-Poissonian, i.e. E[N] V[N]. Similarly to the above, we begin with the law of iterated variances to find V[N] = V[N 1] = E T [V[N 1 | T]] + V T [E[N 1 | T]] (67) = E h T i E h µ T i + V h T i + V h µ T i , (68) where the second equality follows from the fact that the variance of N 1 matches its mean conditioned on T, since it is a Poisson random variable. Focussing on the last term, we find V h µ T i = E µ T 2 E h µ T i2 (69) 0 µ(t)2 P[ T dt] dt 1 (70) 0 u2e u dt 1 = 1, (71) where the third equality follows from a similar u-substitution as in Equation (65). Thus, putting everything together, we find V[N] = r 1 + V h T i + 1 = r + V h T i > E[N]. (72) B.3 The Fractional Moments of the Index In this section, we follow the work of Liu & Verdu (2018) and analyse the fractional moments of the index N and prove Theorem 3.2. As before, set Q and P as GPRS s target and proposal distribution, respectively, and let r = d Q d P . To begin, we define the Rényi α-divergence between two distributions for α (0, 1) (1, ) as Dα[Q P] = 1 α 1 log2 EZ Q r(Z)α 1 . (73) By taking limits, we can extend the definiton to α = 1, as D1[Q P] def = lim α 1 Dα[Q P] = DKL[Q P] (74) D [Q P] def = lim α Dα[Q P] = ess sup x Ω {log2 r(x)} . (75) Now fix α (0, 1) and x Ωand note that by Jensen s inequality, E[N α | X = x] E[N | X = x]α. (76) Now, to continue, we will show the following bound: E[N | X = x] 1 w P (r(x)) (77) Let Π be the (1, P)-Poisson process realized by Algorithm 3, and let M = min{m N | (Tm, Xm) Π, r(Xm) r(x)}, (78) i.e. the index of the first arrival in Π such that the density ratio of the spatial coordinate exceeds r(x). Since Xi P are i.i.d., M Geom (PZ P [r(Z) r(x)]) , (79) E[M] = 1 PZ P [r(Z) r(x)] = 1 w P (r(x)). (80) Now, since geometrically distributed random variables are memoryless, we find E[M] = E[M N | M > N] (81) E[M | M > N], (82) which then implies E[M | M N] E[M]. (83) Before we proceed, we first require the following result Lemma B.1. Let all quantities be as defined above and let x, y Ωsuch that r(x) r(y). Then, N | X = y stochastically dominates N | X = x, i.e. for all positive integers n we have P[N n | X = x] P[N n | X = y]. (84) Proof. The proof follows via direct calculation. Concretely, P[N n | X = x] = P[N 1 n 1 | X = x] (85) = E T P T | X=x h P[N 1 n 1 | X = x, T] i (86) = 1 (n 1)!E T P T | X=x h Γ(n, T µ( T)) i , (87) where Γ(s, x) is the upper incomplete Γ-function defined as Γ(s, x) def = Z x ts 1e t dt. (88) Equation (87) follows from the fact that conditioned on T we have that N 1 is a Poisson random variable with mean T µ( T). Now, focussing on the expectation term in Equation (87), we let ε(r(x)) def = E T P T | X=x h Γ(n, T µ( T)) i = 1 r(x) 0 Γ(n, t µ(t)) σ 1 (t) dt. (89) Substituting h = r(x), we now show that ϵ is a decreasing function, which proves the lemma. To this end, regardless of the image of the density ratio r(Ω), extend the definition of ϵ to the entirety of (0, ess sup r(x)). Now, observe, that h Γ(n, σ(h) µ(σ(h))) =1 by inverse function theorem z }| { σ 1 (σ(h)) σ (h) Z σ(h) 0 Γ(n, t µ(t)) σ 1 (t) dt Focussing on the integral, via integration by parts we find Z σ(h) 0 Γ(n, t µ(t)) σ 1 (t) dt = (91) h Γ(n, σ(h) µ(σ(h))) + Z σ(h) 0 (t µ(t))(n 1)e (t µ(t))(1 w P (σ 1(t)))σ 1(t) dt. Plugging this back into the above calculation, we get 0 (t µ(t))(n 1)e (t µ(t))(1 w P (σ 1(t)))σ 1(t) dt. (93) Note, that since the integrand is positive and h was arbitrary, dϵ dh is negative for every h in its domain, which proves the lemma. Returning to Equation (83), we now examine the two cases M < N and M = N separately. Case M < N: In this case, we have the following inequalities: r(x) r(XM) by the definition of M. Applying σ to both sides we find ϕ(x) ϕ(XM). ϕ(XM) < TM, since M < N and by definition N is the index of the first arrival of Π under the graph of ϕ. Now, note that since M < N conditioned on TM, the first M 1 points of Π are all rejected and hence are all in ΠC = Π \ Π. Thus, M 1 | (M < N, TM) is Poisson distributed with mean TM µ(TM). Hence, E[M | M < N] = 1 + E[M 1 | M < N] (94) = 1 + ETM PTM |M ϕ(X) o (152) be the number of points in Πj that are rejected before the global first arrival occurs. By an independence argument analogous to the one given in Appendix B.2, the ν1, . . . , νJ are independent Poisson random variables, each distributed with mean E h νj | T i = 1 T µ T . (153) Hence, by the law of iterated expectations, we find that we get E [νj] eq. (62) = r 1 rejections in each of the J threads on average before the global first arrival occurs. Now, a thread j terminates either when the global first arrival occurs in it or when the thread s current arrival time T (j) nj provably exceeds the global first arrival time. Thus, on average, each thread will have one more rejection compared to Equation (154). Hence the average runtime of a thread is E[νj + 1] = r 1 J + 1, (155) and the average number of samples simulated by PGPRS across all its threads is J + 1 = r + J 1. (156) Variance of Runtime: Once again, similarly to Appendix B.2, we find by the law of iterated variances that the variance of the runtime in each of the j threads is V[νj + 1] = V[νj] = r 1 V h T i + 1 , (157) meaning that we make an O(1/J) reduction in the variance of the runtime compared to regular GPRS. Codelength: PGPRS can also realize a one-shot channel simulation protocol for a pair of correlated random variables x, y Px,y. For a fixed y Py, Alice applies PGPRS to the target Q = Px|y and proposal P = Px, and encodes the two-part code (J , NJ ). Bob can then simulate NJ samples from ΠJ and recover Alice s sample. Encoding J : By the symmetry of the setup, the global first arrival will occur with equal probability in each subprocess Πj. Hence J follows a uniform distribution on {1, . . . J}. Therefore, Alice can encode J optimally using log2 J bits. Encoding NJ : We can develop a bound using an almost identical argument to the one in Appendix B.4. In particular, by adapting the conditional bound in Equation (138) appropriately using Equation (155), we get E h log2 NJ | T = t, X = x, J i log2(r(x) + 1) + µ(t) log2 e log2 J. (158) Then, using this conditional bound and adapting Equation (143), we find E [log2 NJ | y, J ] DKL[Px|y Px] + 2 log2 e log2 J (159) to obtain a one-shot bound. Taking expectation over y Py, we get E [log2 NJ | J ] I[x; y] + 2 log2 e log2 J, (160) hence, by adapting the maximum entropy bound in Equation (150), we find H[NJ | J ] < I[x; y] log2 J + log2(I[x; y] log2 J + 1) + 6. (161) Thus, we finally find that the entropy of the two-part code (J , NJ ) is upper bounded by H[J , NJ ] < I[x; y] + log2(I[x; y] log2 J + 1) + 7. (162) Using a Zeta distribution ζ(n | s) n s to encode NJ | J with s def = 1 I[x; y] + 2 log2 e log2 J , (163) we find that the expected codelength of the two-part code is upper bounded by H[J , NJ ] < I[x; y] + log2(I[x; y] log2 J + 1) + 8 bits. (164) D Simulating Poisson Processes Using Tree-Structured Partitions Algorithm 7: Simulating a tree construction with another Input :Proposal distribution P Target splitting function splittarget, Simulating splitting function splitsim, Target heap index Htarget Output :Arrival (T, X) with heap index Htarget in Ttarget, and heap index Hsim of the arrival in Tsim. P Priority Queue B Ω K log2 Htarget X P T Exp(1) P.push(T, X, Ω, 1) for k = 0, . . . , K do T, X, C, H P.pop() L, R splitsim(C) if B L = then L Exp(P(L)) TL T + L XL P|L P.push(TL, XL, L, 2H) end if B R = then R Exp(P(R)) TR T + R XR P|R P.push(TR, XR, R, 2H + 1) end until X B /* Exit loop when we find first arrival in B. */ if k < K then {B0, B1} splittarget(B) d l Htarget 2K k m mod 2 B Bd end end return T, X, H In this section, we examine an advanced simulation technique for Poisson processes, which is required to formulate the binary search-based variants of GPRS. We first recapitulate the tree-based simulation technique from (Flamich et al., 2022) and some important results from their work. Then, we present Algorithm 7, using which we can finally formulate the optimally efficient variant of GPRS in Appendix E. Note: For simplicity, we present the ideas for Poisson processes whose spatial measure P is non-atomic. These ideas can also be extended to atomic spatial measures with appropriate modifications. Splitting functions: Let Π be a spatiotemporal Poisson process on some space Ωwith non-atomic spatial measure P. In this paragraph, we will not deal with Π itself yet, but the space on which it is defined and the measure P. Now, assume that there is a function split, which for any given Borel set B Ω, produces a (possibly random) P-essential partition of B consisting of two Borel sets, i.e. split(B) = {L, R}, such that L R = and P(L R) = P(B). (165) The last condition simply allows split to discard some points from B that will not be included in either the left or right split. For example, if we wish to design a splitting function for a subset of B Rd to split B along some hyperplane, we do not need to include the points on the hyperplane in either set. The splitting function that always exists is the trivial splitting function, which just returns the original set and the empty set: splittrivial(B) = {B, }. (166) A more interesting example when Ω= R is the on-sample splitting function, where for a R-valued random variable X P|B it splits B as spliton-samp(B) = {( , X) B, (X, ) B}. (167) This function is used by Algorithm 5 and AS* coding (Flamich et al., 2022). Another example is the dyadic splitting function operating on a bounded interval (a, b), splitting as splitdyad((a, b)) = a, a + b 2 , b . (168) We call this dyadic because when we apply this splitting function to (0, 1) and subsets produced by applying it, we get the set of all intervals with dyadic endpoints on (0, 1), i.e. {(0, 1), (0, 1/2), (1/2, 1), (0, 1/4), . . .}. This splitting function is used by Algorithm 6 and AD* coding. As one might imagine, there might be many more possible splitting functions than the two examples we give above, all of which might be more or less useful in practice. Split-induced binary space partitioning tree: Every splitting function on a space Ωinduces an infinite set of subsets by repeatedly applying the splitting function to the splits it produces. These sets can be organised into an infinite binary space partitioning tree (BSP-tree), where each node in the tree is represented by a set produced by split and an unique index. Concretely, let the root of the tree be represented by the whole space Ωand the index Hroot = 1. Now we recursively construct the rest of the tree as follows: Let (B, H) be a node in the tree, with B a Borel set and H its index, and let {L, R} = split(B) be the left and right splits of B. Then, we set (L, 2H) as the node s left child and (R, 2H + 1) as its right child. We refer to each node s associated index as their heap index. Heap-indexing the points in Π and a strict heap invariant: As we saw in Section 2.1, each point in Π can be uniquely identified by their time index N, i.e. if we time-order the points of Π, N represents the Nth arrival in the ordered list. However, we can also uniquely index each point in Π using a splitting function-induced BSP-tree as follows. We extend each node in the BSP-tree with a point from Π, such that the extended tree satisfies a strict heap invariant: First, we extend the root node (Ω, 1) by adjoining Π s first arrival (T1, X1) and the first arrival index 1 to get (T1, X1, Ω, 1, 1). Then, for every other extended node (T, X, B, H, N) with parent node (T , X , B , H/2 , M) we require that (T, X) is the first arrival in the restricted process Π ((T , ) B). This restriction enforces that we always have that T < T and, therefore, M < N. Furthermore, it is strict because there are no other points of Π in B between those two arrival times. Notation for the tree structure: Let us denote the extended BSP tree T on Π induced by split. Each node ν T is a 5-tuple ν = (T, X, B, H, N) consisting of the arrival time T, spatial coordinate X, its bounds B, split-induced heap index H and time index N. As a slight overload of notation, for a node ν, we use it as subscript to refer to its elements, e.g. Tν is the arrival time associated with node ν. We now prove the following important result, which ensures that we do not lose any information by using T to represent Π. An essentially equivalent statement was first proven for Gumbel processes (Appendix, Equivalence under partition subsection; Maddison et al., 2014). Lemma D.1. Let Π, P, split and T be as defined above. Then, P-almost surely, T contains every point of Π. Proof. We give an inductive argument. Base case: the first arrival of Π is by definition contained in the root node of T . Inductive hypothesis: assume that the first N arrivals of Π are all contained in T . Case for N + 1: We begin by observing that if the first N nodes are contained in T , they must form a connected subtree TN. To see this, assume the contrary, i.e. that the first N arrivals form a subgraph ΓN T with multiple disconnected components. Let ν ΓN be a node contained in a component of ΓN that is disconnected from the component of ΓN containing the root node. Since T is a tree, there is a unique path π from the root node to ν in T , and since ν and the root are disconnected in ΓN, π must contain a node c ΓN. However, since c is an ancestor of ν, by the heap invariant of T we must have that the time index of c is Nc < Nν N hence c ΓN, a contradiction. Thus, let TN represent the subtree of T containing the first N arrivals of Π. Now, let FN represent the frontier of TN, i.e. the leaf nodes children: FN def = {ν T | ν TN, parent(ν) TN} , (169) where parent retrieves the parent of a node in T . Let ν FN Bν (170) be the union of all the bounds of the nodes in the frontier. A simple inductive argument shows that for all N the nodes in FN provide a P-essential partition of Ω, from which P( ΩN) = 1. Let TN be the Nth arrival time of Π. Now, by definition, the arrival time associated with every node in the frontier FN must be later than TN. Finally, consider the first arrival time across the nodes in the frontier: T N def = min ν FN Tν. (171) Then, conditioned on TN, T N is the first arrival of Π restricted to (TN, ) Ω, thus it is P-almost surely the N + 1st arrival in Π, as desired. A connection between the time index an heap index of a node: Now that we have two ways of uniquely identifying each point in Π it is natural to ask whether there is any relation between them. In the next paragraph we adapt an argument from Flamich et al. (2022) to show that under certain circumstances, the answer is yes. First, we need to define two concepts, the depth of a node in an extended BSP-tree and a contractive splitting function. Thus, let T be an extended BSP-tree over Π induced by split. Let ν T . Then, the depth D of ν in T is defined as the distance from ν to the root. A simple argument shows that the heap index Hν of every node ν at depth D in T is between 2D Hν < 2D+1. Thus, we can easily obtain the depth of a node ν from the heap index via the formula Dν = log2 Hν . Next, we say that split is contractive if all the bounds it produces shrink on average. More formally, let ϵ < 1, and for a node ν T , let Aν denote the set of ancestor nodes of ν. Then, split is contractive if for every non-root node ν T we have ETAν |Dν[P(Bν)] ϵDν, (172) where TAν and denotes the subtree of T containing the arrivals of the ancestors of ν. Note that ϵ 1/2 for any split function. This is because if {L, R} = split(B) for some set B, then P(R) = P(B) P(L). Thus, if P(R) = αP(B), then P(L) = (1 α)P(B), and by definition ϵ = max{α, 1 α}, which is minimized when α = 1/2, from which ϵ = 1/2. For example, splitdyad is contractive with ϵ = 1/2, while splittrivial is not contractive. By Lemma 1 from Flamich et al. (2022), spliton-samp is also contractive with ϵ = 3/4. We now strengthen this result using a simple argument, and show that spliton-samp is contractive with ϵ = 1/2. Lemma D.2. Let ν, D be defined as above, let P be a non-atomic probability measure over R with CDF FP . Let Π a (1, P)-Poisson process and T be the BSP-tree over Π induced by spliton-samp. Then, ETAν |Dν[P(Bν)] = 2 Dν. (173) Proof. Fix Dν = d, and let Nd = {n T | Dn = d} be the set of nodes in T whose depth is d. Note, that |Nd| = 2d. We will show that n, m Nd : P(Bn) d= P(Bm), (174) i.e. that the distributions of the bound sizes are all the same. From this, we will immediately have E[P(Bn)] = E[P(Bν)] for every n Nd. Then, using the fact, that the nodes in Nd for a P-almost partition of Ω, we get: n Nd E [P(Bn)] = |Nd| E [P(Bν)] . (175) Dividing the very left and very right by |Nd| = 2d yields the desired result. To complete the proof, we now show that by symmetry, Equation (174) holds. We begin by exposing the fundamental symmetry of spliton-samp: for a node ν with left child L and right child R, the left and right bound sizes are equal in distribution: P(BL) d= P(Br) | P(Bν). (176) First, note that by definition, all involved bounds will be intervals. Namely, assume that Bν = (a, b) for some a < b and Xν is the sample associated with ν. Then, BL = (a, Xν) and BR = (Xν, b) and hence P(BL) = FP (Xν) FP (a). Since Xν P|Bν, by the probability integral transform, we have F(Xν) Unif(FP (a), FP (b)), from which P(BL) Unif(0, FP (b) FP (a)) = Unif(0, P(Bν)). Since P(BR) = P(Bν) P(BL), we similarly have P(BR) Unif(0, P(Bν)), which establishes our claim. Now, to show Equation (174), fix d and fix n Nd. Let An denote the ancestor nodes of n. As we saw in the paragraph above, P(Bn) | P(Bparent(n)) d= P(Bparent(n)) U, U Unif(0, 1), (177) regardless of whether n is a left or a right child of its parent. We can apply this d times to the ancestors of n find the marginal: i=1 Ui, Ui Unif(0, 1). (178) Since the choice of n was arbitrary, all nodes in Nd have this distribution, which is what we wanted to show. Now, we have the following result. Lemma D.3. Let split be a contractive splitting function for some ϵ [1/2, 1). Then, for every node ν in T with time index Nν and depth Dν, we have EDν|Nν[Dν] logϵ Nν. (179) Proof. Let us examine the case Nν = 1 first. In this case, ν is the root of T and has depth Dν = 0 by definition. Thus, 0 logϵ 1 holds trivially. Now, fix ν T with time index Nν = N > 1. Let TN 1 be the subtree of T containing the first N 1 arrivals, FN 1 be the frontier of TN 1 and TN 1 the (N 1)st arrival time. Then, as we saw in Lemma D.1, the Nth arrival in Π occurs in one of the nodes in the frontier FN 1, after TN 1. In particular, conditioned on TN 1, the arrival times associated with each node f FN 1 will be shifted exponentials Tf = TN 1 + Exp(P(Bf)), and the Nth arrival time in Π is the minimum of these: Tν = TN = minf FN 1 Tf. It is a standard fact (see e.g. Lemma 6 in Maddison (2016)) that the index of the minimum FN = arg min f FN 1 Tf (180) is independent of Tν = TFN and P[FN = f | TN 1] = P(Bf). A simple inductive argument shows that the number of nodes on the frontier |FN 1| = N. Thus, we have a simple upper bound on the entropy of F: EFN|TN 1,Nν=N[ log2 P(Bν)] = H[FN | TN 1, Nν = N] log2 N. (181) Thus, taking expectation over TN 1, we find log2 N eq. (181) EFN,TN 1|Nν=N[ log2 P(Bν)] (182) = EDν|Nν=N EFN,TN 1|Dν,Nν=N[ log2 P(Bν)] (183) EDν|Nν=N log2 EFN,TN 1|Dν,Nν=N[P(Bν)] (184) eq. (172) EDν|Nν=N log2 ϵDν (185) = ( log2 ϵ) EDν|Nν=N[Dν]. (186) The second inequality holds by Jensen s inequality. In the third inequality, we apply Equation (172), and one might worry about conditioning on Nν here. However, this is not an issue because EFN,TN 1|Dν=d,Nν=N[P(Bν)] (187) = 1[d N 1] X f FN 1 ETAf |FN=f,Dν=d[P(Bf)] P[FN = f | Df = d] eq. (172) 1[d N 1] X f FN 1 ϵd P[FN = f | Df = d] (189) = 1[d N 1] ϵd. (190) Thus, we finally get ( log2 ϵ) EDν|Nν=N[Dν] log2 N EDν|Nν [Dν] logϵ Nν (191) by dividing both sides by log2 ϵ and we obtain the desired result. Converting between different heap indices: Assume now that we have two splitting functions, splittarget and splitsim, which induce their own BSP-ordering on Π, Ttarget and Tsim. Now, given a splittarget-induced heap index H, Algorithm 7 presents a method for simulating the appropriate node ν Ttarget by simulating nodes from Tsim. In other words, given a node with some heap index induced by a splitting function, Algorithm 7 lets us find the heap index of the same arrival induced by a different splitting function. The significance of Algorithm 7 is that it lets us develop convenient search methods using a given splitting function, but it might be more efficient to encode the heap index induced by another splitting function. Theorem D.4. Let Π, splittarget, splitsim, Ttarget and Tsim be as above. Let ν Ttarget with and let (Tsim, Xsim, Hsim) be the arrival and its heap index output by Algorithm 7 given the above as input as well as Hν as the target index. Then, Algorithm 7 is correct, in the sense that Tν d= Tsim and Xν d= Xsim, (192) and Hsim is the heap index of (Tsim, Xsim) in Tsim. Proof. First, observe that when splittarget = splitsim, Algorithm 7 collapses onto just the extended BSP tree construction for Π and simply returns the arrival with the given heap index Htarget in Ttarget. In particular, the inner loop will always exit after one iteration, and every time one and only one of the conditional blocks will be executed. In other words, in this case, the algorithm becomes equivalent to Algorithm 2 in Flamich et al. (2022). Let us now consider the case when splittarget = splitsim. Denote the depth of the required node by Dν = log2 Hν . Now, we give an inductive argument for correctness. Base case: Consider Dν = 0. In this case, the target bounds B = Ω, and the first sample we draw X P is guaranteed to fall in Ω. Hence, for Dν = 0 the outer loop only runs for one iteration. Furthermore, during that iteration, the inner loop will also exit after one iteration, and Algorithm 7 returns the sample (T, X) sampled before the outer loop with heap index 1. Since T Exp(1) and X P, this will be a correctly distributed output with the appropriate heap index. Inductive hypothesis: Assume Algorithm 7 is correct heap indices with depths up to Dν = d. Case Dν = d + 1: Let ρ Ttarget be the parent node of ν with arrival (Tρ, Xρ). Then, Dρ = d, hence by the inductive hypothesis, Algorithm 7 will correctly simulate a branch Ttarget up to node ρ. At the end of the dth iteration of the outer loop Algorithm 7 sets the target bounds B Bν. Then, in the final, d + 1st iteration, the inner loop simply realizes Tsim and accepts the first sample after X that falls inside Bν whose time index T > Tρ. Due to the priority queue, the loop simulates the nodes of Tsim in time order; hence the accepted sample will also be the one with the earliest arrival time. Furthermore, Algorithm 7 only ever considers nodes of Tsim whose bounds intersect the target bounds Bν, hence the inner loop is guaranteed to terminate, which finishes the proof. E GPRS with Binary Search We now utilise the machinery we developed in Appendix D to analyze Algorithm 5. Correcntess of Algorithm 5: Observe that Algorithm 5 constructs the extended BSP tree for the on-sample splitting function. Thus, we will now focus on performing a binary tree search using the extended BSP tree induced by the on-sample splitting function, which we denote by T . The first step of the algorithm matches GPRS s first step (Algorithm 3). Hence it is correct for the first step. Now consider the algorithm in an arbitrary step k before termination, where the candidate sample (T, X) is rejected, i.e. T > ϕ(X). By assumption, the density ratio r is unimodal, and since σ is monotonically increasing, ϕ = σ r is unimodal too. Thus, let x Ωbe such that r(x ) = r , where r = exp2(D [Q P]). Assume for now that X < x , the case X > x follows by a symmetric argument. By the unimodality assumption, since X < x , it must hold that for all y < X, we have ϕ(y) < ϕ(X). Consider now the arrival (TL, XL) of Π in the current node s left child. Then, we will have T < TL and XL < X by construction. Thus, finally, we get ϕ(XL) < ϕ(X) < T < TL, (193) meaning that the current node s left child is also guaranteed to be rejected. This argument can be easily extended to show that any left-descendant of the current node will be rejected, and it is sufficient to search its right-descendants only. By a similar argument, when X > x , we find that it is sufficient only to check the current node s left-descendants. Finally, since both Algorithms 3 and 5 simulate Π and search for its first arrival under ϕ, by the construction in Appendix A, the sample returned by both algorithms will follow the desired target distribution. Analyzing the expected runtime and codelength: Since Algorithm 5 simulates a single branch of Ton-sample, its runtime is equal to the depth D of the branch. Furthermore, let N denote the time index of the accepted arrival and H its heap index in Ton-sample, which means D = log2 H . It is tempting to use lemmas D.2 and D.3 to in our analysis of N, H and D. However, they do not apply, because they assume no restriction on D and N, while in this section, we condition on the fact that N, H and D identify the arrival in Π that was accepted by Algorithm 5. Hence, we first prove a lemma analogous to lemma D.2 in this restricted case, based on Lemma 4 of Flamich et al. (2022). Note the additional assumptions we need for our result. Lemma E.1. Let Q and P be distributions over R with unimodal density ratio r = d Q/d P, given to Algorithm 5 as the target and proposal distribution as input, respectively. Assume P has CDF FP . Let H denote the accepted sample s heap index, let D = log2 H , and let BH denote the bounds associated with the accepted node. Then, E[P(BH) | H] 3 Proof. We prove the claim by induction. For H = 1, D = 0 the claim holds trivially, since B1 = R, hence P(B1) = 1. Assume now that the claim holds for D = log2 H = d 1 for d 1, and we prove the statement when D = log2 H = d. Let X0, . . . , Xd 1, Xd denote the samples simulated along the branch that leads to the accepted node, i.e. X0 is the sample associated with the root node and Xd is the accepted sample. By the definition of the on-sample splitting function and law of iterated expectations, we have EX0:d 1[P(BH) | H] = EX0[P(BH) | H] when d = 1 EX0:d 2[EXd 1|X0:d 2[P(BH) | H] | H] when d 2. (195) Below, we only examine the inner expectation in the d 2 case; computing the expectation in the d = 1 case follows mutatis mutandis. First, denote the bounds of the accepted node s parent as B H/2 = (α, β) for some real numbers α < β and define a = FP (α), b = FP (β) and U = FP (Xd 1). Since Xd 1 | X1:d 2 P|B H/2 , by the generalized probability integral transform we have U Unif(a, b), where Unif(a, b) denotes the uniform distribution on the interval (a, b). The two possible intervals from which Algorithm 5 will choose are (α, Xd 1) and (Xd 1, β), whose measures are P((α, Xd 1)) = FP (Xd 1) FP (α) = U a and similarly P((Xd 1, β)) = b U. Then, in the worst case, the algorithm always picks the bigger of the two intervals, thus P(BH) max{U A, B U}, from which we obtain the bound EXd 1|X1:d 2[P(BH) | H] EU[max{U a, b U}] = 3 4P(B H/2 ). (196) Plugging this into Equation (195), we get EX1:d 1[P(BH) | H] 3 4EX1:d 2 P(B H/2 ) | H (197) d 1 , (198) where the second inequality follows from the inductive hypothesis, which finishes the proof. Fortunately, lemma D.3 does not depend on this condition, so we can modify it mutatis mutandis so that it still applies in this latter case as well, and we restate it here for completeness: Lemma E.2. Let N, D, H, BH be defined as above. Then E[ log P(BH)] log2 N (199) ED|N[D] log3/4 N. (200) Proof. Equation (199) follows via the same argument as Equation (181), and Equation (200) follows via the same argument as lemma D.3, combined with lemma E.1. Expected runtime: Combining Equation (200) with Equation (143) yields E[D] DKL[Q P] + 2 log2 e log2(4/3) , (201) which proves Theorem 3.6. In the channel simulation case, taking a pair of correlated random variables x, y Px,y and setting Q Px|y and P Px, and taking expectation over y, we get E[D] I[x; y] + 2 log2 e log2(4/3) . (202) Entropy coding the heap index: Let ν Ton-sample now denote the node accepted and returned by Algorithm 5. We will encode ν using its heap index H. As we saw in Appendix D, H can be decomposed into two parts: the depth D and the residual R: H = 2D + R, 0 R < 2D. (203) Then, we have H[H] H[D, R] (204) = H[R | D] + H[D]. (205) Dealing with the second term first, we note that the maximum entropy distribution for a random variable Z over the positive integers with the moment constraint E[Z] = 1/p is the geometric distribution with rate p. Thus, based on Equation (202), we set p = log2(4/3) I[x; y] + 2 log2 e, (206) from which we get H[D] H [Geom (p)] (207) = log2 p 1 p p log2(1 p) (208) log2(I[x; y] + 2 log2 e) log2 log2(4/3) + log2 e (209) log2(I[x; y] + 1) + log2(2 log2 e) log2 log2(4/3) + log2 e, (210) where Equation (208) follows from the fact that the term 1 p p log2(1 p) is bounded from above by log2 e, which it attains as I[x; y] . For the H[R | D] term, we begin by noting that there is a particularly natural encoding for R given a particular run of the algorithm. Let B(D) 1 , . . . , B(D) 2D be the bounds associated with the depth-D nodes of Ton-sample | D, the split-on-sample BSP tree given D. Then, since B(D) 1 , . . . , B(D) 2D form a P-essential partition of Ω, we can encode R by encoding the bounds associated with the accepted sample, B(D) R . Here, we can borrow a technique from arithmetic coding (Rissanen & Langdon, 1979): we encode the largest dyadic interval that is a subset of B(D) R , which can be achieved using log2 P(B(D) R )+1 bits. Note, that Algorithm 5 only ever realises a single branch of Ton-sample, hence, in this section only, let X1: denote the infinite sequence of samples along this branch, starting with X1 P at the root. Furthermore, from here onwards, let Bd and Td denote the bounds and arrival time associated with Xd in the sequence, respectively. Then, by the above reasoning, we have H[R | D] EX1: ,T1: [ED[ log2 P(BD)]] + 1 (211) k=1 EX1:k 1,T1:k [P[D = k | X1:k 1, T1:k] log2 P(Bk)] + 1. (212) We now develop a bound on P(Bk). To this end, note, that by the definition of GPRS, P [D k + 1 | X1:k 1, T1:k, D k] = P [N Π([Tk 1, Tk] Bk) = 0 | X1:k 1, T1:k 1, D k] . P [D = k | X1:k 1, T1:k, D k] = 1 P [D k + 1 | X1:k 1, T1:k, D k] (213) = P [N Π([Tk 1, Tk] Bk) 1 | X1:k 1, T1:k 1, D k] (214) E [N Π([Tk 1, Tk] Bk) | X1:k 1, T1:k 1, D k] (215) E [NΠ([Tk 1, Tk] Bk) | X1:k 1, T1:k 1, D k] (216) = P(Bk)(Tk Tk 1), (217) where eq. (215) follows from Markov s inequality, and eq. (216) follows from the fact that the mean measure of Π is just a restricted version of Π s mean measure. For convenience, we now write k = Tk Tk 1. Then, rearranging the above inequality, we get log2 P(Bk) log2 P [D = k | X1:k 1, T1:k 1, k, D k] + log2 k. (218) Substituting this into eq. (212), we get H[R | D] (219) k=1 EX1:k 1,T1:k 1, k [P[D = k | X1:k 1, T1:k 1, k] log2 P[D = k | X1:k 1, T1:k 1, k]] k=1 EX1:k 1,T1:k 1, k [P[D = k | X1:k 1, T1:k 1, k] log2 k] + 1. (220) We deal with the above two sums separately. In the first sum, we apply Jensen s inequality on the concave function x 7 x log2 x in each summand, from which we get k=1 EX1:k 1,T1:k 1, k [P[D = k | X1:k 1, T1:k 1, k] log2 P[D = k | X1:k 1, T1:k 1, k]] k=1 P[D = k] log2 P[D = k] (221) = H[D]. (222) For the second sum, consider the kth summand first: P[D = k | X1:k 1,T1:k 1, k, D k] log2 k (223) d P[D = k, Xk = x | X1:k 1, T1:k 1, k, D k] d P log2 d P(x). Bk 1[r(x) σ 1(Tk)] log2 d P(x). To continue, we now examine the integrand: r(x) σ 1(Tk) eq. (132) P[ T Tk] Tk P[ T Tk] k (224) from which log2 k log2 r(x) log2 P[ T Tk]. (225) Substituting this back, we get P[D = k | X1:k 1, T1:k 1, k, D k] log2 k (226) Bk 1[r(x) σ 1(Tk)] log2 r(x) d P(x) P[D = k | X1:k 1, T1:k 1, k, D k] log2 P[ T Tk]. (227) Therefore, for the second summand, we get X k=1 EX1:k 1,T1:k 1, k [P[D = k | X1:k 1, T1:k 1, k] log2 k] (228) d P[D = k, Xk = X] d P log2 r(X) k=1 P[D = k] log2 P[ T Tk]. (229) For the first term, by Fubini s theorem and by the correctness of GPRS, we get d P[D = k, Xk = X] d P log2 r(X) = EX P d P[D = k, Xk = X] d P log2 r(X) (230) = EX P [r(X) log2 r(X)] (231) = DKL[Q P]. (232) For the second term, note that P[D = k] = P[ T = Tk]. (233) Hence, k=1 P[D = k] log2 P[ T Tk] = k=1 P[D = k] log2 P[D k] (234) k=1 (P[D k] P[D k + 1]) log2 1 P[D k] (235) 1 log2 p dp (236) = log2 e. (237) Thus, we finally get that H[R | D] DKL[Q P] + H[D] + log2 e + 1. (238) In the average case, this yields H[R | D] I[x; y] + H[D] + log2 e + 1. (239) Thus, we finally get that the joint entropy is bounded above by H[R, D] I[x; y] + 2 log2(I[x; y] + 1) (240) + 2 (log2(2 log2 e) log2 log2(4/3) + log2 e) + log2 e + 1 (241) < I[x; y] + 2 log2(I[x; y] + 1) + 11, (242) which finally proves Theorem 3.7. F General GPRS with Binary Search We finally present a generalized version of branch-and-bound GPRS (Algorithm 6) for more general spaces and remove the requirement that r be unimodal. Decomposing a Poisson process into a mixture of processes: Let Π be a process over R+ Ω as before, with spatial measure P, and let Q be a target measure, ϕ = σ r and U = {(t, x) R+ Ω| t ϕ(x)} as before. Let Π be Π = Π U restricted under ϕ and ( T1, X1) its first arrival. Let {L, R} form an P-essential partiton of Ω, i.e. L R = and P(L R) = 1, and let ΠL = Π R+ L and ΠR = Π R+ R be the restriction of Π to L and R, respectively. Let ΠL = ΠL U and ΠR = ΠR U be the restrictions of the two processes under ϕ as well. Let µL(t) and µR(t) be the mean measures of these processes. Thus, the first arrival times in these processes have survival functions P[ T L 1 t] = e µL(t) (243) P[ T R 1 t] = e µR(t). (244) Note, that by the superposition theorem, ΠL ΠR = Π, hence the first arrival ( T1, X1) occurs in either ΠL or ΠR. Assume now that we have already searched through the points of Π up to time τ without finding the first arrival. At this point, we can ask: will the first arrival occur in ΠL, given that T1 τ? Using Bayes rule, we find P[ X1 L | T1 τ] = P[ X1 L, T1 τ] P[ T1 τ] . (245) More generally, assume that the first arrival of Π occurs in some set A Ω, and we know that the first arrival time is larger than τ. Then, what is the probability that the first arrival occurs in some set B A? Similarly to the above, we find P[ X1 B | T1 τ, X1 A] = P[ X1 B, T1 τ, X1 A] P[ T1 τ, X1 A] (246) = P[ T1 τ, X1 B] P[ T1 τ, X1 A] . (247) Let ( TL, XL) be the first arrival of ΠL. Then, the crucial observation is that ( TL, XL) d= T1, X1 | X1 L. (248) This enables us to search for the first arrival of Π under the graph of ϕ using an extended BSP tree construction. At each node, if we reject, we draw a Bernoulli random variable b with mean equal to the probability that the first arrival occurs within the bounds associated with the right child node. Then, if b = 1, we continue the search along the right branch. Otherwise, we search along the left branch. Note, however, that in a restricted process ΠA, the spatial measure no longer integrates to 1. Furthermore, our target Radon-Nikodym derivative is r(x) 1[x A]/Q(A). This means we need to change the graph ϕ to some new graph ϕA to ensure that the spatial distribution of the returned sample is still correct. Therefore, for a set A we define the restricted versions of previous quantities: µA(t) def = Z t A 1[τ ϕA(x)] d P(x) dt (249) w P (h | A) def = Z A 1[h r(x)] d P(x) (250) Then, via analogous arguments to the ones in Appendix A, we find d P[ TA = t, XA = x] d(λ P) = 1[x A]1[t ϕA(x)]P[ TA t] (251) d P[ XA = x] d P = 1[x A] Z ϕA(x) 0 P[ TA t] dt (252) ϕA = σA r. (253) Similarly, setting d P[ XA=x] d P = 1[x A] r(x)/Q(A), and setting τ = σA(r(x)), we get σ 1 A (τ) = Q(A) Z τ 0 P[ TA t] dt (254) σ 1 A (τ) = P[ TA τ] (255) = P[ T τ, X A]. (256) From this, again using similar arguments to the ones in Appendix A, we find σ 1 A (τ) = w Q(σ 1 A (t) | A) σ 1 A (t) w P (σ 1 A (t) | A). (257) G Necessary Quantities for Implementing GPRS in Practice Ultimately, given a target-proposal pair (Q, P) with density ratio r, we would want an easy-toevaluate expression for the appropriate stretch function σ or σ 1 to plug directly into our algorithms. Computing σ requires computing the integral in Equation (5) and finding σ 1 requires solving the non-linear ODE in Equation (4), neither of which is usually possible in practice. Hence, we usually resort to computing σ 1 numerically by using an ODE solver for Equation (4). In any case, we need to compute w P and w Q, which are analytically tractable in the practically relevant cases. Hence, in this section, we give closed-form expressions for w P and w Q for all the examples we consider and give closed-form expressions for σ and σ 1 for some of them. If we do not give a closed-form expression of σ, we use numerical integration to compute σ 1 instead. Note on notation: In this section only, we denote the natural logarithm as ln and exponentiation with respect to the natural base as exp. G.1 Uniform-Uniform Case Let P be the uniform distribution over an arbitrary space Ωand Q a uniform distribution supported on some subset X Ω, with P(X) = C for some C 1 Then, C 1[x X] (258) w P (h) = C (259) w Q(h) = 1 (260) C log (1 C h) (261) C (1 exp( C h)) . (262) Note that using GPRS in the uniform-uniform case is somewhat overkill, as in this case, it is simply equivalent to standard rejection sampling. G.2 Triangular-Uniform Case Let P = Unif(0, 1) and for some numbers 0 < a < c < b < 1, let Q be the triangular distribution, defined by the PDF 0 if x < a 2(x a) (b a)(c a) if a x < c 2 b a if x = c 2(b x) (b a)(b c) if c < x b 0 if b < x. In this parameterization, the distribution has support on (a, b). Moreover, it is linearly increasing on (a, c) and linearly decreasing on (c, b), with the mode located at c. For convenience, let ℓ= b a. Then, r(x) = q(x) (264) DKL[Q P] = log2 2 log2 e (265) D [Q P] = 1 log2 ℓ (266) Dα[Q P] = log2 ℓ+ α log2(1 + α) w P (h) = ℓ ℓ2 w Q(h) = 1 ℓ2 σ(h) = 2h 2 ℓ h (270) σ 1(t) = 2t 2 + ℓ t (271) Additionally, in this case we find P[ T t] = 4 (2 + ℓ t)2 (272) P[ T dt] = 8 ℓ (2 + ℓ t)3 (273) ϕ T (s) = 2G3,1 1,3 0 0, 1, 3/2 + 2is2 sign(s) l |s| 2Ci 2|s| l + π 2Si 2|s| l2|s| (275) M T (s) = ℓ2 + 2ℓs 4s2e 2s/ℓEi 2s ℓ2 for s 0, (276) where ϕ T and M T denote the characteristic and the moment-generating functions of T, respectively. Furthermore, Si, Ci and Ei are the sine, cosine and exponential integrals, defined as Si(x) = Z x 0 t 1 sin(t) dt (277) x t 1 cos(t) dt (278) Ei(x) = Z x t 1et dt. (279) Finally, G3,1 1,3 is the Meijer-G function (Sections 5.1-3; Bateman & Erdélyi, 1953) Interestingly, we can see that in this case, while the expectation of T is finite, we find that V[ T] = . Thus, by Equation (72), we find that V[N] = as well, where N is the number of samples simulated by Algorithm 3. G.3 Finite Discrete-Discrete / Piecewise Constant Case Without loss of generality, let Q be a discrete distribution over a finite set with K elements defined by the probability vector (r1 < r2 < . . . < r K) and P = Unif([1 : K]) the uniform distribution on K elements. Note that any discrete distribution pair can be rearranged into this setup. Now, we can embed this distribution into [0, 1] by mapping P to ˆP = Unif(0, 1) the uniform distribution over [0, 1] and mapping Q to ˆQ with density k=1 1 [ K x = k 1] K rk. (280) Then, we can draw a sample x from Q by simulating a sample ˆx ˆQ and computing x = K ˆx/K + 1. Hence, we can instead reduce the problem to sampling from a piecewise constant distribution. Thus, let us now instead present the more general case of arbitrary piecewise constant distributions over [0, 1], with Q defined by probabilities (q1 < q2 < . . . < q K) and corresponding piece widths (w1 < w2 < . . . < w K). We require, that P k=1 qk = PK k=1 wk = 1. Then, the density is Define rk = qk/wk. Then, k=1 1[h rk] wk (282) k=1 1[h rk] wk rk (283) k=1 1[h rk 1] 1 Bk ln Ak Bkrk 1 Ak Bk min{rk, h} where we defined j=k qj (285) j=k wj (286) G.4 Diagonal Gaussian-Gaussian Case Without loss of generality, let Q = N(µ, σ2I) and P = N(0, I) be d-dimensional Gaussian distributions with diagonal covariance. As a slight abuse of notation, let N(x | µ, σ2I) denote the probability density function of a Gaussian random variable with mean µ and covariance σ2I evaluated at x. Then, when σ2 < 1, we have r(x) = Z N x | m, s2I (287) m = µ 1 σ2 (288) Z = (1 σ2) d N(µ | 0, (1 σ2)I) (290) w P (h) = P χ2 d, m 2 2s2 ln h + C (291) 2s2 ln h + C C = s2 2 ln Z d ln(2πs2) . (293) Unfortunately, in this case, it is unlikely that we can solve for the stretch function analytically, so in our experiments, we solved for it numerically using Equation (4). G.5 One-dimensional Laplace-Laplace Case Without loss of generality, let Q = L(µ, b) and P = L(0, 1) be Laplace distributions. As a slight abuse of notation, let L(x | µ, b) denote the probability density function of a Laplace random variable with mean µ and scale b evaluated at x. Then, for b < 1 we have b exp |x µ| b + |x| (294) 1 exp b ln(bh) 1 b cosh µ 1 b if ln h |µ| exp b2 ln(bh) |µ| 1 b2 sinh b(ln(bh) |µ|) 2 (bh) b 1+b exp |µ| 1+b (bh) b 1 b exp |µ| 1 b if ln h > |µ| 1 exp ln(hb) 1 b cosh µ 1 b if ln h |µ| 1 exp ln(hb) |µ| 1 b2 cosh b(ln(hb) |µ|) 2(bh) 1 1 b exp |µ| 2(bh) 1 1+b exp |µ| 1+b if ln h > |µ| 1 σ 1 + b b 1 b b 1 1 b | {z } 0 on [0,1] cosh µ 1 b σ 1 1 1 b if ln h |µ| 2 b b 1 b b 1 1 b exp |µ| 1 b σ 1 1 1 b 2 b b 1+b + b 1 1+b exp |µ| 1+b σ 1 1 1+b otherwise Similarly to the Gaussian case, we resorted to numerical integration using Equation (4) to solve for σ 1. H Experimental Details Comparing Algorithm 3 versus Global-bound A* coding: We use a setup similar to the one used by Theis & Yosri (2022). Concretely, we assume the following model for correlated random variables x, µ: Pµ = N(0, 1) (298) Px|µ = N(µ, σ2). (299) From this, we find that the marginal on x must be Px = N(0, σ2 + 1). The mutual information is I[x; µ] = 1 2 log2 1 + σ2 bits, which is what we plot as I[x; µ] in the top two panels in Figure 2. For the bottom panel in Figure 2, we fixed a standard Gaussian prior P = N(0, 1), fixed K = DKL[Q P] and linearly increased R = D [Q P]. To find a target that achieves the desired values for these given divergences, we set its mean and variances as σ2 = exp (W(A exp(B)) B (300) µ = 2K σ2 + ln σ2 + 1 (301) A = 2 ln R 2K 1 (302) B = 2 ln R 1, (303) where W is the principal branch of the Lambert W-function (Corless et al., 1996). We can derive this formula by assuming we wish to find µ and σ2 such that for fixed numbers K and R, and q(x) = N(x | µ, σ2), p(x) = N(x | 0, 1). Then, we have that DKL[q p] = K and sup x R We know that K = DKL[q p] = 1 2 µ2 + σ2 log σ2 1 log R = log sup x R 2(1 σ2) log σ. (305) From these, we get that µ2 = 2K σ2 + log σ2 + 1 µ2 = 2(1 σ2)(log R + log σ) (306) Setting these equal to each other 2K σ2 + log σ2 + 1 = 2 log R + log σ2 2σ2 log R σ2 log σ2 σ2 log σ2 σ2 + 2σ2 log R = 2 log R 2K 1 σ2 log σ2 + σ2(2 log R 1) = A σ2(log σ2 + B) = A σ2 log(σ2e B) = A e Bσ2 log(σ2e B) = Ae B elog(σ2e B) log(σ2e B) = Ae B log(σ2e B) = W(Ae B) σ2 = e W (Ae B) B, where we made the substitutions A = 2 log R 2K 1 and B = 2 log R 1. I Rejection sampling index entropy lower bound Assume that we have a pair of correlated random variables x, y Px,y and Alice and Bob wish to realize a channel simulation protocol using standard rejection sampling as given by, e.g. Algorithm 2. Thus, when Alice receives a source symbol y Py, she sets Q = Px|y as the target and P = Px as the proposal for the rejection sampler. Let N denote the index of Alice s accepted sample, which is also the number of samples she needs to draw before her algorithm terminates. Since each acceptance decision is an independent Bernoulli trial in standard rejection sampling, N follows a geometric distribution whose mean equals the upper bound M used for the density ratio (Maddison, 2016). The lower bound on the optimal coding cost for N is given by its entropy H[N] = (M 1) log2 + log2 M log2 M, (308) where the inequality follows by omitting the first term, which is guaranteed to be positive since x 7 x log2 x is positive on (0, 1). Hence, by using the optimal upper bound on the density ratio M = exp2(D [Q P]) and plugging it into the formula above, we find that D [Q P] H[N]. (309) Now, taking expectation over y, we find Ey Py D [Px|y Px] H[N]. (310)