# fast_relative_entropy_coding_with_a_coding__42dbcb61.pdf

Fast Relative Entropy Coding with A* coding

Gergely Flamich 1 Stratis Markou 1 Jos e Miguel Hern andez-Lobato 1 2 3

Relative entropy coding (REC) algorithms encode a sample from a target distribution Q using a proposal distribution P, such that the expected codelength is O(DKL[Q P]). REC can be seamlessly integrated with existing learned compression models since, unlike entropy coding, it does not assume discrete Q or P, and does not require quantisation. However, general REC algorithms require an intractable Ω(e DKL[Q P ]) runtime. We introduce AS* and AD* coding, two REC algorithms based on A* sampling. We prove that, for continuous distributions over R, if the density ratio is unimodal, AS* has O(D [Q P]) expected runtime, where D [Q P] is the R enyi - divergence. We provide experimental evidence that AD* also has O(D [Q P]) expected runtime. We prove that AS* and AD* achieve an expected codelength of O(DKL[Q P]). Further, we introduce DAD*, an approximate algorithm based on AD* which retains its favourable runtime and has bias similar to that of alternative methods. Focusing on VAEs, we propose the Iso KL VAE (IKVAE), which can be used with DAD* to further improve compression efficiency. We evaluate A* coding with (IK)VAEs on MNIST, showing that it can losslessly compress images near the theoretically optimal limit.

1. Introduction

In recent years, there has been significant progress in compression using machine learning, an approach known as learned compression. Most of the prominent learned compression methods, including the state-of-the-art in both lossless (Townsend et al., 2019; Hoogeboom et al., 2019; Zhang

Equal contribution. 1Department of Engineering, University of Cambridge, Cambridge, UK 2Microsoft Research, Cambridge, UK 3Alan Turing Institute, London, UK. Correspondence to: Gergely Flamich <gf332@cam.ac.uk>, Stratis Markou <em626@cam.ac.uk>.

Proceedings of the 39 th International Conference on Machine Learning, Baltimore, Maryland, USA, PMLR 162, 2022. Copyright 2022 by the author(s).

et al., 2021) and lossy compression (Ball e et al., 2017), perform non-linear transform coding (Ball e et al., 2020).

In transform coding, a datum is first mapped to a latent representation and encoded with entropy coding. Entropy coding assumes that the latent representation and the coding distribution are discrete, which requires a non-differentiable quantization step. Since gradient-based optimization requires derivatives, most state-of-the-art methods use a continuous approximation to quantization during training and switch to hard quantization only during compression time (Ball e et al., 2017). This mismatch has been argued to be harmful towards the compression efficiency of these methods (Havasi et al., 2018; Flamich et al., 2020; Theis & Agustsson, 2021).

Relative entropy coding (REC; Flamich et al., 2020) is a recently proposed alternative, which does not require quantization and avoids this mismatch. A REC algorithm uses samples from a proposal distribution P to produce a random code representing a sample from a target distribution Q, with expected length of approximately DKL[Q P]2, where the subscript specifies that the KL is measured in bits rather than nats. General-purpose REC algorithms place no restrictions on Q and P beyond that DKL[Q P]2 be finite, so they can be applied even when Q and P are continuous. Thus REC can be naturally applied to perform compression with generative models trained via gradient descent, for applications including but not limited to: (1) data compression with variational autoencoders (VAE; Kingma & Welling, 2014), where Q corresponds to a variational posterior over latent variables and P to a prior over these latent variables; (2) model compression (Havasi et al., 2018), where Q corresponds to an approximate posterior over parameters (Blundell et al., 2015), and P to a prior over those parameters.

However, REC algorithms that make no further assumptions on Q or P require Ω 2DKL[Q P ]2 steps to terminate in expectation (Agustsson & Theis, 2020), which is a severe limitation in practice. Thus, in order to make progress, it is necessary to impose additional assumptions on Q and P. Universal quantisation (Ziv, 1985) can be regarded as a REC algorithm that achieves O(DKL[Q P]2) runtime and has been demonstrated to work well with the state-of-theart VAE-based learned compression methods (Agustsson & Theis, 2020). However, it places heavy limitations on Q and P, which might be overly restrictive in many cases.

Fast Relative Entropy Coding with A* Coding

In this work, we introduce AS* and AD* coding, two closely related REC algorithms based on A* sampling (Maddison et al., 2014), which achieve significantly faster runtimes than existing alternatives. For Q and P over R, and without further assumptions, we show that the expected codelength achieved by AS* and AD* is bounded by

λ DKL[Q P]2 + λ log2(DKL[Q P]2 + 1) + O(1), (1)

where λ 2.41 and λ = 1 for AS* and AD* respectively. With the additional assumption that Q and P are continuous with unimodal density ratio, we show that the expected runtime of AS* is O(D [Q P]), where D [Q P] = log supx X d Q d P (x) is the R enyi -divergence. While we do not prove an analogous bound for the runtime of AD*, we conduct extensive experiments on different Q, P pairs, and observe that the runtime of AD* is also O(D [Q P]). Thus AS* and AD* significantly improve upon the runtime of existing REC methods, without requiring as severe assumptions as universal quantization. While AS* and AD* require unimodality, this assumption is satisfied by many models in learnt compression, such as most VAEs.

In addition, a practical limitation of REC algorithms is that, since the codelength of a sample is random, additional bits must be used to communicate the codelength itself to ensure the message is decodable. This additional code, which corresponds to the second and third terms of eq. (1), accounts for a large portion of the overall codelength and grows linearly with the number of dimensions. To remedy this, we consider approximate REC algorithms, in which the codelength is a parameter, which is set prior to coding. This allows us to form blocks of variables which are coded using the same codelength. Thus, the additional cost must be paid only once per block, rather than once per variable, thereby greatly reducing this overhead codelength.

To this end, we first introduce an approximate variant of AD* coding which, similarly to existing approximate REC algorithms, has a tunable codelength and a provably low bias. Unlike existing methods however, it retains the favourable runtime of AD* coding. Second, we propose to parameterize latent variable distributions by explicitly specifying DKL[Q P]. For example, instead of parameterizing a Gaussian Q using a mean and variance, we can specify its mean and DKL[Q P], from which the variance is uniquely determined. This allows us to construct blocks of latent variables with tied KL divergences, which can be coded with the same codelength. This codelength must be communicated once per block instead of once per variable. We consider VAE models using this parameterization, which we refer to as iso KL VAEs (IKVAEs). We present experiments on lossless image compression on MNIST which demonstrate that the performance of IKVAEs is comparable to that of VAEs, while reducing the codelength overhead.

Our contributions can be summarised as follows:

We introduce AS* and AD* coding, two REC algorithms based on A* sampling, for coding samples from one-dimensional distributions.

We prove that, the expected codelength of AS* and AD* is O(DKL[Q P]).

We prove that if d Q/d P is bounded and unimodal, AS* achieves O(D [Q P]) runtime. We present empirical evidence that the runtime of AD* is also linear in D . Therefore AS* and AD* significantly improve over the exponential runtime of existing REC algorithms. A direct consequence of the above is that A* sampling with unimodal d Q/d P also has O(D [Q P]) runtime.

We introduce an approximate variant of AD* and bound its bias. Similar to existing ones, this algorithm can code low-bias samples using fixed codelengths, but retains the favourable runtime of AD*.

We introduce a novel modification for VAEs, in which the KL divergences across either all, or some of, the latent dimensions are tied. This modification, which we refer to as the iso KL VAE (IKVAE), can be used with any fixed-codelength approximate REC algorithm, such as our own, to greatly reduce overhead codes.

We demonstrate the favourable performance of AS* and AD* on toy problems, comparing it with that of alternative methods. Lastly, we apply our approximate AD* algorithm to VAE and IKVAE models on image data, showing that it can losslessly compress images near the theoretically optimal ELBO.

2. Background

Relative Entropy Coding: The central problem which we tackle in this work is the REC problem, defined as follows.

Definition 1 (REC problem and algorithm). Let Q be a target and P be a proposal distribution, with DKL[Q P] < , and let S = (s1, s2, . . . ) be an infinite sequence of publicly available independent fair coin tosses. Relative entropy coding (REC) is the problem of producing a uniquely decodable code C representing a sample from Q given S, such that the codelength |C| satisfies

E[|C|] = O(DKL[Q P]), (2)

An algorithm which solves this problem is a REC algorithm.

In practice, S is implemented by using a pseudo-random number generator (PRNG) with a publicly available seed. Crucially, REC applies to both discrete and continuous distributions, and can be integrated into learned compression pipelines, without requiring quantization. Several existing algorithms solve the REC problem without further assumptions on Q or P, however, they are impractically slow.

Fast Relative Entropy Coding with A* Coding

Poisson Functional Representation: Li & El Gamal (2018) introduced a REC algorithm for general Q and P, here referred to as Poisson functional representation (PFR) coding. Li & El Gamal showed that if Ti are the ordered arrival times of a homogeneous Poisson process on R+ (Kingman, 1992), and Xi P, then

arg min i N

d Q(Xi) Q. (3)

Further, Li & El Gamal showed that a sample may be represented by coding the index which minimises eq. (3), and bounded the expected codelength of PFR by

K E[|C|] K + log2(K + 1) + O(1), (4)

where K = DKL[Q P]2. We note that PFR converts REC into a search problem, just like the A* sampling algorithm (Maddison et al., 2014) converts sampling into a search problem. In fact, it can be shown that the minimization in eq. (3), is equivalent to a variant of A* sampling, called Global Bound A* sampling. In particular

arg min i N

Ti d P d Q(Xi) = arg max i N { log Ti + log r(Xi)}

d= arg max i N {Gi + log r(Xi)} (5)

where r = d Q/d P, and Gi is sampled according to

Gi TG (0, Gi 1) , (6)

where TG (µ, κ) denotes the Gumbel distribution with mean µ and unit scale, truncated to the interval ( , κ], and defining G0 def= . The maximisation in eq. (5) is identical to the Global Bound A* sampling algorithm (see Appendix of Maddison et al.). Thus, Global Bound A* sampling and PFR are identical, with the exception that the former works in the negative log-space of the latter (eq. 5).

Unfortunately, the runtime T of PFR is so large that it renders the algorithm intractable in practice. In particular, the runtime of Global Bound A*, and thus also of PFR, can be shown (see Appendix in Maddison et al.) to be equal to

E[T] = exp (D [Q P]) exp (DKL[Q P]) . (7)

where D [Q P] = log supx X r(x). This bound is perhaps unsurprising, considering a more general result by Agustsson & Theis (2020), who proved that without further assumptions on Q and P, the expected runtime of any REC algorithm is Ω(2DKL[Q P ]2).

Additional assumptions: In order to develop a REC algorithm that is fast enough to be practical, we must make further assumptions about the target and proposal distributions. Focusing on continuous distributions over R, we will

show that an assumption which enables fast runtimes is the unimodality of r. While somewhat restrictive, this assumption is satisfied by virtually all models used in learned compression. We will show that A* sampling can be modified to solve the REC problem, achieving O(D [Q P]) runtime whenever r is bounded and unimodal.

Henceforth, we will assume that Q and P are continuous distributions on R with densities q and p. However, we note that the methods we present in this work can be generalised to arbitrary measure spaces equipped with a total ordering over their elements, using the Radon-Nikodym derivative in place of the density ratio (Grimmett & Stirzaker, 2001).

A* sampling: The A* sampling algorithm (Maddison et al., 2014) is an extension of the Gumbel-max trick (Papandreou & Yuille, 2011) to arbitrary probability spaces. A* sampling is a branch-and-bound algorithm (Land & Doig, 1960), which converts the problem of sampling from Q into the maximization in eq. (5). A* sampling builds a binary search tree, where each node n is associated with a triplet (Bn, Xn, Gn), where: (1) Bn Ωis a subset of the sample space; (2) Xn P|Bn is a sample distributed according to the restriction of P to Bn; (3) Gn TG log P(Bn), Gpar(n) is a truncated Gumbel sample, where par(n) is the parent of n. At each step, A* sampling expands the children of n, partitioning Bn into two disjoint subsets Ln Rn = Bn, using some rule which we refer to as partition. A* sampling then constructs the children s triplets following the definitions above. It then bounds the objective from eq. (5) on each of the children, and utilises the bounds to narrow the search and quickly locate the maximum. Setting partition to be the degenerate splitting

partition GBA (B, X) def= ( , B) (8)

yields the Global Bound A* algorithm. Equation (8) hints at why Global Bound A*, and by extension PFR, have large runtimes. This partition function does not refine the search as the algorithm progresses, maintaining the same fixed global bound throughout a run. In this work we consider using two different partition functions, yielding the AS* and AD* coding algorithms. When applied to unimodal r, these partitioning schemes enable the algorithm to quickly refine its search and achieve a fast runtime.

Approximate REC algorithms: Other lines of work have introduced alternative algorithms, such as Minimal Random Coding (MRC; Havasi et al., 2018) and Ordered Random Coding (ORC; Theis & Yosri, 2022), which produce a code representing an approximate, instead of an exact, sample from Q. Because these algorithms produce biased samples, strictly speaking they do not satisfy the requirements of definition 1, so we refer to them as approximate REC algorithms. Both MRC and ORC accept the codelength |C| as

Fast Relative Entropy Coding with A* Coding

an input parameter. Increasing |C| reduces sample bias, but increases compression cost and runtime. Unfortunately, the runtime required to reduce the bias sufficiently in order to make the samples useful in practice scales as O(2DKL[q p]2), making MRC and ORC as expensive as PFR.

However, one benefit of a tunable codelength is that, when communicating multiple samples, the overhead code corresponding to the second and third terms in eq. (4), can be significantly reduced. By grouping several random variables into blocks, and coding all samples of a block with the same codelength, we only need to communicate a codelength per block, as opposed to one codelength per variable. This procedure reduces the codelength overhead by a factor equal to the number of variables in each block.

3. A* Coding

A* sampling returns a node n with associated triplet (Bn, Xn, Gn), where Xn is an exact sample from Q (Maddison et al., 2014). Therefore, the only addition we need to make to A* sampling to turn it into a REC algorithm, is a way to represent n using a uniquely decodable code C. Given such a C, we can decode the sample Xn by determining the Bn corresponding to n, and then sampling Xn P|Bn given the public source of randomness S.

Since A* sampling may return any node in its binary search tree, we propose to use heap indexing, also known as the ahnentafel or Eytzinger ordering to code nodes. Let the parent of the root node be nil. Then, the heap index of a node n is

1 if par(n) = nil 2Hpar(n) if n is left child of par(n) 2Hpar(n) + 1 if n is right child of par(n). (9)

Let Dn denote the depth of node n in the binary tree, where Droot = 1. We can see that Dn = log2 Hn + 1. Thus, assuming that Dn is known, Hn can be encoded in Dn bits. Therefore, we can modify A* sampling to return the heap index of the optimal node, from which n can be decoded. This yields Algorithm 1, which we refer to as A* coding. Maddison et al. (2014) show that A* sampling is correct regardless of the choice of partition. The following theorem shows that under mild assumptions on partition, the expected codelength of A* coding is O(DKL[Q P]).

Theorem 1 (Expected codelength of A* coding). Let Q and P be the target and proposal distributions passed to A* coding (Algorithm 1), respectively. Assume that partition satisfies the following property: there exists ϵ [1/2, 1) such that for any node n we have

E[P(Bn)] ϵDn, (10)

where the expectation is taken over the joint distribution of the samples associated with the ancestor nodes of n. Let k

be the node returned by A* coding. Then, we have

E[Dk] 1 log ϵ DKL[Q P] + e 1 + log 2 . (11)

In particular, when ϵ = 1/2,

E[Dk] DKL[Q P]2 + e 1 log2 e + 1. (12)

Proof. See Appendix A for proof.

Motivated by the results of Theorem 1, we examine two variants of A* coding based on particular choices for partition, which yield AS* and AD* coding.

AS* coding: For node n with triplet (Bn, Xn, Gn), where Bn = (α, β), we define partition as

partition AS (Bn, Xn) def= (α, Xn), (Xn, β). (13)

In this case, the following result holds.

Lemma 1. Let P be the proposal distribution passed to AS* coding and let partition be as defined in eq. (13). Then the condition in eq. (10) is satisfied with ϵ = 3/4, that is

E[P(Bn)] (3/4)Dn. (14)

Proof. See Appendix B for proof.

Hence, the codelength of AS* sampling is bounded by

λ DKL[Q P]2 + λ log(DKL[Q P] + 1) + O(1), (15)

where λ = log 2/ log(3/4) 2.41. Further, we have the following result.

Theorem 2 (Expected runtime of AS* coding). Let Q and P be the target and proposal distributions passed to AS* coding (Algorithm 1), and assume r = d Q/d P is quasiconcave. Let T be the number of steps AS* takes before it terminates. Then, we have

E[T] = O(D [Q P]) + O(1). (16)

Proof. See Appendix C for proof.

Theorem 2 identifies a general class of target-proposal pairs where REC and A* sampling can be performed much faster than the Ω(exp(DKL[Q P])) and O(exp(D [Q P])) bounds shown by Agustsson & Theis (2020) and Maddison et al. (2014); Maddison (2016), respectively.

AD* coding: Unfortunately, the upper bound on the codelength of AS* coding in eq. (15) is not tight enough to be optimal. However, Theorem 1 suggests this may be addressed by choosing a partition function for which

Fast Relative Entropy Coding with A* Coding

ϵ = 1/2. One such partition function is dyadic partitioning, which splits Bn = (α, β) as

partition AD (Bn, Xn) def= (α, γ), (γ, β), (17)

where γ is chosen such that P((α, γ)) = P((γ, β)), which is always possible for continuous P. We refer to this partitioning as dyadic, and the corresponding algorithm AD* coding, because for every node n in the search tree with Bn = (α, β), we have that (FP (α), FP (β)) forms a dyadic subinterval of (0, 1), where FP is the CDF of P.

Algorithm 1: A* coding. Blue parts show modifications of A* sampling (Maddison et al., 2014).

Input :Target Q, proposal P, bounding function M, maximum search depth Dmax. (LB, X , k) ( , null, 1) Π Priority Queue (D1, H1) (1, 1)

G1 TG (0, ) X1 P M1 M(R) Π.push(1, G1 + M1)

while ! Π.empty() and LB < Π.top Priority() do

n Π.pop Highest() LBn Gn + (d Q/d P)(Xn) if LB < LBn then

(LB, X ) (LBn, Xn)

if Dn Dmax then

L, R partition(Bn, Xn)

for B {L, R} do

(k, Bk) (k + 1, B)

( 2Hn if B = L 2Hn + 1 if B = R Gk TG (log P(Bk), Gn) Xk P|Bk if LB < Gk + Mn then

Mk M(Bk) if LB < Gk + Mk then

Π.push(k, Gk + Mk) end end end end end return (X , H )

We conjecture, that the expected runtime of AD* is O(D [Q P]) and in Section 5 we provide thorough ex-

perimental evidence for this.

Depth-limited A* coding: There is a natural way to set up an approximate REC algorithm based on A*, which takes |C| as an input parameter. Specifically, we can limit the maximal depth Dmax to which alg. 1 is allowed to search. The number of nodes in a complete binary tree of depth D is 2D 1, so setting Dmax = |C| ensures that each node can be encoded using a heap index with |C| bits. By limiting the search depth, A* coding returns the optimal node up to depth Dmax instead of the global optimum. However, if we set Dmax large enough, then depth-limited algorithm should also be able to find the global optimum. This intuition is made precise in the following lemma.

Lemma 2. Let Q and P be the target and proposal distributions passed to A* coding (Algorithm 1). Let H be the heap index returned by unrestricted A* coding and H d be the index returned by its depth-limited version with Dmax = d. Then, conditioned on the public random sequence S, we have H d H . Further, there exists D N such that for all d > D we have H d = H .

Proof. See Appendix E for proof.

If the depth of the global optimum is larger than Dmax, then depth-limited A* coding will not return an exact sample. As we reduce Dmax, we force the algorithm to return increasingly sub-optimal solutions, which correspond to more biased samples. It is therefore important to quantify the trade-off between Dmax and sample bias. Theorem 3 bounds the sample bias of depth-limited AD* coding, which we refer to as DAD*, as a function of Dmax. This result is similar to existing bounds for MRC and ORC.

Theorem 3 (Biasedness of DAD* coding). Let Q and P be the target and proposal distributions passed to DAD* coding (Algorithm 1). Let

K def= DKL[Q P]2 , D def= K + t, N def= 2D (18)

where t is a non-negative integer, r = d Q/d P and Y Q. Let f be a measurable function and define

EQ[f 2]. (19)

Let e QD the distribution of the approximate sample returned by DAD* coding with depth-limit D. Define

P [log2 r(Y ) > K + t/2]

P E e QD[f] EQ[f] 2 f Qδ

Proof. See Appendix D for proof.

Fast Relative Entropy Coding with A* Coding

Theorem 3 says that the bit budget for DAD* should be approximately DKL[Q P]2 in order to obtain reasonably low-bias samples. As we increase the budget beyond this point, we observe that in practice δ, and by extension the bias, decay quickly. In particular, as t , δ 0, and we recover exact AD*. We note that it is also possible to depthlimit other variants of A* coding, such as AS*. However, for any fixed Dmax, the sample bias of these variants will be larger than the bias of DAD*. This is because, as shown in theorem 1, if we use a different partition with ϵ > 1/2, A* coding will need to search deeper down the tree to find the global optimum. AD* achieves the lowest possible average depth out of all variants of A* coding, because it has ϵ = 1/2. Equivalently, AD* can achieve the same sample quality with a lower Dmax than any other variant of A* coding. In practice, we observed that depth limited AS* gives significantly more biased samples than AD*, in line with the above reasoning, so we did not further pursue any depth-limited variants other than DAD*.

Runtime of DAD* coding: Based on our conjecture for exact AD* and the result of Lemma 2, we conjecture that DAD* runs in O(D [Q P]) time. We provide experimental evidence for this in Section 5.

Using all available codewords: When Dmax = D, the number of nodes in the binary search tree of DAD* is 2D 1, which is one fewer than the number of items we can encode in D bits. In particular, the codeword 0 is never used, since heap indexing starts at 1. We can make AD* slightly more efficient by drawing 2 samples and arrival times at the root node instead of a single one. We found that this has a significant effect on the bias for small D, and becomes negligible for large D. In Section 5, we perform our experiments with this modified version of DAD*.

Tying codelengths: Since the codelength is a parameter of DAD*, we can code samples from different variables using the same codelength, grouping them in a block and passing Dmax = |C| to Algorithm 1 for each variable in the block. Since the variables have the same codelength, we only need to communicate this codelength once per block.

4. Iso KL layers and VAEs

Tying KL divergences: Although DAD* can be used to tie together the codelengths of different samples, Theorem 3 suggests that the search depth |C| used in DAD* affects the sample bias. In order to obtain low-bias samples, we must set |C| DKL[Q P]. If we group variables with different KL divergences and code them using the same |C|, one of the two following unwanted effects might occur: (1) if DKL[Q P] |C| for a variable, then the corresponding sample will be highly biased; (2) if DKL[Q P] |C| for a variable, then an excessive codelength is being used to code its sample, which is inefficient. To avoid these cases,

Figure 1. Two different ways of parameterizing a Gaussian variational posterior Q. Left: The usual mean-variance parameterization. Right: Iso KL layer using our proposed mean-KL parameterization where the latent dimensions are divided into b blocks, and the KLs are shared within each block.

it would be useful if |C| DKL[Q P] for all variables in a block. We can achieve this by constraining the KL divergences of all variables in a block to be equal.

Suppose Q and P are diagonal Gaussians over (X1, . . . XN) with means (µ1, . . . , µN) and (ν1, . . . , νN), and variances (σ2 1, . . . , σ2 N) and (ρ2 1, . . . , ρ2 N), respectively. We can parameterize Q such that

DKL[N(µn, σ2 n) N(νn, ρ2 n)] = κ and σn < ρn. (21)

for each n = 1, . . . , N, by setting

|µn νn| < ρn

σ2 n = ρ2 n W exp 2 n 2κ 1 , (23)

where n = (µn νn)/ρn and W is the principal branch of the Lambert W function (Lambert, 1758). Note that the condition that σn < ρn will ensure that D [Q P] < . While the W is not elementary, it can be computed numerically in a fast and efficient way. We refer to this as an Iso KL Gaussian layer (see Figure 1), and call a VAE model using such a Q an Iso KL VAE (IKVAE). Although we focus on Gaussians, this approach can be extended to other distributions. See Appendix F for implementation details and mathematical details on deriving the necessary quantities for Iso KL layers.

5. Experiments

Experiments for exact REC: We conducted experiments using PFR, AS* and AD* coding to perform REC on Gaussian, uniform and disjoint mixture of uniform distributions, where we systematically vary the problem parameters.

Figure 2 shows measurements for PFR, AS* and AD* coding. We report the number of steps executed by each algorithm rather than wall-clock time, as the former is proportional to the runtime and is unaffected by specific differences

Fast Relative Entropy Coding with A* Coding

Figure 2. Number of steps (top) and codelength (bottom) for PFR, AS* and AD* coding. Reported codelengths do not include the overhead terms. Circles show mean values, and error bars show first and third quantiles. AS* and AD* are significantly quicker than PFR.

0 1 2 3 # extra bits

0 20 40 60 80

(D [Q||P] = 5)

Gaussian Q, Gaussian P

MRC ORC DAD*

0 1 2 3 # extra bits

(D [Q||P] = 8)

Gaussian Q, Gaussian P

0 1 2 3 # extra bits

(D [Q||P] = 14)

Gaussian Q, Gaussian P

0 1 2 3 # extra bits

Bias (KL in bits)

(D [Q||P] = 5)

Gaussian Q, Gaussian P

MRC ORC DAD*

0 1 2 3 # extra bits

(D [Q||P] = 8)

Gaussian Q, Gaussian P

0 1 2 3 # extra bits

(D [Q||P] = 14)

Gaussian Q, Gaussian P

Figure 3. Number of steps (top) and bias (bottom) for MCR, OCR and DAD* coding. The bias is measured as the KL divergence from the empirical distribution of samples, to the true target (discussion in text). DAD* has similar bias to MRC and ORC, but is much faster.

in implementation. We also report the codelength, in bits, excluding the additional logarithmic and constant overhead. First, we observe that the number of steps taken by PFR scales exponentially with D [Q P], as expected. This scaling renders PFR computationally intractable for practical problems. By contrast, the number of steps taken by AS* and AD* coding increases linearly with D [Q P].

Second, we observe that the number of modes has an effect on the runtime of both AS* and AD* coding. In the top right of Figure 2 we have set Q to a mixture of uniforms with disjoint support, P to a uniform, and fixed D [Q P] while varying the number of modes in Q. For a small number of modes, AS* and AD* are both significantly faster than PFR. As the number of modes increases, so does the number of steps executed by AS* and AD*. This trend is expected, because for unimodal Q, A* coding can quickly bound the objective for large regions of the input space which have low d Q/d P throughout. As d Q/d P becomes increasingly multimodal, the bounding function M becomes large in increasingly many disconnected regions of the search space. Thus both AS* and AD* must drill down to and search increasingly many of these regions, before producing a sample, which requires a larger number of steps. By contrast, PFR is unaffected by the number of modes in d Q/d P, since

it is equivalent to Global Bound A*, which retains the entire input space in a single active search branch. These results suggest that the unimodality of d Q/d P is a key attribute for enabling fast coding with AS* and AD*.

Third, we observe that the codelengths of PFR, AS* and AD* scale linearly with DKL[Q P], as expected. However, in some cases AS* produces a larger mean codelength than PFR and AD*. This can be explained by the fact that the dominant term in the codelength bounds of the three algorithms is λ DKL[Q P], where PFR and AD* have λ = 1, while AS* has λ 2.41. AS* can produce larger |C| than AD* because, as Lemma 1 states, the expected rate of shrinkage of its search region is slower than AD*. Therefore, in order to refine its search by the same amount, AS* needs a greater number of steps, leading to a larger expected |C|.

Experiments for approximate REC: We conducted experiments using MRC, ORC and DAD* coding to perform approximate REC with Gaussian Q and P, varying the bit budget that is allowed to the coders, in addition to the baseline budget of DKL[Q P]2 bits. Figure 3 shows the effect of the additional bit budget on the number of steps and the sample bias for three problems with different D [Q P]. We quantify the sample bias as the KL divergence, DKL[ ˆQ Q]2, from the empirical distribution ˆQ of the approximate sam-

Fast Relative Entropy Coding with A* Coding

ples, to the true Q. In each case, we draw 100 samples and follow the method of P erez-Cruz (2008) to estimate DKL[ ˆQ Q]2. We observe that in all three cases (fig. 2 bottom), while there is a slight difference in the level of bias when no extra bits are allowed, the difference in bias becomes negligible when one or more extra bits are added. However, DAD* achieves this bias far faster than MRC and ORC (fig. 2 top), making it a far more tractable method.

# LATENT NEG. ELBO AD* DAD*

VAE 20 1.43 0.01 1.53 0.01 50 1.40 0.01 1.66 0.01

IKVAE 20 1.44 0.01 1.55 0.01 1.51 0.01 50 1.44 0.01 1.69 0.01 1.60 0.01

Table 1. ELBO and lossless compression rates (bpp) on MNIST.

# LATENT AD* DAD*

IKVAE 20 86.31 0.01 5.00 0.01 50 195.20 0.01 5.00 0.01

Table 2. Overhead codelength (bits) on MNIST.

Image compression MNIST: We compared the performance of AD* and DAD* on image compression experiments on MNIST, using the feedforward VAE architecture of Townsend et al. (2018), with Gaussian Q and P. We also trained IKVAEs with the same architecture, using an Iso KL Gaussian Q. As Theorem 1 shows and as we discuss above, AS* has strictly worse expected codelength than AD* and hence we did not include it in our experiments. For DAD*, we set κ = 2 based on preliminary experiments. Table 1 shows the lossless compression rates of different model and coder combinations, from which we observe the following trends. First, the IKVAEs achieve similar ELBOs to the VAEs, suggesting that tying the KLs does not degrade model performance. Further, we observe that for the IKVAE architectures, using DAD* improves the compression rate over AD*. We do not provide results for DAD* applied to a standard VAE posterior, as it would yield a strictly worse performance than AD*. This is because the KLs in each latent dimension are different, and hence not only do we need to communicate the codelength for each dimension, but DAD* returns approximate samples from the target distribution, as opposed to AD*. This is because, as corroborated by table 2, DAD* significantly reduces codelength overheads, improving performance over AD*. We also note that in this task, the number of latent dimensions is relatively small (20 or 50) compared to the number of image pixels (784). Since the overhead costs increase with the number of latent variables, we expect the savings of DAD* to be more pronounced for larger IKVAEs. Overall, this experiment demonstrates that AD* can be effective for VAEs, while DAD* and IKVAEs further reduce coding overheads.

6. Related Work

Quantization-based approaches: Most state-of-the-art methods in both lossy and lossless compression are based on including quantization in the compression pipeline, and somehow circumventing its non-differentiability during training. Current widespread approaches in lossy compression use VAEs with a particular choice of latent distributions (Ball e et al., 2017; 2018). Instead of quantizing latent representations during training, these methods perturb the latents with uniform noise, a technique known as dithering. To perform compression, the methods switch back to hard quantization. Dithering is equivalent to using a uniform variational posterior distribution, and has been demonstrated to work well in practice (Ball e et al., 2020). However, this introduces a mismatch between the training and compression phases and it also constrains the design choices for new methods to use uniform distributions. A related variant is the work of Agustsson & Theis (2020), who propose to use universal quantization (Ziv, 1985) for compression. Universal quantisation can be regarded as a REC algorithm. While this method performs very well in the experiments of Agustsson & Theis (2020), it is limited to particular choice of distribution. Our work can be regarded as a step towards lifting the restrictions imposed by quantization based methods.

Bits-back coding: Townsend et al. (2018) introduced a practical way to combine bits-back coding (Hinton & Van Camp, 1993) and VAEs to perform lossless data compression. The method of Townsend et al. (2018) has later been applied to normalizing flows (Ho et al., 2019), which, together with discrete (van den Berg et al., 2020) and quantization based approaches (Zhang et al., 2021) represent the state-of-the-art in lossless image compression. However, bits-back coding is only applicable to perform lossless compression. Furthermore, it is only asymptotically efficient, meaning that it has a large constant codelength overhead that becomes insignificant as we compress larger batches of data. In contrast, our method is applicable to both lossy and lossless compression and can be used to perform one-shot compression.

REC and reverse channel coding: REC was first proposed by Havasi et al. (2018), who developed MRC for compressing Bayesian neural networks, and was later extended to data compression using VAEs by Flamich et al. (2020). Both of these works were inspired by reverse channel coding (Bennett et al., 2002). REC and reverse channel coding can be viewed as the worst-case and average-case approaches to the same problem. Concretely, REC requires that for fixed proposal P and public randomness S, the codelength bound in eq. (2) holds for any target Q. In contrast, reverse channel coding assumes a distribution over a family of possible targets and requires that eq. (2) holds in expectation.

The first general REC algorithm for discrete distributions was proposed by Harsha et al. (2007), however this method

Fast Relative Entropy Coding with A* Coding

has an impractically long runtime. Li & El Gamal (2018) developed PFR coding, an alternative based on Poisson processes, however this also is computationally impractical. Recently, Theis & Yosri (2022) proposed ORC, which combines some the benefits of MRC and PFR, but remains computationally impractical. These algorithms all share the limitation that their expected runtime is Ω(exp(DKL[Q P])).

Hybrid Coding: To improve the runtime of REC/RCC algorithms, Theis & Yosri (2022) proposed hybrid coding (HC). HC is applicable whenever the support of the target is compact, and can be combined with any of the existing REC/RCC algorithms to improve their runtime by a multiplicative factor. However, HC does not change the asymptotic complexity of the REC/RCC algorithm that it is combined with. We note that while Theis & Yosri (2022) used HC in conjunction with ORC, HC can equally well be combined with A* coding. Thus the speedup that HC provides is complementary to that of A* coding.

Efficient rejection sampling: While rejection sampling is a applicable to any target Q and proposal P where D [Q P] < and d Q/d P can be evaluated, the basic algorithm has O(exp(D [Q P])) expected runtime complexity (Maddison, 2016). Thus, a natural question is to ask under what assumptions it is possible to perform rejection sampling from a target Q using a proposal P efficiently.

Recently, Chewi et al. (2022) consider the problem of constructing an upper envelope for rejection sampling from discrete probability distributions. In particular, they study the time complexity of constructing the envelope as a function of the alphabet size. They show that shape constraints on the target distribution, such as monotonicity and logconcavity can be utilized to design algorithms whose runtime scales logarithmically in the alphabet size. Our work is complementary to theirs, as Theorem 2 provides an initial result showing that shape constraints can be leveraged to design more efficient sampling algorithms for continuous distributions as well.

7. Conclusion

Summary: In this work we proposed AS* coding and AD* coding, two algorithms based on A* sampling for performing REC with one-dimensional target and proposal distributions Q and P. We proved that the expected codelengths of AS* and AD* are O(DKL[Q P]) and that, whenever d Q/d P is unimodal, the expected runtime of AS* is O(D [Q P]). Experimental evidence suggests that the runtime of AD* is also O(D [Q P]). This runtime significantly improves upon the existing runtimes of existing REC algorithms, without placing severe conditions on Q and P.

In addition, we proposed two methods to eliminate overhead codelength when encoding multiple samples. First, we introduced an approximate depth-limited variant of AD*

coding, DAD* coding, in which the codelength of the encoder is a tunable variable, and proved an upper bound for its bias. Second, we introduced the Iso KL parameterization, in which latent dimensions are grouped into blocks, and the KL divergences of all latent dimensions in a block are constrained to be equal. DAD* together with the Iso KL parameterization allow us to encode multiple samples with the same codelength, thereby amortising coding overheads, while maintaining low sample bias. Experimentally, we demonstrated the favourable runtimes of AS* and AD* coding on extensive toy experiments. We also showed that DAD* coding achieves levels of bias comparable to existing approximate REC algorithms, while maintaining a significantly faster runtime. On lossless image compression experiments on MNIST, DAD* together with an Iso KL VAE (IKVAE) parameterization achieved a compression rate close to the theoretically optimal ELBO.

Further work: One of the central remaining questions of this work is the runtime of AD* coding. Based on our experiments, we conjecture that the runtime of AD* coding is also O(D [Q P]) whenever d Q/d P is unimodal, however this remains to be shown.

In general, for fixed DKL[Q P] we can have arbitrarily high D [Q P], hence a second, more general question is if there exists a REC algorithm with O(DKL[Q P]) expected runtime. Adaptive rejection sampling (Gilks & Wild, 1992), OS* sampling (Dymetman et al., 2012) and ideas from (Theis & Yosri, 2022) could be good starting points for developing such an algorithm.

Another promising direction is to apply the Iso KL parameterization to larger VAEs, such as those used by Townsend et al. (2019), and scale our approach up to real-world compression tasks. Lastly, our methods can also be readily applied to lossy compression.

8. Author Contributions

GF discovered that A* sampling can be modified to obtain A* coding (alg. 1), which can be used to perform relative entropy coding, and provided proofs for theorems 1 and 3. SM provided a proof for theorem 2. GF and SM contributed equally to the experiments and the writing of this paper. JMH supervised and steered the project.

9. Acknowledgements

We would like to thank Rich Turner and Lennie Wells for useful feedback on an early manuscript of this paper. GF acknowledges funding from Deep Mind. SM acknowledges funding from the Vice Chancellor s & George and Marie Vergottis scholarship of the Cambridge Trust.

Fast Relative Entropy Coding with A* Coding

Agustsson, E. and Theis, L. Universally quantized neural compression. Advances in Neural Information Processing Systems, 33, 2020.

Ball e, J., Laparra, V., and Simoncelli, E. P. End-to-end optimized image compression. In International Conference on Learning Representations, 2017.

Ball e, J., Minnen, D., Singh, S., Hwang, S. J., and Johnston, N. Variational image compression with a scale hyperprior. In International Conference on Learning Representations, 2018.

Ball e, J., Chou, P. A., Minnen, D., Singh, S., Johnston, N., Agustsson, E., Hwang, S. J., and Toderici, G. Nonlinear transform coding. IEEE Journal of Selected Topics in Signal Processing, 15(2):339 353, 2020.

Bennett, C. H., Shor, P. W., Smolin, J. A., and Thapliyal, A. V. Entanglement-assisted capacity of a quantum channel and the reverse Shannon theorem. IEEE Transactions on Information Theory, 48(10):2637 2655, 2002.

Blundell, C., Cornebise, J., Kavukcuoglu, K., and Wierstra, D. Weight uncertainty in neural networks. In Bach, F. and Blei, D. (eds.), Proceedings of the 32nd International Conference on Machine Learning, volume 37 of Proceedings of Machine Learning Research, pp. 1613 1622, Lille, France, 07 09 Jul 2015. PMLR.

Brezinski, C. Extrapolation algorithms and Pad e approximations: a historical survey, 1994.

Chatterjee, S. and Diaconis, P. The sample size required in importance sampling. The Annals of Applied Probability, 28(2):1099 1135, 2018.

Chewi, S., Gerber, P. R., Lu, C., Le Gouic, T., and Rigollet, P. Rejection sampling from shape-constrained distributions in sublinear time. In International Conference on Artificial Intelligence and Statistics, pp. 2249 2265. PMLR, 2022.

Corless, R. M., Gonnet, G. H., Hare, D. E., Jeffrey, D. J., and Knuth, D. E. On the lambertw function. Advances in Computational mathematics, 5(1):329 359, 1996.

Dymetman, M., Bouchard, G., and Carter, S. The OS* algorithm: a joint approach to exact optimization and sampling. ar Xiv preprint ar Xiv:1207.0742, 2012.

Flamich, G., Havasi, M., and Hern andez-Lobato, J. M. Compressing images by encoding their latent representations with relative entropy coding. Advances in Neural Information Processing Systems, 33, 2020.

Gilks, W. R. and Wild, P. Adaptive rejection sampling for gibbs sampling. Journal of the Royal Statistical Society: Series C (Applied Statistics), 41(2):337 348, 1992.

Grimmett, G. and Stirzaker, D. Probability and random processes. Oxford University Press, U.S.A., 2001.

Grimmett, G. and Welsh, D. Probability: an introduction. Oxford University Press, 2014.

Harsha, P., Jain, R., Mc Allester, D., and Radhakrishnan, J. The communication complexity of correlation. In Twenty-Second Annual IEEE Conference on Computational Complexity (CCC 07), pp. 10 23. IEEE, 2007.

Havasi, M., Peharz, R., and Hern andez-Lobato, J. M. Minimal random code learning: Getting bits back from compressed model parameters. In International Conference on Learning Representations, 2018.

Hinton, G. E. and Van Camp, D. Keeping the neural networks simple by minimizing the description length of the weights. In Proceedings of the sixth annual conference on Computational learning theory, pp. 5 13, 1993.

Ho, J., Lohn, E., and Abbeel, P. Compression with flows via local bits-back coding. Advances in Neural Information Processing Systems, 32:3879 3888, 2019.

Hoogeboom, E., Peters, J., van den Berg, R., and Welling, M. Integer discrete flows and lossless compression. Advances in Neural Information Processing Systems, 32:12134 12144, 2019.

Kingma, D. P. and Welling, M. Auto-encoding variational Bayes. International Conference on Learning Representations, 2014.

Kingman, J. Poisson Processes. Oxford Studies in Probability. Clarendon Press, 1992. ISBN 9780191591242.

Lambert, Johann, H. Observationes variae in mathesin puram. Acta Helvetica Physico-Mathematico-Anatomico Botanico-Medica, 3:128 168, 1758.

Land, A. and Doig, A. An automatic method of solving discrete programming problems. Econometrica, 28(3): 497 520, 1960.

Li, C. T. and El Gamal, A. Strong functional representation lemma and applications to coding theorems. IEEE Transactions on Information Theory, 64(11):6967 6978, 2018.

Maddison, C. Poisson process model for Monte Carlo. Perturbation, Optimization, and Statistics, pp. 193 232, 2016.

Fast Relative Entropy Coding with A* Coding

Maddison, C. J., Tarlow, D., and Minka, T. A* sampling. Advances in Neural Information Processing Systems, 27: 3086 3094, 2014.

Markou, S. Notes on the runtime of A* sampling. ar Xiv preprint ar Xiv.2205.15250, 2022.

Papandreou, G. and Yuille, A. L. Perturb-and-MAP random fields: Using discrete optimization to learn and sample from energy models. In 2011 International Conference on Computer Vision, pp. 193 200. IEEE, 2011.

P erez-Cruz, F. Kullback-Leibler divergence estimation of continuous distributions. In 2008 IEEE international symposium on information theory, pp. 1666 1670. IEEE, 2008.

Theis, L. and Agustsson, E. On the advantages of stochastic encoders. In Neural Compression Workshop at ICLR, 2021.

Theis, L. and Yosri, N. Algorithms for the communication of samples. In International Conference on Machine Learning, 2022.

Townsend, J., Bird, T., and Barber, D. Practical lossless compression with latent variables using bits back coding. In International Conference on Learning Representations, 2018.

Townsend, J., Bird, T., Kunze, J., and Barber, D. Hilloc: lossless image compression with hierarchical latent variable models. In International Conference on Learning Representations, 2019.

van den Berg, R., Gritsenko, A. A., Dehghani, M., Sønderby, C. K., and Salimans, T. IDF++: Analyzing and improving integer discrete flows for lossless compression. In International Conference on Learning Representations, 2020.

Zhang, S., Kang, N., Ryder, T., and Li, Z. iflow: Numerically invertible flows for efficient lossless compression via a uniform coder. Advances in Neural Information Processing Systems, 34, 2021.

Ziv, J. On universal quantization. IEEE Transactions on Information Theory, 31(3):344 347, 1985.

Fast Relative Entropy Coding with A* Coding

A. Proof of Theorem 1

Notation: Throughout the appendix, we will write [a : b] def= [a, b] N for a, b N and [a] def= [1 : a]. Furthermore, for a vector we will write x1:n def= (x1, . . . xn) for n N.

We make use of the top-down construction of Gumbel processes (Algorithm 2; Maddison et al., 2014). The top-down construction realizes samples from a base distribution P along with their associated Gumbel values, which together form a Gumbel process. A Gumbel process can be thought of as a generalization of the Gumbel-Max trick (Papandreou & Yuille, 2011), where the log-probability of each member of a sample space is perturbed with i.i.d. Gumbel noise. Then, it can be shown that the maximum of this process is Gumbel distributed, and the argmaximum has law P. Gumbel processes can be shown to be equal in distribution to exponential races, where the time variable is mapped to its negative logarithm (Maddison, 2016). This is important, as it allows us to switch between the Gumbel and Poisson process representations, to leverage existing results in our analysis.

Algorithm 2 realizes its Gumbel process using a space partitioning binary tree construction while also recording the depths and heap indices of nodes. It is therefore an extension of Algorithm 1 in Maddison et al. (2014). Algorithm 2 can be realized using public randomness by anyone with access to the public seed S.

A* sampling can be viewed as performing a binary tree search on the Gumbel process with measure P, as realized by Algorithm 2, to search for a sample with distribution Q. The key observation is that for the search to proceed, the whole realization of the Gumbel process with measure P is not needed, and in fact it can be realized on-the-go.

A* coding (Algorithm 1) first runs the regular A* sampling procedure by simulating the Gumbel process with the proposal measure P using the publicly available randomness S. Then, it encodes the returned sample using the heap index Hn of the node n with which the sample is associated. Since any node with a given heap index can be simulated without reference to Q using Algorithm 2, the code returned by A* coding is always uniquely decodable given S, and the correctness of A* sampling (Maddison et al., 2014) will guarantee that the sample A* coding returns has the correct distribution Q.

PFR and ORC also operate on Gumbel/Poisson processes, however, they use a different encoding process. They encode the index K of a sample as opposed to its heap index H. For a sample x, K is obtained by sorting the arrival times in the Gumbel/Poisson process with measure P, and returning the index associated with x in the sorted list. PFR and ORC obtain K easily, because they realize the Gumbel/Poisson process in-order (for example, see Algorithm 3 in (Maddison et al., 2014)). Similarly, Algorithm 2 also constructs the process in order, however, it uses the top-down construction.

Theorem 1 shows that A* coding is not only correct and uniquely decodable, but its expected codelength is also optimal. However, to show this, we first show the following intermediate result, which relates the index K of a sample to its expected depth D in the top-down construction.

Lemma 3 (Average depth of nodes in a Gumbel process). Let P be a Borel probability measure over some Polish space that is supplied to Algorithm 2. Assume that partition satisfies the following property: there exists ϵ [1/2, 1), such that for any node n we have E[P(Bn)] ϵDn, (24)

where the expectation is taken over the joint distribution of the samples associated with the ancestor nodes of n. Let Kn be the index of a node n realized by Algorithm 2 and let Dn be its depth in the tree. Then,

E[Dn | Kn] logϵ Kn. (25)

Before delving into the proof, we clarify the allowed domain of ϵ. First note that ϵ is solely a property of partition. Note that fixing ϵ = 1 would imply that the bounds do not shrink. The reason why ϵ 1/2 is because if partition breaks some region B into L and R with P(L) = ϵLP(B), then necessarily P(R) = ϵRP(B) = (1 ϵL)P(B), hence we need to take ϵ = max{ϵL, ϵR}. This means that the minimal ϵ is achieved when ϵL = ϵR = 1/2. Finally, since partition might depend on the samples drawn up to reaching the node with bound B, we need to take expectation over these.

Proof. Let Fk = (f1, . . . fk) be the frontier of alg. 2 after k steps. We shall say that alg. 2 or alg. 1 expand a node, which means that they pop off the highest priority node from their priority queue. Fk consists of all the nodes that could be expanded, starting with F1 only containing the root node. In the search literature Fk is also commonly referred to as the open set of nodes. A simple inductive argument shows, that for a binary tree on k nodes will always have k + 1 nodes on its frontier, i.e. |Fk| = k.

Fast Relative Entropy Coding with A* Coding

Let {(Xi, Gi)}k 1 i=1 be the samples and Gumbels realized by alg. 2, sorted in descending order by the Gis up to the k 1 largest one. Let µf = log P(Bf) be the location parameter of the truncated Gumbel variate Gf for a node f Fk. Maddison et al. (2014) show (see their Appendix, the section titled Equivalence Under partition ), that regardless of the choice of partition, f Fk Gf TG (µf, Gk 1) . (26)

Let Fk Fk denote the node that is expanded in step k, i.e.

Fk arg max f Fk {Gf}. (27)

Then a simple Gumbel-max trick-type argument shows (Maddison et al., 2014), that

p(Fk = f | G1:k 1, X1:k 1) = exp(µf) P

ϕ Fk exp(µϕ)

= exp(log P(Bf)) P

ϕ Fk exp(log P(Bϕ))

where the last equality holds because the bounds associated with the nodes on Fk form a partition of the whole sample space for any k. Then, Ep(Fk=f|G1:k 1,X1:k 1)[ µf] = X

f Fk P(Bf)µf

f Fk P(Bf) log P(Bf)

The last equality follows, since the maximal Shannon entropy of a distribution over k items is the uniform distribution with entropy log k.

Now, taking expectations over G1:k 1, X1:k 1, we get

Ep(Fk=f,G1:k 1,X1:k 1|k)[ µf] log k. (30)

Finally, log k Ep(Fk=f,G1:k 1,X1:k 1|k)[ µf]

= Ep(Fk=f,G1:k 1,X1:k 1|k)[ log P(Bf)]

log Ep(Fk=f,G1:k 1,X1:k 1|k)[P(Bf)]

= Dk log ϵ,

where the second inequality holds by Jensen s inequality and the third inequality holds by our assumption on partition. Since 0 < log ϵ, rearranging the two sides of the inequality gives the desired result.

With this result in mind, we are now ready to prove Theorem 1, which we state again for completeness:

Theorem 4 (Expected codelength of A* coding). Let Q and P be the target and proposal measures passed to A* coding (Algorithm 1). Assume that partition satisfies the following property: there exists ϵ [1/2, 1) such that for any node n we have E[P(Bn)] ϵDn, (32)

Fast Relative Entropy Coding with A* Coding

where the expectation is taken over the joint distribution of the samples associated with the ancestor nodes of n. Let k be the node returned by A* coding. Then, we have

E[Dk] 1 log ϵ DKL[Q P] + e 1 + log 2 . (33)

In particular, when ϵ = 1/2, E[Dk] DKL[Q P]2 + e 1 log2 e + 1. (34)

Proof. Let K be the random variable that represents the index (not the heap index) of the sample returned by A* coding. Maddison (2016) show that K is equal in distribution to the index returned by running global bound A* sampling, which is equal in distribution to the index returned by PFR coding. In Li & El Gamal (2018), Appendix, section A it is shown, that

E[log K] DKL[Q P] + e 1 + log 2. (35)

Putting this together with Lemma 3, we get the desired result.

B. Proof of Lemma 1

In the main text, the result is stated for AS* sampling. However, the result does not depend on the target Q, only the realization of the Gumbel process with base measure P in alg. 2, hence we restate the lemma here as follows: Lemma 4. Let P be a non-atomic proposal measure over a 1-dimensional sample space, passed to Algorithm 2, and let partition be as defined in eq. (13). Then the condition in eq. (10) is satisfied with ϵ = 3/4, that is

Proof. Let F denote the CDF of the measure P. we will prove the claim by induction. For the base case, note, that depth D = 0 can be associated with not having drawn any samples yet, i.e. the next node that is expanded by alg. 2 will be the root node. Since the sample is drawn from the whole space, we will have P(B1) = P(Ω) = 1 = (3/4)0. For the hypothesis, assume the claim holds for D = d. Let D = d + 1. Fix a node n such that Dn = d + 1. Let A(n) = (n1, . . . , n Dn 1) denote the ancestors of n, where n1 is the root of the tree and n Dn 1 = par(n) is the direct parent node of n in the tree constructed by alg. 2. Then, by the law of iterated expectations, we have

Ep(XA(n))[P(Bn)] =

Ep(XA(par(n)))[Ep(Xpar(n)|XA(par(n)))[P(Bn)]]. (37)

Focusing on the inner expectation, let Bpar(n) = (a, b). Then, by the definition of partition, L = (a, Xn), R = (Xn, b). Note, that since Bn is either L or R, we have P(Bn) max{P(L), P(R)}. Furthermore, since the space is 1 dimensional, we get P(L) = F(Xn) F(a) and P(R) = F(b) F(Xn). Let U (c, d) denote the uniform density on (c, d). Let α = F(a), β = F(b). Then, by the generalized probability integral transform, we find, that U def= F(Xn) U (α, β). Thus, by the law of the unconscious statistician, the inner expectation of eq. (37) can be rewritten as

Ep(Xpar(n)|XA(par(n)))[P(Bn)]

Ep(U)[max{U α, β U}]

max{u α, β u}

4P(Bpar(n)).

Now, by the induction hypothesis eq. (37) becomes

3 4Ep(XA(par(n)))[P(Bn)] 3

which concludes the proof.

Fast Relative Entropy Coding with A* Coding

C. Proof of Theorem 2

Note: Our original argument for the linear runtime of A* coding contained an error. Markou (2022) provided a proof for theorem 2, which we reproduce here.

Overview: The proof breaks down the execution of AS* coding into two stages. For the first stage, we consider how AS* shrinks its search bounds, until it obtains a sufficiently good candidate sample. Here, a sufficiently good sample is a sample which falls within a predefined super-level set of the density ratio. Lemma 8 gives an upper bound on the expected number of steps in this first stage of the algorithm.

For the second stage, we quantify how many additional steps AS* must subsequently make until it terminates, after obtaining a good candidate sample in the first stage. Lemma 11 gives an upper bound on the expected number of steps in this second stage of the algorithm. Putting lemmas 8 and 11 together, we obtain an upper bound on the runtime of AS*, stated in corollary 1. This bound depends on, and holds for any, super-level set. Therefore, we can minimise this bound over all super-level sets of the density ratio. Lastly, we show that even for the worst case density ratios of this bound, this minimum results in a runtime is linear in the -divergence, resulting in theorem 2.

Notation: In this section, all indices to random variables are integers corresponding to the depth of the variable within the binary tree being searched. This is in contrast to other sections where the random variables are indexed by the node of the binary tree to which they belong. We found this notational overloading makes the exposition clearer, and the meaning of the indexing should be clear from the context.

Assumption 1 (Continuous distributions, finite D ). We assume that measures Q and P describe continuous random variables, so their densities q and p exist. Since P Q, the Radon-Nikodym derivative r(x) = (d Q/d P)(x) also exists. We also assume r(x) is unimodal and satisfies

D [Q||P] = log sup x R

d Q d P (x) = log rmax < . (40)

Without loss of generality, we can also assume P to be the uniform measure on [0, 1], as shown by the next lemma. This is because we can push P and Q through the CDF of P to ensure P is uniform, while leaving the Radon-Nikodym derivative unimodal and the -divergence unchanged.

Lemma 5 (Without loss of generality, P is uniform). Suppose Q is a target measure and P a proposal measure as specified in Assumption 1. Let Φ be the CDF associated with P and consider the measures P , Q : [0, 1] [0, ) defined as

P = P Φ 1 and Q = Q Φ 1. (41)

Then, P is the uniform measure on [0, 1]. Further, the Radon-Nikodym derivative d Q /d P (x) is unimodal, and

log sup z [0,1]

d P (z) = log sup x R

d Q d P (x). (42)

Proof. First, P is the uniform measure on [0, 1] since for any z [0, 1]

P ([0, z]) = P Φ 1([0, z]) = P(( , Φ 1(z)]) = [0, z]. (43)

Now, let the densities of Q and P be q and p, and the densities of Q and P be q and p . Then by the change of variables formula p (z) = p Φ 1(z) (Φ 1) (z) and q (z) = q Φ 1(z) (Φ 1) (z). (44)

Therefore, we have d Q

d P (z) = q (z)

p (z) = q Φ 1(z)

p (Φ 1(z)) = d Q

d P Φ 1(z), (45)

Now, since d Q/d P(x) is a unimodal function of x and Φ 1(z) is increasing in z, the function (d Q/d P) Φ 1(z) is unimodal in z. Also, by taking the the supremum and logarithm of both sizes

log sup z [0,1]

d P (z) = log sup z [0,1]

d Q d P Φ 1(z) = log sup x R

d Q d P (x), (46)

Fast Relative Entropy Coding with A* Coding

arriving at the result.

We now define the super-level sets of the density ratio, and super-level set width functions, on which the argument relies.

Definition 2 (Superlevel set, width). We define the superlevel-set function S : [0, 1] 2[0,1] as

S(γ) = {x [0, 1] | r(x) γrmax}, (47)

And let xmax {x [0, 1] | r(x) r(xmax)} be an arbitrary maximiser of the density ratio. We also define the superlevel-set width function w : [0, 1] [0, 1] as

w(γ) = inf{δ [0, 1] | z [0, 1], S(γ) [z, z + δ]}. (48)

Because width functions are defined in terms of a ratio of probability densities, they satisfy certain properties, stated in lemma 6 and proved below. We use these properties later to prove lemma 12.

Lemma 6 (Properties of w). The width function w(γ) is non-increasing in γ and satisfies

0 w(γ) dγ = 1 rmax and w(0) 1 rmax . (49)

Proof. First, we note that if γ1 γ2, then S(γ2) S(γ1) which implies w(γ2) w(γ1). Therefore

γ1 γ2 = w(γ2) w(γ1), (50)

so w(γ) is decreasing in γ. Second, let A = [0, 1] [0, rmax], define B = {(x, y) A | y r(x)} and consider the integral

A 1(z B) dz. (51)

Since this the integrand is a non-negative measurable function, by Fubini s theorem, we have

0 1((x, y) B) dy dx = Z rmax

0 1((x, y) B) dx dy (52)

0 r(x) dx = Z rmax

0 w(y/rmax) dy (53)

0 q(x) dx = Z 1

0 w(γ)rmax dγ (54)

0 w(γ) dγ = r 1 max. (55)

Last, since w is non-increasing, we have w(0) r 1 max, because otherwise R 1 0 w(γ) dγ < r 1 max.

Now we define the two stages in which we break down the execution AS* coding. In particular, we define N(γ) as the number of steps required until AS* gives a sample in the superlevel set S(γ), and we define K(γ) as the number of subsequent steps required for AS* to terminate.

Definition 3 (# steps to S(γ), # residual steps). Suppose AS* is applied to a target-proposal pair Q, P satisfying Assumption 1, producing a sequence of samples X1, X2, . . . . We use T Z to denote the total number of steps taken by AS* until it terminates and define the random variables

N(γ) = min{n Z | Xn S(γ)} and K(γ) = max{0, T N(γ)}. (56)

Fast Relative Entropy Coding with A* Coding

Because the bounds of AS* shrink exponentially quickly, we can bound the probability that AS* in the first stage, by a quantity which also shrinks exponentially, as stated in lemma 7.

Lemma 7 (Upper bound on the probability of P(Bn) w(γ)). Let Zn = P(Bn). Then

P(Zn w(γ)) 1 w(γ)

Proof. Let Zn = P(Bn). Noting that Zn 0, we apply Markov s inequality and lemma 4 to get

P(Zn w(γ)) 1 w(γ)E[Zn] 1 w(γ)

as required.

Now using lemma 7 we can upper bound the expectation over N(γ), which depends on the logarithm of the width w(γ).

Lemma 8 (Bound on expected N(γ)). The random variable N(γ) satisfies

E[N(γ)] α log 1 w(γ) + 6, where α = log 4

Proof. Let N0 = l log w(γ) log(3/4) m + 1. Also let B0, B1, . . . be the bounds produced by AS*. Noting that by the unimodality of r,

S(γ) is an interval with xmax S(γ), and xmax Bn, we have

P(Bn) < w(γ) = N(γ) n, (60)

that is, the event P(Bn) < w(γ) implies the event N(γ) n. From this it follows that

P(P(Bn) < w(γ)) P(N(γ) n) = P(P(Bn) w(γ)) P(N(γ) n + 1). (61)

Using this together with lemma 7, we can write

EB0: [N(γ)] =

n=1 P (N(γ) = n) n (62)

n=1 P (N(γ) n) (63)

n=N0+1 P (N(γ) n) (64)

n=1 P (N(γ) N0 + n) (65)

n=1 P(BN0+n 1 w(γ)) (66)

N0+n 1 (67)

= N0 + 4 (69)

log 3/4 + 6, (70)

Fast Relative Entropy Coding with A* Coding

where the equality of 62 and 63 is a standard identity (Grimmett & Welsh, 2014), 63 to 64 follows by the fact that probabilities are bounded above by 1, 64 to 65 follows by relabelling the indices, 65 to 66 follows from eq. (61), 66 to 67 follows from follows from lemma 7 and 66 to 67 follows from our definition of N0.

Now we turn to bounding the expectation of K(γ). For this, we must consider how the difference between the upper and lower bounds maintained by the search shrinks. To do so, we will use lemma 10. Lemma 9 is an intermediate result, which we use to show lemma 10.

Lemma 9 (Exponentials and Truncated Gumbels). Let T Exp(λ) and T0 0. Then

Z def = log(T + T0) d= G where G TG(log λ, log T0). (71)

Proof. Let T Exp(λ), T0 0 and define Z def = log(T + T0). (72)

We note that Z log T0. For Z log T0, we can apply the change of variables formula to obtain the density of Z. Let p Z and p T be the densities of Z and T. Then

p Z(z) = p T (t) dt dz

= λe λt d dz (e z T0) (74)

e λe ze z (75)

e z e (z κ), (76)

where κ = log λ. Therefore Z has distribution TG(log λ, log T0).

Lemma 10 (Mean of exponentiated negative truncated Gumbel). Let B1, . . . , BN be the first N bounds produced by AS*, let GN be the N th Gumbel produced by AS* and define EN = e GN . Then

E[EN | B1:N] =

1 P(Bn). (77)

Proof. Define En = e Gn for n = 1, 2, . . . . By the definition of AS*, we have

GN | GN 1, B1:N TG(log P(BN), GN 1). (78)

Negating GN and GN 1, taking exponentials and applying Jensen s inequality together with ineq. (9), we obtain

EN | TN 1, B0:N d= τN + TN 1, where τN 1 Exp(P(BN)). (79)

Repeating this step and taking expectations, we have

E[EN | B0:N] = E

n=1 τn B0:N

1 P(Bn) (80)

as required.

Using lemma 10 we can bound the expectation over K(γ), as stated and proved in lemma 11. Note that this bound does not depend on N(γ), which has been marginalised out.

Lemma 11 (Bound on expected K(γ)). The random variable K(γ) satisfies

E[K(γ)] α log 1

γ + log 1 w(γ)

+ 16, where α = log 4

Fast Relative Entropy Coding with A* Coding

Proof. Let the global upper and lower bounds of AS* at step n be Un and Ln respectively. Then, by the definition of the upper bound of AS* coding

UN(γ) = log rmax + GN(γ), (82)

and also, by the definition of the lower bound of AS* coding

LN(γ) = max n [1:N(γ)]

log r(xn) + Gn log r x N(γ) + GN(γ) log γrmax + GN(γ). (83)

Now for k = 0, 1, 2, . . . , we have

UN(γ)+k LN(γ)+k 0 = T N(γ) + k, (84)

that is, the event UN(γ)+k LN(γ)+k 0 implies the event T N(γ) + k. This is because if UN(γ)+k LN(γ)+k 0, then the algorithm has terminated by step N(γ) + k, so it follows that T N(γ) + k. Further

UN(γ)+k LN(γ)+k UN(γ)+k LN(γ) (85)

log rmax + GN(γ)+k log γrmax GN(γ) (86)

γ + GN(γ)+k GN(γ). (87)

Therefore, we have

GN(γ)+k GN(γ) log γ = T N(γ) + k = K(γ) k, (88)

that is, the event GN(γ)+k GN(γ) log γ implies the event K(γ) k. This holds because if GN(γ)+k GN(γ) log γ, then UN(γ)+k LN(γ)+k 0, which in turn implies K(γ) k. Therefore

P GN(γ)+k GN(γ) log γ P (K(γ) k) = P GN(γ)+k GN(γ) log γ P (K(γ) k + 1) . (89)

Equation (89) upper bounds the probability that the second stage of the algorithm has not terminated, by the probability that the Gumbel values have decreased sufficiently. To proceed, we turn to lower bounding the probability of the complementary event GN(γ)+k GN(γ) log γ. Let ΦT G(g; µ, κ) denote the CDF of a truncated Gumbel distribution with location parameter µ and unit scale parameter, truncated at κ. Then

P GN(γ)+n GN(γ) log γ | N(γ), GN(γ), B0:N(γ)+n = (90)

= EGN(γ)+n 1 P GN(γ)+n GN(γ) log γ | N(γ), GN(γ), GN(γ)+n 1, B0:N(γ)+n (91)

= EGN(γ)+n 1 ΦT G log γ + GN(γ); log P(BN(γ)+n), GN(γ)+n 1 (92)

ΦT G log γ + GN(γ); log P(BN(γ)+n), (93)

= e e (log γ+GN(γ) log P(BN(γ)+n)) (94)

γ P(BN(γ)+n) e GN(γ). (95)

Taking an expectation over GN(γ) and B0:N(γ)+n, we have

P GN(γ)+n GN(γ) log γ | N(γ) EGN(γ),B0:N(γ)+n h e 1

γ P(BN(γ)+n) e GN(γ) N(γ) i (96)

γ EGN(γ),B0:N(γ)+n

h P(BN(γ)+n) e GN(γ) | N(γ) i (97)

Fast Relative Entropy Coding with A* Coding

Focusing on the term in the exponent, we have

EGN(γ),B0:N(γ)+n P BN(γ)+n e GN(γ) N(γ) = (98)

= EB0:N(γ) 1 h EGN(γ),BN(γ):N(γ)+n P BN(γ)+n e GN(γ) B0:N(γ) 1, N(γ) N(γ) i (99)

n+1 P BN(γ) 1 e GN(γ) B0:N(γ) 1

= EB0:N(γ) 1

n+1 P BN(γ) 1 N(γ) 1 X

n+1 N(γ). (102)

Substituting eq. (102) into eq. (97), we obtain

P GN(γ)+n GN(γ) log γ | N(γ) e N(γ)

4) n+1 (103)

and applying ineq. (69) to this we obtain

P GN(γ)+n GN(γ) log γ e 1

4) n+1 (N0+4), (104)

arriving at a deterministic lower bound on which does not depend on any random quantities. Now we also have

γ + log(N0 + 4) = log 1

γ + log log w(γ)

γ + log log w(γ)

log(3/4) + 6 (106)

γ + log 1 w(γ) + 2, (107)

where going from 106 to 107 can be verified numerically. Therefore, letting K0 = l log(1/γ) + log(1/w(γ)) + 2

log(4/3) m , we have

k=0 P K(γ) = k k (108)

k=0 P K(γ) k (109)

k=K0+1 P K(γ) k (110)

k=1 P K(γ) K0 + k (111)

k=1 P GN(γ)+K0+k 1 GN(γ) > log γ (112)

K0 + 4 (114)

log γ log(3/4) + log w(γ)

log(3/4) + 16. (115)

Fast Relative Entropy Coding with A* Coding

where the equality of 108 and 109 is a standard identity (Grimmett & Stirzaker, 2001), 109 to 110 follows because probabilities are bounded above by 1, 110 to 111 follows by relabelling the indices, 111 to 112 follows by ineq. (89), 112 to 113 follows by ineq. (104) and the definition of K0, 113 to 114 can be verified by evaluating the sum using numerical means and 114 to 115 follows by the definition of K0.

Putting lemmas 8 and 11 together, we obtain corollary 1, which is a bound on the expected runtime of AS*. This holds for any width function w and any γ [0, 1]. Note that whenever γ = 0 or w(γ) = 0 this bound becomes vacuous.

Corollary 1 (Upper bound on T for given w). For any γ [0, 1], the total number of steps, T, satisfies

E[T] 2α log 1 w(γ) + 2 log 1

+ 22, where α = log 4

Proof. By the definition of N(γ) and K(γ), we have

T N(γ) + K(γ) = E[T] E [N(γ) + K(γ)] , (117)

for all γ [0, 1]. From lemma 8 and lemma 11, we have

E[T] α 2 log 1 w(γ) + log 1

+ 22 2α log 1 w(γ) + 2 log 1

+ 22, (118)

where α = log(4/3) 1 as required.

Note that in eq. (118) we obtain a bound which we intentionally make looser. This step results in a looser bound but facilitates subsequent manipulations easier. While a more careful analysis may result in a tighter bound, it can only improve our bound by a scaling factor, and we leave this as a point for further work.

Since corollary 1 holds for any γ [0, 1], we can minimise the right hand side with respect to γ to make the bound as tight as possible. This results in a bound that is a function of w, however we are interested in producing a bound that holds for all w. Therefore, after minimising with respect to γ, we will consider the worst possible width functions which maximise the resulting quantity, and show that even for these worst-case width functions, the bound is linear in rmax. Definition 4 introduces the family of these worst-case width functions for a given rmax.

Definition 4 (Bound functions f, g, h, worst-case width set W ). We define

f(γ, w) = log 1 w(γ), g(γ) = 2 log 1

γ and h(γ, w) = f(γ, w) + g(γ). (119)

For fixed rmax, let W(rmax) be the set of all possible width functions. We define the set W of worst-case width functions as

W = w W(rmax) inf γ h(γ , w ) inf γ h(γ , w) w W(rmax) . (120)

We refer to members of this set as worst-case width functions.

Next, for a given rmax, we define a width function w with a particular form, and show that w W(rmax). We also show that if w W(rmax) is any other width function, then

inf γ h(γ, w) inf γ h(γ, w), (121)

from which it follows that w is a worst case width function, that is w W . Lemma 12 (An explicit worst case width function). The function

( 1 for 0 γ γ ( γ/γ)2 for γ < γ 1 , (122)

where γ = 1 p

1 r 1 max, is a width function and w W(rmax). Further, if w W(rmax) then

inf γ h(γ, w) inf γ h(γ, w). (123)

Fast Relative Entropy Coding with A* Coding

Proof. Suppose w W(rmax) and let m = inf γ h(γ, w), (124)

let γm be the point where g equals m, that is

g(γm) = 2 log 1

γm = m = γm = e m/2. (125)

Define v : [0, 1] [0, 1] [0, 1] as

(γ /γ)2 γ < γ 1 , (126)

and consider v(γ, γm) as a function of γ. Note that v(γ, γm) may not be in W(rmax) because, while it is non-increasing and continuous, it may not integrate to r 1 max. In particular it holds that

h(γ, v(γ, γm)) h(γ, w) for all γ [0, 1] = v(γ, γm) w(γ) = Z 1

0 v(γ, γm) dγ r 1 max. (127)

Now, note that Z 1

0 v(γ, γ ) dγ = 2γ (γ )2. (128)

By the intermediate value theorem, there exists some 0 < γ γm such that 2 γ γ2 = r 1 max. For this γ, we define w(γ) = v(γ, γ), which is a width function because it is decreasing and integrates to 1. Specifically, w(γ) is in W(rmax) because the probability density function

q(x) = rmax min n 1, γx 1/2o , (129)

has w(γ) as its width function. In addition note that γ γm so we have

γ γm = inf γ h(γ, w(γ)) = inf γ h(γ, v(γ, γ)) inf γ h(γ, v(γ, γm)) = inf γ h(γ, w(γ)). (130)

Therefore it holds that w W(rmax) = inf γ h(γ, w) inf γ h(γ, w), (131)

from which it follows that w W is a width function.

Last, we can put corollary 1 together with lemma 12 to arrive at the main result.

Theorem (AS* runtime upper bound). Let T be the total number of steps taken by AS* until it terminates. Then

E[T] 2α log rmax + 2α log 2 + 22. (132)

Proof. Suppose AS* is applied to a target Q and proposal P with D [Q P] = rmax, and corresponding width function w W(rmax). Now consider the worst case width function w defined in lemma 12, and note that

1 r 1 max 1 2rmax . (133)

Then we have

E[T] 2α inf γ h(γ, w) + 22 2α inf γ h(γ, w) + 22 2αh( γ, w) + 22, (134)

and substituting the expression for h we obtain

E[T] 2α log 1 w( γ) + 2 log 1

+ 22 4α log rmax + 4α log 2 + 22, (135)

arriving at the result.

Fast Relative Entropy Coding with A* Coding

D. Proof of Theorem 3

Before we state the precise form of Theorem 3, we clarify what the precise form the approximate distribution e QD of the output of DAD* is for target Q and proposal P for depth limit Dmax = D. This means that there are N = 2D 1 nodes in the binary tree constructed by alg. 1 with associated Gumbel values and samples {(Gi, Xi)}N i=1. Let r = d Q

d P and ri = r(Xi). Then, DAD* searches for I def= arg maxi [N]{log r(Xi) + Gi}, which can therefore be interpreted as simply performing the Gumbel-max trick on N atoms. Hence, we know, that

wi def= p(I = i | X1:N) = exp(log ri) PN j=1 exp(log rj)

= ri PN j=1 rj . (136)

Hence, we finally get

i=1 wiδ(X Xi) (137)

where δ denotes the Dirac delta function.

Then, the e QD-expectation of a measurable function f is

E e QD[f] =

i=1 wif(Xi) = PN i=1 rif(Xi)

PN j=1 rj . (138)

Now, define

ID(f) def= 1

i=1 f(Xi)ri. (139)

Note, that ID(f) looks very similar to the usual importance sampling estimator for Q-expectation the function f using N samples with distribution P. However, in this case the Xis used in ID(f) are not identically distributed, though they are independent. In particular, let n be a node in the tree realized by alg. 2 with Dmax = D. Then, Xn P|Bn. The importance of ID(f) is that the e QD-expectation of f in eq. (138) can be written as

E e QD[f] = ID(f)

ID(1) , (140)

where 1 in the above is the constant function identically equal to 1. Hence, we begin by investigating the properties of ID(f). First, we show that it is unbiased:

Lemma 13 (Unbiasedness of ID(f)). Let Q, P, r, f and ID(f) be defined as above. Then,

EP (X1:N)[ID(f)] = EQ[f]. (141)

Proof. Let TD denote the set of all nodes in the binary tree constructed by alg. 2 with Dmax = D. We first note, that for a node n TD, we know Xn P|Bn, hence d P(Xn)|Bn= d P (Xn)

P (Bn) for Xn Bn. Furthermore, since partition is dyadic, we know that for a fixed depth 1 d D, for every node n with Dn = d it holds, that P(Bn) = 2 (d 1).

Fast Relative Entropy Coding with A* Coding

Furthermore, by construction, the bounds at a given depth are disjoint and form a partition of the whole space. Therefore,

Ep(X1:N)[ID(f)] =

n TD rnf(Xn)

Bn f(Xn)d Q

d P (Xn)d P(Xn)

Bn f(Xn)d Q(Xn)

Ω f(Xn)d Q(Xn)

as required.

Given the unbiasedness of ID(f), the rest of the proof follows mostly that of Chatterjee & Diaconis (2018) with appropriate modifications in the necessary places. Hence, we begin with the following lemma:

Lemma 14 (Mean absolute deviation bound for ID(f)). Let Q, P, r, f and ID(f) be defined as above. Let

K def= DKL[Q P]2 , D def= K + t, N def= 2D, (143)

where t is a non-negative integer, and Y Q. Define

EQ[f 2]. (144)

P [log2 r(Y ) > K + t/2]

Then, E[|ITP (D)(f) EQ[f]|] b f Q. (146)

Proof. Let a = 2K+t/2, and define h(x) = f(x)1[r(x) a], (147)

where 1[ ] is an indicator function. Let ϕ, ψ L2(Q), and define

ϕ, ψ Q def= Z

Ω ϕψd Q. (148)

By the triangle inequality, |ID(f) EQ[f]|

|EQ[f] EQ[h]| + |ID(f) ID(h)| + |ID(h) EQ[h]|. (149)

Fast Relative Entropy Coding with A* Coding

We will now proceed to bound each term on the right hand side of the inequality. First,

|EQ[f] EQ[h]| = |EQ[f h]|

EQ[|f|1[r > a]]

= |f|, 1[r > a] Q

P[r(Y ) > a],

where the first inequality follows by Jensen s inequality and the second follows by Cauchy-Schwarz. Next,

E[|ID(f) ID(h)|] = E[|ID(f h)|]

E[ID(|f h|)]

= EQ[|f|1[r > a]]

= |f|, 1[r > a] Q

P[r(Y ) > a],

where the first inequality holds by Jensen s inequality, the second equality follows by Lemma 13 and the second inequality follows by Cauchy-Schwarz. Finally,

E[|ID(h) EQ[h]|]2

E[(ID(h) EQ[h])2]

where the first inequality holds by Jensen s inequality. Now, examining a single variance term in the above sum:

Bn h2(Xn)d Q

d P (Xn)2 d P(Xn)

Bn h2(Xn)r(Xn)d Q(Xn)

Bn f 2(Xn)1[r(Xn) a]r(Xn)d Q(Xn)

Bn f 2(Xn)d Q(Xn).

Fast Relative Entropy Coding with A* Coding

Plugging this back into eq. (152), we get

E[|ID(h) EQ[h]|]2

a N 2 2d 1 D X

Bn f 2(Xn)d Q(Xn)

Ω f 2(Xn)d Q(Xn)

= a f 2 Q N 2

= a f 2 Q N .

Taking square roots of the very left and very right, we get

E[|ID(h) EQ[h]|] f Q2 t/4 r

Putting the three bounds together gives us the desired result.

We are now ready to state our result on the o e QD-expectation

Theorem 5 (Biasedness of DAD* coding). Let Q, P, r, f, I, e QD, K, D, N and Y be defined as in Lemma 14. Define

P [log2 r(Y ) > K + t/2]

P E e QD[f] EQ[f] 2 f Qδ

Proof. The proof is mutatis mutandis the same as the proof of Theorem 1.2 of Chatterjee & Diaconis (2018), as it only relies on eq. (140) and Lemma 14. We simply repeat it here for completeness.

Let a = 2K+t/2 and

P [log2 r(Y ) > K + t/2]

Then, by Markov s inequality and Lemma 14, for δ (0, 1) we have

P[|ID(f) 1| δ] E[|ID(f) 1|]

and for α (0, 1) we get

P[|ID(f) EQ[f]| α] E[|ID(f) EQ[f]|]

Fast Relative Entropy Coding with A* Coding

Now, if |ID(f) EQ[f]| < α and |ID(f) 1| < δ, then

|E e QD[f] EQ[f]|

ID[1] EQ[f]

|ID(f) EQ[f]| + |EQ[f]||1 ID(1)|

< α + |EQ[f]|δ

Finally, setting δ =

b and α = f Qδ gives the desired result.

E. Proof of Lemma 2

Lemma 15. Let Q and P be the target and proposal distributions passed to A* coding (Algorithm 1). Let H be the heap index returned by unrestricted A* coding and H d be the index returned by its depth-limited version with Dmax = d. Then, conditioned on the public random sequence S, we have H d H . Further, there exists D N such that for all d > D we have H d = H .

Proof. Let n, m T be two nodes in the binary tree representation T of the Gumbel process with base measure P realized by alg. 2 simulated using the public random sequence S, such that Dm < Dn. It follows from the definition of heap indexing that Hm < Hn. Given S, A* and its depth-limited version search over the same tree T , with the difference that the depth-limited version only searches Td T , the tree truncated after depth Dmax = d. Let

n = arg max n T {Gn + log r(Xn)}

n d = arg max n Td {Gn + log r(Xn)}, (161)

the nodes from T returned by unrestricted A* coding and its depth-limited version, respectively. Clearly Dn d d. Then, we have the following two cases.

Case 1: d < Dn . In this case, we have Dn d < Dn , hence H d < H .

Case 2: d Dn . In this case, depth-limited A* coding finds and returns n , and since it is the global maximum, increasing the budget further will not make a difference in the returned node. Thus, for this case we get H d = H , as required.

F. KL Parameterization for Distributions

In this section, we demonstrate how Gaussian and uniform target distributions can be parameterized by their KL divergence to some reference distribution, and give some details on how to implement these parameterizations in practice.

F.1. KL-Mean Parameterization for Gaussians

Given a reference Gaussian distribution P = N(ν, ρ2), we want to parameterize Q = N(µ, σ2) such that

DKL[Q P] = κ and σ < ρ. (162)

The condition that σ < ρ incorporates the additional inductive bias, that since P in practical applications acts as a prior and Q will be representing a variational approximation to a posterior, the posterior should have less uncertainty than the prior. It is also a condition required by A* coding, as this condition is necessary and sufficient to ensure D [Q P] < .

The KL divergence from Q to P is

DKL[Q P] = log ρ

σ + σ2 + (µ ν)2

2 = κ. (163)

Fast Relative Entropy Coding with A* Coding

Algorithm 2: Priority queue-based top-down construction of a Gumbel process.

Input :Base distribution P with density p, number of samples to realize K, maximum search depth Dmax. Output :Next realization from the Gumbel process GPP = (Xk, Gk) k=1. k, D1, H1 1, 1, 1 Q Priority Queue G1 Gumbel(0) X1 P( ) Π.push With Priority(1, G1)

while !Π.empty() do

i Π.pop Highest()

if Di Dmax then

L, R partition(Bp, Xi)

for C {L, R} do

k k + 1 Bk C Dk Dp + 1

( 2Hp if C = L 2Hp + 1 if C = R Gk Trunc Gumbel(log P(Bk), Gp) Xk P( )|Bk/P(Bk) Π.push With Priority(k, Gk) end end yield Xi, Gi, Hi, Di end

Fast Relative Entropy Coding with A* Coding

We observe that the largest value that |µ ν| can take while satisfying this equality and the constraint 0 < σ < ρ, occurs when σ ρ, in which case we have (µ ν)2

2ρ2 = κ = |µ ν| < ρ

Defining = (µ ν)/ρ, we can rearrange eq. (163) to

ρ2 e σ2/ρ2 = e 2 2κ 1. (165)

We can rearrange this equation using the Lambert W function (Lambert, 1758), into the form

σ2 = ρ2W e 2 2κ 1 . (166)

While the Lambert W does not have an expression in terms of elementary functions, it can be estimated numerically (Corless et al., 1996). While a numerical solver for the W function is supported in Tensorflow, we found it computationally faster and numerically stabler method to use a Pade approximant of W in practice (Brezinski, 1994). We make this approximatant method available in our code repository.

Implementing a Gaussian Iso KL layer: Note, that the parameters derived above are in a constrained domain. Thus, given P = N(ν, ρ2), we can reparameterize κ and µ to an unconstrained domain as follows:

1. Let α, β be real numbers.

2. Set κ exp(α). This will ensure that κ 0.

3. Set µ ν + ρ

2κ tanh(β). This ensures the inquality on |µ ν| in Equation (164) is satisfied.

4. Set σ2 ρ2W exp 2 2κ 1 .

F.2. KL-Infinity Divergence Parameterization for Gaussians

Assume now, instead of just controlling the KL, we wish to control the R enyi divergence as well. Concretely, for a given reference Gaussian distribution P = N(ν, ρ2), we want to parameterize Q = N(µ, σ2) such that

DKL[Q P] = K and D [Q P] = R. (167)

We know, that

K = DKL[Q P] = 1

2 µ2 + σ2 log σ2 1

R = log sup x R

d P (x) = µ2

2(1 σ2) log σ. (168)

From these, we get that µ2 = 2K σ2 + log σ2 + 1

µ2 = 2(1 σ2)(R + log σ) (169)

Setting these equal to each other

2K σ2 + log σ2 + 1 = 2R + log σ2 2σ2R σ2 log σ2

σ2 log σ2 σ2 + 2σ2R = 2R 2K 1

σ2 log σ2 + σ2(2R 1) = A

σ2(log σ2 + B) = A

σ2 log(σ2e B) = A

e Bσ2 log(σ2e B) = Ae B

elog(σ2e B) log(σ2e B) = Ae B

log(σ2e B) = W(Ae B)

σ2 = e W (Ae B) B,

where we made the substitions A = 2R 2K 1 and B = 2R 1.

Fast Relative Entropy Coding with A* Coding

Note: This parameterization is only unique up to the sign of the target mean µ due to the symmetry of the Gaussian distribution.

F.3. KL-Mean parameterization for Uniforms

Given a uniform reference distribution P = U (ν, ρ) with mean ν and width ρ on the interval [ν ρ/2, ν + ρ/2], we want to parameterize Q = U (µ, σ) such that DKL[Q P] = κ. (171)

Note, that we must have ν ρ/2 µ σ/2 < µ + σ/2 ν + ρ/2 (172)

to ensure Q P. Now, we have κ = DKL[Q P] = log ρ

from which we get σ = ρ exp( κ). (174)

Implementing a Uniform Iso KL layer: Note, that the parameters derived above are in a constrained domain. Thus, given P = U (ν, ρ), we can reparameterize κ and µ to an unconstrained domain as follows:

1. Let α, β be real numbers.

2. Set κ exp(α). This will ensure that κ 0.

3. Set σ ρ exp( κ).

4. Set µ ν + ρ σ

2 tanh(β). This will ensure Q P.