# progressive_compression_with_universally_quantized_diffusion_models__4e18f0c7.pdf Published as a conference paper at ICLR 2025 PROGRESSIVE COMPRESSION WITH UNIVERSALLY QUANTIZED DIFFUSION MODELS Yibo Yang Justus C. Will Stephan Mandt Department of Computer Science University of California, Irvine {yibo.yang, jcwill, mandt}@uci.edu Diffusion probabilistic models have achieved mainstream success in many generative modeling tasks, from image generation to inverse problem solving. A distinct feature of these models is that they correspond to deep hierarchical latent variable models optimizing a variational evidence lower bound (ELBO) on the data likelihood. Drawing on a basic connection between likelihood modeling and compression, we explore the potential of diffusion models for progressive coding, resulting in a sequence of bits that can be incrementally transmitted and decoded with progressively improving reconstruction quality. Unlike prior work based on Gaussian diffusion or conditional diffusion models, we propose a new form of diffusion model with uniform noise in the forward process, whose negative ELBO corresponds to the end-to-end compression cost using universal quantization. We obtain promising first results on image compression, achieving competitive ratedistortion and rate-realism results on a wide range of bit-rates with a single model, bringing neural codecs a step closer to practical deployment. Our code can be found at https://github.com/mandt-lab/uqdm. 1 INTRODUCTION A diffusion probabilistic model can be equivalently viewed as a deep latent-variable model (Sohl Dickstein et al., 2015; Ho et al., 2020; Kingma et al., 2021), a cascade of denoising autoencoders that perform score matching at different noise levels (Vincent, 2011; Song & Ermon, 2019), or a neural SDE (Song et al., 2021b). Here we take the latent-variable model view and explore the potential of diffusion models for communicating information. Given the strong performance of these models on likelihood estimation (Kingma et al., 2021; Nichol & Dhariwal, 2021), it is natural to ask whether they also excel in the closely related task of data compression (Mac Kay, 2003; Yang et al., 2023). Ho et al. (2020); Theis et al. (2022) first suggested a progressive compression method based on an unconditional diffusion model and demonstrated its strong potential for data compression. Such a progressive codec is desirable as it allows us to decode data reconstructions from partial bit-streams, starting from lossy reconstructions at low bit-rates to perfect (lossless) reconstructions at high bitrates, all with a single model. The ability to decode intermediate reconstructions without having to wait for all bits to be received is a highly useful feature present in many traditional codecs, such as JPEG. The use of diffusion models has the additional advantage that we can, in theory, obtain perfectly realistic reconstructions (Theis et al., 2022), even at ultra-low bit-rates. Unfortunately, the proposed method requires the communication of Gaussian samples across many steps, which remains intractable because the exponential runtime complexity of channel simulation (Goc & Flamich, 2024). In this work, we take first steps towards a diffusion-based progressive codec that is computationally tractable. The key idea is to replace Gaussian distributions in the forward process with suitable uniform distributions and adjust the reverse process distributions accordingly. These modifications allow the application of universal quantization (Zamir & Feder, 1992) for simulating uniform noise channels, avoiding the intractability of Gaussian channel simulation in (Theis et al., 2022). Specifically, our contributions are as follows: Equal contribution Published as a conference paper at ICLR 2025 Figure 1: Example reconstructions from several traditional and neural codecs, chosen at roughly similar bitrates. At high bitrates, our UQDM method preserves details (e.g. shape and color pattern of the spider, or sharpness of the calligraphy) better than other neural codecs. Note that among the methods considered here, only ours and CTC (Jeon et al., 2023) implement progressive coding. 1. We introduce a new form of diffusion model, Universally Quantized Diffusion Model (UQDM), that is suitable for end-to-end learned progressive data compression. Unlike in the closely-related Gaussian diffusion model (Kingma et al., 2021), compression with UQDM is performed efficiently with universal quantization, avoiding the generally exponential runtime of relative entropy coding (Agustsson & Theis, 2020; Goc & Flamich, 2024). 2. We investigate design choices of UQDM, specifying its forward and reverse processes largely by matching the moments of those in Gaussian diffusion, and obtain the best results when we learn the reverse-process variance as inspired by Nichol & Dhariwal (2021). 3. We provide theoretical insight into UQDM in relation to VDM, and derive the continuoustime limit of its forward process approaching that of the Gaussian diffusion. These results may inspire future research in improving the modeling formalism and training efficiency. 4. We apply UQDM to image compression, and obtain competitive rate-distortion and raterealism results which exceed existing progressive codecs at a wide range of bit-rates (up to lossless compression), all with a single model. Our results demonstrate, for the first time, the high potential of an unconditional diffusion model as a practical progressive codec. 2 BACKGROUND Diffusion models Diffusion probabilistic models learn to model data by inverting a Gaussian noising process. Following the discrete-time setup of VDM (Kingma et al., 2021), the forward noising process begins with a data observation x and defines a sequence of increasingly noisy latent variables zt with a conditional Gaussian distribution, q(zt|x) = N(αtx, σ2 t I), t = 0, 1, ..., T. Here αt and σ2 t are positive scalar-valued functions of time, with a strictly monotonically increasing signal-to-noise-ratio SNR(t) := α2 t/σ2 t . The variance-preserving process of DDPM (Ho et al., 2020) corresponds to the choice α2 t = 1 σ2 t . The reverse-time generative model is defined by a collection of conditional distributions p(zt 1|zt), a prior p(z T ) = N(0, I), and likelihood model p(x|z0). The conditional distributions p(zt 1|zt) := q(zt 1|zt, x = ˆxθ(zt, t)) are chosen to have the same distributional form as the forward posterior distribution q(zt 1|zt, x), with x estimated from its noisy version zt through the learned denoising model ˆxθ. Further details on the forward and Published as a conference paper at ICLR 2025 backward processes can be found in Appendix A and B. Throughout the paper the logarithms use base 2. The model is trained by minimizing the negative ELBO (Evidence Lower BOund), L(x) = KL(q(z T |x) p(z T )) | {z } :=LT + E [ log p(x|z0)] | {z } :=Lx|z0 t=1 E [KL(q(zt 1|zt, x) p(zt 1|zt))] | {z } :=Lt 1 where the expectations are taken with respect to the forward process q(z0:T |x). Kingma et al. (2021) showed that a larger T corresponds to a tighter bound on the marginal likelihood log p(x), and as T the loss approaches the loss of a class of continuous-time diffusion models that includes the ones considered by Song et al. (2021b). Relative Entropy Coding (REC) Relative Entropy Coding (REC) deals with the problem of efficiently communicating a single sample from a target distribution q using a coding distribution p. Suppose two parties in communication have access to a common prior distribution p and pseudo-random number generators with a common seed; a Relative Entropy Coding (REC) method (Flamich et al., 2020) allows the sender to transmit a sample z q using close to KL(q p) bits on average. If q arises from a conditional distribution, e.g., qx = q(z | x) is the inference distribution of a VAE (which can be viewed as a noisy channel), a reverse channel coding or channel simulation (Theis & Ahmed, 2022) algorithm then allows the sender to transmit z qx with x p(x) using close to Ex p(x)[KL(q(z | x) p(z))] bits on average. At a high level, a typical REC method works as follows. The sender generates a (possibly large) number of candidate z samples from the prior p, zn p, n = 1, 2, 3, ..., and appropriately chooses an index K such that z K is a fair sample from the target distribution, i.e., z K q. The chosen index K N is then converted to binary and transmitted to the receiver. The receiver recovers z K by drawing the same sequence of z candidates from p (made possible by using a pseudo-random number generator with the same seed as the sender) and stopping at the Kth one. A major challenge of REC algorithms is that their computational complexity generally scales exponentially with the amount of information being communicated (Agustsson & Theis, 2020; Goc & Flamich, 2024). As an example, the MRC algorithm (Cuff, 2008; Havasi et al., 2018) draws M candidate samples and selects K {1, 2, , ..., M} with a probability proportional to the importance weights, q(zn)/p(zn), n = 1, ..., M; similarly to importance sampling, M needs to be on the order of 2KL(q p) for z K to be (approximately) a fair sample from q, thus requiring a number of drawn samples that scales exponentially with the relative entropy KL(q p) (the cost of transmitting K is thus log M KL(q p) bits). The exponential complexity prevents, e.g., naively communicating the entire latent tensor z in a Gaussian VAE for lossy compression, as the relative entropy KL(q(z|x) p(z)) easily exceeds thousands of bits, even for a small image. This difficulty can be partly remedied by performing REC on sub-problems with lower dimensions (Flamich et al., 2020; 2022) for which computationally viable REC algorithms exist (Flamich et al., 2024; Flamich, 2024), but at the expense of worse bitrate efficiency due to the accumulation of codelength overhead across the dimensions. Progressive Coding with Diffusion A progressive compression algorithm allows for lossy reconstructions with improving quality as more bits are sent, up till a lossless reconstruction. This results in variable-rate compression with a single bitstream, and is highly desirable in practical applications. As we will explain, the NELBO of a diffusion model (eq. (1)) naturally corresponds to the lossless coding cost of a progressive codec, which can be optimized end-to-end on the data distribution of interest. Given a trained diffusion model, a REC algorithm, and a data point x, we can perform progressive compression as follows (Ho et al., 2020; Theis et al., 2022): Initially, at time T, the sender transmits a sample of q(z T |x) under the prior p(z T ), using LT bits on average. At each subsequent time step t, the sender transmits a sample of q(zt 1|zt, x) given the previously transmitted zt, under the (conditional) prior p(zt 1|zt), using approximately Lt 1 bits. Finally, given z0 at t = 0, x can be transmitted losslessly under the model p(x|z0) by an entropy coding algorithm (e.g., arithmetic coding), with a codelength close to Lx|z0 bits (Polyanskiy & Wu, 2022, Chapter 13.1). Thus, the overall cost of losslessly compressing x sums up to L(x) bits, as in the NELBO in eq. (1). Crucially, at any time t, the receiver can use the most-recently-received zt to obtain a lossy data reconstruction ˆxt. For this, several options are possible: Ho et al. (2020) consider using the diffusion model s denoising prediction ˆxθ(zt), while Theis et al. (2022) consider sampling ˆxt p(x|zt), either by ancestral Published as a conference paper at ICLR 2025 sampling or a probability flow ODE (Song et al., 2021b). Note that if the reverse generative model captures the data distribution perfectly, then ˆxt p(x|zt) follows the same marginal distribution as the data and has the desirable property of perfect realism, i.e., being indistinguishable from real data (Theis et al., 2022). Universal Quantization Although general-purpose REC algorithms suffer from exponential runtime (Agustsson & Theis, 2020; Goc & Flamich, 2024), efficient REC algorithms exist if we are willing to restrict the kinds of target and coding distributions allowed (Flamich et al., 2022; 2024). Here, we focus on the special case where the target distribution q is given by a uniform noise channel, which is solved efficiently by Universal Quantization (UQ) (Roberts, 1962; Zamir & Feder, 1992; Agustsson & Theis, 2020). Specifically, suppose we (the sender) have access to a scalar r.v. Y p Y , and would like to communicate a noise-perturbed version of it, where U U( /2, /2) is an independent r.v. with a uniform distribution on the interval [ /2, /2]. UQ accomplishes this as follows: Step 1. Perturb Y by adding another independent noise U U( /2, /2), and quantize the result to the closet quantization point K on a uniform grid of width , i.e., computing K := Y +U where denotes rounding to the nearest integer. Step 2. Entropy code and transmit K under the conditional distribution of K given U . Step 3. The receiver draws the same U by using the same random number generator and obtains a reconstruction ˆY := K U = Y +U U . Zamir & Feder (1992) showed that ˆY indeed has the same distribution as Y , and the entropy coding cost of K is related to the differential entropy of Y via H[K|U ] = I(Y ; Y ) = h( Y ) log( ). In the above, the optimal entropy coding distribution P(K|U = u ) is obtained by discretizing p Y := p Y U( /2, /2) on a grid of width and offset by U = u (Zamir & Feder, 1992), where denotes convolution. If the true p Y is unknown, we can replace it with a surrogate density model fθ( y) during entropy coding and incur a higher coding cost, Ey PY [KL(u( |y) fθ( ))] I(Y ; Y ), (2) where u( |y) denotes the density function of the uniform noise channel q Y |Y =y = U(y /2, y+ /2). It can be shown that the optimal choice of fθ is the convolution of p Y with U( /2, /2). Therefore, as in prior work (Agustsson & Theis, 2020; Ball e et al., 2018), we will choose fθ to have the form of another underlying density model gθ convolved with uniform noise, i.e. fθ( ) = gθ( ) U( ; /2, /2). (3) 3 UNIVERSALLY QUANTIZED DIFFUSION MODELS We follow the same conceptual framework of progressive compression with diffusion models as in (Ho et al., 2020; Theis et al., 2022), reviewed in the previous section. While Theis et al. (2022) use Gaussian diffusion, relying on the communication of Gaussian samples which remains intractable in higher dimensions, we want to apply UQ to similarly achieve a compression cost given by the NELBO, while remaining computationally efficient. We therefore introduce a new model with a modified forward process and reverse process, which we term universally quantized diffusion model (UQDM), substituting Gaussian noise channels for uniform noise channels. 3.1 FORWARD PROCESS The forward process of a standard diffusion model is often given by the transition kernel q(zt+1|zt) (Ho et al., 2020) or perturbation kernel q(zt|x) (Kingma et al., 2021), which in turn determines the conditional (reverse-time) distributions q(z T |x) and {q(zt 1|zt, x)|t = 1, ..., T} appearing in the NELBO in eq. (1). As we are interested in operationalizing and optimizing the coding cost associated with eq. (1), we will directly specify these conditional distributions to be compatible with UQ, rather than deriving them from a transition/perturbation kernel. We thus specify the forward process with Published as a conference paper at ICLR 2025 the same factorization as in DDIM (Song et al., 2021a) via q(z0:T |x) = q(z T |x) QT t=1 q(zt 1|zt, x), and consider a discrete-time non-Markovian process as follows, ( q(z T |x) := N(αT x, σ2 T I), q(zt 1|zt, x) := U b(t)zt + c(t)x (t) 2 , b(t)zt + c(t)x + (t) 2 , t = 1, 2, ..., T, (4) where b(t), c(t), and (t) are scalar-valued functions of time. Note that unlike in Gaussian diffusion, our q(zt 1|zt, x) is chosen to be a uniform distribution so that it can be efficiently simulated with UQ (as a result, our q(zt|x) for any t = T does not admit a simple distributional form). There is freedom in these choices of the forward process, but for simplicity we base them closely on the Gaussian case: we choose a standard isotropic Gaussian q(z T |x), and set b(t), c(t), (t) so that q(zt 1|zt, x) has the same mean and variance as in the Gaussian case (see Appendix A for more details): b(t) = αt αt 1 σ2 t 1 σ2 t , c(t) = σ2 t|t 1 αt 1 σ2 t , (t) = 12σt|t 1 σt 1 σt , with σ2 t|t 1 := σ2 t α2 t α2 t 1 σ2 t 1. We note here that q(zt|z T , x) can be written as a sum of uniform distributions, which as we increase T , converges in distribution to a Gaussian by the Central Limit Theorem. This implies that q(zt|x) also converges to a Gaussian for every t, and that our forward process has the same underlying continuous-time limit as in VDM (Kingma et al., 2021). We give the precise statement and a proof in Appendix A.3. As in VDM (Kingma et al., 2021), the forward process schedules (i.e., αt and σt, as well as b(t), c(t), (t)) can be learned end-to-end, e.g., by parameterizing σ2 t = sigmoid(ϕ(t)), where ϕ is a monotonic neural network. We did not find this to yield significant improvements compared to using a linear noise schedule similar to the one in Kingma et al. (2021). 3.2 BACKWARD PROCESS Analogously to the Gaussian case, we want to define a conditional distribution p(zt 1|zt) that leverages a denoising model ˆxt = ˆxθ(zt, t) and closely matches the forward posterior q(zt 1|zt, x). In our case, the forward posterior corresponds to a uniform noise channel with width (t), i.e., zt 1 = b(t)zt + c(t)x + (t)ut, ut U( 1/2, 1/2); to simulate it with UQ, we choose a density model for zt 1 with the same form as the convolution in eq. (3). Specifically, we let p(zt 1|zt) = gθ(zt 1; zt, t) U( (t)/2, (t)/2), (5) where gθ(zt 1; zt, t) is a learned density chosen to match q(zt 1|zt, x). Recall in Gaussian diffusion (Kingma et al., 2021), p(zt 1|zt) is chosen to be a Gaussian of the form q(zt 1|zt, x = ˆxθ(zt; t)), i.e., the same as q(zt 1|zt, x) but with the original data x replaced by a denoised prediction x = ˆxθ(zt; t). For simplicity, we base gθ closely on the choice of p(zt 1|zt) in Gaussian diffusion, e.g., gθ(zt 1; zt, t) = N(b(t)zt + c(t)ˆxθ(zt; t), σ2 Q(t)I) (6) or a logistic distribution with the same mean and variance, gθ(zt 1; zt, t) = Logistic b(t)zt + c(t)ˆxθ(zt; t), σ2 Q(t)I . (7) where σ2 Q(t) is the variance of the Gaussian forward posterior , and we use the same noise-prediction network for ˆxθ as in (Kingma et al., 2021). We found the Gaussian and logistic distributions to give similar results, but the logistic to be numerically more stable and therefore adopt it in all our experiments. Inspired by (Nichol & Dhariwal, 2021), we found that learning a per-coordinate variance in the reverse process to significantly improve the log-likelihood, which we demonstrate in Sec. 5. In practice, this is implemented by doubling the output dimension of the score network to also compute a tensor of scaling factors sθ(zt), so that the variance of gθ is σ2 θ = σ2 Q(t) sθ(zt). Refer to Appendix B.2 for a more detailed analysis of the log-likelihood and how a learned variance is beneficial. We note that other possibilities for gθ exist besides Gaussian or logistic, e.g., mixture distributions (Cheng et al., 2020), which trade off higher computation cost for increased modeling power. Analyzing the time reversal of the our forward process, similarly to (Song et al., 2021a), may also suggest better choices of the reverse-time density model gθ. We leave these explorations to future work. Published as a conference paper at ICLR 2025 We adopt the same form of categorical likelihood model p(x|z0) as in VDM (Kingma et al., 2021), as well as the use of Fourier features. Algorithm 1 Encoding z T p(z T ) for t = T, . . . , 2, 1 do Let t = (t), µQ = b(t)zt + c(t)x. Compute the parameters of p(zt 1|zt). Send zt 1 q(zt 1|zt, x) with UQ: ut U( 1/2, 1/2). kt = t µQ t + ut . Derive entropy model p(k|zt, ut) by discretizing p(zt 1|zt). Entropy-encode kt under p(k|zt, ut). zt 1 = kt tut. end for Entropy-encode x with p(x|z0). Algorithm 2 Decoding z T p(z T ) Using shared seed for t = T, . . . , 2, 1 do Let t = (t). Compute the parameters of p(zt 1|zt). ut U( 1/2, 1/2). Using shared seed Derive entropy model p(k|zt, ut) by discretizing p(zt 1|zt). Entropy-decode kt under p(k|zt, ut). zt 1 = kt tut. ˆxt = ˆxθ(zt 1; t 1). Lossy reconstruction end for Entropy-decode x with p(x|z0). Lossless 3.3 PROGRESSIVE CODING Given a UQDM trained on the NELBO in eq. (1), we can use it for progressive compression similarly to (Ho et al., 2020; Theis et al., 2022), outlined in Section 2. The initial step t = T involves transmitting a Gaussian z T . Since we do not assume access to an efficient REC scheme for the Gaussian channel, we will instead draw the same z T p(z T ) = N(0, I) on both the encoder and decoder side, with the help of a shared pseudo-random seed.1 To avoid a train/compression mismatch, we therefore always ensure q(z T |x) p(z T ) and hence LT 0. At any subsequent step t, instead of sampling zt 1 = b(t)zt + c(t)x + (t)u t as in training, we apply UQ to communicate the forward posterior mean vector µQ := b(t)zt + c(t)x. Specifically, given zt, the sender computes µQ and the parameters of p(zt 1|zt) (by evaluating the score network), draws a pseudo-random noise ut U( 1/2, 1/2), quantizes µQ to kt = t µQ t + ut where t := (t), derives an entropy model p(k|zt, ut) (by discretizing p(zt 1|zt) on a grid of width t and offset by ut), and entropy-encodes kt under p(k|zt, ut). The receiver draws the same pseudorandom ut U( 1/2, 1/2), entropy-decodes kt under the same entropy model p(k|zt, ut), and computes zt 1 = kt tut and (optionally) a lossy reconstruction ˆxt from zt 1. Finally, after having transmitted z0 when t = 1, x is losslessly compressed using the entropy model p(x|z0). Pseudocode can be found in Algorithms 1 and 2. Note that we can replace the denoised prediction ˆxt = ˆxθ(zt 1; t 1) with more sophisticated ways to obtain lossy reconstructions such as flowbased reconstruction or ancestral sampling (Theis et al., 2022). As our method is progressive, the algorithm can be stopped at any time and the most recent lossy reconstruction be used as the output. Compared to compression with VDM (Theis et al., 2022), the main difference is that we transmit zt 1 q(zt 1|zt, x) under p(zt 1|zt) using UQ instead of Gaussian channel simulation; the overall computation complexity is now dominated by the evaluation of the denoising network ˆxθ (for computing the parameters of p(zt 1|zt)), which scales linearly with the number of time steps. We implemented the progressive codec using tensorflow-compression (Ball e et al.), and found the actual file size to be within 3% of the theoretical NELBO. 4 RELATED WORK Diffusion models (Sohl-Dickstein et al., 2015) have achieved impressive results on image generation (Ho et al., 2020; Song et al., 2021a) and density estimation (Kingma et al., 2021; Nichol & Dhariwal, 2021). Our work is closely based on the latent-variable formalism of diffusion models (Ho et al., 1This corresponds to a trivial REC problem where a sample from q = p can be transmitted using KL(q p) = 0 bits. Published as a conference paper at ICLR 2025 2020; Kingma et al., 2021), with our forward and backward processes adapted from the Gaussian case. Our forward process is non-Markovian like DDIM (Song et al., 2021a), and our reverse process uses learned variance, inspired by (Nichol & Dhariwal, 2021). Recent research has focused on efficient sampling (Song et al., 2021a; Pandey et al., 2023) and better scalability via latent diffusion (Rombach et al., 2022), consistency models (Song et al., 2023), and distillation (Sauer et al., 2024), whereas we focus on the compression task. Related to our approach, cold diffusion (Bansal et al., 2024) showed that alternative forward processes other than the Gaussian still produce good image generation results. Several diffusion-based neural compression methods exist, but they use conditional diffusion models (Yang & Mandt, 2023; Careil et al., 2023; Hoogeboom et al., 2023) which do not permit progressive decoding. Furthermore, they are also less flexible as a separate model has to be trained for each bitrate. Progressive neural compression has so far been mostly achieved by combining non-linear transform coding (for example using a VAE) with progressive quantization schemes. Such methods include PLONQ (Lu et al., 2021), which uses nested quantization, DPICT (Lee et al., 2022) and its extension CTC (Jeon et al., 2023), which use trit-plane coding, and Deep HQ (Lee et al., 2024) which uses a learned quantization scheme. Finally, codecs based on hierarchical VAEs (Townsend et al., 2024; Duan et al., 2023) are closely related but do not directly target the realism criterion. 5 EXPERIMENTS We train UQDM end-to-end by directly optimizing the NELBO loss eq. (1), summing up Lt across all time steps. This involves simulating the entire forward process {z0, ..., z T } according to eq. (4) and can be computationally expensive when T is large but can be avoided by using a Monte-Carlo estimate based on a single Lt as in the diffusion literature (Ho et al., 2020). We found a small T (< 10) to give the best compression performance, and therefore leave the investigation of training with a single-step Monte-Carlo objective to future work. Note that this would require sampling from the marginal distribution q(zt|x), which becomes approximately Gaussian for large t (see Sec. 3.1). When considering the progressive compression performance of VDM and UQDM, we consider three ways of computing progressive reconstructions from zt: denoise, where ˆx = ˆxθ(zt; t) is the prediction from the denoising network; ancestral, where ˆx p(x|zt) is drawn by ancestral sampling; and flow-based where ˆx p(x|zt) is computed deterministically using the probability flow ODE as in (Theis et al., 2022). In VDM, the probability flow ODE produces the same trajectory of marginal distributions as ancestral sampling, but gives improved lossy compression performance (Theis et al., 2022). In the case of UQDM, we apply the same update equations and observe similar benefits, likely due to the continuous-time equivalence of the underlying processes of UQDM and VDM. See Appendix B.3 for details. Note that Diff C-A and Diff C-F (Theis et al., 2022) directly correspond to our VDM results with ancestral and flow-based reconstructions. In all experiments involving VDM and UQDM, we always use the same denoising U-net architecture for both, except UQDM uses twice as many output dimensions to additionally predict the reverseprocess variance (see Sec. 3). We refer to Appendix Sec. C for further experiment details. 5.1 SWIRL TOY DATA We obtain initial insights into the behavior of our proposed UQDM by experimenting on toy swirl data (see Appendix C.1 for details) and comparing with the hypothetical performance of VDM (Kingma et al., 2021). First, we train UQDM end-to-end for various values of T {3, 4, 5, 10, 15, 20, 30}, with and without learning the reverse process variance. For comparison, we also train a single VDM with T = 1000, but compute the progressive-coding NELBO eq. (1) using different T. Fig. 2 plots the resulting NELBO values, corresponding to the bits-per-dimension cost of lossless compression. We observe that for UQDM, learning the reverse-process variance significantly improves the NELBO across all T, and a higher T is not necessarily better. In fact, there seems to be an optimal T 5, for which we obtain a bpd of around 8. The theoretical performance of VDM, by comparison, monotonically improves with T (green curve) until it converges to a bpd of 5.8 at T = 1000, as consistent with theory (Kingma et al., 2021). We also tried initializing a UQDM without learned reverse-process variances to use the pre-trained VDM weights; interestingly, this resulted in very similar performance to the end-to-end trained result (blue curve), and further finetuning gave little to no improvement. Published as a conference paper at ICLR 2025 5 10 15 20 25 30 T NELBO (bpd) UQDM (fixed rev. var.) UQDM VDM VDM (T = 1000) 0 2 4 6 8 10 12 Rate (bpd) VDM (T=100) UQDM T=5 UQDM T=10 UQDM T=20 UQDM T=30 0 2 4 6 8 10 12 Rate (bpd) VDM (T=100) UQDM T=5 UQDM T=10 UQDM T=20 UQDM T=30 Figure 2: Results on swirl data. The VDM curves correspond to the hypothetical performance of REC that remains computationally intractable. Left: Lossless compression rates v.s. the choice of T, for UQDM with/without learned reverse-process variance (blue/orange) and VDM (green). For UQDM, learning the reverse-process variance significantly improved the NELBO, and an optimal T 5. Middle, Right: Progressive lossy compression performance for VDM and UQDM, measured in fidelity (PSNR) v.s. bit-rate (middle), or realism (sliced Wasserstein distance) v.s. bit-rate (right). Figure 3: Progressive lossy compression performance of UQDM on the CIFAR10 dataset, comparing fidelity (PSNR) and realism (FID) with bit-rate per pixel (bpp), using either ancestral sampling or denoised prediction to obtain progressive reconstructions as indicated. The VDM curve corresponds to hypothetical performance of REC that is computationally intractable. We achieve better fidelity and realism than JPEG and JPEG2000 across all bit-rates and than BPG in the high bit-rate regime. This suggests that a pretrained VDM can already be used for progressive compression with UQ via our moment-matching scheme (see Section 3), although the compression performance will be much worse compared to end-to-end trained UQDM with learned reverse-process variances. We then examine the lossy compression performance of progressive coding. Here, we train UQDM end-to-end with learned reverse-process variances, and perform progressive reconstruction by ancestral sampling. Figure 2 plots the results in fidelity v.s. bit-rate and realism v.s. bit-rate. For reference, we also show the theoretical performance of VDM using T = 100 discretization steps, assuming a hypothetical REC algorithm that operates with no overhead. The results are consistent with those on lossless compression, with a similar performance ranking for T among UQDM, and a gap remains to the hypothetical performance of VDM. Finally, we examine the quality of unconditional samples from UQDM with varying T. Although our earlier results indicate worse compression performance for T > 5, Figure 7 shows that UQDM s sample quality monotonically improves with increasing T. 5.2 CIFAR10 Next, we apply our method to natural images. We start with the CIFAR10 dataset containing 32 32 images. We train a baseline VDM model with a smaller architecture than that used by Kingma et al. (2021), converging to around 3 bits per dimension. We use the noise schedule σ2 t = σ(γt) where γt is linear in t with learned endpoints γT and γ0. For our UQDM model we empirically find that T 4 Published as a conference paper at ICLR 2025 Figure 4: Progressive lossy compression performance of UQDM on the Imagenet64 dataset, comparing fidelity (PSNR) and realism (FID) with bit-rate per pixel (bpp), using either ancestral sampling or the denoised prediction to obtain progressive reconstructions as indicated. The VDM curve corresponds to hypothetical performance of REC that remains computationally intractable. While the reconstruction quality of other codecs like CDC or BPG plateaus at higher bit-rates, our method continues to gradually improve fidelity and realism even at higher bit-rates where it achieves the best results of any baseline. We beat compression performance of JPEG, JPEG2000, and CTC across all bit-rates. Note that only UQDM, CTC, and JPEG2000 implement progressive coding. yields the best trade-off between bit-rate and reconstruction quality. We train our model end-to-end on the progressive coding NELBO eq. (1) with learned reverse-process variances. We compare against the wavelet-based codecs JPEG, JPEG2000, and BPG (Bellard, 2018). For JPEG and BPG we use a fixed set of quality levels and encode the images independently, for JPEG2000 we instead use its progressive compression mode that allows us to set the approximate size reduction in each quality layer and obtain a rate-distortion curve from one bit-stream. As shown in Figure 3, we consistently outperform both JPEG and JPEG2000 over all bit-rates and metrics. Even though BPG, a competitive non-progressive codec optimized for rate-distortion performance, achieves better reconstruction fidelity (as measured in PSNR) in the low bit-rate regime, our method closely matches BPG in realism (as measured in FID) and even beats BPG in PSNR at higher bit-rates. The theoretical performance of compression with Gaussian diffusion (e.g., VDM) (Theis et al., 2022), especially with a high number of steps such as T = 1000, is currently computationally infeasible, both due to the large number of neural function evaluations required, and due the intractable runtime of REC algorithms in the Gaussian case. Still, for reference we report theoretical results both for T = 1000 and T = 20, where the latter uses a smaller and more practical number of diffusion/progressive reconstruction steps. 5.3 IMAGENET 64 64 Finally, we present results on the Image Net 64 64 dataset. We train a baseline VDM model with the same architecture as in (Kingma et al., 2021), reproducing their reported BPD of around 3.4; we train a UQDM of the same architecture with learned reverse-process variances and T = 4. In addition to the baselines described in the previous section, we also compare with CTC (Jeon et al., 2023), a recent progressive neural codec, and CDC (Yang & Mandt, 2023), a non-progressive neural codec based on a conditional diffusion model that can trade-off between distortion and realism via a hyperparameter p. We separately report results for both p = 0, which purely optimizes the conditional diffusion objective, and p = 0.9, which prioritizes more realistic reconstructions that also jointly minimizes a perceptual loss. For CTC we use pre-trained model checkpoints from the official implementation (Jeon et al., 2023); for CDC we fix the architecture but train a new model for each bit-rate v.s. reconstruction quality/realism trade-off. The results are shown in Figure 4. When obtaining progressive reconstructions from denoised predictions, UQDM again outperforms both JPEG and JPEG2000. Our results are comparable to, if not slightly better than, CTC, and even though the reconstruction quality of other codecs plateaus Published as a conference paper at ICLR 2025 Figure 5: Example progressive reconstructions from UQDM trained with T = 4, obtained with denoised prediction (left) or ancestral sampling (right). The latter avoids blurriness but introduces graininess at low bit-rates, likely because the UQDM is unable to completely capture the data distribution and achieve perfect realism (perfect realism is also difficult to achieve also for Gaussian diffusion, as seen in the rate-realism plot of (Theis et al., 2022)). Flow-based reconstructions are qualitatively similar to the denoising-based reconstructions and can be found in Figure 8. at higher bit-rates, our method continues to improve quality and realism gradually, even at higher bit-rates. Refer to Figures 1, 5 and 8 for qualitative results demonstrating progressive coding and comparison across codecs. At high bit-rates, UQDM preserves details better than other neural codecs. UQDM with denoised predictions tends to introduce blurriness, while ancestral sampling introduces graininess at low bit-rates, likely because the UQDM is unable to completely capture the data distribution and achieve perfect realism. Flow-based denoising matches the distortion of denoised predictions but achieves significantly higher realism as measured by FID. We note that the ideal of perfect realism (i.e., achieving 0 divergence between the data distribution and model s distribution) remains a challenge even for state-of-the-art diffusion models. 6 DISCUSSION In this paper, we presented a new progressive coding scheme based on a novel adaptation of the standard diffusion model. Our universally quantized diffusion model (UQDM) implements the idea of progressive compression with an unconditional diffusion model (Theis et al., 2022) but bypasses the intractability of Gaussian channel simulation by using universal quantization (Zamir & Feder, 1992) instead. We present promising first results that match or outperform classic and neural compression baselines, including a recent progressive neural image compression method (Jeon et al., 2023). Given the practical advantages of a progressive neural codec allowing for dynamic trade-offs between rate, distortion and computation, support for both lossy and lossless compression, and potential for high realism, all in a single model our approach brings neural compression a step closer towards real-world deployment. Future work may further improve our approach to close the performance gap to Gaussian diffusion; the latter represents the ideal lossy compression performance under a perfect realism constraint for an approximately Gaussian-distributed data source (Theis et al., 2022). This may require more sophisticated methods for computing progressive reconstructions that can achieve higher quality with fewer steps, or exploring different parameterizations of the forward and reverse processes with better theoretical properties. Finally, we expect further improvement in computation efficiency and scalability when combining our method with ideas such as latent diffusion (Rombach et al., 2022), distillation (Sauer et al., 2024), or consistency models (Song et al., 2023). Published as a conference paper at ICLR 2025 ETHICS STATEMENT Our work focuses on the methodology of a learning-based data compression method, and thus has no direct ethical implications. The deployment of neural lossy compression however carries with it risks of miscommunication and misrepresentation (Yang et al., 2023), and needs to carefully analyzed and mitigated with future research. REPRODUCIBILITY STATEMENT We include proofs for all theoretical results introduced in the main text in Appendix A and B. We include further experimental and implementation details (including model architectures and other hyperparameter choices) in Appendix C. Our code can be found at https://github.com/ mandt-lab/uqdm. ACKNOWLEDGMENTS Justus Will and Yibo Yang acknowledge support from the HPI Research Center in Machine Learning and Data Science at UC Irvine. Stephan Mandt acknowledges support from the National Science Foundation (NSF) under an NSF CAREER Award IIS-2047418 and IIS-2007719, the NSF LEAP Center, by the Department of Energy under grant DE-SC0022331, the IARPA WRIVA program, the Hasso Plattner Research Center at UCI, the Chan Zuckerberg Initiative, and gifts from Qualcomm and Disney. We thank Kushagra Pandey for feedback on the manuscript. Eirikur Agustsson and Lucas Theis. Universally Quantized Neural Compression. Neur IPS, 2020. Johannes Ball e, David Minnen, Saurabh Singh, Sung Jin Hwang, and Nick Johnston. Variational Image Compression with a Scale Hyperprior. ICLR, 2018. Johannes Ball e, Sung Jin Hwang, Nick Johnston, and David Minnen. Tensorflow-compression: Data compression in tensorflow. URL https://github.com/tensorflow/compression. Arpit Bansal, Eitan Borgnia, Hong-Min Chu, Jie Li, Hamid Kazemi, Furong Huang, Micah Goldblum, Jonas Geiping, and Tom Goldstein. Cold Diffusion: Inverting Arbitrary Image Transforms without Noise. Neur IPS, 2024. Fabrice Bellard. Bpg image format, 2018. URL https://bellard.org/bpg/. Marl ene Careil, Matthew J Muckley, Jakob Verbeek, and St ephane Lathuili ere. Towards Image Compression with Perfect Realism at Ultra-low Bitrates. ICLR, 2023. Zhengxue Cheng, Heming Sun, Masaru Takeuchi, and Jiro Katto. Learned Image Compression with Discretized Gaussian Mixture Likelihoods and Attention Modules. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7939 7948, 2020. Paul Cuff. Communication requirements for generating correlated random variables. In 2008 IEEE International Symposium on Information Theory, pp. 1393 1397. IEEE, 2008. Zhihao Duan, Ming Lu, Zhan Ma, and Fengqing Zhu. Lossy Image Compression with Quantized Hierarchical VAEs. In IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 198 207, 2023. Rick Durrett. Probability: theory and examples, volume 49. Cambridge university press, 2019. Gergely Flamich. Greedy Poisson Rejection Sampling. Neur IPS, 2024. Gergely Flamich, Marton Havasi, and Jos e Miguel Hern andez-Lobato. Compressing Images by Encoding their Latent Representations with Relative Entropy Coding. Neur IPS, 2020. Gergely Flamich, Stratis Markou, and Jos e Miguel Hern andez-Lobato. Fast Relative Entropy Coding with A* Coding. ICML, 2022. Published as a conference paper at ICLR 2025 Gergely Flamich, Stratis Markou, and Jos e Miguel Hern andez-Lobato. Faster Relative Entropy Coding with Greedy Rejection Coding. Neur IPS, 2024. Daniel Goc and Gergely Flamich. On Channel Simulation with Causal Rejection Samplers. ar Xiv preprint ar Xiv:2401.16579, 2024. Marton Havasi, Robert Peharz, and Jos e Miguel Hern andez-Lobato. Minimal random code learning: Getting bits back from compressed model parameters. ar Xiv preprint ar Xiv:1810.00440, 2018. Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising Diffusion Probabilistic Models. Neur IPS, 2020. Emiel Hoogeboom, Eirikur Agustsson, Fabian Mentzer, Luca Versari, George Toderici, and Lucas Theis. High-Fidelity Image Compression with Score-Based Generative Models. ar Xiv preprint ar Xiv:2305.18231, 2023. Seungmin Jeon, Kwang Pyo Choi, Youngo Park, and Chang-Su Kim. Context-Based Trit-Plane Coding for Progressive Image Compression. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14348 14357, 2023. Diederik Kingma, Tim Salimans, Ben Poole, and Jonathan Ho. Variational Diffusion Models. Neur IPS, 2021. Jae-Han Lee, Seungmin Jeon, Kwang Pyo Choi, Youngo Park, and Chang-Su Kim. DPICT: Deep Progressive Image Compression using Trit-Planes. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16113 16122, 2022. Jooyoung Lee, Se Yoon Jeong, and Munchurl Kim. Deep HQ: Learned Hierarchical Quantizer for Progressive Deep Image Coding. ar Xiv preprint ar Xiv:2408.12150, 2024. Yadong Lu, Yinhao Zhu, Yang Yang, Amir Said, and Taco S Cohen. Progressive Neural Image Compression with Nested Quantization and Latent Ordering. In IEEE International Conference on Image Processing, pp. 539 543, 2021. David JC Mac Kay. Information Theory, Inference and Learning Algorithms. Cambridge University Press, 2003. Alexander Quinn Nichol and Prafulla Dhariwal. Improved Denoising Diffusion Probabilistic Models. ICML, 2021. Kushagra Pandey, Maja Rudolph, and Stephan Mandt. Efficient Integrators for Diffusion Generative Models. ICLR, 2023. Yury Polyanskiy and Yihong Wu. Information theory: From coding to learning. Book draft, 2022. Lawrence Roberts. Picture Coding using Pseudo-Random Noise. IRE Transactions on Information Theory, pp. 145 154, 1962. Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj orn Ommer. High Resolution Image Synthesis with Latent Diffusion Models. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10684 10695, 2022. Axel Sauer, Frederic Boesel, Tim Dockhorn, Andreas Blattmann, Patrick Esser, and Robin Rombach. Fast High-Resolution Image Synthesis with Latent Adversarial Diffusion Distillation. ar Xiv preprint ar Xiv:2403.12015, 2024. Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep Unsupervised Learning using Nonequilibrium Thermodynamics. ICML, 2015. Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising Diffusion Implicit Models. ICLR, 2021a. Yang Song and Stefano Ermon. Generative Modeling by Estimating Gradients of the Data Distribution. Neur IPS, 2019. Published as a conference paper at ICLR 2025 Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-Based Generative Modeling through Stochastic Differential Equations. ICLR, 2021b. Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations, 2021c. Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever. Consistency Models. ICML, 2023. Lucas Theis and Noureldin Y Ahmed. Algorithms for the Communication of Samples. ICML, 2022. Lucas Theis, Tim Salimans, Matthew D Hoffman, and Fabian Mentzer. Lossy Compression with Gaussian Diffusion. ar Xiv preprint ar Xiv:2206.08889, 2022. James Townsend, Thomas Bird, Julius Kunze, and David Barber. Hi LLoc: Lossless Image Compression with Hierarchical Latent Variable Models. ICLR, 2024. Pascal Vincent. A Connection Between Score Matching and Denoising Autoencoders. Neural Computation, pp. 1661 1674, 2011. Ruihan Yang and Stephan Mandt. Lossy Image Compression with Conditional Diffusion Models. Neur IPS, 2023. Yibo Yang, Stephan Mandt, Lucas Theis, et al. An Introduction to Neural Data Compression. Foundations and Trends in Computer Graphics and Vision, pp. 113 200, 2023. R. Zamir and M. Feder. On Universal Quantization by Randomized Uniform/Lattice Quantizers. IEEE Transactions on Information Theory, pp. 428 436, 1992. Published as a conference paper at ICLR 2025 A FORWARD PROCESS DETAILS A.1 GAUSSIAN (DDPM/VDM) For completeness and reference, we restate the forward process and related conditionals given in (Kingma et al., 2021). The forward process is defined by q(zt|x) := N(αtx, σ2 t I), where αt and σ2 t are positive scalar-valued functions of t. As in (Kingma et al., 2021), we define the following notation shorthand which are used in the rest of the appendix: for any s < t, let αs , σ2 t|s := σ2 t α2 t α2s σ2 s, bt|s := αt σ2 s σ2 t , ct|s := σ2 t|s αs σ2 t , βt|s := σt|s σs σt . By properties of the Gaussian distribution, it can be shown that for any 0 s < t T, q(zt|zs) = N(αt|sx, σ2 t|s I), q(zs|zt, x) = N(bt|szt + ct|sx, β2 t|s I), In particular, q(zt 1|zt, x) = N(bt|t 1zt + ct|t 1x, β2 t|t 1I), q(zt|z T , x) = N(b T |tzt + c T |tx, β2 T |t I), and we can use the reparameterization trick to write zt 1 = bt|t 1 zt + ct|t 1 x + βt|t 1 ϵt, ϵt N(0, I), zt = b T |t z T + c T |t x + βT |t ϵT , ϵT N(0, I) A.2 UNIFORM (OURS) Our forward process is specified by q(z T |x) and q(zt 1|zt, x) for each t, and closely follows that of the Gaussian diffusion. We set q(z T |x) to be the same as in the Gaussian case, i.e., q(z T |x) := N(αT x, σ2 T I), and q(zt 1|zt, x) to be a uniform with the same mean and variance as in the Gaussian case, such that q(zt 1|zt, x) := U(bt|t 1zt + ct|t 1x 3βt|t 1, bt|t 1zt + ct|t 1x + or in other words, zt 1 = bt|t 1zt + ct|t 1x + 12βt|t 1ut, ut U( 1/2, 1/2). In the notation of eq. (4) this corresponds to letting b(t) = bt|t 1, c(t) = ct|t 1, (t) = 12βt|t 1. It follows by algebraic manipulation that zt = b T |t z T + c T |t x + | {z } :=ωt where uv U( 1/2, 1/2), v = t + 1, ..., T are independent uniform noise variables, and δv|t := βv|v 1 j=t+1 bj|j 1 = σ2 t αt SNR(v 1) SNR(v), Published as a conference paper at ICLR 2025 SNR(s) := α2 s σ2s . It can be verified that E [ωt] = 0, v=t+1 δ2 v|t I = σ4 t α2 t [SNR(t) SNR(T)]I = β2 T |t I, or in other words, at any step t our forward-process posterior distribution q(zt|z T , x) has the same mean and variance as in the Gaussian case. A.3 CONVERGENCE TO THE GAUSSIAN CASE We show that both forward processes are equivalent in the continuous-time limit. To allow comparison across different number of steps T, we suppose that αt and σt are obtained from continuous-time schedules α( ) : [0, 1] R+ and σ( ) : [0, 1] R+ (which were fixed ahead of time), such that αt := α(t/T) and σt := σ(t/T) for t = 0, . . . , T, for any choice of T. As in VDM (Kingma et al., 2021), we assume that the continuous-time signal-to-noise ratio snr( ) := α( )2/σ( )2 is strictly monotonically decreasing. To obtain the continuous-time limit, we hold the continuous time ρ := t T fixed for some ρ [0, 1), and let T (or equivalently, let the time discretization 1 T 0). We note that the quantities b T |t, c T |t, β2 T |t only depend on ρ, and are thus well-defined when we hold ρ fixed and let T : b T |t = αT σ2 t σ2 T = α(1) α(ρ) σ2(ρ) σ2(1), c T |t = σ2(1) α2(1) α2(ρ)σ2(ρ) α(ρ) β2 T |t = σ2(1) α2(1) α2(ρ)σ2(ρ) σ2(ρ) σ2(1) = σ4(ρ) α2(ρ)(snr(ρ) snr(1)). We start by showing that our q(zt|z T , x) converges to the corresponding Gaussian distribution in VDM in the continuous-time limit, which in turn implies the convergence of our q(zt|x) to the corresponding Gaussian distribution in VDM. Theorem A.1. For every fixed ρ := t T [0, 1), q(zt|z T , x) d N(b T |t z T + c T |t x, β2 T |t I) as T . Proof. Recall the following fact in the forward process of UQDM (see eq. (8)): zt = b T |t z T + c T |t x + | {z } :=ωt where uv U( 1/2, 1/2), v = t + 1, ..., T are independent uniform noise variables, and δv|t := βv|v 1 j=t+1 bj|j 1 = σ2 t αt SNR(v 1) SNR(v), SNR(s) := α2 s σ2s . Published as a conference paper at ICLR 2025 It therefore suffices to show that ωt converges in distribution to N(0, β2 T |t I) in the continuous-time limit. Since the different coordinates of ωt are independent, we focus on a single coordinate and study the continuous-time limit of a scalar Ωt, given by a sum of scaled uniform variables, snr(ρ + j 1 T ) snr(ρ + j where Uj s are i.i.d. U( 1/2, 1/2) variables, and in the last step we set n := n(T) = T t and switched the summation index to j = v t. Define a triangular array of variables by snr(ρ + j 1 T ) snr(ρ + j for j = 1, 2, ..., n and for n N+. For each n, {Xn,j}j=1,2,...,n are independent variables with E[Xn,j] = 0, and it can be verified that j=1 E[X2 n,j] = Var (Ωt) = β2 T |t = σ4(ρ) α2(ρ)(snr(ρ) snr(1)). To apply the Lindeberg-Feller central limit theorem (Durrett, 2019, Theorem 3.4.10) to Ωt = Xn,1 + ... + Xn,n, it remains to verify the condition ϵ > 0, lim n j=1 E[X2 n,j1{|Xn,j| > ϵ}] = 0. Let ϵ > 0. Since snr( ) is continuous on a compact domain [0, 1], it is also uniformly continuous; then there exists a δ such that |snr(x1) snr(x2)| < ϵα(ρ) 2 , x1, x2, |x1 x2| < δ. (12) Let T (and thus n = T t) become sufficiently large such that 1 T < δ. Then, for all such T (and thus n) sufficiently large, and for all j, it holds that 1{|Xn,j| > ϵ} = 0 almost everywhere: P(|Xn,j| > ϵ) = P snr(ρ + j 1 T ) snr(ρ + j |Uj| > ϵα(ρ) 12σ2(ρ) 1 q snr(ρ + j 1 T ) snr(ρ + j by eq. (12) P (|Uj| > 1) (15) = 0 (16) since Uj U( 1/2, 1/2), and it follows that E[X2 n,j1{|Xn,j| > ϵ}] = 0 for all j for all sufficiently large n. We conclude by the Lindeberg-Feller theorem that Ωt = Xn,1 + ... + Xn,n d N(0, β2 T |t) as T . Applying the above argument coordinate-wise then proves the original statement. Published as a conference paper at ICLR 2025 Corollary A.1.1. If we assume σT and αT to be constants, then for every t, q(zt|x) d N(αtx, σ2 t I) as T , that is, our forward model approaches the Gaussian forward process of VDM with an increasing number of diffusion steps. Proof. As q(z T |x) = N(αT x, σ2 T I) does not depend on T, the joint distribution q(zt, z T |x) = q(zt|z T , x)q(z T |x) converges in distribution, which in turn implies convergence of q(zt|x). The statement then follows from the identity N(zt; αtx, σ2 t I) = Z N(zt; b T |t z T + c T |t x, β2 T |t I) N(z T ; αT x, σ2 T I) dz T . B BACKWARD PROCESS DETAILS AND RATE ESTIMATES B.1 GAUSSIAN (DDPM/VDM) Kingma et al. (2021) set p(zt 1|zt) := q(zt 1|zt, x = ˆxt) = N(bt|t 1 zt + ct|t 1 ˆxt, β2 t|t 1I) which yields Lt 1 = KL(N(bt|t 1 zt + ct|t 1 x, β2 t|t 1I) N(bt|t 1 zt + ct|t 1 ˆxt, β2 t|t 1I)) c2 t|t 1 β2 t|t 1 x ˆxt 2 2 = 1 2(SNR(t 1) SNR(t)) x ˆxt 2 2 . We have that Lt 1 0 as T , due to the continuity of SNR( /T) = snr( ) = α( )2/σ( )2. B.2 UNIFORM (OURS) Recall that we choose each coordinate of the reverse-process model p(zt 1|zt) to have the density p(zt 1|zt)i := gt(z) U(z; t/2, t/2) z t/2 gt(z) dz = 1 t (Gt(z + t/2) Gt(z t/2)), where Gt and gt are the cdf and pdf of a distribution with mean ˆµt := bt|t 1z + ct|t 1ˆx and variance σ2 g, z := (zt)i, x := xi, and ˆx := ˆxθ(zt; t)i. Using the shorthand µt := bt|t 1z + ct|t 1x we can derive the rate associated with the ith coordinate Lt 1 = KL(U(z; µt t/2, µt + t/2) gt(z) U(z; /2t, /2t)) 1 t 1[µt t/2,µt+ t/2](z) 1 t (Gt(z + t/2) Gt(z t/2)) dz t/2 log(Gt(z + µt + t/2) Gt(z + µt t/2)) | {z } To gain some intuition for this rate, note that h(z) is lowest when most of the probability mass of Gt is concentrated tightly around z + µt, which is the case when |µt ˆµt| is small. Specifically, if Gt is in a distributional family with a standardized cdf G0 such that Gt(z) = G0((z ˆµt)/σg) then Gt(z + µt + t/2) Gt(z + µt t/2) 1 if |z + µt ˆµt| < t/2 G0(0) if |z µt ˆµt| = t/2 0 else as σg 0. Thus, if |µt ˆµt| t/2, we obtain improved bit-rates for σg that are small (relative to t). On the other hand, as almost certainly |µt ˆµt| > 0, we can t choose arbitrarily small Published as a conference paper at ICLR 2025 σg because in that case both max( h( t/2), h( t/2)) and Lt 1 as σg 0. This further motivates the merit of learning the backwards variances as σ2 g = sθ(z)β2 t|t 1 = sθ(z) 2 t/12, allowing them to adapt to |µt ˆµt|. Conversely, by the mean value theorem, there exists one c ( t/2, t/2) so that Gt(z + µt + t/2) Gt(z + µt t/2) = tgt(z + µt + c) tgt(z + µt) where the last approximation becomes more accurate for larger σg. If we further assume that Gt is Gaussian (or sufficiently similar) h(t) becomes approximately quadratic. In that case we study h(0) + 2z2 tz 2 t h( t/2) + 2z2 + tz 2 t h( t/2), a quadratic function that exactly matches h at values z { t/2, 0, t/2}. Finally, this results in " 2 2 t (h( t/2) + h( t/2) 2h(0)) Z t/2 t/2 z2 dz + 1 t (h( t/2) h( t/2))) Z t/2 t/2 z dz + th(0) 6 [4h(0) + h( t/2) + h( t/2)] 1 where the last equality uses h(z) 0 and h( t/2) + h( t/2) log(0.25) which follow from the fact that Gt is a cdf. Empirically we note that this estimate is very accurate as long as σ2 g β2 t|t 1, demonstrating that simply matching moments as in VDM will occur a constant overhead for each diffusion step. As seen in Figure 2, this can be partly mitigated with smaller σ2 g but increasing the number of diffusion steps T might still lead to an increase in ELBO. Numerical integration of Lt 1 confirms that if σ2 g is close to the optimal choice of σg |µt ˆµt|, Lt 1 0 as T as in the Gaussian case. B.3 FLOW-BASED RECONSTRUCTIONS Given an intermediate latent zt, ancestral sampling yields an intermediate lossy reconstruction ˆx p(x|zt) that requires us to repeatedly sample from the conditional p(zt 1|zt) until finally obtaining a reconstruction from z0 with the help of p(x|z0). This is equivalent to approximately solving a reverse SDE (Song et al., 2021c) and introduces additional noise during inference, which can make reconstructions grainy for diffusion models with a small number of steps, as can be seen in Figure 5. Song et al. (2021c) further note that an alternative approximate solution to the SDE can be obtained by deterministically reversing a probability-flow ODE (see also Theis et al. (2022)). Specifically, this involves repeatedly evaluating zt 1 = f(zt, t), where f for VDM is defined as f(zt, t) = αt 1 αt zt + σt 1 αt 1 σt zt + αt 1 σt 1 recovering the same process defined in (Song et al., 2021a). The equivalence of the continuous limit in Corollary A.1.1, suggests that the discrete-time backward processes of UQDM and VDM are similar enough in the sense that eq. (17) also approximately solves the implied reverse SDE of UQDM. Thus we use eq. (17) to obtain flow-based reconstructions for both VDM and UQDM. Published as a conference paper at ICLR 2025 true data samples T = 5 T = 10 T = 30 Unconditional UQDM samples Figure 7: Unconditional samples from UQDM models trained with varying T on the swirl dataset. The sample quality improves with larger T; however the compression performance becomes worse after T > 5, as discussed in Section 5. 5 10 15 20 25 30 T NELBO (bpd) VDM UQDM (VDM weights) UQDM (from scratch) UQDM (learned rev. var.) VDM (T = 1000) Figure 6: Left: 1000 samples from the toy swirl source. Right: Additional results on swirl data. We examined the compression performance of applying universal quantization to a pre-trained VDM model; conceptually this is equivalent to When using fixed reverse-process variances, we can directly re-use weights from a pretrained VDM model (orange), which achieves comparable results to training a UQDM model from scratch, even for a smaller number of timesteps. Figure 8: Additional results on Image Net 64x64 data. Left: Example progressive reconstructions from UQDM trained with T = 4, obtained with flow-based denoising, as in Figure 5. Flow-based reconstructions achieve similar distortion (as meassured with PSNR) than denoised predictions at higher fidelity (as meassured with FID). Right: Ablation of the influence of model size on validation loss. Bars are labeled with the number of parameters for each model. Increasing the size of the denoising network allows for smaller bitrates. Published as a conference paper at ICLR 2025 C ADDITIONAL EXPERIMENTAL RESULTS C.1 SWIRL DATA We use the swirl data from the codebase of (Kingma et al., 2021); Figure 6 shows 1000 samples from the toy data source. We use the same denoisng network ˆxθ as in the official implementation,2 which consists of 2 hidden layers with 512 units each. Figure 6 highlights the consequence of Corollary A.1.1: Because VDM and UQDM share the same continuous limit, we can use the weights of a pretrained VDM to obtain comparable UQDM results as a UQDM model that has been trained from scratch. C.2 CIFAR10 We use a scaled-down version of the denoising network from the VDM paper (Kingma et al., 2021) for faster experimentation. We use a U-Net of depth 8, consisting of 8 Res Net blocks in the forward direction and 9 Res Net blocks in the reverse direction, with a single attention layer and two additional Res Net blocks in the middle. We keep the number of channels constant throughout at 128. We verified that our UQDM implementation based on tensorflow-compression achieves file size close the theoretical NELBO. When compressing a single 32x32 CIFAR image, we observe file size overhead 3% of the theoretical NELBO. In terms of computation speed, it takes our model with fixed reverse-process variance less than 1 second to encode or decode a CIFAR image, either on CPU or GPU,3 likely because the very few neural-network evaluations required (T = 4). For our model with learned reverse-process variance, however, it takes about 5 minutes to compress or decompress a CIFAR image, with nearly all of the compute time spent on a single CPU core. This is because with learned reverse-process variance, each latent dimension has a different predicted variance, and a separate CDF table needs to be built for each latent dimension during entropy coding; the tensorflow-compression library builds the CDF table for each coordinate in a naive for-loop rather than in parallel. Thus we expect the coding speed to be dramatically faster with a parallel implementation of entropy coding, e.g., using the Diet GPU4 library. C.3 IMAGENET 64 64 We use the same denoising network as in the VDM paper (Kingma et al., 2021). We use a U-Net of depth 64, consisting of 64 Res Net blocks in the forward direction and 65 Res Net blocks in the reverse direction, with a single attention layer and two additional Res Net blocks in the middle. We keep the number of channels constant throughout at 256. To investigate the impact of the size of the denoising network, in addition to this configuration with 237M parameters we call UQDM-big, we also run experiments with three smaller networks with 32 Res Net blocks and 128 channels (UQDMmedium, 122M parameters), 8 Res Net blocks and 64 channels (UQDM-small, 2M parameters), and 1 Res Net block and 32 channels (UQDM-tiny, 127K parameters), respectively. Smaller network are significantly faster and more resource-efficient but will naturally suffer from higher bitrates, as can be seen in Figure 8. The required number of FLOPS per pixel for encoding and decoding is strongly dominated by the number of neural function evaluations (NFE) of our denoising network which depends on how soon we stop the encoding and decoding process. For lossless compression we have to multiple the FLOPS per NFE with T which is equal to 4 in our case. For lossy compression after t steps, with lossy reconstructions obtained through a denoised prediction, we obtain the required FLOPS for encoding and decoding by multiplying with t and t + 1 respectively. The FLOPS per NFE depend on the network size, our investigated model size require 389K, 2.3M, 105M, and 204M FLOPS per pixel, in order from smallest to biggest model. 2https://github.com/google-research/vdm/blob/main/colab/2D_VDM_Example. ipynb 3Around 0.6 s for encoding and 0.5 s for decoding on Intel(R) Xeon(R) Gold 5218 CPU @ 2.30GHz CPU; 0.5 s for encoding and 0.3 s for decoding on a single Quadro RTX 8000 GPU. 4https://github.com/facebookresearch/dietgpu Published as a conference paper at ICLR 2025 Figures 9 and 10 show more example reconstructions from several traditional and neural codecs, similar to Figure 1. At lower bitrates the artifacts each compression codecs introduces become more visible. Published as a conference paper at ICLR 2025 Figure 9: Additional example reconstructions , chosen at roughly similar (high) bitrates. Figure 10: Additional example reconstructions , chosen at roughly similar (low) bitrates.