# conditional_variational_diffusion_models__68bd60db.pdf Published as a conference paper at ICLR 2024 CONDITIONAL VARIATIONAL DIFFUSION MODELS Gabriel della Maggiora1,2, , Luis Alberto Croquevielle3, Nikita Deshpande1,2, Harry Horsley4, Thomas Heinis3, Artur Yakimovich1,2,5 1 Center for Advanced Systems Understanding (CASUS), G orlitz, Germany 2 Helmholtz-Zentrum Dresden-Rossendorf e. V. (HZDR), Dresden, Germany 3 Department of Computing, Imperial College London, London, United Kingdom 4 Bladder Infection and Immunity Group (BIIG), UCL Centre for Kidney and Bladder Health, Division of Medicine, University College London, Royal Free Hospital Campus, London, United Kingdom 5 Institute of Computer Science, University of Wrocław, Wrocław, Poland correspondence: g.della-maggiora-valdes@hzdr.de, aac622@ic.ac.uk, a.yakimovich@hzdr.de Inverse problems aim to determine parameters from observations, a crucial task in engineering and science. Lately, generative models, especially diffusion models, have gained popularity in this area for their ability to produce realistic solutions and their good mathematical properties. Despite their success, an important drawback of diffusion models is their sensitivity to the choice of variance schedule, which controls the dynamics of the diffusion process. Fine-tuning this schedule for specific applications is crucial but time-consuming and does not guarantee an optimal result. We propose a novel approach for learning the schedule as part of the training process. Our method supports probabilistic conditioning on data, provides high-quality solutions, and is flexible, proving able to adapt to different applications with minimum overhead. This approach is tested in two unrelated inverse problems: super-resolution microscopy and quantitative phase imaging, yielding comparable or superior results to previous methods and fine-tuned diffusion models. We conclude that fine-tuning the schedule by experimentation should be avoided because it can be learned during training in a stable way that yields better results1. 1 INTRODUCTION Inverse problems deal with the task of finding parameters of interest from observations. Formally, for a mapping A : Y X and x X (the data), the inverse problem is to find an input y Y such that A(y) = x. Examples abound in computer science. For instance, single-image super-resolution can be formulated in this setting. Inverse problems are usually ill-posed, which means that observations underdetermine the system or small errors in the data propagate greatly to the solution. Deep networks trained with supervised learning have recently gained popularity for tackling inverse problems in contexts where input-data samples are available. Depending on the application, the optimization might target the L1 or L2 norm, or pixel-wise distance measures in image-based tasks. These methods are effective in many scenarios, but the supervised learning approach can introduce undesired artifacts (Lehtinen et al., 2018) and it does not yield uncertainty estimates for the solutions. Since most inverse problems are ill-posed, it makes sense to model the uncertainty explicitly. This can be done by considering y and x as realizations of random variables Y and X, respectively, and learning a conditional probability distribution PY |X. Several methods are available to learn a distribution from pairs of samples (Isola et al., 2016; Peng & Li, 2020; Sohl-Dickstein et al., Equal contribution 1The code is available on https://github.com/casus/cvdm. Published as a conference paper at ICLR 2024 2015; Kohl et al., 2018). In this work, we use diffusion models, a likelihood-based method with state-of-the-art generation capabilities (Dhariwal & Nichol, 2021; Saharia et al., 2021). Diffusion models produce high quality samples and offer stable training (Ho et al., 2020; Kingma et al., 2023). Despite these advantages, they have a few drawbacks, like the number of steps required to generate samples (Song et al., 2022) and their sensitivity to the choice of variance schedule (Saharia et al., 2021; Nichol & Dhariwal, 2021). The variance schedule controls the dynamics of the diffusion process, and it is usually necessary to fine-tune it with a hyperparameter search for each application. This is time-consuming and leads to suboptimal performance (Chen et al., 2020). In this work, we introduce the Conditional Variational Diffusion Model (CVDM), a flexible method to learn the schedule that involves minimum fine-tuning. Our detailed contributions are: Following Kingma et al. (2023) we learn the schedule as part of the training, extending their approach to the conditioned case. Furthermore, we allow for learning a different schedule for each element in the output (e.g., a pixel-wise for images). These extensions require several technical novelties, including a separation-of-variables strategy for the schedule. We prove that the rate of convergence of the discrete-time diffusion loss to the continuoustime case depends strongly on the derivatives of the schedule. This shows that the finding in Kingma et al. (2023) that continuous-time diffusion loss is invariant under choice of schedule may not hold in practice. Based on this result, we introduce a novel regularization term that proves to be critical for the performance of the method in our general formulation. We implement the schedule by replacing the architecture in Kingma et al. (2023) with two networks, one required to be positive for the conditioning variable and one monotonicconvolutional network. This allows us to test our model with inputs of different resolutions without retraining. Moreover, thanks to our clean formulation, our method does not need the post-processing of the schedule that Kingma et al. (2023) introduces, nor the preprocessing of the input. This further contributes to streamlining our implementation. We test CVDM in three distinct applications. For super-resolution microscopy, our method shows comparable reconstruction quality and enhanced image resolution compared to previous methods. For quantitative phase imaging, it significantly outperforms previous methods. For image super-resolution, reconstruction quality is also comparable to previous methods. CVDM proves to be versatile and accurate, and it can be applied to previously unseen contexts with minimum overhead. 2 RELATED WORK Diffusion probabilistic models (DPMs) (Sohl-Dickstein et al., 2015) are a subset of variational autoencoders methods (Kingma & Welling, 2022; Kingma et al., 2023). Recently, these models have demonstrated impressive generation capabilities (Ho et al., 2020; Dhariwal & Nichol, 2021), performing better than generative adversarial networks (Isola et al., 2016) while maintaining greater training stability. Fundamentally, diffusion models operate by reversing a process wherein the structure of an input is progressively corrupted by noise, according to a variance schedule. Building on the DPM setup, conditioned denoising diffusion probabilistic models (CDDPMs) have been introduced to tackle challenges like image super-resolution (Saharia et al., 2021), inpainting, decompression (Saharia et al., 2022), and image deblurring (Chen et al., 2023). All the cited works use different variance schedules to achieve their results, which is to be expected, because fine-tuning the schedule is usually a prerequisite to optimize performance (Saharia et al., 2021). Efforts have been made to understand variance schedule optimization more broadly. Cosine-shaped schedule functions (Nichol & Dhariwal, 2021) are now the standard in many applications (Dhariwal & Nichol, 2021; Croitoru et al., 2023), showing better performance than linear schedules in various settings. Beyond this empirical observation, Kingma et al. (2023) developed a framework (Variational Diffusion Models, or VDMs) to learn the variance schedule and proved several theoretical properties. Their work supports the idea that learning the schedule during training leads to better performance. Specifically, Kingma et al. (2023) formulate a Gaussian diffusion process for unconditioned distribution sampling. The latent variables are indexed in continuous time, and the forward process diffuses the input driven by a learnable schedule. The schedule must satisfy a few minimal conditions, and its Published as a conference paper at ICLR 2024 parameters are defined as a monotonic network. By forcing the schedule to be contained within a range, the model can be trained by minimizing a weighted version of the noise prediction loss. Their work also introduces Fourier features to improve the prediction of high-frequency details. In this work, we extend VDMs to the conditioned case. To achieve this, we define the schedule as a function of two variables, time t and condition x. This requires careful consideration because the schedule should be monotonic in t and not necessarily with respect to x. To solve this issue, we propose a novel factorization of the schedule and derive several theoretical requirements for the schedule functions. By incorporating these requirements into the loss function, we can train the model in a straightforward way and dispense with the schedule post-processing of Kingma et al. (2023). We adapt the framework proposed by Saharia et al. (2021), which learns a noise prediction model that takes the variance schedule directly as input. To streamline the framework, we eliminate the noise embedding and directly concatenate the output of our learned schedule to the noise prediction model. 3.1 CONDITIONED DIFFUSION MODELS AND INVERSE PROBLEMS For a brief introduction to inverse problems, see Appendix A. For a formulation of non-conditioned DPMs, see Ho et al. (2020); Kingma et al. (2023). In conditioned DPMs we are interested in sampling from a distribution PY |X=x(y), which we denote p(y|x). The samples from this distribution can be considered as solutions to the inverse problem. Following Kingma et al. (2023), we use a continuous-time parametrization of the schedule and latent variables. Specifically, the schedule is represented by a function γ(t, x) and for each t [0, 1] there is a latent variable zt. We formulate the diffusion process using a time discretization, with continuous-time introduced as a limiting case. This formulation is better for the introduction of key concepts like the regularization strategy in Section 3.4. To start with, for a finite number of steps T, let ti = i/T for i {0, 1, . . . , T}, and define the forward process by q(zti|y, x) = N p γ(ti, x)y, σ(ti, x)I . (1) We use the variance-preserving condition σ(ti, x) = 1 γ(ti, x) but sometimes write σ(ti, x) to simplify notation. The whole process is conditioned on y: it starts at zt0 = y and it gradually injects noise, so that each subsequent zti is a noisier version of y, and zt T is almost a pure Gaussian variable. The forward process should start at y in t = 0, so for all x we enforce the condition γ(0, x) = 1. We follow Kingma et al. (2023) in learning the schedule instead of fine-tuning it as a hyperparameter, with some key differences. If y has a vector representation, we learn a different schedule for each component. For example, in the case of images, each pixel has a schedule. This means that functions of γ apply element-wise and expressions like p γ(ti, x)y represent a Hadamard product. Similarly to the non-conditioned case, the diffusion process is Markovian and each step 0 < i T is characterized by a Gaussian distribution q(zti|zti 1, x) (see Appendix B.1). The posterior q(zti 1|zti, y, x) also distributes normally, and the parameters ultimately depend on the function γ (see Appendix B.2). The reverse process is chosen to take the natural shape pν(zti 1|zti, x) = q(zti 1|zti, y = ˆyν(zti, ti, x), x), (2) such that it only differs from the posterior in that y is replaced by a deep learning prediction ˆyν. The forward process should end in an almost pure Gaussian variable, so q(zt T |y, x) N(0, I) should hold. Hence, for the reverse process we model zt T with pν(zt T |x) = N(0, I) (3) for all x. The reverse process is defined by equations (2) and (3) without dependence on y, so we can use them to sample y. Specifically, we can sample from pν(zt T |x) and then use relation (2) repeatedly until we reach zt0 = y. All the relevant distributions are Gaussian, so this procedure is computationally feasible. In other words, equations (2) and (3) completely define the reverse process, such that we can sample any zti conditioned on x, including zt0 = y. If we are able to learn the pν(zti 1|zti, x), we should have a good proxy for p(y|x) from which we can sample. Published as a conference paper at ICLR 2024 3.2 DEFINING THE SCHEDULE We now describe in more detail our approach to learning the schedule. Recall that the latent variables zt are indexed by a continuous time parameter t [0, 1], and we introduced the diffusion process using a time discretization {ti}T i=0. We now consider the non-discretized case, starting with the forward process. In this case, the mechanics described by equation (1) can be extended straightforwardly to the continuous case q(zt|y, x). The continuous-time version of the forward Markovian transitions and the posterior distribution are more complicated. Consider the forward transitions, which in the discretized version we denote by q(zti|zti 1, x). To extend this idea to the continuous case, Kingma et al. (2023) consider q(zt|zs, x) for s < t. We use a different approach and focus on the infinitesimal transitions q(zt+dt|zt, x). This idea can be formalized using stochastic calculus, but we do not need that framework here. From Appendix B.1, the forward transitions are given by q(zti|zti 1, x) = N q 1 ˆβT (ti, x)y, ˆβT (ti, x)I . where ˆβT (ti, x) = 1 γ(ti, x)/γ(ti 1, x). As defined, the values ˆβT (ti, x) control the change in the latent variables over a short period of time. Now, consider the continuous-time limit T . The intuition is that there is a function β such that the change of the latent variables over an infinitesimal period of time dt is given by β(t, x)dt. In the discretization, this becomes ˆβT (ti, x) = β(ti, x)/T. This idea leads to the following relation between γ and β (details in Appendix F): t = β(t, x)γ(t, x). (4) In view of this relation, our approach is as follows. First, we make the assumption that β can be decomposed into two independent functions, respectively depending on the time t and the data x. We write this as β(t, x) = τθ(t)λϕ(x), where both τθ and λϕ are learnable positive functions. This assumption takes inspiration from many separable phenomena in physics, and it proves to be general enough in our experiments. Moreover, γ(t, x) should be decreasing in t, since the forward process should start at y in t = 0 and gradually inject noise from there. This monotony condition is much simpler to achieve in training if the t and x variables are separated in the schedule functions. Replacing this form of β in equation (4) and integrating with initial condition γ(0, x) = 1, we get γ(t, x) = e λϕ(x) R t 0 τθ(s)ds. (5) Equation (5) could be used to compute γ during training and inference, but we follow a different approach to avoid integration. Since τθ is defined to be positive, its integral is an increasing function. This motivates the parametrization of γ as γ(t, x) = e λϕ(x)ρχ(t), where ρχ is a learnable function which is increasing by design. Summarizing, we parametrize β and γ such that they share the same function λϕ(x) and separate time-dependent functions. A priori, ρχ and the integral of τθ could not coincide. So, to ensure that equation (4) holds, we use the Deep Galerkin Method (Sirignano & Spiliopoulos, 2018) by including the following term as part of the loss function: Lβ(x) = Et U([0,1]) t + β(t, x)γ(t, x) + γ(0, x) 1 2 2 + γ(1, x) 0 2 2, where U(A) represents the continuous uniform distribution over set A. The last two terms codify the soft constraints γ(0, x) = 1 and γ(1, x) = 0, which help to ensure that the forward process starts at y (or close) and ends in a standard Gaussian variable (or close). Noise injection is gradual because γ is defined by a deep learning model as a smooth function (see details about architecture in Appendix G). In the rest of Section 3, we provide the remaining details about the loss function. 3.3 NON-REGULARIZED LEARNING The schedule functions γ, β define the forward process and determine the form of the posterior distribution and the reverse process (Appendix B). On the other hand, ˆyν helps to define the reverse process pν. To learn these functions we need access to a dataset of input-data pairs (y, x). The Published as a conference paper at ICLR 2024 standard approach for diffusion models is to use these samples to minimize the Evidence Lower Bound (ELBO), given in our case by (details in Appendix C) LELBO(y, x) = Lprior(y, x) + Ldiffusion(y, x). The term Lprior = DKL (q(zt T |zt0, x) || pν(zt T |x)) helps to ensure that the forward process ends in an almost pure Gaussian variable. As described in Appendix D, Ldiffusion(y, x) is analytic and differentiable, but is computationally inconvenient. To avoid this problem, this term can be rewritten as an expected value LT (y, x) which has a simple and efficient Monte Carlo estimator (Appendix E.1). For reasons outlined in the following section, we work in the continuous-time case, i.e. the T limit. Under certain assumptions, we get the following form of the ELBO when T ˆLELBO(y, x) = DKL (q(z1|y, x) || pν(z1|x)) + L (y, x). In the above expression, z1 represents the latent variable at time t = 1, such that pν(z1|x) is a standard Gaussian and q(z1|y, x) = N( p γ(1, x)y, σ(1, x)I). On the other hand, L is a continuous-time estimator for Ldiffusion and takes an integral form. See Appendix E.2 for the full details. In summary, ˆLELBO provides a differentiable and efficient-to-compute version of the ELBO, and we use it as the core loss function to learn the forward and reverse processes correctly. In the next section, we describe two important modifications we make to the loss function and the learning process. 3.4 REGULARIZED DIFFUSION MODEL As mentioned above, we work with a diffusion process that is defined for continuous time. The schedule γ and the model prediction ˆyν depend on a continuous variable t [0, 1], and the latent variables zt are also parametrized by t [0, 1]. We also derived a continuous-time version of the diffusion loss, L . In our final implementation, we get better results by replacing L with ˆL (y, x) = 1 2Eϵ N(0,I),t U([0,1]) h ϵ ˆϵν(zt(ϵ), t, x) 2 2 i , where ˆϵν is a noise prediction model. See details in Appendix E.3. This provides a natural Monte Carlo estimator for the diffusion loss, by taking samples ϵ N(0, I) and t U([0, 1]). As shown in Kingma et al. (2023), increasing the number of timesteps T should reduce the diffusion loss, so it makes sense to work in the T limit. Also, a continuous-time setup is easier to implement. Importantly, Kingma et al. (2023) prove that the continuous-time diffusion loss is invariant to the choice of variance schedule. However, we argue this is not necessarily the case in practice. Since all computational implementations are ultimately discrete, we look for conditions on γ that make the discrete case as close as possible to the continuous one. As explained in Appendix E.2, one way of achieving this is to keep the Euclidean norm of SNR (t, x) low, where SNR(t, x) = γ(t, x)/σ(t, x). We use f (t, x) to represent the partial derivative of a function f(t, x) with respect to the time t. A natural way of incorporating this condition would be to include a regularization term in the loss function, with a form like LSNR(x) = SNR ( , x) L2([0,1]) where by definition SNR(t, x) = γ(t, x) σ(t, x) = γ(t, x) 1 γ(t, x). From this, we can see that LSNR can be complicated to implement. It involves a fraction and a second derivative, operations that can be both numerically unstable. Moreover, as we have mentioned before, for t = 0 it should hold that γ(t, x) 1, which makes SNR more unstable around t = 0. Since SNR(t, x) γ(t, x) for values of t closer to 1, we replace LSNR with the more stable Lγ(x) = Et U([0,1]) h γ (t, x) 2 2 i . This regularization term is actually key for the performance of our method. To see this, recall that a variable zt q(zt|y, x) can be reparametrized as zt = p γ(t, x)y + p σ(t, x)ϵ with ϵ N(0, I). This means that γ 0, σ 1 make zt = ϵ, so that the noise prediction model ˆϵν(zt(ϵ), t, x) = zt can perfectly predict ϵ and make ˆL = 0. Now, γ 0 is not compatible with the Lβ loss term, but any function γ that starts at 1 for t = 0 and then abruptly drops to 0 is permitted. Lγ prevent this type of undesirable solution. Once we include this term, the full loss function takes the form LCVDM = E(y,x) p(y,x) h Lβ(x) + DKL (q(z1|y, x) || pν(z1|x)) + ˆL (y, x) + αLγ(x) i , (6) Published as a conference paper at ICLR 2024 where α controls the weight of the regularization term and p(y, x) is the joint distribution of y and x. We optimize a Monte Carlo estimator of LCVDM by using the available (y, x) samples. For the KL divergence term, we optimize the analytical form of the KL divergence between two Gaussian distributions with the log-variance (Kingma & Welling, 2022; Nichol & Dhariwal, 2021). 4 EXPERIMENTS AND RESULTS We assess the performance of our model on three distinct benchmarks. First, we evaluate the model s ability to recover high-spatial frequencies using the Bio SR super-resolution microscopy benchmark (Qiao et al., 2021). Second, we examine the model s effectiveness in retrieving the phase of a diffractive system with synthetic data and real brightfield image stacks from a clinical sample assessed by two experienced microscopists. The last benchmark is image super-resolution on Image Net 1K (Deng et al., 2009) . For the first two, performance is measured using two key metrics: multi-scale structural similarity index measure (MS-SSIM) (Wang et al., 2003) and Mean Absolute Error (MAE), both detailed in Appendix G. In the case of Image Net, performance is measured using SSIM and peak signal-to-noise ratio (PSNR). For Bio SR, the resolution of the reconstruction (a metric related to the highest frequency in the Fourier space) is additionally evaluated as per Qiao et al. (2021), using a parameter-free estimation (Descloux et al., 2019). We evaluate against methods developed for each benchmark, as well as CDDPM. In the case of CDDPM, we follow the implementation shown in Saharia et al. (2021). The specific implementation of our model is described in Appendix G. 4.1 SUPER-RESOLUTION MICROSCOPY Super-resolution microscopy aims to overcome the diffraction limit, which restricts the observation of fine details in images. It involves reconstructing a high-resolution image y from its diffraction-limited version x, expressed mathematically as x = K y + η, where K is the point spread function (PSF) and η represents inherent noise. Convolution of the PSF with the high-resolution image y leads to the diffraction-limited image x, which complicates y recovery due to information loss and noise. In this context, we utilize the Bio SR dataset (Qiao et al., 2021). Bio SR consists of pairs of widefield and structured illumination microscopy (SIM) images which encapsulate varied biological structures and signal-to-noise ratio (SNR) levels. The structures present in the dataset have varying complexities: clathrin-coated pits (CCPs), endoplasmic reticulum (ER), microtubules (MTs), and F-actin filaments, ordered by increasing structural complexity. Each image pair is captured over ten different SNR levels. Our results are compared with DFCAN, a regressionbased method implemented as in Qiao et al. (2021), and CDDPM trained as in Saharia et al. (2021). For diffusion methods during inference, best results are found at T = 500, resulting in similar inference time for CVDM and CDDPM. For CDDPM, fine-tuning results in optimal performance for a linear schedule ranging from 0.0001 to 0.03. All models are trained for 400,000 iterations. 4.1.1 RESULTS Table 1 shows the enhanced resolution achieved in the Bio SR benchmark, with our model surpassing other methods in the ER and F-actin structures. Our approach consistently delivers comparable or superior performance than CDDPM across all structures, and improves markedly over DFCAN in the resolution metric. Figure 6b facilitates a visual inspection of these achievements, comparing our reconstructions to those of DFCAN, and Figure 6a, which underscores the contrast in quality between our model and the CDDPM benchmark. Additional comparative insights are detailed in Appendix I. 4.2 QUANTITATIVE PHASE IMAGING Quantitative phase imaging (QPI) has gained prominence in diverse applications, including bioimaging, drug screening, object localization, and security scanning (Zuo et al., 2020). The Transport of Intensity Equation (TIE) method (Teague, 1983; Streibl, 1984) is a notable approach to phase retrieval, linking the diffraction intensity differential along the propagation direction to the lateral phase profile. This relationship, under the paraxial approximation, is formulated as k I(x, y; z) z = (x,y) (I(x, y; z) (x,y)φ(x, y; z)), Published as a conference paper at ICLR 2024 Table 1: Performance on Bio SR Structures. Bold represents a statistically significant best result. Underline represents a statistically significant second place. Statistical significance is determined by comparing the sample mean errors of two different methods over the dataset, using a hypothesis test. (a) CCP structures. Metric / Model DFCAN CDDPM CVDM (ours) MS-SSIM ( ) 0.957 0.952 0.955 MAE ( ) 0.006 0.007 0.007 Resolution (nm) ( ) 107 100 96 (b) ER structures. Metric / Model DFCAN CDDPM CVDM (ours) MS-SSIM ( ) 0.928 0.920 0.934 MAE ( ) 0.033 0.033 0.032 Resolution (nm) ( ) 165 157 152 (c) MT structures. Metric / Model DFCAN CDDPM CVDM (ours) MS-SSIM ( ) 0.901 0.857 0.887 MAE ( ) 0.033 0.042 0.04 Resolution (nm) ( ) 127 101 97 (d) F-actin structures. Metric / Model DFCAN CDDPM CVDM (ours) MS-SSIM ( ) 0.853 0.831 0.863 MAE ( ) 0.049 0.049 0.043 Resolution (nm) ( ) 151 104 98 Table 2: Performance on QPI Data. Bold represents a statistically significant best result. (a) Performance on QPI on synthetic HCOCO. Metric / Model US-TIE CDDPM CVDM (ours) MS-SSIM ( ) 0.907 0.881 0.943 MAE ( ) 0.110 0.134 0.073 (b) MS-SSIM on QPI Brightfield images. Sample US-TIE CDDPM CVDM (ours) 1 0.740 0.863 0.892 2 0.625 0.653 0.742 3 0.783 0.823 0.851 where k is the wavenumber, I is the intensity, φ is the phase of the image, and x, y, z are the coordinates. In practice, the intensity derivative I(x, y; z)/ z at z = 0 is approximated via finite difference calculations, using 2-shot measurements at z = d and z = d: We train the model with a synthetic dataset created by using Image Net to simulate Id and I d from grayscale images. The process, informed by the Fresnel diffraction approximation, is expressed as with z = d, the defocus distance, fixed at 2µm during the training phase. Noise was added using a N(ξ, ξ) distribution, where ξ fluctuates randomly between 0 and 0.2 for every image pair to approximate Poisson noise. To evaluate the effectiveness of our method, we compare against two baselines: CDDPM and US-TIE (Zhang et al., 2020), an iterative parameter-free estimation method for the QPI problem. For diffusion methods during inference, best results are found at T = 400, resulting in similar inference time for CVDM and CDDPM. For CDDPM, fine-tuning results in optimal performance for a linear schedule ranging from 0.0001 to 0.02. Both our model and CDDPM undergo 200,000 iterations of training. 4.2.1 RESULTS FOR THE SYNTHETIC QPI DATASET We estimate the phase of images using the HCOCO dataset (Cong et al., 2019). These images are simulated at a distance of d = 2µm. Table 2a presents the model s performance on the provided test split of HCOCO, which is conducted without noise. Figure 1a visually compares the methods. For more detailed comparisons, please refer to Appendix I. 4.2.2 RESULTS FOR QPI GENERATED FROM CLINICAL BRIGHTFIELD IMAGES We evaluate our method on microscope brightfield images, consisting of three stacks with varying defocus distances, obtained from clinical urine microscopy samples. The phase ground truth is established by computing I(x, y; z)/ z, using a 20th-order polynomial fitting for each pixel within the stack, following Waller et al. (2010). This fitting is performed at distances d = 2kµm, with k Published as a conference paper at ICLR 2024 US TIE CDDPM CVDM Id / GT US TIE CDDPM CVDM Id / GT Figure 1: QPI methods evaluated in the synthetic dataset (a) and brightfield clinical microscopy showing epithelial cells (b). From left to right: first column displays the defocused image at distance d (Id), with the respective ground truth (GT) situated directly below. Second, third, and fourth columns represent each a different method, with the reconstruction on top and the error image at the bottom. ranging from 1 to 20, and the gradient of the polynomial is employed to apply the US-TIE method to generate ground truth data. All methods are evaluated using a 2-shot approach at d = 2µm. The diffusion models are evaluated in a zero-shot scenario from the synthetic experiment. Figure 1b illustrates the reconstructions of all stacks, while Table 2b presents the MS-SSIM for each sample and method, with sample numbers corresponding to the figure rows. 4.3 ABLATIONS AND ADDITIONAL EXPERIMENTS In addition to super-resolution microscopy and QPI, we also test our method for the problem of image super-resolution over Image Net. For this task, we use the architecture of Saharia et al. (2021), equipped with our schedule-learning framework. Without any fine-tuning of the schedule, our results are comparable to Saharia et al. (2021) and slightly better than Kawar et al. (2022), another diffusion-based method. See Appendix L for more details and reconstruction examples. We also evaluate the impact of our design decisions. First, the regularization term (Section 3.4) is critical in preventing the schedule from converging to a meaningless solution. See Appendix K for a detailed analysis. Second, the separation of variables for β (Section 3.2) ensures that the monotonic behavior of γ(t, x) with respect to t is not extended to x. This is key for achieving competitive performance, so we do not include a detailed ablation study. Finally, we compare learning a pixel-wise schedule to a single, global schedule using the synthetic QPI dataset. Our experiment shows that performance drops with a single learned schedule (see details in Appendix K). 5 ANALYSIS AND DISCUSSION Accuracy. Our model competes favorably with DFCAN on the Bio SR dataset, particularly excelling in image resolution (Table 1). It shows the versatility of diffusion models in generating realistic solutions across diverse phenomena. Additionally, it outperforms CDDPM in more complex biological structures, and it advances significantly in the QPI problem by overcoming noise challenges near the singularity (Wu et al., 2022). While not designed specifically for these tasks, our approach shows promise as a flexible tool for solving inverse problems with available input-data pair samples. Schedule. As explained in Section 1, the schedule is a key parameter for diffusion models, so we aim to understand whether our method yields reasonable values. Based on the relation between γ and β , Published as a conference paper at ICLR 2024 Sample mean Sample standard deviation Figure 2: Schedules and sample mean and deviations for the represented images from the Bio SR dataset. (a) Schedule (β) values for a microtubule image. The graph shows the average of the pixels in the respective region. (b) Mean and standard deviations for microtubule (top row) and endoplasmic reticulum (bottom row). The images were reconstructed using 20 samples obtained with CVDM. the forward model can be parametrized as (details in Appendix J): q(zt|y, x) = N e 1 2 R t 0 β(s,x)dsy, 1 e R t 0 β(s,x)ds I . We can see that for large values of β, the latent variable zt gets rapidly close to a N(0, I) distribution. For small values of β, on the other hand, zt remains closer to y for longer. In general, a steeper graph of β results in a diffusion process that adds noise more abruptly around the middle timesteps, resulting in more difficult inversion of the process. In our image-based data applications, the pixelwise dedicated schedule supports this analytical insight. In Bio SR (Figure 2a), structure pixels (i.e., pixels with high-frequency information) are more difficult to denoise, which is reflected in the steeper β graph. In contrast, background pixels (i.e., pixels with low-frequency information) are easier to resolve. This is consistent with the diffraction limit in optical microscopy, which mostly consists of low frequencies. Conversely, in QPI background pixels have a steeper β graph, a phenomenon linked to the amplification of low-frequency noise around the singularity point in k-space (Wu et al., 2022). Uncertainty. In our experiments, we measure the uncertainty of the reconstruction by the pixel-wise variance on the model predictions. This uncertainty is theoretically tied to β. As described in Song et al. (2021), the reverse diffusion process can be characterized by a stochastic differential equation with a diffusion coefficient of p β(t, x) (in the variance-preserving case). Hence, higher values of β introduce more diffusion into the reverse process over time, leading to more varied reconstructions. For Bio SR, structure pixels exhibit higher values of β and consequently higher reconstruction variance, as seen in Figure 2b. For QPI, the converse phenomenon is true (Figure 12). For further analysis and illustration of uncertainty, see Appendix M. 6 CONCLUSION We introduce a method that extends Variational Diffusion Models (VDMs) to the conditioned case, providing new theoretical and experimental insights for VDMs. While theoretically, the choice of schedule should be irrelevant for a continuous-time VDM, we show that it is important when a discretization is used for inference. We test our method in three applications and get comparable or better performance than previous methods. For image super-resolution and super-resolution microscopy we obtain comparable results in reconstruction metrics. Additionally, for super-resolution microscopy we improve resolution by 4.42% when compared against CDDPM and 26.27% against DFCAN. For quantitative phase imaging, we outperform 2-shot US-TIE by 50.6% in MAE and 3.90% in MS-SSIM and improve the CDDPM benchmark by 83.6% in MAE and 7.03% in MS-SSIM. In the wild brightfield dataset, we consistently outperform both methods. In general, our approach improves over fine-tuned diffusion models, showing that learning the schedule yields better results than setting it as a hyperparameter. Remarkably, our approach produces convincing results for a wild clinical microscopy sample, suggesting its immediate applicability to medical microscopy. Published as a conference paper at ICLR 2024 ACKNOWLEDGEMENTS This work was partially funded by the Center for Advanced Systems Understanding (CASUS) which is financed by Germany s Federal Ministry of Education and Research (BMBF) and by the Saxon Ministry for Science, Culture, and Tourism (SMWK) with tax funds on the basis of the budget approved by the Saxon State Parliament. The authors acknowledge the financial support by the Federal Ministry of Education and Research of Germany and by S achsische Staatsministerium f ur Wissenschaft, Kultur und Tourismus in the programme Center of Excellence for AI-research Center for Scalable Data Analytics and Artificial Intelligence Dresden/Leipzig , project identification number: Sca DS.AI. The authors also acknowledge the support of the National Agency for Research and Development (ANID) through the Scholarship Program (DOCTORADO BECAS CHILE/2023 - 72230222). Martin Benning and Martin Burger. Modern regularization methods for inverse problems. Acta numerica, 27:1 111, 2018. Sayantan Bhadra, Varun A Kelkar, Frank J Brooks, and Mark A Anastasio. On hallucinations in tomographic image reconstruction. IEEE transactions on medical imaging, 40(11):3249 3260, 2021. Daniela Calvetti and Erkki Somersalo. Inverse problems: From regularization to bayesian inference. Wiley Interdisciplinary Reviews: Computational Statistics, 10(3):e1427, 2018. Nanxin Chen, Yu Zhang, Heiga Zen, Ron J Weiss, Mohammad Norouzi, and William Chan. Wavegrad: Estimating gradients for waveform generation. ar Xiv preprint ar Xiv:2009.00713, 2020. Zheng Chen, Yulun Zhang, Ding Liu, Bin Xia, Jinjin Gu, Linghe Kong, and Xin Yuan. Hierarchical integration diffusion model for realistic image deblurring, 2023. Wenyan Cong, Jianfu Zhang, Li Niu, Liu Liu, Zhixin Ling, Weiyuan Li, and Liqing Zhang. Image harmonization datasets: Hcoco, hadobe5k, hflickr, and hday2night. Co RR, abs/1908.10526, 2019. URL http://arxiv.org/abs/1908.10526. Florinel-Alin Croitoru, Vlad Hondru, Radu Tudor Ionescu, and Mubarak Shah. Diffusion Models in Vision: A Survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(9): 10850 10869, September 2023. ISSN 0162-8828, 2160-9292, 1939-3539. doi: 10.1109/TPAMI. 2023.3261988. URL http://arxiv.org/abs/2209.04747. ar Xiv:2209.04747 [cs]. Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248 255, 2009. doi: 10.1109/CVPR.2009.5206848. A. Descloux, K. S. Grußmayer, and A. Radenovic. Parameter-free image resolution estimation based on decorrelation analysis. Nature Methods, 16(9):918 924, September 2019. ISSN 15487105. doi: 10.1038/s41592-019-0515-7. URL https://www.nature.com/articles/ s41592-019-0515-7. Number: 9 Publisher: Nature Publishing Group. Prafulla Dhariwal and Alex Nichol. Diffusion models beat gans on image synthesis. Co RR, abs/2105.05233, 2021. URL https://arxiv.org/abs/2105.05233. Juan Luis Fern andez-Mart ınez, Z Fern andez-Mu niz, JLG Pallero, and Luis Mariano Pedruelo Gonz alez. From bayes to tarantola: new insights to understand uncertainty in inverse problems. Journal of Applied Geophysics, 98:62 72, 2013. Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising Diffusion Probabilistic Models, December 2020. URL http://arxiv.org/abs/2006.11239. ar Xiv:2006.11239 [cs, stat]. Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A. Efros. Image-to-image translation with conditional adversarial networks. Co RR, abs/1611.07004, 2016. URL http://arxiv.org/ abs/1611.07004. Published as a conference paper at ICLR 2024 Bahjat Kawar, Michael Elad, Stefano Ermon, and Jiaming Song. Denoising diffusion restoration models, 2022. Diederik P. Kingma and Max Welling. Auto-Encoding Variational Bayes, December 2022. URL http://arxiv.org/abs/1312.6114. ar Xiv:1312.6114 [cs, stat]. Diederik P. Kingma, Tim Salimans, Ben Poole, and Jonathan Ho. Variational Diffusion Models, April 2023. URL http://arxiv.org/abs/2107.00630. ar Xiv:2107.00630 [cs, stat]. Simon Kohl, Bernardino Romera-Paredes, Clemens Meyer, Jeffrey De Fauw, Joseph R. Ledsam, Klaus Maier-Hein, S. M. Ali Eslami, Danilo Jimenez Rezende, and Olaf Ronneberger. A Probabilistic U-Net for Segmentation of Ambiguous Images. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018. URL https://proceedings.neurips.cc/paper_files/paper/2018/ file/473447ac58e1cd7e96172575f48dca3b-Paper.pdf. Yury Korolev and Jonas Latz. Lecture notes, michaelmas term 2020 university of cambridge. 2021. Jaakko Lehtinen, Jacob Munkberg, Jon Hasselgren, Samuli Laine, Tero Karras, Miika Aittala, and Timo Aila. Noise2noise: Learning image restoration without clean data, 2018. Michael Lustig, David Donoho, and John M. Pauly. Sparse MRI: The application of compressed sensing for rapid MR imaging. Magnetic Resonance in Medicine, 58(6):1182 1195, December 2007. ISSN 0740-3194. doi: 10.1002/mrm.21391. David JC Mac Kay. Information theory, inference and learning algorithms. Cambridge university press, 2003. Ahmad Mousavi, Mehdi Rezaee, and Ramin Ayanzadeh. A survey on compressive sensing: Classical results and recent advancements, 2020. Alex Nichol and Prafulla Dhariwal. Improved Denoising Diffusion Probabilistic Models, February 2021. URL http://arxiv.org/abs/2102.09672. ar Xiv:2102.09672 [cs, stat]. Richard Nickl. Bayesian non-linear statistical inverse problems. 2023. Lukas Owens. Exploring the rate of convergence of approximations to the riemann integral. Preprint, 2014. Shichong Peng and Ke Li. Generating unobserved alternatives, 2020. Anna Pyzara, Beata Bylina, and Jarosław Bylina. The influence of a matrix condition number on iterative methods convergence. In 2011 Federated Conference on Computer Science and Information Systems (Fed CSIS), pp. 459 464. IEEE, 2011. Chang Qiao, Di Li, Yuting Guo, Chong Liu, Tao Jiang, Qionghai Dai, and Dong Li. Evaluation and development of deep neural networks for image super-resolution in optical microscopy. Nature Methods, 18(2):194 202, February 2021. ISSN 1548-7105. doi: 10.1038/s41592-020-01048-5. URL https://www.nature.com/articles/s41592-020-01048-5. Number: 2 Publisher: Nature Publishing Group. Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation, 2015. Chitwan Saharia, Jonathan Ho, William Chan, Tim Salimans, David J. Fleet, and Mohammad Norouzi. Image Super-Resolution via Iterative Refinement, June 2021. URL http://arxiv.org/abs/ 2104.07636. ar Xiv:2104.07636 [cs, eess]. Chitwan Saharia, William Chan, Huiwen Chang, Chris A. Lee, Jonathan Ho, Tim Salimans, David J. Fleet, and Mohammad Norouzi. Palette: Image-to-image diffusion models, 2022. Justin Sirignano and Konstantinos Spiliopoulos. DGM: A deep learning algorithm for solving partial differential equations. Journal of Computational Physics, 375:1339 1364, dec 2018. doi: 10.1016/ j.jcp.2018.08.029. URL https://doi.org/10.1016%2Fj.jcp.2018.08.029. Published as a conference paper at ICLR 2024 Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. In International conference on machine learning, pp. 2256 2265. PMLR, 2015. Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models, 2022. Yang Song, Jascha Sohl-Dickstein, Diederik P. Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-Based Generative Modeling through Stochastic Differential Equations, February 2021. URL http://arxiv.org/abs/2011.13456. ar Xiv:2011.13456 [cs, stat]. N. Streibl. Phase imaging by the transport equation of intensity. Optics Communications, 49(1): 6 10, 1984. ISSN 0030-4018. doi: https://doi.org/10.1016/0030-4018(84)90079-8. URL https: //www.sciencedirect.com/science/article/pii/0030401884900798. Michael Reed Teague. Deterministic phase retrieval: a green s function solution. J. Opt. Soc. Am., 73(11):1434 1441, Nov 1983. doi: 10.1364/JOSA.73.001434. URL https://opg.optica. org/abstract.cfm?URI=josa-73-11-1434. Laura Waller, Lei Tian, and George Barbastathis. Transport of intensity phase-amplitude imaging with higher order intensity derivatives. Opt. Express, 18(12):12552 12561, Jun 2010. doi: 10.1364/OE.18.012552. URL https://opg.optica.org/oe/abstract.cfm?URI= oe-18-12-12552. Zhou Wang, Eero P Simoncelli, and Alan C Bovik. Multiscale structural similarity for image quality assessment. In The Thrity-Seventh Asilomar Conference on Signals, Systems & Computers, 2003, volume 2, pp. 1398 1402. Ieee, 2003. Xiaofeng Wu, Ziling Wu, Sibi Chakravarthy Shanmugavel, Hang Z. Yu, and Yunhui Zhu. Physicsinformed neural network for phase imaging based on transport of intensity equation. Opt. Express, 30(24):43398 43416, Nov 2022. doi: 10.1364/OE.462844. URL https://opg.optica. org/oe/abstract.cfm?URI=oe-30-24-43398. Jialin Zhang, Qian Chen, Jiasong Sun, Long Tian, and Chao Zuo. On a universal solution to the transport-of-intensity equation. Optics Letters, 45(13):3649, July 2020. ISSN 01469592, 1539-4794. doi: 10.1364/OL.391823. URL http://arxiv.org/abs/1912.07371. ar Xiv:1912.07371 [physics]. Chao Zuo, Jiaji Li, Jiasong Sun, Yao Fan, Jialin Zhang, Linpeng Lu, Runnan Zhang, Bowen Wang, Lei Huang, and Qian Chen. Transport of intensity equation: a tutorial. Optics and Lasers in Engineering, 135:106187, 2020. ISSN 0143-8166. doi: https://doi.org/10.1016/j.optlaseng. 2020.106187. URL https://www.sciencedirect.com/science/article/pii/ S0143816619320858. Published as a conference paper at ICLR 2024 A INVERSE PROBLEMS We briefly describe a formalization of inverse problems, and mention some of the relevant ideas in this area. Consider two spaces X, Y and a mapping A : Y X. For any particular x X (usually called the data), we are interested in finding an input y Y such that A(y) = x. In most interesting cases A is non-invertible, and even when it is, it may be ill-conditioned, in the sense that the inverse operation will amplify the noise or measurements errors in the data (Fern andez-Mart ınez et al., 2013). Consider the case of solving Ay = x, for an invertible matrix A. If the condition number of the matrix is large, even this relatively simple problem may be dominated by numerical errors in practice (Pyzara et al., 2011). The situation is much more numerically complex when the problem involves a mapping between infinite-dimensional spaces and there is a need for discretization. As a consequence, rather than inverting the operator A, inverse problems are more commonly solved with a least-squares approach (Lustig et al., 2007; Mousavi et al., 2020): min y Y A(y) x . The solution of the above optimization problem defines a generalized inverse for A (Korolev & Latz, 2021). However, if the problem is ill-posed, this approach will still be highly sensitive to the noise in the data. This has motivated the use of regularization (Calvetti & Somersalo, 2018; Benning & Burger, 2018), which changes the problem into min y Y A(y) x 2 2 + αJ (y), where the regularization term usually takes the form J (y) = y , and α determines how much weight is assigned to the second term (prior knowledge of y) as compared to the first (data fitting). Depending on several conditions, regularization can guarantee a stable optimization problem (Korolev & Latz, 2021; Calvetti & Somersalo, 2018). Still, further methods for solving inverse problems have been explored. One reason for this is that the theoretical understanding of regularization is not complete for nonlinear mappings (Benning & Burger, 2018). Moreover, as we mentioned already, one of the challenges for inverse problems is the presence of noise in the data. This is usually modeled as x = A(y) + η, where η is the realization of a random variable. If the noise plays an important role, it would be desirable to provide uncertainty estimation for the solution. This has motivated the use of stochastic methods for inverse problems, one of the most prominent being the use of the Bayesian approach (Benning & Burger, 2018; Nickl, 2023). B FORWARD PROCESS AND POSTERIOR DISTRIBUTION B.1 FORWARD PROCESS MARKOVIAN TRANSITIONS Using the result in Appendix A of Kingma et al. (2023), for each step 0 < i T we have q(zti|zti 1, x) = N γ(ti, x) γ(ti 1, x)zti 1, σ(ti, x) γ(ti, x) γ(ti 1, x)σ(ti 1, x) I We are working with the variance-preserving case, that is, σ(t, x) = 1 γ(t, x). Looking at the variance in the expression above, notice that σ(ti, x) γ(ti, x) γ(ti 1, x)σ(ti 1, x) = (1 γ(ti, x)) γ(ti, x) γ(ti 1, x)(1 γ(ti 1, x)) = 1 γ(ti, x) γ(ti, x) γ(ti 1, x) + γ(ti, x) = 1 γ(ti, x) γ(ti 1, x). To simplify these expressions, we introduce the following notation: ˆβT (ti, x) := 1 γ(ti, x) γ(ti 1, x). Published as a conference paper at ICLR 2024 Notice that ˆβT depends on both ti 1 and ti, but we have only parametrized it in terms of ti. We can do this because, for any fixed number of steps T, we can easily recover ti 1 from ti. Our notation will be convenient when we consider the limit case T , as explained in Section 3.2. With this notation, we can restate the result in a more concise way: q(zti|zti 1, x) = N q 1 ˆβT (ti, x)zti 1, ˆβT (ti, x)I . B.2 POSTERIOR DISTRIBUTION By Bayes theorem, we know that q(zti 1|zti, y, x) = q(zti|zti 1, y, x)q(zti 1|y, x) q(zti|y, x) = q(zti|zti 1, x)q(zti 1|y, x) q(zti|y, x) . Notice that both the prior q(zti 1|y, x) and the likelihood q(zti|zti 1, x) are Gaussian, which means that the posterior will be normally distributed too (Mac Kay, 2003). The computation of the parameters is somewhat involved, but it has already been done in the diffusion models literature. Following Kingma et al. (2023), we get q(zti 1|zti, y, x) = N (µB(zti, ti, y, x), σB(ti, x)I) , µB(zti, ti, y, x) = q 1 ˆβT (ti, x)σ(ti 1, x) σ(ti, x) zti + p γ(ti 1, x) ˆβT (ti, x) 1 ˆβT (ti, x)1 γ(ti 1, x) 1 γ(ti, x) zti + p γ(ti 1, x) ˆβT (ti, x) 1 γ(ti, x)y σB(ti, x) = ˆβT (ti, x)σ(ti 1, x) = ˆβT (ti, x)1 γ(ti 1, x) 1 γ(ti, x) . Notice that, similar to ˆβT , both µB and σB require ti 1 as input besides ti. However, if the number of steps T is fixed, then ti can be used to find ti 1. So, same as ˆβT , we write these functions as depending only on ti, like ˆβT (ti, x), rather than including ti 1 as an extra parameter. C EVIDENCE LOWER BOUND In this appendix, we derive the most important aspects of the loss function. Most of these details correspond to the derivations in Sohl-Dickstein et al. (2015); Ho et al. (2020) and other references, but we include them for completeness and to point out some key differences in our setup. For simplicity of notation, we denote zti as zi in most of this section. Let x be the data. We want the learned pν(y|x) distribution to be as close as possible to the true distribution p(y|x). Hence, it makes sense to maximize the following log-likelihood term: E(y,x) p(y,x) [log pν(y|x)] . Actual implementations of diffusion models take the equivalent approach of minimizing the negative log-likelihood, so we focus on the following term: L = log pν(y|x) = log pν(z0|x) = log Z pν(z T |x) i=1 pν(zi 1|zi, x)dz1 . . . dz T = log Z pν(z T |x) QT i=1 pν(zi 1|zi, x) QT i=1 q(zi|zi 1, x) i=1 q(zi|zi 1, x)dz1 . . . dz T Published as a conference paper at ICLR 2024 Since the process is Markovian, QT i=1 q(zi|zi 1, x) = q(z1, . . . , z T |z0, x) and we get = log E(z1,...,z T ) q(z1,...,z T |z0,x) pν(z T |x) QT i=1 pν(zi 1|zi, x) QT i=1 q(zi|zi 1, x) By Jensen s inequality: E(z1,...,z T ) q(z1,...,z T |z0,x) pν(z T |x) QT i=1 pν(zi 1|zi, x) QT i=1 q(zi|zi 1, x) For simplicity of notation, in the appendices, we compute the loss terms before application of E(y,x) p(y,x) [ ]. This means, for instance, that L depends on (y, x) and should be written as L(y, x). Once again, in the interest of simplifying notation in the appendices, we write the loss terms without explicit dependence on y and x. So, for instance, we write Lprior in the appendices, but in the main body of the paper, we use the more precise Lprior(y, x). In the following, we denote E(z1,...,z T ) q(z1,...,z T |z0,x) [ ] by just Eq|z0,x [ ]. We continue by taking the logarithm of the whole product: pν(z T |x) QT i=1 pν(zi 1|zi, x) QT i=1 q(zi|zi 1, x) i=1 log q(zi|zi 1, x) pν(zi 1|zi, x) log pν(z T |x) i=2 log q(zi|zi 1, x) pν(zi 1|zi, x) + q(z1|z0, x) pν(z0|z1, x) log pν(z T |x) i=2 log q(zi 1|zi, z0, x) pν(zi 1|zi, x) q(zi|z0, x) q(zi 1|z0, x) + q(z1|z0, x) pν(z0|z1, x) log pν(z T |x) i=2 log q(zi 1|zi, z0, x) pν(zi 1|zi, x) + i=2 log q(zi|z0, x) q(zi 1|z0, x) + q(z1|z0, x) pν(z0|z1, x) log pν(z T |x) i=2 log q(zi 1|zi, z0, x) pν(zi 1|zi, x) + log q(z T |z0, x) q(z1|z0, x) + q(z1|z0, x) pν(z0|z1, x) log pν(z T |x) i=2 log q(zi 1|zi, z0, x) pν(zi 1|zi, x) + log q(z T |z0, x) pν(z T |x) log pν(z0|z1, x) = Lprior + Lreconstruction + Ldiffusion. We take a look at each one of these terms in turn. To be consistent with the main body of the paper, we now return to the zti notation instead of using zi. The prior loss term corresponds to Lprior = DKL (q(zt T |zt0, x) || pν(zt T |x)) , and it helps ensure that the forward process ends with a similar distribution as to that with which the reverse process begins. In our setup, pν(zt T |x) is fixed as a normal distribution for all z T and has no trainable parameters. In many DPM setups (Ho et al., 2020; Nichol & Dhariwal, 2021) the schedule is fixed and hence q(zt T |zt0, x) has no trainable parameters either, so Lprior is discarded altogether for training. However, in our framework we need the key flexibility of learning the schedule function γ, so we do not discard this term. There are known ways to estimate and minimize Lprior, since pν(zt T |x) is a standard normal variable and q(zt T |zt0, x) has a Gaussian distribution too. Hence, the KL divergence between the two has an analytical expression that is differentiable and can be minimized with standard techniques (see Section 3.4). On the other hand, consider the reconstruction loss, given by Lreconstruction = Ezt1 q(zt1|zt0,x) [ log pν(zt0|zt1, x)] , Published as a conference paper at ICLR 2024 which helps ensure that the last step of the reverse process pν(zt0|zt1, x) gives the correct converse operation with respect to the forward process q(zt1|zt0, x). In most DPM setups, this term is discarded or included in an altered form (Ho et al., 2020; Nichol & Dhariwal, 2021). We choose to discard it for training, because its effect is very small when compared to Ldiffusion, and because the last step of the reverse process is not especially important. Consider the following: a diffusion process has to produce change in a gradual way, and in the end pν(zt1|x) should be almost identical to pν(zt0|x), especially when the number of steps T is high or even infinite (we discuss the continuous-time case in Appendix D). The quality of pν is ultimately determined by the diffusion loss as a whole rather than by the last step. Finally, consider the diffusion loss, given by Ldiffusion = i=2 E(zt1,...,zt T ) q(zt1,...,zt T |zt0,x) log q(zti 1|zti, zt0, x) pν(zti 1|zti, x) i=2 Ezti q(zti|zt0,x) DKL q(zti 1|zti, zt0, x) || pν(zti 1|zti, x) Each Lti term simplifies greatly, as we show in Appendix D. In the end, after discarding the reconstruction loss, the Evidence Lower Bound loss becomes LELBO = Lprior + Ldiffusion = Lprior + D DIFFUSION LOSS AND TRAINING From Appendix C, we know that one of the terms in the Evidence Lower Bound is Ldiffusion = i=2 Ezti q(zti|zt0,x) DKL q(zti 1|zti, zt0, x) || pν(zti 1|zti, x) . We now simplify this term, essentially following Kingma et al. (2023). We include this material for completeness and to highlight any differences that may come about from conditioning on x. Each Lti includes the KL divergence between two Gaussian distributions, which has an analytic expression. Moreover, both distributions have the same variance. From Section 3.1, recall that by definition pν(zti 1|zti, x) = q(zti 1|zti, zt0 = ˆyν(zti, ti, x), x), Now, from Appendix B.2 recall that the variance σB of q(zti 1|zti, zt0, x) only depends on ti and x, and it does not depend on y. Hence, as we mentioned before, both q(zti 1|zti, zt0, x) and pν(zti 1|zti, x) are Gaussian distributions with exactly the same variance. This means that the KL divergence between them simplifies to DKL q(zti 1|zti, zt0, x) || pν(zti 1|zti, x) = 1 2σB(ti, x) µ µν 2 2, where µ and µν are the means of q(zti 1|zti, zt0, x) and pν(zti 1|zti, x), respectively. Now, from Appendix B.2, we know that µ = µB(zti, ti, y, x) = q 1 ˆβT (ti, x)1 γ(ti 1, x) 1 γ(ti, x) zti + p γ(ti 1, x) ˆβT (ti, x) 1 γ(ti, x)y, µν = µB(zti, ti, y, x) = q 1 ˆβT (ti, x)1 γ(ti 1, x) 1 γ(ti, x) zti + p γ(ti 1, x) ˆβT (ti, x) 1 γ(ti, x)ˆyν γ(ti 1, x) ˆβT (ti, x) 1 γ(ti, x)(y ˆyν). Published as a conference paper at ICLR 2024 Since σB(ti, x) = ˆβT (ti, x)(1 γ(ti 1, x))/(1 γ(ti, x)), we can conclude that DKL q(zti 1|zti, zt0,x) || pν(zti 1|zti, x) = 1 2σB(ti, x) µ µν 2 2 = 1 γ(ti, x) 2ˆβT (ti, x)(1 γ(ti 1, x)) γ(ti 1, x) ˆβT (ti, x) 1 γ(ti, x)(y ˆyν) = γ(ti 1, x)ˆβT (ti, x) 2(1 γ(ti 1, x))(1 γ(ti, x)) y ˆyν 2 2 = γ(ti 1, x) 1 γ(ti,x) γ(ti 1,x) 2(1 γ(ti 1, x))(1 γ(ti, x)) y ˆyν 2 2 = γ(ti 1, x) γ(ti, x) 2(1 γ(ti 1, x))(1 γ(ti, x)) y ˆyν 2 2 γ(ti 1, x) 1 γ(ti 1, x) γ(ti, x) 1 γ(ti, x) σ(ti 1, x) γ(ti, x) where ˆyν = ˆyν(zti, ti, x) The fraction γ(t, x)/σ(t, x) is what Kingma et al. (2023) call the signalto-noise ratio and it should be a decreasing function of time, so the above expression makes sense as a nonnegative loss. Since we are in the variance-preserving case where σ(t, x) = 1 γ(t, x), it is sufficient that γ(t, x) is a decreasing function of time to guarantee that the above expression is nonnegative. Recapping, we have that 2Ezti q(zti|zt0,x) σ(ti 1, x) γ(ti, x) y ˆyν(zti, ti, x) 2 2 By the definition of the forward process q(zti|zt0, x) in Section 3.1, a variable zti q(zti|zt0, x) can be reparametrized as zti(ϵ) = p γ(ti, x)y + p σ(ti, x)ϵ with ϵ N(0, I). This means that we can write the i-th term of the diffusion loss as σ(ti 1, x) γ(ti, x) y ˆyν(zti(ϵ), ti, x) 2 2 Putting everything together, the diffusion loss is given by Ldiffusion = i=2 Eϵ N(0,I) σ(ti 1, x) γ(ti, x) y ˆyν(zti(ϵ), ti, x) 2 2 i=2 Eϵ N(0,I) γ(ti 1, x) 1 γ(ti 1, x) γ(ti, x) 1 γ(ti, x) y ˆyν(zti(ϵ), ti, x) 2 2 E ESTIMATORS FOR DIFFUSION LOSS E.1 MONTE CARLO ESTIMATOR FOR DIFFUSION LOSS From Appendix D, we know that Ldiffusion = 1 i=2 Eϵ N(0,I) σ(ti 1, x) γ(ti, x) y ˆyν(zti(ϵ), ti, x) 2 2. Published as a conference paper at ICLR 2024 For large T, computing the sum becomes computationally expensive. By linearity of expectation: σ(ti 1, x) γ(ti, x) y ˆyν(zti(ϵ), ti, x) 2 2 2 Eϵ N(0,I) σ(ti 1, x) γ(ti, x) y ˆyν(zti(ϵ), ti, x) 2 2 We can recognize the expression in brackets as an expected value, where i is chosen uniformly at random from {2, . . . , T} with probability 1/(T 1). In other words, we can rewrite Ldiffusion as: LT (y, x) := T 1 2 Eϵ N(0,I),i U{2,...,T } σ(ti 1, x) γ(ti, x) y ˆyν(zti(ϵ), ti, x) 2 2 where UA represents the discrete uniform distribution over a finite set A. As mentioned in Section 3.3, the advantage of writing the loss term like this is that it provides a straightforward Monte Carlo estimator, by taking samples ϵ N(0, I) and i U{2,...,T }. E.2 CONTINUOUS-TIME DIFFUSION LOSS We mention again that throughout the appendices, we sometimes omit writing function parameters to simplify notation, mainly in the case of the loss terms. For instance, we write LT instead of LT (y, x). Recall from Section 3.1 that zti = i/T, so intuitively the limit case T takes us into a continuous-time diffusion process, which can be described by a stochastic differential equation. This fact has been noticed before in the literature and it allows to present both diffusion models and score-based generative models as part of the same framework (Song et al., 2021). In our case, we are mostly interested in how the loss term LT changes when T goes to infinity. Kingma et al. (2023) give the following result when taking T : L := lim T L = 1 0 SNR (t, x) y ˆyν(zt(ϵ), t, x) 2 2dt (7) 2Eϵ N(0,I),t U([0,1]) h SNR (t, x) y ˆyν(zt(ϵ), t, x) 2 2 i , where SNR(t, x) represents the signal-to-noise ratio at time t and is defined as γ(t, x)/σ(t, x). Throughout this section, for simplicity of notation we use SNR (t, x) to denote the partial derivative of SNR(t, x) with respect to the time t, and analogously for SNR (t, x). We are interested in better understanding this result, in particular regarding sufficient conditions under which the above equality holds, and its rate of convergence. This provides an interesting regularization idea for the training process, which we discuss in Section 3.4. Starting from the expressions for LT and Ldiffusion given in Appendices D and E.1, we know that LT = Ldiffusion σ(ti 1, x) γ(ti, x) y ˆyν(zti(ϵ), ti, x) 2 2 i=2 (SNR(ti 1, x) SNR(ti, x)) y ˆyν(zti(ϵ), ti, x) 2 2 i=2 (SNR(ti, x) SNR(ti 1, x)) y ˆyν(zti(ϵ), ti, x) 2 2 2Eϵ N(0,I) [ST ] , Published as a conference paper at ICLR 2024 where we have denoted the sum inside the brackets as ST . Now, we denote h T = 1/T and rewrite ST in the following way, such that a derivative-like expression appears: i=2 (SNR(ti, x) SNR(ti 1, x)) y ˆyν(zti(ϵ), ti, x) 2 2 SNR(ti, x) SNR(ti 1, x) h T y ˆyν(zti(ϵ), ti, x) 2 2h T i=2 f T (ti)h T , where the function f T is defined as f T (t) = SNR(t, x) SNR(t h T , x) h T y ˆyν(zt(ϵ), t, x) 2 2. Now, we need to understand the behaviour of ST as T goes to infinity. Assuming that both SNR and ˆyν are continuously differentiable, it is easy to see that f T (t) converges pointwise: f T (t) T g(t) := SNR (t, x) y ˆyν(zt(ϵ), t, x) 2 2. We have established pointwise convergence of f T to g, but what we are really interested in is to understand the convergence of LT to L , defined as in equation (7). Notice that 2Eϵ N(0,I) [ST ] , L = 1 Denote the integral R 1 0 g(t)dt by I. The above expression strongly suggests that we should understand the relation between ST and I when T goes to infinity, in order to establish the convergence of LT to L . With that in mind, for any given positive integer T, the difference between the sum ST and the integral I can be bounded as follows: i=2 f T (ti)h T Z 1 i=2 f T (ti)h T i=2 g(ti)h T i=1 g(ti)h T Z 1 + |g(t1)h T | | {z } E3 We have assumed that both SNR and ˆyν are continuously differentiable, which makes g continuously differentiable too. Given the smoothness of g, for large T we have On the other hand, E2 is just the difference between a Riemann Sum for g and its integral. For a uniform partition such as ours (i.e. ti = i/T) the Riemann Sum converges as O(1/T). In particular, since g is differentiable, for large T we have (Owens, 2014) E2 |g(1) g(0)| On the other hand, notice that f T (ti) g(ti) h T SNR(ti, x) SNR(ti h T , x) h T SNR (ti, x) y ˆyν(zti(ϵ), ti, x) 2 2h T Published as a conference paper at ICLR 2024 Assuming further that SNR is twice differentiable, the forward difference approximates the derivative: Di = SNR(ti, x) SNR(ti h T , x) h T SNR (ti, x) 1 SNR ( , x) L ([ti 1,ti]). Replacing in the bound for E1, we get i=2 Di y ˆyν(zti(ϵ), ti, x) 2 2h T i=2 |Di| y ˆyν(zti(ϵ), ti, x) 2 2h T SNR ( , x) L ([ti 1,ti]) y ˆyν(zti(ϵ), ti, x) 2 2h T SNR ( , x) L ([ti 1,ti]) y ˆyν(zti(ϵ), ti, x) 2 2h T Approximating the Riemann Sum by its integral SNR (t, x) y ˆyν(zt(ϵ), t, x) 2 2dt. Putting everything together, we get |ST I| E3 + E2 + E1 T + |g(1) g(0)| SNR (t, x) y ˆyν(zt(ϵ), t, x) 2 2dt Replacing g and using Cauchy-Schwarz inequality: SNR (0, x) y ˆyν(z0(ϵ), 0, x) 2 2 SNR (1, x) y ˆyν(z1(ϵ), 1, x) 2 2 SNR (0, x) y ˆyν(z0(ϵ), 0, x) 2 2 SNR (t, x) 2dt 1/2 Z 1 0 y ˆyν(zt(ϵ), t, x) 4 2dt 1/2 . If the model is good enough, we would expect y ˆyν(z0(ϵ), 0, x) 2 2 0: |SNR (1, x)| y ˆyν(z1(ϵ), 1, x) 2 2 T + SNR ( , x) L2([0,1]) 0 y ˆyν(zt(ϵ), t, x) 4 2dt 1/2 . Summarizing, the main assumptions so far are that SNR is twice differentiable and that ˆyν is continuously differentiable. With that in mind, we can use the last expression to prove that ST converges to I when T goes to infinity. Adding an extra condition of boundedness for ˆyν, we can use the Dominated Convergence Theorem to conclude that lim T LT = lim T 1 2Eϵ N(0,I) [ST ] = 1 2Eϵ N(0,I) h lim T ST i = 1 We thus establish sufficient conditions for the convergence of LT to L . Our analysis shows that, in some way, this convergence is of order O(1/T) and its speed depends on the magnitude of SNR and SNR . Also, it depends on the approximation quality of the neural network model ˆyν, which is Published as a conference paper at ICLR 2024 consistent with the finding in Kingma et al. (2023) that increasing the number of steps T decreases the error on the condition that the model is good enough. This provides an interesting insight. Kingma et al. (2023) show that in the continuous-time limit, the diffusion loss is invariant to the choice of SNR function, as long as it fulfills a few basic conditions. However, in any computational implementation, continuous time cannot truly exist, so the rate of convergence to the continuous-time case does matter. This means that for choices of SNR with ill-behaved derivatives, or for a model ˆyν that is not good enough, the invariance under the choice of SNR does not necessarily hold. This gives intuition for the regularization introduced in Section 3.4. E.3 NOISE PREDICTION MODEL From Appendix E.2, we know that the continuous-time diffusion loss is given by 2Eϵ N(0,I),t U([0,1]) h SNR (t, x) y ˆyν(zt(ϵ), t, x) 2 2 i . By the definition of the forward process q(zt|y, x) in Section 3.1, a variable zt q(zt|y, x) can be reparametrized as zt = p γ(t, x)y + p σ(t, x)ϵ with ϵ N(0, I). Rearranging the terms: σ(t, x)ϵ . (8) Our model for predicting y is ˆyν(zt(ϵ), t, x). From equation (8), notice that ˆyν receives, as explicit input, all the values necessary to compute y excepting ϵ. As a consequence, we can follow Ho et al. (2020) and repurpose ˆyν into a noise prediction model ˆϵν by parametrizing σ(t, x)ˆϵν . This means that the diffusion loss takes the form 2Eϵ N(0,I),t U([0,1]) h SNR (t, x) y ˆyν(zt(ϵ), t, x) 2 2 i 2Eϵ N(0,I),t U([0,1]) SNR (t, x)σ(t, x) γ(t, x) ϵ ˆϵν(zt(ϵ), t, x) 2 2 2Eϵ N(0,I),t U([0,1]) SNR(t, x) ϵ ˆϵν(zt(ϵ), t, x) 2 2 This is the form of the diffusion loss that we use as part of the loss function in the end (see Section 3.4). In experiments, we get better results by dropping the rational term SNR (t, x)/SNR(t, x), which is consistent with the approach in Ho et al. (2020). Otherwise, the training process can converge to trivial solutions that minimize L by making SNR (t, x) 0 for all t, x. In the end, the actual form of the diffusion loss for our method is given by 2Eϵ N(0,I),t U([0,1]) h ϵ ˆϵν(zt(ϵ), t, x) 2 2 i . F CONTINUOUS-TIME SCHEDULE In Appendix B.1, for all i {1, . . . , T} we define ˆβT (ti, x) = 1 γ(ti, x) γ(ti 1, x). Now, we want to study the relationship between functions γ and ˆβT when T goes to infinity and we move into a continuous-time framework. Since we want a diffusion process to be smooth and free of sudden jumps, we require that γ be least continuous on [0, 1] and continuously differentiable on (0, 1). By definition, notice that for any i {1, . . . , T} γ(ti, x) = γ(ti 1, x)(1 ˆβT (ti, x)). (9) Published as a conference paper at ICLR 2024 Recall that the discretization we are using assumes a number of steps T N, and we use the uniform partition of [0, 1] defined by {ti}T i=0 where ti = i/T. Now, take any t [0, 1] and define t = t T /T. Notice that t is an element of the partition {ti} defined above, and that t t when T . Then, equation (9) implies γ(t , x) = γ(t h T , x) 1 ˆβT (t , x) . Now, assume there exists a continuous function β(t, x) such that ˆβT (t , x) = β(t , x)/T (10) for all t , x. A function like this would allow us to calculate ˆβT for any discretization, so we are interested in learning more about β. Denoting h T = 1/T, we have γ(t , x) = γ(t h T , x) (1 β(t , x)h T ) = γ(t h T , x) γ(t h T , x)β(t , x)h T = γ(t , x) γ(t h T , x) = γ(t h T , x)β(t , x)h T = γ(t , x) γ(t h T , x) h T = γ(t h T , x)β(t , x). We can now take the limit T (which means h T 0 and t t). With our assumption that γ is continuously differentiable, the left-hand side of the equation above converges to γ(t, x)/ t. On the right-hand side, γ(t h T , x) converges to γ(t, x) and, with our assumption that β is continuous, β(t , x) converges to β(t, x). In conclusion, we get the equation t = γ(t, x)β(t, x). During training, we learn both the γ and β functions, and we enforce the above differential equation by adding a corresponding term to the loss function. This ensures that the correct relation between γ and β is preserved during training. Afterwards, having learned these functions successfully, we can use equation (10) to discretize β into ˆβT , and then sample from the forward and the reverse process using the equations derived in Appendix B. G IMPLEMENTATION DETAILS G.1 METRICS DEFINITION We use two main metrics to measure our results (Section 4). The first is MAE, defined as: MAE := 1 |P| p P |yp ˆyp|, where p indexes the set of pixel positions P in the images, y stands for the ground truth image, and ˆy is the predicted image. The second metric is MS-SSIM, which measures the structural similarity between images and is defined as: MS-SSIM(y, ˆy) := [l M(y, ˆy)]αM M Y j=1 [cj(y, ˆy)]βj[sj(y, ˆy)]γj, where lj, cj, and sj are the measures of luminance, contrast, and structure corresponding to scale j. We use five scales for this evaluation and we set αj = βj = γj such that PM j=1 γj = 1. G.2 ARCHITECTURAL DETAILS Following Kingma et al. (2023), to learn the functions τθ(t) and ρχ(t) we parametrize them using a monotonic neural network. This network is composed of a residual block with three convolutional layers. The first and third layers employ linear activation and are linked by a skip connection, while Published as a conference paper at ICLR 2024 the second layer uses a sigmoid activation and is equipped with 1024 filters. All layers adopt 1 1 convolutions. The decision to use convolutional layers over a dense network stems from the desire to facilitate the model s operation at various resolutions without retraining. Additionally, we constrain the weights of the network to be positive. To satisfy the condition β(0, x) = 0, we multiply the network s output by t. For λϕ(x), we parametrize it using a U-Net architecture (Ronneberger et al., 2015), represented in Figure 3. The model incorporates 5 scales, doubling the number of filters at each scale while concurrently halving the resolution. Each block consists of two sequences: convolution, activation, and instance normalization. Upscaling is achieved using transposed convolutions to ensure differentiability. Softplus activations are used throughout the architecture. Mirrored filters are concatenated, and the output s positivity is guaranteed by a final softplus activation. Figure 3: Architecture of λϕ(x). Our denoising model (Figure 4) closely mirrors this implementation with two notable distinctions: first, its input comprises the concatenation of x, γ(t, x), and zt, and the predicted zt 1 output has a linear rather than softplus activation. In line with common practices, the network predicts the noise ϵ at the corresponding timestep. We determine zt 1 by using the algorithm detailed in Appendix H. Figure 4: Architecture of the score predictor used in Bio SR and QPI. H TRAINING AND INFERENCE ALGORITHMS Figure 5 presents a high-level overview of the training algorithm and the inference process for CVDM. Algorithm 1 describes the training of the denoiser using a learnable schedule, while Algorithm 2 demonstrates the inference process and how to use the learned schedule during this procedure. Published as a conference paper at ICLR 2024 Algorithm 1: Training of Denoising Model ˆϵν(zt(ϵ), t, x) (x, y) p(x, y) t U([0, 1]) ϵ N(0, I) Take a gradient descent step ωLCVDM(γ, x, y, ϵ) where ω are the parameters of the joint model χ, ϕ, θ, ν. See equation (6). until converged; Algorithm 2: Inference in T timesteps for t = 0 to T do βt β(t/T,x) T αt 1 βt γt Q t α