# scoreoptimal_diffusion_schedules__c982c89b.pdf Score-Optimal Diffusion Schedules Christopher Williams Department of Statistics University of Oxford Andrew Campbell Department of Statistics University of Oxford Arnaud Doucet Department of Statistics University of Oxford Saifuddin Syed Department of Statistics University of Oxford {williams,campbell,doucet,saifuddin.syed}@stats.ox.ac.uk Denoising diffusion models (DDMs) offer a flexible framework for sampling from high dimensional data distributions. DDMs generate a path of probability distributions interpolating between a reference Gaussian distribution and a data distribution by incrementally injecting noise into the data. To numerically simulate the sampling process, a discretisation schedule from the reference back towards clean data must be chosen. An appropriate discretisation schedule is crucial to obtain high quality samples. However, beyond hand crafted heuristics, a general method for choosing this schedule remains elusive. This paper presents a novel algorithm for adaptively selecting an optimal discretisation schedule with respect to a cost that we derive. Our cost measures the work done by the simulation procedure to transport samples from one point in the diffusion path to the next. Our method does not require hyperparameter tuning and adapts to the dynamics and geometry of the diffusion path. Our algorithm only involves the evaluation of the estimated Stein score, making it scalable to existing pre-trained models at inference time and online during training. We find that our learned schedule recovers performant schedules previously only discovered through manual search and obtains competitive FID scores on image datasets. 1 Introduction Denoising Diffusion models (Sohl-Dickstein et al., 2015; Ho et al., 2020; Song et al., 2021) or DDMs are state-of-the-art generative models. They are formulated through considering a forward noising process that interpolates from the target to a reference Gaussian distribution by gradually introducing noise into an empirical data distribution. Simulating the time reversal of this process then produces samples from the data distribution. Specifically, we evolve data distribution p0 through the forward diffusion process on time interval [0, 1], described by d Xt = f(t)Xtdt + g(t)d Wt, X0 p0, (1) with drift f(t)Xt, diffusion coefficient g(t) and Brownian noise increment d Wt. The coefficients f(t) and g(t) are chosen such that at time t = 1 the distribution of X1 is very close to a reference Gaussian distribution p1 in distribution. A sample starting at p0 and following Equation (1) until time t will be distributed according to pt which is a mollified version of the data distribution X0 p0(x0)pt|0(xt|x0)dx0, pt|0(xt|x0) = N(xt; s(t)x0, σ2(t)I). (2) 38th Conference on Neural Information Processing Systems (Neur IPS 2024). 0.6 0.4 0.2 0.0 0.2 0.4 0.6 X Training Data Samples 0.6 0.4 0.2 0.0 0.2 0.4 0.6 X Linear Schedule Samples 0.6 0.4 0.2 0.0 0.2 0.4 0.6 X Optimised Schedule Samples Figure 1: Density estimates of the mollified Cantor distribution (left) using a DDM with schedule T = {ti}100 i=0 generated with 100 linearly spaces discretisation times ti = i/100 (middle), compared to the optimised schedule T = {t i }50 i=0 with 50 discretisation times t i generated by Algorithm 1 (right). The eight modes present in our true mollified distribution are shown in grey on each plot. The parameters s(t) and σ2(t) define the noising schedule. They can be found in closed form in terms of f(t) and g(t) (Karras et al., 2022). To obtain samples from p0, we can reverse the dynamics of the forward diffusion in Equation (1) to obtain the backward diffusion, d Xt = f(t)Xt g(t)2 x log pt(Xt) dt + g(t)df Wt, X1 p1. (3) By simulating Equation (3) backwards in time, we evolve reference samples X1 p1 from t = 1 to t = 0 to obtain samples that are terminally distributed according to the data distribution p0. To simulate Equation (3) numerically, we must decide upon a discretisation of time, T = {ti}T i=0 with t T = 1, t0 = 0, which we refer to as the discretisation schedule. For a given noising schedule, it is important to select an appropriate discretisation schedule such that (3) can be simulated accurately, i.e. pti and pti 1 should not differ significantly. In this paper, we derive a methodology to compute an optimal discretisation schedule. Prior work has often joined together the choice of noising schedule and discretisation schedule. A uniform splitting of time would be chosen, ti = i/T, with the noising schedule dictating the change between pti and pti 1. Two prominent examples have the form s(t) = p α(t), σ2(t) = 1 α(t) with the linear schedule introduced by Ho et al. (2020) having α(t) = 1 exp R t 0 β(s)ds with linear β(t) = βmin + t(βmax βmin). Alternatively, Nichol and Dhariwal (2021) introduce the cosine schedule with α(t) = f(t)/f(0), f(t) = cos t+ϵ 1+ϵ π 2 2 for ϵ = 0.008. Karras et al. (2022) refines this approach by splitting the choice of noising schedule from that of the discretisation schedule, however, picking the discretisation schedule is still a matter of hyperparameter tuning. A good discretisation schedule can drastically impact the efficiency of the training and inference of the generative model, but unfortunately can be difficult to select for complex target distributions. For example, for a distribution supported on the Cantor set (Figure 1, left), the default linear schedule fails entirely to capture the modes of the data distribution (Figure 1, middle). However, our optimised schedule learned using Algorithm 1 recovers these modes (Figure 1, right). Without such an automatic algorithm, finding performant discretisation schedules often reduces to an expensive and laborious hyperparameter sweep. We devise a method for selecting a discretisation schedule that yields high-quality samples from p0. Our key contribution is defining an appropriate notion of cost incurred when simulating from one step in the diffusion path to the next. We then choose our discretisation schedule to minimise the total cost incurred when simulating the entire path from p1 to p0. Our cost purely depends on the distributional shifts along the diffusion path and assumes perfect score estimation, hence, we refer to our schedules as score-optimal diffusion schedules. The resulting algorithm is cheap to compute, easy to implement and requires no hyperparameter search. Our algorithm can be applied to find discretisation schedules for sampling pre-trained models as well as performed online during DDM training. We demonstrate our proposed method on highly complex 1D distributions and show our method scales to high dimensional image data where it recovers previously known performant discretisation schedules discovered only through manual hyperparameter search. To the best of our knowledge, this is the first online data-dependent adaptive discretisation schedule tuning algorithm. 2 The Cost of Traversing the Diffusion Path To derive our optimal discretisation schedule, we first need to derive a notion of cost of traversing from our reference distribution p1 to the data distribution p0 through each intermediate distribution pt, referred to as the diffusion path. We will then later find the discretisation schedule that minimises the total cost of traversing this path. Our notion of cost is based on the idea that while integrating Equation (3) from time t to t will always take you from pt to pt , the simulation step will need to do more work to make sure the samples are distributed according to pt if pt and pt are very different distributions rather than if they are close. In the following, we will make this intuition precise. 2.1 Predictor/Corrector Decomposition of the Diffusion Update To begin, let X = Rd and P(X) be the space of Lebesgue probability measures on X with smooth densities. For t [0, 1], define the diffusion path pt P(X) as the law of Xt satisfying the forward diffusion Equation (1) initialised at the data distribution X0 p0, or equivalently the law of the Xt satisfying the backward diffusion Equation (3) initialised at the reference distribution X1 p1. Given a sample Xt pt, a sample Xt pt can be generated by integrating the backward diffusion Equation (3). In Song et al. (2021), it was shown that we can further decompose the backward diffusion Equation (3) at time t into a deterministic flow governed by the probability flow ODE, and the stochastic flow driven by Langevin dynamics targeting pt, d Xt = f(t)Xt 1 2g(t)2 log pt(Xt) dt | {z } Probability Flow Prediction ODE 2g(t)2 log pt(Xt)dt + g(t)df Wt. | {z } Langevin Correction SDE This decomposition into a deterministic flow and a correction will help us derive our cost in Section 2.2 by analysing the work done by the correction to keep the samples on the diffusion path. Here, we will first expand upon this decomposition by defining a hypothetical two-step sampling procedure that could be used to sample the DDM. It consists of: (1) a predictor step that generates a deterministic prediction of Xt and (2) a corrector step that uses Langevin dynamics targeting pt to correct any accrued error. Note we are not advocating for the implementation of such a procedure, only that by imagining simulating with this hypothetical predictor-corrector algorithm, it will be helpful for our theoretical derivation of a cost intrinsically linked to the sampling of DDMs. The two stages of the predictor-corrector algorithm are rigorously defined as follows. Definition 2.1. A predictor for t, t [0, 1] is a smooth bijective mapping Ft,t : X 7 X, such that det Ft,t = 0, and the predicted distribution is the pushforward of pt by Ft,t denoted F t,t pt. Definition 2.2. A corrector for t [0, 1], τ [0, ) is a one-parameter family Markov transition kernel Lt,τ : X P(X) 7 [0, 1] such that Zτ Lt,τ(z, dzτ) is the law of Langevin dynamics stationary distribution pt at time τ, initialised at z X, and running at speed v(t) > 0, Z0 = z, d Zτ = v(t) log pt(Zτ)dτ + p 2v(t)d Wτ. (5) The corrector map Lt,τ is specified through an integration time τ and time-dependent speed v(t). We assume τ is fixed and we will describe the appropriate choice of v(t) in Section 3.3. The predictorcorrector algorithm, given Xt pt, first applies the predictor Z0 = Ft,t (Xt) and then uses the corrector to drive the predicted samples towards pt , Predictor: Z0 = Ft,t (Xt) Corrector: Xt ,τ Lt ,τ(Z0, dxτ). (6) In general, Z0 will not be a sample from pt exactly because Ft,t may not be a perfect transport from pt to pt . In Section 2.2, assessing the work done by Lt ,τ to drive the Z0 towards pt will be key in deriving our cost. Our cost will then depend upon the specific choice for Ft,t . Two natural choices for Ft,t are apparent. Setting Ft,t to the identity means our hypothetical sampling algorithm reduces to the annealed Langevin algorithm for DDMs introduced by Song and Ermon (2019). The second natural choice is to set Ft,t to the integrator of the probability flow ODE (Song et al., 2021). Example 2.1 (Annealed Langevin). The predictor step is trivial when the predictor map is the identity Ft,t (x) = x. In such a case, the predicted state reduces to the initial state, Ft,t (Xt) = Xt pt. The work done by the corrector step will then be related to the full discrepancy between pt and pt because the predictor provides no help in transporting the sample. Example 2.2 (Probability Flow ODE). The predictor step is optimal when Ft,t is a transport from pt to pt . In such a case, the predicted state produces a sample from the target distribution, Ft,t (Xt) pt , and so the corrector step would have to perform no work. An optimal predictor map Ft,t can be obtained by integrating the probability flow ODE from time t, t , dt = f(t)xt 1 2g(t)2 log pt(xt). (7) Practical algorithms numerically integrate Equation (7), e.g. an Euler step with t = t t, Ft,t (x) = x + f(t)x 1 2g(t)2 log pt(x) t. (8) In such a case, the work done by the corrector depends on the error in the probability flow integrator. 2.2 The Incremental Cost of Correction We now focus on deriving a cost related to the work done by the corrector step in the predictorcorrector algorithm. Later, in Section 3, we will find the discretisation schedule that minimises the total cost. To derive the cost, we will analyse the movement of Z0 under the corrector step s dynamics Lt ,τ(Z0, dxτ). This requires some care because even if Z0 is already at stationarity, i.e. perfectly distributed according to pt , applying the Langevin correction step will still result in movement of Z0 due to the stochasticity of the update. However, the computed work done by the correction step in this case should be 0. To correctly assign the work done, we will compare two processes. The first is the trajectory of Langevin dynamics, Zτ, defined by the corrector Lt ,τ initialised at Z0 = Ft,t (Xt) targeting pt . The second is a virtual coupled Langevin dynamics Zτ initialised at Ft,t (Xt), driven by the same noise and speed but targeting the stationary distribution of the predictor F t,t pt, Z0 = Ft,t (Xt), d Zτ = v(t ) log pt (Zτ)dτ + p 2v(t )d Wτ, (9) Z0 = Ft,t (Xt), d Zτ = v(t ) log F t,t pt( Zτ)dτ + p 2v(t )d Wτ. (10) Notably, Zτ d= Xt ,τ and Zτ d= Ft,t (Xt) share the same law as the corrected sample and predicted sample respectively. Since Zτ and Zτ are coupled to have the same noise, the difference in their trajectory, Zτ Zτ, isolates the change in corrector dynamics due to discrepancy between F t,t pt and pt . If F t,t pt is very different to pt , then Zτ Zτ will be large, signifying the corrector is needing to do lots of work to push the distribution of Z towards the target pt . Conversely, if F t,t pt = pt , then Zτ Zτ = 0 and no work is done. For small τ, (Zτ Z0)/τ is the initial velocity of Z under the pt corrector dynamics, and similarly for ( Zτ Z0)/τ. We can then define the incremental cost L(t, t ) by taking limits as τ 0+, measuring the expected L2 norm of the difference, L(t, t ) = lim τ 0+ τ 2 E (Zτ Z0) ( Zτ Z0) 2 = lim τ 0+ τ 2 E Zτ Zτ 2 . (11) We can approximate Zτ Zτ using an Euler step, noting that the coupled noise terms cancel, Zτ Zτ = τv(t )( log pt (Zt,t ) log F t,t pt(Zt,t )) + o(τ). (12) By substituting Equation (12) in Equation (11), we have L(t, t ) = v(t )2E log pt (Z0) log F t,t pt(Z0) 2 = v(t )2D(pt F t,t pt), (13) where D(p q) = EX q[ log p(X) log q(X) 2] is a statistical divergence on p, q P(X), measuring the L2 distance between the scores of q and p with respect q. D(p q) is referred to as the Stein divergence or the Fisher divergence; see e.g. (Johnson, 2004). For a given choice of v(t) and Ft,t we now have a cost measuring the change from pt to pt . This cost is intrinsically linked with the effort performed by a DDM sampling algorithm because it is derived through considering the work done by a hypothetical predictor-corrector style update. We note, however, that this general cost can be used to obtain discretisation schedules for use in any style of DDM sampler. 2.3 Corrector and Predictor Optimised Cost By inverting Z0 = Ft,t (Xt), we can express Equation (13) in terms of an expectation with respect to the reference sample Xt pt, and the score of Gt,t : X 7 R+, the incremental weight function associated with the transport Ft,t from the Sequential Monte Carlo literature (Arbel et al., 2021), L(t, t ) = v(t )2E h log Gt,t (Xt) 2i , Gt,t (x) = pt (Ft,t (x)) pt(x) |det Ft,t (x)| . (14) In most cases, it is infeasible to efficiently compute the Jacobian correction in Equation (14). When Ft,t (x) = x is the identity map corresponding to the corrector optimised update from Example 2.1 Equation (14) reduces a rescaled Stein discrepancy between pt and pt , and Gt,t(x) = pt (x)/pt(x) reduces to the likelihood-ratio between pt and pt. We will refer to this case as the corrector-optimised cost denoted Lc(t, t ), to distinguished it from the predictor-optimised cost Lp(t, t ) derived above, where when relevant, we will use subscripts c and p to distinguish between the two: Lc(t, t ) = v(t )2D(pt pt), Lp(t, t ) = v(t )2D(pt F t,t pt). (15) The corrector-optimised cost Lc(t, t ) provides meaningful information during the update from reference pt to the target pt . It is worth computing even when the predictor-optimised cost Lp(t, t ) is accessible. Lc(t, t ) measures the change between the reference and target distribution independent of the predictor, whereas Lp(t, t ) measures the residual error between the predictor and target. Notably, Lc(t, t ) encodes information about the incremental geometry of the diffusion path, whereas Lp(t, t ) quantifies information about the incremental efficiency of the predictor. Generally, one does not dominate the other, but if the predictor is well-tuned and the predictor flows samples Xt pt towards pt , we would expect Lp(t, t ) Lc(t, t ). For deriving our optimal discretisation schedule, we require a notion of how L(t, t ) increases with small increases in t i.e. knowing local changes in incremental cost. In Section 3, we use this local cost to assign distances to schedules through time, enabling us to find the best schedule. We derive the desired local cost in Theorem 2.1, see Appendix A for a PDE and geometric interpretation. Theorem 2.1. Suppose pt(x), Ft,t (x), v(t) and Gt,t (x) are three-times continuously differentiable in t, t , x and let Ft(x) = t Ft,t (x) t =t and Gt(x) = t Gt,t (x) t =t. Suppose the following hold: (1) for all x X, t [0, 1], Ft,t(x) = x and (2) there exists V : X 7 R such that for all x X and t [0, 1], Gt(x) 2 V (x) and supt [0,1] EXt pt[V (Xt)] < . Then for all t [0, 1], we have L(t, t ) = δ(t) t2 + O( t3), where δ(t) = v(t)2EXt pt Gt(Xt) 2 , Gt = t log pt + log pt Ft + Tr Ft. (16) Theorem 2.1 shows that, under regularity assumptions, then the incremental cost is L(t, t ) δ(t) t2 is locally quadratic and controlled by the local cost δ(t). The δ(t) measures the sensitivity of the incremental cost L(t, t ) to moving samples along the diffusion path to t t. Notably, δ(t) = 0 if and only if the predictor satisfies the continuity equation, tpt + (pt Ft) = 0. 3 Score-Optimal Schedules Given a discretisation schedule T = (ti)T i=0 satisfying 0 = t0 < < t T = 1, our hypothetical predictor-corrector algorithm recursively uses the predictor and corrector maps to generate a sequence (Xi)T i=0 starting at XT p1 such that the terminal state X0 approximates samples from p0, Xi Lti,τ(Fti+1,ti(Xi+1), dxi). (17) We want to identify a discretisation schedule that maximises the efficiency of this iterative procedure. This is not generally possible due to the potential complex interactions that arise from the accrued errors. To simplify our analysis, we make the following assumption. Assumption 3.1. For all t, t , if Xt pt and Xt ,τ = Lt ,τ(Ft,t (Xt), dxτ), then Xt ,τ pt . Assumption 3.1 is reasonable if, in our hypothetical corrector steps, τ is set sufficiently large such that the Langevin correction converges to stationarity. We find in our experiments that even if the schedules derived under Assumption 3.1 are used in sampling algorithms for which Assumption 3.1 does not hold, we still obtain high quality samples. Equipped with Assumption 3.1, we can measure the efficiency of the path update through total accumulated cost L = PT i=1 L(ti+1, ti), which we will use as our objective to optimise T . In this section, we will identify the optimal schedule T minimising the cost L by considering an infinitely dense limit. We will then provide a tuning procedure amenable to online schedule optimisation during training. Finally, we will discuss a suitable choice for v(t), the velocity of our hypothetical corrector steps, as well as related work. 3.1 Diffusion Schedule Path Length and Energy Let φ : [0, 1] 7 [0, 1] be a strictly increasing, differentiable function such that φ(0) = 0 and φ(1) = 1. We will say T is generated by φ if ti = φ(i/T) for all i = 0, . . . , T. The schedule generator φ dictates how fast our samples move through their diffusion path. Since every schedule T of size T is generated by some φ, optimising T is equivalent to finding a generator φ minimising L(φ, T), the total cost accumulated by the schedule of size T generated by φ. By Jensen s inequality, we have L(φ, T) Λ(φ, T)2/T, where for ti = φ(i/T), i=1 L(ti+1, ti), Λ(φ, T) = L(ti+1, ti). (18) As we later prove in Theorem 3.1, in the dense schedule limit as T , the cost L(φ, T) and its lower bound Λ(φ, T) are controlled by the energy E(φ) and length Λ respectively where, 0 δ(φ(s)) φ(s)2ds, Λ = Z 1 δ(t)dt. (19) The intuition for why E(φ) is an energy, and Λ a length can be gained by first conceptualising the diffusion time t as a spatial variable rescaled by the metric δ(t) defined by our cost L. We have φ and φ are position and velocity, respectively. Integrating the speed R 1 0 p δ(φ(s)) φ(s)ds = R 1 0 p δ(t)dt along a curve φ(s) obtains the length Λ, whilst integrating a speed squared, R 1 0 δ(φ(s)) φ(s)2ds obtains a kinetic energy E(φ). Note that the length is an invariant of the schedule, whereas the kinetic energy is not. The length Λ measures the intrinsic difficulty of traversing the diffusion path according to the cost independent of φ, whereas E(φ) measures the efficiency of how the path was traversed using φ. This geometric intuition hints at the solution to the optimal scheduling problem. The optimal φ should travel on a geodesic path from p1 to p0, at a constant speed with respect to metric δ. For this optimal φ, we then have the kinetic energy being equal to the square of length between p1 and p0. Theorem 3.1 makes the previous discussion precise. Theorem 3.1. Suppose the assumptions of Theorem 2.1 hold. For all schedule generators φ, lim T TL(φ, T) = E(φ), lim T Λ(φ, T) = Λ. (20) Moreover, E(φ) Λ2, with equality if and only if φ satisfies, φ (s) = Λ 1(Λs), Λ(t) = Z t δ(u)du. (21) Notably independent of the choice of φ, as T , the cost L(φ, T) E(φ)/T. This implies that the cost decays to zero at a linear rate, proportional to E(φ) and L(φ, T) Λ2/T independent of φ. Equation (21) provides an explicit formula for the optimal schedule generator that minimises the dense limit of the total cost and obtains the lower bound E(φ ) = Λ2. The intuition for the formula φ (s) = Λ 1(Λs) is that this implies Λ(φ (s)) = Λs meaning say 10% of the way through the optimal schedule, we should have traversed 10% of the way along the distance between p1 and p0 i.e. 0.1 Λ. This relation holds for constant speed straight lines, meaning φ is the optimal schedule. For a finite T, Theorem 2.1 implies the optimal schedule T = {t i }T i=0 generated by φ ensures the incremental cost is constant L(t i+1, t i ) Λ2/T 2 for all i = 0, . . . , T 1. Our geometric intuition in the language of differential geometry is that the diffusion path M = {pt}t [0,1] is Riemannian manifold with metric δ endowed by the incremental cost L(t, t ). The schedule generator defines a curve s 7 pφ(s) M reparametrising the diffusion path between p0 and p1. Theorem 3.1 shows that φ is the geodesic of length Λ in M between p1 and p0 that traverses the diffusion path at a constant speed p δ(φ (s)) φ (s) = Λ with respect to δ and minimises the cost. 3.2 Estimation of Score-Optimal Schedules Given a schedule T = {ti}T i=0 and estimates of the incremental cost L(ti+1, ti), Algorithm 1 adapts Algorithm 3 from Syed et al. (2021) to estimate the optimal schedule T = {t i }T i=0 generated by φ . We can use Algorithm 1 to refine the schedule for a pre-trained DDM or learn the schedule jointly with the score function. For this joint procedure, we detail in Appendix B.1 how function evaluations can be reused to estimate the cost to minimise computational overhead. For Lc(t, t ) we need only evaluate log pt(Xt) and log pt (Xt) both available through our model s score estimate. Computing Lp(t, t ) is more challenging since there are Hessian terms that arise in Equation (14). Under the assumption that the step size t > 0 is sufficiently small, we can approximate log | det Ft,t (Xt)| through Proposition B.1. This approximation only requires us to compute the gradient trace of the Jacobian of our predicted score. With computational cost proportionate to the computational effort for computing the first derivative. Using a Hutchinson trace (Hutchinson, 1989) like estimator in Proposition B.1, we compute this quantity memory-efficiently in high dimensions, requiring only standard auto-differentiation back-propagation. 3.3 Choice of Velocity Scaling Recall that our cost is derived by considering a Langevin dynamics step with velocity v(t). This velocity should be selected so that Langevin dynamics explores the same proportion of our distribution at varying times throughout our diffusion path. Thus, v(t) should be on the same scale as the spread of the target, pt. Commonly used noising schedules have s(t) 1, and our data distribution is normalised so the scale of pt is on the order of σ(t). We therefore set v(t) = σ(t). This results in a σ(t )-weighted divergence for our incremental cost L(t, t ) = σ(t )2D(pt ||pt). This can be compared to the weighted denoising score matching loss used to train DDMs (Song et al., 2021), which is also a squared norm of score differences: λ(t)EX0,Xt sθ(Xt, t) log pt|0(Xt|X0) 2 for some weighting function λ(t) chosen to equalise the magnitude of the cost over the path. In Song et al. (2021), λ(t) 1/E[ log pt|0(Xt|X0) 2] was chosen, which, as we show in Appendix B.3, is λ(t) σ2(t). This choice of velocity scaling provides an alternative perspective on this commonly used weighting of squared norms of score differences. Algorithm 1 Update Schedule Require: Schedule T = {ti}T i=0, incremental costs {L(ti+1, ti)}T 1 i=0 1: ˆΛ(ti) = Pi 1 j=0 p L(tj+1, tj), i = 0, . . . T Equation (21); 2: ˆΛ = ˆΛ(t T ) Λ in Equation (19) 3: ˆΛ 1( ) = Interpolate({(ˆΛ(t0), t0), . . . , (ˆΛ(t T ), t T )}); E.g. Fritsch and Carlson (1980) 4: t i = ˆΛ 1(ˆΛ i T ), i = 0, . . . , T Equation (21) 5: Return: T = {t i }T i=0 3.4 Related Work Previous works have devised algorithms and heuristics for designing noising and discretisation schedules. The DDM training objective is invariant to the noising schedule shape, as demonstrated by Kingma et al. (2021), necessitating auxiliary costs and objectives for schedule design. Uniform steps in the signal-to-noise ratio, log (s(ti)/σ(ti)), are used by Lu et al. (2022), but this ignores the target distribution s geometry. Watson et al. (2021) optimise the schedule by differentiating through sampling to maximise quality, but GPU memory constraints necessitate gradient rematerialisation. We avoid this with a simulation-free cost. Closely related to our work is Sabour et al. (2024), who minimise a pathwise KL-divergence between discretised and continuous processes. They require multi-stage optimisation with early stopping to prevent over-optimisation of their objective which would otherwise result in worse schedules. Amongst the wider literature, various strategies for discretisation schedule tuning have been proposed. Das et al. (2023) derive an equally spaced schedule using the Fisher metric but assume Gaussian data. Santos et al. (2023) assign time points proportional to the Fisher information of pt|0(xt | x0), ignoring the true target distribution. Xue et al. (2024) derive a schedule to control ODE simulation error, but their cost depends only on the ODE solver, and not on the data distribution. 4 Computational Experiments 4.1 Sampling the Mollified Cantor Distribution The Cantor distribution (Cantor, 1884) lacks a Lebesgue density, with its cumulative distribution function represented by the Devil s staircase and its support being the Cantor set, forming a challenging 1-D test example. When mollified with Gaussian noise, it becomes absolutely continuous and possesses a Stein score. We mollify by running a diffusion with the linear schedule for time t = 10 5. With this mollification, our data density has eight pronounced peaks. We train a one-dimensional DDM for 150,000 iterations using both a fixed linear schedule and our optimisation algorithm Algorithm 2 initialised at the linear schedule. We find that the non-data-specific default schedule fails to capture these modes, whilst our adaptive method faithfully reproduces the data distribution. In Figure 7 we show the complexity of the learned score which displays a self-similar fractal structure. 4.2 Adaptive Schedule Learning for Bimodal Example We train a DDM on a simple bimodal Gaussian distribution. When the variance of the target bimodal Gaussian is low, it becomes difficult to adequately sample from the target distribution. In our instance, the standard Gaussian reference from the diffusion is given, and the target is the density p0(x) = 1 2pleft(x) + 1 2pright(x), where pleft and pright are normal distributions with means 6 and 6, respectively, and a common variance σ2 = 0.12. We learn two diffusion models, one using the linear schedule, and the other using a schedule that is learned online during training. We compute the likelihood of the samples generated from either model during training, which is possible in this example because the true probability density is known. It can be seen in Figure 2 that when the schedule is learned during training, the likelihood evaluation increases and the true score error decreases, in contrast to the linear schedule that remains constant, or worsens, in this regard during training. 0 2000 4000 6000 8000 Training Iterations Likelihood Average Comparison of Linear and Learned Schedules Linear Schedule Learned Schedule 0 2000 4000 6000 8000 Training Iterations Score Error Error Comparison of Linear and Learned Schedules Linear Schedule Error Learned Schedule Error Figure 2: Comparison of Linear and Learned Schedules over Training Iterations for the bimodal example. Each point corresponds to 500 training iterations. 4.3 Scalable Schedule Learning Diffusion Here we demonstrate that jointly learning the schedule and score using our online training methodology (Algorithm 2) scales to high-dimensional data and converges to a stable solution. We train DDMs on CIFAR-10 and MNIST initialised at the cosine schedule using the codebase from Nichol and Dhariwal (2021). In Figure 3 (left), we show the incremental costs p L(tj+1, tj) for the cosine schedule and our learned schedule, finding that the increments approximately equalise over the diffusion path as expected by the discussion in Section 3.1. Figure 3 (right) shows the learned schedule spends more time at high-frequency details, we visualise a sampling trajectory in Figure 11. Table 1: Sample quality measured by Fréchet Inception Distance (FID) versus schedule on CIFAR10 (32 32), FFHQ, AFHQv2, Image Net (64 64). Pretrained models are used from Karras et al. (2022). All FIDs are calculated using 50000 samples. We highlight the best FID in bold. The Image Net model lacks second-order differentiation, so no predictor optimised schedule is shown. Schedule CIFAR-10 FFHQ AFHQv2 Image Net Eq (22) ρ = 3 5.47 2.80 2.05 1.46 Eq (22) ρ = 7 1.96 2.46 2.05 1.42 Log Linear (Lu et al., 2022) 2.05 2.42 2.06 1.45 Convex Schedule 22.1 2.43 2.48 1.64 Corrector optimised 1.99 2.46 2.05 1.44 Predictor optimised 1.99 2.48 2.05 - Schedules from low to high FID FID CO-Cost ( 103) PO-Cost ( 103) KLUB ( 106) Eq (22) ρ = 7 1.96 9.86 2.16 1.75 CO (ours) 1.99 9.54 2.09 1.39 PO (ours) 1.99 9.61 2.09 1.39 Log Linear 2.05 10.5 2.32 1.05 Eq (22) ρ = 3 5.47 17.2 4.02 2.82 Convex Schedule 22.1 45.0 8.17 0.284 0 20 40 60 80 100 Timestep Noise Level CIFAR10 Corrector CIFAR10 Predictor Figure 4: (Left) Costs associated with different schedule choices for the CIFAR10 dataset. Schedules are ordered from lowest FID to highest FID. We compare our Corrector-optimised (CO) cost and Predictor-optimised (PO) cost versus the Kullback-Leibler Upper Bound (KLUB) from Sabour et al. (2024). The minimum value for each cost is highlighted in bold. Note low cost is associated with low FID for our cost and not for the KLUB. (Right) Visualisation of schedules during generative sampling with 100 timesteps. rho=3 and rho=7 refer to Eq 22 with ρ = 3 and ρ = 7 respectively. Log Linear from Lu et al. (2022) and a convex schedule are also shown. We show our cost optimised schedules for CIFAR10 both using the corrector optimised cost and the predictor optimised cost. 0 200 400 600 800 1000 Timestep Length Increment CIFAR Corrector Schedule lr = 0.1 CIFAR Corrector Schedule lr = 0.01 CIFAR Cosine Schedule 0 200 400 600 800 1000 Timestep CIFAR Optimal Scedule Cosine Initial Scedule Figure 3: (Left) Incremental costs p L(tj+1, tj) for the cosine schedule and our online adaptive algorithm. Higher learning rates enforce equalisation of costs more quickly. (Right) Progression of the learned schedule during 40k training iterations, depicted through the standard-deviation 1 αt. 4.4 Sampling Pre-Trained Models In this experiment we demonstrate that our algorithm can recover performant schedules for large image models used in practice and our schedules generate high quality samples. We use the pretrained models from Karras et al. (2022), whose DDM is parameterised such that the forward noising distribution is of the form pt|0(xt|x0) = N(xt; x0, σ2 t I). The scheduling problem then reduces to deciding on a stepping scheme through {σi}N i=1, σN = 0. Karras et al. (2022) suggest a polynomial based schedule with a parameter ρ that controls the curvature of the schedule 1 ρ max + i N 1 σ 1 ρ max ρ and σN = 0. (22) A lower ρ value results in steps near σmin being shortened and steps near σmax being lengthened. Through analysing the truncation error for sampling in Karras et al. (2022), they find that setting ρ = 3 approximately equalises this error, however it is found empirically that ρ = 7 results in better sample quality. We also compare against a schedule that takes uniform steps in log σ space Lu et al. (2022) which we refer to as the Log Linear schedule and a schedule that takes a convex shape in log space. Schedule visualisations are provided in Figure 4 (right). We sampled the pre-trained models using these schedules and computed the sample quality using FID. We use the same number of schedule steps (18 for CIFAR10, 40 for FFHQ and AFHQv2, 256 for Image Net) and solver (Heun second order) as Karras et al. (2022). Our results are shown in Figure 4. Our optimised schedules are able to achieve competitive FID to the best performing ρ = 7 schedule hand-tuned in Karras et al. (2022). This is expected as our schedules take a similar shape to the ρ = 7 schedule as shown in Figure 4 (right). Therefore, our method provides an entirely automatic and hyperparameter free algorithm to recover this performant schedule that was previously only discovered through trial-and-error. We further analyse how the number of discretisation points, T, used during sampling affects the quality of generated samples for different schedules. We report our results on CIFAR10 in Table 2. Notably, the FID decreases with T for all schedules and achieves comparable FID once T is large enough. However, when T is small, only the optimised schedules maintain stable performance. This empirically demonstrates an optimised schedule can improve the sampling efficiency by allowing for coarser discretisations and, hence, faster sampling, as predicted by Theorem 3.1. We observe an identical trend for s FID in Table 3 in the Appendix C.2. # points, T 10 20 30 50 100 CO (ours) 2.46 2.02 2.04 2.06 2.07 ρ = 3 50.75 3.92 2.09 2.01 2.05 ρ = 7 2.70 2.00 2.06 2.05 2.07 ρ = 100 3.09 2.06 2.05 2.06 2.07 Table 2: Comparison of FID across different amounts of discretisation points for different schedules on CIFAR10. CO stands for our corrector optimised schedule. We also compare corrector optimised schedules to predictor optimised schedules in Table 1. They provide similar performance so, on image datasets, we encourage the use of the cheaper to compute corrector optimised schedule. Finally, in Figure 4 (left), we report the raw values of our corrector optimised costs and compare these costs to the values of the objective introduced in Sabour et al. (2024). Both algorithms aim to find schedules that minimise these costs and therefore it is desirable for low values of cost to be associated with good sample quality (i.e. low FID). We find that low values of our cost correlate much more closely with low FID than the objective introduced by Sabour et al. (2024). Indeed, Sabour et al. (2024) introduce a bespoke multi-stage optimisation for their cost because they found over-optimising their objective can lead to worse schedules which is explained by the objective not correlating well with FID. We further find that our predictor optimised costs are lower than the corrector optimised costs which is to be expected as the predictor reduces the work done by the corrector and thus reduces the incremental cost. The overall shape of schedule, however, between the corrector optimised and predictor optimised costs is similar. 5 Discussion We have introduced a method for selecting an optimal DDM discretisation schedule by minimising a cost linked to the work done in transporting samples along the diffusion path. Our algorithm is computationally cheap and does not require hyperparameter tuning. Our learned schedule achieves competitive FID scores. Regarding limitations, the computation of Lp can be computationally expensive due the calculation of second derivatives, however, in Section 4.4 we found Lc to provide a cheap and performant alternative. Furthermore, our theory is derived assuming perfect score estimation. Future work can expand on the geometric interpretation of the diffusion path and links to information geometry to further refine the DDM methodology. Arbel, M., Matthews, A., and Doucet, A. (2021). Annealed flow transport Monte Carlo. In International Conference on Machine Learning. Cantor, G. (1884). De la puissance des ensembles parfaits de points: Extrait d une lettre adressée à l éditeur. Acta Mathematica, 4:381 392. Reprinted in: E. Zermelo (Ed.), Gesammelte Abhandlungen Mathematischen und Philosophischen Inhalts, Springer, New York, 1980. Choi, Y., Uh, Y., Yoo, J., and Ha, J.-W. (2020). Stargan v2: Diverse image synthesis for multiple domains. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Das, A., Fotiadis, S., Batra, A., Nabiei, F., Liao, F., Vakili, S., Shiu, D.-s., and Bernacchia, A. (2023). Image generation with shortest path diffusion. In International Conference on Machine Learning. Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L. (2009). Imagenet: A large-scale hierarchical image database. IEEE Conference on Computer Vision and Pattern Recognition. Fritsch, F. N. and Carlson, R. E. (1980). Monotone piecewise cubic interpolation. SIAM Journal on Numerical Analysis, 17(2):238 246. Ho, J., Jain, A., and Abbeel, P. (2020). Denoising diffusion probabilistic models. In Advances in Neural Information Processing Systems. Hutchinson, M. (1989). A stochastic estimator of the trace of the influence matrix for Laplacian smoothing splines. Communications in Statistics - Simulation and Computation, 18(3):1059 1076. Johnson, O. (2004). Information Theory and the Central Limit Theorem. World Scientific. Karras, T., Aittala, M., Aila, T., and Laine, S. (2022). Elucidating the design space of diffusion-based generative models. In Advances in Neural Information Processing Systems. Karras, T., Laine, S., and Aila, T. (2018). A style-based generator architecture for generative adversarial networks. arxiv e-prints. In Conference on Computer Vision and Pattern Recognition (CVPR). Kingma, D., Salimans, T., Poole, B., and Ho, J. (2021). Variational diffusion models. In Advances in Neural Information Processing Systems. Krizhevsky, A., Hinton, G., et al. (2009). Learning multiple layers of features from tiny images. Lu, C., Zhou, Y., Bao, F., Chen, J., Li, C., and Zhu, J. (2022). DPM-solver: A fast ODE solver for diffusion probabilistic model sampling in around 10 steps. In Advances in Neural Information Processing Systems. Nichol, A. Q. and Dhariwal, P. (2021). Improved denoising diffusion probabilistic models. In International Conference on Machine Learning. Poincaré, H. (1890). Sur les équations aux dérivées partielles de la physique mathématique. American Journal of Mathematics, pages 211 294. Sabour, A., Fidler, S., and Kreis, K. (2024). Align your steps: Optimizing sampling schedules in diffusion models. In International Conference on Machine Learning. Santos, J. E., Fox, Z. R., Lubbers, N., and Lin, Y. T. (2023). Blackout diffusion: generative diffusion models in discrete-state spaces. In International Conference on Machine Learning. Sohl-Dickstein, J., Weiss, E. A., Maheswaranathan, N., and Ganguli, S. (2015). Deep unsupervised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning. Song, Y. and Ermon, S. (2019). Generative modeling by estimating gradients of the data distribution. In Advances in Neural Information Processing Systems. Song, Y., Sohl-Dickstein, J., Kingma, D. P., Kumar, A., Ermon, S., and Poole, B. (2021). Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations. Syed, S., Bouchard-Côté, A., Deligiannidis, G., and Doucet, A. (2021). Non-reversible parallel tempering: a scalable highly parallel MCMC scheme. Journal of the Royal Statistical Society (Series B), 84:321 350. Watson, D., Ho, J., Norouzi, M., and Chan, W. (2021). Learning to efficiently sample from diffusion probabilistic models. ar Xiv preprint ar Xiv:2106.03802. Xue, S., Liu, Z., Chen, F., Zhang, S., Hu, T., Xie, E., and Li, Z. (2024). Accelerating diffusion sampling with optimized time steps. ar Xiv preprint ar Xiv:2402.17376. A Analysis of incremental cost A.1 Proof of Theorem 2.1 Proof. We first note that by the mean value theorem, there exists s(t, t ) [t, t ] L(t, t ) |t t|2 = v(t )2EXt pt " Gs(t,t )(Xt) = v(t )2EXt pt Gs(t,t )(Xt) 2 (24) Since all t, t we have Gs(t, t )(x) 2 V (x), and EXt pt[V (Xt)] < , the dominated convergence theorem and the continuity of v(t ) implies, lim t 7 t L(t, t ) |t t|2 = v(t)2EXt pt lim t 7 t Gs(t,t )(Xt) 2 (25) = v(t)2EXt pt Gs(t,t )(Xt) 2 . (26) The last equality follows since Gt is continuous in t. We will now compute Gt. Since Ft,t(x) = x, we have Gt,t(x) = 1 for all x. This implies, we can express Gt in terms of the derivative of log Gt,t , Gt(x) = t log Gt,t (x) t =t , (27) where log Gt,t equals, log Gt,t (x) = log pt (Ft,t (x)) pt(x) + log det Ft,t (x). (28) By combining Equation (27) to Equation (28), we obtain, Gt(x) = t log pt (Ft,t (x)) t =t + t log det Ft,t (x) t =t . (29) For the first term in Equation (29), we use chain rule to obtain, t log pt (Ft,t (x)) = 1 pt (Ft,t (x)) t (Ft,t (x)) + pt (Ft,t (x)) Ft,t t (x) , (30) where denotes a dot product of vectors. By evaluating at t = t, we have t log pt (Ft,t (x)) t =t = 1 pt(x) pt t (x) + 1 pt(x) pt(x) Ft(x) (31) t log pt(x) + log pt(x) Ft(x). (32) For the second term in Equation (29), we note that as t = t t 0 Ft,t (x) = x + Ft(x) t + O( t2). (33) This implies that the Jacobian determinant admits the following asymptotic expansion, log det Ft,t = log det(I + Ft t + o( t)) (34) = log(1 + Tr Ft t + o( t)) (35) = Tr Ft t + O( t2). (36) Consequentially we have, t log det Ft,t (x) t =t = Tr Ft(x). (37) By substituting in Equations (32) and (37) into Equation (29) we obtain, t log pt(x) + log pt(x) Ft(x) + Tr Ft(x). (38) A.2 Proof of Theorem 3.1 Proof. Let si = i/T and ti = φ(si), Theorem 2.1 implies L(ti+1, ti) = δ(ti+1) t2 i + o( 2 T ), (39) p L(ti+1, ti) = p δ(ti+1) ti + o( T ), (40) where T = maxi | ti|. By the the mean value theorem, T sups [0,1] φ(s)/T and hence is O(T 1) as T . We will first establish the convergence of Λ(φ, T). Using Equation (40) we obtain the following estimate for TL(φ, T), L(ti+1, ti) = δ(ti+1) ti + o(1). (41) In the limit as T , this Riemann sum converges to Λ, lim T Λ(φ, T) = Z 1 δ(t)dt. (42) We will now obtain the limit of TL(φ, T). First denote si = i/T and si = 1/T. Using the differentiability of φ we have, ti = φ(si+1) φ(si) = φ(si+1) T + o T 1 . (43) Substituting Equation (46) into Equation (39), we obtain the following estimate for TL(φ, T), TL(φ, T) = T i=0 L(ti+1, ti) (44) i=0 δ(ti+1) t2 i + o(T 1) (45) i=0 δ(φ(si+1)) φ(si+1)2 si + o(1). (46) In the limit as t , this converges to the integral for E(φ), TL(φ, T) = Z 1 0 δ(φ(s)) φ(s)2ds = E(φ). (47) Combining Jensen s inequality with the fact that φ is increasing implies, 0 δ(φ(s)) φ(s)2ds Z 1 λ(φ(s)) φ(s)ds 2 = Z 1 δ(t)dt 2 = Λ2. (48) The last equality follows by substituting t = φ(s). Note a schedule generator φ obtains the Jensen lower bound if only if there is a C such that for all s [0, 1], C = δ(φ (s)) φ (s)2. (49) By taking square roots and integrating from 0 to s, δ(φ (s )) φ (s )ds = Z φ (s) δ(t)dt = Λ(φ (s)). (50) By using the substitution in s = 1, along with the constraints Λ(1) = Λ and φ (1) = 1 we obtain C = Λ(φ (1)) = Λ(1) = Λ. (51) Finally, by inverting Equation (50), we conclude our proof, φ (s) = Λ 1( Cs) = Λ 1(Λs). (52) A.3 Comparison to Fisher Information When Ft,t (x) = x, the quantity Gt,t (x) = pt (x)/pt(x) reduces to the Radon Nykodym derivative between pt and pt, and Gt(x) reduce to the Fisher score function, t log pt(x). By Theorem 2.1 the corrector optimised cost satisfies, Lc(t, t ) = δc(t) t2 + o( t), where δc(t) = v(t)2EXt pt G2 t 2 = v(t)2EXt pt Suppose pt is sufficiently regular, and the Poincaré inequality (Poincaré, 1890) holds. Since the expectation of the Fisher score with respect to pt is zero, we have, δc(t) C(t)v(t)EXt pt = C(t)v(t)δF (t), (54) where C(t) > 0 is some constant. Here δF (t) = Var Xt pt t log pt is the Fisher information for the diffusion path pt. Suppose we view the diffusion path as a curve in the probability distribution space with metric δc(t). Equation (54) shows that the topology induced by δc(t) is stronger than the Fisher information, and geometry is more regular. B Training Algorithms B.1 Adaptive training We may incorporate Algorithm 1 into an online algorithm used during training. For a fixed score function along the diffusion path, there is an optimal schedule. Similarly, for a fixed schedule, there is an optimal score function that can be learnt from the data. To incorporate these two steps, we propose a two-step algorithm for online training. First, for a fixed schedule, we optimise the score function. Then, using this estimated score function, we compute the optimal schedule through Algorithm 1. To add regularity throughout training, as our score predictions are over batches rather than the entire dataset, we do not replace the current schedule with our computed optimal one. Instead, we take a weighted combination of the current schedule with the computed optimal one. The weighting factor γ (0, 1) is akin to a learning rate for the schedule optimisation. If γ is set too high, our schedule learning may be overly influenced by the current batch, which could negatively affect the score training performance. Algorithm 2 Adaptive Schedule Training Require: Initial schedule T = {ti}T i=0, learning rate γ (0, 1), score estimate sθ 1: while not converged do 2: for each batch B from data do 3: Fix T and assign θ argminθLtraining(θ, B, T ) 4: Fix sθ and over batch estimate L(ti+1, ti), i = 0, . . . , T 1 (Equation (13)) 5: Assign T Update Schedule(T , L(ti+1, ti)) 6: Update time locations ti γt i + (1 γ)ti 7: end for 8: end while B.2 Estimating predictor optimised cost Proposition B.1. Let Ft,t be the predictor map given by the forward Euler discretisation (8) of the probability flow ODE. For N N and let ˆJt,N(x) to be the Jacobian of the Hutchinson trace estimator (Hutchinson, 1989) for ( log pt(x))) at x X and t [0, 1], ˆJt,N(x) = 1 n=1 (v T n Jt(x)vn), vn N(0, I). (55) If t is small enough such that, t Tr f(t)I 1 2g(t)2 2 log pt(x) < 1. (56) Then, as N the following limit exists almost surely, log det Ft,t (x) = t 2 g(t)2 lim N ˆJt,N + O( t2). (57) Proof. The probability flow ODE update is given through Ft,t (x) = x + t f(x)x 1 2g(t)2 log pt(x) . (58) By taking the gradient, we have, Ft,t (x) = I + t f(t)I 1 2g(t) 2 log pt(x) . (59) Since log det(1 + t) = t Tr A + O( t2) that holds when Tr(A) < 1, our assumption in Equation (56) hold, we can apply a Taylor expansion log det Ft,t (x) = t Tr f(t)I 1 2g(t) 2 log pt(x) + O( t2). (60) By taking a gradient, we have log det Ft,t (x) = t 2 g(t)2 Tr 2 log pt(X) + O( t2) (61) = E(vv T ) t 2 g(t)2 Tr 2 log pt(X) + O( t2) (62) 2 g(t)2E Tr v T 2 log pt(X)v + O( t2). (63) The last line used the fact that the Hessian of log pt is equivalently the Jacobian ( log pt), the latter we can approximate unbiasedly using the Hutchinson trace estimator to gain ˆJt,N(x). B.3 Denoising Score Matching Weighting The standard denoising score matching weighting used by Song et al. (2021) is λ(t) 1/E[|| log pt|0(Xt|X0)||2]. For pt|0(xt|x0) = N(xt; s(t)x0, σ2(t)I) we have, log pt|0(xt|x0) = 1 2 ||xt s(t)x0||2 2 log 2πσ(t)2 . (64) xt log pt|0(xt|x0) = s(t)x0 xt The expectation for E[|| log pt|0(Xt|X0)||2] is taken with respect to Xt pt|0. We therefore have, E[|| log pt|0(Xt|X0)||2] = Ept|0(Xt|X0) " s(t)X0 Xt = EN(ϵ;0,I) " s(t)X0 s(t)X0 σ(t)ϵ = EN(ϵ;0,I) = 1 σ(t)2 EN(ϵ;0,I) ϵ 2 (69) = d σ(t)2 . (70) where on the second line we have used the reparameterisation trick. Therefore, we have that the standard weighting for denoising score matching is λ(t) σ(t)2. 0.0 0.2 0.4 0.6 0.8 1.0 time Schedule ( t) Linear Schedule Corrector Schedule Predictor Schedule 6 4 2 0 2 4 6 x Corrector Schedule Predictor Schedule Linear Schedule True Density 6 4 2 0 2 4 6 x Corrector Schedule Predictor Schedule Linear Schedule Figure 5: Schedules and density estimates for: linear (blue); Stein score optimised (green); and predictor optimised (red) schedules. The predictor optimised schedule identifies a bump along the diffusion path where the reference Gaussian density splits into two modes. In the regions where the score is evaluated (around 6), our trained score is accurate compared to the linear schedule score, which fails to match the slope of the true score, resulting in a wider variance density estimate. C Experiment Details C.1 1D Density Estimation For all 1D experiments, we train a one spatial dimension model with continuous time encoding via Gaussian Fourier features to embed time values into a higher-dimensional space. The model architecture includes a hidden dimension of 128, five layers, an embedding dimension of 12, and one residual time step. It features a combination of residual blocks, both incorporating linear layers with GELU activation and Layer Norm, tailored to integrate time embeddings. C.1.1 Bimodal Example We train our model in the form of f(t) = βt/2 and g(t) = βt using Algorithm 2 with γ = 0.1. We use the weight function v2 tn = (1 αtn), where αtn = 1 βtn and αtn = Qn i=1 αti. We train our model for 5 thousand iterations using both a fixed linear schedule and our optimisation algorithm initialised at the linear schedule. Due to the non-linear dependence during training of the transport-optimised schedule, initialisation in this context plays an important role. We initialise our predictor-optimised schedule with the optimal schedule generated with respect to Lc without the Jacobian term as a first approximation. We then add the Jacobian term, optimise the schedule, and train for an additional 5 thousand iterations. In Figure 5 we see that the linear schedule forms an estimate with greater variance than the true data and the predicted densities of the optimised schedules. 0.4 0.2 0.0 0.2 0.4 x Corrector Optimised Score 0.0 0.2 0.4 0.6 0.8 1.0 Time Schedule ( t) Linear Schedule Corrector Schedule Predictor Schedule 250 Predictor Optimised Score 0.4 0.2 0.0 0.2 0.4 x Linear Schedule Score Figure 6: Comparison of sampling from a mollified Cantor distribution using DDMs with two different schedules: linear (blue) and optimised (green). The optimised schedule enables the DDM to capture eight distinct data modes centered on the mollified Cantor distribution (grey), whereas the linear schedule does not have clear mode separation. The optimised schedule diffusion accurately predicts the score for the mollified Cantor distribution, being near vertical lines interweaving the Cantor set, where the linear schedule fails to adequately approximate the score. C.1.2 Mollified Cantor Distribution We train our model with f(t) = βt/2 and g(t) = βt using the online Algorithm 2 with γ = 0.01. The weight function is v2 tn = (1 αtn), where αtn = 1 βtn and αtn = Qn i=1 αti. To capture high-frequency details, we train the one-dimensional model for 150,000 iterations using both a fixed linear schedule and our optimisation algorithm initialised at the linear schedule. The difference in sample quality is evident in Figure 1. C.2 Pretrained Image Models For sampling pretrained image models, we use the networks and implementation from Karras et al. (2022), https://github.com/NVlabs/edm. We compute the FID using the provided FID script within the codebase with the standard 50, 000 samples using the same fixed seeds 0 49999 for all schedules. For each image dataset, we use the default sampling strategy included in the codebase by Karras et al. (2022). For unconditional CIFAR10 this is a deterministic 2nd order Heun solver with 18 timesteps, therefore a total of 35 NFE (number of function evaluations) for the underlying denoising network. For both unconditional FFHQ and unconditional AFHQv2 the same deterministic 2nd order Heun solver is used with 40 timesteps (NFE = 79). For class conditional Image Net, the bespoke stochastic solver from Karras et al. (2022) is used with 256 timesteps (NFE=511). The stochasticity settings are left at their default values for this dataset of Schurn = 40, Smin = 0.05, Smax = 50, Snoise = 1.003. To compute the corrector optimised schedule for each dataset, we use Algorithm 1. We use 100 data samples for each dataset when computing Lc as we find the variation in learned schedule is small between different samples from the dataset. Our initial discretisation schedule used to calculate Λ(t) is Log Linear with 100 steps. We then fit a monotonic spline to the cumulative estimated Λ(t) and invert this function to find φ function from which we can derive our schedules. This takes on the order of 5 minutes to find the Corrector Optimised Schedule for CIFAR10 on a single RTX 2080Ti GPU. For predictor optimised schedules we repeat the same procedure however also include estimation of the Hessian term. We use 5 samples of v for each image datapoint when using Hutchinson s trace estimator. It takes 5 GPU hours to compute the Predictor Optimised Schedule for CIFAR10 due to the extra computational cost of computing the second derivatives. We compare the corrector optimised and predictor optimised schedules for the 4 image datasets in Figure 8. We find that all of the schedules have the same general shape with increasing step sizes in log space as the generative process approaches clean data. However, the curvature of the schedule varies between datasets which is to be expected as the schedule is determined through the score which will vary depending on the data distribution. We find that, in general, the higher resolution datasets (FFHQ, AFHQv2 and Image Net) favour shorter steps nearer the start of the generative process at the expense of larger steps at low noise levels near the end of the generative process. 0.50 0.25 0.00 0.25 0.50 0.50 0.25 0.00 0.25 0.50 300 0.50 0.25 0.00 0.25 0.50 0.50 0.25 0.00 0.25 0.50 0.50 0.25 0.00 0.25 0.50 0.50 0.25 0.00 0.25 0.50 0.50 0.25 0.00 0.25 0.50 0.50 0.25 0.00 0.25 0.50 0.50 0.25 0.00 0.25 0.50 0.50 0.25 0.00 0.25 0.50 0.50 0.25 0.00 0.25 0.50 0.50 0.25 0.00 0.25 0.50 0.50 0.25 0.00 0.25 0.50 0.50 0.25 0.00 0.25 0.50 0.50 0.25 0.00 0.25 0.50 15 0.50 0.25 0.00 0.25 0.50 0.50 0.25 0.00 0.25 0.50 0.50 0.25 0.00 0.25 0.50 7.5 0.50 0.25 0.00 0.25 0.50 6 0.50 0.25 0.00 0.25 0.50 Figure 7: Evolution of the estimated score for the mollified Cantor distribution Section 4.1 with a Corrector Optimised Schedule. In this case the linear schedule fails to evenly progress the progression of the score, see Figure 6 showing the terminal score estimate in this case. The estimated score exhibits a self-similar nature of interweaving roots around the centers of mass of the mollified Cantor distribution. Identification of these roots amounts to estimated modes in our density estimate, see Figure 1. 0 20 40 60 80 100 Timestep Noise Level CIFAR10 Corrector CIFAR10 Predictor FFHQ Corrector FFHQ Predictor AFHQv2 Corrector AFHQv2 Predictor Image Net Corrector Figure 8: Corrector optimised and predictor optimised schedules for the 4 image datasets, CIFAR10, FFHQ, AFHQv2 and Image Net. 10 2 10 1 100 101 102 Noise Level CIFAR10 Corrector CIFAR10 Predictor FFHQ Corrector FFHQ Predictor AFHQv2 Corrector AFHQv2 Predictor Image Net Corrector Figure 9: Local cost p δ(σ) versus σ for 4 image datasets and using the corrector optimised versus predictor optimised costs. We analyse the local cost as a function of noise level p δ(σ) for the 4 images datasets within Figure 9. This local cost is used as the metric when determining velocities in the space of schedules from p1 to p0. # points, T 10 20 30 50 100 CO (ours) 3.94 3.78 3.80 3.81 3.81 ρ = 3 24.08 4.90 3.80 3.77 3.80 ρ = 7 4.02 3.76 3.78 3.80 3.81 ρ = 100 4.31 3.81 3.81 3.81 3.81 Table 3: Comparison of s FID across different amounts of discretisation points for different schedules on CIFAR10. CO stands for our corrector optimised schedule. C.3 Online Schedule Optimisation of Images MNIST: For the MNIST experiments, we trained a model with an image size of 32, 32 channels, with a U-Net architecture, with 1 residual block per U-Net resolution, without learning sigma, and 0 10000 20000 30000 40000 50000 60000 Iteration Schedule lr=0.05 Cosine Schedule 0 10000 20000 30000 40000 50000 60000 Iteration Schedule lr=0.05 Cosine Schedule 0 100 200 300 400 500 Timestep Schedule lr=0.05 Cosine Schedule Figure 10: Progression of the length and energy Equation (18) over training of MNIST. Both models are trained from initialisation, one with adaptive schedule learning (red) and one without (blue). We can see that the energy and length quantities increase during training. Recall that for a fixed path of scores t 7 log pt that the length Λ is constant. As we are learning the score, this value is not constant during training. Interestingly, by optimising the schedule during training we observe a larger length value, possibly indicating that the diffusion path learned with the optimised schedule differs greatly from the path learned without. Figure 11: Sample progression of MNIST digits for the standard cosine schedule with ϵ = 0.008 (top) against our optimised schedule (bottom). As we can see, the cosine schedule spends more time near the Gaussian reference distribution whereas the optimised schedule quickly determines large scale features and spends more time toward the data distribution. 0 10000 20000 30000 40000 50000 Iteration Corrector Optimised loss Schedule lr=0.1 Schedule lr=0.01 Cosine Schedule 0 10000 20000 30000 40000 50000 Iteration Schedule lr=0.1 Schedule lr=0.01 Cosine Schedule Figure 12: Progression of the length Λ and cost through online training for CIFAR-10. For the larger learning rate, Algorithm 2 seems to garner a larger Λ value at a faster rate that the lower schedule learning rate. For the fixed cosine schedule run, Λ is stable, perhaps because the score estimate has stabilised for the fixed cosine schedule already during the model training burn-in phase. with a dropout rate of 0.3. The diffusion process was configured with 500 diffusion steps. Training was conducted with a learning rate of 1e-4. The schedule was also initialised with the Cosine schedule and trained for 60,000 iterations on an NVIDIA 1080 Ti GPU with 12 GB RAM. The batch size for MNIST was set to 128. We trained two models: one with the schedule optimisation and one without. The schedule training rate was set to γ = 0.05, as stated in Algorithm 2. Training took approximately 12 hours for either model. CIFAR-10: For the CIFAR-10 experiments, we trained a model with an image size of 32, 128 channels, and a U-Net architecture with 3 residual blocks per multiplier resolution (described in codebase Nichol and Dhariwal (2021)), without learning sigma, and with a dropout rate of 0.3. The diffusion process was configured with 1000 diffusion steps and a cosine noise schedule. The schedule was initialised with the Cosine schedule and trained for 160,000 iterations on 4 NVIDIA A40 GPUs, each with 48 GB RAM. The batch size was set to 1,536. After training for 160,000 iterations, we trained two models online: one with the corrector schedule optimisation and one without, for an additional 50,000 iterations on a single GPU with a batch size of 384. The burn in phase for training took approximately 50 hours, with the individual schedule optimisations after this taking approximately 24 hours. C.4 Licenses Improved Denoising Diffusion Probabilistic Models Nichol and Dhariwal (2021): MIT License Elucidating the Design Space of Diffusion-Based Generative Models Karras et al. (2022): Attribution-Non Commercial-Share Alike 4.0 International CIFAR-10 Krizhevsky et al. (2009): MIT license FFHQ Karras et al. (2018): Creative Commons BY-NC-SA 4.0 license AFHQv2 Choi et al. (2020): Creative Commons BY-NC 4.0 license Image Net Deng et al. (2009): Unknown License D Acknowledgements of Funding CW acknowledges support from DST Australia. AC acknowledges support from the EPSRC CDT in Modern Statistics and Statistical Machine Learning (EP/S023151/1). SS acknowledges support from the NSERC Postdoctoral Fellow Program. Neur IPS Paper Checklist Question: Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? Answer: [Yes] . Justification: Our main theoretical and experimental contributions are clearly stated in the abstract and demonstrated in the paper. They reflect the paper s contributions and scope. 2. Limitations Question: Does the paper discuss the limitations of the work performed by the authors? Answer: [Yes] Justification: We discuss the limitations of our work in Section 5. 3. Theory Assumptions and Proofs Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof? Answer: [Yes] . Justification: All the proofs are proven in the supplementary material. They are duly cross-referenced. 4. Experimental Result Reproducibility Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)? Answer: [Yes] Justification: We provide detailed descriptions of our experimental procedures in Appendix C and provide the code to run our experiments. 5. Open access to data and code Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: We provide the code necessary to run our experiments. 6. Experimental Setting/Details Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results? Answer: [Yes] Justification: We provide all our experiment details in Appendix C. 7. Experiment Statistical Significance Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments? Answer: [No] Justification: Our main metric is FID score for which it is standard practice to report it calculated on the first 50,000 images generated from the model. Standard deviations could be bootstrapped from this set but this is not standard practice. Furthermore, the computational cost to sample large image models 50,000 times multiple times is prohibitive for our academic compute cluster. 8. Experiments Compute Resources Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments? Answer: [Yes] Justification: We provide details of the compute resources used in Appendix C. 9. Code Of Ethics Question: Does the research conducted in the paper conform, in every respect, with the Neur IPS Code of Ethics https://neurips.cc/public/Ethics Guidelines? Answer: [Yes] . Justification: After careful review of the Neur IPS Code of Ethics, it is clear that the research presented in this paper conforms with the Code of Ethics in every respect. 10. Broader Impacts Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed? Answer: [NA] . Justification: This paper is mostly theoretical and methodological. We do not see immediate societal impact of this work. However, we acknowledge that large scale implementation of our algorithm might suffer from the same societal biases as any other generative models. Indeed it could improve the quality of generative models and hence been used to generate deepfakes for disinformation. 11. Safeguards Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)? Answer: [NA] . Justification: The paper poses no such risks. 12. Licenses for existing assets Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected? Answer: [Yes] Justification: We provide the references and licenses for the codebases and datasets used in our work in Appendix C.4. 13. New Assets Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets? Answer: [NA] Justification: This paper does not introduce any new assets. Guidelines: 14. Crowdsourcing and Research with Human Subjects Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)? Answer: [NA] . Justification: The paper does not involve crowdsourcing nor research with human subjects. 15. Institutional Review Board (IRB) Approvals or Equivalent for Research with Human Subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or institution) were obtained? Answer: [NA] . Justification: The paper does not involve crowdsourcing nor research with human subjects.