# analyzing_diffusion_as_serial_reproduction__b94613e5.pdf Analyzing Diffusion as Serial Reproduction Raja Marjieh 1 Ilia Sucholutsky 2 Thomas A. Langlois 2 3 Nori Jacoby 3 Thomas L. Griffiths 1 2 Diffusion models are a class of generative models that learn to synthesize samples by inverting a diffusion process that gradually maps data into noise. While these models have enjoyed great success recently, a full theoretical understanding of their observed properties is still lacking, in particular, their weak sensitivity to the choice of noise family and the role of adequate scheduling of noise levels for good synthesis. By identifying a correspondence between diffusion models and a well-known paradigm in cognitive science known as serial reproduction, whereby human agents iteratively observe and reproduce stimuli from memory, we show how the aforementioned properties of diffusion models can be explained as a natural consequence of this correspondence. We then complement our theoretical analysis with simulations that exhibit these key features. Our work highlights how classic paradigms in cognitive science can shed light on state-of-the-art machine learning problems. 1. Introduction Diffusion models are a class of deep generative models that have enjoyed great success recently in the context of image generation (Sohl-Dickstein et al., 2015; Ho et al., 2020; Song & Ermon, 2019; Rombach et al., 2022; Ramesh et al., 2022), with some particularly impressive text-to-image applications such as DALL-E 2 (Ramesh et al., 2022) and Stable Diffusion (Rombach et al., 2022). The idea behind diffusion models is to learn a data distribution by training a model to invert a diffusion process that gradually destroys data by adding noise (Sohl-Dickstein et al., 2015). Given the trained model, sampling is then done using a sequential 1Department of Psychology, Princeton University, Princeton, USA 2Department of Computer Science, Princeton University, Princeton, USA 3Max Planck Institute for Empirical Aesthetics, Frankfurt am Main, Germany. Correspondence to: Raja Marjieh . Proceedings of the 40 th International Conference on Machine Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright 2023 by the author(s). procedure whereby an input signal (e.g., a noisy image) is iteratively denoised at different noise levels which, in turn, are successively made finer until a sharp sample is generated. Initially, the noise family was restricted to the Gaussian class (Sohl-Dickstein et al., 2015; Song & Ermon, 2019; Ho et al., 2020) and the process was understood as a form of Langevin dynamics (Song & Ermon, 2019). However, recent work showed that this assumption can be relaxed substantially (Bansal et al., 2022; Daras et al., 2022) by training diffusion models with a wide array of degradation families. One feature of this work is that it highlights the idea that sampling (i.e. synthesis) can be thought of more generally as an alternating process between degradation and restoration operators (Bansal et al., 2022). This in turn calls into question the theoretical understanding of these models and necessitates new approaches. A hint at a strategy for understanding diffusion models comes from noting that the structure of the sampling procedure in these generalized models (i.e., as a cascade of noising-denoising units), as well as its robustness to the choice of noise model, bears striking resemblance to a classic paradigm in cognitive science known as serial reproduction (Bartlett & Bartlett, 1995; Xu & Griffiths, 2010; Jacoby & Mc Dermott, 2017; Langlois et al., 2021). In a serial reproduction task, participants observe a certain stimulus, e.g., a drawing or a piece of text, and then are asked to reproduce it from memory (Figure 1A). The reproduction then gets passed on to a new participant who in turn repeats the process and so on. The idea is that as people repeatedly observe (i.e., encode) a stimulus and then reproduce (i.e., decode) it from memory, their internal biases build up so that the asymptotic dynamics of this process end up revealing their inductive biases (or prior beliefs) with regard to that stimulus domain. By modeling this process using Bayesian agents, Xu & Griffiths (2010) showed that the process can be interpreted as a Gibbs sampler, and more so, that the stationary behavior of this process is in fact independent of the nature of cognitive noise involved, making serial reproduction a particularly attractive tool for studying human priors (Figure 1B). The main contribution of the present paper is to make the correspondence between diffusion and serial reproduction precise and show how the observed properties of diffusion models, namely, their robustness to the choice of noise fam- Analyzing Diffusion as Serial Reproduction ily, and the role of adequate noise level scheduling for good sampling, jointly arise as a natural consequence. Moreover, by analyzing in what precise ways these properties could break in practice we derive a new measure, which we term inversion complexity , that quantifies the smoothness of noise schedules, and we show empirically how it can track reconstruction error across different noise schedules. The paper proceeds as follows. In Section 2 we review the mathematical formulation of serial reproduction and diffusion models and set the ground for the analysis that follows. In Section 3, we establish a correspondence between sampling in diffusion models and serial reproduction and show how it explains key properties of these models. In Section 4 we complement our theoretical analysis with simulations, and then conclude with a discussion in Section 5. 2. Background 2.1. Serial Reproduction We begin with a brief exposition of the mathematical formulation of the serial reproduction paradigm (Xu & Griffiths, 2010; Jacoby & Mc Dermott, 2017). A serial reproduction process is a Markov chain over a sequence of stimuli (images, sounds, text, etc.) x0 x1 xt ... where the dynamics are specified by the encoding-decoding cascade of a Bayesian agent with some prior π(x) and a likelihood model p(ˆx|x). The prior captures the previous experiences of the agent with the domain (i.e., their inductive bias), and the likelihood specifies how input stimuli x map into noisy percepts ˆx (e.g., due to perceptual, production or cognitive noise). Specifically, given an input stimulus xt, the agent encodes xt as a noisy percept ˆxt, and then at reproduction decodes it into a new stimulus xt+1 by sampling from the Bayesian posterior (a phenomenon known as probability matching) (Griffiths & Kalish, 2007), p(xt+1|ˆxt) = p(ˆxt|xt+1)π(xt+1) R p(ˆxt| xt+1)π( xt+1)d xt+1 . (1) The generated stimulus xt+1 is then passed on to a new Bayesian agent with similar prior and likelihood who in turn repeats the process and so on. From here, we see that the transition kernel of the process can be derived by integrating over all the intermediate noise values against the posterior p(xt+1|xt) = Z p(xt+1|ˆxt)p(ˆxt|xt)dˆxt = Z p(ˆxt|xt+1)p(ˆxt|xt) R p(ˆxt| xt+1)π( xt+1)d xt+1 π(xt+1)dˆxt. Crucially, by noting that xt+1 is a dummy integration variable we see that the prior π(x) satisfies the detailed-balance condition with respect to this kernel (see Appendix A) p(xt+1|xt)π(xt) = p(xt|xt+1)π(xt+1). (3) This in turn implies that the prior π(x) is the stationary distribution of the serial reproduction process irrespective of the noise model p(ˆx|x), so long as it allows for finite transition probabilities between any pair of stimuli to preserve ergodicity (Xu & Griffiths, 2010; Griffiths & Kalish, 2007). This insensitivity to noise is what makes serial reproduction a particularly attractive tool for studying inductive biases in humans (Figure 1B). It is also worth noting that another way to derive these results is by observing that the full process over stimulus-percept pairs (x, ˆx) whereby one alternates between samples from the likelihood p(ˆx|x) and samples from the posterior p(x|ˆx) implements a Gibbs sampler from the joint distribution p(x, ˆx) = p(ˆx|x)π(x). 2.2. Diffusion Models We next review the basics of diffusion models and set the ground for the analysis in the next section. Following Sohl-Dickstein et al. (2015) and Ho et al. (2020), a diffusion model is a generative model that learns to sample data out of noise by inverting some specified forward process q(x0, . . . , x T ) that gradually maps data x0 qd(x0) to noise x T qn(x T ), where qd(x) and qn(x T ) are given data and noise distributions, respectively. Such forward process can be implemented as a Markov chain q(x0, . . . , x T ) = qd(x0) t=1 q(xt|xt 1) (4) with some pre-specified transition probabilities q(xt|xt 1) = Tqn(xt|xt 1; βt) where Tqn is a noise (diffusion) kernel and βt being some diffusion parameter for which the noise distribution qn is stationary, i.e., R Tqn(y|x)qn(x)dx = qn(y). This ensures that for a sufficiently large time t = T we are guaranteed to transform qd(x) into qn(x). A common explicit example of this is a Gaussian kernel Tqn(xt|xt 1; βt) = N(xt; 1 βtxt 1, βt I) where I is the identity matrix (Ho et al., 2020), however we will not assume that. The inversion is then done by solving a variational problem with respect to a trainable reverse process pθ(x0, . . . , x T ) which itself is assumed to be Markov pθ(x0, . . . , x T ) = p(x T ) t=1 pθ(xt 1|xt) (5) where p(x T ) = qn(x T ), that is, the reverse process starts from noise and iteratively builds its way back to data. Since we are interested in the optimal structure of pθ we will suppress θ in what follows. The reverse process induces a probability distribution over data p(x0) by marginalizing Equation (5) over all xt>0. The variational objective is then given by a bound K on the log-likelhood of the data under the reverse process L = R qd(x0) log p(x0)dx0 and can Analyzing Diffusion as Serial Reproduction Initial seeds t = 1 Steps t = 5 Steps t = 10 Steps t = 15 Steps t = 20 Steps Figure 1. Serial reproduction paradigm. A. Participants observe (encode) a stimulus and then try to reproduce (decode) it from memory. As the process unfolds, the generated samples gradually change until they become consistent with people s priors. Drawings are reproduced from (Bartlett & Bartlett, 1995). B. Data from a real serial reproduction task reproduced from (Langlois et al., 2021). Participants observed a red dot placed on a background image (here a triangle) and were instructed to reproduce the location of that dot from memory. The new dot location then gets passed to a new participant who in turn repeats the task. The initial uniform distribution gets transformed into a highly concentrated distribution around the triangle s corners, capturing people s visual priors. be written as (Sohl-Dickstein et al., 2015) Z q(x0, . . . , x T ) log p(xt 1|xt) dx0...T Hqn (6) where Hqn is the entropy of the noise distribution qn which is a constant. Finally, by defining the forward posterior q(xt 1|xt) q(xt|xt 1)q(xt 1) R q(xt| xt 1)q( xt 1)d xt 1 = q(xt|xt 1)q(xt 1) where q(xt 1) and q(xt) are the marginals of the forward process (Equation (4)) at steps t 1 and t, respectively, we can rewrite K as (see Appendix B) t=1 Ext q DKL [q(xt 1|xt)||p(xt 1|xt)] + Cq (8) where DKL is the Kullback-Leibler divergence and Cq is a constant. While Equation (8) is not necessarily the most tractable form of the bound K, it will prove useful in the next section when we make the connection to serial reproduction (see Appendix A, Eqs. 17-26 in Ho et al. (2020) for a similar derivation). Before proceeding to the next section, it is worth pausing for a moment to review the existing theoretical interpretations of diffusion models and the challenges they face. Two general formulations of diffusion models exist in the literature, namely, Denoising Diffusion Probabilistic Models (DDPMs) (Ho et al., 2020; Sohl-Dickstein et al., 2015) which adopt a formulation similar to the one used here, and Score-Based Models (Song & Ermon, 2019; Daras et al., 2022) which learn to approximate the gradient of the log-likelihood of the data distribution under different degradations, also known as the score of a distribution, and then incorporate that gradient in a stochastic process that samples from the data distribution. These formulations are not entirely independent and in fact have been shown to arise from a certain class of stochastic differential equations (Song et al., 2020). Importantly, these analyses often assume that the structure of noise is Gaussian, either to allow for tractable expressions for variational loss functions (Sohl-Dickstein et al., 2015), or as a way to link sampling to well-known processes such as Langevin dynamics (Song & Ermon, 2019). This theoretical reliance on Gaussian noise has been recently called into question as successful applications of diffusion models with a wide variety of noise classes were demonstrated empirically (Bansal et al., 2022; Daras et al., 2022). We seek to remedy this issue in the next section. Analyzing Diffusion as Serial Reproduction 3. Diffusion Sampling as Serial Reproduction 3.1. Mathematical Derivation As noted in the introduction, the sampling procedure for diffusion models is strikingly similar to the process of serial reproduction. In what follows we will make this statement more precise and show how it allows to explain key features of diffusion models. First, observe that the non-negativity of the KL divergence implies that the bound in Equation (8) is maximized by the solution p(xt 1|xt) = q(xt 1|xt) = q(xt|xt 1)q(xt 1) R q(xt| xt 1)q( xt 1)d xt 1 . (9) In other words, what the diffusion model is approximating at each step t 1 is simply a Bayesian posterior with the diffusion kernel q(xt|xt 1) = Tqn(xt|xt 1; βt) serving as likelihood and the forward marginal at step t 1, namely q(xt 1), serving as a prior. For clarity, in what follows we will denote the optimal posterior distribution at step t 1 (Equation (9)) as pt 1(x|y) and the marginal at step t 1 as qt 1(x). Next, to make contact with the recent literature that extends diffusion models to generalized noise families which decompose sampling into a process that alternates between restoration and degradation (Bansal et al., 2022; Daras et al., 2022) we define the sampling process for a diffusion model as a Markov sequence x T ˆx T x T 1 xt ˆxt xt 1 x0, where x T qn(x T ), ˆxt is a noisy version of xt under the noise kernel Tqn(ˆxt|xt; βt), and the transition ˆxt xt 1 is done by denoising using the posterior pt 1. While this might seem at first to be at odds with the standard definition of sampling in Gaussian DDPMs whereby sampling is done by iterative applications of the posterior in Equation (9), we will show in Section 3.2 that our definition is in fact equivalent. From here, the transition kernel for the sampling process at step t 1 which we denote by ps,t 1(xt 1|xt) is given by ps,t 1(xt 1|xt) = Z pt 1(xt 1|ˆxt)Tqn(ˆxt|xt; βt)dˆxt = Z Tqn(ˆxt|xt 1; βt)Tqn(ˆxt|xt; βt) R Tqn(ˆxt| xt 1; βt)qt 1( xt 1)d xt 1 qt 1(xt 1)dˆxt. This kernel has an identical structure to the serial reproduction kernel in Equation (2), and as such, the forward marginal qt 1 satisfies detailed balance with respect to it. In other words, we have a set of detailed-balance conditions given by ps,t 1(xt|xt 1)qt 1(xt 1) = ps,t 1(xt 1|xt)qt 1(xt). (11) In particular, the last of these ps,0 satisfies this condition for the true data distribution since by definition q0(x) = qd(x). Now, observe that unlike the case in serial reproduction where we had a single distribution π(x) satisfying detailed balance at all steps, here we have a sequence of such distributions. Nevertheless, we can still analyze the induced sampling distribution ps(x0) in light of Equation (11). Indeed, by substituting the detailed-balance conditions we have ps(x0) = Z ps,0(x0|x1) . . . ps,T 1(x T 1|x T )qn(x T )dx1...T = qd(x0) Z ps,0(x1|x0)q1(x1) ps,T 1(x T |x T 1) qn(x T ) q T 1(x T )dx1...T . From here we see that the performance of ps(x0) as a good approximator of the true data distribution qd(x0) critically depends on whether the integral on the right-hand-side of Equation (12) sums up to one. Observe that the integral can be evaluated step-by-step by first integrating over x T , then x T 1 and down to x1. For the x T integral we have Z ps,T 1(x T |x T 1) qn(x T ) q T 1(x T )dx T = Z T(ˆx|x T 1; βT ) Z T(ˆx|x T ; βT )qn(x T )dx T dˆx. (13) Now, using the fact that qn is stationary with respect to Tqn and that by construction q T (x) = qn(x), the rightmost intergral and the denominator cancel out and the remaining integral over ˆx integrates to 1. Going one step further to the integral over x T 1 (and its own latent ˆx) we have Z ps,T 2(x T 1|x T 2)q T 1(x T 1) q T 2(x T 1)dx T 1 = Z T(ˆx|x T 2; βT 1) Z T(ˆx|x T 1; βT 1)q T 1(x T 1). Unlike before, we are no longer guaranteed stationarity for q T 1. One trivial solution for this is to use a very strong (abrupt) diffusion schedule {βt} such that the marginals behave like q0 = qd and qt>0 = qn, that is, within one step we are pretty much at noise level. From the perspective of the Bayesian posterior in Equation (9), this limit corresponds to the case where the Bayesian inversion simply ignores the very noisy input and instead relies on its learned prior which happens to be the data distribution for p0. Such a solution, however, is not really feasible in practice because it means that the denoising network that approximates the final Bayesian posterior p0 must learn to map pure noise to true data samples in one step, but this is precisely the problem that we are trying to solve. We therefore exclude this solution. Analyzing Diffusion as Serial Reproduction A. Forward process: adding noise B. Reverse process: generation C. Robustness to noise type: generation with different noise - bimodal D. Robustness to noise type: generation with different noise - fade E. Robustness to noise type: generation with a different stationary distribution ps(xt-1|xt) Figure 2. Simulation results for the optimal sampling process. Data was generated by sampling two dimensional vectors from a Swiss-roll distribution. We considered different diffusion noise families and computed their corresponding reverse sampling processes. Each image corresponds to the distribution of samples at a given iteration in a process. A. Forward process for a Gaussian noise kernel. B. The corresponding reverse sampling process. C. Reverse sampling process for a bimodal kernel. D. Reverse sampling for a fade-type noise with uniform stationary distribution. E. Reverse sampling for a fade-type noise with mixture stationary distribution. Luckily, there is another way around this problem, namely, if the diffusion parameter βT 1 is chosen such that it changes q T 1 only by a little so that it is approximately stationary, that is, q T 1(ˆx) Z dx T 1T(ˆx|x T 1; βT 1)q T 1(x T 1), (15) then the integral in Equation (14) will again be close to 1, as the denominator cancels out with the rightmost integral. By induction, we can repeat this process for all lower levels and conclude that ps(x0) qd(x0) (16) Crucially, under this schedule denoising networks only need to learn to invert between successive levels of noise. This suggests that the structure of the optimal sampling processes allows one to trade a hard abrupt noise schedule with a feasible smooth one by spreading the reconstruction complexity along the diffusion path in a divide-and-conquer fashion. It is worth noting at this point that the fact that the data distribution qd(x0) appears on the right-hand-side of the formula for the sampling distribution ps(x0) in Equation (12) is suggestive of a bound on the sampling error DKL[qd(x0)||ps(x0)] that can incorporate the stationarity condition in Equation 15. While we are not aware of an analytical formula for this bound, we will explore it empirically in Section 4. 3.2. Connecting to a Wider Range of Diffusion Models Before moving on to the empirical analysis, we complete our theoretical derivation by showing that our definition of the sampling process is equivalent to the one used in the well-studied Gaussian DDPM formalism of Ho et al. (2020). Analyzing Diffusion as Serial Reproduction 0 10 20 30 40 50 60 70 80 90 100 Number of steps Distance to distribution (bits) Error (DKL) 0 10 20 30 40 50 60 70 80 90 100 Number of steps Complexity (bits) 0 10 20 30 40 50 60 70 80 90 100 Noise (sigma) Noise schedule T=2 T=3 T=6 T=12 T=24 T=48 T=100 Figure 3. Effect of noise schedule on reconstruction error. A. Different noise injection schedules for a Gaussian noise kernel. B. Reconstruction error as measured by the KL divergence between the generated distribution and the true data distribution. C. A measure of inversion complexity as quantified by the maximum KL divergence between two consecutive marginals along the forward diffusion path. As noted in that paper (Section 3.2 of Ho et al. (2020)), the posterior is taken to be of the form p(xt 1|xt) = N(xt 1; µ(xt, t), Σ(xt, t)) with Σ(xt, t) = σ2 t I, where µ(xt, t) is a trainable function and I is the identity matrix. This can be also written as xt 1 = µ(xt, t) + σtz where z N(0, I). In other words, the posterior is given by a Gaussian distribution around a function of the input µ(xt, t) with some diagonal covariance matrix with variance σ2 t . Intuitively, equivalence then follows from the fact that introducing an additional noising step in the sampling process is simply adding Gaussian noise to a Gaussian posterior which corresponds to a redefinition of the mean and variance parameters (which are design parameters; (Dhariwal & Nichol, 2021)). More explicitly, our sampling process is defined as x T ˆx T x T 1 x0, that is, we start from an initial sample x T and then successively add noise to it and then denoise it with the posterior. Now, since the initial x T qn(x T ) is sampled from the stationary noise distribution, adding noise to it (i.e., transitioning to ˆx T ) does not change its distribution so we can equivalently start by denoising x T using the posterior (as in DDPM) and then adding noise and so on. Mathematically, this corresponds to applying to the generated posterior sample (i.e., the denoised x T ) a generic noisy transformation of the form x αx + σz where z N(0, I) where α and σ are some scaling and variance parameters (similar to Equation 2 in Ho et al. (2020)). Now, when combined with the Gaussian posterior above this yields Analyzing Diffusion as Serial Reproduction T = 500 steps T = 200 steps T = 100 steps T = 50 steps MNIST CIFAR10 Number of steps Error (FID) T = 50 steps T = 100 steps T = 200 steps T = 500 steps T = 50 steps T = 100 steps T = 200 steps T = 500 steps MNIST Examples FMNIST Examples Figure 4. Reconstruction error of a DDPM trained on various datasets for different noise schedules (as captured by the total number of steps interpolating between two noise levels in the forward process). A. Performance quantified using FID score. B. Example samples from different models. x T 1 = αµ(x T , T) + ασT z + σz which corresponds to p(x T 1|x T ) = N(x T 1; αµ(x T , T), (α2σ2 T + σ2)I), but this is equivalent to the formula used in Ho et al. (2020) up to a redefinition of the mean and variance, namely, µT (x T , T) αµ(x T , T) and σ2 T α2σ2 T + σ2. The same holds for all subsequent steps which, as in the case of DDPM, terminate with a final application of the posterior denoiser. Thus, we see that the two definitions are equivalent up to a redefinition of the trainable posterior parameters. 3.3. Summary We have shown that the sampling process of an optimal diffusion model approximates the true data distribution irrespective of the choice of noise family, so long as the noise schedule {βt} is chosen such that it alters the input distribution gradually. This result is consistent with recent work suggesting that a good scheduling heuristic minimizes the sum of Wasserstein distances between pairs of distributions along the diffusion path of the forward process (Daras et al. (2022); see also Dhariwal & Nichol (2021)). It also explains the necessary thinning in βt for later stages of the synthesis where the marginal becomes more and more structured. In the next section, we provide empirical support for this theoretical analysis. 4. Simulations To test the theoretical findings of the previous section, we selected a simulation setup in which the Bayesian posteriors are tractable and can be computed analytically. Specifically, similar to Sohl-Dickstein et al. (2015), we considered a case where the data is given by two-dimensional vectors (x, y) in the range 0.5 x, y 0.5 that are sampled from a Swissroll-like data distribution. For tractability, we discretized the space into a finite grid of size 41 41 (1,681 bins) with wrapped boundaries (to avoid edge artifacts). We then analytically computed the forward (noising) distributions using Equation (4) (Figure 2A) and the reverse sampling distributions using Equation (10) (Figure 2B) (we did not train any neural networks). As for the noise families, we considered multiple options to test whether the process is indeed robust to noise type. Specifically, we considered Gaussian noise with a linear variance schedule σt = 0.03 + 0.04 (t/T) (Figure 2A-B), a bimodal noise kernel consisting of a mixture with two modes symmetrically positioned 0.07 units away from the center resulting in diagonal smearing (Figure 2C), a fade-type noise where there is an increasing probability pt = 0.01 + 0.99 (t/T) to sample from a different pre-specified distribution, namely, uniform (Figure 2D) and Gaussian mixture (Figure 2E). T denotes the total number of steps as before (See Supplementary Section C for additional details about the simulations). As can be clearly seen, in all of these cases, whether we changed the noise kernel or the stationary noise distribution, the sampler was Analyzing Diffusion as Serial Reproduction able to converge on a good approximation for the underlying Swiss-roll distribution (see Supplementary Figure S2 for additional simulations with the posterior-only sampling scheme and Supplementary Figures S3-S4 for examples of bad samplers due to inadequate noise schedules). Next, we wanted to test how the approximation accuracy of the sampler depends on the graduality of the noise injection in the forward process as captured by the number of diffusion steps. To that end, we restricted ourselves to the Gaussian case and considered different noise schedules by varying the number of steps interpolating between two fixed levels of variance, that is, σt = 0.01 + 0.04 (t/T) for different values of the number of steps T (Figure 3A). To quantify accuracy, we computed the reconstruction error as measured by the KL divergence between the generated data distribution and the true data distribution, i.e., DKL[qd(x0)||ps(x0)] for each of the chosen schedules (Figure 3B). We also added a measure of inversion complexity which measures the biggest change between consecutive distributions along the forward diffusion path, that is, maxt DKL[qt+1(x)||qt(x)]. The idea is that the bigger this number is the less gradual (more abrupt) the noise injection is, making the learning of the denoiser harder in practice, i.e., recovering a clean sample from noisy data (Figure 3C; see also Figure S1 for an additional stationarity metric). As predicted, we see that adding more steps results in lower reconstruction error and lower complexity. In addition, we also see that beyond a certain number of steps the accuracy of the sampler saturates, presumably because the forward process has enough time to reach stationarity and to do that in a smooth way (see also Figure S5 for another similar simulation with a different noise type). Finally, we also performed diffusion experiments with deep neural networks to validate that this dependence on the number of steps indeed occurs. Specifically, we trained a Denoising Diffusion Probabilistic Model (Ho et al., 2020) to denoise MNIST (Le Cun & Cortes, 2005), FMNIST (Xiao et al., 2017), KMNIST (Clanuwat et al., 2018), and CIFAR10 (Krizhevsky et al., 2009) images1. For this model, the noise at step t depends on the diffusion parameter βt = βmin +(βmax βmin) t/T. To investigate the effect of the noise schedule, we set βmin = 0.0001, βmax = 0.02 and retrained the model multiple times with a different number of total steps each time (T [50, 500]). To evaluate the sample quality from each trained model, we generated 6,000 images using the same number of steps as the model was trained on, and computed the Fr echet inception distance (FID; (Heusel et al., 2017)) between each set of generated images and the training set. Smaller FID values correspond to better image generation quality. We report the results in 1We used Brian Pulfer s Py Torch re-implementation of DDPM https://github.com/Brian Pulfer/ Papers Reimplementations Figure 4A alongside a sample of the resulting images for different values of T (Figure 4B). We see that similar to the idealized case, gradual noise schedules with a bigger number of steps tend to improve sample quality and reach some saturation level beyond a certain number of steps. We should note that the number of steps needed for high quality samples may be reduced by an appropriate choice of a sampling method, however, the overall dependence on the number of steps remains the same (see e.g., Tables 1-4 and Figures 1, A.1-3 of Watson et al. (2021)). 5. Discussion In this work, we derived a correspondence between diffusion models and an experimental paradigm used in cognitive science known as serial reproduction, whereby Bayesian agents iteratively encode and decode stimuli in a fashion similar to the game of telephone. This was achieved by analyzing the structure of the underlying optimal inversion that the model was attempting to approximate and integrating it in a diffusion sampling process. Crucially, this allowed us to provide a new theoretical understanding for key features of diffusion models that challenged previous formulations, specifically, their robustness to the choice of noise family and the role of noise scheduling (note also that our framework is relevant for discrete diffusion models (Austin et al., 2021) as we did not commit to any specific data space x, and the information theoretic quantities used in the derivation are well-defined for discrete variables). Our analysis also identified in what ways these features could break in practice and yielded a simple practical metric (inversion complexity) that can help to formalize when a network can or cannot be trained to perform the inversion, and can lead to estimates on reconstruction errors. Finally, we validated our theoretical findings by simulating the optimal process as well as by training real diffusion models. We conclude with some remarks regarding promising future directions. First, in this work we focused on probabilistic diffusion models, as these are the most common. Nevertheless, a new class of models suggests that it is possible to implement completely deterministic diffusion processes (Bansal et al., 2022). A related formulation of serial reproduction (Griffiths & Kalish, 2007) where the assumption of probability matching is replaced with deterministic maximum-aposteriori (MAP) estimation such that the Gibbs sampling process becomes an Expectation-Maximization algorithm (EM) could provide a suitable setup for reinterpreting such diffusion models. Second, in the present work we assumed asymptotic stationarity (Sohl-Dickstein et al., 2015) as we wanted to focus on the simplest theoretical setup that would allow addressing the question of robustness to noise types. However, recent work suggests that this assumption too can be relaxed (Watson et al., 2021); a follow-up study could Analyzing Diffusion as Serial Reproduction extend our work to include such setups. Third, In the present work we considered a discrete time formulation of diffusion models to make contact with the widely used DDPM formalism as well as the core literature on serial reproduction. However, both diffusion models and serial reproduction have alternative continuous time formulations (see e.g., Thompson & Griffiths (2021); Xu & Griffiths (2010); Song & Ermon (2019)) which could allow for an analytical perturbation analysis of reconstruction error as a function of stationarity violations in the forward process. Fourth, future work could use this new perspective as an inspiration for defining better noise scheduling metrics by directly trading off intermediate stationarity, convergence rate, and sample accuracy. Finally, this interpretation also suggests that it may be possible to come up with multi-threaded sampling procedures that incorporate multiple serial processes with different learned priors and selection strategies as a way of generating samples that possess a collection of desired qualities. This is inspired by the idea that serial reproduction can be interpreted as a process of cumulative cultural evolution whereby a heterogeneous group of agents jointly reproduce and mutate stimuli so as to optimize them for different properties (Xu et al., 2013; Thompson et al., 2022; van Rijn et al., 2022). We hope to explore these avenues in future research. Reproducibility. Code for reproducing all simulations of the optimal sampler is available at the following link: https://github.com/raja-marjieh/ diffusion-sr.2 Acknowledgments This work was supported by an ONR grant (N00014-18-1-2873) to TLG and an NSERC fellowship (567554-2022) to IS. Austin, J., Johnson, D. D., Ho, J., Tarlow, D., and van den Berg, R. Structured denoising diffusion models in discrete state-spaces. Advances in Neural Information Processing Systems, 34:17981 17993, 2021. Bansal, A., Borgnia, E., Chu, H.-M., Li, J. S., Kazemi, H., Huang, F., Goldblum, M., Geiping, J., and Goldstein, T. Cold diffusion: Inverting arbitrary image transforms without noise. ar Xiv preprint ar Xiv:2208.09392, 2022. Bartlett, F. C. and Bartlett, F. C. Remembering: A study in experimental and social psychology. Cambridge university press, 1995. Clanuwat, T., Bober-Irizar, M., Kitamoto, A., Lamb, A., Yamamoto, K., and Ha, D. Deep learning for classical japanese literature, 2018. 2For DDPM training code, see footnote 1. Further details can be found in Appendix C. Daras, G., Delbracio, M., Talebi, H., Dimakis, A. G., and Milanfar, P. Soft diffusion: Score matching for general corruptions. ar Xiv preprint ar Xiv:2209.05442, 2022. Dhariwal, P. and Nichol, A. Diffusion models beat gans on image synthesis. Advances in Neural Information Processing Systems, 34:8780 8794, 2021. Griffiths, T. L. and Kalish, M. L. Language evolution by iterated learning with bayesian agents. Cognitive science, 31(3):441 480, 2007. Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., and Hochreiter, S. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30, 2017. Ho, J., Jain, A., and Abbeel, P. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33:6840 6851, 2020. Jacoby, N. and Mc Dermott, J. H. Integer ratio priors on musical rhythm revealed cross-culturally by iterated reproduction. Current Biology, 27(3):359 370, 2017. Krizhevsky, A., Hinton, G., et al. Learning multiple layers of features from tiny images. 2009. Langlois, T. A., Jacoby, N., Suchow, J. W., and Griffiths, T. L. Serial reproduction reveals the geometry of visuospatial representations. Proceedings of the National Academy of Sciences, 118(13):e2012938118, 2021. Le Cun, Y. and Cortes, C. The mnist database of handwritten digits. 2005. Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., and Chen, M. Hierarchical text-conditional image generation with clip latents. ar Xiv preprint ar Xiv:2204.06125, 2022. Rombach, R., Blattmann, A., Lorenz, D., Esser, P., and Ommer, B. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10684 10695, 2022. Ronneberger, O., Fischer, P., and Brox, T. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pp. 234 241. Springer, 2015. Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., and Ganguli, S. Deep unsupervised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning, pp. 2256 2265. PMLR, 2015. Analyzing Diffusion as Serial Reproduction Song, Y. and Ermon, S. Generative modeling by estimating gradients of the data distribution. Advances in Neural Information Processing Systems, 32, 2019. Song, Y., Sohl-Dickstein, J., Kingma, D. P., Kumar, A., Ermon, S., and Poole, B. Score-based generative modeling through stochastic differential equations. ar Xiv preprint ar Xiv:2011.13456, 2020. Thompson, B. and Griffiths, T. L. Human biases limit cumulative innovation. Proceedings of the Royal Society B, 288(1946):20202752, 2021. Thompson, B., van Opheusden, B., Sumers, T., and Griffiths, T. Complex cognitive algorithms preserved by selective social learning in experimental populations. Science, 376 (6588):95 98, 2022. van Rijn, P., Lee, H., and Jacoby, N. Bridging the prosody gap: Genetic algorithm with people to efficiently sample emotional prosody. ar Xiv preprint ar Xiv:2205.04820, 2022. Watson, D., Chan, W., Ho, J., and Norouzi, M. Learning fast samplers for diffusion models by differentiating through sample quality. In International Conference on Learning Representations, 2021. Xiao, H., Rasul, K., and Vollgraf, R. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms, 2017. Xu, J. and Griffiths, T. L. A rational analysis of the effects of memory biases on serial reproduction. Cognitive psychology, 60(2):107 126, 2010. Xu, J., Dowman, M., and Griffiths, T. L. Cultural transmission results in convergence towards colour term universals. Proceedings of the Royal Society B: Biological Sciences, 280(1758):20123073, 2013. Analyzing Diffusion as Serial Reproduction A. Appendix - Derivation of Equation (3) The explicit derivation of the detailed balance condition in Equation (3) is as follows p(xt+1|xt)π(xt) = Z p(ˆxt|xt+1)p(ˆxt|xt) R p(ˆxt| xt+1)π( xt+1)d xt+1 π(xt+1)π(xt)dˆxt = Z p(ˆxt|xt)p(ˆxt|xt+1) R p(ˆxt| xt)π( xt)d xt π(xt)π(xt+1)dˆxt = p(xt|xt+1)π(xt+1) where the first equality follows from Equation (2), the second equality from the fact that xt+1 is a dummy integration variable, and the third equality from Equation (2) and the fact that ˆxt is also a dummy integration variable. B. Appendix - Derivation of Equation (8) To get to Equation (8), we simply plug Equation (7) into Equation (6) which yields Z q(x0, . . . , x T ) log p(xt 1|xt) dx0 . . . dx T Hqn Z q(x0, . . . , x T ) log p(xt 1|xt)q(xt 1) q(xt 1|xt)q(xt) dx0 . . . dx T Hqn Z q(x0, . . . , x T ) log p(xt 1|xt) dx0 . . . dx T + Cq where we defined Cq = PT t=1 R q(x0, . . . , x T ) log(q(xt 1)/q(xt)) Hqn which is simply a constant with respect to p. Next, observe that the Markovian decomposition in Equation (4) implies that Z q(x0, . . . , x T )dx0 . . . dxt 1dxt+2 . . . dx T = q(xt|xt 1)q(xt 1) (19) where q(xt 1) = R q(xt 1|xt 2) . . . q(x1|x0)qn(x0)dx0 . . . dxt 2 is the marginal at step t 1. By combing this with Bayes formula in Equation (7) we can write Z q(xt|xt 1)q(xt 1) log p(xt 1|xt) dxt 1dxt + Cq Z q(xt 1|xt)q(xt) log q(xt 1|xt) dxt 1dxt + Cq = Ext q DKL [q(xt 1|xt)||p(xt 1|xt)]] + Cq which is the desired result. It s worth noting that a similar result can be found in Ho et al. (2020). C. Appendix - Additional Simulation Details C.1. Simulations with the Idealized Model We ran idealized simulations in two dimensions (x, y) in the range 0.5 x, y 0.5. We used a 41 by 41 finite grid (1,681 bins) with bin width of 0.025. For the data distribution we used a Swiss-roll distribution similar to Sohl Dickstein et al. (2015). We made training a bit more challenging by avoiding bins that have zero density. This was done by interpolating the Swiss-roll distribution p with a uniform distribution over the finite grid pu so in practice we used p = 0.9 p + 0.1 pu. We computed marginal distributions as vectors over the 1,681 bins, and conditional distributions as 1,681 by 1,681 matrices. In all cases, we computed noise terms with wrapped boundaries so that boundary artifacts are avoided. We used three types of noise: (1) Gaussian noise, (2) bimodal noise and (3) fade noise. The Analyzing Diffusion as Serial Reproduction Gaussian noise over (x, y) was defined as p(y, x) = C exp ( 1 2(y x)T Σ(y x)), where Σ = σ I2 is a diagonal matrix, σ is the noise parameter, and C is a normalization constant. The bimodal distribution was defined as follows: p(y, x) = C (exp ( 1 2(y x + µ)T Σ(y x + µ)) + exp ( 1 2(y x µ)T Σ(y x µ))), where µ = (0.05, 0.05) and Σ = σ I2. The fade noise depended on a parameter f between 0 and 1, and linearly interpolated between not changing the distribution and completely changing it to some final distribution p F . It was defined by the following formula: p(y, x) = (1 f) δ(y x) + f p F . Where δ(z) is the Dirac two-dimensional delta function. For stationary distribution we chose either uniform (Fig. 2D) or a mixture with three modes (Fig. 2E). The three equally weighted modes were located in points x1 = (0.1, 0.2), x2 = ( 0.2, 0.1), x3 = (0.2, 0.2) and had covariance matrix of 0.0128 I2, 0.0128 I2, and 0.0128 I3, where I2 is the 2 by 2 identity matrix and I3 is the matrix [2, 1; 1, 2]. We provide the code reproducing all figures and simulation as part of the Supplemental materials. C.2. Simulations with DDPM For our experiments with deep neural networks, we trained a Denoising Diffusion Probabilistic Model (Ho et al., 2020) to denoise MNIST (Le Cun & Cortes, 2005), FMNIST (Xiao et al., 2017), KMNIST (Clanuwat et al., 2018), and CIFAR10 (Krizhevsky et al., 2009) images. We used Brian Pulfer s Py Torch re-implementation of DDPM (https://github.com/Brian Pulfer/Papers Reimplementations). The UNet (Ronneberger et al., 2015) used by this implementation is designed to be compatible with single-channel images of size 28x28 (which is standard for MNIST variants) so CIFAR10 images first had to be resized and transformed to grayscale. Our focus is not on the technical implementation of DDPM, so we direct interested readers to Brian Pulfer s repository as it contains helpful documentation and commentary. The key detail is that for this model, the noise at step t depends on the diffusion parameter βt = βmin + (βmax βmin) t/T where T is the total number of steps and t = 0, 1, 2, ..., T. To investigate the effect of the noise schedule, we set βmin = 0.0001, βmax = 0.02 and retrained the model multiple times with a different number of total steps each time (T [50, 500]). For each dataset, this process took less than 2 hours to run on a single RTX 3080 Laptop GPU. To evaluate the sample quality from each trained model, we generated 6,000 images using the same number of steps as the model was trained on, and computed the Fr echet inception distance (FID; (Heusel et al., 2017)) between each set of generated images and the training set of the respective dataset. 0 10 20 30 40 50 60 70 80 90 100 Number of steps Stationarity score Stationarity 0 10 20 30 40 50 60 70 80 90 100 0 Noise (sigma) Noise schedule T=2 T=3 T=6 T=12 T=24 T=48 T=100 Figure S1. Effect of noise schedule on a complementary stationarity metric. A. Different noise injection schedules for a Gaussian noise kernel similar to Figure 3. B. Stationarity of the forward path as quantified by the RMSE of the integral on the right-hand-side of the second equality in Equation (12) minus the identity vector. Analyzing Diffusion as Serial Reproduction x T x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x T x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x T x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x T x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x T x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x T x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 A. Gaussian Noise: Sampling with Noise Injection B. Gaussian Noise: Sampling without Noise Injection C. Bimodal Noise: Sampling with Noise Injection D. Bimodal Noise: Sampling without Noise Injection E. Fade Noise with Non-uniform Stationary Distribution: Sampling with Noise Injection F. Fade Noise with Non-uniform Stationary Distribution: Sampling without Noise Injection Figure S2. Comparison between sampling with noise injection and without noise injection for different noise types. Analyzing Diffusion as Serial Reproduction A. Gaussian noise B. Bimodal noise x T C. Fade noise with a different stationary distribution x T Figure S3. Examples of bad noise schedules. Similar to the other simulations we had T = 10 steps, but unlike before the schedules here have a fixed noise parameter that did not change over the 10 steps. A. Gaussian noise. We used a constant σ = 0.01 in all steps. B. Bimodal noise. We used a constant σ = 0.001 in all steps. C. Fade noise. We used a constant f = 0.0001 in all steps. (see Supplementary Section C, for the definitions of these parameters). Gaussian noise (T=10) Bimodal noise (T=10) Fade noise (T=10) Gaussian noise (T=100) Stationarity score Stationarity Gaussian noise (T=10) Bimodal noise (T=10) Fade noise (T=10) Gaussian noise (T=100) Complexity (bits) Gaussian noise (T=10) Bimodal noise (T=10) Fade noise (T=10) Gaussian noise (T=100) Distance to distribution (bits) Performance (error) Figure S4. Evaluation metrics for the bad noise schedules shown in Figure S3 and an additional better Gaussian schedule with more steps (T = 100) for reference. A. Performance (reconstruction error), B. Complexity, C. Stationarity (see Figure S1 for definition). Analyzing Diffusion as Serial Reproduction 0 10 20 30 40 50 60 70 80 90 100 Number of steps Distance to distribution (bits) Error (DKL) 0 10 20 30 40 50 60 70 80 90 100 Number of steps Complexity (bits) 0 10 20 30 40 50 60 70 80 90 100 Steps Noise (sigma) Noise schedule T=2 T=3 T=6 T=12 T=24 T=48 T=100 Figure S5. Effect of noise schedule on reconstruction error for an additional bimodal noise type.