# active_diffusion_subsampling__7e10557d.pdf Published in Transactions on Machine Learning Research (02/2025) Active Diffusion Subsampling Oisín Nolan o.i.nolan@tue.nl Tristan S.W. Stevens t.s.w.stevens@tue.nl Wessel L. van Nierop w.l.v.nierop@tue.nl Ruud J.G. van Sloun r.j.g.v.sloun@tue.nl Eindhoven University of Technology Reviewed on Open Review: https: // openreview. net/ forum? id= OGifiton47 Subsampling is commonly used to mitigate costs associated with data acquisition, such as time or energy requirements, motivating the development of algorithms for estimating the fully-sampled signal of interest x from partially observed measurements y. In maximumentropy sampling, one selects measurement locations that are expected to have the highest entropy, so as to minimize uncertainty about x. This approach relies on an accurate model of the posterior distribution over future measurements, given the measurements observed so far. Recently, diffusion models have been shown to produce high-quality posterior samples of high-dimensional signals using guided diffusion. In this work, we propose Active Diffusion Subsampling (ADS), a method for designing intelligent subsampling masks using guided diffusion in which the model tracks a distribution of beliefs over the true state of x throughout the reverse diffusion process, progressively decreasing its uncertainty by actively choosing to acquire measurements with maximum expected entropy, ultimately producing the posterior distribution p(x | y). ADS can be applied using pre-trained diffusion models for any subsampling rate, and does not require task-specific retraining just the specification of a measurement model. Furthermore, the maximum entropy sampling policy employed by ADS is interpretable, enhancing transparency relative to existing methods using black-box policies. Experimentally, we show that through designing informative subsampling masks, ADS significantly improves reconstruction quality compared to fixed sampling strategies on the MNIST and Celeb A datasets, as measured by standard image quality metrics, including PSNR, SSIM, and LPIPS. Furthermore, on the task of Magnetic Resonance Imaging acceleration, we find that ADS performs competitively with existing supervised methods in reconstruction quality while using a more interpretable acquisition scheme design procedure. Code is available at https://active-diffusion-subsampling.github.io/. Reverse Diffusion Intermediate Belief Distributions Generated Mask Reconstruction Figure 1: Active Diffusion Subsampling jointly designs a subsampling mask and reconstructs the target signal in a single reverse diffusion process. Published in Transactions on Machine Learning Research (02/2025) 1 Introduction In recent years, diffusion models have defined the state of the art in inverse problem solving, particularly in the image domain, through novel posterior sampling methods such as Diffusion Posterior Sampling (DPS) (Chung et al., 2022) and Posterior Sampling with Latent Diffusion (PSLD) (Rout et al., 2024). These methods are often evaluated on inverse imaging problems, such as inpainting, which is akin to image subsampling. Typical benchmarks evaluate inpainting ability using naïve subsampling masks such as randomly masked pixels, or a box mask in the center of the image (Rout et al., 2024). In real-world applications, however, more sophisticated subsampling strategies are typically employed, for example, in Magnetic Resonance Imaging (MRI) acceleration (Lustig & Pauly, 2010; Bridson, 2007). These subsampling strategies are usually designed by domain experts, and are therefore not generalizable across tasks. Some recent literature has explored learning subsampling masks for various tasks (Bahadir et al., 2020; Baldassarre et al., 2016; Huijben et al., 2020; Van Gorp et al., 2021), but these methods typically depend on black-box policy functions, and require task-specific training. In this work, we introduce Active Diffusion Subsampling (ADS), an algorithm for automatically designing taskand sample-adaptive subsampling masks using diffusion models, without the need for further training or fine-tuning (Figure 2). ADS uses a white-box policy function based on maximum entropy sampling (Caticha, 2021), in which the model chooses sampling locations that are expected to maximize the information gained about the reconstruction target. In order to implement this policy, ADS leverages quantities that are already computed during the reverse diffusion process, leading to minimal additional inference time. We anticipate that ADS can be employed by practitioners in various domains that use diffusion models for subsampling tasks but currently rely on non-adaptive subsampling strategies. Such domains include medical imaging, such as MRI, CT, and ultrasound (Shaul et al., 2020; Pérez et al., 2020; Faltin et al., 2011), geophysics (Campman et al., 2017; Cao et al., 2011), and astronomy (Feng et al., 2023). Our main contributions are thus as follows: A novel approach to active subsampling which can be employed with existing diffusion models using popular posterior sampling methods; A white-box policy function for sample selection, grounded in theory from Bayesian experimental design; Significant improvement in reconstruction quality relative to baseline sampling strategies. Application to MRI acceleration and ultrasound scan-line subsampling. ADS reconstruction Posterior samples Diffusion model Measurement Figure 2: Schematic overview of the proposed Active Diffusion Sampling (ADS) method. 2 Related Work Methods aiming to select maximally informative measurements appear in many domains, spanning statistics, signal processing, and machine learning, but sharing foundations in information theory and Bayesian Published in Transactions on Machine Learning Research (02/2025) inference. Optimal Bayesian Experimental Design (Lindley, 1956) aims to determine which experiment will be most informative about some quantity of interest θ (Rainforth et al., 2024), typically parameters of a statistical model. Active learning (Houlsby et al., 2011) performs an analogous task in machine learning, aiming to identify which samples, if included in the training set, would lead to the greatest performance gain on the true data distribution. While our method focuses on subsampling high-dimensional signals, it could also be interpreted as a Bayesian regression solver for the forward problem y = f(x) + n, in which x is now seen as parameters for a model f, and the active sampler seeks measurements y which minimize uncertainty about the parameters x, relating it to the task of Bayesian experimental design. From a signal processing perspective, ADS can be characterized as a novel approach to adaptive compressive sensing, in which sparse signals are reconstructed from measurements with sub-Nyquist sampling rates (Rani et al., 2018), and typically applied in imaging and communication. Measurement matrices relating observed measurements to the signal of interest x are then designed so as to minimize reconstruction error on x. In Bayesian compressive sensing, measurement matrices are designed so as to minimize a measure of uncertainty about the value of x. Adaptive approaches, such as that of Braun et al. (2015), aim to decrease this uncertainty iteratively, by greedily choosing measurements that maximize the mutual information between x and y = Ux + n. These methods typically assume a Gaussian prior on x, however, limiting the degree to which more complex prior structure can be used. More recently, a number of methods using deep learning to design subsampling strategies have emerged. These approaches typically learn subsampling strategies from data that minimize reconstruction error between x and y. Methods by Huijben et al. (2020) and Bahadir et al. (2020) learn fixed sampling strategies, in which a single mask is designed a priori for a given domain, and applied to all samples for inference. These methods can be effective, but suffer in cases where optimal masks differ across samples. Sample-adaptive methods (Van Gorp et al., 2021; Bakker et al., 2020; Yin et al., 2021; Stevens et al., 2022) move past this limitation by designing sampling strategies at inference time. A popular application of such methods is MRI acceleration, spurred by the fast MRI benchmark (Zbontar et al., 2018), in which a full MRI image must be reconstructed from sub-sampled κ-space measurements. In A-DPS (Van Gorp et al., 2021), for example, a neural network is trained to build an acquisition mask consisting of M κ-space lines iteratively, adaptively adding new lines based on the current reconstruction and prior context. Bakker et al. (2020) implements the same procedure using a reinforcement learning agent. One drawback of these methods is their reliance on black-box policies, making it difficult to detect and interpret failure cases. Generative approaches with transparent sampling policies circumvent this issue. For example, methods by Sanchez et al. (2020) and van de Camp et al. (2023) take a generative approach to adaptive MRI acquisition design, using variants of generative adversarial networks (GANs) to generate posterior samples over MRI images, and maximum-variance sampling in the κ-space as the measurement selection policy. The performance of such generative approaches depends on how well they can model the true posterior distribution over x given observations. This motivates our choice of diffusion models for generative modeling, as they have shown excellent performance in diverse domains, such as computer vision (Dhariwal & Nichol, 2021), medical imaging (Chung & Ye, 2022; Stevens et al., 2024), and natural language processing (Yu et al., 2022). The existing works discussed above have in common that they run the reconstruction model N times for N measurements during inference. Due to the slower iterative inference procedure used by diffusion models, running inference N times may not be feasible in real-world applications, which are typically time-bound by the acquisition process. ADS alleviates this issue by using the Tweedie estimate of the posterior distribution as an approximate posterior, allowing N measurements to be acquired in a single reverse diffusion process. We also highlight a related concurrent work by Elata et al. (2025), which uses denoising diffusion restoration models (DDRMs) to generate sensing matrices for adaptive compressed sensing, which differs from ADS in that it requires full DDRM inference in order to acquire each sample. 3 Background 3.1 Bayesian Optimal Experimental Design In Bayesian Optimal Experimental Design (Rainforth et al., 2024) and Bayesian Active Learning (Houlsby et al., 2011), the objective is to choose the optimal design, or set of actions A = A , leading to new observations of a measurement variable y that will minimize uncertainty about a related quantity of interest Published in Transactions on Machine Learning Research (02/2025) x, as measured by entropy H. It was shown by Lindley (1956) that this objective is equivalent to finding the actions A that maximize the mutual information I between x and y, i.e. selecting actions leading to observations of y that will be most informative about x: A = arg min A [H(x | A, y)] = arg max A [I(y; x | A)] (1) This objective is commonly optimized actively, wherein the design is created iteratively by choosing actions that maximize mutual information, considering past observations when selecting new actions (Rainforth et al., 2024). This active paradigm invites an agent-based framing, in which the agent s goal is to minimize its own uncertainty, and beliefs over x are updated as new measurements of y are taken. The active design objective can be formulated as follows, where at is possible action at time t, At = At 1 at is the set of actions {a0, ..., at} taken so far at time t, and yt 1 is the set of partial observations of the measurement variable y until time t 1: a t = arg max at [I(yt; x | At, yt 1)] = arg max at [Ep(yt|x,At)p(x|yt 1)[log p(yt | At, yt 1) log p(yt | x, At, yt 1)]] = arg max at [H(yt | At, yt 1) H(yt | x, At, yt 1)] We can interpret the agent s behavior from Equation 2 as trying to maximize marginal uncertainty about yt, while minimizing model uncertainty about what value yt should take on given a particular x (Houlsby et al., 2011). Active designs are typically preferred over fixed designs, in which a set of actions is chosen up-front, as opposed to being chosen progressively as measurements are acquired (Rainforth et al., 2024). While fixed designs may be more computationally efficient, they are less sample-specific, which can lead to lower information gain about x. Finally, it is worth noting that this active optimization scheme, while greedy, has been shown to be near-optimal due to the submodularity of conditional entropy (Golovin & Krause, 2011). 3.2 Subsampling Generally image reconstruction tasks can be formulated as inverse problems, given by: y = Ux + n, (3) where y YM is a measurement, x X N the signal of interest and n N M some noise source, typically Gaussian. For the subsampling problem, the measurement matrix U RM N can be expressed in terms of a binary subsampling matrix using one-hot encoded rows such that we have an element-wise mask m = diag(U U), where only the diagonal entries of U U are retained, representing the subsampling pattern. We can relate the subsampling mask m through the zero-filled measurement which can be obtained through yzf = U y = m x + U n. Since we are interested in the adaptive design of these masks, we express their generation as m = U(At) U(At), where the measurement matrix is now a function of the actions At = {a0, ..., at} taken by the agent up to time t. The ith element of that mask is defined as follows: m = U(At) U(At) ( 1 if i At 0 otherwise. (4) The measurement model in equation 3 can now be extended to an active setting via yt = U(At)x+nt. Note too that in some applications we have an additional forward model f, mapping from the data domain to the measurement domain, yielding yt = U(At)f(x) + nt. 3.3 Posterior Sampling with Diffusion Models Denoising diffusion models learn to reverse a stochastic differential equation (SDE) that progressively noises samples x towards a standard Normal distribution (Song et al., 2020). The (variance preserving) SDE Published in Transactions on Machine Learning Research (02/2025) defining the noising process is as follows: where x(0) Rd is an initial clean sample, τ [0, T], β(τ) is the noise schedule, and w is a standard Wiener process, and x(T) N(0, I). According to the following equation from Anderson (1982), this SDE can be reversed once the score function x log pτ(x) is known, where w is a standard Wiener process running backwards: 2 x β(τ) x log pτ(x) dτ + p β(τ)d w (6) Following the notation by Ho et al. (2020) and Chung et al. (2022), the discrete setting of the SDE is represented using xτ = x(τT/N), βτ = β(τT/N), ατ = 1 βτ, ατ = Qτ s=1 αs, where N is the numbers of discretized segments. The diffusion model achieves the SDE reversal by learning the score function using a neural network parameterized by θ, sθ(xτ, τ) xτ log pτ(xτ). The reverse diffusion process can be conditioned on a measurement y to produce samples from the posterior p(x|y). This can be done with substitution of the conditional score function xτ log pτ(xτ) in equation 6. The intractability of the noise-perturbed likelihood xτ log pτ(y|xτ) which follows from refactoring the posterior using Bayes rule has led to various approximate guidance schemes to compute these gradients with respect to a partially-noised sample xτ (Chung et al., 2022; Song et al., 2023; Rout et al., 2024). Most of these rely on Tweedie s formula (Efron, 2011; Robbins, 1992), which serves as a one-step denoising process from τ 0, denoted Dτ(.), yielding the Minimum Mean-Squared Error denoising of xτ (Milanfar & Delbracio, 2024). Using our trained score function sθ(xτ, τ) we can approximate x0 as follows: ˆx0 = E[x0|xτ] = Dτ(xτ) = 1 ατ (xτ + (1 ατ)sθ(xτ, τ)) (7) Diffusion Posterior Sampling (DPS) uses equation 7 to approximate xτ log p(y|xτ) xτ log p(y|ˆx0). In the case of active subsampling, this leads to guidance term xτ ||yt U(At)f(ˆx0)||2 2 indicating the direction in which xτ should step in order to be more consistent with yt = U(At)f(x) + n. The conditional reverse diffusion process then alternates between standard reverse diffusion steps and guidance steps in order to generate samples from the posterior p(x | yt). Finally, we note that a number of posterior sampling methods for diffusion models have recently emerged, and refer the interested reader to the survey by Daras et al. (2024). 4.1 Active Diffusion Subsampling ADS (Algorithm 1) operates by running a reverse diffusion process generating a batch {x(i) τ }, i 0, ..., Np of possible reconstructions of the target signal guided by an evolving set of measurements {yt}T t=0 which are revealed to it through subsampling actions taken at reverse diffusion steps satisfying τ S, where S is a subsampling schedule, or set of diffusion time steps at which to acquire new measurements. For example, in an application in which one would like to acquire 10 measurements, {yt}9 t=0, across a diffusion process consisting of T = 100 steps, one might choose a linear subsampling schedule S = {0, 10, 20, ..., 90}. Then, during the diffusion process, if the diffusion step τ is an element of S, a new measurement will be acquired, revealing it to the diffusion model. These measurements could be, for example, individual pixels or groups of pixels. We refer to the elements of the batch of reconstructions {x(i) τ } as particles in the data space, as they implicitly track a belief distribution over the true, fully-denoised x = x0 throughout reverse diffusion. These particles are used to compute estimates of uncertainty about x, which ADS aims to minimize by choosing actions at that maximize the mutual information between x and yt given at. Figure 3 illustrates this action selection procedure. The remainder of the section describes how this is achieved through (i) employing running estimates of x0 given by Dτ(xτ), (ii) modeling assumptions on the measurement entropy, and (iii) computational advantages afforded by the subsampling operator. Published in Transactions on Machine Learning Research (02/2025) Algorithm 1: Active Diffusion Subsampling Require: T, Np, S, ζ, { στ}T τ=0, {ατ}T τ=0, Ainit 1 t = 0; A0 = Ainit; y0 = U(A0)f(x) + n0; {x(i) T N(0, I)}Np 1 i=0 2 for τ = T to 0 do // Batch process in parallel for efficient inference 3 for i = 0 to Np 1 do 4 ˆs sθ(x(i) τ , τ) 5 ˆx(i) 0 Dτ(x(i) τ ) = 1 ατ (x(i) τ + (1 ατ)ˆs) // Calculate Tweedie estimate 6 ˆy(i) f(ˆx(i) 0 ) // Simulate full measurement 7 z N(0, I) τ 1 ατ (1 ατ 1) 1 ατ x(i) τ + ατ 1βτ 1 ατ ˆx(i) 0 + στz // Reverse diffusion step 9 x(i) τ 1 x(i) τ 1 ζ x(i) τ ||yt U(At)ˆy(i)||2 2 // Data fidelity step 10 if τ S then // Find the group of pixels with maximum entropy across simulated measurements 12 a t = arg maxat (ˆy(i) l ˆy(j) l )2 13 At = At 1 a t 14 yt = U(At)f(x) + nt // Acquire new measurements of the true x return: {ˆx(i) 0 }Np 1 i=0 // Return posterior samples ADS follows an information-maximizing policy, selecting measurements a t = arg maxat[I(y; x | At, yt 1)]. Assuming a measurement model with additive noise yt = U(At)f(x) + nt, the posterior entropy term H(yt | x, At, yt 1) in the mutual information is unaffected by the choice of action at: it is solely determined by the noise component nt. This simplifies the policy function, leaving only the marginal entropy term H(yt | At, yt 1) to maximize. This entropy term (Equation 2) can be computed as the expectation of log p(yt | At, yt 1), or the expected distribution of measurements yt given the observations so far, and a subsampling matrix At. We approximate the true measurement posterior in Equation 8 as a mixture of Np isotropic Gaussians, with means set as estimates of future measurements under possible actions, and variance σ2 y I. The means ˆy(i) t = U(At)f(ˆx(i) 0 ) are computed by applying the forward model to posterior samples ˆx(i) 0 p(x | yt 1), which are estimated using the batch of partially denoised particles x(i) τ via ˆx(i) 0 = Dτ(x(i) τ ), yielding: p(yt | At, yt 1) = i=0 wi N(ˆy(i) t , σ2 y I) (8) Note the parameter σ2 y, which controls the width of the Gaussians . The entropy of this isotropic Gaussian Mixture Model can then be approximated as follows, as given by Hershey & Olsen (2007): H(yt | At, yt 1) constant + ( ||ˆy(i) t ˆy(j) t ||2 2 2σ2y Intuitively, this approximation computes the entropy of the mixture of Gaussians by summing the Gaussian error between each pair of means, which grows with distance according to the hyperparameter σ2 y. This hyperparameter σ2 y can therefore be tuned to control the sensitivity of the entropy to outliers. Published in Transactions on Machine Learning Research (02/2025) Figure 3: Illustration of a single action selection using ADS. 1 shows the current batch of partially-denoised images {x(i) τ } at diffusion step τ . In 2 , this batch of particles is mapped using Tweedie s formula to a batch of fully-denoised particles {ˆx(i) 0 }, constituting the belief distribution at time τ . The forward model f is then applied in 3 to simulate the set of measurements that result from the belief distribution, used to approximate the measurement posterior as a GMM. Given this GMM, the measurement entropy is computed in 4 using Equation 10. Finally, in 5 , the maximum entropy line is selected as the next measurement location. Finally, we leverage the fact that U(At) is a subsampling matrix to derive an efficient final formulation of our policy. Because U(At) is a subsampling matrix, the optimal choice for the next action at will be at the region of the measurement space with the largest disagreement among the particles, as measured by Gaussian error in Equation 9. Therefore, rather than computing a separate set of subsampled particles for each possible next subsampling mask, we instead compute a single set of fully-sampled measurement particles ˆy(i) = f(ˆx(i) 0 ), and simply choose the optimal action as the region with largest error. For example, when pixel-subsampling an image, the particles ˆy(i) become predicted estimates of the full image, given the pixels observed so far, and the next sample is chosen as whichever pixel has the largest total error across the particles. Similarly, in accelerated MRI, the next κ-space line selected is the one in which there is the largest error across estimates of the full κ-space. Denoting as l at the set of indices sampled by each possible action at, and assuming equal weights for all particles, wi = wj, i, j, the final form of the policy function is given as follows (see Appendix A.1 for derivation): a t = arg max at l at (ˆy(i) l ˆy(j) l )2 5 Experiments We evaluate our method with three sets of experiments, covering a variety of data distributions and application domains. The first two experiments, in Sections 5.1 and 5.2, evaluate the proposed Maximum-Entropy subsampling strategy employed by ADS against baseline strategies, keeping the generative model fixed to avoid confounding. Next, in Section 5.3, we evaluate the model end-to-end on both sampling and reconstruction by applying it in the real-world task of MRI acceleration with the fast MRI dataset, and comparing it to existing supervised approaches. Published in Transactions on Machine Learning Research (02/2025) In order to evaluate the effectiveness of the Maximum Entropy subsampling strategy employed by ADS, we compare it to two baseline subsampling strategies on the task of reconstructing images of digits from the MNIST dataset (Le Cun et al., 1998). To this end, a diffusion model was trained on the MNIST training dataset, resized to 32 32 pixels. See Appendix A.2.1 for further details on training and architecture. Using this trained diffusion model, each subsampling strategy was used to reconstruct 500 unseen samples from the MNIST test set for various subsampling rates. Both pixel-based and line-based subsampling were evaluated, where line-based subsampling selects single-pixel-wide columns. The measurement model is thus yt = U(At)x, as there is no measurement noise or measurement transformation, i.e. f(x) = x. The baseline subsampling strategies used for comparison were as follows: (i) Random subsampling selects measurement locations from a uniform categorical distribution without replacement, and (ii) Data Variance subsampling selects measurement locations without replacement from a categorical distribution in which the probability of a given location is proportional to the variance across that location in the training set. This can therefore be seen as a data-driven but fixed design strategy. Inference was performed using Diffusion Posterior Sampling for measurement guidance, with guidance weight ζ = 1 and T = 1000 reverse diffusion steps. For ADS, measurements were taken at regular intervals in the window [0, 800], with Np = 16 particles and σy = 10. For the fixed sampling strategies, the subsampling masks were set a priori, such that all diffusion steps are guided by the measurements, as is typical in inverse problem solving with diffusion models. The results to this comparison are illustrated by Figure 4, with numbers provided in Appendix A.3. We use Mean Absolute Error as an evaluation metric since MNIST consists of single channel brightness values. It is clear from these results that ADS outperforms fixed mask baselines, most notably in comparison with data-variance sampling: for pixel-based sampling, we find that maximum entropy sampling with a budget of 100 pixels outperforms data variance sampling with a budget of 250 pixels, i.e. actively sampling the measurements is as good as having 2.5 the number of measurements with the data-variance strategy. We also find that the standard deviation of the reconstruction errors over the test set is significantly lower for 25% and 50% subsampling rates (typically 2-3 than baselines, leading to more reliable reconstructions). Further experiments on MNIST are carried out in Appendix A.3, where ADS is compared to Active Deep Probabilistic Subsampling Van Gorp et al. (2021), an end-to-end supervised method for active subsampling. 10 25 50 100 250 500 Number of pixels sampled Random Data Variance ADS (a) Pixel-based subsampling 2 4 8 16 24 Number of lines sampled Random Data Variance ADS (b) Line-based subsampling Figure 4: Comparison of ADS (ours) with two non-adaptive baselines. Evaluated based on reconstruction Mean Absolute Error (MAE) on N = 500 unseen samples from the MNIST test set. Note that MAE is plotted on a log scale. 5.2 Celeb A In order to evaluate ADS on a natural image dataset, a diffusion model was trained on the Celeb A (Liu et al., 2015) training dataset at 128 128 resolution (see Appendix A.2.2 for training details). ADS was then benchmarked on N=100 samples from the Celeb A test set against DPS with baseline sampling strategies. Published in Transactions on Machine Learning Research (02/2025) This evaluation was carried out for a number sampling rates |S| {50, 100, 200, 300, 500}. The measurement scheme employed here samples boxes of size 4 4 pixels from the image. The number of diffusion steps taken during inference was chosen based on sampling rate, with 400 steps for |S| < 300, 600 steps for |S| = 500. In each case, for sampling rate |S| the sampling window [10, |S| + 10] was used. These sampling windows were chosen empirically, finding that some initial unguided reverse diffusion steps help create a better initial estimate of the posterior. Np = 16 particles with σy = 1 were used to model the posterior distribution. The peak signal-to-noise ratio (PSNR) values between the posterior mean and test samples are provided in Figure 5a (a) and Table 1, with further metrics, including metrics across individual posterior samples, are provided in Appendix A.5. It is clear in these results that ADS outperforms baseline sampling strategies with DPS across a number of sampling rates. Note, for example, that ADS with 200 measurements achieves a higher PSNR than random-mask DPS with 300 measurements. In Figure 5b a selection of samples are provided to exemplify the mask designs produced by ADS. Note here that the masks focus on important features such as facial features in order to minimize the posterior entropy, leading to higher information gain about the target, and ultimately better recovery. 50 100 200 300 500 Number of boxes sampled Random DPS Data Variance DPS ADS ADS DPS + Random Mask Mask Measurements Posterior Mean Mask Measurements Posterior Mean Target Figure 5: In (a), PSNR ( ) scores for ADS vs DPS with random and data variance sampling on N=100 unseen samples from the Celeb A dataset are plotted, for increasing numbers of measurements. A measurement here is a 4 4 box of pixels. (b) shows some examples of ADS inference versus DPS with random measurements on the Celeb A dataset from the evaluation with 200 boxes sampled. # Samples (Max %) Random DPS Data Variance DPS ADS 50 (4.88%) 18.099 (0.019) 17.747 (0.018) 18.932 (0.024) 100 (9.76%) 20.580 (0.019) 19.892 (0.017) 22.725 (0.024) 200 (19.53%) 23.555 (0.020) 22.895 (0.018) 27.954 (0.026) 300 (29.29%) 25.919 (0.019) 24.779 (0.019) 31.483 (0.027) 500 (48.82%) 29.123 (0.018) 27.512 (0.018) 35.630 (0.027) Table 1: PSNR ( ) on test samples from the Celeb A dataset. Next to each # Samples |S| is a Max % indicating the maximum % of a 128 128 image that |S| 4 4 boxes could cover. 5.3 MRI Acceleration To assess the real-world practicability of ADS, it was evaluated on the popular fast MRI (Zbontar et al., 2018) 4 acceleration benchmark for knee MRIs. In this task, one must reconstruct a fully-sampled knee MRI image given a budget of only 25% of the κ-space measurements, where each κ-space measurement is a vertical line of width 1 pixel. We compare with existing MRI acceleration methods focused specifically on learning sampling strategies, namely PG-MRI (Bakker et al., 2020), LOUPE (Bahadir et al., 2020), and Seq MRI (Yin et al., 2021), each of which are detailed in Appendix A.9. We use the same data train / validation Published in Transactions on Machine Learning Research (02/2025) / test split and data preprocessing as Yin et al. (2021) for comparability. In particular, the data samples are κ-space slices cropped and centered at 128 128, with 34, 732 train samples, 1, 785 validation samples, and 1, 851 test samples. We train a diffusion model on complex-valued image space samples x C128 128 obtained by computing the inverse Fourier transform of the κ-space training samples (see Appendix A.2.3 for further training details). The data space is therefore the complex image space, with κ-space acting as the measurement space. This yields the measurement model yt = U(At)F(x) + nt, where F is the Discrete Fourier Transform, nt NC(0, σn) is complex Gaussian measurement noise, and U(At) is the subsampling matrix selecting samples at indices At. ADS proceeds by running Diffusion Posterior Sampling in the complex image domain with guidance from κ-space measurements through the measurement model, selecting maximum entropy lines in the κ-space. We observed on data from the validation set that ADS reconstruction performance increases with the number of reverse diffusion steps, although with diminishing returns as steps increased. This indicates that in applying ADS, one can choose to increase sample quality at the cost of inference time and compute. To showcase the potential for ADS, we chose a large number of steps, T = 10k. Further, we choose guidance weight ζ = 0.85, and sampling schedule S evenly partitioning [50, 2500], and an initial action set Ainit = {63}, starting with a single central κ-space line. Np = 16 and σy = 50 were used to model the posterior distribution. Reconstructions were evaluated using the structural similarity index measure (SSIM) (Wang et al., 2004) to compare the absolute values of the fully-sampled target image and reconstructed image. The SSIM uses a window size of 4 4 with k1 = 0.01 and k2 = 0.03 as set be the fast MRI challenge. Table 2 shows the SSIM results on the test set, comparing ADS to recent supervised methods, along with a fixed-mask Diffusion Posterior Sampling using the same inference parameters to serve as a strong unsupervised baseline. The fixed-mask used with DPS measures the 8% of lines at the center of the κ-space, and random lines elsewhere, as used by Zbontar et al. (2018). It is clear from the results that ADS performs competitively with supervised baselines, and outperforms the fixed-mask diffusion-based approach. Figure 6 shows two reconstructions created by ADS. See Appendix A.11 for a histogram of all SSIMs over the test set for the diffusion-based approaches. Finally, we note the inference time for this model, which is an important factor in making active sampling worthwhile. Our model for fast MRI (Table 2) uses 40ms / step with 76 steps per acquisition, leading to 3040ms per acquisition on our NVIDIA Ge Force RTX 2080 Ti GPU. A typical acquisition time for a k-space line in MRI is 500ms-2500ms, or higher, depending on the desired quality (Jung & Weigel, 2013). Given modern hardware with increased FLOPs, we believe that this method is already near real-time, even without employing additional tricks to accelerate inference, such as quantization (Shang et al., 2023) or distillation (Salimans & Ho, 2022). Unsupervised Method SSIM ( ) PG-MRI (Bakker et al., 2020) 87.97 LOUPE (Bahadir et al., 2020) 89.52 Fixed-mask DPS 90.13 Seq MRI (Yin et al., 2021) 91.08 ADS (Ours) 91.26 Table 2: SSIM scores for fast MRI knee test set with 4x acceleration. See Appendix A.11 for SSIM histograms for diffusion-based methods. 5.4 Focused Ultrasound Scan-Line Subsampling As a final experiment, we explore focused ultrasound scan-line subsampling (Huijben et al., 2020). In focused ultrasound, a set of focused scan-lines spanning a region of tissue is typically acquired, with the lines being displayed side-by-side to create an image. By acquiring only a small subset of these scan-lines, the power consumption and data transfer rate requirements of the probe can be reduced. Successfully applying scanline subsampling could find applications in (i) extending battery life in wireless ultrasound probes (Baribeau et al., 2020), (ii) reducing bandwidth requirements for cloud-based ultrasound (Federici et al., 2023), and (iii) increasing frame-rates in 3D ultrasound (Giangrossi et al., 2022). In this experiment, we demonstrate using ADS to identify informative subsets of scan-lines, from which the full image can be reconstructed. Published in Transactions on Machine Learning Research (02/2025) Target Reconstruction -space mask Target Reconstruction -space mask Figure 6: Sample fast MRI reconstructions produced by ADS, including the generated κ-space masks. The SSIMs are 95.8 for the left, and 94.4 for the right. To this end, we trained a diffusion model on the samples from the CAMUS (Leclerc et al., 2019) dataset, with the train set consisting of 7448 frames from cardiac ultrasound scans across 500 patients, resized to 128 128 in the polar domain. Each column of pixels in the polar domain is taken to represent a single beamformed scan-line, yielding the model yt = U(At)x, where x is the fully-sampled target image, U(At) is a mask selecting a set of scan-lines, and yt is the set of scan-line measurements selected by At. Inference takes place in the polar domain, and results are scan-converted for visualization. For ADS inference, the parameters chosen were T = 400 steps, S evenly partitioning the interval [0, 320], Np = 16, σy = 1, and ζ = 10. We benchmark ADS against DPS using random and data-variance fixed-mask strategies on a test set consisting of frames from 50 unseen patients, finding that ADS significantly outperforms both in terms of reconstruction quality metrics. Figure 7 presents the PSNR results along with a sample reconstruction produced by ADS versus random fixed-mask DPS, with further quantitative results and sample outputs provided in Appendix A.5. Target Measurements Reconstruction ADS Random DPS 16 32 Number of lines sampled Random DPS Data Variance DPS ADS Figure 7: In (a), a comparison of selected measurements and reconstructions generated by ADS versus DPS with a random mask on the CAMUS (Leclerc et al., 2019) dataset. The measurement budget is 16/128 = 12.5% of the scan-lines. (b) shows the PSNR scores across a held-out test set consisting of 875 frames from 50 unseen patients, with measurement budgets of 16 and 32 scan lines, representing subsampling rates of 12.5% and 25%, respectively. Published in Transactions on Machine Learning Research (02/2025) 6 Discussion While ADS appears to outperform fixed-mask baselines on MNIST, Celeb A, and fast MRI, it is interesting to note that the relative improvement offered by ADS is less pronounced in the case of fast MRI. For line-based image subsampling on MNIST with 25% samples, ADS achieves a 50% reduction in reconstruction error versus a fixed-mask approach (MAE = 0.019 vs 0.037), whereas with line-based κ-space subsampling for fast MRI with 25% samples, ADS achieves only a 12% relative improvement (SSIM = 91.26 vs 90.13). We find that the size of this performance gap between fixed and active mask design strategies can be explained by examining the distribution of masks designed by ADS on each task (See Appendix A.10). Indeed, the masks designed for fast MRI are very similar, whereas those designed for MNIST typically differ depending on the sample. When mask designs are similar, then fixed masks will perform similarly to actively designed masks. This is in part a feature of the data distribution for example, most information in κ-space is contained in the center, at lower frequencies. Tasks in which one might expect significantly better performance from ADS are therefore those in which optimal masks will be highly sample-dependent. Another interesting trend appears in the results when observing the relative improvement of ADS over fixedmask strategies as a function of the number of samples taken. One might expect that the relative improvement is highest when the subsampling rate is lowest, monotonically decreasing as more samples are taken. In fact, however, we observe that the largest relative improvement appears not at the lowest subsampling rates, but rather at medium rates, e.g. 10% - 50%. One possible explanation for this is the under-performance of diffusion posterior sampling in scenarios with very few measurements, leading to inaccurate estimates of the posterior distribution and therefore sub-optimal sampling strategies. Further study of the performance of posterior sampling methods in cases with very few measurements could provide an interesting direction for future work. We also observe that data variance sampling outperforms random sampling on MNIST, but not on Celeb A. This somewhat surprising result can be explained by noting that the variance in MNIST appears in a highly informative region, namely around the center of the images where the digits are contained. In contrast, the variance in Celeb A appears mostly in the background, where backgrounds may exhibit large deviations in brightness and color. Despite the large variance in the backgrounds, they are typically homogeneous blocks of color that can be inpainted well from sparse samples. Hence, taking most samples in this high-variance background region leading to suboptimal sampling strategies and reconstructions. A final point of discussion is on choosing the number of particles Np for a particular task. We find in Appendix A.5 that 8-16 particles are sufficient for modelling the posterior in Celeb A, with diminishing returns for increasing Np. In general, we advise choosing Np such that the modes of the posterior distribution can be sufficiently represented, so that regions of disagreement between modes can be identified by the entropy computation. For example, an MNIST dataset containing only 2 digits will require fewer particles than an MNIST dataset containing 10. In conclusion, we have proposed a method for using diffusion models as active subsampling agents without the need for additional training, using a simple, interpretable action selection policy. We show that this method significantly outperforms fixed-mask baselines on MNIST, Celeb A, and ultrasound scan-line subsampling, and competes with existing supervised approaches in MRI acceleration without tasks-specific training. This method therefore takes a step towards transparent active sensing with automatically generated adaptive strategies, decreasing cost factors such as exposure time, and energy usage. 7 Limitations & Future Work While experiments in Section 5 evidence some strengths of ADS against baseline sampling strategies, it is not without limitations. For example, the duration of inference in ADS is dependent on that of the diffusion posterior sampling method. Since low latency is essential in active subsampling, future work could aim to accelerate posterior sampling with diffusion models, leading to accelerated ADS. Another limitation is that the number of measurements taken is upper-bounded by the number of reverse diffusion steps T; this limitation could be overcome by extending ADS to generate batch designs (Azimi et al., 2010), containing multiple measurements, from a single posterior estimate. Future work applying ADS in diverse domains Published in Transactions on Machine Learning Research (02/2025) would also help to further assess the robustness of the method. Finally, as mentioned in Section 6, we note that ADS does not show significant improvements over baseline sampling strategies under very low sampling rates, e.g. < 5%. We hypothesize that this may be due to inaccurate posterior sampling with very few measurements, and believe that this presents an interesting direction for future study. Brian DO Anderson. Reverse-time diffusion equation models. Stochastic Processes and their Applications, 12(3):313 326, 1982. Javad Azimi, Alan Fern, and Xiaoli Fern. Batch Bayesian optimization via simulation matching. Advances in neural information processing systems, 23, 2010. Cagla D Bahadir, Alan Q Wang, Adrian V Dalca, and Mert R Sabuncu. Deep-learning-based optimization of the under-sampling pattern in MRI. IEEE Transactions on Computational Imaging, 6:1139 1152, 2020. Tim Bakker, Herke van Hoof, and Max Welling. Experimental design for MRI by greedy policy search. Advances in Neural Information Processing Systems, 33:18954 18966, 2020. Luca Baldassarre, Yen-Huan Li, Jonathan Scarlett, Baran Gözcü, Ilija Bogunovic, and Volkan Cevher. Learning-based compressive subsampling. IEEE Journal of Selected Topics in Signal Processing, 10(4): 809 822, 2016. Yanick Baribeau, Aidan Sharkey, Omar Chaudhary, Santiago Krumm, Huma Fatima, Feroze Mahmood, and Robina Matyal. Handheld point-of-care ultrasound probes: the new generation of POCUS. Journal of cardiothoracic and vascular anesthesia, 34(11):3139 3145, 2020. James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary, Dougal Maclaurin, George Necula, Adam Paszke, Jake Vander Plas, Skye Wanderman-Milne, and Qiao Zhang. JAX: composable transformations of Python+Num Py programs, 2018. URL http://github.com/google/jax. Gábor Braun, Sebastian Pokutta, and Yao Xie. Info-greedy sequential adaptive compressed sensing. IEEE Journal of selected topics in signal processing, 9(4):601 611, 2015. Robert Bridson. Fast Poisson disk sampling in arbitrary dimensions. SIGGRAPH sketches, 10(1):1, 2007. Xander Campman, Zijian Tang, Hadi Jamali-Rad, Boris Kuvshinov, Mike Danilouchkine, Ying Ji, Wim Walk, and Dirk Smit. Sparse seismic wavefield sampling. The Leading Edge, 36(8):654 660, 2017. Jingjie Cao, Yanfei Wang, Jingtao Zhao, and Changchun Yang. A review on restoration of seismic wavefields based on regularization and compressive sensing. Inverse Problems in Science and Engineering, 19(5): 679 704, 2011. Ariel Caticha. Entropy, information, and the updating of probabilities. Entropy, 23(7):895, 2021. François Chollet et al. Keras. https://keras.io, 2015. Hyungjin Chung and Jong Chul Ye. Score-based diffusion models for accelerated MRI. Medical image analysis, 80:102479, 2022. Hyungjin Chung, Jeongsol Kim, Michael T Mccann, Marc L Klasky, and Jong Chul Ye. Diffusion posterior sampling for general noisy inverse problems. ar Xiv preprint ar Xiv:2209.14687, 2022. Giannis Daras, Hyungjin Chung, Chieh-Hsin Lai, Yuki Mitsufuji, Jong Chul Ye, Peyman Milanfar, Alexandros G Dimakis, and Mauricio Delbracio. A survey on diffusion models for inverse problems. ar Xiv preprint ar Xiv:2410.00083, 2024. Published in Transactions on Machine Learning Research (02/2025) Prafulla Dhariwal and Alexander Nichol. Diffusion models beat GANs on image synthesis. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan (eds.), Advances in Neural Information Processing Systems, volume 34, pp. 8780 8794. Curran Associates, Inc., 2021. URL https://proceedings.neurips.cc/paper_files/paper/2021/file/ 49ad23d1ec9fa4bd8d77d02681df5cfa-Paper.pdf. Bradley Efron. Tweedie s formula and selection bias. Journal of the American Statistical Association, 106 (496):1602 1614, 2011. Noam Elata, Tomer Michaeli, and Michael Elad. Adaptive compressed sensing with diffusion-based posterior sampling. In European Conference on Computer Vision, pp. 290 308. Springer, 2025. Peter Faltin, Kraisorn Chaisaowong, Thomas Kraus, and Til Aach. Markov-Gibbs model based registration of CTlung images using subsampling for the follow-up assessment of pleural thickenings. In 2011 18th IEEE International Conference on Image Processing, pp. 2181 2184. IEEE, 2011. Beatrice Federici, Ben Luijten, Andre Immink, Ruud JG van Sloun, and Massimo Mischi. Cloud-based ultrasound platform for advanced real-time image formation. In Dutch Biomedical Engineering Conference, 2023. Berthy T Feng, Jamie Smith, Michael Rubinstein, Huiwen Chang, Katherine L Bouman, and William T Freeman. Score-based diffusion models as principled priors for inverse imaging. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10520 10531, 2023. Claudio Giangrossi, Alessandro Ramalli, Alessandro Dallai, Daniele Mazierli, Valentino Meacci, Enrico Boni, and Piero Tortoli. Requirements and hardware limitations of high-frame-rate 3-d ultrasound imaging systems. Applied Sciences, 12(13):6562, 2022. Daniel Golovin and Andreas Krause. Adaptive submodularity: Theory and applications in active learning and stochastic optimization. Journal of Artificial Intelligence Research, 42:427 486, 2011. John R Hershey and Peder A Olsen. Approximating the Kullback Leibler divergence between gaussian mixture models. In 2007 IEEE International Conference on Acoustics, Speech and Signal Processing ICASSP 07, volume 4, pp. IV 317. IEEE, 2007. Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840 6851, 2020. Neil Houlsby, Ferenc Huszár, Zoubin Ghahramani, and Máté Lengyel. Bayesian active learning for classification and preference learning. ar Xiv preprint ar Xiv:1112.5745, 2011. Iris Huijben, Bastiaan S Veeling, and Ruud JG van Sloun. Deep probabilistic subsampling for task-adaptive compressed sensing. In 8th International Conference on Learning Representations, ICLR 2020, 2020. Bernd André Jung and Matthias Weigel. Spin echo magnetic resonance imaging. Journal of Magnetic Resonance Imaging, 37(4):805 817, 2013. Sarah Leclerc, Erik Smistad, Joao Pedrosa, Andreas Østvik, Frederic Cervenansky, Florian Espinosa, Torvald Espeland, Erik Andreas Rye Berg, Pierre-Marc Jodoin, Thomas Grenier, et al. Deep learning for segmentation using an open large-scale dataset in 2d echocardiography. IEEE transactions on medical imaging, 38(9):2198 2210, 2019. Yann Le Cun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278 2324, 1998. Dennis V Lindley. On a measure of the information provided by an experiment. The Annals of Mathematical Statistics, 27(4):986 1005, 1956. Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. In Proceedings of the IEEE international conference on computer vision, pp. 3730 3738, 2015. Published in Transactions on Machine Learning Research (02/2025) Michael Lustig and John M Pauly. Spirit: iterative self-consistent parallel imaging reconstruction from arbitrary k-space. Magnetic resonance in medicine, 64(2):457 471, 2010. Peyman Milanfar and Mauricio Delbracio. Denoising: A powerful building-block for imaging, inverse problems, and machine learning. ar Xiv preprint ar Xiv:2409.06219, 2024. Eduardo Pérez, Jan Kirchhof, Fabian Krieg, and Florian Römer. Subsampling approaches for compressed sensing with ultrasound arrays in non-destructive testing. Sensors, 20(23):6734, 2020. Tom Rainforth, Adam Foster, Desi R Ivanova, and Freddie Bickford Smith. Modern Bayesian experimental design. Statistical Science, 39(1):100 114, 2024. Meenu Rani, Sanjay B Dhok, and Raghavendra B Deshmukh. A systematic review of compressive sensing: Concepts, implementations and applications. IEEE access, 6:4875 4894, 2018. Herbert E Robbins. An empirical Bayes approach to statistics. In Breakthroughs in Statistics: Foundations and basic theory, pp. 388 394. Springer, 1992. Litu Rout, Negin Raoof, Giannis Daras, Constantine Caramanis, Alex Dimakis, and Sanjay Shakkottai. Solving linear inverse problems provably via posterior sampling with latent diffusion models. Advances in Neural Information Processing Systems, 36, 2024. Tim Salimans and Jonathan Ho. Progressive distillation for fast sampling of diffusion models. ar Xiv preprint ar Xiv:2202.00512, 2022. Thomas Sanchez, Igor Krawczuk, Zhaodong Sun, and Volkan Cevher. Closed loop deep Bayesian inversion: Uncertainty driven acquisition for fast MRI, 2020. URL https://openreview.net/forum?id= BJl POl BKDB. Yuzhang Shang, Zhihang Yuan, Bin Xie, Bingzhe Wu, and Yan Yan. Post-training quantization on diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 1972 1981, 2023. Roy Shaul, Itamar David, Ohad Shitrit, and Tammy Riklin Raviv. Subsampled brain MRI reconstruction by generative adversarial neural networks. Medical Image Analysis, 65:101747, 2020. Jiaming Song, Arash Vahdat, Morteza Mardani, and Jan Kautz. Pseudoinverse-guided diffusion models for inverse problems. In International Conference on Learning Representations, 2023. URL https:// openreview.net/forum?id=9_gs MA8MRKQ. Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. ar Xiv preprint ar Xiv:2011.13456, 2020. Tristan SW Stevens, Nishith Chennakeshava, Frederik J de Bruijn, Martin Pekař, and Ruud JG van Sloun. Accelerated intravascular ultrasound imaging using deep reinforcement learning. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1216 1220. IEEE, 2022. Tristan SW Stevens, Faik C Meral, Jason Yu, Iason Z Apostolakis, Jean-Luc Robert, and Ruud JG Van Sloun. Dehazing ultrasound using diffusion models. IEEE Transactions on Medical Imaging, 2024. Koen CE van de Camp, Hamdi Joudeh, Duarte J Antunes, and Ruud JG van Sloun. Active subsampling using deep generative models by maximizing expected information gain. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1 5. IEEE, 2023. Hans Van Gorp, Iris Huijben, Bastiaan S Veeling, Nicola Pezzotti, and Ruud JG Van Sloun. Active deep probabilistic subsampling. In International Conference on Machine Learning, pp. 10509 10518. PMLR, 2021. Published in Transactions on Machine Learning Research (02/2025) Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing, 13(4):600 612, 2004. Tianwei Yin, Zihui Wu, He Sun, Adrian V Dalca, Yisong Yue, and Katherine L Bouman. End-to-end sequential sampling and reconstruction for MRI. In Machine Learning for Health, pp. 261 281. PMLR, 2021. Peiyu Yu, Sirui Xie, Xiaojian Ma, Baoxiong Jia, Bo Pang, Ruiqi Gao, Yixin Zhu, Song-Chun Zhu, and Ying Nian Wu. Latent diffusion energy-based model for interpretable text modelling. In International Conference on Machine Learning, pp. 25702 25720. PMLR, 2022. Jure Zbontar, Florian Knoll, Anuroop Sriram, Tullie Murrell, Zhengnan Huang, Matthew J Muckley, Aaron Defazio, Ruben Stern, Patricia Johnson, Mary Bruno, et al. fast MRI: An open dataset and benchmarks for accelerated MRI. ar Xiv preprint ar Xiv:1811.08839, 2018. Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 586 595, 2018. Published in Transactions on Machine Learning Research (02/2025) A.1 Derivation of Equation (10) Here we show that maximising the policy function does not require computing a set of particles for each possible action in the case where the action is a subsampling mask. Because the subsampling mask At = At 1 at only varies in at in the arg max, the elements of each particle ˆy(i) will remain the same for each possible At except for at those indices selected by at. We therefore decompose the squared L2 norm into two squared L2 norms, one for the indices in at and the other for those in At 1. The latter then becomes a constant in the argmax, and can be ignored. This results in a formulation in which we only need to compute the squared L2 norms for the set of elements corresponding with at. We use U(At) to indicate the subsampling matrix containing 1s on the diagonal at indices in At. a t = arg max at ( ||ˆy(i) t ˆy(j) t ||2 2 2σ2y = arg max at ( ||U(At)f(ˆx(i) 0 ) U(At)f(ˆx(j) 0 )||2 2 2σ2y = arg max at k At (f(ˆx(i) 0 )k f(ˆx(j) 0 )k)2 = arg max at l at (f(ˆx(i) 0 )l f(ˆx(j) 0 )l)2 + P m At 1 (f(ˆx(i) 0 )m f(ˆx(j) 0 )m)2 = arg max at l at (f(ˆx(i) 0 )l f(ˆx(j) 0 )l)2 m At 1 (f(ˆx(i) 0 )m f(ˆx(j) 0 )m)2 = arg max at l at (f(ˆx(i) 0 )l f(ˆx(j) 0 )l)2 = arg max at l at (ˆy(i) l ˆy(j) l )2 A.2 Training Details The methods and models are implemented in the Keras 3.1 (Chollet et al., 2015) library using the Jax backend (Bradbury et al., 2018). The DDIM architecture is provided by Keras3 at the following URL: https://keras.io/examples/generative/ddim/. Each model was trained using one Ge Force RTX 2080 Ti (NVIDIA, Santa Clara, CA, USA) with 11 GB of VRAM. A.2.1 MNIST The model was trained for 500 epochs with the following parameters: widths=[32, 64, 128], block_depth=2, diffusion_steps=30, ema_0.999, learning_rate=0.0001, weight_decay=0.0001, loss="mae". Published in Transactions on Machine Learning Research (02/2025) A.2.2 Celeb A The model was trained for 200 epochs with the following parameters: widths=[32, 64, 96, 128], block_depth=2, diffusion_steps=30, ema_0.999, learning_rate=0.0001, weight_decay=0.0001, loss="mae". A.2.3 Fast MRI The training run of 305 epochs with the following parameters: widths=[32, 64, 96, 128], block_depth=2, diffusion_steps=30, ema_0.999, learning_rate=0.0001, weight_decay=0.0001, loss="mae". A.3 MNIST metrics # Pixels (%) Random Data Variance ADS (Ours) 10 (.97%) 0.231 (.002) 0.197 (.002) 0.207 (.002) 25 (2.44%) 0.190 (.002) 0.125 (.002) 0.124 (.002) 50 (4.88%) 0.140 (.002) 0.067 (.001) 0.042 (.001) 100 (9.76%) 0.074 (.001) 0.034 (.001) 0.011 (.000) 250 (24.41%) 0.024 (.000) 0.015 (.000) 0.007 (.000) 500 (48.82%) 0.011 (.000) 0.008 (.000) 0.007 (.000) # Lines (%) Random Data Variance ADS (Ours) 2 (6.25%) 0.197 (.003) 0.152 (.002) 0.148 (.002) 4 (12.5%) 0.143 (.002) 0.086 (.002) 0.071 (.001) 8 (25%) 0.073 (.001) 0.037 (.001) 0.001 (.000) 16 (50%) 0.022 (.001) 0.013 (.000) 0.000 (.000) 24 (75%) 0.010 (.000) 0.0076 (.000) 0.0074 (.000) Table 3: Mean and (standard error) for the Mean Absolute Error ( ) in reconstruction of MNIST samples, using pixeland line-based subsampling. A.4 Comparison with Active Deep Probabilistic Subsampling In this appendix we provide qualitative and quantitative results comparing ADS to a supervised active sampling approach, Active Deep Probabilistic Subsampling (ADPS) by Van Gorp et al. (2021). ADPS performs active sampling by training a Long Short-Term Memory (LSTM) model to produce logits for each sampling location at each iteration. This is trained jointly with a task model , which performs some task based on the partial observations. This could be, for example, a reconstruction model. We chose to compare ADS to ADPS on the task of reconstructing MNIST digits to facilitate a qualitative comparison of the masks generated by each method. In order to compare on a reconstruction task, we adapted the MNIST classification model provided in the ADPS codebase1 for reconstruction by replacing their fully-connected classification layers with a large UNet and using an MSE training objective to reconstruct 32 32 pixel MNIST digits from partial observations containing 81/1024 8% of the pixels. We chose a UNet architecture with widths=[32, 64, 128, 256] and block_depth=2 as the reconstruction model for ADPS, making it similar to the UNet denoising network used by ADS. Both models were trained on the MNIST training set and evaluated on 1000 unseen samples from the validation set. The results are presented in Figures 8, 9, 10, and 11, indicating that ADS outperforms ADPS in terms of reconstruction error, and generated more adaptive and intuitive masks. One reason for the limited adaptivity in the masks generated by ADPS could be the fact that the reconstruction model is trained jointly with the sampling model, leading to a strategy wherein the reconstruction model can depend on receiving certain inputs and not others. We note, however, that the performance of ADPS could change depending on the choice of reconstruction model, loss function used, and other design choices. In general, the major advantage 1https://github.com/Iam Huijben/Deep-Probabilistic-Subsampling/tree/master/ADPS_van Gorp2021/MNIST_ Classification Published in Transactions on Machine Learning Research (02/2025) of ADS over ADPS is that it does not need to be re-trained for different choices of forward model or subsampling rates: all that s needed is a diffusion prior, and sampling schemes can vary freely at inference time. 20 0 20 t-SNE Component 1 t-SNE Component 2 75 50 25 0 25 50 t-SNE Component 1 t-SNE Component 2 Figure 8: t-SNE plots for the masks generated by both (a) ADS and (b) ADPS for 1000 unseen samples from the MNIST validation set. The t-SNE plot for ADS appears to exhibit stronger clustering between different digits, indicating that the generated masks are more digit-dependent. ADPS creates a distinct mask for the digit 1 , but there is much overlap across other digits. Figure 9: The mean across all masks for each digit generated by (a) ADS and (b) ADPS. ADS has generated masks that trace the digits, whereas ADPS has generated spread-out masks in the central region of the image where digits may appear. Published in Transactions on Machine Learning Research (02/2025) Measurement 0 Reconstruction 0 Measurement 1 Reconstruction 1 Measurement 2 Reconstruction 2 Measurement 3 Reconstruction 3 Measurement 4 Reconstruction 4 Measurement 0 Reconstruction 0 Measurement 1 Reconstruction 1 Measurement 2 Reconstruction 2 Measurement 3 Reconstruction 3 Measurement 4 Reconstruction 4 Figure 10: Random samples from the test set with masks, measurements, and reconstructions generated by (a) ADS and (b) ADPS. 0 1 2 3 4 5 6 7 8 9 Digit Label Figure 11: Digit-wise comparison of Mean Absolute Error between targets and reconstructions generated by ADS versus ADPS. The error bars on each bar indicate the standard error of the mean. Published in Transactions on Machine Learning Research (02/2025) A.5 Celeb A Further Metrics Table 4: MAE ( ) of the posterior mean on targets from the Celeb A test set. Num Samples Random DPS Data Variance DPS ADS 50 19.768 (0.043) 20.625 (0.044) 19.569 (0.060) 100 13.623 (0.028) 14.809 (0.028) 11.731 (0.034) 200 8.909 (0.020) 9.786 (0.020) 6.356 (0.018) 300 6.538 (0.014) 7.507 (0.016) 4.426 (0.012) 500 4.320 (0.009) 5.185 (0.010) 2.825 (0.007) Table 5: SSIM ( ) of the posterior mean on targets from the Celeb A test set. Num Samples Random DPS Data Variance DPS ADS 50 0.577 (0.001) 0.557 (0.001) 0.567 (0.001) 100 0.687 (0.001) 0.652 (0.001) 0.697 (0.001) 200 0.791 (0.001) 0.759 (0.000) 0.831 (0.001) 300 0.851 (0.000) 0.820 (0.000) 0.889 (0.000) 500 0.911 (0.000) 0.881 (0.000) 0.941 (0.000) Table 6: LPIPS ( ) (Zhang et al., 2018) of the posterior mean on targets from the Celeb A test set. Num Samples Random DPS Data Variance DPS ADS 50 0.306 (0.001) 0.320 (0.000) 0.287 (0.001) 100 0.230 (0.000) 0.248 (0.000) 0.190 (0.001) 200 0.152 (0.000) 0.174 (0.000) 0.103 (0.000) 300 0.111 (0.000) 0.131 (0.000) 0.071 (0.000) 500 0.069 (0.000) 0.086 (0.000) 0.043 (0.000) Published in Transactions on Machine Learning Research (02/2025) Table 7: Mean MAE ( ) across posterior samples on targets from the Celeb A test set. Num Samples Random DPS Data Variance DPS ADS 50 24.486 (0.048) 25.321 (0.050) 23.895 (0.067) 100 17.436 (0.035) 18.493 (0.034) 14.463 (0.040) 200 11.255 (0.025) 12.337 (0.024) 7.704 (0.021) 300 8.230 (0.018) 9.372 (0.020) 5.225 (0.014) 500 5.389 (0.011) 6.477 (0.013) 3.268 (0.008) Table 8: Mean PSNR ( ) across posterior samples on targets from the Celeb A test set. Num Samples Random DPS Data Variance DPS ADS 50 15.988 (0.016) 15.752 (0.015) 16.971 (0.022) 100 18.169 (0.017) 17.791 (0.016) 20.828 (0.023) 200 21.237 (0.019) 20.621 (0.016) 26.314 (0.025) 300 23.617 (0.018) 22.604 (0.018) 30.160 (0.026) 500 26.795 (0.018) 25.274 (0.018) 34.546 (0.026) Table 9: Mean SSIM ( ) across posterior samples on targets from the Celeb A test set. Num Samples Random DPS Data Variance DPS ADS 50 0.486 (0.001) 0.471 (0.001) 0.481 (0.001) 100 0.599 (0.001) 0.569 (0.001) 0.623 (0.001) 200 0.725 (0.001) 0.691 (0.001) 0.783 (0.001) 300 0.799 (0.000) 0.764 (0.000) 0.859 (0.001) 500 0.877 (0.000) 0.841 (0.000) 0.925 (0.000) Table 10: Mean LPIPS ( ) (Zhang et al., 2018) across posterior samples on targets from the Celeb A test set. Num Samples Random DPS Data Variance DPS ADS 50 0.321 (0.001) 0.329 (0.001) 0.305 (0.001) 100 0.250 (0.000) 0.260 (0.000) 0.208 (0.001) 200 0.172 (0.000) 0.184 (0.000) 0.112 (0.000) 300 0.127 (0.000) 0.141 (0.000) 0.073 (0.000) 500 0.079 (0.000) 0.095 (0.000) 0.041 (0.000) Published in Transactions on Machine Learning Research (02/2025) A.6 Further Results on Ultrasound Scan-Line Subsampling Table 11: MAE on test samples from the CAMUS dataset. Num Samples Random DPS Data Variance DPS ADS 16 12.198 (0.002) 12.431 (0.002) 9.533 (0.001) 32 8.520 (0.001) 8.599 (0.001) 6.701 (0.001) Table 12: PSNR on test samples from the CAMUS dataset. Num Samples Random DPS Data Variance DPS ADS 16 22.643 (0.002) 22.480 (0.002) 25.473 (0.001) 32 25.893 (0.001) 25.824 (0.002) 29.130 (0.001) Measurements Reconstruction Measurements Reconstruction (b) Random DPS Figure 12: Targets, measurements, and reconstructions for 5 random samples from the CAMUS test set by (a) ADS and (b) Random fixed-mask DPS. The Mean Absolute Error for each reconstruction is provided. Published in Transactions on Machine Learning Research (02/2025) A.7 Runtime for Celeb A 50 100 200 300 Number of Measurements Execution Time (s) DPS (Np = 16) = 15.51 ADS (Np = 16) Figure 13: Runtime of ADS on samples from the Celeb A dataset with size 128 128 pixels, using an NVIDIA Ge Force RTX 2080 Ti GPU. Note the contribution in runtime from DPS, below the dashed purple line, and the contribution from ADS, above the dashed purple line, which scales linearly with the number of measurements. The execution time presented is the mean execution time over N = 20 runs, with 1 standard deviation indicated by the error bars. A.8 Effect of varying the number of particles 2 4 8 16 32 Number of particles, Np Figure 14: Reconstruction accuracy on N = 20 samples from the the Celeb A validation set, as measured by the MAE between the target and posterior mean, using a variety of values for Np. Published in Transactions on Machine Learning Research (02/2025) A.9 Fast MRI Comparison Methods A.9.1 PG-MRI Bakker et al. (2020) use policy-gradient methods from Reinforcement Learning to learn a policy function πϕ(at | ˆxt) that outputs new measurement locations given the current reconstruction. Reconstructions are then generated using a pre-existing U-Net based reconstruction model provided by the fast MRI repository. A.9.2 LOUPE LOUPE (Bahadir et al., 2020) introduces a end-to-end learning framework that trains a neural network to output under-sampling masks in combination with an anti-aliasing (reconstuction) model on undersampled full-resolution MRI scans. Their loss function consists of a reconstruction term and a trick to enable sampling a mask. A.9.3 Seq MRI With Seq MRI, Yin et al. (2021) propose an end-to-end differentiable sequential sampling framework. They therefore jointly learn the sampling policy and reconstruction, such that the sampling policy can best fit with the strengths and weaknesses of the reconstruction model, and vice versa. Published in Transactions on Machine Learning Research (02/2025) A.10 ADS Mask Distributions for MNIST and fast MRI 0 5 10 15 20 25 30 Fraction of masks containing each line 0 5 10 15 20 25 30 line Probability of mask containing each line (a) ADS mask distribution for 500 samples from the MNIST test set. 0 20 40 60 80 100 120 Fraction of masks containing each line 0 20 40 60 80 100 120 line Probability of mask containing each line (b) ADS mask distribution for 1851 samples from the fast MRI test set. Figure 15: The distribution of masks chosen by ADS varies according to the task. We observe that the masks chosen for MNIST are less predictable a priori than those chosen for fast MRI, leading to a stronger performance by ADS relative to fixed-mask approaches. We plot at the bottom of each plot estimates of the probability that each line will appear in a mask generated by ADS as Bernoulli variables (either the line is present in the mask, or not). To quantify the predictability of these masks, we compute the average entropy over each of these variables, finding that MNIST masks are signficantly less predictable than those for fast MRI. Published in Transactions on Machine Learning Research (02/2025) A.11 Fast MRI SSIM Distributions 20 40 60 80 100 SSIM Number of occurrences mean=91.26 std=11.01 (a) 200-bin histogram showing distribution of SSIM scores across the Fast MRI knee test set for 4x acceleration using ADS. 20 40 60 80 100 SSIM Number of occurrences mean=90.13 std=11.04 (b) 200-bin histogram showing distribution of SSIM scores across the Fast MRI knee test set for 4x acceleration using diffusion posterior sampling with fixed κ-space masks. Figure 16: Histograms comparing the distribution of SSIM scores using ADS and diffusion posterior sampling.