# influenceguided_diffusion_for_dataset_distillation__e9d1bf5f.pdf

Published as a conference paper at ICLR 2025

INFLUENCE-GUIDED DIFFUSION FOR DATASET DISTILLATION

Mingyang Chen1,2, Jiawei Du3, Bo Huang1,2, Yi Wang4, Xiaobo Zhang5, Wei Wang1,2

1The Hong Kong University of Science and Technology (Guangzhou) 2The Hong Kong University of Science and Technology 3CFAR, A*STAR, Singapore 4Dongguan University of Technology 5Southwest Jiaotong University mchenbt@connect.ust.hk,weiwcs@ust.hk

Dataset distillation aims to streamline the training process by creating a compact yet effective dataset for a much larger original dataset. However, existing methods often struggle with distilling large, high-resolution datasets due to prohibitive resource costs and limited performance, primarily stemming from sample-wise optimizations in the pixel space. Motivated by the remarkable capabilities of diffusion generative models in learning target dataset distributions and controllably sampling high-quality data tailored to user needs, we propose framing dataset distillation as a controlled diffusion generation task aimed at generating data specifically tailored for effective training purposes. By establishing a correlation between the overarching objective of dataset distillation and the trajectory influence function, we introduce the Influence-Guided Diffusion (IGD) sampling framework to generate training-effective data without the need to retrain diffusion models. An efficient guided function is designed by leveraging the trajectory influence function as an indicator to steer diffusions to produce data with influence promotion and diversity enhancement. Extensive experiments show that the training performance of distilled datasets generated by diffusions can be significantly improved by integrating with our IGD method and achieving state-of-the-art performance in distilling Image Net datasets. Particularly, an exceptional result is achieved on the Image Net-1K, reaching 60.3% at IPC=50. Our code is available at https: //github.com/mchen725/DD_IGD.

1 INTRODUCTION

Dataset distillation has gained significant attention due to its ability to balance the conflict demands of maintaining training effectiveness while overwhelming resource overhead. This method involves crafting a compact yet effective surrogate dataset for a large-scale original dataset. The surrogate is optimized to retain essential information from the cumbersome original, enabling models trained on it to achieve performance comparable to those trained on the complete one.

Early dataset distillation methods have made significant strides in distillation efficacy through various insightful paradigms (Zhao et al., 2021; Kim et al., 2022; Nguyen et al., 2021; Cazenavette et al., 2022; Du et al., 2023; Cui et al., 2023). However, their success is mainly limited to distilling small datasets like CIFAR (Krizhevsky & Hinton, 2009) or downscaled Image Net (Russakovsky et al., 2015) with low resolution. Extending these methods to higher-resolution datasets (e.g., 128 128) is hindered by treating data as a entity and refining it at the pixel level. This escalates time and computational costs with data dimensionality and preset compression ratios, typically indicated by Images Per Class (IPC). Moreover, prioritizing pixel-level optimization overlooks distributional shifts from the original dataset. Yet, at higher resolutions, synthetic data retains ineffective high-frequency patterns, leading to performance degradation (Cazenavette et al., 2023).

Recognizing the robust capability to capture intricate data distributions, a recent approach (Gu et al., 2024) integrates diffusion models to tackle the high-resolution challenges faced by previous

Published as a conference paper at ICLR 2025

pixel-oriented methods, achieving cutting-edge performance. This technique entails fine-tuning a latent diffusion model through a minimax criterion, yielding distilled datasets that harmonize representativeness and diversity for better alignment with the authentic data distribution. However, research on core-set selection techniques (Killamsetty et al., 2021a;b; Iyer et al., 2021) indicates that even data sampled directly from the authentic distribution can contribute unevenly to model training. Concerns remain about the effectiveness of the proposed objective in generating distilled datasets that are optimally tailored for highly effective training.

Figure 1: Enhanced cross-architecture performance with average influence by integrating IGD in distilling Image Nette with IPC=100.

In this work, we introduce a new paradigm of using diffusion models in the task of dataset distillation, termed the Influence-Guided Diffusion (IGD) sampling method. This method is conceptually tailored to directly guide diffusion models in generating data under a generalized training-effective condition, without requiring to retrain diffusion models. We highlight the challenges inherent in achieving this target, particularly due to the abstract nature of the tailored condition, in contrast to existing controlled diffusion generation tasks that involve explicit content specifications (Rombach et al., 2022; Ho & Salimans, 2022). To address this challenge, we first establish a correlation between the overarching objective of dataset distillation and the trajectory influence function (Pruthi et al., 2020), which approximates the impact of training on given data in terms of test loss variation. Building on this connection, we develop diversity-constraint and influence-based guidance as indicators to steer the diffusion models towards generating data with influence promotion and diversity enhancement, as illustrated in Figure 5. As evidenced by Figure 1, integrating IGD significantly enhances the performance of the vanilla Diffusion Transformer (Di T), outperforming results obtained through the fine-tuning method Minimax. Moreover, IGD complements Minimax to achieve even better results, with a simultaneous increase in influence.

In summary, our contributions are as follows:

We propose a new scheme for dataset distillation by framing the task as a guided-diffusion generation problem. We establish a novel diffusion sampling framework that pioneers the integration of the influence function as a guidance for the controlled diffusion generation, with the aim of achieving generalized training-enhancing objectives. Experimental results illustrate that our method significantly improves the performance of diffusion models across different architectures on two Image Net subsets. Furthermore, a state-of-the-art result is achieved on the Image Net-1K, reaching 60.3% at IPC=50.

2 PRELIMINARIES

2.1 BACKGROUND ON DATASET DISTILLATION

We refer to the target dataset as T = {(xi, yi)}|T | i=1. Each sample xi is drawn i.i.d. from a natural distribution q(x), where xi Rd and yi Y = {1, 2, . . . , C} refers to the ground-truth label. Dataset Distillation (DD) aims to condense this large labelled dataset T into a smaller synthetic dataset S = {(ui, yi)}|S| i=1, with ui Rd and yi Y, such that |S| |T |. The reduced dataset S is optimized to retain essential information from T to ensure that any model initialized with parameters θ0 can be optimized to minimize the validation loss on the target dataset T :

min S 1 |T |

ℓ xi, yi; θS ℓ(xi, yi; θ0) s.t. θS = Alg(S, θ0). (1)

Here, Alg(S, θ0) = arg minθ E(ui,yi) S [ℓ(ui, yi; θ)] represents the training algorithm that optimizes the initialized parameters θ0 over the synthetic data S, and ℓ(x, y; θ) denotes the prediction

Published as a conference paper at ICLR 2025

loss of a model with parameters θ on a data pair (x, y). To prevent unexpected distributional shift, we propose to frame DD as learning a conditional distribution of the authentic distribution, e.g., p(x|Condition), to sample near-real data under the generalized training-effective conditions.

2.2 GUIDED DIFFUSION GENERATION

Given samples from the data distribution q(x), diffusion models are capable of learning a parameterized distribution pϕ(x) that approximates q(x) and is easy to sample from it (Song et al., 2020b). On a high level, this is implemented through a forward noising process and a reverse denoising process. Concretely, the forward process gradually adds Gaussian noise ϵ N(0, I) of different magnitudes to clean data point z0: zt = αtz0 + 1 αtϵ, where αt controls the noise scale at step t. A diffusion model is a denoising function that learns by minimizing the dissimilarity, e.g., mean squared error, between the predicted noise ϵϕ(zt, t, c) and ϵ, where c is a conditional input such as labels. The reverse process generates denoised samples by sampling from pϕ(zt 1|zt, z0), which is generally parameterized as a Gaussian distribution and varies across studies in its approximation (Ho et al., 2020). For instance, Denoising Diffusion Implicit Model (DDIM) (Song et al., 2020a) first predicts the clean data point ˆz0|t based on zt as:

ˆz0|t = 1 αt (zt

1 αtϵϕ(zt, t, c)). (2)

zt 1 is then sampled from N αt 1 ˆz0|t + p

1 αt 1 σ2 t ϵϕ(zt, t, c), σ2 t I , where σt is the predefined noise factor. For notation simplicity, we abstract this process as: zt 1 = s(zt, t, ϵϕ). In this work, we adopt the widely utilized latent diffusion (Peebles & Xie, 2023) as the backbone. Here, an encoder is employed to transform images to latent codes z = E(x) and a decoder D( ) reconstructs latent codes back to the image space to obtain the distilled dataset S = {(D(zi), yi)}|S| i=1.

Diffusion models typically employ conditioning to tailor outputs to specific user inputs, such as labels or text prompts. However, our purpose diverges from explicit content specifications, focusing instead on more abstract requirements. We aim to guide the diffusion model to identify conditional distributions within the learned distribution and selectively sample data to optimize training effectiveness. To this end, we employ a more adaptable method of controlling model outputs through guided-diffusion generation (Bansal et al., 2023; Yu et al., 2023; Gopalakrishnan Nair et al., 2023). These methods are largely inspired by the energy-based model (EBM) used for formulating score-based diffusions (Song et al., 2020b; 2021). Intuitively, any metric function f C(.) that subtly measures the compatibility of the noisy sample zt to the condition C is valid for providing steering guidance. By this means, the sampling step can generally be implemented as:

zt 1 = s(zt, t, ϵϕ) ρt ztf C(zt), (3)

where ρt is defined to align with the denoising scale of the current ϵϕ(zt, t). By introducing a meticulously designed guided function that effectively measures the impact of data on training efficacy (e.g., as depicted by validation loss), this implementation seamlessly aligns with our objective of framing dataset distillation as sampling data from a desirable conditional distribution.

3.1 ESTIMATING DATA INFLUENCE AS DIFFUSION CONDITIONAL GUIDANCE

We identify influence function (Koh & Liang, 2017) as insightful parallel research that can quantify the impact of specific training data on model validation loss. This is highly relevant to the design of metric functions used for steering guidance in diffusion models under our training-effective condition. Leveraging the Fundamental Theorem of Calculus, Pruthi et al. (2020) introduced trajectory influence to estimate the cumulative influence of a training data pair (x, y) on validation data pair (x , y ). This method integrates the stepwise changes in the loss of the validation data throughout the training process. In our case, employing Stochastic Gradient Descent (SGD) as the training algorithm Alg, the model update can be expressed as θt+1 θt = ηt θℓ(x, y; θt) , where ηt represents the learning rate at timestep t. Utilizing the first-order Taylor expansion, the loss change of (x , y ) at

Published as a conference paper at ICLR 2025

each timestep can be approximated by:

ℓ(x , y ; θt+1) ℓ(x , y ; θt) θℓ(x , y ; θt) (θt+1 θt)

= ηt θℓ(x , y ; θt) θℓ(x, y; θt). (4)

The overall influence of (x, y) on (x , y ) throughout the training trajectory is quantified by aggregating these stepwise changes across epochs:

e=0 ηe θℓ(x , y ; θe) θℓ(x, y; θe) ℓ(x , y ; θ0) ℓ(x , y ; θE), (5)

where ηe denotes the learning rate of the e-th epoch, for a total of E epochs. By substituting the validation data (x , y ) as the real data in the original dataset, this formulation is an effective approximation to the general objective of dataset distillation defined in Equation (1). Based on this insight, we define the objective of the guidance for a latent code z given a certain class c as:

max z 1 |Tc|

i=1 I(D(z), xi) = max z

e=0 ηe θℓc X c; θS e θℓc D(z); θS e , (6)

where θℓc(X c; θS e ) = 1 |Tc| P|Tc| i=1 θℓ xi, c; θS e based on the Fubini s Theorem and Tc is the subset of the given class c, θS e represents a checkpoint obtained on the decoded data. Intuitively, this objective can be optimized if models trained on synthetic data obtain trajectories equivalent to those trained on Tc, thereby maximizing the validation loss drop. This essentially shares a similar purpose with the Gradient-Matching (GM) scheme (Zhao et al., 2021; Zhao & Bilen, 2021a). However, we identify three primary issues with directly adapting this formulation as the metric function for guided diffusion in dataset distillation: (1) prohibitive cost: the necessity of model retraining at each diffusion sampling step is computationally burdensome; (2) accumulated error: akin to the limitations of the GM method, the gap between trajectories inevitably accumulates during training on synthetic data, leading to ineffective matching and consequently degraded performance (Cazenavette et al., 2022); (3) information redundancy: the relatively poor diversity of diffusion-generated data limits its effectiveness for dataset distillation (Du et al., 2023), and matching with the averaged real gradients, as shown in Equation (6), may further exacerbate this issue.

In the following section, we tackle these challenges by developing diversity-constrained guided functions and detailing our Influence-Guided Diffusion (IGD) sampling framework.

3.2 EFFICIENT INFLUENCE-GUIDED DIFFUSION SAMPLING WITH DIVERSITY CONSTRAINT

Denote θTc e = θTc e 1 ηe 1 θℓc(X c; θTc e 1) as checkpoints trained on the real subset Tc with SGD and the same learning rate schedule as on the synthetic data. Replacing the checkpoints θS e with θTc e in Equation (6) is an optimizably equivalent target. This equivalence holds because these two targets converge to the same optimal solution when z can provide the same training dynamics as Tc, i.e., θℓc(X c; θTc e ) = θℓc(D(z); θTc e ) e [0, E]. Building on this insight, in practical implementation, we extend this usage to the checkpoints θT e obtained through standard mini-batch updates over the entire dataset T . This adjustment mitigates the mismatch caused by the discrepancy between synthetic and real trajectories (Kim et al., 2022), while also eliminating the time cost associated with retraining models on S at each sampling step. Additionally, we use cosine similarity instead of the dot product to stabilize the magnitude of the guidance signal provided by the influence function. These modifications yield the influence guided loss function as:

GI(z) = 1 |E|

1 θℓc X c; θT e θℓc D(z); θT e θℓc (X c; θTe ) θℓc (D(z); θTe )

Directly computing the influence over an intermediate noisy zt is undesirable, as the checkpoints θT e are trained on clean data and do not adapt to provide meaningful guidance as a metric function when the input is noisy (Ho & Salimans, 2022). To mitigate this issue, we utilize the predicted clean data ˆz0|t of the current zt, based on Equation (2) as defined by DDIM, as an approximation of the real z0. Subsequently, we compute the influence guidance GI(ˆz0|t) on the predicted clean data and derive the guided gradient zt GI((zt 1 αtϵϕ(zt, t))/ αt) through backpropagation.

Published as a conference paper at ICLR 2025

Algorithm 1: Influence-Guided Diffusion Sampling

1 Parameters: Class c, influence factor ρt, deviation factor γt, scales {αt}T t=1, guided range A,B

2 Required: Pre-trained diffusion model ϵϕ, list of retained checkpoints R, list of averaged gradients Gc, generated data memory Mc, decoder model D

3 Initialize: Sample initial random noise z T N(0, I);

4 for t = T to 1 do

5 Obtain the denoised signal ϵϕ(zt, t, c) from the diffusion model;

6 if t in [A, B] then

7 Calculate the influence metric GI(ˆz0|t) as Equation (7) with R and Gc;

8 Calculate the deviation metric GD(zt) as Equation (8) with Mc;

9 Implement guided sampling zt 1 = s(zt, t, ϵϕ) ρt zt GI(ˆz0|t) γt zt GD(zt);

11 Implement vanilla sampling zt 1 = s(zt, t, ϵϕ);

12 return Decoded synthetic image D(z0);

To ensure diversity and avoid excessive redundancy in the surrogate dataset s training signals, we propose adding a constraint to the generation objective. This constraint ensures that the similarity between generated data within a certain class does not exceed a specified threshold: sim(zi, zj) δ, zi, zj Zc, where zi = zj. In practice, we incorporate this constraint using a Lagrangian multiplier and propose a deviation guidance function to optimize it in each guided sampling step:

GD(z) = z z

z z subject to z = arg max z Mc z z z z , (8)

where Mc represents the set of all previously generated data for a certain class c.

Ultimately, we utilize the influence guidance of GI(ˆz0|t) alongside the deviation guidance of GD(zt), reformulating the guided sampling step as:

zt 1 = s(zt, t, ϵϕ) ρt zt GI(ˆz0|t) γt zt GD(zt), where ρt = k

1 αt ϵϕ(zt, t, c) zt GI(ˆz0|t) (9)

is the scale factor designed to adaptively adjust the magnitude of the influence guidance alongside the dynamics of the denoised signal ϵϕ, and γt is empirically preset for the deviation guidance. Furthermore, we introduce two practical techniques that are essential for enhancing both the efficiency and efficacy of the proposed IGD framework.

Choosing representative checkpoints via gradient similarity. For efficiency, trajectory influence initially suggests saving checkpoints at regular intervals to compute step-wise influence. However, given the non-linear nature of training dynamics, evenly spaced checkpoints may scatter attention to critical stages. To efficiently calculate the influence guidance, we propose a simple yet effective filtering algorithm. We store θT 0 as the first checkpoint in a list R and compute its averaged gradient Ec[ θℓc(X c; θT 0 )] as the initial reference. For each subsequent checkpoint, we compute the averaged gradient and calculate its cosine similarity with the reference. If the similarity is below a given threshold, we store the current checkpoint and update its averaged gradient as the new reference. This process traverses all epochs, and only the retained checkpoints in R are used by influence guidance.

Mitigating overfitting and reducing runtime by early-stage Guidance. Guided diffusion tasks face a trade-off between generation quality and the impact of guidance (Lugmayr et al., 2022; Bansal et al., 2023). In our problem, we observe that samples generated with a large preset k in ρt achieve significant influence loss reduction but also exhibit noticeable abnormalities and degraded performance. Detailed evaluations are provided in Section 4.4. Empirical observations in diffusion generation demonstrate that most semantic content is generated during the early-to-mid stages of sampling (Yu et al., 2023). We adopt guided sampling only in these partial steps, allowing vanilla sampling to refine details in the remaining steps. For example, in DDIM with 50 sampling steps, guided sampling is applied only when t is in [30, 45]. This approach allows data generated with strong guidance to maintain comparable influence without noticeable abnormalities or performance degradation. Consequently, this also reduces the runtime associated with guidance calculation.

Published as a conference paper at ICLR 2025

Table 1: Image Nette & Image Woof: Performance comparison with state-of-the-art pixel-level distillation methods, pretrained Di T and Minimax-tuned Di T models with vanilla generation. Di TIGD and Minimax-IGD denote utilizing our proposed IGD sampling framework for generation.

Dataset Model IPC Random DM IDC-1 Di T Di T-IGD Minimax Minimax-IGD Full

Conv Net-6 10 46.0 0.5 49.8 1.1 48.2 1.2 56.2 1.3 61.9 1.9 58.2 0.9 58.8 1.0

94.3 0.5 50 71.8 1.2 70.3 0.8 72.4 0.7 74.1 0.6 80.9 0.9 76.9 0.9 82.3 0.8 100 79.9 0.8 78.5 0.8 80.6 1.1 78.2 0.3 84.5 0.7 81.1 0.3 86.3 0.8

Res Net AP-10 10 54.2 1.2 60.2 0.7 60.4 0.6 62.8 0.8 66.5 1.1 63.2 1.0 63.5 1.1

94.6 0.5 50 77.3 1.0 76.7 1.0 77.4 0.7 76.9 0.5 81.0 1.2 78.2 0.7 82.3 1.1 100 81.1 0.6 80.9 0.7 81.5 1.2 80.1 1.1 85.2 0.5 81.3 0.9 86.1 0.9

Res Net-18 10 55.8 1.0 60.9 0.7 61.0 0.8 62.5 0.9 67.7 0.3 64.9 0.6 66.2 1.2

95.3 0.6 50 75.8 1.1 75.0 1.0 77.8 0.7 75.2 0.9 81.0 0.7 78.1 0.6 82.0 0.3 100 82.0 0.4 81.5 0.4 81.7 0.8 77.8 0.6 84.4 0.8 81.3 0.7 86.0 0.6

Conv Net-6 10 25.2 1.1 27.6 1.2 34.1 0.8 32.3 0.8 35.0 0.8 33.5 1.4 36.2 1.6

85.9 0.4 50 41.9 1.4 43.8 1.1 42.6 0.9 48.5 1.3 54.2 0.7 50.7 1.8 55.7 0.8 100 52.3 1.5 50.1 0.9 51.0 1.1 54.2 1.5 61.1 1.0 57.1 1.9 63.0 1.8

Res Net AP-10 10 31.6 0.8 29.8 1.0 38.5 0.7 39.0 0.9 41.0 0.8 39.6 1.2 43.3 0.3

87.2 0.6 50 50.1 1.6 47.8 1.2 49.3 0.9 55.8 1.1 62.7 1.2 59.8 0.8 65.0 0.8 100 59.2 0.9 59.8 1.3 56.4 0.5 62.5 0.9 69.7 0.9 66.8 1.2 71.5 0.8

Res Net-18 10 30.9 1.3 30.2 0.6 36.7 0.8 40.6 0.6 44.8 0.8 42.2 1.2 47.2 1.6

89.0 0.6 50 54.0 0.8 53.9 0.7 54.5 1.0 57.4 0.7 62.0 1.1 60.5 0.5 65.4 1.8 100 63.6 0.5 64.9 0.7 57.7 0.8 62.3 0.5 70.6 1.8 67.4 0.7 72.1 0.9

Algorithm 1 outlines the detailed process of our influence-guided diffusion sampling framework for generating each synthetic image. Before constructing the surrogate dataset, we first obtain checkpoints {θT e }E e=1 trained on T and apply the proposed filtering algorithm to retain representative checkpoints in the list R. Before initiating generation for a specific class c, we calculate the averaged gradient θℓc(X c; θT e ) across each retained checkpoint and store them in a list Gc. Subsequently, we execute the algorithm, storing the generated images in memory Mc until the desired number of images reaches the preset target IPC (images per class).

4 EXPERIMENTS

4.1 EXPERIMENTAL SETUP

Datasets. As our primary interest lies in large-scale, high-resolution distillation tasks, we assess the performance of our method on the complete Image Net-1K dataset (224 224) (Russakovsky et al., 2015). To provide comparable evaluations across varying task difficulties, we conduct comprehensive experiments on two representative subsets, Image Nette and Image Woof (Howard, 2019). Image Nette, consisting of 10 classes with less similarity and therefore easier to distinguish between, contrasts with Image Woof, a challenging subset containing 10 classes of dog breeds.

Baselines and evaluation metric. We compare our method with several state-of-the-art dataset distillation methods including DM (Zhao & Bilen, 2021b), IDC-1 (Kim et al., 2022), SRe2L (Yin et al., 2024), G-VBSM (Shao et al., 2023), and RDED (Sun et al., 2024). Additionally, we regard pretrained Di T (Peebles & Xie, 2023) as a notable baseline because it achieves performance comparable to state-of-the-art methods even without tailored optimizations for dataset distillation. Furthermore, we include Minimax (Gu et al., 2024), a recent work refined Di T specifically for dataset distillation through a fine-tuning scheme, as a perpendicular baseline. Test architectures include Conv Net-6, Res Net-10 (He et al., 2016) with Average Pooling, Res Net-18, Res Net-101, Moblie Net-V2 (Sandler et al., 2018), Efficient Net-B0 (Tan, 2019) and Swin Transformer (Liu et al., 2021). The top-1 test accuracies of models trained on distilled datasets with different IPC (Image Per Class) are reported.

Implementation detail. For a fair comparison, we follow the official implementation of Minimax, utilizing a latent Di T model from Pytorch s official repository and an open-source VAE model from Stable Diffusion. DDIM (Song et al., 2020a) with 50 denoised steps is used as the vanilla sampling method for generation. For each test dataset, we train a 6-layer Conv Net (Conv Net-6) for 50 epochs with the learning rate 1 10 2 to collect the surrogate checkpoints used in Equation (7). The similarity threshold for choosing representative checkpoints is set as 0.7. The detailed setup of hyperparameters k and γt for each datasets is discussed in Appendix A.10. All the experimental results of our method can be obtained on a single RTX 4090 GPU.

Published as a conference paper at ICLR 2025

Table 2: Image Net-1K: Performance comparison over Res Net-18 with state-of-the-art dataset distillation methods, pretrained Di T and Minimax-tuned Di T models with vanilla DDIM generation.

Dataset IPC SRe2L G-VBSM RDED Di T Di T-IGD Minimax Minimax-IGD

Image Net-1K 10 21.3 0.6 31.4 0.5 42.0 0.1 39.6 0.4 45.5 0.5 44.3 0.5 46.2 0.6 50 46.8 0.2 51.8 0.4 56.5 0.1 52.9 0.6 59.8 0.3 58.6 0.3 60.3 0.4

Table 3: Image Net-1K: Cross-architecture generalization performance comparison.

Res Net101 Mobile Net-V2 Efficient Net-B0 Swin Transformer

IPC10 IPC50 IPC10 IPC50 IPC10 IPC50 IPC10 IPC50

RDED 48.3 1.0 61.2 0.4 40.4 0.1 53.3 0.2 31.0 0.1 58.5 0.4 42.3 0.6 53.2 0.8 Di T-IGD 52.6 1.2 66.2 0.2 39.2 0.2 57.8 0.2 47.7 0.1 62.0 0.1 44.1 0.6 58.6 0.5 Minimax-IGD 53.4 0.9 66.8 0.2 39.7 0.4 58.5 0.3 48.5 0.1 62.7 0.2 44.8 0.8 58.2 0.5

4.2 COMPARISON WITH STATE-OF-THE-ART METHODS

Evaluation on Woof & Nette. As a training-free sampling framework, our IGD method can be incorporated into the pretrained Di T and Minimax-tuned Di T during the generation process. We designate these two methods as Di T-IGD and Minimax-IGD, respectively. As depicted in Table 1, our IGD-based methods demonstrate a significant improvement over the original backbone methods, and achieve state-of-the-art performance across both Woof and Nette datasets in all IPC settings. These enhancements are consistently observed across all evaluations conducted on the three tested architectures, highlighting a robust cross-architecture generalization ability. Particularly for IPC 50, Di T-IGD notably enhances the performance of Di T by 5.8% on Nette and by 6.6% on Woof, on average. Comparing with Minimax, Minimax-IGD averagely provides a 4.7% boost on Nette and a 5.1% boost on Woof. Moreover, we observed that Di T-IGD outperforms Minimax in most evaluations. Especially for the easier dataset Nette, despite the class distinctions facilitating knowledge condensation, Minimax only shows a marginal average improvement of 2.5% over Di T at IPC=100. In contrast, Di T-IGD achieves an average boost of 6.1%. Compared to diffusion-based methods, the pixel-level optimization methods DM and IDC-1 achieve moderate performance gains over random original images at IPC=10. However, as the IPC increases, the performance gain drastically diminishes or even becomes negative.

Evaluation on Image Net-1K. Recent approaches proposed for efficiently distilling Image Net-1K data rely on using well-trained models to provide synthetic images with soft labels to acquire richer information. Following the evaluation protocol of the RDED, we employ a Res Net-18 model, trained on the original dataset, to generate soft labels for synthetic images. The performances shown in Table 2 are evaluated over the same Res Net-18 architecture. The results demonstrate consistent improvements in integrating our IGD method over the Di T and Minimax methods. In particular, Di T-IGD demonstrates significant improvement to raw Di T, with enhancements of 5.9% at IPC=10 and 6.9% at IPC=50. This also positions our Minimax-IGD method at the forefront of this practical distillation task, surpassing the state-of-the-art image-based method RDED by 4.0%. In the crossarchitecture comparison detailed in Table 3, synthetic datasets generated using our IGD methods generally outperform those created by RDED across four different unseen networks. Notably, our Di T-IGD and Minimax-IGD methods surpass RDED by an average margin of 4.6% and 5.0% at IPC=50, respectively. These remarkable performance improvements underline the promising potential of diffusion-based methods in the future of dataset distillation research.

4.3 CROSS-ARCHITECTURE ROBUSTNESS OF INFLUENCE GUIDANCE

In our IGD framework, influence guidance necessitates a surrogate model to be trained on the original dataset, collecting representative checkpoints for calculating guided loss. Here, we test the impact of influence guidance obtained over networks of different architectures, including Conv Net6, Res Net AP-10, and Res Net18. We then train these networks on generated surrogate datasets from scratch and evaluate their cross-architecture performance. Table 4 demonstrates that datasets generated based on Conv Net-6 generally exhibit superior performance. In most cross-architecture evaluations involving Res Net AP-10 and Res Net-18, they even outperform datasets generated with

Published as a conference paper at ICLR 2025

Table 4: Cross-architecture performance of Di T-IGD using different surrogate architectures to calculate influence guidance.

Dataset Surrogate Conv Net-6 Res Net AP-10 Res Net-18

IPC10 IPC50 IPC100 IPC10 IPC50 IPC100 IPC10 IPC50 IPC100

Nette Conv Net-6 61.9 1.9 80.9 0.9 84.5 0.7 66.5 1.1 81.0 1.2 85.2 0.5 67.7 0.3 81.0 0.7 84.4 0.8 Res Net AP-10 58.9 0.4 79.5 0.8 83.7 0.4 66.2 0.8 82.3 0.2 84.4 0.8 66.7 0.6 82.3 0.9 85.4 0.4 Res Net18 62.2 0.3 78.5 0.9 80.1 0.3 63.3 1.6 79.5 0.8 82.1 0.5 63.1 0.3 80.3 0.7 83.3 1.4

Woof Conv Net-6 35.0 0.8 54.2 0.7 61.1 1.0 41.0 0.8 62.7 1.2 69.7 0.9 44.8 0.8 62.0 1.1 70.6 1.8 Res Net AP-10 33.8 1.0 53.5 0.3 60.0 0.4 39.6 0.4 61.5 0.8 68.8 0.5 43.6 0.5 65.5 0.7 69.3 0.4 Res Net18 34.3 0.8 54.3 0.8 61.0 1.8 39.5 1.1 61.0 1.4 68.7 0.7 43.8 1.4 62.9 1.0 69.5 0.4

Table 5: The ablation study of proposed influence guidance GI and deviation guidance GD tested with Res Net-18 on Image Nette .

Di T-IGD Minimax-IGD

GI GD IPC50 IPC100 IPC50 IPC100

75.2 0.9 77.8 0.6 78.1 0.6 81.3 0.7 76.5 0.6 79.1 0.4 81.5 0.4 85.1 0.4 78.2 0.4 80.7 0.7 78.5 0.2 82.8 0.3 81.0 0.7 84.4 0.8 82.0 0.3 86.0 0.6

Table 6: Comparison of checkpoint selection strategies for Minimax-IGD: the gradientsimilarity-based method versus regular interval selection, on Image Nette with Res Net-18.

Threshold # Checkpoints Regular Ours

0.65 3 79.5 0.6 80.4 0.7 0.70 4 79.8 1.1 82.0 0.3 0.75 6 80.5 0.4 81.4 0.5 0.80 10 81.1 0.5 80.8 0.3

the test architecture. Additionally, due to fewer model parameters compared to the other two, the computational time required for influence loss calculations is reduced. Based on these observations, we choose to utilize Conv Net-6 as the surrogate in our formal implementation. However, we also note that the performance gap between datasets generated with different architectures is not significant. Particularly, datasets generated with Res Net AP-10 notably outperform Conv Net-6 in several tests against Res Net-18. These results further validate the robustness and generalization ability of our proposed IGD sampling framework.

4.4 ABLATION STUDY AND ANALYSIS

Guidance component analysis. Table 5 presents the performance achieved by independently applying influence guidance and deviation guidance to raw Di T and Minimax. The independent utilization of the two proposed guidance mechanisms still enhances the performance of both backbone methods. Specifically, in the case of raw Di T, the incorporation of deviation guidance yields results akin to those obtained with raw Minimax, primarily due to its ability to augment the diversity of generated data. Conversely, for Minimax, sole reliance on influence guidance markedly elevates its performance, achieving parity with the comprehensive framework. Despite Minimax s inherent focus on refining sample diversity through fine-tuning, additional gains can be attained through the integration of deviation guidance. Moreover, it is important to note that although influence guidance yields moderate improvements for raw Di T, the integration of deviation guidance results in significant enhancements. These observations substantiate our discourse regarding the critical role of data diversity in optimizing influence effectiveness. Conclusively, the synergy between influence guidance and deviation guidance complements each other, facilitating our guided sampling framework harmoniously in aligning with the training-enhancing objective.

Early-stage guidance analysis. We assess the practicability of our early-stage guidance strategy by comparing it with the entire guided sampling approach on Image Woof, with variations in the influence guidance scaling factor k. Figure 2b demonstrates that applying the influence guidance throughout the entire generation stage with a large preset k can significantly reduce influence loss. However, as illustrated by Figure 2c, when k 10, despite a reduction in loss, validation accuracy notably drops, likely due to overfitting to the surrogate used for influence calculation. Moreover, this also leads to abnormal image generation shown in Figure 2a. In contrast, the early-stage guidance strategy allows strong guidance signals to steer the generation process effectively while mitigating the overfitting problem. Consequently, this strategy achieves superior performance in less generation time, thereby enhancing both the efficacy and efficiency of the process.

Published as a conference paper at ICLR 2025

Figure 2: (a) Examples generated using entire and early-stage guidance with varying influence magnitude k on Image Woof; (b) Averaged normalized loss GI of datasets generated with different values of k and IPC=100; (c) Corresponding validation accuracies for varying k.

(b) Di T-IGD

(c) Minimax

(d) Minimax-IGD

Figure 3: Visualization study for sample distributions of synthetic datasets (IPC=100) generated by four methodologies versus the original Image Woof dataset. Smaller Wasserstein distances to the original dataset T signify closer alignment with the authentic distribution.

Checkpoints selection strategy analysis. We assess the efficacy of the gradient-similarity-based checkpoint selection strategy proposed for computing the influence-guided loss (Equation (7)). A predetermined threshold is utilized to determine checkpoint selection based on the similarity of its averaged real gradient to the current reference, with an empirically identified suitable range set as [0.6, 0.8]. Thresholds beyond this range result in excessive checkpoint selection, leading to diminished efficiency, while overly small thresholds yield minimal selection. The baseline comparison involves the original trajectory influence s strategy, which saves checkpoints at fixed regular intervals. In Table 6, we contrast our strategy s results with various thresholds against the original regular strategy. To ensure fairness, an equal number of regularly collected checkpoints is used for guidance calculation at each threshold scenario. Comparative analysis reveals superior performance of our strategy over the regular approach. Notably, at a threshold of 0.7, our strategy with 4 checkpoints outperforms the results of 10 regularly selected checkpoints, demonstrating enhanced efficiency and efficacy. For a case study, checkpoint selection indexes of {0, 4, 11, 40} are observed at a threshold of 0.7, compared

Published as a conference paper at ICLR 2025

Figure 4: Comparison of image generation results from raw Di T, Di T-IGD, Minimax, and Minimax IGD. Images in each column share the same random seed. Integrating IGD directly into the generation process produces high-quality data with varying semantic content and enhanced diversity compared to vanilla generation. Many instances exhibit robust consistency under the guidance of IGD.

to the regular indexes {0, 16, 32, 48}. This adaptive selection indicates better alignment with typical training dynamics, as more checkpoints are selected from the early stages of training.

4.5 VISUALIZATION STUDY ON GENERATED DATA

Data distribution comparison. To clearly investigate the effect of our guided sampling method on diffusion generation, Figure 3 shows t-SNE distribution comparisons among the full Image Woof training dataset and data produced by two baseline methods, Di T and Minimax, as well as our two IGD-based approaches, each set at IPC=100. Additionally, we use the Wasserstein distance to quantitatively evaluate how well the distributions of the generated datasets align with the entire training dataset. Relative to the Minimax method, our IGD approach guides the diffusion process to achieve a closer match to the original training set s distribution, offering more comprehensive coverage and lower Wasserstein distances. Notably, Minimax-IGD surpasses Di T-IGD in performance, despite a higher Wasserstein distance from the original dataset. This finding lends partial support to our hypothesis that pinpointing a pivotal conditional distribution within the authentic distribution can be more beneficial than mere distribution alignment.

Synthetic image comparison. Figure 4 compares images generated by vanilla sampling of raw Di T and Minimax with those from guided sampling methods Di T-IGD and Minimax-IGD, using the same random seeds for each column. While baseline Di T generates high-quality images, they often share similar content, such as poses and structures. Minimax attempts to address the diversity issue in the generated data through fine-tuning Di T, but in many cases, the primary content or layout of the objects does not significantly change. In contrast, our method introduces additional signals in each guided generation step, achieving significant content variation and enhanced diversity without reducing quality. Furthermore, the guided signal from IGD is robust, producing similar content in both Minimax fine-tuned Di T and raw Di T in many cases.

5 CONCLUSION

In this work, we introduce a novel approach to dataset distillation by framing it as a guided diffusion generation problem. We correlate the general objective of dataset distillation with the trajectory influence function, designing an efficient influence-guided function for the diffusion sampling process. Additionally, we implement a deviation guidance function to ensure diversity and prevent training signal redundancy. These innovations enable us to create an efficient influence-guided diffusion sampling framework. Comprehensive experimental results illustrate that our method significantly improves the performance of diffusion models and demonstrate remarkable crossarchitecture generalization ability.

Published as a conference paper at ICLR 2025

ACKNOWLEDGEMENTS

Jiawei Du was supported by the A*STAR Career Development Fund (Grant No. C233312004). Yi Wang was supported in part by the Guangdong Basic and Applied Basic Research Foundation (Grant No. 2023B1515120058). Wei Wang was supported by the Guangdong Provincial Key Laboratory of Integrated Communication, Sensing, and Computation for Ubiquitous Internet of Things (Grant No. 2023B1212010007), the Guangzhou Municipal Science and Technology Project (Grant Nos. 2023A03J0003, 2023A03J0013, and 2024A03J0621), and the Institute of Education Innovation and Practice Project (Grant Nos. G01RF000012 and G01RF000017).

Arpit Bansal, Hong-Min Chu, Avi Schwarzschild, Soumyadip Sengupta, Micah Goldblum, Jonas Geiping, and Tom Goldstein. Universal guidance for diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 843 852, 2023.

Lukas Bossard, Matthieu Guillaumin, and Luc Van Gool. Food-101 mining discriminative components with random forests. In Computer vision ECCV 2014: 13th European conference, zurich, Switzerland, September 6-12, 2014, proceedings, part VI 13, pp. 446 461. Springer, 2014.

George Cazenavette, Tongzhou Wang, Antonio Torralba, Alexei A. Efros, and Jun-Yan Zhu. Dataset distillation by matching training trajectories. In CVPR, pp. 10708 10717. IEEE, 2022.

George Cazenavette, Tongzhou Wang, Antonio Torralba, Alexei A. Efros, and Jun-Yan Zhu. Generalizing dataset distillation via deep generative prior. In CVPR, pp. 3739 3748. IEEE, 2023.

Mingyang Chen, Bo Huang, Junda Lu, Bing Li, Yi Wang, Minhao Cheng, and Wei Wang. Dataset distillation via adversarial prediction matching, 2023. URL https://arxiv.org/abs/ 2312.08912.

Hyungjin Chung, Jeongsol Kim, Michael T Mccann, Marc L Klasky, and Jong Chul Ye. Diffusion posterior sampling for general noisy inverse problems. ar Xiv preprint ar Xiv:2209.14687, 2022.

Justin Cui, Ruochen Wang, Si Si, and Cho-Jui Hsieh. Scaling up dataset distillation to imagenet-1k with constant memory. In ICML, volume 202 of Proceedings of Machine Learning Research, pp. 6565 6590. PMLR, 2023.

Prafulla Dhariwal and Alexander Quinn Nichol. Diffusion models beat gans on image synthesis. In Neur IPS, pp. 8780 8794, 2021.

Jiawei Du, Yidi Jiang, Vincent Y. F. Tan, Joey Tianyi Zhou, and Haizhou Li. Minimizing the accumulated trajectory error to improve dataset distillation. In CVPR, pp. 3749 3758. IEEE, 2023.

Nithin Gopalakrishnan Nair, Anoop Cherian, Suhas Lohit, Ye Wang, Toshiaki Koike-Akino, Vishal M Patel, and Tim K Marks. Steered diffusion: A generalized framework for plug-and-play conditional image synthesis. ar Xiv e-prints, pp. ar Xiv 2310, 2023.

Alexandros Graikos, Nikolay Malkin, Nebojsa Jojic, and Dimitris Samaras. Diffusion models as plug-and-play priors. Advances in Neural Information Processing Systems, 35:14715 14728, 2022.

Jianyang Gu, Saeed Vahidian, Vyacheslav Kungurtsev, Haonan Wang, Wei Jiang, Yang You, and Yiran Chen. Efficient dataset distillation via minimax diffusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024.

Ziyao Guo, Kai Wang, George Cazenavette, Hui Li, Kaipeng Zhang, and Yang You. Towards lossless dataset distillation via difficulty-aligned trajectory matching. In ICLR. Open Review.net, 2024.

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, pp. 770 778. IEEE Computer Society, 2016.

Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. ar Xiv preprint ar Xiv:2207.12598, 2022.

Published as a conference paper at ICLR 2025

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840 6851, 2020.

Jeremy Howard. Imagenette: A smaller subset of 10 easily classified classes from imagenet, March 2019. URL https://github.com/fastai/imagenette.

Rishabh Iyer, Ninad Khargoankar, Jeff Bilmes, and Himanshu Asanani. Submodular combinatorial information measures with applications in machine learning. In Algorithmic Learning Theory, pp. 722 754. PMLR, 2021.

Bahjat Kawar, Michael Elad, Stefano Ermon, and Jiaming Song. Denoising diffusion restoration models. Advances in Neural Information Processing Systems, 35:23593 23606, 2022.

Krishnateja Killamsetty, Sivasubramanian Durga, Ganesh Ramakrishnan, Abir De, and Rishabh Iyer. Grad-match: Gradient matching based data subset selection for efficient deep model training. In International Conference on Machine Learning, pp. 5464 5474. PMLR, 2021a.

Krishnateja Killamsetty, Durga Sivasubramanian, Ganesh Ramakrishnan, and Rishabh Iyer. Glister: Generalization based data subset selection for efficient and robust learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pp. 8110 8118, 2021b.

Jang-Hyun Kim, Jinuk Kim, Seong Joon Oh, Sangdoo Yun, Hwanjun Song, Joonhyun Jeong, Jung Woo Ha, and Hyun Oh Song. Dataset condensation via efficient synthetic-data parameterization. In ICML, volume 162 of Proceedings of Machine Learning Research, pp. 11102 11118. PMLR, 2022.

Pang Wei Koh and Percy Liang. Understanding black-box predictions via influence functions. In International conference on machine learning, pp. 1885 1894. PMLR, 2017.

Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. https: //www.cs.toronto.edu/ kriz/cifar.html, 2009. Accessed: March 1, 2023.

Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 10012 10022, 2021.

Noel Loo, Ramin M. Hasani, Mathias Lechner, and Daniela Rus. Dataset distillation with convexified implicit gradients. In ICML, volume 202 of Proceedings of Machine Learning Research, pp. 22649 22674. PMLR, 2023.

Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps, 2022. URL https: //arxiv.org/abs/2206.00927.

Andreas Lugmayr, Martin Danelljan, Andres Romero, Fisher Yu, Radu Timofte, and Luc Van Gool. Repaint: Inpainting using denoising diffusion probabilistic models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 11461 11471, 2022.

Timothy Nguyen, Zhourong Chen, and Jaehoon Lee. Dataset meta-learning from kernel ridgeregression. In ICLR. Open Review.net, 2021.

William Peebles and Saining Xie. Scalable diffusion models with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4195 4205, 2023.

Garima Pruthi, Frederick Liu, Satyen Kale, and Mukund Sundararajan. Estimating training data influence by tracing gradient descent. Advances in Neural Information Processing Systems, 33: 19920 19930, 2020.

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj orn Ommer. Highresolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 10684 10695, 2022.

Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael S. Bernstein, Alexander C. Berg, and Li Fei-Fei. Imagenet large scale visual recognition challenge. Int. J. Comput. Vis., 115(3):211 252, 2015.

Published as a conference paper at ICLR 2025

Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4510 4520, 2018.

Shitong Shao, Zeyuan Yin, Muxin Zhou, Xindong Zhang, and Zhiqiang Shen. Generalized large-scale data condensation via various backbone and statistical matching. ar Xiv preprint ar Xiv:2311.17950, 2023.

Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. ar Xiv preprint ar Xiv:2010.02502, 2020a.

Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. ar Xiv preprint ar Xiv:2011.13456, 2020b.

Yang Song, Conor Durkan, Iain Murray, and Stefano Ermon. Maximum likelihood training of score-based diffusion models. Advances in neural information processing systems, 34:1415 1428, 2021.

Duo Su, Junjie Hou, Weizhi Gao, Yingjie Tian, and Bowen Tang. Dˆ 4: Dataset distillation via disentangled diffusion model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5809 5818, 2024.

Peng Sun, Bei Shi, Daiwei Yu, and Tao Lin. On the diversity and realism of distilled dataset: An efficient dataset distillation paradigm. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024.

Mingxing Tan. Efficientnet: Rethinking model scaling for convolutional neural networks. ar Xiv preprint ar Xiv:1905.11946, 2019.

Kai Wang, Bo Zhao, Xiangyu Peng, Zheng Zhu, Shuo Yang, Shuo Wang, Guan Huang, Hakan Bilen, Xinchao Wang, and Yang You. CAFE: learning to condense dataset by aligning features. In CVPR, pp. 12186 12195. IEEE, 2022a.

Yinhuai Wang, Jiwen Yu, and Jian Zhang. Zero-shot image restoration using denoising diffusion null-space model. ar Xiv preprint ar Xiv:2212.00490, 2022b.

Zeyuan Yin, Eric Xing, and Zhiqiang Shen. Squeeze, recover and relabel: Dataset condensation at imagenet scale from a new perspective. Advances in Neural Information Processing Systems, 36, 2024.

Jiwen Yu, Yinhuai Wang, Chen Zhao, Bernard Ghanem, and Jian Zhang. Freedom: Trainingfree energy-guided conditional diffusion model. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 23174 23184, 2023.

Jianhao Yuan, Jie Zhang, Shuyang Sun, Philip Torr, and Bo Zhao. Real-fake: Effective training data synthesis through distribution matching. ar Xiv preprint ar Xiv:2310.10402, 2023.

Bo Zhao and Hakan Bilen. Dataset condensation with differentiable siamese augmentation. In ICML, volume 139 of Proceedings of Machine Learning Research, pp. 12674 12685. PMLR, 2021a.

Bo Zhao and Hakan Bilen. Dataset condensation with distribution matching. Co RR, abs/2110.04181, 2021b.

Bo Zhao, Konda Reddy Mopuri, and Hakan Bilen. Dataset condensation with gradient matching. In ICLR. Open Review.net, 2021.

Yongchao Zhou, Ehsan Nezhadarya, and Jimmy Ba. Dataset distillation using neural feature regression. In Neur IPS, 2022.

Published as a conference paper at ICLR 2025

Figure 5: An intuitive comparison between our influence-guided diffusion generation and the vanilla diffusion generation frameworks.

A.1 RELATED WORK

Dataset Distillation. Current dataset distillation methods can be categorized into meta-learning, data-matching, and model-inversion approaches. Meta-learning methods (Nguyen et al., 2021; Zhou et al., 2022; Loo et al., 2023) tackle dataset distillation as a nested optimization problem, aiming to minimize generalization errors on original data caused by models trained on distilled data. Datamatching methods involve synthesizing data to replicate specific behaviours from the original dataset, such as latent distributions (Zhao & Bilen, 2021b; Wang et al., 2022a), gradients (Zhao et al., 2021; Zhao & Bilen, 2021a; Kim et al., 2022), training trajectories (Cazenavette et al., 2022; Du et al., 2023; Cui et al., 2023) and surrogate predictions (Chen et al., 2023). Model-inversion methods (Yin et al., 2024; Shao et al., 2023) are established on data-free knowledge distillation (DFKD) techniques with specific batch normalization statistic alignment. Additionally, recent research (Gu et al., 2024) has integrated diffusion models into dataset distillation alongside a fine-tuning scheme, perpendicular to our training-free, sampling-oriented approach.

Guided-Diffusion Sampling. Works in this category employ a pre-trained diffusion model as a foundation but modify the sampling method to guide generation with feedback from the guidance function (Kawar et al., 2022; Chung et al., 2022; Graikos et al., 2022). Early work employed classifiers as guidance, adjusting gradients during sampling (Dhariwal & Nichol, 2021). However, classifiers for noisy images are domain-specific and often unavailable. (Wang et al., 2022b) introduced linear operator-based external guidance, generating images in the null space of these operators, though extending to non-linear functions is challenging. Several recent works (Gopalakrishnan Nair et al., 2023; Yu et al., 2023; Bansal et al., 2023) explored general guidance functions, modifying the sampling process with gradients of the guidance function on expected denoised images. However, these methods rely on existing metric functions that can concretely measure specific requirements. In contrast, our contribution lies in guiding the model to generate data that meets abstract, trainingenchaining criteria.

A.2 LIMITATIONS AND FUTURE WORK

The main limitation of our method is the additional time incurred by guidance calculations during the diffusion sampling process. Despite efforts to improve efficiency, our sampling framework takes 5 to 6 longer than the vanilla method. For example, raw DDIM generates a 256 256 image in 1.5 seconds, our method takes 8.2 seconds on a RTX 4090 GPU. This is particularly challenging

Published as a conference paper at ICLR 2025

for distilling extensive datasets in resource-constrained scenarios. Consequently, improving the generation efficiency of guided diffusion sampling method will be a key focus of our future research.

A.3 GRADIENT-SIMILARITY-BASED CHECKPOINT SELECTION ALGORITHM

In Algorithm 2, we present a detailed implementation of the gradient-similarity-based checkpoint selection algorithm introduced in Section 3.2. This algorithm is designed to select representative checkpoints for calculating the influence guidance GI. The core intuition behind this algorithm is that if the gradients at a checkpoint closely resemble those at the previous one, the previous checkpoint can effectively represent the current one.

Complexity analysis. The computational overhead primarily stems from calculating the averaged gradient gt w.r.t the model parameters θ at each of the E checkpoints collected during training. When using the same cross-entropy loss as in model training, due to its additive nature, the computational complexity of calculating gt at a given checkpoint θt is equivalent to the complexity of one epoch of gradient descent, approximately O(|θ| B N

B d), where B is the batch size, N is the number of data instances, and d is the data dimension. Essentially, without any optimization, the complexity of this algorithm is similar to training a model parameterized by θ for E epochs. In practice, instead of loading the entire dataset into the dataloader to compute the average gradient θℓc for each class, we first load all images from a class folder into CPU memory and slice them into GPU memory for gradient computation and accumulation. Empirically, this approach further reduces the runtime of the filtering algorithm. Additionally, the cross-architecture evaluation discussed in Section 4.3 and Table 4 demonstrates that using models with simpler architectures (e.g., Conv Net) as surrogates can provide more effective influence guidance, further reducing the time overhead for selecting representative checkpoints.

Algorithm 2: Filtering Algorithm for Influence Guidance

Input: Original dataset T , Initial checkpoint θT 0 , Threshold δ Output: Retained checkpoints list R

1 Initialize: R θT 0 ;

2 Compute Ec[ θℓc(Xc; θT 0 )] as reference gradient gref;

3 for t = 1 to E do

4 Compute averaged gradient gt = Ec[ θℓc(Xc; θT t )];

5 Calculate cosine similarity s = gt gref gt gref ;

6 if s < δ then

7 R R {θT t };

8 Update reference gradient gref = gt;

9 return R;

A.4 ADDITIONAL PERFORMANCE EVALUATION ON FOOD-101 DATASET

We evaluate the performance of our IGD methods on Food-101 (Bossard et al., 2014) dataset to provide further test on distilling other large, high-resolution datasets. Food-101 is a challenging dataset that includes 101 food categories, totaling 101,000 images, with each category containing 250 manually reviewed test images and 750 training images. All images are scaled to a maximum side length of 256 pixels. Results detailed in Table 7 show that our IGD methods achieve superior performances over all IPC scenarios. Furthermore, applying our method to baseline methods, including Di T and Minimax, results in noticeable performance enhancements, with average improvements of 3.8% and 3.5%, respectively. In contrast, the Minimax method yields only a marginal average improvement of 0.8% to Di T. These findings align with evaluations conducted on Image Net, indicating robust scalability across large, high-resolution datasets.

A.5 ADDITIONAL PERFORMANCE EVALUATION ON CIFAR DATASETS

The results presented in Tables 1 and 2 demonstrate the outstanding performance of our method in distilling targeted high-resolution datasets. In this section, we further investigate the generalizability of our method to smaller datasets. Specifically, we compare the performance of our framework with

Published as a conference paper at ICLR 2025

Table 7: Food-101: performance comparison with state-of-the-art pixel-level distillation methods, pretrained Di T and Minimax-tuned Di T models with vanilla generation. The results are obtained on Res Net AP-10 at different IPCs.

IPC Random DM Di T Di T-IGD Minimax Minimax-IGD Full

10 16.2 0.5 18.5 0.8 23.9 1.0 27.2 0.9 24.8 0.9 28.3 0.9

78.6 0.4 50 36.9 0.3 37.8 0.4 40.8 0.7 45.2 0.7 41.6 1.0 44.8 0.7 100 46.8 0.3 44.8 0.3 45.9 0.5 49.7 0.3 46.5 0.5 50.3 0.6

Table 8: CIFAR-10 & CIFAR-100: Perfomrance comparison with two low-resolution-orientied methods DM and DATM and two high-resolution-oriented methods SRe2L and RDED.

CIFAR-10 CIFAR-100

IPC 50 500 1000 10 50 100

Ratio (%) 1.0 10.0 20.0 2.0 10.0 20.0

DM 63.1 0.4 74.3 2 79.2 0.2 29.7 0.3 43.6 0.4 47.1 0.4 DATM 76.1 0.3 83.5 0.2 85.5 0.4 47.2 0.4 55.0 0.2 57.5 0.2 SRe2L 43.2 0.3 55.3 0.4 57.1 0.4 24.5 0.4 45.2 0.3 46.6 0.5 RDED 68.4 0.2 78.1 0.4 79.8 0.4 46.4 0.3 51.5 0.3 52.6 0.4 Di T-IGD 66.8 0.5 82.6 0.6 84.6 0.5 45.8 0.5 53.9 0.6 55.9 0.4

two state-of-the-art high-resolution-oriented methods, SRe2L Yin et al. (2024) and RDED Sun et al. (2024), as well as two low-resolution-oriented methods, DM Zhao & Bilen (2021b) and DATM Guo et al. (2024), on CIFAR-10 and CIFAR-100. Notably, DATM, a recent strong baseline, has been shown to achieve lossless distillation on small-scale datasets such as CIFAR. Table 8 presents the comparison results on Conv Net under varying IPC values. Our experimental findings indicate that our method consistently outperforms both SRe2L and RDED across most scenarios. Remarkably, our approach achieves nearly lossless performance, comparable to that of DATM, even at a 20% compression ratio. These results, combined with the exceptional performance observed on larger datasets like Image Net, suggest that our method is a versatile and unified solution that excels in both low-resolution and high-resolution settings.

A.6 COMPARISON WITH OTHER DIFFUSION-BASED METHODS ON IMAGENET-1K

We compare our method with two other synthetic dataset generation methods, namely TDSDM (Yuan et al., 2023) and D4M (Su et al., 2024), which leverage pre-trained diffusion models. Although TDSDM was not initially designed for dataset distillation tasks, its goal is to enhance the training efficacy of synthetic data generated by diffusion models. Along with our baseline method, Minimax, these three approaches utilize distribution-matching-like objectives to fine-tune diffusion models and improve the performance of synthetic data in training. The results presented in Table 9 are evaluated on Image Net-1K under the evaluation protocol of RDED. Our findings demonstrate that our guided-diffusion method consistently outperforms the others, reinforcing the importance of introducing informative guidance when applying diffusion models in dataset distillation.

Table 9: Performance comparison over Res Net-18 with state-of-the-art diffusion-finetuning methods

Dataset IPC TDSDM D4M Minimax Di T-IGD Minimax-IGD

Image Net-1K 10 44.5 0.4 27.9 0.7 44.3 0.5 45.5 0.5 46.2 0.6 50 59.4 0.3 55.2 0.3 58.6 0.3 59.8 0.3 60.3 0.4

Published as a conference paper at ICLR 2025

A.7 EVALUATING THE ROBUSTNESS OF INFLUENCE-GUIDED DIFFUSION WITH DPM SOLVER

In the practical implementation of our IGD methods, we propose using DDIM with 50 denoising steps. To assess the generalizability of our proposed Influence Guidance and Deviation Guidance across different diffusion solvers, we additionally explore the use of the DPM solver Lu et al. (2022) as a replacement for the DDIM solver in outputting s(zt, t, ϵϕ) in Equation (9). The DPM solver is a fast, high-order solver for diffusion ODEs with a convergence order guarantee. We employ the second-order DPM solver with 20 denoising steps by default. Accordingly, we adjust the guided range for diffusion guidance to [12, 18]. This adjustment results in a significant 50% reduction in average sampling time, from 8.2 seconds to 4.3 seconds on an RTX 4090. Table 10 compares the average performance of DDIM with 50 steps and the DPM solver with 20 steps for distilling Image Nette and Image Woof with IPC=50. The results indicate that using the DPM solver with fewer denoising steps does not lead to a significant degradation in performance, and in some scenarios, it even yields slight improvements. This further validates the robustness of our influence-guided sampling method.

Table 10: Performance of Res Net-18 using the DDIM-50 and DPM-20 solvers for diffusion generation.

Solver Di T Di T-IGD Minimax Minimax-IGD

Image Nette DDIM-50 75.2 0.9 81.0 0.7 78.1 0.6 82.0 0.3 DPM-20 73.5 0.8 82.0 0.5 76.6 0.4 80.6 0.5

Image Woof DDIM-50 57.4 0.7 62.0 1.1 60.5 0.5 65.4 1.8 DPM-20 57.8 0.6 64.0 1.0 60.3 0.6 64.5 1.3

A.8 DISTRIBUTION DIVERSITY AND COVERAGE ANALYSIS

In Figure 3, we provided t-SNE visualizations comparing the distributions of data generated by two baseline methods (Di T and Minimax) and our IGD-based methods with IPC=100. The figure shows that integrating IGD enhances diversity and alignment with the original dataset, supported by lower Wasserstein distances to the original dataset. In this section, we further compare the FID scores and coverage of surrogate datasets (IPC=100) generated for Image Woof by different methods. Here, coverage metric was assessed based on whether each original data point had a nearest neighbor in the surrogate dataset within a given threshold (e.g., 300 in the Inception V3 latent space). For fairness, we excluded data selected by the Random method from the original dataset during coverage calculation. From the results shown in Table 11 also demonstrate a clear diversity improment achieved by our IGD method to vanilla Di T. However we also observed that although the randomly selected dataset has the lowest FID and highest coverage, its performance was the worst. Similarly, while Minimax-IGD has worse FID and coverage than Di T-IGD, it performed better. These findings suggest that our diversity-constraint influence-guided objective is a more effective measure for DD than relying solely on distribution alignment.

Table 11: FID and distribution coverage comparison among different methods.

Metrics Di T Di T-IGD Minimax Minimax-IGD Random

FID 81.1 75.9 80.1 76.4 54.1 Coverage (%) 65.4 68.1 66.5 67.2 72.1 Accuracy (%) 62.3 70.6 67.4 72.1 63.6

A.9 PARAMETER ANALYSIS

In our IGD sampling framework, two critical hyper-parameters are k, which controls the magnitude of influence guidance, and γt, which controls the magnitude of deviation guidance. In Figure 6, we examine the impact of these scaling factors on Di T-IGD and Minimax-IGD using the Image Nette dataset as an instance. For Di T-IGD, variations in both k and γt significantly influence performance.

Published as a conference paper at ICLR 2025

Increasing the values of these parameters enhances performance, highlighting the importance of influence and dataset diversity for model training. However, setting k too high results in a notable performance drop. As discussed in Section 3.2, this is likely due to excessive overfitting to the surrogate data with distorted content. In contrast, for Minimax-IGD, increasing γt contributes marginally to performance improvement. This is because Minimax-IGD inherently focuses on increasing diversity as a core aspect of its fine-tuning-based scheme. However, increasing influence guidance by enlarging k significantly improves its results. Despite this improvement, a similar performance drop is observed when k becomes excessively large. These findings underscore the necessity of carefully tuning k and γt to optimize the effectiveness of our IGD sampling framework, ensuring balanced influence and diversity without overfitting.

Figure 6: Hyper-parameter analysis on (a) & (c) the scaling factor k of influence guidance, and (b) & (d) the scaling factor γt of deviation guidance for Di T-IGD and Minimax-IGD.

A.10 HYPERPARAMETER SETUP AND GUIDELINES

In Table 12, we provide a detailed hyperparameter configuration for k and γt in Equation (7) to replicate the results obtained across Image Nette, Image Woof, and Image Net-1K datasets. Despite incorporating an adaptive scaling factor based on the ratio between the denoised signal magnitude from diffusion and the guided signal from the influence guidance GI, manual specification of the scale factor k remains essential to forestall unexpected overfitting resulting from the influence guidance. Drawing from insights gleaned from our ablation study, as illustrated in Figure 6, we recommend setting the value range of k within [1, 50] for scaling our method in distillation tasks involving other Image Net subsets. Similarly, we suggest a grid-search range for the scaling factor γt of the deviation guidance as [10, 200]. Particularly for scenarios with small IPC, we advocate for starting from a relatively smaller value of k to hold the representatives of generated data.

A.11 MORE VISUALIZATION COMPARISON OF SYNTHETIC DATA.

Here, we provide an additional visual comparison between images generated by two backbone models with vanilla DDIM sampling: the raw Di T and the Minimax-tuned Di T, and with our IGD-sampling framework: Di T-IGD and Minimax IGD. All synthetic data were generated for the Image Woof and Image Nette datasets.

Published as a conference paper at ICLR 2025

Table 12: Detailed setup of hyperparameters k and γt in Equation (7) for reproducing the results reported in Table 1 & 2.

DIT-IGD Minimax-IGD

Parameter IPC10 IPC50 IPC100 IPC10 IPC50 IPC100

Nette k 5 5 5 15 15 15 γt 50 120 120 10 10 10

woof k 5 5 5 10 10 10 γt 50 120 120 50 100 100

1K k 5 5 5 10 10 10 γt 120 120 120 100 100 100

Figure 7: Visualizaiton comparison between raw Di T and Di T-IGD on Image Woof.

Published as a conference paper at ICLR 2025

Figure 8: Visualizaiton comparison between Minimax and Minimax-IGD on Image Woof.

Published as a conference paper at ICLR 2025

Figure 9: Visualizaiton comparison between raw Di T and Di T-IGD on Image Nette.

Published as a conference paper at ICLR 2025

Figure 10: Visualizaiton comparison between Minimax and Minimax-IGD on Image Nette.