# sequential_subset_matching_for_dataset_distillation__cc84768e.pdf Sequential Subset Matching for Dataset Distillation Jiawei Du , Qin Shi , Joey Tianyi Zhou Centre for Frontier AI Research (CFAR), Agency for Science, Technology and Research (A*STAR), Singapore Institute of High Performance Computing (IHPC), Agency for Science, Technology and Research (A*STAR), Singapore {dujw,Joey_Zhou}@cfar.a-star.edu.sg, shiqin924924@gmail.com Dataset distillation is a newly emerging task that synthesizes a small-size dataset used in training deep neural networks (DNNs) for reducing data storage and model training costs. The synthetic datasets are expected to capture the essence of the knowledge contained in real-world datasets such that the former yields a similar performance as the latter. Recent advancements in distillation methods have produced notable improvements in generating synthetic datasets. However, current state-of-the-art methods treat the entire synthetic dataset as a unified entity and optimize each synthetic instance equally. This static optimization approach may lead to performance degradation in dataset distillation. Specifically, we argue that static optimization can give rise to a coupling issue within the synthetic data, particularly when a larger amount of synthetic data is being optimized. This coupling issue, in turn, leads to the failure of the distilled dataset to extract the high-level features learned by the deep neural network (DNN) in the latter epochs. In this study, we propose a new dataset distillation strategy called Sequential Subset Matching (Seq Match), which tackles this problem by adaptively optimizing the synthetic data to encourage sequential acquisition of knowledge during dataset distillation. Our analysis indicates that Seq Match effectively addresses the coupling issue by sequentially generating the synthetic instances, thereby enhancing its performance significantly. Our proposed Seq Match outperforms state-of-the-art methods in various datasets, including SVNH, CIFAR-10, CIFAR-100, and Tiny Image Net. Our code is available at https://github.com/shqii1j/seqmatch. 1 Introduction Recent advancements in Deep Neural Networks (DNNs) have demonstrated their remarkable ability to extract knowledge from large-scale real-world data, as exemplified by the impressive performance of the large language model GPT-3, which was trained on a staggering 45 terabytes of text data [4]. However, the use of such massive datasets comes at a significant cost in terms of data storage, model training, and hyperparameter tuning. The challenges associated with the use of large-scale datasets have motivated the development of various techniques aimed at reducing datasets size while preserving their essential characteristics. One such technique is dataset distillation [5, 6, 13, 22, 29, 35, 42, 48, 50, 51, 52, 53], which involves synthesizing a smaller dataset that effectively captures the knowledge contained within the original dataset. Models trained on these synthetic datasets have been shown to achieve comparable performance to those trained on the full dataset. In recent years, dataset distillation has garnered increasing attention from the deep learning community and has been leveraged in various practical applications, including continual learning [41, 52, 53], neural architecture search [21, 39, 40, 51, 53], and privacy-preserving tasks [12, 15, 31], among others. Corresponding Author 37th Conference on Neural Information Processing Systems (Neur IPS 2023). Existing methods for dataset distillation, as proposed in [5, 11, 13, 33, 37, 38, 47, 51, 53], have improved the distillation performance through enhanced optimization methods. These approaches have achieved commendable improvements in consolidating knowledge from the original dataset and generating superior synthetic datasets. However, the knowledge condensed by these existing methods primarily originates from the easy instances, which exhibit a rapid reduction in training loss during the early stages of training. These easy instances constitute the majority of the dataset and typically encompass low-level, yet commonly encountered visual features (e.g. edges and textures [49]) acquired in the initial epochs of training. In contrast, the remaining, less frequent, but more challenging instances encapsulate high-level features (e.g. shapes and contours) that are extracted in the subsequent epochs and significantly impact the generalization capability of deep neural networks (DNNs). The findings depicted in Figure 1 reveal that an overemphasis on low-level features hinders the extraction and condensation of high-level features from hard instances, thereby resulting in a decline in performance. & #&! " #!$ & '#" Figure 1: Left: MTT [5] fails to extract adequate highlevel features. The loss drop rate between easy and hard instances is employed as the metric to evaluate the condensation efficacy of low-level and high-level features. The upper solid lines represent the loss change of hard instances, while the lower dashed lines depict the loss change of easy instances. The inability to decrease the loss of hard instances indicates MTT s inadequacy in capturing high-level features. In contrast, our proposed Seq Match successfully minimizes the loss for both hard and easy instances. Right: The consequent performance improvement of Seq Match in CIFAR [25] datasets. Experiments are conducted with 50 images per class (ipc = 50). In this paper, we investigate the factors that hinder the efficient condensation of high-level features in dataset distillation. Firstly, we reveal that DNNs are optimized through a process of learning from low-level visual features and gradually adapting to higher-level features. The condensation of high-level features determines the effectiveness of dataset distillation. Secondly, we argue that existing dataset distillation methods fail to extract high-level features because they treat the synthetic data as a unified entity and optimize each synthetic instance unvaryingly. Such static optimization makes the synthetic instances become coupled with each other easier in cases where more synthetic instances are optimized. As a result, increasing the size of synthetic dataset will over-condense the low-level features but fail to condense additional knowledge from the real dataset, let alone the higher-level features. Building upon the insights derived from our analysis, we present a novel dataset distillation strategy, termed Sequential Subset Matching (Seq Match), which is designed to extract both low-level and high-level features from the real dataset, thereby improving dataset distillation. Our approach adopts a simple yet effective strategy for reorganizing the synthesized dataset S during the distillation and evaluation phases. Specifically, we divide the synthetic dataset into multiple subsets and encourage each subset to acquire knowledge in the order that DNNs learn from the real dataset. Our approach can be seamlessly integrated into existing dataset distillation methods. The experiments, as shown in Figure 1, demonstrate that Seq Match effectively enables the latter subsets to capture high-level features. This, in turn, leads to a substantial improvement in performance compared to the baseline method MTT, which struggles to compress higher-level features from the real dataset. Extensive experiments demonstrate that Seq Match outperforms state-of-the-art methods, particularly in high compression ratio2 scenarios, across a range of datasets including CIFAR-10, CIFAR-100, Tiny Image Net, and subsets of the Image Net. In a nutshell, our contribution can be summarized as follows. we examine the inefficacy of current dataset distillation in condensing hard instances from the original dataset. We present insightful analyses regarding the plausible factors contributing to this inefficacy and reveal the inherent preference of dataset distillation in condensing knowledge. We thereby propose a novel dataset distillation strategy called Sequential Subset Matching (Seq Match) to targetedly encourage the condensing of higher-level features. Seq Match seamlessly integrates with existing dataset distillation methods, offering easy implementa- 2compression ratio = compressed dataset size / full dataset size [8] tion. Experiments on diverse datasets demonstrate the effectiveness of Seq Match, achieving state-of-the-art performance. 2 Related work Coreset selection is the traditional dataset reduction approach by selecting representative prototypes from the original dataset[2, 7, 17, 43, 45]. However, the non-editable nature of the coreset limits its performance potential. The idea of synthesizing the coreset can be traced back to Wang et al. [47]. Compared to coreset selection, dataset distillation has demonstrated greatly superior performance. Based on the approach of optimizing the synthetic data, dataset distillation can be taxonomized into two types: data-matching methods and meta-learning methods [30]. Data-matching methods encourage the synthetic data to imitate the influence of the target data, involving the gradients, trajectories, and distributions. Zhao and Bilen [52] proposed distribution matching to update synthetic data. Zhao et al. [53] matched the gradients of the target and synthetic data in each iteration for optimization. This approach led to the development of several advanced gradientmatching methods[5, 20, 22, 51]. Trajectory-matching methods [5, 9, 13] further matched multi-step gradients to optimize the synthetic data, achieving state-of-the-art performance. Factorization-based methods [11, 29, 33] distilled the synthetic data into a low-dimensional manifold and used a decoder to recover the source instances from the factorized features. Meta-learning methods treat the synthetic data as the parameters to be optimized by a meta (or outer) algorithm [3, 11, 32, 34, 37, 38, 54]. A base (or inner) algorithm solves the supervised learning problem and is nested inside the meta (or outer) algorithm with respect to the synthetic data. The synthetic data can be directly updated to minimize the empirical risk of the network. Kernel ridge regression (KRR) based methods [34, 37, 38] have achieved remarkable performance among meta-learning methods. Both data-matching and meta-learning methods optimize each synthetic instance equally. The absence of variation in converged synthetic instances may lead to the extraction of similar knowledge and result in over-representation of low-level features. 3 Preliminaries Background Throughout this paper, we denote the target dataset as T = {(xi, yi)}|T | i=1. Each pair of data sample is drawn i.i.d. from a natural distribution D, and xi Rd, yi Y = {0, 1, , C 1} where d is the dimension of input data and C is the number of classes. We denote the synthetic dataset as S = {(si, yi)}|S| i=1 where si Rd, yi Y. Each class of S contains ipc (images per class) data pairs. Thus, |S| = ipc C and ipc is typically set to make |S| |T |. We employ fθ to denote a deep neural network f with weights θ. An ideal training progress is to search for an optimal weight parameter ˆθ that minimizes the expected risk over the natural distribution D, which is defined as LD(fθ) E(x,y) D ℓ(fθ(x), y) . However, as we can only access the training set T sampled from the natural distribution D, the practical training approach of the network f is to minimizing the empirical risk LT (fθ) minimization (ERM) on the training set T , which is defined as ˆθ = alg(T , θ0) = arg min θ LT (fθ) where LT (fθ) = 1 |T | xi T ℓ fθ(xi), yi , (1) where ℓcan be any training loss function; alg is the given training algorithm that optimizes the initialized weights parameters θ0 over the training set T ; θ0 is initialized by sampling from a distribution Pθ0. Dataset distillation aims to condense the knowledge of T into the synthetic dataset S so that training over the synthetic dataset S can achieve a comparable performance as training over the target dataset T . The objective of dataset distillation can be formulated as, E (x,y) D, θ0 Pθ0 ℓ(falg(T ,θ0)(x), y) E (x,y) D, θ0 Pθ0 ℓ(falg(S,θ0)(x), y) . (2) Gradient Matching Methods We take gradient matching methods as the backbone method to present our distillation strategy. Matching the gradients introduced by T and S helps to solve S in Equation 2. By doing so, gradient matching methods achieve advanced performance in dataset distillation. Specifically, gradient matching methods introduce a distance metric D( , ) to measure the distance between gradients. A widely-used distance metric [53] is defined as D(X, Y ) = PI i=1 1 Xi,Yi Xi Yi , where X, Y RI J and Xi, Yi RJ are the ith columns of X and Y respectively. With the defined distance metric D( , ), gradient matching methods consider solving b S = arg min S Rd Y |S|=ipc C m=1 L(S, θm) , where L(S, θ) = D θLS(fθ), θLT (fθ) , (3) where θi is the intermediate weights which is continuously updated by training the network fθ0 over the target dataset T . The methods employ M as the hyperparameter to control the length of teacher trajectories to be matched starting from the initialized weights θ0 Pθ0. L(S.θ) is the matching loss. The teacher trajectory {θ0, θ1, , θM} is equivalent to a series of gradients {g1, g2, , g M}. To ensure the robustness of the synthetic dataset S to different weights initializations, θ0 will be sampled from Pθ0 for many times. As a consequence, the distributions of the gradients for training can be represented as {Pg1, Pg2, , Pg M }. Algorithm 1 Training with Seq Match in Distillation Phase. Input: Target dataset T ; Number of subsets K; Iterations N in updating each subset; A base distillation method A. 1: Initialize the synthetic dataset Sall 2: Divide Sall into K subsets of equal size |Sall| K , i.e., Sall = S1 S2 SK 3: for each Sk do 4: Optimize each subset Sk sequentially: 5: repeat 6: if k =1 then 7: Initialize network weights θk 0 Pθ0 8: else 9: Load network weights θk 0 Pθk 1 N saved in optimizing last subset Sk 1 10: for i = 1 to N do 11: Update Network weights by subset Sk: 12: θk i = alg(Sk S(k 1), θk i 1) 13: Update Sk by the base distillation method: 14: Sk A(T , Sk, θk i ) 15: Record and save updated network weights θk 1 N 16: until Converge Output: Distilled synthetic dataset Sall Increasing the size of a synthetic dataset is a straightforward approach to incorporating additional high-level features. However, our findings reveal that simply optimizing more synthetic data leads to an excessive focus on knowledge learned from easy instances. In this section, we first introduce the concept of sequential knowledge acquisition in a standard training procedure (refer to subsection 4.1). Subsequently, we argue that the varying rate of convergence causes certain portions of the synthetic data to abandon the extraction of further knowledge in the later stages (as discussed in Figure 4.2). Finally, we present our proposed strategy, Sequential Subset Matching (referred to as Seq Match), which is outlined in Algorithm 1 in subsection 4.3. 4.1 Features Are Represented Sequentially Many studies have observed the sequential acquisition of knowledge in training DNNs. Zeiler et al. [49] revealed that DNNs are optimized to extract low-level visual features, such as edges and textures, in the lower layers, while higher-level features, such as object parts and shapes, were represented in the higher layers. Han et al. [16] leverages the observation that DNNs learn the knowledge from easy instances first, and gradually adapt to hard instances[1] to propose noisy learning methods. The sequential acquisition of knowledge is a critical aspect of DNN. However, effectively condensing knowledge throughout the entire training process presents significant challenges for existing dataset distillation methods. While the synthetic dataset S is employed to learn from extensive teacher trajectories, extending the length of these trajectories during distillation can exacerbate the issue of domain shifting in gradient distributions, thereby resulting in performance degradation. This is primarily due to the fact that the knowledge extracted from the target dataset T varies across different epochs, leading to corresponding shifts in the domains of gradient distributions. Consequently, the synthetic dataset S may struggle to adequately capture and consolidate knowledge from prolonged teacher trajectories. To enhance distillation performance, a common approach is to match a shorter teacher trajectory while disregarding knowledge extracted from the latter epochs of T . For instance, in the case of the CIFAR-10 dataset, optimal hyperparameters for M (measured in epochs) in the MTT [5] method were found to be 2, 20, 40 for ipc = 1, 10, 50 settings, respectively. The compromise made in matching a shorter teacher trajectory unexpectedly resulted in a performance gain, thereby confirming the presence of excessive condensation on easy instances. Taking into account the sequential acquisition of knowledge during deep neural network (DNN) training is crucial for improving the generalization ability of synthetic data. Involving more synthetic data is the most straightforward approach to condense additional knowledge from longer teacher trajectory. However, our experimental findings, as illustrated in Figure 1, indicate that current gradient matching methods tend to prioritize the consolidation of knowledge derived from easy instances in the early epochs. Consequently, we conducted further investigations into the excessive emphasis on low-level features in existing dataset distillation methods. 4.2 Coupled Synthetic Dataset 0.000 0.005 0.010 0.015 0.020 0.025 E si S+[ R(si, fθ0) 1] E si S [ R(si, fθ0) 1] Acc.(S+) - Acc.(S ) Distillation discrepancy between S+ and S MTT Seq Match Figure 2: The accuracy discrepancy between the networks trained using S+ and S separately. The discrepancy will increase with the magnitude of R(si, fθm). These results verified the coupling issue between S+ and S , and our proposed method Seq Match successfully mitigates the coupling issue. More experimental details can be found in subsection 5.3. The coupling issue within the synthetic dataset impedes its effectiveness in condensing additional high-level features. Existing dataset distillation methods optimize the synthetic dataset S as a unified entity, resulting in the backpropagated gradients used to update S being applied globally. The gradients on each instance only differ across different initializations and preassigned labels, implying that instances sharing a similar initialization within the same class will converge similarly. Consequently, a portion of the synthetic data only serves the purpose of alleviating the gradient matching error for the pre-existing synthetic data. Consider a synthetic dataset S is newly initialized to be distilled from a target dataset T . The distributions of the gradients for distillation are {Pg1, Pg2, , Pg M }, and the sampled gradients for training is {g1, g2, , g M}. Suppose that G is the integrated gradients calculated by S, by minimizing the loss function as stated in Equation 3, the gradients used for updating si when θ = θm would be si L(S, θm) = L G G θmℓ(fθm(si), yi) θmℓ(fθm(si), yi) G R(si, fθm), where R(si, fθm) θmℓ(fθm(si), yi) we have G θmℓ(fθm(si),yi) = 1, because G is accumulated by the gradients of each synthetic data, i.e., G = θm LS(fθm) = P S i=1 θmℓ(fθm(si), yi). Here we define the amplification function R(si, fθm) Rd. Then, the gradients on updating synthetic instance si L(S, θm) shares the same L G and only varies in R(si, fθm). The amplification function R(si, fθm) is only affected by the pre-assigned label and initialization of si. More importantly, the magnitude of R(si, fθm) determines the rate of convergence of each synthetic instance si. Sorted by the l1-norm of amplification function R(si, fθm) 1 can be divided into two subsets S+ and S . S+ contains the synthetic instances with greater values of R(si, fθm) than those in S . That implies that instances in S+ converge faster to minimize D( θm LS+(fθm), gm), and S+ is optimized to imitate gm. On the other hand, the instances in S converge slower and are optimized to minimize D( θm LS (fθm), ϵ), where ϵ represents the gradients matching error ϵ of S+, i.e., ϵ = gm θm LS+(fθm). Therefore, S is optimized to imitate ϵ and its effectiveness is achieved by compensating for the gradients matching error of S+. S is coupled with S+ and unable to capture the higher-level features in the latter epochs. We conducted experiments to investigate whether S solely compensates for the gradient matching error of S+ and is unable to extract knowledge independently. To achieve this, we sorted S+ and S by the l1-norm of the amplification function R(si, fθm) 1 and trained separate networks with S+ and S . As depicted in Figure 2, we observed a significant discrepancy in accuracy, which increased with the difference in magnitude of R(si, fθm). Further details and discussion are provided in subsection 5.3. These experiments verify the coupling issue wherein S compensates for the matching error of S+, thereby reducing its effectiveness in condensing additional knowledge. 4.3 Sequential Subset Matching We can use a standard deep learning task as an analogy for the dataset distillation problem, then the synthetic dataset S can be thought of as the weight parameters that need to be optimized. However, simply increasing the size of the synthetic dataset is comparable to multiplying the parameters of a model in an exact layer without architecting the newly added parameters, and the resulting performance improvement is marginal. We thereby propose Seq Match to reorganize the synthetic dataset S to utilize the newly added synthetic data. We incorporate additional variability into the optimization process of synthetic data to encourage the capture of higher-level feature extracted in the latter training progress. To do this, Seq Match divides the synthetic dataset S into K subsets equally, i.e., S = S1 S2 SK, |Sk| = |S| K . Seq Match optimizes each Sk by solving b Sk = arg min Sk Rd Y |Sk|= |S|/K m=(k 1)n L(Sk S(k 1), θm) , (5) where S(k 1) = S1 S2 Sk 1, which represents the union set of the subsets in the former. S(k 1) are fixed and only Sk will be updated. The subset Sk is encouraged to match the corresponding kth segment of the teacher trajectory to condense the knowledge in the latter epoch. Let n = M K denote the length of trajectory segment to be matched by each subset Sk in the proposed framework. To strike a balance between providing adequate capacity for distillation and avoiding coupled synthetic data , the size of each subset Sk is well-controlled by K. In the distillation phase, each subset is arranged in ascending order to be optimized sequentially. We reveal that the first subset S1 with 1 K size of the original synthetic dataset S is sufficient to condense adequate knowledge in the former epoch. For the subsequent subset Sk, we encourage the kth subset Sk condense the knowledge different from those condensed in the previous subsets. This is achieved by minimizing the matching loss L(Sk S(k 1), θm) while only Sk will be updated. During the evaluation phase, the subsets of the synthetic dataset are used sequentially to train the neural network fθ, with the weight parameters θ being iteratively updated by θk = alg(Sk, θk 1). This training process emulates the sequential feature extraction of real dataset T during training. Further details regarding Seq Match and the optimization of θ can be found in Algorithm.1. 5 Experiment In this section, we provide implementation details for our proposed method, along with instructions for reproducibility. We compare the performance of Seq Match against state-of-the-art dataset distillation methods on a variety of datasets. To ensure a fair and comprehensive comparison, we follow up the experimental setup as stated in [8, 30]. We provide more experiments to verify the effectiveness of Seq Match including the results on Image Net subsets and analysis experiments in Appendix A.1 due to page constraints. 5.1 Experimental Setup Datasets: We evaluate the performance of dataset distillation methods on several widely-used datasets across various resolutions. MNIST [28], which is a fundamental classification dataset, is included with a resolution of 28 28. SVNH [36] is also considered, which is composed of RGB images of house numbers cwith a resolution of 32 32. CIFAR10 and CIFAR100 [25], two datasets frequently used in dataset distillation, are evaluated in this study. These datasets consist of 50, 000 training images and 10, 000 test images from 10 and 100 different categories, respectively. Additionally, our proposed method is evaluated on the Tiny Image Net [27] dataset with a resolution of 64 64 and on the Image Net [24] subsets with a resolution of 128 128. Evaluation Metric: The evaluation metric involves distillation phase and evaluation phase. In the former, the synthetic dataset is optimized with a distillation budget that typically restricts the number of images per class (ipc). We evaluate the performance of our method and baseline methods under the settings ipc = {10, 50}. We do not evaluate the setting with ipc = 1 since our approach requires ipc 2 . To facilitate a clear comparison, we mark the factorization-based baselines with an asterisk (*) since they often employ an additional decoder, following the suggestion in [30]. We employ 4-layer Conv Net [14] in Tiny Image Net dataset whereas for the other datasets we use a 3-layer Conv Net [14]. In the evaluation phase, we utilize the optimized synthetic dataset to train neural networks using a standard training procedure. Specifically, we use each synthetic dataset to train five networks with random initializations for 1, 000 iterations and report the mean accuracy and its standard deviation of the results. Implementation Details. To ensure the reproducibility of Seq Match, we provide detailed implementation specifications. Our method relies on a single hyperparameter, denoted by K, which determines the number of subsets. In order to balance the inclusion of sufficient knowledge in each segment with the capture of high-level features in the later stages, we set K = {2, 3} for the scenarios where ipc = {10, 50}, respectively. Notably, our evaluation results demonstrate that the choice of K remains consistent across the various datasets. As a plug-in strategy, Seq Match requires a backbone method for dataset synthesis. Each synthetic subset is optimized using a standard training procedure, specific to the chosen backbone method. The only hyperparameters that require adjustment in the backbone method are those that control the segments of the teacher trajectory to be learned by the synthetic dataset, whereas the remaining hyperparameters emain consistent without adjustment. Such adjustment is to ensure each synthetic subset effectively condenses the knowledge into stages. The precise hyperparameters of the backbone methods are presented in Appendix A.3. We conduct our experiments on the server with four Tesla V100 GPUs. 5.2 Results Our proposed Seq Match is plugged into the methods MTT [5] and IDC [23], which are denoted as Seq Match-MTT and Seq Match-IDC, respectively. As shown in Table 1, the classification accuracies of Conv Net [14] trained using each dataset distillation method are summarized. The results indicate that Seq Match significantly outperforms the backbone method across various datasets, and even surpasses state-of-the-art baseline methods in different settings. Our method is demonstrated to outperform the state-of-the-art baseline methods in different settings among different datasets. Notably, Seq Match achieves a greater performance improvement in scenarios with a high compression ratio (i.e., ipc = 50). For instance, we observe a 3.5% boost in the performance of MTT [5], achieving 51.2% accuracy on CIFAR-100. Similarly, we observe a 1.9% performance enhancement in IDC [23], Table 1: Performance comparison of dataset distillation methods across a variety of datasets. Abbreviations of GM, TM, DM,META stand for gradient matching, trajectory matching, distribution matching, and meta-learning respectively. We reproduce the results of MTT [5] and IDC [23] and cite the results of the other baselines [30]. The best results of non-factorized methods (without decoders) are highlighted in orange font. The best results of factorization-based methods are highlighted in blue font. Methods Schemes MNIST SVHN CIFAR-10 CIFAR-100 Tiny Image Net 10 50 10 50 10 50 10 50 10 DD [47] META 79.5 8.1 - - - 36.8 1.2 - - - - DC [53] GM 94.7 DSA [51] GM 97.8 DM [52] DM 97.3 0.2 - - 48.9 CAFE [46] DM 97.5 KIP [37, 38] KRR 97.5 FTD [13] TM - - - - 66.6 MTT [5] TM 97.3 Seq Match-MTT TM 97.6 IDC [23] 3 GM 98.4 Seq Match-IDC GM 98.6 RTP [11] META 99.3 Ha Ba [33] TM - - 83.2 Whole - 99.6 achieving 92.1% accuracy on SVNH, which approaches the 95.4% accuracy obtained using the real dataset. These results suggest that our method is effective in mitigating the adverse effects of coupling and effectively condenses high-level features in high compression ratio scenarios. Cross-Architecture Generalization: We also conducted experiments to evaluate cross-architecture generalization, as illustrated in Table 2. The ability to generalize effectively across different architectures is crucial for the practical application of dataset distillation. We evaluated our proposed Seq Match on the CIFAR-10 dataset with ipc = 50. Following the evaluation metric established in [13, 46], three additional neural network architectures were utilized for evaluation: Res Net [19], VGG [44], and Alex Net [26]. Our Seq Match approach demonstrated a significant improvement in performance during cross-architecture evaluation, highlighting its superior generalization ability. 5.3 Discussions Table 2: Cross-Architecture Results trained with Conv Net on CIFAR-10 with ipc = 50. We cite the results reported in Du et al. [13]. Evaluation Model Method Conv Net Res Net18 VGG11 Alex Net DC [53] 53.9 CAFE [46] 55.5 MTT [5] 71.6 FTD [13] 73.8 Seq Match-MTT 74.4 Seq Match-IDC 75.3 0.2 69.7 0.6 73.4 0.1 72.0 0.2 Sequential Knowledge Acquisition: We conducted experiments on CIFAR-10 with ipc = 50, presented in Figure 1, to investigate the inability of existing baseline methods to capture the knowledge learned in the latter epoch, as discussed in subsection 4.1. Inspired by [16], we utilized the change in instance-wise loss on real dataset to measure the effectiveness of condensing high-level features. Specifically, we recorded the loss of each instance from the real dataset T at every epoch, where the network was trained with synthetic dataset for only 20 3Although IDC [23] is not categorized as a factorization-based method, it employs data parameterization to better improve the performance of synthetic dataset. Therefore, we compare IDC to the factorization-based method as factorization can be treated as a kind of special data parameterization. iterations in each epoch. To distinguish hard instances from easy ones, we employed k-means algorithm [18] to cluster all instances in the real dataset into two clusters based on the recorded instance-wise loss. The distribution of instances in terms of difficulty is as follows: 77% are considered easy instances, while 23% are classified as hard instances. We evaluated MTT[5] and Seq Match as mentioned above. Our results show that MTT[5] overcondenses the knowledge learned in the former epoch. In contrast, Seq Match is able to successfully capture the knowledge learned in the latter epoch. Coupled Synthetic Subsets: In order to validate our hypothesis that the synthetic subset S is ineffective at condensing knowledge independently and results in over-condensation on the knowledge learned in the former epoch, we conducted experiments as shown in Figure 2. We sorted the subsets S+ and S of the same size by the l1-norm of the amplification function |R(si, fθm)|1 as explained in Figure 4.2. We then recorded the accuracy discrepancies between the separate networks trained by S+ and S with respect to the mean l1-norm difference, i.e., Esi S+[|R(si, fθ0)|1] Esi S [|R(si, fθ0)|1]. As shown in Figure 2, the accuracy discrepancies increased linearly with the l1-norm difference, which verifies our hypothesis that S is coupled with S+ and this coupling leads to the excessive condensation on low-level features. However, our proposed method, Seq Match, is able to alleviate the coupling issue by encouraging S to condense knowledge more efficiently. Figure 3: Visualization example of car synthetic images distilled by MTT [5] and Seq Match from 32 32 CIFAR-10 (ipc = 50). Synthetic Image Visualization: In order to demonstrate the distinction between MTT [5] and Seq Match, we visualized synthetic images within the "car" class from CIFAR-10 [25] and visually compared them. As depicted in Figure 3, the synthetic images produced by MTT exhibit more concrete features and closely resemble actual "car" images. Conversely, the synthetic images generated by Seq Match in the 2nd and 3rd subsets possess more abstract attributes and contain complex car shapes. We provide more visualizations of the synthetic images in Appendix A.2. 5.4 Limitations and Future Work We acknowledge the limitations of our work from two perspectives. Firstly, our proposed sequential optimization of synthetic subsets increases the overall training time, potentially doubling or tripling it. To address this, future research could investigate optimization methods that allow for parallel optimization of each synthetic subset. Secondly, as the performance of subsequent synthetic subsets builds upon the performance of previous subsets, a strategy is required to adaptively distribute the distillation budget of each subset. Further research could explore strategies to address this limitation and effectively enhance the performance of dataset distillation, particularly in high compression ratio scenarios. 6 Conclusion In this study, we provide empirical evidence of the failure in condensing high-level features in dataset distillation attributed to the sequential acquisition of knowledge in training DNNs. We reveal that the static optimization of synthetic data leads to a bias in over-condensing the low-level features, predominantly extracted from the majority during the initial stages of training. To address this issue in a targeted manner, we introduce an adaptive and plug-in distillation strategy called Seq Match. Our proposed strategy involves the division of synthetic data into multiple subsets, which are sequentially optimized, thereby promoting the effective condensation of high-level features learned in the later epochs. Through comprehensive experimentation on diverse datasets, we validate the effectiveness of our analysis and proposed strategy, achieving state-of-the-art performance. Acknowledgements This work is support by Joey Tianyi Zhou s A*STAR SERC Central Research Fund (Use-inspired Basic Research) and the Singapore Government s Research, Innovation and Enterprise 2020 Plan (Advanced Manufacturing and Engineering domain) under Grant A18A1b0045. [1] Devansh Arpit, Stanisław Jastrz ebski, Nicolas Ballas, David Krueger, Emmanuel Bengio, Maxinder S Kanwal, Tegan Maharaj, Asja Fischer, Aaron Courville, Yoshua Bengio, et al. A closer look at memorization in deep networks. In International conference on machine learning, pages 233 242. PMLR, 2017. 5 [2] Olivier Bachem, Mario Lucic, and Andreas Krause. Practical coreset constructions for machine learning. ar Xiv preprint ar Xiv:1703.06476, 2017. 3 [3] Ondrej Bohdal, Yongxin Yang, and Timothy Hospedales. Flexible dataset distillation: Learn labels instead of images. ar Xiv preprint ar Xiv:2006.08572, 2020. 3 [4] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877 1901, 2020. 1 [5] George Cazenavette, Tongzhou Wang, Antonio Torralba, Alexei A Efros, and Jun-Yan Zhu. Dataset distillation by matching training trajectories. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4750 4759, 2022. 1, 2, 3, 5, 7, 8, 9, 14, 15 [6] George Cazenavette, Tongzhou Wang, Antonio Torralba, Alexei A Efros, and Jun-Yan Zhu. Generalizing dataset distillation via deep generative prior. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3739 3748, 2023. 1 [7] Yutian Chen, Max Welling, and Alex Smola. Super-samples from kernel herding. ar Xiv preprint ar Xiv:1203.3472, 2012. 3 [8] Justin Cui, Ruochen Wang, Si Si, and Cho-Jui Hsieh. Dc-bench: Dataset condensation benchmark. ar Xiv preprint ar Xiv:2207.09639, 2022. 2, 7 [9] Justin Cui, Ruochen Wang, Si Si, and Cho-Jui Hsieh. Scaling up dataset distillation to imagenet1k with constant memory. ar Xiv preprint ar Xiv:2211.10586, 2022. 3 [10] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A largescale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248 255. Ieee, 2009. 14 [11] Zhiwei Deng and Olga Russakovsky. Remember the past: Distilling datasets into addressable memories for neural networks. ar Xiv preprint ar Xiv:2206.02916, 2022. 2, 3, 8 [12] Tian Dong, Bo Zhao, and Lingjuan Lyu. Privacy for free: How does dataset condensation help privacy? ar Xiv preprint ar Xiv:2206.00240, 2022. 1 [13] Jiawei Du, Yidi Jiang, Vincent TF Tan, Joey Tianyi Zhou, and Haizhou Li. Minimizing the accumulated trajectory error to improve dataset distillation. ar Xiv preprint ar Xiv:2211.11004, 2022. 1, 2, 3, 8, 14 [14] Spyros Gidaris and Nikos Komodakis. Dynamic few-shot visual learning without forgetting. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4367 4375, 2018. 7, 14 [15] Jack Goetz and Ambuj Tewari. Federated learning via synthetic data. ar Xiv preprint ar Xiv:2008.04489, 2020. 1 [16] Bo Han, Quanming Yao, Xingrui Yu, Gang Niu, Miao Xu, Weihua Hu, Ivor Tsang, and Masashi Sugiyama. Co-teaching: Robust training of deep neural networks with extremely noisy labels. Advances in neural information processing systems, 31, 2018. 5, 8 [17] Sariel Har-Peled and Akash Kushal. Smaller coresets for k-median and k-means clustering. In Proceedings of the twenty-first annual symposium on Computational geometry, pages 126 134, 2005. 3 [18] John A Hartigan and Manchek A Wong. Algorithm as 136: A k-means clustering algorithm. Journal of the royal statistical society. series c (applied statistics), 28(1):100 108, 1979. 9 [19] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770 778, 2016. 8 [20] Zixuan Jiang, Jiaqi Gu, Mingjie Liu, and David Z Pan. Delving into effective gradient matching for dataset condensation. ar Xiv preprint ar Xiv:2208.00311, 2022. 3 [21] Haifeng Jin, Qingquan Song, and Xia Hu. Auto-keras: An efficient neural architecture search system. In Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining, pages 1946 1956, 2019. 1 [22] Jang-Hyun Kim, Jinuk Kim, Seong Joon Oh, Sangdoo Yun, Hwanjun Song, Joonhyun Jeong, Jung-Woo Ha, and Hyun Oh Song. Dataset condensation via efficient synthetic-data parameterization. In International Conference on Machine Learning, pages 11102 11118. PMLR, 2022. 1, 3 [23] Jang-Hyun Kim, Jinuk Kim, Seong Joon Oh, Sangdoo Yun, Hwanjun Song, Joonhyun Jeong, Jung-Woo Ha, and Hyun Oh Song. Dataset condensation via efficient synthetic-data parameterization. ar Xiv preprint ar Xiv:2205.14959, 2022. 7, 8 [24] Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Joan Puigcerver, Jessica Yung, Sylvain Gelly, and Neil Houlsby. Big transfer (bit): General visual representation learning. In European conference on computer vision, pages 491 507. Springer, 2020. 7 [25] Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton. CIFAR-10 and CIFAR-100 datasets. URl: https://www. cs. toronto. edu/kriz/cifar. html, 6(1):1, 2009. 2, 7, 9, 14 [26] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems, 25, 2012. 8 [27] Ya Le and Xuan Yang. Tiny imagenet visual recognition challenge. CS 231N, 7(7):3, 2015. 7, [28] Yann Le Cun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278 2324, 1998. 7 [29] Hae Beom Lee, Dong Bok Lee, and Sung Ju Hwang. Dataset condensation with latent space knowledge factorization and sharing. ar Xiv preprint ar Xiv:2208.10494, 2022. 1, 3 [30] Shiye Lei and Dacheng Tao. A comprehensive survey to dataset distillation. ar Xiv preprint ar Xiv:2301.05603, 2023. 3, 7, 8 [31] Guang Li, Ren Togo, Takahiro Ogawa, and Miki Haseyama. Soft-label anonymous gastric x-ray image distillation. In 2020 IEEE International Conference on Image Processing (ICIP), pages 305 309. IEEE, 2020. 1 [32] Ping Liu, Xin Yu, and Joey Tianyi Zhou. Meta knowledge condensation for federated learning. ar Xiv preprint ar Xiv:2209.14851, 2022. 3 [33] Songhua Liu, Kai Wang, Xingyi Yang, Jingwen Ye, and Xinchao Wang. Dataset distillation via factorization. ar Xiv preprint ar Xiv:2210.16774, 2022. 2, 3, 8, 14 [34] Noel Loo, Ramin Hasani, Alexander Amini, and Daniela Rus. Efficient dataset distillation using random feature approximation. ar Xiv preprint ar Xiv:2210.12067, 2022. 3 [35] Noel Loo, Ramin Hasani, Mathias Lechner, and Daniela Rus. Dataset distillation with convexified implicit gradients. ar Xiv preprint ar Xiv:2302.06755, 2023. 1 [36] Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y Ng. Reading digits in natural images with unsupervised feature learning. 2011. 7 [37] Timothy Nguyen, Zhourong Chen, and Jaehoon Lee. Dataset meta-learning from kernel ridge-regression. ar Xiv preprint ar Xiv:2011.00050, 2020. 2, 3, 8 [38] Timothy Nguyen, Roman Novak, Lechao Xiao, and Jaehoon Lee. Dataset distillation with infinitely wide convolutional networks. Advances in Neural Information Processing Systems, 34:5186 5198, 2021. 2, 3, 8 [39] Hieu Pham, Melody Guan, Barret Zoph, Quoc Le, and Jeff Dean. Efficient neural architecture search via parameters sharing. In International conference on machine learning, pages 4095 4104. PMLR, 2018. 1 [40] Pengzhen Ren, Yun Xiao, Xiaojun Chang, Po-Yao Huang, Zhihui Li, Xiaojiang Chen, and Xin Wang. A comprehensive survey of neural architecture search: Challenges and solutions. ACM Computing Surveys (CSUR), 54(4):1 34, 2021. 1 [41] Andrea Rosasco, Antonio Carta, Andrea Cossu, Vincenzo Lomonaco, and Davide Bacciu. Distilled replay: Overcoming forgetting through synthetic samples. ar Xiv preprint ar Xiv:2103.15851, 2021. 1 [42] Noveen Sachdeva and Julian Mc Auley. Data distillation: A survey. ar Xiv preprint ar Xiv:2301.04272, 2023. 1 [43] Ozan Sener and Silvio Savarese. Active learning for convolutional neural networks: A core-set approach. ar Xiv preprint ar Xiv:1708.00489, 2017. 3 [44] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. ar Xiv preprint ar Xiv:1409.1556, 2014. 8 [45] Ivor W Tsang, James T Kwok, Pak-Ming Cheung, and Nello Cristianini. Core vector machines: Fast svm training on very large data sets. Journal of Machine Learning Research, 6(4), 2005. 3 [46] Kai Wang, Bo Zhao, Xiangyu Peng, Zheng Zhu, Shuo Yang, Shuo Wang, Guan Huang, Hakan Bilen, Xinchao Wang, and Yang You. Cafe: Learning to condense dataset by aligning features. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12196 12205, 2022. 8 [47] Tongzhou Wang, Jun-Yan Zhu, Antonio Torralba, and Alexei A Efros. Dataset distillation. ar Xiv preprint ar Xiv:1811.10959, 2018. 2, 3, 8 [48] Ruonan Yu, Songhua Liu, and Xinchao Wang. Dataset distillation: A comprehensive review. ar Xiv preprint ar Xiv:2301.07014, 2023. 1 [49] Matthew D Zeiler and Rob Fergus. Visualizing and understanding convolutional networks. In Computer Vision ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part I 13, pages 818 833. Springer, 2014. 2, 4 [50] Lei Zhang, Jie Zhang, Bowen Lei, Subhabrata Mukherjee, Xiang Pan, Bo Zhao, Caiwen Ding, Yao Li, and Dongkuan Xu. Accelerating dataset distillation via model augmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11950 11959, 2023. 1 [51] Bo Zhao and Hakan Bilen. Dataset condensation with differentiable siamese augmentation. In International Conference on Machine Learning, pages 12674 12685. PMLR, 2021. 1, 2, 3, 8 [52] Bo Zhao and Hakan Bilen. Dataset condensation with distribution matching. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 6514 6523, 2023. 1, 3, 8 [53] Bo Zhao, Konda Reddy Mopuri, and Hakan Bilen. Dataset condensation with gradient matching. ICLR, 1(2):3, 2021. 1, 2, 3, 4, 8 [54] Yongchao Zhou, Ehsan Nezhadarya, and Jimmy Ba. Dataset distillation using neural feature regression. ar Xiv preprint ar Xiv:2206.00719, 2022. 3 A More Experiements A.1 Image Net Subsets To assess the efficacy of our approach, we conducted experiments on subsets of the Image Net dataset [10]. These subsets were constructed by selecting ten pertinent categories from the Image Net1k dataset [10], with a resolution of 128 128. Consequently, the Image Net subsets pose greater challenges compared to the CIFAR-10/100 [25] and Tiny Image Net [27] datasets. We adhered to the configuration of the Image Net subsets as suggested by previous studies [5, 13, 33], encompassing subsets such as Image Nette (diverse objects), Image Woof (dog breeds), Image Fruits (various fruits), and Image Meow (cats). To synthesize the dataset, we employed a 5-layer Conv Net [14] with a parameter setting of ipc = 10. The evaluation of the synthetic dataset involved performing five trials with randomly initialized networks. We compared the outcomes of our proposed method, referred to as Seq Match, with the baseline approach MTT [5], as well as the plug-in strategies FTD [13] and Ha Ba [33] which build upon MTT. The comprehensive results are presented in Table 3. Our proposed Seq Match consistently outperformed the baseline MTT across all subsets. Notably, we achieved a performance improvement of 4.3% on the Image Fruit subset. Additionally, Seq Match demonstrated superior performance compared to Ha Ba and achieved comparable results to FTD. Table 3: The performance comparison trained with 5-layer Conv Net on the Image Net subsets with a resolution of 128 128. We cite the results as reported in MTT [5], FTD [13] and Ha Ba [33]. The latter two methods, FTD and Ha Ba, are plug-in strategies that build upon the foundation of MTT. Our proposed approach, Seq Match, exhibits superior performance compared to both MTT and Ha Ba, demonstrating a significant improvement in results. Image Nette Image Woof Image Fruit Image Meow Real dataset 87.4 MTT [5] 63.0 FTD [13] 67.7 Ha Ba [33] 64.7 Seq Match-MTT 66.9 Seq Match-FTD 70.6 A.2 More Visualizations Instance-wise loss change: We have presented the average loss change of easy and hard instances in Figure 1, revealing that MTT failed to effectively condense the knowledge learned from the hard instances. To avoid the bias introduced by averaging, we have meticulously recorded and visualized the precise loss change of each individual instance. This is accomplished by employing a heatmap representation, as demonstrated in Figure 6. Each instance is depicted as a horizontal line exhibiting varying colors, with deeper shades of blue indicating higher loss values. Following the same clustering approach as depicted in Figure 1, we proceed to visualize the hard instances at the top and the easy instances at the bottom. The individual loss changes of MTT, depicted in Figure 6, remain static across epochs. The losses of easy instances decrease to a small value during the initial stages, while the losses of hard instances persist at a high value until the end of training. These results confirm that MTT excessively focuses on low-level features. In contrast, the visualization of Seq Match clearly exhibits the effect of color gradient, indicating a decrease in loss for the majority of instances. Notably, the losses of hard instances experience a significant decrease when a subsequent subset is introduced. These results validate that Seq Match effectively consolidates knowledge in a sequential manner. 0 5 10 15 20 25 30 35 40 45 Epoch Instance loss from target dataset 0 5 10 15 20 25 30 35 40 45 Epoch Instance-wise Loss Figure 4: The heatmap illustrates the loss change of each instance in the real dataset across epochs. Each row in the heatmap represents an instance, while the deeper blue color denotes higher instance-wise loss. The network is trained with the synthetic datasets distilled by MTT and Seq Match. Left: MTT fails to reduce the loss of hard instances while excessively reducing the loss of easy instances. Right: Seq Match minimizes the loss of both the hard and easy instances. Figure 5: Visualization example of synthetic images distilled by MTT [5] and Seq Match from 32 32 CIFAR-10 (ipc = {2, 3}). Seq Match(ipc=2) is seamlessly embedded as the first two rows within the name Seq Match(ipc=3) visualization. Synthetic Dataset Visualization: We compare the synthetic images with ipc = {2, 3} from the CIFAR10 dataset to highlight the differences between subsets of Seqmatch in Figure 5. We provide more visualizations of the synthetic datasets for ipc = 10 from the 128 128 resolution Image Net dataset: Image Woof subset in Figure 7 and Image Meow subset in Figure 8. In addition, parts of the visualizations of synthetic images from the 32 32 resolution CIFAR-100 dataset are showed in Figure 6. We observe that the synthetic images generated by Seq Match in the subsequent subset contains more abstract features than the previous subset. A.3 Hyperparameter Details The hyperparameters K of Seq Match-MTT is set with {2, 3} for the settings ipc = {10, 50}, respectively. The optimal value of hyperparameter K is obtained via grid searches within the set {2, 3, 4, 5, 6} in a validation set within the CIFAR-10 dataset. We find that the subset with a small size will fail to condense the adequate knowledge from the corresponding segment of teacher trajectories, Figure 6: Visualization of the first 10 classes of synthetic images distilled by Seq Match from 32 32 CIFAR100 (ipc = 10). The initial 5 image rows and the final 5 image rows match the first and second subsets, respectively. resulting in performance degradation in the subsequent subsets. For the rest of the hyperparamters, we report them in Table 4. 1Image Fruit has different setting of Max Start Epoch from other Image Net subesets: {10,10} Figure 7: Visualization of the synthetic images distilled by Seq Match from 32 32 Image Woof (ipc = 10). The initial 5 image rows and the final 5 image rows match the first and second subsets, respectively. Table 4: Hyperparameter values we used for Seq Match-MTT in the main result table. Most of the hyperparameters Max Start Epoch and Synthetic Step are various with the subsets, we use a sequential numbers to denote the parameters used in the corresponding subsets. Img. denotes the abbreviation of Image Net. CIFAR-10 CIFAR-100 Tiny Img. Img. Subsets ipc 10 50 10 50 10 10 K 2 3 2 3 2 2 Max Start Epoch {20,10} {20,20,10} {20,40} {40,20,20} {20,10} {10,5}4 Synthetic Step {30,80} 30 30 80 20 20 Expert Epoch {2,3} 2 2 2 2 2 Synthetic Batch Size - - - 125 100 20 Learning Rate (Pixels) 100 100 1000 1000 10000 100000 Learning Rate (Step Size) 1e-5 1e-5 1e-5 1e-5 1e-4 1e-6 Learning Rate (Teacher) 0.001 0.01 0.01 0.01 0.01 0.01 Figure 8: Visualization of the synthetic images distilled by Seq Match from 32 32 Image Meow (ipc = 10). The initial 5 image rows and the final 5 image rows match the first and second subsets, respectively.