# data_distillation_a_survey__efed6563.pdf Published in Transactions on Machine Learning Research (07/2023) Data Distillation: A Survey Noveen Sachdeva nosachde@ucsd.edu Computer Science & Engineering University of California, San Diego Julian Mc Auley jmcauley@ucsd.edu Computer Science & Engineering University of California, San Diego Reviewed on Open Review: https: // openreview. net/ forum? id= lm XMXP74TO The popularity of deep learning has led to the curation of a vast number of massive and multifarious datasets. Despite having close-to-human performance on individual tasks, training parameter-hungry models on large datasets poses multi-faceted problems such as (a) high model-training time; (b) slow research iteration; and (c) poor eco-sustainability. As an alternative, data distillation approaches aim to synthesize terse data summaries, which can serve as effective drop-in replacements of the original dataset for scenarios like model training, inference, architecture search, etc. In this survey, we present a formal framework for data distillation, along with providing a detailed taxonomy of existing approaches. Additionally, we cover data distillation approaches for different data modalities, namely images, graphs, and user-item interactions (recommender systems), while also identifying current challenges and future research directions. 1 Introduction (Loose) Definition 1. (Data distillation) Approaches that aim to synthesize tiny and high-fidelity data summaries which distill the most important knowledge from a given target dataset. Such distilled summaries are optimized to serve as effective drop-in replacements of the original dataset for efficient and accurate data-usage applications like model training, inference, architecture search, etc. The recent scale-is-everything viewpoint (Ghorbani et al., 2021; Hoffmann et al., 2022; Kaplan et al., 2020), argues that training bigger models (i.e., consisting of a higher number of parameters) on bigger datasets, and using larger computational resources is the sole key for advancing the frontier of artificial intelligence. Such studies observe and hypothesize the generalizability of neural networks as a power-law w.r.t. the aforementioned factors, albeit with small exponents. On the other hand, a reasonable argument is that a principled and well-reasoned solution will be more amenable to various scaling-laws, thereby leading to faster progress. Data distillation (Definition 1) is clearly a task rooted in the latter school of thought by introducing the fidelity of data as in important covariate in such neural scaling-laws. Sorscher et al. (2022) demonstrate this viewpoint analytically by using simple heuristics to prune away data with low measures of signal for model training. Clearly, the scale viewpoint still holds, in that if we keep increasing the amount of data (albeit now compressed and of higher quality), we will observe an improvement in both upstream and downstream generalization, but at a faster rate. Motivation. A terse, high-quality data summary has use cases from a variety of standpoints. First and foremost, it leads to a faster model-training procedure. In turn, faster model training equates to (1) computecost saving and expedited research iterations, i.e., the investigative procedure of manually experimenting different ideas; and (2) improved eco-sustainability, i.e., lowering the amount of compute time directly leads to a lower carbon footprint from running power-hungry accelerated hardware (Gupta et al., 2022). Additionally, Published in Transactions on Machine Learning Research (07/2023) Data Distillation Similar Performance << 50K distilled images 50K images Learning algorithm [HQ Image Link] Figure 1: The premise of data distillation demonstrated using an image dataset. a small data summary democratizes the entire pipeline, as more people can train state-of-the-art algorithms on reasonably accessible hardware using the data summary. Finally, a high-quality data summary indirectly also accelerates orthogonal procedures like neural architecture search (Liu et al., 2019), approximate nearest neighbour search (Arya et al., 1998), knowledge distillation (Hinton et al., 2015), etc., where the procedure needs to iterate over the entire dataset multiple times. Comparison with data pruning. Another reasonable avenue for summarizing large datasets is pruning away low-quality data which presumably does not carry large amount of signal to be captured during modeltraining. The primary challenge for such data pruning approaches (a.k.a. coreset construction) lies in tagging the hardness of each data-point which can be used for subsequent pruning (typically in a greedy fashion). Prominent data-pruning approaches propose heuristics for the same, relying on concepts such as shapley values (Ghorbani & Zou, 2019), confidence scores (Coleman et al., 2020), error-contribution (Toneva et al., 2019), feature-space geometry (Abbas et al., 2023; Sorscher et al., 2022; Welling, 2009), etc. Another line of work builds on the advances in submodular optimization (see Bilmes (2022) for a review) to approximately solve the NP-Hard combinatorial optimization of selecting the subset that maximizes a set-level goodness function, when such goodness functions are provably submodular (Killamsetty et al., 2021; Mirzasoleiman et al., 2020; S et al., 2021). Notably, such data pruning methodologies inherently share the same goal as data distillation but are severely restricted due to only retaining data already in the target dataset, thereby leading to finite expressivity and hence, generally, lower sample-fidelity (see Ayed & Hayou (2023) for a deeper theoretical outlook on the fundamental limitations of data pruning). Further, recent empirical studies of data pruning methodologies (Guo et al., 2022) demonstrate that the efficacy of such data pruning heuristics rarely and irregularly transfers to practical scenarios, with random downsampling being a hard baseline. Comparison with knowledge distillation & transfer learning. Despite inherently distilling some notion of knowledge, we would like to highlight both knowledge distillation and transfer learning are orthogonal procedures to data distillation, and can potentially work in conjunction to perform both tasks more efficiently. More specifically, knowledge distillation (Hinton et al., 2015) entails distilling the knowledge from a trained teacher network into a smaller student network. On the other hand, transfer learning (Pratt, 1992) focuses on transferring knowledge across similar tasks, e.g., from image classification to image segmentation. Orthogonally, data distillation aims to distill the knowledge from a given dataset into a terse data summary. Such data summaries can be used in conjunction with knowledge distillation or transfer learning procedures for both (1) faster learning of the teacher models; and (2) faster knowledge transfer to the student models. The same holds true for model compression techniques (Le Cun et al., 1989), where similar to knowledge distillation, the goal is to reduce model storage size rather than reducing the training time or increasing the sample-fidelity. In this survey, we intend to provide a succinct overview of various data distillation frameworks across different data modalities. We start by presenting a formal data distillation framework in Section 2, and present technicalities of various existing techniques. We classify all data distillation techniques into four categories Published in Transactions on Machine Learning Research (07/2023) (see Figure 2 for a taxonomy) and provide a detailed empirical comparison of image distillation techniques in Table 1. Subsequently, in Section 3, we discuss existing data distillation frameworks for synthesizing data of different modalities, as well as outlining the associated challenges. In Section 4, we discuss alternative applications of synthesizing a high-fidelity data summary rather than simply accelerating model training along with pointers to existing work. Finally, in Section 5, we conclude by presenting common pitfalls in existing data distillation techniques, along with proposing interesting directions for future work. 2 The Data Distillation Framework Before going into the specifics of data distillation, we start by outlining useful notation. Let D {(xi, yi)}|D| i=1 be a given dataset which needs to be distilled, where xi X are the set of input features, and yi Y is the desired label for xi. For classification tasks, let C be the set of unique classes in Y, and Dc {(xi, yi) | yi = c}|D| i=1 be the subset of D with class c. We also define the matrices X [xi]|D| i=1 and Y [yi]|D| i=1 for convenience. Given a data budget n Z+, data distillation techniques aim to synthesize a high-fidelity data summary Dsyn {( xi, yi)}n i=1 such that n |D|. We define Dc syn, Xsyn, and Ysyn similarly as defined for D. Let Φθ : X 7 Y represent a learning algorithm parameterized by θ. We also assume access to a twice-differentiable cost function l : Y Y 7 R, and define LD(θ) E(x,y) D[l(Φθ(x), y)] for convenience. Notation is also summarized in Appendix A. Notably, since D and Dsyn share the same data domain (X), under reasonable systems assumptions, training Φ using gradient descent (GD) on Dsyn will have a |D| n training-time speedup compared to training Φ on D. For the sake of uniformity, we refer to the data synthesized by data distillation techniques as a data summary henceforth. Inspired by the definition of coresets (Bachem et al., 2017), we formally define an ϵ approximate data summary, and the data distillation task as follows: Definition 2. (ϵ approximate data summary) Given a learning algorithm Φ, let θD, θDsyn represent the optimal set of parameters for Φ estimated on D and Dsyn, and ϵ R+; we define an ϵ approximate data summary as one which satisfies: sup { | l (ΦθD(x), y) l (ΦθDsyn(x), y) | }x X y Y ϵ (1) Definition 3. (Data distillation) Given a learning algorithm Φ, let θD, θDsyn represent the optimal set of parameters for Φ estimated on D and Dsyn; we define data distillation as optimizing the following: arg min Dsyn,n sup { | l (ΦθD(x), y) l (ΦθDsyn(x), y) | }x X y Y From Definition 3, we highlight three cornerstones of evaluating data distillation methods: (1) Performance: downstream evaluation of models trained on the synthesized data summary vs. the full dataset (e.g., accuracy, FID, n DCG, etc.); (2) Efficiency: how quickly can models reach full-data performance (or even exceed it), i.e., the scaling of n vs. downstream task-performance; and (3) Transferability: how well can data summaries generalize to a diverse pool of learning algorithms, in terms of downstream evaluation. No free lunch. The universal No Free Lunch theorem (Wolpert & Macready, 1997) applies to data distillation as well. For example, looking at the transferability of a data summary, it is strongly dependent on the set of encoded inductive biases, i.e., through the choice of the learning algorithm Φ used while distilling, as well as the objective function l( , ). Such biases are unavoidable for any data distillation technique, in a sense that learning algorithms closely following the set of encoded inductive biases, will be able to generalize better on the data summary than others. Keeping these preliminaries in mind, we now present a formal framework for data distillation, encapsulating existing data distillation approaches. Notably, the majority of existing techniques intrinsically solve a bilevel optimization problem, which are tractable surrogates of Equation (2). The inner-loop typically optimizes a representative learning algorithm on the data summary, and using the optimized learning algorithm, the outer-loop optimizes a tractable proxy of Equation (2). Published in Transactions on Machine Learning Research (07/2023) Section 2.1 Section 2.2 Section 2.3 Section 2.4 Data Distillation Distribution DM, CAFE, IT-GAN, KFS, Gradient Matching DC, DSA, DCC, IDC, GCOND, DD, KIP, RFAD, FREPO, LINBA, Image Distillation Rec Sys Distillation Graph Distillation [HQ Image Link] Figure 2: A taxonomy of existing data distillation approaches. Some common assumptions that existing data distillation techniques follow are: (1) static-length data summary, i.e., n is fixed and is treated as a tunable hyper-parameter; and (2) we have on-demand access to the target dataset D which is also assumed to be iid. Notably, the outer-loop optimization of Dsyn happens simply through GD on the analogously defined Xsyn Rn dim(X), which is instantiated as free parameters. Note that the labels, Ysyn Rn dim(Y), can be similarly optimized through GD as well (Bohdal et al., 2020). For the sake of notational clarity, we will interchangeably use optimization of Dsyn or (Xsyn, Ysyn) henceforth. 2.1 Data Distillation by Meta-model Matching Meta-model matching-based data distillation approaches fundamentally optimize for the transferability of models trained on the data summary when generalized to the original dataset: arg min Dsyn LD θDsyn s.t. θDsyn arg min θ LDsyn(θ), (3) where intuitively, the inner-loop trains a representative learning algorithm on the data summary until convergence, and the outer-loop subsequently optimizes the data summary for the transferability of the optimized learning algorithm to the original dataset. Besides common assumptions mentioned earlier, the key simplifying assumption for this family of methods is that a perfect classifier exists and can be estimated on D, i.e., θD s.t. l(ΦθD(x), y) = 0, x X, y Y. Plugging the second assumption along with the iid assumption of D in Equation (2) directly translates to Equation (3). Despite the assumption, Equation (3) is highly expensive both in terms of computation time and memory, due to which, methods from this family typically resort to making further assumptions. Wang et al. (2018) (DD) originally proposed the task of data distillation, and used the meta-model matching framework for optimization. DD makes the optimization in Equation (3) tractable by performing (1) local optimization à la stochastic gradient descent (SGD) in the inner-loop, and (2) outer-loop optimization using Truncated Back-Propagation Through Time (TBPTT), i.e., unrolling only a limited number of inner-loop optimization steps. Formally, the modified optimization objective for DD is as follows: arg min Dsyn E θ0 Pθ [LD (θT )] s.t. θt+1 θt η θLDsyn(θt), (4) where Pθ is a parameter initialization distribution of choice, T accounts for the truncation in TBPTT, and η is a tunable learning rate. We also elucidate DD s control-flow in Algorithm 1 for reference. Notably, TBPTT has been associated with drawbacks such as (1) computationally expensive inner-loop unrolling; (2) bias involved with truncated unrolling (Wu et al., 2018); and (3) poorly conditioned loss landscapes, particularly with long unrolls (Metz et al., 2019). Consequently, the TBPTT framework was empirically shown to be ineffective for data distillation (Zhao et al., 2021). However, recent work (Deng & Russakovsky, 2022) claims that using momentum-based optimizers and longer inner-loop unrolling can greatly improve performance. We delay a deeper discussion of this work to Section 2.5 for clarity. Published in Transactions on Machine Learning Research (07/2023) Algorithm 1: Control-flow of data distillation using naïve meta-matching (Equation (4)) Input: Target dataset D, outer-loop iterations K, parameter initialization distribution Pθ, inner-loop iterations T, inner-loop learning rate η, outer-loop learning rate ηsyn 1 Initialize: (X0 syn, Y0 syn) D 2 for k = 1, . . . , K do // Outer-loop: optimize Dsyn 3 Initialize θ0 Pθ 4 for t = 1, . . . , T do // Inner-loop: optimize Φ on Dk 1 syn 5 θt θt 1 η θLDk 1 syn (θt 1) 6 Xk syn Xk 1 syn ηsyn Xsyn LD(θT ) // Update Xsyn by computing unrolled meta-gradient 7 Yk syn Yk 1 syn ηsyn Ysyn LD(θT ) // Update Ysyn by computing unrolled meta-gradient Output: DK syn (XK syn, YK syn) Analogously, a separate line of work focuses on using Neural Tangent Kernel (NTK) (Jacot et al., 2018) based algorithms to solve the inner-loop in closed form. As a brief side note, the infinite-width correspondence states that performing Kernelized Ridge Regression (KRR) using the NTK of a given neural network, is equivalent to training the same -width neural network with L2 reconstruction loss for SGD-steps. These -width neural networks have been shown to perform reasonably compared to their finite-width counterparts, while also being solved in closed-form (see Lee et al. (2020) for a detailed analysis on finite vs. infinite neural networks for image classification). KIP uses the NTK of a fully-connected neural network (Nguyen et al., 2021a), or a convolutional network (Nguyen et al., 2021b) in the inner-loop of Equation (3) for efficient data distillation. More formally, given the NTK K : X X 7 R of a neural network architecture, KIP optimizes the following objective: arg min Xsyn,Ysyn Y KXXsyn (KXsyn Xsyn + λI) 1 Ysyn 2 , (5) where KAB R|A| |B| represents the gramian matrix of two sets A and B, and whose (i, j)th element is defined by K(Ai, Bj). Although KIP doesn t impose any additional simplifications to the meta-model matching framework, it has an O(|D| n dim(X)) time and memory complexity, limiting its scalability. Subsequently, RFAD (Loo et al., 2022) proposes using (1) the light-weight Empirical Neural Network Gaussian Process (NNGP) kernel (Neal, 2012) instead of the NTK; and (2) a classification loss (e.g., NLL) instead of the L2-reconstruction loss for the outer-loop to get O(n) time complexity while also having better performance. On a similar note, FRe PO (Zhou et al., 2022b) decouples the feature extractor and a linear classifier in Φ, and alternatively optimizes (1) the data summary along with the classifier, and (2) the feature extractor. To be precise, let fθ : X 7 X be the feature extractor, gψ : X 7 Y be the linear classifier, s.t. Φ(x) gψ(fθ(x)) x X; the optimization objective for FRe PO can be written as: arg min Xsyn,Ysyn E θ0 Pθ Y Kθt XXsyn (Kθt Xsyn Xsyn + λI) 1 Ysyn 2 # s.t. θt+1 θt η E (x,y) Dsyn [ θl(gψ(fθ(x)), y)] ; Kθ Xsyn Xsyn fθt(Xsyn)fθt(Xsyn)T , (6) where T represents the number of inner-loop update steps for the feature extractor fθ. Notably, (1) a wide architecture for fθ is crucial for distillation quality in FRe PO; and (2) despite the bilevel optimization, FRe PO is shown to be more scalable compared to KIP (Equation (5)), while also being more generalizable. 2.2 Data Distillation by Gradient Matching Gradient matching based data distillation, at a high level, performs one-step distance matching on (1) the network trained on the target dataset (D) vs. (2) the same network trained on the data summary (Dsyn). In contrast to the meta-model matching framework, such an approach circumvents the unrolling of the inner-loop, thereby making the overall optimization much more efficient. First proposed by Zhao et al. (2021) (DC), Published in Transactions on Machine Learning Research (07/2023) data summaries optimized by gradient-matching significantly outperformed data pruning methodologies, as well as TBPTT-based data distillation proposed by Wang et al. (2018). Formally, given a learning algorithm Φ, DC solves the following optimization objective: arg min Dsyn E θ0 Pθ c C t=0 D θLDc(θt), θLDcsyn(θt) # s.t. θt+1 θt η θLDsyn(θt), (7) where T accounts for model similarity T-steps in the future, and D : R|θ| R|θ| 7 R is a distance metric of choice (typically cosine distance). In addition to assumptions imposed by the meta-model matching framework (Section 2.1), gradient-matching assumes (1) inner-loop optimization of only T steps; (2) local smoothness: two sets of model parameters close to each other (given a distance metric) imply model similarity; and (3) first-order approximation of θD t : instead of exactly computing the training trajectory of optimizing θ0 on D (say θD t ); perform first-order approximation on the optimization trajectory of θ0 on the much smaller Dsyn (say θDsyn t ), i.e., approximate θD t as a single gradient-descent update on θDsyn t 1 using D rather than θD t 1 (Figure 3). Subsequently, numerous other approaches have been built atop this framework with subtle variations. DSA (Zhao & Bilen, 2021) improves over DC by performing the same image-augmentations (e.g., crop, rotate, jitter, etc.) on both D and Dsyn while optimizing Equation (7). Since these augmentations are universal and are applicable across data distillation frameworks, augmentations performed by DSA have become a common part of all methods proposed henceforth, but we omit them for notational clarity. DCC (Lee et al., 2022b) further modifies the gradient-matching objective to incorporate class contrastive signals inside each gradient-matching step and is shown to improve stability as well as performance. With θt evolving similarly as in Equation (7), the modified optimization objective for DCC can be written as: arg min Dsyn E θ0 Pθ t=0 D E c C [ θLDc(θt)] , E c C h θLDcsyn(θt) i # Most recently, Kim et al. (2022) (IDC) extend the gradient matching framework by: (1) multi-formation: to synthesize a higher amount of data within the same memory budget, store the data summary (e.g., images) in a lower resolution to remove spatial redundancies, and upsample (using e.g., bilinear, FSRCNN (Dong et al., 2016)) to the original scale while usage; and (2) matching gradients of the network s training trajectory over the full dataset D rather than the data summary Dsyn. To be specific, given a k upscaling function f : Rd d 7 Rkd kd, the modified optimization objective for IDC can be formalized as: arg min Dsyn E θ0 Pθ c C t=0 D θLDc(θt), θLf(Dcsyn)(θt) # s.t. θt+1 θt η θLD(θt) (9) Kim et al. (2022) further hypothesize that training models on Dsyn instead of D in the inner-loop has two major drawbacks: (1) strong coupling of the innerand outer-loop resulting in a chicken-egg problem (Mc Lachlan & Krishnan, 2007); and (2) vanishing network gradients due to the small size of Dsyn, leading to an improper outer-loop optimization for gradient-matching based techniques. 2.3 Data Distillation by Trajectory Matching Cazenavette et al. (2022) proposed MTT which aims to match the training trajectories of models trained on D vs. Dsyn. More specifically, let {θD t }T t=0 represent the training trajectory of training Φθ on D; trajectory matching algorithms aim to solve the following optimization: arg min Dsyn,η E θ0 Pθ D θD t+M, θDsyn t+N D θD t+M, θD t s.t. θDsyn t+i+1 θDsyn t+i η θLDsyn(θDsyn t+i ) ; θDsyn t+1 θD t η θLDsyn(θD t ), where D : R|θ| R|θ| 7 R is a distance metric of choice (typically L2 distance). Such an optimization can intuitively be seen as optimizing for similar quality models trained with N SGD steps on Dsyn, compared to Published in Transactions on Machine Learning Research (07/2023) t ) 𝔼(x,y) 𝒟[l (Φ Update by minimizing: 𝒟𝗌𝗒𝗇 Meta-Model Matching Gradient Matching Trajectory Matching D ( D (θ𝒟 t+M, θ𝒟 t ) Update on 𝒟 Update on 𝒟𝗌𝗒𝗇 [HQ Image Link] Figure 3: The underlying optimization in various data distillation frameworks. M N steps on D, thereby invoking long-horizon trajectory matching. Notably, calculating the gradient of Equation (10) w.r.t. Dsyn encompasses gradient unrolling through N-timesteps, thereby limiting the scalability of MTT. On the other hand, since the trajectory of training Φθ on D, i.e., {θD t }T t=0 is independent of the optimization of Dsyn, it can be pre-computed for various θ0 Pθ initializations and directly substituted. Similar to gradient matching methods (Section 2.2), the trajectory matching framework also optimizes the first-order distance between parameters, thereby inheriting the local smoothness assumption. As a scalable alternative, Cui et al. (2022b) proposed TESLA, which re-parameterizes the parameter-matching loss of MTT in Equation (10) (specifically when D is set as the L2 distance), using linear algebraic manipulations to make the bilevel optimization s memory complexity independent of N. Furthermore, TESLA uses learnable soft-labels (Ysyn) during the optimization for an increased compression efficiency. 2.4 Data Distillation by Distribution Matching Even though the aforementioned gradient-matching or trajectory-matching based data distillation techniques have been empirically shown to synthesize high-quality data summaries, the underlying bilevel optimization, however, is oftentimes a computationally expensive procedure. To this end, distribution-matching techniques solve a correlated proxy task via a single-level optimization, leading to a vastly improved scalability. More specifically, instead of matching the quality of models on D vs. Dsyn, distribution-matching techniques directly match the distribution of D vs. Dsyn. The key assumption for this family of methods is that two datasets that are similar according to a particular distribution divergence metric, lead to similarly trained models. First proposed by Zhao & Bilen (2023), DM uses (1) numerous parametric encoders to cast high-dimensional data into respective low-dimensional latent spaces; and (2) an approximation of the Maximum Mean Discrepancy to compute the distribution mismatch between D and Dsyn in each of the latent spaces. More precisely, given a set of k encoders E {ψi : X 7 Xi}k i=1, the optimization objective can be written as: arg min Dsyn E ψ E c C " E x Dc [ψ(x)] E x Dcsyn [ψ(x)] DM uses a set of randomly initialized neural networks (with the same architecture) to instantiate E. They observe similar performance when instantiated with more meaningful, task-optimized neural networks, despite it being much less efficient. CAFE (Wang et al., 2022) further refines the distribution-matching idea by: (1) solving a bilevel optimization problem for jointly optimizing a single encoder (Φ) and the data summary, rather than using a pre-determined set of encoders (E); and (2) assuming a neural network encoder (Φ), match the latent representations obtained at all intermediate layers of the encoder instead of only the last layer. Formally, given a (L + 1)-layer neural network Φθ : X 7 Y where Φl θ represents Φ s output at the lth Published in Transactions on Machine Learning Research (07/2023) layer, the optimization problem for CAFE can be specified as: arg min Dsyn E c C E x Dc Φl θt(x) E x Dcsyn 2 β E (x,y) Dc [log ˆp(y|x, θt)] s.t. θt+1 θt η θLDsyn(θt) ; ˆp(y|x, θ) softmax y ΦL θ (x), E x Dy syn ΦL θ (x ) , where ˆp( | , θ) intuitively represents the nearest centroid classifier on Dsyn using the latent representations obtained by last layer of Φθ. Analogously, IT-GAN (Zhao & Bilen, 2022) also uses the distribution-matching framework in Equation (11) to generate data that is informative for model training, in contrast to the traditional GAN (Goodfellow et al., 2014) which focuses on generating realistic data. 2.5 Data Distillation by Factorization All of the aforementioned data distillation frameworks intrinsically maintain the synthesized data summary as a large set of free parameters, which are in turn optimized. Arguably, such a setup prohibits knowledge sharing between synthesized data points (parameters), which might introduce data redundancy. On the other hand, factorization-based data distillation techniques parameterize the data summary using two separate components: (1) bases: a set of mutually independent base vectors; and (2) hallucinators: a mapping from the bases vector space to the joint dataand label-space. In turn, both the bases and hallucinators are optimized for the task of data distillation. Formally, let B {bi B}|B| i=1 be the set of bases, and H {hi : B 7 X Y}|H| i=1 be the set of hallucinators, then the data summary is parameterized as Dsyn {h(b)}b B, h H. Even though such a two-pronged approach seems similar to generative modeling of data, note that unlike classic generative models, (1) the input space consists only of a fixed and optimized set of latent codes and isn t meant to take any other inputs; and (2) given a specific B and H, we can generate at most |B| |H| sized data summaries. Notably, such a hallucinator-bases data parameterization can be optimized using any of the aforementioned data optimization frameworks (Sections 2.1 to 2.4) This framework was concurrently proposed by Deng & Russakovsky (2022) (we take the liberty to term their unnamed model as Lin-ear Ba-ses ) and Liu et al. (2022c) (Ha Ba). Lin Ba modifies the general hallucinator-bases framework by assuming (1) the bases vector space (B) to be the same as the task input space (X); and (2) the hallucinator to be linear and additionally conditioned on a given predictand. More specifically, the data parameterization can be formalized as follows: Dsyn (y HT B, y) y C H H s.t. B R|B| dim(X) [bi X]|B| i=1 ; H n Hi R|B| |C|o|H| where for the sake of notational simplicity, we assume y R|C| represents the one-hot vector of the label for which we want to generate data, and the maximum amount of data that can be synthesized n |C| |H|. Since the data generation (Equation (13)) is end-to-end differentiable, both B and H are jointly optimized using the TBPTT framework discussed in Section 2.1, albeit with some crucial modifications for vastly improved performance: (1) using momentum-based optimizers instead of vanilla SGD in the inner-loop; and (2) longer unrolling ( 100 steps) of the inner-loop during TBPTT. Liu et al. (2022c) (Ha Ba) relax the linear and predictand-conditional hallucinator assumption of Lin Ba, equating to the following data parameterization: Dsyn { (h(b), y) }b,y B h H s.t. B { (bi X, yi Y) }|B| i=1 ; H {hθi : X 7 X}|H| i=1 , (14) where B and H are optimized using the trajectory matching framework (Section 2.3) with an additional contrastive constraint to promote diversity in Dsyn (cf. Liu et al. (2022c), Equation (6)). Following this setup, Ha Ba can generate at most |B| |H| sized data summaries. Furthermore, one striking difference between Ha Ba (Equation (14)) and Lin Ba (Equation (13)) is that to generate each data point, Lin Ba uses a linear combination of all the bases, whereas Ha Ba generates a data point using a single base vector. Published in Transactions on Machine Learning Research (07/2023) Table 1: Comparison of data distillation methods. Each method (1) synthesizes the data summary on the train-set; (2) unless mentioned, trains a 128-width Conv Net (Gidaris & Komodakis, 2018) on the data summary; and (3) evaluates it on the test-set. Confidence intervals are obtained by training at least 5 networks on the data summary. Lin Ba (No Fact.) represents Lin Ba with the no factorization. Methods evaluated using KRR are marked as ( -Conv) or ( -FC). The equivalent storage-in-bytes is used for factorization-based techniques instead of IPC. The best method in their category is emboldened, the best-overall non-factorized method evaluated on Conv Net is colored orange, and the best-overall factorized method is colored blue. Dataset MNIST CIFAR-10 CIFAR-100 Tiny Image Net Imgs/Class (IPC) 1 10 50 1 10 50 1 10 50 1 10 50 Random 64.9 3.5 95.1 0.9 97.9 0.2 14.4 2.0 26.0 1.2 43.4 1.0 4.2 0.3 14.6 0.5 30.0 0.4 1.5 0.1 6.0 0.8 16.8 1.8 Herding1 89.2 1.6 93.7 0.3 94.9 0.2 21.5 1.2 31.6 0.7 40.4 0.6 8.4 0.3 17.3 0.5 33.7 0.5 - - - Forgetting2 35.5 5.6 68.1 3.3 88.2 1.2 13.5 1.2 23.3 1.0 23.3 1.1 4.5 0.2 15.1 0.3 30.5 0.3 - - - Meta-model Matching DD3 - 79.5 8.1 - - 36.8 1.2 - - - - - - - Lin Ba (No Fact.)16 95.2 0.3 98.8 0.1 99.2 0.1 49.1 0.6 62.4 0.4 70.5 0.4 21.3 0.6 34.7 0.5 - - - - KIP (Conv Net)4 90.1 0.1 97.5 0.0 98.3 0.1 49.9 0.2 62.7 0.3 68.6 0.2 15.7 0.2 28.3 0.1 - - - - RFAD (Conv Net)5 94.4 1.5 98.5 0.1 98.8 0.1 53.6 1.2 66.3 0.5 71.1 0.4 26.3 1.1 33.0 0.3 - - - - FRe PO (Conv Net)6 93.0 0.4 98.6 0.1 99.2 0.1 46.8 0.7 65.5 0.6 71.7 0.2 28.7 0.1 42.5 0.2 44.3 0.2 15.4 0.3 25.4 0.2 - KIP ( -FC)7 85.5 0.1 97.2 0.2 98.4 0.1 40.5 0.4 53.1 0.5 58.6 0.4 - - - - - - KIP ( -Conv)4 97.3 0.1 99.1 0.1 99.5 0.1 64.7 0.2 75.6 0.2 80.6 0.1 34.9 0.1 49.5 0.3 - - - - RFAD ( -Conv)5 97.2 0.2 99.1 0.0 99.1 0.0 61.4 0.8 73.7 0.2 76.6 0.3 44.1 0.1 46.8 0.2 - - - - FRe PO ( -Conv)6 92.6 0.4 98.6 0.1 99.2 0.1 47.9 0.6 68.0 0.2 74.4 0.1 32.3 0.1 44.9 0.2 43.0 0.3 19.1 0.3 26.5 0.1 - DC8 91.7 0.5 97.4 0.2 98.2 0.2 28.3 0.5 44.9 0.5 53.9 0.5 12.8 0.3 25.2 0.3 30.5 0.3 4.6 0.6 11.2 1.6 10.9 0.7 DSA9 88.7 0.6 97.8 0.1 99.2 0.1 28.8 0.7 52.1 0.5 60.6 0.5 13.9 0.3 32.3 0.3 42.8 0.4 6.6 0.2 14.4 2.0 22.6 2.6 DCC10 - - - 34.0 0.7 54.5 0.5 64.2 0.4 14.6 0.3 33.5 0.3 39.3 0.4 - - - DM11 89.7 0.6 97.5 0.1 98.6 0.1 26.0 0.8 48.9 0.6 63.0 0.4 11.4 0.3 29.7 0.3 43.6 0.4 3.9 0.2 12.9 0.4 24.1 0.3 CAFE12 90.8 0.5 97.5 0.1 98.9 0.2 31.6 0.8 50.9 0.5 62.3 0.4 14.0 0.3 31.5 0.2 42.9 0.2 - - - MTT13 - - - 46.3 0.8 65.3 0.7 71.6 0.2 24.3 0.3 40.1 0.4 47.7 0.2 8.8 0.3 23.2 0.2 28.0 0.3 TESLA14 - - - 48.5 0.8 66.4 0.8 72.6 0.7 24.8 0.4 41.7 0.3 47.9 0.3 - - - Factorization IDC15 - - - 50.0 0.4 67.5 0.5 74.5 0.1 - 44.8 0.2 - - - - Lin Ba16 98.7 0.7 99.3 0.5 99.4 0.4 66.4 0.4 71.2 0.4 73.6 0.5 34.0 0.4 42.9 0.7 - 16.0 0.7 - - Ha Ba17 - - - 48.3 0.8 69.9 0.4 74.0 0.2 33.4 0.4 40.2 0.2 47.0 0.2 - - - KFS18 - - - 59.8 0.5 72.0 0.3 75.0 0.2 40.0 0.5 50.6 0.2 - 22.7 0.2 27.8 0.2 - Full Dataset 99.6 0.1 84.8 0.1 56.2 0.3 37.6 0.4 1 (Welling, 2009), 2 (Toneva et al., 2019), 3 (Wang et al., 2018), 4 (Nguyen et al., 2021b), 5 (Loo et al., 2022) 6 (Zhou et al., 2022b), 7 (Nguyen et al., 2021a), 8 (Zhao et al., 2021), 9 (Zhao & Bilen, 2021), 10 (Lee et al., 2022b) 11 (Zhao & Bilen, 2023), 12 (Wang et al., 2022), 13 (Cazenavette et al., 2022), 14 (Cui et al., 2022b) 15 (Kim et al., 2022), 16 (Deng & Russakovsky, 2022), 17 (Liu et al., 2022c), 18 (Lee et al., 2022a) Published in Transactions on Machine Learning Research (07/2023) Lee et al. (2022a) (KFS) further build atop this framework by maintaining a different bases vector space B from the data domain X, such that dim(B) < dim(X). This parameterization allows KFS to store an even larger number of images, with a comparable storage budget to other methods. Formally, the data parameterization for KFS can be specified as: c C { (h(b), c) }b Bc h H c C Bc ; Bc {bc i B}B i=1 ; H {hθi : B 7 X}|H| i=1 , (15) where KFS stores B bases per class, equivalent to a total of n = |C| B |H| sized data summaries. Following this data parameterization, B and H are optimized using the distribution matching framework for data distillation (Equation (11)) to ensure fast, single-level optimization. Data Distillation vs. Data Compression. We highlight that it is non-trivial to ensure a fair comparison between data distillation techniques that (1) are non-factorized , i.e., maintain each synthesized data point as a set of free-parameters (Sections 2.1 to 2.4); and (2) use factorized approaches discussed in this section to efficiently organize the data summary. If we use the size of the data summary (n) as the efficiency metric, factorized approaches are adversely affected as they need a much smaller storage budget to synthesize the same-sized data summaries. On the other hand, if we use end-to-end bytes of storage as the efficiency metric, non-factorized approaches are adversely affected as they perform no kind of data compression, but focus solely on better understanding the model-to-data relationship through the lens of optimization. For a better intuition, one can apply posthoc lossless compression (e.g., Huffman coding) on data synthesized by non-factorized data distillation approaches to fit more images in the same storage budget (Schirrmeister et al., 2022). Such techniques unintentionally deviate from the original intent of data distillation, and progress more toward better data compression techniques. As a potential solution, we encourage the community to consider reporting results for both scenarios: a fixed data summary size n, as well as fixed bytes-of-storage. Nonetheless, for the ease of empirical comparison amongst the discussed data distillation techniques, we provide a collated set of results over four image-classification datasets in Table 1. 3 Data Modalities Having learned about different kinds of optimization frameworks for data distillation, we now discuss an orthogonal (and important) aspect of data distillation what kinds of data can data distillation techniques summarize? From continuous-valued images to heterogeneous and semi-structured graphs, the underlying data for each unique application of machine learning has its own modality, structure, and set of assumptions. While the earliest data distillation techniques were designed to summarize images, recent steps have been taken to expand the horizon of data distillation into numerous other scenarios. In what follows, we categorize existing data distillation methods by their intended data modality, while also discussing their unique challenges. Images. A large-portion of existing data distillation techniques are designed for image classification data (Cazenavette et al., 2022; Deng & Russakovsky, 2022; Kim et al., 2022; Lee et al., 2022a;b; Liu et al., 2022c; Loo et al., 2022; Nguyen et al., 2021a;b; Wang et al., 2022; 2018; Zhao & Bilen, 2021; 2022; 2023; Zhao et al., 2021; Zhou et al., 2022b) simply because images have a real-valued, continuous data-domain (X Rd d). This allows SGD-based optimization directly on the data, which is treated as a set of free parameters. Intuitively, incrementally changing each pixel value can be treated as slight perturbations in the color space, and hence given a suitable data distillation loss, can be naïvely optimized using SGD. Text. Textual data is available in large amounts from sources like websites, news articles, academic manuscripts, etc., and is also readily accessible with datasets like the common crawl1 which sizes up to almost 541TB. Furthermore, with the advent of large language models (LLM) (Brown et al., 2020; Devlin et al., 2019; Thoppilan et al., 2022), training such models from scratch on large datasets has become an increasingly 1https://commoncrawl.org/the-data/ Published in Transactions on Machine Learning Research (07/2023) Data Distillation 0.2 0.1 0.5 0.8 0.1 1 0.9 0.8 0.9 0.1 0.5 1 0.4 0.4 0.3 0.4 1 0.8 0.9 0.2 0.2 Items (Movies/Ads/Songs) Items (Movies/Ads/Songs) (",$,%) Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip Lorem sit adipiscing do incididunt et aliqua. Ut minim nostrud laboris aliquip [HQ Image Link] Figure 4: Overview of distilling data for a few commonly observed data modalities. expensive procedure. Despite recent efforts in democratizing LLM training (Geiping & Goldstein, 2022; Scao et al., 2022; Wolf et al., 2020), effectively distilling large-scale textual data as a solution is yet to be explored. The key bottlenecks for distilling textual data are: (1) the inherently discrete nature of data, where a token should belong in a limited vocabulary of words; (2) the presence of a rich underlying structure, i.e., sentences of words (text) obey fixed patterns according to a grammar; and (3) richness of context, i.e., a given piece of text could have wildly different semantic interpretations under different contexts. Sucholutsky & Schonlau (2021) take a latent-embedding approach to textual data distillation. On a high level, to circumvent the discreteness of the optimization, the authors perform distillation in a continuous embedding space. More specifically, assuming access to a latent space specified by a fixed text-encoder, the authors learn continuous representations of each word in the distilled text and optimize it using the TBPTT data-distillation framework proposed by Wang et al. (2018) (Equation (4)). Finally, the distilled text representations are decoded by following a simple nearest-neighbor protocol. Graphs. A wide variety of data and applications can inherently be modeled as graphs, e.g., user-item interactions (Mittal et al., 2021; Sachdeva & Mc Auley, 2020; Wu et al., 2020), social networks (Fan et al., 2019), autonomous driving (Casas et al., 2020; Sachdeva et al., 2022b), etc. Taking the example of social networks, underlying user-user graphs typically size up to the billion-scale (Chen et al., 2021), calling for principled scaling solutions. Graph distillation trivially solves a majority of the scale challenges, but synthesizing tiny, high-fidelity graphs has the following hurdles: (1) nodes in a graph can be highly abstract, e.g., users, products, etc. and could be discrete, heterogeneous, or even numerical IDs; (2) graphs follow a variety of intrinsic patterns (e.g., spatial (Kipf & Welling, 2017)) which need to be retained in the distilled graphs; and (3) quadratic size of the adjacency matrix could be computationally prohibitive for data optimization. Jin et al. (2022b) propose GCond which distills graphs in the inductive node-classification setting, specified by its node-feature matrix X, adjacency matrix A, and node-target matrix Y. GCond distills the given graph by learning a synthetic node-feature matrix Xsyn, and using Xsyn to generate Asyn fθ(Xsyn) which can be realized, e.g., through a parametric similarity function simθ( , ) between the features of two nodes, i.e., Ai,j syn σ(simθ(Xi syn, Xj syn)), where σ( ) is the sigmoid function. Finally, both Xsyn and θ are optimized using the gradient-matching framework proposed by Zhao et al. (2021) (Equation (7)). Another work (Liu et al., 2022a) (GCDM) shares the same framework as GCond but instead uses the distribution matching framework proposed by Zhao & Bilen (2023) (Equation (11)) to optimize Xsyn and θ. Extending to a graph-classification setting, Jin et al. (2022a) further propose Dos Cond with two major changes compared to GCond: (1) instead of parameterizing the adjacency matrix using a similarity function on Xsyn, they maintain a free-parameter matrix Ωwith the same size as the adjacency matrix, and sample each Ai,j syn entry through an independent Bernoulli draw on Ωi,j as the prior using the reparameterization trick (Maddison et al., 2017). Such a procedure ensures differentiability as well as discrete matrix synthesis; and (2) Xsyn and Published in Transactions on Machine Learning Research (07/2023) Ωare still optimized using the gradient-matching framework (Equation (7)), albeit with only a single-step, i.e., T = 1 for improved scalability and without empirically observing a loss in performance. Recommender Systems. The amount of online user-feedback data available for training recommender systems is rapidly increasing (Wu et al., 2022). Furthermore, typical user-facing recommender systems need to be periodically re-trained (Naumov et al., 2019), which adds to requirements for smarter data summarization solutions (see Sachdeva et al. (2022c) for background on sampling recommender systems data). However, distilling recommender systems data has the following challenges: (1) the data is available in the form of abstract and discrete (user ID, item ID, relevance) tuples, which departs from the typical (features, label) setup; (2) the distribution of both userand item-popularity follows a strong power-law which leads to data scarcity and unstable optimization; and (3) the data inherits a variety of inherent structures, e.g., sequential patterns (Kang & Mc Auley, 2018; Sachdeva et al., 2019), user-item graph patterns (Wu et al., 2019), item-item co-occurrence patterns (Steck, 2019), missing-not-at-randomness (Sachdeva et al., 2020; Schnabel et al., 2016), etc. Sachdeva et al. (2022a) propose Distill-CF which distills implicit-feedback recommender systems data, i.e., when the observed user-item relevance is binary (e.g., click or no-click). Such data can be visualized as a binary user-item matrix R where each row represents a single user, and each column represents an item. On a high-level, Distill-CF synthesizes fake users along with their item-consumption histories, visualized as a synthetic user-item matrix Rsyn. Notably, to preserve semantic meaning, the item-space in Rsyn is the same as in R. To alleviate the data discreteness problem, Distill-CF maintains a sampling-prior matrix Ω which has the same size as Rsyn, and can in-turn be used to generate Rsyn using multi-step Gumbel sampling with replacement (Jang et al., 2017) for each user s prior in Ω(equivalent to each row). Such a formulation automatically also circumvents the dynamic userand item-popularity artifact in recommender systems data, which can analogously be controlled by the rowand column-wise entropy of Ω. Finally, Ωis optimized using the meta-model matching framework proposed by Nguyen et al. (2021a). Notably, Sachdeva et al. (2022a) also propose infinite-width autoencoders which suit the task of item recommendation while also leading to closed-form computation of the inner-loop in the meta-model matching framework (Equation (5)). 4 Applications While the data distillation task was originally designed to accelerate model training, there are numerous other applications of a high-fidelity data summary. Below we briefly discuss a few such promising applications, along with providing pointers to existing works. Differential Privacy. Data distillation was recently shown to be a promising solution for differential privacy as defined by Dwork (2008). Dong et al. (2022) show that data distillation techniques can perform better than existing state-of-the-art differentially-private data generators (Cao et al., 2021; Harder et al., 2021) on both performance and privacy grounds. Notably, the privacy benefits of data distillation techniques are virtually free, as none of these methods were optimized for generating differentially-private data. Chen et al. (2022) further modify the gradient matching framework (Equation (7)) by clipping and adding white noise to the gradients obtained on the original dataset while optimization. Such a routine was shown to have better sample utility, while also satisfying strict differential privacy guarantees. From a completely application perspective, data distillation has been used to effectively distill sensitive medical data as well (Li et al., 2020a; 2022). Neural Architecture Search (NAS). Automatic searching of neural-network architectures can alleviate a majority of manual effort, as well as lead to more accurate models (see Elsken et al. (2019) for a detailed review). Analogous to using model extrapolation, i.e., extrapolating the performance of an under-trained model architecture on the full dataset; data extrapolation, on the other hand, aims to train models on a small, high-fidelity data sample till convergence. Zhao et al. (2021) show promise of their technique (DC) on a small custom NAS test-bed consisting of only 720 variations of the Conv Net architecture (Gidaris & Komodakis, 2018) by employing the data extrapolation framework. However, Cui et al. (2022a) show that data distillation does not perform well when evaluating diverse architectures on the bigger test-bed, NAS-Bench-201 (Dong & Yang, 2020), calling for better rank-preserving data distillation techniques. Published in Transactions on Machine Learning Research (07/2023) Continual Learning. Never-ending learning (see Parisi et al. (2019) for a detailed review) has been frequently associated with catastrophic forgetting (French, 1999), i.e., patterns extracted from old data/tasks are easily forgotten when patterns from new data/tasks are learned. Data distillation has been shown as an effective solution to alleviate catastrophic forgetting, by simply using the distilled data summary in a replay buffer that is continually updated and used in subsequent data/task training (Rosasco et al., 2021; Sangermano et al., 2022; Wiewel & Yang, 2021). Deng & Russakovsky (2022) show further evidence of a simple compress-then-recall strategy outperforming existing state-of-the-art continual learning approaches. Notably, only the data summary is stored for each task, and a new model is trained (from scratch) using all previous data summaries, for each new incoming task. Federated Learning. Federated or collaborative learning (see Li et al. (2020b) for a detailed survey) involves training a learning algorithm in a decentralized fashion. A standard approach to federated learning is to synchronize local parameter updates to a central server, instead of synchronizing the raw data itself (Konečn y et al., 2016). Data distillation, on the other hand, alleviates the need to synchronize large parametric models across clients and servers, by synchronizing tiny synthesized data summaries to the central server instead. Subsequently, the entire training happens only on the central server. Such data distillation-based federated learning methods (Goetz & Tewari, 2020; Hu et al., 2022; Liu et al., 2022b; Song et al., 2022; Xiong et al., 2022; Zhou et al., 2020) are shown to perform better than model-synchronization based federated learning approaches, while also requiring multiple orders lesser client-server communication. 5 Challenges & Future Directions Despite achieving remarkable progress in data-efficient learning, there are numerous framework-based, theoretical, and application-based directions yet to be explored in data distillation. In what follows, we highlight and discuss such directions for the community to further explore, based either on early evidence or our intuition. New data modalities. Extending the discussion in Section 3, existing data distillation techniques have largely been restricted to cater to image datasets, primarily due to the amenable data-optimization in the continuous pixel-domain of images. Despite recent efforts in increasing the horizon of data distillation to other data modalities such as graphs (Jin et al., 2022a;b) and recommender systems (Sachdeva et al., 2022a); each data modality poses its unique challenges and calls for future work, e.g., handling long sequences of time-series data in audio-classification (Hershey et al., 2017), video classification (Karpathy et al., 2014), self-driving (Sun et al., 2020); millions of categorical features in tabular data (Wang et al., 2021); sparse and noisy financial data (Xu & Cohen, 2018); etc. New predictive tasks. Another limitation of existing data distillation techniques is that their underlying optimization is primarily designed for classification scenarios. However, a large number of predictive tasks fail to naïvely fit into the existing supervised data distillation framework, e.g., image-generation (Ramesh et al., 2022; Rombach et al., 2022), language modeling (Brown et al., 2020; Devlin et al., 2019; Touvron et al., 2023), representation learning (Chen et al., 2020; Grill et al., 2020), etc. Further, the aforementioned tasks have gained immense popularity and have seen widespread practical use in the recent years, calling for future work in developing data distillation techniques for more predictive tasks. Better scaling. Existing data distillation techniques validate their prowess only in the super low-data regime (typically 1 50 data points per class) due to (i) computational difficulties in synthesizing large data summaries with existing techniques; and (ii) collapse to the random-sampling baseline when synthesizing large data summaries, as noted by Cui et al. (2022a). This calls for future work from both directions developing efficient data distillation techniques that are scalable to web-scale datasets, and deeper investigations of the cause and potential fixes of the observed scaling artifacts of existing techniques. Improved optimization. A unifying thread across data distillation techniques is an underlying bilevel optimization, which is provably NP-hard even in the linear inner-optimization case (Vicente et al., 1994). Notably, bilevel optimization has been successfully applied in a variety of other applications like meta-learning (Finn et al., 2017; Li et al., 2017), hyper-parameter optimization (Lorraine et al., 2020; Maclaurin et al., Published in Transactions on Machine Learning Research (07/2023) 2015), neural architecture search (Liu et al., 2019), coreset construction (Borsos et al., 2020; Zhou et al., 2022a), etc. Despite its success, many theoretical underpinnings are yet to be explored, e.g., the effect of commonly-used singleton solution assumption (Franceschi et al., 2018), the effect of over-parameterization on bilevel optimization (Vicol et al., 2022), connections to statistical influence functions (Bae et al., 2022), the bias-variance tradeoff (Vicol et al., 2021), etc. Clearly, an overall better understanding of bilevel optimization will directly enable the development of better data distillation techniques. Improved data-quality evaluation. As briefly discussed in Section 2, data synthesized using data distillation is evaluated from performance, efficiency, and transferability standpoints. However, numerous high-stakes use-cases call for being able to train robust models from a variety of angles such as fairness (Mehrabi et al., 2021), adversarial robustness (Madry et al., 2018), etc. Hence, synthesizing data summaries able to support such robust model training is practical and an important direction for future work. Notably, while popular metrics exist for evaluating the robustness of learning algorithms from the aforementioned standpoints, developing such notions at the dataset-level is non-trivial, and with little existing literature (Ben-Eliezer & Yogev, 2020; Celis et al., 2018). Acknowledgments We sincerely thank Zhiwei Deng, Bo Zhao, and George Cazenavette for their feedback on early drafts of this survey. Amro Abbas, Kushal Tirumala, Dániel Simig, Surya Ganguli, and Ari S Morcos. Semdedup: Data-efficient learning at web-scale through semantic deduplication. ar Xiv preprint ar Xiv:2303.09540, 2023. Sunil Arya, David M Mount, Nathan S Netanyahu, Ruth Silverman, and Angela Y Wu. An optimal algorithm for approximate nearest neighbor searching fixed dimensions. Journal of the ACM (JACM), 45(6):891 923, 1998. Fadhel Ayed and Soufiane Hayou. Data pruning and neural scaling laws: fundamental limitations of score-based algorithms. ar Xiv preprint ar Xiv:2302.06960, 2023. Olivier Bachem, Mario Lucic, and Andreas Krause. Practical coreset constructions for machine learning. ar Xiv preprint ar Xiv:1703.06476, 2017. Juhan Bae, Nathan Ng, Alston Lo, Marzyeh Ghassemi, and Roger Grosse. If influence functions are the answer, then what is the question? ar Xiv preprint ar Xiv:2209.05364, 2022. Omri Ben-Eliezer and Eylon Yogev. The adversarial robustness of sampling. In Proceedings of the 39th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, pp. 49 62, 2020. Jeff Bilmes. Submodularity in machine learning and artificial intelligence. ar Xiv preprint ar Xiv:2202.00132, 2022. Ondrej Bohdal, Yongxin Yang, and Timothy Hospedales. Flexible dataset distillation: Learn labels instead of images. ar Xiv preprint ar Xiv:2006.08572, 2020. Zalán Borsos, Mojmir Mutny, and Andreas Krause. Coresets via bilevel optimization for continual learning and streaming. Advances in Neural Information Processing Systems, 33:14879 14890, 2020. Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877 1901, 2020. Tianshi Cao, Alex Bie, Arash Vahdat, Sanja Fidler, and Karsten Kreis. Don t generate me: Training differentially private generative models with sinkhorn divergence. Advances in Neural Information Processing Systems, 2021. Sergio Casas, Cole Gulino, Renjie Liao, and Raquel Urtasun. Spagnn: Spatially-aware graph neural networks for relational behavior forecasting from sensor data. In 2020 IEEE International Conference on Robotics and Automation (ICRA), pp. 9491 9497. IEEE, 2020. Published in Transactions on Machine Learning Research (07/2023) George Cazenavette, Tongzhou Wang, Antonio Torralba, Alexei A Efros, and Jun-Yan Zhu. Dataset distillation by matching training trajectories. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4750 4759, 2022. Elisa Celis, Vijay Keswani, Damian Straszak, Amit Deshpande, Tarun Kathuria, and Nisheeth Vishnoi. Fair and diverse dpp-based data summarization. In International Conference on Machine Learning, pp. 716 725. PMLR, 2018. Dingfan Chen, Raouf Kerkouche, and Mario Fritz. Private set generation with discriminative information. In Proceedings of the Advances in Neural Information Processing Systems (Neur IPS), 2022. Qi Chen, Bing Zhao, Haidong Wang, Mingqin Li, Chuanjie Liu, Zengzhong Li, Mao Yang, and Jingdong Wang. Spann: Highly-efficient billion-scale approximate nearest neighborhood search. Advances in Neural Information Processing Systems, 34:5199 5212, 2021. Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. In International conference on machine learning, pp. 1597 1607. PMLR, 2020. Cody Coleman, Christopher Yeh, Stephen Mussmann, Baharan Mirzasoleiman, Peter Bailis, Percy Liang, Jure Leskovec, and Matei Zaharia. Selection via proxy: Efficient data selection for deep learning. In International Conference on Learning Representations, 2020. Justin Cui, Ruochen Wang, Si Si, and Cho-Jui Hsieh. DC-BENCH: Dataset condensation benchmark. In Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2022a. Justin Cui, Ruochen Wang, Si Si, and Cho-Jui Hsieh. Scaling up dataset distillation to imagenet-1k with constant memory. ar Xiv preprint ar Xiv:2211.10586, 2022b. Zhiwei Deng and Olga Russakovsky. Remember the past: Distilling datasets into addressable memories for neural networks. In Advances in Neural Information Processing Systems, 2022. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. Chao Dong, Chen Change Loy, and Xiaoou Tang. Accelerating the super-resolution convolutional neural network. In European conference on computer vision, pp. 391 407. Springer, 2016. Tian Dong, Bo Zhao, and Lingjuan Lyu. Privacy for free: How does dataset condensation help privacy? In Proceedings of the 39th International Conference on Machine Learning. PMLR, 2022. Xuanyi Dong and Yi Yang. Nas-bench-201: Extending the scope of reproducible neural architecture search. In International Conference on Learning Representations, 2020. Cynthia Dwork. Differential privacy: A survey of results. In International conference on theory and applications of models of computation, pp. 1 19. Springer, 2008. Thomas Elsken, Jan Hendrik Metzen, and Frank Hutter. Neural architecture search: A survey. The Journal of Machine Learning Research, 20(1):1997 2017, 2019. Wenqi Fan, Yao Ma, Qing Li, Yuan He, Eric Zhao, Jiliang Tang, and Dawei Yin. Graph neural networks for social recommendation. In The world wide web conference, pp. 417 426, 2019. Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In International conference on machine learning, pp. 1126 1135. PMLR, 2017. Luca Franceschi, Paolo Frasconi, Saverio Salzo, Riccardo Grazzi, and Massimiliano Pontil. Bilevel programming for hyperparameter optimization and meta-learning. In International Conference on Machine Learning, pp. 1568 1577. PMLR, 2018. Robert M French. Catastrophic forgetting in connectionist networks. Trends in cognitive sciences, 3(4):128 135, 1999. Jonas Geiping and Tom Goldstein. Cramming: Training a language model on a single gpu in one day, 2022. Published in Transactions on Machine Learning Research (07/2023) Amirata Ghorbani and James Zou. Data shapley: Equitable valuation of data for machine learning. In International Conference on Machine Learning, pp. 2242 2251. PMLR, 2019. Behrooz Ghorbani, Orhan Firat, Markus Freitag, Ankur Bapna, Maxim Krikun, Xavier Garcia, Ciprian Chelba, and Colin Cherry. Scaling laws for neural machine translation. ar Xiv preprint ar Xiv:2109.07740, 2021. Spyros Gidaris and Nikos Komodakis. Dynamic few-shot visual learning without forgetting. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4367 4375, 2018. Jack Goetz and Ambuj Tewari. Federated learning via synthetic data. ar Xiv preprint ar Xiv:2008.04489, 2020. Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pp. 2672 2680, 2014. Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin Tallec, Pierre Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Guo, Mohammad Gheshlaghi Azar, et al. Bootstrap your own latent-a new approach to self-supervised learning. Advances in neural information processing systems, 33:21271 21284, 2020. Chengcheng Guo, Bo Zhao, and Yanbing Bai. Deepcore: A comprehensive library for coreset selection in deep learning. In Database and Expert Systems Applications: 33rd International Conference, DEXA 2022, Vienna, Austria, August 22 24, 2022, Proceedings, Part I, pp. 181 195. Springer, 2022. Udit Gupta, Young Geun Kim, Sylvia Lee, Jordan Tse, Hsien-Hsin S Lee, Gu-Yeon Wei, David Brooks, and Carole-Jean Wu. Chasing carbon: The elusive environmental footprint of computing. IEEE Micro, 42(4):37 47, 2022. Frederik Harder, Kamil Adamczewski, and Mijung Park. Dp-merf: Differentially private mean embeddings with randomfeatures for practical privacy-preserving data generation. In International conference on artificial intelligence and statistics, pp. 1819 1827. PMLR, 2021. Shawn Hershey, Sourish Chaudhuri, Daniel PW Ellis, Jort F Gemmeke, Aren Jansen, R Channing Moore, Manoj Plakal, Devin Platt, Rif A Saurous, Bryan Seybold, et al. Cnn architectures for large-scale audio classification. In 2017 ieee international conference on acoustics, speech and signal processing (icassp), pp. 131 135. IEEE, 2017. Geoffrey Hinton, Oriol Vinyals, Jeff Dean, et al. Distilling the knowledge in a neural network. ar Xiv preprint ar Xiv:1503.02531, 2(7), 2015. Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models. ar Xiv preprint ar Xiv:2203.15556, 2022. Shengyuan Hu, Jack Goetz, Kshitiz Malik, Hongyuan Zhan, Zhe Liu, and Yue Liu. Fedsynth: Gradient compression via synthetic data in federated learning. ar Xiv preprint ar Xiv:2204.01273, 2022. Arthur Jacot, Franck Gabriel, and Clément Hongler. Neural tangent kernel: Convergence and generalization in neural networks. Advances in neural information processing systems, 31, 2018. Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with gumbel-softmax. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. Open Review.net, 2017. Wei Jin, Xianfeng Tang, Haoming Jiang, Zheng Li, Danqing Zhang, Jiliang Tang, and Bing Yin. Condensing graphs via one-step gradient matching. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp. 720 730, 2022a. Wei Jin, Lingxiao Zhao, Shichang Zhang, Yozen Liu, Jiliang Tang, and Neil Shah. Graph condensation for graph neural networks. In International Conference on Learning Representations, 2022b. Wang-Cheng Kang and Julian Mc Auley. Self-attentive sequential recommendation. In 2018 IEEE international conference on data mining (ICDM), pp. 197 206. IEEE, 2018. Jared Kaplan, Sam Mc Candlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. ar Xiv preprint ar Xiv:2001.08361, 2020. Published in Transactions on Machine Learning Research (07/2023) Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Sukthankar, and Li Fei-Fei. Large-scale video classification with convolutional neural networks. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 1725 1732, 2014. Krishnateja Killamsetty, Durga S, Ganesh Ramakrishnan, Abir De, and Rishabh Iyer. Grad-match: Gradient matching based data subset selection for efficient deep model training. In Marina Meila and Tong Zhang (eds.), Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pp. 5464 5474. PMLR, 18 24 Jul 2021. Jang-Hyun Kim, Jinuk Kim, Seong Joon Oh, Sangdoo Yun, Hwanjun Song, Joonhyun Jeong, Jung-Woo Ha, and Hyun Oh Song. Dataset condensation via efficient synthetic-data parameterization. In Proceedings of the 39th International Conference on Machine Learning, 2022. Thomas N. Kipf and Max Welling. Semi-Supervised Classification with Graph Convolutional Networks. In Proceedings of the 5th International Conference on Learning Representations, ICLR 17, 2017. Jakub Konečn y, H Brendan Mc Mahan, Daniel Ramage, and Peter Richtárik. Federated optimization: Distributed machine learning for on-device intelligence. ar Xiv preprint ar Xiv:1610.02527, 2016. Yann Le Cun, John Denker, and Sara Solla. Optimal brain damage. Advances in neural information processing systems, 2, 1989. Hae Beom Lee, Dong Bok Lee, and Sung Ju Hwang. Dataset condensation with latent space knowledge factorization and sharing. ar Xiv preprint ar Xiv:2208.10494, 2022a. Jaehoon Lee, Samuel Schoenholz, Jeffrey Pennington, Ben Adlam, Lechao Xiao, Roman Novak, and Jascha Sohl Dickstein. Finite versus infinite neural networks: an empirical study. Advances in Neural Information Processing Systems, 33:15156 15172, 2020. Saehyung Lee, Sanghyuk Chun, Sangwon Jung, Sangdoo Yun, and Sungroh Yoon. Dataset condensation with contrastive signals. In Proceedings of the 39th International Conference on Machine Learning, pp. 12352 12364, 2022b. Guang Li, Ren Togo, Takahiro Ogawa, and Miki Haseyama. Soft-label anonymous gastric x-ray image distillation. In 2020 IEEE International Conference on Image Processing (ICIP), pp. 305 309. IEEE, 2020a. Guang Li, Ren Togo, Takahiro Ogawa, and Miki Haseyama. Compressed gastric image generation based on soft-label dataset distillation for medical data sharing. Computer Methods and Programs in Biomedicine, pp. 107189, 2022. Tian Li, Anit Kumar Sahu, Ameet Talwalkar, and Virginia Smith. Federated learning: Challenges, methods, and future directions. IEEE Signal Processing Magazine, 37(3):50 60, 2020b. Zhenguo Li, Fengwei Zhou, Fei Chen, and Hang Li. Meta-sgd: Learning to learn quickly for few-shot learning. ar Xiv preprint ar Xiv:1707.09835, 2017. Hanxiao Liu, Karen Simonyan, and Yiming Yang. DARTS: Differentiable architecture search. In International Conference on Learning Representations, 2019. Mengyang Liu, Shanchuan Li, Xinshi Chen, and Le Song. Graph condensation via receptive field distribution matching. ar Xiv preprint ar Xiv:2206.13697, 2022a. Ping Liu, Xin Yu, and Joey Tianyi Zhou. Meta knowledge condensation for federated learning. ar Xiv preprint ar Xiv:2209.14851, 2022b. Songhua Liu, Kai Wang, Xingyi Yang, Jingwen Ye, and Xinchao Wang. Dataset distillation via factorization. Neur IPS, 2022c. Noel Loo, Ramin Hasani, Alexander Amini, and Daniela Rus. Efficient dataset distillation using random feature approximation. In Advances in Neural Information Processing Systems, 2022. Jonathan Lorraine, Paul Vicol, and David Duvenaud. Optimizing millions of hyperparameters by implicit differentiation. In International Conference on Artificial Intelligence and Statistics, pp. 1540 1552. PMLR, 2020. Dougal Maclaurin, David Duvenaud, and Ryan Adams. Gradient-based hyperparameter optimization through reversible learning. In International conference on machine learning, pp. 2113 2122. PMLR, 2015. Published in Transactions on Machine Learning Research (07/2023) Chris J. Maddison, Andriy Mnih, and Yee Whye Teh. The concrete distribution: A continuous relaxation of discrete random variables. In International Conference on Learning Representations, 2017. Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. In International Conference on Learning Representations, 2018. Geoffrey J Mc Lachlan and Thriyambakam Krishnan. The EM algorithm and extensions. John Wiley & Sons, 2007. Ninareh Mehrabi, Fred Morstatter, Nripsuta Saxena, Kristina Lerman, and Aram Galstyan. A survey on bias and fairness in machine learning. ACM Computing Surveys (CSUR), 54(6):1 35, 2021. Luke Metz, Niru Maheswaranathan, Jeremy Nixon, Daniel Freeman, and Jascha Sohl-Dickstein. Understanding and correcting pathologies in the training of learned optimizers. In International Conference on Machine Learning, pp. 4556 4565. PMLR, 2019. Baharan Mirzasoleiman, Jeff Bilmes, and Jure Leskovec. Coresets for data-efficient training of machine learning models. In International Conference on Machine Learning, pp. 6950 6960. PMLR, 2020. Anshul Mittal, Noveen Sachdeva, Sheshansh Agrawal, Sumeet Agarwal, Purushottam Kar, and Manik Varma. Eclare: Extreme classification with label graph correlations. In Proceedings of the Web Conference 2021, WWW 21, 2021. Maxim Naumov, Dheevatsa Mudigere, Hao-Jun Michael Shi, Jianyu Huang, Narayanan Sundaraman, Jongsoo Park, Xiaodong Wang, Udit Gupta, Carole-Jean Wu, Alisson G. Azzolini, Dmytro Dzhulgakov, Andrey Mallevich, Ilia Cherniavskii, Yinghai Lu, Raghuraman Krishnamoorthi, Ansha Yu, Volodymyr Kondratenko, Stephanie Pereira, Xianjie Chen, Wenlin Chen, Vijay Rao, Bill Jia, Liang Xiong, and Misha Smelyanskiy. Deep learning recommendation model for personalization and recommendation systems. Co RR, abs/1906.00091, 2019. Radford M Neal. Bayesian learning for neural networks, volume 118. Springer Science & Business Media, 2012. Timothy Nguyen, Zhourong Chen, and Jaehoon Lee. Dataset meta-learning from kernel ridge-regression. In International Conference on Learning Representations, 2021a. Timothy Nguyen, Roman Novak, Lechao Xiao, and Jaehoon Lee. Dataset distillation with infinitely wide convolutional networks. Advances in Neural Information Processing Systems, 34, 2021b. German I Parisi, Ronald Kemker, Jose L Part, Christopher Kanan, and Stefan Wermter. Continual lifelong learning with neural networks: A review. Neural Networks, 113:54 71, 2019. Lorien Y Pratt. Discriminability-based transfer between neural networks. Advances in neural information processing systems, 5, 1992. Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents. ar Xiv preprint ar Xiv:2204.06125, 2022. Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10684 10695, 2022. Andrea Rosasco, Antonio Carta, Andrea Cossu, Vincenzo Lomonaco, and Davide Bacciu. Distilled replay: Overcoming forgetting through synthetic samples. ar Xiv preprint ar Xiv:2103.15851, 2021. Durga S, Rishabh Iyer, Ganesh Ramakrishnan, and Abir De. Training data subset selection for regression with controlled generalization error. In Marina Meila and Tong Zhang (eds.), Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pp. 9202 9212. PMLR, 18 24 Jul 2021. Noveen Sachdeva and Julian Mc Auley. How Useful Are Reviews for Recommendation? A Critical Review and Potential Improvements, pp. 1845 1848. SIGIR 20. Association for Computing Machinery, New York, NY, USA, 2020. ISBN 9781450380164. doi: 10.1145/3397271.3401281. Noveen Sachdeva, Giuseppe Manco, Ettore Ritacco, and Vikram Pudi. Sequential variational autoencoders for collaborative filtering. In Proceedings of the twelfth ACM international conference on web search and data mining, pp. 600 608, 2019. Published in Transactions on Machine Learning Research (07/2023) Noveen Sachdeva, Yi Su, and Thorsten Joachims. Off-policy bandits with deficient support. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD 20, pp. 965 975, New York, NY, USA, 2020. Association for Computing Machinery. doi: 10.1145/3394486.3403139. Noveen Sachdeva, Mehak Preet Dhaliwal, Carole-Jean Wu, and Julian Mc Auley. Infinite recommendation networks: A data-centric approach. In Advances in Neural Information Processing Systems, 2022a. Noveen Sachdeva, Ziran Wang, Kyungtae Han, Rohit Gupta, and Julian Mc Auley. Gapformer: Fast autoregressive transformers meet rnns for personalized adaptive cruise control. In 2022 IEEE 25th International Conference on Intelligent Transportation Systems (ITSC), pp. 2528 2535, 2022b. doi: 10.1109/ITSC55140.2022.9922275. Noveen Sachdeva, Carole-Jean Wu, and Julian Mc Auley. On sampling collaborative filtering datasets. In Proceedings of the Fifteenth ACM International Conference on Web Search and Data Mining, WSDM 22, 2022c. Mattia Sangermano, Antonio Carta, Andrea Cossu, and Davide Bacciu. Sample condensation in online continual learning. In 2022 International Joint Conference on Neural Networks (IJCNN), pp. 01 08. IEEE, 2022. Teven Le Scao, Angela Fan, Christopher Akiki, Ellie Pavlick, Suzana Ilić, Daniel Hesslow, Roman Castagné, Alexandra Sasha Luccioni, François Yvon, Matthias Gallé, et al. Bloom: A 176b-parameter open-access multilingual language model. ar Xiv preprint ar Xiv:2211.05100, 2022. Robin Tibor Schirrmeister, Rosanne Liu, Sara Hooker, and Tonio Ball. When less is more: Simplifying inputs aids neural network understanding. ar Xiv preprint ar Xiv:2201.05610, 2022. Tobias Schnabel, Adith Swaminathan, Ashudeep Singh, Navin Chandak, and Thorsten Joachims. Recommendations as treatments: Debiasing learning and evaluation. In Proceedings of The 33rd International Conference on Machine Learning, volume 48 of Proceedings of Machine Learning Research, pp. 1670 1679. PMLR, 2016. Rui Song, Dai Liu, Dave Zhenyu Chen, Andreas Festag, Carsten Trinitis, Martin Schulz, and Alois Knoll. Federated learning via decentralized dataset distillation in resource-constrained edge environments. ar Xiv preprint ar Xiv:2208.11311, 2022. Ben Sorscher, Robert Geirhos, Shashank Shekhar, Surya Ganguli, and Ari S. Morcos. Beyond neural scaling laws: beating power law scaling via data pruning. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), Advances in Neural Information Processing Systems, 2022. Harald Steck. Embarrassingly shallow autoencoders for sparse data. In The World Wide Web Conference, 2019. Ilia Sucholutsky and Matthias Schonlau. Soft-label dataset distillation and text dataset distillation. In 2021 International Joint Conference on Neural Networks (IJCNN), pp. 1 8. IEEE, 2021. Pei Sun, Henrik Kretzschmar, Xerxes Dotiwalla, Aurelien Chouard, Vijaysai Patnaik, Paul Tsui, James Guo, Yin Zhou, Yuning Chai, Benjamin Caine, et al. Scalability in perception for autonomous driving: Waymo open dataset. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 2446 2454, 2020. Romal Thoppilan, Daniel De Freitas, Jamie Hall, Noam Shazeer, Apoorv Kulshreshtha, Heng-Tze Cheng, Alicia Jin, Taylor Bos, Leslie Baker, Yu Du, et al. Lamda: Language models for dialog applications. ar Xiv preprint ar Xiv:2201.08239, 2022. Mariya Toneva, Alessandro Sordoni, Remi Tachet des Combes, Adam Trischler, Yoshua Bengio, and Geoffrey J. Gordon. An empirical study of example forgetting during deep neural network learning. In International Conference on Learning Representations, 2019. Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. ar Xiv preprint ar Xiv:2302.13971, 2023. Luis Vicente, Gilles Savard, and Joaquim Júdice. Descent approaches for quadratic bilevel programming. Journal of Optimization theory and applications, 81(2):379 399, 1994. Paul Vicol, Luke Metz, and Jascha Sohl-Dickstein. Unbiased gradient estimation in unrolled computation graphs with persistent evolution strategies. In International Conference on Machine Learning, pp. 10553 10563. PMLR, 2021. Paul Vicol, Jonathan P Lorraine, Fabian Pedregosa, David Duvenaud, and Roger B Grosse. On implicit bias in overparameterized bilevel optimization. In International Conference on Machine Learning. PMLR, 2022. Published in Transactions on Machine Learning Research (07/2023) Kai Wang, Bo Zhao, Xiangyu Peng, Zheng Zhu, Shuo Yang, Shuo Wang, Guan Huang, Hakan Bilen, Xinchao Wang, and Yang You. Cafe: Learning to condense dataset by aligning features. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12196 12205, 2022. Ruoxi Wang, Rakesh Shivanna, Derek Cheng, Sagar Jain, Dong Lin, Lichan Hong, and Ed Chi. Dcn v2: Improved deep & cross network and practical lessons for web-scale learning to rank systems. In Proceedings of the web conference 2021, pp. 1785 1797, 2021. Tongzhou Wang, Jun-Yan Zhu, Antonio Torralba, and Alexei A Efros. Dataset distillation. ar Xiv preprint ar Xiv:1811.10959, 2018. Max Welling. Herding dynamical weights to learn. In Proceedings of the 26th Annual International Conference on Machine Learning, ICML 09, 2009. Felix Wiewel and Bin Yang. Condensed composite memory continual learning. In 2021 International Joint Conference on Neural Networks (IJCNN), pp. 1 8. IEEE, 2021. T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, J. Davison, S. Shleifer, P. von Platen, C. Ma, Y. Jernite, J. Plu, C. Xu, T. Le Scao, S. Gugger, M. Drame, Q. Lhoest, and A. Rush. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp. 38 45, Online, October 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-demos.6. D.H. Wolpert and W.G. Macready. No free lunch theorems for optimization. IEEE Transactions on Evolutionary Computation, 1(1):67 82, 1997. doi: 10.1109/4235.585893. Carole-Jean Wu, Ramya Raghavendra, Udit Gupta, Bilge Acun, Newsha Ardalani, Kiwan Maeng, Gloria Chang, Fiona Aga, Jinshi Huang, Charles Bai, et al. Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems, 4:795 813, 2022. Shiwen Wu, Fei Sun, Wentao Zhang, Xu Xie, and Bin Cui. Graph neural networks in recommender systems: a survey. ACM Computing Surveys (CSUR), 2020. Shu Wu, Yuyuan Tang, Yanqiao Zhu, Liang Wang, Xing Xie, and Tieniu Tan. Session-based recommendation with graph neural networks. In Proceedings of the AAAI conference on artificial intelligence, 2019. Yuhuai Wu, Mengye Ren, Renjie Liao, and Roger Grosse. Understanding short-horizon bias in stochastic metaoptimization. In International Conference on Learning Representations, 2018. Yuanhao Xiong, Ruochen Wang, Minhao Cheng, Felix Yu, and Cho-Jui Hsieh. Feddm: Iterative distribution matching for communication-efficient federated learning. ar Xiv preprint ar Xiv:2207.09653, 2022. Yumo Xu and Shay B Cohen. Stock movement prediction from tweets and historical prices. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1970 1979, 2018. Bo Zhao and Hakan Bilen. Dataset condensation with differentiable siamese augmentation. In International Conference on Machine Learning, pp. 12674 12685. PMLR, 2021. Bo Zhao and Hakan Bilen. Synthesizing informative training samples with gan. ar Xiv preprint ar Xiv:2204.07513, 2022. Bo Zhao and Hakan Bilen. Dataset condensation with distribution matching. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2023. Bo Zhao, Konda Reddy Mopuri, and Hakan Bilen. Dataset condensation with gradient matching. In International Conference on Learning Representations, 2021. Xiao Zhou, Renjie Pi, Weizhong Zhang, Yong Lin, Zonghao Chen, and Tong Zhang. Probabilistic bilevel coreset selection. In International Conference on Machine Learning, pp. 27287 27302. PMLR, 2022a. Yanlin Zhou, George Pu, Xiyao Ma, Xiaolin Li, and Dapeng Wu. Distilled one-shot federated learning. ar Xiv preprint ar Xiv:2009.07999, 2020. Yongchao Zhou, Ehsan Nezhadarya, and Jimmy Ba. Dataset distillation using neural feature regression. In Advances in Neural Information Processing Systems, 2022b. Published in Transactions on Machine Learning Research (07/2023) Dataset related D {(xi X, yi Y)}|D| i=1 The target dataset to be distilled X Data domain Y Predictand domain C Set of unique classes in Y Dc {(xi, yi) | yi = c}|D| i=1 Portion of D with class c X [xi]|D| i=1 Matrix of all features in D Y [yi]|D| i=1 Matrix of all predictands in D n Size of data summary Dsyn {( xi, yi)}n i=1 Data summary Dc syn {( xi, yi) | yi = c}n i=1 Portion of Dsyn with class c Xsyn [ xi]n i=1 Matrix of all features in Dsyn Ysyn [ yi]n i=1 Matrix of all predictands in Dsyn Learning related Φθ : X 7 Y Learning algorithm parameterized by θ l : Y Y 7 R Twice-differentiable cost function LD(θ) E(x,y) D[l(Φθ(x), y)] Expected loss of Φ on D LDsyn(θ) E(x,y) Dsyn[l(Φθ(x), y)] Expected loss of Φ on Dsyn dim(A) Size of basis of A |A| Number of elements in A sup Supremum arg min θ f(θ) Optimum value of θ which minimizes f(θ) E x [f(x)] P x p(x) f(x) Expected value of f(x) when domain of x is discrete