# hybrid_datafree_knowledge_distillation__556ec829.pdf Hybrid Data-Free Knowledge Distillation Jialiang Tang1,2,3, Shuo Chen4*, Chen Gong5* 1School of Computer Science and Engineering, Nanjing University of Science and Technology, China 2Key Laboratory of Intelligent Perception and Systems for High-Dimensional Information of Ministry of Education, China 3Jiangsu Key Laboratory of Image and Video Understanding for Social Security, China 4Center for Advanced Intelligence Project, RIKEN, Japan 5Department of Automation, Institute of Image Processing and Pattern Recognition, Shanghai Jiao Tong University, China tangjialiang@njust.edu.cn, shuo.chen.ya@riken.jp, chen.gong@sjtu.edu.cn Data-free knowledge distillation aims to learn a compact student network from a pre-trained large teacher network without using the original training data of the teacher network. Existing collection-based and generation-based methods train student networks by collecting massive real examples and generating synthetic examples, respectively. However, they inevitably become weak in practical scenarios due to the difficulties in gathering or emulating sufficient real-world data. To solve this problem, we propose a novel method called Hybrid Data-Free Distillation (Hi DFD), which leverages only a small amount of collected data as well as generates sufficient examples for training student networks. Our Hi DFD comprises two primary modules, i.e., the teacher-guided generation and student distillation. The teacher-guided generation module guides a Generative Adversarial Network (GAN) by the teacher network to produce high-quality synthetic examples from very few real-world collected examples. Specifically, we design a feature integration mechanism to prevent the GAN from overfitting and facilitate the reliable representation learning from the teacher network. Meanwhile, we drive a category frequency smoothing technique via the teacher network to balance the generative training of each category. In the student distillation module, we explore a data inflation strategy to properly utilize a blend of real and synthetic data to train the student network via a classifier-sharing-based feature alignment technique. Intensive experiments across multiple benchmarks demonstrate that our Hi DFD can achieve state-of-the-art performance using 120 times less collected data than existing methods. Code https://github.com/tangjialiang97/Hi DFD Introduction The success of Deep Neural Networks (DNNs) (He et al. 2016; Hao et al. 2024) is usually accompanied by significant computational and storage demands, which hinders their deployment on practical resource-limited devices. Knowledge Distillation (KD) (Hinton, Vinyals, and Dean 2015; Miles and Mikolajczyk 2024) has served as an effective compression technology that transfers knowledge from a complex pre-trained teacher network to improve the performance of a *Corresponding authors: Chen Gong, Shuo Chen. Copyright 2025, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. lightweight student network. However, in practice, the training data of the teacher network is usually inaccessible due to privacy concerns and only the pre-trained teacher network itself can be used to learn the student network. This is because users may prefer sharing a pre-trained black box DNN rather than disclosing their sensitive data. In such cases, vanilla KD methods can hardly train a reliable student network owing to the absence of original training data. To address this issue, various Data-Free Knowledge Distillation (DFKD) approaches (Binici et al. 2022; Chen et al. 2019, 2021b; Tang et al. 2023) have been developed to enable training the student network without using any original data. Among existing DFKD methods, collection-based approaches (Chen et al. 2021b; Tang et al. 2023) can achieve satisfactory performance by amassing numerous real examples to train the student network. However, it is still difficult for the collection-based methods to train a reliable student network in practical tasks, e.g., medical image classification because gathering sufficient training examples can be challenging. On the other hand, generation-based methods (Yin et al. 2020; Chen et al. 2019) leverage the teacher network to guide a generative model (Creswell et al. 2018) in producing fake examples, thereby successfully training the student network without reliance on real examples. Nevertheless, the synthesized examples may exhibit low quality in the absence of real data supervision, leading to suboptimal student performance, especially for many challenging recognition tasks on Image Net (Deng et al. 2009). The inherent constraints of both collection-based and generation-based DFKD methods prompt an essential question: Can we train an effective generative model only using a small number of collected examples and then learn reliable student networks with the hybrid data comprising both collected and synthetic examples? To answer the above question under the practical data-free distillation scenario, we need a generative model that not only possesses powerful generative capabilities but also has the ability to acquire valuable knowledge from the teacher network. Recent studies (Cui et al. 2023; Rangwani, Mopuri, and Babu 2021) suggest that the Generative Adversarial Network (GAN) (Mirza and Osindero 2014) can easily learn from pre-trained models and then generate high-quality synthetic examples, so we employ this great approach as our generative module. The standard GAN consists of a generator and a discriminator trained in an adversarial manner, The Thirty-Ninth AAAI Conference on Artificial Intelligence (AAAI-25) where the generator attempts to produce fake examples to deceive the discriminator while the discriminator strives to distinguish between real and fake examples. However, the collected data in practice tasks like medical image classification has two inherent characteristics that may impede the training of the GAN, namely: 1) Limited data quantity, as capturing medical images requires expensive and complex equipment; and 2) Imbalanced class distribution, where certain diseases (e.g., vascular lesions ) are more rarely than others (e.g., nevus ). When training on the collected data with limited examples and imbalanced class distribution, the discriminator is susceptible to overfitting (Huang et al. 2022; Jiang et al. 2021). It implies that the discriminator tends to memorize all real examples and almost perfectly distinguish them from fake examples, resulting in the disappearance of the gradient for the generator. Moreover, the generator training is dominated by a few classes occupying the majority of examples, which prevents it from generating diverse examples. Therefore, it is critical to overcome the overfitting issue of discriminator and data imbalance issue of generator when training with scarce collected examples. In this paper, we propose a novel approach called Hybrid Data-Free Distillation (Hi DFD), which learns reliable student networks on the hybrid data comprising synthetic examples and very few real collected examples. Our Hi DFD is composed of two pivotal modules of teacher-guided generation and student distillation. In the teacher-guided generation module, we aim to solve the critical issues in the GAN mentioned above, and thus generating high-quality synthetic examples. Specifically, we propose a feature integration mechanism to aggregate the features of both the collected and synthetic examples between the teacher network and GAN. Such an integration mechanism not only mitigates the overfitting of the discriminator, which forcibly distinguishes those closely resembling examples, but also transfers valuable representations to guide the discriminator to capture category dependencies. Meanwhile, we also develop a new technique called category frequency smoothing to alleviate the imbalanced training of the generator. In the student distillation module, we develop a data inflation operation to adjust the contribution of collected examples among the hybrid data when training the student network. Finally, we design a classifier-sharing-based strategy to closely align the features of student network with those of teacher network to enhance student performance. Thanks to effectively transferring knowledge from the teacher network to both the GAN and student network, our Hi DFD can successfully train reliable student networks using very few collected realworld examples. The contributions of our Hi DFD are summarized as follows: By considering the difficulties in gathering or emulating real-world data, we propose a novel data-free distillation method called Hi DFD, which only requires a small number of collected data to generate high-quality synthetic examples for training the student network. We design a teacher-guided generation module to effectively tackle the critical issues of discriminator overfitting and imbalanced learning in generating synthetic ex- amples, which empowers the distillation module to learn reliable student networks from the teacher network. Our Hi DFD can achieve State-Of-The-Art (SOTA) performance using only 1/120 (5,000/600,000) of examples required by existing collection-based DFKD methods. Related Works In this section, we review the relevant works, including knowledge distillation and generative models. Knowledge Distillation Traditional KD methods (Chen et al. 2021a; Li et al. 2023) learn a compact and reliable student network by encouraging it to mimic a variety of knowledge, i.e., softened logits (Zhao et al. 2022a), intermediate features (Chen et al. 2022), and representation relationships (Peng et al. 2019), from a large teacher network using ample original training data. However, in practical applications, these approaches might be ineffective because the original data is usually unavailable due to privacy concerns. To address the above issue, generation-based (Tran et al. 2024; Wang et al. 2024a,b) and collection-based (Chen et al. 2021b; Tang et al. 2023) DFKD methods have been proposed to train student networks using synthetic and collected data, respectively. The generation-based methods utilize the teacher network to guide a generator in producing examples from statistics in the teacher network or random noise. However, the resulting student network still achieves suboptimal performance due to the flawed synthetic examples. Conversely, collection-based methods assume that there are numerous easily accessible examples in the real-world, and they acquire an oversized collected data (e.g., 600,000 examples on CIFAR10) to train the student network. In practical tasks, it is hard to gather so many examples, and thus they still fail to train reliable student networks. In this paper, our Hi DFD only utilizes a small collected data that contains fewer examples than the original data, which initially guides the GAN in training on such collected data by the teacher and then trains the student on adequate data composed of the synthetic and collected examples. Generative Models Recent advances in generative models, including Variational Auto Encoders (Kingma and Welling 2013; Zhao, Song, and Ermon 2019), diffusion models (Ho, Jain, and Abbeel 2020; Mei and Patel 2023), and GAN (Hou et al. 2021; Mirza and Osindero 2014) have significantly propelled the data generation. This paper focuses on the powerful GAN due to its ability to learn from pre-trained models (Cui et al. 2023; Rangwani, Mopuri, and Babu 2021). The traditional GAN (Goodfellow et al. 2020) consists of a generator and a discriminator, where the generator produces fake examples to deceive the discriminator, and the discriminator tries to accurately distinguish between real and fake examples. Recently, Auxiliary Discriminative Classifier GAN (ADCGAN) (Hou et al. 2022) captures dependencies between generated examples and class labels by encouraging the discriminator to classify synthetic examples into specific categories, which effectively improves the quality of synthetic data. In our method, we hope the GAN can produce highquality synthetic examples that are easily classifiable, and thus training a precise student network. Therefore, we adopt ADCGAN as the foundational generative model. The ADCGAN composed of a generator NG : Z Y X maps a noise-label pair (z, y) to a fake example NG(z, y) X that can be precisely predicted as y Y; and a discriminator ND : X {0, 1} determines whether the input example is real (i.e., 1) or fake (i.e., 0), which also has a classifier ΨD : X Y+ Y (y+ Y+ and y Y denote the labels for real and fake examples, respectively). Mathematically, the objective functions for the discriminator and generator in the ADCGAN are defined as Ladc d and Ladc g, respectively, as follows: Ladc d = Ld + Ex,y PX,Y [log ΨD(y+ | x)] + Ex,y QX,Y [log ΨD(y | x)], Ladc g = Lg Ex,y QX,Y [log ΨD(y+ |x)] + Ex,y QX,Y [log ΨD(y | x)], where Ld = Ex PX[log ND(x)] + Ex QX[log(1 ND(x))] and Lg = Ex QX[log(1 ND(x))] are the loss functions for the standard GAN, P and Q denote the distribution of real collected data and fake synthetic data, respectively. ΨD(y+ | ) (resp. ΨD(y | )) denotes the probability that the input example is classified as the label y and real (resp. fake) simultaneously by the following classifier of the discriminator. Formally, ΨD (y+ | x) = exp(Ψ+ D(y) ΦD(x)) P y Y+ exp(Ψ+ D( y) ΦD(x))+P y Y exp(Ψ D( y) ΦD(x)), where ΦD represents the shared feature extractor between the original discriminator ND and the classifier ΨD. Ψ+ D (resp. Ψ D) captures the dependencies between the category labels and real (resp. fake) data. Notably, De GAN (Addepalli et al. 2020) also trains a GAN using collected data, but it still requires a large number of collected examples and can only utilize synthetic examples to train the target model. Data-free distillation aims to train a compact student network NS using a pre-trained teacher network NT without accessing the teacher s original training data Do. Both NT and NS consist of a feature extractor Φ and classifier Ψ, where the subscripts T and S indicate teacher and student , respectively. Existing collection-based DFKD methods (Tang et al. 2023) usually rely on the collected data Dc with overwhelming examples searched based on the categories of the original data. Here, the data amount |Dc| |Do|, which is hard to satisfy in practice tasks. To overcome this limitation, we propose a more practical method that only requires a small number of collected examples for DFKD, i.e., the data amount |Dc| |Do|. To this end, we develop a hybrid framework to generate abundant synthetic examples from very few collected examples, and then we integrate them as the hybrid data for training the reliable student network. Motivation of the Hybrid Learning Formally, we denote the distribution of the collected data Dc and synthetic data Ds as P and Q, respectively, while the distribution of the hybrid data D = Dc Ds is represented as U = αP + (1 α)Q. Here α = |Dc| / (|Dc| + |Ds|) represents the proportion of collected examples in the hybrid data. In general, the synthetic and collected examples usually exhibit a significant distribution gap. This can cause substantial fluctuations during the training of the student network on hybrid data, ultimately leading to poor performance (Wang, Zhang, and Wang 2024). Therefore, it is essential to align the distribution of synthetic data with that of collected data, thereby forming reliable hybrid data. Here, the synthetic data is generated under the supervision of the collected data, so we assume that synthetic data, collected data, and hybrid data have the same support set X. Then, the distribution gap between the reliable hybrid data and synthetic data can be characterized by the Total Variation Distance (TVD), which is defined as TVD(U, Q) = 1 x X |U(x) Q(x)|, (2) where U(x) (0, 1) and Q(x) (0, 1) measure the distribution probability of x in the hybrid and synthetic data, respectively. Here TVD( , ) 0, and TVD(U, Q) = 1 2 P x X |U(x) Q(x)| 1 2 P x X (|U(x)|+|Q(x)|) = 1. Based on the triangle inequality (Steerneman 1983) of TVD, we easily have that TVD(U, Q) TVD(U, P) + TVD(P, Q). (3) Then, given that U = αP + (1 α)Q with parameter α controlling the weight of collected data, we can compute TVD(U, P) as TVD(U, P) = 1 x X |U(x) P(x)| x X |αP(x) + (1 α)Q(x) P(x)| x X |Q(x) P(x)| = (1 α)TVD (Q, P) . (4) By invoking the symmetry of TVD and Eq. (3), we obtain TVD(U, Q) (2 α)TVD(P, Q). (5) Here Eq. (5) reveals that the high-quality synthetic data Ds and the mix proportion α are two critical factors influencing the distribution gap TVD(U, P). The above observation inspires us to employ two modules to align the distribution of synthetic data with that of collected data, as shown in Fig. 1(c). In the teacher-guided generation module, we employ the teacher network to guide the GAN to enhance the quality of synthetic data, which solves its intrinsic issues when trained on the small and imbalanced collected data, including the overfitting of discriminator and imbalanced learning of generator: Figure 1: The diagram of (a) generation-based methods (Fang et al. 2021; Yin et al. 2020; Chen et al. 2019; Micaelli and Storkey 2019), (b) collection-based methods (Chen et al. 2021b; Tang et al. 2023), and (c) our Hi DFD. In Hi DFD, the teacher-guided generation module employs the teacher network to guide the training of the GAN on limited collected data. Subsequently, the student distillation module closely aligns the features of the student network with those of the teacher network on the hybrid data comprising high-quality synthetic examples and properly inflated collected examples. Discriminator Overfitting. When trained with very few collected data, the discriminator is prone to be overconfident in determining fake examples, i.e., Ex QX[ND(x)] tends to be 0. As a result, the gradient of Lg in Eq. (1), which specialized in promoting generator to produce high-quality examples, may become ineffective, namely NGEx QX[log(1 ND(x))]=Ex QX (6) as the parameters of ND and NG are independent of each other, and Eq. (6) is proved by (Arjovsky and Bottou 2022). Meanwhile, the discriminator also has a classifier that provides valuable category dependencies for the generator by precisely predicting input examples, and thus promoting the generator to generate classifiable examples. However, multiclass classification is more challenging than binary determination of true and fake. Given very few collected examples, the discriminator is difficult to learn powerful representations for its classifier to achieve precise classification. Imbalanced Generator Learning. Given the optimal classifier Ψ D 1 of the discriminator, optimizing the generator to produce the classifiable examples2 is equivalent to max NG [Ex,y QX,Ylog(p(x, y) q(x, y))] min NG KL(QX,Y PX,Y) , 1Ψ D y+| x = p(x,y) p(x)+q(x), Ψ D y | x = q(x,y) p(x)+q(x) (see Appendix). 2max NG[Ex,y QX,Y [log Ψ D(y+|x)] Ex,y QX,Y [log Ψ D(y |x)]]. where KL represents the Kullback-Leibler divergence, and the proof of Eq. (7) is provided in Appendix. The above Eq. (7) indicates that optimizing the generator will force the joint distribution QX,Y of synthetic data toward the PX,Y of the imbalanced collected data, inevitably resulting in synthetic examples with poor diversity. In student distillation, we properly inflate the collected examples to construct the hybrid data with a moderate mix proportion α for effectively training the student network. Teacher-Guided Generation In this section, we promote GAN to generate high-quality examples by solving its critical issues guided by the teacher network. To mitigate the discriminator overfitting, we design a feature integration mechanism to force the aggregation between the features of both real collected examples and fake synthetic examples. Specifically, we blend the boundaries between real and fake examples to increase the difficulty for the discriminator to accurately discriminate them, and thus preventing the discriminator from overconfidence, i.e., Lblend=Ex,y PX,Y,ˆx,y QX,Y[I(p>q)( ΦT(x) ΦD(ˆx) 2 + ΦT(ˆx) ΦD(x) 2)], (8) where I(p > q) is an indicator function to control Lblend be applied with a probability of q and its value is 1 if p > q and 0 otherwise (p is sampled from [0, 1], q=0.7 and it is analyzed in Appendix). Meanwhile, we transfer the expressive features of the teacher network to enhance the representation ability of the discriminator, i.e., Ltrans =Ex,y PX,Y ,ˆx,y QX,Y [( ΦT(x) ΦD(x) 2 + ΦT(ˆx) ΦD(ˆx) 2)]. (9) To alleviate the imbalanced learning of the generator, we regularize the GAN training across all categories. During generator training, we dynamically update the class frequencies {nt c}C c=1 (C represents the number of categories) at the beginning of iteration t via the following exponential moving average function with a weight γ [0, 1], namely nt c = (1 γ)nt 1 c + γ nt 1 c , (10) where nt 1 c is the number of synthetic examples belonging to class c in iteration t 1, nt c is initially set as a constant, and γ=0.5 (analyzed in Appendix). Then, each class frequency nt c {nt c}C c=1 is normalized as ˆnt c = nt c PC j=1 nt j . (11) Thereafter, the generator is regulated to produce balanced examples by minimizing the loss function: pc T log (pc T) ˆntc , (12) where p T =Ex,y QX,Y [Soft Max(NT(x))] is the average softmax vector output by the teacher network. The teacher network is well-trained on the original data, so it can precisely predict synthetic examples. In such a case, pc T can be regarded as the proportion of examples in category c within the synthetic data. In Eq. (12), the generation of examples in a category c with the lower (or higher) pc T is adjusted by the larger (or smaller) 1/ˆnt c. The loss functions of discriminator and generator in our teacher-guided GAN are summarized as LD = Ladc d + λd(Lblend + Ltrans), LG = Ladc g + λg Lreg, (13) where Ladc d and Ladc g are defined in Eq. (1), and the tradeoff parameters λd > 0 and λg > 0. Student Distillation In the teacher-guided generation module, we successfully trained an effective GAN for generating high-quality synthetic examples, which are then combined with collected examples to construct the hybrid data D for training the student network. However, directly composing the limited collected examples with numerous synthetic examples will result in a small mix ratio α (i.e., a large distribution gap TVD(U, Q)) to disturb the training of the student network. Therefore, we inflate the collected data via example repeating to enlarge the α from |Dc| / (|Dc| + |Ds|) to N |Dc| / (N |Dc| + |Ds|), where N is the inflation factor. We adopt a moderate inflation factor of N = |Ds|/|Dc| and further details are available in Extended Experiments. Recent works (Chen et al. 2021b; Tang et al. 2023) indicate that the collected data usually contains many noisy examples, which may mislead the GAN to produce undesired synthetic examples with wrong labels. As a result, these potentially noisy examples will harm the training of the student network, particularly affecting its classifier. In DFKD, the teacher network is well-trained on the original data and possesses an accurate classifier. Recent studies (Tang et al. 2023; Chen et al. 2022) show that the teacher s classifier contains useful category information regarding the original data. Therefore, we share the classifier of the teacher network with the student network. Then, we closely align the feature of the student network with that of the teacher network as follows: Lalign = Ex D [ ΦS(x) ΦT(x) 2] . (14) By minimizing the Lalign, the feature of the student network is closely aligned with that of the teacher network, and the aligned feature is inputted into the shared classifier can produce predictions as accurately as the teacher network. The student network did not use any example labels during the training process, thereby avoiding the negative impact of potentially noisy labels. The whole algorithm of our proposed Hi DFD is given in Appendix. Experiments Datasets and Implementation Details Original Datasets. We evaluate the effectiveness of our Hi DFD on popular datasets, including CIFAR (Krizhevsky 2009), CINIC (Darlow et al. 2018), and Tiny Image Net (Le and Yang 2015), which are widely used by existing DFKD methods (Chen et al. 2019, 2021b). Additionally, we also conduct experiments on the large-scale Image Net (Deng et al. 2009) and the practical medical image dataset HAM (Tschandl, Rosendahl, and Kittler 2018), which are challenging for existing DFKD methods. Collected Datasets. When using CIFAR and CINIC as the original datasets, we search for examples from Image Net. With Tiny Image Net and Image Net as the original datasets, we utilize Web Vision (Li et al. 2017) as our source of collected data. Moreover, we collect examples from ISIC (Codella et al. 2018) when using HAM as the original dataset. We follow (Chen et al. 2021b) and sample a part of examples from the corresponding dataset as collected data Dc. Here, we define the ratio between the collected data Dc and original data Do as ρ = |Dc|/|Do|. We construct small (ρ=0.1) and moderate (ρ=1.0) collected data for the experiments of collection-based DFKD methods. Notably, the original dataset is solely required for the pre-training of the teacher network. Detailed information regarding these datasets and the corresponding synthesized examples are provided in Appendix. Implementation Details. All student networks in our Hi DFD employ SGD with weight decay as 5 10 4 and momentum as 0.9 as the optimizer. The student networks are trained over 240 epochs with a learning rate of 0.05, which is sequentially divided by 10 at the 150th, 180th, and 210th epochs. Meanwhile, the generator and discriminator in GAN utilize Adam for optimization with learning rates 1 10 4 and 4 10 4, respectively, and both of them are trained over 500 epochs. Additionally, the hyper-parameters in Eq. (13) are configured as λd = 0.1 and λg = 0.1. Dataset Arch ACCT ACCS Generation-Based Collection-Based DAFL DDAD DI CMI SSNet De GAN DFND KD3 Hi DFD (ours) ρ=0.1 ρ=1.0 ρ=0.1 ρ=1.0 ρ=0.1 ρ=1.0 ρ=0.1 ρ=1.0 CIFAR10 95.70 95.20 92.22 93.08 93.26 94.84 95.39 90.39 91.95 48.82 85.82 65.70 93.37 94.74 95.11 94.07 92.69 86.92 90.85 85.27 88.49 92.00 87.52 90.37 48.65 89.22 48.93 91.49 92.28 93.14 95.70 92.69 83.36 89.76 90.24 86.63 92.03 86.40 89.69 49.48 90.60 65.10 93.05 92.90 93.76 CIFAR100 78.05 77.10 74.47 73.64 61.32 77.04 77.41 53.20 62.94 21.45 64.73 26.96 72.90 76.93 78.35 74.53 72.28 65.36 68.33 60.00 59.70 71.16 53.97 61.80 23.48 63.90 21.27 71.44 71.26 74.18 78.05 72.28 45.28 68.59 61.07 61.80 72.38 46.82 56.44 23.86 64.54 25.25 72.46 73.44 75.65 CINIC 86.62 85.09 60.54 80.10 78.57 78.47 83.47 57.59 76.78 24.53 80.94 39.35 82.68 85.62 86.68 84.22 83.28 59.08 77.90 68.90 74.99 79.63 54.36 76.11 29.53 77.41 29.88 78.18 81.92 82.27 86.62 83.28 44.62 77.63 59.52 75.46 80.30 54.43 74.40 33.40 79.33 71.57 80.28 81.90 82.88 Tiny Image Net 66.44 64.87 52.20 59.84 6.98 64.01 64.04 25.74 49.11 26.36 60.09 20.26 63.63 65.96 66.61 62.34 61.55 53.89 42.25 1.22 17.73 57.82 23.13 44.65 25.39 58.47 24.26 61.06 60.46 62.69 66.44 61.55 52.46 44.20 2.27 20.57 59.16 21.09 48.12 25.53 58.18 27.37 61.98 61.67 65.27 HAM 81.18 79.64 32.05 44.68 62.79 67.34 74.52 34.75 64.43 27.55 62.59 64.10 68.44 77.08 81.52 Image Net 73.27 67.00 1.92 1.46 1.14 1.84 5.74 22.28 43.96 28.99 45.66 35.02 55.05 65.36 66.89 Table 1: Accuracies (in %) of student networks trained by various methods on six image classification datasets. The columns ACCT and ACCS report the accuracies yielded by the teacher network and student network trained on the full original data, respectively. The best and the second-best results are highlighted in bold and underlined, respectively. The notations , , and represent the teacher-student pairs Res Net34-Res Net18, Res Net34-VGG13, and VGG16-VGG13, respectively. Experiments on Benchmark Datasets In this section, we conduct comprehensive experiments on various benchmark datasets to evaluate the performance of our proposed Hi DFD against SOTA generation-based (Chen et al. 2019; Zhao et al. 2022b; Yin et al. 2020; Binici et al. 2022; Fang et al. 2021; Yu et al. 2023) and collectionbased (Addepalli et al. 2020; Chen et al. 2021b; Tang et al. 2023) DFKD methods. These methods are reproduced by using their official source codes. Tab. 1 reports the results of the compared methods and our proposed Hi DFD. Firstly, our proposed Hi DFD using only a small quantity of collected examples (ρ=0.1) achieves comparable performance with those trained on the full original data. Secondly, when trained on the modestly sized collected data (ρ=1.0), our proposed Hi DFD significantly outperforms compared methods on most tasks, especially on the challenging HAM and Image Net. Thirdly, those generationbased methods, which utilize generative models to produce training examples without the supervision of real examples, tend to perform unsatisfactorily due to the deficiencies in their synthetic examples. These results demonstrate that our proposed Hi DFD can train robust student networks by effectively generating training examples from limited real-world examples and properly utilizing all realistic examples. Ablation Studies & Parametric Sensitivities In this section, we evaluate the effectiveness of our method with a small collected data (ρ=0.1), where CIFAR and Image Net serve as the original and collected datasets, respectively. Moreover, Res Net34 and Res Net18 are used as the teacher network and student network, respectively. Ablation Studies. We evaluate three key operations (Lblend, Ltrans, and Lreg) in teacher-guided generation and the classifier-sharing-based strategy in the student distillation. The experimental results are reported in Tab. 2, and the contributions of these components are analyzed as follows: 1) Teacher-Guided Generation. The feature blending Lblend in Eq. (8) and feature transferring Ltrans in Eq. (9) Type Algorithm CIFAR10 CIFAR100 Teacher Guided Generation w/o Lblend 92.87 ( 1.87) 74.18 ( 2.75) w/o Ltrans 91.86 ( 2.88) 74.40 ( 2.53) w/o Lreg 92.76 ( 1.98) 73.95 ( 2.98) w/o Lblend, Ltrans 91.10 ( 3.64) 71.02 ( 5.91) w/o Lblend, Lreg 90.77 ( 3.97) 71.83 ( 5.10) w/o Ltrans, Lreg 91.42 ( 3.32) 72.10 ( 4.83) w/o Lblend,trans,reg 89.55 ( 5.19) 70.25 ( 6.68) Student Distillation OFAKD 92.88 ( 1.86) 70.86 ( 6.07) VKD 92.69 ( 2.05) 66.96 ( 9.97) Semc KD 93.49 ( 1.25) 70.93 ( 6.00) CC 92.63 ( 2.11) 69.52 ( 7.41) DKD 92.95 ( 1.79) 68.25 ( 8.68) RKD 92.40 ( 2.34) 70.53 ( 6.40) CATKD 92.49 ( 2.25) 68.69 ( 8.24) NKD 93.26 ( 1.48) 65.31 ( 11.62) Hi DFD (ours) 94.74 76.93 Table 2: Accuracies (in %) of ablation studies. for preventing the overfitting of discriminator and enhancing its representation ability. Meanwhile, the generator regulation Lreg in Eq. (12) is also essential for maintaining the balanced training of the generator. Therefore, the omission of any components among them leads to a noticeable reduction in the performance of the student network. Particularly, training the student network only on synthetic examples without any guidance from the teacher network results in the poorest performance (as shown in the term w/o Lblend, Ltrans, Lreg ). These results indicate the importance of these operations for robust GAN training with limited collected examples, thereby generating high-quality examples for training reliable student networks. 2) Student Distillation. We examine the impact of replacing the classifier-sharing-based feature alignment with traditional KD methods (Hinton, Vinyals, and Dean 2015; Chen et al. 2021a). Both the student networks are trained on the hybrid data composed of collected and synthetic examples. We can find that the student networks trained by these methods generally achieve suboptimal performance due to their inability to effectively handle the potentially noisy examples (a) Analysis of (a) Analysis of (b) Analysis of Accuracy (%) Accuracy (%) Accuracy (%) Accuracy (%) (c) Accuracy vs inflation factor (d) Accuracy vs ρ (d) Accuracy vs ρ Inflation factor N Figure 2: Parametric sensitivities of (a) λd and (b) λg in Eq. (13). Accuracies (in %) of the student networks trained with collected data with (c) varying inflation factors and (d) various quantities. among the hybrid data. These results highlight the suitability of our training strategy for reliable student networks in the data-free distillation scenarios. Parametric Sensitivity. There are two tuning parameters in our Hi DFD, including λd and λg in Eq. (13). To analyze the sensitivities, we individually vary each parameter while keeping the others constant during training. The accuracies of the corresponding student networks are shown in Fig. 2(a) and Fig. 2(b). Despite the large fluctuations in these parameters, where λd, λg {0.001,0.01, 0.1, 1, 10}, the accuracy curve of the student network remains relatively stable. These results indicate the robustness of our Hi DFD against parameter variations. Additionally, the student network achieved the best performance when λd = λg = 0.1, so we adopted such parameter configuration in our method. Extended Experiments Experiments with Various Backbones. We evaluate our Hi DFD across many widely used teacher-student pairs to assess its adaptability to different networks. The results are shown in Tab. 3, we can observe that our Hi DFD consistently achieves satisfactory performance across different teacher-student pairs, where both the trained students perform comparably to those trained on the original data. Experiments with Varying Inflation Factors. We report the accuracies of the student networks trained on collected data with various inflation factors in Fig. 2(c). The student network performs better with increasing N, and the best accuracy is observed at N=10. Furthermore, excessive inflation may reduce the diversity brought by synthetic data, so that the student network encounters performance degradation when N>10. Therefore, we adopt a moderate inflation factor of N = |Ds|/|Dc| . These experiments demonstrate that appropriately inflating the collected examples, which are crucial for reducing the distribution gap between synthetic and collected data, can effectively improve the performance of the student network. Experiments on Collected Data with Various Data Quantities. We explore the impact of varying the volume of collected data on the performance of student networks, with ρ values ranging from 0.1 to 1. As shown in Fig. 2(d), student networks trained by the compared collection-based DFKD methods (Chen et al. 2021b; Tang et al. 2023) tend to under- Dataset Teacher Student ACCS Hi DFD Res Net32 4 Res Net110 93.37 95.04 Res Net32 4 Shuffle Net 93.23 93.62 Res Net110 2 Res Net116 93.21 94.83 Res Net110 2 WRN40 2 94.86 95.35 Res Net32 4 Res Net110 74.31 75.69 Res Net32 4 Shuffle Net 72.60 75.03 Res Net110 2 Res Net116 74.46 74.49 Res Net110 2 WRN40 2 76.31 75.65 Table 3: Accuracies (in %) of various networks (ρ=1.0). perform with small values of ρ. Conversely, our Hi DFD consistently achieves satisfactory performance across a spectrum of ρ values. These results further demonstrate the effectiveness of our Hi DFD in training reliable student networks leveraging limited collected data. In this paper, we proposed a new data-free distillation approach termed Hi DFD to train the student networks on the hybrid data comprising high-quality synthetic examples and scarce collected examples, which well meets practical requirements. Our investigation reveals that bridging the distribution gap between the hybrid and synthetic data is crucial for training reliable student networks, and it implies that the quality of synthetic data and the weight of collected data are two key factors in reducing this gap. This observation inspired us to propose a novel hybrid distillation framework, where the teacher-guided generation module can effectively generate high-quality synthetic examples from the limited collected data by leveraging the teacher network to guide the GAN training process, and the student distillation module properly enhances the influence of collected examples within the hybrid data by inflating their frequency. Consequently, we can naturally define a classifier-sharingbased feature alignment to distill the student network, and we achieve state-of-the-art performance using significantly fewer examples than existing methods. The limitations and broader impacts of our Hi DFD are discussed in Appendix. Acknowledgments This research is supported by NSF of China (Nos: 62336003, 12371510), and NSF for Distinguished Young Scholar of Jiangsu Province (No: BK20220080). References Addepalli, S.; Nayak, G. K.; Chakraborty, A.; and Radhakrishnan, V. B. 2020. Degan: Data-enriching gan for retrieving representative samples from a trained classifier. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, 3130 3137. Arjovsky, M.; and Bottou, L. 2022. Towards principled methods for training generative adversarial networks. In International Conference on Learning Representations (ICLR). Binici, K.; Aggarwal, S.; Pham, N. T.; Leman, K.; and Mitra, T. 2022. Robust and resource-efficient data-free knowledge distillation by generative pseudo replay. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, 6089 6096. Chen, D.; Mei, J.-P.; Zhang, H.; Wang, C.; Feng, Y.; and Chen, C. 2022. Knowledge distillation with the reused teacher classifier. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 11933 11942. Chen, D.; Mei, J.-P.; Zhang, Y.; Wang, C.; Wang, Z.; Feng, Y.; and Chen, C. 2021a. Cross-layer distillation with semantic calibration. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, 7028 7036. Chen, H.; Guo, T.; Xu, C.; Li, W.; Xu, C.; Xu, C.; and Wang, Y. 2021b. Learning student networks in the wild. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 6428 6437. Chen, H.; Wang, Y.; Xu, C.; Yang, Z.; Liu, C.; Shi, B.; Xu, C.; Xu, C.; and Tian, Q. 2019. Data-free learning of student networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 3514 3522. Codella, N. C.; Gutman, D.; Celebi, M. E.; Helba, B.; Marchetti, M. A.; Dusza, S. W.; Kalloo, A.; Liopyris, K.; Mishra, N.; Kittler, H.; et al. 2018. Skin lesion analysis toward melanoma detection: A challenge at the 2017 international symposium on biomedical imaging, hosted by the international skin imaging collaboration. In 15th IEEE International Symposium on Biomedical Imaging (ISBI), 168 172. Creswell, A.; White, T.; Dumoulin, V.; Arulkumaran, K.; Sengupta, B.; and Bharath, A. A. 2018. Generative adversarial networks: An overview. IEEE Signal Processing Magazine (SPM), 35(1): 53 65. Cui, K.; Yu, Y.; Zhan, F.; Liao, S.; Lu, S.; and Xing, E. P. 2023. Kd-dlgan: Data limited image generation via knowledge distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 3872 3882. Darlow, L. N.; Crowley, E. J.; Antoniou, A.; and Storkey, A. J. 2018. Cinic-10 is not imagenet or cifar-10. ar Xiv preprint ar Xiv:1810.03505. Deng, J.; Dong, W.; Socher, R.; Li, L.-J.; Li, K.; and Fei Fei, L. 2009. Imagenet: A large-scale hierarchical image database. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 248 255. Fang, G.; Song, J.; Wang, X.; Shen, C.; Wang, X.; and Song, M. 2021. Contrastive model inversion for data-free knowledge distillation. ar Xiv preprint ar Xiv:2105.08584. Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; and Bengio, Y. 2020. Generative adversarial networks. Communications of the ACM, 63(11): 139 144. Guo, Z.; Yan, H.; Li, H.; and Lin, X. 2023. Class attention transfer based knowledge distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 11868 11877. Hao, Z.; Guo, J.; Han, K.; Tang, Y.; Hu, H.; Wang, Y.; and Xu, C. 2023. One-for-all: bridge the gap btween heterogeneous architectures in knowledge distillation. Advances in Neural Information Processing Systems (Neur IPS), 36: 79570 79582. Hao, Z.; Guo, J.; Wang, C.; Tang, Y.; Wu, H.; Hu, H.; Han, K.; and Xu, C. 2024. Data-efficient large vision models through sequential autoregression. In Forty-first International Conference on Machine Learning (ICML). He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 770 778. Hinton, G.; Vinyals, O.; and Dean, J. 2015. Distilling the knowledge in a neural network. ar Xiv preprint ar Xiv:1503.02531. Ho, J.; Jain, A.; and Abbeel, P. 2020. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems (Neur IPS), 33: 6840 6851. Hou, L.; Cao, Q.; Shen, H.; Pan, S.; Li, X.; and Cheng, X. 2022. Conditional gans with auxiliary discriminative classifier. In International Conference on Machine Learning (ICML), 8888 8902. PMLR. Hou, L.; Yuan, Z.; Huang, L.; Shen, H.; Cheng, X.; and Wang, C. 2021. Slimmable generative adversarial networks. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, 7746 7753. Huang, J.; Cui, K.; Guan, D.; Xiao, A.; Zhan, F.; Lu, S.; Liao, S.; and Xing, E. 2022. Masked generative adversarial networks are data-efficient generation learners. Advances in Neural Information Processing Systems (Neur IPS), 35: 2154 2167. Jiang, L.; Dai, B.; Wu, W.; and Loy, C. C. 2021. Deceive d: Adaptive pseudo augmentation for gan training with limited data. Advances in Neural Information Processing Systems (Neur IPS), 34: 21655 21667. Kingma, D. P.; and Welling, M. 2013. Auto-encoding variational bayes. ar Xiv preprint ar Xiv:1312.6114. Krizhevsky, A. 2009. Learning multiple layers of features from tiny images. Master s Thesis, University of Tront. Le, Y.; and Yang, X. 2015. Tiny imagenet visual recognition challenge. CS 231N, 7(7): 3. Li, W.; Wang, L.; Li, W.; Agustsson, E.; and Van Gool, L. 2017. Webvision database: Visual learning and understanding from web data. ar Xiv preprint ar Xiv:1708.02862. Li, Z.; Li, X.; Yang, L.; Zhao, B.; Song, R.; Luo, L.; Li, J.; and Yang, J. 2023. Curriculum temperature for knowledge distillation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, 1504 1512. Mei, K.; and Patel, V. 2023. Vidm: Video implicit diffusion models. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, 9117 9125. Micaelli, P.; and Storkey, A. J. 2019. Zero-shot knowledge transfer via adversarial belief matching. Advances in Neural Information Processing Systems (Neur IPS), 32. Miles, R.; and Mikolajczyk, K. 2024. Understanding the role of the projector in knowledge distillation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, 4233 4241. Mirza, M.; and Osindero, S. 2014. Conditional generative adversarial nets. ar Xiv preprint ar Xiv:1411.1784. Park, W.; Kim, D.; Lu, Y.; and Cho, M. 2019. Relational knowledge distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 3967 3976. Peng, B.; Jin, X.; Liu, J.; Li, D.; Wu, Y.; Liu, Y.; Zhou, S.; and Zhang, Z. 2019. Correlation congruence for knowledge distillation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 5007 5016. Rangwani, H.; Mopuri, K. R.; and Babu, R. V. 2021. Class balancing gan with a classifier in the loop. In Uncertainty in Artificial Intelligence (UAI), 1618 1627. PMLR. Steerneman, T. 1983. On the total variation and Hellinger distance between signed measures; an application to product measures. Proceedings of the American Mathematical Society (AMS), 88(4): 684 688. Tang, J.; Chen, S.; Niu, G.; Sugiyama, M.; and Gong, C. 2023. Distribution shift matters for knowledge distillation with webly collected images. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). Tran, M.-T.; Le, T.; Le, X.-M.; Harandi, M.; Tran, Q. H.; and Phung, D. 2024. Nayer: Noisy layer data generation for efficient and effective data-free knowledge distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 23860 23869. Tschandl, P.; Rosendahl, C.; and Kittler, H. 2018. The HAM10000 dataset, a large collection of multi-source dermatoscopic images of common pigmented skin lesions. Scientific Data, 5(1): 1 9. Wang, Y.; Qian, B.; Liu, H.; Rui, Y.; and Wang, M. 2024a. Unpacking the gap box against data-free knowledge distillation. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI). Wang, Y.; Yang, D.; Chen, Z.; Liu, Y.; Liu, S.; Zhang, W.; Zhang, L.; and Qi, L. 2024b. De-confounded data-free knowledge distillation for handling distribution shifts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 12615 12625. Wang, Y.; Zhang, J.; and Wang, Y. 2024. Do generated data always help contrastive learning? In International Conference on Learning Representations (ICLR). Yang, Z.; Zeng, A.; Li, Z.; Zhang, T.; Yuan, C.; and Li, Y. 2023. From knowledge distillation to self-knowledge distillation: A unified approach with normalized loss and customized soft labels. ar Xiv preprint ar Xiv:2303.13005. Yin, H.; Molchanov, P.; Alvarez, J. M.; Li, Z.; Mallya, A.; Hoiem, D.; Jha, N. K.; and Kautz, J. 2020. Dreaming to distill: Data-free knowledge transfer via deepinversion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 8715 8724. Yu, S.; Chen, J.; Han, H.; and Jiang, S. 2023. Data-free knowledge distillation via feature exchange and activation region constraint. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 24266 24275. Zhao, B.; Cui, Q.; Song, R.; Qiu, Y.; and Liang, J. 2022a. Decoupled knowledge distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 11953 11962. Zhao, H.; Sun, X.; Dong, J.; Manic, M.; Zhou, H.; and Yu, H. 2022b. Dual discriminator adversarial distillation for datafree model compression. International Journal of Machine Learning and Cybernetics (IJMLC), 13(5): 1213 1230. Zhao, S.; Song, J.; and Ermon, S. 2019. Infovae: Balancing learning and inference in variational autoencoders. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, 5885 5892.