# hybrid_datafree_knowledge_distillation__556ec829.pdf

Hybrid Data-Free Knowledge Distillation

Jialiang Tang1,2,3, Shuo Chen4*, Chen Gong5*

1School of Computer Science and Engineering, Nanjing University of Science and Technology, China 2Key Laboratory of Intelligent Perception and Systems for High-Dimensional Information of Ministry of Education, China 3Jiangsu Key Laboratory of Image and Video Understanding for Social Security, China 4Center for Advanced Intelligence Project, RIKEN, Japan 5Department of Automation, Institute of Image Processing and Pattern Recognition, Shanghai Jiao Tong University, China tangjialiang@njust.edu.cn, shuo.chen.ya@riken.jp, chen.gong@sjtu.edu.cn

Data-free knowledge distillation aims to learn a compact student network from a pre-trained large teacher network without using the original training data of the teacher network. Existing collection-based and generation-based methods train student networks by collecting massive real examples and generating synthetic examples, respectively. However, they inevitably become weak in practical scenarios due to the difﬁculties in gathering or emulating sufﬁcient real-world data. To solve this problem, we propose a novel method called Hybrid Data-Free Distillation (Hi DFD), which leverages only a small amount of collected data as well as generates sufﬁcient examples for training student networks. Our Hi DFD comprises two primary modules, i.e., the teacher-guided generation and student distillation. The teacher-guided generation module guides a Generative Adversarial Network (GAN) by the teacher network to produce high-quality synthetic examples from very few real-world collected examples. Speciﬁcally, we design a feature integration mechanism to prevent the GAN from overﬁtting and facilitate the reliable representation learning from the teacher network. Meanwhile, we drive a category frequency smoothing technique via the teacher network to balance the generative training of each category. In the student distillation module, we explore a data inﬂation strategy to properly utilize a blend of real and synthetic data to train the student network via a classiﬁer-sharing-based feature alignment technique. Intensive experiments across multiple benchmarks demonstrate that our Hi DFD can achieve state-of-the-art performance using 120 times less collected data than existing methods.

Code https://github.com/tangjialiang97/Hi DFD

Introduction The success of Deep Neural Networks (DNNs) (He et al. 2016; Hao et al. 2024) is usually accompanied by signiﬁcant computational and storage demands, which hinders their deployment on practical resource-limited devices. Knowledge Distillation (KD) (Hinton, Vinyals, and Dean 2015; Miles and Mikolajczyk 2024) has served as an effective compression technology that transfers knowledge from a complex pre-trained teacher network to improve the performance of a

*Corresponding authors: Chen Gong, Shuo Chen. Copyright 2025, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

lightweight student network. However, in practice, the training data of the teacher network is usually inaccessible due to privacy concerns and only the pre-trained teacher network itself can be used to learn the student network. This is because users may prefer sharing a pre-trained black box DNN rather than disclosing their sensitive data. In such cases, vanilla KD methods can hardly train a reliable student network owing to the absence of original training data. To address this issue, various Data-Free Knowledge Distillation (DFKD) approaches (Binici et al. 2022; Chen et al. 2019, 2021b; Tang et al. 2023) have been developed to enable training the student network without using any original data. Among existing DFKD methods, collection-based approaches (Chen et al. 2021b; Tang et al. 2023) can achieve satisfactory performance by amassing numerous real examples to train the student network. However, it is still difﬁcult for the collection-based methods to train a reliable student network in practical tasks, e.g., medical image classiﬁcation because gathering sufﬁcient training examples can be challenging. On the other hand, generation-based methods (Yin et al. 2020; Chen et al. 2019) leverage the teacher network to guide a generative model (Creswell et al. 2018) in producing fake examples, thereby successfully training the student network without reliance on real examples. Nevertheless, the synthesized examples may exhibit low quality in the absence of real data supervision, leading to suboptimal student performance, especially for many challenging recognition tasks on Image Net (Deng et al. 2009). The inherent constraints of both collection-based and generation-based DFKD methods prompt an essential question: Can we train an effective generative model only using a small number of collected examples and then learn reliable student networks with the hybrid data comprising both collected and synthetic examples? To answer the above question under the practical data-free distillation scenario, we need a generative model that not only possesses powerful generative capabilities but also has the ability to acquire valuable knowledge from the teacher network. Recent studies (Cui et al. 2023; Rangwani, Mopuri, and Babu 2021) suggest that the Generative Adversarial Network (GAN) (Mirza and Osindero 2014) can easily learn from pre-trained models and then generate high-quality synthetic examples, so we employ this great approach as our generative module. The standard GAN consists of a generator and a discriminator trained in an adversarial manner,

The Thirty-Ninth AAAI Conference on Artificial Intelligence (AAAI-25)

where the generator attempts to produce fake examples to deceive the discriminator while the discriminator strives to distinguish between real and fake examples. However, the collected data in practice tasks like medical image classiﬁcation has two inherent characteristics that may impede the training of the GAN, namely: 1) Limited data quantity, as capturing medical images requires expensive and complex equipment; and 2) Imbalanced class distribution, where certain diseases (e.g., vascular lesions ) are more rarely than others (e.g., nevus ). When training on the collected data with limited examples and imbalanced class distribution, the discriminator is susceptible to overﬁtting (Huang et al. 2022; Jiang et al. 2021). It implies that the discriminator tends to memorize all real examples and almost perfectly distinguish them from fake examples, resulting in the disappearance of the gradient for the generator. Moreover, the generator training is dominated by a few classes occupying the majority of examples, which prevents it from generating diverse examples. Therefore, it is critical to overcome the overﬁtting issue of discriminator and data imbalance issue of generator when training with scarce collected examples. In this paper, we propose a novel approach called Hybrid Data-Free Distillation (Hi DFD), which learns reliable student networks on the hybrid data comprising synthetic examples and very few real collected examples. Our Hi DFD is composed of two pivotal modules of teacher-guided generation and student distillation. In the teacher-guided generation module, we aim to solve the critical issues in the GAN mentioned above, and thus generating high-quality synthetic examples. Speciﬁcally, we propose a feature integration mechanism to aggregate the features of both the collected and synthetic examples between the teacher network and GAN. Such an integration mechanism not only mitigates the overﬁtting of the discriminator, which forcibly distinguishes those closely resembling examples, but also transfers valuable representations to guide the discriminator to capture category dependencies. Meanwhile, we also develop a new technique called category frequency smoothing to alleviate the imbalanced training of the generator. In the student distillation module, we develop a data inﬂation operation to adjust the contribution of collected examples among the hybrid data when training the student network. Finally, we design a classiﬁer-sharing-based strategy to closely align the features of student network with those of teacher network to enhance student performance. Thanks to effectively transferring knowledge from the teacher network to both the GAN and student network, our Hi DFD can successfully train reliable student networks using very few collected realworld examples. The contributions of our Hi DFD are summarized as follows:

By considering the difﬁculties in gathering or emulating real-world data, we propose a novel data-free distillation method called Hi DFD, which only requires a small number of collected data to generate high-quality synthetic examples for training the student network.

We design a teacher-guided generation module to effectively tackle the critical issues of discriminator overﬁtting and imbalanced learning in generating synthetic ex-

amples, which empowers the distillation module to learn reliable student networks from the teacher network. Our Hi DFD can achieve State-Of-The-Art (SOTA) performance using only 1/120 (5,000/600,000) of examples required by existing collection-based DFKD methods.

Related Works

In this section, we review the relevant works, including knowledge distillation and generative models.

Knowledge Distillation

Traditional KD methods (Chen et al. 2021a; Li et al. 2023) learn a compact and reliable student network by encouraging it to mimic a variety of knowledge, i.e., softened logits (Zhao et al. 2022a), intermediate features (Chen et al. 2022), and representation relationships (Peng et al. 2019), from a large teacher network using ample original training data. However, in practical applications, these approaches might be ineffective because the original data is usually unavailable due to privacy concerns. To address the above issue, generation-based (Tran et al. 2024; Wang et al. 2024a,b) and collection-based (Chen et al. 2021b; Tang et al. 2023) DFKD methods have been proposed to train student networks using synthetic and collected data, respectively. The generation-based methods utilize the teacher network to guide a generator in producing examples from statistics in the teacher network or random noise. However, the resulting student network still achieves suboptimal performance due to the ﬂawed synthetic examples. Conversely, collection-based methods assume that there are numerous easily accessible examples in the real-world, and they acquire an oversized collected data (e.g., 600,000 examples on CIFAR10) to train the student network. In practical tasks, it is hard to gather so many examples, and thus they still fail to train reliable student networks. In this paper, our Hi DFD only utilizes a small collected data that contains fewer examples than the original data, which initially guides the GAN in training on such collected data by the teacher and then trains the student on adequate data composed of the synthetic and collected examples.

Generative Models

Recent advances in generative models, including Variational Auto Encoders (Kingma and Welling 2013; Zhao, Song, and Ermon 2019), diffusion models (Ho, Jain, and Abbeel 2020; Mei and Patel 2023), and GAN (Hou et al. 2021; Mirza and Osindero 2014) have signiﬁcantly propelled the data generation. This paper focuses on the powerful GAN due to its ability to learn from pre-trained models (Cui et al. 2023; Rangwani, Mopuri, and Babu 2021). The traditional GAN (Goodfellow et al. 2020) consists of a generator and a discriminator, where the generator produces fake examples to deceive the discriminator, and the discriminator tries to accurately distinguish between real and fake examples. Recently, Auxiliary Discriminative Classiﬁer GAN (ADCGAN) (Hou et al. 2022) captures dependencies between generated examples and class labels by encouraging the discriminator to

classify synthetic examples into speciﬁc categories, which effectively improves the quality of synthetic data. In our method, we hope the GAN can produce highquality synthetic examples that are easily classiﬁable, and thus training a precise student network. Therefore, we adopt ADCGAN as the foundational generative model. The ADCGAN composed of a generator NG : Z Y X maps a noise-label pair (z, y) to a fake example NG(z, y) X that can be precisely predicted as y Y; and a discriminator ND : X {0, 1} determines whether the input example is real (i.e., 1) or fake (i.e., 0), which also has a classiﬁer ΨD : X Y+ Y (y+ Y+ and y Y denote the labels for real and fake examples, respectively). Mathematically, the objective functions for the discriminator and generator in the ADCGAN are deﬁned as Ladc d and Ladc g, respectively, as follows:

Ladc d = Ld + Ex,y PX,Y [log ΨD(y+ | x)] + Ex,y QX,Y [log ΨD(y | x)], Ladc g = Lg Ex,y QX,Y [log ΨD(y+ |x)] + Ex,y QX,Y [log ΨD(y | x)],

where Ld = Ex PX[log ND(x)] + Ex QX[log(1 ND(x))] and Lg = Ex QX[log(1 ND(x))] are the loss functions for the standard GAN, P and Q denote the distribution of real collected data and fake synthetic data, respectively. ΨD(y+ | ) (resp. ΨD(y | )) denotes the probability that the input example is classiﬁed as the label y and real (resp. fake) simultaneously by the following classiﬁer of the discriminator. Formally, ΨD (y+ | x) = exp(Ψ+ D(y) ΦD(x)) P

y Y+ exp(Ψ+ D( y) ΦD(x))+P

y Y exp(Ψ D( y) ΦD(x)), where

ΦD represents the shared feature extractor between the original discriminator ND and the classiﬁer ΨD. Ψ+ D (resp. Ψ D) captures the dependencies between the category labels and real (resp. fake) data. Notably, De GAN (Addepalli et al. 2020) also trains a GAN using collected data, but it still requires a large number of collected examples and can only utilize synthetic examples to train the target model.

Data-free distillation aims to train a compact student network NS using a pre-trained teacher network NT without accessing the teacher s original training data Do. Both NT and NS consist of a feature extractor Φ and classiﬁer Ψ, where the subscripts T and S indicate teacher and student , respectively. Existing collection-based DFKD methods (Tang et al. 2023) usually rely on the collected data Dc with overwhelming examples searched based on the categories of the original data. Here, the data amount |Dc| |Do|, which is hard to satisfy in practice tasks. To overcome this limitation, we propose a more practical method that only requires a small number of collected examples for DFKD, i.e., the data amount |Dc| |Do|. To this end, we develop a hybrid framework to generate abundant synthetic examples from very few collected examples, and then we integrate them as the hybrid data for training the reliable student network.

Motivation of the Hybrid Learning Formally, we denote the distribution of the collected data Dc and synthetic data Ds as P and Q, respectively, while the distribution of the hybrid data D = Dc Ds is represented as U = αP + (1 α)Q. Here α = |Dc| / (|Dc| + |Ds|) represents the proportion of collected examples in the hybrid data. In general, the synthetic and collected examples usually exhibit a signiﬁcant distribution gap. This can cause substantial ﬂuctuations during the training of the student network on hybrid data, ultimately leading to poor performance (Wang, Zhang, and Wang 2024). Therefore, it is essential to align the distribution of synthetic data with that of collected data, thereby forming reliable hybrid data. Here, the synthetic data is generated under the supervision of the collected data, so we assume that synthetic data, collected data, and hybrid data have the same support set X. Then, the distribution gap between the reliable hybrid data and synthetic data can be characterized by the Total Variation Distance (TVD), which is deﬁned as

TVD(U, Q) = 1

x X |U(x) Q(x)|, (2)

where U(x) (0, 1) and Q(x) (0, 1) measure the distribution probability of x in the hybrid and synthetic data, respectively. Here TVD( , ) 0, and TVD(U, Q) = 1 2 P x X |U(x) Q(x)| 1

2 P x X (|U(x)|+|Q(x)|) = 1. Based on the triangle inequality (Steerneman 1983) of TVD, we easily have that

TVD(U, Q) TVD(U, P) + TVD(P, Q). (3)

Then, given that U = αP + (1 α)Q with parameter α controlling the weight of collected data, we can compute TVD(U, P) as

TVD(U, P) = 1

x X |U(x) P(x)|

x X |αP(x) + (1 α)Q(x) P(x)|

x X |Q(x) P(x)|

= (1 α)TVD (Q, P) . (4) By invoking the symmetry of TVD and Eq. (3), we obtain

TVD(U, Q) (2 α)TVD(P, Q). (5)

Here Eq. (5) reveals that the high-quality synthetic data Ds and the mix proportion α are two critical factors inﬂuencing the distribution gap TVD(U, P). The above observation inspires us to employ two modules to align the distribution of synthetic data with that of collected data, as shown in Fig. 1(c). In the teacher-guided generation module, we employ the teacher network to guide the GAN to enhance the quality of synthetic data, which solves its intrinsic issues when trained on the small and imbalanced collected data, including the overﬁtting of discriminator and imbalanced learning of generator:

Figure 1: The diagram of (a) generation-based methods (Fang et al. 2021; Yin et al. 2020; Chen et al. 2019; Micaelli and Storkey 2019), (b) collection-based methods (Chen et al. 2021b; Tang et al. 2023), and (c) our Hi DFD. In Hi DFD, the teacher-guided generation module employs the teacher network to guide the training of the GAN on limited collected data. Subsequently, the student distillation module closely aligns the features of the student network with those of the teacher network on the hybrid data comprising high-quality synthetic examples and properly inﬂated collected examples.

Discriminator Overﬁtting. When trained with very few collected data, the discriminator is prone to be overconﬁdent in determining fake examples, i.e., Ex QX[ND(x)] tends to be 0. As a result, the gradient of Lg in Eq. (1), which specialized in promoting generator to produce high-quality examples, may become ineffective, namely

NGEx QX[log(1 ND(x))]=Ex QX

(6) as the parameters of ND and NG are independent of each other, and Eq. (6) is proved by (Arjovsky and Bottou 2022). Meanwhile, the discriminator also has a classiﬁer that provides valuable category dependencies for the generator by precisely predicting input examples, and thus promoting the generator to generate classiﬁable examples. However, multiclass classiﬁcation is more challenging than binary determination of true and fake. Given very few collected examples, the discriminator is difﬁcult to learn powerful representations for its classiﬁer to achieve precise classiﬁcation. Imbalanced Generator Learning. Given the optimal classiﬁer Ψ D 1 of the discriminator, optimizing the generator to produce the classiﬁable examples2 is equivalent to

max NG [Ex,y QX,Ylog(p(x, y)

q(x, y))] min NG KL(QX,Y PX,Y) ,

1Ψ D y+| x = p(x,y) p(x)+q(x), Ψ D y | x = q(x,y) p(x)+q(x) (see Appendix). 2max NG[Ex,y QX,Y [log Ψ D(y+|x)] Ex,y QX,Y [log Ψ D(y |x)]].

where KL represents the Kullback-Leibler divergence, and the proof of Eq. (7) is provided in Appendix. The above Eq. (7) indicates that optimizing the generator will force the joint distribution QX,Y of synthetic data toward the PX,Y of the imbalanced collected data, inevitably resulting in synthetic examples with poor diversity. In student distillation, we properly inﬂate the collected examples to construct the hybrid data with a moderate mix proportion α for effectively training the student network.

Teacher-Guided Generation In this section, we promote GAN to generate high-quality examples by solving its critical issues guided by the teacher network. To mitigate the discriminator overﬁtting, we design a feature integration mechanism to force the aggregation between the features of both real collected examples and fake synthetic examples. Speciﬁcally, we blend the boundaries between real and fake examples to increase the difﬁculty for the discriminator to accurately discriminate them, and thus preventing the discriminator from overconﬁdence, i.e.,

Lblend=Ex,y PX,Y,ˆx,y QX,Y[I(p>q)( ΦT(x) ΦD(ˆx) 2 + ΦT(ˆx) ΦD(x) 2)], (8)

where I(p > q) is an indicator function to control Lblend be applied with a probability of q and its value is 1 if p > q and 0 otherwise (p is sampled from [0, 1], q=0.7 and it is analyzed in Appendix). Meanwhile, we transfer the expressive features of the teacher network to enhance the representation

ability of the discriminator, i.e.,

Ltrans =Ex,y PX,Y ,ˆx,y QX,Y [( ΦT(x) ΦD(x) 2 + ΦT(ˆx) ΦD(ˆx) 2)]. (9)

To alleviate the imbalanced learning of the generator, we regularize the GAN training across all categories. During generator training, we dynamically update the class frequencies {nt c}C c=1 (C represents the number of categories) at the beginning of iteration t via the following exponential moving average function with a weight γ [0, 1], namely

nt c = (1 γ)nt 1 c + γ nt 1 c , (10)

where nt 1 c is the number of synthetic examples belonging to class c in iteration t 1, nt c is initially set as a constant, and γ=0.5 (analyzed in Appendix). Then, each class frequency nt c {nt c}C c=1 is normalized as

ˆnt c = nt c PC j=1 nt j . (11)

Thereafter, the generator is regulated to produce balanced examples by minimizing the loss function:

pc T log (pc T) ˆntc , (12)

where p T =Ex,y QX,Y [Soft Max(NT(x))] is the average softmax vector output by the teacher network. The teacher network is well-trained on the original data, so it can precisely predict synthetic examples. In such a case, pc T can be regarded as the proportion of examples in category c within the synthetic data. In Eq. (12), the generation of examples in a category c with the lower (or higher) pc T is adjusted by the larger (or smaller) 1/ˆnt c. The loss functions of discriminator and generator in our teacher-guided GAN are summarized as LD = Ladc d + λd(Lblend + Ltrans), LG = Ladc g + λg Lreg, (13)

where Ladc d and Ladc g are deﬁned in Eq. (1), and the tradeoff parameters λd > 0 and λg > 0.

Student Distillation In the teacher-guided generation module, we successfully trained an effective GAN for generating high-quality synthetic examples, which are then combined with collected examples to construct the hybrid data D for training the student network. However, directly composing the limited collected examples with numerous synthetic examples will result in a small mix ratio α (i.e., a large distribution gap TVD(U, Q)) to disturb the training of the student network. Therefore, we inﬂate the collected data via example repeating to enlarge the α from |Dc| / (|Dc| + |Ds|) to N |Dc| / (N |Dc| + |Ds|), where N is the inﬂation factor. We adopt a moderate inﬂation factor of N = |Ds|/|Dc| and further details are available in Extended Experiments. Recent works (Chen et al. 2021b; Tang et al. 2023) indicate that the collected data usually contains many noisy examples, which may mislead the GAN to produce undesired

synthetic examples with wrong labels. As a result, these potentially noisy examples will harm the training of the student network, particularly affecting its classiﬁer. In DFKD, the teacher network is well-trained on the original data and possesses an accurate classiﬁer. Recent studies (Tang et al. 2023; Chen et al. 2022) show that the teacher s classiﬁer contains useful category information regarding the original data. Therefore, we share the classiﬁer of the teacher network with the student network. Then, we closely align the feature of the student network with that of the teacher network as follows: Lalign = Ex D [ ΦS(x) ΦT(x) 2] . (14) By minimizing the Lalign, the feature of the student network is closely aligned with that of the teacher network, and the aligned feature is inputted into the shared classiﬁer can produce predictions as accurately as the teacher network. The student network did not use any example labels during the training process, thereby avoiding the negative impact of potentially noisy labels. The whole algorithm of our proposed Hi DFD is given in Appendix.

Experiments Datasets and Implementation Details Original Datasets. We evaluate the effectiveness of our Hi DFD on popular datasets, including CIFAR (Krizhevsky 2009), CINIC (Darlow et al. 2018), and Tiny Image Net (Le and Yang 2015), which are widely used by existing DFKD methods (Chen et al. 2019, 2021b). Additionally, we also conduct experiments on the large-scale Image Net (Deng et al. 2009) and the practical medical image dataset HAM (Tschandl, Rosendahl, and Kittler 2018), which are challenging for existing DFKD methods. Collected Datasets. When using CIFAR and CINIC as the original datasets, we search for examples from Image Net. With Tiny Image Net and Image Net as the original datasets, we utilize Web Vision (Li et al. 2017) as our source of collected data. Moreover, we collect examples from ISIC (Codella et al. 2018) when using HAM as the original dataset. We follow (Chen et al. 2021b) and sample a part of examples from the corresponding dataset as collected data Dc. Here, we deﬁne the ratio between the collected data Dc and original data Do as ρ = |Dc|/|Do|. We construct small (ρ=0.1) and moderate (ρ=1.0) collected data for the experiments of collection-based DFKD methods. Notably, the original dataset is solely required for the pre-training of the teacher network. Detailed information regarding these datasets and the corresponding synthesized examples are provided in Appendix. Implementation Details. All student networks in our Hi DFD employ SGD with weight decay as 5 10 4 and momentum as 0.9 as the optimizer. The student networks are trained over 240 epochs with a learning rate of 0.05, which is sequentially divided by 10 at the 150th, 180th, and 210th epochs. Meanwhile, the generator and discriminator in GAN utilize Adam for optimization with learning rates 1 10 4 and 4 10 4, respectively, and both of them are trained over 500 epochs. Additionally, the hyper-parameters in Eq. (13) are conﬁgured as λd = 0.1 and λg = 0.1.

Dataset Arch ACCT ACCS Generation-Based Collection-Based DAFL DDAD DI CMI SSNet De GAN DFND KD3 Hi DFD (ours) ρ=0.1 ρ=1.0 ρ=0.1 ρ=1.0 ρ=0.1 ρ=1.0 ρ=0.1 ρ=1.0

CIFAR10 95.70 95.20 92.22 93.08 93.26 94.84 95.39 90.39 91.95 48.82 85.82 65.70 93.37 94.74 95.11 94.07 92.69 86.92 90.85 85.27 88.49 92.00 87.52 90.37 48.65 89.22 48.93 91.49 92.28 93.14 95.70 92.69 83.36 89.76 90.24 86.63 92.03 86.40 89.69 49.48 90.60 65.10 93.05 92.90 93.76

CIFAR100 78.05 77.10 74.47 73.64 61.32 77.04 77.41 53.20 62.94 21.45 64.73 26.96 72.90 76.93 78.35 74.53 72.28 65.36 68.33 60.00 59.70 71.16 53.97 61.80 23.48 63.90 21.27 71.44 71.26 74.18 78.05 72.28 45.28 68.59 61.07 61.80 72.38 46.82 56.44 23.86 64.54 25.25 72.46 73.44 75.65

CINIC 86.62 85.09 60.54 80.10 78.57 78.47 83.47 57.59 76.78 24.53 80.94 39.35 82.68 85.62 86.68 84.22 83.28 59.08 77.90 68.90 74.99 79.63 54.36 76.11 29.53 77.41 29.88 78.18 81.92 82.27 86.62 83.28 44.62 77.63 59.52 75.46 80.30 54.43 74.40 33.40 79.33 71.57 80.28 81.90 82.88

Tiny Image Net

66.44 64.87 52.20 59.84 6.98 64.01 64.04 25.74 49.11 26.36 60.09 20.26 63.63 65.96 66.61 62.34 61.55 53.89 42.25 1.22 17.73 57.82 23.13 44.65 25.39 58.47 24.26 61.06 60.46 62.69 66.44 61.55 52.46 44.20 2.27 20.57 59.16 21.09 48.12 25.53 58.18 27.37 61.98 61.67 65.27 HAM 81.18 79.64 32.05 44.68 62.79 67.34 74.52 34.75 64.43 27.55 62.59 64.10 68.44 77.08 81.52 Image Net 73.27 67.00 1.92 1.46 1.14 1.84 5.74 22.28 43.96 28.99 45.66 35.02 55.05 65.36 66.89

Table 1: Accuracies (in %) of student networks trained by various methods on six image classiﬁcation datasets. The columns ACCT and ACCS report the accuracies yielded by the teacher network and student network trained on the full original data, respectively. The best and the second-best results are highlighted in bold and underlined, respectively. The notations , , and represent the teacher-student pairs Res Net34-Res Net18, Res Net34-VGG13, and VGG16-VGG13, respectively.

Experiments on Benchmark Datasets

In this section, we conduct comprehensive experiments on various benchmark datasets to evaluate the performance of our proposed Hi DFD against SOTA generation-based (Chen et al. 2019; Zhao et al. 2022b; Yin et al. 2020; Binici et al. 2022; Fang et al. 2021; Yu et al. 2023) and collectionbased (Addepalli et al. 2020; Chen et al. 2021b; Tang et al. 2023) DFKD methods. These methods are reproduced by using their ofﬁcial source codes. Tab. 1 reports the results of the compared methods and our proposed Hi DFD. Firstly, our proposed Hi DFD using only a small quantity of collected examples (ρ=0.1) achieves comparable performance with those trained on the full original data. Secondly, when trained on the modestly sized collected data (ρ=1.0), our proposed Hi DFD signiﬁcantly outperforms compared methods on most tasks, especially on the challenging HAM and Image Net. Thirdly, those generationbased methods, which utilize generative models to produce training examples without the supervision of real examples, tend to perform unsatisfactorily due to the deﬁciencies in their synthetic examples. These results demonstrate that our proposed Hi DFD can train robust student networks by effectively generating training examples from limited real-world examples and properly utilizing all realistic examples.

Ablation Studies & Parametric Sensitivities

In this section, we evaluate the effectiveness of our method with a small collected data (ρ=0.1), where CIFAR and Image Net serve as the original and collected datasets, respectively. Moreover, Res Net34 and Res Net18 are used as the teacher network and student network, respectively. Ablation Studies. We evaluate three key operations (Lblend, Ltrans, and Lreg) in teacher-guided generation and the classiﬁer-sharing-based strategy in the student distillation. The experimental results are reported in Tab. 2, and the contributions of these components are analyzed as follows: 1) Teacher-Guided Generation. The feature blending Lblend in Eq. (8) and feature transferring Ltrans in Eq. (9)

Type Algorithm CIFAR10 CIFAR100

Teacher Guided Generation

w/o Lblend 92.87 ( 1.87) 74.18 ( 2.75) w/o Ltrans 91.86 ( 2.88) 74.40 ( 2.53) w/o Lreg 92.76 ( 1.98) 73.95 ( 2.98) w/o Lblend, Ltrans 91.10 ( 3.64) 71.02 ( 5.91) w/o Lblend, Lreg 90.77 ( 3.97) 71.83 ( 5.10) w/o Ltrans, Lreg 91.42 ( 3.32) 72.10 ( 4.83) w/o Lblend,trans,reg 89.55 ( 5.19) 70.25 ( 6.68)

Student Distillation

OFAKD 92.88 ( 1.86) 70.86 ( 6.07) VKD 92.69 ( 2.05) 66.96 ( 9.97) Semc KD 93.49 ( 1.25) 70.93 ( 6.00) CC 92.63 ( 2.11) 69.52 ( 7.41) DKD 92.95 ( 1.79) 68.25 ( 8.68) RKD 92.40 ( 2.34) 70.53 ( 6.40) CATKD 92.49 ( 2.25) 68.69 ( 8.24) NKD 93.26 ( 1.48) 65.31 ( 11.62) Hi DFD (ours) 94.74 76.93

Table 2: Accuracies (in %) of ablation studies.

for preventing the overﬁtting of discriminator and enhancing its representation ability. Meanwhile, the generator regulation Lreg in Eq. (12) is also essential for maintaining the balanced training of the generator. Therefore, the omission of any components among them leads to a noticeable reduction in the performance of the student network. Particularly, training the student network only on synthetic examples without any guidance from the teacher network results in the poorest performance (as shown in the term w/o Lblend, Ltrans, Lreg ). These results indicate the importance of these operations for robust GAN training with limited collected examples, thereby generating high-quality examples for training reliable student networks. 2) Student Distillation. We examine the impact of replacing the classiﬁer-sharing-based feature alignment with traditional KD methods (Hinton, Vinyals, and Dean 2015; Chen et al. 2021a). Both the student networks are trained on the hybrid data composed of collected and synthetic examples. We can ﬁnd that the student networks trained by these methods generally achieve suboptimal performance due to their inability to effectively handle the potentially noisy examples

(a) Analysis of (a) Analysis of

(b) Analysis of

Accuracy (%) Accuracy (%)

Accuracy (%) Accuracy (%)

(c) Accuracy vs inflation factor (d) Accuracy vs ρ (d) Accuracy vs ρ Inflation factor N

Figure 2: Parametric sensitivities of (a) λd and (b) λg in Eq. (13). Accuracies (in %) of the student networks trained with collected data with (c) varying inﬂation factors and (d) various quantities.

among the hybrid data. These results highlight the suitability of our training strategy for reliable student networks in the data-free distillation scenarios. Parametric Sensitivity. There are two tuning parameters in our Hi DFD, including λd and λg in Eq. (13). To analyze the sensitivities, we individually vary each parameter while keeping the others constant during training. The accuracies of the corresponding student networks are shown in Fig. 2(a) and Fig. 2(b). Despite the large ﬂuctuations in these parameters, where λd, λg {0.001,0.01, 0.1, 1, 10}, the accuracy curve of the student network remains relatively stable. These results indicate the robustness of our Hi DFD against parameter variations. Additionally, the student network achieved the best performance when λd = λg = 0.1, so we adopted such parameter conﬁguration in our method.

Extended Experiments

Experiments with Various Backbones. We evaluate our Hi DFD across many widely used teacher-student pairs to assess its adaptability to different networks. The results are shown in Tab. 3, we can observe that our Hi DFD consistently achieves satisfactory performance across different teacher-student pairs, where both the trained students perform comparably to those trained on the original data. Experiments with Varying Inﬂation Factors. We report the accuracies of the student networks trained on collected data with various inﬂation factors in Fig. 2(c). The student network performs better with increasing N, and the best accuracy is observed at N=10. Furthermore, excessive inﬂation may reduce the diversity brought by synthetic data, so that the student network encounters performance degradation when N>10. Therefore, we adopt a moderate inﬂation factor of N = |Ds|/|Dc| . These experiments demonstrate that appropriately inﬂating the collected examples, which are crucial for reducing the distribution gap between synthetic and collected data, can effectively improve the performance of the student network. Experiments on Collected Data with Various Data Quantities. We explore the impact of varying the volume of collected data on the performance of student networks, with ρ values ranging from 0.1 to 1. As shown in Fig. 2(d), student networks trained by the compared collection-based DFKD methods (Chen et al. 2021b; Tang et al. 2023) tend to under-

Dataset Teacher Student ACCS Hi DFD

Res Net32 4 Res Net110 93.37 95.04 Res Net32 4 Shufﬂe Net 93.23 93.62 Res Net110 2 Res Net116 93.21 94.83 Res Net110 2 WRN40 2 94.86 95.35

Res Net32 4 Res Net110 74.31 75.69 Res Net32 4 Shufﬂe Net 72.60 75.03 Res Net110 2 Res Net116 74.46 74.49 Res Net110 2 WRN40 2 76.31 75.65

Table 3: Accuracies (in %) of various networks (ρ=1.0).

perform with small values of ρ. Conversely, our Hi DFD consistently achieves satisfactory performance across a spectrum of ρ values. These results further demonstrate the effectiveness of our Hi DFD in training reliable student networks leveraging limited collected data.

In this paper, we proposed a new data-free distillation approach termed Hi DFD to train the student networks on the hybrid data comprising high-quality synthetic examples and scarce collected examples, which well meets practical requirements. Our investigation reveals that bridging the distribution gap between the hybrid and synthetic data is crucial for training reliable student networks, and it implies that the quality of synthetic data and the weight of collected data are two key factors in reducing this gap. This observation inspired us to propose a novel hybrid distillation framework, where the teacher-guided generation module can effectively generate high-quality synthetic examples from the limited collected data by leveraging the teacher network to guide the GAN training process, and the student distillation module properly enhances the inﬂuence of collected examples within the hybrid data by inﬂating their frequency. Consequently, we can naturally deﬁne a classiﬁer-sharingbased feature alignment to distill the student network, and we achieve state-of-the-art performance using signiﬁcantly fewer examples than existing methods. The limitations and broader impacts of our Hi DFD are discussed in Appendix.

Acknowledgments This research is supported by NSF of China (Nos: 62336003, 12371510), and NSF for Distinguished Young Scholar of Jiangsu Province (No: BK20220080).

References Addepalli, S.; Nayak, G. K.; Chakraborty, A.; and Radhakrishnan, V. B. 2020. Degan: Data-enriching gan for retrieving representative samples from a trained classiﬁer. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, volume 34, 3130 3137. Arjovsky, M.; and Bottou, L. 2022. Towards principled methods for training generative adversarial networks. In International Conference on Learning Representations (ICLR). Binici, K.; Aggarwal, S.; Pham, N. T.; Leman, K.; and Mitra, T. 2022. Robust and resource-efﬁcient data-free knowledge distillation by generative pseudo replay. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, volume 36, 6089 6096. Chen, D.; Mei, J.-P.; Zhang, H.; Wang, C.; Feng, Y.; and Chen, C. 2022. Knowledge distillation with the reused teacher classiﬁer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 11933 11942. Chen, D.; Mei, J.-P.; Zhang, Y.; Wang, C.; Wang, Z.; Feng, Y.; and Chen, C. 2021a. Cross-layer distillation with semantic calibration. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, volume 35, 7028 7036. Chen, H.; Guo, T.; Xu, C.; Li, W.; Xu, C.; Xu, C.; and Wang, Y. 2021b. Learning student networks in the wild. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 6428 6437. Chen, H.; Wang, Y.; Xu, C.; Yang, Z.; Liu, C.; Shi, B.; Xu, C.; Xu, C.; and Tian, Q. 2019. Data-free learning of student networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 3514 3522. Codella, N. C.; Gutman, D.; Celebi, M. E.; Helba, B.; Marchetti, M. A.; Dusza, S. W.; Kalloo, A.; Liopyris, K.; Mishra, N.; Kittler, H.; et al. 2018. Skin lesion analysis toward melanoma detection: A challenge at the 2017 international symposium on biomedical imaging, hosted by the international skin imaging collaboration. In 15th IEEE International Symposium on Biomedical Imaging (ISBI), 168 172. Creswell, A.; White, T.; Dumoulin, V.; Arulkumaran, K.; Sengupta, B.; and Bharath, A. A. 2018. Generative adversarial networks: An overview. IEEE Signal Processing Magazine (SPM), 35(1): 53 65. Cui, K.; Yu, Y.; Zhan, F.; Liao, S.; Lu, S.; and Xing, E. P. 2023. Kd-dlgan: Data limited image generation via knowledge distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 3872 3882. Darlow, L. N.; Crowley, E. J.; Antoniou, A.; and Storkey, A. J. 2018. Cinic-10 is not imagenet or cifar-10. ar Xiv preprint ar Xiv:1810.03505.

Deng, J.; Dong, W.; Socher, R.; Li, L.-J.; Li, K.; and Fei Fei, L. 2009. Imagenet: A large-scale hierarchical image database. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 248 255. Fang, G.; Song, J.; Wang, X.; Shen, C.; Wang, X.; and Song, M. 2021. Contrastive model inversion for data-free knowledge distillation. ar Xiv preprint ar Xiv:2105.08584. Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; and Bengio, Y. 2020. Generative adversarial networks. Communications of the ACM, 63(11): 139 144. Guo, Z.; Yan, H.; Li, H.; and Lin, X. 2023. Class attention transfer based knowledge distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 11868 11877. Hao, Z.; Guo, J.; Han, K.; Tang, Y.; Hu, H.; Wang, Y.; and Xu, C. 2023. One-for-all: bridge the gap btween heterogeneous architectures in knowledge distillation. Advances in Neural Information Processing Systems (Neur IPS), 36: 79570 79582. Hao, Z.; Guo, J.; Wang, C.; Tang, Y.; Wu, H.; Hu, H.; Han, K.; and Xu, C. 2024. Data-efﬁcient large vision models through sequential autoregression. In Forty-ﬁrst International Conference on Machine Learning (ICML). He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 770 778. Hinton, G.; Vinyals, O.; and Dean, J. 2015. Distilling the knowledge in a neural network. ar Xiv preprint ar Xiv:1503.02531. Ho, J.; Jain, A.; and Abbeel, P. 2020. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems (Neur IPS), 33: 6840 6851. Hou, L.; Cao, Q.; Shen, H.; Pan, S.; Li, X.; and Cheng, X. 2022. Conditional gans with auxiliary discriminative classiﬁer. In International Conference on Machine Learning (ICML), 8888 8902. PMLR. Hou, L.; Yuan, Z.; Huang, L.; Shen, H.; Cheng, X.; and Wang, C. 2021. Slimmable generative adversarial networks. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, volume 35, 7746 7753. Huang, J.; Cui, K.; Guan, D.; Xiao, A.; Zhan, F.; Lu, S.; Liao, S.; and Xing, E. 2022. Masked generative adversarial networks are data-efﬁcient generation learners. Advances in Neural Information Processing Systems (Neur IPS), 35: 2154 2167. Jiang, L.; Dai, B.; Wu, W.; and Loy, C. C. 2021. Deceive d: Adaptive pseudo augmentation for gan training with limited data. Advances in Neural Information Processing Systems (Neur IPS), 34: 21655 21667. Kingma, D. P.; and Welling, M. 2013. Auto-encoding variational bayes. ar Xiv preprint ar Xiv:1312.6114. Krizhevsky, A. 2009. Learning multiple layers of features from tiny images. Master s Thesis, University of Tront.

Le, Y.; and Yang, X. 2015. Tiny imagenet visual recognition challenge. CS 231N, 7(7): 3. Li, W.; Wang, L.; Li, W.; Agustsson, E.; and Van Gool, L. 2017. Webvision database: Visual learning and understanding from web data. ar Xiv preprint ar Xiv:1708.02862. Li, Z.; Li, X.; Yang, L.; Zhao, B.; Song, R.; Luo, L.; Li, J.; and Yang, J. 2023. Curriculum temperature for knowledge distillation. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, volume 37, 1504 1512. Mei, K.; and Patel, V. 2023. Vidm: Video implicit diffusion models. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, volume 37, 9117 9125. Micaelli, P.; and Storkey, A. J. 2019. Zero-shot knowledge transfer via adversarial belief matching. Advances in Neural Information Processing Systems (Neur IPS), 32. Miles, R.; and Mikolajczyk, K. 2024. Understanding the role of the projector in knowledge distillation. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, volume 38, 4233 4241. Mirza, M.; and Osindero, S. 2014. Conditional generative adversarial nets. ar Xiv preprint ar Xiv:1411.1784. Park, W.; Kim, D.; Lu, Y.; and Cho, M. 2019. Relational knowledge distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 3967 3976. Peng, B.; Jin, X.; Liu, J.; Li, D.; Wu, Y.; Liu, Y.; Zhou, S.; and Zhang, Z. 2019. Correlation congruence for knowledge distillation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 5007 5016. Rangwani, H.; Mopuri, K. R.; and Babu, R. V. 2021. Class balancing gan with a classiﬁer in the loop. In Uncertainty in Artiﬁcial Intelligence (UAI), 1618 1627. PMLR. Steerneman, T. 1983. On the total variation and Hellinger distance between signed measures; an application to product measures. Proceedings of the American Mathematical Society (AMS), 88(4): 684 688. Tang, J.; Chen, S.; Niu, G.; Sugiyama, M.; and Gong, C. 2023. Distribution shift matters for knowledge distillation with webly collected images. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). Tran, M.-T.; Le, T.; Le, X.-M.; Harandi, M.; Tran, Q. H.; and Phung, D. 2024. Nayer: Noisy layer data generation for efﬁcient and effective data-free knowledge distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 23860 23869. Tschandl, P.; Rosendahl, C.; and Kittler, H. 2018. The HAM10000 dataset, a large collection of multi-source dermatoscopic images of common pigmented skin lesions. Scientiﬁc Data, 5(1): 1 9. Wang, Y.; Qian, B.; Liu, H.; Rui, Y.; and Wang, M. 2024a. Unpacking the gap box against data-free knowledge distillation. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI). Wang, Y.; Yang, D.; Chen, Z.; Liu, Y.; Liu, S.; Zhang, W.; Zhang, L.; and Qi, L. 2024b. De-confounded data-free

knowledge distillation for handling distribution shifts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 12615 12625. Wang, Y.; Zhang, J.; and Wang, Y. 2024. Do generated data always help contrastive learning? In International Conference on Learning Representations (ICLR). Yang, Z.; Zeng, A.; Li, Z.; Zhang, T.; Yuan, C.; and Li, Y. 2023. From knowledge distillation to self-knowledge distillation: A uniﬁed approach with normalized loss and customized soft labels. ar Xiv preprint ar Xiv:2303.13005. Yin, H.; Molchanov, P.; Alvarez, J. M.; Li, Z.; Mallya, A.; Hoiem, D.; Jha, N. K.; and Kautz, J. 2020. Dreaming to distill: Data-free knowledge transfer via deepinversion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 8715 8724. Yu, S.; Chen, J.; Han, H.; and Jiang, S. 2023. Data-free knowledge distillation via feature exchange and activation region constraint. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 24266 24275. Zhao, B.; Cui, Q.; Song, R.; Qiu, Y.; and Liang, J. 2022a. Decoupled knowledge distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 11953 11962. Zhao, H.; Sun, X.; Dong, J.; Manic, M.; Zhou, H.; and Yu, H. 2022b. Dual discriminator adversarial distillation for datafree model compression. International Journal of Machine Learning and Cybernetics (IJMLC), 13(5): 1213 1230. Zhao, S.; Song, J.; and Ermon, S. 2019. Infovae: Balancing learning and inference in variational autoencoders. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, volume 33, 5885 5892.