# model_conversion_via_differentially_private_datafree_distillation__7d755121.pdf Model Conversion via Differentially Private Data-Free Distillation Bochao Liu1,2 , Pengju Wang1,2 , Shikun Li1,2 , Dan Zeng3 , Shiming Ge1,2 1Institute of Information Engineering, Chinese Academy of Sciences, Beijing 100085, China 2School of Cyber Security, University of Chinese Academy of Sciences, Beijing 100049, China 3School of Communication and Information Engineering, Shanghai University, Shanghai 200444, China {liubochao, wangpengju, lishikun, geshiming}@iie.ac.cn, dzeng@shu.edu.cn While massive valuable deep models trained on large-scale data have been released to facilitate the artificial intelligence community, they may encounter attacks in deployment which leads to privacy leakage of training data. In this work, we propose a learning approach termed differentially private data-free distillation (DPDFD) for model conversion that can convert a pretrained model (teacher) into its privacy-preserving counterpart (student) via an intermediate generator without access to training data. The learning collaborates three parties in a unified way. First, massive synthetic data are generated with the generator. Then, they are fed into the teacher and student to compute differentially private gradients by normalizing the gradients and adding noise before performing descent. Finally, the student is updated with these differentially private gradients and the generator is updated by taking the student as a fixed discriminator in an alternate manner. In addition to a privacy-preserving student, the generator can generate synthetic data in a differentially private way for other downstream tasks. We theoretically prove that our approach can guarantee differential privacy and well convergence. Extensive experiments clearly demonstrate that our approach significantly outperform other differentially private generative approaches. 1 Introduction The success of deep neural networks in a wide array of applications [Jia et al., 2019a; Jia et al., 2019b; Ye et al., 2020] greatly owes to the open source of massive models. However, a major problem is that the training data of these models often contain a large amount of sensitive information that can be easily recovered with a few access to the models [Fredrikson et al., 2015; Yang et al., 2019]. How to protect such private information while maintaining the Shiming Ge is the corresponding author (geshiming@iie.ac.cn). Supplementary material is available at: https://arxiv.org/abs/2304.12528 model performance has attracted a lot of attentions. Differential privacy (DP) [Dwork et al., 2006] is a common technique for protecting privacy. [Abadi et al., 2016] guaranteed that the model was differentially private regarding the training data by clipping gradients and adding Gaussian noise to gradients. However, the model accuracy decreases severely when the privacy requirements increase so that the model can t be applied directly to the case where only the pretrained models are given. [Papernot et al., 2017; Papernot et al., 2018] proposed a semi-supervised learning framework called PATE to reduce the impact of the DP noise by leveraging the noisy aggregation of multiple teacher models trained directly on the private data. It is possible to train a privacy-preserving student model using PATE framework given only sensitive teacher models, but it is difficult to find a suitable unlabeled public dataset for the distillation process. In the meantime, an independent line of research concerning model compression shows that some data-free knowledge distillation approaches (DFKD) [Chen et al., 2019; Zhu et al., 2021; Choi et al., 2020] could achieve similar performance by vanilla training with only a teacher model. The data used for the distillation process is generated by a generator, which could potentially be a remedy for the above problem of the suitable public dataset. The generators of such methods mainly learn the data distribution rather than the image details, which intuitively also provides a degree of protection of privacy. [Ge et al., 2023] has showed that: It is possible to leverage the power of data-free knowledge distillation to train a privacy-preserving student model that is not necessary to access to the original dataset. But there is still a long way to take advantage of this intuition. Inspired by the above observations, in this paper we propose a model conversion approach with Differentially Private Data-Free Distillation (DPDFD) to facilitate the model releasing by distilling a pretrained model as teacher into a differentially private student. Specifically, our DPDFD combines DFKD and DP, which applies DFKD to distill private knowledge and a DP mechanism AC,σ to guarantee privacy. The objective is to enable an effective conversion that achieves strong privacy protection with minimum accuracy loss when only private models (teachers) are given. As shown in Fig. 1, we first generate massive synthetic data with a generator. Then, we feed the synthetic data into the teacher model and student model to compute the loss. Differentially Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence (IJCAI-23) private gradients are calculated by applying DP mechanism AC,σ. Finally, we update the student with these gradients and update the generator by taking the student as a fixed discriminator. In particular, we achieve DP by performing normalization on the gradients of student outputs and adding Gaussian noise to gradients during student learning. The reason for performing normalization instead of clipping is that it will retain the relative size information of the gradients and achieve better performance with a smaller norm bound. The reason for adding Gaussian noise to the gradients of student outputs is that it has a lower dimension compared to other gradients. Both smaller norm bound and lower dimension gradients make it easier to balance the performance and privacy. In addition, the DP mechanism AC,σ also ensures differentially private training of the generator according to the postprocessing mechanism. We can use the generator to generate data for other down stream tasks if needed. We also provide privacy and convergence analysis for our DPDFD in theory. Furthermore, DPDFD can be extended to multi-model case, which aggregates multiple sensitive teachers into a privacypreserving student model. In summary, our DPDFD can effectively convert a sensitive teacher model to a privacy-preserving student model through three key components. First, performing normalization instead of clipping which is usually used in other approaches retains information about the relative size of the gradients. Second, achieving DP by adding noise on the gradients of lower dimensional outputs makes it easier to balance performance and privacy. Third, synthetic data generated by the generator has a similar distribution with the training data of teacher model. In this way, we can convert the sensitive model to a privacy-preserving student with minimum accuracy loss. Our major contributions are three folds: 1) we propose a model conversion approach to convert a pretrained sensitive model into a privacy-preserving model for secure releasing; 2) we provide privacy analysis and convergence analysis for our approach in theory; 3) we conduct extensive experiments and analysis to demonstrate the scalability and effectiveness of our approach. 2 Related Works The approach we proposed in this paper aims to train privacypreserving student models by distilling knowledge from given sensitive model(s) via a differentially private data-free distillation. Therefore, we briefly review the related works for two inspects, including differentially private learning and datafree knowledge distillation. 2.1 Differentially Private Learning Differentially private learning aims to ensure the learning model is differentially private regarding the private data. [Abadi et al., 2016] proposed a differentially private stochastic gradient descent (DPSGD) algorithm which achieved DP by clipping and adding noise to the gradients of all parameters during the training process. However, the model performance degrades severely with strong privacy requirements. [Papernot et al., 2017] later proposed PATE which used semisupervised learning to transfer the knowledge of the teacher ensemble to the student by using a noisy aggregation. It can reduce the impact of noise on performance by increasing the number of teacher models. However, it is difficult to find an unlabeled public dataset that has similar distribution to the training data of teachers. Some works want to train differentially private generators to generate data that has similar distribution to private data while preserving privacy. [Xie et al., 2018] applied DPSGD to the training process of Generative Adversarial Networks (GAN) [Goodfellow et al., 2014] to get a differentially private generator. [Chen et al., 2020] suggested that it is not necessary to clip and add noise to all gradients, but only needs to achieve DP in the back propagation process from discriminator to generator. [Cao et al., 2021] applied differentially private optimal transmission theory to train generators. [Chen et al., 2022] proposed an energyguided network trained on sanitized data to indicate the direction of the true data distribution via Langevin Markov Chain Monte Carlo sampling method. In this paper, we make a better balance between performance and privacy by applying normalization instead of clipping and post-processing of DP. 2.2 Data-Free Knowledge Distillation Data-free knowledge distillation is a class of approaches that aims to train a student model with a pretrained teacher model without access to original training data. It uses the information extracted from the teacher model to synthesize data used in the distillation process. [Srinivas and Babu, 2015] proposed to directly merge similar neurons in fully-connected layers, which cannot be applied to convolutional layers and networks because the detailed architectures and parameters information is unknown. [Lopes et al., 2017] first proposed data-free knowledge distillation, using the distribution information of the original data to reconstruct the synthetic data used in the distillation process. [Deng and Zhang, 2021] used this approach to generate synthetic graph data for data-free knowledge distillation for graph classification tasks. [Zhu et al., 2021] modified and applied it to federated settings. [Chen et al., 2019] proposed a novel framework named DAFL for training the student model by exploiting GAN. It uses the teacher model as the discriminator to train a generator. [Choi et al., 2020] proposed matching statistics from the batch normalization layers for generated data and the original data in the teacher. It can make the generated data closer to the original data. [Fang et al., 2022] proposed Fast DFKD that applied the idea of meta-learning to the training process to accelerate the efficiency of data synthesis. Inspired by these approaches, it is possible to convert a sensitive model into a privacy-preserving model without access to the original data. 3 Preliminaries Here we will first provide some background knowledge about differential privacy. We then draw connections between the definitions and theorems we introduced here and our DP analysis on DPDFD later. The following definition explains how DP provides rigorous privacy guarantees clearly. Definition 1 (Differential Privacy). A randomized mechanism A with domain R is (ε, δ)-differential privacy, if for all Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence (IJCAI-23) sensitive (Ɛ,δ)-private non-private Figure 1: Overview of our differentially private data-free distillation approach. The approach learns to convert a pretrained model ϕt into a privacy-preserving student ϕs via an intermediate generator ϕg. The learning is performed to collaborate three parties in a unified way. First, the generator generates massive data. Then, these data are fed into the teacher and student models to calculate the gradients gs; Finally, the student and generator are updated with differentially private gradients gs, which are computed by applying DP mechanism AC,σ to gs. Here, C is the norm bound and N(0, σ2) is Gaussian noise with mean 0 and variance σ2. O R and any adjacent datasets D and D : Pr[A(D) O] eε Pr [A (D ) O] + δ, (1) where adjacent datasets D and D differ from each other with only one training example. ε is the privacy budget and the smaller it is the better, and δ is the failure probability. [Mironov, 2017] proposed a variant R enyi differential privacy (RDP), which computes privacy with R enyi divergence. We review the definition of RDP and its connection to DP. Definition 2 (R enyi Differential Privacy). A randomized mechanism A is (λ, ε)-RDP with λ > 1 if for any adjacent datasets D and D : Dλ(A(D)||A(D )) = 1 λ 1 log E(x A(D)) " Pr[A(D) = x] Pr[A(D ) = x] (2) Different from DP, RDP has a more friendly composition theorem and can be applied to both data-independent and data-dependent settings. It supports a tighter composition of privacy budget which can be described as follows: For a sequence of mechanisms A1, ..., Ai, ..., Ak, where Ai is (λ, εi)-RDP, the composition of them A1 ...Ai... Ak is (λ, P i εi)-RDP. Ai represents one query to the teacher model in our case. Moreover, the connection between RDP and DP can be described as follows: Theorem 1 (Convert RDP to DP). A (λ, ε)-RDP mechanism A also satisfies (ε + log λ 1 λ log δ+log λ λ 1 , δ)-DP. To provide DP guarantees, we exploit the postprocessing [Dwork and Roth, 2014] described as follows: Theorem 2 (Post-processing). If mechanism A satisfies (ε, δ)-DP, the composition of a data-independent function F with A is also (ε, δ)-DP. 4 Proposed Approach 4.1 Problem Formulation Given a sensitive teacher model ϕt with parameters θt, the objective is converting it into a privacy-preserving student model ϕs with parameters θs that does not reveal data privacy and has the ability to perform similarly to the teacher model. To achieve that, we introduce a differentially private data-free distillation approach. We first sample a batch of noise vectors z = {zi}B i=1 and feed them into the generator ϕg with parameters θg to generate massive synthetic data D = {di}B i=1. Enter them into the teacher model and student model to calculate the loss LT (ϕt(θt; D), ϕs(θs; D)) and then calculate updated outputs of student ys to achieve DP with a differentially private mechanism AC,σ. Finally update the student by the loss LS(ϕs(θs; D), ys) . Thus, the converting process can be formulated by minimizing an energy function E: E(θs; θt) = E(ϕs(θs; D), ys) = E (ϕs(θs; D), ϕs(θs; D) γ AC,σ(g)) , (3) where g = LT (ϕt(θt;D),ϕs(θs;D)) ϕs(θs;D) and γ is the learning rate. We suppress the risk of privacy leakage with DP mechanism AC,σ which normalizes the gradients with norm bound C and adds Gaussian noise with variance σ2. We will introduce how the differentially private mechanism AC,σ protects privacy in detail in the following. 4.2 Our Solution: DPDFD Student model distilled from a sensitive teacher model directly may lead to privacy leakage and another main problem is that we don t have access to the original dataset. Thus, we aim to convert the sensitive teacher model to a privacypreserving student model while having similar performance to the sensitive model in a data-free manner. Unlike [Abadi et al., 2016] which requires clipping and adding Gaussian noise to gradients of all parameters, due to the post-processing of DP, we only need to perform normalization and add noise to the gradients of student s outputs to calculate new differentially private outputs. As shown in Fig. 1 and Alg. 1, we first sample a batch of noise vectors z = {zi}B i=1 and feed them into the generator ϕg with parameter θg to obtain massive synthetic data D = ϕg(z). Then enter the synthetic data D into the teacher and student to compute the loss LT (ϕt(θt; D), ϕs(θs; D)). To get better results, Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence (IJCAI-23) Algorithm 1 DPDFD Input: Training iterations T, loss function LT , LS, LG, noise scale σ, sample size B, learning rate γ, γs, γg, gradient norm bound C, a positive stability constant e 1: for t [T] do 2: Sample B noise samples z = {zi}B i=1 3: Generate B synthetic samples D = {ϕg(θg; zi)}B i=1 4: for each synthetic data di = ϕg(θg; zi) do 5: Compute loss LT (ϕt(θt; di), ϕs(θs; di)) 6: Compute the gradient gi = LT ϕs(θs;di) 7: Normalize the gradient gi = C gi ||g||2+e 8: end for 9: Add noise g = 1 i=1 gi + N(0, σ2C2I)) 10: Compute differentially private outputs of student ys = ϕs(θs; D) γ g 11: Compute loss LS(ϕs(θs; D), ys) 12: Update student θt+1 s = θt s γs LS θt s 13: Compute loss LG(ϕs(θs; D)) 14: Update generator θt+1 g = θt g γg (LS+LG) θt g 15: end for 16: return θs and θg we treat argmax(ϕt(θt; D)) as the target labels and then calculate the distillation loss LT same as [Zhao et al., 2022]. After that we achieve DP by the mechanism AC,σ which can be described as follows: AC,σ(g) = 1 C gi ||g||2 + e + N(0, σ2C2I)), (4) and generate new differentially private outputs ys. Compared with clipping, normalization can achieve higher accuracy at smaller norm bound C. This is because when C is small, clipping makes the gradients lose their variability but normalization retains the relative size relationship of gradients. Another important point is that the smaller the C, the smaller the privacy budget consumed by each query to the teacher model, which is exactly what we want. So we choose a smaller C and normalization operation to get a lower privacy budget and better performance. Finally, we compute the loss LS(ϕs(θs; D), ys) and update the student model with it. This loss function can take many forms, and we use cross-entropy loss in our experiments. In order to make the distribution of the synthetic data closer to the private data and balance the classes of the synthetic data, we add an additional loss LG when updating the generator. It can be formulated as: LG(ϕs(θs; D)) =ℓ(ϕs(θs; D), argmax(ϕs(θs; D))) +αϕs(θs; D) log(ϕs(θs; D)) +β||ϕs(θ s ; D)||2, (5) where α and β are the tuning parameters to balance the effect of three terms, and we set both of them to 1. The first term ℓ( ) is a cross entropy function that measures the one-hot clas- sification loss, which enforces the synthetic data having similar distribution as the private data. The second term is the information entropy loss to measure the class balance of generated data. The third term uses l2-norm || ||2 to measure the activation loss, since the features that are extracted by the student and correspond to the output before the fully-connected layer tend to receive higher activation value if input data are real rather than some random vectors, where θ s θs is the student s backbone parameters. We update the student and generator alternately in our experiments. In the nutshell, the trained student and generator satisfy DP because the training process can be seen as post-processing, given that ys is differentially private results. Overall, DPDFD can be formally proved DP in Theorem. 3. Theorem 3. DPDFD guatantees ( 2C2n BT λ σ2 + log λ 1 λ log δ+log λ λ 1 , δ)-DP for all λ 1 and δ (0, 1). 4.3 Convergence Analysis To analyze the convergence of our DPDFD, we follow the standard assumptions in the SGD literature [Allen-Zhu, 2017; Bottou et al., 2018; Ghadimi and Lan, 2013], with an additional assumption on the gradient noise. We assume that LT has a lower bound L and LT is κ-smoothness, which can be described as follows: x, y, there is an non-negative constant κ such that LT (x) LT (y) LT (x) (x y)+ κ 2 ||x y||2. The additional assumption is that (gr g) N(0, ζ2), where gr is the ideal gradients of LT and g is the gradient we compute as an unbiased estimate of gr. Then according to [Bu et al., 2022], in our case we have: 2 (L0 L ) + 2Tκγ2C2(1 + σ2d) (6) where χ is min 0 t T E ( gt r ), d is a constant number, L0 is the initial loss and F( ) results only from the normalization operation same as [Bu et al., 2022] and it won t affect the monotonicity of input variables. We simply set the learning rate γ 1 T and the gradients will gradually tend to 0 as T increases. 4.4 Discussion Extension to Multi-Model Ensemble. Our DPDFD can be extended to the case of multiple sensitive models. Different from the case of a single teacher that reduces the impact of DP noise by averaging a batch of gradients, the case of multi-model achieves it by averaging the gradients from given sensitive models. In particular, given multiple teacher models {ϕj t}n j=1, we first sample noise vectors and generate massive synthetic data same as the case of a single teacher. For each data di, we enter it into multiple teacher models {ϕj t(θj t; di)}n j=1 and student model ϕs(θs; di) and compute losses {LT (ϕj t(θj t; di), ϕs(θs; di))}n j=1. Then, we com- pute gradients gij = n LT (ϕj t(θj t ;di),ϕs(θs;di)) ϕs(θs;di) on j=1 with the losses. After that, we normalize them with norm bound C and add noise to them to get differentially private gradients Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence (IJCAI-23) Dataset Teacher ε DP-GAN PATE-GAN G-PATE GS-WGAN DP-MERF P3GM Data Lens DPSH DPGEN DPDFD MNIST 0.9921 1 0.4036 0.4168 0.5810 0.1432 0.6367 0.7369 0.7123 N/A 0.9046 0.9512 10 0.8011 0.6667 0.8092 0.8075 0.6738 0.7981 0.8066 0.8320 0.9357 0.9751 FMNIST 0.9102 1 0.1053 0.4222 0.5567 0.1661 0.5862 0.7223 0.6478 N/A 0.8283 0.8386 10 0.6098 0.6218 0.6934 0.6579 0.6162 0.7480 0.7061 0.7110 0.8784 0.8988 Celeb A-G 0.9353 1 0.5330 0.6068 0.6702 0.5901 0.5936 0.5673 0.7058 N/A 0.6999 0.7237 10 0.5211 0.6535 0.6897 0.6136 0.6082 0.5884 0.7287 0.7630 0.8835 0.8992 Celeb A-H 0.8868 1 0.3447 0.3789 0.4985 0.4203 0.4413 0.4532 0.6061 N/A 0.6614 0.7839 10 0.3920 0.3900 0.6217 0.5225 0.4489 0.4858 0.6224 N/A 0.8147 0.8235 Table 1: Accuracy comparisons with 9 explicit approaches under different privacy budget ε (δ = 10 5). C gij ||g||2+e + N(0, σ2C2I)). Next we can compute an updated output of student yi s = ϕs(θs; di) γ gi. Finally, we update the student and generator same as Alg. 1. In this way, we can aggregate multiple sensitive teacher models into a privacy-preserving student model. Direct Training on Private Data. The proposed DP mechanism AC,σ is a key building block of DPDFD, and can also be applied to situations where private data is available. Given a private dataset {x, y}, we input the data x into the model and calculate the cross-entroy loss L(ϕ(θ; x), y). After that, we run the same procedures to achieve DP and update the model as Alg. 1 except we don t need to update the generator. In this way, we can train a privacy-preserving model with direct access to private data. 5 Experiments In this section, we present the experimental evaluation of DPDFD for converting the sensitive model to a differentially private model with high performance. 5.1 Experimental Setup Datasets. We conduct our experiments on 7 image datasets, including MNIST [Le Cun et al., 1998], Fashion MNIST(FMNIST) [Xiao et al., 2017], CIFAR10 [Krizhevsky, 2009], Celeb A [Liu et al., 2015], Path MNIST [Yang et al., 2021], COVIDx and Image Net [Deng et al., 2009]. We created Celeb A-H and Celeb A-G based on Celeb A. Celeb A-H and Celeb A-G are classification datasets with hair color (black/blonde/brown) and gender as the label, respectively. COVIDx is a classification dataset for COVID. Baselines. We perform compartments with 14 state-of-theart benchmarks, including 9 explicit approaches that training classifiers with generative data (DP-GAN [Xie et al., 2018], GS-WGAN [Chen et al., 2020], PATE-GAN [Jordon et al., 2019], DP-MERF [Harder et al., 2021], P3GM [Takagi et al., 2021], Data Lens [Wang et al., 2021], G-PATE [Long et al., 2021], DPSH [Cao et al., 2021], DPGEN [Chen et al., 2022]) and 5 implicit approaches that training classifiers with differentially private learning (DPSGD [Abadi et al., 2016], TSADP [Papernot et al., 2021], TOPAGG [Wang et al., 2021], GM-DP [Mc Mahan et al., 2018], DGD [Ge et al., 2023]). To make the comparisons fair, we take the results from original papers or run the official codes. Networks. We adopt several popular network architectures as our teacher models, including Alex Net [Krizhevsky et al., 2012], VGGNet [Simonyan and Zisserman, 2015], Res Net [He et al., 2016], Wide Resnet [Zagoruyko and Komodakis, 2017], Dense Net [Huang et al., 2016], Mobile Net [Howard et al., 2017], Shuffle Net [Zhang et al., 2018], Google Net [Szegedy et al., 2015] and Vi T [Dosovitskiy et al., 2020]. For VGGNet, we use 19-layer net with BN. For Res Net, we use 50-layer networks for Image Net and 34layer networks for others. For Wide Resnet, we use 50-layer networks as teacher. For Dense Net, we use 161-layer networks with growth rate equals 24. For Vi T, we use the same architecture as [Dosovitskiy et al., 2020]. For student model, we use 34-layer Res Net for Image Net and the same network as Data Lens for others. 5.2 Experimental Results In this section, we compare DPDFD with 14 baselines and evaluate it on different networks to verify its effectiveness. We first compare the model performance of DPDFD and other state-of-the-art approaches. Then, we conduct experiments on Image Net with different networks under different privacy budget. We show that our DPDFD is scalable and outperforms all baselines. Comparisons with 9 Explicit Baselines. To demonstrate the effectiveness of our DPDFD, we conduct comparisons with 9 baselines under different privacy budget and report the results in Tab. 1. All approaches are under a low failure probability δ = 10 5. We can see that our DPDFD achieves the highest performance under the same condition of privacy budget. In particular, when ε = 1, our DPDFD achieves an accuracy of 0.9512 on MNIST and 0.8386 on FMNIST, which remarkably reduces the accuracy drop by 0.0409 and 0.0716 respectively, while most of the other approaches fail to achieve an accuracy of 0.8000. It shows that our approach has the best privacy-preserving ability and minimal accuracy drop. Even for high dimensional datasets like Celeb A whose dimensionality is about 16 times larger than MNIST, all approaches suffer from accuracy drop with respect to their counterparts under high privacy budget while our DPDFD still de- Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence (IJCAI-23) Approach ε MNIST ε CIFAR10 DPSGD 2.0 0.9500 2.0 0.6623 TSADP 1.0 0.7991 7.5 0.6620 TOPAGG 1.0 0.9465 2.0 0.8518 GM-DP 1.0 0.9508 2.0 0.8597 DGD 1.0 0.7360 3.0 0.7365 DPDFD 1.0 0.9512 2.0 0.8601 Table 2: Accuracy comparisons with 5 implicit approaches under different privacy budget ε. Networks Teacher ε 1 2 5 Alex Net 0.5655 0.2756 0.5124 0.5218 VGGNet 0.7423 0.3419 0.6381 0.6602 Res Net 0.7602 0.3823 0.6499 0.6781 Wide Res Net 0.7848 0.2982 0.5997 0.7117 Dense Net 0.7765 0.4552 0.6612 0.7029 Mobile Net 0.7186 0.3830 0.6203 0.6424 Shuffle Net 0.6953 0.4964 0.6313 0.6652 Google Net 0.6246 0.2987 0.5717 0.5858 Vi T 0.8140 0.3142 0.6880 0.7240 Table 3: Accuracy on Image Net with different networks under different privacy budget ε (δ = 10 5). livers the highest test accuracy. This demonstrates that our approach is also effective for high-dimensional datasets. Comparisons with 5 Implicit Baselines. In addition, we conduct experimental comparisons with 5 implicit baselines on MNIST and CIFAR10. The results are shown in Tab. 2, where our DPDFD achieves the highest accuracy of 0.9512 and 0.8601 under the lowest privacy budget of 1.0 and 2.0. The main reason comes from that we use normalization instead of traditional clipping. In this way, the differentially private gradients obtained after AC,σ retain more information (relative size between gradients). Evaluation on Open Source Networks. To demonstrate the scalability of our DPDFD, we apply it to convert several popular networks pretrained on Image Net under different privacy budget ε. These teacher models are taken directly from the Py Torch model zoo. The results are shown in Tab. 3. We find that the effect of network architectures is sometimes greater than the effect of public teacher model accuracy when ε is small. In particular, when ε = 1, the student model reaches the highest accuracy of 0.4964 when the teacher model is Shuffle Net, but the accuracy of teacher model is 0.1187 lower than the highest Vi T. We guess this is because the simpler the teacher model is, the easier it is for the generator to learn the distribution of its training data, Dataset ε DPSGD TOPAGG GM-DP DPDFD 0.5 0.8103 0.9235 0.9331 0.9377 0.7 0.8932 0.9382 0.9438 0.9447 1.0 0.9247 0.9465 0.9508 0.9762 2.0 0.6623 0.8518 0.8597 0.8652 4.0 0.6884 0.8540 0.8663 0.8708 8.0 0.7159 0.8562 0.8705 0.8973 Table 4: Accuracy comparisons with 3 DPSGD mechanisms on MNIST and CIFAR10 under different privacy budget ε. resulting in faster convergence of the student model. As ε increases, the student model can learn more and more fully from the teacher model, so the effect of teacher performance dominates. The more complex the teacher model, the more information about the training data it contains. The student model learns more accurate predictions from the teacher and the generator learns a more similar data distribution, both of which lead to higher accuracy of the student model. We can see that at ε = 5, the accuracy of the student is positively correlated with the accuracy of teacher model and it has approached that of the teachers, which proves that our DPDFD is not only scalable but also effective to a variety of popular network architectures. Evalution of Training Directly with Private Data. In this section, we conduct experiments to evaluate the application of our DP mechanism AC,σ to the case where private data can be accessed. We compare 3 state-of-the-art approaches which trained directly with private data (based on DPSGD mechnism): DPSGD, TOPAGG and GM-DP on MNIST and CIFAR10 under the same settings except for the DP mechanism. The results are shown in Tab. 4. We can see that our approach achieves the best performance on both the low-dimensional MNIST dataset and the high-dimensional CIFAR10 dataset. These results depend on two aspects. On one hand, we only need to compute a new differentially private output in the first step of backpropagation instead of adding noise to each gradient like other approaches. On the other hand, we use normalization instead of clipping, which retains more gradient information in the case of small norm bound C. So our approach can achieve better results. 5.3 Ablation Studies After the promising performance is achieved, we further analyze the impact of each component in our approach, including normalization vs. clipping, norm bound, noise scale, and composition of loss function. Normalization vs. Clipping. To study the effect of different operations on the gradients, we conduct experiments with 34-layer Res Net pretrained on MNIST and FMNIST as teachers and report the results in Fig. 2. Performing normalization is significantly better than clipping when C is small, and even clipping makes the student not converge. The advantage of normalization gradually decreases as C increases, and when it increases to a certain level (about 10 3 as Fig. 2 shows), Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence (IJCAI-23) 10 6 10 5 10 4 10 3 10 2 MNIST-C MNIST-N FMNIST-C FMNIST-N Figure 2: Student accuracy under different norm bound C and different operations on gradients ( -C or -N ) in the case ε = 0.6 and δ = 10 5 ( -C : Clipping, -N : Normalization) Dataset σ 20 50 100 200 500 MNIST 0.4159 0.8314 0.9822 0.9815 0.9751 FMNIST 0.5193 0.6578 0.8386 0.8053 0.8142 Table 5: Student accuracy under different noise scale σ clipping will be superior to normalization. This is because normalization retains the relative size information of the gradients while clipping retains the absolute size information. Norm Bound. Norm bound is an important hyperparameter of our DP mechanism AC,σ. We conduct experiments to investigate how norm bound C affects the performance of student model. The results are shown as the red and green lines in Fig. 2. As we see, the student performs better when C is small. This is because normalization preserves information about the relative size of the gradients. Although smaller C will be more affected by DP noise, it will allow more training epochs. So the student performs better when C is small. Noise Scale. Like norm bound, noise scale is also an important hyper-parameter of our DP mechanism AC,σ. To study the effect of it, we conduct experiments and the results are shown in Tab. 5. We observe that the model performance gets better as noise scale σ increases from 20 to 100. This is because a larger noise scale consumes a smaller privacy budget per epoch, which allows the student model to learn more fully with a limited privacy budget. However, we find that there is a slightly worse decrease in model performance as noise scale increases from 100 to 500. This is because the gradients will be more broken with a larger noise scale, thus leading to a slightly worse performance. In practical applications, a tradeoff should be made based on the actual situation. Composition of Loss Functions. To check the effect of loss terms in the training generator, we investigate how each Dataset CE IE Norm Accuracy 0.9751 0.9691 0.5463 0.9496 0.8988 0.7649 0.5463 0.8806 Table 6: Impact of each term in LG. The test accuracy of student models trained on synthetic data under ε=10 is reported. CE: Cross Entropy; IE: Information Entropy; Norm: l2-Normalization. component of loss function contributes to the performance of student with 34-layer Res Net as the teacher model. We evaluate how they impact the performance by adding or removing each component and the results are shown in Tab. 6. We note that the information entropy loss term is the most important component based on the results. This is because removing it will result in an imbalance in the classes of the synthetic data. Cross entropy loss contributes differently for different datasets, with 1% improvement on the MNIST dataset and 13% improvement on the FMNIST dataset. l2normalization is also a useful term though less critical, contributing to the 2%-3% of the performance improvement as shown in Tab. 6. In summary, all three compositions have a positive effect on the performance of the converted model. 6 Conclusion Public pretrained models in model zoos may pose the risk of privacy leakage. To facilitate model deployment, we proposed a differentially private data-free distillation approach to convert sensitive teacher models into privacy-preserving student models. We train a generator for approximating the private dataset without the training data and student networks can be learned effectively through the knowledge distillation scheme. In addition, we perform normalization on the gradients of student outputs and add Gaussian noise to them to guarantee privacy. We also provide privacy analysis and convergence analysis for DPDFD. Extensive experiments are conducted to show the effectiveness of our approach. In the future, we will explore the approach in more real-world applications like federated learning on medical images. Acknowledgements This work was partially supported by grants from the National Key Research and Development Plan (2020AAA0140001), and the Beijing Natural Science Foundation (19L2040). Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence (IJCAI-23) References [Abadi et al., 2016] Martin Abadi, Andy Chu, Ian Goodfellow, et al. Deep learning with differential privacy. In CCS, pages 308 318, 2016. [Allen-Zhu, 2017] Z Natasha Allen-Zhu. Natasha 2: Faster non-convex optimization than sgd. ar Xiv:1708.08694, 2017. [Bottou et al., 2018] L eon Bottou, Frank E Curtis, and Jorge Nocedal. Optimization methods for large-scale machine learning. SIAM Review, 60(2):223 311, 2018. [Bu et al., 2022] Zhiqi Bu, Yu-Xiang Wang, Sheng Zha, and George Karypis. Automatic clipping: Differentially private deep learning made easier and stronger. ar Xiv:2206.07136, 2022. [Cao et al., 2021] Tianshi Cao, Alex Bie, Arash Vahdat, Sanja Fidler, and Karsten Kreis. Don t generate me: Training differentially private generative models with sinkhorn divergence. In Neur IPS, pages 12480 12492, 2021. [Chen et al., 2019] Hanting Chen, Yunhe Wang, Chang Xu, Zhaohui Yang, Chuanjian Liu, Boxin Shi, Chunjing Xu, Chao Xu, and Qi Tian. Data-free learning of student networks. In ICCV, pages 3514 3522, 2019. [Chen et al., 2020] Dingfan Chen, Tribhuvanesh Orekondy, and Mario Fritz. Gs-wgan: A gradient-sanitized approach for learning differentially private generators. In Neur IPS, pages 12673 12684, 2020. [Chen et al., 2022] Jia-Wei Chen, Chia-Mu Yu, Ching-Chia Kao, Tzai-Wei Pang, and Chun-Shien Lu. Dpgen: Differentially private generative energy-guided network for natural image synthesis. In CVPR, pages 8387 8396, 2022. [Choi et al., 2020] Yoojin Choi, Jihwan Choi, Mostafa El Khamy, and Lee Jungwon. Data-free network quantization with adversarial knowledge distillation. In CVPR Workshop, pages 710 711, 2020. [Deng and Zhang, 2021] Xiang Deng and Zhongfei Zhang. Graph-free knowledge distillation for graph neural networks. ar Xiv:2105.07519, 2021. [Deng et al., 2009] Jia Deng, Wei Dong, Richard Socher, Li Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In CVPR, pages 248 255, 2009. [Dosovitskiy et al., 2020] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2020. [Dwork and Roth, 2014] Cynthia Dwork and Aaron Roth. The algorithmic foundations of differential privacy. FTTCS, pages 211 407, 2014. [Dwork et al., 2006] Cynthia Dwork, Frank Mc Sherry, Kobbi Nissim, and Adam Smith. Calibrating noise to sensitivity in private data analysis. In TCC, pages 265 284, 2006. [Fang et al., 2022] Gongfan Fang, Kanya Mo, Xinchao Wang, Jie Song, Shitao Bei, Haofei Zhang, and Mingli Song. Up to 100x faster data-free knowledge distillation. In AAAI, pages 6597 6604, 2022. [Fredrikson et al., 2015] Matt Fredrikson, Somesh Jha, and Thomas Ristenpart. Model inversion attacks that exploit confidence information and basic countermeasures. In CCS, pages 1322 1333, 2015. [Ge et al., 2023] Shiming Ge, Bochao Liu, Pengju Wang, Yong Li, and Dan Zeng. Learning privacy-preserving student networks via discriminative-generative distillation. IEEE TIP, 32:116 127, 2023. [Ghadimi and Lan, 2013] Saeed Ghadimi and Guanghui Lan. Stochastic first-and zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization, 23(4):2341 2368, 2013. [Goodfellow et al., 2014] Ian Goodfellow, Jean Pouget Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Neur IPS, pages 2672 2680, 2014. [Harder et al., 2021] Frederik Harder, Kamil Adamczewski, and Mijung Park. Dp-merf: Differentially private mean embeddings with random features for practical privacypreserving data generation. In AISTATS, pages 1819 1827, 2021. [He et al., 2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, pages 770 778, 2016. [Howard et al., 2017] Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. ar Xiv:1704.04861, 2017. [Huang et al., 2016] Gao Huang, Yu Sun, Zhuang Liu, Daniel Sedra, and Kilian Q Weinberger. Deep networks with stochastic depth. In ECCV, pages 646 661, 2016. [Jia et al., 2019a] Ruoxi Jia, David Dao, Boxin Wang, Frances Ann Hubis, Nezihe Merve Gurel, Bo Li, Ce Zhang, Costas J Spanos, and Dawn Song. Efficient task-specific data valuation for nearest neighbor algorithms. Proc. VLDB Endow, pages 1610 1623, 2019. [Jia et al., 2019b] Ruoxi Jia, David Dao, Boxin Wang, Frances Ann Hubis, Nick Hynes, Nezihe Merve G urel, Bo Li, Ce Zhang, Dawn Song, and Costas J Spanos. Towards efficient data valuation based on the shapley value. In AISTATS, pages 1167 1176, 2019. [Jordon et al., 2019] James Jordon, Jinsung Yoon, and Mihaela Van Der Schaar. Pate-gan: Generating synthetic data with differential privacy guarantees. In ICLR, 2019. [Krizhevsky et al., 2012] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Neur IPS, pages 1097 1105, 2012. Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence (IJCAI-23) [Krizhevsky, 2009] Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, University of Toronto, 2009. [Le Cun et al., 1998] Yann Le Cun, L eon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278 2324, 1998. [Liu et al., 2015] Ziwei Liu, Ping Luo, Xiaogang Wang, and Tang Xiaoou. Deep learning face attributes in the wild. In ICCV, pages 3730 3738, 2015. [Long et al., 2021] Yunhui Long, Boxin Wang, Zhuolin Yang, Bhavya Kailkhura, Aston Zhang, Carl Gunter, and Bo Li. G-pate: Scalable differentially private data generator via private aggregation of teacher discriminators. In Neur IPS, pages 2965 2977, 2021. [Lopes et al., 2017] Raphael Lopes, Stefano Fenu, and Thad Starner. Data-free knowledge distillation for deep neural networks. ar Xiv:1710.07535, 2017. [Mc Mahan et al., 2018] H Brendan Mc Mahan, Galen Andrew, Ulfar Erlingsson, Steve Chien, Ilya Mironov, Nicolas Papernot, and Peter Kairouz. A general approach to adding differential privacy to iterative training procedures. ar Xiv:1812.06210, 2018. [Mironov, 2017] Ilya Mironov. R enyi differential privacy. In CSF, pages 263 275, 2017. [Papernot et al., 2017] Nicolas Papernot, Mart ın Abadi, Ulfar Erlingsson, Ian Goodfellow, and Kunal Talwar. Semisupervised knowledge transfer for deep learning from private training data. In ICLR, 2017. [Papernot et al., 2018] Nicolas Papernot, Shuang Song, Ilya Mironov, Ananth Raghunathan, Kunal Talwar, and Ulfar Erlingsson. Scalable private learning with pate. In ICLR, 2018. [Papernot et al., 2021] Nicolas Papernot, Abhradeep Thakurta, Shuang Song, Steve Chien, and Ulfar Erlingsson. Tempered sigmoid activations for deep learning with differential privacy. In AAAI, pages 9312 9321, 2021. [Simonyan and Zisserman, 2015] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR, 2015. [Srinivas and Babu, 2015] Suraj Srinivas and R Venkatesh Babu. Data-free parameter pruning for deep neural networks. ar Xiv:1507.06149, 2015. [Szegedy et al., 2015] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In CVPR, pages 1 9, 2015. [Takagi et al., 2021] Shun Takagi, Tsubasa Takahashi, Yang Cao, and Masatoshi Yoshikawa. P3gm: Private highdimensional data release via privacy preserving phased generative model. In ICDE, pages 169 180, 2021. [Wang et al., 2021] Boxin Wang, Fan Wu, Yunhui Long, Luka Rimanic, Ce Zhang, and Bo Li. Datalens: Scalable privacy preserving training via gradient compression and aggregation. In CCS, pages 2146 2168, 2021. [Xiao et al., 2017] Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-mnist: A novel image dataset for benchmarking machine learning algorithms. ar Xiv:1708.07747, 2017. [Xie et al., 2018] Liyang Xie, Kaixiang Lin, Shu Wang, Fei Wang, and Jiayu Zhou. Differentially private generative adversarial network. ar Xiv:1802.06739, 2018. [Yang et al., 2019] Ziqi Yang, Jiyi Zhang, Ee-Chien Chang, and Zhenkai Liang. Neural network inversion in adversarial setting via background knowledge alignment. In CCS, pages 225 240, 2019. [Yang et al., 2021] Jiancheng Yang, Rui Shi, Donglai Wei, Zequan Liu, Lin Zhao, Bilian Ke, Hanspeter Pfister, and Bingbing Ni. Medmnist v2: A large-scale lightweight benchmark for 2d and 3d biomedical image classification. ar Xiv:2110.14795, 2021. [Ye et al., 2020] Yunan Ye, Hengzhi Pei, Boxin Wang, Pin Yu Chen, Yada Zhu, Ju Xiao, and Bo Li. Reinforcementlearning based portfolio management with augmented asset movement prediction states. In AAAI, pages 1112 1119, 2020. [Zagoruyko and Komodakis, 2017] Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. In CVPR, 2017. [Zhang et al., 2018] Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin, and Jian Sun. Shufflenet: An extremely efficient convolutional neural network for mobile devices. In CVPR, pages 6848 6856, 2018. [Zhao et al., 2022] Borui Zhao, Quan Cui, Renjie Song, Yiyu Qiu, and Jiajun Liang. Decoupled knowledge distillation. In CVPR, pages 11953 11962, 2022. [Zhu et al., 2021] Zhuangdi Zhu, Junyuan Hong, and Jiayu Zhou. Data-free knowledge distillation for heterogeneous federated learning. In ICML, pages 12878 12889, 2021. Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence (IJCAI-23)