# private_gans_revisited__07d72777.pdf Published in Transactions on Machine Learning Research (10/2023) Private GANs, Revisited Alex Bie yabie@uwaterloo.ca University of Waterloo Gautam Kamath g@csail.mit.edu University of Waterloo Guojun Zhang guojun.zhang@huawei.com Huawei Noah s Ark Lab Reviewed on Open Review: https: // openreview. net/ forum? id= 9s VCIngrh P We show that the canonical approach for training differentially private GANs updating the discriminator with differentially private stochastic gradient descent (DPSGD) can yield significantly improved results after modifications to training. Specifically, we propose that existing instantiations of this approach neglect to consider how adding noise only to discriminator updates inhibits discriminator training, disrupting the balance between the generator and discriminator necessary for successful GAN training. We show that a simple fix taking more discriminator steps between generator steps restores parity between the generator and discriminator and improves results. Additionally, with the goal of restoring parity, we experiment with other modifications namely, large batch sizes and adaptive discriminator update frequency to improve discriminator training and see further improvements in generation quality. Our results demonstrate that on standard image synthesis benchmarks, DPSGD outperforms all alternative GAN privatization schemes. Code: https://github.com/alexbie98/dpgan-revisit. 1 Introduction Differential privacy (DP) (Dwork et al., 2006b) has emerged as a compelling approach for training machine learning models on sensitive data. However, incorporating DP requires changes to the training process. Notably, it prevents the modeller from working directly with sensitive data, complicating debugging and exploration. Furthermore, upon exhausting their allocated privacy budget, the modeller is restricted from interacting with sensitive data. One approach to alleviate these issues is by producing differentially private synthetic data, which can be plugged directly into existing machine learning pipelines, without further concern for privacy. Towards generating high-dimensional, complex data (such as images), a line of work has examined privatizing generative adversarial networks (GANs) (Goodfellow et al., 2014) to produce DP synthetic data. Initial efforts proposed to use differentially private stochastic gradient descent (DPSGD) (Abadi et al., 2016) as a drop-in replacement for SGD to update the GAN discriminator an approach referred to as DPGAN (Xie et al., 2018; Beaulieu-Jones et al., 2019; Torkzadehmahani et al., 2019). However, follow-up work (Jordon et al., 2019; Long et al., 2021; Chen et al., 2020; Wang et al., 2021) departs from this approach: they propose alternative privatization schemes for GANs, and report significant improvements over the DPGAN baseline. Other methods for generating DP synthetic data diverge from GAN-based architectures, yielding improvements to utility in most cases (Table 2). This raises the question of whether GANs are suitable for DP training, or if bespoke architectures are required for DP data generation. Authors GK and GZ are listed in alphabetical order. Work performed in part while interning at Huawei. Published in Transactions on Machine Learning Research (10/2023) 4 6 8 10 Privacy budget non-private (3.4) n D = 1 n D = 10 n D = 20 n D = 50 (a) MNIST FID over a training run (b) Corresponding images at (10, 10 5)-DP Figure 1: DPGAN results on MNIST synthesis at (10, 10 5)-DP. (a) We run 3 seeds, plotting mean, min, and max FID along the runs. We find that increasing n D, the number of discriminator steps taken between generator steps, significantly improves image synthesis. Increasing n D = 1 n D = 50 improves FID from 205.3 0.9 18.5 0.9. (b) Corresponding synthesized images (each are trained with the same privacy budget). We observe that large n D improves visual quality, and low n D leads to mode collapse. MNIST Fashion MNIST Celeb A-Gender Privacy ε Method FID Acc.(%) FID Acc.(%) FID Acc.(%) ε = Real Data 1.0 99.2 1.5 92.5 1.1 96.6 GAN 3.4 0.1 97.0 0.1 16.5 1.7 79.5 0.8 30.0 1.6 92.0 0.4 ε = 10 Best Private GAN 61.34 80.92 131.34 70.61 - 70.72 DPGAN 179.16 80.11 243.80 60.98 - 54.09 ε = 9.32 Our DPGAN 12.8 0.3 95.1 0.1 62.3 8.7 74.7 0.4 170.8 20.3 82.4 4.4 Table 1: A summary of our results compared to results reported in previous work on private GANs. For our results, we run 3 seeds and report mean std. Acc.(%) refers to downstream classification accuracy of CNN models trained with generated data. The middle two rows are a composite of the best results reported in the literature for DPGAN and alternative GAN privatization schemes ( Best Private GAN ); see Tables 2 and 3 for correspondences. Here we use Gopi et al. (2021) privacy accounting for our results. We find significant improvement over all previous GANbased methods for DP image synthesis. Our contributions. We show that DPGANs give far better utility than previously demonstrated, and compete with or outperform almost all other methods for DP image synthesis.1 Hence, we conclude that previously demonstrated deficiencies of DPGANs should not be attributed to inherent limitations of the framework, but rather, training issues. Specifically, we propose that the asymmetric noise addition in DPGANs (adding noise to discriminator updates only) inhibits discriminator training while leaving generator training untouched, disrupting the balance necessary for successful GAN training. Indeed, the seminal study of Goodfellow et al. (2014) points to the challenge of synchronizing the discriminator with the generator in GAN training, suggesting that, G must not be trained too much without updating D, in order to avoid the Helvetica scenario [mode collapse] . Prior DPGAN implementations in the literature do not take this into consideration in the process of porting over non-private GAN training recipes. We propose that taking more discriminator steps between generator updates addresses the imbalance introduced by noise. With this change, DPGANs improve significantly (see Figure 1 and Table 1). Furthermore, we show this perspective on DPGAN training ( restoring parity to a discriminator weakened by DP noise ) can be applied to improve training. We make other modifications to discriminator training larger batch 1A notable exception is diffusion models, discussed further in Section 2. Published in Transactions on Machine Learning Research (10/2023) sizes and adaptive discriminator update frequency to improve discriminator training and further improve upon the aforementioned results. In summary, we make the following contributions: We find that taking more discriminator steps between generator steps significantly improves DPGANs. Contrary to previous results in the literature, DPGANs outperform alternative GAN privatization schemes. We present empirical findings towards understanding why more frequent discriminator steps help. We propose an explanation based on asymmetric noise addition for why vanilla DPGANs do not perform well, and why taking more frequent discriminator steps helps. We employ our explanation as a principle for designing better private GAN training recipes incorporating larger batch sizes and adaptive discriminator update frequency and indeed are able to improve over the aforementioned results. 2 Related work Private GANs. The baseline DPGAN that employs a DPSGD-trained discriminator was introduced in Xie et al. (2018), and studied in follow-up work of Torkzadehmahani et al. (2019); Beaulieu-Jones et al. (2019). Despite significant interest in the approach ( 400 citations at time of writing), we were unable to find studies that explore the modifications we perform or uncover similar principles for improving DPGAN training. We note that the number of discriminator steps taken per generator step, n D, appears as a hyperparameter in the framework outlined by the seminal study of Goodfellow et al. (2014), and in followup work such as WGAN (Arjovsky et al., 2017). Xie et al. (2018) privatizes WGAN, adopting its imbalanced stepping strategy of n D = 5, however makes no mention of the importance of the parameter (along with Torkzadehmahani et al. (2019), which uses n D = 1). As we show in Figure 1a, ensuring that n D lies within a critical range (as determined by DPSGD hyperparameters) is key to adapting a GAN training recipe to DP; selection of n D is the difference between state-of-the-art-competitive performance and something that is entirely not working.2 As a consequence, subsequent work has departed from DPGANs, examining alternative privatization schemes for GANs (Jordon et al., 2019; Long et al., 2021; Chen et al., 2020; Wang et al., 2021). Broadly speaking, these approaches employ subsample-and-aggregate (Nissim et al., 2007) via the PATE approach (Papernot et al., 2017), dividing the data into 1K disjoint partitions and training teacher discriminators separately on each one. Our work shows that these privatization schemes are outperformed by DPSGD. DP generative models. Other generative modelling frameworks have been applied to generate DP synthetic data: VAEs (Chen et al., 2018), maximum mean discrepancy (Harder et al., 2021; Vinaroz et al., 2022; Harder et al., 2022), Sinkhorn divergences (Cao et al., 2021), normalizing flows (Waites & Cummings, 2021), and diffusion models (Dockhorn et al., 2022). In a different vein, Chen et al. (2022) avoids learning a generative model, and instead generates a coreset of examples ( 20 per class) for the purpose of training a classifier. These approaches fall into two camps: applications of DPSGD to existing, highly-performant generative models; or custom approaches designed specifically for privacy which fall short of GANs when evaluated at their non-private limits (ε ). Concurrent work on DP diffusion models. Simultaneous and independent work by Dockhorn et al. (2022) is the first to investigate DP training of diffusion models. They achieve impressive state-of-the-art results for DP image synthesis in a variety of settings, in particular, outperforming our results for DPGANs reported in this paper. We consider our results to still be of significant interest to the community, as we challenge the conventional wisdom regarding deficiencies of DPGANs, showing that they give much better utility than previously thought. Indeed, GANs are still one of the most popular and well-studied generative models, and consequently, there are many cases where one would prefer a GAN over an alternative approach. By revisiting several of the design choices in DPGANs, we give guidance on how to seamlessly introduce 2For further discussion on the role of hyperparameters in DP machine learning, see Appendix F. Published in Transactions on Machine Learning Research (10/2023) differential privacy into such pipelines. Furthermore, both our work and the work of Dockhorn et al. (2022) are aligned in supporting a broader message: training conventional machine learning architectures with DPSGD frequently achieves state-of-the-art results under differential privacy. Indeed, both our results and theirs outperform almost all custom methods designed for DP image synthesis. This reaffirms a similar message recently demonstrated in other private ML settings, including image classification (De et al., 2022) and NLP (Li et al., 2022; Yu et al., 2022). 3 Preliminaries Our goal is to train a generative model on sensitive data that is safe to release, that is, it does not leak the secrets of individuals in the training dataset. We do this by ensuring the training algorithm A which takes as input the sensitive dataset D and returns the parameters of a trained (generative) model θ satisfies differential privacy. Definition 1 (Differential Privacy, Dwork et al. 2006b). A randomized algorithm A : U Θ is (ε, δ)- differentially private if for every pair of neighbouring datasets D, D U, we have P{A(D) S} exp(ε) P{A(D ) S} + δ for all (measurable) S Θ. In this work, we adopt the add/remove definition of DP, and say two datasets D and D are neighbouring if they differ in at most one entry, that is, D = D {x} or D = D {x}. One convenient property of DP is closure under post-processing, which says that further outputs computed from the output of a DP algorithm (without accessing private data by any other means) are safe to release, satisfying the same DP guarantees as the original outputs. In our case, this means that interacting with a privatized model (e.g., using it to compute gradients on non-sensitive data, generate samples) does not lead to any further privacy violation. DPSGD. A gradient-based learning algorithm can be privatized by employing differentially private stochastic gradient descent (DPSGD) (Song et al., 2013; Bassily et al., 2014; Abadi et al., 2016) as a drop-in replacement for SGD. DPSGD involves clipping per-example gradients and adding Gaussian noise to their sum, which effectively bounds and masks the contribution of any individual point to the final model parameters. Privacy analysis of DPSGD follows from several classic tools in the DP toolbox: Gaussian mechanism, privacy amplification by subsampling, and composition (Dwork et al., 2006a; Dwork & Roth, 2014; Abadi et al., 2016; Wang et al., 2019). In our work, we use two different privacy accounting methods for DPSGD: (a) the classical approach of Mironov et al. (2019), implemented in Opacus (Yousefpour et al., 2021), and (b) the recent exact privacy accounting of Gopi et al. (2021). By default, we use the former technique for a closer direct comparison with prior works (though we note that some prior works use even looser accounting techniques). However, the latter technique gives tighter bounds on the true privacy loss, and for all practical purposes, is the preferred method of privacy accounting. We use Gopi et al. (2021) accounting only where indicated in Tables 1, 2, and 3. DPGANs. Algorithm 1 details the training algorithm for DPGANs, which is effectively an instantiation of DPSGD. Note that only gradients for the discriminator D must be privatized (via clipping and noise), and not those for the generator G. This is a consequence of closure under post-processing the generator only interacts with the sensitive dataset indirectly via discriminator parameters, and therefore does not need further privatization. 4 Frequent discriminator steps improves private GANs In this section, we discuss our main finding: n D, the number of discriminator steps taken between each generator step (see Algorithm 1) plays a significant role in the success of DPGAN training. Published in Transactions on Machine Learning Research (10/2023) Algorithm 1 Train DPGAN(D; ϕ0, θ0, Opt D, Opt G, n D, T, B, C, σ, δ) 1: Input: Labelled dataset D = {(xj, yj)}n j=1. Discriminator D and generator G initializations ϕ0 and θ0. Optimizers Opt D, Opt G. Hyperparameters: n D (D steps per G step), T (total number of D steps), B (expected batch size), C (clipping norm), and σ (noise level). Privacy parameter δ. 2: q B/|D| and t, k 0 Calculate sampling rate q, initialize counters. 3: while t < T do Update D with DPSGD. 4: St Poisson Sample(D, q) Sample a real batch St by including each (x, y) D w.p. q. 5: St G( ; θk)B Sample fake batch St. 6: gϕt P (x,y) St clip ( ϕt( log(D(x, y; ϕt))); C) + P ( x, y) St clip ( ϕt( log(1 D( x, y; ϕt))); C) Clip per-example gradients. 7: bgϕt 1 2B (gϕt + zt), where zt N(0, C2σ2I)) Add Gaussian noise. 8: ϕt+1 Opt D(ϕt,bgθt) 9: t t + 1 10: if n D divides t then Perform G update every n D steps. 11: S t G( ; θk)B 12: gθk 1 B P ( x, y) S t θk( log(D( x, y; ϕt))) 13: θk+1 Opt G(θk, gθk) 14: k k + 1 15: end if 16: end while 17: ε Privacy Accountant(T, σ, q, δ) Compute privacy budget spent. 18: Output: Final G parameters θk and (ε, δ)-DP guarantee. Fixing a setting of DPSGD hyperparameters, there is an optimal range of values for n D that maximizes generation quality, in terms of both visual quality and utility for downstream classifier training. This value can be quite large (n D 100 in some cases). 4.1 Experimental details Setup. We focus on labelled generation of MNIST (Le Cun et al., 1998) and Fashion MNIST (Xiao et al., 2017), both of which are comprised of 60K 28 28 grayscale images divided into 10 classes. To build a strong baseline, we begin from an open source Py Torch (Paszke et al., 2019) implementation3 of DCGAN (Radford et al., 2016) that performs well non-privately, and copy their training recipe. We then adapt their architecture to our purposes: removing Batch Norm layers (which are not compatible with DPSGD) and adding label embedding layers to enable labelled generation. Training this configuration non-privately yields labelled generation that achieves FID scores of 3.4 0.1 on MNIST and 16.5 1.7 on Fashion MNIST. D and G have 1.72M and 2.27M trainable parameters respectively. For further details, please see Appendix B.1. Privacy implementation. To privatize training, we use Opacus (Yousefpour et al., 2021) which implements per-example gradient computation. As discussed before, we use the Rényi differential privacy (RDP) accounting of Mironov et al. (2019) (except in a few noted instances, where we instead use the tighter Gopi et al. (2021) accounting). For our baseline setting, we use the following DPSGD hyperparameters: we keep the non-private (expected) batch size B = 128, and use a noise level σ = 1 and clipping norm C = 1. Under these settings, we have the budget for T = 450K discriminator steps when targeting (10, 10 5)-DP. Evaluation. We evaluate our generative models by examining the visual quality and utility for downstream tasks of generated images. Following prior work, we measure visual quality by computing the Fréchet Inception Distance (FID) (Heusel et al., 2017) between 60K generated images and entire test set.4 To measure downstream task utility, we again follow prior work, and train a CNN classifier on 60K generated image-label pairs and report its accuracy on the real test set. 3Courtesy of Hyeonwoo Kang (https://github.com/znxlwm). Code available at this link. 4We use an open source Py Torch implementation to compute FID: https://github.com/mseitzer/pytorch-fid. Published in Transactions on Machine Learning Research (10/2023) 4 6 8 10 Privacy budget Accuracy (%) non-private (97.0) n D = 1 n D = 10 n D = 50 n D = 100 n D = 200 (a) MNIST accuracy 4 6 8 10 Privacy budget non-private (16.5) n D = 10 n D = 50 n D = 100 n D = 200 (b) Fashion MNIST FID 4 6 8 10 Privacy budget Accuracy (%) non-private (79.5) n D = 10 n D = 50 n D = 100 n D = 200 (c) Fashion MNIST accuracy Figure 2: DPGAN results over training runs using different discriminator update frequencies n D, targeting (10, 10 5)- DP. Each plotted line indicates the mean, min, and max utility of 3 training runs with different seeds, as the privacy budget is expended. (a) We plot the test set accuracy of a CNN trained on generated data only. Accuracy mirrors the FID scores from Figure 1a. Going from n D = 1 to n D = 50 improves accuracy from 40.3 6.3% 93.0 0.6%. Further n D increases hurt accuracy. (b) and (c) We obtain similar results for Fashion MNIST. Note that the optimal n D is higher (around n D 100). At n D = 100, we obtain an FID of 85.9 6.4 and accuracy of 71.7 1.0%. Figure 3: Evolution of samples drawn during training with n D = 10, when targeting (10, 10 5)-DP. This setting reports its best FID and downstream accuracy at t = 50K iterations (ε 2.85). As training progresses beyond this point, we observe mode collapse for several classes (e.g., the 6 s and 7 s, particularly at t = 150K), co-occuring with the deterioration in evaluation metrics (these samples correspond to the first 4 data points in the n D = 10 line in Figures 1a and 2a). 4.2 Results More frequent discriminator steps improves generation. We plot in Figures 1a and 2 the evolution of FID and accuracy during DPGAN training for both MNIST and Fashion MNIST, under varying discriminator update frequencies n D. The effect of this parameter has outsized impact on the final results. For MNIST, n D = 50 yields the best results; on Fashion MNIST, n D = 100 is the best. We emphasize that increasing the frequency of discriminator steps, relative to generator steps, does not affect the privacy cost of Algorithm 1. For any setting of n D, we perform the same number of noisy gradient queries on real data what changes is the total number of generator steps taken over the course of training, which is reduced by a factor of n D. Private GANs are on a path to mode collapse. For our MNIST results, we observe that at low discriminator update frequencies (n D = 10), the best FID and accuracy scores occur early in training, well before the privacy budget we are targeting is exhausted.5 Examining Figures 1a and 2a at 50K discriminator steps (the leftmost points on the charts; ε 2.85), the n D = 10 runs (in orange) have better FID and accuracy than both: (a) later checkpoints of the n D = 10 runs, after training longer and spending more privacy budget; and (b) other settings of n D at that stage of training. 5This observation has been reported in Neunhoeffer et al. (2021), serving as motivation for their remedy of taking a mixture of intermediate models encountered in training. We are not aware of any mentions of this aspect of DPGAN training in papers reporting DPGAN baselines for labelled image synthesis. Published in Transactions on Machine Learning Research (10/2023) 0 2000 4000 6000 8000 10000 12000 14000 Generator step k Discriminator Accuracy non-private (n D = 1) n D = 50 n D = 10 n D = 1 Figure 4: Exponential moving average (β = 0.95) of GAN discriminator accuracy on mini-batches, immediately before each generator step. While non-privately the discriminator maintains a 60% accuracy, the private discriminator with n D = 1 is effectively a random guess. Increasing the number of discriminator steps recovers the discriminator s advantage early on, leading to generator improvement. As the generator improves, the discriminator s task is made more difficult, driving down accuracy. We attribute generator deterioration with more training to mode collapse: a known failure mode of GANs where the generator resorts to producing a small set of examples rather than representing the full variation present in the underlying data distribution. In Figure 3, we plot the evolution of generated images for an n D = 10 run over the course of training and observe qualitative evidence of mode collapse: at 50K steps, all generated images are varied, whereas at 150K steps, many of the columns (in particular the 6 s and 7 s) are slight variations of the same image. In contrast, successfully trained GANs do not exhibit this behaviour (see the n D = 50 images in Figure 1b). Mode collapse co-occurs with the deterioration in FID and accuracy observed in the first 4 data points of the n D = 10 runs (in orange) in Figures 1a and 2a. An optimal discriminator update frequency. These results suggest that fixing other DPSGD hyperparameters, there is an optimal setting for the discriminator step frequency n D that strikes a balance between: (a) being too low, causing the generation quality to peak early in training and then undergo mode collapse; resulting in all subsequent training to consume additional privacy budget without improving the model; and (b) being too high, preventing the generator from taking enough steps to converge before the privacy budget is exhausted (an example of this is the n D = 200 run in Figure 2a). Striking this balance results in the most effective utilization of privacy budget towards improving the generator. 5 Why does taking more steps help? In this section, we present empirical findings towards understanding why more frequent discriminator steps improves DPGAN training. We propose an explanation that is conistent with our findings. How does DP affect GAN training? Figure 4 compares the accuracy of the GAN discriminator on held-out real and fake examples immediately before each generator step, between private and non-private training with different settings of n D. We observe that non-privately at n D = 1, discriminator accuracy stabilizes at around 60%. Naively introducing DP (n D = 1) leads to a qualitative difference: DP causes discriminator accuracy to drop to 50% (i.e., comparable accuracy to randomly guessing) immediately at the start of training, to never recover.6 For other settings of n D, we make following observations: (1) larger n D corresponds to higher discriminator accuracy in early training; (2) in a training run, discriminator accuracy decreases throughout as the generator improves; (3) after discriminator accuracy falls below a certain threshold, the generator degrades or sees 6Our plot only shows the first 15K generator steps, but we remark that this persists until the end of training (450K steps). Published in Transactions on Machine Learning Research (10/2023) 1000 1500 2000 2500 3000 3500 4000 Generator step k Discriminator Accuracy non-private (n D = 1) n D = 50 n D = 10 n D = 1 (a) Exponential moving average (β = 0.95) of discriminator accuracy on mini-batches, after checkpoint restarts 1000 1500 2000 2500 3000 3500 4000 Generator step k non-private (n D = 1) n D = 50 n D = 10 n D = 1 (b) FID after checkpoint restarts 1000 1500 2000 2500 3000 3500 4000 Generator step k non-private (n D = 1) n D = 50 n D = 10 n D = 1 (c) Accuracy after checkpoint restarts Figure 5: We restart training under various privacy and n D settings at 3 checkpoints taken at 1K, 2K, and 3K generator steps into non-private training. We plot the progression of discriminator accuracy, FID, and downstream classification accuracy. The black dots correspond to the initial values of a checkpoint. We observe that low n D settings do not achieve comparable discriminator accuracy to non-private training (a), and results in degradation of utility ((b) and (c)). Discriminator accuracy for n D = 50 tracks non-private training, and we observe utility improvement throughout training, as in the non-private setting. limited improvement.7 Based on these observations, we propose the following explanation for why more frequent discriminator steps help: Generator improvement occurs when the discriminator is effective at distinguishing between real and fake data. The asymmetric noise addition introduced by DP to the discriminator makes such a task difficult, resulting in limited generator improvement. Allowing the discriminator to train longer on a fixed generator improves its accuracy, recovering the non-private case where the generator and discriminator are balanced. Checkpoint restarting experiment. We perform a checkpoint restarting experiment to examine this explanation in a more controlled setting. We train a non-private GAN for 3K generator steps, and save checkpoints of D and G (and their respective optimizers) at 1K, 2K, and 3K generator steps. We restart training from each of these checkpoints for 1K generator steps under different n D and privacy settings. We plot the progression of discriminator accuracy, FID, and downstream classification accuracy. Results are pictured in Figure 5. Broadly, our results corroborate the observations that discriminator accuracy improves with larger n D and decreases with better generators, and that generator improvement occurs when the discriminator has sufficiently high accuracy. 7For n D = 10, accuracy falls below 50% after 5K G steps (= 50K D steps), which corresponds to the first point in the n D = 10 line in Figures 1a and 2a. For n D = 50, accuracy falls below 50% after 5K G steps (= 250K D steps), which corresponds to the 5th point in the n D = 50 line in Figures 1a and 2a. Published in Transactions on Machine Learning Research (10/2023) 0.4 0.6 0.8 1.0 Noise scale n D = 1 at = 0.45 (127.1) n D = 50 at = 1 (18.5) (a) Varying σ only vs. FID at n D = 1 0.4 0.6 0.8 1.0 Noise scale Accuracy (%) n D = 1 at = 0.45 (57.5) n D = 50 at = 1 (93.0) (b) Varying σ only vs. accuracy at n D = 1 Figure 6: On MNIST, we fix n D = 1 and report results for various settings of the DPSGD noise level σ, where the number of iterations T is chosen for each σ to target (10, 10 5)-DP. The gap between the dashed lines represent the advancement of the utility frontier by incorporating the choice of n D into our design space. Does reducing noise accomplish the same thing? In light of the above explanation, we ask if reducing the noise level σ can offer the same improvement as taking more steps, as reducing σ should also improve discriminator accuracy before a generator step. To test this: starting from our setting in Section 4, fixing n D = 1, and targeting MNIST at ε = 10, we search over a grid of noise levels σ (the lowest of which, σ = 0.4, admits a budget of only T = 360 discriminator steps). Results are pictured in Figure 6. We obtain a best FID of 127.1 and best accuracy of 57.5% at noise level σ = 0.45. Hence we can conclude that in this experimental setting, incorporating discriminator update frequency in our design space allows for more effective use of privacy budget for improving generation quality. Does taking more discriminator steps always help? As we discuss in more detail in Section 6.1, when we are able to find other means to improve the discriminator beyond taking more steps, tuning discriminator update frequency may not yield improvements. To illustrate with an extreme case, consider eliminating the privacy constraint. In non-private GAN training, taking more steps is known to be unnecessary. We corroborate this result: we run our non-private baseline from Section 4 with the same number of generator steps, but opt to take 10 discriminator steps between each generator step instead of 1. FID worsens from 3.4 0.1 8.3, and accuracy worsens from 97.0 0.1% 91.3%. 6 Better generators via better discriminators Our proposed explanation in Section 5 provides a concrete suggestion for improving GAN training: effectively use our privacy budget to maximize the number of generator steps taken when the discriminator has sufficiently high accuracy. We experiment with modifications to the private GAN training recipe towards these ends, which translate to improved generation. 6.1 Larger batch sizes Several recent works have demonstrated that for classification tasks, DPSGD achieves higher accuracy with larger batch sizes, after tuning the noise level σ accordingly (Tramèr & Boneh, 2021; Anil et al., 2022; De et al., 2022). On simpler, less diverse datasets (such as MNIST, CIFAR-10, and FFHQ), GAN training is typically conducted with small batch sizes (for example, DCGAN uses B = 128 (Radford et al., 2016), which we adopt; Style GAN(2|3) uses B = 32/64 (Karras et al., 2019; 2020; 2021)).8 Therefore it is interesting to see if large batch sizes help in our setting. We corroborate that on MNIST, larger batch sizes do not significantly improve our non-private baseline from Section 4: when we go up to B = 2048 from B = 128, FID goes from 3.4 0.1 3.2 and accuracy goes from 97.0 0.1% 97.5%. 8However on more complex, diverse datasets (such as Image Net), it has been found that larger batch sizes help: this is the conclusion from Big GAN (Brock et al., 2019). Recent work scaling up Style GAN to diverse datasets Style GAN-XL (Sauer et al., 2022) and Giga GAN (Kang et al., 2023) corroborate this result, seeing improvements from scaling up batch sizes to 2048 and 1024 respectively. Published in Transactions on Machine Learning Research (10/2023) MNIST Fashion MNIST Privacy ε Method Reported in FID Acc.(%) FID Acc.(%) ε = Real data This work 1.0 99.2 1.5 92.5 GAN 3.4 0.1 97.0 0.1 16.5 1.7 79.5 0.8 DP-MERF Cao et al. (2021) 116.3 82.1 132.6 75.5 DP-Sinkhorn Cao et al. (2021) 48.4 83.2 128.3 75.1 PSG9 Chen et al. (2022) - 95.6 - 77.7 DPDM Dockhorn et al. (2022) 5.01 97.3 18.6 84.9 DPGAN10 Chen et al. (2020) 179.16 63 243.80 50 Long et al. (2021) 304.86 80.11 433.38 60.98 GS-WGAN Chen et al. (2020) 61.34 80 131.34 65 PATE-GAN Long et al. (2021) 253.55 66.67 229.25 62.18 G-PATE Long et al. (2021) 150.62 80.92 171.90 69.34 Data Lens Wang et al. (2021) 173.50 80.66 167.68 70.61 ε = 9.32* Our DPGAN This work 18.5 0.9 93.0 0.6 85.9 6.4 71.7 1.0 + large batches 13.2 1.1 94.0 0.6 70.9 6.3 73.0 1.1 + adaptive n D 12.8 0.3 95.1 0.1 62.3 8.7 74.7 0.4 DP-MERF11 Vinaroz et al. (2022) - 80.7 - 73.9 DP-HP Vinaroz et al. (2022) - 81.5 - 72.3 PSG Chen et al. (2022) - 80.9 - 70.2 DPDM Dockhorn et al. (2022) 23.4 95.3 37.8 79.4 DPGAN Long et al. (2021) 470.20 40.36 472.03 10.53 GS-WGAN Long et al. (2021) 489.75 14.32 587.31 16.61 PATE-GAN Long et al. (2021) 231.54 41.68 253.19 42.22 G-PATE Long et al. (2021) 153.38 58.80 214.78 58.12 Data Lens Wang et al. (2021) 186.06 71.23 194.98 64.78 ε = 0.912* Our DPGAN This work 111.1 17.9 76.9 0.6 155.3 7.1 64.9 0.8 + large batches 106.2 64.0 67.5 7.8 158.9 6.0 67.2 1.0 + adaptive n D 52.6 3.2 81.3 0.8 126.4 4.1 69.1 0.1 Table 2: We gather previously reported results in the literature on the performance of various methods for labelled generation of MNIST and Fashion MNIST, compared with our results. For our results, we run 3 seeds and report mean std. Note that Reported In refers to the source of the numerical result, not the originator of the approach. For downstream accuracy, we report the best accuracy among classifiers they use, and compare against our CNN classifier accuracy. (*) For our results, we target ε = 10/ε = 1 with Opacus accounting and additionally report ε using the improved privacy accounting of Gopi et al. (2021). Results. We scale up batch sizes, considering B {128, 512, 2048}, and search for the optimal noise level σ and n D (details in Appendix B.2). We target both ε = 1 and ε = 10. We report the best results from our hyperparameter search in in Table 2. We find that larger batch sizes leads to improvements: for ε = 1 and ε = 10, the best results are achieved at B = 512 and B = 2048 respectively. We also note that for large batch sizes, the optimal number of generator steps can be quite small. For B = 2048, σ = 4.0, targeting MNIST at ε = 10, n D = 5 is the optimal discriminator update frequency, and improves over our best B = 128 setting employing n D = 50. For full results, see Appendix D.3. 6.2 Adaptive discriminator step frequency Our observations from Section 4 and 5 motivate us to consider adaptive discriminator step frequencies. As pictured in Figures 4 and 5a, discriminator accuracy drops during training as the generator improves. In 9Since PSG produces a coreset of only 200 examples (20 per class), the covariance of its Inception Net-extracted features is singular, and therefore it is not possible to compute an FID score. 10We group per-class unconditional GANs together with conditional GANs under the DPGAN umbrella. 11Results from Vinaroz et al. (2022) are presented graphically in the paper. Exact numbers can be found in their code. Published in Transactions on Machine Learning Research (10/2023) Privacy Method Reported In FID Acc.(%) ε = Real data This work 1.1 96.6 GAN 30.0 1.6 92.0 0.4 DP-MERF Cao et al. (2021) 274.0 65 DP-Sinkhorn Cao et al. (2021) 189.5 76.3 DPDM Dockhorn et al. (2022) 21.1 - DPGAN Long et al. (2021) - 54.09 GS-WGAN Long et al. (2021) - 63.26 PATE-GAN Long et al. (2021) - 58.70 G-PATE Long et al. (2021) - 70.72 ε = 9.39* Our DPGAN This work 170.8 20.3 82.4 4.4 DPGAN Long et al. (2021) 485.41 52.11 GS-WGAN Long et al. (2021) 432.58 61.36 PATE-GAN Long et al. (2021) 424.60 65.35 G-PATE Long et al. (2021) 305.92 68.97 Data Lens Wang et al. (2021) 320.84 72.87 Table 3: Top section of the table: Comparison to state-of-the-art results on 32 32 Celeb A-Gender, targeting (ε, 10 6)-DP (except for the results reported in Long et al. (2021) which target a weaker (ε, 10 5)-DP). We run 3 seeds and report the mean std. (*) For our results, we target ε = 10 with Opacus accounting and additionally report ε using the improved privacy accounting of Gopi et al. (2021). DPDM reports a much better FID score than our DPGAN (which itself, is an improvement over previous results). Our DPGAN achieves the best reported accuracy score. Bottom section of the table: Results for GAN-based approaches reported in Long et al. (2021) and Wang et al. (2021), which are not directly comparable because they target (10, 10 5)-DP and use 64 64 Celeb A-Gender. this scenario, we want to take more steps to improve the discriminator, in order to further improve the generator. However, using a large discriminator update frequency right from the beginning of training is wasteful as evidenced by the fact that low n D achieves the best FID and accuracy early in training. Hence we propose to start at a low discriminator update frequency (n D = 1), and ramp up when our discriminator is performing poorly. Accuracy on real data must be released with DP. While this is feasible, it introduces the additional problem of having to find the right split of privacy budget for the best performance. We observe that discriminator accuracy is related to discriminator accuracy on fake samples only (which are free to evaluate on, by post-processing). Hence we use it as a proxy to assess discriminator performance. We propose an adaptive step frequency, parameterized by β and d. β is the decay parameter used to compute the exponential moving average (EMA) of discriminator accuracy on fake batches before each generator update. d is the accuracy floor that upon reaching, we move to the next update frequency n D {1, 2, 5, 10, 20, 50, 100, 200, 500, 1000, ...}. Additionally, we promise a grace period of 2/(1 β) generator steps before moving on to the next update frequency motivated by the fact that β-EMA s value is primarily determined by its last 2/(1 β) observations. We use β = 0.99 in all settings, and try d = 0.6 and d = 0.7. The additional benefit of the adaptive step frequency is that it means we do not have to search for the optimal update frequency. Although the adaptive step frequency introduces the extra hyperparameter of the threshold d, we found that these two settings (d = 0.6 and d = 0.7) were sufficient to improve over results of a much more extensive hyperparameter search over n D (whose optimal value varied significantly based on the noise level σ and expected batch size B). 6.3 Comparison with previous results in the literature 6.3.1 MNIST and Fashion MNIST Table 2 summarizes our best experimental settings for MNIST and Fashion MNIST, and situates them in the context of previously reported results for the task. We also present a visual comparison in Figure 7. We provide some examples of generated images in Figures 9 and 10 for ε = 10, and Figures 11 and 12 for ε = 1. Published in Transactions on Machine Learning Research (10/2023) DP-Sinkhorn Figure 7: MNIST and Fashion MNIST results at (10, 10 5)-DP for different methods. Images of other methods are from Cao et al. (2021) and Dockhorn et al. (2022). Figure 8: 32 32 Celeb A-Gender at (10, 10 6)-DP. From top to bottom: DPDM (unconditional generation), DP-Sinkhorn, and our DPGAN. Images of other methods are from Cao et al. (2021) and Dockhorn et al. (2022). Plain DPSGD beats all alternative GAN privatization schemes. Our baseline DPGAN from Section 4, with the appropriate choice of n D (and without the modifications described in this section yet), outperforms all other GAN-based approaches proposed in the literature (GS-WGAN, PATE-GAN, G-PATE, and Data Lens) uniformly across both metrics, both datasets, and both privacy levels. Large batch sizes and adaptive discriminator step frequency improve GAN training. Broadly speaking, across both privacy levels and both datasets, we see an improvement from taking larger batch sizes, and then another with the adaptive step frequency. Comparison with state-of-the-art. With the exception of DPDM, our best DPGANs are competitive with state-of-the-art approaches for DP synthetic data, especially in terms of FID scores. 6.3.2 Celeb A-Gender We also report results on generating 32 32 Celeb A, conditioned on gender at (10, 10 6)-DP. For these experiments, we employed large batches (B = 2048) and adaptive discriminator step frequency with threshold d = 0.6. Full implementation details can be found in Appendix C. Results are summarized in Table 3 and visualized in Figure 8. For more example generations, see Figure 13. Published in Transactions on Machine Learning Research (10/2023) 7 Conclusion We revisit differentially private GANs and show that, with appropriate tuning of the training procedure, they can perform dramatically better than previously thought. Some crucial modifications include increasing discriminator step frequency, increasing the batch size, and introducing adaptive discriminator step frequency. We explore the hypothesis that the previous deficiencies of DPGANs were due to poor classification accuracy of the discriminator. More broadly, our work supports the recurring finding that carefully-tuned DPSGD on conventional architectures can yield strong results for differentially private machine learning. Acknowledgements AB is supported by an NSERC Discovery Grant, a David R. Cheriton Graduate Scholarship, and an Ontario Graduate Scholarship. GK is supported by an NSERC Discovery Grant, an unrestricted gift from Google, and a University of Waterloo startup grant. We would like to thank the TMLR anonymous reviewers and action editor for providing constructive feedback. Published in Transactions on Machine Learning Research (10/2023) Martin Abadi, Andy Chu, Ian Goodfellow, H. Brendan Mc Mahan, Ilya Mironov, Kunal Talwar, and Li Zhang. Deep learning with differential privacy. In CCS 16: 2016 ACM SIGSAC Conference on Computer and Communications Security, 2016. Rohan Anil, Badih Ghazi, Vineet Gupta, Ravi Kumar, and Pasin Manurangsi. Large-scale differentially private BERT. In Findings of the Association for Computational Linguistics: EMNLP 22, 2022. Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein generative adversarial networks. In Proceedings of the 34th International Conference on Machine Learning (ICML 17), 2017. Raef Bassily, Adam Smith, and Abhradeep Thakurta. Private empirical risk minimization: Efficient algorithms and tight error bounds. In 55th Annual IEEE Symposium on Foundations of Computer Science (FOCS 14), 2014. Brett K. Beaulieu-Jones, Zhiwei Steven Wu, Chris Williams, Ran Lee, Sanjeev P. Bhavnani, James Brian Byrd, and Casey S. Greene. Privacy-preserving generative deep neural networks support clinical data sharing. Circulation: Cardiovascular Quality and Outcomes, 12(7), 2019. Andrew Brock, Jeff Donahue, and Karen Simonyan. Large scale GAN training for high fidelity natural image synthesis. In 7th International Conference on Learning Representations (ICLR 19), 2019. Tianshi Cao, Alex Bie, Arash Vahdat, Sanja Fidler, and Karsten Kreis. Don t generate me: Training differentially private generative models with Sinkhorn divergence. In Advances in Neural Information Processing Systems 34 (Neur IPS 21), 2021. Tatjana Chavdarova, Matteo Pagliardini, Sebastian U. Stich, François Fleuret, and Martin Jaggi. Taming GANs with lookahead-minmax. In 9th International Conference on Learning Representations (ICLR 21), 2021. Dingfan Chen, Tribhuvanesh Orekondy, and Mario Fritz. GS-WGAN: A gradient-sanitized approach for learning differentially private generators. In Advances in Neural Information Processing Systems 33 (Neur IPS 20), 2020. Dingfan Chen, Raouf Kerkouche, and Mario Fritz. Private set generation with discriminative information. In Advances in Neural Information Processing Systems 35 (Neur IPS 22), 2022. Qingrong Chen, Chong Xiang, Minhui Xue, Bo Li, Nikita Borisov, Dali Kaafar, and Haojin Zhu. Differentially private data generative models. Co RR, abs/1812.02274, 2018. Soumith Chintala, Emily Denton, Martin Arjovsky, and Michael Mathieu. How to train a GAN? Tips and tricks to make GANs work. https://github.com/soumith/ganhacks, 2016. Soham De, Leonard Berrada, Jamie Hayes, Samuel L Smith, and Borja Balle. Unlocking high-accuracy differentially private image classification through scale. Co RR, abs/2204.13650, 2022. Tim Dockhorn, Tianshi Cao, Arash Vahdat, and Karsten Kreis. Differentially private diffusion models. Co RR, abs/2210.09929, 2022. Cynthia Dwork and Aaron Roth. The algorithmic foundations of differential privacy. Foundations and Trends in Theoretical Compututer Science, 9(3-4):211 407, 2014. Cynthia Dwork, Krishnaram Kenthapadi, Frank Mc Sherry, Ilya Mironov, and Moni Naor. Our data, ourselves: Privacy via distributed noise generation. In 25th Annual International Conference on the Theory and Applications of Cryptographic Techniques (EUROCRYPT 06), 2006a. Cynthia Dwork, Frank Mc Sherry, Kobbi Nissim, and Adam Smith. Calibrating noise to sensitivity in private data analysis. In Proceedings of the 3rd Conference on Theory of Cryptography (TCC 06), 2006b. Published in Transactions on Machine Learning Research (10/2023) Tanner Fiez and Lillian J. Ratliff. Local convergence analysis of gradient descent ascent with finite timescale separation. In 9th International Conference on Learning Representations (ICLR 21), 2021. Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in Neural Information Processing Systems 27 (NIPS 14), 2014. Sivakanth Gopi, Yin Tat Lee, and Lukas Wutschitz. Numerical composition of differential privacy. In Advances in Neural Information Processing Systems 34 (Neur IPS 21), 2021. Frederik Harder, Kamil Adamczewski, and Mijung Park. DP-MERF: Differentially private mean embeddings with random features for practical privacy-preserving data generation. In 24th International Conference on Artificial Intelligence and Statistics (AISTATS 21), 2021. Frederik Harder, Milad Jalali Asadabadi, Danica J. Sutherland, and Mijung Park. Differentially private data generation needs better features. Co RR, abs/2205.12900, 2022. Moritz Hardt, Katrina Ligett, and Frank Mcsherry. A simple and practical algorithm for differentially private data release. In Advances in Neural Information Processing Systems 25 (NIPS 12), 2012. Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. GANs trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in Neural Information Processing Systems 30 (NIPS 17), 2017. James Jordon, Jinsung Yoon, and Mihaela van der Schaar. PATE-GAN: Generating synthetic data with differential privacy guarantees. In 7th International Conference on Learning Representations (ICLR 19), 2019. Minguk Kang, Jun-Yan Zhu, Richard Zhang, Jaesik Park, Eli Shechtman, Sylvain Paris, and Taesung Park. Scaling up GANs for text-to-image synthesis. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 23), 2023. Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 19), 2019. Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. Analyzing and improving the image quality of Style GAN. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 20), 2020. Tero Karras, Miika Aittala, Samuli Laine, Erik Härkönen, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. Alias-free generative adversarial networks. In Advances in Neural Information Processing Systems 34 (Neur IPS 21), 2021. Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations (ICLR 15), 2015. Yann Le Cun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proc. IEEE, 86(11):2278 2324, 1998. Xuechen Li, Florian Tramèr, Percy Liang, and Tatsunori Hashimoto. Large language models can be strong differentially private learners. In 10th International Conference on Learning Representations (ICLR 22), 2022. Jingcheng Liu and Kunal Talwar. Private selection from private candidates. In Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing (STOC 19), 2019. Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV 15), 2015. Published in Transactions on Machine Learning Research (10/2023) Yunhui Long, Boxin Wang, Zhuolin Yang, Bhavya Kailkhura, Aston Zhang, Carl Gunter, and Bo Li. GPATE: Scalable differentially private data generator via private aggregation of teacher discriminators. In Advances in Neural Information Processing Systems 34 (Neur IPS 21), 2021. Ryan Mc Kenna, Daniel Sheldon, and Gerome Miklau. Graphical-model based estimation and inference for differential privacy. In Proceedings of the 36th International Conference on Machine Learning (ICML 19), 2019. Ilya Mironov, Kunal Talwar, and Li Zhang. Rényi differential privacy of the sampled Gaussian mechanism. Co RR, abs/1908.10530, 2019. Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. Spectral normalization for generative adversarial networks. In 6th International Conference on Learning Representations (ICLR 18), 2018. Shubhankar Mohapatra, Sajin Sasy, Xi He, Gautam Kamath, and Om Thakkar. The role of adaptive optimizers for honest private hyperparameter selection. In Proceedings of the Thirty-Sixth AAAI Conference on Artificial Intelligence (AAAI-22), 2022. Marcel Neunhoeffer, Steven Wu, and Cynthia Dwork. Private post-GAN boosting. In 9th International Conference on Learning Representations (ICLR 19), 2021. Kobbi Nissim, Sofya Raskhodnikova, and Adam Smith. Smooth sensitivity and sampling in private data analysis. In Proceedings of the 39th Annual ACM Symposium on the Theory of Computing (STOC 07), 2007. Nicolas Papernot, Martín Abadi, Úlfar Erlingsson, Ian J. Goodfellow, and Kunal Talwar. Semi-supervised knowledge transfer for deep learning from private training data. In 5th International Conference on Learning Representations (ICLR 17), 2017. Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Köpf, Edward Z. Yang, Zachary De Vito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Py Torch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32 (Neur IPS 19), 2019. Natalia Ponomareva, Hussein Hazimeh, Alex Kurakin, Zheng Xu, Carson Denison, H. Brendan Mc Mahan, Sergei Vassilvitskii, Steve Chien, and Abhradeep Guha Thakurta. How to dp-fy ML: A practical guide to machine learning with differential privacy. J. Artif. Intell. Res., 77:1113 1201, 2023. Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. In 4th International Conference on Learning Representations (ICLR 16), 2016. Axel Sauer, Katja Schwarz, and Andreas Geiger. Style GAN-XL: Scaling Style GAN to large diverse datasets. In Special Interest Group on Computer Graphics and Interactive Techniques Conference (SIGGRAPH 22), 2022. Shuang Song, Kamalika Chaudhuri, and Anand D. Sarwate. Stochastic gradient descent with differentially private updates. In 2013 IEEE Global Conference on Signal and Information Processing, 2013. Yuchao Tao, Ryan Mc Kenna, Michael Hay, Ashwin Machanavajjhala, and Gerome Miklau. Benchmarking differentially private synthetic data generation algorithms. Co RR, abs/2112.09238, 2021. Reihaneh Torkzadehmahani, Peter Kairouz, and Benedict Paten. DP-CGAN: Differentially private synthetic data and label generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, (CVPR Workshops 19), 2019. Florian Tramèr and Dan Boneh. Differentially private learning needs better features (or much more data). In 9th International Conference on Learning Representations (ICLR 21), 2021. Published in Transactions on Machine Learning Research (10/2023) Margarita Vinaroz, Mohammad-Amin Charusaie, Frederik Harder, Kamil Adamczewski, and Mi Jung Park. Hermite polynomial features for private data generation. In Proceedings of the 39th International Conference on Machine Learning (ICML 22), 2022. Chris Waites and Rachel Cummings. Differentially private normalizing flows for privacy-preserving density estimation. Co RR, abs/2103.14068, 2021. Boxin Wang, Fan Wu, Yunhui Long, Luka Rimanic, Ce Zhang, and Bo Li. Data Lens: Scalable privacy preserving training via gradient compression and aggregation. In CCS 21: 2021 ACM SIGSAC Conference on Computer and Communications Security, 2021. Yu-Xiang Wang, Borja Balle, and Shiva Prasad Kasiviswanathan. Subsampled Rényi differential privacy and analytical moments accountant. In 22nd International Conference on Artificial Intelligence and Statistics (AISTATS 19), 2019. Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-MNIST: a novel image dataset for benchmarking machine learning algorithms. Co RR, abs/1708.07747, 2017. Liyang Xie, Kaixiang Lin, Shu Wang, Fei Wang, and Jiayu Zhou. Differentially private generative adversarial network. Co RR, abs/1802.06739, 2018. Ashkan Yousefpour, Igor Shilov, Alexandre Sablayrolles, Davide Testuggine, Karthik Prasad, Mani Malek, John Nguyen, Sayan Gosh, Akash Bharadwaj, Jessica Zhao, Graham Cormode, and Ilya Mironov. Opacus: User-friendly differential privacy library in Py Torch. Co RR, abs/2109.12298, 2021. Da Yu, Saurabh Naik, Arturs Backurs, Sivakanth Gopi, Huseyin A. Inan, Gautam Kamath, Janardhan Kulkarni, Yin Tat Lee, Andre Manoel, Lukas Wutschitz, Sergey Yekhanin, and Huishuai Zhang. Differentially private fine-tuning of language models. In 10th International Conference on Learning Representations (ICLR 22), 2022. Jun Zhang, Graham Cormode, Cecilia M. Procopiuc, Divesh Srivastava, and Xiaokui Xiao. Priv Bayes: Private data release via bayesian networks. ACM Trans. Database Syst., 42(4), 2017. Published in Transactions on Machine Learning Research (10/2023) A Generated samples We provide a few non-cherrypicked samples for MNIST and Fashion MNIST at ε = 10 and ε = 1, as well as 32 32 Celeb A-Gender at ε = 10. Figure 9: Some non-cherrypicked MNIST samples from our method, ε = 10. Figure 10: Some non-cherrypicked Fashion MNIST samples from our method, ε = 10. Published in Transactions on Machine Learning Research (10/2023) Figure 11: Some non-cherrypicked MNIST samples from our method, ε = 1. Figure 12: Some non-cherrypicked Fashion MNIST samples from our method, ε = 1. Published in Transactions on Machine Learning Research (10/2023) Figure 13: Some non-cherrypicked Celeb A samples from our method, ε = 10. B MNIST and Fashion MNIST implementation details B.1 Training recipe For MNIST and Fashion MNIST, we begin from an open source Py Torch implementation of DCGAN (Radford et al., 2016) (available at this link) that performs well non-privately, and copy their training recipe. This includes: batch size B = 128, the Adam optimizer (Kingma & Ba, 2015) with parameters (α = 0.0002, β1 = 0.5, β2 = 0.999) for both G and D, the non-saturating GAN loss (Goodfellow et al., 2014), and a 5-layer fully convolutional architecture with width parameter d = 128. To adapt it to our purposes, we make three architectural modifications: in both G and D we (1) remove all Batch Norm layers (which are not compatible with DPSGD); (2) add label embedding layers to enable labelled generation; and (3) adjust convolutional/transpose convolutional stride lengths and kernel sizes as well as remove the last layer, in order to process 1 28 28 images without having to resize. Finally, we remove their custom weight initialization, opting for Py Torch defaults. Our baseline non-private GANs are trained for 45K steps. We train our non-private GANs with poisson sampling as well: for each step of discriminator training, we sample real examples by including each element of our dataset independently with probability B/n, where n is the size of our dataset. We then add B fake examples sampled from G to form our fake/real combined batch. Clipping fake sample gradients. When training the discriminator privately with DPSGD, we draw B fake examples and compute clipped per-example gradients on the entire combined batch of real and fake examples (see Algorithm 1). This is the approach taken in the prior work of Torkzadehmahani et al. (2019). We remark that this is purely a design choice it is not necessary to clip the gradients of the fake samples, nor to process them together in the same batch. So long as we preserve the sensitivity of gradient queries with respect to the real data, the same amount of noise will suffice for privacy. B.2 Large batch size hyperparameter search We scale up batch sizes, considering B {64, 128, 512, 2048}, and search for the optimal noise level σ and n D. For B = 128 targeting ε = 10, we search over three noise levels Σ10 128 = {0.6, 1.0, 1.4}. We choose candidate noise levels for other batch sizes as follows: when considering a batch size B = 128n, we search over Σ10 128n := { n σ : σ Σ10 128}. Published in Transactions on Machine Learning Research (10/2023) We also target the high privacy (ε = 1) regime. For ε = 1, we multiply all noise levels by 5, Σ1 B = {5σ : σ Σ10 B }. For each setting of (B, σ), we search over a grid of n D {1, 2, 5, 10, 20, 50, 100, 200, 500}. Due to compute limitations, we omit some values that we are confident will fail (e.g., trying n D = 1 when mode collapse occurs for n D = 5). C Celeb A implementation details The Celeb A dataset (Liu et al., 2015) consists of 202,599 178 218 RGB images of celebrity faces, each labelled with 40 binary attributes. The version of the dataset we work with, 32x32 Celeb A-Gender (a benchmark reported in Cao et al. (2021)), is obtained by resizing to 32x32 and labelling with the gender attribute. The 202,599 images are partitioned into a training set of size 182,637 and a test set of size 19,962. We use essentially the same model architectures we used for MNIST and Fashion MNIST for Celeb A: 4layer fully convolutional networks with label embedding layers for both D and G. We adjust convolutional/transpose convolutional stride lengths and kernels sizes to process 3 32 32 images without having to resize. D and G for Celeb A are slightly larger, having 2.64M and 3.16M trainable parameters respectively. Drawing from the results of our MNIST and Fashion MNIST experiments, we used a large batch size (B = 2048) and adaptive discriminator updates, with threshold d = 0.6. We experimented with a few settings for noise level σ {2, 3, 4}. Our best results were with the largest noise σ = 4 which gave us 385K discriminator steps when targeting ε = 10. D Ablations D.1 Varying discriminator size We train DPGANs on MNIST under the setting of Section 4: using noise level σ = 1, batch size B = 128, and targeting ε = 10 which yields 450K discriminator steps. By adjusting d D (the # of filter banks in the first convolutional layer of the discriminator, which controls width throughout), we can obtain discriminators with roughly 0.25 , 0.5 , and 2 the parameter count (Table 4). For these experiments, we vary discriminator size while keeping the generator size (2.27M parameters) fixed. d D D parameter count Ratio 64 0.44M 0.26 96 0.97M 0.57 128 1.72M 1 196 3.86M 2.24 Table 4: Number of trainable parameters in discriminator size variants. Results. In Figure 14 we plot the progression of FID and downstream classifier accuracy of generated MNIST samples during non-private training with discriminators of varying size. We observe that, nonprivately, larger discriminators do better in terms of FID early on, and converge to slightly worse accuracies. In Figures 15 and 16, we plot the progression of FID and accuracy (respectively) for DPGANs trained on MNIST (targeting ε = 10) at different discriminator update frequencies n D. In each plot, we compare the d D = 128 runs (in green), which correspond to results from Figures 1a and 2a, to the results of training with discriminators with 0.26 2.24 as many trainable parameters. These additional settings mostly track the d D = 128 runs. Larger discriminators appear to perform slightly better, especially in terms of accuracy early in training. Larger discriminators also use significantly more compute. Published in Transactions on Machine Learning Research (10/2023) 10000 20000 30000 40000 Steps d D = 96 d D = 128 (a) Non-private FID 10000 20000 30000 40000 Steps Accuracy (%) d D = 96 d D = 128 (b) Non-private accuracy Figure 14: MNIST FID and downstream classifier accuracy for non-private GAN training with various discriminator sizes. The green (d D = 128) line corresponds to the 1.72M parameter discriminator used in previous experiments. 4 6 8 10 Privacy budget d D = 96 d D = 128 d D = 192 (a) FID at n D = 1 4 6 8 10 Privacy budget d D = 96 d D = 128 d D = 192 (b) FID at n D = 10 4 6 8 10 Privacy budget d D = 64 d D = 96 d D = 128 d D = 192 (c) FID at n D = 50 4 6 8 10 Privacy budget d D = 64 d D = 96 d D = 128 d D = 192 (d) FID at n D = 100 Figure 15: MNIST FID for DPGAN training (targeting ε = 10) at various discriminator update frequencies n D. In each plot, we present results from training with various discriminator sizes. The green (d D = 128) lines correspond to the results pictured in Figure 1a. Discriminators with 0.26 2.24 as many trainable parameters track the results of the original d D = 128 setting. D.2 Varying learning rate We train DPGANs on MNIST under the setting of Section 4: using noise level σ = 1, batch size B = 128, and targeting ε = 10 which yields 450K discriminator steps. Here, we keep discriminator size d D = 128 fixed, and vary the learning rates of G and D, while keeping the other Adam parameters β1 and β2 for both G and D fixed. Table 5 lists the learning rate settings we consider. Published in Transactions on Machine Learning Research (10/2023) 4 6 8 10 Privacy budget Accuracy (%) d D = 96 d D = 128 (a) Accuracy at n D = 1 4 6 8 10 Privacy budget Accuracy (%) d D = 96 d D = 128 (b) Accuracy at n D = 10 4 6 8 10 Privacy budget Accuracy (%) d D = 96 d D = 128 d D = 192 (c) Accuracy at n D = 50 4 6 8 10 Privacy budget Accuracy (%) d D = 96 d D = 128 d D = 192 (d) Accuracy at n D = 100 Figure 16: MNIST downstream classifier accuracy for DPGAN training (targeting ε = 10) at various discriminator update frequencies n D. In each plot, we present results from training with various discriminator sizes. The green (d D = 128) lines correspond to the results pictured in Figure 2a. Discriminators with 0.26 2.24 as many trainable parameters track the results of the original d D = 128 setting. Setting G LR D LR Base 0.0002 0.0002 5 LR 0.001 0.001 0.2 LR 0.00004 0.00004 5 D LR 0.0002 0.001 0.2 D LR 0.0002 0.00004 Table 5: Learning rate settings. Results. In Figure 17 we plot the progression of FID and downstream classifer accuracy of generated MNIST samples during non-private training under various learning rate settings. We observe FID and accuracy degradation near the end of training for the 5 (D) LR settings. The 0.2 LR setting converges much slower. This is remedied when we adjust only the D LR by 0.2 and leave G LR unchanged (comparing the green line to the purple line in Figure 17). Figures 18 and 19 examine the case where we adjust both G and D learning rates by 5 and 0.2 respectively. Broadly, we see the same behaviour in Section 4: FID and downstream classification accuracy improve significantly as we take n D >> 1, up until the point where n D is too high, limiting the generator from taking enough steps to converge. However, we note some differences: (1) the performance of the best settings for n D are reduced across the board (most prominently in the case of accuracy in the 0.2 LR setting; see Figure 19b); and (2) the n D which results in the best performance is different while n D = 50 leads to the best results for MNIST at ε = 10 in Section 4, n D = 200 performs the best for 5 LR and n D = 100 is Published in Transactions on Machine Learning Research (10/2023) the best for 0.2 LR. Note that these two differences are not observed in the experiments where we vary discriminator size d D in Appendix D.1: all runs with different d D track the d D = 128 run closely. In Figures 20 and 21, we examine the case where we only adjust D LR, by 5 and 0.2 respectively, and keep G LR fixed at 0.0002. Again, we observe large improvements in utility as we take n D >> 1, up to the point where n D is too high. We note that when keeping G LR fixed, the 0.2 setting gets much closer to the level of improvement from varying n D observed in the base LR setting. In summary: changing learning rates while keeping other hyperparameters fixed still exhibits the benefit of increasing n D, but compared to the base setting, does not recover: (1) the scale of the improvement, and (2) the precise behaviour of the phenomenon; i.e. the same optimal n D. We leave open the question of understanding more precisely how the phenomenon changes under different learning rates: it may be fruitful to investigate how Adam s momentum parameters (β1, β2) and DPSGD noise level σ impact the results, and also perhaps the degradation of the non-private GAN results for large D LR. 10000 20000 30000 40000 Steps Base 5 LR 0.2 LR 5 D LR 0.2 D LR (a) Non-private FID 10000 20000 30000 40000 Steps Accuracy (%) Base 5 LR 0.2 LR 5 D LR 0.2 D LR (b) Non-private accuracy Figure 17: MNIST FID and downstream classifier accuracy for non-private GAN training with various learning rate settings. The blue (d D = 128) line corresponds to the base learning rate setting used in previous experiments. 4 6 8 10 Privacy budget non-private (8.3) n D = 1 n D = 10 n D = 50 n D = 100 n D = 200 (a) FID at 5 LR 4 6 8 10 Privacy budget Accuracy (%) non-private (94.2) n D = 1 n D = 10 n D = 50 n D = 100 n D = 200 (b) Accuracy at 5 LR Figure 18: MNIST FID and downstream classifier accuracy for DPGAN training runs targeting ε = 10 and using 5 the base learning rate for both G and D, under various settings of n D. Published in Transactions on Machine Learning Research (10/2023) 4 6 8 10 Privacy budget non-private (5.9) n D = 1 n D = 10 n D = 50 n D = 100 n D = 200 (a) FID at 0.2 LR 4 6 8 10 Privacy budget Accuracy (%) non-private (97.2) n D = 1 n D = 10 n D = 50 n D = 100 n D = 200 (b) Accuracy at 0.2 LR Figure 19: MNIST FID and downstream classifier accuracy for DPGAN training runs targeting ε = 10 and using 0.2 the base learning rate for both G and D, under various settings of n D. 4 6 8 10 Privacy budget non-private (6.5) n D = 1 n D = 10 n D = 50 n D = 100 n D = 200 (a) FID at 5 D LR 4 6 8 10 Privacy budget Accuracy (%) non-private (93.1) n D = 1 n D = 10 n D = 50 n D = 100 n D = 200 (b) Accuracy at 5 D LR Figure 20: MNIST FID and downstream classifier accuracy for DPGAN training runs targeting ε = 10 and using 5 the base learning rate for D only (G LR unchanged), under various settings of n D. 4 6 8 10 Privacy budget non-private (4.5) n D = 1 n D = 10 n D = 50 n D = 100 n D = 200 (a) FID at 0.2 D LR 4 6 8 10 Privacy budget Accuracy (%) non-private (98.1) n D = 1 n D = 10 n D = 50 n D = 100 n D = 200 (b) Accuracy at 0.2 D LR Figure 21: MNIST FID and downstream classifier accuracy for DPGAN training runs targeting ε = 10 and using 0.2 the base learning rate for D only (G LR unchanged), under various settings of n D. D.3 Varying batch size and noise level Fixing a batch size B and a noise level σ yields a total discriminator step budget T allowed under our privacy budget ε. For example, the results from Section 4 and Appendices D.1 and D.2 use B = 128 and σ = 1, which allows for T = 450K when targeting ε = 10 on MNIST. Again targeting MNIST at ε = 10, we take Published in Transactions on Machine Learning Research (10/2023) various combinations of (B, σ), and plot the final FID and accuracy of DPGANs trained at such a setting, over a spectrum of n D. Results are pictured in Figures 22, 23, and 24. 1 4 16 64 256 Discriminator step frequency n D = 0.6 = 1 = 1.4 (a) FID at B = 128 1 4 16 64 256 Discriminator step frequency n D Accuracy (%) = 0.6 = 1 = 1.4 (b) Accuracy at B = 128 Figure 22: MNIST FID and downstream classifier accuracy for B = 128 runs targeting ε = 10, with σ {0.6, 1, 1.4}. We report final utility over a range of n D for the 3 noise levels. The x-axis is log-scaled. 1 4 16 64 Discriminator step frequency n D = 1.2 = 2 = 2.8 (a) FID at B = 512 1 4 16 64 Discriminator step frequency n D Accuracy (%) = 1.2 = 2 = 2.8 (b) Accuracy at B = 512 Figure 23: MNIST FID and downstream classifier accuracy for B = 512 runs targeting ε = 10, with σ {1.2, 2, 2.8}. We report final utility over a range of n D for the 3 noise levels. The x-axis is log-scaled. 1 2 4 8 16 32 Discriminator step frequency n D = 2.4 = 4 = 5.6 (a) FID at B = 2048 1 2 4 8 16 32 Discriminator step frequency n D Accuracy (%) = 2.4 = 4 = 5.6 (b) Accuracy at B = 2048 Figure 24: MNIST FID and downstream classifier accuracy for B = 2048 runs targeting ε = 10, with σ {2.4, 4, 5.6}. We report final utility over a range of n D for the 3 noise levels. The x-axis is log-scaled. Results. At all batch sizes and noise levels, we observe the same U-shaped utility curve described in Section 4, which predicts the existence of an optimal n D for any fixed setting of (σ, B). For fixed B, the Published in Transactions on Machine Learning Research (10/2023) optimal n D is lower for smaller σ. We also see that for settings with low σ and large B, optimal n D can be quite low. For all batch sizes, choosing noise levels that achieve their optimal n D at fairly large values (>> 1) tends to outperform smaller noise levels which achieve their optimal n D early. E Additional results E.1 Wall clock times We report wall clock times for training runs under various hyperparameter settings, which are executed on 1 NVIDIA A40 card setups running Py Torch 1.11.0+CUDA 11.3.1 and Opacus 1.1.3. Table 6 presents results on MNIST, in particular comparing the effect of n D on training time. The total number of discriminator steps, T, is determined by the privacy budget and DPSGD hyperparameters. Hence, increasing n D results in fewer total G steps and faster training time. Table 7 presents training times under adaptive discriminator step frequency for various datasets. All private settings are much slower than non-private training. Indeed, the best DP results tend to come from training long with large noise levels, trading off computation for utility (De et al., 2022). For example, the best DP diffusion models (Dockhorn et al., 2022) use 8 V100 s for 1 day to train their best MNIST models. Although not directly comparable, we note that our best ε = 10 results train in 7.5 hours on 1 A40. Privacy B σ T n D FID Wall clock time ε = 128 - 45K 1 3.4 0.1 44m ε = 10 128 1 450K 1 205.3 0.9 11h 03m 10 103.4 5.8 6h 33m 50 18.5 0.9 5h 56m 100 21.0 1.6 5h 57m 200 26.6 2.2 5h 54m 2048 5.6 98K 20 13.2 1.0 16h 54m ε = 1 128 5 325K 200 111.1 17.9 4h 40m 512 14 165K 50 106.2 64.0 7h 31m Table 6: Wall clock times on MNIST for various settings. The privacy level ε, batch size B, and noise level σ determines the total number D steps taken during training, T. Given T, the discriminator update frequency n D determines the number of G steps taken during training. Privacy Dataset B σ T FID Wall clock time ε = MNIST 128 - 45K 3.4 0.1 44m Fashion MNIST 16.5 1.7 42m Celeb A 30.0 1.6 47m ε = 10 MNIST 512 2 174K 12.8 0.3 7h 35m Fashion MNIST 62.3 8.7 7h 35m Celeb A 2048 4 385K 170.8 20.3 3d 17h 49m ε = 1 MNIST 512 14 165K 52.6 3.2 7h 43m Fashion MNIST 126.4 4.1 7h 26m Table 7: Wall clock times on MNIST, Fashion MNIST and 32 32 Celeb A for runs using adaptive discriminator step frequency (with the exception of the ε = results, which use n D = 1). E.2 Increasing discriminator learning rate instead of step frequency From the experimental setting in Section 4 targeting MNIST at ε = 10, we consider the n D = 1 setting, and increase the discriminator learning rate instead of n D. G LR is kept fixed at 0.0002. Results are in Table Published in Transactions on Machine Learning Research (10/2023) 8. We do not observe the same level of improvement obtained by increasing n D and keeping D LR/G LR at 1 . D LR/G LR FID Acc. (%) 1 205.9 33.7 5 251.8 45.0 10 228.1 37.5 50 237.2 54.5 Table 8: Results on MNIST at ε = 10 for increasing D LR while keeping n D = 1. For reference, using n D = 50 while keeping D LR/G at 1 yields an FID score of 18.5 0.9 and accuracy of 93.0 0.6%. F Additional discussion DP tabular data synthesis. Our investigation focuses on image datasets, while many important applications of private data generation involve tabular data. In these settings, marginal-based approaches (Hardt et al., 2012; Zhang et al., 2017; Mc Kenna et al., 2019) perform the best. While Tao et al. (2021) find that private GAN-based approaches fail to preserve even basic statistics in these settings, we speculate that our techniques may yield improvements. Taking multiple discriminator steps. The original GAN formulation of Goodfellow et al. (2014) has n D, the number of discriminator steps taken between generator steps, as a tunable hyperparameter. In the WGAN framework (Arjovsky et al., 2017), it is suggested to train discriminators as much as possible between generator steps, i.e. to optimality, for best performance. In practice, WGAN implementations use n D = 5 to save on computation. Several studies empirically explore the effect of taking multiple discriminator steps (Miyato et al., 2018; Brock et al., 2019), finding that searching for an optimal n D can improve results. Similar imbalanced setups, such as lookahead and imbalanced learning rates, have been analyzed theoretically (Chavdarova et al., 2021; Fiez & Ratliff, 2021). In the private setting, applying such strategies to improve DPGAN training has been relatively unexplored. DPGAN training recipes are largely ports of non-private approaches inheriting many parameter choices designed for performant non-private training which are sub-optimal under DP. Guidance in the non-private setting (tip 14 of Chintala et al. (2016)) prescribes to train the discriminator for more steps in the presence of noise (a regularization approach used in non-private GANs). This is the case for DP, and is our core strategy that yields the most significant gains in utility. We were not aware of this tip when we discovered this phenomenon, but it serves as validation of our finding. While Chintala et al. (2016) provides little elaboration, looking at further explorations of this principle (and other strategies) in the non-private setting may offer guidance for improving DPGANs. For instance, Chavdarova et al. (2021) propose a lookahead update rule that enables fast convergence in the presence of noise, without using large batches such techniques may help in the private setting as well. Hyperparameter tuning in DP machine learning. Hyperparameter tuning is crucial to getting deep learning to work. The same is true under privacy, with two additional concerns: (1) tuning is not free naive composition says privacy loss scales with the number of runs; and (2) DPSGD alters the hyperparameter landscape introducing extra ones, and also changing the relative importance of existing hyperparameters. (1) is addressed by Liu & Talwar (2019) (although composition is competitive in settings where adaptive selection is important (Mohapatra et al., 2022)). On (2), Ponomareva et al. (2023) offers a comprehensive guide on DP hyperparameter tuning; for GAN training, our work identifies an important parameter with outsized impact in the DP setting. To compare with prior work, we report our best hyperparameter settings. Indeed, introducing a highly dataset-dependent parameter can result in worse performance overall when accounting for the cost of search in a real deployment setting. Our adaptive discriminator update frequency is motivated by this concern, and our use of MNIST hyperparameters directly for Fashion MNIST is a brittleness sanity-check. Published in Transactions on Machine Learning Research (10/2023) The aspect of our work that identifies the importance of discriminator update frequency in private GAN training is unaffected by concerns regarding private hyperparameter search. Evaluation approaches that take into account the cost of search when comparing algorithms is an important direction, which we do not address in this work. For benchmark datasets, the problem is complicated by implicit knowledge encoded in various algorithms design choices and default hyperparameter ranges.