# stochastic_differentially_private_and_fair_learning__dd832ed6.pdf Published as a conference paper at ICLR 2023 STOCHASTIC DIFFERENTIALLY PRIVATE AND FAIR LEARNING Andrew Lowy University of Southern California lowya@usc.edu Devansh Gupta Indraprastha Institute of Information Technology, Delhi devansh19160@iiitd.ac.in Meisam Razaviyayn University of Southern California razaviya@usc.edu Machine learning models are increasingly used in high-stakes decision-making systems. In such applications, a major concern is that these models sometimes discriminate against certain demographic groups such as individuals with certain race, gender, or age. Another major concern in these applications is the violation of the privacy of users. While fair learning algorithms have been developed to mitigate discrimination issues, these algorithms can still leak sensitive information, such as individuals health or financial records. Utilizing the notion of differential privacy (DP), prior works aimed at developing learning algorithms that are both private and fair. However, existing algorithms for DP fair learning are either not guaranteed to converge or require full batch of data in each iteration of the algorithm to converge. In this paper, we provide the first stochastic differentially private algorithm for fair learning that is guaranteed to converge. Here, the term stochastic" refers to the fact that our proposed algorithm converges even when minibatches of data are used at each iteration (i.e. stochastic optimization). Our framework is flexible enough to permit different fairness notions, including demographic parity and equalized odds. In addition, our algorithm can be applied to non-binary classification tasks with multiple (non-binary) sensitive attributes. As a byproduct of our convergence analysis, we provide the first utility guarantee for a DP algorithm for solving nonconvex-strongly concave min-max problems. Our numerical experiments show that the proposed algorithm consistently offers significant performance gains over the state-of-the-art baselines, and can be applied to larger scale problems with non-binary target/sensitive attributes. 1 INTRODUCTION In recent years, machine learning algorithms have been increasingly used to inform decisions with far-reaching consequences (e.g. whether to release someone from prison or grant them a loan), raising concerns about their compliance with laws, regulations, societal norms, and ethical values. Specifically, machine learning algorithms have been found to discriminate against certain sensitive demographic groups (e.g. racial minorities), prompting a profusion of algorithmic fairness research (Dwork et al., 2012; Sweeney, 2013; Datta et al., 2015; Feldman et al., 2015; Bolukbasi et al., 2016; Angwin et al., 2016; Calmon et al., 2017; Hardt et al., 2016a; Fish et al., 2016; Woodworth et al., 2017; Zafar et al., 2017; Bechavod & Ligett, 2017; Kearns et al., 2018; Prost et al., 2019; Baharlouei et al., 2020; Lowy et al., 2022a). Algorithmic fairness literature aims to develop fair machine learning algorithms that output non-discriminatory predictions. Fair learning algorithms typically need access to the sensitive data in order to ensure that the trained model is non-discriminatory. However, consumer privacy laws (such as the E.U. General Data Protection Regulation) restrict the use of sensitive demographic data in algorithmic decision-making. Work done as a visiting scholar at the University of Southern California, Viterbi School of Engineering. Published as a conference paper at ICLR 2023 These two requirements fair algorithms trained with private data presents a quandary: how can we train a model to be fair to a certain demographic if we don t even know which of our training examples belong to that group? The works of Veale & Binns (2017); Kilbertus et al. (2018) proposed a solution to this quandary using secure multi-party computation (MPC), which allows the learner to train a fair model without directly accessing the sensitive attributes. Unfortunately, as Jagielski et al. (2019) observed, MPC does not prevent the trained model from leaking sensitive data. For example, with MPC, the output of the trained model could be used to infer the race of an individual in the training data set (Fredrikson et al., 2015; He et al., 2019; Song et al., 2020; Carlini et al., 2021). To prevent such leaks, Jagielski et al. (2019) argued for the use of differential privacy (Dwork et al., 2006) in fair learning. Differential privacy (DP) provides a strong guarantee that no company (or adversary) can learn much more about any individual than they could have learned had that individual s data never been used. Since Jagielski et al. (2019), several follow-up works have proposed alternate approaches to DP fair learning (Xu et al., 2019; Ding et al., 2020; Mozannar et al., 2020; Tran et al., 2021b;a; 2022). As shown in Fig. 1, each of these approaches suffers from at least two critical shortcomings. In particular, none of these methods have convergence guarantees when mini-batches of data are used in training. In training large-scale models, memory and efficiency constraints require the use of small minibatches in each iteration of training (i.e. stochastic optimization). Thus, existing DP fair learning methods cannot be used in such settings since they require computations on the full training data set in every iteration. See Appendix A for a more comprehensive discussion of related work. Our Contributions: In this work, we propose a novel algorithmic framework for DP fair learning. Our approach builds on the non-private fair learning method of Lowy et al. (2022a). We consider a regularized empirical risk minimization (ERM) problem where the regularizer penalizes fairness violations, as measured by the Exponential Rényi Mutual Information. Using a result from Lowy et al. (2022a), we reformulate this fair ERM problem as a min-max optimization problem. Then, we use an efficient differentially private variation of stochastic gradient descent-ascent (DP-SGDA) to solve this fair ERM min-max objective. The main features of our algorithm are: 1. Guaranteed convergence for any privacy and fairness level, even when mini-batches of data are used in each iteration of training (i.e. stochastic optimization setting). As discussed, stochastic optimization is essential in large-scale machine learning scenarios. Our algorithm is the first stochastic DP fair learning method with provable convergence. 2. Flexibility to handle non-binary classification with multiple (non-binary) sensitive attributes (e.g. race and gender) under different fairness notions such as demographic parity or equalized odds. In each of these cases, our algorithm is guaranteed to converge. Empirically, we show that our method outperforms the previous state-of-the-art methods in terms of fairness vs. accuracy trade-off across all privacy levels. Moreover, our algorithm is capable of training with mini-batch updates and can handle non-binary target and non-binary sensitive attributes. By contrast, existing DP fairness algorithms could not converge in our stochastic/non-binary experiment. A byproduct of our algorithmic developments and analyses is the first DP convergent algorithm for nonconvex min-max optimization: namely, we provide an upper bound on the stationarity gap of DP-SGDA for solving problems of the form minθ max W Fpθ, Wq, where Fp , Wq is non-convex. We expect this result to be of independent interest to the DP optimization community. Prior works that provide convergence results for DP min-max problems have assumed that Fp , Wq is either (strongly) convex (Boob & Guzmán, 2021; Zhang et al., 2022) or satisfies a generalization of strong convexity known as the Polyak-Łojasiewicz (PL) condition (Yang et al., 2022). 2 PROBLEM SETTING AND PRELIMINARIES Let Z tzi pxi, si, yiqun i 1 be a data set with non-sensitive features xi P X, discrete sensitive attributes (e.g. race, gender) si P rks fi t1, . . . , ku, and labels yi P rls. Let pyθpxq denote the model predictions parameterized by θ, and ℓpθ, x, yq ℓppyθpxq, yq be a loss function (e.g. cross-entropy loss). Our goal is to (approximately) solve the empirical risk minimization (ERM) problem # p Lpθq : 1 i 1 ℓpθ, xi, yiq Published as a conference paper at ICLR 2023 in a fair manner, while maintaining the differential privacy of the sensitive data tsiun i 1. We consider two different notions of fairness in this work:1 Definition 2.1 (Fairness Notions). Let A : Z Ñ Y be a classifier. A satisfies demographic parity (Dwork et al., 2012) if the predictions Ap Zq are statistically independent of the sensitive attributes. A satisfies equalized odds (Hardt et al., 2016a) if the predictions Ap Zq are conditionally independent of the sensitive attributes given Y y for all y. Reference Non-binary Multiple fairness notions? Convergence guarantee (poly. time)? Jagielski et al. (2019) (post-proc.)* Jagielski et al. (2019) (in-proc.) Ding et al. Mozannar et Tran et al. Tran et al. Tran et al. Figure 1: Comparison with existing works. Guarantee refers to provable guarantee. N/A: the post-processing method of Jagielski et al. (2019) is not an iterative algorithm. *Method requires access to the sensitive data at test time. The in-processing method of Jagielski et al. (2019) is inefficient. The work of Mozannar et al. (2020) specializes to equalized odds, but most of their analysis seems to be extendable to other fairness notions. Depending on the specific problem at hand, one fairness notion may be more desirable than the other (Dwork et al., 2012; Hardt et al., 2016a). In practical applications, achieving exact fairness, i.e. (conditional) independence of p Y and S, is unrealistic. In fact, achieving exact fairness can be impossible for a differentially private algorithm that achieves non-trivial accuracy (Cummings et al., 2019). Thus, we instead aim to design an algorithm that achieves small fairness violation on the given data set Z. Fairness violation can be measured in different ways: see e.g. Lowy et al. (2022a) for a thorough survey. For example, if demographic parity is the desired fairness notion, then one can measure (empirical) demographic parity violation by max py PY max s PS ˇˇˇˆp p Y |Sppy|sq ˆp p Y ppyq ˇˇˇ , (2) where ˆp denotes an empirical probability calculated directly from p Z, tpyiun i 1q. Next, we define differential privacy (DP). Following the DP fair learning literature in (Jagielski et al., 2019; Tran et al., 2021b; 2022)), we consider a relaxation of DP, in which only the sensitive attributes require privacy. Say Z and Z1 are adjacent with respect to sensitive data if Z tpxi, yi, siqun i 1, Z1 tpxi, yi, s1 iqun i 1, and there is a unique i P rns such that si s1 i. Definition 2.2 (Differential Privacy w.r.t. Sensitive Attributes). Let ϵ ě 0, δ P r0, 1q. A randomized algorithm A is pϵ, δq-differentially private w.r.t. sensitive attributes S (DP) if for all pairs of data sets Z, Z1 that are adjacent w.r.t. sensitive attributes, we have Pp Ap Zq P Oq ď eϵPp Ap Zq P Oq δ, (3) for all measurable O Ď Y. As discussed in Section 1, Theorem 2.2 is useful if a company wants to train a fair model, but is unable to use the sensitive attributes (which are needed to train a fair model) due to privacy concerns and laws (e.g., the E.U. GDPR). Theorem 2.2 enables the company to privately use the sensitive attributes to train a fair model, while satisfying legal and ethical constraints. That being said, Theorem 2.2 still may not prevent leakage of non-sensitive data. Thus, if the company is concerned with privacy of user data beyond the sensitive demographic attributes, then it should impose DP for all the features. Our algorithm and analysis readily extends to DP for all features: see Section 3. Throughout the paper, we shall restrict attention to data sets that contain at least ρ-fraction of every sensitive attribute for some ρ P p0, 1q: i.e. 1 |Z| ř|Z| i 1 1tsi ru ě ρ for all r P rks. This is a reasonable 1Our method can also handle any other fairness notion that can be defined in terms of statistical (conditional) independence, such as equal opportunity. However, our method cannot handle all fairness notions: for example, false discovery rate and calibration error are not covered by our framework. Published as a conference paper at ICLR 2023 assumption in practice: for example, if sex is the sensitive attribute and a data set contains all men, then training a model that is fair with respect to sex and has a non-trivial performance (better than random) seems almost impossible. Understanding what performance is (im-)possible for DP fair learning in the absence of sample diversity is an important direction for future work. 3 PRIVATE FAIR ERM VIA EXPONENTIAL RÉNYI MUTUAL INFORMATION A standard in-processing strategy in the literature for enforcing fairness is to add a regularization term to the empirical objective that penalizes fairness violations (Zhang et al., 2018; Donini et al., 2018; Mary et al., 2019; Baharlouei et al., 2020; Cho et al., 2020b; Lowy et al., 2022a). We can then jointly optimize for fairness and accuracy by solving ! p Lpθq λDpp Y , S, Y q ) , where D is some measure of statistical (conditional) dependence between the sensitive attributes and the predictions (given Y ), and λ ě 0 is a scalar balancing fairness and accuracy considerations. The choice of D is crucial and can lead to different fairness-accuracy profiles. Inspired by the strong empirical performance and amenability to stochastic optimization of Lowy et al. (2022a), we choose D to be the Exponential Rényi Mutual Information (ERMI): Definition 3.1 (ERMI Exponential Rényi Mutual Information). We define the exponential Rényi mutual information between random variables p Y and S with empirical joint distribution ˆp p Y ,S and marginals ˆp p Y , ˆp S by: p DRpp Y , Sq : E # ˆp p Y ,Spp Y , Sq ˆp p Y pp Y qˆp Sp Sq ˆp p Y ,Spj, rq2 ˆp p Y pjqˆp Sprq 1 (ERMI) Theorem 3.1 is what we would use if demographic parity were the desired fairness notion. If instead one wanted to encourage equalized odds, then Theorem 3.1 can be readily adapted to these fairness notions by substituting appropriate conditional probabilities for ˆp p Y ,S, ˆp p Y , and ˆp S in (ERMI): see Appendix B for details.2 It can be shown that ERMI ě 0, and is zero if and only if demographic parity (or equalized odds, for the conditional version of ERMI) is satisfied (Lowy et al., 2022a). Further, ERMI provides an upper bound on other commonly used measures of fairness violation: e.g.) (2), Shannon mutual information (Cho et al., 2020a), Rényi correlation (Baharlouei et al., 2020), Lq fairness violation (Kearns et al., 2018; Hardt et al., 2016a) (Lowy et al., 2022a). This implies that any algorithm that makes ERMI small will also have small fairness violation with respect to these other notions. Lastly, (Lowy et al., 2022a, Proposition 2) shows that empirical ERMI (Theorem 3.1) is an asymptotically unbiased estimator of population ERMI which can be defined as in Theorem 3.1, except that empirical distributions are replaced by their population counterparts. Our approach to enforcing fairness is to augment (1) with an ERMI regularizer and privately solve: ! FERMIpθq : p Lpθq λ p DRpp Yθp Xq, Sq ) . (FERMI obj.) Since empirical ERMI is an asymptotically unbiased estimator of population ERMI, a solution to (FERMI obj.) is likely to generalize to the corresponding fair population risk minimization problem (Lowy et al., 2022a). There are numerous ways to privately solve (FERMI obj.). For example, one could use the exponential mechanism (Mc Sherry & Talwar, 2007), or run noisy gradient descent (GD) (Bassily et al., 2014). The problem with these approaches is that they are inefficient or require computing n gradients at every iteration, which is prohibitive for large-scale problems, as discussed earlier. Notice that we could not run noisy stochastic GD (SGD) on (FERMI obj.) because we do not (yet) have a statistically unbiased estimate of θ p DRpp Yθp Xq, Sq. Our next goal is to derive a stochastic, differentially private fair learning algorithm. For feature input x, let the predicted class labels be given by pypx, θq j P rls with probability Fjpx, θq, where Fpx, θq is differentiable in θ, has range r0, 1sl, and řl j 1 Fjpx, θq 1. For instance, 2To simplify the presentation, we will assume that demographic parity is the fairness notion of interest in the remainder of this section. However, we consider both fairness notions in our numerical experiments. Published as a conference paper at ICLR 2023 Fpx, θq p F1px, θq, . . . , Flpx, θqq could represent the output of a neural net after softmax layer or the probability label assigned by a logistic regression model. Then we have the following min-max re-formulation of (FERMI obj.): Theorem 3.2 (Lowy et al. (2022a)). There are differentiable functions pψi such that (FERMI obj.) is equivalent to min θ max W PRkˆl # p Fpθ, Wq : p Lpθq λ 1 i 1 pψipθ, Wq Further, pψipθ, q is strongly concave for all θ. The functions pψi are given explicitly in Appendix C. Theorem 3.2 is useful because it permits us to use stochastic optimization to solve (FERMI obj.): for any batch size m P rns, the gradients (with respect to θ and W) of 1 m ř i PB ℓpxi, yi; θq λ pψipθ, Wq are statistically unbiased estimators of the gradients of p Fpθ, Wq, if B is drawn uniformly from Z. However, when differential privacy of the sensitive attributes is also desired, the formulation (4) presents some challenges, due to the non-convexity of p Fp , Wq. Indeed, there is no known DP algorithm for solving non-convex min-max problems that is proven to converge. Next, we provide the first such convergence guarantee. 3.1 NOISY DP-FERMI FOR STOCHASTIC PRIVATE FAIR ERM Our proposed stochastic DP algorithm for solving (FERMI obj.), is given in Algorithm 1. It is a noisy DP variation of two-timescale stochastic gradient descent ascent (SGDA) Lin et al. (2020). Algorithm 1 DP-FERMI Algorithm for Private Fair ERM 1: Input: θ0 P Rdθ, W0 0 P Rkˆl, step-sizes pηθ, ηwq, fairness parameter λ ě 0, iteration number T, minibatch size |Bt| m P rns, set W Ă Rkˆl, noise parameters σ2 w, σ2 θ. 2: Compute p P 1{2 S . 3: for t 0, 1, . . . , T do 4: Draw a mini-batch Bt of data points tpxi, si, yiqui PBt 5: Set θt 1 Ð θt ηθ |Bt| ř i PBtr θℓpxi, yi; θtq λp θ pψipθt, Wtq utqs, where ut Np0, σ2 θIdθq. 6: Set Wt 1 Ð ΠW Wt ηw λ |Bt| ř i PBt w pψipθt, Wtq Vt ı , where Vt is a k ˆ l matrix with independent random Gaussian entries p Vtqr,j Np0, σ2 wq. 7: end for 8: Pick ˆt uniformly at random from t1, . . . , Tu. 9: Return: ˆθT : θˆt. Explicit formulae for θ pψipθt, Wtq and w pψipθt, Wtq are given in Theorem D.1 (Appendix D). We provide the privacy guarantee of Algorithm 1 in Theorem 3.3: Theorem 3.3. Let ϵ ď 2 lnp1{δq, δ P p0, 1q, and T ě n ?ϵ 2m 2 . Assume Fpx, q is Lθ-Lipschitz for all x, and |p Wtqr,j| ď D for all t P r Ts, r P rks, j P rls. Then, for σ2 w ě 16T lnp1{δq σ2 θ ě 16L2 θD2 lnp1{δq T ϵ2n2ρ , Algorithm 1 is pϵ, δq-DP with respect to the sensitive attributes for all data sets containing at least ρ-fraction of minority attributes. Further, if σ2 w ě 32T lnp1{δq ϵ2n2 1 ρ D2 and σ2 θ ě 64L2 θD2 lnp1{δq T ϵ2n2ρ 32D4L2 θl2T lnp1{δq ϵ2n2 , then Algorithm 1 is pϵ, δq-DP (with respect to all features) for all data sets containing at least ρ-fraction of minority attributes. See Appendix D for the proof. Next, we give a convergence guarantee for Algorithm 1: Theorem 3.4. Assume the loss function ℓp , x, yq and Fpx, q are Lipschitz continuous with Lipschitz gradient for all px, yq, and p PSprq ě ρ ą 0 @ r P rks. In Algorithm 1, choose W to be Published as a conference paper at ICLR 2023 a sufficiently large ball that contains W pθq : argmax W p Fpθ, Wq for every θ in some neighborhood of θ P argminθ max W p Fpθ, Wq. Then there exist algorithmic parameters such that the pϵ, δq-DP Algorithm 1 returns ˆθT with E} FERMIpˆθT q}2 O maxpdθ, klq lnp1{δq treating D diameterp Wq, λ, ρ, l, and the Lipschitz and smoothness parameters of ℓand F as constants. Theorem 3.4 shows that Algorithm 1 finds an approximate stationary point of (FERMI obj.). Finding approximate stationary points is generally the best one can hope to do in polynomial time for nonconvex optimization (Murty & Kabadi, 1985). The stationarity gap in Theorem 3.4 depends on the number of samples n and model parameters dθ, the desired level of privacy pϵ, δq, and the number of labels l and sensitive attributes k. For large-scale models (e.g. deep neural nets), we typically have dθ " 1 and k, l Op1q, so that the convergence rate of Algorithm 1 is essentially immune to the number of labels and sensitive attributes. In contrast, no existing works with convergence guarantees are able to handle non-binary classification (l ą 2), even with full batches and a single binary sensitive attribute. A few more remarks are in order. First, the utility bound in Theorem 3.4 corresponds to DP for all of the features. If DP is only required for the sensitive attributes, then using the smaller σ2 θ, σ2 w in Theorem 3.3 would improve the dependence on constants D, l, Lθ in the utility bound. Second, the choice of W in Theorem 3.4 implies that (4) is equivalent to minθ max W PW p Fpθ, Wq, which is what our algorithm directly solves (c.f. (7)). Lastly, note that while we return a uniformly random iterate in Algorithm 1 for our theoretical convergence analysis, we recommend returning the last iterate θT in practice: our numerical experiments show strong performance of the last iterate. In Theorem E.1 of Appendix E, we prove a result which is more general than Theorem 3.4. Theorem E.1 shows that noisy DP-SGDA converges to an approximate stationary point of any smooth nonconvex-strongly concave min-max optimization problem (not just (4)). We expect Theorem E.1 to be of general interest to the DP optimization community beyond its applications to DP fair learning, since it is the first DP convergence guarantee for nonconvex min-max optimization. We also give a bound on the iteration complexity T in Appendix E. The proof of Theorem E.1 involves a careful analysis of how the Gaussian noises propagate through the optimization trajectories of θt and wt. Compared with DP non-convex minimization analyses (e.g. Wang et al. (2019); Hu et al. (2021); Ding et al. (2021b); Lowy et al. (2022b)), the two noises required to privatize the solution of the min-max problem we consider complicates the analysis and requires careful tuning of ηθ and ηW . Compared to existing analyses of DP min-max games in Boob & Guzmán (2021); Yang et al. (2022); Zhang et al. (2022), which assume that fp , wq is convex or PL, dealing with non-convexity is a challenge that requires different optimization techniques. 4 NUMERICAL EXPERIMENTS In this section, we evaluate the performance of our proposed approach (DP-FERMI) in terms of the fairness violation vs. test error for different privacy levels. We present our results in two parts: In Section 4.1, we assess the performance of our method in training logistic regression models on several benchmark tabular datasets. Since this is a standard setup that existing DP fairness algorithms can handle, we are able to compare our method against the state-of-the-art baselines. We carefully tuned the hyperparameters of all baselines for fair comparison. We find that DP-FERMI consistently outperforms all state-of-the-art baselines across all data sets and all privacy levels. These observations hold for both demographic parity and equalized odds fairness notions. To quantify the improvement of our results over the state-of-the-art baselines, we calculated the performance gain with respect to fairness violation (for fixed accuracy level) that our model yields over all the datasets. We obtained a performance gain of demographic parity that was 79.648 % better than Tran et al. (2021b) on average, and 65.89% better on median. The average performance gain of equalized odds was 96.65% while median percentage gain was 90.02%. In Section 4.2, we showcase the scalability of DP-FERMI by using it to train a deep convolutional neural network for classification on a large Published as a conference paper at ICLR 2023 image dataset. In Appendix F, we give detailed descriptions of the data sets, experimental setups and training procedure, along with additional results. 4.1 STANDARD BENCHMARK EXPERIMENTS: LOGISTIC REGRESSION ON TABULAR DATASETS In the first set of experiments we train a logistic regression model using DP-FERMI (Algorithm 1) for demographic parity and a modified version of DP-FERMI (described in Appendix F) for equalized odds. We compare DP-FERMI against all applicable publicly available baselines in each expeiment. 4.1.1 DEMOGRAPHIC PARITY We use four benchmark tabular datasets: Adult Income, Retired Adult, Parkinsons, and Credit-Card dataset from the UCI machine learning repository (Dua & Graff (2017)). The predicted variables and sensitive attributes are both binary in these datasets. We analyze fairness-accuracy trade-offs with four different values of ϵ P t0.5, 1, 3, 9u for each dataset. We compare against state-of-the-art algorithms proposed in Tran et al. (2021a) and (the demographic parity objective of) Tran et al. (2021b). The results displayed are averages over 15 trials (random seeds) for each value of ϵ. For the Adult dataset, the task is to predict whether the income is greater than $50K or not keeping gender as the sensitive attribute. The Retired Adult dataset is the same as the Adult dataset, but with updated data. We use the same output and sensitive attributes for both experiments. The results for Adult and Retired Adult are shown in Figs. 2 and 6 (in Appendix F.2). Compared to Tran et al. (2021a;b), DP-FERMI offers superior fairness-accuracy tradeoffs at every privacy (ϵ) level. (d) ϵ 9 Figure 2: Private, Fair (Demographic Parity) logistic regression on Adult Dataset. In the Parkinsons dataset, the task is to predict whether the total UPDRS score of the patient is greater than the median or not keeping gender as the sensitive attribute. Results for ϵ P t1, 3u are shown in Fig. 3. See Fig. 8 in Appendix F for ϵ P t0.5, 9u. Our algorithm again outperforms the baselines Tran et al. (2021a;b) for all tested privacy levels. In the Credit Card dataset , the task is to predict whether the user will default payment the next month keeping gender as the sensitive attribute. Results are shown in Fig. 7 in Appendix F.2. Once again, DP-FERMI provides the most favorable privacy-fairness-accuracy profile. Published as a conference paper at ICLR 2023 Figure 3: Private, Fair (Demogrpahic Parity) logistic regression on Parkinsons Dataset 4.1.2 EQUALIZED ODDS Next, we consider the slightly modified version of Algorithm 1, which is designed to minimize the Equalized Odds violation by replacing the absolute probabilities in the objective with class conditional probabilities: see Appendix F.2.4 for details. We considered the Credit Card and Adult datasets for these experiments, using the same sensitive attributes as mentioned above. Results for Credit Card are shown in Fig. 4. Adult results are given in Fig. 9 in Appendix F.2.4. Compared to Jagielski et al. (2019) and the equalized odds objective in Tran et al. (2021b), our equalized odds variant of DP-FERMI outperforms these state-of-the-art baselines at every privacy level. Figure 4: Private, Fair (Equalized Odds) logistic regression on Credit Card Dataset 4.2 LARGE-SCALE EXPERIMENT: DEEP CONVOLUTIONAL NEURAL NETWORK ON IMAGE DATASET In our second set of experiments, we train a deep 9-layer VGG-like classifier (Simonyan & Zisserman, 2015) with d 1.6 million parameters on the UTK-Face dataset (Zhang et al., 2017) using Algorithm 1. We classify the facial images into 9 age groups similar to the setup in Tran et al. (2022), while keeping race (containing 5 classes) as the sensitive attribute. See Appendix F.3 for more details.We analyze consider with four different privacy levels ϵ P t10, 25, 50, 100u. Compared to the tabular datasets, larger ϵ is needed to obtain stable results in the large-scale setting since the number of parameters d is much larger and the cost of privacy increases with d (see Theorem 3.4). Larger values of ϵ ą 100 were used in the baseline Jagielski et al. (2019) for smaller scale experiments. The results in Fig. 5 empirically verify our main theoretical result: DP-FERMI converges even for non-binary classification with small batch size and non-binary sensitive attributes. We took Tran et al. (2021a;b) as our baselines and attempted to adapt them to this non-binary large-scale task. We observed that the baselines were very unstable while training and mostly gave degenerate results (predicting a single output irrespective of the input). By contrast, our method was able to obtain stable and meaningful tradeoff curves. Also, while Tran et al. (2022) reported results on UTK-Face, their code is not publicly available and we were unable to reproduce their results. Published as a conference paper at ICLR 2023 Figure 5: DP-FERMI on a Deep CNN for Image Classification on UTK-Face 5 CONCLUDING REMARKS Motivated by pressing legal, ethical, and social considerations, we studied the challenging problem of learning fair models with differentially private demographic data. We observed that existing works suffer from a few crucial limitations that render their approaches impractical for large-scale problems. Specifically, existing approaches require full batches of data in each iteration (and/or exponential runtime) in order to provide convergence/accuracy guarantees. We addressed these limitations by deriving a DP stochastic optimization algorithm for fair learning, and rigorously proved the convergence of the proposed method. Our convergence guarantee holds even for non-binary classification (with any hypothesis class, even infinite VC dimension, c.f. Jagielski et al. (2019)) with multiple sensitive attributes and access to random minibatches of data in each iteration. Finally, we evaluated our method in extensive numerical experiments and found that it significantly outperforms the previous state-of-the-art models, in terms of fairness-accuracy tradeoff. The potential societal impacts of our work are discussed in Appendix G. ACKNOWLEDGMENTS This work was supported in part with funding from the NSF CAREER award 2144985, from the YIP AFOSR award, from a gift from the USC-Meta Center for Research and Education in AI & Learning, and from a gift from the USC-Amazon Center on Secure & Trusted Machine Learning. Published as a conference paper at ICLR 2023 Martin Abadi, Andy Chu, Ian Goodfellow, H. Brendan Mc Mahan, Ilya Mironov, Kunal Talwar, and Li Zhang. Deep learning with differential privacy. Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, Oct 2016. doi: 10.1145/2976749.2978318. URL http://dx.doi.org/10.1145/2976749.2978318. Alekh Agarwal, Alina Beygelzimer, Miroslav Dudík, John Langford, and Hanna Wallach. A reductions approach to fair classification. In International Conference on Machine Learning, pp. 60 69. PMLR, 2018. Julia Angwin, Jeff Larson, Surya Mattu, and Lauren Kirchner. Machine bias. Pro Publica, 2016. Eugene Bagdasaryan, Omid Poursaeed, and Vitaly Shmatikov. Differential privacy has disparate impact on model accuracy. Advances in neural information processing systems, 32, 2019. Sina Baharlouei, Maher Nouiehed, Ahmad Beirami, and Meisam Razaviyayn. Rényi fair inference. In ICLR, 2020. Raef Bassily, Adam Smith, and Abhradeep Thakurta. Private empirical risk minimization: Efficient algorithms and tight error bounds. In 2014 IEEE 55th Annual Symposium on Foundations of Computer Science, pp. 464 473. IEEE, 2014. Raef Bassily, Vitaly Feldman, Kunal Talwar, and Abhradeep Thakurta. Private stochastic convex optimization with optimal rates. In Advances in Neural Information Processing Systems, 2019. Yahav Bechavod and Katrina Ligett. Penalizing unfairness in binary classification. ar Xiv preprint ar Xiv:1707.00044, 2017. Yahav Bechavod, Katrina Ligett, Aaron Roth, Bo Waggoner, and Zhiwei Steven Wu. Equal opportunity in online classification with partial feedback. ar Xiv preprint ar Xiv:1902.02242, 2019. Tolga Bolukbasi, Kai-Wei Chang, James Y Zou, Venkatesh Saligrama, and Adam T Kalai. Man is to computer programmer as woman is to homemaker? debiasing word embeddings. In Advances in neural information processing systems, pp. 4349 4357, 2016. Digvijay Boob and Cristóbal Guzmán. Optimal algorithms for differentially private stochastic monotone variational inequalities and saddle-point problems. ar Xiv preprint ar Xiv:2104.02988, 2021. Flavio Calmon, Dennis Wei, Bhanukiran Vinzamuri, Karthikeyan Natesan Ramamurthy, and Kush R Varshney. Optimized pre-processing for discrimination prevention. In Advances in Neural Information Processing Systems, pp. 3992 4001, 2017. Nicholas Carlini, Florian Tramer, Eric Wallace, Matthew Jagielski, Ariel Herbert-Voss, Katherine Lee, Adam Roberts, Tom Brown, Dawn Song, Ulfar Erlingsson, et al. Extracting training data from large language models. In 30th USENIX Security Symposium (USENIX Security 21), pp. 2633 2650, 2021. Jaewoong Cho, Gyeongjo Hwang, and Changho Suh. A fair classifier using mutual information. In 2020 IEEE International Symposium on Information Theory (ISIT), pp. 2521 2526. IEEE, 2020a. Jaewoong Cho, Gyeongjo Hwang, and Changho Suh. A fair classifier using kernel density estimation. In Hugo Larochelle, Marc Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin (eds.), Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, Neur IPS 2020, December 6-12, 2020, virtual, 2020b. Rachel Cummings, Varun Gupta, Dhamma Kimpara, and Jamie Morgenstern. On the compatibility of privacy and fairness. In Adjunct Publication of the 27th Conference on User Modeling, Adaptation and Personalization, pp. 309 315, 2019. Amit Datta, Michael Carl Tschantz, and Anupam Datta. Automated experiments on ad privacy settings. Proceedings on privacy enhancing technologies, 2015(1):92 112, 2015. Published as a conference paper at ICLR 2023 Frances Ding, Moritz Hardt, John Miller, and Ludwig Schmidt. Retiring adult: New datasets for fair machine learning. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan (eds.), Advances in Neural Information Processing Systems, volume 34, pp. 6478 6490. Curran Associates, Inc., 2021a. URL https://proceedings.neurips.cc/paper/2021/file/ 32e54441e6382a7fbacbbbaf3c450059-Paper.pdf. Jiahao Ding, Xinyue Zhang, Xiaohuan Li, Junyi Wang, Rong Yu, and Miao Pan. Differentially private and fair classification via calibrated functional mechanism. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pp. 622 629, 2020. Jiahao Ding, Guannan Liang, Jinbo Bi, and Miao Pan. Differentially private and communication efficient collaborative learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pp. 7219 7227, 2021b. Michele Donini, Luca Oneto, Shai Ben-David, John S Shawe-Taylor, and Massimiliano Pontil. Empirical risk minimization under fairness constraints. In Advances in Neural Information Processing Systems, pp. 2791 2801, 2018. Dheeru Dua and Casey Graff. UCI machine learning repository, 2017. URL http://archive. ics.uci.edu/ml. Cynthia Dwork, Frank Mc Sherry, Kobbi Nissim, and Adam Smith. Calibrating noise to sensitivity in private data analysis. In Theory of cryptography conference, pp. 265 284. Springer, 2006. Cynthia Dwork, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Richard Zemel. Fairness through awareness. In Proceedings of the 3rd innovations in theoretical computer science conference, pp. 214 226, 2012. Michael Feldman, Sorelle A Friedler, John Moeller, Carlos Scheidegger, and Suresh Venkatasubramanian. Certifying and removing disparate impact. In proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining, pp. 259 268, 2015. Benjamin Fish, Jeremy Kun, and Ádám D Lelkes. A confidence-based approach for balancing fairness and accuracy. In Proceedings of the 2016 SIAM International Conference on Data Mining, pp. 144 152. SIAM, 2016. Matt Fredrikson, Somesh Jha, and Thomas Ristenpart. Model inversion attacks that exploit confidence information and basic countermeasures. In Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security, pp. 1322 1333, 2015. Moritz Hardt, Eric Price, and Nati Srebro. Equality of opportunity in supervised learning. In Advances in neural information processing systems, pp. 3315 3323, 2016a. Moritz Hardt, Ben Recht, and Yoram Singer. Train faster, generalize better: Stability of stochastic gradient descent. In Maria Florina Balcan and Kilian Q. Weinberger (eds.), Proceedings of The 33rd International Conference on Machine Learning, volume 48 of Proceedings of Machine Learning Research, pp. 1225 1234, New York, New York, USA, 20 22 Jun 2016b. PMLR. URL http://proceedings.mlr.press/v48/hardt16.html. Zecheng He, Tianwei Zhang, and Ruby B Lee. Model inversion attacks against collaborative inference. In Proceedings of the 35th Annual Computer Security Applications Conference, pp. 148 162, 2019. Rui Hu, Yuanxiong Guo, and Yanmin Gong. Concentrated differentially private federated learning with performance analysis. IEEE Open Journal of the Computer Society, 2:276 289, 2021. doi: 10.1109/OJCS.2021.3099108. Matthew Jagielski, Michael Kearns, Jieming Mao, Alina Oprea, Aaron Roth, Saeed Sharifi-Malvajerdi, and Jonathan Ullman. Differentially private fair learning. In International Conference on Machine Learning, pp. 3000 3008. PMLR, 2019. Michael Kearns, Seth Neel, Aaron Roth, and Zhiwei Steven Wu. Preventing fairness gerrymandering: Auditing and learning for subgroup fairness. In International Conference on Machine Learning, pp. 2564 2572, 2018. Published as a conference paper at ICLR 2023 Niki Kilbertus, Adrià Gascón, Matt Kusner, Michael Veale, Krishna Gummadi, and Adrian Weller. Blind justice: Fairness with encrypted sensitive attributes. In International Conference on Machine Learning, pp. 2630 2639. PMLR, 2018. Niki Kilbertus, Manuel Gomez Rodriguez, Bernhard Schölkopf, Krikamol Muandet, and Isabel Valera. Fair decisions despite imperfect predictions. In Silvia Chiappa and Roberto Calandra (eds.), Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, volume 108 of Proceedings of Machine Learning Research, pp. 277 287. PMLR, 26 28 Aug 2020. Lihua Lei, Cheng Ju, Jianbo Chen, and Michael I Jordan. Non-convex finite-sum optimization via scsg methods. In Proceedings of the 31st International Conference on Neural Information Processing Systems, pp. 2345 2355, 2017. Tianyi Lin, Chi Jin, and Michael Jordan. On gradient descent ascent for nonconvex-concave minimax problems. In International Conference on Machine Learning, pp. 6083 6093. PMLR, 2020. Andrew Lowy, Sina Baharlouei, Rakesh Pavan, Meisam Razaviyayn, and Ahmad Beirami. A stochastic optimization framework for fair risk minimization. Transactions on Machine Learning Research, 2022a. Andrew Lowy, Ali Ghafelebashi, and Meisam Razaviyayn. Private non-convex federated learning without a trusted server. ar Xiv preprint ar Xiv:2203.06735, 2022b. Jérémie Mary, Clément Calauzenes, and Noureddine El Karoui. Fairness-aware learning for continuous attributes and treatments. In International Conference on Machine Learning, pp. 4382 4391. PMLR, 2019. Frank Mc Sherry and Kunal Talwar. Mechanism design via differential privacy. In 48th Annual IEEE Symposium on Foundations of Computer Science (FOCS 07), pp. 94 103. IEEE, 2007. Hussein Mozannar, Mesrob Ohannessian, and Nathan Srebro. Fair learning with private demographic data. In International Conference on Machine Learning, pp. 7066 7075. PMLR, 2020. Katta G Murty and Santosh N Kabadi. Some np-complete problems in quadratic and nonlinear programming. 1985. Flavien Prost, Hai Qian, Qiuwen Chen, Ed H Chi, Jilin Chen, and Alex Beutel. Toward a better trade-off between performance and fairness with kernel-based distribution matching. ar Xiv preprint ar Xiv:1910.11779, 2019. Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. In International Conference on Learning Representations, 2015. Mengkai Song, Zhibo Wang, Zhifei Zhang, Yang Song, Qian Wang, Ju Ren, and Hairong Qi. Analyzing user-level privacy attack against federated learning. IEEE Journal on Selected Areas in Communications, 38(10):2430 2444, 2020. doi: 10.1109/JSAC.2020.3000372. Latanya Sweeney. Discrimination in online ad delivery. ar Xiv preprint ar Xiv:1301.6822, 2013. Cuong Tran, My Dinh, and Ferdinando Fioretto. Differentially private empirical risk minimization under the fairness lens. Advances in Neural Information Processing Systems, 34:27555 27565, 2021a. Cuong Tran, Ferdinando Fioretto, and Pascal Van Hentenryck. Differentially private and fair deep learning: A lagrangian dual approach. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pp. 9932 9939, 2021b. Cuong Tran, Ferdinando Fioretto, Pascal Van Hentenryck, and Zhiyan Yao. Decision making with differential privacy under a fairness lens. In IJCAI, pp. 560 566, 2021c. Cuong Tran, Keyu Zhu, Ferdinando Fioretto, and Pascal Van Hentenryck. Sf-pate: Scalable, fair, and private aggregation of teacher ensembles. ar Xiv preprint ar Xiv:2204.05157, 2022. Published as a conference paper at ICLR 2023 Michael Veale and Reuben Binns. Fairer machine learning in the real world: Mitigating discrimination without collecting sensitive data. Big Data & Society, 4(2):2053951717743530, 2017. Lingxiao Wang, Bargav Jayaraman, David Evans, and Quanquan Gu. Efficient privacy-preserving stochastic nonconvex optimization. ar Xiv preprint ar Xiv:1910.13659, 2019. Blake Woodworth, Suriya Gunasekar, Mesrob I Ohannessian, and Nathan Srebro. Learning nondiscriminatory predictors. ar Xiv preprint ar Xiv:1702.06081, 2017. Depeng Xu, Shuhan Yuan, and Xintao Wu. Achieving differential privacy and fairness in logistic regression. In Companion proceedings of The 2019 world wide web conference, pp. 594 599, 2019. Zhenhuan Yang, Shu Hu, Yunwen Lei, Kush R Varshney, Siwei Lyu, and Yiming Ying. Differentially private sgda for minimax problems. ar Xiv preprint ar Xiv:2201.09046, 2022. Muhammad Bilal Zafar, Isabel Valera, Manuel Gomez Rogriguez, and Krishna P Gummadi. Fairness constraints: Mechanisms for fair classification. In Artificial Intelligence and Statistics, pp. 962 970. PMLR, 2017. Brian Hu Zhang, Blake Lemoine, and Margaret Mitchell. Mitigating unwanted biases with adversarial learning. In Proceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and Society, pp. 335 340, 2018. Liang Zhang, Kiran Koshy Thekumparampil, Sewoong Oh, and Niao He. Bring your own algorithm for optimal differentially private stochastic minimax optimization. ar Xiv preprint ar Xiv:2206.00363, 2022. Zhifei Zhang, Yang Song, and Hairong Qi. Age progression/regression by conditional adversarial autoencoder. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5810 5818, 2017. Published as a conference paper at ICLR 2023 A ADDITIONAL DISCUSSION OF RELATED WORK The study of differentially private fair learning algorithms was initiated by Jagielski et al. (2019). Jagielski et al. (2019) considered equalized odds and proposed two DP algorithms: 1) an ϵ-DP post-processing approach derived from Hardt et al. (2016a); and 2) an pϵ, δq-DP in-processing approach based on Agarwal et al. (2018). The major drawback of their post-processing approach is the unrealistic requirement that the algorithm have access to the sensitive attributes at test time, which Jagielski et al. (2019) admits isn t feasible (or legal) in certain applications. Additionally, post-processing approaches are known to suffer from inferior fairness-accuracy tradeoffs compared with in-processing methods. While the in-processing method of Jagielski et al. (2019) does not require access to sensitive attributes at test time, it comes with a different set of disadvantages: 1) it is limited to binary classification; 2) its theoretical performance guarantees require the use of the computationally inefficient (i.e. exponential-time) exponential mechanism (Mc Sherry & Talwar, 2007); 3) its theoretical performance guarantees require computations on the full training set and do not permit mini-batch implementations; 4) it requires the hypothesis class H to have finite VC dimension. In this work, we propose the first algorithm that overcomes all of these pitfalls: our algorithm is amenable to multi-way classification with multiple sensitive attributes, computationally efficient, and comes with convergence guarantees that hold even when mini-batches of m ă n samples are used in each iteration of training, and even when VCp Hq 8. Furthermore, our framework is flexible enough to accommodate many notions of group fairness besides equalized odds (e.g. demographic parity, accuracy parity). Following Jagielski et al. (2019), several works have proposed other DP fair learning algorithms. None of these works have managed to simultaneously address all the shortcomings of the method of Jagielski et al. (2019). The work of Xu et al. (2019) proposed DP and fair binary logistic regression, but did not provide any theoretical convergence/performance guarantees. The work of Mozannar et al. (2020) combined aspects of both Hardt et al. (2016a) and Agarwal et al. (2018) in a two-step locally differentially private fairness algorithm. Their approach is limited to binary classification. Moreover, their algorithm requires n{2 samples in each iteration (of their in-processing step), making it impractical for large-scale problems. More recently, Tran et al. (2021b) devised another DP in-processing method based on lagrange duality, which covers non-binary classification problems. In a subsequent work, Tran et al. (2021a) studied the effect of DP on accuracy parity in ERM, and proposed using a regularizer to promote fairness. Finally, Tran et al. (2022) provided a semisupervised fair Private Aggregation of Teacher Ensembles framework. A shortcoming of each of these three most recent works is their lack of theoretical convergence or accuracy guarantees. In another vein, some works have observed the disparate impact of privacy constraints on demographic subgroups (Bagdasaryan et al., 2019; Tran et al., 2021c). B EQUALIZED ODDS VERSION OF ERMI If equalized odds (Hardt et al., 2016b) is the desired fairness notion, then one should use the following variation of ERMI as a regularizer Lowy et al. (2022a): p DRpp Y ; S|Y q : E # ˆp p Y ,S|Y pp Y , S|Y q ˆp p Y |Y pp Y |Y qˆp S|Y p S|Y q ˆp p Y ,S|Y pj, r|yq2 ˆp p Y |Y pj|yqˆp S|Y pr|yq ˆp Y pyq 1. (5) Here ˆp p Y ,S|Y denotes the empirical joint distribution of the predictions and sensitive attributes pp Y , Sq conditional on the true labels Y . In particular, if DRpp Y ; S|Y q 0, then p Y and S are conditionally independent given Y (i.e. equalized odds is satisfied). Published as a conference paper at ICLR 2023 C COMPLETE VERSION OF THEOREM 3.2 Let pypxi; θq P t0, 1ul and si P t0, 1uk be the one-hot encodings of pypxi, θq and si, respectively: i.e., pyjpxi; θq 1tpypxi,θq ju and si,r 1tsi ru for j P rls, r P rks. Also, denote p Ps diagppp Sp1q, . . . , pp Spkqq, where pp Sprq : 1 n řn i 1 1tsi ru ě ρ ą 0 is the empirical probability of attribute r (r P rks). Then we have the following re-formulation of (FERMI obj.) as a min-max problem: Theorem C.1 (Lowy et al. (2022a)). (FERMI obj.) is equivalent to min θ max W PRkˆl # p Fpθ, Wq : p Lpθq λ 1 i 1 pψipθ, Wq pψipθ, Wq : Trp WErpypxi, θqpypxi, θq T |xis W T q 2 Trp WErpypxi; θqs T i |xi, sis p P 1{2 s q 1, Erpypxi; θqpypxi; θq T |xis diagp F1pxi, θq, . . . , Flpxi, θqq, and Erpypxi; θqs T i |xi, sis is a kˆl matrix with Erpypxi; θqs T i |xi, sisr,j si,r Fjpxi, θq. Strong concavity of pψi is shown in Lowy et al. (2022a). D DP-FERMI ALGORITHM: PRIVACY We begin with a routine calculation of the derivatives of pψi, which follows by elementary matrix calculus: Lemma D.1. Let pψipθ, Wq Trp WErpypxi, θqpypxi, θq T |xis W T q 2 Trp WErpypxi; θqs T i |xi, sis p P 1{2 s q 1, where Erpypxi; θqpypxi; θq T |xis diagp F1pxi, θq, . . . , Flpxi, θqq and Erpypxi; θqs T i |xi, sis is a k ˆ l matrix with Erpypxi; θqs T i |xi, sisr,j si,r Fjpxi, θq. Then, θ pψipθ, Wq θ vecp Erpypxi, θqpypxi, θq T |xisq T vecp W T Wq 2 θ vecp Ersipypxi, θq T |xi, sisq vec ˆ W T p PS 1{2 and w pψipθ, Wq 2WErpypxi, θqpypxi, θq T |xis 2 p P 1{2 S Ersipypxi, θq T |xi, sis. Using Theorem D.1, we can prove that Algorithm 1 is DP: Theorem D.2 (Re-statement of Theorem 3.3). Let ϵ ď 2 lnp1{δq, δ P p0, 1q, and T ě n ?ϵ 2m 2 . Assume Fp , xq is Lθ-Lipschitz for all x, and |p Wtqr,j| ď D for all t P r Ts, r P rks, j P rls. Then, for σ2 w ě 16T lnp1{δq ϵ2n2ρ and σ2 θ ě 16L2 θD2 lnp1{δq T ϵ2n2ρ , Algorithm 1 is pϵ, δq-DP with respect to the sensitive attributes for all data sets containing at least ρ-fraction of minority attributes. Further, if σ2 w ě 32T lnp1{δq ϵ2n2 1 ρ D2 and σ2 θ ě 64L2 θD2 lnp1{δq T ϵ2n2ρ 32D4L2 θl2T lnp1{δq ϵ2n2 , then Algorithm 1 is pϵ, δq-DP (with respect to all features) for all data sets containing at least ρ-fraction of minority attributes. Proof. First consider the case in which only the sensitive attributes are private. By the moments accountant Theorem 1 in Abadi et al. (2016), it suffices to bound the sensitivity of the gradient updates by 2 θ ď 8D2L2 θ m2ρ and 2 w ď 8 m2ρ. Here 2 θ sup Z Z1,θ,W θ pψpθ, W; ziq θ pψpθ, W; z1 iq ı Published as a conference paper at ICLR 2023 and Z Z1 means that Z and Z1 are two data sets (both with ρ-fraction of minority attributes) that differ in exactly one person s sensitive attributes: i.e. si s1 i for some unique i P rns, but zj z1 j for all j i and pxi, yiq px1 i, y1 iq. Likewise, 2 w sup Z Z1,θ,W w pψpθ, W; ziq w pψpθ, W; z1 iq ı Now, by Theorem D.1, θ pψipθ, Wq θ vecp Erpypxi, θqpypxi, θq T |xisq T vecp W T Wq 2 θ vecp Ersipypxi, θq T |xi, sisq vec ˆ W T p PS 1{2 , and notice that only the second term depends on S. Therefore, we can bound the ℓ2-sensitivity of the θ-gradient updates by: 2 θ sup Z Z1,W,θ i 1 2 θ vecp Ersipypxi, θq T |xi, sisq vec ˆ W T p PS 1{2 2 θ vecp Ers1 ipypxi, θq T |xi, s1 isq vec ˆ W T p PS1 1{2 ď 4 m2 sup x,si,s1 i,W,θ j 1 } θFjpθ, xq}2W 2 r,j p PSprq s1 i,r b ď 8 ρm2 sup x,W,θ j 1 } θFjpθ, xq}2W 2 r,j ď 8D2L2 θ ρm2 , using Lipschitz continuity of Fp , xq, the assumption that W has diameter bounded by D, the assumption that the data sets have at least ρ-fraction of sensitive attribute r for all r P rks. Similarly, for the W-gradients, we have w pψipθ, Wq 2WErpypxi, θqpypxi, θq T |xis 2 p P 1{2 S Ersipypxi, θq T |xi, sis by Theorem D.1. Hence 2 W sup θ,W,si,s1 i Wdiagp F1pθ, xiq, . . . , Flpθ, xiqq p P 1{2 S Ersipyipxi; θtq T |xi, sis Wdiagp F1pθ, xiq, . . . , Flpθ, xiqq p P 1{2 S1 Ers1 ipyipxi; θtq T |xi, s1 is ď 4 m2 sup θ,W,si,s1 i j 1 Fjpθ, xiq2 kÿ p PSprq s1 i,r b since řl j 1 Fjpθ, xiq2 ď řl j 1 Fjpθ, xiq 1. This establishes the desired privacy guarantee with respect to sensitive attributes for Algorithm 1. Now consider the case in which all features are private. We aim to bound the sensitivities of the gradient updates to changes in a single sample zi psi, xi, yiq. Denote these new sensitivities by θ sup Z Z1,θ,W θ pψpθ, W; ziq θ pψpθ, W; z1 iq ı , Published as a conference paper at ICLR 2023 where we now write Z Z1 to mean that Z and Z1 are two data sets (both with ρ-fraction of minority attributes) that differ in exactly one person s (sensitive and non-sensitive) data: i.e. zi z1 i for some unique i P rns. Likewise, W sup Z Z1,θ,W w pψpθ, W; ziq w pψpθ, W; z1 iq ı . m sup zi,z1 i,θ,W,S S1 θ vecp Erpypxi, θqpypxi, θq T |xisq T vecp W T Wq 2 θ vecp Ersipypxi, θq T |xi, sisq vec ˆ W T p PS 1{2 θ vecp Erpypx1 i, θqpypx1 i, θq T |x1 isq T vecp W T Wq 2 θ vecp Ers1 ipypx1 i, θq T |x1 i, s1 isq vec ˆ W T p PS1 1{2 Thus, 2 θ ď 4L2 θl2D2 m2 2 2 θ. Therefore, by the moments accountant, the collection of all θt updates in Algorithm 1 is pϵ, δq-DP if σ2 θ ě 32D2L2 θT lnp1{δq ρϵ2n2 8D2L2 θl2T lnp1{δq ϵ2n2 8L2 θD2T lnp1{δq ϵ2n2 4 ρ l2 . Next, we bound the sensitivity W of the W-gradient updates. We have 2 W sup θ,W,zi,z1 i Wdiagp F1pθ, xiq, . . . , Flpθ, xiqq p P 1{2 S Ersipyipxi; θtq T |xi, sis Wdiagp F1pθ, x1 iq, . . . , Flpθ, x1 iqq p P 1{2 S1 Ers1 ipy T i px1 i; θtq|x1 i, s1 is m2 sup θ,W,xi,x1 i Wdiagp F1pθ, xiq F1pθ, x1 iq, . . . , Flpθ, xiq Flpθ, x1 iqq ď 2 2 W 16D2 m2 sup θ,xi j 1 Fjpθ, xiq2 ď 2 2 W 16D2 Therefore, by the moments accountant, the collection of all Wt updates in Algorithm 1 is pϵ, δq-DP if σ2 w ě 32T lnp1{δq ϵ2n2 1 ρ D2 . This completes the proof. E DP-FERMI ALGORITHM: UTILITY To prove Theorem 3.4, we will first derive a more general result. Namely, in Appendix E.1, we will provide a precise upper bound on the stationarity gap of noisy DP stochastic gradient descent ascent (DP-SGDA). E.1 NOISY DP-SGDA FOR NONCONVEX-STRONGLY CONCAVE MIN-MAX PROBLEMS Consider a generic (smooth) nonconvex-strongly concave min-max ERM problem: min θPRdθ max w PW Fpθ, wq : 1 i 1 fpθ, w; ziq where fpθ, ; zq is µ-strongly concave3 for all θ, z but fp , w; zq is potentially non-convex. We 3We say a differentiable function g is µ-strongly concave if gpαq x gpαq, α1 αy µ 2 }α α1}2 ě gpα1q for all α, α1. Published as a conference paper at ICLR 2023 Algorithm 2 Noisy Differentially Private Stochastic Gradient Descent-Ascent (DP-SGDA) 1: Input: data Z, θ0 P Rdθ, w0 P W, step-sizes pηθ, ηwq, privacy noise parameters σθ, σw, batch size m, iteration number T ě 1. 2: for t 0, 1, . . . , T 1 do 3: Draw a batch of data points tzium i 1 uniformly at random from Z. 4: Update θt 1 Ð θt ηθ 1 m řm i 1 θfpθt, wt; ziq ut , where ut Np0, σ2 θIdθq and wt 1 Ð ΠW wt ηw 1 m řm i 1 wfpθt, wt; ziq vt , where vt Np0, σ2 w Idwq. 5: end for 6: Draw ˆθT uniformly at random from tθtu T t 1. 7: Return: ˆθT propose Noisy DP-SGDA4 (Algorithm 2) for privately solving (7), which is a noisy DP variation of two-timescale SGDA (Lin et al., 2020). Now, we provide the first theoretical convergence guarantee for DP non-convex min-max optimization: Theorem E.1 (Privacy and Utility of Algorithm 2, Informal Version). Let ϵ ď 2 lnp1{δq, δ P p0, 1q. Assume: fp , w; zq is Lθ-Lipschitz5 and fpθ, ; zq is Lw-Lipschitz for all θ, w, z; and W Ă Rdw is a convex, compact set. Denote Φpθq maxw PW Fpθ, wq. Choose σ2 w 8T L2 w lnp1{δq ϵ2n2 , σ2 θ 8T L2 θ lnp1{δq ϵ2n2 , and T ě n ?ϵ 2m 2 . Then, Algorithm 2 is pϵ, δq-DP. Further, if fp , ; zq has Lipschitz gradients and fpθ, ; zq is strongly concave, then D T, ηθ, ηw such that E} ΦpˆθT q}2 O where d maxpdθ, dwq. (The expectation is solely over the algorithm.) In our DP fair learning application, fpθ, W; ziq ℓpθ, xi, yiq λ pψipθ, Wq and the strong concavity assumption on f in Theorem E.1 is automatically satisfied, by Lowy et al. (2022a). The Lipschitz and smoothness assumptions on f are standard in optimization literature and are satisfied for loss functions that are typically used in pracdtice. In our application to DP-FERMI, these assumptions hold as long as the loss function ℓand F are Lipschitz continuous with Lipschitz gradients. Our next goal is to prove (the precise, scale-invariant version of) Theorem E.1. To that end, we require the following notation. Notation and Assumptions: Let f : Rdθ ˆ Rdw ˆ Z Ñ R, and Fpθ, wq 1 n řn i 1 fpθ, w; ziq for fixed training data Z pz1, , znq P Zn. Let W Ă Rdw be a convex, compact set. For any θ P Rdθ, denote w pθq P argmaxw PW Fpθ, wq and pΦpθq maxw PW Fpθ, wq. Let Φ pΦpθ0q infθ pΦZpθq. Recall that a function h is β-smooth if its derivative h is β-Lipschitz. We write a À b if there is an absolute constant C ą 0 such that a ď Cb. Assumption E.2. 1. fp , w; zq is Lθ-Lipschitz and βθ-smooth for all w P W, z P Z. 2. fpθ, ; zq is Lw-Lipschitz, βw-smooth, and µ-strongly concave on W for all θ P Rdθ, z P Z. 3. } wfpθ, w; zq wfpθ1, w; zq} ď βθw}θ θ1} and } θfpθ, w; zq θfpθ, w1; zq} ď βθw}w w1} for all θ, θ1, w, w1, z. 4. W has ℓ2 diameter bounded by D ě 0. 5. w Fpθ, w pθqq 0 for all θ, where w pθq denotes the unconstrained global minimizer of Fpθ, q. The first four assumptions are standard in (DP and min-max) optimization. The fifth assumption means that W contains the unconstrained global minimizer w pθq of Fpθ, q for all θ. Hence (7) is equivalent to min θPRdθ max w PRdw Fpθ, wq. 4DP-SGDA was also used in Yang et al. (2022) for convex and PL min-max problems. 5We say function g is L-Lipschitz if }gpαq gpα1q} ď L}α α1} for all α, α1. Published as a conference paper at ICLR 2023 This assumption is not actually necessary for our convergence result to hold, but we will need it when we apply our results to the DP fairness problem. Moreover, it simplifies the proof of our convergence result. We refer to problems of the form (7) that satisfy Theorem E.2 as (smooth) nonconvex-strongly concave min-max. We denote κw : βw µ and κθw : βθw We can now provide the complete, precise version of Theorem E.1: Theorem E.3 (Privacy and Utility of Algorithm 2, Formal Version). Let ϵ ď 2 lnp1{δq, δ P p0, 1q. Grant Theorem E.2. Choose σ2 w 8T L2 w lnp1{δq ϵ2n2 , σ2 θ 8T L2 θ lnp1{δq ϵ2n2 , and T ě n ?ϵ 2m 2 . Then Algorithm 2 is pϵ, δq-DP. Further, if we choose ηθ 1 16κwpβθ βθwκθwq, ηw 1 βw , and κwr Φpβθ βθwκθwq β2 θw D2sϵn min 1 Lθ ?dθ , βw βθw Lw ?κwdw E} ΦpˆθT q}2 À b Φ pβθ βθwκθwqκw κwβ2 θw D2q ϵn ˆβθw ?κw βw ˆ L2 θ κwβ2 θw L2 w β2w In particular, if m ě min ˆ ϵn Lθ ? dθκwr Φpβθ βθwκθwq β2 θw D2s, ϵn Lw?κw βθwβw? dwκwr Φpβθ βθwκθwq β2 θw D2s E} ΦpˆθT q}2 À b κwr Φpβθ βθwκθwq β2 θw D2s dθ ˆβθw ?κw βw The proof of Theorem E.3 will require several technical lemmas. These technical lemmas, in turn, require some preliminary lemmas, which we present below. We begin with a refinement of Lemma 4.3 from Lin et al. (2020): Lemma E.4. Grant Theorem E.2. Then Φ is 2pβθ βθwκθwq-smooth with Φpθq θFpθ, w pθqq, and w p q is κw-Lipschitz. Proof. The proof follows almost exactly as in the proof of Lemma 4.3 of Lin et al. (2020), using Danskin s theorem, but we carefully track the different smoothness parameters with respect to w and θ (and their units) to obtain the more precise result. Lemma E.5 (Lei et al. (2017)). Let talul Prns be an arbitrary collection of vectors such that řn l 1 al 0. Further, let S be a uniformly random subset of rns of size m. Then, n m pn 1qm 1 n l 1 }al}2 ď 1tmănu Lemma E.6 (Co-coercivity of the gradient). For any β-smooth and convex function g, we have } gpaq gpbq}2 ď 2βpgpaq gpbq xgpbq, a byq, for all a, b P domainpgq. Having recalled the necessary preliminaries, we now provide the novel technical ingredients that we ll need for the proof of Theorem E.3. The next lemma quantifies the progress made in minimizing Φ from a single step of noisy stochastic gradient descent in θ (i.e. line 4 of Algorithm 2): Lemma E.7. For all t P r Ts, the iterates of Algorithm 2 satisfy EΦpθtq ď Φpθt 1q ηθ 2 2pβθ βθwκθwqη2 θ E} Φpθt 1q}2 2 2η2 θpβθ βθwκθwq E} Φpθt 1q θFpθt 1, wt 1q}2 pβθβθwκθwqη2 θ ˆ dθσ2 θ 4L2 θ m 1tmănu conditional on θt 1, wt 1. Published as a conference paper at ICLR 2023 Proof. Let us denote rg : 1 m řm i 1 θfpθt 1, wt 1; ziq ut 1 : g ut 1, so θt θt 1 ηθrg. First condition on the randomness due to sampling and Gaussian noise addition. By smoothness of Φ (see Theorem E.4), we have Φpθtq ď Φpθt 1q ηθxrg, Φpθt 1qy pβθ βθwκθwqη2 θ}rg}2 Φpθt 1q ηθ} Φpθt 1q}2 ηθxrg Φpθt 1q, Φpθt 1qy pβθ βθwκθwqη2 θ}rg}2. Taking expectation (conditional on θt 1, wt 1), ErΦpθtqs ď Φpθt 1q ηθ} Φpθt 1q}2 ηθx θFpθt 1, wt 1q Φpθt 1q, Φpθt 1qy pβθ βθwκθwqη2 θ dθσ2 θ E}g θFpθt 1, wt 1q}2 } θFpθt 1, wt 1q}2 ď Φpθt 1q ηθ} Φpθt 1q}2 ηθx θFpθt 1, wt 1q Φpθt 1q, Φpθt 1qy pβθ βθwκθwqη2 θ dθσ2 θ 4L2 θ m 1tmănu } θFpθt 1, wt 1q}2 ȷ ď Φpθt 1q ηθ} Φpθt 1q}2 ηθx θFpθt 1, wt 1q Φpθt 1q, Φpθt 1qy pβθ βθwκθwqη2 θ dθσ2 θ 4L2 θ m 1tmănu 2} θFpθt 1, wt 1q Φpθt 1q}2 2} Φpθt 1q}2 ȷ ď Φpθt 1q ηθ} Φpθt 1q}2 ηθ } Φpθt 1q θFpθt 1, wt 1q}2 } Φpθt 1q}2 pβθ βθwκθwqη2 θ dθσ2 θ 4L2 θ m 1tmănu 2} θFpθt 1, wt 1q Φpθt 1q}2 2} Φpθt 1q}2 ȷ ď Φpθt 1q ηθ 2 2pβθ βθwκθwqη2 θ } Φpθt 1q}2 2 2pβθ βθwκθwqη2 θ } Φpθt 1q θFpθt 1, wt 1q}2 pβθ βθwκθwqη2 θ ˆ dθσ2 θ 4L2 θ m 1tmănu In the first inequality, we used the fact that the Gaussian noise has mean zero and is independent of pθt 1, wt 1, Zq, plus the fact that Eg θFpθt 1, wt 1q. In the second inequality, we used Theorem E.5 and Lipschitz continuity of f. In the third and fourth inequalities, we used Young s inequality and Cauchy-Schwartz. For the particular ηθ prescribed in Theorem E.3, we obtain: Lemma E.8. Grant Theorem E.2. If ηθ 1 16κwpβθ βθwκθwq, then the iterates of Algorithm 2 satisfy (@t ě 0) EΦpθt 1q ď E Φpθtq 3 8ηθ}Φpθtq}2 5 ˆ β2 θw}w pθtq wt}2 dθσ2 θ 4L2 θ m 1tmănu Proof. By Theorem E.7, we have EΦpθt 1q ď EΦpθtq ηθ 2 2pβθ βθwκθwqη2 θ E} Φpθtq}2 2 2η2 θpβθ βθwκθwq E} Φpθtq θFpθt, wtq}2 pβθβθwκθwqη2 θ ˆ dθσ2 θ 4L2 θ m 1tmănu 8ηθE} Φpθtq}2 5 E} Φpθtq θFpθt, wtq}2 dθσ2 θ 4L2 θ m 1tmănu 8ηθE} Φpθtq}2 5 β2 θw E}w pθtq wt}2 dθσ2 θ 4L2 θ m 1tmănu In the second inequality, we simply used the definition of ηθ. In the third inequality, we used the fact that Φpθtq θFpθt, w pθtqq (see Theorem E.4) together with Theorem E.2 (part 3). Next, we describe the progress made in the wt updates: Published as a conference paper at ICLR 2023 Lemma E.9. Grant Theorem E.2. If ηw 1 βw , then the iterates of Algorithm 2 satisfy (@t ě 0) E}w pθt 1q wt 1}2 ď ˆ 1 1 2κw 4κwκ2 θwη2 θβ2 θw E}w pθtq wt}2 2 ˆ4L2 w m 1tmănu dwσ2 w 4κwκ2 θwη2 θ E} Φpθtq}2 dθσ2 θ . Proof. Fix any t and denote δt : E}w pθtq wt}2 : E}w wt}2. We may assume without loss of generality that fpθ, ; zq is µ-strongly convex and that wt 1 ΠWrwt 1 βw 1 m řm i 1 wfpθt, wt; ziq vt s : ΠWrwt 1 βw p hpwtq vtqs : ΠWrwt 1 βw hpwtqs. Now, E}wt 1 w }2 E ΠWrwt 1 βw hpwtqs w E}wt w }2 1 E} hpwtq}2 dwσ2 w 2 βw E A wt w , rhpwtq E ď E}wt w }2 1 E} hpwtq}2 dwσ2 w 2 βw E Fpθt, wtq Fpθt, w q µ 2 }wt w }2ı βw E r Fpθt, wtq Fpθt, w qs E} hpwtq}2 β2w dwσ2 w β2w . E} hpwtq}2 E } hpwtq w Fpθt, wtq}2 } w Fpθt, wtq}2 ď 4L2 w m 1tmănu E} w Fpθt, wtq}2 ď 4L2 w m 1tmănu 2βwr Fpθt, wtq Fpθt, w pθtqqs, using independence and Theorem E.5 plus Lipschitz continuity of f in the first inequality and Theorem E.6 (plus Theorem E.2 part 5) in the second inequality. This implies E}wt 1 w }2 ď δt dwσ2 w 4L2 w m 1tmănu δt 1 E}wt 1 w pθtq w pθtq w pθt 1q}2 ď ˆ 1 1 2κw 1 E}wt 1 w pθtq}2 2κw E}w pθtq w pθt 1q}2 ď ˆ 1 1 2κw 1 dwσ2 w 4L2 w m 1tmănu ȷ 2κw E}w pθtq w pθt 1q}2 ď ˆ 1 1 2κw 1 dwσ2 w 4L2 w m 1tmănu ȷ 2κwκ2 θw E}θt θt 1}2 ď ˆ 1 1 2κw 1 dwσ2 w 4L2 w m 1tmănu 4κwκ2 θwη2 θ E} θFpθt, wtq Φpθtq}2 } Φpθtq}2 dθσ2 t ˆ 1 1 2κw 1 dwσ2 w 4L2 w m 1tmănu 4κwκ2 θwη2 θ E} θFpθt, wtq θFpθt, w pθtq}2 } Φpθtq}2 dθσ2 t ď ˆ 1 1 2κw 1 dwσ2 w 4L2 w m 1tmănu 4κwκ2 θwη2 θ β2 θw E}wt w pθtq}2 } Φpθtq}2 dθσ2 t , Published as a conference paper at ICLR 2023 by Young s inequality, (8), and Theorem E.4. Since 1 1 2κw 1 1 1 κw ď 1 1 2κw , we obtain δt 1 ď ˆ 1 1 2κw 4κwκ2 θwη2 θβ2 θw dwσ2 w 4L2 w m 1tmănu ȷ 4κwκ2 θwη2 θ } Φpθtq}2 dθσ2 t , as desired. We are now prepared to prove Theorem E.3. Proof of Theorem E.3. Privacy: This is an easy consequence of Theorem 1 in Abadi et al. (2016) (with precise constants obtained from the proof therein, as in Bassily et al. (2019)) applied to both the min (descent in θ) and max (ascent in w) updates. Unlike Abadi et al. (2016), we don t clip the gradients here before adding noise, but the Lipschitz continuity assumptions (Theorem E.2 parts 1 and 2) imply that the ℓ2-sensitivity of the gradient updates in lines 4 and 5 of Algorithm 2 are nevertheless bounded by 2Lθ{m and 2Lw{m, respectively. Thus, Theorem 1 in Abadi et al. (2016) still applies. Convergence: Denote ζ : 1 1 2κw 4κwκ2 θwη2 θβ2 θw, δt E}w pθtq wt}2, and ˆ4L2 w m 1tmănu dwσ2 w 4κwκ2 θwη2 θ E} Φpθtq}2 dθσ2 θ , so that Theorem E.9 reads as δt ď ζδt 1 Ct 1 (9) for all t P r Ts. Applying (9) recursively, we have j 0 Ct j 1ζj ď ζt D2 4κwκ2 θwη2 θ j 0 ζt 1 j E} Φpθjq}2 j 0 ζt 1 j 2 ˆ4L2 w m 1tmănu dwσ2 w 4κwκ2 θwη2 θdθσ2 θ Combining this inequality with Theorem E.8, we get EΦpθtq ď E Φpθt 1q 3 8ηθ} Φpθt 1q}2 ȷ 5 ˆ dθσ2 θ 4L2 θ m 1tmănu ζt D2 4κwκ2 θwη2 θ j 0 ζt 1 j E} Φpθjq}2 j 0 ζt 1 j 2 ˆ4L2 w m 1tmănu dwσ2 w 4κwκ2 θwη2 θdθσ2 θ Summing over all t P r Ts and re-arranging terms yields EΦpθT q ď Φpθ0q 3 t 1 E} Φpθt 1q}2 5 8ηθT ˆ dθσ2 t 4L2 θ m 1tmănu 4η3 θβ2 θwκwκ2 θw j 0 ζt 1 j E} Φpθjq}2 ˆ4L2 w m 1tmănu dwσ2 w 4κwκ2 θwη2 θdθσ2 θ Published as a conference paper at ICLR 2023 Now, ζ ď 1 1 4κw , which implies that t 1 ζt ď 4κw and j 0 ζt 1 j ď 4κw T. t 1 E} Φpθtq}2 ď 3rΦpθ0q EΦpθT qs ˆ dθσ2 θ 4L2 θ m 1tmănu 7β2 θw D2κw 48η2 θβ2 θwκ2 wκ2 θw T t 1 E} Φpθtq}2 8κwβ2 θw 2 β2w ˆ4L2 w m 1tmănu dwσ2 w 32β2 θwκ2 wκ2 θwη2 θdθσ2 θ. Since η2 θβ2 θwκ2 wκ2 θw ď 1 256, we obtain E} ΦpˆθT q}2 À Φκw T pβθ βθwκθwq dθL2 θT lnp1{δq ˆ L2 θ κwβ2 θw L2 w β2w κwβ2 θw L2 wdw T lnp1{δq β2wϵ2n2 Our choice of T then implies E} ΦpˆθT q}2 À b Φ pβθ βθwκθwqκw κwβ2 θw D2q ϵn ˆβθw ?κw βw ˆ L2 θ κwβ2 θw L2 w β2w Finally, our choice of sufficiently large m yields the last claim in Theorem E.3. E.2 PROOF OF THEOREM 3.4 Theorem 3.4 is an easy consequence of Theorem E.1, which we proved in the above subsection: Theorem E.10 (Re-statement of Theorem 3.4). Assume the loss function ℓp , x, yq and Fpx, q are Lipschitz continuous with Lipschitz gradient for all px, yq, and p PSprq ě ρ ą 0 @ r P rks. In Algorithm 1, choose W to be a sufficiently large ball that contains W pθq : argmax W p Fpθ, Wq for every θ in some neighborhood of θ P argminθ max W p Fpθ, Wq. Then there exist algorithmic parameters such that the pϵ, δq-DP Algorithm 1 returns ˆθT with E} FERMIpˆθT q}2 O maxpdθ, klq lnp1{δq treating D diameterp Wq, λ, ρ, l, and the Lipschitz and smoothness parameters of ℓand F as constants. Proof. By Theorem E.1, it suffices to show that fpθ, W; ziq : ℓpθ, xi, yiq λ pψipθ, Wq is Lipschitz continuous with Lipschitz gradient in both the θ and W variables for any zi pxi, yi, siq, and that fpθ, ; ziq is strongly concave. We assumed ℓp , xi, yiq is Lipschitz continuous with Lipschitz gradient. Further, the work of Lowy et al. (2022a) showed that fpθ, ; ziq is strongly concave. Thus, it suffices to show that pψipθ, Wq is Lipschitz continuous with Lipschitz gradient. This clearly holds by Theorem D.1, since Fpx, q is Lipschitz continuous with Lipschitz gradient and W P W is bounded. Published as a conference paper at ICLR 2023 F NUMERICAL EXPERIMENTS: ADDITIONAL DETAILS AND RESULTS F.1 MEASURING DEMOGRAPHIC PARITY AND EQUALIZED ODDS VIOLATION We used the expressions given in (10) and (11) to measure the demographic parity violation and the equalized odds violation respectively. We denote Y to be the set of all possible output classes and S to be the classes of the sensitive attribute. Pr Es denotes the empirical probability of the occurrence of an event E. max y1PY,s1,s2PS ˇˇPrpy y1|s s1s Prpy y1|s s2s ˇˇ (10) max y1PY,s1,s2PS maxp ˇˇPrpy y1|s s1, y y1s Prpy y1|s s2, y y1s ˇˇ , ˇˇPrpy y1|s s1, y y1s Prpy y1|s s2, y y1s ˇˇq (11) F.2 TABULAR DATASETS F.2.1 MODEL DESCRIPTION AND EXPERIMENTAL DETAILS Demographic Parity: We split each dataset in a 3:1 train:test ratio. We preprocess the data similar to Hardt et al. (2016a) and use a simple logistic regression model with a sigmoid output O σp Wx bq which we treat as conditional probabilities pppy i|xq. The predicted variables and sensitive attributes are both binary in this case across all the datasets. We analyze fairness-accuracy trade-offs with four different values of ϵ P t0.5, 1, 3, 9u for each dataset. We compare against state-of-the-art algorithms proposed in Tran et al. (2021a) and (the demographic parity objective of) Tran et al. (2021b). The tradeoff curves for DP-FERMI were generated by sweeping across different values for λ P r0, 2.5s. The learning rates for the descent and ascent, ηθ and ηw, remained constant during the optimization process and were chosen from r0.005, 0.01s. Batch size was 1024. We tuned the ℓ2 diameter of the projection set W and θ-gradient clipping threshold in r1, 5s in order to generate stable results with high privacy (i.e. low ϵ). Each model was trained for 200 epochs. The results displayed are averages over 15 trials (random seeds) for each value of ϵ. Equalized Odds: We replicated the experimental setup described above, but we took ℓ2 diameter of W and the value of gradient clipping for θ to be in r1, 2s. Also, we only tested three values of ϵ P t0.5, 1, 3u. F.2.2 DESCRIPTION OF DATASETS Adult Income Dataset: This dataset contains the census information about the individuals. The classification task is to predict whether the person earns more than 50k every year or not. We followed a preprocessing approach similar to Lowy et al. (2022a). After preprocessing, there were a total of 102 input features from this dataset. The sensitive attribute for this work in this dataset was taken to be gender. This dataset consists of around 48,000 entries spanning across two CSV files, which we combine and then we take the train-test split of 3:1. Retired Adult Income Dataset: The Retired Adult Income Dataset proposed by Ding et al. (2021a) is essentially a superset of the Adult Income Dataset which attempts to counter some caveats of the Adult dataset. The input and the output attributes for this dataset is the same as that of the Adult Dataset and the sensitive attribute considered in this work is the same as that of the Adult. This dataset contains around 45,000 entries. Parkinsons Dataset: In the Parkinsons dataset, we use the part of the dataset which had the UPRDS scores along with some of the features of the recordings obtained from individuals affected and not affected with the Parkinsons disease. The classification task was to predict from the features whether the UPDRS score was greater than the median score or not. After preprocessing, there were a total of 19 input features from this dataset and the sensitive attribute for this dataset was taken to be gender. This dataset contains around 5800 entries in total. We took a train-test split of 3:1. Credit Card Dataset: This dataset contains the financial data of users in a bank in Taiwan consisting of their gender, education level, age, marital status, previous bills, and payments. The assigned Published as a conference paper at ICLR 2023 classification task is to predict whether the person defaults their credit card bills or not, essentially making the task if the clients were credible or not. We followed a preprocessing approach similar to Lowy et al. (2022a). After preprocessing, there were a total of 85 input features from this dataset. The sensitive attribute for this dataset was taken to be gender. This dataset consists of around 30,000 entries from which we take the train-test split of 3:1. UTK-Face Dataset: This dataset is a large scale image dataset containing with an age span from 0 to 116. The dataset consists of over 20,000 face images with details of age, gender, and ethnicity and covers large variation in pose, facial expression, illumination, occlusion, resolution. We consider the age classification task with 9 age groups similar to the experimental setup in Tran et al. (2022). We consider the sensitive attribute to be the ethnicity which consists of 5 different classes. F.2.3 DEMOGRAPHIC PARITY Retired Adult Results: See Fig. 6 for our results on Retired Adult Dataset. The results are qualitatively similar to the reusults reported in the main body: our algorithm (DP-FERMI) achieves the most favorable fairness-accuracy tradeoffs across all privacy levels. Figure 6: Private, fair logistic regression on the Retired Adult Dataset Credit Card Results: See Fig. 7 for our results on Credit Card Dataset. DP-FERMI offers superior fairness-accuracy-privacy profile compared to all applicable baselines. Additional Results for Parkinsons Dataset: More results for Parkinsons are shown in Fig. 8. DP-FERMI offers the best performance. F.2.4 EQUALIZED ODDS Equalized Odds Variation of DP-FERMI Algorithm: The (FERMI obj.) minimizes the Exponential Renyi Mutual Information (ERMI) between the output and the sensitive attributes which Published as a conference paper at ICLR 2023 Figure 7: Private, fair (demographic parity) logistic regression on the Credit Card Dataset Figure 8: Private, Fair (Demogrpahic Parity) Logistic regression on Parkinsons Dataset essentially leads to a reduced demographic parity violation. The equalized-odds condition is more constrained and enforces the demographic parity condition for data grouped according to labels. For the equalized-odds, the ERMI between the predicted and the sensitive attributes is minimized conditional to each of the label present in the output variable of the dataset. So, the FERMI regularizer is split into as many parts as the number of labels in the output. This enforces each part of the FERMI regularizer to minimize the ERMI while an output label is given/constant. Each part has its own unique W that is maximized in order to create a stochastic estimator for the ERMI with respect to each of the output labels. Published as a conference paper at ICLR 2023 Adult Results: Results for the equalized odds version of DP-FERMI on Adult dataset are shown in Fig. 9. Our approach outperforms the previous state-of-the-art methods. Figure 9: Results obtained for applying DP-FERMI with equalized odds violation on logistic regression on the Adult Dataset Retired Adult Results: (Initial) Results for the equalized odds version of DP-FERMI on the retiredadult dataset are shown in Fig. 10. Our approach outperforms Tran et al. (2021b) and we are currently tuning our non-private and/or non-fair versions of our models along with Jagielski et al. (2019). Figure 10: Results obtained for applying DP-FERMI with equalized odds violation on logistic regression on the Retired Adult Dataset F.3 IMAGE DATASET (UTK-FACE) We split the dataset in a 3:1 train:test ratio. Batch size was 64. 128 x 128 normalized images were used as input for our model. We tuned the ℓ2 diameter of W and the value of gradient clipping for θ to be in r1, 2s and learning rates for the descent and ascent, ηθ and ηw, remained constant during the optimization process and were chosen as 0.001 and 0.005 respectively. We analyze the fairness-accuracy trade-offs with five different values of ϵ P t10, 25, 50, 100, 200u. The results displayed were averaged over observations obtained from 5 different randomly chosen seeds on each configuration of ϵ and a dataset. Each model was trained for 150 epochs. The tradeoff curves for this set of experiments were obtained by sweeping across different values for λ P r0, 500s. Published as a conference paper at ICLR 2023 G SOCIETAL IMPACTS In this paper, we considered the socially consequential problem of privately learning fair models from sensitive data. Motivated by the lack of scalable private fair learning methods in the literature, e developed the first differentially private (DP) fair learning algorithm that is guaranteed to converge with small batches (stochastic optimization). We hope that our method will be used to help companies, governments, and other organizations to responsibly use sensitive, private data. Specifically, we hope that our DP-FERMI algorithm will be useful in reducing discrimination in algorithmic decisionmaking while simultaneously preventing leakage of sensitive user data. The stochastic nature of our algorithm might be especially appealing to companies that are using very large models and datasets. On the other hand, there are also some important limitations of our method that need to be considered before deployment. One caveat of our work is that we have assumed that the given data set contains fair and accurate labels. For example, if gender is the sensitive attribute and likelihood of repaying a loan is the target, then we assume that the training data accurately describes everyone s financial history without discrimination. If training data is biased against a certain demographic group, then it is possible that our algorithm could amplify (rather than mitigate) unfairness. See e.g. Kilbertus et al. (2020); Bechavod et al. (2019) for further discussion. Another important practical consideration is how to weigh/value the different desiderata (privacy, fairness, and accuracy) when deploying our method. As shown in prior works (e.g., Cummings et al. (2019)) and re-enforced in the present work, there are fundamental tradeoffs between fairness, accuracy, and privacy: improvements in one generally come at a cost to the other two. Determining the relative importance of each of these three desiderata is a critical question that lacks a clear or general answer. Depending on the application, one might be seriously concerned with either discrimination or privacy attacks, and should calibrate ϵ and λ accordingly. Or, perhaps very high accuracy is necessary for a particular task, with privacy and/or fairness as an afterthought. In such a case, one might want to start with very large ϵ and small λ to ensure high accuracy, and then gradually shrink ϵ and/or increase λ to improve privacy/fairness until training accuracy dips below a critical threshold. A thorough and rigorous exploration of these issues could be an interesting direction for future work.