# differentially_private_postprocessing_for_fair_regression__39d94ef4.pdf Differentially Private Post-Processing for Fair Regression Ruicheng Xian 1 Qiaobo Li 1 Gautam Kamath 2 Han Zhao 1 This paper describes a differentially private postprocessing algorithm for learning fair regressors satisfying statistical parity, addressing privacy concerns of machine learning models trained on sensitive data, as well as fairness concerns of their potential to propagate historical biases. Our algorithm can be applied to post-process any given regressor to improve fairness by remapping its outputs. It consists of three steps: first, the output distributions are estimated privately via histogram density estimation and the Laplace mechanism, then their Wasserstein barycenter is computed, and the optimal transports to the barycenter are used for post-processing to satisfy fairness. We analyze the sample complexity of our algorithm and provide fairness guarantee, revealing a tradeoff between the statistical bias and variance induced from the choice of the number of bins in the histogram, in which using less bins always favors fairness at the expense of error. 1. Introduction Prediction and forecasting models trained from machine learning algorithms are ubiquitous in real-world applications, whose performance hinges on the availability and quality of training data, often collected from end-users or customers. This reliance on data has raised ethical concerns including fairness and privacy. Models trained on past data may propagate and exacerbate historical biases against disadvantaged demographics, and producing less favorable predictions (Bolukbasi et al., 2016; Buolamwini & Gebru, 2018), resulting in unfair treatments and outcomes especially in areas such as criminal justice, healthcare, and finance (Barocas & Selbst, 2016; Berk et al., 2021). Models also have the risk of leaking highly sensitive private 1University of Illinois Urbana-Champaign 2University of Waterloo and Vector Institute. Correspondence to: Ruicheng Xian , Han Zhao . Proceedings of the 41 st International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by the author(s). information in the training data collected for these applications (Dwork & Roth, 2014). While there has been significant effort at addressing these concerns, few treats them in combination, i.e., designing algorithms that train fair models in a privacy-preserving manner. A difficulty is that privacy and fairness may not be compatible: exactly achieving group fairness criterion such as statistical parity or equalized odds requires precise (estimates of) group-level statistics, but for ensuring privacy, only noisy statistics are allowed under the notion of differential privacy. Resorting to approximate fairness, prior work has proposed private learning algorithms for reducing disparity, but the focus has been on the classification setting (Jagielski et al., 2019; Xu et al., 2019; Mozannar et al., 2020; Tran et al., 2021). In this paper, we propose and analyze a differentially private post-processing algorithm for learning attribute-aware fair regressors under the squared loss, with respect to the fairness notion of statistical parity. It can take any (privately pre-trained) regressor and remaps its outputs (with minimum deviations) to improve fairness. At a high-level, our algorithm consists of three steps: estimating the output distributions of the regressor from data, computing their Wasserstein barycenter, and the optimal transports (Chzhen et al., 2020; Le Gouic et al., 2020; Xian et al., 2023). To make this process differentially private, we use private histogram density estimates (HDE) for the distributions via the Laplace mechanism (Diakonikolas et al., 2015; Xu et al., 2012), followed by re-normalization, which introduces additional complexity in our analysis. The choice of the number of bins in the HDE induces a trade-off between the statistical bias and variance for the cost of privacy and fairness. Our theoretical analysis and experiments show that using less bins always improves fairness at the expense of higher error. Paper Organization. In Section 2, we introduce definitions, notation, and the problem setup. Section 3 describes our private post-processing algorithm for learning fair regressors, with finite sample analysis of the accuracy-privacyfairness trade-offs in Section 3.4. Finally in Section 4, we empirically explore the trade-offs achieved by our postprocessing algorithm on Law School and Communities & Crime datasets. Differentially Private Post-Processing for Fair Regression 1.1. Related Work Differential Privacy. Under the notion of differential privacy (Dwork et al., 2006) and subsequent refinements such as Rényi DP (Mironov, 2017), a variety of private learning algorithms are proposed for settings including regression, classification, and distribution learning. There are private variants of logistic regression (Papernot et al., 2017), decision trees (Fletcher & Islam, 2019), and linear regression (Wang, 2018; Covington et al., 2021; Alabi et al., 2022). The problem of private distribution learning has been studied for finite-support distributions (Xu et al., 2012; Diakonikolas et al., 2015) and parameterized families (Bun et al., 2019; Aden-Ali et al., 2021). Lastly, private optimization algorithms including those based on objective perturbation (Chaudhuri et al., 2011; Kifer et al., 2012) and DP-SGD (Song et al., 2013; Bassily et al., 2014; Abadi et al., 2016) are proposed for solving convex and general optimization problems. Algorithmic Fairness. In parallel, the study of algorithmic fairness revolves around the formalization of fairness notions and design of bias mitigation methods (Barocas et al., 2023). Fairness criteria include those that focus on disparate impact of the model, such as individual fairness (Dwork et al., 2012), which asks the model to treat similar inputs similarly, or group-level statistical parity (Calders et al., 2009) and equalized odds (Hardt et al., 2016); and, those on performance inequality, e.g., accuracy parity (Buolamwini & Gebru, 2018; Chi et al., 2021), predictive parity (Chouldechova, 2017), and multi-calibration (Hébert-Johnson et al., 2018). Fair algorithms are categorized into pre-processing, by removing biased correlation in the training data (Calmon et al., 2017); in-processing, that turns the original learning problem into a constrained one (Kamishima et al., 2012; Zemel et al., 2013; Agarwal et al., 2018; 2019); and postprocessing, which remaps the predictions of a trained model post-hoc to meet the fairness criteria (Hardt et al., 2016; Pleiss et al., 2017; Chzhen et al., 2020; Zhao & Gordon, 2022; Xian et al., 2023). Fairness and Privacy. Private fair learning and bias mitigation algorithm are proposed and studied in prior work (Jagielski et al., 2019; Xu et al., 2019; Mozannar et al., 2020; Tran et al., 2021), but they have so far been focused on the classification setting. It remains an open question on how to achieve private fair regression, and what are the trade-offs between privacy, fairness, and accuracy. Cummings et al. (2019) and Agarwal (2020) showed that fairness and privacy are incompatible in the sense that no ε-DP algorithm can generally guarantee group fairness on the training set (from which population-level guarantees can be derived via generalization), unless the hypothesis class is restricted to constant predictors. The argument is that a predictor f that is fair on S may not be fair on its neighbor S , so an ε-DP algorithm that outputs f on S with nonzero probability may also output f on S , which is unfair. Work on private fair algorithms circumvent this incompatibility by relaxing to high probability guarantees for fairness. Bagdasaryan et al. (2019) showed that the performance impact of privacy may be more severe on underrepresented groups, resulting in accuracy disparity. 2. Preliminaries A regression problem is defined by a joint distribution µ of the observations X X, sensitive attribute A A (which has finite support), and the response Y R. We will use upper case X, A, Y to denote the random variables, and lower case x, a, y instances of them. The goal of this paper is to develop a privacy-preserving post-processing algorithm for learning (randomized) fair regressors. We consider the attribute-aware setting, i.e., the sensitive attribute A is available explicitly during both training and prediction and can be taken as input by the regressor, f : X A R. The risk of a regressor is defined to be mean squared error, and its excess risk is defined with respect to the Bayes regressor, f (x, a) := E[Y | X = x, A = a]: R(f) := E[(f(X, A) Y )2], ER(f) := R(f) R(f ) = E[(f(X, A) f (X, A))2] by the orthogonality principle, where E is taken with respect to µ (and the randomness of f). Given a (training) dataset S consisting of n i.i.d. samples of (X, A, Y ), a private learning algorithm minimizes the leakage of any individual s information in its output. We use the notion of differential privacy (Dwork et al., 2006), which limits the influence of any single training sample: Definition 2.1 (Differential Privacy). A randomized (learning) algorithm A is ε-differentially private (DP) if for all pairs of nonempty neighboring datasets S, S , P(A (S) O) eε P(A (S ) O), O range(A ), where P is taken with respect to the randomness of A . We say two datasets are neighboring if they differ in one entry by substitution (our result also covers insertion and deletion operations, which may have lower sensitivity). Note that privacy is guaranteed with respect to both A and X. For fairness, we consider statistical parity (Calders et al., 2009), which requires the output distributions of f conditioned on each group to be similar. The similarity between distributions is measured in Kolmogorov Smirnov distance, defined for probability measures p, q supported on R by DKS(p, q) = sup t R (p(x) q(x)) dx . Differentially Private Post-Processing for Fair Regression Definition 2.2 (Statistical Parity). A (randomized) regressor f : X A R satisfies α-approximate statistical parity (SP) if SP(f) := max a,a A DKS(ra, ra ) α, where ra is the distribution of regressor output f(X, A) conditioned on A = a. 3. Fair and Private Post-Processing We describe a private post-processing algorithm such that, given an unlabeled training dataset S = {(xi, ai)}i [n] sampled from µ, and a (privately) pre-trained regressor f (e.g., learned using the algorithms mentioned in Section 1.1), the algorithm learns a mapping g that transforms the output of f (a.k.a. post-processing) so that f := g f is fair, while preserving differential privacy with respect to S. Compared to in-processing approaches for fairness, postprocessing decouples the conflicting goals of fairness and error minimization, since the regressor f can be trained to optimality without any constraint, and then post-processed to satisfy fairness. We show that this decoupling does not affect the optimality of the resulting fair regressor f the optimal fair regressor can be recovered via post-processing if the algorithm in the pre-training stage learns the Bayes regressor f (Theorem 3.1). Moreover, post-processing has low sample complexity and only requires unlabeled data (Theorem 3.3), so labeled data (that may be scarce in some domains) can be dedicated entirely to error minimization, which is typically a more difficult problem. 3.1. Fair Post-Processing with Wasserstein-Barycenters Exact statistical parity (α = 0) requires all groups to have identical output distributions, so finding the fair postprocessing mapping g that incurs minimum deviations from the original outputs of f amounts to the following steps (Chzhen et al., 2020; Le Gouic et al., 2020): 1. Learn the output distributions of f conditioned on each group, ra, for all a A. 2. Find a common (fair) distribution q that is close to the original distributions (i.e., their barycenter). 3. Compute the optimal transports ga from ra to the barycenter q. The (randomized) transports are then applied to post-process f to obtain a fair attribute-aware regressor f(x, a) = ga f(x), so that every group has the same output distribution equal to q. Formally, q is called the Wasserstein barycenter of the ra s, which is a distribution supported on R with minimum total distances to the ra s as measured in Wasserstein distance: q arg min q :supp(q ) R a A wa W 2 2 (ra, q ), (1) where wa := P(A = a), W 2 2 (ra, q) = min πa Π(ra,q) R R (y y )2 dπa(y, y ), (2) and Π(p, q) = {π : supp(π) R R, R π(x, y ) dy = p(x), R π(x , y) dx = q(y), x, y} is the collection of probability couplings of p, q. The value of W 2 2 (ra, q) naturally represents the squared cost of transforming ra into q, i.e., the minimum amount of output deviations (in squared distance) required to post-process f for group a so that its output distribution is transformed to q (which is a consequence of Lemma A.4); specifically, this optimal postprocessor is given by the optimal transport from ra to q. 3.2. Generalization to Approximate Statistical Parity To handle approximate statistical parity (α > 0), in step 2, we first replace the barycenter in Equation (1) by a KS ball of radius α/2 and relax the problem to finding target output distributions qa for each group a inside the ball with minimum W 2 2 distance to the original distribution ra. Then in step 3, we find the optimal transports ga from ra to qa. So, the problem in Equation (1) is generalized to the approximate SP case with P({ra}a A, {wa}a A, α) : arg min q:supp(q) R {qa}a A BKS(q,α/2) a A wa W 2 2 (ra, qa), where BKS(q, α/2) = {p : DKS(p, q) α/2} is a KS ball centered at q on the space of probability distributions supported on R. When learning from finite samples S (non-privately), we replace ra, wa in P with the respective empirical distributions/estimates. Computationally, solving P on the empirical distributions is as hard as the finite-support Wasserstein barycenter problem (α = 0), which can be formulated by a linear program of exponential size in |A| (Anderes et al., 2016; Altschuler & Boix-Adserà, 2021). To reduce the complexity, we restrict the support of the barycenter q via discretization; this fixed-support approximation is common in prior work (Cuturi & Doucet, 2014; Staib et al., 2017). The validity of the post-processing algorithm described above is supported by the fact that, if the regressor being post-processed is Bayes optimal, then the resulting fair regressor is also optimal. This has been established for the exact SP case in prior work (e.g., Le Gouic et al., 2020, Theorem 3), and here we provide a more general result for approximate SP: Differentially Private Post-Processing for Fair Regression 1b. Add noise & re-normalize 2. Compute barycenter 3. Optimal transports 1a. Discretize & histogram Figure 1. Illustration of the private fair post-processing steps 1 3 performed in Algorithm 1. The (randomized) transports to the barycenter are represented by (sparse) k k matrices, and the value at the (i, j)-th entry is the probability of transporting to bin j given bin i. Theorem 3.1. Let a regression problem be given by a joint distribution µ of (X, A, Y ), denote wa = P(A = a), the Bayes regressor by f (x, a) = E[Y | X = x, A = a], and its output distribution conditioned on group a by r a. Then min f: SP(f) α ER(f) = min q:supp(q) R {qa}a A BKS(q,α/2) a A wa W 2 2 (r a, qa). This shows that the excess risk of the optimal fair regressor is indeed the value of the Wasserstein barycenter problem (P) discussed above, and can be achieved by f postprocessed using the optimal transports to the barycenter. All proofs are deferred to Appendices A and B; this result follows from an equivalence between learning regressors and learning post-processings of f . 3.3. Privacy via Discretization and Private PMF Estimation To make the fair post-processing algorithm in Section 3.1 private with respect to S, it suffices to perform step 1 of estimating the output distributions privately. This is because the subsequent steps 2 and 3 of computing the barycenter (including the target distributions) and the optimal transports only depend on the estimated distributions, so privacy is preserved by the post-processing immunity of DP. To estimate the distributions, one could construct a family of distributions (a.k.a. hypotheses) and (privately) choose the most likely one (Bun et al., 2019; Aden-Ali et al., 2021), or use non-parametric estimators. For generality, we adopt the latter approach and use (private) histogram density estimator (Diakonikolas et al., 2015; Xu et al., 2012). Altogether, our fair and private post-processing algorithm is detailed in Algorithm 1 and illustrated in Figure 1. It takes as inputs the regressor f being post-processed, the samples S, the fairness tolerance α, and the privacy budget ε. For performing HDE, it also requires specifying an interval1 1This is necessary for pure (ε, 0)-DP, due to a lower bound by Hardt & Talwar (2010) using a packing argument. (using prior/domain knowledge) that contains the image of the regressor [s, t] f[supp(µX,A)], and the number of bins k. We describe the algorithm in details below: Step 1 (Estimate Output Distributions). On line 3, we compute the empirical joint distribution of A and h f(X, A), the discretized regressor output (the support is v := ({(j 1/2)(t s)/k}j [k])). Then, we make this statistics private via the Laplace mechanism on line 4, noting that the L1-sensitivity to ˆp is at most 2/n (Remark B.1).2 With the private joint distribution, we get a private estimate of the group marginal distribution wa s (clipping negative values to zero) on line 6, and the private group conditional discretized output distribution pa s (with re-normalization) on lines 7 9. The re-normalization on pa (could also be applied to wa), required for defining the optimal transport problem in the subsequent step, is done by performing isotonic regression on the partial sums (so that its values are non-decreasing) with clipping to get a valid CDF. We analyze the accuracy of the private estimates: Theorem 3.2. Let pa(vj) = P(h f(X, A) = vj | A = a) and wa = P(A = a) for all a A, j [k], and let pa, wa denote their privately estimated counterparts in Algorithm 1. Then for all n Ω(maxa 1/wa ln 1/β), with probability at least 1 β, for all a A, k nε ln k|A| k nwa ln k|A| β + k nwaε ln k|A| 1 nwa ln |A| k nwaε ln k|A| 2The Laplace mechanism could be replaced by, e.g., the Gaussian mechanism, to relax the pure (ε, 0)-DP guarantee to approximate (ε, δ)-DP. Differentially Private Post-Processing for Fair Regression Algorithm 1 Fair and Private Post-Processing (Attribute-Aware) Require: Regressor f : X A R, samples {xi, ai}i [n], interval [s, t], number of bins k, fairness tolerance α, privacy budget ε 1: Let vj = (j 1/2)(t s)/k for all j [k] midpoints of the bins 2: Define h(y) = arg minj [k] |y vj| discretizer 3: Define ˆp(a, vj) = 1/n P i [n] 1[h f(xi, ai) = vj] empirical joint distribution 4: Define ˇp(a, vj) = ˆp(a, vj) + Laplace(0, 2/nε) Laplace mechanism to get private joint distribution 5: for a A do 6: wa max(Pk j=1 ˇp(a, vj), 0) private group marginal distribution 7: Define q Fa(vj) = 1/ wa Pj ℓ=1 ˇp(a, vℓ) scaled partial sums 8: Define e Fa(vj) = ( proj[0,1](1/2( q Fa(vlj) + q Fa(vrj))) if j < k 1 else where (lj, rj) = arg maxl j r( q Fa(vl) q Fa(vr)) L isotonic regression and clipping to get private CDF 9: Define pa(vj) = e Fa(vj) e Fa(vj 1) private group conditional PMF 10: ({πa}a A, q, { qa}a A) LP({ pa}a A, { wa}a A, α) compute barycenter and get optimal transports 11: for a A do 12: Define ga(vj) = ( vℓw.p. πa(vj,vℓ)/ pa(vj), ℓ [k] if pa(vj) > 0 vj else post-processing mappings 13: return (x, a) 7 ga h f(x, a) privately post-processed fair regressor DKS(pa, pa) O k nwa ln k|A| k nwaε ln k|A| The private estimation of PMFs with the Laplace mechanism has been analyzed in prior work (Diakonikolas et al., 2015; Vadhan, 2017), but our algorithm performs an extra re-normalization (lines 7 9) after adding Laplace noise to ensure that the PMF returned is valid. This makes the analysis of Theorem 3.2 more involved, because the noise added to each bin can interact during re-normalization. The rate of e O( p k/n + k/nε) for TV distance is consistent with existing results (Diakonikolas et al., 2015). For KS distance, the rate is improved by a k factor, which is not surprising as KS is a weaker metric than TV. We note that our analysis in attaining this improvement is made easier with the L isotonic regression CDF re-normalization scheme, because KS distance is also defined via the CDF. Steps 2 and 3 (Compute Barycenter and Optimal Transports). With the output distributions pa estimated privately, we now compute their barycenter and the optimal transports to obtain the fair post-processing mappings ga, i.e., solving P({ pa}a A, { wa}a A, α). Since the pa s are distributions supported on v, by restricting the support of the barycenter to v,3 the barycenter and the optimal transports can be computed by solving a linear program with O(k2|A|) variables 3This approximation reduces the size of the barycenter problem to polynomial, at O(t s/k) error, the same as that from discretizing the outputs. and constraints (cf. P and Equation (2)): LP({pa}a A, {wa}a A, α) : arg min {πa}a A 0 q,{qa}a A 0 j,ℓ [k] (vj vℓ)2 πa(vj, vℓ) ℓ [k] πa(vj, vℓ) = pa(vj), a A, j [k], j [k] πa(vj, vℓ) = qa(vℓ), a A, ℓ [k], j ℓ (qa(vj) q(vj)) 2 , a A, ℓ [k]. The constraints enforce that the πa s are couplings, and the target output distributions qa s are valid PMFs satisfying the fairness constraint of maxa,a DKS( qa, qa ) α. Post-Processed Fair Regressor. On Lines 11 to 13, the (randomized) optimal transports ga from pa to qa are extracted by reading off from the optimal couplings πa (represented by k k matrices), and are used to construct the fair regressor, f(x, a) = ga h f(x, a). Given an input (x, a), the fair regressor f obtained from Algorithm 1 calls f to make a prediction y = f(x, a), discretizes it to get y = h(y), and then uses the optimal transport ga of the respective group to post-process and return a fair prediction, y = ga( y). Differentially Private Post-Processing for Fair Regression 3.4. Statistical Analysis Algorithm 1 is ε-DP by the post-processing theorem of DP (Dwork & Roth, 2014, Proposition 2.1), because its output depends on S only via the private statistics p on line 4 that is ε-DP.2 We analyze the suboptimality (or fair excess risk) of the postprocessed fair regressor and provide fairness guarantee: Theorem 3.3. Let regressor f be given along with samples S = {(xi, ai)}i [n] of µX,A. Denote L = t s, assume L 1, and Y, f(X, A) [s, t] almost surely. Let f denote the fair regressor returned from Algorithm 1 on (f, S, [s, t], k, α, ε). Then for all n Ω(maxa 1/wa ln |A|/β), with probability at least 1 β, R( f) R( f ) O k + 5 E[|f(X, A) f (X, A)|] where f = arg minf : SP(f ) α R(f ) is the optimal fair regressor, f is the (unconstrained) Bayes regressor, and SP( f) α + max a A O k nwaε ln k|A| k nwa ln k|A| The risk bound reflects four potential sources of error: the first term is finite sample estimation error, the second term is due to the noise added for ε-DP, the third term is the error introduced by discretization, and the last term is the L1 excess risk of the regressor f being post-processed (carrying over the error of the pre-trained regressor). The first three terms of the risk bound (and last two in the fairness bound) are attributed to the accuracy of the private distribution estimate using HDE (cf. Theorem 3.2). In particular, a trade-off between the statistical bias and variance is incurred by the choice of k, the number of bins: nε | {z } variance + 1 k |{z} bias Using too few bins leads to a large discretization error (statistical bias), whereas using too many suffers from large variance due to data sampling and the noise added for privacy. The cost of privacy is dominated by the estimation error as long as n maxa k/waε2 k|A|/ε2. In which case, the choice of k = eΘ(n1/3) is optimal for MSE, which is consistent with classical non-parametric estimation results of HDE (Rudemo, 1982). The fairness bound, on the other hand, only contains variance terms but not the statistical bias, therefore using fewer bins is always more favorable for fairness at the expense of MSE (e.g., in the extreme case of k = 1, the post-processor outputs a constant value, which is exactly fair). This suggests that when n is small, k can be decreased to reduce variance for higher levels of fairness. The optimal choice of k (in combination with α) is datadependent, and is typically tuned on a validation split, however; extra care should be taken as this practice could, in principle, violate differential privacy. In our experiments, we sweep α and k to empirically explore the trade-offs between error, privacy and fairness attainable with our Algorithm 1. This is common practice in differential privacy research (Mohapatra et al., 2022). Regarding the selection of hyperparameters while preserving privacy, we refer readers to Liu & Talwar (2019); Papernot & Steinke (2022). 4. Experiments In this section, we evaluate the private and fair postprocessing algorithm described in Algorithm 1. We do not compare to other algorithms because we are not aware of any existing private algorithms for learning fair regressors. Although it may be possible to adopt some existing algorithms to this setting, e.g., using DP-SGD as in (Tran et al., 2021), making them practical and competitive requires care, and is hence left to future work. Our postprocessing algorithm is based on (Chzhen et al., 2020); their paper is referred to for empirical comparisons to inprocessing algorithms under the non-private setting (ε = ). Setup. Since the excess risk can be decomposed according to the pre-training and post-processing stages (Theorem 3.3, where E[|f f |] is carried over from the suboptimality of the regressor f trained in the pre-training stage), we will simplify our experiment setup to isolate and focus on the performance and trade-offs induced by post-processing alone. This means we will access the ground-truth responses and directly apply Algorithm 1 (i.e., X = Y and f = Id, whereby E[|f f |] = 0).4 Datasets. The datasets are randomly split 70-30 for training (i.e., post-processing) and testing. Communities & Crime (Redmond & Baveja, 2002). It contains the socioeconomic and crime data of communities in US, and the task is to predict the rate of violent crimes per 100k population (Y [0, 1]). The sensitive attribute is an indicator for whether the community has a significant presence of a minority population (|A| = 2). The total size of the dataset is 1,994, and the number of training examples for the smallest group is 679 ( nwa). 4Code is available at https://github.com/rxian/fair-regression. Differentially Private Post-Processing for Fair Regression 0.000 0.005 0.010 0.015 Communities & Crime (k = 12) 0.000 0.005 0.010 0.015 Law School (k = 36) 0.000 0.005 0.010 0.015 Communities & Crime (k = 60) 0.000 0.005 0.010 0.015 Law School (k = 180) Mean squared error Fairness violation ε 10 5 1 0.5 0.1 Figure 2. Error-privacy-fairness trade-offs achieved by Algorithm 1 by sweeping α under the indicated number of bins k, with different privacy budgets ε. Fairness violation is measured in KS distance as defined in Definition 2.2 ( SP). Average of 50 random seeds (also for Figures 3 and 4). Law School (Wightman, 1998). This dataset contains the academic performance of law school applicants, and the task is to predict the student s undergraduate GPA (Y [1, 4]). The sensitive attribute is race (|A| = 4), and the dataset has size 21,983. The smallest group has 628 training examples. 4.1. Results In Figure 2, we show the MSE and fairness trade-offs (in KS distance; SP, Definition 2.2) attained with Algorithm 1 under various settings of α. The main observations are: 1. The cost of discretization (indicated by the horizontal distance from 0 to the top-left starting points of the curves) can be expected to be insignificant compared to fairness, unless the model is already very fair without post-processing. 2. Although the amount of data available for postprocessing is small by modern standards, the results are very insensitive to the privacy budget until the highest levels of DP are demanded or very large k is used. This is because according to Theorem 3.3, the error attributed to DP noise is dominated by estimation error when n maxa k/waε2. Note that in our experiments, we have n k|A|/ε2 on both datasets except for ε = 0.1, and ε = 0.5 when k = 180. Using more bins increases the cost of privacy due to variance, as we will show in Figure 3. 3. The right end of each line is obtained with setting α = 0. Relaxing it to larger values could give better tradeoffs, especially when n, ε are small, because in these cases the estimated distributions can be inaccurate due to estimation error and the noise added, so aiming for exact SP may fit to data artifacts or the noise rather than the true signal, causing MSE to increase without actual improvements to fairness. Trade-off Between Statistical Bias and Variance. Recall from the analysis and discussion in Section 3.4 that the MSE of Algorithm 1 exhibits a trade-off between the statistical bias and variance from the choice of k: using more bins lowers discretization error but suffers from larger estimation error and more noise added for DP, and vice versa. On the other hand, the fairness only depends on the variance. Hence, one can decrease k to achieve higher levels of fairness at the expense of MSE. In Figure 3, we plot the trade-offs on the Law School dataset for ε = 0.1 under different settings of k. On the right extreme, exact SP ( SP = 0) is achieved by setting k = 1, Differentially Private Post-Processing for Fair Regression 0.00 0.01 0.02 0.03 Mean squared error Fairness violation Law School (ε = 0.1) 60 48 36 24 11 8 7 Figure 3. Error-fairness trade-offs achieved by Algorithm 1 on the Law School dataset by sweeping α and k for ε = 0.1. The black line is the lower envelope, and ends on the right at (0.6772, 0) (outside the cropped figure). 0.00 0.01 0.02 0.03 Mean squared error Fairness violation 10 5 1 0.5 0.1 Figure 4. Error-privacy-fairness trade-offs achieved by Algorithm 1 on the Law School dataset by sweeping α and k and taking the lower envelope. The line for ε = 0.1 is the black line in Figure 3. All lines meet and end on the right at (0.6772, 0). but with a high 0.6772 MSE. As expected: 1. Smaller settings of k lead to smaller SP, but it comes with higher discretization error, as reflected in the rightward-shifting starting points (i.e., results with α = , i.e., not post-processed for fairness). 2. Using more bins may result in worse trade-offs compared to less bins due to large variance. Note that the trade-offs with k = 60 bins are almost completely dominated by those with k = 36 bins. 3. The general trend is that, to achieve the best tradeoffs (i.e., the lower envelope), use smaller (α, k) when aiming for higher levels of fairness, and larger (α, k) for smaller MSE (but less fairness). Error-Privacy-Fairness Trade-Off. The black line in Figure 3 shows the optimal error-fairness trade-offs attainable with Algorithm 1 for ε = 0.1 from sweeping α and k and taking the lower envelope (segments of the line not reached by any k can be achieved by combining two regressors via randomization, although obtaining them via post-processing requires double the privacy budget). We repeat this experiment for all settings of ε on the Law School dataset, and plot their lower envelopes in Figure 4. This illustrates the Pareto front of the error-privacy-fairness trade-offs achieved by Algorithm 1. Demanding stricter privacy degrades the trade-offs between error and fairness. Lastly, we remark that while Figures 2 to 4 illustrate the trade-offs that can be possibly attained with our postprocessing algorithm from varying the hyperparameters α and k, selecting the desired trade-off requires tuning them (privately) on a validation set. Readers are referred to (Liu & Talwar, 2019; Papernot & Steinke, 2022) on the topic of differentially private hyperparameter tuning. 5. Conclusion and Future Work In this paper, we described and analyzed a private postprocessing algorithm for learning attribute-aware fair regressors, in which privacy is achieved by performing histogram density estimates of the distributions privately, and fairness by computing the optimal transports to their Wasserstein barycenter. We evaluated the error-privacy-fairness tradeoffs attained by our algorithm on two datasets. Although we only studied the attribute-aware setting, that is, we have explicit access to A during training and when making predictions, our post-processing algorithm could be extended to the attribute-blind setting, where A is only available in the training data. This requires training an extra predictor for the sensitive attribute, b PA = b P(A | X) |A| 1, to estimate the joint distribution of (b Y , b PA) (vs. the joint of (b Y , A) estimated in Algorithm 1 for the attributeaware setting), and modifying P (and LP) to use predicted group membership b PA rather than the true A. This has a higher sample complexity, and the fairness of the postprocessed regressor will additionally depend on the accuracy of b PA. We leave the implementation and analysis of this extension to future work. Acknowledgements GK is supported by a Canada CIFAR AI Grant, an NSERC Discovery Grant, and an unrestricted gift from Google. HZ is partially supported by a research grant from the Amazon Illinois Center on AI for Interactive Conversational Experiences (AICE) and a Google Research Scholar Award. Differentially Private Post-Processing for Fair Regression Broader Impacts This paper continues the study of privacy and fairness in machine learning, and fills the gap in prior work on private and fair regression. Since the setting is well-established, and we have theoretically analyzed the risk of our algorithm, we do not find outstanding societal consequences for discussion. Abadi, M., Chu, A., Goodfellow, I., Mc Mahan, H. B., Mironov, I., Talwar, K., and Zhang, L. Deep Learning with Differential Privacy. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, 2016. Aden-Ali, I., Ashtiani, H., and Kamath, G. On the Sample Complexity of Privately Learning Unbounded High Dimensional Gaussians. In Proceedings of the 32nd International Conference on Algorithmic Learning Theory, 2021. Agarwal, A., Beygelzimer, A., Dudík, M., Langford, J., and Wallach, H. A Reductions Approach to Fair Classification. In Proceedings of the 35th International Conference on Machine Learning, 2018. Agarwal, A., Dudík, M., and Wu, Z. S. Fair Regression: Quantitative Definitions and Reduction-Based Algorithms. In Proceedings of the 36th International Conference on Machine Learning, 2019. Agarwal, S. Trade-Offs between Fairness and Privacy in Machine Learning. In IJCAI 2020 Workshop on AI for Social Good, 2020. Alabi, D., Mc Millan, A., Sarathy, J., Smith, A., and Vadhan, S. Differentially Private Simple Linear Regression. Proceedings on Privacy Enhancing Technologies, 2022. Altschuler, J. M. and Boix-Adserà, E. Wasserstein Barycenters can be Computed in Polynomial Time in Fixed Dimension. Journal of Machine Learning Research, 22(44), 2021. Anderes, E., Borgwardt, S., and Miller, J. Discrete Wasserstein Barycenters: Optimal Transport for Discrete Data. Mathematical Methods of Operations Research, 84(2), 2016. Bagdasaryan, E., Poursaeed, O., and Shmatikov, V. Differential Privacy Has Disparate Impact on Model Accuracy. In Advances in Neural Information Processing Systems, volume 32, 2019. Barocas, S. and Selbst, A. D. Big Data s Disparate Impact. California Law Review, 104(3), 2016. Barocas, S., Hardt, M., and Narayanan, A. Fairness and Machine Learning: Limitations and Opportunities. The MIT Press, 2023. Bassily, R., Smith, A., and Thakurta, A. Private Empirical Risk Minimization: Efficient Algorithms and Tight Error Bounds. In IEEE 55th Annual Symposium on Foundations of Computer Science, 2014. Berk, R., Heidari, H., Jabbari, S., Kearns, M., and Roth, A. Fairness in Criminal Justice Risk Assessments: The State of the Art. Sociological Methods & Research, 50 (1), 2021. Bolukbasi, T., Chang, K.-W., Zou, J., Saligrama, V., and Kalai, A. Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings. In Advances in Neural Information Processing Systems, volume 29, 2016. Bun, M., Kamath, G., Steinke, T., and Wu, S. Z. Private Hypothesis Selection. In Advances in Neural Information Processing Systems, volume 32, 2019. Buolamwini, J. and Gebru, T. Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification. In Proceedings of the 2018 Conference on Fairness, Accountability, and Transparency, 2018. Calders, T., Kamiran, F., and Pechenizkiy, M. Building Classifiers with Independency Constraints. In 2009 IEEE International Conference on Data Mining Workshops, 2009. Calmon, F. P., Wei, D., Vinzamuri, B., Ramamurthy, K. N., and Varshney, K. R. Optimized Pre-Processing for Discrimination Prevention. In Advances in Neural Information Processing Systems, volume 30, 2017. Chaudhuri, K., Monteleoni, C., and Sarwate, A. D. Differentially Private Empirical Risk Minimization. Journal of Machine Learning Research, 12(29), 2011. Chi, J., Tian, Y., Gordon, G. J., and Zhao, H. Understanding and Mitigating Accuracy Disparity in Regression. In Proceedings of the 38th International Conference on Machine Learning, 2021. Chizat, L., Roussillon, P., Léger, F., Vialard, F.-X., and Peyré, G. Faster Wasserstein Distance Estimation with the Sinkhorn Divergence. In Advances in Neural Information Processing Systems, volume 33, 2020. Chouldechova, A. Fair Prediction with Disparate Impact: A Study of Bias in Recidivism Prediction Instruments. Big Data, 5(2), 2017. Differentially Private Post-Processing for Fair Regression Chzhen, E., Denis, C., Hebiri, M., Oneto, L., and Pontil, M. Fair Regression with Wasserstein Barycenters. In Advances in Neural Information Processing Systems, volume 33, 2020. Covington, C., He, X., Honaker, J., and Kamath, G. Unbiased Statistical Estimation and Valid Confidence Intervals Under Differential Privacy, 2021. arxiv:2110.14465 [stat.ME]. Cummings, R., Gupta, V., Kimpara, D., and Morgenstern, J. On the Compatibility of Privacy and Fairness. In Adjunct Publication of the 27th Conference on User Modeling, Adaptation and Personalization, 2019. Cuturi, M. and Doucet, A. Fast Computation of Wasserstein Barycenters. In Proceedings of the 31st International Conference on Machine Learning, 2014. Diakonikolas, I., Hardt, M., and Schmidt, L. Differentially Private Learning of Structured Discrete Distributions. In Advances in Neural Information Processing Systems, volume 28, 2015. Dwork, C. and Roth, A. The Algorithmic Foundations of Differential Privacy. Foundations and Trends in Theoretical Computer Science, 9(3 4), 2014. Dwork, C., Mc Sherry, F., Nissim, K., and Smith, A. Calibrating Noise to Sensitivity in Private Data Analysis. In Theory of Cryptography, 2006. Dwork, C., Hardt, M., Pitassi, T., Reingold, O., and Zemel, R. Fairness Through Awareness. In Proceedings of the 3rd Innovations in Theoretical Computer Science Conference, 2012. Fletcher, S. and Islam, M. Z. Decision Tree Classification with Differential Privacy: A Survey. ACM Computing Surveys, 52(4), 2019. Hardt, M. and Talwar, K. On the Geometry of Differential Privacy. In Proceedings of the Forty-Second ACM Symposium on Theory of Computing, 2010. Hardt, M., Price, E., and Srebro, N. Equality of Opportunity in Supervised Learning. In Advances in Neural Information Processing Systems, volume 29, 2016. Hébert-Johnson, Ú., Kim, M. P., Reingold, O., and Rothblum, G. N. Multicalibration: Calibration for the (Computationally-Identifiable) Masses. In Proceedings of the 35th International Conference on Machine Learning, 2018. Jagielski, M., Kearns, M., Mao, J., Oprea, A., Roth, A., Sharifi-Malvajerdi, S., and Ullman, J. Differentially Private Fair Learning. In Proceedings of the 36th International Conference on Machine Learning, 2019. Kamishima, T., Akaho, S., Asoh, H., and Sakuma, J. Fairness-Aware Classifier with Prejudice Remover Regularizer. In Machine Learning and Knowledge Discovery in Databases, 2012. Kifer, D., Smith, A., and Thakurta, A. Private Convex Empirical Risk Minimization and High-dimensional Regression. In Proceedings of the 25th Annual Conference on Learning Theory, 2012. Le Gouic, T., Loubes, J.-M., and Rigollet, P. Projection to Fairness in Statistical Learning, 2020. arxiv:2005.11720 [cs.LG]. Li, J. and Tkocz, T. Tail Bounds for Sums of Independent Two-Sided Exponential Random Variables. In High Dimensional Probability IX, 2023. Liu, J. and Talwar, K. Private Selection from Private Candidates. In Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing, 2019. Mironov, I. Rényi Differential Privacy. In IEEE 30th Computer Security Foundations Symposium, 2017. Mohapatra, S., Sasy, S., He, X., Kamath, G., and Thakkar, O. The Role of Adaptive Optimizers for Honest Private Hyperparameter Selection. Proceedings of the AAAI Conference on Artificial Intelligence, 36(7), 2022. Mozannar, H., Ohannessian, M. I., and Srebro, N. Fair Learning with Private Demographic Data. In Proceedings of the 37th International Conference on Machine Learning, 2020. Papernot, N. and Steinke, T. Hyperparameter Tuning with Renyi Differential Privacy. In The Tenth International Conference on Learning Representations, 2022. Papernot, N., Abadi, M., Erlingsson, Ú., Goodfellow, I., and Talwar, K. Semi-supervised Knowledge Transfer for Deep Learning from Private Training Data. In The Fifth International Conference on Learning Representations, 2017. Pleiss, G., Raghavan, M., Wu, F., Kleinberg, J., and Weinberger, K. Q. On Fairness and Calibration. In Advances in Neural Information Processing Systems, volume 30, 2017. Redmond, M. and Baveja, A. A data-driven software tool for enabling cooperative information sharing among police departments. European Journal of Operational Research, 141(3), 2002. Rudemo, M. Empirical Choice of Histograms and Kernel Density Estimators. Scandinavian Journal of Statistics, 9 (2), 1982. Differentially Private Post-Processing for Fair Regression Song, S., Chaudhuri, K., and Sarwate, A. D. Stochastic gradient descent with differentially private updates. In IEEE Global Conference on Signal and Information Processing, 2013. Staib, M., Claici, S., Solomon, J., and Jegelka, S. Parallel Streaming Wasserstein Barycenters. In Advances in Neural Information Processing Systems, volume 30, 2017. Stout, Q. F. L Isotonic Regression for Linear, Multidimensional, and Tree Orders, 2017. arxiv:1507.02226 [cs.DS]. Tran, C., Fioretto, F., and Hentenryck, P. V. Differentially Private and Fair Deep Learning: A Lagrangian Dual Approach. Proceedings of the AAAI Conference on Artificial Intelligence, 35(11), 2021. Vadhan, S. The Complexity of Differential Privacy. In Tutorials on the Foundations of Cryptography: Dedicated to Oded Goldreich. Springer International Publishing, 2017. Wang, Y.-X. Revisiting differentially private linear regression: Optimal and adaptive prediction & estimation in unbounded domain. In Proceedings of the 34th Uncertainty in Artificial Intelligence Conference, 2018. Weissman, T., Ordentlich, E., Seroussi, G., Verdu, S., and Weinberger, M. J. Inequalities for the L1 Deviation of the Empirical Distribution. Technical Report HPL-200397R1, Hewlett-Packard Laboratories, 2003. Wightman, L. F. LSAC National Longitudinal Bar Passage Study. Law School Admission Council, 1998. Xian, R., Yin, L., and Zhao, H. Fair and Optimal Classification via Post-Processing. In Proceedings of the 40th International Conference on Machine Learning, 2023. Xu, D., Yuan, S., and Wu, X. Achieving Differential Privacy and Fairness in Logistic Regression. In Companion Proceedings of the 2019 World Wide Web Conference, 2019. Xu, J., Zhang, Z., Xiao, X., Yang, Y., and Yu, G. Differentially Private Histogram Publication. In IEEE 28th International Conference on Data Engineering, 2012. Zemel, R., Wu, Y. L., Swersky, K., Pitassi, T., and Dwork, C. Learning Fair Representations. In Proceedings of the 30th International Conference on Machine Learning, 2013. Zhao, H. and Gordon, G. J. Inherent Tradeoffs in Learning Fair Representations. Journal of Machine Learning Research, 23(57), 2022. Differentially Private Post-Processing for Fair Regression A. Excess Risk of the Optimal Fair Regressor This section proves Theorem 3.1, that the excess risk of the optimal attribute-aware fair regressor can be expressed as the sum of Wasserstein distances from the output distributions r a of the Bayes regressor f conditioned on each group a A to their barycenter q. We will cover the approximate fairness setting using the same analysis in (Xian et al., 2023). The result is a direct consequence of Lemma A.4 but before stating which, because the fair regressor is randomized, to make the discussions involving randomized functions rigorous, we provide a formal definition for them with the Markov kernel. Definition A.1 (Markov Kernel). A Markov kernel from a measurable space (X, S) to (Y, T ) is a mapping K : X T [0, 1], such that K( , T) is S-measurable, T T , and K(x, ) is a probability measure on (Y, T ), x X. Definition A.2 (Randomized Function). A randomized function f : (X, S) (Y, T ) is associated with a Markov kernel K : X T [0, 1], such that x X, T T , P(f(x) T) = K(x, T). Definition A.3 (Push-Forward Distribution). Let p be a measure on (X, S) and f : (X, S) (Y, T ) a randomized function with Markov kernel K. The push-forward of p under f, denoted by f p, is a measure on Y given by f p(T) = R X K(x, T) dp(x), T T . Now, we state the lemma of which Theorem 3.1 is a direct consequence; it says that given any (randomized) regressor f with a particular shape q, one can derive a regressor g f from the Bayes regressor f that has the same shape and excess risk (g is a randomized function with Markov kernel K(y, T) = π(y,T )/π(y,R) where π is given in Equation (3)): Lemma A.4. Let a regression problem be given by a joint distribution µ of (X, Y ), denote the Bayes regressor by f : X R and r = f µX, and let q be an arbitrary distribution on R. Then, for any randomized regressor f with Markov kernel K satisfying f µX = q, π(y , y) = Z f 1(y ) K(x, y) dµX(x) (3) (where f 1(y ) := {x X : f (x) = y }) is a coupling π Π(r , q) that satisfies ER(f) = Z (y y)2 dπ(y , y). (4) Conversely, for any π Π(r , q), the randomized regressor f with Markov kernel K(x, T) = π(f (x), T) π(f (x), R) satisfies f µX = q and Equation (4). Proof. For the first direction, note that ER(f) = E[(f (X) f(X))2] R R (y y)2 P(f (X) = y , f(X) = y) d(y , y) R R (y y)2 Z X P(f (X) = y , f(X) = y, X = x) dx d(y , y) R R (y y)2 Z f 1(y ) P(f(X) = y, X = x) dx R R (y y)2 Z f 1(y ) P(f(X) = y | X = x) dµX(x) R R (y y)2π(y , y) d(y , y) (5) Differentially Private Post-Processing for Fair Regression as desired, where line 4 is because P(f (X) = y , f(X) = y, X = x) = 1[f (x) = y ] P(f(X) = y, X = x) as f is deterministic. We also verify that the constructed π is a coupling: Z R π(y , y) dy = Z f 1(y ) K(x, y) dµX(x) dy R K(x, y) dy dµX(x) f 1(y ) dµX(x) = P(f (X) = y ) by Definitions A.1 and A.3, and Z R π(y , y) dy = Z f 1(y ) K(x, y) dµX(x) dy X K(x, y) dµX(x) X P(f(X) = y | X = x) dµX(x) = P(f(X) = y) by Definition A.2 and the assumption that f µX = q. For the converse direction, it suffices to show that the Markov kernel constructed for f satisfies the equality in Equation (3), which would immediately imply f µX = q, and Equation (4) with the same arguments in Equation (5). Let y, y R and z f 1(y ) be arbitrary, then π(y , y) = π(y , y) π(y , R)π(y , R) = π(f (z), y) π(f (z), R)π(y , R) = K(z, y)π(y , R) = K(z, y)r (y ) = K(z, y) Z f 1(y ) dµX(x) f 1(y ) K(z, y) dµX(x) f 1(y ) K(x, y) dµX(x), where line 3 is by construction of K, line 4 by the assumption that π Π(r , q), and the last line is because K(x, y) = K(z, y) for all x f 1(y ), also by construction. This lemma allows us to formulate the problem of finding the optimal regressor under a shape constraint q as that of finding the optimal coupling π Π(r , q) with the squared cost (and π can be used to derive the regressor g f that achieves the minimum of the original problem). Because statistical parity is a shape constraint on the regressors, we can leverage this lemma to prove Theorem 3.1: Proof of Theorem 3.1. Because we are finding an attribute-aware fair regressor, f : X A R, we can optimize the components corresponding to each group independently, i.e., fa := f( , a), a A. Denote the excess risk conditioned on Differentially Private Post-Processing for Fair Regression group a by ERa(fa) = E[(fa(X) Y )2 | A = a], and the marginal distribution of X conditioned on A = a by µX|a, then min f: SP(f) α ER(f) = min f:DKS(fa µX|a,fa µX|a) α ER(f) = min f:DKS(ra(f),ra (f)) α a A wa ERa(fa) = min {fa}a A:DKS(fa µX|a,fa µX|a ) α a A wa ERa(fa) = min {qa}a A:DKS(qa,qa ) α a A wa min fa:fa µX|a=qa ERa(fa) = min q:supp(q) R {qa}a A BKS(q,α/2) a A wa min fa:fa µX|a=qa ERa(fa) (the proof that DKS(µ, ν) α q s.t. µ, ν BKS(q, α/2) is omitted; a hint for the forward direction is that such a q can be constructed by averaging the CDFs of µ, ν), where, by Lemma A.4 and the definition of Wasserstein distance, min fa:fa µX|a=qa ERa(fa) = min πa Π(r a,qa) Z (y y)2 dπa(y , y) = W 2 2 (r a, qa), and the theorem follows by plugging this back into the previous equation. B. Proofs for Section 3 We analyze the L1 sensitivity for our notion of neighboring datasets described in Section 2, and provide the proofs to Theorems 3.2 and 3.3, in that order. Remark B.1. For nonempty neighboring datasets S, S that differ in at most one entry by insertion, deletion or substitution, the L1 sensitivity to the empirical PMF is at most 2/n. Let ˆp, ˆp denote the empirical PMFs of S and S , respectively, assume w.l.o.g. that they have two coordinate, and the insertion/deletion takes place in the first coordinate. Denote n = |S|, n1 = nˆp1, and n2 = nˆp2. (Insertion). The sensitivity is ˆp ˆp 1 = n1 = n1(n + 1) (n1 + 1)n + n2(n + 1) n2n = n1 n n(n + 1) + n2 n(n + 1) = 2 n2 n(n + 1) (Deletion). Similarly, ˆp ˆp 1 = |n1/n n1 1/n 1| + |n2/n n2/n 1| = 2|n2/n(n 1)| 2/n, because n2 n 1. (Substitution). ˆp ˆp 1 = |n1/n n1 1/n| + |n2/n n2+1/n| = |1/n| + |1/n| = 2/n. For the proofs of the theorems, several technical results are required. First, a concentration bound of i.i.d. sum of Laplace random variables based on the following fact (Li & Tkocz, 2023), which is due to p 2Yj Z Laplace(0, 1) and Pk j=1 aj Zj N(0, Pk j=1 a2 j) for Z1, . . . , Zk N(0, 1) and a1, . . . , ak 0: Proposition B.2. Let X1, . . . , Xk Laplace(0, 1), Y1, . . . , Yk Exponential(1), and Z N(0, 1) all be independent, then Pk j=1 Xj has the same distribution as (2 Pk j=1 Yj)1/2Z. Lemma B.3. Let independent X1, . . . , Xk Laplace(0, 1), then for all t 0, with probability at least 1 β, |Pk j=1 Xj| 2 Differentially Private Post-Processing for Fair Regression Proof. Using Proposition B.2, we bound Pk j=1 Xj by analyzing q Pk j=1 Yj|Z|. For all t 0, P j s.t. Yj t On the other hand, the Chernoff bound implies that P(|Z| t) 2 exp( t2/2). With a union bound, with probability at least 1 β, j=1 Yj|Z| 2 Next, an L1 (TV) convergence result of empirical distributions with finite support, which follows from the concentration of i.i.d. sum of Multinoulli random variables: Theorem B.4 (Weissman et al., 2003). Let p d 1 the (d 1) simplex and ˆpn 1/n Multinomial(n, p), then with probability at least 1 β, p ˆpn 1 p 2d/n ln 2/β. Lemma B.5. Let independent x1, . . . , xn p with finite support X, and denote the empirical distribution by ˆpn = 1/n Pn i=1 βxi, then with probability at least 1 β, p ˆpn 1 p 2|X|/n ln 2/β. For the proof of Theorem 3.2, we need two technical results: Lemma B.6. Let independent x1, . . . , xn p supported on [k], and denote the empirical PMF by ˆpj = 1/n P i 1[xi = j], for all j [k]. Let E [k] be a subset of size ℓ, denote w E = P(x E), the PMF conditioned on the event x E by p|E, and its empirical counterpart by ˆpj|E = 1/n E P i 1[xi = j], where n E = P i 1[xj E]. Then for all n 8/w E ln 8/β, with probability at least 1 β, 1 nw E ln 8ℓ DTV(p|E, ˆp|E) = 1 2 p|E ˆp|E 1 ℓ 4nw E ln 8 DKS(p|E, ˆp|E) 1 nw E ln 8ℓ Proof. By Chernoff bound on the Binomial distribution, for all n 8/w E ln 2/β, with probability at least 1 β, 2 n E 2nw E. (6) Order the samples so that the ones with xi E are at the front (or, consider first sample n E, then sample the first n E samples from p|E). Conditioned on Equation (6), by Hoeffding s inequality and a union bound, for all j E, with probability at least 1 β, |pj|E ˆpj|E| = 1 w E P(X = j) 1 i=1 1[xi = j] i=1 w E 1[xi = j] w2 E 2n E ln 2ℓ 1 nw E ln 2ℓ Differentially Private Post-Processing for Fair Regression Next, by Lemma B.5, with probability at least 1 β, DTV(p|E, ˆp|E) = 1 1 w E P(X = j) 1 i=1 1[xi = j] ℓ 2n E ln 2 ℓ 4nw E ln 2 Lastly, DKS computes the L -distance between two CDFs, so similar to the ℓ bound, by Hoeffding s inequality and a union bound, with probability at least 1 β, DKS(p|E, ˆp|E) = max j E 1 w E P(X j) 1 i=1 1[xi j] i=1 w E 1[xi j] w2 E 2n E ln 2ℓ 1 nw E ln 2ℓ The result follows by taking a final union bound over the four events during the analysis and rescaling β β/4. Lemma B.7. Let constants a1, . . . , ak 0, and independent ξ1, . . . , ξk Laplace(0, b). Denote Sj = Pj ℓ=1 aj, s = Sk, and let t 0. Define for all j [k], (add noise) xj := aj + ξj, Fj = (isotonic regression) yj = Gj Gj 1, Gj := 1 2 Flj + Frj , (clipping) zj = Hj Hj 1, Hj := ( proj[0,t] Gj if j < k t else, where (lj, rj) = arg max l j r (Fl Fr). Then with probability at least 1 β, a z 1 3|s t| + 74bk ln 4k a z 2|s t| + 32b S H |s t| + 12b Differentially Private Post-Processing for Fair Regression Proof. Our analysis for proceeds by using the triangle inequality to decompose into and bounding each of the following terms (shown here for 1, analogously for and the partial sums): a z 1 a x 1 + x y 1 + y z 1. (7) We will use the following concentration result of Laplace random variables: by Lemma B.3, with probability at least 1 β, for all 0 ℓ m k, m ℓln 2(m ℓ)k2 First Term in Equation (7). By the CDF of the exponential distribution (of which the Laplace distribution is the two-sided version), with a union bound, with probability at least 1 β, |aj xj| = |ξj| 2b ln k and it follows that j=1 |aj xj| 2bk ln k For the partial sums, by Equation (8), S H = max j |Sj Fj| = max j max j 12b p Second Term in Equation (7). Note that for any ℓ m such that Fℓ Fm (i.e., a violating pair for isotonic regression), Gℓ Gm = Fℓ Fℓ j=ℓ+1 ξℓ 12b because aj 0. So for all j [k], 0 |xj yj| = |Gj Gj 1 (Fj Fj 1)| |Gj Fj| + |Gj 1 Fj 1| ( Gj Fj if Gj > Fj Fj Gj else ( Gj 1 Fj 1 if Gj 1 > Fj 1 Fj 1 Gj 1 else Gj Glj + Grj 2 if Gj > Fj Gj 1 Glj 1 + Grj 1 2 if Gj 1 > Fj 1 Glj 1 + Grj 1 2 Gj 1 else Gj Gj + Grj 2 if Gj > Fj Gj 1 Gj 1 + Grj 1 2 if Gj 1 > Fj 1 Glj 1 + Gj 1 2 Gj 1 else ( Gj Grj if Gj > Fj Glj Gj else ( Gj 1 Grj 1 if Gj 1 > Fj 1 Glj 1 Gj 1 else max(rj j, j lj) ln 2k max(k j, j) ln 2k where line 5 is because Grm Gm Glm for all j, and line 7 by Equation (9). It then follows that j=1 |xj yj| 24bk ln 2k Differentially Private Post-Processing for Fair Regression Lastly, using the fact that the L error of L isotonic regression is 1/2 maxℓ m(Gℓ Gm) (Stout, 2017), and by Equation (9) again, 2 max ℓ m(Gℓ Gm) 6b Third Term in Equation (7). Note that because aj [0, s], by Equation (8), min j Gj = min j 1 2(Flj +Frj) min j Fj = min j m=1 (am + ξm) min j m=1 ξm max j and similarly, max j Gj max j m=1 (am + ξm) s + 12b Since G is nondecreasing after isotonic regression, clipping only affects its prefix and/or suffix. For the prefix, let l = max{j [k] : Hj = 0}. If l does not exist, then no clipping to zero has occurred. Otherwise, for all j l, by Equation (10), |Gj Hj| = max( Gj, 0) 12b |yj zj| = |Gj Gj 1 (Hj Hj 1)| Gj Gj 1 2 max min j Gj, 0 24b For the suffix, let r = min{j [k] : Hj = t}, then for all r j < k, by Equation (11), |Gj Hj| = Gj t |s t| + 12b |yj zj| (Gj t) + max(Gj 1 t, 0) 2 max max j Gj t, 0 2|s t| + 24b ( Gk t if Gk t t Gk else = ( Gk t if Gk t t s + Pk m=1 ξm else |s t| + 12b |yj zj| |Gj t| + max(Gj 1 t, 0) 2|s t| + 24b Finally, for 1, j=1 |Gj Gj 1 (Hj Hj 1)| + j=r |Gj Gj 1 (Hj Hj 1)| j=1 (Gj Gj 1) + |yr zr| + j=r+1 (Gj Gj 1) + |yk zk| = Gl G1 + |yr zr| + Gk 1 Gr + |yk zk| G1 + |yr zr| + Gk 1 t + |yk zk| β + 2 |s t| + 12b Differentially Private Post-Processing for Fair Regression 3|s t| + 48b keep in mind that 1 l and r k 1, k on line 3 and onward. The result follows by taking a final union bound over the two events above and rescaling β β/2. Proof of Theorem 3.2. Because wa 0, by Lemma B.3, with probability at least 1 β, j=1 Laplace(0, 2/nε), 0 j=1 Laplace(0, 2/nε), wa j=1 Laplace(0, 2/nε) k nε ln k|A| Next, define ˆpa(vj) = 1 i=1 1[h f(xi, ai), ai = a], ˇpa(vj) = 1 wa p(a, vj), where na = 1/n Pn i=1 1[ai = a]. Note that q Fa(vj) = Pj ℓ=1 ˇpa(vℓ). By triangle inequality, pa pa pa ˇpa + ˇpa ˆpa + ˆpa pa wa ˇp(a, ) + 1 wa ˇp(a, ) 1 ˆwa ˆp(a, ) + ˆpa pa ˆwa ˇp(a, ) + 2 1 wa ˇp(a, ) 1 ˆwa ˇp(a, ) + 1 ˆwa ˇp(a, ) 1 ˆwa ˆp(a, ) + ˆpa pa ˆwa ˆwa pa ˇp(a, ) + 2 ˆwa | ˆwa wa| + 1 ˆwa ˇp(a, ) ˆp(a, ) + ˆpa pa ˆwa wa pa ˆwa pa + 1 ˆwa wa pa ˇp(a, ) + 2 ˆwa | ˆwa wa| + 1 ˆwa ˇp(a, ) ˆp(a, ) + ˆpa pa ˆwa wa pa ˇp(a, ) + 3 ˆwa | ˆwa wa| + 1 ˆwa ˇp(a, ) ˆp(a, ) + ˆpa pa | ˆwa wa| + k nε ln k|A| 1 nwa ln k|A| k nε ln k|A| k nε ln k|A| 1 nwa ln k|A| k n ˆwaε ln k|A| 1 nwa ln k|A| k nwaε ln k|A| 1 nwa ln k|A| Differentially Private Post-Processing for Fair Regression by Equation (12) and Lemmas B.6 and B.7; the bound on the ˇp(a, ) ˆp(a, ) follows the same analysis in the proof of Lemma B.7 for the first term. ˆwa wa pa ˇp(a, ) 1 + 3 ˆwa | ˆwa wa| + 1 ˆwa ˇp(a, ) ˆp(a, ) 1 + ˆpa pa 1 k nwaε ln k|A| k nwa ln |A| DKS(pa, pa) ℓ=1 ( wa pa(vℓ) ˇp(a, vℓ)) ˆwa | ˆwa wa| + 1 ℓ=1 (ˇp(a, ) ˆp(a, )) + DKS(ˆpa, pa) k nwaε ln k|A| k nwa ln k|A| For the proof of Theorem 3.3, we need the following technical result for the difference of W 2 2 distances: Lemma B.8 (Chizat et al., 2020). Let µ, µ , ν, ν be distributions whose supports are contained in the centered ball of radius R in Rd, then W 2 2 (µ, ν) W 2 2 (µ , ν ) Z x 2 2 d(µ µ )(x) + Z x 2 2 d(ν ν )(x) + 2R sup convex f Lip(1) Z f(x) d(µ µ )(x) + 2R sup convex g Lip(1) Z g(x) d(ν ν )(x) 4RW1(µ, µ ) + 4RW1(ν, ν ). The last line follows from the dual representation of W1 distance for distributions with bounded support: W1(µ, ν) = sup f Lip(1) Z f(x) d(µ ν)(x) , and the fact that x 7 x 2 2 is 2R-Lipschitz on the centered ball of radius R. Also, recall the fact that the W1 distance of distributions supported on a ball of radius R can be upper bounded by total variation distance: W1(µ, ν) = inf π Π(µ,ν) Z d(x, y) dπ(x, y) 2R inf π Π(µ,ν) Z 1[x = y] dπ(x, y) 1 sup π Π(µ,ν) Z 1[x = y] dπ(x, y) = 2R 1 Z min(µ(x), ν(x)) dx = 2R Z max(0, ν(x) µ(x)) dx = R Z |ν(x) µ(x)| dx Differentially Private Post-Processing for Fair Regression =: 2RDTV(µ, ν), (13) where line 6 is because R (µ(x) ν(x)) dx = 0. And, note the following simple fact regarding optimal transports T : R R under the squared cost; in the special case where T is a Monge transportation plan, the lemma is equivalent to saying that T is a nondecreasing function (see the last panel of Figure 1 for a picture): Lemma B.9. Let p, q be two distributions supported on [k], and π arg minπ Π(p,q) P m,ℓ(m ℓ)2π(m, ℓ), then for all j [k], = 1 if m < lj [0, 1] if m = lj = 0 if m > lj, m s.t. p(m) > 0, (14) for some l1, . . . , lk [k]. Proof. Let j [k] be arbitrary. Suppose to the contrary that lj such that Equation (14) holds, then either f as a function of m is not non-increasing, i.e., f(m + 1) > f(m) for some m, or 0 < f(m + 1) f(m) < 1. We show that either of these contradicts the optimality of π. In both cases, there must exists l j < r such that π(m, r), π(m + 1, l) q for some q > 0, because Pj ℓ=1 π(m + 1, ℓ) = p(m + 1)f(m + 1) > 0 and Pk ℓ=j+1 π(m, ℓ) = p(m)(1 f(m)) > 0. Then a coupling γ with a lower cost can be constructed by (partially) exchanging the two entries: π(m, r) q if i = m, ℓ= r π(m, l) + q if i = m, ℓ= l π(m + 1, r) + q if i = m + 1, ℓ= r π(m + 1, l) q if i = m + 1, ℓ= l π(i, ℓ) else. We verify that it has a lower cost than π: X i,ℓ (i ℓ)2(γ π)(i, ℓ) = q(m r)2 + q(m l)2 + q(m + 1 r)2 q(m + 1 l)2 = 2q(l r) < 0. Proof of Theorem 3.3. This proof also relies on properly applying the triangle inequality to decompose into comparable terms. We list the terms that will be compared here: Denote the Bayes regressor by f , and recall that f is the regressor being post-processed. Denote the output distribution of f conditioned on group a by r a := f ( , a) µX|a, which will be compared to that of f, ra := f( , a) µX|a. Given a discretizer h, the discretized conditional output distribution of f is denoted by p a := h r a, and that of f by pa := h ra. We will compare r a to its discretized version p a, and pa to pa, the empirical conditional discretized output distributions of f estimated privately. Denote the private group marginal distribution estimated from the samples by wa, which will be compared to the ground-truth wa := P(A = a). Let ( , { q a}a A) P({ pa}a A, { wa}a A, α) and ( , , { qa}a A) LP({ pa}a A, { wa}a A, α). The difference between q a, qa is that the support of the latter is restricted to v. Recall that the fair regressor returned from Algorithm 1 has the form f = ga h f, where ga is the optimal transport from pa to qa. The qa s will be compared to the q a s, which are in turn compared to the output distributions q a of an optimal α-fair regressor, denoted by f (note that ( , {q a}a A) P({r a}a A, {wa}a A, α)). Error Bound. Note that R( f) R( f ) = ER( f) ER( f ). We begin with analyzing the first term. By the orthogonality principle, ER( f) = E h f(X, A) f (X, A) 2i Differentially Private Post-Processing for Fair Regression a A wa EX µX|a h f(X, a) f (X, a) 2i a A wa EX µX|a h f(X, a) f(X, a) + (f(X, a) f (X, a)) 2i a A wa EX µX|a h f(X, a) f(X, a) 2 + (f(X, a) f (X, a))2 + 2 f(X, a) f(X, a) (f(X, a) f (X, a)) i a A wa EX µX|a h f(X, a) f(X, a) 2i + 3 E[|f(X, A) f (X, A)|] | {z } E1 where line 5 is because of the assumption that the images of f, f are contained in [0, 1]. The second term on the last line is the L1 excess risk of f; for the first term, EX µX|a h f(X, a) f(X, a) 2i = EX µX|a h (ga h f(X, a) f(X, a))2i = EX µX|a h (ga h f(X, a) h f(X, a) + (h f(X, a) f(X, a)))2i EX µX|a h (ga h f(X, a) h f(X, a))2 + 3|h f(X, a) f(X, a)| i EX µX|a h (ga h f(X, a) h f(X, a))2i + 3 j=1 pa(vj) E h (ga(vj) vj)2i + 3 j=1 pa(vj) E h (ga(vj) vj)2i + j=1 (pa(vj) pa(vj)) E h (ga(vj) vj)2i + 3 W 2 2 ( pa, qa) + pa pa 1 + 3 W 2 2 ( pa, qa) + O k nwa ln k|A| β + k nwaε ln k|A| where line 4 is because h discretizes the input to the midpoint of the bin that it falls in, which displaces it by up to L/2k = 1/2k; line 5 is because pa(vj) = P(h f(X, a) = vj); the first term on line 7 is because ga is the optimal transport from pa to qa. Then, combining the above, by Theorem 3.2, with probability at least 1 β, wa W 2 2 ( pa, qa) + O a A wa W 2 2 ( pa, qa) + E1 + O wa W 2 2 ( pa, qa) + | wa wa|W 2 2 ( pa, qa) + E1 + E2 + 3 wa W 2 2 ( pa, qa) + | wa wa| + E1 + E2 + 3 a A wa W 2 2 ( pa, qa) + E1 + E2 + 3 Differentially Private Post-Processing for Fair Regression a A wa W 2 2 ( pa, h q a) + E1 + E2 + 3 a A wa(W2( pa, q a) + W2( q a, h q a))2 + E1 + E2 + 3 a A wa W 2 2 ( pa, q a) + 2W2( pa, q a)W2( q a, h q a) + W 2 2 ( q a, h q a) + E1 + E2 + 3 a A wa W 2 2 ( pa, q a) + E1 + E2 + 3 a A wa W 2 2 ( pa, q a) + E1 + E2 + 3 where line 6 follows by noting that {h q a}a A is a feasible solution to LP, as it can be verified that DKS(h q a, h q a ) α given that DKS( q a, q a ) α, a, a A (hence restricting the support of the barycenter to v introduces an additional error of 3/2k as discussed in footnote 3), and the last line is because { q a}a A is a minimizer of P({ pa}a A, { wa}a A, α). So for the suboptimality of h, by Theorem 3.1, ER( f) ER( f ) X a A wa W 2 2 ( pa, q a) W 2 2 (r a, q a) + E1 + E2 + 3 a A wa W 2 2 ( pa, q a) (W2(q a, p a) W2(r a, p a)) 2 + E1 + E2 + 3 a A wa W 2 2 ( pa, q a) W 2 2 (p a, q a) + 2W2(p a, q a)W2(p a, r a) + E1 + E2 + 3 a A wa W 2 2 ( pa, q a) W 2 2 (p a, q a) + 1 k + E1 + E2 + 3 where the last line is because h is a transport from r a to p a with displacements of at most 1/2k; for the first term, by Lemma B.8 and Equation (13), X a A wa W 2 2 ( pa, q a) W 2 2 (p a, q a) 4 X a A wa W1( pa, p a) a A wa(W1( pa, pa) + W1(pa, ra) + W1(ra, r a) + W1(r a, p a)) pa pa 1 + 1 2k + EX µX|a[|f (X, a) f(X, a)|] + 1 a A wa EX µX|a[|f (X, a) f(X, a)|] + E2 + 4 a A (wa wa + wa) EX µX|a[|f (X, a) f(X, a)|] + E2 + 4 a A wa EX µX|a[|f (X, a) f(X, a)|] + 4 X a A | wa wa| + 4E2 + 4 4E1 + E2 + 4 the third inequality is because the joint distribution of (f(X, a), f (X, a)) is a valid coupling belonging to Π(ra, r a) that incurs a transportation cost of E[|f (X, a) f(X, a)|] = Z |y y | d P(f(X, a) = y, f (X, a) = y ) W1(ra, r a). Putting everything together, and with a union bound over the two events above, gives the result in the theorem statement. Differentially Private Post-Processing for Fair Regression Fairness Guarantee. Let pa denote the output distribution of the post-processed regressor f conditioned on group a. Using triangle inequality, for any a, a A, DKS( pa, pa ) DKS( qa, qa ) + DKS( pa, qa) + DKS( pa , qa ) α + DKS( pa, qa) + DKS( pa , qa ), DKS( pa, qa) = max j ℓ=1 ( pa(vℓ) qa(vℓ)) P( f(X, a) = vℓ| A = a) qa(ℓ) m=1 pa(vm) P(ga(vm) = vℓ| A = a) m=1 pa(vm) P(ga(vm) = vℓ| A = a) m=1 (pa(vm) pa(vm)) ℓ=1 P(ga(vm) = vℓ| A = a) m=1 (pa(vm) pa(vm)) + pa(vιj) pa(vιj) P(ga(vιj) vj) DKS(pa, pa) + max j pa(vιj) pa(vιj) k nwaε ln k|A| k nwa ln k|A| line 3 is because ga is a transport from pa to qa, line 5 uses Lemma B.9 and the fact that Pj ℓ=1 πa(vm,vℓ)/ pa(vm) = P(ga(vm) vj), and line 7 is by Theorem 3.2.