# generalized_demographic_parity_for_group_fairness__36f2bd2c.pdf Published as a conference paper at ICLR 2022 GENERALIZED DEMOGRAPHIC PARITY FOR GROUP FAIRNESS Zhimeng Jiang1, Xiaotian Han1, Chao Fan1, Fan Yang2, Ali Mostafavi1, and Xia Hu2 1Texas A&M University, 2Rice University This work aims to generalize demographic parity to continuous sensitive attributes while preserving tractable computation. Current fairness metrics for continuous sensitive attributes largely rely on intractable statistical independence between variables, such as Hirschfeld-Gebelein-Renyi (HGR) and mutual information. Statistical fairness metrics estimation relying on either tractable bounds or neural network approximation, however, are not sufficiently trustful to rank algorithms prediction bias due to lack of estimation accuracy guarantee. To make fairness metrics trustable, we propose Generalized Demographic Parity (GDP), a group fairness metric for continuous and discrete attributes. We show the understanding of GDP from the probability perspective and theoretically reveal the connection between GDP regularizer and adversarial debiasing. To estimate GDP, we adopt hard and soft group strategies via the one-hot or the soft group indicator, representing the membership of each sample in different groups of the sensitive attribute. We provably and numerically show that the soft group strategy achieves a faster estimation error convergence rate. Experiments show the better bias mitigation performance of GDP regularizer, compared with adversarial debiasing, for regression and classification tasks in tabular and graph benchmarks 1. 1 INTRODUCTION Fairness problem has attracted increasing attention in many high-stakes applications, such as credit rating, insurance pricing and college admission (Mehrabi et al., 2021; Du et al., 2020; Bellamy et al., 2018), the adopted machine learning models encode and even amplify societal biases toward the group with different sensitive attributes. The majority of existing fairness metrics, such as demographic parity (DP) (Feldman et al., 2015), equal odds (EO) (Hardt et al., 2016), presumably consider discrete sensitive variables such as gender and race. In many real-world applications including urban studies and mobility predictions (Tessum et al., 2021), however, individuals sensitive attributes are unavailable due to privacy constraints. Instead, only aggregated attributes presenting in continuous distributions are available, and thus fairness requires unbiased prediction over neighborhood or region-level objects. Additionally, the sensitive attributes, such as age and weight, are inherently continuous (Mary et al., 2019; Grari et al., IJCAI 20). The widely existing continuous sensitive attributes stimulate further fairness metrics definition and bias mitigation methods. Existing fairness metrics on continuous sensitive attributes rely on the statistical measurement of independence, such as Hirschfeld-Gebelein-Renyi (HGR) maximal correlation coefficient (Mary et al., 2019) and mutual information (Jha et al., 2021; Creager et al., 2019), which are computationintractable due to the involved functional optimization. Note that the mutual information involves the ratio of probability density function, it is intractable to directly estimate mutual information via probability density function estimation due to the sensitivity over probability density function, especially for the low probability density value. Previous works (Roh et al., 2020; Lowy et al., 2021; Cho et al., 2020) adopting mutual information or variations as regularizer, however, rely on tractable bound, or computationally complex singular value decomposition operation (Mary et al., 2019), or training-needed neural network approximation (Belghazi et al., 2018), such as Donsker-Varadhan representation (Belghazi et al., 2018), variational bounds (Poole et al., 2019). Nevertheless, it is 1Codes are available at https://github.com/zhimengj0326/GDP Published as a conference paper at ICLR 2022 饾惡饾惙饾憙= 饾憞饾憠( , ) 饾惙饾憙= 饾憞饾憠( , ) 饾惙饾憙= 饾憞饾憠 , = 饾憞饾憠( , ) Figure 1: Illustration of demographic parity definition for binary and quaternary sensitive attributes, and generalized demographic parity for continuous sensitive attribute. Markers represent average prediction among specific discrete sensitive attributes, Red dashed line and blue solid line represent prediction average among all data and that with specific sensitive attribute. TV ( , ) represents weighted total variation distance. unreliable to adopt the mathematical bound of fairness metric to evaluate different algorithms since lower metrics bound does not necessarily imply lower prediction bias. A question is raised: Can we extend DP for continuous attributes while preserving tractable computation? In this work, we provide positive answers via proposing generalized demographic parity (GDP) from regression perspective. Figure 1 provides an illustrative example for DP and GDP. The local prediction average (blue solid curve ) and the global prediction average (red dashed line ) represent the average prediction value given sensitive attributes and the whole data samples, respectively. The local and global prediction average should be consistent at any specific continuous sensitive attributes. Therefore, we define GDP, via the weighted total variation distance, to measure the distance between the local and global prediction average, where the weight is the probability density of continuous sensitive attributes. We also theoretically demonstrate the equivalence of GDP and DP for binary sensitive attributes, provide an understanding of GDP from probability perspective, and reveal the bias mitigation methods connection between GDP regularizer and adversarial debiasing. Although GDP is clearly defined on the unknown underlying joint distribution of prediction and sensitive attributes, only data samples, in practice, are available. To this end, we propose two GDP estimation methods, named histogram estimation (hard group strategy) and kernel estimation (soft group strategy) methods, where kernel estimation is provable with faster estimation error convergence rate w.r.t. data sample size. Specifically, histogram estimation manually divides continuous sensitive attributes into several disjointed and complementary sensitive attributes bins, and then each sample only belongs to one specific bin containing the sensitive attribute of the sample. In other words, the group indicator of the sample is one-hot. As for kernel estimation, instead of quantizing continuous sensitive attributes as several groups, the group indicator of the sample is treated as a kernel function. In other words, to calculate the mean prediction value given specific sensitive attributes, group smoothing strategy is adopted via a soft indicator determined by the sensitive attribute distance between the sample sensitive attribute with target-specific sensitive attribute. In short, the contributions of this paper are: We develop a tractable group fairness metric GDP for continuous sensitive attributes. We theoretically justify GDP via demonstrating the equivalence with DP for binary sensitive attributes, providing GDP understanding from probability perspective, and revealing the connection between GDP regularizer and adversarial debiasing. We propose histogram and kernel GDP estimation and provably demonstrate the superiority of kernel GDP estimation method with faster estimation error convergence rate w.r.t. sample size. We experimentally evaluate the effectiveness and expansibility of GDP on different domain benchmarks (e.g., tabular, graph, and temporal graph data), tasks (e.g, classification and regression tasks), and compositional sensitive attributes. 2 RELATED WORK Machine Learning Fairness Fair machine learning targets bias mitigation for automated decisionmaking systems. Various fairness definitions, such as group fairness and individual fairness, have been proposed (Zemel et al., 2013). Group metrics, such as DP and EO, measure prediction dif- Published as a conference paper at ICLR 2022 ference between the groups with different sensitive attributes such as gender, age (Louizos et al., 2016; Hardt et al., 2016). While preand post-processing methods have been proposed for fairness boosting, these methods can still lead to higher prediction bias (Barocas et al., 2017) compared with in-processing methods, such as adding regularizer, adversarial debiasing and data augmentation. For example, the covariance between the predictions and sensitive attributes regularization are imposed to boost the independence in (Woodworth et al., 2017). (Zafar et al., 2017) constrains the decision boundaries of classifier to minimize prediction disparity between different groups. Adversarial training has been originally proposed for deep generative modeling (Goodfellow et al., 2014)and has been introduced for prediction debias in representation learning (Zhao et al., 2020; Beutel et al., 2017; Louppe et al., 2017) and transfer learning (Madras et al., 2018). Data augmentation, such as fair mixup (Chuang & Mroueh, 2020), can improve the generalization ability for fairness. Representation neutralization is proposed to boost fairness without sensitive attribute (Du et al., 2021). Kernel Density Estimation and Kernel Regression Kernel density estimation (KDE) is a nonparametric method to estimate the continuous probability density function of a random variable (Davis et al., 2011; Parzen, 1962). Given finite data samples, KDE smoothly estimate the probability function via weighted summation, where the weight is determined via kernel function (Epanechnikov, 1969). Kernel regression is a non-parametric technique to estimate the conditional expectation of a random variable (Nadaraya, 1964). Nadaraya-Watson Kernel regression function estimator is proposed for regression via locally normalized weighted average in (Bierens, 1988), where the sample weight is determined by kernel function. 3 GENERALIZED DEMOGRAPHIC PARITY Without loss of generality, we consider a binary classification task to predict the output variable Y given the input variable X, while avoiding prediction bias for sensitive attribute S. Define the input X X Rd, labels Y {0, 1}, and machine learning model f : Rd [0, 1] provides prediction score 藛Y = f(X). Fairness requires predictor 藛Y to be independent of sensitive attribute S, regardless of continuous or discrete, i.e., P( 藛Y = 藛y) = P( 藛Y = 藛y|S = s) for any support value y and s (Beutel et al., 2017). Since the independent constraint is difficult to optimize, the relaxed demographic parity (DP) (Madras et al., 2018) metrics are proposed to quantitatively measure the predictor bias for binary sensitive attribute S {0, 1}. Formally, the demographic parity is defined as DP = |E 藛Y [ 藛Y |S = 0] E 藛Y [ 藛Y |S = 1]|, where E[ ] represents variable expectation. For categorical sensitive attribute S S, work (Cho et al., 2020) introduces a fairness metric, named difference w.r.t. demographic parity (DDP) DDP = P s S |E 藛Y [ 藛Y |S = s] E 藛Y [ 藛Y ]|. Although DP has been widely used to evaluate the prediction bias, it is still inapplicable for continuous sensitive attributes since the data samples cannot be directly divided into several distinctive groups based on the sensitive attributes. Without loss of generality, we assume continuous sensitive attributes S [0, 1] and propose GDP to extend tractable DP for continuous sensitive attributes. Assume the joint distribution of tuple (S, 藛Y ) is PS, 藛Y (s, 藛y), the local prediction average and global prediction average are defined as the prediction expectation given sensitive attribute S = s and without any sensitive attribute condition, i.e., local prediction average m(s) = E[ 藛Y |S = s] and global prediction average mavg = ES[m(S)] = E[ 藛Y ], respectively. Then, we adopt weighted total variation distance on local prediction average and global prediction average, where the weight is specified by the probability density function of the sensitive attribute. The formal definition of the discrepancy demographic parity for continuous sensitive attributes is as follows: m(s) mavg PS(S = s)ds = ES[|m(S) mavg|], (1) We also provide the connection of GDP and DP for binary sensitive attributes, which implies that GDP is equivalent to DP for binary sensitive attributes, as follows: Theorem 1 (Connection between DP and GDP). For binary sensitive attribute S {0, 1}, GPD and DP are equivalent except the coefficient only dependent on datasets. Specifically, the relation of GDP and DP satisfies GDP = 2PS(S = 1) PS(S = 0) DP. The proof of Theorem 1 is presented in Appendix A. For categorical sensitive attributes, it is easy to obtain that GDP = P s S PS(S = s)|E 藛Y [ 藛Y |S = s] E 藛Y [ 藛Y ]|, i.e., GDP is weighted DDP for Published as a conference paper at ICLR 2022 categorical sensitive attributes. In a nutshell, GDP is a natural fairness metric extension for binary and categorical sensitive attributes. Since the independence between prediction 藛Y and sensitive attribute S implies that the joint distribution PS, 藛Y (s, 藛y) and product marginal distribution PS(s)P 藛Y (藛y) are the same, the bias can be measured by the distance of the joint distribution and product marginal distribution. Subsequently, we show the connection of GDP and prediction-weighted total variation distance between these two distributions as follows: Theorem 2 (Probability View of GDP). Assume the joint distribution of ( 藛Y , S) with support [0, 1]2 is PS, 藛Y (s, 藛y). Define the prediction-weighted total variation distance as TVpred(P 1, P 2) = R 1 0 R 1 0 藛y|P 1(藛y, s) P 2(藛y, s)|d藛yds. Then the proposed GDP for continuous senstive attribute is upper bounded by predictionweighted total variation distance between the joint distribution and product marginal distribution: 藛y h PS, 藛Y (s, 藛y) PS(s)P 藛Y (藛y) i d藛yds TVpred PS, 藛Y (s, 藛y), PS(s)P 藛Y (藛y) . The proof of Theorem 2 is presented in Appendix B. Theorem 2 demonstrates that GDP is actually a lower bound for prediction-weighted total variation distance between these two distributions and implies the necessity of GDP for bias measurement. 4 GDP ESTIMATION GDP is defined based on the underlying joint distribution PS, 藛Y (s, 藛y) of tuple (S, 藛Y ), where S, 藛Y [0, 1]. The underlying joint distribution, however, is unknown. Thus, we aim to estimate GDP given samples {(sn, 藛yn), n [N]}, where [N] = {1, , N}. To bridge this gap, we propose histogram GDP estimation and kernel GDP estimation methods based on different group strategies. Specifically, histogram GDP estimation hardly groups data samples via creating consecutive, nonoverlapping intervals bins, and the local prediction average is estimated by the average prediction among the data samples in the bin. As for kernel GDP estimation, a soft group indicator, determined by the kernel function and sensitive attribute distance, is adopted to provide group smoothing strategy. Specifically, the local prediction average for target sensitive attribute is calculated via weighted prediction, where the sample with sensitive attribute close to the target sensitive attribute possesses large weight. Histogram GDP Estimation A histogram is originally an approximate representation of the underlying data probability distribution via creating several consecutive, non-overlapping intervals or bins with, usually but not required, equal bandwidth. In this paper, we assume all bins with equal bandwidth h and the number of bins Nh = 1 h is an integer. In other words, the data tuples can be divided into Nh groups, and the bin intervals are given by B1 = [0, h), B2 = [h, 2h), , BNh = [(Nh 1)h, 1]. Define indicator function I(A) as 1 if event A happens, otherwise is 0. Thereby, the group indicator wh(n, i) of sample (sn, 藛yn) for i th bin is given by I(sn Bi). Note that all bin intervals are complementary; each sample belongs one and only one bin, i.e., the indicator vectors wh(n) = [wh(n, 1), , wh(n, Nh)] for sample n is one-hot, which formally define the hard group strategy. The local prediction average and probability of sensitive attribute can be point-wisely estimated based on the empirical expectation and distribution. Specifically, for s Bi, the local and global prediction average are given by n=1 I(sn Bi)藛yi, for i [Nh]; 藛mh avg = n=1 藛yi. (2) Similarly, the probability of sensitive attribute is given by 藛P h S (s Bi) = PN n=1 I(sn Bi) N . Finally, we combine all estimations to calculate prediction bias as follows: 藛mh s Bi 藛mh avg 藛P h S (s Bi). (3) Kernel GDP Estimation Histogram GDP estimation provides a hard group strategy via creating several bins. However, a tiny sensitive attribute perturbation can lead to different group indicators Published as a conference paper at ICLR 2022 and thus histogram estimation is not robust on sensitive attribute. On the other hand, a data tuple not necessarily belongs to one group for continuous attributes. For example, the data sample with s = 0.5 may have the same probability belonging to target sensitive attributes s = 0.4 and s = 0.6. Based on these observations, we propose kernel GDP estimation via group smoothing. Intuitively, when calculating local prediction average or probability density function for target sensitive attribute, the tuple with more close attribute is entrusted higher weight. Specifically, we introduce a symmetric one-dimensional smoothing kernel function K(s) 0 satisfying normalized condition R K(s)ds = 1, symmetry condition R s K(s)ds = 0 and finite variance 蟽2 K = R s2K(s)ds > 0. Define h > 0 as the kernel bandwidth. For target sensitive attribute s, the tuple weight for sample with sensitive attribute sn is given by w(sn, s) = 1 h K( |sn s| h ). In short, kernel function provides the group smoothing strategy based on sensitive attribute distance of tuple pair. Given the smoothing kernel function K(s), the local and global prediction average can be obtained via normalized weighted average (Nadaraya Watson kernel estimator) as follows: mh(s) = PN n=1 藛yn K( sn s h ) PN n=1 K( sn s h ) , mh avg = PN n=1 藛yn Similarly, the probability of sensitive attribute is given by ph S(s) = 1 Nh PN n=1 K( sn s h ). Finally, we combine all estimations to calculate kernel GDP estimation as follows: GDP(h) = Z 1 mh(s) mh avg ph S(s)ds. (5) Estimation Error Analysis We provide the theoretical analysis on GDP estimation error and prove the superiority of kernel GDP estimation. Assume that each data sample is independent and identically distributed random variables. Therefore, the estimated GDP is still a random variable, and we adopt the expectation of the mean squared error (MSE) to quantify the accuracy of the estimation method. Formally, the error of histogram estimation and kernel estimation are given by Errhist = E[| 藛 GDP GDP|2]; Errkernel = E[| GDP GDP|2]. (6) where the expectation is taken across N tuples. Here, we provide an asymptotic analysis on estimation error and show the superiority of kernel GDP estimation in the following: Theorem 3 (Estimation Error Convergence Rate). Assume that the mean prediction function m(s), given sensitive attribute s, is smooth and satisfies LLipschitz condition 2 on local average |m(s) m(s )| L|s s | for any s, s . the optimal bandwidth choice for histogram and kernel estimation methods are h hist = O(N 1 3 ) and h kernel = O(N 1 5 ). Additionally, the estimation error satisfy Errhist = O(N 2 3 ) and Errkernel = O(N 4 5 ), where O( ) is big O notation. The proof of Theorem 3 is presented in Appendix C. The proof sketch is to separately provide upper bounds for the estimation error of local average and sensitive attribute probability density estimation. Finally, we combine these two estimations to obtain the estimation error for GDP. Computation Complexity Analysis Since GDP calculation involves in integral operations, the approximated numerical integration is usually adopted with M probing sensitive attribute. The complexity to calculate local prediction average and probability density at M probing sensitive attributes are O(MN) and thus, the complexity for histogram and kernel estimation both are O(MN). In (Mary et al., 2019), intractable HGR coefficient equals second large singular value of distribution ratio matrix or can be upper bounded by tractable chi-square distance between the joint distribution and marginal product distribution. Assume that there are M probing prediction, the complexity for two-dimensional probability density function estimation and SVD is O(M 2N) and O(M 3), and that of the chi-square distance is O(M 2N). Therefore, the computation complexity for HGR is O(M 2(M + N)). As for the neural based approximation for HGR or mutual information, the complexity for training is quite large compared with directly computation. 2Lipschitz condition requires bounded gradient w.r.t. sensitive attribute and guarantees the rationale local average estimation based on neighbor data samples. Published as a conference paper at ICLR 2022 5 ANALYSIS ON GDP REGULARIZER AND ADVERSARIAL DEBIASING With GDP bias measurement for continuous sensitive attributes, it is natural to add GDP regularizer to enforce fairness. Another bias mitigation is adversarial debiasing, a two-player framework for predictor and adversary with regression task. We establish the connection between GDP regularizer and adversarial debiasing, and demonstrate that adversarial debiasing with specific adversary regression objective is actually minimizing GDP implicitly. GDP Regularizer: As a reminder, our goal is to learn a predictor 藛Y = f(X) that approximates the label Y for each input while reducing the prediction bias GDP w.r.t. continuous sensitive attributes S. It is natural to add GDP regularization to enforce fairness. Given the prediction loss Lpred, the fairness-enforcing objective function is min f Lpred f(X), Y +位 GDP, where L could be regression or classification task loss and 位 is the hyperparameter to control the trade-off between the prediction performance and prediction bias reduction. Adversarial Debiasing: Adversarial debiasing is another natural method to ensure fair prediction via a two-player game between predictor and adversary. Specifically, the predictor f yields prediction 藛Y , and given prediction 藛Y , the adversary g tries to predict continuous sensitive attributes 藛S = g( 藛Y ) in regression task. Similar to adversarial debiasing (Louppe et al., 2017), we adopt the same classifier and adversary structure, where the input of adversary is the output of the classifier. For objective function of adversary, we adopted the utility function proposed in (Madras et al., 2018) 3 to represent sensitive attribute prediction accuracy. In this case, the predictor aims to generate accurate prediction and fool the adversarial simultaneously, while adversary targets high utility for accurate sensitive attribute prediction. Let Lpred denote the prediction loss and Ladv represent the adversarial utility. Then adversarial debiasing is trained following the min-max procedure min f max g Lpred f(X), Y +位Ladv g f(X) , S . Define g = arg min g Ladv g f(X) , S as the optimal adversarial given predictor f, the min-max procedure is simplified to the objective function min f Lpred f(X), Y + 位Ladv g f(X) , S . Theoretical Connection: We provide an inherent connection between GDP regularizer and adversarial debiasing. Specifically, we demonstrate that the optimal adversarial utility Ladv g f(X) , S is actually the upper bound of GDP GDP as long as the utility function in adversary is Ladv( 藛S, S) = 1 | 藛S S| in the following theorem: Theorem 4 (GDP and Adversarial Debiasing Connection). Considering a predictor 藛Y = f(X) and adversary 藛S = g( 藛Y ), given adversary utility Ladv( 藛S, S) = 1 | 藛S S| and optimal adversary g , then GDP GDP is bounded by the utility function Ladv g f(X) , S with optimal adversary, i.e., Ladv g f(X) , S GDP. The proof of Theorem 4 is presented in Appendix E. Intuitively, a fair classifier aims to minimize the adversary utility to make prediction fair, and thus induces lower GDP metric. Theorem 4 reveals the inherent connection between GDP regularizer and adversarial debiasing: adversarial debiasing behaviors like predictor loss optimization with GDP regularization, as long as the adversary is optimal. Additionally, such connection not only holds for underlying data distribution, but also for empirical data distribution. The proof sketch is similar via replacing underlying data distribution as empirical data distribution in the expectation operation. In practice, the alternative optimization is usually adopted in adversarial debiasing via alternatively updating either predictor or adversary at each training step while keeping the other one fixed. 3Work (Madras et al., 2018) adopts encoder-decoder structure for transferable representation learning. We only adopt utility function to connect GDP and adversarial debiasing (Louppe et al., 2017). Published as a conference paper at ICLR 2022 6 EXPERIMENTS We evaluate the effectiveness and expansibility of GDP. First, we show the lower GDP estimation error for kernel GDP estimation 4 with group smoothing, compared to that of histogram, via two synthetic experiments. We empirically evaluate the effectiveness and expansibility of kernel estimation for multiple prediction tasks, including classification and regression tasks, and multiple domain real-world datasets, including tabular, graph and temporal graph data (See Appendix H.1). Moreover, the kernel estimation is also applicable for compositional continuous sensitive attributes. For a fair comparison, we compare our method kernel, adding kernel GDP estimation as regularization, with (a) vanilla: training with empirical risk minimization (ERM) without any regularization; (b) histogram: histogram estimation with continuous sensitive attribute; (c) adv: adversarial debiasing (Louppe et al., 2017); (d) adv-bn: adversarial debiasing with binary-quantized sensitive attribute; (e) hgr: the upper bound of HGR as regularizer (Mary et al., 2019); and (f)hgr-bn: the upper bound of HGR as regularizer with binary-quantized sensitive attribute. Specifically, we demonstrate the trade-off between prediction performance and GDP by varying the hyper-parameter 位. In particular, we adopt accuracy (Acc) for classification task and mean absolute error (MAE) for regression task. Details about the baselines and experimental setups for each dataset can be found in Appendix H.2. 6.1 SYNTHETIC EXPERIMENTS We test GDP estimation error and investigate the robustness for histogram and kernel estimation via two synthetic experiments. The goal is to investigate the effect of bandwidth choice and number of samples on GDP estimation error. For data generation, we first generate bivariate Gaussian distribution N(碌, 危) for samples (S, 藛Y ), where expectation 碌 = [0.5, 1.0] and covariance matrix 危 = 1 0.5 0.5 2.0 . Additionally, we generate the samples based on p S, 藛Y (s, 藛y) = s + 藛y if 0 s, 藛y 1, via acceptance-rejection method (Wells et al., 2004). For estimation, we select two typical kernel functions (tricube and Aitchison-Aitken kernel (Mussa, 2013)) for local prediction average, and two kernel functions (linear and cosine kernel functions) for probability density function estimation. Figure 2 shows the GDP estimation error w.r.t. bandwidth h and number of samples N for bivariate Gaussian distribution and second synthetic bivariate distribution, respectively. We observe that histogram GDP estimation error is highly sensitive to bandwidth choice, while kernel estimation is robust to bandwidth and kernel function choice due to the flexibility of group smoothing. As for error rate with number of samples, our experiments show the error rate curve for histogram and kernel estimation. It is seen that kernel estimation can achieve the fast error convergence rate if searching the optimal bandwidth for each method. In addition, the estimation error result is almost the same for different kernel functions, which supports the theoretical results in Theorem 3. Figure 2: Demographic parity estimation error analysis with respect to bandwidth and number of samples for Gaussian distribution and second synthetic distributions. 6.2 EXPERIMENTS ON TABULAR DATA Dataset:We consider two benchmark tabular datasets, UCI Adult and Crimes, to evaluate the effectiveness of kernel estimation for classification and regression tasks. UCI Adult dataset 5 contains more than 40, 000 individual information from 1994 US Census. The classification task is to predict 4Since there is not ground truth on underlying data distribution, we adopt reliable kernel-based estimated GDP as fairness metric in real-world datasets. 5https://archive.ics.uci.edu/ml/datasets/adult Published as a conference paper at ICLR 2022 whether a person s income exceeds $50k/yr (KOHAVI, 1996). We consider normalized age 6 sensitive attribute to measure the fairness of algorithms. The Crime dataset 7 includes 128 attributes for 1, 994 samples from communities in the US. The regression task is to predict the number of violent crimes per population in US communities. We adopt the black group ratio as continuous sensitive attribute. Model: We adopt two-layer selu networks model (Klambauer et al., 2017) with hidden size 50 and report the mean prediction performance and GDP with 5 running times. 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 vanilla kernel histogram adv (a) Adult dataset 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 vanilla kernel histogram adv (b) Crimes Dataset Figure 3: Mitigation performance on tabular dataset with kernel estimation and other baselines. (a) The tradeoff between Accuracy (classification) and GDP for Adult dataset; (b) The tradeoff between MAE (regression) and GDP for Crimes dataset. Results: We compare the bias mitigation performance, i.e., the tradeoff between prediction performance and GDP, of kernel estimation with other baselines for the two tabular datasets in Figure 3. The hyper-parameter 位 in Eq. (5) controls the tradeoff between prediction performance and GDP. Specifically, we choose accuracy metric for classification task in Adult dataset, and MAE metric for regression task in Crimes datset. For adversarial debiasing, we vary the regularization weights to obtain the tradeoff curve. Overall, we make the following observations: (a) kernel estimation outperforms all other baseline methods in terms of performance-fairness tradeoff curve for Adult and Crimes datasets. Specifically, kernel estimation can achieve more than 1% accuracy improvement if GDP is lower than 0.02 for adult dataset, and more than 2% MAE reduction for Crimes dataset; (b) kernel estimation has lower computation complexity compared with adversarial debiasing. Kernel estimation and histogram decrease more than 50% running time on an average across all tabular dataset, which makes kernel and histogram estimation readily usable for large scale realworld datasets; (c) The histogram estimation and binary-quatized sensitive attribute would lead to mitigation performance drop for kernel estimation, adversarial debiasing and HGR. This fact implies the importance of order information in continuous sensitive attributes for bias mitigation. In addition, we observe adversarial debiasing training and HGR are not stable and large hyper-parameter unnecessarily leads to bias mitigation. 6.3 EXPERIMENTS ON GRAPH DATA Dataset: We consider two real-world graph datasets, Pokec-z and Pokec-n, sampled from a larger one Facebook-like social network Pokec in Slovakia. User profiles contain gender, age, interest, education, working field and etc. We treat the normalized age as the continuous sensitive attributes and the node classification task is to predict the working field of the users. Model: We use three graph neural network backbones, graph convolutional networks (GCN) (Kipf & Welling, 2017), graph attention networks (GAT) (Veli藝ckovi c et al., 2018) and Simplifying graph convolutional networks (SGC) (Wu et al., 2019) with 64 feature dimensions. We train GNN with 200 epochs with 5 running times and report the average accuracy and GDP. In each trial, the dataset is randomly split into a training, validation, and test set with 50%, 25%, and 25% partition, respectively. Results: We compare the mitigation performance of kernel estimation with other baselines for two datasets with three backbones in Figure 4. Similarly, kernel estimation consistently outperforms the other baselines by a large margin and binary-quantized sensitive attributes inevitably deteriorate the mitigation performance. Another observation is that GDP can be reduced at least 80% at the cost 6The biggest difference between continuous and discrete sensitive attribute falls in the existence of order information (Mary et al., 2019) for attribute values. 7https://archive.ics.uci.edu/ml/datasets/communities+and+crime Published as a conference paper at ICLR 2022 of 2% accuracy for kernel estimation in two datasets and three backbones, while results in larger accuracy drop for other baselines. Additionally, adversarial debiasing and HGR are also unstable during training and higher hyper-parameter may lead to larger GDP. GCN GAT SGC Figure 4: Mitigation performance of GCN, GAT and SGC for Pokec-n and Pokec-z dataset. 6.4 EXPERIMENTS ON COMPOSITIONAL CONTINUOUS SENSITIVE ATTRIBUTES Dataset: Similarly, the same two benchmark tabular datasets are adopted to evaluate the effectiveness of kernel estimation for compositional continuous sensitive attributes. Specifically, we treat the normalized age and education number for UCI dataset, and black group ratio and normalized age for Crimes dataset as compositional continuous sensitive attributes. to evaluate bias mitigation performance. Model: We also adopt two-layer selu networks model with hidden size 50 with 5 running times and report the average mean prediction performance and GDP. Results: Figure 5 shows bias mitigation performance between prediction performance and GDP. Again, kernel estimation consistently achieves a better tradeoff compared with all other baselines, and binary-quantized compositional sensitive attribute leads to mitigation performance drop. 0.0 0.5 1.0 1.5 2.0 2.5 3.0 GDP vanilla kernel histogram adv adv_bn (a) Adult dataset 0.0 0.5 1.0 1.5 2.0 2.5 3.0 GDP vanilla kernel histogram adv adv_bn (b) Crimes Dataset Figure 5: Mitigation performance for compositional sensitive attributes. 7 CONCLUSION We generalize demographic parity fairness metric, named GDP, to continuous sensitive attributes while preserving tractable computation. We theoretically justify the unification of proposed GDP for continuous and discrete sensitive attributes, and show the necessity of GDP via demonstrating the connection with joint and product margin distributions distance. We also propose two GDP estimation methods, named histogram and kernel, with linear computation complexity via hard and soft group strategies, and provide corresponding estimation error analysis. For the superiority of kernel estimation, we provably demonstrate the faster estimation error convergence rate compared with histogram estimation, and experimentally show better bias mitigation performances in multiple domains, multiple tasks and compositional sensitive attributes. Published as a conference paper at ICLR 2022 8 ACKNOWLEDGEMENTS We would like to sincerely thank everyone who has provided their generous feedback for this work. Thank the anonymous reviewers for their thorough comments and suggestions. This work was supported in part by X-Grant project, National Science Foundation IIS-1939716, IIS-1900990, IIS1750074 grant, and JPMorgan Faculty Research Award. Solon Barocas, Moritz Hardt, and Arvind Narayanan. Fairness in machine learning. Nips tutorial, 1:2017, 2017. Mohamed Ishmael Belghazi, Aristide Baratin, Sai Rajeshwar, Sherjil Ozair, Yoshua Bengio, Aaron Courville, and Devon Hjelm. Mutual information neural estimation. In International Conference on Machine Learning, pp. 531 540. PMLR, 2018. Rachel KE Bellamy, Kuntal Dey, Michael Hind, Samuel C Hoffman, Stephanie Houde, Kalapriya Kannan, Pranay Lohia, Jacquelyn Martino, Sameep Mehta, Aleksandra Mojsilovic, et al. Ai fairness 360: An extensible toolkit for detecting, understanding, and mitigating unwanted algorithmic bias. ar Xiv preprint ar Xiv:1810.01943, 2018. Alex Beutel, Jilin Chen, Zhe Zhao, and Ed H Chi. Data decisions and theoretical implications when adversarially learning fair representations. ar Xiv preprint ar Xiv:1707.00075, 2017. HJ Bierens. The nadaraya-watson kernel regression function estimator. Faculty of Economics and Business Administration, Vrije Universiteit Amsterdam Serie Research Memoranda, (1988-58), 1988. US Census Bureau. American community survey 5-year data, 2021. URL https: //www.census.gov/programs-surveys/acs/technical-documentation/ table-and-geography-changes/2018/5-year.html. Tianqi Chen and Carlos Guestrin. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, pp. 785 794, 2016. Jaewoong Cho, Gyeongjo Hwang, and Changho Suh. A fair classifier using kernel density estimation. Advances in Neural Information Processing Systems, 33:15088 15099, 2020. Ching-Yao Chuang and Youssef Mroueh. Fair mixup: Fairness via interpolation. In International Conference on Learning Representations, 2020. Elliot Creager, David Madras, J orn-Henrik Jacobsen, Marissa Weis, Kevin Swersky, Toniann Pitassi, and Richard Zemel. Flexibly fair representation learning by disentanglement. In International conference on machine learning, pp. 1436 1445. PMLR, 2019. Cuebiq. Data for good: Location intelligence for good is our contribution to the scientific community, 2021. URL https://www.cuebiq.com/about/data-for-good/. da Xu, chuanwei ruan, evren korpeoglu, sushant kumar, and kannan achan. Inductive representation learning on temporal graphs. In International Conference on Learning Representations, 2020. Richard A Davis, Keh-Shin Lii, and Dimitris N Politis. Remarks on some nonparametric estimates of a density function. In Selected Works of Murray Rosenblatt, pp. 95 100. Springer, 2011. Mengnan Du, Fan Yang, Na Zou, and Xia Hu. Fairness in deep learning: A computational perspective. IEEE Intelligent Systems, 2020. Mengnan Du, Subhabrata Mukherjee, Guanchu Wang, Ruixiang Tang, Ahmed Hassan Awadallah, and Xia Hu. Fairness via representation neutralization. ar Xiv preprint ar Xiv:2106.12674, 2021. Vassiliy A Epanechnikov. Non-parametric estimation of a multivariate probability density. Theory of Probability & Its Applications, 14(1):153 158, 1969. Published as a conference paper at ICLR 2022 Michael Feldman, Sorelle A Friedler, John Moeller, Carlos Scheidegger, and Suresh Venkatasubramanian. Certifying and removing disparate impact. In proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining, pp. 259 268, 2015. Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. Advances in neural information processing systems, 27, 2014. Vincent Grari, Sylvain Lamprier, and Marcin Detyniecki. Fairness-aware neural r enyi minimization for continuous features. In Proceedings of the 29th International Joint Conference on Artificial Intelligence, IJCAI 20. Xiaotian Han, Zhimeng Jiang, Ninghao Liu, Qingquan Song, Jundong Li, and Xia Hu. Geometric graph representation learning via maximizing rate reduction. In Proceedings of the Web Conference, 2022. Moritz Hardt, Eric Price, and Nati Srebro. Equality of opportunity in supervised learning. Advances in neural information processing systems, 29:3315 3323, 2016. Akshita Jha, Bhanukiran Vinzamuri, and Chandan K Reddy. Fair representation learning using interpolation enabled disentanglement. ar Xiv preprint ar Xiv:2108.00295, 2021. Zhimeng Jiang, Xiaotian Han, Chao Fan, Zirui Liu, Na Zou, Ali Mostafavi, and Xia Hu. FMP: Toward fair graph message passing against topology bias. ar Xiv preprint ar Xiv:2202.04187, 2022. Thomas N. Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. In International Conference on Learning Representations (ICLR), 2017. G unter Klambauer, Thomas Unterthiner, Andreas Mayr, and Sepp Hochreiter. Self-normalizing neural networks. In Proceedings of the 31st international conference on neural information processing systems, pp. 972 981, 2017. R KOHAVI. Scaling up the accuracy of naive-bayes classifiers: a decision-tree hybrid. In Second International Conference on Knowledge Discovery and Data Mining, 1996, pp. 202 207, 1996. Christos Louizos, Kevin Swersky, Yujia Li, Max Welling, and Richard S Zemel. The variational fair autoencoder. In ICLR, 2016. Gilles Louppe, Michael Kagan, and Kyle Cranmer. Learning to pivot with adversarial networks. In Proceedings of the 31st International Conference on Neural Information Processing Systems, pp. 982 991, 2017. Andrew Lowy, Rakesh Pavan, Sina Baharlouei, Meisam Razaviyayn, and Ahmad Beirami. Fermi: Fair empirical risk minimization via exponential r\ enyi mutual information. ar Xiv preprint ar Xiv:2102.12586, 2021. David Madras, Elliot Creager, Toniann Pitassi, and Richard Zemel. Learning adversarially fair and transferable representations. In International Conference on Machine Learning, pp. 3384 3393. PMLR, 2018. J er emie Mary, Cl ement Calauzenes, and Noureddine El Karoui. Fairness-aware learning for continuous attributes and treatments. In International Conference on Machine Learning, pp. 4382 4391. PMLR, 2019. Ninareh Mehrabi, Fred Morstatter, Nripsuta Saxena, Kristina Lerman, and Aram Galstyan. A survey on bias and fairness in machine learning. ACM Computing Surveys (CSUR), 54(6):1 35, 2021. Hamse Y Mussa. The aitchison and aitken kernel function revisited. Journal of Mathematics Research, 5(1):22, 2013. Elizbar A Nadaraya. On estimating regression. Theory of Probability & Its Applications, 9(1): 141 142, 1964. Emanuel Parzen. On estimation of a probability density function and mode. The annals of mathematical statistics, 33(3):1065 1076, 1962. Published as a conference paper at ICLR 2022 Ben Poole, Sherjil Ozair, Aaron Van Den Oord, Alex Alemi, and George Tucker. On variational bounds of mutual information. In International Conference on Machine Learning, pp. 5171 5180. PMLR, 2019. Yuji Roh, Kangwook Lee, Steven Whang, and Changho Suh. Fr-train: A mutual information-based approach to fair and robust training. In International Conference on Machine Learning, pp. 8147 8157. PMLR, 2020. Christopher W. Tessum, David A. Paolella, Sarah E. Chambliss, Joshua S. Apte, Jason D. Hill, and Julian D. Marshall. Pm sub 2.5 /sub polluters disproportionately and systemically affect people of color in the united states. Science Advances, 7(18):eabf4491, 2021. doi: 10.1126/sciadv. abf4491. URL https://www.science.org/doi/abs/10.1126/sciadv.abf4491. Petar Veli藝ckovi c, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Li o, and Yoshua Bengio. Graph attention networks. In International Conference on Learning Representations, 2018. Martin T Wells, George Casella, and Christian P Robert. Generalized accept-reject sampling schemes. In A Festschrift for Herman Rubin, pp. 342 347. Institute of Mathematical Statistics, 2004. Blake Woodworth, Suriya Gunasekar, Mesrob I Ohannessian, and Nathan Srebro. Learning nondiscriminatory predictors. In Conference on Learning Theory, pp. 1920 1953. PMLR, 2017. Felix Wu, Amauri Souza, Tianyi Zhang, Christopher Fifty, Tao Yu, and Kilian Weinberger. Simplifying graph convolutional networks. In International conference on machine learning, pp. 6861 6871. PMLR, 2019. Muhammad Bilal Zafar, Isabel Valera, Manuel Gomez Rogriguez, and Krishna P Gummadi. Fairness constraints: Mechanisms for fair classification. In Artificial Intelligence and Statistics, pp. 962 970. PMLR, 2017. Rich Zemel, Yu Wu, Kevin Swersky, Toni Pitassi, and Cynthia Dwork. Learning fair representations. In International conference on machine learning, pp. 325 333. PMLR, 2013. Han Zhao, Jianfeng Chi, Yuan Tian, and Geoffrey J Gordon. Trade-offs and guarantees of adversarial representation learning for information obfuscation. Advances in Neural Information Processing Systems, 33, 2020. Ling Zhao, Yujiao Song, Chao Zhang, Yu Liu, Pu Wang, Tao Lin, Min Deng, and Haifeng Li. T-gcn: A temporal graph convolutional network for traffic prediction. IEEE Transactions on Intelligent Transportation Systems, 21(9):3848 3858, 2019. Published as a conference paper at ICLR 2022 A PROOF OF THEOREM 1 For binary sensitive attribute S {0, 1}, the probability of sensitive attribute follows Bernoulli distribution PS(S = 0), PS(S = 1) . Therefore, the global prediction average is given by mavg = PS(S = 0)m(0) + PS(S = 1)m(1) and GDP is GDP = PS(S = 0) m(0) mavg + PS(S = 1) m(1) mavg = PS(S = 0) m(0) PS(S = 0)m(0) PS(S = 1)m(1) +PS(S = 1) PS(S = 0)m(0) + PS(S = 1)m(1) m(1) = 2PS(S = 0)PS(S = 1)|m(0) m(1)| = 2PS(S = 0)PS(S = 1) DP. where the coefficient 2PS(S = 0)PS(S = 1) only depends on data. This fact demonstrates the applicability of GDP for categorical sensitive attribute with discrete measure choice for sensitive attributes, and it is equivalent to demographic parity for binary sensitive attribute. B PROOF OF THEOREM 2 Notice that the fairness constraint ideally requires the independence of prediction 藛Y and sensitive attribute S, i.e., the joint distribution and product margin distribution equals: P 藛Y ,S(藛y, s) = PS(s)P 藛Y (藛y). Aiming to measure independence deviation, a natural intuition is to quantify the probability deviation via the distance between the joint distribution and product margin distribution. We show the connection between the proposed GDP and prediction-weighted total variation distance of these two distributions. Recall the definition of GDP, it is easy to obtain m(s) mavg PS(S = s)ds 0 藛y P 藛Y |S(y|s)d藛y Z 1 0 藛y P 藛Y (y)d藛y PS(S = s)ds 0 藛y P 藛Y ,S(y, s) PS(s)P 藛Y (藛y) d藛y d藛yds 0 藛y P 藛Y ,S(y, s) PS(s)P 藛Y (藛y) d藛yds = TV藛y(P 藛Y ,S(藛y, s), PS(s)P 藛Y (藛y). which completes the proof. C PROOF OF THEOREM 3 Notice that histogram and kernel estimation require local prediction average and probability density function estimation for sensitive attributes, we start with the analysis on local prediction average and probability density function estimation, and then provide the proof of GDP error analysis. C.1 PROOF FOR HISTOGRAM ESTIMATION Recall that the number of bins is Nh with same bandwidth h, and bins interval are given by B1 = [0, h), B2 = [h, 2h), , BNh = [(Nh 1)h, 1]. Firstly, we will separately analyze the error for local prediction average and sensitive attribute probability. Next, these two estimation error results can be combined for GDP estimation error. Published as a conference paper at ICLR 2022 Since the continuous sensitive attributes is considered, we defined the probability density function of sensitive attribute S as p S(s). The estimated probability density function is given by: 藛ph S(s) = 1 h 藛P h S (s Bi) = 1 Nh n=1 I(sn Bi) for s Bi; (7) Subsequently, we define the pointwise MSE error MSEpdf hist(s) to measure probability density function estimation error given sensitive attribute s as follows: MSEpdf hist(s) = E[|藛ph S(s) p S(s)|2], (8) where the expectation is took across N samples. Then we have following Lemma 1 on optimal pointwise MSE error for probability density function: Lemma 1. Assume the mean prediction function m(s), given sensitive attribute s, is smooth and satisfies L-Lipschitz condition |m(s) m(s )| L|s s | for any s, s . Given N i.i.d. samples {(藛yn, sn), n [N]}, then the optimal MSE error min h MSEpdf hist(s) is O(N 2 3 ), where the optimal bandwidth satisfies h = O(N 1 3 ) for any sensitive attribute s. As for the local prediction average, similarly, we have the estimated local prediction average as follows: 藛m(s) = PN n=1 I(sn Bi)藛yn PN n=1 I(sn Bi) for s Bi; (9) Subsequently, we also define the pointwise MSE error MSEreg hist(s) to measure local prediction average estimation error given sensitive attribute s as follows: MSEreg hist(s) = E[| 藛mh(s) m(s)|2], (10) where the expectation is took across N samples. Then we have following Lemma 2 on optimal pointwise MSE error for local prediction average: Lemma 2. Assume the mean prediction function m(s), given sensitive attribute s, is smooth and satisfies L-Lipschitz condition |m(s) m(s )| L|s s | for any s, s . Define the bounded prediction variance 蟽2(s) 蟽2, given sensitive attribute s, as 蟽2(s) = E 藛Y |S[( 藛Y m(s))2|S = s]. Given N i.i.d. samples {(藛yn, sn), n [N]}, then the optimal MSE error min h MSEreg hist(s) is 3 ), where the optimal bandwidth satisfies h = O(N 1 3 ) for any sensitive attribute s. Proof for histogram estimation: Based on the definition of MSE error in Eq. (6), we have Errhist = E[| 藛 GDP GDP|2] (a) E hn Z 1 | 藛mh(s) 藛mavg|藛ph S(s) |m(s) mavg|p S(s) ds o2i (b) E hn Z 1 | 藛mh(s) 藛mavg| |藛ph S(s) p S(s)| |m(s) mavg 藛mh(s) + 藛mavg| p S(s) ds o2i (c) 2E h Z 1 0 | 藛mh(s) 藛mavg|2ds i E h Z 1 0 |藛ph S(s) p S(s)|2ds i +2 2 n E h Z 1 0 | 藛mh(s) m(s)|2p S(s)ds + E h Z 1 0 | 藛mh avg mavg|2p S(s)ds io (d) 2 O(N 2 3 ) + 4 O(N 2 3 ) + 4 N 1 = O(N 2 where inequality (a) holds due to absolute value inequality, inequality (b) holds due to |a1b1 a2b2| |a1||b1 b2| + |a1 a2||b2|, inequality (c) holds based on (a + b)2 2(a2 + b2) and Cauchy Schwarz inequality, inequality (d) holds based on Lemmas (1) and (2). Note that the order of the optimal bandwidth are the same in Lemmas (1) and (2), the optimal bandwidth for minimizing GDP MSE is O(N 1 Published as a conference paper at ICLR 2022 C.2 KERNEL ESTIMATION Recall that we assume that kernel function satisfies normalized condition R K(s)ds = 1, symmetry R s K(s)ds = 1 and finite variance 蟽2 K = R s2K(s)ds > 0. We also define 蟽2 K = R K2(y)dy. Similarly, the probability density function of sensitive attribute is given by p S(s). The estimated probability density function is given by: ph S = 1 Nh Subsequently, we define the pointwise MSE error MSEpdf kernel(s) to measure probability density function estimation error given sensitive attribute s as follows: MSEpdf hist(s) = E[| ph S(s) p S(s)|2], (12) where the expectation is took across N samples. Then we have following Lemma 3 on optimal pointwise MSE error for probability density function: Lemma 3. Given N i.i.d. samples {(藛yn, sn), n [N]}, then the optimal MSE error min h MSEpdf kernel(s) is O(N 4 5 ), where the optimal bandwidth satisfies h = O(N 1 5 ) for any sensitive attribute s. As for the local prediction average, similarly, we have local prediction average estimation as follows: m(s) = PN n=1 K( sn s h )藛yn PN n=1 K( sn s Subsequently, we also define pointwise MSE error MSEreg kernel(s) to measure local prediction average estimation error given sensitive attribute s as follows: MSEreg kernel(s) = E[| 藛mh(s) m(s)|2], (14) where the expectation is took across N samples. Then we have following Lemma 4 on optimal pointwise MSE error for local prediction average: Lemma 4. Define the bounded prediction variance 蟽2(s) 蟽2, given sensitive attribute s, as 蟽2(s) = E 藛Y |S[( 藛Y m(s))2|S = s]. Given N i.i.d. samples {(藛yn, sn), n [N]}, then the optimal MSE error min h MSEreg kernel(s) is O(N 4 5 ), where the optimal bandwidth satisfies h = O(N 1 for any sensitive attribute s. Proof for kernel estimation: Based on the definition of MSE error in Eq. (6), we have Errkernel = E[| GDP GDP|2] (e) E hn Z 1 | mh(s) mavg| ph S(s) |m(s) mavg|p S(s) ds o2i (f) E hn Z 1 | mh(s) mavg| | ph S(s) p S(s)| |m(s) mavg mh(s) + mavg| p S(s) ds o2i (g) 2E h Z 1 0 | mh(s) mavg|2ds i E h Z 1 0 | ph S(s) p S(s)|2ds i +2 2 n E h Z 1 0 | mh(s) m(s)|2p S(s)ds + E h Z 1 0 | mh avg mavg|2p S(s)ds io (h) 2 O(N 4 5 ) + 4 O(N 4 5 ) + 4 N 1 = O(N 4 where inequality (e) holds due to absolute value inequality, inequality (f) holds due to |a1b1 a2b2| |a1||b1 b2| + |a1 a2||b2|, inequality (g) holds based on (a + b)2 2(a2 + b2) and Cauchy Schwarz inequality, inequality (h) holds based on Lemmas (3) and (4). Note that the order of the optimal bandwidth are the same in Lemmas (3) and (4), the optimal bandwidth for minimizing GDP MSE is O(N 1 Published as a conference paper at ICLR 2022 D PROOF OF LEMMAS D.1 PROOF OF LEMMA 1 Proof. Note that there exists bias-variance tradeoff for MSEpdf hist(s), i.e., MSEpdf hist(s) = E[|藛ph S(s) p S(s)|2] = E[藛ph S(s)] p S(s) 2 Biaspdf hist(s) + E[ 藛ph S(s) E[藛ph S(s)] 2 ] | {z } V arpdf hist(s) Next we analyze the bias and variance for probability density function MSE. For the bias part, note that, for s Bi, the expectation of estimated probability density function satisfies: E[藛ph S(s)] = 1 h P(sn Bi) = R ih (i 1)h p S(s)ds h = p S(s ), s Bi; where the last equality holds by the mean value theorem. Therefore, the bias satisfies Biaspdf hist(s) = E[藛ph S(s)] p S(s) As for variance part, note that the variance of Bernoulli distribution with parameter p is p(1 p), for s Bi, we have the variance as follows, V arpdf hist(s) = 1 h2 D[ 1 n=1 I(s Bi)] = P(sn Bi)[1 P(sn Bi)] Nh2 = p S(s ) Combining the bias and variance part, we have MSEpdf hist(s) = [Biaspdf hist(s)]2 + V arpdf hist(s) L2h2 + p S(s ) It is easy to obtain the optimal bandwidth h = [ p S(s ) 2L2N ] 1 3 = O(N 1 3 ), and the minimized MSE is lower than [ Lps(s ) 2N ] 2 3 = O(N 2 D.2 PROOF OF LEMMA 2 Proof. Note that the local prediction average by histogram, for s Bi, is given by 藛m(s) = PN n=1 I(sn Bi)Yn PN n=1 I(sn Bi) . Define the normalized weight as wn(sn) = I(sn Bi) PN n=1 I(sn Bi), then local prediction average is given by 藛m(s) = PN n=1 wn(sn)m(sn). Similarly, we can obtain the bias and variance tradeoff for local prediction average error MSEreg hist(s) as follows, MSEreg hist(s) = E[| 藛mh(s) m(s)|2] = E[ 藛mh(s)] m(s) 2 | {z } Biashist reg (s) + E[ 藛mh(s) E[ 藛mh(s)] 2 ] | {z } V arhist reg (s) For the bias part, based on Lipschitz condition on the mean prediction function, note that PN n=1 wn(sn) = 1, we have Biashist reg (s) = E[ 藛mh(s)] m(s) = n=1 wn(sn)[m(sn) m(s)] n=1 wn(sn) [m(sn) m(s)] n=1 wn(sn)Lh = Lh. (16) Published as a conference paper at ICLR 2022 For the variance part, we have variance for local prediction average as follows, V arhist reg (s) = D[ n=1 wn(sn)藛yn] = E h [ PN n=1 I(sn Bi)(Yn m(sn)) PN n=1 I(sn Bi) ]2i n=1 E h I(sn Bi)蟽2 PN n=1 I(sn Bi) 2 i N蟽2/Nh (N/Nh)2 = 蟽2 Combining the bias and variance part, we have MSEreg hist(s) = [Biasreg hist(s)]2 + V arreg hist(s) L2h2 + 蟽2 It is easy to obtain the optimal bandwidth h = [ 蟽2 2L2N ] 1 3 = O(N 1 3 ), and the minimized MSE is lower than [ L蟽2 2N ] 2 3 = O(N 2 D.3 PROOF OF LEMMA 3 Proof. Note that there exists bias-variance tradeoff for MSEpdf kernel(s), i.e., MSEpdf kernel(s) = E[|藛ph S(s) p S(s)|2] = E[藛ph S(s)] p S(s) 2 Biaspdf kernel(s) + E[ 藛ph S(s) E[藛ph S(s)] 2 ] | {z } V arpdf kernel(s) Next we analyze the bias and variance for probability density function MSE. For the bias part, the expectation of estimated probability density function satisfies: E[ ph S(s)] ph S(s) = E[ 1 h )] ph S(s) = 1 h E[K(sn s h )] ph S(s) h )p S(sn)dsn ph S(s) = Z K(y)p S(s + hy)dy ph S(s). (19) where the last equality holds by adopting transformation y = sn s h . By Taylor expansion, when h is small, we have p S(s + hy) = p S(s) + hyp S(s) + h2y2 2 p S(s) + o(h2) (20) Based ob Eqs. (19) and (20), we have Biaspdf kernel(s) = |E[ ph S(s)] ph S(s)| = p S(s) Z K(y)dy + hp S(s) Z y K(y)dy + h2 2 p S(s) Z y2K(y)dy + o(h2) p S(s) = h2蟽2 K 2 p S(s); (21) As for the variance part, we have V arpdf kernel(s) = D[ 1 h )] = 1 Nh2 D[K(sn s h )] 1 Nh2 E[K2(sn s h )p S(sn)dsn = 1 Nh Z K2(y)p S(s + hy)dy Z K2(y)[p S(s) + hyp S(s) + o(h)]dy = p S(s) Z K2(y)dy + o(h) = p S(s) 蟽2 K Nh + o( 1 Published as a conference paper at ICLR 2022 Combining the bias and variance part, we have MSEpdf kernel(s) = [Biaspdf kernel(s)]2 + V arpdf kernel(s) h4|p S(s)|2蟽4 4 + p S(s) 蟽2 K Nh + o(h4) + o( 1 = O(h4) + O( 1 It is easy to obtain that the optimal bandwidth h = O(N 1 5 ), and thus, the minimized MSE is lower than O(N 4 D.4 PROOF OF LEMMA 4 Proof. Recall that the mean prediction conditioned on sensitive attribute s is given by m(s) = E[ 藛Y |S = s], we rewrite prediction as 藛Y = m(S) + e, where e is the regression noise and satisfies E[e] = 0 and E[e2|S = s] = 蟽2(s). Note that the prediction Y = m(s) + [m(S) m(s)] + e, we have: h )藛yn = 1 Nh h )m(s) + 1 Nh h )[m(sn) m(s)] = 藛p S(s)m(s) + m1(s) + m2(s). Based on Eq. (13), we have m(s) = m(s) + m1(s) 藛p S(s) + m2(s) 藛p S(s) . (24) Since E[e|S = s] = 0, we have the expectation of E[ m2(s)] = 0 since E[K( sn s h )e] = E[K( sn s h )E[e|S = sn]] = 0. As for the variance of m2(s), we have D[ m2(s)] = 1 Nh2 E[K(sn s h )e2] = 1 Nh2 E[K(sn s h )蟽2(sn)p S(sn)dsn Z K(y)蟽2(s + yh)p S(s + yh)dsn = 蟽2 K蟽2(s)p S(s) Subsequently, we consider the expectation and variance of m1(s). Specifically, for expectation, we have E[ m1(s)] = 1 h E h K(sn s h ) m(sn) m(s) i h ) m(sn) m(s) p S(sn)dsn = Z K(y) m(s + hy) m(s) p S(s + hy)dy = Z K(y) hym (s) + y2h2 2 m (s) p S(s) + hyp S(s) + o(h2)dy = 蟽2 Kh2 m (s)p S(s) 2 + m (s)p S(s) + o(h2). (26) Published as a conference paper at ICLR 2022 As for the variance of m1(s), we have D[ m1(s)] = 1 Nh2 D h K(sn s h ) m(sn) m(s) i h ) m(sn) m(s) E[ m(s)] o2 p S(sn)dsn Z h K(y)hym (s) E[ m(s)] i2 p S(s) + yhp S(s) dsn = 蟽2(s)蟽2 K PS(s)Nh + o( 1 Combining the bias and variance part, we have MSEreg kernel(s) = [Biasreg kernel(s)]2 + V arreg kernel(s) L2h2 + 蟽2 Nh O(h4) + O( 1 Based on the inequality of arithmetic and geometric means, it is easy to obtain the optimal bandwidth h = O(N 1 5 ), and the minimized MSE is lower than O(N 4 E PROOF OF THEOREM 4 Given predictor 藛Y = f(X), adversary 藛S = g( 藛Y ), and adversary utility Ladv( 藛S, S) = 1 | 藛S S|, GDP and adversary utility are given by GDP = ES h EX|S[f(X)] EX[f(X)] i ; Ladv = ES h EX|S 1 S g f(X) i . Intuitively, higher model prediction implies larger sensitive attribute if mean prediction function m(s) is more close to s compared with 1 s and vice versa. Therefore, we construct adversary g as follows: g# f(X) = f(X), if ES[|S m(S)|] ES[ S 1 m(S) ]; 1 f(X), Otherwise. Suppose without loss of generality (WLOG) that ES[|S m(S)|] ES[ S 1 m(S) ], i.e., higher model prediction implies larger sensitive attribute. Then adversary utility is given by Ladv g# f(X) , S = ES h EX|S 1 S f(X) i ES h 1 S EX|S[f(X)] i = ES h 1 S m(S) i where the inequality holds due to Jensen s inequality and convex function |x t| for any constant t. Next we show, under the constructive adversary g, the adversarial utility Ladv 1 2 GDP. Firstly, notice that, for any function m(S) [0, 1] and S [0, 1], we have |S m(S)| + S 1 m(S) max n |1 m(S)| + 1 1 m(S) , |0 m(S)| + 0 1 m(S) o = 1, which implies that ES h 1 S m(S) i 1 2 ES[|S m(S)|] + ES[ S 1 m(S) ] ES |S m(S)| + S 1 m(S) 1 As for the analysis on GDP, we consider the worst case of mean prediction function m(s) since GDP satisfies GDP = ES[ m(S) ES[m(S)] ]. Note that function |x| is strictly convex and the solution to maximize a strictly convex function over all finite support given first moment is achieved by a distribution of two mass extreme points, GDP can achieve maximal value when m(S) is 0 or 1. Define ppos = ES h P m(S) = 1 i , then ES[m(S)] = ppos and GDP = ppos 1 ppos + (1 ppos)ppos 1 2. Therefore, for the optimal adversary g , we have Ladv g f(X) , S Ladv g# f(X) , S GDP. (28) Published as a conference paper at ICLR 2022 F DATA STATISTICS For fair comparision with previous work, we perform the classification and regression task on five datasets, including Crimes, Adult, Pokec-n, Pokec-z and Harris dataset. The first four dataset have been widely adopted to study the fairness problem in tabular data and graph data, while Harris dataset is collected by ourself for temporal graph data. Table 1 presents additional information on the real-world tabular, graph and temporal graph datasets. For task type column, Reg and Clf represents regression task and classification task, respectively. Table 1: Statistical Information on Datasets Data Type Dataset Task Type # Nodes /Samples # Edges # Features Metric Tabular Crimes Reg 1994 121 MAE GDP Adult Clf 45222 13 Acc Graph Pokec-n Clf 66569 729129 59 Pokec-z Clf 67797 882765 59 Temporal Graph Harris Clf 4204 19946 36 G MORE DETAILS ON SYNTHETIC EXPERIMENTS G.1 GDP CALCULATION We firstly provide ground truth analysis in synthetic experiments so that we can evaluate propsoed two GDP estimation methods error. Considering bivariate Gaussian distribution with mean 碌 = [碌1, 碌2] and covariance matrix 危 = 蟽11 蟽12 蟽21 蟽22 and note that covariance matrix is positive definite matrix, it is easy to obtain inverse covariance matrix 危 1 = 位11 位12 位21 位22 , where 位22 = 蟽11 |危| . The joint distribution of (S, 藛Y ) follows p S, 藛Y (s, 藛y) = 1 2蟺|危| exp 1 2位11(s 碌1)2 1 2位22(s 碌2)2 + 位12(s 碌1)(藛y 碌2) . Based on probability theory, we can have the condition function p 藛Y |S(藛y|s) as follows: p 藛Y |S(藛y|s) = p S, 藛Y (s, 藛y) p S(s) = 1 q exp 位22 藛y 位22碌2+位12s 位12碌1 N 蟽11碌2 + 蟽12(s 碌1) which means the mean prediction function m(s) = 蟽11碌2+蟽12(s 碌1) 蟽11 . Notice that the probability density function of sensitive attribute is also Guassian with N(碌1, 蟽11), therefore, the GDP is GDP = Z |m(s) 碌2|p S(s)ds = Z 蟽12(s 碌1) 1 2蟺蟽11 exp (s 碌1)2 ds = 2蟽12 2蟺蟽11 . Next, we calculate GDP for the second synthetic probability density function p S, 藛Y (s, 藛y) = s + 藛y if 0 s, 藛y 1. It is easy to obtain the conditional probability p 藛Y |S(藛y|s) = s+藛y 2 if 0 s, 藛y 1 and thus the mean prediction function m(s) = E[ 藛Y |S = s] = 2 . Similarly, the probability of Published as a conference paper at ICLR 2022 sensitive attribute is p S(s) = s + 1 2 if 0 s, 藛y 1. Thus, the GDP satisfies GDP = Z m(s) E[m(S)] p S(s)ds = Z 1 G.2 MORE SYNTHETIC EXPERIMENT RESULTS ON ESTIMATION ERROR Figure 6 shows local prediction average and sensitive attribute probability density function estimation results for different kernel function choice. The top two subfigures show the local prediction average estimation error w.r.t. bandwidth choice. It is seen that the tricube and aitchison aitken kernel function achieve better and robust local prediction average estimation compared with Gaussian kernel function. The mid subfigures show the local prediction average result for different sensitive attribute and bottom two subfigures shows probability density function estimation results with different kernel and histogram choice. It is seen that kernel estimation possesses more smooth and accurate probability density function estimation. 100 2 10 1 3 10 1 4 10 1 6 10 1 regression mse regression error gaussian tricube aitchison_aitken 10 2 10 1 bandwidth regression mse regression error gaussian tricube aitchison_aitken 4 3 2 1 0 1 2 3 4 Sensitive attribute Local regression ground truth gaussian tricube aitchison_aitken 0.0 0.2 0.4 0.6 0.8 1.0 Sensitive attribute Local regression ground truth gaussian tricube aitchison_aitken Figure 6: Local prediction average and sensitive attribute probability density function estimation error analysis with respect to the kernel bandwidth and number of samples for bivariate Gaussian distribution and second synthetic distributions. Published as a conference paper at ICLR 2022 G.3 MORE SYNTHETIC EXPERIMENT RESULTS ON ROBUSTNESS In this subsection, we aim to investigate the robustness of GDP estimation error over different mean and covariance parameters. Note that only the parameters expectation 碌 = [碌1, 碌2] = [0.5, 1.0] and covariance matrix 危 = 蟽11 蟽12 蟽12 蟽22 = 1 0.5 0.5 2.0 are adopted on synthetic experiments, we adopt these parameters as default and only modify the one parameter to inspect GDP estimation error. Figures 7 and 8 show the GDP estimation error with mean parameters 碌1 and 碌2, and covariance parameters 蟽11, 蟽12, and 蟽22 given the same data size 1000. We have two observations: (1) kernel-based GDP estimation methods consistently achieves lower estimation errors than histogrambased counterpart for any Gaussian distribution parameter, which further validates the superiority of kernel-based method. (2) kernel-based GDP estimation error is almost the same for any kernel function pair choice, which validates our theoretical analysis in Theorem. 3. Figure 7: GDP estimation error for different mean parameters Figure 8: GDP estimation error for different covariance parameters H MORE DETAILS ON REAL-WORLD EXPERIMENTAL RESULTS Many real-world data are organized as tabular, graph and temporal graph data. Various machine learning models and training strategy are specifically developed for these data types (Chen & Guestrin, 2016; Han et al., 2022; Jiang et al., 2022; Zhao et al., 2019). Our experiments demonstrate that the proposed GDP and corresponding regularizer can be adopted for these data types. H.1 MITIGATION PERFORMANCE FOR TEMPORAL GRAPH DATA The temporal graph data, provided by data intelligence company Cuebiq (Cuebiq, 2021), is collected from anonymous human movement activities, including the coordinates and time of mobile devices at stop points, during August 2017 in Harris County (Houston) Texas, USA. To generate the temporal graph, we first divide the Harris County into several grid cells with equal size approximately 1km 1km. Each grid cell is treated as a graph node, and temporal link between two nodes represents at least one user movement in hourly basis duration. The node features are generated from socio-demographic data of the American Community Survey (ACS) 2014 2018 (5-year) data by the U.S. Census Bureau (Bureau, 2021). In the experiment, the white race ratio is treated as continuous sensitive attributes and our task is to predict whether the income of each node is high Published as a conference paper at ICLR 2022 or low. We adopt temporal graph attention (TGAT) (da Xu et al., 2020)8 with map and product attention mechanism to efficiently aggregate temporal-topological neighborhood features and report the mean prediction performance and GDP with 5 running times. We compare the mitigation performance of kernel estimation and other baselines for private Harris dataset with two backbones in Figure 9. Simialrly, the hyper-parameter 位 control the tradeoff between accuracy and GDP. Again, kernel estimation consistently outperforms the other baselines by a large margin and binary-quantized sensitive attributes inevitably deteriorate the mitigation performance. 0.0 0.2 0.4 0.6 0.8 1.0 1.2 GDP 52.5 55.0 57.5 60.0 62.5 65.0 67.5 kernel histogram adv adv_bn vanilla 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 50.0 52.5 55.0 57.5 60.0 62.5 65.0 67.5 kernel histogram adv adv_bn vanilla Figure 9: Mitigation performance for temporal graph Harris dataset. (a) TGAT with map attention; (b) TGAT with product attention. H.2 PREDICTION PERFORMANCE AND GDP TRADEOFF CURVE DURING TRAINING Training curve on tabular data: Aiming to inspect the dynamic prediction performance and GDP during model training, we provide prediction performance and GDP tradeoff curve for Adult and Crimes dataset in Figure 10. The left and right y-axis represent the prediction performance and GDP metric, respectively. It is seen that, for kernel or histogram as regularization, the hyperparameter can control the prediction performance and GDP tradeoff, while the training is highly unstable with large variance for adversarial debiasing. Additionally, kernel estimation as regularization possesses better bias mitigation performance for Adult and Crimes dataset. 0 20 40 60 80 100 Training progress Accuracy(solid) kernel-0.0 histogram-0.0 adv-0.0 adv_bn-0.0 GDP(dotted) 0 20 40 60 80 100 Training progress Accuracy(solid) kernel-0.7 histogram-0.7 adv-50.0 adv_bn-50.0 GDP(dotted) 0 20 40 60 80 100 Training progress Accuracy(solid) kernel-5.0 histogram-5.0 adv-500.0 adv_bn-500.0 GDP(dotted) 0 20 40 60 80 100 Training progress kernel-0.0 histogram-0.0 adv-0.0 adv_bn-0.0 GDP(dotted) 0 20 40 60 80 100 Training progress kernel-0.7 histogram-0.7 adv-50.0 adv_bn-50.0 GDP(dotted) 0 20 40 60 80 100 Training progress kernel-5.0 histogram-5.0 adv-500.0 adv_bn-500.0 GDP(dotted) Figure 10: Prediction performance and GDP training curve for Adult (top) and Crimes (bottom) datasets with different hyperparameters. Training curve on graph data: Figures 11 and 12 demonstrates prediction performance and GDP tradeoff curve for Pokec-n and Pokec-z datasets using GCN and GAT model. It is seen that, for 8https://github.com/Stats DLMaths Recom Sys/Inductive-representation-learning-on-temporal-graphs Published as a conference paper at ICLR 2022 kernel or histogram as regularization, the hyperparameter can control the prediction performance and GDP tradeoff, while the training is highly unstable with large variance for adversarial debiasing. Additionally, kernel estimation as regularization possesses better bias mitigation performance for Pokec-n and Pokec-z datasets in GCN and GAT model. 0 20 40 60 80 100 Training progress Accuracy(solid) kernel-0.0 histogram-0.0 adv-0.0 adv_bn-0.0 GDP(dotted) 0 20 40 60 80 100 Training progress Accuracy(solid) kernel-3.0 histogram-3.0 adv-10.0 adv_bn-10.0 GDP(dotted) 0 20 40 60 80 100 Training progress Accuracy(solid) kernel-5.0 histogram-5.0 adv-50.0 adv_bn-50.0 GDP(dotted) 0 20 40 60 80 100 Training progress Accuracy(solid) kernel-0.0 histogram-0.0 adv-0.0 adv_bn-0.0 GDP(dotted) 0 20 40 60 80 100 Training progress Accuracy(solid) kernel-3.0 histogram-3.0 adv-10.0 adv_bn-10.0 GDP(dotted) 0 20 40 60 80 100 Training progress Accuracy(solid) kernel-5.0 histogram-5.0 adv-50.0 adv_bn-50.0 GDP(dotted) Figure 11: Prediction performance and GDP training curve for Pokec-n (top) and Pokec-z (bottom) datasets with GAT model. 0 20 40 60 80 100 Training progress Accuracy(solid) kernel-0.0 histogram-0.0 adv-0.0 adv_bn-0.0 GDP(dotted) 0 20 40 60 80 100 Training progress Accuracy(solid) kernel-3.0 histogram-3.0 adv-10.0 adv_bn-10.0 GDP(dotted) 0 20 40 60 80 100 Training progress Accuracy(solid) kernel-5.0 histogram-5.0 adv-50.0 adv_bn-50.0 GDP(dotted) 0 20 40 60 80 100 Training progress Accuracy(solid) kernel-0.0 histogram-0.0 adv-0.0 adv_bn-0.0 GDP(dotted) 0 20 40 60 80 100 Training progress Accuracy(solid) kernel-3.0 histogram-3.0 adv-10.0 adv_bn-10.0 GDP(dotted) 0 20 40 60 80 100 Training progress Accuracy(solid) kernel-5.0 histogram-5.0 adv-50.0 adv_bn-50.0 GDP(dotted) Figure 12: Prediction performance and GDP training curve for Pokec-n (top) and Pokec-z (bottom) datasets with GCN model. Training curve on temporal graph data: Figure 13 demonstrates prediction performance and GDP tradeoff curve for Harris datasets using TGAT with map and product attention mechanism. It is seen that, for kernel or histogram as regularization, the hyperparameter can control the prediction performance and GDP tradeoff, while the training is highly unstable with large variance for adversarial debiasing. Additionally, kernel estimation as regularization possesses better bias mitigation performance for TGAT with map and product attention mechanism. Training curve on compositional sensitive attribute: Figure 14 demonstrates prediction performance and GDP tradeoff curve for Adult and Crimes dataset. It is seen that, for kernel or histogram as regularization, the hyperparameter can control the prediction performance and GDP tradeoff, while the training is highly unstable with large variance for adversarial debiasing. Additionally, kernel estimation as regularization possesses better bias mitigation performance for Adult and Crimes dataset. Published as a conference paper at ICLR 2022 0 20 40 60 80 100 Training progress Accuracy(solid) kernel-0.0 histogram-0.0 adv-0.0 adv_bn-0.0 GDP(dotted) 0 20 40 60 80 100 Training progress Accuracy(solid) kernel-3.0 histogram-3.0 adv-10.0 adv_bn-10.0 GDP(dotted) 0 20 40 60 80 100 Training progress Accuracy(solid) kernel-5.0 histogram-5.0 adv-50.0 adv_bn-50.0 GDP(dotted) 0 20 40 60 80 100 Training progress Accuracy(solid) kernel-0.0 histogram-0.0 adv-0.0 adv_bn-0.0 GDP(dotted) 0 20 40 60 80 100 Training progress Accuracy(solid) kernel-3.0 histogram-3.0 adv-10.0 adv_bn-10.0 GDP(dotted) 0 20 40 60 80 100 Training progress Accuracy(solid) kernel-5.0 histogram-5.0 adv-50.0 adv_bn-50.0 GDP(dotted) Figure 13: Prediction performance and GDP training curve for TGAT with map (top) and product (bottom) attention mechanism with Harris data. 0 20 40 60 80 100 Training progress Accuracy(solid) kernel-0.0 histogram-0.0 adv-0.0 adv_bn-0.0 GDP(dotted) 0 20 40 60 80 100 Training progress Accuracy(solid) kernel-0.7 histogram-0.7 adv-50.0 adv_bn-50.0 GDP(dotted) 0 20 40 60 80 100 Training progress Accuracy(solid) kernel-5.0 histogram-5.0 adv-500.0 adv_bn-500.0 GDP(dotted) 0 20 40 60 80 100 Training progress kernel-0.0 histogram-0.0 adv-0.0 adv_bn-0.0 GDP(dotted) 0 20 40 60 80 100 Training progress kernel-0.7 histogram-0.7 adv-50.0 adv_bn-50.0 GDP(dotted) 0 20 40 60 80 100 Training progress kernel-5.0 histogram-5.0 adv-500.0 adv_bn-500.0 GDP(dotted) Figure 14: Prediction performance and GDP training curve for Adult (top) and Crimes (bottom) datasets with compositional attributes. I FUTURE WORKS There are several interesting future work related to proposed GDP:(1)Extend other fairness metrics, such as Equal Odds (EO), for continuous and discrete sensitive attributes. The connection between extended fairness metric and adversarial debiasing is also interesting. (2) Provide a more finegrained analysis on estimation error, including the exact optimal bandwidth and estimation error expression, the lower bound of the estimation error for histogram and kernel GDP estimation over all data distribution and model prediction, the estimation error analysis over compositional sensitive attributes. (3)Investigate the better GDP estimation method with faster convergence guarantee.