# individually_fair_gradient_boosting__100ac602.pdf Published as a conference paper at ICLR 2021 INDIVIDUALLY FAIR GRADIENT BOOSTING Alexander Vargo Department of Mathematics University of Michigan ahsvargo@umich.edu Fan Zhang School of Information Science and Technology Shanghai Tech University zhangfan4@shanghaitech.edu.cn Mikhail Yurochkin IBM Research MIT-IBM Watson AI Lab mikhail.yurochkin@ibm.com Yuekai Sun Department of Statistics University of Michigan yuekai@umich.edu We consider the task of enforcing individual fairness in gradient boosting. Gradient boosting is a popular method for machine learning from tabular data, which arise often in applications where algorithmic fairness is a concern. At a high level, our approach is a functional gradient descent on a (distributionally) robust loss function that encodes our intuition of algorithmic fairness for the ML task at hand. Unlike prior approaches to individual fairness that only work with smooth ML models, our approach also works with non-smooth models such as decision trees. We show that our algorithm converges globally and generalizes. We also demonstrate the efficacy of our algorithm on three ML problems susceptible to algorithmic bias. 1 INTRODUCTION In light of the ubiquity of machine learning (ML) methods in high-stakes decision-making and support roles, there is concern about ML models reproducing or even exacerbating the historical biases against certain groups of users. These concerns are valid: there are recent incidents in which algorithmic bias has led to dire consequences. For example, Amazon recently discovered its ML-based resume screening system discriminates against women applying for technical positions (Dastin, 2018). In response, the ML community has proposed a myriad of formal definitions of algorithmic fairness. Broadly speaking, there are two types of fairness definitions: group fairness and individual fairness (Chouldechova & Roth, 2018). In this paper, we focus on enforcing individual fairness. At a highlevel, the idea of individual fairness is the requirement that a fair algorithm should treat similar individuals similarly. For a while, individual fairness was overlooked in favor of group fairness because there is often no consensus on which users are similar for many ML tasks. Fortunately, there is a flurry of recent work that addresses this issue (Ilvento, 2019; Wang et al., 2019; Yurochkin et al., 2020; Mukherjee et al., 2020a). In this paper, we assume there is a fair metric for the ML task at hand and consider the task of individually fair gradient boosting. Gradient boosting, especially gradient boosted decision trees (GBDT), is a popular method for tabular data problems (Chen & Guestrin, 2016). Unfortunately, existing approaches to enforcing individual fairness are either not suitable for training non-smooth ML models (Yurochkin et al., 2020) or perform poorly with flexible non-parametric ML models. We aim to fill this gap in the literature. Our main contributions are: 1. We develop a method to enforce individual fairness in gradient boosting. Unlike other methods for enforcing individually fairness, our approach handles non-smooth ML models such as (boosted) decision trees. 2. We show that the method converges globally and leads to ML models that are individually fair. We also show that it is possible to certify the individual fairness of the models a posteriori. 3. We show empirically that our method preserves the accuracy of gradient boosting while improving widely used group and individual fairness metrics. Published as a conference paper at ICLR 2021 2 ENFORCING INDIVIDUAL FAIRNESS IN GRADIENT BOOSTING Consider a supervised learning problem. Let X Rd be the input space and Y be the output space. To keep things simple, we assume Y = {0, 1}, but our method readily extends to multi-class classification problems. Define Z = X {0, 1}. We equip X with a fair metric dx that measures the similarity between inputs. The fair metric is application specific, and we refer to the literature on fair metric learning (Ilvento, 2019; Wang et al., 2019; Yurochkin et al., 2020) for ways of picking the fair metric. Our goal is to learn an ML model f : X {0, 1} that is individually fair. Formally, we enforce distributionally robust fairness (Yurochkin et al., 2020), which asserts that an ML model has similar accuracy/performance (measured by the loss function) on similar samples (see Definition 2.1). One way to accomplish this is adversarial training (Yurochkin et al., 2020; Yurochkin & Sun, 2020). Unfortunately, adversarial training relies on the smoothness of the model (with respect to the inputs), so it cannot handle non-smooth ML models (e.g. decision trees). We address this issue by considering a restricted adversarial cost function that only searches over the training examples (instead of the entire input space) for similar examples that reveal violations of individual fairness. As we shall see, this restricted adversarial cost function is amenable to functional gradient descent, which allows us to develop a gradient boosting algorithm. 2.1 ENFORCING INDIVIDUAL FAIRNESS WITH A RESTRICTED ADVERSARIAL COST FUNCTION We first review how to train individually fair ML models with adversarial training (see Yurochkin et al. (2020) for more details) to set up notation and provide intuition on how adversarial training leads to individual fairness. Let F be a set of ML models, let ℓ: Y Y R be a smooth loss function (see Section 3 for concrete assumptions on ℓ) that measures the performance of an ML model, and let D = {(xi, yi)}n i=1 be a set of training data. Define the transport cost function c((x1, y1), (x2, y2)) d2 x(x1, x2) + 1{y1 =y2}. (2.1) We see that c((x1, y1), (x2, y2)) is small iff x1 and x2 are similar (in the fair metric dx) and y1 = y2. In other words, c is small iff two similar examples are assigned the same output. Define the optimal transport distance W (with transport cost c) on probability distributions on Z: W(P1, P2) infΠ C(P1,P2) R Z Z c(z1, z2)dΠ(z1, z2), where C(P1, P2) is the set of couplings between P1 and P2 (distributions on Z Z whose marginals are P1 and P2). This optimal transport distance lifts the fair metric on (points in) the sample space to distributions on the sample space. Two distributions are close in this optimal transport distance iff they assign mass to similar areas of the sample space Z. Finally, define the adversarial risk function Lr(f) sup P : W (P,P ) ϵ EP [ℓ(f(X), Y )], (2.2) where P is the data generating distribution and ϵ > 0 is a small tolerance parameter. The adversarial risk function looks for distributions on the sample space that (i) are similar to the data generating distribution and (ii) increases the risk of the ML model f. This reveals differential performance of the ML model on similar samples. This search for differential performance is captured by the notion of distributionally robust fairness: Definition 2.1 (distributionally robust fairness (DRF) (Yurochkin et al., 2020)). An ML model h : X Y is (ϵ, δ)-distributionally robustly fair (DRF) WRT the fair metric dx iff sup P :W (P,Pn) ϵ R Z ℓ(z, h)d P(z) δ. (2.3) In light of the preceding developments, a natural cost function for training individually fair ML models is the adversarial cost function: Le(f) sup P : W (P,Pn) ϵ EP [ℓ(f(X), Y )], (2.4) where Pn is the empirical distribution of the training data. This is the empirical counterpart of (2.2), and it works well for training smooth ML models (Yurochkin et al., 2020). Unfortunately, (2.4) is hard to evaluate for non-smooth ML models: it is defined as the optimal value of an optimization problem, but the gradient xℓ(f(x), y) is not available because the ML model f is non-smooth. Published as a conference paper at ICLR 2021 To circumvent this issue, we augment the support of the training set and restrict the supremum in (2.4) to the augmented support. Define the augmented support set D0 {(xi, yi), (xi, 1 yi)}n i=1 and the restricted optimal transport distance WD between distributions supported on D0: WD(P1, P2) infΠ C0(P1,P2) R Z Z c(z1, z2)dΠ(z1, z2), where C0(P1, P2) is the set of distributions supported on D0 D0 whose marginals are P1 and P2. We consider the restricted adversarial cost function L(f) sup P : WD(P,Pn) ϵ EP [ℓ(f(X), Y )], (2.5) where ϵ > 0 is a small tolerance parameter. The interpretation of (2.5) is identical to that of (2.4): it searches for perturbations to the training examples that reveal differential performance in the ML model. On the other hand, compared to (2.4), the supremum in (2.5) is restricted to distributions supported on D0. This allows us to evaluate (2.5) by solving a (finite-dimensional) linear program (LP). As we shall see, this LP depends only on the loss values ℓ(f(xi), yi) and ℓ(f(xi), 1 yi), so it is possible to solve the LP efficiently even if the ML model f is non-smooth. This is the key idea in this paper. Before delving into the details, we note that the main drawback of restricting the search to distributions supported on D0 is reduced power to detect differential performance. If the ML model exhibits differential performance between two (similar) areas of the input space but only one area is represented in the training set, then (2.4) will detect differential performance but (2.5) will not. Augmenting the support set with the points {(xi, 1 yi)}n i=1 partially alleviates this issue (but the power remains reduced compared to (2.4)). This is the price we pay for the broader applicability of (2.5). 2.2 FUNCTIONAL GRADIENT DESCENT ON THE RESTRICTED ADVERSARIAL COST FUNCTION Gradient boosting is functional gradient descent (Friedman, 2001), so a key step in gradient boosting is evaluating L ˆy , where the components of ˆy Rn are ˆyi f(xi). By Danskin s theorem, we have L ˆyi = f(xi) sup P : WD(P,Pn) ϵ EP [ℓ(f(xi), yi)] = P y Y f(xi) ℓ(f(xi), y)) P (xi, y), (2.6) where P is a distribution that attains the supremum in (2.5). We note that there is no need to differentiate through the ML model f in (2.6), so it is possible to evaluate the functional gradient for non-smooth ML models. It remains to find P . We devise a way of finding P by solving a linear program. We start with a simplifying observation: if c(zi, zj) = for any zi D0 and zj D, then any weight at zj cannot be transported to zi. Thus, we will only focus on the pairs (zi, zj) D0 D with c(zi, zj) < . Let C Rn n be the matrix with entries given by Ci,j = c((xi, yj), (xj, yj)) = d2 x(xi, xj). We also define the class indicator vectors y1, y0 {0, 1}n by y1 j = 1 : yj = 1 0 : yj = 0 and y0 = 1n y1. (2.7) For any distribution P on D0, let Pi,k = P({(xi, k)}) for k {0, 1}. Then, the condition that WD(P, Pn) ϵ is implied by the existence of a matrix Π such that 1. Π Γ with Γ = {Π|Π Rn n + , C, Π ϵ, ΠT 1n = 1 n1n}. 2. Π y1 = (P1,1, . . . , Pn,1), and Π y0 = (P1,0, . . . , Pn,0). Further define the matrix R Rn n by Rij = ℓ(f(xi), yj) - this is the loss incurred if point j with label yj is transported to point i. With this setup, given the current predictor f, we can obtain a solution Π to the optimization as the solution to the linear program (in n2 variables) Π arg max Π Γ R, Π . (2.8) Then the optimal distribution P on D0 is given by P ({(xi, k)}) = (Π yk)i. An outline of the full gradient boosting procedure is provided in Algorithm 1. It is important to note that we have made no assumptions about the class of candidate predictors F in finding the optimal transport map Π in (2.8). In particular, F can contain discontinuous functions - for example, decision trees or sums of decision trees. This allows us to apply this fair gradient boosting algorithm to any class F of base classifiers. Published as a conference paper at ICLR 2021 Algorithm 1 Fair gradient boosting 1: Input: Labeled training data {(xi, yi)}n i=1; class of weak learners H; initial predictor f0; search radius ϵ; number of steps T; sequence of step sizes α(t); fair metric dx on X 2: Define the matrix C by Ci,j d2 x(xi, xj). 3: for t = 0, 1, . . . , T 1 do 4: Define the matrix Rt by (Rt)ij = ℓ(ft(xi), yj) 5: Find Π t arg maxΠ Γ Rt, Π ; and set Pt+1(xi, k) (Π t yk)i 6: Fit a base learner ht H to the set of pseudo-residuals { L ft(xi)}n i=1 (see (2.6)). 7: Let ft+1 = ft + αtht. 8: end for 9: return f T 2.3 RELATED WORK Enforcing individual fairness. There is a line of work that seeks to improve fairness guarantees for individuals by enforcing group fairness with respect to many (possibly overlapping) groups (Hébert-Johnson et al., 2017; Kearns et al., 2017; Creager et al., 2019; Kearns et al., 2019; Kim et al., 2019). There is another line of work on enforcing individual fairness without access to a fair metric (Gillen et al., 2018; Kim et al., 2018; Jung et al., 2019; Kearns et al., 2019). This line of work circumvents the need for a fair metric by assuming the learner has access to an oracle that provides feedback on violations of individual fairness. Our work is part of a third line of work on enforcing individual fairness that assumes access to a fair metric (Yurochkin et al., 2020; Yurochkin & Sun, 2020). The applicability of these methods has broadened thanks to recent work on the precursor task of learning the fair metric from data (Ilvento, 2019; Lahoti et al., 2019; Wang et al., 2019; Mukherjee et al., 2020b). Unfortunately, these methods are limited to training smooth ML models. Adversarial training and distributionally robust optimization. Our approach to fair training is also similar to adversarial training (Goodfellow et al., 2014; Madry et al., 2017), which hardens ML models against adversarial attacks. There are two recent papers on fitting robust non-parametric classifiers (including decision trees) (Yang et al., 2019; Chen et al., 2019) but neither are applicable to enforcing individual fairness in gradient boosting. Our approach is also an instance of distributionally robust optimization (DRO) (Blanchet et al., 2016; Duchi & Namkoong, 2016; Esfahani & Kuhn, 2015; Lee & Raginsky, 2017; Sinha et al., 2017; Hashimoto et al., 2018). Before moving on, we remark that the key idea in this paper cannot be applied to adversarial training of non-smooth ML models. The goal of adversarial training is to make ML model robust to adversarial examples that are not in the training set. The restriction to the augmented support set in (2.5) precludes such adversarial examples, so training with (2.5) does not harden the model against adversarial examples. On the other hand, as long as the training set is diverse enough (see Assumption 3.3 for a rigorous condition to this effect), it is possible to reveal differential performance in the ML model by searching over the augmented support set and enforce individual fairness. 3 THEORETICAL RESULTS We study the convergence and generalization properties of fair gradient boosting (Algorithm 1). The optimization properties are standard: fair gradient boosting together with a line search strategy converges globally. This is hardly surprising in light of the global convergence properties of line search methods (see (Nocedal & Wright, 2006, Chapter 3)), so we defer this result to Appendix A.1. The generalization properties of fair gradient boosting are less standard. We start by stating the assumptions on the problem. The first two assumptions are standard in the DRO literature (see Lee & Raginsky (2017) and Yurochkin et al. (2020)). Assumption 3.1 (boundedness of input space). diam(X) < Assumption 3.2 (regularity of loss function). (i) ℓis bounded: 0 ℓ(f, z) B f F, z Z. (ii) Let L {ℓ(f, ) : f F} denote the class of loss functions. L is ω2-Lipschitz with respect to dx: |ℓ(f, (x1, y)) ℓ(f, (x2, y))| ω2dx(x1, x2) for all x1, x2 X, y Y and f F. Published as a conference paper at ICLR 2021 Additionally, the support of the data generating distribution P should cover the input space. Otherwise, the training data may miss areas of the input space, which precludes detecting differential treatment in these areas. Similar conditions appear in the non-parametric classification literature, under the name strong density condition (Audibert & Tsybakov, 2007). Assumption 3.3. Let Bdx(r, x ) = {x X : dx(x, x ) < r} be the dx-ball of radius r around x in X. There are constants δ > 0 and d such that P (Bdx(r, x) Y) δrd for any r < 1. The lower bound δrd in Assumption 3.3 is motivated by the volume of the Euclidean ball of radius r in Rd being proportional to rd. For a bounded input space X, Assumption 3.3 implies the probability mass assigned by P to small balls is always comparable (up to the small constant δ) to the probability mass assigned by the uniform distribution on X. We note that this assumption is close to the goal of Buolamwini & Gebru (2018) in their construction of the Pilot Parliaments Benchmark data. Theorem 3.4. Under Assumptions 3.1, 3.2, and 3.3, we have supf F |L(f) Lr(f)| P 1 n1/(2d) ω2 + 2ω2diam(X) ϵ + 1 n (3.1) The first term on the right side of (3.1) is the discrepancy between L and Le, while the second term is the discrepancy between Le and Lr (recall (2.2) and (2.4)). The second term is well-studied in the DRO literature (Lee & Raginsky, 2017; Yurochkin et al., 2020), so we focus on the first term, which captures the effect of restricting the search space in the supremum in (2.5) to the augmented support set D0. We see the curse of dimensionality in the slow 1 n1/(2d) convergence rate, but this is unavoidable. For the restricted search for differential performance on D0 to emulate the search on X, the size of D0 must grow exponentially with the dimension d. This leads to the exponential dependence on d in the first term. We note that it is possible for the dimension d in Assumption 3.3 to be smaller than dim(X). Intuitively, the fair metric dx ignores variation in the inputs due to the sensitive attributes. This variation is usually concentrated in a low-dimensional set because there are few sensitive attributes in most ML tasks. Thus the metric effectively restricts the search to this low-dimensional space, so the effective dimension of the search is smaller. In such cases, the convergence rate will depend on the (smaller) effective dimension of the search. One practical consequence of Theorem 3.4 is it is possible to certify a posteriori that a (non-smooth) ML model is individually fair by checking the empirical performance gap n Pn i=1 ℓ(f(xi), yi). (3.2) As long as the (non-adversarial) empirical risk converges to its population value, Theorem 3.4 implies supf F |L(f) 1 n Pn i=1 ℓ(f(xi), yi) (Lr(f) EP ℓ(f(X), Y ) )| p 0. (3.3) In other words, the (empirical) performance gap L(f) 1 n Pn i=1 ℓ(f(xi), yi) generalizes. Thus it is possible for practitioners to certify the worst-case performance differential of an ML model (up to an error term that vanishes in the large sample limit) by evaluating (3.2). 4 SCALABLE FAIR GRADIENT BOOSTING A key step in the fair gradient boosting Algorithm 1 is finding the worst-case distribution P . This entails solving a linear program (LP) in n2 variables (n is the size of the training set). It is possible to use an off-the-shelf LP solver to find P , but solving an LP in n2 variables per iteration does not scale to modern massive datasets. In this section, we appeal to duality to derive a stochastic optimization approach to finding the worst-case distribution. At a high level, the approach consists of two steps: 1. Solve the dual of (2.8). The dual is a univariate optimization problem, and it is amenable to stochastic optimization (see (B.14)). Computational complexity is O(n). 2. Reconstruct the primal optimum from the dual optimum. Computational complexity is O(n2). We see that the computational complexity of this step is O(n2), which is much smaller than the complexity of solving an LP in n2 variables. This two-step approach is similar to the dual approach Published as a conference paper at ICLR 2021 to solving (unrestricted) DRO problems. Due to space constraints, we summarize the two steps in Algorithm 2 and defer derivations to Appendix B. We also include in Appendix B an entropic regularized version of the approach that sacrifices exactness for additional computational benefits. We remark that it is possible to achieve further speedups by exploiting the properties of the fair metric. We start by observing that the cost of recovering the primal solution (step 10 in Algorithm 2) dominates the computational cost of Algorithm 2. Recall the search for differential performance is restricted by the fair metric to a small set of points. This implies the argmax in step 10 of Algorithm 2 is restricted to a few points that are similar to xj in the fair metric. This can be seen from the corresponding argmax expression: if Cij is large, i is unlikely to be a solution to argmax. The neighbors of each point can be computed in advance and re-used during training. This further reduces the cost of recovering the primal solution from O(n2) to O(nm), where m is the maximum number of neighbors of a point. With this heuristic it should be possible to train our algorithm whenever the vanilla GBDT is feasible. We now state the full fair gradient boosting algorithm that combines the framework presented in Section 2.1 with the preceding stochastic optimization approach to evaluating the worst-case distribution. We refer to the GBDT method as Tree Boost; this can be replaced with any GBDT algorithm. In every boosting step in Algorithm 3, we find the optimal transport map Π using Algorithm 2. We then boost for one step using the GBDT training methods on the augmented data set D0 weighted according to the distribution given by P({xi, k}) = (Π yk)i. There are multiple hyperparameters that can be tweaked in the GBDT model (e.g. the maximum depth of the trees in F); we represent this by including a list of GBDT parameters ρ as an input to Algorithm 3. Algorithm 2 SGD to find optimal dual variable η and approximate Π 1: Input: Initial η1 > 0; cost matrix C; loss matrix R; tolerance ϵ; batch size B; step sizes αt > 0. 2: repeat 3: Sample indices j1, . . . , j B uniformly at random from {1, . . . n}. 4: Let Rt columns j1, . . . , j B of R. Let Ct columns j1, . . . , j B of C. {Rt, Ct Rn B} 5: Let wt(η) η ϵ + 1 B PB j=1 maxi Rij ηCij 6: ηt+1 max{0, ηt αt d dηwt(η)} 7: until converged 8: Set Π be an n n matrix of zeros; η is the final value from above. 9: for j = 0 to n 1 do 10: Choose t arg maxi Rij η Cij and set Πtj = 1 n. 11: end for 12: return Π Algorithm 3 Fair gradient boosted trees (Bu DRO) 1: Input: Data D = {(xi, yi)}n i=1; perturbation budget ϵ; loss function ℓ; fair metric dx on X; number of boosting steps T; GBDT parameters ρ, batch size B 2: Let D0 = {(xi, 0)}n i=1 {(xi, 1)}n i=1 and define C by Cik dx(xi, xk). 3: Let f0 = Tree Boost.Train(ρ, data =D0, Steps =1) {Run one step of plain boosting} 4: for t = 0 to T 1 do 5: Define R by Rij = ℓ(ft, (xi, yj)). 6: Construct Πt following Algorithm 2 with inputs C, R, ϵ, and B 7: Let wt be the concatenation of Πt y0 and Πt y1. 8: Let ft+1 Tree Boost.Train(ρ, ft, data =D0, weights =wt, Steps =1). 9: end for 10: Return f T . 5 EXPERIMENTS We apply Bu DRO (Algorithm 3) to three data sets popular in the fairness literature. Full descriptions of these data sets and a synthetic example are included in Appendix C. Timing information is also found in Appendix C.6. Bu DRO uses the XGBoost algorithm (Chen & Guestrin, 2016) for the GBDT Published as a conference paper at ICLR 2021 method, and ℓis the logistic loss. These experiments reveal that Bu DRO is successful in enforcing individual fairness while achieving high accuracy (leveraging the power of GBDTs). We also observe improvement of the group fairness metrics. Fair metric. Recall that a practical implementation of our algorithm requires a fair metric. We consider the procedure from Yurochkin et al. (2020) to learn the fair metric from data. They propose a metric of the form d2 x(x1, x2) = x1 x2, Q(x1 x2) , where Q is the projecting matrix orthogonal to some sensitive subspace. This sensitive subspace is formed by taking the span of the vectors orthogonal to decision boundaries of linear classifiers fitted to predict a (problem specific) set of protected attributes (e.g. gender, race or age). The idea behind this procedure is that the sensitive subspace captures variation in the data due to protected information of the individuals. A fair metric should treat individuals that only differ in their protected information similarly, i.e. a distance between a pair of individuals that only differ by a component in the sensitive subspace should be 0. Comparison methods. We consider other possible implementations of fair GBDT given existing techniques in the literature. As mentioned previously, due to the non-differentiability of trees, the majority of other individual fairness methods are not applicable to the GBDT framework. For this reason, we limit our analysis to fair data preprocessing techniques before applying a vanilla GBDT method. We report the results when considering two preprocessing techniques: project that eliminates the protected attributes and projects out the sensitive subspace used for the fair metric construction (Yurochkin et al., 2020), and reweigh that balances the representations of protected groups by assigning different weights to the individuals (Kamiran & Calders, 2011). Evaluation metrics. To evaluate the individual fairness of each method without appealing to the underlying fair metric (i.e. to avoid giving our method and project an unfair advantage at test time, and to verify generalization properties of the fair metric used for training), we report data-specific consistency metrics. Specifically, for each data set, we find a set of attributes that are not explicitly protected but are correlated with the protected attribute (e.g. is_husband or is_wife when gender is protected) and vary these attributes to create artificial counterfactual individuals. Such counterfactual individuals are intuitively similar to the original individuals and classifier output should be the same for all counterfactuals to satisfy individual fairness. We refer to classification consistency as the frequency of a classifier changing its predictions on the aforementioned counterfactuals. We also examine group fairness for completeness. We consider the group fairness metrics introduced in De-Arteaga et al. (2019). Specifically, we compute the differences in TPR and TNR for each protected attribute, and report the maximum (GAPMax) and the root-mean-squared statistics (GAPRMS) of these gaps. See Appendix C for a full description of all comparison metrics. 5.1 GERMAN CREDIT The German credit data set (Dua & Graff, 2017) contains information from 1000 individuals; the ML task is to label the individuals as good or bad credit risks. We treat age as the protected attribute in the German data set. The age feature is not binary; this precludes the usage of fair methods that assume two protected classes (including the reweighing preprocessing technique). In the United States, it is unlawful to make credit decisions based on the age of the applicant; thus, there are distinct legal reasons to be able to create a classifier that does not discriminate based on age. For the fair metric, we construct the sensitive subspace by fitting ridge regression on age and augment it with an indicator vector for the age coordinate (see Appendix C.3). This subspace is also used for data preprocessing with the project baseline. For the individual fairness consistency evaluation we consider varying a personal status feature that encodes both gender and marital status (S-cons). Table 1: German credit: average results over 10 splits into 80% training and 20% test data. Age gaps Method BAcc Status cons GAPMax GAPRMS Bu DRO .715 .974 .185 .151 Baseline .723 .920 .310 .241 Project .698 .960 .188 .144 Baseline NN .687 .826 .234 .179 Published as a conference paper at ICLR 2021 We present the results in Table 1 (see Table 5 for error bars). To compare classification performance we report balanced accuracy due to class imbalance in the data. The baseline (GBDTs with XGBoost) is the most accurate and significantly outperforms a baseline neural network (NN). The Bu DRO method has the highest individual fairness (S-cons) score while maintaining a high accuracy and improved group fairness metrics. Preprocessing by projecting out the sensitive subspace is not as effective as Bu DRO in improving individual fairness and also can negatively impact the performance. The Adult data set (Dua & Graff, 2017) is another common benchmark in the fairness literature. The task is to predict if an individual earns above or below $50k per year. We follow the experimental setup and comparison metrics from the prior work on individual fairness (Yurochkin et al., 2020) studying this data. Individual fairness is quantified with two classification consistency measures: one with respect to a relationship status feature (S-cons) and the other with respect to the gender and race (GR-cons) features. The sensitive subspace is learned via logistic regression classifiers for gender and race and is augmented with unit vectors for gender and race. Note that the project preprocessing baseline is guaranteed to have a perfect GR-cons score since those features are explicitly projected out; it is interesting to assess if it generalizes to S-cons, however. Table 2: Adult: average results over 10 splits into 80% training and 20% test data. NN, Sen SR and Adversarial Debiasing (Zhang et al., 2018) numbers are from Yurochkin et al. (2020). Individual fairness Gender gaps Race gaps Method BAcc S-cons GR-cons GAPMax GAPRMS GAPMax GAPRMS Bu DRO .815 .944 .957 .146 .114 .083 .072 Baseline .844 .942 .913 .200 .166 .098 .082 Project .787 .881 1 .079 .069 .064 .050 Reweigh .784 .853 .949 .131 .093 .056 .043 Baseline NN .829 .848 .865 .216 .179 .105 .089 Sen SR .789 .934 .984 .087 .068 .067 .055 Adv. Deb. .815 .807 .841 .110 .082 .078 .070 The results are in Table 2 (see Table 7 for error bars). The baseline GBDT method is again the most accurate; it produces poor gender gaps, however. Bu DRO is less accurate than the baseline, but the gender gaps have shrunken considerably, and both the S-cons and GR-cons are very high. Project and reweighing produce the best group fairness results; however, their S-cons values are worse than the baseline (representing violations of individual fairness), and they also result in significant accuracy reduction. Comparing to the results from Yurochkin et al. (2020), Bu DRO is slightly less accurate than the baseline NN, but it improves on all fairness metrics. Bu DRO matches the accuracy of adversarial debiasing and greatly improves the individual fairness results there. Finally, Bu DRO improves upon the accuracy of Sen SR while maintaining similar individual fairness results. We present additional studies of the trade-off between accuracy and fairness in Figure 3 of Appendix C.4.1. Overall, this and the preceding experiment provide empirical evidence that Bu DRO trains individually fair classifiers while still obtaining high accuracy due to the power of GBDT methods. We study the COMPAS recidivism prediction data set (Larson et al., 2016). The task is to predict whether a criminal defendant would recidivate within two years. We consider race (Caucasian or not-Caucasian) and gender (binary) as protected attributes and use them to learn the sensitive subspace similarly to previous experiments. Due to smaller data set size, instead of Algorithm 2, we considered an entropy-regularized solution to (2.8) presented in Appendix B. The individual fairness consistency measures are gender consistency (G-cons) and race-consistency (R-cons). The results are collected in Table 3 (see Table 9 for error bars). COMPAS is a data set where NNs outperform GBDT in terms of accuracy, but this results in poor group and individual fairness measurements. A NN trained with Sen SR shows similar accuracy to Bu DRO, but worse performance Published as a conference paper at ICLR 2021 Table 3: COMPAS: average results over 10 splits into 80% training and 20% test data. Individual fairness Gender gaps Race gaps Method Acc G-cons R-cons GAPMax GAPRMS GAPMax GAPRMS Bu DRO 0.652 1.000 1.000 0.099 0.124 0.125 0.145 Baseline 0.677 0.944 0.981 0.180 0.223 0.215 0.258 Project 0.671 0.874 1.000 0.150 0.190 0.185 0.230 Reweigh 0.666 0.788 0.813 0.207 0.245 0.069 0.092 Baseline NN 0.682 0.841 0.908 0.246 0.282 0.228 0.258 Sen SR 0.652 0.977 0.988 0.130 0.167 0.159 0.179 Adv. Deb. 0.670 0.854 0.818 0.219 0.246 0.108 0.130 both in group and individual fairness. We conclude that even in problems where neural networks outperform GBDT, Bu DRO is an effective method when taking fairness into consideration. 6 SUMMARY AND DISCUSSION We developed a gradient boosting algorithm that enforces individual fairness. The main challenge of enforcing individual fairness is searching for differential performance in the ML model, and we overcome the non-smoothness of the ML model by restricting the search space to a finite set. Unlike most methods for enforcing individual fairness, our method accepts non-smooth ML models. We note that the restricted adversarial cost function developed for fair gradient boosting may be used to audit other non-smooth ML models (e.g. random forests) for differential performance, but we defer such developments to future work. Theoretically, we show that the resulting fair gradient boosting algorithm converges globally and generalizes. Empirically, we show that the method preserves the accuracy of gradient boosting while improving group and individual fairness metrics. ACKNOWLEDGEMENTS This paper is based upon work supported by the National Science Foundation (NSF) under grants no. 1830247 and 1916271. Any opinions, findings, and conclusions or recommendations expressed in this paper are those of the authors and do not necessarily reflect the views of the NSF. Julia Angwin and Jeff Larson. Bias in criminal risk scores is mathematically inevitable, researchers say. Propublica, available at: https://goo. gl/S3Gwcn (accessed 5 March 2018), 2016. Jean-Yves Audibert and Alexandre B. Tsybakov. Fast learning rates for plug-in classifiers. Annals of Statistics, 35(2):608 633, April 2007. ISSN 0090-5364, 2168-8966. doi: 10.1214/ 009053606000001217. Rachel K. E. Bellamy, Kuntal Dey, Michael Hind, Samuel C. Hoffman, Stephanie Houde, Kalapriya Kannan, Pranay Lohia, Jacquelyn Martino, Sameep Mehta, Aleksandra Mojsilovic, Seema Nagar, Karthikeyan Natesan Ramamurthy, John Richards, Diptikalyan Saha, Prasanna Sattigeri, Moninder Singh, Kush R. Varshney, and Yunfeng Zhang. AI Fairness 360: An Extensible Toolkit for Detecting, Understanding, and Mitigating Unwanted Algorithmic Bias. ar Xiv:1810.01943 [cs], October 2018. Jose Blanchet and Karthyek R. A. Murthy. Quantifying Distributional Model Risk via Optimal Transport. ar Xiv:1604.01446 [math, stat], April 2016. Jose Blanchet, Yang Kang, and Karthyek Murthy. Robust Wasserstein Profile Inference and Applications to Machine Learning. ar Xiv:1610.05627 [math, stat], October 2016. Amanda Bower, Laura Niss, Yuekai Sun, and Alexander Vargo. Debiasing representations by removing unwanted variation due to protected attributes. ar Xiv:1807.00461 [cs], July 2018. Published as a conference paper at ICLR 2021 Joy Buolamwini and Timnit Gebru. Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification. In Proceedings of Machine Learning Research, volume 87, pp. 77 91, 2018. Hongge Chen, Huan Zhang, Duane S. Boning, and Cho-Jui Hsieh. Robust decision trees against adversarial examples. Ar Xiv, abs/1902.10660, 2019. Tianqi Chen and Carlos Guestrin. XGBoost: A Scalable Tree Boosting System. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining - KDD 16, pp. 785 794, 2016. doi: 10.1145/2939672.2939785. Alexandra Chouldechova and Aaron Roth. The Frontiers of Fairness in Machine Learning. ar Xiv:1810.08810 [cs, stat], October 2018. Elliot Creager, David Madras, Jörn-Henrik Jacobsen, Marissa A. Weis, Kevin Swersky, Toniann Pitassi, and Richard S. Zemel. Flexibly fair representation learning by disentanglement. Ar Xiv, abs/1906.02589, 2019. Jeffrey Dastin. Amazon scraps secret AI recruiting tool that showed bias against women. Reuters, October 2018. Maria De-Arteaga, Alexey Romanov, Hanna Wallach, Jennifer Chayes, Christian Borgs, Alexandra Chouldechova, Sahin Geyik, Krishnaram Kenthapadi, and Adam Tauman Kalai. Bias in Bios: A Case Study of Semantic Representation Bias in a High-Stakes Setting. Proceedings of the Conference on Fairness, Accountability, and Transparency - FAT* 19, pp. 120 128, 2019. doi: 10.1145/3287560.3287572. Dheeru Dua and Casey Graff. UCI machine learning repository, 2017. URL http:// archive.ics.uci.edu/ml. John Duchi and Hongseok Namkoong. Variance-based regularization with convex objectives. Journal of Machine Learning Research, October 2016. Peyman Mohajerin Esfahani and Daniel Kuhn. Data-driven Distributionally Robust Optimization Using the Wasserstein Metric: Performance Guarantees and Tractable Reformulations. Mathematical Programming, May 2015. Anthony W Flores, Kristin Bechtel, and Christopher T Lowenkamp. False positives, false negatives, and false analyses: A rejoinder to machine bias: There s software used across the country to predict future criminals. and it s biased against blacks. Fed. Probation, 80:38, 2016. Jerome H. Friedman. Greedy function approximation: A gradient boosting machine. The Annals of Statistics, 29(5):1189 1232, 2001. ISSN 0090-5364, 2168-8966. doi: 10.1214/aos/1013203451. Sahaj Garg, Vincent Perot, Nicole Limtiaco, Ankur Taly, Ed Huai hsin Chi, and Alex Beutel. Counterfactual fairness in text classification through robustness. In AIES 19, 2018. Aude Genevay, Marco Cuturi, Gabriel Peyré, and Francis Bach. Stochastic Optimization for Largescale Optimal Transport. In NIPS (ed.), NIPS 2016 - Thirtieth Annual Conference on Neural Information Processing System, Proc. NIPS 2016, Barcelona, Spain, December 2016. URL https://hal.archives-ouvertes.fr/hal-01321664. Stephen Gillen, Christopher Jung, Michael Kearns, and Aaron Roth. Online Learning with an Unknown Fairness Metric. ar Xiv:1802.06936 [cs], February 2018. Ian J. Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and Harnessing Adversarial Examples. ar Xiv preprint ar Xiv:1412.6572, December 2014. Tatsunori B. Hashimoto, Megha Srivastava, Hongseok Namkoong, and Percy Liang. Fairness Without Demographics in Repeated Loss Minimization. ar Xiv:1806.08010 [cs, stat], June 2018. Úrsula Hébert-Johnson, Michael P. Kim, Omer Reingold, and Guy N. Rothblum. Calibration for the (Computationally-Identifiable) Masses. ar Xiv e-prints, art. ar Xiv:1711.08513, Nov 2017. Published as a conference paper at ICLR 2021 Christina Ilvento. Metric Learning for Individual Fairness. ar Xiv:1906.00250 [cs, stat], June 2019. Abdul Jerri. Introduction to integral equations with applications. John Wiley & Sons, 1999. Christopher Jung, Michael J. Kearns, Seth Neel, Aaron Roth, Logan Stapleton, and Zhiwei Steven Wu. Eliciting and enforcing subjective individual fairness. Co RR, abs/1905.10660, 2019. URL http://arxiv.org/abs/1905.10660. Faisal Kamiran and Toon Calders. Classifying without discriminating. In 2009 2nd International Conference on Computer, Control and Communication. IEEE, February 2009. doi: 10.1109/ ic4.2009.4909197. URL https://doi.org/10.1109/ic4.2009.4909197. Faisal Kamiran and Toon Calders. Data preprocessing techniques for classification without discrimination. Knowledge and Information Systems, 33(1):1 33, December 2011. doi: 10.1007/s10115-011-0463-8. URL https://doi.org/10.1007/s10115-011-0463-8. Michael Kearns, Seth Neel, Aaron Roth, and Zhiwei Steven Wu. Preventing Fairness Gerrymandering: Auditing and Learning for Subgroup Fairness. ar Xiv e-prints, art. ar Xiv:1711.05144, Nov 2017. Michael Kearns, Seth Neel, Aaron Roth, and Zhiwei Steven Wu. An empirical study of rich subgroup fairness for machine learning. In Proceedings of the Conference on Fairness, Accountability, and Transparency, FAT* 19, pp. 100 109, New York, NY, USA, 2019. ACM. ISBN 9781-4503-6125-5. doi: 10.1145/3287560.3287592. URL http://doi.acm.org/10.1145/ 3287560.3287592. Michael Kearns, Aaron Roth, and Saeed Sharifi-Malvajerdi. Average Individual Fairness: Algorithms, Generalization and Experiments. ar Xiv e-prints, art. ar Xiv:1905.10607, May 2019. Michael Kim, Omer Reingold, and Guy Rothblum. Fairness through computationally-bounded awareness. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (eds.), Advances in Neural Information Processing Systems 31, pp. 4842 4852. Curran Associates, Inc., 2018. URL http://papers.nips.cc/paper/7733-fairness-throughcomputationally-bounded-awareness.pdf. Michael P. Kim, Amirata Ghorbani, and James Zou. Multiaccuracy: Black-box post-processing for fairness in classification. In Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society, AIES 19, pp. 247 254, New York, NY, USA, 2019. ACM. ISBN 978-1-4503-6324-2. doi: 10.1145/3306618.3314287. URL http://doi.acm.org/10.1145/3306618.3314287. Preethi Lahoti, Krishna P. Gummadi, and Gerhard Weikum. Operationalizing individual fairness with pairwise fair representations. Ar Xiv, abs/1907.01439, 2019. Jeff Larson, Surya Mattu, Lauren Kirchner, and Julia Angwin. How we analyzed the compas recidivism algorithm. Pro Publica (5 2016), 9, 2016. Jaeho Lee and Maxim Raginsky. Minimax Statistical Learning with Wasserstein Distances. ar Xiv:1705.07815 [cs], May 2017. Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards Deep Learning Models Resistant to Adversarial Attacks. ar Xiv:1706.06083 [cs, stat], June 2017. Llew Mason, Jonathan Baxter, Peter L Bartlett, Marcus Frean, et al. Functional gradient techniques for combining hypotheses. Advances in Neural Information Processing Systems, pp. 221 246, 1999. Debarghya Mukherjee, Mikhail Yurochkin, Moulinath Banerjee, and Yuekai Sun. Two simple ways to learn individual fairness metrics from data, 2020a. Debarghya Mukherjee, Mikhail Yurochkin, Moulinath Banerjee, and Yuekai Sun. Two simple ways to learn individual fairness metrics from data. In International Conference on Machine Learning, July 2020b. Published as a conference paper at ICLR 2021 Jorge Nocedal and Stephen J. Wright. Numerical Optimization. Springer Series in Operations Research. Springer, New York, 2nd ed edition, 2006. ISBN 978-0-387-30303-1. F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825 2830, 2011. Flavien Prost, Nithum Thain, and Tolga Bolukbasi. Debiasing Embeddings for Reduced Gender Bias in Text Classification. ar Xiv:1908.02810 [cs, stat], August 2019. Aman Sinha, Hongseok Namkoong, and John Duchi. Certifying Some Distributional Robustness with Principled Adversarial Training. ar Xiv:1710.10571 [cs, stat], October 2017. Hanchen Wang, Nina Grgic-Hlaca, Preethi Lahoti, Krishna P. Gummadi, and Adrian Weller. An Empirical Study on Learning Fairness Metrics for COMPAS Data with Human Supervision. ar Xiv:1910.10255 [cs], October 2019. Hanchen Wang, Nina Grgic-Hlaca, Preethi Lahoti, Krishna P. Gummadi, and Adrian Weller. An Empirical Study on Learning Fairness Metrics for COMPAS Data with Human Supervision. ar Xiv e-prints, art. ar Xiv:1910.10255, Oct 2019. Yao-Yuan Yang, Cyrus Rashtchian, Yizhen Wang, and Kamalika Chaudhuri. Adversarial examples for non-parametric methods: Attacks, defenses and large sample limits. Ar Xiv, abs/1906.03310, 2019. Mikhail Yurochkin and Yuekai Sun. Sen Se I: Sensitive Set Invariance for Enforcing Individual Fairness. ar Xiv:2006.14168 [cs, stat], June 2020. Mikhail Yurochkin, Amanda Bower, and Yuekai Sun. Training individually fair ML models with sensitive subspace robustness. In International Conference on Learning Representations, Addis Ababa, Ethiopia, 2020. Brian Hu Zhang, Blake Lemoine, and Margaret Mitchell. Mitigating unwanted biases with adversarial learning. In AIES 18, 2018. Published as a conference paper at ICLR 2021 A PROOFS OF THEORETICAL RESULTS A.1 NUMERICAL CONVERGENCE RESULTS In this section, we focus on analyzing the numerical convergence of our boosting algorithm. To do so, we will work in the framework similar to that found in previous boosting literature (e.g. Mason et al. (1999)). To be concrete, we assume that Y R. Then, we consider all base classifiers to be of the form h: {x1, . . . , xn} Y and thus treat a base classifier h as a vector h Rn (precisely, h = [h(x1), . . . , h(xn)]T ). Moreover, we define the vector y = [y1, . . . , yn]T Rn. In this framework, the loss ℓtakes the form ℓ: Rn Rn R, defined by ℓ(f, y) = 1 n Pn i=1 ℓ(fi, yi). We abuse notation and write L(f) for the robust loss function; this is a functional on Rn since it only depends on the values f(xi) for xi {x1, . . . , xn}. In this context, given a functional F : Rn R, the functional derivative F is the gradient of F. Additionally, denotes the standard Euclidean norm on Rn. We require several assumptions for these results: Assumption A.1. (i) ℓis convex as well as firstand second-order differentiable with respect to f. (ii) For any v Rn, ℓis ω1-Lipschitz differentiable, meaning there exists a constant ω1 > 0 such that for any f, f Rn ℓ(f, v) ℓ( f, v) ω1 f f . The assumption that the loss is convex is restrictive, and we only invoke it here to show global convergence. In section 3, we also assume the loss function is bounded. This pair of assumptions (convexity and boundedness) is not particularly restrictive because we have in mind a bounded input space (see Assumption 3.1). We additionally require the following assumption, which is standard in the gradient descent literature (Nocedal & Wright, 2006). Essentially, we suppose that our boosting algorithm moves in a sufficient descent direction at each iteration. That is, the weak learner can ensure the angle between the searching direction and the functional gradient is bounded. Assumption A.2. There is a value δ > 0, such that Algorithm 1 can find a function ht with θt := L(ft), ht L(ft) ht δ, for all t = 1, , T. (A.1) The following convergence result is a consequence of general results on the convergence of functional gradient descent algorithm for minimizing cost functionals (Mason et al., 1999). It states that, with a specific step-size, Algorithm 1 will converge to some classifier f . Theorem A.3. Under Assumptions A.1 and A.2, suppose we run Algorithm 1 with step-sizes αt = L(ft), ht ω1 ht 2 . (A.2) Then, the optimization converges, i.e.: lim t L(ft) = 0. This framework also results in the following, which shows that we obtain a global optimal solution. Theorem A.4. Under Assumptions A.1 and A.2, suppose {ft} is the sequence generated by Algorithm 1 with step-size given by (A.2). Then, any stationary point {ft} is the global optimal solution. The specific step-size presented above is only required for analyzing the convergence properties of Algorithm 1, which can be easily replaced by a step-size selected by some line search methods. We begin the proof of numerical results by providing following preparatory properties of the cost functional L. Lemma A.5. Under Assumption A.1, it holds (i) L is convex, firstand second-order differentiable with respect to f. Published as a conference paper at ICLR 2021 (ii) L is a ω1-Lipschitz differentiable functional, that is L(f) L( f) ω1 f f (A.3) for any f, f F. Proof of Lemma A.5. For part (i), recall L(f) = P T ℓ(f, y), combining with Assumption A.1 (i), we have L(f) = diag(P) ℓ(f) and 2L(f) = diag(P) 2 ℓ(f, y), where diag(P) is a diagonal matrix with [diag(P)]ii = Pi for i = 1, , n. Since P [0, 1]n and Pn i=1 Pi = 1, L(f) is a linear combination of ℓ(fi, yi). Therefore, L is a convex functional of f. For part (ii), according to Assumption A.1(ii), ℓ(f, y) is ω1-Lipschitz differentiable with respect to f, yielding 2 ℓ(f, y) ω1. (A.4) Given any f F, it holds 2L(f) = diag(P) 2 ℓ(f, y) diag(P) 2 ℓ(f, y) ω1, where the first inequality follows from Cauchy Schwarz inequality and the second inequality follows from Assumption (A.4) and P [0, 1]n. Combining (A.5) with the mean value theorem (for functionals) (Jerri, 1999), it holds L(f) L( f) 2L cf + (1 c) f f f where c [0, 1]. It completes the proof. Proof of Theorem A.3. By Lemma A.5, we have following lower bound for the difference between two successive loss functions: L(ft) L(ft+1) L(ft) L(ft) + αt L(ft), ht + ω1(αt)2 = αt L(ft), ht ω1(αt)2 We choose αt = L(ft,ht ω1 ht 2 as (A.2) to make the greatest reduction, which yields L(ft) L(ft+1) L(ft), ht 2 2ω1 ht 2 . (A.6) Combining (A.6) with the sufficient decrease condition (A.1), we have L(ft) L(ft+1) δ2 L(ft) 2 2ω1 . (A.7) Summing up both sides of (A.7) from 0 to , we have t=0 L(ft) 2 δ2 (L(ft) L(ft+1)) 2ω1 δ2 L(f0) < , where the second inequality follows by ft 0 for all t, yielding lim t L(ft) = 0. Published as a conference paper at ICLR 2021 Proof of Theorem A.4. Suppose f is a stationary point of {ft}. We prove this theorem by contradiction by assuming that we can find a point ˆf lin(F) such that L(ˆf) < L(f ). The directional derivative of L at f in the direction ˆf f is given by L(f ),ˆf f = lim ζ 0 L(f + ζ(ˆf f )) L(f ) Since L is convex, it holds L(f ),ˆf f lim ζ 0 ζL(ˆf) + (1 ζ)L(f ) L(f ) = L(ˆf) L(f ) < 0. Therefore, L(f ) = 0, which makes the contradiction. A.2 PROOFS OF GENERALIZATION ERROR BOUNDS Proof of Theorem 3.4. Under Assumptions 3.1, 3.2, 3.3, we have the following result from Proposition 3.2 in Yurochkin et al. (2020) (recall Equations (2.2) and (2.4)) |Le(f) Lr(f)| 0 as n . Then, our proof is completed by showing L converges to Le uniformly in F. It has been shown (Blanchet & Murthy, 2016) that the dual to the optimization Le defined in (2.4) is given by Lf(h) = inf λ 0 λϵ 1 i=1 sup x X ℓ(h(x), yi) + λd2 x(x, xi). (A.8) Likewise, calculations (see Appendix B.3) reveal that the dual to (2.5) is given by L(h) = inf λ 0 λϵ 1 i=1 max x {x1,...,xn} ℓ(h(x), yi) + λd2 x(x, xi). (A.9) We thus need to establish a bound on inf λ 0 λϵ 1 i=1 sup x X ℓ(h(x), yi) + λd2 x(x, xi) i=1 max x {x1,...,xn} ℓ(h(x), yi) + λd2 x(x, xi) To do so, let λn be a minimizer of (A.9). Then, we have that i=1 sup x X ℓ(h(x), yi) + λnd2 x(x, xi) i=1 max x {x1,...,xn} ℓ(h(x), yi) + λnd2 x(x, xi) i=1 max x {x1,...,xn} ℓ(h(x), yi) + λnd2 x(x, xi) sup x X ℓ(h(x), yi) + λnd2 x(x, xi) Define a map T on D such that T(xi) arg supx X ℓ(h(x), yi) + λnd2 x(x, xi). By assumption, we have that P (Bn 1/2d(T(xi)) Y) δ/ n. (A.12) Thus, the probability of the event {{x1, . . . , xn} S i Bn 1/2d(T(xi)) = } (i.e. the event that there are no points in the training data in Bn 1/2d(T(xi))) is at most Pn i=1(1 δ/n1/2)n = n(1 δ/n1/2)n. Published as a conference paper at ICLR 2021 Thus assume that, for each i, there is a point (x i , y i ) D such that x i Bn 1/2d(T(xi)). Then, continuing from (A.11), we have that i=1 |ℓ(h(x i ), yi) ℓ(h(T(xi), yi)| + λn(d2 x(x i , xi) d2 x(T(xi), xi)) (A.13) i=1 ω2dx(x i , T(xi) + |λn(d(x i , xi) d(T(xi), xi))(d(x i , xi) + d(T(xi), xi))| (A.14) ω2 n1/2d + 2λndiam(X) ω2 n1/2d + 2ω2diam(X) n1/2d ϵ . (A.16) In line (A.14), we use the fact that ℓis ω2-Lipschitz with respect to dx (Assumption 3.2(ii)). In line (A.15), we use the fact that x i Bn 1/2d(T(xi)) with the triangle inequality (to get that d(x i , xi) d(T(xi), xi) d(x i , T(xi))) and with the fact that d(T(xi), xi) diam(X). Finally, in line (A.16), we apply Lemma A.1 from Yurochkin et al. (2020) which asserts that 0 λn ω2 ϵ. This completes the proof of the desired result. Compared to most theoretical studies of distributionally robust optimization, the proof of Theorem 3.4 is complicated by the restriction of the c-transform to the sample D in (A.9). This complication arises because the max over the sample is generally (much) smaller than the max over the sample space due to the curse of dimensionality. This discrepancy between the max over the sample and the max over the space space is responsible for the dependence of the rate of convergence on the dimension of the feature space in Theorem 3.4. B FURTHER IMPLEMENTATION CONSIDERATIONS Here we present a method based on entropic regularization that can be used to quickly obtain an approximate solution to the linear program in Equation 2.8. Moreover, this method requires only simple matrix computations; thus, we are able to implement it in Tensor Flow for quick GPU computations. It is also possible to quickly find the solution to the dual of the linear program (2.8). Obtaining the primal optimizer Π from the dual solution, however, can be difficult. For this reason, the Sinkhorn-based entropic regularization method runs significantly more quickly than the dual method in our implementations if we insist on using complementary slackness to determine the exact primal optimizer Π . In practice, we can quickly obtain an approximation to Π from the dual solution, however (Algorithm 2). A discussion of the dual method can be found in Section B.3. B.1 ENTROPIC REGULARIZATION Equation (2.8) is an optimization problem over probability distributions. Thus, it is reasonable to regularize the objective function in Equation (2.8) with the entropy of the distribution. Formally, define the matrix R by Ri,j = ℓ(f, (xi, yj)) - this is the loss incurred if point j with label yj is transported to point i. Moreover, define the matrix C by Cij = d2 x(xi, xj), and let γ denote a regularization parameter. With this notation, including entropic regularization, the problem becomes: Π = arg max Π Γ R, Π γ log(Π), Π (B.1) where log(Π)ij = log(Πij). Adding the entropy of Π to the objective encourages the optimizer Π to be less sparse than the (low entropy) optimizer of the original optimal transport problem. Note that it is not inherently desirable to find a sparse optimizer Π , since we only consider the marginals of Π while boosting. Coming close to maximizing the original objective R, Π is far more important than sparsity. Published as a conference paper at ICLR 2021 We follow the Sinkhorn method to develop a solution to the problem (B.1). The Lagrangian is given by L(Π, λ, η) = X ij RijΠij γ X ij log(Πij)Πij X ij CijΠij ϵ) (B.2) so that at the optimum we have L Πij = Rij γ γ log(Πij) λj ηCij = 0 (B.3) which means that Π ij = exp(Rij/γ)exp( λj/γ 1)exp( ηCij/γ). (B.4) Now, define matrices V and K and a vector u by the following expressions: Vij = exp(Rij/γ), Kij = exp( ηCij/γ), and uj = exp( λj/γ 1). By definition, we have that Π ij = Vij Kijuj, and that V is a constant (it does not depend on any of the variables of the Lagrangian). Exploiting the constraints of the set Γ (see the text before (2.8)), we see that u can be written in terms of K: X i Πij = uj X i Kij Vij = 1 n uj = 1 n P i Kij Vij . (B.5) Thus, the solution Π depends only on K. Since K is determined completely by the Lagrange multiplier η, we need only to find the value of η that makes the other constraint of the set Γ tight: X ij CijΠij = X ij Cij Vij Kijuj = 1T n(C K(η) V )u = ϵ. (B.6) where A B denotes entrywise (Hadamard) product of the matrices A and B. We use a root finding algorithm to determine the optimal value of η: specifically, we find the root of m(η) = ϵ 1T n(C K(η) V )u. In our experiments, the bisection or secant methods generally converge in only around 10 to 20 evaluations of m(η). A fast method for evaluating m(η) is presented in Algorithm 4. It consists of simple entrywise matrix operations (entrywise multiplications and exponentiations); this is highly amenable to processing on a GPU, and we have implemented this Sinkhorn-based method with Tensor Flow. In Algorithm 5, we provide a method for using the optimal root η to obtain the coupling matrix Π following Equation (B.4). Algorithm 4 Fast evaluation of Sinkhorn objective 1: Input: η 0; cost matrix C; loss matrix R; tolerance ϵ; regularization strength γ. 2: Let T exp(R/γ ηC/γ) {entrywise exponentiation} 3: Let u 1/(n1T n T) {entrywise reciprocation} 4: return ϵ 1T n (C T) u Algorithm 5 Find Π using entropic regularization via (B.4) 1: Input: cost matrix C; loss matrix R; tolerance ϵ; regularization strength γ. 2: Let η be the root of Algorithm 4 with fixed inputs C, R, ϵ, and γ. 3: Let S exp(R/γ η C/γ). 4: Let u 1/(n1n S). 5: Define Π by Πij = Sijuj. 6: return Π Published as a conference paper at ICLR 2021 B.2 STOCHASTIC GRADIENT DESCENT FOR FINDING η WITH ENTROPIC REGULARIZATION Although the Sinkhorn-based method presented above is fast, note that it is still not especially scalable, as it requires holding at least two n n matrices (R and C) in memory at the same time. Assuming double precision floating point arithmetic, each of these matrices requires more than 10 GB of memory when n is approximately 4 104. Thus, following Genevay et al. (2016), we express the determination of the optimal dual variable η as the minimization of an expectation over the empirical distribution Pn; thus, it is possible to leverage stochastic gradient descent (with mini-batching) to quickly find the optimal dual variable η while using a small amount of memory. To develop this expression, note that the Lagrangian dual to the problem (B.1) is given by h max Π L(Π, λ, η) i (B.7) where L is defined in Equation (B.2). The optimal value of Π ij for the inner maximum is established in Equation (B.4). Using this optimal value, we see that log(Π ij) = 1 γ (Rij λj ηCij) 1 (B.8) and we can use this to calculate that L(Π , λ, η) = ηϵ + 1 ij Π ij = ηϵ + 1 ij exp Rij λj ηCij (B.9) To minimize this with respect to λ (which is unconstrained) we set λj (L(Π , λ, η)) = 1 i exp Rij ηCij γ 1 = 0 (B.10) which means that i exp Rij ηCij λ j = γ log 1 i exp Rij ηCij Thus, substituting into (B.9), we calculate that L(Π , λ , η) = γ + X i exp Rij ηCij The value η 0 that minimizes (B.11) is the same as the minimizer of min η 0 E(x,y) Pn i exp ℓ(f, (xi, y)) ηd2 x(xi, x) γ where we ignored some constants that don t affect the minimizing value η and substituted in the definitions of Rij and Cij (see the paragraph at the start of Section B.1). The problem B.12 is amenable to minimization via stochastic gradient descent (SGD) - this is presented in Algorithm 6. In every descent step of Algorithm 6, we are only working with a subset of the columns of R and C. Thus, R and C may be stored anywhere (or even computed on-the-fly) - they do not need to be kept in RAM or sent to the GPU memory. Thus, this SGD version can be used for essentially arbitrarily large data sets. Empirically, we also observe good results from running only a few gradient descent steps to obtain an approximation to η in every boosting step; this allows the SGD method to run as quickly as (or more quickly than) the normal Sinkhorn method (Algorithm 4). To find the final transport map Π, we use Algorithm 5, as before. Published as a conference paper at ICLR 2021 Algorithm 6 SGD to find optimal dual variable with entropic regularization η 1: Input: Starting point η1 > 0; cost matrix C; loss matrix R; tolerance ϵ; regularization strength γ; batch size B, step sizes αt > 0. 2: repeat 3: Sample indices j1, . . . , j B uniformly at random from {1, . . . n}. 4: Let Rt columns j1, . . . , j B of R. Let Ct columns j1, . . . , j B of C. {Rt, Ct Rn B} 5: Let wt j(η) sum all elements in column j of exp 1 γ (Rt ηCt) {entrywise exponentiation} 6: ηt+1 max{0, ηt αtϵ αtγ d dη h PB j=1 log wt j(η) i 7: until converged B.3 DUAL OF ROBUST EMPIRICAL LOSS FUNCTION L Although the algorithms presented in the preceding section run quickly and can handle arbitrarily large inputs, they only obtain an approximation to the true optimal transport map (due to the entropic regularization). Here, we present a method to quickly obtain the solution to the dual of the original linear program (2.8). We find that constructing the transport map Π from the knowledge of the dual optimum is difficult; however, there are quick methods to produce reasonable approximations to Π from the knowledge of the dual solution. Following standard methods, the dual of the linear program (2.8) is given by: inf η 0 ϵη + 1 s.t. νj Rij ηCij for all i, j Which can be simplified to: inf η 0 ϵη + 1 j=1 max i Rij ηCij. (B.14) Define λj(η) = maxi Rij ηCij and let M(η) = ϵη + 1 n Pn j=1 λj(η) denote the dual objective function from (B.14). Note that each λj(η) is a monotone decreasing convex piecewise linear function of η. In addition, since Cjj = 0, we have that limη λj(η) Rjj; that is, the λj are positive and eventually become constant. This means that the infimum (over η) will either occur at η = 0 or at one of the corners of one of the λj (a corner is a value of η such that there exist i1 and i2, i1 = i2, with Ri1j ηCi1j = Ri2j ηCi2j = maxi Rij ηCij; i.e. at least two of the lines that define λj are intersecting and creating a point where M(η) is not differentiable). Thus, it is theoretically possible to exactly obtain the optimizer η by enumerating and testing all of these corners. In practice, we have found that it is quicker to approximate η by noticing that an element of the subgradient for M(η) is given by j Cij,j : ij arg max i Rij ηCij. (B.15) Thus, a bisection method can be implemented to approximate the point η such that d dηM(η) < 0 for η < η and d dηM(η) > 0 for η > η . This is the corner that we are looking for. Using a bisection method (or similar) guarantees that we only need a small fixed number ( 30) of subgradient computations in every boosting step. Complementary slackness can directly be exploited to quickly find the desired solution Π to the original discrete Sen SR LP from the dual optimizer η if the following conditions hold: 1. There is an index j such that the point η = arg minη 0 M(η) is a corner of λj with | arg maxi Rij η Cij| = 2 (only two of the lines that define λj intersect at η ). Published as a conference paper at ICLR 2021 2. For all k = j, the point η is not a corner of λk (i.e. | arg maxi Rik η Cik| = 1). These conditions are based on the fact that, for all i and j, Πij is the primal variable corresponding to the constraint vj Rij ηCij in the dual (B.13). Condition 2 implies that, for all k = j, there is a value of t such that Π tk = 1 n and Π ik = 0 for all i = t. Then, condition 1 implies that there are only two nonzero values in column j of Π . These two nonzero values must sum to 1 n and the constraint C, Π = ϵ must be satisfied. With the complete knowledge of the other values of Π , this results in two linear equations with two unknowns (the two remaining nonzero values of Π ), and this system can be easily solved to give the exact solution Π . Similar tricks can be applied to rapidly solve for Π when slightly more entries are candidates for being nonzero (according to complementary slackness). Unfortunately, since dx is a fair distance (and thus it usually ignores several directions in X), it is often the case that there are multiple individuals xi = xj in the training data such that dx(xi, xj) = 0 (i.e. there are a significant number of off-diagonal entries of C that are 0). In practice, we have observed that these points also often produce the same predicted value f(xi) = f(xj) after the first step of boosting. This results in a situation where complementary slackness presents a complicated system of equations for obtaining the primal optimizer Π . Thus, in order to maintain an efficient boosting algorithm in practice, we use the dual optimizer η to construct an approximation ˆΠ to the primal optimizer Π using the following heuristic: for all j, randomly select t in arg maxi Rij η Cij and let ˆΠtj = 1 n. That is, in each column, randomly select one of the candidate nonzero entries (according to complementary slackness) and set it to 1 n in ˆΠ. This approximation ˆΠ actually is an interpretable transport map: each point in the training data is mapped (completely) to another point in the training data. We present the construction of ˆΠ using Algorithm 2 in the main text, and we use Algorithm 2 in our experiments on the German credit data set the Adult data set in Section 5. An outline of the full dual method (not using SGD) is presented in Algorithm 7. Although it is not exactly clear as to how good (or bad) of an approximation ˆΠ is to Π , our experimental results show that it is functional. Algorithm 7 Find Π using the dual formulation (following (B.15)) 1: Input: cost matrix C; loss matrix R; tolerance ϵ; root tolerance δ 2: Use the bisection method (or something similar) to find a value η such that for all η j Cijj : ij arg maxi Rij ηCij} < 0 when η < η δ inf{ϵ 1 j Cijj : ij arg maxi Rij ηCij} > 0 when η > η + δ. 3: Obtain Π using steps 8-13 of Algorithm 2. 4: return Π C EXPERIMENTAL DETAILS In this section, we provide the details of the experiments using the Bu DRO algorithm. We start with a synthetic motivation, then discuss the three data sets that are presented in the main text. C.1 SYNTHETIC MOTIVATION Consider a data set with two features, x1 and x2. Suppose that the one feature x1 is protected and the other feature x2 is can be used for making fair decisions. For example: our task may be to decide which individuals to approve for a loan. The protected feature x1 may correspond to the percentage of nonwhite residents in the applicant s zip code, and the other feature x2 may correspond to the applicant s credit score (or some other fair measure of credit worthiness). In this case, we can approximate a fair metric dx by the difference in the second feature: the fair distance between the two individuals (xa 1, xa 2) and (xb 1, xb 2) is given by dx((xa 1, xa 2), (xb 1, xb 2)) = |xa 2 xb 2|. We synthetically constructed such a data set in the following manner: 150 individuals were independently drawn from the same centered normal distribution. Let R be a rectangle of minimum area containing the 150 points; let L1 be the line passing through the top right corner and the bottom left Published as a conference paper at ICLR 2021 (a) Naive boosted tree classifier (b) Fair boosted tree classifier Figure 1: Comparison of GBDT classifiers without (a) and with (b) the fairness constraints introduced in this paper. The classifiers output a probability in [0, 1] - these probabilities are discretized to binary labels to create a classification. In the figures, red points are individuals with true label 0 and blue points are individuals with true label 1. The darker red areas correspond to lower output probabilities and the darker blue areas correspond to higher output probabilities. The arrows in (b) indicate the transport map corresponding to the fairness constraint for the previous boosting step. corner of R and L2 be the line passing through the top left corner and the bottom right corner of R. 125 of the samples were chosen (uniformly at random) to belong to majority white neighbourhoods: these individuals were labeled 0 if they were below L1 and 1 if they were above L1, and then they were all shifted to the left (by 2, so that the highly white cluster is centered at ( 2, 0)). The remaining 25 individuals make up the majority nonwhite cluster: they were classified according to L2 and then centered at (2, 0). In Figure 1, we show such an example of this setup. The red points in the figure correspond to individuals that are labeled 0; these represent individuals that should be declined for the loan (e.g. they have defaulted on a loan, with these data collected in the past to make future predictions). The blue points are individuals that are labeled 1 and represent people who should be approved (e.g. they have paid back a loan in the past). The horizontal axis represents the percentage of nonwhite residents in an individual s zip code, while the vertical axis is taken as some fair measure of credit worthiness. The naively trained GBDT classifier in Figure 1(a) shows high accuracy on this synthetic data set, but it is unfair - requiring that individuals from neighbourhoods containing medium to high percentages of nonwhite individuals obtain a significantly higher credit score than those in very racially homogeneous neighbourhoods. Applying our individually fair gradient boosting algorithm to the data set, however, results in a fair classifier (visualized in Figure 1(b)). The credit score threshold for loan acceptance is consistent across the different racial make ups of zip codes. This provides a synthetic visualization of the performance of the Bu DRO method, and suggests that it is working as we have described. As a tangent: note that the clusters in this example are generated in a symmetric manner. This is a case where the unfairness in the naively trained classifier is coming from a lack of data and a push for extra accuracy rather than a specific inherent bias in the data. C.2 DETAILS COMMON TO ALL EXPERIMENTAL DATA SETS Fair metric We follow one of the methods presented (Yurochkin et al., 2020) to construct fair metrics for the experimental data sets. Specifically, for a given data set, we determine a finite set T of protected directions in X. In a similar fashion to the synthetic example in C.1, changes in these protected directions should intuitively be ignored by a fair measure. Thus, the fair metric dx is defined by projecting onto the orthogonal complement of span(T) and considering the Euclidean Published as a conference paper at ICLR 2021 distance on the projected space. In particular, let A = span(T) and proj(A) denote the projection onto A. Then, dx(x1, x2) = ((I proj(A)) (x1 x2) 2. (C.1) The protected directions are determined in a similar fashion for each experimental data set. In particular, each data set contains (one or more) protected attributes: the indicator for each protected attribute is included in the set of protected directions1. We additionally obtain one or more extra protected directions in the following manner. For a fixed protected attribute g (e.g. gender), we remove the feature g from the data set and train a linear model on the edited data set to predict the removed feature g (we use logistic regression with an ℓ2 regularization of strength 0.1 when g is binary; ridge regression with cross validation is used for a non-binary feature g). Let w be the normal vector to the separating hyperplane corresponding to this linear model: we include w in T, the set of protected directions. For example, on the Adult data set, we consider both the gender and race features as protected attributes (both features are binary, see the description in Section 5). The set T for Adult then contains three protected directions. Two of these protected directions are given by eg, the indicator vector of the gender feature, and er, the indicator vector of the race feature. We obtain the third protected direction w by removing the gender feature from the data set and training logistic regression on this edited data set with targets given by the gender feature. Several of the features in the Adult data set (such as the is_husband and is_wife categories of the relationship feature) are highly predictive of gender. Combined with the gender imbalance in the data set, this allows for logistic regression to be able to predict gender with nearly 80% accuracy on a (holdout) test set. We thus take w to be the normal vector to the separating hyperplane discovered during this logistic regression. Note that a metric defined in this way will assign a distance of 0 between two individuals that differ only in the protected attributes (e.g. gender or race) but are identical in all other features. Additionally, if the difference x1 x2 between two individuals is nearly parallel to one of the logistic regression directions u, then d(x1, x2) will also be small. For example, on the Adult data set, the vector w (defined in the previous paragraph) exhibits comparatively high support on the is_husband and is_wife categories of the relationship feature. Thus, if x1 and x2 are identical except that x1 has wife = 1 and x2 has husband = 1, then dx(x1, x2) will be small. This fair metric dx is an approximation to an actual fair metric on X: it does not capture information about all protected features, and only makes a heuristic approximation to reduce differences between true race and gender groups of individuals. Comparison methods Here we give an overview of the ML methods that we compare to Bu DRO. Please refer to the included code for full information about the hyperparameter grids that we consider; see the respective data set sections for further information about training details and the optimal hyperparameter choices. We consider three boosting methods in addition to Bu DRO: baseline GBDTs, projecting, and reweiging. Our implementations of these methods all use the XGBoost framework (Chen & Guestrin, 2016). The GBDT hyperparameters that we consider are: max_depth, the maximum depth of the decision trees used as weak learners; lambda, an ℓ2 regularization parameter; min_weight, a tree regularization parameter; and eta, the XGBoost learning rate. We additionally examine the effects of boosting for different numbers of steps. We always present the results of reweighing using the default XGBoost hyperparameters. The projecting preprocessing method functions by eliminating the entire protected subspace from the data set before training a vanilla GBDT (see, for example Bower et al. (2018), Yurochkin et al. (2020), Prost et al. (2019)). In particular, using the notation from the discussion of fair metrics (above), we project all of the data onto the orthogonal complement of span(T) as a preprocessing step. This has 1The indicator for an attribute is a vector with only one nonzero entry; that nonzero entry appears in the attribute that we are indicating. Published as a conference paper at ICLR 2021 the effect that the final classifier will be completely blind to differences in the protected directions in T. Our experiments (see e.g. Table 2) show that this is not enough to produce an individually fair classifier: changes to attributes that are highly correlated with elements of T are still used to make classification decisions. Reweighing is presented in Kamiran & Calders (2011). Essentially, this method functions by assigning weights to the individuals in the training data. These weights are chosen to force protected group status to appear statistically independent to the outcomes in Y. Then a GBDT is trained on the reweighted data. This is inherently a group fairness method (the data are weighted to match a group fairness constraint) that cannot be used when the protected attribute does not take a finite number of values. We use the default XGBoost parameters along with reweighing to generate the values that are presented in the main text. To test if GBDT classifiers are useful on the tabular data sets that we consider here, we also train (naive) one-layer (100 unit) fully connected neural networks on all data sets. For the Adult and the COMPAS data sets, we additionally compare to the Sen SR method from Yurochkin et al. (2020) and the adversarial debiasing method from Zhang et al. (2018). The Sen SR method creates an individually fair neural network through stochastic gradient descent on a robust loss similar to the one considered in this work. Adversarial debiasing is based on minimizing the ability to predict the protected attributes from knowledge of the final outputs of a predictor, and also draws on ideas from robustness in machine learning. It is constructed to provide improvements to statistical group fairness quantities rather than for the creation of an individually fair classifier. Evaluation metrics We evaluate the ML methods with a type of individual fairness metric that we define below, also discussed Section 5 of the main text. It is not generally true that individual fairness will imply group fairness: it will depend on the group fairness constraint in question as well as the fair metric dx on X. We thus also report several group fairness metrics to allow for comparison with other methods in the literature; it remains future work to establish the precise conditions under which individual fairness will imply group fairness. The specific group fairness metrics that we consider here are GAPMax and GAPRMS, as introduced in De-Arteaga et al. (2019). Suppose that there are two groups of individuals, labeled by g = 0 (a protected group) and g = 1 (a privileged group). Then, given true outcomes Y and predicted outcomes ˆY , we can consider a statistical fairness gap defined by Gapy = P( ˆY = y|Y = y, g = 0) P( ˆY = y|Y = y, g = 1) for each possible outcome y Y. Note that a large value of |Gapy| corresponds to (correctly) assigning the outcome y to a larger fraction of individuals belonging one of the groups than the other. For example, suppose that Y = {0, 1}, with an 1 indicating a favorable outcome. Then, a large value of |Gap1| means that our classifier is able to identify the successful individuals from one group at a higher rate than it is able to identify the successful individuals from the other. Thus, large values of Gapy indicate unfair performance by the classifier at the level of groups. We then define GAPMax = max y Y |Gapy| y Y Gap2 y. (C.2) It is difficult to evaluate the individual fairness of an ML model since we only have access to an approximation to a true fair distance on X (we cannot, for example, consider pairs of training points that are definitely close to each other). We expect that the fair metrics considered in this work are good approximations to true fair metrics; nonetheless, we still desire to evaluate the individual fairness of an ML model in a way that is agnostic to the choice of approximate fair metric. For this reason, we construct counterfactual pairs of individuals (that is, pairs of individuals that differ only in certain protected attributes) from the input data. Intuitively, these counterfactual pairs should be treated the same way by any individually fair classifier (they should be close according to the true fair metric).2 2This is somewhat debatable. For example, does flipping gender but keeping all other attributes the same truly result in an equivalent individual of a different gender? Published as a conference paper at ICLR 2021 We thus examine how often these counterfactual individuals are assigned to the same outcome. These evaluation criteria are inspired by Garg et al. (2018), which examined changes in predicted sentiment when changing one word in a sentence. Specifically, suppose an attribute g takes values in a set V . To measure the consistency of the predictor f with respect to g (the g-consistency of f"), we construct |V | copies of the test data. These copies are altered so that the value of the attribute g is constant on each copy and so that each value in V is represented in a copy of the data. We then apply f to each copy of the data to obtain |V | vectors of predicted outcomes ˆy1, . . . , ˆy|V |, ˆyj Yn. The g-consistency of f is then the fraction of individuals who are assigned the same outcome in every copy of the data set. That is, it is the fraction of indices i such that (ˆy1)i = (ˆy2)i = . . . = (ˆy|V |)i. C.3 GERMAN CREDIT German credit is a data set that is commonly evaluated in the fairness literature (Dua & Graff, 2017). It contains information about 20 attributes (seven numerical, 13 categorical) from 1000 individuals; the ML task is to label the individuals as good or bad credit risks. Several of the features, such as amount_in_savings and length_current_employment, are numerical but have been recorded as categorical. Additionally, some of the other features, such as employment_skill_level and credit_history_status are categorical but could be considered to be ordered (for example, all credits paid" is better than past delays in payments" which is better than account critical"). In some analyses, these ordered categorical features are converted to integers; for the purposes of this analysis, we one-hot encode all of the categorical features (which results in 62 features in our input data). We additionally standardize each numerical feature (that is, center by subtracting the mean and divide by the standard deviation). None of the data points are removed. We treat age (a numerical feature that we standardize) as the protected attribute in the German credit data set. In order to report the group fairness metrics, we consider two protected groups: one group consists of individuals who are younger than 25, the other group consists of those who are 25 or older. This split was proposed by Kamiran & Calders (2009) to formalize the potential for age discrimination with this data set. Fair distance For the German credit data set, we consider two protected directions, the first of which is the indicator vector ea of the age attribute. Additionally, following the framework described in C.2, we eliminate the age attribute from the data and use ridge regression to train a classifier to predict age from the other features (we use the default parameters in the Ridge CV class from the scikit-learn package, version 0.21.3 (Pedregosa et al., 2011)). The second protected direction w is a normal vector to the hyperplane created by ridge regression (i.e. it is the vector of ridge regression coefficients). Evaluation metrics As discussed above, we consider the Gap RMS and Gap MAX for a binarized version of the age attribute. For an individual fairness metric, we cannot examine an age consistency, since age is a numerical feature. German credit contains a categorical personal_status feature that encodes some information about gender and marital status3, however. After one-hot encoding, we find that several of the personal_status categories are well-correlated with the age attribute according to ridge regression. Specifically, the male_divorced/separated category is positively correlated with age (the ridge regression coefficient has an average of 0.26 over our 10 train/test splits) and the male_married/widowed feature is negatively correlated with age (the ridge regression coefficient has an average of 0.25). Since this personal_status attribute can be considered to be a protected attribute, we examine a consistency measure based on this personal_status attribute as described in C.2. We refer to this consistency measure as status consistency (or S-cons). Method training and hyperparameter selection The labels of the German credit data are quite unbalanced (only 30% of the labels are 1). We thus train all ML methods to optimize the balanced 3The personal_status categories are male_single, male_married/widowed, male_divorced/separated, female_single, and female_divorced/separated/married. Published as a conference paper at ICLR 2021 accuracy. For the GBDT methods (baseline, projecting, and Bu DRO), this is accomplished by setting the XGBoost parameter scale_pos_weight to 0.7 0.3, the ratio of zeros to ones in the labels of the training data. For naive (baseline) NNs, we sample each minibatch to contain equal numbers of points labelled 0 and labelled 1. For both the baseline GBDT classifier and the projecting method, we search over a grid of GBDT hyperparameters on ten 80% train/20% test splits, and choose the set of hyperparameters that optimizes the average balanced test accuracy over those ten splits. See C.2 for the parameters that we tune and the code for the values that we consider. For Bu DRO, we tune parameters by hand on one 80% train/20% split. We use the dual implementation to find the optimal transport map (without SGD, see Algorithm 7); thus the only extra hyperparameter to consider is the perturbation budget ϵ. The data in Table 1 are collected by averaging the results obtained using the optimal hyperparameter choices across 10 new 80% train/20% test splits (the same splits for each method). The optimal hyperparameters are presented in Table 4. Table 4: Optimal XGBoost parameters for German credit data set. For Bu DRO, we also used a pertubation budget of ϵ = 1.0. Method max_depth lambda min_weight eta steps Baseline 10 1000 2 0.5 105 Projecting 7 2000 2 0.5 111 Bu DRO 4 1.0 1/80 0.005 90 We also train a neural network on the German credit data set (see Appendix C.2 for a description of the architecture). We find that we consistently obtain high accuracy when we use a learning rate of 10 4 and run for 4100 epochs (without any ℓ2 regularization). Results Table 5 reproduces the data from the main text (averages over ten 80% train/20% test splits) including standard deviation values. Table 5: Results on German credit data set. We report the balanced accuracy in the second column. These results are based on 10 splits into 80% training and 20% test data. Age gaps Method BAcc Status cons GAPMax GAPRMS Bu DRO 0.715 0.032 0.974 0.025 0.185 0.055 0.151 0.048 Baseline 0.723 0.019 0.920 0.022 0.310 0.159 0.241 0.109 Project 0.698 0.024 0.960 0.029 0.188 0.086 0.144 0.064 Baseline NN 0.687 0.031 0.826 0.028 0.234 0.126 0.179 0.093 The Adult data set (Dua & Graff, 2017) is another commonly considered benchmark data set in the algorithmic fairness literature. After preprocessing and removing entries that contain missing data, we consider a subset of Adult containing information about 11 demographic attributes from 45222 individuals. Each individual is labelled with a binary label that is 0 if the individual makes less than $50000 per year and 1 if the individual makes more than $50000 per year. In our preprocessing, we standardize the (five) continuous features and one-hot encode the (six) categorical features, resulting in 41 total features that we use in our analysis. These features include several attributes that could be considered protected, such as age (continuous), relationship_status (categorical), gender (binary), and race (here we consider a binary white vs non-white race feature). For this experiment, we choose to construct a predictor f for the labels that is fair according to the gender and race attributes. As mentioned in the main text, we exactly follow the experimental set-up as described in Yurochkin et al. (2020) for the Adult data set. In particular, we do not remove the gender and race attributes from the training data. This helps to produce an adversarial analysis (how fair can the method become when it is explicitly training on gender and race). Published as a conference paper at ICLR 2021 Fair distance The fair distance on Adult is described in detail in C.2. Briefly, we consider three protected directions: eg, the indicator for the gender attribute; er, the indicator for the race attribute; and w, the normal vector to the separating hyperplane obtained via logistic regression trained to predict the gender attribute from the other attributes. Evaluation metrics As previously mentioned, we consider both gender and race as protected attributes on the Adult data set; thus we report Gap Max and Gap RMS for both the gender and race features separately. We examine two types of counterfactual individual fairness metrics. The first we refer to as spouse consistency (S-cons). The S-cons is determined by creating two copies of the data: one in which every point belongs to the husband category of the relationship_status feature and the other in which every point belongs to the wife category. Unlike the framework described in C.2, we do not consider all categories of the relationship_status feature. To be concrete, two altered copies of the data are used to obtain two vectors of predicted outcomes ˆyh and ˆyw, and the S-cons is the fraction of outcomes that are the same in these two vectors. Explicitly, S-cons is calculated as 1 n|{i: ˆyw i = ˆyh i }|. Intuitively, the S-cons measures how likely an individual is to be assigned to a different outcome simply from labeling themselves as a wife rather than a husband. The other evaluation metric is the gender and race consistency (GR-cons), which involves four copies of the input data. Each copy is altered so that the gender and race features are constant on that copy of the data, and so that each copy has a different combination of the race and gender feature from the other three copies. We then apply the classifier to each copy of the data, to produce a four vectors of predicted outcomes ˆyi {0, 1}n, i = 0, 1, 2, 3. The GR-cons is defined as the fraction of outcomes that are the same in all of the yi. Method training and hyperparameter selection The labels for the Adult data set are unbalanced (only about 25% of people make more that $50000 per year) and thus all methods are again trained for balanced accuracy. For the GBDT methods (Baseline, projecting, Bu DRO) this is accomplished by setting the XGBoost parameter scale_pos_weight to be the ratio of zeros to ones in the labels of the training data. The data from the other methods (baseline NNs, Sen SR, and adversarial debaising) are obtained from Yurochkin et al. (2020); see that work for further information about the method training. The baseline GBDT and projecting methods were trained to optimize balanced test accuracy over a grid of hyperparameters on one 80% train/20% test seed. The optimal hyperparameters discovered in this way were then used to collect data on ten different 80% train/20% test splits: the average of the results on the test data from these new train/test splits are presented in Table 2 in the main text (see Table 7 for information about standard error). The optimal GBDT parameters are presented in Table 6. Table 6: Optimal XGBoost parameters for the Adult data set. For Bu DRO, we also used a perturbation budget of ϵ = 0.4. The value of min_weight for Bu DRO is computed relative to the size of an 80% training set, which contains 36177 individuals. Method max_depth lambda min_weight eta steps Baseline 3 0.01 0.5 0.05 816 Projecting 8 0.5 0.5 0.05 668 Bu DRO 14 10 4 0.1/36177 0.005 180 Bu DRO was implemented using the Dual SGD formulation to find the optimal transport map Π as described in Algorithm 2. This involves several additional hyperparameters that we set in the following ways: an initial guess of the dual variable (set to 0.1), a batch size (set to 200), a number of SGD iterations (set to 100), a momentum parameter (set to 0.9), and an SGD learning rate (set to 10 4). The optimal perturbation budget ϵ was found to be 0.4. See C.4.1 for more information about the hyperparameter selection on Adult. Published as a conference paper at ICLR 2021 C.4.1 FURTHER RESULTS Table 7 reproduces the results from the main text (averages over 10 80% train/20% test splits) including standard deviations. Only the data collected from GBDT methods are reproduced in Table 7; the data from other methods were obtained from Yurochkin et al. (2020). See Yurochkin et al. (2020) for information about the standard error in these values. Table 7: Results on Adult. We report the balanced accuracy in the second column. Individual fairness Gender gaps Race gaps Method Acc S-cons GR-cons GAPMax GAPRMS GAPMax GAPRMS Bu DRO .815 0.005 .944 0.013 .957 0.007 .146 0.012 .114 0.013 .083 0.018 .072 0.017 Baseline .844 0.003 .942 0.007 .913 0.008 .200 0.006 .166 0.008 .098 0.013 .082 0.013 Project .787 0.005 .881 0.079 1 0.000 .079 0.022 .069 0.018 .064 0.021 .050 0.016 Reweigh .784 0.005 .853 0.010 .949 0.009 .131 0.021 .093 0.015 .056 0.031 .043 0.022 (a) Minimize average Gap RMS (b) Minimize race Gap RMS (c) Minimize gender Gap RMS (d) Maximize S-cons Figure 2: Fairness measures at given accuracy levels for the Bu DRO method on the Adult data set, considered over 10 train/test splits. The results from each choice of hyperparameters are averaged over all train/test splits before being grouped into accuracy bins. The plotted points are the ones from each bin that optimize the specified quantity ((a) average Gap RMS, (b) race Gap RMS, (c) gender Gap RMS, (d) spouse consistency). Error bars represent one standard deviation. Empirically, the gender gaps were harder to reduce than the race gaps. The S-cons in all of these pictures never drops below 94%. We illustrate the fairness of Bu DRO at different accuracy levels in Figures 2 and 3. The data in Figure 2 was also used to guide hyperparameter selection for the Bu DRO method on the Adult data set. Before analyzing the information in these figures, we provide a description of how they were generated. We separate the accuracy axis (horizontal) into bins of a fixed width (here, the bin size is 0.0016). We then consider ten 80% train/20% test splits of the data set, and explore a grid of hyperparameters for each of these train/test splits (see the included code for the specific hyperparameter grids that we examined). The results from each choice of hyperparameters are averaged (over all ten train/test splits) and are placed in the bin containing the average accuracy; the error bars in the figures correspond to the standard error from this averaging. In Figure 2, we present four figures: each figure is obtained by determining the set of hyperparameters in each accuracy bin that optimize the average (over the 10 seeds) of a given fairness quantity (on the test set). Thus, each figure contains the data from approximately 30 hyperparameter selections (one for each bin, different for each figure). For example, in Figure 2(c), each point was obtained from the classifier constructed using the hyperparameters that minimize the gender GAPRMS in Published as a conference paper at ICLR 2021 the corresponding accuracy bin. In a similar fashion, Figure 2(a) is constructed by selecting the hyperparameters from each accuracy bin that minimize the average of the race GAPRMS and the gender GAPRMS. In all of the plots, the S-cons never drops below 94%, and the GR-cons remains similarly high; thus, we do not focus on the consistency measures here. In the following analysis, we concentrate on minimizing the gender GAPRMS (Figure 2(c)) due to the fact that the gender gaps were significantly more difficult to shrink than the race gaps in Figure 2. In fact, as seen in Table 2, the baseline classier exhibits fairly small race gaps even though it is trained with knowledge of the race feature. In real-world use, however, it is not clear which fairness quantity a user would be required to optimize; thus, Figure 2 presents a picture of the different trade-offs involved. The Bu DRO data in Table 2 was collected from ten different (new) 80% train/20% test splits, using hyperparameters chosen by examining Figure 2(c). We were interested in finding a point with high accuracy and high spouse consistency; thus, we examined hyperparameters corresponding to the points in the accuracy range 0.81 to 0.825 and looked for patterns in these hyperparameters. We attempt this generalization to make hyperparameter selection slightly more realistic: it seems willfully ignorant to disregard all of this data when selecting hyperparameters for our final tests, and it is at least plausible that a user could find some of these generally good hyperparameters via hand-tuning. This results in the set of Bu DRO hyperparameters that were presented in Table 6. See additionally the code for more details about the selected hyperparameters. In Figures 2(a) and (c), we observe a trade-off between accuracy and fairness with the Bu DRO method. That is, by decreasing accuracy (which generally corresponds to increasing the perturbation parameter ϵ), we are able to decrease the group fairness gaps, all at a high level of S-cons. Thus, it is possible to chose parameters to produce smaller group fairness gaps (with potentially lower accuracy) if the application calls for that. Figure 3: The accuracy vs fairness trade-off of Bu DRO when compared to other baseline boosting algorithms on the Adult data set. All lines are chosen to minimize gender Gap RMS. (a) contains a comparison of the Gender gap RMS. The Bu DRO line also appears in Figure 2(c). (b) contains a comparison of the S-cons. Error bars represent one standard deviation. Figure 3 includes a comparison of the Bu DRO method to some other baseline boosting methods to further illustrate the trade-off between accuracy and fairness. Each point in Figure 3 is chosen to be the one from the accuracy bin that minimizes the gender Gap RMS. This figure is constructed specifically to explore how fair we can make a classifier at given (fixed) levels of accuracy. The vanilla GBDT method always produces high accuracy classifiers (in the hyperparameter grid that we consider). This comes with large group fairness gaps that we are unable to decrease. On the other hand, the projecting method always produces good group fairness gaps, but we are unable to obtain high accuracy with this method. Finally, the reweighing method can obtain high accuracy with low fairness gaps, but it never produces an acceptable S-cons. Figure 3 suggests that Bu DRO is a better method than projecting at all accuracy levels on Adult: at any accuracy level, Bu DRO matches the group fairness gaps produced by projecting while improving upon the (consistency of the) individual fairness metric. Projecting always produces a classifier with small group fairness gaps. Unlike projecting, however, Bu DRO can also be used to construct a more accurate classifier if we are allow for slightly larger group fairness gaps. Depending on the Published as a conference paper at ICLR 2021 requirements of the application (e.g. as defined by law or other application specific fairness goals), this allows for more flexibility in the creation of individually fair and accurate classifiers. It is also interesting to observe how the Bu DRO curve in Figure 3 meets the curve for Vanilla GBDTs. Specifically, as the curves meet, the gender gap of the Bu DRO method is increasing, while the gender gap of the Vanilla GBDT curve appears to be slightly decreasing. We speculate that this is due to the fact that the grid of hyperparameters used in the construction of Figure 3 does not contain values of ϵ smaller that 0.1. That is, the value of ϵ does not go down to 0, so we can not expect to precisely recover the baseline results. Essentially, we force a small perturbation in the data while also pushing for high accuracy. Thus, it appears that the high accuracy points in the Figure correspond to solutions that are obtained by improving race fairness without significantly improving gender fairness (see also Figure 2(c) - the race fairness always remains small). These high accuracy cases apparently find a perturbation that is mostly along the race axis. Overall, these figures provide further evidence that the Bu DRO method creates an individually fair classifier while still obtaining high accuracy (on tabular, structured data like Adult) due to the ability to leverage powerful GBDT methods. The COMPAS recidivism prediction data set, compiled by Pro Publica (Larson et al., 2016), includes information on the offenders gender, race, age, criminal history (charge_for_arrest, number_prior_offenses), and the risk_score assigned to the offender by COMPAS. Pro Publica also collects whether or not these offenders recidivate within two years as the ground truth. More details and discussions of the data set can be found in Angwin & Larson (2016); Flores et al. (2016). We remove the risk_score attribute, standardize the number_prior_offenses attribute and one-hot encode the age attribute, yielding the final data set of 5278 individuals with 7 features. Since there are only seven features (with only one continuous feature) in the COMPAS data, there are many different pairs of individuals (i, j) with the same similarity distance Ci,j. Then, there can be a lot of different solutions while finding t arg maxi Rij η Cij when using the dual method (see Appendix B.2 and Algorithm 7). Therefore, we consider the entropic regularization form (B.1) of the linear program (2.8), and solve it using the fast Sinkhorn method in Algorithm 4 and 5. Fair metric We define three protected directions: eg, the indicator vector of gender, er, the indicator vector of race, and a protected direction w obtained by eliminating the race attribute and training the logistic regression for binary label race on the edited data set. Then, we follow the steps for computing the projection matrix A = span{eg, er, w} and calculating the fair distances dx in (C.1) for all pairs of individuals. Evaluation metrics We evaluate the methods using several individual fairness metrics, as well as some group fairness metrics. For group fairness metrics, we report GAPMax and GAPRMS for race and gender separately. We consider two counterfactual individual fairness evaluation measures on COMPAS based on the two protected attributes race and gender (which are both binary). Therefore, we can make the counterfactual examples by flipping the protected attributes. For example, we generate two copies of the data set: one with all individuals are female, the other with all individuals are male. Then the gender consistency (G-cons) is obtained by calculating the fraction of the same classified outcomes on two copies. The race consistency (R-cons) can be calculated in a similar way. Method training and hyperparameter selection We compare to the same methods that were considered on Adult: baseline GBDT, projecting, and reweighing (based on XGBoost); and baseline NN, Sen SR, and adversarial debiasing (based on neural networks). The results reported are averaged over ten random training (80%) and testing (20%) splits of the data set. We implement the adversarial debiasing methods (in the adversarial_debiasing class from the IBM s AIF360 package, version 0.2.2 (Bellamy et al., 2018)) with the default hyperparameters, and combine the implementation of reweighing (in the reweighing class from AIF360 with default hyperparameters) with running XGBoost (default parameters) for 100 boosting Published as a conference paper at ICLR 2021 steps. For the baseline GBDT and projecting, we select hyperparameters by splitting 20% of the training data into a validation set and evaluating the performance on the validation set. For Baseline NN, Sen SR and the Bu DRO methods, we manually train the hyperparameters on one train/test split and use the resulting hyperparameters for computing the averaged results of 10 restarts. We report the optimal hyperparameters for the GBDT methods in Table 8. For these methods, lambda is 10 8, and min_child_weight is 0.1/ntrain, where ntrain is the number of training samples. We set scale_pos_weight to 1 for projecting and the baseline, scale_pos_weight to be the ratio of zeros to ones in the labels of the training data for Bu DRO. The optimal perturbation budget ϵ for Bu DRO is 0.12. Table 8: Optimal XGBoost parameters for COMPAS data set. Method max_depth eta steps Baseline 3 5 10 4 1600 Projecting 4 7.5 10 4 2000 Bu DRO 2 1.5 10 5 68 For the neural network based methods, the optimal hyperparameters are described below. For baseline NN: learning rate = 5 10 6, number of epochs = 27000. For Sen SR: perturbation budget = 0.01, epoch = 2000, full epoch = 10, full learning rate = 0.0001, subspace epoch = 10, subspace learning rate = 0.1. Results The results in Table 3 with the associated standard deviations of Bu DRO and the comparison methods are shown in Table 9. Table 9: Results on COMPAS data set. These results are based on 10 random splits into 80% training and 20% test data. Individual fairness Gender gaps Race gaps Method Acc G-cons R-cons GAPMax GAPRMS GAPMax GAPRMS Bu DRO 0.652 0.012 1.000 0.000 1.000 0.000 0.124 0.053 0.099 0.037 0.145 0.028 0.125 0.021 Baseline 0.677 0.014 0.944 0.038 0.981 0.040 0.223 0.055 0.180 0.041 0.258 0.056 0.215 0.046 Project 0.671 0.012 0.873 0.025 1.000 0.000 0.189 0.040 0.150 0.030 0.230 0.043 0.185 0.033 Reweigh 0.666 0.020 0.788 0.023 0.813 0.047 0.245 0.054 0.207 0.046 0.092 0.027 0.069 0.019 Baseline NN 0.682 0.011 0.841 0.027 0.907 0.027 0.282 0.045 0.246 0.028 0.258 0.055 0.228 0.047 Sen SR 0.652 0.018 0.977 0.029 0.988 0.016 0.167 0.065 0.129 0.047 0.179 0.039 0.159 0.032 Adv. Deb. 0.671 0.016 0.854 0.028 0.818 0.088 0.246 0.064 0.219 0.051 0.130 0.065 0.108 0.059 C.6 METHOD TIMING INFORMATION Table 10 contains the training runtimes of the vanilla GBDT and the Bu DRO methods. The runtimes of the other fair GBDT methods (projecting and reweighing) are dominated by the time required for running vanilla GBDT after preprocessing; thus, the runtimes for these methods are omitted here. The results show that the Bu DRO method as defined in Algorithm 3 (i.e. using SGD to find a dual solution with O(n2) time required to recover the primal solution) is scalable to large problems. Table 10: Average runtime for training, including standard deviations. The number in parentheses is the number of trials used to compute the average. In all trials, the hyperparameters (including the number of boosting steps) were examined during the generation of the data presented in Section 5. Each baseline GBDT trial ran for 1000 boosting steps on 4 CPUs. Bu DRO ran for 500 steps on 2 CPUs on German credit, 200 steps using 2 CPUs on COMPAS, and 200 steps using 4 CPUs and 1 GPU on Adult. Problem size Time (seconds) Data set training samples features Baseline GBDT Bu DRO German credit 800 62 6.5 0.3 (10) 105.0 25.5 (480) COMPAS 4222 7 17.5 0.7 (10) 154.9 4.5 (10) Adult 36177 41 201.6 5.4 (10) 1455.9 263.2 (6)