# too_relaxed_to_be_fair__92d6fc59.pdf Too Relaxed to Be Fair Michael Lohaus 1 2 Michael Perrot 2 3 4 Ulrike von Luxburg 1 2 Abstract The problem of learning fair classifiers has mainly been addressed in three ways. First, pre-processing approaches alter We address the problem of classification under the labels of the examples or their representation to increase fairness constraints. Given a notion of fairness, the intrinsic fairness of a dataset. A classifier learned on the goal is to learn a classifier that is not discrimithis modified data is then more likely to be fair (Feldman natory against a group of individuals. In the literet al., 2015; Calmon et al., 2017; Kamiran & Calders, 2012; ature, this problem is often formulated as a con Dwork et al., 2012; Zemel et al., 2013). Second, post-hoc strained optimization problem and solved using procedures transform existing accurate but unfair classifiers relaxations of the fairness constraints. We show into fair classifiers (Chzhen et al., 2019; Hardt et al., 2016; that many existing relaxations are unsatisfactory: Woodworth et al., 2017; Kamiran et al., 2010). Finally, dieven if a model satisfies the relaxed constraint, it rect methods learn a fair and accurate classifier in a single can be surprisingly unfair. We propose a princistep (Kamishima et al., 2012; Zafar et al., 2017b;a; Calders pled framework to solve this problem. This new & Verwer, 2010; Wu et al., 2019; Donini et al., 2018; Cotter approach uses a strongly convex formulation and et al., 2019; Agarwal et al., 2018; Goh et al., 2016). In this comes with theoretical guarantees on the fairness paper, we focus on the latter kind of approaches. of its solution. In practice, we show that this method gives promising results on real data. Motivation: relaxations sometimes fail to produce fair solutions. Recently, several direct methods have been proposed that use relaxed versions of the considered fairness constraint. These approaches work reasonably well for some 1. Introduction applications. However, their relaxations are quite coarse and we demonstrate below that they can fail to find fair classi Informally, a classifier is considered unfair when it unjustly fiers. In particular, there is typically no guarantee on the promotes a group of individuals while being detrimental relationship between the relaxed fairness and the true fairto others; it is considered fair when it is free of any unjust ness of a solution: a classifier that is perfectly fair in terms behavior. However, the details of what is fair and unfair of relaxed fairness can be highly unfair in terms of true can be vastly different from one application to another. For fairness (see Figure 1 for an illustration). In this paper, we example, a college might want to admit a diverse student study the limitations of a number of popular approaches pool with respect to gender or race. This notion of fairness (Zafar et al., 2017b;a; Wu et al., 2019; Donini et al., 2018). is called demographic parity. On the other hand, consider a bank giving out loans. If a group of individuals repays less Algorithmic contributions. We propose a new principled frequently than others, it is normal that they receive fewer framework to tackle the problem of fair classification that is loans. However, it does not mean that all requests should be particularly relevant for application scenarios where formal declined. In particular, any individual that is likely to repay fairness guarantees are required. Our approach is based on a loan should be given the opportunity to get one, regardless convex relaxations and comes with theoretical guarantees of group membership. This is called equality of opportunity. that ensure that the learned classifier is fair up to sampling 1 2 errors. Furthermore, we use a learning theory framework for University of Tubingen, Germany Max Planck Institute for Intelligent Systems, Tubingen, Germany 3Univ Lyon, UJM-Saintsimilarity-based classifiers to exhibit sufficient conditions Etienne, CNRS, IOGS, Lab HC UMR 5516, F-42023, SAINTthat guarantee the existence of a fair and accurate classifier. ETIENNE, France 4Most of the work was done when M.P. was affiliated with the Max Planck Institute. Correspondence to: Michael Lohaus , Michael 2. Problem Setting Perrot , Ulrike von Luxburg . Let X be a feature space, Y = { 1, 1} a space of binary class labels, and S = { 1, 1} a space of binary sensitive Proceedings of the 37 th International Conference on Machine attributes. Assume that there exists a distribution DZ over Learning, Online, PMLR 119, 2020. Copyright 2020 by the au Z = X S Y and that we can draw some examples thor(s). Too Relaxed to Be Fair Figure 1. The goal is to separate the positive class (+) from the negative class ( ) while remaining fair with respect to two sensitive groups: the blue and the red group. We evaluate the true fairness (DDP) and a linear relaxation of the fairness (Zafar, Section 3.1) of three linear classifiers learned with no fairness constraint (Unconstr., orange), a linear relaxation of the fairness constraint (Linear Constr., green), and our framework (Search Fair, red). We also plot the classifier obtained by translating Linear (Linear (shifted), brown). It has the same relaxed fairness as Linear but a different true fairness: the relaxation is oblivious to the intercept parameter. Search Fair finds the fairest classifier. (x, s, y) DZ. Our goal in fair classification is to obtain a classifier, a mapping h : X Y defined as h(x) = sign(f(x)) where f : X R is a real valued function, that is fair with respect to the sensitive attribute while remaining accurate on the class labels. In this paper, we study two notions of fairness: demographic parity and equality of opportunity. Demographic Parity. A classifier f is fair for demographic parity when its predictions are independent of the sensitive attribute (Calders et al., 2009; Calders & Verwer, 2010). Formally, this can be written as P [f(x)>0|s=1] = P [f(x)>0|s= 1] . (x,s,y) DZ (x,s,y) DZ In practice, enforcing exact demographic parity might be too restrictive. Instead, we consider a fairness score (Wu et al., 2019) called Difference of Demographic Parity (DDP): DDP(f) = (1) E If(x)>0|s=1 E If(x)>0|s= 1 , (x,s,y) DZ (x,s,y) DZ where Ia is the indicator function that returns 1 when a is true and 0 otherwise. The DDP is positive when the favoured group is s = 1 and negative when it is s = 1. Given a threshold τ 0, we say that a classifier f is τ-DDP fair if |DDP(f) | τ. When τ = 0, exact demographic parity is achieved and we say that the classifier is DDP fair. Equality of Opportunity. A classifier f is fair for equality of opportunity when its predictions for positively labelled examples are independent of the sensitive attribute (Hardt et al., 2016). Formally, it is P [f(x) > 0|y = 1, s = 1] = (x,s,y) DZ P [f(x) > 0|y = 1, s = 1] . (x,s,y) DZ Again, instead of only considering exact equality of opportunity, we use a fairness score (Donini et al., 2018) called Difference of Equality of Opportunity (DEO): DEO(f) = E If(x)>0|y = 1, s = 1 (x,s,y) DZ E If(x)>0|y = 1, s = 1 . (2) (x,s,y) DZ This quantity is positive when the favoured group is s = 1 and negative when it is s = 1. Given a threshold τ 0, we say that a classifier f is τ-DEO fair if |DEO(f) | τ. When τ = 0, exact equality of opportunity is achieved and we say that the classifier is DEO fair. It is worth noting that demographic parity and equality of opportunity are quite similar from a mathematical point of view. In the remainder of the paper, we focus our exposition on DDP as results that hold for DDP can often be readily extended to DEO by conditioning on the target label. We only provide details in the supplementary when these extensions are more involved. Learning a fair classifier. Given a function class F, a τDDP fair and accurate classifier f is given as the solution of the following problem: f = arg min L(f) , f F |DDP(f)| τ where L(f) = E(x,s,y) DZ [ℓ(f(x) , y)] is the true risk of f for a convex loss function ℓ: X Y R. In practice, we only have access to a set Db n Z = {(xi, si, yi)}i=1 of n examples drawn from DZ. Hence, we consider the empirical version of this problem: f β = arg min Lb(f) + βΩ(f) , (3) f F |DDP(f)| τ where Ω(f) is a convex regularization term used to prevent over-fitting, β is a trade-off parameter, and Lb(f) = 1 P n (x,s,y) Db ℓ(f(x) , y) Z is the empirical risk. The main difficulty involved in learning a fair classifier is to ensure that |DDP(f)| τ. 3. When Fairness Relaxations Fail To obtain a τ-DDP fair classifier, most approaches consider the fully empirical version of Optimization Problem 3: ˆ min L(f) + βΩ(f) f F subject to |DDP d (f) | τ, (4) Too Relaxed to Be Fair (b) Linear. (c) Convex-concave. (d) Wu - Lower. (e) Wu - Upper. Figure 2. Consider linear classifiers for the dataset in Figure 1. The decision boundaries are of the form x2 = a1x1 + a0 where a1 controls the slope and a0 the intercept. For given intercepts and slopes, we plot normalized values of (a) the DDP score (yellow is fair), (b) the linear relaxation (Section 3.1), (c) the convex-concave relaxation (Section 3.2), (d) the concave Wu lower bound, and (d) the convex Wu upper bound (Section 3.2). The black dotted area in (a) corresponds to trivial constant classifiers the predicted class is the same for all points. The colored crosses correspond to the classifiers in Figure 1. A good relaxation should capture the true DDP reasonably well, in particular the yellow regions should match. However, none of the considered relaxations manage to achieve this. where the empirical version of DDP is: approximation of If(x)>0 and obtain the constraint: 1 X 1 X X DDP d (f) = If(x)>0 I f(x)>0. 1 1 s + 1 n n pˆ1 f(x) τ, (x,s,y) Db Z (x,s,y) Db Z n pˆ1(1 pˆ b 1) 2 s=1 s= 1 (x,s,y) DZ The main issue with this optimization problem is the nonwhere pˆ1 is an empirical estimate of p1. In their origiconvexity of the constraints that makes it hard to find the nal formulation, Zafar et al. (2017b) get rid of the factor optimal solution. A standard approach is then to first rewrite 1 pˆ1(1 pˆ1) by replacing the right hand side of the constraint the DDP in an equivalent, but easier to handle form1 and with c = pˆ1(1 pˆ1)τ. then replace the indicator functions with a relaxation. Zafar Similarly, Donini et al. (2018) rewrite the DDP: et al. (2017b) and Donini et al. (2018) use a linear relaxation to obtain a fully convex constraint. Zafar et al. (2017a) use a s DDP convex relaxation that leads to a convex-concave constraint. (f) = E If(x)>0, (x,s,y) D ps Wu et al. (2019) combine a convex relaxation with a con- cave one to obtain a fully convex problem. Below, we show where p = P s (x ,s ,y ) D (s = s) is the proportion of inthat these approaches only loosely approximate the true con Z dividuals in group s. Then, using the same linear relaxation straint and might lead to suboptimal solutions (see Figure 2). as Zafar et al. (2017b) with pˆs, an empirical estimate of ps, Furthermore, when theoretical guarantees accompany the they obtain the constraint2 method, they are either insufficient to ensure that the learned classifier is fair (Wu et al., 2019) or they make assumptions 1 X s that are hard to satisfy in practice (Donini et al., 2018). f(x) τ. n pˆs (x,s,y) Db 3.1. Linear Relaxations Z Both constraints are mathematically close and only differ We first study methods that use a linear relaxation of the indiin terms of the multiplicative factor in front of in the cator function to obtain a convex constraint in Optimization f(x) inner sum. Thus, they can be rewritten as Problem 4. First, Zafar et al. (2017b) rewrite the DDP: X 1 s + 1 1 DDP(f) = E p1 I b f(x)>0, LR d (f) = C s, DZ f(x) τ. (x,s,y) DZ p1(1 p ) 2 DDP 1 n b (x,s,y) DZ where p1 = P(x,s,y) DZ (s = 1) is the proportion of in2Donini et al. (2018) originally consider τ-DEO fairness rather dividuals in group s = 1. Then, they consider a linear than DDP. In the constraint, instead of drawing the examples from DZ, they use the conditional distribution DZ |y=1. However, this 1In the supplementary, we provide the derivations for all the does not change the intrinsic nature of the constraint, and the issues alternate formulations of DDP presented in this paper. raised here remain valid. Too Relaxed to Be Fair where C s, Db Z can be chosen to obtain any of the two constraints. In the following, we use this general formulation to show that both formulations have shortcomings that can lead to undesired behaviors. Linear relaxations are too loose. In Figures 2(a) and 2(b) we illustrate the behaviors of DDP d (f) and LRDDP d (f). In the figures, we consider linear classifiers of the form f(x) = x2 + a1x1 + a0 where a1 controls the slope of the classifier and a0 the intercept. The underlying data is the same as in Figure 1. It shows that the linear relaxation of DDP can behave completely differently compared to the true DDP. It is particularly striking to notice that the intercept does not have any influence on the constraint. This behavior can be formally verified. Let f be a predictor of the form f(x) = g(x) + b where b is the intercept. Then, LRDDP d (f) is independent of changes in b n (x,s,y) Db C s, Db Z = 0 Z for both constraints presented above. The proofs are given in the supplementary. Theoretical guarantees for linear relaxations are not satisfactory. Donini et al. (2018) study a sufficient condition under which the linear fairness relaxation LRDDP d (f) of a function f is close to its true fairness, that is it holds that DDP d (f) LRDDP d (f) ˆ + . The condition that needs to be satisfied by f is 1 X 1 X sign ˆ ( (f(x)) f(x)) . 2 2 s { 1,1} (x,s,y) Db Z s=s Unfortunately, the left hand side of this condition is nonconvex and thus, it is difficult to use in practice. In particular, when they learn a classifier with their linear relaxation, Donini et al. (2018) do not ensure that it also has a small ˆ . They only verify this condition when the learning process is over, that is when a classifier f has already been produced. However, at this time, it is also possible to compute DDP d (f) directly, so the bound is not needed anymore. If one could show that for a given function class F, there exists a small ˆ such that the condition holds for all f F, then any classifier learned from this function class would be guaranteed to be fair when LRDDP d (f) is small. However, it is not clear whether such function class exists. Nevertheless, this argument hints that for linear relaxations of the fairness constraint, the complexity of the function class largely controls the DDP that can be achieved. Linear relaxations should not be combined with complex classifiers. We demonstrate that, if the class of classifiers F is complex, then the linear relaxation constraint has almost no influence on the outcome of the optimization problem. In Figure 3, we compare the performance, in terms (b) Accuracy. Figure 3. We consider a similarity-based classifier (Section 5) with rbf kernel and 1000 train and test points from the Adult dataset. Using a varying regularization parameter β and fairness parameter τ, we train several classifiers using the linear fairness relaxation (Section 3.1). We plot the empirical test DDP of the learned models in Figure 3(a) (red and blue are bad, yellow is good) and their accuracy in Figure 3(b) (red is bad, green is good). We can see that, if β is small (complex model), the fairness relaxation parameter τ has no influence on the DDP score. For higher values of β (simpler models), decreasing τ improves the DDP. Best viewed in color. of empirical DDP and accuracy, of several models learned by Optimization Problem 4 equipped with the linear relaxation for different parameters β (for regularization) and τ (for fairness). Intuitively, one would expect that varying τ leads to changes in the fairness level while varying β leads to changes in accuracy. However, this is not the case: τ only has an effect on the result when β is sufficiently large. It means that the fairness of the model is mainly controlled by the regularization parameter rather than the fairness one. This would not be an issue if the fairness of complex classifiers was small. Unfortunately, high-complexity models have a high capacity to alter their decision boundaries. It means that to achieve both high accuracy and high fairness at the same time, they tend to alter their prediction margin for a few selected examples. While this might not affect the accuracy by a lot, the linear relaxation is sensible to this kind of changes and thus can be largely improved which is what the optimization aims for. However, altering labels of individual points does not have a big influence on the true DDP: it remains high. This effect is reduced when one learns models of low capacity, which have less freedom to deliberately change labels of individual points. Overall, linear relaxations are mainly relevant for simple classifiers and tend to have little effect on complex ones. We outline this undesirable behavior in the experiments. 3.2. Other Relaxations In the previous section we demonstrated that linear relaxations are not sufficient to ensure fairness of the learned classifier. We now focus on two approaches that use nonlinear relaxations of the indicator function to stay close to the original fairness definition. Too Relaxed to Be Fair Convex-concave relaxation. In a second paper, Zafar et al. choose τκ and τδ appropriately. To address this, Wu et al. (2017a) use the same fairness formulation as Zafar et al. (2019) show that choosing (2017b), but, instead of a linear relaxation of the indicator + function, they use a non-linear relaxation.3 Hence, given τκ = ψ τ pˆ κ upper DDP d + DDP d κ 1 defined as in Section 3.1, they obtain the constraint: d + τδ = ψδ τlower + DDP + DDP d δ , CCR d (f) = DDP + guarantees that τlower DDP d (f) τupper. Here DDP d X s+1 1 pˆ d 1 2 min (0, f(x)) τ. and DDP are the worst possible scores of DDP d (f): they n pˆ1(1 pˆ1) (x,s,y) Db Z are attained by those functions in the given function class that advantage either group s = 1 or group s = 1 the In Figure 2(c) we give an illustration of CCR (f) DDP d . It most. The values DDP d + κ and DDP d δ are defined in the same more closely approximates the original DDP d (f) than the way for the relaxed scores. The functions ψκ and ψδ are linear relaxation. Nevertheless, it remains quite far from the invertible functions that depend on the selected surrogate. original definition in particular for classifiers that are not While this solution is appealing at a first glance, it turns constant. Moreover, using such a convex relaxation leads out that Optimization Problem 5 is often infeasible for to a convex-concave problem that turns out to be difficult to meaningful values of and optimize without guarantees on the global optimality. τupper τlower as the constraints form disjoint convex sets. To illustrate this, consider Lower-upper relaxation with guarantees. To derive their κ(x) = max{0, 1 + x} and δ(x) = min{1, x} as profairness constraint, Wu et al. (2019) propose to first equivaposed by Wu et al. (2019) and the dataset used in Figure 1. lently rewrite the DDP as follows: Then, if τupper = τlower 1.13, the problem is infeasible. If τlower = 0 and τupper 1.95 the problem is also infeasible. DDP(f) = Overall, the guarantees are often meaningless: they either I I E s=1 I s= 1 make statements about the empty set (no feasible solution) f(x)>0 + If(x)<0 1 (x,s,y) DZ p1 1 p1 or they are too loose to ensure meaningful levels of fairness. where p1 is defined as in Section 3.1. Replacing the indicator functions with a convex surrogate other than the linear 4. New Approach with Guaranteed Fairness one would lead to a convex-concave problem due to the In the previous section, we have seen that existing apabsolute value in the constraint. Instead, Wu et al. (2019) proaches use relaxations of the fairness constraint that lead propose to use a convex surrogate function κ for the requireto tractable optimization problems but have little control ment DDP(f) < τ and a concave surrogate function δ for over the true fairness of the learned model. For this reason, DDP(f) > τ. The corresponding relaxation is we propose a new framework that solves the problem of finding provably fair solutions: given a convex approxima DDPκ(f) = tion of the fairness constraint, our method is guaranteed to Is=1 Is= 1 find a classifier with a good level of fairness. E κ(f(x)) + κ( f(x)) 1 , (x,s,y) DZ p1 1 p1 We consider the following optimization problem: and DDPδ(f) is defined analogously by simply replacing κ β ˆ f b (λ) = arg min L(f) + λR d (f) + βΩ(f) , (6) with δ. It leads to the following convex problem: DZ DDP f F ˆ min L(f) + βΩ(f) (5) where R d (f) DDP is a convex approximation of the signed f F fairness constraint, that is we do not consider the usual absubject to DDP d κ(f) τκ solute value. In other words, we obtain a trade-off between accuracy and fairness that is controlled by two hyperparam DDP d δ(f) τδ. eters λ 0 and β > 0 and, given β fixed, we can vary λ to move from strongly preferring one group to strongly Individually, the relaxations are far from the original fairness preferring the other group. Our goal is then to find a paconstraint (as illustrated in Figures 2(e) and 2(d)) but the rameter setting that is in the neutral regime and does not idea is that combining the upper bound and the lower bound favor any of the two groups. The main theoretical ingredient will help to learn a fair classifier. However, one needs to for this procedure to succeed is the next theorem, which 3 Zafar et al. (2017a) originally consider other notions of fairstates that the function λ 7 DDP β f (λ) is continuous ness than DPP, among them is the b τ-DEO fairness (Equation (5) DZ in their paper). Instead of drawing the examples from DZ, they under reasonable assumptions on the data distribution, the consider the conditional distribution DZ |y=1. candidate classifiers, and the convex relaxation. Too Relaxed to Be Fair Theorem 1 (Continuity of DDP β f b (λ) Let F be a D ). functionnspace, we define the set of learnable Z o functions as β FΛ = f F : λ 0, f = f b (λ) . Assume that the D following conditions hold: (i) Optimization Problem 6 is m-strongly convex in f, (ii) f F , R[(f) DDP is bounded in [ B, B], (iii) ρ, a metric, such that (FΛ, ρ) is a metric space, (iv) x X, g(f) : f 7 f(x) is continuous, (v) f FΛ, f is Lebesgue measurable and the set {x : (x, s, y) Z, s = 1, f(x) = 0} is a Lebesgue null set, as well as {x : (x, s, y) Z, s = 1, f(x) = 0}, (vi) the probability density functions f Z|s=1 and f Z|s= 1 are Lebesgue-measurable. Then, the function λ 7 DDP β f Db (λ) is continuous. Z The proof of this theorem is given in the supplementary. The conditions (i) - (vi) are of a technical nature, but not very restrictive: Condition (i) can be satisfied by using a strongly convex regularization term, for example the squared L2 norm. Condition (ii) can be satisfied by assuming that X is bounded. Condition (iii) is, for example, satisfied by any Hilbert Space equipped with the standard dot product. This includes, but is not restricted to, the set of linear classifiers. Condition (iv) ensures that small changes in the hypothesis, with respect to the metric associated to F, also yield small changes in the predictions. Condition (v) ensures that the number of examples for which the predictions are zero is negligible, for example this happens when the decision boundary is sharp. Condition (vi) is satisfied by many usual distributions, for example the Gaussian distribution. We demonstrate the continuous behavior of DDP on a real dataset in Figure 4. We plot the DDP score and the accuracy of classifiers learned with Optimization Problem 6 for varying parameters λ and β. Given a fixed β, the results support our theoretical findings: there is a smooth transition between favouring the group s = 1 with small λ and favouring the group s = 1 with higher λ. In between, there is always a region of perfect fairness. In the next corollary, we formally state the conditions necessary to ensure the existence of such a DDP-fair classifier. Corollary 1 (Existence of a DDP-fair classifier). Assume that the conditions of Theorem 1 hold and that the convex approximation R[(f) DDP is chosen such that for Optimization Problem (6) there e xist (i) λ+ such that DDP β f b (λ+) >0, DZ (ii) λ such that DDP β f b (λ ) <0. D Then, there exists at least Z one value λ0 in the interval [min (λ+, λ ) , max (λ+, λ )] such that DDP β f b (λ0) = 0. DZ We prove this corollary in the supplementary. This sug- (b) Accuracy. Figure 4. We consider a similarity-based classifier (Section 5) with rbf kernel and 1000 train and test points from the Adult dataset. Using a varying regularization parameter β and fairness parameter λ, we train several classifiers using Optimization Problem 6 with the same loss, convex relaxation, and regularization as Search Fair in the experiments. We plot the empirical test DDP of the learned models in (a) (red and blue are bad, yellow is good) and their accuracy in (b) (red is bad, green is good). We can see that, given a fixed regularization β, we can move from positive DDP (small λ, in red) to a negative DDP (large λ, in blue) with a region of perfect fairness in between (in yellow). gests a very simple framework to learn provably fair models. First, we choose a convex fairness relaxation (e.g. the one proposed by Wu et al. (2019)) and search for two initial hyperparameters λ+ and λ that fulfill the assumptions of Corollary 1 (empirically, λ = 0 and 1 are good candidates). Then, we use a binar y search to find a λ0 between λ+ and λ such that DDP β f Db (λ0) = 0. We call this procedure Search Fair and summarize Z it in Algorithm 1 in the supplementary. Note that any convex approximation RDDP d (f) can be used as long as the conditions of Corollary 1 are respected. In Appendix A we give more details on how we choose this relaxation. Finally, Search Fair theoretically requires to evaluate the true population fairness DDP β f Db (λ) on the underlying distribution DZ. In practice, we follo Z w the example of existing fairness constraints (Woodworth et al., 2017) and simply approximate this quantity by its empirical counterpart DDP d β f Db (λ) . Z 5. Towards Classifiers that are Fair and Accurate In the last section, we presented a method that is guaranteed to find a DDP fair classifier. However, there is one important catch: we did not make any statement about the classification accuracy of this solution. Here, we take a step in this direction by proposing some sufficient conditions that ensure the existence of a classifier that is both fair and accurate. To this end, we focus on a particular set of classifiers: the family of similarity-based functions. Given a similarity function K : X X [ 1, 1] and a set of points S = {(x 1, s 1, y 1), . . . , (x d, s d, y d)}, we define a similarity Too Relaxed to Be Fair based classifier as d f(x) = i=1 α i K(x, xi). The goal is and, with p1 = P(x,s,y) DZ [s = 1], then to learn the weights αi. A theory of learning with such functions has been devel1 1 |DDP(α)| µ + (ν + 2δ) max , . oped by Balcan et al. (2008). By defining a notion of good p1 1 p1 similarities, they provide sufficient conditions that ensure the existence of an accurate similarity-based classifier. Here, we build upon this framework and we introduce a notion 6. Experiments of good similarities for both accuracy and fairness. Hence, In this section, we empirically evaluate Search Fair by comin Definition 1 we give sufficient conditions that ensure paring it to 5 baselines on the existence of a classifier that is within a guaranteed 6 real-world datasets. In all the experiments, Search Fair either reliably finds the fairest margin fair and accurate at the same time. classifier and is comparable to a very recent non-convex Definition 1 (Good Similarities for Fairness). A simoptimization approach. ilarity function K is (ε, γ, τ)-good for convex, posi We consider different datasets: Celeb A (Liu tive, and decreasing loss ℓand (µ, ν)-fair for demo Datasets. 6 et al., 2015), Adult (Kohavi & Becker, 1996), Dutch graphic parity if there exists a (random) indicator func- (Zliobaite et al., 2011), COMPAS (Larson et al., 2016), tion R(x, s, y) defining a (probabilistic) set of reason Communities and Crime (Redmond & Baveja, 2002), and able points such that, given that x X, g(x) = E (x ,s German Credit (Dua & Graff, 2017). In the supplementary, ,y ) DZ [y K(x, x ) |R(x , s , y )], the following conwe give detailed descriptions of these datasets, how we preditions hold: yg(x) process the data, and the sizes of the train and test splits. (i) E ℓ ε, ( x,s,y) D Z γ Note that we remove the sensitive attribute s from the set of features x so that it is not needed at decision time. (ii) P [g(x) γ] P [g(x) γ] µ, DZ |s=1 DZ |s= 1 Baselines. We compare Search Fair to 5 baselines. For 3 of (iii) P [|g(x)| γ] 1 ν, D them, we use Optimization Problem 4 with hinge loss and (x,s,y) Z (iv) P ℓ [R(x, s, y)] a squared 2 norm as the regularization term. As a funcτ. (x,s,y) DZ tion class F, we use similarity-based classifiers presented in Section 5 with either the linear or the rbf kernel and with Roughly speaking, a similarity is good for classification if 70% (at most 1000) of the training examples as reasonable examples of the same class are on average closer to each points. As a fairness constraint, we use either the linear other than examples of different classes up to a certain relaxation of Zafar et al. (2017b) (Zafar), the linear relaxmargin. Moreover, it is good for fairness if this margin is ation of Donini et al. (2018) (Donini), or no constraint at independent of group membership. Given such a similarity, all (Unconst). The fourth baseline is a recent method for we can prove the existence of a fair and accurate classifier non-convex constrained optimization by Cotter et al. (2019) as is summarized in the next theorem. The proof is given in (Cotter). Our last baseline is the constant classifier (Conthe supplementary. stant) that always predicts the same label but has perfect Theorem 2 (Existence of a fair and accurate separator). fairness. Let K [ 1, 1] be a (ε, γ, τ)-good and (µ, ν)-fair metric Search Fair.4 For Search Fair we also use the hinge loss, a for a given convex, positive and decreasing loss ℓwith lipssquared ℓnorm as the regularization term (it is strongly chitz constant L. For any ε > 0 and 0 < δ < γε 2 1 1 2(L+ℓ(0)), let convex), and similarity-based classifiers. As a convex ap S = {(x , s 1 1, y1), . . . , (x } d, sd, yd) be a set of d examples proximation of the fairness constraint, we use the bounds drawn from DZ with with hinge loss proposed by Wu et al. (2019) (see Section A 2 in the supplementary for details). 1 L 3 4L p d + + δ(1 τ) log(2/δ) . 2 2 Metrics. Our main goal is to learn fair classifiers. Hence, τ γ ε1 δ δγε1 our main evaluation metrics are the empirical DDP and S d S DEO scores on the test set (lower is better). As a secondary Let φ : X R be a mapping with φi (x) = K(x, xi), metric (in case of ties in the fairness scores), we consider the for all i {1, . . . , d}. Then, with probability at least 1 5δ 2 S classification performance of the models and we report the over the choice of S, the induced distribution over φ (X) errors on the test set (lower is better). All the experiments S Y has a linear separator α such that are repeated 10 times and we report the mean and standard " !# S deviation for all the metrics. y α, φ (x) E ℓ ε + ε 4 1, The code is freely available online: github.com/ (x,s,y) DZ γ mlohaus/Search Fair. Too Relaxed to Be Fair (c) Celeb A. Figure 5. We report the average and standard deviation of classification error and absolute fairness scores DDP and DEO (closer to 0 is better) over 10 repetitions. The constant classifier is perfectly fair as it always predicts the same label. Its classification error is shown by the grey dashed vertical line. (a) To obtain good fairness on Adult, all DDP fair methods learn the constant classifier. Only Search Fair and Cotter reliably find fair classifiers for both kernels. (b) On Dutch, Search Fair obtains the lowest DDP with a slight loss in accuracy. Cotter performs comparably for both kernels, whereas the other methods only do well with a linear kernel but fail to learn fair classifiers with the rbf kernel. (c) For Celeb A, Search Fair and Cotter are the only methods that obtain a low DDP and DEO with only a slight loss in accuracy. The other methods only provide little to no improvement. Hyperparameters. Zafar, Donini and Cotter use a fairness vations. First, Search Fair always obtains fairness values parameter, that we call τ, to control the fairness level. Since that are very close to zero. It learns the fairest classifiers our goal is to learn classifiers that are fair, we set τ = 0 such out of all the methods and is only matched by Cotter, the that perfect fairness is required. For Search Fair, there is no non-convex approach. This sometimes comes with a small fairness parameter since λ0 is automatically searched for increase in terms of classification error. For example, in between a lower bound λmin and an upper bound λmax. We order to achieve perfect DDP fairness on the Adult dataset, set λmin = 0 and λmax = 1 as these values usually lead to Search Fair, and all the other fair methods, yield classifiers classifiers with fairness scores of opposite sign (as needed). close to the trivial constant one. Second, the complexity We use 10 iterations in the binary search. of the model greatly influences the performances of the linear relaxations. For example, using the complex rbf kernel We use 5-fold cross validation to choose other hyperparamalmost always results in an increase in the fairness score eters. For Cotter, only the width of the rbf kernel has to of Zafar and Donini. This is particularly striking for Adult be tuned since we use the framework of the original paper and Dutch where the linear kernel yields reasonable fairness with no regularization term. For all remaining methods scores. Note that this trend is not always respected. For exwe need to choose the regularization parameter β and ample, on Celeb A, using an rbf kernel improves the fairness the width of the rbf kernel. These values are respectively 6 score compared to the linear kernel. However, neither of chosen in the sets 10 , 10 5, 10 4, 10 3, 10 2 and log d 1 log d 1 log d +1 log d +2 them obtain reasonable fairness levels in the first place. {10 , 10 , d , 10 ,10 }, with d the number of features. We select the set of Discussion on hyperparameter selection. Apart from the parameters that lead to the most accurate classifier on hyperparameter selection method used in our experiments, average over the 5 folds. Indeed, the fairness level is one can think of other cross validation procedures. For automatically taken care of by the methods. example, Donini et al. (2018) proposed NVP, a cross validation method where one selects the set of hyperparameters Results. We present the results for 3 out of 6 datasets in that gives the fairest classifier while obtaining an average Figure 5. The other results are deferred to the supplementary accuracy above a given threshold. Similarly, one could seas they follow the same trend. We make two main obserlect the set of hyperparameters that yields the most accurate Too Relaxed to Be Fair classifier under a given fairness threshold. In the supple Conference on Data Mining and Knowledge Discovery, mentary, we report results that empirically show that these 2010. more complex procedures tend to improve the fairness of Calders, T., Kamiran, F., and Pechenizkiy, M. Building clasthe baselines (Search Fair remains competitive on all the sifiers with independency constraints. In International datasets). Unfortunately, they also blur the dividing line be Conference on Data Mining Workshops, 2009. tween hyperparameters that control the fairness of the model and the ones that control its complexity. In other words, it Calmon, F., Wei, D., Vinzamuri, B., Natesan Ramamurthy, becomes unclear whether fairness is achieved thanks to the K., and Varshney, K. R. Optimized pre-processing for relaxation or thanks to the choice of hyperparameters (we discrimination prevention. In International Conference already evoked this issue in Figure 3). We believe that it on Neural Information Processing Systems, 2017. is better to have a method that is guaranteed to find a fair Chzhen, E., Denis, C., Hebiri, M., Oneto, L., and Pontil, classifier for any given family of models and does not rely M. Leveraging labeled and unlabeled data for consistent on a complex cross validation procedure. fair binary classification. In International Conference on Neural Information Processing Systems, 2019. 7. Conclusion Cotter, A., Jiang, H., and Sridharan, K. Two-player games In this paper, we have shown that existing approaches to for efficient non-convex constrained optimization. In Inlearn fair and accurate classifiers have many shortcomings. ternational Conference on Algorithmic Learning Theory, They use loose relaxations of the fairness constraint and 2019. guarantees that relate the relaxed fairness to the true fairness of the solutions are either missing or not sufficient. Donini, M., Oneto, L., Ben-David, S., Shawe-Taylor, J., We empirically demonstrated how these approaches can and Pontil, M. Empirical risk minimization under fairproduce undesirable models. If fair machine learning is ness constraints. In International Conference on Neural supposed to be employed in real applications in society, we Information Processing Systems, 2018. need algorithms that actually find fair solutions, and ideally Dua, D. and Graff, C. UCI machine learning repository, come with guarantees. We made a first step in this direc2017. tion by proposing Search Fair, an approach that uses convex relaxations to learn a classifier that is guaranteed to be fair. Dwork, C., Hardt, M., Pitassi, T., Reingold, O., and Zemel, R. Fairness through awareness. In Innovations in Theo Acknowledgements retical Computer Science Conference, 2012. Feldman, M., Friedler, S. A., Moeller, J., Scheidegger, C., This work has been supported by the German Research and Venkatasubramanian, S. Certifying and removing dis Foundation through the Cluster of Excellence Machine parate impact. In International Conference on Knowledge Learning New Perspectives for Science (EXC 2064/1 Discovery and Data Mining, 2015. number 390727645), the BMBF Tubingen AI Center (FKZ: 01IS18039A), and the International Max Planck Research Goh, G., Cotter, A., Gupta, M., and Friedlander, M. P. Sat School for Intelligent Systems (IMPRS-IS). It has also been isfying real-world goals with dataset constraints. In Insupported by the ACADEMICS grant of the IDEXLYON, ternational Conference in Neural Information Processing project of the Universite de Lyon, PIA operated by ANRSystems, 2016. 16-IDEX-0005. The authors also thank Luca Rendsburg for Hardt, M., Price, E., and Srebro, N. Equality of opportunity helpful discussions. in supervised learning. In International Conference on Neural Information Processing Systems, 2016. References Kamiran, F. and Calders, T. Data preprocessing techniques Agarwal, A., Beygelzimer, A., Dudik, M., Langford, J., and for classification without discrimination. Knowledge and Wallach, H. A reductions approach to fair classification. Information Systems, 2012. In International Conference on Machine Learning, 2018. Kamiran, F., Calders, T., and Pechenizkiy, M. Discrimination aware decision tree learning. In International Balcan, M.-F., Blum, A., and Srebro, N. Improved guaran Conference on Data Mining, 2010. tees for learning via similarity functions. In Conference on Learning Theory, 2008. Kamishima, T., Akaho, S., Asoh, H., and Sakuma, J. Fairness-aware classifier with prejudice remover regu Calders, T. and Verwer, S. Three naive Bayes approaches larizer. In Machine Learning and Knowledge Discovery for discrimination-free classification. In International in Databases, 2012. Too Relaxed to Be Fair Kohavi, R. and Becker, B. Scaling up the accuracy of naivebayes classifiers: a decision-tree hybrid, 1996. Larson, J., Mattu, S., Kirchner, L., and Angwin, J. How we analyzed the compas recidivism algorithm. Pro Publica, 2016. Liu, Z., Luo, P., Wang, X., and Tang, X. Deep learning face attributes in the wild. In International Conference on Computer Vision, 2015. Redmond, M. A. and Baveja, A. A data-driven software tool for enabling cooperative information sharing among police departments. European Journal of Operational Research, 2002. Woodworth, B., Gunasekar, S., Ohannessian, M. I., and Srebro, N. Learning non-discriminatory predictors. In Conference on Learning Theory, 2017. Wu, Y., Zhang, L., and Wu, X. On convexity and bounds of fairness-aware classification. In The World Wide Web Conference, 2019. Zafar, M. B., Valera, I., Gomez Rodriguez, M., and Gummadi, K. P. Fairness beyond disparate treatment & disparate impact: Learning classification without disparate mistreatment. In International Conference on World Wide Web, 2017a. Zafar, M. B., Valera, I., Rogriguez, M. G., and Gummadi, K. P. Fairness Constraints: Mechanisms for Fair Classification. In International Conference on Artificial Intelligence and Statistics, 2017b. Zemel, R., Wu, Y., Swersky, K., Pitassi, T., and Dwork, C. Learning fair representations. In International Conference on Machine Learning, 2013. Zliobaite, I., Kamiran, F., and Calders, T. Handling conditional discrimination. In International Conference on Data Mining, 2011.