# fair_selective_classification_via_sufficiency__ae440b54.pdf

Fair Selective Classiﬁcation via Sufﬁciency

Yuheng Bu * 1 Joshua Ka-Wing Lee * 1 Subhro Das 2 Rameswar Panda 2 Deepta Rajan 2 Prasanna Sattigeri 2

Gregory Wornell 1

Selective classiﬁcation is a powerful tool for decision-making in scenarios where mistakes are costly but abstentions are allowed. In general, by allowing a classiﬁer to abstain, one can improve the performance of a model at the cost of reducing coverage and classifying fewer samples. However, recent work has shown, in some cases, that selective classiﬁcation can magnify disparities between groups, and has illustrated this phenomenon on multiple real-world datasets. We prove that the sufﬁciency criterion can be used to mitigate these disparities by ensuring that selective classiﬁcation increases performance on all groups, and introduce a method for mitigating the disparity in precision across the entire coverage scale based on this criterion. We then provide an upper bound on the conditional mutual information between the class label and sensitive attribute, conditioned on the learned features, which can be used as a regularizer to achieve fairer selective classiﬁcation. The effectiveness of the method is demonstrated on the Adult, Celeb A, Civil Comments, and Che Xpert datasets.

1. Introduction

As machine learning applications continue to grow in scope and diversity, its use in industry raises increasingly many ethical and legal concerns, especially those of fairness and bias in predictions made by automated systems (Selbst et al., 2019; Bellamy et al., 2018; Meade, 2019). As systems are trusted to aid or make decisions regarding loan applications, criminal sentencing, and even health care, it is more important than ever that these predictions be free of bias.

*Equal contribution 1Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, USA 2MIT-IBM Watson AI Lab, IBM Research, Cambridge, USA. Correspondence to: Joshua Ka-Wing Lee <jklee83@hotmail.com>.

Proceedings of the 38 th International Conference on Machine Learning, PMLR 139, 2021. Copyright 2021 by the author(s).

The ﬁeld of fair machine learning is rich with both problems and proposed solutions, aiming to provide unbiased decision systems for various applications. A number of different deﬁnitions and criteria for fairness have been proposed, as well as a variety of settings where fairness might be applied.

One major topic of interest in fair machine learning is that of fair classiﬁcation, whereby we seek to make a classiﬁer fair for some deﬁnition of fairness that varies according to the application. In general, fair classiﬁcation problems arise when we have protected groups that are deﬁned by a shared sensitive attribute (e.g., race, gender), and we wish to ensure that we are not biased against any one group with the same sensitive attribute.

In particular, one sub-setting of fair classiﬁcation which exhibits an interesting fairness-related phenomenon is that of selective classiﬁcation. Generally speaking, selective classiﬁcation is a variant of the classiﬁcation problem where a model is allowed to abstain from making a decision. This has applications in settings where making a mistake can be very costly, but abstentions are not (e.g., if the abstention results in deferring classiﬁcation to a human actor).

In general, selective classiﬁcation systems work by assigning some measure of conﬁdence about their predictions, and then deciding whether or not to abstain based on this conﬁdence, usually via thresholding.

The desired outcome is obvious: the higher the conﬁdence threshold for making a decision (i.e., the more conﬁdent one needs to be to not abstain), the lower the coverage (proportion of samples for which a decision is made) will be. But in return, one should see better performance on the remaining samples, as the system is only making decisions when it is very sure of the outcome. In practice, for most datasets, with the correct choice of conﬁdence measure and the correct training algorithm, this outcome is observed.

However, recent work has revealed that selective classiﬁcation can magnify disparities between groups as the coverage decreases, even as overall performance increases. (Jones et al., 2020). This, of course, has some very serious consequences for systems that require fairness, especially if it appears at ﬁrst that predictions are fair enough under full coverage (i.e., when all samples are being classiﬁed).

Fair Selective Classiﬁcation Via Sufﬁciency

Thus, we seek a method for enforcing fairness which ensures that a classiﬁer is fair even if it abstains from classifying on a large number of samples. In particular, having a measure of conﬁdence that is reﬂective of accuracy for each group can ensure that thresholding doesn t harm one group more than another. This property can be achieved by applying a condition known as sufﬁciency, which ensures that our predictive scores in each group are such that they provide the same accuracy at each conﬁdence level (Barocas et al., 2019). This condition also ensures that the precision on all groups increases when we apply selective classiﬁcation, and can help mitigate the disparity between groups as we decrease coverage.

The sufﬁciency criteria can be formulated as enforcing a conditional independence between the label and sensitive attribute, conditioned on the learned features, and thus allows for a relaxation and optimization method that centers around the mutual information. However, to impose this criteria, we require the use of a penalty term that includes the conditional mutual information between two discrete or continuous variables conditioned on a third continuous variable. Existing method for computing the mutual information for the purposes of backpropagation tend to struggle when the term in the condition involves the learned features. In order to facilitate this optimization, we thus derive an upper-bound approximation of this quantity.

In this paper, we make two main contributions. Firstly, we prove that sufﬁciency can be used to train fairer selective classiﬁers which ensure that precision always increases as coverage is decreased for all groups. Secondly, we derive a novel upper bound of the conditional mutual information which can be used as a regularizer to enforce the sufﬁciency criteria, then show that it works to mitigate the disparities on real-world datasets.

2. Background

2.1. The Fair Classiﬁcation Problem

We begin with the standard supervised learning setup of predicting the value of a target variable Y Y using a set of decision or predictive variables X X with training samples {(x1, y1), . . . , (xn, yn)}. For example, X may be information about an individual s credit history, and Y is whether the individual will pay back a certain loan. In general, we wish to ﬁnd features Φ(x) RdΦ, which are predictive about Y , so that we can construct a good predictor ˆy = T(Φ(x)) of y under some loss criteria L(ˆy, y), where T denotes the ﬁnal classiﬁer on the learned features.

Now suppose we have some sensitive attributes D D we wish to be fair about (e.g., race, gender), and training samples {(x1, y1, d1), . . . , (xn, yn, dn)}. For example, in the banking system, predictions about the chance of someone

repaying a loan (Y ) given factors about one s ﬁnancial situation (X) should not be determined by gender (D). While D can be continuous or discrete (and our method generalizes to both cases), we focus in this paper on the case where D is discrete, and refer to members which share the same value of D as being in the same group. This allows us to formulate metrics based on group-speciﬁc performance.

There are numerous metrics and criteria for what constitutes a fair classiﬁer, many of which are mutually exclusive with one another outside of trivial cases. One important criteria is positive predictive parity (Corbett-Davies et al., 2017; Pessach & Shmueli, 2020), which is satisﬁed when the precision (which we denote as PPV , after Positive Predictive Value) for each group is the same, that is:

a, b D, P(Y =1| ˆY =1, D=a)=P(Y =1| ˆY =1, D=b). (1) This criteria is especially important in applications where false positives are particularly harmful (e.g., criminal sentencing or loan application decisions) and having one group falsely labeled as being in the positive group could lead to great harm or contribute to further biases. Looking at precision rates can also reveal disparities that may be hidden by only considering the differences in accuracies across groups (Angwin et al., 2016).

When D is binary, an intuitive way to measure the severity of violations of this condition is to measure the difference in precision between the two groups:

P P V P(Y =1| ˆY =1, D=0) P(Y =1| ˆY =1, D=1). (2)

A number of methods exist for learning fairer classiﬁers. (Calders et al., 2009) and (Menon & Williamson, 2018) reweight probabilities to ensure fairness, while (Zafar et al., 2017) proposes using covariance-based constraints to enforce fairness criteria, and (Zhang et al., 2018) uses an adversarial method, requiring the training of an adversarial classiﬁer. Still others work by directly penalizing some speciﬁc fairness measure (Zemel et al., 2013b; du Pin Calmon et al., 2017).

Many of these methods also work by penalizing some approximation or proxy for the mutual information. (Mary et al., 2019; Baharlouei et al., 2020; Lee et al., 2020) propose the use of the Hirschfeld-Gebelein-Renyi maximal correlation as a regularizer for the independence and separation constraints (which requires that ˆY D and ˆY D | Y , respectively), which has been shown to be an approximation for the mutual information (Huang et al., 2019). Finally, (Cho et al., 2020) approximates the mutual information using a variational approximation. However, none of these methods are designed to tackle the selective classiﬁcation problem.

Fair Selective Classiﬁcation Via Sufﬁciency

2.2. Selective Classiﬁcation

In selective classiﬁcation, a predictive system is given the choice of either making a prediction ˆY or abstaining from the decision. The core assumption underlying selective classiﬁcation is that there are samples for which a system is more conﬁdent about its prediction, and by only making predictions when it is conﬁdent, the performance will be improved. To enable this, we must have a conﬁdence score κ(x) which represents the model s certainty about its prediction on a given sample x (Geifman & El-Yaniv, 2017). Then, we threshold on this value to decide whether to make a decision or to abstain. We deﬁne the coverage as the fraction of samples for which we do not abstain on (i.e., the fraction of samples that we make predictions on).

As is to be expected, when the conﬁdence is a good measure of the probability of making a correct prediction, then as we increase the minimum conﬁdence threshold for making the prediction (thus decreasing the coverage), we should see the risk on the classiﬁed samples decrease or the accuracy over the classiﬁed samples increase. This leads us to the accuracy-coverage tradeoff, which is central to selective classiﬁcation (though we note here the warning from the previous section about accuracy not telling the whole story).

Selective classiﬁers can work a posteriori by taking in an existing classiﬁer and deriving an uncertainty measure from it for which to threshold on (Geifman & El-Yaniv, 2017), or a selective classiﬁer can be trained with an objective that is designed to enable selective classiﬁcation (Cortes et al., 2016; Yildirim et al., 2019).

One common method of extracting a conﬁdence score from an existing network is to take the softmax response s(x) as a measure of conﬁdence. In the case of binary classiﬁcation, to better visualize the distribution of the scores, we deﬁne the conﬁdence using a monotonic mapping of s(x):

2 log s(x) 1 s(x)

which maps [0.5, 1] to [0, ] and provides much higher resolution on the values close to 1.

Finally, to measure the effectiveness of selective classiﬁcation, we can plot the accuracy-coverage curve, and then compute the area under this curve to encapsulate the performance across different coverages (Franc & Pr uˇsa, 2019).

2.3. Biases in Selective Classiﬁcation

(Jones et al., 2020) has shown that in some cases, when coverage is decreased, the difference in recall between groups can sometimes increase, magnifying disparities between groups and increasing unfairness. In particular, they have shown that in the case of the Celeb A and Civil Comments dataset, decreasing the coverage can also decrease the recall

Figure 1. (Top) When margin distributions are not aligned, (Bottom) then as we sweep over the threshold τ, the accuracies for the groups do not necessarily move in concert with one another.

on the worst-case group.

In general, this phenomenon occurs due to a difference between the average margin distribution and the group-speciﬁc margin distributions, resulting in different levels of performance when thresholding, as illustrated in Figure 1.

The margin M of a classiﬁer is deﬁned as κ(x) when ˆy(x) = y and κ(x) otherwise. If we let τ be our threshold, then a selective classiﬁer makes the correct prediction when M(x) τ and incorrect predictions when M(x) τ. We also denote its probability density function (PDF) and cumulative density function (CDF) as f M and FM, respectively. Then, the selective accuracy is

AF (τ) = 1 FM(τ) FM( τ) + 1 FM(τ) (4)

for a given threshold. We can analogously compute the selective precision by conditioning on ˆY = 1,

PPVF (τ) = 1 FM| ˆY =1(τ)

FM| ˆY =1( τ) + 1 FM| ˆY =1(τ). (5)

We can also analogously deﬁne the distributions of the margin for each group using f M,D and FM,d for group d D.

(Jones et al., 2020) proposes a number of different situations for which average accuracy could increase but worstgroup accuracy could decrease based on their relative margin distributions. For example, if F is left-log-concave (e.g., Gaussian), then AF (τ) is monotonically increasing

Fair Selective Classiﬁcation Via Sufﬁciency

when AF (0) 0.5 and monotonically decreasing otherwise. Thus, if AF (0) > 0.5 but AFd(0) < 0.5, then average accuracy may increase as we increase τ (and thus decrease coverage) but the accuracy on group d may decrease, thus resulting in magniﬁed disparity. This same phenomenon occurs with the precision when we condition on ˆY = 1. In general, when margin distributions are not aligned between groups, disparity can increase as one sweeps over the threshold τ. Further subdividing groups according to their label yields the difference in recall rates observed.

3. Fair Selective Classiﬁcation with Sufﬁciency

3.1. Sufﬁciency and Fair Selective Classiﬁcation

Our solution to the fair selective classiﬁcation problem is to apply the sufﬁciency criteria to the learned features.

Sufﬁciency requires that Y D | ˆY or Y D | Φ(X), i.e., the prediction completely subsumes all information about the sensitive attribute that is relevant to the label (Cleary, 1966). When Y is binary, the sufﬁciency criteria requires that (Barocas et al., 2019):

P(Y = 1|Φ(x), D = a) = P(Y = 1|Φ(x), D = b),

a, b D. (6)

Sufﬁciency has applications to full-coverage positive predictive parity, and is also used for solving the domain generalization problem by learning domain-invariant features, as in Invariant Risk Minimization (IRM) (Creager et al., 2020; Arjovsky et al., 2019).

The application of this criteria to fair selective classiﬁcation comes to us by way of Calibration by Group. Calibration by group (Chouldechova, 2017) requires that there exists a score function R = s(x) such that, for all r (0, 1):

P(Y = 1|R = r, D = a) = r, a D. (7)

The following result from (Barocas et al., 2019) links calibration and sufﬁciency: Theorem 1. If a classiﬁer has sufﬁcient features Φ, then there exists a mapping h(T(Φ)) : [0, 1] [0, 1] such that h(T(Φ)) is calibrated by group.

If we can ﬁnd sufﬁcient features Φ(X), so that the score function is calibrated by group based these features, then we have the following result (the proof can be found in the Appendix A): Theorem 2. If a classiﬁer has a score function R = s(x) which is calibrated by group, and selective classiﬁcation is performed using conﬁdence κ as deﬁned in (3), then for all groups d D we have that both AF (τ) and PPVFd(τ) are monotonically increasing with respect to τ. Furthermore, we also have that AFd(0) > 0.5 and PPVFd(0) > 0.5.

From this, we can guarantee that as we sweep through the threshold, we will never penalize performance of any one group in service of increasing the overall precision. Furthermore, in most real-world applications, the precision on the best-performing groups tends to saturate very quickly to values close to 1 when coverage is reduced, and thus, if we can guarantee that the precision increases on the worst performing group as well, then in general, the difference in precision between groups decreases as coverage decreases.

3.2. Imposing the Sufﬁciency Condition

From the above theorem, we can see that the classiﬁer given by the following optimization problem which satisﬁes sufﬁciency should yield the desired property of enabling fair selective classiﬁcation. It should ensure that as we sweep over the coverage, the performance of one group is not penalized in the service of improving the performance of another group or improving the average performance.

min θ L(ˆy, y)

s.t. Y D|Φ(X), (8)

where ˆy = T(Φ(x)), and θ are the model parameters for both Φ and T. One possible way of representing the sufﬁciency constraint is by using the mutual information:

min θ L(ˆy, y)

s.t. I(Y ; D|Φ(X)) = 0. (9)

This follows from the fact that Y D | Φ(X) is satisﬁed if and only if I(Y ; D|Φ(X)) = 0. This provides us with a simple relaxation of the constraint into the following form:

min θ L(ˆy, y) + λI(Y ; D|Φ(X)). (10)

We note here that existing works using mutual information for fairness are ill-equipped to handle this condition, as they assume that it is not the features that will be conditioned on, but rather that the penalty will be the mutual information between the sensitive attribute and the features (e.g., penalizing I(Φ(X); D) for demographic parity), possibly conditioned on the label (e.g., penalizing I(Φ(X); D|Y ) in the case of equalized odds). As such, existing methods either assume that the variable being conditioned on is discrete (du Pin Calmon et al., 2017; Zemel et al., 2013a; Hardt et al., 2016), become unstable when the features are placed in the condition (Mary et al., 2019), or simply do not allow for conditioning of this type due to their formulation (Grari et al., 2019; Baharlouei et al., 2020).

Thus, in order to approximate the mutual information for our purposes, we must ﬁrst derive an upper bound for the mutual information which is computable in our applications. Our bound is inspired by the work of (Cheng et al., 2020) and is stated in the following theorem:

Fair Selective Classiﬁcation Via Sufﬁciency

Theorem 3. For random variables X, Y and Z, we have

IUB(X; Y |Z) I(X; Y |Z), (11)

where equality is achieved if and only if X Y | Z, and

IUB(X; Y |Z) EPXY Z [log P(Y |X, Z)]

EPX [EPY Z [log P(Y |X, Z)]] . (12)

Proof. The conditional mutual information can be written as

= EPXY Z [log P(Y |X, Z)] EPY Z [log P(Y |Z)] . (13)

IUB(X; Y |Z) I(X; Y |Z)

= EPY Z [log P(Y |Z) + EPX [ log P(Y |X, Z)]] . (14)

Note that log( ) is convex,

EPX [ log P(Y |X, Z)] log EPX [P(Y |X, Z)]

= log P(Y |Z), (15)

which completes the proof.

Thus, I(Y ; D|Φ(X)) can be upper bounded by IUB as:

I(Y ; D|Φ(X)) EPXY D [log P(Y |Φ(X), D)] (16)

EPD [EPXY [log P(Y |Φ(X), D)]] .

Since P(y|Φ(x), d) is unknown in practice, we need to use a variational distribution q(y|Φ(x), d; θ) with parameter θ to approximate it. Here, we adopt a neural net that predicts Y based on feature Φ(X) and sensitive attribute D as our variational model q(y|Φ(x), d; θ).

However, in many cases, X will be continuous, highdimensional data (e.g., images), while D will be a discrete, categorical variable (e.g., gender, ethnicity). Therefore, it would be more convenient to instead formulate the model as q(y|Φ(x); θd), i.e., to train a group-speciﬁc model for each d D to approximate P(y|Φ(x), d), instead of treating D as a single input to the neural net.

Then, we can compute the ﬁrst term of the upper bound as the negative cross-entropy of the training samples using the correct classiﬁer for each group (group-speciﬁc loss), and the second term as the cross-entropy of the samples using a randomly-selected classiﬁer (group-agnostic loss) drawn according to the marginal distribution PD. Thus, by replacing all expectations in (16) with empirical averages, the regularizer is given by

log q(yi|Φ(xi); θdi) log q(yi|Φ(xi); θ e di) ,

Algorithm 1 Training with sufﬁciency-based regularizer Data: Training samples {(x1, y1, d1), . . . , (xn, yn, dn)}, {ed1, . . . , edn}, which are drawn i.i.d. from the empirical distribution ˆPD Initialize Φ, T (parameterized by θφ and θT , respectively) and θd with pre-trained model, and let nd be the number of samples in group d. Compute the following losses: Group-speciﬁc losses Ld = P

i: di=d log q(yi|Φ(xi); θ) Joint loss L0 = 1

n Pn i=1 L T(Φ(xi)), yi

Regularizer loss LR deﬁned in (17) including both Groupspeciﬁc loss and Group-agnostic loss for each training iteration do

for d = 1, . . . , |D| do // Fit group-specific models

for j = 1, . . . , M do // For each batch

θd θd 1 nd ηd θLd end end for j = 1, . . . , N do // For each batch

nηf θφ(L0 + λLR) // Update feature extractor θT θT 1 nη θT L0 // Update joint classifier end end

where edi are drawn i.i.d. from the marginal distribution PD, and for d D,

θd = arg max θ

i: di=d log q(yi|Φ(xi); θ). (18)

Let T denote a joint classiﬁer over all groups which is used to make ﬁnal predictions, such that ˆy = T(Φ(x)), then the overall loss function is

min θT ,θΦ 1 n

L T(Φ(xi)), yi + λ log q(yi|Φ(xi); θdi)

λ log q(yi|Φ(xi); θ e di) . (19)

In practice, we train our model by alternating between the ﬁtting steps in (18) and feature updating steps in (19), and the overall training process is described in Algorithm 1 and Figure 2. Intuitively, by trying to minimize the difference between the log-probability of the output of the correct model and that of the randomly-chosen one, we are trying to enforce Φ(x) to have the property that all group-speciﬁc models trained on it will be the same; that is:

q(y|Φ(x); θa) = q(y|Φ(x); θb), a, b D. (20)

This happens when P(Y |Φ(X), D) = P(Y |Φ(X)), which implies the sufﬁciency condition Y D|Φ(X).

Fair Selective Classiﬁcation Via Sufﬁciency

Figure 2. Diagram illustrating the computation of our sufﬁciency-based loss when D is binary.

Table 1. Summary of datasets.

Dataset Modality Target Attribute

Adult Demographics Income Sex

Celeb A Photo Hair Colour Gender

Civil Comments Text Toxicity Christianity

Che Xpertdevice X-ray Disease Support Device

4. Experimental Results

4.1. Datasets and Setup

We test our method on four binary classiﬁcation datasets (while our method works with multi-class classiﬁcation in general, our metrics for comparison are based on the binary case), which are commonly used in fairness: Adult1, Celeb A2, Civil Comments3, and Che Xpert4. In all cases, we use the standard train/val/test splits packaged with the datasets and implemented our code in Py Torch. We set λ = 0.7 for all datasets as well, which we chose by sweeping over values of λ across all datasets.

The Adult dataset (Kohavi, 1996) consists of census data drawn from the 1994 Census database, with 48,842 samples. The data X consists of demographic information about individuals, including age, education, marital status, and country of origin. Following (Bellamy et al., 2018), we onehot encode categorical variables and designate the binaryquantized income to be the target label Y and sex to be

1https://archive.ics.uci.edu/ml/datasets/adult 2http://mmlab.ie.cuhk.edu.hk/projects/Celeb A.html 3https://www.kaggle.com/c/jigsaw-unintended-bias-intoxicity-classiﬁcation/data 4https://stanfordmlgroup.github.io/competitions/chexpert

Figure 3. Overall accuracy-coverage curves for Adult dataset for the three methods.

the sensitive attribute D. In order to simulate the bias phenomenon discussed in Section 2.3, we also drop all but the ﬁrst 50 samples for which D = 0 and Y = 1. We then use a two-layer neural network with 80 nodes in the hidden layer for classiﬁcation, as in (Mary et al., 2019), with the ﬁrst layer serving as the feature extractor and the second as the classiﬁer, and trained the network for 20 epochs.

The Celeb A dataset (Liu et al., 2015) consists of 202,599 images of 10,177 celebrities, along with a list of attributes associated with them. As in (Jones et al., 2020), we use the images as data X (resized to 224x224), the hair color (blond or not) as the target label Y , and the gender as the sensitive attribute D, then train a Res Net-50 model (He et al., 2016) (with initialization using pre-trained Image Net weights) for 10 epochs on the dataset, with the penultimate layer as the feature extractor and the ﬁnal layer as the classiﬁer.

The Civil Comments dataset (Borkan et al., 2019) is a textbased dataset consisting of a collection of online comments, numbering 1,999,514 in total, on various news articles, along with metadata about the commenter and a label in-

Fair Selective Classiﬁcation Via Sufﬁciency

(a) Baseline

(c) Sufﬁciency-regularized

Figure 4. Group-speciﬁc precision-coverage curves for Adult dataset for the three methods.

(a) Baseline

(c) Sufﬁciency-regularized

Figure 5. Group-speciﬁc precision-coverage curves for Che Xpert dataset for the three methods.

dicating whether the comment displays toxicity or not. As in (Jones et al., 2020), we let X be the text of the comment, Y be the toxicity binary label, and D to be mention of Christianity. We pass the data ﬁrst through a BERT model (Devlin et al., 2019) with Google s pre-trained parameters (Turc et al., 2019) and treat the output features as the input into our system. We then apply a two-layer neural network to the BERT output with 80 nodes in the hidden layer, once again treating the layers as feature extractor and classiﬁer, respectively. We trained the model for 20 epochs.

The Che Xpert dataset (Irvin et al., 2019) comprises of 224,316 chest radiograph images from 65,240 patients with annotations for 14 different lung diseases. As in (Jones et al., 2020), we consider the binary classiﬁcation task of detecting Pleural Effusion (PE). We set X to be the X-ray image of resolution 224x224, Y is whether the patient has PE, and D is the presence of a support device. We train a model by ﬁne-tuning the Dense Net-121 (Huang et al., 2017) (with initialization using pre-trained Image Net weights) for 10 epochs on the dataset, with the penultimate layer as the feature extractor and the ﬁnal layer as the classiﬁer.

We compared our results to a baseline where we only optimize the cross-entropy loss, as in standard classiﬁcation. We also compared our method to the group DRO method of (Sagawa et al., 2019), using the code provided publicly on Github5, which has been shown to mitigate the disparity in recall rates between groups in selective classiﬁcation (Jones

5https://github.com/kohpangwei/group DRO

et al., 2020).

4.2. Results and discussion

Figure 3 shows the overall accuracy vs. coverage graphs for each method on the Adult dataset. We can see that, in all cases, selective classiﬁcation increases the overall accuracy on the dataset, as is to be expected.

However, when we look at the group-speciﬁc precisions in Figure 4, we observe that, for the baseline method, this increase in performance comes at the cost of worse performance on the worst-case group. This phenomenon is heavily mitigated in the case of DRO, but there is still a gap in performance in the mid-coverage regime. Finally, our method shows the precisions converging to equality as coverage decreases very quickly. This can be explained by looking at the margin distributions for each method. The margin distribution histograms are plotted in Figure 6. We can see that the margin distributions are mismatched for the two groups in the baseline and DRO cases, but aligned for our sufﬁciency-based method.

Figure 5 and 7 show the group precisions and margin distributions for the Che Xpert dataset. We can see that our method produces a smaller gap in precision at almost all coverages compared to the other two methods, and improves the worst-group precision. Note, in this use-case the presence of a support device (e.g., chest tubes) is spuriously correlated to being diagnosed as having PE (Oakden-Rayner et al., 2020). Thus, the worst-case group includes X-rays

Fair Selective Classiﬁcation Via Sufﬁciency

(a) Baseline

(c) Sufﬁciency-regularized

Figure 6. Margin distributions for Adult dataset for the three methods.

(a) Baseline

(c) Sufﬁciency-regularized

Figure 7. Margin distributions for Che Xpert dataset for the three methods.

Table 2. Area under curve results for all datasets.

Dataset Method Area under accuracy curve

Area between precision curves

Adult Baseline 0.931 0.220 DRO 0.911 0.116 Ours 0.887 0.021

Celeb A Baseline 0.852 0.094 DRO 0.965 0.018 Ours 0.975 0.013

Civil Baseline 0.888 0.026 Comments DRO 0.944 0.013 Ours 0.943 0.010

Che Xpert Baseline 0.929 0.064 device DRO 0.933 0.080 Ours 0.934 0.031

with a support device, that are diagnosed as PE negative.

Finally, in order to numerically evaluate the relative performances of the algorithms for all the datasets, we compute the following quantities: area under the average accuracycoverage curve (Franc & Pr uˇsa, 2019) and area under the absolute difference in precision-coverage curve (or area between the precision-coverage curve for the two groups). Table 2 shows the results for each method and dataset. From

this, it is clear that while our method may incur a small decrease in overall accuracy in some cases, it reduces the disparity between the two groups, as desired.

More experimental results can be found in Appendix B for additional baselines on the Adult Dataset, as well as the precision-coverage curves and margin distributions for the Celeb A and Civil Comments datasets.

5. Conclusion

Fairness in machine learning has never been a more important goal to pursue, and as we continue to root out the biases that plague our systems, we must be ever-vigilant of settings and applications where fairness techniques may need to be applied. We have introduced a method for enforcing fairness in selective classiﬁcation, using a novel upper bound for the conditional mutual information. And yet, the connection to mutual information suggests that there may be some grander picture yet to be seen, whereby the various mutual information-inspired methods may be uniﬁed. A central perspective on fairness grounded in such a fundamental quantity could prove incredibly insightful, both for theory and practice.

ACKNOWLEDGMENTS

This work was supported, in part, by the MIT-IBM Watson AI Lab under Agreement No. W1771646, and NSF under Grant No. CCF-1717610.

Fair Selective Classiﬁcation Via Sufﬁciency

Angwin, J., Larson, J., Mattu, S., and Kirchner, L. How we analyzed the compas recidivism algorithm. Pro Publica, 2016. URL https://www.propublica.org/a rticle/how-we-analyzed-the-compas-re cidivism-algorithm.

Arjovsky, M., Bottou, L., Gulrajani, I., and Lopez Paz, D. Invariant risk minimization. ar Xiv preprint ar Xiv:1907.02893, 2019.

Baharlouei, S., Nouiehed, M., Beirami, A., and Razaviyayn, M. R enyi fair inference. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. Open Review.net, 2020. URL https://openreview.net/forum?id=Hkgs UJrt DB.

Barocas, S., Hardt, M., and Narayanan, A. Fairness and Machine Learning. fairmlbook.org, 2019. http://ww w.fairmlbook.org.

Bellamy, R. K., Dey, K., Hind, M., Hoffman, S. C., Houde, S., Kannan, K., Lohia, P., Martino, J., Mehta, S., Mojsilovic, A., et al. AI Fairness 360: An extensible toolkit for detecting, understanding, and mitigating unwanted algorithmic bias. ar Xiv preprint ar Xiv:1810.01943, 2018.

Borkan, D., Dixon, L., Sorensen, J., Thain, N., and Vasserman, L. Nuanced metrics for measuring unintended bias with real data for text classiﬁcation. In Companion Proceedings of The 2019 World Wide Web Conference, 2019.

Calders, T., Kamiran, F., and Pechenizkiy, M. Building classiﬁers with independency constraints. In 2009 IEEE International Conference on Data Mining Workshops, pp. 13 18. IEEE, 2009.

Cheng, P., Hao, W., Dai, S., Liu, J., Gan, Z., and Carin, L. CLUB: A contrastive log-ratio upper bound of mutual information. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event, volume 119 of Proceedings of Machine Learning Research, pp. 1779 1788. PMLR, 2020. URL http://proceedings.mlr.press/ v119/cheng20b.html.

Cho, J., Hwang, G., and Suh, C. A fair classiﬁer using mutual information. In 2020 IEEE International Symposium on Information Theory (ISIT), pp. 2521 2526. IEEE, 2020.

Chouldechova, A. Fair prediction with disparate impact: A study of bias in recidivism prediction instruments. Big data, 5(2):153 163, 2017.

Cleary, T. A. Test bias: Validity of the scholastic aptitude test for negro and white students in integrated colleges. ETS Research Bulletin Series, 1966(2):i 23, 1966.

Corbett-Davies, S., Pierson, E., Feller, A., Goel, S., and Huq, A. Algorithmic decision making and the cost of fairness. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Halifax, NS, Canada, August 13 - 17, 2017, pp. 797 806. ACM, 2017. doi: 10.1145/3097983.3098095. URL ht tps://doi.org/10.1145/3097983.3098095.

Cortes, C., De Salvo, G., and Mohri, M. Learning with rejection. In International Conference on Algorithmic Learning Theory, pp. 67 82. Springer, 2016.

Creager, E., Jacobsen, J.-H., and Zemel, R. Exchanging lessons between algorithmic fairness and domain generalization. ar Xiv preprint ar Xiv:2010.07249, 2020.

Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171 4186, Minneapolis, Minnesota, 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1423. URL https://www.aclweb.org/anthology/N19 -1423.

du Pin Calmon, F., Wei, D., Vinzamuri, B., Ramamurthy, K. N., and Varshney, K. R. Optimized pre-processing for discrimination prevention. In Guyon, I., von Luxburg, U., Bengio, S., Wallach, H. M., Fergus, R., Vishwanathan, S. V. N., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pp. 3992 4001, 2017. URL https://proceedings.neurips.cc/p aper/2017/hash/9a49a25d845a483fae4be 7e341368e36-Abstract.html.

Franc, V. and Pr uˇsa, D. On discriminative learning of prediction uncertainty. In Chaudhuri, K. and Salakhutdinov, R. (eds.), Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, volume 97 of Proceedings of Machine Learning Research, pp. 1963 1971. PMLR, 2019. URL http://proceedings.mlr.press/ v97/franc19a.html.

Geifman, Y. and El-Yaniv, R. Selective classiﬁcation for deep neural networks. In Guyon, I., von Luxburg, U., Bengio, S., Wallach, H. M., Fergus, R., Vishwanathan, S. V. N., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 30: Annual Conference on

Fair Selective Classiﬁcation Via Sufﬁciency

Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pp. 4878 4887, 2017. URL https://proceedings.neurips.cc/p aper/2017/hash/4a8423d5e91fda00bb7e4 6540e2b0cf1-Abstract.html.

Grari, V., Ruf, B., Lamprier, S., and Detyniecki, M. Fairnessaware neural R eyni minimization for continuous features. ar Xiv preprint ar Xiv:1911.04929, 2019.

Hardt, M., Price, E., and Srebro, N. Equality of opportunity in supervised learning. In Lee, D. D., Sugiyama, M., von Luxburg, U., Guyon, I., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain, pp. 3315 3323, 2016. URL https://proceedings.neur ips.cc/paper/2016/hash/9d2682367c393 5defcb1f9e247a97c0d-Abstract.html.

He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, pp. 770 778. IEEE Computer Society, 2016. doi: 10.1109/CVPR.201 6.90. URL https://doi.org/10.1109/CVPR.2 016.90.

Huang, G., Liu, Z., van der Maaten, L., and Weinberger, K. Q. Densely connected convolutional networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pp. 2261 2269. IEEE Computer Society, 2017. doi: 10.1109/CVPR.2017.243. URL https://doi.org/ 10.1109/CVPR.2017.243.

Huang, S.-L., Makur, A., Wornell, G. W., and Zheng, L. On universal features for high-dimensional learning and inference. Preprint, 2019. http://allegro.mit. edu/ gww/unifeatures.

Irvin, J., Rajpurkar, P., Ko, M., Yu, Y., Ciurea-Ilcus, S., Chute, C., Marklund, H., Haghgoo, B., Ball, R. L., Shpanskaya, K. S., Seekins, J., Mong, D. A., Halabi, S. S., Sandberg, J. K., Jones, R., Larson, D. B., Langlotz, C. P., Patel, B. N., Lungren, M. P., and Ng, A. Y. Chexpert: A large chest radiograph dataset with uncertainty labels and expert comparison. In The Thirty-Third AAAI Conference on Artiﬁcial Intelligence, AAAI 2019, The Thirty-First Innovative Applications of Artiﬁcial Intelligence Conference, IAAI 2019, The Ninth AAAI Symposium on Educational Advances in Artiﬁcial Intelligence, EAAI 2019, Honolulu, Hawaii, USA, January 27 - February 1, 2019, pp. 590 597. AAAI Press, 2019. doi: 10.1609/aaai.v33i01.3301590. URL https: //doi.org/10.1609/aaai.v33i01.3301590.

Jones, E., Sagawa, S., Koh, P. W., Kumar, A., and Liang, P. Selective classiﬁcation can magnify disparities across groups. ar Xiv preprint ar Xiv:2010.14134, 2020.

Kohavi, R. Scaling up the accuracy of naive-bayes classiﬁers: A decision-tree hybrid. In Kdd, volume 96, pp. 202 207, 1996.

Lee, J., Bu, Y., Sattigeri, P., Panda, R., Wornell, G., Karlinsky, L., and Feris, R. A maximal correlation approach to imposing fairness in machine learning. ar Xiv preprint ar Xiv:2012.15259, 2020.

Liu, L. T., Simchowitz, M., and Hardt, M. The implicit fairness criterion of unconstrained learning. In Chaudhuri, K. and Salakhutdinov, R. (eds.), Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, volume 97 of Proceedings of Machine Learning Research, pp. 4051 4060. PMLR, 2019. URL http://procee dings.mlr.press/v97/liu19f.html.

Liu, Z., Luo, P., Wang, X., and Tang, X. Deep learning face attributes in the wild. In 2015 IEEE International Conference on Computer Vision, ICCV 2015, Santiago, Chile, December 7-13, 2015, pp. 3730 3738. IEEE Computer Society, 2015. doi: 10.1109/ICCV.2015.425. URL https://doi.org/10.1109/ICCV.2015.425.

Mary, J., Calauz enes, C., and Karoui, N. E. Fairness-aware learning for continuous attributes and treatments. In Chaudhuri, K. and Salakhutdinov, R. (eds.), Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, volume 97 of Proceedings of Machine Learning Research, pp. 4382 4391. PMLR, 2019. URL http://pr oceedings.mlr.press/v97/mary19a.html.

Meade, R. Bias in machine learning: How facial recognition models show signs of racism, sexism and ageism. 2019. https://towardsdatascience.com/biasin-machine-learning-how-facial-recog nition-models-show-signs-of-racism-s exism-and-ageism-32549e2c972d.

Menon, A. K. and Williamson, R. C. The cost of fairness in binary classiﬁcation. In Conference on Fairness, Accountability and Transparency, pp. 107 118. PMLR, 2018.

Oakden-Rayner, L., Dunnmon, J., Carneiro, G., and R e, C. Hidden stratiﬁcation causes clinically meaningful failures in machine learning for medical imaging. In Proceedings of the ACM conference on health, inference, and learning, pp. 151 159, 2020.

Pessach, D. and Shmueli, E. Algorithmic fairness. ar Xiv preprint ar Xiv:2001.09784, 2020.

Fair Selective Classiﬁcation Via Sufﬁciency

Platt, J. et al. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. Advances in large margin classiﬁers, 10(3):61 74, 1999.

Sagawa, S., Koh, P. W., Hashimoto, T. B., and Liang, P. Distributionally robust neural networks for group shifts: On the importance of regularization for worst-case generalization. ar Xiv preprint ar Xiv:1911.08731, 2019.

Selbst, A. D., Boyd, D., Friedler, S. A., Venkatasubramanian, S., and Vertesi, J. Fairness and abstraction in sociotechnical systems. In Proceedings of the Conference on Fairness, Accountability, and Transparency, pp. 59 68, 2019.

Turc, I., Chang, M.-W., Lee, K., and Toutanova, K. Well-read students learn better: On the importance of pre-training compact models. ar Xiv preprint ar Xiv:1908.08962, 2019.

Yildirim, M. Y., Ozer, M., and Davulcu, H. Leveraging uncertainty in deep learning for selective classiﬁcation. ar Xiv preprint ar Xiv:1905.09509, 2019.

Zafar, M. B., Valera, I., Gomez-Rodriguez, M., and Gummadi, K. P. Fairness constraints: Mechanisms for fair classiﬁcation. In Singh, A. and Zhu, X. J. (eds.), Proceedings of the 20th International Conference on Artiﬁcial Intelligence and Statistics, AISTATS 2017, 20-22 April 2017, Fort Lauderdale, FL, USA, volume 54 of Proceedings of Machine Learning Research, pp. 962 970. PMLR, 2017. URL http://proceedings.mlr.press/ v54/zafar17a.html.

Zemel, R. S., Wu, Y., Swersky, K., Pitassi, T., and Dwork, C. Learning fair representations. In Proceedings of the 30th International Conference on Machine Learning, ICML 2013, Atlanta, GA, USA, 16-21 June 2013, volume 28 of JMLR Workshop and Conference Proceedings, pp. 325 333. JMLR.org, 2013a. URL http://proceeding s.mlr.press/v28/zemel13.html.

Zemel, R. S., Wu, Y., Swersky, K., Pitassi, T., and Dwork, C. Learning fair representations. In Proceedings of the 30th International Conference on Machine Learning, ICML 2013, Atlanta, GA, USA, 16-21 June 2013, volume 28 of JMLR Workshop and Conference Proceedings, pp. 325 333. JMLR.org, 2013b. URL http://proceeding s.mlr.press/v28/zemel13.html.

Zhang, B. H., Lemoine, B., and Mitchell, M. Mitigating unwanted biases with adversarial learning. In Proceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and Society, pp. 335 340, 2018.