# selective_classification_can_magnify_disparities_across_groups__b35a4d1d.pdf Published as a conference paper at ICLR 2021 SELECTIVE CLASSIFICATION CAN MAGNIFY DISPARITIES ACROSS GROUPS Erik Jones , Shiori Sagawa , Pang Wei Koh , Ananya Kumar & Percy Liang Department of Computer Science, Stanford University {erjones,ssagawa,pangwei,ananya,pliang}@cs.stanford.edu Selective classification, in which models can abstain on uncertain predictions, is a natural approach to improving accuracy in settings where errors are costly but abstentions are manageable. In this paper, we find that while selective classification can improve average accuracies, it can simultaneously magnify existing accuracy disparities between various groups within a population, especially in the presence of spurious correlations. We observe this behavior consistently across five vision and NLP datasets. Surprisingly, increasing abstentions can even decrease accuracies on some groups. To better understand this phenomenon, we study the margin distribution, which captures the model s confidences over all predictions. For symmetric margin distributions, we prove that whether selective classification monotonically improves or worsens accuracy is fully determined by the accuracy at full coverage (i.e., without any abstentions) and whether the distribution satisfies a property we call left-log-concavity. Our analysis also shows that selective classification tends to magnify full-coverage accuracy disparities. Motivated by our analysis, we train distributionally-robust models that achieve similar full-coverage accuracies across groups and show that selective classification uniformly improves each group on these models. Altogether, our results suggest that selective classification should be used with care and underscore the importance of training models to perform equally well across groups at full coverage. 1 INTRODUCTION Selective classification, in which models make predictions only when their confidence is above a threshold, is a natural approach when errors are costly but abstentions are manageable. For example, in medical and criminal justice applications, model mistakes can have serious consequences, whereas abstentions can be handled by backing off to the appropriate human experts. Prior work has shown that, across a broad array of applications, more confident predictions tend to be more accurate (Hanczar & Dougherty, 2008; Yu et al., 2011; Toplak et al., 2014; Mozannar & Sontag, 2020; Kamath et al., 2020). By varying the confidence threshold, we can select an appropriate trade-off between the abstention rate and the (selective) accuracy of the predictions made. In this paper, we report a cautionary finding: while selective classification improves average accuracy, it can magnify existing accuracy disparities between various groups within a population, especially in the presence of spurious correlations. We observe this behavior across five vision and NLP datasets and two popular selective classification methods: softmax response (Cordella et al., 1995; Geifman & El-Yaniv, 2017) and Monte Carlo dropout (Gal & Ghahramani, 2016). Surprisingly, we find that increasing the abstention rate can even decrease accuracies on the groups that have lower accuracies at full coverage: on those groups, the models are not only wrong more frequently, but their confidence can actually be anticorrelated with whether they are correct. Even on datasets where selective classification improves accuracies across all groups, we find that it preferentially helps groups that already have high accuracies, further widening group disparities. These group disparities are especially problematic in the same high-stakes areas where we might want to deploy selective classification, like medicine and criminal justice; there, poor performance on particular groups is already a significant issue (Chen et al., 2020; Hill, 2020). For example, we study a variant of Che Xpert (Irvin et al., 2019), where the task is to predict if a patient has pleural Equal contribution Published as a conference paper at ICLR 2021 -4 -2 τ 0 τ 2 4 Margin Incorrect Abstain Correct 0.0 0.2 0.4 0.6 0.8 1.0 Average coverage Selective accuracy Average Worst group Figure 1: A selective classifier (ˆy, ˆc) makes a prediction ˆy(x) on a point x if its confidence ˆc(x) in that prediction is larger than or equal to some threshold τ. We assume the data comprises different groups each with their own data distribution, and that these group identities are not available to the selective classifier. In this figure, we show a classifier with low accuracy on a particular group (red), but high overall accuracy (blue). Left: The margin distributions overall (blue) and on the red group. The margin is defined as ˆc(x) on correct predictions (ˆy(x) = y) and ˆc(x) otherwise. For a threshold τ, the selective classifier is thus incorrect on points with margin τ; abstains on points with margin between τ and τ; and is correct on points with margin τ. Right: By varying τ, we can plot the accuracy-coverage curve, where the coverage is the proportion of predicted points. As coverage decreases, the average (selective) accuracy increases, but the worst-group accuracy decreases. The black dots correspond to the threshold τ = 1, which is shaded on the left. effusion (fluid around the lung) from a chest x-ray. As these are commonly treated with chest tubes, models can latch onto this spurious correlation and fail on the group of patients with pleural effusion but not chest tubes or other support devices. However, this group is the most clinically relevant as it comprises potentially untreated and undiagnosed patients (Oakden-Rayner et al., 2020). To better understand why selective classification can worsen accuracy and magnify disparities, we analyze the margin distribution, which captures the model s confidences across all predictions and determines which examples it abstains on at each threshold (Figure 1). We prove that when the margin distribution is symmetric, whether selective classification monotonically improves or worsens accuracy is fully determined by the accuracy at full coverage (i.e., without any abstentions) and whether the distribution satisfies a property we call left-log-concavity. To our knowledge, this is the first work to characterize whether selective classification (monotonically) helps or hurts accuracy in terms of the margin distribution, and to compare its relative effects on different groups. Our analysis shows that selective classification tends to magnify accuracy disparities that are present at full coverage. Motivated by our analysis, we find that selective classification on group DRO models (Sagawa et al., 2020) which achieve similar accuracies across groups at full coverage by using group annotations during training uniformly improves group accuracies at lower coverages, substantially mitigating the disparities observed on standard models that are instead optimized for average accuracy. This approach is not a silver bullet: it relies on knowing group identities during training, which are not always available (Hashimoto et al., 2018). However, these results illustrate that closing disparities at full coverage can also mitigate disparities due to selective classification. 2 RELATED WORK Selective classification. Abstaining when the model is uncertain is a classic idea (Chow, 1957; Hellman, 1970), and uncertainty estimation is an active area of research, from the popular approach of using softmax probabilities (Geifman & El-Yaniv, 2017) to more sophisticated methods using dropout (Gal & Ghahramani, 2016), ensembles (Lakshminarayanan et al., 2017), or training snapshots (Geifman et al., 2018). Others incorporate abstention into model training (Bartlett & Wegkamp, 2008; Geifman & El-Yaniv, 2019; Feng et al., 2019) and learn to abstain on examples human experts are more likely to get correct (Raghu et al., 2019; Mozannar & Sontag, 2020; De et al., 2020). Selective classification can also improve out-of-distribution accuracy (Pimentel et al., 2014; Hendrycks & Gimpel, 2017; Liang et al., 2018; Ovadia et al., 2019; Kamath et al., 2020). On the theoretical side, early work characterized optimal abstention rules given well-specified models (Chow, 1970; Hellman & Raviv, 1970), with more recent work on learning with perfect precision (El-Yaniv & Wiener, 2010; Khani et al., 2016) and guaranteed risk (Geifman & El-Yaniv, 2017). We build on this literature by establishing general conditions on the margin distribution for when selective classification helps, and importantly, by showing that it can magnify group disparities. Published as a conference paper at ICLR 2021 Group disparities. The problem of models performing poorly on some groups of data has been widely reported (e.g., Hovy & Søgaard (2015); Blodgett et al. (2016); Corbett-Davies et al. (2017); Tatman (2017); Hashimoto et al. (2018)). These disparities can arise when models latch onto spurious correlations, e.g., demographics (Buolamwini & Gebru, 2018; Borkan et al., 2019), image backgrounds (Ribeiro et al., 2016; Xiao et al., 2020), spurious clinical variables (Badgeley et al., 2019; Oakden-Rayner et al., 2020), or linguistic artifacts (Gururangan et al., 2018; Mc Coy et al., 2019). These disparities have implications for model robustness and equity, and mitigating them is an important open challenge (Dwork et al., 2012; Hardt et al., 2016; Kleinberg et al., 2017; Duchi et al., 2019; Sagawa et al., 2020). Our work shows that selective classification can exacerbate this problem and must therefore be used with care. A selective classifier takes in an input x X and either predicts a label y Y or abstains. We study standard confidence-based selective classifiers (ˆy, ˆc), where ˆy : X Y outputs a prediction and ˆc : X R+ outputs the model s confidence in that prediction. The selective classifier abstains on x whenever its confidence ˆc(x) is below some threshold τ and predicts ˆy(x) otherwise. Data and training. We consider a data distribution D over X Y G, where G = {1, 2, . . . , k} corresponds to a group variable that is unobserved by the model. We study the common setting where the model ˆy is trained under full coverage (i.e., without taking into account any abstentions) and the confidence function ˆc is then derived from the trained model, as described in the next paragraph. We will primarily consider models ˆy trained by empirical risk minimization (i.e., to minimize the average training loss); in that setting, the group g G is never observed. In Section 7, we will consider the group DRO training algorithm (Sagawa et al., 2020), which observes g at training time only. In both cases, g is not observed at test time and the model thus does not take in g. This is a common assumption: e.g., we might want to ensure that a face recognition model has equal accuracies across genders, but the model only sees the photograph (x) and not the gender (g). Confidence. We will primarily consider softmax response (SR) selective classifiers, which take ˆc(x) to be the normalized logit of the predicted class. Formally, we consider models that estimate ˆp(y | x) (e.g., through a softmax) and predict ˆy(x) = arg maxy Y ˆp(y | x), with the corresponding probability estimate ˆp(ˆy(x) | x). For binary classifiers, we define the confidence ˆc(x) as 2 log ˆp(ˆy(x) | x) 1 ˆp(ˆy(x) | x) This corresponds to a confidence of ˆc(x) = 0 when ˆp(ˆy(x) | x) = 0.5, i.e., the classifier is completely unsure of its prediction. We generalize this notion to multi-class classifiers in Section A.1. Softmax response is a popular technique applicable to neural networks and has been shown to improve average accuracies on a range of applications (Geifman & El-Yaniv, 2017). Other methods offer alternative ways of computing ˆc(x); in Appendix B.1, we also run experiments where ˆc(x) is obtained via Monte Carlo (MC) dropout (Gal & Ghahramani, 2016), with similar results. Metrics. The performance of a selective classifier at a threshold τ is typically measured by its average (selective) accuracy on its predicted points, P[ˆy(x) = y | ˆc(x) τ], and its average coverage, i.e., the fraction of predicted points P[ˆc(x) τ]. We call the average accuracy at threshold 0 the full-coverage accuracy, which corresponds to the standard notion of accuracy without any abstentions. We always use the term accuracy w.r.t. some threshold τ 0; where appropriate, we emphasize this by calling it selective accuracy, but we use these terms interchangeably in this paper. Following convention, we evaluate models by varying τ and tracing out the accuracy-coverage curve (El-Yaniv & Wiener, 2010). As Figure 1 illustrates, this curve is fully determined by the distribution of the margin, which is ˆc(x) on correct predictions (ˆy(x) = y) and ˆc(x) otherwise. We are also interested in evaluating performance on each group. For a group g G, we compute its group (selective) accuracy by conditioning on the group, P[ˆy(x) = y | g, ˆc(x) τ], and we define group coverage analogously as P[ˆc(x) τ | g]. We pay particular attention to the worst group under the model, i.e., the group argming P[ˆy(x) = y | g] with the lowest accuracy at full coverage. In our setting, we only use group information to evaluate the model, which does not observe g at training or test time; we relax this later when studying distributionally-robust models that use g during training. Published as a conference paper at ICLR 2021 Dataset Modality # examples Prediction task Spurious attributes A Celeb A Photos 202,599 Hair color Gender Civil Comments Text 448,000 Toxicity Mention of Christianity Waterbirds Photos 11,788 Waterbird or landbird Water or land background Che Xpert-device X-rays 156,848 Pleural effusion Has support device Multi NLI Sentence pairs 412,349 Entailment Presence of negation words Table 1: We study these datasets from Liu et al. (2015); Borkan et al. (2019); Sagawa et al. (2020); Irvin et al. (2019); Williams et al. (2018) respectively. For each dataset, we form a group for each combination of label y Y and spuriously-correlated attribute a A, and evaluate the accuracy of selective classifiers on average and on each group. Dataset details in Appendix C.1. Datasets. We consider five datasets (Table 1) on which prior work has shown that models latch onto spurious correlations, thereby performing well on average but poorly on the groups of data where the spurious correlation does not hold up. Following Sagawa et al. (2020), we define a set of labels Y as well as a set of attributes A that are spuriously correlated with the labels, and then form one group for each (y, a) Y A. For example, in the pleural effusion example from Section 1, one group would be patients with pleural effusion (y = 1) but no support devices (a = 0). Each dataset has |Y| = |A| = 2, except Multi NLI, which has |Y| = 3. More dataset details are in Appendix C.1. 4 EVALUATING SELECTIVE CLASSIFICATION ON GROUPS We start by investigating how selective classification affects group (selective) accuracies across the five datasets in Table 1. We train standard models with empirical risk minimization, i.e., to minimize average training loss, using Res Net50 for Celeb A and Waterbirds; Dense Net121 for Che Xpertdevice; and BERT for Civil Comments and Multi NLI. Details are in Appendix C. We focus on softmax response (SR) selective classifiers, but show similar results for MC-dropout in Appendix B.1. Accuracy-coverage curves. Figure 2 shows group accuracy-coverage curves for each dataset, with the average in blue, worst group in red, and other groups in gray. On all datasets, average accuracies improve as coverage decreases. However, the worst-group curves fall into three categories: 1. Decreasing. Strikingly, on Celeb A, worst-group accuracy decreases with coverage: the more confident the model is on worst-group points, the more likely it is incorrect. 2. Mixed. On Waterbirds, Che Xpert-device, and Civil Comments, as coverage decreases, worstgroup accuracy sometimes increases (though not by much, except at noisy, low coverages) and sometimes decreases. 3. Slowly increasing. On Multi NLI, as coverage decreases, worst-group accuracy consistently improves but more slowly than other groups: from full to 50% average coverage, worst-group accuracy goes from 65% to 75% while the second-to-worst group accuracy goes from 77% to 95%. Group sel. accuracy Celeb A Waterbirds Che Xpert-device Civil Comments Multi NLI Worst group Average Other groups 0 0.5 1 Average coverage Group coverage 0 0.5 1 Average coverage 0 0.5 1 Average coverage 0 0.5 1 Average coverage 0 0.5 1 Average coverage Figure 2: Accuracy (top) and coverage (bottom) for each group, as a function of the average coverage. Each average coverage corresponds to a threshold τ. The red lines represent the worst group. At low coverages, accuracy estimates are noisy as only a few predictions are made. Published as a conference paper at ICLR 2021 0 0.25 0.5 0.75 1 Coverage Selective accuracy 0 0.25 0.5 0.75 1 Coverage Civil Comments 0 0.25 0.5 0.75 1 Coverage 0 0.25 0.5 0.75 1 Coverage Che Xpert-device 0 0.25 0.5 0.75 1 Coverage Worst group Group-agnostic Robin Hood Average Figure 3: SR selective classifiers (solid line) substantially underperform their group-agnostic references (dotted line) on the worst group, which are in turn far behind the best-case Robin Hood references (dashed line). By construction, these share the same average accuracy-coverage curves (blue line). Similar results for MC-dropout are in Figure 7. Group-agnostic and Robin Hood references. The results above show that even when selective classification is helping the worst group, it seems to help other groups more. We formalize this notion by comparing the selective classifier to a matching group-agnostic reference that is derived from it and that tries to abstain equally across groups. At each threshold, the group-agnostic reference makes the same numbers of correct and incorrect predictions as its corresponding selective classifier, but distributes these predictions uniformly at random across points without regard to group identities (Algorithm 1). By construction, it has an identical average accuracy-coverage curve as its corresponding selective classifier, but can differ on the group accuracies. We show in Appendix A.2 that it satisfies equalized odds (Hardt et al., 2016) w.r.t. which points it predicts or abstains on. The group-agnostic reference distributes abstentions equally across groups; from the perspective of closing disparities between groups, this is the least that we might hope for. Ideally, selective classification would preferentially increase worst-group accuracy until it matches the other groups. We can capture this optimistic scenario by constructing, for a given selective classifier, a corresponding Robin Hood reference which, as above, also makes the same number of correct and incorrect predictions (see Algorithm 2 in Appendix A.3). However, unlike the group-agnostic reference, for the Robin Hood reference, the correct predictions are not chosen uniformly at random; instead, we prioritize picking them from the worst group, then the second worst group, etc. Likewise, we prioritize picking the incorrect predictions from the best group, then the second best group, etc. This results in worst-group accuracy rapidly increasing at the cost of the best group. Both the group-agnostic and the Robin Hood references are not algorithms that we could implement in practice without already knowing all of the groups and labels. They act instead as references: selective classifiers that preferentially benefit the worst group would have worst-group accuracycoverage curves that lie between the group-agnostic and Robin Hood curves. Unfortunately, Figure 3 shows that SR selective classifiers substantially underperform even their group-agnostic counterparts: they disproportionately help groups that already have higher accuracies, further exacerbating the disparities between groups. We show similar results for MC-dropout in Section B.1. Algorithm 1: Group-agnostic reference for (ˆy, ˆc) at threshold τ Input: Selective classifier (ˆy, ˆc), threshold τ, test data D Output: The sets of correct predictions Cga τ D and incorrect predictions Iga τ D that the group-agnostic reference for (ˆy, ˆc) makes at threshold τ. 1 Let Cτ be the set of all examples that (ˆy, ˆc) correctly predicts at threshold τ: Cτ = {(x, y, g) D | ˆy(x) = y and ˆc(x) τ}. (2) Sample a subset Cga τ of size |Cτ| uniformly at random from C0, which is the set of all examples that ˆy would have predicted correctly at full coverage. 2 Let Iτ be the analogous set of incorrect predictions at τ: Iτ = {(x, y, g) D | ˆy(x) = y and ˆc(x) τ}. (3) Sample a subset Iga τ of size |Iτ| uniformly at random from I0. 3 Return Cga τ and Iga τ . Since |Cga τ | = |Cτ| and |Iga τ | = |Iτ|, the group-agnostic reference makes the same numbers of correct and incorrect predictions as (ˆy, ˆc), but in a group-agnostic way. Published as a conference paper at ICLR 2021 Celeb A Waterbirds Che Xpert-device Civil Comments Multi NLI 10 0 10 Margin Worst-group 10 0 10 Margin 2.5 0.0 2.5 Margin 2.5 0.0 2.5 Margin 2.5 0.0 2.5 Margin Figure 4: Margin distributions on average (top) and on the worst group (bottom). Positive (negative) margins correspond to correct (incorrect) predictions, and we abstain on points with margins closest to zero first. The worst groups have disproportionately many confident but incorrect examples. 5 ANALYSIS: MARGIN DISTRIBUTIONS AND ACCURACY-COVERAGE CURVES We saw in the previous section that while selective classification typically increases average (selective) accuracy, it can either increase or decrease worst-group accuracy. We now turn to a theoretical analysis of this behavior. Specifically, we establish the conditions under which we can expect to see its two extremes: when accuracy monotonically increases, or decreases, as a function of coverage. While these extremes do not fully explain the empirical phenomena, e.g., why worst-group accuracy sometimes increases and then decreases, our analysis broadly captures why accuracy monotonically increases on average with decreasing coverage, but displays mixed behavior on the worst group. Our central objects of study are the margin distributions for each group. Recall that the margin of a selective classifier (ˆy, ˆc) on a point (x, y) is its confidence ˆc(x) 0 if the prediction is correct (ˆy(x) = y) and ˆc(x) 0 otherwise. The selective accuracies on average and on the worstgroup are thus completely determined by their respective margin distributions, which we show for our datasets in Figure 4. The worst-group and average distributions are very different: the worstgroup distributions are consistently shifted to the left with many confident but incorrect examples. Our approach will be to characterize what properties of a general margin distribution F lead to monotonically increasing or decreasing accuracy. Then, by letting F be the overall or worst-group margin distribution, we can see how the differences in these distributions lead to differences in accuracy as a function of coverage. Setup. We consider distributions over margins that have a differentiable cumulative distribution function (CDF) and a density, denoted by corresponding upperand lowercase variables (e.g., F and f, respectively). Each margin distribution F corresponds to a selective classifier over some data distribution. We denote the corresponding (selective) accuracy of the classifier at threshold τ as AF (τ) = (1 F(τ))/(F( τ) + 1 F(τ)). Since increasing the threshold τ monotonically decreases coverage, we focus on studying accuracy as a function of τ. All proofs are in Appendix D. 5.1 SYMMETRIC MARGIN DISTRIBUTIONS We begin with symmetric distributions. We introduce a generalization of log-concavity, which we call left-log-concavity; for symmetric distributions, left-log-concavity corresponds to monotonicity of the accuracy-coverage curve, with the direction determined by the full-coverage accuracy. Definition 1 (Left-log-concave distributions). A distribution is left-log-concave if its CDF is logconcave on ( , µ], where µ is the mean of the distribution. Left-log-concave distributions are a superset of the broad family of log-concave distributions (e.g., Gaussian, beta, uniform), which require log-concave densities (instead of CDFs) on their entire support (Boyd & Vandenberghe, 2004). Notably, they can be multimodal: a symmetric mixture of two Gaussians is left-log-concave but not generally log-concave (Lemma 1 in Appendix D). Proposition 1 (Left-log-concavity and monotonicity). Let F be the CDF of a symmetric distribution. If F is left-log-concave, then AF (τ) is monotonically increasing in τ if AF (0) 1/2 and monotonically decreasing otherwise. Conversely, if AFd(τ) is monotonically increasing for all translations Fd such that Fd(τ) = F(τ d) for all τ and AFd(0) 1/2, then F is left-log-concave. Published as a conference paper at ICLR 2021 Proposition 1 is consistent with the observation that selective classification tends to improve average accuracy but hurts worst-group accuracy. As an illustration, consider a margin distribution that is a symmetric mixture of two Gaussians, each corresponding to a group, and where the average accuracy is >50% at full coverage but the worst-group accuracy is <50%. As the overall and worst-group margin distributions are both left-log-concave (Lemma 1), Proposition 1 implies that worst-group accuracy will decrease monotonically with τ while the average accuracy improves monotonically. Applied to Celeb A (Figure 4), Proposition 1 is also consistent with how average accuracy improves while worst-group accuracy, which is <50% at full coverage, worsens. Finally, Proposition 1 also helps to explain why selective classification generally improves average accuracy in the literature, as average accuracies are typically high at full coverage and margin distributions often resemble Gaussians (Balasubramanian et al., 2011; Lakshminarayanan et al., 2017). 5.2 SKEW-SYMMETRIC MARGIN DISTRIBUTIONS The results above apply only to symmetric margin distributions. As not all of the margin distributions in Figure 4 are symmetric, we extend our analysis to asymmetric margin distributions by building upon prior work on skew-symmetric distributions (Azzalini & Regoli, 2012). Definition 2. A distribution with density fα,µ is skew-symmetric with skew α and center µ if fα,µ(τ) = 2h(τ µ)G(α(τ µ)) (4) for all τ R, where h is the density of a distribution is symmetric about 0, and G is the CDF of a potentially different distribution that is also symmetric about 0. In other words, fα,µ is a skewed form of the symmetric density h, where higher α means more right skew, and setting α = 0 yields a (translated) h. Skew-symmetric distributions are a broad family and include, e.g., skew-normal distributions, as well as all symmetric distributions. In Appendix D.3, we show some properties of skew-symmetric distributions as pertains to selective classification, e.g., margin distributions that are more right-skewed have higher accuracies (Proposition 6). Our result is that skewing a symmetric distribution in the same direction preserves monotonicity: if accuracy is monotone increasing, then right skew (which increases accuracy) preserves this. Proposition 2 (Skew in the same direction preserves monotonicity). Let Fα,µ be the CDF of a skewsymmetric distribution. If accuracy of its symmetric version, AF0,µ(τ), is monotonically increasing in τ, then AFα,µ(τ) is also monotonically increasing in τ for any α > 0. Similarly, if AF0,µ(τ) is monotonically decreasing in τ, then AFα,µ(τ) is also monotonically decreasing in τ for any α < 0. 5.3 DISCUSSION Proposition 1 relates the left-log-concavity of a symmetric margin distribution to the monotonicity of selective accuracy, with the direction of monotonicity (increasing or decreasing) determined by the full-coverage accuracy. Proposition 2 then states that if we have a symmetric margin distribution with monotone accuracy, skewing it in the same direction preserves monotonicity. Combining both propositions, we have that if F0,µ is symmetric left-log-concave with full-coverage accuracy >50%, then the accuracy AFα,µ(τ) of any skewed Fα,µ(τ) with α > 0 is monotone increasing in τ. Since accuracy-coverage curves are preserved under all odd, monotone transformations of margins (Lemma 8 in Appendix D), these results also generalize to odd, monotone transformations of these (skew-)symmetric distributions. As many margin distributions in Figure 4 and in the broader literature (Balasubramanian et al., 2011; Lakshminarayanan et al., 2017) resemble the distributions studied above (e.g., Gaussians and skewed Gaussians), we thus expect selective classification to improve average accuracy but worsen worst-group accuracy when worst-group accuracy at full coverage is low to begin with. An open question is how to characterize the properties of margin distributions that lead to nonmonotone behavior. For example, in Waterbirds and Che Xpert-device, accuracies first increase and then decrease with decreasing coverage (Figure 2). These two datasets have worst-group margin distributions that have full-coverage accuracies >50% but that are left-skewed with skewness -0.30 and -0.33 respectively (Figure 4), so we cannot apply Proposition 2 to describe them. Published as a conference paper at ICLR 2021 6 ANALYSIS: COMPARISON TO GROUP-AGNOSTIC REFERENCE Even if selective classification improves worst-group accuracy, it can still exacerbate group disparities, underperforming the group-agnostic reference on the worst group (Section 4). In this section, we continue our analysis and show that while it is possible to outperform the group-agnostic reference, it is challenging to do so, especially when the accuracy disparity at full coverage is large. Setup. Throughout this section, we decompose the margin distribution into two components F = p Fwg + (1 p)Fothers, where Fwg and Fothers correspond to the margin distributions of the worst group and of all other groups combined, respectively; p is the fraction of examples in the worst group; and the worst group has strictly worse accuracy at full coverage than the other groups (i.e., AFwg(0) < AFothers(0)). Recall from Section 4 that for any selective classifier (i.e., any margin distribution), its group-agnostic reference has the same average accuracy at each threshold τ but potentially different group accuracies. We denote the worst-group accuracy of the group-agnostic reference as AFwg(τ), which can be written in terms of Fwg, Fothers, and p (Appendix A.2). We continue with notation from Section 5 otherwise, and all proofs are in Appendix E. A selective classifier with margin distribution F is said to outperform the group-agnostic reference on the worst group if AFwg(τ) AFwg(τ) for all τ 0. To establish a necessary condition for outperforming the reference, we study the neighborhood of τ = 0, which corresponds to full coverage: Proposition 3 (Necessary condition for outperforming the group-agnostic reference). Assume that 1/2 < AFwg(0) < AFothers(0) < 1 and the worst-group density fwg(0) > 0. If AFwg(τ) AFwg(τ) for all τ 0, then fwg(0) 1 AFothers(0) 1 AFwg(0) . (5) The RHS is the ratio of full-coverage errors; the larger the disparity between the worst group and the other groups at full coverage, the harder it is to satisfy this condition. In Appendix F, we simulate mixtures of Gaussians and show that this condition is rarely fulfilled. Motivated by the empirical margin distributions, we apply Proposition 3 to the setting where Fwg and Fothers are both log-concave and are translated and scaled versions of each other. We show that the worst group must have lower variance than the others to outperform the group-agnostic reference: Corollary 1 (Outperforming the group-agnostic reference requires smaller scaling for log-concave distributions). Assume that 1/2 < AFwg(0) < AFothers(0) < 1, Fwg is log-concave, and fothers(τ) = vfwg(v(τ µothers) + µwg) for all τ R, where v is a scaling factor. If AFwg(τ) AFwg(τ) for all τ 0, v < 1. This is consistent with the empirical margin distributions on Waterbirds: the worst group has higher variance, implying v > 1 as v is the ratio of the worst group s standard deviation to the other group s, and it thus fails to satisfy the necessary condition for outperforming the group-agnostic reference. A further special case is when Fwg and Fothers are log-concave and unscaled translations of each other. Here, selective classification underperforms the group-agnostic reference at all thresholds τ. Proposition 4 (Translated log-concave distributions underperform the group-agnostic reference). Assume Fwg and Fothers are log-concave and fothers(τ) = fwg(τ d) for all τ R. Then for all τ 0, AFwg(τ) AFwg(τ). (6) This helps to explain our results on Che Xpert-device, where the worst-group and average margin distributions are approximately translations of each other, and selective classification significantly underperforms the group-agnostic reference at all confidence thresholds. 7 SELECTIVE CLASSIFICATION ON GROUP DRO MODELS Our above analysis suggests that selective classification tends to exacerbate group disparities, especially when the full-coverage disparities are large. This motivates a potential solution: by reducing Published as a conference paper at ICLR 2021 0 0.25 0.5 0.75 1 Coverage Selective accuracy Worst group Group-agnostic Average 0 0.25 0.5 0.75 1 Coverage Civil Comments 0 0.25 0.5 0.75 1 Coverage 0 0.25 0.5 0.75 1 Coverage Che Xpert-device 0 0.25 0.5 0.75 1 Coverage Figure 5: When applied to group DRO models, which have more similar accuracies across groups than standard models, SR selective classifiers improve average and worst-group accuracies. At low coverages, accuracy estimates are noisy as only a few predictions are made. Celeb A Waterbirds Che Xpert-device Civil Comments Multi NLI 2.5 0.0 2.5 Margin Worst-group 1 0 1 Margin 2 0 2 Margin 2.5 0.0 2.5 Margin 2.5 0.0 2.5 Margin Figure 6: Density of margins for the average (top) and the worst-group (bottom) distributions over margins for the group DRO model. Positive (negative) margins correspond to correct (incorrect) predictions, and we abstain on margins closest to zero first. Unlike the selective classifiers trained with ERM, we see similar average and worst-group distributions for group DRO. group disparities at full coverage, models are more likely to satisfy the necessary condtion for outperforming the group-agnostic reference defined in Proposition 3. In this section, we explore this approach by training models using group distributionally robust optimization (group DRO) (Sagawa et al., 2020), which minimizes the worst-group training loss LDRO(θ) = maxg G ˆE [ℓ(θ; (x, y)) | g]. Unlike standard training, group DRO uses group annotations at training time. As with prior work, we found that group DRO models have much smaller full-coverage disparities (Figure 5). Moreover, worst-group accuracies consistently improve as coverage decreases and at a rate that is comparable to the group-agnostic reference, though small gaps remain on Waterbirds and Che Xpert-device. While our theoretical analysis motivates the above approach, the analysis ultimately depends on the margin distributions of each group, not just on their full-coverage accuracies. Although group DRO only optimizes for similar full-coverage accuracies across groups, we found that it also leads to much more similar average and worst-group margin distributions compared to ERM (Figure 6), explaining why selective classification behaves more uniformly over groups across all datasets. Group DRO is not a silver bullet, as it relies on group annotations for training, which are not always available. Nevertheless, these results show that closing full-coverage accuracy disparities can mitigate the downstream disparities caused by selective classification. 8 DISCUSSION We have shown that selective classification can magnify group disparities and should therefore be applied with caution. This is an insidious failure mode, since selective classification generally improves average accuracy and can appear to be working well if we do not look at group accuracies. However, we also found that selective classification can still work well on models that have equal full-coverage accuracies across groups. Training such models, especially without relying on too much additional information at training time, remains an important research direction. On the theoretical side, we characterized the behavior of selective classification in terms of the margin distributions; an open question is how different margin distributions arise from different data distributions, models, training procedures, and selective classification algorithms. Finally, in this paper we focused on studying selective accuracy in isolation; accounting for the cost of abstention and the equity of different coverages on different groups is an important direction for future work. Published as a conference paper at ICLR 2021 ACKNOWLEDGMENTS We thank Emma Pierson, Jean Feng, Pranav Rajpurkar, and Tengyu Ma for helpful advice. This work was supported by NSF Award Grant no. 1804222. SS was supported by the Herbert Kunzel Stanford Graduate Fellowship and AK was supported by the Stanford Graduate Fellowship. REPRODUCIBILITY All code, data, and experiments are available on Coda Lab at https://worksheets. codalab.org/worksheets/0x7ceb817d53b94b0c8294a7a22643bf5e. The code is also available on Git Hub at https://github.com/ejones313/worst-group-sc. Adelchi Azzalini and Giuliana Regoli. Some properties of skew-symmetric distributions. Annals of the Institute of Statistical Mathematics, 64(4):857 879, 2012. Marcus A Badgeley, John R Zech, Luke Oakden-Rayner, Benjamin S Glicksberg, Manway Liu, William Gale, Michael V Mc Connell, Bethany Percha, Thomas M Snyder, and Joel T Dudley. Deep learning predicts hip fracture using confounding patient and healthcare variables. npj Digital Medicine, 2, 2019. Mark Bagnoli and Ted Bergstrom. Log-concave probability and its applications. Economic Theory, 26:445 469, 2005. Krishnakumar Balasubramanian, Pinar Donmez, and Guy Lebanon. Unsupervised supervised learning II: Margin-based classification without labels. Journal of Machine Learning Research (JMLR), 12:3119 3145, 2011. Peter L Bartlett and Marten H Wegkamp. Classification with a reject option using a hinge loss. Journal of Machine Learning Research (JMLR), 9(0):1823 1840, 2008. Su Lin Blodgett, Lisa Green, and Brendan O Connor. Demographic dialectal variation in social media: A case study of African-American English. In Empirical Methods in Natural Language Processing (EMNLP), pp. 1119 1130, 2016. Daniel Borkan, Lucas Dixon, Jeffrey Sorensen, Nithum Thain, and Lucy Vasserman. Nuanced metrics for measuring unintended bias with real data for text classification. In World Wide Web (WWW), pp. 491 500, 2019. Stephen Boyd and Lieven Vandenberghe. Convex Optimization. Cambridge University Press, 2004. Joy Buolamwini and Timnit Gebru. Gender shades: Intersectional accuracy disparities in commercial gender classification. In Conference on Fairness, Accountability and Transparency, pp. 77 91, 2018. Irene Y Chen, Emma Pierson, Sherri Rose, Shalmali Joshi, Kadija Ferryman, and Marzyeh Ghassemi. Ethical machine learning in health. ar Xiv preprint ar Xiv:2009.10576, 2020. C. K. Chow. An optimum character recognition system using decision functions. In IRE Transactions on Electronic Computers, 1957. Chao K Chow. On optimum recognition error and reject tradeoff. IEEE Transactions on Information Theory, 16(1):41 46, 1970. Sam Corbett-Davies, Emma Pierson, Avi Feller, Sharad Goel, and Aziz Huq. Algorithmic decision making and the cost of fairness. In International Conference on Knowledge Discovery and Data Mining (KDD), pp. 797 806, 2017. Luigi Pietro Cordella, Claudio De Stefano, Francesco Tortorella, and Mario Vento. A method for improving classification reliability of multilayer perceptrons. IEEE Transactions on Neural Networks, 6(5):1140 1147, 1995. Published as a conference paper at ICLR 2021 Madeleine Cule, Richard Samworth, and Michael Stewart. Maximum likelihood estimation of a multi-dimensional log-concave density. Journal of the Royal Statistical Society, 73:545 603, 2010. Abir De, Paramita Koley, Niloy Ganguly, and Manuel Gomez-Rodriguez. Regression under human assistance. In Association for the Advancement of Artificial Intelligence (AAAI), pp. 2611 2620, 2020. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In Association for Computational Linguistics (ACL), pp. 4171 4186, 2019. Lucas Dixon, John Li, Jeffrey Sorensen, Nithum Thain, and Lucy Vasserman. Measuring and mitigating unintended bias in text classification. In Association for the Advancement of Artificial Intelligence (AAAI), pp. 67 73, 2018. John Duchi, Tatsunori Hashimoto, and Hongseok Namkoong. Distributionally robust losses against mixture covariate shifts. https://cs.stanford.edu/ thashim/assets/ publications/condrisk.pdf, 2019. Cynthia Dwork, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Rich Zemel. Fairness through awareness. In Innovations in Theoretical Computer Science (ITCS), pp. 214 226, 2012. Ran El-Yaniv and Yair Wiener. On the foundations of noise-free selective classification. Journal of Machine Learning Research (JMLR), 11, 2010. Jean Feng, Arjun Sondhi, Jessica Perry, and Noah Simon. Selective prediction-set models with coverage guarantees. ar Xiv preprint ar Xiv:1906.05473, 2019. Yarin Gal and Zoubin Ghahramani. Dropout as a Bayesian approximation: Representing model uncertainty in deep learning. In International Conference on Machine Learning (ICML), 2016. Yonatan Geifman and Ran El-Yaniv. Selective classification for deep neural networks. In Advances in Neural Information Processing Systems (Neur IPS), 2017. Yonatan Geifman and Ran El-Yaniv. Selectivenet: A deep neural network with an integrated reject option. In International Conference on Machine Learning (ICML), 2019. Yonatan Geifman, Guy Uziel, and Ran El-Yaniv. Bias-reduced uncertainty estimation for deep neural classifiers. In International Conference on Learning Representations (ICLR), 2018. Suchin Gururangan, Swabha Swayamdipta, Omer Levy, Roy Schwartz, Samuel Bowman, and Noah A Smith. Annotation artifacts in natural language inference data. In Association for Computational Linguistics (ACL), pp. 107 112, 2018. Blaise Hanczar and Edward R. Dougherty. Classification with reject option in gene expression data. Bioinformatics, 2008. Moritz Hardt, Eric Price, and Nathan Srebo. Equality of opportunity in supervised learning. In Advances in Neural Information Processing Systems (Neur IPS), pp. 3315 3323, 2016. Tatsunori B. Hashimoto, Megha Srivastava, Hongseok Namkoong, and Percy Liang. Fairness without demographics in repeated loss minimization. In International Conference on Machine Learning (ICML), 2018. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Computer Vision and Pattern Recognition (CVPR), 2016. Martin Hellman and Josef Raviv. Probability of error, equivocation, and the chernoff bound. IEEE Transactions on Information Theory, 16(4):368 372, 1970. Martin E Hellman. The nearest neighbor classification rule with a reject option. IEEE Transactions on Systems Science and Cybernetics, 6(3):179 185, 1970. Published as a conference paper at ICLR 2021 Dan Hendrycks and Kevin Gimpel. A baseline for detecting misclassified and out-of-distribution examples in neural networks. In International Conference on Learning Representations (ICLR), 2017. Kashmir Hill. Wrongfully accused by an algorithm. The New York Times, 2020. URL https:// www.nytimes.com/2020/06/24/technology/facial-recognition-arrest. html. Dirk Hovy and Anders Søgaard. Tagging performance correlates with age. In Association for Computational Linguistics (ACL), pp. 483 488, 2015. Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4700 4708, 2017. Jeremy Irvin, Pranav Rajpurkar, Michael Ko, Yifan Yu, Silviana Ciurea-Ilcus, Chris Chute, Henrik Marklund, Behzad Haghgoo, Robyn Ball, Katie Shpanskaya, et al. Chexpert: A large chest radiograph dataset with uncertainty labels and expert comparison. In Association for the Advancement of Artificial Intelligence (AAAI), volume 33, pp. 590 597, 2019. Shalmali Joshi, Oluwasanmi Koyejo, Been Kim, and Joydeep Ghosh. x GEMs: Generating examplars to explain black-box models. ar Xiv preprint ar Xiv:1806.08867, 2018. Amita Kamath, Robin Jia, and Percy Liang. Selective question answering under domain shift. In Association for Computational Linguistics (ACL), 2020. Fereshte Khani, Martin Rinard, and Percy Liang. Unanimous prediction for 100% precision with application to learning semantic mappings. In Association for Computational Linguistics (ACL), 2016. Jon Kleinberg, Sendhil Mullainathan, and Manish Raghavan. Inherent trade-offs in the fair determination of risk scores. In Innovations in Theoretical Computer Science (ITCS), 2017. Pang Wei Koh, Shiori Sagawa, Henrik Marklund, Sang Michael Xie, Marvin Zhang, Akshay Balsubramani, Weihua Hu, Michihiro Yasunaga, Richard Lanas Phillips, Irena Gao, Tony Lee, Etienne David, Ian Stavness, Wei Guo, Berton A. Earnshaw, Imran S. Haque, Sara Beery, Jure Leskovec, Anshul Kundaje, Emma Pierson, Sergey Levine, Chelsea Finn, and Percy Liang. WILDS: A benchmark of in-the-wild distribution shifts. ar Xiv, 2020. Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and scalable predictive uncertainty estimation using deep ensembles. In Advances in Neural Information Processing Systems (Neur IPS), 2017. Shiyu Liang, Yixuan Li, and R. Srikant. Enhancing the reliability of out-of-distribution image detection in neural networks. In International Conference on Learning Representations (ICLR), 2018. Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. In Proceedings of the IEEE International Conference on Computer Vision, pp. 3730 3738, 2015. R Thomas Mc Coy, Ellie Pavlick, and Tal Linzen. Right for the wrong reasons: Diagnosing syntactic heuristics in natural language inference. In Association for Computational Linguistics (ACL), 2019. Hussein Mozannar and David Sontag. Consistent estimators for learning to defer to an expert. ar Xiv preprint ar Xiv:2006.01862, 2020. Luke Oakden-Rayner, Jared Dunnmon, Gustavo Carneiro, and Christopher R e. Hidden stratification causes clinically meaningful failures in machine learning for medical imaging. In Proceedings of the ACM Conference on Health, Inference, and Learning, pp. 151 159, 2020. Yaniv Ovadia, Emily Fertig, Jie Ren, Zachary Nado, D Sculley, Sebastian Nowozin, Joshua V. Dillon, Balaji Lakshminarayanan, and Jasper Snoek. Can you trust your model s uncertainty? evaluating predictive uncertainty under dataset shift. In Advances in Neural Information Processing Systems (Neur IPS), 2019. Published as a conference paper at ICLR 2021 Ji Ho Park, Jamin Shin, and Pascale Fung. Reducing gender bias in abusive language detection. In Empirical Methods in Natural Language Processing (EMNLP), pp. 2799 2804, 2018. Marco AF Pimentel, David A Clifton, Lei Clifton, and Lionel Tarassenko. A review of novelty detection. Signal Processing, 99:215 249, 2014. Jos e M Porcel. Chest tube drainage of the pleural space: a concise review for pulmonologists. Tuberculosis and Respiratory Diseases, 81(2):106 115, 2018. Maithra Raghu, Katy Blumer, Rory Sayres, Ziad Obermeyer, Bobby Kleinberg, Sendhil Mullainathan, and Jon Kleinberg. Direct uncertainty prediction for medical second opinions. In International Conference on Machine Learning (ICML), pp. 5281 5290, 2019. Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. Why Should I Trust You? : Explaining the predictions of any classifier. In International Conference on Knowledge Discovery and Data Mining (KDD), 2016. Shiori Sagawa, Pang Wei Koh, Tatsunori B. Hashimoto, and Percy Liang. Distributionally robust neural networks for group shifts: On the importance of regularization for worst-case generalization. In International Conference on Learning Representations (ICLR), 2020. Rachael Tatman. Gender and dialect bias in You Tube s automatic captions. In Workshop on Ethics in Natural Langauge Processing, volume 1, pp. 53 59, 2017. Marko Toplak, Rok Moˇcnik, Matija Polajnar, Zoran Bosni c, Lars Carlsson, Catrin Hasselgren, Janez Demˇsar, Scott Boyer, Blaz Zupan, and Jonna St alring. Assessment of machine learning reliability methods for quantifying the applicability domain of QSAR regression models. Journal of Chemical Information and Modeling, 54, 2014. C Wah, S Branson, P Welinder, P Perona, and S Belongie. The Caltech-UCSD Birds-200-2011 dataset. Technical report, California Institute of Technology, 2011. Adina Williams, Nikita Nangia, and Samuel Bowman. A broad-coverage challenge corpus for sentence understanding through inference. In Association for Computational Linguistics (ACL), pp. 1112 1122, 2018. Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, R emi Louf, Morgan Funtowicz, and Jamie Brew. Hugging Face s transformers: State-of-the-art natural language processing. ar Xiv preprint ar Xiv:1910.03771, 2019. Kai Xiao, Logan Engstrom, Andrew Ilyas, and Aleksander Madry. Noise or signal: The role of image backgrounds in object recognition. ar Xiv preprint ar Xiv:2006.09994, 2020. Dong Yu, Jinyu Li, and Li Deng. Calibration of confidence measures in speech recognition. Trans. Audio, Speech and Lang. Proc., 19(8):2461 2473, 2011. Bolei Zhou, Agata Lapedriza, Aditya Khosla, Aude Oliva, and Antonio Torralba. Places: A 10 million image database for scene recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(6):1452 1464, 2017. Published as a conference paper at ICLR 2021 A.1 SOFTMAX RESPONSE SELECTIVE CLASSIFIERS In this section, we describe our implementation of softmax response (SR) selective classifiers (Geifman & El-Yaniv, 2017). Recall from Section 3 that a selective classifier is a pair (ˆy, ˆc), where ˆy : X Y outputs a prediction and ˆc : X R+ outputs the model s confidence, which is always non-negative, in that prediction. SR classifiers are defined for neural networks (for classification), which generally have a last softmax layer over the k possible classes. For an input point x, we denote its maximum softmax probability, which corresponds to its predicted class ˆy(x), as ˆp(ˆy(x) | x). We defined the confidence ˆc(x) for binary classifiers as 2 log ˆp(ˆy(x) | x) 1 ˆp(ˆy(x) | x) Since the maximum softmax probability ˆp(ˆy(x) | x) is at least 0.5 for binary classification, ˆc(x) is nonnegative for each x, and is thus a valid confidence. For k > 2 classes, however, ˆp(ˆy(x) | x) can be less than 0.5, in which case ˆc(x) would be negative. To ensure that confidence is always nonnegative, we define ˆc(x) for k classes to be 2 log ˆp(ˆy(x) | x) 1 ˆp(ˆy(x) | x) 2 log(k 1). (8) With k classes, the maximum softmax probability ˆp(ˆy(x) | x) 1/k, and therefore we can verify that ˆc(x) 0 as desired. Moreover, when k = 2, Equation (8) reduces to our original binary confidence; we can therefore interpret the general form in Equation (8) as a normalized logit. Note that ˆc(x) is a monotone transformation of the maximum softmax probability ˆp(ˆy(x) | x). Since the accuracy-coverage curve of a selective classifier only depends on the relative ranking of ˆc(x) across points, we could have equivalently set ˆc(x) to be ˆp(ˆy(x) | x). However, following prior work, we choose the logit-transformed version to make the corresponding distribution of confidences easier to visualize (Balasubramanian et al., 2011; Lakshminarayanan et al., 2017). Finally, we remark on one consequence of SR on the margin distribution for multi-class classification. Recall that we define the margin of an example to be ˆc(x) on correct predictions (ˆy(x) = y) and ˆc(x) otherwise, as described in Section 3. In Figure 4, we plot the margin distributions of SR selective classifiers on all five datasets. We observe that on Multi NLI, which is the only multi-class dataset (with k = 3), there is a gap (region of lower density) in the margin distribution around 0. We attribute this gap in part to the comparative rarity of seeing a maximum softmax probability of 1 3 when k = 3 versus seeing 1 2 when k = 2; in the former, all three logits must be the same, while for the latter only two logits must be the same. A.2 GROUP-AGNOSTIC REFERENCE Here, we describe the group-agnostic reference described in Section 4 in more detail. We elaborate on the construction from the main text and then define the reference formally. Finally, we show that the group-agnostic reference satisfies equalized odds. A.2.1 DEFINITION We begin by recalling on the construction described in the main text for a finite test set D. In Algorithm 1 we relied on two important sets; the set of correctly classified points at threshold τ, Cτ: Cτ = {(x, y, g) D | ˆy(x) = y and ˆc(x) τ}, (9) and similarly, the set of incorrectly classified points at threshold τ, Iτ: Iτ = {(x, y, g) D | ˆy(x) = y and ˆc(x) τ}. (10) To compute the accuracy of the group-agnostic reference, we sample a subset Cga τ of size |Cτ| uniformly at random from C0, and similarly a subset Iga τ of size |Iτ| uniformly at random from I0. The group-agnostic reference makes predictions when examples are in Cga τ Iga τ and abstains otherwise. We compute group accuracies over this set of predicted examples. Published as a conference paper at ICLR 2021 Note that the group accuracies, as defined, are randomized due to the sampling. For the remainder of our analysis, we take the expectation over this randomness to compute the group accuracies. We now generalize the above construction by considering data distributions D. We first define the number of correctly and incorrectly classified points: Definition 3 (Correctly and incorrectly classified points). Consider a selective classifier (ˆy, ˆc). For each threshold τ, we define the fractions of points that are predicted (not abstained on), and correctly or incorrectly classified, as C(τ) = ˆp(ˆy(x) = y ˆc(x) τ), (11) I(τ) = ˆp(ˆy(x) = y ˆc(x) τ). (12) We define analogous metrics for each group g as Cg(τ) = ˆp(ˆy(x) = y ˆc(x) τ | g), (13) Ig(τ) = ˆp(ˆy(x) = y ˆc(x) τ | g). (14) For each threshold τ, we will make predictions on a C(τ)/C(0) fraction of the C(0) total (probability mass of) correctly classified points. Since each group g has Cg(0) correctly classified points, at threshold τ, the group-agnostic reference will make predictions on Cg(0)C(τ)/C(0) correctly classified points in group g. We can reason similarly over the incorrectly classified points. Putting it all together, we can define the group-agnostic reference as satisfying the following: Definition 4 (Group-agnostic reference). Consider a selective classifier (ˆy, ˆc) and let C, I, Cg, Ig denote the analogous quantities to Definition 3 for its matching group-agnostic reference. For each threshold τ, these satisfy C(τ) = C(τ) (15) I(τ) = I(τ), (16) and for each threshold τ and group g, Cg(τ) = Cg(0)C(τ)/C(0) (17) Ig(τ) = Ig(0)I(τ)/I(0). (18) The group-agnostic reference thus has the following accuracy on group g: Ag(τ) = Cg(τ) Cg(τ) + Ig(τ) (19) = Cg(0)C(τ)/C(0) Cg(0)C(τ)/C(0) + Ig(0)I(τ)/I(0). (20) A.2.2 CONNECTION TO EQUALIZED ODDS We now show that the group-agnostic reference satisfies equalized odds with respect to which points it predicts or abstains on. The goal of a selective classifier is to make predictions on points it would get correct (i.e., ˆy(x) = y) while abstaining on points that it would have gotten incorrect (i.e., ˆy(x) = y). We can view this as a meta classification problem, where a true positive is when the selective classifier decides to make a prediction on a point x and gets it correct (ˆy(x) = y), and a false positive is when the selective classifier decides to make a prediction on a point x and gets it incorrect (ˆy(x) = y). As such, we can define the true positive rate RTP(τ) and false positive rate RFP(τ) of a selective classifier: Definition 5. The true positive rate of a selective classifier at threshold τ is RTP(τ) = C(τ) and the false positive rate at threshold τ is RFP(τ) = I(τ) Published as a conference paper at ICLR 2021 Analogously, the true positive and false positive rates on a group g are RTP g (τ) = Cg(τ) Cg(0), (23) RFP g (τ) = Ig(τ) Ig(0). (24) The group-agnostic reference satisfies equalized odds (Hardt et al., 2016) with respect to this definition: Proposition 5. The group-agnostic reference defined in Definition 4 has equal true positive and false positive rates for all groups g G and satisfies equalized odds. Proof. By construction of the group-agnostic reference (Definition 4), we have that Cg(τ) = Cg(0)C(τ)/C(0) (25) = Cg(0) C(τ)/ C(0), (26) and therefore, for each group g, we can show that the true-positive rate of the group-agnostic reference RTP g on g is equal to its average true-positive rate RTP(τ). RTP g (τ) = Cg(τ) Cg(0) (27) = C(τ)/ C(0) (28) = RTP(τ). (29) Each group thus has the same true positive rate with the group-agnostic reference. Using similar reasoning, each group also has the same false positive rate. By the definition of equalized odds, the group-agnostic reference thus satisfies equalized odds. A.3 ROBIN HOOD REFERENCE In this section, we define the Robin Hood reference, which preferentially increases the worst-group accuracy through abstentions until it matches the other groups. Like the group-agnostic reference, we constrain the Robin Hood reference to make the same number of correct and incorrect predictions as a given selective classifier. We formalize the definition of the Robin Hood reference in Algorithm 2. The Robin Hood reference makes predictions on the subset of examples from D that has the smallest discrepency between best-group and worst-group accuracies, while still matching the number of correct and incorrect predictions of a given selective classifier. Since enumerating over all possible subsets of the test data Algorithm 2: Robin Hood reference at threshold τ Input: Selective classifier (ˆy, ˆc), threshold τ, test data D Output: The set of points P D that the Robin Hood reference for (ˆy, ˆc) makes predictions on at threshold τ. 1 In Algorithm 1, we defined Cτ and Iτ to be the sets of correct and incorrect points predicted on and abstained on at threshold τ respectively. Define Q, the set of subsets of D that have the same number of correct and incorrect predictions as (ˆy, ˆc) at threshold τ as follows: Q = {S D | |P C0| = |Cτ|, |P I0| = |Iτ|}. (30) 2 Let accg(S) be the accuracy of ˆy on points in S that belong to group g. We return the set P that minimizes the difference between the best-group and worst-group accuracies. P = argmin S Q max g G accg(S) min g G accg(S) . (31) Published as a conference paper at ICLR 2021 D is intractable, we compute the accuracies of the Robin Hood reference iteratively. Starting from full coverage, we abstain on examples from lowest to highest confidence. Whenever we abstain on an incorrect example, we assume it comes from the current lowest accuracy group, and similarly whenever we abstain on a correctly classified example we assume it comes from the current highest accuracy group. This reduces disparities to the maximum extent possible as the threshold τ increases. B SUPPLEMENTAL EXPERIMENTS B.1 MONTE-CARLO DROPOUT In the main text, we observed that SR selective classifiers monotonically improve average accuracy as coverage decreases, but exacerbate accuracy disparities across groups on all five datasets. To demonstrate that these observations are not specific to SR selective classifiers, we now present our empirical results on another standard selective classification method: Monte-Carlo (MC) dropout (Gal & Ghahramani, 2016; Geifman & El-Yaniv, 2017). We find that MC-dropout selective classifiers exhibit similar empirical trends as SR selective classifiers. MC-dropout selective classifiers. MC-dropout is an alternate way of assigning confidences to points. Taking a model with a dropout layer, the selective classifier first predicts ˆy(x) simply by taking the model output without dropout. To estimate the confidence, it then samples n softmax probabilities corresponding to the label ˆy(x) over the randomness of the dropout layer. The confidence is computed as ˆc(x) = 1/s, where s2 is the variance of the sampled probabilities. We implement MC-dropout by using the existing dropout layers for BERT and adding dropout to the final fully-connected layer for Res Net and Dense Net, with a dropout probability 0.1. We present empirical results for n = 10; Gal & Ghahramani (2016) observed that this was sufficient to produce good confidence estimates. Results. The MC-dropout selective classifiers exhibit similar trends to those that observed in Section 4, demonstrating that the observed empirical trends are not specific to SR response; even though the average accuracy improves monotonically as coverage decreases across all five datasets, the worstgroup accuracy tends to decrease for Celeb A, fails to increase consistently for Che Xpert, Waterbirds, and Civil Comments, and increases consistently but slowly for Multi NLI. Comparing SR and MCdropout selective classifiers, we observe that the MC-dropout selective classifiers performs slightly worse. For example, we see a more prominent drop in worst-group accuracy for Waterbirds and much smaller improvements in worst-group accuracy for Che Xpert. Group sel. accuracy Celeb A Waterbirds Che Xpert-device Civil Comments Multi NLI Worst group Average Other groups 0 0.5 1 Average coverage Group coverage 0 0.5 1 Average coverage 0 0.5 1 Average coverage 0 0.5 1 Average coverage 0 0.5 1 Average coverage Figure 7: Selective accuracy (top) and coverage (bottom) for each group, as a function of the average coverage for the MC-dropout selective classifier. Each average coverage corresponds to a threshold τ. The red lines represent the worst group. Published as a conference paper at ICLR 2021 Group sel. accuracy Worst group Average Other groups Civil Comments Waterbirds Che Xpert-device Multi NLI 0 0.5 1 Average coverage Group coverage 0 0.5 1 Average coverage 0 0.5 1 Average coverage 0 0.5 1 Average coverage 0 0.5 1 Average coverage Figure 8: Selective accuracy (top) and coverage (bottom) for each group as a function of the average coverage for the softmax response selective classifier with model optimized with group DRO. Each average coverage corresponds to a specific threshold τ. The red lines represent the worst group (i.e. the one with the lowest accuracy at full coverage,) and the gray lines represent the other groups. B.2 GROUP DRO We showed in Section 7 that SR selective classifiers trained with the group DRO objective successfully improve worst-group selective accuracies as coverage decreases, and perform comparably to the group-agnostic reference. We now present additional empirical results for these selective classifiers. Accuracy-coverage curves. We first present the accuracy-coverage curves for all groups, along with the group coverage trends, in Figure 8. Group selective accuracies tend to improve monotonically, in stark contrast with our results on standard (ERM) selective classifiers. These trends hold generally across groups and datasets, with a few exceptions; selective accuracies drop on two groups in Multi NLI at roughly 20% average coverage, and selective accuracies improve slowly on two groups in Civil Comments. Below, we look into these anomalies and offer potential explanations. For Multi NLI, we first note that the drops are observed at very low group coverages lower than 1% at which point the accuracies are computed based on very few examples and thus are noisy. In addition, much of the anomaly can be explained by label noise; we manually inspect the 20 examples from the above two groups with the highest softmax confidences, and find that 17 of them are labeled incorrectly. The observations on Civil Comments can also be potentially attributed to label noise. Removing examples with high inter-annotator disagreement (with the fraction of toxic annotations between 0.5 and 0.6) yields an accuracy-coverage curve that fits the broader empirical trends. C EXPERIMENT DETAILS C.1 DATASETS Celeb A. Models have been shown to latch onto spurious correlations between labels and demographic attributes such as race and gender (Buolamwini & Gebru, 2018; Joshi et al., 2018), and we study this on the Celeb A dataset (Liu et al., 2015). Following Sagawa et al. (2020), we consider the task of classifying hair color, which is spuriously correlated with the gender. Concretely, inputs are celebrity face images, labels are hair color Y = {blond, non-blond}, and spurious attributes are gender, A = {male, female}, with blondness associated with being female. Of the four groups, blond males are the smallest group, with only 1,387 examples out of 162,770 training examples, and they tend to be the worst group empirically. We use the official train-val-split of the dataset. Published as a conference paper at ICLR 2021 Waterbirds. Object recognition models are prone to using image backgrounds as a proxy for the label (Ribeiro et al., 2016; Xiao et al., 2020). We study this on the Waterbirds dataset (Sagawa et al., 2020), constructed using images of birds from the Caltech-UCSD Birds dataset (Wah et al., 2011) placed on backgrounds from the Places dataset (Zhou et al., 2017). The task is to classify a photograph of a bird as one of Y = {waterbird, landbird}, and the label is spuriously correlated with the background A = {water background, land background}. Of the four groups, waterbirds on land backgrounds make up the smallest group with only 56 examples out of 4,795 training examples, and they tend to be the worst group empirically. We use the train-val-split provided by Sagawa et al. (2020), and also follow their protocol for computing average metrics; to compute average accuracies and coverages, we first compute the metrics for each group and obtain a weighted average according to group proportions in the training set, in order to account for the discrepancy in group proportions across the splits. Che Xpert-device. Models can latch onto spurious correlations even in high-stakes applications such as medical imaging. When models are trained to classify whether a patient has certain pathologies from chest X-rays, models have been shown to spuriously detect the presence of a support device, in particular a chest drain, instead (Oakden-Rayner et al., 2020). We study this phenomenon in a modified version of the Che Xpert dataset (Irvin et al., 2019), which we call Che Xpert-device. Concretely, the inputs are chest X-rays, labels are Y = {pleural effusion, no pleural effusion}, and spurious attributes indicate the the presence of a support device, A = {support device, no support device}. We note that chest drain is one type of support device, and is used to treat suspected pleural effusion (Porcel, 2018). Che Xpert-device is a subsampled version of the full Che Xpert dataset that manifests the spurious correlation more strongly. To create Che Xpert-device, we first create a new 80/10/10 train/val/test split of examples from the publicly available Che Xpert train and validation sets, randomly assigning patients to splits so that all X-rays of the same patient fall in the same split. We then subsample the training set; in particular, we enforce that in 90% of examples, the label of support device matches the pleural effusion label. Of the four groups, cases of pleural effusion without a support device make up the smallest group, with 5,467 examples out of 112,100 training examples, and they tend to be the worst group empirically. To compute the average accuracies and coverages, we weight groups according to group proportions in the training set, similarly to Waterbirds. Another complication with Che Xpert is that some patients can have multiple X-rays from one visit. Following Irvin et al. (2019), we treat these images as separate training examples at training time, but output one prediction for each patient-study pair at evaluation time. Concretely, we predict pleural effusion if the model detects the condition in any of the X-ray images belonging to the patient-study pair, as pathologies may only appear clearly in some X-rays. Civil Comments. In toxicity comment detection, models have been shown to latch onto spurious correlations between the toxicity and mention of certain demographic groups (Park et al., 2018; Dixon et al., 2018). We study this in the Civil Comments dataset (Borkan et al., 2019). The task is to classify the toxicity of comments on online articles with labels Y = {toxic, non-toxic}. As spurious attributes, we consider whether each comment mentions a Christian identity, A = {mention of Christian identity, no mention of Christian identity}; non-toxicity is associated with the mention of Christian identity, often resulting in high false negative rate on comments with such mentions. Of the four groups, toxic comments with a mention of Christian identity make up the smallest group, with only 2,446 examples out of 269,038 training examples, and they tend to be the worst group empirically. and further split the development set into a training set and a validation set by randomly splitting on articles and associating all comments with each article to either set. We use the train, validation, and test set from the WILDS benchmark (Koh et al., 2020). The training, validation, and test set all comprise comments from disjoint sets of articles. The original dataset also contains many additional examples that have toxicity annotations but not identity annotations; we do not use these in our experiments. In the original Civil Comments dataset, each comment is given a probabilistic labels for both the toxicity and the mention of a Christian identity, where a probabilistic label is the average of binary Published as a conference paper at ICLR 2021 labels across annotators. Following the associated Kaggle competition1, we use binarized labels obtained by thresholding the probabilistic labels at 0.5. Multi NLI. Lastly, we consider natural langage inference (NLI), where the task is to predict whether a hypothesis is entailed, contradicted by, or neutral to an associated premise, Y = {entailed, contradictory, neutral}. NLI models have been shown to exploit annotation artifacts, for example predicting contradictory whenever negation words such as never or nobody are present (Gururangan et al., 2018). We study this on the Multi NLI dataset (Williams et al., 2018). To annotate examples spurious attributes A = {negation words, no negation words}, we consider the following negation words following Gururangan et al. (2018): nobody , no , never , and nothing . We use the splits used in Sagawa et al. (2020), whose training set includes 206,175 examples with 1,521 examples from the smallest group (entailment with negations). The worst group for standard models tends to be neutral examples with negation words. We train Res Net (He et al., 2016) for Celeb A and Waterbirds (images), Dense Net (Huang et al., 2017) for Che Xpert (X-rays), and BERT (Devlin et al., 2019) for Civil Comments and Multi NLI (text). For tasks studied in Sagawa et al. (2020) (Celeb A, Waterbirds, Multi NLI), we use the hyperparameters from Sagawa et al. (2020). For others (Civil Comments and Che Xpert-device), we test the same number of hyperparameter sets for each of ERM and DRO, and report the best set below. Across all image and X-ray tasks, inputs are downsampled to resolution 224 x 224. Celeb A. To train a model on Celeb A, we initialize to pretrained Res Net-50. For ERM we optimize with learning rate 1e-4, weight decay 1e-4, batch size 128, and train for 50 epochs. For DRO we use learning rate 1e-5, weight decay 1e-1, and use generalization adjustment 1 (described in Sagawa et al. (2020)). The batch size is 128, and we train for 50 epochs. Waterbirds. For Waterbirds, as with Celeb A, we use pretrained Res Net-50 as an initialization. For ERM we use learning rate 1e-3, weight decay 1e-4, batch size 128, and train for 300 epochs. For DRO we use learning rate 1e-5, weight decay 1, geneneralization adjustment 1, batch size 128, and train for 300 epochs. Che Xpert-device. For Che Xpert-device, we fine-tune pretrained Dense Net-121 for three epochs. For ERM we use learning rate 1e-3, no weight decay, batch size 16, and choose the model (out of the first three epochs) with highest average accuracy (epoch 2). For DRO, we use learning rate 1e4, weight decay 1e-1, and batch size 16, and choose the model with highest worst-group accuracy (epoch 1). Civil Comments. To train a model for Civil Comments, we fine-tune bert-base-uncased using the implementation from Wolf et al. (2019). For both ERM and DRO we use learning rate 1e-5, weight decay 1e-2, and batch size 16. We train the ERM models and DRO models for three epochs (early stopping,) then choose the model with highest average accuracy for ERM (epoch 2), and highest worst-group accuracy for DRO (epoch 1). Multi NLI. For Multi NLI we again fine-tune bert-base-uncased using the implementation from Wolf et al. (2019). For ERM, we fine-tune for three epochs with learning rate 2e-5, weight decay 0, and batch size 32. For DRO, we also use learning rate 2e-5, and weight decay 0, and batch size 32, but use generalization adjustment 1. For both ERM and DRO the model after the third epoch is best in terms of average accuracy for ERM and worst-group accuracy for DRO. D PROOFS: MARGIN DISTRIBUTIONS AND ACCURACY-COVERAGE CURVES D.1 LEFT-LOG-CONCAVITY Recall the definition of left-log-concave from Section 5: Definition 1 (Left-log-concave distributions). A distribution is left-log-concave if its CDF is logconcave on ( , µ], where µ is the mean of the distribution. 1www.kaggle.com/c/jigsaw-unintended-bias-in-toxicity-classification/ Published as a conference paper at ICLR 2021 We first prove that a symmetric mixture of Gaussians is left-log-concave, but not necessarily logconcave. D.1.1 SYMMETRIC MIXTURES OF GAUSSIANS ARE LEFT-LOG-CONCAVE Lemma 1 (Symmetric mixtures of two gaussians are left-log-concave). Consider a symmetric mixture of Gaussians with density f = 0.5fµ + 0.5f µ, where fµ is the density of a N(µ, σ2) random variable and likewise f µ is the density of a N( µ, σ2) random variable. Then the mixture is left-log-concave for all values of µ R, σ > 0, but only log-concave if |µ| σ. Proof. Without loss of generality, we can take σ = 1, since (left-)log-concavity is invariant to scaling, and also assume that µ is positive. First consider the case where µ 1. Then the mixture is log-concave, and therefore left-log-concave (Cule et al., 2010). Now, consider the case where µ > 1. Cule et al. (2010) show the mixture is no longer log-concave. However, we claim that it is still left-log-concave. We start by studying the gradient of log f, f(x) = exp (x µ)2 2 (x µ) exp (x+µ)2 2 + exp (x+µ)2 " 1 exp( 2xµ) 1 + exp( 2xµ) We claim that f (x) f(x) has a local minimum at x = a < 0, and as it is an odd function, a corresponding local maximum at x = a > 0. To show this, we first differentiate to obtain f(x) = 1 + 4µ2 exp( 2xµ) (1 + exp( 2xµ))2 . (34) Setting the derivative to 0 gives us the quadratic equation 0 = 1 + 4µ2 exp( 2xµ) (1 + exp( 2xµ))2 (35) (1 + exp( 2xµ))2 = 4µ2 exp( 2xµ) (36) [exp( 2xµ)]2 + (2 4µ2) exp( 2xµ) + 1 = 0 (37) exp( 2xµ) = 2µ2 1 2µ p Since µ > 1, there are two distinct roots of this quadratic, and two corresponding critical points of f /f. Let v(µ) = 2µ2 1 + 2µ p µ2 1 be the larger root. Then v(µ) is a strictly increasing function for µ 1, and since v(1) = 1, we have that v(µ) > 1 for all µ > 1. Let x = a satisfy exp(2aµ) = v(µ). Then we have that a, the smaller of the critical points of f /f, is a = log v(µ) = log(2µ2 1 + 2µ p µ2 1) 2µ (40) To show that f (a)/f(a) is a local minimum, we take the second derivative 1 + 4µ2 exp( 2xµ) (1 + exp( 2xµ))2 = 8µ3 exp( 2xµ)(exp( 2xµ) 1) (exp( 2xµ) + 1)3 , (43) which at x = a gives f(x) |x= a = 8µ3v(µ)(v(µ) 1) (v(µ) + 1)3 (44) Published as a conference paper at ICLR 2021 since v(µ) > 1. Since a is the only critical point of f /f that is less than 0 and it is a local minimum, f /f must be decreasing on ( , a], which in turn implies that f, and therefore F, is log-concave on ( , a]. It remains to show that F is also log-concave on [ a, 0]. We make use of two facts. First, since a is a local minimum and the only critical point less than 0, we have that f ( a) f( a) f (x) f(0) = 0 for all x [ a, 0]. Second, since f(x) and F(x) are non-negative for all x, f(x)/F(x) is also non-negative for all x. Thus, for all x [ a, 0] d dx f(x) F(x) = F(x)f (x) f(x)2 F(x) | {z } 0 f(x) | {z } 0 F(x) | {z } 0 and therefore F is also log-concave on [ a, 0]. Remark 1. Note that if f is (left-)log-concave, then F is also (left-)log-concave (Bagnoli & Bergstrom, 2005). However, the reverse direction does not hold. D.2 SYMMETRIC MARGIN DISTRIBUTIONS In this section we prove Proposition 1. We first prove the following helpful lemma: Lemma 2 (Conditions for monotonicity of selective accuracy). AF (τ) is monotone increasing in τ if and only if f( τ) F( τ) f(τ) 1 F(τ) (49) for all τ 0. Conversely, AF (τ) is monotone decreasing in τ if and only if the above inequality is flipped for all τ 0. Proof. AF (τ) is monotone increasing in τ if and only if d AF dτ 0 for all τ 0. We obtain d AF dτ by differentiating AF and simplifying: 1 F(τ) 1 F(τ) + F( τ) 1 F(τ) + F( τ) f(τ) 1 F(τ) f(τ) f( τ) (1 F(τ) + F( τ))2 (51) = f(τ) + f(τ)F(τ) f(τ)F( τ) + f(τ) + f( τ) f(τ)F(τ) f( τ)F(τ) (1 F(τ) + F( τ))2 = f( τ) f(τ)F( τ) f( τ)F(τ) (1 F(τ) + F( τ))2 (52) = f( τ)[1 F(τ)] f(τ)F( τ) (1 F(τ) + F( τ))2 . (53) Since the denominator is always positive, we have that d AF dτ 0 if and only if the numerator f( τ)[1 F(τ)] f(τ)F( τ) 0, which in turn is equivalent to f( τ) F( τ) f(τ) 1 F(τ), (54) as desired. The case for monotone decreasing AF (τ) is analogous. In the next two lemmas, we prove the necessary and sufficient conditions for Proposition 1 respectively. Published as a conference paper at ICLR 2021 Lemma 3 (Left-log-concavity and symmetry imply monotonicity.). Let f be symmetric about µ and let F be left-log-concave. If AF (0) 0.5, then AF (τ) is monotone increasing. Conversely, if AF (0) 0.5, then AF (τ) is monotone decreasing. Proof. Consider the case where AF (0) 0.5; the case where AF (0) 0.5 is analogous. From Lemma 2 and the symmetry of f, we have that AF (τ) is monotone increasing if (and only if) f( τ) F( τ) f(τ) 1 F(τ) = f(2µ τ) F(2µ τ) (55) holds for all τ 0. To show that this inequality holds for all τ 0, we first note that since AF (0) 0.5, we have that F(0) 0.5, which together with the symmetry of F implies that µ 0. Thus, τ 2µ τ for all τ 0. Now, if τ µ, then 2µ τ µ, so the desired inequality (55) follows from the log-concavity of F on ( , µ] (Remark 1 of Bagnoli & Bergstrom (2005).) If instead 0 τ µ, we have f( τ) F( τ) f(τ) F(τ) [by log-concavity of F on ( , µ]] (56) f(τ) 1 F(τ) [since F(τ) 0.5] (57) and thus (55) also holds. Lemma 4 (Monotonicity and symmetry imply left-log-concavity.). Let f be symmetric with mean 0, and let fµ be a translated version of f that has mean µ. If Afµ is monotone increasing for all µ 0, then f is left-log-concave. Proof. First consider µ 0. Since Afµ is monotone increasing for all µ 0, from Lemma 2 and the symmetry of fµ, fµ( τ) Fµ( τ) fµ(τ) 1 Fµ(τ) = fµ(2µ τ) Fµ(2µ τ) (58) must hold for all τ 0. Since fµ(x) = f(x µ) by construction, we can equivalently write f( µ τ) F( µ τ) f(µ τ) F(µ τ), (59) which holds for all τ 0 as well, and by assumption, for all µ 0 as well. By letting a = µ τ and b = µ τ, we can equivalently write the above inequality as f(a) F(a) f(b) for all a, b R where a b, a + b 0. By the definition of log-concavity, this means that F is log-concave on ( , 0]. D.2.1 LEFT-LOG-CONCAVITY AND MONOTONICITY. Proposition 1 (Left-log-concavity and monotonicity). Let F be the CDF of a symmetric distribution. If F is left-log-concave, then AF (τ) is monotonically increasing in τ if AF (0) 1/2 and monotonically decreasing otherwise. Conversely, if AFd(τ) is monotonically increasing for all translations Fd such that Fd(τ) = F(τ d) for all τ and AFd(0) 1/2, then F is left-log-concave. Proof. If F is left-log-concave, by Lemma 3 AF (τ) is monotonically increasing in τ if AFµ(0) 1 2 and monotonically decreasing otherwise. Conversely, if AF is monotonically increasing for all translations Fd such that Fd(τ) = F(τ d) for all τ by Lemma 4, F is log-concave, completing the proof. Published as a conference paper at ICLR 2021 D.3 SKEW-SYMMETRIC MARGIN DISTRIBUTIONS We now turn to how skew affects selective classification. Recall in Section 5 we defined skewsymmetric distributions as follows: Definition 2. A distribution with density fα,µ is skew-symmetric with skew α and center µ if fα,µ(τ) = 2h(τ µ)G(α(τ µ)) (4) for all τ R, where h is the density of a distribution is symmetric about 0, and G is the CDF of a potentially different distribution that is also symmetric about 0. When α = 0, f = h and there is no skew. Increasing α increases rightward skew, and decreasing α increases leftward skew. Note that in general, the mean of f depends on α as well. We will also consider the translated distribution: fα,µ(x) = fα(x µ). (61) The CDF of fα,µ is Fα,µ and the CDF of fα is Fα. We will use the following properties of these distributions: Lemma 5 (Skew-symmetry about µ). These properties hold when we flip the skew from α to α: fα,µ(x) = f α,µ(2µ x) (62) Fα,µ(x) = 1 F α,µ(2µ x) (63) fα,µ(x) = 2h(x µ)G(α(x µ)) (64) = 2h(µ x)[G(( α)(µ x))] (65) = f α,µ(2µ x). (66) Fα,µ(x) = Z x fα,µ(t)dt (67) f α,µ(2c t)dt (68) 2c x f α,µ(t)dt (69) = 1 F α,µ(2c x). (70) Lemma 6 (Stochastic ordering with α [Proposition 4 from Azzalini & Regoli (2012)]). Let α1 α2. Then, for any µ R, Fα1,µ Fα2,µ. Proof. Since the ordering is invariant to translation, without loss of generality we can take µ = 0. First consider x 0. Then Fα1(x) = Z x 2h(t)G(α1t)dt (71) 2h(t)G(α2t)dt (72) = Fα2(x). (73) Published as a conference paper at ICLR 2021 Now consider x 0. We have Fα1(x) = Z x 2h(t)G(α1t)dt (74) 2h(t)(1 G( α1t))dt [by symmetry of G] (75) = 2H(x) Z x 2h(t)G( α1t)dt (76) = 2H(x) Z x 2h( t)G( α1t)dt [by symmetry of h] (77) x 2h(t)G(α1t)dt [change of variables] (78) = 2H(x) 1 + Fα1( x), (79) which reduces to the case where x 0. Lemma 7 (Log gradient ordering by skew). For all α 0, τ 0, and µ R, fα,µ(τ) Fα,µ(τ) f0,µ(τ) F0,µ(τ) f α,µ(τ) F α,µ(τ). (80) Proof. Since the ordering is invariant to translation, without loss of generality we can take µ = 0. We have: fα(τ) Fα(τ) = h(τ)G(ατ) R τ h(t)G(αt)dt (81) = h(τ) R τ h(t) G(αt) G(ατ)dt (82) h(τ) R τ h(t)dt = f0(τ) h(τ) R τ h(t) G( αt) G( ατ)dt (84) = h(τ)G( ατ) R τ h(t)G( αt)dt (85) F α(τ). (86) To get the inequalities, note that when α 0 and t τ, we have α 0, G(αt)/G(ατ) 1 since G is an increasing function. Similarly, G( αt)/G( ατ) 1. Since h is non-negative, the inequalities then follow from the monotonicity of the integral. We now prove the main results of this section. D.3.1 ACCURACY IS MONOTONE WITH SKEW Proposition 6 (Accuracy is monotone with skew). Let Fα,µ be the CDF of a skew-symmetric distribution. For all τ 0 and µ R, AFα,µ(τ) is monotonically increasing in α. Proof. We use the skew-symmetry of fα,µ to write the selective accuracy as Afα,µ(τ) = 1 1 + Fα,µ( τ) 1 Fα,µ(τ) (87) 1 + Fα,µ( τ) Fα,µ(2µ τ) . (88) Published as a conference paper at ICLR 2021 We see that Afα,µ(τ) is a monotone decreasing function of Fα,µ( τ) F α,µ(2µ τ). From Lemma 6, the numerator decreases with increasing α while the denominator increases. Thus, this fraction decreases, which in turn implies that Afα,µ(τ) is monotone increasing with α as desired. D.3.2 SKEW IN THE SAME DIRECTION PRESERVES MONOTONICITY Proposition 2 (Skew in the same direction preserves monotonicity). Let Fα,µ be the CDF of a skewsymmetric distribution. If accuracy of its symmetric version, AF0,µ(τ), is monotonically increasing in τ, then AFα,µ(τ) is also monotonically increasing in τ for any α > 0. Similarly, if AF0,µ(τ) is monotonically decreasing in τ, then AFα,µ(τ) is also monotonically decreasing in τ for any α < 0. Proof. The idea is to use Lemma 7 to reduce the statement to the case where α = 0, so that we can apply monotonicity. First consider the case where AF0,µ(τ) is monotonically increasing in τ, and α > 0. We have fα,µ( τ) Fα,µ( τ) f0,µ( τ) F0,µ( τ) [Lemma 7] (89) Hµ( τ) [definition of f] (90) hµ(τ) 1 Hµ(τ) [from monotonicity of A (Lemma 2)] (91) Hµ(2µ τ) [symmetry of h] (92) = f0,µ(2µ τ) F0,µ(2µ τ) [definition of f] (93) f α,µ(2µ τ) F α,µ(2µ τ) [Lemma 7] (94) fα,µ(τ) 1 Fα,µ(τ). [Lemma 5] (95) Applying Lemma 2 completes the proof. The case where α < 0 is analogous. D.4 MONOTONE ODD TRANSFORMATIONS PRESERVE ACCURACY-COVERAGE CURVES. We now show that our results are unchanged by strictly monotone and odd transformations to the density. Lemma 8 (Odd and strictly monotonically increasing transformations preserve selective accuracy). Let X be a real-valued random variable with CDF FX, T be a strictly monotonically increasing function, and define random variable Y = T(X) with CDF FY . Then, AFX(τ) = AFY (T(τ)) for each τ. Proof. For each τ, since T is a strictly monotonically increasing function, we apply the change of variables formula for transformations of univariate random variables to get the following: FX(τ) = FY (T(τ)) (96) FX( τ) = FY (T( τ)) = FY ( T(τ)). [T is odd] (97) We now solve for selective accuracy: AFX(τ) = 1 FX(τ) 1 FX(τ) + FX( τ) (98) = 1 FY (T(τ)) 1 FY (T(τ)) + FY ( T(τ)) (99) = AFY (T(τ)). (100) Published as a conference paper at ICLR 2021 With Lemma 8, our results extend all random variables T(X) where X corresponds to a left-logconcave distribution and T is odd and strictly monotonically increasing. In particular, X and T(X) have the same accuracy-coverage curves. E PROOFS: COMPARISON TO THE GROUP-AGNOSTIC REFERENCE In this section, we present the proofs from Section 6, which outline conditions under which selective classifiers outperform the group-agnostic reference. E.1 DEFINITIONS We first define certain metrics on selective classifiers and their group-agnostic reference in terms of their margin distributions. Definition 6. Consider a margin distribution with CDF g G pg Fg, (101) where pg is the mixture weight and Fg is the CDF for group g. We write the fraction of examples that are correct/incorrect and predicted on at threshold τ from group g and on average as follows: Cg(τ) = 1 Fg(τ) (102) Ig(τ) = Fg( τ) (103) g pg Cg(τ) (104) g pg Ig(τ) (105) We write the true-positive rate RTP(τ) and false-positive rate RFP(τ) for a given threshold τ as RTP(τ) = C(τ) C(0), (106) RFP(τ) = I(τ) I(0). (107) Finally, we write the accuracy of the group-agnostic reference on group g as AFg(τ) = AFg(0)RTP(τ) AFg(0)RTP(τ) + (1 AFg(0))RFP(τ) (108) = Cg(0)C(τ)/C(0) Cg(0)C(τ)/C(0) + Ig(0)I(τ)/I(0) (109) For convenience in our proofs below, we also define the fraction of each group g in correctly and incorrectly classified predictions at each threshold τ. Definition 7. We define the fraction of group g out of correctly and incorrectly classified predictions at threshold τ, CFg(τ) and IFg(τ) respectively, as CFg(τ) = pg Cg(τ) IFg(τ) = pg Ig(τ) I(τ) . (111) E.2 GENERAL NECESSARY CONDITIONS TO OUTPERFORM THE GROUP-AGNOSTIC REFERENCE We now present the proofs for Proposition 3 and Corollary 1, starting with the supporting lemmas. We first write the accuracy of the group-agnostic reference in terms of IFwg(0) and CFwg(0). Published as a conference paper at ICLR 2021 AFwg(τ) = 1 1 + IFwg(0) CFwg(0) I(τ) C(τ) . (112) Proof. We have: AFwg(τ) = AFwg(0)RTP(τ) AFwg(0)RTP(τ) + (1 AFwg(0))RFP(τ) (113) = p Cwg(0)RTP(τ) p Cwg(0)RTP(τ) + p Iwg(0)RFP(τ) (114) = p Cwg(0) C(τ) C(0) p Cwg(0) C(τ) C(0) + p Iwg(0) I(τ) = CFwg(0)C(τ) CFwg(0)C(τ) + IFwg(0)I(τ) (116) 1 + IFwg(0) CFwg(0) I(τ) C(τ) . (117) Lemma 10 (Bounding the derivative of 1/CFwg(τ) at τ = 0). If AFothers(0) AFwg(0), Fwg(0) 1 Fwg(0) fwg(0)Fothers(0) fothers(0)Fwg(0) p Cwg(τ) + (1 p)Cothers(τ) 1 Fothers(τ) 1 Fothers(τ) fwg(τ)(1 Fothers(τ)) fothers(τ)(1 Fwg(τ)) (1 Fwg(τ))2 p 1 1 Fwg(τ) fwg(τ) 1 Fothers(τ) fothers(τ) τ=0 (124) p 1 1 Fwg(0) fwg(0) 1 Fothers(0) fothers(0) (125) p 1 1 Fwg(0) fwg(0)Fothers(0) Fwg(0) fothers(0) [0 < Fothers(0) Fwg(0) < 1] p 1 1 Fwg(0) fwg(0)Fothers(0) fothers(0)Fwg(0) Fwg(0) (127) p Fwg(0) 1 Fwg(0) fwg(0)Fothers(0) fothers(0)Fwg(0) Fwg(0)2 (128) Published as a conference paper at ICLR 2021 Lemma 11. If AFothers(0) AFwg(0) > 0.5, then d dτ IFwg(τ) CFwg(τ) τ=0 C(fothers(0)Fwg(0) fwg(0)Fothers(0)) (129) for some positive constant C. d dτ IFwg(τ) CFwg(τ) dτ IFwg(τ) τ=0 1 CFwg(0) + IFwg(0) d dτ 1 CFwg(τ) dτ IFwg(τ) τ=0 1 CFwg(0) (132) + IFwg(0) 1 p Fwg(0) 1 Fwg(0) fwg(0)Fothers(0) fothers(0)Fwg(0) p Fothers(0) fothers(0)Fwg(0) fwg(0)Fothers(0) 1 Fothers(0) p Fothers(0) Fwg(0) 1 Fwg(0) fwg(0)Fothers(0) fothers(0)Fwg(0) p Fothers(0) (fothers(0)Fwg(0) fwg(0)Fothers(0)) 1 Fwg(0)2 p 1 Fothers(0) 1 Fwg(0) 1 + 1 p p Fothers(0) Fwg(0) | {z } 1 because AFothers(0) AFwg (0) Fwg(0) 1 Fwg(0) | {z } <1 because AFwg (0)>0.5 = C(fothers(0)Fwg(0) fwg(0)Fothers(0)) (137) where (133) follows from Lemma 10. E.2.1 NECESSARY CONDITION FOR OUTPERFORMING THE GROUP-AGNOSTIC REFERENCE Proposition 3 (Necessary condition for outperforming the group-agnostic reference). Assume that 1/2 < AFwg(0) < AFothers(0) < 1 and the worst-group density fwg(0) > 0. If AFwg(τ) AFwg(τ) for all τ 0, then fwg(0) 1 AFothers(0) 1 AFwg(0) . (5) Proof. Recall that AFwg(τ) = 1 1 + IFwg(τ) CFwg(τ) I(τ) AFwg(τ) = 1 1 + IFwg(0) CFwg(0) I(τ) C(τ) . (139) Published as a conference paper at ICLR 2021 If AFwg(τ) AFwg(τ) for all τ 0, then d dτ IFwg(τ) CFwg(τ) τ=0 0. (140) From Lemma 11, C(fothers(0)Fwg(0) fwg(0)Fothers(0)) d dτ IFwg(τ) CFwg(τ) for some positive constant C. Combined, fothers(0)Fwg(0) fwg(0)Fothers(0) 0. (142) E.2.2 OUTPERFORMING THE GROUP-AGNOSTIC REFERENCE REQUIRES SMALLER SCALING FOR LOG-CONCAVE DISTRIBUTIONS Corollary 1 (Outperforming the group-agnostic reference requires smaller scaling for log-concave distributions). Assume that 1/2 < AFwg(0) < AFothers(0) < 1, Fwg is log-concave, and fothers(τ) = vfwg(v(τ µothers) + µwg) for all τ R, where v is a scaling factor. If AFwg(τ) AFwg(τ) for all τ 0, v < 1. Proof. By Proposition 3, if AFwg(τ) AFwg(τ) for all τ 0, then: fothers(0)Fwg(0) fwg(0)Fothers(0) 0, (143) and equivalently, fothers(0) Fothers(0) fwg(0) Fwg(0). (144) From the definition of fothers, v fwg(0)/Fwg(0) fwg( µothersv + µwg)/Fwg( µothersv + µwg) (145) Because AFothers(0) > AFwg(0), µothersv + µwg < 0. Applying log-concavity yields v < 1. (146) Thus, when v > 1, there exists some threshold τ 0 where AFwg(τ) > AFwg(τ). E.3 TRANSLATED, LOG-CONCAVE DISTRIBUTIONS ALWAYS UNDERPERFORM THE GROUP-AGNOSTIC REFERENCE We now present a proof of Proposition 4. Assume fwg and fothers are log concave and symmetric, the worst group has mean µwg, the combination of other group(s) has mean µothers, and fothers(x) = fwg(x (µothers µwg)), (the densities are translations of each other.) For convenience, define d = µothers µwg. This implies: Fothers(x) = Fwg(x (µothers µwg)) (147) = Fwg(x d). (148) We will first show that wg is indeed the worst group at full coverage; i.e. AFwg(0) < AFothers(0): Lemma 12. If fwg is a PDF and fothers is obtained by translating fwg to the right (i.e. fothers(x) = fwg(x d) for d > 0), and Fwg and Fothers are their associated CDFs, AFwg(0) < AFothers(0). Proof. We have: AFwg(0) = 1 Fwg(0) (149) 1 Fwg( d) (150) = 1 Fothers(0) (151) = AFothers(0). (152) Published as a conference paper at ICLR 2021 We next show that CFwg(τ) is monotonically decreasing in τ: Lemma 13. If fwg, Fwg, fothers, Fothers, p, and CFwg are as described, CFwg is monotonically decreasing in τ. Proof. First, defining CFwg in terms of Fwg, we have: CFwg(τ) = p(1 Fwg(τ)) p(1 Fwg(τ)) + (1 p)(1 Fothers(τ)) (153) = p(1 Fwg(τ)) p(1 Fwg(τ)) + (1 p)(1 Fwg(τ d)) (154) = p Fwg(2µwg τ) p Fwg(2µwg τ) + (1 p)Fwg(2µwg τ + d) (155) 1 + (1 p)Fwg(2µwg τ+d) p Fwg(2µwg τ) . (156) Taking the derivative of CFwg, we have: d dτ CFwg(τ) = d 1 + (1 p)Fwg(2µwg τ+d) p Fwg(2µwg τ) 1 + (1 p)Fwg(2µwg τ+d) p Fwg(2µwg τ) Fwg(2µwg τ + d) Fwg(2µwg τ) = c ( fwg(2µwg τ + d)Fwg(2µwg τ) + fwg(2µwg τ)Fwg(2µwg τ + d)) , (159) where c is a positive constant (since p and 1 p are positive and all of the terms are squares.) Thus, we can say that CFwg(τ) is monotonically decreasing for all τ if: fwg(2µwg τ + d)Fwg(2µwg τ) fwg(2µwg τ)Fwg(2µwg τ + d) (160) fwg(2µwg τ + d) Fwg(2µwg τ + d) fwg(2µwg τ) Fwg(2µwg τ). (161) Now, since d > 0, this follows from the log concavity of fwg. Therefore, the fraction of correct examples that come from group 1 is a decreasing function of τ. Next, we show that the IFwg(τ) is monotonically increasing in τ: Lemma 14. If fwg, Fwg, fothers, Fothers, p and IFwg are as described, IFwg is monotonically increasing in τ. Proof. First, we note that: IFwg(τ) = p Fwg( τ) p Fwg( τ) + (1 p)Fwg( τ d) (162) 1 + (1 p)Fwg( τ d) p Fwg( τ) . (163) We will show that the derivative of IFwg(τ) with respect to τ is positive. Doing so gives us: d dτ IFwg(τ) = d 1 + (1 p)Fwg( τ d) p Fwg( τ) (164) 1 + (1 p)Fwg( τ d) = c ( fwg( τ d)Fwg( τ) + fwg( τ)Fwg( τ d)) , (166) Published as a conference paper at ICLR 2021 where c is positive since it is the product of squared terms and (1 p)/p, which is also positive. Thus, d dτ IFwg(τ) 0 is equivalent to: ( fwg( τ d)Fwg( τ) + fwg( τ)Fwg( τ d)) 0 (167) fwg( τ d)Fwg( τ) fwg( τ)Fwg( τ d) (168) fwg( τ d) Fwg( τ d) fwg( τ) Fwg( τ). (169) Since fwg is log-concave and d is positive, the last inequality is true, so IFwg(τ) is monotonically increasing in τ, as desired. Lastly, we show that Lemma 13 and Lemma 14 imply that the ratio IFwg(τ) CFwg(τ) is monotonically increasing in τ. Lemma 15. IFwg(τ) CFwg(τ) is monotonically increasing in τ Proof. We simply follow the quotient rule: d dτ IFwg(τ) CFwg(τ) = CFwg(τ)IF 1(τ) IFwg(τ)CF 1(τ) (170) where we note that CFwg(τ), IFwg(τ) 0, CF 1(τ) < 0 from Lemma 13, and IF 1(τ) > 0 from Lemma 14. We will now prove the main result of this section: E.3.1 TRANSLATED, LOG-CONCAVE DISTRIBUTIONS ALWAYS UNDERPERFORM THE GROUP-AGNOSTIC REFERENCE Proposition 4 (Translated log-concave distributions underperform the group-agnostic reference). Assume Fwg and Fothers are log-concave and fothers(τ) = fwg(τ d) for all τ R. Then for all τ 0, AFwg(τ) AFwg(τ). (6) Proof. We consider the case of underperforming the group-agnostic reference: the case of overperforming the group-agnostic reference is analogous. Assume the worst group is the lower accuracy group (so d is positive.) Using the definition of selective accuracy, we have: AFwg(τ) = Cwg(τ) Cwg(τ) + I1(τ) (172) = CFwg(τ)C(τ) CFwg(τ)C(τ) + IFwg(τ)I(τ) (173) 1 + IFwg(τ) CFwg(τ) I(τ) 1 + IFwg(0) CFwg(0) I(τ) C(τ) [by Lemma 15] (175) = CFwg(0)C(τ) CFwg(0)C(τ) + IFwg(0)I(τ) (176) = AFwg(τ). (177) Thus, the worst group underperforms the group-agnostic reference, as desired. Published as a conference paper at ICLR 2021 F SIMULATIONS To demonstrate that it is possible but challenging to outperform the group-agnostic reference, we present simulation results on margin distributions that are mixtures of two Gaussians. Following the setup from Section 6, we consider a best-group margin distribution N(1, 1) and a worst-group margin distribution N(µ, σ2) varying the parameters µ, σ. We set the fraction of mass from the worst group, p to be 0.5. We evaluate whether selective classification outperforms the group-agnostic reference, plotting the results in Figure 9. We observe that selective classification outperforms the group-agnostic reference under some parameter settings, notably when the necessary condition in Proposition 3 is met and when the variance is smaller as implied by Corollary 1, but it underperforms the group-agnostic reference under most parameter settings. 0.5 1.0 1.5 2.0 Worst-group standard deviation Worst-group mean Outperforms the group-agnostic baseline Underperforms the group-agnostic baseline Satisfies the necessary condition Best-group parameter Figure 9: Simulation results on margin distributions that are mixtures of two Gaussians, where we vary the mean and the variance of the worst-group margin distributions. For parameters corresponding to the blue and orange regions, the worst group outperforms and underperforms the groupagnostic reference, respectively. We do not consider parameters in the white region to maintain that they yield full-coverage accuracies that are worse than the other group. We shade parameters that satisfy the necessary condition in Proposition 3. In addition to whether the worst group underperforms or outperforms the group-agnostic reference, we study the magnitude of the difference in the worst-group accuracy with respect to the groupagnostic reference through the same simulations (Figure 10). Concretely, we compute the difference in worst-group accuracy with respect to the group-agnostic reference, taking the worst-case difference across thresholds: maxτ AFwg(τ) AFwg(τ). We observe significant disparities between the observed worst-group accuracy and the group-agnostic reference. 0.5 1.0 1.5 2.0 Worst-group standard deviation Worst-group mean Difference in worst-group accuracy from the group-agnostic baseline Other-group parameter max AFwg( ) AFwg( ) Figure 10: Simulation results on margin distributions that are mixtures of two Gaussians, where we vary the mean and the variance of the worst-group margin distribution, and keep the other group s margin distribution fixed with mean and variance 1. We compute the difference in worst-group accuracy with respect to the group-agnostic reference, taking the worst-case difference across thresholds: maxτ AFwg(τ) AFwg(τ). Lastly, we simulate the effects of varying the worst-group and other-group means while keeping the variance fixed at σ2 = 1 for both groups, following the same simulation protocol otherwise. Published as a conference paper at ICLR 2021 We similarly observe substantial gaps between the observed worst-group accuracy and the groupagnostic reference (Figure 11). 0.0 0.5 1.0 1.5 Other-group mean Worst-group mean Difference in worst-group accuracy from the group-agnostic baseline max AFwg( ) AFwg( ) Figure 11: Simulation results on margin distributions that are mixtures of two Gaussians, where we vary the means of the two margin distributions, keeping their variances fixed at 1. We compute the difference in worst-group accuracy with respect to the group-agnostic reference, taking the worstcase difference across thresholds: maxτ AFwg(τ) AFwg(τ).