# the_measure_and_mismeasure_of_fairness__66a2f1bc.pdf Journal of Machine Learning Research 24 (2023) 1-117 Submitted 12/22; Revised 8/23; Published 8/23 The Measure and Mismeasure of Fairness Sam Corbett-Davies samcorbettdavies@gmail.com Johann D. Gaebler jgaebler@fas.harvard.edu Department of Statistics Harvard University Cambridge, MA 02138, USA Hamed Nilforoshan hamedn@cs.stanford.edu Department of Computer Science Stanford University Stanford, CA 94305, USA Ravi Shroff ravi.shroff@nyu.edu Department of Applied Statistics, Social Science, and Humanities New York University New York, NY 10003, USA Sharad Goel sgoel@hks.harvard.edu Harvard Kennedy School Harvard University Cambridge, MA 02138, USA Authors contributed equally. Editor: Kilian Weinberger The field of fair machine learning aims to ensure that decisions guided by algorithms are equitable. Over the last decade, several formal, mathematical definitions of fairness have gained prominence. Here we first assemble and categorize these definitions into two broad families: (1) those that constrain the effects of decisions on disparities; and (2) those that constrain the effects of legally protected characteristics, like race and gender, on decisions. We then show, analytically and empirically, that both families of definitions typically result in strongly Pareto dominated decision policies. For example, in the case of college admissions, adhering to popular formal conceptions of fairness would simultaneously result in lower student-body diversity and a less academically prepared class, relative to what one could achieve by explicitly tailoring admissions policies to achieve desired outcomes. In this sense, requiring that these fairness definitions hold can, perversely, harm the very groups they were designed to protect. In contrast to axiomatic notions of fairness, we argue that the equitable design of algorithms requires grappling with their context-specific consequences, akin to the equitable design of policy. We conclude by listing several open challenges in fair machine learning and offering strategies to ensure algorithms are better aligned with policy goals. Keywords: fair machine learning, consequentialism, discrimination c 2023 Corbett-Davies, Gaebler, Nilforoshan, Shroff, and Goel. License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v24/22-1511.html. Corbett-Davies, Gaebler, Nilforoshan, Shroff, and Goel 1 Introduction 4 2 Mathematical Definitions of Fairness 6 2.1 Formal Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.2 Limiting the Effect of Decisions on Disparities . . . . . . . . . . . . . . . . . 8 2.3 Limiting the Effect of Attributes on Decisions . . . . . . . . . . . . . . . . . 9 3 Equitable Decisions in the Absence of Externalities 13 3.1 Utility, Risk, and Threshold Rules . . . . . . . . . . . . . . . . . . . . . . . 14 3.2 The Problem of Inframarginality . . . . . . . . . . . . . . . . . . . . . . . . 15 3.3 The Problem with Fairness through Unawareness . . . . . . . . . . . . . . . 18 4 Equitable Decisions in the Presence of Externalities 23 4.1 The Geometry of Fair Decision Making . . . . . . . . . . . . . . . . . . . . . 23 4.2 A Formal Theory of Fairness in the Presence of Externalities . . . . . . . . 26 5 A Path Forward 30 5.1 Competing Notions of Ethical Decision Making: Process vs. Outcomes . . . 31 5.2 Designing Equitable Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . 33 5.3 Case Study: An Algorithm to Allocate Limited Medical Resources . . . . . 38 6 Conclusion 43 Appendices 44 A Path-specific Counterfactuals 44 B Constructing Causally Fair Policies 46 C A Stylized Example of College Admissions 49 D Proof of Theorem 10 49 E Proof of Proposition 15 51 F Prevalence and the Proof of Theorem 17 56 F.1 Shyness and Prevalence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 F.2 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 F.3 Convexity, Complete Metrizability, and Universal Measurability . . . . . . . 60 F.4 Shy Sets and Probes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 F.5 Proof of Theorem 17 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 F.6 Proof of Theorem 9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 F.7 Proof of Corollary 18 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 F.8 General Measures on K . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 The Measure and Mismeasure of Fairness G Theorem 11 and Related Results 94 G.1 Extension to Continuous Covariates . . . . . . . . . . . . . . . . . . . . . . 95 G.2 A Markov Chain Perspective . . . . . . . . . . . . . . . . . . . . . . . . . . 98 H Proofs of Propositions 19 and 20 100 H.1 Beta Distributions and Stochastic Dominance . . . . . . . . . . . . . . . . . 100 H.2 Proof of Proposition 19 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 H.3 Proof of Proposition 20 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 Corbett-Davies, Gaebler, Nilforoshan, Shroff, and Goel 1. Introduction In banking, criminal justice, medicine, and beyond, consequential decisions are often informed by machine learning algorithms (Barocas and Selbst, 2016; Berk, 2012; Chouldechova et al., 2018; Shroff, 2017). As the influence and scope of algorithms increase, academics, policymakers, and journalists have raised concerns that these tools might inadvertently encode and entrench human biases. Such concerns have sparked tremendous interest in developing fair machine-learning algorithms, and, accordingly, a plethora of formal fairness criteria have been proposed in the computer science community (Berk et al., 2021; Carey and Wu, 2022; Chiappa, 2019; Chouldechova, 2017; Chouldechova and Roth, 2020; Cleary, 1968; Corbett-Davies et al., 2017; Coston et al., 2020; Darlington, 1971; Dwork et al., 2012; Galhotra et al., 2022; Hardt et al., 2016; Imai and Jiang, 2020; Imai et al., 2020; Kilbertus et al., 2017; Kleinberg et al., 2017b; Kusner et al., 2017; Loftus et al., 2018; Mhasawade and Chunara, 2021; Nabi and Shpitser, 2018; Wang et al., 2019; Woodworth et al., 2017; Wu et al., 2019; Zafar et al., 2017a,b; Zhang and Bareinboim, 2018; Zhang et al., 2017). Here we synthesize and critically examine the statistical properties of popular formal fairness approaches as well as the consequences of enforcing them. Using both theory and empirical evidence, we argue that these approaches, when used as algorithmic design principles, can often cause more harm than good. In contrast to popular axiomatic approaches to algorithmic fairness, we advocate for a consequentialist perspective that directly grapples with the difficult policy trade-offs inherent to many algorithmically guided decisions. We begin, in Section 2, by proposing a two-part taxonomy of formal fairness definitions. Our first category of definitions encompasses those that consider the effects of decisions on disparities. Imagine, for example, designing an algorithm to guide decisions for college admissions. Under the principle that fair algorithms should have comparable performance across demographic groups (Hardt et al., 2016), one might check that among applicants who were ultimately academically successful (e.g., who eventually earned a college degree, either at the institution in question or elsewhere), the algorithm would recommend admission for an equal proportion of candidates across race groups. Our second category of definitions encompasses those that seek to limit both the direct and indirect effects of one s group membership on decisions. Following the principle that decisions should be agnostic to legally protected attributes like race and gender (cf. Dwork et al., 2012), one might mandate that these features not be provided to the algorithm. Further, because one s race might impact earlier educational opportunities, and hence test scores, one might require that admissions decisions are robust to the effect of race along such causal paths. These formalizations of fairness have considerable intuitive appeal. It can feel natural to exclude protected characteristics in a drive for equity; and one might understandably interpret disparities in error rates as indicating problems with an algorithm s design or with the data on which it was trained. However, in Sections 3 and 4, we show that both classes of algorithmic fairness definitions suffer from deep statistical limitations. For example, for natural families of utility functions like those that prefer both higher academic preparedness and more student-body diversity we prove that common fairness criteria almost always, in a measure theoretic sense, lead to strongly Pareto dominated decision policies.1 1. A policy is strongly Pareto dominated if there is an alternative feasible policy that is preferred under every utility function in the family (cf. Section 4.2). The Measure and Mismeasure of Fairness In particular, in our running college admissions example, adhering to several of the popular conceptions of fairness we consider would simultaneously result in lower student-body diversity and a less academically prepared class, relative to what one could attain by explicitly tailoring admissions policies to achieve desired outcomes. In fact, under one prominent definition of fairness, we prove that the induced policies require simply admitting all applicants with equal probability, irrespective of one s academic qualifications or group membership. These formal fairness criteria are thus often at odds with policy goals, and, perversely, can harm the very same groups one ostensibly sought to protect by developing and adopting axiomatic notions of fairness. How, then, can we ensure algorithms are fair? There are no easy solutions, but we conclude in Section 5 by offering several observations and suggestions for designing more equitable algorithms. Most importantly, we believe it is critical to acknowledge and tackle head-on the substantive trade-offs at the heart of many decision problems. For example, when creating a college admissions policy, one must necessarily make difficult choices that balance competing priorities. Formal fairness axioms are poor tools for engaging with these challenging issues. Our overarching exhortation is thus to recognize algorithms as encoding policy choices, and to accordingly tailor their design. Contributions. To summarize, we offer three main contributions. First, we survey the fairness literature, describing existing fairness definitions and organizing them into a twopart taxonomy. Our categorization of formal fairness definitions proposed in the computer science literature highlights their connections to influential legal and economic notions of discrimination. Second, we lay out a consequentialist framework for designing equitable algorithms. Our framework is motivated by viewing algorithmic fairness as a policy objective rather than as a technical problem. This approach exposes the statistical and normative limitations of many popular formal fairness definitions. Finally, we apply our consequentialist framework to develop a positive vision for addressing problems of fairness and equity in algorithm design. Much of the content we present synthesizes and builds on research that we and our collaborators have conducted over the last several years (Cai et al., 2020; Chohlas-Wood et al., 2023a,b; Corbett-Davies et al., 2017; Koenecke et al., 2023). In particular, we draw heavily on two papers by Corbett-Davies and Goel (2018) and Nilforoshan et al. (2022). In addition to synthesis, we broaden the formal theoretical results presented in this line of work and offer new, concrete illustrations of our theoretical arguments. Some of the results and arguments we present date back five years, and the field of algorithmic fairness has since moved forward in many ways. For example, in the intervening time, there has been increasing recognition of the shortcomings of popular formal fairness definitions (Barocas et al., 2019). Nevertheless, we believe our message is as relevant as ever. For instance, within the research community, new algorithmic fairness definitions are regularly introduced that, while different in some respects, frequently suffer from the same statistical and conceptual limitations as the notions we survey here. In the broader world, policymakers, algorithm designers, journalists, and advocates often still evaluate algorithms and accordingly influence decisions by turning to these formal fairness definitions without necessarily appreciating their shortcomings. For example, proposed legislation in Idaho sought to require that pretrial risk assessment algorithms have equal error rates across groups (Idaho H.B. 118, 2019). Although the proposed bill was never passed, it the illustrates the ways Corbett-Davies, Gaebler, Nilforoshan, Shroff, and Goel in which these formal measures have garnered significant attention beyond the academic community. The call to build equitable algorithms will only grow over time as automated decisions become even more widespread. As such, it is imperative to address limitations in past formulations of fairness, to identify best practices moving forward, and to outline important open research questions. By synthesizing and critically examining recent developments in fair machine learning, we hope to help both researchers and practitioners advance this increasingly influential field. 2. Mathematical Definitions of Fairness We start by assembling and categorizing definitions of algorithmic fairness into a two-part taxonomy: those that seek to limit the effect of decisions on disparities, and those that seek to limit the effect of protected attributes like race or gender on the decisions themselves. We first introduce formal notation and concrete examples of decision problems in which one might seek to apply these fairness definitions, before reviewing prominent examples of both approaches in turn. 2.1 Formal Setting Consider a population of individuals with observed covariates X, drawn i.i.d. from a set X Rn with distribution DX. Further suppose that A A describes one or more discrete protected attributes, such as race or gender, which can be derived from X (i.e., A = α(X) for some function α). Each individual is subject to a binary decision D {0, 1}, determined by a (randomized) rule d(x) [0, 1], where d(x) = Pr(D = 1 | X = x) is the probability of receiving a positive decision, D = 1.2,3 Given a budget b with 0 < b 1, we require the decision rule to satisfy E[D] b. Finally, we suppose that each individual has some associated binary outcome Y . In some cases, we will be concerned with the causal effect of the decision D on Y , in which case we imagine that there exist two potential outcomes, Y (0) and Y (1), corresponding to what happens to the individual depending on whether they receive a negative or positive decision.4 To make our discussion concrete, we imagine two running examples corresponding to this formal setting: diabetes screening and college admissions. As we discuss in detail below, these two examples differ in the extent to which there is agreement about the ultimate value of different decision policies, which in turn impacts our mathematical analysis. Diabetes is a common and serious health condition that afflicts many American adults. If caught early, it is often possible to avoid some of the most significant consequences of the disease through, for example, changes to one s diet and physical routine. A blood test can be used to determine whether an individual has diabetes, but as with many screening tools, there are risks and inconveniences associated with screening (e.g., a patient may need to 2. That is, D = 1UD d(X), where UD is an independent uniform random variable on [0, 1]. 3. By positive, we simply mean the decision D is greater than zero, without ascribing any normative position to the decision. Individuals may or may not have a preferences for positive decisions in this sense. 4. As is implicit in our notation, we assume that there are no spillover effects between units (Imbens and Rubin, 2015). The Measure and Mismeasure of Fairness take time offfrom work). In particular, if an individual were certain that they did not have diabetes, then they would prefer not to undergo screening. Our goal is to design an equitable screening policy d(x) to determine which patients have (Y = 1) or do not have (Y = 0) diabetes, based on a set of covariates X. For example, following Aggarwal et al. (2022), the screening decision may be based on a patient s age, body mass index (BMI) and race. (Those authors argue that consideration of race, while controversial, leads to more precise and equitable estimates of diabetes risk, a point we return to in Section 3.3.) We further imagine the budget b equals 1, corresponding to the fact that everyone could be screened in principle. Our second example concerns college admissions. Here, the population of interest is applicants to a particular college, and the decision D is the admissions committee s binary admissions decision. To simplify our exposition, we assume all admitted students attend the school. In this setting, the covariates X may, for example, consist of an applicant s test score and race A {a0, a1}, and Y is a binary variable that indicates college graduation (i.e., degree attainment). In contrast to our diabetes example, here we imagine that the decision itself may affect the outcomes. Specifically, Y (1) and Y (0) describe whether an applicant would attain a college degree if admitted to or if rejected from the school we consider, respectively. Note that Y (0) is not necessarily zero, as a rejected applicant may attend and graduate from a different university. Further, in this case we set the budget b to be less than one to reflect the fact that the admissions committee has limited resources and is unable to admit every candidate. As mentioned above, a key distinction between these two examples is the extent to which stakeholders may agree on the value of different potential decision policies. For example, in college admissions, there may be significant disagreement on how to balance competing priorities, such as academic preparedness and class diversity.5 Admissions committees may seek to increase both dimensions, but there is often an inherent trade-off, particularly since there is a limit on the number of students that can be admitted by the college (i.e., b < 1). Our diabetes example, in contrast, reflects a setting where there is ostensibly broader agreement on the value of different decision policies. Indeed, since there is effectively no limit on the number of diabetes tests that can be administered (i.e., b = 1), we can model the value of a decision policy as the sum of each individual s value for being screened.6 In Sections 3 and 4, we in turn examine the structure of equitable decision making in the absence and presence of such trade-offs. First, though, we introduce several formal fairness criteria. 5. In some jurisdictions, explicit considerations of racial diversity may be prohibited. For instance, a recent U.S. Supreme Court case bars colleges from explicitly considering race in admissions; however, colleges may consider an applicant s discussion of how race affected the applicant s life (SFFA v. Harvard, 2023). U.S. colleges may also consider other forms of diversity, such as economic or geographic diversity. 6. In the case of infectious diseases which involve greater externalities there is again often disagreement about the value of different screening and vaccination policies. Paulus and Kent (2020) similarly draw a distinction between polar settings (in which parties have competing interests, like our admissions example) and non-polar settings (where there is broad alignment, as in our diabetes example). Corbett-Davies, Gaebler, Nilforoshan, Shroff, and Goel 2.2 Limiting the Effect of Decisions on Disparities A popular class of fairness definitions requires that error rates (e.g., false positive and false negative rates) are equal across protected groups (Hardt et al., 2016).7 We refer to these definitions as examples of classification parity, meaning that some given measure of classification error is equal across groups defined by attributes such as race and gender. In particular, we include in this definition any measure that can be computed from the two-by-two confusion matrix tabulating the joint distribution of decisions D and outcomes Y for a group. Berk et al. (2021) enumerate seven such statistics, including false positive rate, false negative rate, precision, recall, and the proportion of decisions that are positive. The proportion of positive decisions is not, strictly speaking, a measure of error , but we nonetheless include it under classification parity since it can be computed from a confusion matrix. We also include the area under the ROC curve (AUC), a popular measure among practitioners examining the fairness of algorithms (Skeem and Lowenkamp, 2016). Two of the above measures the proportion of decisions that are positive, and the false positive rate have received considerable attention in the machine learning community (Agarwal et al., 2018; Blum and Stangl, 2019; Calders and Verwer, 2010; Chouldechova, 2017; Edwards and Storkey, 2016; Feldman et al., 2015; Hardt et al., 2016; Jung et al., 2020a; Kamiran et al., 2013; Pedreshi et al., 2008; Zafar et al., 2017a,c; Zemel et al., 2013). Definition 1 We say that demographic parity holds when8 Definition 2 We say that equalized false positive rates holds when D A | Y = 0. (2) In our running diabetes example, demographic parity means that the proportion of patients who are screened for the disease is equal across race groups. Similarly, in our college admissions example, demographic parity means an equal proportion of students is admitted across race groups. Equalized false positive rates, in our diabetes example, means that among individuals who in reality do not have diabetes and thus for whom screening, ex post, would not have been beneficial screening rates are equal across race groups.9 Causal analogues of these definitions have also recently been proposed (Coston et al., 2020; Imai and Jiang, 2020; Imai et al., 2020; Mishler et al., 2021), which require various conditional independence conditions to hold between the potential outcomes, protected attributes, and decisions.10 Below we list three representative examples of this class of fairness 7. Some work relaxes strict equality of error rates or other metrics to requiring only that the difference be at most some fixed ϵ (e.g., Nabi and Shpitser, 2018). For ease of exposition, we consider strict equality throughout, though we emphasize that the spirit of the critique we develop applies also in cases where fairness constraints are approximate, rather than exact. 8. We use the notation X Y throughout to mean that the random variables X and Y are independent. 9. In our college admissions example, the decision D impacts the outcome Y . One could, in theory, apply the definition of error rate parity above to that case by recognizing that Y = Y (D). However, that interpretation does not seem aligned with the original intent of the definition. We instead discuss the admissions example in the context of the explicitly causal definitions of fairness below. 10. In the literature on causal fairness, there is at times ambiguity between predictions ˆY {0, 1} of Y and decisions D {0, 1}. Following past work (e.g., Corbett-Davies et al., 2017; Kusner et al., 2017; The Measure and Mismeasure of Fairness definitions: counterfactual predictive parity (Coston et al., 2020), counterfactual equalized odds (Coston et al., 2020; Mishler et al., 2021), and conditional principal fairness (Imai and Jiang, 2020).11 Definition 3 We say that counterfactual predictive parity holds when Y (1) A | D = 0. (3) In our college admissions example, counterfactual predictive parity means that among rejected applicants, the proportion who would have attained a college degree, had they been accepted, is equal across race groups. (For our diabetes example, because the screening decision does not affect whether a patient actually has diabetes, Y (0) = Y (1) = Y , and so counterfactual predictive parity, as well as the causal definitions below, reduce to their non-causal analogues). Definition 4 We say that counterfactual equalized odds holds when D A | Y (1). (4) In our running college admissions example, counterfactual equalized odds is satisfied when two conditions hold: (1) among applicants who would graduate if admitted (i.e., Y (1) = 1), students are admitted at the same rate across race groups; and (2) among applicants who would not graduate if admitted (i.e., Y (1) = 0), students are again admitted at the same rate across race groups. Definition 5 We say that conditional principal fairness holds when D A | Y (0), Y (1), W, (5) where, for some function ω on X, W = ω(X) describes a reduced set of the covariates X. When W is constant (or, equivalently, when we do not condition on W), this condition is called principal fairness. In the college admissions example, conditional principal fairness means that similar applicants where similarity is defined by the potential outcomes and covariates W are admitted at the same rate across race groups. 2.3 Limiting the Effect of Attributes on Decisions An alternative framework for understanding fairness considers the effects of protected attributes on decisions. This approach can be understood as codifying the legal notion of disparate treatment (Goel et al., 2017; Zafar et al., 2017a) which we discuss further in Section 5.1. Perhaps the simplest way to limit the effects of protected attributes on decisions is to require that the decisions do not explicitly depend on them, what some call fairness through unawareness (cf. Dwork et al., 2012). Wang et al., 2019), here we focus exclusively on decisions, with predictions implicitly impacting decisions but not explicitly appearing in our definitions. 11. Our subsequent analytical results extend in a straightforward manner to structurally similar variants of these definitions (e.g., requiring Y (0) A | D = 1 or D A | Y (0), variants of counterfactual predictive parity and counterfactual equalized odds, respectively). Corbett-Davies, Gaebler, Nilforoshan, Shroff, and Goel Definition 6 Suppose that the covariates can be partitioned into the protected attributes and all other covariates, i.e., that X = Xu A, where Xu consists of unprotected attributes. Then, we say that blinding holds when, for all a, a A and xu Xu, d(xu, a) = d(xu, a ). (6) In our running diabetes example, blinding holds when the screening decision depends solely on factors like age and BMI, and, in particular, does not depend on the patient s race. We similarly say college admissions decisions satisfy blinding when the decisions depend on factors like test scores and extracurricular activities, but not race. Blinding is closely tied to the notion of calibration, the requirement that, conditional on the estimated probability of some outcome (such as graduation from college or having diabetes), the outcome is independent of group membership. For example, among people with an estimated diabetes risk of 1%, calibration would require that the proportion of individuals who actually have diabetes be the same across groups. Many authors treat calibration as a kind of fairness constraint in particular, to ensure that the meaning of estimated risks do not differ across groups and it has received considerable attention in the fairness literature (e.g., H ebert-Johnson et al., 2018; Rothblum and Yona, 2022). We note, though, that miscalibration is equivalent to blindness in practice. In particular, when estimation error is small, risk estimates that are allowed to depend on group membership are calibrated; conversely, risk estimates that are blind to group membership typically are miscalibrated an empirical phenomenon shown and discussed in Figure 3 below. Because of this close relationship, we do not treat calibration as a separate fairness constraint, but we do discuss calibration and its relationship to blinding in detail in Sections 3.3.1 and 5.2. In contrast to blinding in which race and other protected attributes are barred from being an explicit input to a decision rule the causal versions of this idea consider both the direct and indirect effects of protected attributes on decisions (Kilbertus et al., 2017; Kusner et al., 2017; Mhasawade and Chunara, 2021; Nabi and Shpitser, 2018; Wang et al., 2019; Wu et al., 2019; Zhang and Bareinboim, 2018; Zhang et al., 2017). For example, even if decisions only directly depend on test scores, race may indirectly impact decisions through its effects on educational opportunities, which in turn influence test scores. In this vein, a decision rule is deemed fair if, at a high level, decisions for individuals are the same in (a) the actual world and (b) a counterfactual world where the individual belonged to a different demographic group (Kusner et al., 2017).12 This idea can be formalized by requiring that decisions remain the same in expectation even if one s protected characteristics are counterfactually altered, a condition known as counterfactual fairness (Kusner et al., 2017). 12. Conceptualizing a general causal effect of an immutable characteristic such as race or gender is rife with challenges, the greatest of which is expressed by the mantra, no causation without manipulation (Holland, 1986). In particular, analyzing race as a causal treatment requires one to specify what exactly is meant by changing an individual s race from, for example, White to Black (Gaebler et al., 2022; Hu and Kohler-Hausmann, 2020). Such difficulties can sometimes be addressed by considering a change in the perception of race by a decision maker (Greiner and Rubin, 2011) for instance, by changing the name listed on an employment application (Bertrand and Mullainathan, 2004), or by masking an individual s appearance (Chohlas-Wood et al., 2021; Goldin and Rouse, 2000; Grogger and Ridgeway, 2006; Pierson et al., 2020). The Measure and Mismeasure of Fairness Preparation Figure 1: A causal DAG illustrating a hypothetical process for college admissions. Under path-specific fairness, one may require, for example, that race does not affect decisions along the path highlighted in red. Definition 7 Counterfactual fairness holds when E[D(a ) | X] = E[D | X], (7) where D(a ) denotes the decision when one s protected attributes are counterfactually altered to be any a A. In our running college admissions example, this means that for each group of observationally identical applicants (i.e., those with the same values of X, meaning identical race and test score), the proportion of students who are actually admitted is the same as the proportion who would be admitted if their race were counterfactually altered. Counterfactual fairness aims to limit all direct and indirect effects of protected traits on decisions. In a generalization of this criterion termed path-specific fairness (Chiappa, 2019; Nabi and Shpitser, 2018; Wu et al., 2019; Zhang et al., 2017) one allows protected traits to influence decisions along certain causal paths but not others. For example, one may wish to allow the direct consideration of race by an admissions committee to implement an affirmative action policy, while also guarding against any indirect influence of race on admissions decisions that may stem from cultural biases in standardized tests (Williams, 1983). The formal definition of path-specific fairness requires specifying a causal DAG describing relationships between attributes (both observed covariates and latent variables), decisions, and outcomes. In our running example of college admissions, we imagine that each individual s observed covariates are the result of the process illustrated by the causal DAG in Figure 1. In this graph, an applicant s race A influences the educational opportunities E available to them prior to college; and educational opportunities in turn influence an applicant s level of college preparation, M, as well as their score on a standardized admissions test, T, such as the SAT. We assume the admissions committee only observes an applicant s race and test score so that X = (A, T), and makes their decision D based on Corbett-Davies, Gaebler, Nilforoshan, Shroff, and Goel these attributes. Finally, whether or not an admitted student subsequently graduates (from any college), Y , is a function of both their preparation and whether they were admitted.13 To formalize path-specific fairness, we start by defining, for the decision D, path-specific counterfactuals, a general concept in causal DAGs (cf. Pearl, 2001). Suppose G = (V, U, F) is a causal model with nodes V, exogenous variables U, and structural equations F that define the value at each node Vj as a function of its parents (Vj) and its associated exogenous variable Uj. (See, for example, Pearl (2009a) for further details on causal DAGs.) Let V1, . . . , Vm be a topological ordering of the nodes, meaning that (Vj) {V1, . . . , Vj 1} (i.e., the parents of each node appear in the ordering before the node itself). Let Π denote a collection of paths from node A to D. Now, for two possible values a and a for the variable A, the path-specific counterfactuals DΠ,a,a for the decision D are generated by traversing the list of nodes in topological order, propagating counterfactual values obtained by setting A = a along paths in Π, and otherwise propagating values obtained by setting A = a. (In Algorithm 1 in the Appendix, we formally define path-specific counterfactuals for an arbitrary node or collection of nodes in the DAG.) To see this idea in action, we work out an illustrative example, computing path-specific counterfactuals for the decision D along the single path Π = {A E T D} linking race to the admissions committee s decision through test score, highlighted in red in Figure 1. We describe the distribution of DΠ,a,a generatively, formally showing how to produce a draw from this distribution. To start, we draw values U E, U M, U T , U D of the exogenous variables. Now, the first column in Table 1 corresponds to draws V for each node V in the DAG, where we set A to a, and then propagate that value as usual. The second column corresponds to draws V of path-specific counterfactuals, where we set A to a , and then propagate the counterfactuals only along the path A E T D. In particular, the value for the test score T is computed using the value of E (since the edge E T is on the specified path) and the value of M (since the edge M T is not on the path). As a result of this process, we obtain a draw D from the distribution of DΠ,a,a . Path-specific fairness formalizes the intuition that the influence of a sensitive attribute on a downstream decision may, in some circumstances, be considered legitimate (i.e., it may be acceptable for the attribute to affect decisions along certain paths in the DAG). For instance, an admissions committee may believe that the effect of race A on admissions decisions D which passes through college preparation M is legitimate, whereas the effect of race along the path A E T D, which may reflect access to test prep or cultural biases of the tests, rather than actual academic preparedness, is illegitimate. In that case, the admissions committee may seek to ensure that the proportion of applicants they admit from a certain race group remains unchanged if one were to counterfactually alter the race of those individuals along the path Π = {A E T D}. Definition 8 Let Π be a collection of paths, and, for some function ω on X, let W = ω(X) describe a reduced set of the covariates X. Path-specific fairness, also called Π-fairness, holds when, for any a A, E[DΠ,A,a | W] = E[D | W]. (8) 13. In practice, the racial composition of an admitted class may itself influence degree attainment, if, for example, diversity provides a net benefit to students (Page, 2007). Here, for simplicity, we avoid consideration of such peer effects. The Measure and Mismeasure of Fairness A = a A = a E = f E(A , U E) E = f E(A , U E), M = f M(E , U M) M = f M(E , U M), T = f T (E , M , U T ) T = f T (E , M , U T ) D = f D(A , T , U D) D = f D(A , T , U D) Table 1: Computing path-specific counterfactuals for the DAG in Figure 1. The first column corresponds to draws V for each node V , where we set A to a, and then propagate that value as usual. The second column corresponds to draws V of path-specific counterfactuals, where we set A to a , and then propagate the counterfactuals only along the path A E T D. In the definition above, rather than a particular counterfactual level a, the baseline level of the path-specific effect is A, i.e., an individual s actual (non-counterfactually altered) group membership (e.g., their actual race). We have implicitly assumed that the decision variable D is a descendant of the covariates X. In particular, without loss of generality, we assume D is defined by the structural equation f D(x, u D) = 1u D d(x), where the exogenous variable UD Unif(0, 1), so that Pr(D = 1 | X = x) = d(x). If Π is the set of all paths from A to D, then DΠ,A,a = D(a ), in which case, for W = X, path-specific fairness is the same as counterfactual fairness. 3. Equitable Decisions in the Absence of Externalities In many decision-making settings, the decision maker is free to make the optimal decision for each individual, without consideration of spillover effects or other externalities. For instance, in our diabetes screening example, one could, in principle, screen all patients if that course of action were medically advisable. To investigate notions of fairness in these settings, we first introduce a framework for utilitarian decision analysis. Specifically, we consider in this section situations in which there is broad agreement on the utility of different potential courses of action. (In the subsequent section, we consider cases where stakeholders disagree on the precise form of the utility.) In this setting, threshold rules maximize utility. We then describe the statistical phenomenon of inframarginality, a property that is endemic to fairness definitions that seek to enforce some form of classification parity. In particular, we discuss, both informally and mathematically, why inframarginality almost surely in a measure theoretic sense renders optimal decision making incompatible with classification parity. Finally, we discuss blinding. In parallel to our discussion of classification parity, we see that in many important settings, the information loss associated with, e.g., removing protected information from a predictive model, results in less efficient decision making without compensatory benefits. Moreover, in general, we see that the more stringent the standard of masking e.g., removing not only Corbett-Davies, Gaebler, Nilforoshan, Shroff, and Goel direct but also indirect effects of protected attributes the greater the potential harm that results from enforcing it. 3.1 Utility, Risk, and Threshold Rules A natural way to analyze a decision, such as deciding whether an individual should be screened for diabetes, is to consider the costs and benefits of various possible outcomes under different courses of action. For instance, a patient screened for diabetes who does not have the disease still has to bear the risks, discomfort, and inconvenience associated with the blood test itself, while a patient who is not screened but does in fact have the disease loses out on the opportunity to start treatment. In general, the benefit of making decision D = 1 over D = 0 when the outcome Y equals y can be represented by v(y). For instance, in our diabetes example, v(1) represents the net benefit of screening over not screening when the patient has diabetes; and v(0) is the net cost of screening when the patient does not have diabetes, including both monetary and non-monetary costs, such as discomfort and loss of time.14 Let r(x) = Pr(Y = 1 | X = x) be the risk of Y equalling 1 when X = x. Then the expected benefit of making decision D = 1 over D = 0 for an individual with covariates X = x is u(x) = E[v(Y ) | X = x] = r(x) v(1) + [1 r(x)] v(0). Here, for ease of interpretation, we restrict our utility to be of the form u(x) = E[v(Y ) | X = x] for some function v, and we also assume there is no budget constraint (i.e., b = 1). In Section 4, we allow the utility u(x) to be an arbitrary function on X and consider b < 1, which induces the trade-offs in decisions that are central to our later discussion. The aggregate expected utility of a decision policy d(x) relative to the baseline policy of taking action D = 0 for all individuals is then given by u(d) = E[d(X) u(X)]. We say a decision policy d (x) is utility-maximizing if u(d ) = max d u(d). It is better, in expectation, for an individual with covariates X = x to take action D = 1 instead of D = 0 when u(x) > 0; that is, when15 r(x) > v(0) v(0) v(1). (9) Thus, the decision with the maximum utility can be determined by comparing an individual s risk against a particular risk threshold t, defined by the right-hand side of Eq. (9). 14. For ease of exposition, we assume that costs and benefits are identical across individuals; in reality, these could vary, e.g., depending on age. When utilities vary by person, the optimal decision rule is to screen only those with positive individual utility, in line with our subsequent discussion. 15. We assume, without loss of generality, that v(1) > v(0). If v(1) < v(0), we can take Y = 1 Y as our outcome of interest; relative to Y , the inequality will be reversed. If v(1) = v(0), then the outcome is irrelevant. In this degenerate case, the higher utility decision depends on the sign of v(1) alone, and not the risk. The Measure and Mismeasure of Fairness We refer to this kind of policy as a threshold policy. In particular, we see that a utilitymaximizing decision for each individual i.e., d(x) = 1 if r(x) > t and d(x) = 0 if r(x) t is also a decision policy that maximizes aggregate utility, so there is no conflict between doing what is best from each individual person s perspective and what is best for the population as a whole. While our framing in terms of expected utility is suitably general, threshold policies can be simpler to interpret when we reparameterize in terms of more familiar quantities. In the diabetes screening example, if the patient does not have diabetes, the cost of action D = 1 over D = 0 is v(0) = c Test, i.e., the cost (monetary and non-monetary) of the test. If the patient does have diabetes, the benefit of D = 1 over D = 0 is v(1) = b Treat c Test, i.e., the benefit of treatment minus the cost of the test. Rewriting Eq. (9) in terms of these quantities gives In particular, if the benefit of early treatment of diabetes is 50 times greater than the cost of performing the diagnostic test, one would ideally screen patients who have at least a 2% chance of developing the disease. Threshold rules are a natural approach to decision making in a variety of settings. In our running medical example, a threshold rule corresponds to screening patients with a sufficiently high risk of having diabetes. A threshold rule with the optimally chosen threshold ensures that only the patients at highest risk of having diabetes take the test, thereby optimally balancing the costs and benefits of screening. Indeed, in many medical examples, from diagnosis to treatment, there are no significant externalities. As a result, deviating from utility-maximizing threshold policies can only force individuals to experience greater costs in the form of unnecessary tests or untreated illness in expectation, without compensatory benefits. We return to the problem of optimal (and equitable) decisionmaking in the presence of externalities in Section 4. 3.2 The Problem of Inframarginality In the setting that we have been considering, threshold policies guarantee optimal choices are made for each individual. However, as we now show, threshold policies in general violate various versions of classification parity, such as demographic parity and equalized false positive rates. This incompatibility highlights a critical limitation of classification parity as a fairness criterion, as enforcing the definition often requires making decisions that harm individuals without any clear compensating benefits. To help build intuition for this phenomenon, we consider the empirical distribution of diabetes risk among White and Asian patients. Following Aggarwal et al. (2022), we base our risk estimates on age, BMI, and race, using a sample of approximately 15,000 U.S. adults aged 18 70 interviewed as part of the National Health and Nutrition Survey (NHANES; Centers for Disease Control and Prevention, 2011-2018). The resulting risk distributions are shown in the left-hand panel of Figure 2. The dashed vertical lines show the group means, and indicate that the incidence of diabetes is higher among Asian Americans (11%) Corbett-Davies, Gaebler, Nilforoshan, Shroff, and Goel Distribution of risk Conditional distribution of risk 0% 5% 10% 15% 20% 0% 5% 10% 15% 20% Probability of having diabetes White Asian Figure 2: A graphical illustration of the incompatibility between threshold policies and classification parity, based on the National Health and Nutrition Survey. Left: The distribution of diabetes risk for White Americans and Asian Americans, with the dashed vertical lines corresponding to the overall incidence rate within each group. At a screening threshold of 1.5% (indicated by the solid black line), the screening rate for Asian Americans is higher than for White Americans, violating demographic parity. Right: The distribution of diabetes risk among individuals who do not have diabetes. Since the proportion of Asian Americans above the screening threshold is greater than the proportion of White Americans above the threshold, the false positive rate for Asian Americans is greater than the false positive rate for White Americans. than among White Americans (9%).16 This difference in base rates is also reflected in the heavier tail of the risk distribution among Asian individuals. Drawing on recommendations from the United States Preventative Screening Task Force, Aggarwal et al. (2022) suggest screening patients with at least a 1.5% risk of diabetes, irrespective of race. We depict this risk threshold by the solid black vertical line in the plot. Based on that recommendation, 81% of Asian Americans and 69% of White Americans are to the right of the threshold and should be screened violating demographic parity. If, hypothetically, we were to raise the screening threshold to 2.2% for Asian Americans and lower the threshold to 1% for White Americans, 75% of people in both groups would be screened, satisfying demographic parity.17 The cost of doing so, however, would be failing 16. The precise shapes of the risk distributions depend on the set of covariates used to estimate outcomes, but the means of the distributions correspond to the overall incidence of diabetes in each group, and, in particular, are unaffected by the choice of covariates. It is thus necessarily the case that the risk distributions will differ across groups in this example, regardless of which covariates are used. 17. Corbett-Davies et al. (2017) show that group-specific threshold policies are utility-maximizing under the constraint of satisfying various notions of classification parity, including demographic parity and equality of false positive rates. The Measure and Mismeasure of Fairness to screen some Asian Americans who have a relatively high risk of diabetes, and subjecting some relatively low-risk White Americans to a procedure that is medically inadvisable given their low likelihood of having diabetes. In an effort to satisfy demographic parity, we would have harmed members from both groups. This example illustrates a similar incompatibility between threshold policies and equalized false positive rates. In our setting, the false positive rate for a group is the screening rate among those in the group who do not in reality have diabetes. To visualize the race-specific false positive rates, the right-hand panel of Figure 2 shows the distribution of diabetes risk among those individuals who do not have diabetes. (Because the overall prevalence of diabetes is low, the conditional distribution displayed in the right-hand panel is nearly identical to the unconditional distribution displayed in the left-hand panel.) The false positive rate for each group is the proportion of people in the group falling to the right of the 1.5% screening threshold. In this case, the false positive rate is 79% for Asian Americans and 67% for White Americans violating equalized false positive rates. As before, we could alter the screening guidelines to equalize false positive rates, but doing so requires deviating from our threshold policy, in which case we would end up screening some individuals who are relatively low-risk and not screening others who are relatively high-risk. In this example, the incompatibility between threshold policies and classification parity stems from the fact that the risk distributions differ across groups. This general phenomenon is known as the problem of inframarginality in the economics and statistics literature, and has long been known to plague tests of discrimination in human decisions (Anwar and Fang, 2006; Ayres, 2002; Carr and Megbolugbe, 1993; Engel and Tillyer, 2008; Galster, 1993; Knowles et al., 2001; Pierson et al., 2018; Simoiu et al., 2017). Common legal and economic understandings of fairness are concerned with what happens at the margin (e.g., whether the same standard is applied to all individuals) a point we return to in Section 5. What happens at the margin also determines whether decisions maximize social welfare, with the optimal threshold set at the point where the marginal benefits equal marginal costs. However, popular error metrics assess behavior away from the margin, hence they are called infra-marginal statistics. As a result, when risk distributions differ, standard error metrics are often poor proxies for individual equity or social well-being. In general, we expect any two non-random subgroups of a population to differ on a variety of social and economic dimensions, which in turn is likely to yield risk distributions that differ across groups. As a result, as our running diabetes example shows, the optimal decision policy which maximizes each patient s own well-being will likely violate various measures of classification parity. Thus, to the extent that formal measures of fairness are violated, that tells us more about the shapes of the risk distributions than about the quality of decisions or the utility delivered to members of any group. This intuition can be made precise, in the sense that for almost every risk distribution, the optimal decision policy violates the various notions of classification parity considered here. The notion of almost every distribution that we use here was formalized by Christensen (1972), Hunt et al. (1992), Anderson and Zame (2001), and others (cf. Ott and Yorke, 2005, for a review). Suppose, for a moment, that combinations of covariates and outcomes take values in a finite set of size m. Then the space of joint distributions on covariates and outcomes can be represented by the unit (m 1)-simplex: m 1 = {p Rm | pi 0 and Pm i=1 pi = 1}. Since m 1 is a subset of an (m 1)-dimensional hyperplane in Corbett-Davies, Gaebler, Nilforoshan, Shroff, and Goel Rm, it inherits the usual Lebesgue measure on Rm 1. In this finite-dimensional setting, almost every distribution means a subset of distributions that has full Lebesgue measure on the simplex. Given a property that holds for almost every distribution in this sense, that property holds almost surely under any probability distribution on the space of distributions that is described by a density on the simplex. We use a generalization of this basic idea that extends to infinite-dimensional spaces, allowing us to consider distributions with arbitrary support. (See the Appendix for further details.) Theorem 9 Let t be the optimal decision threshold, as in Eq. (9). If 0 < t < 1, then for almost every collection of group-specific risk distributions which have densities on [0, 1], no utility-maximizing decision policy satisfies demographic parity or equalized false positive rates. The proof of Theorem 9, which formalizes the informal discussion above, is given in Appendix F.6. At a high level, the constraints of classification parity are sensitive to even small perturbations in the underlying risk distributions. As a result, any particular collection of risk distributions is unlikely to satisfy the constraints. For simplicity, we have been considering settings in which the decision D does not impact the outcome Y . However, this basic style of argument extends to causal settings, showing that threshold policies are almost surely, in the measure theoretic sense, incompatible with counterfactual predictive parity, counterfactual equalized odds, and conditional principal fairness definitions of fairness that we consider in depth in Section 4, in the more complex setting of having a budget b < 1. 3.3 The Problem with Fairness through Unawareness We now consider notions of fairness, both causal and non-causal, that aim to limit the effects of attributes on decisions. As above, we show the inherent incompatibility of these definitions with optimal decision making. We note, though, that while blinding can lead to suboptimal decisions and, in some cases, harm marginalized groups the legal, political, and social benefits of, for example, race-blind and gender-blind algorithms may outweigh their costs in certain instances (Cerde na et al., 2020; Coots et al., 2023). 3.3.1 Blinding A common starting point for designing an ostensibly fair algorithm is to exclude protected characteristics from the statistical model. This strategy ensures that decisions have no explicit dependence on group membership. For instance, in the case of estimating diabetes risk, one could use only BMI and age rather than including race, as we did above. However, excluding race from models of diabetes risk can ultimately harm both White and Asian patients. In Figure 3a, we compare the actual diabetes rate to estimated diabetes risk resulting from the race-blind risk model. Aggarwal et al. (2022) showed that Asian patients have higher incidence of diabetes than White patients with comparable age and BMI. As a result, the race-blind model systematically underestimates risk for Asian patients and systematically overestimates risk for White individuals. In particular, applying a nominal 1.5% screening threshold under the race-blind model amounts to effectively applying a 1% The Measure and Mismeasure of Fairness 0% 1% 2% 3% 4% 5% Race blind risk score Diabetes rate (a) Diabetes 1 2 3 4 5 6 7 8 9 10 Gender blind COMPAS score Recidivism rate (b) Recidivism Figure 3: Calibration plots showing the effect of removing protected attributes from risk models when estimating the risk of diabetes (left) and recidivism (right). Because Asian patients with the same BMI and age have a higher rate of diabetes, the race-blind model underestimates their risk of having diabetes. Similarly, because women reoffend at lower rates than men with similar criminal histories, the gender-blind COMPAS score overstates the recidivism risk for women. screening threshold to White patients and a 3% screening threshold to Asian patients. Thus, by using race-blind risk scores, we subject relatively low-risk White patients to screening, and fail to screen Asian patients who have a relatively high risk for having diabetes. A raceaware model would ensure that nominal risk thresholds correspond to observed incidence rates across race groups. This phenomenon which we call miscalibration across subgroups is not unique to diabetes screening. Consider, for instance, the case of pretrial recidivism predictions. Shortly after an individual is arrested in the United States, a judge must often determine conditions of release pending future court proceedings. In many jurisdictions across the country, these pretrial decisions are informed by statistical risk estimates of the likelihood the individual would be arrested or convicted of a future crime if released. After adjusting for factors such as criminal history, age, and substance use, women have been found to reoffend less often than men in many jurisdictions (De Michele et al., 2018; Skeem et al., 2016). Consequently, gender-blind risk assessments are miscalibrated, meaning that they tend to overstate the recidivism risk of women and understate the recidivism risk of men. Figure 3b illustrates this point, plotting the observed recidivism rate for men and women in Broward County, Florida as a function of their gender-blind COMPAS risk scores a commonly used risk assessment tool (Bao et al., 2021). In particular, women with a COMPAS score of seven recidivate about 55% of the time, whereas men with the same score recidivate about 65% of the time. Said differently, women with a score of seven Corbett-Davies, Gaebler, Nilforoshan, Shroff, and Goel recidivate approximately as often as men with a score of five, and this two-point differential persists across the range of scores. By acknowledging the predictive value of gender in this setting, one could create a decision rule that detains fewer people (particularly women) while achieving the same public safety benefits. Conversely, by ignoring this information and basing decisions solely on the gender-blind risk assessments, one would effectively be subjecting women to a more stringent risk standard and potentially harsher penalties than men. As in the case of classification parity, one cannot typically remove protected attributes from the risk predictions without decreasing utility (cf. Manski et al., 2022); however, the reduction in utility is not always as large as one might expect (Coots et al., 2023). In concrete terms, in our running diabetes example, basing decisions on race-blind risk estimates necessarily means screening some patients who would have preferred not to be screened had they been given race-aware risk estimates, and, conversely, not screening some patients who would have preferred to be screened had they been given the more complete estimates. We state this result formally below. Theorem 10 Suppose 0 < t < 1, where t is the optimal decision threshold on the risk scale, as in Eq. (9). Let π : Xu A Xu denote restriction to the unprotected covariates. Let ρ(x) = Pr(Y = 1 | π(X) = π(x)) denote the risk estimated using the blinded covariates. Suppose that r(x) and ρ(x) have densities on [0, 1] that are positive in a neighborhood of t. Further suppose that there exists ϵ > 0 such that the conditional variance Var(r(X) | ρ(X)) > ϵ a.s., where r(x) is the risk estimated from the full set of covariates. Then no blind policy is utility-maximizing. The proof of Theorem 10 is given in Appendix D. In short, when race, gender, or other protected traits add predictive value a condition codified in our assumption that the conditional variance be greater than ϵ excluding these attributes will in general decrease utility, both for individuals and in the aggregate. Basing decisions on blinded risk scores can harm individuals and communities, for example by failing to flag relatively high-risk Asian patients for diabetes screening. But it is also important to consider potential harms stemming from the use of raceand gender-specific risk tools. In medicine, for instance, one might worry that race-specific risk assessments could encourage doctors and the public-at-large to advance spurious and pernicious arguments about inherent differences between race groups. In reality, the differences in diabetes risk we see are likely due to a complex mix of factors, both environmental and genetic, and should not be misinterpreted as indicating any causal effects of race. Indeed, even race itself is a thorny, socially influenced, concept, that elides easy definition. Similarly, the use of gender-specific recidivism estimates could reduce trust in the criminal justice system, giving the impression that individuals are held to different standards based on their gender. (Though, as we have seen above, blinded risk assessments can likewise and perhaps more persuasively be said to subject individuals to different standards based on their race and gender.) In some circumstances, raceand gender-specific risk estimates are even prohibited by law a topic we return to in Section 5.1. For these reasons, risk assessments in medicine, criminal justice, and beyond have generally avoided using race, gender, and other sensitive demographic attributes. Ultimately, when constructing risk assessment tools, it is The Measure and Mismeasure of Fairness important to acknowledge and carefully balance both the costs and benefits of blinding in any given circumstance. 3.3.2 Counterfactual and Path-Specific Fairness As discussed in Section 2, counterfactual and path-specific fairness are generalizations of simple blinding that attempt to account for both the direct and indirect effects of protected attributes on decisions. Because the constraints are more stringent, the resulting decrease in utility is proportionally greater. In particular, in some common settings, path-specific fairness with W = X constrains decisions so severely that the only allowable policies are constant (i.e., d(x1) = d(x2) for all x1, x2 X). For instance, in our running admissions example, path-specific fairness requires admitting all applicants with the same probability, irrespective of academic preparation or group membership. To build intuition for this result, we sketch the argument for a finite covariate space X. Given a policy d that satisfies path-specific fairness, select x arg maxx X d(x). By the definition of path-specific fairness, for any a A, d(x ) = E[DΠ,A,a | X = x ] x α 1(a) d(x) Pr(XΠ,A,a = x | X = x ). (10) That is, the probability of an individual with covariates x receiving a positive decision must be the average probability of the individuals with covariates x in group a receiving a positive decision, weighted by the probability that an individual with covariates x in the real world would have covariates x counterfactually. Next, we suppose that there exists an a A such that Pr(XΠ,A,a = x | X = x ) > 0 for all x α 1(a ). In this case, because d(x) d(x ) for all x X, Eq. (10) shows that in fact d(x) = d(x ) for all x α 1(a ). Now, let x be arbitrary. Again, by the definition of path-specific fairness, we have that d(x ) = E[DΠ,A,a | X = x ] x α 1(a ) d(x) Pr(XΠ,A,a = x | X = x ) x α 1(a ) d(x ) Pr(XΠ,A,a = x | X = x ), where we use in the third equality the fact d(x) = d(x ) for all x α 1(a ), and in the final equality the fact that XΠ,A,a is supported on α 1(a ). Theorem 11 formalizes and extends this argument to more general settings, where Pr(XΠ,A,a = x | X = x ) is not necessarily positive for all x α 1(a ). The proof of Theorem 11 is in the Appendix, along with extensions to continuous covariate spaces and a more complete characterization of Π-fair policies for finite X. Theorem 11 Suppose X is finite and Pr(X = x) > 0 for all x X. Suppose Z = ζ(X) is a random variable such that: Corbett-Davies, Gaebler, Nilforoshan, Shroff, and Goel 1. Z = ZΠ,A,a for all a A, 2. Pr(XΠ,A,a = x | X = x) > 0 for all a A such that α(x ) = a , α(x) = a , and x, x X such that ζ(x) = ζ(x ). Then, for any Π-fair policy d, with W = X, there exists a function f such that d(X) = f(Z), i.e., d is constant across individuals having the same value of Z. The first condition of Theorem 11 holds for any reduced set of covariates Z that is not causally affected by changes in A (e.g., Z is not a descendant of A). The second condition requires that among individuals with covariates x, a positive fraction have covariates x in a counterfactual world in which they belonged to another group a . Because ζ(x) is the same in the real and counterfactual worlds since Z is unaffected by A, by the first condition we only consider x such that ζ(x ) = ζ(x) in the second condition. In our admissions example, this result shows that, under mild conditions, causallyfair policies require admitting all applicants with equal probability. In particular, suppose that among students with a given test score, a positive fraction achieve any other test score in the counterfactual world in which their race is altered as, for instance, we might expect if the individual-level causal effects are drawn from an (appropriately discretized) normal distribution. In this case, the empty set of reduced covariates formally encoded by setting ζ to a constant function satisfies the conditions of Theorem 11. The theorem then implies that under any Π-fair policy, every applicant is admitted with equal probability. (We motivated our admissions example by assuming that only a fraction b < 1 of applicants could be admitted; however, Theorem 11 holds irrespective of the budget, and, in particular, when b = 1, and so we discuss this result together with our others on unconstrained decision making as a natural extension of blinding.) Even when decisions are not perfectly uniform lotteries, Theorem 11 suggests that enforcing Π-fairness can lead to unexpected outcomes. For instance, suppose we modify our admissions example to additionally include age as a covariate that is causally unconnected to race as some past work has done. In that case, Π-fair policies would admit students based on their age alone, irrespective of test score or race. Although in some cases such restrictive policies might be desirable, this strong structural constraint implied by Π-fairness appears to be a largely unintended consequence of the mathematical formalism. The conditions of Theorem 11 are relatively mild, but do not hold in every setting. Suppose that in our admissions example it were the case that TΠ,A,a0 = TΠ,A,a1 + c for some constant c that is, suppose the effect of intervening on race is a constant change to an applicant s test score. Then the second condition of Theorem 11 would no longer hold for a constant ζ. Indeed, any multiple-threshold policy in which ta0 = ta1 + c would be Π-fair. In practice, though, such deterministic counterfactuals would seem to be the exception rather than the rule. For example, it seems reasonable to expect that test scores would depend on race in complex ways that induce considerable heterogeneity. Lastly, we note that W = X in some variants of path-specific fairness (e.g., Nabi and Shpitser, 2018; Zhang and Bareinboim, 2018), in which case Theorem 11 does not apply. Although, in that case, path-specific fairness is still typically incompatible with optimal decision-making, as shown in Theorem 17. The Measure and Mismeasure of Fairness 4. Equitable Decisions in the Presence of Externalities We have thus far considered cases where there is largely agreement on the utility of different decision policies. In that setting, we showed that maximizing utility is at odds with various mathematical formalizations of fairness. We further argued that these results illustrate weaknesses in the formalizations themselves, since deviating from utility-maximizing polices in that setting can harm both individuals and groups as seen in our diabetes screening example. Agreement on the utility, however, is perhaps the exception rather than the rule. One could indeed argue that the value of mathematical formalizations of fairness is their ability to arbitrate between competing definitions of utility. Here we critically examine that perspective. We show, in analog to our previous results, that even when it is unclear how to balance competing priorities, enforcing existing fairness constraints typically leads to worse outcomes on each dimension. For instance, in our running college admissions example, policies constrained to satisfy various fairness constraints will typically require admitting a student body that is both less academically prepared and less diverse, relative to alternative policies that violate these mathematical fairness definitions. We start, in Section 4.1, by examining our college admissions example in detail, illustrating in geometric terms how existing fairness definitions can lead to problematic admissions policies. Then, in Section 4.2, we develop our formal theory of equitable decision making in the presence of externalities. The mathematics necessary to establish our key results are significantly deeper than what we have needed thus far, but our high-level message is the same: enforcing several formal notions of fairness leads to policies that can paradoxically harm the very groups that they were designed to protect. 4.1 The Geometry of Fair Decision Making To build intuition about the limitations of popular definitions of fairness, we return to our running example on college admissions. In that setting, we imagine an admissions committee debating the merits of different admissions policies. In particular, we imagine disagreement within the committee over how best to balance two competing objectives: academic preparation (operationalized, e.g., in terms of the high school grades and standardized test scores of admitted students) and class diversity (e.g., the number of admitted applicants from marginalized groups). We assume that our hypothetical committee members all agree that more (total) academic preparedness and more class diversity are better. Thus, in the absence of any resource constraints (with b = 1, as is approximated in some online courses), the university could admit all applicants, maximizing both the number of admitted students from marginalized groups and also the total academic preparedness of the admitted class. But given limits on the number of students who can be admitted (i.e., b < 1), one must make difficult choices on whom to admit, with reasonable and expected disagreement on how much to trade one dimension for another. The trade-offs in decision making are most acute when the budget b < 1, and for this reason we focus here on that case. In light of these trade-offs, one might turn to the myriad formal fairness criteria we have discussed to ensure admissions decisions are equitable. Many of the fairness definitions we consider make reference to a distinguished outcome Y . In our example, we can imagine Corbett-Davies, Gaebler, Nilforoshan, Shroff, and Goel this outcome corresponds to college degree attainment, an ex post measure of academic preparedness. In the case of causal fairness definitions, we could take Y (1) to mean degree attainment if the student were admitted, and Y (0) to be degree attainment if the student were not admitted, with the understanding that a student who is not admitted could potentially attend and graduate from another university. For example, satisfying counterfactual predictive parity requires that among rejected applicants, the proportion who would have attained a college degree, had they been accepted, is equal across race groups. In these cases, we imagine academic preparedness is some student-level measure that connects observables X upon which the committee must make their admissions decisions to (potential) outcomes Y . For example, the academic index m(x) might be a prediction of Y (1) given X based on historical data, or, more generally, could encode committee preferences for both academic preparation and participation in extracurricular activities, among other factors. The key point of our informal discussion thus far is that we assume committee members would like to enact an admissions policy d that balances two competing objectives. First, they would like a policy that leads to large m(x), i.e., they would like E[m(X) d(X)] to be big, where m(x) is some quantity that may, for example, encode academic preparedness and other preferences. Second, the committee would like large diversity, i.e., they would like E[1α(X)=a1 d(X)] to be big, where a1 corresponds to some target group of interest. All committee members would like more of each dimension, but, given the budget constraint, it is in general impossible to maximize both dimensions simultaneously, leading to the inherent trade-offs we consider in this section. We now explore the consequences of imposing additional fairness constraints on our college admissions example, as given by the causal DAG in Figure 1, via a simulation study of one million hypothetical applicants, for one quarter of whom (b = 1 4) seats are allocated. In particular, in the hypothetical pool of applicants we consider, applicants in the target race group a1 have, on average, fewer educational opportunities than those applicants in group a0, which leads to lower average academic preparedness, as well as lower average test scores. We define the academic index m(x) of applicants to be the estimated probability that an applicant will graduate if admitted, based on their observed test score and race. See Appendix C for additional details, including the specific structural equations we use in the simulation. Each of the panels in Figure 4 illustrates the geometry of fairness constraints for five different formal notions of fairness described in Section 2: counterfactual fairness, pathspecific fairness, principal fairness, counterfactual equalized odds, and counterfactual predictive parity. The vertical axes of each panel correspond to aggregate academic index and the horizontal axes to the number of admitted applicants from the target group. The purple lines trace out the boundary of the set of feasible policies, with points on or below the curves achievable by policies that adhere to the budget constraint. Policies lying strictly below the purple curves (or, similarly, on the dashed segments of the purple curves) are Pareto dominated, meaning that one can find feasible alternatives that are larger on both of the depicted axes (i.e., academic index and diversity). Since we have assumed committee members prefer higher values on each dimension, their effective choice set consists of those policies on the solid purple segments the Pareto frontier. Committee members may still disagree over which policy on the frontier to adopt. But for any policy not on the frontier, The Measure and Mismeasure of Fairness 0 50,000 100,000 150,000 200,000 250,000 Admitted Applicants from Target Group Academic Index Principal Fairness 0 50,000 100,000 150,000 200,000 250,000 Admitted Applicants from Target Group Academic Index Counterfactual Fairness / PDWKï6SHFLILF Fairness 0 50,000 100,000 150,000 200,000 250,000 Admitted Applicants from Target Group Academic Index Counterfactual Predictive Parity 0 50,000 100,000 150,000 200,000 250,000 Admitted Applicants from Target Group Academic Index Counterfactual Equalized Odds Figure 4: The geometry of fairness constraints, in an illustrative example of college admissions. Points under the purple curve correspond to all feasible policies given the budget constraint whereas the shaded regions correspond to feasible policies that satisfy various formal definitions of fairness. (For path-specific fairness, we set Π equal to the single path A E T D highlighted in Figure 1, and set W = X.) For each definition, the constrained policies lie strictly under the purple curve, meaning there are alternative, unconstrained, feasible policies that simultaneously achieve greater student-body diversity (indicated by the x-axis) and greater academic preparedness (indicated by the y-axis). The solid segments of the purple lines correspond to policies on the Pareto frontier for which one cannot simultaneously increase both diversity and academic preparedness. The points labeled random in the upper left-hand corner correspond to policies generated by random lotteries in which each individual is admitted with equal probability, in accordance with Theorem 11. there is a feasible policy above and to the right of it, which is thus preferred by every member of the committee. Finally, the shaded regions indicate the set of feasible policies constrained to satisfy each of the fairness definitions. (In Appendix B, we show that these feasibility regions can be computed by solving a series of linear programs.) In each case, the constrained regions do not intersect the Pareto frontier, and so there is an alternative, unconstrained feasible policy Corbett-Davies, Gaebler, Nilforoshan, Shroff, and Goel that simultaneously achieves more student-body diversity and an overall higher academic index. For example, in the case of policies satisfying counterfactual or path-specific fairness, shown in the upper left panel, the set of feasible policies lie on a single line segment. That structure follows from Theorem 11, since the only policies satisfying either of these notions of fairness in our setting are ones that admit all students with a constant probability, irrespective of their covariates. While not as extreme, the other fairness definitions similarly restrict the space of feasible policies in severe ways, as shown in the remaining panels. These results illustrate that constraining decision-making algorithms to satisfy popular definitions of fairness can have unintended consequences, and may even harm the very groups they were ostensibly designed to help. Our discussion in this section aimed to highlight the geometry of fair decision policies and their consequences in the context of a simple motivating example. We next show that these qualitative findings are guaranteed to hold much more generally. 4.2 A Formal Theory of Fairness in the Presence of Externalities Our simulation above showed that policies satisfying one of the mentioned fairness definitions are suboptimal, in the sense that they constrain one to a portion of the feasible region in which policies could be improved along both dimensions of interest. As was the case in the absence of trade-offs in Section 3.2, the phenomenon occurring in our simulation is true much more generally. To understand why, we begin by isolating and formalizing the relevant mathematical properties of our example. To generalize our setting in Section 3, we consider arbitrary utility functions of the form u : X R. As before, for a function u and decision policy d, we write u(d) = E[d(X) u(X)] to denote the expected utility of decision policy d(x) under the utility u. An important constraint on the admissions committee was the fact that their admissions decisions could not, in expectation, exceed the budget. Definition 12 For a budget b, we say a decision policy d(x) is feasible if E[d(X)] b. A key feature of the college admissions example is that despite some level of uncertainty regarding the true utility i.e., exactly how to trade offbetween its objectives the committee knows what its objectives are: to increase the academic index and diversity of the incoming class. One way to encode this kind of uncertainty is to consider a set U consisting of all reasonable ways of trading offbetween the objectives. While the utilities need not be the same, they should be consistent, in the sense that conditional on an applicant s group membership, all of the utilities should agree that a higher academic index is better. Definition 13 We say that a set of utilities U is consistent modulo α if, for any u, u U: 1. For any x, sign(u(x)) = sign(u (x)); 2. For any x1 and x2 such that α(x1) = α(x2), u(x1) > u(x2) if and only if u (x1) > u (x2). A second relevant feature of the admissions problem is that certain policies were strictly better from the admissions committee s perspective, despite their uncertainty about the exact form of their utility. The notion that one policy is better than another regardless of the exact form of the utility is formalized by Pareto dominance. The Measure and Mismeasure of Fairness Definition 14 Suppose U is a collection of utility functions. A decision policy d is Pareto dominated if there exists a feasible alternative d such that u(d ) u(d) for all u U, and there exists u U such that u (d ) > u (d). A policy d is strongly Pareto dominated if there exists a feasible alternative d such that u(d ) > u(d) for all u U. A policy d is Pareto efficient if it is feasible and not Pareto dominated, and the Pareto frontier is the set of Pareto efficient policies. As discussed above and in Section 3.1, in the absence of trade-offs, optimal decision policies take the simple form of threshold policies. The existence of trade-offs broadens the range of forms a Pareto efficient policy can take. Even so, for consistent collections of utilities, the Pareto efficient policies take a closely related form. Proposition 15 Suppose U is a set of utilities that is consistent modulo α. Then any Pareto efficient decision policy d is a multiple-threshold policy. That is, for any u U, there exist group-specific constants ta 0 such that, a.s.: ( 1 u(x) > tα(x), 0 u(x) < tα(x). (11) The proof of Proposition 15 is in the Appendix.18 4.2.1 Fairness Definitions with Many Constraints All of the definitions we study in this section prominently feature causal quantities, but the important quality driving our analysis in this section is that each definition imposes many constraints. For instance, counterfactual equalized odds requires that Pr(D = 1 | A = a, Y (1) = y) = Pr(D = 1 | Y (1) = y) for every outcome y. Theorem 17 shows that for almost every joint distribution of X, Y (0), and Y (1) such that u(X) has a density, any feasible decision policy satisfying counterfactual equalized odds or conditional principal fairness is Pareto dominated. Similarly, for almost every joint distribution of X and XΠ,A,a, we show that feasible policies satisfying path-specific fairness including counterfactual fairness are Pareto dominated. (The analogous statements for counterfactual predictive parity, equalized false positive rates, and demographic parity are not true; we return to this point in Section 4.2.2.) That is, we show that, for a typical joint distribution, any feasible policy satisfying the fairness definitions enumerated above cannot have the form of a multiple-threshold policy. To prove this result, we make relatively mild restrictions on the set of distributions and utilities we consider to exclude degenerate cases, as formalized by Definition 16. 18. In the statement of the proposition, we do not specify what happens at the thresholds u(x) = tα(x) themselves, as one can typically ignore the exact manner in which decisions are made at the threshold. Specifically, given a multiple-threshold policy d, we can construct a standardized multiple-threshold policy d that is constant within group at the threshold (i.e., d (x) = cα(x) when u(x) = tα(x)), and for which: (1) E[d (X)|A] = E[d(X)|A]; and (2) u(d ) = u(d). In our running example, this means we can standardize multiple-threshold policies so that applicants at the threshold are admitted with the same group-specific probability. Corbett-Davies, Gaebler, Nilforoshan, Shroff, and Goel Definition 16 Let G be a collection of functions from Z to Rd for some set Z. We say that a distribution of Z on Z is G-fine if g(Z) has a density for all g G. In particular, U-fineness ensures that the distribution of u(X) has a density. In the absence of U-fineness, corner cases can arise in which an especially large number of policies may be Pareto efficient, in particular when u(X) has large atoms and X can be used to predict the potential outcomes Y (0) and Y (1) even after conditioning on u(X). See Proposition 72 in the Appendix for details. Theorem 17 Suppose U is a set of utilities consistent modulo α. Further suppose that for all a A there exist a U-fine distribution of X and a utility u U such that Pr(u(X) > 0, A = a) > 0, where A = α(X). Then, For almost every U-fine distribution of X and Y (1), any feasible decision policy satisfying counterfactual equalized odds is strongly Pareto dominated. If | Img(ω)| < and there exists a U-fine distribution of X such that Pr(A = a, W = w) > 0 for all a A and w Img(ω), where W = ω(X), then, for almost every U-fine joint distribution of X, Y (0), and Y (1), any feasible decision policy satisfying conditional principal fairness is strongly Pareto dominated. If | Img(ω)| < and there exists a U-fine distribution of X such that Pr(A = a, W = wi) > 0 for all a A and some distinct w0, w1 Img(ω), then, for almost every UA-fine joint distributions of A and the counterfactuals XΠ,A,a , any feasible decision policy satisfying path-specific fairness is strongly Pareto dominated.19 The proof of Theorem 17 is given in the Appendix. At a high level, the proof proceeds in three steps, which we outline below using the example of counterfactual equalized odds. First, we show that for almost every fixed U-fine joint distribution µ of X and Y (1) there is at most one policy d (x) satisfying counterfactual equalized odds that is not strongly Pareto dominated. To see why, note that for any specific y0, since counterfactual equalized odds requires that D A | Y (1) = y0, setting the threshold for one group determines the thresholds for all the others; the budget constraint then can be used to fix the threshold for the original group. Second, we construct a slice around µ such that for any distribution ν in the slice, d (x) is still the only policy that can potentially lie on the Pareto frontier while satisfying counterfactual equalized odds. We create the slice by strategically perturbing µ only where Y (1) = y1, for some y1 = y0. This perturbation moves mass from one side of the thresholds of d (x) to the other. Due to inframarginality, this perturbation typically breaks the balance requirement D A | Y (1) = y1 for almost every ν in the slice. Finally, we appeal to the notion of prevalence to stitch the slices together, showing that for almost every distribution, any policy satisfying counterfactual equalized odds is strongly Pareto dominated. Analogous versions of this general argument apply to the cases of conditional principal fairness and path-specific fairness.20 We note that the conditions of Theorem 17 19. Here, u A : (xa)a A 7 (u(xa))a A and UA is the set of u A for u U, i.e., component-wise application of u to elements of X A. In other words, the requirement is that the joint distribution of the u(XΠ,A,a) has a density. 20. This argument does not depend in an essential way on the definitions being causal. In Corollary 70 in the Appendix, we show an analogous result for the non-counterfactual version of equalized odds. The Measure and Mismeasure of Fairness are sufficient, rather than necessary, meaning that the conclusion of the theorem may and, indeed, we expect will hold even in some cases where the conditions are not satisfied. In particular, we note that this proof technique prevents the conditions of Theorem 17 from holding when A factors through W and, in particular, when W = X. Although, when X = W, Theorem 11 shows that under slightly different conditions, a much stronger result holds. To bring our discussion full circle, we now map Theorem 17 onto the motivation offered in Section 4.1. Recall that the admissions committee knew that given the opportunity, it preferred policies that increased both the overall academic index of its admitted class, and policies that resulted in more students being admitted from the target group. In other words, we imagine that members of the admissions committee have utilities u of the form21 u (d) = v E[m(X) d(X)], E[1α(X)=a1 d(X)] , (12) where, as above, m(x) denotes the academic index of an applicant with covariates X = x, and v increases in both coordinates. Corollary 18 establishes the inherent incompatibility of such preferences with the formal fairness criteria we have been considering. Corollary 18 Consider a utility of the form given in Eq. (12), where v is monotonically increasing in both coordinates and m(x) 0. Then, under the same hypotheses as in Theorem 17,22 for almost every joint distribution, no utility-maximizing decision-policy satisfies counterfactual equalized odds, conditional principal fairness, or path-specific fairness. Lastly, while, in general, one s decision policy can depend only on the covariates known at the time of the decision, in some cases, the restriction that u(x) be a function of x X alone may be too restrictive; the connection between an individual having covariates X = x and our utility may depend also on the relationship between X and Y . For instance, in the admissions example, the admissions committee may value high test scores and extracurriculars not, e.g., as per se measures of academic merit, but rather instrumentally insofar as they are connected to whether an applicant will eventually graduate. However, allowing u to depend on both x and y greatly complicates the underlying geometry of the problem. Proving Theorem 17 in this more general setting remains an open problem. However, intuition from finite-dimensions where more powerful measure-theoretic tools are available suggests that the result remains true in the more general setting. For example, Proposition 19 presents a version of this result over a natural, finite-dimensional family of distributions. Proposition 19 Suppose A = {a0, a1}, and consider the family U of utility functions of the form u(x) = r(x) + λ 1α(x)=a1, indexed by λ 0, where r(x) = E[Y (1) | X = x]. For almost every (α0, β0, α1, β1) R4 >0, if the conditional distributions of r(X) given A are beta distributed with r(X) | A = ai Beta(αi, βi), 21. Strictly speaking, we are saying that members of the admissions committee, rather than having an aggregate utility which, as we have considered so far, has the form E[u(X) d(X)] has a utility on aggregate outcomes. 22. The full statement is given in Appendix F.7. Corbett-Davies, Gaebler, Nilforoshan, Shroff, and Goel then any policy satisfying counterfactual equalized odds is strongly Pareto dominated. 4.2.2 Fairness Definitions with Few Constraints We conclude this analysis by considering equalized false positive rates, demographic parity, and counterfactual predictive parity. These fairness notions are less demanding than the notions considered above, in that they introduce only one additional constraint, e.g., that Y (1) A | D = 0, in the case of counterfactual predictive parity. Since the budget introduces a second constraint, and the form of a multiple-threshold policy allows for a degree of freedom in each group, the number of constraints and the number of degrees of freedom are equal as opposed to the causal fairness definitions covered by Theorem 17, in which the constraints outnumber the degrees of freedom. As such, it is possible in some instances to have a policy on the Pareto frontier that satisfies these conditions; though see Section 5.3 for discussion about why such policies are still often at odds with broader goals. However, it is not always possible to find a point on the Pareto frontier satisfying these definitions. In Proposition 20, we show that counterfactual predictive parity cannot lie on the Pareto frontier in some common cases, including our example of college admissions. In that setting, when the target group has lower average graduation rates a pattern that often motivates efforts to actively increase diversity decision policies constrained to satisfy counterfactual predictive parity are Pareto dominated. The proof of the proposition is in Appendix H.3. Proposition 20 Suppose A = {a0, a1}, and consider the family U of utility functions of the form u(x) = r(x) + λ 1α(x)=a1, indexed by λ 0, where r(x) = E[Y (1) | X = x]. Suppose the conditional distributions of r(X) given A are beta distributed, i.e., r(X) | A = a Beta(µa, v), with µa0 > µa1 and v > 0.23 Then any policy satisfying counterfactual predictive parity is strongly Pareto dominated. 5. A Path Forward We have thus far worked to clarify some of the statistical limitations of existing mathematical definitions of fairness. We have argued that in many cases of interest, these definitions can ultimately do more harm than good, hurting even those individuals that these notions of fairness were ostensibly designed to help. We end on a more optimistic note, charting out a potential path toward designing more equitable algorithms. To do so, we start, in Section 5.1 by reviewing conceptions of discrimination in law and economics, and, in particular, we contrast process-oriented and outcome-oriented notions of fairness. Whereas the computer science literature is dominated by process-oriented, deontological definitions of fairness, we see more promise in adopting 23. Here we parameterize the beta distribution in terms of its mean µ and sample size v. In terms of the common, alternative α-β parameterization, µ = α/(α + β) and v = α + β. The Measure and Mismeasure of Fairness an outcome-oriented, consequentialist approach represented by the utilitarian analysis we have described above. In Section 5.2, we enumerate and discuss four issues that we feel are critical in developing equitable algorithms: (1) balancing inherent trade-offs in decision problems; (2) assessing calibration; (3) selecting the inputs and targets of prediction; and (4) designing data collection strategies. Finally, in Section 5.3, we illustrate how to grapple with these considerations in a case study of complex medical care, motivated by work from Obermeyer et al. (2019). 5.1 Competing Notions of Ethical Decision Making: Process vs. Outcomes There are many distinct but related understandings of ethical decision making in law, economics, philosophy, and beyond. One key dimension on which we organize these notions is the extent to which they consider the process through which decisions are made versus the outcomes that those decisions render. The dominant legal doctrine of discrimination in the United States treats explicit raceand gender-based decisions with heightened scrutiny. The Equal Protection Clause of the U.S. Constitution s Fourteenth Amendment restricts government agencies from adopting policies that explicitly reference legally protected categories, and myriad federal and state disparate treatment statutes similarly constrain a variety of private actors. Conversely, policies that do not explicitly consider legally protected traits or obvious proxies are generally deemed not to violate disparate treatment principles. Formally, it is lawful to use legally protected attributes in a limited way to further a compelling government interest, but, in practice, such exceptions are few and far between. Until recently, the prime example of a race-conscious policy passing legal muster was affirmative action in college admissions (Fisher v. University of Texas, 2016). However, in 2023, the U.S. Supreme Court barred the explicit consideration of race in admissions decisions (SFFA v. Harvard, 2023). Disparate treatment doctrine has evolved over time, and reflects ongoing debates about the role of classification (use of protected traits, a process-oriented, deontological notion) versus subordination (subjugation of disadvantaged groups, an outcome-oriented notion) in discrimination cases (Fiss, 1976). Some legal scholars have argued that courts, even when formally applying anti-classification criteria, are often sympathetic to the potential effects of judgments on social stratification, indicating tacit concern for anti-subordination (Balkin and Siegel, 2003; Colker, 1986; Siegel, 2003). Others, though, have noted that such judicial support for anti-subordination appears to be waning (Nurse, 2014). At a high level, we thus view modern disparate treatment law as primarily interested in process over outcomes, though these debates illustrate that the two concepts cannot be perfectly separated. In contrast to process-oriented disparate treatment principles, the economics literature distinguishes between two outcome-focused, consequentialist rationales for explicitly considering race, gender, and other protected traits: taste-based and statistical. With taste-based discrimination (Becker, 1957), decision makers act as if they have a preference or taste for bias, sacrificing profit to avoid certain transactions. This includes, for example, an employer who forfeits financial gain by failing to hire exceptionally qualified minority applicants. But, in contrast to legal reasoning, the economic argument against taste-based discrimination is not that decisions are based on race per se, but rather because consideration of race leads to worse outcomes: a loss of profit. With statistical discrimination (Arrow, 1973; Phelps, Corbett-Davies, Gaebler, Nilforoshan, Shroff, and Goel 1972), decision makers explicitly consider protected attributes in order to optimally achieve some non-prejudicial goal. For example, profit-maximizing auto insurers may charge a premium to male drivers to account for gender differences in accident rates. Despite their differing outcome-based justifications, both taste-based and statistical discrimination are often considered legally problematic as they explicitly consider race, in violation of disparate treatment laws.24 As the above insurance example and our running diabetes example illustrate, one might consider it acceptable to base decisions in part on legally protected traits when doing so leads to good outcomes. Conversely, whereas process-oriented disparate treatment principles generally deem race-blind policies acceptable, one might declare such blind policies problematic if they lead to bad outcomes. Indeed, under the statutory disparate impact standard, a practice may be deemed discriminatory if it has an unjustified adverse effect on legally protected groups, even in the absence of explicit categorization (Barocas and Selbst, 2016).25 The disparate impact doctrine was formalized in Griggs v. Duke Power Co., a 1971 U.S. Supreme Court case. In 1955, the Duke Power Company mandated that employees have a high school diploma to be considered for promotion, which, in practice, severely limited the eligibility of Black employees. The Court found that this facially raceneutral requirement had little relation to job performance, and accordingly deemed it to have an unjustified and illegal disparate impact. The Court noted that the employer s motivation for instituting the policy was irrelevant to its decision; even if enacted without discriminatory purpose, the policy was deemed discriminatory in its effects and hence illegal. However, disparate impact law does not prohibit all group differences produced by a policy the law only prohibits unjustified disparities. For example, if, hypothetically, the high-school diploma requirement in Griggs were shown to be necessary for job success, the resulting disparities would be legal. On the spectrum from processto outcome-based understandings of discrimination, we view the formal, axiomatic fairness definitions described in Section 2 as reflecting a largely process-based orientation. Blinding and its more stringent causal variants counterfactual fairness and path-specific fairness can be viewed as descendants of disparate treatment considerations, as they seek to remove the effects of race and other protected attributes on decisions. The remaining definitions for example, those that aim to equalize error rates across groups do explicitly reference an outcome Y , but they do so in a way that seems largely disconnected from the consequences one might naturally consider. As we have 24. In the case of auto insurance specifically, some states (including California, Hawaii, Massachusetts, Maine, Michigan, North Carolina, and Pennsylvania) though not all have barred the use of gender in pricing policies. 25. The legal doctrine of disparate impact stems largely from federal statutes, not constitutional law, and applies only in certain contexts, such as employment (via Title VII of the 1964 Civil Rights Act) and housing (via the Fair Housing Act of 1968). Apart from federal statutes, some states have passed more expansive disparate impact laws, including Illinois and California. The distinction between statutory and constitutional rules is particularly relevant here, as there is debate among scholars over whether disparate impact laws violate the Equal Protection Clause and are thus unconstitutional (Primus, 2003). There is also debate over whether disparate impact law is motivated primarily by an interest in banning bad outcomes, or seeks to provide an alternative pathway for ferreting out bad intent, when actors may mask animus with race-neutral policies (Watson v. Fort Worth, 1988). But, regardless of its underlying justification, disparate impact law is formally focused on outcomes, not intent or classification, and so we view it as an outcomes-focused principle. The Measure and Mismeasure of Fairness argued, whether error rates are equal across groups has more to do with the structure of group-specific risk distributions than with whether decisions lead to good or bad outcomes for group members. For example, in our college admissions example, enforcing various formal notions of fairness would, in theory, typically lead to student bodies that are both less diverse and less academically prepared than those resulting from feasible alternatives not constrained to satisfy these notions. One might, on principle, favor certain process-based understandings of discrimination over outcome-based notions. One might even adopt a meta-consequentialist position, and argue that procedural considerations (e.g., ensuring medical decisions are blind to race) engender trust and in turn bring about better downstream outcomes. In many cases, though, the ethical underpinnings of popular mathematical definitions of fairness have not been clearly articulated. Absent such justification, we advocate for an approach that more directly engages with the real-world costs and benefits of different decision policies, a perspective that we outline in more detail in the remaining sections. 5.2 Designing Equitable Algorithms A key advantage of the dominant axiomatic approach to algorithmic fairness is that it can be readily applied across contexts, with little domain-specific knowledge. One can build automated tests to check whether any predictive algorithm satisfies various formal fairness desiderata, and even automatically modify algorithms to ensure that they do satisfy a specific fairness criterion (e.g., Cotter et al., 2019; Weerts et al., 2023). But, as we have argued, this approach is often at odds with improving well-being, including for disadvantaged groups. A particularly pernicious risk of automated, axiomatic approaches is that they can make invisible the cost to well-being: automatically constraining algorithms to be fair can lead one to overlook unconstrained alternatives that are clearly preferable. We have instead called for a more careful analysis of the consequences, good and bad, of different decision policies, selecting the appropriate course of action based on the specific context. This is admittedly hard to do and does not easily scale but there are general principles that we believe are helpful to keep in mind when navigating this terrain. Below we enumerate and discuss four of them. 5.2.1 Contending with Inherent Trade-Offs There are inherent trade-offs in many important decision problems. For instance, in our college admissions example, one must balance academic preparedness with student-body diversity. Although one cannot generally circumvent these trade-offs, we believe it is useful to explicitly enumerate the primary dimensions of interest and to acknowledge the tradeoffs between them. In some cases, like our stylized admissions example, one might be able to explicitly calculate the Pareto frontier shown in Figure 4, in which case it often makes sense to focus on those policies lying on the frontier. In many cases, it won t be possible to compute the frontier. Still, by listing and discussing trade-offs, even informally, one can reduce the risk of adopting clearly problematic policies, like those that typically result from uncritically constraining decisions to satisfy formal fairness criteria. In this sense, designing equitable algorithms is akin to designing equitable policy writ large. One might accordingly adapt democratic mechanisms used to draft and enact leg- Corbett-Davies, Gaebler, Nilforoshan, Shroff, and Goel islation to algorithm design. For example, adopting such a policy-oriented perspective, Chohlas-Wood et al. (2023a) and Koenecke et al. (2023) surveyed a diverse sample of Americans to elicit preferences on how best to balance competing objectives in programs that algorithmically allocate government benefits. 5.2.2 Assessing Calibration When designing or auditing a risk assessment algorithm, it is important to check whether predictions are calibrated, meaning that risk scores correspond to the same observed level of risk across groups. In general, the relationship between predictors and outcomes may plausibly differ across groups, leading to miscalibrated risk estimates what Ayres (2002) calls the problem of subgroup validity. Figure 3 shows instances of risk scores for diabetes and recidivism that are miscalibrated across race and gender, respectively. For example, a nominal 1.5% diabetes risk based on age and BMI corresponds to an actual, observed diabetes rate of approximately 1% among White patients and 3% among Asian patients. Similarly, among individuals receiving a COMPAS risk score of 7 based on criminal history and related factors about 55% of women recidivate, compared to 65% of men. These miscalibrated risk scores can result in inequitable decisions. For instance, a policy to screen patients with a nominal diabetes risk of 1.5% or above in line with existing medical recommendations (Aggarwal et al., 2022) would overscreen White patients and underscreen Asian patients, harming individuals in both groups. Calibration can often be visually assessed by plotting predicted risk against average outcomes, as in Figure 6 (cf. Arrieta-Ibarra et al., 2022). For a simple, more quantitative, measure, we recommend regressing observed outcomes against risk estimates and group membership. A coefficient of approximately zero on group membership suggests risk estimates correspond to similar average outcomes across groups, with deviations from zero indicating the degree of miscalibration. In practice, miscalibration can often be rectified by training group-specific risk models, or, roughly equivalently, including group membership in a single risk model fit across groups. For example, diabetes risk models that include race, and recidivism risk models that include gender, are approximately calibrated. The relatively new literature on multicalibration has introduced new computational techniques to ensure predictions are simultaneously calibrated across many different subgroups (H ebert-Johnson et al., 2018). Of course, including protected traits in risk models raises additional legal and ethical challenges. In some cases, it may be possible to reduce or eliminate miscalibration by incorporating additional, non-protected covariates. Regardless, we believe it is important to check the calibration of risk scores to make informed decisions about if and how to address any observed disparities. Calibration is an important necessary condition to ensure risk estimates correspond to actually observable levels of risk across groups. But it is not sufficient. Indeed, even calibrated risk scores can encode and reinforce deeply discriminatory policies. To see this, imagine a bank that wants to discriminate against Black applicants. Further suppose that: (1) within ZIP code, White and Black applicants have similar default rates; and (2) Black applicants live in ZIP codes with relatively high default rates. Then the bank can surreptitiously discriminate against Black borrowers by basing estimates of default risk only on an applicant s ZIP code, ignoring all other relevant information. Such scores would be The Measure and Mismeasure of Fairness 0% 5% 10% 15% 20% Risk score Num. patients 0% 10% 20% 30% 40% 50% Risk estimate Diabetes rate Figure 5: Calibration is insufficient to prevent discrimination. Left: The distribution in green shows diabetes risk for Asian patients based on accurately collected age and BMI, and the distribution in purple shows estimates when the risk model is trained on noisy inputs. Estimates under the noisy model concentrate around the mean (dashed vertical line), pushing more Asian patients above the screening threshold (solid vertical line). Right: A calibration plot comparing noisy risk estimates for Asian patients and accurate risk estimates for White patients. The calibrated risk scores can mask both intentional discrimination and inadvertent errors. calibrated (White and Black applicants with the same score would default equally often), and the bank could use these scores to justify denying loans to nearly all Black applicants. The bank, however, would be sacrificing profit by refusing loans to creditworthy Black applicants,26 and thus engaging in taste-based discrimination. This discriminatory lending strategy is indeed closely related to the historical (and illegal) practice of redlining, and illustrates the limitations of calibration as a measure of equity. Figure 5 shows another example of calibrated scores masking disparities. In the lefthand panel, we plot in green the distribution of diabetes risk for Asian patients, as estimated from age, BMI, and race. In purple, we plot the distribution of estimated risk when age and BMI are imperfectly measured for Asian patients in the training data. Training the risk model on noisy features pushes risk estimates toward the mean. As a result, based on the noisy risk model, 97% of Asian patients are above the 1.5% screening threshold, compared to 81% of Asian patients under the more accurately estimated model leading to more medically unnecessary screening under the noisy model. Importantly, however, the noisy model is still calibrated, as shown in the right-hand panel of Figure 5, where we compare risk scores for Asian patients estimated from the noisy predictors and risk scores for White patients estimated from the accurately estimated information. In theory, a 26. These applicants are creditworthy in the sense that they would have been issued a loan had the bank used all the information it had available to determine their risk. Corbett-Davies, Gaebler, Nilforoshan, Shroff, and Goel malicious algorithm designer could generate such calibrated but inaccurate scores to intentionally harm Asian patients. In practice, this pattern could equally arise from negligence rather than malice. These examples illustrate the importance of considering all available data when constructing statistical risk estimates; assessments that either intentionally or inadvertently ignore predictive information may facilitate discriminatory decisions while satisfying calibration though, as we discuss below, even this intuitive heuristic of use all the data has its limitations. 5.2.3 Selecting the Target of Prediction In constructing algorithmic risk scores, a key ingredient is the target of prediction. In practice, though, there is often a mismatch between our true outcome of interest and the available data an occurrence we call label bias (Zanger-Tishler et al., 2023). As with the other issues we discuss, there is typically no perfect solution to this problem, but there are ways to mitigate it. For example, in pretrial risk assessment, we would often like to estimate the likelihood a defendant would commit a crime if released. But there are two key difficulties with this goal. First, though we might want to measure crime conducted by defendants awaiting trial, we typically only observe crime that results in a conviction or an arrest. These observable outcomes, however, are imperfect proxies for the underlying criminal act. Further, heavier policing in communities of color might lead to Black and Hispanic defendants being arrested, and later convicted, more often than White defendants who commit the same offense (Lum and Isaac, 2016). Poor outcome data might thus cause one to systematically underestimate the risk posed by White defendants. The second, related, issue is that our target of interest is a counterfactual outcome; it corresponds to what would have happened had a defendant been released. In reality, we only observe what actually happened conditional on the judge s actual detention decision. One way to reduce label bias in this case is to adjust the target of interest. For example, criminologists have found that arrests for violent crime as opposed to drug crime may suffer from less racial bias.27 In particular, Skeem and Lowenkamp (2016) note that the racial distribution of individuals arrested for violent offenses is in line with the racial distribution of offenders inferred from victim reports, and is also in line with self-reported offending data. In other cases, like lending, where one may seek to estimate default rates, the measured outcome (e.g., failure to pay) corresponds more closely to the event of interest. The problem of estimating counterfactuals can likewise be partially addressed in some applications. In the pretrial setting, Angwin et al. (2016) measure recidivism rates in 27. D Alessio and Stolzenberg (2003) find evidence that White offenders are even somewhat more likely than Black offenders to be arrested for certain categories of crime, including robbery, simple assault, and aggravated assault. Measurements of minor criminal activity, like drug offenses, are more problematic. For example, there is evidence that drug arrests in the United States are biased against Black and Hispanic individuals, with racial minorities who commit drug crimes substantially more likely to be arrested than White individuals who commit the same offenses (Ramchand et al., 2006). Although this pattern is well known, many existing risk assessment tools still consider arrests or convictions for any new criminal activity including drug crimes which may lead to biased estimates. As another example of label bias, auto insurance rates are determined in part by a driver s record of receiving speeding tickets, but disparities in police enforcement mean that tickets are biased proxies of dangerous driving behavior (Cai et al., 2022b). The Measure and Mismeasure of Fairness the first two-year period during which a defendant is not incarcerated; this is not identical to the desired counterfactual outcome since the initial detention may be criminogenic, for example but it seems like a reasonable estimation strategy. Further, unaided human decisions often exhibit considerable randomness, a fact that can be exploited to facilitate statistical estimation of counterfactual outcomes (Jung et al., 2020b; Kleinberg et al., 2017a). More generally, a spate of recent work at the intersection of machine learning and causal inference (Hill, 2011; Jung et al., 2020c; Mullainathan and Spiess, 2017) offers hope for more gains in counterfactual estimation. 5.2.4 Collecting Training Data A final issue we discuss is collecting suitable training data for risk assessment algorithms to mitigate the effects of sample bias. Ideally, one would train algorithms on data sets that are broadly representative of the populations on which they are ultimately applied though there are subtleties to this heuristic that we describe below. While often challenging in practice, failure to train on representative data can lead to unintended, and potentially discriminatory, consequences. For example, Buolamwini and Gebru (2018) found that commercial facial analysis tools struggle to correctly classify the gender of dark-skinned individuals and of dark-skinned women in particular a disparity likely attributable to the relative dearth of dark-skinned faces in facial analysis data sets. Similarly, Koenecke et al. (2020) found that several popular automated speech recognition systems were significantly worse at transcribing Black speakers than White speakers, likely due to insufficient data from speakers of African American Vernacular English (AAVE), a variety of English spoken by many Black Americans. The problems of non-representative data can be even more acute in the case of risk assessment algorithms, especially when the target of interest is a causal quantity. For instance, in our running college admissions example, we seek to estimate how a student would (counterfactually) perform if admitted. Historical data are typically the result of past, potentially biased, decisions, and so may not fully generalize. Imagine, for example, that predictions of college performance are informed by where an applicant went to high school, and that, historically, only applicants from certain high schools were accepted and we consequently only see outcomes for students from those high schools. Then we would expect less accurate predictions for students from those absent high schools. In general, regression to the mean could attenuate estimates for high achieving students who differ from those previously accepted, potentially reinforcing existing admissions practices. As a general heuristic, we believe it is advisable to train models on representative data, but, as Cai et al. (2022a) note, the optimal sampling strategy depends on the statistical structure of the problem and the group-specific costs of collecting data. Interestingly, the value of representative data collection strategies depends in part on the degree to which race, gender, and other protected attributes are predictive. In theory, if protected attributes are not predictive, one could build an accurate risk model using only examples from one particular group (e.g., White men). Given enough examples of White men, the model would learn the relationship between features and risk, which by our assumption would generalize to the entire population. This phenomenon highlights a tension in informal discussions of fairness, with some advocating both for representative training data and for the exclusion of Corbett-Davies, Gaebler, Nilforoshan, Shroff, and Goel protected attributes. However, representative data are often most important precisely when protected attributes add information, in which case their use is arguably more justified. Even if protected attributes are not predictive, representative data can still help in two additional ways. First, a representative sample ensures that the full support of features is present at training time, as it is possible that the distribution of features varies across groups, even if the connection between features and outcomes does not. We note, though, that one might have adequate support even without a representative sample in many realworld settings, particularly when models are trained on large data sets and the feature space is relatively low dimensional. Second, a representative sample can help with model validation, allowing one to assess the potential effects of group imbalance on model fit. In particular, without a representative sample, it can be difficult to determine whether a model trained on a single group generalizes to the entire population. In many settings, one may be able to gather better training data with greater investment of time and money. For example, in our diabetes example one could aim to collect more complete medical records, a process that may be both costly and logistically difficult. In theory, this additional information may lead to welfare gains, and policymakers must accordingly evaluate the relative costs and benefits to all groups of exerting this extra effort when designing algorithms. Fortunately, in practice, there are often diminishing returns to information, with a relatively short list of key features providing most of the predictive power (Jung et al., 2020b), at least partially mitigating this concern. Carefully documenting data provenance and content including what features the data contain, how observations were selected into the sample, potential missingness or measurement error, and the representation of protected classes in the data with the use of datasheets or other structured tools can both aid in determining the extent of possible issues of non-representative data and in alerting future analysts to potential issues (Arnold et al., 2019; Chmielinski et al., 2022; Gebru et al., 2021; Holland et al., 2020; Stoyanovich and Howe, 2019). As with the other issues we have discussed, there is no universal solution to data collection. It might, for example, simply be prohibitive in the short run to train models on the data set one would ideally like to use. Nevertheless, as in all situations, one must carefully weigh the potential costs and benefits of adopting a necessarily imperfect risk assessment algorithm relative to the other possible options. In particular, even an imperfect algorithm may in some circumstances be better than leaving decisions to similarly imperfect humans who have their own biases. 5.3 Case Study: An Algorithm to Allocate Limited Medical Resources We conclude by illustrating the principles discussed above with a real-world example: an algorithm for referring patients into a high-risk care management program, previously considered by Obermeyer et al. (2019). The care management program more effectively aids patients with complex medical needs, in principle both improving outcomes for patients and reducing costs to the medical system for patients who are enrolled. But the program has limited capacity, which we formalize by assuming that only 2% of patients can be enrolled The Measure and Mismeasure of Fairness (i.e., we set b = 1 50). For our analysis, we use the data released by Obermeyer et al.,28 which contain demographic variables, cost information, comorbidities, biomarker and medication details, and health outcomes for a population of approximately 43,000 White and 5,600 Black primary care patients at an academic hospital from 2013 2015. As a first step toward identifying patients to enroll in the program, one could train a model predicting the healthcare resources a patient is likely to require over the next year. Sicker patients require more care and, consequently, incur greater healthcare costs. Thus, our initial approach is to predict how likely a patient is to be high cost which we operationalize as being in the top decile of healthcare expenditures in a given year based on the available information. (One of the main contributions of Obermeyer et al. (2019) is highlighting that healthcare costs is a problematic outcome due to label bias, a point we return to shortly.) As discussed above, it is useful to assess the calibration of risk assessment algorithms across groups. In particular, while calibration across race is largely guaranteed if race is included as a predictor in the statistical model, race-blind models are often preferred, particularly in healthcare, in part to avoid perceptions of bias. As a result, we assess the calibration of a race-blind model trained on all available information except for race, shown in Figure 6. Unlike in the diabetes screening example considered in Section 3.2, the race-blind healthcare cost predictions are calibrated across race groups, meaning that risk estimates largely match observed costs across groups. It accordingly appears that race provides little marginal predictive power in this example, assuaging potential concerns with its omission. We turn next to assessing the effects of applying formal fairness criteria to our enrollment decisions. Often, in healthcare contexts, a false negative e.g., failing to screen for a disease when it is present is more consequential than a false positive e.g., screening for a disease when it is not present. For this reason, one might seek to ensure enrollment decisions are fair by mandating false negative rates be equal across race groups (e.g., Seyyed-Kalantari et al., 2021),29 i.e., requiring that A D | Y = 1. In our example, equalizing false negative rates means that among patients who ultimately incur high medical costs (Y = 1), the same proportion (E[D]) are referred into the program across race groups (A). In our setting, approximately 1,000 patients can be referred into the care management program 2% of the roughly 50,000 patients in our data set. When we equalize falsenegative rates, 747 of the enrolled patients ultimately incur high costs, and 113 enrolled patients are Black.30 However, an unconstrained decision rule (i.e., one that enrolls the patients most likely to incur high costs) enrolls both more high-cost patients (758) and more Black patients (205). In this example, we end up providing worse care to Black patients when we constrain our algorithm to satisfy the formal, mathematical fairness criterion. 28. Obermeyer et al. released a synthetic data set closely mirroring the real data set, available at: https: //gitlab.com/labsysmed/dissecting-bias. In contrast to b = 2%, which we adopt to better illustrate some of the statistical phenomena we discuss in this section, Obermeyer et al. use a budget of b = 3%. 29. The authors describe their metric of concern as the false positive rate, where the positive prediction is understood as a no finding label, e.g., not having a disease. In our example, we follow the more common convention that a positive classification is assigned to the rare event (i.e., being high cost), and so we call this metric the false negative rate. 30. Following Corbett-Davies et al. (2017), we equalize false-negative rates in a manner that maximizes the number of enrolled patients who ultimately incur high costs. Corbett-Davies, Gaebler, Nilforoshan, Shroff, and Goel 0% 25% 50% 75% 100% Risk estimate Proportion high cost Black patients White patients Figure 6: Calibration of a race-blind model predicting whether a patient is high-cost . The lack of a gap between the proportion of patients who are actually high-cost and the proportion predicted to be high-cost for both groups indicates that the model is well calibrated across race groups. Instead of applying such mathematical fairness criteria, we advocate for directly weighing the costs and benefits of different decision policies. In Figure 7a, we show the Pareto frontier for our example, tracing out policies that optimally trade offthe demographic composition of the enrolled population with the number of enrolled patients who in reality incur high costs. The green point corresponds to equalizing false negative rates, and is to the left of the dashed vertical line that corresponds to the unconstrained decision rule visually illustrating how constraining our algorithm leads to fewer resources for Black patients. Also shown on the plot are points corresponding to demographic parity and equal false positive rates, both of which likewise lead to fewer resources for Black patients.31 The result in Figure 7a stems from the false negative rate for Black patients being lower than the false negative rate for White patients in the unconstrained algorithm a pattern we expect since Black patients are more likely than White patients to incur high medical costs in our data. Equalizing false negative rates thus means raising the enrollment bar for Black patients and lowering the bar for White patients. In light of this example, one might argue for applying formal fairness criteria only when error rates for racial minorities are higher than for White individuals. Figure 7b repeats our analysis above for the subset of women between the ages of 25 and 34, a subpopulation in which Black patients have higher error rates than White patients. In this case, equalizing false positive rates or false negative rates, or enforcing demographic parity all result in more Black patients being admitted 31. As discussed in Section 4, with the exception of counterfactual predictive parity, all of the remaining fairness definitions given in Section 2 are known a priori to restrict one to enrollment policies that will lower both the number of Black patients enrolled as well as the number of truly high-cost patients enrolled. The Measure and Mismeasure of Fairness 0 200 400 600 Total num. Black patients Total num. high cost patients (a) Full population DP, FPR FNR 0 10 20 30 40 Total num. Black patients Total num. high cost patients (b) Women, aged 25 34 Figure 7: Enforcing formal fairness criteria can harm marginalized groups. Feasible regions for admissions policies to a high-risk care management program, where the dashed line indicates the number of Black patients admitted by the policy admitting the maximal number of high-cost patients. Left: The Pareto frontier for all patients in the population. Because more Black patients incur high medical costs in the population as a whole, equalizing false negative rates (FNR) as well as false positive rates (FPR), or enforcing demographic parity (DP) results in fewer Black patients being admitted than under the policy that maximizes the total number of high-cost patients admitted. (Equalized false negative rates and demographic parity are achieved at the same point.) Right: The Pareto frontier for the subpopulation of women between the ages of 25 and 34. Within this subpopulation, Black patients incur lower medical costs, and so equalizing false positive rates, false negative rates, or achieving demographic parity all result in more Black patients being admitted to the high-risk management program than the policy that admits the maximum number of high-cost patients. into the program. It is, however, unclear why one should adopt those particular error-rate equalizing policies over any other. The policies on the Pareto frontier (i.e., on the curve to the right of the dashed line) are all arguably reasonable to consider. It is admittedly difficult to determine which of these policies to adopt, but we believe it is best to confront this challenge head on, recognizing the hard trade-offs inherent to the problem. We conclude our case study by considering label bias, the primary concern identified by Obermeyer et al. (2019) in this context. As those authors noted, medical cost is a poor proxy for medical need, and so allocating healthcare resources to minimize anticipated costs can lead to severe disparities. Replicating an analysis by Obermeyer et al., Figure 8a shows that among patients with similar likelihood of incurring high-cost medical care, Black patients are considerably more likely than White patients to have complex medical needs, Corbett-Davies, Gaebler, Nilforoshan, Shroff, and Goel 0% 25% 50% 75% 100% Risk estimate Pr(chronic conditions 5) (a) Label bias 0% 10% 20% 30% 40% 50% Estimated effect Num. patients (b) Heterogeneous treatment effects Figure 8: The target of prediction impacts equity. Left: The probability of having complex medical needs (i.e., at least five active chronic conditions) for Black and White patients as a function of their estimated likelihood of incurring high medical costs, reproducing an analysis by Obermeyer et al. (2019). The large gap across groups indicates that Black patients have greater medical need than White patients with similar anticipated healthcare costs. Right: Distribution of the estimated change in the probability that an individual will have complex medical needs after enrolling in the care management program, showing that the extent to which enrollment reduces complex medical needs varies considerably across individuals. operationalized as having five or more active chronic conditions. This gap is likely a consequence of worse access to healthcare among Black patients, due to a mix of socioeconomic factors and discrimination. To the extent that care management programs aim to aid the sickest patients as opposed to simply reducing costs targeting resources based on anticipated costs can lead to inefficient and inequitable outcomes. Obermeyer et al. accordingly suggest switching the target of prediction from medical costs to health status. It is possible to achieve further gains by recognizing resource allocation as an inherently causal problem. In particular, one may seek to enroll patients in the program in a manner that maximizes E[Y (D)], where Y (D) is the potential health outcome under the enrollment decision. To do so, we could prioritize patients by their estimated treatment effect ˆYi(1) ˆYi(0), rather than an estimate ˆYi of their future health status that ignores the causal effect of the program.32 A proper causal analysis is, in general, a complex topic requiring careful 32. In many analyses that do not explicitly grapple with the causal effects of interventions, the estimand Y is further corrupted by the fact that some patients are expected to be enrolled in the program. As a result, some of the sickest patients may not be prioritized for care, since their expected outcome already incorporates the fact that they would have been enrolled, leading to a prediction paradox. To avoid this situation, one could explicitly estimate and prioritize patients by Yi(0), the potential outcome in the absence of care. In this case, however, the allocation decisions do not necessarily lead to the largest The Measure and Mismeasure of Fairness treatment beyond the scope of this article. Nonetheless, Figure 8b shows the distribution of ˆYi(1) ˆYi(0), as estimated with a simple regression model. The plot suggests that there is considerable predictable heterogeneity in the extent to which enrollment in the care management program causally improves health. In particular, we find that the estimated treatment effect is only weakly correlated with the number of chronic conditions a patient currently exhibits (r = 0.05). Consequently, directly targeting resources to those most likely to benefit could yield large health improvements. Once the types of label bias we have discussed above have been identified, it may be possible to re-train predictive models to better align decision-making algorithms with policy goals. 6. Conclusion From medicine to criminal justice, practitioners are increasingly turning to statistical risk assessments to help guide and improve human decisions. Algorithms can avoid many of the implicit and explicit biases of human decisions makers, but they can also exacerbate historical inequities if not developed with care. Policymakers, in response, have rightly demanded that these high-stakes decision systems be designed and audited to ensure outcomes are equitable. The research community has responded to the challenge, coalescing around several formal mathematical definitions of fairness. However, as we have aimed to articulate, these popular measures of fairness suffer from significant statistical limitations. Indeed, adopting these measures as algorithmic design principles can often harm the groups that these measures were designed to protect. In contrast to the dominant axiomatic approach to algorithmic fairness, we advocate for a more consequentialist orientation (Cai et al., 2022a; Chohlas-Wood et al., 2023a; Liang et al., 2022; Nyarko et al., 2021). Most importantly, we stress the importance of grounding technical and policy discussions of fairness in terms of real-world quantities. For example, in the pretrial domain, one might consider a risk assessment s short and long-term impacts on public safety and the size of the incarcerated population, as well as a tool s alignment with principles of due process. In lending, one could similarly consider a risk assessment s immediate and equilibrium effects on community development and the sustainability of a loan program. Formal mathematical measures of fairness only indirectly address such issues, and can inadvertently lead discussions astray. Of course, it is not always clear how best to quantify or to balance the relevant costs and benefits of proposed algorithmic interventions. In some cases, it may be possible to conduct randomized controlled trials; in other cases, the best one can do is hypothesize about an algorithm s potential effects. Regardless, we believe a more explicit focus on consequences is necessary to make progress. We further recommend decoupling the statistical problem of risk assessment from the policy problem of designing interventions. At their best, predictive algorithms estimate the likelihood of events under different scenarios; they cannot dictate policy. An algorithm might (correctly) infer that a defendant has a 20% chance of committing a violent crime if released, but that fact does not, in and of itself, determine a course of action. For example, detention is not the only alternative to release, as one could take any number of rehabilitative interventions (Barabas et al., 2018). Even if detention is deemed an appropriate health gains, as the patients likely to be sickest in the absence of care are not typically the same as those likely to benefit the most from the program. Corbett-Davies, Gaebler, Nilforoshan, Shroff, and Goel intervention, one must still determine what threshold would appropriately balance public safety with the social and financial costs of detention. One might even decide that society s goals are best achieved by setting different thresholds for different groups. For example, a policymaker might reason that, all else being equal, the social costs of detaining a single parent are higher than the social costs of detaining an individual without children, and thus decide to apply different thresholds to the two groups. When policymakers consider these options and others, we believe the primary role of a risk assessment tool is, as its name suggests, to estimate risk. This view, however, is at odds with requiring that algorithms satisfy popular fairness criteria. Such constrained algorithms typically do not reflect the best available estimates of risk, and thus implicitly conflate the statistical and policy problems. Fair machine learning still has much left to accomplish and there are several important avenues of research that could benefit from new statistical and computational insights. From mitigating measurement error and sample bias, to understanding externalities and equilibrium effects, to eliciting and aggregating preferences to arbitrate between competing algorithms, there is much work to be done. But the benefits are equally large. When carefully designed and evaluated, statistical algorithms have the potential to dramatically improve both the efficacy and equity of consequential decisions. As these algorithms are increasingly deployed in all walks of life, it will become ever more important to ensure they are fair. Acknowledgments We thank Guillaume Basse, Sander Beckers, Hana Chockler, Alex Chohlas-Wood, Madison Coots, Avi Feller, Josh Grossman, Joe Halpern, Jennifer Hill, Aziz Huq, David Kent, Keren Ladin, Julian Nyarko, Emma Pierson, Ravi Sojitra, and Michael Zanger-Tishler for helpful conversations. This paper is based on work by Corbett-Davies and Goel (2018) and Nilforoshan et al. (2022). H.N was supported by a Stanford Knight-Hennessy Scholarship and an NSF Graduate Research Fellowship under Grant No. DGE-1656518. J.G was supported by a Stanford Knight-Hennessy Scholarship. R.S. was supported by the NSF Program on Fairness in AI in Collaboration with Amazon under the award FAI: End-to-End Fairness for Algorithm-in-the-Loop Decision Making in the Public Sector, no. IIS-2040898. S.G. was supported by a grant from the Harvard Data Science Initiative. Any opinions, findings, conclusions, or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the NSF or Amazon. Reproduction materials are available at https://github.com/jgaeb/measure-mismeasure. Appendix A. Path-specific Counterfactuals Constructing policies which satisfy path-specific fairness requires computing path-specific counterfactual values of features. In Algorithm 1, we describe the formal construction of path-specific counterfactuals ZΠ,a,a , for an arbitrary variable Z (or collection of variables) in the DAG. To generate a sample Z Π,a,a from the distribution of ZΠ,a,a , we first sample values U j for the exogenous variables. Then, in the first loop, we traverse the DAG in The Measure and Mismeasure of Fairness topological order, setting A to a and iteratively computing values V j of the other nodes based on the structural equations in the usual fashion. In the second loop, we set A to a , and then iteratively compute values Vj for each node. Vj is computed using the structural equation at that node, with value Vℓ for each of its parents that are connected to it along a path in Π, and the value V ℓfor all its other parents. Finally, we set Z Π,a,a to Z . Algorithm 1: Path-specific counterfactuals Data: G (topologically ordered), Π, a, and a Result: A sample Z Π,a,a from ZΠ,a,a 1 Sample values {U j } for the exogenous variables /* Compute counterfactuals by setting A to a */ 2 for j = 1, . . . , m do 3 if Vj = A then 6 (Vj) {V ℓ| Vℓ (Vj)} 7 V j f Vj( (Vj) , U j ) /* Compute counterfactuals by setting A to a and propagating values along paths in Π */ 10 for j = 1, . . . , m do 11 if Vj = A then 14 for Vk (Vj) do 15 if edge (Vk, Vj) lies on a path in Π then 16 V k V k 17 else 18 V k V k 19 end 21 (Vj) {V ℓ| Vℓ (Vj)} 22 V j f Vj( (Vj) , U j ) 25 Z Π,a,a Z Corbett-Davies, Gaebler, Nilforoshan, Shroff, and Goel Appendix B. Constructing Causally Fair Policies Our aim is to identify the feasible region of expected outcomes attainable via policies which are constructed to satisfy various causal fairness constraints. First, consider the problem of finding decision policies that maximize expected utility, subject to satisfying a given definition of causal fairness, as well as the outcome and budget constraints. Specifically, letting C denote the family of all decision policies that satisfy one of the causal fairness definitions listed above, a utility-maximizing policy d is given by d arg max d C E[d(X) u(X)] s.t. o1 ϵ E[d(X) 1α(X)=a1] o1 + ϵ s.t. o2 ϵ E[d(X) E[Y (1) | X]] o2 + ϵ s.t. E[d(X)] b. We prove that this optimization problem can be efficiently solved as a single linear program in the case of counterfactual equalized odds, conditional principal fairness, counterfactual fairness, and path-specific fairness or as a series of linear programs in the case of counterfactual predictive parity. Theorem 21 Consider the optimization problem given in Eq. (13). 1. If C is the class of policies that satisfies counterfactual equalized odds or conditional principal fairness, and the distribution of (X, Y (0), Y (1)) is known and supported on a finite set of size n, then a utility-maximizing policy constrained to lie in C can be constructed via a linear program with O(n) variables and constraints. 2. If C is the class of policies that satisfies path-specific fairness (including counterfactual fairness), and the distribution of (X, DΠ,A,a) is known and supported on a finite set of size n, then a utility-maximizing policy constrained to lie in C can be constructed via a linear program with O(n) variables and constraints. 3. Suppose C is the class of policies that satisfies counterfactual predictive parity, that the distribution of (X, Y (1)) is known and supported on a finite set of size n, and that the optimization problem in Eq. (13) has a feasible solution. Further suppose Y (1) is supported on k points, and let k 1 = {p Rk | pi 0 and Pk i=1 pi = 1} be the unit (k 1)-simplex. Then one can construct a set of linear programs L = {L(v)}v k, with each having O(n) variables and constraints, such that the solution to one of the LPs in L is a utility-maximizing policy constrained to lie in C. Before moving on to the proof of Theorem 21, we note that since the constraints of the linear programs are convex, the feasible regions in Figure 4 can be determined by solving the convex feasibility problem where we impose the additional convex constraint that the expected outcomes in our admissions example, the aggregate academic index and number of admitted applicants from the target group lie within some distance ϵ of a given point. Performing a grid search over all points then determines the feasible regions. The Measure and Mismeasure of Fairness Proof Let X = {x1, . . . , xm}; then, we seek decision variables di, i = 1, . . . , m, corresponding to the probability of making a positive decision for individuals with covariate value xi. Therefore, we require that 0 di 1. Letting pi = Pr(X = xi) denote the mass of X at xi, note that the objective function, as well as the outcome and budget constraints are all linear in the decision variables. 1. The objective function E[d(X) u(X)] equals Pm i=1 di u(xi) pi 2. The budget constraint E[d(X)] b. constraint equals Pm i=1 di pi b 3. The first outcome constraint o1 ϵ E[d(X) 1α(X)=a1] o1 + ϵ equals o1 ϵ Pm i=1 1α(xi)=a1 di pi] o1 + ϵ 4. The second outcome constraint o2 ϵ E[d(X) E[Y (1) | X]] o2 + ϵ equals o2 ϵ Pm i=1 E[Y (1) | X = xi] di pi o2 + ϵ We now show that each of the causal fairness definitions can be enforced via linear constraints. We do so in three parts as listed in theorem. Theorem 21 Part 1. First, we consider counterfactual equalized odds. A decision policy satisfies counterfactual equalized odds when D A | Y (1). Since D is binary, this condition is equivalent to the expression E[d(X) | A = a, Y (1) = y] = E[d(X) | Y (1) = y] for all a A and y Y such that Pr(Y (1) = y) > 0. Expanding this expression and replacing d(xj) by the corresponding decision variable dj, we obtain that i=1 di Pr(X = xi | A = a, Y (1) = y) = i=1 di Pr(X = xi | Y (1) = y) for each a A and each of the finitely many values y Y such that Pr(Y (1) = y) > 0. These constraints are linear in the di by inspection. Next, we consider conditional principal fairness. A decision policy satisfies conditional principal fairness when D A | Y (0), Y (1), W, where W = ω(X) denotes a reduced set of the covariates X. Again, since D is binary, this condition is equivalent to the expression E[d(X) | A = a, Y (0) = y0, Y (1) = y1, W = w] = E[d(X) | Y (0) = y0, Y (1) = y1, W = w] for all y0, y1, and w satisfying Pr(Y (0) = y0, Y (1) = y1, W = w) > 0. As above, expanding this expression and replacing d(xj) by the corresponding decision variable dj yields linear constraints of the form i=1 di Pr(X = xi | A = a, S = s) = j=1 di Pr(X = xi | S = s) for each a A and each of the finitely many values of S = (Y (0), Y (1), W) such that s = (y0, y1, w) Y Y W satisfies Pr(Y (0) = y0, Y (1) = y1, W = w) > 0. Again, these constraints are linear by inspection. Theorem 21 Part 2. Suppose a decision policy satisfies path-specific fairness for a given collection of paths Π and a (possibly) reduced set of covariates W = ω(X), meaning that for every a A, E[DΠ,A,a | W] = E[D | W]. Corbett-Davies, Gaebler, Nilforoshan, Shroff, and Goel Recall from the definition of path-specific counterfactuals that DΠ,A,a = f D(XΠ,A,a , UD) = 1UD d(XΠ,A,a ), where UD {XΠ,A,a, X}. Since W = ω(X), UD {XΠ,A,a, W}, it follows that E[DΠ,A,a | W = w] i=1 E[DΠ,A,a | XΠ,A,a = xi, W = w] Pr(XΠ,A,a = xi | W = w) i=1 E[1UD d(XΠ,A,a ) | XΠ,A,a = xi, W = w] Pr(XΠ,A,a = xi | W = w) i=1 d(XΠ,A,a ) Pr(XΠ,A,a = xi | W = w) i=1 di Pr(XΠ,A,a = xi | W = w). An analogous calculation yields that E[D | W = w] = Pm i=1 di Pr(X = xi | W = w). Equating these expressions gives i=1 di Pr(X = xi | W = w) = i=1 di Pr(XΠ,A,a = xi | W = w) for each a A and each of the finitely many w W such that Pr(W = w) > 0. Again, each of these constraints is linear by inspection. Theorem 21 Part 3. A decision policy satisfies counterfactual predictive parity if Y (1) A | D = 0, or equivalently, Pr(Y (1) = y | A = a, D = 0) = Pr(Y (1) | D = 0) for all a A. We may rewrite this expression to obtain: Pr(Y (1) = y, A = a, D = 0) Pr(A = a, D = 0) = Cy, where Cy = Pr(Y (1) = y | D = 0). Expanding the numerator on the left-hand side of the above equation yields Pr(Y (1) = y, A = a, D = 0) = i=1 [1 di] Pr(Y (1) = y, A = a, X = xi) Similarly, expanding the denominator yields Pr(Y (1) = y, D = 0) = i=1 [1 di] Pr(Y (1) = y, X = xi). for each of the finitely many y Y. Therefore, counterfactual predictive parity corresponds to i=1 [1 di] Pr(Y (1) = y, A = a, X = xi) = Cy i=1 [1 di] Pr(Y (1) = y, X = xi), (14) The Measure and Mismeasure of Fairness for each a A and y Y. Again, these constraints are linear in the di by inspection. Consider the family of linear programs L = {L(v)}v k where the linear program L(v) has the same objective function, outcome constraint, and budget constraint as before, together with additional constraints for each a A as in Eq. (14), where Cyi = vi for i = 1, . . . , k. By assumption, there exists a feasible solution to the optimization problem in Eq. (13), so the solution to at least one program in L is a utility-maximizing policy that satisfies counterfactual predictive parity. Appendix C. A Stylized Example of College Admissions In the example that we consider in Section 4.1, the exogenous variables in the DAG, U = {u A, u D, u E, u M, u T , u Y }, are independently distributed as follows: UA, UD, UY Unif(0, 1), UE, UM, UT N(0, 1). For fixed constants µA, βE,0, βE,A, βM,0, βM,E, βT,0, βT,E, βT,M, βT,B, βT,u, βY,0, βY,D, we define the endogenous variables V = {A, E, M, T, D, Y } in the DAG by the following structural equations: ( a1 if u A µA a0 otherwise , f E(a, u E) = βE,0 + βE,A 1(a = a1) + u E, f M(e, u M) = βM,0 + βM,E e + u M, f T (e, m, u T ) = βT,0 + βT,E e + βT,M m + βT,B e m + βT,u u T , f D(x, u D) = 1(u D d(x)), f Y (m, u Y , δ) = 1(u Y logit 1(βY,0 + m + βY,D δ)), where logit 1(x) = (1 + exp( x)) 1 and d(x) is the decision policy. In our example, we use constants µA = 1 3, βE,0 = 1, βE,A = 1, βM,0 = 0, βM,E = 1, βT,0 = 50, βT,E = 4, βT,M = 4, βT,u = 7, βT,B = 1, βY,0 = 1 2, βY,D = 1 2. We also assume a budget b = 1 Appendix D. Proof of Theorem 10 We begin with the following simple lemma. Lemma 22 Suppose F is a sub σ-algebra of measurable sets, and suppose X is a nonnegative bounded random variable with X b a.s. Then b E[X | F]. Corbett-Davies, Gaebler, Nilforoshan, Shroff, and Goel Proof Note that since 0 X b a.s., E[X2 | F] E[X | F] b. By Jensen s inequality, Var(X | F) E[X2 | F], and so, it follows that, a.s., b E[X | F], as desired. Ignoring the conditioning, Lemma 22 can be interpreted as saying that the minimum of a bounded random variable cannot be too close to the mean. This fact enables us to prove Theorem 10. Proof of Theorem 10 We wish to show that no event E of the form {r(X) t} E {r(X) > t} is π-measurable. To that end, it suffices to show that E[E | π(X)] / {0, 1} with positive probability. Since Pr(r(X) = t) = 0, 1r(X) t = 1r(X)>t a.s., and so it suffices to show that E[1r(X)>t | π(X)] / {0, 1} with positive probability. Consider the set of covariates x = (xu, a) such that ρ(x) lies in the interval (t, t + ϵ), where, without loss of generality, we assume t + ϵ < 1. Since E[r(X) | Xu = xu] = ρ(x) > t, it follows immediately that for these x, Pr(r(X) > t | Xu) > 0. Let I(x) denote the essential infimum of the conditional distribution of r(X) | ρ(X) = ρ(x). Then, we can apply Lemma 22 to r(X) I(X), using the fact that 0 r(X) I(X) 1 a.s., to obtain that ρ(X) I(X) > ϵ a.s. It follows that the event {ρ(X) (t, t + ϵ), Pr(r(X) < t | ρ(X)) > 0} (15) has positive probability. We can conclude from this that the related event {ρ(X) (t, t + ϵ), Pr(r(X) < t | π(X)) > 0} cannot have probability zero, since, by the tower law, we would consequently have that, a.s., 0 = 1ρ(X) (t,t+ϵ) Pr(r(X) < t | π(X)), and so, taking conditional expectations with respect to ρ(X), a.s., 0 = 1ρ(X) (t,t+ϵ) Pr(r(X) < t | ρ(X)), which contradicts Eq. (15). Therefore, for x such that ρ(x) (t, t + ϵ), Pr(r(X) < t | Xu = xu) > 0 and Pr(r(X) > t | Xu = xu) > 0. Since Pr(ρ(X) (t, t + ϵ)) > 0, it follows that E[1r(X)>t | π(X)] / {0, 1} with positive probability, as desired. The Measure and Mismeasure of Fairness Appendix E. Proof of Proposition 15 We begin by more formally defining (multiple) threshold policies. We assume, without loss of generality, that Pr(A = a) > 0 for all a A throughout. Definition 23 Let u(x) be a utility function. We say that a policy d(x) is a threshold policy with respect to u if there exists some t such that ( 1 u(x) > t, 0 u(x) < t, and d(x) [0, 1] is arbitrary if u(x) = t. We say that d(x) is a multiple threshold policy with respect to u if there exist group-specific constants ta for a A such that ( 1 u(x) > tα(x), 0 u(x) < tα(x), and d(x) [0, 1] is arbitrary if u(x) = tα(x). Remark 24 In general, it is possible for different thresholds to produce threshold policies that are almost surely equal. For instance, if u(X) Bern(1 2), then the policies 1u(X)>p are almost surely equal for all p [0, 1). Nevertheless, we speak in general of the threshold associated with the threshold policy d(X) unless there is ambiguity. We first observe that if U is consistent modulo α, then whether a decision policy d(x) is a multiple threshold policy does not depend on our choice of u U. Lemma 25 Let U be a collection of utilities consistent modulo α, and suppose d : X [0, 1] is a decision rule. If d(x) is a multiple threshold rule with respect to a utility u U, then d(x) is a multiple threshold rule with respect to every u U. In particular, if d(x) can be represented by non-negative thresholds over u , it can be represented by non-negative thresholds over any u U. Proof Suppose d(x) is represented by thresholds {t a}a A with respect to u . We construct the thresholds {ta}a A explicitly. Fix a A and suppose that there exists x α 1(a) such that u (x ) = t a. Then set ta = u(x ). Now, if u(x) > ta = u(x ) then, by consistency modulo α, u (x) > u (x ) = t a. Similarly if u(x) < ta then u (x) < t a. We also note that by consistency modulo α, sign(ta) = sign(u(x )) = sign(u (x )) = sign(t a). If there is no x α 1(a) such that u (x ) = t a, then let ta = inf x Sa u(x) where Sa = {x α 1(a) | u (x) > t a}. Note that since sign(u(x)) = sign(u (x)) for all x by consistency modulo α, if t a 0, it follows that ta 0 as well. We need to show in this case also that if u(x) > ta then u (x) > t a, and if u(x) < ta then u (x) < t a. To do so, let x α 1(a) be arbitrary, and suppose u(x) > ta. Then, by Corbett-Davies, Gaebler, Nilforoshan, Shroff, and Goel definition, there exists x α 1(a) such that u(x) > u(x ) > ta and u (x ) > t a, whence u (x) > u (x ) > t a by consistency modulo α. On the other hand, if u(x) < ta, it follows by the definition of ta that u (x) t a; since u (x) = t a by hypothesis, it follows that u (x) < t a. Therefore, it follows in both cases that for x α 1(a), if u(x) > ta then u (x) > t a, and if u(x) < ta then u (x) < t a. Therefore ( 1 if u(x) > tα(x), 0 if u(x) < tα(x), i.e., d(x) is a multiple threshold policy with respect to u. Moreover, as noted above, if t a 0 for all a A, then ta 0 for all a A. We now prove the following strengthening of Prop. 15. Lemma 26 Let U be a collection of utilities consistent modulo α. Let d(x) be a feasible decision policy that is not a.s. a multiple threshold policy with non-negative thresholds with respect to U, then d(x) is strongly Pareto dominated. Proof We prove the claim in two parts. First, we show that any policy that is not a multiple threshold policy is strongly Pareto dominated. Then, we show that any multiple threshold policy that cannot be represented with non-negative thresholds is strongly Pareto dominated. If d(x) is not a multiple threshold policy, then there exists a u U and a A such that d(x) is not a threshold policy when restricted to α 1(a ) with respect to u. We will construct an alternative policy d (x) that attains strictly greater utility on α 1(a ) and is identical elsewhere. Thus, without loss of generality, we assume there is a single group, i.e., α(x) = a . The proof proceeds heuristically by moving some of the mass below a threshold to above a threshold to create a feasible policy with improved utility. Let b = E[d(X)]. Define m Lo(t) = E[d(X) 1u(X)t]. We show that there exists t such that m Up(t ) > 0 and m Lo(t ) > 0. For, if not, consider t = inf{t R : m Up(t) = 0}. Note that d(X) 1u(X)> t = 1u(X)> t a.s. If t = , then by definition d(X) = 1 a.s., which is a threshold policy, violating our assumption on d(X). If t > , then for any t < t, we have, by definition that m Up(t ) > 0, and so by hypothesis m Lo(t ) = 0. Therefore d(X) 1u(X)< t = 0 a.s., and so, again, d(X) is a threshold policy, contrary to hypothesis. Now, with t as above, for notational simplicity, let m Up = m Up(t ) and m Lo = m Lo(t ) and consider the alternative policy (1 m Up) d(x) u(x) < t , d(x) u(x) = t , 1 (1 m Lo) (1 d(x)) u(x) > t . The Measure and Mismeasure of Fairness Then it follows by construction that E[d (X)] = (1 m Up) m Lo + E[d(X) 1u(X)=t ] + Pr(u(X) > t ) (1 m Lo) m Up = m Lo + E[d(X) 1u(X)=t ] + Pr(u(X) > t ) m Up = E[d(X) 1u(X)t ] E[(1 d(X)) 1u(X)>t ] so d (x) is feasible. However, d (x) d(x) = m Lo (1 d(x)) 1u(x)>t m Up d(x) 1u(x)t u(X)] m Up E[d(X) 1u(X) m Lo t E[(1 d(X)) 1u(X)>t ] m Up t E[d(X) 1u(X) u (d) for arbitrary u U. Let t = inf{u (x) : d (x) > d(x)}. Note that by construction for any x, x X, if d (x) > d(x) and d (x ) < d(x ), then u(x) > t > u(x ). It follows by consistency modulo α that u (x) t u (x ), and, moreover, that at least one of the inequalities is strict. Without loss of generality, assume u (x) > t u (x ). Then, we have that u(x) > t if and only if u (x) > t . Therefore, it follows that E[(d (X) d(X)) 1u (X)>t ] = m Up > 0. Since E[d (X) d(X)] = 0, we see that E[(d (X) d(X)) u (X)] = E[(d (X) d(X)) 1u (X)>t u (X)] + E[(d (X) d(X)) 1u (X) t u (X)] > t E[(d (X) d(X)) 1u (X)>t ] + t E[(d (X) d(X)) 1u (X) t ] = t E[d (X) d(X)] where in the inequality we have used the fact that if d (x) > d(x), u (x) > t , and if d (x) < d(x), u (x) t . Therefore E[d(X) u (X)] < E[d (X) u (X)], Corbett-Davies, Gaebler, Nilforoshan, Shroff, and Goel i.e., d (x) strongly Pareto dominates d(x). Now, we prove the second claim, namely, that a multiple threshold policy τ(x) that cannot be represented with non-negative thresholds is strongly Pareto dominated. For, if τ(x) is such a policy, then, by Lemma 25, for any u U, E[τ(X) 1u(X)<0] > 0. It follows immediately that τ (x) = τ(x) 1u(x)>0 satisfies u(τ ) > u(τ). By consistency modulo α, the definition of τ (x) does not depend on our choice of u, and so u(τ ) > u(τ) for every u U, i.e., τ (x) strongly Pareto dominates τ(x). The following results, which draw on Lemma 26, are useful in the proof of Theorem 17. Definition 27 We say that a decision policy d(x) is budget-exhausting if min(b, Pr(u(X) > 0)) E[d(X)] min(b, Pr(u(X) 0)). Remark 28 We note that if U is consistent modulo α, then whether or not a decision policy d(x) is budget-exhausting does not depend on the choice of u U. Further, if Pr(u(X) = 0) = 0 e.g., if the distribution of X is U-fine then the decision policy is budget-exhausting if and only if E[d(X)] = min(b, Pr(u(X) > 0)). Corollary 29 Let U be a collection of utilities consistent modulo α. If τ(x) is a feasible policy that is not a budget-exhausting multiple threshold policy with non-negative thresholds, then τ(x) is strongly Pareto dominated. Proof Suppose τ(x) is not strongly Pareto dominated. By Lemma 26, it is a multiple threshold policy with non-negative thresholds. Now, suppose toward a contradiction that τ(x) is not budget-exhausting. Then, either E[τ(X)] > min(b, Pr(u(X) 0)) or E[τ(X)] < min(b, Pr(u(X) > 0)). In the first case, since τ(x) is feasible, it follows that E[τ(X)] > Pr(u(X) 0). It follows that τ(x) 1u(x)<0 is not almost surely zero. Therefore E[τ(X)] < E[τ(X) 1u(X)>0], and, by consistency modulo α, this holds for any u U. Therefore τ(x) is strongly Pareto dominated, contrary to hypothesis. In the second case, consider d(x) = θ 1u(x)>0 + (1 θ) τ(x). Since E[τ(X)] < min(b, Pr(u(X) > 0)) and E[d(X)] = θ Pr(u(X) > 0) + (1 θ) E[τ(X)], there exists some θ > 0 such that d(x) is feasible. For that θ, a similar calculation shows immediately that u(d) > u(τ), and, by consistency modulo α, u (d) > u (τ) for all u U. Therefore, again, d(x) strongly Pareto dominates τ(x), contrary to hypothesis. The Measure and Mismeasure of Fairness Lemma 30 Given a utility u, there exists a mapping T from [0, 1]A to [ , ]A taking sets of quantiles {qa}a A to thresholds {ta}a A such that: 1. T is monotonically non-increasing in each coordinate; 2. For each set of quantiles, there is a multiple threshold policy τ : X [0, 1] with thresholds T({qa}) with respect to u such that E[τ(X) | A = a] = qa. Proof Simply choose ta = inf{s R : Pr(u(X) > s) < qa}. (16) (qa Pr(u(X)>ta|A=a) Pr(u(X)=ta|A=a) Pr(u(X) = ta, A = a) > 0 0 Pr(u(X) = ta, A = a) = 0. Note that Pr(u(X) ta | A = a) qa, since, by definition, Pr(u(X) > ta ϵ | A = a) qa for all ϵ > 0. Therefore, Pr(u(X) > ta | A = a) + Pr(u(X) = ta | A = a) qa, and so pa 1. Further, since Pr(u(X) > ta | A = a) qa, we have that pa 0. Finally, let 1 u(x) > tα(x), pa u(x) = tα(x), 0 u(x) < tα(x), and it follows immediately that E[d(X) | A = a] = qa. That ta is a monotonically nonincreasing function of qa follows immediately from Eq. (16). We can further refine Cor. 29 and Lemma 30 as follows: Lemma 31 Let u be a utility. Then a feasible policy is utility maximizing if and only if it is a budget-exhausting threshold policy. Moreover, there exists at least one utility maximizing policy. Proof Let α be a constant map, i.e., α : X A, where | A| = 1. Then U = {u} is consistent modulo α, and so by Cor. 29, any Pareto efficient policy is a budget exhausting multiple threshold policy relative to U. Since U contains a single element, a policy is Pareto efficient if and only if it is utility maximizing. Since α is constant, a policy is a multiple threshold policy relative to α if and only if it is a threshold policy. Therefore, a policy is utility maximizing if and only if it is a budget exhausting threshold policy. By Lemma 30, such a policy exists, and so the maximum is attained. Corbett-Davies, Gaebler, Nilforoshan, Shroff, and Goel Appendix F. Prevalence and the Proof of Theorem 17 The notion of a probabilistically small set such as the event in which an idealized dart hits the exact center of a target is, in finite-dimensional real vector spaces, typically encoded by the idea of a Lebesgue null set. Here we prove that the set of distributions such that there exists a policy satisfying either counterfactual equalized odds, conditional principal fairness, or counterfactual fairness that is not strongly Pareto dominated is small in an analogous sense. The proof turns on the following intuition. Each of the fairness definitions imposes a number of constraints. By Lemma 26, any policy that is not strongly Pareto dominated is a multiple threshold policy. By adjusting the group-specific thresholds of such a policy, one can potentially satisfy one constraint per group. If there are more constraints than groups, then one has no additional degrees of freedom that can be used to ensure that the remaining constraints are satisfied. If, by chance, those constraints are satisfied with the same threshold policy, they are not satisfied robustly even a minor distribution shift, such as increasing the amount of mass above the threshold by any amount on the relevant subpopulation, will break them. Therefore, over a typical distribution, at most |A| of the constraints can simultaneously be satisfied by a Pareto efficient policy, meaning that typically no Pareto efficient policy fully satisfies all of the conditions of the fairness definitions. Formalizing this intuition, however, requires considerable care. In Section F.1, we give a brief introduction to a popular generalization of null sets to infinite-dimensional vector spaces, drawing heavily on a review article by Ott and Yorke (2005). In Section F.2 we provide a road map of the proof itself. In Section F.3, we establish the main hypotheses necessary to apply the notion of prevalence to a convex set in our case, the set of U-fine distributions. In Section F.4, we establish a number of technical lemmata used in the proof of Theorem 17, and provide a proof of the theorem itself in Section F.5. In Section F.8, we show why the hypothesis of U-fineness is important and how conspiracies between atoms in the distribution of u(X) can lead to robust counterexamples. F.1 Shyness and Prevalence Lebesgue measure λn on Rn has a number of desirable properties: Local finiteness: For any point v Rn, there exists an open set U containing x such that λn[U] < ; Strict positivity: For any open set U, if λn[U] = 0, then U = ; Translation invariance: For any v Rn and measurable set E, λn[E + v] = λn[E]. No measure on an infinite-dimensional, separable Banach space, such as L1(R), can satisfy these three properties Ott and Yorke (2005). However, while there is no generalization of Lebesgue measure to infinite dimensions, there is a generalization of Lebesgue null sets called shy sets to the infinite-dimensional context that preserves many of their desirable properties. Definition 32 (Hunt et al. (1992)) Let V be a completely metrizable topological vector space. We say that a Borel set E V is shy if there exists a Borel measure µ on V such that: The Measure and Mismeasure of Fairness 1. There exists compact C V such that 0 < µ[C] < , 2. For all v V , µ[E + v] = 0. An arbitrary set F V is shy if there exists a shy Borel set E V containing F. We say that a set is prevalent if its complement is shy. Prevalence generalizes the concept of Lebesgue full measure or co-null sets (i.e., sets whose complements have null Lebesgue measure) in the following sense: Proposition 33 (Hunt et al. (1992)) Let V be a completely metrizable topological vector space. Then: Any prevalent set is dense in V ; If G L and G is prevalent, then L is prevalent; A countable intersection of prevalent sets is prevalent; Every translate of a prevalent set is prevalent; If V = Rn, then G Rn is prevalent if and only if λn[Rn \ G] = 0. As is conventional for sets of full measure in finite-dimensional spaces, if some property holds for every v E, where E is prevalent, then we say that the property holds for almost every v V or that it holds generically in V . Prevalence can also be generalized from vector spaces to convex subsets of vector spaces, although additional care must be taken to ensure that a relative version of Prop. 33 holds. Definition 34 (Anderson and Zame (2001)) Let V be a topological vector space and let C V be a convex subset completely metrizable in the subspace topology induced by V . We say that a universally measurable set E C is shy in C at c C if for each 1 δ > 0, and each neighborhood U of 0 in V , there is a regular Borel measure µ with compact support such that Supp(µ) (δ(C c) + c) (U + c), and µ[E + v] = 0 for every v V . We say that E is shy in C or shy relative to C if E is shy in C at c for every c C. An arbitrary set F V is shy in C if there exists a universally measurable shy set E C containing F. A set G is prevalent in C if C \ G is shy in C. Proposition 35 (Anderson and Zame (2001)) If E is shy at some point c C, then E is shy at every point in C and hence is shy in C. Sets that are shy in C enjoy similar properties to sets that are shy in V . Proposition 36 (Anderson and Zame (2001)) Let V be a topological vector space and let C V be a convex subset completely metrizable in the subspace topology induced by V . Then: Corbett-Davies, Gaebler, Nilforoshan, Shroff, and Goel Any prevalent set in C is dense in C; If G L and G is prevalent in C, then L is prevalent in C; A countable intersection of sets prevalent in C is prevalent in C If G is prevalent in C then G + v is prevalent in C + v for all v V . If V = Rn and C V is a convex subset with non-empty interior, then G C is prevalent in C if and only if λn[C \ G] = 0. Sets that are shy in C can often be identified by inspecting their intersections with a finite-dimensional subspace W of V , a strategy we use to prove Theorem 17. Definition 37 (Anderson and Zame (2001)) A universally measurable subset E of a convex and completely metrizable set C is said to be k-shy in C if there exists a k-dimensional subspace W V such that 1. A translate of the set C has positive Lebesgue measure in W, i.e., λW [C + v0] > 0 for some v0 V ; 2. Every translate of the set E is a Lebesgue null set in W, i.e., λW [E + v] = 0 for all v V . Here λW denotes k-dimensional Lebesgue measure supported on W.33 We refer to such a W as a k-dimensional probe witnessing the k-shyness of E, and to an element w W as a perturbation. The following intuition motivates the use of probes to detect shy sets. By analogy with Fubini s theorem, one can imagine trying to determine whether a subset of a finitedimensional vector space is large or small by looking at its cross sections parallel to some subspace W V . If a set E V is small in each cross section i.e., if λW [E + v] = 0 for all v V then E itself is small in V , i.e., E has λV -measure zero. Proposition 38 (Anderson and Zame (2001)) Every k-shy set in C is shy in C. F.2 Outline To aid the reader in following the application of the theory in Section F.1 to the proof of Theorem 17, we provide the following outline of the argument. In Section F.3 we establish the context to which we apply the notion of relative shyness. In particular, we introduce the vector space K consisting of the totally bounded Borel measures on the state space K where K is X Y, X Y Y, or A X A, depending on which notion of fairness is under consideration. We further isolate the subspace K K of U-fine totally bounded Borel measures. Within this space, we are interested in the convex set Q K, the set of U-fine joint probability distributions of, respectively, X and Y (1); 33. Note that Lebesgue measure on W is only defined up to a choice of basis; however, since λ[T(A)] = | det(T)| λ[A] for any linear automorphism T and Lebesgue measure λ, whether a set has null measure does not depend on the choice of basis. The Measure and Mismeasure of Fairness X, Y (0), Y (1); or A and the XΠ,A,a. Within Q, we identify E Q, the set of U-fine distributions on K over which there exists a policy satisfying the relevant fairness definition that is not strongly Pareto dominated. The claim of Theorem 17 is that E is shy relative to Q. To ensure that relative shyness generalizes Lebesgue null measure in the expected way i.e., that Prop. 36 holds Definition 34 has three technical requirements: (1) that the ambient vector space V be a topological vector space; (2) that the convex set C be completely metrizable; and (3) that the shy set E be universally measurable. In Lemma 42, we observe that K is a complete topological vector space under the total variation norm, and so is a Banach space. We extend this in Cor. 47, showing that K is also a Banach space. We use this fact in Lemma 50 to show that Q is a completely metrizable subset of K, as well as convex. Lastly, in Lemma 56, we show that the set E is closed, and therefore universally measurable. In Section F.4, we develop the machinery needed to construct a probe W for the proof of Theorem 17 and prove several lemmata simplifying the eventual proof of the theorem. To build the probe, it is necessary to construct measures µmax,a with maximal support on the utility scale. This ensures that if any two threshold policies produce different decisions on any µ K, they will produce different decisions on typical perturbations. The construction of the µmax,a, is carried out in Lemma 58 and Cor. 59. Next, we introduce the basic style of argument used to show that a subset of Q is shy in Lemma 62 and Lemma 63, in particular, by showing that the set of µ Q that give positive probability to an event E is either prevalent or empty. We use then use a technical lemma, Lemma 64, to show, in effect, that a generic element of Q has support on the utility scale wherever a given fixed distribution µ Q does. In Defn. 66, we introduce the concept of overlapping and splitting utilities, and show in Lemma 67 that this property is generic in Q unless there exists a ω-stratum that contains no positive-utility observables x. Lastly, in Lemma 68, we provide a mild simplification of the characterization of finitely shy sets that makes the proof of Theorem 17 more straightforward. Finally, in Section F.5, we give the proof of Theorem 17. We divide the proof into three parts. In the first part, we restrict our attention to the case of counterfactual equalized odds, and show in detail how to combine the lemmata of the previous section to construct the (at most) 2 |A|-dimensional probe W. In the second part we consider two distinct cases. The argument in both cases is conceptually parallel. First, we argue that the balance conditions of counterfactual equalized odds encoded by Eq. (4) must be broken by a typical perturbation in W. In particular, we argue that for a given base distribution µ, there can be at most one budget-exhausting multiple threshold policy that can although need not necessarily satisfy counterfactual equalized odds. We show that the form of this policy cannot be altered by an appropriate perturbation in W, but that the conditional probability of a positive decision will, in general, be altered in such a way that Eq. (4) can only hold for a λW-null set of perturbations. In the final section, we lay out modifications that can be made to the proof given for counterfactual equalized odds in the first two parts that adapt the argument to the cases of conditional principal fairness and path-specific fairness. In particular, we show how to construct the probe W in such a way that the additional conditioning on the reduced covariates W = ω(X) in Eqs. (5) and (8) does not affect the argument. Corbett-Davies, Gaebler, Nilforoshan, Shroff, and Goel F.3 Convexity, Complete Metrizability, and Universal Measurability In this section, we establish the background requirements of Prop. 38 for the setting of Theorem 17. In particular, we exhibit the U-fine distributions as a convex subset of a topological vector space, the set of totally bounded U-fine Borel measures. We show that the U-fine probability distributions form a completely metrizable subset in the topology it inherits from the space of totally bounded measures. Lastly, we show that the set of regular distributions under which there exists a Pareto efficient policy satisfying one of the three fairness criteria is closed, and therefore universally measurable. F.3.1 Background and Notation We begin by establishing some notational conventions. We let K denote the underlying state space over which the distributions in Theorem 17 range. Specifically, K = X Y in the case of counterfactual equalized odds; K = X Y Y in the case of conditional principal fairness; and K = A X A in the case of path-specific fairness. We note that since X Rk for some k and Y R, K may equivalently be considered a subset of Rn for some n N, with the subspace topology (and Borel sets) inherited from Rn.34 We recall the definition of totally bounded measures. Definition 39 Let M be a σ-algebra on V , and let µ be a countably additive (V, M)- measure. Then, we define |µ|[E] = sup i=1 |µ[Ei]| (17) where the supremum is taken over all countable partitions {Ei}i N, i.e., collections such that S i=1 Ei = E and Ei Ej = for j = i. We call |µ| the total variation of µ, and the total variation norm of µ is |µ|[V ]. We say that µ is totally bounded if its total variation norm is finite, i.e., |µ|[V ] < . Lemma 40 If µ is totally bounded, then |µ| is a finite positive measure on (V, M), and |µ[E]| |µ|[E] for all E M. See Theorem 6.2 in Rudin (1987) for proof. We let K denote the set of totally bounded Borel measures on K. We note that, in the case of path specific fairness, which involves the joint distributions of counterfactuals, X is not defined directly. Rather, the joint distribution of the counterfactuals XΠ,A,a and A defines the distribution of X through consistency, i.e., what would have happened to someone if their group membership were changed to a A is what actually happens to them if their group membership is a . More formally, Pr(X E | A = a ) = Pr(XΠ,A,a E | A = a ) for all Borel sets E X. (See 3.6.3 in Pearl (2009b).) For any µ K, we adopt the following notational conventions. If we say that a property holds µ-a.s., then the subset of K on which the property fails has |µ|-measure zero. If E K is a measurable set, then we denote by µ E the restriction of µ to E, i.e., the measure defined by the mapping E 7 µ[E E ]. We let Eµ[f] = R K f dµ, and for measurable sets 34. In the case of path-specific fairness, we can equivalently think of A as a set of integers indexing the groups. The Measure and Mismeasure of Fairness E, Prµ(E) = µ[E].35 The fairness criteria we consider involve conditional independence relations. To make sense of conditional independence relations more generally, for Borel measurable f we define Eµ[f | F] to be the Radon-Nikodym derivative of the measure E 7 Eµ[f 1E] with respect to the measure µ restricted to the sub σ-algebra of Borel sets F. (See 34 in Billingsley (1995).) Similarly, we define Eµ[f | g] to be Eµ[f | σ(g)], where σ(g) denotes the sub σ-algebra of the Borel sets generated by g. In cases where the condition can occur with non-zero probability, we can instead make use of the elementary definition of discrete conditional probability. Lemma 41 Let g be a Borel function on K, and suppose Prµ(g = c) = 0 for some constant c R. Then, we have that µ-a.s., for any Borel function f, Eµ[f | g] 1g=c = Eµ[f 1g=c] Prµ(g = c) 1g=c. See Rao (2005) for proof. With these notational conventions in place, we turn to establishing the background conditions of Prop. 38. Lemma 42 The set of totally bounded measures on a measure space (V, M) form a complete topological vector space under the total variation norm, and hence a Banach space. See, e.g., Steele (2019) for proof. It follows from this that K is a Banach space. Remark 43 Since K is a Banach space, it possesses a topology, and consequently a collection of Borel subsets. These Borel sets are to be distinguished from the Borel subsets of the underlying state space K, which the elements of K measure. The requirement that the subset E of the convex set C be universally measurable in Proposition 38 is in reference to the Borel subsets of K; the requirement that µ K be a Borel measure is in reference to the Borel subsets of K. Recall the definition of absolute continuity. Definition 44 Let µ and ν be measures on a measure space (V, M). We say that a measure ν is absolutely continuous with respect to µ also written ν Î µ if, whenever µ[E] = 0, ν[E] = 0. Absolute continuity is a closed property in the topology induced by the total variation norm. Lemma 45 Consider the space of totally bounded measures on a measure space (V, M) and fix µ. The set of ν such that ν Î µ is closed. 35. To state and prove our results in a notationally uniform way, we occasionally write Prµ(E) even when µ ranges over measures that may not be probability measures. Corbett-Davies, Gaebler, Nilforoshan, Shroff, and Goel Proof Let {νi}i N be a convergent sequence of measures absolutely continuous with respect to µ. Let the limit of the νi be ν. We seek to show that ν Î µ. Let E M be an arbitrary set such that µ[E] = 0. Then, we have that ν[E] = lim n νi[E] since νi Î µ for all i. Since E was arbitrary, the result follows. Recall the definition of a pushforward measure. Definition 46 Let f : (V, M) (V , M ) be a measurable function. Let µ be a measure on V . We define the pushforward measure µ f 1 on V by the map E 7 µ[f 1(E )] for E M . Within K, in the case of counterfactual equalized odds and conditional principal fairness, we define the subspace K to be the set of totally bounded measures µ on K such that the pushforward measure µ u 1 is absolutely continuous with respect to the Lebesgue measure λ on R for all u U. By the Radon-Nikodym theorem, these pushforward measures arise from densities, i.e., for any µ K, there exists a unique fµ L1(R) such that for any measurable subset E of R, we have µ u 1[E] = Z In the case of path-specific fairness, we require the joint distributions of the counterfactual utilities to have a joint density. That is, we define the subspace K to be the set of totally bounded measures µ on K such that the pushforward measure µ (u A) 1 is absolutely continuous with respect to Lebesgue measure on RA for all u U. Here, we recall that u A : (a, (xa )a A) 7 (u(xa ))a A. As before, there exists a corresponding density fµ L1(RA). We therefore see that K extends in a natural way the notion of a Uor UA-fine distribution, and so, by a slight abuse of notation, refer to K as the set of U-fine measures on K. Indeed, since Prµ(u(X) E, A = a) Prµ(u(X) E), it also follows that, for a A such that Prµ(A = a) > 0, the conditional distributions of u(X) | A = a are also absolutely continuous with respect to Lebesgue measure, and so also have densities. For notational convenience, we set fµ,a to be the function satisfying Prµ(u(X) E, A = a) = Z so that fµ = P a A fµ,a. Since absolute continuity is a closed condition, it follows that K is a closed subspace of K. This leads to the following useful corollary of Lemma 45. The Measure and Mismeasure of Fairness Corollary 47 The collection of U-fine measures on K is a Banach space. Proof It is straightforward to see that K is a subspace of K. Since K is a closed subset of K by Lemma 45, it is complete, and therefore a Banach space. We note the following useful fact about elements of K. Lemma 48 Consider the mapping µ 7 fµ from K to L1(R) given by associating a measure µ with the Radon-Nikodym derivative of the pushforward measure µ u 1. This mapping is continuous. Likewise, the mapping µ 7 fµ,a is continuous for all a A, and, in the case of path-specific fairness, the mapping of µ to the Radon-Nikodym derivative of µ (u A) 1 is continuous. Proof We show only the first case. The others follow by virtually identical arguments. Let ϵ > 0 be arbitrary. Choose µ K, and suppose that |µ µ |[K] < ϵ. Then, let EUp = {x R : fµ(x) > fµ (x)} ELo = {x R : fµ(x) < fµ (x)}. Then EUp and ELo are disjoint, so we have that fµ fµ L1(R) = EUp fµ fµ dλ + ELo fµ fµ dλ = |(µ µ )[u 1(EUp)]| + |(µ µ )[u 1(ELo)]| where the second equality follows by the definition of pushforward measures and the inequality follows from Lemma 40. Since ϵ was arbitrary, the claim follows. Finally, we define Q. We let Q be the subset of K consisting of all U-fine probability measures, i.e., measures µ K such that: 1. The measure µ is U-fine; 2. For all Borel sets E K, µ[E] 0; 3. The measure of the whole space is unity, i.e., µ[K] = 1. We conclude the background and notation by observing that threshold policies are defined wholly by their thresholds for distributions in K and Q. Importantly, this observation does not hold when there are atoms on the utility scale which measures in K lack which can in turn lead to counterexamples to Theorem 17; see Appendix F.8. Lemma 49 Let τ0(x) and τ1(x) be two multiple threshold policies. If τ0(x) and τ1(x) have the same thresholds, then for any µ K, τ0(X) = τ1(X) µ-a.s. Similarly, for µ Q, if Eµ[τ0(X) | A = a] = Eµ[τ1(X) | A = a] Corbett-Davies, Gaebler, Nilforoshan, Shroff, and Goel for all a A such that Prµ(A = a) > 0, then τ0(X) = τ1(X) µ-a.s. Moreover, for µ K in the case of path-specific fairness, if τ0(x) and τ1(x) have the same thresholds, then τ0(XΠ,A,a) = τ1(XΠ,A,a) µ-a.s. for any a A. Similarly, for µ Q in the case of path-specific fairness, if Eµ[τ0(XΠ,A,a)] = Eµ[τ1(XΠ,A,a)] then τ0(XΠ,A,a) = τ1(XΠ,A,a) µ-a.s. as well. Proof First, we show that threshold policies with the same thresholds are equal, then we show that threshold policies that distribute positive decisions across groups in the same way are equal. Let {ta}a A denote the shared set of thresholds. It follows that if τ0(x) = τ1(x), then u(x) = tα(x). Now, Pr(u(X) = ta, A = a) = Z ta ta fµ,a dλ = 0, so Prµ(τ0(X) = τ1(X)) = 0. Next, suppose Eµ[τ0(X) | A = a] = Eµ[τ1(X) | A = a]. If the thresholds of the two policies agree for all a A such that Prµ(A = a) > 0, then we are done by the previous paragraph. Therefore, suppose t0 a = t1 a for some suitable a A, where ti a represents the threshold for group a A under the policy τi(x). Without loss of generality, suppose t0 a < t1 a. Then, it follows that t0a fµ,a dλ = Eµ[τ0(X) | A = a] Eµ[τ1(X) | A = a] Since µ Q, µ = |µ|, whence Pr|µ|(ta 0 u(X) t1 a | A = a) = 0. Since this is true for all a A such that Prµ(A = a) > 0, τ0(X) = τ1(X) µ-a.s. The proof in the case of path-specific fairness is almost identical. F.3.2 Convexity, Complete Metrizability, and Universal Measurability The set of regular U-fine probability measures Q is the set to which we wish to apply Prop. 38. To do so, we must show that Q is a convex and completely metrizable subset of K. Lemma 50 The set of regular probability measures Q is convex and completely metrizable. The Measure and Mismeasure of Fairness Proof The proof proceeds in two pieces. First, we show that the U-fine probability distributions are convex, as can be verified by direct calculation. Then, we show that Q is closed and therefore complete in the original metric of K. We begin by verifying convexity. Let µ, µ Q and let E K be an arbitrary Borel subset of K. Then, choose θ [0, 1], and note that (θ µ + [1 θ] µ )[E] = θ µ[E] + [1 θ] µ [E] θ 0 + [1 θ] 0 and, likewise, that (θ µ + [1 θ] µ )[K] = θ µ[K] + [1 θ] µ [K] = θ 1 + [1 θ] 1 It remains only to show that Q is completely metrizable. To prove this, it suffices to show that it is closed, since closed subsets of complete spaces are complete, and K is a Banach space by Cor. 47, and therefore complete. Suppose {µi}i N is a convergent sequence of probability measures in K with limit µ. Then µ[E] = lim i µi[E] lim i 0 = 0 and µ[K] = lim i µi[K] = lim i 1 = 1. Therefore Q is closed, and therefore complete, and hence is a convex, completely metrizable subset of K. Next we prove that the set E of regular U-fine densities over which there exists a policy satisfying the relevant counterfactual fairness definition that is not strongly Pareto dominated is universally measurable. Recall the definition of universal measurability. Definition 51 Let V be a complete topological space. Then E V is universally measurable if V is measurable by the completion of every finite Borel measure on V , i.e., if for every finite Borel measure µ, there exist Borel sets E and S such that E E S and µ[S] = 0. We note that if a set is Borel, it is by definition universally measurable. Moreover, if a set is open or closed, it is by definition Borel. To show that E is closed, we show that any convergent sequence in E has a limit in E. The technical complication of the argument stems from the following fact that satisfying the fairness conditions, e.g., Eq. (7), involves conditional expectations, about which very little can be said in the absence of a density, and which are difficult to compare when taken across distinct measures. Corbett-Davies, Gaebler, Nilforoshan, Shroff, and Goel To handle these difficulties, we begin with a technical lemma, Lemma 55, which gives a coarse bound on how different the conditional expectations of the same variable can be with respect to a sub σ-algebra F over two different distributions, µ and µ , before applying the results to the proof of Lemma 56. Definition 52 Let µ be a measure on a measure space (V, M), and let f be µ-measurable. Consider the equivalence class of M-measurable functions C = {g : g = f µ-a.e.}.36 We say that any g C is a version of f, and that g C is a standard version if g(v) C for some constant C and all v V . Remark 53 It is straightforward to see that for f L (µ), a standard version always exists with C = f . Remark 54 Note that in general, the conditional expectation Eµ [f | F] is defined only µ -a.e. If µ is not assumed to be absolutely continuous with respect to µ , it follows that Eµ[f | F] Eµ [f | F] L1(µ) (18) is not entirely well-defined, in that its value depends on what version of Eµ [f | F] one chooses. For appropriate f, however, one can nevertheless bound Eq. (18) for any standard version of Eµ [f | F]. Lemma 55 Let µ, µ be totally bounded measures on a measure space (V, M). Let f L (µ) L (µ ). Let F be a sub σ-algebra of M. Let C = max( f L (µ), f L (µ )). Then, if g is a standard version of Eµ [f | F], we have that V |Eµ[f | F] g| dµ 4C |µ µ |[V ]. (19) Proof First, we note that both Eµ[f | F] and g are F-measurable. Therefore, the sets EUp = {v V : Eµ[f | F](v) > g(v)} and ELo = {v V : Eµ[f | F](v) < g(v)} are in F. Now, note that Z V |Eµ[f | F] g| dµ = Z EUp Eµ[f | F] g dµ + Z ELo g Eµ[f | F] dµ. 36. Some authors define Lp(µ) spaces to consist of such equivalence classes, rather than the definition we use here. The Measure and Mismeasure of Fairness First consider EUp. Then, we have that Z EUp Eµ[f | F] g dµ = Z EUp Eµ[f | F] g dµ + Z EUp Eµ[f | F] dµ Z EUp g dµ + Z Eup g d|µ µ | EUp f dµ + Z Eup C d|µ µ |, where in the final inequality, we have used the fact that, since g is a standard version of Eµ [f | F], g(v) Eµ [f | F] L (µ ) C for all v V , and the fact that, by the definition of conditional expectation, Z E Eν[h | F] dν = Z for any E F. Since f is everywhere bounded by C, applying Lemma 40 yields that this final expression is less than or equal to 2C |µ µ |[V ]. An identical argument shows that Z ELo g Eµ[f | F] dµ 2C |µ µ |[V ], whence the result follows. Lemma 56 Let E Q denote the set of joint densities on K such that there exists a policy satisfying the relevant fairness definition that is not strongly Pareto dominated. Then, E is closed, and therefore universally measurable. Proof For notational simplicity, we consider the case of counterfactual equalized odds. The proofs in the other two cases are virtually identical. Suppose µi µ in K, where {µi}i N E. Then, by Lemma 48, fµi,a fµ,a in L1(R). Moreover, by Lemma 26, there exists a sequence of threshold policies {τi(x)}i N such that both Eµi[τ(X)] = min(b, Prµi(u(X) > 0)) and Eµi[τi(X) | A, Y (1)] = Eµi[τi(X) | Y (1)]. Let {qa,i}a A be defined by qa,i = Eµi[τi(X) | A = a] if Prµi(A = a) > 0, and qa,i = 0 otherwise. Corbett-Davies, Gaebler, Nilforoshan, Shroff, and Goel Since [0, 1]A is compact, there exists a convergent subsequence {{qa,ni}a A}i N. Let it converge to the collection of quantiles {qa}a A defining, by Lemma 30, a multiple threshold policy τ(x) over µ. Because µi µ and {qa,ni}a A {qa}a A, we have that Eµ[τa,ni(X) | A = a] Eµ[τ(X) | A = a] for all a A such that Prµ(A = a) > 0. Therefore, by Lemma 48, τni(X) τ(X) in L1(µ). Choose ϵ > 0 arbitrarily. Then, choose N so large that for i greater than N, |µ µni|[K] < ϵ 10, τ(X) τni(X) L1(µ) ϵ 10. Then, observe that τ(x), τi(x) 1, and recall that Eµni[τni(X) | A, Y (1)] = Eµni[τni(X) | Y (1)]. (20) Therefore, let gi(x) be a standard version of Eµni[τni(X) | Y (1)] over µni. By Eq. (20), gi(x) is also a standard version of Eµni[τni(X) | A, Y (1)] over µni. Then, by Lemma 55, we have that Eµ[τ(X) | A, Y (1)] Eµni[τni(X) | Y (1)] L1(µ) Eµ[τ(X) | A, Y (1)] Eµ[τni(X) | A, Y (1)] L1(µ) + Eµ[τni(X) | A, Y (1)] gi(X) L1(µ) + gi(X) Eµ[τni(X) | Y (1)] L1(µ) L1(µ) + Eµ[τni(X) | Y (1) Eµ[τ(X) | Y (1)] L1(µ) Since ϵ > 0 was arbitrary, it follows that, µ-a.e., Eµ[τ(X) | A, Y (1)] = Eµ[τ(X) | Y (1)]. Recall the standard fact that for independent random variables X and U, E[f(X, U) | X] = Z f(X, u) d FU(u), where FU is the distribution of U.37 Further recall that D = 1UD τ(X), where UD X, Y (1). It follows that Prµ(D = 1 | X, Y (1)) = Z 1 0 1ud<τ(X) dλ(ud) = τ(X). Hence, by the law of iterated expectations, Prµ(D = 1 | A, Y (1)) = Eµ[Prµ(D = 1 | X, Y (1)) | A, Y (1)] = Eµ[τ(X) | A, Y (1)] = Eµ[τ(X) | Y (1)] = Eµ[Prµ(D = 1 | X, Y (1)) | Y (1)] = Prµ(D = 1 | Y (1)). 37. For a proof of this fact see, e.g., Brozius (2019). The Measure and Mismeasure of Fairness Therefore D A | Y (1) over µ, i.e., counterfactual equalized odds holds for the decision policy τ(x) over the distribution µ. Consequently µ E, and so E is closed and therefore universally measurable. F.4 Shy Sets and Probes We require a number of additional technical lemmata for the proof of Theorem 17. The probe must be constructed carefully, so that, on the utility scale, an arbitrary element of Q is absolutely continuous with respect to a typical perturbation. In addition, it is useful to show that a number of properties are generic to simplify certain aspects of the proof of Theorem 17. For instance, Lemma 63 is used in Theorem 17 to show that a certain conditional expectation is generically well-defined, avoiding the need to separately treat certain corner cases. Cor. 59 concerns the construction of the probe used in the proof of Theorem 17. Lemmata 64 to 68 use Cor. 59 to provide additional simplifications to the proof of Theorem 17. F.4.1 Maximal Support First, to construct the probe used in the proof of Theorem 17, we require elements µ Q such that the densities fµ have maximal support. To produce such distributions, we use the following measure-theoretic construction. Definition 57 Let {Eα}γ Γ be an arbitrary collection of µ-measurable sets for some positive measure µ on a measure space (M, M). We say that E is the measure-theoretic union of {Eγ}γ Γ if µ[Eγ \E] = 0 for all γ Γ and E = S i=1 Eγi for some countable subcollection {γi} N. While measure-theoretic unions themselves are known (cf. Silva (2008), Rudin (1991)), for completeness, we include a proof of their existence, which, to the best of our knowledge, is not found in the literature. Lemma 58 Let µ be a finite positive measure on a measure space (V, M). Then an arbitrary collection {Eγ}γ Γ of µ-measurable sets has a measure-theoretic union. Proof For each countable subcollection Γ Γ, consider the error term r(Γ ) = sup γ Γ µ We claim that the infimum of r(Γ ) over all countable subcollections Γ Γ must be zero. For, toward a contradiction, suppose it were greater than or equal to ϵ > 0. Choose any set Eγ1 such that µ[Eγ1] ϵ. Such a set must exist, since otherwise r( ) < ϵ. Choose Eγ2 such that µ[Eγ2 \ Eγ1] > ϵ. Again, some such set must exist, since otherwise r({γ1}) < ϵ. Continuing in this way, we construct a countable collection {Eγi}i N. Corbett-Davies, Gaebler, Nilforoshan, Shroff, and Goel Therefore, we see that By construction, every term in the final sum is greater than or equal to ϵ, contradicting the fact that µ[V ] < . Therefore, there exist countable collections {Γn}n N such that r(Γn) < 1 n. It follows immediately that for all n for any fixed k N. Consequently, and S n N Γn is countable. The construction of the maximal elements used to construct the probe in the proof of Theorem 17 follows as a corollary of Lemma 58 Corollary 59 There are measures µmax,a Q such that for every a A and any µ K, λ[Supp(fµ,a) \ Supp(fµmax,a)] = 0. Proof Consider the collection {Supp(fµ,a)}µ K. By Lemma 58, there exists a countable collection of measures {µi}i N such that for any µ K, Supp(fµ,a) \ i=1 Supp(fµi,a) where, without loss of generality, we may assume that λ[Supp(fµi,a)] > 0 for all i N. Such a sequence must exist, since, by the first hypothesis of Theorem 17, for every a A, there exists µ Q such that Prµ(A = a) > 0. Therefore, we can define the probability measure µmax,a, where i=1 2 i |µi A=a| |µi A=a| [K]. It follows immediately by construction that Supp(fµmax,a) = i=1 Supp(fµi,a), and that µmax,a Q. The Measure and Mismeasure of Fairness For notational simplicity, we refer to Supp(fµmax,a) as Sa throughout. In the case of conditional principal fairness and path-specific fairness, we need a mild refinement of the previous result that accounts for ω. Corollary 60 There are measures µmax,a,w Q defined for every w W = Img(ω) and any a A such that for some ν K, Prν(W = w, A = a) > 0. These measures have the property that for any µ K, λ[Supp(fµ ,a,w) \ Supp(fµmax,a,w)] = 0, where fµ ,a,w is the density of the pushforward measure (µ W=w,A=a) u 1. Recalling that | Img(ω)| < , the proof is the same as Cor. 59, and we analogously refer to Supp(fµmax,a,w) as Sa,w. Here, we have assumed without loss of generality as we continue to assume in the sequel that for all w W, there is some µ K such that Prµ(W = w) > 0. Remark 61 Because their support is maximal, the hypotheses of Theorem 17, in addition to implying that µmax,a is well-defined for all a A, also imply that Prµmax,a(u(X) > 0) > 0. In the case of conditional principal fairness, they further imply that Prµmax,a(W = w) > 0 for all w W and a A. Likewise, in the case of path-specific fairness, they further imply that Prµmax,a(W = wi) > 0 for i = 0, 1 and some a A. F.4.2 Shy Sets and Probes In the following lemmata, we demonstrate that a number of useful properties are generic in Q. We also demonstrate a short technical lemma, Lemma 68, which allows us to use these generic properties to simplify the proof of Theorem 17. We begin with the following lemma, which is useful in verifying that certain subspaces of K form probes. Lemma 62 Let W be a non-trivial finite dimensional subspace of K such that ν[K] = 0 for all ν W. Then, there exists µ K such that λW[Q µ] > 0. |νi| |νi|[K], where ν1, . . . , νn form a basis of W. Then, if |βi| 1 |νi|[K], it follows that i=1 βi νi Q. 1 |νi|[K], 1 |νi|[K] Corbett-Davies, Gaebler, Nilforoshan, Shroff, and Goel it follows that λW[Q µ] > 0. Next we show that, given a ν Q, a generic element of Q sees events to which ν assigns non-zero probability. While Lemma 65 alone in principle suffices for the proof of Theorem 17, we include Lemma 63 both for conceptual clarity and to introduce at a high level the style of argument used in the subsequent lemmata and in the proof of Theorem 17 to show that a set is shy relative to Q. Lemma 63 For a Borel set E K, suppose there exists ν Q such that ν[E] > 0. Then the set of µ Q such that µ[E] > 0 is prevalent. Proof First, we note that the set of µ Q such that µ[E] = 0 is closed and therefore universally measurable. For, if {µi}i N Q is a convergent sequence with limit µ, then µ[E] = lim n µi[E] Now, if µ[E] > 0 for all µ Q, there is nothing to prove. Therefore, suppose that there exists ν Q such that ν [E] = 0. Next, consider the measure ν = ν ν. Then, let W = Span( ν). Since ν = 0 and ν[K] = ν [K] ν[K] = 0, it follows by Lemma 62 that λW[Q µ] > 0 for some µ. Now, for arbitrary µ Q, note that if (µ + β ν)[E] = 0, then µ[E] β ν[E] = 0 A singleton has null Lebesgue measure, and so the set of ν W such that (µ + ν)[E] = 0 is λW-null. Therefore, by Prop. 38, the set of µ Q such that µ[E] = 0 is shy relative to Q, as desired. While Lemma 63 shows that a typical element of Q sees individual events, in the proof of Theorem 17, we require a stronger condition, namely, that a typical element of Q sees certain uncountable collections of events. To demonstrate this more complex property, we require the following technical result, which is closely related to the real analysis folk theorem that any convergent uncountable sum can contain only countably many non-zero terms. (See, e.g., Benji (2020).) The Measure and Mismeasure of Fairness Lemma 64 Suppose µ is a totally bounded measure on (V, M), f and g are µ-measurable real-valued functions, and g = 0 µ-a.e. Then the set Zβ = {v V : f(v) + β g(v) = 0} has non-zero µ measure for at most countably many β R. Proof First, we show that for any countable collection {βi}i N R, the sum P i=1 µ[Zβi] converges. Then, we show how this implies that µ[Zβ] = 0 for all but countably many β R. First, we note that for distinct β, β R, Zβ Zβ {v V : (β β ) g(v) = 0}. Now, by hypothesis, µ[{v V : g(v) = 0}] = 0, and since β β = 0, it follows that µ[{v V : (β β ) g(v) = 0}] = 0 as well. Consequently, it follows that if {Zβi}i N is a countable collection of distinct elements of R, then i=1 µ[Zβi] = µ To see that this implies that µ[Zβ] > 0 for only countably many β R, let Gϵ R consist of those β such that µ[Zβ] ϵ. Then Gϵ must be finite for all ϵ > 0, since otherwise we could form a collection {βi}i N Gϵ, in which case contrary to what was just shown. Therefore, {β R : µ[Zβ] > 0} = is countable. We now apply Lemma 64 to the proof of the following lemma, which states, informally, that, under a generic element of Q, u(X) is supported everywhere it is supported under some particular fixed element of Q. For instance, Lemma 64 can be used to show that for a generic element of Q, the density of u(X) | A = a is positive λ Sa-a.e. Corbett-Davies, Gaebler, Nilforoshan, Shroff, and Goel Lemma 65 Let ν Q and suppose ν is supported on E, i.e., ν[K \ E] = 0. Then the set of µ Q such that ν u 1 Î (µ E) u 1 is prevalent relative to Q. Lemma 65 states, informally, that for generic µ Q, fµ E is supported everywhere fν is supported. Proof We begin by showing that the set of µ Q such that ν u 1 Î (µ E) u 1 is Borel, and therefore universally measurable. Then, we construct a probe W and use it to show that this collection is finitely shy. To begin, let Uq denote the set of µ Q such that ν u 1[{|fµ E| = 0}] < q. We note that Uq is open. For, if µ Uq, then there exists some r > 0 such that ν u 1[{|fµ E| < r}] < q. Let ϵ = q ν u 1[{|fµ E| < r}]. Now, since ν u 1 Î λ, there exists a δ such that if λ[E ] < δ, then ν u 1[E ] < ϵ. Choose µ arbitrarily so that |µ µ |[K] < δ r. Then, by Markov s inequality, we have that λ[{|fµ E fµ E| > r}] < δ, i.e., ν u 1[{fµ E fµ E| > r}] < ϵ. Now, we note that by the triangle inequality, wherever |fµ E| = 0, either |fµ E| < r or |fµ E fµ E| > r. Therefore λ[{|fµ E| = 0}] ν u 1[{|fµ E fµ E| > r}] + µ u 1[{|fµ E| < r] < ϵ + µ u 1[{|fµ E| < r] We conclude that µ Uq, and so Uq is open. Note that ν u 1 Î (µ E) u 1 if and only if λ[Supp(fν) \ Supp(fµ E)] = 0. By the definition of the support of a function, λ Supp(fµ) Î µ u 1. Therefore, it follows that λ[Supp(fµ) \ Supp(fν E)] = 0 if and only if µ u 1[Supp(fµ) \ Supp(fν E)] = 0. Then, it follows immediately that the set of ν Q such that µ u 1 Î (ν E) u 1 is Tn i=1 U1/i, which is, by construction, Borel, and therefore universally measurable. The Measure and Mismeasure of Fairness Prν(u(X) < t) = Z t is a continuous function of t, by the intermediate value theorem, there exists t such that Prν(u(X) S0) = Prν(u(X) S1), where S0 = Supp(fν) ( , t) and S1 = Supp(fν) [t, ). Then, we let E 1u 1(S0) 1u 1(S1) dν. Take W = Span( ν). Since ν = 0 and ν[K] = 0, we have by Lemma 62 that λW[Q µ] > 0 for some µ. By the definition of a density, f ν is positive ( ν u 1)-a.e. Consequently, by the definition of ν, f ν is non-zero (µ u 1)-a.e. Therefore, by Lemma 64, there exist only countably many β R such that the density of (µ+β ν) u 1 equals zero on a set of positive µ u 1-measure. Since countable sets have λ-measure zero and ν is arbitrary, the set of µ Q such that ν u 1 Î (µ E) u 1 is prevalent relative to Q by Prop. 38. The following definition and technical lemma are needed to extend Theorem 17 to the cases of conditional principal fairness and path-specific fairness, which involve additional conditioning on W = ω(X). In particular, one corner case we wish to avoid in the proof of Theorem 17 is when the decision policy is non-trivial (i.e., some individuals receive a positive decision and others do not) but from the perspective of each ω-stratum, the policy is trivial (i.e., everyone in the stratum receives a positive or negative decision). Definition 66 formalizes this pathology, and Lemma 67 shows that this issue under a mild hypothesis does not arise for a generic element of Q. Definition 66 We say that µ Q overlaps utilities when, for any budget-exhausting multiple threshold policy τ(x), if 0 < Eµ[τ(X)] < 1, then there exists w W such that 0 < Eµ[τ(X) | W = w] < 1. If there exists a budget-exhausting multiple threshold policy τ(x) such that 0 < Eµ[τ(X)] < 1, but for all w W, Eµ[τ(X) | W = w] {0, 1}, then we say that τ(x) splits utilities over µ. Informally, having overlapped utilities prevents a budget-exhausting threshold policy from having thresholds that fall on the utility scale exactly between the strata induced by ω i.e., a threshold policy that splits utilities. This is almost a generic condition in Q, as we shown in Lemma 67. Corbett-Davies, Gaebler, Nilforoshan, Shroff, and Goel Lemma 67 Let 0 < b < 1. Suppose that for all w W there exists µ Q such that Prµ(u(X) > 0, W = w) > 0. Then almost every µ Q overlaps utilities. Proof Our goal is to show that the set E of measures µ Q such that there exists a splitting policy τ(x) is shy. To simplify the proof, we divide an conquer, showing that the set EΓ of measures µ Q such that there exists a splitting policy where the thresholds fall below w Γ W and above w / Γ is Borel, before constructing a probe that shows that it is shy. Then, we argue that E = S Γ W EΓ, which shows that E is shy. We begin by considering the linear map Φ : K R RW given by Φ(µ) = Prµ(u(X) = 0), (Prµ(W = w))w W . For any Γ W, the sets F Up Γ = {x R RW : x0 b, b = X F Lo Γ = {x R RW : x0 b, x0 = X are closed by construction. Therefore, since Φ is continuous, Γ W F Up Γ F Lo Γ is closed, and therefore universally measurable. Note that by our hypothesis and Cor. 60, for all w W there exists some aw A such that Prµmax,aw,w(u(X) > 0). We use this to show that EΓ is shy. Pick w W arbitrarily, and consider the measures νw for w = w defined by νw = µmax,aw,w u(X)>0 Prµmax,aw,w(u(X) > 0) µmax,aw ,w u(X)>0 Prµmax,aw ,w (u(X) > 0). We note that νw[K] = 0 by construction. Therefore, if Ww = Span(νw), then λWw[Q µw] > 0 for some µw by Lemma 62. Moreover, we have that Prν(u(X) > 0) = 0 for all ν Ww, i.e., Prµ(u(X) > 0) = Prµ+ν(u(X) > 0). Now, since 0 < b < 1 and ω partitions X, it follows that Since λW[ ] = 0 for any subspace W, we can assume without loss of generality that Γ = W, . The Measure and Mismeasure of Fairness In that case, there exists wΓ W such that if w Γ, then wΓ / Γ, and vice versa. Without loss of generality, assume wΓ Γ and w / Γ. It then follows that for arbitrary µ Q, Φ(µ + β νwΓ) = Φ(µ) + β ewΓ β ew , where ew is the basis vector corresponding to w W. From this, it follows immediately by Eq. (F.4.2) that µ + β νwΓ EΓ only if β = min(b, Prµ(u(X) > 0)) X w Γ Prµ(W = w). This is a measure zero subset of R, and so it follows that λWwΓ[EΓ µ] = 0 for all µ K. Therefore, by Prop. 38, EΓ is shy in Q. Taking the union over Γ W, it follows by Prop. 36 that S Γ W EΓ is shy. Now, we must show that E = S Γ W EΓ. By construction, EΓ E , since the policy τ(x) = 1ω(x) Γ is budget-exhausting and separates utilities. To see the reverse inclusion, suppose µ E , i.e., that there exists a budget-exhausting multiple threshold policy τ(x) that splits utilities over µ. Then, let Γµ = {w W : Eµ[τ(X) | W = w] = 1}. Since τ(x) is budget-exhausting, it follows immediately that µ EΓµ. Therefore, E = S Γ W EΓ, and so E is shy, as desired. We conclude our discussion of shyness and shy sets with the following general lemma, which simplifies relative prevalence proofs by showing that one can, without loss of generality, restrict one s attention to the elements of the shy set itself in applying Prop. 38. Lemma 68 Suppose E is a universally measurable subset of a convex, completely metrizable set C in a topological vector space V . Suppose that for some finite-dimensional subspace V , λV [C + v0] > 0 for some v0 V . If, in addition, for all v E, λV [{v V : v + v E}] = 0, (22) then it follows that E is shy relative to C. Proof Let v be arbitrary. Then, either (v + V ) E is empty or not. First, suppose it is empty. Since λV [ ] = 0 by definition, it follows immediately that in this case λV [E v] = 0. Next, suppose the intersection is not empty, and let v + v E for some fixed v V . It follows that λV [E v] = λV [{v V : v + v E}] = λV [{v V : (v + v ) + v E}] Corbett-Davies, Gaebler, Nilforoshan, Shroff, and Goel where the first equality follows by definition; the second equality follows by the translation invariance of λV , and the fact that v + V = V ; and the final inequality follows from Eq. (22). Therefore λV [E v] = 0 for arbitrary v, and so E is shy. F.5 Proof of Theorem 17 Using the lemmata above, we can prove Theorem 17. We briefly summarize what has been established so far: Lemma 42: The set K of U-fine distributions on K is a Banach space; Lemma 50: The subset Q of U-fine probability measures on K is a convex, completely metrizable subset of K; Lemma 56: The subset E of Q is a universally measurable subset of K, where E is the set consisting of U-fine probability measures over which there exists a policy satisfying counterfactual equalized odds (resp., conditional principal fairness, or pathspecific fairness) that is not strongly Pareto dominated. Therefore, to apply Prop. 38, it follows that what remains is to construct a probe W and show that λW[Q + µ0] > 0 for some µ0 K but λW[E + µ] = 0 for all µ K. Proof We divide the proof into three pieces. First, we illustrate how to construct the probe W from a particular collection of distributions {νUp a , νLo a }a A. Second, we show that λW[E+µ] = 0 for all µ K. For notational and expository simplicity, we focus in these first two sections on the case of counterfactual equalized odds. Therefore, in the third section, we show how to generalize the argument to conditional principal fairness and path-specific fairness. Construction of the probe. We will construct our probe to address two different cases. We recall that, by Cor. 29, any policy that is not strongly Pareto dominated must be a budget-exhausting multiple threshold policy with non-negative thresholds. In the first case, we consider when the candidate budget-exhausting multiple threshold policy is 1u(x)>0. By perturbing the underlying distribution by ν WLo, we will be able to break the balance requirements implied by Eq. (4). In the second case, we treat the possibility that the candidate budget-exhausting multiple threshold policy has a non-trivial positive threshold for at least one group. By perturbing the underlying distribution by ν WUp for an alternative set of perturbations WUp, we will again be able to break the balance requirements. More specifically, to construct our probe W = WUp + WLo, we want WUp and WLo to have a number of properties. In particular, for all ν W, perturbation by ν should not affect whether the underlying distribution is a probability distribution, and should not affect how much of the budget is available to budget-exhausting policies. Specifically, for all ν W, Z K 1 dν = 0, (23) The Measure and Mismeasure of Fairness K 1u(X)>0 dν = 0. (24) In fact, the amount of budget available to budget-exhausting policies will not change within group, i.e., for all a A and ν W, Z K 1u(X)>0,A=a dν = 0. (25) Additionally, for some distinguished y0, y1 Y, non-zero perturbations in νLo WLo should move mass between y0 and y1. That is, they should have the property that if Pr|νLo|(A = a) > 0, then Z K 1u(X)<0,Y =yi,A=a dνLo = 0. (26) Finally, perturbations in WUp should have the property that for any non-trivial t > 0, some mass is moved either above or below t > 0. More precisely, for any µ Q and any t such that 0 < Prµ(u(X) > t | A = a) < 1, if νUp WUp is such that Pr|νUp|(A = a) > 0, then Z K 1u(X)>t,A=a dνUp = 0. (27) To carry out the construction, choose distinct y0, y1 Y. Then, since µmax,a u 1[Sa [0, ra)] µmax,a u 1[Sa [ra, )] is a continuous function of ra, it follows by the intermediate value theorem that we can partition Sa into three pieces, SLo a = Sa ( , 0), SUp a,0 = Sa [0, ra), SUp a,1 = Sa [ra, ), so that Prµmax,a u(X) SUp a,0 = Prµmax,a u(X) SUp a,1 . Recall that K = X Y. Let πX : K X denote projection onto X, and γy : X K be the injection x 7 (x, y). We define νUp a [E] = µmax,a (γy1 πX ) 1 h E u 1 SUp a,1 i , µmax,a (γy1 πX ) 1 h E u 1 SUp a,0 i , νLo a [E] = µmax,a (γy1 πX ) 1 E u 1 SLo a µmax,a (γy0 πX ) 1 E u 1 SLo a . Corbett-Davies, Gaebler, Nilforoshan, Shroff, and Goel By construction, νUp a concentrates on {y1} u 1(Sa [0, )), while νLo a concentrates on {y0, y1} u 1(Sa ( , 0)). Moreover, if we set WUp = Span(νUp a )a A, WLo = Span(νLo a )a A, then it is easy to see that Eqs. (23) to (26) will hold. The only non-trivial case is Eq. (27). However, by Cor. 59, the support of fµmax,a is maximal. That is, for µ Q, if 0 < Prµ(u(X) > t | A = a, u(X) > 0) < 1, then it follows that 0 < t < sup Sa. Either t ra or t > ra. First, assume t ra; then, it follows by the construction of νUp a that νUp a u 1[(t, )] = Z ra fmax,a dλ Z ra t fmax,a dλ ra fmax,a dλ Z ra 0 fmax,a dλ Similarly, if t > ra, νUp a u 1[(t, )] = Z t fmax,a dλ sup Sa fmax,a dλ Therefore Eq. (27) holds. Since W is non-trivial38 and ν[K] = 0 for all ν W, it follows by Lemma 62 that λW[Q µ] > 0 for some µ K. Shyness. Recall that, by Prop. 36, a set E is shy if and only if, for an arbitrary shy set E , E \ E is shy. By Lemma 63, a generic element of µ Q satisfies Prµ(u(X) > 0, Y (1) = yi, A = a) > 0 for i = 0, 1, and a A. Likewise, by Lemma 65, a generic µ Q satisfies νUp a u 1 Î (µ X {y1}) u 1. Therefore, to simplify our task and recalling Remark 61, we may instead demonstrate the shyness of the set of µ Q such that: There exists a budget-exhausting multiple threshold policy τ(x) with non-negative thresholds satisfying counterfactual equalized odds over µ; 38. In general, some or all of the νLo may be zero depending on the λ-measure of SLo a . However, as noted in Remark 61, the νUp a,i cannot be zero, since Prµmax,a(u(X) > 0) > 0 for all a A. Therefore W = {0}. The Measure and Mismeasure of Fairness For i = 0, 1, Prµ(u(X) > 0, A = a, Y (1) = yi) > 0; (28) For all a A, νUp a u 1 Î (µ α 1(a) {y1}) u 1. (29) By a slight abuse of notation, we continue to refer to this set as E. We note that, by the construction of W, Eq. (28) is not affected by perturbation by ν W, and Eq. (29) is not affected by perturbation by νLo W. In particular, by Lemma 68, it suffices to show that λW[E µ] = 0 for µ E. Therefore, let µ E be arbitrary. Let the budget-exhausting multiple threshold policy satisfying counterfactual equalized odds over it be τ(x), so that Eµ[τ(X)] = min(b, Prµ(u(X) > 0)), with thresholds {ta}a A. We split into two cases based on whether τ(X) = 1u(X)>0 µ-a.s. or not. In both cases, we make use of the following two useful observations. First, note that as E Q, if µ + ν is not a probability measure, then µ + ν / E. Therefore, without loss of generality, we assume throughout that µ + ν is a probability measure. Second, suppose τ (x) is a policy satisfying counterfactual equalized odds over some ν Q. Then, if 0 < Eµ[τ (X)] < 1, it follows that for all a A, 0 < Eµ[τ (X) | A = a] < 1. (30) For, suppose not. Then, without loss of generality, there must be a0, a1 A such that Eµ[τ (X) | A = a0] = 0 and Eµ[τ (X) | A = a1] > 0. But then, by the law of iterated expectation, there must be some Y Y such that µ[X Y ] > 0 and so, 1X Y Eµ[τ (X) | A = a1, Y (1)] > 0 = 1X Y Eµ[τ (X) | A = a0, Y (1)], contradicting the fact that τ (x) satisfies counterfactual equalized odds over µ. Therefore, in what follows, we can assume that Eq. (30) holds. Our goal is to show that λW[E µ] = 0. Case 1 (τ(X) = 1u(X)>0) We argue as follows. First, we show that 1u(X)>0 is the unique budget-exhausting multiple threshold policy with non-negative thresholds over µ + ν for all ν W. Then, we show that the set of ν W such that 1u(x)>0 satisfies counterfactual equalized odds over µ + ν is a λW-null set. We begin by observing that WLo = {0}. For, if that were the case, then Eq. (30) would not hold for τ(x). Corbett-Davies, Gaebler, Nilforoshan, Shroff, and Goel Next, we note that by Eq. (24), for any ν W, Prµ+ν(u(X) > 0) = Prµ(u(X) > 0) and so Eµ+ν[1u(X)>0] = min(b, Prµ+ν(u(X) > 0)). If τ (x) is a feasible multiple threshold policy with non-negative thresholds and τ (X) = 1u(X)>0 (µ + ν)-a.s., then, as a consequence, Eµ+ν[τ (X)] < Prµ+ν(u(X) > 0) b. Therefore, it follows that 1u(X)>0 is the unique budget-exhausting multiple threshold policy over µ + ν with non-negative thresholds. Now, note that if counterfactual equalized odds holds with decision policy τ(x) = 1u(x)>0, then, by Eq. (7) and Lemma 41, we must have that Prµ+ν(u(X) > 0 | A = a, Y (1) = y1) = Prµ+ν(u(X) > 0 | A = a , Y (1) = y1) for a, a A.39 Now, we will show that a typical element of W breaks this balance requirement. Choose a such that νLo a = 0. Recall that ν is fixed, and let ν = ν βLo a νLo a . Let pa = Prµ+ν (u(X) > 0 | A = a , Y (1) = y1). Note that it cannot be the case that pa = 0 for all a A, since, by Eq. (28), Prµ+ν (u(X) > 0 | Y (1) = y1) > 0. Therefore, by the foregoing discussion, either pa > 0 or pa = 0 and we can choose a A such that pa > 0. Since the νLo a , νUp a,i are all mutually singular, it follows that counterfactual equalized odds can only hold over µ + ν if pa = Prµ+ν(u(X) > 0 | A = a , Y (1) = y1). Now, we observe that by Lemma 41, that Prµ+ν(u(X) > 0 | A = a , Y (1) = y1) = η π + βLo a ρ η = Prµ(u(X) > 0, A = a , Y (1) = y1) π = Prµ(A = a , Y (1) = y1), K 1A=a ,Y (1)=y1 dνLo a . 39. To ensure that both quantities are well-defined, here and throughout the remainder of the proof we use the fact that by Eqs. (25) and (28), Prµ+ν(u(X) > 0, A = a, Y (1) = y1) > 0. The Measure and Mismeasure of Fairness K 1u(X)>0,A=a ,Y (1)=y1 dνLo a , K 1A=a ,Y (1)=y1 dνLo a . Here, the equality follows by the fact that νLo is supported on SLo a {y0, y1} and the inequality from Eq. (26). Therefore, if, in the first case, pa > 0, then counterfactual equalized odds only holds if βLo a = e pa π since, as noted above, ρ = 0 by Eq. (26). In the second case, if pa = 0, then counterfactual equalized odds can only hold if e = pa π = 0. Since we chose a so that pa > 0 if pa = 0 and π > 0 by Eq. (28), this is impossible. In either case, we see that the set of βLo a R such that there a budget-exhausting threshold policy with positive thresholds satisfying counterfactual equalized odds over µ + ν + βLo a νLo a has λ-measure zero. That is λSpan(νLo a )[E µ ν ] = 0. Since ν was arbitrary, it follows by Fubini s theorem that λW[E µ] = 0. Case 2 (τ(X) = 1u(X)>0) Our proof strategy is similar to the previous case. First, we show that, for a given fixed νLo WLo, there is a unique candidate policy τ(x) for being a budget-exhausting multiple threshold policy with non-negative thresholds and satisfying counterfactual equalized odds over µ + νLo + νUp for any νUp WUp. Then, we show that the set of νUp such that τ(X) satisfies counterfactual equalized odds has λWUp measure zero. Finally, we argue that this in turn implies that the set of ν W such that there exists a Pareto efficient policy satisfying counterfactual equalized odds over µ + ν has λW-measure zero. We seek to show that λWUp[E (µ + νLo)] = 0. To begin, we note that since νUp a,i concentrates on {y1} X for all a A, it follows that Eµ+νLo[d(X) | A = a, Y (1) = y0] = Eµ+νLo+νUp[d(X) | A = a, Y (1) = y0] for any νUp WUp. Now, suppose there exists some νUp WUp such that there exists a budget-exhausting multiple threshold policy τ(x) with non-negative thresholds such that counterfactual equalized odds is satisfied over µ+νLo+νUp. (If not, then we are done and λWUp[E (µ+νLo)] = 0, as the measure of the empty set is zero.) Let p = Eµ+νLo[ τ(X) | A = a, Y (1) = y0]. Corbett-Davies, Gaebler, Nilforoshan, Shroff, and Goel Suppose that τ (x) is an alternative budget-exhausting multiple threshold policy with nonnegative thresholds such that counterfactual equalized odds is satisfied. We seek to show that τ (X) = τ(X) (µ + νLo + νUp)-a.e. for any νUp WUp. Toward a contradiction, suppose that for some a0 A, Eµ+νLo[ τ (X) | A = a0, Y (1) = y0] < p. Since, by Eq. (28), Prµ+νLo(A = a0, Y (1) = y0) > 0, it follows that Eµ+νLo[ τ (X) | A = a0] < Eµ+νLo[ τ(X) | A = a0]. Therefore, since τ(x) is budget exhausting, there must be some a1 such that Eµ+νLo[ τ (X) | A = a1] > Eµ+νLo[ τ(X) | A = a1]. From this, it follows τ (x) can be represented by a threshold greater than or equal to that of τ(x) on α 1(a1), and hence Eµ+νLo[ τ (X) | A = a1, Y (1) = y0] Eµ+νLo[ τ(X) | A = a0, Y (1) = y0] > Eµ+νLo[ τ (X) | A = a0, Y (1) = y0], contradicting the fact that τ (x) satisfies counterfactual equalized odds. By the fact that νLo is supported on u 1(( , 0]), the preceding discussion, and Lemma 49, it follows that τ(X) = τ (X) (µ X {y0})-a.e. By Eq. (29), it follows that τ(X) = τ (X) νUp a,i -a.e. for i = 0, 1. As a consequence, τ(X) = τ (X) (µ + νLo + νup)-a.e. for all νUp WUp. Therefore τ(X) is, indeed, unique, as desired. Now, we note that since τ(X) = 1u(X)>0, it follows that E[τ(X)] < Prµ(u(X) > 0). It follows that Eµ[τ(X)] = b, since τ(x) is budget exhausting. Therefore, by Eq. (24), it follows that for any budget-exhausting policy τ(X), E[ τ(X)] = b, and so τ(X) = 1u(X)>0 over µ + ν. Therefore, fix νLo and τ(X). By Eq. (30), there is some a such that 0 < Prµ+νLo(u(X) > ta | A = a ) < 1. Then, it follows by Eq. (27) that Z K 1u(X)> ta dνUp a = 0. Fix ν = ν βUp a νUp a . Then, for some a = a , set p = Eµ+ν [ τ(X) | A = a, Y (1) = y1]. The Measure and Mismeasure of Fairness Since the νLo a , νUp a are all mutually singular, it follows that counterfactual equalized odds can only hold over µ + ν if p = Prµ+ν(u(X) > ta | A = a , Y (1) = y1). Now, we observe that by Lemma 41, that Prµ+ν(u(X) > ta | A = a , Y (1) = y1) = η + βUp a γ π η = Prµ+νLo(u(X) > ta | A = a , Y (1) = y1), π = Prµ+νLo(A = a , Y (1) = y1), K 1u(X)> ta ,A=a ,Y (1)=y1 dνUp a , and we note that 0 = Z K 1A=a ,Y (1)=y1 dνLo a . Eq. (31) can be rearranged to (p π η) β γ = 0. This can only hold if since by Eq. (27), γ = 0. Since any countable subset of R is a λ-null set, λSpan(νUp a )[E µ ν ] = 0. Since ν was arbitrary, it follows by Fubini s theorem that λWUp[E µ νLo] = 0 in this case as well. Lastly, since νLo was also arbitrary, applying Fubini s theorem a final time gives that λW[E µ] = 0. Conditional principal fairness and path-specific fairness. The extension of these results to conditional principal fairness and path-specific fairness is straightforward. All that is required is a minor modification of the probe. In the case of conditional principal fairness, we set νUp a,w[E] = µmax,a,w (γ(y1,y1) πX ) 1[E u 1(SUp a,1)], µmax,a (γ(y1,y1) πX ) 1[E u 1(SUp a,w)], νLo a,w[E] = µmax,a,w (γ(y1,y1) πX ) 1[E u 1(SLo a )] µmax,a (γ(y0,y0) πX ) 1[E u 1(SLo a,w)], Corbett-Davies, Gaebler, Nilforoshan, Shroff, and Goel where γ(y,y ) : X K is the injection x 7 (x, y, y ). Our probe is then given by WUp = Span(νUp a,w), WLo = Span(νLo a,w), almost as before. The proof otherwise proceeds virtually identically, except for two points. First, recalling Remark 61, we use the fact that a generic element of Q satisfies Prµ(A = a, W = w) > 0 in place of Prµ(A = a) > 0 throughout. Second, we use the fact that ω overlaps utility in place of Eq. (30). In particular, If ω does not overlap utilities for a generic µ Q, then, by Lemma 67, there exists w W such that Prµ(u(X) > 0, W = w) = 0 for all µ Q. If this occurs, we can show that no budget-exhausting multiple threshold policy with positive thresholds satisfies conditional principal fairness, exactly as we did to show Eq. (30). In the case of path-specific fairness, we instead define SLo a,w = Sa,w ( , ra,w), SUp a,w = Sa,w [ra,w, ), where ra,w is chosen so that Prµmax,a,w(u(X) SLo a,w) = Prµmax,a,w(u(X) SUp a,w). Let πX denote the projection from K = A X A given by (a, (xa )a A) 7 xa. Let πa denote the projection from the a -th component. (That is, given µ K, the distribution of XΠ,A,a over µ is given by µ π 1 a and the distribution of X is given by µ π 1 X .) Then, we let µmax,a,w be the measure on X given by µmax,a,w[E] = µmax,a,w[E (u πa) 1(s Up a,w)] µmax,a,w[E (u πa) 1(SLo a,w)]. Finally, let φ : A A be a permutation of the groups with no fixed points, i.e., so that a = φ(a ) for all a A. Then, we define νa = δa µmax,φ(a ),w1 Y a =φ(a ) µmax,a,w1 π 1 a , where δa is the measure on A given by δa[{a }] = 1a=a . Then, simply let W = Span(ν a)a A. Since µmax,a,w[X] = 0 for all a A, it follows that νa,w π 1 X = 0, i.e., Prµ(X E) = Prµ+ν(X E) for any ν W and µ Q. Therefore Eqs. (23) and (24) hold. Moreover, the νa satisfy the following strengthening of Eq. (27). Perturbations in W have the property that for The Measure and Mismeasure of Fairness any non-trivial t not necessarily positive some of the mass of u(XΠ,A,a) is moved either above or below t. More precisely, for any µ Q and any t such that 0 < Prµ(u(X) > t | A = a) < 1, if ν W is such that Pr|ν|(A = φ 1(a)) > 0, then K 1u(XΠ,A,a)>t dνa = 0. (32) This stronger property means that we need not separately treat the case where τ(X) = 1u(X)>0 µ-a.e. Other than this difference the proof proceeds in the same way, except for two points. First, we again make use of the fact that ω can be assumed to overlap utilities in place of Eq. (30), as in the case of conditional principal fairness. Second, w0 and w1 take the place of y0 and y1. In particular, to establish the uniqueness of τ(x) given µ and νLo in the second case, instead of conditioning on y0, we instead condition on w0, where, following the discussion in Remark 61 and Lemma 63, this conditioning is well-defined for a generic element of Q. We have focused on causal definitions of fairness, but the thrust of our analysis applies to non-causal conceptions of fairness as well. Below we show that policies constrained to satisfy (non-counterfactual) equalized odds (Hardt et al., 2016) are generically strongly Pareto dominated, a result that follows immediately from our proof above. Definition 69 Equalized odds holds for a decision policy d(x) when d(X) A | Y. (33) We note that Y in Eq. (33) does not depend on our choice of d(x), but rather represents the realized value of Y , e.g., under some status quo decision making rule. Corollary 70 Suppose U is a set of utilities consistent modulo α. Further suppose that for all a A there exist a U-fine distribution of X and a utility u U such that Pr(u(X) > 0, A = a) > 0, where A = α(X). Then, for almost every U-fine distribution of X and Y on X Y, any decision policy d(x) satisfying equalized odds is strongly Pareto dominated. Proof Consider the following maps. Distributions of X and Y (1), i.e., probability measures on X Y, can be embedded in the space of joint distributions on X, Y (0), and Y (1) via pushing forward by the map ι, where ι : (x, y) 7 (x, y, y). Likewise, given a fixed decision policy D = d(X), joint distributions of X, Y (0), and Y (1) can be projected onto the space of joint distributions of X and Y by pushing forward by the map πd : (x, y0, y1) 7 (x, yd(x)). Lastly, we see that the composition of ι and πd regardless of our choice of d(x) is the Corbett-Davies, Gaebler, Nilforoshan, Shroff, and Goel identity, as shown in the diagram below. We note also that counterfactual equalized odds holds for µ exactly when equalized odds holds for µ (πd ι) 1. The result follows immediately from this and Theorem 17. F.6 Proof of Theorem 9 The proof of Theorem 9 is simpler than the proof of Theorem 17, but uses some of the same machinery. As before, let K = A [0, 1] denote the state space, and K denote the set of measures on [0, 1] A. Let K denote the measures µ K that are absolutely continuous with respect to λ δ, where λ is Lebesgue measure and δ is the counting measure on A i.e., measures such that the restriction to [0, 1] {a} has a density for all a A. Applying Cor. 47 with U = {π}, where π : (u, a) 7 u, shows that K is a Banach space. As before, we let Q K denote the probability simplex, i.e., the set of all µ K such that µ[K] = 1 and µ[E] 0 for all Borel sets E. Let R = r(X) denote risk.40 By Lemma 31, we have that a policy is utility maximizing if and only if 1R>t d(X) 1R t a.s. By Since Prµ(R = t) = 0 for any absolutely continuous measure µ, we see that, in fact, a policy is utility maximizing if and only if 1R>t = d(X) µ-a.s. Consider the sets EFP = µ K : ( a A) Eµ[(1 R) 1A=a,R>t] Eµ[(1 R) 1A=a] = Eµ[(1 R) 1R>t] and EDP = {µ K : ( a A) Prµ(R > t | A = a) = Prµ(R > t)}. We note that since the mapping µ 7 Prµ(E) is a continuous function on K, the set of measures such that Prµ(A = a, R > t) = 0 or Prµ(R > t) = 0 is closed and shy. It follows that EFP and EDP are Borel. We note that, by definition, EDP is the set of risk distributions such that there exists a utility-maximizing policy satisfying demographic parity (when the condition is well defined). Likewise, an application of the law of iterated expectations yields that EFP is the set of distributions satisfying equalized false positive rates (when the condition is well-defined). With this context in place, we can move on to the proof of Theorem 9. Proof of Theorem 9 First we construct the probe. Choose distinct a0, a1 A arbitrarily, and let ν be defined by ν[E] = λ[Ea0 [0, t)] t λ[Ea1 [0, t)] 40. Here, since the measures are on the risk scale, rather than on X, we write R for notational simplicity. The Measure and Mismeasure of Fairness where Ea = E {A = a}. Then, by Lemma 62, there exists µ0 such that λW[K + ν0] > 0, where K = Span( ν). We see that Pr ν(R < t, A = a0) = 1, Pr ν(R > t, A = a0) = 0, Pr ν(R < t, A = a1) = 1, Pr ν(R > t, A = a1) = 0. It follows that Prµ+β ν(R > t | A = a0) = e0 p0 + β , Prµ+β ν(R > t, | A = a1) = e1 p1 β , e0 = Prµ(R > t, A = a0), e1 = Prµ(R > t, A = a1), p0 = Prµ(A = a0), p1 = Prµ(A = a1). Note that by Lemmata 65 and 68, we can assume that e0, e1 > 0. It follows, rearranging terms, that Prµ+β ν(R > t | A = a0) = Prµ+β ν(R > t | A = a1) if and only if β = e0 p1 e1 p0 which is a measure-zero subset of β R. Therefore λW[EDP + µ] = 0. In the same way, we observe that E ν[R 1A=a0,Rt] = 0 E ν[R 1A=a1,Rt] = 0, Eµ+β ν[(1 R) 1A=a0,R>t] Eµ+β ν[(1 R) 1A=a0] = e 0 p 0 + β t 2 , Eµ+β ν[(1 R) 1A=a1,R>t] Eµ+β ν[(1 R) 1A=a1] = e 1 p 1 β t e 0 = Eµ[(1 R) 1A=a0,R>t], e 1 = Eµ[(1 R) 1A=aa,R>t], p 0 = Eµ[(1 R) 1A=a0], p 1 = Eµ[(1 R) 1A=a0]. As before, µ + β ν EFP if and only if t e0 p1 e1 p0 which is again a measure-zero subset of β R. Therefore λW[EFP + µ] = 0. Therefore it follows that both EFP and EDP are shy. Corbett-Davies, Gaebler, Nilforoshan, Shroff, and Goel F.7 Proof of Corollary 18 The proof of Corollary 18 restated in full detail as Corollary 71 below is a straightforward application of Theorem 17. We begin by giving the complete statement. Corollary 71 Consider a utility of the form given in Eq. (12), where v is monotonically increasing in both coordinates and m(x) 0. Suppose that for all a A there exists an {m}-fine distribution of X such that Pr(m(X) > 0, A = a) > 0, where A = α(X). Then, For almost every {m}-fine distribution of X and Y (1), no utility-maximizing decision policy satisfies counterfactual equalized odds. If | Img(ω)| < and there exists an {m}-fine distribution of X such that Pr(A = a, W = w) > 0 for all a A and w Img(ω), where W = ω(X), then, for almost every {m}-fine joint distribution of X, Y (0), and Y (1), no utility-maximizing decision policy satisfies conditional principal fairness. If | Img(ω)| < and there exists a {m}-fine distribution of X such that Pr(A = a, W = wi) > 0 for all a A and some distinct w0, w1 Img(ω), then, for almost every {m}A-fine joint distributions of A and the counterfactuals XΠ,A,a , no utilitymaximizing decision policy satisfies path-specific fairness. Recall that for notational simplicity, we refer to the distinguished utility as u , where u (d) = v E[m(X) d(X)], E[1α(X)=a1 d(X)] . Proof of Corollary 18 Consider the subset S of R2 consisting of all pairs (E[m(X) d(X)], E[1α(X)=a1 d(X)]), where d ranges over feasible policies. We note that, for θ [0, 1], θ E[m(X) d0(X)] + [1 θ] E[m(X) d1(X)] = E[m(X) (θ d0(X) + [1 θ] d1(X))], and similarly for E[1α(X)=a1 d(X)]. Likewise, θ E[d0(X)] + [1 θ] E[d1(X)] = E[θ d0(X) + (1 θ) d1(X)], and so convex combinations of feasible policies are feasible. It follows that S is convex. Now, consider a point (x0, y0) at which v(x, y) is maximized on S. Since v is monotonically increasing in both coordinates, we must have that S (x0, y0) + R2 0 = {(x0, y0)}, and so, by the separating hyperplane theorem, there exists (h0, h1) R2 and t R such that (h0, h1) (x0, y0) = t, (h0, h1) c > t, c (x0, y0) + R2 0, c = (x0, y0) (h0, h1) s < t. (s S, s = (x0, y0)) The Measure and Mismeasure of Fairness We note that it follows that both h0 and h1 are positive, since, otherwise, without loss of generality, if h0 = 0, then (h0, h1) (x0 + ϵ, y0) = h1 y0 = (h0, h1) (x0, y0) = 0, contrary to assumption, since (x0 + ϵ, y0) (x0, y0) + R2 0, (x0 + ϵ, y0) = (x0, y0). Let λ = h1/h0, and consider the collection of utilities U = {m(x) + λ 1α(x)=a1}λ>0. Since m(x) 0, U is consistent modulo α. We need the further claim that any policy that is utility maximizing for some u(x) U is Pareto efficient. For, suppose that d0(x) were utility-maximizing for u0(x) but not Pareto efficient, i.e., there existed d1(x) and u1(x) such that u0(d1) = u0(d0) but u1(d1) > u1(d0). Then, we would have for any u U that u(d1) u(d0) = u(d1) u(d0) (u0(d1) u0(d0)) = (λ λ0) E[d1(X) 1α(X)=a1] E[d0(X) 1α(X)=a1] . First suppose that λ1 > λ0. Then, it follows from the fact that u1(d1) > u1(d0) that E[d1(X) 1α(X)=a1] > E[d0(X) 1α(X)=a1]. Now, if 0 < λ2 < λ1 and u(x) = m(x)+λ2 1α(x)=a1, then we have that u2(d1) < u2(d0). In the same way, if λ1 < λ0, we could choose λ2 > λ1 such that u2(d1) < u2(d0). Therefore d1 does not Pareto dominate d0, contrary to hypothesis. Therefore, any policy that is utility maximizing for some u(x) U is Pareto efficient. In particular, it follows that the policy that maximizes u(x) = m(x) + λ 1α(x)=a1 in expectation is Pareto efficient. By the construction of the separating hyperplane, this is also the policy that maximizes u , and so the policy that maximizes u is Pareto efficient. Therefore, under the hypotheses of Theorem 17, for almost every joint distribution, the utility maximizing policy does not satisfy counterfactual equalized odds, conditional principal fairness, or path-specific fairness. F.8 General Measures on K Theorem 17 is restricted to U-fine and UA-fine distributions on the state space. The reason for this restriction is that when the distribution of X induces atoms on the utility scale, threshold policies can possess additional or even infinite degrees of freedom when the threshold falls exactly on an atom. In particular circumstances, these degrees of freedom can be used to ensure causal fairness notions, such as counterfactual equalized odds, hold in a locally robust way. In particular, the generalization of Theorem 17 beyond U-fine measures to all totally bounded measures on the state space is false, as illustrated by the following proposition. Corbett-Davies, Gaebler, Nilforoshan, Shroff, and Goel Proposition 72 Consider the set E K of distributions not necessarily U-fine on K = X Y over which there exists a Pareto efficient policy satisfying counterfactual equalized odds. There exist b, X, Y, and U such that E is not relatively shy. Proof We adopt the notational conventions of Section F.3. We note that by Prop. 36, a set can only be shy if it has empty interior. Therefore, we will construct an example in which an open ball of distributions on K in the total variation norm all allow for a Pareto efficient policy satisfying counterfactual equalized odds, i.e., are contained in E . Let b = 3 4, Y = {0, 1}, A = {a0, a1}, and X = {0, 1} {a0, a1} R. Let α : X A be given by α : (y, a, v) 7 a for arbitrary (y, a, v) X. Likewise, let u : X R be given by u : (y, a, v) 7 v. Then, if U = {u}, U is vacuously consistent modulo α. Consider the joint distribution µ on K = X Y where for all y, y Y, a A, and u R, Prµ(X = (a, y, u), Y (1) = y ) = 1 4 1y=y Prµ(u(X) = u), where, over µ, u(X) is distributed as a 1 2-mixture of Unif(1, 2) and δ(1); that is, Pr(u(X) = 1) = 1 2 and Pr(a < u(X) < b) = b a 2 for 0 a b < 1. We first observe that there exists a Pareto efficient threshold policy τ(x) such that counterfactual equalized odds is satisfied with respect to the decision policy τ(X). Namely, let τ(a, y, u) = 1 2 u = 1, 0 u < 1. Then, it immediately follows that E[τ(X)] = 3 4 = b. Since τ(x) is a threshold policy and exhausts the budget, it is utility maximizing by Lemma 31. Moreover, if D = 1UD τ(X) for some UD Unif(0, 1) independent of X and Y (1), then D A | Y (1). Since u(X) A, Y (1), it follows that Prµ(D = 1 | A = a, Y (1) = y) = Pr(UD τ(X) | A = a, Y (1) = y) = Pr(UD τ(X)) = Eµ[τ(X)], Therefore Eq. (4) is satisfied, i.e., counterfactual equalized odds holds. Now, using µ, we construct an open ball of distributions over which we can construct similar threshold policies. In particular, suppose µ is any distribution such that |µ µ |[K] < 1 64. Then, we claim that there exists a budget-exhausting threshold policy satisfying counterfactual equalized odds over µ . For, we note that Prµ (U > 1) < Prµ(U > 1) + 1 Prµ (U 1) > Prµ(U 1) 1 The Measure and Mismeasure of Fairness and so any threshold policy τ (x) satisfying E[τ (X)] = b = 3 4 must have t = 1 as its threshold. We will now construct a threshold policy τ (x) satisfying counterfactual equalized odds over µ . Consider a threshold policy of the form τ (a, y, u) = 1 u > 1, pa,y u = 1, 0 u < 1. For notational simplicity, let qa,y = Prµ (A = a, Y = y, U > 1), ra,y = Prµ (A = a, Y = y, U = 1), πa,y = Prµ (A = a, Y = y). Then, we have that Eµ [τ (X)] = X a,y qa,y + pa,y ra,y, Eµ [τ (X) | A = a, Y = y] = qa,y + pa,y ra,y Therefore, the policy will be budget exhausting if X a,y qa,y + pa,y ra,y = 3 and it will satisfy counterfactual equalized odds if πa1,0 (qa0,0 + pa0,0 ra0,0) = πa0,0 (qa1,0 + pa1,0 ra1,0), πa1,1 (qa0,1 + pa0,1 ra0,1) = πa0,1 (qa1,1 + pa1,1 ra1,1), since, as above, Pr(D = 1 | A = a, Y (1) = y) = E[τ (X) | A = a, Y (1) = y]. Again, for notational simplicity, let 3 4 Prµ (U > 1) Prµ (U = 1) . Then, a straightforward algebraic manipulation shows that Eq. (34) is solved by setting pa0,y to be S πa0,y (ra0,y + ra1,y) + πa0,y qa1,y πa1,y qa0,y ra0,y (πa0,y + πa1,y) , Corbett-Davies, Gaebler, Nilforoshan, Shroff, and Goel and pa1,y to be S πa1,y (ra0,y + ra1,y) + πa1,y qa0,y πa0,y qa1,y ra1,y (πa0,y + πa1,y) . In order for τ (x) to be a well-defined policy, we need to show that pa,y [0, 1] for all a A and y Y. To that end, note that qa,y = Prµ (A = a, Y = y, U > 1), ra,y = Prµ (A = a, Y = y, U = 1), πa,y = Prµ (A = a, Y = y), ra0,y + ra1,y = Prµ (Y = y, U = 1), πa0,y + πa1,y = Prµ (Y = y), 3 4 Prµ (U > 1) Prµ (U = 1) . Now, we recall that | Prµ (E) Prµ(E)| < 1 64 for any event E by hypothesis. Therefore, 7 64 qa,y 9 7 64 ra,y 9 7 64 πa,y 17 15 64 ra0,y + ra1,y 17 31 64 πa0,y + πa1,y 33 Using these bounds and the expressions for pa,y derived above, we see that 629 3069 < pa,y < 6497 and hence pa,y [0, 1] for all a A and y Y. Therefore, the policy τ (x) is well-defined, and, by construction, is budget-exhausting and therefore utility-maximizing by Lemma 31. It also satisfies counterfactual equalized odds by construction. Since µ was arbitrary, it follows that the set of distributions on K such that there exists a Pareto efficient policy satisfying counterfactual equalized odds contains an open ball, and hence is not shy. Appendix G. Theorem 11 and Related Results We first prove a variant of Theorem 11 for general, continuous covariates X. Then, we extend and generalize Theorem 11 using the theory of finite Markov chains, offering a proof of the theorem different from the sketch included in the main text. The Measure and Mismeasure of Fairness G.1 Extension to Continuous Covariates Here we follow the proof sketch in the main text for Theorem 11, which assumes a finite covariate-space X. In that case, we start with a point x with maximum decision probability d(x ), and then assume, toward a contradiction, that there exists a point with strictly lower decision probability. The general case is more involved since it is not immediately clear that the maximum value of d(x) is achieved with positive probability in X. We start with the lemma below before proving the main result. Lemma 73 A decision policy d(x) satisfies path-specific fairness with W = X if and only if any a A, E[d(XΠ,A,a ) | X] = d(X). Proof First, suppose that d(x) satisfies path-specific fairness. To show the result, we use the standard fact that for independent random variables X and U, E[f(X, U) | X] = Z f(X, u) d FU(u), (35) where FU is the distribution of U. (For a proof of this fact see, for example, Brozius, 2019) Now, we have that E[DΠ,A,a | XΠ,A,a ] = E[1UD d(XΠ,A,a ) | XΠ,A,a ] 0 1u d(XΠ,A,a ) du = d(XΠ,A,a ), where the first equality follows from the definition of DΠ,A,a , and the second from Eq. (35), since the exogenous variable UD Unif(0, 1) is independent of the counterfactual covariates XΠ,A,a . An analogous argument shows that E[D | X] = d(X). Finally, conditioning on X, we have E[d(XΠ,A,a ) | X] = E[E[DΠ,A,a | XΠ,A,a ] | X] = E[E[DΠ,A,a | XΠ,A,a , X] | X] = E[DΠ,A,a | X] where the second equality follows from the fact that DΠ,A,a X | XΠ,A,a , the third from the law of iterated expectations, and the fourth from the definition of path-specific fairness. Next, suppose that E[d(XΠ,A,a | X] = d(X) Corbett-Davies, Gaebler, Nilforoshan, Shroff, and Goel for all a A. Then, since W = X and X UD, using Eq. (35), we have that for all a A, E[DΠ,A,a | X] = E[E[1UD d(XΠ,A,a ) | XΠ,A,a , X] | X] = E[E[d(XΠ,A,a ) | XΠ,A,a , X] | X] = E[d(XΠ,A,a ) | X] = E[d(X) | X] = E[D | X]. This is exactly Eq. (8), and so the result follows. We are now ready to prove a continuous variant of Theorem 11. The technical hypotheses of the theorem ensure that the conditional probability measures Pr(E | X) are sufficiently mutually non-singular distributions on X with respect to the distribution of X for example, the conditions ensure that the conditional distribution of XΠ,A,a | X does not have atoms that X itself does not have, and vice versa. For notational and conceptual simplicity, we only consider the case of trivial ζ, i.e., where ζ(x) = ζ(x ) for all x, x X. Proposition 74 Suppose that 1. For all a A and any event S satisfying Pr(X S | A = a) > 0, we have, a.s., Pr(XΠ,A,a S A = a | X) > 0. 2. For all a A and ϵ > 0, there exists δ > 0 such that for any event S satisfying Pr(X S | A = a) < δ, we have, a.s., Pr(XΠ,A,a S, A = a | X) < ϵ. Then, for W = X, any Π-fair policy d(x) is constant a.s. (i.e., d(X) = p a.s. for some 0 p 1). Proof Let dmax = d(x) , the essential supremum of d. To establish the theorem statement, we show that Pr(d(X) = dmax | A = a) = 1 for all a A. To do that, we begin by showing that there exists some a A such that Pr(d(X) = dmax | A = a) > 0. Assume, toward a contradiction, that for all a A, Pr(d(X) = dmax | A = a) = 0. (36) Because A is finite, there must be some a0 A such that Pr(dmax d(X) < ϵ | A = a0) > 0 (37) for all ϵ > 0. Choose a1 = a0. We show that for values of x such that d(x) is close to dmax, the distribution of d(XΠ,A,a1) | X = x must be concentrated near dmax with high probability The Measure and Mismeasure of Fairness to satisfy the definition of path-specific fairness, in Eq. (8). But, under the assumption in Eq. (36), we also show that the concentration occurs with low probability, by the continuity hypothesis in the statement of the theorem, establishing the contradiction. Specifically, by Markov s inequality, for any ρ > 0, a.s., Pr(dmax d(XΠ,A,a1) ρ | X) E[dmax d(XΠ,A,a1) | X] = dmax d(X) where the final equality follows from Lemma 73. Rearranging, it follows that for any ρ > 0, a.s., Pr(dmax d(XΠ,A,a1) < ρ | X) 1 dmax d(X) Now let S = {x X : dmax d(x) < ρ}. By the second hypothesis of the theorem, we can choose δ sufficiently small that if Pr(X S | A = a1) < δ then, a.s., Pr(XΠ,A,a1 S, A = a1 | X) < 1 In other words, we can chose δ such that if Pr(dmax d(X) < ρ | A = a1) < δ then, a.s., Pr(dmax d(XΠ,A,a1) < ρ, A = a1 | X) < 1 2 By Eq. (36), we can choose ϵ > 0 so small that Pr(dmax d(X) < ϵ | A = a1) < δ. Then, we have that Pr(dmax d(XΠ,A,a1) < ϵ, A = a1 | X) < 1 2 a.s. Further, by the definition of the essential supremum and a0, and the fact that a0 = a1, we have that Pr(dmax d(X) < ϵ 2, A = a1) > 0. Therefore, with positive probability, we have that 1 dmax d(X) 2 > Pr(dmax d(XΠ,A,a1) < ϵ, A = a1 | X). This contradicts Eq. (38), and so it cannot be the case that Pr(d(X) = dmax | A = a0) = 0, meaning Pr(d(X) = dmax | A = a0) > 0. Corbett-Davies, Gaebler, Nilforoshan, Shroff, and Goel Now, we show that Pr(d(X) = dmax | A = a1) = 1. Suppose, toward a contradiction, that Pr(d(X) < dmax | A = a1) > 0. Then, by the first hypothesis, a.s., Pr(d(XΠ,A,a1) < dmax A = a1 | X) > 0 As a consequence, dmax = E[d(X) | d(X) = dmax, A = a0] = E[E[d(XΠ,A,a1) | X] | d(X) = dmax, A = a0] < E[E[dmax | X] | d(X) = dmax, A = a0] = E[dmax | d(X) = dmax, A = a0] where we can condition on the set {d(X) = dmax, A = a0} since Pr(d(X) = dmax | A = a0) > 0; and the second equality above follows from Lemma 73. This establishes the contradiction, and so Pr(d(X) = dmax | A = a1) = 1. Finally, we extend this equality to all a A. Since, Pr(d(X) = dmax | A = a1) = 0, we have, by the second hypothesis of the theorem, that, a.s., Pr(d(XΠ,A,a1) = dmax, A = a1 | X) = 0. Since, by definition, Pr(XΠ,A,a1 = X | A = a1) = 1, and Pr(d(X) = dmax | A = a1) = 1, we can strengthen this to Pr(d(XΠ,A,a1) = dmax | X) = 0. Consequently, a.s., d(X) = E[d(XΠ,A,a) | X] = E[dmax | X] where the first equality follows from Lemma 73, establishing the result. G.2 A Markov Chain Perspective The theory of Markov chains illuminates and allows us to extend the proof of Theorem 11. Suppose X = {x1, . . . , xn}.41 For any a A, let Pa = [pa i,j] where pa i,j = Pr(XΠ,A,a = xj | X = xi). Then Pa is a stochastic matrix. To motivate the subsequent discussion, we first note that this perspective conceptually simplifies some of our earlier results. Lemma 73 can be recast as stating that when W = X, 41. Because of the technical difficulties associated with characterizing the long-run behavior of arbitrary infinite Markov chains, we restrict our attention in this section to Markov chains with finite state spaces. The Measure and Mismeasure of Fairness a policy d is Π-fair if and only if Pa d = d i.e., if and only if d is a 1-eigenvector of Pa for all a A. The 1-eigenvectors of Markov chains have a particularly simple structure, which we derive here for completeness. Lemma 75 Let S1, . . . , Sm denote the recurrent classes of a finite Markov chain with transition matrix P. If d is a 1-eigenvector of P, then d takes a constant value pk on each Sk, k = 1, . . . , m, and j Sk P n ij Remark 76 We note that limn P j Sk P n i,j always exists and is the probability that the Markov chain, beginning at state i, is eventually absorbed by the recurrent class Sk. Proof Note that, possibly by reordering the states, we can arrange that the stochastic matrix P is in canonical form, i.e., that where Q is a sub-stochastic matrix, R is non-negative, and P1 P2 ... Pm is a block-diagonal matrix with the stochastic matrix Pi corresponding to the transition probabilities on the recurrent set Si in the i-th position along the diagonal. Now, consider a 1-eigenvector v = [v1 v2] of P. We must have that Pv = v, i.e., Bv1 = v1 and R v1 + Qv2 = v2. Therefore v1 is a 1-eigenvector of B. Since B is block diagonal, and each diagonal element is a positive stochastic matrix, it follows by the Perron Frobenius theorem that the 1-eigenvectors of B are given by Span(1Si)i=1,...,m, where 1Si is the vector which is 1 at index j if j Si and is 0 otherwise. Now, for v1 Span(1Si)i=1,...,m, we must find v2 such that R v1 + Qv2 = v2. Note that every finite Markov chain M can be canonically associated with an absorbing Markov chain MAbs where the set of states of MAbs is exactly the union of the transitive states of M and the recurrent sets of M. (In essence, one tracks which state of M the Markov chain is in until it is absorbed by one of the recurrent sets, at which point the entire recurrent set is treated as a single absorbent state.) The transition matrix P Abs associated with MAbs is given by P Abs = I R Q where R = R [1S1 . . . 1Sm]. In particular, it follows that v = [v1 v2] is a 1-eigenvector of P if and only if [Tv1 v2] is a 1-eigenvector of P Abs, where T : 1Si 7 ei. Corbett-Davies, Gaebler, Nilforoshan, Shroff, and Goel Now, if v is a 1-eigenvector of P Abs, then it is a 1-eigenvector of (P Abs)k for all k. Since Q is sub-stochastic, the series P k=0 Qk converges to (I Q) 1. Since (P Abs)k = I (I + Q + + Qk 1)R Qk it follows that lim k (P Abs)k = I (I Q) 1R 0 Therefore, if v = [v1 v2] is a 1-eigenvector of P Abs, we must have that (I Q) 1Rv1 = v2. By Theorem 3.3.7 in Kemeny and Snell (1976), the (i, k) entry of (I Q) 1R is exactly the probability that, conditional on X0 = xi, the Markov chain is eventually absorbed by the recurrent set Sk. This is, in turn, by the Chapman-Kolmogorov equations and the definition of Sk, equal to limn P j Sk P n i,j, and therefore the result follows. We arrive at the following simple necessary condition on Π-fair policies. Corollary 77 Suppose X is finite, and define the stochastic matrix P = 1 |A| P a A Pa . If d(x) is a Π-fair policy then it is constant on the recurrent classes of P. Proof By Lemma 73, d is Π-fair if and only if Pa d = d for all a A. Therefore, a A Pa d = 1 |A| a A d = d, (40) and so d is a 1-eigenvector of P. Therefore it is constant on the recurrent classes of P by Lemma 75. We note that Theorem 11 follows immediately from this. Proof of Theorem 11 Note that 1 |A| P a A Pa decomposes into a block diagonal stochastic matrix, where each block corresponds to a single stratum of ζ and is irreducible. Consequently, each stratum forms a recurrent class, and the result follows. Appendix H. Proofs of Propositions 19 and 20 The proofs of Proposition 19 and Proposition 20 rely on certain shared theory about beta distributions. We begin by reviewing this theory before moving onto the proofs of the respective propositions. H.1 Beta Distributions and Stochastic Dominance We begin by introducing incomplete beta functions and distributions. The Measure and Mismeasure of Fairness Definition 78 The incomplete beta function It(α, β) for t (0, 1] is given by It(α, β) = Z t 0 xα 1(1 x)β 1. A random variable X is said to be distributed as Betat(α, β) if X Y | Y < t for Y Beta(α, β); equivalently, if X has PDF 1x (0,t) xα 1(1 x)β 1 An important property relating different beta distributions is stochastic dominance. Definition 79 Let X and Y be random variables with CDFs FX(t) and FY (t), respectively. We say that X stochastically dominates Y , written Y st X, if FX(t) FY (t) for all t R. We say that X strictly stochastically dominates Y , i.e., Y t] < E[X 1X>t] for any t > 0 such that there exists t t with FX(t ) < FY (t ). Proof Since since FY (x) FX(x) for all x and, in particular, FY (x) > FX(x) on an open interval containing t , we have that E[Y 1Y >t] = t [1 FY (t)] + Z t 1 FY (x) dx < t [1 FX(t)] + Z t 1 FX(x) dx = E[X 1X>t] where we have applied the Darth Vader rule to calculate the expectations (Muldowney et al., 2012). This leads to the following modest technical generalization. Lemma 82 Suppose Y 0 is such that there exists t < t with FX(t ) < FY (t ), and f : R>0 R>0 is monotonically decreasing and continuous. Then E[f(Y ) 1Y E[f(X) 1X f 1(x)) = 1 FX(f 1(x)). It follows that the CDFs of f(X) and f(Y ) are 1 FX(f 1(x)) and 1 FY (f 1(x)), respectively. Then, observe that since FX(x) FY (x) for all x R, 1 FX(x) 1 FY (x) for all x R, i.e., f(X) st f(Y ). In particular, since FX(x) = 0 if and only if 1 FX(x) = 1 and vice versa, it follows from the invertibility of f that f(X) f(t) and similarly for Y , the result follows from Lemma 81. It is relatively straightforward to characterize the stochastic dominance relationships between various (incomplete) beta distributions according to α and β; see Arab et al. (2021) for full details. Here, we require only the following result, which closely follows the proof of Theorem 1 there. Lemma 83 If X Betat(α0, β0), Y Betat(α1, β1), where α0 α1 and β0 β1, then Y st X. If, in addition, either α0 > α1 or β0 < β1, then Y α1; the case where β0 < β1 is virtually identical. In particular, observe that G(0) = G(t) = 0, and that for s (0, t), G (s) = sα0 1(1 s)β0 1 It(α0, β0) sα1 1(1 s)β1 1 = sα0 1(1 s)β0 1 It(α1, β1) sα1 α0(1 s)β1 β0 It(α1, β1) We consider the two multiplicands in the final expression. We note that the first is greater than zero for all s (0, t). Therefore, for s (0, t), G (s) = 0 if and only if sα1 α0(1 s)β1 β0 = It(α1, β1) It(α0, β0). Now, since α0 > α1 and β0 β1, it follows that the left-hand side of the previous expression is strictly decreasing. In particular, G (s) = 0 for at most one s (0, t). Since G(0) = G(t) = 0 and G(t) is non-constant, it follows that G (s) = 0 for exactly one s (0, t) by Rolle s theorem. In particular, either G(s) > 0 for all s (0, t) or G(s) < 0 for all t (0, t). Consequently, sα1 α0(1 s)β1 β0 It(α1, β1) The Measure and Mismeasure of Fairness changes sign for some s0 (0, t). Since the minuend is strictly decreasing, it follows that G (s) > 0 for s (0, s0). Therefore, in particular, G(s) > 0 for all s (0, s0), and hence for s (0, t). Therefore FX(s) > FY (s) for s (0, t), and so Y 0 is a trivially a smooth 4-manifold, the measure of the singular values of π f 1(0) is zero. However, since the maximum rank of Dπ on f 1(0) is three, every point of f 1(0) is singular, and consequently, the whole image of π is singular, i.e., π(f 1(0)) has measure zero. However, as argued above, the set of (α0, β0, α1, β1) such that there exists a Pareto efficient distribution satisfying counterfactual equalized odds is a subset of π(f 1(0)). Therefore, it remains only to show that f is smooth, and that Df has full rank for all (α0, β0, t0, α1, β1, t1) R2 >0 [0, 1] R2 >0 [0, 1]. This verification is a routine exercise in linear algebra and multivariable calculus, and is given below. Smoothness. We note that since smooth functions are closed under composition, it suffices to show that It(α, β) is a smooth function of t, α, and β. First, we consider partial derivatives with respect to α and β. If we could differentiate under the integral sign, then we would have that αn βm It(α, β) = Z t 0 log(x)n log(1 x)mxα 1(1 x)β 1 dx. (41) Recall the well-known condition for the Leibniz integral rule that if xφ(x, t)|x=x is dominated by some integrable g(t) for all x in some neighborhood of x, then42 0 φ(x, t) dt = Z t xφ(x, t) dt. Since the integrand log(x)n log(1 x)mxα 1(1 x)β 1 is strictly decreasing in both α and β for x (0, 1), it suffices merely to show that log(x)n log(1 x)mxα 1(1 x)β 1 is integrable on (0, 1) for all α > 0 and β > 0. Moreover, since 42. See, e.g., Theorem 6.28 in Klenke (2020). The Measure and Mismeasure of Fairness The whole integrand is bounded on (ϵ, 1 ϵ), The factor log(1 x)m(1 x)β 1 is bounded on (0, ϵ), The factor log(x)nxα 1 is bounded on (1 ϵ, 1), it suffices to show that log(x)nxα 1 is integrable on (0, ϵ) and that log(1 x)m(1 x)β 1 is integrable on (1 ϵ, 1). Up to the change of variables x 7 1 x, since n, m, α, and β are arbitrary, we see that it suffices to verify that log(x)nxα 1 alone is integrable on (0, 1). Integrating by parts, we have that 0 log(x)nxα 1 d = log(x)nxα 0 log(x)n 1xα 1 dx. Since, by l Hˆopital s rule, limx 0 log(x)nxα = 0, this expression equals 0 log(x)n 1xα 1 dx. Since xα 1 is integrable on 0, 1 we see inductively that so is log(x)nxα 1. Therefore, we can differentiate under the integral sign, and Eq. (41) holds. Now, taking derivatives with respect to t yields that t αn βm It(α, β) = log(t)n log(1 t)mtα 1(1 t)β 1, which is a polynomial in smooth functions of t, and hence is smooth in t. Therefore tk αn βm It(α, β) exists and is a polynomial in t for all k > 1. By continuity, the orders of the partial derivatives can be switched arbitrarily (see, e.g., Theorem 9.40 in Rudin (1976)), and so it follows that f is smooth. Full rank. By the rank-nullity theorem, it suffices to show that the column rank of Df is three. However, letting Bα,β Beta(α, β) we have, by the results of the previous section, that the first three columns (i.e., partial derivatives with respect to α0, β0, and t0, respectively) of Df are given by E[log(Bα0,β0) 1Bα0,β0 E[log(Bα0,β0+1) 1Bα0,β0+1 tα(x), 0 r(x) < tα(x). The Measure and Mismeasure of Fairness (Note that by our distributional assumption, Pr(u(x) = t) = 0 for all t [0, 1].) Since λ 0, we must have that ta0 ta1. Since b < 1, 0 < ta0. Therefore, E[Y (1) | A = a0, D = 0] = E[r(X) | A = a0, u(X) < ta0] E[r(X) | A = a0, u(X) < ta1] > E[r(X) | A = a1, u(X) < ta1] = E[Y (1) | A = a1, D = 0], where the first equality follows by the law of iterated expectation, the second from the fact that ta1 ta0, the third from our distributional assumption and Lemmata 80 and 83, and the final again from the law of iterated expectation. However, since counterfactual predictive parity is satisfied, E[Y (1) | A = a0, D = 0] = E[Y (1) | A = a1, D = 0], which is a contradiction. Therefore, no such threshold policy exists. After accounting for the difference in parameterization, Prop. 20 follows as a corollary. Proof of Prop. 20 Since µa0 > µa1, αa0 = v µa0 > v µa1 = αa1 and βa0 = v (1 µa0) < v (1 µa1) = βa1. Therefore βa0 < βa1 and αa1 < αa0, and so, by Lemma 84, the proposition follows. Alekh Agarwal, Alina Beygelzimer, Miroslav Dud ık, John Langford, and Hanna Wallach. A reductions approach to fair classification. In International Conference on Machine Learning, 2018. Rahul Aggarwal, Kirsten Bibbins-Domingo, Robert W. Yeh, Yang Song, Nicholas Chiu, Rishi K. Wadhera, Changyu Shen, and Dhruv S. Kazi. Diabetes screening by race and ethnicity in the United States: Equivalent body mass index and age thresholds. Annals of Internal Medicine, 175(6):765 773, 2022. Robert M Anderson and William R Zame. Genericity with infinitely many parameters. Advances in Theoretical Economics, 1(1):1 62, 2001. Julia Angwin, JeffLarson, Surya Mattu, and Lauren Kirchner. Machine bias: There s software used across the country to predict future criminals. and it s biased against blacks. Pro Publica, 5 2016. Shamena Anwar and Hanming Fang. An alternative test of racial prejudice in motor vehicle searches: Theory and evidence. The American Economic Review, 2006. Idir Arab, Paulo Eduardo Oliveira, and Tilo Wiklund. Convex transform order of beta distributions with some consequences. Statistica Neerlandica, 75(3):238 256, 2021. Matthew Arnold, Rachel KE Bellamy, Michael Hind, Stephanie Houde, Sameep Mehta, Aleksandra Mojsilovi c, Ravi Nair, K Natesan Ramamurthy, Alexandra Olteanu, David Corbett-Davies, Gaebler, Nilforoshan, Shroff, and Goel Piorkowski, et al. Factsheets: Increasing trust in ai services through supplier s declarations of conformity. IBM Journal of Research and Development, 63(4/5):6 1, 2019. Imanol Arrieta-Ibarra, Paman Gujral, Jonathan Tannen, Mark Tygert, and Cherie Xu. Metrics of calibration for probabilistic predictions. Journal of Machine Learning Research, 2022. Kenneth Arrow. The theory of discrimination. In Discrimination in labor markets. Princeton University Press, 1973. Ian Ayres. Outcome tests of racial disparities in police practices. Justice Research and Policy, 4(1-2):131 142, 2002. Jack M Balkin and Reva B Siegel. The American civil rights tradition: Anticlassification or antisubordination. Issues in Legal Scholarship, 2(1), 2003. Michelle Bao, Angela Zhou, Samantha Zottola, Brian Brubach, Sarah Desmarais, Aaron Horowitz, Kristian Lum, and Suresh Venkatasubramanian. It s COMPASlicated: The messy relationship between RAI datasets and algorithmic fairness benchmarks. ar Xiv preprint ar Xiv:2106.05498, 2021. Chelsea Barabas, Madars Virza, Karthik Dinakar, Joichi Ito, and Jonathan Zittrain. Interventions over predictions: Reframing the ethical debate for actuarial risk assessment. In Conference on Fairness, Accountability and Transparency, pages 62 76, 2018. Solon Barocas and Andrew D Selbst. Big data s disparate impact. Cal. L. Rev., 104:671, 2016. Solon Barocas, Moritz Hardt, and Arvind Narayanan. Fairness and Machine Learning: Limitations and Opportunities. fairmlbook.org, 2019. http://www.fairmlbook.org. Gary S Becker. The Economics of Discrimination. University of Chicago Press, 1957. Benji. The sum of an uncountable number of positive numbers. Mathematics Stack Exchange, 2020. URL https://math.stackexchange.com/q/20661. (version: 2020-05-29). Richard Berk. Criminal Justice Forecasts of Risk: A Machine Learning Approach. Springer Science & Business Media, 2012. Richard Berk, Hoda Heidari, Shahin Jabbari, Michael Kearns, and Aaron Roth. Fairness in criminal justice risk assessments: The state of the art. Sociological Methods & Research, 50(1):3 44, 2021. Marianne Bertrand and Sendhil Mullainathan. Are Emily and Greg more employable than Lakisha and Jamal? A field experiment on labor market discrimination. American Economic Review, 94(4):991 1013, 2004. Patrick Billingsley. Probability and Measure. Wiley Series in Probability and Mathematical Statistics. John Wiley & Sons, Inc., New York, third edition, 1995. ISBN 0-471-00710-2. A Wiley-Interscience Publication. The Measure and Mismeasure of Fairness Avrim Blum and Kevin Stangl. Recovering from biased data: Can fairness constraints improve accuracy? ar Xiv preprint ar Xiv:1912.01094, 2019. Henk Brozius. Conditional expectation - E[f(X, Y )|Y ]. Mathematics Stack Exchange, 2019. URL https://math.stackexchange.com/q/3247577. (Version: 2019-06-01). Joy Buolamwini and Timnit Gebru. Gender shades: Intersectional accuracy disparities in commercial gender classification. In Conference on Fairness, Accountability and Transparency, pages 77 91, 2018. William Cai, Johann Gaebler, Nikhil Garg, and Sharad Goel. Fair allocation through selective information acquisition. In Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society, pages 22 28, 2020. William Cai, Ro Encarnacion, Bobbie Chern, Sam Corbett-Davies, Miranda Bogen, Stevie Bergman, and Sharad Goel. Adaptive sampling strategies to construct equitable training datasets. In Proceedings of the Conference on Fairness, Accountability, and Transparency, 2022a. William Cai, Johann Gaebler, Justin Kaashoek, Lisa Pinals, Samuel Madden, and Sharad Goel. Measuring racial and ethnic disparities in traffic enforcement with large-scale telematics data. PNAS Nexus, 1(4):pgac144, 2022b. Toon Calders and Sicco Verwer. Three naive Bayes approaches for discrimination-free classification. Data Mining and Knowledge Discovery, 21(2):277 292, 2010. Alycia N Carey and Xintao Wu. The causal fairness field guide: Perspectives from social and formal sciences. Frontiers in Big Data, 5, 2022. James H Carr and Isaac F Megbolugbe. The Federal Reserve Bank of Boston study on mortgage lending revisited. Fannie Mae Office of Housing Policy Research, 1993. Centers for Disease Control and Prevention. National Health and Nutrition Examination Survey data. Technical report, National Center for Health Statistics, 20112018. URL https://wwwn.cdc.gov/nchs/nhanes/continuousnhanes/overview.aspx? Begin Year=2011. Jessica P Cerde na, Marie V Plaisime, and Jennifer Tsai. From race-based to race-conscious medicine: How anti-racist uprisings call us to act. The Lancet, 396(10257):1125 1128, 2020. Silvia Chiappa. Path-specific counterfactual fairness. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 7801 7808, 2019. Kasia S Chmielinski, Sarah Newman, Matt Taylor, Josh Joseph, Kemi Thomas, Jessica Yurkofsky, and Yue Chelsea Qiu. The dataset nutrition label (2nd gen): Leveraging context to mitigate harms in artificial intelligence. ar Xiv preprint ar Xiv:2201.03954, 2022. Corbett-Davies, Gaebler, Nilforoshan, Shroff, and Goel Alex Chohlas-Wood, Joe Nudell, Keniel Yao, Zhiyuan Lin, Julian Nyarko, and Sharad Goel. Blind justice: Algorithmically masking race in charging decisions. In Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society, pages 35 45, 2021. Alex Chohlas-Wood, Madison Coots, Emma Brunskill, and Sharad Goel. Learning to be fair: A consequentialist approach to equitable decision-making. ar Xiv preprint ar Xiv:2109.08792, 2023a. Alex Chohlas-Wood, Madison Coots, Sharad Goel, and Julian Nyarko. Designing equitable algorithms. Nature Computational Science, 3, 2023b. Alexandra Chouldechova. Fair prediction with disparate impact: A study of bias in recidivism prediction instruments. Big Data, 5(2):153 163, 2017. Alexandra Chouldechova and Aaron Roth. A snapshot of the frontiers of fairness in machine learning. Communications of the ACM, 63(5):82 89, 2020. Alexandra Chouldechova, Diana Benavides-Prado, Oleksandr Fialko, and Rhema Vaithianathan. A case study of algorithm-assisted decision making in child maltreatment hotline screening decisions. In Conference on Fairness, Accountability and Transparency, pages 134 148, 2018. Jens Peter Reus Christensen. On sets of Haar measure zero in Abelian Polish groups. Israel Journal of Mathematics, 13(3-4):255 260, 1972. T Anne Cleary. Test bias: Prediction of grades of Negro and white students in integrated colleges. Journal of Educational Measurement, 5(2):115 124, 1968. Ruth Colker. Anti-subordination above all: Sex, race, and equal protection. NYUL Rev., 61:1003, 1986. Madison Coots, Soroush Saghafian, David Kent, and Sharad Goel. Reevaluating the role of race and ethnicity in diabetes screening. ar Xiv preprint ar Xiv:2306.10220, 2023. Sam Corbett-Davies and Sharad Goel. The measure and mismeasure of fairness: A critical review of fair machine learning. ar Xiv preprint ar Xiv:1808.00023v2, 2018. Sam Corbett-Davies, Emma Pierson, Avi Feller, Sharad Goel, and Aziz Huq. Algorithmic decision making and the cost of fairness. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 797 806, 2017. Amanda Coston, Alan Mishler, Edward H Kennedy, and Alexandra Chouldechova. Counterfactual risk assessments, evaluation, and fairness. In Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency, pages 582 593, 2020. Andrew Cotter, Heinrich Jiang, and Karthik Sridharan. Two-player games for efficient non-convex constrained optimization. In Algorithmic Learning Theory, pages 300 332. PMLR, 2019. Stewart J D Alessio and Lisa Stolzenberg. Race and the probability of arrest. Social forces, 81(4):1381 1397, 2003. The Measure and Mismeasure of Fairness Richard B Darlington. Another look at cultural fairness . Journal of Educational Measurement, 8(2):71 82, 1971. Matthew De Michele, Peter Baumgartner, Michael Wenger, Kelle Barrick, Megan Comfort, and Shilpi Misra. The Public Safety Assessment: A re-validation and assessment of predictive utility and differential prediction by race and gender in Kentucky, 2018. URL https://papers.ssrn.com/abstract=3168452. Cynthia Dwork, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Richard Zemel. Fairness through awareness. In Proceedings of the 3rd Innovations in Theoretical Computer Science Conference, pages 214 226, 2012. Harrison Edwards and Amos Storkey. Censoring representations with an adversary. In Proceedings of the International Conference in Learning Representations, 2016. Robin S Engel and Rob Tillyer. Searching for equilibrium: The tenuous nature of the outcome test. Justice Quarterly, 25(1):54 71, 2008. Michael Feldman, Sorelle A Friedler, John Moeller, Carlos Scheidegger, and Suresh Venkatasubramanian. Certifying and removing disparate impact. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 259 268. ACM, 2015. Fisher v. University of Texas. 579 U.S., 2016. Owen M Fiss. Groups and the Equal Protection Clause. Philosophy & Public Affairs, pages 107 177, 1976. Johann Gaebler, William Cai, Guillaume Basse, Ravi Shroff, Sharad Goel, and Jennifer Hill. A causal framework for observational studies of discrimination. Statistics and Public Policy, 2022. Sainyam Galhotra, Karthikeyan Shanmugam, Prasanna Sattigeri, and Kush R Varshney. Causal feature selection for algorithmic fairness. Proceedings of the 2022 International Conference on Management of Data (SIGMOD), 2022. George C Galster. The facts of lending discrimination cannot be argued away by examining default rates. Housing Policy Debate, 4(1):141 146, 1993. Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daum e Iii, and Kate Crawford. Datasheets for datasets. Communications of the ACM, 64(12):86 92, 2021. Sharad Goel, Maya Perelman, Ravi Shroff, and David Alan Sklansky. Combatting police discrimination in the age of big data. New Criminal Law Review: An International and Interdisciplinary Journal, 20(2):181 232, 2017. Claudia Goldin and Cecilia Rouse. Orchestrating impartiality: The impact of blind auditions on female musicians. American Economic Review, 90(4):715 741, 2000. Corbett-Davies, Gaebler, Nilforoshan, Shroff, and Goel D James Greiner and Donald B Rubin. Causal effects of perceived immutable characteristics. Review of Economics and Statistics, 93(3):775 785, 2011. Jeffrey Grogger and Greg Ridgeway. Testing for racial profiling in traffic stops from behind a veil of darkness. Journal of the American Statistical Association, 101(475):878 887, 2006. Moritz Hardt, Eric Price, and Nati Srebro. Equality of opportunity in supervised learning. Advances in Neural Information Processing Systems, 29:3315 3323, 2016. Ursula H ebert-Johnson, Michael Kim, Omer Reingold, and Guy Rothblum. Multicalibration: Calibration for the (computationally-identifiable) masses. In International Conference on Machine Learning, pages 1939 1948. PMLR, 2018. Jennifer L Hill. Bayesian nonparametric modeling for causal inference. Journal of Computational and Graphical Statistics, 20(1):217 240, 2011. Paul W Holland. Statistics and causal inference. Journal of the American Statistical Association, 81(396):945 960, 1986. Sarah Holland, Ahmed Hosny, Sarah Newman, Joshua Joseph, and Kasia Chmielinski. The dataset nutrition label. Data Protection and Privacy, 12(12):1, 2020. Lily Hu and Issa Kohler-Hausmann. What s sex got to do with machine learning? In Proceedings of the 2020 ACM Conference on Fairness, Accountability, and Transparency, 2020. Brian R Hunt, Tim Sauer, and James A Yorke. Prevalence: a translation-invariant almost every on infinite-dimensional spaces. Bulletin of the American Mathematical Society, 27 (2):217 238, 1992. Idaho H.B. 118. H.B. 118, 65th Leg., 1st Reg. Sess., 2019. https://legislature.idaho. gov/wp-content/uploads/sessioninfo/2019/legislation/H0118.pdf. Kosuke Imai and Zhichao Jiang. Principal fairness for human and algorithmic decisionmaking. ar Xiv preprint ar Xiv:2005.10400, 2020. Kosuke Imai, Zhichao Jiang, James Greiner, Ryan Halen, and Sooahn Shin. Experimental evaluation of algorithm-assisted human decision-making: Application to pretrial public safety assessment. ar Xiv preprint ar Xiv:2012.02845, 2020. Guido W Imbens and Donald B Rubin. Causal Inference in Statistics, Social, and Biomedical Sciences. Cambridge University Press, 2015. Christopher Jung, Sampath Kannan, Changhwa Lee, Mallesh Pai, Aaron Roth, and Rakesh Vohra. Fair prediction with endogenous behavior. In Proceedings of the 21st ACM Conference on Economics and Computation, pages 677 678, 2020a. Jongbin Jung, Connor Concannon, Ravi Shroff, Sharad Goel, and Daniel G Goldstein. Simple rules to guide expert classifications. Journal of the Royal Statistical Society: Series A (Statistics in Society), 183(3):771 800, 2020b. The Measure and Mismeasure of Fairness Jongbin Jung, Ravi Shroff, Avi Feller, and Sharad Goel. Bayesian sensitivity analysis for offline policy evaluation. In Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society, pages 64 70, 2020c. Faisal Kamiran, Indr e ˇZliobait e, and Toon Calders. Quantifying explainable discrimination and removing illegal discrimination in automated decision making. Knowledge and Information Systems, 35(3):613 644, 2013. John G. Kemeny and J. Laurie Snell. Finite Markov Chains. Undergraduate Texts in Mathematics. Springer-Verlag, New York-Heidelberg, 1976. Reprinting of the 1960 original. Niki Kilbertus, Mateo Rojas-Carulla, Giambattista Parascandolo, Moritz Hardt, Dominik Janzing, and Bernhard Sch olkopf. Avoiding discrimination through causal reasoning. In Proceedings of the 31st International Conference on Neural Information Processing Systems, pages 656 666, 2017. Jon Kleinberg, Himabindu Lakkaraju, Jure Leskovec, Jens Ludwig, and Sendhil Mullainathan. Human decisions and machine predictions. The Quarterly Journal of Economics, 133(1):237 293, 2017a. Jon Kleinberg, Sendhil Mullainathan, and Manish Raghavan. Inherent trade-offs in the fair determination of risk scores. In Proceedings of Innovations in Theoretical Computer Science (ITCS), 2017b. Achim Klenke. Probability Theory: A Comprehensive Course. Universitext. Springer, Cham, 2020. Third edition. John Knowles, Nicola Persico, and Petra Todd. Racial bias in motor vehicle searches: Theory and evidence. Journal of Political Economy, 109(1), 2001. Allison Koenecke, Andrew Nam, Emily Lake, Joe Nudell, Minnie Quartey, Zion Mengesha, Connor Toups, John R Rickford, Dan Jurafsky, and Sharad Goel. Racial disparities in automated speech recognition. Proceedings of the National Academy of Sciences, 117(14): 7684 7689, 2020. Allison Koenecke, Eric Giannella, Robb Willer, and Sharad Goel. Popular support for balancing equity and efficiency in resource allocation: A case study in online advertising to increase welfare program awareness. In Proceedings of the International AAAI Conference on Web and Social Media, volume 17, pages 494 506, 2023. Matt J Kusner, Joshua Loftus, Chris Russell, and Ricardo Silva. Counterfactual fairness. In Advances in Neural Information Processing Systems, pages 4066 4076, 2017. John M. Lee. Introduction to Smooth Manifolds, volume 218 of Graduate Texts in Mathematics. Springer, New York, second edition, 2013. Annie Liang, Jay Lu, and Xiaosheng Mu. Algorithmic design: Fairness versus accuracy. In Proceedings of the 23rd ACM Conference on Economics and Computation, pages 58 59, 2022. Corbett-Davies, Gaebler, Nilforoshan, Shroff, and Goel Joshua R Loftus, Chris Russell, Matt J Kusner, and Ricardo Silva. Causal reasoning for algorithmic fairness. ar Xiv preprint ar Xiv:1805.05859, 2018. Kristian Lum and William Isaac. To predict and serve? Significance, 13(5):14 19, 2016. Charles F Manski, John Mullahy, and Atheendar Venkataramani. Using measures of race to make clinical predictions: Decision making, patient health, and fairness. Technical report, National Bureau of Economic Research, 2022. Vishwali Mhasawade and Rumi Chunara. Causal multi-level fairness. In Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society, pages 784 794, 2021. Alan Mishler, Edward H Kennedy, and Alexandra Chouldechova. Fairness in risk assessment instruments: Post-processing to achieve counterfactual equalized odds. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, pages 386 400, 2021. Pat Muldowney, Krzysztof Ostaszewski, and Wojciech Wojdowski. The Darth Vader rule. Tatra Mountains Mathematical Publications, 52(1):53 63, 2012. Sendhil Mullainathan and Jann Spiess. Machine learning: An applied econometric approach. Journal of Economic Perspectives, 31(2):87 106, 2017. Razieh Nabi and Ilya Shpitser. Fair inference on outcomes. In Proceedings of the AAAI Conference on Artificial Intelligence, 2018. Hamed Nilforoshan, Johann D Gaebler, Ravi Shroff, and Sharad Goel. Causal conceptions of fairness and their consequences. In International Conference on Machine Learning, pages 16848 16887. PMLR, 2022. Abigail Nurse. Anti-subordination in the Equal Protection Clause: A case study. NYUL Rev., 89:293, 2014. Julian Nyarko, Sharad Goel, and Roseanna Sommers. Breaking taboos in fair machine learning: An experimental study. In Equity and Access in Algorithms, Mechanisms, and Optimization. Association for Computing Machinery, 2021. Ziad Obermeyer, Brian Powers, Christine Vogeli, and Sendhil Mullainathan. Dissecting racial bias in an algorithm used to manage the health of populations. Science, 366(6464): 447 453, 2019. William Ott and James Yorke. Prevalence. Bulletin of the American Mathematical Society, 42(3):263 290, 2005. Scott E Page. Making the difference: Applying a logic of diversity. Academy of Management Perspectives, 21(4):6 20, 2007. Jessica K Paulus and David M Kent. Predictably unequal: Understanding and addressing concerns that algorithmic clinical prediction may increase health disparities. NPJ Digital Medicine, 3(1):1 8, 2020. The Measure and Mismeasure of Fairness Judea Pearl. Direct and indirect effects. In Proceedings of the Seventeenth Conference on Uncertainty and Artificial Intelligence, 2001, pages 411 420. Morgan Kaufman, 2001. Judea Pearl. Causal inference in statistics: An overview. Statistics surveys, 3:96 146, 2009a. Judea Pearl. Causality. Cambridge University Press, second edition, 2009b. Dino Pedreshi, Salvatore Ruggieri, and Franco Turini. Discrimination-aware data mining. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 560 568, 2008. Edmund S Phelps. The statistical theory of racism and sexism. The American Economic Review, 62(4):659 661, 1972. Emma Pierson, Sam Corbett-Davies, and Sharad Goel. Fast threshold tests for detecting discrimination. In The 21st International Conference on Artificial Intelligence and Statistics (AISTATS), 2018. Emma Pierson, Camelia Simoiu, Jan Overgoor, Sam Corbett-Davies, Daniel Jenson, Amy Shoemaker, Vignesh Ramachandran, Phoebe Barghouty, Cheryl Phillips, Ravi Shroff, and Sharad Goel. A large-scale analysis of racial disparities in police stops across the United States. Nature Human Behaviour, 4(7):736 745, 2020. Richard A Primus. Equal protection and disparate impact: Round three. Harv. L. Rev., 117:494, 2003. Rajeev Ramchand, Rosalie Liccardo Pacula, and Martin Y Iguchi. Racial differences in marijuana-users risk of arrest in the United States. Drug and alcohol dependence, 84(3): 264 272, 2006. M. M. Rao. Conditional measures and applications, volume 271 of Pure and Applied Mathematics (Boca Raton). Chapman & Hall/CRC, Boca Raton, FL, second edition, 2005. Guy N Rothblum and Gal Yona. Decision-making under miscalibration. ar Xiv preprint ar Xiv:2203.09852, 2022. Walter Rudin. Principles of Mathematical Analysis, volume 3. Mc Graw-Hill New York, 1976. Walter Rudin. Real and Complex Analysis. Mc Graw-Hill Book Co., New York, third edition, 1987. ISBN 0-07-054234-1. Walter Rudin. Functional Analysis. International Series in Pure and Applied Mathematics. Mc Graw-Hill, Inc., New York, second edition, 1991. ISBN 0-07-054236-8. Laleh Seyyed-Kalantari, Haoran Zhang, Matthew Mc Dermott, Irene Y Chen, and Marzyeh Ghassemi. Underdiagnosis bias of artificial intelligence algorithms applied to chest radiographs in under-served patient populations. Nature Medicine, 27(12):2176 2182, 2021. Corbett-Davies, Gaebler, Nilforoshan, Shroff, and Goel SFFA v. Harvard. Students for Fair Admissions, Inc., Petitioner, v. President and Fellows of Harvard College. Students for Fair Admissions, Inc., Petitioner, v. University of North Carolina, et al., 2023. https://www.supremecourt.gov/opinions/22pdf/ 20-1199_l6gn.pdf. Moshe Shaked and J George Shanthikumar. Stochastic Orders. Springer, 2007. Ravi Shroff. Predictive analytics for city agencies: Lessons from children s services. Big Data, 5(3):189 196, 2017. Reva B Siegel. Equality talk: Antisubordination and anticlassification values in constitutional struggles over Brown. Harv. L. Rev., 117:1470, 2003. C. E. Silva. Invitation to Ergodic Theory, volume 42 of Student Mathematical Library. American Mathematical Society, Providence, RI, 2008. Camelia Simoiu, Sam Corbett-Davies, and Sharad Goel. The problem of infra-marginality in outcome tests for discrimination. The Annals of Applied Statistics, 11(3):1193 1216, 2017. Jennifer Skeem, John Monahan, and Christopher Lowenkamp. Gender, risk assessment, and sanctioning: The cost of treating women like men. Law and human behavior, 40(5): 580, 2016. Jennifer L. Skeem and Christopher T. Lowenkamp. Risk, race, and recidivism: Predictive bias and disparate impact. Criminology, 54(4):680 712, 2016. Rhys Steele. Space of vector measures equipped with the total variation norm is complete. Mathematics Stack Exchange, 2019. URL https://math.stackexchange.com/q/ 3197508. (Version: 2019-04-22). Julia Stoyanovich and Bill Howe. Nutritional labels for data and models. A Quarterly bulletin of the Computer Society of the IEEE Technical Committee on Data Engineering, 42(3), 2019. Yixin Wang, Dhanya Sridhar, and David M Blei. Equal opportunity and affirmative action via counterfactual predictions. ar Xiv preprint ar Xiv:1905.10870, 2019. Watson v. Fort Worth. Watson v. Forth Worth Bank & Trust, 1988. 487 U.S. 977. Hilde Weerts, Miroslav Dud ık, Richard Edgar, Adrin Jalali, Roman Lutz, and Michael Madaio. Fairlearn: Assessing and improving fairness of ai systems, 2023. Teresa Scotton Williams. Some issues in the standardized testing of minority students. Journal of Education, pages 192 208, 1983. Blake Woodworth, Suriya Gunasekar, Mesrob I Ohannessian, and Nathan Srebro. Learning non-discriminatory predictors. In Conference on Learning Theory, pages 1920 1953. PMLR, 2017. The Measure and Mismeasure of Fairness Yongkai Wu, Lu Zhang, Xintao Wu, and Hanghang Tong. PC-fairness: A unified framework for measuring causality-based fairness. Advances in Neural Information Processing Systems, 32, 2019. Muhammad Bilal Zafar, Isabel Valera, Manuel Gomez Rodriguez, and Krishna P Gummadi. Fairness beyond disparate treatment & disparate impact: Learning classification without disparate mistreatment. In Proceedings of the 26th International Conference on World Wide Web, pages 1171 1180, 2017a. Muhammad Bilal Zafar, Isabel Valera, Manuel Gomez Rodriguez, Krishna P Gummadi, and Adrian Weller. From parity to preference-based notions of fairness in classification. In Proceedings of the 31st International Conference on Neural Information Processing Systems, pages 228 238, 2017b. Muhammad Bilal Zafar, Isabel Valera, Manuel Gomez Rogriguez, and Krishna P Gummadi. Fairness constraints: Mechanisms for fair classification. In Artificial Intelligence and Statistics, pages 962 970, 2017c. Michael Zanger-Tishler, Julian Nyarko, and Sharad Goel. Risk scores, label bias, and everything but the kitchen sink. ar Xiv preprint ar Xiv:2305.12638, 2023. Rich Zemel, Yu Wu, Kevin Swersky, Toni Pitassi, and Cynthia Dwork. Learning fair representations. In International Conference on Machine Learning, pages 325 333, 2013. Junzhe Zhang and Elias Bareinboim. Fairness in decision-making the causal explanation formula. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018. Lu Zhang, Yongkai Wu, and Xintao Wu. A causal framework for discovering and removing direct and indirect discrimination. In Proceedings of the 26th International Joint Conference on Artificial Intelligence, pages 3929 3935, 2017.