# causal_conceptions_of_fairness_and_their_consequences__1ddd688b.pdf Causal Conceptions of Fairness and their Consequences Hamed Nilforoshan * 1 Johann Gaebler * 1 Ravi Shroff 2 Sharad Goel 3 Recent work highlights the role of causality in designing equitable decision-making algorithms. It is not immediately clear, however, how existing causal conceptions of fairness relate to one another, or what the consequences are of using these definitions as design principles. Here, we first assemble and categorize popular causal definitions of algorithmic fairness into two broad families: (1) those that constrain the effects of decisions on counterfactual disparities; and (2) those that constrain the effects of legally protected characteristics, like race and gender, on decisions. We then show, analytically and empirically, that both families of definitions almost always in a measure theoretic sense result in strongly Pareto dominated decision policies, meaning there is an alternative, unconstrained policy favored by every stakeholder with preferences drawn from a large, natural class. For example, in the case of college admissions decisions, policies constrained to satisfy causal fairness definitions would be disfavored by every stakeholder with neutral or positive preferences for both academic preparedness and diversity. Indeed, under a prominent definition of causal fairness, we prove the resulting policies require admitting all students with the same probability, regardless of academic qualifications or group membership. Our results highlight formal limitations and potential adverse consequences of common mathematical notions of causal fairness. *Equal contribution 1Stanford University, Stanford, CA 2New York University, New York, NY 3Harvard University, Cambridge, MA. Correspondence to: Hamed Nilforoshan , Johann Gaebler , Ravi Shroff , Sharad Goel . Proceedings of the 39 th International Conference on Machine Learning, Baltimore, Maryland, USA, PMLR 162, 2022. Copyright 2022 by the author(s). 1. Introduction Imagine designing an algorithm to guide decisions for college admissions. To help ensure algorithms such as this are broadly equitable, a plethora of formal fairness criteria have been proposed in the machine learning community (Berk et al., 2021; Chouldechova, 2017; Chouldechova & Roth, 2020; Cleary, 1968; Corbett-Davies et al., 2017; Darlington, 1971; Dwork et al., 2012; Hardt et al., 2016; Kleinberg et al., 2017; Woodworth et al., 2017; Zafar et al., 2017a;b). For example, under the principle that fair algorithms should have comparable performance across demographic groups (Hardt et al., 2016), one might check that among applicants who were ultimately academically successful (e.g., who eventually earned a college degree, either at the institution in question or elsewhere), the algorithm would recommend admission for an equal proportion of candidates across race groups. Alternatively, following the principle that decisions should be agnostic to legally protected attributes like race and gender (cf. Corbett-Davies & Goel, 2018; Dwork et al., 2012), one might mandate that these features not be provided to the algorithm. Recent scholarship has argued for extending equitable algorithm design by adopting a causal perspective, leading to myriad additional formal criteria for fairness (Carey & Wu, 2022; Chiappa, 2019; Coston et al., 2020; Galhotra et al., 2022; Imai & Jiang, 2020; Imai et al., 2020; Kilbertus et al., 2017; Kusner et al., 2017; Loftus et al., 2018; Mhasawade & Chunara, 2021; Nabi & Shpitser, 2018; Wang et al., 2019; Wu et al., 2019; Zhang & Bareinboim, 2018; Zhang et al., 2017). Less attention, however, has been given to understanding the potential downstream consequences of using these causal definitions of fairness as algorithmic design principles, leaving an important gap to fill if these criteria are to responsibly inform policy choices. Here we synthesize and critically examine the statistical properties and concomitant consequences of popular causal approaches to fairness. We begin, in Section 2, by proposing a two-part taxonomy for causal conceptions of fairness that mirrors the illustrative, non-causal fairness principles described above. Our first category of definitions encompasses those that consider the effect of decisions on counterfactual disparities. For example, recognizing the causal effect of college admission on later success, one might demand Causal Conceptions of Fairness and their Consequences that among applicants who would be academically successful if admitted to a particular college, the algorithm would recommend admission for an equal proportion of candidates across race groups. The second category of definitions encompasses those that seek to limit both the direct and indirect effects of one s group membership on decisions. For example, because one s race might impact earlier educational opportunities, and hence test scores, one might require that admissions decisions are robust to the effect of race along such causal paths. We show, in Section 3, that when the distribution of causal effects is known (or can be estimated), one can efficiently compute utility-maximizing decision policies constrained to satisfy each of the causal fairness criteria we consider. However, for natural families of utility functions for example, those that prefer both higher degree attainment and more student-body diversity we prove in Section 4 that causal fairness constraints almost always lead to strongly Pareto dominated decision policies. To establish this result, we use the theory of prevalence (Anderson & Zame, 2001; Christensen, 1972; Hunt et al., 1992; Ott & Yorke, 2005), which extends the notion of full-measure sets to infinite-dimensional vector spaces. In particular, in our running college admissions example, adhering to any of the common conceptions of causal fairness would simultaneously result in a lower number of degrees attained and lower student-body diversity, relative to what one could achieve by explicitly tailoring admissions policies to achieve desired outcomes. In fact, under one prominent definition of causal fairness, we prove that the induced policies require simply admitting all applicants with equal probability, irrespective of one s academic qualifications or group membership. These results, we hope, elucidate the structure and limitations of current causal approaches to equitable decision making. 2. Causal Approaches to Fair Decision Making We describe two broad classes of causal notions of fairness: (1) those that consider outcomes when decisions are counterfactually altered; and (2) those that consider outcomes when protected attributes are counterfactually altered. We illustrate these definitions in the context of a running example of college admissions decisions. 2.1. Problem Setup Consider a population of individuals with observed covariates X, drawn i.i.d from a set X Rn with distribution DX. Further suppose that A A describes one or more discrete protected attributes, such as race or gender, which can be derived from X (i.e., A = α(X) for some measurable function α). Each individual is subject to a binary decision D {0, 1}, determined by a (randomized) rule d(x) [0, 1], where d(x) = Pr(D = 1 | X = x) is the probability of receiving a positive decision.1 Given a budget b with 0 < b < 1, we require the decision rule to satisfy E[D] b, limiting the expected proportion of positive decisions. In our running example, we imagine a population of applicants to a particular college, where d denotes an admissions rule and D indicates a binary admissions decision. To simplify our exposition, we assume all admitted students attend the school. In our setting, the covariates X consist of an applicant s test score and race A {a0, a1}, where, for notational convenience, we consider two race groups. The budget b bounds the expected proportion of admitted applicants. Assuming there is no interference between units (Imbens & Rubin, 2015), we write Y (1) and Y (0) to denote potential outcomes of interest under each of the two possible binary decisions, where Y = Y (D) is the realized outcome. We assume that Y (1) and Y (0) are drawn from a (possibly infinite) set Y R, where |Y| > 1. In our admissions example, Y is a binary variable that indicates college graduation (i.e., degree attainment), with Y (1) and Y (0) describing, respectively, whether an applicant would attain a college degree if admitted to or if rejected from the school we consider. Note that Y (0) is not necessarily zero, as a rejected applicant may attend and graduate from a different university. Given this setup, our goal is to construct decision policies d that are broadly equitable, formalized in part by the causal notions of fairness described below. We focus on decisions that are made algorithmically, informed by historical data on applicants and subsequent outcomes. 2.2. Limiting the Effect of Decisions on Disparities A popular class of non-causal fairness definitions requires that error rates (e.g., false positive and false negative rates) are equal across protected groups (Corbett-Davies & Goel, 2018; Hardt et al., 2016). Causal analogues of these definitions have recently been proposed (Coston et al., 2020; Imai & Jiang, 2020; Imai et al., 2020; Mishler et al., 2021), which require various conditional independence conditions to hold between the potential outcomes, protected attributes, and decisions.2 Below we list three representative examples of this class of fairness definitions: counterfactual predictive parity (Coston et al., 2020), counterfactual equalized 1That is, D = 1UD d(X), where UD is an independent uniform random variable. 2In the literature on causal fairness, there is at times ambiguity between predictions ˆY {0, 1} of Y and decisions D {0, 1}. Following past work (e.g., Corbett-Davies et al., 2017; Kusner et al., 2017; Wang et al., 2019), here we focus exclusively on decisions, with predictions implicitly impacting decisions but not explicitly appearing in our definitions. Causal Conceptions of Fairness and their Consequences odds (Coston et al., 2020; Mishler et al., 2021), and conditional principal fairness (Imai & Jiang, 2020).3 Definition 1. Counterfactual predictive parity holds when Y (1) A | D = 0. (1) In our college admissions example, counterfactual predictive parity means that among rejected applicants, the proportion who would have attained a college degree, had they been accepted, is equal across race groups. Definition 2. Counterfactual equalized odds holds when D A | Y (1). (2) In our running example, counterfactual equalized odds is satisfied when two conditions hold: (1) among applicants who would graduate if admitted (i.e., Y (1) = 1), students are admitted at the same rate across race groups; and (2) among applicants who would not graduate if admitted (i.e., Y (1) = 0), students are again admitted at the same rate across race groups. Definition 3. Conditional principal fairness holds when D A | Y (0), Y (1), W, (3) where, for a measurable function ω on X, W = ω(X) describes a reduced set of the covariates X. When W is constant (or, equivalently, when we do not condition on W), this condition is called principal fairness. In our example, conditional principal fairness means that similar applicants where similarity is defined by the potential outcomes and covariates W are admitted at the same rate across race groups. 2.3. Limiting the Effect of Attributes on Decisions An alternative causal framework for understanding fairness considers the effects of protected attributes on decisions (Kilbertus et al., 2017; Kusner et al., 2017; Mhasawade & Chunara, 2021; Nabi & Shpitser, 2018; Wang et al., 2019; Wu et al., 2019; Zhang & Bareinboim, 2018; Zhang et al., 2017). This approach, which can be understood as codifying the legal notion of disparate treatment (Goel et al., 2017; Zafar et al., 2017a), considers a decision rule to be fair if, at a high level, decisions for individuals are the same in (a) the actual world and (b) a counterfactual world where the individual belonged to a different demographic group (Kusner et al., 2017).4 3Our subsequent analytical results extend in a straightforward manner to structurally similar variants of these definitions (e.g., requiring Y (0) A | D = 1 or D A | Y (0), variants of counterfactual predictive parity and counterfactual equalized odds, respectively). 4Conceptualizing a general causal effect of an immutable characteristic such as race or gender is rife with challenges, the greatest In contrast to fairness through unawareness in which race and other protected attributes are barred from being an explicit input to a decision rule (cf. Corbett-Davies & Goel, 2018; Dwork et al., 2012) the causal versions of this idea consider both the direct and indirect effects of protected attributes on decisions. For example, even if decisions only directly depend on test scores, race may indirectly impact decisions through its effects on educational opportunities, which in turn influence test scores. This idea can be formalized by requiring that decisions remain the same in expectation even if one s protected characteristics are counterfactually altered, a condition known as counterfactual fairness (Kusner et al., 2017). Definition 4. Counterfactual fairness holds when E[D(a ) | X] = E[D | X]. (4) where D(a ) denotes the decision when one s protected attributes are counterfactually altered to be any a A. In our running example, this means that for each group of observationally identical applicants (i.e., those with the same values of X, meaning identical race and test score), the proportion of students who are actually admitted is the same as the proportion who would be admitted if their race were counterfactually altered. Counterfactual fairness aims to limit all direct and indirect effects of protected traits on decisions. In a generalization of this criterion termed path-specific fairness (Chiappa, 2019; Nabi & Shpitser, 2018; Wu et al., 2019; Zhang et al., 2017) one allows protected traits to influence decisions along certain causal paths but not others. For example, one may wish to allow the direct consideration of race by an admissions committee to implement an affirmative action policy, while also guarding against any indirect influence of race on admissions decisions that may stem from cultural biases in standardized tests (Williams, 1983). The formal definition of path-specific fairness requires specifying a causal DAG describing relationships between attributes (both observed covariates and latent variables), decisions, and outcomes. In our running example of college admissions, we imagine that each individual s observed covariates are the result of the process illustrated by the causal DAG in Figure 1. In this graph, an applicant s race A influ- of which is expressed by the mantra, no causation without manipulation (Holland, 1986). In particular, analyzing race as a causal treatment requires one to specify what exactly is meant by changing an individual s race" from, for example, white to Black (Gaebler et al., 2022; Hu & Kohler-Hausmann, 2020). Such difficulties can sometimes be addressed by considering a change in the perception of race by a decision maker (Greiner & Rubin, 2011) for instance, by changing the name listed on an employment application (Bertrand & Mullainathan, 2004), or by masking an individual s appearance (Chohlas-Wood et al., 2021b; Goldin & Rouse, 2000; Grogger & Ridgeway, 2006; Pierson et al., 2020). Causal Conceptions of Fairness and their Consequences Preparation Figure 1. A causal DAG illustrating a hypothetical process for college admissions. Under path-specific fairness, one may require, for example, that race does not affect decisions along the path highlighted in red. ences the educational opportunities E available to them prior to college; and educational opportunities in turn influence an applicant s level of college preparation, M, as well as their score on a standardized admissions test, T, such as the SAT. We assume the admissions committee only observes an applicant s race and test score so that X = (A, T), and makes their decision D based on these attributes. Finally, whether or not an admitted student subsequently graduates (from any college), Y , is a function of both their preparation and whether they were admitted.5 To define path-specific fairness, we start by defining, for the decision D, path-specific counterfactuals, a general concept in causal DAGs (cf. Pearl, 2001). Suppose G = (V, U, F) is a causal model with nodes V, exogenous variables U, and structural equations F that define the value at each node Vj as a function of its parents (Vj) and its associated exogenous variable Uj. (See, for example, Pearl (2009a) for further details on causal DAGs.) Let V1, . . . , Vm be a topological ordering of the nodes, meaning that (Vj) {V1, . . . , Vj 1} (i.e., the parents of each node appear in the ordering before the node itself). Let Π denote a collection of paths from node A to D. Now, for two possible values a and a for the variable A, the path-specific counterfactuals DΠ,a,a for the decision D are generated by traversing the list of nodes in topological order, propagating counterfactual values obtained by setting A = a along paths in Π, and otherwise propagating values obtained by setting A = a. (In Algorithm 1 in the Appendix, we formally define pathspecific counterfactuals for an arbitrary node or collection of nodes in the DAG.) To see this idea in action, we work out an illustrative example, computing path-specific counterfactuals for the decision D along the single path Π = {A E T D} linking race to the admissions committee s decision through test score, highlighted in red in Figure 1. In the system of equa- 5In practice, the racial composition of an admitted class may itself influence degree attainment, if, for example, diversity provides a net benefit to students (Page, 2007). Here, for simplicity, we avoid consideration of such peer effects. tions below, the first column corresponds to draws V for each node V in the DAG, where we set A to a, and then propagate that value as usual. The second column corresponds to draws V of path-specific counterfactuals, where we set A to a , and then propagate the counterfactuals only along the path A E T D. In particular, the value for the test score T is computed using the value of E (since the edge E T is on the specified path) and the value of M (since the edge M T is not on the path). As a result of this process, we obtain a draw D from the distribution of DΠ,a,a . A = a, A = a , E = f E(A ), E = f E(A ), M = f M(E ), M = f M(E ), T = f T (E , M ), T = f T (E , M ), D = f D(A , T ), D = f D(A , T ). Path-specific fairness formalizes the intuition that the influence of a sensitive attribute on a downstream decision may, in some circumstances, be considered legitimate (i.e., it may be acceptable for the attribute to affect decisions along certain paths in the DAG). For instance, an admissions committee may believe that the effect of race A on admissions decisions D which passes through college preparation M is legitimate, whereas the effect of race along the path A E T D, which may reflect access to test prep or cultural biases of the tests, rather than actual academic preparedness, is illegitimate. In that case, the admissions committee may seek to ensure that the proportion of applicants they admit from a certain race group remains unchanged if one were to counterfactually alter the race of those individuals along the path Π = {A E T D}. Definition 5. Let Π be a collection of paths, and, for a measurable function w on X, let W = ω(X) describe a reduced set of the covariates X. Path-specific fairness, also called Π-fairness, holds when, for any a A, E[DΠ,A,a | W] = E[D | W]. (5) In the definition above, rather than a particular counterfactual level a, the baseline level of the path-specific effect is A, i.e., an individual s actual (non-counterfactually altered) group membership (e.g., their actual race). We have implicitly assumed that the decision variable D is a descendant of the covariates X. In particular, without loss of generality, we assume D is defined by the structural equation f D(x, u D) = 1u D d(x), where the exogenous variable u D UNIF(0, 1), so that Pr(D = 1 | X = x) = d(x). If Π is the set of all paths from A to D, then DΠ,A,a = D(a ), in which case, for W = X, path-specific fairness is the same as counterfactual fairness. Causal Conceptions of Fairness and their Consequences 3. Constructing Causally Fair Policies The definitions of causal fairness above constrain the set of decision policies one might adopt, but, in general, they do not yield a unique policy. For instance, a policy in which applicants are admitted randomly and independently with probability b where b is the specified budget satisfies counterfactual equalized odds (Def. 2), conditional principal fairness (Def. 3), counterfactual fairness (Def. 4), and pathspecific fairness (Def. 5).6 However, such a randomized policy may be sub-optimal in the eyes of decision-makers aiming to maximize outcomes such as class diversity or degree attainment. Past work has described multiple approaches to selecting a single policy from among those satisfying any given fairness definition, including maximizing concordance of the decision with the outcome variable (Chiappa, 2019; Nabi & Shpitser, 2018) or with an existing policy (Wang et al., 2019) (e.g., in terms of binary accuracy or KL-divergence). Here, as we are primarily interested in the downstream consequences of various causal fairness definitions, we consider causally fair policies that maximize utility (Cai et al., 2020; Chohlas-Wood et al., 2021a; Corbett-Davies et al., 2017; Kasy & Abebe, 2021; Liu et al., 2018). Suppose u(x) denotes the utility of assigning a positive decision to individuals with observed covariate values x, relative to assigning them negative decisions. In our running example, we set u(x) = E[Y (1) | X = x] + λ 1α(x)=a1, (6) where E[Y (1) | X = x] denotes the likelihood the applicant would graduate if admitted, 1α(x)=a1 indicates whether the applicant identifies as belonging to race group a1 (e.g., a1 may denote a group historically underrepresented in higher education), and λ 0 is an arbitrary constant that balances preferences for both student graduation and racial diversity. We seek decision policies that maximize expected utility, subject to satisfying a given definition of causal fairness, as well as the budget constraint. Specifically, letting C denote the family of all decision policies that satisfy one of the causal fairness definitions listed above, a utility-maximizing policy d is given by d arg max d C E[d(X) u(X)] s.t. E[d(X)] b. (7) Constructing optimal policies poses both statistical and computational challenges. One must, in general, estimate the 6A policy satisfying counterfactual predictive parity (Def. 1) is not guaranteed to exist. For example, if b = 0 in which case D = 0 a.s. and E[Y (1) | A = a1] = E[Y (1) | A = a2], then Eq. (1) cannot hold. Similar counterexamples can be constructed for b 1. joint distribution of covariates and potential outcomes and, even more dauntingly, causal effects along designated paths for path-specific definitions of fairness. In some settings, it may be possible to obtain these estimates from observational analyses of historical data or randomized controlled trials, though both approaches typically involve substantial hurdles in practice. We prove that if one has this statistical information, it is possible to efficiently compute causally fair utility-maximizing policies by solving either a single linear program or a series of linear programs (Appendix, Theorem B.1). In the case of counterfactual equalized odds, conditional principal fairness, counterfactual fairness, and path-specific fairness, we show that the definitions can be translated to linear constraints. For counterfactual predictive parity, the defining independence condition yields a quadratic constraint, which we show can be expressed as a linear constraint by further conditioning on one of the decision variables, and the optimization problem in turn can be solved through a series of linear programs. 4. The Structure of Causally Fair Policies Above, for each definition of causal fairness, we sketched how to construct utility-maximizing policies that satisfy the corresponding constraints. Now we explore the structural properties of causally fair policies. We show both empirically and analytically, under relatively mild distributional assumptions that policies constrained to be causally fair are disfavored by every individual in a natural class of decision makers with varying preferences for diversity. To formalize these results, we start by introducing some notation and then defining the concept of (strong) Pareto dominance. 4.1. Pareto Dominance and Consistent Utilities For a real-valued utility function u and decision policy d, we write u(d) = E[d(X) u(X)] to denote the utility of d under u. Definition 6. For a budget b, we say a decision policy d is feasible if E[d(X)] b. Given a collection of utility functions encoding the preferences of different individuals, we say a decision policy d is Pareto dominated if there exists a feasible alternative d such that none of the decision makers prefers d over d , and at least one decision maker strictly prefers d over d, a property formalized in Definition 7. Definition 7. Suppose U is a collection of utility functions. A decision policy d is Pareto dominated if there exists a feasible alternative d such that u(d ) u(d) for all u U, and there exists u U such that u (d ) > u (d). A policy d is strongly Pareto dominated if there exists a feasible Causal Conceptions of Fairness and their Consequences alternative d such that u(d ) > u(d) for all u U. A policy d is Pareto efficient if it is feasible and not Pareto dominated, and the Pareto frontier is the set of Pareto efficient policies. To develop intuition about the structure of causally fair decision policies, we continue working through our illustrative example of college admissions. We consider a collection of decision makers with utilities U of the form in Eq. (6), for λ 0. In this example, decision makers differ in their preferences for diversity (as determined by λ), but otherwise have similar preferences. We call such a collection of utilities consistent modulo α. Definition 8. We say that a set of utilities U is consistent modulo α if, for any u, u U: 1. For any x, sign(u(x)) = sign(u (x)); 2. For any x1 and x2 such that α(x1) = α(x2), u(x1) > u(x2) if and only if u (x1) > u (x2). For consistent utilities, the Pareto frontier takes a particularly simple form, represented by (a subset of) groupspecific threshold policies. Proposition 1. Suppose U is a set of utilities that is consistent modulo α. Then any Pareto efficient decision policy d is a multiple threshold policy. That is, for any u U, there exist group-specific constants ta 0 such that, a.s.: ( 1 u(x) > tα(x), 0 u(x) < tα(x). (8) The proof of Proposition 1 is in the Appendix.7 4.2. An Empirical Example With these preliminaries in place, we now empirically explore the structure of causally fair decision policies in the context of our stylized example of college admissions, given by the causal DAG in Figure 1. In the hypothetical pool of 100,000 applicants we consider, applicants in the target race group a1 have, on average, fewer educational opportunities than those applicants in group a0, which leads to lower average academic preparedness, as well as lower average test scores. See Section C in the Appendix for additional details, including the specific structural equations we use. 7 In the statement of the proposition, we do not specify what happens at the thresholds u(x) = tα(x) themselves, as one can typically ignore the exact manner in which decisions are made at the threshold. Specifically, given a threshold policy d, we can construct a standardized threshold policy d that is constant within group at the threshold (i.e., d (x) = cα(x) when u(x) = tα(x)), and for which: (1) E[d (X)|A] = E[d(X)|A]; and (2) u(d ) = u(d). In our running example, this means we can standardize threshold policies so that applicants at the threshold are admitted with the same group-specific probability. Max Graduation Max Utility (for λ = 1 Counterfactual Fairness & Path Specific Fairness (Random) Principal Fairness Counterfactual Pred. Parity Counterfactual Equalized Odds 0 10,000 20,000 30,000 Admitted Applicants from Target Group College Degree Attainment Figure 2. Utility-maximizing policies for various definitions of causal fairness in an illustrative example of college admissions, with the Pareto frontier depicted by the solid purple curve. For path-specific fairness, we set Π equal to the single path A E T D, and set W = X. For each causal fairness definition, the depicted constrained policies are strongly Pareto dominated, meaning there is an alternative feasible policy that simultaneously achieves greater student-body diversity and higher college degree attainment. Our analytical results show, more generally, that under mild distributional assumptions, every policy constrained to satisfy these causal fairness definitions is strongly Pareto dominated. For the utility function in Eq. (6) with λ = 1 4, we apply Theorem B.1 to compute utility-maximizing policies for each of the above causal definitions of fairness. We plot the results in Figure 2, where, for each policy, the horizontal axis shows the expected number of admitted applicants from the target race group, and the vertical axis shows the expected number of college graduates. Additionally, for the family of utilities U given by Eq. (6) for λ 0, we depict the Pareto frontier by the solid purple curve, computed via Proposition 1.8 For reference, the dashed purple line corresponds to max-utility policies constrained to satisfy the level of diversity indicated on the x-axis, though these policies are not on the Pareto frontier, as they result in fewer college graduates and lower diversity than the policy that maximizes graduation alone (indicated by the max graduation point in Figure 2). For each fairness definition, the depicted policies are strongly Pareto dominated, meaning that there is an alternative feasible policy favored by all decision makers with 8For all the cases we consider, the optimal policies admit the maximum proportion of students allowed under the budget b (i.e., Pr(D = 1) = b). To compute the Pareto frontier in Figure 2, it is sufficient by Proposition 1 and Footnote 7 to sweep over (standardized) group-specific threshold policies relative to the utility u0(x) = E[Y (1)|X = x]. Causal Conceptions of Fairness and their Consequences preferences in U. In particular, for each definition of causal fairness, there is an alternative feasible policy in which one simultaneously achieves more student-body diversity and more college graduates. In some instances, the efficiency gap is quite stark. Utility-maximizing policies constrained to satisfy either counterfactual fairness or path-specific fairness require one to admit each applicant independently with fixed probability b (where b is the budget), regardless of academic preparedness or group membership.9 These results show that constraining decision-making algorithms to satisfy popular definitions of causal fairness can have unintended consequences, and may even harm the very groups they were ostensibly designed to protect. 4.3. The Statistical Structure of Causally Fair Policies The patterns illustrated in Figure 2 and discussed above are not idiosyncracies of our particular example, but rather hold quite generally. Indeed, Theorem 1 shows that for almost every joint distribution of X, Y (0), and Y (1) such that u(X) has a density, any decision policy satisfying counterfactual equalized odds or conditional principal fairness is Pareto dominated. Similarly, for almost every joint distribution of X and XΠ,A,a, we show that policies satisfying path-specific fairness (including counterfactual fairness) are Pareto dominated. (NB: The analogous statement for counterfactual predictive parity is not true, which we address in Proposition 2.) The notion of almost every distribution that we use here was formalized by Christensen (1972), Hunt et al. (1992), Anderson & Zame (2001), and others (cf. Ott & Yorke, 2005, for a review). Suppose, for a moment, that combinations of covariates and outcomes take values in a finite set of size m. Then the space of joint distributions on covariates and outcomes can be represented by the unit (m 1)-simplex: m 1 = {p Rm | pi 0 and Pm i=1 pi = 1}. Since m 1 is a subset of an (m 1)-dimensional hyperplane in Rm, it inherits the usual Lebesgue measure on Rm 1. In this finite-dimensional setting, almost every distribution means a subset of distributions that has full Lebesgue measure on the simplex. Given a property that holds for almost every distribution in this sense, that property holds almost surely under any probability distribution on the space of distributions that is described by a density on the simplex. We use a generalization of this basic idea that extends to infinite-dimensional spaces, allowing us to consider distributions with arbitrary support. (See the Appendix for further details.) To prove this result, we make relatively mild restrictions on the set of distributions and utilities we consider to exclude degenerate cases, as formalized by Definition 9 below. Definition 9. Let G be a collection of functions from Z to 9For path-specific fairness, we set Π equal to the single path A E T D, and set W = X in this example. Rd for some set Z. We say that a distribution of Z on Z is G-fine if g(Z) has a density for all g G. In particular, U-fineness ensures that the distribution of u(X) has a density. In the absence of U-fineness, corner cases can arise in which an especially large number of policies may be Pareto efficient, in particular when u(X) has large atoms and X can be used to predict the potential outcomes Y (0) and Y (1) even after conditioning on u(X). See Prop. E.7 for details. Our example of college admissions, where U is defined by Eq. (6), is U-fine. Theorem 1. Suppose U is a set of utilities consistent modulo α. Further suppose that for all a A there exist a U-fine distribution of X and a utility u U such that Pr(u(X) > 0, A = a) > 0, where A = α(X). Then, For almost every U-fine distribution of X and Y (1), any decision policy satisfying counterfactual equalized odds is strongly Pareto dominated. If | IMG(ω)| < and there exists a U-fine distribution of X such that Pr(A = a, W = w) > 0 for all a A and w IMG(ω), where W = ω(X), then, for almost every U-fine joint distribution of X, Y (0), and Y (1), any decision policy satisfying conditional principal fairness is strongly Pareto dominated. If | IMG(ω)| < and there exists a U-fine distribution of X such that Pr(A = a, W = wi) > 0 for all a A and some distinct w0, w1 IMG(ω), then, for almost every UA-fine joint distributions of A and the counterfactuals XΠ,A,a , any decision policy satisfying path-specific fairness is strongly Pareto dominated.10 The proof of Theorem 1 is given in the Appendix. At a high-level, the proof proceed in three steps, which we outline below using the example of counterfactual equalized odds. First, we show that for almost every fixed U-fine joint distribution µ of X and Y (1) there is at most one policy d (x) satisfying counterfactual equalized odds that is not strongly Pareto dominated. To see why, note that for any specific y0, since counterfactual equalized odds requires that D A | Y (1) = y0, setting the threshold for one group determines the thresholds for all the others; the budget constraint then can be used to fix the threshold for the original group. Second, we construct a slice around µ such that for any distribution ν in the slice, d (x) is still the only policy that can potentially lie on the Pareto frontier while satisfying counterfactual equalized odds. We create the slice by strategically perturbing µ only where Y (1) = y1, for some y1 = y0. This perturbation moves mass from one side of the 10Here, u A : (xa)a A 7 (u(xa))a A and UA is the set of u A for u U. In other words, the requirement is that the joint distribution of the u(XΠ,A,a) has a density. Causal Conceptions of Fairness and their Consequences thresholds of d (x) to the other, consequently breaking the balance requirement D A | Y (1) = y1 for almost every ν in the slice. This phenomenon is similar to the problem of infra-marginality (Ayres, 2002; Simoiu et al., 2017), which likewise afflicts non-causal notions of fairness (Corbett Davies & Goel, 2018; Corbett-Davies et al., 2017). Finally, we appeal to the notion of prevalence to stitch the slices together, showing that for almost every distribution, any policy satisfying counterfactual equalized odds is strongly Pareto dominated. Analogous versions of this general argument apply to the cases of conditional principal fairness and path-specific fairness.11 In some common settings, path-specific fairness with W = X constrains decisions so severely that the only allowable policies are constant (i.e., d(x1) = d(x2) for all x1, x2 X). For instance, in our running example, pathspecific fairness requires admitting all applicants with the same probability, irrespective of academic preparation or group membership. Thus, all applicants are admitted with probability b, where b is the budget, under the optimal policy constrained to satisfy path-specific fairness. To build intuition for this result, we sketch the argument for a finite covariate space X. Given a policy d that satisfies path-specific fairness, select x arg maxx X d(x). By the definition of path-specific fairness, for any a A, d(x ) = E[DΠ,A,a | X = x ] x α 1(a) d(x) Pr(XΠ,A,a = x | X = x ). (9) Now suppose there exists an a A such that Pr(XΠ,A,a = x | X = x ) > 0 for all x α 1(a ). In this case, because d(x) d(x ) for all x X, Eq. (9) shows that in fact d(x) = d(x ) for all x α 1(a ). Now, let x be arbitrary. Again, by the definition of path-specific fairness, we have that d(x ) = E[DΠ,A,a | X = x ] x α 1(a ) d(x) Pr(XΠ,A,a = x | X = x ) x α 1(a ) d(x ) Pr(XΠ,A,a = x | X = x ) where we use in the third equality the fact d(x) = d(x ) for all x α 1(a ), and in the final equality the fact that XΠ,A,a is supported on α 1(a ). Theorem 2 formalizes and extends this argument to more general settings, where Pr(XΠ,A,a = x | X = x ) is 11This argument does not depend in an essential way on the definitions being causal. In Corollary E.5, we show an analogous result for the non-counterfactual version of equalized odds. not necessarily positive for all x α 1(a ). The proof of Theorem 2 is in the Appendix, along with extensions to continuous covariate spaces and a more complete characterization of Π-fair policies for finite X. Theorem 2. Suppose X is finite and Pr(X = x) > 0 for all x X. Suppose Z = ζ(X) is a random variable such that: 1. Z = ZΠ,A,a for all a A, 2. Pr(XΠ,A,a = x | X = x) > 0 for all a A such that α(x) = a and x, x X such that ζ(x) = ζ(x ). Then, for any Π-fair policy d, with W = X, there exists a function f such that d(X) = f(Z), i.e., d is constant across individuals having the same value of Z. The first condition of Theorem 2 holds for any reduced set of covariates Z that is not causally affected by changes in A (e.g., Z is not a descendent of A). The second condition requires that among individuals with covariates x, a positive fraction have covariates x in a counterfactual world in which they belonged to another group a . Because ζ(x) is the same in the real and counterfactual worlds since Z is unaffected by A, by the first condition we only consider x such that ζ(x ) = ζ(x) in the second condition. In our running example, the only non-race covariate is test score, which is downstream of race. Further, among students with a given test score, a positive fraction achieve any other test score in the counterfactual world in which their race is altered. As such, the empty set of reduced covariates formally encoded by setting ζ to a constant function satisfies the conditions of Theorem 2. The theorem then implies that under any Π-fair policy, every applicant is admitted with equal probability. Even when decisions are not perfectly uniform lotteries, as in our admissions example, Theorem 2 suggests that enforcing Π-fairness can lead to unexpected outcomes. For instance, suppose we modify our admissions example to additionally include age as a covariate that is causally unconnected to race as some past work has done. In that case, Π-fair policies would admit students based on their age alone, irrespective of test score or race. Although in some cases such restrictive policies might be desirable, this strong structural constraint implied by Π-fairness appears to be a largely unintended consequence of the mathematical formalism. The conditions of Theorem 2 are relatively mild, but do not hold in every setting. Suppose that in our admissions example it were the case that TΠ,A,a0 = TΠ,A,a1 + c for some constant c that is, suppose the effect of intervening on race is a constant change to an applicant s test score. Then the second condition of Theorem 2 would no longer hold for a Causal Conceptions of Fairness and their Consequences constant ζ. Indeed, any multiple-threshold policy in which ta0 = ta1 + c would be Π-fair. In practice, though, such deterministic counterfactuals would seem to be the exception rather than the rule. For example, it seems reasonable to expect that test scores would depend on race in complex ways that induce considerable heterogeneity. Lastly, we note that W = X in some variants of pathspecific fairness (e.g., Nabi & Shpitser, 2018; Zhang & Bareinboim, 2018), in which case Theorem 2 does not apply. Although, in that case, policies are typically still Pareto dominated in accordance with Theorem 1. We conclude our analysis by investigating counterfactual predictive parity, the least demanding of the causal notions of fairness we have considered, requiring only that Y (1) A | D = 0. As such, it is in general possible to have a policy on the Pareto frontier that satisfies this condition. However, in Proposition 2, we show that this cannot happen in some common cases, including our example of college admissions. In that setting, when the target group has lower average graduation rates a pattern that often motivates efforts to actively increase diversity decision policies constrained to satisfy counterfactual predictive parity are Pareto dominated. The proof of the proposition is in the Appendix. Proposition 2. Suppose A = {a0, a1}, and consider the family U of utility functions of the form u(x) = r(x) + λ 1α(x)=a1, indexed by λ 0, where r(x) = E[Y (1) | X = x]. Suppose the conditional distributions of r(X) given A are beta distributed, i.e., r(X) | A = a BETA(µa, v), with v > 2 and µa0 > µa1 > 1/v.12 Then any policy satisfying counterfactual predictive parity is strongly Pareto dominated. In Proposition 2, we restrict to a family of beta distributions where E[r(X)] is lower in the target group compared to the non-target group, a condition which, as discussed above, often holds in settings where one seeks to prioritize diversity. 5. Discussion We have worked to collect, synthesize, and investigate several causal conceptions of fairness that recently have appeared in the machine learning literature. These definitions formalize intuitively desirable properties for example, minimizing the direct and indirect effects of race on decisions. But, as we have shown both analytically and 12Here we parameterize the beta distribution in terms of its mean µ and sample size v. In terms of the common, alternative α-β parameterization, µ = α/(α + β) and v = α + β. with a synthetic example, they can, perhaps surprisingly, lead to policies with unintended downstream outcomes. In contrast to prior impossibility results (Chouldechova, 2017; Kleinberg et al., 2017), in which different formal notions of fairness are shown to be in conflict with each other, we demonstrate trade-offs between formal notions of fairness and resulting social welfare. For instance, in our running example of college admissions, enforcing various causal fairness definitions can lead to a student body that is both less academically prepared and less diverse than what one could achieve under natural alternative policies, potentially harming the very groups these definitions were ostensibly designed to protect. Our results thus highlight a gap between the goals and potential consequences of popular causal approaches to fairness. What, then, is the role of causal reasoning in designing equitable algorithms? Under a consequentialist perspective to algorithm design (Cai et al., 2020; Chohlas-Wood et al., 2021a; Liang et al., 2021), one aims to construct policies with the most desirable expected outcomes, a task that inherently demands causal reasoning. Formally, this approach corresponds to solving the unconstrained optimization problem in Eq. (7), where preferences for diversity may be directly encoded in the utility function itself, rather than by constraining the class of policies, mitigating potentially problematic consequences. While conceptually appealing, this consequentialist approach still faces considerable practical challenges, including estimating the expected effects of decisions, and eliciting preferences over outcomes. Our analysis illustrates some of the limitations of mathematical formalizations of fairness, reinforcing the need to explicitly consider the consequences of actions, particularly when decisions are automated and carried out at scale. Looking forward, we hope our work clarifies the ways in which causal reasoning can aid the equitable design of algorithms. Acknowledgements We thank Guillaume Basse, Jennifer Hill, and Ravi Sojitra for helpful conversations. H.N was supported by a Stanford Knight-Hennessy Scholarship and the National Science Foundation Graduate Research Fellowship under Grant No. DGE-1656518. J.G was supported by a Stanford Knight-Hennessy Scholarship. R.S. was supported by the NSF Program on Fairness in AI in Collaboration with Amazon under the award FAI: End-to-End Fairness for Algorithm-in-the-Loop Decision Making in the Public Sector, no. IIS-2040898. Any opinion, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation or Amazon. Code to reproduce our results is available at https://github. com/stanford-policylab/causal-fairness. Causal Conceptions of Fairness and their Consequences Anderson, R. M. and Zame, W. R. Genericity with infinitely many parameters. Advances in Theoretical Economics, 1 (1):1 62, 2001. Ayres, I. Outcome tests of racial disparities in police practices. Justice Research and Policy, 4(1-2):131 142, 2002. Benji. The sum of an uncountable number of positive numbers. Mathematics Stack Exchange, 2020. URL https: //math.stackexchange.com/q/20661. (version: 2020-05-29). Berk, R., Heidari, H., Jabbari, S., Kearns, M., and Roth, A. Fairness in criminal justice risk assessments: The state of the art. Sociological Methods & Research, 50(1):3 44, 2021. Bertrand, M. and Mullainathan, S. Are Emily and Greg more employable than Lakisha and Jamal? A field experiment on labor market discrimination. American Economic Review, 94(4):991 1013, 2004. Billingsley, P. Probability and Measure. Wiley Series in Probability and Mathematical Statistics. John Wiley & Sons, Inc., New York, third edition, 1995. ISBN 0-47100710-2. A Wiley-Interscience Publication. Brozius, H. Conditional expectation - E[f(X, Y )|Y ]. Mathematics Stack Exchange, 2019. URL https:// math.stackexchange.com/q/3247577. (Version: 2019-06-01). Cai, W., Gaebler, J., Garg, N., and Goel, S. Fair allocation through selective information acquisition. In Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society, pp. 22 28, 2020. Carey, A. N. and Wu, X. The causal fairness field guide: Perspectives from social and formal sciences. Frontiers in Big Data, 5, 2022. Chiappa, S. Path-specific counterfactual fairness. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pp. 7801 7808, 2019. Chohlas-Wood, A., Coots, M., Brunskill, E., and Goel, S. Learning to be fair: A consequentialist approach to equitable decision-making. ar Xiv preprint ar Xiv:2109.08792, 2021a. Chohlas-Wood, A., Nudell, J., Yao, K., Lin, Z., Nyarko, J., and Goel, S. Blind justice: Algorithmically masking race in charging decisions. In Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society, pp. 35 45, 2021b. Chouldechova, A. Fair prediction with disparate impact: A study of bias in recidivism prediction instruments. Big data, 5(2):153 163, 2017. Chouldechova, A. and Roth, A. A snapshot of the frontiers of fairness in machine learning. Communications of the ACM, 63(5):82 89, 2020. Christensen, J. P. R. On sets of Haar measure zero in Abelian Polish groups. Israel Journal of Mathematics, 13(3-4): 255 260, 1972. Cleary, T. A. Test bias: Prediction of grades of Negro and white students in integrated colleges. Journal of Educational Measurement, 5(2):115 124, 1968. Corbett-Davies, S. and Goel, S. The measure and mismeasure of fairness: A critical review of fair machine learning. ar Xiv preprint ar Xiv:1808.00023, 2018. Corbett-Davies, S., Pierson, E., Feller, A., Goel, S., and Huq, A. Algorithmic decision making and the cost of fairness. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 797 806, 2017. Coston, A., Mishler, A., Kennedy, E. H., and Chouldechova, A. Counterfactual risk assessments, evaluation, and fairness. In Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency, pp. 582 593, 2020. Darlington, R. B. Another look at cultural fairness . Journal of Educational Measurement, 8(2):71 82, 1971. Dwork, C., Hardt, M., Pitassi, T., Reingold, O., and Zemel, R. Fairness through awareness. In Proceedings of the 3rd Innovations in Theoretical Computer Science Conference, pp. 214 226, 2012. Gaebler, J., Cai, W., Basse, G., Shroff, R., Goel, S., and Hill, J. A causal framework for observational studies of discrimination. Statistics and Public Policy, 2022. Galhotra, S., Shanmugam, K., Sattigeri, P., and Varshney, K. R. Causal feature selection for algorithmic fairness. Proceedings of the 2022 International Conference on Management of Data (SIGMOD), 2022. Goel, S., Perelman, M., Shroff, R., and Sklansky, D. A. Combatting police discrimination in the age of big data. New Criminal Law Review: An International and Interdisciplinary Journal, 20(2):181 232, 2017. Goldin, C. and Rouse, C. Orchestrating impartiality: The impact of blind auditions on female musicians. American Economic Review, 90(4):715 741, 2000. Causal Conceptions of Fairness and their Consequences Greiner, D. J. and Rubin, D. B. Causal effects of perceived immutable characteristics. Review of Economics and Statistics, 93(3):775 785, 2011. Grogger, J. and Ridgeway, G. Testing for racial profiling in traffic stops from behind a veil of darkness. Journal of the American Statistical Association, 101(475):878 887, 2006. Hardt, M., Price, E., and Srebro, N. Equality of opportunity in supervised learning. Advances in Neural Information Processing Systems, 29:3315 3323, 2016. Holland, P. W. Statistics and causal inference. Journal of the American Statistical Association, 81(396):945 960, 1986. Hu, L. and Kohler-Hausmann, I. What s sex got to do with machine learning? In Proceedings of the 2020 ACM Conference on Fairness, Accountability, and Transparency, 2020. Hunt, B. R., Sauer, T., and Yorke, J. A. Prevalence: a translation-invariant almost every on infinitedimensional spaces. Bulletin of the American Mathematical Society, 27(2):217 238, 1992. Imai, K. and Jiang, Z. Principal fairness for human and algorithmic decision-making. ar Xiv preprint ar Xiv:2005.10400, 2020. Imai, K., Jiang, Z., Greiner, J., Halen, R., and Shin, S. Experimental evaluation of algorithm-assisted human decisionmaking: Application to pretrial public safety assessment. ar Xiv preprint ar Xiv:2012.02845, 2020. Imbens, G. W. and Rubin, D. B. Causal Inference in Statistics, Social, and Biomedical Sciences. Cambridge University Press, 2015. Kasy, M. and Abebe, R. Fairness, equality, and power in algorithmic decision-making. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, pp. 576 586, 2021. Kemeny, J. G. and Snell, J. L. Finite Markov Chains. Undergraduate Texts in Mathematics. Springer-Verlag, New York-Heidelberg, 1976. Reprinting of the 1960 original. Kilbertus, N., Rojas-Carulla, M., Parascandolo, G., Hardt, M., Janzing, D., and Schölkopf, B. Avoiding discrimination through causal reasoning. In Proceedings of the 31st International Conference on Neural Information Processing Systems, pp. 656 666, 2017. Kleinberg, J., Mullainathan, S., and Raghavan, M. Inherent trade-offs in the fair determination of risk scores. In 8th Innovations in Theoretical Computer Science Conference (ITCS). Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 2017. Kusner, M., Loftus, J., Russell, C., and Silva, R. Counterfactual fairness. In Proceedings of the 31st International Conference on Neural Information Processing Systems, pp. 4069 4079, 2017. Liang, A., Lu, J., and Mu, X. Algorithmic design: Fairness versus accuracy. ar Xiv preprint ar Xiv:2112.09975, 2021. Liu, L. T., Dean, S., Rolf, E., Simchowitz, M., and Hardt, M. Delayed impact of fair machine learning. In International Conference on Machine Learning, pp. 3150 3158. PMLR, 2018. Loftus, J. R., Russell, C., Kusner, M. J., and Silva, R. Causal reasoning for algorithmic fairness. ar Xiv preprint ar Xiv:1805.05859, 2018. Mhasawade, V. and Chunara, R. Causal multi-level fairness. In Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society, pp. 784 794, 2021. Mishler, A., Kennedy, E. H., and Chouldechova, A. Fairness in risk assessment instruments: Post-processing to achieve counterfactual equalized odds. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, pp. 386 400, 2021. Nabi, R. and Shpitser, I. Fair inference on outcomes. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018. Ott, W. and Yorke, J. Prevalence. Bulletin of the American Mathematical Society, 42(3):263 290, 2005. Page, S. E. Making the difference: Applying a logic of diversity. Academy of Management Perspectives, 21(4): 6 20, 2007. Pearl, J. Direct and indirect effects. In Proceedings of the Seventeenth Conference on Uncertainty and Artificial Intelligence, 2001, pp. 411 420. Morgan Kaufman, 2001. Pearl, J. Causal inference in statistics: An overview. Statistics surveys, 3:96 146, 2009a. Pearl, J. Causality. Cambridge University Press, second edition, 2009b. Pierson, E., Simoiu, C., Overgoor, J., Corbett-Davies, S., Jenson, D., Shoemaker, A., Ramachandran, V., Barghouty, P., Phillips, C., Shroff, R., and Goel, S. A large-scale analysis of racial disparities in police stops across the United States. Nature Human Behaviour, 4(7):736 745, 2020. Rao, M. M. Conditional measures and applications, volume 271 of Pure and Applied Mathematics (Boca Raton). Chapman & Hall/CRC, Boca Raton, FL, second edition, 2005. Causal Conceptions of Fairness and their Consequences Rudin, W. Real and Complex Analysis. Mc Graw-Hill Book Co., New York, third edition, 1987. ISBN 0-07-054234-1. Rudin, W. Functional Analysis. International Series in Pure and Applied Mathematics. Mc Graw-Hill, Inc., New York, second edition, 1991. ISBN 0-07-054236-8. Silva, C. E. Invitation to Ergodic Theory, volume 42 of Student Mathematical Library. American Mathematical Society, Providence, RI, 2008. Simoiu, C., Corbett-Davies, S., and Goel, S. The problem of infra-marginality in outcome tests for discrimination. The Annals of Applied Statistics, 11(3):1193 1216, 2017. Steele, R. Space of vector measures equipped with the total variation norm is complete. Mathematics Stack Exchange, 2019. URL https://math.stackexchange. com/q/3197508. (Version: 2019-04-22). Wang, Y., Sridhar, D., and Blei, D. M. Equal opportunity and affirmative action via counterfactual predictions. ar Xiv preprint ar Xiv:1905.10870, 2019. Williams, T. S. Some issues in the standardized testing of minority students. Journal of Education, pp. 192 208, 1983. Woodworth, B., Gunasekar, S., Ohannessian, M. I., and Srebro, N. Learning non-discriminatory predictors. In Conference on Learning Theory, pp. 1920 1953. PMLR, 2017. Wu, Y., Zhang, L., Wu, X., and Tong, H. PC-fairness: A unified framework for measuring causality-based fairness. Advances in Neural Information Processing Systems, 32, 2019. Zafar, M. B., Valera, I., Gomez Rodriguez, M., and Gummadi, K. P. Fairness beyond disparate treatment & disparate impact: Learning classification without disparate mistreatment. In Proceedings of the 26th International Conference on World Wide Web, pp. 1171 1180, 2017a. Zafar, M. B., Valera, I., Rodriguez, M. G., Gummadi, K. P., and Weller, A. From parity to preference-based notions of fairness in classification. In Proceedings of the 31st International Conference on Neural Information Processing Systems, pp. 228 238, 2017b. Zhang, J. and Bareinboim, E. Fairness in decisionmaking the causal explanation formula. In Thirty Second AAAI Conference on Artificial Intelligence, 2018. Zhang, L., Wu, Y., and Wu, X. A causal framework for discovering and removing direct and indirect discrimination. In Proceedings of the 26th International Joint Conference on Artificial Intelligence, pp. 3929 3935, 2017. A. Path-specific Counterfactuals Constructing policies which satisfy path-specific fairness requires computing path-specific counterfactual values of features. In Algorithm 1, we describe the formal construction of path-specific counterfactuals ZΠ,a,a , for an arbitrary variable Z (or collection of variables) in the DAG. To generate a sample Z Π,a,a from the distribution of ZΠ,a,a , we first sample values U j for the exogenous variables. Then, in the first loop, we traverse the DAG in topological order, setting A to a and iteratively computing values V j of the other nodes based on the structural equations in the usual fashion. In the second loop, we set A to a , and then iteratively compute values Vj for each node. Vj is computed using the structural equation at that node, with value Vℓ for each of its parents that are connected to it along a path in Π, and the value V ℓfor all its other parents. Finally, we set Z Π,a,a to Z . B. Constructing Causally Fair Policies In order to construct causally fair policies, we prove that the optimization problem in Eq. (7) can be efficiently solved as a single linear program in the case of counterfactual equalized odds, conditional principal fairness, counterfactual fairness, and path-specific fairness or as a series of linear programs in the case of counterfactual predictive parity. Theorem B.1. Consider the optimization problem in Eq. (7). 1. If C is the class of policies that satisfies counterfactual equalized odds or conditional principal fairness, and the distribution of (X, Y (0), Y (1)) is known and supported on a finite set of size n, then a utility-maximizing policy constrained to lie in C can be constructed via a linear program with O(n) variables and constraints. 2. If C is the class of policies that satisfies path-specific fairness (including counterfactual fairness), and the distribution of (X, DΠ,A,a) is known and supported on a finite set of size n, then a utility-maximizing policy constrained to lie in C can be constructed via a linear program with O(n) variables and constraints. 3. Suppose C is the class of policies that satisfies counterfactual predictive parity, that the distribution of (X, Y (1)) is known and supported on a finite set of size n, and that the optimization problem in Eq. (7) has a feasible solution. Further suppose Y (1) is supported on k points, and let k 1 = {p Rk | pi 0 and Pk i=1 pi = 1} be the unit (k 1)- simplex. Then one can construct a set of linear programs L = {L(v)}v k, with each having O(n) variables and constraints, such that the solution to one of Causal Conceptions of Fairness and their Consequences Algorithm 1: Path-specific counterfactuals Data: G (topologically ordered), Π, a, and a Result: A sample Z Π,a,a from ZΠ,a,a 1 Sample values {U j } for the exogenous variables /* Compute counterfactuals by setting A to a */ 2 for j = 1, . . . , m do 3 if Vj = A then 6 (Vj) {V ℓ| Vℓ (Vj)} 7 V j f Vj( (Vj) , U j ) /* Compute counterfactuals by setting A to a and propagating values along paths in Π */ 10 for j = 1, . . . , m do 11 if Vj = A then 14 for Vk (Vj) do 15 if edge (Vk, Vj) lies on a path in Π then 16 V k V k 17 else 18 V k V k 19 end 21 (Vj) {V ℓ| Vℓ (Vj)} 22 V j f Vj( (Vj) , U j ) 25 Z Π,a,a Z the LPs in L is a utility-maximizing policy constrained to lie in C. Proof. Let X = {x1, . . . , xm}; then, we seek decision variables di, i = 1, . . . , m, corresponding to the probability of making a positive decision for individuals with covariate value xi. Therefore, we require that 0 di 1. Letting pi = Pr(X = xi) denote the mass of X at xi, note that the objective function E[d(X) u(X)] equals Pm i=1 di u(xi) pi and the budget constraint Pm i=1 di pi b are both linear in the decision variables. We now show that each of the causal fairness definitions can be enforced via linear constraints. We do so in three parts as listed in theorem. Theorem B.1 Part 1 First, we consider counterfactual equalized odds. A decision policy satisfies counterfactual equalized odds when D A | Y (1). Since D is binary, this condition is equivalent to the expression E[d(X) | A = a, Y (1) = y] = E[d(X) | Y (1) = y] for all a A and y Y such that Pr(Y (1) = y) > 0. Expanding this expression and replacing d(xj) by the corresponding decision variable dj, we obtain that i=1 di Pr(X = xi | A = a, Y (1) = y) i=1 di Pr(X = xi | Y (1) = y) for each a A and each of the finitely many values y Y such that Pr(Y (1) = y) > 0. These constraints are linear in the di by inspection. Next, we consider conditional principal fairness. A decision policy satisfies conditional principal fairness when D A | Y (0), Y (1), W, where W = ω(X) denotes a reduced set of the covariates X. Again, since D is binary, this condition is equivalent to the expression E[d(X) | A = a, Y (0) = y0, Y (1) = y1, W = w] = E[d(X) | Y (0) = y0, Y (1) = y1, W = w] for all y0, y1, and w satisfying Pr(Y (0) = y0, Y (1) = y1, W = w) > 0. As above, expanding this expression and replacing d(xj) by the corresponding decision variable dj yields linear constraints of the form i=1 di Pr(X = xi | A = a, S = s) j=1 di Pr(X = xi | S = s) for each a A and each of the finitely many values of S = (Y (0), Y (1), W) such that s = (y0, y1, w) Y Y W satisfies Pr(Y (0) = y0, Y (1) = y1, W = w) > 0. Again, these constraints are linear by inspection. Theorem B.1 Part 2 Suppose a decision policy satisfies path-specific fairness for a given collection of paths Π and a (possibly) reduced set of covariates W = ω(X), meaning that for every a A, E[DΠ,A,a | W] = E[D | W]. Recall from the definition of path-specific counterfactuals that DΠ,A,a = f D(XΠ,A,a , UD) = 1UD d(XΠ,A,a ), where UD {XΠ,A,a, X}. Since W = ω(X), UD Causal Conceptions of Fairness and their Consequences {XΠ,A,a, W}, it follows that E[DΠ,A,a | W = w] i=1 E[DΠ,A,a | XΠ,A,a = xi, W = w] Pr(XΠ,A,a = xi | W = w) i=1 E[1UD d(XΠ,A,a ) | XΠ,A,a = xi, W = w] Pr(XΠ,A,a = xi | W = w) i=1 d(XΠ,A,a ) Pr(XΠ,A,a = xi | W = w) i=1 di Pr(XΠ,A,a = xi | W = w). An analogous calculation yields that E[D | W = w] = Pm i=1 di Pr(X = xi | W = w). Equating these expressions gives i=1 di Pr(X = xi | W = w) i=1 di Pr(XΠ,A,a = xi | W = w) for each a A and each of the finitely many w W such that Pr(W = w) > 0. Again, each of these constraints is linear by inspection. Theorem B.1 Part 3 A decision policy satisfies counterfactual predictive parity if Y (1) A | D = 0, or equivalently, Pr(Y (1) = y | A = a, D = 0) = Pr(Y (1) | D = 0) for all a A. We may rewrite this expression to obtain: Pr(Y (1) = y, A = a, D = 0) Pr(A = a, D = 0) = Cy, where Cy = Pr(Y (1) = y | D = 0). Expanding the numerator on the left-hand side of the above equation yields Pr(Y (1) = y, A = a, D = 0) i=1 [1 di] Pr(Y (1) = y, A = a, X = xi) Similarly, expanding the denominator yields Pr(Y (1) = y, D = 0) i=1 [1 di] Pr(Y (1) = y, X = xi). for each of the finitely many y Y. Therefore, counterfactual predictive parity corresponds to i=1 [1 di] Pr(Y (1) = y, A = a, X = xi) i=1 [1 di] Pr(Y (1) = y, X = xi), for each a A and y Y. Again, these constraints are linear in the di by inspection. Consider the family of linear programs L = {L(v)}v k where the linear program L(v) has the same objective function Pm i=1 di u(xi) pi and budget constraint Pm i=1 di pi b as before, together with additional constraints for each a A as in Eq. (10), where Cyi = vi for i = 1, . . . , k. By assumption, there exists a feasible solution to the optimization problem in Eq. (7), so the solution to at least one program in L is a utility-maximizing policy that satisfies counterfactual predictive parity. C. A Stylized Example of College Admissions In the example we consider in Section 2.1, the exogenous variables U = {u A, u D, u E, u M, u T , u Y } in the DAG are independently distributed as follows: UA, UD, UY UNIF(0, 1), UE, UM, UT N(0, 1). For fixed constants µA, βE,0, βE,A, βM,0, βM,E, βT,0, βT,E, βT,M, βT,B, βT,u, βY,0, βY,D, we define the endogenous variables V = {A, E, M, T, D, Y } in the DAG by the following structural equations: ( a1 if u A µA a0 otherwise , f E(a, u E) = βE,0 + βE,A 1(a = a1) + u E, f M(e, u M) = βM,0 + βM,E e + u M, f T (e, m, u T ) = βT,0 + βT,E e + βT,M m + βT,B e m + βT,u u T , f D(x, u D) = 1(u D d(x)), f Y (m, u Y , δ) = 1(u Y logit 1(βY,0 + m + βY,D δ)), where logit 1(x) = (1 + exp( x)) 1 and d(x) is the decision policy. In our example, we use constants µA = 1 3, βE,0 = 1, βE,A = 1 βM,0 = 0, βM,E = 1, βT,0 = 50, βT,E = 4, βT,M = 4, βT,u = 7, βT,B = 1, βY,0 = 1 2, βY,D = 1 2. We also assume a budget b = 1 Causal Conceptions of Fairness and their Consequences D. Proof of Proposition 1 We begin by more formally defining (multiple) threshold policies. We assume, without loss of generality, that Pr(A = a) > 0 for all a A throughout. Definition D.1. Let u(x) be a utility function. We say that a policy d(x) is a threshold policy with respect to u if there exists some t such that ( 1 u(x) > t, 0 u(x) < t, and d(x) [0, 1] is arbitrary if u(x) = t. We say that d(x) is a multiple threshold policy with respect to u if there exist group-specific constants ta for a A such that ( 1 u(x) > tα(x), 0 u(x) < tα(x), and d(x) [0, 1] is arbitrary if u(x) = tα(x). Remark 1. In general, it is possible for different thresholds to produce threshold policies that are almost surely equal. For instance, if u(X) BERN( 1 2), then the policies 1u(X)>p are almost surely equal for all p [0, 1). Nevertheless, we speak in general of the threshold associated with the threshold policy d(X) unless there is ambiguity. We first observe that if U is consistent modulo α, then whether a decision policy d(x) is a multiple threshold policy does not depend on our choice of u U. Lemma D.1. Let U be a collection of utilities consistent modulo α, and suppose d : X [0, 1] is a decision rule. If d(x) is a multiple threshold rule with respect to a utility u U, then d(x) is a multiple threshold rule with respect to every u U. In particular, if d(x) can be represented by non-negative thresholds over u , it can be represented by non-negative thresholds over any u U. Proof. Suppose d(x) is represented by thresholds {t a}a A with respect to u . We construct the thresholds {ta}a A explicitly. Fix a A and suppose that there exists x α 1(a) such that u (x ) = t a. Then set ta = u(x ). Now, if u(x) > ta = u(x ) then, by consistency modulo α, u (x) > u (x ) = t a. Similarly if u(x) < ta then u (x) < t a. We also note that by consistency modulo α, sign(ta) = sign(u(x )) = sign(u (x )) = sign(t a). If there is no x α 1(a) such that u (x ) = t a, then let ta = inf x Sa u(x) where Sa = {x α 1(a) | u (x) > t a}. Note that since sign(u(x)) = sign(u (x)) for all x by consistency modulo α, if t a 0, it follows that ta 0 as well. We need to show in this case also that if u(x) > ta then u (x) > t a, and if u(x) < ta then u (x) < t a. To do so, let x α 1(a) be arbitrary, and suppose u(x) > ta. Then, by definition, there exists x α 1(a) such that u(x) > u(x ) > ta and u (x ) > t a, whence u (x) > u (x ) > t a by consistency modulo α. On the other hand, if u(x) < ta, it follows by the definition of ta that u (x) t a; since u (x) = t a by hypothesis, it follows that u (x) < t a. Therefore, it follows in both cases that for x α 1(a), if u(x) > ta then u (x) > t a, and if u(x) < ta then u (x) < t a. Therefore ( 1 if u(x) > tα(x), 0 if u(x) < tα(x), i.e., d(x) is a multiple threshold policy with respect to u. Moreover, as noted above, if t a 0 for all a A, then ta 0 for all a A. We now prove the following strengthening of Prop. 1. Lemma D.2. Let U be a collection of utilities consistent modulo α. Let d(x) be a feasible decision policy that is not a.s. a multiple threshold policy with non-negative thresholds with respect to U, then d(x) is strongly Pareto dominated. Proof. We prove the claim in two parts. First, we show that any policy that is not a multiple threshold policy is strongly Pareto dominated. Then, we show that any multiple threshold policy that cannot be represented with non-negative thresholds is strongly Pareto dominated. If d(x) is not a multiple threshold policy, then there exists a u U and a A such that d(x) is not a threshold policy when restricted to α 1(a ) with respect to u. We will construct an alternative policy d (x) that attains strictly greater utility on α 1(a ) and is identical elsewhere. Thus, without loss of generality, we assume there is a single group, i.e., α(x) = a . The proof proceeds heuristically by moving some of the mass below a threshold to above a threshold to create a feasible policy with improved utility. Let b = E[d(X)]. Define m Lo(t) = E[d(X) 1u(X)t]. We show that there exists t such that m Up(t ) > 0 and m Lo(t ) > 0. For, if not, consider t = inf{t R : m Up(t) = 0}. Note that d(X) 1u(X)> t = 1u(X)> t a.s. If t = , then by definition d(X) = 1 a.s., which is a threshold policy, violating our assumption on d(X). If t > , then for any Causal Conceptions of Fairness and their Consequences t < t, we have, by definition that m Up(t ) > 0, and so by hypothesis m Lo(t ) = 0. Therefore d(X) 1u(X)< t = 0 a.s., and so, again, d(X) is a threshold policy, contrary to hypothesis. Now, with t as above, for notational simplicity, let m Up = m Up(t ) and m Lo = m Lo(t ) and consider the alternative policy (1 m Up) d(x) u(x) < t , d(x) u(x) = t , 1 (1 m Lo) (1 d(x)) u(x) > t . Then it follows by construction that E[d (X)] = (1 m Up) m Lo + E[d(X) 1u(X)=t ] + Pr(u(X) > t ) (1 m Lo) m Up = m Lo + E[d(X) 1u(X)=t ] + Pr(u(X) > t ) m Up = E[d(X) 1u(X)t ] E[(1 d(X)) 1u(X)>t ] so d (x) is feasible. However, d (x) d(x) = m Lo (1 d(x)) 1u(x)>t m Up d(x) 1u(x)t u(X)] m Up E[d(X) 1u(X) m Lo t E[(1 d(X)) 1u(X)>t ] m Up t E[d(X) 1u(X) u (d) for arbitrary u U. Let t = inf{u (x) : d (x) > d(x)}. Note that by construction for any x, x X, if d (x) > d(x) and d (x ) < d(x ), then u(x) > t > u(x ). It follows by consistency modulo α that u (x) t u (x ), and, moreover, that at least one of the inequalities is strict. Without loss of generality, assume u (x) > t u (x ). Then, we have that u(x) > t if and only if u (x) > t . Therefore, it follows that E[(d (X) d(X)) 1u (X)>t ] = m Up > 0. Since E[d (X) d(X)] = 0, we see that E[(d (X) d(X)) u (X)] = E[(d (X) d(X)) 1u (X)>t u (X)] + E[(d (X) d(X)) 1u (X) t u (X)] > t E[(d (X) d(X)) 1u (X)>t ] + t E[(d (X) d(X)) 1u (X) t ] = t E[d (X) d(X)] where in the inequality we have used the fact that if d (x) > d(x), u (x) > t , and if d (x) < d(x), u (x) t . Therefore E[d(X) u (X)] < E[d (X) u (X)], i.e., d (x) strongly Pareto dominates d(x). Now, we prove the second claim, namely, that a multiple threshold policy τ(x) that cannot be represented with non-negative thresholds is strongly Pareto dominated. For, if τ(x) is such a policy, then, by Lemma D.1, for any u U, E[τ(X) 1u(X)<0] > 0. It follows immediately that τ (x) = τ(x) 1u(x)>0 satisfies u(τ ) > u(τ). By consistency modulo α, the definition of τ (x) does not depend on our choice of u, and so u(τ ) > u(τ) for every u U, i.e., τ (x) strongly Pareto dominates τ(x). The following results, which draw on Lemma D.2, are useful in the proof of Theorem 1. Definition D.2. We say that a decision policy d(x) is budgetexhausting if min(b, Pr(u(X) > 0)) E[d(X)] min(b, Pr(u(X) 0)). Remark 2. We note that if U is consistent modulo α, then whether or not a decision policy d(x) is budget-exhausting does not depend on the choice of u U. Further, if Pr(u(X) = 0) = 0 e.g., if the distribution of X is Ufine then the decision policy is budget-exhausting if and only if E[d(X)] = min(b, Pr(u(X) > 0)). Corollary D.1. Let U be a collection of utilities consistent modulo α. If τ(x) is a feasible policy that is not a budget-exhausting multiple threshold policy with nonnegative thresholds, then τ(x) is strongly Pareto dominated. Proof. Suppose τ(x) is not strongly Pareto dominated. By Lemma D.2, it is a multiple threshold policy with nonnegative thresholds. Causal Conceptions of Fairness and their Consequences Now, suppose toward a contradiction that τ(x) is not budgetexhausting. Then, either E[τ(X)] > min(b, Pr(u(X) 0)) or E[τ(X)] < min(b, Pr(u(X) > 0)). In the first case, since τ(x) is feasible, it follows that E[τ(X)] > Pr(u(X) 0). It follows that τ(x) 1u(x)<0 is not almost surely zero. Therefore E[τ(X)] < E[τ(X) 1u(X)>0], and, by consistency modulo α, this holds for any u U. Therefore τ(x) is strongly Pareto dominated, contrary to hypothesis. In the second case, consider d(x) = θ 1u(x)>0 + (1 θ) τ(x). Since E[τ(X)] < min(b, Pr(u(X) > 0)) and E[d(X)] = θ Pr(u(X) > 0) + (1 θ) E[τ(X)], there exists some θ > 0 such that d(x) is feasible. For that θ, a similar calculation shows immediately that u(d) > u(τ), and, by consistency modulo α, u (d) > u (τ) for all u U. Therefore, again, d(x) strongly Pareto dominates τ(x), contrary to hypothesis. Lemma D.3. Given a utility u, there exists a mapping T from [0, 1]A to [ , ]A taking sets of quantiles {qa}a A to thresholds {ta}a A such that: 1. T is monotonically non-increasing in each coordinate; 2. For each set of quantiles, there is a multiple threshold policy τ : X [0, 1] with thresholds T({qa}) with respect to u such that E[τ(X) | A = a] = qa. Proof. Simply choose ta = inf{s R : Pr(u(X) > s) < qa}. (11) Then define ( qa Pr(u(X)>ta|A=a) Pr(u(X)=ta|A=a) Pr(u(X) = ta, A = a) > 0 0 Pr(u(X) = ta, A = a) = 0. Note that Pr(u(X) ta | A = a) qa, since, by definition, Pr(u(X) > ta ϵ | A = a) qa for all ϵ > 0. Therefore, Pr(u(X) > ta | A = a) + Pr(u(X) = ta | A = a) qa, and so pa 1. Further, since Pr(u(X) > ta | A = a) qa, we have that pa 0. Finally, let 1 u(x) > tα(x), pa u(x) = tα(x), 0 u(x) < tα(x), and it follows immediately that E[d(X) | A = a] = qa. That ta is a monotonically non-increasing function of qa follows immediately from Eq. (11). We can further refine Cor. D.1 and Lemma D.3 as follows: Lemma D.4. Let u be a utility. Then a feasible policy is utility maximizing if and only if it is a budget-exhausting threshold policy. Moreover, there exists at least one utility maximizing policy. Proof. Let α be a constant map, i.e., α : X A, where | A| = 1. Then U = {u} is consistent modulo α, and so by Cor. D.1, any Pareto efficient policy is a budget exhausting multiple threshold policy relative to U. Since U contains a single element, a policy is Pareto efficient if and only if it is utility maximizing. Since α is constant, a policy is a multiple threshold policy relative to α if and only if it is a threshold policy. Therefore, a policy is utility maximizing if and only if it is a budget exhausting threshold policy. By Lemma D.3, such a policy exists, and so the maximum is attained. E. Prevalence and the Proof of Theorem 1 The notion of a probabilistically small set such as the event in which an idealized dart hits the exact center of a target is, in finite-dimensional real vector spaces, typically encoded by the idea of a Lebesgue null set. Here we prove that the set of distributions such that there exists a policy satisfying either counterfactual equalized odds, conditional principal fairness, or counterfactual fairness that is not strongly Pareto dominated is small in an analogous sense. The proof turns on the following intuition. Each of the fairness definitions imposes a number of constraints. By Lemma D.2, any policy that is not strongly Pareto dominated is a multiple threshold policy. By adjusting the group-specific thresholds of such a policy, one can potentially satisfy one constraint per group. If there are more constraints than groups, then one has no additional degrees of freedom that can be used to ensure that the remaining constraints are satisfied. If, by chance, those constraints are satisfied with the same threshold policy, they are not satisfied robustly even a minor distribution shift, such as increasing the amount of mass above the threshold by any amount on the relevant subpopulation, will break them. Therefore, over a typical distribution, at most |A| of the constraints can simultaneously be satisfied by a Pareto efficient policy, meaning that typically no Pareto efficient policy fully satisfies all of the conditions of the fairness definitions. Formalizing this intuition, however, requires considerable care. In Section E.1, we give a brief introduction to a popular generalization of null sets to infinite-dimensional vector Causal Conceptions of Fairness and their Consequences spaces, drawing heavily on a review article by Ott & Yorke (2005). In Section E.2 we provide a roadmap of the proof itself. In Section E.3, we establish the main hypotheses necessary to apply the notion of prevalence to a convex set in our case, the set of U-fine distributions. In Section E.4, we establish a number of technical lemmata used in the proof of Theorem 1, and provide a proof of the theorem itself in Section E.5. In Section E.6, we show why the hypothesis of U-fineness is important and how conspiracies between atoms in the distribution of u(X) can lead to robust counterexamples. E.1. Shyness and Prevalence Lebesgue measure λn on Rn has a number of desirable properties: Local finiteness: For any point v Rn, there exists an open set U containing x such that λn[U] < ; Strict positivity: For any open set U, if λn[U] = 0, then U = ; Translation invariance: For any v Rn and measurable set E, λn[E + v] = λn[E]. No measure on an infinite-dimensional, separable Banach space, such as L1(R), can satisfy these three properties (Ott & Yorke, 2005). However, while there is no generalization of Lebesgue measure to infinite dimensions, there is a generalization of Lebesgue null sets called shy sets to the infinite-dimensional context that preserves many of their desirable properties. Definition E.3 (Hunt et al. (1992)). Let V be a completely metrizable topological vector space. We say that a Borel set E V is shy if there exists a Borel measure µ on V such that: 1. There exists compact C V such that 0 < µ[C] < , 2. For all v V , µ[E + v] = 0. An arbitrary set F V is shy if there exists a shy Borel set E V containing F. We say that a set is prevalent if its complement is shy. Prevalence generalizes the concept of Lebesgue full measure or co-null sets (i.e., sets whose complements have null Lebesgue measure) in the following sense: Proposition E.3 (Hunt et al. (1992)). Let V be a completely metrizable topological vector space. Then: Any prevalent set is dense in V ; If G L and G is prevalent, then L is prevalent; A countable intersection of prevalent sets is prevalent; Every translate of a prevalent set is prevalent; If V = Rn, then G Rn is prevalent if and only if λn[Rn \ G] = 0. As is conventional for sets of full measure in finitedimensional spaces, if some property holds for every v E, where E is prevalent, then we say that the property holds for almost every v V or that it holds generically in V . Prevalence can also be generalized from vector spaces to convex subsets of vector spaces, although additional care must be taken to ensure that a relative version of Prop. E.3 holds. Definition E.4 (Anderson & Zame (2001)). Let V be a topological vector space and let C V be a convex subset completely metrizable in the subspace topology induced by V . We say that a universally measurable set E C is shy in C at c C if for each 1 δ > 0, and each neighborhood U of 0 in V , there is a regular Borel measure µ with compact support such that SUPP(µ) (δ(C c) + c) (U + c), and µ[E + v] = 0 for every v V . We say that E is shy in C or shy relative to C if E is shy in C at c for every c C. An arbitrary set F V is shy in C if there exists a universally measurable shy set E C containing F. A set G is prevalent in C if C \ G is shy in C. Proposition E.4 (Anderson & Zame (2001)). If E is shy at some point c C, then E is shy at every point in C and hence is shy in C. Sets that are shy in C enjoy similar properties to sets that are shy in V . Proposition E.5 (Anderson & Zame (2001)). Let V be a topological vector space and let C V be a convex subset completely metrizable in the subspace topology induced by V . Then: Any prevalent set in C is dense in C; If G L and G is prevalent in C, then L is prevalent in C; A countable intersection of sets prevalent in C is prevalent in C If G is prevalent in C then G + v is prevalent in C + v for all v V . Causal Conceptions of Fairness and their Consequences If V = Rn and C V is a convex subset with nonempty interior, then G C is prevalent in C if and only if λn[C \ G] = 0. Sets that are shy in C can often be identified by inspecting their intersections with a finite-dimensional subspace W of V , a strategy we use to prove Theorem 1. Definition E.5 (Anderson & Zame (2001)). A universally measurable set E C, where C is convex and completely metrizable, is said to be k-shy in C if there exists a kdimensional subspace W V such that 1. A translate of the set C has positive Lebesgue measure in W, i.e., λW [C + v0] > 0 for some v0 V ; 2. Every translate of the set E is a Lebesgue null set in W, i.e., λW [E + v] = 0 for all v V . Here λW denotes k-dimensional Lebesgue measure supported on W.13 We refer to such a W as a k-dimensional probe witnessing the k-shyness of E, and to an element w W as a perturbation. The following intuition motivates the use of probes to detect shy sets. By analogy with Fubini s theorem, one can imagine trying to determine whether a subset of a finite-dimensional vector space is large or small by looking at its cross sections parallel to some subspace W V . If a set E V is small in each cross section i.e., if λW [E + v] = 0 for all v V then E itself is small in V , i.e., E has λV -measure zero. Proposition E.6 (Anderson & Zame (2001)). Every k-shy set in C is shy in C. E.2. Outline To aid the reader in following the application of the theory in Section E.1 to the proof of Theorem 1, we provide the following outline of the argument. In Section E.3 we establish the context to which we apply the notion of relative shyness. In particular, we introduce the vector space K consisting of the totally bounded Borel measures on the state space K where K is X Y, X Y Y, or A X A, depending on which notion of fairness is under consideration. We further isolate the subspace K K of U-fine totally bounded Borel measures. Within this space, we are interested in the convex set Q K, the set of Ufine joint probability distributions of, respectively, X and Y (1); X, Y (0), Y (1); or A and the XΠ,A,a. Within Q, we 13Note that Lebesgue measure on W is only defined up to a choice of basis; however, since λ[T(A)] = | det(T)| λ[A] for any linear automorphism T and Lebesgue measure λ, whether a set has null measure does not depend on the choice of basis. identify E Q, the set of U-fine distributions on K over which there exists a policy satisfying the relevant fairness definition that is not strongly Pareto dominated. The claim of Theorem 1 is that E is shy relative to Q. To ensure that relative shyness generalizes Lebesgue null measure in the expected way i.e., that Prop. E.5 holds Definition E.4 has three technical requirements: (1) that the ambient vector space V be a topological vector space; (2) that the convex set C be completely metrizable; and (3) that the shy set E be universally measurable. In Lemma E.7, we observe that K is a complete topological vector space under the total variation norm, and so is a Banach space. We extend this in Cor. E.2, showing that K is also a Banach space. We use this fact in Lemma E.11 to show that Q is a completely metrizable subset of K, as well as convex. Lastly, in Lemma E.13, we show that the set E is closed, and therefore universally measurable. In Section E.4, we develop the machinery needed to construct a probe W for the proof of Theorem 1 and prove several lemmata simplifying the eventual proof of the theorem. To build the probe, it is necessary to construct measures µmax,a with maximal support on the utility scale. This ensures that if any two threshold policies produce different decisions on any µ K, they will produce different decisions on typical perturbations. The construction of the µmax,a, is carried out in Lemma E.14 and Cor. E.3. Next, we introduce the basic style of argument used to show that a subset of Q is shy in Lemma E.15 and Lemma E.16, in particular, by showing that the set of µ Q that give positive probability to an event E is either prevalent or empty. We use then use a technical lemma, Lemma E.17, to show, in effect, that a generic element of Q has support on the utility scale wherever a given fixed distribution µ Q does. In Defn. E.12, we introduce the concept of overlapping and splitting utilities, and show in Lemma E.19 that this property is generic in Q unless there exists a ω-stratum that contains no positive-utility observables x. Lastly, in Lemma E.20, we provide a mild simplification of the characterization of finitely shy sets that makes the the proof of Theorem 1 more straightforward. Finally, in Section E.5, we give the proof of Theorem 1. We divide the proof into three parts. In the first part, we restrict our attention to the case of counterfactual equalized odds, and show in detail how to combine the lemmata of the previous section to construct the (at most) 2 |A|- dimensional probe W. In the second part we consider two distinct cases. The argument in both cases is conceptually parallel. First, we argue that the balance conditions of counterfactual equalized odds encoded by Eq. (2) must be broken by a typical perturbation in W. In particular, we argue that for a given base distribution µ, there can be at most one budget-exhausting multiple threshold policy that Causal Conceptions of Fairness and their Consequences can although need not necessarily satisfy counterfactual equalized odds. We show that the form of this policy cannot be altered by an appropriate perturbation in W, but that the conditional probability of a positive decision will, in general, be altered in such a way that Eq. (2) can only hold for a λW-null set of perturbations. In the final section, we lay out modiciations that can be made to the proof given for counterfactual equalized odds in the first two parts that adapt the argument to the cases of conditional principal fairness and path-specific fairness. In particular, we show how to construct the probe W in such a way that the additional conditioning on the reduced covariates W = ω(X) in Eqs. (3) and (5) does not affect the argument. E.3. Convexity, Complete Metrizability, and Universal Measurability In this section, we establish the background requirements of Prop. E.6 for the setting of Theorem 1. In particular, we exhibit the U-fine distributions as a convex subset of a topological vector space, the set of totally bounded U-fine Borel measures. We show that the U-fine probability distributions form a completely metrizable subset in the topology it inherits from the space of totally bounded measures. Lastly, we show that the set of regular distributions under which there exists a Pareto efficient policy satisfying one of the three fairness criteria is closed, and therefore universally measurable. E.3.1. BACKGROUND AND NOTATION We begin by establishing some notational conventions. We let K denote the underlying state space over which the distributions in Theorem 1 range. Specifically, K = X Y in the case of counterfactual equalized odds; K = X Y Y in the case of conditional principal fairness; and K = A X A in the case of path-specific fairness. We note that since X Rk for some k and Y R, K may equivalently be considered a subset of Rn for some n N, with the subspace topology (and Borel sets) inherited from Rn.14 We recall the definition of totally bounded measures. Definition E.6. Let M be a σ-algebra on V , and let µ be a countably additive (V, M)-measure. Then, we define |µ|[E] = sup i=1 |µ[Ei]| (12) where the supremum is taken over all countable partitions {Ei}i N, i.e., collections such that S i=1 Ei = E and Ei Ej = for j = i. We call |µ| the total variation of µ, and the total variation norm of µ is |µ|[V ]. We say that µ is totally bounded if its total variation norm 14In the case of path-specific fairness, we can equivalently think of A as a set of integers indexing the groups. is finite, i.e., |µ|[V ] < . Lemma E.5. If µ is totally bounded, then |µ| is a finite positive measure on (V, M), and |µ[E]| |µ|[E] for all E M. See Theorem 6.2 in Rudin (1987) for proof. We let K denote the set of totally bounded Borel measures on K. We note that, in the case of path specific fairness, which involves the joint distributions of counterfactuals, X is not defined directly. Rather, the joint distribution of the counterfactuals XΠ,A,a and A defines the distribution of X through consistency, i.e., what would have happened to someone if their group membership were changed to a A is what actually happens to them if their group membership is a . More formally, Pr(X E | A = a ) = Pr(XΠ,A,a E | A = a ) for all Borel sets E X. (See 3.6.3 in Pearl (2009b).) For any µ K, we adopt the following notational conventions. If we say that a property holds µ-a.s., then the subset of K on which the property fails has |µ|-measure zero. If E K is a measurable set, then we denote by µ E the restriction of µ to E, i.e., the measure defined by the mapping E 7 µ[E E ]. We let Eµ[f] = R K f dµ, and for measurable sets E, Prµ(E) = µ[E].15 The fairness criteria we consider involve conditional independence relations. To make sense of conditional independence relations more generally, for Borel measurable f we define Eµ[f | F] to be the Radon-Nikodym derivative of the measure E 7 Eµ[f 1E] with respect to the measure µ restricted to the sub σ-algebra of Borel sets F. (See 34 in Billingsley (1995).) Similarly, we define Eµ[f | g] to be Eµ[f | σ(g)], where σ(g) denotes the sub σ-algebra of the Borel sets generated by g. In cases where the condition can occur with non-zero probability, we can instead make use of the elementary definition of discrete conditional probability. Lemma E.6. Let g be a Borel function on K, and suppose Prµ(g = c) = 0 for some constant c R. Then, we have that µ-a.s., for any Borel function f, Eµ[f | g] 1g=c = Eµ[f 1g=c] Prµ(g = c) 1g=c. See Rao (2005) for proof. With these notational conventions in place, we turn to establishing the background conditions of Prop. E.6. Lemma E.7. The set of totally bounded measures on a measure space (V, M) form a complete topological vector space under the total variation norm, and hence a Banach space. 15To state and prove our results in a notationally uniform way, we occasionally write Prµ(E) even when µ ranges over measures that may not be probability measures. Causal Conceptions of Fairness and their Consequences See, e.g., Steele (2019) for proof. It follows from this that K is a Banach space. Remark 3. Since K is a Banach space, it possesses a topology, and consequently a collection of Borel subsets. These Borel sets are to be distinguished from the Borel subsets of the underlying state space K, which the elements of K measure. The requirement that the subset E of the convex set C be universally measurable in Proposition E.6 is in reference to the Borel subsets of K; the requirement that µ K be a Borel measure is in reference to the Borel subsets of K. Recall the definition of absolute continuity. Definition E.7. Let µ and ν be measures on a measure space (V, M). We say that a measure ν is absolutely continuous with respect to µ also written ν Î µ if, whenever µ[E] = 0, ν[E] = 0. Absolute continuity is a closed property in the topology induced by the total variation norm. Lemma E.8. Consider the space of totally bounded measures on a measure space (V, M) and fix µ. The set of ν such that ν Î µ is closed. Proof. Let {νi}i N be a convergent sequence of measures absolutely continuous with respect to µ. Let the limit of the νi be ν. We seek to show that ν Î µ. Let E M be an arbitrary set such that µ[E] = 0. Then, we have that ν[E] = lim n νi[E] since νi Î µ for all i. Since E was arbitrary, the result follows. Recall the definition of a pushforward measure. Definition E.8. Let f : (V, M) (V , M ) be a measurable function. Let µ be a measure on V . We define the pushforward measure µ f 1 on V by the map E 7 µ[f 1(E )] for E M . Within K, in the case of counterfactual equalized odds and conditional principal fairness, we define the subspace K to be the set of totally bounded measures µ on K such that the pushforward measure µ u 1 is absolutely continuous with respect to the Lebesgue measure λ on R for all u U. By the Radon-Nikodym theorem, these pushforward measures arise from densities, i.e., for any µ K, there exists a unique fµ L1(R) such that for any measurable subset E of R, we have µ u 1[E] = Z In the case of path-specific fairness, we require the joint distributions of the counterfactual utilities to have a joint density. That is, we define the subspace K to be the set of totally bounded measures µ on K such that the pushforward measure µ (u A) 1 is absolutely continuous with respect to Lebesgue measure on RA for all u U. Here, we recall that u A : (a, (xa )a A) 7 (u(xa ))a A. As before, there exists a corresponding density fµ L1(RA). We therefore see that K extends in a natural way the notion of a Uor UA-fine distribution, and so, by a slight abuse of notation, refer to K as the set of U-fine measures on K. Indeed, since Prµ(u(X) E, A = a) Prµ(u(X) E), it also follows that, for a A such that Prµ(A = a) > 0, the conditional distributions of u(X) | A = a are also absolutely continuous with respect to Lebesgue measure, and so also have densities. For notational convenience, we set fµ,a to be the function satisfying Prµ(u(X) E, A = a) = Z so that fµ = P Since absolute continuity is a closed condition, it follows that K is a closed subspace of K. This leads to the following useful corollary of Lemma E.8. Corollary E.2. The collection of U-fine measures on K is a Banach space. Proof. It is straightforward to see that K is a subspace of K. Since K is a closed subset of K by Lemma E.8, it is complete, and therefore a Banach space. We note the following useful fact about elements of K. Lemma E.9. Consider the mapping µ 7 fµ from K to L1(R) given by associating a measure µ with the Radon Nikodym derivative of the pushforward measure µ u 1. This mapping is continuous. Likewise, the mapping µ 7 fµ,a is continuous for all a A, and, in the case of pathspecific fairness, the mapping of µ to the Radon-Nikodym derivative of µ (u A) 1 is continuous. Proof. We show only the first case. The others follow by virtually identical arguments. Let ϵ > 0 be arbitrary. Choose µ K, and suppose that |µ µ |[K] < ϵ. Then, let EUp = {x R : fµ(x) > fµ (x)} ELo = {x R : fµ(x) < fµ (x)}. Causal Conceptions of Fairness and their Consequences Then EUp and ELo are disjoint, so we have that fµ fµ L1(R) = EUp fµ fµ dλ ELo fµ fµ dλ = |(µ µ )[u 1(EUp)]| + |(µ µ )[u 1(ELo)]| where the second equality follows by the definition of pushforward measures and the inequality follows from Lemma E.5. Since ϵ was arbitrary, the claim follows. Finally, we define Q. We let Q be the subset of K consisting of all U-fine probability measures, i.e., measures µ K such that: 1. The measure µ is U-fine; 2. For all Borel sets E K, µ[E] 0; 3. The measure of the whole space is unity, i.e., µ[K] = 1. We conclude the background and notation by observing that threshold policies are defined wholly by their thresholds for distributions in K and Q. Importantly, this observation does not hold when there are atoms on the utility scale which measures in K lack which can in turn lead to counterexamples to Theorem 1; see Appendix E.6. Lemma E.10. Let τ0(x) and τ1(x) be two multiple threshold policies. If τ0(x) and τ1(x) have the same thresholds, then for any µ K, τ0(X) = τ1(X) µ-a.s. Similarly, for µ Q, if Eµ[τ0(X) | A = a] = Eµ[τ1(X) | A = a] for all a A such that Prµ(A = a) > 0, then τ0(X) = τ1(X) µ-a.s. Moreover, for µ K in the case of path-specific fairness, if τ0(x) and τ1(x) have the same thresholds, then τ0(XΠ,A,a) = τ1(XΠ,A,a) µ-a.s. for any a A. Similarly, for µ Q in the case of path-specific fairness, if Eµ[τ0(XΠ,A,a)] = Eµ[τ1(XΠ,A,a)] then τ0(XΠ,A,a) = τ1(XΠ,A,a) µ-a.s. as well. Proof. First, we show that threshold policies with the same thresholds are equal, then we show that threshold policies that distribute positive decisions across groups in the same way are equal. Let {ta}a A denote the shared set of thresholds. It follows that if τ0(x) = τ1(x), then u(x) = tα(x). Now, Pr(u(X) = ta, A = a) = Z ta ta fµ,a dλ = 0, so Prµ(τ0(X) = τ1(X)) = 0. Next, suppose Eµ[τ0(X) | A = a] = Eµ[τ1(X) | A = a]. If the thresholds of the two policies agree for all a A such that Prµ(A = a) > 0, then we are done by the previous paragraph. Therefore, suppose t0 a = t1 a for some suitable a A, where ti a represents the threshold for group a A under the policy τi(x). Without loss of generality, suppose t0 a < t1 a. Then, it follows that t0a fµ,a dλ = Eµ[τ0(X) | A = a] Eµ[τ1(X) | A = a] Since µ Q, µ = |µ|, whence Pr|µ|(ta 0 u(X) t1 a | A = a) = 0. Since this is true for all a A such that Prµ(A = a) > 0, τ0(X) = τ1(X) µ-a.s. The proof in the case of path-specific fairness is almost identical. E.3.2. CONVEXITY, COMPLETE METRIZABILITY, AND UNIVERSAL MEASURABILITY The set of regular U-fine probability measures Q is the set to which we wish to apply Prop. E.6. To do so, we must show that Q is a convex and completely metrizable subset of K. Lemma E.11. The set of regular probability measures Q is convex and completely metrizable. Proof. The proof proceeds in two pieces. First, we show that the U-fine probability distributions are convex, as can be verified by direct calculation. Then, we show that Q is closed and therefore complete in the original metric of K. We begin by verifying convexity. Let µ, µ Q and let E K be an arbitrary Borel subset of K. Then, choose θ [0, 1], and note that (θ µ + [1 θ] µ )[E] = θ µ[E] + [1 θ] µ [E] θ 0 + [1 θ] 0 Causal Conceptions of Fairness and their Consequences and, likewise, that (θ µ + [1 θ] µ )[K] = θ µ[K] + [1 θ] µ [K] = θ 1 + [1 θ] 1 It remains only to show that Q is completely metrizable. To prove this, it suffices to show that it is closed, since closed subsets of complete spaces are complete, and K is a Banach space by Cor. E.2, and therefore complete. Suppose {µi}i N is a convergent sequence of probability measures in K with limit µ. Then µ[E] = lim i µi[E] lim i 0 = 0 and µ[K] = lim i µi[K] = lim i 1 = 1. Therefore Q is closed, and therefore complete, and hence is a convex, completely metrizable subset of K. Next we prove that the set E of regular U-fine densities over which there exists a policy satisfying the relevant counterfactual fairness definition that is not strongly Pareto dominated is universally measurable. Recall the definition of universal measurability. Definition E.9. Let V be a complete topological space. Then E V is universally measurable if V is measurable by the completion of every finite Borel measure on V , i.e., if for every finite Borel measure µ, there exist Borel sets E and S such that E E S and µ[S] = 0. We note that if a set is Borel, it is by definition universally measurable. Moreover, if a set is open or closed, it is by definition Borel. To show that E is closed, we show that any convergent sequence in E has a limit in E. The technical complication of the argument stems from the following fact that satisfying the fairness conditions, e.g., Eq. (4), involves conditional expectations, about which very little can be said in the absence of a density, and which are difficult to compare when taken across distinct measures. To handle these difficulties, we begin with a technical lemma, Lemma E.12, which gives a coarse bound on how different the conditional expectations of the same variable can be with respect to a sub σ-algebra F over two different distributions, µ and µ , before applying the results to the proof of Lemma E.13. Definition E.10. Let µ be a measure on a measure space (V, M), and let f be µ-measurable. Consider the equivalence class of M-measurable functions C = {g : g = f µ-a.e.}.16 We say that any g C is a version of f, and that g C is a standard version if g(v) C for some constant C and all v V . Remark 4. It is straightforward to see that for f L (µ), a standard version always exists with C = f . Remark 5. Note that in general, the conditional expectation Eµ [f | F] is defined only µ -a.e. If µ is not assumed to be absolutely continuous with respect to µ , it follows that Eµ[f | F] Eµ [f | F] L1(µ) (13) is not entirely well-defined, in that its value depends on what version of Eµ [f | F] one chooses. For appropriate f, however, one can nevertheless bound Eq. (13) for any standard version of Eµ [f | F]. Lemma E.12. Let µ, µ be totally bounded measures on a measure space (V, M). Let f L (µ) L (µ ). Let F be a sub σ-algebra of M. Let C = max( f L (µ), f L (µ )). Then, if g is a standard version of Eµ [f | F], we have that V |Eµ[f | F] g| dµ 4C |µ µ |[V ]. (14) Proof. First, we note that both Eµ[f | F] and g are Fmeasurable. Therefore, the sets EUp = {v V : Eµ[f | F](v) > g(v)} ELo = {v V : Eµ[f | F](v) < g(v)} are in F. Now, note that V |Eµ[f | F] g| dµ = Z EUp Eµ[f | F] g dµ ELo g Eµ[f | F] dµ. 16Some authors define Lp(µ) spaces to consist of such equivalence classes, rather than the definition we use here. Causal Conceptions of Fairness and their Consequences First consider EUp. Then, we have that Z EUp Eµ[f | F] g dµ EUp Eµ[f | F] g dµ EUp Eµ[f | F] dµ Z Eup g d|µ µ | Eup C d|µ µ |, where in the final inequality, we have used the fact that, since g is a standard version of Eµ [f | F], g(v) Eµ [f | F] L (µ ) C for all v V , and the fact that, by the definition of conditional expectation, Z E Eν[h | F] dν = Z for any E F. Since f is everywhere bounded by C, applying Lemma E.5 yields that this final expression is less than or equal to 2C |µ µ |[V ]. An identical argument shows that Z ELo g Eµ[f | F] dµ 2C |µ µ |[V ], whence the result follows. Lemma E.13. Let E Q denote the set of joint densities on K such that there exists a policy satisfying the relevant fairness definition that is not strongly Pareto dominated. Then, E is closed, and therefore universally measurable. Proof. For notational simplicity, we consider the case of counterfactual equalized odds. The proofs in the other two cases are virtually identical. Suppose µi µ in K, where {µi}i N E. Then, by Lemma E.9, fµi,a fµ,a in L1(R). Moreover, by Lemma D.2, there exists a sequence of threshold policies {τi(x)}i N such that both Eµi[τ(X)] = min(b, Prµi(u(X) > 0)) Eµi[τi(X) | A, Y (1)] = Eµi[τi(X) | Y (1)]. Let {qa,i}a A be defined by qa,i = Eµi[τi(X) | A = a] if Prµi(A = a) > 0, and qa,i = 0 otherwise. Since [0, 1]A is compact, there exists a convergent subsequence {{qa,ni}a A}i N. Let it converge to the collection of quantiles {qa}a A defining, by Lemma D.3, a multiple threshold policy τ(x) over µ. Because µi µ and {qa,ni}a A {qa}a A, we have that Eµ[τa,ni(X) | A = a] Eµ[τ(X) | A = a] for all a A such that Prµ(A = a) > 0. Therefore, by Lemma E.9, τni(X) τ(X) in L1(µ). Choose ϵ > 0 arbitrarily. Then, choose N so large that for i greater than N, |µ µni|[K] < ϵ 10, τ(X) τni(X) L1(µ) ϵ 10. Then, observe that τ(x), τi(x) 1, and recall that Eµni[τni(X) | A, Y (1)] = Eµni[τni(X) | Y (1)]. (15) Therefore, let gi(x) be a standard version of Eµni[τni(X) | Y (1)] over µni. By Eq. (15), gi(x) is also a standard version of Eµni[τni(X) | A, Y (1)] over µni. Then, by Lemma E.12, we have that Eµ[τ(X) | A, Y (1)] Eµni[τni(X) | Y (1)] L1(µ) Eµ[τ(X) | A, Y (1)] Eµ[τni(X) | A, Y (1)] L1(µ) + Eµ[τni(X) | A, Y (1)] gi(X) L1(µ) + gi(X) Eµ[τni(X) | Y (1)] L1(µ) L1(µ) + Eµ[τni(X) | Y (1) Eµ[τ(X) | Y (1)] L1(µ) Since ϵ > 0 was arbitrary, it follows that, µ-a.e., Eµ[τ(X) | A, Y (1)] = Eµ[τ(X) | Y (1)]. Recall the standard fact that for independent random variables X and U, E[f(X, U) | X] = Z f(X, u) d FU(u), where FU is the distribution of U.17 Further recall that D = 1UD τ(X), where UD X, Y (1). It follows that Prµ(D = 1 | X, Y (1)) = Z 1 0 1ud<τ(X) dλ(ud) = τ(X). 17For a proof of this fact see, e.g., Brozius (2019). Causal Conceptions of Fairness and their Consequences Hence, by the law of iterated expectations, Prµ(D = 1 | A, Y (1)) = Eµ[Prµ(D = 1 | X, Y (1)) | A, Y (1)] = Eµ[τ(X) | A, Y (1)] = Eµ[τ(X) | Y (1)] = Eµ[Prµ(D = 1 | X, Y (1)) | Y (1)] = Prµ(D = 1 | Y (1)). Therefore D A | Y (1) over µ, i.e., counterfactual equalized odds holds for the decision policy τ(x) over the distribution µ. Consequently µ E, and so E is closed and therefore universally measurable. E.4. Shy Sets and Probes We require a number of additional technical lemmata for the proof of Theorem 1. The probe must be constructed carefully, so that, on the utility scale, an arbitrary element of Q is absolutely continuous with respect to a typical perturbation. In addition, it is useful to show that a number of properties are generic to simplify certain aspects of the proof of Theorem 1. For instance, Lemma E.16 is used in Theorem 1 to show that a certain conditional expectation is generically well-defined, avoiding the need to separately treat certain corner cases. Cor. E.3 concerns the construction of the probe used in the proof of Theorem 1. Lemmata E.17 to E.20 use Cor. E.3 to provide additional simplifications to the proof of Theorem 1. E.4.1. MAXIMAL SUPPORT First, to construct the probe used in the proof of Theorem 1, we require elements µ Q such that the densities fµ have maximal support. To produce such distributions, we use the following measure-theoretic construction. Definition E.11. Let {Eα}γ Γ be an arbitrary collection of µ-measurable sets for some positive measure µ on a measure space (M, M). We say that E is the measure-theoretic union of {Eγ}γ Γ if µ[Eγ \ E] = 0 for all γ Γ and E = S i=1 Eγi for some countable subcollection {γi} N. While measure-theoretic unions themselves are known (cf. Silva (2008), Rudin (1991)), for completeness, we include a proof of their existence, which, to the best of our knowledge, is not found in the literature. Lemma E.14. Let µ be a finite positive measure on a measure space (V, M). Then an arbitrary collection {Eγ}γ Γ of µ-measurable sets has a measure-theoretic union. Proof. For each countable subcollection Γ Γ, consider the error term r(Γ ) = sup γ Γ µ We claim that the infimum of r(Γ ) over all countable subcollections Γ Γ must be zero. For, toward a contradiction, suppose it were greater than or equal to ϵ > 0. Choose any set Eγ1 such that µ[Eγ1] ϵ. Such a set must exist, since otherwise r( ) < ϵ. Choose Eγ2 such that µ[Eγ2 \ Eγ1] > ϵ. Again, some such set must exist, since otherwise r({γ1}) < ϵ. Continuing in this way, we construct a countable collection {Eγi}i N. Therefore, we see that By construction, every term in the final sum is greater than or equal to ϵ, contradicting the fact that µ[V ] < . Therefore, there exist countable collections {Γn}n N such that r(Γn) < 1 n. It follows immediately that for all n for any fixed k N. Consequently, and S n N Γn is countable. The construction of the maximal elements used to construct the probe in the proof of Theorem 1 follows as a corollary of Lemma E.14 Corollary E.3. There are measures µmax,a Q such that for every a A and any µ K, λ[SUPP(fµ,a) \ SUPP(fµmax,a)] = 0. Proof. Consider the collection of measurable sets {SUPP(fµ,a)}µ K. By Lemma E.14, there exists a countable collection of measures {µi}i N such that for any µ K, SUPP(fµ,a) \ i=1 SUPP(fµi,a) where without loss of generality, we assume that λ[SUPP(fµi,a)] > 0 for all i N. Such a sequence must exist, since, by the first hypothesis of Theorem 1, for every Causal Conceptions of Fairness and their Consequences a A, there exists µ Q such that Prµ(A = a) > 0. Therefore, we can define the probability measure µmax,a, where i=1 2 i |µi A=a| |µi A=a| [K]. It follows immediately by construction that SUPP(fµmax,a) = i=1 SUPP(fµi,a), and that µmax,a Q. For notational simplicity, we refer to SUPP(fµmax,a) as Sa throughout. In the case of conditional principal fairness and path-specific fairness, we need a mild refinement of the previous result that accounts for ω. Corollary E.4. There are measures µmax,a,w Q defined for every w W = IMG(ω) and any a A such that for some ν K, Prν(W = w, A = a) > 0. These measures have the property that for any µ K, λ[SUPP(fµ ,a,w) \ SUPP(fµmax,a,w)] = 0, where fµ ,a,w is the density of the pushforward measure (µ W =w,A=a) u 1. Recalling that | IMG(ω)| < , the proof is the same as Cor. E.3, and we analogously refer to SUPP(fµmax,a,w) as Sa,w. Here, we have assumed without loss of generality as we continue to assume in the sequel that for all w W, there is some µ K such that Prµ(W = w) > 0. Remark 6. Because their support is maximal, the hypotheses of Theorem 1, in addition to implying that µmax,a is welldefined for all a A, also imply that Prµmax,a(u(X) > 0) > 0. In the case of conditional principal fairness, they further imply that Prµmax,a(W = w) > 0 for all w W and a A. Likewise, in the case of path-specific fairness, they further imply that Prµmax,a(W = wi) > 0 for i = 0, 1 and some a A. E.4.2. SHY SETS AND PROBES In the following lemmata, we demonstrate that a number of useful properties are generic in Q. We also demonstrate a short technical lemma, Lemma E.20, which allows us to use these generic properties to simplify the proof of Theorem 1. We begin with the following lemma, which is useful in verifying that certain subspaces of K form probes. Lemma E.15. Let W be a non-trivial finite dimensional subspace of K such that ν[K] = 0 for all ν W. Then, there exists µ K such that λW[Q µ] > 0. |νi| |νi|[K], where ν1, . . . , νn form a basis of W. Then, if |βi| 1 |νi|[K], it follows that i=1 βi νi Q. 1 |νi|[K], 1 |νi|[K] it follows that λW[Q µ] > 0. Next we show that, given a ν Q, a generic element of Q sees events to which ν assigns non-zero probability. While Lemma E.18 alone in principle suffices for the proof of Theorem 1, we include Lemma E.16 both for conceptual clarity and to introduce at a high level the style of argument used in the subsequent lemmata and in the proof of Theorem 1 to show that a set is shy relative to Q. Lemma E.16. For a Borel set E K, suppose there exists ν Q such that ν[E] > 0. Then the set of µ Q such that µ[E] > 0 is prevalent. Proof. First, we note that the set of µ Q such that µ[E] = 0 is closed and therefore universally measurable. For, if {µi}i N Q is a convergent sequence with limit µ, then µ[E] = lim n µi[E] Now, if µ[E] > 0 for all µ Q, there is nothing to prove. Therefore, suppose that there exists ν Q such that ν [E] = 0. Next, consider the measure ν = ν ν. Then, let W = SPAN( ν). Since ν = 0 and ν[K] = ν [K] ν[K] = 0, it follows by Lemma E.15 that λW[Q µ] > 0 for some µ. Now, for arbitrary µ Q, note that if (µ + β ν)[E] = 0, then µ[E] β ν[E] = 0 A singleton has null Lebesgue measure, and so the set of ν W such that (µ + ν)[E] = 0 is λW-null. Therefore, by Prop. E.6, the set of µ Q such that µ[E] = 0 is shy relative to Q, as desired. Causal Conceptions of Fairness and their Consequences While Lemma E.16 shows that a typical element of Q sees individual events, in the proof of Theorem 1, we require a stronger condition, namely, that a typical element of Q sees certain uncountable collections of events. To demonstrate this more complex property, we require the following technical result, which is closely related to the real analysis folk theorem that any convergent uncountable sum can contain only countably many non-zero terms. (See, e.g., Benji (2020).) Lemma E.17. Suppose µ is a totally bounded measure on (V, M), f and g are µ-measurable real-valued functions, and g = 0 µ-a.e. Then the set Zβ = {v V : f(v) + β g(v) = 0} has non-zero µ measure for at most countably many β R. Proof. First, we show that for any countable collection {βi}i N R, the sum P i=1 µ[Zβi] converges. Then, we show how this implies that µ[Zβ] = 0 for all but countably many β R. First, we note that for distinct β, β R, Zβ Zβ {v V : (β β ) g(v) = 0}. Now, by hypothesis, µ[{v V : g(v) = 0}] = 0, and since β β = 0, it follows that µ[{v V : (β β ) g(v) = 0}] = 0 as well. Consequently, it follows that if {Zβi}i N is a countable collection of distinct elements of R, then i=1 µ[Zβi] = µ To see that this implies that µ[Zβ] > 0 for only countably many β R, let Gϵ R consist of those β such that µ[Zβ] ϵ. Then Gϵ must be finite for all ϵ > 0, since otherwise we could form a collection {βi}i N Gϵ, in which case X contrary to what was just shown. Therefore, {β R : µ[Zβ] > 0} = is countable. We now apply Lemma E.17 to the proof of the following lemma, which states, informally, that, under a generic element of Q, u(X) is supported everywhere it is supported under some particular fixed element of Q. For instance, Lemma E.17 can be used to show that for a generic element of Q, the density of u(X) | A = a is positive λ Sa-a.e. Lemma E.18. Let ν Q and suppose ν is supported on E, i.e., ν[K \ E] = 0. Then the set of µ Q such that ν u 1 Î (µ E) u 1 is prevalent relative to Q. Lemma E.18 states, informally, that for generic µ Q, fµ E is supported everywhere fν is supported. Proof. We begin by showing that the set of µ Q such that ν u 1 Î (µ E) u 1 is Borel, and therefore universally measurable. Then, we construct a probe W and use it to show that this collection is finitely shy. To begin, let Uq denote the set of µ Q such that ν u 1[{|fµ E| = 0}] < q. We note that Uq is open. For, if µ Uq, then there exists some r > 0 such that ν u 1[{|fµ E| < r}] < q. Let ϵ = q ν u 1[{|fµ E| < r}]. Now, since ν u 1 Î λ, there exists a δ such that if λ[E ] < δ, then ν u 1[E ] < ϵ. Choose µ arbitrarily so that |µ µ |[K] < δ r. Then, by Markov s inequality, we have that λ[{|fµ E fµ E| > r}] < δ, i.e., ν u 1[{fµ E fµ E| > r}] < ϵ. Now, note that by the triangle inequality, wherever |fµ E| = 0, either |fµ E| < r or |fµ E fµ E| > r. Therefore λ[{|fµ E| = 0}] ν u 1[{|fµ E fµ E| > r}] + µ u 1[{|fµ E| < r] < ϵ + µ u 1[{|fµ E| < r] We conclude that µ Uq, and so Uq is open. Note that ν u 1 Î (µ E) u 1 if and only if λ[SUPP(fν) \ SUPP(fµ E)] = 0. Causal Conceptions of Fairness and their Consequences By the definition of the support of a function, λ SUPP(fµ) Î µ u 1. Therefore, it follows that λ[SUPP(fµ) \ SUPP(fν E)] = 0 if and only if µ u 1[SUPP(fµ) \ SUPP(fν E)] = 0. Then, it follows immediately that the set of ν Q such that µ u 1 Î (ν E) u 1 is Tn i=1 U1/i, which is, by construction, Borel, and therefore universally measurable. Prν(u(X) < t) = Z t is a continuous function of t, by the intermediate value theorem, there exists t such that Prν(u(X) S0) = Prν(u(X) S1), where S0 = SUPP(fν) ( , t) and S1 = SUPP(fν) [t, ). Then, we let E 1u 1(S0) 1u 1(S1) dν. Take W = SPAN( ν). Since ν = 0 and ν[K] = 0, we have by Lemma E.15 that λW[Q µ] > 0 for some µ. By the definition of a density, f ν is positive ( ν u 1)- a.e. Consequently, by the definition of ν, f ν is non-zero (µ u 1)-a.e. Therefore, by Lemma E.17, there exist only countably many β R such that the density of (µ + β ν) u 1 equals zero on a set of positive µ u 1-measure. Since countable sets have λ-measure zero and ν is arbitrary, the set of µ Q such that ν u 1 Î (µ E) u 1 is prevalent relative to Q by Prop. E.6. The following definition and technical lemma are needed to extend Theorem 1 to the cases of conditional principal fairness and path-specific fairness, which involve additional conditioning on W = ω(X). In particular, one corner case we wish to avoid in the proof of Theorem 1 is when the decision policy is non-trivial (i.e., some individuals receive a positive decision and others do not) but from the perspective of each ω-stratum, the policy is trivial (i.e., everyone in the stratum receives a positive or negative decision). Definition E.12 formalizes this pathology, and Lemma E.19 shows that this issue under a mild hypothesis does not arise for a generic element of Q. Definition E.12. We say that µ Q overlaps utilities when, for any budget-exhausting multiple threshold policy τ(x), if 0 < Eµ[τ(X)] < 1, then there exists w W such that 0 < Eµ[τ(X) | W = w] < 1. If there exists a budget-exhausting multiple threshold policy τ(x) such that 0 < Eµ[τ(X)] < 1, but for all w W, Eµ[τ(X) | W = w] {0, 1}, then we say that τ(x) splits utilities over µ. Informally, having overlapped utilities prevents a budgetexhausting threshold policy from having thresholds that fall on the utility scale exactly between the strata induced by ω i.e., a threshold policy that splits utilities. This is almost a generic condition in Q, as we shown in Lemma E.19. Lemma E.19. Let 0 < b < 1. Suppose that for all w W there exists µ Q such that Prµ(u(X) > 0, W = w) > 0. Then almost every µ Q overlaps utilities. Proof. Our goal is to show that the set E of measures µ Q such that there exists a splitting policy τ(x) is shy. To simplify the proof, we divide an conquer, showing that the set EΓ of measures µ Q such that there exists a splitting policy where the thresholds fall below w Γ W and above w / Γ is Borel, before constructing a probe that shows that it is shy. Then, we argue that E = S Γ W EΓ, which shows that E is shy. We begin by considering the linear map Φ : K R RW Φ(µ) = Prµ(u(X) = 0), (Prµ(W = w))w W . For any Γ W, the sets F Up Γ = {x R RW : x0 b, b = X F Lo Γ = {x R RW : x0 b, x0 = X are closed by construction. Therefore, since Φ is continuous, Γ W F Up Γ F Lo Γ is closed, and therefore universally measurable. Note that by our hypothesis and Cor. E.4, for all w W there exists some aw A such that Prµmax,aw,w(u(X) > 0). We use this to show that EΓ is shy. Pick w W arbitrarily, and consider the measures νw for w = w defined by νw = µmax,aw,w u(X)>0 Prµmax,aw,w(u(X) > 0) µmax,aw ,w u(X)>0 Prµmax,aw ,w (u(X) > 0). Causal Conceptions of Fairness and their Consequences We note that νw[K] = 0 by construction. Therefore, if Ww = SPAN(νw), then λWw[Q µw] > 0 for some µw by Lemma E.15. Moreover, we have that Prν(u(X) > 0) = 0 for all ν Ww, i.e., Prµ(u(X) > 0) = Prµ+ν(u(X) > 0). Now, since 0 < b < 1 and ω partitions X, it follows that Since λW[ ] = 0 for any subspace W, we can assume without loss of generality that Γ = W, . In that case, there exists wΓ W such that if w Γ, then wΓ / Γ, and vice versa. Without loss of generality, assume wΓ Γ and w / Γ. It then follows that for arbitrary µ Q, Φ(µ + β νwΓ) = Φ(µ) + β ewΓ β ew , where ew is the basis vector corresponding to w W. From this, it follows immediately by Eq. (E.4.2) that µ + β νwΓ EΓ β = min(b, Prµ(u(X) > 0)) X w Γ Prµ(W = w). This is a measure zero subset of R, and so it follows that λWwΓ[EΓ µ] = 0 for all µ K. Therefore, by Prop. E.6, EΓ is shy in Q. Taking the union over Γ W, it follows by Prop. E.5 that S Γ W EΓ is shy. Now, we must show that E = S Γ W EΓ. By construction, EΓ E , since the policy τ(x) = 1ω(x) Γ is budgetexhausting and separates utilities. To see the reverse inclusion, suppose µ E , i.e., that there exists a budgetexhausting multiple threshold policy τ(x) that splits utilities over µ. Then, let Γµ = {w W : Eµ[τ(X) | W = w] = 1}. Since τ(x) is budget-exhausting, it follows immediately that µ EΓµ. Therefore, E = S Γ W EΓ, and so E is shy, as desired. We conclude our discussion of shyness and shy sets with the following general lemma, which simplifies relative prevalence proofs by showing that one can, without loss of generality, restrict one s attention to the elements of the shy set itself in applying Prop. E.6. Lemma E.20. Suppose E is a universally measurable subset of a convex, completely metrizable set C in a topological vector space V . Suppose that for some finite-dimensional subpace V , λV [C + v0] > 0 for some v0 V . If, in addition, for all v E, λV [{v V : v + v E}] = 0, (17) then it follows that E is shy relative to C. Proof. Let v be arbitrary. Then, either (v + V ) E is empty or not. First, suppose it is empty. Since λV [ ] = 0 by definition, it follows immediately that in this case λV [E v] = 0. Next, suppose the intersection is not empty, and let v +v E for some fixed v V . It follows that λV [E v] = λV [{v V : v + v E}] = λV [{v V : (v + v ) + v E}] where the first equality follows by definition; the second equality follows by the translation invariance of λV , and the fact that v + V = V ; and the final inequality follows from Eq. (17). Therefore λV [E v] = 0 for arbitrary v, and so E is shy. E.5. Proof of Theorem 1 Using the lemmata above, we can prove Theorem 1. We briefly summarize what has been established so far: Lemma E.7: The set K of U-fine distributions on K is a Banach space; Lemma E.11: The subset Q of U-fine probability measures on K is a convex, completely metrizable subset of K; Lemma E.13: The subset E of Q is a universally measurable subset of K, where E is the set consisting of U-fine probability measures over which there exists a policy satisfying counterfactual equalized odds (resp., conditional principal fairness, or path-specific fairness) that is not strongly Pareto dominated. Therefore, to apply Prop. E.6, it follows that what remains is to construct a probe W and show that λW[Q + µ0] > 0 for some µ0 K but λW[E + µ] = 0 for all µ K. Proof. We divide the proof into three pieces. First, we illustrate how to construct the probe W from a particular collection of distributions {νUp a , νLo a }a A. Second, we Causal Conceptions of Fairness and their Consequences show that λW[E+µ] = 0 for all µ K. For notational and expository simplicity, we focus in these first two sections on the case of counterfactual equalized odds. Therefore, in the third section, we show how to generalize the argument to conditional principal fairness and path-specific fairness. Construction of the probe We will construct our probe to address two different cases. We recall that, by Cor. D.1, any policy that is not strongly Pareto dominated must be a budget-exhausting multiple threshold policy with nonnegative thresholds. In the first case, we consider when the candidate budget-exhausting multiple threshold policy is 1u(x)>0. By perturbing the underlying distribution by ν WLo, we will be able to break the balance requirements implied by Eq. (2). In the second case, we treat the possibility that the candidate budget-exhausting multiple threshold policy has a non-trivial positive threshold for at least one group. By perturbing the underlying distribution by ν WUp for an alternative set of perturbations WUp, we will again be able to break the balance requirements. More specifically, to construct our probe W = WUp + WLo, we want WUp and WLo to have a number of properties. In particular, for all ν W, perturbation by ν should not affect whether the underlying distribution is a probability distribution, and should not affect how much of the budget is available to budget-exhausting policies. Specifically, for all ν W, Z K 1 dν = 0, (18) K 1u(X)>0 dν = 0. (19) In fact, the amount of budget available to budget-exhausting policies will not change within group, i.e., for all a A and ν W, Z K 1u(X)>0,A=a dν = 0. (20) Additionally, for some distinguished y0, y1 Y, non-zero perturbations in νLo WLo should move mass between y0 and y1. That is, they should have the property that if Pr|νLo|(A = a) > 0, then K 1u(X)<0,Y =yi,A=a dνLo = 0. (21) Finally, perturbations in WUp should have the property that for any non-trivial t > 0, some mass is moved either above or below t > 0. More precisely, for any µ Q and any t such that 0 < Prµ(u(X) > t | A = a) < 1, if νUp WUp is such that Pr|νUp|(A = a) > 0, then K 1u(X)>t,A=a dνUp = 0. (22) To carry out the construction, choose distinct y0, y1 Y. Then, since µmax,a u 1[Sa [0, ra)] µmax,a u 1[Sa [ra, )] is a continuous function of ra, it follows by the intermediate value theorem that we can partition Sa into three pieces, SLo a = Sa ( , 0), SUp a,0 = Sa [0, ra), SUp a,1 = Sa [ra, ), Prµmax,a u(X) SUp a,0 = Prµmax,a u(X) SUp a,1 . Recall that K = X Y. Let πX : K X denote projection onto X, and γy : X K be the injection x 7 (x, y). We define νUp a [E] = µmax,a (γy1 πX ) 1 h E u 1 SUp a,1 i , µmax,a (γy1 πX ) 1 h E u 1 SUp a,0 i , νLo a [E] = µmax,a (γy1 πX ) 1 E u 1 SLo a µmax,a (γy0 πX ) 1 E u 1 SLo a . By construction, νUp a concentrates on {y1} u 1(Sa [0, )), while νLo a concentrates on {y0, y1} u 1(Sa ( , 0)). Moreover, if we set WUp = SPAN(νUp a )a A, WLo = SPAN(νLo a )a A, then it is easy to see that Eqs. (18) to (21) will hold. The only non-trivial case is Eq. (22). However, by Cor. E.3, the support of fµmax,a is maximal. That is, for µ Q, if 0 < Prµ(u(X) > t | A = a, u(X) > 0) < 1, then it follows that 0 < t < sup Sa. Either t ra or t > ra. First, assume t ra; then, it follows by the construction of Causal Conceptions of Fairness and their Consequences νUp a u 1[(t, )] = Z ra fmax,a dλ t fmax,a dλ ra fmax,a dλ 0 fmax,a dλ Similarly, if t > ra, νUp a u 1[(t, )] = Z t fmax,a dλ sup Sa fmax,a dλ Therefore Eq. (22) holds. Since W is non-trivial18 and ν[K] = 0 for all ν W, it follows by Lemma E.15 that λW[Q µ] > 0 for some µ K. Shyness Recall that, by Prop. E.5, a set E is shy if and only if, for an arbitrary shy set E , E \ E is shy. By Lemma E.16, a generic element of µ Q satisfies Prµ(u(X) > 0, Y (1) = yi, A = a) > 0 for i = 0, 1, and a A. Likewise, by Lemma E.18, a generic µ Q satisfies νUp a u 1 Î (µ X {y1}) u 1. Therefore, to simplify our task and recalling Remark 6, we may instead demonstrate the shyness of the set of µ Q such that: There exists a budget-exhausting multiple threshold policy τ(x) with non-negative thresholds satisfying counterfactual equalized odds over µ; For i = 0, 1, Prµ(u(X) > 0, A = a, Y (1) = yi) > 0; (23) For all a A, νUp a u 1 Î (µ α 1(a) {y1}) u 1. (24) By a slight abuse of notation, we continue to refer to this set as E. We note that, by the construction of W, Eq. (23) is 18In general, some or all of the νLo may be zero depending on the λ-measure of SLo a . However, as noted in Remark 6, the νUp a,i cannot be zero, since Prµmax,a(u(X) > 0) > 0 for all a A. Therefore W = {0}. not affected by perturbation by ν W, and Eq. (24) is not affected by perturbation by νLo W. In particular, by Lemma E.20, it suffices to show that λW[E µ] = 0 for µ E. Therefore, let µ E be arbitrary. Let the budget-exhausting multiple threshold policy satisfying counterfactual equalized odds over it be τ(x), so that Eµ[τ(X)] = min(b, Prµ(u(X) > 0)), with thresholds {ta}a A. We split into two cases based on whether τ(X) = 1u(X)>0 µ-a.s. or not. In both cases, we make use of the following two useful observations. First, note that as E Q, if µ + ν is not a probability measure, then µ + ν / E. Therefore, without loss of generality, we assume throughout that µ + ν is a probability measure. Second, suppose τ (x) is a policy satisfying counterfactual equalized odds over some ν Q. Then, if 0 < Eµ[τ (X)] < 1, it follows that for all a A, 0 < Eµ[τ (X) | A = a] < 1. (25) For, suppose not. Then, without loss of generality, there must be a0, a1 A such that Eµ[τ (X) | A = a0] = 0 and Eµ[τ (X) | A = a1] > 0. But then, by the law of iterated expectation, there must be some Y Y such that µ[X Y ] > 0 and so, 1X Y Eµ[τ (X) | A = a1, Y (1)] = 1X Y Eµ[τ (X) | A = a0, Y (1)], contradicting the fact that τ (x) satisfies counterfactual equalized odds over µ. Therefore, in what follows, we can assume that Eq. (25) holds. Our goal is to show that λW[E µ] = 0. Case 1 (τ(X) = 1u(X)>0). We argue as follows. First, we show that 1u(X)>0 is the unique budget-exhausting multiple threshold policy with non-negative thresholds over µ+ν for all ν W. Then, we show that the set of ν W such that 1u(x)>0 satisfies counterfactual equalized odds over µ + ν is a λW-null set. We begin by observing that WLo = {0}. For, if that were the case, then Eq. (25) would not hold for τ(x). Next, we note that by Eq. (19), for any ν W, Prµ+ν(u(X) > 0) = Prµ(u(X) > 0) Causal Conceptions of Fairness and their Consequences Eµ+ν[1u(X)>0] = min(b, Prµ+ν(u(X) > 0)). If τ (x) is a feasible multiple threshold policy with nonnegative thresholds and τ (X) = 1u(X)>0 (µ + ν)-a.s., then, as a consequence, Eµ+ν[τ (X)] < Prµ+ν(u(X) > 0) b. Therefore, it follows that 1u(X)>0 is the unique budgetexhausting multiple threshold policy over µ + ν with nonnegative thresholds. Now, note that if counterfactual equalized odds holds with decision policy τ(x) = 1u(x)>0, then, by Eq. (4) and Lemma E.6, we must have that Prµ+ν(u(X) > 0 | A = a, Y (1) = y1) = Prµ+ν(u(X) > 0 | A = a , Y (1) = y1) for a, a A.19 Now, we will show that a typical element of W breaks this balance requirement. Choose a such that νLo a = 0. Recall that ν is fixed, and let ν = ν βLo a νLo a . Let pa = Prµ+ν (u(X) > 0 | A = a , Y (1) = y1). Note that it cannot be the case that pa = 0 for all a A, since, by Eq. (23), Prµ+ν (u(X) > 0 | Y (1) = y1) > 0. Therefore, by the foregoing discussion, either pa > 0 or pa = 0 and we can choose a A such that pa > 0. Since the νLo a , νUp a,i are all mutually singular, it follows that counterfactual equalized odds can only hold over µ + ν if pa = Prµ+ν(u(X) > 0 | A = a , Y (1) = y1). Now, we observe that by Lemma E.6, that Prµ+ν(u(X) > 0 | A = a , Y (1) = y1) = η π + βLo a ρ η = Prµ(u(X) > 0, A = a , Y (1) = y1) π = Prµ(A = a , Y (1) = y1), K 1A=a ,Y (1)=y1 dνLo a . 19To ensure that both quantities are well-defined, here and throughout the remainder of the proof we use the fact that by Eqs. (20) and (23), Prµ+ν(u(X) > 0, A = a, Y (1) = y1) > 0. K 1u(X)>0,A=a ,Y (1)=y1 dνLo a , K 1A=a ,Y (1)=y1 dνLo a . Here, the equality follows by the fact that νLo is supported on SLo a {y0, y1} and the inequality from Eq. (21). Therefore, if, in the first case, pa > 0, then counterfactual equalized odds only holds if βLo a = e pa π since, as noted above, ρ = 0 by Eq. (21). In the second case, if pa = 0, then counterfactual equalized odds can only hold if e = pa π = 0. Since we chose a so that pa > 0 if pa = 0 and π > 0 by Eq. (23), this is impossible. In either case, we see that the set of βLo a R such that there a budget-exhausting threshold policy with positive thresholds satisfying counterfactual equalized odds over µ + ν + βLo a νLo a has λ-measure zero. That is λSPAN(νLo a )[E µ ν ] = 0. Since ν was arbitrary, it follows by Fubini s theorem that λW[E µ] = 0. Case 2 (τ(X) = 1u(X)>0). Our proof strategy is similar to the previous case. First, we show that, for a given fixed νLo WLo, there is a unique candidate policy τ(x) for being a budget-exhausting multiple threshold policy with non-negative thresholds and satisfying counterfactual equalized odds over µ + νLo + νUp for any νUp WUp. Then, we show that the set of νUp such that τ(X) satisfies counterfactual equalized odds has λWUp measure zero. Finally, we argue that this in turn implies that the set of ν W such that there exists a Pareto efficient policy satisfying counterfactual equalized odds over µ + ν has λW-measure zero. We seek to show that λWUp[E (µ + νLo)] = 0. To begin, we note that since νUp a,i concentrates on {y1} X for all a A, it follows that Eµ+νLo[d(X) | A = a, Y (1) = y0] = Eµ+νLo+νUp[d(X) | A = a, Y (1) = y0] for any νUp WUp. Now, suppose there exists some νUp WUp such that there exists a budget-exhausting multiple threshold policy τ(x) with non-negative thresholds such that counterfactual Causal Conceptions of Fairness and their Consequences equalized odds is satisfied over µ + νLo + νUp. (If not, then we are done and λWUp[E (µ + νLo)] = 0, as the measure of the empty set is zero.) Let p = Eµ+νLo[ τ(X) | A = a, Y (1) = y0]. Suppose that τ (x) is an alternative budget-exhausting multiple threshold policy with non-negative thresholds such that counterfactual equalized odds is satisfied. We seek to show that τ (X) = τ(X) (µ + νLo + νUp)-a.e. for any νUp WUp. Toward a contradiction, suppose that for some a0 A, Eµ+νLo[ τ (X) | A = a0, Y (1) = y0] < p. Since, by Eq. (23), Prµ+νLo(A = a0, Y (1) = y0) > 0, it follows that Eµ+νLo[ τ (X) | A = a0] < Eµ+νLo[ τ(X) | A = a0]. Therefore, since τ(x) is budget exhausting, there must be some a1 such that Eµ+νLo[ τ (X) | A = a1] > Eµ+νLo[ τ(X) | A = a1]. From this, it follows τ (x) can be represented by a threshold greater than or equal to that of τ(x) on α 1(a1), and hence Eµ+νLo[ τ (X) | A = a1, Y (1) = y0] Eµ+νLo[ τ(X) | A = a0, Y (1) = y0] > Eµ+νLo[ τ (X) | A = a0, Y (1) = y0], contradicting the fact that τ (x) satisfies counterfactual equalized odds. By the preceding discussion, Lemma E.10, and the fact that νLo is supported on u 1(( , 0]), τ(X) = τ (X) (µ X {y0})-a.e. By Eq. (24), it follows that τ(X) = τ (X) νUp a,i -a.e. for i = 0, 1. As a consequence, τ(X) = τ (X) (µ + νLo + νup)-a.e. for all νUp WUp. Therefore τ(X) is, indeed, unique, as desired. Now, we note that since τ(X) = 1u(X)>0, it follows that E[τ(X)] < Prµ(u(X) > 0). It follows that Eµ[τ(X)] = b, since τ(x) is budget exhausting. Therefore, by Eq. (19), it follows that for any budget-exhausting policy τ(X), E[ τ(X)] = b, and so τ(X) = 1u(X)>0 over µ + ν. Therefore, fix νLo and τ(X). By Eq. (25), there is some a 0 < Prµ+νLo(u(X) > ta | A = a ) < 1. Then, it follows by Eq. (22) that Z K 1u(X)> ta dνUp a = 0. Fix ν = ν βUp a νUp a . Then, for some a = a , set p = Eµ+ν [ τ(X) | A = a, Y (1) = y1]. Since the νLo a , νUp a are all mutually singular, it follows that counterfactual equalized odds can only hold over µ + ν if p = Prµ+ν(u(X) > ta | A = a , Y (1) = y1). Now, we observe that by Lemma E.6, that Prµ+ν(u(X) > ta | A = a , Y (1) = y1) = η + βUp a γ π η = Prµ+νLo(u(X) > ta | A = a , Y (1) = y1), π = Prµ+νLo(A = a , Y (1) = y1), K 1u(X)> ta ,A=a ,Y (1)=y1 dνUp a , and we note that K 1A=a ,Y (1)=y1 dνLo a . Eq. (26) can be rearranged to (p π η) β γ = 0. This can only hold if since by Eq. (22), γ = 0. Since any countable subset of R is a λ-null set, λSPAN(νUp a )[E µ ν ] = 0. Since ν was arbitrary, it follows by Fubini s theorem that λWUp[E µ νLo] = 0 in this case as well. Lastly, since νLo was also arbitrary, applying Fubini s theorem a final time gives that λW[E µ] = 0. Conditional principal fairness and path-specific fairness The extension of these results to conditional principal fairness and path-specific fairness is straightforward. All that is required is a minor modification of the probe. Causal Conceptions of Fairness and their Consequences In the case of conditional principal fairness, we set νUp a,w[E] = µmax,a,w (γ(y1,y1) πX ) 1[E u 1(SUp a,1)], µmax,a (γ(y1,y1) πX ) 1[E u 1(SUp a,w)], νLo a,w[E] = µmax,a,w (γ(y1,y1) πX ) 1[E u 1(SLo a )] µmax,a (γ(y0,y0) πX ) 1[E u 1(SLo a,w)], where γ(y,y ) : X K is the injection x 7 (x, y, y ). Our probe is then given by WUp = SPAN(νUp a,w), WLo = SPAN(νLo a,w), almost as before. The proof otherwise proceeds virtually identically, except for two points. First, recalling Remark 6, we use the fact that a generic element of Q satisfies Prµ(A = a, W = w) > 0 in place of Prµ(A = a) > 0 throughout. Second, we use the fact that ω overlaps utility in place of Eq. (25). In particular, If ω does not overlap utilities for a generic µ Q, then, by Lemma E.19, there exists w W such that Prµ(u(X) > 0, W = w) = 0 for all µ Q. If this occurs, we can show that no budget-exhausting multiple threshold policy with positive thresholds satisfies conditional principal fairness, exactly as we did to show Eq. (25). In the case of path-specific fairness, we instead define SLo a,w = Sa,w ( , ra,w), SUp a,w = Sa,w [ra,w, ), where ra,w is chosen so that Prµmax,a,w(u(X) SLo a,w) = Prµmax,a,w(u(X) SUp a,w). Let πX denote the projection from K = A X A given by (a, (xa )a A) 7 xa. Let πa denote the projection from the a -th component. (That is, given µ K, the distribution of XΠ,A,a over µ is given by µ π 1 a and the distribution of X is given by µ π 1 X .) Then, we let µmax,a,w be the measure on X given by µmax,a,w[E] = µmax,a,w[E (u πa) 1(s Up a,w)] µmax,a,w[E (u πa) 1(SLo a,w)]. Finally, let ϕ : A A be a permutation of the groups with no fixed points, i.e., so that a = ϕ(a ) for all a A. Then, we define νa = δa µmax,ϕ(a ),w1 Y a =ϕ(a ) µmax,a,w1 π 1 a , where δa is the measure on A given by δa[{a }] = 1a=a . Then, simply let W = SPAN(ν a)a A. Since µmax,a,w[X] = 0 for all a A, it follows that νa,w π 1 X = 0, i.e., Prµ(X E) = Prµ+ν(X E) for any ν W and µ Q. Therefore Eqs. (18) and (19) hold. Moreover, the νa satisfy the following strengthening of Eq. (22). Perturbations in W have the property that for any non-trivial t not necessarily positive some of the mass of u(XΠ,A,a) is moved either above or below t. More precisely, for any µ Q and any t such that 0 < Prµ(u(X) > t | A = a) < 1, if ν W is such that Pr|ν|(A = ϕ 1(a)) > 0, then Z K 1u(XΠ,A,a)>t dνa = 0. (27) This stronger property means that we need not separately treat the case where τ(X) = 1u(X)>0 µ-a.e. Other than this difference the proof proceeds in the same way, except for two points. First, we again make use of the fact that ω can be assumed to overlap utilities in place of Eq. (25), as in the case of conditional principal fairness. Second, w0 and w1 take the place of y0 and y1. In particular, to establish the uniqueness of τ(x) given µ and νLo in the second case, instead of conditioning on y0, we instead condition on w0, where, following the discussion in Remark 6 and Lemma E.16, this conditioning is well-defined for a generic element of Q. We have focused on causal definitions of fairness, but the thrust of our analysis applies to non-causal conceptions of fairness as well. Below we show that policies constrained to satisfy (non-counterfactual) equalized odds (Hardt et al., 2016) are generically strongly Pareto dominated, a result that follows immediately from our proof above. Definition E.13. Equalized odds holds for a decision policy d(x) when d(X) A | Y. (28) We note that Y in Eq. (28) does not depend on our choice of d(x), but rather represents the realized value of Y , e.g., under some status quo decision making rule. Corollary E.5. Suppose U is a set of utilities consistent modulo α. Further suppose that for all a A there exist a U-fine distribution of X and a utility u U such that Causal Conceptions of Fairness and their Consequences Pr(u(X) > 0, A = a) > 0, where A = α(X). Then, for almost every U-fine distribution of X and Y on X Y, any decision policy d(x) satisfying equalized odds is strongly Pareto dominated. Proof. Consider the following maps. Distributions of X and Y (1), i.e., probability measures on X Y, can be embedded in the space of joint distributions on X, Y (0), and Y (1) via pushing forward by the map ι, where ι : (x, y) 7 (x, y, y). Likewise, given a fixed decision policy D = d(X), joint distributions of X, Y (0), and Y (1) can be projected onto the space of joint distributions of X and Y by pushing forward by the map πd : (x, y0, y1) 7 (x, yd(x)). Lastly, we see that the composition of ι and πd regardless of our choice of d(x) is the identity, as shown in the diagram below. We note also that counterfactual equalized odds holds for µ exactly when equalized odds holds for µ (πd ι) 1. The result follows immediately from this and Theorem 1. E.6. General Measures on K Theorem 1 is restricted to U-fine and UA-fine distributions on the state space. The reason for this restriction is that when the distribution of X induces atoms on the utility scale, threshold policies will, in general, possess additional or even infinite degrees of freedom when the threshold falls exactly on an atom. In particular circumstances, these degrees of freedom can be used to ensure causal fairness notions, such as counterfactual equalized odds, hold in a locally robust way. In particular, the generalization of Theorem 1 beyond U-fine measures to all totally bounded measures on the state space is false, as illustrated by the following proposition. Proposition E.7. Consider the set E K of distributions not necessarily U-fine on K = X Y over which there exists a Pareto efficient policy satisfying counterfactual equalized odds. There exist b, X, Y, and U such that E is not relatively shy. Proof. We adopt the notational conventions of Section E.3. We note that by Prop. E.5, a set can only be shy if it has empty interior. Therefore, we will construct an example in which an open ball of distributions on K in the total variation norm all allow for a Pareto efficient policy satisfying counterfactual equalized odds, i.e., are contained in E . 4, Y = {0, 1}, A = {a0, a1}, and X = {0, 1} {a0, a1} R. Let α : X A be given by α : (y, a, v) 7 a for arbitrary (y, a, v) X. Likewise, let u : X R be given by u : (y, a, v) 7 v. Then, if U = {u}, U is vacuously consistent modulo α. Consider the joint distribution µ on K = X Y where for all y, y Y, a A, and u R, Prµ(X = (a, y, u), Y (1) = y ) 4 1y=y Prµ(u(X) = u), where, over µ, u(X) is distributed as a 1 2-mixture of UNIF(1, 2) and δ(1); that is, Pr(u(X) = 1) = 1 2 and Pr(a < u(X) < b) = b a 2 for 0 a b < 1. We first observe that there exists a Pareto efficient threshold policy τ(x) such that counterfactual equalized odds is satisfied with respect to the decision policy τ(X). Namely, let τ(a, y, u) = 1 u > 1, 1 2 u = 1, 0 u < 1. Then, it immediately follows that E[τ(X)] = 3 4 = b. Since τ(x) is a threshold policy and exhausts the budget, it is utility maximizing by Lemma D.4. Moreover, if D = 1UD τ(X) for some UD UNIF(0, 1) independent of X and Y (1), then D A | Y (1). Since u(X) A, Y (1), it follows that Prµ(D = 1 | A = a, Y (1) = y) = Pr(UD τ(X) | A = a, Y (1) = y) = Pr(UD τ(X)) = Eµ[τ(X)], Therefore Eq. (2) is satisfied, i.e., counterfactual equalized odds holds. Now, using µ, we construct an open ball of distributions over which we can construct similar threshold policies. In particular, suppose µ is any distribution such that |µ µ |[K] < 1 64. Then, we claim that there exists a budgetexhausting threshold policy satisfying counterfactual equalized odds over µ . For, we note that Prµ (U > 1) < Prµ(U > 1) + 1 Prµ (U 1) > Prµ(U 1) 1 and so any threshold policy τ (x) satisfying E[τ (X)] = b = 3 4 must have t = 1 as its threshold. Causal Conceptions of Fairness and their Consequences We will now construct a threshold policy τ (x) satisfying counterfactual equalized odds over µ . Consider a threshold policy of the form τ (a, y, u) = 1 u > 1, pa,y u = 1, 0 u < 1. For notational simplicity, let qa,y = Prµ (A = a, Y = y, U > 1), ra,y = Prµ (A = a, Y = y, U = 1), πa,y = Prµ (A = a, Y = y). Then, we have that Eµ [τ (X)] = X a,y qa,y + pa,y ra,y, Eµ [τ (X) | A = a, Y = y] = qa,y + pa,y ra,y Therefore, the policy will be budget exhausting if X a,y qa,y + pa,y ra,y = 3 and it will satisfy counterfactual equalized odds if πa1,0 (qa0,0 + pa0,0 ra0,0) = πa0,0 (qa1,0 + pa1,0 ra1,0), πa1,1 (qa0,1 + pa0,1 ra0,1) = πa0,1 (qa1,1 + pa1,1 ra1,1), since, as above, Pr(D = 1 | A = a, Y (1) = y) = E[τ (X) | A = a, Y (1) = y]. Again, for notational simplicity, let 3 4 Prµ (U > 1) Prµ (U = 1) . Then, a straightforward algebraic manipulation shows that Eq. (29) is solved by setting pa0,y to be S πa0,y (ra0,y + ra1,y) + πa0,y qa1,y πa1,y qa0,y ra0,y (πa0,y + πa1,y) , and pa1,y to be S πa1,y (ra0,y + ra1,y) + πa1,y qa0,y πa0,y qa1,y ra1,y (πa0,y + πa1,y) . In order for τ (x) to be a well-defined policy, we need to show that pa,y [0, 1] for all a A and y Y. To that end, note that qa,y = Prµ (A = a, Y = y, U > 1), ra,y = Prµ (A = a, Y = y, U = 1), πa,y = Prµ (A = a, Y = y), ra0,y + ra1,y = Prµ (Y = y, U = 1), πa0,y + πa1,y = Prµ (Y = y), 3 4 Prµ (U > 1) Prµ (U = 1) . Now, we recall that | Prµ (E) Prµ(E)| < 1 64 for any event E by hypothesis. Therefore, 7 64 qa,y 9 7 64 ra,y 9 7 64 πa,y 17 15 64 ra0,y + ra1,y 17 31 64 πa0,y + πa1,y 33 Using these bounds and the expressions for pa,y derived above, we see that 629 3069 < pa,y < 6497 and hence pa,y [0, 1] for all a A and y Y. Therefore, the policy τ (x) is well-defined, and, by construction, is budget-exhausting and therefore utility-maximizing by Lemma D.4. It also satisfies counterfactual equalized odds by construction. Since µ was arbitrary, it follows that the set of distributions on K such that there exists a Pareto efficient policy satisfying counterfactual equalized odds contains an open ball, and hence is not shy. F. Theorem 2 and Related Results We first prove a variant of Theorem 2 for general, continuous covariates X. Then, we extend and generalize Theorem 2 using the theory of finite Markov chains, offering a proof of the theorem different from the sketch included in the main text. Causal Conceptions of Fairness and their Consequences F.1. Extension to Continuous Covariates Here we follow the proof sketch in the main text for Theorem 2, which assumes a finite covariate-space X. In that case, we start with a point x with maximum decision probability d(x ), and then assume, toward a contradiction, that there exists a point with strictly lower decision probability. The general case is more involved since it is not immediately clear that the maximum value of d(x) is achieved with positive probability in X. We start with the lemma below before proving the main result. Lemma F.21. A decision policy d(x) satisfies path-specific fairness with W = X if and only if any a A, E[d(XΠ,A,a ) | X] = d(X). Proof. First, suppose that d(x) satisfies path-specific fairness. To show the result, we use the standard fact that for independent random variables X and U, E[f(X, U) | X] = Z f(X, u) d FU(u), (30) where FU is the distribution of U. (For a proof of this fact see, for example, Brozius, 2019) Now, we have that E[DΠ,A,a | XΠ,A,a ] = E[1UD d(XΠ,A,a ) | XΠ,A,a ] 0 1u d(XΠ,A,a ) du = d(XΠ,A,a ), where the first equality follows from the definition of DΠ,A,a , and the second from Eq. (30), since the exogenous variable UD UNIF(0, 1) is independent of the counterfactual covariates XΠ,A,a . An analogous argument shows that E[D | X] = d(X). Finally, conditioning on X, we have E[d(XΠ,A,a ) | X] = E[E[DΠ,A,a | XΠ,A,a ] | X] = E[E[DΠ,A,a | XΠ,A,a , X] | X] = E[DΠ,A,a | X] where the second equality follows from the fact that DΠ,A,a X | XΠ,A,a , the third from the law of iterated expectations, and the fourth from the definition of pathspecific fairness. Next, suppose that E[d(XΠ,A,a | X] = d(X) for all a A. Then, since W = X and X UD, using Eq. (30), we have that for all a A, E[DΠ,A,a | X] = E[E[1UD d(XΠ,A,a ) | XΠ,A,a , X] | X] = E[E[d(XΠ,A,a ) | XΠ,A,a , X] | X] = E[d(XΠ,A,a ) | X] = E[d(X) | X] = E[D | X]. This is exactly Eq. (5), and so the result follows. We are now ready to prove a continuous variant of Theorem 2. The technical hypotheses of the theorem ensure that the conditional probability measures Pr(E | X) are sufficiently mutually non-singular distributions on X with respect to the distribution of X for example, the conditions ensure that the conditional distribution of XΠ,A,a | X does not have atoms that X itself does not have, and vice versa. For notational and conceptual simplicity, we only consider the case of trivial ζ, i.e., where ζ(x) = ζ(x ) for all x, x X. Proposition F.8. Suppose that 1. For all a A and any event S satisfying Pr(X S | A = a) > 0, we have, a.s., Pr(XΠ,A,a S A = a | X) > 0. 2. For all a A and ϵ > 0, there exists δ > 0 such that for any event S satisfying Pr(X S | A = a) < δ, we have, a.s., Pr(XΠ,A,a S, A = a | X) < ϵ. Then, for W = X, any Π-fair policy d(x) is constant a.s. (i.e., d(X) = p a.s. for some 0 p 1). Proof. Let dmax = d(x) , the essential supremum of d. To establish the theorem statement, we show that Pr(d(X) = dmax | A = a) = 1 for all a A. To do that, we begin by showing that there exists some a A such that Pr(d(X) = dmax | A = a) > 0. Assume, toward a contradiction, that for all a A, Pr(d(X) = dmax | A = a) = 0. (31) Because A is finite, there must be some a0 A such that Pr(dmax d(X) < ϵ | A = a0) > 0 (32) for all ϵ > 0. Causal Conceptions of Fairness and their Consequences Choose a1 = a0. We show that for values of x such that d(x) is close to dmax, the distribution of d(XΠ,A,a1) | X = x must be concentrated near dmax with high probability to satisfy the definition of path-specific fairness, in Eq. (5). But, under the assumption in Eq. (31), we also show that the concentration occurs with low probability, by the continuity hypothesis in the statement of the theorem, establishing the contradiction. Specifically, by Markov s inequality, for any ρ > 0, a.s., Pr(dmax d(XΠ,A,a1) ρ | X) E[dmax d(XΠ,A,a1) | X] = dmax d(X) where the final equality follows from Lemma F.21. Rearranging, it follows that for any ρ > 0, a.s., Pr(dmax d(XΠ,A,a1) < ρ | X) 1 dmax d(X) Now let S = {x X : dmax d(x) < ρ}. By the second hypothesis of the theorem, we can choose δ sufficiently small that if Pr(X S | A = a1) < δ then, a.s., Pr(XΠ,A,a1 S, A = a1 | X) < 1 In other words, we can chose δ such that if Pr(dmax d(X) < ρ | A = a1) < δ then, a.s., Pr(dmax d(XΠ,A,a1) < ρ, A = a1 | X) < 1 By Eq. (31), we can choose ϵ > 0 so small that Pr(dmax d(X) < ϵ | A = a1) < δ. Then, we have that Pr(dmax d(XΠ,A,a1) < ϵ, A = a1 | X) < 1 a.s. Further, by the definition of the essential supremum and a0, and the fact that a0 = a1, we have that Pr(dmax d(X) < ϵ 2, A = a1) > 0. Therefore, with positive probability, we have that 1 dmax d(X) 2 > Pr(dmax d(XΠ,A,a1) < ϵ, A = a1 | X). This contradicts Eq. (33), and so it cannot be the case that Pr(d(X) = dmax | A = a0) = 0, meaning Pr(d(X) = dmax | A = a0) > 0. Now, we show that Pr(d(X) = dmax | A = a1) = 1. Suppose, toward a contradiction, that Pr(d(X) < dmax | A = a1) > 0. Then, by the first hypothesis, a.s., Pr(d(XΠ,A,a1) < dmax A = a1 | X) > 0 As a consequence, dmax = E[d(X) | d(X) = dmax, A = a0] = E[E[d(XΠ,A,a1) | X] | d(X) = dmax, A = a0] < E[E[dmax | X] | d(X) = dmax, A = a0] = E[dmax | d(X) = dmax, A = a0] where we can condition on the set {d(X) = dmax, A = a0} since Pr(d(X) = dmax | A = a0) > 0; and the second equality above follows from Lemma F.21. This establishes the contradiction, and so Pr(d(X) = dmax | A = a1) = 1. Finally, we extend this equality to all a A. Since, Pr(d(X) = dmax | A = a1) = 0, we have, by the second hypothesis of the theorem, that, a.s., Pr(d(XΠ,A,a1) = dmax, A = a1 | X) = 0. Since, by definition, Pr(XΠ,A,a1 = X | A = a1) = 1, and Pr(d(X) = dmax | A = a1) = 1, we can strengthen this to Pr(d(XΠ,A,a1) = dmax | X) = 0. Consequently, a.s., d(X) = E[d(XΠ,A,a) | X] = E[dmax | X] where the first equality follows from Lemma F.21, establishing the result. Causal Conceptions of Fairness and their Consequences F.2. A Markov Chain Perspective The theory of Markov chains illuminates and allows us to extend the proof of Theorem 2. Suppose X = {x1, . . . , xn}.20 For any a A, let Pa = [pa i,j] where pa i,j = Pr(XΠ,A,a = xj | X = xi). Then Pa is a stochastic matrix. To motivate the subsequent discussion, we first note that this perspective conceptually simplifies some of our earlier results. Lemma F.21 can be recast as stating that when W = X, a policy d is Π-fair if and only if Pa d = d i.e., if and only if d is a 1-eigenvector of Pa for all a A. The 1-eigenvectors of Markov chains have a particularly simple structure, which we derive here for completeness. Lemma F.22. Let S1, . . . , Sm denote the recurrent classes of a finite Markov chain with transition matrix P. If d is a 1-eigenvector of P, then d takes a constant value pk on each Sk, k = 1, . . . , m, and j Sk P n ij Remark 7. We note that limn P j Sk P n i,j always exists and is the probability that the Markov chain, beginning at state i, is eventually absorbed by the recurrent class Sk. Proof. Note that, possibly by reordering the states, we can arrange that the stochastic matrix P is in canonical form, i.e., that where Q is a sub-stochastic matrix, R is non-negative, and P1 P2 ... Pm is a block-diagonal matrix with the stochastic matrix Pi corresponding to the transition probabilities on the recurrent set Si in the i-th position along the diagonal. Now, consider a 1-eigenvector v = [v1 v2] of P. We must have that Pv = v, i.e., Bv1 = v1 and R v1 + Qv2 = v2. Therefore v1 is a 1-eigenvector of B. Since B is block diagonal, and each diagonal element is a positive stochastic matrix, it follows by the Perron-Frobenius theorem that the 1-eigenvectors of B are given by SPAN(1Si)i=1,...,m, where 20Because of the technical difficulties associated with characterizing the long-run behavior of arbitrary infinite Markov chains, we restrict our attention in this section to Markov chains with finite state spaces. 1Si is the vector which is 1 at index j if j Si and is 0 otherwise. Now, for v1 SPAN(1Si)i=1,...,m, we must find v2 such that R v1 + Qv2 = v2. Note that every finite Markov chain M can be canonically associated with an absorbing Markov chain M ABS where the set of states of M ABS is exactly the union of the transitive states of M and the recurrent sets of M. (In essence, one tracks which state of M the Markov chain is in until it is absorbed by one of the recurrent sets, at which point the entire recurrent set is treated as a single absorbent state.) The transition matrix P ABS associated with M ABS is given by P ABS = I R Q where R = R [1S1 . . . 1Sm]. In particular, it follows that v = [v1 v2] is a 1-eigenvector of P if and only if [Tv1 v2] is a 1-eigenvector of P ABS, where T : 1Si 7 ei. Now, if v is a 1-eigenvector of P ABS, then it is a 1eigenvector of (P ABS)k for all k. Since Q is sub-stochastic, the series P k=0 Qk converges to (I Q) 1. Since (P ABS)k = I (I + Q + + Qk 1)R Qk it follows that lim k (P ABS)k = I (I Q) 1R 0 Therefore, if v = [v1 v2] is a 1-eigenvector of P ABS, we must have that (I Q) 1Rv1 = v2. By Theorem 3.3.7 in Kemeny & Snell (1976), the (i, k) entry of (I Q) 1R is exactly the probability that, conditional on X0 = xi, the Markov chain is eventually absorbed by the recurrent set Sk. This is, in turn, by the Chapman-Kolmogorov equations and the definition of Sk, equal to limn P j Sk P n i,j, and therefore the result follows. We arrive at the following simple necessary condition on Π-fair policies. Corollary F.6. Suppose X is finite, and let P = 1 |A| P a A Pa . If d(x) is a Π-fair policy then it is constant on the recurrent classes of P. Proof. By Lemma F.21, d is Π-fair if and only if Pa d = d for all a A. Therefore, a A Pa d = 1 |A| a A d = d, (35) and so d is a 1-eigenvector of P. Therefore it is constant on the recurrent classes of P by Lemma F.22. Causal Conceptions of Fairness and their Consequences We note that Theorem 2 follows immediately from this. Proof of Theorem 2. Note that 1 |A| P a A Pa decomposes into a block diagonal stochastic matrix, where each block corresponds to a single stratum of ζ and is irreducible. Consequently, each stratum forms a recurrent class, and the result follows. G. Proof of Proposition 2 To prove the proposition, we must characterize the conditional tail risks of the beta distribution. Note that in the main text, we parameterize beta distributions in terms of their mean µ and sample size v; here, for mathematical simplicity, we parameterize them in terms of successes, α, and failures, β, where µ = α α+β and v = α + β. Lemma G.23. Suppose Zi BETA(αi, βi) for i = 0, 1, and that α0 > α1 > 1, 1 < β0 < β1. Then, for all t (0, 1], E[Z1 | Z1 < t] < E[Z0 | Z0 < t]. Proof. Let Z(α, β) BETA(α, β). Then, E[Z(α, β) | Z(α, β) < t] = R t 0 x xα 1(1 x)β 1 B(α,β) dx R t 0 xα 1(1 x)β 1 R t 0 xα(1 x)β 1 dx R t 0 xα 1(1 x)β 1 dx . Since α > 1, we may take the partial derivative with respect to α by differentiating under the integral sign, which yields that αE[Z(α, β) | Z(α, β) < t] equals α I(t, α, β)2 [α 1] I(t, α + 1, β) I(t, α 1, β) I(t, α, β)2 , where I(x, α, β) = R t 0 xα 1(1 x)β 1 dx. Rearranging gives that this is greater than zero when 0 < α I(t, α + 1, β) Z t 0 (xα 2 xα 1)(1 x)β dx + α I(t, α, β) Z t 0 (xα 1 xα)(1 x)β dx + I(t, α + 1, β) I(t, α 1, β). Since all of the integrands are positive, αE[Z(α, β) | Z(α, β) < t] > 0. A virtually identical argument shows that β E[Z(α, β) | Z(α, β) < t] < 0. Therefore, the result follows. We use this lemma to prove a modest generalization of Prop. 2. Lemma G.24. Suppose A = {a0, a1}, and consider the family U of utility functions of the form u(x) = r(x) + λ 1α(x)=a1, indexed by λ 0, where r(x) = E[Y (1) | X = x]. Suppose the conditional distributions of r(X) given A are beta distributed, i.e., D(r(X) | A = a) = BETA(αa, βa), with 1 < αa1 < αa0 and 1 < βa0 < βa1. Then any policy satisfying counterfactual predictive parity is strongly Pareto dominated. Proof. Suppose there were a Pareto efficient policy satisfying counterfactual predictive parity. Let λ = 0. Then, by Prop. 1, we may without loss of generality assume that there exist thresholds ta0, ta1 and a constant p such that a threshold policy τ(x) witnessing Pareto efficiency is given by ( 1 r(x) > tα(x), 0 r(x) < tα(x). (Note that by our distributional assumption, Pr(u(x) = t) = 0 for all t [0, 1].) Since λ 0, we must have that ta0 ta1. Since b < 1, 0 < ta0. Therefore, E[Y (1) | A = a0, D = 0] = E[r(X) | A = a0, u(X) < ta0] E[r(X) | A = a0, u(X) < ta1] > E[r(X) | A = a1, u(X) < ta1] = E[Y (1) | A = a1, D = 0], where the first equality follows by the law of iterated expectation, the second from the fact that ta1 ta0, the third from our distributional assumption and Lemma G.23, and the final again from the law of iterated expectation. However, since counterfactual predictive parity is satisfied, E[Y (1) | A = a0, D = 0] = E[Y (1) | A = a1, D = 0], which is a contradiction. Therefore, no such threshold policy exists. After accounting for the difference in parameterization, Prop. 2 follows as a corollary. Proof of Prop. 2. Note that αa = µa va > v 1 v = 1 and βa = v αa = v (1 µa) > v (1 1 v) > 2 1 = 1. Moreover, since µa0 > µa1, αa0 = v µa0 > v µa1 = αa1 and βa0 = v (1 µa0) < v (1 µa1) = βa1. Therefore 1 < αa0 < αa1 and 1 < αa1 < αa0, and so, by Lemma G.24, the proposition follows.