# on_the_withingroup_fairness_of_screening_classifiers__a1b8efb1.pdf On the Within-Group Fairness of Screening Classifiers Nastaran Okati 1 Stratis Tsirtsis 1 Manuel Gomez-Rodriguez 1 Screening classifiers are increasingly used to identify qualified candidates in a variety of selection processes. In this context, it has been recently shown that if a classifier is calibrated, one can identify the smallest set of candidates which contains, in expectation, a desired number of qualified candidates using a threshold decision rule. This lends support to focusing on calibration as the only requirement for screening classifiers. In this paper, we argue that screening policies that use calibrated classifiers may suffer from an understudied type of within-group unfairness they may unfairly treat qualified members within demographic groups of interest. Further, we argue that this type of unfairness can be avoided if classifiers satisfy within-group monotonicity, a natural monotonicity property within each group. Then, we introduce an efficient post-processing algorithm based on dynamic programming to minimally modify a given calibrated classifier so that its probability estimates satisfy within-group monotonicity. We validate our algorithm using US Census survey data and show that withingroup monotonicity can often be achieved at a small cost in terms of prediction granularity and shortlist size. 1. Introduction As many selection processes receive hundreds or even thousands of applications, it has become increasingly common to rely on automated screening tools to shortlist a tractable set of promising candidates. These shortlisted candidates then move forward in the selection process and are evaluated in detail, possibly multiple times, until one or more qualified candidates are selected. The benefits and harms posed by automated screening have been investigated in many high- 1Max Planck Institute for Software Systems. Correspondence to: Nastaran Okati . Proceedings of the 40 th International Conference on Machine Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright 2023 by the author(s). stakes domains, including medicine (Etzioni et al., 2003; Shen et al., 2019), recruiting (Cowgill, 2018; Raghavan et al., 2020) and content moderation (Gorwa et al., 2020). In the machine learning literature, algorithmic screening has been studied together with other high-stakes decision making problems as a supervised learning problem (Corbett Davies et al., 2017; Kilbertus et al., 2020; Sahoo et al., 2021). Under this view, algorithmic screening consists of designing both a screening classifier, which estimates the probability that a candidate is qualified, and a screening policy, which shortlists candidates using the candidates probability values estimated by the screening classifier. Only very recently, a line of work has focused specifically on algorithmic screening (Wang et al., 2022; Jin & Candès, 2022; Wang & Joachims, 2023). Therein, Wang et al. (2022) argue that, to increase the efficiency of the selection process without decreasing the quality of the shortlisted candidates, the focus should be on screening policies that find the smallest shortlist of candidates containing a desired average number of qualified candidates with high probability without making any distributional assumptions on the candidates. Further, this work has shown that, if the screening classifier is calibrated (Dawid, 1982), such distribution-free guarantees can be achieved using threshold decision rules as screening policies, and the more granular the predictions of the classifier, the smaller the shortlists provided by such policies. In this work, our starting point is the realization that any threshold decision rule that uses calibrated screening classifiers may be biased against qualified candidates within demographic groups of interest. More specifically, it may shortlist one or more candidates from a group who are less likely to be qualified than one or more rejected candidates from the same group. Unfortunately, this type of withingroup unfairness may perpetuate historical biases against minority groups since it may preclude the best candidates from the groups the candidates who are more likely to be qualified to move forward in the selection process and have a chance to be selected (Yang et al., 2019). Our contributions. We first show that to avoid such withingroup unfairness, screening classifiers need to satisfy a natural monotonicity property within each of the groups of interest, which we refer to as within-group monotonicity. Then, we develop a set partitioning post-processing framework to minimally modify any calibrated classifier such that On the Within-Group Fairness of Screening Classifiers it satisfies within-group monotonicity. Along the way, we make the following contributions: I. We show that the problem is NP-hard using a reduction from a variation of the partition problem (Karp, 1972), which we refer to as the equal average partition problem and prove it is NP-complete. However, we identify a natural class of partitions contiguous partitions under which the problem is tractable. II. While the structure of our problem for contiguous partitions resembles isotonic regression (Barlow & Brunk, 1972), we show that the classical Pool Adjacent Violators (PAV) algorithm may fail even to find a locally optimal solution. III. We derive a dynamic programming algorithm for contiguous partitions that is guaranteed to find an optimal solution to our problem in polynomial time. IV. We show that within-group calibration (Pleiss et al., 2017) implies within-group monotonicity. However, we show that it is often impossible to modify a classifier to satisfy the former and, whenever possible, the predictions of the resulting classifier are coarse. Finally, we create multiple instances of a simulated screening process using US Census survey data to validate and complement our methodological contributions and theoretical results. The results show that the probability that an individual from a minority group suffers from within-group unfairness may be significant and within-group monotonicity can be achieved at a small cost in terms of prediction granularity and shortlist size. Related work. There is an extensive and rapidly growing line of work addressing group bias and discrimination in the machine learning literature (Hardt et al., 2016; Friedler et al., 2016; Zafar et al., 2017; Kim et al., 2019; Beutel et al., 2019b; Lahoti et al., 2020). This line of work has applications in a variety of important domains, including ranking (Celis et al., 2017; Yang & Stoyanovich, 2017; Biega et al., 2018; Singh & Joachims, 2018; 2019), health care (Garb, 1997; Williams & Mohammed, 2009), criminal justice (Dieterich et al., 2016; W Flores et al., 2016; Angwin et al., 2016; Feller et al., 2016; Chouldechova, 2017; Dressel & Farid, 2018) and recommender systems (Sweeney, 2013; Datta et al., 2014; Beutel et al., 2019a; Wang et al., 2021; Prost et al., 2022). However, it has predominantly focused on preventing discrimination across groups of interest, e.g., designing machine learning models whose predictive performance (e.g., accuracy, false positive rate) is invariant across groups. In contrast, we focus on preventing unfairness within groups. Within the above machine learning literature, there are a few notable exceptions (Zehlike et al., 2017; Speicher et al., 2018; Yang et al., 2019; García-Soriano & Bonchi, 2021; Zehlike et al., 2022), which studied similar notions to withingroup monotonicity (in the context of ranking) and withingroup unfairness. Among them, the works by Zehlike et al. (2017; 2022) and Speicher et al. (2018) are the most related to ours. Zehlike et al. (2017; 2022) introduces a notion of in-group monotonicity that is similar to ours. However, it comprises only the top-k ranked candidates in a specific pool of candidates (i.e., in our work, the shortlisted candidates), rather than every candidate in a population of interest, and unconditional quality scores, rather than group conditional quality scores. Moreover, their formulation is fundamentally different and their technical contributions are orthogonal to ours. Speicher et al. (2018) addresses within-group unfairness as a measure of how unequally members within a group benefit from algorithmic decisions. In contrast, our notion of within-group monotonicity asks for accurately ranking individuals belonging to a group in terms of how worthy they are of receiving a beneficial decision rather than equally benefiting them. In this context, it is also worth highlighting the notion of within-group calibration (Pleiss et al., 2017; Kleinberg, 2018), which implies within-group monotonicity, as discussed previously. Within-group calibration asks for equally well-calibrated probability estimates across groups so that a decision maker cannot use group membership to interpret these estimates. However, in the context of screening, our results show that within-group calibration may be an unnecessarily strong requirement. Our work also relates to a line of work devoted to the study of calibration in supervised learning (Zadrozny & Elkan, 2001; 2002; Guo et al., 2017; Kumar et al., 2018; Krishnan & Tickoo, 2020; Karandikar et al., 2021). Here, the main focus has been the design of classifiers with low calibration error using calibration-aware training or post-hoc re-calibration. However, there have been also very recent efforts to ensure calibration errors are bias-free (Bröcker, 2011; Ferro & Fricker, 2012; Roelofs et al., 2022). Here, we do not aim to minimize calibration error but ensure a calibrated classifier satisfies within-group monotonicity. 2. Screening, Calibration and Within-Group Discrimination Given a candidate with a feature vector x X, we assume the candidate belongs to one demographic group of interest z Z and can be qualified (y = 1) or unqualified (y = 0) for the selection objective1,2. Next, let f : X Range(f) [0, 1] be a screening classifier that maps a candidate s feature vector x X to a qual- 1We do not require a candidate s group membership z to be included in or be inferable from their feature vector x. 2In practice, one measures qualification using proxy variables, which need to be chosen carefully not to perpetuate historical biases (Bogen & Rieke, 2018; Garr & Jackson, 2019; Tambe et al., 2019). On the Within-Group Fairness of Screening Classifiers ity score f(x), where the higher the quality score f(x), the more the classifier believes the candidate is qualified. Then, given a pool of m candidates, a screening policy π : [0, 1]m P({0, 1}m) maps the candidates quality scores to a probability distribution over shortlisting decisions {si}i [m]. Here, each decision si specifies whether the corresponding candidate is shortlisted (si = 1) or is not shortlisted (si = 0). In high-stakes applications, screening classifiers f are usually demanded to provide calibrated quality scores (Brier et al., 1950; Gneiting et al., 2007; Gupta et al., 2020), i.e., f is calibrated iff, for every a Range(f), it holds that Pr(Y = 1 | f(X) = a) = a. In this context, Wang et al. (2022) have recently shown that, if the classifier f is calibrated, the optimal screening policy π f that is guaranteed to shortlist, in expectation, the smallest set of candidates with a desired number of qualified candidates with high probability is given by a simple threshold decision rule that take shortlisting decisions as 1 if f(xi) > tf, Bernoulli(θf) if f(xi) = tf 0 otherwise, (1) where tf and θf depend on the classifier and data distribution. These results lend support to focusing on calibration as the only requirement for screening classifiers. In this work, we argue that screening policies given by threshold decision rules using calibrated classifiers may suffer from an understudied type of unfairness they may be biased against qualified members within demographic groups. More formally, the following proposition shows that any threshold decision rule may be biased against qualified members within demographic groups3: Proposition 2.1. Let π be a screening policy given by a threshold decision rule using a calibrated classifier f with threshold t. Assume there exist a, b Range(f), with a < t < b, and z Z such that P(Y = 1 | f(X) = a, Z = z) > P(Y = 1 | f(X) = b, Z = z). Then, it holds that EY PY | X,Z, S π [Y (1 S) | f(X) = a, Z = z] > EY PY | X,Z, S π [Y S | f(X) = b, Z = z] . The above result implies that there exist pools of applicants for which an optimal policy using a calibrated classifier may shortlist a candidate from a group who is less likely to be qualified than a rejected candidate from the same group. Importantly, the assumption under which the above within-group unfairness appears is not just a theoretical construct it has been observed empirically in multiple real-world domains whenever the group membership Z is a spurious confounding factor that causes both X and Y (Wagner, 1982; Pearl, 2000). The case in which the assumption 3All proofs can be found in the Appendix A. Pr(Y = 1) = 0.7 Pr(Y = 1) = 0.8 0.8 0.7 tf f(X) (a) Overall Pr(Y = 1) = 0.62 Pr(Y = 1) = 0.5 0.7 0.8 tf f(X) Pr(Y = 1) = 0.87 Pr(Y = 1) = 1 0.7 0.8 tf f(X) (c) Male Figure 1. An illustrative example of within-group unfairness. Panel (a) shows that candidates who are shortlisted (f(X) > tf) are more likely to be qualified (Y = 1) than those who are rejected (f(X) < tf). However, panels (b) and (c) show that, after conditioning on their gender, candidates who are rejected (f(X) < tf) are more likely to be qualified than those who are short listed (f(X) > tf). Qualified candidates are shown in color. holds for every group z Z and any threshold decision rule is known as Simpson s paradox (Blyth, 1972). Refer to Figure 1 for an illustrative example. To avoid the above within-group unfairness, we introduce and study within-group monotonicity: Definition 2.2. Given a set of groups Z, a classifier f is within-group monotone if, for any z Z and a, b Range(f) such that a < b, Pr(Z = z | f(X) = a) > 0 and Pr(Z = z | f(X) = b) > 0, it holds that Pr (Y = 1 | f(X) = a, Z = z) Pr (Y = 1 | f(X) = b, Z = z) . In what follows, we will design a post-processing framework that, given a calibrated classifier, modifies it minimally so that it is within-group monotone, as shown in Figure 2. As a result, any screening policy given by a threshold decision rule using the modified classifier will not suffer from withingroup unfairness. Here note that we favor a post-processing approach, rather than an in-processing one, because postprocessing approaches can be applied to any black-box classifier without asking for retraining or introducing training overhead (Hardt et al., 2016). Furthermore, in-processing approaches commonly need access to the feature defining group membership to ensure group-level fairness (Wood- On the Within-Group Fairness of Screening Classifiers 0.01 0.01 0.03 0.1 0.22 0.32 0.43 0.5 0.59 0.69 0.7 0.76 0.79 0.82 0.86 Pr(Y = 1|f(X)) Pr(Y = 1|f(X), Z) Born in the US Born in Unincorporated US Born abroad Not a US citizen 0.01 0.02 0.1 0.22 0.32 0.46 0.59 0.72 0.82 Pr(Y = 1|f B (X)) Pr(Y = 1|f B (X), Z) Figure 2. Quality score values a = P(Y = 1 | f(X) = a) and group conditional quality score values az = P(Y = 1 | f(X) = a, Z = z) of a (approximately) calibrated screening classifier f with finite range trained on US Census survey data and its withingroup monotone counterpart f B found by our post-processing framework. The demographic groups of interest Z are defined using US citizen status and the hatched bars indicate within-group monotonicity violations. Note that there exist no such violations in f B (second row). worth et al., 2017), which may not be available to the classifier due to privacy, legal or regulatory reasons. Whenever it is clear from the context, we do not specify the set of groups Z with respect to which a classifier is within-group calibrated or monotone. 3. Set Partitioning Post-Processing Framework Let f be a calibrated classifier with Range(f) = {a1, . . . , an} and Pr (f(X) = ai) = ρi. Here, note that we focus on calibrated classifiers with finite range, i.e., |Range(f)| = n < , since it is impossible to find nonatomic calibrated classifiers from data4, even asymptotically (van der Gaag et al., 2004; Barber, 2020). Here, assume that ai < aj for any i < j without loss of generality. Further, for every demographic group of interest z Z, let Pr (Y = 1 | f(X) = ai, Z = z) = ai,z and Pr (Z = z | f(X) = ai) = ρz | i, and note that, by definition, we have that ai = P z Z ρz | iai,z. Then, our goal is to modify f minimally so that it is within-group monotone. To this end, we first note that the classifier f induces a partition of X into n disjoint regions or bins {X1, . . . , Xn}, where each bin Xi is characterized by ai and ρi. Building 4Given a (non-atomic) classifier f, there exists a variety of methods to discretize and calibrate its predictions (Zadrozny & Elkan, 2001; 2002; Gupta et al., 2020). However, this is out of the scope of our work. Moreover, for ease of exposition, we assume that f is perfectly calibrated and we have access to true value of the relevant probabilities ρi, ai, ρz | i and ai,z. However, our methodology can be adapted to work with approximately calibrated f and noisy probability estimates as long as the estimation errors can be bounded (with high probability). upon this observation, we look at the problem from the perspective of set partitioning and seek to merge a small number of these induced bins to achieve within-group monotonicity. More formally, let P be the set of all partitions of the bin indices {1, . . . , n}. Every B P is a partition of the bin indices into a collection of nonempty and disjoint equivalence classes {A1, . . . , A|B|}, which we call cells. For each x X, denote the index of the bin it belongs to as i(x) = {i | f(x) = ai} and represent a cell in B containing index i(x) by [i(x)]B, where we drop the subscript B whenever it is clear from the context. Further, we know that the equivalence relation B implies that, for all i(x ) [i(x)], we have that i(x) B i(x ). Then, we can use the partition5 B to define the modified classifier f B : X Range(f B) = {a A}A B, where j A ρj and f B(x) = a[i(x)]. Without loss of generality, we keep the cells induced by the partition B in increasing order with respect to a A, i.e., a Ai a Aj for any i < j. Next, note that, by definition, f B is calibrated, i.e., Pr (Y = 1 | f B(X) = a A) = j A ρj = a A, and we have that Pr (Y = 1 | f B(X) = a A, Z = z) = j A ρjρz | jaj,z P j A ρjρz | j := a A,z. Moreover, the larger the partition size abr B, the more finegrained the predictions of the classifier f B (Gneiting et al., 2007; Wang et al., 2022). Therefore, we can naturally think of reducing the problem to finding a partition B of maximum size such that f B is within-group monotone6, i.e., maximize B P |B| subject to a Ai,z a Aj,z Ai, Aj B such that a Ai < a Aj, z Z. However, such a problem formulation presents difficulties in terms of tractability and soundness. First, we cannot expect to find such a partition in polynomial time: Theorem 3.1. Given a calibrated classifier f, the problem of finding the partition B P of maximum size such that f B is within-group monotone is NP-hard. To prove the above result in Appendix A.2, we first show that, by finding the partition B of maximum size such that 5We use partition instead of partition on the bin indices whenever it is clear from the context. 6Maximizing the size of the partition |B| is equivalent to minimizing the distance d(f, f B) = n |B|. Thus, f B can be viewed as the closest within-group monotone classifier f B under a prediction-only access model (Błasiok et al., 2022). On the Within-Group Fairness of Screening Classifiers f B is within-group monotone, we can decide whether there exists a partition B of size |B | = 2 such that f B is withingroup monotone. Then, we show that the latter decision problem is NP-complete by a reduction from a variation of the partition problem (Karp, 1972), which we refer to as the equal average partition problem and prove it is NPcomplete. Second, even if the partition size |B| is large, the shortlists provided by threshold decision rules using f B may differ greatly from those using f. The reason is that, in general, we may end up merging very different bins to ensure monotonicity within groups and, as a consequence, f B may rank (pairs of) candidates strictly differently. More specifically, f B may not satisfy the following monotonicity property with respect to f: Definition 3.2. A classifier f is monotone with respect to f if, for all f(x1), f(x2) Range(f) such that f(x1) < f(x2), it holds that f (x1) f (x2). To guarantee that f B is monotone with respect to f, we need to restrict our attention to the set of contiguous partitions B P of {1, . . . , n}, i.e., for any B B, if i(x1) < i(x2) < i(x3) and i(x1) B i(x3), then it also holds that i(x1) B i(x2) and i(x2) B i(x3). More formally, we have the following result: Proposition 3.3. Given a classifier f with Range(f) = {a1, . . . , an}, f B is monotone with respect to f iff B is a contiguous partition on {1, . . . , n}. Surprisingly, while |B| = 2n 1, we will show in the next section that it is possible to find the optimal contiguous partition B = argmax B B |B| such that f B is within-group monotone in polynomial time using dynamic programming. 4. Optimal Set Partitioning via Dynamic Programming Since the structure of our problem resembles isotonic regression, one may think of using a simple variation of the many times re-discovered Pool Adjacent Violators (PAV) algorithm (Ayer et al., 1955; Eeden, 1958; Miles, 1959; Bartholomew, 1959) to find the optimal (contiguous) partition. However, in what follows, we first show that the PAV algorithm may not find the optimal partition it is not even guaranteed to find a partition satisfying an intuitive type of local optimality. Then, building on the reasons why the PAV algorithm may not find the optimal partition, we derive an efficient algorithm based on dynamic programming that is guaranteed to find the optimal partition. 4.1. Pool Adjacent Violators (PAV) Algorithm In comparison with the original PAV algorithm, the only difference is that, in our setting, one needs to check for Algorithm 1 It returns a partition Bpav such that f Bpav is within-group monotone. 1: Input: {a1,z, . . . , an,z}z Z 2: Initialize: Bpav = {{1} , . . . , {n}} 3: while Ai 1, Ai Bpav and z Z such that a Ai,z < a Ai 1,z do 4: Bpav = Bpav \ {Ai 1, Ai} 5: Bpav = Bpav {Ai 1 Ai} 6: end while 7: return Bpav monotonicity violations across multiple sets of conditional predictors, one per group z Z, rather than only one set of predictors. However, the main idea underpinning the PAV algorithm remains the same, i.e., as long as there are monotonicity violations between two adjacent cells, the algorithm merges the corresponding cells into one. Algorithm 1 summarizes the overall procedure, which has complexity O(n2 |Z|) and is guaranteed to return a partition Bpav such that f Bpav is within-group monotone, as formalized by the following Proposition: Proposition 4.1. Algorithm 1 returns a partition Bpav B such that the classifier f Bpav is within-group monotone. Unfortunately, while the original PAV algorithm does enjoy global optimality guarantees for the isotonic regression problem7 under multiple choices of loss functions (Yu & Xing, 2016; Jordan et al., 2019), this is not true for our problem. There exist many instances for which Algorithm 1 fails to find the optimal partition B , e.g., refer to Figure 6 in Appendix B.1. In fact, Algorithm 1 does not even enjoy a type of intuitive local optimality guarantee based on the notion of dominance (Wang et al., 2022): Definition 4.2. Let f and f be calibrated classifiers. Classifier f dominates f if, for any x1, x2 X such that f(x1) = f(x2), it holds that f (x1) = f (x2). More specifically, if f B dominates f B , it can be shown that the expected size of the shortlists provided by the optimal screening policies using f B are not larger than those using f B (Corollary 4.3, Wang et al. (2022)) and it clearly holds that |B| |B |. For example, let Range(f) = {a1, a2, a3}, Z = {z1, z2} and ρiρz | i = 1 6 for all i {1, 2, 3} and z Z. Further, let a1,z2 = a2,z1 = a3,z2 = α, a1,z1 = 2α, a2,z2 = 3α and a3,z1 = 4α, where α [0, 0.25]. Then, Algorithm 1 returns Bpav = {{1, 2, 3}}, however, f Bpav is dominated by f B, with B = {{1} , {2, 3}}, which is also within-group monotone. Refer to Appendix A.5 for details. The reason why Algorithm 1 may fail to find the optimal par- 7In the isotonic regression problem (Barlow & Brunk, 1972), given a set of response variables {yi}i [n], the goal is to find a set of predictor values {xi}i [n], with xi xi+1 for all i [n], such that P i ℓ(xi, yi) is minimized, where ℓ(xi, yi) is a loss measuring how well xi approximates yi. On the Within-Group Fairness of Screening Classifiers Algorithm 2 It returns the optimal partition B such that f B is within-group monotone. 1: Input: {a1,z, . . . , an,z}z Z 2: Initialize: Bl,r = {} l, r {2, . . . , n}, B1,r = {1, . . . , r} r {1, . . . , n} 3: for l {2, . . . , n} do 4: for r {l, . . . , n} do 5: Sl,r = k|k < l, a{k,...,l 1},z a{l,...,r},z z Z {Refer to Lemma. 4.3} 6: if Sl,r = then 7: Continue {In this case Bl,r = } 8: end if 9: k = argmaxk Sl,r |Bk,l 1| 10: Bl,r = Bk ,l 1 {{l, . . . , r}} 11: end for 12: end for 13: l = argmaxi {1,...,n} |Bi,n| 14: return Bl ,n tition is that, whenever it tries to fix a monotonicity violation between two adjacent cells Ai 1 and Ai, it does so by merging them. However, in our problem, the optimal fix may require merging cells Ai and Ai+1. Building on this insight, we will design an efficient algorithm based on dynamic programming that provably finds the optimal partition. 4.2. An Optimal Dynamic Programming Algorithm Our starting point is the following observation, which allows us to break down the problem of finding the optimal partition B into several subproblems. Let Br be the set of contiguous partitions of the bin indices {1, . . . , r}, with r n, and Bl,r Br be the subset of those partitions such that, for any B = {A1, . . . , A|B|} Bl,r, it holds that A|B| = {l, . . . , r} and f B B is within-group monotone on the region of the feature space defined by i r Xi, where B is any partition of the bin indices {r + 1, . . . , n}8. Then, it clearly holds that the optimal partition B n l=1Bl,n and thus we can break the problem of finding B into n subproblems, i.e., finding the optimal partition B l,n = argmax B Bl,n |B| within in each subset Bl,n. From now on, with a slight abuse of notation, we will write f B instead of f B B whenever B refers to any partition of the bin indices not in B and it is clear from the context. Next, we realize that we can efficiently find the optimal partition B l,n in each subset Bl,n recursively using dynamic programming. The key idea of the recursion is that any partition B Bl,r needs to satisfy the following necessary and sufficient conditions: 8Note that it may be impossible to satisfy both conditions simultaneously if, for example, the Simpon s paradox (Simpson, 1951) holds, i.e., for every group z Z and every pair of indices i < j, we have that ai,z > aj,z. In those cases, we may have that Bl,r = for all 1 < l r. Lemma 4.3. Given any B Br, it holds that B Bl,r if and only if k < l such that B \{{l, . . . , r}} Bk,l 1 and a{k,...,l 1},z a{l,...,r},z z Z. Consequently, we can efficiently find all the partitions in the subsets Bl,r iterating through l using the partitions in the subsets Bk,l 1 with k < l. Finally, by construction, it clearly holds that, if B l,r = B {{l, . . . , r}}, with B Bk,l 1, is the optimal partition in Bl,r then B = B k,l 1 is the optimal partition in Bk,l 1. As a result, at each step of the recursion, we only need to store the optimal partition B l,r, not all partitions in Bl,r. Algorithm 2 summarizes the overall procedure, which has complexity O(n3 |Z|) and is guaranteed to find the optimal partition B , as formalized by the following theorem: Theorem 4.4. Algorithm 2 returns B = argmax B B |B| such that f B is within-group monotone. Remark In many domains, allowing for a pre-specified, application-dependent level of within-group monotonicity violations may be acceptable. Such tolerance levels, whether global or group-specific, can easily be integrated into our algorithm without introducing any computational overhead. More specifically, let τz [0, 1] be the pre-specified maximum level of within-group monotonicity violations for each group z Z, i.e., a classifier f needs to satisfy that Pr(Y = 1|f(X) = a, Z = z) Pr(Y = 1|f(X) = b, Z = z) + τz for all z Z and a < b. Then, one only needs to modify the condition in line 5 in Algorithm 2 to a{k,...,l 1},z a{l,...,r},z + τz z Z. Here, note that this modification does not add to the time or space complexity of our algorithm. At the same time, a similar proof as the proof of Theorem 4.4 shows that the algorithm can return the partition of maximum size such that the classifier induced by this partition is within-group monotone with a slack of τz for all groups z Z. Note that such relaxations will result in partitions of larger sizes or, equivalently, more fine-grained classifiers and may be imposed by the domain expert to tradeoff the prediction power and within-group fairness. 5. Within-Group Monotonicity vs Within-Group Calibration Within-group calibration, or calibration within groups9, requires that the probability that a candidate is qualified is independent of their group membership conditioned on their quality score. More specifically, it is defined as follows (Pleiss et al., 2017; Kleinberg, 2018): Definition 5.1. Given a set of groups Z, a classifier f 9There also exists a generalized, stronger notion of within-group calibration called multicalibration (Hébert-Johnson et al., 2018; Jung et al., 2021), which requires predictions to be calibrated within every group that can be identified within a specified class of computations. On the Within-Group Fairness of Screening Classifiers Algorithm 3 It returns the optimal partition B cal such that f B cal within-group calibrated. 1: Input: {a1,z, . . . , an,z}z Z 2: Initialize: Bcal,i = {} i {1, . . . , n} 3: if a1,z = a1 z Z then 4: Bcal,1 = {{a1}} 5: end if 6: for r {2, . . . , n} do 7: Sr = i {2, . . . , r} | a{i,...,r},z = a{i,...,r} z Z 8: k = argmaxk Sr |Bcal,k 1| 9: if Bcal,k 1 = then 10: Bcal,r = Bcal,k 1 {{k , . . . , r}} 11: else if a{1,...,r} = a{1,...,r},z z Z then 12: Bcal,r = {{1, . . . , r}} 13: end if 14: end for 15: return Bcal,n is within-group calibrated iff, for every z Z and a Range(f) such that Pr(Z = z | f(X) = a) > 0, it holds that Pr(Y = 1 | f(X) = a, Z = z) = a. As discussed previously, within-group calibration implies within-group monotonicity. Then, to minimally modify a calibrated classifier f so that it becomes within-group monotone, one may think of finding the optimal partition B cal = argmax B B |B| such that f B is within-group calibrated. In what follows, we will first show that, perhaps surprisingly, finding B cal is computationally easier10 than finding B . However, we will further show that, in many cases, B cal may not exist and, when it does exist, the size of B cal may be much smaller than the size of B , leading to less fine-grained predictions. To find the optimal B cal, we proceed recursively. Let Br be the set of contiguous partitions of the bin indices {1, . . . , r}, with r n. Then, iterating through r, we find the optimal partitions B cal,r = argmax B Br |B| such that f B cal,r is within-group calibrated in i r Xi. In this case, the key idea of the recursion is that any partition B Br such that f B is within-calibrated on i r Xi needs to satisfy the following necessary and sufficient condition: Lemma 5.2. Given any B Br, it holds that f B is withincalibrated on i r Xi if and only if l < r such that B\ {{l, . . . , r}} Bl 1 and f B\{{l,...,r}} is within-group calibrated on i l 1Xi and a{l,...,r},z = a{l,...,r} z Z. As a consequence, we can efficiently find all partitions B in the subsets Br such that f B is within-group calibrated iterating through r using the partitions B in the subsets Bl 10Using a similar proof technique as in Theorem 3.1, it can be proven that the problem of finding the partition B P of maximum size such that f B is withingroup calibrated is NP-hard. Therefore, in general, the computational complexity is not lower. with l < r such that f B is within-group calibrated. Finally, by construction, it clearly holds that if the optimal partition B cal,r = B {{l, . . . , r}}, with B Bl 1, is the optimal partition in Br then B = B cal,l 1 is the optimal partition in Bl 1. As a result, at each step of the recursion, we only need to store the optimal partition B r, not all partitions B Br such that f B is within-group calibrated, and reuse it to find all B r with r > r. Algorithm 3 summarizes the overall procedure, which has complexity O(n2 |Z|) and is guaranteed to find the optimal partition B cal, if such a partition exists, as formalized by the following theorem: Theorem 5.3. Algorithm 3 returns B cal = argmax B B |B| such that f B cal is within-group calibrated if such partition exists or otherwise. Unfortunately, there are many cases in which B cal does not exist, e.g., this will happen if f systematically undervalues the probability that individuals from a group are qualified in comparison with individuals from another group: Proposition 5.4. Let Z = {z, z }, ρz | i = ρz | i and ai,z < ai,z for all i {1, . . . , n}. Then, there exists no B B such that f B is within-group calibrated. In the above situation, f may actually be within-group monotone and thus |B | = n. Even if B cal exists, there are examples where |B | |B cal| = n 1. 6. Experiments Using Survey Data In this section, we create multiple instances of a simulated screening process using US Census survey data11 to first investigate how frequently within-group unfairness occurs in a recruiting domain and then compare the partitions, as well as induced screening classifiers, provided by Algorithms 1, 2 and 312. Experimental setup. We use a dataset consisting of 3.2 million individuals from the US Census (Ding et al., 2021). Each individual is represented by sixteen features and one label y {0, 1} indicating whether the individual is employed (y = 1) or not (y = 0). For our experiments, we think of employment as a (imperfect) proxy of qualification13. The features contain demographic information such as age, marital status or gender (Appendix B4, Ding et al. (2021)). We run four sets of experiments where, in each of them, we use a different feature (US citizen status, race, gender, or disability record) to define the demographic groups 11An implementation of our algorithms and the data used in our experiments are available at https://github.com/Networks-Learning/within-group-monotonicity. 12We ran all experiments on a machine equipped with 48 Intel(R) Xeon(R) 2.50GHz CPU cores and 256GB memory. 13Note that the label used as the proxy for qualification closely depends on the application domain. In an academic hiring scenario, the label Educational Attainment could serve as a proxy for qualification while Years of Working Experience might be a better proxy in hiring scenarios for craft professions. On the Within-Group Fairness of Screening Classifiers Citizenship status (Z) Race code (Z) Gender (Z) Disability record (Z) Citizenship status (Z) Race code (Z) Gender (Z) Disability record (Z) 0.01 0.05 0.07 0.86 Pr(Z = z) Citizenship status (Z) Born in Unincorporated US Not a US citizen Born abroad Born in the US 0.01 0.1 0.12 0.77 Pr(Z = z) Race code (Z) American Indian or Alaska Black or African American Asian, Native Hawaiian or other White (a) pd | z vs. Pr(Z = z) 5 10 15 20 25 30 35 40 n (b) pd vs. n 5 10 15 20 25 30 35 40 n (c) pd | Dpool vs. n Figure 3. Probability that an individual suffers from within-group unfairness. Panel (a) shows the probability pd | z that an individual from group z may suffer from within-group unfairness against Pr(Z = z) for n = 15. Panel (b) shows the probability pd that an individual may suffer from within-group unfairness. Panel (c) shows the probability pd | Dpool that an individual suffers from within-group unfairness in a test pool Dpool of size m, averaged across all test pools, against n = |Range(f)|. of interest Z. For space reasons, in this section, we focus mainly on groups Z based on US citizenship status and race. However, Appendix B.3 shows similar results for groups Z defined based on gender and disability record. For the experiments, we randomly split the dataset into two equally-sized and disjoint subsets. We use the first subset for training and calibration and the second subset for testing. More specifically, for each experiment, we create the training and calibration sets Dtr and Dcal by picking 100,000 and 50,000 individuals at random (without replacement) from the first subset. We use Dtr to train a logistic regression model f LR14 and use Dcal to both (approximately) calibrate f LR using uniform mass binning (UMB) (Wang et al., 2022; Zadrozny & Elkan, 2001), i.e., discretize its outputs to n calibrated quality scores, and estimate the relevant probabilities ρi, ai, ρz | i and ai,z needed by Algorithms 1, 2, and 3. The resulting (approximately) calibrated classifier serves as our screening classifier f. For testing, we create a set {Di pool}100 i=1 of 100 pools, each with m = 100 individuals picked at random from the second subset, and create (the smallest) shortlists with at least k qualified individuals using the screening classifiers f Bpav, f B and f B cal induced by the partitions found by Algorithms 1, 2 and 3, respectively. Here, since we find that, in most experiments, no within-group calibrated classifier exists, we allow f B cal to be within-group ϵ-calibrated15 within Algorithm 3 and use binary search to find the smallest ϵ (0, 1) such that f B cal exists16. Throughout the experiments, we estimate the average and the standard error of the reported quantities by repeating each experiment 100 times. Within-group unfairness frequently occurs between individuals from minority groups, especially with fine- 14The classifier f LR achieves a test accuracy of 74% at predicting whether an individual is qualified. 15Given a set of groups Z, a classifier f is within-group ϵ-calibrated iff, for every z Z and a Range(f) such that Pr(Z = z | f(X) = a) > 0, it holds that |Pr(Y = 1 | f(X) = a, Z = z) a| ϵ. 16Refer to Appendix B.2 for additional experiments on within-group ϵ-calibration. grained classifiers. We start by estimating the probability pd | z that an individual from a demographic group of interest z Z may suffer from within-group unfairness, i.e., pd | z = 1 Pr(Z=z) P i {1,...,n} ρiρz | ivi, where vi = I [ aj Range(f) | ai < aj ai,z > aj,z]. Figure 3a summarizes the results for a screening classifier f with n = 15 bins. We find that individuals who belong to minority groups are much more likely to suffer from withingroup unfairness than those who belong to a majority group. For example, the probability that an individual who is not a US citizen may suffer from within-group unfairness is pd | z > 0.3 while it is almost impossible that an individual born in the US is treated unfairly within its group. Further, we investigate to what extent the probability pd = P z Z P(Z = z)pd | z that an individual may suffer from within-group unfairness depends on the number of bins n of f. Figure 3b shows that the more fine-grained a classifier is, the higher the probability that an individual may suffer from within-group unfairness, e.g., for n 10, pd < 0.05 while, for n = 40, pd > 0.12 across all sets of groups Z. Since the accuracy of a calibrated classifier is related to how fine-grained its predictions are (Wang et al., 2022), the above finding suggests that high accuracy may have a cost in terms of within-group unfairness. Our results so far show that the probability that individuals may suffer from within-group unfairness is significant. Next, we estimate the probability pd | Dpool that, in a test pool Dpool of size m, an individual does suffer from withingroup unfairness, i.e., pd | Dpool = 1 m P x Dpool vx, where vx = I x Dpool | ai(x) < ai(x ) ai(x),z > ai(x ),z . Figure 3c shows that, on average across all test pools, the probability pd | Dpool follows the same trend as pd, however, it is slightly lower in value because each of the test pools is not representative of the entire population. However, note that, as m , one can readily conclude that pd | Dpool pd. Algorithm 2 consistently provides larger partitions, which result in more fine-grained classifiers and smaller On the Within-Group Fairness of Screening Classifiers B Bpav B cal 5 10 15 20 25 30 35 40 n (a) Citizenship status (Z) 5 10 15 20 25 30 35 40 n (b) Race code (Z) Figure 4. Size of the partitions Bpav, B and B cal returned by Algorithms 1, 2 and 3, respectively (higher is better). shortlists, than Algorithms 1 and 3. We experiment with several screening classifiers f with a varying number of bins n and compare the size of the partitions B provided by each of the algorithms, i.e., the number of bins of the modified classifiers f B. Figure 4 shows that the optimal partition B is always greater in size than the partitions B cal and Bpav. Moreover, it also shows that, as n increases, the growth in the size of the partitions B and Bpav diminishes because the occurrence of within-group unfairness increases, as shown in Figure 3. Further, we use both the original classifier f and the modified classifiers f B , f Bpav and f B cal to shortlist the minimum number of individuals among those in each of the simulated test pools {Bi pool} such that, in expectation, there are at least k qualified shortlisted individuals per pool. To this end, for each test pool and classifier, we sort the candidates in decreasing order with respect to the corresponding quality score and, starting from the first, we keep shortlisting individuals in order until the sum of the quality scores reaches k (Appendix, A.3, Wang et al. (2022)). Figure 5 shows that the shortlists created using f B are consistently smaller than those created using f Bpav and f B cal for k = 5. Moreover, it also shows that the price to pay for achieving within-group monotonicity, i.e., the difference in size between the shortlists created using f and f B , is small. We found qualitatively similar results for other k values. Appendix B.1 takes a closer look at the (group conditional) score values of f, f B , f Bpav and f B cal. Remark. Note the shortlists created using f B will be larger than those created using f and this imposes more burden on the decision maker in selecting the desired number of qualified candidates (e.g., they have to interview more candidates). However, it ensures none of the members within demographic groups are unfairly treated. Therefore, it shifts the costs of using a poor screening classifier from the applicants to the decision-maker. If |B | is too small, it may be a sign that the decision maker has to reconsider using f as the screening classifier. f f B f Bpav f B cal 5 10 15 20 25 30 35 40 n Shortlist Size (a) Citizenship status (Z) 5 10 15 20 25 30 35 40 n Shortlist Size (b) Race code (Z) Figure 5. Size of the shortlists created using both the original classifier f and the modified classifiers f B , f Bpav and f B cal for k = 5 (lower is better). 7. Conclusions In this work, we have first shown that optimal screening policies using calibrated classifiers may suffer from an understudied type of within-group unfairness. Then, we have developed a polynomial time algorithm based on dynamic programming to minimally modify any given calibrated classifier so that it satisfies within-group monotonicity, a natural monotonicity property that prevents the occurrence of within-group unfairness. Finally, we have shown that within-group monotonicity can be achieved at a small cost in terms of prediction granularity and shortlist size. Our work opens up many interesting avenues for future work. For example, it would be interesting to design classifiers that are within-group monotone with respect to every group that can be identified within a specified class of computations (Hébert-Johnson et al., 2018). Moreover, in some scenarios, it might be sufficient to control the probability that an individual suffers from within-group unfairness. Further, it would be important to investigate how within-group monotonicity interacts with group fairness (Hardt et al., 2016; Zafar et al., 2017). In addition, it would be interesting to investigate how frequently within-group unfairness occurs in other domains such as medicine or content moderation. Finally, it would be interesting to design post-processing algorithms using a sample access model (Błasiok et al., 2022), rather than a prediction-only access model, and optimize other quality measures different from the partition size. Acknowledgements We would like to thank Nina Corvelo Benz, Eleni Straitouri, and Luke Wang for fruitful discussions during different stages of the project and the anonymous ICML reviewers for constructive feedback. Gomez-Rodriguez acknowledges support from the European Research Council (ERC) under the European Union s Horizon 2020 research and innovation programme (grant agreement No. 945719). On the Within-Group Fairness of Screening Classifiers Angwin, J., Larson, J., Mattu, S., and Kirchner, L. Machine bias: There s software used across the country to predict future criminals. and it s biased against blacks. propublica, may 23, 2016. Ayer, M. C., Brunk, H. D., Ewing, G. M., Reid, W. T., and Silverman, E. An empirical distribution function for sampling with incomplete information. Annals of Mathematical Statistics, 26:641 647, 1955. Barber, R. F. Is distribution-free inference possible for binary regression? Electronic Journal of Statistics, 14(2): 3487 3524, 2020. Barlow, R. E. and Brunk, H. D. The isotonic regression problem and its dual. Journal of the American Statistical Association, 67(337):140 147, 1972. Bartholomew, D. J. A test of homogeneity for ordered alternatives. Biometrika, 46(1/2):36 48, 1959. ISSN 00063444. URL http://www.jstor.org/stable/2332806. Beutel, A., Chen, J., Doshi, T., Qian, H., Wei, L., Wu, Y., Heldt, L., Zhao, Z., Hong, L., Chi, E. H., et al. Fairness in recommendation ranking through pairwise comparisons. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 2212 2220, 2019a. Beutel, A., Chen, J., Doshi, T., Qian, H., Woodruff, A., Luu, C., Kreitmann, P., Bischof, J., and Chi, E. H. Putting fairness principles into practice: Challenges, metrics, and improvements. In Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society, pp. 453 459, 2019b. Biega, A. J., Gummadi, K. P., and Weikum, G. Equity of attention: Amortizing individual fairness in rankings. In The 41st international acm sigir conference on research & development in information retrieval, pp. 405 414, 2018. Błasiok, J., Gopalan, P., Hu, L., and Nakkiran, P. A unifying theory of distance from calibration. ar Xiv preprint ar Xiv:2211.16886, 2022. Blyth, C. R. On simpson s paradox and the sure-thing principle. Journal of the American Statistical Association, 67(338):364 366, 1972. Bogen, M. and Rieke, A. Help wanted: An examination of hiring algorithms, equity, and bias. 2018. Brier, G. W. et al. Verification of forecasts expressed in terms of probability. Monthly weather review, 78(1):1 3, 1950. Bröcker, J. Erratum to: Estimating reliability and resolution of probability forecasts through decomposition of the empirical score. Climate Dynamics - CLIM DYNAM, 39: 1 13, 08 2011. doi: 10.1007/s00382-011-1191-1. Celis, L. E., Straszak, D., and Vishnoi, N. K. Ranking with fairness constraints. ar Xiv preprint ar Xiv:1704.06840, 2017. Chouldechova, A. Fair prediction with disparate impact: A study of bias in recidivism prediction instruments. Big data, 5(2):153 163, 2017. Collins, B. W. Tackling unconscious bias in hiring practices: The plight of the rooney rule. NYUL Rev., 82:870, 2007. Corbett-Davies, S., Pierson, E., Feller, A., Goel, S., and Huq, A. Algorithmic decision making and the cost of fairness. In Proceedings of the 23rd acm sigkdd international conference on knowledge discovery and data mining, pp. 797 806, 2017. Cowgill, B. Bias and productivity in humans and algorithms: Theory and evidence from resume screening. Columbia Business School, Columbia University, 29, 2018. Datta, A., Tschantz, M. C., and Datta, A. Automated experiments on ad privacy settings: A tale of opacity, choice, and discrimination. ar Xiv preprint ar Xiv:1408.6491, 2014. Dawid, A. P. The well-calibrated bayesian. Journal of the American Statistical Association, 1982. Dieterich, W., Mendoza, C., and Brennan, T. Compas risk scales: Demonstrating accuracy equity and predictive parity. Northpointe Inc, 7(4), 2016. Ding, F., Hardt, M., Miller, J., and Schmidt, L. Retiring adult: New datasets for fair machine learning. Advances in Neural Information Processing Systems, 34: 6478 6490, 2021. Dressel, J. and Farid, H. The accuracy, fairness, and limits of predicting recidivism. Science advances, 4(1):eaao5580, 2018. Eeden, C. v. Testing and estimating ordered parameters of probability distributions / constance van eeden., 1958. Etzioni, R., Urban, N., Ramsey, S., Mc Intosh, M., Schwartz, S., Reid, B., Radich, J., Anderson, G., and Hartwell, L. Early detection: The case for early detection. Nature reviews. Cancer, 3:243 52, 05 2003. doi: 10.1038/nrc1041. Feller, A., Pierson, E., Corbett-Davies, S., and Goel, S. A computer program used for bail and sentencing decisions was labeled biased against blacks. it s actually not that clear. The Washington Post, 17, 2016. On the Within-Group Fairness of Screening Classifiers Ferro, C. A. T. and Fricker, T. E. A bias-corrected decomposition of the brier score. Quarterly Journal of the Royal Meteorological Society, 138, 2012. Friedler, S. A., Scheidegger, C., and Venkatasubramanian, S. On the (im) possibility of fairness. ar Xiv preprint ar Xiv:1609.07236, 2016. Garb, H. N. Race bias, social class bias, and gender bias in clinical judgment. Clinical Psychology: Science and Practice, 4(2):99, 1997. García-Soriano, D. and Bonchi, F. Maxmin-fair ranking: individual fairness under group-fairness constraints. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, pp. 436 446, 2021. Garr, S. S. and Jackson, C. Diversity & inclusion technology: The rise of a transformative market. Red Thread Research and Mercer, 2019. Gneiting, T., Balabdaoui, F., and Raftery, A. Probabilistic forecasts, calibration and sharpness. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 2007. Gorwa, R., Binns, R., and Katzenbach, C. Algorithmic content moderation: Technical and political challenges in the automation of platform governance. big data & society 7, 1 (2020), 2053951719897945, 2020. Guo, C., Pleiss, G., Sun, Y., and Weinberger, K. Q. On calibration of modern neural networks. In International conference on machine learning, pp. 1321 1330. PMLR, 2017. Gupta, C., Podkopaev, A., and Ramdas, A. Distribution-free binary classification: prediction sets, confidence intervals and calibration. Advances in Neural Information Processing Systems, 33:3711 3723, 2020. Hardt, M., Price, E., and Srebro, N. Equality of opportunity in supervised learning. Advances in neural information processing systems, 29, 2016. Hébert-Johnson, U., Kim, M., Reingold, O., and Rothblum, G. Multicalibration: Calibration for the (computationallyidentifiable) masses. In International Conference on Machine Learning, pp. 1939 1948. PMLR, 2018. Jin, Y. and Candès, E. J. Selection by prediction with conformal p-values. ar Xiv preprint ar Xiv:2210.01408, 2022. Jordan, A. I., Mühlemann, A., and Ziegel, J. F. Optimal solutions to the isotonic regression problem. ar Xiv preprint ar Xiv:1904.04761, 2019. Jung, C., Lee, C., Pai, M., Roth, A., and Vohra, R. Moment multicalibration for uncertainty estimation. In Conference on Learning Theory, pp. 2634 2678. PMLR, 2021. Karandikar, A., Cain, N., Tran, D., Lakshminarayanan, B., Shlens, J., Mozer, M. C., and Roelofs, B. Soft calibration objectives for neural networks. In Advances in Neural Information Processing Systems, 2021. Karp, R. M. Reducibility among combinatorial problems. In Complexity of Computer Computations: Proceedings of a symposium on the Complexity of Computer Computations, pp. 85 103. Springer, 1972. Kilbertus, N., Gomez-Rodriguez, M., Schölkopf, B., Muandet, K., and Valera, I. Fair decisions despite imperfect predictions. In International Conference on Artificial Intelligence and Statistics, pp. 277 287, 2020. Kim, M. P., Ghorbani, A., and Zou, J. Multiaccuracy: Blackbox post-processing for fairness in classification. In Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society, pp. 247 254, 2019. Kleinberg, J. Inherent trade-offs in algorithmic fairness. In Abstracts of the 2018 ACM International Conference on Measurement and Modeling of Computer Systems, pp. 40 40, 2018. Krishnan, R. and Tickoo, O. Improving model calibration with accuracy versus uncertainty optimization. ar Xiv preprint ar Xiv:2012.07923, 2020. Kumar, A., Sarawagi, S., and Jain, U. Trainable calibration measures for neural networks from kernel mean embeddings. In International Conference on Machine Learning, pp. 2805 2814. PMLR, 2018. Lahoti, P., Beutel, A., Chen, J., Lee, K., Prost, F., Thain, N., Wang, X., and Chi, E. Fairness without demographics through adversarially reweighted learning. Advances in neural information processing systems, 33:728 740, 2020. Miles, R. E. The complete amalgamation into blocks, by weighted means, of a finite set of real numbers. Biometrika, 46(3-4):317 327, 1959. ISSN 0006-3444. doi: 10.1093/biomet/46.3-4.317. URL https://doi.org/10.1093/biomet/46.3-4.317. Pearl, J. Models, reasoning and inference. Cambridge, UK: Cambridge University Press, 19(2), 2000. Pleiss, G., Raghavan, M., Wu, F., Kleinberg, J., and Weinberger, K. Q. On fairness and calibration. Advances in neural information processing systems, 30, 2017. On the Within-Group Fairness of Screening Classifiers Prost, F., Packer, B., Chen, J., Wei, L., Kremp, P., Blumm, N., Wang, S., Doshi, T., Osadebe, T., Heldt, L., et al. Simpson s paradox in recommender fairness: Reconciling differences between per-user and aggregated evaluations. ar Xiv preprint ar Xiv:2210.07755, 2022. Raghavan, M., Barocas, S., Kleinberg, J., and Levy, K. Mitigating bias in algorithmic hiring: Evaluating claims and practices. In Conference on Fairness, Accountability, and Transparency, pp. 469 481, 2020. Roelofs, R., Cain, N., Shlens, J., and Mozer, M. C. Mitigating bias in calibration error estimation. In International Conference on Artificial Intelligence and Statistics, pp. 4036 4054. PMLR, 2022. Sahoo, R., Zhao, S., Chen, A., and Ermon, S. Reliable decisions with threshold calibration. In Advances in Neural Information Processing Systems, 2021. Shen, L., Margolies, L., Rothstein, J., Fluder, E., Mc Bride, R., and Sieh, W. Deep learning to improve breast cancer detection on screening mammography. Scientific Reports, 9:1 12, 08 2019. doi: 10.1038/s41598-019-48995-4. Simpson, E. The interpretation of interaction in contingency tables. Journal of the royal statistical society series bmethodological, 13:238 241, 1951. Singh, A. and Joachims, T. Fairness of exposure in rankings. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 2219 2228, 2018. Singh, A. and Joachims, T. Policy learning for fairness in ranking. In Wallach, H., Larochelle, H., Beygelzimer, A., d'Alché-Buc, F., Fox, E., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, 2019. Speicher, T., Heidari, H., Grgic-Hlaca, N., Gummadi, K. P., Singla, A., Weller, A., and Zafar, M. B. A unified approach to quantifying algorithmic unfairness: Measuring individual &group unfairness via inequality indices. In Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining, pp. 2239 2248, 2018. Sweeney, L. Discrimination in online ad delivery. Communications of the ACM, 56(5):44 54, 2013. Tambe, P., Cappelli, P., and Yakubovich, V. Artificial intelligence in human resources management: Challenges and a path forward. California Management Review, 61(4): 15 42, 2019. van der Gaag, L. C., Bodlaender, H. L., and Feelders, A. Monotonicity in bayesian networks. In Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence, UAI 04, pp. 569 576. AUAI Press, 2004. W Flores, A., Bechtel, K., and Lowenkamp, C. False positives, false negatives, and false analyses: A rejoinder to machine bias: There s software used across the country to predict future criminals. and it s biased against blacks. . Federal probation, 2016. Wagner, C. H. Simpson s paradox in real life. The American Statistician, 36(1):46 48, 1982. Wang, L. and Joachims, T. Fairness in the first stage of two-stage recommender systems. In Proceedings of the 16th International Conference on Web Search and Data Mining, 2023. Wang, L., Joachims, T., and Gomez-Rodriguez, M. Improving screening processes via calibrated subset selection. In Proceedings of the 39th International Conference on Machine Learning, 2022. Wang, X., Thain, N., Sinha, A., Prost, F., Chi, E. H., Chen, J., and Beutel, A. Practical compositional fairness: Understanding fairness in multi-component recommender systems. In Proceedings of the 14th ACM International Conference on Web Search and Data Mining, pp. 436 444, 2021. Williams, D. R. and Mohammed, S. A. Discrimination and racial disparities in health: evidence and needed research. Journal of behavioral medicine, 32(1):20 47, 2009. Woodworth, B., Gunasekar, S., Ohannessian, M. I., and Srebro, N. Learning non-discriminatory predictors. In Conference on Learning Theory, pp. 1920 1953. PMLR, 2017. Yang, K. and Stoyanovich, J. Measuring fairness in ranked outputs. In Proceedings of the 29th international conference on scientific and statistical database management, pp. 1 6, 2017. Yang, K., Gkatzelis, V., and Stoyanovich, J. Balanced ranking with diversity constraints. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI, 2019. Yu, Y.-L. and Xing, E. P. Exact algorithms for isotonic regression and related. In Journal of Physics: Conference Series, volume 699, pp. 012016. IOP Publishing, 2016. Zadrozny, B. and Elkan, C. Obtaining calibrated probability estimates from decision trees and naive bayesian classifiers. In Icml, volume 1, pp. 609 616. Citeseer, 2001. Zadrozny, B. and Elkan, C. P. Transforming classifier scores into accurate multiclass probability estimates. Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, 2002. On the Within-Group Fairness of Screening Classifiers Zafar, M. B., Valera, I., Gomez-Rodriguez, M., and Gummadi, K. P. Fairness constraints: Mechanisms for fair classification. In Artificial intelligence and statistics, pp. 962 970, 2017. Zehlike, M., Bonchi, F., Castillo, C., Hajian, S., Megahed, M., and Baeza-Yates, R. Fa* ir: A fair top-k ranking algorithm. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, pp. 1569 1578, 2017. Zehlike, M., Sühr, T., Baeza-Yates, R., Bonchi, F., Castillo, C., and Hajian, S. Fair top-k ranking with multiple protected groups. Information Processing & Management, 59(1):102707, 2022. On the Within-Group Fairness of Screening Classifiers A.1. Proof of Proposition 2.1 By definition, the threshold decision rule π outputs S = 0 if f(X) = a and S = 1 if f(X) = b. As a result, it immediately follows that: EY PY | X,Z, S π [Y (1 S) | f(X) = a, Z = z] = EY PY | X,Z [Y | f(X) = a, Z = z] > EY PY | X,Z [Y | f(X) = b, Z = z] = EY PY | X,Z, S π [Y S | f(X) = b, Z = z] . A.2. Proof of Theorem 3.1 We call a partition B P valid if f B is within-group monotone. We first show that, by finding a valid partition B of maximum size, we can decide whether there exists a valid partition B of size |B | = 2. Assume the valid partition B of maximum size has size |B| = m. Then, if m 2, we can conclude that such a partition exists using Lemma A.1 and, if m < 2, no such partition exists because B is the valid partition of maximum size. Now, since we prove in Lemma A.2 that this decision problem is NP-complete, we can directly conclude that the problem of finding the valid partition of maximum size is NP-hard. Lemma A.1. Assume the valid partition B of maximum size has size |B| = k. Then for every k {1, . . . , k 1}, there exist a valid partition B such that |B | = k . Proof. By Proposition 3.3, we have that any contiguous partition B on {1, . . . , |B|} is monotone with respect to f B. Furthermore, due to the same proposition, B is also monotone with respect to the set {a Ai,z}i {1,...,|B|} for all z Z. Since B is valid, we have that {a Ai,z}i {1,...,|B|} is increasing for all z Z. As a result, B is a valid partition. Thus, for any k {1, . . . , k 1}, we have that the contiguous partition B = A1, A2, . . . , A|B| k 1, j {0,...,k }A|B| j is valid and |B | = k . This concludes the proof. Lemma A.2. The problem of deciding whether there exists a valid partition B such that |B| = 2 is NP-complete. Proof. First it is easy to see that, given a partition B, we can check whether the partition is valid and has size |B| = 2 in polynomial time. Therefore, the problem belongs to NP. Now, to show the problem is NP-complete, we perform a reduction from a variation of the classical partition problem (Karp, 1972), which we refer to as the equal average partition problem. The equal average partition problem seeks to decide whether a set of n positive integers S = {s1, . . . , sn} can be partitioned into two subsets of equal average. In Theorem A.3, we prove that the equal average partition problem is NP-complete, a result which may be of independent interest17. Without loss of generality, we assume si [0, 1] for all si S18 and, si sj if i < j. For every si S, we set ai,z1 = si, ai,z2 = 1 si, ρi = 1 n, ρz1 | i = α, ρz2 | i = 1 α for α (0.5, 0.75]. Note that we will have that ai = αsi + (1 α)(1 si) = (2α 1)si + (1 α) [0, 1]. Note first that for any A B j A ρjρz1 | jaj,z1 P j A ρjρz1 | j = j A α naj,z1 P j A(1 aj,z1) |A| = 1 a A,z2. (2) j A((2α 1)aj,z1 + 1 α) |A| = (2α 1) |A| + 1 α = (2α 1)a A,z1 + 1 α (3) Note that, whenever we have that a A,z1 a A ,z1, it will also hold that a A < a A as 2α 1 > 0. Now, assume a valid partition B with |B| = 2 exists and B = {A1, A2}. Without loss of generality, assume a A1,z1 a A2,z1. Since B is a valid partition, we should have also that a A1,z2 a A2,z2, furthermore, a A1,z1 a A2,z1 1 a A1,z1 1 a A2,z1 a A1,z2 a A2,z2 (4) 17Given the similarity of the equal average partition problem to the classical partition problem, we would have expected to find a proof of NP-completeness elsewhere. However, we failed to find such a proof in previous work. 18We can always divide every element in S by the largest member of S to ensure elements fall in [0, 1]. On the Within-Group Fairness of Screening Classifiers Since it simultaneously holds that a A1,z2 a A2,z2 and a A1,z2 a A2,z2, a valid partition B with |B| = 2 exists if and only if a A1,z2 = a A2,z2 and hence a A1,z1 = a A2,z1. As a A1,z1 is the average of sj for j A1 and a A2,z1 is the average of sj for j A2 the partition B can partition S into two subsets of equal average. We now prove that if no valid partition B with |B| = 2 exists, there is no way of partitioning S into two subsets of equal average. For the sake of contradiction, assume S can be partitioned into S1 and S2 with equal averages κ. Define A1 = {i | si S1} and A2 = {j | sj S2}. Now if we build an instance of our problem based on S as described before and set B = {A1, A2} (clearly we have that B is a partition of {1, . . . , n}) we have that a A1,z1 = a A2,z1 = κ, a A1,z2 = a A2,z2 = 1 κ (refer to Eq. 2) and a A1 = a A2 = (2α 1)κ + (1 α) (refer to Eq. 3). As a result, we have that B is a valid partition of size 2 which is a contradiction. This concludes the proof. Theorem A.3. Given a set of n positive integers, the problem of deciding whether it can be partitioned into two non-empty subsets of equal average is NP-complete. Proof. First it is easy to see that, given two subsets, we can evaluate in polynomial time their averages and check whether they are equal or not. Therefore, the problem belongs to NP. In the remainder of the proof, we will perform a reduction from the equal cardinality partition problem, which is known to be NP-complete, to the equal average partition problem. In the original problem, we are given a set of n positive integers S, where n is an even number. The objective is to decide whether there exist two subsets S1, S2 S such that S1 S2 = S and S1 S2 = , with |S1| = |S2| and P Now, we will transform an arbitrary instance of that problem into an instance of the equal average partition problem. Let the set of integers be S = S {nσ, nσ}, where σ = P k S k. It is easy to see that the average of S is equal to (2n+1)σ We will start by showing that, if we can decide positively about that instance of the equal average partition problem, we can also decide positively about the original instance of the equal cardinality partition problem. Assume there exists a partition of S into two sets S 1, S 2, with equal averages. As an intermediate result, we will show that the two copies of the number nσ cannot belong to the same set S 1 or S 2. For the sake of contradiction, and without loss of generality, assume that both copies belong to S 1. In the case where S 1 = {nσ, nσ}, it holds that |S 1| = nσ and n, which is a contradiction, since the two quantities cannot be equal because of n 2. In cases where S 1 contains at least one more element, since S 2 = , we get |S 1| = 2nσ+κ 2+l , with 0 < κ < σ and 1 l n 1, and |S 2| = σ κ n l . It follows that 1 n l 1 σ κ |S 2| < σ ( ) |S 2| < (2n + 1)σ k S k |S | , where ( ) holds because n > 1. According to Lemma A.4, the last inequality leads to a contradiction. With that, we can conclude that one copy of nσ belongs to S 1 and the other one belongs to S 2. Let S1, S2 be such that S 1 = {nσ} S1 and S 2 = {nσ} S2. We will now show that S1 and S2 are a solution to the original instance of the equal cardinality partition problem, i.e., |S1| = |S2| and P j S2 j. It is trivial to see that S1, S2 have to be non-empty, otherwise the averages of S 1 and S 2 would differ. Since S 1, S 2 are a partition of S with equal averages and because of Lemma A.4, we know that i S1 i 1 + |S1| = nσ + P 1 + |S2| = (2n + 1)σ n + 2 . (5) For the sake of contradiction, assume that either |S1| = |S2| or P j S2 j. For brevity, we will focus only on the two following cases, as any other case leads easily to a contradiction: |S1| < |S2| and P j S2 j: Since S1, S2 are such that S1 S2 = S, it holds that i S1 i < σ ( ) (2n + 1)σ n + 2 (1 + |S2|) nσ (2n + 1)σ n + 2 (1 + |S1|) + nσ < σ n + 2 (|S2| |S1|) < σ (2n + 1)(|S2| |S1|) < (n + 2) ( ) 2n + 1 < n + 2 n < 1, On the Within-Group Fairness of Screening Classifiers where ( ) follows from Equation 5, and ( ) holds because |S2| |S1| 1. The last inequality is clearly a contradiction. |S1| > |S2| and P j S2 j: The proof is the symmetric version of the proof in the previous case. Therefore, we can conclude that S1 and S2 are a solution to the original problem, i.e., they are a partition of S with equal cardinality and equal sums. Lastly, we will show that, if there is no partition of S with equal averages, there can be no equal cardinality partition of S with equal sums. For the sake of contradiction, assume there exist S1, S2 with |S1| = |S2| and P i S1 i = P j S2 j. Then, let S 1 = {nσ} S1 and S 2 = {nσ} S2. It is easy to see that P |S 1| = nσ + P i S1 i 1 + |S1| = nσ + P |S 2| , (6) which is a contradiction, since it means that S 1 and S 2 are a partition of S with equal averages. Following the above procedure, we can decide whether the original instance of the equal-cardinality problem has a solution or not. As a consequence, the problem of deciding whether a set of positive integers can be partitioned into two subsets of equal average is NP-complete. Lemma A.4. A set of integers S can be partitioned into two non-empty sets S1, S2 with equal averages i S1 i |S1| = j S2 j |S2| , i S1 i |S1| = k S k |S| , with |S1| |S|. Proof. First, assume there is such a partition of S into S1, S2, with equal averages. It holds that P i S1 i |S1| = i S1 i |S| |S1| (|S| |S1|) X i S1 i = |S1| i S1 i = |S1| X i S1 i |S1| = P k S k |S| , where S1 S because S2 = . Now, assume there exists a set S1 S, such that i S1 i |S1| = k S k |S| and let S2 = S \ S1. It is easy to see that i S1 i |S| |S1| = k S k |S| , and therefore, the sets S1, S2 consist a partition of S with equal averages. A.3. Proof of Proposition 3.3 We first prove the sufficient condition, i.e., we prove that, if f B is monotone with respect to f, then B is a contiguous partition on {1, . . . n}. The proof is by contradiction. Assume B is not a contiguous partition, i.e., there exists x1, x2, x3 X such that i(x1) < i(x2) < i(x3) and i(x1) B i(x3) while i(x1) B i(x2). If a[i(x1)] > a[i(x2)], then f B(x1) > f B(x2), however, since f(x1) < f(x2), this leads to a contradiction with the monotonicity assumption. On the other hand, if a[i(x1)] < a[i(x2)], then f B(x3) < f B(x2) since i(x1) B i(x3) and thus a[i(x3)] < a[i(x2)], however, this leads again to a contradiction with the monotonicity assumption. This proves that B must be a contiguous partition. Next, we prove the necessary condition, i.e., we prove that, if B is a contiguous partition on {1, . . . n}, then f B is monotone with respect to f. For any x1, x2 X such that f(x1) < f(x2), we have that: f B(x1) = a[i(x1)] = l [i(x1)] alρl P l [i(x1)] ρl l [i(x2)] alρl P l [i(x2)] ρl = a[i(x2)] = f B(x2). where the inequality is due to Lemma A.5 below and the fact that the weighted average of a set of numbers is lower and upper bounded by the smallest and largest element of the set respectively. On the Within-Group Fairness of Screening Classifiers Lemma A.5. Let f be a classifier with Range(f) = {a1, . . . , an}, B be a contiguous partition on {1, . . . , n} and x1, x2 X. If i(x1) < i(x2) and i(x1) B i(x2), then, for every k [i(x1)] and k [i(x2)], it holds that k < k . Proof. To prove the lemma, we just need to prove that the largest index in [i(x1)] is smaller than the smallest index in [i(x2)]. The proof is by contradiction. Let l = max{k | k [i(x1)]} and s = min{k | k [i(x2)]} and assume that l > s. Then, it cannot simultaneously hold that i(x1) = l and i(x2) = s since we have that i(x1) < i(x2). Assume first that i(x1) = l, and take x3, x4 X such that i(x3) = s and i(x4) = l. If i(x3) < i(x1), then it holds that i(x3) < i(x1) < i(x2), however, since i(x2) B i(x3) and i(x1) B i(x2), this leads to a contradiction with the assumption that B is contiguous. If i(x3) > i(x1), then it holds that i(x1) < i(x3) < i(x4), however, since i(x1) B i(x4) while i(x3) B i(x4), this also leads to a contradiction with the assumption that B is contiguous. If one assumes instead that i(x1) = l, a similar reasoning using i(x2) and i(x4) leads to a contradiction too. This completes the proof. A.4. Proof of Proposition 4.1 We prove by contradiction. Assume there exist violations of within-group monotonicity. We first define the nearest violating triplet, (l, r, z), as: (l, r, z) = argmin {(i,j,z) | i,j Range(f B),i a Aj,z If r = l+1 then it contradicts with the assumption that no monotonicity violations occur between adjacent cells. If r = l+1, there exists i Range(f B) such that l i r and it does not happen simultaneously that i = l and i = r. Then it should hold that a Al,z a Ai,z a Ar,z since otherwise either of (l, i, z) or (i, r, z) is the nearest violating triplet. In this case however, a Al,z a Ar,z which is a contradiction with it being a violating triplet. As a result, no such triplet can exist and f B is within-group monotone. A.5. Proof of Lack of Local Optimality of the Pool Adjacent Violators (PAV) Algorithm Let Range(f) = {a1, a2, a3}, Z = {z1, z2} and ρiρz | i = 1 6 for all i {1, 2, 3} and z Z. Further, let a1,z2 = a2,z1 = a3,z2 = α, a1,z1 = 2α, a2,z2 = 3α and a3,z1 = 4α, where α [0, 0.25]. First, we note that, by construction, it holds that a1 = 3 2α < a2 = 2α < a3 = 5 2α. Now, since a1,z1 > a2,z1, Algorithm 1 first merges these two bins, then, since a{1,2},z2 > a{3},z2, it merges all the three bins together and finally it terminates, returning B = {{1, 2, 3}}. However, since it holds that a1,z1 < a{2,3},z1 and a1,z2 < a{2,3},z2, it clearly holds that the partition B = {{1} , {2, 3}} induces a classifier f B that is within-group monotone and it readily follows that f B dominates f B. A.6. Proof of Lemma 4.3 We first prove the sufficient condition, i.e., we prove, for any B Bl,r, k < l such that B \ {{l, . . . , r}} Bk,l 1 and a{k,...,l 1},z a{l,...,r},z z Z. Let B = B \ {{l, . . . , r}}. To this end, we start by proving by contradiction that k < l such that B Bk,l 1. Since the partition B covers {1, . . . , r}, we have that the last cell of B contains bin l 1. Assume B l 1 k=1Bk,l 1. Then, there must exist A, A B and z Z such that a A < a A and a A,z > a A,z . However, since B B, it also holds that A, A B and f B cannot be within-group monotone on i r Xi, leading to a contradiction. Therefore, it must hold that B l 1 k=1Bk,l 1. Now, to prove that, if B l 1 k=1Bk,l 1 and B Bl,r, then it must hold that a{k,...,l 1},z a{l,...,r},z z Z, we resort to Lemma A.6. We next prove the necessary condition, i.e., we prove that, given any B Br, if k < l such that B \ {{l, . . . , r}} Bk,l 1 and a{k,...,l 1},z a{l,...,r},z z Z then B Bl,r. Let B = B \ {{l, . . . , r}}. Since B Bk,l 1, we know that no violations of within-group monotonicity occurs on i l 1Xi. Now, we prove that there are no violations of within-group monotonicity between {l, . . . , r} and any A B . By assumption, we know that there are not violations of within-group monotonicity between {l, . . . , r} and {k, . . . , l 1}. Then, we prove by contradiction that there are not violations between {l, . . . , r} and any A B \ {{k, . . . , l 1}}. For any A B \ {{k, . . . , l 1}}, it follows from Proposition 3.3 that a A < a{k,...,l 1} and a A < a{l,...,r}. Now, assume there exists A B \ {{k, . . . , l 1}}, z Z such that a A,z > a{l,...,r},z. Since, by assumption, we have that a{k,...,l 1},z a{l,...,r},z, it should hold that a{k,...,l 1},z < a A,z, which contradicts with the assumption that B Bk,l 1, leading to a contradiction. This proves that B Bl,r. Lemma A.6. Let B = B {{l, . . . , r}} Bl,r and B Bk,l 1 with k < l. Then, it must hold that a{k,...,l 1},z a{l,...,r},z z Z. On the Within-Group Fairness of Screening Classifiers Proof. Since B Bk,l 1, we know that {k, . . . , l 1} B . Moreover, it follows from Proposition 3.3 that f B is monotone with respect to f and hence, since k < l and k B l, we have that a{k,...,l 1} < a{l,...,r}. Further, since B Bl,r, we have that, for every A, A B such that a A < a A , it holds that a A,z a A ,z for all z Z. Thus, it also holds that a{k,...,l 1},z a{l,...,r},z for all z Z. A.7. Proof of Theorem 4.4 To prove that Algorithm 2 returns the optimal partition B , we just need to prove that, for each l, r {1, . . . , n}, the partition Bl,r the algorithm finds is optimal, i.e., Bl,r = B l,r. In what follows, we prove this by induction. For the base cases, we have that B1,r = {{1, . . . , r}} are clearly optimal since B1,r only contains {{1, . . . , r}} for all r {1, . . . , n}. As the induction hypothesis, assume that, for any l < l and r < r, the partition Bl ,r the algorithm finds is optimal. Moreover, let Sl,r = k | k < l, a{k,...,l 1},z a{l,...,r},z z Z . Then, for (l, r), we need to show that Bl,r = Bk ,l 1 {{l, . . . , r}}, with k = argmaxk Sl,r |Bk,l 1|, is optimal. To this end, we first show that f Bl,r is within-group monotone on i r Xi, i.e., Bl,r Bl,r. We have that, by the induction hypothesis, Bk ,l 1 Bk ,l 1 and, by definition, k Sl,r. Then, it follows directly from Lemma 4.3 that f B Bl,r. Next, we show that Bl,r = argmax B Bl,r |B|. Using again Lemma 4.3, we have that, for any B Bl,r, it holds that B = B {{l, . . . , r}}, with B Bk,l 1, for some k Sl,r. As a result, since |B {{l, . . . , r}}| = |B | + 1, it suffices to find B = argmax B k Sl,r Bk,l 1 |B |. Now, by the induction hypothesis, we know that, for each Bk,l 1, Bk,l 1 is the optimal partition. Then, since k = argmaxk Sl,r |Bk,l 1|, we can conclude that Bl,r is optimal. A.8. Proof of Lemma 5.2 We first prove the sufficient condition, i.e., we prove that, given any B Br, if it holds that f B is within-group calibrated on i r Xi then l < r such that B\ {{l, . . . , r}} Bl 1 and f B\{{l,...,r}} is within-group calibrated on i l 1Xi and a{l,...,r},z = a{l,...,r} for all z Z. Let B = B \ {{l, . . . , r}}. Since B covers {1, . . . , r}, then it holds that B covers {1, . . . , l 1} and hence B Bl 1. Since B B and f B is within-group calibrated on i r Xi, then it holds that f B is within-group calibrated on i l 1Xi. Finally, since {l, . . . , r} B, it also holds that a{l,...,r},z = a{l,...,r}. Next, we prove the necessary condition, i.e., given any B Br, if l < r such that B\ {{l, . . . , r}} Bl 1 and f B\{{l,...,r}} is within-group calibrated on i l 1Xi and a{l,...,r},z = a{l,...,r} z Z then f B is within-group calibrated on i r Xi. We need to show that, for every A B, it holds that a A,z = a A. Let B = B \ {{l, . . . , r}}. For every z Z, it holds by assumption that a A,z = a A A B and a{l,...,r},z = a{l,...,r}. As a result, f B is within-group calibrated on i r Xi. A.9. Proof of Theorem 5.3 To prove that Algorithm 3 returns the optimal B cal, if a solution exists, we just need to prove that, for every r {1, . . . , n}, the partition Bcal,r the algorithm finds is optimal, i.e., Bcal,r = B cal,r. In what follows, we prove this by induction. For the base case (r = 1), we have that Bcal,1 = {{a1}} iff, for all z Z with ρz | 1 > 0, it holds that a1,z = a1. This is clearly optimal since B1 only contains {{a1}}. Otherwise, it holds that Bcal,1 = . As the induction hypothesis, assume that, for any r < r, the partition Bcal,r the algorithm finds is either the optimal partition or, if there is no solution, an empty partition. Moreover, let Sr = i {2, . . . , r} | a{i,...,r},z = a{i,...,r} z Z . Then, for r, we distinguish between two cases. If Bcal,r is empty for all r < r, we again distinguish between two cases. If a{1,...,r} = a{1,...,r},z z Z, it means that Bcal,r = {{1, . . . , r}} is the only partition in Br that is within-group calibrated and thus it is optimal. Otherwise, we can conclude that no partition B Br is within-group calibrated and thus Bcal,r = . Now, if Bcal,r is not empty for some r < r, we need to show that Bcal,r = Bcal,k 1 {{k , . . . , r}}, with k = argmaxk Sr |Bcal,k 1|, is optimal. To this end, we first show that f Bcal,r is within-group calibrated on i r Xi. Using the induction hypothesis and the fact that k r, we have that Bcal,k 1 is the optimal partition in Bk 1. As a result, it follows from Lemma 5.2 that f Bcal,r is within-group calibrated on i r Xi. Next, we show that Bcal,r = argmax B Br |B| among those partitions B such that f B is within-group calibrated. Using again Lemma 5.2, we have that, for any B such that f B is within-group calibrated, it holds that B = B {{k, . . . , r}}, with B Bk 1, for some k Sr. As a result, since |B| = |B | + 1, it suffices to find B = argmax B k Sr Bk 1 |B | such that f B is within-group calibrated. Now, by the induction hypothesis, we know that, for each Bk 1, Bk 1 is the optimal partition. Then, since k = argmaxk Sr |Bcal,k 1|, we can conclude that Bcal,r is optimal. On the Within-Group Fairness of Screening Classifiers A.10. Proof of Proposition 5.4 We prove by contradiction. Assume there exists a B B such that f B is within-group calibrated. Then, for every A B, it must hold that a A,z = a A,z = a A. Consider an arbitrary cell A B. We have that j A ρjρz | jaj,z P j A ρjρz | j j A ρjρz | jaj,z P j A ρjρz | j j A ρjρz | jaj,z P j A ρjρz | j = a A,z where (i) follows from the fact that ρz | i = ρz | i for all i Range(f) and (ii) follows from the fact that, by assumption, ai,z < ai,z for all i {1, . . . , n}. As an immediate consequence, we have that a A,z < a A < a A,z , contradicting the within-group calibration property. On the Within-Group Fairness of Screening Classifiers B. Additional Experiments B.1. Screening Classifiers Induced by the Partitions Found by Algorithms 1, 2 and 3 In this section, we take a closer look at all the quality score values a = Pr(Y = 1 | f(X) = a) and group conditional score values az = Pr(Y = 1 | f(X) = a, Z = z) of both the original classifier f and the modified classifiers f B induced by the partitions B found by Algorithms 1, 2 and 3. Figure 6 summarizes the results for one experiment with a classifier f with n = 15, which reveal several interesting findings. As expected, f B and f Bpav are within-group monotone and f B is more fine-grained than f Bpav, i.e., |B | |Bpav|. However, the minimum value of ϵ such that f B cal exists is not always low enough for f B cal to be within-group monotone. Moreover, we find that, for f, f B and f Bpav, the difference among group conditional score values az for a given quality score values a is often significant. As a result, one should be cautious about comparing candidates from different groups z and instead utilize group-dependent decision thresholds (Wang et al., 2022) to implement more equitable hiring practices such as the Rooney rule (Collins, 2007), which requires that, when hiring for a given position, at least one (or more) candidate(s) from each minority group should be interviewed. In this context, it is also worth noting that, while using f B cal would mitigate such differences, our results show that this would reduce dramatically the granularity of the predictions. We found qualitatively similar results for different n values. 0.01 0.01 0.03 0.1 0.22 0.32 0.43 0.5 0.59 0.69 0.7 0.76 0.79 0.82 0.86 Pr(Y = 1|f(X)) Pr(Y = 1|f(X), Z) Born in the US Born in Unincorporated US Born abroad Not a US citizen 0.01 0.02 0.1 0.22 0.32 0.46 0.59 0.72 0.82 Pr(Y = 1|f B (X)) Pr(Y = 1|f B (X), Z) 0.01 0.1 0.22 0.32 0.46 0.66 0.81 Pr(Y = 1|f Bpav(X)) Pr(Y = 1|f Bpav(X), Z) 0.01 0.09 0.6 0.84 Pr(Y = 1|f B cal(X)) Pr(Y = 1|f B cal(X), Z) ϵ = 0.1 ϵ = 0.1 ϵ = 0.1 ϵ = 0.1 (a) Citizenship status (Z) 0.01 0.01 0.03 0.09 0.22 0.31 0.42 0.5 0.59 0.68 0.71 0.78 0.8 0.81 0.86 Pr(Y = 1|f(X)) Pr(Y = 1|f(X), Z) White Black or African American American Indian or Alaska Asian, Native Hawaiian or other 0.01 0.02 0.09 0.22 0.31 0.46 0.59 0.68 0.76 0.81 0.86 Pr(Y = 1|f B (X)) Pr(Y = 1|f B (X), Z) 0.01 0.03 0.09 0.22 0.31 0.46 0.59 0.77 Pr(Y = 1|f Bpav(X)) Pr(Y = 1|f Bpav(X), Z) 0.01 0.01 0.03 0.56 Pr(Y = 1|f B cal(X)) Pr(Y = 1|f B cal(X), Z) ϵ = 0.06 ϵ = 0.06 ϵ = 0.06 ϵ = 0.06 (b) Race code (Z) Figure 6. Quality score values a = P(Y = 1 | f(X) = a) and group conditional quality score values az = P(Y = 1 | f(X) = a, Z = z) of the screening classifier f and the modified classifiers f Bpav, f B , and f B cal induced by the partitions found by Algorithms 1, 2 and 3, respectively. In the first and last rows, the hatched bars indicate within-group monotonicity violations and, in the last row, we report the smallest ϵ value such that a within-group ϵ-calibrated classifier f B cal exists. On the Within-Group Fairness of Screening Classifiers B.2. Additional Experiments On Within-Group ϵ-Calibration In this section, we investigate how the smallest ϵ such that a within-group ϵ-calibrated classifier f B cal exists varies against the number of bins n of the screening classifier f. Figure 7 shows that, for each set of groups Z, ϵ remains relatively constant with respect to n, however, the greater the difference across group conditional quality scores az = P(Y = 1 | f(X) = a, Z = z), the greater the value of ϵ that is needed to obtain a within-group ϵ-calibrated classifier, as one may have perhaps expected. Citizenship status (Z) Race code (Z) Gender (Z) Disability record (Z) Citizenship status (Z) Race code (Z) Gender (Z) Disability record (Z) 5 10 15 20 25 30 35 40 n Figure 7. Minimum value of ϵ such that a within-group ϵ-calibrated f B cal exists against the number of bins n of the screening classifier f. B.3. Experimental Results for Other Groups Z 0.49 0.51 Pr(Z = z) Male Female 0.15 0.85 Pr(Z = z) 0.05 Disability record (Z) With a disability Without a disability Figure 8. Probability pd | z that an individual from group z may suffer from within-group unfairness against Pr(Z = z) for n = 15. B Bpav B cal f f B f Bpav f B cal 5 10 15 20 25 30 35 40 n 5 10 15 20 25 30 35 40 n (a) |B| vs. n 5 10 15 20 25 30 35 40 n Shortlist Size 5 10 15 20 25 30 35 40 n Shortlist Size (b) Shortlist size vs. n Figure 9. Quality of the partitions Bpav, B , and B cal returned by Algorithms 1, 2 and 3, respectively, for screening classifiers f with an increasing number of bins n. Panel (a) shows the size |B| of the partitions provided by each algorithm (higher is better). Panel (b) shows the size of the shortlists created using the classifiers f B induced by each partition B (lower is better). On the Within-Group Fairness of Screening Classifiers 0.01 0.02 0.02 0.08 0.22 0.33 0.41 0.5 0.57 0.64 0.75 0.79 0.81 0.81 0.82 Pr(Y = 1|f(X)) Pr(Y = 1|f(X), Z) Male Female 0.01 0.02 0.02 0.08 0.22 0.33 0.41 0.5 0.57 0.64 0.75 0.79 0.81 0.82 Pr(Y = 1|f B (X)) Pr(Y = 1|f B (X), Z) 0.01 0.02 0.02 0.08 0.22 0.33 0.41 0.5 0.57 0.64 0.75 0.79 0.81 Pr(Y = 1|f Bpav(X)) Pr(Y = 1|f Bpav(X), Z) 0.01 0.48 Pr(Y = 1|f B cal(X)) Pr(Y = 1|f B cal(X), Z) ϵ = 0.03 ϵ = 0.03 (a) Gender (Z) 0.0 0.02 0.04 0.11 0.23 0.34 0.41 0.5 0.59 0.68 0.71 0.76 0.79 0.81 0.87 Pr(Y = 1|f(X)) Pr(Y = 1|f(X), Z) With a disability Without a disability 0.0 0.02 0.04 0.11 0.23 0.34 0.41 0.5 0.59 0.68 0.71 0.76 0.79 0.84 Pr(Y = 1|f B (X)) Pr(Y = 1|f B (X), Z) 0.0 0.02 0.04 0.11 0.23 0.34 0.41 0.5 0.59 0.68 0.71 0.76 0.82 Pr(Y = 1|f Bpav(X)) Pr(Y = 1|f Bpav(X), Z) 0.0 0.02 0.04 0.11 0.23 0.34 0.41 0.5 0.63 0.71 0.76 0.79 0.84 Pr(Y = 1|f B cal(X)) Pr(Y = 1|f B cal(X), Z) ϵ = 0.22 ϵ = 0.22 (b) Disability record (Z) Figure 10. Quality score values a = P(Y = 1 | f(X) = a) and group conditional quality score values az = P(Y = 1 | f(X) = a, Z = z) of the screening classifier f and the modified classifiers f Bpav, f B , and f B cal induced by the partitions found by Algorithms 1, 2 and 3, respectively. In the first row, the hatched bars indicate within-group monotonicity violations and, in the last row, we report the smallest ϵ value such that a within-group ϵ-calibrated classifier f B cal exists.