# improving_screening_processes_via_calibrated_subset_selection__fbfc120c.pdf Improving Screening Processes via Calibrated Subset Selection Lequn Wang 1 2 Thorsten Joachims 1 Manuel Gomez Rodriguez 3 Many selection processes such as finding patients qualifying for a medical trial or retrieval pipelines in search engines consist of multiple stages, where an initial screening stage focuses the resources on shortlisting the most promising candidates. In this paper, we investigate what guarantees a screening classifier can provide, independently of whether it is constructed manually or trained. We find that current solutions do not enjoy distribution-free theoretical guarantees and we show that, in general, even for a perfectly calibrated classifier, there always exist specific pools of candidates for which its shortlist is suboptimal. Then, we develop a distribution-free screening algorithm called Calibrated Subset Selection (CSS) that, given any classifier and some amount of calibration data, finds near-optimal shortlists of candidates that contain a desired number of qualified candidates in expectation. Moreover, we show that a variant of CSS that calibrates a given classifier multiple times across specific groups can create shortlists with provable diversity guarantees. Experiments on US Census survey data validate our theoretical results and show that the shortlists provided by our algorithm are superior to those provided by several competitive baselines. 1. Introduction Screening is an essential part of many selection processes where an often intractable number of candidates is reduced to a shortlist of the most promising candidates for detailed and more resource intensive evaluation. Screening thus enables an allocation of resources that improves the overall 1Department of Computer Science, Cornell University 2Most of the work was done during Wang s internship at the Max Planck Institute for Software Systems. 3Max Planck Institute for Software Systems. Correspondence to: Lequn Wang , Thorsten Joachims , Manuel Gomez Rodriguez . Proceedings of the 39 th International Conference on Machine Learning, Baltimore, Maryland, USA, PMLR 162, 2022. Copyright 2022 by the author(s). quality of the decisions under limited resources. Examples of such screening problems are: finding patients in a large database of electronic health records to manually evaluate for qualification to take part in a medical trial (Liu et al., 2021); the first stage of a multi-stage retrieval pipeline of a search engine (Covington et al., 2016; Geyik et al., 2019); or which people to reach out to with a personalized invitation to apply to a specific job posting (Raghavan et al., 2020). In each of these examples, there is significant pressure to make high-quality, unbiased screening decisions quickly and efficiently, often about thousands or even millions of candidates under limited resources and additional diversity requirements (Bendick Jr et al., 1997; Bertrand & Mullainathan, 2004; Johnson et al., 2016; Covington et al., 2016). While these screening decisions have been made manually or through manually constructed rules in the past, automated predictive tools for optimizing screening decisions are becoming more prevalent (Cowgill, 2018; Chamorro-Premuzic & Akhtar, 2019; Raghavan et al., 2020; S anchez-Monedero et al., 2020). Algorithmic screening has been typically studied together with other high-stakes decision making problems as a supervised learning problem (Corbett-Davies et al., 2017; Kilbertus et al., 2020; Sahoo et al., 2021). Under this perspective, algorithmic screening reduces to: (i) training a classifier that estimates the probability that a candidate is qualified given a set of observable features; (ii) designing a deterministic threshold rule that shortlists candidates by thresholding the candidates probability values estimated by the classifier. Here, the classifier and the threshold rule aim to maximize a measure of average accuracy and average utility, respectively, possibly subject to diversity constraints. Only very recently, it has been argued that the classifier must also satisfy threshold calibration, a specific type of group calibration, to be able to accurately estimate the average utility of the threshold rule (Sahoo et al., 2021). Unfortunately, the above screening algorithms do not enjoy distribution-free guarantees1 on the quality of the shortlisted candidates. As a result, they may not always hold their promise of increasing the efficiency of a selection 1We refer to distribution-free guarantees as finite-sample distribution-free guarantees, since we never have infinite amount of data in practice. Improving Screening Processes via Calibrated Subset Selection process without decreasing the quality of the screening decisions. The results in this paper bridge this gap. In particular, we focus on developing screening algorithms that, without making any distributional assumptions on the candidates, provide the smallest shortlists of candidates, among those in a given pool, containing a desired expected number of qualified candidates with high probability.2 Our Contributions. We first show that, even if a classifier is perfectly calibrated, in general, there always exist specific pools of candidates for which the shortlist created by a policy that makes decisions based on the probability predictions from the classifier will be significantly suboptimal, both in terms of size and expected number of qualified candidates. Then, we develop a distribution-free screening algorithm called Calibrated Subset Selection (CSS) that calibrates any given classifier using calibration data in a way such that the shortlist of candidates created by thresholding the candidates probability values estimated by the calibrated classifier is near-optimal in terms of expected size and it provably contains, in expectation across all potential pools of candidates, a desired number of qualified candidates with high probability. Moreover, we theoretically characterize how the accuracy of the classifier and the amount of calibration data affect the expected size of the shortlists CSS provides. In addition, motivated by the Rooney rule (Collins, 2007), which requires that, when hiring for a given position, at least one candidate from the underrepresented group be interviewed, we demonstrate that a variant of CSS that calibrates the given classifier multiple times across different groups can be used to create a shortlist with provable diversity guarantees namely that this shortlist contains, in expectation across all potential pools of candidates, a desired number of qualified candidates from each group with high probability. Finally, we validate CSS on simulated screening processes created using US Census survey data (Ding et al., 2021). The results show that, compared to several competitive baselines, CSS consistently selects the shortest shortlists of candidates among those methods that select enough qualified candidates. Moreover, the results also demonstrate that the amount of human effort CSS helps reduce the difference in size between the pools of candidates and the shortlists it provides depends on the accuracy of the classifier it relies upon as well as the amount of calibration data. However, the expected quality of the shortlists the expected number of qualified candidates in the shortlists never decreases below the user-specified requirement. Our code is accessible at https://github.com/Lequn Wang/Improve-Screening-via Calibrated-Subset-Selection. 2If the overall pool of candidates does not contain the desired number of qualified candidates with high probability, the algorithms should also determine that. Further Related Work. Our work builds upon prior literature on distribution-free uncertainty quantification, which includes calibration and conformal prediction. Calibration (Brier et al., 1950; Dawid, 1982; Platt et al., 1999; Zadrozny & Elkan, 2001; Gneiting et al., 2007; Gupta et al., 2020) measures the accuracy of the probability outputs from predictive models. Many notions of calibration have been proposed to measure the accuracy of the probability outputs from a classifier (Platt et al., 1999; Guo et al., 2017) or a regression model (Gneiting et al., 2007; Kuleshov et al., 2018). Arguably, the most commonly used notion is marginal (or average) calibration (Gupta et al., 2020; Zhao et al., 2020; Gneiting et al., 2007; Kuleshov et al., 2018) for both classifiers and regression models, which requires the probability outputs be marginally calibrated on the whole population. Our definition of perfectly calibrated classifier inherits from the marginal calibration definition for classifiers in prior works (Gupta et al., 2020). We also show how CSS produces an approximately calibrated regression model under average calibration definition in regression (Gneiting et al., 2007; Kuleshov et al., 2018). In the other extreme, individual calibration (Zhao et al., 2020) refers to a classifier that predicts the probability distribution for each example. We refer to classifiers with such properties the omniscient classifiers in this paper. Many recent works (Chouldechova, 2017; Kleinberg et al., 2017; Pleiss et al., 2017) discuss the relationship between calibration and fairness in binary classification, and show that marginal calibration is incompatible with many fairness definitions in machine learning. Motivated by multicalibration (H ebert-Johnson et al., 2018; Jung et al., 2021), which requires the predictive models be calibrated on multiple groups of candidates, we propose a variant of CSS that selects calibrated subsets within each group to ensure diversity in the final selected shortlists. Very recently, some works (Sahoo et al., 2021; Zhao et al., 2021; Straitouri et al., 2022; Wang & Joachims, 2022) realize that we can design calibration algorithms (and calibration definitions) well-suited for different downstream tasks to achieve better performance. CSS is specifically designed for screening processes so that it provides near-optimal shortlists in terms of the expected size among those having distribution-free guarantees on the expected number of qualified candidates. Conformal prediction (Vovk et al., 1999; 2005; Shafer & Vovk, 2008; Romano et al., 2019; Balasubramanian et al., 2014; Gupta et al., 2020; Angelopoulos & Bates, 2021; Chzhen et al., 2021) aims to build confidence intervals on the probability outputs from predictive models. Within this literature, the work most closely related to ours is arguably the work by Bates et al. (2021), which has focused on generating set-valued predictions from a black-box predictor that controls the expected loss on future test points at a userspecified level. While one can view our problem from the Improving Screening Processes via Calibrated Subset Selection perspective of set-valued predictions, applying their methodology to find near-optimal solutions in our problem is not straightforward, and one would need to assume to have access to qualification labels for all the candidates in multiple pools, something we view as rather impractical. Moreover, our work also builds on work in budgetconstrained decision making, where one needs to first select a set of candidates to screen and then, given the result of that screening process, determine to whom to allocate the resources (Cai et al., 2020; Bakker et al., 2021). This contrasts with our work where all candidates undergo screening. Many large-scale recommender systems adopt multi-stage pipelines (Bendick Jr & Nunes, 2013). Existing works on multi-stage recommender systems (Ma et al., 2020; Hron et al., 2021) focus on learning accurate classifiers in different stages. Complementary to these works, we assume the classifiers are already given and provide distribution-free and finite-sample guarantees on the quality of the shortlists. 2. Problem Formulation Given a candidate with a feature vector x X, we assume the candidate can be either qualified (y = 1) or unqualified (y = 0) for the selection objective3. Moreover, let f : X [0, 1] be a classifier that maps a candidate s feature vector x to a quality score4 f(x). The higher the quality score f(x), the more the classifier believes the candidate is qualified. Given a pool of m candidates with feature vectors x = {xi}i [m], an algorithmic screening policy π : [0, 1]m P({0, 1}m) maps the candidates quality scores {f(xi)}i [m] to a probability distribution over shortlisting decisions s = {si}i [m]. Here, each shortlisting decision si specifies whether the corresponding candidate is shortlisted (si = 1) or is not shortlisted (si = 0). For brevity, we will write S π whenever there is no ambiguity5. Furthermore, we use πf to indicate that a policy makes shortlisting decisions based on the quality scores from classifier f. This implies that for any pool of candidates x and any two candidates i, j [m] in the pool, if f(xi) = f(xj), then Pr(Si = 1) = Pr(Sj = 1). We denote the set of all possible policies πf as Πf. For any screening process, we would ideally like a screening policy π that shortlists only candidates who are qualified (y = 1). Unfortunately, as long as there is no deterministic mapping between x and y, such a perfect screening policy 3In practice, one needs to measure qualification using proxy variables and these proxy variables need to be chosen carefully to not perpetuate historical biases (Bogen & Rieke, 2018; Garr & Jackson, 2019; Tambe et al., 2019). 4Our theory and algorithms apply to any classifier with a bounded range, by scaling the scores to [0, 1]. 5We use upper case letters to denote random variables, and lower case letters to denote realizations of random variables. π does not exist in general. Instead, our goal is to find a screening policy π that shortlists a small set of candidates that provably contains enough qualified candidates, without making any assumptions about the data distribution. These shortlisted candidates will then move forward in the selection process and will be evaluated in detail, possibly multiple times, until one or more qualified candidates are selected. More specifically, let each candidate s feature vector x and label y be sampled from a data distribution PX,Y = PX PY | X. Then, for a pool of m candidates6 with feature vectors X = {Xi}i [m] and unobserved labels Y = {Yi}i [m], where Xi PX, Yi PY |Xi for all i [m], we will investigate to what extent it is possible to find screening policies π with near-optimal guarantees with respect to two different oracle policies. In particular: (i) an oracle policy π that, for any set of candidates x X m, shortlists the smallest set of candidates that contains, in expectation with respect to PY | X, more than k qualified candidates, i.e., x X m, π argmin π Π ES π i [m] Si Yi (ii) an oracle policy π that shortlists the smallest set of candidates that contains, in expectation with respect to PX,Y , more than k qualified candidates, i.e., π argmin π Π EX P,S π π E(X,Y ) P,S π i [m] Si Yi In the language of distribution-free uncertainty quantification (Balasubramanian et al., 2014), the optimality guarantees with respect to the first and the second oracle policy can be viewed as individual and marginal guarantees respectively. 6For ease of presentation, we assume a constant pool size m in the main paper. However, all the theoretical results and algorithms can be easily adapted to settings where the pool size changes across selection processes. Refer to Appendix A.1 for more details. Improving Screening Processes via Calibrated Subset Selection 3. Impossibility of Screening with Individual Guarantees If we had access to an omniscient classifier f (x) = Pr(Y = 1 | X = x) for all x X, then we could recover the oracle policy π defined by Eq. 1 just by thresholding the quality scores of each candidate in the pool. More specifically, we have the following theorem7: Theorem 3.1. The screening policy π f that, given any pool of m candidates with feature vectors x X m, takes shortlisting decisions as 1 if f (xi) > t , Bernoulli(θ ) if f (xi) = t , 0 otherwise, (3) i [m] I{f (xi) t}f (xi) k i [m] I{f (xi) > t }f (xi) P i [m] I{f (xi) = t }f (xi) , is a solution (if there is one) to the constrained minimization problem defined in Eq. 1. Unfortunately, without distributional assumptions about PX,Y , finding the omniscient classifier f from data is impossible, even asymptotically, if the distribution Pf (X) induced by PX,Y is nonatomic, as recently shown by Barber (2020) and Gupta et al. (2020). Alternatively, one may think whether there exist other classifiers h allowing for near-optimal screening policies πh with individual guarantees. In this context, a natural alternative is a perfectly calibrated classifier h other than the omniscient classifier f . A classifier h is perfectly calibrated if and only if Pr(Y = 1 | h(X) = a) = a a Range(h). (4) However, the following theorem shows that there exist data distributions PX,Y for which there is no perfectly calibrated classifier h = f allowing for a screening policy πh with individual guarantees of optimality. Proposition 3.2. Let X = {a, b}, Pr(Y = 1 | X = a) = 1, Pr(Y = 1 | X = b) = k m, and f be the omniscient classifier. Then, for any screening policy πh using any perfectly calibrated classifier h = f , there exists a pool of candidates x = {xi}i [m] X m such that ES πh,S π f i [m] (Si S i ) 7All proofs can be found in Appendix C. and a pool of candidates x = {x i}i [m] X m such that EY P,S πh,S π f i [m] (Si S i )Yi The above result reveals that, if m k, for any screening policy πh using a calibrated classifier h = f , there are always scenarios in which πh provides shortlists that are m 2 apart from those provided by the oracle policy π in size or k 2 apart in terms of the number of qualified candidates. In particular, the second part of Proposition 3.2 directly implies that there is no policy π Πh that satisfies the individual guarantee. This negative result motivates us to look for screening policies with marginal guarantees of optimality. 4. Screening Algorithms with Marginal Guarantees To investigate to what extent it is possible to find screening policies with near-optimal guarantees with respect to the oracle policy π defined in Eq. 2, we focus on perfectly calibrated classifiers h with finite range, i.e., |Range(h)|< . The reason is that, similarly as in the case of the omniscient classifier f , it is impossible to find nonatomic perfectly calibrated classifiers h from data, even asymptotically (Barber, 2020; Gupta et al., 2020). We will first introduce the optimal algorithm assuming we have access to a perfectly calibrated classifier, and discuss how the accuracy of the classifier affects the performance of the optimal algorithm. Then, we will introduce the proposed CSS algorithm that, given a classifier and some amount of calibration data, selects shortlists that are near-optimal in terms of the expected size, while ensuring that the expected number of qualified candidates in the shortlists is above the desired threshold. An Optimal Calibration Algorithm with Perfectly Calibrated Classifiers. Let h be a perfectly calibrated classifier with Range(h) = {µb}b [B], and denote ρb = EX P [I{h(X) = µb}]. First, note that the classifier h induces a sample-space partition of X into B regions or bins {Xb}b [B], where µb = Pr(Y = 1 | X Xb) and ρb = Pr(X Xb). Then, the following theorem shows that the optimal screening policy π h that shortlists the smallest set of candidates within the set of policies Πh, which contain, in expectation with respect to PX,Y , more than k qualified candidates, is given by a threshold decision rule. Theorem 4.1. The screening policy π h that takes shortlisting decisions as 1 if h(xi) > th, Bernoulli(θh) if h(xi) = th, 0 otherwise, Improving Screening Processes via Calibrated Subset Selection µ {µb}b [B] b [B] µb I{µb µ}ρb k b [B] µb I{µb > th}ρb P b [B] µb I{µb = th}ρb , is a solution (if there is one) to the constrained minimization problem defined in Eq. 2 over the set of policies Πh. However, it is important to realize that the expected size of the shortlists provided by π h for different perfectly calibrated classifiers h may differ. To put it differently, not all screening policies π h (and classifiers h) will help reduce the downstream effort by the same amount. To characterize this difference, we introduce a notion of dominance between perfectly calibrated classifiers. Definition 4.2. Let h and h be perfectly calibrated classifiers. We say that h dominates h if, for any x1, x2 X such that h(x1) = h(x2), it holds that h (x1) = h (x2). Equipped with this notion, we now characterize the difference in size between shortlists provided by different perfectly calibrated classifiers using the following corollary, which follows from Theorem 4.1. Corollary 4.3. Let h and h be perfectly calibrated classifiers. If h dominates h , then EX P,S π h,S π h i [m] (Si S i) This notion of dominance relates to the notion of sharpness (Gneiting et al., 2007; Kuleshov et al., 2018), which links the accuracy of a calibrated classifier to how finegrained the calibration is within the sample-space. In particular, if h dominates h , it can be readily shown that h is sharper than h . However, existing works have not studied the effect of the sharpness of a classifier on its performance on screening tasks. A Near-Optimal Screening Algorithm with Calibration Data. Until here, we have assumed that a perfectly calibrated classifier h with finite range, as well as the size of each bin ρb are given. However, using finite amounts of calibration data, we can only hope to find approximately calibrated classifiers. Next, we will develop an algorithm that, rather than training an approximately calibrated classifier from scratch, it approximately calibrates a given classifier f, e.g., a deep neural network, using a calibration set Dcal = {(xc i, yc i )}i [n], which are independently sampled from PX,Y 8. In doing so, the algorithm will find the optimal 8Superscript c is used to differentiate between candidates in the calibration set and candidates in the pool at test time. sample-space partition and decision rule that minimize the expected size of the provided shortlists among those ensuring that an empirical lower bound on the expected number of qualified candidates is greater than k. From now on, we will assume that the given classifier f is nonatomic9 and satisfies a natural monotonicity property10 with respect to the data distribution PX,Y , a significantly weaker assumption than calibration. Definition 4.4. A classifier f is monotone with respect to a data distribution PX,Y if, for any a, b Range(f) such that a < b, it holds that Pr {Y = 1 | f(X) = a} Pr {Y = 1 | f(X) = b} . Under this assumption11, we can first show that the solution (if there is one) to the constrained minimization problem in Eq. 2 over Πf is a threshold decision rule πf,t f with some threshold t f [0, 1] that takes shortlisting decisions as ( 1 if f(xi) t f, 0 otherwise. (5) More specifically, we have the following theorem. Theorem 4.5. Let f be a monotone classifier with respect to PX,Y and the distribution Pf(X) induced by PX,Y is nonatomic. Then, for any πf Πf, there always exists a threshold decision rule πf,t Πf with some threshold t [0, 1] such that EX P,S πf ,S πf,t i [m] (Si S i) E(X,Y ) P,S πf ,S πf,t i [m] Yi(Si S i) The theorem directly implies there exists a threshold decision policy πf,t f that is optimal in the constraint minimization problem defined in Eq. 2 (if there is a solution). 9We can add arbitrarily small noise to break atoms. 10The monotonicity property we consider is different from that considered in the literature on monotonic classification (Cano et al., 2019). 11There is empirical evidence that well-performing classifiers learned from data are (approximately) monotone (Guo et al., 2017; Kuleshov et al., 2018; Gupta et al., 2020). In this context, we would like to emphasize that the monotone property is only necessary to prove that the expected size of the shortlists provided by our algorithm is near-optimal. However, the distribution-free guarantees on the expected number of qualified shortlisted candidates holds even for non-monotone classifiers. Improving Screening Processes via Calibrated Subset Selection As a result, we focus our attention on finding a near-optimal threshold decision policy. We first notice that each threshold decision policy πf,t induces a sample space partition of X into two bins Xt,1 = {x X | f(x) t} and Xt,2 = {x X | f(x) < t}. Thus, it is sufficient to analyze the calibration errors of the family of approximately calibrated classifiers ht with the two bins. In Appendix A.3, we show that using the calibration errors of calibrated classifiers that partition the sample-space into more bins will only worsen the performance guarantees of threshold policies. At this point, one may think of applying conventional distribution-free calibration methods to bound the calibration errors of each classifier ht with high probability. However, prior distribution-free calibration methods can only bound calibration errors on the mean values µb for each bin b, but not the bin sizes ρb. To make things worse, to find the threshold value ˆtf with optimal guarantees, we need to bound the calibration errors of all classifiers ht, simultaneously with high probability. Unfortunately, since t [0, 1], there are infinitely many of them and we cannot naively apply a union bound on the bounds derived separately for each classifier. To overcome the above issues, we will now leverage the Dvoretzky Kiefer Wolfowitz Massart (DKWM) inequality (Dvoretzky et al., 1956; Massart, 1990), which bounds how close an empirical cumulative distribution function (CDF) is to the cumulative distribution function of the distribution from which the empirical samples are sampled. More specifically, let δt,1 := E(X,Y ) P [I {f(X) t} Y ] and i [n] I {f(xc i) t} yc i be an empirical estimator of δt,1 using samples from the calibration set Dcal. Then, we can use the DKWM inequality to bound the calibration errors |ˆδt,1 δt,1| across all approximately calibrated classifiers ht with high probability: Proposition 4.6. For any α (0, 1), with probability at least 1 α (in f and Dcal), it holds that δt,1 ˆδt,1 p ln (2/α) /(2n) := ϵ(α, n) simultaneously for all t [0, 1]. In Appendix B, we show that based on the above error guarantees, we can build a regression model that achieves average calibration in regression (Gneiting et al., 2007; Kuleshov et al., 2018), if we regard the binary classification problem as a regression problem. Further, building on the above proposition, we can derive an empirical lower bound on the expected number of qualified candidates in the shortlists provided by all the threshold decision rules πf,t. Corollary 4.7. For any α (0, 1), with probability at least Algorithm 1 Calibrated Subset Selection (CSS) 1: input: k, m, Dcal, f, α, x 2: initialize: s = 0 3: ˆtf = sup t [0, 1] ˆδt,1 k/m + ϵ(α, n) 4: for i = 1, . . . , m do 5: if f(xi) ˆtf then 6: si = 1 7: end if 8: end for 9: return s 1 α (in f and Dcal), it holds that E(X,Y ) P,S πf,t i [m] Yi Si m ˆδt,1 + ϵ(α, n) simultaneously for all t [0, 1]. Now, to find the decision threshold rule πf,ˆtf that provides, in expectation, the smallest shortlists of candidates subject to a constraint on the lower bound on the expected number of qualified candidates in the provided shortlists, i.e., ˆtf = argmin t T EX P,S πf,t where T = {t [0, 1] | m ˆδt,1 ϵ(α, n) k}, we resort to the following theorem. Theorem 4.8. The threshold value ˆtf = sup t [0, 1] ˆδt,1 k/m + ϵ(α, n) . (7) is a solution (if there is one) to the constrained minimization problem defined in Eq. 6. Algorithm 1 summarizes our resulting CSS algorithm, whose complexity is O(n log(n)). Finally, we can show that the expected size of the shortlists provided by CSS is near-optimal and we can also derive a lower bound on the worst-case size of the provided shortlists. More specifically, the following propositions show that the difference in expected number of qualified candidates between the shortlists provided by πf,ˆtf and πf,t f decreases at a rate 1/ n and the worst-case size is on the order of Proposition 4.9. Let f be a monotone classifier with respect to PX,Y and assume that the distribution Pf(X) induced by PX,Y is nonatomic. Then, for any α (0, 1), with Improving Screening Processes via Calibrated Subset Selection probability at least 1 α, it holds that E(X,Y ) P,S πf,t f ,S πf,ˆtf i [m] (S i Si)Yi 2 ln(2/α)/n. Proposition 4.10. For any α1 (0, 1), α2 (e 9 4 k, 1) such that α1 + α2 < 1, the CSS algorithm with parameter α = α1 guarantees that, with probability at least 1 α1 α2 (in f, Dcal, and X, Y ), i [m] Yi Si k 1 3 ln(1/α2) 1 ln2(1/α2)+18k ln(1/α2). Finally, note that we can use the above worse-case guarantee to ensure that the worst-case size of the shortlists provided by CSS is greater than a target tworst by setting k to be slightly larger than kworst k = kworst + 1 3 ln(1/α2) 1 ln2(1/α2) + 18k ln(1/α2). By doing so, CSS will satisfy the constraints defined in Eq. 1 with high probability with respect to the test pool of candidates. However, CSS might not be optimal in terms of expected or worst-case shortlist size among those satisfying the above constrains. How to design screening algorithms that are optimal in terms of expected or worst-case shortlist size while satisfying that each individual shortlist contains enough qualified candidates with high probability is an interesting problem to explore in future work. 5. Increasing the Diversity of Screening Our theoretical results have shown that CSS is robust to the accuracy of the classifier and the amount of calibration data. However, CSS does not account for the potential differences in accuracy or in the amount of calibration data across demographic groups. As a result, qualified candidates in minority groups may be unfairly underrepresented in the shortlists provided by CSS. To tackle the above problem and increase the diversity of the shortlists, we design a variant of CSS, which we name CSS (Diversity), motivated by the Rooney rule (Collins, 2007) and multicalibration (H ebert-Johnson et al., 2018) (refer to Appendix A.2). CSS (Diversity) quantifies the uncertainty in the estimation of the number of qualified candidates separately for each demographic group, so that we have distribution-free guarantees for each demographic group. This means that CSS (Diversity) will include more candidates in the shortlist for groups with higher uncertainty to ensure a sufficient number of qualified candidates from each group. Effectively, CSS (Diversity) shifts the cost of uncertainty from minority candidates to the decision maker, since it creates a potentially longer shortlist. This provides an economic incentive for the decision maker to both build classifiers that perform well across demographic groups and collect more calibration data about minority groups. 6. Empirical Evaluation Using Survey Data In this section, we compare CSS against several competitive baselines on multiple instances of a simulated screening process created using US Census survey data. Experiment Setup. We create a simulated screening process using a dataset comprised of employment information for 3.2 million individuals from the US Census (Ding et al., 2021). For each individual, we have sixteen features x R16 (e.g., education, marital status) and a label y {0, 1} that indicates whether the individual is employed (y = 1) or unemployed (y = 0). Race is among the features, which we use as the protected attribute as suggested by Ding et al. (2021). We treat white as the majority group gmaj and all other races as the minority group gmin. To ensure that there is a limited number of qualified candidates ( 20%) in each of the simulated screening processes, we randomly downsample the dataset12. After downsampling, the dataset contains 2.2 million individuals. For the experiments, we randomly split the dataset into two equally-sized and disjoint subsets. We use one of the subsets to create a training set of 10,000 individuals, as well as calibration sets with varying sizes n. We use the other subset to simulate pools of candidates for testing. To get a screening classifier, we train a logistic regression f LR on the training set to predict the probability f LR(x) that a candidate is qualified. Here, we vary the accuracy of the classifier by replacing, with probability rnoise, each of its predictions with some noise β Beta(1, 4), i.e., f = γβ+(1 γ)f LR, where γ Bernoulli(rnoise) and rnoise is the classifier noise ratio. In each simulated screening process, we set the size of the test pool of candidates to m = 100, the desired expected number of qualified candidates to k = 5, and the success probability to 1 α = 0.9. For the diversity experiments, we set the desired expected number of qualified candidates kmaj and kmin so that the equal opportunity constraint,i.e., kmaj/(m E(X,Y ) [Y I {X gmaj}]) = kmin/(m E(X,Y [Y I {X gmin}]) is satisfied (Hardt et al., 2016) subject to kmaj + kmin = 5. Methods. In our experiments, we compare CSS with several baselines. Since no prior screening algorithms with distribution-free guarantees exist, we introduce a simple screening algorithm based on uniform mass binning (Zadrozny & Elkan, 2001) (UBM B Bins), which 12For the diversity experiments, we downsample the majority and minority groups independently. Improving Screening Processes via Calibrated Subset Selection 0 0.2 0.4 0.6 0.8 1 rnoise 0 0.2 0.4 0.6 0.8 1 rnoise 1e3 2e3 5e3 1e4 2e4 5e4 1e5 1e3 2e3 5e3 1e4 2e4 5e4 1e5 CSS Uncalibrated Isotonic Platt UMB 1 Bins UMB 2 Bins UMB 3 Bins UMB 4 Bins Figure 1. Analysis of CSS and baselines when varying the classifier noise ratio rnoise (i.e., accuracy) and calibration set sizes n. The first and third plots show the empirical probability that each algorithm provides enough qualified candidates (EQ) from 100 runs with standard error bars (higher is better). The second and fourth plots show the average, among 100 runs with one standard deviation as shaded regions, expected size (SS) of the shortlists (lower is better). We do not plot SS for algorithms that fail the quality requirement in terms of EQ. In the left two plots, the size of the calibration set is n = 10, 000. In the right two plots, the classifier noise ratio is rnoise = 0. 0 0.2 0.4 0.6 0.8 1.0 rnoise Minority EQ Majority 0 0.2 0.4 0.6 0.8 1.0 rnoise Minority EQ Minority 0 0.2 0.4 0.6 0.8 1.0 rnoise Minority SS Majority 0 0.2 0.4 0.6 0.8 1.0 rnoise Minority SS Minority CSS (Diversity) CSS (No Diversity) Uncalibrated (Diversity) Isotonic (Diversity) Platt (Diversity) Figure 2. Analysis of diversity guarantees of CSS (Diversity) and several baselines when varying the classifier noise ratio rnoise (i.e., accuracy) for individuals in the minority group. For individuals in the majority group, the classifier noise ratio is rnoise = 0. The first and third plots focus on the majority group and the second and fourth plots focus on the minority group. also enjoys distribution-free guarantees on the expected number of qualified candidates. The algorithm bounds the calibration error of a classifier that is calibrated on B bins of roughly equal size and selects the candidates from topscored bins to low-scored bins (and possibly at random from the last bin it selects candidates from) until a lower bound on the expected number of qualified candidates is no smaller than k. Refer to Appendix A.3 for more details about this baseline algorithm. We also compare CSS with three other baselines that do not provide distribution-free guarantees. The first is called Uncalibrated, and it applies the optimal decision rule for omniscient classifiers as if f were the omniscient classifier defined in Eq. 3. The second is called Platt, since it first calibrates f using Platt scaling (Platt et al., 1999) and then proceeds like Uncalibrated with the calibrated classifier. The third is called Isotonic. It treats the classifier f as a regression model from x to y, and then employs Isotonic regression calibration (Kuleshov et al., 2018) to produce a calibrated regression model h that estimates h(t) E [(Y I {f(x) t}]. It then selects the largest threshold t such that mh(t) k for shortlisting. Metrics. To compare the screening algorithms, we run the experiments 100 times for each algorithm and setting. For each run, we estimate whether each algorithm provides shortlists that contain a large enough expected number of qualified candidates EQ = I n ˆE h P i [m] Yi Si i k o , and its expected shortlist size SS = ˆE h P i [m] Si i , using 1, 000 independent pools of candidates sampled at random from the test set. We then compare the algorithms in terms of the percentage of times, along with standard errors, they provide enough qualified candidates. We also compare the average shortlist size, along with standard deviations. How does the accuracy of the classifier affect different screening algorithms? The left two plots in Figure 1 show how CSS compares to the baselines for classifiers of varying accuracy (i.e., varying rnoise). The results show that the shortlists provided by the algorithms with distribution-free guarantees (i.e., CSS and UMB) do contain enough qualified candidates, while the others fail. Moreover, CSS and UMB are robust to the accuracy of the classifier they use, as they compensate for decreased accuracy with longer shortlists to maintain their guarantees. In addition, we find that the shortlists provided by these algorithms contain enough qualified candidates more frequently than suggested by the worst-case theoretical guarantee of 1 α = 0.9. Compared to all UMB variants, CSS provides smaller shortlists except for the pure-noise case. Note that this exception does not Improving Screening Processes via Calibrated Subset Selection violate the optimality of CSS among deterministic threshold policies using the empirical lower bound as shown in Theorem 5, since UMB algorithms are allowed to randomly select candidates in the last bin they select candidates from, as discussed previously and in Appendix A.3 in detail. What is the effect of different amounts of calibration data on the screening algorithms? The right two plots in Figure 1 show how the screening algorithms perform for increasing amounts of calibration data n. We see that CSS and UMB are robust to the amount of calibration data, and can effectively account for less data by increasing the shortlist size to maintain their guarantees. In terms of shortlist size, CSS is more effective over the whole range of n, since it provides smaller shortlists than all of the UMB variants. How does the accuracy of the classifier affect different groups of candidates? Figure 2 shows how CSS (Diversity) and the baselines perform on the majority and minority groups as we decrease the accuracy of the classifier for individuals in the minority group (i.e., we increase rnoise only for individuals in the minority group). Here, CSS (No Diversity) refers to naively applying CSS on the entire pool of candidates, without diversity requirements. We allow all other algorithms to select candidates from the majority group and the minority group separately. The results show that, as the accuracy of the classifier for individuals in the minority group decreases, the shortlists provided by CSS (No Diversity) contain more and more (fewer and fewer) candidates from the majority (minority) group. This suggests that we should explicitly account for diversity in the screening process, especially when the accuracy of the classifier differs across groups. While Uncalibrated, Platt and Isotonic select candidates across groups separately, the shortlists they provide contain fewer qualified candidates from the minority group as the accuracy worsens (Uncalibrated) or do not contain enough qualified candidates from both groups (Platt, Isotonic). In contrast, CSS (Diversity) adapts to the loss in accuracy of the classifier for individuals in the minority group and the shortlists it provides contain enough qualified candidates from both groups. We found similar results also when we varied the number of individuals from each group in the calibration data, instead of the classifier accuracy. 7. Conclusions In this work, we initiated the development of screening algorithms that provide distribution-free guarantees on the quality of the shortlisted candidates. In particular, we proposed the CSS algorithm, which can be applied to any screening classifier. We show that for any amount of calibration data, CSS selects near-optimal shortlists of candidates, in terms of the expected shortlist size, while ensuring that the expected number of qualified candidates is above a desired target with high probability. Moreover, we showed how the CSS (Diversity) variant, which shortlists different groups of candidates separately, can ensure diversity of the shortlists with similar guarantees. Both the theoretical analysis and the empirical evaluation confirm that CSS and CSS (Diversity) are robust to the accuracy of the given classifier and the amount of calibration data. Our work opens up many interesting avenues for future work. For example, we have assumed that candidates do not present themselves strategically to the screening algorithms and that the data distribution at calibration and test time is the same. It would thus be interesting to develop screening algorithms that are robust to strategic behaviors (Tsirtsis & Gomez Rodriguez, 2020) and distribution shifts (Podkopaev & Ramdas, 2021). Moreover, it is crucial to fully understand how screening algorithms can most effectively and most fairly augment (biased) human decision making in real selection processes. Acknowledgements We thank Eleni Straitouri and Nastaran Okati for discussions and feedback in the initial stage of this work. Gomez Rodriguez acknowledges the support from the European Research Council (ERC) under the European Union s Horizon 2020 research and innovation programme (grant agreement No. 945719). Wang and Joachims acknowledge the support from NSF Awards IIS-1901168 and IIS-2008139. All content represents the opinion of the authors, which is not necessarily shared or endorsed by their respective employers and/or sponsors. Angelopoulos, A. N. and Bates, S. A gentle introduction to conformal prediction and distribution-free uncertainty quantification. ar Xiv preprint ar Xiv:2107.07511, 2021. Bakker, M. A., Tu, D. P., Gummadi, K. P., Pentland, A. S., Varshney, K. R., and Weller, A. Beyond reasonable doubt: Improving fairness in budget-constrained decision making using confidence thresholds. In AAAI/ACM Conference on AI, Ethics, and Society, pp. 346 356, 2021. Balasubramanian, V., Ho, S.-S., and Vovk, V. Conformal prediction for reliable machine learning: theory, adaptations and applications. Newnes, 2014. Barber, R. F. Is distribution-free inference possible for binary regression? Electronic Journal of Statistics, 14(2): 3487 3524, 2020. Bates, S., Angelopoulos, A., Lei, L., Malik, J., and Jordan, M. Distribution-free, risk-controlling prediction sets. Journal of the ACM, 68(6), 2021. Bendick Jr, M. and Nunes, A. P. Developing the research Improving Screening Processes via Calibrated Subset Selection basis for controlling bias in hiring. Journal of Social Issues, 68:238 262, 2013. Bendick Jr, M., Jackson, C. W., and Romero, J. H. Employment discrimination against older workers: An experimental study of hiring practices. Journal of Aging & Social Policy, 8(4):25 46, 1997. Bertrand, M. and Mullainathan, S. Are emily and greg more employable than lakisha and jamal? a field experiment on labor market discrimination. American economic review, 94(4):991 1013, 2004. Bogen, M. and Rieke, A. Help wanted: An examination of hiring algorithms, equity, and bias. 2018. Brier, G. W. et al. Verification of forecasts expressed in terms of probability. Monthly weather review, 78(1):1 3, 1950. Cai, W., Gaebler, J., Garg, N., and Goel, S. Fair allocation through selective information acquisition. In AAAI/ACM Conference on AI, Ethics, and Society, pp. 22 28, 2020. Cano, J.-R., Guti errez, P. A., Krawczyk, B., Wo zniak, M., and Garc ıa, S. Monotonic classification: An overview on algorithms, performance measures and data sets. Neurocomputing, 341:168 182, 2019. Chamorro-Premuzic, T. and Akhtar, R. Should companies use al to assess job candidates? Harvard Business Review, 17, 2019. Chouldechova, A. Fair prediction with disparate impact: A study of bias in recidivism prediction instruments. Big data, 5(2):153 163, 2017. Chzhen, E., Denis, C., Hebiri, M., and Lorieul, T. Setvalued classification overview via a unified framework. ar Xiv preprint ar Xiv:2102.12318, 2021. Collins, B. W. Tackling unconscious bias in hiring practices: The plight of the rooney rule. NYUL Rev., 82:870, 2007. Corbett-Davies, S., Pierson, E., Feller, A., Goel, S., and Huq, A. Algorithmic decision making and the cost of fairness. In International Conference on Knowledge Discovery and Data Mining, pp. 797 806, 2017. Covington, P., Adams, J., and Sargin, E. Deep neural networks for youtube recommendations. In ACM conference on recommender systems, pp. 191 198, 2016. Cowgill, B. Bias and productivity in humans and algorithms: Theory and evidence from resume screening. Columbia Business School, Columbia University, 29, 2018. Dawid, A. P. The well-calibrated bayesian. Journal of the American Statistical Association, 77(379):605 610, 1982. Ding, F., Hardt, M., Miller, J., and Schmidt, L. Retiring adult: New datasets for fair machine learning. In Advances on Neural Information Processing Systems, 2021. Dvoretzky, A., Kiefer, J., and Wolfowitz, J. Asymptotic minimax character of the sample distribution function and of the classical multinomial estimator. The Annals of Mathematical Statistics, pp. 642 669, 1956. Garr, S. S. and Jackson, C. Diversity & inclusion technology: The rise of a transformative market. Red Thread Research and Mercer, 2019. Geyik, S. C., Ambler, S., and Kenthapadi, K. Fairness-aware ranking in search & recommendation systems with application to linkedin talent search. In ACM International Conference on Knowledge Discovery and Data Mining, pp. 2221 2231, 2019. Gneiting, T., Balabdaoui, F., and Raftery, A. E. Probabilistic forecasts, calibration and sharpness. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 69 (2):243 268, 2007. Guo, C., Pleiss, G., Sun, Y., and Weinberger, K. Q. On calibration of modern neural networks. In International Conference on Machine Learning, pp. 1321 1330, 2017. Gupta, C. and Ramdas, A. K. Distribution-free calibration guarantees for histogram binning without sample splitting. In International Conference on Machine Learning, 2021. Gupta, C., Podkopaev, A., and Ramdas, A. Distribution-free binary classification: prediction sets, confidence intervals and calibration. In Advances in Neural Information Processing Systems, 2020. Hardt, M., Price, E., and Srebro, N. Equality of opportunity in supervised learning. In Advances in neural information processing systems, volume 29, pp. 3315 3323, 2016. H ebert-Johnson, U., Kim, M., Reingold, O., and Rothblum, G. Multicalibration: Calibration for the (computationallyidentifiable) masses. In International Conference on Machine Learning, pp. 1939 1948, 2018. Hron, J., Krauth, K., Jordan, M., and Kilbertus, N. On component interactions in two-stage recommender systems. Advances in Neural Information Processing Systems, 34, 2021. Johnson, S. K., Hekman, D. R., and Chan, E. T. If there s only one woman in your candidate pool, there s statistically no chance she ll be hired. Harvard Business Review, 26(04), 2016. Jung, C., Lee, C., Pai, M., Roth, A., and Vohra, R. Moment multicalibration for uncertainty estimation. In Conference on Learning Theory, pp. 2634 2678, 2021. Improving Screening Processes via Calibrated Subset Selection Kilbertus, N., Rodriguez, M. G., Sch olkopf, B., Muandet, K., and Valera, I. Fair decisions despite imperfect predictions. In International Conference on Artificial Intelligence and Statistics, pp. 277 287, 2020. Kleinberg, J., Mullainathan, S., and Raghavan, M. Inherent trade-offs in the fair determination of risk scores. Innovations in Theoretical Computer Science, 2017. Kuleshov, V., Fenner, N., and Ermon, S. Accurate uncertainties for deep learning using calibrated regression. In International Conference on Machine Learning, pp. 2796 2804, 2018. Liu, R., Rizzo, S., Whipple, S., Pal, N., Pineda, A. L., Lu, M., Arnieri, B., Lu, Y., Capra, W., Copping, R., et al. Evaluating eligibility criteria of oncology trials using realworld data and ai. Nature, 592(7855):629 633, 2021. Ma, J., Zhao, Z., Yi, X., Yang, J., Chen, M., Tang, J., Hong, L., and Chi, E. H. Off-policy learning in two-stage recommender systems. In Proceedings of The Web Conference 2020, pp. 463 473, 2020. Massart, P. The tight constant in the dvoretzky-kieferwolfowitz inequality. The annals of Probability, pp. 1269 1283, 1990. Platt, J. et al. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. Advances in large margin classifiers, 10(3):61 74, 1999. Pleiss, G., Raghavan, M., Wu, F., Kleinberg, J., and Weinberger, K. Q. On fairness and calibration. In Advances in Neural Information Processing Systems, 2017. Podkopaev, A. and Ramdas, A. Distribution-free uncertainty quantification for classification under label shift. In Conference on Uncertainty in Artificial Intelligence, 2021. Raghavan, M., Barocas, S., Kleinberg, J., and Levy, K. Mitigating bias in algorithmic hiring: Evaluating claims and practices. In Conference on Fairness, Accountability, and Transparency, pp. 469 481, 2020. Romano, Y., Patterson, E., and Candes, E. Conformalized quantile regression. In Advances in Neural Information Processing Systems, volume 32, pp. 3543 3553, 2019. Sahoo, R., Zhao, S., Chen, A., and Ermon, S. Reliable decisions with threshold calibration. In Advances in Neural Information Processing Systems, 2021. S anchez-Monedero, J., Dencik, L., and Edwards, L. What does it mean to solve the problem of discrimination in hiring? social, technical and legal perspectives from the uk on automated hiring systems. In Conference on fairness, accountability, and transparency, pp. 458 468, 2020. Shafer, G. and Vovk, V. A tutorial on conformal prediction. Journal of Machine Learning Research, 9(3), 2008. Straitouri, E., Wang, L., Okati, N., and Rodriguez, M. G. Provably improving expert predictions with prediction sets. ar Xiv preprint ar Xiv:2201.12006, 2022. Tambe, P., Cappelli, P., and Yakubovich, V. Artificial intelligence in human resources management: Challenges and a path forward. California Management Review, 61(4): 15 42, 2019. Tsirtsis, S. and Gomez Rodriguez, M. Decisions, counterfactual explanations and strategic behavior. In Advances in Neural Information Processing Systems, 2020. Vovk, V., Gammerman, A., and Saunders, C. Machinelearning applications of algorithmic randomness. In International Conference on Machine Learning, 1999. Vovk, V., Gammerman, A., and Shafer, G. Algorithmic learning in a random world. Springer Science & Business Media, 2005. Wang, L. and Joachims, T. Fairness in the first stage of two-stage recommender systems. ar Xiv preprint ar Xiv:2205.15436, 2022. Zadrozny, B. and Elkan, C. Obtaining calibrated probability estimates from decision trees and naive bayesian classifiers. In International Conference on Machine Learning, volume 1, pp. 609 616, 2001. Zhao, S., Ma, T., and Ermon, S. Individual calibration with randomized forecasting. In International Conference on Machine Learning, pp. 11387 11397, 2020. Zhao, S., Kim, M. P., Sahoo, R., Ma, T., and Ermon, S. Calibrating predictions to decisions: A novel approach to multi-class calibration. In Advances in Neural Information Processing Systems, 2021. Improving Screening Processes via Calibrated Subset Selection Algorithm 2 CSS (Dynamic Pool Size) 1: input: k, E [M], Dcal, f, α, x 2: initialize: s = 3: ˆtf = sup t [0, 1] ˆδt,1 k/E [M] + ϵ(α, n) 4: for x x do 5: if f(x) ˆtf then 6: add x to s 7: end if 8: end for 9: return s A. Additional Algorithms A.1. Calibrated Screening Algorithm under the Dynamic Pool Size Setting In reality, the size of the pool of candidates might be different across times. Here, we derive an algorithm CSS (Dynamic Pool Size) illustrated in Algorithm 2 that allows for dynamic size of the pool of applicants. we overload the use of notation s to denote the set of the selected candidates rather than a vector, for ease of presentation, especially for the screening algorithm to ensure diversity as we will discuss in the next subsection. CSS (Dynamic Pool Size) requires that the expected size E [M] is given, which can be estimated from past pools. We show that all our theoretical results naturally generalize to the dynamic pool size setting. In the dynamic pool size setting, instead of assuming that the size of the pool of candidates m is constant, we assume the size is a random variable M PM that follows some distribution PM, with mean E [M] = EM PM [M]. Thus, the data generation process for the pool of candidates becomes M, X, Y PM P M X,Y . For the individual guarantees, Theorem 3.1 naturally holds for all m R+, and thus for the dynamic size setting; the impossibility results in Proposition 3.2 still hold in the dynamic pool size setting, since the constant size setting is a special case of the dynamic pool size setting. So we focus on showing that the theoretical results in the marginal guarantees still hold here, by showing that there is a family of policies that are oblivious to the size of the pools while representative enough. We call this family of policies pool-independent policies. Definition A.1 (Pool-Independent Policies). Given a classifier f, a policy πf Πf is pool-independent if there is a mapping pπf : Range(f) [0, 1], such that for any candidate xi in any pool x of any size m, the probability that the candidate is selected under πf is Pr (Si = 1) = pπf (f(xi)). This family of policies have the nice property that their expected number of qualified candidates in the shortlist and the expected shortlist size depend on PM only through E [M]. Formally speaking, for a pool-independent policy πf and its selection probability mapping pπf , E(M,X,Y ) P,S πf i [M] Si Yi = E [M] E(f(X),Y ) Pf(X),Y pπf (f(X))Y E(M,X) P,S πf = E [M] Ef(X) Pf(X) pπf (f(X)) . We show in the following proposition that the family of pool-independent policies are representative enough in the dynamic pool size setting, in a sense that for any given classifier f and any policy πf Πf, we can always find a pool-independent policy in Πf that achieves the same expected number of qualified candidates in the shortlists and the same expected size of the shortlists as πf. Proposition A.2. Given a classifier f, for any policy πf Πf, there exists a pool-independent policy π f Πf such that Improving Screening Processes via Calibrated Subset Selection Algorithm 3 CSS (Diversity) 1: input: G, n E [M]g o g G, {kg}g G, {Dg cal}g G, f, α, {xg}g G 2: for g G do 3: sg = CSS (Dynamic Pool Size) (kg, E [M]g, Dg cal, f, α/|G|, xg) 4: end for 5: return g Gsg they have the same expected number of qualified candidates E(M,X,Y ) P,S π i [M] Yi Si = E(M,X,Y ) P,S π i [M] Yi Si and the same expected shortlist size E(M,X) P,S π = E(M,X) P,S π This implies that it is sufficient to find optimal/near-optimal solutions within this family of pool-independent policies to achieve global optimal/near-optimal guarantees. In fact, all the optimal/near-optimal policies in all the theorems, propositions, corollaries are all pool-independent policies. They all can be translated to the dynamic pool size setting by setting m = E [M] in the policy design. A.2. Calibrated Subset Selection Algorithm to Ensure Diversity CSS (Diversity) (Algorithm 3) selects calibrated shortlists of candidates for each demographic group independently. CSS (Diversity) requires inputs of the groups G, the expected pool size of each group n E [M]g o g G, the target expected number of qualified candidates for each group {kg}g G, the calibration data for each group {Dg cal}g G, and the pool of candidates from each group {xg}g G. Note that the groups can have overlap, i.e., each candidate might belong to multiple groups. The guarantees on the expected number of qualified candidates from CSS (Dynamic Pool Size) directly apply to each group. The near-optimality guarantees of CSS (Dynamic Pool Size) only apply for every group when each candidate belongs to exactly one group, since otherwise it is possible that we select more-than-enough candidates from a particular group to satisfy the guarantees for the other groups. A.3. Calibrated Screening Algorithms with Multiple Bins We develop algorithms that can ensure distribution-free guarantees on the expected number of qualified candidates in the shortlists, when we partition the sample-space into multiple bins by the prediction scores from a given classifier f. For ease of notation and presentation of the proof, we illustrate the theory and the algorithm in the constant pool size setting. But they can be similarly adapted to the dynamic pool size setting as described in Appendix A.1. Let t0 = 1, . . . , t B = 0 be the thresholds on the prediction scores from f in decreasing order, which partition the sample-space X into B bins: {x X | t0 f(x) t1}, {x X | t1 < f(x) t2}, ..., {x X | t B 1 < f(x) t B}. Let B : X [B] be the bin identity function, which maps a candidate x to the bin it belongs to B(x). Let δb = E(X,Y ) PX,Y [I {B(X) = b} Y ], and ˆδb = 1 i [n] Y c i I {B(Xc i )} be an empirical estimate of it from the calibration data Dcal. There exist many concentration inequalities to bound the calibration errors. But they generally require sample-splitting, i.e., to split the calibration data into two sets, one set for selecting the sample-space partition, and the other set for estimating the calibration error. A recent work (Gupta & Ramdas, 2021) on distribution-free calibration without sample-splitting does not apply here, since their proposed method can only bound the calibration errors of µb = E(X,Y ) PX,Y [Yi | B(X) = b], rather than δb. To avoid sample-splitting while still bounding the calibration errors of δb on sample-space partitions with thresholds {tb}b [B 1] chosen in a data dependent way (e.g., uniform mass binning (Zadrozny & Elkan, 2001)), we again use the DKWM inequality to bound the calibration error of δb on any bin b that corresponds to an interval on the scores predict by f, by the following proposition. Improving Screening Processes via Calibrated Subset Selection Algorithm 4 Calibrated Subset Selection with Multiple Bins 1: input: k, m, B, B, n ˆδb o b [B], ϵ(α, n), x 2: initialize: s = 3: ˆb = inf a [B] P b [a] ˆδb 2ϵ(α, n) k/m 4: ˆθ = k/m P b [ˆb 1](ˆδb 2ϵ(α,n)) ˆδˆb 2ϵ(α,n) 5: for x x do 6: if B(x) < ˆb then 7: add x to s 8: else if B(x) = ˆb then 9: add x to s with probability ˆθ 10: end if 11: end for 12: return s Proposition A.3. For any α (0, 1), with probability at least 1 α (in f and Dcal), it holds that for any bin {x X | tb 1 f(x) tb} indexed by b and parameterized by 0 tb 1 < tb 1, δb ˆδb p 2 ln(2/α)/n = 2ϵ(α, n). With this calibration error guarantee that holds for any bin, and motivated by the monotone property of f, we derive an empirical lower bound on the expected number of qualified candidates similarly as Corollary 4.7 using a family of threshold decision policies that select candidates from the top-scored bins to lower-scored bins, possibly at random for the candidates in the last bin it selects candidates from. Any policy πf,a,θ given a classifier f in this family makes decisions based on the following decision rule 1 if B(xi) < a, Bernoulli(θ) if B(xi) = a, 0 otherwise. An empirical lower bound on the expected number of qualified candidates in the shortlists can be derived as follows E(X,Y ) P,S πf,a,θ i [m] Si Yi = m E(B(X),Y ) PB(X),Y [Y (I {B(X) < a} + θI {B(X) = a})] b [B] δb (I {b < a} + θI {b = a}) ˆδb 2ϵ(α, n) (I {b < a} + θI {b = a}) . We solve the optimization problem with this empirical lower bound similarly as in Eq. 7: ˆb, ˆθ = argmin b,θ Θ EX P,S πf,b,θ where Θ = {a [B], θ [0, 1) | m P b [B] ˆδb 2ϵ(α, n) (I {b < a} + θI {b = a}) k}. And it turns out that there is a closed form solution ˆδb 2ϵ(α, n) k/m Improving Screening Processes via Calibrated Subset Selection b [ˆb 1] ˆδb 2ϵ(α, n) ˆδˆb 2ϵ(α, n) . We summarize the algorithm using this decision policy in Algorithm 4. It guarantees that the expected number of qualified candidates is greater than k with probability 1 α as shown in the empirical lower bound. And, it also satisfies the near-optimality guarantees on the expected size of the shortlist among this extended family of threshold decision policies given a fixed sample-space partition similarly as Calibrated Subset Selection does in Proposition 4.9. We omit the proof since it can be simply adapted from the proof of Proposition 4.9. The disadvantage of Algorithm 4 is that it requires a given sample-space partition and finds near-optimal policy only among the policies using this fixed sample-space partition, while CSS directly optimizes the sample-space partition and the policy jointly. To also optimize over sample-space partitions, i.e., the number of bins and the thresholds, is computationally intractable and hard to identify which one is actually optimal. CSS solves this problem by considering a smaller policy space (no random selection from the last bin) and the sample-space partition space (only two bins), so that it is easy to find the optimal sample-space partition and the optimal policy simultaneously. We now show that among deterministic threshold decision rules, sample-space partitions with two bins are optimal. Calibration on More Bins Worsens the Performance of Threshold Policies. When we consider deterministic threshold decision rules, Algorithm 4 becomes selecting the candidates in bin ˆb with probability 1 instead of with probability ˆθ, to ensure distribution-free guarantees on the expected number of qualified candidates in the shortlists. This is equivalent to finding the largest threshold ˆt among {tb}b [B] to ensure enough qualified candidates t {tb}b [B] b [B] I {tb t} ( ˆδb 2ϵ(δ, n)) k This is essentially solving the optimization problem using the empirical lower bound on the expected number of qualified candidates X b [B] I {tb t} ( ˆδb 2ϵ(δ, n)), over a subset of thresholds in [0, 1]. This empirical lower bound is looser compared to the bound used in CSS, which also optimizes over the entire threshold space [0, 1], as shown in the following X b [B] I {tb t} ( ˆδb 2ϵ(δ, n)) = δtb,1 2 X b [B] I {tb t} ϵ(α, n) > δtb,1 ϵ(α, n). So it is clear that CSS which uses a tighter bound over a larger space of thresholds achieves shorter shortlists in expectation. UMB B Bins Algorithms. In the empirical evaluation, the UBM B Bins algorithms correspond to selecting the thresholds such that the calibration data Dcal are partitioned evenly into B bins, and running Algorithm 4 with the sample-space partition characterized by these thresholds. B. Calibration Error Bounds Imply Distribution-Free Average Calibration in Regression If we regard the given classifier f as a regression model from x to y, then the calibration error bounds in Proposition 4.6 directly imply a classifier that (approximately) achieves average calibration for any interval (Gneiting et al., 2007; Kuleshov et al., 2018). Translating average calibration in regression to our setting, where the range of the regression model has only two values {0, 1}, a classifier h is average-calibrated if for any 0 t 1 E(X,Y ) PX,Y [Y I {h(X) t}] = 1 t. From the calibration error guarantees in Proposition 4.6, we know that for any δ (0, 1), with probability 1 δ, for any t [0, 1], E(X,Y ) PX,Y [Y I {f(X) t}] ˆδt,1 ϵ(α, n). Improving Screening Processes via Calibrated Subset Selection Let g(t) = ˆδt,1, which is a monotonically decreasing function. Let t = g 1(1 t ) in the above inequalities, we get for any t [0, 1], E(X,Y ) PX,Y Y I f(X) g 1(1 t ) (1 t ) ϵ(α, n). Since g is monotone, we can get that for any t [0, 1], E(X,Y ) PX,Y [Y I {(g f)(X) 1 t}] (1 t) ϵ(α, n). Let ˆh = g f, ˆh is approximately calibrated in a sense that with probability 1 α, for any t [0, 1], 1 t ϵ(α, δ) E(X,Y ) PX,Y h Y I n ˆh(X) t oi 1 t + ϵ(α, δ). C.1. Proof of Theorem 3.1 We first show that for any x X m, the constraint in the minimization problem in Eq. 1 is satisfied with equality under π f , ES π f ,Y P i [m] Si Yi = E{Yi}i [m] Q i [m] PY |X=xi,S π({f(xi)}i [m]) i [m] Si Yi i [m] Pr (Y = 1 | X = xi) (I {f(xi) > t } + I {f(xi) = t } θ ) i [m] f(xi) (I {f(xi) > t } + I {f(xi) = t } θ ) The last equality is by the definition of t and θ . Next, we will show that it is an optimal solution for any x X m by contradiction. The objective in Eq. 1 using π f is i [m] I {Pr (Y = 1 | X = xi) > t } + θ I {Pr (Y = 1 | X = xi) = t } . Suppose a policy π achieves smaller objective than π f for some x X m, i.e., ES π ({f(xi)}i [m]) := Rπ (x) < X i [m] I {Pr (Y = 1 | X = xi) > t } + θ I {Pr (Y = 1 | X = xi) = t } . We will show that the constraint in the optimization problem in Eq. 1 for x is not satisfied using π . Let t [0, 1] : X i [m] I {Pr (Y = 1 | X = xi) t} Rπ (x) θ := Rπ (x) P i [m] I {Pr (Y = 1 | X = xi) > t } P i [m] I {Pr (Y = 1 | X = xi) = t } . Note that t t since X i [m] I {Pr (Y = 1 | X = xi) t } i [m] I {Pr (Y = 1 | X = xi) > t } + θ I {Pr (Y = 1 | X = xi) = t } > Rπ (x). Improving Screening Processes via Calibrated Subset Selection Now, we can show that the constraint is not satisfied using π . The left hand side of the constraint using π can be bounded as i [m] Yi Si = E{Yi}i [m] Q i [m] PY |X=xi,S π i [m] Yi Si i [m] Pr(Y = 1 | X = xi)Si i [m] Pr(Y = 1 | X = xi) (I {Pr(Y = 1 | X = xi) > t } + I {Pr(Y = 1 | X = xi) = t } θ ) , where the last inequality is by the definition of Rπ (x), t and θ . When t < t , i [m] Pr (Y = 1 | X = xi) (I {Pr (Y = 1 | X = xi) > t } + I {Pr (Y = 1 | X = xi) = t } θ ) i [m] Pr (Y = 1 | X = xi) (I {Pr (Y = 1 | X = xi) > t } + I {Pr[Y = 1 | X = xi] = t } θ ) = k. When t = t , θ = Rπ P i [m] I {Pr (Y = 1 | X = xi) > t } P i [m] I {Pr (Y = 1 | X = xi) = t } < θ . i [m] Pr (Y = 1 | X = xi) (I {Pr (Y = 1 | X = xi) > t } + I {Pr (Y = 1 | X = xi) = t } θ ) i [m] Pr (Y = 1 | X = xi) (I {Pr (Y = 1 | X = xi) > t } + I {Pr (Y = 1 | X = xi) = t } θ ) i [m] Pr (Y = 1 | X = xi) (I {Pr (Y = 1 | X = xi) > t } + I {Pr (Y = 1 | X = xi) = t } θ ) = k. This shows that for any x X m, for any policy π that achieves smaller objective than π f , π does not satisfy the constraint, which concludes the proof. C.2. Proof of Proposition 3.2 We construct two pools of candidates such that any policy will fail on at least one them. The two pools of candidates are all as and all bs, x = {a, a, . . . , a} and x = {b, b, . . . , b}. And also note that in this kind of candidate distribution, the only perfectly calibrated predictor that is not the omniscient predictor is the predictor that predicts h(x) = Pr (Y = 1) for all x X, since there are only two possible candidate feature vectors. For any policy πh that makes decisions S based purely on the quality scores by predicted by h, the distribution of S will be the same for the two pools of candidates, since they have the same predictions from h. Thus the expectation on the size of the shortlist for the two pools of candidates are the same ES πh({h(xi)}i [m]) = ES πh {h(x i)}i [m] On the other hand, for the optimal policy π f , we know that ES π f ({f (xi)}i [m]) Improving Screening Processes via Calibrated Subset Selection ES π {f (x i)}i [m] The sum of the differences on the two pools of candidates is ES πh({h(xi)}i [m]) ES π f ({f (xi)}i [m]) ES πh {h(x i)}i [m] ES π {f (x i)}i [m] = |cπh k| + |cπh m| m k. So for any policy πh, there exists a pool of candidates (either x or x ) such that the difference is at least m k For the expected number of qualified candidates in the shortlists, for any policy πh, we have EY P,S πh({h(xi)}i [m]) i [m] Si Yi EY P,S πh {h(x i)}i [m] i [m] Si Yi On the other hand, for the optimal policy π f , we know that EY P,S π f ({f (xi)}i [m]) i [m] Si Yi = EY P,S π f {f (x i)}i [m] i [m] Si Yi The sum of differences on the two pools of candidates is EY P,S πh({h(xi)}i [m]) i [m] Si Yi EY P,S π f ({f (xi)}i [m]) i [m] Si Yi EY P,S πh {h(x i)}i [m] i [m] Si Yi EY P,S π f {f (x i)}i [m] i [m] Si Yi = |cπh k| + k mcπh k k 1 k where the equality is achieved when cπ = k. So for any policy πh, there exist a set of candidates (either x or x ) such that the difference is at least k Improving Screening Processes via Calibrated Subset Selection C.3. Proof of Theorem 4.1 First, we show that the constraint in the minimization problem in (2) is satisfied with equality: E(X,Y ) P,S π h i [m] Yi Si = E{(h(Xi),Yi)}i [m] P m h(X),Y ,S π h({h(Xi)}i [m]) i [m] Yi Si = E{h(Xi)}i [m] P m h(X),S π h({h(Xi)}i [m]) i [m] Si Pr (Y = 1 | h(Xi)) = E{h(Xi)}i [m] P m h(X),S π h({h(Xi)}i [m]) i [m] Sih(Xi) = E{h(Xi)}i [m] P m h(X) i [m] [I {h(Xi) > th} h(Xi) + I {h(Xi) = th} h(Xi)θh] = m Eh(X) Ph(X) [I {h(X) > th} h(X) + I {h(X) = th} h(X)θh] where the last equality is by the definition of th and θh. Next, we will show that it is an optimal solution by contradiction. The objective in (2) using π h is i [m] (I {h(Xi) > th} + θh I {h(Xi) = th}) = E{h(Xi)}i [m] P m h(X) i [m] (I {h(Xi) > th} + θh I {h(Xi) = th}) = m Eh(X) Ph(X) [I {h(X) > th} + θh I {h(X) = th}] b [B] ρ b (I {µ b > th} + θh I {µ b = th}) . For any policy π h that achieves smaller objective than π h := Rπ h < m X b [B] ρ b (I {µ b > th} + θh I {µ b = th}) , we will show that the constraint is not satisfied using π h. Let t [0, 1] : X b [B] ρ b I {µ b t} Rπ h/m θ = Rπ h/m P b [B] ρ b I {µ b > t } P b [B] ρ b I {µ b = t } . Note that th t since X b [B] ρ b I {µ b th} X b [B] ρ b (I {µ b > th} + θh I {µ b = th}) > Rπ h/m. Improving Screening Processes via Calibrated Subset Selection Now we can show that the constraint is not satisfied using π h. The left hand side of the constraint using π h can be bounded as E(X,Y ) P,S π h i [m] Yi Si = E{(h(Xi),Yi)}i [m] P m h(X),Y ,S π h({h(Xi)}i [m]) i [m] Yi Si = E{h(Xi)}i [m] P m h(X),S π h({h(Xi)}i [m]) i [m] Si Pr (Y = 1 | h(Xi)) = E{h(Xi)}i [m] P m h(X),S π h({h(Xi)}i [m]) i [m] Sih(Xi) m Eh(X) Ph(X) [h(X) (I {h(X) > t h} + θ h I {h(X) = t h})] b [B] ρ bµ b (I {µ b > t h} + θ h I {µ b = t h}) . The inequality is by the definition of Rπ h, t h and θ h. When th < t h, b [B] ρ bµ b (I {µ b > t h} + θ h I {µ b = t h}) m X b [B] ρ bµ b (I {µ b > th} + θh I {µ b = th}) = k. When th = t h, θ h = Rπ h/m P b [B] ρ b I {µ b > th} P b [B] ρ b I {µ b = th} < θh. b [B] ρ bµ b (I {µ b > t h} + θ h I {µ b = t h}) = m X b [B] ρ bµ b (I {µ b > th} + θ h I {µ b = th}) b [B] ρ bµ b (I {µ b > th} + θh I {µ b = th}) This shows that no policy in Πh can achieve smaller objective than π h, while satisfying the constraint. So π h is a solution to the minimization problem among Πh. C.4. Proof of Corollary 4.3 By the definition of Πh, it is easy to see that the policy space of using h is a subset of that of h Πh Πh, the corollary directly follows from Theorem 4.1. C.5. Proof of Theorem 4.5 The proof constructs a policy πf,t for any policy πf such that the two inequalities in the theorem hold. For any policy πf, let kπf := E(X,Y ) P,S πf i [m] Si Yi be the expected number of qualified candidates selected by πf, µf( ) = Pr (Y = 1 | f(X) = ) denote the conditional probability that a candidate is qualified given a prediction quality score from f. We construct the the threshold in policy πf,t as t = sup n u [0, 1] : m Ef(X) Pf(X) [µf(f(X))I {f(X) u}] kπf o . Improving Screening Processes via Calibrated Subset Selection The expected number of qualified candidates using πf,t is E(X,Y ) P,S πf,t i [m] Yi Si = E{(f(Xi),Yi)}i [m] P m f(X),Y ,S πf,t({f(Xi)}i [m]) i [m] Yi Si = E{f(Xi)}i [m] P m f(X) i [m] µf(f(Xi))I {f(Xi) t} = m Ef(X) Pf(X) [µf(f(X))I {f(X) t}] This shows that the expected number of qualified candidates selected by πf,t is no smaller than πf. Now we only need to show the expected size of the shortlist using πf,t is no larger than that of using πf. The expected size of the shortlist using πf,t is EX P,S πf,t = E{f(Xi)}i [m] P m f(X) i [m] I {f(Xi) t} = m Ef(X) Pf(X) [I {f(X) t}] = m Pr (f(X) t) . We will show that πf achieves no smaller expected size of the shortlists by using contradiction. Suppose πf achieves smaller expected size of the shortlist than πf,t := Rπf Rπf . Then, the expected number of qualified candidates using πf can be bounded as E(X,Y ) P,S πf i [m] Yi Si = E{f(Xi)}i [m] P m f(X),S πf({f(Xi)}i [m]) i [m] µf(f(Xi))Si E{f(Xi)}i [m] P m f(X),S πf,t ({f(Xi)}i [m]) i [m] µf(f(Xi))Si = m Ef(X) Pf(X) [µf(f(X))I {f(X) t }] < m Ef(X) Pf(X) [µf(f(X))I {f(X) t}] The inequality is by monotone property of f. This arrives at a contradiction with the definition that the expected size of shortlist using πf is kπf . So, we get EX P,S πf,t which concludes the proof. Improving Screening Processes via Calibrated Subset Selection C.6. Proof of Proposition 4.6 Before introducing the proof of the proposition, we first introduce the DKWM inequality (Dvoretzky et al., 1956; Massart, 1990) as a lemma. Lemma C.1 (DKWM inequality (Dvoretzky et al., 1956; Massart, 1990)). Given a natural number n, let Z1, Z2,. . . , Zn be real-valued independent and identically distributed random variables with cumulative distribution function F( ). Let Fn denote the associated empirical distribution function defined by i [n] I {Zi z} z R. For any α (0, 1), with probability 1 α sup z R |F(x) Fn(z)| Equipped with this lemma, we provide the proof for Proposition 4.6 as follows. Zδ i = Y c i f(Xc i ) i [n]. They are real-valued independent and identically distributed random variables. Let F δ( ) denote the cumulative distribution function of Zδ and F δ n(z) = 1 i [n] I { Y c i f(Xc i ) z} z R. Applying the DKWM inequality, we can get that for any α (0, 1), with probability at least 1 α, for any z R, F δ n(z) F δ(z) F δ n(t) = ˆδt,1, F δ(t) = δt,1, which concludes the proof. C.7. Proof of Corollary 4.7 When the event in Proposition 4.6 holds, E(X,Y ) P,S πf,t i [m] Si Yi = E{(f(Xi),Yi)}i [m] P m f(X),Y ,S πf,t({f(Xi)}i [m]) i [m] Si Yi = E{(f(Xi),Yi)}i [m] P m f(X),Y i [m] Yi I {f(Xi) t} = m E(f(X),Y ) Pf(X),Y [Y I {f(X) t}] By Proposition 4.6, this corollary holds, which concludes the proof. Improving Screening Processes via Calibrated Subset Selection C.8. Proof of Theorem 4.8 We can see that EX P,S πf,t is a non-increasing function of t. So to solve the constraint minimization problem, we only need to select the largest t that satisfies the constraint, which is Eq. 7. This concludes the proof. C.9. Proof of Proposition 4.9 For the optimal policy πf,t f , the constraint is active (otherwise we can randomly select the candidates selected by πf,t f to derive a better policy, which contradicts that πf,t f is optimal) E(X,Y ) P,S πf,t f i [m] Si Yi Since the distribution of f(X) is nonatomic, all the candidates have different prediction scores almost surely. Thus, by the definition of ˆtf, we know that ˆδˆtf ,1 ϵ(α, n) < k/m + 1/n almost surely. By Proposition 4.6 and the proof in Corollary 4.7, we have that for any α (0, 1), with probability at least 1 α, E(X,Y ) P,S πf,ˆtf i [m] Yi Si = mδˆtf ,1 m ˆδˆtf ,1 + ϵ(α, n) < k + 2mϵ(α, n) + m/n. This concludes the proof. C.10. Proof of Proposition 4.10 Let Qi = Si Yi for all i [m] be the random variables of whether a candidate i is qualified and selected in the pool of candidates. Note that E(X,Y ) P,S πf,ˆtf [Qi] = δˆtf ,1. By Theorem 4.8, we know that for any α1 (0, 1) with probability at least 1 α1, E(X,Y ) P,S πf,ˆtf = mδˆtf ,1 k. By applying Bernstein s inequality to these m independent and identically distributed random variables, we have that, for any α2 (0, 1), with probability at least 1 α2 i [m] Qi mδˆtf ,1 1 3 ln(1/α2) q 1/9 ln2(1/α2) + 2mδˆtf ,1 ln(1/α2). The right hand side of the above inequality is an increasing function of mδˆtf ,1 when mδˆtf ,1 4/9 ln(1/α2). When k 4/9 ln(1/α2), we can apply the union bound to the above two events to get that, with probability at least 1 α1 α2, i [m] Qi mδˆtf ,1 1 3 ln(1/α2) q 1/9 ln2(1/α2) + 2mδˆtf ,1 ln(1/α2) k 1 3 ln(1/α2) 1 ln2(1/α2) + 18k ln(1/α2). This concludes the proof. Improving Screening Processes via Calibrated Subset Selection C.11. Proof of Proposition A.2 We prove the proposition for f that has a discrete range, and the proof for classifiers with continuous range can be derived similarly. For any πf, we construct a pool-independent policy π f by constructing pπ f Range(f) [0, 1] as pπ f (v) = EM PM,{f(Xi)}i [M] P M f(X),S πf h P i [M] I {f(Xi) = v} Si i E [M] Ef(X) Pf(X) {I {f(X) = v}} . Let π f be the policy using pπ f . The expected number of qualified candidates for policy π f equals that of πf E(M,X,Y ) P,S π f i [M] Yi Si = EM PM,{f(Xi)}i [M] P M f(X),S π f({f(Xi)}i [M]) i [M] Si Pr (Y = 1 | f(Xi)) = EM PM,{f(Xi)}i [M] P M f(X) i [M] pπ f (f(Xi)) Pr (Y = 1 | f(Xi)) = E [M] Ev Pf(X) h pπ f (v) Pr (Y = 1 | v) i EM PM,{f(Xi)}i [M] P M f(X),S πf h P i [M] I {f(Xi) = v} Si i Pr (Y = 1 | v) Ef(X) Pf(X) {I {f(X) = v}} = EM PM,{f(Xi)}i [M] P M f(X),S πf({f(Xi)}i [M]) i [M] Si Pr (Y = 1 | f(X)) = E(M,X,Y ) P,S πf i [M] Yi Si and the expected size of the shortlist equals that of πf E(M,X,Y ) P,S π f i [M] Yi Si = EM PM,{f(Xi)}i [M] P M f(X),S π f({f(Xi)}i [M]) = EM PM,{f(Xi)}i [M] P M f(X) i [M] pπ f (f(Xi)) = E [M] Ev Pf(X) h pπ f (v) i EM PM,{f(Xi)}i [M] P M f(X),S πf h P i [M] I {f(Xi) = v} Si i Ef(X) Pf(X) {I {f(X) = v}} = EM PM,{f(Xi)}i [M] P M f(X),S πf({f(Xi)}i [M]) = E(M,X,Y ) P,S πf i [M] Yi Si Improving Screening Processes via Calibrated Subset Selection C.12. Proof of Proposition A.3 From the proof of Proposition 4.6, we know that with probability at least 1 α E(X,Y ) PX,Y [Y I {f(X) t}] 1 i [n] Y c i I {f(Xc i ) t} for all t [0, 1]. So, with probability at least 1 α, for any 0 ta < tb 1, E(X,Y ) PX,Y [Y I {ta f(X) tb}] 1 i [n] Y c i I {ta f(Xc i ) tb} E(X,Y ) PX,Y [Y I {f(x) ta}] 1 i [n] Y c i I {f(Xc i ) ta} E(X,Y ) PX,Y [Y I {f(X) tb}] 1 i [n] Y c i I {f(Xc i ) tb} This concludes the proof.