# selective_regression_under_fairness_criteria__5f619958.pdf Selective Regression Under Fairness Criteria Abhin Shah * 1 Yuheng Bu * 1 Joshua Ka-Wing Lee 2 Subhro Das 3 Rameswar Panda 3 Prasanna Sattigeri 3 Gregory W. Wornell 1 Selective regression allows abstention from prediction if the confidence to make an accurate prediction is not sufficient. In general, by allowing a reject option, one expects the performance of a regression model to increase at the cost of reducing coverage (i.e., by predicting on fewer samples). However, as we show, in some cases, the performance of a minority subgroup can decrease while we reduce the coverage, and thus selective regression can magnify disparities between different sensitive subgroups. Motivated by these disparities, we propose new fairness criteria for selective regression requiring the performance of every subgroup to improve with a decrease in coverage. We prove that if a feature representation satisfies the sufficiency criterion or is calibrated for mean and variance, then the proposed fairness criteria is met. Further, we introduce two approaches to mitigate the performance disparity across subgroups: (a) by regularizing an upper bound of conditional mutual information under a Gaussian assumption and (b) by regularizing a contrastive loss for conditional mean and conditional variance prediction. The effectiveness of these approaches is demonstrated on synthetic and real-world datasets. 1. Introduction As the adoption of machine learning (ML) based systems accelerates in a wide range of applications, including critical workflows such as healthcare management (Bellamy et al., 2018), employment screening (Selbst et al., 2019), automated loan processing, there is a renewed focus on the *Equal contribution 1Massachusetts Institute of Technology 2This work was done while J. Lee was at Massachusetts Institute of Technology; the author is now with Snap 3MIT-IBM Watson AI Lab, IBM Research. Correspondence to: Abhin Shah , Yuheng Bu . Proceedings of the 39 th International Conference on Machine Learning, Baltimore, Maryland, USA, PMLR 162, 2022. Copyright 2022 by the author(s). trustworthiness of such systems. An important attribute of a trustworthy ML system is to reliably estimate the uncertainty in its predictions. For example, consider a loan approval ML system where the prediction task is to suggest appropriate loan terms (e.g., loan approval, interest rate). If the model s uncertainty in its prediction is high for an applicant, the prediction can be rejected to avoid potentially costly errors. The users of the system, i.e., the decision-maker, can intervene, and take remedial actions such as gathering more information for applicants with rejected model predictions or involving a special human credit committee before arriving at a decision. This paradigm is known as prediction with reject-option or selective prediction. By making the tolerance for uncertainty more stringent, the user expects the error rate of the predictions made by the system to decrease as the system makes predictions for fewer samples, i.e., as coverage is reduced. Although the error may lessen over the entire population, (Jones et al., 2020) demonstrated that for selective classification, this may not be true for different sub-populations. In other words, selective classification could magnify disparities across different sensitive groups (e.g., race, gender). For example, in the loan approval ML system, the error rate for a sensitive group could increase with a decrease in coverage. To mitigate such disparities, (Lee et al., 2021; Schreuder & Chzhen, 2021) proposed methods for performing fair selective classification. In this work, we demonstrate and investigate the performance disparities across different subgroups for selective regression as well as develop novel methods to mitigate such disparities. Similar to (Lee et al., 2021), we do not assume access to the identity of sensitive groups at test time. Compared to selective classification, one major challenge to tackling the aforementioned disparities in selective regression is as follows: in selective classification, generating an uncertainty measure (i.e., the model s uncertainty for its prediction) from an existing classifier is straightforward. For example, one could take the softmax output of an existing classifier as an uncertainty measure. In contrast, there is no direct method to extract an uncertainty measure from an existing regressor designed only to predict the conditional mean. Selective Regression Under Fairness Criteria Contributions. First, we show via the Insurance dataset and a toy example that selective regression, like selective classification, can decrease the performance of some subgroups when coverage (the fraction of samples for which a decision is made) is reduced (see Section 3.1). 1. Motivated by this, we provide a novel fairness criteria (Definition 1) for selective regression, namely, monotonic selective risk, which requires the risk of each subgroup to monotonically decrease with a decrease in coverage (see Section 3.2). 2. We prove that if a feature representation satisfies the standard sufficiency criterion or is calibrated for mean and variance (Definition 2), then the monotonic selective risk criteria is met (see Theorem 4.1 and 4.2). 3. We provide two neural network-based algorithms: one to impose the sufficiency criterion by regularizing an upper bound of conditional mutual information under a Gaussian assumption (see Section 5.1) and the other to impose the calibration for mean and variance by regularizing a contrastive loss (see Section 5.2). 4. Finally, we empirically1 demonstrate the effectiveness of these algorithms on real-world datasets (see Section 6). 2. Background 2.1. Fair Regression In standard (supervised) regression, given pairs of input variables X X (e.g., demographic information) and target variable Y R (e.g., annual medical expenses), we want to find a predictor f : X R that best estimates the target variable for new input variables. Formally, given a set of predictors F and a set of training samples of X and Y , i.e., {(x1, y1), . . . , (xn, yn)}, the goal is to construct f F which minimizes the mean-squared error (MSE): f = arg min f F E[(Y f(X))2]. (1) In fair regression, we augment the goal in (1) by requiring our predictor to retain fairness w.r.t. some sensitive attributes D D (e.g., race, gender). For example, we may want our predictions of annual medical expenses using the demographic information not to discriminate w.r.t. race. In this work, we assume D to be discrete and consider members with the same value of D as being in the same subgroup. While numerous criteria have been proposed to enforce fairness in machine learning, we focus on the notion of subgroup risks (Williamson & Menon, 2019), which ensures that the predictor f behaves similarly (in terms of risks) across all subgroups. This notion, also known as accuracy disparity, has been used frequently 1The source code is available at github.com/Abhin02/ fair-selective-regression. in fair regression, e.g., (Chzhen & Schreuder, 2020; Chi et al., 2021), and has also received attention in the field of domain generalization, e.g., (Krueger et al., 2021). Formally, given a set of training samples of X, Y , and D, i.e., {(x1, y1, d1), . . . , (xn, yn, dn)}, the goal is to construct f F which minimizes the overall MSE subject to the subgroup MSE being equal for all subgroups: f = arg min f F E[(Y f(X))2] s.t d, d D, E[(Y f(X))2|D = d] = E[(Y f(X))2|D = d ]. (2) In this work, we consider the scenario where the sensitive attribute is available only during training i.e., we do not assume access to the sensitive attribute at test time. 2.2. Selective Regression In selective regression, given pairs of input variables X X and target variable Y R, for new input variables, the system has a choice to: (a) make a prediction of the target variable or (b) abstain from a prediction (if it is not sufficiently confident). In the example of predicting annual medical expenses, we may prefer abstention in certain scenarios to avoid harms arising from wrong predictions. By only making predictions for those input variables with low prediction uncertainty, the performance (in terms of MSE) is expected to improve. Formally, in addition to a predictor f : X R that best estimates the target variable for new input variables, we need to construct a rejection rule Γ : X {0, 1} that decides whether or not to make a prediction for new input variables. Thereafter, for X = x, the system outputs f(x) as the prediction when Γ(x) = 1, and makes no prediction if Γ(x) = 0. There are two quantities that characterize the performance of selective regression: (i) coverage, i.e., the fraction of samples that the system makes predictions on, which is denoted by c(Γ) = P(|Γ(X)| = 1) and (ii) the MSE when prediction is performed E[(Y f(X))2|Γ(X) = 1]. In order to construct a rejection rule Γ, we need some measure of the uncertainty g( ) associated with a prediction f( ). Then, the rejection rule Γ can be defined as: ( 1, if g(x) τ 0, otherwise. where τ is the parameter that balances the MSE vs. coverage tradeoff: larger τ results in a larger coverage but also yields a larger MSE. Therefore, τ can be interpreted as the cost for not making predictions. As discussed in (Zaoui et al., 2020), a natural choice for the uncertainty measure g( ) could be the conditional variance of Y given X. Selective Regression Under Fairness Criteria The goal of selective regression is to build a model with (a) high coverage and (b) low MSE. However, there may not exist any τ for which both (a) and (b) are satisfied simultaneously. Therefore, in practice, the entire MSE vs. coverage tradeoff curve is generated by sweeping over all possible values of τ allowing the system designer to choose any convenient operating point. 2.3. Related Work Selective Regression. While selective classification has received a lot of attention in the machine learning community (Chow, 1957; 1970; Hellman, 1970; Herbei & Wegkamp, 2006; Bartlett & Wegkamp, 2008; Nadeem et al., 2009; Lei, 2014; Geifman & El-Yaniv, 2017), there is very limited work on selective regression. It is also known that existing methods for selective classification cannot be used directly for selective regression (Jiang et al., 2020). (Wiener & El-Yaniv, 2012) studied regression with reject option to make predictions inside a ball of small radius around the regression function with high probability. (Geifman & El-Yaniv, 2019) proposed Selective Net, a neural network with an integrated reject option, to optimize both classification (or regresssion) performance and rejection rate simultaneously. (Zaoui et al., 2020) considered selective regression with a fixed rejection rate and derived the optimal rule which relies on thresholding the conditional variance. (Jiang et al., 2020) analyzed selective regression with the goal to minimize the rejection rate given a regression risk bound. We emphasize that none of these works study the question of fairness in selective regression. Fair Regression. (Calders et al., 2013), one of the first works on fair regression, studied linear regression with constraints on the mean outcome or residuals of the models. More recently, several works including (Berk et al., 2017; Pérez-Suay et al., 2017; Komiyama & Shimao, 2017; Komiyama et al., 2018; Fitzsimons et al., 2018; Raff et al., 2018; Agarwal et al., 2019; Nabi et al., 2019; Oneto et al., 2020) considered fair regression employing various fairness criteria. (Mary et al., 2019) and (Lee et al., 2020) enforced independence between prediction and sensitive attribute by ensuring that the maximal correlation is below a fixed threshold. (Chzhen et al., 2020) considered learning an optimal regressor requiring the distribution of the output to remain the same conditioned on the sensitive attribute. We emphasize that none of these works could be used for selective regression as they are designed to only predict the conditional mean and not the conditional variance (i.e., the uncertainty). Fairness Criteria. Numerous metrics and criteria have been proposed to enforce fairness in machine learning (Verma & Rubin, 2018). Many of these criteria are mutually exclusive outside of trivial cases (Gölz et al., 2019). Further, the existing approaches also differ in the way they enforce these criteria: (a) pre-processing methods (Zemel et al., 2013; Louizos et al., 2015; Calmon et al., 2017) modify the training set to ensure fairness of any learned model, (b) post-processing methods (Hardt et al., 2016; Pleiss et al., 2017; Corbett-Davies et al., 2017) transform the predictions of a trained model to satisfy a measure of fairness, and (c) in-processing methods (Kamishima et al., 2011; Zafar et al., 2017; Agarwal et al., 2018) modify the training process to directly learn fair predictors e.g., minimizing a loss function that accounts for both accuracy and fairness as in (2). Additionally, these criteria also differ in the kind of fairness they consider (see (Mehrabi et al., 2021; Castelnovo et al., 2022) for details): (a) group fairness ensures that subgroups that differ by sensitive attributes are treated similarly and (b) individual fairness ensures that individuals who are similar (with respect to some metric) are treated similarly. In this work, we consider group fairness and propose a novel fairness criteria specific to selective regression. Our approach falls under the umbrella of in-processing methods as will be evident in Section 5 (see (5), (8), and (9)). 3. Fair Selective Regression While fair regression and selective regression have been independently studied before, consideration of fair selective regression (i.e., selective regression while ensuring fairness) is missing in the literature. In this section, we explore the disparities between different subgroups that may arise in selective regression. Building on top of this, we formulate a notion of fairness for selective regression. 3.1. Disparities in Selective Regression (Jones et al., 2020) argued that, in selective classification, increasing abstentions (i.e., decreasing coverage) could decrease accuracies on some subgroups and observed this behavior for Celeb A dataset. In this section, we show that a similar phenomenon can be observed in selective regression. Insurance dataset. Consider the task of predicting the annual medical expenses charged to patients from input variables such as age, BMI, number of children, etc., as in Insurance dataset. Suppose we construct our predictor as the conditional expectation and our uncertainty measure as the conditional variance following (Zaoui et al., 2020). Then, generating the subgroup MSE vs. coverage tradeoff curve2 across the subgroups induced by gender, as shown in Figure 1a, we see that while decreasing the coverage improves the performance for the majority subgroup (i.e., females), the performance for the minority subgroup (i.e., males) 2We have used Baseline 1 to generate this (see Section 6). Selective Regression Under Fairness Criteria 0.2 0.4 0.6 0.8 1.0 coverage Majority Minority (a) The disparity for the Insurance dataset when the predictor is the conditional expectation and the uncertainty measure is the conditional variance. 0.2 0.4 0.6 0.8 1.0 coverage Majority Minority (b) The disparity for toy example when the predictor is the conditional expectation and the uncertainty measure is the conditional variance of Y given X1 and X2. 0.2 0.4 0.6 0.8 1.0 coverage Majority Minority (c) The disparity could be mitigated for toy example when the predictor is the conditional expectation and the uncertainty measure is the conditional variance of Y given X1. Figure 1: Subgroup MSE vs. coverage to illustrate disparities in selective regression via Insurance dataset and toy example. degrades. Thus, unknowingly, selective regression can magnify disparities between different sensitive subgroups. Motivated to further understand this phenomenon, we explicitly recreate it via the following toy example. Toy example. Consider predicting Y from two normalized input variables X1 and X2 that are generated i.i.d. from the uniform distribution over [0, 1]. Suppose we have a binary sensitive attribute D with P(D = 0) = 0.9, where D = 0 represents majority and D = 1 represents minority. To illustrate the disparities that may arise in selective regression, we let the distribution of Y differ with D. More specifically, for majority subgroup, we let the target be Y |D=0 = X1 + X2 + N(0, 0.1X1 + 0.15X2) and, for minority subgroup, we let the target be Y |D=1 = X1 + X2 + N(0, 0.1X1 + 0.15(1 X2)). To summarize, the only difference is that for the majority, the variance of Y increases in X2, and for the minority, the variance of Y decreases in X2. In this case, the conditional variance Var(Y |X), i.e., our uncertainty measure would mainly capture the behavior of the majority subgroup D = 0, i.e., the subgroup with more samples. Since Var(Y |X, D=0) differs significantly from Var(Y |X, D =1), Var(Y |X) may not be a good measure of uncertainty for the minority subgroup D = 1. As a result, for minority subgroup, when we decrease the coverage, we may make predictions on samples that erroneously achieve low uncertainty based on Var(Y |X). This results in an increase in the MSE for the subgroup D = 1 (Figure 1b). An alternative could be to use the conditional variance of Y given X1 (instead of both X1 and X2) as our uncertainty measure. While this may be a slightly worse measure of uncertainty for D = 0 than Var(Y |X), it is a much better measure of uncertainty for D = 1. All in all, using this uncertainty measure, when we decrease the coverage, we make predictions on samples with low uncertainty for all subgroups resulting in a decrease in the MSE for every subgroup as shown in Figure 1c (albeit at the cost of a slight increase in the overall MSE). It is important to note that the toy example is designed to highlight the disparities when the uncertainty measure, a component of selective regression, is designed unfairly. The disparities could generally occur due to the predictor or the uncertainty measure (or both). 3.2. When is Selective Regression Fair? Motivated by the disparities (that may arise) in selective regression, as shown above, we define the first notion of fairness for selective regression, which is called monotonic selective risk. This notion requires our predictor and uncertainty measure to ensure the primary goal of selective regression that the subgroup MSE decreases monotonically with a decrease in coverage for every subgroup. The subgroup MSE for d D, as a function of the predictor f and the uncertainty measure g, for a fixed coverage (parameterized by τ) is given by MSE(f, g, τ, d) = E[(Y f(X))2|g(X) τ, D = d]. Now, we formalize the criteria of monotonic selective risk which ensures that no subgroup is discriminated against when the coverage is reduced in selective regression. Definition 1. We say that a predictor f and an uncertainty measure g satisfy monotonic selective risk if for any τ < τ MSE(f, g, τ, d) MSE(f, g, τ , d) d D. Inspired by the success of representation based learning in machine learning (Bengio et al., 2013), we seek to find a representation Φ : X H that maps the input variable X X to an intermediate representation Φ(X) H and Selective Regression Under Fairness Criteria use Φ(X) to construct our predictor f : H R and our uncertainty measure g : H R 0. Then, for X = x, our prediction and uncertainty measure are (with a slight abuse of notation) f(Φ(x)) and g(Φ(x)) respectively. 4. Theoretical Results In this section, we show that under certain conditions on the feature representation Φ, the conditional mean as the predictor and the conditional variance as the uncertainty measure satisfy monotonic selective risk (Definition 1). 4.1. Sufficiency The sufficiency criterion requires3 Y D|Φ(X), i.e., the learned representation Φ(X) completely subsumes all information about the sensitive attribute that is relevant to the target variable (Cleary, 1966). Sufficiency is closely tied with learning domain-invariant feature representation (Arjovsky et al., 2019; Creager et al., 2021) and has been used in fair selective classification (Lee et al., 2021). The theorem below shows that if the feature representation is sufficient, then the choice of conditional mean as the predictor and the conditional variance as the uncertainty measure ensures the fairness criteria of monotonic selective risk, i.e., the subgroup MSE decreases monotonically with coverage for all subgroups. See Appendix A for a proof. Theorem 4.1. Suppose the representation Φ(X) is sufficient i.e., Y D|Φ(X). Let f(Φ(X)) = E[Y |Φ(X)] and g(Φ(X)) = Var[Y |Φ(X)]. Then, for any d D and any τ < τ , we have MSE(f, g, τ, d) < MSE(f, g, τ , d). 4.2. Calibration for mean and variance In practice, the conditional independence Y D|Φ(X) required by sufficiency may be too difficult to satisfy. Since we only care about MSE, which depends on the first and second-order moments, one could relax the sufficiency condition by requiring the representation Φ to be such that these moments are the same across all subgroups. This inspires our notion of Φ calibrated for mean and variance. Definition 2. We say a representation Φ(X) is calibrated for mean and variance if E[Y |Φ(X), d] = E[Y |Φ(X)] d D, Var[Y |Φ(X), d] = Var[Y |Φ(X)] d D. The theorem below shows that if the feature representation is calibrated for mean and variance, then the choice of 3While conventionally sufficiency requires Y D|E[Φ(X)], this notion has been adapted to incorporate feature representation/score function (e.g., Sec 1.1 in (Liu et al., 2019)), Sec 3.4 in (Castelnovo et al., 2022). conditional mean as the predictor and the conditional variance as the uncertainty measure ensures the fairness criteria of monotonic selective risk, i.e., the subgroup MSE decreases monotonically with coverage for all subgroups. The proof is similar to the proof of Theorem 4.1 and is omitted. Theorem 4.2. Suppose the representation Φ(X) is calibrated for mean and variance. Let f(Φ(X)) = E[Y |Φ(X)] and g(Φ(X)) = Var[Y |Φ(X)]. Then, for any d D and any τ < τ , we have MSE(f, g, τ, d) < MSE(f, g, τ , d). 5. Algorithm Design In this section, we provide two neural network-based algorithms: one to impose sufficiency and the other to impose the calibration for mean and variance. 5.1. Imposing sufficiency To simplify our algorithm when directly enforcing sufficiency, we utilize the framework of heteroskedastic neural network (Gal, 2016). A heteroskedastic neural network, which requires training only a single neural network, assumes that the distribution of Y conditioned on X is Gaussian. Then, it is trained by minimizing the negative log likelihood: i=1 log PG(yi|Φ(xi); θ) where PG(y|Φ(x); θ) represents a Gaussian distribution with f(Φ(x); θf) and g(Φ(x); θg) as the conditional mean and the conditional variance (of Y given Φ(X)) respectively. The feature representation Φ is parameterized by θΦ and the neural network is supposed to learn the parameters θΦ and θ = (θf, θg). To impose sufficiency, we augment minimizing the negative log likelihood as follows: min θ,Φ LG(Φ, θ), s.t. Y D|Φ(X). To relax the hard constraint of Y D|Φ(X) into a soft constraint, we use the conditional mutual information, since Y D|Φ(X) is equivalent to I(Y ; D|Φ(X)) = 0. For λ 0, min θ,Φ LG(Φ, θ) + λI(Y ; D|Φ(X)). As discussed in (Lee et al., 2021), existing methods using mutual information for fairness are ill-equipped to handle conditioning on the feature representation Φ( ). Therefore, we further relax the soft constraint by using the following Selective Regression Under Fairness Criteria Algorithm 1 Heteroskedastic neural network with sufficiency-based regularizer Input: training samples {(xi, yi, di)}n i=1, regularizer λ Draw: {ed1, . . . , edn} drawn i.i.d. from ˆPD Initialize: θ, θΦ, and w(d) with pre-trained models Initialize: nd = number of samples in group d, d D for each training iteration do for each batch do for d = 1, . . . , |D| do # update subgroup-specific mean/variance predictor w(d) w(d) 1 nd ηw w Ld(w) end for end for for each batch do # update feature extractor θΦ θΦ 1 nη θΦ(LG(Φ, θ) + λLR(Φ)) # update mean/variance predictor θ θ 1 nη θLG(Φ, θ) end for end for upper bound for I(Y ; D|Φ(X)) from (Lee et al., 2021): I(Y ; D|Φ(X)) EΦ(X),Y,D [log P(Y |Φ(X), D)] (3) ED EΦ(X),Y [log P(Y |Φ(X), D)] . where equality is achieved if and only if Y D | Φ(X). In order to compute the upper bound in (3), we need to learn the unknown distribution P(y|Φ(x), d). We approximate this by PG(y|Φ(x), d; w) which is a Gaussian distribution with f(Φ(x), d; wf) and g(Φ(x), d; wg) as the conditional mean and the conditional variance of Y given Φ(X) and D, respectively. The neural network is supposed to learn the parameters w = (wf, wg). In scenarios where Φ(X) is high-dimensional compared to D, it would be preferred to approximate PG(y|Φ(x), d; w) by PG(y|Φ(x); w(d)), i.e., train a subgroup-specific Gaussian model with parameters w(d) for each d D instead of using D as a separate input to ensure that D has an effect in PG(y|Φ(x), d; w). Then, for d D, w(d) = arg min w Ld(w), where (4) i: di=d log PG(yi|Φ(xi); w). To summarize, the first term of the upper bound in (3) is approximated by the log-likelihood of the training samples using PG(y|Φ(x); w(di)) for each subgroup di D (i.e., subgroup-specific loss). Then, drawing edi i.i.d from PD i.e., the marginal distribution of D (which could be approximated by ˆPD), the second term of the upper bound in (3) is approximated by the negative log-likelihood of the samples using the randomly-selected Gaussian model PG(y|Φ(x); w( e di)) for each subgroup edi D (i.e., subgroup-agnostic loss). Combining everything and replacing all the expectations in (3) with empirical averages, the regularizer is given by i=1 log PG(yi|Φ(xi); w(di)) PG(yi|Φ(xi); w( e di)) where edi are drawn i.i.d. from the marginal distribution ˆPD. Summarizing, the overall objective is min θ,Φ LG(Φ, θ) + λLR(Φ). (5) As shown in Algorithm 1, we train our model by alternating between the fitting subgroup-specific models in (4) and feature updating in (5). 5.2. Imposing calibration for mean and variance To achieve the calibration for mean and variance, we let the representation Φ = (Φ1, Φ2). Then, to enable the use of the residual-based neural network (Hall & Carroll, 1989), we let the conditional expectation depend only on Φ1 i.e., f(Φ(X)) = f(Φ1(X)) and let the conditional variance depend only on Φ2 i.e., g(Φ(X)) = g(Φ2(X)). This method is useful in scenarios where the conditional Gaussian assumption in the Section 5.1 does not hold. In a residual-based neural network, the conditional mean-prediction network is trained by minimizing: LS1(Φ1, θf) i=1 (yi f(Φ1(xi); θf))2. The feature representation Φ1 is parameterized by θΦ1 and the mean-prediction network is supposed to learn the parameters θΦ1 and θf. Then, the conditional variance-prediction network is trained by fitting the residuals obtained from the mean-prediction network, i.e., ri (yi f(Φ1(xi); θf))2 by minimizing: LS2(Φ2, θg) i=1 (ri g(Φ2(xi); θg))2. The feature representation Φ2 is parameterized by θΦ2 and the variance-prediction network is supposed to learn the parameters θΦ2 and θg. To impose calibration under mean, we need to convert the following hard constraint E[Y |Φ1(X), D] = E[Y |Φ1(X)] (6) Selective Regression Under Fairness Criteria into a soft constraint. We do this by using the following contrastive loss: ED EΦ(X),Y (Y E[Y |Φ1(X), D])2 EΦ(X),Y,D (Y E[Y |Φ1(X), D])2 , (7) which is inspired from (3) and obtained by replacing the negative log-likelihood log P(Y |Φ(X), D) in (3) by the MSE achieved using the representation Φ1(X) and sensitive attribute D. We emphasize that (7) is zero when (6) holds and therefore (7) is a relaxation of (6). To compute (7), we need to learn the unknown conditional expectation E[Y |Φ1(X), D]. Similar to Section 5.1, we approximate this by f(y|Φ1(x); w(d) f ) (i.e., train a subgroup-specific mean-prediction model with parameters w(d) f for each d D, instead of using D as a separate input). Similar to Section 5.1, combining everything and replacing all the expectations in (7) with empirical averages, the regularizer for the mean-prediction network is given by yi f(Φ1(xi); w( e di) f ) 2 yi f(Φ1(xi); w(di) f ) 2 , where edi are drawn i.i.d. from PD i.e., the marginal distribution of D (approximated by ˆPD) and for d D, w(d) f = arg min w yi f(Φ1(xi); w) 2. In summary, the overall objective for mean-prediction is min θf ,Φ1 LS1(Φ1, θf) + λ1LR1(Φ1). (8) Once the mean-prediction network is trained, we obtain the residuals and train the variance-prediction network g using a similar regularizer: min θg,Φ2 LS2(Φ2, θg) + λ2LR2(Φ2), (9) where LS2 and LR2 are defined in a similar manner as LS1 and LR1. More details about these and the pseudo-code (i.e., Algorithm 2) are provided in the Appendix B. 6. Experimental Results Datasets. We test our algorithms on Insurance and Crime datasets, and provide an application of our method in Causal Inference via IHDP dataset. These datasets (summarized in Table 1) are selected due to their potential fairness concerns, e.g., (a) presence of features often associated with possible discrimination, such as race and sex, and (b) potential sensitivity regarding the predictions being made such as medical expenses, violent crimes, and cognitive test score. Table 1: Summary of datasets. Dataset Target Attribute Insurance Medical Expenses Sex Crime Crimes per Population Race IHDP Cognitive Test Score Sex Insurance. The Insurance dataset (Lantz, 2019) considers predicting total annual medical expenses charged to patients using demographic statistics. Following (Chi et al., 2021), we use sex as the sensitive attribute: D = 1 (i.e., minority) if male otherwise 0. After preprocessing (see Appendix C), the dataset contains 1000 samples (338 with D = 1 and 662 with D = 0) and 5 features. Communities and Crime. The Crime dataset (Redmond & Baveja, 2002) considers predicting the number of violent crimes per 100K population using socio-economic information of communities in U.S. Following (Chi et al., 2021), we use race (binary)4 as the sensitive attribute: D=1 (i.e., minority) if the population percentage of the black is more or equal to 20 otherwise 0. After preprocessing (see Appendix C), the dataset contains 1994 samples (532 with D=1 and 1462 with D=0) and 99 features. IHDP. The IHDP dataset (Hill, 2011) is generated based on a randomized control trial targeting low-birth-weight, premature infants. In the treated group, the infants were provided with both intensive and high-quality childcare and specialist home visits. The task is to predict the infants cognitive test scores. We let sex be the sensitive attribute and observe that male is the minority group (i.e., D = 1) in both the control group and the treatment group. After preprocessing (see Appendix C), the control group has 608 samples (296 with D = 1 and 312 with D = 0) and the treatment group has 139 samples (67 with D = 1 and 72 with D = 0). The dataset contains 25 features. Choice of λ. We observe our algorithms to be agnostic to the choice of λ as long as it is in a reasonable range, i.e., λ [0.5, 3]. To be consistent, we set λ = 1 throughout. Baselines. We compare against the following baselines: Baseline 1: Heteroskedastic neural network without any regularizer i.e., Algorithm 1 with λ = 0. Baseline 2: Residual-based neural network without any regularizer i.e., Algorithm 2 with λ = 0. Experimental Details. In all of our experiments, we use two-layer neural networks, and train our model only once on a fixed training set. We evaluate and report the empirical findings on a held-out test set with a train-test split ratio 4We provide results for the scenario where race can take more than two values in in Appendix C.8. Selective Regression Under Fairness Criteria 0.2 0.4 0.6 0.8 1.0 coverage Majority Minority (a) Performance of Baseline 2 for the Insurance dataset. 0.2 0.4 0.6 0.8 1.0 coverage Majority Minority (b) Performance of Algorithm 2 for the Insurance dataset. 0.2 0.4 0.6 0.8 1.0 coverage Majority Minority (c) Performance of Baseline 1 for the Crime dataset. 0.2 0.4 0.6 0.8 1.0 coverage Majority Minority (d) Performance of Algorithm 1 for the Crime dataset. 0.2 0.4 0.6 0.8 1.0 coverage Majority Minority (e) Performance of Baseline 1 for the IHDP (treatment) dataset. 0.2 0.4 0.6 0.8 1.0 coverage Majority Minority (f) Performance of Algorithm 1 for the IHDP (treatment) dataset. Figure 2: Subgroup MSE vs. coverage plots. Compared to baselines (top), our algorithms (bottom) (a) show a consistent trend of decreasing MSE with decrease in coverage for both subgroups, (b) achieve better minority MSE for fixed coverage, i.e., a smaller AUC for minority subgroup (red), (c) achieve comparable majority MSE for fixed coverage, i.e., a comparable AUC for majority subgroup (green), (d) reduce gap between the subgroup MSE curves, i.e., a smaller AUADC. of 0.8/0.2. More experimental details can be found in Appendix C. Comparison in terms of selective regression. To compare different algorithms in terms of how well they perform selective regression (i.e., without fairness), we look at area under MSE vs. coverage curve (AUC), which encapsulates performance across different coverage (Franc & Prusa, 2019; Lee et al., 2021). We provide the results in Table 2 (smaller AUC indicates better performance) and observe that our algorithms are competitive (if not better) than baselines. We provide MSE vs. coverage curves in Appendix C. Comparison in terms of fairness. To compare different algorithms in terms of fair selective regression, we look at subgroup MSE vs coverage curves. for the Insurance dataset, we show these curves for Baseline 2 in Figure 2a and Algorithm 2 in Figure 2b. for the Crime dataset, we show these curves for Baseline 1 in Figure 2c and Algorithm 1 in Figure 2d. See Appendix C for remaining set of curves. For the Insurance dataset, we see that subgroup MSE for Baseline 2 increases with decrease in coverage for both majority and minority subgroups (Figure 2a). In contrast, the subgroup MSE for Algorithm 2 tends to decrease with a decrease in coverage for both subgroups (Figure 2b). For the Crime dataset, we see that the subgroup MSE for Baseline 1 as well as Algorithm 1 tends to decrease with a decrease in coverage for both subgroups (Figure 2c and Figure 2d). However, for a particular coverage, Algorithm 1 achieves a better MSE for the minority subgroup, a comparable MSE for the majority subgroup, and reduces the gap between the subgroup curves than Baseline 1. As hinted above, in addition to ensuring monotonic selective risk, one may wish to (a) achieve a better performance focusing solely on the majority subgroup, (b) achieve a better performance focusing solely on the minority subgroup, and (c) reduce the gap between the minority subgroup MSE and the majority subgroup MSE across all thresholds. These aspects could be quantitatively captured by looking at (a) the area under the majority MSE vs. coverage curve, i.e., AUC (D = 0), (b) the area under the minority MSE vs. coverage curve, i.e., AUC (D = 1), and (c) the area under the absolute difference of the subgroup MSE vs coverage curves (AUADC) (Franc & Prusa, 2019; Lee et al., 2021) respectively. We provide these results in Table 2 and observe that our algorithms outperform the baselines across datasets in terms of AUC (D = 1) and AUADC while being comparable in terms of AUC (D = 0). Application to Causal Inference. We provide an application of our work to fair-treatment effect estimation. Selective Regression Under Fairness Criteria Table 2: AUC, AUC (D=0), AUC (D=1), and AUADC (averaged across 5 runs) for all algorithms and all datasets. Smaller values are better. B1, B2, A1, and A2 refer to Baseline 1, Baseline 2, Algorithm 1, and Algorithm 2. See Appendix C for standard deviations. Dataset Algo AUC AUC AUC AUADC rithm (D=0) (D=1) Insurance B1 0.0371 0.0342 0.0442 0.0069 A1 0.0195 0.0207 0.0167 0.0052 B2 0.0142 0.0129 0.0175 0.0079 A2 0.0099 0.0087 0.0120 0.0051 Crime B1 0.0075 0.0040 0.0345 0.0309 A1 0.0079 0.0045 0.0296 0.0298 B2 0.0101 0.0060 0.0442 0.0272 A2 0.0117 0.0082 0.0375 0.0257 IHDP B1 0.3053 0.2000 0.3509 0.2266 (Treatment) A1 0.2435 0.2024 0.2849 0.2034 IHDP B1 0.2041 0.2144 0.1983 0.0495 (Control) A1 0.2017 0.2169 0.1877 0.0398 Treatment effect estimation has been studied under the paradigm of prediction with reject option (Jesson et al., 2020), and we explore an additional dimension to this, i.e., the behavior across sub-populations. We follow the standard approach (Künzel et al., 2019) of viewing the dataset as two distinct datasets one corresponding to treatment group and the other corresponding to control group and apply our framework to these distinct datasets. We focus only on Algorithm 1 (Hill, 2011) and compare with Baseline 1 since the simulated target for the IHDP dataset perfectly fits the conditional Gaussian assumption (Hill, 2011). For treatment group, we provide the subgroup MSE vs coverage behavior for Baseline 1 in Figure 2e and Algorithm 1 in Figure 2f. We provide corresponding curves for the control group in Appendix C. Compared to Baseline 1 (Figure 2e), Algorithm 1 (Figure 2f) (a) shows a general trend of decreasing MSE with decrease in coverage for both subgroups and (b) achieves a better minority MSE for a fixed coverage. Table 2 suggests that our algorithm outperforms the baseline in terms of AUC (D = 1) and AUADC while being comparable in AUC (D = 0). 7. Concluding Remarks We proposed a new fairness criterion, monotonic selective risk for selective regression, which requires the performance of each subgroup to improve with a decrease in coverage. We provided two conditions for the feature representation (sufficiency and calibrated for mean and variance) under which the proposed fairness criterion is met. We presented algorithms to enforce these conditions and demonstrated mitigation of disparity in the performances across groups for three real-world datasets. Monotonic selective risk is one criteria for fairness in selective regression. Developing and investigating other such criteria and understanding their relationship with monotonic selective risk is an important question for future research. Acknowledgements We thank the anonymous reviewers for their comments and suggestions. This work was supported, in part, by the MIT-IBM Watson AI Lab, and its member companies Boston Scientific, Samsung, and Wells Fargo; and NSF under Grant No. CCF-1717610. Selective Regression Under Fairness Criteria Agarwal, A., Beygelzimer, A., Dudík, M., Langford, J., and Wallach, H. A reductions approach to fair classification. In International Conference on Machine Learning, pp. 60 69. PMLR, 2018. Agarwal, A., Dudík, M., and Wu, Z. S. Fair regression: Quantitative definitions and reduction-based algorithms. In International Conference on Machine Learning, pp. 120 129. PMLR, 2019. Arjovsky, M., Bottou, L., Gulrajani, I., and Lopez-Paz, D. Invariant risk minimization. ar Xiv preprint ar Xiv:1907.02893, 2019. Bartlett, P. L. and Wegkamp, M. H. Classification with a reject option using a hinge loss. Journal of Machine Learning Research, 9(8), 2008. Bellamy, R. K., Dey, K., Hind, M., Hoffman, S. C., Houde, S., Kannan, K., Lohia, P., Martino, J., Mehta, S., Mojsilovic, A., et al. AI Fairness 360: An extensible toolkit for detecting, understanding, and mitigating unwanted algorithmic bias. ar Xiv preprint ar Xiv:1810.01943, 2018. Bengio, Y., Courville, A., and Vincent, P. Representation learning: A review and new perspectives. IEEE transactions on pattern analysis and machine intelligence, 35(8):1798 1828, 2013. Berk, R., Heidari, H., Jabbari, S., Joseph, M., Kearns, M., Morgenstern, J., Neel, S., and Roth, A. A convex framework for fair regression. ar Xiv preprint ar Xiv:1706.02409, 2017. Calders, T., Karim, A., Kamiran, F., Ali, W., and Zhang, X. Controlling attribute effect in linear regression. In 2013 IEEE 13th international conference on data mining, pp. 71 80. IEEE, 2013. Calmon, F., Wei, D., Vinzamuri, B., Ramamurthy, K. N., and Varshney, K. R. Optimized pre-processing for discrimination prevention. In Advances in Neural Information Processing Systems, pp. 3992 4001, 2017. Castelnovo, A., Crupi, R., Greco, G., Regoli, D., Penco, I. G., and Cosentini, A. C. A clarification of the nuances in the fairness metrics landscape. Scientific Reports, 12 (1):1 21, 2022. Chi, J., Tian, Y., Gordon, G. J., and Zhao, H. Understanding and mitigating accuracy disparity in regression. ar Xiv preprint ar Xiv:2102.12013, 2021. Chow, C. On optimum recognition error and reject tradeoff. IEEE Transactions on information theory, 16(1):41 46, 1970. Chow, C.-K. An optimum character recognition system using decision functions. IRE Transactions on Electronic Computers, (4):247 254, 1957. Chzhen, E. and Schreuder, N. An example of prediction which complies with demographic parity and equalizes group-wise risks in the context of regression. ar Xiv preprint ar Xiv:2011.07158, 2020. Chzhen, E., Denis, C., Hebiri, M., Oneto, L., and Pontil, M. Fair regression via plug-in estimator and recalibration with statistical guarantees. In Neur IPS 2020-34th Conference on Neural Information Processing Systems, 2020. Cleary, T. A. Test bias: Validity of the scholastic aptitude test for negro and white students in integrated colleges. ETS Research Bulletin Series, 1966(2):i 23, 1966. Corbett-Davies, S., Pierson, E., Feller, A., Goel, S., and Huq, A. Algorithmic decision making and the cost of fairness. In Proceedings of the 23rd acm sigkdd international conference on knowledge discovery and data mining, pp. 797 806, 2017. Creager, E., Jacobsen, J.-H., and Zemel, R. Environment inference for invariant learning. In International Conference on Machine Learning, pp. 2189 2200. PMLR, 2021. Dorie, V. Npci: Non-parametrics for causal inference. 2016. URL https://github.com/vdorie/npci. Fitzsimons, J., Ali, A., Osborne, M., and Roberts, S. Equality constrained decision trees: For the algorithmic enforcement of group fairness. ar Xiv preprint ar Xiv:1810.05041, 2018. Franc, V. and Prusa, D. On discriminative learning of prediction uncertainty. In International Conference on Machine Learning, pp. 1963 1971. PMLR, 2019. Gal, Y. Uncertainty in deep learning. 2016. Geifman, Y. and El-Yaniv, R. Selective classification for deep neural networks. In Advances in neural information processing systems, pp. 4878 4887, 2017. Geifman, Y. and El-Yaniv, R. Selectivenet: A deep neural network with an integrated reject option. In International Conference on Machine Learning, pp. 2151 2159. PMLR, 2019. Gölz, P., Kahng, A., and Procaccia, A. D. Paradoxes in fair machine learning. In Advances in Neural Information Processing Systems, pp. 8340 8350, 2019. Selective Regression Under Fairness Criteria Hall, P. and Carroll, R. J. Variance function estimation in regression: The effect of estimating the mean. Journal of the Royal Statistical Society: Series B (Methodological), 51(1):3 14, 1989. Hardt, M., Price, E., and Srebro, N. Equality of opportunity in supervised learning. In Advances in Neural Information Processing Systems 29, pp. 3315 3323, Barcelona, Spain, December 2016. Hellman, M. E. The nearest neighbor classification rule with a reject option. IEEE Transactions on Systems Science and Cybernetics, 6(3):179 185, 1970. Herbei, R. and Wegkamp, M. H. Classification with reject option. The Canadian Journal of Statistics/La Revue Canadienne de Statistique, pp. 709 721, 2006. Hill, J. L. Bayesian nonparametric modeling for causal inference. Journal of Computational and Graphical Statistics, 20(1):217 240, 2011. Jesson, A., Mindermann, S., Shalit, U., and Gal, Y. Identifying causal-effect inference failure with uncertainty-aware models. Advances in Neural Information Processing Systems, 33, 2020. Jiang, W., Zhao, Y., and Wang, Z. Risk-controlled selective prediction for regression deep neural network models. In 2020 International Joint Conference on Neural Networks (IJCNN), pp. 1 8. IEEE, 2020. Jones, E., Sagawa, S., Koh, P. W., Kumar, A., and Liang, P. Selective classification can magnify disparities across groups. ar Xiv preprint ar Xiv:2010.14134, 2020. Kamishima, T., Akaho, S., and Sakuma, J. Fairness-aware learning through regularization approach. In 2011 IEEE 11th International Conference on Data Mining Workshops, pp. 643 650. IEEE, 2011. Komiyama, J. and Shimao, H. Two-stage algorithm for fairness-aware machine learning. ar Xiv preprint ar Xiv:1710.04924, 2017. Komiyama, J., Takeda, A., Honda, J., and Shimao, H. Nonconvex optimization for regression with fairness constraints. In International conference on machine learning, pp. 2737 2746. PMLR, 2018. Krueger, D., Caballero, E., Jacobsen, J.-H., Zhang, A., Binas, J., Zhang, D., Le Priol, R., and Courville, A. Out-of-distribution generalization via risk extrapolation (rex). In International Conference on Machine Learning, pp. 5815 5826. PMLR, 2021. Künzel, S. R., Sekhon, J. S., Bickel, P. J., and Yu, B. Metalearners for estimating heterogeneous treatment effects using machine learning. Proceedings of the national academy of sciences, 116(10):4156 4165, 2019. Lantz, B. Machine learning with R: expert techniques for predictive modeling. Packt publishing ltd, 2019. Lee, J., Bu, Y., Sattigeri, P., Panda, R., Wornell, G., Karlinsky, L., and Feris, R. A maximal correlation approach to imposing fairness in machine learning. ar Xiv preprint ar Xiv:2012.15259, 2020. Lee, J. K., Bu, Y., Rajan, D., Sattigeri, P., Panda, R., Das, S., and Wornell, G. W. Fair selective classification via sufficiency. In International Conference on Machine Learning, pp. 6076 6086. PMLR, 2021. Lei, J. Classification with confidence. Biometrika, 101(4): 755 769, 2014. Liu, L. T., Simchowitz, M., and Hardt, M. The implicit fairness criterion of unconstrained learning. In International Conference on Machine Learning, pp. 4051 4060. PMLR, 2019. Louizos, C., Swersky, K., Li, Y., Welling, M., and Zemel, R. The variational fair autoencoder. ar Xiv preprint ar Xiv:1511.00830, 2015. Mary, J., Calauzenes, C., and El Karoui, N. Fairness-aware learning for continuous attributes and treatments. In International Conference on Machine Learning, pp. 4382 4391, 2019. Mehrabi, N., Morstatter, F., Saxena, N., Lerman, K., and Galstyan, A. A survey on bias and fairness in machine learning. ACM Computing Surveys (CSUR), 54(6):1 35, 2021. Nabi, R., Malinsky, D., and Shpitser, I. Learning optimal fair policies. In International Conference on Machine Learning, pp. 4674 4682. PMLR, 2019. Nadeem, M. S. A., Zucker, J.-D., and Hanczar, B. Accuracy-rejection curves (arcs) for comparing classification methods with a reject option. In Machine Learning in Systems Biology, pp. 65 81. PMLR, 2009. Oneto, L., Donini, M., and Pontil, M. General fair empirical risk minimization. In 2020 International Joint Conference on Neural Networks (IJCNN), pp. 1 8. IEEE, 2020. Pérez-Suay, A., Laparra, V., Mateo-García, G., Muñoz-Marí, J., Gómez-Chova, L., and Camps-Valls, G. Fair kernel learning. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 339 355. Springer, 2017. Selective Regression Under Fairness Criteria Pleiss, G., Raghavan, M., Wu, F., Kleinberg, J., and Weinberger, K. Q. On fairness and calibration. In Advances in Neural Information Processing Systems, pp. 5680 5689, 2017. Raff, E., Sylvester, J., and Mills, S. Fair forests: Regularized tree induction to minimize model bias. In Proceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and Society, pp. 243 250, 2018. Redmond, M. and Baveja, A. A data-driven software tool for enabling cooperative information sharing among police departments. European Journal of Operational Research, 141(3):660 678, 2002. Schreuder, N. and Chzhen, E. Classification with abstention but without disparities. ar Xiv preprint ar Xiv:2102.12258, 2021. Selbst, A. D., Boyd, D., Friedler, S. A., Venkatasubramanian, S., and Vertesi, J. Fairness and abstraction in sociotechnical systems. In Proceedings of the Conference on Fairness, Accountability, and Transparency, pp. 59 68, 2019. Verma, S. and Rubin, J. Fairness definitions explained. In 2018 ieee/acm international workshop on software fairness (fairware), pp. 1 7. IEEE, 2018. Wiener, Y. and El-Yaniv, R. Pointwise tracking the optimal regression function. In Advances in Neural Information Processing Systems, pp. 2042 2050, 2012. Williamson, R. and Menon, A. Fairness risk measures. In International Conference on Machine Learning, pp. 6786 6797. PMLR, 2019. Zafar, M. B., Valera, I., Rogriguez, M. G., and Gummadi, K. P. Fairness constraints: Mechanisms for fair classification. In Artificial Intelligence and Statistics, pp. 962 970. PMLR, 2017. Zaoui, A., Denis, C., and Hebiri, M. Regression with reject option and application to knn. Advances in Neural Information Processing Systems, 33:20073 20082, 2020. Zemel, R., Wu, Y., Swersky, K., Pitassi, T., and Dwork, C. Learning fair representations. In International Conference on Machine Learning, pp. 325 333, 2013. Selective Regression Under Fairness Criteria Organization. The Appendix is organized as follows. In Appendix A, we provide the proof of Theorem 4.1. In Appendix B, we provide more details for imposing calibration for mean and variance via the residual-based neural network as well as provide the pseudo-code, i.e., Algorithm 2. In Appendix C, we provide more experimental details and results. A. Proof of Theorem 4.1 We restate the Theorem below and then provide a proof. Theorem 4.1. Suppose the representation Φ(X) is sufficient i.e., Y D|Φ(X). Let f(Φ(X)) = E[Y |Φ(X)] and g(Φ(X)) = Var[Y |Φ(X)]. Then, for any d D and any τ < τ , we have MSE(f, g, τ, d) < MSE(f, g, τ , d). Proof. First, let us simplify the expression for MSE(f, g, τ, d). We have MSE(f, g, τ, d) = E[(Y f(Φ(X)))2 1(g(Φ(X)) τ)|D = d] E[1(g(Φ(X)) τ)|D = d] = E[(Y f(Φ(X)))2 1(g(Φ(X)) τ)|D = d] P(g(Φ(X)) τ|D = d) (10) Let cτ,d P(g(Φ(X)) τ|D = d). Using (10), we have MSE(f, g, τ, d) MSE(f, g, τ , d) = 1 cτ,d E[(Y f(Φ(X)))2 1(g(Φ(X)) τ)|D = d] 1 cτ ,d E[(Y f(Φ(X)))2 1(g(Φ(X)) τ )|D = d] cτ,d 1 cτ ,d E[(Y f(Φ(X)))2 1(g(Φ(X)) τ)|D = d] (11) 1 cτ ,d E[(Y f(Φ(X)))2 1(τ < g(Φ(X)) τ )|D = d] (12) Now, let us upper bound E[(Y f(Φ(X)))2 1(g(Φ(X)) τ)|D = d]. We have E[(Y f(Φ(X)))2 1(g(Φ(X)) τ)|D = d] (a) = EΦ(X)|D[1(g(Φ(X)) τ) EY |Φ(X),D[(Y f(Φ(X)))2|Φ(X), D = d]|D = d] (b) = EΦ(X)|D[1(g(Φ(X)) τ) EY |Φ(X),D[(Y E[Y |Φ(X)])2|Φ(X), D = d]|D = d] (c) = EΦ(X)|D[1(g(Φ(X)) τ) EY |Φ(X),D[(Y E[Y |Φ(X), D = d])2|Φ(X), D = d]|D = d] (d) = EΦ(X)|D[1(g(Φ(X)) τ) Var[Y |Φ(X), D = d]|D = d] (e) = EΦ(X)|D[1(g(Φ(X)) τ) Var[Y |Φ(X)]|D = d] (f) = EΦ(X)|D[1(g(Φ(X)) τ) g(Φ(X))|D = d] τEΦ(X)|D[1(g(Φ(X)) τ)|D = d] = τP(g(Φ(X)) τ|D = d) = τcτ,d (13) where (a) follows from the definition of conditional expectation and because 1(g(Φ(X)) τ) is a constant conditioned on Φ(X), (b) follows because f(Φ(X)) = E[Y |Φ(X)], (c) follows because Y D|Φ(X) = E[Y |Φ(X)] = E[Y |Φ(X), D = d], (d) follows from the definition of conditonal variance, (e) follows because Y D|Φ(X) = Var[Y |Φ(X)] = Var[Y |Φ(X), D = d], and (f) follows because g(Φ(X)) = Var[Y |Φ(X)]. Now, let us lower bound E[(Y f(Φ(X)))2 1(τ < g(Φ(X)) τ )|D = d]. Similar to above, we have E[(Y f(Φ(X)))2 1(τ < g(Φ(X)) τ )|D = d] = EΦ(X)|D[1(τ < g(Φ(X)) τ ) g(Φ(X))|D = d] > τEΦ(X)|D[1(τ < g(Φ(X)) τ )|D = d] = τ(P(g(Φ(X)) τ |D = d) P(g(Φ(X)) τ|D = d)) = τ(cτ ,d cτ,d) (14) Selective Regression Under Fairness Criteria Plugging in (13) and (14) in (11) and (12), we have MSE(f, g, τ, d) MSE(f, g, τ , d) < 1 cτ,d 1 cτ ,d τcτ,d 1 cτ ,d τ (cτ ,d cτ,d) = 0 B. More Details to Impose Calibration for Mean and Variance In this section, we provide more details for imposing calibration for mean and variance via the residual-based neural network as well as provide the pseudo-code. As described in Section 5.2, in a residual-based neural network, once the mean-prediction network f is trained, the residuals i.e., ri = (yi f(Φ1(xi); θf))2 are used to train the variance-prediction network g by minimizing: LS2(Φ2, θg) i=1 (ri g(Φ2(xi); θg))2. The feature representation Φ2 is parameterized by θΦ2 and the variance-prediction network is supposed to learn the parameters θΦ2 and θg. To impose calibration under variance, we construct a contrastive loss similar to (7). Then, the regularizer can be written as ri g(Φ2(xi); w( e di) g ) 2 ri g(Φ2(xi); w(di) g ) 2 , where edi are drawn i.i.d. from the marginal distribution PD, and for d D, w(d) g = arg min w ri g(Φ2(xi); w) 2. Summarizing, the overall objective for variance-prediction is min θg,Φ2 LS2(Φ2, θg) + λ2LR2(Φ2). We provide a pseudo-code in Algorithm 2 where yi f(Φ1(xi); wf) 2, ri g(Φ2(xi); wg) 2. C. Additional experimental results In this section, we provide more experimental details and results. We start by providing those experimental details that remain the same across the datasets. Next, we provide details that are specific to each dataset, i.e., Insurance, Crime, and IHDP. Finally, we provide more experimental results and some discussion. C.1. Experimental Details In all of our experiments, we use two-layer neural networks. For all hidden layers, we use the selu activation function. For the output layer, we use a non-linear activation function only for the variance-prediction network associated with Algorithm 2 to ensure that the predictions of variance are non-negative. In particular, we use the soft-plus activation function for the variance-prediction network associated with Algorithm 2. In our implementation of Algorithm 1, we predict log-variance instead of variance and therefore stick to linear activation function. We train all our neural networks with the Adam optimizer, a batch size of 128, and over 40 epochs. Further, we use a step learning rate scheduler with an initial learning rate of 5 10 3 and decay it by a factor of half after every two epochs. Finally, as described in Section 6, we set the regularizer λ = 1 for all our experiments after observing that the performance of our algorithms is agnostic to the choice of λ as long as it is in a reasonable range, i.e., λ [0.5, 3]. Selective Regression Under Fairness Criteria Algorithm 2 Residual-based neural network with calibration-based regularizer Input: training samples {(xi, yi, di)}n i=1, regularizers λ1 and λ2 Draw: {ed1, . . . , edn} drawn i.i.d. from ˆPD Initialize: θf, θg, θΦ1, θΦ2, w(d) f , and w(d) g with pre-trained models Initialize: nd = number of samples in group d d D for each training iteration do for each batch do for d = 1, . . . , |D| do # update subgroup-specific mean predictor w(d) f w(d) f 1 nd ηf wf Ld1(wf) end for end for for each batch do # update feature extractor for mean predictor θΦ1 θΦ1 1 nη θΦ1(LS1(Φ1, θf) + λ1LR1(Φ1)) # update mean predictor θf θf 1 nη θf LS1(Φ1, θf) end for end for Compute the residuals: ri = (yi f(Φ1(xi); θf))2 for each training iteration do for each batch do for d = 1, . . . , |D| do # update subgroup-specific variance predictor w(d) g w(d) g 1 nd ηg wg Ld2(wg) end for end for for each batch do # update feature extractor for variance predictor θΦ2 θΦ2 1 nη θΦ2(LS2(Φ2, θg) + λ2LR2(Φ2)) # update variance predictor θg θg 1 nη θg LS2(Φ2, θg) end for end for C.2. Insurance The Insurance dataset5 is a semi-synthetic dataset that was created using demographic statistics from the U.S. Census Bureau and approximately reflects real-world conditions. A few features in this dataset include the BMI, number of children, age, etc. We remove the sensitive attribute from the set of input features to preprocess the data. To reflect the real-world scenarios where the accuracy disparity is significant due to the small and imbalanced dataset, similar to (Chi et al., 2021), we randomly drop 50% of examples with D =1. Further, we normalize the output annual medical expenses and the features: age and BMI. We use 3 neurons in the hidden layer for this dataset. C.3. Communities and Crime The Communities and Crime dataset6 contains socio-economic information of communities in the U.S. and their crime rates. A few features in this dataset include population for community, mean people per household, percentage of the population that is white, per capita income, number of police cars, etc. We remove the non-predictive attributes and the sensitive attribute from the set of input features during preprocessing. All attributes in the dataset have been curated and normalized 5https://github.com/stedy/Machine-Learning-with-R-datasets/blob/master/insurance.csv 6https://archive.ics.uci.edu/ml/datasets/communities+and+crime Selective Regression Under Fairness Criteria to [0, 1], so we do not perform any additional normalization. Finally, we replace the missing values with the mean values of the corresponding attributes similar to (Chi et al., 2021). We use 50 neurons in the hidden layer for this dataset. The IHDP dataset7 is generated based on a randomized control trial targeting low-birth-weight, premature infants. The 25 features measure various aspects about the children and their mothers, e.g., child s birth weight, child s gender, mother s age, mother s education, an indicator for maternal alcohol consumption during pregnancy, etc. We remove the sensitive attribute from the set of input features to preprocess the data. Further, we normalize the output cognitive test score and the features: child s birth weight, child s head circumference at birth, number of weeks pre-term that the child was born, birth order, neo-natal health index, and mom s age when she gave birth to the child. Following the norm in the causal inference community, a biased subset of the treated group is removed to create an imbalance leaving 139 samples in the treatment group and 608 samples in the control group. The target is typically simulated using the setting A of the NPCI package (Dorie, 2016). We use 20 neurons in the hidden layer for this dataset. C.5. Standard deviations In Table 3 below, we provide the standard deviations associated with AUC, AUC (D = 0), AUC (D = 1), and AUADC whose means where provided in Table 2 in Section 6. Table 3: Mean standard deviation (averaged across 5 runs) for AUC, AUC (D = 0), AUC (D = 1), and AUADC for all the algorithms and all the datasets. Dataset Algorithm AUC AUC (D = 0) AUC (D = 1) AUADC Insurance B1 i.e., Baseline 1 0.0371 0.0255 0.0342 0.0197 0.0442 0.0218 0.0069 0.0050 A1 i.e., Algorithm 1 0.0195 0.0059 0.0207 0.0050 0.0167 0.0075 0.0052 0.0031 B2 i.e., Baseline 2 0.0142 0.0052 0.0129 0.0042 0.0175 0.0026 0.0079 0.0041 A2 i.e., Algorithm 2 0.0099 0.0006 0.0087 0.0004 0.0120 0.0011 0.0051 0.0018 Crime B1 i.e., Baseline 1 0.0075 0.0002 0.0040 0.0011 0.0345 0.0037 0.0309 0.0008 A1 i.e., Algorithm 1 0.0079 0.0004 0.0045 0.0013 0.0296 0.0054 0.0298 0.0011 B2 i.e., Baseline 2 0.0101 0.0019 0.0060 0.0017 0.0442 0.0022 0.0272 0.0013 A2 i.e., Algorithm 2 0.0117 0.0017 0.0082 0.0012 0.0375 0.0019 0.0257 0.0028 IHDP B1 i.e., Baseline 1 0.3053 0.0823 0.2000 0.0899 0.3509 0.0811 0.2266 0.0919 (Treatment) A1 i.e., Algorithm 1 0.2435 0.0823 0.2024 0.0935 0.2849 0.0767 0.2034 0.0925 IHDP B1 i.e., Baseline 1 0.2041 0.0138 0.2144 0.0101 0.1983 0.0125 0.0495 0.0053 (Control) A1 i.e., Algorithm 1 0.2017 0.0170 0.2169 0.0133 0.1877 0.0129 0.0398 0.0073 C.6. Overall MSE vs. coverage curves In Section 6, we compared different algorithms in terms of how well they performed selective regression (i.e., with no consideration of fairness) by looking at the area under MSE vs. coverage curve (AUC). Here, we provide the MSE vs. coverage curves for the Insurance dataset in Figure 3a, the Crime dataset in Figure 3b, the IHDP (control) dataset in Figure 3c, and the IHDP (treatment) dataset in Figure 3d. For the Insurance dataset, we see that Algorithm 1 and Algorithm 2 perform selective regression better than Baseline 1 and Baseline 2, respectively. This is also evident via the values of AUC in Table 2/3. For the Crime dataset, the MSE decreases with decrease in coverage as expected for all four algorithms. Further, the performances of Baseline 1 and Baseline 2 are slightly better than that of Algorithm 1 and Algorithm 2 respectively. This is also evident via the values of AUC in Table 2/3. for the IHDP (control) dataset and IHDP (treatment) dataset, we see that Algorithm 1 performs selective regression better than Baseline 1. This is also evident via the values of AUC in Table 2/3. 7https://github.com/AMLab-Amsterdam/CEVAE/tree/master/datasets/IHDP Selective Regression Under Fairness Criteria 0.2 0.4 0.6 0.8 1.0 coverage Baseline (Hetero) Our (Hetero) Baseline (Residue) Our (Residue) (a) Insurance 0.2 0.4 0.6 0.8 1.0 coverage Baseline (Hetero) Our (Hetero) Baseline (Residue) Our (Residue) 0.2 0.4 0.6 0.8 1.0 coverage Baseline (control) Our (control) (c) IHDP (control) 0.2 0.4 0.6 0.8 1.0 coverage Baseline (treatment) Our (treatment) (d) IHDP (treatment) Figure 3: MSE vs. coverage for various datasets 0.2 0.4 0.6 0.8 1.0 coverage Majority Minority (a) Performance of Baseline 1 for the Insurance dataset. 0.2 0.4 0.6 0.8 1.0 coverage Majority Minority (b) Performance of Algorithm 1 for the Insurance dataset. 0.2 0.4 0.6 0.8 1.0 coverage Majority Minority (c) Performance of Baseline 2 for the Crime dataset. 0.2 0.4 0.6 0.8 1.0 coverage Majority Minority (d) Performance of Algorithm 2 for the Crime dataset. 0.2 0.4 0.6 0.8 1.0 coverage Majority Minority (e) Performance of Baseline 1 for the IHDP (control) dataset. 0.2 0.4 0.6 0.8 1.0 coverage Majority Minority (f) Performance of Algorithm 1 for the IHDP (control) dataset. Figure 4: Subgroup MSE vs. coverage plots for various datasets comparing baselines (top) and our algorithms (bottom). C.7. Group-specific MSE vs. coverage curves. In Section 6, we compared the different algorithms in terms of how well they perform fair selective regression by looking at the subgroup MSE vs. coverage curves in addition to AUC, AUC (D = 0), AUC (D = 1), and AUADC. Selective Regression Under Fairness Criteria More specifically, we looked at Baseline 2 and Algorithm 2 for the Insurance dataset, Baseline 1 and Algorithm 1 for the Crime dataset, and Baseline 1 and Algorithm 1 for the IHDP (treatment) dataset. Here, we show the subgroup MSE vs. coverage curves for Baseline 1 and Algorithm 1 for the Insurance dataset (Figure 4a and 4b), Baseline 2 and Algorithm 2 for the Crime dataset (Figure 4c and 4d), and Baseline 1 and Algorithm 1 for the IHDP (control) dataset (Figure 4e and 4f). For the Insurance dataset, we see that subgroup MSE for the minority subgroup increases with decrease in coverage for Baseline 1 (Figure 4a) as already described in Section 3.1. In contrast, the subgroup MSE for Algorithm 1 does not increase with decrease in coverage and stays relatively flat (Figure 4b). Further, for a particular coverage, Algorithm 1 achieves a better MSE for the minority subgroup, a comparable MSE for the majority subgroup, and reduces the gap between the subgroup curves than Baseline 1 (see the values of AUC (D = 0), AUC (D = 1), and AUADC in Table 2/3). For the Crime dataset, we see that the subgroup MSE for Baseline 2 as well as Algorithm 2 tends to decrease with a decrease in coverage for both subgroups (Figure 4c and Figure 4d). However, for a particular coverage, Algorithm 2 achieves a better MSE for the minority subgroup, a comparable MSE for the majority subgroup, and reduces the gap between the subgroup curves than Baseline 2 (see the values of AUC (D = 0), AUC (D = 1), and AUADC in Table 2/3). For the IHDP (control) dataset, we see that the subgroup MSE for Baseline 1 increases with decrease in coverage (Figure 4e). In contrast, the subgroup MSE for Algorithm 1 decreases with decrease in coverage (Figure 4f). Additionally, Algorithm 1 achieves a comparable MSE for the majority subgroup, and reduces the gap between the subgroup curves than Baseline 1 (see the values of AUC (D = 0), AUC (D = 1), and AUADC in Table 2/3). C.8. Group-specific MSE vs. coverage curves with three subgroups In all of our experiments so far, we focused on the scenario where the sensitive attribute was binary. However, as we now demonstrate, our approach works equally well when the sensitive attribute can take more than two values. More concretely, we use the Crime dataset to obtain 3 subgroups i.e., the sensitive attribute can take three values. This is possible since, race, the sensitive attribute, is reported as the population percentage of the black in the Crime dataset. We assign (a) D = 2 if the population percentage of the black is more than or equal to 20, (b) D = 1 if the population percentage of the black is less than 20 but more than or equal to 1, and (c) D = 0 otherwise. We show the performance of Baseline 1 and Algorithm 1 in Figure 5a and 5b respectively. As expected, Algorithm 1 ensures monotonic selective risk unlike Baseline 1 (see D = 2). 0.2 0.4 0.6 0.8 1.0 coverage D = 2 D = 1 D = 0 (a) Performance of Baseline 1 for the Crime dataset. 0.2 0.4 0.6 0.8 1.0 coverage D = 2 D = 1 D = 0 (b) Performance of Algorithm 1 for the Crime dataset. Figure 5: Subgroup MSE vs. coverage plots for the Crime dataset with three subgroups