# bias_in_evaluation_processes_an_optimizationbased_model__7b4b3b89.pdf Bias in Evaluation Processes: An Optimization-Based Model L. Elisa Celis Yale University Amit Kumar IIT Delhi Anay Mehrotra Yale University Nisheeth K. Vishnoi Yale University Biases with respect to socially-salient attributes of individuals have been well documented in evaluation processes used in settings such as admissions and hiring. We view such an evaluation process as a transformation of a distribution of the true utility of an individual for a task to an observed distribution and model it as a solution to a loss minimization problem subject to an information constraint. Our model has two parameters that have been identified as factors leading to biases: the resource-information trade-off parameter in the information constraint and the risk-averseness parameter in the loss function. We characterize the distributions that arise from our model and study the effect of the parameters on the observed distribution. The outputs of our model enrich the class of distributions that can be used to capture variation across groups in the observed evaluations. We empirically validate our model by fitting real-world datasets and use it to study the effect of interventions in a downstream selection task. These results contribute to an understanding of the emergence of bias in evaluation processes and provide tools to guide the deployment of interventions to mitigate biases. 1 Introduction Evaluation processes arise in numerous high-stakes settings such as hiring, university admissions, and fund allocation decisions [20, 30, 90, 122]. Specific instances include recruiters estimating the hireability of candidates via interviews [121, 30], reviewers evaluating the competence of grant applicants from proposals [147, 18], and organizations assessing the scholastic abilities of students via standardized examinations [99, 19]. In these processes, an evaluator estimates an individual s value to an institution. The evaluator need not be a person, they can be a committee, an exam, or even a machine learning algorithm [51, 122, 145]. Moreover, outcomes of real-world evaluation processes have at least some uncertainty or randomness [30, 18, 79]. This randomness can arise both, due to the features of the individual (e.g., their test scores or grades) that an evaluator takes as input [31, 76, 124], as well as, due to the evaluation process itself [50, 30, 140]. Biases against individuals in certain disadvantaged groups have been well-documented in evaluation processes [147, 74, 104, 110, 30]. For instance, in employment decisions and peer review, women receive systematically lower competence scores than men, even when qualifications are the same [147, 110], in standardized tests, the scores show higher variance in students from certain genders [21, 112], and in risk assessment a type of evaluation widely used tools were twice as likely to misclassify Black defendants as being at a high risk of violent recidivism than White defendants [6]. Here, neither the distribution of individuals true evaluation depends on their socially-salient attributes nor is the process trying to bias evaluations, yet biases consistently arise [147, 74, 104, 110, 30]. Such evaluations are increasingly used by ML systems to learn or make decisions about individuals, potentially exacerbating inequality [51, 122, 145]. This raises the question of explaining the emergence of biases in evaluation processes which is important to understand how to mitigate them, and is studied here. 37th Conference on Neural Information Processing Systems (Neur IPS 2023). Related work. A wide body of work has studied reasons why such differences may arise and how to mitigate the effect of such biases [75, 57, 54, 32, 88, 28]. For one, socioeconomic disadvantages (often correlated with socially-salient attributes) have been shown to impact an individual s ability to perform in an evaluation process, giving rise to different performance distributions across groups [55, 16]. Specifically, disparities in access to monetary resources are known to have a significant impact on individuals SAT scores [55]. Moreover, because of differences between socially-salient attributes of individuals and evaluators, the same amount of resources (such as time or cognitive effort) spent by the evaluator and the individual, can lead to different outcomes for individuals in different groups [64, 95, 8]. For instance, it can be cognitively more demanding, especially in time-constrained evaluation processes, for the evaluator to interact with individuals who have a different cultural background than them, thus impacting the evaluations [95, 86, 70, 58, 142, 115, 113]. Further, such biases in human evaluations can also affect learning algorithms through biased past data that the algorithms take as input [75, 57, 54, 28]. Another factor that has been identified as a source of bias is risk averseness: the tendency to perceive a lower magnitude of increase in their utility due to a profit than the magnitude of decrease in their utility due to a loss of the same magnitude as the profit [84, 144, 151]. Risk averseness is known to play a role in high-stakes decisions such as who to hire, who to follow on social networks, and whether to pursue higher education [71, 23, 14]. In evaluations with an abundance of applicants, overestimating the value of an individual can lead to a downstream loss (e.g., because an individual is hired or admitted) whereas under-estimating may not have a significant loss [139, 65]. Thus, in the presence of risk averseness, the outputs of evaluation processes may skew the output evaluations to lower or higher values. The same skew can also arise from the perspective of individuals [13, 33, 111]. For instance, when negotiating salaries, overestimating their salary can lead to adverse effects in the form of evaluators being less inclined to work with the individual or in extreme cases denying employment [13, 33]. Moreover, these costs have been observed to be higher for women than for men, and are one of the prominent explanations for why women negotiate less frequently [13, 33]. A number of interventions to mitigate the adverse effects of such biases in evaluation processes have been proposed. These include representational constraints that, across multiple individuals, increase the representation of disadvantaged and minority groups in the set of individuals with high evaluations [46, 135, 131, 19, 77, 27, 116], structured evaluations which reduce the scope of unintended biases in evaluations [123, 68, 147, 15], and anonymized evaluations that, when possible, blind the decision makers to the socially-salient attributes of individuals being evaluated [72]. Mathematically, some works have modeled the outcomes of evaluation processes based on empirical observations [12, 22, 90, 61]. For instance, the implicit variance model of [61] models differences in the amount of noise in the utilities for individuals in different groups. Here, the output estimate is drawn from a Gaussian density whose mean is the true utility v (which can take any real value) and whose variance depends on the group of the individual being evaluated: The variance is higher for individuals in the disadvantaged group compared to individuals in the advantaged group. Additive and multiplicative skews in the outputs of evaluation processes have also been modeled [90, 22] (also see Appendix A). [90] consider true utilities v > 0 distributed according to the Pareto density and they model the output as v/ρ for some fixed ρ 1; where ρ is larger for individuals in the disadvantaged group. These models have been influential in the study of various downstream tasks such as selection [90, 61, 38, 129, 67, 106, 108, 29], ranking [40], and classification [28] in the presence of biases. Our contributions. We propose a new optimization-based approach to model how an evaluation process transforms an (unknown) input density f D representing the true utility of an individual or a population to an observed distribution in the presence of information constraints or risk aversion. Based on the aforementioned studies and insights in social sciences, our model has two parameters: the resource-information parameter (τ R) in the information constraint and the risk-averseness parameter (α 1) in the objective function; see (Opt Prog) in Section 2. The objective measures the inaccuracy of the estimator with respect to the true density f D, and involves a given loss function ℓand the parameter α α is higher (worse) for individuals in groups facing higher risk aversion. The constraint places a lower bound of τ on the amount of information (about the density of the true value v) that the individual and evaluator can acquire or exchange in their interaction τ is higher for individuals in groups that require more resources to gain unit information. We measure the amount of information in the output density by its differential entropy. Our model builds on the maximum-entropy framework in statistics and information theory [80] and is derived in Section 2 and can be viewed as extending this theory to output a rich family of biased densities. In Section 3, we show various properties of the output densities of our model. We prove that the solution to (Opt Prog) is unique under general conditions and characterize the output density as a function of f D, ℓ, τ, and α; see Theorem 3.1. By varying the loss function and the true density, our framework can not only output standard density functions (such as Gaussian, Pareto, Exponential, and Laplace), but also their appropriate noisy and skewed versions, generalizing the models studied in [90, 22, 61]. Subsequently, we investigate how varying the parameter τ affects the output density in Section 3. We observe that when τ , there is effectively no constraint, and the output is concentrated at a point. For any fixed α, as τ increases, the output density spreads its variance and/or mean increases. We also study the effect of increasing α on the output density. We observe that when the true density is Gaussian or Pareto, the mean of the output density decreases as α increases for any fixed τ. Thus, individuals in the group with higher α and/or τ face higher noise and/or skew in their evaluations as predicted by our model. Empirically, we evaluate our model s ability to emulate biases present in real-world evaluation processes using two real-world datasets (JEE-2009 Scores and the Semantic Scholar Open Research Corpus) and one synthetic dataset (Section 4). For each dataset, we report the total variation (TV) distance between the densities of biased utilities in the data and the best-fitting densities output by our framework and earlier models. Across all datasets, we observe that our model can output densities that are close to the density of biased utilities in the datasets and has a better fit than the models of [61, 90]; Table 1. Further, on a downstream selection task, we evaluate the effectiveness of two well-studied bias-mitigating interventions: equal representation (ER) and proportional representation (PR) constraints, and two additional interventions suggested by our work: decreasing the resourceinformation parameter τ and reducing the risk-averseness parameter α. ER and PR are constraints on the allowable outcomes, τ can be decreased by, e.g., training the evaluators to improve their efficiency, and α can be decreased using, e.g., structured interviews [30]. We observe that for each intervention, there are instances of selection, where it outperforms all other interventions (Figure 1). Thus, our model can be used as a tool to study the effectiveness of different types of interventions in downstream tasks and inform policy; see also Appendix L.1 and Appendix B. The evaluation processes we consider have two stakeholders an evaluator and an individual along with a societal context that affects the process. In an evaluation process, an evaluator interacts with an individual to obtain an estimate of the individual s utility or value. We assume that each individual s true utility v is drawn from a probability distribution. This not only captures the case that the same individual may have variability in the same evaluation (as is frequently observed in interviews, examinations, and peer-review [31, 50, 76, 124, 30, 140, 18]) but also the case that v corresponds to the utility of an individual drawn from a population. For simplicity, we consider the setting where v is real-valued and its density is supported on a continuous subset Ω R. This density gives rise to a distribution over Ωwith respect to the Lebesgue measure µ over R. For instance, Ω could be the set of all real numbers R, the set of positive real number R>0, an open interval such as [1, ), or a closed interval [a, b]. Following prior work modeling output densities [90, 40, 61], we assume that the true utility of all individuals is drawn from the same density f D. We view an evaluation process as a transformation of an (unknown) true density f D into an observed density f E over Ω. In real-world evaluation processes, this happens through various means: by processing features of an individual (e.g., past performance on exams or past employment), through interaction between the evaluator and the individual (e.g., in oral examinations), or by requesting the individual to complete an assessment or test [121, 99, 19]. We present an optimization-based model that captures some of the aforementioned scenarios and outputs f E. The parameters of this model encode factors that may be different for different socially-salient groups, thus, making f E group dependent even though f D is not group dependent. We derive our model in four steps. Step 1: Invoking the entropy maximization principle. In order to gain some intuition, consider a simple setting where the utility of an individual is a fixed quantity v (i.e., f D is a Dirac-delta function around v). We first need to define an error or loss function ℓ: Ω Ω R; given a guess x of v, the loss function ℓ(x, v) indicates the gap between the two values. We do not assume that ℓis symmetric but require ℓ(x, v) 0 when x v. Some examples of ℓ(x, v) are (x v)2, |x v| , x/v, and ln(x/v). The right choice of the loss function can be context-dependent, e.g., (x v)2 is a commonly used loss function for real-valued data, and ln(x/v) is sometimes better at capturing relative error for heavy-tailed distributions over positive domains [83]. For a density f for x, Ex f ℓ(x, v) denotes the expected error of the evaluation process. One can therefore consider the following problem: Given a value , can we find an f such that Ex f ℓ(x, v) ? This problem is under-specified as there may be (infinitely) many densities f satisfying this constraint. To specify f uniquely, we appeal to the maximum entropy framework in statistics and information theory [81]: Among all the feasible densities, one should select the density f which has the maximum entropy. This principle leads to the selection of a density that is consistent with our constraint and makes no additional assumption. We use the notion of the (differential) entropy of a density with respect to the Lebesgue measure µ on R: Ent(f) := R x Ωf(x) ln f(x)dµ(x), (1) where f(x) ln f(x) = 0 whenever f(x) = 0. Thus, we get the following optimization problem: argmaxf: density on ΩEnt(f), s.t., Ex f ℓ(x, v) . (2) This optimization problem is well-studied and it is known that by using different loss functions, we can derive many families of densities [102, 148]. For instance, for ℓ(x, v) := (x v)2, we recover the Gaussian density with mean v, and for ℓ(x, v) := ln(x/v), we obtain a Pareto density. Step 2: Incorporating the resource-information parameter. We now extend the above formulation to include information constraints in the evaluation process. In an evaluation process, both the evaluator and the individual spend resources such as time, cognitive effort, or money to communicate the information related to the utility of the individual to the evaluator. For instance, in interviews, both the interviewer and the interviewee spend time and cognitive effort. In university admissions, the university admissions office needs to spend money to hire and train application readers who, in turn, screen applications for the university s admission program, and the applicants need to spend time, cognitive effort, and money to prepare and submit their applications [24, 150]. The more resources are spent in an evaluation process, the more additional information about v is acquired. We model this using a resource-information parameter τ, which puts a lower bound on the entropy of f. Thus, we modify the optimization problem in Equation (2) in the following manner. We first flip the optimization problem to an equivalent problem where we minimize the expected loss subject to a lower bound on the entropy, Ent(f), of f. A higher value of the resource-information parameter τ means that one needs to spend more resources to obtain the same information and corresponds to a stronger lower bound on Ent(f) (and vice-versa) in our framework. argminf: density on ΩEx f ℓ(x, v), s.t., Ent(f) τ. (3) When τ , the optimal density tends to a point or delta density around v (recovering the most information), and when τ , it tends to a uniform density on Ω(learning nothing about v). Since differential entropy can vary from negative to positive infinity, the value of τ is to be viewed relative to an arbitrary reference point. τ may vary with the socially-salient group of the individual in real-world contexts. For instance, in settings where the evaluator needs to interact with individuals (e.g., interviews), disparities can arise because it is less cognitively demanding for an evaluator to communicate with individuals who speak the same language as themselves, compared to individuals who speak a different language [95]. In settings where the evaluator assesses individuals based on data about their past education and employment (e.g., at screening stages of hiring or in university admissions), disparities can arise because the evaluator is more knowledgeable about a specific group s sociocultural background compared to others, and would have to spend more resources to gather the required information for the other groups [49, 62]. Step 3: Incorporating the risk-averseness parameter. We now introduce the parameter α that captures risk averseness. Roughly speaking, risk averseness may arise in an evaluation process because of the downstream impact of the output. The evaluator may also benefit or may be held accountable for the estimated value, and hence, would be eager or reluctant to assign values much higher than the true utility [71, 23, 14]. Further, the individual may also be risk averse, e.g. during a hiring interview, the risk of getting rejected may prompt the individual to quote less than the expected salary [13, 33, 111]. To formalize this intuition, for a given ℓ, we define a risk-averse loss function ℓα : Ω Ω R that incorporates the parameter α 1 in ℓas follows: ℓα(x, v) := α ℓ(x, v) if x v and ℓα(x, v) := ℓ(x, v) if x < v. (4) Not only does this loss function penalize overestimation versus underestimation, but in addition, the more the overestimation, the more the penalization is. This intuition is consistent with the theory of risk averseness [10, 119]. Our choice is related to the notion of hyperbolic absolute risk aversion [78, 109], and one may pick other ways to incorporate risk averseness in the loss function [144, 98]. As an example, if ℓ(x, v) = (x v)2, then the ℓ2 2-loss is ℓα(x, v) = α (x v)2, if x v and (x v)2 otherwise. If ℓ(x, v) = ln (x/v), then the log-ratio loss is ℓα(x, v) = α ln (x/v), if x v; ln (x/v) otherwise. Plots of these two loss functions are in Figure 2.1 Thus, the analog of Equation (3) becomes argminf: density on ΩEx f ℓα(x v, v), s.t., Ent(f) τ. (5) We note that, for the risk-averse loss function defined in (4), the following hold: For all α α 1, ℓα(x, v) ℓα (x, v) for all x, v Ω, and ℓα(x, v) ℓα (x, v) is an increasing function of x for x v. Beyond (4), one could consider other ℓα(x, v) satisfying these two properties in our framework; we omit the details. One can also incorporate the (opposite) notion of risk eager, where values of x lower than v are penalized more as opposed to values of x higher than v by letting α (0, 1]. Step 4: Generalizing to arbitrary f D. To extend Equation (5) to the setting when v comes from a general density f D, we replace the loss function by its expectation over f D and arrive at the model for the evaluation process that we propose in this paper: argminf: density on ΩErrℓ,α (f D, f) := R x Ωℓα(x, v)f(x)dµ(x) f D(v)dµ(v), (Opt Prog) such that R x Ωf(x) log f(x)dµ(x) τ. For a given ℓand parameters α and τ, this optimization framework can be viewed as transforming the true utility density f D of a group of individuals to the density f E (the solution to this optimization problem). It is worth pointing out that neither the evaluator nor the individual is solving the above optimization problem rather (Opt Prog) models the evaluation process and the loss function ℓ, α, and τ depend on the socially-salient attribute of the group of an individual; see also Appendix B. 3 Theoretical results Characterization of the optimal solution. We first characterize the solution of the optimization problem (Opt Prog) in terms of f D, ℓα, τ, and α. Given a probability density f D, a parameter α 1, and a loss function ℓ, consider the function If D,ℓ,α(x) := R v Ωℓα(x, v)f D(v)dµ(v). This integral captures the expected loss when the estimated utility is x. Further, for a density f, the objective function of (Opt Prog) can be expressed as Errℓ,α (f D, f) := R x ΩIf D,ℓ,α(x)f(x)dµ(x). Theorem 3.1 (Informal version of Theorem C.1 in Appendix C). Under general conditions on f D and ℓ, for any finite τ and α 1, (Opt Prog) has a unique solution f (x) exp ( If D,ℓ,α(x)/γ ), where γ > 0 is unique and also depends on α and τ. Further, Ent(f ) = τ. The uniqueness in Theorem 3.1 implies that, if α = 1, τ = Ent(f D), and ℓ(x, v) := ln f D(x) ln f D(v), then the optimal solution is f (x) = f D(x). To see this, note that in this case If D,ℓ,α(x) = ln f D(x) + Ent(f D). Hence, f (x) = f D(x) satisfies Theorem 3.1 s conclusion (see Section C.7 for details). Thus, in the absence of risk averseness, and for an appropriate choice of resource-information parameter, the output density is the same as the true density. Theorem 3.1 can be viewed as a significant extension of results that show how well-known probability distributions arise as solutions to the entropy-maximization framework. Indeed, the standard maximum-entropy formulation only considers the setting where the input utility is given by a single value, i.e., the distribution corresponding to f D is concentrated at a single point; and the riskaverseness parameter α = 1. While the optimal solution to the maximum-entropy framework (2) restricted to the class of well-known loss functions, e.g. ℓ2 2-loss or linear loss, can be understood by using standard tools from convex optimization (see [48]), characterizing the optimal solution to the general formulation (Opt Prog) is more challenging because of several reasons: (i) The input density f D need not be concentrated at a single point, and hence one needs to understand conditions on f D when the formulation has a unique optimal solution. (ii) The loss function can be arbitrary and one needs to formulate suitable conditions on the loss function such that (Opt Prog) has a unique optimal solution. (iii) The risk-averseness parameter α makes the loss function asymmetric (around 1A variation of (4) that we use in the empirical part is the following: For a fixed shift v0, let ℓα(x, v) := α ℓ(x, v + v0) if x v + v0 and ℓα(x, v) := ℓ(x, v + v0) if x < v + v0. any fixed value v) and makes the analysis of the error in the objective function non-trivial. Roughly speaking, the only restrictions, other than standard integrability assumptions, that we need on the input are: (a) Monotonicity of the loss function ℓ(x, v) with respect to either x or v, (b) the growth rate of the loss function ℓ(x, v) is at least logarithmic, and (c) the function If D,ℓ,α(x) has a unique global minimum, which is a much weaker assumption than convexity of the function. Note that for the ℓ2 2-loss function given by ℓ(x, v) = (x v)2, the first two conditions hold trivially; and when the input density f D is Gaussian, it is not hard to show that If D,ℓ,α(x) is strongly convex and hence the third condition mentioned about holds (see Appendix H for details). These conditions are formally stated in Appendix C.1 and we show that they hold for the cases of Gaussian, Pareto, Exponential, and Laplace densities in Sections H, I, J, K respectively. The proof of Theorem 3.1 is presented in Appendix C and the following are the key steps in it: (i) The proof starts by considering the dual of (Opt Prog) and shows that strong duality holds (see Appendix C.2 and Appendix C.3). (ii) The next step is to show that the optimal solution f of (Opt Prog) exists and is unique. This requires proving that the dual variable γ (corresponding to the entropy constraint in (Opt Prog)) is positive while this variable is always non-negative, the main technical challenge is to show that it is non-zero. In fact, there are instances of (Opt Prog) where γ is zero, and an optimal solution does not exist (or an optimal solution exists, but is not unique). (iii) The proof of γ = 0 requires us to understand the properties of the integral If D,ℓ,α(x) (abbreviated as I(x) when the parameters f D, ℓ, α are clear from the context). In Appendix C.4 we show that I(x) can be expressed as a sum of two monotone functions (see Theorem C.10). This decomposition allows us to show that the optimal value of (Opt Prog) is finite. (iv) In Appendix C.5, we show that the optimal value of (Opt Prog) is strictly larger than I(x ) (Lemma C.13, Lemma C.14), where x is the minimizer of I(x). This requires us to understand the interplay between the growth rate of the expected loss function and the entropy of a density as we place probability mass away from x . Indeed, these technical results do not hold true if the loss function ℓ(x, v) grows very slowly (as a function of x/v or (|x v|). (v) Finally, in Theorem C.15, we show that γ is nonzero. This follows from the fact that if γ = 0, then the optimal value of (Opt Prog) is equal to I(x ), which contradicts the claim in (iv) above. Once we show γ > 0, the expression for the (unique) optimal solution, i.e., f (x) exp ( I(x)/γ ), follows from Theorem C.16. We conclude this section with two remarks. 1) In Appendix E, we show that Theorem 3.1 implies that τ = (Errℓ,α (f D, f ) /γ ) + ln Z , where Z := R Ωexp ( If D,ℓ,α(x)/γ ) dµ(x) is the partition function or the normalizing constant that makes f a probability density; This equation is an analog of the Gibbs equation in statistical physics and gives a physical interpretation of If D,ℓ,α(x) and γ : γ is the temperature and If D,ℓ,α(x) is the energy corresponding to state x. This may be useful in understanding the effects of different parameters on the output density. 2) If one wishes, one can use (Opt Prog) to understand the setting where a single individual is being evaluated by setting the input density f D to be concentrated at their true utility v. For instance, if we set the loss function to be the ℓ2 2-loss, using Theorem 3.1, one can show that for any given values of the parameters, α and τ and loss function being ℓ2 2-loss, the mean of the output density is v q π α 1 α ; see Appendix D for a proof. Therefore for α > 1, the mean of the output density is strictly less than u. This gives a mapping from the true ability to the (mean of the) biased ability in this case. This mapping can be used to understand how the parameters τ and α in the evaluation process transform the true ability. Effect of varying τ for a fixed α. We first study the effect of changing the resource-information parameter τ on f for a fixed value of the risk-averseness parameter α 1. To highlight this dependency on τ, here we use the notation f τ to denote the optimal solution f . We start by noting that as γ increases, the optimal density becomes close to uniform, and as it goes towards zero, the optimal density concentrates around a point x that minimizes energy: argminx ΩIf D,ℓ,α(x). Note that the point x does not depend on τ. However, γ may depend in a complicated manner on both τ and α, and it is not apparent what effect changing τ has on γ . We show that, for any fixed α 1, as τ decreases, the output density gets concentrated around x (see Theorem F.1). This confirms the intuition that as we reduce τ by adding more resources in the evaluation process, the uncertainty in the output density should be reduced. Similarly, if τ increases because of a reduction in the resources invested in the evaluation, the uncertainty in f τ should increase, and hence, the output density should converge towards a uniform density. For specific densities, one can obtain sharper results. Consider the case when f D is a Gaussian with mean m and variance σ2, and ℓα(x, v) := α(x v)2 if x v, and (x v)2 if x < v. The uncertainty in the Gaussian density is captured by the variance, and hence, we expect the output density to have a higher variance when the parameter τ is increased. Indeed, when α = 1, we show that the optimal density f τ is a Gaussian with mean m and variance e2τσ2; see Appendix H. Thus, if one increases τ from to , the variance of the output density changes monotonically from 0 to . When α 1, numerically, it can be seen that for any fixed α 1, increasing τ increases the variance, and also decreases the mean of f τ ; see Figure 3. Intuitively, the decrease in mean occurs because higher variance increases the probability of the estimated value being much larger than the mean, and the risk-averseness parameter imposes a high penalty when the estimated value is larger than the true value. In fact, we show in Theorem F.5 that the variance of f τ for any continuous input density f D supported on R is at least 1 2πe2τ 1. This follows from the well-known fact that among all probability densities supported on R with variance σ2, the Gaussian density with variance σ2 maximizes the (differential) entropy [120]; see Section F.2 for details. In a similar vein, we show that the mean of f τ is at least eτ 1 when the input density f D is supported on [0, ), and hence approaches as τ goes to (see Theorem F.7). This result relies on the fact that among all densities supported on [0, ) and with a fixed expectation 1/λ (for λ > 0), the one maximizing the entropy is the exponential density with parameter λ [48]. We now consider the special setting when the input density f D is Pareto with parameter β > 0 (f D(x) := βx β 1 for x [1, )), and ℓα(x, v) := α ln(x/v) if x v, and ln(x/v) if x < v. When α = 1, we show that the optimal density f τ is also a Pareto density with parameter β satisfying the following condition: 1 + (1/β ) ln β = τ. Using this, it can be shown that, for α = 1, both the mean and variance of f τ monotonically increase to as τ goes to ; see Appendix I. The increase in variance reflects the fact that increasing τ increases the uncertainty in the evaluation process. Unlike the Gaussian case, where the mean of f τ could shift to the left of 0 with an increase in τ, the mean of the output density in this setting is constrained to be at least 1, and hence, increasing the variance of f τ also results in an increase in its mean. Numerically, for any fixed α 1, increasing τ increases both the mean and variance of f τ ; see Figures 3 and 6. Effect of varying α for a fixed τ. Let f α denote the optimal solution f with α for a fixed τ. We observe that, for any fixed x, If D,ℓ,α(x), which is the expected loss when the output is x, and Errℓ,α(f α, f D) are increasing functions of α; see Appendix G. Thus, intuitively, the mass of the density should shift towards the minimizer of If D,ℓ,α(x). Moreover, the minimizer of If D,ℓ,α(x) itself should reduce with increasing α. Indeed, as the evaluation becomes more risk averse, the expected loss, If D,ℓ,α(x), for an estimated value x, increases. However, the asymmetry of the loss function leads to a more rapid rate of increase in If D,ℓ,α(x) for larger values of x. As a result, minimizer of If D,ℓ,α(x) decreases with increasing α. Thus, as we increase α, the output densities should shrink and/or shift towards the left. We verify this for Pareto and Gaussian densities numerically: for fixed τ, increasing α decreases the mean f α for both Pareto and Gaussian f D; see Figures 4 and 7. As for the variance, with fixed τ, increasing α increases the variance when f D is Gaussian and decreases the variance when f D is Pareto; see Figures 4 and 7. See also discussions in Appendices H and I. Connection to the implicit variance model. Our results confirm that increasing τ effectively increases the noise in the estimated density by moving it closer to the uniform density. The implicit variance model of [61] also captures this phenomenon. More concretely, in their model, the observed utility of the advantaged group is a Gaussian random variable with mean µ and variance σ2 0, and the observed utility of the disadvantaged group is a Gaussian random variable with mean µ and variance σ2 0 + σ2. This model can be derived from our framework where the input density f D is Gaussian, the risk-averseness parameter α = 1, the loss function is ℓ(x, v) = (x v)2 and the disadvantaged group is associated with a higher value of the resource-information parameter τ; see Section H for details. Connection to the multiplicative bias model. In the multiplicative-bias model of [90], the true utility of both the groups is drawn from a Pareto distribution, and the output utility for the disadvantaged group is obtained by scaling down the true utility by a factor ρ > 1. This changes the domain of the distribution to [1/ρ, ) from [1, ) and, hence, does not fit exactly in our model which does not allow for a change in the domain. Nevertheless, we argue that when the input density f D is Pareto with a parameter β, τ = Ent(f D) and α > 1, and the loss function is given by ℓ(x, v) = ln x ln v, then output density f α has a smaller mean than that of the input density. As we increase α, the evaluation becomes more risk-averse and hence decreases the probability of estimating higher utility values. Hence, the mean of the output density decreases. We first show that for any fixed β, as the parameter α increases, the output density converges to a density g . We then show numerically that, for all the Pareto distributions considered by our study, the mean of the density g is less than that of the output density f α when α = 1. We give details of this argument in Section I. Dataset This work Multiplicative Implicit Vary α and τ Fix α=1 Fix τ=Ent(f D) bias [90] variance [61] JEE-2009 (Birth category) 0.09 0.15 0.21 0.14 0.10 JEE-2009 (Gender) 0.07 0.15 0.19 0.07 0.08 Semantic Scholar (Gender) 0.03 0.09 0.23 0.08 0.13 Synthetic Network 0.03 0.05 10.0 0.05 0.22 Table 1: TV distances between best-fit densities and real data (Section 4) with 80%-20% training and testing data split: Each dataset consists of two densities f G1 and f G2 of utility, corresponding to the advantaged and disadvantaged groups. We fix f D=f G1 and report the best-fit TV distance between f G2 and densities output by (a) our model, (b) the multiplicative bias model, and (c) the implicit variance model. We compare our model to variants where we fix α=1 and τ=Ent(f D). Our model achieves a small TV distance on all datasets. 4 Empirical results Ability to capture biases in data. First, we evaluate our model s ability to output densities that are close to the densities of biased utility in one synthetic and two real-world datasets. Setup and discussion. In all datasets we consider, there are natural notions of utility for individuals: scores in college admissions, number of citations in research, and degree in (social) networks. In each dataset, we fix a pair of groups G1 and G2 (defined by protected attributes such as age, gender, and race) and consider the empirical density of utilities f G1 and f G2 for the two groups. Suppose group G1 is more privileged or advantaged than G2. In all datasets, we observe notable differences between f G1 and f G2 that advantaged G1 (e.g., f G1 s mean is at least 34% higher than f G2 s). Implementation details. Our goal is to understand whether our model can capture the biases or differences between f G1 and f G2. To evaluate this, we fix f D = f G1, i.e., f G1 is the true density, and compute the minimum total variation distance between f G2 and a density output by our model, i.e., minα,τ d TV f α,τ, f G2 ; where f α,τ is the solution to (Opt Prog) with inputs α and τ. (The total variation distance between two densities f and g over Ωis 1 Ω|f(x) g(x)| dµ(x) [97].) To illustrate the importance of both α and τ, we also report the TV-distances achieved with α = 1 (no skew) and with τ = τ0 := Ent(f D) (vacuous constraint), i.e., minτ d TV f 1,τ, f G2 and minα d TV f α,τ0, f G2 , respectively. As a further comparison, we also report the minimum TV distances achieved by existing models of biases in evaluation processes: the multiplicative-bias model and the implicit variance model [90, 61]. Concretely, we report minµ,ρ d TV (f D,µ,ρ, f G2) and minµ,σ d TV (f D,µ,σ, f G2) where f D,µ,ρ is the density of ρv + µ for v f D and f D,µ,σ is the density of v + µ + σζ where v f D and ζ N(0, 1). Below we present brief descriptions of the datasets; detailed descriptions and additional implementation appear in Appendix L. Dataset 1 (JEE-2009 scores). Indian Institutes of Technology (IITs) are, arguably, the most prestigious engineering universities in India. Admission into IITs is decided based on students performance in the yearly Joint Entrance Exam (JEE) [20]. This dataset contains the scores, birth category (official SES label [135]), and (binary) gender of all students from JEE-2009 (384,977 total) [91]. We consider the score as the utility and run two simulations with birth category (G1 denotes students in GEN category) and gender (G1 denotes male students) respectively as the protected attributes. We set Ωas the discrete set of possible scores. We fix ℓ2 2-loss as f G1 and f G2 appear to be Gaussian-like (unimodal with both a left-tail and a right-tale; see Figure 13 in Appendix L). Dataset 2 (Semantic Scholar Open Research Corpus). This dataset contains the list of authors, the year of publication, and the number of citations for 46,947,044 research papers on Semantic Scholar. We consider the total first-author citations of an author as their utility and consider their gender (predicted from first name) as the protected attribute (G1 denotes male authors). We fix Ω= {1, 2, . . . } and ℓas the log-ratio loss as f G1 and f G2 have Pareto-like density. Dataset 3 (Synthetic network data). We generate a synthetic network with a biased variant of the Barabási Albert model [5, 17, 35, 93]. The vertices are divided into two groups G1 and G2. We start with a random graph G0 with m=50 vertices where each vertex is in G1 w.p. 1 2 independently. We extend G0 to n=10, 000 vertices iteratively: at each iteration, one vertex u arrives, u joins G1 w.p. 1 2 and otherwise G2, and u forms one edge with an existing vertex v where v is chosen w.p. dv if v G1 (dv is v s current degree) and 1 2dv otherwise. We use a vertex s degree as its utility, fix Ω= {1, 2, ...}, and use log-ratio loss as f G1 and f G2 have Pareto-like density. Observations. We report the TV distances for all simulations in Table 1 (also see Figure 11 and Figure 12) for the plots of the corresponding best-fit densities). We observe that across all simulations our model can output densities that are close in TV distance ( 0.09) to f G2. Moreover, both α and τ parameters are important, and dropping either can increase the TV distance significantly (e.g., by 1.5 times on JEE-2009 data with birth category and 2.66 times on the Semantic Scholar data). Compared to the implicit variance model, our model has a better fit on the JEE-2009 (Birth category), Semantic Scholar, and Synthetic Network data because the implicit variance model does not capture skew and in these datasets f G1 is a skewed version of f G2. Compared to the multiplicative bias model, our model has a better fit on the JEE-2009 (Birth category), as here the utilities have Gaussian-like distributions due to which multiplicative bias largely has a translation effect. Finally, on the JEE-2009 (Gender) data, our model s performance is similar to multiplicative bias and implicit variance models because in this data f G1 and f G2 are similar (TV distance 0.08) Effect of interventions on selection. Next, we illustrate the use of our model to study the effectiveness of different bias-mitigating interventions in downstream selection tasks (e.g., university admissions, hiring, and recommendation systems). Subset selection tasks. There are various types of selection tasks [59, 4, 30]. We consider the simplest instantiation where there are n items, each item i has a true utility vi 0, and the goal is to select a size-k subset SD maximizing P i S vi. If V =(v1, . . . , vn) is known, then this problem is straightforward: select k items with the highest utility. However, typically V is unknown and is estimated via a (human or algorithmic) evaluation process that outputs a possibly skewed/noisy estimate X=(x1, . . . , xn) of V [125, 152, 110, 118, 36]. Hence, the outputs is SE:= argmax|S|=k P i S xi, which may be very different from SD and possible has a much lower true utility: P Interventions to mitigate bias. Several interventions have been proposed to counter the adverse effects of bias in selection, including, representational constraints, structured interviews, and interviewer training. Each of these interventions tackles a different dimension of the selection task. Representational constraints require the selection to include at least a specified number of individuals from unprivileged groups [30, 37, 135]. Structured interviews reduce the scope of unintended skews by requiring all interviewees to receive the same (type of) questions [30, 123, 68]. Interviewer training aims to improve efficiency: the amount of (accurate) information the interviewer can acquire in a given time [30]. Which intervention should a policymaker enforce? Studying the effectiveness of interventions. A recent and growing line of work [90, 40, 28, 61, 67, 38, 107] evaluates the effectiveness of representational constraints under specific models of bias: they ask, given C of subsets satisfying some constraint, when does the constraint optimal set, SE,C := argmax S C P i S xi, have a higher utility than the unconstrained optimal SE, i.e., when is P i SE,C vi > P i SE vi? Based on their analysis [90, 40, 28, 61, 67, 38, 107] demonstrate the benefits of different constraints including, equal representation (ER), which requires the output S to satisfy |S G1| = |S G2| and, proportional representation (PR), which requires S to satisfy |S G1|/|G1| = |S G2|/|G2|. A feature of our model is that its parameters α and τ have a physical interpretation, which enables the study of other interventions: for instance, structured interviews aim to reduce skew in evaluation, which corresponds to shifting α closer to 1, and interviewer-training affects the information-to-resource trade-off, i.e., reduces τ. Using our model we compare ER and PR with two new interventions: change α by 50% (αintervention) and change τ by 50% (τ-intervention). Here, 50% is an arbitrary amount for illustration. Setup. We consider a selection scenario based on the JEE 2009 data: we fix f D = f G1, α , τ to be the best-fit parameters on the JEE 2009 data (by TV distance), and |G1| = 1000. Let fα and fτ be the densities obtained after applying the alpha intervention and the tau intervention respectively. (Formally, fα=f α /2,τ and fτ=f α ,3τ /2.) We vary |G2| {500, 1000, 1500}. For each i G1, we draw vi f D and set xi = vi (no bias). For each i G2, we draw vi f D, xi f α,τ, xα i fα, and xτ i fτ coupled so that the CDFs of the respective densities at vi, xi, xα i , and xτ i are the same. We give ER and PR the utilities {xi}i as input, we give α-intervention utilities {xα i }i as input, and the τ-intervention utilities {xτ i }i. For each |G1| and |G2|, we vary 50 k 1000, sample utilities and report the expected utilities of the subset output by each intervention over 100 iterations (Figure 1). Here, 1000 is the largest value for which ER is satisfiable across all group sizes |G1| and |G2|. Reduce skew (| 1|) Decrease Equal representation Proportional representation 200 400 600 800 1000 k Relative increase in utility (a) |G1| = 1000 and |G2| = 500 200 400 600 800 1000 k Relative increase in utility (b) |G1| = 1000 and |G2| = 1000 200 400 600 800 1000 k Relative increase in utility (c) |G1| = 1000 and |G2| = 2000 Figure 1: Effectiveness of different interventions on the selection-utility as estimated by our model: The x-axis shows k (size of selection) and the y-axis shows the ratio of the (true) utility of the subset output with an intervention to the (true) utility of the subset output without any intervention. The main observation across the figures is that, for each intervention, there is a choice of k, and group sizes |G1| and |G2|, where the intervention outperforms all other interventions. Hence, each intervention has a different effect on the latent utility of selection and a policymaker can use our model to study their effect and decide which intervention to enforce. Error bars represent the standard error of the mean over 100 repetitions. Observations. Our main observation is that there is no pair of interventions such that one always achieves a higher utility than the others. In fact, for each intervention, there is a value of k, |G1|, and |G2|, such that the subset output with this intervention has a higher utility than the subsets output with other interventions. Thus, each intervention has a very different effect on the latent utility of the selection and a policymaker can use our model to study the effects in order to systematically decide which interventions to enforce; see Appendix L.1 for a case study of how a policymaker could potentially use this model to study bias-mitigating interventions in the JEE context. 5 Conclusion, limitations, and future work We present a new optimization-based approach to modeling bias in evaluation processes ((Opt Prog)). Our model has two parameters, risk averseness α and resource-information trade-off τ, which are well documented to lead to evaluation biases in a number of contexts. We show that it can generate rich classes of output densities (Theorem 3.1) and discuss how the output densities depend on the two parameters (Section 3). Empirically, we demonstrate that the densities arising from our model have a good fit with the densities of biased evaluations in multiple real-world datasets and a synthetic dataset; often, leading to a better fit than models of prior works [90, 61] (Table 1). We use our model as a tool to evaluate different types of bias-mitigating interventions in a downstream selection task illustrating how this model could be used by policymakers to explore available interventions (Figure 1 and Appendix L.1); see also Appendix B. Our work relies on the assumptions in prior works that there are no differences (at a population level) between G1 and G2; see, e.g., [89, 61, 40]. If this premise is false, then the effectiveness of interventions can be either underestimated or overestimated which may lead a policymaker to select a suboptimal intervention. That said, if all the considered interventions reduce risk aversion and/or resource constraints, then the chosen intervention should still have a positive impact on the disadvantaged group. Our model can be easily used to study multiple socially-salient groups by considering a group-specific risk-aversion parameter and a group-specific information constraint. For example, if two groups G1, G2 overlap, then we can consider three disjoint subgroups G1 G2, G1\G2 and G2\G1. Our model of evaluation processes considers scenarios where candidates are evaluated along a single dimension. It can also be applied in a dimension-by-dimension fashion to scenarios where individuals are evaluated along multiple dimensions, but the evaluation in any dimension is independent of the evaluation in other dimensions. Modeling evaluation processes involving multiple correlated dimensions is an interesting direction. While we illustrate the use of our model in a downstream selection task, utilities generated from biased evaluation processes are also used in other decision-making tasks (such as regression and clustering), and studying the downstream impact of evaluation biases on them is an important direction. Moreover, the output of or model can be used by policymakers to assess the impact of interventions in the supermodular set aggregation setting, where the utility of the selected group is more than the sum of the individuals. Our model cannot be directly used to understand the effect of interventions in the long term. Additional work would be required to do so, perhaps as in [38], and would be an important direction for future work. Finally, any work on debiasing could be used adversarially to achieve the opposite goal. We need third-party evaluators, legal protections, and available recourse for affected parties crucial components of any system though beyond the scope of this work. Acknowledgments. This project is supported in part by NSF Awards CCF-2112665 and IIS2045951. [1] Is Income Implicit in Measures of Student Ability? https://budgetmodel.wharton.up enn.edu/issues/2021/9/28/is-income-implicit-in-measures-of-student-a bility#:~:text=Summary%3A%20Measures%20of%20student%20ability,in%20fam ily%20income%20across%20students, 2021. [2] Supreme Court guts affirmative action, effectively ending race-conscious admissions. https: //www.npr.org/2023/06/29/1181138066/affirmative-action-supreme-court-d ecision, 2023. [3] Social Security Administration. Beyond the Top 1000 Names, 2018. https://www.ssa.go v/oact/babynames/limits.html. [4] Rakesh Agrawal, Sreenivas Gollapudi, Alan Halverson, and Samuel Ieong. Diversifying search results. In Proceedings of the Second ACM International Conference on Web Search and Data Mining, WSDM 09, page 5 14, New York, NY, USA, 2009. Association for Computing Machinery. [5] William Aiello, Fan R. K. Chung, and Linyuan Lu. A random graph model for massive graphs. In STOC, pages 171 180. ACM, 2000. [6] Julia Angwin, Madeleine Varner, and Ariana Tobin. Machine bias: Facebook enabled advertisers to reach jew haters . Pro Publica, Sept, 2017. [7] Another PIL seeks entrance tests in Gujarati, October 2011. https://web.archive.org/ web/20120106121134/http://articles.timesofindia.indiatimes.com/2011-1 0-21/ahmedabad/30306229_1_entrance-tests-regional-language-gujarati-l anguage. [8] Michael Argyle. The psychology of interpersonal behaviour. Penguin UK, 1994. [9] S. Arora, E. Hazan, and S. Kale. The multiplicative weights update method: a meta-algorithm and applications. Theory of Computing, 8(1):121 164, 2012. [10] Kenneth J. Arrow. Aspects of the Theory of Risk-Bearing. YrjöJahnsson lectures. YrjöJahnssonin Säätiö, Helsinki, 1965. [11] Kenneth J. Arrow. The Theory of Discrimination, pages 1 33. Princeton University Press, Princeton, 1974. [12] Kenneth J. Arrow. What has economics to say about racial discrimination? Journal of Economic Perspectives, 12(2):91 100, June 1998. [13] L. Babcock and S. Laschever. Why Women Don t Ask: The High Cost of Avoiding Negotiations - And Positive Strategies for Change. Little, Brown Book Group Limited, 2009. [14] Stijn Baert. Hiring a gay man, taking a risk?: A lab experiment on employment discrimination and risk aversion. Journal of Homosexuality, 65(8):1015 1031, 2018. PMID: 28841095. [15] Katherine Baldiga. Gender Differences in Willingness to Guess. Management Science, 60(2):434 448, 2014. [16] British Business Bank and Oliver Wyman. Alone together: Entrepreneurship and diversity in the UK, October 2020. https://www.british-business-bank.co.uk/wp-content/u ploads/2020/10/Alone-together-Entrepreneurship-and-diversity-in-the-U K-FINAL.pdf. [17] Albert-László Barabási and Réka Albert. Emergence of scaling in random networks. Science, 286(5439):509 512, 1999. [18] Hannah Bast. How Objective is Peer Review?, November 2020. https://cacm.acm.org/b logs/blog-cacm/248824-how-objective-is-peer-review/fulltext. [19] Surender Baswana, Partha P Chakrabarti, V Kamakoti, Yash Kanoria, Ashok Kumar, Utkarsh Patange, and Sharat Chandran. Joint seat allocation: An algorithmic perspective, 2015. [20] Surender Baswana, Partha Pratim Chakrabarti, Sharat Chandran, Yashodhan Kanoria, and Utkarsh Patange. Centralized Admissions for Engineering Colleges in India. In Proceedings of the 2019 ACM Conference on Economics and Computation, EC 19, page 323 324, New York, NY, USA, 2019. Association for Computing Machinery. [21] Ariane Baye and Christian Monseur. Gender differences in variability and extreme scores in an international context. Large-scale Assessments in Education, 4(1):1, 2016. [22] Gary S. Becker. The Economics of Discrimination. Economic Research Studies. University of Chicago Press, 2010. [23] Christian Belzil and Marco Leonardi. Risk aversion and schooling decisions. Annals of Economics and Statistics, pages 35 70, 2013. [24] Sarah Berger. This high school senior says he is spending $1,700 on college applications, January 2018. https://www.cnbc.com/2018/01/02/high-price-of-college-appli cations.html. [25] Marianne Bertrand and Esther Duflo. Chapter 8 - field experiments on discriminationaalaura stilwell and jan zilinsky provided excellent research assistance. we thank abhijit banerjee for comments. we are particularly grateful to betsy levy paluck, our discussant, for her detailed and thoughtful review of an earlier draft. In Abhijit Vinayak Banerjee and Esther Duflo, editors, Handbook of Field Experiments, volume 1 of Handbook of Economic Field Experiments, pages 309 393. North-Holland, 2017. [26] Deepti Bhaskaran. CBSE s Udaan initiative helping girl students crack IIT entrance exam. Mint, August 2017. https://www.livemint.com/Education/B5Vf Kv6ts4EOWTu Xpq Y x KJ/CBSEs-Udaan-initiative-helping-girl-students-crack-IIT-e.html. [27] Scott Bland. Schumer to Introduce Rules for Diverse Senate Hiring. Politico, 2017. https: //www.politico.com/story/2017/02/schumer-diversity-nfl-rooney-rule-235 477. [28] Avrim Blum and Kevin Stangl. Recovering from biased data: Can fairness constraints improve accuracy? In FORC, volume 156 of LIPIcs, pages 3:1 3:20. Schloss Dagstuhl - Leibniz Zentrum für Informatik, 2020. [29] Niclas Boehmer, L. Elisa Celis, Lingxiao Huang, Anay Mehrotra, and Nisheeth K. Vishnoi. Subset selection based on multiple rankings in the presence of bias: Effectiveness of fairness constraints for multiwinner voting score functions. In International Conference on Machine Learning, ICML, volume 202 of Proceedings of Machine Learning Research, pages 2641 2688. PMLR, 2023. [30] Iris Bohnet. What Works: Gender Equality by Design. Harvard University Press, 2016. [31] Robert F Boldt, John A Centra, and Rosalea G Courtney. The validity of various methods of treating multiple sat scores. ETS Research Report Series, 1986(1):i 8, 1986. [32] Tristan L Botelho and Mabel Abraham. Pursuing quality: How search costs and uncertainty magnify gender-based double standards in a multistage evaluation process. Administrative Science Quarterly, 62(4):698 730, 2017. [33] Hannah Riley Bowles. Why Women Don t Negotiate Their Job Offers, June 2014. https: //hbr.org/2014/06/why-women-dont-negotiate-their-job-offers. [34] Stephen Boyd and Lieven Vandenberghe. Convex Optimization. Cambridge University Press, 2004. [35] Andrei Z. Broder, Ravi Kumar, Farzin Maghoul, Prabhakar Raghavan, Sridhar Rajagopalan, Raymie Stata, Andrew Tomkins, and Janet L. Wiener. Graph structure in the web. Comput. Networks, 33(1-6):309 320, 2000. [36] Quinn Capers IV, Daniel Clinchot, Leon Mc Dougle, and Anthony G Greenwald. Implicit racial bias in medical school admissions. Academic Medicine, 92(3):365 369, 2017. [37] Marilyn Cavicchia. How to fight implicit bias? With conscious thought, diversity expert tells NABE, June 2017. [38] L. Elisa Celis, Chris Hays, Anay Mehrotra, and Nisheeth K. Vishnoi. The Effect of the Rooney Rule on Implicit Bias in the Long Term. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, FAcc T 21, page 678 689, New York, NY, USA, 2021. Association for Computing Machinery. [39] L. Elisa Celis, Vijay Keswani, and Nisheeth K. Vishnoi. Data preprocessing to mitigate bias: A maximum entropy based approach. In ICML, volume 119 of Proceedings of Machine Learning Research, pages 1349 1359. PMLR, 2020. [40] L. Elisa Celis, Anay Mehrotra, and Nisheeth K. Vishnoi. Interventions for ranking in the presence of implicit bias. In FAT*, pages 369 380. ACM, 2020. [41] Christopher P Chambers and Federico Echenique. A Characterisation of Phelpsian Statistical Discrimination. The Economic Journal, 131(637):2018 2032, 08 2020. [42] Tessa E. S. Charlesworth and Mahzarin R. Banaji. Patterns of Implicit and Explicit Attitudes: I. Long-Term Change and Stability From 2007 to 2016. Psychological Science, 30(2):174 192, 2019. PMID: 30605364. [43] Erick Chastain, Adi Livnat, Christos Papadimitriou, and Umesh Vazirani. Algorithms, games, and evolution. Proceedings of the National Academy of Sciences, 111(29):10620 10623, 2014. [44] Mansi Choksi. Inside India s Cram City. The New York Times, January 2023. https: //web.archive.org/web/20230202034926/https://www.nytimes.com/2023/01/ 18/magazine/india-cram-schools-kota.html. [45] Stephen Coate and Glenn C. Loury. Will affirmative-action policies eliminate negative stereotypes? The American Economic Review, 83(5):1220 1240, 1993. [46] Brian W. Collins. Tackling unconscious bias in hiring practices: The plight of the rooney rule. NYUL Rev., 82:870, 2007. [47] Conduct IIT entrance test in Tamil also: PMK, March 2012. https://web.archive.org/ web/20140502002203/https://www.thehindu.com/todays-paper/tp-national/ article2961807.ece. [48] Keith Conrad. Probability distributions and maximum entropy. Entropy, 6(452):10, 2004. [49] Bradford Cornell and Ivo Welch. Culture, information, and screening discrimination. Journal of political Economy, 104(3):542 571, 1996. [50] Jason Dana, Robyn Dawes, and Nathanial Peterson. Belief in the unstructured interview: The persistence of an illusion. Judgment and Decision Making, 8(5):512 520, 2013. [51] Jeffrey Dastin. Amazon scraps secret AI recruiting tool that showed bias against women, October 2019. https://reut.rs/2N1dz RJ. [52] Debraj Deb. Tripura govt to sponsor 30 students with NEET, JEE coaching under Super 30 scheme. The Indian Express, May 2020. https://indianexpress.com/article/educ ation/tripura-to-sponsor-30-top-students-with-neet-jee-coaching-under -super-30-scheme-from-this-year-6413972/. [53] Delhi govt. to provide free coaching for NEET, JEE aspirants, February 2022. https: //www.thehindu.com/news/cities/Delhi/delhi-govt-to-provide-free-coach ing-for-neet-jee-aspirants/article65056297.ece. [54] William Dieterich, Christina Mendoza, and Tim Brennan. Compas risk scales: Demonstrating accuracy equity and predictive parity. Northpoint Inc, 2016. [55] Ezekiel J Dixon-Román, Howard T Everson, and John J Mc Ardle. Race, poverty and sat scores: Modeling the influences of family income on black and white high school students sat performance. Teachers College Record, 115(4):1 33, 2013. [56] Miroslav Dudik. Maximum entropy density estimation and modeling geographic distributions of species, 2007. [57] Cynthia Dwork, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Richard Zemel. Fairness through awareness. In ITCS, pages 214 226, New York, NY, USA, 2012. ACM. [58] Tracy D Eells and C Robert Showalter. Work-related stress in american trial judges. Journal of the American Academy of Psychiatry and the Law Online, 22(1):71 83, 1994. [59] Edith Elkind, Piotr Faliszewski, Piotr Skowron, and Arkadii Slinko. Properties of multiwinner voting rules. In AAMAS 2014, pages 53 60, 2014. [60] Kim Elsesser. Lawsuit Claims SAT And ACT Are Biased Here s What Research Says, 2019. https://www.forbes.com/sites/kimelsesser/2019/12/11/lawsuit-claims-sat -and-act-are-biased-heres-what-research-says/?sh=177187663c42. [61] Vitalii Emelianov, Nicolas Gast, Krishna P. Gummadi, and Patrick Loiseau. On fair selection in the presence of implicit variance. In EC, pages 649 675. ACM, 2020. [62] Episode 2 of Inside the Yale Admissions Office Podcast, 2020. https://admissions.yal e.edu/podcast. [63] Robert Epstein and Ronald E Robertson. The search engine manipulation effect (SEME) and its possible impact on the outcomes of elections. Proceedings of the National Academy of Sciences, 112(33):E4512 E4521, 2015. [64] Frederick Erickson and Jeffrey Schultz. The Counselor as gatekeeper: Social interaction in inverviews. Academic Press, 1982. [65] Sean Fahey. The Real Cost Of Bad Hiring Decisions (And How To Avoid Making Them), March 2022. https://www.forbes.com/sites/forbeshumanresourcescouncil/202 2/03/10/the-real-cost-of-bad-hiring-decisions-and-how-to-avoid-makin g-them/?sh=1e15226e5dac. [66] Hanming Fang and Andrea Moro. Chapter 5 - Theories of Statistical Discrimination and Affirmative Action: A Survey. In Jess Benhabib, Alberto Bisin, and Matthew O. Jackson, editors, Handbook of Social Economics, volume 1, pages 133 200. North-Holland, 2011. [67] Nikhil Garg, Hannah Li, and Faidra Monachou. Standardized tests and affirmative action: The role of bias and variance. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, FAcc T 21, page 261, New York, NY, USA, 2021. Association for Computing Machinery. [68] Atul Gawande. The Checklist Manifesto: How to Get Things Right. Henry Holt and Company, 2010. [69] Tamar Szabó Gendler. On the epistemic costs of implicit bias. Philosophical Studies, 156(1):33, 2011. [70] Daniel T. Gilbert and J. Gregory Hixon. The trouble of thinking: Activation and application of stereotypic beliefs. Journal of Personality and Social Psychology, 60:509 517, 1991. [71] Anand M Goel and Anjan V Thakor. Overconfidence, ceo selection, and corporate governance. the Journal of Finance, 63(6):2737 2784, 2008. [72] Claudia Goldin and Cecilia Rouse. Orchestrating Impartiality: The Impact of Blind Auditions on Female Musicians. American Economic Review, 90(4):715 741, September 2000. [73] Anthony G Greenwald and Mahzarin R Banaji. Implicit social cognition: attitudes, self-esteem, and stereotypes. Psychological review, 102(1):4, 1995. [74] Anthony G Greenwald and Linda Hamilton Krieger. Implicit bias: Scientific foundations. California Law Review, 94(4):945 967, 2006. [75] Moritz Hardt, Eric Price, and Nati Srebro. Equality of opportunity in supervised learning. In Daniel D. Lee, Masashi Sugiyama, Ulrike von Luxburg, Isabelle Guyon, and Roman Garnett, editors, Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain, pages 3315 3323, 2016. [76] Abigail Johnson Hess. Rich students get better SAT scores here s why, October 2019. https://www.cnbc.com/2019/10/03/rich-students-get-better-sat-scores-h eres-why.html. [77] The White House. Fact Sheet: President Obama Announces New Commitments from Investors, Companies, Universities, and Cities to Advance Inclusive Entrepreneurship at First-Ever White House Demo Day, August 2015. [78] J.E. Ingersoll. Theory of Financial Decision Making. G - Reference,Information and Interdisciplinary Subjects Series. Rowman & Littlefield, 1987. [79] Abigail Z. Jacobs and Hanna Wallach. Measurement and fairness. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, FAcc T 21, page 375 385, New York, NY, USA, 2021. Association for Computing Machinery. [80] Edwin T. Jaynes. Information theory and statistical mechanics. Physical Review, 106:620 630, May 1957. [81] Edwin T. Jaynes. On the rationale of maximum-entropy methods. Proceedings of the IEEE, 70(9):939 952, 1982. [82] Kailash Jeenger. Reservation Is About Adequate Representation, Not Poverty Eradication. The Wire, May 2020. https://thewire.in/law/supreme-court-bench-reservation. [83] Yangqing Jia and Trevor Darrell. Heavy-tailed distances for gradient based image descriptors. In J. Shawe-Taylor, R. Zemel, P. Bartlett, F. Pereira, and K.Q. Weinberger, editors, Advances in Neural Information Processing Systems, volume 24. Curran Associates, Inc., 2011. [84] D. Kahneman, S.P. Slovic, P. Slovic, A. Tversky, and Cambridge University Press. Judgment Under Uncertainty: Heuristics and Biases. Cambridge University Press, 1982. [85] Irwin Katz. Gordon Allport s The Nature of Prejudice . Political Psychology, 12(1):125 157, 1991. [86] Giora Keinan. Decision making under stress: Scanning of alternatives under controllable and uncontrollable threats. Journal of Personality and Social Psychology, 52:639 644, 1987. [87] M.E. Kite and B.E. Whitley. Psychology of Prejudice and Discrimination: 3rd Edition. Taylor & Francis, 2016. [88] Jon Kleinberg and Sendhil Mullainathan. Simplicity creates inequity: Implications for fairness, stereotypes, and interpretability. In Proceedings of the 2019 ACM Conference on Economics and Computation, EC 19, page 807 808, New York, NY, USA, 2019. Association for Computing Machinery. [89] Jon M. Kleinberg, Sendhil Mullainathan, and Manish Raghavan. Inherent trade-offs in the fair determination of risk scores. In ITCS, volume 67 of LIPIcs, pages 43:1 43:23. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2017. [90] Jon M. Kleinberg and Manish Raghavan. Selection problems in the presence of implicit bias. In ITCS, volume 94 of LIPIcs, pages 33:1 33:17. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2018. [91] Mr. Rajeev Kumar. Rti complaint, 2009. Decision No. CIC/SG/C/2009/001088/5392, Complaint No. CIC/SG/C/2009/001088. [92] Rahul Kumar. SC, ST, OBC representation in Indian education is dismal, upper-caste nexus persists. The Print, March 2021. https://theprint.in/campus-voice/sc-st-obc-r epresentation-in-indian-education-is-dismal-upper-caste-nexus-persist s/627217/. [93] Ravi Kumar, Prabhakar Raghavan, Sridhar Rajagopalan, D Sivakumar, Andrew Tomkins, and Eli Upfal. Stochastic models for the web graph. In Proceedings 41st Annual Symposium on Foundations of Computer Science, pages 57 65. IEEE, 2000. [94] Didier Lairez. A short derivation of boltzmann distribution and gibbs entropy formula from the fundamental postulate, 2023. [95] Kevin Lang. A Language Theory of Discrimination*. The Quarterly Journal of Economics, 101(2):363 382, 05 1986. [96] Flavien Léger. A gradient descent perspective on sinkhorn. Applied Mathematics & Optimization, 84(2):1843 1855, 2021. [97] David Asher Levin, Yuval Peres, and Elizabeth Lee Wilmer. Markov chains and mixing times. American Mathematical Soc., 2009. [98] Haim Levy and Myles Robinson. Stochastic Dominance: Investment Decision Making Under Uncertainty, volume 34. Springer, 2006. [99] Tamar Lewin. A New SAT Aims to Realign With Schoolwork, March 2014. https://www. nytimes.com/2014/03/06/education/major-changes-in-sat-announced-by-col lege-board.html. [100] Falk Lieder and Thomas L. Griffiths. Resource-rational analysis: Understanding human cognition as the optimal use of limited computational resources. Behavioral and Brain Sciences, 43:e1, 2020. [101] W Lippmann. Public opinion. new york city. Harcourt, Brace. Lipson, J., Omidian, P.(1992, September). Health issues of Afghan refugees in California. The Western Journal of Medicine, 157(3):271 275, 1922. [102] J. H. C. Lisman and M. C. A. van Zuylen. Note on the generation of most probable frequency distributions. Statistica Neerlandica, 26(1):19 23, March 1972. [103] List of languages by number of native speakers in India. https://en.wikipedia.org/wik i/List_of_languages_by_number_of_native_speakers_in_India. [104] Karen S Lyness and Madeline E Heilman. When fit is fundamental: performance evaluations and promotions of upper-level female and male managers. Journal of Applied Psychology, 91(4):777, 2006. [105] Michael W. Mahoney and Lorenzo Orecchia. Implementing regularization implicitly via approximate eigenvector computation. In ICML, pages 121 128. Omnipress, 2011. [106] Anay Mehrotra, Bary S. R. Pradelski, and Nisheeth K. Vishnoi. Selection in the presence of implicit bias: The advantage of intersectional constraints. In FAcc T 22: 2022 ACM Conference on Fairness, Accountability, and Transparency, Seoul, Republic of Korea, June 21 - 24, 2022, pages 599 609. ACM, 2022. [107] Anay Mehrotra, Bary S. R. Pradelski, and Nisheeth K. Vishnoi. Selection in the Presence of Implicit Bias: The Advantage of Intersectional Constraints. In FAcc T, page To appear. ACM, 2022. [108] Anay Mehrotra and Nisheeth K. Vishnoi. Maximizing submodular functions for recommendation in the presence of biases. In Ying Ding, Jie Tang, Juan F. Sequeda, Lora Aroyo, Carlos Castillo, and Geert-Jan Houben, editors, Proceedings of the ACM Web Conference 2023, WWW 2023, Austin, TX, USA, 30 April 2023 - 4 May 2023, pages 3625 3636. ACM, 2023. [109] Robert C. Merton. Optimum consumption and portfolio rules in a continuous-time model**i would like to thank p. a. samuelson, r. m. solow, p. a. diamond, j. a. mirrlees, j. a. flemming, and d. t. scheffman for their helpful discussions. of course, all errors are mine. aid from the national science foundation is gratefully acknowledged. an earlier version of the paper was presented at the second world congress of the econometric society, cambridge, england. In W.T. ZIEMBA and R.G. VICKSON, editors, Stochastic Optimization Models in Finance, pages 621 661. Academic Press, 1975. [110] Corinne A. Moss-Racusin, John F. Dovidio, Victoria L. Brescoll, Mark J. Graham, and Jo Handelsman. Science faculty s subtle gender biases favor male students. Proceedings of the National Academy of Sciences, 109(41):16474 16479, 2012. [111] Julie A Nelson. Are women really more risk-averse than men? a re-analysis of the literature using expanded methods. Journal of economic surveys, 29(3):566 585, 2015. [112] R. E. O Dea, M. Lagisz, M. D. Jennions, and S. Nakagawa. Gender differences in individual variation in academic grades fail to fit expected patterns for stem. Nature Communications, 9(1):3777, 2018. [113] Tom O Neil. Ask the expert: How to write a CV, 2012. http://web.archive.org/web/ 20170202065654/http://www.economist.com/node/21559508. [114] Paula Onuchic. Recent contributions to theories of discrimination. ar Xiv preprint ar Xiv:2205.05994, 2022. [115] Philip Oreopoulos and Diane Dechief. Why do some employers prefer to interview matthew, but not samir? new evidence from toronto, montreal, and vancouver. Canadian Labour Market and Skills Researcher Network Working Paper 95, 2012. [116] Christina Passariello. Tech Firms Borrow Football Play to Increase Hiring of Women, September 2016. https://www.wsj.com/articles/tech-firms-borrow-football-play-t o-increase-hiring-of-women-1474963562. [117] Edmund S. Phelps. The statistical theory of racism and sexism. The American Economic Review, 62(4):659 661, 1972. [118] J.R. Posselt. Inside Graduate Admissions: Merit, Diversity, and Faculty Gatekeeping. Harvard University Press, 2016. [119] John W. Pratt. 4 - Risk Aversion in the Small and in the Large. In Peter Diamond and Michael Rothschild, editors, Uncertainty in Economics, pages 59 79. Academic Press, 1978. [120] Proof: Normal distribution maximizes differential entropy for fixed variance. https://stat proofbook.github.io/P/norm-maxent.html. [121] Elaine D Pulakos. Selection assessment methods. United stated of America: Society for Human Resource Management (SHRM) Foundation, 2005. [122] Manish Raghavan, Solon Barocas, Jon M. Kleinberg, and Karen Levy. Mitigating bias in algorithmic hiring: evaluating claims and practices. In Mireille Hildebrandt, Carlos Castillo, Elisa Celis, Salvatore Ruggieri, Linnet Taylor, and Gabriela Zanfir-Fortuna, editors, FAT* 20: Conference on Fairness, Accountability, and Transparency, Barcelona, Spain, January 27-30, 2020, pages 469 481, Barcelona, Spain, 2020. ACM. [123] Barbara F. Reskin and Debra Branch Mc Brier. Why Not Ascription? Organizations Employment of Male and Female Managers. American Sociological Review, 65(2):210 233, 2000. [124] Retaking the SAT, Manhattan Review. https://www.manhattanreview.com/sat-retak ing. [125] Dan-Olof Rooth. Automatic associations and discrimination in hiring: Real world evidence. Labour Economics, 17(3):523 534, 2010. [126] Walter Rudin. Real and Complex Analysis. Mc Graw-Hill, 1986. [127] Melody S Sadler, Joshua Correll, Bernadette Park, and Charles M Judd. The world is not black and white: Racial bias in the decision to shoot in a multiethnic context. Journal of Social Issues, 68(2):286 313, 2012. [128] Sahitya parishad demands entrance tests in gujarati, September 2011. https://web.archiv e.org/web/20120106140750/http://articles.timesofindia.indiatimes.com/2 011-09-21/education/30184151_1_regional-languages-entrance-tests-raghu veer-chaudhary. [129] Jad Salem and Swati Gupta. Closing the gap: Group-aware parallelization for the secretary problem with biased evaluations. Available at SSRN 3444283, 2019. [130] Howard Schuman, Charlotte Steeh, Lawrence Bobo, and Maria Krysan. Racial Attitudes in America: Trends and Interpretations. Social trends in the United States. Harvard University Press, 1985. [131] Deepa Seetharaman. Facebook Is Testing the Rooney Rule Approach to Hiring. The Wall Street Journal, June 2015. [132] Herbert A Simon. Bounded rationality and organizational learning. Organization science, 2(1):125 134, 1991. [133] Mohit Singh and Nisheeth K. Vishnoi. Entropy, optimization and counting. In Proceedings of the 46th Annual ACM Symposium on Theory of Computing, pages 50 59. ACM, 2014. [134] Aarzoo Snigdha. JEE Main 2021: Over 45,000 Students Writing Exam In Regional Languages. NDTV, February 2021. https://www.ndtv.com/education/jee-main-2021-over-4 5000-students-writing-exam-in-regional-languages-2378496. [135] Thomas Sowell. Affirmative Action Around the World: An Empirical Study. Yale University Press, 2008. [136] Damian Straszak and Nisheeth K. Vishnoi. Maximum entropy distributions: Bit complexity and stability. COLT, 2019. [137] Damian Straszak and Nisheeth K. Vishnoi. Iteratively reweighted least squares and slime mold dynamics: connection and convergence. Mathematical Programming, 194(1):685 717, 2022. [138] R.H. Thaler. Misbehaving: The Making of Behavioral Economics. W. W. Norton, 2015. [139] The Cost of a Bad Hire, February 2019. https://www.northwestern.edu/hr/about/ne ws/february-2019/the-cost-of-a-bad-hire.html. [140] The Utter Uselessness of Job Interviews, April 2017. https://www.nytimes.com/2017/0 4/08/opinion/sunday/the-utter-uselessness-of-job-interviews.html. [141] UP govt free coaching for JEE, NEET aspirants begins today, February 2021. https: //www.livemint.com/education/news/up-govt-free-coaching-for-jee-neet-a spirants-begins-today-11613449583134.html. [142] AD Van Knippenberg, AP Dijksterhuis, and Diane Vermeulen. Judgement and memory of a criminal act: The effects of stereotypes and cognitive load. European Journal of Social Psychology, 29(2-3):191 201, 1999. [143] Nisheeth K. Vishnoi. Algorithms for Convex Optimization. Cambridge University Press, 2021. [144] J. von Neumann, O. Morgenstern, H.W. Kuhn, and A. Rubinstein. Theory of Games and Economic Behavior: 60th Anniversary Commemorative Edition. Princeton Classic Editions. Princeton University Press, 2007. [145] Joseph Walker. Meet the New Boss: Big Data, September 2012. https://www.wsj.com/ar ticles/SB10000872396390443890304578006252019616768. [146] Thomas E. Weisskopf. Impact of Reservation on Admissions to Higher Education in India. Economic and Political Weekly, 39(39):4339 4349, 2004. [147] Christine Wennerås and Agnes Wold. Nepotism and Sexism in Peer-Review. Nature, 387(6631):341 343, May 1997. [148] Wikipedia contributors. Maximum entropy probability distribution Wikipedia, the free encyclopedia. https://en.wikipedia.org/w/index.php?title=Maximum_ent ropy_probability_distribution&oldid=1154992719, 2023. [Online; accessed 16-May-2023]. [149] Joan C Williams. Double jeopardy? an empirical study with implications for the debates over implicit bias and intersectionality. Harvard Journal of Law & Gender, 37:185, 2014. [150] Sarah Wood. Colleges With The Highest Application Fees, January 2022. https://www.us news.com/education/best-colleges/the-short-list-college/articles/coll eges-with-the-highest-application-fees. [151] Ruixun Zhang, Thomas J Brennan, and Andrew W Lo. The origin of risk aversion. Proceedings of the National Academy of Sciences, 111(50):17777 17782, 2014. [152] Jonathan C. Ziegert and Paul J. Hanges. Employment discrimination: The role of implicit attitudes, motivation, and a climate for racial bias. Journal of Applied Psychology, 90(3):553 562, 2005. 1 Introduction 1 3 Theoretical results 5 4 Empirical results 8 5 Conclusion, limitations, and future work 10 A Other related work 22 B Specific examples of our model 23 C Characterization of optimal solution to the optimization problem 24 C.1 The primal optimization problem and assumptions . . . . . . . . . . . . . . . . . . 24 C.2 Dual formulation and weak duality . . . . . . . . . . . . . . . . . . . . . . . . . . 26 C.3 Strong duality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 C.4 Properties of the integral I(x) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 C.5 Positivity of the optimal dual variable . . . . . . . . . . . . . . . . . . . . . . . . 31 C.6 Optimality conditions and uniqueness of optimal solution . . . . . . . . . . . . . . 33 C.7 Conditions under which optimal solution is f D . . . . . . . . . . . . . . . . . . . 34 D Derivation of output density for a single individual 35 E Connection to the Gibbs equation 36 F Effect of changing the resource-information parameter τ 36 F.1 Effect of decreasing τ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 F.2 Effect of increasing τ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 G Effect of changing the risk-averseness parameter α 39 H Gaussian density 39 I Pareto density 42 J Exponential density 45 K Laplace density 47 L Implementation details and additional empirical results 48 L.1 Case Study: Evaluating bias-mitigating interventions in IIT-JEE admissions . . . . 48 L.2 Additional plots for simulations in Section 4 . . . . . . . . . . . . . . . . . . . . . 51 L.3 Implementation details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 L.3.1 Our framework and other models . . . . . . . . . . . . . . . . . . . . . . 51 L.3.2 Computational resources used . . . . . . . . . . . . . . . . . . . . . . . . 53 L.3.3 JEE-2009 Scores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 L.3.4 Semantic Scholar Open Research Corpus . . . . . . . . . . . . . . . . . . 54 A Other related work Models of bias in Economics. There are two prominent models of discrimination in the Economics literature: taste-based discrimination and statistical discrimination [114]. These capture different types of biases [117, 45, 49, 12, 22, 41, 11] (also see [66, 25, 114]). Taste-based discrimination [22] models explicit biases (e.g., racism and sexism) and, in the vanilla taste-based discrimination model, individuals are divided into two groups and a decision-maker pays an additional cost for interacting with individuals in the disadvantaged group. This additional additive cost diminishes the value of disadvantaged individuals for the decision-maker. While we do not model explicit biases, such an additive bias also arises in our model of evaluation processes with specific parameter choices, suggesting that additive biases may also arise due to resource constraints and risk averseness (Appendix C). Statistical discrimination models how group-wise disparities in the noise in the inputs to a Bayesian decision-maker propagate to systematic disparities in decisions [117, 11]. Our mathematical model can be viewed as giving an explanation of why such disparities may arise in the input. Implicit biases in Psychology. There is a long and rich history of the study of implicit (and explicit) biases in Psychology, e.g., [85, 101, 74, 104, 69, 127, 149]. This body of works proposes various theories about why implicit biases arise [69] and their relation to real-world stimuli [130, 63, 42]. We refer the reader to [73, 87] for an overview. [69] explains that the ease of categorizing individuals into categories (defined by, e.g., color, race, or gender) provides an incentive to the evaluator to use their prior (possibly biased) knowledge, and this leads to implicit biases. In contrast, we show that even when the evaluator has the same prior estimate for all social groups, biases can arise in evaluation processes when the information-to-resource trade-off or the degree of risk averseness is different for different groups. Further, since resource constraints and risk averseness are not specific to the setting of a single evaluator, our model of evaluation processes also models scenarios where the evaluator denotes a group or an organization. Optimization and human behavior. The use of optimization to model human behavior dates back to (at least) von Neumann and Morgenstern s work that showed that, under a small number of assumptions, the behavior of an evaluator is as if they are optimizing the expected value of a utility function [144]. Since then, numerous apparent deviations from this theory were discovered [84]. These deviations, broadly referred to as irrational behavior or cognitive biases, laid the foundation of Behavioral Economics [138]. Several theories have been proposed to account for the deviations in human behavior from utility maximization. Including prospect theory that models risk averseness of individuals the empirical observation that humans process losses and gains of equal amounts (monetary or otherwise) asymmetrically [84] bounded rationality that reconciles irrational human behavior by proposing that humans solve underlying optimization problems approximately (instead of optimally) [132], and resource rational analysis that proposes that humans trade-off the utility of with the costs (e.g., such as effort and time) required to find a solution with higher utility [100]. These works model hidden costs and constraints on humans that lead to deviations from the traditional rational utility maximization. Like this work, these works also use optimization to explain human behavior, but while they focus on irrational behaviors and cognitive biases, our work focuses on biases with respect to socially-salient attributes in evaluation processes. Other entropy-based models. Maximum entropy distributions have been widely deployed in machine learning [56] and theoretical computer science [133]. Maximum entropy distributions have been shown to be stable [136]. The maximum-entropy framework over discrete distributions has been used to preprocess data to debias it [39]. In our optimization program, we use entropy to measure the amount of information in a distribution this appears as a constraint in our program. An entropy-constrained viewpoint is also prevalent in explanations of other phenomena. For instance, various optimization algorithms can be captured using an optimizationand entropy-based viewpoint [9, 96, 143], it is also known to arise in explanations of biological phenomena [43], and leads to Page rank and other popular random-walk-based procedures [105]. Finally, taking an optimization viewpoint when studying a dynamical system also has additional benefits, such as providing a potential function that functions that gives an efficient and interpretable method of tracking the progress of a complex dynamical system [137]. 2 1 0 1 2 0 15 Plot of (x,0) for different when = 2 2 =1 =2 =3 =4 (a) ℓα(x, 0) when ℓ(x, v) = (x v)2 Plot of (x,5) for different when (x,v)= x =1 =2 =3 =4 (b) ℓα(x, 5) when ℓ(x, v) = ln x ln v Figure 2: Plots of risk-averse loss function ℓα(x, v) for different α and ℓ. B Specific examples of our model In this section, we present some specific examples of mechanisms by which information constraints and risk aversion lead to bias. One context is college admissions. It is well known that SAT scores are implicitly correlated with income (and, hence, test preparation) in addition to student ability [1]. While the true ability may be v, the score is skewed depending on the amount/quality of test preparation, which depends on socioeconomic status that may be correlated to socially-salient attributes. The parameter α in our model can be used to encode this. As for τ, while an evaluator may know what a GPA means at certain universities well known to them, they may not understand what GPA means for students from lesser-known schools. This lack of knowledge can be overcome, but takes effort/time, and without effort entrenches the status quo. Another example is the evaluation of candidates using a standardized test. In time-constrained settings, a high value of the resource-information parameter τ for the disadvantaged group indicates that such candidates may not be able to comprehend a question as well as someone from an advantaged group. This could be due to various factors including less familiarity with the language used in the test or the pattern of questions, as opposed to someone who had the resources to invest in a training program for the test. Similarly, a high value of the risk-averseness parameter captures that an evaluator, when faced with a choice of awarding low or high marks to an answer given by a candidate from the disadvantaged group, is less likely to give high marks. More concretely, suppose there are several questions in a test, where each question is graded either 0 or 1. Assume that a candidate has true utility v [0, 1], and hence, would have received an expected score v for each of the questions if one were allowed to award grades in the continuous range [0, 1]. However, the fact that the true scores have to be rounded to either 0 or 1 can create a bias for the disadvantaged group. Indeed, the probability that an evaluator rounds such an answer to 1 may be less than v the risk-averseness parameter measures the extent to which this probability gets scaled down. Most of the prior works on interventions in selection settings have focused on adding representational constraints for disadvantaged groups. Such constraints, often framed as a form of affirmative action, could be beneficial but may not be possible to implement in certain contexts. For instance, in a landmark 2023 ruling, the US Supreme Court effectively prohibited the use of race-based affirmative action in college admissions [2]. Our work, via a more refined model of how bias might arrive at a population level in evaluation processes, allows for evaluating additional interventions that focus on procedural fairness; this allows working towards diversity and equity goals without placing affirmative-action-like constraints. In this framing, we can consider decreasing either α or τ. Improving either would work towards equity, but which ones to target or to what extent and via which method would be context-dependent and vary in cost. A decrease in α can be achieved by reducing risk-averseness in the evaluation process; e.g. by investing in better facilities for disadvantaged groups, or making the evaluation process blind to group membership. A reduction in τ may follow by allocating additional resources to the evaluation process, e.g., by letting a candidate choose an evaluation in their native language. Our framework allows a policymaker to study these trade-offs, and we discuss specific examples in Appendix L. C Characterization of optimal solution to the optimization problem In this section, we present the formal version of Theorem 3.1. We first state the general conditions under which this result is true, and give an outline of its proof. Recall that an instance of the optimization problem is given by a tuple I := (Ω, f D, ℓ, α, τ), where Ωis a closed interval in R (i.e., Ωis one of [a, b], [a, ), ( , b], ( , ) for some suitable real values a or b), f D is the density of the true utility that is Lebesgue measurable, ℓis the loss function, τ is the resource-information parameter and α 1 is the risk-averseness parameter. The main result, Theorem C.1, states that under mild conditions, the instance I has a unique optimal solution. We present applications of Theorem C.1 when the input density is Gaussian, Pareto, Exponential, and Laplace in Sections H, I, J, and K respectively. In Section C.1, we state the primal optimization formulation and the assumptions needed by Theorem C.1. Clearly, we need that the instance I has a non-empty set of solutions, i.e., there exists a density of entropy at least τ (assumption (A0)). In Section C.2, we state the dual of the convex program Primal Opt and show that weak duality holds. This proof requires integrability of the loss function with respect to the measure induced by the density f D (assumption (A3)). This assumption also ensures that the optimal value does not become infinite. In Section C.3, we show that strong duality holds. We use Slater s condition to prove this. This proof also shows that there exist optimal Lagrange dual variables. However, this does not show that there is an optimal solution f to the primal convex program. In fact, one can construct examples involving natural loss functions and densities where strong duality holds, but the optimal solution does not exist. In order to prove the existence of an optimal solution f to the instance I, we show that the optimal Lagrange dual variable γ (corresponding to the entropy constraint in Primal Opt) is strictly positive. This proof requires several technical steps. We first study properties of the function If D,α,ℓ(x) in Section C.4, which is the expected loss if the estimated value is x. It is easy to verify that the objective function of Primal Opt, Errℓ,α(f D, f), is the integral of If D,α,ℓ(x)f(x) over Ω. Therefore, the optimal solution f (x) would place a higher probability mass on regions where If D,α,ℓ(x) is small. Under natural monotonicity conditions on the loss function (assumption (A1)), we show that If D,α,ℓ(x) can be expressed as a sum of an increasing and a decreasing function. This decomposition shows that for natural loss functions, e.g., those which or concave or convex in each of the coordinates, the function If D,α,ℓ(x) is unimodal; we state this as an assumption (A5) to take care of general loss function settings. In Section C.5, we show that γ > 0. This proof hinges on the following result: For every instance I = (Ω, f D, ℓ, α, τ) with finite optimal value, there is a positive constant η such that the optimal value is at least If D,ℓ,α(x ) + η, where x is the minimizer of If D,ℓ,α(x). This requires unimodality of the function If D,ℓ,α(x) and bounded mean (and median) of f D (assumption A4). Somewhat surprisingly, this result also requires that the loss function ℓ(v, x) grows at least logarithmically as a function of |v x| (assumption (A2)). Without this logarithmic growth, an optimal solution need not exist: it may happen that there are near-optimal solutions that place vanishingly small probability mass outside the global minimum x of If D,ℓ,α(x). Thus, even though we have an entropy constraint, there will be a sequence of solutions, whose error converges to the optimal value, that converges to a delta function. In Section C.6, we use the positivity of γ to explicitly write down an expression for the optimal primal solution. We use strict convexity of the feasible region for Primal Opt to show that the optimal solution is unique. Finally, in Section C.7, we show that if f D has bounded entropy, then there is a suitable choice for the loss function ℓ, and parameters α and τ, such that we recover f D as the unique optimal solution in the resulting instance. C.1 The primal optimization problem and assumptions We re-write the optimization problem Opt Prog here. An instance I of this problem is given by a tuple I := (Ω, f D, ℓ, α, τ), where Ωis a closed interval in R, f D is the density of the true utility that is Lebesgue measurable, ℓis the loss function, τ is the resource-information parameter and α 1 is the risk-averseness parameter. Recall that µ is the Lebesgue measure on R. Note that Ωcan be of the form ( , ), or [a, ) for a real a, ( , b] for a real b, or [a, b] for real a, b, a < b. The primal problem for I = (Ω, f D, ℓ, α, τ) is as follows. Here, Dom denotes the set {f : Ω R 0}. min f Dom Errℓ,α (f D, f) := Z Ω ℓα(x, v)f(x)dµ(x) f D(v)dµ(v) (Primal Opt) such that Ent(f) := Z Ω f(x) ln f(x)dµ(x) τ (6) Z Ω f(x)dµ(x) = 1. (7) We state the assumptions and justify each of these: A0 (Strict feasibility of the instance) The interval Ωhas length strictly larger than eτ. This is satisfied trivially if Ωis infinite. Remark: In order to ensure that the set of feasible solutions to an instance of Primal Opt is non-empty, we require that Ωhas length at least eτ; otherwise even the uniform distribution on Ωshall have entropy less than τ. The strict inequality is needed to ensure strong duality (Slater s condition). A1 (Monotonicity of the loss function) We assume that the loss function ℓ: Ω Ω R is continuous and ℓ(x, x) = 0 for all x Ω. We consider two types of loss functions, TYPEP and TYPEN. The TYPEP loss functions have the following monotonicity property: For any fixed v Ω, ℓ(v, x) strictly increases as |v x| increases. It follows that ℓ(v, x) 0 with equality if and only if v = x. The TYPEN loss functions have the following property: For a fixed v Ω, ℓ(v, x) is a strictly increasing function of x. It follows that ℓ(v, x) 0 if x v and ℓ(v, x) < 0 if x < v. For example, ℓ(x, v) = (x v)2, |x v| are of TYPEP, whereas ℓ(x, v) = ln x ln v, (x v) are of TYPEN. Remark: These are natural monotonicity properties. A TYPEP loss function ensures that the optimal density to an instance I of Primal Opt assigns higher values to points where the input density f D is more concentrated. A TYPEN loss function ensures that the optimal density does not place high probability mass at points that are much larger than the mean of f D. A2 (Growth rate of the loss function) We would like to assume that the loss function ℓ(x, v) grows reasonably rapidly as |x v| increases. It turns out that the right growth rate would be at least logarithmic. For instance, we could require that |ℓ(x, v)| is Ω(| ln(x v)|). However, unless we assume some form of the triangle inequality, such a lower bound would not imply a lower bound on |ℓ(x2, v) ℓ(x1, v)| for any x2, x1, v Ω. Thus, we make the following assumption: There is a constant C, such that for all x2, x1, v Ω, with |x1 x2| C, v / (x1, x2), |ℓ(x2, v) ℓ(x1, v)| ln |x2 x1| θx1, where θx1 is a value which depends on x1 only. For example, when ℓ(x, v) = (x v), the above is satisfied with θx = 0, C = 1; and when ℓ(x, v) = ln x ln v, the above property holds with θx = ln x, C = 0. Remark: This is a subtle condition, and is needed to ensure that an optimal solution exists. Consider for example, an instance I with Ω= [0, ), f D being the exponential density, ℓ(x, v) = ln ln(x + 2) ln ln(v + 2) (the +2 factor is to ensure that ln(x + 2) does not become negative). The parameters α, τ can be arbitrary. Let ε > 0 be an arbitrarily small constant. Now, consider a solution that places 1 ε probability mass at x = 0 and spreads the remaining ε probability mass uniformly over an interval of length eτ/ε to achieve entropy τ. Since the loss function grows slowly, the expected loss of this solution decreases with decreasing ε. Thus, even though strong duality holds in this instance, an optimal solution does not exist. In fact, we have a sequence of solutions, with error converging to the optimal value of I, converging to the delta-function at 0. A3 (Integrability of the loss function) We assume that the function ℓ(x, v)f D(v) is in L1(µ) for every x Ω. In other words, for each x Ω, Z Ω |ℓ(x, v)|f D(v)dµ(v) < . Remark: This is needed in order to carry out basic operations like the swapping of integrals (Fubini s Theorem) on the objective function. Further, this ensures that the objective function does not become infinite under reasonable conditions. A4 (Bounded mean and half-radius) We assume that the true utility density f D has a bounded mean m. Moreover, we assume that there is a finite value R such that at least half of the probability mass of f D lies in the range [m R, m + R], i.e., Z Ω [m R,m+R] f D(v)dµ(v) 1/2. Remark: This condition ensures that f D decays at a reasonable rate (though it is much milder than requiring bounded variance). A5 (Unique global minimum for estimated loss function) Let If D,ℓ,α(x) denote the expected loss when the estimated value is x, i.e., If D,ℓ,α(x) := Z Ω ℓα(x, v)f D(v)dµ(v). We assume that this function has a unique minimum on Ω. Moreover, if x denotes argminx ΩIf D,ℓ,α(x), we also assume that for all other local minima x of this function, If D,ℓ,α(x) is larger than If D,ℓ,α(x ) by a fixed constant. We state this condition formally as follows: Given any δ > 0, there is an εδ > 0 such that for all x satisfying |x x | > δ, we have If D,ℓ,α(x) If D,ℓ,α(x ) + εδ. Remark: This condition is needed to ensure the uniqueness of the optimal solution. The optimal density for an instance I = (Ω, f D, ℓ, α, τ) tries to place higher probability mass in regions where If D,ℓ,α(x) is low. Therefore, a unique global minimum (and wellseparatedness from other local minima) is needed to ensure that the optimal solution is unique. When the loss function is TYPEN, this condition is always satisfied (unless the optimal value is ). For a TYPEP loss function, this condition is satisfied if the loss function ℓ(x, v) is concave or convex in x (for each fixed value of v), e.g., when ℓ(x, v) = (x v)2, |x v|, | ln x ln v|. The following is the formal version of Theorem 3.1. Theorem C.1 (Characterization of optimal density). Consider an instance I of the optimization problem Primal Opt defined on a closed interval Ω R. Let ℓ, α, τ, f D denote the loss function, riskaverseness parameter, resource-information parameter, and the density of the true utility respectively in I. If assumptions (A0) (A5) are satisfied, then there is a unique solution f to the instance I. This solution satisfies the following condition: f (x) exp If D,ℓ,α(x) where γ is the Lagrange variable corresponding to the entropy constraint (6) and is strictly positive. Moreover, Ent(f ) = τ. C.2 Dual formulation and weak duality Let γ 0 and ϕ R denote the Lagrange variables for the constraints (6) and (7) respectively. Then, the Lagrangian is L(f, γ, ϕ) := Errℓ,α(f D, f) + γ(τ Ent(f)) + ϕ Z Ω f(x)dµ(x) 1 . Given γ 0, ϕ, define g(γ, ϕ) := inf f Dom L(f, γ, ϕ). (9) The dual problem is max γ R 0,ϕ R g(γ, ϕ). (Dual Opt) We first show weak duality. Theorem C.2 (Weak duality). Consider an instance I = (Ω, f D, ℓ, α, τ) of Primal Opt satisfying assumption (A3). Let g(γ, ϕ) be as defined in (9). Then, max γ R 0,ϕ R g(γ, ϕ) min f:f feasible for Primal Opt Errℓ,α(f D, f). Proof. We assume that the primal problem has a feasible solution, otherwise the desired inequality follows trivially. Let f be a feasible solution to Primal Opt and let γ, ϕ be real values with γ 0. Then, g(γ, ϕ) L(f, γ, ϕ) = Errℓ,α(f D, f) + γ(τ Ent(f)) + ϕ Z Ω f(x)dµ(x) Errℓ,α(f D, f), where the first inequality follows from the definition of g(γ, ϕ), and the last inequality follows from the fact that f is a feasible solution to the instance I of Primal Opt. C.3 Strong duality Theorem C.3 (Strong duality). Consider an instance I = (Ω, f D, ℓ, α, τ) of Primal Opt satisfying assumptions (A0), (A1) and (A3). Let f and (γ , ϕ ) be the optimal solutions to Primal Opt and Dual Opt respectively. Then g(γ , ϕ ) = Errℓ,α(f D, f ). Proof. We first observe that there is a feasible solution to the instance I. This is so because, by assumption (A0), Ωmust contain an interval I of length at least eτ. Now, we define f as the uniform distribution on I. Then Ent(f) = τ, and hence, f is a feasible solution to the instance I. We now argue that Errℓ,α(f D, f) is finite. Claim C.4. Consider the function f and the interval I defined above. Then, Errℓ,α(f D, f) < . Proof. Let a, b denote the left and the right end-points of the interval I respectively. For any x I and v Ω, we claim that ℓα(x, v) max(ℓα(a, v), ℓα(b, v)). First, consider the case when the loss function is TYPEP. If v is at least x, then ℓ(x, v) ℓ(a, v); otherwise ℓ(x, v) ℓ(b, v). If the loss function is TYPEN, then we know that it is an increasing function of x, and therefore, ℓ(x, v) ℓ(b, v). Thus, we see that for any x I, v Ω, ℓα(x, v) max(ℓα(a, v), ℓα(b, v)) |ℓα(a, v)| + |ℓα(b, v)|. Therefore (recall that f is the uniform distribution on I), Errℓ,α(f D, f) 1 I |ℓ(a, v)|f D(v)dµ(v) + 1 I |ℓ(b, v)|f D(v)dµ(v) < , where the last inequality follows from assumption (A3). Thus, we see that the optimal value of Primal Opt is either finite or . If it is , then weak duality implies that Dual Opt has optimal value as well. Thus, we have shown strong duality in this case. For the rest of the proof, assume that the optimal value, p , of Primal Opt is finite. We show strong duality using Slater s condition. Towards this, we define two sets A and B, both of which are contained in R3. Define A := (u, v, t) : u τ Ent(f), v = Z Ω f(x)dµ(x) 1, t Errℓ,α(f D, f), for some f Dom , and B := {(0, 0, q) : q < p } . It is easy to see that B is convex. We show that A is also convex. Claim C.5. The set A as defined above is a convex subset of R3. Proof. The proof follows from the fact that Ent(f) is a concave function of f and Errℓ,α(f D, f) is linear in f. Formally, suppose (u1, v1, t1), (u2, v2, t2) A and λ [0, 1]. We need to show that (u, v, t) := λ(u1, v1, t1) + (1 λ)(u2, v2, t2) A. By the definition of the set A, there exist f1, f2 Dom such that ui τ Ent(fi), vi = Z Ω fi(x)dµ(x) 1, ti Errℓ,α(f D, fi), i {1, 2}. Consider f = λf1 + (1 λ)f2. Then f is also a density and is in Dom. Further, Errℓ,α(f D, f) = λ1Errℓ,α(f D, f1) + (1 λ1)Errℓ,α(f D, f2) = λt1 + (1 λ)t2 = t. Similarly, since x ln x is concave, Ω f(x) ln f(x)dµ(x) Ω f1(x) ln f1(x)dµ(x) (1 λ) Z Ω f2(x) ln f2(x)dµ(x) τ. Thus, (u, v, t) A. We argue that A and B are disjoint. Indeed, otherwise, there is an f Dom such that f is a density, Ent(f) τ and Errℓ,α(f D, f) < p , a contradiction. By the hyperplane separation theorem [34, 143], there is a hyperplane ( γ, ϕ, ν) (x1, x2, x3) = c such that A and B lie on different sides of this hyperplane. In other words, for every (x1, x2, x3) A, γx1 + ϕx2 + νx3 c, and for every (x1, x2, x3) B, γx1 + ϕx2 + νx3 c. Therefore, for each f Dom, γ(τ Ent(f)) + ϕ Z Ω f(x)dµ(x) 1 + νErrℓ,α(f D, f) c, (10) Claim C.6. γ and ν are non-negative. Proof. Suppose, for the sake of contradiction, that γ < 0. Consider f Dom as given by Claim C.4. Choose values (x1, x2, x3) as follows: x1 is a large enough value greater than τ Ent(f), x2 = 0, x3 = Errℓ,α(f D, f). Claim C.4 shows that x3 is finite. Since (x1, x2, x3) A, γx1 + νx3 α. Since γ < 0, we can choose x1 large enough to make γx1 + νx3 go below α, which is a contradiction. Similarly, we can show that ν 0. Thus, two cases arise (i) ν > 0, or (ii) ν = 0. First, consider the case when ν > 0. Define γ = γ/ ν, ϕ = ϕ/ ν. Inequalities (10) and (11) show that for all f Dom, γ (τ Ent(f)) + ϕ Z Ω f(x)dµ(x) 1 + Errℓ,α(f D, f) p , i.e., g(γ , ϕ ) p . It follows from weak duality (Theorem C.2) that g(γ , ϕ ) = p and the dual optimum value is equal to p . Now consider the case when ν = 0. Again, inequalities (10) and (11) show that for all f Dom, γ(τ Ent(f)) + ϕ Z Ω f(x)dµ(x) 1 0. Next, we observe that there is an f0 Dom such that f0 is a density and Ent(f0) > τ. Indeed, let f0 be the uniform distribution over an interval of length strictly larger than eτ (such an interval exists by assumption (A0)). Substitution f = f0 in the above inequality, and assuming γ > 0, the l.h.s. of the above inequality becomes strictly less than 0, which is a contradiction. Therefore, it must be the case that γ = 0. Hence, we see that for all f Dom, Ω f(x)dµ(x) 1 0. Since all the three quantities γ, ϕ, ν cannot be 0, it must be the case that ϕ = 0. But for any density f0 Dom, by suitably scaling it by a positive real, we can make the quantity R Ωf(x)dµ(x) 1 strictly larger than or strictly smaller than 0, which is again a contradiction. Hence, we have concluded that ν cannot be 0. This completes the proof of strong duality. Corollary C.7. Consider an instance I = (Ω, f D, ℓ, α, τ) of Primal Opt and assume that the optimal value p is finite. Then there exists a solution (γ , ϕ ) to Dual Opt such that p = g(γ , ϕ ). Proof. This follows from the proof of Theorem C.3. When p is finite, the parameter ν > and, hence, γ = γ/ ν, ϕ = ϕ/ ν as defined in the proof of this result satisfy the property that g(γ , ϕ ) = p . We would now like to prove that, assuming that the optimal primal value is finite, the optimal dual variable γ is non-zero. This allows us to write an explicit expression for an optimal solution to Primal Opt. We first need to understand the properties of the following integral, which was defined in assumption (A5): If D,ℓ,α(x) = Z Ω ℓα(x, v)f D(v)dµ(v). When the parameters f D, ℓ, α will be clear from the context, we shall often abbreviate If D,ℓ,α(x) as I(x). C.4 Properties of the integral I(x) We study some of the key properties of the integral I(x). We shall assume that assumptions (A0) (A4) hold. The integral I(x) can be split into two parts: Ω ( ,x] ℓ(x, v)f D(v)dµ(v) | {z } :=IL(x) Ω [x, )] ℓ(x, v)f D(v)dµ(v) | {z } :=IR(x) Lemma C.8. The integral IL(x) is a strictly increasing function of x. Further, IL(x2) IL(x1) α(ln(x2 x1) θx1)FD(x1), for all x2, x1 Ωsatisfying x2 x1 C. Here, FD denotes the cumulative distribution function (c.d.f.) of f D. Recall that the parameter θx1 appears in the assumption (A3). Proof. Consider x1, x2 Ωwith x1 < x2. Then IL(x2) IL(x1) = α Z x1 (ℓ(x2, v) ℓ(x1, v))f D(v)dµ(v) + α Z [x1,x2] Ω ℓ(x2, v)f D(v)dµ(v). By assumption (A1), ℓ(x2, v) > ℓ(x1, v) for all v x1 and ℓ(x2, v) 0 for all v x2, we see that IL(x2) > IL(x1). Suppose x2 x1 C. Then ℓ(x2, v) ℓ(x1, v) ln |x2 x1| θx1. Therefore, using A3, the first integral above is at least α(ln(|x2 x1| θx1) Z x1 f D(v)dµ(v) = α(ln(x2 x1) θx1)FD(x1). Lemma C.9. The integral IR(x) is a strictly increasing function of x when the loss function is TYPEN, and is a strictly decreasing function of x when the loss function is TYPEP. Proof. Consider x1, x2 Ωwith x1 < x2. Then IR(x2) IR(x1) = Z x2 x1 ℓ(x1, v)f D(v)dµ(v) + Z [x2, ) Ω (ℓ(x2, v) ℓ(x1, v))f D(v)dµ(v). First, consider the case of TYPEN loss function. Then for all v x1, ℓ(x1, v) ℓ(v, v) = 0, and hence, the first integrand above is positive. We also know that for all v, ℓ(x2, v) ℓ(x1, v), and hence, the second integrand above is also positive. This shows that IR(x) is an increasing function of x when the loss function is TYPEN. Now consider the case when the loss function is TYPEP. For any value v x1, ℓ(x1, v) ℓ(v, v) = 0. Similarly, for v x2, ℓ(x2, v) < ℓ(x1, v). Thus, IR(x2) < IR(x1) in this case. Combining Lemma C.8 and Lemma C.9, we get the main result about the variation of I(x). Theorem C.10 (Monotonicity of I(x) with respect to x). Assume that conditions (A0) (A4) hold. The function I(x) is a continuous function of x. For a TYPEN loss function, I(x) is a monotonically strictly increasing function of x, with I(x) going to as x goes to (assuming Ωis unbounded from below). For a TYPEP loss function, I(x) is the sum of an increasing and a decreasing function, and has a global minimum on Ω. Further, I(x) goes to as x goes to or (assuming these values lie in Ω). Proof. First, consider the case of TYPEN loss functions. It follows from Lemma C.8 and Lemma C.9 that I(x) is a strictly increasing function of x. The integrand in IL(x) is I[x v]ℓ(x, v)f D(v), where I[ ] denotes the indicator function. As shown in the proof of Lemma C.8, this is a monotonically decreasing function of x. Therefore, the monotone convergence theorem [126] implies that IL(x) goes to 0 as x goes to . Similarly, IR(x) goes to as x goes to . Similarly, in the case of TYPEP loss function, IL(x) goes to 0 and IR(x) goes to as x goes to . Thus, I(x) goes to . Similarly, I(x) goes to as x goes to . Since I(x) is the sum of an increasing and decreasing function, and is infinite as x goes to or , it must have a global minimum on Ω. Corollary C.11. Consider an instance I := (Ω, f D, ℓ, α, τ) of Primal Opt. If the loss function is TYPEP, then the optimal value is always finite. If the loss function is TYPEN and the optimal value is finite, then Ωis of the form [a, ) or [a, b] for some a, b R. Proof. First, consider the case when the loss function is TYPEP. We exhibit a solution f with finite objective value. Let f be the uniform distribution over an interval A of length eτ (by assumption (A0), such an interval always exists). Since I(x) is continuous, it achieves a maximum value on A let this value be p. Then Errℓ,α(f D, f) p. Thus, we see that the optimal value for this instance is finite. Now we prove the second assertion. Suppose, for the sake of contradiction, that the loss function is TYPEN and Ωis unbounded from below. We claim that the optimal value for this instance is , which will be a contradiction. To see this, consider a density f Dom which is uniform on an interval A = [s, t] of length eτ (by assumption (A0)). The entropy of this density is τ and, hence, this is a feasible solution to the instance I. However, Errℓ,α(f D, f) = Z Ω I(x)f(x)dµ(x) = 1 A I(x)dµ(x) I(t). Thus, we can keep moving the interval A to the left, which would mean that I(t) would tend to (using Theorem C.10). This shows that the optimal value of this instance is . We shall use the following fact about the finiteness of optimal value when condition (A5) is also satisfied. Claim C.12. Consider an instance I = (Ω, f D, ℓ, α, τ) of Primal Opt satisfying conditions (A0) (A5). Then the optimal value is always finite. Proof. If the loss function is TYPEP, this follows from Corollary C.11. In the case of a TYPEN loss function, the optimal value is finite unless Ωis unbounded from below. In this case, Theorem C.10 shows that argminx ΩI(x) does not exist, and hence, assumption (A5) is violated. C.5 Positivity of the optimal dual variable We now prove that the optimal dual variable γ is strictly positive. In this section, we assume that assumptions (A0) (A5) hold. For this, we need certain technical results. Lemma C.13. Consider an instance I = (Ω, f D, ℓ, α, τ) of Primal Opt with the loss function being TYPEP. Let x be argminx ΩI(x). Then there is a value η > 0, such that any feasible solution f must have Errℓ,α(f D, f) I(x ) + η. Proof. Let x denote the unique global minimum of I(x) on Ω(using assumption (A5)). We show that there is a small enough value ε > 0 such that any feasible solution f must place at least ε amount of probability mass outside Iε := [x ε, x + ε]. This suffices for the following reason. We know by the assumption (A5) that there is a positive value ζ such that I(x) > I(x ) + ζ for all x / Iε. Thus, Errℓ,α(f D, f) = Z Ω Iε I(x)f(x)dµ(x) + Z Ω\Iε I(x)f(x)dµ(x) I(x ) + εζ. Hence, we can choose the desired value η to be εζ. It remains to find such a ε. We consider such a value ε and assume that a feasible solution f places strictly more than 1 ε probability mass inside Iε. We need one notation: For an interval A, define Ent A(f) := Z Ω A f(x) ln f(x)dµ(x). Let m be the mean of f D and let R be the half-radius of f D, i.e., f D places at least half of its mass in the interval [m R, m + R] Ω. Let A0 be a large enough interval containing x and the interval [m R, m + R] assume that the end-points of A0 are at least A0/2 away from any point in [m R, m + R]. We consider intervals of growing size around A0. Let A0 := [s0, t0]. Let A1 be the union of the intervals on both sides of A0, each of length e2/ε. Similarly, having defined Ai 1 (which will be a union of two intervals), define Ai to be the union of two intervals on both sides of Ai 1, each of length e2i. Consider the feasible solution f. Let βi be the total probability mass placed by f on Ai. Thus, Ent(f) = Ent Iε(f) + Ent A0\Iε(f) + X i 1 Ent Ai(f). We bound each of the terms above. Note that Ent Ai(f) is maximized when we distribute βi mass uniformly over Ai. Similarly, Ent Iε(f) is at most ln(ε) and Ent A0\Iε ln |A0| = ln D. Thus, we see that τ Ent(f) ln(εD) + X i 1 βi ln |Ai| βi ln(εD) + X i 1 βi ln |Ai|. In other words, X i 1 βi ln |Ai| τ ln(εD). Observe that we can still choose ε and, hence, we will choose it such that the r.h.s. above becomes as large as we want (ideally, much larger than p ). Observe that any point in Ai is at least |Ai 1| distance away from any point in [m R, m + R]. Therefore, (and this is where we use non-negativity of the loss function, which has been assumed to be TYPEP) Errℓ,α(f D, f) X 2 (ln |Ai 1| θx) X where we have used the fact that | ln |Ai| = 2 ln |Ai 1|. It follows from the above two inequalities that if we choose ε = eτ 4θxp D 4 , then Errℓ,α(f D, f) 2p . Hence, either f places less than 1 ε mass inside Iε or its objective function value is more than 2p . The above proof only worked for TYPEP loss functions. We need a similar result for TYPEN loss functions. The ideas are similar, but we need to use the function I(x) in a more subtle manner. Lemma C.14. Consider an instance I = (Ω, f D, ℓ, α, τ) of Primal Opt with the loss function being TYPEN. Let x be argminx ΩI(x). Then there is a value η > 0, such that any feasible solution f must have Errℓ,α(f D, f) I(x ) + η. Proof. As in the proof of Lemma C.13, we would like to argue that there is a positive ε > 0 such that any feasible solution places at least ε amount of mass outside the interval Iε. Since the optimum is finite, Corollary C.11 shows that Ωis bounded from below, i.e., it is of the form [a, b] or [a, ). Theorem C.10 now implies that x = a. Again, define the sets Ai as in the proof of Lemma C.13, but now, we make A0 start from the lower limit a of Ω. Thus, each of the Ai will now be a single interval, with Ai being to the right of Ai 1. Now we use the fact that when the loss function is TYPEN, both the functions IL(x) and IR(x) are monotonically increasing (Lemma C.8 and Lemma C.9). Further, for a point x Ai, Lemma C.8 shows that IL(x) IL(m + R) α 2 (ln |Ai 1| θm+R), ) because the c.d.f. of f D at m + R is at least 1/2. Hence, Z Ai Ω (IL(x) IL(x )f(x)dµ(x) αβi 2 (ln |Ai 1| θm+R). Summing the above over all i 1, and proceeding as in the proof of Lemma C.13, we see that by choosing a small enough ε, (here t0 denotes the right end-point of A0) X [t0, ) Ω IL(x)f(x)dµ(x) IL(x ) can be made larger than a desired quantity Q, which depends only on the parameters of the input instance I. Since IR(x) is an increasing function of x, we also get Z [t0, ) Ω I(x)f(x)dµ(x) I(x ) + Q. One remaining issue is that the integral on the l.h.s. does not include the terms for [a, t0) (recall that I(x) can be negative). However, observe that I(x) is an increasing function of x, and hence I(x) I(a) for all x. Therefore, Errℓ,α(f D, f) = Z Ω I(x)f(x)dµ(x) a I(x)f(x)dµ(x) + Z [t0, ) Ω I(x)f(x)dµ(x) I(a) + I(a) + Q. Again, by choosing Q large enough, we can set the above to more than 2I(a). As an immediate consequence of this result, we see that for an instance of Primal Opt satisfying conditions (A0) (A5), the optimal dual solution γ is non-zero. Theorem C.15. Consider an instance I = (Ω, f D, ℓ, α, τ) of Primal Opt satisfying conditions (A0) (A5). Then there is an optimal dual solution (γ , ϕ ) to Dual Opt satisfying γ > 0. Proof. Let p denote the optimal value of this instance (it is finite by Claim C.12). Corollary C.11 implies that I(x) has a (unique) global minimum x in Ω. Assume, for the sake of contradiction, that γ = 0. We claim that g(γ , ϕ ) = I(x ). Indeed, consider a function f which places unit probability mass at x . Then it is easy to verify that L(f, γ , ϕ ) = I(x ). However, Lemma C.13 and Lemma C.14 show that p > I(x ), which is a contradiction (because by Theorem C.3, strong duality holds). C.6 Optimality conditions and uniqueness of optimal solution We now show that, under suitable conditions, there is an optimal solution to an instance of Primal Opt. Theorem C.16 (Optimality condition). Consider an instance I of the optimization problem Primal Opt defined on a space Ω. Let ℓ, α, τ, f D denote the loss function, risk-averseness parameter, resource-information parameter, and the density of the true utility respectively in I. Assume that the optimal value for this instance is finite and assumptions (A0), (A3) hold. Let (γ , ϕ ) be an optimal solution to the corresponding Dual Opt with γ > 0. Consider a function f Dom defined as follows: If D,ℓ,α(x) + γ (1 + ln f (x)) + ϕ = 0, x Ω. (12) Then f is an optimal solution to the instance I. Proof. Let p denote the value of the optimal solution to I. Since p is finite, Theorem C.15 shows that there is an optimal dual solution satisfying γ > 0. Recall that g(γ , ϕ ) := min f Dom L(f, γ , ϕ ). Let f be the function defined by (12) (f (x) is well-defined since γ > 0, this is where we need strict positivity of γ ). We argue that L(f , γ , ϕ ) = g(γ , ϕ ). Indeed, consider any other function h Dom. Define a function θ : [0, ) R as follows: θ(t) := L((1 t)f (x) + t h(x), γ , ϕ ) = L(f (x) + te(x), γ , ϕ ), where e(x) = h(x) f (x). We first claim that θ(t) is a convex function of t. Claim C.17. The function θ : [0, 1] R is a convex function. Proof. We observe that Err(f +t e, f D) is a linear function of t. The function γ(τ Ent(f +t e)) is a convex function of t, and ϕ R Ω(f(x) + te(x))dµ(x) 1 is a linear function of t. Therefore, θ(t) which is the sum of these three functions, is a convex function of t. We shall show that dθ(t) dt t=0+ = 0. Along with the convexity of θ(t), this implies that θ(0) θ(1) and hence L(f , γ , ϕ ) L(h, γ , ϕ ). A routine calculation shows that dθ(t) dt t=0+ is equal to Z Ω (If D,ℓ,α(x) + γ (1 + ln f (x)) + ϕ ) e(x)dµ(x). Inequality (12) implies that the above expression is 0. This proves the desired result. Thus, we have shown that L(f , γ , ϕ ) = g(γ , ϕ ), and therefore, f is an optimal solution to the instance I. We now show the uniqueness of the optimal solution to Primal Opt. Consider an instance of this optimization problem specified by Ω, loss function ℓ, density f D and parameters α, τ. Let F denote the set of feasible densities, i.e., F := {f : f Dom, f is a density on Ωand Z Ω f(x) ln f(x)dµ(x) τ}. Lemma C.18. F is strictly convex. Proof. Let f1, f2 F and λ be a parameter in [0, 1]. We first show that the density f defined by f(x) := λf1(x) + (1 λ)f2(x), x Ω, is also in F. Clearly, f is a density, because f(x) 0 and Z Ω f(x)dµ(x) = λ Z Ω f1(x)dµ(x) + (1 λ) Z Ω f2(x)dµ(x) = λ + (1 λ) = 1. Hence, the fact that g(y) := y ln y is a strongly convex function on R 0 implies that f(x) ln f(x) λf1(x) ln f1(x) + (1 λ)f2(x) ln f2(x)), with equality if and only if f1(x) = f2(x). Integrating both sides, and using the fact that f1, f2 belong to F, we get Z Ω f(x) ln f(x)dµ(x) λτ (1 λ)τ Thus, if f1(x) and f2(x) differ on a set of positive measure, then f(x) ln f(x) < λf1(x) + (1 λ)f2(x) on a set of positive measure. Integrating both sides, we get Ent(f) > τ. This shows that the set F is strictly convex. Strict convexity of F now allows us to show the uniqueness of the optimal solution. Theorem C.19 (Uniqueness of optimal solution). Consider an instance I = (Ω, f D, ℓ, α, τ) of Primal Opt satisfying the property that the optimal value is finite and there is an optimal dual solution with γ > 0. Then there is a unique optimal solution to this instance. Proof. Let p denote the optimal value of this instance. Theorem C.16 already shows the existence of an optimal solution. We show the uniqueness of an optimal solution. We have assumed that there is an optimal dual solution (γ , ϕ ) such that γ > 0. The complementary slackness condition shows that for any optimal solution f to the instance I, Ent(f) must equal τ. Suppose, for the sake of contradiction, that there are two distinct optimal solutions f 1 and f 2 to I (i.e., f 1 (x) = f 2 on a subset of positive measure). Consider a solution f = tf 1 + (1 t)f 2 for some t (0, 1). Linearity of Errℓ,α(f D, f) shows that Errℓ,α(f D, f ) = p as well. Now f is also a feasible solution by Lemma C.18; in fact, this lemma shows that Ent(f ) > τ. But this contradicts the fact that every optimal solution must have entropy equal to τ. Thus, we see that there must be a unique optimal solution to the instance I. Combining Lemma F.2, Theorem C.16 and Theorem C.19, we see that Theorem C.1 holds. C.7 Conditions under which optimal solution is f D We show that our optimization framework can output the true utility f D for a choice of the parameters α, τ, and the loss function ℓ. Theorem C.20 (Parameters that recover the true utility). Consider an instance I = (Ω, f D, ℓ, α, τ) of the optimization problem Primal Opt. Assume that the density f D has finite entropy. Suppose, α := 1, τ := Ent(f D), and the loss function ℓ(x, v) := ln f D(v) ln f D(x). Then the density f D is the unique optimal solution to Primal Opt for I. Proof. We claim that, for the values of dual variables γ := 1 and ϕ := Ent(f D) 1, g(γ, ϕ) = L(f D, γ, ϕ). Towards this, we show that f D minimizes L(f, γ, ϕ) over all f Dom. We first show that f D satisfies the condition (12) with γ = γ, ϕ = ϕ, f = f D; it shall then follow from exactly the same arguments as in Theorem C.16 that f D = argminf Dom L(f, γ, ϕ). Now, we check that f D satisfies condition (12) (recall that the loss function ℓ(x, v) := ln f D(v) ln f D(x)): I(x) + γ(1 + ln f D(x)) + ϕ = Z Ω ℓ(x, v)f D(v)dµ(v) + γ(1 + ln f D(x)) + ϕ Ω f D(v) ln f D(v)dµ(v) ln f D(x) Z Ω f D(v)dµ(v) + (1 + ln f D(x)) + Ent(f D) 1 = Ent(f D) ln f D(x) + (1 + ln f D(x)) + Ent(f D) 1 = 0. This proves the desired claim. Thus, g(γ, ϕ) = L(f D, γ, ϕ) = Errℓ,α(f D, f D). Thus, g(γ, ϕ) is equal to the objective function value of Primal Opt at f = f D. Hence, f D is an optimal solution to Primal Opt for I. In order to prove uniqueness, note that the above argument also yields an optimal solution (γ, ϕ) to Dual Opt. Since γ > 0, complementary slackness conditions imply that any optimal solution must have entropy exactly equal to τ. Now, arguing as in Theorem C.19, we see that there is a unique optimal solution. D Derivation of output density for a single individual In this section, we consider the setting when a single individual with true utility u is being evaluated. In this case, the input density f D(x) is specified by the Dirac-delta function centered at v, denoted δv(x). Using Theorem 3.1, we can characterize the output density as follows: Lemma D.1. Consider an instance I = (Ω, f D, ℓ, α, τ) of (Opt Prog) where Ω= R, f D = δv(x) for some real v, ℓ(x, u) = (x u)2, and α and τ are arbitrary real parameters. Then the optimal density is given by ( Ke (x v)2 γ otherwise , where K is the normalization constant and γ is the optimal dual variable for the entropy constraint. The mean of this density is equal to v q Proof. We first evaluate the integral I(x) as follows: Ω f D(u)ℓα(x, u)dv = α Z x δv(u)(x u)2 + Z x δv(u)(x u)2. When x v, the first integral on the r.h.s. is 0, and hence, the above integral is equal to (x v)2. Similarly, if x > v, the above integral is equal to α(x v)2. The expression for f (x) in the statement of the Lemma now follows from (12). A routine calculation shows that the normalization constant K is equal to 2 α πγ (1+ α). Now the mean of this density turns out to be E Connection to the Gibbs equation Theorem E.1 (Gibbs equation). Consider an instance I = (Ω, f D, ℓ, α, τ) of the optimization problem Primal Opt that satisfies the assumptions of Theorem C.1. Then, the following holds: γ ln Z = Errℓ,α(f , f D) + γ Ent(f ). (13) Here, f (x) e I(x) γ is the solution to Primal Opt, Z := R e I(x) γ dµ(x), and γ > 0 is the solution to Dual Opt. Recall that I(x) := R Ωℓα(x, v)f D(v)dµ(v). Proof. Theorem C.1 implies that there exists f (x) and γ > 0 such that f (x) e I(x) Thus, if we let Z := R e I(x) γ dµ(x), then f (x) = e I(x) Thus, from the optimality condition (12), we obtain that there is a ϕ such that I(x) + γ (1 + ln f (x)) + ϕ = 0. Since γ > 0, we can divide by it to obtain Z = e1+ ϕ γ . We integrate the above with respect to the density f (x) to get Errℓ,α(f , f D) + γ γ τ + ϕ = 0. Thus, we obtain: γ τ = γ Ent(f ) = Errℓ,α(f , f D) + γ ln Z . (14) Rearranging this equation we obtain the theorem. In analogy with the Gibbs equation in statistical physics [94], Z can be viewed as the partition function corresponding to the energy function I(x), γ corresponds to the temperature, γ ln Z corresponds to the free energy and Err(f , f D) is the internal energy. It follows from the theorem that we can write f (x) as f (x) = e τ exp I(x) Errℓ,α(f , f D) F Effect of changing the resource-information parameter τ F.1 Effect of decreasing τ Theorem F.1 (Effect of decreasing τ). Fix α 1, a density f D, and loss function ℓ. Assume that the function If D,ℓ,α(x) satisfies condition (A5), and let x := argminx ΩIf D,ℓ,α(x). Given a τ, let f τ denote the optimal solution to Primal Opt. For every δ > 0, there exists a value Tδ such that when τ Tδ, the solution f τ has the following property: For every x Ω, |x x | δ, we have f τ (x) f τ (x ) δ. In other words, the density outside an interval of length δ around x has a much smaller value than at x . In order to prove the above result, we first show that the optimal dual value γ τ goes to 0 as τ goes to . Then we shall use Theorem E.1 to show that the optimal density is highly sensitive to small changes in I(x). Lemma F.2. Fix an interval Ω R, parameter α 1, density f D of true utility and loss function ℓ. Assume that the function If D,ℓ,α(x) has a unique global minimum. Let Iτ denote the instance (Ω, f D, ℓ, α, τ). Let γ τ be the optimal Lagrange variable corresponding to this instance (assuming it has a non-empty solution). Then lim τ γ τ = 0. Proof. We first observe that the instance Iτ will always have a feasible solution for small enough τ. Indeed, when eτ is less than the length of Ω, there is always a feasible solution to Iτ. Let τ0 be a value of τ for which the instance Iτ has a feasible solution. For sake of brevity, let I(x) denote If D,ℓ,α(x), and let x := argminx ΩI(x). We first argue that Err(f D, f τ ) remains in a bounded range. Claim F.3. Consider a value of τ τ0. Let f τ be the optimal solution to the instance Iτ. Then I(x ) Errℓ,α(f D, f τ ) Errℓ,α(f D, f τ0). Proof. The first inequality follows from the fact that Errℓ,α(f D, f τ ) = Z Ω I(x)f τ (x)dµ(x) I(x ). The second inequality follows from the fact that the solution f τ0 is also a feasible solution for the instance Iτ. Suppose for the sake of contradiction, lim τ γ τ > 0. In other words, θ0 := lim supτ γ τ > 0. It follows that there is an infinite sequence A := (τ0 > τ1 > ) going to such that γ τi θ0/2 for all τi A. Consider an interval of finite but non-zero length in Ω let U = [s, t] be such an interval. Since U is closed and I(x) is a continuous function, there are finite values a0, b0 such that a0 I(x) b0 for all x U. Thus we get: Claim F.4. There is a positive real η, such that exp Errℓ,α(f D, f τ ) I(x) γ τ holds for all x U, τ A. Proof. It follows from Claim F.3 and the observation above that for all x U and all τ τ0, |Errℓ,α(f D, f τ ) I(x)| lies in the range [I(x ) b0, Errℓ,α(f D, f τ ) + a0]. Since γτ θ0/2 for all τ A, the result follows. The above claim along with Theorem E.1 shows that f τ (x) η e τ for all x U, τ A. Since A contains an infinite sequence of values going to , we can choose a value τ A such that ηe τ > 1/|U|. But then f τ (x) > 1 |U| for all x U, which is not possible because f is a density. This proves the lemma. We now prove Theorem F.1. Consider the instance I as stated in this statement of this theorem. Let τ0 be a value of the information-resource parameter for which there is a density with entropy τ in Ω (i.e., when Primal Opt has a feasible solution). Recall that x = argminx ΩI(x). Consider a δ > 0. Assumption (A5) shows that there is a value εδ > 0 such that I(x) I(x ) > εδ for all x satisfying |x x | δ. Now consider an x such that |x x | δ. Using Theorem E.1, we see that, for all τ τ0 f τ (x) f τ (x ) = exp I(x ) I(x) exp ( εδ/γ τ) , where the last inequality follows from the fact that γ τ is positive. Now, Lemma F.2 shows that there a value Tδ such that for all τ Tδ, 0 < γ τ δ εδ ln(1/δ). Therefore, for all x, |x x | δ, f τ (x) f τ (x ) δ. This proves Theorem F.1. F.2 Effect of increasing τ In this section, we consider the effect of an increase in τ on the variance and the mean of the optimal density. Theorem F.5 (Effect of increasing τ on variance). Consider a continuous density f D on R, loss function ℓand information-resource parameter α. For a given risk-averse parameter τ, let Iτ denote the instance (R, f D, ℓ, α, τ) of Primal Opt. Let f τ be the optimal solution to the instance Iτ. Then the variance of f τ is at least 1 2πe2τ 1. The proof relies on the following result. Theorem F.6 (Gaussian maximizes entropy; Theorem 3.2 in [48]). For a continuous probability density function f on R with variance σ2, Ent(f) 1 2 ln 2πσ2 with equality if and only if f is a Gaussian density with variance σ2. Proof of Theorem F.5. Consider the instance Iτ and the optimal solution f τ for this instance. Since f τ is a feasible solution, Ent(f τ ) τ. If f τ , has unbounded variance, then the desired result follows trivially. Hence, assume that f has bounded variance, say σ2. Now, Theorem F.6 shows that Ent(f τ ) 1 2 ln(2πσ2). Using the fact that Ent(f τ ) τ, the above inequality implies that σ2 1 2πe2τ 1. This proves the desired result. We now show that the mean of the optimal density also increases with increasing τ when the input density f D is supported on [0, ). Theorem F.7 (Effect of increasing τ on mean). Consider a continuous density f D on Ω:= [0, ), loss function ℓand information-resource parameter α. For a given risk-averse parameter τ, let Iτ denote the instance (Ω, f D, ℓ, α, τ) of Primal Opt, and f τ denotes the optimal solution to the instance Iτ. Then the mean of f τ is at least eτ 1. The proof relies on the following result. Theorem F.8 (Exponential maximizes entropy; Theorem 3.3 in [48]). For a continuous probability density function f on [0, ) with mean λ, Ent(f) 1 + ln (λ). Proof of Theorem F.7. The proof proceeds along similar lines as that of Theorem F.5. We know that Ent(f τ ) τ because it is a feasible solution to the instance Iτ. If f τ has unbounded mean, we are done; therefore, assume its mean, denoted λ, is finite. Theorem F.8 shows that Ent(f) 1 + ln λ. Since Ent(f τ ) τ, we see that λ eτ 1. This proves the theorem. G Effect of changing the risk-averseness parameter α Theorem G.1 (Monotonicity of I(x) with respect to α). Consider an instance I = (Ω, f D, ℓ, α, τ) of the optimization problem Primal Opt. Then, for any x Ω, If D,ℓ,α(x) is an increasing function of α. Proof. By definition Iα(x) := If D,ℓ,α(x) = Z v x ℓα(x, v)f D(v)dµ(v) + Z v>x ℓ(x, v)f D(v)dµ(v). Recall from (4), that for any x, v Ω, ℓα(x, v) := ℓ(x, v) if x < v and ℓα(x, v) := αℓ(x, v) when x v. Since our model requires ℓ(x, v) 0 whenever x v, ℓα(x, v) ℓα (x, v) for any x v and α α . Thus, we have that Iα(x) Iα (x) for all x Ω. Theorem G.2 (Monotonicity of Err with respect to α). Consider an instance Iα = (Ω, f D, ℓ, α, τ) of the optimization problem Primal Opt. Suppose Iα satisfies the assumption of Theorem C.1 and let f α be the optimal solution to instance Iα. Then, the function Errℓ,α(f α, f D) is an increasing function of α. Proof. As noted in Section C, if the instance I satisfies the assumptions for α = 1, then the instances obtained by changing α continue to satisfy the assumptions needed in Theorem C.1. Thus, we may assume the optimal solution exists for each version of I where we vary α 1, and let f α denote the optimal density. We first show that for any fixed density f, Errℓ,α(f, f D) is an increasing function of α. This is so because Errℓ,α(f, f D) = Z x 1: f P(x) := β xβ+1 , x [1, ). The mean of this density is β β 1. Thus, the mean is finite only when β > 1. Its differential entropy is 1 + 1 β [148]. We consider the loss function ℓ(x, v) := ln x ln v. First, we compute the expression for I(x) which we use to verify the applicability of Theorem C.1 with the above parameters. Lemma I.1 (Expression for Iα(x)). Consider an instance I = (Ω, f P, ℓ, α, τ) of Primal Opt where Ω= [1, ), ℓ(x, v) := ln x ln v, and f P is the Pareto density with parameter β > 1. Then I(x) = α ln x + α 1 Proof. We use integration by parts to derive the following expression for I(x): I(x) = α Z x 1 (ln x ln v) β vβ+1 dµ(v) + Z x (ln x ln v) β vβ+1 dµ(v) = α ln v ln x 1 vβ+1 dµ(v) + ln v ln x 1 vβ+1 dµ(v) = α ln x + α xβ 1 1 βxβ . 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.8 Plot of x f[x] vs. 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 2 Plot of Varx f[x] vs. (b) Figure 6: Mean and variance of the output density f, i.e., Ex f[x] and Varx f[x], as a function of τ when f D is the Pareto distribution with parameter 3 and α is fixed to 2. Applicability of Theorem C.1. Next, we verify that assumptions (A0) (A5) hold. Since Ω= [1, ), (A0) holds for any finite τ. (A1) and (A2) hold due to the choice of the loss function. (A3) can be shown to hold since f P is a Pareto density: To see this note that for any finite x 1 Z Ω |ℓ(x, v)| f P(v)dµ(v) = Z x 1 (ln(x) + ln(v)) f P(v)dµ(v) + Z x (ln(v) ln(x)) f P(v)dµ(v) 1 (ln(x) + ln(v)) f P(v)dµ(v) = ln(x) + Z 1 ln(v)f P(v)dµ(v) = ln(x) + ln(v) 1 vβ+1 dµ(v) = ln(x) + v β = ln(x) + 1 Thus, (A3) holds. (A4) holds with, e.g., R = 2. By Lemma I.1, I(x) = α ln x + α 1 Thus, I(x) is differentiable at each x Ω. Moreover, for all x Ω Since the derivative is positive for all x Ω= [1, ), it follows that I(x) has a unique global minimum at x = 1. Therefore, (A5) holds. Since assumptions (A0) (A5) hold, we invoke Theorem C.1 to deduce the form of f . Figure 6 plots the mean and the variance of the output density f as a function of the parameter τ when the input density f D is the standard normal density. Figure 7 plots the mean and the variance of the output density as a function of the parameter α in this setting. Theorem I.2 (Expression for f with α = 1). Consider an instance I = (Ω, f P, ℓ, 1, τ) of Primal Opt where Ω= [1, ), ℓ:= ln x ln v, and f P is the Pareto density with parameter β > 1. Let f be the optimal solution of instance I. f is a Pareto density with parameter βτ satisfying the following condition: βτ ln βτ = τ. Plot of x f[x] vs. Plot of Varx f[x] vs. Figure 7: Mean and variance of the output density f, i.e., Ex f[x] and Varx f[x], as a function of α when f D is the Pareto distribution with parameter 3 and τ is fixed to the entropy of f D. Let mτ and σ2 τ be the mean and variance of f τ as a function of τ. It holds that mτ is monotonically increasing in τ and σ2 τ is either infinite or monotonically increasing in τ. Proof. For α = 1, Lemma I.1 implies that I(x) = ln(x) 1 As shown earlier, one can invoke Theorem C.1 for instance I for any finite τ, which implies that f has the following form where the proportionality constant and γ are determined by R Ωf (x)dµ(x) = 1 and Ent(f ) = τ. f is a Pareto density with parameter βτ = 1 γ 1 and, hence, has a differential entropy of 1 + 1 βτ + ln 1 βτ [148]. This combined with the condition Ent(f ) = τ implies that f is the Pareto density with βτ satisfying Since 1 + 1 βτ + ln 1 βτ is a decreasing function of βτ, it follows that increasing τ monotonically decreases βτ. Note that βτ > 1 for any finite τ. The mean and variance of f are mτ = βτ βτ 1 and σ2 τ = βτ (βτ 1)2(βτ 2) respectively. Since mτ is a monotonically decreasing function of βτ (for βτ > 1) and βτ is a monotonically decreasing function of τ, it follows that mτ is a monotonically increasing function of βτ. The variance σ2 τ is finite when βτ > 2. Moreover, if βτ > 2, then σ2 τ is a monotonically decreasing function of βτ. Since βτ is a monotonically decreasing function of τ, it follows that σ2 τ is either infinite or a monotonically increasing function of βτ. Reduction in mean with increase in α. We show how our framework can capture a similar phenomenon as the multiplicative-bias model of [90]. Recall that in the multiplicative-bias model, the estimated utility of the disadvantaged group is scaled down by a factor ρ > 1. Fix a parameter β > 1. Consider an instance I1 given by the parameters I1 = (Ω= [1, ), f D, ℓ, α, τ) where f D is the Pareto density with parameter β, τ = Ent(f D) = 1 + 1 β , α = 1 and ℓ(x, v) = ln x ln v. As shown in Theorem I.2, the output density f is the same as f D. The proof of this result also shows that the output density f (x) is proportional to e I(x)/γ , where I(x) = ln x 1 β and γ is the optimal dual variable for the corresponding entropy constraint. The disadvantaged group is modeled by an instance I2 which has the same parameters as that of I1 except that the parameter α is larger than 1. In this case, Theorem I.2 shows that the optimal density f α(x) is proportional to e Iα(x)/γ α, where Iα(x) = α ln x + α 1 β and γ α is the corresponding (b) Figure 8: Figure (a) plots the input Pareto density with parameter β = 1.5 and the corresponding limiting output density. Figure (b) plots the mean of the input Pareto density and the corresponding limiting output density as a function of the parameter β. optimal dual variable. For large α, Iα(x) α(I(x) + 1 βxβ ). Hence, the output density (for large α) is given by f α(x) = K e α I(x)+ 1 βxβ , where K is a normalization constant. The normalization constant K and the ratio γ /α are given by two constraints: (i) integral of f α over the domain Ωshould be 1, and (ii) the entropy of f α should be equal to τ = Ent(f D). This shows that the ratio γ α/α tends to a constant for large α, and hence, the output density converges to the density given by g (x) = Ke C ln x+ 1 βxβ . We plot the output density g (x) for β = 1.5 in Figure 8(a). In Figure 8(b), we observe that for all considered values of the parameter β, the mean of the limiting output density g always remains below that of the input density f D. It is also worth noting that for large β, the gap between the mean of the input density and that of the (limiting) output density diminishes. Intuitively, this happens because as β increases, the mean of the input (Pareto) density gets closer to 1 (recall that the mean of a Pareto density with parameter β is equal to β β 1). Now the output density also places more mass closer to 1, but gets restricted because of two conditions: (i) the entropy of the output density must be the same as that of the corresponding input density, and (ii) it cannot place any probability mass on values below 1. Hence, there is not much room for the output density to place extra probability mass on values close to 1 (as compared to the corresponding Pareto density). Hence its mean cannot go much below that of the corresponding Pareto density. J Exponential density The exponential density is defined as follows over Ω= [0, ) and has a parameter λ > 0: f Exp(x) := λe λx, x [0, ). λ is referred to as the rate parameter. The mean of the exponential density is 1 λ. The differential entropy of f Exp is 1 ln λ [148]. We consider the loss function ℓ(x, v) := x v. First, we compute the expression of I(x) which we use to verify the applicability of Theorem C.1 with the above parameters. Lemma J.1 (Expression for I(x)). Consider an instance I = (Ω, f Exp, ℓ, α, τ) of Primal Opt where Ω= [0, ), f Exp is the Exponential density with rate parameter λ, and ℓ(x, v) := x v. Then λ α(λx 1) + (α 1)e λx . Proof. The desired integral is 0 (x v)e λvdµ(v) + λ Z x (x v)e λvdµ(v). We do a change of variable, with y := x v above, to get 0 yeλ(y x)dµ(y) + λ Z 0 yeλ(y x)dµ(y) = λαe λx eλy λy 1 0 + λe λx eλy λy 1 λ α(λx 1) + (α 1)e λx . Applicability of Theorem C.1. We show that assumptions (A0) (A5) hold. Since Ω= [0, ), (A0) holds for any finite τ. (A1) and (A2) hold due to the choice of the loss function. (A3) can be shown to hold since f Exp is an Exponential density: To see this note that for any finite x Ω Z Ω |ℓ(x, v)| f Exp(v)dµ(v) = Z x 1 (x v) f Exp(v)dµ(v) + Z x (v x) f Exp(v)dµ(v) 1 (|x| + |v|) f Exp(v)dµ(v) Thus, (A3) holds. (A4) holds with, e.g., R = 1 λ. By Lemma J.1, λ α(λx 1) + (α 1)e λx . From this expression, it follows that I(x) is differentiable at each x Ω. Moreover, for all x Ω 2I(x) 2x = λ(α 1)e λx. For any α > 1, this derivative is positive for all x Ωand, hence, I(x) strictly convex whenever α > 1 and, thus, it has a unique global minimum. If α = 1, then I(x) = x 1 λ. This function has a unique global minimum at x = 0 over Ω= [0, ). Combining with the α > 1 case, it follows that (A5) holds. Since assumptions (A0) (A5) hold, one can invoke Theorem C.1 to deduce the form of f . Theorem J.2 (Expression for f when α = 1). Consider an instance I = (Ω, f Exp, ℓ, 1, τ) of Primal Opt where Ω= [0, ), ℓ(x, v) := x v, and f Exp is the Exponential with rate parameter λ. Let f be the optimal solution of I. Then f is the Exponential density with mean eτ 1 Thus, for α = 1, increasing τ increases the rate parameter of the output density. Proof. Since α = 1, Lemma J.1 implies that As shown earlier, one can invoke Theorem C.1 for instance I for any finite τ, which implies that f has the following form f (x) exp x 1 where the proportionality constant and γ are determined by R Ωf (x)dµ(x) = 1 and Ent(f ) = τ. Since f is an Exponential density with rate parameter 1 γ , its entropy is Ent(f ) = 1 + ln (γ ) [148]. Since Ent(f ) = τ, the previous equality implies that γ = eτ 1. It follows that f (x) = e1 τ exp x exp(τ 1) which is the Exponential density with mean eτ 1. K Laplace density The Laplace density is defined as follows over Ω= R and has parameters a R and b > 0: f L(x) := 1 b |x a|, x R. a is referred to as the location parameter and b as diversity. The differential entropy of f L is 1 + ln(2b) [148]. We consider the loss function to be ℓ(x, v) := |x a| |v a| for a R. First, we compute the expression for I(x) that we use to show that Theorem C.1 is applicable with the above parameters. Lemma K.1 (Expression for I(x)). Consider an instance I = (Ω, f L, ℓ, α, τ) of Primal Opt where Ω= R, f L is the Laplace density with parameters a R and b > 0, and ℓ(x, v) := |x a| |v a|. Then ( αb(w 1) + b (α 1) 2 e w if x a b(w + 1) + b (1 α) 2 ew otherwise , where w := x a Proof. The desired integral is (|x a| |v a|)e |v a|/bdµ(v) + 1 x (|x a| |v a|)e |v a|/bdµ(v). We perform a change of variables with w := x a b and y = v a b . Then the above integral becomes (|w| |y|)e |y|dµ(y) + b w (|w| |y|)e |y|dµ(y). Now two cases arise (i) w 0, or (ii) w 0. First, consider the case when w 0. Then the above integral becomes: (w + y)eydµ(y) + α 0 (w y)e ydµ(y) + 1 w (w y)e ydµ(y) 2 + α(w + e w 1) = αb(w 1) + (α 1)b In the second case, we get ( w + y)eydµ(y) + 1 w ( w + y)eydµ(y) + 1 0 ( w y)e ydµ(y) = (w + 1)b + (1 α)ewb Applicability of Theorem C.1. We show that for any finite a R and b > 0, assumptions (A0) (A5) hold for the instance I = (Ω, f L, ℓ, α, τ) of Primal Opt where Ω= R, f L is the Laplace density with parameters a R and b > 0, and ℓ(x, v) = |x a| |v a|. Since Ω= R, (A0) holds for any finite τ. (A1) and (A2) hold due to the choice of the loss function. (A3) can be shown to hold since f L is a Laplace density: To see this note that for any finite x Ω Z Ω |ℓ(x, v)| f L(v)dµ(v) Z 0 (|x a| + |v a|) f L(v)dµ(v) = |x a| + b Thus, (A3) holds. (A4) holds with, e.g., R = b ln(2). By Lemma K.1, ( αb(w 1) + b (α 1) 2 e w if x a b(w + 1) + b (1 α) 2 ew otherwise , where w = x a b . One can check that I(x) is continuous and differentiable at each x Ω\ {a}. Moreover, for all x < a, I(x) x < 0 and for all x a, I(x) x 0. Hence, it follows that I(x) has a unique global minimum at x = a. Therefore, (A5) holds. Since assumptions (A0) (A5) hold, we invoke Theorem C.1 to deduce the form of the optimal density. Theorem K.2 (Expression for f when α = 1). Consider an instance I = (Ω, f L, ℓ, 1, τ) of Primal Opt where Ω= R, f D is the Laplace density with parameters a R and b > 0, and ℓ(x, v) = |x a| |v a|. Let f be the optimal solution of I. Then f is the Laplace density with parameters (a, eτ 1/2). Thus, for α = 1, increasing τ does not change the location parameter, but increases the diversity parameter of the output density. Proof. Since α = 1, Lemma K.1 implies that I(x) = |x a| b As shown earlier, one can invoke Theorem C.1 for instance I for any finite τ, which implies that f has the following form where the proportionality constant and γ are determined by R R f (x)dµ(x) = 1 and Ent(f ) = τ. Clearly, f is a Laplace density with the diversity parameter γ , its entropy is Ent(f ) = 1+ln (2γ ) [148]. On the other hand, since Ent(f ) = τ, the previous equality implies that γ = 1 2eτ 1. It follows that f (x) = e1 τ exp 2|x a| exp(τ 1) which is the Laplace density with parameters L Implementation details and additional empirical results In this section, we present additional discussions and evaluations of intervention in the JEE setting (Appendix L.1), plots omitted from Section 4 (Appendix L.2), and implementation details of our model (Appendix L.3). The code for this paper is available at https://github.com/Anay Mehro tra/Bias-in-Evaluation-Processes. L.1 Case Study: Evaluating bias-mitigating interventions in IIT-JEE admissions In this section, we continue our study of the effectiveness of different interventions in a downstream selection task. Like in Section 4, we consider selection based on the JEE 2009 scores, but here consider representational constraints actually used in admissions to IITs. We also discuss additional interventions being implemented by the Indian state and central governments to reduce inequity in JEE scores. Recall that, the Indian Institutes of Technology (IITs) are a group of engineering institutes in India. In 2009, there were 15 IITs and today this has grown to 23. Undergraduate admissions at IITs are decided based on the scores of candidates in the Joint Entrance Exam (JEE). JEE is conducted once every year. In 2009, the scores, (binary) genders, and birth categories of all candidates who appeared in JEE 2009 were released in response to a Right to Information application filed in June 2009 [91]. The birth category of the candidates is an official socioeconomic status label recognized by the government of India [135]. Here, we focus on two groups of candidates: the candidates in the general (GEN) category (the most privileged) and candidates not in the general category. We begin by discussing some of the interventions in place to reduce inequity in JEE scores and subsequent admissions at IITs. Interventions used in IIT admissions. The Indian constitution allows the central government and state governments to enforce affirmative action in the form of quotas or lower bound constraints for official SES groups at educational institutes, employments, and political bodies [82, 146]. In 2005 lower-bound interventions were introduced in the admissions process at the IITs. Concretely, in 2009, out of the 7,440 seats, 3,688 (49.6%) were reserved for students who are not in the GEN category. This means that at least 3,688 out of the 7, 440 students admitted into IITs must not be in the GEN category. Note that this allows more than 3,688 or even all admitted students to be outside the GEN category. We call this constraint the Reservation constraint and, in this section, we study its effectiveness compared to other forms of interventions. Apart from reservations, a number of other interventions have also been proposed and/or implemented to reduce biases in the JEE. We discuss two other types of interventions next. Interventions to reduce skew. Private coaching institutes that train students for JEE have been criticized for being exorbitantly expensive and, hence, inaccessible for students in low SES groups [44]. Lack of accessibility to training resources can reduce the scores of candidates in low SES groups creating a skew in the scores. To improve accessibility to training, in 2022, the Delhi government established a new program that will provide free training to students enrolled in government-funded schools [53]. Similar programs have also been introduced in other states [52, 141] and by school education boards that span multiple states [26]. In the context of our model, these interventions can be seen as reducing this skew in the evaluation process. Interventions to reduce information constraint. A criticism of JEE is that it is only offered in the English and Hindi languages. This is undesirable because only 44% of Indians report English or Hindi as their first language and, according to the 2011 census, less than 68% of Indians list one of these languages among the three languages they are most comfortable with [103]. IITs have been repeatedly criticized for not offering the exam in regional languages [128, 7, 47]. The main concern is that the current exam reduces the performance of students less familiar with English and Hindi. In the context of our model, this can be thought of as placing a stronger information constraint on candidates who do not speak English or Hindi as a first language: these students would need to spend a higher cognitive load to understand the questions. This constraint not only acts during the exam but also during the preparation period because students speaking regional languages (and not English or Hindi), have to devote additional time to learning either English or Hindi in addition to the technical material for the exam. While the JEE exam itself has not been offered in regional languages yet. Recently, in 2021, the screening test that candidates have to clear before appearing in JEE was offered in 11 regional languages in addition to English and Hindi [134]. In this section, we compare the effectiveness of the above three interventions Reservation for lower SES groups, interventions to reduce skew (change α by α percent), and interventions to reduce information-constraint (change τ by τ percent). Setup (Group sizes and k). Admissions into IITs are highly selective. For instance, in 2009, 384,977 applicants (232,334 from GEN; 60%) took the exam and just 7,440 (2%) were admitted to IITs. The admission is based on the candidates All India Rank (henceforth just rank) which denotes the candidate s position in the list of candidates ordered in decreasing order of their scores in JEE. Let G1 be the group of students in GEN category and G2 be all other students. To study the impact of different interventions for admissions into IITs, we fix group sizes and k to match real numbers: |G1| = 232, 334, |G2| = 152, 643, and k = 7, 400. We focus on the set of candidates who scored at least 80 (out of 480) on the exam. (The threshold 80 ensures that at least 10k candidates outside Decrease skew (| 1|) Reservation Decrease Reservation 0.0 0.2 0.4 0.6 0.8 1.0 1.00 Relative increase in utility n1 =232334, n2 =152643, ITER=50, =2.667, =0.867 (a) Changing α by α percent 0.0 0.2 0.4 0.6 0.8 1.0 1.0 Relative increase in utility n1 =232334, n2 =152643, ITER=50, =2.667, =0.867 (b) Changing τ by τ percent Figure 9: Effectiveness of different interventions on the selection-utility as estimated by our model: We vary the strengths of the interventions ( α [0, 1] and τ [0, 1]) and report the expected utilities of the subset output by all three interventions. The x-axis shows the strength of the intervention changing α (Figure 9(a)) or τ (Figure 9(b)). The y-axis shows the ratio of the (true) utility of the subset output with an intervention to the (true) utility of the subset output without any intervention. Our main observation is that for each of the three interventions, there is a value of the percentage change in α and τ (i.e., α and τ respectively) for which the intervention outperforms the other two interventions. Hence, depending on the amount of change a policymaker expects a specific intervention (e.g., providing free coaching) to have on the parameters α and τ, they can use our framework as a tool to inform their decision about which intervention to enforce. Error bars represent the standard error of the mean over 100 repetitions. the GEN category are considered and this is significantly lower than the k-th highest score of 167). We fix f D to be the density of utilities of all candidates in G1 who scored at least 80. Since f D has a Pareto-like density (see Figure 10), we fix ℓ(x, v) = ln(x) ln(v). We fix Ωto be the set of all possible scores and f G2 to be the density of all candidates in G2 who scored at least 80. As in Section 4, we select α and τ that lead to the density closest in TV distance to f G2. The rest of the setup is the same as in Section 4. Unlike the main body, here, we only consider high-scoring candidates (those with a score of at least 80) because JEE is highly selective (k/n 0.02) and, hence, to have meaningful results the estimated density f E should have a good fit to the density from the real-data on the top 2% quantile, i.e., the right tail. To ensure this, we specifically consider the right tail of the distribution (by dropping candidates with a score below 80). Observations and discussion. We vary α [0, 1] and τ [0, 1] and report the expected utilities of the subset output by all three interventions over 100 iterations in Figure 9. Our main observation is that for each of the three interventions, there is a value of the percentage change in α and τ (i.e., α and τ respectively) for which the intervention outperforms the other two interventions. Hence, depending on the amount of change a policymaker expects a specific intervention (e.g., providing free coaching) to have on the parameters α and τ, they can use our framework as a tool to inform their decision about which intervention to enforce. Further, we observe that, as expected, increasing α and τ, i.e., the percentage of change in α and τ, improves the utility achieved by the corresponding interventions. Limitations and further discussion. Next, we discuss some of the limitations of our study. First, we note that interventions such as increasing the accessibility of education can not only reduce inequity 0.020 Students in GEN category Students not in GEN category Figure 10: The densities of scores of students in GEN category (blue) and students not in GEN category (orange) in JEE-2009 only for students scoring at least 80 points out of 480. in JEE but can also have positive effects on other exams and hiring. Hence, such interventions can have a larger positive (or negative) impact than suggested by our simulations. Studying these auxiliary effects is beyond the scope of this paper. Further, our study also does not model the response of the students, e.g., how do interventions affect the students incentive to invest in skill development? Finally, our model only predicts the effect of α and τ on utility distributions. These predictions may not be accurate and a careful post-deployment evaluation may be required to accurately assess the effectiveness of different interventions. L.2 Additional plots for simulations in Section 4 In this section, we present plots of the best-fit densities output by our framework on different datasets. (a) Best-fit distribution (α = 3 10 4 and τ = 1.51) with JEE-2009 (Birth category) (b) Best-fit distribution (α = 3 10 4 and τ = 1.51) with JEE-2009 (Gender) Figure 11: Illustration of the best-fit distribution output by our framework for the JEE-2009 dataset. Captions of subfigures report the best fit α and τ. L.3 Implementation details L.3.1 Our framework and other models In this section, we give implementation details of our model. Recall that our model outputs the density which is the optimal solution of the following optimization program. (a) Best-fit distribution (α = 22.78 and τ = 2.09) with Semantic Scholar Open Research Corpus (b) Best-fit distribution (α = 1.92 and τ = 3.19) with synthetic network data Figure 12: Illustration of the best-fit distribution output by our framework for the Semantic Scholar Open Research Corpus. Captions of subfigures report the best fit α and τ. argminf: density on ΩErrℓ,α (f D, f) := R Ωℓα(x, v)f(x)dµ(x) f D(v)dµ(v), (Opt Prog-App) such that R Ωf(x) log f(x)dµ(x) τ. An instance of this program is specified by the following parameters. 1. A domain Ω R (e.g., Ω= R and Ω= [1, )); 2. A true density f D over Ωwith respect to the Lebesgue measure µ; 3. A loss function ℓ: Ω Ω R (e.g., ℓ(x, v) = (x v)2 and ℓ(x, v) = ln (x/v)); 4. A risk-averseness (or risk-eagerness) parameter α > 0; and 5. A resource-information parameter τ > 0. Recall that ℓα is a risk-averse loss defined by ℓand α as in (4). For our simulations, we consider the shifted variant of ℓα mentioned in Section 2: given a shift parameter v0 R, a loss function ℓ: Ω Ω R, and parameter α > 0 ℓα,v0(x, v) = α ℓ(x, v + v0) if x > v + v0, ℓ(x, v + v0) otherwise. Let f α,τ,v0 be the optimal solution to the instance Iα,τ,v0 = (Ω, f D, ℓ, α, τ, v0) of (Opt Prog). Algorithmic task. Given a target density f T (denoting the density of biased utilities in the data), risk-averse loss function ℓα, and true density f D, the goal of our implementation is to find α , τ , and v 0 that minimize the total variation distance between f T and f α,τ,v0: (α , τ , v 0) := argmin α,τ,v0 d TV(f α,τ,v0, f T ). Algorithmic approach and implementation. We perform grid-search over all three parameters α, τ, and v0. Given a specific α, τ, and v0, to solve the above problem, we use the characterization in Theorem 3.1 to find f α,τ,v0. Recall that the optimal solution of (Opt Prog) is of the following form f α,τ,v0(x) = C exp ( Iα,v0(x)/γ ) where Iα,v0(x) := R Ωℓα,v0(x, v)f D(x)dµ(x) and C, γ > 0 are constants that are uniquely specified by the following two equations Z Ω f α,τ,v0(x)dµ(x) = 1 and Z Ω f α,τ,v0(x) log f α,τ,v0(x) dµ(x) = τ. Algorithmically, finding C and γ requires computing a double integral over Ω. In all of the simulations in Section 4, Ωis a discrete domain, so these integrals reduce to summations and we compute them exactly. We also provide an implementation of our algorithm for continuous domains. The implementation for continuous domains uses the quad function in scipy to compute the integrals. For the grid search itself, we varied α over [10 4, 102], τ over [10 1, 10], and v0 over Ω. We found this range to be sufficient for our simulation, but it would be interesting to design a principled way of specifying the ranges given other parameters and target density f T . Implementation details of multiplicative bias model [90] and implicit variance model [61]. Recall that the multiplicative bias and the implicit variance models are specified by parameters ρ and σ respectively: given a fixed true value v R, the output of the multiplicative bias model is v/ρ and the output of the implicit variance model is v + ζ where ζ is a zero-mean normal random variable with variance σ2. In addition, we allow both models to introduce a shift v0. For the multiplicative bias model, given a true density f D and a target density f T , we compute (ρ , v 0) that solves argminρ,v0 d TV(fρ,v0, f T ) where fρ,v0 is the density of (v/ρ) + v0 for v f D. For the implicit variance model, given a true density f D and a target density f T , we compute (σ, v0) that solves argminσ,v0 d TV(fσ,v0, f T ) where fσ,v0 is the density of v + v0 + ζ for v f D and a zero-mean normal random variable ζ with variance σ2. For both models, we compute the optimal parameters using grid search: we vary v0 over Ω, (1/ρ) over [0, 1], and σ over [10 2, 10]. L.3.2 Computational resources used All simulations were run on a Mac Book Pro with 16 GB RAM and an Apple M2 Pro processor. L.3.3 JEE-2009 Scores Additional discussion of the dataset. The JEE-2009 test scores were released in response to a Right to Information application filed in June 2009 [91]. This dataset contains the scores of all students from JEE-2009 (384,977 total) [91]; we used the version available provided by [40]. In addition to the scores, for each student, the data contains their self-reported (binary) gender and their birth category. The birth category of a student is an officially designated indicator of their socioeconomic group, where the general (GEN) category is the most privileged; see [135, 19] for more details. We observe that students not in the GEN category have significantly lower average scores than students in the GEN category (18.2 vs. 35.1); this may not imply that students not in the GEN category would perform poorly if admitted. Indeed, among students of equal true potential, those from underprivileged groups are known to perform poorer on standardized tests [60]. In the Indian context, this could be due to many reasons, including that in India, fewer students outside the GEN category attend primary school compared to students from the general category, and on average a lower amount of money is spent on the education of students in the non-general category compared to the general category [92]. Figure 13: Distribution of scores in the JEE dataset for different protected groups based on birth category. See Section 4 for a discussion of the dataset. L.3.4 Semantic Scholar Open Research Corpus Cleaning and predicting author names. We follow the procedure used by [40]. Concretely, we remove papers without publication year (1.86% of total) and predict author gender using their first name from a publicly available dataset [3], containing first names and gender of everyone born between 1890 to 2018 and registered with the US social security administration (USSSA). We remove authors whose first name has 2 or fewer characters, as these names are likely to be abbreviations (retaining 75% of the total), and then categorize an author as female (respectively male) if more than ϕ = 0.9 fraction of the people of the same first name are female (respectively male) in the USSSA data. We drop all uncategorized authors (32.25% of the remaining). This results in 3,900,934 women and 5,074,426 men (43.46% females). We present the tradeoff between the total number of authors retained and ϕ in Figure 14. Counting the number of citations. We aim to ensure that the citation counts we compute correspond to the total citations received by an author over their lifetime (so far). Since the dataset only contains citations from 1980 onwards, we remove authors who published their first paper before 1980 as the dataset does not have information about their earlier citations. This is the same as the cleaning procedure used by [40]. We present the resulting citation-distributions for male and female authors respectively in Figure 15. 0.5 0.6 0.7 0.8 0.9 1.0 Threshold Number of Authors Retained Men Women Unknown first name Figure 14: The tradeoff between the threshold ϕ used for clearing the Semantic Scholar Open Research Corpus and the number of authors retained. Details appear in Appendix L.3.4. Figure 15: Distributions of total citations of men and women in the Semantic Scholar Open Research Corpus. Details appear in Appendix L.3.4.