# on_humanaligned_risk_minimization__ae458d88.pdf

On Human-Aligned Risk Minimization

Liu Leqi Carnegie Mellon University Pittsburgh, PA 15213 leqil@cs.cmu.edu

Adarsh Prasad Carnegie Mellon University Pittsburgh, PA 15213 adarshp@cs.cmu.edu

Pradeep Ravikumar Carnegie Mellon Universit Pittsburgh, PA 15213 pradeepr@cs.cmu.edu

The statistical decision theoretic foundations of modern machine learning have largely focused on the minimization of the expectation of some loss function for a given task. However, seminal results in behavioral economics have shown that human decision-making is based on different risk measures than the expectation of any given loss function. In this paper, we pose the following simple question: in contrast to minimizing expected loss, could we minimize a better human-aligned risk measure? While this might not seem natural at ﬁrst glance, we analyze the properties of such a revised risk measure, and surprisingly show that it might also better align with additional desiderata like fairness that have attracted considerable recent attention. We focus in particular on a class of human-aligned risk measures inspired by cumulative prospect theory. We empirically study these risk measures, and demonstrate their improved performance on desiderata such as fairness, in contrast to the traditional workhorse of expected loss minimization.

1 Introduction

The decision-theoretic foundations of modern machine learning models have largely focused on estimating model parameters that minimize the expectation of some loss function. This ensures that the resulting model have high average case performance, which loosely is what is meant by good generalization performance. However, as ML models are increasingly deployed in broader societal settings, and in particular, to assist humans in decision-making, it is clear that humans want models to have not just good average performance but also properties like fairness. Due to the importance of these additional desiderata, there have been a burgeoning interest in capturing these properties via appropriate constraints and modiﬁcations of the classical objective of expected loss [8, 12, 18, 27]. In this work, we posit a very natural if simple solution to addressing these varied desiderata that are driven in large part by human considerations. Speciﬁcally, we suggest that in contrast to using the standard workhorse of expected loss, we draw from theories of human cognition in psychology and behavioral economics, to consider a human-aligned risk instead.

Alternatives to expected loss based risk measures have a long history in decision-making [16], with earlier efforts focusing on percentile risk criteria [11]. In machine learning, instead of minimizing expected loss, various risk measures have been considered in different settings. In risk sensitive reinforcement learning, conditional value-at-risk (CVa R), a percentile risk measure that quantiﬁes the tail performance of a model, has been connected to robustness to modeling errors [5, 21]. Recently, human-aligned risk measures have also been explored in bandit [14] and reinforcement learning [23], where the goal of the agent is to produce long term returns aligned with the preferences of one or more humans.

Contributions. In this work, we introduce a novel notion of human risk minimization (Section 3), by bringing ideas from cumulative prospect theory (Section 2) into supervised learning. We explore various salient characteristics of our objective such as diminishing sensitivity, decision-making

33rd Conference on Neural Information Processing Systems (Neur IPS 2019), Vancouver, Canada.

based on higher-order moments and information-content or "surprisal" view point of human risk (Section 4). We also study the implications of minimizing our objective in the context of subpopulation performance (Section 5). In particular, our empirical results illustrate that human risk minimization inherently avoids drastic losses across all subgroups.

2 Background

2.1 Cumulative Prospect Theory

As a seminal work in behavioral economics, cumulative prospect theory (CPT) [28] provides a framework to emulate human decision-making under uncertainty. In particular, CPT points out that humans overweight extreme events that occur with low probability, rather than treating all the events equally, which is the assumption of expected utility theory (EUT). As an alternative to EUT, CPT has three important components [25]:

Outcomes are considered as gains or losses compared to a reference point;

Value functions are concave for gains, convex for losses and ﬂatter for gains than for losses;

An inverse S-shaped (ﬁrst concave then convex) probability weighting function (Figure 1

(a)) is used to transform the cumulative distribution function so that small probabilities are inﬂated and large probabilities are deﬂated [30].

Current machine learning follows the EUT framework in the sense that expected losses are minimized. However, as CPT has pointed out, a human evaluates risk differently. For example, given two models {M1, M2} such that M1 has zero loss with probability .95 and loses 100 with probability .05 while M2 loses 5.01 all the time, EUT will choose M1. The reasoning behind is that the expected loss of M1 is 5, which is smaller than the expected loss of M2, which is 5.01. However, from the CPT perspective, M2 will be chosen because CPT inﬂates the probability .05 and M1 will end up having a larger risk. In this case, because of the human-innate probability weighting, we end up choosing a model that avoids drastic losses instead of the one with a better average performance.

The inverse S-shaped CPT probability weighting function captures that humans over-weight extreme events with low probability while under-weight average events that are more probable but less extreme. Many parametric forms of the probability weighting function have been proposed [24, 25, 28, 30]. To start with, we formally deﬁne a class of weighting functions called WCPT.

Deﬁnition 1. Let w : [0, 1] [0, 1] be a differentiable function. Then, w WCPT if and only if

1. w(0) = 0 and w(1) = 1;

2. there exists a (0, 1) such that w(a) = a;

3. w (x) is monotonically decreasing on x [0, a) and w (x) is monotonically increasing on x (a, 1].

Traditional CPT probability weighting functions fall into this class, including the original weighting function (for losses) w(x) = x.69

(x.69+(1 x).69)1/.69 [28]. For a real-valued continuous random variable X with cumulative distribution function F(x) and a CPT probability weighting function w WCPT, the CPT subjective utility is deﬁned as [25, 28]:

UCPT(X) = +

v(x)dg(F(x)) (1)

where (1) v : R R is a value function; (2) g(F(x)) = w(F(x)) when x < 0 and g(F(x)) = w(1 F(x)) when x 0.

Rank-dependent Utility. As pointed out in [28], CPT subjective utility is a rank-dependent utility since the decision weight on x depends on the rank of x, which is given by F(x). When F(x) is weighted by w WCPT [7] and v(x) = x, the CPT-weighted rank-dependent utility is:

UCPT-RD(X) = +

xdw(F(x)). (2)

We focus on studying the effect of using CPT-weighted cumulative distribution function w(F(x)) on training an ML model. Hence, analyzing the effect of using a reference point and a value function v is out of the scope of this paper. As one may have noticed, if w(F(x)) = F(x), then UCPT-RD(X) = E[X].

To have a ﬁnite CPT subjective utility [25] for a real-valued continuous random variable X with w WCPT, it is sufﬁcient to ensure w to be strictly increasing on [0, 1] and continuously differentiable on [0, 1], i.e. w (0) and w (1) are ﬁnite. As proposed by [25], the simplest polynomial that satisﬁes the above conditions is

w POLY(F(x)) = 3 3b a2 a + 1 F(x)3 (a + 1)F(x)2 + a F(x) + F(x) (3)

where a (0, 1) is the ﬁxed point, i.e. w POLY(a) = a, and b (0, 1) controls the curvature of w POLY( ) WCPT. As b approaches 1, w POLY(F(x)) will converge to the linear function w(F(x)) = F(x). One could interpret b as controlling the sensitivity of the probability weighting function to a unit difference in probability changes [13]. We use w POLY( ; a, b) to denote the polynomial form CPT probability weighting function with ﬁxed point a and curvature b.

(a) w POLY(F(x)) (b) w POLY(F(x))

Figure 1: (a) Inverse S-shaped probability weighting function w POLY is the steepest near the endpoints 0, 1. The parametric form of w POLY( ; a, b) is shown in Equation 3. (b) U-shaped CPT probability weighting function derivative w POLY up-weights the tails of the original distribution.

Proposition 1. Given any cumulative distribution function F(x), if a non-decreasing continuous function w : [0, 1] [0, 1] satisﬁes w(0) = 0 and w(1) = 1, then w(F(x)) is a cumulative distribution function of some random variable.

For any w WCPT that is non-decreasing (e.g. w POLY), for a real-valued continuous random variable X with cumulative distribution function F(x) and density f(x), one can think of f(x)w (F(x)) as a CPT-weighted density and w(F(x)) is the corresponding CPT-weighted cumulative distribution function. UCPT-RD is the expectation of the random variable that has the CPT-weighted cumulative distribution function. We denote the set of non-decreasing functions in WCPT to be WCPT.

2.2 Empirical Risk Minimization

The canonical way of learning an ML model is through empirical risk minimization (ERM). Given n i.i.d. samples Z1, . . . , Zn Z, and a loss function : Θ Z R, the population risk (expected loss) for model θ is deﬁned to be: R(θ) = E[ (θ; Z)]. ERM minimizes 1

n n i=1 (θ; Zi) (empirical risk). However, expectation is only one of the many risk measures. For example, value-at-risk and conditional value-at-risk [26] are popular risk measures for evaluating risks of portfolios of ﬁnancial instruments. CPT deﬁnes another way of measuring risk, which aligns with human s preferences. We want to study if minimizing a human-aligned risk will give us ML models that have properties other than a low population risk.

3 Human Risk Minimization

Deﬁnition 2. Given a real-valued random variable Z Z, a loss function : Θ Z R and a CPT probability weighing function w WCPT, the human risk is deﬁned to be

RH(θ; w) def = E[ (θ; Z)w (F( (θ; Z)))], (4)

where F( (θ; Z)) is the cumulative distribution function of the loss.

Comparing Equation 4 with the CPT-weighted rank-dependent utility in Equation 2, we see that RH(θ) = UCPT-RD( (θ; Z)). Given n i.i.d. samples Z1, . . . , Zn Z, we deﬁne empirical human risk minimization (EHRM) as

θ = arg min θ

i=1 (θ; Zi)w (Fn( (θ; Zi))), (5)

where Fn is the empirical CDF of the loss.

Optimization. When is differentiable, we use the following iterative update rule to minimize empirical human risk:

θt+1 = θt ηt

i=1 wt i θ (θt; Zi) for all t {0, , T 1},

where wt i = w (Fn( (θt; Zi))) and ηt is the learning rate. Note that this heuristic approach relies crucially on the assumption that minor perturbations in θ, don t change w (Fn( (θ; Z)) drastically. We empirically show that such a heuristic approach performs quite well in practice (See Appendix B). Deriving provably optimal optimization algorithms for EHRM is an interesting open problem.

Remarks. In general, there are two levels of decision making in supervised learning: model selection when training a model, and instance prediction when using a model. These two kinds of decisions are very much related. In traditional ERM, a model is selected over others when per-instance predictions are more accurate on average. We explore the consequences of EHRM in both settings: Section 2.1 and 4.2 discuss the model selection consequences of EHRM; Section 5 explores its consequences on per-instance predictions. Different from traditional settings where CPT is considered, supervised learning only evaluates losses. Humans tend to be risk-averse when facing possibilities of large loss. Such a property distinguishes EHRM from ERM. When training machine learning models, surrogate losses are used (e.g., hinge loss is used in replace of 0/1 loss). Most of the times, such surrogate losses are upper bounds for the original losses. In such cases, the risk-aversion towards possible drastic loss will be carried through when surrogate loss is used instead of the true loss.

To further understand how adding the weights w (F( (θ; Z))) to expected loss inﬂuences the learned model, we provide the psychological interpretation (Section 4.1) and an analytical illustration of how skewness of the loss distribution may inﬂuence choices of people with different risk preferences (Section 4.2) as well as an information weighting view point (Section 4.3) of the probability weighting function.

4 Characteristic Properties of Human Risk Minimization

We next review some characteristic properties of human risk minimization, contrasting it with the standard machine learning objective of expected loss minimization. To simplify the notation, we will denote the CDF F( (θ; Z)) of the loss random variable L as F( ) in the subsequent analysis.

4.1 Diminishing Sensitivity to Probability Changes

Recall from Equation (3) that we work with the following polynomial form of CPT probability weighting function

w POLY(F(x)) = 3 3b a2 a + 1 F(x)3 (a + 1)F(x)2 + a F(x) + F(x),

where a (0, 1) is the ﬁxed point of the function, and b (0, 1) controls the curvature. For any event E with probability P(E) (0, 1), given a probability change Δ, we deﬁne

g(P(E)) = w POLY(P(E) + Δ; a, b) w POLY(P(E); a, b) /Δ,

which is the ratio between the human perceived probability change and the original probability change. Intuitively, g(P(E)) represents human s sensitivity to probability changes. Lemma 1. For any event E with probability P(E) (0, 1), lim Δ 0g(P(E)) is a monotonically

increasing function of |P(E) a+1

The above stated result can be seen as a quantitative evidence of how CPT probability weighting function captures humans diminishing sensitivity, which has been long-studied in behavioral economics [28]. Humans are sensitive to probability changes of extreme events. Such sensitivity diminishes as the events become less extreme. When using CPT probability weighting function to weight F( ), the event we are considering is E = I {L }, i.e. if the loss L is less than or equal to a threshold . In this case, P(E) = F( ). Diminishing sensitivity states that for a given amount of probability change, human s perceived probability change depends on where the probability change happens. The perceived change diminishes as the distance between where it happens and the boundary (impossibility F( ) = 0 and certainty F( ) = 1) becomes smaller. The probability changes that happen close to the boundary will be up-weighted while the changes in between will be down-weighted. As shown in Lemma 1, for w POLY, the sensitivity of the probability change Δ diminishes as P(E) moves away from 0 (impossibility) and 1 (certainty).

4.2 Responsiveness to Skewness of the Loss Distribution

Since the inverse S-shaped probability weighting function exaggerates small probabilities of both good and bad extreme outcomes, intuitively, its overall impact on evaluating a model depends on higher-order moments of the loss distribution. We highlight this phenomena by considering a family of Bernoulli distributions with same mean and variance. In particular, consider the family of models {Mθ | θ [0, 1]} whose losses { (θ) | θ [0, 1]} are parameterized by θ. For all θ [0, 1], suppose that (θ) follows a Bernoulli distribution [15, 20]:

(θ) = 1 1 θ

(θ) = 1 + θ 1 θ

In the above setup, the mean and variance of the losses are independent of θ. In particular, we have that E[ (θ)] = 1 and Var( (θ)) = 1 for all θ [0, 1].

Hence, in this setting empirical risk minimization will treat all the models equally. However, the third central moment (skewness) of (θ) is given by

Skewness( (θ)) = 2θ 1

Observe that Skewness( (θ)) is a monotonically increasing function of θ, with Skewness( (θ)) = 0 for θ = 1

2. Hence, θ < 0.5 corresponds to models with negatively skewed loss distributions, while θ > 0.5 corresponds to models with positively skewed loss distributions. Then, in this setting, we have the following result: Lemma 2. Consider the human risk objective in Equation (4) instantiated with w POLY having ﬁxed point a = 1

2. Then, we have the following:

1. For θ < 0.5, RH(θ; w POLY( ; a, b)) is a monotonically increasing function of b.

2. For θ > 0.5, RH(θ; w POLY( ; a, b)) is a monotonically decreasing function of b.

Remarks. The above result shows that for models with negatively skewed loss distributions, their expected loss is higher than any human risk, while the opposite is true for positively skewed loss distributions. While empirical risk minimization will treat all the models equally, human risk minimization will distinguish the models through higher-order moments of the loss distribution.

4.3 Weighting by Information Content

The information content or "surprisal" of an event is the amount of information gained when the event is observed and is deﬁned as follows. Deﬁnition 3. The information content of an event E with probability P(E) is deﬁned as

I(E) def = ln[P(E)],

where e is used as the base of the logarithm.

Note that the above deﬁnition captures the intuition that the observation of a rare event provides more information than a common one. In our setting, the rare event corresponds to the event that the loss L takes extreme values.

Next, we construct a special weighting function w IT( ) using the information content of the events E1 = I{L } and E2 = I{L > }. Observe that the information content of the events E1 and E2 is given by ln F( ) and ln(1 F( )) respectively. Moreover, it is easy to see that as gets smaller, the information content of the left tail event E1 increases; and as gets larger, the information content of the right tail event E2 increases.

Then, we can use the information content of E1 and E2 to weight the density of L, and deﬁne the corresponding weighting function. In particular, the information weighted density is deﬁned to be:

w IT(F( ))f( ) = 1

2(I(E1) + I(E2))f( )

2f( ) ln (F( ) (1 F( ))) ,

and the corresponding information content weighting function is given by:

w IT(F( )) = 1

0 1 w IT(F( ))d F( )

2 (1 F( )) ln(1 F( )) F( ) ln(F( )) + F( ).

Lemma 3. The information content weighing function w IT belongs to WCPT.

Interestingly, using w IT to weight a distribution is of interest in information theory, where it is known as the two-sided information-weighted distribution [6]. Moreover, it is easy to see that w POLY( , a, b) with ﬁxed point a = 1/2 and curvature b = ln 2 is approximately equal to the third order Taylor approximation of w IT. To the best of our knowledge, this is the ﬁrst time that information weighting function and CPT probability weighting function have been connected. Uncertainty-aversion in human preferences has also been studied in behavioral economics [3, 4]. In the context of human risk minimization, using w IT, we can deﬁne the information-weighted human risk to be

RH(θ; w IT) = E[ (θ; Z)w IT(F( (θ; Z)))].

As studied in [6], the information-weighted distribution w IT(F( )) will be heavy-tailed when the CDF F( ) = e κ| | for some κ > 0. Such heavy-tailedness for the loss distribution may cause human risk to be hard to estimate. We believe that deriving statistically and computationally optimal procedures for minimizing RH(θ; w IT) is an interesting direction for future work.

5 Implications for Performance over Subgroups

Machine learning models are being increasingly deployed to automate a variety of day-to-day tasks. Employers use such models to select job applicants, provide credit scoring and predicting insurance premiums. With such high stakes, ensuring that learned models are non-discriminatory or fair with respect to sensitive features such as gender and race is of utmost importance [10, 18, 19, 22]. In this section, we explore the implications of HRM towards subgroup performances. In particular, we know that HRM up-weights possible extreme events, hence, we expect HRM to avoid drastic losses for all subpopulations. We test this hypothesis on both synthetic and real-world datasets and use w POLY( ; a, b) speciﬁed in Equation 3. We have chosen a = .5 so that for all b (0, 1), w POLY(F( )) is symmetric about the line F( ) = a.

(a) Majority Group Performance (b) Minority Group Performance

Figure 2: Majority and minority performance of ERM, EHRM and CVa R on the synthetic dataset. Note that when b = 1, the CPT probability weighting function is the identity function. Hence, EHRM is the same as ERM. 2000 training and 20000 testing data points are used in the experiment. Solid lines and shaded area represent the means and one standard derivations of the risks.

5.1 Synthetic Experiment

Setup. In this experiment, we create a synthetic regression task to test the performance of EHRM on the minority subgroup. We follow the setup of [9] and draw our covariates (features) from an isotropic Gaussian X N(0, I5) in R5. The noise distribution is ﬁxed as N(0, .01). We draw our response variable Y as,

Y = X θ + if X(1) 1.645 X θ + X(1) + otherwise

where θ = [1, 1, 1, 1, 1] and X(1) is the ﬁrst coordinate of X. Observe that since P X(1) > 1.645 = .05, {X | X(1) > 1.645} represents our minority subgroup. We ﬁx the squared error (θ; (x, y)) = 1

2(y x T θ)2 as our loss function.

Results. Figure 2 plots the risk of minority and majority groups for EHRM and ERM. The empirical risk minimizer is denoted by OLS, the solution of this ordinary least square problem. We see that for different values of b < 1, EHRM has a lower minority risk than ERM. Moreover, as b approaches 1, EHRM becomes more similar to ERM. This validates our hypothesis: because the inverse S-shaped probability weighting function inﬂates small probabilities for extreme losses, drastic losses of the minority group will be exaggerated and human risk minimization trades a low population risk for a better minority performance. Optimization performance of EHRM is shown in Figure 4 (Appendix B). In addition to comparing with ERM, we have also compared EHRM with conditional value-at-risk (CVa R), a risk measure that has been used to measure the worst-case subgroup performance [8, 29]. CVa Rα( (θ; (x, y))) is the expectation of the worst α proportion of the losses. As shown in Figure 2, when α is small, CVa R has a lower minority risk than EHRM and ERM, at a cost of a higher majority risk. As α approaches 1, the minority risk increases drastically.

5.2 Recidivism Prediction: Similar Subpopulation Performance

Setup. We follow the experimental set up in [8]. Using the fair ML toolkit version of the COMPAS recidivism dataset [1], we want to study the performance of EHRM and ERM on different demographic subgroups. With a 90% and 10% train-test split, ERM and EHRM are used to train a logistic regression model with L2 regularization. To study the subgroup performances, we report the misclassiﬁcation rate of different demographic groups on the test set. In particular, out of the 10 binary features in the dataset, we have chosen 7 of them that have more than 10 samples to group the population. For each chosen (binary) attribute, the dataset can be divided into two subgroups. For EHRM, we have chosen b to be .3 so that the EHRM probability weighting function is close to the median estimate of the CPT probability weighting function for high rank losses in [28]. In practice, a and b are application-dependent and user-dependent.

Figure 3: Test misclassiﬁcation rate of the recidivism prediction task. Each box represents the worst performance among the subpopulations grouped by the attribute listed below. EHRM has a more similar performance across subgroups.

Results. In Figure 3, for each attribute, we report the maximum misclassiﬁcation rate of the two subgroups at test time. Compared to ERM, EHRM has a higher misclassiﬁcation rate but a more similar worst case performance across different subgroups. Such an observation aligns with our hypothesis that EHRM avoids extremely bad performances for all demographic groups and hence will sacriﬁce average performance for similar subpopulation performances.

5.3 Gender Classiﬁcation based on Facial Image: Fairness Metrics Comparison

Setup. To study EHRM performance on standard fairness metrics, we use the AI Fairness 360 toolkit [2]. In particular, we use the UTKFace dataset [31] to train a neural network1 for predicting gender based on facial images (male= 0, female= 1). As suggested by [2], we use race as an indicator to divide the population into two groups G1 (white) and G2 (other race). The fairness metrics we have used include statistical parity difference, disparate impact, equal opportunity difference, average odds difference, Theil index and false negative rate difference.2 We train the model with ERM and EHRM over 10 random seeds. For EHRM, b is chosen to be .3 for the same reason mentioned in Section 5.2. To minimize empirical human risk, we have used a variant of mini-batch stochastic gradient descent. At each step t, θt+1 = θt ηt B i=1 wt i θ (θ; Zi)/B, where wt i = w POLY(Fn( (θt; Zi))), Fn( ) is the empirical CDF of the mini-batch losses, B is the mini-batch size and ηt is the learning rate. As shown in Figure 5 (Appendix B), the empirical human risk of the entire training dataset decreases as training proceeds. We have also compared EHRM with a data pre-processing algorithm named reweighing [17] that re-weights the samples so that statistical dependence between the protected attribute and label are mitigated.

Results. Table 1 shows the mean and standard deviation of the test time performance of EHRM, and ERM with and without reweighing pre-processing. ERM performs the best in terms of accuracy at test time. However, reweighing with ERM and EHRM does better in terms of the fairness metrics. The empirical result suggests that the human innate risk-aversion towards possibility of extreme losses has promoted similar performances across different subgroups.

Table 1: Mean and standard deviation of accuracy and fairness metrics of models learned by EHRM with w POLY( ; .5, .3), and ERM with and without reweighing pre-processing. For each metric, the best performing algorithm is highlighted.

EHRM(.5, .3) Reweighing [17] ERM

Accuracy .8751 .0052 .8767 .0067 .8767 .0060 Stat. Parity Diff. -.0825 .0220 -.0875 .0212 -.0881 .0208 Disparate Impact .8475 .0411 .8396 .0390 .8368 .0386 Equal Opp. Diff. -.0440 .0261 -.0518 .0253 -.0502 .0263 Avg. Odds Diff. -.0116 .0202 -.0165 .0177 -.0173 .0188 Theil Index .0859 .0058 .0824 .0038 .0855 .0071 FNR Diff. .0440 .0261 .0518 .0253 .0502 .0263

1Appendix D consists details of the model conﬁguration. 2Appendix C contains deﬁnitions of these metrics in terms of G1 and G2.

6 Conclusion

In this work, we have studied alternatives to empirical risk minimization, and in particular proposed alternate formulations, which are better aligned with human risk measures. We have analyzed several characteristics of human risk minimization such as diminishing sensitivity, model selection based on higher-order moments and information-weighted loss distributions. Further, our empirical analysis has shown that such risk measures have implications for fairness, and in particular trade average performance for similar subgroup performances. Our empirical analysis raises several interesting future directions. Fairness is only one of such desiderata that people start caring about in ML. We would like to study other desiderata that HRM brings. Meanwhile, many risk measures such as conditional value-at-risk [26] can be expressed in a dual form, however, it is not immediately clear if HRM has an equivalent formulation.

Acknowledgement We thank the reviewers for providing thoughtful and constructive feedback for the paper. We thank Hongseok Namkoong for providing the method to optimize conditional value-at-risk. We acknowledge the support of ONR via N000141812861.

[1] Julius A Adebayo et al. Fair ML: Tool Box for diagnosing bias in predictive modeling. Ph D thesis, Massachusetts Institute of Technology, 2016.

[2] Rachel K. E. Bellamy, Kuntal Dey, Michael Hind, Samuel C. Hoffman, Stephanie Houde, Kalapriya Kannan, Pranay Lohia, Jacquelyn Martino, Sameep Mehta, Aleksandra Mojsilovic, Seema Nagar, Karthikeyan Natesan Ramamurthy, John Richards, Diptikalyan Saha, Prasanna Sattigeri, Moninder Singh, Kush R. Varshney, and Yunfeng Zhang. AI Fairness 360: An extensible toolkit for detecting, understanding, and mitigating unwanted algorithmic bias, October 2018.

[3] Colin Camerer and Martin Weber. Recent developments in modeling preferences: Uncertainty and ambiguity. Journal of risk and uncertainty, 5(4):325 370, 1992.

[4] Colin F Camerer, George Loewenstein, and Matthew Rabin. Advances in behavioral economics. Princeton university press, 2004.

[5] Yinlam Chow, Aviv Tamar, Shie Mannor, and Marco Pavone. Risk-sensitive and robust decisionmaking: a cvar optimization approach. In Advances in Neural Information Processing Systems, pages 1522 1530, 2015.

[6] Hélio M de Oliveira and Renato J Cintra. A new information theoretical concept: Informationweighted heavy-tailed distributions. ar Xiv preprint ar Xiv:1601.06412, 2016.

[7] Enrico Diecidue and Peter P Wakker. On the intuition of rank-dependent utility. Journal of Risk and Uncertainty, 23(3):281 298, 2001.

[8] John C Duchi, Tatsunori Hashimoto, and Hongseok Namkoong. Distributionally robust losses against mixture covariate shifts.

[9] Rick Durrett. Probability: theory and examples, volume 49. Cambridge university press, 2019.

[10] Cynthia Dwork, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Richard Zemel. Fairness through awareness. In Proceedings of the 3rd innovations in theoretical computer science conference, pages 214 226. ACM, 2012.

[11] Jerzy A Filar, Dmitry Krass, and Keith W Ross. Percentile performance criteria for limiting average markov decision processes. IEEE Transactions on Automatic Control, 40(1):2 10, 1995.

[12] Javier Garcıa and Fernando Fernández. A comprehensive survey on safe reinforcement learning. Journal of Machine Learning Research, 16(1):1437 1480, 2015.

[13] Richard Gonzalez and George Wu. On the shape of the probability weighting function. Cognitive psychology, 38(1):129 166, 1999.

[14] Aditya Gopalan, LA Prashanth, Michael Fu, and Steve Marcus. Weighted bandits or: How bandits learn distorted values that are not expected. In Thirty-First AAAI Conference on Artiﬁcial Intelligence, 2017.

[15] Xue Dong He, Roy Kouwenberg, and Xun Yu Zhou. Inverse s-shaped probability weighting and its impact on investment. Mathematical Control & Related Fields, Sep 2018.

[16] Ronald A Howard and James E Matheson. Risk-sensitive markov decision processes. Management science, 18(7):356 369, 1972.

[17] Faisal Kamiran and Toon Calders. Data preprocessing techniques for classiﬁcation without discrimination. Knowledge and Information Systems, 33(1):1 33, 2012.

[18] Toshihiro Kamishima, Shotaro Akaho, and Jun Sakuma. Fairness-aware learning through regularization approach. In 2011 IEEE 11th International Conference on Data Mining Workshops, pages 643 650. IEEE, 2011.

[19] Matt J Kusner, Joshua Loftus, Chris Russell, and Ricardo Silva. Counterfactual fairness. In Advances in Neural Information Processing Systems, pages 4066 4076, 2017.

[20] Cheekiat Low, Dessislava Pachamanova, and Melvyn Sim. Skewness-aware asset allocation: A new theoretical framework and empirical evidence. Mathematical Finance: An International Journal of Mathematics, Statistics and Financial Economics, 22(2):379 410, 2012.

[21] Takayuki Osogami. Robustness and risk-sensitivity in markov decision processes. In Advances in Neural Information Processing Systems, pages 233 241, 2012.

[22] Dino Pedreshi, Salvatore Ruggieri, and Franco Turini. Discrimination-aware data mining. In Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 560 568. ACM, 2008.

[23] LA Prashanth, Cheng Jie, Michael Fu, Steve Marcus, and Csaba Szepesvári. Cumulative prospect theory meets reinforcement learning: Prediction and control. In International Conference on Machine Learning, pages 1406 1415, 2016.

[24] Drazen Prelec et al. The probability weighting function. ECONOMETRICA-EVANSTON ILL-, 66:497 528, 1998.

[25] Marc Oliver Rieger and Mei Wang. Cumulative prospect theory and the st. petersburg paradox. Economic Theory, 28(3):665 679, 2006.

[26] R Tyrrell Rockafellar, Stanislav Uryasev, et al. Optimization of conditional value-at-risk. Journal of risk, 2:21 42, 2000.

[27] Suvrit Sra, Sebastian Nowozin, and Stephen J Wright. Optimization for machine learning. Mit Press, 2012.

[28] Amos Tversky and Daniel Kahneman. Advances in prospect theory: Cumulative representation of uncertainty. Journal of Risk and uncertainty, 5(4):297 323, 1992.

[29] Robert C Williamson and Aditya Krishna Menon. Fairness risk measures. ar Xiv preprint ar Xiv:1901.08665, 2019.

[30] George Wu and Richard Gonzalez. Curvature of the probability weighting function. Management science, 42(12):1676 1690, 1996.

[31] Song Yang Zhang, Zhifei and Hairong Qi. Age progression/regression by conditional adversarial autoencoder. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2017.