# fairness_in_reinforcement_learning__b436d51c.pdf Fairness in Reinforcement Learning Shahin Jabbari Matthew Joseph Michael Kearns Jamie Morgenstern Aaron Roth 1 We initiate the study of fairness in reinforcement learning, where the actions of a learning algorithm may affect its environment and future rewards. Our fairness constraint requires that an algorithm never prefers one action over another if the long-term (discounted) reward of choosing the latter action is higher. Our first result is negative: despite the fact that fairness is consistent with the optimal policy, any learning algorithm satisfying fairness must take time exponential in the number of states to achieve non-trivial approximation to the optimal policy. We then provide a provably fair polynomial time algorithm under an approximate notion of fairness, thus establishing an exponential gap between exact and approximate fairness. 1. Introduction The growing use of machine learning for automated decision-making has raised concerns about the potential for unfairness in learning algorithms and models. In settings as diverse as policing [22], hiring [19], lending [4], and criminal sentencing [2], mounting empirical evidence suggests these concerns are not merely hypothetical [1; 25]. We initiate the study of fairness in reinforcement learning, where an algorithm s choices may influence the state of the world and future rewards. In contrast, previous work on fair machine learning has focused on myopic settings where such influence is absent, e.g. in i.i.d. or no-regret models [5; 6; 9; 10]. The resulting fairness definitions therefore do not generalize well to a reinforcement learning setting, as they do not reason about the effects of short-term actions on long-term rewards. This is relevant for the set- 1University of Pennsylvania, Philadelphia, PA, USA. Correspondence to: Shahin Jabbari, Matthew Joseph, Michael Kearns, Jamie Morgenstern, Aaron Roth . Proceedings of the 34 th International Conference on Machine Learning, Sydney, Australia, PMLR 70, 2017. Copyright 2017 by the author(s). The full technical version of this paper is available at https://arxiv.org/pdf/1611.03071.pdf. tings where historical context can have a distinct influence on the future. For concreteness, we consider the specific example of hiring (though other settings such as college admission or lending decisions can be embedded into this framework). Consider a firm aiming to hire employees for a number of positions. The firm might consider a variety of hiring practices, ranging from targeting and hiring applicants from well-understood parts of the applicant pool (which might be a reasonable policy for short-term productivity of its workforce), to exploring a broader class of applicants whose backgrounds might differ from the current set of employees at the company (which might incur shortterm productivity and learning costs but eventually lead to a richer and stronger overall applicant pool). We focus on the standard model of reinforcement learning, in which an algorithm seeks to maximize its discounted sum of rewards in a Markovian decision process (MDP). Throughout, the reader should interpret the actions available to a learning algorithm as corresponding to choices or policies affecting individuals (e.g. which applicants to target and hire). The reward for each action should be viewed as the short-term payoff of making the corresponding decision (e.g. the short-term influence on the firm s productivity after hiring any particular candidate). The actions taken by the algorithm affect the underlying state of the system (e.g. the company s demographics as well as the available applicant pool) and therefore in turn will affect the actions and rewards available to the algorithm in the future. Informally, our definition of fairness requires that (with high probability) in state s, an algorithm never chooses an available action a with probability higher than another action a0 unless Q (s, a) > Q (s, a0), i.e. the long-term reward of a is greater than that of a0. This definition, adapted from Joseph et al. (2016), is weakly meritocratic: facing some set of actions, an algorithm must pick a distribution over actions with (weakly) heavier weight on the better actions (in terms of their discounted long-term reward). Correspondingly, a hiring process satisfying our fairness definition cannot probabilistically target one population over another if hiring from either population will have similar long-term benefit to the firm s productivity. Unfortunately, our first result shows an exponential separation in expected performance between the best unfair algo- Fairness in Reinforcement Learning rithm and any algorithm satisfying fairness. This motivates our study of a natural relaxation of (exact) fairness, for which we provide a polynomial time learning algorithm, thus establishing an exponential separation between exact and approximately fair learning in MDPs. Our Results Throughout, we use (exact) fairness to refer to the adaptation of Joseph et al. (2016) s definition defining an action s quality as its potential long-term discounted reward. We also consider two natural relaxations. The first, approximate-choice fairness, requires that an algorithm never chooses a worse action with probability substantially higher than better actions. The second, approximate-action fairness, requires that an algorithm never favors an action of substantially lower quality than that of a better action. The contributions of this paper can be divided into two parts. First, in Section 3, we give a lower bound on the time required for a learning algorithm to achieve near-optimality subject to (exact) fairness or approximate-choice fairness. Theorem (Informal statement of Theorems 3, 4, and 5). For constant , to achieve -optimality, (i) any fair or approximate-choice fair algorithm takes a number of rounds exponential in the number of MDP states and (ii) any approximate-action fair algorithm takes a number of rounds exponential in 1/(1 γ), for discount factor γ. Second, we present an approximate-action fair algorithm (Fair-E3) in Section 4 and prove a polynomial upper bound on the time it requires to achieve near-optimality. Theorem (Informal statement of Theorem 6). For constant and any MDP satisfying standard assumptions, Fair E3 is an approximate-action fair algorithm achieving - optimality in a number of rounds that is (necessarily) exponential in 1/(1 γ) and polynomial in other parameters. The exponential dependence of Fair-E3 on 1/(1 γ) is tight: it matches our lower bound on the time complexity of any approximate-action fair algorithm. Furthermore, our results establish rigorous trade-offs between fairness and performance facing reinforcement learning algorithms. 1.1. Related Work The most relevant parts of the large body of literature on reinforcement learning focus on constructing learning algorithms with provable performance guarantees. E3 [13] was the first learning algorithm with a polynomial learning rate, and subsequent work improved this rate (see Szita and Szepesv ari (2010) and references within). The study of robust MDPs [16; 18; 20] examines MDPs with high parameter uncertainty but generally uses optimistic learning strategies that ignore (and often conflict with) fairness and so do not directly apply to this work. Our work also belongs to a growing literature studying the problem of fairness in machine learning. Early work in data mining [8; 11; 12; 17; 21; 29] considered the question from a primarily empirical standpoint, often using statistical parity as a fairness goal. Dwork et al. (2012) explicated several drawbacks of statistical parity and instead proposed one of the first broad definitions of algorithmic fairness, formalizing the idea that similar individuals should be treated similarly . Recent papers have proven several impossibility results for satisfying different fairness requirements simultaneously [7; 15]. More recently, Hardt et al. (2016) proposed new notions of fairness and showed how to achieve these notions via post-processing of a black-box classifier. Woodworth et al. (2017) and Zafar et al. (2017) further studied these notion theoretically and empirically. 1.2. Strengths and Limitations of Our Models In recognition of the duration and consequence of choices made by a learning algorithm during its learning process e.g. job applicants not hired our work departs from previous work and aims to guarantee the fairness of the learning process itself. To this end, we adapt the fairness definition of Joseph et al. (2016), who studied fairness in the bandit framework and defined fairness with respect to one-step rewards. To capture the desired interaction and evolution of the reinforcement learning setting, we modify this myopic definition and define fairness with respect to long-term rewards: a fair learning algorithm may only choose action a over action a0 if a has true long-term reward at least as high as a0. Our contributions thus depart from previous work in reinforcement learning by incorporating a fairness requirement (ruling out existing algorithms which commonly make heavy use of optimistic strategies that violates fairness) and depart from previous work in fair learning by requiring online fairness in a previously unconsidered reinforcement learning context. First note that our definition is weakly meritocratic: an algorithm satisfying our fairness definition can never probabilistically favor a worse option but is not required to favor a better option. This confers both strengths and limitations. Our fairness notion still permits a type of conditional discrimination in which a fair algorithm favors group A over group B by selecting choices from A when they are superior and randomizing between A and B when choices from B are superior. In this sense, our fairness requirement is relatively minimal, encoding a necessary variant of fairness rather than a sufficient one. This makes our lower bounds and impossibility results (Section 3) relatively stronger and upper bounds (Section 4) relatively weaker. Next, our fairness requirement holds (with high probability) across all decisions that a fair algorithm makes. We view this strong constraint as worthy of serious consideration, since forgiving unfairness during the learning Fairness in Reinforcement Learning may badly mistreat the training population, especially if the learning process is lengthy or even continual. Additionally, it is unclear how to relax this requirement, even for a small fraction of the algorithm s decisions, without enabling discrimination against a correspondingly small population. Instead, aiming to preserve the minimal spirit of our definition, we consider a relaxation that only prevents an algorithm from favoring a significantly worse option over a better option (Section 2.1). Hence, approximate-action fairness should be viewed as a weaker constraint: rather than safeguarding against every violation of fairness , it instead restricts how egregious these violations can be. We discuss further relaxations of our definition in Section 5. 2. Preliminaries In this paper we study reinforcement learning in Markov Decision Processes (MDPs). An MDP is a tuple M = (SM, AM, PM, RM, T, γ) where SM is a set of n states, AM is a set of k actions, T is a horizon of a (possibly infinite) number of rounds of activity in M, and γ is a discount factor. PM : SM AM ! SM and RM : SM ! [0, 1] denote the transition probability distribution and reward distribution, respectively. We use RM to denote the mean of RM.2 A policy is a mapping from a history h (the sequence of triples (state, action, reward) observed so far) to a distribution over actions. The discounted state and state-action value functions are denoted by V and Q , and V (s, T) represents expected discounted reward of following from s for T steps. The highest values functions are achieved by the optimal policy and are denoted by V and Q [24]. We use µ to denote the stationary distribution of . Throughout we make the following assumption. Assumption 1 (Unichain Assumption). The stationary distribution of any policy in M is independent of its start state. We denote the -mixing time of by T . Lemma 1 relates the -mixing time of any policy to the number of rounds until the V M values of the visited states by are close to the expected V M values (under the stationary distribution µ ). We defer all the omitted proofs to the Appendix. Lemma 1. Fix > 0. For any state s, following for T T steps from s satisfies where st is the state visited at time t when following from s and the expectation in the second term is over the transition function and the randomization of .3 2Note that RM 1 and Var(RM) 1 for all states. The bounded reward assumption can be relaxed (see e.g. [13]). Also assuming rewards in [0, 1] can be made w.l.o.g. up to scaling. 3Lemma 1 can be stated for a weaker notion of mixing time The horizon time Hγ := log ( (1 γ)) / log(γ) of an MDP captures the number of steps an approximately optimal policy must optimize over. The expected discounted reward of any policy after Hγ steps approaches the expected asymptotic discounted reward (Kearns and Singh (2002), Lemma 2). A learning algorithm L is a nonstationary policy that at each round takes the entire history and outputs a distribution over actions. We now define a performance measure for learning algorithms. Definition 1 ( -optimality). Let > 0 and δ 2 (0, 1/2). L achieves -optimality in T steps if for any T T 2 1 γ , (1) with probability at least 1 δ, for st the state L reaches at time t, where the expectation is taken over the transitions and the randomization of L, for any MDP M. We thus ask that a learning algorithm, after sufficiently many steps, visits states whose values are arbitrarily close to the values of the states visited by the optimal policy. Note that this is stronger than the hand-raising notion in Kearns and Singh (2002),4 which only asked that the learning algorithm stop in a state from which discounted return is near-optimal, permitting termination in a state from which the optimal discounted return is poor. In Definition 1, if there are states with poor optimal discounted reward that the optimal policy eventually leaves for better states, so must our algorithms. We also note the following connection between the average V M values of states visited under the stationary distribution of (and in particular an optimal policy) and the average undiscounted rewards achieved under the stationary distribution of that policy. Lemma 2 (Singh (2016)). Let RM be the vector of mean rewards in states of M and V M the vector of discounted rewards in states under . Then µ RM = (1 γ)µ V We design an algorithm which quickly achieves - optimality and we bound the number of steps T before this happens by a polynomial in the parameters of M. 2.1. Notions of Fairness We now turn to formal notions of fairness. Translated to our setting, Joseph et al. (2016) define action a s quality as the expected immediate reward for choosing a from state s and then require that an algorithm not probabilistically favor a over a0 if a has lower expected immediate reward. However, this naive translation does not adequately capture the structural differences between bandit and MDP set- called the -reward mixing time which is always linearly bounded by the -mixing time but can be much smaller in certain cases (see Kearns and Singh (2002) for a discussion). 4We suspect unfair E3 also satisfies this stronger notion. Fairness in Reinforcement Learning tings since present rewards may depend on past choices in MDPs. In particular, defining fairness in terms of immediate rewards would prohibit any policy sacrificing shortterm rewards in favor of long-term rewards. This is undesirable, since it is the long-term rewards that matter in reinforcement learning, and optimizing for long-term rewards often necessitates short-term sacrifices. Moreover, the long-term impact of a decision should be considered when arguing about its relative fairness. We will therefore define fairness using the state-action value function Q M. Definition 2 (Fairness). L is fair if for all input δ > 0, all M, all rounds t, all states s and all actions a, a0 M(s, a0) ) L(s, a, ht 1) L(s, a0, ht 1) with probability at least 1 δ over histories ht 1. 5 Fairness requires that an algorithm never probabilistically favors an action with lower long-term reward over an action with higher long-term reward. In hiring, this means that an algorithm cannot target one applicant population over another unless the targeted population has a higher quality. In Section 3, we show that fairness can be extremely restrictive. Intuitively, L must play uniformly at random until it has high confidence about the Q M values, in some cases taking exponential time to achieve near-optimality. This motivates relaxing Definition 2. We first relax the probabilistic requirement and require only that an algorithm not substantially favor a worse action over a better one. Definition 3 (Approximate-choice Fairness). L is -choice fair if for all inputs δ > 0 and > 0: for all M, all rounds t, all states s and actions a, a0: M(s, a0) ) L(s, a, ht 1) L(s, a0, ht 1) , with probability of at least 1 δ over histories ht 1. If L is -choice fair for any input > 0, we call L approximate-choice fair. A slight modification of the lower bound for (exact) fairness shows that algorithms satisfying approximate-choice fairness can also require exponential time to achieve nearoptimality. We therefore propose an alternative relaxation, where we relax the quality requirement. As described in Section 1.1, the resulting notion of approximate-action fairness is in some sense the most fitting relaxation of fairness, and is a particularly attractive one because it allows us to give algorithms circumventing the exponential hardness proved for fairness and approximate-choice fairness. Definition 4 (Approximate-action Fairness). L is -action fair if for all inputs δ > 0 and > 0, for all M, all rounds t, all states s and actions a, a0: M(s, a) > Q M(s, a0)+ ) L(s, a, ht 1) L(s, a0ht 1) 5L(s, a, h) denotes the probability L chooses a from s given history h. with probability of at least 1 δ over histories ht 1. If L is -action fair for any input > 0, we call L approximate-action fair. Approximate-choice fairness prevents equally good actions from being chosen at very different rates, while approximate-action fairness prevents substantially worse actions from being chosen over better ones. In hiring, an approximately-action fair firm can only (probabilistically) target one population over another if the targeted population is not substantially worse. While this is a weaker guarantee, it at least forces an approximately-action fair algorithm to learn different populations to statistical confidence. This is a step forward from current practices, in which companies have much higher degrees of uncertainty about the quality (and impact) of hiring individuals from under-represented populations. For this reason and the computational benefits mentioned above, our upper bounds will primarily focus on approximate-action fairness. We now state several useful observations regarding fairness. We defer all the formal statements and their proofs to the Appendix. We note that there always exists a (possibly randomized) optimal policy which is fair (Observation 1); moreover, any optimal policy (deterministic or randomized) is approximate-action fair (Observation 2), as is the uniformly random policy (Observation 3). Finally, we consider a restriction of the actions in an MDP M to nearly-optimal actions (as measured by Q Definition 5 (Restricted MDP). The -restricted MDP of M, denoted by M , is identical to M except that in each state s, the set of available actions are restricted to {a : Q M(s, a) maxa02AM Q M(s, a0) | a 2 AM}. M has the following two properties: (i) any policy in M is -action fair in M (Observation 4) and (ii) the optimal policy in M is also optimal in M (Observation 5). Observations 4 and 5 aid our design of an approximate-action fair algorithm: we construct M from estimates of the Q M values (see Section 4.3 for more details). 3. Lower Bounds We now demonstrate a stark separation between the performance of learning algorithms with and without fairness. First, we show that neither fair nor approximate-choice fair algorithms achieve near-optimality unless the number of time steps T is at least (kn), exponential in the size of the state space. We then show that any approximate-action fair algorithm requires a number of time steps T that is at least 1 1 γ ) to achieve near-optimality. We start by proving a lower bound for fair algorithms. Theorem 3. If δ < 1 8, no fair algorithm Fairness in Reinforcement Learning can be -optimal in T = O(kn) steps.6 Standard reinforcement learning algorithms (absent a fairness constraint) learn an -optimal policy in a number of steps polynomial in n and 1 ; Theorem 3 therefore shows a steep cost of imposing fairness. We outline the idea for proof of Theorem 3. For intuition, first consider the special case when the number of actions k = 2. We introduce the MDPs witnessing the claim in Theorem 3 for this case. Definition 6 (Lower Bound Example). For AM = {L, R}, let M(x) = (SM, AM, PM, RM, T, γ, x) be an MDP with for all i 2 [n], PM(si, L, s1) = PM(si, R, sj) = 1 where j = min{i + 1, n} and is 0 otherwise. for i 2 [n 1], RM(si) = 0.5, and RM(sn) = x. Figure 1. MDP(x): Circles represent states (labels denote the state name and deterministic reward). Arrows represent actions. Figure 1 illustrates the MDP from Definition 6. All the transitions and rewards in M are deterministic, but the reward at state sn can be either 1 or 1 2, and so no algorithm (fair or otherwise) can determine whether the Q M values of all the states are the same or not until it reaches sn and observes its reward. Until then, fairness requires that the algorithm play all the actions uniformly at random (if the reward at sn is 1 2, any fair algorithm must play uniformly at random forever). Thus, any fair algorithm will take exponential time in the number of states to reach sn. This can be easily modified for k > 2: from each state si, k 1 of the actions from state si (deterministically) return to state s1 and only one action (deterministically) reaches any other state smin{i+1,n}. It will take kn steps before any fair algorithm reaches sn and can stop playing uniformly at random (which is necessary for near-optimality). The same example, with a slightly modified analysis, also provides a lower bound of ((k/(1 + k ))n) time steps for approximatechoice fair algorithms as stated in Theorem 4. Theorem 4. If δ < 1 8, no - choice fair algorithm is -optimal for T = O(( k 1+k )n) steps. Fairness and approximate-choice fairness are both extremely costly, ruling out polynomial time learning rates. 6We have not optimized the constants upper-bounding parameters in the statement of Theorems 3, 4 and 5. The values presented here are only chosen for convenience. Hence, we focus on approximate-action fairness. Before moving to positive results, we mention that the time complexity of approximate-action fair algorithms will still suffer from an exponential dependence on 1 1 γ . Theorem 5. For δ < 1 4, < 1 8, γ > max(0.9, c), c 2 ( 1 2, 1) and < 1 ec 1 16 , no -action fair algorithm is - optimal for T = O((k 1 1 γ )c) steps. The MDP in Figure 1 also witnesses the claim of Theorem 5 when n = d log(1/(2 )) 1 γ e. The discount factor γ is generally taken as a constant, so in most interesting cases 1 1 γ n: this lower bound is substantially less stringent than the lower bounds proven for fairness and approximate-choice fairness. Hence, from now on, we focus on designing algorithms satisfying approximate-action fairness with learning rates polynomial in every parameter but 1 1 γ , and with tight dependence on 1 1 γ . 4. A Fair and Efficient Learning Algorithm We now present an approximate-action fair algorithm, Fair-E3 with the performance guarantees stated below. Theorem 6. Given > 0, > 0, δ 2 and γ 2 [0, 1) as inputs, Fair-E3 is an -action fair algorithm which achieves -optimality after min{ 4, 4} 2 (1 γ)12 steps where O hides poly-logarithmic terms. The running time of Fair-E3 (which we have not attempted to optimize) is polynomial in all the parameters of the MDP except 1 1 γ ; Theorem 5 implies that this exponential dependence on 1 1 γ is necessary. Several more recent algorithms (e.g. R-MAX [3]) have improved upon the performance of E3. We adapted E3 primarily for its simplicity. While the machinery required to properly balance fairness and performance is somewhat involved, the basic ideas of our adaptation are intuitive. We further note that subsequent algorithms improving on E3 tend to heavily leverage the principle of optimism in face of uncertainty : such behavior often violates fairness, which generally requires uniformity in the face of uncertainty. Thus, adapting these algorithms to satisfy fairness is more difficult. This in particular suggests E3 as an apt starting point for designing a fair planning algorithm. The remainder of this section will explain Fair-E3, beginning with a high-level description in Section 4.1. We then define the known states Fair-E3 uses to plan in Section 4.2, explain this planning process in Section 4.3, and Fairness in Reinforcement Learning bring this all together to prove Fair-E3 s fairness and performance guarantees in Section 4.4. 4.1. Informal Description of Fair-E3 Fair-E3 relies on the notion of known states. A state s is defined to be known after all actions have been chosen from s enough times to confidently estimate relevant reward distributions, transition probabilities, and Q M values for each action. At each time t, Fair-E3 then uses known states to reason about the MDP as follows: If in an unknown state, take a uniformly random trajec- tory of length Hγ . If in a known state, compute (i) an exploration policy which escapes to an unknown state quickly and p, the probability that this policy reaches an unknown state within 2T steps, and (ii) an exploitation policy which is near-optimal in the known states of M. If p is large enough, follow the exploration policy; oth- erwise, follow the exploitation policy. Fair-E3 thus relies on known states to balance exploration and exploitation in a reliable way. While Fair-E3 and E3 share this general idea, fairness forces Fair-E3 to more delicately balance exploration and exploitation. For example, while both algorithms explore until states become known , the definition of a known state must be much stronger in Fair-E3 than in E3 because Fair-E3 additionally requires accurate estimates of actions Q M values in order to make decisions without violating fairness. For this reason, Fair-E3 replaces the deterministic exploratory actions of E3 with random trajectories of actions from unknown states. These random trajectories are then used to estimate the necessary Q In a similar vein, Fair-E3 requires particular care in computing exploration and exploitation policies, and must restrict the set of such policies to fair exploration and fair exploitation policies. Correctly formulating this restriction process to balance fairness and performance relies heavily on the observations about the relationship between fairness and performance provided in Section 2.1. 4.2. Known States in Fair-E3 We now formally define the notion of known states for Fair-E3. We say a state s becomes known when one can compute good estimates of (i) RM(s) and PM(s, a) for all a, and (ii) Q M(s, a) for all a. Definition 7 (Known State). Let A state s becomes known after taking m Q := k max{m1, m2} (3) random trajectories from s. It remains to show that motivating conditions (i) and (ii) indeed hold for our formal definition of a known state. Informally, m1 random trajectories suffice to ensure that we have accurate estimates of all Q M(s, a) values, and m2 random trajectories suffice to ensure accurate estimates of the transition probabilities and rewards. To formalize condition (i), we rely on Theorem 7, connecting the number of random trajectories taken from s to the accuracy of the empirical V M estimates. Theorem 7 (Theorem 5.5, Kearns et al. (2000)). For any state s and > 0, after random trajectories of length Hγ from s, with probability of at least 1 δ, we can compute estimates ˆV M such that |V M (s) | , simultaneously for all 2 . Theorem 7 enables us to translate between the number of trajectories taken from a state and the uncertainty about its V M values for all policies (including and hence V M). Since | | = kn, we substitute log (| |) = n log (k). To estimate Q M (s, a) values using the V M (s) values we increase the number of necessary length-Hγ random trajectories by a factor of k. For condition (ii), we adapt the analysis of E3 [13], which states that if each action in a state s is taken m2 times, then the transition probabilities and reward in state s can be estimated accurately (see Section 4.4). 4.3. Planning in Fair-E3 We now formalize the planning steps in Fair-E3 from known states. For the remainder of our exposition, we make Assumption 2 for convenience (and show how to remove this assumption in the Appendix). Assumption 2. T Fair-E3 constructs two ancillary MDPs for planning: MΓ is the exploitation MDP, in which the unknown states of M are condensed into a single absorbing state s0 with no reward. In the known states Γ, transitions are kept intact and the rewards are deterministically set to their mean value. MΓ thus incentivizes exploitation by giving reward only Fairness in Reinforcement Learning Figure 2. Left: An MDP M with two actions (L and R) and deterministic transition functions and rewards. Green denotes the set of known states Γ. Middle: MΓ. Right: M[n]\Γ. for staying within known states. In contrast, M[n]\Γ is the exploration MDP, identical to MΓ except for the rewards. The rewards in the known states Γ are set to 0 and the reward in s0 is set to 1. M[n]\Γ then incentivizes exploration by giving reward only for escaping to unknown states. See the middle (right) panel of Figure 2 for an illustration of MΓ (M[n]\Γ), and Appendix for formal definitions. Fair-E3 uses these constructed MDPs to plan according to the following natural idea: when in a known state, Fair-E3 constructs ˆ MΓ and ˆ M[n]\Γ based on the estimated transition and rewards observed so far (see the Appendix for formal definitions), and then uses these to compute additional restricted MDPs ˆ M [n]\Γ for approximate-action fairness. Fair-E3 then uses these restricted MDPs to choose between exploration and exploitation. More formally, if the optimal policy in ˆ M [n]\Γ escapes to the absorbing state of MΓ with high enough probability within 2T steps, then Fair-E3 explores by following that policy. Otherwise, Fair-E3 exploits by following the optimal policy in ˆ M steps. While following either of these policies, whenever Fair-E3 encounters an unknown state, it stops following the policy and proceeds by taking a length-Hγ random trajectory. 4.4. Analysis of Fair-E3 In this section we formally analyze Fair-E3 and prove Theorem 6. We begin by proving that M Γ is useful in the following sense: M Γ has at least one of an exploitation policy achieving high reward or an exploration policy that quickly reaches an unknown state in M. Lemma 8 (Exploit or Explore Lemma). For any state s 2 Γ, β 2 (0, 1) and any T > 0 at least one of the statements below holds: there exists an exploitation policy in M Γ such that where the random variables t(s) and t(s) denote the states reached from s after following and for t steps, respectively. there exists an exploration policy in M Γ such that the probability that a walk of 2T steps from s following will terminate in s0 exceeds β/T. We can use this fact to reason about exploration as follows. First, since Observation 2 tells us that the optimal policy in M is approximate-action fair, if the optimal policy stays in the set of M s known states MΓ, then following the optimal policy in M Γ is both optimal and approximate-action fair. However, if instead the optimal policy in M quickly escapes to an unknown state in M, the optimal policy in M Γ may not be able to compete with the optimal policy in M. Ignoring fairness, one natural way of computing an escape policy to keep up with the optimal policy is to compute the optimal policy in M[n]\Γ. Unfortunately, following this escape policy might violate approximate-action fairness high-quality actions might be ignored in lieu of low-quality exploratory actions that quickly reach the unknown states of M. Instead, we compute an escape policy in M [n]\Γ and show that if no near-optimal exploitation policy exists in MΓ, then the optimal policy in M [n]\Γ (which is fair by construction) quickly escapes to the unknown states of M. Next, in order for Fair-E3 to check whether the optimal policy in M [n]\Γ quickly reaches the absorbing state of MΓ with significant probability, Fair-E3 simulates the execution of the optimal policy of M [n]\Γ for 2T steps from the known state s in M Γ several times, counting the ratio of the runs ending in s0, and applying a Chernoff bound; this is where Assumption 2 is used. Having discussed exploration, it remains to show that the exploitation policy described in Lemma 8 satisfies - optimality as defined in Definition 1. By setting T T in Lemma 8 and applying Lemmas 1 and 10, we can prove Corollary 9 regarding this exploitation policy. Corollary 9. For any state s 2 Γ and T T if there exists an exploitation policy in M ***** 1 γ . Finally, we have so far elided the fact that Fair-E3 only has access to the empirically estimated MDPs ˆ M [n]\Γ (see the Appendix for formal definitions). We remedy this issue by showing that the behavior of any policy in ˆ M [n]\Γ) is similar to the behavior of in M [n]\Γ). To do so, we prove a stronger claim: the behavior of any in ˆ MΓ (and ˆ M[n]\Γ) is similar to the behavior of in MΓ (and M[n]\Γ). Lemma 10. Let Γ be the set of known states and ˆ MΓ the approximation to MΓ. Then for any state s 2 Γ, any action a and any policy , with probability at least 1 δ: Fairness in Reinforcement Learning MΓ(s) min{ /2, } V MΓ(s) + min{ /2, }, 2. Q MΓ (s, a) min{ /2, } Q ˆ MΓ (s, a) Q MΓ (s, a) + min{ /2, }. We now have the necessary results to prove Theorem 6. Proof of Theorem 6. We divide the analysis into separate parts: the performance guarantee of Fair-E3 and its approximate-action fairness. We defer the analysis of the probability of failure of Fair-E3 to the Appendix. We start with the performance guarantee and show that when Fair-E3 follows the exploitation policy the average V M values of the visited states is close to Es µ V M(s). However, when following an exploration policy or taking random trajectories, visited states V M values can be small. To bound the performance of Fair-E3, we bound the number of these exploratory steps by the MDP parameters so they only have a small effect on overall performance. Note that in each T -step exploitation phase of Fair-E3, the expectation of the average V M values of the visited states is at least Es µ V M(s) /(1 γ) /2 by Lemmas 1, 8 and Observation 5. By a Chernoff bound, the probability that the actual average V M values of the visited states is less than Es µ V M(s) /(1 γ) 3 /4 is less than δ/4 if there are at least δ ) 2 exploitation phases. We now bound the total number of exploratory steps of Fair-E3 by where m Q is defined in Equation 3 of Definition 7. The two components of this term bound the number of rounds in which Fair-E3 plays non-exploitatively: the first bounds the number of steps taken when Fair-E3 follows random trajectories, and the second bounds how many steps are taken following explicit exploration policies. The former bound follows from the facts that each random trajectory has length Hγ ; that in each state, m Q trajectories are sufficient for the state to become known; and that random trajectories are taken only before all n states are known. The latter bound follows from the fact that Fair-E3 follows an exploration policy for 2T steps; and an exploration policy needs to be followed only O( T δ )) times before reaching an unknown state (since any exploration policy will end up in an unknown state with probability of at least T according to Lemma 8, and applying a Chernoff bound); that an unknown state becomes known after it is visited m Q times; and that exploration policies are only followed before all states are known. Finally, to make up for the potentially poor performance in exploration, the number of 2T steps exploitation phases needed is at least Therefore, after T = T1 + T2 steps we have M(st) 2 1 γ , as claimed in Equation 2. The running time of Fair-E3 is O( n T 3 ): the additional n T 2 factor comes from offline computation of the optimal policies in ˆ M We wrap up by proving Fair-E3 satisfies approximate-action fairness in every round. The actions taken during random trajectories are fair (and hence approximate-action fair) by Observation 3. Moreover, Fair-E3 computes policies in ˆ M [n]\Γ. By Lemma 10 with probability at least 1 δ any Q or V value estimated in ˆ M [n]\Γ is within /2 of its corresponding true value in M [n]\Γ. As a result, ˆ M [n]\Γ (i) contain all the optimal policies and (ii) only contain actions with Q values within of the optimal actions. It follows that any policy followed in ˆ M [n]\Γ is -action fair, so both the exploration and exploitation policies followed by Fair-E3 satisfy -action fairness, and Fair-E3 is therefore -action fair. 5. Discussion and Future Work Our work leaves open several interesting questions. For example, we give an algorithm that has an undesirable exponential dependence on 1/(1 γ), but we show that this dependence is unavoidable for any approximate-action fair algorithm. Without fairness, near-optimality in learning can be achieved in time that is polynomial in all of the parameters of the underlying MDP. So, we can ask: does there exist a meaningful fairness notion that enables reinforcement learning in time polynomial in all parameters? Moreover, our fairness definitions remain open to further modulation. It remains unclear whether one can strengthen our fairness guarantee to bind across time rather than simply across actions available at the moment without large performance tradeoffs. Similarly, it is not obvious whether one can gain performance by relaxing the every-step nature of our fairness guarantee in a way that still forbids discrimination. These and other considerations suggest many questions for further study; we therefore position our work as a first cut for incorporating fairness into a reinforcement learning setting. Fairness in Reinforcement Learning Julia Angwin, Jeff Larson, Surya Mattu, and Lauren Kirch- ner. Machine bias. Propublica, 2016. Anna Barry-Jester, Ben Casselman, and Dana Goldstein. The new science of sentencing. The Marshall Project, August 8 2015. URL https://www.themarshallproject. org/2015/08/04/the-new-science-of-sentencing/. Retrieved 4/28/2016. Ronen Brafman and Moshe Tennenholtz. R-MAX - A general polynomial time algorithm for near-optimal reinforcement learning. Journal of Machine Learning Research, 3:213 231, 2002. Nanette Byrnes. Artificial intolerance. MIT Technology Review, March 28 2016. URL https://www. technologyreview.com/s/600996/artificial-intolerance/. Retrieved 4/28/2016. Cynthia Dwork, Moritz Hardt, Toniann Pitassi, Omer Rein- gold, and Richard Zemel. Fairness through awareness. In Proceedings of the 3rd Innovations in Theoretical Computer Science, pages 214 226, 2012. Michael Feldman, Sorelle Friedler, John Moeller, Carlos Scheidegger, and Suresh Venkatasubramanian. Certifying and removing disparate impact. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 259 268, 2015. Sorelle Friedler, Carlos Scheidegger, and Suresh Venkata- subramanian. On the (im)possibility of fairness. Co RR, abs/1609.07236, 2016. Sara Hajian and Josep Domingo-Ferrer. A methodology for direct and indirect discrimination prevention in data mining. IEEE Transactions on Knowledge and Data Engineering, 25(7):1445 1459, 2013. Moritz Hardt, Eric Price, and Nathan Srebro. Equality of opportunity in supervised learning. In Proceedings of the 30th Annual Conference on Neural Information Processing Systems, pages 3315 3323, 2016. Matthew Joseph, Michael Kearns, Jamie Morgenstern, and Aaron Roth. Fairness in learning: Classic and contextual bandits. In Proceedings of the 30th Annual Conference on Neural Information Processing Systems, pages 325 333, 2016. Faisal Kamiran, Asim Karim, and Xiangliang Zhang. De- cision theory for discrimination-aware classification. In Proceedings of the 12th IEEE International Conference on Data Mining, pages 924 929, 2012. Toshihiro Kamishima, Shotaro Akaho, Hideki Asoh, and Jun Sakuma. Fairness-aware classifier with prejudice remover regularizer. In Proceedings of the European Conference on Machine Learning and Knowledge Discovery in Databases, pages 35 50, 2012. Michael Kearns and Satinder Singh. Near-optimal reinforcement learning in polynomial time. Machine Learning, 49(2-3):209 232, 2002. Michael Kearns, Yishay Mansour, and Andrew Ng. Ap- proximate planning in large POMDPs via reusable trajectories. In Proceedings of the 13th Annual Conference on Neural Information Processing Systems, pages 1001 1007, 2000. Jon Kleinberg, Sendhil Mullainathan, and Manish Ragha- van. Inherent trade-offs in the fair determination of risk scores. In Proceedings of the 7th Conference on Innovations in Theoretical Computer Science, 2017. Shiau Hong Lim, Huan Xu, and Shie Mannor. Reinforce- ment learning in robust markov decision processes. In Proceedings of the 27th Annual Conference on Neural Information Processing Systems, pages 701 709, 2013. Binh Thanh Luong, Salvatore Ruggieri, and Franco Turini. k-NN as an implementation of situation testing for discrimination discovery and prevention. In Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 502 510, 2011. Shie Mannor, Ofir Mebel, and Huan Xu. Lightning does not strike twice: Robust MDPs with coupled uncertainty. In Proceedings of the 29th International Conference on Machine Learning, 2012. Clair Miller. Can an algorithm hire better than a human? The New York Times, June 25 2015. URL http://www.nytimes.com/2015/06/26/upshot/can-analgorithm-hire-better-than-a-human.html/. Retrieved 4/28/2016. Jun Morimoto and Kenji Doya. Robust reinforcement learning. Neural computation, 17(2):335 359, 2005. Dino Pedreshi, Salvatore Ruggieri, and Franco Turini. Discrimination-aware data mining. In Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 560 568. ACM, 2008. Cynthia Rudin. Predictive policing using machine learn- ing to detect patterns of crime. Wired Magazine, August 2013. URL http://www.wired.com/insights/2013/08/ predictive-policing-using-machine-learning-to-detectpatterns-of-crime/. Retrieved 4/28/2016. Fairness in Reinforcement Learning Satinder Singh. Personal Communication, June 2016. Richard Sutton and Andrew Barto. Introduction to Rein- forcement Learning. MIT Press, Cambridge, MA, USA, 1st edition, 1998. Latanya Sweeney. Discrimination in online ad delivery. Communications of the ACM, 56(5):44 54, 2013. Istv an Szita and Csaba Szepesv ari. Model-based reinforcement learning with nearly tight exploration complexity bounds. In Proceedings of the 27th International Conference on Machine Learning, pages 1031 1038, 2010. Blake Woodworth, Suriya Gunasekar, Mesrob Ohannes- sian, and Nathan Srebro. Learning non-discriminatory predictors. In Proceedings of the 30th Conference on Learning Theory, 2017. Muhammad Bilal Zafar, Isabel Valera, Gomez-Rodriguez Manuel, and Krishna P. Gummadi. Fairness beyond disparate treatment and disparate impact: Learning classification without disparate mistreatment. In Proceedings of the 26th International World Wide Web Conference, 2017. Rich Zemel, Yu Wu, Kevin Swersky, Toni Pitassi, and Cyn- thia Dwork. Learning fair representations. In Proceedings of the 30th International Conference on Machine Learning, pages 325 333, 2013.