# group_fairness_in_reinforcement_learning__cad636b2.pdf Published in Transactions on Machine Learning Research (04/2023) Group Fairness in Reinforcement Learning Harsh Satija harsh.satija@mail.mcgill.ca Mc Gill University, Mila Alessandro Lazaric lazaric@meta.com Meta AI (FAIR) Matteo Pirotta pirotta@meta.com Meta AI (FAIR) Joelle Pineau jpineau@cs.mcgill.ca Mc Gill University, Mila, Meta AI (FAIR) Reviewed on Open Review: https: // openreview. net/ forum? id= Jk IH4Me Oc3 We pose and study the problem of satisfying fairness in the online Reinforcement Learning (RL) setting. We focus on the group notions of fairness, according to which agents belonging to different groups should have similar performance based on some given measure. We consider the setting of maximizing return in an unknown environment (unknown transition and reward function) and show that it is possible to have RL algorithms that learn the best fair policies without violating the fairness requirements at any point in time during the learning process. In the tabular finite-horizon episodic setting, we provide an algorithm that combines the principle of optimism and pessimism under uncertainty to achieve zero fairness violation with arbitrarily high probability while also maintaining sub-linear regret guarantees. For the high-dimensional Deep-RL setting, we present algorithms based on the performance-difference style approximate policy improvement update step and we report encouraging empirical results on various traditional RL-inspired benchmarks showing that our algorithms display the desired behavior of learning the optimal policy while performing a fair learning process. 1 Introduction With an ever-increasing number of automated decision-making algorithms deployed around us, it becomes important to be cautious of the risks and biases that can result due to the nature in which these algorithms are being trained. Reinforcement Learning (RL, Sutton, 1988) has emerged as a powerful paradigm for learning in the sequential decision-making setting. At the same time, several studies have demonstrated that it is even more important to control for fairness in sequential decision-making, as imposing fairness constraints without considering feedback effects of the policy can lead to further discrepancy (Liu et al., 2018). Therefore, in this paper, we focus on the concerns related to imposing fairness not only to the RL optimization problem but also during the learning process. We first clarify what we mean by fairness. At an abstract level, fairness can be defined as the absence of discrimination. This definition requires us to define some measure of discrimination, and then the fairness property is defined w.r.t. that measure. For this work, we focus on the category of Group fairness. This definition of fairness is based on a notion of protected subgroups and a measure across groups. The subgroups Published in Transactions on Machine Learning Research (04/2023) are defined on the choice of some sensitive attributes (such as race, gender, ethnicity, etc.), and a measure that is required for comparing the outcomes of different subgroups (such as false-positive rate or classification error). Fairness is then defined in terms of requiring equal measure for different protected subgroups. A majority of definitions of fairness falls under this category, such as demographic parity (Dwork et al., 2012), disparate impact (Feldman et al., 2015), equality of opportunity and equalized odds (Hardt et al., 2016). Example 1 (College admissions): Consider a scenario where an RL agent assists the college admissions process to accept candidates every semester. A candidate s information consists of sensitive features representative of their demographic information (like socio-economic background or gender) and non-sensitive features like their standardized test scores. Due to resource constraints, the college can only admit a limited number of candidates during every admission cycle. Suppose the college consistently admits candidates from one demographic group over the other. In that case, it can create a feedback loop where candidates from that particular group are encouraged to apply more and vice versa (Immorlica et al., 2019; Garg et al., 2020; Puranik et al., 2022). From an equity and diversity perspective, one can motivate the need to create feedback loops for the minority groups to reverse existing trends (Emelianov et al., 2020). Thus, under this fairness requirement, the admissions agent should admit the top-scoring candidates such that the long-term disparity between demographics should be below some target over all the admission cycles. Example 2 (Credit lending): Consider another scenario where an RL-based credit lending system assists a bank by filtering loan applicants. The lending agent interacts with applicants having features consisting of both sensitive attributes (like demographics) and non-sensitive attributes (like credit score). An applicant might file for multiple loans over a span of time, and for any given application, the lending agent needs to decide whether to approve or reject an application to maximize the number of loan repayments for the bank. The problem is sequential as both the repayment and defaults on the granted loans take over some time, during which the applicant s non-sensitive features can change due to the agent s actions (for instance, credit score might decrease for applicants that are unable to repay in time). Furthermore, the applicants in different demographics can have different dynamics (how the creditworthiness of each group evolves) and other initial starting non-sensitive state distributions (like credit score distributions). Recent works on imposing fairness in the credit lending domain have highlighted the caution regarding the long-term effect of deploying such decision-making systems as these systems have to potential to increase further the disparity between groups (Liu et al., 2018; Castelnovo et al., 2021; D Amour et al., 2020; Fuster et al., 2022). Thus, external regulations like the European AI Act (The European Commission, 2021) might require that such a system not exhibit discriminatory behaviour toward different population segments. Under this requirement, the policies deployed by our algorithm should not exhibit behaviour that contributes to the existing disparity between groups, such as rejecting more applications from a particular demographic compared to the other. Although we motivate the problem in the credit lending and college admissions domain, the evidence related to the biased nature of deployed algorithms is substantial and can be found in a variety of settings such as hiring, ads and marketing (Miller, 2015; Dastin, 2018; Datta et al., 2015; Celis et al., 2019; Fu et al., 2020). 1.1 Contributions and limitations (a) Traditional Figure 1: Toy grid-world example demonstrating traditional vs fair optimal policies. We now highlight how introducing the fairness requirement in the RL framework changes its different aspects. First, the fairness requirement changes the definition of the optimal policy; the learning objective becomes to find the policy that maximizes the reward in a restricted class of policies that satisfy the fairness property. Additionally, at any point in the learning process, the policy executed by the agent should meet the fairness criteria. This makes applying traditional RL exploration strategies, such as those based on optimism under uncertainty, particularly challenging for this setting as being optimistic can lead to being unfair. Example 3 (Four-rooms grid): Consider the modified four-rooms domain in Figure 1 that contains two subgroups: the circle agents that start in the upper-left corner and the diamond agents in the lower-left corner. Published in Transactions on Machine Learning Research (04/2023) The state is defined both by the position of the agent in the grid (non-sensitive attribute) and the shape of the agent (sensitive attribute). The goal is to reach the cell containing the gold star in the least amount of steps. Both subgroups start at a similar distance from the goal, but the transition dynamics of circle agents is deterministic whereas the diamond agents have very stochastic dynamics. Thus, the trajectories taken by the circle agents via traditional RL optimal policy are shorter compared to the diamond agents (Figure 1a). Enforcing the group fairness constraints here leads to slightly longer trajectories for circle agents (Figure 1b) and we observe that both subgroups now take similar time to reach the goal to satisfy the fairness criteria. We aim to answer the following question in this work: Can we design algorithms that can respect the given fairness constraints throughout the policy improvement learning procedure and achieve good performance? In Section 3, we show it is possible to have an algorithm with high probability guarantees on performance and fairness during the learning in the tabular episodic RL setting. This is achieved by leveraging the principle of combining optimism and pessimism, as previously used in constrained bandits (Amani et al., 2019; Pacchiano et al., 2021), constrained MDPs (Wachi and Sui, 2020; Liu et al., 2021) and robust-RL (Curi et al., 2021). We apply this principle to the fairness setting with unknown MDP parameters for balancing the exploration and fairness trade-off to simultaneously achieve zero fairness constraint violation during learning, and sub-linear regret. In Section 4, we present a scalable and practical Deep-RL algorithm that uses a trust-region-based update rule to approximate guarantees on the fairness violation. This approach is based on the constrained policy improvement performance bounds proposed by Achiam et al. (2017) and allows us to extend the Deep-RL algorithms such as PPO (Schulman et al., 2017) to our setting. In Section 4.3, we provide empirical evidence that our approach is indeed able to achieve good performance while achieving the fairness requirement on simulated robotic locomotion and navigation tasks. We now identify some of the limitations of our approach. First, note that we neither impose any structure on the environment nor consider the case where there is any information sharing within the groups. This makes our setting different from the multi-agent setting as the agents interact with the environment independently and do not interact with other agents in any form. Our setting is closer to the typical single-agent RL setting, with the added caveat that the environment can behave differently for some populations of agents. Next, we work in the context of the traditional group fairness definitions and do not investigate the effectiveness of these definitions. In practice, selecting a particular algorithmic fairness definition for any domain can be quite challenging as the fairness demands can be formulated under different perspectives and scenarios. This is further exacerbated by the fact that there is no consensus on a universally accepted definition of algorithmic fairness and some fairness definitions are at odds with each other (Chouldechova, 2017; Pleiss et al., 2017). We are not advocating that using existing group-based notions of fairness will address all the concerns related to fairness in the sequential decision-making setting. This is a rather complex problem, and the solution will potentially be more involved than either a single kind of fairness definition (Awasthi et al., 2020). Our work should be viewed as the first step toward this bigger goal where we establish the general results regarding satisfying group-based notions of fairness in the context of the RL paradigm. Finally, our approach is based on enforcing the fairness requirements by directly restricting the policy space. This implies that guaranteeing fairness can limit the scope of available policies and can come at a performance cost for some groups. As such, given a group fairness requirement, our approach can potentially penalize the higher-performing groups (as in Example 2) to match the performance of the lesser-performing subgroups (within some margin), whereas an ideal scenario would have been to increase the performance of the latter. We will revisit this limitation in Section 5 and propose possible workarounds against this for future work. 1.2 Related work Fairness in machine learning is an active field of research (Mehrabi et al., 2019; Pessach and Shmueli, 2020; Gajane et al., 2022). A number of recent works have specifically explored various aspects of the fair-RL setting. Jabbari et al. (2017) study the problem of fairness in online RL setting and Doroudi et al. (2017); Nabi et al. (2019) study the problem in offline RL setting, however, they work with different non-group specific notions of fairness has no notion of sensitive attributes or groups. Siddique et al. (2020) study a notion of fairness based on social welfare functions in the multi-objective RL setting and provide methods to adapt the traditional RL techniques to their modified objective. Mandal and Gan (2022) study the problem Published in Transactions on Machine Learning Research (04/2023) of selecting a measure that ensures fairness in the multi-agent RL setting. They propose properties that a fair policy should satisfy and provide an algorithm to find such fair optimal policy in the unknown multi-agent MDP. Their notion of fairness is different from the group-based fairness we consider in this work. The problem studied in this work is closely related to the one studied in the Constrained MDPs (CMDPs, Altman, 1999). The focus in CMDPs is to find a policy that maximizes the return in some restricted policy class, where the constraints that define the restricted policy space are based on the additional feedback signals from the environment. The main difference in our setting and CMDPs is how the constraints on the policy space are being defined. In CMDPs, a constraint is based solely on a return defined w.r.t. a single reward signal, and multiple constraints differs only in the corresponding reward signals while all the other environment parameters remain the same. Whereas in our setting, given a fairness criteria, the constraints are based on the returns belonging to different populations that may differ due to variation in any possible environment parameters. As a result, the constraints in our case are based on a combination of multiple returns, each of which can differ in either reward signal or the transition dynamics. Despite this difference, we build on the works on exploration in CMDPs (Achiam et al., 2017; Tessler et al., 2018; Chow et al., 2019; Zhang et al., 2020; Efroni et al., 2020; Liu et al., 2021). A closely related work to ours is the recent work by Wen et al. (2021) who incorporate the group fairness criteria as a constraint in the CMDP framework. Similar to Wen et al. (2021) we also make an assumption that the sensitive attributes (like race) do not change via interactions with the environment. However, the formulation in Wen et al. (2021) requires that populations corresponding to different sensitive attributes to share the same environment transition dynamics. Instead our work addresses a more general setting where the different populations can differ in any of the environment parameters, including the transition dynamics. Furthermore, in the approach proposed by Wen et al. (2021) the policy is only updated once throughout the entire learning process. Finally, they also assume that the uniform exploration policy is fair, which is not true in general, and can violate the fairness constraints any number of times. 2 Problem setting in finite-horizon episodic MDPs For any positive integer n, let [n] denotes the set {1, . . . , n}. An episodic Markov Decision Process (MDP, Bellman, 1957) is denoted as M = (S, A, P, r, µ, H), where S = [N] and A = [A] denotes the finite state and action sets, H denotes the horizon or length of the episode, and rh : S A [0, 1] denotes the reward function at time step h [H]. The transition model is denoted by P ( |s, a) H S , s S, a A, where S denotes the |S|-dimensional probability simplex. Ph(s |s, a) denotes the probability of transitioning to state s after taking an action a from state s at time step h [H]. The initial state distribution is denoted by µ S, where µ(s) denotes the probability of the agent starting in state s. A non-stationary policy in this setting is defined as π : [H] S A. We abuse the notation and use πh(a|s) to denote the probability of taking action a in state s at time step h. The expected return of a policy at some state s S is defined by the value function V π h (s; r, P) .= EP,π h PH t=h rh(St, At) | Sh = s i , where At πt( |St) and St+1 Ph( |St, At). The return of the policy is defined by Jπ(r, P) .= ES1 µ[V π 1 (S1; r, P)]. The occupancy measure dπ of a policy π denotes the set of distributions generated by executing the policy π and is defined as dπ h(s, a; µ, P) .= E[1{Sh = s, Ah = a|S1 µ, P, π}] = Pr{Sh = s, Ah = a|S1 µ, P, π} (Efroni et al., 2020). Using the occupancy measure, the return of a policy can be reformulated as Jπ(r, P) = P h,s,a dπ h(s, a; µ, P)rh(s, a) (Altman, 1999; Puterman, 2014). The traditional RL optimal policy w.r.t. reward function r is defined as arg maxπ Jπ(r, P). 2.1 Introducing fairness We focus on the fairness definition based on groups where different sub-populations of agents are interacting with the environment independently. We assume the agent s state space is jointly comprised of sensitive attributes that are required for defining a notion of group, and environment specific features that determine how the agent navigates in the environment. For instance, in a recommender systems context, the agent specific sensitive attributes can be race, gender, nationality, etc. whereas the environment specific attributes Published in Transactions on Machine Learning Research (04/2023) can be budget, past order history, preferred categories. We also assume that interactions with the environment do not change the agent s sensitive attributes. Formally, these assumptions can be defined as: Assumption 2.1 (Separable and observable attributes). We assume the state space S is jointly composed of sensitive-attributes Z and non-sensitive attributes S, i.e., S = Z S. Assumption 2.2 (Consistent sensitive attributes). We assume that an agent s sensitive attributes remain constant throughout an episode. This implies that the transition function at any h [H] satisfies Ph(s |s, a) = Ph((z , s )|(z, s), a) = Ph( s |z, s, a)1[z = z ] relation, where Ph( s |z, s, a) is the underlying transition function associated with non-sensitive attributes S and z Z. Definition 2.1 (Subgroup). We refer to the population of agents associated with a particular sensitive attribute z Z as the subgroup associated with z. The initial non-sensitive state distribution associated with a particular subgroup z Z is denoted by µz S, and defined as µz( s) = Pr( s|z) = µ(z, s) Pr(z) , s S, where Pr(z) = P s S µ(z, s) is a normalizing constant. For an agent belonging to subgroup z Z, µz( s) denotes the probability of the agent starting the episode in the non-sensitive state s. Similarly, we use Pz to denote the subgroup specific transition function corresponding to the non-sensitive attributes, i.e., Pz,h( s | s, a) .= Ph( s |z, s, a). Note that the definition of policy remains unchanged, i.e., it depends on both s and z. We can thus define the subgroup specific returns for any policy π as: Jπ z (r, P) .= E S1 µz [V π 1 ((z, S1); r, Pz)]. (1) We work with a particular group fairness criteria known as demographic parity (Dwork et al., 2012; Zafar et al., 2017). Informally, the demographic parity constraint requires that different subgroups should have similar returns. Formally, in our setting it is defined as follows. Definition 2.2 (Demographic fairness). For some ϵ 0, that denotes an acceptable margin of error, a policy π satisfies demographic fairness criteria if |Jπ i (r, P) Jπ j (r, P)| ϵ, i j; (i, j) Z2. The decision maker s reward might or might not be aligned with the reward of the demographics. For instance, in the grid world navigation scenario, there is no distinction between the reward of the decision-maker and the demographics with the reward being reaching the goal quickly. Whereas in the credit lending scenario, the decision-maker s reward is to maximize the bank s profits via loan repayments, whereas the demographics reward is to get more loans. We consider the general case where the decision maker s reward can be different and denote it by lh : S A [0, 1]. Let π denote the optimal policy that maximizes the return among the class of policies that satisfy the above fairness condition, i.e., π = arg max π Jπ(l, P) (2) s.t. |Jπ i (r, P) Jπ j (r, P)| ϵ, i j; i, j Z2. Note that the primary objective in Equation (2) is concerned with maximizing the cumulative returns for all subgroups w.r.t. the decision-maker s reward function l whereas as the fairness constraint is based on the demographics reward function r. When all the MDP parameters are known, it is possible to devise a Linear Programming (LP) solution to the problem in Equation (2). The details of the LP solver are provided in Appendix B. In Appendix B.3, we show how other group fairness definitions, such as equality of opportunity and equalized odds (Hardt et al., 2016), can be formulated and solved in our setting. In order to avoid making any assumption regarding the feasibility of the problem, we make the following assumption regarding an initial fair policy: Assumption 2.3 (Initial feasible policy). The algorithm has access to a policy π0 that satisfies the fairness constraints in Equation (2). We also assume |Jπ0 i (r, P) Jπ0 j (r, P)| ϵ0 < ϵ, (i, j) Z2 and the value of ϵ0 is known to the algorithm. Having an initial fair policy ensures that the problem in Equation (2) is always feasible, even in the case when the agent is unaware of the MDP parameters as it can always interact with the environment without Published in Transactions on Machine Learning Research (04/2023) violating fairness constraints via π0. Assumption 2.3 will not be valid if either Equation (2) is unfeasible, or none of feasible policies satisfy the strict inequality condition, i.e., none of them have any margin for exploration. From a practical perspective, any known sub-optimal policy that the algorithm practitioner regards as fair can be used, by controlling the acceptable fairness threshold ϵ. We acknowledge that Assumption 2.3 is a strong assumption as it requires the algorithm practitioner to have a fair exploration policy with some margin. We want to highlight that the learning problem we consider in this work is also quite challenging, hence the reliance on a stronger assumption. When the learning algorithm has neither any information about the environment nor any access to some initial fair exploration policy, it cannot avoid unfair decisions in the first episode of learning itself, as any potential interaction with the environment might lead to a fairness violation at the beginning of the learning process. Similar assumptions are also made in safe RL literature (Pacchiano et al., 2021; Liu et al., 2021; Bura et al., 2022). The assumption is not unrealistic, as the practitioner can use any existing strategy, even if it incurs low rewards, to initialize the algorithm. The strict inequality in the initial exploration policy is required to leave some margin of error for the agent to be able to explore where the dependence on the margin term ϵ ϵ0 is also captured in the regret bounds in Theorem 3.2. Note that it is possible to leverage the techniques from Efroni et al. (2020) and provide a sublinear regret result without such an assumption, however, doing so does not guarantee zero fairness violation during the learning process. We can then only guarantee that the number of violations decreases over time. Finally, it is possible to devise algorithms for the simpler problem of recovering a fair policy. While this will remove the necessity of having access to a fair initial policy, the algorithm will only be guaranteed to deploy a sublinear number of unfair policies. However, this is outside the scope of this work. 2.2 Motivating example We expand on the credit lending example provided in the Section 1 to better illustrate our setting. We build on the lending environment from D Amour et al. (2020); Wen et al. (2021) where the agent is making applicant filtering decisions on behalf of a financial institution. An applicant applies for multiple loans over a span of time and the agent is tasked to accept or reject the application, i.e., A = {grant, reject}. The applicant features consists of non-sensitive attributes based on credit score S = {1, . . . , Cmax}, where the probability that an applicant successfully repays the loan is a deterministic function of the credit score denoted by ξ : S [0, 1]. The bank makes a profit of interest Ib on a successful repayment of a granted loan and suffers a loss of principal Pb on a default, i.e., l( , s, grant) = ξ( s)Ib + (1 ξ( s))Pb. There are two group of candidates Z = {high, low} that have different initial credit score distribution as well as dynamics of how the credit score evolves due to agent s actions. The initial credit-score distribution of the group high skews more towards the higher ranges with a higher mean of initial credit score compared to the low group. For both the groups, if granted a loan, a successful repayment of a loan leads to improvement of the applicant s credit score by c+ whereas a default causes the credit score to decrease by c . However, the dynamics of both groups differ on rejection where an applicant from the group low is affected more and suffers a possible decrease in ability to repay future loans. This is modeled by decreasing the credit score of applicants in group low by c with a probability ν on rejection, where ν is a hyper-parameter that denotes the handicap. The candidates in the group high do not experience any change in credit score on rejection. The candidates are applying for loans because they need credit, thus the reward of the candidates is based on whether or not they were given loan, r(z, s, a) = 1[a = grant]. Therefore, the fairness constraint here ensures that a near equal amount of loans are given to both the groups. This prevents the agent from significantly rejecting more loans from the disadvantaged group, that was disadvantaged due to possible history of financial oppression in the first place. In this setting, the traditional non-fair RL algorithms will focus solely on maximizing the bank profits leading to rejecting more applicants from the low group and further increasing the existing disparity. Finally, an initial policy that grants the loans to every applicant regardless of their credit score can be used for Assumption 2.3. Even though this policy might be quite sub-optimal in maximizing the bank profits, but since it approves loans with same rate for both the groups (ϵ0 = 0), it can be used to drive the initial exploration for any choice of margin. Published in Transactions on Machine Learning Research (04/2023) 3 Algorithm for the unknown model and reward setting We now turn our attention to the setting where the agent only has access to the subgroup specific initial non-sensitive state distributions µz, z Z, and relies on the observed (sampled) rewards and transitions to improve its performance over time. For clarity, we present our method for the case where groups are sampled uniformly for each episode (or Pr(z) = 1/|Z|, z Z). We relax this assumption later and show that our results also extend to the general case of non-uniform group sampling distributions. We assume that the algorithm interacts with the environment for a total of K episodes, with each episode being H steps long. For each episode k [K], a subgroup Zk 1 is selected uniformly from Z and then the corresponding initial state (Zk 1 , Sk 1) is sampled from µZk 1 . The agent then executes the non-stationary policy at that current episode πk, where the agent takes an action Ak h on state (Zk h, Sk h) at the time step h , transitions to state (Zk h+1, Sk h+1) where Zk h+1 = Zk h and Sk h+1 Ph( |Zk h, Sk h, Ak h) (Assumption 2.2), and receives the rewards rk h(Zk h, Sk h, Ak h) = rh(Zk h, Sk h, Ak h) + ζk h and lk h(Zk h, Sk h, Ak h) = lh(Zk h, Sk h, Ak h) + ζk h, where ζk h is zero mean 1/2-sub-Gaussian. We define the cumulative regret in the traditional sense as Reg(K; l) .= PK k=1(Jπ (l, P) Jπk(l, P)). Given some initial fair exploration policy π0 (Assumption 2.3), our goal is to design a learning algorithm that can (i) satisfy the fairness requirement with arbitrarily high probability throughout the learning, and (ii) achieve a sub-linear cumulative regret in the number of episodes. To guarantee the fairness requirement holds throughout the learning, we construct a set that contains the possible policies that are fair (with high-confidence) based on the general principle of the optimism in face of uncertainty (Auer et al., 2008; Efroni et al., 2020; Liu et al., 2021). We want this set to contain the policy updates, based on the current MDP estimates, that ensure we do not violate the fairness constraints in the true MDP with high confidence. However, optimism alone is not sufficient to construct such set as we want the fairness guarantee to hold in the true MDP model and not just in the best possible optimistic MDP model (formal argument provided in Appendix C.1). Therefore, we use the techniques from Liu et al. (2021) that combine both optimism and pessimism to construct reward estimates to achieve this goal. At each episode k, we denote the empirical estimates of the rewards and transition model at time step h based on the episodes [1, . . . , k 1] by ˆrk h, ˆlk h and ˆP k h . For a given value of δ (0, 1), we denote the uncertainty estimate for the empirical transition and rewards by, βk h(z, s, a) .= r 1 max{Nk h(z, s,a),1)} log 4|Z|2| S|2|A|HK where N k h(z, s, a) denotes the number of times the state-action tuple (z, s, a) was observed at time step h so far (z, s, a, h) Z S A [H]. We define the optimistic and pessimistic reward estimates as: rk h(z, s, a) .= ˆrk h(z, s, a) + (1 + |Z|| S|H)βk h(z, s, a) rk h(z, s, a) .= ˆrk h(z, s, a) (1 + |Z|| S|H)βk h(z, s, a) (3) With high confidence, we want an optimistically estimated return to be greater than the underlying true return and vice versa, even when accounting and integrating the uncertainty due to rewards and transitions over the horizon. The constants for the optimistic and pessimistic rewards in Equation (3) are defined to get the corresponding properties for the associated returns as described in Appendix C.3. The key step in our approach that is different from Liu et al. (2021) is that for a pair of subgroups i, j Z2, we optimistically estimate the return for the one subgroup and at the same time pessimistically estimate the return for the other subgroup to ensure that the fairness constraint still holds true with high-confidence. We define the set of policies that satisfy the fairness guarantees based on the optimistic and pessimistic reward estimates as: π : Jπ i (rk, ˆP k) Jπ j (rk, ˆP k) ϵ, i j; (i, j) Z2. Jπ j (rk, ˆP k) Jπ i (rk, ˆP k) ϵ, i j; (i, j) Z2. Published in Transactions on Machine Learning Research (04/2023) where we have omitted the conditions for π H A for the sake of brevity. The final set of policies available to the algorithm to execute at episode k is chosen from the high-confidence set Πk, defined as: ( if Jπ0 i (rk, ˆP k) Jπ0 j (rk, ˆP k) > (ϵ + ϵ0)/2, or Jπ0 j (rk, ˆP k) Jπ0 i (rk, ˆP k) > (ϵ + ϵ0)/2, i j; (i, j) Z2. Πk F , otherwise. The first case denotes the scenario where the ˆrk, ˆP k parameters are not well estimated and as such the agent needs to gather more data by executing the known initial fair policy π0. We now present a result stating that policies chosen from Πk do not violate the fairness guarantees for any of the subgroups throughout the learning duration with arbitrarily high probability. Theorem 3.1 (Fairness violation). Given an input confidence parameter δ (0, 1) and an initial fair policy π0, the construction of Πk ensures that there are no fairness violations at any episode in the learning procedure in the true environment with high probability (1 δ), i.e., for any π Πk, Pr Jπ i (r, P) Jπ j (r, P) ϵ 1 δ, (i, j) Z2, k [K]. (6) The proof of the above claim is presented in Appendix C.4. Even though selecting just any policy from Πk will satisfy the fairness guarantees, we also care about efficiency of the exploration. In order to achieve sub-linear regret, we incorporate another optimistic reward estimate that is defined as: lk h(z, s, a) .= ˆlk h(z, s, a) + αlβk h(z, s, a), (7) where αl .= 1 + |Z|| S|H + 8H(1+|Z|| S|H) η is another scaling factor with η = (ϵ ϵ0). To conclude, for an episode k [K], the maximum performing policy within the set Πk w.r.t. the new optimistic reward is selected to be executed as πk, i.e., πk arg max π Πk Jπ( lk, ˆP k) (8) The above optimization problem can be solved in a similar manner as the LP formulation in Section 2.1. The complete algorithm is presented in Algorithm 1 and the exact formulation of LP for solving Equation (8) is given in Appendix C.5. Theorem 3.2 (Regret Bound). For any δ (0, 1), with probability 1 δ, executing πk from Equation (8) at every episode k [K] incurs in a regret of at most Reg(K; l) = O H3 |Z|3| S|3|A|K + H5|Z|5| S|3|A| where O( ) hides polylogarithmic terms. The first term in the above result corresponds to the difference from using estimated parameters instead of the true MDP parameters. This term is consistent with the result of Liu et al. (2021) in the context of CMDPs, where this quantity takes the form O( H3 |Z|3| S|3|A|K). Here ηC denotes the exploration margin in terms of constraint violation for CMDPs and we assume |S| = |Z|| S|. In (Efroni et al., 2020), this term takes the form O(H2 q |Z|2| S|2|A|K), but they do not guarantee zero constraint violation during learning. The second term in Equation (9) (independent of K) represents the upper bound on the amount of time the agent needs to resort to executing π0 due to inaccurately estimated MDP parameters. In (Liu et al., 2021), this term when defined in context of a CMDP with a single constraint, takes the form H5|Z|3| S|3|A|/ min{ηC, η2 C}. In our result, we have an additional factor of |Z|2 to account for constraints in our case that are defined pairwise. As mentioned earlier, we currently sample z Z uniformly for each episode (or Pr(z) = 1/|Z|, z Z). However, this might not be the case in reality as different populations might not be always represented equally. Published in Transactions on Machine Learning Research (04/2023) Algorithm 1 LP based algorithm for Section 3 Input: π0, ϵ0, ϵ, K, δ. 1: Initialize: Nh(z, s, a) = 0, (z, s, a, h) Z S A [H]. 2: for k = 1, . . . , K do 3: Update the empirical estimates ˆP k, ˆrk, ˆlk. 4: Compute the optimistic and pessimistic reward estimates lk, rk, rk. 5: Set πk Null 6: for i j; (i, j) Z2: do 7: if Jπ0 i (rk, ˆP k) Jπ0 j (rk, ˆP k) > (ϵ + ϵ0)/2 Jπ0 j (rk, ˆP k) Jπ0 i (rk, ˆP k) > (ϵ + ϵ0)/2 then 8: Set πk π0. 9: end if 10: end for 11: if πk == Null then 12: Set πk arg maxπ Πk Jπ( lk, ˆP k). 13: end if 14: Execute πk in the true environment and collect a trajectory (Zk h, Sk h, Ak h, rk h(Zk h, Sk h, Ak h), lk h(Zk h, Sk h, Ak h)), h [H]; 15: Update counters Nh(Zk h, Sk h, Ak h), h [H]; 16: end for In Appendix C.7, we show that our approach can be easily extended to the setting with any arbitrary Z by also incorporating the Pr(z) term in the definition of subgroup specific returns. The results regarding fairness violations and regret Theorems 3.1 and 3.2 remain valid even in this extended setting. To summarize, we build on the methodology of Liu et al. (2021), defined in the context of traditional CMDP setting, and extend it to the setting of group-based fairness constraints. We show that the proposed approach still maintains desirable properties of achieving good performance while respecting the given fairness constraints. The methodology of Liu et al. (2021) only requires pessimism in the constraints as the safety constraints in CMDPs are based entirely on a single reward function. However, the group fairness constraints we consider in this work require pairwise treatment of returns as the statistics of different groups are considered. When introducing more than one reward function in the constraint, a single reward scaling technique such as pessimism fails to be sufficient anymore and instead we require techniques that can carefully balance the different notions of reward scaling. This combination of optimistic and pessimistic reward shaping and balancing in the fairness constraints, along with the pairwise nature of group constraints, requires substantial effort and new results (like Lemma C.6 and proof for Theorem 3.1) to extend the analysis techniques from Liu et al. (2021). 0 1 2 3 4 5 Time-steps Episodic return Cumulative regret 0 1 2 3 4 5 Time-steps Fairness violations (a) Z = [0.9, 0.1] 0 1 2 3 4 5 Time-steps Episodic return Cumulative regret 0 1 2 3 4 5 Time-steps Fairness violations (b) Z = [0.1, 0.9] 0 1 2 3 4 5 Time-steps Episodic return Cumulative regret 0 1 2 3 4 5 Time-steps Fairness violations (c) Z = [0.5, 0.5] Figure 2: Regret and fairness violations on the credit landing environment for different underlying group distributions, where Z denotes [Pr(high), Pr(low)]. In all the scenarios, our proposed approach achieves sub-linear regret and close to zero fairness violations. We provide an empirical study validating the above results on the credit lending environment (Section 2.2) as well as a variation of the classic River Swim environment (Strehl and Littman, 2008) in Appendix D. We Published in Transactions on Machine Learning Research (04/2023) compare our method with an MLE baseline that starts with π0 and then builds the MLE estimates of the MDP parameters. The MLE baseline then uses the estimated parameters in the LP solver (Appendix B) to get a policy to execute at an episode k. We show the results for credit lending environment in Figure 2 where we see that the proposed algorithm can reach good performance and sub-linear regret while maintaining the zero fairness violation property across different group distributions. While the MLE baseline incurs a very low regret, however it comes at a cost of large number of fairness violations. We provide all the additional environment and experimentation details in Appendix D where we show the properties of our method also hold true for the River Swim environment. Finally, note that Liu et al. (2021) do not provide any empirical evidence of their approaches for the unknown MDPs. 4 The infinite-horizon and high-dimensional Deep-RL setting Much of the recent interest in RL is in the Deep-RL setting where the state and/or action spaces are high dimensional and non-linear function-approximators are used for policy and value estimation. As the state and/or action spaces can be infinite, the LP-based approaches from the previous sections are not applicable in this setting anymore. Rather, most of the practical algorithms in this setting fall into the category of approximate policy iteration that are usually implemented in an actor-critic learning control architecture with policy gradient-based updates (Sutton and Barto, 2018). In order to be consistent with the other works in this space, we make the following modifications to our setting. We now consider the time-homogeneous infinite horizon setting, where γ [0, 1) denotes the discount factor. Let τ π denote a trajectory τ = (S1, A1, . . . , ) sampled from the MDP using the stationary policy π, i.e, S1 µ, At π( |St), St+1 P( |St, At). The infinite-horizon discounted return associated with a stationary policy π and reward function r is denoted by J(π; r) .= Eτ π[P t=1 γtr(St, At)]. The value and state-action value functions are defined as V π(s; r) .= Eτ π[P t=1 γtr(St, At)|S1 = s] and Qπ(s, a; r) .= Eτ π[P t=1 γtr(St, At)|S1 = s, A1 = a] respectively. The advantage function is defined Aπ(s, a; r) .= Qπ(s, a; r) V π(s; r). The discounted future state visitation distribution is defined by dπ(s) .= (1 γ) P t=1 γt Pr(St = s|π). When introducing fairness in this setting, we assume Z is a finite set (countable number of subgroups) but the non-sensitive attribute space S can be potentially infinite. Additionally, we use πz : S A to denote stationary policy associated with a subgroup z Z, i.e, πz = {π(a|z, s) : a A, s S}). The complete policy can be denoted by π = {π1, . . . , π|Z|}. The discounted return of subgroup z under πz is denoted by J(πz). Finally, similarly to the Section 2.1, the subgroup specific quantities can also be defined for the infinite-horizon setting, i.e., µz, dπ z , V π z , Qπ z , and Aπ z . 4.1 Trust-region based fair policy updates We base our approach on the trust-region based local policy gradient methods that focus on iteratively updating the policy such that it maximizes the return over a local neighbourhood of the current policy (Kakade, 2003; Peters et al., 2010; Pirotta et al., 2013; Schulman et al., 2015). Using the methodology of Constrained Policy Optimization (CPO, Achiam et al., 2017), we present a result that extends the trust-region based updates to our setting with fairness constraints. Proposition 4.1. Let π and π denote two arbitrary policies such that there exists only one subgroup for which the associated policies differ, i.e., =1i Z : πi = π i. Then for any j Z : j = i, the policy performance difference Jπ ,π i,j = J(π i; r) J(πj; r) associated with π i and πj is bounded as: Jπ ,π i,j Jπ,π i,j + 1 1 γ E s dπ i a πi " π i(a| s) πi(a| s) Aπ i ( s, a; r) + 2γξπ i (1 γ) DKL(π i||πi)[ s] Jπ ,π i,j Jπ,π i,j + 1 1 γ E s dπ i a πi " π i(a| s) πi(a| s) Aπ i ( s, a; r) 2γξπ i (1 γ) DKL(π i||πi)[ s] where ξπ i = max s | Ea π i[Aπ i ( s, a; r)]|, and DKL denotes the KL divergence. Published in Transactions on Machine Learning Research (04/2023) The proof is provided in Appendix E. This result allows to quantify the difference in returns of two different policies associated with a particular subgroup (π i, πi) without the need of sampling from the distribution dπ i . Therefore, these computable upper and lower bounds can be used to control the performance difference and enforce the fairness requirement. Let Πθ denote the class of parameterized policies and πk i denote the policy for subgroup i at some iteration k. The trust-region based update procedure for updating πk i takes the following form: πk+1 i = arg max πi Πθ: DKL(πi||πk i ) κ E s dπk i a πi [Aπk i ( s, a; l)] s.t. u Jπ,π i,j + u E s dπk i a πi " Aπk i ( s, a; r) ϵ, j = i; j Z, u { 1, 1} where DKL(πi||πk i ) = E s dπk i [DKL(πi||πk i )[ s]], κ is a hyper-parameter that controls the size of the trust-region update, and u is an indicator for incorporating both upper and lower bounds. 4.2 Practical algorithm Consider the scenario where each subgroup s policy is parameterized independently, for instance, each of the subgroup can have their separate neural networks. If the optimization problem in Equation (10) can be solved exactly, then we can use it as an inner loop of an algorithm that updates only one subgroup s policy at a time while ensuring that each update satisfies that fairness requirement. In practice, solving Equation (10) is usually quite challenging as it requires inverting the high-dimensional Fisher information matrix, which makes it computationally expensive and difficult to implement. It is possible to devise an approximate analytical solution based on Taylor approximations (Achiam et al., 2017; Yang et al., 2020), but that requires approximating the Hessian. Therefore, we focus on approaches that only rely on the first-order gradient information and make approximations around that. We consider two such approaches and describe them briefly below: Projection-based approach: This category of methods solves the optimization problem first in the non-parametric policy space and then projects the computed non-parametric solution policy back to the parameter space (Abdolmaleki et al., 2018; 2020; Zhang et al., 2020). We use the methodology from First-Order Constrained Optimization in Policy Space (FOCOPS, Zhang et al., 2020) and give the full details on how the FOCOPS approach can be used in our setting in Appendix F.1. Lagrangian-based approach: There also exists penalty-based approaches that convert the problem into a single objective via the Lagrangian method (Tessler et al., 2018; Chow et al., 2017; 2019), where the penalty coefficient is also updated at every iteration to enforce the constraints. Full details of the Lagrangian method can be found in Appendix F.2. While both of the above approaches are easy to implement, the FOCOPS based approach provides approximate bounds on worst-case constraint violation, whereas that remains an open question in the case of Lagrangian methods. The complete algorithms are presented in Appendix F. 4.3 Deep-RL experiments Environments: For the first set of experiments, we modify the Half-Cheetah-v3 environment from the Open AI gym (Brockman et al., 2016) to create two additional subgroups with different dynamics: one with 2 the feet size of the default Half-Cheetah-v3, and another with 10 friction than the default setting. We use the default reward function across all the three subgroups. We also design a navigation task based on Point environment (Duan et al., 2016), where we have two different subgroups corresponding to two different sizes of the Point agent (default and 5 ). The task is to navigate Published in Transactions on Machine Learning Research (04/2023) (a) 10 friction (b) 2 feet size (c) Default size (d) 5 agent size Figure 3: Different environments for the Deep RL experiments: Half-Cheetah variation with 10 friction but default size (Figure 3b) and the 2 feet size (Figure 3a), and Navigation environments with default size point agent (Figure 3c) and 5 size (Figure 3d). through the maze to reach the goal located in the top-left location. The agent receives a per-step penalty of 0.05, and reaching the goal leads to a reward of +1.0 and episode termination. Additionally, there are two different openings of different sizes in the maze. The smaller point agent can pass through the smaller opening on the left (Figure 3c) leading to episodes of smaller length (and higher return), whereas the larger agent needs to navigate through the opening on the right (Figure 3d) to reach the goal leading to a relatively longer episode length (and lesser return). Note that, for both the environments we do not make any distinction between the agent and decision-maker s reward, i.e., r = l. Baselines: We benchmark different variations of the Proximal Policy Optimization algorithm (PPO, Schulman et al., 2017). We follow the methodology described in Section 4.2 to get two fair versions of the PPO algorithms: FOC-PPO (projection based) and Lagrangian-PPO (Lagrangian based). We also include the traditional PPO algorithm as a baseline to get an estimate of the trade-off in performance due to the fairness requirement. Results: For both the experiment settings we compare the baselines across three different levels of fairness thresholds (ϵ {high, medium, low}). Due to space constraints, we present only selected results that highlight the representative behaviour among the different environments and ϵ configurations. We report the results for all configurations in Appendix G. For each iteration of the algorithm, the same amount of training data is used for every subgroup s policy update. Note that PPO baseline is agnostic to the fairness threshold and therefore its behavior does not vary across changing ϵ. A high ϵ value denotes the scenario where the acceptable fairness gap is so large that it can never be violated during learning (Figure 4a). We observe that both the algorithms are able to perform competitively with PPO, with the latter performing slightly better than the rest. Both the medium and low ϵ values denote the scenario where the algorithms need to trade-off the performance for fairness satisfactions. In Figure 4b, we see that the traditional PPO baseline ends up violating the fairness constraint whereas both Lagrangian-PPO and FOC-PPO are able to satisfy the fairness constraints throughout learning. For the navigation environment, the initial random policy is exploratory enough to reach the goal faster for the smaller sized agents compared to the larger sized agents. As a result, our assumption about having an initial fair policy does not hold anymore in this task, and we observe a high fairness gap during the initial part of the learning for all the baselines (Figure 4c). In spite of that, as the training progresses Lagrangian-PPO and FOC-PPO are able to reduce this fairness gap to the specified acceptable threshold and maintain it over the course of training while reaching the goal state. Another interesting phenomena that we observe is the fair algorithms delay the learning for the smaller agents subgroup until the larger groups have learned to reach the goal consistently (the dip in learning curve of smaller subgroup in the rightmost subplot in Figure 4c). Across the different experimental settings, we observe that both FOC-PPO and Lagrangian-PPO are able to satisfy the fairness guarantees, with the performance of FOC-PPO only marginally better than Lagrangian PPO in terms of cumulative returns. More details about the experiments, including environment and architecture details, hyper-parameter selection procedure, generalization analysis and results with a minimal implementation of the PPO baseline, can be found in Appendix G. Published in Transactions on Machine Learning Research (04/2023) 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 Time-steps Difference in returns Fairness gap, 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 Time-steps Cumulative return All subgroups FOC-PPO Lagrangian-PPO PPO 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 Time-steps Epsiodic return Big-feet subgroup 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 Time-steps Episodic return High-friction subgroup 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 Time-steps Episodic return Default subgroup (a) Half-Cheetah variations with high fairness threshold (ϵ = 10, 000). 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 Time-steps Difference in returns Fairness gap, 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 Time-steps Cumulative return All subgroups FOC-PPO Lagrangian-PPO PPO 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 Time-steps Epsiodic return Big-feet subgroup 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 Time-steps Episodic return High-friction subgroup 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 Time-steps Episodic return Default subgroup (b) Half-Cheetah variations with medium fairness threshold (ϵ = 1000). 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 Time-steps 1e6 Difference in returns Fairness gap, 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 Time-steps 1e6 Cumulative return All subgroups FOC-PPO Lagrangian-PPO PPO 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 Time-steps 1e6 Epsiodic return Big size subgroup 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 Time-steps 1e6 Episodic return Default size subgroup (c) Navigation task with low fairness threshold (ϵ = 0.5). Figure 4: Learning curves for different environments with different fairness thresholds. The first subplot in each row denotes the fairness gap (maximum of absolute difference of returns between subgroups) and the black dotted horizontal line denotes the specified acceptable fairness threshold (ϵ). The second subplot in each row denotes the cumulative return for all subgroups, and the rest of the subplots in the row denote the subgroup specific returns. The x-axis denote the number of samples used during the learning. The solid colored lines represent the smoothed mean over 10 random seeds for different baselines (with weight=0.9) and the colored shaded regions represent the normal 95% confidence interval. Plots for running mean over the last 100 episodes are included in Appendix G. 5 Discussion In this work, we pose and study the problem of satisfying group fairness requirement in the online RL setting where unfair decisions should be avoided throughout learning. Our main contribution is to show we can leverage methods from the constrained MDPs literature to satisfy this problem leading to new algorithmic solutions to this open problem. We also provide empirical evidence in support of our proposed algorithms on a variety of synthetic domains: discrete classic control tasks, continuous control robotic locomotion and navigation tasks, with extensive complementary analysis in the appendices. We based most of our empirical studies on traditional RL domains, which are convenient to validate the properties of our algorithms, but where fairness is not an inherent issue. The next step is to apply our techniques in real-world domains where fairness is the primary concern. As mentioned in Section 1.1, a limitation of our work is that we enforce the fairness constraints only via restricting the policy space. This is also reflected in the navigation experiments in Section 4.3, where we observe that our methods end up penalizing the higher-performing subgroups to match the performance of the lesser-performing subgroups. An interesting future line of work can be to explore ways to increase the Published in Transactions on Machine Learning Research (04/2023) performance of the lower-performing possibly by modifying the environment itself. Consider the scenario where the system designer has some partial control over a part of the environment dynamics and can decide to modify them by paying some cost. In this setting, the task becomes to find a configuration of the environment along with a policy such that the resulting system is fair and allows various subgroups to achieve similar levels of high performance. However, the problem described above is significantly harder than the problem we originally considered in Section 2. As the agent decisions can change the environment dynamics itself, all the past experiences of the learning algorithm can become obsolete unless we introduce even stronger assumptions regarding either the environment or initial fair exploration policies. If the system designer can afford to relax the requirement of not violating any fairness constraint during the learning procedure, then we should be able to leverage the recent advances in configuring environments via RL to take a step toward this problem of inclusive environment design. For instance, the framework of Configurable MDPs (Metelli* et al., 2018; Metelli et al., 2019) provides tools for finding the environment and policy configuration that achieves maximum performance in absence of any other constraints. We conjecture that their methodology can be used in conjunction with our work but we leave that for future work. We acknowledge that Assumption 2.3 is indeed strong. However, as we mentioned in Section 2, this is necessary for RL algorithms that do not violate the fairness constraints during the entire learning procedure. A promising approach towards the practical relaxation of this assumption can be to consider the setting where instead of having access to an explicit fair policy, the learning algorithm has access to an offline dataset collected under some fair policy. The recent advances in policy fine-tuning and hybrid-RL methods (Xie et al., 2021; Wagenmaker and Pacchiano, 2022; Song et al., 2023) indicate that we might be able to extract an approximately fair policy from the offline dataset. We can then use the extracted policy to initialize the algorithm in Section 3, but further studies need to be conducted regarding the impact of such an approach on regret and fairness guarantees. We also note that our results regarding the regret bounds in Section 3 can be improved further by using Bernstein s concentration inequalities-based analysis (Maurer and Pontil, 2009). This is a known result in traditional MDPs (Azar et al., 2017) and has also been applied in the CMDP setting (Efroni et al., 2020; Bura et al., 2022). We leave this for future work, as our primary focus is not to improve the results in the exploration of CMDPs but rather to show how these results can be extended to a different setting where they are traditionally not employed. Our work does not impose any structure on the environment or the subgroup dynamics, and as such, it is difficult to do something different than treating them independently. Considering structured problems is another interesting avenue for future research as well as identifying common parts of the dynamics (if possible). A more interesting setting would be one in which some information-sharing between groups is possible, e.g. while there might be some differences in the dynamics for each group, there is some similarity, so information on the performance of one group could be used to learn for another group. Finally, our analysis is limited to discrete subgroups, which corresponds to precise fairness criteria. Future work should investigate broader notions of fairness, and extensions to the multi-agent and multi-objective settings where subgroups can have conflicting interests or shared global resource constraints. Broader Impact Statement Our algorithms can help in reducing a particular notion of bias (group based) during the learning for the RL based systems (Sections 1 and 2). We also provide the conditions under which the proposed algorithms can attain these theoretical guarantees (Section 3). As mentioned in Sections 1.1 and 5, our notion of fairness is enforced by penalizing the members of the highest performing subgroup to match the performance of the lowest performing subgroups (within some margin). Therefore we caution against deploying our algorithms in settings where such a notion of fairness might not be suitable. Published in Transactions on Machine Learning Research (04/2023) Acknowledgments The authors would like to thank NSERC (Natural Sciences and Engineering Research Council), IVADO (Institut de valorisation des données) and CIFAR (Canadian Institute for Advanced Research) for funding to Mc Gill in support of this research. The computational component of this research was enabled in part by support provided by Calcul Québec (www.calculquebec.ca), Compute Canada (www.computecanada.ca) and Mila s IDT team. Published in Transactions on Machine Learning Research (04/2023) Abdolmaleki, A., Huang, S., Hasenclever, L., Neunert, M., Song, F., Zambelli, M., Martins, M., Heess, N., Hadsell, R., and Riedmiller, M. (2020). A distributional view on multi-objective policy optimization. In International Conference on Machine Learning, pages 11 22. PMLR. Abdolmaleki, A., Springenberg, J. T., Tassa, Y., Munos, R., Heess, N., and Riedmiller, M. (2018). Maximum a posteriori policy optimisation. ar Xiv preprint ar Xiv:1806.06920. Achiam, J., Held, D., Tamar, A., and Abbeel, P. (2017). Constrained policy optimization. ar Xiv preprint ar Xiv:1705.10528. Altman, E. (1999). Constrained Markov decision processes, volume 7. CRC Press. Amani, S., Alizadeh, M., and Thrampoulidis, C. (2019). Linear stochastic bandits under safety constraints. ar Xiv preprint ar Xiv:1908.05814. Andrychowicz, M., Raichuk, A., Stańczyk, P., Orsini, M., Girgin, S., Marinier, R., Hussenot, L., Geist, M., Pietquin, O., Michalski, M., et al. (2020). What matters in on-policy reinforcement learning? a large-scale empirical study. ar Xiv preprint ar Xiv:2006.05990. Auer, P., Jaksch, T., and Ortner, R. (2008). Near-optimal regret bounds for reinforcement learning. Advances in neural information processing systems, 21. Awasthi, P., Cortes, C., Mansour, Y., and Mohri, M. (2020). Beyond individual and group fairness. ar Xiv preprint ar Xiv:2008.09490. Azar, M. G., Osband, I., and Munos, R. (2017). Minimax regret bounds for reinforcement learning. In International Conference on Machine Learning, pages 263 272. PMLR. Bellman, R. (1957). A markovian decision process. Journal of mathematics and mechanics, pages 679 684. Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., and Zaremba, W. (2016). Openai gym. ar Xiv preprint ar Xiv:1606.01540. Bura, A., Hasanzadezonuzy, A., Kalathil, D., Shakkottai, S., and Chamberland, J.-F. (2022). Dope: Doubly optimistic and pessimistic exploration for safe reinforcement learning. In Advances in Neural Information Processing Systems. Castelnovo, A., Malandri, L., Mercorio, F., Mezzanzanica, M., and Cosentini, A. (2021). Towards fairness through time. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 647 663. Springer. Celis, E., Mehrotra, A., and Vishnoi, N. (2019). Toward controlling discrimination in online ad auctions. In International Conference on Machine Learning, pages 4456 4465. PMLR. Chouldechova, A. (2017). Fair prediction with disparate impact: A study of bias in recidivism prediction instruments. Big data, 5(2):153 163. Chow, Y., Ghavamzadeh, M., Janson, L., and Pavone, M. (2017). Risk-constrained reinforcement learning with percentile risk criteria. The Journal of Machine Learning Research, 18(1):6070 6120. Chow, Y., Nachum, O., Faust, A., Duenez-Guzman, E., and Ghavamzadeh, M. (2019). Lyapunov-based safe policy optimization for continuous control. ar Xiv preprint ar Xiv:1901.10031. Curi, S., Bogunovic, I., and Krause, A. (2021). Combining pessimism with optimism for robust and efficient model-based deep reinforcement learning. ar Xiv preprint ar Xiv:2103.10369. D Amour, A., Srinivasan, H., Atwood, J., Baljekar, P., Sculley, D., and Halpern, Y. (2020). Fairness is not static: deeper understanding of long term fairness via simulation studies. In Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency, pages 525 534. Published in Transactions on Machine Learning Research (04/2023) Dann, C., Lattimore, T., and Brunskill, E. (2017). Unifying pac and regret: Uniform pac bounds for episodic reinforcement learning. ar Xiv preprint ar Xiv:1703.07710. Dastin, J. (2018). Amazon scraps secret ai recruiting tool that showed bias against women. https: // www. reuters. com/ article/ us-amazon-com-jobs-automation-insight/ \amazon-scraps-secret-ai-recruiting-% 20tool-that-showed-bias\ -against-women-id USKCN1MK08G . Datta, A., Tschantz, M. C., and Datta, A. (2015). Automated experiments on ad privacy settings: A tale of opacity, choice, and discrimination. Proceedings on privacy enhancing technologies, 2015(1):92 112. Diamond, S. and Boyd, S. (2016). CVXPY: A Python-embedded modeling language for convex optimization. Journal of Machine Learning Research, 17(83):1 5. Doroudi, S., Thomas, P. S., and Brunskill, E. (2017). Importance sampling for fair policy selection. Grantee Submission. Duan, Y., Chen, X., Houthooft, R., Schulman, J., and Abbeel, P. (2016). Benchmarking deep reinforcement learning for continuous control. In International conference on machine learning, pages 1329 1338. PMLR. Dwork, C., Hardt, M., Pitassi, T., Reingold, O., and Zemel, R. (2012). Fairness through awareness. In Proceedings of the 3rd innovations in theoretical computer science conference, pages 214 226. Efroni, Y., Mannor, S., and Pirotta, M. (2020). Exploration-exploitation in constrained mdps. ar Xiv preprint ar Xiv:2003.02189. Emelianov, V., Gast, N., Gummadi, K. P., and Loiseau, P. (2020). On fair selection in the presence of implicit variance. In Proceedings of the 21st ACM Conference on Economics and Computation, pages 649 675. Engstrom, L., Ilyas, A., Santurkar, S., Tsipras, D., Janoos, F., Rudolph, L., and Madry, A. (2020). Implementation matters in deep policy gradients: A case study on ppo and trpo. ar Xiv preprint ar Xiv:2005.12729. Feldman, M., Friedler, S. A., Moeller, J., Scheidegger, C., and Venkatasubramanian, S. (2015). Certifying and removing disparate impact. In proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining, pages 259 268. Fu, Z., Xian, Y., Gao, R., Zhao, J., Huang, Q., Ge, Y., Xu, S., Geng, S., Shah, C., Zhang, Y., et al. (2020). Fairness-aware explainable recommendation over knowledge graphs. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 69 78. Fuster, A., Goldsmith-Pinkham, P., Ramadorai, T., and Walther, A. (2022). Predictably unequal? the effects of machine learning on credit markets. The Journal of Finance, 77(1):5 47. Gajane, P., Saxena, A., Tavakol, M., Fletcher, G., and Pechenizkiy, M. (2022). Survey on fair reinforcement learning: Theory and practice. ar Xiv preprint ar Xiv:2205.10032. Garg, N., Li, H., and Monachou, F. (2020). Dropping standardized testing for admissions trades off information and access. ar Xiv preprint ar Xiv:2010.04396. Hardt, M., Price, E., and Srebro, N. (2016). Equality of opportunity in supervised learning. ar Xiv preprint ar Xiv:1610.02413. Huang, S., Dossa, R. F. J., Ye, C., and Braga, J. (2021). Cleanrl: High-quality single-file implementations of deep reinforcement learning algorithms. Immorlica, N., Ligett, K., and Ziani, J. (2019). Access to population-level signaling as a source of inequality. In Proceedings of the Conference on Fairness, Accountability, and Transparency, pages 249 258. Jabbari, S., Joseph, M., Kearns, M., Morgenstern, J., and Roth, A. (2017). Fairness in reinforcement learning. In International Conference on Machine Learning, pages 1617 1626. PMLR. Published in Transactions on Machine Learning Research (04/2023) Kakade, S. M. (2003). On the sample complexity of reinforcement learning. University of London, University College London (United Kingdom). Kanagawa, Y. and Kaneko, T. (2020). Diverse exploration via infomax options. ar Xiv preprint ar Xiv:2010.02756. Liu, L. T., Dean, S., Rolf, E., Simchowitz, M., and Hardt, M. (2018). Delayed impact of fair machine learning. In International Conference on Machine Learning, pages 3150 3158. PMLR. Liu, T., Zhou, R., Kalathil, D., Kumar, P., and Tian, C. (2021). Learning policies with zero or bounded constraint violation for constrained mdps. ar Xiv preprint ar Xiv:2106.02684. Mandal, D. and Gan, J. (2022). Socially fair reinforcement learning. ar Xiv preprint ar Xiv:2208.12584. Maurer, A. and Pontil, M. (2009). Empirical bernstein bounds and sample variance penalization. ar Xiv preprint ar Xiv:0907.3740. Mehrabi, N., Morstatter, F., Saxena, N., Lerman, K., and Galstyan, A. (2019). A survey on bias and fairness in machine learning. ar Xiv preprint ar Xiv:1908.09635. Metelli, A. M., Ghelfi, E., and Restelli, M. (2019). Reinforcement learning in configurable continuous environments. In Proceedings of the 36th International Conference on Machine Learning (ICML), volume 97, pages 4546 4555. PMLR. Metelli*, A. M., Mutti*, M., and Restelli, M. (2018). Configurable markov decision processes. In Proceedings of the 35th International Conference on Machine Learning (ICML), volume 80, pages 3488 3497. Miller, C. C. (2015). Can an algorithm hire better than a human. The New York Times, 25. Nabi, R., Malinsky, D., and Shpitser, I. (2019). Learning optimal fair policies. In International Conference on Machine Learning, pages 4674 4682. PMLR. Pacchiano, A., Ghavamzadeh, M., Bartlett, P., and Jiang, H. (2021). Stochastic bandits with linear constraints. In International Conference on Artificial Intelligence and Statistics, pages 2827 2835. PMLR. Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., et al. (2019). Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32. Pessach, D. and Shmueli, E. (2020). Algorithmic fairness. ar Xiv preprint ar Xiv:2001.09784. Peters, J., Mulling, K., and Altun, Y. (2010). Relative entropy policy search. In Twenty-Fourth AAAI Conference on Artificial Intelligence. Pirotta, M., Restelli, M., Pecorino, A., and Calandriello, D. (2013). Safe policy iteration. In International Conference on Machine Learning, pages 307 315. PMLR. Pleiss, G., Raghavan, M., Wu, F., Kleinberg, J., and Weinberger, K. Q. (2017). On fairness and calibration. Advances in neural information processing systems, 30. Puranik, B., Madhow, U., and Pedarsani, R. (2022). Dynamic positive reinforcement for long-term fairness. In ICML 2022 Workshop on Responsible Decision Making in Dynamic Environments. Puterman, M. L. (2014). Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons. Schulman, J., Levine, S., Abbeel, P., Jordan, M., and Moritz, P. (2015). Trust region policy optimization. In International conference on machine learning, pages 1889 1897. PMLR. Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. (2017). Proximal policy optimization algorithms. ar Xiv preprint ar Xiv:1707.06347. Published in Transactions on Machine Learning Research (04/2023) Siddique, U., Weng, P., and Zimmer, M. (2020). Learning fair policies in multi-objective (deep) reinforcement learning with average and discounted rewards. In International Conference on Machine Learning, pages 8905 8915. PMLR. Song, Y., Zhou, Y., Sekhari, A., Bagnell, D., Krishnamurthy, A., and Sun, W. (2023). Hybrid RL: Using both offline and online data can make RL efficient. In The Eleventh International Conference on Learning Representations. Strehl, A. L. and Littman, M. L. (2008). An analysis of model-based interval estimation for markov decision processes. Journal of Computer and System Sciences, 74(8):1309 1331. Sutton, R. S. (1988). Learning to predict by the methods of temporal differences. Machine learning, 3(1):9 44. Sutton, R. S. and Barto, A. G. (2018). Reinforcement learning: An introduction. MIT press. Tessler, C., Mankowitz, D. J., and Mannor, S. (2018). Reward constrained policy optimization. ar Xiv preprint ar Xiv:1805.11074. The European Commission (2021). A european approach to artificial intelligence. https: // digital-strategy. ec. europa. eu/ en/ policies/ european-approach-artificial-intelligence . Todorov, E., Erez, T., and Tassa, Y. (2012). Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 5026 5033. IEEE. Wachi, A. and Sui, Y. (2020). Safe reinforcement learning in constrained markov decision processes. In International Conference on Machine Learning, pages 9797 9806. PMLR. Wagenmaker, A. and Pacchiano, A. (2022). Leveraging offline data in online reinforcement learning. ar Xiv preprint ar Xiv:2211.04974. Wen, M., Bastani, O., and Topcu, U. (2021). Algorithms for fairness in sequential decision making. In Banerjee, A. and Fukumizu, K., editors, Proceedings of The 24th International Conference on Artificial Intelligence and Statistics, volume 130 of Proceedings of Machine Learning Research, pages 1144 1152. PMLR. Xie, T., Jiang, N., Wang, H., Xiong, C., and Bai, Y. (2021). Policy finetuning: Bridging sample-efficient offline and online reinforcement learning. Advances in neural information processing systems, 34:27395 27407. Yang, T.-Y., Rosca, J., Narasimhan, K., and Ramadge, P. J. (2020). Projection-based constrained policy optimization. ar Xiv preprint ar Xiv:2010.03152. Zafar, M. B., Valera, I., Rogriguez, M. G., and Gummadi, K. P. (2017). Fairness constraints: Mechanisms for fair classification. In Artificial Intelligence and Statistics, pages 962 970. PMLR. Zhang, Y., Vuong, Q., and Ross, K. W. (2020). First order constrained optimization in policy space. ar Xiv preprint ar Xiv:2002.06506. Zimin, A. and Neu, G. (2013). Online learning in episodic markovian decision processes by relative entropy policy search. In Neural Information Processing Systems 26. Published in Transactions on Machine Learning Research (04/2023) Table 1: Notation General MDPs: M denotes the entire MDP S state-space A action-space µ initial starting state distribution r reward function (can be non-stationary rh(s, a) or stationary r(s, a)) H horizon length for finite-horizon episodic MDPs P transition model (can be non-stationary Ph(s |s, a) or stationary P(s |s, a)) π policies (can be stationary π(a|s) or non-stationary πh(a|s)) Sh state observed at time step h Ah action taken at time step h Rh reward observed at time step h V π h (s; r, P) value function under policy π starting at state s at time step h for MDP with reward function r and transition P Jπ(r, P) return of a policy π for MDP with reward function r and transition P dπ h(s, a; µ, P) occupancy measure of policy π for state s and action a at time step h when starting with initial distribution µ and transition P Introducing fairness: Z set of sensitive attributes that define subgroups S non-sensitive state space St non-sensitive state observed at time step t µz subgroup specific initial non-sensitive starting state distribution Pz subgroup specific transition function corresponding to the non-sensitive attributes Pz,h( s| s, a) Jπ z (r, P) return for a subgroup z under policy π with reward function r and transition Pz (starting from µz) πz policy corresponding to a subgroup z l decision maker s reward function (can be non-stationary lh(s, a) or stationary l(s, a)) ϵ specified fairness threshold π0 initial feasible and fair policy ϵ0 fairness margin corresponding to π0 Algorithm for the unknown model and reward setting: ζ noise in the observed reward δ confidence parameter ˆrk h empirical estimate of the reward function for time step h at episode k ˆP k h empirical estimate of the transition function for time step h at episode k βk h uncertainty estimate for time step h at episode k rk h, rk h optimistically and pessimistically scaled reward estimates Πk high confidence set consisting of fair policies rk h optimistically scaled reward estimate αr scaling factor associated with r η exploration margin, η = (ϵ ϵ0) The infinite-horizon and high-dimensional Deep-RL setting : Published in Transactions on Machine Learning Research (04/2023) γ discount factor associated with MDP for the infinite-horizon setting τ sampled trajectory J(π) infinite-horizon return under a stationary policy π J(πz) infinite-horizon return associated with subgroup z under policy πz Aπ z ( s, a) advantage function under the policy πz for the subgroup z Jπ ,π i,j J(π i) J(πj) κ trust-region parameter DKL KL divergence ξi worst case error for subgroup i in Proposition 4.1 Πθ parameterized policy space B Linear Programming solution for finite-horizon episodic MDPs B.1 Linear Programming based solver Note that any policy π : Z S [H] A can be decomposed into a set of |Z| sub-policies. A policy associated with a particular subgroup z Z is denoted with πz, where πz : S [H] A, with πz,h(a| s) being the probability of taking action a in state (z, s) at time step h [H]. Thus, the complete policy at an episode k [K] is denoted by the πk = {πk 1, . . . , πk |Z|}. The occupancy measure associated with a particular subgroup z Z and its corresponding policy πz is defined as: dπz h ( s, a; µz, Pz) .= E[1{ Sh = s, Ah = a | S1 µz, Pz, πz}] (11) = Pr{ Sh = s, Ah = a | S1 µz, Pz, πz}. Similarly, the return associated with a particular subgroup w.r.t. a reward function g : [H] Z S A R can be written as: Jπ z (g, P) = X h, s,a dπz h ( s, a; µz, Pz)gh(z, s, a). (12) The subgroup specific occupancy measure should satisfy the properties of an occupancy measure (Zimin and Neu, 2013; Efroni et al., 2020), i.e., X a dπz h ( s, a) = X s ,a Ph 1( s|z, s , a )dπz h 1( s , a ) s S, (13) dπz h ( s, a) 0 s, a S A, for all h [H] \ 1. For h = 1 and corresponding initial distribution µz, we have X a dπz 1 ( s, a) = µz( s) s. (14) For any policy π and z, s, a, h Z S A [H], we define the combined occupancy measure over the entire orginal state-space S as: dπ h((z, s), a) = Pr(z) dπz h ( s, a) (15) We show that the final occupancy measure returned by the above formulation satisfies the properties of a valid occupancy measure in Lemma B.1. This allows us to reformulate the problem in Equation (2) to a Linear Program where the optimization variables are measures. The optimal solutions to the LP define the optimal Markov policy through occupancy measure where, πz,h(a| s) = dπz h ( s, a) P a dπz h ( s, a ), (z, s, a, h) Z S A [H]. (16) Published in Transactions on Machine Learning Research (04/2023) The final LP program takes the following form: a dπ h((z, s), a)lh((z, s), a) (17) h, s,a dπi h ( s, a)rh(i, s, a) | {z } = Jπ i (r,P ) h, s,a dπj h ( s, a)rh(j, s, a) | {z } = Jπ j (r,P ) ϵ i j; (i, j) Z2. a dπz h ( s, a) = X s ,a Ph 1( s|z, s , a )dπz h 1( s , a ) z, s, h Z S [H] \ 1 a dπz 1 ( s, a) = µz( s) z, s Z S dπz h ( s, a) 0 z, s, a, h Z S A [H] Finally, in Proposition B.1 (below) we show that the above LP is able to find the solution of the problem in Equation (2). B.2 Supporting results for LP formulation Lemma B.1. For any policy π, the occupancy measure dπ in Equation (15) is a valid occupancy measure and satisfies the following properties of an occupancy measure: X s,a dπ h(s, a) = 1 h [H] (18) a dπ h(s, a) = X s ,a Ph 1(s|s , a )dπ h 1(s , a ) s S (19) dπ h(s, a) 0 s, a, h S A [H]. (20) Proof. Recall that, for any policy π and z, s, a, h Z S A [H], the combined occupancy measure over S is defined as (Equation (15)): dπ h((z, s), a) = Pr(z) dπz h ( s, a) (21) Part 1: From Equation (13) we know that for any z Z, dπz h ( s, a) 0 s, a, h S A [H], As, Pr(z) 0, z Z, Pr(z)dπz h ( s, a) 0 z, s, a Z S A [H], dπ h((z, s), a) 0 z, s, a (Z S) A [H], dπ h(s, a) 0 s, a S A [H], which implies that the property in Equation (20) is valid. Part 2: From the construction in Equation (13) we know that for any z Z X a dπz h ( s, a) = X s ,a Ph 1( s|z, s , a )dπz h 1( s , a ) s, h The above equation can be rewritten as, X a dπz h ( s, a) = X z , s ,a Ph 1( s|z, s , a )1[z = z ]dπz h 1( s , a ) z, s, h Published in Transactions on Machine Learning Research (04/2023) Multiplying both sides by Pr(z), we get: X a Pr(z)dπz h ( s, a) = X z , s ,a Ph 1( s|z, s , a )1[z = z ] Pr(z)dπz h 1( s , a ) z, s, h Replacing Pr(z)dπz h ( s, a) by dπ h(z, s, a) from Equation (15), X a dπ h((z, s), a) = X a (Ph 1( s)|z , s , a )1[z = z ])dπ h 1(z, s , a ) z, s, h As we are only considering case where z = z in RHS (due to 1[z = z ]), we can replace dπ h 1(z, s , a ) with dπ h 1(z , s , a ): a dπ h((z, s), a) = X a (Ph 1( s|z , s , a )1[z = z ])dπ h 1(z , s , a ) z, s, h From the definition of transition function in Assumption 2.1 and substituting (z , s ) with s , we get: X a dπ h((z, s), a) = X a Ph 1(z, s|z , s , a )dπ h 1(z , s , a ) z, s, h a dπ h(s, a) = X a Ph 1(s|s , a )dπ h 1(s , a ) s, h Hence, the property in Equation (19) is also valid. Part 3: We show the property in Equation (18) is true using induction. First for the base case fix h = 1, and we need to show P s,a dπ 1(s, a) = 1. a dπ 1(s, a) = X a dπ 1(s, a) a Pr(z)dπz 1 ( s, a) (by definition of dπ eq. (21)) a dπz 1 ( s, a) By construction (Equation (14)), we have P a dπz 1 ( s, a) = µz( s), s, therefore: z Pr(z) µz( s) From Definition 2.1, we have µ(z, s) = Pr(z) µz( s) or P z Z µ(z, s) = P z Z Pr(z) µz( s). Therefore, we get: a dπ 1(s, a) = X = 1 (By definition of µ) We have shown that P s,a dπ 1(s, a) = 1. Note that we have already proved that Equation (19) is true. For h = 2, s S, we have: X a dπ 2(s, a) = X s ,a P1(s|s , a )dπ 1(s , a ) Published in Transactions on Machine Learning Research (04/2023) By summing over s S, we get: s,a dπ 2(s, a) = X s ,a P1(s|s , a )dπ 1(s , a ) s ,a dπ 1(s , a ) Similarly, by summing over the first constraint in Equation (19) over s for different values of h iteratively we can show that P s,a dπ h(s, a) = 1, h [H].Therefore, dπ is a valid probability measure. Lemma B.2. For any policy π and any arbitrary reward function g : [H] S A R, we have the following relation between the cumulative return and subgroup specific returns: Jπ(g, P) = X z Pr(z)Jπ z (g, P). Jπ(g, P) = X h,z, s,a dπ h(z, s, a)gh(z, s, a) z, s dπ h((z, s), a)gh((z, s), a) z, s Pr(z)dπz h ( s, a)gh((z, s), a) (From Equation (15)) h,a, s dπz h ( s, a)gh((z, s), a) z Pr(z)Jπ z (g, P). (From Equation (12)) Proposition B.1. The LP in Equation (17) returns the solution π to Equation (2). Proof. Let π denote the optimal solution to Equation (2), and let π denote the solution returned by the LP. Note that the problem is feasible due to Assumption 2.3. As π is the solution to Equation (2), it must satisfy the fairness criteria and have a valid occupancy measure. From Lemma B.1 and the fairness constraint in the construction of LP (the first constraint in Equation (17)), all the valid occupancy measures that satisfy fairness constraint are feasible points in the LP. As occupancy measure corresponding to π is a feasible point of the LP, we can denote its return as Jπ (l, P). If π is the solution returned by the LP, that implies that π satisfies the fairness criteria and also has the maximum return over all the feasible points, i.e., Jπ (l, P) Jπ (l, P). However, π is the solution to Equation (2), therefore π must also be a solution to Equation (2). Published in Transactions on Machine Learning Research (04/2023) B.3 Extension to other Group fairness definitions For all the valid π H A, the Linear Program for the Demographic parity based fairness constraints can be written in the following form: a dπ h((z, s), a)lh((z, s), a) (22) h, s,a dπi h ( s, a)rh(i, s, a) | {z } = Jπ i (r,P ) h, s,a dπj h ( s, a)rh(j, s, a) | {z } = Jπ j (r,P ) ϵ i j; (i, j) Z2. Equality of Opportunity: Informally, this fairness constraint is defined as having equal rates or measure (such as true positive rate) for the qualified sub-populations for all subgroups (Hardt et al., 2016). In our setting, this implies that the subgroup space (Z) can be further decomposed into qualified (Z1) and unqualified (Z0) groups, i.e., Z = Z1 Z0. For all the valid π H A, the Linear Program for this definition takes the following form: a dπ h((z, s), a)lh((z, s), a) (23) s.t. Jπ i (r, P) Jπ j (r, P) ϵ i j; (i, j) Z2 Note that the only difference of the above equation with with Equation (22) is that the constraints are defined only for the pairs belonging in the qualified subgroups (Z1) instead of pver all possible subgroups (Z). Equalized Odds: This fairness constraint is similar to the Equality of Opportunity case, with the difference that it requires equal rates or measure for both the qualified and unqualified population for all subgroups (Hardt et al., 2016). Assuming, Z = Z1 Z0, the Linear Program takes the following form π H A: a dπ h((z, s), a)lh((z, s), a) (24) s.t. Jπ i (r, P) Jπ j (r, P) ϵ i j; (i, j) Z2 1. Jπ i (r, P) Jπ j (r, P) ϵ i j; (i, j) Z2 The main difference between the three formulations Demographic parity (used in our main paper), Equality of Opportunity and Equalized Odds is the number of subgroup pairs on which the constraint is being defined. In Demographic parity, all the possible combinations are being considered, whereas in the other two definitions only a proportion of the possible pairs is included in the constraints. Thus, our approach can be transferred directly to other settings by appropriately removing the unnecessary constraints for each definition. C Proofs for episodic case C.1 Why optimism alone might not be enough The core idea in optimism-based approaches is to select the MDP model that yields the best return among the set of all possible MDP models. Let Pk, Rk and Lk denote the confidence sets based on empirical estimates that contain the true transition and reward models with high confidence. Thus, by using only the optimism in the face of uncertainty approach, the policy to execute at episode k can be found by solving the following problem: πk, rk, lk, P k = arg max π,r Rk,l Lk,P Pk Jπ(l , P ) s.t. |Jπ i (r , P ) Jπ j (r , P )| ϵ, i j; i, j Z2. Published in Transactions on Machine Learning Research (04/2023) However, the solution policy from the above optimization problem only guarantees the constraint satisfaction w.r.t. to the best optimistic MDP model and not the true MDP. To expand, we can get guarantees on |Jπk i (rk, P k) Jπk j (rk, P k)|, whereas we are interested in the guarantees on |Jπk i (r, P) Jπk j (r, P)|. C.2 High probability good event We follow the notation and proof technique from Liu et al. (2021) for this section. The goal is to have high confidence guarantees based on some probability parameter 0 < δ < 1 given by the user. In order to do that, we first define a high probability event E that is required for analyzing the performance guarantees of the algorithm. Let {Fk}k 0 denotes the filtration with Fk = σ ( Sk h , Zk h , Ak h , Rk h , Lk h )h [H],k [k] k [K] and F0 being the trivial sigma algebra. The set of deployed policies {πk}k [K] is a predictable process w.r.t. filtration {Fk}k 0. Recall that N k h(z, s, a) is the number of times the state-action tuple (z, s, a) was observed at time step h in the episodes [1, . . . , k 1]. Thus, N k h(z, s, a) Fk 1. In our setting, the algorithm has access to µz and additionally we sample z Z uniformly for each episode, or Pr(z) = 1/|Z|, z. Therefore, the joint µ is also accessible as µ(z, s) = Pr(z) µz( s) z, s Z S (Definition 2.1). We can therefore define the expectation operator Eµ,P,π[ ] w.r.t. a stochastic trajectory (Sh, Ah)h [H] according to Markov chain induced by (µ, P, π). Next we formally state the noise assumption on the reward function, Assumption C.1 (Sub-Gaussian-noise). We assume the reward noise variables are zero mean 1/2-sub Gaussian, i.e., E[ζk h|Fk 1] = 0, E[exp(λζk h)|Fk 1] exp(λ2/4), λ R, where Fk denotes the σ-algebra generated by random variables up to episode k, h [H], k [K]. Therefore, for each (z, s, a, h) Z S A [H], the empirical estimates of the rewards and transition based on the observations so far are defined as: ˆP k h (z , s |z, s, a) .= Pk 1 k =1 1(Zk h = z, Sk h = s, Ak h = a, Zk h+1 = z , Sk h+1 = s ) max(N k h(z, s, a), 1) , (25) ˆrk h(z, s, a) .= Pk 1 k =1 1(Zk h = z, Sk h = s, Ak h = a)(rh(z, s, a) + ζk h(z, s, a)) max(N k h(z, s, a), 1) , (26) ˆlk h(z, s, a) .= Pk 1 k =1 1(Zk h = z, Sk h = s, Ak h = a)(lh(z, s, a) + ζk h(z, s, a)) max(N k h(z, s, a), 1) . (27) Next, for an event sequence G1:K, that is predictable w.r.t. {F}k 0, i.e., Gk Fk 1, k [K], we define the event: EG(δ) .= n K [K], 1(Gk)dπk h (z, s, a) max(N k h(z, s, a), 1) 4H|Z|| S||A| + 2H|Z|| S||A| ln K G + 4 ln 2HK 1(Gk)dπk h (z, s, a) q max{N k h(z, s, a), 1} 6H|Z|| S||A| + 2H q |Z|| S||A|K G + 2H|Z|| S||A| ln K G + 5 ln 2HK where K G .= PK k=1 1(Gk), and dπk is the occupancy measure of policy πk, ie, dπk h (z, s, a) = Eµ,P,πk[1(Zk h = z, Sk h = s, Ak h = a)|Fk 1]. Let EΩ(δ) be the event associated with the trivial predictable event sequence Gk = Ω, k [K], where Ωis the sample space. Similarly, let G 1:K = {(Jπ0 i (rk, ˆP k) Jπ0 j (rk, ˆP k) > (ϵ + ϵ0)/2) (Jπ0 j (rk, ˆP k) Jπ0 i (rk, ˆP k) > Published in Transactions on Machine Learning Research (04/2023) (ϵ + ϵ0)/2), i j; i, j Z2.}k [K], where denotes the logical OR operator. Note that the event sequence G 1:K is also predictable w.r.t. {Fk}k 0. Let E0(δ) denote the event EG (δ) defined by the event sequence G 1:K. Good event E is defined as: E .= n k [K], h [H], z Z, s S, a A, |rh(z, s, a) ˆrk h(z, s, a)| βk h(z, s, a), |lh(z, s, a) ˆlk h(z, s, a)| βk h(z, s, a), | ˆP k h (z , s |z, s, a) P k h (z , s |z, s, a)| βk h(z, s, a), z , s Z S o EΩ(δ/4) E0(δ/4), (28) where βk h(z, s, a) .= q 1 max(N k h(z, s,a),1)C and C = log(4|Z|2| S|2|A|HK/δ). Lemma C.1. For a given value of δ (0, 1), the event E occurs with probability at least 1 δ. Proof. We can get the above results directly using Lemma A.1 (Liu et al., 2021). As the rest of our analysis is based on this good event, for completeness, we briefly present their proof extended to our setting below. For each (z, s, a, h) Z S A [H], we first define a dataset of K mutually independent samples of reward and next state collected under the true MDP model as {Rn(z, s, a, h), Ln(z, s, a, h), Sn(z, s, a, h)}K n=1. Let ˆrn(z, s, a, h), ˆln(z, s, a, h) and ˆP n( |z, s, a, h) be the corresponding running empirical means for the samples {Ri(z, s, a, h), Li(z, s, a, h), Si(z, s, a, h)}n i=1. We can then define the failure events: F r n .= { z, s, a, h : |ˆrn h(z, s, a) rh(z, s, a)| β(n), } F l n .= { z, s, a, h : |ˆln h(z, s, a) lh(z, s, a)| β(n), } F P n .= { z, s, a, z , s , h : |Ph(z , s |z, s, a) ˆP n h (z , s |z, s, a)| β(n)}, where β(n) = q 1 max(n,1)C. We also define another event, Egen .= Pr( K n=1(F r n F l n F P n ))C EΩ(δ/4) E0(δ/4) Let nk(z, s, a, h) denote the quantity N k h(z, s, a) + 1. Then the problem in our setting can be simulated as following: at an episode k, taking action a in state z, s at time-step h, we get the sample (Rnk(z, s,a,h)(z, s, a, h), Lnk(z, s,a,h)(z, s, a, h), Snk(z, s,a,h)(z, s, a, h)). Therefore, the set {Rn(z, s, a, h), Ln(z, s, a, h), Sn(z, s, a, h)}K n=1 already contains all the samples drawn in the learning problem and the sample averages calculated by the algorithms are: (ˆrk h(z, s, a), ˆlk h(z, s, a), ˆP k h (z , s |z, s, a)) = (rnk(z, s,a,h)(z, s, a, h), lnk(z, s,a,h)(z, s, a, h), P nk(z, s,a,h)( |z, s, a, h)) As a result the Egen implies E, and it is sufficient to show that Egen occurs with probability at least 1 δ. Using Lemma H.4 and union bound, we get the result that EΩ(δ/4) E0(δ/4) occurs with probability at least 1 δ/2 Next, we define δ .= δ|Z|| S| 2(|Z|| S|+2), that satisfies the relation δ 4 δ . To see that, 2 |Z|| S| (since |Z| > 1) 4δ 2δ|Z|| S| 4δ 4δ|Z|| S| 2δ|Z|| S| 2δ(|Z|| S| + 2) 4δ|Z|| S| δ 4 δ|Z|| S| 2(|Z|| S| + 2) Published in Transactions on Machine Learning Research (04/2023) For F r n, by Hoeffding s Inequality and Union bound, we have: Pr( K n=1F r n) |Z|| S||A|HK exp( n(β(n))2) = K|Z|| S||A|H δ 4|Z|2| S|2|A|HK δ which is equivalent to the event Pr( K n=1F r n)C 1 δ |Z|| S|. For F l n, using Hoeffding s Inequality and Union bound, we also get: Pr( K n=1F l n) |Z|| S||A|HK exp( n(β(n))2) = K|Z|| S||A|H δ 4|Z|2| S|2|A|HK δ which is equivalent to the event Pr( K n=1F l n)C 1 δ |Z|| S|. Finally, for F P n , by Hoeffding s Inequality and Union bound, we get: Pr( K n=1F P n ) K|Z|2| S|2|A|HK exp( n(β(n))2) δ . Therefore for the event Pr( K n=1(F r n F l n F P n )), using the definition of δ , we have: Pr( K n=1(F r n F l n F P n )) Pr( K n=1F P n ) + Pr( K n=1F r n) + Pr( K n=1F l n) δ 1 + 2 |Z|| S| Or, the event Pr( K n=1(F r n F l n F P n ))C 1 δ/2. Finally, combining the above results we get that the event Pr( K n=1(F r n F l n F P n ))C EΩ(δ/4) E0(δ/4) holds with probability at least 1 δ or Egen holds with probability 1 δ, that implies E holds with probability 1 δ. All the results in the rest of this section are conditioned on the good event E defined above. C.3 Supporting Results based on Optimistic and Pessimistic MDP Estimates Recall that at an episode k we define the optimistic and pessimistic reward estimates as following (z, s, a, h) Z S A [H]: rk h(z, s, a) .= ˆrk h(z, s, a) + (1 + |Z|| S|H)βk h(z, s, a) rk h(z, s, a) .= ˆrk h(z, s, a) (1 + |Z|| S|H)βk h(z, s, a) Lemma C.2. On good event E, for any policy π and z Z, using the optimistic reward estimate leads to a higher estimated return compared to the true return, i.e., Jπ z (r, P) Jπ z (rk, ˆP k). Proof. For any k, h, z, s, a we assume the sample point ω E: rk h(z, s, a) rh(z, s, a) = rk h(z, s, a) ˆrk h(z, s, a) + ˆrk h(z, s, a) rh(z, s, a) = (1 + |Z|| S|H)βk h(z, s, a) + ˆrk h(z, s, a) rh(z, s, a) (From definition eq. (3)) (1 + |Z|| S|H)βk h(z, s, a) βk h(z, s, a) (due to E) |Z|| S|Hβk h(z, s, a). Additionally, z , s ( ˆP k h Ph)(z , s |z, s, a)V πz h+1(z , s ; r, P) (a) H X z , s βk h(z, s, a) = |Z|| S|Hβk h(z, s, a), Published in Transactions on Machine Learning Research (04/2023) where (a) holds due to Holder s inequality. Using Value difference lemma (Lemma H.2), for any policy π, we have: Jπ z (rk, ˆP k) Jπ z (r, P) h=1 (rk h(z, Sh, Ah) rh(z, Sh, Ah) + X s ,z ( ˆP k h Ph)(z , s |z, Sh, Ah)V πz h+1(z , s ; r, P)) Fk 1 h=1 |Z|| S|Hβk h(z, Sh, Ah) |Z|| S|Hβk h(z, Sh, Ah) Fk 1 Therefore we have, Jπ z (r, P) Jπ z (rk, ˆP k). Lemma C.3. For any policy π and z Z, using the pessimistic reward estimate leads to a lower estimated return compared to the true return, i.e., Jπ z (rk, ˆP k) Jπ z (r, P). Proof. For any k, h, z, s, a, we assume the sample point ω E: rk h(z, s, a) rh(z, s, a) = rk h(z, s, a) ˆrk h(z, s, a) + ˆrk h(z, s, a) rh(z, s, a) = (1 + |Z|| S|H)βk h(z, s, a) + ˆrk h(z, s, a) rh(z, s, a) (From definition eq. (3)) ( 1 |Z|| S|H)βk h(z, s, a) + βk h(z, s, a) (due to E) |Z|| S|Hβk h(z, s, a). Additionally, z , s ( ˆP k h Ph)(z , s |z, s, a)V πz h+1(z , s ; r, P) (a) H X z , s βk h(z, s, a) = |Z|| S|Hβk h(z, s, a). where we get (a) via the Holder s inequality. Using value difference lemma (Lemma H.2), for any policy π, we have: Jπ z (rk, ˆP k) Jπ z (r, P) h=1 (rk h(z, Sh, Ah) rh(z, Sh, Ah) + X z , s ( ˆP k h Ph)(z , s |z, Sh, Ah)V πz h+1(z , s ; r, P)) Fk 1 h=1 |Z|| S|Hβk h(z, Sh, Ah) + |Z|| S|Hβk h(z, Sh, Ah) Fk 1 Therefore, Jπ z (rk, ˆP k) Jπ z (r, P). Lemma C.4. For any policy π and z Z, the difference in returns using the optimistic reward estimate and the true return can be bounded in terms of βk as, Jπ z (rk, ˆP k) Jπ z (r, P) 2(1 + | S||Z|H)Jπ z (βk, ˆP k). Published in Transactions on Machine Learning Research (04/2023) Proof. For any k, h, z, s, a we assume the sample point ω E: rk h(z, s, a) rh(z, s, a) = rk h(z, s, a) ˆrk h(z, s, a) + ˆrk h(z, s, a) rh(z, s, a) = (1 + |Z|| S|H)βk h(z, s, a) + ˆrk h(z, s, a) rh(z, s, a) (From definition eq. (3)) (1 + |Z|| S|H)βk h(z, s, a) + βk h(z, s, a) (due to E) (2 + |Z|| S|H)βk h(z, s, a). Additionally, X z , s ( ˆP k h Ph)(z , s |z, s, a)V πz h+1(z , s ; r, P) (a) H X z , s βk h(z, s, a) = |Z|| S|Hβk h(z, s, a), where we get (a) via the Holder s inequality. Using Value difference lemma (Lemma H.2), for any policy π, we have: Jπ z (rk, ˆP k) Jπ z (r, P) h=1 (rk h(z, Sh, Ah)) rh(z, Sh, Ah) + X z , s ( ˆP k h Ph)(z , s |z, Sh, Ah)V πz h+1(z, s ; r, P)) Fk 1 h=1 2(1 + |Z|| S|H)βk h(z, Sh, Ah) Fk 1 = 2(1 + |Z|| S|H)Jπ z (βk, ˆP k). Therefore we have, Jπ z (rk, ˆP k) Jπ z (r, P) 2(1 + |Z|| S|H)Jπ z (βk, ˆP k). Lemma C.5. For any policy π and z Z, the difference in returns using the pessimistic reward estimate and the true return can be bounded in terms of βk as, Jπ z (r, P) Jπ z (rk, ˆP k) 2(1 + | S||Z|H)Jπ z (βk, ˆP k). Proof. For any k, h, z, s, a, we assume the sample point ω E: rk h(z, s, a) rh(z, s, a) = rk h(z, s, a) ˆrk h(z, s, a) + ˆrk h(z, s, a) rh(z, s, a) = (1 + |Z|| S|H)βk h(z, s, a) + ˆrk h(z, s, a) rh(z, s, a) (From definition eq. (3)) ( 1 |Z|| S|H)βk h(z, s, a) βk h(z, s, a) (due to E) (2 + |Z|| S|H)βk h(z, s, a). Additionally, X z , s ( ˆP k h Ph)(z , s |z, s, a)V πz h+1(z , s ; r, P) (a) H X z , s βk h(z, s, a) = |Z|| S|Hβk h(z, s, a), where (a) holds due to Holder s inequality. Using value difference lemma (Lemma H.2), for any policy π, we have: Jπ z (rk, ˆP k) Jπ z (r, P) h=1 (rk h(z, Sh, Ah) rh(z, Sh, Ah) + X z , s ( ˆP k h Ph)(z , s | Sh, Ah)V πz h+1(z , s ; r, P)) Fk 1 h=1 2(1 + |Z|| S|H)βk h(z, Sh, Ah) Fk 1 2(1 + |Z|| S|H)Jπ z (βk, ˆP k). Published in Transactions on Machine Learning Research (04/2023) Therefore, Jπ z (r, P) Jπ z (rk, ˆP k) 2(1 + | S||Z|H)Jπ z (βk, ˆP k). C.4 Proof for Theorem 3.1 W.l.o.g., let {i, j} denote any pair of subgroups in Z2. For π0, we have |Jπ0 i (r, P) Jπ0 j (r, P)| ϵ by definition of initial fair policy (Assumption 2.3). We will now show that the our construction of Πk F in Equation (4) satisfies the zero constraint violation property for any such pair of subgroup. Part 1: In the first part of the proof, we will show that on the good event E, for any k [K] and policy π Πk F , Jπ i (r, P) Jπ j (r, P) ϵ. Proof. Using Lemma C.2 w.r.t. z = i and r, we have: Jπ i (r, P) Jπ i (rk, ˆP k) (29) Similarly, using Lemma C.3 w.r.t. z = j and r, we get Jπ j (rk, ˆP k) Jπ j (r, P), or, Jπ j (r, P) Jπ j (rk, ˆP k) (30) From Equations (29) and (30), we have: Jπ i (r, P) Jπ j (r, P) Jπ i (rk, ˆP k) Jπ j (rk, ˆP k). (31) Note that from the definition of Πk F , we know any policy in π Πk F satisfies the constraint: Jπ i (rk, ˆP k) Jπ j (rk, ˆP k) ϵ. Therefore, we have the following relation: Jπ i (r, P) Jπ j (r, P) Jπ i (rk, ˆP k) Jπ j (rk, ˆP k) ϵ. Part 2: Now we will present the result that shows on the good event E, for any k [K] and policy π Πk F , Jπ j (r, P) Jπ i (r, P) ϵ. Proof. Using Lemma C.2 w.r.t. z = j and r, we have: Jπ j (r, P) Jπ j (rk, ˆP k) (32) Now, using Lemma C.3 w.r.t. z = i and r, we get Jπ i (rk, ˆP k) Jπ i (r, P), or, Jπ i (r, P) Jπ i (rk, ˆP k) (33) from Equations (32) and (33), we have: Jπ j (r, P) Jπ i (r, P) Jπ j (rk, ˆP k) Jπ i (rk, ˆP k). Note that from the definition of Πk F , we know any policy in π Πk F satisfies the constraint: Jπ j (rk, ˆP k) Jπ i (rk, ˆP k) ϵ. Therefore, we have the following relation: Jπ j (r, P) Jπ i (r, P) Jπ j (rk, ˆP k) Jπ i (rk, ˆP k) ϵ. Published in Transactions on Machine Learning Research (04/2023) Finally, we know that an episode k, either π0 or a policy from Πk F will be deployed. We already know that |Jπ0 i (r, P) Jπ0 j (r, P)| ϵ by definition of initial fair policy from Assumption 2.3. From the results from Parts 1 and 2 above, we get that on the good event E, for any k [K] and policy π Πk F , Jπ j (r, P) Jπ i (r, P) ϵ and Jπ i (r, P) Jπ j (r, P) ϵ, or, |Jπk i (r, P) Jπk j (r, P)| ϵ. The above argument holds true for any pair of subgroups. Extending this argument to all the pairs of subgroups we get, |Jπ i (r, P) Jπ j (r, P)| ϵ i j; (i, j) Z2. C.5 LP formulation for Section 3 The algorithm in Algorithm 1 requires solving an LP that takes the following form: a dπ h((z, s), a) lk h((z, s), a) (34) h, s,a dπi h ( s, a)rk h(i, s, a) | {z } = Jπ i (rk, ˆ P k) h, s,a dπj h ( s, a)rk h(j, s, a) | {z } = Jπ j (rk, ˆ P k) ϵ, i j; i, j Z2. h, s,a dπj h ( s, a)rk h(j, s, a) | {z } = Jπ j (rk, ˆ P k) h, s,a dπi h ( s, a)rk h(i, s, a) | {z } = Jπ i (rk, ˆ P k) ϵ, i j; i, j Z2. a dπz h ( s, a) = X s ,a ˆP k h 1( s|z, s , a )dπz h 1( s , a ) z, s, h Z S [H] \ 1 a dπz 1 ( s, a) = µz( s) z, s Z S dπz h ( s, a) 0 z, s, a, h Z S A [H] C.6 Proof for Theorem 3.2 Change w.r.t. l. When the ˆrk, ˆP k parameters are not well estimated, the LP in Equation (34) might not be feasible itself. In that case, the only available policy to execute is π0 and |Πk| = 1. As the algorithm proceeds with gathering more data via executing π0, the Equation (34) eventually become feasible. However, the algorithm will continue to execute π0 even when the problem becomes feasible as long as there exists at least one pair subgroup i, j Z2 for which either Jπ0 i (rk, ˆP k) Jπ0 j (rk, ˆP k) (ϵ + ϵ0)/2 or Jπ0 j (rk, ˆP k) Jπ0 i (rk, ˆP k) (ϵ + ϵ0)/2 (from the definition of Πk in Equation (5)). From Assumption 2.3, we know that π0 satisfies the fairness condition with strict inequality, or ϵ0 < (ϵ0 + ϵ)/2 < ϵ. Therefore, eventually after enough rounds of |Πk| = 1, the above condition will be not valid anymore and at that point the algorithm can proceed with using solution based on the estimated parameters from Equation (34). Additionally, at that time, we know that all the policies that close enough to π0, which are infinitely many will also be feasible solutions or |Πk| = . We use the proof techniques from Liu et al. (2021) where we decompose the regret in three terms, then analyze each of individual terms separately and combine them later for the final result. First, notice that the Published in Transactions on Machine Learning Research (04/2023) regret can be decomposed as: Reg(K; l) = k=1 1(|Πk| = 1) Jπ (l, P) Jπ0(l, P) k=1 1(|Πk| > 1) Jπ (l, P) Jπk( lk, ˆP k) | {z } (II) k=1 1(|Πk| > 1) Jπk( lk, ˆP k) Jπk(l, P) | {z } (III) For the first term, we have the following result that gives an upper bound on the number of episodes required for exploration by π0. Lemma C.6. On good event E, PK k=1 1(|Πk| = 1) C , where C = O H4| S|3|Z|5|A| min{(ϵ ϵ0)2, (ϵ ϵ0)} Proof. Recall that |Πk| = 1 when there is at least one pair of subgroups i, j Z2 for which the fairness constraint was violated w.r.t. π0: Jπ0 i (rk, ˆP k) Jπ0 j (rk, ˆP k) (ϵ + ϵ0)/2 Jπ0 j (rk, ˆP k) Jπ0 i (rk, ˆP k) (ϵ + ϵ0)/2. We will use the indicator 1(|Πk| = 1; i, j) to denote that if a pair i, j violates this constraint. Wlog, assume that a subgroup pair i, j Z2 violates this constraint. Note that |Πk| = 1 if either of the following conditions is true: (Case A) Jπ0 i (rk, ˆP k) Jπ0 j (rk, ˆP k) (ϵ+ϵ0)/2, which we will denote by the event 1(|Πk| = 1; Ai,j) .= 1(Jπk i (rk, ˆP k) Jπk j (rk, ˆP k) (ϵ + ϵ0)/2) where πk = π0, and (Case B) Jπ0 j (rk, ˆP k) Jπ0 i (rk, ˆP k) (ϵ + ϵ0)/2 that is denoted by 1(|Πk| = 1; Bi,j). In context of i, j, either of these two scenarios can be responsible for |Πk| = 1. Therefore, we have: k=1 1(|Πk| = 1; i, j) k=1 1(|Πk| = 1; Ai,j) + k=1 1(|Πk| = 1; Bi,j) Now, we define K .= PK k=1 1(|Πk| = 1; i, j). We have, k=1 1(|Πk| = 1; i, j)(ϵ ϵ0) k=1 1(|Πk| = 1; i, j) (ϵ + ϵ0) k=1 1(|Πk| = 1; Ai,j) (ϵ + ϵ0) k=1 1(|Πk| = 1; Bi,j) (ϵ + ϵ0) Published in Transactions on Machine Learning Research (04/2023) For the first term in the above equation corresponding to Case A, .i.e., 1(|Πk| = 1; Ai,j), we have : k=1 1(|Πk| = 1; Ai,j) (ϵ + ϵ0) (a) 1(|Πk| = 1; Ai,j) (Jπk i (rk, ˆP k) Jπk j (rk, ˆP k)) (Jπk i (r, P) Jπk j (r, P)) = 1(|Πk| = 1; Ai,j)(Jπk i (rk, ˆP k) Jπk i (r, P)) | {z } A.1 + 1(|Πk| = 1; Ai,j)(Jπk j (r, P) Jπk j (rk, ˆP k)) | {z } A.2 where (a) holds because πk = π0 when |Πk| = 1 and Jπk i (rk, ˆP k) Jπk j (rk, ˆP k) (ϵ + ϵ0)/2 (from definition of Case A), and Jπk i (r, P) Jπk j (r, P) ϵ0 due to Assumption 2.3. For the A.1 term above, we will use Lemma H.3 with |rk h rh| = |ˆrk h + (1 + |Z|| S|H)βk h rh| = |ˆrk h rh + (1 + |Z|| S|H)βk h| βk h + (1 + |Z|| S|H)βk h (2 + |Z|| S|H)βk h. Now we can directly apply Lemma H.3 to get the following result that bounds A.1, A.1 = O H4| S|3|Z|3|A| + H2 q | S|3|Z|3|A|K . Similarly for term A.2, we have: |rh rk h| = |rh (hrk h (1 + |Z|| S|H)βk h)| = |rh ˆrk h + (1 + |Z|| S|H)βk h| βk h + (1 + |Z|| S|H)βk h (2 + |Z|| S|H)βk h. Again applying Lemma H.3 we get, A.2 = O H4| S|3|Z|3|A| + H2 q | S|3|Z|3|A|K . k=1 1(|Πk| = 1; Ai,j) (ϵ + ϵ0) 2 ϵ0 = O H4| S|3|Z|3|A| + H2 q | S|3|Z|3|A|K . The analysis for 1(|Πk| = 1; Bi,j) is analogous and leads to the following result: k=1 1(|Πk| = 1; Bi,j) (ϵ + ϵ0) 2 ϵ0 = O H4| S|3|Z|3|A| + H2 q | S|3|Z|3|A|K . Combining the results for 1(|Πk| = 1; Ai,j) and 1(|Πk| = 1; Bi,j): 2 K = O H4| S|3|Z|3|A| + H2 q | S|3|Z|3|A|K Published in Transactions on Machine Learning Research (04/2023) Using Lemma H.5 for K , there exists some parameter C such that, K C = O H4| S|3|Z|3|A| (ϵ ϵ0) min{1, (ϵ ϵ0)} Now we must consider that |Πk| = 1 if the constraint w.r.t. π0 in Equation (5) is violated for any of the subgroup pairs. Note that we have the relation, k=1 1(|Πk| = 1) X k=1 1(|Πk| = 1; i, j) The inner term of the above expression can be bounded using the above result w.r.t. K . We also know that the number of pairs of subgroups is bounded by |Z|2, therefore we have: k=1 1(|Πk| = 1) C = O H4| S|3|Z|5|A| (ϵ ϵ0) min{1, (ϵ ϵ0)} We use the following result for the high-probability bound for the (II) term in Equation (35). Lemma C.7. For αl = 1 + |Z|| S|H + 8H(1 + |Z|| S|H)/(ϵ ϵ0), on good event E, k=1 1(|Πk| > 1) Jπ (l, P) Jπk( lk, ˆP k) 0, Proof. In our setting, we sample a subgroup z uniformly from Z at the beginning of each episode, or Pr(z) = 1/|Z|, z. From Lemma B.2, we have: k=1 1(|Πk| > 1) Jπ (l, P) Jπk( lk, ˆP k) k=1 1(|Πk| > 1) X z Z Pr(z) Jπ z (l, P) Jπk z ( lk, ˆP k) k=1 1(|Πk| > 1) X Jπ z (l, P) Jπk z ( lk, ˆP k) . (36) Therefore, it suffices to show that P z Z Jπ z (l, P) Jπk z ( lk, ˆP k) 0 holds true for the case when |Πk| > 1. We will show that this statement holds true for cases when π Πk and π / Πk for any k [K]. When π Πk, using Lemma C.2 with the relation that lk(z, s, h) lk(z, s, h) z, s, h, it can be shown that for any z Z, Jπ z ( lk, ˆP k) Jπ z (l, P) Multiplying the above inequality on both sides with Pr(z) and summing over all z Z, we have: Jπ ( lk, ˆP k) Jπ (l, P) (37) However, as πk is the solution of Equation (8) and |Πk| > 1, we have: Jπk( lk, ˆP k) Jπ ( lk, ˆP k) (38) Published in Transactions on Machine Learning Research (04/2023) From, Equation (37) and Equation (38), when |Πk| > 1 and π Πk, then for any k [K]: 1(|Πk| > 1) Jπ (l, P) Jπk( lk, ˆP k) 0. We will now focus on the case when π |Πk for the rest of the proof. We will first show this result for a pair of subgroups and then extend it to the general case. Wlog, assume that we have a pair {i, j} Z2. Let Bγk denote an independent Bernoulli distributed random variable with mean γk. Using this, we can define a probabilistic mixed policy as: πk = Bγkπ + (1 Bγk)π0, Let γk [0, 1] be the largest coefficient that satisfies, J π i (rk, ˆP k) J π j (rk, ˆP k) ϵ. (39) If Jπ i (rk, ˆP k) Jπ j (rk, ˆP k) < ϵ, then γk = 1. Else, the equality holds in Equation (39). Therefore, ϵ = γk Jπ i (rk, ˆP k) + (1 γk)Jπ0 i (rk, ˆP k) γk Jπ j (rk, ˆP k) (1 γk)Jπ0 j (rk, ˆP k) = γk Jπ i (rk, ˆP k) Jπ j (rk, ˆP k) + (1 γk) Jπ0 i (rk, ˆP k) Jπ0 j (rk, ˆP k) (a) γk Jπ i (rk, ˆP k) Jπ j (rk, ˆP k) + (1 γk) ϵ + ϵ0 = γk Jπ i (rk, ˆP k) Jπ j (rk, ˆP k) + (1 γk) ϵ + ϵ0 + γk Jπ i (r, P) Jπ j (r, P) γk Jπ i (r, P) Jπ j (r, P) Jπ i (rk, ˆP k) Jπ j (rk, ˆP k) Jπ i (r, P) Jπ j (r, P) | {z } .= Jk i,j + γk Jπ i (r, P) Jπ j (r, P) +(1 γk) ϵ + ϵ0 = γk( Jk i,j) + γkϵ + ϵ + ϵ0 Jk i,j + ϵ ϵ0 where (a) holds because |Πk| > 1 which implies Jπ0 i (rk, ˆP k) Jπ0 j (rk, ˆP k) < (ϵ + ϵ0)/2 for all subgroup pairs (Equation (5)), and Jπ i (r, P) Jπ j (r, P) ϵ because π satisfies the fariness constraint by definition. We also denote Jk i,j .= Jπ i (rk, ˆP k) Jπ j (rk, ˆP k) Jπ i (r, P) Jπ j (r, P) for readability. Note that from Equation (31), we know that for any π, Jπ i (r, P) Jπ j (r, P) Jπ i (rk, ˆP k) Jπ j (rk, ˆP k), therefore for π = π , Jπ i (rk, ˆP k) Jπ j (rk, ˆP k) Jπ i (r, P) Jπ j (r, P) 0, or, Jk i,j 0. Published in Transactions on Machine Learning Research (04/2023) Therefore, Jk i,j + (ϵ ϵ0)/2 0 ( ϵ0 < ϵ). Plugging this back to Equation (40) we get, ϵ ϵ0 + 2( Jk i,j). (41) By Lemma C.4 for any policy π and z = i, Jπ i (rk, ˆP k) Jπ i (r, P) 2(1 + | S||Z|H)Jπ i (βk, ˆP k). (42) Similarly, by Lemma C.5 for any policy π and z = j, Jπ j (r, P) Jπ j (rk, ˆP k) 2(1 + | S||Z|H)Jπ j (βk, ˆP k). (43) Adding Equations (42) and (43) for any policy π: Jπ i (rk, ˆP k) Jπ i (r, P) + Jπ j (r, P) Jπ j (rk, ˆP k) 2(1 + | S||Z|H) Jπ i (βk, ˆP k) + Jπ j (βk, ˆP k) Jπ i (rk, ˆP k) Jπ j (rk, ˆP k) Jπ i (r, P) Jπ j (r, P) | {z } Jk i,j 2(1 + | S||Z|H) Jπ i (βk, ˆP k) + Jπ j (βk, ˆP k) | {z } .=Jπ i,j(βk, ˆ P k) Here we also introduce the notation for return w.r.t. only two subgroups Jπ i,j(g, P) .= Jπ i (g, P) + Jπ j (g, P) for any reward function g and dynamics P. Thus, we have the following relation for any policy π and any pair i, j Z2: Ji,j 2(1 + |Z|| S|H)Jπ i,j(βk, ˆP) (44) Although πk might both not a Markov policy, from Lemma D.3 of Liu et al. (2021) (Theorem 6.1(i) of Altman (1999)), we can find a randomized Markov policy ˆπk that matches the occupation distributions of πk under transition probabilities ˆP k, with J ˆπk(g, ˆP k) = J πk(g, ˆP k) for any g. From the definition of πk (Equation (8)) and ˆπk Πk, we have: Jπk i,j ( lk, ˆP k) J ˆπk i,j ( lk, ˆP k) = J πk i,j ( lk, ˆP k) = γk Jπ i,j ( lk, ˆP k) + (1 γk)Jπ0 i,j ( l, ˆP k) | {z } 0 γk Jπ i,j ( lk, ˆP k) ϵ ϵ0 + 2( Jk i,j)Jπ i,j ( lk, ˆP k) (Substitute γk using Equation (41)) ϵ ϵ0 + 4(1 + |Z|| S|H)Jπ i,j (βk, ˆP k) Jπ i,j ( lk, ˆP k) (Using Equation (44)) To make Jπk i,j ( lk, ˆP k) Jπ i,j (l, P), it is sufficient to show: ϵ ϵ0 + 4(1 + |Z|| S|H)Jπ i,j (βk, ˆP k) Jπ i,j ( lk, ˆP k) Jπ i,j (l, P), or, (ϵ ϵ0)(Jπ i,j ( lk, ˆP k) Jπ i,j (l, P)) 4(1 + |Z|| S|H)Jπ i,j (βk, ˆP k)Jπ i,j ( lk, ˆP k). (45) Published in Transactions on Machine Learning Research (04/2023) From value difference lemma (Lemma H.2), for any z Z, Jπ z ( lk, ˆP k) Jπ z (l, P) lk(z, Sh, Ah) l(z, Sh, Ah) + X z , s ( ˆP k h Ph)(z , s |z, Sh, Ah)V π z h+1(z, s ; l, P) h=1 (αl 1 |Z|| S|H)βk h(z, Sh, Ah) Fk 1 = (αl 1 |Z|| S|H)Jπ z (βk, ˆP k). Using the above result separately for z = i and z = j we have: Jπ i ( lk, ˆP k) Jπ i (l, P) (αl 1 |Z|| S|H)Jπ i (βk, ˆP k), Jπ j ( lk, ˆP k) Jπ j (l, P) (αl 1 |Z|| S|H)Jπ j (βk, ˆP k). Adding the above two equations, we get: Jπ i,j ( lk, ˆP k) Jπ i,j (l, P) (αl 1 |Z|| S|H)Jπ i,j (βk, ˆP k). If we use αl = 1 + |Z|| S|H + 4(1+|Z|| S|H) (ϵ ϵ0) 2H, then Jπ i,j ( lk, ˆP k) Jπ i,j (l, P) 4(1 + |Z|| S|H) (ϵ ϵ0) Jπ i,j (βk, ˆP k)2H. As the the maximum value of Jπ i,j ( lk, ˆP k) is 2H, therefore the Equation (45) is always satisfied. This implies Jπk i,j ( lk, ˆP k) Jπ i,j (l, P), or Jπk i ( lk, ˆP k)+Jπk j ( lk, ˆP k) Jπ i (l, P)+Jπ j (l, P) for any k and for any {i, j} Z2 with |Πk| > 1. Using the above result for consecutive pairs of subgroups {(1, 2), (2, 3), . . . , (|Z| 1, |Z|), (|Z|, 1)}, and adding them together we get z=1 Jπk z ( lk, ˆP k) 2 z=1 Jπ z (l, P) Jπ z (l, P) Jπk z ( lk, ˆP k) 0 Therefore, using Equation (36), Jπ (l, P) Jπk( lk, ˆP k) 0 for any k with |Πk| > 1. The last term (III) in Equation (35) is bounded directly using the Lemma B.4 of Liu et al. (2021) with the value of αl. We provide the result here for completeness. Lemma C.8. On good event E, k=1 1(|Πk| > 1) Jπk( lk, ˆP k) Jπk(l, P) = O H3 | S|3|Z|5|A|K + H5| S|3|Z|4|A| Published in Transactions on Machine Learning Research (04/2023) Proof. As | lk h rh| = |ˆlk h lh + αlβk h| (1 + αl)βk h, by Lemma H.3, k=1 1(|Πk| > 1) Jπk( lk, ˆP k) Jπk(l, P) k=1 |Jπk( lk, ˆP k) Jπk(l, P)| = O (αl + H q | S||Z|)H q | S||Z||A|K + H3| S|2|Z|2|A|αl | S|3|Z|3|A|K + H5| S|3|Z|3|A| Combining the results for terms (I), (II) and (III) derived above, we get Jπ (l, P) Jπk(l, P) = O H3 |Z|3| S|3|A|K + H5|Z|5| S|3|A| min{(ϵ ϵ0), (ϵ ϵ0)2} C.7 Extension to non-uniform Z In our current setup, we sample z Z uniformly for each episode (or Pr(z) = 1/|Z|, z Z). As such, the optimization problem takes the form: max π 1 |Z| z Jπ z (l, P) (46) s.t.|Jπ i (r, P) Jπ j (r, P)| ϵ, i j; i, j Z2. where the 1/|Z| acts as positive multiplicative constant and can be ignored from an optimization perspective. As we mentioned in Section 3, this scenario might not be the case in reality as different populations might not be always represented equally. In this section, we will show how our approach can be extended to the setting with any arbitrary Z, given that the Z is also know to the algorithm. We will do so by taking the Pr(z) term into account in the definition of the subgroup specific returns. Jπ z (r, P) = E ( S1,Z1) µ [V π 1 ( S1, Z1; r, P)]1[z == Z1] = E Z1 Z S1 µZ1 [V π 1 ( S1, Z1; r, P)]1[z == Z1] = Pr(z) E S1 µz [V π 1 ( S1, Z1; r, P)]. (47) In the setting considered in the main paper, where the algorithms can sample trajectories from each subgroup for every iteration of the algorithm, Pr(z) = 1/|Z|, z Z, and as such the 1/|Z| term can be ignored in the definition subgroup specific returns (Equation (1)). For this new setting, the problem in Equation (2) takes the following form: z Jπ j (l, P) (48) s.t.| Jπ i (r, P) Jπ j (r, P)| ϵ, i j; i, j Z2. The Assumption 2.3 also modifies accordingly in this setting, i.e., we assume that | Jπ0 i (r, P) Jπ0 j (r, P)| ϵ0 < ϵ, i, j Z2 and the value of ϵ0 is known to the algorithm. Published in Transactions on Machine Learning Research (04/2023) Note that the LP based solution is still valid to this setting as the constraints are still linear even with the addition of Pr(z) term. In the rest of this section, we will show that the results from Section 3 are still valid in new setting for the same choice of optimistic and pessimistic reward estimates. In this setting, we can define the set of fair policies Πk F (analogous to Equation (4)) at an episode k [K] as: π : Jπ i (rk, ˆP k) Jπ j (rk, ˆP k) ϵ, i j; i, j Z2. Jπ j (rk, ˆP k) Jπ i (rk, ˆP k) ϵ, i j; i, j Z2. The final set of policies is now chosen from the high-confidence set Πk, defined as: ( if Jπ0 i (rk, ˆP k) Jπ0 j (rk, ˆP k) > (ϵ + ϵ0)/2, or Jπ0 j (rk, ˆP k) Jπ0 i (rk, ˆP k) > (ϵ + ϵ0)/2, i j; i, j Z2. Πk F , otherwise. The result from Theorem 3.1 is still valid in the setting, i.e, given an input confidence parameter δ (0, 1) and an initial fair policy π0, the construction of Πk ensures that there are no fairness violations at any episode in the learning procedure in the true environment with high probability (1 δ), i.e., for any π Πk, Pr Jπ i (r, P) Jπ j (r, P) ϵ 1 δ, i, j Z2, k [K]. (51) The proof follows the exact same steps as in Appendix C.4 but in context with Jπ z (r, P). We describe the proof sketch briefly below: Proof. To see that the Part 1 of the proof from Appendix C.4 holds true, notice that from Lemma C.2 w.r.t. z = i and r, we have: Jπ i (r, P) Jπ i (rk, ˆP k) Pr(i)Jπ i (r, P) Pr(i)Jπ i (rk, ˆP k) Jπ i (r, P) Jπ i (rk, ˆP k). Similarly, from Lemma C.3 w.r.t. z = j and r, we get Jπ j (r, P) Jπ j (rk, ˆP k) Combining the above relations, we have: Jπ i (r, P) Jπ j (r, P) Jπ i (rk, ˆP k) Jπ j (rk, ˆP k). From the definition of Πk F , we know any policy in π Πk F satisfies the constraint: Jπ i (rk, ˆP k) Jπ j (rk, ˆP k) ϵ. Therefore, we have the following relation: Jπ i (r, P) Jπ j (r, P) Jπ i (rk, ˆP k) Jπ j (rk, ˆP k) ϵ. The proof of Part 2 of Appendix C.4 follows the same steps, and we get: Jπ j (r, P) Jπ i (r, P) Jπ j (rk, ˆP k) Jπ i (rk, ˆP k) ϵ. Finally, from the same argument as the last step of Appendix C.4, we know that either π0 is deployed in an episode (which is fair) or a policy from Πk F will be deployed. We have shown in Parts 1 and 2 above that on Published in Transactions on Machine Learning Research (04/2023) the good event E, for any k [K] and policy π Πk F , Jπ j (r, P) Jπ i (r, P) ϵ and Jπ i (r, P) Jπ j (r, P) ϵ, or, | Jπk i (r, P) Jπk j (r, P)| ϵ. Extending this argument to all the pairs of subgroups we get, | Jπ i (r, P) Jπ j (r, P)| ϵ i j; i, j Z2. The result from Theorem 3.2 also holds true in this setting for the same definition of lh (Equation (7)). We will describe the proof sketch briefly below: Note that the results from Lemma C.6 and Lemma C.8 extend directly to this setting by replacing the term Jπ z (l, P) with Jπ z (l, P) and then following the exact same steps in the corresponding proofs. For Lemma C.7, notice that we have: k=1 1(|Πk| > 1) Jπ (l, P) Jπk( lk, ˆP k) k=1 1(|Πk| > 1) X z Z Pr(z) Jπ z (l, P) Jπk z ( lk, ˆP k) (from Lemma B.2) k=1 1(|Πk| > 1) X Jπ z (l, P) Jπk z ( lk, ˆP k) . (52) Therefore, it suffices to show that P z Z Jπ z (l, P) Jπk z ( lk, ˆP k) 0 holds true for the case when |Πk| > 1. From here, the same steps from the proof of Lemma C.7 can be followed by replacing the term Jπ z (l, P) with Jπ z (l, P). D Tabular experiments The goal for the experiments to validate if the proposed algorithm in Section 3 achieves: (i) zero constraint violation (with probability 1 δ), and (ii) incurs a sub-linear regret. D.1 River Swim Environment: We take the River Swim environment (| S| = 7, H = 10, |A| = 2) (Strehl and Littman, 2008) and make the following modifications to suit our fairness setting: There are two subgroups (|Z| = 2), with different Pz and µz. In terms of Pz, the major distinction between the subgroups is that one group has more stochastic transitions (Figure 5a) compared to the other (Figure 5b). In terms of difference in µz, the more stochastic subgroup (Figure 5a) starts the episode in the leftmost state with high probability (µhigh( S1 = 1) = 0.999) and uniformly from the other states. Similarly, the less stochastic subgroup (Figure 5b) starts at the second from left state with high probability (µlow( S1 = 2) = 0.999) and uniformly from the other states. There is no distinction between the decision-maker and demographics rewards, i.e. l = r in this case. Another important distinction from classic river swim environment is the presence of a halfway state where the agent receives a reward higher than the initial state but lesser than the rightmost state. Therefore, reward at reaching right-most state = +1.0, at initial state= 0.01, halfway point = 0.1. In this setting, the traditional non-fair RL algorithms will generate different optimal policies for different subgroups that have distinctively different behaviour. For the subgroup with higher stochasticity that starts Published in Transactions on Machine Learning Research (04/2023) 1 2 3 4 5 6 7 r = 0.01 r = 0.1 r = 1.0 (a) Higher stochasticity subgroup 1 2 3 4 5 6 7 r = 0.01 r = 0.1 r = 1.0 (b) Lower stochasticity subgroup Figure 5: Description of the modified Riwer Swim environment. Figure 5a denotes the transition dynamics for the higher stochasticity subgroup, and Figure 5b denotes the transition dynamics for the other subgroup. The nodes in the graph denote the non-sensitive states and r denotes the reward function (that is same for both the subgroups). The solid arrows denote the transitions corresponding to taking the right action, and the dashed arrows denote left action. The number above arrows denotes the probability that action will result in the corresponding transition. furthest from the rightmost state, the agent under the traditional RL optimal policy for this subgroup stays close to the halfway point (that gives a return of 0.4167), whereas for the subgroup with lesser stochasticity, the agent under the traditional RL optimal policy for this subgroup reaches the right-most state and stays there (return of 3.1901). Experiment methodology: For the experiments, instead of sampling only a single trajectory from one subgroup at an iteration, we instead sample one trajectory from both subgroups for efficiency. Additionally, we use the time-homogeneous transition and reward functions to simplify the experimentation setting. We present the experimentation methodology in Algorithm 2, and introduce the experiment design and input parameters below: ϵ0 denotes the fairness gap corresponding to the initial fair exploration policy. We use an alternate reward function, along with the ϵ0 = 0.1 and true MDP transition function P to construct the corresponding π0. The alternate reward function is similar to the true reward, with the difference that there is no reward for the rightmost state. This allows us to get a π0 that can reach the midway point quite easily, but has a very low probability of reaching to the rightmost state. The motivation behind this is to start with an inefficient fair-exploration strategy. η : The final fairness constraint is ϵ = ϵ0 + η. This is set to 1.0 as it allows the agent to reach till the end but forces the agent to not stay there in order to prevent violating the fairness criteria. K : The number of episodes to run the algorithm K = 20k. δ = 0.1, the high-probability constant (or the failure-rate). B, confidence set scaling parameter: If we do not scale the confidence sets βk, then it would take the algorithm millions of episodes before making a switch from the initial policy (i.e., algorithm behaves conservatively). Due to computational reasons , we scale the βk to have more sensible confidence sets. Published in Transactions on Machine Learning Research (04/2023) Evaluation criteria: We plot the following quantities for different algorithms during the learning: Cumulative regret, Reg(K; l) .= PK k=1(Jπ (l, P) Jπk(l, P)), Cumulative regret w.r.t. π0, The returns for both of the subgroups throughout the learning (Jπk z ), Number of unfair policies executed so far, Failure-rate (δ): The average number of time the executed policy violated the fairness constraints , The fairness gap at each iteration ϵk, Whether the algorithm is using π0 or not. Baselines: We consider the Maximum Likelihood Estimation (MLE) based baseline for the comparison. The MLE baseline starts with π0 and then simply builds the MLE estimates of the MDP parameters. It then uses the estimated parameters in the LP solver in Equation (17) directly to get a policy to execute at an episode k. Algorithm 2 Experiment procedure for River Swim Input: Env, ϵ0, η, K, δ, PI-Algorithm and B. 1: Calculate π0 based using the true MDP parameters and alternate reward. 2: Set ϵ = ϵ0 + η. 3: Compute π using Equation (17) using true MDP parameters, calculate Jπ . 4: Initialize: N p(z, s, a) = N c(z, s, a) = 0, (z, s, a) Z S A. 5: for k = 0, 1, . . . , K do 6: if (z, s, a) : N c(z, s, a) 2N p(z, s, a) : then 7: Update the empirical model ˆP k, ˆrk, ˆlk; 8: Estimate βk and multiply it with B; 9: Estimate optimistic/pessimistic reward estimates lk, rk, rk; 10: Find πk based on the input algorithm; 11: N p N c 13: for z = 1, . . . , |Z| do 14: Get initial state Zk 1 = z, Sk 1 µz. 15: Execute πk in the true environment and collect a trajectory (Zk h, Sk h, Ak h, Rk h), h [H]; 16: Update counters N c(Zk h, Sk h, Ak h), h [H]; 17: end for 18: end for Hypothesis: As the initial fair policy π0 mostly discovers the reward at the halfway state, an inefficient exploration strategy will have difficulty discovering the rightmost state. Therefore, we expect the MLE baseline to have difficulty in exploration and accumulate higher regret. Note that the probability of reaching right-most state under π0 is quite small but non-zero. As such, once the MLE baseline is able to discover the reward at the rightmost state, we would expect it to violate the fairness constraints as the MLE baseline does not take into account the uncertainty associated with the transitions. Compared to MLE, we expect out algorithm to discover the rightmost state quickly and achieve a sub-linear regret, while maintaining a failure rate of less that δ and incurring very low constraint violations over the learning. Results: Before diving into results, we note that the only hyper-parameter in our experiments is the scaling coefficient B. When B 1, the algorithm behaves conservatively and needs more samples to switch from π0 (no computational advantage), and when B 0 then the algorithm behaves similar to MLE baseline (as β 0, and there is no consideration of uncertainty). For this task, we found that the scaling values in range Published in Transactions on Machine Learning Research (04/2023) Figure 6: River swim environment with a starting π0 with an inefficient exploration strategy, i.e., π0 has very low probability of discovering the right most state (Fine-tuned B hyper-parameter). [10 3, 10 4] tend to achieve this computational speedup without affecting the behavior of the algorithm. We present the results for B = 3 10 4 in Figure 6, and highlight the following observations 1: Cumulative regret (Figure 6: Row 1, Column 1): We observe that for the most of the training, the MLE baseline accumulates a linear-regret rate, which then plateaus once it discovers the rightmost state. Compared to MLE, our algorithm achieves sub-linear regret throughout the learning. We note that our algorithm s regret has not plateaued, i.e., there is still some scope of improvement, which might resolve with more amount of samples. Cumulative regret w.r.t. π0 (Figure 6: Row 1, Column 2): We observe that our algorithm has consistently negative regret w.r.t. π0, i.e., it performs better that π0 consistently over the training. The MLE baseline accumulates a positive regret in the beginning and performs worse than baseline (e.g., it stays at the left most state possibly due to incorrect transition estimates), and afters a while it performs as good as baseline (able to reach halfway state consistently), and then it eventually discovers the rightmost state and after which the regret w.r.t. π0 goes down. The returns for both the subgroups through-out learning (Figure 6: Row 1, Column 3,4): These plots depict when the different algorithms were able to achieve different reward states. For instance, we see for the second subgroup (Row 1, column 4), our algorithm is able to achieve a return greater than 1 (able to discover the rightmost state) quite early in training. The number of unfair policies executed over the learning (Figure 6: Row 2, Column 1): The total number of times our algorithm violated the fair constraints is 3, compared to the 2.5k violations for the MLE approach. For our algorithm, the violations occurred when it first switched from π0 (as evident in the small peak in failure-rate plot), whereas for MLE, the fairness violations occurred when the MLE agent discovered the rightmost state but had inaccurate transition estimates. Failure-rate (Figure 6: Row 2, Column 2): We observe that the average failure rate of our algorithm is < δ = 0.1, whereas the MLE baselines violates this property. The fairness gap at each iteration ϵk (Figure 6: Row 2, Column 3): We observe that our algorithm quickly reaches closer to the specified fairness threshold. We also observe that once the algorithm reaches close to the specified ϵ value, the learning also slows down (not much change in the cumulative regret rate) as there is less margin for deviating from policy. 1Another result for B = 10 4 is presented in Figure 7, where we observe similar trends but it requires more samples. Published in Transactions on Machine Learning Research (04/2023) Average amount of times π0 was used (Figure 6: Row 2, Column 4): We observe that the algorithm quickly switches from π0 (in about 200 episodes). Figure 7: River swim environment with a starting π0 with an inefficient exploration strategy, i.e., π0 has very low probability of discovering the right most state (B = 10 4). Additional details: We used cvxpy (Diamond and Boyd, 2016) with the default parameters for solving all the different LP problems. In terms of compute, on an Intel(R) Xeon(R) CPU E5-2623 v3 (3.00GHz), 20k iterations of the algorithm take about 5 hours. D.2 Credit lending Environment: The MDP description follows Section 2.2. The horizon is set to H = 5, handicap for the low group is set as τ = 0.7, and the target ϵ is set to 0.11. The traditional (unfair) RL policy leads to a gap of 50 approved between the groups with a profit for the bank 13.64, whereas a fair policy with ϵ = 0.11 leads to gap of 10 loans and the bank return of 13.58. All the additional details can be found in the accompanying code. Evaluation methodology, criteria and baselines : We use the similar methodology and baselines as in Appendix D.1, except that now we only sample one subgroup from the specified Z at the beginning of each episode. As a result, different subgroups might now be have uneven amount of learning experience. Results: As in Appendix D.1, we use the scaling coefficient hyper-parameter B for computational speedup. We show the results for B = 5e 4 in Figure 8. E Proof of Proposition 4.1 Proof. We omit the r term in the notation for the associated return, value and advantage functions for the sake of clarity. Recall that π and π denote two arbitrary policies such that there exists only one subgroup for which the associated policies differ, i.e., =1i Z : πi = π i. The value function associated with any policy π for subgroup z Z is denoted by V π z . Using Lemma H.6 with π i and f = V π i , we get: J(π i) = E s µi [V π i ( s)] + 1 1 γ E s dπ i a π i s Pi [δi( s, a, s )], (53) Published in Transactions on Machine Learning Research (04/2023) 0 1 2 3 4 5 Time-steps Episodic return Cumulative regret 0 1 2 3 4 5 Time-steps Fairness violations 0 1 2 3 4 5 Time-steps Episodic return Fairness gap k 0 1 2 3 4 5 Time-steps Episodic return Return for first subgroup, Jk 0 1 2 3 4 5 Time-steps Episodic return Return for second subgroup, Jk (a) Z = [0.1, 0.9] 0 1 2 3 4 5 Time-steps Episodic return Cumulative regret 0 1 2 3 4 5 Time-steps Fairness violations 0 1 2 3 4 5 Time-steps Episodic return Fairness gap k 0 1 2 3 4 5 Time-steps Episodic return Return for first subgroup, Jk 0 1 2 3 4 5 Time-steps Episodic return Return for second subgroup, Jk (b) Z = [0.9, 0.1] 0 1 2 3 4 5 Time-steps Episodic return Cumulative regret 0 1 2 3 4 5 Time-steps Fairness violations 0 1 2 3 4 5 Time-steps Episodic return Fairness gap k 0 1 2 3 4 5 Time-steps Episodic return Return for first subgroup, Jk 0 1 2 3 4 5 Time-steps Episodic return Return for second subgroup, Jk (c) Z = [0.5, 0.5] Figure 8: Learning curves for the credit lending task with different Z. The subplots in each row denote the following (from left to right): cumulative regret, number of fairness violations, fairness gap between the groups, return of group high, return of group low. The x-axis denote the number of samples used during the learning. where δi( s, a, s ) = r((i, s), a) + V π i ( s ) V π i ( s). Similarly, for πj using Lemma H.6 with f = V π j , we get: J(πj) = E s µj [V π j ( s)] + 1 1 γ E s dπ j a πj s Pj [δj( s, a, s )], (54) where δj( s, a, s ) = r((j, s), a) + V π j ( s ) V π j ( s). From the above two relations, we have: Jπ ,π i,j = J(π i) J(πj) = E s µi [V π i ( s)] E s µj [V π j ( s)] | {z } Term 1 E s dπ i a π i s Pi [δi( s, a, s )] E s dπ j a πj s Pj [δj( s, a, s )] | {z } Term 2 Published in Transactions on Machine Learning Research (04/2023) The quantity in the Term 1 of the above Equation (55) denotes the difference in the returns of the subgroups under the policy π and can be written as: E s µi [V π i ( s)] E s µj [V π j ( s)] = J(πi) J(πj) = Jπ,π i,j . (56) Notice that we do not require any any samples from π i to estimate the above quantity. However, this is not the case for the second term in Equation (55) as it requires samples from dπ i . We will now follow the same methodology from (Lemma 2, Achiam et al., 2017) to bound the second term so that we do not require the samples from dπ i . We provide the proof below for completeness. We focus on the first quantity in the Term 2 of Equation (55) (E s dπ i ,a π i, s Pi[δi( s, a, s )]). Let δπ i R| S| denote the vector with δπ i ( s) = Ea π i, s Pi[δi( s, a, s )| s]. Using this, we get the relation: E s dπ i , a π i, s Pi [δi( s, a, s )] = D dπ i , δπ i E = D dπ i , δπ i E + D dπ i dπ i , δiiπ E . Using Hölder s inequality: for any p, q [1, ), s.t. 1 D dπ i , δπ i E dπ i dπ i p δπ i q E s dπ i , a π i, s Pi [δi( s, a, s )] D dπ i , δπ i E + dπ i dπ i p δπ i q . (57) For p = 1 and q = , we get: dπ i dπ i 1 = 2DT V (dπ ||dπ), δπ i = ξπ i , where ξπ i = max s | Ea π i, s Pi[δi( s, a, s )]| is the worst case error. Plugging this back in Term 2 for Equation (55), E s dπ i a π i s Pi [δi( s, a, s )] E s dπ j a πj s Pj [δj( s, a, s )] D dπ i , δπ i E + 2DT V (dπ ||dπ)ξπ i dπ j , δπ j , (where δπ j = Ea πj, s Pj[δj( s, a, s )| s]) E s dπ i a πi s Pi π i(a| s) πi(a| s) δi( s, a, s ) + 2DT V (dπ ||dπ)ξπ i E s dπ j a πj s Pj [δj( s, a, s )] (by Importance Sampling: E s dπ i a π i s Pi [δi( s, a, s )] = E s dπ i a πi s Pi h π i(a| s) πi(a| s) δi( s, a, s ) i ) Note that the term E s Pz[δz( s, a, s )|z, s, a] = E s P [r((z, s), a) + γV π z ( s ) V π z ( s)|z, s, a] = Aπ z ( s, a), i.e., the advantage estimate associated with πz for subgroup z. Additionally, from Lemma H.7, we know: dπ i dπ i 1 2γ (1 γ) E s dπ i [DT V (π i||πi)[s]], Published in Transactions on Machine Learning Research (04/2023) where DT V (π i||πi)[s] = 1 a |π i(a|s) πi(a|s)|. Therefore the quantity in Equation (55) becomes, Jπ ,π i,j Jπ,π i,j + 1 1 γ E s dπ i a πi s Pi π i(a| s) πi(a| s) Aπ i ( s, a) E s dπ j a πj s Pj [Aπ j ( s, a)] + 2 γξπ i (1 γ)2 E s dπ i [DT V (π i||πi)[s]] Additionally, for any policy π and subgroup z, we have Ea π z[Aπ z ( s, a)] = 0 by the definition of the advantage function. Thus, E s dπ j a πj [Aπ j ( s, a)] = E s dπ j [ E a πj Aπ j ( s, a)] = 0. Putting these term back ion Equation (55), we get: Jπ ,π i,j Jπ,π i,j + 1 1 γ E s dπ i a πi " π i(a| s) πi(a| s) Aπ i ( s, a) + 2γξπ i (1 γ)DT V (π i||πi)[ s] From (Corollary 3 Achiam et al., 2017), we can replace the term E s dπ i [DT V (π i||πi)[ s]] with q 1 2DKL(π i||πi)[ s], that gives us: Jπ ,π i,j Jπ,π i,j + 1 1 γ E s dπ i a πi " π i(a| s) πi(a| s) Aπ i ( s, a) + 2γξπ i (1 γ) DKL(π i||πi)[ s] Similarly, using the Hölder s inequality in the other direction in Equation (57) gives us the lower bound: Jπ ,π i,j Jπ,π i,j + 1 1 γ E s dπ i a πi " π i(a| s) πi(a| s) Aπ i ( s, a) 2γξπ i (1 γ) DKL(π i||πi)[ s] F Practical Deep-RL algorithm methodology Consider the scenario where each subgroup s policy is parameterized independently and they have their separate neural networks. As we mentioned in Section 4.2, if the Equation (10) can be solved exactly, then we can use it to construct an algorithm that only updates one subgroup at a time, while ensuring each update satisfies the fairness requirement. We present the algorithm based on this methodology in Algorithm 3. F.1 FOCOPS methodology In this section we will describe in detail how the FOCOPS (Zhang et al., 2020) methodology can be used to solve the problem in Equation (10) based solely on the first-order approximations. For ease of exposition, we will show the approach for only two subgroups (Z = {i, j}), however the approach is also valid for any number of subgroups. Instead of solving Equation (10) directly, we will follow the FOCOPS approach, under which now a two-step approach will be taken as following: Published in Transactions on Machine Learning Research (04/2023) Algorithm 3 General algorithm methodology for the Deep-RL case Input: π0 = πθ0 z Πθ, z Z, κ, ϵ. π0 = πθ0 = {πθ0 1 , . . . , πθ0 |Z|} denotes the parameterized input fair policy where πθ0 z denotes the separate policy network for the subgroup z. 1: for k = 0, 1, . . . do 2: for z Z do 3: Sample trajectories for each subgroup Dl = τ πθ l , l Z. 4: Get πθk+1 z using the update rule in Equation (10) with πk z = πθk z and i = z. 5: The algorithm only updates one subgroup at a time. 6: end for 7: All the subgroups have been updated once. 8: end for Step 1: Given some policy πθk i , the first step is to find the optimal update policy π i by solving the optimization problem in Equation (10) in in the non-parameterized policy space, i.e., solve the following optimization problem for non-parameterized πi Π (and not Πθ): E s dπθk i a πi [Aπθk i ( s, a; l)] s.t. Jπθk ,πθk i,j + E s dπθk i a πi " Aπθk i ( s, a; r) 1 γ Jπθk ,πθk i,j E s dπθk i a πi " Aπθk i ( s, a; r) 1 γ DKL(πi||πθk i ) κ, We have the following result that follows directly from Theorem 1 of Zhang et al. (2020). Lemma F.1 (Restatement of Theorem 1 of Zhang et al. (2020)). Let ˆb1 = (1 γ) ϵ Jπθk ,πθk i,j and ˆb2 = (1 γ) ϵ + Jπθk ,πθk i,j . If πθk is a feasible solution, the optimal policy for Equation (58) takes the form: π i (a| s) = πθk i (a| s) Gλ,ν1,ν2( s) exp 1 Aπθk i ( s, a; l) ν1Aπθk i ( s, a; r) + ν2Aπθk i ( s, a; r) , (59) where Gλ,ν1,ν2( s) = P a πθk i (a| s) exp 1 λ Aπθk i ( s, a; l) ν1Aπθk i ( s, a; r) + ν2Aπθk i ( s, a; r) is a partition func- tion which ensures that the Equation (59) is a valid probability distribution, and λ, ν1 and ν2 are solutions to the optimization problem: min λ,ν1 0,ν2 0 λδ + ν1ˆb1 ν2ˆb2 + λ E s dπθk i a π i [log Gλ,ν1,ν2( s)]. (60) Proof. We need to show that the problem in eq. (58) is convex w.r.t. πi = {πi(a| s) : s S, a A}. The objective in Equation (58) is linear w.r.t. πi. As Jπθk ,πθk i,j is also constant w.r.t. πi, the constraint in Equation (58) is also linear. Finally, the KL constraint P s dπθk i ( s)DKL(πi||πθk)[ s] κ, is the same as in Theorem 1 of Zhang et al. (2020) and is also linear. From here we can directly follow the steps from Theorem 1 of Zhang et al. (2020). Published in Transactions on Machine Learning Research (04/2023) Step 2: The optimal policy π i found in the previous step is now projected back to the parameterized policy space Πθ by solving for the closest policy to π i ie: L(θ) = E s dπθk i [DKL(πθ i ||π i )[ s]], where πθ i Πθ is the projected policy which is going to approximate the optimal update policy and then used later as πθk+1 i . Instead of solving for π i , we can use the Corollary 1 of Zhang et al. (2020) with the form of optimal policy derived in the Lemma F.1. This allows us to rewrite the gradient of L(θ) as: θL(θ) = E s dπθk i [ θDKL(πθ i ||π i )[ s]], θDKL(πθ i ||π i )[ s] = θDKL(πθ i ||πθk i )[ s] 1 λ E a π θk i " θπθ i (a| s) πθk i (a| s) Aπθk i ( s, a; l) ν1Aπθk i ( s, a; r) + ν2Aπθk i ( s, a; r) # The above expression allows to estimate the gradient update from the samples generated from πθk i without the need of exact optimal policy update π i . Estimating λ, ν1, and ν2: Note that we still need the parameters λ, ν1, ν2 from solving the dual in Equation (60) at every iteration. Solving this is impractical for high-dimensional state and action spaces, and Zhang et al. (2020) propose the following approximations for estimating the values for λ, ν1 and ν2: λ corresponds to the trust-region constraint and Zhang et al. (2020) found that, in practice, a fixed value of λ found through hyper-parameter sweeping works in practice. Unlike λ, Zhang et al. (2020) claim that ν1 and ν2 need to be continuously adapted during the training. The heuristic proposed by them is based on closeness approximation, (E s dπθk i ,a π i [Aπθk i ( s, a; r)] E s dπθk i ,a π θk i [Aπθk i ( s, a; r)] = 0), and takes the following form for our problem: ν1 projν[ν1 α ϵ Jπθk ,πθk i,j ], ν2 projν[ν2 α ϵ + Jπθk ,πθk i,j ], where α is the step size hyper-parameter and projν is a projection operator that projects ν1, ν2 to the interval [0, νmax] where νmax is another hyper-parameter. The final updates take the form: θDKL(πθ i ||πθk i )[ sn] λ θπθ i (an| sn) πθk i (an| sn) Aπθk i ( sn, an; l) ν1Aπθk i ( sn, an; r) + ν2Aπθk i ( sn, an; r) # I( sn), (61) where N denotes the sample collected under dπθk i and I( sn) .= 1DKL(πθ i ||π θk i )[ sn] κ is in indicator function that ensures only the states that satisfy the πθ i πθk i condition are used for the updates. The complete algorithm is provided in Algorithm 4. Published in Transactions on Machine Learning Research (04/2023) Algorithm 4 FOC-PPO for |Z| = 2 Initialize: Subgroup policies π0 = {π0 1, π0 2} Πθ; value functions V r,0 = {V r,0 1 , V r,0 2 } Vϕr, V l,0 = {V l,0 1 , V l,0 2 } Vϕl; subgroup specific constraint parameters νz 1, νz 2 = 0, z Z. Input: Fairness threshold ϵ, trust-region parameter κ, maximum projection bound νmax, learning rate for νz 1, νz 2 updates α, temperature parameter λ, GAE parameter, discount factor γ, learning rates for policy networks απ and value networks αV . 1: for k = 0, 1, . . . do 2: Generate batch data of M episodes of length T using the current policies for both subgroups πk 1 and πk 2. 3: Updating policy for 1st subgroup (πθk 1 ). 4: Estimate the average difference in returns between the two subgroups based on batch data Jπθk ,πθk 1,2 . 5: Estimate the advantage functions Aπθk 1 (; r), Aπθk 1 (; l) using GAE. Get the bootstrapped target value function for critic updates. 6: Update ν1 1, ν1 2 corresponding to this subgroup: ν1 1 projν[ν1 1 α ϵ Jπθk ,πθk 1,2 ], ν1 2 projν[ν1 2 α ϵ + Jπθk ,πθk 1,2 ], 7: for l = 0, 1, . . . # update epochs do 8: for mb = 0, 1, . . . # mini-batches do 9: Sample a minibatch of size Mb. 10: Calculate the loss function for the critics using MSE loss Lr V (ϕr 1), Ll V (ϕl 1). 11: Update the value networks: ϕl 1 ϕl 1 αV ϕl Ll V (ϕl 1), ϕr 1 ϕr 1 αV ϕr Lr V (ϕr 1). 12: Update the policy: θ1 θ1 απ ˆ θLπ(θ1), ˆ θ1Lπ(θ) 1 Mb θDKL(πθ 1||πθk 1 )[ sn] λ θπθ 1(an| sn) πθk 1 (an| sn) Aπθk 1 ( sn, an; l) ν1Aπθk 1 ( sn, an; r) + ν2Aπθk 1 ( sn, an; r) # 13: end for 14: if 1 Mb PM i=1 PT 1 t=0 DKL(πθ 1||πθk 1 )[ si,t] > κ then 15: exit the update loop; 16: end if 17: end for 18: Similar procedure for updating the policy for 2nd subgroup, (πθk 2 ), but we sample trajectories w.r.t. πθk+1 1 and use the it compute the corresponding Jπθk ,πθk+1 2,1 . 19: end for Published in Transactions on Machine Learning Research (04/2023) F.2 Lagrangian based approach As mentioned in Section 4, the Lagrangian based approaches use adaptive penalty coefficients to enforce the constraints. Borrowing the description from Zhang et al. (2020): for an objective function f(θ) and constraint g(θ) 0 the Lagrangian methods solve the problem maxθ minν 0 f(θ) νg(θ) where ν denotes the Lagrange multiplier or the penalty coefficient. The optimization problem is solved in two steps: a maximization step w.r.t. θ and a minimization step for the penalty coefficient ν. We now describe how the Lagrangian methodology based on PPO can be applied to out setting. We again show the procedure only for Z = {i, j}, but the approach is general and can be applied to any number of subgroups. W.l.o.g., assume that we are only updating the policy for subgroup i, πθ i where where θ denotes only the parameters of the policy network. Let L(θ) denote the traditional PPO objective function for return maximization and let Ji(θ) .= Jπθk ,πθk i,j + E s dπk i a πθ i Aπθk i ( s,a;r) 1 γ . Then the update rule in Equation (10), in context to only θ, can be re-written as: s.t. Ji(θ) ϵ 0 Using the Lagrangian methodology, the augmented objective with penalty coefficients ν1, ν2 can be written as: L(θ, ν1, ν2) = L(θ) ν1( Ji(θ) ϵ) ν2( Ji(θ) ϵ) (62) For the maximization step we take the gradient w.r.t. θ and we get the update of the form: θ L(θ, ν1, ν2) = θ min πθ i (a| s) πθk i (a| s) Aπθk i ( s, a; l) ν1Aπθk i ( s, a; r) + ν2Aπθk i ( s, a; r) , πθ i (a| s) πθk i (a| s) , 1 ξ, 1 + ξ ! Aπθk i ( s, a; l) ν1Aπθk i ( s, a; r) + ν2Aπθk i ( s, a; r) ! where ξ denotes the PPO-specific clipping coefficient, and clip denotes the clip operator that clamps the value of πθ i (a| s) π θk i (a| s) to [1 ξ, 1 + ξ] range. For the minimization step we apply gradient descent w.r.t. ν1, ν2 and get the following update rule: ν1 projν[ν1 α ϵ Jπθk ,πθk i,j ], ν2 projν[ν2 α ϵ + Jπθk ,πθk i,j ], In the above update rule for ν1, ν2, we have made similar approximations as in the projection based method described in Appendix F.1. G Additional details for the Deep-RL experiments G.1 Environment Details We build on the Half-Cheetah-v3 environment from Open AI Gym (Brockman et al., 2016) for the locomotion based experiments. We use the default implementation of Half-Cheetah-v3 as one subgroup, and make the following adjustments for the other two subgroups based on the Mu Jo Co guidelines (Todorov et al., 2012): We create the subgroup with 2 the default feet size by increasing the size parameter in the body model associated with the default Half-Cheetah-v3. We modify the flags associated with back and front feet only (bfoot and ffoot). Published in Transactions on Machine Learning Research (04/2023) We create the subgroup with 10 the default friction by increasing the global friction parameter in the body model associated with the default Half-Cheetah-v3. We modify the geom flag under the default section in the model file. For the point maze navigation, we build on top of the open source library mujoco-maze (https://github. com/kngwyu/mujoco-maze). The parameters for the size of the point agent parameters are increased to 5 the default size based on the torso section of the body model. The implementations for both the environments is provided in the supplemental material. G.2 Architecture and Hyper-parameter selection procedure We use the same network architecture for both the tasks. We follow the same network architecture as Huang et al. (2021); Zhang et al. (2020), where we have two-layered neural network with tanh activation for both the policy and value networks. The policy is modeled as Gaussian, where the network outputs the mean and state-independent log standard deviations. The same pre-processing procedure as Huang et al. (2021) is followed for both the tasks. We use Py Torch (Paszke et al., 2019) for implementing the Deep-RL algorithms. For the algorithm specific hyper-parameters, we follow the guidelines from FOCOPS (Zhang et al., 2020). We set the νmax hyper-parameter to a very large value 1000.0 and do not fine tuned it. We found that the learning rate for the ν1, ν2 parameters typically works best in the range [0.01, 0.1] for our tasks, and we ended up using α = 0.01 for the our experiments. For the λ hyper-parameters, we did hyper-parameter search in range {1.0, 1.5, 3.0, 10.0} and used λ = 1.0 for the maze navigation tasks and λ = 1.5 for the Half-Cheetah tasks. The initial values for all the ν1, ν2 parameters are set to 0. The other parameters, such as mini-batch size, number of updates, trust-region size, clipping parameter, etc. are taken from the PPO-based libraries on which we build our implementations, i.e., Huang et al. (2021) for the Half-Cheetah based locomotion environments and Kanagawa and Kaneko (2020) for the maze navigation based experiments. G.3 Additional results From an implementation point of view, the state-of-the-art implementations of PPO include many code level optimizations (Engstrom et al., 2020; Andrychowicz et al., 2020; Huang et al., 2021) that were not part of the originally proposed algorithm on which we base our fair versions. As a result, we include both the variations of PPO in our baselines. We refer to the version with all the code level optimizations as PPO and the minimal version that is more consistent with the other fair baselines as Minimal-PPO. We present the results for both the tasks and different level of fairness thresholds below: Half-Cheetah task with ϵ {high, medium, low} is presented in Figure 9 (with smoothing) and Figure 10 (with running average). Point Navigation task with ϵ {high, medium, low} is presented in Figure 11 (with smoothing) and Figure 12 (with running average). In terms of compute, on an Nvidia Quadro RTX 8000 GPU with AMD EPYC 7502 32-Core Processor, the navigation experiments take about 3 hours to run with 16 CPU cores and Half-Cheetah experiments take about 7 hours to run with a single CPU core. G.4 Generalization results We train the algorithms with a fixed random seed in a train environment, and then evaluate the performance of the algorithm on ten test environments, each with a different random seed. We present the aggregated results on the test environments for the both the tasks and the lowest ϵ setting in Figure 13. We observe that even though the fair versions of the PPO (FOC-PPO and Lagrangian-PPO) are not able to perfectly satisfy the fairness requirement, they perform vastly superior to the unfair baselines (PPO and Minimal-PPO) in this aspect. Published in Transactions on Machine Learning Research (04/2023) 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 Time-steps Difference in returns Fairness gap, 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 Time-steps Cumulative return All subgroups FOC-PPO Lagrangian-PPO Minimal-PPO PPO 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 Time-steps Epsiodic return Big-feet subgroup 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 Time-steps Episodic return High-friction subgroup 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 Time-steps Episodic return Default subgroup (a) Half-Cheetah : High fairness threshold. 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 Time-steps Difference in returns Fairness gap, 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 Time-steps Cumulative return All subgroups FOC-PPO Lagrangian-PPO Minimal-PPO PPO 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 Time-steps Epsiodic return Big-feet subgroup 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 Time-steps Episodic return High-friction subgroup 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 Time-steps Episodic return Default subgroup (b) Half-Cheetah: Medium fairness threshold. 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 Time-steps Difference in returns Fairness gap, 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 Time-steps Cumulative return All subgroups FOC-PPO Lagrangian-PPO Minimal-PPO PPO 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 Time-steps Epsiodic return Big-feet subgroup 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 Time-steps Episodic return High-friction subgroup 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 Time-steps Episodic return Default subgroup (c) Half-Cheetah: Low fairness threshold. Figure 9: Learning curves for Half-Cheetah environment with different fairness thresholds. The first subplot in each row denotes the fairness gap (maximum of absolute difference of returns between subgroups) and the black dotted horizontal line denotes the specified acceptable fairness threshold (ϵ). The second subplot in each row denotes the cumulative return for all subgroups, and the rest of the subplots in the row denote the subgroup specific returns. The x-axis denote the number of samples used during the learning. The solid colored lines represent the smoothed mean over 10 random seeds for different baselines (with weight=0.9) and the colored shaded regions represent the normal 95% confidence interval. Published in Transactions on Machine Learning Research (04/2023) 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 Time-steps Difference in returns Fairness gap, 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 Time-steps Cumulative return All subgroups FOC-PPO Lagrangian-PPO Minimal-PPO PPO 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 Time-steps Epsiodic return Big-feet subgroup 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 Time-steps Episodic return High-friction subgroup 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 Time-steps Episodic return Default subgroup (a) Half-Cheetah : High fairness threshold. 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 Time-steps Difference in returns Fairness gap, 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 Time-steps Cumulative return All subgroups FOC-PPO Lagrangian-PPO Minimal-PPO PPO 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 Time-steps Epsiodic return Big-feet subgroup 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 Time-steps Episodic return High-friction subgroup 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 Time-steps Episodic return Default subgroup (b) Half-Cheetah: Medium fairness threshold. 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 Time-steps Difference in returns Fairness gap, 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 Time-steps Cumulative return All subgroups FOC-PPO Lagrangian-PPO Minimal-PPO PPO 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 Time-steps Epsiodic return Big-feet subgroup 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 Time-steps Episodic return High-friction subgroup 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 Time-steps Episodic return Default subgroup (c) Half-Cheetah: Low fairness threshold. Figure 10: Learning curves for Half-Cheetah environment with different fairness thresholds. The first subplot in each row denotes the fairness gap (maximum of absolute difference of returns between subgroups) and the black dotted horizontal line denotes the specified acceptable fairness threshold (ϵ). The second subplot in each row denotes the cumulative return for all subgroups, and the rest of the subplots in the row denote the subgroup specific returns. The x-axis denote the number of samples used during the learning. The solid colored lines represent the running mean over the last 100 episodes for 10 random seeds for different baselines and the colored shaded regions represent the normal 95% confidence interval. Published in Transactions on Machine Learning Research (04/2023) 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 Time-steps 1e6 Difference in returns Fairness gap, 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 Time-steps 1e6 Cumulative return All subgroups FOC-PPO Lagrangian-PPO Minimal-PPO PPO 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 Time-steps 1e6 Epsiodic return Big size subgroup 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 Time-steps 1e6 Episodic return Default size subgroup (a) Point-Navigation : High fairness threshold. 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 Time-steps 1e6 Difference in returns Fairness gap, 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 Time-steps 1e6 Cumulative return All subgroups FOC-PPO Lagrangian-PPO Minimal-PPO PPO 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 Time-steps 1e6 Epsiodic return Big size subgroup 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 Time-steps 1e6 Episodic return Default size subgroup (b) Point-Navigation: Medium fairness threshold. 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 Time-steps 1e6 Difference in returns Fairness gap, 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 Time-steps 1e6 Cumulative return All subgroups FOC-PPO Lagrangian-PPO Minimal-PPO PPO 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 Time-steps 1e6 Epsiodic return Big size subgroup 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 Time-steps 1e6 Episodic return Default size subgroup (c) Point-Navigation: Low fairness threshold. Figure 11: Learning curves for Point Navigation environment with different fairness thresholds. The first subplot in each row denotes the fairness gap (maximum of absolute difference of returns between subgroups) and the black dotted horizontal line denotes the specified acceptable fairness threshold (ϵ). The second subplot in each row denotes the cumulative return for all subgroups, and the rest of the subplots in the row denote the subgroup specific returns. The x-axis denote the number of samples used during the learning. The solid colored lines represent the smoothed mean over 10 random seeds for different baselines (with weight=0.9) and the colored shaded regions represent the normal 95% confidence interval. Published in Transactions on Machine Learning Research (04/2023) 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 Time-steps 1e6 Difference in returns Fairness gap, 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 Time-steps 1e6 Cumulative return All subgroups FOC-PPO Lagrangian-PPO Minimal-PPO PPO 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 Time-steps 1e6 Epsiodic return Big size subgroup 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 Time-steps 1e6 Episodic return Default size subgroup (a) Point-Navigation: High fairness threshold. 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 Time-steps 1e6 Difference in returns Fairness gap, 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 Time-steps 1e6 Cumulative return All subgroups FOC-PPO Lagrangian-PPO Minimal-PPO PPO 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 Time-steps 1e6 Epsiodic return Big size subgroup 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 Time-steps 1e6 Episodic return Default size subgroup (b) Point-Navigation: Medium fairness threshold. 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 Time-steps 1e6 Difference in returns Fairness gap, 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 Time-steps 1e6 Cumulative return All subgroups FOC-PPO Lagrangian-PPO Minimal-PPO PPO 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 Time-steps 1e6 Epsiodic return Big size subgroup 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 Time-steps 1e6 Episodic return Default size subgroup (c) Point-Navigation: Low fairness threshold. Figure 12: Learning curves for Point Navigation environment with different fairness thresholds. The first subplot in each row denotes the fairness gap (maximum of absolute difference of returns between subgroups) and the black dotted horizontal line denotes the specified acceptable fairness threshold (ϵ). The second subplot in each row denotes the cumulative return for all subgroups, and the rest of the subplots in the row denote the subgroup specific returns. The x-axis denote the number of samples used during the learning. The solid colored lines represent the running mean over the last 100 episodes for 10 random seeds for different baselines and the colored shaded regions represent the normal 95% confidence interval. Published in Transactions on Machine Learning Research (04/2023) 0.0 0.2 0.4 0.6 0.8 1.0 Time-steps 1e6 Difference in returns Fairness gap, 0.0 0.2 0.4 0.6 0.8 1.0 Time-steps 1e6 Cumulative return All subgroups FOC-PPO Lagrangian-PPO Minimal-PPO PPO Half-Cheetah Variations : Big-Feet vs High-Friction vs Default Half-Cheetah-v3 with = 500 (a) Half-Cheetah: Test time results on low fairness threshold. 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 Time-steps 1e6 Difference in returns Fairness gap, 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 Time-steps 1e6 Cumulative return All subgroups FOC-PPO Lagrangian-PPO Minimal-PPO PPO Navigation: Big vs Regular Point Subgroups with = 0.5 (b) Point-Navigation: Test time results on low fairness threshold. Figure 13: Test time performance plots for the Half-Cheetah and Point-Navigation environments with low ϵ on 10 unseen environment instantiation with different random seeds. The first subplot in each row denotes the fairness gap (maximum of absolute difference of returns between subgroups) and the black dotted horizontal line denotes the specified acceptable fairness threshold (ϵ). The second subplot in each row denotes the cumulative return for all subgroups. The x-axis denote the number of samples used during the learning. The solid colored lines represent the smoothed mean over 10 random seeds for different baselines (with weight=0.9) and the colored shaded regions represent the normal 95% confidence interval. Published in Transactions on Machine Learning Research (04/2023) H Other supporting results Lemma H.1 (Hoeffding s inequality for sub-Gaussian random variables). Let X1, . . . , Xn be n independent random variables such that Xi sub Gaussian(σ2). Let X = 1 n Xi we have, Pr( X t) exp nt2 and Pr( X t) exp nt2 When σ2 = 1/2, i.e., X1, . . . , Xn are 1/2-sub-Gaussian random variables, we have: Pr( X t) exp nt2 and Pr( X t) exp nt2 Lemma H.2 (Value difference lemma, Dann et al. (2017), Lemma E.15). Let M = (S, A, µ, H, g, P) and M = (S, A, µ, H, g , P ) be two MDPs with different non-stationary reward functions (g, g ) and transition functions (P, P ). Then, for any policy π, we have the following relation: Jπ(µ; g , P ) Jπ(µ; g, P) gh(Sh, Ah) g h(Sh, Ah) + X s (P h Ph)(s |Sh, Ah)V π h+1(s ; g , P ) g h(Sh, Ah) gh(Sh, Ah) + X s (P h Ph)(s |Sh, Ah)V π h+1(s ; g, P) Lemma H.3 (Lemma D.4 of Liu et al. (2021)). Let G1:K be a sequence of events such that G Fk 1 for each k [K]. Suppose | gk g| αβk, α 1. On good event E, for any K K, k=1 1(Gk)|Jπk z ( gk, ˆP K) Jπk z (g, P)| (3α + 3 | S||Z|)H q | S||Z||A|K GC + O(αH3| S|2|Z|2|A|), where K G = PK k=1 1(Gk). Lemma H.4 (Lemma D.5 of Liu et al. (2021)). Given a sequence of events G1:K that Gk {F}k 1 for each k [K]. With probability at least 1 δ, for any K K, 1(Gk)dπk h (z, s, a) max(N k h(z, s, a), 1) 4H|Z|| S||A| + 2H|Z|| S||A| ln K G + 4 ln 2HK 1(Gk)dπk h (z, s, a) q max{N k h(z, s, a), 1} 6H|Z|| S||A| + 2H q |Z|| S||A|K G + 2H|Z|| S||A| ln K G where N k h(z, s, a) denotes the number of times the state-action tuple (z, s, a) was observed at time step h so far in episodes [1, . . . , k 1], K G .= PK k=1 1(Gk), and dπk is the occupancy measure of policy πk, ie, dπk h (z, s, a) Eµ,P,πk[1(Zk h = z, Sk h = s, Ak h = a)|Fk 1]. Lemma H.5 (Lemma D.6 of Liu et al. (2021)). Suppose 0 x a + b x, for some a, b > 0, Lemma H.6 (Restatement of Lemma 1 of Achiam et al. (2017)). For any subgroup z Z, function f : S R and any policy π, we have: J(πz) = E s µz [f( s)] + 1 1 γ E s dπ z a πz s Pz [r((z, s), a) + γf( s ) f( s)]. Published in Transactions on Machine Learning Research (04/2023) Proof. From the definition of dπ z , we have: dπ z = (1 γ) t=0 (γP π z )t µz = (1 γ)(1 γP π z ) 1 µz where µz, dπ z R| S|, P π z R| S| | S| denote the vector form of the estimates. Multiplying by (1 γP π z ) on both sides and taking inner product with vector f R| S|, (1 γP π z )dπ z = (1 γ) µz, (1 γ) E s µz [f(s)] + E s dπ z a πz s Pz [γf( s )] E s dπ z [f( s)] = 0. Adding the above relation with the definition of J(πz) = 1 1 γ E s dπ z a πz s Pz [r(s, a)], we get: J(πz) = E s µz [f( s)] + 1 1 γ E s dπ z a πz s Pz [R((z, s), a) + γf( s ) f( s)]. Lemma H.7 (Achiam et al. (2017), Lemma 3). The divergence between discounted future state visitation distributions dπ i dπ i 1 is bounded by average divergence of the policies π i and πi: dπ i dπ i 1 2γ (1 γ) E s dπ i [DT V (π ||π)[s]], where DT V (π ||π)[s] = 1 a |π (a|s) π(a|s)|.