# longterm_fairness_with_unknown_dynamics__e30f0a76.pdf Long-Term Fairness with Unknown Dynamics Tongxin Yin Electrical and Computer Engineering University of Michigan Ann Arbor, MI 48109 tyin@umich.edu Reilly Raab Computer Science and Engineering University of California, Santa Cruz Santa Cruz, CA 95064 reilly@ucsc.edu Mingyan Liu Electrical and Computer Engineering University of Michigan Ann Arbor, MI 48109 mingyan@umich.edu Yang Liu University of California, Santa Cruz Byte Dance Research Santa Cruz, CA 95064 yangliu@ucsc.edu While machine learning can myopically reinforce social inequalities, it may also be used to dynamically seek equitable outcomes. In this paper, we formalize long-term fairness as an online reinforcement learning problem for a policy affecting human populations. This formulation accommodates dynamical control objectives, such as achieving equitable population states, that cannot be incorporated into static formulations of fairness. We demonstrate that algorithmic solutions to the proposed fairness problem can adapt to unknown dynamics and, by sacrificing short-term incentives, drive the policy-population system towards more desirable equilibria. For the proposed setting, we develop an algorithm that adapts recent work in online learning and prove that this algorithm achieves simultaneous probabilistic bounds on cumulative loss and cumulative violations of fairness. In the classification setting subject to group fairness, we compare our proposed algorithm to several baselines, including the repeated retraining of myopic or distributionally robust classifiers, and to a deep reinforcement learning algorithm that lacks fairness guarantees. Our experiments model human populations according to evolutionary game theory and integrate real-world datasets. 1 Introduction As machine learning (ML) algorithms are deployed for tasks with real-world social consequences (e.g., school admissions, loan approval, medical interventions, etc.), the possibility exists for runaway social inequalities (Crawford and Calo, 2016; Chaney et al., 2018; Fuster et al., 2018; Ensign et al., 2018). While fairness has become a salient ethical concern in contemporary research, the closed-loop dynamics of real-world systems with feedback, i.e., comprising ML-driven decisions and populations that mutually adapt to each other (Fig. 5 in the supplementary material) remain poorly understood. We assert that realistic scenarios are often described by fundamentally unknown dynamics: Even with models of human behavior based on rational behavior or evolutionary game theory, utilities and risk-preferences are generally unknown or uncertain. For this reason, we advocate for dynamics-agnostic formulations of fairness, for which reinforcement learning is a natural fit. In this paper, our primary contribution is to consider the problem of long-term fairness, or algorithmic fairness in the context of a dynamically responsive population, as a reinforcement learning (RL) *These authors contributed equally to this work. 37th Conference on Neural Information Processing Systems (Neur IPS 2023). problem subject to constraint. In our formulation, the central learning task is to develop a policy that minimizes cumulative loss (e.g., financial risk, negative educational outcomes, misdiagnoses, etc.) incurred by an ML agent interacting with a human population up to a finite time horizon, subject to constraints on cumulative violations of fairness , which we refer to in a single time step as disparity and cumulatively as distortion. Our central hypothesis is that an RL formulation of long-term fairness may allow an agent to learn to sacrifice short-term utility in order to drive the system towards more desirable equilibria. The core practical difficulties posed by our general problem formulation, however, are the potentially unknown dynamics of the system under control, which must be determined by the RL agent online (i.e., during actual deployment), and the general non-convexity of the losses or constraints considered. Additionally, our problem setting generally requires continuous state and action spaces for the RL agent, which preclude familiar methods that yield performance guarantees in discrete settings. Our secondary contributions are 1) to show that long-term fairness can be solved within asymptotic, probabilistic bounds under certain dynamical assumptions and 2) to demonstrate that the problem can also be addressed more flexibly with existing RL methods. For theoretical guarantees, we develop L-UCBFair, an online RL method, and prove sublinear bounds on regret (suboptimiality of cumulative loss) and distortion (suboptimality of cumulative disparity) with high probability (Section 3.1). To demonstrate practical solutions without strong safety guarantees, we apply R-TD3, an adaptation of a well-known deep reinforcement learning method (viz., TD3) to a time-dependent Lagrangian relaxation of the central problem. We compare L-UCBFair and R-TD3 to several baselines (Section 3.3), including myopic policies, in interaction with simulated populations initialized with synthetic or real-world data and updated according to evolutionary game theory (Section 4). Finally, this paper considers fairness in terms of statistical regularities across (ideally) socioculturally meaningful groups. While internal conflict exists between different statistical measures of fairness (Corbett-Davies and Goel, 2018), we show that an RL approach to long-term fairness can mitigate trade-offs between immediately-fair policy decision outcomes (e.g., acceptance rate disparities (Dwork et al., 2012; Zemel et al., 2013; Feldman et al., 2015)) and causally-fair population states (e.g., qualification rate (Raab and Liu, 2021; Zhang et al., 2020)). In particular, we show that our proposed solutions can learn to avoid the familiar trap of pursuing short-term fairness metrics only to widen underlying disparities that demand escalating interventions at the expense of utility (Section 5.2). 1.1 Related Work Our formulation of long-term fairness bridges recent work on fairness in machine learning , which has developed in response to the proliferation of data-driven methods in society, and safe reinforcement learning , which seeks theoretical safety guarantees in the control of uncertain dynamical systems. Dynamics of Fairness in Machine Learning We distinguish long-term fairness from the dynamics of fair allocation problems (Joseph et al., 2016; Jabbari et al., 2017; Tang et al., 2021; Liu et al., 2017) and emphasize side-effects of algorithmic decisions affecting future decision problems. By formalizing long-term fairness in terms of cumulative losses and disparities, we iterate on a developing research trend that accounts for the dynamical response of a human population to deployed algorithmic prediction: both as a singular reaction (Liu et al., 2018; Hu et al., 2019; Hu and Zhang, 2022) or as a sequence of mutual updates between the population and the algorithm (Coate and Loury, 1993; D Amour et al., 2020; Zhang et al., 2020; Heidari et al., 2019; Wen et al., 2019; Liu et al., 2020; Hu and Chen, 2018; Mouzannar et al., 2019; Williams and Kolter, 2019; Raab and Liu, 2021). Several prior works have considered the long-term fairness implications in the context of performative stability or optimality (Perdomo et al., 2020) with known, stateless dynamical transitions: Hu and Zhang (2022) characterize the convergence of systems typified by sequential, myopic policies while Hashimoto et al. (2018) contrast myopic policy with a modified objective satisfying distributional robustness. While Mouzannar et al. (2019) and Raab and Liu (2021) address stateful dynamical transitions, the cited works only treat myopic classifiers that optimize immediate utility (subject to fairness constraints) rather than learning to anticipate dynamical population responses. Finally, while Wen et al. (2019) explore reinforcement learning for long-term fairness, the discussion is limited to a tabular explore-then-commit approach over discrete state and action spaces. This approach does not generalize to continuous spaces, where we provide separate and tighter bounds for both utility and disparity. Safe Reinforcement Learning L-UCBFair furthers recent efforts in safe RL. While model-based approaches, in which the algorithm learns an explicit dynamical model of the environment, constitute one thread of prior work (Efroni et al., 2020; Singh et al., 2020; Brantley et al., 2020; Zheng and Ratliff, 2020; Kalagarla et al., 2021; Liu et al., 2021; Ding et al., 2021), leading algorithms in this domain are characterized by significant time and space complexity. Among model-free algorithms, the unknown dynamics of our setting preclude the use of a simulator that allows algorithms to search over arbitrary state-action pairs (Xu et al., 2021; Ding et al., 2020; Bai et al., 2022). While Wei et al. (2022) introduce a model-free, simulator-free algorithm, the tabular approach they consider is only applicable to discrete state and action spaces, while our setting requires continuous state and action spaces. To tackle continuous state space, Ding et al. (2021) and Ghosh et al. (2022) consider linear dynamics: Ding et al. (2021) develop a primal-dual algorithm with safe exploration, and Ghosh et al. (2022) use a softmax policy design. Both algorithms are based on the work of Jin et al. (2020), which proposed a least squares value iteration (LSVI) method, using an Upper Confidence Bound (UCB) (Auer et al., 2002) to estimate a state-action Q function. In addition to continuous state spaces, L-UCBFair also addresses continuous action spaces. To our knowledge, L-UCBFair is the first model-free, simulator-free RL algorithm to provides theoretical safety guarantees for both discrete and continuous state and action spaces. Moreover, L-UCBFair achieves bounds on regret and distortion as tight as any algorithm thus far on discrete action space (Ghosh et al., 2022). 2 Problem Formulation We motivate our formulation of long-term fairness with a binary classification task, though the general formulation we propose is more widely applicable. Given this initial task, we sequentially incorporate fairness constraints, then population dynamics, and then cumulative loss and disparity, before fully formalizing the central problem of optimizing cumulative loss subject to expected dynamics and constraints on cumulative disparity. For our initial binary classification task, suppose that a random individual, sampled i.i.d. from a fixed population, has features X Rd, a label Y { 1, 1}, and a demographic group G G (where G = [n] for n 2). Denote the joint distribution of these variables in the population as s := Pr(X, Y, G) such that X, Y, G s. The task is to predict Y (as ˆY ) from X and G by choosing a classifier a, where ˆY a(X, G), to minimize the empirical loss L (s, a) e.g. = E L(Y, ˆY ) , where L denotes some individual loss such as zero-one-loss. In general, we allow arbitrary, (unit-interval) bounded loss functions L (s, a) [0, 1] such that the basic classification task is mina L (s, a). The standard, fair classification task (without a dynamically responsive population) incorporates constraints on the choice of classifier a, such that the disparity D [0, 1] induced on distribution s by a is bounded by some value c [0, 1]. Formally, the task is mina L (s, a) subject to D(s, a) c. A standard measure of disparity D is the violation of demographic parity (Dwork et al., 2012). For example, when G = {g1, g2}, D(s, a) e.g. = ξs,a(g1) ξs,a(g2) 2, where ξs,a(g) := Pr(ˆY =1 | G=g). In this paper, we also wish to consider measures of fairness inherent to the distribution s (e.g., parity of group qualification rates Pr(Y =1 | G=g)). Such measures of fairness can only be affected by changes to s and thus require dynamics, which are not captured by the above formulation (Raab and Liu, 2021; Zhang et al., 2020). We will see that such notions of disparity are well-suited to an RL formulation of long-term fairness. State, action, and policy Adopting the semantics of sequence of dependent classification tasks, we identify the time-dependent distribution sτ S of individuals in the population as a state and the chosen classifier aτ A as an action of some algorithmic agent at time τ. While state space S is arbitrary, we assume that action space A admits a Euclidean metric, under which it is closed (i.e., A is isomorphic to [0, 1]m, m Z>0). We specify that aτ is sampled stochastically according to the agent s current policy πτ, that is, aτ πτ(sτ). Additionally, we assume sτ is fully observable at time τ {1, 2, ...}. In practice, sτ must be approximated from finitely many empirical samples, though this caveat introduces well-understood errors that vanish in the limit of infinitely many samples. Dynamics Moving beyond the one-shot fair classification task above, let us now assume that a population may react to classification, inducing the distribution s to change. In practice, such distribution shift is a well-known, real-world phenomenon that can increase realized loss and disparity when classifiers are fixed (Chen et al., 2022). For classifiers that free to update in response to a mutating distribution s, subsequent classification tasks depend on the (stochastic) predictions made in previous tasks. In our formulation, we assume the existence of dynamical kernel P that maps a state s and action a at time τ to a distribution over possible states at time τ + 1. That is, sτ+1 P(sτ, aτ). We stipulate that P may be initially unknown, but we assume that it does not explicitly depend on time and may be reasonably approximated or learned online . While real-world dynamics may depend on information other than the current distribution of classification-specific variables Pr(X, Y, G) e.g., exogenous parameters, history, or additional variables of state we have identified the dynamical state s with the current, fully-observed distribution for simplicity and tractability. Reward and utility, value and quality functions Because the standard RL convention is to maximize reward rather than minimize loss, for an RL agent, we define the instantaneous reward r(sτ, aτ) := 1 L (sτ, aτ) [0, 1] and a separate utility g(sτ, aτ) := 1 D(sτ, aτ) [0, 1], where r and g do not explicitly depend on time τ. We next propose to optimize anticipated cumulative reward subject to constraints on anticipated cumulative utility. Let the index j represent either reward r or utility g. We use the letter V (for value ) to denote the future expected accumulation of j over steps [h, ..., H] (without time-discounting) starting from state s, using policy π. Likewise, we denote the quality of an action a in initial state s with the letter Q. For j {r, g}, we define V π j,h(s) := E τ=h j sτ, aτ |sh = s ; Qπ j,h(s, a) := E τ=h j sτ, aτ) | sh = s, ah = a Note that V and Q marginalize over the stochasticity of state transitions sτ+1 P(sτ, aτ) and the sampling of actions aτ πτ(sτ). By the boundedness of r and, g, the values of V and Q belong to the interval [0, H h + 1]. Long-term fairness via reinforcement learning The central problem explored in this paper is max π V π r,1(s) subject to V π g,1(s) c. (1) We emphasize that this construction considers a finite time horizon of H steps, and we denote the optimal value of π as π . We first assume that a solution to the problem exists bounded within the interior of the feasible set (i.e., strict primal feasibility or Slater s condition), as in Efroni et al. (2020), Ding et al. (2021), and Ghosh et al. (2022): Assumption 2.1 (Slater s Condition). γ > 0, π, such that V π g,1 (s) c + γ. The Online Setting Given initially unknown dynamics, we formulate long-term fairness for the online setting (in which learning is only possible through actual live deployment of policy, rather than through simulation). As it is not possible to unconditionally guarantee constraint satisfaction in Eq. (1) over a finite number of online steps, we instead measure two types of regret: the suboptimality of a policy with respect to cumulative incurred loss, denoted Regret, and the suboptimality of a policy with respect to cumulative induced disparity, denoted denoted Drtn for distortion . Regret(π, s1) := V π r,1 (s1) V π r,1 (s1) ; Drtn(π, s1) := max 0, c V π g,1 (s1) . (2) 3 Algorithms and Analysis We show that it is possible to provide guarantees for long-term fairness in the online setting. To do this, we develop L-UCBFair, the first model-free, simulator-free algorithm to provide such guarantees with continuous state and action spaces. Specifically, we prove probabilistic, sublinear bounds for regret and distortion under a linear MDP assumption (Assumption 3.1) and properly chosen parameters (Appendix B.1, in the supplementary material). We defer proofs to the supplementary material. 3.1 L-UCBFair Episodic MDP L-UCBFair inherits from a family of algorithms that treat an episodic Markov decision process (MDP) Jin et al. (2020). Therefore, we first map the long-term fairness problem to the episodic MDP setting, which we denote as MDP(S, A, H, P, L , D). The algorithm runs for K episodes, each consisting of H time steps. At the beginning of each episode, which we index with k, the agent commits to a sequence of policies πk = (πk 1, πk 2, ..., πk H) for the next H steps. At each step h within an episode, an action ak h A is sampled according to policy πk h, then the state sk h+1 S is sampled according to the transition kernel P(sk h, ak h). sk 1 is sampled arbitrarily for each episode. Episodic Regret Because L-UCBFair predetermines its policy for an entire episode, we amend our definitions of regret and distortion over all HK time steps by breaking them into a sum over K episodes of length H: Regret(K) = V π r,1 sk 1 V πk r,1 sk 1 ; Drtn(K) = max c V πk g,1 sk 1 # The Lagrangian Consider the Lagrangian L associated with Eq. (1), with dual variable ν 0: L(π, ν) := V π r,1 (s) + ν V π g,1 (s) c . (4) L-UCBFair approximately solves the primal problem, maxπ minν L(π, ν), which is non-trivial, since the objective function is seldom concave in practical parameterizations of π. Let ν denote the optimal value of ν; L-UCBFair assumes bound ν V , given parameter V , as described in the supplementary material (Assumption B.1). Assumption 3.1 (Linear MDP). We assume MDP(S, A, H, P, L , D) is a linear MDP with feature map ϕ : S A Rd. That is, for any h, there exist d signed measures µh = µ1 h, . . . , µd h over S and vectors θr,h, θg,h Rd such that, for any (s, a, s ) S A S, Ph (s | s, a) = ϕ(s, a), µh (s ) ; r s, a = ϕ(s, a), θr,h ; g(s, a) = ϕ(s, a), θg,h . Assumption 3.1 addresses the curse of dimensionality when state space S is the space of distributions over X, Y, G. This assumption is also used by Jin et al. (2020) and Ghosh et al. (2022), and a similar assumption is made by Ding et al. (2021). This assumption is well-justified in continuous time if we allow for infinite-dimensional Hilbert spaces Brunton et al. (2021), but in practice we require limited dimensionality d for the codomain of ϕ. In our experiments, we use a neural network to estimate a feature map ˆϕ offline which approximately satisfies the linear MDP assumption (Appendix D.1). 3.1.1 Construction L-UCBFair, or LSVI-UCB for Fairness (Algorithm 1) is based on a Least-Squares Value Iteration with an optimistic Upper-Confidence Bound, as in LSVI-UCB Jin et al. (2020). For each H-step episode k, L-UCBFair maintains estimates for Qk r, Qk g and a time-indexed policy πk. In each episode, L-UCBFair updates Qk r, Qk g, interacts with the environment, and updates the dual variable νk, which is constant over episode k. LSVI-UCB Jin et al. (2020) The estimation of Q is challenging, as it is impossible to iterate over all s, a pairs when S and A are continuous and P is unknown. LSVI parameterizes Q h(s, a) by the linear form w h ϕ(s, a), as used by Jin et al. (2020), and updates wh argmin w Rd rh (sτ h, aτ h) + max a A Qh+1 sτ h+1, a w ϕ (sτ h, aτ h) 2 + ς w 2. In addition, a bonus term β ϕ Λ 1 h ϕ 1/2 is added to the estimate of Q in Algorithm 1 to encourage exploration. Adaptive Search Policy Compared to the works of Ding et al. (2021) and Ghosh et al. (2022), the major challenge we face for long-term fairness is a continuous action space A, which renders the independent computation of Qk r, Qk g for each action impossible. To handle this issue, we propose an adaptive search policy: Instead of drawing an action directly from a distribution over continuous values, π represents a distribution over finitely many (M), Voronoi partitions of A. After sampling a region with a softmax scheme, the agent draws action a uniformly at random from it. In addition to defining a Voronoi partitioning of the action space for the adaptive search policy of L-UCBFair (in the supplementary material), we make the following smoothness assumption, which we use to bound the error introduced by this sampling method: Assumption 3.2 (Lipschitz). There exists ρ > 0, such that ϕ(s, a) ϕ(s, a ) 2 ρ a a 2. Dual Update For L-UCBFair, the update method for estimating ν in Eq. (4) is also essential. Since V π r,1 (s) and V π g,1 (s) are unknown, we use V k r,1 (s) and V k g,1 (s) to estimate them. An estimate for ν is iteratively updated by performing gradient ascent in the Lagrangian (Eq. (4)) with step-size η, subject to the assumed upper bound V for ν (Assumption B.1). A similar method is also used in Ding et al. (2021); Ghosh et al. (2022). 3.1.2 Analysis We next bound the regret and distortion, defined in Eq. (3), realizable by L-UCBFair. We then compare L-UCBFair with existing algorithms for discrete action spaces and discuss the importance of the number of regions M and the maximum distance ϵI from any action to the center of its corresponding Voronoi partition. Theorem 3.3 (Boundedness). With probability 1 p, there exists a constant b such that L-UCBFair (Algorithm 1) achieves Regret(K) bζH2 d3 + (V + 1)H K ; Drtn(K) bζ(1 + V )H2 for parameter values ς=1 ϵI 1 2ρ(1+V )KH d, ζ = log(log(M)4d HK/p), and β = O(d H ζ). For a detailed proof of Theorem 3.3, refer to Appendix B.1. This theorem indicates that L-UCBFair achieves O H2 d3K bounds for both regret and distortion with high probability. Compared to the algorithms introduced by Ding et al. (2021) and Ghosh et al. (2022), which work with discrete action space, L-UCBFair guarantees the same asymptotic bounds on regret and distortion. Because assumptions in Section 3.1 (e.g., the linear MDP assumption) are often violated in the real world, we also consider deep reinforcement learning methods as a flexible alternative. Concretely, we apply Twin-Delayed Deep Deterministic Policy Gradient (TD3) Fujimoto et al. (2018), with an implementation and default parameters provided by the open-source package Stable Baselines 3 Raffin et al. (2021), on a Lagrangian relaxation of the long-term fairness problem with timedependent multipliers for the reward and utility terms (Eq. (6)). We term this specific algorithm for long-term fairness R-TD3. While such methods lack provable safety guarantees, they may still confirm our hypothesis that agents trained via RL can learn to sacrifice short-term utility in order to drive dynamics towards preferable long-term states and explicitly incorporate dynamical control objectives provided as functions of state. To treat the long-term fairness problem (Eq. (1)) using unconstrained optimization techniques (i.e., methods like TD3), we consider a time-dependent Lagrangian relaxation: We train R-TD3 to optimize min π E aτ π(sτ ) (1 λτ)L (sτ, aτ) + λτD(sτ, aτ) # where sτ+1 P(sτ, aτ), λτ = τ/H. Strictly applied, myopic fairness constraints can lead to undesirable dynamics and equilibria Raab and Liu (2021). Eq. (6) relaxes these constraints (hard soft) for the near future while emphasizing them long-term. Thus, we hope to develop classifiers that learn to transition to more favorable equilibria. Algorithm 1 L-UCBFair Input: A set of points {I0, I1, , IM} satisfy Definition B.2. ϵI = 1 2ρ(1+χ)KH . ν1=0. wr,h=wg,h=0. α= log(M)K 2(1+χ+H). η=χ/ KH2. β=C1d H p log(4 log Md T/p), ς = 1. for episode k = 1, 2, ..., K do Receive the initial state sk 1. for step h = H, H 1, , 1 do Λk h Pk 1 τ=1 ϕ (sτ h, aτ h) ϕ (sτ h, aτ h)T + ςI for j {r, g} do wk j,h Λk h 1 h Pk 1 τ=1 ϕ (sτ h, aτ h) j (sτ h, aτ h) + V k j,h+1 sτ h+1 i end for for iteration i = 1, , M and index j {r, g} do ξi,j ϕ( , Ii)T Λk h 1 ϕ( , Ii) 1/2 , Qk j,h( , Ii) min h D wk j,h, ϕ( , Ii) E + βξi,j, H i end for SMh,k(Ii | ) = exp(α(Qk r,h( ,Ii)+νk Qk g,h( ,Ii))) P j exp(α(Qk r,h( ,Ij)+νk Qk a,h( ,Ij))) πk h(a | ) 1 R b I(a) db SMh,k(I(a) | ) V k r,h( ) R a A πk h(a | )Qk r,h( , a)da, V k g,h( ) R a A πk h(a | )Qk g,h( , a)da end for for step h = 1, , H do Compute Qk r,h sk h, Ii , Qk g,h sk h, Ii , π Ii | sk h . Take action ak h πk h | sk h and observe sk h+1. end for νk+1 = max min νk + η c V k g,1 (s1) , V , 0 3.3 Baselines We compare L-UCBFair and R-TD3 to a greedy agent as a proxy for a myopic status quo in which policy is repeatedly determined by optimizing for immediate utility, without regard for the population dynamics. This standard is known as Repeated Risk Minimization (Perdomo et al., 2020; Hu and Zhang, 2022), and we implement it using simple gradient descent on the different classes of (λ-parameterized) objective functions f we consider (Eq. (7)). Having adopted a notion of fairness that relies on groups , we presuppose different groups of agents indexed by g, and denote groupconditioned loss as Lg. The objectives that are approximately minimized for each baseline are Myopic: f(π) = E a π [L (s, a)] . (7a) Myopic-Fair: fλ(π) = E a π [(1 λ)L (s, a) + λD(s, a)] , λ (0, 1). (7b) Maxmin: f(π) = E a π max g (Lg(s, a)) . (7c) Maxmin-Fair: fλ(π) = E a π (1 λ) max g (Lg(s, a)) + λD(s, a) , λ (0, 1). (7d) The two "Maxmin" objectives are related to distributionally robust optimization, which has been previously explored in the context of fairness (Hashimoto et al., 2018), while the two "Myopic" objectives are more straight-forward. While our baselines do not guarantee constraint satisfaction, the two objectives labelled -Fair are nonetheless constraint aware in precisely the same way as a firm that (probabilistically) incurs penalties for violating constraints. 4 Simulated Environments To evaluate the proposed algorithms and baselines, we consider a series of binary (Y { 1, 1}) classification tasks on a population of two groups G = {g1, g2} modeled according to evolutionary game theory (using replicator dynamics, as described in Appendix A.1, in the supplementary material). We consider two families of distributions of real-valued features for the population: One that is purely synthetic, for which X N(Y, 1), independent of group G, and one that is based on logistic regressions to real-world data, described in Appendix A.2 (in the supplementary material). Both families of distributions over X are parameterized by the joint distribution Pr(Y, G). RL agents are trained on episodes of length H initialized with randomly sampled states. In order to better handle continuous state space, we make the following assumption, which has been used to simplify similar synthetic environments (Raab and Liu, 2021): Assumption 4.1 (Well-behaved feature). For the purely synthetic setting, we require X to be a well-behaved real-valued feature within each group. That is, for each group g, Pr(Y =1 | G=g, X=x) strictly increases in x. As an intuitive example of Assumption 4.1, if Y represents qualification for a fixed loan and X represents credit-score, we require higher credit scores to imply higher likelihood that an individual is qualified for the loan. Theorem 4.2 (Threshold Bayes-optimality). For each group g, when Assumption 4.1 is satisfied, the Bayes-optimal, deterministic binary classifier is a threshold policy described by a feature threshold ag for group g. That is, if X ag then ˆY = 1; otherwise, ˆY = 1. As a result of Theorem 4.2, we consider our action space to be the space of group-specific thresholds, and denote an individual action as the vector a := (a1, a2). Nonetheless, we note that Assumption 4.1 is often violated in practice, as it is in our semi-synthetic setting. 5 Experimental Results Do RL agents learn to seek favorable equilibria against short-term utility? Is Lagrangian relaxation Eq. (1) sufficient to encourage this behavior? We give positive demonstrations for both questions. 5.1 Losses and Disparities Considered Our experiments consider losses L which combine true-positive and true-negative rates, i.e., L (s, a) = 1 αtp(s, a) βtn(s, a) ; r(s, a) = αtp(s, a) + βtn(s, a), (8) where α, β [0, 1]; tp(s, a) = Prs,a( ˆY =1, Y =1); and tn(s, a) = Prs,a( ˆY = 1, Y = 1). For disparity D, we consider functions of the form D(s, a) = 1 2 ξs,a(g1) ξs,a(g2) 2 that measure violations of demographic parity (DP) (Dwork et al., 2012), equal opportunity (EOp) (Hardt et al., 2016), equalized odds (EO), and qualification rate parity (QR), described in Table 1. We note that QR is inconsequential for the baseline agents, which ignore the mutability of the population state. 5.2 Results Func. ξs,a(g) DP Prs,a( ˆY =1 | G=g) QR Prs,a(Y =1 | G=g) EOp Prs,a( ˆY =1 | Y =1, G=g) EO Prs,a( ˆY =1 | Y =1, G=g) Prs,a( ˆY =1 | Y = 1, G=g) Table 1: Disparity Functions, where D(s, a) = 1 2 ξs,a(g1) ξs,a(g2) 2. Our experiments show that algorithms trained with an RL formulation of long-term fairness can drive a reactive population toward states with higher utility and fairness than myopic policies. In Fig. 1, the baseline policies (i.e., subfigures (a) and (b)), which focus on short-term incentives drive disparate qualification rates and attain lower long-term utility than the RL agents, indicating that short-term utility is misaligned with desirable dynamics in this example. In subfigure (b) in particular, the baseline policy increasingly violates the static fairness condition with time, agreeing with previous results by Raab and Liu (2021). Meanwhile, the RL algorithms (subfigures (c) and (d)) learn to drive universally high qualification rates, thus allowing policies that capture higher utility with time (subfigure (e)), Fig. 3. 0.1 0.3 0.5 0.7 0.9 Demographic Disparity a) Maxmin-Fair, (tp, DP), λ= 1 Group 1 Qualification Rate Group 2 Qualification Rate 0.1 0.3 0.5 0.7 0.9 Demographic Disparity b) Myopic-Fair, (tp, DP) λ= 1 Group 1 Qualification Rate 0.1 0.3 0.5 0.7 0.9 Disparity (Demographic Parity) 1 c) L-UCBFair, (tp, DP) Group 1 Qualification Rate 0.1 0.3 0.5 0.7 0.9 Disparity (Demographic Parity) 1 d) R-TD3, (tp, DP) Group 1 Qualification Rate 0.2 0.4 0.6 0.8 1.0 Episodic Mean Loss Maxmin Maxmin-Fair Myopic Myopic-Fair L-UCBFair R-TD3 e) (tp, DP) Figure 1: Using a modeled population with scalar features fit to the Adult dataset (Dua and Graff, 2017) at each time-step to mirror the evolving qualification rates (Appendix A.2), we compare our baseline algorithms to L-UCBFair and R-TD3 on the same set of 16 initial states with the task of maximizing true positive rates (tp) subject to demographic parity (DP). Plots (a-d) depict evolving group qualification rates under each algorithm with streamlines (red arrows indicate the author s emphasis), while shading indicating immediate violations of demographic parity. We remark that the corresponding plots for the non -Fair baselines are qualitatively indistinct and omit them for space. In subplot (e), we visualize mean episode loss where H=150 for each algorithm. Our central hypothesis, that long-term fairness via RL may induce an algorithm to sacrifice short-term utility for better long-term outcomes, is clearly demonstrated in the purely synthetic environment depicted by subfigures (a1 a3) of Fig. 2, in which the environment provides higher immediate utility (true-positive rates) but lower long-term utility when a policy chooses initially high acceptance rates. In this case, the RL algorithm (subfigure (a2)) drives the system towards high qualification rates by giving up immediate utility maximized by the myopic agent (subfigure (a1)). 0.1 0.3 0.5 0.7 0.9 Disparity (Demographic Parity) 1 a1) Myopic-Fair (tp, DP), λ= 1 Group 1 Qualification Rate Group 2 Qualification Rate 0.1 0.3 0.5 0.7 0.9 Disparity (Demographic Parity) 1 a2) L-UCBFair (tp, DP) Group 1 Qualification Rate 0.1 0.3 0.5 0.7 0.9 Disparity (Qualification Rate) 1 b1) L-UCBFair (tp, QR) Group 1 Qualification Rate 0.1 0.3 0.5 0.7 0.9 Disparity (Qualification Rate) 1 b2) R-TD3 (tp, QR) Group 1 Qualification Rate 0.6 0.8 1.0 Episodic Mean Loss Myopic-Fair a3) Task: (tp, DP) 0.25 0.50 0.75 1.00 Episodic Mean Loss b3) Task: (tp, QR) Figure 2: Subfigures (a1 a3) compare a baseline policy to L-UCBFair in a purely synthetic environment on the task maximizing the fraction of true-positive classifications (tp) subject to bounded demographic parity violation (DP). Episode length in this environment is H = 100. Subfigures (b1b3) compare L-UCBFair to R-TD3 on a similar task subject to bounded qualification rate disparity instead. In both environments, modeled agent utilities translate higher acceptance rates to lower qualification rates. Figures (a1, a2) and (b1, b2) depict evolving qualification rates with streamlines (red arrows indicate the author s emphasis) and use shading to indicate fairness violations. With subfigures (b1 b3) of Fig. 2, we demonstrate the capability of RL to incorporate notions of fairness (e.g., qualification rate parity QR), that are impossible to formulate in the myopic setting. In subfigures (b1-b3), both RL agents learn to satisfy qualification rate parity by driving the state of the population towards equal qualification rates by group. Finally, our experiments also show that RL algorithms without theoretical guarantees may be applicable to long-term fairness. In Fig. 2 subfigure (b2), R-TD3 achieves similar qualitative behavior as L-UCBFair (subfigure (b1)), that is, driving high qualification at the expense of short-term utility) while achieving lower episodic mean loss (subfigure (b3)). 0 500 1000 1500 2000 Iterations Episodic Mean Loss 0 500 1000 1500 2000 Iterations Episodic Mean Disparity Figure 3: The mean episodic loss (left) and disparity (right) within a 20-step sliding window obtained by L-UCBFair for the (tp, DP) setting of Fig. 1 (a2). We emphasize that both loss and disparity decrease in time, as required for the sublinear regret and distortion guaranteed by (Theorem 3.3). (tp + 0.8tn, DP) 0.1 0.3 0.5 0.7 0.9 Group 1 qualification rate Group 2 qualification rate Accumulated Disparity (a) TD3, no constraint 0.1 0.3 0.5 0.7 0.9 Group 1 qualification rate Group 2 qualification rate Accumulated Disparity 0.1 0.3 0.5 0.7 0.9 Group 1 qualification rate Group 2 qualification rate Accumulated Loss (c) TD3, no constraint 0.1 0.3 0.5 0.7 0.9 Group 1 qualification rate Group 2 qualification rate Accumulated Loss Figure 4: Comparison of total accumulated disparity (blue) and loss (orange) over a single episode of H = 150 steps as functions of initial state (initial qualification rates), for R-TD3 and TD3 (no constraint; λ = 0) in the semi-synthetic environment (Adult dataset). This figure indicates that the increasingly weighted fairness term in the objective of R-TD3 can play a role in more rapidly reducing cumulative disparity, in this case by driving the system towards the line of equal qualification rates, at the cost of increased cumulative loss. Limitations Despite the potential, highlighted by our experiments, for RL-formulation of fairness to drive positive long-term social outcomes, it is not possible to truly validate such approaches without deployment on actual populations, which may be practically and ethically fraught. In addition, violations of fairness or decreased utility may be difficult to justify to affected populations and stakeholders, especially when the bounds provided by L-UCBFair, while as tight as any known, rely on assumptions that may be violated in practice. Concluding remarks Machine learning techniques are frequently deployed in settings in which affected populations will react to the resulting policy, closing a feedback loop that must be accounted for. In such settings, algorithms that prioritize immediate utility or static notions of fairness may yield dynamics that are misaligned with these objectives long-term. In this paper, we have reformulated long-term fairness as an online reinforcement learning problem (Eq. (1)) to address the importance of dynamics. We have shown that algorithmic solutions to this problem (e.g., L-UCBFair) are capable of simultaneous theoretical guarantees regarding cumulative loss and disparity (violations of fairness). We have also shown that these guarantees can be relaxed in practice to accommodate a wider class of RL algorithms, such as R-TD3. Finally, we emphasize again that the RL framework of long-term fairness allows notions of disparity inherent to the state of a population to be explicitly treated, while such definitions are inoperable in the standard, myopic framework of fairness. We hope that our contributions spur interest in long-term mechanisms and incentive structures for machine learning to be a driver of positive social change. Acknowledgements This work is partially supported by the National Science Foundation (NSF) under grants IIS-2143895, IIS-2112471, IIS-2040800, and CCF-2023495. Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem. Machine learning, 47(2):235 256, 2002. Qinbo Bai, Amrit Singh Bedi, Mridul Agarwal, Alec Koppel, and Vaneet Aggarwal. Achieving zero constraint violation for constrained reinforcement learning via primal-dual approach. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 3682 3689, 2022. Kianté Brantley, Miro Dudik, Thodoris Lykouris, Sobhan Miryoosefi, Max Simchowitz, Aleksandrs Slivkins, and Wen Sun. Constrained episodic reinforcement learning in concave-convex and knapsack settings. Advances in Neural Information Processing Systems, 33:16315 16326, 2020. Steven L Brunton, Marko Budiši c, Eurika Kaiser, and J Nathan Kutz. Modern koopman theory for dynamical systems. ar Xiv preprint ar Xiv:2102.12086, 2021. Allison JB Chaney, Brandon M Stewart, and Barbara E Engelhardt. How algorithmic confounding in recommendation systems increases homogeneity and decreases utility. In Proceedings of the 12th ACM Conference on Recommender Systems, pages 224 232. ACM, 2018. Yatong Chen, Reilly Raab, Jialu Wang, and Yang Liu. Fairness transferability subject to bounded distribution shift. ar Xiv preprint ar Xiv:2206.00129, 2022. Stephen Coate and Glenn C Loury. Will affirmative-action policies eliminate negative stereotypes? The American Economic Review, pages 1220 1240, 1993. Sam Corbett-Davies and Sharad Goel. The measure and mismeasure of fairness: A critical review of fair machine learning. ar Xiv preprint ar Xiv:1808.00023, 2018. Kate Crawford and Ryan Calo. There is a blind spot in AI research. Nature News, 538(7625):311, 2016. Alexander D Amour, Hansa Srinivasan, James Atwood, Pallavi Baljekar, D Sculley, and Yoni Halpern. Fairness is not static: Deeper understanding of long term fairness via simulation studies. In Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency, pages 525 534, 2020. Dongsheng Ding, Kaiqing Zhang, Tamer Basar, and Mihailo Jovanovic. Natural policy gradient primal-dual method for constrained markov decision processes. Advances in Neural Information Processing Systems, 33:8378 8390, 2020. Dongsheng Ding, Xiaohan Wei, Zhuoran Yang, Zhaoran Wang, and Mihailo Jovanovic. Provably efficient safe exploration via primal-dual policy optimization. In International Conference on Artificial Intelligence and Statistics, pages 3304 3312. PMLR, 2021. Dheeru Dua and Casey Graff. UCI machine learning repository, 2017. URL http://archive.ics. uci.edu/ml. Cynthia Dwork, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Richard Zemel. Fairness through awareness. In Proceedings of the 3rd innovations in theoretical computer science conference, pages 214 226, 2012. Yonathan Efroni, Shie Mannor, and Matteo Pirotta. Exploration-exploitation in constrained mdps. ar Xiv preprint ar Xiv:2003.02189, 2020. Danielle Ensign, Sorelle A Friedler, Scott Neville, Carlos Scheidegger, and Suresh Venkatasubramanian. Runaway feedback loops in predictive policing. In Conference of Fairness, Accountability, and Transparency, 2018. Alessandro Epasto, Mohammad Mahdian, Vahab Mirrokni, and Emmanouil Zampetakis. Optimal approximation-smoothness tradeoffs for soft-max functions. Advances in Neural Information Processing Systems, 33:2651 2660, 2020. Michael Feldman, Sorelle A Friedler, John Moeller, Carlos Scheidegger, and Suresh Venkatasubramanian. Certifying and removing disparate impact. In proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining, pages 259 268, 2015. Scott Fujimoto, Herke Hoof, and David Meger. Addressing function approximation error in actorcritic methods. In International conference on machine learning, pages 1587 1596. PMLR, 2018. Andreas Fuster, Paul Goldsmith-Pinkham, Tarun Ramadorai, and Ansgar Walther. Predictably unequal? The effects of machine learning on credit markets. The Effects of Machine Learning on Credit Markets, 2018. Arnob Ghosh, Xingyu Zhou, and Ness Shroff. Provably efficient model-free constrained rl with linear function approximation. ar Xiv preprint ar Xiv:2206.11889, 2022. Moritz Hardt, Eric Price, and Nathan Srebro. Equality of opportunity in supervised learning, 2016. Tatsunori Hashimoto, Megha Srivastava, Hongseok Namkoong, and Percy Liang. Fairness without demographics in repeated loss minimization. In International Conference on Machine Learning, pages 1929 1938. PMLR, 2018. Hoda Heidari, Vedant Nanda, and Krishna P. Gummadi. On the long-term impact of algorithmic decision policies: Effort unfairness and feature segregation through social learning. the International Conference on Machine Learning (ICML), 2019. Lily Hu and Yiling Chen. A short-term intervention for long-term fairness in the labor market. In Proceedings of the 2018 World Wide Web Conference on World Wide Web, pages 1389 1398. International World Wide Web Conferences Steering Committee, 2018. Lily Hu, Nicole Immorlica, and Jennifer Wortman Vaughan. The disparate effects of strategic manipulation. In Proceedings of the Conference on Fairness, Accountability, and Transparency, pages 259 268, 2019. Yaowei Hu and Lu Zhang. Achieving long-term fairness in sequential decision making. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 9549 9557, 2022. Shahin Jabbari, Matthew Joseph, Michael Kearns, Jamie Morgenstern, and Aaron Roth. Fairness in reinforcement learning. In International conference on machine learning, pages 1617 1626. PMLR, 2017. Chi Jin, Zhuoran Yang, Zhaoran Wang, and Michael I Jordan. Provably efficient reinforcement learning with linear function approximation. In Conference on Learning Theory, pages 2137 2143. PMLR, 2020. Matthew Joseph, Michael Kearns, Jamie H Morgenstern, and Aaron Roth. Fairness in learning: Classic and contextual bandits. Advances in neural information processing systems, 29, 2016. Krishna C Kalagarla, Rahul Jain, and Pierluigi Nuzzo. A sample-efficient algorithm for episodic finitehorizon mdp with constraints. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 8030 8037, 2021. Lydia T Liu, Sarah Dean, Esther Rolf, Max Simchowitz, and Moritz Hardt. Delayed impact of fair machine learning. In International Conference on Machine Learning, pages 3150 3158. PMLR, 2018. Lydia T Liu, Ashia Wilson, Nika Haghtalab, Adam Tauman Kalai, Christian Borgs, and Jennifer Chayes. The disparate equilibria of algorithmic decision making when individuals invest rationally. In Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency, pages 381 391, 2020. Tao Liu, Ruida Zhou, Dileep Kalathil, Panganamala Kumar, and Chao Tian. Learning policies with zero or bounded constraint violation for constrained mdps. Advances in Neural Information Processing Systems, 34:17183 17193, 2021. Yang Liu, Goran Radanovic, Christos Dimitrakakis, Debmalya Mandal, and David C Parkes. Calibrated fairness in bandits. ar Xiv preprint ar Xiv:1707.01875, 2017. Hussein Mouzannar, Mesrob I Ohannessian, and Nathan Srebro. From fair decision making to social equality. In Proceedings of the Conference on Fairness, Accountability, and Transparency, pages 359 368. ACM, 2019. Ling Pan, Qingpeng Cai, Qi Meng, Wei Chen, Longbo Huang, and Tie-Yan Liu. Reinforcement learning with dynamic boltzmann softmax updates. ar Xiv preprint ar Xiv:1903.05926, 2019. Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary De Vito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32, pages 8024 8035. Curran Associates, Inc., 2019. Juan Perdomo, Tijana Zrnic, Celestine Mendler-Dünner, and Moritz Hardt. Performative prediction. In International Conference on Machine Learning, pages 7599 7609. PMLR, 2020. Reilly Raab and Yang Liu. Unintended selection: Persistent qualification rate disparities and interventions. Advances in Neural Information Processing Systems, 34:26053 26065, 2021. Antonin Raffin, Ashley Hill, Adam Gleave, Anssi Kanervisto, Maximilian Ernestus, and Noah Dormann. Stable-baselines3: Reliable reinforcement learning implementations. Journal of Machine Learning Research, 22(268):1 8, 2021. Rahul Singh, Abhishek Gupta, and Ness B Shroff. Learning in markov decision processes under constraints. ar Xiv preprint ar Xiv:2002.12435, 2020. Wei Tang, Chien-Ju Ho, and Yang Liu. Bandit learning with delayed impact of actions. Advances in Neural Information Processing Systems, 34:26804 26817, 2021. Honghao Wei, Xin Liu, and Lei Ying. Triple-q: A model-free algorithm for constrained reinforcement learning with sublinear regret and zero constraint violation. In International Conference on Artificial Intelligence and Statistics, pages 3274 3307. PMLR, 2022. Min Wen, Osbert Bastani, and Ufuk Topcu. Fairness with dynamics. ar Xiv preprint ar Xiv:1901.08568, 2019. Joshua Williams and J Zico Kolter. Dynamic modeling and equilibria in fair decision making. ar Xiv preprint ar Xiv:1911.06837, 2019. Tengyu Xu, Yingbin Liang, and Guanghui Lan. Crpo: A new approach for safe reinforcement learning with convergence guarantee. In International Conference on Machine Learning, pages 11480 11491. PMLR, 2021. Rich Zemel, Yu Wu, Kevin Swersky, Toni Pitassi, and Cynthia Dwork. Learning fair representations. In International conference on machine learning, pages 325 333. PMLR, 2013. Xueru Zhang, Ruibo Tu, Yang Liu, Mingyan Liu, Hedvig Kjellström, Kun Zhang, and Cheng Zhang. How do fair decisions fare in long-term qualification? ar Xiv preprint ar Xiv:2010.11300, 2020. Liyuan Zheng and Lillian Ratliff. Constrained upper confidence reinforcement learning. In Learning for Dynamics and Control, pages 620 629. PMLR, 2020. A Simulated Environments: Supplementary A.1 Replicator Dynamics Our use of replicator dynamics closely mirrors that of Raab and Liu (2021) as an equitable model of a population: Individuals my be modeled identically, independently of group membership, yet persistent outcome disparities may emerge from disparate initial conditions between groups. For our experiments, we parameterize the evolving distribution Pr(X, Y | G), assuming constant group sizes, in terms of qualification rates qg := Pr(Y =1 | G=g) and update these qualification rates according to the discrete-time replicator dynamics: qg[t + 1] = qg[t]W g 1 [t] W g[t] ; W g[t] := W g 1 qg + (1 qg)W g 1. In this model, the fitness W g y > 0 of label Y =y in group G=g may be interpreted as the average utility to the individual in group g of possessing label y, and thus relative replication rate of label y in group g, as agents update their labels by mimicking the successful strategies of in-group peers. We model W g y in terms of acceptance and rejection rates with a group-independent utility matrix U: ˆy { 1,1} Uy,ˆy Pr ˆY =ˆy | Y =y, G=g . We choose the matrix U to eliminate dominant strategies (i.e., when agents universally prefer one label over another, independent of classification), assert that agents always prefer acceptance over rejection, and to imply that the costs of qualification are greater than the costs of non-qualification among accepted individuals. While other parameterizations of U are valid, this choice of parameters guarantees internal equilibrium of the replicator dynamics for a Bayes-optimal classifier and wellbehaved scalar-valued feature X, such that Pr(Y =1 | X=x) is monotonically increasing in x (Raab and Liu, 2021). A.2 Data Synthesis and Processing In addition to a synthetic distribution, for which we assume X N(Y, 1), independent of G, for all time, we also consider real-world distributions when simulating and comparing algorithms for longterm fairness . In both cases, as mentioned above, we wish to parameterize distributions in terms of qualification rates qg. As we perform binary classification on discrete groups and scalar-valued features, in addition to parameterizing a distribution in terms of qg, we desire a scalar-valued feature for each example, rather than the multi-dimensional features common to real-world data. Our solution is to use an additional learning step for preprocessing : Given a static dataset D from which (X , Y, G) is drawn i.i.d., (e.g., the Adult Data Set Dua and Graff (2017)), where G varies over an individual s sex and Y corresponds to whether an individual has more than 50,000 USD in annual income, at each time-step, we train a stochastic binary classifier a, such that ˆY a(X , G) with a loss that re-weights examples by label value according to qg. That is, at each time step, we solve: min a E a,D h w(X , Y, G)L(Y, ˆY ) i , w(X , Y, G) = 1 + Y q Pr(Y =1 | G) 1 q Pr(Y = 1 | G) and L is zero-one loss. In our experiments, we choose a according to logistic regression. We interpret Pr( ˆY =1), drawn from the learned preprocessing function a, as a new, scalar feature value X R mapped from from higher-dimensional features X . The threshold policy ultimately operates on X. Assumption 4.1 is as hard to satisfy in general as solving the Bayes-optimal binary classification task over higher-dimensional features. Nonetheless, we expect Assumption 4.1 to be approximately satisfied by such a preprocessing pipeline. It is important to note that the behavior introduced by retraining a logistic classifier at each time step can yield very different qualitative behavior when compared to the static, purely synthetic distribution, as the distribution of X conditioned on Y and G also mutates, and Assumption 4.1 may be violated. B Deferred Proofs Without loss of generality, we assume ϕ(s, a) 1 for all (s, a) S A, and max { µh(S) , θh } d for all h [H]. Assumption B.1 (V bound for ν ). We define a parameter V that is assumed to bound the optimal value of ν in Eq. (4). For π and γ > 0 satisfying Slater s Condition (Assumption 2.1), ν V π r,1 (s1) V π r,1 (s1) γ H In Assumption B.1, V = H/γ, upper-bounds the optimal dual variable ν as an input to L-UCBFair. Definition B.2. Given a set of distinct actions I = {I0, , IM} A, where A is a closed set in Euclidean space, define Ii = {a: a Ii 2 a Ij 2, j = i} as the subset of actions closer to Ii than to Ij, i.e., the Voronoi region corresponding to locus Ii, with tie-breaking imposed by the order of indices i. Also define the locus function I(a) = mini argmin Ii a Ii 2. Lemma B.3. The Voronoi partitioning described above satisfies Ii Ij = , i = j and M i=1Ii = A. Proof. We will begin by proving Ii Ij = for all i = j. To establish this, we will assume the contrary, that there exist indices i and j such that Ii Ij = . Without loss of generality, assume i > j. We will denote an arbitrary action within the interaction of Ii and Ij as a A. Considering that a Ii, according to the given Definition B.2, we can infer that |a Ii|2 < |a Ij|2 (since i > j). However, this assertion contradicts the fact that a Ij, which implies |a Ij|2 |a Ii|2. Therefore, Ii Ij = for all i = j. We then proof M i=1Ii = A. Since A is a closed set, for any a A, there must be a i {1, 2, , M}, such that da,i = a Ii 2 a Ij 2 = da,j, j. If da,i < da,j strictly holds for all j, then a Ii. Otherwise define a set J = {j|da,j = da,i}, then a Ij where j = argminj J j. Theorem B.4. If the number M of distinct loci or regions partitioning A is sufficiently large, there exists a finite set of loci I such that a Ii, i M, a Ii 2 ϵI. Proof. Since A is closed in euclidean space, denote N the dimension, d = supa,a A a a 2. Define an Orthonormal Basis {e1, e2, , e N}. Randomly choose a point a A, Set it as I0, then we form a set of loci I = {I0 + PN i=1 ϵkiei|I0 + PN i=1 ϵkiei A, ki Z, d We know that |I| 2 d ϵ N. It s not hard to verify that a Ii 2 ϵ 2 N 1, a Ii. Taken 2 N 1 yields the statement. B.1 Proof of Theorem 3.3 Theorem 3.3 (Boundedness). With probability 1 p, there exists a constant b such that L-UCBFair (Algorithm 1) achieves Regret(K) bζH2 d3 + (V + 1)H K ; Drtn(K) bζ(1 + V )H2 for parameter values ς=1 ϵI 1 2ρ(1+V )KH d, ζ = log(log(M)4d HK/p), and β = O(d H ζ). Outline The outline of this proof simulates the proof in Ghosh et al. (2022). For brevity, denote Ph V π j,h+1(s, a) = Es Ph( |s,a)V π j,h+1 (s ) for j = r, g. Then Qπ j,h(s, a) = rh + Ph V π j,h+1 (s, a) (9) V π j,h(s) = πh( | s)Qπ j,h(s, ) A (10) πh( | s), Qπ j,h(s, ) a A πh(a | s)Qπ j,h(s, a) (11) Similar to Efroni et al. (2020), we establish Regret(K) + νDistortion(K) V π r,1 (s1) V πk r,1 (s1) + ν b V πk g,1 (s1) V π r,1 (s1) + νk V π g,1 (s1) V k r,1 (s1) + νk V k g,1 (s1) V k r,1 (s1) V πk r,1 (s1) + ν V k g,1 (s1) V πk g,1 (s1) 2H2K | {z } T3 (12) T3 is easily bounded if η. The major task remains bound T1 and T2. Bound T1 and T2. We have following two lemmas. Lemma B.5 (Boundedness of T1). With probability 1 p/2, we have T1 α + 2(1 + V )HρϵI q ς . Specifically, if α = log(M)K 2(1+V +H) and ς = 1, we have T1 2H(1 + V + H) + 2KH2(1 + V )ρϵI d K with probability 1 p/2. Lemma B.6 (Boundedness of T2). Ghosh et al. (2022) With probability 1 p/2, T2 O (ν + 1)H2ζ d3K , where ζ = log[log(M)4d HK/p]. Lemma B.6 follows the same logic in Ghosh et al. (2022), and we delay the proof of Lemma B.5 to Section B.2. Now we are ready to proof Theorem 3.3. Proof. For any ν [0, V ], with prob. 1 p, Regret(K) + νDistortion(K) T1 + T2 + T3 2H2K + HK log M α + 2KH2(1 + V )ρϵI d K + O (ν + 1)H2ζ Taking ν = 0, η = V KH2 , α = K log M 2(1+V +H), ϵI = 1 2ρ(1+V )KH d, there exist constant b, Regret(K) V H K + 2H(1 + V + H) + 2H2K(1 + V )ρϵI d K + O H2ζ d3 + (V + 1)H Taking ν = V , η = V KH2 , α = K log M 2(1+V +H), ϵI = 1 2ρ(1+V )KH Regret(K) + V Distortion(K) (V + 1)H K + (1 + V )O H2ζ Following the idea of Efroni et al. (2020), there exists a policy π such that V π r,1 = 1 K PK k=1 V πk r,1 , V π g,1 = 1 K PK k=1 V πk g,1. By the occupancy measure, V π r,1 and V π g,1 are linear in occupancy measure induced by π. Thus, the average of K occupancy measure also produces an occupancy measure which induces policy π and V π r,1, and V π g,1. We take ν = 0 when PK k=1 b V πk g,1 sk 1 < 0, otherwise ν = V . Hence, we have V π r,1 (s1) 1 k=1 V πk r,1 (s1) + V max (c 1 k=1 V πk g,1 (s1) , 0 = V π r,1 (s1) V π r,1 (s1) + V max c V π g,1 (s1) , 0 Since V = 2H/γ, and using the result of Lemma B.15, we have k=1 V πk g,1 sk 1 , 0 V + 1 In this section we proof Lemma B.5 and Lemma B.6. B.2 Prepare for Lemma B.5 In order to bound T1 and T2, we introduce the following lemma. Lemma B.7. There exists a constant B2 such that for any fixed p (0, 1), with probability at least 1 p/2, the following event holds τ=1 ϕτ j,h h V k j,h+1 sτ h+1 Ph V k j,h+1 (sτ h, aτ h) (ςk h) 1 B2d Hq for j {r, g}, where q = p log [4 (B1 + 1) log(M)d T/p] for some constant B1. We delay the proof of Lemma B.7 to Appendix B.4. Lemma B.7 shows the bound of estimated value function V k j,h and value function V π j,h corresponding in a given policy at k. We now introcuce the following lemma appeared in Ghosh et al. (2022). This lemma bounds the difference between the value function without bonus in L-UCBFair and the true value function of any policy π. This is bounded using their expected difference at next step, plus a error term. Lemma B.8. Ghosh et al. (2022) There exists an absolute constant β = C1d H ζ, ζ = log(log(M)4d T/p), and for any fixed policy π, for the event defined in Lemma B.7, we have ϕ(s, a), wk j,h Qπ j,h(s, a) = Ph V k j,h+1 V π j,h+1 (s, a) + k h(s, a) for some k h(s, a) that satisfies k h(s, a) β q ϕ(s, a)T Λk h 1 ϕ(s, a). Lemma B.9. Ghosh et al. (2022) With probability at least 1 p/2, (for the event defined in Lemma B.7) Qπ r,h(s, a) + νk Qπ g,h(s, a) Qk r,h(s, a) + νk Qk g,h(s, a) Ph V k h+1 V π,νk h+1 (s, a) We also introduce the following lemma. This lemma bound the value function by taking L-UCBFair policy and greedy policy. Lemma B.10. Define V k h ( ) = maxa h Qk r,h( , a) + νk Qk g,h( , a) i the value function corresponding to greedy policy, we have V k h (s) V k h (s) log M α + 2(1 + V )HρϵI Proof. Define ag the solution of greedy policy, V k h (s) V k h (s) = Qk r,h (s, ag) + νk Qk g,h (s, ag) (16) a πh,k(a | s) Qk r,h(s, a) + νk Qk g,h(s, a) da (17) Qk r,h (s, ag) + νk Qk g,h (s, ag) (18) i SMα(Ii | x) Qk r,h(x, Ii) + νk Qk g,h(x, Ii) + 2(1 + V )HρϵI a exp α Qk r,h(s, Ii) + νk Qk g,h(s, Ii) i SMα(Ii | s) Qk r,h(s, Ii) + νk Qk g,h(s, Ii) + 2(1 + V )HρϵI α + 2(1 + V )HρϵI The first inequality follows from Lemma B.14 and the second inequality holds because of Proposition 1 in Pan et al. (2019). B.3 Proof of Lemma B.5 Now we re ready to proof Lemma B.5. Proof. This proof simulates Lemma 3 in Ghosh et al. (2022). We use induction to proof this lemma. At step H, we have Qk j,H+1 = 0 = Qπ j,H+1 by definition. Under the event in Lemma B.13 and using Lemma B.8, we have for j = r, g, ϕ(s, a), wk j,H(s, a) Qπ j,H(s, a) β q ϕ(s, a)T Λk H 1 ϕ(s, a) Thus Qπ j,H(s, a) min ϕ(s, a), wk j,H + β q ϕ(s, a)T Λk H 1 ϕ(s, a), H = Qk j,H(s, a). From the definition of V k h , V k H(s) = max a Qk r,H(s, a) + νk Qk g,h(s, a) X a π(a | x) Qπ r,H(s, a) + νk Qπ g,H(s, a) = V π,νk H (s) for any policy π. Thus, it also holds for π , the optimal policy. Using Lemma B.10 we can get V π ,νk H (s) V k H(s) log M α + 2(1 + V )HρϵI Now, suppose that it is true till the step h + 1 and consider the step h. Since, it is true till step h + 1, thus, for any policy π, Ph V π,νk h+1 V k h+1 (s, a) (H h) log M α + 2(1 + V )HρϵI From (27) in Lemma 10 and the above result, we have for any (s, a) Qπ r,h(s, a) + νk Qπ g,h(s, a) Qk r,h(s, a) + νk Qk g,h(s, a) + (H h) log M α + 2(1 + V )HρϵI V π,νk h (s) V k h (s) + (H h) log M α + 2(1 + V )HρϵI Now, again from Lemma 11, we have V k h (s) V k h (s) log(|A|) V π,νk h (s) V k h (s) (H h + 1) log M α + 2(1 + V )HρϵI Now, since it is true for any policy π, it will be true for π . From the definition of V π,νk, we have V π r,h(s) + νk V π g,h(s) V k r,h(s) + νk V k g,h(s) (H h + 1) log M α + 2(1 + V )HρϵI Hence, the result follows by summing over K and considering h = 1. B.4 Proof of Lemma B.7 We first define some useful sets. Let Qj = n Q | Q( , a) = min n w T j ϕ( , a) + β p ϕT ( , a)T Λ 1ϕ( , a), H o , a A o be the set of Q functions, where j {r, g}. Since the minimum eigen value of Λ is no smaller than one so the Frobenius norm of Λ 1 is bounded. Let Vj = Vj | Vj( ) = R a π(a | )Qj( , a)da; Qr Qr, Qg Qg, ν [0, V ] be the set of Q functions, where j {r, g}. Define π | a A, π(a | ) = 1 R b I(a) db SMα (Qr( , I(a)) + νQg( , I(a))) , Qr Qr, Qg Qg, ν [0, V ] the set of policies. It s easy to verify V k j Vj. Then we introduce the proof of Lemma B.7. To proof Lemma B.7, we need the ϵ-covering number for the set of value functions(Lemma B.13Ghosh et al. (2022)). To achieve this, we need to show if two Q functions and the dual variable ν are close, then the bound of policy and value function can be derived(Lemma B.11, Lemma B.12). Though the proof of Lemma B.11 and Lemma B.12 are different from Ghosh et al. (2022), we show the results remain the same, thus Lemma B.13 still holds. We ll only introduce Lemma B.13 and omit the proof. We now proof Lemma B.11. Lemma B.11. Let π be the policy of L-UCBFair corresponding to Qk r + νk Qk g, i.e., π(a | ) = 1 R b I(a) db SMα (Qr( , I(a)) + νQg( , I(a))) (23) and π(a | ) = 1 R b I(a) db SMα Qr( , I(a)) + ν Qg( , I(a)) , (24) if Qj Qj ϵ and |ν ν| ϵ , then R a (π(a | x) π(a | x)) da 2αϵ (1 + V + H). a (π(a | x) π(a | x)) da (25) a Ii (π(I(a) | x) π(I(a) | x)) da b Ii db (π(Ii | x) π(Ii | x)) SM α Qr(s, Ii) + νQg(s, Ii) SM α Qr(s, Ii) + ν Qg(s, Ii) 2α Qr( , I(a)) + νQg( , I(a)) Qr( , I(a)) ν Qg( , I(a)) (26) The last inequaity holds because of Theorem 4.4 in Epasto et al. (2020). Using Corollary B.17, we have a (π(a | x) π(a | x)) da 2αϵ (1 + V + H) (27) Now since we have Lemma B.11, we can further bound the value functions. Lemma B.12. If Qj Qk j ϵ , where Qj Qj, then there exists Vj Vj such that V k j e Vj H2αϵ (1 + V + H) + ϵ , Proof. For any x, V k j (s) e Vj(s) a π(a | s)Qk j (s, a)da Z a π(a | s) Qj(s, a)da a π(a | s)Qk j (s, a)da Z a π(a | s) Qj(s, a)da + Z a π(a | s) Qj(s, a)da Z a π(a | s) Qj(s, a)da a π(a | s) Qk j (s, a) Qj(s, a) da + a π(a | s) Qj(s, a)da Z a π(a | s) Qj(s, a)da a (π(a | s) π(a | s)) da ϵ + H2αϵ (1 + V + H) Using Lemmas above, we can have the same result presented in Lemma 13 of Ghosh et al. (2022) as following. Lemma B.13. Ghosh et al. (2022) There exists a Vj Vj parameterized by wr, wg, β, Λ, V such that dist Vj, Vj ϵ where Vj Vj = sup x Vj(s) Vr(s) . Let N Vj ϵ be the ϵ-covering number for the set Vj, then, log N Vj ϵ d log + d2 log h 1 + 8d1/2β2/ ς (ϵ )2 i + log 1 + V where ϵ = ϵ H2α(1+V +H)+1 Lemma B.14. |Qk j,h(s, a) Qk j,h(s, I(a))| 2HρϵI q Qk j,h(s, a) Qk j,h(s, I(a)) (28) = wk j,h(s, a)T (ϕ(s, a) ϕ(s, I(a))) (29) ||wk j,h(s, a) 2 ϕ(s, a) ϕ(s, I(a))||2 (30) From Lemma B.16 and Assumption 3.2 we get the result. B.5 Prior Results Lemma B.15. Ding et al. (2021) Let ν be the optimal dual variable, and C 2ν , then, if V π r,1 (s1) V π r,1 (s1) + C c V π g,1 (s1) we have c V π g,1 (x1) Lemma B.16. Jin et al. (2020) For any (k, h), the weight wk j,h satisfies wk j,h 2H p Corollary B.17. If dist Qr, Qr ϵ , dist Qg, Qg ϵ , and | νk νk| ϵ , then, dist Qk r+ νk Qk g, Qr + νk Qg ϵ (1 + V + H). C Additional Figures 0 1 2 3 4 5 O2 aτ π(sτ) sτ+1 P(sτ, aτ) 0 1 2 3 4 5 Figure 5: The interaction of an algorithmic classifier and a reactive population. Given state sτ, the classifier uses policy π to select action aτ. The population, in state sτ, reacts to aτ , transitioning state to sτ+1, then the process repeats. 0.0 0.2 0.4 0.6 0.8 1.0 X Pr(X=x|G=g) (Density) 0.0 0.2 0.4 0.6 0.8 1.0 X Pr(Y=1|X=x,G=g) Figure 6: A synthetic distribution is updated according to a dynamical kernel P based on evolutionary dynamics (Appendix A.1), when a classifier repeatedly predicts ˆY =1 iff X 0.5. We visualize how the distribution of X and conditional qualification rates Pr(Y =1 | X) change in each group g {1 (red, solid), 2 (blue, dashed)}, fading the plotted lines over 10 time steps. In this example, the feature values X in each group decrease with time, while the qualification rates of agents at any fixed value of X decrease. D Experiments D.1 Experiment Details Device and Packages. We run all the experiment on a single 1080Ti GPU. We implement the R-TD3 agent using Stable Baseline3 (Raffin et al., 2021). The neural network is implemented using Pytorch (Paszke et al., 2019). Figures were generated using code included in the supplementary material in less than 24 hours on a single Nvidia RTX A5000. Using a Neural Network to Learn ϕ. We use a multi-layer perceptron to learn ϕ. Specifically, we sample 100000 data points using a random policy, storing s, a, r and g. The inputs of the network are state and action, passing through fully connected (fc) layers with size 256, 128, 64, 64. Re LU is used as activation function between fc layers, while a Soft Max layer is applied after the last fc layer. We treat the outcome of this network as ϕ. To learn ϕ, we apply two separated fc layers (without bias) with size 1 to ˆϕ and treat the outputs as predicted r and predicted g. A combination of MSE losses of r and g are adopted. We use Adam as the optimizer. Weight decay is set to 1e-4 and learning rate is set to 1e-3, while batch size is 128. Note that, ˆϕ is linear regarding r and g, but the linearity of transition kernel cannot be captured using such a schema. Therefore, equivalently we made an assumption that there always exists measure µh such that for given ˆϕ, the linearity of transition kernel holds. It s a stronger assumption than Assumption 3.1. D.2 Maximize True-Positives; Synthetic Data Demographic Parity Equal Opportunity Equalized Odds a) Myopic-Fair (λ = 1 0.1 0.3 0.5 0.7 0.9 Group 1 qualification rate s1 Group 2 qualification rate s2 Demographic Disparity 0.1 0.3 0.5 0.7 0.9 Group 1 qualification rate s1 Group 2 qualification rate s2 Unequal Opportunity 0.1 0.3 0.5 0.7 0.9 Group 1 qualification rate s1 Group 2 qualification rate s2 Odds Disparity b) Maxmin-Fair (λ = 1 0.1 0.3 0.5 0.7 0.9 Group 1 qualification rate s1 Group 2 qualification rate s2 Demographic Disparity 0.1 0.3 0.5 0.7 0.9 Group 1 qualification rate s1 Group 2 qualification rate s2 Unequal Opportunity 0.1 0.3 0.5 0.7 0.9 Group 1 qualification rate s1 Group 2 qualification rate s2 Odds Disparity c) L-UCBFair trained for 2,000 steps. 0.1 0.3 0.5 0.7 0.9 Group 1 qualification rate s1 Group 2 qualification rate s2 Disparity(Demographic Partity) Demographic Partity 0.1 0.3 0.5 0.7 0.9 Group 1 qualification rate s1 Group 2 qualification rate s2 Disparity(Equal Opportunity) Equal Opportunity 0.1 0.3 0.5 0.7 0.9 Group 1 qualification rate s1 Group 2 qualification rate s2 Disparity(Equalized Odds) Equalized Odds Equal qualification rates d) R-TD3 trained for 200,000 steps 0.1 0.3 0.5 0.7 0.9 Group 1 qualification rate s1 Group 2 qualification rate s2 Disparity(Demographic Partity) Demographic Partity 0.1 0.3 0.5 0.7 0.9 Group 1 qualification rate s1 Group 2 qualification rate s2 Disparity(Equal Opportunity) Equal Opportunity 0.1 0.3 0.5 0.7 0.9 Group 1 qualification rate s1 Group 2 qualification rate s2 Disparity(Equalized Odds) Equalized Odds Equal qualification rates e) Episodic Mean Loss (Over 100 steps) Maxmin-Fair Myopic-Fair Episodic Mean Loss Myopic-Fair Maxmin-Fair Episodic Mean Loss Myopic-Fair Maxmin-Fair Episodic Mean Loss Table 2: A comparison of baseline and RL policies in a purely synthetic environment with the objective of maximizing the cumulative true positive fraction of a binary classification task (Section 4), subject to three fairness constraints (Section 5.1) (columns). In all cases, the baseline, greedy policies drive the system (indicated by streamlines) to promote low qualification rates in each group (lower left corner of each plot), while the RL agents are able to drive the system to more favorable equilibria characterized by higher qualification rates (upper right). Shading represents local (noncumulative) violations of fairness. D.3 Maximize True-Positives; Adult Data Demographic Parity Equal Opportunity Equalized Odds a) Myopic-Fair (λ = 1 0.1 0.3 0.5 0.7 0.9 Group 1 qualification rate s1 Group 2 qualification rate s2 Demographic Disparity 0.1 0.3 0.5 0.7 0.9 Group 1 qualification rate s1 Group 2 qualification rate s2 Unequal Opportunity 0.1 0.3 0.5 0.7 0.9 Group 1 qualification rate s1 Group 2 qualification rate s2 Odds Disparity b) Maxmin-Fair (λ = 1 0.1 0.3 0.5 0.7 0.9 Group 1 qualification rate s1 Group 2 qualification rate s2 Demographic Disparity 0.1 0.3 0.5 0.7 0.9 Group 1 qualification rate s1 Group 2 qualification rate s2 Unequal Opportunity 0.1 0.3 0.5 0.7 0.9 Group 1 qualification rate s1 Group 2 qualification rate s2 Odds Disparity c) L-UCBFair trained for 2,000 steps. 0.1 0.3 0.5 0.7 0.9 Group 1 qualification rate Group 2 qualification rate Disparity (Demographic Parity) Demographic Parity 0.1 0.3 0.5 0.7 0.9 Group 1 qualification rate s1 Group 2 qualification rate s2 Disparity(Equal Opportunity) Equal Opportunity 0.1 0.3 0.5 0.7 0.9 Group 1 qualification rate s1 Group 2 qualification rate s2 Disparity(Equalized Odds) Equalized Odds Equal qualification rates d) R-TD3 trained for 200,000 steps 0.1 0.3 0.5 0.7 0.9 Group 1 qualification rate Group 2 qualification rate Disparity (Demographic Parity) Demographic Parity 0.1 0.3 0.5 0.7 0.9 Group 1 qualification rate s1 Group 2 qualification rate s2 Disparity (Equal Opportunity) Equal Opportunity 0.1 0.3 0.5 0.7 0.9 Group 1 qualification rate s1 Group 2 qualification rate s2 Disparity (Equalized Odds) Equalized Odds Equal qualification rates e) Episodic Mean Loss (Over 150 steps) Myopic-Fair Maxmin-Fair Episodic Mean Loss Myopic-Fair Maxmin-Fair Episodic Mean Loss Myopic-Fair Maxmin-Fair Episodic Mean Loss Table 3: A comparison of baseline and RL policies in a semi-synthetic environment incorporating data synthesized from the UCI Adult Data Set (Dua and Graff, 2017), as detailed in section Appendix A.2. Policies share the objective of maximizing the cumulative true positive fraction of a binary classifications (Section 4), subject to three fairness constraints (Section 5.1) (columns). The qualitative difference between the synthetic environment (Table 2) and this, modified environment can be largely explained by the fact that Pr(Y =1|X=x)) is not actually monotonically increasing in x, as stipulated by Assumption 4.1. (Appendix A.2) D.4 Maximize True-Positives +0.8 True-Negatives; Synthetic Data Demographic Parity Equal Opportunity Equalized Odds a) Myopic-Fair (λ = 1 0.1 0.3 0.5 0.7 0.9 Group 1 qualification rate s1 Group 2 qualification rate s2 Demographic Disparity 0.1 0.3 0.5 0.7 0.9 Group 1 qualification rate s1 Group 2 qualification rate s2 Unequal Opportunity 0.1 0.3 0.5 0.7 0.9 Group 1 qualification rate s1 Group 2 qualification rate s2 Odds Disparity b) Maxmin-Fair (λ = 1 0.1 0.3 0.5 0.7 0.9 Group 1 qualification rate s1 Group 2 qualification rate s2 Demographic Disparity 0.1 0.3 0.5 0.7 0.9 Group 1 qualification rate s1 Group 2 qualification rate s2 Unequal Opportunity 0.1 0.3 0.5 0.7 0.9 Group 1 qualification rate s1 Group 2 qualification rate s2 Odds Disparity c) R-TD3 trained for 200,000 steps 0.1 0.3 0.5 0.7 0.9 Group 1 qualification rate s1 Group 2 qualification rate s2 Disparity(Demographic Partity) Demographic Partity 0.1 0.3 0.5 0.7 0.9 Group 1 qualification rate s1 Group 2 qualification rate s2 Disparity(Equal Opportunity) Equal Opportunity 0.1 0.3 0.5 0.7 0.9 Group 1 qualification rate s1 Group 2 qualification rate s2 Disparity(Equalized Odds) Equalized Odds Equal qualification rates e) Episodic Mean Loss (Over 100 steps) Myopic-Fair Maxmin-Fair Episodic Mean Loss Myopic-Fair Maxmin-Fair Episodic Mean Loss Myopic-Fair Maxmin-Fair Episodic Mean Loss Table 4: A comparison of baseline and R-TD3 policies in a purely synthetic environment with the objective of maximizing a weighted combination of the cumulative true positive and true negative fractions of a binary classification task (Section 4), subject to three fairness constraints (Section 5.1) (columns). In all cases, the baseline, greedy policies drive the system (indicated by streamlines) to a 1-dimensional manifold in the space of group qualification rates that is bounded away from universal qualification (upper right corner of each plot), while the RL agents are generally able to drive the system closer to this favorable equilibrium. Shading represents local violations of fairness. D.5 Maximize True-Positives +0.8 True-Negatives; Adult Data Demographic Parity Equal Opportunity Equalized Odds a) Myopic-Fair (λ = 1 0.1 0.3 0.5 0.7 0.9 Group 1 qualification rate s1 Group 2 qualification rate s2 Demographic Disparity 0.1 0.3 0.5 0.7 0.9 Group 1 qualification rate s1 Group 2 qualification rate s2 Unequal Opportunity 0.1 0.3 0.5 0.7 0.9 Group 1 qualification rate s1 Group 2 qualification rate s2 Odds Disparity b) Maxmin-Fair (λ = 1 0.1 0.3 0.5 0.7 0.9 Group 1 qualification rate s1 Group 2 qualification rate s2 Demographic Disparity 0.1 0.3 0.5 0.7 0.9 Group 1 qualification rate s1 Group 2 qualification rate s2 Unequal Opportunity 0.1 0.3 0.5 0.7 0.9 Group 1 qualification rate s1 Group 2 qualification rate s2 Odds Disparity c) R-TD3 trained for 200,000 steps 0.1 0.3 0.5 0.7 0.9 Group 1 qualification rate Group 2 qualification rate Disparity(Demographic Partity) Demographic Partity 0.1 0.3 0.5 0.7 0.9 Group 1 qualification rate s1 Group 2 qualification rate s2 Disparity(Equal Opportunity) Equal Opportunity 0.1 0.3 0.5 0.7 0.9 Group 1 qualification rate s1 Group 2 qualification rate s2 Disparity(Equalized Odds) Equalized Odds Equal qualification rates e) Episodic Mean Loss (Over 150 steps) Myopic-Fair Maxmin-Fair Episodic Mean Loss Myopic-Fair Maxmin-Fair Episodic Mean Loss Myopic-Fair Maxmin-Fair Episodic Mean Loss Table 5: A comparison of baseline and R-TD3 policies in a semi-synthetic environment incorporating data synthesized from the UCI Adult Data Set (Dua and Graff, 2017), as detailed in Appendix A.2. Policies share the objective of maximizing a weighted combination of the cumulative true positive and true negative fractions of a binary classification task (Section 4), subject to three fairness constraints (Section 5.1) (columns). This figure indicates a negative result, in which the R-TD3 algorithm does not offer clear benefits over the baseline policies. As noted in the main text, there are no theoretical guarantees for the R-TD3 algorithm, and the possibility of such negative results constitute a risk for real-world deployments. D.6 Training Curves: L-UCBFair 0 500 1000 1500 2000 Iterations Episodic Mean Loss 0 500 1000 1500 2000 Iterations Episodic Mean Loss 0 500 1000 1500 2000 Iterations Episodic Mean Loss 0 500 1000 1500 2000 Iterations Episodic Mean Disparity (a) Demographic Parity 0 500 1000 1500 2000 Iterations Episodic Mean Disparity (b) Equal Opportunity 0 500 1000 1500 2000 Iterations Episodic Mean Disparity (c) Equalized Odds Figure 7: L-UCBFair 20-step sliding mean & std for the setting in Table 2. 0 500 1000 1500 2000 Iterations Episodic Mean Loss 0 500 1000 1500 2000 Iterations Episodic Mean Loss 0 500 1000 1500 2000 Iterations Episodic Mean Loss 0 500 1000 1500 2000 Iterations Episodic Mean Disparity (a) Demographic Parity 0 500 1000 1500 2000 Iterations Episodic Mean Disparity (b) Equal Opportunity 0 500 1000 1500 2000 Iterations Episodic Mean Disparity (c) Equalized Odds Figure 8: L-UCBFair 20-step sliding mean & std for the setting in Table 3. D.7 Training Curves: R-TD3 0 25k 50k 75k 100k 125k 150k 175k 200k Episodic Mean Loss 0 25k 50k 75k 100k 125k 150k 175k 200k Episodic Mean Loss 0 25k 50k 75k 100k 125k 150k 175k 200k Episodic Mean Loss 0 25k 50k 75k 100k 125k 150k 175k 200k Episodic Mean Disparity (a) Demographic Parity 0 25k 50k 75k 100k 125k 150k 175k 200k Accumulated Disparity (b) Equal Opportunity 0 25k 50k 75k 100k 125k 150k 175k 200k Episodic Mean Disparity (c) Equalized Odds Figure 9: R-TD3 100-step sliding mean & std for the setting in Table 3. 0 25k 50k 75k 100k 125k 150k 175k 200k Episodic Mean Loss 0 25k 50k 75k 100k 125k 150k 175k 200k Episodic Mean Loss 0 25k 50k 75k 100k 125k 150k 175k 200k Episodic Mean Loss 0 25k 50k 75k 100k 125k 150k 175k 200k Episodic Mean Disparity (a) Demographic Parity 0 25k 50k 75k 100k 125k 150k 175k 200k Episodic Mean Disparity (b) Equal Opportunity 0 25k 50k 75k 100k 125k 150k 175k 200k Episodic Mean Disparity (c) Equalized Odds Figure 10: R-TD3 100-step sliding mean & std for the setting in Table 5. D.8 Reduction of Utility 0.1 0.3 0.5 0.7 0.9 Group 1 qualification rate Group 2 qualification rate Reduction of Utility Figure 11: This figure depicts the short-term impact on utility of the UCBFair algorithm compared to the greedy, myopic agent that operates without fairness constraints. In this experiment, both algorithms were designed to optimize the fraction of true-positive classifications, but only UCBFair was subject to the additional constraint of demographic parity. As the results indicate, the UCBFair algorithm experiences a reduction in utility compared to the greedy baseline, but it is able to drive the system towards a state that is preferable in the long term.