# cooperative_multiagent_fairness_and_equivariant_policies__391eab15.pdf

Cooperative Multi-Agent Fairness and Equivariant Policies

Niko A. Grupen1, Bart Selman1, Daniel D. Lee2

1 Cornell University, Ithaca, NY 2 Cornell Tech, New York, NY {nag83, bs54m, ddl46}@cornell.edu

We study fairness through the lens of cooperative multi-agent learning. Our work is motivated by empirical evidence that naive maximization of team reward yields unfair outcomes for individual team members. To address fairness in multiagent contexts, we introduce team fairness, a group-based fairness measure for multi-agent learning. We then prove that it is possible to enforce team fairness during policy optimization by transforming the team s joint policy into an equivariant map. We refer to our multi-agent learning strategy as Fairness through Equivariance (Fair-E) and demonstrate its effectiveness empirically. We then introduce Fairness through Equivariance Regularization (Fair-ER) as a soft-constraint version of Fair-E and show that it reaches higher levels of utility than Fair-E and fairer outcomes than non-equivariant policies. Finally, we present novel ﬁndings regarding the fairnessutility trade-off in multi-agent settings; showing that the magnitude of the trade-off is dependent on agent skill.

Introduction Algorithmic fairness is an increasingly important subdomain of AI. As statistical learning algorithms continue to automate decision-making in crucial areas such as lending (Fuster et al. 2020), healthcare (Potash et al. 2015), and education (Dorans and Cook 2016), it is imperative that the performance of such algorithms does not rely upon sensitive information pertaining to the individuals for which decisions are made (e.g. race, gender). Despite its growing importance, fairness research has largely targeted predictionbased problems, where decisions are made for one individual at one time (Mitchell et al. 2018). Though recent studies have extended fairness to the multi-agent case (Jiang and Lu 2019), such work primarily considers social dilemmas in which team utility is in obvious conﬂict with the local interests of each team member (Leibo et al. 2017; Rapoport 1974; Van Lange et al. 2013). Many real-world problems, however, must weigh the fairness implications of team behavior in the presence of a single overarching goal. In-line with recent work that has highlighted the importance of leveraging multi-agent learning to study socio-economic challenges such as taxation, social planning, and economic policy (Zheng et al. 2020), we posit

Copyright 2022, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

that understanding the range of team behavior that emerges from single-objective utility maximization is crucial for the development of fair multi-agent systems. For this reason, we study fairness in the context of cooperative multi-agent settings. Cooperative multi-agent fairness differs from traditional game-theoretic interpretations of fairness (e.g. resource allocation (Elzayn et al. 2019; Zhang and Shah 2014), social dilemmas (Leibo et al. 2017)) in that it seeks to understand the fairness implications of emergent coordination learned by multi-agent teams that are bound by a shared reward. Cooperative multi-agent fairness therefore reframes the question Will agents cooperate or defect, given the choice between local and team interests? to a related but novel question Given the incentive to work together, do agents learn to coordinate effectively and fairly? Experimentally, we target pursuit-evasion (i.e. predatorprey) as a test-bed for cooperative multi-agent fairness. Pursuit-evasion allows us to simulate a number of important components of socio-economic systems, including: (i) Shared objectives: the overarching goal of pursuers is to capture an evader; (ii) Agent skill: the speed of the pursuers relative the evader serves as a proxy for skill; (iii) Coordination: success requires sophisticated cooperation by the pursuers. Using pursuit-evasion, we study the fairness implications of behavior that emerges under variations of these socio-economic parameters. Similar to prior work (Lowe et al. 2017; Mordatch and Abbeel 2018; Grupen, Lee, and Selman 2020), we cast pursuit-evasion as a multi-agent reinforcement learning (RL) problem. Our ﬁrst result highlights the importance of shared objectives to cooperation. In particular, we compare policies learned when pursuers share in team success (mutual reward) to those learned when pursuers do not share reward (individual reward). We ﬁnd that sophisticated coordination only emerges when pursuers are bound by mutual reward. Given individual reward, pursuers are not properly incentivized to work together. However, though mutual reward aides coordination, it does not specify how to coordinate fairly. In our experiments, we ﬁnd that naive, unconstrained maximization of mutual reward yields unfair individual outcomes for cooperative teammates. In the context of pursuitevasion, the optimal strategy is a form of role assignment the majority of pursuers act as supporting agents, shepherding the evader to one designated capturer agent. Solving

The Thirty-Sixth AAAI Conference on Artificial Intelligence (AAAI-22)

this issue is the subject of the rest of our analysis. Addressing this form of unfair emergent coordination requires connecting fairness to multi-agent learning settings. To do this, we ﬁrst introduce team fairness, a group-based fairness measure inspired by demographic parity (Dwork et al. 2012; Feldman et al. 2015). Team fairness requires the distribution of a team s reward to be equitable across sensitive groups. We then show that it is possible to enforce team fairness during policy optimization by transforming the team s joint policy into an equivariant map. We prove that equivariant policies yield fair reward distributions under assumptions of agent homogeneity. We refer to our multi-agent learning strategy as Fairness through Equivariance (Fair-E) and demonstrate its effectiveness empirically in pursuit-evasion experiments. Despite achieving fair outcomes, Fair-E represents a binary switch one can either choose fairness (at the expense of utility) or utility (at the expense of fairness). In many cases, however, it is advantageous to modulate between fairness and utility. To this end, we introduce a soft-constraint version of Fair-E that incentivizes equivariance through regularization. We refer to this method as Fairness through Equivariance Regularization (Fair-ER) and show that it is possible to tune fairness constraints over multi-agent policies by adjusting the weight of equivariance regularization. Moreover, we show empirically that Fair-ER reaches higher levels of utility than Fair-E while achieving fairer outcomes than non-equivariant policy learning. Finally, as in both prediction-based settings (Corbett Davies et al. 2017; Zhao and Gordon 2019) and in traditional multi-agent variants of fairness (Okun 1975; Le Grand 1990), it is important to understand the cost of fairness. We present novel ﬁndings regarding the fairnessutility trade-off for cooperative multi-agent settings. Speciﬁcally, we show that the magnitude of the trade-off depends on the skill level of the multi-agent team. When agent skill is high (making the task easier to solve), fairness comes with no trade-off in utility, but as skill decreases (making the task more difﬁcult), gains in team fairness are increasingly offset by decreases in team utility.

Preview of Contributions In sum, our work offers the following contributions:

We show that mutual reward is critical to multi-agent coordination. In pursuit-evasion, agents trained with mutual reward learn to coordinate effectively, whereas agents trained with individual reward do not. We connect fairness to cooperative multi-agent settings. We introduce team fairness as a group-based fairness measure for multi-agent teams that requires equitable reward distributions across sensitive groups. We introduce Fairness through Equivariance (Fair-E), a novel multi-agent strategy leveraging equivariant policy learning. We prove that Fair-E achieves fair outcomes for individual members of a cooperative team. We introduce Fairness through Equivariance Regularization (Fair-ER) as a soft-constraint version of Fair-E. We show that Fair-ER reaches higher levels of utility

than Fair-E while achieving fairer outcomes than nonequivariant learning. We present novel ﬁndings regarding the fairness-utility trade-off for cooperative settings. Speciﬁcally, we show that the magnitude of the trade-off depends on agent skill when agent skill is high, fairness comes for free; whereas with lower skill levels, fairness is increasingly expensive.

Related Work

At a high-level, the prediction-based fairness literature can be split into two factions: individual fairness and group fairness. Introduced by Dwork et al. (2012), individual fairness posits that two individuals with similar features should be classiﬁed similarly (i.e. similarity in feature-space implies similarity in decision-space). Such approaches rely on taskspeciﬁc distance metrics with which similarity can be measured (Barocas, Hardt, and Narayanan 2019; Chouldechova and Roth 2018). Group fairness, on the other hand, attempts to achieve outcome consistency across sensitive groups. This idea has given rise to a number of methods such as statistical/demographic parity (Feldman et al. 2015; Johndrow, Lum et al. 2019; Kamiran and Calders 2009; Zafar et al. 2017), equality of opportunity (Hardt, Price, and Srebro 2016), and calibration (Kleinberg, Mullainathan, and Raghavan 2016). Recent work has extended fairness to the RL setting to consider the feedback effects of decision-making (Jabbari et al. 2017; Wen, Bastani, and Topcu 2021). In multi-agent systems, fairness is typically studied in game-theoretic settings in which individual payoffs and overall group utility are in obvious conﬂict (De Jong et al. 2005) such as resource allocation (Elzayn et al. 2019; Zhang and Shah 2014) and social dilemmas (Leibo et al. 2017; Rapoport 1974; Van Lange et al. 2013). In multi-agent RL settings, these tensions have been addressed through myriad techniques, including reward shaping (Peysakhovich and Lerer 2017), intrinsic reward (Wang et al. 2018), parameterized inequity aversion (Hughes et al. 2018), and hierarchical learning (Jiang and Lu 2019). Also related is the Shapley value: a method for sharing surplus across a coalition based on one s contributions to the coalition (Shapley 2016). Shapley value-based credit assignment techniques have recently been shown to stabilize learning and achieve fairer outcomes when incorporated into the multi-agent RL problem (Wang et al. 2020; Li et al. 2021). Our work differs from this prior work in two key ways. First we target fully-cooperative multi-agent settings (Hao and Leung 2016) in which fairness implications emerge naturally in the presence of a single overarching goal (i.e. mutual reward). In this fully-cooperative setting, individual and team incentives are not in obvious conﬂict. Our motivation for studying fully-cooperative team objectives follows from recent work that highlights the role of multi-agent learning in real-world problems characterized by shared objectives, including taxation and economic policy (Zheng et al. 2020). Moreover, we study modiﬁcations to the utilitymaximization objective that yield fairer outcomes by incentivizing agents to change their behavior, rather than redis-

tributing outcomes after-the-fact. Most relevant is Siddique, Weng, and Zimmer (2020) and Zimmer, Siddique, and Weng (2020), which introduce a class of algorithms that successfully achieve fair outcomes for multi-agent teams through pre-deﬁned social welfare functions that encode speciﬁc fairness principles. Our work, conversely, introduces taskagnostic methods for incentivizing fairness through both hard-constraints on agent policies and soft-constraints (i.e. regularization) (Liu et al. 2018) on the RL objective. Finally, discussion of the fairness-utility (or fairnessefﬁciency) trade-off has a long history in game-theoretic multi-agent settings (Okun 1975; Le Grand 1990; Bertsimas, Farias, and Trichakis 2012; Joe-Wong et al. 2013; Bertsimas, Farias, and Trichakis 2011) and is also prevalent throughout the prediction-based fairness literature (Menon and Williamson 2018). Existing work has shown both theoretically (Calders, Kamiran, and Pechenizkiy 2009; Kleinberg, Mullainathan, and Raghavan 2016; Zhao and Gordon 2019) and empirically (Dwork et al. 2012; Feldman et al. 2015; Kamiran and Calders 2009; Lahoti, Gummadi, and Weikum 2019; Pannekoek and Spigler 2021) that gains in fairness come at the cost of utility. Our discussion of the fairness-utility trade-off is most similar to Corbett-Davies et al. (2017) in this regard, as we study the trade-off through the lens of constrained vs. unconstrained optimization. However, we take this trade-off a step further, outlining a relationship between fairness, utility, and agent skill that is not present in prior work.

Preliminaries Markov Games A Markov game is a multi-agent extension of the Markov decision process (MDP) formalism (Littman 1994). For n agents, it is represented by a state space S, joint action space A = {A1, ..., An}, joint observation space O = {O1, ..., On}, transition function T : S A S, and joint reward function r. Following multiobjective RL (Zimmer, Siddique, and Weng 2020), we deﬁne a vectorial reward r : S A Rn with each component ri representing agent i s contribution to r. Each agent i is initialized with a policy πi : Oi A (or deterministic policy µi) from which it selects actions and an action-value function Qi : S Ai R with which it judges the value of state-action pairs. Following action selection, the environment transitions from its current state st to a new state st+1, as governed by T , and produces a reward vector rt indicating the strength or weakness of the group s decision-making. In the episodic case, this process continues for a ﬁnite time horizon T, producing a trajectory τ = (s1, a1, ..., s T 1, a T 1, s T ) with probability:

P(τ) = P (s1)

t=1 P(st+1|st, at)π(at|st) (1)

where P is a special distribution specifying the likelihood of each start state.

Deep Deterministic Policy Gradients Deep Deterministic Policy Gradients (DDPG) is an off-policy actor-critic algorithm for policy gradient learning in continuous action

spaces (Lillicrap et al. 2015). DDPG leverages the deterministic policy gradient theorem (Silver et al. 2014), which asserts that it is possible to ﬁnd an optimal deterministic policy (µφ, with parameters φ), with respect to a Q-function (Qω, with parameters ω), for the RL objective: J(φ) = Es[Qω(s, a)|s=st,a=µφ(st)] (2) by performing gradient ascent with respect to the following gradient: φJ(φ) = Es[ a Qω(s, a)|s=st,a=µ(st) φµ(s)|s=st] (3) under mild conditions that conﬁrm the existence of gradients a Qω(s, a) and φµ(s). For critic updates, DDPG follows batched TD-control, where the Q-function minimizes the loss function: L(ω) = E s,a,r,s Qω(s, a) (r(s, a)+γQω(s , µφ(s ))) 2 (4)

where (s, a, r, s ) are transition tuples sampled from a replay buffer. In this work, agents learn in a decentralized manner, each performing DDPG updates individually.

Fairness Prediction-based fairness considers a population of n individuals (indexed i = 1, ..., n), each described by variables vi (i.e. features or attributes), which are separated into sensitive variables zi and other variables xi. Variables vi are used to predict (typically binary) outcomes yi Y by estimating the conditional probability P[Y = 1|V = vi] through a scoring function ψ : V {0, 1}. Outcomes in turn yield decisions by applying a decision rule δ(vi) = f(ψ(vi)). For example, in a lending scenario, a classiﬁer may use vi to predict whether an individual i will default on (yi = 0) or repay (yi = 1) his/her loan, which informs the decision to deny (di = 0) or approve (di = 1) the individual s loan application (Mitchell et al. 2018). Of particular relevance is group-based fairness, which examines how well outcome (Y ) and decision (D) consistency is preserved across sensitive groups (Z) (Feldman et al. 2015; Zafar et al. 2017). We highlight the group-based measure of demographic parity, which requires that D Z or, equivalently, that P[D = 1|Z = z] = P[D = 1|Z = z ] for all z, z where z = z .

Mutual Information Given random variables X1 PX1 and X2 PX2 with joint distribution PX1X2, mutual information is deﬁned as the Kullback-Leibler (KL-) divergence between the joint PX1X2 and the product of the marginals PX1 PX2: I(X1; X2) := DKL(PX1X2||PX1 PX2) (5) Mutual information quantiﬁes the dependence between X1 and X2 where, in Equation 5, larger divergence represents stronger dependence. Importantly, mutual information can also be represented as the decrease in entropy of X1 when introducing X2: I(X1; X2) := H(X1) H(X1|X2) (6) Equivariance Let g1 and g2 be G-sets of a group G and σ be a symmetry transformation over G. Then a function f : g1 g2 is equivariant with respect to σ if the commutative relationship f(σ x) = σ f(x) holds. Equivariance in the context of RL implies that separate policies will take the same actions under permutations of state space.

Figure 1: Snapshot of a pursuit-evasion game. Pursuers p1 and p2 (red) chase an evader e (green) with the goal of capturing it. Given a state st, each pursuer selects its next heading (dotted arrow) from its policy, yielding the joint policy π={π1(st), π2(st)}. Evader action selection is omitted for clarity. a) For an equivariant joint policy, applying the transformation σ to the state st (producing σ st which, in this example, swaps the positions of p1 and p2) and running the policy π(σ st) is equivalent to running the joint policy ﬁrst and transforming the joint action afterwards (i.e. the commutative relationship π(σ s) = σ π(s) holds). b) This commutative relationship does not necessarily hold for non-equivariant policies.

Pursuit-Evasion Pursuit-evasion (i.e. predator-prey) is a classic setting for studying multi-agent coordination (Isaacs 1999). A pursuit-evasion game is deﬁned between n pursuers {p1, ..., pn} and a single evader e. At any time t, an agent i is described by its current position and heading qt i = [x, y, θ]i and the environment is described by the position and heading of all agents st = {qp1, ..., qpn, qe}. Upon observing st, each agent selects its next heading θi as an action. The chosen heading is pursued at the maximum allowed speed for each agent (| vp| for the pursuers, | ve| for the evader). We assume the evader to be part of the environment, deﬁned by the potential-ﬁeld policy:

cos(θe θi) (7)

where ri and θi are the L2-distance and relative angle between the evader and the i-th pursuer, respectively, and θe is the heading of the evader. Intuitively, U(θe) pushes the evader away from pursuers, taking the largest bisector between any two when possible. The goal of the pursuers to capture the evader as quickly as possible is mirrored in the reward function, where r(st, at)=50.0 if the evader is captured and r(st, at)= 0.1 otherwise. Note that | vp| serves as a proxy for agent skill level. When | vp| > | ve|, pursuers are skilled enough to capture the evader on their own, whereas | vp| | ve| requires that pursuers work together.

Method In this section, we present a novel interpretation of fairness for cooperative multi-agent teams. We then introduce our proposed method Fairness through Equivariance (Fair E) and prove that it yields fair outcomes. Finally, we present Fairness through Equivariance Regularization (Fair ER) as a soft-constraint version of Fair-E.

Notation Let n be the number of agents in a cooperative team. We describe each agent i by variables vi = (zi, xi),

consisting of sensitive variables zi Z and non-sensitive variables xi X. In team settings, we deﬁne non-sensitive variables xi to be any variables that affect agent i s performance on the team; such as maximum speed. Other variables that should not impact team performance, such as an agent s identity or belonging to a minority group, are deﬁned as sensitive variables zi. We deﬁne fairness in terms of reward distributions R where each r R is a vectorial team reward and each component ri is agent i s contribution to r. In the following deﬁnitions, let I(R; Z) be the mutual information between reward distributions R and sensitive variables Z.

Team Fairness We now deﬁne team fairness, a group-based fairness measure for multi-agent learning. Deﬁnition 1 (Exact Team Fairness). A set of cooperative agents achieves exact team fairness if I(R; Z) = 0. Deﬁnition 2 (Approximate Team Fairness). A set of cooperative agents achieves approximate team fairness if I(R; Z) ϵ for some ϵ > 0. Team fairness connects cooperative multi-agent learning to group-based fairness, as I(R; Z) = 0 is equivalent to requiring R Z (Barocas, Hardt, and Narayanan 2019).

Fairness Through Equivariance To enforce team fairness during policy optimization, we introduce a novel multi-agent learning strategy. The key to our approach is equivariance: by enforcing parameter symmetries (Ravanbakhsh, Schneider, and Poczos 2017) in each agent i s policy network πφi, we show that equivariance propagates through the multi-agent RL problem. In particular, we show that the joint policy π = {πφ1, ..., πφn} is an equivariant map with respect to permutations over state and action space. Further, we show that equivariance in policyspace begets equivariance in trajectory-space; namely, the terminal state s T following a multi-agent trajectory is equivariant to that trajectory s initial state s1. Finally, we prove

that equivariance in multi-agent policies and trajectories yields exact team fairness. A comparison of equivariant vs. non-equivariant joint policies is provided in Figure 1. In the proofs that follow, we assume: (i) homogeneity across agents on the team i.e. agents are identical in their non-sensitive variables x; (ii) the distribution of agent positions satisﬁes exchangeability. Finally, though our derivations utilize general (stochastic) policies, we provide equivalent proofs for deterministic policies in Appendix A. Theorem 1. If individual policies πφi are symmetric, then the joint policy π = {πφ1, ..., πφn} is an equivariant map.

Proof. Let σ be a permutation operator that, when applied to a vector (such as a state st or action at), produces a permuted vector (σ st = sσ t or σ at = aσ t , respectively). Under parameter symmetry (i.e. φ1=φ2= =φn), we have: π(σ s) = π(sσ) = aσ = σ a = σ π(s) (8) where the commutative relationship π(σ s) = σ π(s) implies that π is an equivariant map. Commutativity here is crucial Equation (8) and therefore Theorems 2 and 3 do not hold for non-equivariant policies (see Figure 1b).

Theorem 2. Let pπ(s s , k) be the probability of transitioning from state s to state s in k steps (Sutton and Barto 2018). Given that the joint policy π is an equivariant map, it follows that pπ(s1 s T , T) = pπ(sσ 1 sσ T , T).

Proof. It follows from our assumption of agent homogeneity that permuting a state σ st, which (from Theorem 1) permutes action selection σ at, also permutes the environment s transition probabilities: P(st+1|st, at) = P(sσ t+1|sσ t , aσ t ) This is because, from the environment s perspective, a stateaction pair is indistinguishable from the state-action pair generated by the same agents after swapping their positions and selected actions. Assuming a uniform distribution of start-states P , we also have P (s1) = P (sσ 1). Recall the probability of a trajectory from Equation (1). Given the equivariant function π and the two equalities above, it follows that:

t=1 P(st+1|st, at)π(at|st)

t=1 P(sσ t+1|sσ t , aσ t )π(aσ t |sσ t )

We can represent the probability of a trajectory as a single transition from initial state s1 to terminal state s T by marginalizing out the intermediate states, so it follows that: pπ(s1 s T , T)

s T 1 P (s1)

t=1 P(st+1|st, at)π(at|st)

sσ T 1 P (sσ 1)

t=1 P(sσ t+1|sσ t , aσ t )π(aσ t |sσ t )

= pπ(sσ 1 sσ T , T)

Thus, the probability of reaching terminal state s T from initial state s1 is equivalent to the probability of reaching sσ T from sσ 1.

Theorem 3. Equivariant policies are exactly fair with respect to team fairness.

Proof. The proof follows directly from Theorem 2. Since pπ(s1 s T , T) = pπ(sσ 1 sσ T , T), the probability of the agents obtaining reward r must be equal to obtaining reward rσ. Under the full distribution of initial states, the equality:

P[R = r|Z = z] = P[R = rσ|Z = zσ]

holds for all r and assignments of sensitive variables z. This is only possible if R Z and, therefore, I(R; Z) = 0, which meets exact team fairness.

Fairness Through Equivariance Regularization

Though Fair-E achieves team fairness, it does so in a rigid manner imposing hard constraints on policy parameters. Fair-E therefore has no choice but to pursue fairness to the fullest extent (and accept the maximum utility trade-off in return). In many cases, it is advantageous to tune the strength of the fairness constraints. For this reason, we propose a softconstraint version of Fair-E, which we call Fairness through Equivariance Regularization (Fair-ER). Fair-ER is deﬁned by the following regularization objective:

Jeqv(φ1, ..., φi, ..., φn) = E s[ E j =i[1 cos(πφi(s) πφj(s))|s=st]] (9)

which encourages equivariance by penalizing agents proportionally to the amount their actions differ from the actions of their teammates. Using Equation (9), Fair-ER extends the standard RL objective from Equation (2) as follows:

J(φi) + λJeqv(φ1, ..., φi, ..., φn) (10)

where λ is a fairness control parameter weighting the strength of equivariance. Differentiating the joint objective with respect to parameters φi produces the Fair-ER policy gradient:

φi Jeqv(φi) = E s

j =i sin(πφi(s)

πφj(s)) φiπφi(s)|s=si

In this work, Fair-ER is applied to each agent s actor network by optimizing Equation (11) alongside Equation (3). Though the above derivations consider stochastic policies, we highlight that Fair-ER is also applicable to deterministic policies and is therefore useful to any multi-agent policy gradient algorithm. We provide further background and a derivation of Equation (11) in Appendix B.

Experiments

Pursuit-evasion allows us to quantify the performance of emergent team behavior (in terms of both team success and fairness) under variations of socio-economic parameters such as shared objectives and agent skill-level. We therefore use the pursuit-evasion game formalized in Section to verify our methods. In each experiment, n=3 pursuer agents are trained in a decentralized manner (each following DDPG) for a total of 125,000 episodes, during which velocity is decreased from | vp| = 1.2 to | vp| = 0.4. The evader speed is ﬁxed at | ve| = 1.0. After training, we test the resulting policies at discrete velocity steps (e.g. | vp|=1.0, | vp|=0.9, etc), where a decrease in | vp| represents a lesser skilled pursuer. We deﬁne the sensitive attribute zi for each agent i to be a unique identiﬁer of that agent (i.e. zi = [0, 0, 1], zi = [0, 1, 0] or zi = [1, 0, 0] in the n = 3 case). Each method is evaluated in terms of both utility through traditional measures of performance such as success rate and fairness through the team fairness measure proposed in Section . Our evaluation proceeds as follows: ﬁrst, we study the role of mutual reward in coordination by comparing policies trained with mutual reward to those trained with individual reward. Next, we show that naive mutual reward maximization results in high utility at the expense of fairness. We then show the efﬁcacy of our proposed solution, Fair-E, in resolving these fairness issues. Finally, we evaluate our softconstraint method, Fair-ER, in balancing fairness and utility.

Importance of Mutual Reward

We train pursuer policies with decentralized DDPG under conditions of either mutual or individual reward. In the mutual reward condition, pursuers share in the success of their teammates, each receiving the sum of the reward vector r. In the individual reward condition, a pursuer is only rewarded if it captures the prey itself, which makes the pursuit-evasion task competitive. The results are shown in Figure 2, where utility is the capture success rate of the multi-agent team. We ﬁnd that pursuers trained with mutual reward signiﬁcantly outperform those trained with individual reward. Mutual reward pursuers maintain their performance even as speed drops to | vp| = 0.5; which is only half of the evader s speed. Under individual reward, performance drops off quickly for | vp| 1.0. The velocity | vp| = 1.0 represents a crucial turning-point in pursuit-evasion it is the point at which a straight-line chase towards the prey no longer works. These results show that, without mutual reward, the pursuers are not properly incentivized to work together and therefore do not develop a coordination strategy that is any better than a greedy individual pursuit of the evader. Thus, we conﬁrm that mutual reward (a single, shared objective) is vital to coordination.

Fair Outcomes With Fair-E

Though mutual reward incentivizes efﬁcient team coordination, it does not stipulate how agents should coordinate. To study the nature of the resulting strategy, we examine the distribution of reward vectors obtained by the pursuers over 100 test-time trajectories (averaged over ﬁve random seeds

Figure 2: Performance of policies trained with individual vs. mutual reward. As pursuer velocity decreases, the pursuitevasion task requires more sophisticated coordination.

each). As shown in Figure 3a (top), in which we plot reward vector assignments for captures involving only one pursuer, the pursuer team discovers an unfair strategy the majority of captures are accounted for by a single agent. The emergence of an unfair strategy reﬂects the difﬁculty of the pursuit-evasion setting. As | vp| decreases, the whole pursuer team has to learn to work together to capture the evader, which is a challenging coordination task. The pursuers learn to do this effectively by assigning roles e.g. in the n=3 case, two pursuers take supporting roles, shepherding the evader towards the third agent, who is designated the capturer . We note that the decision of which agent becomes the capturer is an emergent phenomenon of the system. As | vp| decreases further, such role assignment is not only helpful but necessary for success. Altogether, the results suggest that unconstrained mutual reward maximization prioritizes utility over fairness. Our proposed solution, Fair-E, directly combats these fairness issues. Figure 3a (bottom) shows the distribution of reward vectors obtained by agents trained with Fair-E. Due to equivariance, Fair-E yields much more evenly distributed rewards. To further quantify these gains, we compare team fairness for both strategies over a variety of skill levels (i.e. | vp| values). The results, shown in Figure 3b, conﬁrm that Fair-E achieves much lower I(R; Z) and, therefore, higher team-fairness. Note that, when | vp| < 0.9, I(R; Z) is low for non-equivariant pursuers as well. This is an artifact of team fairness as | vp| decreases, capture success inevitably decreases as well, which is technically a fairer, albeit less desirable, outcome (all agents share equitably in failure). Nevertheless, Figure 3b serves as empirical evidence to backup our theoretical result from Section that Fair-E meets the demands of team fairness. Despite achieving fairer outcomes, Fair-E is subject to drops in utility as | vp| decreases (see Figure 3c). The utility curve for Fair-E drops precipitously for agent skill | vp| < 0.9; much faster than the drop-off for pursuers with no equivariance. This is because Fair-E directly prevents role assignment. By hard-constraining each agent s policy, Fair E enforces πi(st)=πj(st), whereas role assignment requires πi(st) =πj(st) for i =j. We emphasize role assignment as

Figure 3: Quantitative comparison of performance for equivariant (i.e. Fair-E) vs. non-equivariant policies. a) Distribution of reward vectors for both strategies at the pursuer velocity | vp| = 1.0. Non-equivariant policies (top), which learn strategies that push captures towards one agent, yield highly uneven reward distributions. Fair-E policies (bottom) learn to spread captures amongst teammates equally, resulting in even reward distributions. b) Team fairness scores for both strategies (lower better). Fair-E policies yield fairer outcomes than non-equivariant policies across all pursuer velocity (i.e. agent skill) levels. Note that the curve for the Greedy strategy, which is fair despite its low utility, is tucked behind the Fair-E curve. c) Team utility (i.e. capture success) achieved by both strategies (higher better). As pursuer velocity decreases, non-equivariant policies outperform Fair-E by a wide margin, indicating that a fairness-utility trade-off exists.

key to this result, as parameter-sharing has been shown to be helpful in problem domains that do not require explicit role assignment (Baker et al. 2019). In the context of fairness, however, these results indicate that Fair-E will always elect to give up utility to preserve fairness. For completeness, we also show results for the policies learned with individual reward (from Figure 2) and a handcrafted greedy control strategy in which each pursuer runs directly towards the evader. Note that greedy policies are equivariant by deﬁnition, agents will select similar actions in similar states but demonstrate no coordination. For this reason, greedy policies have high fairness, but very low utility. Utility follows the same pattern for individual reward policies. Interestingly though, individual reward policies become less fair between | vp| = 0.9 and | vp| = 0.6, before tapering off as performance decreases. We defer further discussion of this ﬁnding, as well as details regarding the computation of the team fairness score, I(R; Z), and the handcrafted greedy control baseline to Appendix C.

Modulating Fairness With Fair-ER

Unlike Fair-E, Fair-ER allows policies to balance fairness and utility dynamically. Intuitively, this is because Fair-ER incentivizes policy equivariance through the regularization objective from Equation (9), while still allowing agents to update their own individual policy parameters (unlike Fair E). Therefore, the value of the fairness control weight λ will dictate how much each agent values fairness vs. utility. To study the effectiveness of this method, we trained Fair ER agents in increasingly difﬁcult environments (by decreasing pursuer velocity | vp|) while modulating the fairness control parameter λ. The effect of λ on policy training is shown in Figure 4. The results show that, for λ 0.5, Fair-ER is successful in bridging the performance gap between non-equivariant policies (λ=0.0) and Fair-E (black line). Importantly, we show that it is also possible to overconstrain the system so that it actually performs worse than Fair-E (e.g. λ=1.0). This indicates that, though Fair-ER can

mitigate the drops in performance described above, the regularization parameter λ must be tuned appropriately. We also performed the same test-time analysis as described for Fair-E in the previous subsection. Figure 5 shows the effect of λ on both fairness (I(R; Z)) and utility (capture success). For each skill level, increasing λ allows Fair ER to ﬁne-tune the balance between fair and unfair policies, achieving the highest utility possible under its given constraints. We ﬁnd that, with high values of λ (e.g. λ = 0.9), Fair-ER prioritizes fairness over utility and performs in-line with (or worse than) Fair-E achieving fair outcomes, even at the expense of utility. When λ is in the range λ = 0.5 to λ = 0.1, Fair-ER withstands a drop in utility until | vp| = 0.7 by giving up small amounts of fairness. Therefore, we ﬁnd evidence that learning multi-agent coordination strategies with Fair-ER simultaneously maintains higher utility than Fair-E while achieving higher fairness than non-equivariant learning. Overall, tuning the fairness weight λ allows us to directly control the strength of the fairness constraints imposed on the system, enabling Fair-ER to modulate fairness to the needs of the task.

Fairness-Utility Trade-off As we saw in Figure 3, the hard constraints that Fair-E places on each agent s policy creates an inherent commitment to achieving fair outcomes at the expense of utility (or reward). In this section, we examine the extent to which the fairnessutility trade-off exists for Fair-ER across all agent skill levels. For each agent skill level (i.e. | vp| value), we computed both the fairness (through team fairness I(R; Z)) and utility (through capture success) scores achieved by multi-agent coordination strategies learned for λ values in the range λ {0.0, 1.0} over 100 test-time trajectories (averaged over ﬁve random seeds each). The results of this experiment are shown in Figure 5. Unlike many prior studies in both traditional multi-agent fairness settings (Okun 1975; Le Grand 1990; Bertsimas, Farias, and Trichakis 2012; Joe-Wong et al. 2013; Bertsimas, Farias, and Trichakis 2011) and prediction-based set-

Figure 4: Effect of the equivariance control parameter λ on policy learning with Fair-ER. Increasing λ yields fairer policies, but causes performance to decay more quickly in difﬁcult environments (lower agent skill). The black line represents Fair-E s performance for comparison.

tings (Corbett-Davies et al. 2017; Zhao and Gordon 2019), we ﬁnd that it is not always the case that fairness must be traded for utility. With Fair-ER, fairness comes for little to no cost in utility until | vp| = 0.8. This means that, when each agent operates at a high skill level, requiring each agent in the multi-agent team to shift towards an equivariant policy (which yields fair results) does not cause coordination of the larger multi-agent team to break down. When | vp| < 0.8, however, utility drops quickly for larger values of λ. This indicates that, when agent skill decreases (or the task becomes more complex relative the agents current skill level), unfair strategies such as role assignment are the only effective way to maintain high levels of utility. Overall, these results serve as empirical evidence that, in the context of cooperative multi-agent tasks, fairness is inexpensive, so long as the task is easy enough (i.e. agent skill is high enough). As task difﬁculty increases, fairness comes at an increasingly steep cost. To the best of our knowledge, such a characterization of the fairness-utility trade-off for multi-agent settings has not been illustrated in the fairness literature.

Conclusion Multi-agent learning holds promise for helping AI researchers, economic theorists, and policymakers alike better evaluate real-world problems involving social structures, taxation, policy, and economic systems broadly. This work has focused on one such problem; namely, fairness in cooperative multi-agent settings. In particular, we have demonstrated that fairness issues arise naturally in cooperative, single-objective multi-agent learning problems. We have shown that our proposed method, equivariant policy optimization (Fair-E), mitigates such issues. We have also shown that soft constraints (Fair-ER) lower the cost of fairness and allow the fairness-utility trade-off to be balanced dynamically. Moreover, we have presented novel results regarding the fairness-utility trade-off for cooperative multiagent settings; identifying a connection between agent skill and fairness. In particular, we showed that fairness comes for free when agents are highly-skilled, but becomes increasingly expensive for lesser-skilled agents.

Figure 5: Fairness vs. utility comparisons for Fair-ER trained with various values of equivariance regularization λ and environment difﬁculties. Note that λ = 0.0 is equivalent to no equivariance.

This work represents a ﬁrst step towards understanding the core factors underlying fairness and multi-agent learning in environments where team dynamics and coordination are important for task success. There are a number of exciting avenues of future work that build upon these initial ideas. First, ongoing work is investigating cooperative multi-agent fairness in more complex domains (e.g. video games, simulated economic societies). Moreover, there is room to explore indirect or backdoor causal paths between sensitive and target variables in the context of multi-agent teams, which warrant connecting additional interpretations of fairness (e.g. causal fairness) to cooperative multi-agent settings.

Acknowledgments

We thank the reviewers for their valuable feedback. This research was supported by NSF awards CCF-1522054 (Expeditions in computing), AFOSR Multidisciplinary University Research Initiatives (MURI) Program FA9550-18-1-0136, AFOSR FA9550-17-1-0292, AFOSR 87727, ARO award W911NF-17-1-0187, and and Open Philanthropy award to the Center for Human-Compatible AI.

Baker, B.; Kanitscheider, I.; Markov, T.; Wu, Y.; Powell, G.; Mc Grew, B.; and Mordatch, I. 2019. Emergent tool use from multi-agent autocurricula. ar Xiv preprint ar Xiv:1909.07528. Barocas, S.; Hardt, M.; and Narayanan, A. 2019. Fairness and Machine Learning. http://www.fairmlbook.org. Bertsimas, D.; Farias, V. F.; and Trichakis, N. 2011. The price of fairness. Operations research, 59(1): 17 31. Bertsimas, D.; Farias, V. F.; and Trichakis, N. 2012. On the efﬁciency-fairness trade-off. Management Science, 58(12): 2234 2250. Calders, T.; Kamiran, F.; and Pechenizkiy, M. 2009. Building classiﬁers with independency constraints. In 2009 IEEE International Conference on Data Mining Workshops, 13 18. IEEE. Chouldechova, A.; and Roth, A. 2018. The frontiers of fairness in machine learning. ar Xiv preprint ar Xiv:1810.08810. Corbett-Davies, S.; Pierson, E.; Feller, A.; Goel, S.; and Huq, A. 2017. Algorithmic decision making and the cost of fairness. In Proceedings of the 23rd acm sigkdd international conference on knowledge discovery and data mining, 797 806. De Jong, S.; Tuyls, K.; Verbeeck, K.; and Roos, N. 2005. Priority awareness: towards a computational model of human fairness for multi-agent systems. In Adaptive Agents and Multi-Agent Systems III. Adaptation and Multi-Agent Learning, 117 128. Springer. Dorans, N. J.; and Cook, L. L. 2016. Fairness in educational assessment and measurement. Routledge. Dwork, C.; Hardt, M.; Pitassi, T.; Reingold, O.; and Zemel, R. 2012. Fairness through awareness. In Proceedings of the 3rd innovations in theoretical computer science conference, 214 226. Elzayn, H.; Jabbari, S.; Jung, C.; Kearns, M.; Neel, S.; Roth, A.; and Schutzman, Z. 2019. Fair algorithms for learning in allocation problems. In Proceedings of the Conference on Fairness, Accountability, and Transparency, 170 179. Feldman, M.; Friedler, S. A.; Moeller, J.; Scheidegger, C.; and Venkatasubramanian, S. 2015. Certifying and removing disparate impact. In proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining, 259 268. Fuster, A.; Goldsmith-Pinkham, P.; Ramadorai, T.; and Walther, A. 2020. Predictably unequal? The effects of machine learning on credit markets. The Effects of Machine Learning on Credit Markets. Grupen, N. A.; Lee, D. D.; and Selman, B. 2020. Low Bandwidth Communication Emerges Naturally in Multi Agent Learning Systems. ar Xiv preprint ar Xiv:2011.14890. Hao, J.; and Leung, H.-f. 2016. Fairness in Cooperative Multiagent Systems. In Interactions in Multiagent Systems: Fairness, Social Optimality and Individual Rationality, 27 70. Springer.

Hardt, M.; Price, E.; and Srebro, N. 2016. Equality of opportunity in supervised learning. ar Xiv preprint ar Xiv:1610.02413. Hughes, E.; Leibo, J. Z.; Phillips, M. G.; Tuyls, K.; Du e nez Guzm an, E. A.; Casta neda, A. G.; Dunning, I.; Zhu, T.; Mc Kee, K. R.; Koster, R.; et al. 2018. Inequity aversion improves cooperation in intertemporal social dilemmas. ar Xiv preprint ar Xiv:1803.08884. Isaacs, R. 1999. Differential games: a mathematical theory with applications to warfare and pursuit, control and optimization. Courier Corporation. Jabbari, S.; Joseph, M.; Kearns, M.; Morgenstern, J.; and Roth, A. 2017. Fairness in reinforcement learning. In International Conference on Machine Learning, 1617 1626. PMLR. Jiang, J.; and Lu, Z. 2019. Learning fairness in multi-agent systems. ar Xiv preprint ar Xiv:1910.14472. Joe-Wong, C.; Sen, S.; Lan, T.; and Chiang, M. 2013. Multiresource allocation: Fairness efﬁciency tradeoffs in a unifying framework. IEEE/ACM Transactions on Networking, 21(6): 1785 1798. Johndrow, J. E.; Lum, K.; et al. 2019. An algorithm for removing sensitive information: application to raceindependent recidivism prediction. The Annals of Applied Statistics, 13(1): 189 220. Kamiran, F.; and Calders, T. 2009. Classifying without discriminating. In 2009 2nd International Conference on Computer, Control and Communication, 1 6. IEEE. Kleinberg, J.; Mullainathan, S.; and Raghavan, M. 2016. Inherent trade-offs in the fair determination of risk scores. ar Xiv preprint ar Xiv:1609.05807. Lahoti, P.; Gummadi, K. P.; and Weikum, G. 2019. Operationalizing individual fairness with pairwise fair representations. ar Xiv preprint ar Xiv:1907.01439. Le Grand, J. 1990. Equity versus efﬁciency: the elusive trade-off. Ethics, 100(3): 554 568. Leibo, J. Z.; Zambaldi, V.; Lanctot, M.; Marecki, J.; and Graepel, T. 2017. Multi-agent reinforcement learning in sequential social dilemmas. ar Xiv preprint ar Xiv:1702.03037. Li, J.; Kuang, K.; Wang, B.; Liu, F.; Chen, L.; Wu, F.; and Xiao, J. 2021. Shapley Counterfactual Credits for Multi-Agent Reinforcement Learning. ar Xiv preprint ar Xiv:2106.00285. Lillicrap, T. P.; Hunt, J. J.; Pritzel, A.; Heess, N.; Erez, T.; Tassa, Y.; Silver, D.; and Wierstra, D. 2015. Continuous control with deep reinforcement learning. ar Xiv preprint ar Xiv:1509.02971. Littman, M. L. 1994. Markov games as a framework for multi-agent reinforcement learning. In Machine learning proceedings 1994, 157 163. Elsevier. Liu, L. T.; Dean, S.; Rolf, E.; Simchowitz, M.; and Hardt, M. 2018. Delayed impact of fair machine learning. In International Conference on Machine Learning, 3150 3158. PMLR.

Lowe, R.; Wu, Y.; Tamar, A.; Harb, J.; Abbeel, P.; and Mordatch, I. 2017. Multi-agent actor-critic for mixed cooperative-competitive environments. ar Xiv preprint ar Xiv:1706.02275. Menon, A. K.; and Williamson, R. C. 2018. The cost of fairness in binary classiﬁcation. In Conference on Fairness, Accountability and Transparency, 107 118. PMLR. Mitchell, S.; Potash, E.; Barocas, S.; D Amour, A.; and Lum, K. 2018. Prediction-based decisions and fairness: A catalogue of choices, assumptions, and deﬁnitions. ar Xiv preprint ar Xiv:1811.07867. Mordatch, I.; and Abbeel, P. 2018. Emergence of grounded compositional language in multi-agent populations. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, volume 32. Okun, A. M. 1975. Equality and efﬁciency: The big tradeoff. Brookings Institution Press. Pannekoek, M.; and Spigler, G. 2021. Investigating Tradeoffs in Utility, Fairness and Differential Privacy in Neural Networks. ar Xiv preprint ar Xiv:2102.05975. Peysakhovich, A.; and Lerer, A. 2017. Prosocial learning agents solve generalized stag hunts better than selﬁsh ones. ar Xiv preprint ar Xiv:1709.02865. Potash, E.; Brew, J.; Loewi, A.; Majumdar, S.; Reece, A.; Walsh, J.; Rozier, E.; Jorgenson, E.; Mansour, R.; and Ghani, R. 2015. Predictive modeling for public health: Preventing childhood lead poisoning. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2039 2047. Rapoport, A. 1974. Prisoner s dilemma recollections and observations. In Game Theory as a Theory of a Conﬂict Resolution, 17 34. Springer. Ravanbakhsh, S.; Schneider, J.; and Poczos, B. 2017. Equivariance through parameter-sharing. In International Conference on Machine Learning, 2892 2901. PMLR. Shapley, L. S. 2016. 17. A value for n-person games. Princeton University Press. Siddique, U.; Weng, P.; and Zimmer, M. 2020. Learning Fair Policies in Multi-Objective (Deep) Reinforcement Learning with Average and Discounted Rewards. In International Conference on Machine Learning, 8905 8915. PMLR. Silver, D.; Lever, G.; Heess, N.; Degris, T.; Wierstra, D.; and Riedmiller, M. 2014. Deterministic policy gradient algorithms. In International conference on machine learning, 387 395. PMLR. Sutton, R. S.; and Barto, A. G. 2018. Reinforcement learning: An introduction. MIT press. Van Lange, P. A.; Joireman, J.; Parks, C. D.; and Van Dijk, E. 2013. The psychology of social dilemmas: A review. Organizational Behavior and Human Decision Processes, 120(2): 125 141. Wang, J.; Zhang, Y.; Kim, T.-K.; and Gu, Y. 2020. Shapley Q-value: a local reward approach to solve global reward games. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, volume 34, 7285 7292.

Wang, J. X.; Hughes, E.; Fernando, C.; Czarnecki, W. M.; Du e nez-Guzm an, E. A.; and Leibo, J. Z. 2018. Evolving intrinsic motivations for altruistic behavior. ar Xiv preprint ar Xiv:1811.05931. Wen, M.; Bastani, O.; and Topcu, U. 2021. Algorithms for Fairness in Sequential Decision Making. In International Conference on Artiﬁcial Intelligence and Statistics, 1144 1152. PMLR. Zafar, M. B.; Valera, I.; Rogriguez, M. G.; and Gummadi, K. P. 2017. Fairness constraints: Mechanisms for fair classiﬁcation. In Artiﬁcial Intelligence and Statistics, 962 970. PMLR. Zhang, C.; and Shah, J. A. 2014. Fairness in multi-agent sequential decision-making. Advances in Neural Information Processing Systems, 27. Zhao, H.; and Gordon, G. J. 2019. Inherent tradeoffs in learning fair representations. ar Xiv preprint ar Xiv:1906.08386. Zheng, S.; Trott, A.; Srinivasa, S.; Naik, N.; Gruesbeck, M.; Parkes, D. C.; and Socher, R. 2020. The ai economist: Improving equality and productivity with ai-driven tax policies. ar Xiv preprint ar Xiv:2004.13332. Zimmer, M.; Siddique, U.; and Weng, P. 2020. Learning Fair Policies in Decentralized Cooperative Multi-Agent Reinforcement Learning. ar Xiv preprint ar Xiv:2012.09421.