# policy_aggregation__c8537e61.pdf Policy Aggregation Parand A. Alamdari University of Toronto & Vector Institute parand@cs.toronto.edu Soroush Ebadian University of Toronto soroush@cs.toronto.edu Ariel D. Procaccia Harvard University arielpro@seas.harvard.edu We consider the challenge of AI value alignment with multiple individuals that have different reward functions and optimal policies in an underlying Markov decision process. We formalize this problem as one of policy aggregation, where the goal is to identify a desirable collective policy. We argue that an approach informed by social choice theory is especially suitable. Our key insight is that social choice methods can be reinterpreted by identifying ordinal preferences with volumes of subsets of the state-action occupancy polytope. Building on this insight, we demonstrate that a variety of methods including approval voting, Borda count, the proportional veto core, and quantile fairness can be practically applied to policy aggregation. 1 Introduction Early discussion of AI value alignment had often focused on learning desirable behavior from an individual teacher, for example, through inverse reinforcement learning [27, 1]. But, in recent years, the conversation has shifted towards aligning AI models with large groups of people or even entire societies. This shift is exemplified at a policy level by Open AI s democratic inputs to AI program [41] and Meta s citizens assembly on AI governance [8], and at a technical level by the ubiquity of reinforcement learning from human feedback [30] as a method for fine-tuning large language models. We formalize the challenge of value alignment with multiple individuals as a problem that we view as fundamental policy aggregation. Our starting point is the common assumption that the environment can be represented as a Markov decision process (MDP). While the states, actions and transition functions are shared by all agents, their reward functions which incorporate values, priorities or subjective beliefs may be different. In particular, each agent has its own optimal policy in the underlying MDP. Our question is this: How should we aggregate the individual policies into a desirable collective policy? A naïve answer is to define a new reward function that is the sum of the agents reward functions (for each state-action pair separately) and compute an optimal policy for this aggregate reward function; such a policy would guarantee maximum utilitarian social welfare. This approach has a major shortcoming, however, in that it is sensitive to affine transformations of rewards, so, for example, if we doubled one of the reward functions, the aggregate optimal policy may change. This is an issue because each agent s individual optimal policy is invariant to (positive) affine transformations of rewards, so while it is possible to recover a reward function that induces an agent s optimal policy by Equal contribution. Authors are listed alphabetically. 38th Conference on Neural Information Processing Systems (Neur IPS 2024). observing their actions over time,2 it is impossible to distinguish between reward functions that are affine transformations of each other. More broadly, economists and moral philosophers have long been skeptical about interpersonal comparisons of utility [19] due to the lack of universal scale an issue that is especially pertinent in our context. Therefore, aggregation methods that are invariant to affine transformations are strongly preferred. Our approach. To develop such aggregation methods, we look to social choice theory, which typically deals with the aggregation of ordinal preferences. To take a canonical example, suppose agents report rankings over m alternatives. Under the Borda count rule, each voter gives m k points to the alternative they rank in the k th position, and the alternative with most points overall is selected. The voting approach can be directly applied to our setting. For each agent, it is (in theory) possible to compute the value of every possible (deterministic) policy, and rank them all by value. Then, any standard voting rule, such as Borda count, can be used to aggregate the rankings over policies and single out a desirable policy. The caveat, of course, is that this method is patently impractical, because the number of policies is exponential in the number of states of the MDP. The main insight underlying our approach is that ordinal preferences over policies have a much more practical volumetric interpretation in the state-action occupancy polytope O. Roughly speaking, a point in the state-action occupancy polytope represents a (stochastic) policy through the frequency it is expected to visit different state-action pairs. If a policy is preferred by an agent to a subset of policies O , its rank is the volume of O as a fraction of the volume of O. The score of a policy under Borda count, for example, can be interpreted as the sum of these ranks over all agents. Our results. We investigate two classes of rules from social choice theory, those that guarantee a notion of fairness and voting rules. By mapping ordinal preferences to the state-action occupancy polytope, we adapt the different rules to the policy aggregation problem. The former class is examined in Section 5. As a warm-up we start from the notion of proportional veto core; it follows from recent work by Chaudhury et al. [7] that a volumetric interpretation of this notion is nonempty and can be computed efficiently. We then turn to quantile fairness, which was recently introduced by Babichenko et al. [4]; we prove that the volumetric interpretation of this notion yields guarantees that are far better than those known for the original, discrete setting, and we design a computationally efficient algorithm to optimize those guarantees. The latter class is examined in Section 6; we focus on volumetric interpretations of α-approval (including the ubiquitous plurality rule, which is the special case of α = 1) and the aforementioned Borda count. In contrast to the rules studied in Section 5, existence is a nonissue for these voting rules, but computation is a challenge, and indeed we establish several computational hardness results. To overcome this obstacle, we implement voting rules for policy aggregation through mixed integer linear programming, which leads to practical solutions. Finally, our experiments in Section 7 evaluate the policies returned by different rules based on their fairness; the results identify quantile fairness as especially appealing. The experiments also illustrate the advantage of our approach over rules that optimize measures of social welfare (which are sensitive to affine transformations of the rewards). 2 Related Work Noothigattu et al. [28] consider a setting related to ours, in that different agents have different reward functions and different policies that must be aggregated. However, they assume that the agents reward functions are noisy perturbations of a ground-truth reward function, and the goal is to learn an optimal policy according to the ground-truth rewards. In social choice terms, our work is akin to the typical setting where subjective preferences must be aggregated, whereas the work of Noothigattu et al. [28] is conceptually similar to the setting where votes are seen as noisy estimates of a ground-truth ranking [39, 9, 6]. Chaudhury et al. [7] study a problem completely different from ours: fairness in federated learning. However, their technical approach served as an inspiration for ours. Specifically, they consider the proportional veto core and transfer it to the federated learning setting using volumetric arguments, by 2And we assume this is done accurately, in order to focus on the essence of the policy aggregation problem. considering volumes of subsets in the space of models. Their proof that the proportional veto core is nonempty carries over to our setting, as we explain in Section 5. There is a body of work on multi-objective reinforcement learning (MORL) and planning that uses a scalarization function to reduce the problem to a single-objective one [32, 21]. Other solutions to MORL focus on developing algorithms to identify a set of policies approximating the problem s Pareto frontier [38]. A line of work more closely related to ours focuses on fairness in sequential decision making, often taking the scalarization approach to aggregate agents preferences by maximizing a (cardinal) social welfare function, which maps the vector of agent utilities to a single value. Ogryczak et al. [29] and Siddique et al. [34] investigate generalized Gini social welfare, and Mandal and Gan [25], Fan et al. [13] and Ju et al. [23] focus on Nash and egalitarian social welfare. Alamdari et al. [3] study this problem in a non-Markovian setting, where fairness depends on the history of actions over time, and introduce concepts to assess different fairness criteria at varying time intervals. A shortcoming of these solutions is that they are not invariant to affine transformations of rewards a property that is crucial in our setting, as discussed earlier. Our work is closely related to the pluralistic alignment literature, aiming to develop AI systems that reflect the values and preferences of diverse individuals [35, 10]. Alamdari et al. [2] propose a framework in reinforcement learning in which an agent learns to act in a way that is considerate of the values and perspectives of humans within a particular environment. Concurrent work explores reinforcement learning from human feedback (RLHF) from a social choice perspective, where the reward model is based on pairwise human preferences, often constructed using the Bradley-Terry model [5]. Zhong et al. [42] consider the maximum Nash and egalitarian welfare solutions, and Swamy et al. [36] propose a method based on maximal lotteries due to Fishburn [14]. 3 Preliminaries For t N, let [t] = {1, 2, . . . , t}. For a closed set S, let (S) denote the probability simplex over the set S. We denote the dot product of two vectors as x, y = Pd i=1 xi yi for x, y Rd. A halfspace in Rd determined by w Rd and b R is the set of points satisfying {x Rd | x, w b}. A polytope O Rd is the intersection of a finite number of halfspaces, i.e., a convex subset of the d-dimensional space Rd determined by a set of linear constraints {x | Ax b} where A Rk d is a matrix of coefficients of k linear inequalities and b Rk. 3.1 Multi-Objective Markov Decision Processes A multi-objective Markov decision process (MOMDP) is a tuple defined as M = S, A, P, R1, . . . , Rn for the average-reward case and S, dinit, A, P, R1, . . . , Rn, γ for the discounted-reward case, where S is a finite set of states, A is a finite set of actions, and P : (S A) (S) is the transition probability distribution. P(st, at, st+1) is the probability of transitioning to state st+1 by taking action at in st. For i [n], Ri : S A R is the reward function of the ith agent, the initial state is sampled from dinit (S), and γ (0, 1] is the discount factor. A (Markovian) policy π(a|s) is a probability distribution over the actions a A given the state s S. A policy is deterministic if at each state s one action is selected with probability of 1, and otherwise it is stochastic. The expected average return of agent i for a policy π and the expected discounted return of agent i for a policy π are defined over an infinite time horizon as Javg i (π) = lim T 1 T Eπ,P t=1 Ri(st, at) , Jγ i (π) = (1 γ) Eπ,P h X t=1 γt Ri(st, at)|s1 dinit i where the expectation is over the state-action pairs at time t based on both the policy π and the transition function P. Definition 1 (state-action occupancy measure). Let Pt π be the probability measure over states at time t under policy π. The state-action occupancy measure for state s and action a is defined as davg π (s, a) = lim T 1 T E XT t=1 Pt π(s)π(a|s) , dγ π(s, a) = (1 γ) E h X t=1 γt Pt π(s)π(a|s) i . For both the average and discounted cases, we can rewrite the expected return as the dot product of the state-action occupancy measures and rewards, that is, Ji(π) = P (s,a) dπ(s, a) Ri(s, a) = dπ, Ri . In addition, the policy can be calculated given the occupancy measure as π(a|s) = dπ(s, a)/(P a dπ(s, a)) if P a dπ(s, a) > 0, and π(a|s) = 1/|A| otherwise. Definition 2 (state-action occupancy polytope [31, 40]). For a MOMDP M in the average-reward case, the space of valid state-action occupancies is the polytope Oavg = davg π | davg π 0, X s,a davg π (s, a) = 1, X a davg π (s, a) = X s ,a P(s , a , s)davg π (s , a ), s S . We similarly define this polytope for the discounted-reward case in Appendix A. A mechanism receives a MOMDP and aggregates the agents preferences into a policy. An economical efficiency axiom in the social choice literature is that of Pareto optimality. Definition 3 (Pareto optimality). For a MOMDP M and ϵ 0, a policy π is ϵ-Pareto optimal if there does not exist another policy π such that Ji(π ) Ji(π) + ϵ for all i N, with strict inequality for at least one agent. For ϵ = 0, we simply call such policies Pareto optimal. We call a mechanism Pareto optimal if it always returns a Pareto optimal policy. In special cases where all agents unanimously agree on an optimal policy, Pareto optimality implies that the mechanism will return one such policy. We discuss Pareto optimal implementations of all mechanisms in this work. 3.2 Voting and Social Choice Functions In the classical social choice setting, we have a set of n agents and a set C of m alternatives. The preferences of voter i [n] is represented as a strict ordering over the alternatives σi : [m] C equal to σi(1) i σi(2) i . . . i σi(m), where c1 i c2 denotes agent i prefers c1 over c2 for c1, c2 C. A (possibly randomized) voting rule aggregates agents preferences and returns an alternative or a distribution over the alternatives as the collective decision. Positional Scoring Rules. A positional scoring rule with scoring vector s = (s1, . . . , sm) such that s1 s2 . . . sm works as follows. Each agent gives a score of s1 to their top choice, a score of s2 to their second choice, and so on. The votes are tallied and an alternative with the maximum total score is selected. A few of the well-known positional scoring rules are: Plurality: (1, 0, . . . , 0), Borda: (m 1, m 2, . . . , 1, 0), k-approval: (1, . . . , 1, 0, . . . , 0) with k ones. 4 Occupancy Polytope as the Space of Alternatives In a MOMDP M, each agent i incorporates their values and preferences into their respective reward function Ri. Agent i prefers π over π if and only if π achieves higher expected return, Ji(π) > Ji(π ), and is indifferent between two policies π and π if and only if Ji(π) = Ji(π ). As discussed before, given a state-action occupancy measure dπ in the state-action occupancy polytope O, we can recover the corresponding policy π. Therefore, we can interpret O as the domain of all possible alternatives over which the n agents have heterogeneous weak preferences (with ties). Agent i prefers dπ to dπ in O if and only if they prefer π to π . We study the policy aggregation problem through this lens; specifically, we design or adapt voting mechanisms where the (continuous) space of alternatives is O and agents have weak preferences over them determined by their reward functions Ri. Affine transformation and reward normalization. A particular benefit of this interpretation, as mentioned before, is that all positive affine transformations of the reward functions, i.e., a Ri + b for all a R 0 and b R, yield the same weak ordering over the polytope. Hence, we can assume without loss of generality that Ji(π) [0, 1]. Further, we can ignore agents that are indifferent between all policies, i.e., minπ Ji(π) = maxπ Ji(π), and normalize reward functions Ri Ri minπ Ji(π) maxπ Ji(π) minπ Ji(π) such that minπ Ji(π) = 0 and maxπ Ji(π) = 1. The relative ordering of the policies does not change since for all points dπ O we have P s,a dπ(s, a) = 1. Volumetric definitions. A major difference between voting over a continuous space of alternatives and the classical voting setting is that the domain of alternatives is infinite and not all voting mechanisms can be directly applied to the policy aggregation problem. In particular, various voting rules require reasoning about the rank of an alternative or the size of some subset of alternatives. For instance, the Borda count of an alternative c over a finite set of alternatives is defined as the number (or fraction) of candidates ranked below c. In the continuous setting, for almost all of the mechanisms that we discuss later, we use the measure or volume of a subset of alternatives to refer to their size. For a measurable subset O O, let vol(O ) denote its measure. The ratio vol(O )/vol(O) is the fraction of alternatives that lie in O . A probabilistic interpretation is that for a uniform distribution over the polytope O, vol(O )/vol(O) denotes the probability that a policy uniformly sampled from O lies in O . We can also define the expected return distribution of an agent over O as a random variable that maps a policy to its expected return, i.e., one that maps dπ O to dπ, Ri . The pdf and cdf of this r.v. is defined below. Definition 4 (expected return distribution). For a MOMDP M and v R, the expected return distribution of agent i [n] is defined as fi(v) = 1 vol(O) x O δ(v x, Ri ) dx, Fi(v) = Z v x= fi(x) dx = vol(O { x, Ri v}) where fi and Fi are the pdf and cdf of the expected return distribution and δ( ) is the Dirac delta function indicating v = x, Ri . A useful observation about fi, the pdf, is that it is unimodal, i.e., increasing up to its mode mode(fi) arg maxv R fi(v) and decreasing afterwards, which follows from the Brunn Minkowski inequality [17]. Since fi (pdf) is the derivative of Fi (cdf), the unimodality of fi implies that Fi is a convex function in ( , mode(fi)] and concave in [mode(fi), ). In our algorithms, we use a subroutine that measures the volume of a polytope, which we denote by vol-comp({Ax b}). Dyer et al. [12] designed a fully polynomial time randomized approximation scheme (FPRAS) for computing the volume of polytopes. We report the running time of algorithms in terms of the number of calls to this oracle. 5 Fairness in Policy Aggregation In this section, we utilize the volumetric interpretation of the state-action occupancy polytope to extend fairness notions from social choice to policy aggregation, and we develop algorithms to compute stochastic policies provably satisfying these notions. 5.1 Proportional Veto Core The proportional veto core was first proposed by Moulin [26] in the classical social choice setting with a finite set of alternatives where agents have full (strict) rankings over the alternatives. For simplicity, suppose the number of alternatives m is a multiple of n. The idea of the proportional veto core is that x% of the agents should be able to veto x% of the alternatives. More precisely, for an alternative c to be in the proportional veto core, there should not exist a coalition S that can block c using their proportional veto power of |S|/n. S blocks c if they can unanimously suggest m(1 |S|/n) candidates that they prefer to c. For instance, if c is in the proportional veto core, it cannot be the case that a coalition of 60% of the agents unanimously prefer 40% of the alternatives to c. Chaudhury et al. [7] extended this notion to a continuous domain of alternatives in the federated learning setting. We show that such an extension also applies to policy aggregation. Definition 5 (proportional veto core). Let ϵ (0, 1/n). For a coalition of agents S [n], let veto(S) = |S|/n be their veto power. A point dπ O is blocked by a coalition S if there exists a subset O O of measure vol(O )/vol(O) 1 veto(S) + ϵ such that all agents in S prefer all points in O to dπ, i.e., dπ i dπ for all dπ O and i S. A point dπ is in the ϵ-proportional veto core if it is not blocked by any coalition. A candidate in the proportional veto core satisfies desirable properties that are extensively discussed in prior work [26, 7, 24, 22]. It is worth mentioning that any candidate in the ϵ-proportional veto core, besides the fairness aspect, is also economically efficient as it satisfies ϵ-Pareto optimality. This holds since the grand coalition S = [n] can veto any ϵ-Pareto dominated alternative. Moulin [26] proved that the proportional veto core is nonempty in the discrete setting and Chaudhury et al. [7] proved it for the continuous setting. The following result is a corollary of Theorem 1 of Chaudhury et al. [7]. ALGORITHM 1: Seq. ϵ-Prop. Veto Core [7] n ϵ n+1 for i = 1 to n do Using binary search find v i [0, 1] s.t. vol(Oi 1 { x, Ri v i }) δ vol(O) Oi Oi 1 { x, Ri v i } return a welfare maximizing point dπ On ALGORITHM 2: ϵ-Max Quantile Fairness Procedure q-quantile-feasible(q): Oq {x O | x, Ri F 1 i (q), i [n]} if Oq is a feasible linear program then return True else return False Using binary search find maximum q [0, 1] s.t. q-quantile-feasible() is feasible return a welfare maximizing point dπ Oq Theorem 1. Let ϵ (0, 1/n). For a policy aggregation problem, the ϵ-proportional veto core is nonempty. Furthermore, such policies can be found in polynomial time using O(log(1/ϵ)) many calls per agent to vol-comp. We refer the reader to the paper of Chaudhury et al. [7] for the complete proof, and provide a high-level description of Algorithm 1, which finds a point in the proportional veto core. Algorithm 1 iterates over the agents and lets the ith agent eliminate roughly 1/n vol(O) of the remaining space of alternatives. That is, agent i finds the hyperplane Hi = { x, Ri v i } such that its intersection with Oi 1 (the remaining part of O at the ith iteration) has a volume of approximately vol(O)/n. This value, for each agent, can be found by doing a binary search over v i to a precision of [v i ϵ, v i ] by O(log(1/ϵ)) calls to the volume estimating subroutine vol-comp.3 Pareto optimality. We briefly discuss why Algorithm 1 is Pareto optimal. During the ith iteration, the space of policies is contracted by adding a linear constraint of the form Ji(π) v i . If the returned welfare-maximizing policy π (derived from dπ) is Pareto dominated by another policy π , then π would satisfy all these linear constraints as Ji(π ) Ji(π) v i with the earlier inequality being strict for at least one agent. Therefore, dπ On and π achieves a higher social welfare, which is a contradiction. The same argument can be used to establish the Pareto optimality of other mechanisms discussed later; each of these mechanisms searches for a policy that satisfies certain lower bounds on agents utilities, from which a welfare-maximizing, Pareto optimal policy can be selected. 5.2 Quantile Fairness Next, we consider an egalitarian type of fairness based on the occupancy polytope, building on very recent work by Babichenko et al. [4] in the discrete setting; surprisingly, we show that it is possible to obtain stronger guarantees in the continuous setting. Babichenko et al. [4] focus on the fair allocation of a set of m indivisible items among n agents, where each item must be fully allocated to a single agent. They quantify the extent an allocation A is fair to an agent i by the fraction of allocations over which i prefers A (note that the number of all discrete allocations is nm). In other words, if one randomly samples an allocation, the fairness is measured by the probability that A is preferred to the random allocation. An allocations is q-quantile fair for q [0, 1] if all agents consider this allocation among their top q-quantile allocations. Babichenko et al. [4] aim to find a universal value of q such that for any fair division instance, a q-quantile fair allocation exists. They make an interesting connection between q-quantile fair allocations and the Erd os Matching Conjecture, and show under the assumption that the conjecture holds, (1/2e)-quantile fair allocations exist for all instances.4 We ask the same question for policy aggregation, and again, a key difference is that our domain of alternatives is continuous. The notion of q-quantile fairness extends well to our setting. Agents assess the fairness of a policy π based on the fraction of the occupancy polytope (i.e., the set of all policies) to which they prefer π, or equivalently, the probability that they prefer the chosen policy to a randomly sampled policy. Definition 6 (q-quantile fairness). For a MOMDP M and q [0, 1], a policy π is q-quantile fair if for every agent i [n], π is among i s top q-fraction of policies, vol(O { x, Ri Ji(π)}) q vol(O). 3It is important to check the non-emptyness of the search space Oi as an over-estimation of v i could lead to an empty feasible region. 4This result is for arbitrary monotone valuations. For additive valuations, they show 0.14 e -quantile fair allocations always exist without any assumptions. We show that a (1/e)-quantile fair policy always exist and that this ratio is tight; note that this bound is twice as good as that of Babichenko et al. [4], and it is unconditional. We prove this by making a connection to an inequality due to Grünbaum [17]. The centroid of a polytope P is defined as centroid(P) = Lemma 1 (Grunbaum s Inequality). Let P be a polytope and w a direction in Rd. For the halfspace H = {x | w, x centroid(P) 0}, it holds that 1 e d d + 1 vol(P) 1 d d + 1 Furthermore, this is tight for the n-dimensional simplex. Theorem 2. For every MOMDP M, there always exist q-quantile fair policy where q = ( ℓ 1 and ℓ= |S| |A|. Note that q 1 e 36.7%. Furthermore, this bound is tight: For any ℓ, there is an instance with a single state and ℓactions where no q-quantile fair policy exists for any q > ( ℓ 1 Proof. First, we show that the centroid of the occupancy polytope c = centroid(O) is q-quantile fair policy for the aformentioned q. Since O is a subset of the (n 1)-simplex (see Definition 2), O has a nonzero volume in some lower dimensional space ℓ |S| |A| 1. By invoking Grunbaum s inequality (Lemma 1) with wi being equal to Ri projected to the ℓ -dimensional subspace for all agents i [n], we have that vol(O Hi) ( ℓ ℓ +1)ℓ vol(O) where Hi = { x c, wi 0} = { x, Ri Ji(c)}. Since ℓ ℓ 1, we have ( ℓ ℓ +1)ℓ ( ℓ 1 ℓ)ℓ 1, which completes the proof. To show tightness, take a MOMDP with a single state S = {s} hence a constant transition function P( ) = s and ℓactions A = {a1, . . . , aℓ} and ℓagents. The reward function of agent i is Ri(s, ai) = 1 and 0 otherwise. The state-action occupancy polytope of this MOMDP is the (ℓ 1)-dimensional simplex O = {P a dπ(s, a) = 1, dπ R1 ℓ 0 }. Take any point in dπ O. There exists at least one agent i that has Ji(dπ) = dπ(s, ai) 1 ℓ. Take the halfspace Hi = { x, Ri 1 ℓ}. Observe that Hi O is equivalent to a smaller (ℓ 1)-dimensional simplex {P a dπ(s, a) = 1 1 which has volume of vol (1 1 ℓ ℓ 1 vol(O). Therefore, vol(O Hi) vol(O) = ℓ 1 The centroid of the occupancy polytope, as per Theorem 2, attains a worst-case measure of quantilefairness. However, the centroid policy can be highly suboptimal as it disregards the preferences of the agents involved. For instance, there could exist a 99%-quantile fair policy. To this end, we take an egalitarian approach and aim to find a q-quantile fair policy with the maximum q. Max quantile fair algorithm. Algorithm 2 searches for the optimal value q , for which a q -quantile fair policy exists, and gets close to q by a binary search. To perform the search, we need a subroutine that checks, for a given value of q, if a q-quantile fair policy exists. Suppose we have the q-quantile expected return F 1 i (q) for all i, that is, the expected return amount vi such that Fi(vi) = q. Then, the problem of existence of a q-quantile fair policy is equivalent to the feasibility of the linear program {x O | x, Ri F 1 i (q), i [n]}, which can be solved in polynomial time. Importantly, after finding a good approximation of q, there can be infinitely many policies that are q-quantile fair and there are various ways to select the final policy after finding the value q. As mentioned earlier, a desirable efficiency property is Pareto optimality; to satisfy it, one can return the policy that maximizes the sum of agents expected returns among the ones that are q-quantile fair. Finally, to calculate F 1 i (q), we can again use binary search to get δ close to the value vi for which Fi(vi) = q using O(log 1/δ) calls to vol-comp. The discussion above is summarized below. Proposition 3. Assuming an optimal oracle for F 1 i , Algorithm 2 finds a q-quantile fair policy that is ϵ close to the optimal value in polynomial time with O(log(1/ϵ)) per agent calls to the oracle. A δ-approximation to F 1 i (q) can be computed using O(log(1/δ)) calls to vol-comp. 6 Policy Aggregation with Voting Rules In this section, we adapt existing voting rules from the discrete setting to policy aggregation and discuss their computational complexity. ALGORITHM 3: α-Approvals MILP compute F 1 i (α) for all i [n] solve the mixed integer linear program max X s.t. ai F 1 i (α) dπ, Ri i [n] dπ O, ai {0, 1} i [n] return a Pareto optimal policy subject to max α-approval {i | ai = 1} ALGORITHM 4: ϵ-Borda count MILP compute F 1 i (kϵ) for all i [n] and k [ 1 ϵ ] solve the mixed integer linear program max X k [1/ϵ] ai,k (Fi(kϵ) Fi((k 1)ϵ)) s.t. ai,k kϵ dπ, Ri i [n] dπ O, ai,k {0, 1}, i [n], k [1/ϵ] return a Pareto optimal policy subject to max Borda score Plurality. The plurality winner is the policy that achieves the maximum number of plurality votes or approvals, where agent i approves a policy π if it achieves their maximum expected return Ji(π) = maxπ Ji(π ) = 1. Hence the plurality winner is a policy in arg maxπ P i [n] I[Ji(π) = 1]. This formulation does not require the volumetric interpretation. However, in contrast to the discrete setting where one can easily count the approvals for all candidates, we show that solving this problem in the context of policy aggregation is not only NP-hard, but hard to approximate up to factor of a 1/n1/2 ϵ. We establish the hardness of approximation by a reduction from the maximum independent set problem [20]; we defer the proof to Appendix B. Theorem 4. For any fixed ϵ (0, 1), there is no polynomial-time 1 n1/2 ϵ -approximation algorithm for the maximum plurality score unless P = NP. Nevertheless, we can compute plurality in practice, as we discuss below. α-approval. We extend the k-approval rule using the volumetric interpretation of the occupancy polytope, similarly to the q-quantile fairness definition. For some α [0, 1], agents approve a policy π if its return is among their top α fraction of O, i.e., Fi(Ji(π)) α. The α-approval winner is a policy that has the highest number of α-approvals, so it is in arg maxπ P i [n] I[Fi(Ji(π)) α]. Note that plurality is equivalent to 1-approval. It is worth mentioning that there can be infinitely many policies that have the maximum approval score and, to avoid a suboptimal decision, one can return a Pareto optimal solution among the set of α-approval winner policies. Theorem 2 shows that for α 1/e, there always exists a policy that all agents approve, and by Proposition 3 such policies can be found in polynomial time, assuming access to an oracle for volumetric computations. Therefore, the problem of finding an α-approval winner is easy for α (0, 1/e). In sharp contrast, for α = 1 namely, plurality Theorem 4 gives a hardness of approximation. The next theorem shows the hardness of computing α-approval for α (7/8, 1] via a reduction from the MAX-2SAT problem. We defer the proof to Appendix B. Theorem 5. For α (7/8, 1], computing a policy with the highest α-approval score is NP-hard. This even holds for binary reward vectors and when every Fi has a closed form. Given the above hardness result, to compute the α-approval rule, we turn to mixed-integer linear programming (MILP). Algorithm 3 simply creates n binary variables for each agent i indicating whether i α-approves the policy, i.e., Fi(Ji(π)) α which is equivalent to Ji(π) F 1 i (α). To encode the expected return requirement for agent i to approve a policy as a linear constraint, we precompute F 1 i (α). This can be done by a binary search similar to Algorithm 2. Importantly, Algorithm 3 has one binary variable per agent and only n constraints which is key to its practicability. Borda count. The Borda count rule also has a natural definition in the continuous setting. In the discrete setting, the Borda score of agent i for alternative c is the number of alternatives c such that c i c . In the continuous setting, Fi(Ji(π)) indicates the volume of the occupancy polytope to which agent i prefers π. The Borda count rule then selects a policy among arg maxπ P i [n] Fi(Ji(π)). The computational complexity of the Borda count rule remains an interesting open question, though we make progress on two fronts.5 First, we identify a sufficient condition under which we can find an approximate max Borda count policy using convex optimization in polynomial time. Second, similar to Algorithm 3, we present a MILP to approximate the Borda count rule in Algorithm 4. 5We suspect the problem to be NP-hard since the objective resembles a summation of sigmoidal functions over a convex domain, which is known to be NP-hard [37]. The first is based on the observation in Section 4 that Fi is concave in range [mode(fi), ). We assume that the max Borda count policy π appears in the concave portion of all agents, i.e., Ji(π) mode(fi) for all i [n]. Then, the problem becomes a maximization of the concave objective max P i Fi( dπ, Ri ) over the convex domain {dπ O | dπ, Ri mode(fi), i [n]}. Second, Algorithm 4 is a MILP that finds an approximate max Borda count policy. As a preprocessing step, we estimate Fi for each agent i separately. We measure Fi for the fixed expected return values of {ϵ, 2ϵ, . . . , 1 ϵ, 1}. This accounts for 1/ϵ oracle calls to vol-comp per agent. Then, for the MILP, we introduce 1/ϵ binary variables for each agent indicating their ϵ-rounded return levels, i.e., ai,k = 1 iff dπ, Ri kϵ for k [1/ϵ]. The MILP then searches for an occupancy measure dπ O with maximum total Borda score among the ϵ-rounded expected return vectors (see Appendix C for more details). Finally, we make a novel connection between q-quantile fairness and Borda count in Theorem 6. We defer the proof to Appendix B. A corollary of Theorems 2 and 6 is that the policy returned by ϵ-max quantile fair algorithm (Algorithm 2) achieves a (1/e ϵ) multiplicative approximation of the maximum Borda score. Theorem 6. A q-quantile fair policy is a q-approximation of the maximum Borda score. 7 Experiments Environment. We adapt the dynamic attention allocation environment introduced by D Amour et al. [11]. We aim to monitor several sites and prevent potential incidents, but limited resources prevent us from monitoring all sites at all times; this is inspired by applications such as food inspection and pest control 6. There are m = 5 warehouses and each can be in 3 different stages: normal (norm), risky (risk) and incident (inc). There are |S| = 3m states containing all possible stages of all warehouses. In each step, we can monitor at most one site, so there are m + 1 actions, where action m + 1 is no operation and action i m is monitoring warehouse i. There are n agents; each agent i has a list ℓi of warehouses that they consider valuable and a reward function Ri. In each step t, Ri(st, at) = I[at m] P j ℓi ρiwj I[st,j = inc at = j], where wj {100, 150, , 250} denotes the penalty of an incident occurring in warehouse j, ρi is the scale of penalties for agent i which is sampled from {0.25, 0.5, , n}, and 1 is the cost of monitoring. In each step, if we monitor warehouse j, its stage becomes normal. If not, it changes from norm to risk and from risk to inc with probabilities pj,risk and pj,inc, and stays the same otherwise. Probabilities are sampled i.i.d. uniformly from [0.5, 0.8]. The state transitions P is the product of the warehouses stage transitions. Rules. We compare the outcomes of policy aggregation with different rules: max-quantile, Borda, α-approval (α = 0.9, 0.8), egalitarian (maximize minimum return) and utilitarian (maximize sum of returns). We sample 5 105 random policies based on which we fit a generalized logistic function to estimate the cdf of the expected return distribution Fi (Definition 4) for every agent. The policies for α-approval voting rules are optimized with respect to maximum utilitarian welfare. The egalitarian rule finds a policy that maximizes the expected return of the worst-off agent, then optimizes for the second worst-off agent, and so on. The implementation details of Borda count are in Appendix D. Results. In Figure 1, we report the normalized expected return of agents as Ji(π) minπ Ji(π) maxπ Ji(π ) minπ Ji(π ) (sorted from lowest to highest) which are averaged over 10 different environment and agents instances. We observe that the utilitarian and egalitarian rules are sensitive to the different agents reward scales and tend to perform unfairly. The utilitarian rule achieves the highest utilitarian welfare by almost ignoring one agent. The egalitarian rule achieves higher return for the worst-off agents compared to the utilitarian rule, but still yields an inequitable outcome. The max-quantile rule tends to return the fairest outcomes with similar normalized returns for the agents. The Borda rule, while not a fair rule by design, tends to find fair outcomes which are slightly worse than the max-quantile rule. The α-approval rule with max utilitarian completion tends to the utilitarian rule as α 0 and to plurality as α 1. Importantly, although not shown in the plots, the plurality rule ignores almost all agents and performs optimally for a randomly selected agent. In addition to the fine-grained utility distributions, in Table 1, we report two aggregate measures based on agents utilities: (i) the Gini index, a statistical measure of dispersion defined 6The code for the experiments is available at https://github.com/praal/policy-aggregation. max-q borda 90%-a 80%-a egal util 0 % Max Return (a) 10 agents consider a random subset of warehouses valuable max-q borda 90%-a 80%-a egal util 0 % Max Return (b) 5 symmetric agents, one per warehouse Figure 1: Comparison of policies optimized by different rules in two different scenarios based on the normalized expected return for agents. The bars, grouped by rule, correspond to agents sorted based on their normalized expected return. The error bars show the standard error of the mean. 5 symmetric agents, one per warehouse 10 agents, random subsets of warehouses Rules Gini index Nash welfare Gini index Nash welfare egalitarian 0.2864 0.0295 0.2208 0.0717 0.2126 0.0209 0.4655 0.0581 utilitarian 0.4392 0.0094 0.0502 0.0174 0.2020 0.0182 0.5736 0.0471 80%-approvals 0.1233 0.0047 0.5186 0.0051 0.1352 0.0037 0.6741 0.0200 90%-approvals 0.0793 0.0056 0.5286 0.0053 0.1257 0.0034 0.6746 0.0211 Borda 0.0225 0.0024 0.5356 0.0062 0.1029 0.0083 0.6801 0.0261 max-quantile 0.0188 0.0022 0.5355 0.0062 0.0625 0.0067 0.6474 0.0232 Table 1: Comparison of policies optimized by different rules in two scenarios based on Gini index and Nash welfare based on their normalized expected return averaged. We report the mean and the standard error. j N |Ji(π) Jj(π)| 2n P i N Ji(π) where a lower Gini index indicates a more equitable distribution, and (ii) the Nash welfare, defined as the geometric mean of agents utilities Q i N Ji(π) 1/n where a higher Nash welfare is preferable. We observe a similar trend as above, where utilitarian and egalitarian rules perform worse across both metrics. For the other four rules, the Nash welfare scores are comparable in both scenarios, with Borda showing slightly better performance. The Gini index, however, highlights a clearer distinction among the rules, with max-quantile performing better. 8 Discussion We conclude by discussing some of the limitations of our approach. A first potential limitation is computation. When we started our investigation of the policy aggregation problem, we were skeptical that ordinal solutions from social choice could be practically applied. We believe that our results successfully lay this initial concern to rest. However, additional algorithmic advances are needed to scale our approach beyond thousands of agents, states, and actions. Additionally, an interesting future direction is to apply these rules within continuous state or action spaces, as well as in online reinforcement learning setting where the environment remains unknown. A second limitation is the possibility of strategic behavior. The Gibbard-Satterthwaite Theorem [16, 33] precludes the existence of reasonable voting rules that are strategyproof, in the sense that agents cannot gain from misreporting their ordinal preferences; we conjecture that a similar result holds for policy aggregation in our framework. However, if reward functions are obtained through inverse reinforcement learning, successful manipulation would be difficult: an agent would have to act in a way that the learned reward function induces ordinal (volumetric) preferences leading to a higher-return aggregate stochastic policy. This separation between the actions taken by an agent and the preferences they induce would likely alleviate the theoretical susceptibility of our methods to strategic behavior. Acknowledgments We gratefully acknowledge funding from the Natural Sciences and Engineering Research Council of Canada (NSERC) and the Canada CIFAR AI Chairs Program (Vector Institute). This work was also partially supported by the Cooperative AI Foundation; by the National Science Foundation under grants IIS-2147187, IIS-2229881 and CCF-2007080; and by the Office of Naval Research under grants N00014-20-1-2488 and N00014-24-1-2704. [1] P. Abbeel and A. Y. Ng. Apprenticeship learning via inverse reinforcement learning. In Proceedings of the 21st International Conference on Machine Learning (ICML), pages 1 8, 2004. [2] P. A. Alamdari, T. Q. Klassen, R. Toro Icarte, and S. A. Mc Ilraith. Be considerate: Avoiding negative side effects in reinforcement learning. In Proceedings of the 21st International Conference on Autonomous Agents and Multiagent Systems (AAMAS), pages 18 26, 2022. [3] P. A. Alamdari, T. Q. Klassen, E. Creager, and S. A. Mcilraith. Remembering to be fair: Non-Markovian fairness in sequential decision making. In Proceedings of the 41st International Conference on Machine Learning (ICML), pages 906 920, 2024. [4] Y. Babichenko, M. Feldman, R. Holzman, and V. V. Narayan. Fair division via quantile shares. In Proceedings of the 56th Annual ACM Symposium on Theory of Computing (STOC), pages 1235 1246, 2024. [5] R. A. Bradley and M. E. Terry. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39(3/4):324 345, 1952. [6] I. Caragiannis, A. D. Procaccia, and N. Shah. When do noisy votes reveal the truth? ACM Transactions on Economics and Computation, 4(3): article 15, 2016. [7] B. R. Chaudhury, A. Murhekar, Z. Yuan, B. Li, R. Mehta, and A. D. Procaccia. Fair federated learning via the proportional veto core. In Proceedings of the 41st International Conference on Machine Learning (ICML), 2024. [8] N. Clegg. Bringing people together to inform decision-making on generative AI. Blog post, 2023. URL https://about.fb.com/news/2023/06/generative-ai-community-forum. [9] V. Conitzer and T. Sandholm. Common voting rules as maximum likelihood estimators. In Proceedings of the 21st Annual Conference on Uncertainty in Artificial Intelligence (UAI), pages 145 152, 2005. [10] V. Conitzer, R. Freedman, J. Heitzig, W. H. Holliday, B. M. Jacobs, N. Lambert, M. Mosse, E. Pacuit, S. Russell, H. Schoelkopf, E. Tewolde, and W. S. Zwicker. Position: Social choice should guide AI alignment in dealing with diverse human feedback. In Proceedings of the 41st International Conference on Machine Learning (ICML), pages 9346 9360, 2024. [11] A. D Amour, H. Srinivasan, J. Atwood, P. Baljekar, D. Sculley, and Y. Halpern. Fairness is not static: Deeper understanding of long term fairness via simulation studies. In Proceedings of the 3rd Conference on Fairness, Accountability, and Transparency (FAcc T), pages 525 534, 2020. [12] M. Dyer, A. Frieze, and R. Kannan. A random polynomial-time algorithm for approximating the volume of convex bodies. Journal of the ACM, 38(1):1 17, 1991. [13] Z. Fan, N. Peng, M. Tian, , and B. Fain. Welfare and fairness in multi-objective reinforcement learning. In Proceedings of the 22nd International Conference on Autonomous Agents and Multi-Agent Systems (AAMAS), pages 1991 1999, 2023. [14] P. C Fishburn. Probabilistic social choice based on simple voting comparisons. The Review of Economic Studies, 51(4):683 692, 1984. [15] M.R. Garey, D.S. Johnson, and L. Stockmeyer. Some simplified np-complete graph problems. Theoretical Computer Science, 1(3):237 267, 1976. ISSN 0304-3975. [16] A. Gibbard. Manipulation of voting schemes. Econometrica, 41:587 602, 1973. [17] B. Grünbaum. Partitions of mass-distributions and of convex bodies by hyperplanes. Pacific Journal of Mathematics, 10(4):1257 1261, 1960. [18] Gurobi Optimization, LLC. Gurobi Optimizer Reference Manual, 2023. URL https://www. gurobi.com. [19] P. J. Hammond. Interpersonal comparisons of utility: Why and how they are and should be made. In J. Elster and J. E. Roemer, editors, Interpersonal Comparisons of Well-Being, chapter 7, pages 200 254. Cambridge University Press, 1991. [20] J. Håstad. Clique is hard to approximate within n 1ε. Acta Mathematica, 182:105 142, 1999. [21] C. F. Hayes, R. R adulescu, E. Bargiacchi, J. Källström, M. Macfarlane, M. Reymond, T. Verstraeten, L. M Zintgraf, R. Dazeley, F. Heintz, et al. A practical guide to multi-objective reinforcement learning and planning. Autonomous Agents and Multi-Agent Systems, 36(1):26, 2022. [22] E. Ianovski and A. Y. Kondratev. Computing the proportional veto core. In Proceedings of the 35th AAAI Conference on Artificial Intelligence (AAAI), 2021. [23] P. Ju, A. Ghosh, and N. Shroff. Achieving fairness in multi-agent MDP using reinforcement learning. In Proceedings of the 12th International Conference on Learning Representations (ICLR), 2023. [24] A. Y. Kondratev and E. Ianovski. Veto core consistent preference aggregation. In Proceedings of the 23rd International Conference on Autonomous Agents and Multi-Agent Systems (AAMAS), pages 1020 1028, 2024. [25] D. Mandal and J. Gan. Socially fair reinforcement learning. Co RR, abs/2208.12584, 2022. [26] H. Moulin. The proportional veto principle. The Review of Economic Studies, 48(3):407 416, 1981. [27] A. Y. Ng and S. Russell. Algorithms for inverse reinforcement learning. In Proceedings of the 17th International Conference on Machine Learning (ICML), pages 663 670, 2000. [28] R. Noothigattu, T. Yan, and A. D. Procaccia. Inverse reinforcement learning from like-minded teachers. In Proceedings of the 35th AAAI Conference on Artificial Intelligence (AAAI), pages 9197 9204, 2021. [29] W. Ogryczak, P. Perny, and P. Weng. A compromise programming approach to multiobjective markov decision processes. International Journal of Information Technology & Decision Making, 12(05):1021 1053, 2013. [30] L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, J. Schulman, J. Hilton, F. Kelton, L. Miller, M. Simens, A. Askell, P. Welinder, P. F. Christiano, J. Leike, and R. Lowe. Training language models to follow instructions with human feedback. In Proceedings of the 36th Annual Conference on Neural Information Processing Systems (Neur IPS), 2022. [31] M. L. Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons, 2014. [32] D. M. Roijers, P. Vamplew, S. Whiteson, and R. Dazeley. A survey of multi-objective sequential decision-making. Journal of Artificial Intelligence Research, 48:67 113, 2013. [33] M. Satterthwaite. Strategy-proofness and Arrow s conditions: Existence and correspondence theorems for voting procedures and social welfare functions. Journal of Economic Theory, 10: 187 217, 1975. [34] U. Siddique, P. Weng, and M. Zimmer. Learning fair policies in multi-objective (deep) reinforcement learning with average and discounted rewards. In Proceedings of the 37th International Conference on Machine Learning (ICML), pages 8905 8915, 2020. [35] T. Sorensen, J. Moore, J. Fisher, M. L. Gordon, N. Mireshghallah, C. Michael Rytting, A. Ye, L. Jiang, X. Lu, N. Dziri, T. Althoff, and Y. Choi. Position: A roadmap to pluralistic alignment. In Proceedings of the 41st International Conference on Machine Learning (ICML), pages 46280 46302, 2024. [36] G. Swamy, C. Dann, R. Kidambi, S. Wu, and A. Agarwal. A minimaximalist approach to reinforcement learning from human feedback. In Proceedings of the 41st International Conference on Machine Learning (ICML), pages 47345 47377, 2024. [37] M. Udell and S. Boyd. Maximizing a sum of sigmoids. Optimization and Engineering, pages 1 25, 2013. [38] R. Yang, X. Sun, and K. Narasimhan. A generalized algorithm for multi-objective reinforcement learning and policy adaptation. In Proceedings of the 33rd Annual Conference on Neural Information Processing Systems (Neur IPS), 2019. [39] H. P. Young. Condorcet s theory of voting. The American Political Science Review, 82(4): 1231 1244, 1988. [40] T. Zahavy, B. O Donoghue, G. Desjardins, and S. Singh. Reward is enough for convex MDPs. In Proceedings of the 35th Annual Conference on Neural Information Processing Systems (Neur IPS), pages 25746 25759, 2021. [41] W. Zaremba, A. Dhar, L. Ahmad, T. Eloundou, S. Santurkar, S. Agarwal, and J. Leung. Democratic inputs to AI. Blog post, 2023. URL https://openai.com/blog/ democratic-inputs-to-ai. [42] H. Zhong, Z. Deng, W. J. Su, Z. S. Wu, and L. Zhang. Provable multi-party reinforcement learning with diverse human feedback. ar Xiv preprint ar Xiv:2403.05006, 2024. A State-action Occupancy Polytope for Discounted Reward Definition 7 (State-action Occupancy Polytope for discounted-reward [31, 40]). For a MOMDP M in the discounted-reward case, the space of valid state-action occupancies is the polytope Oγ = dγ π | dγ π 0, X a dγ π(s, a) = (1 γ)dinit(s) + γ X s ,a P(s , a , s)dγ π(s , a ), s S B Proofs of Section 6 We define a fully connected MOMDP as a MOMDP M where for every pair of states s, s S and any action a A, P(s, a, s ) = 1 |S|. In all the hardness proofs, we create a fully connected MOMDP. For such MOMDPs, it is not difficult to observe that the state-action occupancy polytope, for both the average and discounted reward, is equivalent to {dπ | P a dπ(s, a) = 1 |S|, s S} for both the discounted and average reward case. Furthermore, π(a|s) = dπ(s, a) P a A dπ(s, a ) = |S| dπ(s, a) (1) B.1 Proof of Theorem 4 Proof of Theorem 4. We show hardness of approximation by a approximation-preserving reduction from the maximum independent set (MIS) problem. Definition 8 (maximum independent set (MIS)). For a graph G = (V, E) with vertex set V and edge set E 2( V 2), the maximum independent set (MIS) of G is the maximum subset of vertices V V such that there are no edges between any pair of vertices v1, v2 V . Theorem 7 (Håstad [20]). For any fixed ϵ (0, 1), there is no polynomial-time 1 n1/2 ϵ -approximation algorithm for the MIS problem unless P = NP, and no 1 n1 ϵ -approximation algorithm unless ZPP = NP. Construction of MOMDP. Let G = ([n], E) be a graph for which we want to find the maximum independent set. Create a fully connected MOMDP M with |E| states {se}, one per edge e E. There are only two actions A = {a1, a2}. In state se of edge e = (e1, e2), performing action a1 and a2 correspond to e1 and e2 respectively. We create n agents where agent i corresponds to vertex i. The reward function of agent i for state se and action ak for k {1, 2} is defined as Ri(se, ak) = 1, if ek = i, 0, o.w. In other words, the reward functions encode the set of edges incident to vertex i. Correctness of reduction. A policy π is optimal for agent i iff for all the edges incident to i the action corresponding to i is selected with probability 1. If a policy π is considered optimal by two agents i and i , then e = (i, i ) / E, since at state se either π(a1|s) = 1 or π(a2|s) = 1. Therefore, the set of agents that consider a policy optimal corresponds to an independent set in G. Furthermore, take any independent set V [n]. Let π be the policy that for each edge e selects the action corresponding to the vertex in V and, if no such vertices exist, select one arbitrarily. This policy is well defined since V is an independent set and there are no edges with both vertices from V . Policy π is optimal for agents of V since at each state their favourite action is selected. Thus, we have an equivalence between the maximum independent set of G and the plurality winner policy of M. Therefore, the hardness of approximation follows from Theorem 7. B.2 Proof of Theorem 5 Proof of Theorem 5. We reduce the MAX-2SAT problem to finding an α-approval policy for α (7/8, 1]. For two Boolean variables x1 and x2, we denote the disjunction (i.e., logical or) of two variables by x1 x2 and the negation of x1 by x1. (a) The scaled occupancy polytope (b) The expected return cdf Figure 2: The effective state-action occupancy polytope of agents and their expected return distribution. Definition 9 (maximum 2-satisfiability (MAX-2SAT)). Given a set of m Boolean variables {x1, . . . xm} and a set of n 2-literal disjunction clauses {C1, . . . , Cn}, the goal of the maximum 2-satifiability problem (MAX-2SAT) is to find the maximum number of clauses that can be satisfied together by an assignment ϕ : {xj}j [n] {True, False}. Garey et al. [15] showed that the MAX-2SAT problem is NP-hard. Construction of MOMDP. For an instance of the MAX-2SAT problem, let {C1, . . . , Cn} be a set of m 2-literal disjunction clauses over n variables {x1, . . . xm}. We create a fully connected MOMDP M with m states state sj representing variable xj. There are only two actions, a True and a False which at state sj correspond to setting variable xj to True and False respectively. This way, a policy π(a True|sj) can be interpreted as the probability of setting the variable xj to True, subject to π(a True|sj) = 1 π(a False|sj). Agents and reward function. We introduce some notation before introducing the agents and their reward function. Take a clause Ci = ci,1 ci,2 where ci,k {xj, xj}j [m] for k [2]. By combining the former relation with the mapping of xj and xj to state-action pairs (sj, a True) and (sj, a False) respectively, we define λ : {ci,1, ci,2} S A. For every ci,k, λ(ci,k) is the state-action pair that evaluates ci,k to True when selected with probability one. Now, we are ready to introduce the agents. For each clause Ci, we create 3 agents per clause as follows: Agent agi,1 with a reward function that is 1 for λ(ci,1) and λ(ci,2) and zero otherwise. An optimal policy of this agent selects the two state-action pairs with probability one, which can be interpreted as ci,1 True and ci,2 True that implies Ci = True. Agent agi,2 with a reward function that is 1 for λ(ci,1) and λ( ci,2) note the negation. This is another assignment of variables, ci,1 True and ci,2 False, that implies Ci True. Similarly, the reward function of agi,3 is 1 for λ( ci,1) and λ(ci,2) (which again implies Ci True). For ease of exposition, and by a slight abuse of notation, in the rest of the proof we use π to refer to the occupancy measure. From Equation (1) we have |S| dπ(s, a) = π(a|s) and the α-approval rule is invariant to affine transformation. Therefore, we let Ji(π) = π, R . Expected return distribution. The expected return of agent agi,1 for a policy π is π(λ(ci,1)) + π(λ(ci,2)). For conciseness, let v1 = π(λ(ci,1)), v2 = π(λ(ci,2)). Then, the cdf is Fagi,1(v) = Z v v2=0 I[v1 + v2 v] dv1 dv2 = 2 for v [0, 1], 2 for v [1, 2]. See Figure 2 for a visualization. The cdf of all other agents have the same form. For each of the 3n agents above, their maximum expected return is 2. Rounding a policy. Observe that Fi(3/2) = 7/8. For agent agi,1 to α-approve a policy for α (7/8, 1], they require a utility (strictly) more than 3/2 = F 1 agi,1(7/8). The fact that v1, v2 [0, 1] in addition to v1 + v2 > 3/2, implies that v1 > 1/2 and v2 > 1/2. Therefore, if a policy π is α-approved by agent agi,1, we have π(λ(ci,1)) > 1/2 and π(λ(ci,2)) > 1/2. Further, observe that for at most one of the three agents {agi,k}k [3] the condition of Jagi,k(π) > 3/2 may hold as every pair of the agent disagree on one literal. We round such a policy π to an assignment ϕ by letting ϕ(xj) True if π(a True|sj) > 1 2 and ϕ(xj) False otherwise. If an agent in {agi,k}k [3] α-approves π, then Ci is satisfied by the assignment ϕ. Therefore, we have that OPTα OPTMAX 2SAT, where OPTα is the maximum feasible number of α-approvals among all policies and OPTMAX 2SAT is the maximum number of clauses that can be satisfied among all assignments. Next, we show OPTMAX 2SAT OPTα by deriving a policy π that gets OPTMAX 2SAT αapprovals based on the optimal assignment for OPTMAX 2SAT by simply letting π(a True|sj) = 1 iff ϕ(xi) = True. If ϕ satisfies a clause Ci, then for the agent ag {agi,k}k [3] that matches the literal assignments of ϕ for Ci, we have Jag(π) = 2, which is an optimal, i.e., 1-approval, policy for ag. For the other two agents for Ci, Jag(π) = 1 which gets a return less than their required utility of 3/2 for an α-approval. Therefore, we have that OPTMAX 2SAT = OPTα and the hardness of computation follows from our reduction. B.3 Proof of Theorem 6 Proof of Theorem 6. The Borda count of a policy is defined as Borda(π) = P i [n] Fi(Ji(π)). Since Fi(v) [0, 1], maxπ P i [n] Fi(Ji(π)) n. From Definition 6, for a q-quantile fair policy πq, we have Fi(Ji(πq)) q, for all i [n]. Therefore, P i [n] Fi(Ji(πq)) qn and we have Borda(πq) maxπ Borda(π) qn C A Note on Algorithm 4 Here, we expand on Algorithm 4. As mentioned before, in the pre-processing step, for every agent i, we measure Fi(kϵ) for k [1/ϵ] via 1/ϵ calls to vol-comp. Let J(π) = (J1(π), . . . , Jn(π)) be the (expected) return vector. Importantly, our MILP approximates the Borda score of a return vector r by its ϵ-rounded-down return vector rϵ = ( ri/ϵ ϵ | i [n]). Therefore, the MILP finds a policy that has a Borda count which is at least as high as the maximum Borda count among ϵ-rounded return vectors. This is not necessarily equivalent to a (1 ϵ) approximation of the max Borda count. D Experimental Details Experiments are all done on an AMD EPYC 7502 32-Core Processor with 258Gi B system memory. We use Gurobi [18] to solve LPs and MILPs. Running time. All our voting rules has a running time of less than ten minutes on the constructed MOMDPs of 35 = 243 states, 6 actions, and 10 agents. The most resource extensive task of our experiments was sampling 5 105 random policies and computing the expected return for every agent which is a standard task. For each instance, we did this in parallel with 20 processes in a total running time of less than 2 hours per instance. Implementation details of Borda count rule. After fitting a generalized logistic function for Fi based on the expected return of sampled policies, we find the value mode(fi), and check the existence of a policy by solving a linear program that achieves a expected return of mode(fi) for all agents. Next, to utilize Gurobi LP solvers, we approximate the concave function Fi by a set of linear constraints that form a piecewise linear concave approximation of Fi. Therefore, our final program for an approximate Borda count rule is simply an LP. Neur IPS Paper Checklist Question: Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? Answer: [Yes] Justification: We summarized the theoretical contributions and experimental results in abstract and introduction. Guidelines: The answer NA means that the abstract and introduction do not include the claims made in the paper. The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations. A No or NA answer to this question will not be perceived well by the reviewers. The claims made should match theoretical and experimental results, and reflect how much the results can be expected to generalize to other settings. It is fine to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper. 2. Limitations Question: Does the paper discuss the limitations of the work performed by the authors? Answer: [Yes] Justification: Some of the limitations of our approach are discussed in the discussion section. Guidelines: The answer NA means that the paper has no limitation while the answer No means that the paper has limitations, but those are not discussed in the paper. The authors are encouraged to create a separate "Limitations" section in their paper. The paper should point out any strong assumptions and how robust the results are to violations of these assumptions (e.g., independence assumptions, noiseless settings, model well-specification, asymptotic approximations only holding locally). The authors should reflect on how these assumptions might be violated in practice and what the implications would be. The authors should reflect on the scope of the claims made, e.g., if the approach was only tested on a few datasets or with a few runs. In general, empirical results often depend on implicit assumptions, which should be articulated. The authors should reflect on the factors that influence the performance of the approach. For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting. Or a speech-to-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon. The authors should discuss the computational efficiency of the proposed algorithms and how they scale with dataset size. If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness. While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that aren t acknowledged in the paper. The authors should use their best judgment and recognize that individual actions in favor of transparency play an important role in developing norms that preserve the integrity of the community. Reviewers will be specifically instructed to not penalize honesty concerning limitations. 3. Theory Assumptions and Proofs Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof? Answer: [Yes] Justification: We provide the full set of assumptions and proofs in the paper. However, details of some of the proofs (due to the limited space) are deferred to appendix. Guidelines: The answer NA means that the paper does not include theoretical results. All the theorems, formulas, and proofs in the paper should be numbered and crossreferenced. All assumptions should be clearly stated or referenced in the statement of any theorems. The proofs can either appear in the main paper or the supplemental material, but if they appear in the supplemental material, the authors are encouraged to provide a short proof sketch to provide intuition. Inversely, any informal proof provided in the core of the paper should be complemented by formal proofs provided in appendix or supplemental material. Theorems and Lemmas that the proof relies upon should be properly referenced. 4. Experimental Result Reproducibility Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)? Answer: [Yes] Justification: The experimental details are provided in the experiments section and appendix. Guidelines: The answer NA means that the paper does not include experiments. If the paper includes experiments, a No answer to this question will not be perceived well by the reviewers: Making the paper reproducible is important, regardless of whether the code and data are provided or not. If the contribution is a dataset and/or model, the authors should describe the steps taken to make their results reproducible or verifiable. Depending on the contribution, reproducibility can be accomplished in various ways. For example, if the contribution is a novel architecture, describing the architecture fully might suffice, or if the contribution is a specific model and empirical evaluation, it may be necessary to either make it possible for others to replicate the model with the same dataset, or provide access to the model. In general. releasing code and data is often one good way to accomplish this, but reproducibility can also be provided via detailed instructions for how to replicate the results, access to a hosted model (e.g., in the case of a large language model), releasing of a model checkpoint, or other means that are appropriate to the research performed. While Neur IPS does not require releasing code, the conference does require all submissions to provide some reasonable avenue for reproducibility, which may depend on the nature of the contribution. For example (a) If the contribution is primarily a new algorithm, the paper should make it clear how to reproduce that algorithm. (b) If the contribution is primarily a new model architecture, the paper should describe the architecture clearly and fully. (c) If the contribution is a new model (e.g., a large language model), then there should either be a way to access this model for reproducing the results or a way to reproduce the model (e.g., with an open-source dataset or instructions for how to construct the dataset). (d) We recognize that reproducibility may be tricky in some cases, in which case authors are welcome to describe the particular way they provide for reproducibility. In the case of closed-source models, it may be that access to the model is limited in some way (e.g., to registered users), but it should be possible for other researchers to have some path to reproducing or verifying the results. 5. Open access to data and code Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: The full code with the running instructions are attached. Guidelines: The answer NA means that paper does not include experiments requiring code. Please see the Neur IPS code and data submission guidelines (https://nips.cc/ public/guides/Code Submission Policy) for more details. While we encourage the release of code and data, we understand that this might not be possible, so No is an acceptable answer. Papers cannot be rejected simply for not including code, unless this is central to the contribution (e.g., for a new open-source benchmark). The instructions should contain the exact command and environment needed to run to reproduce the results. See the Neur IPS code and data submission guidelines (https: //nips.cc/public/guides/Code Submission Policy) for more details. The authors should provide instructions on data access and preparation, including how to access the raw data, preprocessed data, intermediate data, and generated data, etc. The authors should provide scripts to reproduce all experimental results for the new proposed method and baselines. If only a subset of experiments are reproducible, they should state which ones are omitted from the script and why. At submission time, to preserve anonymity, the authors should release anonymized versions (if applicable). Providing as much information as possible in supplemental material (appended to the paper) is recommended, but including URLs to data and code is permitted. 6. Experimental Setting/Details Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results? Answer: [Yes] Justification: We describe all the essential details for reproducing the results. Guidelines: The answer NA means that the paper does not include experiments. The experimental setting should be presented in the core of the paper to a level of detail that is necessary to appreciate the results and make sense of them. The full details can be provided either with the code, in appendix, or as supplemental material. 7. Experiment Statistical Significance Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments? Answer: [Yes] Justification: Yes. They are included in the plots, and the results are reported based on the average of multiple runs. Guidelines: The answer NA means that the paper does not include experiments. The authors should answer "Yes" if the results are accompanied by error bars, confidence intervals, or statistical significance tests, at least for the experiments that support the main claims of the paper. The factors of variability that the error bars are capturing should be clearly stated (for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions). The method for calculating the error bars should be explained (closed form formula, call to a library function, bootstrap, etc.) The assumptions made should be given (e.g., Normally distributed errors). It should be clear whether the error bar is the standard deviation or the standard error of the mean. It is OK to report 1-sigma error bars, but one should state it. The authors should preferably report a 2-sigma error bar than state that they have a 96% CI, if the hypothesis of Normality of errors is not verified. For asymmetric distributions, the authors should be careful not to show in tables or figures symmetric error bars that would yield results that are out of range (e.g. negative error rates). If error bars are reported in tables or plots, The authors should explain in the text how they were calculated and reference the corresponding figures or tables in the text. 8. Experiments Compute Resources Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments? Answer: [Yes] Justification: They are included in the appendix. Guidelines: The answer NA means that the paper does not include experiments. The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage. The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute. The paper should disclose whether the full research project required more compute than the experiments reported in the paper (e.g., preliminary or failed experiments that didn t make it into the paper). 9. Code Of Ethics Question: Does the research conducted in the paper conform, in every respect, with the Neur IPS Code of Ethics https://neurips.cc/public/Ethics Guidelines? Answer: [Yes] Justification: We have reviewed the Neur IPS Code of Ethics, and the research conducted in the paper conforms with it in every respect. Guidelines: The answer NA means that the authors have not reviewed the Neur IPS Code of Ethics. If the authors answer No, they should explain the special circumstances that require a deviation from the Code of Ethics. The authors should make sure to preserve anonymity (e.g., if there is a special consideration due to laws or regulations in their jurisdiction). 10. Broader Impacts Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed? Answer: [Yes] Justification: We discuss the potential for positive societal impact through better AI alignment. While it is conceivable that our techniques would lead to worse AI alignment, this scenario seems implausible. Guidelines: The answer NA means that there is no societal impact of the work performed. If the authors answer NA or No, they should explain why their work has no societal impact or why the paper does not address societal impact. Examples of negative societal impacts include potential malicious or unintended uses (e.g., disinformation, generating fake profiles, surveillance), fairness considerations (e.g., deployment of technologies that could make decisions that unfairly impact specific groups), privacy considerations, and security considerations. The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments. However, if there is a direct path to any negative applications, the authors should point it out. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to generate deepfakes for disinformation. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster. The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from (intentional or unintentional) misuse of the technology. If there are negative societal impacts, the authors could also discuss possible mitigation strategies (e.g., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efficiency and accessibility of ML). 11. Safeguards Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)? Answer: [NA] Justification: We do not use any types of data or models that have the potential for misuse. Guidelines: The answer NA means that the paper poses no such risks. Released models that have a high risk for misuse or dual-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety filters. Datasets that have been scraped from the Internet could pose safety risks. The authors should describe how they avoided releasing unsafe images. We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort. 12. Licenses for existing assets Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected? Answer: [Yes] Justification: We cite the optimization tools utilized in our code. Guidelines: The answer NA means that the paper does not use existing assets. The authors should cite the original paper that produced the code package or dataset. The authors should state which version of the asset is used and, if possible, include a URL. The name of the license (e.g., CC-BY 4.0) should be included for each asset. For scraped data from a particular source (e.g., website), the copyright and terms of service of that source should be provided. If assets are released, the license, copyright information, and terms of use in the package should be provided. For popular datasets, paperswithcode.com/datasets has curated licenses for some datasets. Their licensing guide can help determine the license of a dataset. For existing datasets that are re-packaged, both the original license and the license of the derived asset (if it has changed) should be provided. If this information is not available online, the authors are encouraged to reach out to the asset s creators. 13. New Assets Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets? Answer: [NA] Justification: We do not release any new assets. Guidelines: The answer NA means that the paper does not release new assets. Researchers should communicate the details of the dataset/code/model as part of their submissions via structured templates. This includes details about training, license, limitations, etc. The paper should discuss whether and how consent was obtained from people whose asset is used. At submission time, remember to anonymize your assets (if applicable). You can either create an anonymized URL or include an anonymized zip file. 14. Crowdsourcing and Research with Human Subjects Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)? Answer: [NA] Justification: We do not experiment with human subjects. Guidelines: The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Including this information in the supplemental material is fine, but if the main contribution of the paper involves human subjects, then as much detail as possible should be included in the main paper. According to the Neur IPS Code of Ethics, workers involved in data collection, curation, or other labor should be paid at least the minimum wage in the country of the data collector. 15. Institutional Review Board (IRB) Approvals or Equivalent for Research with Human Subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or institution) were obtained? Answer: [NA] Justification: We do not experiment with human subjects. Guidelines: The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Depending on the country in which research is conducted, IRB approval (or equivalent) may be required for any human subjects research. If you obtained IRB approval, you should clearly state this in the paper. We recognize that the procedures for this may vary significantly between institutions and locations, and we expect authors to adhere to the Neur IPS Code of Ethics and the guidelines for their institution. For initial submissions, do not include any information that would break anonymity (if applicable), such as the institution conducting the review.