# aggregating_binary_judgments_ranked_by_accuracy__c2c375fc.pdf

Aggregating Binary Judgments Ranked by Accuracy

Daniel Halpern,1 Gregory Kehne,2 Dominik Peters,1

Ariel D. Procaccia,1 Nisarg Shah,3 Piotr Skowron4

1 Harvard University, 2 Carnegie Mellon University, 3 University of Toronto, 4 University of Warsaw dhalpern@g.harvard.edu,gkehne@andrew.cmu.edu,dpeters@seas.harvard.edu, arielpro@seas.harvard.edu,nisarg@cs.toronto.edu,p.skowron@mimuw.edu.pl

We revisit the fundamental problem of predicting a binary ground truth based on independent binary judgments provided by experts. When the accuracy levels of the experts are known, the problem can be solved easily through maximum likelihood estimation. We consider, however, a setting in which we are given only a ranking of the experts by their accuracy. Motivated by the worst-case approach to handle the missing information, we consider three objective functions and design efﬁcient algorithms for optimizing them. In particular, the recently popular distortion objective leads to an intuitive new rule. We show that our algorithms perform well empirically using real and synthetic data in collaborative ﬁltering and political prediction domains.

1 Introduction Consider the task of predicting a binary ground truth G {0, 1} by aggregating independent binary judgments provided by n experts. This models a wide range of real-world scenarios, where the judgments can be polls predicting the outcome of an upcoming political or sports event, a user s reviews of previously watched movies, weather forecasts, or juror opinions of a defendant s guilt. The judgment of expert i, denoted Xi, is assumed to be a Bernoulli random variable, which coincides with the ground truth with probability pi; this probability is referred to as the accuracy of the expert. If p = (p1, . . . , pn) is known, then the classical maximum likelihood estimation approach chooses the ground truth estimate that maximizes the likelihood of inducing the vector of expert judgments X = (X1, . . . , Xn), i.e., the value of y {0, 1} that maximizes L[X; G = y, p] = Qn i=1 p I[Xi=y] i (1 pi)I[Xi =y], where I is the indicator variable. However, sometimes we may not know the exact values of p1, . . . , pn; instead, we may only know a ranking of the expert judgments by accuracy. This may be the case when there is metadata available about the judgments that is known to be correlated with accuracy, but the exact nature of the correlation is not known. For instance, if a pollster conducts multiple polls over time, polls conducted closer to the date of the event being predicted may be considered more accurate than the ones conducted earlier; the same reasoning

Copyright 2021, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

applies to weather forecasts. Similarly, polls conducted concurrently may be ranked by their sample sizes. Sometimes, experts may participate in a judgment contest (such as the Good Judgment Project1), which may show their ranking by accuracy on the leaderboard. Motivated by such settings, we address the following question in this work:

How should we aggregate n binary judgments ranked by accuracy in order to predict a binary ground truth?

Note that the n binary judgments ordered by accuracy can be represented as a bit-string of length n. Thus, we essentially study aggregation rules which take a bit-string as input and output a bit. Due to the fundamental nature of this setting, the rules designed in this work may have applications in other domains (see Section 6).

Our Contribution

Recall that the likelihood function L[X; G = y, p] depends on p1, . . . , pn, i.e., on the accuracy of the experts. However, we are given only partial information about these values, namely, their ordering. To address this missing information, we take a worst-case viewpoint. Speciﬁcally, let P denote the set of all p which are consistent with the given ordering; we deﬁne the following three natural objectives that serve as proxies for the likelihood induced by a given estimate y {0, 1}, and design algorithms to compute the estimate optimizing these objectives.

1. Distortion: supp P L[X; G = 1 y, p]/L[X; G = y, p]. Note that this is worst-case ratio of the likelihood of the estimate not chosen (1 y) to the likelihood of the estimate chosen (y). Our aim is to minimize this objective.

2. Optimistic likelihood: supp P L[X; G = y, p]. Maximizing this objective can be thought of as a natural extension of the maximum likelihood approach, where we make an inference about p together with one about the estimate y.

3. Pessimistic likelihood: infp P L[X; G = y, p]. Maximizing this objective can be thought of as maximizing the worst-case likelihood.

1https://goodjudgment.com/

The Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI-21)

In Section 3, we characterize the rules which optimize these objectives, and show that they can be implemented in polynomial time. In particular, the rules optimizing the ﬁrst two objectives are novel and elegant. In Section 4, we restrict our attention to a natural family of rules, which we refer to as scoring rules. These rules assign monotonic weights to judgments (i.e., judgments ranked higher by accuracy receive no less weight than those ranked lower), and return the estimate with the highest total weight. We characterize the scoring rules that optimize the three aforementioned objectives among all scoring rules. In the full version2, we also consider three other approaches, namely, an axiomatic approach, a Bayesian approach, and a randomized approach. Finally, in Section 5, we empirically evaluate the performance of the rules designed in this work against some baselines. The experiments use synthetic and real data in the domain of collaborative ﬁltering, and real data in the domain of political predictions. Overall, given their low information requirements, our rules do remarkably well.

Related Work Our paper contributes to a large body of work in computational social choice (Brandt et al. 2016). A central feature that separates our setting from the vast majority of papers in the area is that the judgments (or opinions, or preferences) that are being aggregated are typically assumed to be anonymous, in the sense that individuals are indistinguishable. However, it has been noted that there are important contexts where anonymity leads to bad outcomes (Lang 2019). Our setting is related to judgment aggregation (Endriss 2016), an area that also aggregates binary judgments. Typically, this literature does not assume the existence of a ground truth, except for some work on epistemic judgment aggregation (Hartmann and Sprenger 2012; Bozbay, Dietrich, and Peters 2014; Terzopoulou and Endriss 2019). While our model deals with only a single issue, judgment aggregation focuses on problems arising from the aggregation of several logically related issues simultaneously. In statistics there is inﬂuential work on the problem of estimating the common mean of multiple normal distributions (Cohen and Sackrowitz 1974; Jordan and Krishnamoorthy 1996), where the unknown variance of each distribution can be seen as a measure of (in)accuracy. Our setting is more closely related to the work of Ghosh, Kale, and Mc Afee (2011), who, like us, consider a binary ground truth (for each item ), and binary judgments, each of which is correct with some probability that depends on the expert s unknown accuracy. The central idea that distinguishes our work from these papers is that we assume a known ranking of the experts by accuracy. This assumption also guides our choice of (worst-case) optimization objectives, which are different from the statistical estimation problems considered in previous work. Some of our main results pertain to the distortion objective. This objective was conceived in the context of social-welfare maximization in voting settings (Procaccia

2Full version: www.cs.toronto.edu/ nisarg/papers/rankedbinary-judgments.pdf

and Rosenschein 2006; Boutilier et al. 2015; Caragiannis et al. 2017; Anshelevich et al. 2018), but several papers have applied the idea to other problems such as matching, facility location, and even traveling salesperson (Anshelevich and Sekar 2016; Abramowitz and Anshelevich 2018; Anshelevich and Zhu 2018). Our aggregation rules can be viewed as simple games (Taylor and Zwicker 1999) where the experts are players, and winning coalitions correspond to sets of experts such that when all these experts report 1, then so does the aggregation rule. The simple games literature has also studied linear simple games, which correspond to games with ranked players. This literature includes characterization results for weighted simple games (Taylor and Zwicker 1992), which correspond to what we call scoring rules.

2 Model For k N, let us denote [k] = {1, . . . , k}. Let G {0, 1} denote an unknown binary ground truth. Let N = [n] denote a set of experts. Each expert i N provides a binary judgment Xi {0, 1}, which is a Bernoulli random variable that is correct with probability pi, i.e., Pr[Xi = G] = pi. We refer to X = (X1, . . . , Xn) as the judgment proﬁle and p = (p1, . . . , pn) as the accuracy proﬁle. In this work, we make two crucial assumptions regarding X and p. First, we assume that the expert judgments (i.e. X1, . . . , Xn) are independent. Second, we assume that each expert is at least as accurate as a coin toss, i.e., pi 1/2 for each i N. For a discussion about relaxing these assumptions, see Section 6. For y {0, 1}, the likelihood of observing X when the ground truth is G = y can now be written as

L[X; G = y, p] = Qn i=1 p I[Xi=y] i (1 pi)I[Xi =y], where I denotes the indicator variable. If the accuracy proﬁle p is known, then a classical approach to aggregating the expert judgments would be to return the maximum likelihood estimate (MLE) of the ground truth given by argmaxy {0,1} L[X; G = y, p]. In this work, we assume that we do not know p. Instead, we are given a ranking of the experts by their accuracy, and we are interested in aggregating the expert judgments subject to this ordinal information. Without loss of generality, assume that p1 p2 . . . pn. Thus, expert 1 is the most accurate, while expert n is the least accurate. Let Pn = {p : 1 p1 . . . pn 1/2} denote the set of feasible accuracy proﬁles. Note that Pn contains the accuracy proﬁle p = (1, . . . , 1), under which the likelihood of any nonunanimous judgment proﬁle X is zero, regardless of the estimate y. This makes some of our objectives not well-deﬁned or uninteresting. Of course, in practice, no judgment is perfectly accurate. To circumvent this hypothetical inconsistency, we deﬁne Pϵ n = {p : 1 ϵ p1 . . . pn 1/2}, analyze the aggregation rules optimizing our objectives deﬁned with respect to Pϵ n, and then take the limit ϵ 0. In the limit, these rules converge , in the sense that they become ﬁxed once ϵ is small enough. When the objective is well-deﬁned directly with respect to Pn, we avoid taking this longer route.

Formally, our input is the bit-string X {0, 1}n, where we refer to X1 as the most accurate bit and Xn as the least accurate. An aggregation function is denoted f : {0, 1}n {0, 1, }, where denotes a tie.3 We will alternatively represent a tie as the function returning {0, 1} instead of . We are also interested in a natural family of aggregation functions that we refer to as scoring rules. A scoring rule fw is parametrized by a weight vector w = (w1, . . . , wn) Rn 0, where wi is the weight associated with the i-th most accurate bit. Given input X, fw returns the bit with the highest total weight, i.e., argmaxy {0,1} Pn i=1 wi I[Xi = y]. This deﬁnition is inspired by that of a prominent family of voting rules called positional scoring rules, which includes well-known rules such as plurality and Borda count.

3 Worst-Case Optimal Aggregation Rules Given incomplete information about the accuracy proﬁle p, we cannot compute the MLE, since different accuracy proﬁles p consistent with the given ordinal information may induce different likelihoods. Our approach is to deﬁne an objective function that summarizes the likelihoods induced by all feasible p and optimize it; we consider three proposals.

Distortion Informally, given an objective function and ordinal information about cardinal inputs to the function, the distortion approach selects an outcome minimizing the ratio between the optimal objective value and the objective value under the selected outcome, in the worst case over all cardinal inputs consistent with the given ordinal information. The objective we are interested in is the likelihood function L, and we are given ordinal information about p (speciﬁcally, that p Pn). Given a judgment proﬁle X, the distortion of ground truth estimate y {0, 1} is then deﬁned as

dist(y; X) = sup p Pn

max (L[X; G = 0, p], L[X; G = 1, p])

L[X; G = y, p]

L[X; G = 1 y, p]

L[X; G = y, p] .

Here, the second equality is due to the fact that with G = y, the ratio is always 1, which can also be achieved with G = 1 y at p1 = . . . = pn = 1/2 (which makes the likelihoods given both possible ground truths equal). Hence, the worst case is achieved with G = 1 y in the numerator. Given this deﬁnition, the distortion-optimal estimate is y argmaxy {0,1} dist(y; X). We borrow the idea of distortion from the voting literature, where the goal is to select a candidate maximizing total voter happiness, but only voters ranked preferences (and not the exact intensities) are known. Distortion offers a priorindependent evaluation of candidates; a candidate minimizing distortion is the best choice given only the information available. This objective requires attention to the technicality mentioned in Section 2. Let p = (1, . . . , 1) Pn, and consider a

3Allowing ties does not signiﬁcantly alter most of our results; we discuss some of the implications of ties in later sections.

proﬁle X in which not all judgments agree. Then L[X; G = 0, p] = L[X; G = 1, p] = 0, making distortion undeﬁned. Hence, we use Pϵ n = {p : 1 ϵ p1 . . . pn 1/2} to redeﬁne the distortion as

distϵ(y; X) = sup p Pϵ n

L[X; G = 1 y, p]

L[X; G = y, p] .

The distortion-optimal rule fdist is deﬁned as fdist(X) = limϵ 0 argminy {0,1} distϵ(y; X). Interestingly, we show that the estimate y minimizing distϵ(y; X) is independent of ϵ, making the limit unnecessary. First, we deﬁne a quantity that we will later show to be closely related to distortion. Deﬁnition 1. Given X {0, 1}n, the strength s X(y) of estimate y is the maximum difference between the number of occurrences of y and that of 1 y in any preﬁx of X, i.e.,

s X(y) = max k [n] {0} Pk i=1 {I[Xi = y] I[Xi = 1 y]} .

Lemma 1. For ϵ (0, 1/2), n N, X {0, 1}n, and y {0, 1}, we have distϵ(y; X) = 1 ϵ

ϵ s X(1 y).

Proof. Fix y {0, 1}. Given a sequence p, we say that it has a jump at i [n 1] if pi > pi+1. We ﬁrst show that in the deﬁnition of distϵ(y; X), the supremum over p is achieved at an accuracy proﬁle with at most one jump. Let p be a vector with the minimum jumps at which the supremum is achieved. Suppose for contradiction that it has at least two jumps, and let k and j be indices such that pk > pk+1 = . . . = pj > pj+1. Deﬁne p1 and p2 such that p1 i = p2 i = pi for i [n] \ {k + 1, . . . , j}, p1 i = pk for i {k + 1, . . . , j}, and p2 i = pj+1 for i {k + 1, . . . , j}. That is, in p1, we shift the block (pk+1, . . . , pj) up and make it equal to pk, and in p2, we shift it down and make it equal to pj+1. We show that at least one of these two vectors must yield an approximation ratio no better than that at p, and is therefore also a point where the supremum is achieved; this is a contradiction because they both have one fewer jump than p. To see why the claim is true, let a = pk, b = pk+1 = . . . = pj, and c = pj+1. Thus, a > b > c 1/2. Denoting S = {k + 1, . . . , j}, we have that

L[X; G = 1 y, p]

L[X; G = y, p] = Y

p I[Xi=1 y] i (1 pi)I[Xi=y]

p I[Xi=y] i (1 pi)I[Xi=1 y]

b I[Xi=1 y] (1 b)I[Xi=y]

b I[Xi=y] (1 b)I[Xi=1 y]

p I[Xi=1 y] i (1 pi)I[Xi=y]

p I[Xi=y] i (1 pi)I[Xi=1 y]

i S(I[Xi=1 y] I[Xi=y]) .

In the last expression, as b > 1/2, we have b/(1 b) > 1. Thus, if the exponent of b/(1 b) is non-positive, then decreasing b to c does not decrease the expression, and the expression changes from the approximation ratio at p to that

at p2. Similarly, if the exponent is non-negative, then increasing b to a does not decrease the expression, and the expression changes from the approximation ratio at p to that at p1. Hence, at least one of p1 and p2 achieves a ratio at least as high as p, as desired. We have established that the supremum is achieved at some p which has at most one jump. Then, there exist a, b [1/2, 1 ϵ] with a > b and an index k [n] {0} such that pi = a for all i k and pi = b for all i > k. Note that allowing k = 0 and k = n permits zero jumps. We show that we can let a = 1 ϵ and b = 1/2 without loss of generality. The approximation ratio at this p is given by a 1 a Pk i=1(I[Xi=1 y] I[Xi=y])

b 1 b Pn i=k+1(I[Xi=1 y] I[Xi=y]) .

The exponent of a/(1 a) must be non-negative (otherwise decreasing a to b would strictly increase the approximation ratio). Hence, increasing a to 1 ϵ does not decrease the approximation ratio. Similarly, we can let b = 1/2. We have thus established that the supremum is achieved at p such that for some k [n] {0}, pi = 1 ϵ for i k and pi = 1/2 for i > k. Thus, the distortion of y given X is

distϵ(y; X) = max k [n] {0}

ϵ Pk i=1(I[Xi=1 y] I[Xi=y]) ,

which is 1 ϵ

ϵ s X(1 y), as desired.

We can immediately obtain a characterization of the distortion-optimal estimate y {0, 1} by observing that argminy {0,1} s X(1 y) = argmaxy {0,1} s X(y) and applying Lemma 1: The distortion-optimal estimate is the estimate with the greatest strength in X. Theorem 1. For any ϵ (0, 1/2), n N, and X {0, 1}n,

fdist(X) = argmin y {0,1} distϵ(y; X) = argmax y {0,1} s X(y),

where fdist is the distortion-optimal rule. Further, this can be computed in linear time. Note that in case of both estimates having equal strength, the result also implies that their distortion will be equal. A notable property of fdist is that if more than n/3 most accurate judgments or more than 2n/3 least accurate judgments are identical, then that will be the output of fdist, regardless of the remaining judgments.

Other Objectives We now turn our attention to two other objectives, namely maximization of optimistic and pessimistic likelihoods. Recall that the reason we cannot directly compute the MLE argmaxy {0,1} L[X; G = y, p] is because we do not know the exact accuracy proﬁle p. Instead, we know that p Pn. Given this, we deﬁne the optimistic and pessimistic likelihoods by taking the best case and the worst case over the choice of p, respectively. The optimistic likelihood L of observing X when the ground truth is G = y is L [X; G = y] =

Algorithm 1: OPT-LIKELIHOOD

Input: Judgment proﬁle X {0, 1}n, y {0, 1} Output: Optimistic likelihood L [X; G = y] if n = 1 then

return 1I[X1=y] (1/2)I[X1 =y]

end [Find the preﬁx of X with the highest density of y] i index maximizing (1/i) Pi j=1 I[Xj = y], breaking ties in favor of larger indices d (1/i) Pi j=1 I[Xj = y] r max {d, 1/2} L OPT-LIKELIHOOD((Xi+1, . . . , Xn), y) return rd(1 r)1 d i L

supp Pn L[X; G = y, p]. The optimistic MLE rule which maximizes this objective, denoted f MLE , is given by f MLE (X) = argmaxy {0,1} L [X; G = y]. We can view f MLE as simply performing a joint maximum likelihood estimation over (y, p) {0, 1} Pn, and returning the y component of the resulting estimate.4 We begin by presenting an algorithm that calculates the optimistic likelihood of an estimate y {0, 1} given a judgment proﬁle X. The algorithm repeatedly identiﬁes a preﬁx of X with the highest density of y, and imputes that the accuracies of judgments in that preﬁx are equal to this density. The following is proved in the full version. Theorem 2. Algorithm 1 calculates the optimistic likelihood L [X; G = y] in polynomial time. Thus, the aggregation rule f MLE can be implemented in polynomial time. We illustrate Algorithm 1 by an example. Example 1. Let us consider running OPT-LIKELIHOOD with X = (0, 1, 1, 1, 0, 1, 1, 0, 0, 1) and y = 1. The ﬁrst iteration selects i = 4 (i.e. preﬁx (0, 1, 1, 1)) because the density of y = 1 in this preﬁx is 3/4, and this is the highest density in any preﬁx. This leads to d = r = 3/4. The second iteration selects i = 3 (i.e. preﬁx (0, 1, 1)), leading to d = r = 2/3. The ﬁnal iteration selects the remaining string, and sets d = 1/3 but r = 1/2. Thus, p = (3/4, 3/4, 3/4, 3/4, 2/3, 2/3, 2/3, 1/3, 1/3, 1/3) is the accuracy proﬁle leading to the optimistic likelihood of 3 4

We now turn our attention to maximizing the pessimistic likelihood L . If we deﬁne it as L [X; G = y] = infp Pn L[X; G = y, p], then we run into the issue discussed in Section 2: the pessimistic likelihood of any nonunanimous X becomes 0 under both values of y due to the accuracy proﬁle p = (1, . . . , 1) Pn, leading to unnecessary ties. Hence, we again consider Pϵ n instead of Pn, deﬁne Lϵ [X; G = y] = infp Pϵ n L[X; G = y, p], and deﬁne

4This is also equivalent to computing the maximum a posteriori estimate (MAP) when we are given a uniform prior over p. For computing MAP given other priors, see the full version.

the pessimistic MLE rule, denoted f MLE , as f MLE (X) = limϵ 0 argmaxy {0,1} Lϵ [X; G = y]. When compared to f MLE , f MLE is more in line with a worst-case approach. Unlike in the case of distortion, the choice of y does not turn out to be independent of ϵ, but as we see in the proof of Theorem 3, the rule converges once ϵ < 2 n. The next result identiﬁes f MLE analytically. This is possible because the accuracy proﬁle resulting in the pessimistic likelihood always consists of only 1 ϵ and 1/2. This is in contrast to the one leading to the optimistic likelihood, which, as Example 1 demonstrates, can be more complex. The following is proved in the full version. Theorem 3. The pessimistic MLE rule f MLE , given a judgment proﬁle X, outputs the majority judgment; if tied, it outputs the opposite of the least accurate judgment (i.e. 1 Xn).

4 Optimal Scoring Rules We now turn our attention to a natural class of aggregation rules, scoring rules. Speciﬁcally, we are interested in how well we can optimize certain objectives when we are restricted to this class of functions. There are two clear ambiguities about scoring rules that we must take into account. First, for many choices of objectives, there is no scoring rule which is instance optimal, i.e., at least as good as every other scoring rule on all possible judgment proﬁles. To handle this, we modify our goal slightly and instead choose scoring rules that are optimal in the worst case. Next, is the issue of ties. In this case we ll take a pessimistic view (in line with our worst case objective goals) and say that when a scoring rule outputs a tie, the value of the objective in this instance will be the worse of the two outcomes. Theorem 4. For any ϵ (0, 1/2) and n N, the scoring rule given by w = (1, . . . , 1, 0, . . . , 0) with exactly 2 n/3 + 1 ones minimizes the worst case distortion max X {0,1}n distϵ(fw(X); X) over all possible scoring rules parametrized by w Rn 0.

Proof. Fix ϵ (0, 1/2) and n N. Recall that minimizing the distortion, distϵ(y; X) is equivalent to minimizing the strength of the unchosen judgment, s X(1 y). First, we ll show that no rule f (scoring or otherwise) can guarantee s X(1 f(X)) < n/3 for all X {0, 1}n. This will imply max X {0,1}n s X(1 fw(X)) n/3 for all w Rn 0. To see this, we construct X {0, 1}n such that both s X(0) and s X(1) are at least n/3 . Consider the judgment proﬁle Xs = (1, . . . , 1, 0, . . . , 0) with n/3 ones and n n/3 zeros. On the preﬁx of the ﬁrst n/3 judgments, there are n/3 more 1s than there are 0s. Thus, the strength of 1 is at least n/3 . On the other hand, on the entire proﬁle, there are (n n/3 ) n/3 n 2n

3 n/3 more 0s then 1s. Hence, the strength of 0 is at least n/3 , as desired. Next, we show that s X(1 fw (X)) n/3 for all judgment proﬁles X {0, 1}n. Qualitatively, fw simply picks the majority bit of the ﬁrst 2 n/3 + 1 bits. Note that since 2 n/3 + 1 is odd, there is always a majority bit and thus fw will never output a tie. Let X {0, 1}n and, without loss of generality, suppose fw (X) = 1. We show that s X(0) n/3 .

Since fw chose 1, there cannot be a majority of 0s in the ﬁrst 2 n/3 + 1 bits. Hence, 0 occurs at most n/3 times in this preﬁx. This implies that for k 2 n/3 + 1, Pk i=1 {I[Xi = 0] I[Xi = 1]} n/3 . Next, since 1 has a majority among the ﬁrst 2 n/3 + 1 bits, P2 n/3 +1 i=1 {I[Xi = 0] I[Xi = 1]} 1. or k > 2 n/3 + 1, Pk i=1 {I[Xi = 0] I[Xi = 1]}

Pk i=2 n/3 +2 {I[Xi = 0] I[Xi = 1]} 1

k (2 n/3 + 1) 1 n (2 n/3 + 1) 1 = n (3 n/3 + 2) + n/3 n n + n/3 = n/3 .

So, for all k {0} [n], Pk i=1 {I[Xi = 0] I[Xi = 1]} n/3 , and hence s X(0) n/3 as desired.

Other Objectives We also investigate optimal scoring rules with respect to the optimistic and pessimistic MLE rules f MLE and f MLE . It is easy to see that maximizing the optimistic (resp. pessimistic) likelihood of the chosen estimate is equivalent to minimizing the optimistic (resp. pessimistic) likelihood of the unchosen estimate; thus, we can also view f MLE and f MLE as minimizing the optimistic and pessimistic likelihoods of the unchosen estimate, respectively. This connection holds only because we are looking for the optimal rule within the family of all possible rules; however, when we look for the optimal rule within the family of scoring rules, we have to consider four not two objectives.

Deﬁnition 2. We deﬁne optimal scores w , w , w , w as

w arg maxw Rn 0 min X L [X; G = fw(X)]

w arg minw Rn 0 max X L [X; G = 1 fw(X)]

w arg maxw Rn 0 min X L [X; G = fw(X)]

w arg minw Rn 0 max X L [X; G = 1 fw(X)].

For example, w maximizes the optimistic likelihood of its chosen answer in the worst case. For this rule it sufﬁces to always choose the most accurate expert s judgment:

Theorem 5. The score w = (1, 0, . . . , 0) is optimal. For the cases based on pessimistic likelihood (both maximizing it for the chosen answer and minimizing it for the unchosen answer), characterizing the optimum scoring rule is easy, since the optimum rule we identiﬁed in Theorem 3 can be represented as a scoring rule.

Theorem 6. Scores w = w = (1, . . . , 1, 1/2) are optimal for ϵ 2 n, and coincide with the rule of Theorem 3. Note that in the scoring vectors of Theorem 6, the 1/2 component could be replaced with any value strictly between 0 and 1 without changing the aggregation rule. In general, throughout this section, the scoring rules we identify are optimal but not uniquely optimal (cf. Deﬁnition 2).

The remaining case, w , that is minimizing the optimistic likelihood of the unchosen answer, is less straightforward. Using a linear program, we have obtained optimal scores w for n 20; these are cataloged in the full version. For general n, w is unknown, but in the full version we also show that there exists an optimal scoring rule that is nonincreasing, for all n.

5 Experiments In this section we assess through computer simulations the quality of decisions made by our aggregation functions in the context of two example applications.

Collaborative Filtering Consider a set of agents N, a set of issues I, and a partially observed binary matrix (xij)i N,j I. We interpret an entry xij {0, 1} as the decision of agent i on issue j (for example, reviewer i bids on paper j). In each run of the experiment, we randomly select an entry of the matrix, hide it, and use several algorithms to guess its value. An algorithm is successful if it guesses correctly. We repeat the experiment 1,000 times to assess the average accuracy of the algorithms. We use our aggregation functions to predict hidden values in a matrix as follows. For an agent i N, let R(i) be the set of issues j such that the entry xij is observed. Given a hidden entry xij we ﬁrst identify the set of agents k N for which the value xkj is observed. Second, we rank those agents by their similarity to i. Formally, we deﬁne the similarity score of two agents i and k as sim(i, k) = |{j R(k) R(i): xij = xkj}|/|R(k) R(i)|, that is the fraction of issues on which i and k agreed among all issues for which we have data from both agents. We rank the agents k in the descending order of their similarity score with i. Thus, we assume that more similar agents are better predictors of the hidden decision xij of agent i. We truncate the list of agents to the ﬁrst half (we use this heuristic since our algorithms where designed for the case where p > 1/2). The ranked decisions by agents on issue j then form the input to the aggregation functions from Sections 3 and 4 and to the Bayesian algorithm from full version (with a prior estimated from the data). We compare our algorithms with three standard Recommender Systems algorithms for matrix completion, implemented in the fancyimpute python library5: Matrix Factorization (MF), Iterative SVD (ISVD), and Soft Impute (SI). We evaluate the rules on two datasets from the Pref Lib library (Mattei and Walsh 2013), and on a synthetic dataset.

Sushi. This dataset contains information about individuals preferences on various types of sushi. There are 100 types of sushi, and each individual assigns scores from {1, . . . , 4} to 10 randomly selected sushi sets. We ﬁlter only those individuals who assigned 4 different scores to the sets (there are 2737 such agents), and convert their preferences to binary judgments as follows. For a ﬁxed value of d {3, 4} we set the decision of an agent i for sushi j to 1 if i assigns to j the score at least equal to d;

5https://github.com/iskandr/fancyimpute

the decision is 0 if the corresponding score is lower than d. Note that only 10% of entries of this matrix are observed. Conference Bidding (CONF). This dataset contains reviewers bids on papers at a major computer science conference. We convert the reviewers bids to their binary judgments over papers by setting the decision to one if they bid yes for a paper (d = Y) or by setting it to one if they bid yes or maybe for a paper (d = M). Additionally, we hide an h fraction of randomly selected entries in the matrix (h {0.5, 0.8, 0.9}). Synthetic Model (SYNT) Each agent and each issue is represented by a d-dimensional vector of attributes (d {5, 10}). For each agent and each issue we sample the value of each attribute independently and uniformly from [ 1, 1]. An agent i decides 1 on an issue j if the dot product of their corresponding attribute vectors is positive. Otherwise i decides 0. We hide an h fraction of randomly selected entries in the matrix (h {0.5, 0.8, 0.9}).

Political Predictions We use a dataset from Five Thirty Eight of polling data from the 2016 US Presidential Election. We convert this data into a binary format by choosing a threshold, the mean of the number of votes the candidate received over all polls in that state, then reporting 1 if the poll was above this threshold and 0 if it was below. In addition, we assume that polls accuracies are sorted by their recency, that is later polls are more accurate than earlier ones. We run an experiment for each US state and each candidate. The ground truth is taken to be whether the true number of votes the candidate received in the general election was above or below the threshold. We then analyze our algorithms: given the sorted binary polling data, do they correctly predict the ground truth? For each state and candidate, algorithms get a score of 1 for getting the ground truth correctly, 0 for being incorrect, and 1/2 for a tie. The algorithms overall scores are their average over all states. Note that for a few states, there is no polling data for a certain candidate in which case the state was not included in the score. This is why the scores are not all multiples of 1/50. Finally, since older data may be inaccurate and could even hurt accuracy, we compare two settings: using all available polls and restricting the algorithms to polls conducted on or after October 1, 2016. The election took place on November 8, 2016.

Results Representative results of our experiments for selected values of the parameters are summarized in Table 1. 1. The scoring rules using vectors w and w are suboptimal: for most datasets the distortion-optimal rule achieved better accuracies than these rules (the only exception is JOHNSON-ALL, where fdist performed worse than w ). Similarly, f MLE performed (slightly) better than fdist on only two datasets (CLINTON-OCT and CONF (d = M, h = 0.9)), and for several other datasets it produced signiﬁcantly worse results. 2. fdist, f MLE , and the scoring rule using w perform comparably well, though each excelled in different datasets.

fdist f MLE f MLE sc (w ) sc (w ) sc (w ) Bayesian MT ISVD SI

SUSHI (d = 3) 65.3 66.6 65.2 65.5 62.4 57.0 48.8 50.1 57.5 49.5 SUSHI (d = 4) 68.6 69.4 67.2 70.1 67.9 63.3 57.6 60.4 63.3 66.7

CONF (d = M, h = 0.5) 94.8 94.8 94.8 94.8 90.3 94.8 94.8 96.5 94.5 96.8 CONF (d = M, h = 0.8) 95.3 95.2 95.3 95.3 92.0 95.3 95.2 93.0 91.0 93.5 CONF (d = M, h = 0.9) 95.1 94.6 95.2 95.2 92.3 91.9 90.5 92.0 94.6 95.6

SYNT (h = 0.8, d = 10) 76.5 73.4 73.7 68.1 64.6 74.2 77.1 46.0 87.0 73.5 SYNT (h = 0.8, d = 5) 85.4 84.0 83.4 81.5 78.8 85.5 90.1 49.0 91.5 89.0 SYNT (h = 0.5, d = 5) 89.9 91.2 88.9 87.7 86.7 89.9 92.4 94.0 94.1 92.0

CLINTON-ALL 83.3 82.4 74.5 86.3 52.9 78.4 84.3 JOHNSON-ALL 68.4 79.6 51.0 79.6 73.5 55.1 85.7 TRUMP-ALL 90.2 92.2 82.4 90.2 53.9 90.2 94.1 CLINTON-OCT 86.3 82.4 88.2 86.3 52.9 78.4 72.5 JOHNSON-OCT 80.6 81.6 77.6 71.4 73.5 55.1 71.4 TRUMP-OCT 92.2 92.2 92.2 92.2 53.9 90.2 98.0

Table 1: Summary of the experiments comparing accuracies (given as percentages) of aggregation functions. In each row, the best performing algorithms are bolded; those that perform within 1 and 2 percentage points of the best algorithm are shaded dark grey and light grey, respectively. Simulations for parameter values omitted in the table led to qualitatively similar conclusions. We use sc(w) to denote the scoring rule parametrized by the vector w.

3. For many datasets the Bayesian algorithm outperforms the rules with worst-case guarantees, yet there are instances (such as SUSHI) where the Bayesian algorithm is much worse. If we could pick the best response out of those produced by the Bayesian algorithm and f MLE (or fdist), we would always obtain high-quality results.

4. For some datasets, notably SUSHI, our algorithms outperform standard algorithms for matrix completion. For other datasets, the Bayesian algorithm is comparable to the matrix-completion algorithms. This is promising since our algorithms use less information.

5. In the political domain our best rules produced considerably more accurate predictions than simply trusting the most accurate (most recent) predictions (w ).

We can compare these experimental results to a setting in which the true accuracies of the experts are known. For example, if we take 10 experts and sample their accuracies uniformly from (0.5, 0.7), then the maximum likelihood estimate using both the judgments along with the expert accuracies recovers the ground truth in 76.7% of cases. On the other hand, knowing simply the judgments ordered by accuracies, fdist predicts correctly in 76.1% of cases, f MLE in 74.1%, and f MLE in 74.6%.

6 Discussion

Our setting boils down to the design of Boolean functions that take a string of bits as input and output a single bit with the twist that the order of bits matters, in that earlier bits are given greater importance. We view this as a fundamental problem, and there are many ways to approach it. In addition to the objectives and algorithms described in Sections 3 and 4, we present three additional approaches in the full version: axiomatic, Bayesian, and randomized .

One might ask whether the assumption that pi 1/2 for all i N can be relaxed. If the identities of experts with pi < 1/2 are known, we can simply ﬂip their judgments and reverse their order (as the ﬂipped judgment of the least accurate expert is now the most accurate). Interestingly, our problem now becomes that of aggregating two strings of judgments, ordered by accuracy, into a single bit. This problem is potentially richer than ours because there is no information on the relative accuracy of experts associated with two different strings. An even more general setup simply provides a partial order of the experts by accuracy. Another natural variant of our setting is one where, instead of binary judgments, experts provide real-valued judgments in, say, [0, 1], and the goal is to aggregate them to return a single real number in [0, 1]. Interestingly, given a binary aggregation rule f from our work, one can compute the greatest value x [0, 1] such that converting expert judgments to binary depending on whether they are at least x and feeding them to f gives output 1; this is well-deﬁned when f satisﬁes a natural monotonicity condition. We leave such directions for future work.

Acknowledgements

Halpern, Kehne, Peters and Procaccia were partially supported by the National Science Foundation under grants CCF-2007080, IIS-2024287 and CCF-1733556; and by the Ofﬁce of Naval Research under grant N00014-20-1-2488. Skowron was supported by Poland s National Science Center grant UMO-2019/35/B/ST6/02215. Shah was partially supported by an NSERC Discovery Grant.

References Abramowitz, B.; and Anshelevich, E. 2018. Utilitarians without utilities: Maximizing social welfare for graph prob-

lems using only ordinal preferences. In Proceedings of the 32nd AAAI Conference on Artiﬁcial Intelligence (AAAI), 894 901.

Anshelevich, E.; Bhardwaj, O.; Elkind, E.; Postl, J.; and Skowron, P. 2018. Approximating Optimal Social Choice under Metric Preferences. Artiﬁcial Intelligence 264: 27 51.

Anshelevich, E.; and Sekar, S. 2016. Blind, Greedy, and Random: Algorithms for Matching and Clustering using only Ordinal Information. In Proceedings of the 30th AAAI Conference on Artiﬁcial Intelligence (AAAI), 390 396.

Anshelevich, E.; and Zhu, W. 2018. Ordinal approximation for social choice, matching, and facility location problems given candidate positions. In Proceedings of the 14th Conference on Web and Internet Economics (WINE), 3 20. Springer.

Boutilier, C.; Caragiannis, I.; Haber, S.; Lu, T.; Procaccia, A. D.; and Sheffet, O. 2015. Optimal Social Choice Functions: A Utilitarian View. Artiﬁcial Intelligence 227: 190 213.

Bozbay, I.; Dietrich, F.; and Peters, H. 2014. Judgment aggregation in search for the truth. Games and Economic Behavior 87: 571 590.

Brandt, F.; Conitzer, V.; Endress, U.; Lang, J.; and Procaccia, A. D., eds. 2016. Handbook of Computational Social Choice. Cambridge University Press.

Caragiannis, I.; Nath, S.; Procaccia, A. D.; and Shah, N. 2017. Subset Selection Via Implicit Utilitarian Voting. Journal of Artiﬁcial Intelligence Research 58: 123 152.

Cohen, A.; and Sackrowitz, H. B. 1974. On Estimating the Common Mean of Two Normal Distributions. Annals of Statistics 2(6): 1274 1282.

Endriss, U. 2016. Judgment Aggregation. In Brandt, F.; Conitzer, V.; Endriss, U.; Lang, J.; and Procaccia, A. D., eds., Handbook of Computational Social Choice, chapter 17. Cambridge University Press.

Ghosh, A.; Kale, S.; and Mc Afee, P. 2011. Who Moderates the Moderators? Crowdsourcing Abuse Detection in User Generated Content. In Proceedings of the 12th ACM Conference on Economics and Computation (EC), 167 176.

Hartmann, S.; and Sprenger, J. 2012. Judgment aggregation and the problem of tracking the truth. Synthese 187(1): 209 221.

Jordan, S. M.; and Krishnamoorthy, K. 1996. Exact Conﬁdence Intervals for the Common Mean of Several Normal Populations. Biometrics 52(1): 77 86.

Lang, J. 2019. In Praise of Nonanomymity: Nonsubstitutable Voting. In Laslier, J.-F.; Moulin, H.; Sanver, R.; and Zwicker, W. S., eds., The Future of Economic Design, 97 102. Springer.

Mattei, N.; and Walsh, T. 2013. Pref Lib: A Library of Preference Data. In Proceedings of the 3rd International Conference on Algorithmic Decision Theory (ADT), 259 270.

Procaccia, A. D.; and Rosenschein, J. S. 2006. The Distortion of Cardinal Preferences in Voting. In Proceedings of the 10th International Workshop on Cooperative Information Agents (CIA), 317 331. Taylor, A. D.; and Zwicker, W. S. 1992. A Characterization of Weighted Voting. Proceedings of the American Mathematical Society 115(4): 1089 1094. Taylor, A. D.; and Zwicker, W. S. 1999. Simple Games: Desirability Relations, Trading, Pseudoweightings. Princeton University Press. Terzopoulou, Z.; and Endriss, U. 2019. Optimal Truth Tracking Rules for the Aggregation of Incomplete Judgments. In International Symposium on Algorithmic Game Theory, 298 311. Springer.