# crowdsourcing_with_arbitrary_adversaries__2c9660cd.pdf

Crowdsourcing with Arbitrary Adversaries

Matth aus Kleindessner 1 Pranjal Awasthi 1

Most existing works on crowdsourcing assume that the workers follow the Dawid-Skene model, or the one-coin model as its special case, where every worker makes mistakes independently of other workers and with the same error probability for every task. We study a signiﬁcant extension of this restricted model. We allow almost half of the workers to deviate from the one-coin model and for those workers, their probabilities of making an error to be task-dependent and to be arbitrarily correlated. In other words, we allow for arbitrary adversaries, for which not only error probabilities can be high, but which can also perfectly collude. In this adversarial scenario, we design an efﬁcient algorithm to consistently estimate the workers error probabilities.

1. Introduction

Crowdsourcing is an omnipresent phenomenon: it has emerged as an integral part of the machine learning pipeline in recent years, and one reason for the great advances in deep learning is the presence of large data sets that have been labeled by the crowd (e.g., Deng et al., 2009; Krizhevsky, 2009). Crowdsourcing is also at the heart of peer grading systems (e.g., Alfaro & Shavlovsky, 2014), which help with rising enrollment at universities, and online rating systems (e.g., Liao et al., 2014), which many of us rely on when choosing the next restaurant, to provide just a few examples.

A crowdsourcing scenario consists of a set of workers and a set of tasks that need to be solved. A data curator utilizing crowdsourcing can aim at estimating various quantities of interest. The ﬁrst goal might be to estimate the true labels or answers for the tasks at hand. Typically, additional constraints are involved here such as a worker not being willing

1Department of Computer Science, Rutgers University, Piscataway Township, New Jersey, USA. Correspondence to: Matth aus Kleindessner <matthaeus.kleindessner@rutgers.edu>, Pranjal Awasthi <pranjal.awasthi@rutgers.edu>.

Proceedings of the 35 th International Conference on Machine Learning, Stockholm, Sweden, PMLR 80, 2018. Copyright 2018 by the author(s).

to solve too many tasks and the data curator wanting to get high-quality labels at a low price. The canonical example of this case is the Amazon Mechanical Turk TM. There one cannot track speciﬁc workers as they are ﬂeeting. However, in scenarios such as peer grading or online rating systems, a second goal might be to estimate worker qualities, especially if workers can be reused at a later time.

In a seminal paper, Dawid & Skene (1979) proposed a formal model that involves worker quality parameters for crowdsourcing scenarios in the context of classiﬁcation. The Dawid-Skene model has become a standard theoretical framework and has led to a ﬂurry of research over the past few years (Liu et al., 2012; Raykar & Yu, 2012; Li et al., 2013; Gao et al., 2016; Zhang et al., 2016; Khetan et al., 2017), in particular in its special symmetric form usually referred to as one-coin model (Ghosh et al., 2011; Karger et al., 2011a;b; Dalvi et al., 2013; Gao & Zhou, 2013; Karger et al., 2014; Bonald & Combes, 2017; Ma et al., 2017). In its general form for binary classiﬁcation problems, the Dawid Skene model assumes that for each worker, the probability of providing the wrong label only depends on the true label of the task, but not on the task itself. Moreover, given the true label, the responses provided by different workers are independent. The one-coin model additionally assumes that for each worker, the probability of providing the wrong label is the same for both classes. We will formally introduce the one-coin model in Section 2. A discussion of prior work work is provided in Section 5 and Appendix A.

The crucial limitation of the Dawid-Skene and one-coin model is the assumption that workers error probabilities are task-independent. In particular, this excludes the possibility of colluding adversaries (other than those that provide the wrong label all of the time), which might make these models a poor approximation of the real world encountered in such applications as peer grading or online rating. In this paper, we study a signiﬁcant extension of the one-coin model that allows for arbitrary, highly colluding adversaries. We provide an algorithm for estimating the workers error probabilities and prove that it asymptotically recovers the true error probabilities. Using our estimates of the error probabilities in weighted majority votes, we also provide strategies to estimate ground-truth labels of the tasks. Experiments on both synthetic and real data show that our approach clearly outperforms existing methods in the presence of adversaries.

Crowdsourcing with Arbitrary Adversaries

2. Setup and problem formulation

We ﬁrst describe a general model for crowdsourcing with non-adaptive workers and binary classiﬁcation tasks: there are n workers w1, . . . , wn and an i.i.d. sample of m tasklabel pairs ((xi, yi))m i=1 Dm, where D is a joint probability distribution over tasks x X and corresponding labels y { 1, +1}. There is a variable gij {0, 1}, i [m], j [n], indicating whether worker wj is presented with task xi (for k N, we use [k] to denote the set {1, . . . , k}). If wj is presented with xi, that is gij = 1, wj provides an estimate wj(xi) { 1, +1} of the ground-truth label yi. Let A { 1, 0, +1}m n be a matrix that stores all the responses collected from the workers: Aij = wj(xi) if gij = 1 and Aij = 0 if gij = 0.

We assume that each worker wj follows some (probabilistic or deterministic) strategy such that wj(xi) only depends on xi. In particular, given xi, any two different workers responses wj(xi) and wk(xi) and the ground-truth label yi are independent. Let εwj(x, y) [0, 1] be the conditional error probability that, given x and y, wj(x) does not equal y, that is

εwj(x, y) := Prwj|(x,y)[wj(x) = y | (x, y)]. (1)

Note that the unconditional probability of wj(x) being incorrect, before seeing x and y, is given by

Pr(x,y) D,wj[wj(x) = y] = E(x,y) D[εwj(x, y)] =: εwj.

Now one may study the following questions:

(i) Given only the matrix A, how can we estimate the ground-truth labels y1, . . . , ym?

(ii) Given only the matrix A, how can we estimate the workers unconditional error probabilities εw1, . . . , εwn?

(iii) If we can choose gij (either in advance of collecting workers responses or adaptively while doing so), how should we choose it such that we can achieve (i) or (ii) with a minimum number of collected responses?

In case of εwj(x, y) as deﬁned in (1) being constant on X { 1, +1}, that is εwj(x, y) εwj, for all j [n], our model boils down to what is usually referred to as the one-coin model (e.g., Szepesvari, 2015), for which (i) to (iii) have been studied extensively (see Section 5 and Appendix A for references and a detailed discussion). With this paper we initiate the study of a signiﬁcant extension of the one-coin model. We will allow almost half of the workers to deviate from the one-coin model and for such a worker wj, the conditional error probability εwj(x, y) to be a completely arbitrary random variable. In other words, we will allow for arbitrary adversaries, for which not only error

probabilities can be high, but for which error probabilities can be arbitrarily correlated. We mainly study (ii) in this scenario. We then make use of existing results for the onecoin model to answer (i) satisfactorily for our purposes. We do not deal with (iii), but instead assume that gij has been speciﬁed in advance.

3. General outline of our approach

In this section we want to present the general outline of our approach. A key insight is that the unconditional probability of workers wj and wk being agreeing is given by

Pr(x,y) D,wj,wk[wj(x) = wk(x)] = 1 εwj εwk+

2εwjεwk + 2 Cov(x,y) D[εwj(x, y), εwk(x, y)]. (2)

Cov(x,y) D[εwj(x, y), εwk(x, y)] denotes the covariance between random variables εwj(x, y) and εwk(x, y), that is

Cov(x,y) D[εwj(x, y), εwk(x, y)] =

E(x,y) D[(εwj(x, y) εwj) (εwk(x, y) εwk)].

A proof of (2) can be found in Appendix B. The probability on the left-hand side of (2) can be easily estimated from A by the ratio of the number of tasks that wj and wk agreed on to the number of tasks they were both presented with:

Pr[wj(x) = wk(x)] Pm i=1 gijgik1{Aij = Aik} Pm i=1 gijgik =: pjk.

This suggests to solve the system of equations

1 εj εk + 2εjεk + 2cjk = pjk, 1 j < k n, (4)

in the unknowns εl, l [n], and cjk, 1 j < k n, in order to obtain estimates of the workers unconditional error probabilities εw1, . . . , εwn. However, there is a catch: in general, the system (4) is not identiﬁable and has several solutions. We will assume that at least n

2 + 2 of the workers follow the one-coin model and have error probabilities smaller than one half. A worker wj following the one-coin model implies

Cov(x,y) D[εwj(x, y), εwk(x, y)] = 0, k = j, (5)

and hence under this assumption we can restrict the search for solutions of (4) to εl, l [n], and cjk, 1 j < k n, with the property that1

L [n] with |L| n/2 + 2 such that j L : (εj < 1/2 [ k = j : cjk = 0]) . (6)

1Throughout the paper, we set cjk = ckj if j > k. We also assume pjk = pkj.

Crowdsourcing with Arbitrary Adversaries

Note that we never assume to know which workers follow the one-coin model, which corresponds to using the existential quantiﬁer for the set L in (6) rather than considering a ﬁxed L. We can show that the system (4) has at most one solution with property (6). We also provide evidence that our assumption of n

2 + 2 of the workers following the one-coin model and having error probabilities smaller than one half is a necessary condition for guaranteeing the identiﬁability of system (4). If the workers satisfy our assumption and pjk on the right-hand side of (4) are actually true agreement probabilities, then εl = εwl and cjk = Cov[εwj(x, y), εwk(x, y)] is the unique solution of (4) that satisﬁes (6). But if pjk are not exactly true agreement probabilities, there might be no solution of (4) with property (6) at all. We prove that if estimates pjk are not too bad, we can solve (4) together with (6) approximately, and our approximate solution is guaranteed to be close to true error probabilities εw1, . . . , εwn and covariances Cov[εwj(x, y), εwk(x, y)], j < k. This answers (ii) from Section 2 and is the main contribution of our paper: Main result. Assume that at least n

2 + 2 of the workers follow the one-coin model and have error probabilities not greater than γTR < 1

2. If |Pr[wj(x) = wk(x)] pjk| β for all j = k and β sufﬁciently small, we can compute estimates ˆεw1, . . . , ˆεwn of εw1, . . . , εwn such that

|εwi ˆεwi| C(γTR) β1/4.

We answer (i) from Section 2 and provide two ways to predict ground-truth labels y1, . . . , ym by taking weighted majority votes over the responses provided by the workers. In these majority votes, the weights depend on our estimates of true error probabilities εw1, . . . , εwn.

4. Details and analysis

4.1. Estimating agreement probabilities

If gij has been speciﬁed in advance, we have the following guarantee on the quality of the estimates pjk (see (3)): Lemma 1. Assume Pm i=1 gijgik > 0, j = k. Let δ > 0 and

βjk = min 1, h ln(2n2/δ)/ 2 Xm

i=1 gijgik i1/2 .

Then we have with probability at least 1 δ over the sample ((xi, yi))m i=1 and the randomness in workers strategies that

|Pr[wj(x) = wk(x)] pjk| βjk, 1 j < k n.

Proof. A straightforward application of Hoeffding s inequality and the union bound yields the result.

4.2. Identiﬁability and approximate solution

If all workers follow the one-coin model, that is εwj(x, y) εwj for all j [n], we have

Cov(x,y) D[εwj(x, y), εwk(x, y)] = 0, 1 j < k n, and system (4) reduces to

1 εj εk + 2εjεk = pjk, 1 j < k n, (7)

in the unknowns εl, l [n]. It is well known that, in general, even (7) is not identiﬁable. For example, if pjk = 1 for all 1 j < k n, there are the two solutions εl = 0, l [n], and εl = 1, l [n], corresponding to either all perfect or all completely erroneous workers. On the other hand, the system (7) is identiﬁable if we assume that on average workers are better than random guessing, that is 1 n Pn j=1 εwj < 1

2, and there are at least three informative workers with εwj = 1

2 (Bonald & Combes, 2017).

Clearly, these two conditions do not guarantee identiﬁability of the general system (4). The next lemma shows that even if we additionally assume half of the workers to follow the one-coin model, the system (4) is not identiﬁable. Here we only state an informal version of the lemma. A detailed version and its proof can be found in Appendix B.

Lemma 2. There exists an instance of the system (4), where n is even, that has two different solutions. In both solutions, it holds that εl < 1

2, l [n]. Furthermore:

(a) In the ﬁrst solution, cjk = 0 for all j [ n

2 ] and k = j, and εl is small if l [ n

2 ] and big if l [n] \ [ n

(b) In the second solution, cjk = 0 for all j [n]\[ n

2 ] and k = j, and εl is small if l [n] \ [ n

2 ] and big if l [ n

We want to mention that a solution of (4) does not necessarily correspond to actual workers, that is given εl, l [n], and cjk, 1 j < k n, there might be no collection of workers w1, . . . , wn such that εwl = εl and Cov[εwj(x, y), εwk(x, y)] = cjk. By the Bhatia Davis inequality (Bhatia & Davis, 2010) it holds that Var[εwj(x, y)] εwj ε 2 wj. Hence, a necessary condition for a solution to correspond to actual workers is that |cjk| (εj ε 2 j )1/2(εk ε 2 k )1/2 (in addition to εl [0, 1]). The two solutions in Lemma 2 correspond to actual workers.

From now on we assume that at least n

2 + 2 workers follow the one-coin model and have error probabilities smaller than one half:2

Assumption A. There exists L [n] with |L| n/2 + 2 such that for all j L, the worker wj follows the one-coin model with error probability εwj < 1/2.

This corresponds to considering (4) together with the constraint (6). The system (4) together with (6) is identiﬁable:

Proposition 1. There exists at most one solution of system (4) that has property (6).

2All results of Section 4.2 hold true if we assume, more generally, the existence of L [n] with |L| n

2 + 2 such that (5) together with εwj < 1

2 holds for all j L.

Crowdsourcing with Arbitrary Adversaries

Proof. Assuming there are two solutions (εS1 l )l [n], (c S1 jk )1 j<k n and (εS2 l )l [n], (c S2 jk )1 j<k n with L1 and L2 satisfying (6), there have to be pairwise different i1, i2, i3 L1 L2. It is easy to see that (εS1 i1 , εS1 i2 , εS1 i3 ) and (εS2 i1 , εS2 i2 , εS2 i3 ) and consequently also all the other components of the two solutions have to coincide. Details can be found in Appendix B.

If pjk at the right-hand side of (4) are true agreement probabilities, the true error probabilities εw1, . . . , εwn and covariances Cov[εwj(x, y), εwk(x, y)], j < k, make up the unique solution of (4) that satisﬁes (6), but if pjk are not exactly true agreement probabilities, there might be no solution of (4) that satisﬁes (6) at all. Our goal is then to ﬁnd a solution of (4) that satisﬁes (6) approximately and to show that our approximate solution has to be close to εw1, . . . , εwn and Cov[εwj(x, y), εwk(x, y)], j < k. As a ﬁrst step towards this goal we need a generalization of Proposition 1:

Proposition 2. Let γ < 1/2 and ν < 1/8 γ/2 + γ2/2. If there exist two solutions (εSi l )l [n], (c Si jk )1 j<k n, i {1, 2}, of system (4) (where pjk [0, 1]) with the property that εSi l [0, 1], l [n], and

Li [n] with |Li| n/2 + 2 such that

j Li : εSi j γ h k = j : |c Si jk | ν i , (8)

then εS1 l εS2 l G(γ, ν) ν, c S1 jk c S2 jk 3G(γ, ν) ν

for l [n], j < k, where G(γ, ν) G(γ) > 0 as ν 0.

The proof of Proposition 2, which provides an explicit expression for G(γ, ν), can be found in Appendix B.

In a next step, we assume that we are given pairwise different i1, i2, i3 [n] such that wi1, wi2, wi3 follow the onecoin model with εwi1, εwi2, εwi3 < 1/2. In this case, assuming that estimates pjk are close to true agreement probabilities, we can construct a solution of (4) that is guaranteed to be close to the true error probabilities and covariances (and hence approximately satisﬁes (6)). This is made precise in the next lemma (its proof can be found in Appendix B).

Lemma 3. Let γTR < 1/2 and consider the system (4) with p TR jk [0, 1] as right-hand side. Assume there exists a solution3 (εTR l )l [n], (c TR jk )1 j<k n with the property that εTR l [0, 1] and

LTR [n] with |LTR| n/2 + 2 such that

j LTR : εTR j γTR k = j : c TR jk = 0 . (9)

Now consider the system (4) with pjk [0, 1] as right-hand side. Assume that |p TR jk pjk| β for all j = k, where

3By Proposition 1, this solution is unique.

β satisﬁes β < 1/2 2γTR + 2γ2 TR. Let i1, i2, i3 [n] be pairwise different and set

B := 2 + 4pi1i3,

C := 1 + 2pi1i2pi2i3 pi1i2 pi1i3 pi2i3,

B , εS i2 := min(γTR, max(0, εR i2))

and for all l = i2 and for all 1 j < k n

εR l := pi2l 1 + εS i2 2εS i2 1 ,

( min(γTR, max(0, εR l )) if l {i1, i3} min(1, max(0, εR l )) if l / {i1, i3} ,

c S jk := pjk (1 εS j εS k + 2εS j εS k ) 2 .

If all expressions are deﬁned (i.e., B > 0, B + 4C 0 and εS i2 = 1

2), then (εS l )l [n], (c S jk)1 j<k n is a solution of (4) with pjk as right-hand side. If i1, i2, i3 LTR, then all expressions are deﬁned and

εTR l εS l H(γTR, β) p

β, l [n], c TR jk c S jk 3H(γTR, β) p

β + β/2, j < k, (12)

where H(γTR, β) H(γTR) > 0 as β 0.

In Lemma 3, for constructing the solution (εS l )l [n], (c S jk)1 j<k n as deﬁned in (10) and (11) we need to know γTR < 1/2, which is an upper bound on the error probabilities of at least n

2 + 2 workers that follow the one-coin model. In practice, we might choose γTR depending on the difﬁculty of the tasks or simply set it conservatively, for example as γTR = 0.45. If i1, i2, i3 LTR, then (12) implies that (εS l )l [n], (c S jk)1 j<k n satisﬁes (8) with

γ = γTR + H(γTR, β) p

β, ν = 3H(γTR, β) p

β + β/2. (13)

If we know the value of β (using Lemma 1, we easily obtain an upper bound β that holds with high probability), we can compute these quantities. This suggests the following strategy for obtaining estimates of εw1, . . . , εwn and Cov[εwj(x, y), εwk(x, y)], j < k: we sample pairwise different i1, i2, i3 [n] uniformly at random and construct (εS l )l [n], (c S jk)1 j<k n as deﬁned in (10) and (11). If one of the expressions is not deﬁned, we can immediately discard (i1, i2, i3). Otherwise, we check whether (εS l )l [n], (c S jk)1 j<k n satisﬁes (8) with γ and ν as speciﬁed in (13). If it does, since (εTR l )l [n], (c TR jk + (pjk p TR jk )/2)1 j<k n is a solution of (4) with pjk as right-hand side that satisﬁes

Crowdsourcing with Arbitrary Adversaries

property (8) too, Proposition 2 guarantees that

3H(γTR, β) p

G γTR + H(γTR, β) p

β, 3H(γTR, β) p

for all l [n] and a similar bound on |c TR jk c S jk|, j < k. If (εS l )l [n], (c S jk)1 j<k n does not satisfy (8), we discard (i1, i2, i3) and start anew. Note that under our Assumption A, the probability of choosing i1, i2, i3 such that i1, i2, i3 LTR is greater than 1/8. In expectation we have to discard (i1, i2, i3) for not more than eight times before ﬁnding a solution that satisﬁes (8) and hence (14).

Assuming that every worker is presented with every task, that is gij = 1 for all i [m] and j [n], it follows from Lemma 1 and (14) that m has to scale as ln(n2/δ)/ρ8 in order that the described strategy is guaranteed to yield, with probability at least 1 δ, estimates εS 1 , . . . , εS n satisfying |εTR l εS l ρ, l [n]. This is signiﬁcantly larger than the rate m ln(n2/δ)/ρ2 required by the TE algorithm, which solves the estimation problem for the error probabilities in the one-coin model and is claimed to be minimax optimal (Bonald & Combes, 2017). We suspect that our rate with its dependence on ρ 8 is not optimal and consider it to be an interesting follow-up question to study the minimax rate for our extension of the one-coin model.

Although the convergence rate that we can guarantee for the described strategy is slow, we might still hope that the strategy performs better in practice. However, there is an issue that we have to overcome. Unless β is very small, γ and ν as speciﬁed in (13) are too big for being meaningful, that is any solution (εS l )l [n], (c S jk)1 j<k n as deﬁned in (10) and (11) will satisfy (8) with these values. We will not discard any (i1, i2, i3), regardless of whether i1, i2, i3 LTR holds or not. We deal with this issue by adapting the strategy as follows: let P {(i1, i2, i3) : i1, i2, i3 [n] pairwise different}. For every p = (i1, i2, i3) P, we construct (εS l (p))l [n], (c S jk(p))1 j<k n as deﬁned in (10) and (11). We set Qp = [n] unless γ as speciﬁed in (13) is smaller than one, in which case we set Qp = {l [n] : εS l (p) γ} and discard any solution (εS l (p))l [n], (c S jk(p))1 j<k n for which |Qp| < n

2 + 2. Let νp be the n

2 +2 -th smallest element of {maxk [n]\{l} |c S lk(p)| : l Qp}. Then we ﬁnally return the solution (εS l (p0))l [n], (c S jk(p0)))1 j<k n for which νp is smallest, that is p0 = argminp νp.

If γ is small enough, it follows from Proposition 2 that εTR l εS l (p0) p

max{νp0, β/2}

G γTR + H(γTR, β) p

β, max{νp0, β/2} . (15)

Note that if P contains at least one triple of indices i1, i2, i3 LTR, then νp0 3H(γTR, β) β + β

2 , so that the guarantee (15) is at least as good as (14). We also expect νp0 to be smaller the larger P is. Hence, we should choose P as large as we can afford due to computational reasons, but in practice, there is one more aspect that we have to consider. Depending on how gij has been chosen, there might be workers wj and wk that were presented with only a few common tasks or no common tasks at all. In this case, the estimate pjk of the agreement probability between wj and wk is only poor and there is no uniform bound β on |p TR jk pjk| (where p TR jk are true agreement probabilities). We can deal with this aspect by choosing P in a way such that for all p P, all estimates pjk that are involved in the computation of (εS l (p))l [n] are somewhat reliable. We present a concrete implementation of this in Algorithm 1 below.

4.3. Predicting ground-truth labels

Once we have estimates ˆεw1, . . . , ˆεwn of the true error probabilities εw1, . . . , εwn, we predict ground-truth labels yi by taking a weighted majority vote over the responses collected for the task xi. Our estimate for yi is given by

ˆyi = sign n Xn

l=1 f(ˆεwl) Ail o , (16)

where f : [0, 1] [ , + ]. Ties are broken uniformly at random. We consider two choices for the function f.

It is well-known that if all workers follow the one-coin model with known error probabilities εw1, . . . , εwn, groundtruth labels are balanced, that is Pr(x,y) D[y = +1] = Pr(x,y) D[y = 1], and gij are independent Bernoulli random variables with common success probability α > 0, then the optimal estimator for the ground-truth label yi is given by the weighted majority vote (16) with f(ˆεwl) replaced by f(εwl) = ln ((1 εwl)/εwl) (Nitzan & Paroush, 1982; Berend & Kontorovich, 2015; Bonald & Combes, 2017). Hence, a common approach for the one-coin model is to ﬁrst estimate the true error probabilities and then to estimate ground-truth labels by using the majority vote (16) with f(ˆεwl) = ln ((1 ˆεwl)/ˆεwl) (Bonald & Combes, 2017; Ma et al., 2017). We propose to use the same majority vote, but restricted to answers from workers that we believe to follow the one-coin model. Using the notation from Section 4.2, this means that we set f(ˆεwl) = ln ((1 ˆεwl)/ˆεwl) for l Qp0 with maxk [n]\{l} |c S lk(p0)| νp0 and f(ˆεwl) = 0 otherwise.

Alternatively, we suggest to set f(ˆεwl) = 1 2ˆεwl for l [n]. With this choice of f we make use of the responses provided by all workers. The same choice has been used for the one-coin model too (Dalvi et al., 2013). A third option would be to set f(ˆεwl) = 1 2ˆεwl for l Qp0 with maxk [n]\{l} |c S lk(p0)| νp0 and f(ˆεwl) = 0 otherwise, but we do not consider this choice any further.

Crowdsourcing with Arbitrary Adversaries

4.4. Algorithm

In the interests of clarity, we present our approach as self contained Algorithm 1. Choosing P as the set of triples such that involved pairs of workers have been provided with at least ten or three common tasks might seem somewhat arbitrary here. Indeed, one could introduce two parameters to the algorithm instead. Without optimizing for these parameters, we chose them as ten and three in all our experiments on real data, and hence we state Algorithm 1 as is.

Our analysis best applies to the setting of a full matrix A (or variables gij that are independent Bernoulli random variables with common success probability, as it is assumed by Bonald & Combes, 2017, for example). In this case, which we consider in our experiments on synthetic data, choosing P as stated in Algorithm 1 reduces to choosing P as the set of all triples of pairwise different indices. If the number of workers n is small, this is the best one can do. If n is large, it is infeasible to choose P as the set of all triples though since the running time of Algorithm 1 is in O(n2(m + |P|)). If n is large and A full, one should sample P uniformly at random. For |P| ln δ/ ln(7/8) our error guarantee (14) holds with probability at least 1 δ then (compare with Section 4.2).

5. Related work

We brieﬂy survey related work here. A complete discussion can be found in Appendix A. As discussed in Sections 1 and 2, in crowdsourcing one might be interested in estimating ground-truth labels and/or worker qualities given the response matrix A, but also in optimal task assignment. In their seminal paper, Dawid & Skene (1979) proposed an EM based algorithm to address the ﬁrst two goals. Since then numerous works have followed addressing all three goals for the Dawid-Skene and one-coin model (Ghosh et al., 2011; Karger et al., 2011a;b; 2013; 2014; Dalvi et al., 2013; Gao & Zhou, 2013; Gao et al., 2016; Zhang et al., 2016; Bonald & Combes, 2017; Ma et al., 2017). There have also been efforts to study generalizations of the Dawid-Skene model (Jaffe et al., 2016; Khetan & Oh, 2016; Shah et al., 2016) as well as to explicitly deal with adversaries (Raykar & Yu, 2012; Jagabathula et al., 2017). However, none of the prior work can handle a number of arbitrary adversaries almost as large as the number of reliable workers as we do.

6. Experiments

On both synthetic and real data, we compared our proposed Algorithm 1 to straightforward majority voting for predicting labels (referred to as Maj) and the following methods from the literature: the spectral algorithms by Ghosh et al. (2011) (GKM), Dalvi et al. (2013) (Ro E and Eo R) and Karger et al. (2013) (KOS), the two-stage procedure by

Algorithm 1

Input: crowdsourced labels stored in A { 1, 0, +1}m n, upper bound 0 < γTR < 1 2 on the error probabilities of n

2 + 2 workers that follow the one-coin model, conﬁdence parameter 0 < δ < 1 Output: estimates (εF l )l [n], (c F jk)j<k, (ˆyi)i [m] of error probabilities, covariances and ground-truth labels

Estimating agreement probabilities set gij = 1{Aij = 0}, i [m], j [n] set qjk = Pm i=1 gijgik, j, k [n] set pjk as in (3), j, k [n] (pjk = Na N if qjk = 0)

Estimating error probabilities and covariances set β = ln(2n2/δ)/ 2 minj,k [n] qjk 1/2 (0, +Inf ] set γ as in (13) if γ / [0, 1] then

set γ = 1 end if set P = (i1, i2, i3) : i1, i2, i3 [n] pairwise different and qjk 10, j, k {i1, i2, i3}, and qi2j 3, j = i2

set νold = Inf, (εF l )l [n] = 0, (c F jk)1 j<k n = 0, L = for (i1, i2, i3) P do

if not all expressions in (10) or (11) are deﬁned then

break end if compute (εS l )l [n], (c S jk)1 j<k n as in (10) and (11) set Q = {l [n] : εS l γ} set ν = n

2 + 2 -th smallest element of {maxk [n]\{l} |c S lk| : l Q} (ν = Na N if Q = ) if |Q| n

2 + 2 AND ν < νold then set (εF l )l [n] = (εS l )l [n], (c F jk)j<k = (c S jk)j<k set L = {l Q : maxk [n]\{l} |c S lk| ν} set νold = ν end if end for

Estimating ground-truth labels set f(ˆεwl) = ln ((1 ˆεwl)/ˆεwl) [ Inf, +Inf], l L,

and f(ˆεwl) = 0, l [n] \ L (alternatively set f(ˆεwl) = 1 2ˆεwl, l [n]) set ˆyi as in (16), i [m]

Zhang et al. (2016) (S-EM1 and S-EM10, where we run one or ten iterations of the EM algorithm) and the recent method by Bonald & Combes (2017) (TE). We used the Matlab implementation of KOS, S-EM1 and S-EM10 made available by Zhang et al. (2016). In our implementations of the other methods, we adapted GKM, Ro E and Eo R as to assume that the average error of the workers is smaller than one half rather than assuming that the error of the ﬁrst worker is. We always called Algorithm 1 with parameters γTR = 0.4 and δ = 0.1, which resulted in γ being set to 1

Crowdsourcing with Arbitrary Adversaries

0 5 10 15 20 25 # corrupted workers

Prediction error

50 workers --- 5000 tasks

Alg. 1 Alt-Alg. 1

0 5 10 15 20 25 # corrupted workers

Estimation error

Maximum norm

Maj GKM Ro E Eo R KOS S-EM1 S-EM10 TE Alg. 1

0 5 10 15 20 25 # corrupted workers

Estimation error

Normalized 1-norm

Figure 1. Synthetic data: prediction error and estimation error as a function of the number of corrupted workers.

0.5 1 1.5 2 # tasks 104

Prediction error

50 workers --- 23 corrupted ones

Alg. 1 Alt-Alg. 1

0.5 1 1.5 2 # tasks 104

Estimation error

Maximum norm

Figure 2. Synthetic data: prediction and estimation error of Algorithm 1 as a function of the number of tasks m.

in the execution of the algorithm in all our experiments. We refer to Algorithm 1 with the logarithmic weights in (16) as Alg. 1 and and with the linear weights as Alt-Alg. 1. In the following, all results are average results obtained from running an experiment for 100 times.

6.1. Synthetic data

In our ﬁrst experiment, we consider n = 50 workers and m = 5000 tasks with balanced ground-truth labels. Every worker is presented with every task. For 0 t 25, we choose t workers at random. These workers are corrupted workers that all provide the same random response to every task, which is incorrect with error probability 0.5. The remaining n t workers provide responses according to the one-coin model, where the error probability of each of these workers is 0.4. Figure 1 shows the prediction error for estimating ground-truth labels and the estimation error for estimating error probabilities in both the maximum norm and the normalized 1-norm for the various methods as a function of t. The prediction error is given by 1 m Pm i=1 1{yi = ˆyi} for ground-truth labels yi and estimates ˆyi and the estimation error is given by maxl [n] |εwl ˆεwl| or 1

n Pn l=1 |εwl ˆεwl| for true error probabilities εwl and estimates ˆεwl. The methods Maj and KOS, by default, do not provide estimates of the workers error probabilities. We adapt these two methods in order to return estimates of the error probabilities too as follows: if the method returns label estimates ˆy1, . . . , ˆym and worker wl provides responses A1l, . . . , Aml = 0, then the method

m Pm i=1 1{ˆyi = Ail} as estimate ˆεwl of εwl.

Our Algorithm 1 is the only method that can handle up to 23 = n 2 2 corrupted workers (in accordance with our theoretical results). Its estimation error is constant as the number of corrupted workers increases from 0 to 23. Its prediction error depends on which weights we use in (16): the prediction error of Alg. 1 is constant in this range too, the one of Alt-Alg. 1 is slightly increasing. If only a few workers are corrupted, Alt-Alg.1 performs better than Alg. 1, while it is the other way round if more than 13 workers are corrupted. The methods from the literature predict ground-truth labels as badly as random guessing already in the presence of only six corrupted workers. All these methods are outperformed by majority voting. We do not have an explanation for the non-monotonic behavior of the estimation error of SEM10 in the maximum norm. In Appendix C we present similar experiments, in which the error probability of the workers following the one-coin model is smaller or the error probabilities of the corrupted workers are less correlated. Still, the overall picture there is the same.

One might wonder whether one can combine the considered methods from the literature with one of the algorithms by Jagabathula et al. (2017) in order to ﬁrst sort the corrupted workers out and then apply the method only to the remaining workers and their responses. However, those algorithms cannot deal with the corrupted workers considered in this experiment, which are perfectly colluding, at all. Even though provided with the correct number t of corrupted workers as input, when t 3, the soft-penalty algorithm by Jagabathula et al. (2017) was not able to identify any of the corrupted workers in any of the 100 runs of the experiment.

In our next experiment, we study the convergence rate of Algorithm 1. We consider n = 50 workers, out of which t = 23 are corrupted in the same way as above. Figure 2 shows the prediction and estimation error of Algorithm 1 as a function of the number of tasks m varying from 5000 to 20000. The prediction error of Alg. 1 decreases only slightly as m increases, the prediction error of Alt-Alg. 1 decreases more signiﬁcantly. Most interesting is the decay of the estimation error. Apparently, in this experiment it

Crowdsourcing with Arbitrary Adversaries

Data set Maj GKM Ro E Eo R KOS S-EM1 S-EM10 TE Alg. 1 Alt-Alg. 1 Bird 0.2407 0.2778 0.2778 0.2778 0.1111 0.1111 0.1019 0.1759 0.2963 0.2778 Dog 0.1883 0.2020 0.1834 0.1871 0.2069 0.1834 0.1772 0.9913 0.1921 0.1859 Duchenne 0.3802 0.3000 0.3125 0.3250 0.3813 0.3250 0.3562 0.3562 0.3062 0.2937 RTE 0.2562 0.4925 0.4937 0.1175 0.4000 0.1613 0.1025 0.2100 0.3638 0.2900 Temp 0.0976 0.5649 0.5693 0.0563 0.0671 0.0671 0.0628 0.0714 0.1991 0.0584 Web 0.1217 0.0249 0.0426 0.1014 0.0377 0.0931 0.0513 0.9955 0.0309 0.0611

Table 1. Real world data sets: prediction error of the various methods. The smallest value of each row is shown in red.

0 5 10 15 20 # corrupted workers

Prediction error

0 5 10 15 20 25 # corrupted workers

Prediction error

0 2 4 6 # corrupted workers

Prediction error

Maj GKM Ro E Eo R KOS S-EM1 S-EM10 TE Alg. 1 Alt-Alg. 1

Figure 3. Real world data sets: prediction error of the various methods as a function of the number of corrupted workers.

decreases at a rate of m 1/2 rather than at a rate of m 1/8 as suggested by our upper bound (compare with Section 4.2).

6.2. Real data

We performed experiments on six publicly available data sets that are are commonly used in the literature (cf. Snow et al., 2008, Zhang et al., 2016, and Bonald & Combes, 2017). All six data sets come with ground truth labels for each task. For most of the data sets the matrix A, which stores the collected responses, is highly sparse. In order to reduce sparseness, we removed workers that provided fewer than 50 labels. For two of the data sets, we merged classes in order to end up with binary classiﬁcation problems in the same way as Bonald & Combes (2017) did (Dog: {0, 2} vs {1, 3}; Web: {0, 1, 2} vs {3, 4}). Table 2 in Appendix C provides the characteristic values of the data sets. Note that only for the Bird data set every worker provided a label for every task whereas for the other ones A is still rather sparse. Figure 5 in Appendix C shows for each data set a histogram of the error probabilities of the workers (computed over those tasks that a worker was presented with). Figure 6 shows a heat map of the matrix (|Cov[εwj(x, y), εwk(x, y)]|)n j,k=1 (computed over those tasks that two workers were jointly presented with).

Table 1 shows the prediction error for the various methods and data sets. There is no method that performs best on all data sets. Overall, S-EM10 seems to be the method of choice. Our Algorithm 1 can compete with the other methods, and on four out of the six data sets, the prediction error of Alt-Alg. 1 is smaller or larger only by 0.01 than the prediction error of S-EM10. Alg. 1 performs slightly worse than Alt-Alg. 1. The poor performance of our method on

the Bird data set might be explained by the fact that there the workers clearly deviate from our model: as Figure 6 shows, there are no n

2 + 2 workers that follow the one-coin model.

We performed another experiments on these data sets by corrupting some of the workers (chosen at random). Like in the experiments of Section 6.1, the corrupted workers provide the same random response to every task. Figure 3 shows the prediction errors for the various methods and the ﬁrst three data sets as functions of the number of corrupted workers. Similar plots for the other data sets are shown in Figure 7 in Appendix C. On none of the data sets, any method can handle more corrupted workers than Alt-Alg. 1.

7. Discussion

In this work, we studied an extension of the well-known one-coin model for crowdsourcing that allows for colluding adversaries. Our results show that even if almost half of the workers are adversarial, one can consistently estimate the workers error probabilities with an efﬁcient algorithm.

For future work, it would be interesting to relax the assumption that the reliable workers follow the one-coin model and to allow for task-dependent error probabilities also for them. It would also be interesting to see whether our approach can be extended to multiclass classiﬁcation problems. Another direction concerns improving the sufﬁcient rate m ρ 8 , which we obtained for our algorithm for recovering worker qualities up to error ρ. In the absence of adversaries one can achieve a rate m ρ 2, and we would like to understand whether this gap is inherent or an artifact of our algorithm/proof. Finally, we wonder about the role of adaptive task assignment in our extension of the one-coin model.

Crowdsourcing with Arbitrary Adversaries

Acknowledgements

This research is supported by a Rutgers Research Council Grant and a Center for Discrete Mathematics and Theoretical Computer Science (DIMACS) postdoctoral fellowship.

Alfaro, L. and Shavlovsky, M. Crowdgrader: A tool for crowdsourcing the evaluation of homework assignments. In Technical Symposium on Computer Science Education (SIGCSE), 2014.

Berend, D. and Kontorovich, A. A ﬁnite sample analysis of the naive bayes classiﬁer. Journal of Machine Learning Research (JMLR), 16:1519 1545, 2015.

Bhatia, R. and Davis, C. A better bound on the variance. The American Mathematical Monthly, 107(4):353 357, 2010.

Bonald, T. and Combes, R. A minimax optimal algorithm for crowdsourcing. In Neural Information Processing Systems (NIPS), 2017.

Dalvi, N., Dasgupta, A., Kumar, R., and Rastogi, V. Aggregating crowdsourced binary ratings. In World Wide Web Conference (WWW), 2013.

Dawid, A. and Skene, A. Maximum likelihood estimation of observer error-rates using the EM algorithm. Applied Statistics, 28(1):20 28, 1979.

Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In Conference on Computer Vision and Pattern Recognition (CVPR), 2009.

Gao, C. and Zhou, D. Minimax optimal convergence rates for estimating ground truth from crowdsourced labels. ar Xiv:1310.5764 [stat.ML], 2013.

Gao, C., Lu, Y., and Zhou, D. Exact exponent in optimal rates for crowdsourcing. In International Conference on Machine Learning (ICML), 2016.

Ghosh, A., Kale, S., and Mc Afee, P. Who moderates the moderators? Crowdsourcing abuse detection in usergenerated content. In Conference on Electronic Commerce (EC), 2011.

Jaffe, A., Fetaya, E., Nadler, B., Jiang, T., and Kluger, Y. Unsupervised ensemble learning with dependent classiﬁers. In International Conference on Artiﬁcial Intelligence and Statistics (AISTATS), 2016.

Jagabathula, S., Subramanian, L., and Venkataraman, A. Identifying unreliable and adversarial workers in crowdsourced labeling tasks. Journal of Machine Learning Research, 18(93):1 67, 2017.

Karger, D., Oh, S., and Shah, D. Iterative learning for reliable crowdsourcing systems. In Neural Information Processing Systems (NIPS), 2011a.

Karger, D., Oh, S., and Shah, D. Budget-optimal crowdsourcing using low-rank matrix approximations. In Allerton Conference on Communication, Control, and Computing, 2011b.

Karger, D., Oh, S., and Shah, D. Efﬁcient crowdsourcing for multi-class labeling. In ACM Sigmetrics, 2013.

Karger, D., Oh, S., and Shah, D. Budget-optimal task allocation for reliable crowdsourcing systems. Operations Research, 65(1):266 287, 2014.

Khetan, A. and Oh, S. Reliable crowdsourcing under the generalized Dawid-Skene model. ar Xiv:1602.03481v1 [cs.LG], 2016.

Khetan, A., Lipton, Z., and Anandkumar, A. Learning from noisy singly-labeled data. ar Xiv:1712.04577 [cs.LG], 2017.

Krizhevsky, A. Learning multiple layers of features from tiny images. Technical report, University of Toronto, 2009.

Le, J., Edmonds, A., Hester, V., and Biewald, L. Ensuring quality in crowdsourced search relevance evaluation: The effects of training question distribution. In SIGIR workshop on crowdsourcing for search evaluation (CSE), 2010.

Li, H., Yu, B., and Zhou, D. Error rate bounds in crowdsourcing models. ar Xiv:1307.2674 [stat.ML], 2013.

Liao, H., Zeng, A., Xiao, R., Ren, Z.-M., Chen, D.-B., and Zhang, Y.-C. Ranking reputation and quality in online rating systems. PLo S ONE, 9(5):e97146, 2014.

Liu, Q., Peng, J., and Ihler, A. Variational inference for crowdsourcing. In Neural Information Processing Systems (NIPS), 2012.

Ma, Y., Saligrama, V., and Szepesvari, C. Crowdsourcing with sparsely interacting workers. ar Xiv:1706.06660 [cs.LG], 2017.

Nitzan, S. and Paroush, J. Optimal decision rules in uncertain dichotomous choice situations. International Economic Review, 23(2):289 297, 1982.

Raykar, V. and Yu, S. Eliminating spammers and ranking annotators for crowdsourced labeling tasks. Journal of Machine Learning Research, 13:491 518, 2012.

Crowdsourcing with Arbitrary Adversaries

Shah, N., Balakrishnan, S., and Wainwright, M. A permutation-based model for crowd labeling: Optimal estimation and robustness. ar Xiv:1606.09632 [cs.LG], 2016.

Snow, R., O Connor, B., Jurafsky, D., and Ng, A. Cheap and fast but is it good? Evaluating non-expert annotations for natural language tasks. In Conference on Empirical Methods in Natural Language Processing (EMNLP), 2008.

Szepesvari, D. A statistical analysis of the aggregation of crowdsourced labels. Master s thesis, University of Waterloo, 2015.

Zhang, Y., Chen, X., Zhou, D., and Jordan, M. Spectral methods meet EM: A provably optimal algorithm for crowdsourcing. Journal of Machine Learning Research, 17(102):1 44, 2016. Code available on https://github.com/zhangyuc/ Spectral Methods Meet EM.

Zhou, D., Basu, S., Mao, Y., and Platt, J. Learning from the wisdom of crowds by minimax entropy. In Neural Information Processing Systems (NIPS), 2012.