# achieving_budgetoptimality_with_adaptive_schemes_in_crowdsourcing__9a21315c.pdf

Achieving Budget-optimality with Adaptive Schemes in Crowdsourcing

Ashish Khetan and Sewoong Oh Department of ISE, University of Illinois at Urbana-Champaign Email: {khetan2,swoh}@illinois.edu

Adaptive schemes, where tasks are assigned based on the data collected thus far, are widely used in practical crowdsourcing systems to efﬁciently allocate the budget. However, existing theoretical analyses of crowdsourcing systems suggest that the gain of adaptive task assignments is minimal. To bridge this gap, we investigate this question under a strictly more general probabilistic model, which has been recently introduced to model practical crowdsourcing datasets. Under this generalized Dawid-Skene model, we characterize the fundamental trade-off between budget and accuracy. We introduce a novel adaptive scheme that matches this fundamental limit. A given budget is allocated over multiple rounds. In each round, a subset of tasks with high enough conﬁdence are classiﬁed, and increasing budget is allocated on remaining ones that are potentially more difﬁcult. On each round, decisions are made based on the leading eigenvector of (weighted) non-backtracking operator corresponding to the bipartite assignment graph. We further quantify the gain of adaptivity, by comparing the tradeoff with the one for non-adaptive schemes, and conﬁrm that the gain is signiﬁcant and can be made arbitrarily large depending on the distribution of the difﬁculty level of the tasks at hand.

1 Introduction

Crowdsourcing platforms provide labor markets in which pieces of micro-tasks are electronically distributed to a pool of workers. In typical crowdsourcing scenarios, such as those on Amazon s Mechanical Turk, a requester posts a collection of tasks, and a batch is picked up by any worker who is willing to complete it. The worker is subsequently rewarded for each task he/she completes. However, some workers are spammers trying to make easy money. Moreover, since the reward is small and tasks are tedious, errors are common even among those who try. To correct for the errors, a common approach is to introduce redundancy by assigning each task to multiple workers and aggregating their responses using some schemes such as majority voting.

A fundamental problem of interest is how to maximize the accuracy of thus inferred solutions, while using as small number of repetitions as possible. There are two challenges in achieving such an optimal tradeoff between accuracy and the budget: (a) we need a scheme for deciding which tasks to assign to which workers; and (b) at the same time infer the true solutions from their responses.

Since the workers are ﬂeeting, the requester has no control over who gets to work on which tasks. It is impossible to make a trust relationship with the workers. In particular, it does not make sense to explore reliable workers, and exploit them in subsequent steps. Each arriving worker is completely new and you may never get him back. Nevertheless, by comparing responses from multiple workers, we can estimate the true answer to the task, and use it in subsequent steps to learn the reliability of the workers. Our beliefs on the true answers as well as the difﬁculty of the tasks and the reliability of the workers can be iteratively reﬁned, and one can potentially choose to assign more workers to the more difﬁcult tasks. We would like to understand such intricate interplay of task assignment and inference.

30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.

Setup. We have m binary classiﬁcation tasks to be completed by workers. We assume a recent generalization of the Dawid-Skene model introduced in [22] to model the responses, which captures the heterogeneity in the tasks as well as the workers. Precisely, each new arriving worker is parametrized by a quality parameter pj [0, 1] (for the j-th arriving worker), which is i.i.d. according to some prior distribution F. Each task is parametrized by a difﬁculty parameter qi [0, 1] (for the i-th task), which is drawn i.i.d. according to some prior distribution G . When a worker j is assigned a task i, the task is perceived as a positive task with probability qi, and as a negative task otherwise. Hence, if qi is close to a half then it is confusing and difﬁcult to correctly classify, and easy if close to one or zero. When task i is assigned to worker j, the response is a noisy perception of the task:

Aij = 1, w.p. qipj + qi pj, 1, w.p. qipj + qi pj. , (1)

where qi = 1 qi and pj = 1 pj. With probability pj, the worker answers truthfully as he perceives the task, and otherwise gives the opposite answer. Hence, if pj is close to one then he tells the truth (in his opinion) and if it is close to half he gives random answers. If it is zero, he is also reliable, in the sense that a requester who can correctly decode his reliability can extract the truths exactly.

We deﬁne the ground truth of a task as what the majority of the workers agree on, had we asked all the workers. Accordingly, we assume that EF[pj] > 1/2 and the true labels are deﬁned as ti = I{qi>(1/2)} I{qi<(1/2)} . Otherwise, we do not impose any condition on the distribution of pj s. However, we assume qi s are discrete random variables with support at K points. Our results do not directly depend on this support size K, and therefore K can be made arbitrarily large. Note that we focus on only binary tasks with two types of classes, and also the workers are assumed to be symmetric, i.e. the error probability does not depend on the perceived label of the task. The original Dawid-Skene model introduced in [3] and analyzed in [9] is a special case, when all tasks are equally easy, i.e. qi s are either one or zero. This makes inference easier as all tasks are perceived their true class; the only source of error is in workers noisy responses.

We assume the following task assignment scenario to model practical crowdsourcing systems. It is a discrete time system, where at the beginning of each time step the requester can create a batch of tasks. This batch is picked up by a new arriving worker, and his/her responses are collected. To model real-world constraints we assume there is a limit on how many tasks a single worker can complete, which we denote by r. The requester (also called task master) has no control over who is arriving next, but he has control over which of the m tasks are to be solved by the next arriving worker. This allows for adaptive task assignment schemes, where the requester can choose to include those tasks that he is most uncertain about based on all the history of responses collected thus far.

We consider all randomized task assignment schemes, whose expected number of assignment per task is ℓ, and all inference algorithms. We study the minimax rate when the nature chooses the worst case priors F and G (from a family of priors parametrized by average worker reliability β and average task difﬁculty λ deﬁned in (2)), and we choose the best possible adaptive task assignment together with the best possible inference algorithm. We further propose a novel adaptive approach that achieves this minimax rate up to a constant factor. Our approach is different from existing adaptive schemes in [5], where there are multiple types of tasks and the main source of uncertainty is which type the next arriving worker is expert on. Golden tasks with known answers are used to explore expertise and tasks are assigned accordingly.

Related work. Existing work on crowdsourcing systems study the standard Dawid-Skene (DS) model [3], where all tasks are equally difﬁcult and hence qi {0, 1} for all tasks. Several inference algorithms have been proposed [3, 17, 6, 16, 4, 7, 11, 23, 10, 21, 2, 8, 14], and the question of task assignment is addressed in [9], where the minimax rate on the probability of error is characterized and a matching task assignment scheme and an inference algorithm are proposed. Perhaps surprisingly, for the standard DS model, a non-adaptive task assignment scheme achieves the fundamental limit. Namely, given m tasks and a total budget for mℓresponses, the requester ﬁrst constructs a bipartite task-assignment graph with m task nodes, n = mℓ/r worker nodes, and edges drawn uniformly at random with degree ℓfor the task nodes and r for the worker nodes. Then, j-th arriving worker is assigned a batch of r tasks that are adjacent to the j-th worker node. Together with an inference algorithm explained in detail in Section 2, this achieves a near-optimal performance. Namely, to achieve an average probability of error ε, it is sufﬁcient to have total budget O((m/β) log(1/ε)), where β = EF[(2pj 1)2] is the quality of the workers deﬁned in (2). Perhaps surprisingly, no adaptive assignment can improve upon it. Even the best adaptive scheme and the best inference

algorithm still requires Ω((m/β) log(1/ε)) total budget. Hence, there is no gain in adaptivity. This negative result relies crucially in the fact that under the standard DS model, all tasks are inherently equally difﬁcult. Hence, adaptively assigning more workers to relatively more ambiguous tasks has only a marginal gain. However, simple adaptive schemes are widely used in practice, where signiﬁcant gains are achieved; in real-world systems, tasks are widely heterogeneous. To capture such varying difﬁculties in the tasks, generalizations of the DS model were proposed in [19, 18, 22, 15] and signiﬁcant improvement has been reported on inference problems for real datasets.

The generalized DS model serves as the missing piece in bridging the gap between practical gains of adaptivity and theoretical limitations of adaptivity. We investigate the fundamental question of do adaptive task assignments improve accuracy? under this generalized Dawid-Skene model of Eq. (1).

Contributions. To investigate the gain of adaptivity, we ﬁrst characterize the fundamental lower bound on the budget required to achieve a target accuracy. To match this fundamental limit, we introduce a novel adaptive task assignment scheme. Our approach consists of multiple rounds of non-adaptive schemes, and we provide sharp analyses on the performance at each round, which guides the design of the task assignment in each round adaptively using the data from previous rounds. The proposed adaptive task assignment is simple to apply in practice, and numerical simulations conﬁrm the superiority compared to state-of-the-art non-adaptive schemes. Under a certain assumption on the choice of parameters in the algorithm, which requires a moderate access to an oracle, we can prove that the performance of the proposed adaptive scheme matches that of the fundamental limit up to a constant factor. Finally, we quantify the gain of adaptivity by proving a strictly larger lower bound on the budget required for any non-adaptive schemes. Precisely, we show that the minimax rate on the budget required to achieve a target average error rate of ε scales as Θ((m/λβ) log(1/ε)). The dependence on the prior F and G are solely captured in β (the quality of the crowd as a whole) and λ (the quality of the tasks as a whole). We show that the fundamental tradeoff for non-adaptive schemes is Θ((m/λminβ) log(1/ε)), requiring a factor of λ/λmin larger budget for non-adaptive schemes. This factor of λ/λmin is precisely how much we gain by adaptivity, and this gain can be made arbitrarily large in the worst case distribution G .

2 Main Results

The following quantities are fundamental in capturing the dependence of the minimax rate on the distribution of task difﬁculties and worker reliabilities:

1 , α EG [(2qi 1)2], and β EF[(2pj 1)2] . (2)

Let n denote the total number of workers used, and Tj denote the set of all tasks assigned to worker j [n] and Wi denote the set of all workers assigned to task i [m] until the adaptive task assignment scheme has terminated. We consider discrete distribution G with K types of tasks of varying difﬁculty levels. Deﬁne effective difﬁculty level of each task i to be λi (2qi 1)2, and λmin = mini [m] λi. A task with a small λi is more difﬁcult, since qi close to 1/2 means the task is more ambiguous. Let δa denote fraction of total tasks having difﬁculty level λa for a [K] such that P

a [K] δa = 1, and δmax maxa [K] δa, δmin mina [K] δa.

2.1 Fundamental limit under the adaptive scenario

We prove a lower bound on the minimax error rate: the error that is achieved by the best inference algorithm ˆt using the best adaptive task assignment scheme τ under a worst case worker distribution F and the worst-case true answers t for the given distribution of difﬁculty level λi s. Note that given λi, either qi = (1 + λi)/2 in which case ti = 1 or qi = (1 λi)/2 and ti = 1. Let Tℓbe the set of all task assignment schemes that use at most mℓqueries in total, and let Fβ be the set of all the worker distributions such that expectation of worker quality is β, i.e. Fβ {F|EF[(2pj 1)2] = β}. Then we can show the following lower bound on the minimax rate on the probability of error. A proof of this theorem is provided in Section 4 in the supplementary material. Theorem 2.1. When β < 1, there exists a positive constant C such that for each task i [m],

min τ Tℓ,ˆt max t { 1}m,F Fβ P[ti = ˆti|λi] 1 2e C λiβ E[|Wi| | λi] .

This proves a lower bound on per task probability of error that decays exponentially with exponent scaling as λiβE[|Wi| | λi]. The easier the task (λi large), the more reliable the workers are (β large), and the more workers assigned to that task (|Wi| large), the smaller the achievable error. To get a lower bound on the average probability of error, suppose we know the difﬁculties of the tasks and assign ℓa workers to tasks of difﬁculty λa. With average budget constraint P

a [K] ℓaδa ℓ,

min τ Tℓ,ˆt max t { 1}m,F Fβ 1 m

i=1 P[ti = ˆti] min ℓa:P

a [K] δaℓa=ℓ

1 2δae C ℓaλaβ (3)

= 1 2e C ℓλβ K X

a=1 δae λ P

a =a (δa /λa ) log(λa/λa ) ,

where the equality follows from solving the optimization problem. Note that the summand in the bound does not depend upon the budget ℓ, and it is lower bounded by δmin > 0. The error scales as e C ℓλβ, where λ = 1/(E[1/λi]) as deﬁned in (2), and captures how difﬁcult the set of tasks are collectively. This gives a lower bound on the budget Γ required to achieve error ε; there exists a constant C such that if

λβ log δmin

then no task assignment scheme (adaptive or not) with any inference algorithm can achieve error less than ϵ. Intuitively, β captures the (collective) quality of the workers as speciﬁed by F and λ captures the (collective) difﬁculty of the tasks as speciﬁed by G . This recovers the known fundamental limit for standard DS model where all tasks have λi = 1 and hence λ = 1 in [9]: Γε > C m

2.2 Upper bound on the achievable error rate

We present an adaptive task assignment scheme and an iterative inference algorithm that asymptotically achieve an error rate of C1e (Cδ/4)ℓλβ, when m grows large and ℓ= Θ(log m) where C1 = log2(2δmax/δmin) log2(λ1/λK). This matches the lower bound in (3) and the expected number of queries (or task-worker assignments) is bounded by mℓ. Comparing it to a fundamental lower bound in Theorem 2.1 establishes the near-optimality of our approach, and the sufﬁcient condition to achieve average error ε is for the average total budget to be larger than,

2.2.1 Adaptive algorithm

Since difﬁculty level is varying across the tasks, it is intuitive to assign fewer workers to easy tasks and more workers to hard tasks. Suppose we know the difﬁculty levels, then optimizing the lower bound (3) over ℓi s, it suggests to assign ℓi ℓ(λ/λi) workers to the task i with difﬁculty λi, when given a ﬁxed budget of ℓworkers per task on average. However, the difﬁculty levels are not known. Non-adaptive schemes can be arbitrarily worse (see Theorem 2.4). We propose a novel approach of adaptively assigning workers in multiple rounds, reﬁning our belief on λi, and making decisions on the tasks with higher conﬁdence.

The main algorithmic component is the sub-routine in line 8-13 of Algorithm 1. For a choice of the (per task) budget ℓt, we collect responses according to a (ℓt, rt = ℓt) regular random graph on |M| tasks and |M| workers. The leading eigen-vector of the non-backtracking operator on this bipartite graph, weighted by the 1 responses reveals a noisy observation of the true class and the difﬁculty levels of the tasks. Let x R|M| denote the top left eigenvector, computed as per Algorithm 2. Then the i-th entry xi asymptotically converges in the large number of tasks m limit to a Gaussian random variable with mean proportional to the difﬁculty level (2qi 1), with mean and variance speciﬁed in Lemma 5.1 in the the supplementary material. This non-backtracking operator approach to crowdsourcing was ﬁrst introduced in [7] for the standard DS model, is a single-round non-adaptive scheme, and uses a threshold of zero to classify tasks based on the sign of xi s. We generalize their analysis to this generalized DS model in Theorem 2.3 for ﬁnite sample regime, and further give a sharper characterization based on central limit theorem in the asymptotic regime (Lemma 5.1 in the supplementary material).

This provides us a sub-routine that reveals (2qi 1) s we want, corrupted by additive Gaussian noise. This resembles the setting in racing algorithms introduced in [12] where the goal is to choose the variable (i.e. task) with largest mean (i.e. easiest) with minimal budget. However, our goal is to identify the sign of the mean of the variables (i.e. classes) with sufﬁcient accuracy. The key idea is to classify the easier tasks ﬁrst with minimal budget, and then classify the remaining more difﬁcult tasks with more budget allocated per task. We can set a threshold Xt,u at each round, and make a permanent decision on a subset of tasks that have large xi s in absolute value, since those are the tasks we are most conﬁdent about in its class, i.e. sign(2qi 1). We are now left to choose the budget ℓt and the threshold Xt,u for each round.

We prescribe a choice using following notations. Assume that λa s are indexed such that λ1 > λ2 > . . . > λK. For simplicity, assume that λK = λ12 (T 1) for some T Z+ \ {1}. Given the distribution {λa, δa}a [K], we ﬁrst bin it to get another distribution { λa, δa}a [T ] which is supported at most at T points. We take λ1 = λ1 and λa+1 = λa2 1 for each a [T 1]. δa is the total fraction of tasks whose difﬁculty λi is smaller than λ12 (a 2) and larger than λ12 (a 1). Precisely, δa = P

a [K] δa I λ1/2(a 1) λa < λ1/2(a 2) , for a [T] . The choice of 2 for the ratio of λa s is arbitrary and can be further optimized for a given distribution of λi s. For ease of notations in writing the algorithm, we re-index the binned distribution to get { λa, δa}a [ T ], for T T, such that δa = 0 for all a T. Note that T log2(λ1/λK) .

We start with a set of all tasks M = [m]. A fraction of tasks are classiﬁed in each round and the un-classiﬁed ones are taken to the next round. At round t {1, . . . , T}, our goal is to classify sufﬁcient fraction of those tasks in the same difﬁculty group {i M : λi = λt} to be classiﬁed with desired level of accuracy. If ℓt is too low and/or threshold Xt,u too small, then misclassiﬁcation rate will be too large. If ℓt is too large, we are wasting our budget unnecessarily. If Xt,u is too large, not enough tasks will be classiﬁed. We choose ℓt = ℓCδ λ/ λt and an appropriate Xt,u to ensure that the misclassiﬁcation probability is at most C1e (Cδ/4)λβℓbased on the central limit theorem on the leading eigen vector (see (21) in the supplementary material). We run this sub-routine st = max{0, log2( δt(1 + γt)/ δt+1γt+1) } times to ensure that enough fraction from t-th group is classiﬁed. We make sure that the expected number of unclassiﬁed tasks is at most equal to the number of tasks in the next group, i.e., difﬁculty level λi = λt+1. We provide a near-optimal performance guarantee for γt = 1 for all t [ T], and γt provides an extra degree of freedom for practitioners to further optimize the efﬁciency.

Note that statistically, the fraction of the t-th group (i.e. tasks with difﬁculty λt) that get classiﬁed before the t-th round is very small as the threshold set in these rounds is more than their absolute mean message. Most tasks with λt will get classiﬁed in round t. Further, the binning of the original given distribution to get { λa, δa} ensures that ℓt+1 2ℓt. It ensures that the total extraneous budget spent on λt tasks is not more than a constant times the allocated budget of those tasks, and the constant can be made one, by changing the initial choice of ℓ1 by a constant factor.

2.2.2 Performance Guarantee

Since we are not wasting any budget on any of the tasks, with the right choice of the constant Cδ, we are guaranteed that this algorithm uses at most mℓassignments in expectation. One caveat is that, the threshold Xt,u depends on αt,u = (1/|M|) P

i [M] λi, which is the average difﬁculty of the remaining tasks. As the remaining tasks are changing over the course of the algorithm, we need to estimate this value in each sub-routine. We provide an estimator of αt,u in Algorithm 3 (in the supplementary material) that only uses the sampled responses that are already collected. All numerical results are based on this estimator. However, analyzing the sensitivity of the performance with respect to the estimation error in αt,u is quite challenging, and for a theoretical analysis, we assume we have access to an oracle that provides the exact value of αt,u, replacing Algorithm 3.

Theorem 2.2. Suppose Algorithm 3 returns the exact value of αt,u = (1/|M|) P

i [M] λi. With the choice of γa = 1 for all a [ T] and Cδ = (4 + log(2δmax/δmin) ) 1 for any given distribution of task difﬁculty {λa, δa}a [K] of m tasks and an average number of workers per task ℓ= Θ(log m), the expected number of queries made by Algorithm 1 is asymptotically bounded by limm P

t [ T ],u st ℓt E[|Mt,u|]/(mℓ) 1, where Mt,u is the number of tasks remaining at round

(t, u). Further, Algorithm 1 returns estimates {ˆti}i [m] that asymptotically achieves,

i=1 P[ti = ˆti] C1e (Cδ/4)ℓλβ , (6)

where C1 = log2(2δmax/δmin) log2(λ1/λK) for λβ scaling as 1/ℓsuch that ℓλβ = Θ(1).

A proof of this theorem is provided in Section 5 in the supplementary material. This shows the nearoptimal sufﬁcient condition of our approach in (5). The constant Cδ can be improved by optimizing over the choice of γa s by minimizing the expected number of queries that the algorithm makes.

Algorithm 1 Adaptive Task Assignment and Inference Algorithm

Require: m, { λa, δa}a [ T ], ℓ, Cδ, {γa}a [ T ], α, β, µ = E[2pj 1] Ensure: Estimate {ˆti}i [m]

1: M {1, 2, , m}, λ = P

a [ T ]( δa/ λa) 1

2: for all t = 1, 2, , T do 3: ℓt (ℓCδ λ)/ λt, rt ℓt 4: st max n 0, l log δt(1+γt)

m o I{t < T} + 1 I{t = T}

5: for all u = 1, 2, , st do 6: if M = then 7: n |M| , k p

log |M| 8: Draw E {0, 1}|M| n (ℓt, rt)-regular random graph 9: Collect answers {Ai,j {1, 1}}(i,j) E 10: {xi}i M Algorithm 2 E, {Ai,j}(i,j) E, k

11: αt,u Algorithm 3 [E, {Ai,j}(i,j) E, ℓt, rt]

12: Xt,u p λtµℓt (ℓt 1)(rt 1)αt,uβ

k 1I{t < T} + 0 I{t = T} 13:

ˆti = I{xi > Xt,u} I{xi < Xt,u}

i M, M {i M : |xi| Xt,u} 14: end if 15: end for 16: end for

Algorithm 2 Message-Passing Algorithm

Require: E {0, 1}|M| n, {Aij {1, 1}}(i,j) E, kmax Ensure: {xi R}i [|M|]

1: for all (i, j) E do

2: Initialize y(0) j i with random Zj i N(1, 1) 3: end for 4: for all k = 1, 2, , kmax do 5: for all (i, j) E do

6: x(k) i j P

j Wi\j Aij yk 1 j i 7: end for 8: for all (i, j) E do

9: y(k) j i P

i Tj\i Ai jxk i j 10: end for 11: end for 12: for all i [m] do 13: xi P

j Wi Aijykmax 1 j i 14: end for

In Figure 1, we compare performance of our algorithm with majority voting and also non-adaptive version of our Algorithm 1, where we assign to each task ℓ(the given budget) number of workers in

one round and set classiﬁcation threshold Xt,u = 0 so as to classify all the tasks. This non-adaptive special case has been introduced for the standard DS model in [9].

We make a slight modiﬁcation to Algorithm 1. In the ﬁnal round, when the classiﬁcation threshold is set to zero, we include all the responses collected thus far when running the message passing Algorithm 2, and not just the fresh samples collected in that round. This creates dependencies between rounds, which makes the analysis challenging. However, in practice we see improved performance and it allows us to use the given ﬁxed budget efﬁciently.

We run synthetic experiments with m = 1800 and ﬁx n = 1800 for the non-adaptive version. The crowds are generated from the spammer-hammer model with hammer probability equal to 0.3. In the left panel, we take difﬁculty level λa to be uniformly distributed over {1, 1/4, 1/16}, that gives λ = 1/7. In the right panel, we take λa = 1 with probability 3/4, otherwise we take it to be 1/4 or 1/16 with equal probability, that gives λ = 4/13. As predicted from the theoretical analysis, our adaptive algorithm improves signiﬁcantly over its non-adaptive version. In particular, for the left panel, the non-adaptive algorithm s error scaling depends on smallest λi that is 1/16 while for the adaptive algorithm it scales with λ = 1/7. In the ﬁgure, it can be seen that the adaptive algorithm requires approximately (7/16)ℓqueries to acheive the same error as achieved by the non-adaptive one using ℓqueries. This gap widens in the right panel to approximately (13/64) as predicted, and the adaptive algorithm achieves zero error as the number of queries increase. For a fair comparison with the non-adaptive version, we ﬁx total budget to be mℓand assign workers in each round until the budget is exhausted. Cδ is 1 and st = 1 for t {1, 2, 3}.

50 100 150 200 250 300 350

Majority voting

Non-adaptive

probability of error

number of queries per task ℓ

50 100 150 200 250 300

Majority voting

Non-adaptive

probability of error

number of queries per task ℓ

Figure 1: Algorithm 1 improves signiﬁcantly over its non-adaptive version and majority voting.

2.3 Achievable error rate under the non-adaptive scenario

Consider a non-adaptive version of our approach where we apply it for one round using an (ℓ, r) random regular graph, where ℓis the given budget. Naturally, the classiﬁcation threshold is set to Xt,u = 0 so as to classify all the tasks. We provide a sharp upper bound on the achieved error, that holds for all (non-asymptotic) regimes of m. Deﬁne σ2 k as

µ2 ˆℓˆr(αβ)2 k 1 + 3 1 + 1 ˆrαβ

1 1/ ˆℓˆr(αβ)2 k 1

1 1/ ˆℓˆr(αβ)2 . (7)

This captures the effective variance in the sub-Gaussian tail of the messages xi s after k iterations of the inference algorithm (Algorithm 2), as shown in the proof of the following theorem (see the supplementary material in Section 6). Theorem 2.3. For any ℓ> 1 and r > 1, suppose m tasks are assigned according to a random (ℓ, r)-regular graph drawn from the conﬁguration model. If µ > 0, ˆℓˆrα2β2 > 1, and ˆrα > 1, then for any t { 1}m, the estimate ˆti = sign(xi) after k iterations of Algorithm 2 achieves

P ti = ˆt(k) i λi e ℓβλi/(2σ2 k) + 3ℓr

m (ˆℓˆr)2k 2. (8)

Therefore, the average error rate is bounded by

i=1 P[ti = ˆt(k) i ] EG

2σ2 k + 3ℓr

m (ˆℓˆr)2k 2. (9)

The second term, which is the probability that the resulting (ℓ, r) regular random graph is not locally tree-like, can be made small for large m as long as k = O( log m) (which is the choice we make in Algorithm 1). Hence, the dominant term in the error bound is the ﬁrst term. Further, when we run our algorithm for large enough numbers of iterations, σ2 k converges linearly to a ﬁnite limit σ2 limk σ2 k such that σ2 = 3 1+1/(ˆrαβ) (ˆℓˆrαβ)2/((ˆℓˆrαβ)2 1), which for large enough ˆrαβ and ˆℓˆr is upper bounded by a constant. Hence, for a wide range of parameters, the average error in (9) is dominated by EG e ℓβλi/2σ2 k = P

a δae Cℓβλa. When all δ s are strictly positive, the error is dominated by the difﬁcult tasks with λmin = mina λa, as illustrated in Figure 2. Hence, it is sufﬁcient to have budget Γε C m/(λminβ) log(1/ε) to achieve an average error of ε > 0. Such a scaling is also necessary as we show in the next section.

This is further illustrated in Figure 2. The error decays exponentially in ℓand β as predicted, but the rate of decay crucially hinges on the difﬁculty level. We run synthetic experiments with m = n = 1000 and the crowds are generated from the spammer-hammer model where pj = 1 with probability β and 1/2 otherwise. We ﬁx β = 0.3 and vary ℓin the left ﬁgure and ﬁx ℓ= 30 and vary β in the right ﬁgure. We let qi s take values in {0.6, 0.8, 1} with equal probability such that α = 1.4/3. The error rate of each task grouped by their difﬁculty is plotted in the dashed lines, matching predicted e Ω(ℓβ(2qi 1)2). The average error rates in solid lines are dominated by those of the difﬁcult tasks, which is a universal drawback for all non-adaptive schemes.

0 5 10 15 20 25 30

q = 1.0 q = 0.8 q = 0.6 Mean Error

probability of error

number of queries per task ℓ

0 0.1 0.2 0.3 0.4

q = 1.0 q = 0.8 q = 0.6 Mean Error

probability of error

crowd quality β

Figure 2: Non-adaptive schemes suffer as average error is dominated by difﬁcult tasks.

2.4 Fundamental limit under the non-adaptive scenario

Theorem 2.3 implies that it sufﬁces to assign ℓ (c/(βλi)) log(1/ε) to achieve an error smaller than ε for a task i. We show in the following theorem that this scaling is also necessary. Hence, applying one round of Algorithm 1 is near-optimal in the non-adaptive scenario compared to a minimax rate where the nature chooses the worst distribution of worker pj s among the set of distributions with the same β. We provide a proof of the theorem in Section 7 in the supplementary material. Theorem 2.4. There exists a positive constant C and a distribution F of workers with average reliability E[(2pj 1)2] = β s.t. when λi < 1, if the number of workers assigned to task i by any non-adaptive task assignment scheme is less than (C /(βλi)) log(1/ϵ), then no algorithm can achieve conditional probability of error on task i less than ϵ for any m and r.

Since in this non-adaptive scheme, task assignments are done a priori, there are on average ℓworkers assigned to any set of tasks of the same difﬁculty. Hence, if the total budget is less than

Γε C m λminβ log δmin

then no algorithm can achieve average error less than ε, where λmin = mina λa. Compared to the adaptive case in (4) (nearly achieved in (5)), the gain of adaptivity is a factor of λ/λmin. The RHS is negative when δmin < ε, and can be tightened to C (m/λaβ) log(Pa b=1 δb/ε) where a is the smallest integer such that Pa b=1 δb > ε.

Acknowledgements

This work is supported by NSF Sa TC award CNS-1527754, and NSF CISE award CCF-1553452.

[1] N. Alon and J. H. Spencer. The probabilistic method. John Wiley and Sons, 2004.

[2] N. Dalvi, A. Dasgupta, R. Kumar, and V. Rastogi. Aggregating crowdsourced binary ratings. In Proceedings of the 22nd international conference on World Wide Web, pages 285 294, 2013.

[3] A. P. Dawid and A. M. Skene. Maximum likelihood estimation of observer error-rates using the em algorithm. Applied statistics, pages 20 28, 1979.

[4] A. Ghosh, S. Kale, and P. Mc Afee. Who moderates the moderators?: crowdsourcing abuse detection in user-generated content. In Proceedings of the 12th ACM conference on Electronic commerce, pages 167 176. ACM, 2011.

[5] C. Ho, S. Jabbari, and J. W. Vaughan. Adaptive task assignment for crowdsourced classiﬁcation. In Proceedings of the 30th International Conference on Machine Learning (ICML-13), pages 534 542, 2013.

[6] R. Jin and Z. Ghahramani. Learning with multiple labels. In Advances in neural information processing systems, pages 921 928, 2003.

[7] D. R. Karger, S. Oh, and D. Shah. Iterative learning for reliable crowdsourcing systems. In Advances in neural information processing systems, pages 1953 1961, 2011.

[8] D. R. Karger, S. Oh, and D. Shah. Efﬁcient crowdsourcing for multi-class labeling. In Proceedings of the ACM SIGMETRICS/international conference on Measurement and modeling of computer systems, pages 81 92, 2013.

[9] D. R. Karger, S. Oh, and D. Shah. Budget-optimal task allocation for reliable crowdsourcing systems. Operations Research, 62:1 24, 2014.

[10] H. Li and B. Yu. Error rate bounds and iterative weighted majority voting for crowdsourcing. ar Xiv preprint ar Xiv:1411.4086, 2014.

[11] Q. Liu, J. Peng, and A. Ihler. Variational inference for crowdsourcing. In Advances in Neural Information Processing Systems 25, pages 701 709, 2012.

[12] O. Maron and A. W. Moore. Hoeffding races: Accelerating model selection search for classiﬁcation and function approximation. Robotics Institute, page 263, 1993.

[13] M. Mezard and A. Montanari. Information, physics, and computation. Oxford University Press, 2009.

[14] J. Ok, S. Oh, J. Shin, and Y. Yi. Optimality of belief propagation for crowdsourced classiﬁcation. In International Conference on Machine Learning, 2016.

[15] N. B. Shah, S. Balakrishnan, and M. J. Wainwright. A permutation-based model for crowd labeling: Optimal estimation and robustness. ar Xiv preprint ar Xiv:1606.09632, 2016.

[16] V. S Sheng, F. Provost, and P. G. Ipeirotis. Get another label? improving data quality and data mining using multiple, noisy labelers. In Proceedings of the 14th ACM SIGKDD, pages 614 622. ACM, 2008.

[17] P. Smyth, U. Fayyad, M. Burl, P. Perona, and P. Baldi. Inferring ground truth from subjective labelling of venus images. In NIPS, pages 1085 1092, 1995.

[18] P. Welinder, S. Branson, S. Belongie, and P. Perona. The multidimensional wisdom of crowds. In Advances in Neural Information Processing Systems, pages 2424 2432, 2010.

[19] J. Whitehill, P. Ruvolo, T. Wu, J. Bergsma, and J. Movellan. Whose vote should count more: Optimal integration of labels from labelers of unknown expertise. In Advances in Neural Information Processing Systems, volume 22, pages 2035 2043, 2009.

[20] D. Williams. Probability with martingales. Cambridge university press, 1991.

[21] Y. Zhang, X. Chen, D. Zhou, and M. I. Jordan. Spectral methods meet em: A provably optimal algorithm for crowdsourcing. In Advances in neural information processing systems, pages 1260 1268, 2014.

[22] D. Zhou, Q. Liu, J. C. Platt, C. Meek, and N. B. Shah. Regularized minimax conditional entropy for crowdsourcing. ar Xiv preprint ar Xiv:1503.07240, 2015.

[23] D. Zhou, J. Platt, S. Basu, and Y. Mao. Learning from the wisdom of crowds by minimax entropy. In Advances in Neural Information Processing Systems 25, pages 2204 2212, 2012.