# on_the_cost_complexity_of_crowdsourcing__469b7689.pdf

On the Cost Complexity of Crowdsourcing

Yili Fang, Hailong Sun , Pengpeng Chen, Jinpeng Huai SKLSDE, School of Computer Science and Engineering, Beihang University, Beijing, China Beijing Advanced Innovation Center for Big Data and Brain Computing, Beijing, China {fangyili,chenpp}@act.buaa.edu.cn, {sunhl,huaijp}@buaa.edu.cn

Existing efforts mainly use empirical analysis to evaluate the effectiveness of crowdsourcing methods, which is often unreliable across experimental settings. Consequently, it is of great importance to study theoretical methods. This work, for the ﬁrst time, deﬁnes the cost complexity of crowdsourcing, and presents two theorems to compute the cost complexity. Our theorems provide a general theoretical method to model the trade-off between costs and quality, which can be used to evaluate and design crowdsourcing algorithms, and characterize the complexity of crowdsourcing problems. Moreover, following our theorems, we prove a set of corollaries that can obtain existing theoretical results for special cases. We have veriﬁed our work theoretically and empirically.

1 Introduction

Crowdsourcing provides an effective means for solving many real-world problems, e.g. labeling training data for machine learning. As crowdsourcing workers are normally nonexperts, the individual contributions from one worker can be unreliable. In practice, task redundancy is commonly used to amortize the unreliability in crowdsourcing with extra costs. Many efforts [Ho et al., 2013; Roy et al., 2015; Yu et al., 2017; Tran-Thanh et al., 2013] have been made to obtain high quality results with as few costs as possible, where the quality is often measured with the result error rate and the costs are evaluated with the times of querying humans. To evaluate the effectiveness of the proposed methods, prior works mainly employ experimental analysis. However, the same methods often exhibit contradicting results in different experiments [Liu et al., 2012; Zhou et al., 2015; Li and Liu, 2015]. For instance, [Zhou et al., 2012] shows that the Dawid & Skene (DS) model [Dawid and Skene, 1979] outperforms majority voting (MV) in terms of result accuracy while [Han et al., 2016] presents exactly reverse results in tasks for acquiring speciﬁc knowledge. [Liu et al., 2012] demonstrates that the homogeneous DS (HDS) model [Raykar et al., 2010] is better than MV in bluebird dataset while MV is better than

Corresponding author

HDS in rte and temp. Therefore, empirical analysis is not a reliable means to evaluate crowdsourcing methods for general cases, which calls for theoretical research to overcome such limitations. In this regard, the NSF computing division [Wing, 2008] considers the development of theoretical tools for systems involving human computation as one of the ﬁve major challenges that today s computing faces. Inspired by the theorectical computer science, some efforts [Shahaf and Amir, 2007; Kulkarni, 2011] concern the theoretical models involving humans in computation. [Shahaf and Amir, 2007] presents a Human-Assisted Turing Machine (HTM) that models the hybrid computation paradigm with machines and humans. Basing on HTM, the authors discuss how to measure human efforts such as the times and size of human input so as to deﬁne the algorithm and problem complexity, but they do not consider the unreliability of workers or the trade-off of costs and quality in crowdsourcing. Furthermore, [Kulkarni, 2011] discusses the importance of building a theoretical model of computation involving humans in terms of algorithm evaluation, algorithm comparison and cost quantiﬁcation. Another line of theoretical research efforts [Li and Liu, 2015; Wang and Zhou, 2016; Gao and Zhou, 2016; Gao et al., 2016] focus on estimating the upper bound on the mean error rate of speciﬁc algorithms. [Li and Liu, 2015] gives the upper bound on the mean error rate with weighted majority voting (WMV). [Wang and Zhou, 2016] theoretically analyzes the costs and the result error rate of MV. And [Gao and Zhou, 2016; Gao et al., 2016] theoretically studies the performance of the DS model, but ignores the parameter learning that can greatly affects the result accuracy. [Nushi et al., 2015; Venanzi et al., 2014] addresses the parameter learning issue in the context of data sparsity, which is helpful for better algorithm designing. In summary, although existing efforts recognize the importance of theoretical models in human participant computation systems like crowdsourcing, few provide general theoretical methods for crowdsourcing. In this work, motivated by the classical computational complexity, the sample complexity and the PAC theory in machine learning [Balcan et al., 2010], we study a general theoretical approach to understanding the complexity of crowdsourcing that is affected by multiple interplaying factors such as number of workers, worker ability and aggregation methods. Speciﬁcally, we propose the cost complexity of crowd-

Proceedings of the Twenty-Seventh International Joint Conference on Artiﬁcial Intelligence (IJCAI-18)

Human Processing

Answer Set Task Set Result

Task Allocation Result Inference

Answer Aggregation

Figure 1: The framework of a crowdsourcing workﬂow.

sourcing to theoretically address the trade-off of costs and quality in crowdsourcing. We argue that like computational complexity, the cost complexity of crowdsourcing is useful for evaluating, comparing and designing of algorithms, measuring the complexity of crowdsourcing problems and etc. Our major contributions are as follows: We deﬁne the cost complexity of crowdsourcing, which measures how many costs are needed to meet certain quality requirements. To the best of our knowledge, this is the ﬁrst effort to give a formal deﬁnition of the crowdsourcing complexity that relates costs with quality. We give two theorems that can derive the cost complexity of general crowdsourcing. For a speciﬁc crowdsourcing algorithm, our method not only can give the theoretical upper bound on the mean error rate, but can estimate how many workers to hire for achieving a certain quality objective. Following the theorems, we also obtain a set of corollaries that have been veriﬁed in existing work. Through a set of case studies, we have veriﬁed our method through theoretical analysis and experimental evaluation on real-world datasets. The outcome explains the contradicting results in previous work. For instance, when each worker completes more than 1 (ηMV ηHDS)2 ln 2|H|

δ ) tasks, HDS will outperform MV; otherwise, MV is better.

2 Overview of Crowdsourcing Workﬂows Fig.1 shows the general framework of crowdsourcing workﬂows consisting of task allocation and result inference.

2.1 Task Allocation Task allocation involves worker pool construction and task assignment. First, a high-quality worker pool can be built by ﬁltering out low-ability workers with qualiﬁcation tests [AMT, 2017; Ipeirotis and Gabrilovich, 2014; Marcus et al., 2015] or by analyzing historical logs [Jung, 2014; Ambati et al., 2011]. Then the distribution of the workers in worker pool over ability can be obtained through qualiﬁcation tests or log analysis by statistical methods. For instance, Quizz [Ipeirotis and Gabrilovich, 2014] obtains the Beta distribution of worker ability with qualiﬁcation tests. Next, in task assignment, a task is assigned to a certain number of workers selected from the work pool with a speciﬁc strategy. Then with the constructed worker pool and the adopted task assignment strategy, we can obtain the distribution of participating workers over ability (denoted by W) in task processing. And for result inference, W can be considered the prior knowledge.

Assume that there are m workers and n tasks. We denote the ground truth set by Y = {yj|0 < j < n}, and the truth answer to task j by yj that takes on a value in a candidate answer set A = {0, , K 1}. Let X = {xij|i m, j n} be the answer set of n tasks from m workers, xij be the answer given by worker i to task j, and Xj denote the answer set of task j. We use the confusion matrix in (1) to characterize workers abilities. πij kl = P(xij = l|yj = k), (1)

which satisﬁes PL l=1 πij kl = 1. Given yj = k, xij is generated by a multinomial distribution with πij k = (πij k1, , πij k L). Thus the three-dimensional matrix π(i) = [πij kl] denotes the abilities of worker i in all tasks. After task allocation, we may not know π, but we can know its distribution W.

2.2 Result Inference Result inference is to infer the truth result of a task by aggregating the answers. There are mainly two lines of work. 1) Voting. Majority voting, as the simplest voting method, infers the ﬁnal result by simply counting the votes for each alternative answer [Snow et al., 2008; Ipeirotis and Gabrilovich, 2014]. Though simple, it suffers from being error-prone due to the ignorance of the difference of workers abilities and other parameters. In light of this, weighted majority voting [Li and Liu, 2015] incorporates workers abilities into majority voting. Speciﬁcally, it assigns different weights to votes according to workers abilities that are unknown parameters. 2) Probabilistic approach. Probabilistic generative models containing unknown parameters (e.g. worker ability) are employed to specify workers performance on tasks, and then the parameters are estimated (parameter learning), ﬁnally the answers are aggregated through model inference, e.g. inferring the ﬁnal result by using EM algorithm for parameter estimation of probabilistic generative models [Dawid and Skene, 1979; Raykar et al., 2010; Salek et al., 2013]. Without loss of generality, we deﬁne a uniﬁed aggregation function f : A|Xj| A as follows:

f(Xj) = argmax k A

i=1 Asj(i, k, xij), (2)

where Asj(i, k, xij) is the aggregation score when worker i gives answer xij A to task j the ground truth of which is k A. (2) is a universal representation of result inference. For majority voting, we have Asj(i, k, xij) = I(xij = k), where I( ) is an indicator function; and for weighted majority voting, Asj(i, k, xij) = vi I(xij = k), where vi is the weight of worker i which can be obtained with machine learning. In probabilistic methods (e.g. DS model), Asj(i, k, l) = log πij kl (we use l to mark xij), where πij kl π(i) denotes the worker ability that needs to be estimated with machine learning. Let D denote the distribution of the ground truth among the candidate answers. We deﬁne the loss function to measure the mean error rate of f as follows:

L(D,W,Y)(f) = 1

j=1 P{f(Xj) = yj}. (3)

Proceedings of the Twenty-Seventh International Joint Conference on Artiﬁcial Intelligence (IJCAI-18)

However, to evaluate the effectiveness of f, we ﬁrst need to learn the unknown parameters in f. Straightforwardly, the more workers answers we have, the better we can learn the parameters of f, which is exactly what sample complexity addresses. In computational learning theory, probably approximately correct (PAC) learning is a framework for mathematical analysis of the sample complexity of machine learning algorithms. H is a set of answer aggregation functions with different parameters, the input and output of which are X and Y respectively. H is called the hypothesis class and every member in H is called a hypothesis. f is an unknown answer aggregation function f : A|Xj| A. For simplicity, it is assumed that f H, which is called the realizability assumption. A learner is given access to an oracle EX(D, W, f), which outputs aggregated answers one at a time randomly and independently according to D, W and f. The goal is to learn f from H so that the corresponding parameters can be correctly learned. S = (X1, X2, Xn) is a ﬁnite sequence of answers set for all tasks. This is the learner s input and is generated by n calls to EX(D, W, f). The learner s output is h H, h is the answer aggregation function with the estimated value of parameters. To measure the effectiveness of the learner, we deﬁne the error of a function h as follows: L(D,W,f)(h) = P(D,W){h(Xj) = f(Xj)}. (4) (4) denotes the probability that h disagrees with f on distribution D and W. Ideally, h agrees with f in the whole domain, namely, L(D,W,f)(h) = 0. That is to say the unknown parameters are accurately learned, and the ﬁnal error rate only depends on answer aggregation function. For instance, majority voting can be viewed as such a special case, where all parameters are known in f.

3 The Cost Complexity of Crowdsourcing We use the terminology, cost complexity, to denote the costs of solving a crowdsourcing task. Speciﬁcally, it measures the number of task requests for human workers. Different from the traditional computational complexity, the cost complexity of crowdsourcing is closely related to the quality requirement and workers abilities. This section introduces two formal deﬁnitions of the cost complexity of crowdsourcing given a certain quality constraint on the mean error rate and the distribution over workers abilities. Let U denote the set of workers selected by task allocation. Deﬁnition 1. If there is a learning algorithm A(m , n , δ) that outputs an aggregation function h and three values ηU R, ϵn R, ηm R after making at most c (c = m n ) task requests, such that for any answer aggregation function f H, ϵ (0, 1/2), η (0, 1/2), δ (0, 1/4), for any m 0, any worker set U, we have P{L(D,W,f)(h) ϵn } 1 δ and L(D,W,Y)(f) ηU; and for any ηm = exp(EW(ln(ηU)) and c OW(ϵ, δ, f, η), we have P{L(D,W,f)(h) ϵn ϵ} 1 δ, L(D,W,Y)(f) ηU and ηm η. (5)

Then we call OW(ϵ, η, δ, f) the cost complexity of crowdsourcing with worker distribution W over abilities.

Essentially OW measures how many answers should be solicited from workers for learning an aggregation function and infering the ﬁnal crowdsourcing results accurately. ϵn and ηm respectively specify the error rate bounds of the parameter learning of the aggregation function f and the aggregated results. ηU is the error rate bound of the aggregated results determined by the worker set U and is related to ηm . Actually OW borrows the concept of sample complexity from machine learning. OW depends on m , the total number of workers, and n , the number of tasks a worker completes. A special case of Deﬁnition 1 is when all workers exhibit the same ability w, which is not unusual because many microtasks do not require much expertise and all workers can fulﬁll the tasks. For this case, we deﬁne Ow as follows: Deﬁnition 2. If there is a learning algorithm A(m , n , δ) that outputs an aggregation function h and two values ηm R, ϵn R after making at most c (c = m n ) task requests, such that for any answer aggregation function f H, ϵ (0, 1/2), η (0, 1/2), δ (0, 1/4), for any m 0, n 0, we have P{L(D,w,f)(h) ϵn } 1 δ, and L(D,w,Y)(f) ηm ; and for any c Ow(ϵ, δ, f, η), we have P{L(D,w,f)(h) ϵn ϵ} 1 δ, L(D,w,Y)(f) ηm η. (6)

Then we call Ow(ϵ, η, δ, f) the cost complexity of crowdsourcing with identical workers ability w. From Deﬁnition 2, we know that ηm and ϵn only depends on m and n respectively. A simple scenario is when f has no unknown parameters (e.g., MV), which means L(D,w,f)(h) = 0 always hold. In that case, c only depends on m , and OW only depends on ηm . Note that our cost complexity speciﬁes the max number of workers needed to achieve an error rate bound. In other word, c is at least OW, but we do not require workers complete the same number of tasks. n can be regarded as the min number of tasks a worker must complete. In reality, if a worker completes over n tasks, the practical cost is less than that given by the cost complexity for a quality constraint.

4 Main Results The cost complexity is closely related to the quality requirements measured by error rate ηm and ϵn . This section ﬁrst presents Theorem 1 and Theorem 2 to compute OW and the upper bound on the error rate respectively, which are the main contributions of this work. Theorem 1. Given a task j (1 j n) processed by m workers, we can obtain an answer set Xj = {xij|i m}. For simplicity, we use l to mark xij. Worker i has ability πij = [πij kl] when processing task j, πij kl is the performance when worker i processes task j the truth answer of which is k A. Let f(Xj) denote the aggregation method. Then the complexity OW can be computed as follows:

OW(ϵ, η, δ, f) = 2 ϵ2(1 2η)2 ln(2|H|

2(ln η K 1)(a b)2 DW(µi)

E2 W(µi) , (7)

Proceedings of the Twenty-Seventh International Joint Conference on Artiﬁcial Intelligence (IJCAI-18)

where µi = P l πij kl(Asj(i, g, l) Asj(i, k, l)) is a value speciﬁc to worker i, Asj(i, g, l) [a, b], and g represents any alternative answer other than the truth (i.e. g = k), and DW(µi) and EW(µi) are the variance and expectation of µi respectively. Remarks. This theorem provides a theoretical means to estimate how many queries of human workers should be made for meeting certain quality requirements. Its beneﬁts are three-fold. First, given a set of workers with distribution W of workers over ability, Theorem 1 can be used to compare the performance of two result aggregation functions. Second, given a result aggregation function, Theorem 1 helps evaluate the performance of different task allocation methods. Third, when both task allocation and result aggregation methods are given, Theorem 1 can estimate how many workers should be hired. In all, Theorem 1 provides a fundamental tool for designing and evaluating general crowdsourcing solutions. OW depends on the error rate bounds ( ηU and ϵn ) of result inference and the parameter learning of the aggregation function. To prove Theorem 1, we ﬁrst give Theorem 2 that analyzes the error rate bound of result inference. Theorem 2. For the crowdsourcing tasks described in Theorem 1, the error rate of the aggregated results can be upperbounded as follows:

L(D,W,Y)(f) (K 1) exp ( (Pm i=1 µi)2

2m(a b)2 ), (8)

where K is the size of the answer set A.

Proof. First, we give a general function to denote an aggregation process.

f(Xj) = argmax k A

i=1 Asj(i, k, xij).

Let Asj(i, k, l) [a, b], Zgk i = Asj(i, g, l) Asj(i, k, l), then Zgk i [a b, b a] and the expectation is µi = P l πij kl(Asj(i, g, l) Asj(i, k, l)). First we apply the union bound to get (9) and obtain (10) with Hoeffding s inequality. We get the error rate bound as follows:

L(D,W,Y)(f) = P{g = k,

i Zgk i 0} (9)

g A exp ( (El

i=1 (Asj(i, g, l) Asj(i, k, l)))2

i=1 (a b)2 ) (10)

g A exp ( ( m P

l πij kl(Asj(i, g, l) Asj(i, k, l)))2

(K 1) exp ( (Pm i=1 µi)2

2m(a b)2 ).

Thus, L(D,W,Y)(f) (K 1) exp ( ( Pm i=1 µi) 2

2m(a b)2 ).

Theorem 2 provides a general method to compute the upper bounds of error rates ηU in an aggregation function with the worker set U. Except for result inference, task allocation determines worker distribution W and can greatly affect the crowdsourcing results, and this is what Theorem 1 addresses on the basis of Theorem 2. For identical worker abilities, we can generalize Theorem 2 to obtain Corollary 1 given in [Wang and Zhou, 2016]. Corollary 1. Given m workers whose abilities are i.i.d. according to parameters q = [q0, q1, , q K 1], the groundtruth label i {0, 1, , K 1} and γ = mini =i (qi qi) > 0. For the error rate of the aggregated results to be upper-bounded by η, it is sufﬁcient that

We can further generalize Theorem 2 by using the weighted majority voting (WMV) with weight K ˆwi 1 as the aggregation function. Then we can obtain Corollary 2 that is given in [Li and Liu, 2015]. Corollary 2. For a set of m workers U, using the weighted majority voting with weights K ˆwi 1, aggregation result ˆyj of task j. And an unbiased estimator of the workers ability E( ˆwi) = wi that satisﬁes {wi}i K 1. If the workers labels are generated independently according to the following probability:

K 1 wi 1 wi

Then we have

j=1 P{ˆyj = yj} exp( 2F(U)2

K2(K 1)2 + ln(K 1)),

where F(U) = 1 m P

i m(Lwi 1)2 .

With Corollary 2, we can learn that weighted majority voting can achieve the same error rate as majority voting if the weight (worker ability) is a constant value. However, if workers abilities are unknown, analyzing the weighted majority voting entails considering the parameter learning process. In practice, the parameter learning is indispensable for many aggregation methods. And it is difﬁcult to learn the unknown parameters 100% accurately due to the limitation of dataset and machine learning algorithms. Regarding Deﬁnition 1, OW involves parameter learning which also affects the error rates of aggregation method with estimated value of unknown parameters. Thus, we use PAC learnability to analyze the mean error rate of aggregation methods. First, we formulate the error rate of inferred results by considering parameter learning.

err(h) = L(f,Y)(1 L(h,f)) + (1 L(f,Y))L(h,f) = (1 2L(h,f))L(f,Y) + L(h,f), (13)

Proceedings of the Twenty-Seventh International Joint Conference on Artiﬁcial Intelligence (IJCAI-18)

where L(f,Y) and L(h,f) are the abbreviations of L(D,W,Y)(f) and L(D,W,f)(h) respectively. In some special cases, some aggregation methods, such as majority voting, do not entail parameter learning due to the absence of parameters in their aggregation rules. In this case, err(h) is identical to L(f,Y). To prove Theorem 1, we still need to introduce the Valiant s PAC (Lemma 1) to compute n for achieving the error rate ϵn of parameter learning. Lemma 1. [Angluin and Laird, 1988] Let H be a ﬁnite hypothesic class, let δ, ϵ (0, 1), η (0, 1

2) and let n be an integer that satisﬁes

n 2 ϵ2(1 2η)2 ln(2|H|

Then, for any f and D, for which the realizability assumption holds, given a sequence S of size n , if a hypothesis h H minimizes err(h), we have P{L(D,W,f)(h) ϵ} δ. (15)

Although the parameter space can be inﬁnite, the hypothesis space H is ﬁnite. For instance, given n tasks, m workers and answer set A(|A| = K), a hypothesis h H is actually a mapping from workers answer set to a set of truth answers. Thus the size of the hypothesis space is |K|n at most. Then we can use VC theory [Mc Allester, 1998] to obtain |H|. As for crowdsourcing tasks, the result aggregation method can be viewed as a process of machine Learning. The error rate is η, if the parameters are accurately learned. In particular, err(h) is affected by the error rate of aggregation rule η (i.e. err(h) = L(f,Y)) when the parameter learning process generates no error, i.e. err (Lh,f) = 0). Most existing literatures analyze the result aggregation rule with the impractical assumption that L(D,W,f)(h) = 0. When each worker has the identical ability (i.e., w1 = w2, , wm), for task j, more than (b a)2

2µ2 i ln |L| 1

η workers can generate the error rate less than η with the respect to the aggregation rule, where µi = PL l=1 πij kl(Asj(i, g, l) Asj(i, k, l)). Next we employ PAC learnability to prove Theorem 1.

Proof of Theorem 1. Basing on Theorem 2, we set µi = P l πij kl(Asj(i, g, l) Asj(i, k, l)). Workers abilities vary with different workers in Deﬁnition 1. We can obtain

L(f,Y) (K 1) exp ( (Pm

2m (a b)2 ) = ηU.

Since ηm = exp(EW(ln(ηU)) and ηm η in Deﬁnition 2, we can obtain

i=1 µi)2 2m (ln η K 1)(a b)2.

Based on the property of variance EW(Pm i=1 µi)2 E2 W(Pm i=1 µi) = DW(Pm i=1 µi), we get

i=1 µi) + DW(

i=1 µi) 2m(ln η K 1)(a b)2.

Since all workers come from the same worker pool following a certain distribution W over ability. Meanwhile, all EW(µi)s are equal, all DW(µi)s are ﬁxed, and workers are independent of each other. Then we can get

m 2E2 W(µi) + m DW(µi) 2m (ln η K 1)(a b)2.

It can be simpliﬁed as

m 2(ln η K 1)(a b)2 DW(µi)

Since the aggregation function contains parameters, we need parameter learning. Then based on Lemma 1, we get

n 2 ϵ2(1 2ηb)2 ln(2|H|

As c = m n in Deﬁnition 1, we can derive OW

2 ϵ2(1 2η)2 ln( 2|H|

δ ) 2(ln η K 1 )(a b)2 DW(µi)

Corollary 3. Let f(Xj) be the aggregation function of crowdsourcing for each task j (Asj(i, g, l) [a, b]), suppose all workers have identical ability π, then the cost complexity of crowdsourcing Ow can be computed as follows:

Ow(ϵ, η, δ, f) = 2 ϵ2(1 2η)2 ln(2|H|

where w = P l πij kl(Asj(i, g, l) Asj(i, k, l)) is the same for all workers.

This section mainly gives a general method to compute the cost complexity of crowdsourcing, which is determined by the distribution of worker over ability, the parameter learning and answer aggregation in result inference. Theorem 1 and 2, the major contributions of this work, provide a method to compute the cost complexity and the upper bound on the mean error rate respectively for general crowdsourcing workﬂows. Corollary 1-3 give some interesting results that are obtained by generalizing the two theorems to special cases. Due to space limitation, the proofs of Corollary 1-3 are shared on the web 1.

5 Case Studies

In this section, we aim at verifying the effectiveness of Theorem 1 through applying it to three representative result aggregation algorithms, including MV, WMV and HDS. Specifically, we conducted case studies both theoretically and empirically.

5.1 Theoretical Analysis

First, for MV, all parameters with f are known. In other words, h is equal to f. Then, we have Corollary 4.

1Link to the proofs of Corollary 1-3: https://goo.gl/Zh Kddo

Proceedings of the Twenty-Seventh International Joint Conference on Artiﬁcial Intelligence (IJCAI-18)

Number of tasks processed by each worker

Upper bound of error rate

1 (ηMV ηHDS)2 ln 2|H|

1 (ηMV ηWMV)2 ln 2|H|

100 150 200 250 350 300

Figure 2: The upper bound of error rate for three methods.

50 100 150 200 250 300

Number of tasks processed by each worker

Empirical error rates

Figure 3: The error rate analysis on dog dataset.

100 200 300 400 5

Number of tasks processed by each worker

Empirical error rates

Figure 4: The error rate analysis on temp dataset.

Corollary 4. Let µi = 1 2wi, then OW for MV f = argmaxk Pm i=1 I(xij = k) is computed as follows:

OW(ϵ,η, δ, f) = (2 ln η K 1) DW(µi)

E2 W(µi) . (17)

Similarly, we can obtain Corollary 5 for WMV:

Corollary 5. Let µi = wi(1 2wi), then OW for WMV (f = argmaxk Pm i=1 wi I(xij = k)) is as follows:

OW(ϵ, η, δ, f) = 2 ϵ2(1 2η)2 ln(2|H|

2(ln η K 1) DW(µi)

E2 W(µi) . (18)

Furthermore, for HDS, we have Corollary 6:

Corollary 6. Let µi mark ln wi 1 wi (1 2wi), f(Xj) be the aggregation function of HDS, then OW is as below:

OW(ϵ, η, δ, f) = 2 ϵ2(1 2η)2 ln(2|H|

2(ln η K 1) DW(µi)

E2 W(µi) . (19)

In Equation (13), there involve two types of error rate including parameter learning and answer aggregation. Suppose we ﬁx the number of workers processing one task, then we can obtain the upper bound of error rate in answer aggregation function. As shown in Fig.2, we plot the upper bound on the mean error rate obtained theoretically from the three algorithms including MV, WMV and HDS. We can observe that the upper bound on the mean error rate in MV is stable (i.e. the error rate upper bound is ηMV) varying with the number of tasks processed by each worker. While the the error rate upper bound with WMV/HDS decreases as number of tasks grows, and WMV outperforms HDS. When the number of tasks processed by each worker is less than 1 (ηMV ηWMV)2 ln 2|H|

δ ( 1 (ηMV ηHDS)2 ln 2|H|

δ ), MV outperforms WMV/HDS, which well explains the contradicting results in existing work [Zhou et al., 2012] and [Han et al., 2016]. Note: as Corollary 4-6 can be proved by simply substituting the expectation and variance of the worker distribution over ability (i.e., (12)) into Theorem 1, we omit the proofs.

5.2 Emprical Analysis Here we present the experimental analysis of error rates with two real-world crowdsourcing datasets: dog [Zhou et al., 2012] and temp [Snow et al., 2008]. The maximum number of tasks processed by a worker are 345 and 462 respectively for the two datasets. For each dataset, we ﬁrst grouped the answers into subsets according to worker IDs, then varied the number of tasks that a worker processes. Next we inferred the results with three algorithms including MV, WMV and HDS, and we computed the corresponding error rate. The results are plotted in Fig.3 and Fig.4. As the results presented here are about error rate instead of the upper bound on the error rate shown in Fig.2, the plots ﬂuctuate with the increase of task amount, but they demonstrate similar trends to our theoretical analysis shown in Fig.2. Note in our experiments, for a speciﬁc number of tasks x, if a worker ﬁnishes over x (x > x) tasks, we will replace that worker with a group of simulated workers. Each simulated worker ﬁnishes x or (x mod x) tasks and the total number of tasks they ﬁnish is exactly x . For instance, for x = 10, if a worker ﬁnishes 23 tasks in reality, three workers who ﬁnishe 10, 10 and 3 tasks respectively will be generated.

6 Conclusion This work studies the computational complexity of general crowdsourcing workﬂows. We ﬁrst give two deﬁnitions of the cost complexity of crowdsourcing, i.e. OW and Ow, for different distribution over workers abilities. Then we present two theorems to compute the cost complexity and the upper bound on the mean error rate respectively for general crowdsourcing workﬂows. We further generalize our theoretical results to special cases and obtain a set of corollaries that have been veriﬁed in existing work. Finally, to verify the effectiveness of our methods, we present a set of case studies for three representative answer aggregation methods both theoretically and empirically, which explains the existing contradicting experimental results. In all, our work beneﬁts crowdsourcing in designing and evaluating crowdsourcing algorithms.

Acknowledgements This work was supported partly by National Basic Research Program of China (973 Program) under Grant Nos. 2015CB358700 and 2014CB340304, partly by National Key Research and Development Program of China under Grant No.2016YFB1000804, and partly by National Natural Science Foundation under Grant No. 61421003.

Proceedings of the Twenty-Seventh International Joint Conference on Artiﬁcial Intelligence (IJCAI-18)

References [Ambati et al., 2011] Vamshi Ambati, Stephan Vogel, and Jaime Carbonell. Towards task recommendation in microtask. In HCOMP, pages 80 83, 2011. [AMT, 2017] AMT developer guide. http://docs.aws.amaz on.com/AWSMech Turk/latest/AWSMechanical Turk Requ ester/amt-dg.pdf, 2017. [Angluin and Laird, 1988] Dana Angluin and Philip Laird. Learning from noisy examples. Machine Learning, 2(4):343 370, 1988. [Balcan et al., 2010] Maria Florina Balcan, Steve Hanneke, and Jennifer Wortman Vaughan. The true sample complexity of active learning. Machine Learning, 80(2):111 139, 2010. [Dawid and Skene, 1979] Alexander Philip Dawid and Allan M Skene. Maximum likelihood estimation of observer error-rates using the em algorithm. Applied statistics, pages 20 28, 1979. [Gao and Zhou, 2016] Chao Gao and Dengyong Zhou. Minimax optimal convergence rates for estimating ground truth from crowdsourced labels. Statistics, 2016. [Gao et al., 2016] Chao Gao, Yu Lu, and Dengyong Zhou. Exact exponent in optimal rates for crowdsourcing. In ICML, pages 603 611, 2016. [Han et al., 2016] Tao Han, Hailong Sun, Yangqiu Song, Yili Fang, and Xudong Liu. Incorporating external knowledge into crowd intelligence for more speciﬁc knowledge acquisition. In IJCAI, pages 1541 1547, 2016. [Ho et al., 2013] Chien-Ju Ho, Shahin Jabbari, and Jennifer Wortman Vaughan. Adaptive task assignment for crowdsourced classiﬁcation. In ICML, pages 534 542, 2013. [Ipeirotis and Gabrilovich, 2014] Panagiotis G. Ipeirotis and Evgeniy Gabrilovich. Quizz: targeted crowdsourcing with a billion (potential) users. In WWW, pages 143 154, 2014. [Jung, 2014] Hyun Joon Jung. Quality assurance in crowdsourcing via matrix factorization based task routing. In WWW, pages 3 8, 2014. [Kulkarni, 2011] Anand Kulkarni. The complexity of crowdsourcing: Theoretical problems in human computation. In CHI Workshop on Crowdsourcing and Human Computation, 2011. [Li and Liu, 2015] Hongwei Li and Qiang Liu. Cheaper and better: Selecting good workers for crowdsourcing. Eprint Arxiv, 2015. [Liu et al., 2012] Qiang Liu, Jian Peng, and Alexander Ihler. Variational inference for crowdsourcing. In NIPS, pages 692 700, 2012. [Marcus et al., 2015] Adam Marcus, Adam Marcus, Adam Marcus, and Adam Marcus. Argonaut: macrotask crowdsourcing for complex data processing. Proceedings of the VLDB Endowment, 8(12):1642 1653, 2015.

[Mc Allester, 1998] David A. Mc Allester. Some pacbayesian theorems. In COLT, pages 230 234, 1998. [Nushi et al., 2015] Besmira Nushi, Adish Singla, Anja Gruenheid, Erfan Zamanian, Andreas Krause, and Donald Kossmann. Crowd access path optimization: Diversity matters. In HCOMP, pages 130 139, 2015.

[Raykar et al., 2010] Vikas C Raykar, Shipeng Yu, Linda H Zhao, Gerardo Hermosillo Valadez, Charles Florin, Luca Bogoni, and Linda Moy. Learning from crowds. Machine Learning, 11(Apr):1297 1322, 2010.

[Roy et al., 2015] Senjuti Basu Roy, Ioanna Lykourentzou, Saravanan Thirumuruganathan, Sihem Amer-Yahia, and Gautam Das. Task assignment optimization in knowledgeintensive crowdsourcing. The VLDB Journal, 24(4):467 491, 2015.

[Salek et al., 2013] M Salek, Y Bachrach, and P Key. Hotspotting - a probabilistic graphical model for image object localization through crowdsourcing. In AAAI, pages 1156 1162, 2013.

[Shahaf and Amir, 2007] Dafna Shahaf and Eyal Amir. Towards a theory of ai completeness. In AAAI Spring Symposium: Logical Formalizations of Commonsense Reasoning, pages 150 155, 2007.

[Snow et al., 2008] Rion Snow, Brendan O Connor, Daniel Jurafsky, and Andrew Y. Ng. Cheap and fast but is it good?: Evaluating non-expert annotations for natural language tasks. In EMNLP, pages 254 263, 2008.

[Tran-Thanh et al., 2013] Long Tran-Thanh, Matteo Venanzi, Alex Rogers, and Nicholas R Jennings. Efﬁcient budget allocation with accuracy guarantees for crowdsourcing classiﬁcation tasks. In AAMAS, pages 901 908, 2013.

[Venanzi et al., 2014] Matteo Venanzi, John Guiver, Gabriella Kazai, Pushmeet Kohli, and Milad Shokouhi. Community-based bayesian aggregation models for crowdsourcing. In WWW, pages 155 164, 2014.

[Wang and Zhou, 2016] Lu Wang and Zhihua Zhou. Costsaving effect of crowdsourcing learning. In IJCAI, pages 2111 2117, 2016.

[Wing, 2008] Jeannette M Wing. Five deep questions in computing. CACM, 51(1):58 60, 2008.

[Yu et al., 2017] Han Yu, Chunyan Miao, Yiqiang Chen, Simon Fauvel, Xiaoming Li, and Victor R Lesser. Algorithmic management for improving collective productivity in crowdsourcing. Scientiﬁc Reports, 7(1):12541, 2017.

[Zhou et al., 2012] Dengyong Zhou, John C. Platt, Sumit Basu, and Yi Mao. Learning from the wisdom of crowds by minimax entropy. NIPS, 3:2195 2203, 2012.

[Zhou et al., 2015] Dengyong Zhou, Qiang Liu, John C. Platt, Christopher Meek, and Nihar B. Shah. Regularized minimax conditional entropy for crowdsourcing. Eprint Arxiv, 2015.

Proceedings of the Twenty-Seventh International Joint Conference on Artiﬁcial Intelligence (IJCAI-18)