# streaming_bayesian_inference_for_crowdsourced_classification__fe54f06c.pdf

Streaming Bayesian Inference for Crowdsourced Classiﬁcation

Edoardo Manino University of Southampton E.Manino@soton.ac.uk

Long Tran-Thanh University of Southampton l.tran-thanh@soton.ac.uk

Nicholas R. Jennings Imperial College, London n.jennings@imperial.ac.uk

A key challenge in crowdsourcing is inferring the ground truth from noisy and unreliable data. To do so, existing approaches rely on collecting redundant information from the crowd, and aggregating it with some probabilistic method. However, oftentimes such methods are computationally inefﬁcient, are restricted to some speciﬁc settings, or lack theoretical guarantees. In this paper, we revisit the problem of binary classiﬁcation from crowdsourced data. Speciﬁcally we propose Streaming Bayesian Inference for Crowdsourcing (SBIC), a new algorithm that does not suffer from any of these limitations. First, SBIC has low complexity and can be used in a real-time online setting. Second, SBIC has the same accuracy as the best state-of-the-art algorithms in all settings. Third, SBIC has provable asymptotic guarantees both in the online and ofﬂine settings.

1 Introduction

Crowdsourcing works by collecting the annotations of large groups of human workers, typically through an online platform like Amazon Mechanical Turk1 or Figure Eight.2 On one hand, this paradigm can help process high volumes of small tasks that are currently difﬁcult to automate at an affordable price [Snow et al., 2008]. On the other hand, the open nature of the crowdsourcing process gives no guarantees on the quality of the data we collect. Leaving aside malicious attempts at thwarting the result of the crowdsourcing process [Downs et al., 2010], even well-intentioned crowdworkers can report incorrect answers [Ipeirotis et al., 2010].

Thus, the success of a crowdsourcing project relies on our ability to reconstruct the ground-truth from the noisy data we collect. This challenge has attracted the attention of the research community which has explored a number of algorithmic solutions. Some authors focus on probabilistic inference on graphical methods, including the early work of Dawid and Skene [1979] on EM estimation, variational inference [Welinder and Perona, 2010; Liu et al., 2012] and belief propagation [Karger et al., 2014]. These techniques are stable in most settings, easy to generalise to more complex models (e.g. [Kim and Ghahramani, 2012]), but generally require several passes over the entire dataset to converge and lack theoretical guarantees. In contrast, other authors have turned to tensor factorisation [Dalvi et al., 2013; Zhang et al., 2016] and the method of moments [Bonald and Combes, 2017]. This choice yields algorithms with tractable theoretical behaviour, but the assumptions required to do so restrict them to a limited number of settings.

1www.mturk.com 2www.figure-eight.com

33rd Conference on Neural Information Processing Systems (Neur IPS 2019), Vancouver, Canada.

At the same time, there have been several calls to focus on how we sample the data from the crowd, rather than how we aggregate it [Welinder and Perona, 2010; Barowy et al., 2012; Simpson and Roberts, 2014; Manino et al., 2018]. All of these end up recommending some form of adaptive strategy, which samples more data on the tasks where the crowd is disagreeing the most. Employing one of these strategies improves the ﬁnal accuracy of the crowdsourcing system, but requires the ability to work in an online setting. Thus, in order to perform crowdsourcing effectively, our algorithms must be computationally efﬁcient.

In this paper, we address these research challenges on the problem of binary classiﬁcation from crowdsourced data, and make the following contributions to the state of the art.:

We introduce Streaming Bayesian Inference for Crowdsourcing (SBIC), a new algorithm based on approximate variational Bayes. This algorithm comes in two variants.

The ﬁrst, Fast SBIC, has similar computational complexity to the quick majority rule, but delivers more than an order of magnitude higher predictive accuracy.

The second, Sorted SBIC, is more computationally intensive, but delivers state-of-the-art predictive accuracy in all settings.

We quantify the asymptotic performance of SBIC in both the ofﬂine and online setting analytically. Our theoretical bounds closely match the empirical performance of SBIC.

The paper is structured in the following way. In Section 2 we introduce the most popular model of crowdsourced classiﬁcation, and the existing aggregation methods. In Section 3 we present the SBIC algorithm in its two variants. In Section 4 we compute its asymptotical accuracy. In Section 5 we compare its performance with the state of the art on both synthetic and real-world datasets. In Section 6 we conclude and outline possible future work.

2 Preliminaries

Existing works in crowdsourced classiﬁcation are mostly built around the celebrated Dawid-Skene model [Dawid and Skene, 1979]. In this paper we adopt its binary, or one-coin variant, which has received considerable attention from the crowdsourcing community [Liu et al., 2012; Karger et al., 2014; Bonald and Combes, 2017; Manino et al., 2018]. The reason for this is that it allows to study the fundamental properties of the crowdsourcing process, without dealing with the peculiarities of more complex scenarios. Furthermore, generalising to the multi-class case is usually straightforward (e.g. [Gao et al., 2016]).

2.1 The one-coin Dawid-Skene model

According to this model, the objective is to infer the binary ground-truth class yi = { 1} of a set tasks M, with i M. To do so, we can interact with the crowd of workers N, and ask them to submit a set of labels X = {xij}, where j N is the worker s index. We have no control on the availability of the workers, and we assume that we interact with them in sequential fashion. Thus, at each time step t a single worker j = a(t) becomes available, gets assigned to a task i and provides the label xij = 1 in exchange for a unitary payment. We assume that we can collect an average of R |N| labels per task, for a total budget of T = R|M| labels. With slight abuse of notation, we set xij = 0 for any missing task-worker pairs, so that we can treat X as a matrix when needed. On a similar note, we use Mj to denote the set of tasks executed by worker j, and Ni for the set of workers on task i. Furthermore, we use the superscript t (e.g. Xt) to denote the information visible up to time t.

A key feature of the one-coin Dawid-Skene model is that each worker has a ﬁxed probability P(xij = yi) = pj of submitting a correct label. That means that the workers behave like independent random variables (conditioned on the ground-truth yi), and their accuracy pj remains stable over time and across different tasks.

2.2 Sampling the data

When interacting with the crowd, we need to decide which tasks to allocate the incoming workers to. The sampling policy π we use to make these allocations has a considerable impact on the ﬁnal

accuracy of our predictions, as demonstrated by Manino et al. [2018]. The existing literature provides us with the following two main options.

Uniform Sampling (UNI). This policy allocates the same number of workers |Ni| T/|M| to each task i (rounded to the closest integer). The existing literature does not usually specify how this policy is implemented in practice (e.g [Karger et al., 2014; Manino et al., 2018]). In this paper we assume a round-robin implementation, where we ensure that no worker is asked to label the same task twice: πuni(t) = argmin i M t a(t)

|N t i | (1)

where M t a(t) is the set of tasks labelled by the currently available worker j = a(t) so far.

Uncertainty Sampling (US). A number of policies proposed in the literature are adaptive, in that they base their decisions on the data collected up to time t [Welinder et al., 2010; Barowy et al., 2012; Simpson and Roberts, 2014]. In this paper we focus on the most common of them, which consist of greedily choosing the task with the largest uncertainty at each time-step t. More formally, assume that we have a way to estimate the posterior probability on the ground-truth y given the current data Xt. Then, we can select the task to label as follows:

πus(t) = argmin i M t a(t)

n max ℓ { 1}

P(yi = ℓ|Xt) o (2)

Compared to uniform sampling, this second policy is provably better [Manino et al., 2018]. However, it can only be implemented in an online setting, when we have estimates of the posterior on y at every t. Producing such estimates in real time is an open challenge. Current approaches are based on simple heuristics like the majority voting rule [Barowy et al., 2012].

We study the theoretical and empirical performance of SBIC under these two policies in Sections 4 and 5 respectively.

2.3 Aggregating the data

Given a (partial) dataset Xt as input, there exist several methods in the literature to form a prediction ˆy over the ground-truth classes y of the tasks. The simplest is the aforementioned majority voting rule (MAJ), which forms its predictions as ˆyi = sign{P

j Ni xij}, where ties are broken at random.

Alternatively, we can resort to Bayesian methods, which infer the value of the latent variables y and p by estimating their posterior probability P(y, p|X, θ) given the observed data X and prior θ. In this regard, Liu et al. [2012] propose an approximate variational mean-ﬁeld algorithm (AMF) and show its similarity to the original expectation-maximisation (EM) algorithm of Dawid and Skene [1979]. Conversely, Karger et al. [2014] propose a belief-propagation algorithm (KOS) on a spammer-hammer prior, and show its connection to matrix factorisation. Both these algorithms require several iterationd on the whole dataset X to converge to their ﬁnal predictions. As another option, we can directly estimate the value of the posterior by Montecarlo Sampling (MC) [Kim and Ghahramani, 2012], even though this is usually more expensive computationally than the former two techniques.

Finally, there have been attempts at applying the frequentist approach to crowdsourcing [Dalvi et al., 2013; Zhang et al., 2016; Bonald and Combes, 2017]. The resulting algorithms have tractable theoretical properties, but put strong constraints on the rank and sparsity of the task-worker matrix X, which limit their range of applicability. For completeness, we include in our experiments of Section 5 the Triangular Estimation algorithm (TE) recently proposed in [Bonald and Combes, 2017].

3 The SBIC algorithm

In this section we introduce Streaming Bayesian Inference for Crowdsourcing (SBIC) and discuss the ideas behind it. Then, we present two variants of this method, which we call Fast SBIC and Sorted SBIC. These prioritise two different goals: namely, computational speed and predictive accuracy.

Figure 1: A graphical representation of the SBIC algorithm.

The overarching goal in Bayesian inference is estimating the posterior probability P(y, p|Xt, θ) on the latent variables y and p given the data we observed so far Xt and the prior θ. With this piece of information, we can form our current predictions ˆyt on the task classes by looking at the marginal probability over each yi as follows:

ˆyt i = argmax ℓ { 1}

P(yi = ℓ|Xt, θ) (3)

Unfortunately the marginal in Equation 3 is computationally intractable in general. In fact, just summing P(y, p|Xt, θ) over all vectors y that contain a speciﬁc yi has exponential time complexity in |M|. To overcome this issue, we turn to a mean-ﬁeld variational approximation as done before in [Liu et al., 2012; Kim and Ghahramani, 2012]. This allows us to factorise the posterior as follows:

P(y, p|Xt, θ) Y

i M µt i(yi) Y

j N νt j(pj) (4)

where the factors µt i correspond to each task i and the factors νt j to each worker j.

Our work diverges from the standard variational mean ﬁeld paradigm [Murphy, 2012] in that we use a novel method to optimise the factors µt and νt. Previous work minimises the Kullback-Leibler (KL) divergence between the two sides of Equation 4 by running an expensive coordinate descent algorithm with multiple passes over the whole dataset Xt [Liu et al., 2012; Kim and Ghahramani, 2012]. Instead, we aim at achieving a similar result by taking a single optimisation step after observing each new data point. This yields quicker updates to µt and νt, thus allowing us to run our algorithm online.

More speciﬁcally, the core ideas of the SBIC algorithm are the following. First, assume that the prior on the worker accuracy is pj Beta(α, β). This assumption is standard in Bayesian statistics, since the Beta distribution is the conjugate prior of a Bernoulli-distributed random variable [Murphy, 2012]. Second, initialise the task factors µ0 to their respective prior P(yi = +1) = q, that is µ0 i (+1) = q and µ0 i ( 1) = 1 q for all i M.3 Then, upon observing a new label at time t, update the factor νt j corresponding to the current available worker j = a(t) only. Thanks to the properties of the KL divergence, νt j is still Beta-distributed:

νt j(pj) Beta X

µt 1 i (xij)+α, X

µt 1 i ( xij)+β (5)

where M t 1 j is the set of tasks labelled by worker j up to time t 1. Next, we update the factor µi corresponding to the task we observed the new label xij on:

µt i(yi) µt 1 i (yi) pt j if xij = yi µt 1 i (yi) 1 pt j if xij = yi where pt j =

i M t 1 j µt 1 i (xij) + α

|M t 1 j | + α + β (6)

Finally, we can inspect the factors µt and form our predictions on the task classes as ˆyt i = argmaxℓ { 1}{µt i(ℓ)}. Note that we set pt j = Epj{νt j} in Equation 6. An exact optimisation step would require pt j = exp Epj{log(νt j)} instead. However, the ﬁrst-order approximation we use has a negligible impact on the accuracy of the inference, as demonstrated in [Liu et al., 2012].

We summarise the high-level behaviour of the SBIC algorithm in the explanatory sketch of Figure 1. There, it is easy to see that SBIC falls under the umbrella of the Streaming Variational Bayes framework [Broderick et al., 2013]: in fact, at each time step t we trust our current approximations

3Exact knowledge of α, β and q is not necessary in practice. See Section 5 for examples.

Algorithm 1 Fast SBIC Input: dataset X, availability a, policy π, prior θ Output: ﬁnal predictions ˆy T

1: z0 i = log(q/(1 q)), i M 2: for t = 1 to T do 3: i π(t) 4: j a(t)

h Mt 1 j sig(xhjzt 1 h )+α

|M t 1 j |+α+β 6: zt i zt 1 i + xij log( pt j/(1 pt j))

7: zt i zt 1 i , i = i

8: return ˆy T i = sign(z T i ), i M

µt and νt to be close to the exact posterior, and we use their values to inform the next local updates. From another point of view, SBIC is a form of constrained variational inference, where the constraints are implicit in the local steps we make in Equations 5 and 6, as opposed to an explicit alteration of the KL objective. Finally, the sequential nature of the SBIC algorithm means that its output is deeply inﬂuenced by the order in which we process the dataset in X. By altering its ordering, we can optimise SBIC for different applications, as we show in the next two Sections 3.1 and 3.2.

3.1 Fast SBIC

Recall that crowdsourcing beneﬁts from an online approach, since it allows the deployment of an adaptive sampling strategy which can greatly improve the predictive accuracy (see Section 2.2). Thus, our main goal here is computational speed, which we achieve by keeping the natural ordering of the set X unaltered.

We call the resulting algorithm Fast SBIC, and show its pseudocode in Algorithm 1. There, we use the following computational tricks. First, we express the value of each factor µt i in terms of its log-odds. Accordingly, Equation 6 becomes:

zt i = log µt i(+1) µt i( 1)

= zt 1 i + xij log

pt j 1 pt j

where z0 i = log q 1 q

This has both the advantage of converting the chain of products into a summation, and removing the need of normalising the factors µt i. Second, we can use the current log-odds zt to compute the worker accuracy estimate as follows:

i M t 1 j sig(xijzt 1 i ) + α

|M t 1 j | + α + β where sig(zt 1 i ) 1 1 + exp( zt 1 i ) = µt 1 i (+1)

Thanks to the additive nature of Equation 7, we can quickly update the log-odds zt as we observe new labels. More in detail, in Line 1 of Algorithm 1 we set z0 i to its prior value. Then, for every new label xij, we estimate the mean accuracy of worker j given the current value of zt 1 (see Line 5), and add its contribution to the log-odds on task i (see Line 6). In the end (Line 7), we compute the ﬁnal predictions by selecting the maximum-a-posteriori class ˆy T i = sign(z T i ).

This algorithm runs in O(TL) time, where L = maxj(|Mj|) is the maximum number of labels per worker. This makes it particularly efﬁcient in an online setting, e.g. under an adaptive collection strategy, since it takes only O(L) operations to update its estimates after observing a new label. In Section 5 we show that its computational speed is on par with the simple majority voting scheme.

3.2 Sorted SBIC

In an ofﬂine setting, or when more computational resources are available, we have the opportunity of trading off some of the computational speed of Fast SBIC in exchange for better predictive accuracy. We can do so by running multiple copies of the algorithm in parallel, and presenting them the labels in X in different orders. We show the implementation of this idea in Algorithm 2, which we call Sorted SBIC.

The intuition behind the algorithm is the following. When running Fast SBIC, the estimates ˆµt and ˆνt are very close to their prior in the ﬁrst rounds. As time passes, two things change. First, we have more information since we observe more data points. Second, we run more updates on each factor µt i and

Algorithm 2 Sorted SBIC Input: dataset X, availability a, policy π, prior θ Output: ﬁnal predictions ˆy T

1: sk i = log(q/(1 q)), i M, k M 2: for t = 1 to T do 3: i π(t) 4: j a(t) 5: for all k M : k = i do

h Mt j \k sig(xhjsk h)+α

|M t j \k|+α+β 7: sk i sk i + xij log( pk j /(1 pk j ))

8: zt i = log(q/(1 q)), i M 9: for u = 1 to t do 10: i π(u) 11: j a(u)

h Mt j \i sig(xhjsi h)+α

|M t j \i|+α+β 13: zt i zt i + xij log( pi j/(1 pi j))

14: return ˆy T i = sign(z T i ), i M

νt j. Because of these, the estimates ˆµt and ˆνt become closer and closer to their ground-truth values. As a result, we get more accurate predictions on a speciﬁc task i when the corresponding subset of labels is processed towards the end of the collection process (t T), rather than the beginning (t 0).

We exploit this property in Sorted SBIC by keeping a separate view of the log-odds sk for each task k M (see Line 1). Then, every time we observe a new label xij we update the views for all tasks k except the one we observed the label on (see Lines 5-7). We skip it because we want to process the corresponding label xij at the end. Note that in Line 6 we compute a different estimate pk j for each task k = i. This is because we are implicitly running |M| copies of Fast SBIC, and each copy can only see their correponding information stored in sk.

Finally, we need to process all the labels we skipped. If we are running Sorted SBIC ofﬂine, we only need to do so once at the end of the collection process. Conversely, in an online setting we need to repeat the same procedure at each time step t. Lines 8-13 contain the corresponding pseudocode. Notice how we compute the estimates pi j by looking at all the tasks M t j labelled by worker j except for task i itself. This is because we skipped the corresponding label xij in the past, and we are processing it right now.

The implementation of Sorted SBIC presented in Algorithm 2 runs in O(|M|TL) time, which is a factor |M| slower than Fast SBIC since we are running |M| copies of it in parallel. By sharing the views sk across different tasks, we can reduce the complexity to O(log(|M|)TL). However, this is only possible if the algorithm is run in an ofﬂine setting, where the whole dataset X is known in advance. This additional time complexity comes with improved predictive accuracy. In Sections 4 and 5 we quantify such improvement both theoretically and empirically.

4 Theoretical analysis

In this section we study the predictive performance of SBIC from the theoretical perspective. As is the norm in the crowdsourcing literature, we establish an exponential relationship between the probability of an error and the average number of labels per task R = T/|M| in the form P(ˆyi = yi) exp( c R + o(1)). Computing the constant c is not trivial as its value depends not only on the properties of the crowd and the aggregation algorithm, but also the collection policy π we use (see Section 2.2). In this regard, previous results are either very conservative [Karger et al., 2014; Manino et al., 2018], or assume a large number of labels per worker so that the estimates of p are close to their ground-truth value [Gao et al., 2016].

Here, we take a different approach and provide exponential bounds that are both close to the empirical performance of SBIC, and valid for any number of labels per worker. We achieve this by focusing on the asymptotic case, where we assume that the predictions of SBIC are converging to the ground-truth after observing a large enough number of labels. More formally: Deﬁnition 1. For any small ϵ > 0, deﬁne t as the minimum size of the dataset X, such that µt i (yi) 1 ϵ for any task i M with high probability.

For any larger dataset, when t t the term µt i(xij) is very close to the indicator I(xij = yi). As a consequence, we can replace the worker accuracy estimates in Equation 6 with pt j = (kt j +

α)/(|M t j| + α + β), where kt j is the number of correct answers. With this in mind, we can establish the following bound on the performance of SBIC under the UNI policy: Theorem 1. For a crowd of workers with accuracy pj Beta(α, β), L labels per worker, R labels per task, the probability of an error under the UNI policy is bounded by:

P(ˆyi = yi) exp R log F(L, α, β) + o(1) , for all i M (9)

where F(L, α, β) depends on the variant of SBIC we use. For Sorted SBIC we have:

Fsorted(L, α, β) =

k=0 P(k| L, α, β)2

s k + α L + α + β

where L = L 1, and the probability of observing k is:

P(k| L, α, β) = L k

B(k + α, L k + β)

B(α, β) (11)

For Fast SBIC we have instead:

Ffast(L, α, β) = 1

h=1 Fsorted(h, α, β) (12)

For reasons of space, we only present the intuition behind Theorem 1 here (the full proof is in Appendix A). First, P(k| L, α, β) is the probability of observing a worker with accuracy pj Beta(α, β) produce k correct labels over a total of L labels. Second, the square root term converges to the corresponding term 2 p

pj(1 pj) in [Gao et al., 2016] when the estimates pt j become close to their ground-truth value pj. Finally, the constant Ffast is averaged over L [0, L 1] as this is the number of past labels we use to form each worker s estimate pt j during the execution of Fast SBIC.

Similarly, for the US policy we have the following theorem: Theorem 2. For a crowd of workers with accuracy pj Beta(α, β), L labels per worker, an average of R labels per task, and |M| , the probability of an error under the US policy is bounded by:

P(ˆyi = yi) exp RG(L, α, β) + o(1) , for all i M (13)

where G(L, α, β) depends on the variant of SBIC we use. For Sorted SBIC we have:

Gsorted(L, α, β) =

k=0 P(k| L, α, β) log k + α L k + β

(k + α) ( L k + β)

L + α + β (14)

For Fast SBIC we have instead:

Gfast(L, α, β) = 1

h=1 Gsorted(h, α, β) (15)

A full proof of Theorem 2 is in Appendix A. Here, note that the logarithm term corresponds to the log-odds of a worker with accuracy pt j, and the right-most term is the expected value of a new label xij provided by said worker.

In practice, both variants of SBIC reach the asymptotic regime described in Deﬁnition 1 for fairly small values of R. As an example, in Figure 2 we compare our theoretical results with the empirical performance of SBIC on synthetic data. There, we can see how the slope we predict in Theorems 1 and 2 closely matches the empirical decay in prediction error of SBIC. This in contrast with the corresponding state-of-the-art results in [Manino et al., 2018], which apply to any state-of-the-art probabilistic inference algorithm (i.e. not MAJ) but are signiﬁcantly more conservative.

5 Empirical analysis

In this section we compare the empirical performance of SBIC with the state-of-the-art algorithms listed in Section 2.3. Our analysis includes synthetic data, real-world data and a discussion on time complexity. For reasons of space, we report the details of the algorithm implementations and experiment parameters in Appendix B.

10 20 30 40 50 60 70 80 10 3

Number of labels per task

Prediction error

MC Fast SBIC Sorted SBIC

(a) UNI sampling policy

10 20 30 40 50 60 70 80 10 4

Number of labels per task

Prediction error

MC Fast SBIC Sorted SBIC

(b) US sampling policy

Figure 2: Prediction error on synthetic data with pj Beta(α, β), q = 0.5 and L = 10. The accuracy guarantees for SBIC are represented by a dotted line in the corresponding colour.

Synthetic data. First, we run the algorithms on synthetic data. With this choice we can make sure that the assumptions of the underlying one-coin Dawid-Skene model are met. In turn, this allows us to compare the empirical performance of SBIC with the theoretical results in Section 4.

To do so, we extract workers from a distribution pj Beta(4, 3), representing a non-uniform population with large variance. Crucially, the mean of this distribution is above 1

2, thus ensuring that the crowd is biased towards the correct answer. Additionally, we set the number of tasks to M = 1000 and the number of labels per worker to L = 10. This represents a medium-sized crowdsourcing project with a high worker turnout. Finally, we run EM, AMF, MC and SBIC with parameters α and β matching the distribution of pj. Conversely, MAJ and KOS do not require any extra parameter. We omit the results for TE since in this setting the task-worker matrix X is too sparse for the algorithm to produce non-random predictions.

In Figures 2a and 2b we show the results obtained under the UNI and US sampling policies respectively. For reference, we also plot the bounds of Theorems 1 and 2 up to an arbitrary o(1) constant (see Section 4 for the related discussion). As expected, the performance of all algorithms under the US policy greatly improves with respect to the UNI policy. Also, notice how MAJ is consistently outperformed by the other algorithms in this setting (this is not the case on real-world data, as we show below). Additionally, both variants of SBIC perform well, with Sorted SBIC achieving state-of-the-art performance under the UNI policy and matching the computationally-expensive MC algorithm under the US policy. Interestingly, Fast SBIC is asymptotically competitive as well, but suffers from an almost constant performance gap (in logarithmic scale). Finally, both EM and AMF tend to lose their competitiveness as the number of labels per task R increases. This is due to their inability to form unbiased estimates of the workers accuracy with few labels per worker. Under the US policy this may lead to poor sampling behaviour, which explains the lack of improvement in predictive accuracy for R > 40 in Figure 2b.

Time complexity. As we show in our experiments on synthetic data, all algorithms beneﬁt from an adaptive sampling strategy. However, in order to deploy such policy we need to be able to update our estimates in real time, and only the MAJ and Fast SBIC algorithms are capable of that. To prove this point, we measure the average time the algorithms take to complete the simulations presented in Figure 2b, i.e. when used in conjunction with the US policy. We plot the results in Figure 3. Note how Fast SBIC matches MAJ in terms of computational speed, whereas all the other algorithms are orders of magnitude slower. This makes Fast SBIC the only viable alternative to MAJ for the online setting, particularly because it can deliver superior predictive accuracy.

Real-world data. Second, we consider the 5 publicly available dataset listed in Table 1, which come with binary annotations and ground-truth values. For more information on the datasets see [Snow et al., 2008; Welinder et al., 2010; Lease and Kazai, 2011]. The performance of the algorithms is reported in Table 2. There we run EM, AFM, MC and SBIC with the generic prior α = 2, β = 1 and q = 1

2 as proposed in Liu et al. [2012]. Additionally, we include the triangular estimation (TE)

5 10 15 20 25 30 35 40

Number of labels per task

MC Fast SBIC Sorted SBIC

Figure 3: Time required to complete a single run with |M| = 1000 tasks under the US policy.

algorithm from Bonald and Combes [2017], since it outputs non-random predictions on most of the aforementioned datasets.

Table 1: Summary of the real-world datasets Dataset # Tasks # Workers # Labels Avg. L Avg. R

Birds 108 39 4212 108 39 Ducks 240 53 9600 181 40 RTE 800 164 8000 49 10 TEMP 462 76 4620 61 10 TREC 711 181 2199 12 3

Table 2: Prediction error on the real-world datasets Dataset MAJ EM AMF KOS MC Fast SBIC Sorted SBIC TE

Birds 0.241 0.278 0.278 0.278 0.341 0.260 0.298 0.194 Ducks 0.306 0.412 0.412 0.396 0.412 0.400 0.405 0.408 RTE 0.100 0.072 0.075 0.491 0.079 0.075 0.072 0.257 TEMP 0.057 0.061 0.061 0.567 0.095 0.059 0.062 0.115 TREC 0.257 0.217 0.266 0.259 0.302 0.251 0.239 0.451

Interestingly, the MAJ algorithm performs quite well and achieves the best score on the Ducks and TEMP datasets. This conﬁrms the practitioner s knowledge that majority voting is a robust and viable algorithms in most settings. Unsurprisingly, TE achieves its best score on the Birds dataset, which has a full task-worker matrix X. On the contrary, its predictions are almost random on the TREC dataset, which has a low number of labels per worker. Finally, both variants of SBIC match the performance of the other state-of-the-art Bayesian algorithms (EM, AFM, MC), with Sorted SBIC achieving the best score on RTE, and EM on both RTE and TREC. More importantly, Fast SBIC is always close to the other algorithms, making a strong case for its computationally efﬁcient approach to Variational Bayes.

6 Conclusions

In this paper we proposed Streaming Bayesian Inference for Crowdsourcing, a new method to infer the ground-truth from binary crowdsourced data. This method combines strong theoretical guarantees, state-of-the-art accuracy and computational efﬁciency. The latter makes it the only viable alternative to majority voting when real-time decisions need to be made in an online setting. We plan to extend these techniques to the multi-class case as our future work.

Acknowledgments

This research is funded by the UK Research Council project ORCHID, grant EP/I011587/1. The authors acknowledge the use of the IRIDIS High Performance Computing Facility, and associated support services at the University of Southampton.

Daniel W. Barowy, Charlie Curtsinger, Emery D. Berger, and Andrew Mc Gregor. Auto Man: A Platform for Integrating Human-based and Digital Computation. In Proceedings of the ACM International Conference on Object Oriented Programming Systems Languages and Applications, pages 639 654, 2012.

Thomas Bonald and Richard Combes. A Minimax Optimal Algorithm for Crowdsourcing. In Proceedings of the Thirtieth International Conference on Neural Information Processing Systems, pages 4355 4363, 2017.

Tamara Broderick, Nicholas Boyd, Andre Wibisono, Ashia C Wilson, and Michael I Jordan. Streaming Variational Bayes. In Proceedings of the Twenty-Sixth International Conference on Neural Information Processing Systems, pages 1727 1735, 2013.

Nilesh Dalvi, Anirban Dasgupta, Ravi Kumar, and Vibhor Rastogi. Aggregating Crowdsourced Binary Ratings. In Proceedings of the 22Nd International Conference on World Wide Web, pages 285 294, 2013.

Alexander P. Dawid and Allan M. Skene. Maximum Likelihood Estimation of Observer Error-Rates Using the EM Algorithm. Journal of the Royal Statistical Society. Series C (Applied Statistics), 28(1):20 28, 1979.

Julie S Downs, Mandy B Holbrook, Steve Sheng, and Lorrie Faith Cranor. Are Your Participants Gaming the System? Screening Mechanical Turk Workers. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pages 2399 2402, 2010.

Chao Gao, Yu Lu, and Dengyong Zhou. Exact Exponent in Optimal Rates for Crowdsourcing. In Proceedings of the Thirty-Third International Conference on Machine Learning, pages 603 611, 2016.

Panagiotis G. Ipeirotis, Foster Provost, and Jing Wang. Quality Management on Amazon Mechanical Turk. In Proceedings of the ACM SIGKDD Workshop on Human Computation, pages 64 67, 2010.

David R. Karger, Sewoong Oh, and Devavrat Shah. Budget-Optimal Task Allocation for Reliable Crowdsourcing Systems. Operations Research, 62(1):1 24, 2014.

Hyun-Chul Kim and Zoubin Ghahramani. Bayesian Classiﬁer Combination. In Proceedings of the Fifteenth International Conference on Artiﬁcial Intelligence and Statistics, volume 22, pages 619 627, 2012.

Matthew Lease and Gabriella Kazai. Overview of the TREC 2011 crowdsourcing track. In Proceedings of TREC 2011, 2011.

Qiang Liu, Jian Peng, and Alexander Ihler. Variational Inference for Crowdsourcing. In Proceedings of the Twenty-Fifth International Conference on Neural Information Processing Systems, pages 692 700, 2012.

Edoardo Manino, Long Tran-Thanh, and Nicholas R. Jennings. On the Efﬁciency of Data Collection for Crowdsourced Classiﬁcation. In Proceedings of the Twenty-Seventh International Joint Conference on Artiﬁcial Intelligence, pages 1568 1575, 2018.

Kevin P Murphy. Machine Learning: A Probabilistic Perspective. The MIT Press, 2012.

Edwin Simpson and Stephen Roberts. Bayesian Methods for Intelligent Task Assignment in Crowdsourcing Systems. In Scalable Decision Making: Uncertainty, Imperfection, Deliberation, pages 1 32. Springer, 2014.

Rion Snow, Brendan O Connor, Daniel Jurafsky, and Andrew Y Ng. Cheap and Fast but is It Good?: Evaluating Non-expert Annotations for Natural Language Tasks. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 254 263, 2008.

Peter Welinder and Pietro Perona. Online Crowdsourcing: Rating Annotators and Obtaining Cost Effective Labels. In Proceedings of the 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, pages 25 32, 2010.

Peter Welinder, Steve Branson, Serge Belongie, and Pietro Perona. The Multidimensional Wisdom of Crowds. In Proceedings of the Twenty-Third International Conference on Neural Information Processing Systems, pages 1 9, 2010.

Yuchen Zhang, Xi Chen, Dengyong Zhou, and Michael I. Jordan. Spectral Methods Meet EM: A Provably Optimal Algorithm for Crowdsourcing. Journal of Machine Learning Research, 17(1):3537 3580, 2016.