# deep_learning_from_crowds__7539e8f9.pdf

Deep Learning from Crowds

Filipe Rodrigues, Francisco C. Pereira Dept. of Management Engineering, Technical University of Denmark Bygning 116B, 2800 Kgs. Lyngby, Denmark rodr@dtu.dk, camara@dtu.dk

Over the last few years, deep learning has revolutionized the ﬁeld of machine learning by dramatically improving the stateof-the-art in various domains. However, as the size of supervised artiﬁcial neural networks grows, typically so does the need for larger labeled datasets. Recently, crowdsourcing has established itself as an efﬁcient and cost-effective solution for labeling large sets of data in a scalable manner, but it often requires aggregating labels from multiple noisy contributors with different levels of expertise. In this paper, we address the problem of learning deep neural networks from crowds. We begin by describing an EM algorithm for jointly learning the parameters of the network and the reliabilities of the annotators. Then, a novel general-purpose crowd layer is proposed, which allows us to train deep neural networks end-to-end, directly from the noisy labels of multiple annotators, using only backpropagation. We empirically show that the proposed approach is able to internally capture the reliability and biases of different annotators and achieve new state-of-the-art results for various crowdsourced datasets across different settings, namely classiﬁcation, regression and sequence labeling.

Introduction In the last decade, deep learning has made major advances in solving artiﬁcial intelligence problems in different domains such as speech recognition, visual object recognition, object detection and machine translation (Schmidhuber 2015). This success is often attributed to its ability to discover intricate structures in high-dimensional data (Le Cun, Bengio, and Hinton 2015), thereby making it particularly well suited for tackling complex tasks that are often regarded as characteristic of humans, such as vision, speech and natural language understanding. However, typically, a key requirement for learning deep representations of complex high-dimensional data is large sets of labeled data. Unfortunately, in many situations this data is not readily available, and humans are required to manually label large collections of data. On the other hand, in recent years, crowdsourcing has established itself as a reliable solution to annotate large collections of data. Indeed, crowdsourcing platforms like Amazon Mechanical Turk1 and Crowdﬂower2 have proven to be an

Copyright c 2018, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved. 1http://www.mturk.com 2http://crowdﬂower.com

efﬁcient and cost-effective way for obtaining labeled data (Snow et al. 2008; Buhrmester, Kwang, and Gosling 2011), especially for the kind of human-like tasks, such as vision, speech and natural language understanding, for which deep learning methods have been shown to excel. Even in ﬁelds like medical imaging, crowdsourcing is being used to collect the large sets of labeled data that modern data-savvy deep learning methods enjoy (Greenspan, van Ginneken, and Summers 2016; Albarqouni et al. 2016; Guan et al. 2017). However, while crowdsourcing is scalable enough to allow labeling datasets that would otherwise be impractical for a single annotator to handle, it is well known that the noise associated with the labels provided by the various annotators can compromise practical applications that make use of such type of data (Sheng, Provost, and Ipeirotis 2008; Donmez and Carbonell 2008). Thus, it is not surprising that a large body of the recent machine learning literature is dedicated to mitigating the effects of the noise and biases inherent to such heterogeneous sources of data (e.g. Yan et al. (2014); Albarqouni et al. (2016); Guan et al. (2017)). When learning deep neural networks from the labels of multiple annotators, typical approaches rely on some sort of label aggregation mechanisms prior to training. In classiﬁcation settings, the simplest and most common approach is to use majority voting, which naively assumes that all annotators are equally reliable. More advanced approaches, such as the one proposed in (Dawid and Skene 1979) and other variants (e.g. Ipeirotis, Provost, and Wang (2010); Whitehill et al. (2009)) jointly model the unknown biases of the annotators and their answers as noisy versions of some latent ground truth. Despite their improved ground truth estimates over majority voting, recent works have shown that jointly learning the classiﬁer model and the annotators noise model using EM-style algorithms generally leads to improved results (Raykar et al. 2010; Albarqouni et al. 2016). In this paper, we begin by describing an EM algorithm for learning deep neural networks from crowds in multi-class classiﬁcation settings, highlighting its limitations. Then, a novel crowd layer is proposed, which allows us to train neural networks end-to-end, directly from the noisy labels of multiple annotators, using only backpropagation. This alternative approach not only allows us to avoid the additional computational overhead of EM, but also leads to a generalpurpose framework that generalizes trivially beyond classi-

The Thirty-Second AAAI Conference on Artificial Intelligence (AAAI-18)

ﬁcation settings. Empirically, the proposed crowd layer is shown to be able to automatically distinguish the good from the unreliable annotators and capture their individual biases, thus achieving new state-of-the-art results in real data from Amazon Mechanical Turk for image classiﬁcation, text regression and named entity recognition. As our experiments show, when compared to the more complex EM-based approaches and other approaches from the state of the art, the crowd layer is able to achieve comparable or, in many cases, signiﬁcantly superior results.

Related work The increasing popularity of crowdsourcing as a way to label large collections of data in an inexpensive and scalable manner has led to much interest of the machine learning community in developing methods to address the noise and trustworthiness issues associated with it. In this direction, one of the key early contributions is the work of Dawid and Skene (1979), who proposed an EM algorithm to obtain point estimates of the error rates of patients given repeated but conﬂicting responses to medical questions. This work was the basis for many other variants for aggregating labels from multiple annotators with different levels of expertise, such as the one proposed in (Whitehill et al. 2009), which further extends Dawid and Skene s model by also accounting for item difﬁculty in the context of image classiﬁcation. Similarly, Ipeirotis et al. (2010) propose using Dawid and Skene s approach to extract a single quality score for each worker that allows to prune low-quality workers. The approach proposed in our paper contrast with this line of work, by allowing neural networks to be trained directly on the noisy labels of multiple annotators, thereby avoiding the need to resort to prior label aggregation schemes. Despite the generality of label aggregation approaches described above, which can be used in combination with any type of machine learning algorithm, they are sub-optimal when compared to approaches that also jointly learn the classiﬁer itself. One of the most prominent works in this direction is the one of Raykar et al. (2010), who proposed an EM algorithm for jointly learning the levels of expertise of different annotators and the parameters of a logistic regression classiﬁer, by modeling the ground truth labels as latent variables. This idea was later extended to other types of models such as Gaussian process classiﬁers (Rodrigues, Pereira, and Ribeiro 2014), supervised latent Dirichlet allocation (Rodrigues et al. 2017) and, recently, to convolutional neural networks with softmax outputs (Albarqouni et al. 2016). In this paper, we begin by describing a generalization of the approach in (Albarqouni et al. 2016) to multiclass settings, highlighting some of the technical difﬁculties associated with it. Then, a novel type of neural network layer is proposed, which allows the training of deep neural networks directly from the noisy labels of multiple annotators using pure backpropagation. This contrasts with most of works in the literature, which rely on more complex iterative procedures based on EM. Furthermore, the simplicity of the proposed approach allows for straightforward extensions to regression and structured prediction problems. Recently, Guan et al. (2017) also proposed an approach

for training deep neural networks that exploits information about the annotators. The idea is to model the multiple experts individually in the neural network and then, while keeping their predictions ﬁxed, independently learning averaging weights for combining them using backpropagation. Like our proposed approach, this two-stage procedure does not require an EM algorithm to estimate the annotators weights. However, while our approach has the ability to capture the biases of the different annotators (e.g. confusing class 2 with class 4) and correct them, the approach in (Guan et al. 2017) only learns how to combine the predicted answers of multiple annotators by weighting them differently. Moreover, its two-stage learning procedure increases the computation complexity of training, whereas in our proposed approach is kept the same. Lastly, while the work in (Guan et al. 2017) focuses only on classiﬁcation, we consider regression and structured prediction problems as well. Regarding applications areas for multiple-annotator learning, some of the most popular ones are: image classiﬁcation (Smyth et al. 1995; Welinder et al. 2010), computer-aided diagnosis/radiology (Raykar et al. 2010; Greenspan, van Ginneken, and Summers 2016), object detection (Su, Deng, and Fei-Fei 2012), text classiﬁcation (Rodrigues et al. 2017), natural language processing (Snow et al. 2008) and speechrelated tasks (Parent and Eskenazi 2011). In this paper, we will use data from some of these areas to evaluate different approaches. Given that these are precisely some of the areas that have seen the most dramatic improvements due to recent contributions in deep learning (Le Cun, Bengio, and Hinton 2015; Schmidhuber 2015), developing novel efﬁcient algorithms for learning deep neural networks from crowds is of great importance to the ﬁeld.

EM algorithm for deep learning from crowds

Let D = {xn, yn}N n=1 be a dataset of size N, where for each input vector xn RD we are given a vector of crowdsourced labels yn = {yr n}R r=1, with yr n representing the label provided by the rth annotator in a set of R annotators. Following the ideas in (Raykar et al. 2010; Yan et al. 2014), we shall assume the existence of a latent true class zn whose value is, in this particular case, determined by a softmax output layer of a deep neural network parameterized by Θ, and that each annotator then provides a noisy version of zn according to p(yr n|zn, Πr) = Multinomial(yr n|πr zn). This formulation corresponds to keeping a per-annotator confusion matrix Πr = (πr 1, . . . , πr C) to model their expertise, where C denotes the number of classes. Further assuming that annotators provide labels independently of each other, we can write the complete-data likelihood as

p(D, z|Θ, {Πr}R r=1) =

n=1 p(zn|xn, Θ)

r=1 p(yr n|zn, Πr).

Based on this formulation, we can derive an Expectation Maximization (EM) algorithm for jointly learning the reliabilities of the annotators Πr and the parameters of the neural network Θ. The expected-value of the complete-data log-

likelihood under a current estimate of the posterior distribution over latent variables q(zn) is given by

E ln p(D, z|Θ, Π1, . . . , ΠR) =

zn q(zn) ln p(zn|xn, Θ)

r=1 p(yr n|zn, Πr) , (1)

where the posterior q(zn) is obtained by making use of Bayes theorem given the previous estimate of the model parameters {Θold, Π1 old, . . . , ΠR old}, yielding

q(zn = c) p(zn = c|xn, Θold)

r=1 p(yr n|zn = c, Πr old).

This corresponds to the E-step of EM. In the M-step, we ﬁnd the new maximum likelihood for the model parameters. The update for the annotators reliability parameters is given by

πr c,l = N n=1 q(zn = c) I(yr n = l) N n=1 q(zn = c) ,

where I(yr n = l) is an indicator function that takes the value 1 when yr n = l, and zero otherwise. In practice, since crowd annotators typically only label a small portion of the data, it is particularly important to carefully impose Dirichlet priors on each πr c and compute MAP estimates instead, in order to avoid numerical issues. As for estimating the parameters of the deep neural network Θ, we follow the approach in (Albarqouni et al. 2016) and use the noise-adjusted ground-truth estimates q(zn) to backpropagate the error through the network using standard stochastic optimization techniques such as stochastic gradient descent (SGD) or Adam (Kingma and Ba 2014). Kindly notice how this raises the important question of how to schedule the EM steps. If we perform one EM iteration per mini-batch, we risk not having enough evidence to estimate the annotators reliabilities. On the other hand, if we run SGD or Adam until convergence, then the computational overhead of EM becomes very large. In practice, we found that, typically, one EM iteration per training epoch provides good computational efﬁciency without compromising accuracy. However, this seems to vary among different datasets, thus making it hard to tune in practice. One key fundamental aspect for the development of this EM approach was the probabilistic interpretation of the softmax output layer of deep neural networks for classiﬁcation. Unfortunately, such probabilistic interpretation is typically not available when considering, for example, continuous output variables, thereby making it more difﬁcult to generalize this approach to regression problems. Furthermore, notice that if the target variable is a sequence (or any other structured prediction output), then the marginalization over the latent variables in (1) quickly become intractable, as the number of possible label sequences to sum over grows exponentially with the length of the sequence.

Crowd layer In this section, we propose the crowd layer: a special type of network layer that allows us to train deep neural networks directly from the noisy labels of multiple annotators,

thereby avoiding some of the aforementioned limitations of EM-based approaches for learning from crowds. The intuition is rather simple. The crowd layer takes as input what would normally be the output layer of a deep neural network (e.g. softmax for classiﬁcation, or linear for regression), and learns an annotator-speciﬁc mapping from the output layer to the labels of the different annotators in the crowd that captures the annotator reliabilities and biases. In this way, the former output layer becomes a bottleneck layer that is shared among the different annotators. Figure 1 illustrates this bottleneck structure in the context of a simple convolutional neural network for classiﬁcation problems with 4 classes and R annotators.

Figure 1: Bottleneck structure for a CNN for classiﬁcation with 4 classes and R annotators.

The idea is then that when using the labels of a given annotator to propagate errors through the whole neural network, the crowd layer adjusts the gradients coming from the labels of that annotator according to his/her reliability by scaling them and adjusting their bias. In doing so, the bottleneck layer of the network now receives adjusted gradients from the different annotators labels, which it aggregates and backpropagates further through the rest of the network. As it turns out, through this crowd layer, the network is able to account for unreliable annotators and even correct systematic biases in their labeling. Moreover, all of that can be done naturally within the backpropagation framework.

Formally, let σ be the output of a deep neural network with an arbitrary structure. Without loss of generality, we shall assume the vector σ to correspond to the output of a softmax layer, such that σc corresponds to the probability of the input instance belonging to class c. The activation of the crowd layer for each annotator r is then deﬁned as ar = fr(σ), where fr is an annotator-speciﬁc function, and the output of the crowd layer simply as the softmax of the activations or c = ear c/ C l=1 ear l .

The question is then how to deﬁne the function mapping fr(σ). In the experiments section, we study different alternatives. For classiﬁcation problems a reasonable assumption is to consider a matrix transformation, such that fr(σ) = Wrσ, where Wr is an annotator-speciﬁc matrix. Given a cost function E(or, yr) between the expected output of the rth annotator and its actual label yr, we can compute the gradients E/ ar at the activation ar for each annotator

and backpropagate them to the bottleneck layer, leading to

The gradient vector at the bottleneck layer then naturally becomes a weighted sum of gradients according to the labels of the different annotators. Moreover, if the annotator is likely to mislabel class c as class l (annotation bias), then the matrix Wr can actually adjust the gradients accordingly. The problem of missing labels from some of the annotators can be easily addressed by setting their gradient contributions to zero. As for estimating the annotator weights {Wr}R r=1, since they parameterize the mapping from the output of the bottleneck layer σ to the annotators labels {or}R r=1, they can be estimated using standard stochastic optimization techniques such as SGD or Adam (Kingma and Ba 2014). Once the network is trained, the crowd layer can be removed, thus exposing the output of bottleneck layer σ, which can readily be used to make predictions for unseen instances. An obvious concern with the approach described above is identiﬁability. Therefore, it is important to not overparametrize fr(σ), since adding parameters beyond necessary can make the output of the bottleneck layer σ lose its interpretability as a shared estimated ground truth. Another important aspect is parameter initialization. In our experiments, we found that the best practice is to initialize the crowd layer with identities, i.e. zeros for additive parameters, ones for scalar parameters, identity matrix for multiplicative matrices, etc. An alternative solution is to use regularization to force the parameters of the crowd layer to be close to identities. However, in some cases this might be an undesirable property. For example, if we consider a very biased annotator, then we do not wish to force the matrix Wr

to be close to the identity matrix. Based on our experiments, the initialization alternative provides the best results. Lastly, it should be noted that, as with EM-based approaches, there is an implicit assumption that random or adversarial annotators do not constitute a vast majority (which generally holds in practice), in which case the crowd layer would not perform better than a random predictor. A particularly important aspect to note, is that the framework described above is quite general. For example, it can be straightforwardly applied to sequence labeling problems without further changes, or be adapted to regression problems by considering univariate scalar and bias parameters per annotator in the crowd layer.

Experiments The proposed crowd layer (CL) was implemented as a new type of layer in Keras (Chollet 2015), so that using it in practice requires only a single line of code. Source code, datasets and demos for all experiments are provided at: http://www.fprodrigues.com/.

Image classiﬁcation We begin by evaluating the proposed crowd layer in a more controlled setting, by using simulated annotators with different levels of expertise on a large image classiﬁcation

dataset consisting of 25000 images of dogs and cats from (Kaggle 2013), where the goal is to distinguish between the two species. Let the dog and cat classes be represented by 1 and 0, respectively. Since this is a binary classiﬁcation task, we can easily simulate annotators with different levels of expertise by assigning them individual sensitivities αr and speciﬁcities βr, and sampling their answers from a Bernoulli distribution with parameter αr if the true label is 1, and from a Bernoulli distribution with parameter βr otherwise. Using this procedure, we simulated a challenging scenario with 5 annotators with the following values of αr = [0.6, 0.9, 0.5, 0.9, 0.9] and βr = [0.7, 0.8, 0.5, 0.2, 0.9]. For this particular problem we used a fairly standard CNN architecture with 4 convolutional layers with 3x3 patches, 2x2 max pooling and Re LU activations. The output of the convolutional layers is then fed to a fully-connected (FC) layer with 128 Re LU units and ﬁnally goes to an output layer with a softmax activation. We use batch normalization (Ioffe and Szegedy 2015) and apply 50% dropout between the FC and output layers. The proposed approach further adds a crowd layer on top of the softmax output layer during training. The base architecture was selected from a set of possible conﬁgurations using the true labels by optimizing the accuracy on a validation set (consisting of 20% of the train set) through random search. It is important to note that it is supposed to be representative of a set of typical approaches for image classiﬁcation rather than being the single best possible architecture in the literature for this particular dataset. Furthermore, our main interest in this paper is the contribution of the crowd layer to the training of the neural network. The proposed CNN with a crowd layer (referred to as DL-CL ) is compared with: the multi-annotator approach from (Rodrigues et al. 2017) based on supervised latent Dirichlet allocation - MA-s LDA ; a CNN trained on the result of (hard) majority voting - DL-MV ; a CNN trained on the output of the label aggregation approach proposed by Dawid and Skene (1979) - DL-DS ; a CNN using the EM approach described earlier - DL-EM ; a CNN using the Doctor Net approach from (Guan et al. 2017) - DLDN , which consists on training a CNN to predict the labels of the multiple annotators and then combining their predictions using majority voting; and, lastly, a CNN using the Weighted Doctor Net approach from (Guan et al. 2017) - DL-WDN , which is the best performing variant according to the original paper. This approach is similar to DLDN but additionally learns how to weight the predictions of the different annotators. Kindly see (Guan et al. 2017) for further details. As a reference point, we also compare with a CNN trained on true labels - DL-TRUE . We consider 3 variants of the proposed crowd layer (CL) with different annotator-speciﬁc functions fr with increasing number of parameters: a vector of per-class weights wr, such that fr(σ) = wr σ (referred to as VW ); a vector of per-class biases br, such that fr(σ) = σ + br ( VB ); and a version with a matrix of weights Wr, such that fr(σ) = Wrσ ( MW ). In our experiments, we found that for approaches with more parameters than MW, such as fr(σ) = Wrσ+br, identiﬁability issues start to occur.

Table 1: Accuracy results for classiﬁcation datasets: Dogs vs. Cats and Label Me.

Method Dogs vs. Cats Label Me (MTurk)

MA-s LDAc (Rodrigues et al. 2017) - 78.120 ( 0.397) DL-MV 71.377 ( 1.123) 76.744 ( 1.208) DL-DS (Dawid and Skene 1979) 76.750 ( 1.282) 80.792 ( 1.066) DL-EM (Albarqouni et al. 2016) 80.184 ( 1.454) 82.677 ( 0.981) DL-DN (Guan et al. 2017) 79.005 ( 1.347) 81.888 ( 1.114) DL-WDN (Guan et al. 2017) 76.822 ( 2.838) 82.410 ( 0.783)

DL-CL (VW) 79.534 ( 1.064) 81.051 ( 0.899) DL-CL (VW+B) 79.688 ( 1.406) 81.886 ( 0.893) DL-CL (MW) 80.265 ( 1.230) 83.151 ( 0.877)

DL-TRUE 84.912 ( 1.248) 90.038 ( 0.652)

Table 2: Results for Movie Reviews (MTurk) dataset.

Method MAE RMSE R2

MA-s LDAr (Rodrigues et al. 2017) - - 35.553 ( 1.282) DL-MEAN 1.215 ( 0.048) 1.498 ( 0.050) 31.496 ( 4.690) DL-EM 1.201 ( 0.046) 1.482 ( 0.048) 32.974 ( 4.457) DL-DN (Guan et al. 2017) 1.270 ( 0.021) 1.549 ( 0.022) 26.775 ( 2.102) DL-WDN (Guan et al. 2017) 1.261 ( 0.016) 1.541 ( 0.018) 27.597 ( 1.763)

DL-CL (S) 1.228 ( 0.041) 1.508 ( 0.044) 30.560 ( 4.101) DL-CL (S+B) 1.163 ( 0.031) 1.440 ( 0.033) 37.086 ( 2.407) DL-CL (B) 1.130 ( 0.025) 1.411 ( 0.028) 39.276 ( 2.374)

DL-TRUE 1.050 ( 0.029) 1.330 ( 0.036) 45.983 ( 2.895)

0.0 0.2 0.4 0.6 0.8 1.0 True sensitivity

corrcoef=0.978

(a) Sensitivities

0.0 0.2 0.4 0.6 0.8 1.0 True speciﬁcity

corrcoef=0.992

(b) Speciﬁcities

Figure 2: Comparison between the true sensitivities and speciﬁcities of the annotators and the diagonal elements of their weight matrices Wr for the Dogs vs. Cats dataset.

Of the 25000 images in the Dogs vs Cats dataset, 50% were used for training and the remaining for testing the different approaches. In order to account for the effect of the random initialization that is used for most of the parameters in the network, we performed 30 executions of all approaches and report their average accuracies in Table 1. We can immediately verify that both the EM-based and the crowd layer (CL) approaches signiﬁcantly outperform the majority voting (DL-MV) and Dawid & Skene (DL-DS) baselines, thus demonstrating the gain of learning from the answers of multiple annotators directly rather than relying on aggregation schemes prior to training. As for the DL-

DN and DL-WDN approaches from (Guan et al. 2017), we can observe that, although they also outperform the DL-MV and DL-DS baselines, their accuracy is inferior to that of the proposed DL-CL, which can be explained by the fact that DL-DN and DL-WDN are unable to correct the annotators biases (e.g. confusing class 2 with class 4). Furthermore, it important to recall that due to two-stage procedure of DL-WDN, its computational time can be signiﬁcantly higher than DL-CL. Regarding the different variants of the proposed crowd layer, we can verify that the MW approach is the one that gives the best average accuracy. In order to better understand what the MW approach is doing, we inspected the weight matrices Wr of each annotator r. Figure 2 shows the relationship between the diagonal elements of Wr and the true sensitivities and speciﬁcities of the corresponding annotators, highlighting the strong linear correlation between the two. This evidences that the proposed crowd layer is able to internally represent the reliabilities of the annotators.

Having veriﬁed that the crowd layer was performing well for simulated annotators, we then moved on to evaluating it in real data from Amazon Mechanical Turk (AMT). For this purpose, we used the image classiﬁcation dataset from (Rodrigues et al. 2017) adapted from part of the Label Me data (Russell et al. 2008), whose goal is to classify images according to 8 classes: highway , inside city , tall building , street , forest , coast , mountain or open coun-

0 1 2 3 4 5 6 7

0 1 2 3 4 5 6 7

Confusion matrix

0 1 2 3 4 5 6 7

0 1 2 3 4 5 6 7

Learned weights

(a) annot. 1

0 1 2 3 4 5 6 7

0 1 2 3 4 5 6 7

Confusion matrix

0 1 2 3 4 5 6 7

0 1 2 3 4 5 6 7

Learned weights

(b) annot. 9

0 1 2 3 4 5 6 7

0 1 2 3 4 5 6 7

Confusion matrix

0 1 2 3 4 5 6 7

0 1 2 3 4 5 6 7

Learned weights

(c) annot. 20

0 1 2 3 4 5 6 7

0 1 2 3 4 5 6 7

Confusion matrix

0 1 2 3 4 5 6 7

0 1 2 3 4 5 6 7

Learned weights

(d) annot. 23

0 1 2 3 4 5 6 7

0 1 2 3 4 5 6 7

Confusion matrix

0 1 2 3 4 5 6 7

0 1 2 3 4 5 6 7

Learned weights

(e) annot. 39

0 1 2 3 4 5 6 7

0 1 2 3 4 5 6 7

Confusion matrix

0 1 2 3 4 5 6 7

0 1 2 3 4 5 6 7

Learned weights

(f) annot. 45

Figure 3: Comparison between the learned weight matrices Wr and the corresponding true confusion matrices.

try . It consists of a total of 2688 images, where 1000 of them were used to obtain labels from multiple annotators from Amazon Mechanical Turk. Each image was labeled by an average of 2.547 workers, with a mean accuracy of 69.2%. The remaining 1688 images were using for evaluating the different approaches. Since the training set is rather small, we use the pretrained CNN layers of the VGG-16 deep neural network (Simonyan and Zisserman 2014) and apply only one FC layer (with 128 units and Re LU activations) and one output layer on top, using 50% dropout. The last column of Table 1 shows the obtained results. We can once more verify that DL-EM, DL-WDN and DL-CL approaches outperform the majority voting and Dawid & Skene baselines, and also the probabilistic approaches proposed in (Rodrigues et al. 2017) based on supervised latent Dirichlet allocation (s LDA), being the proposed crowd layer (DL-CL) the approach that again gives the best results. However, unlike for the Dogs vs. Cats dataset, the differences between the different function mappings fr for the crowd layer (CL) become more evident. This can be justiﬁed by the ability of the MW version to be able to model the biases of the annotators. Indeed, if we compare the learned weight matrices Wr with the respective true confusion matrices of the annotators, we can notice how they resemble each other. Figure 3 shows this comparison for 6 annotators, where the color intensity of the cells increases with the relative magnitude of the value, thus demonstrating that the crowd layer is able to learn the labeling patterns of the annotators.

Text regression

As previously mentioned, one of the key advantages of the proposed crowd layer is its straightforward extension to other types of target variables. In this section, we consider a regression problem based on the dataset also introduced in (Rodrigues et al. 2017). This dataset consists of 5006 movie reviews, where the goal is to predict the rating given to the movie (on a scale of 1 to 10) based on the text of the review. Using AMT, the authors collected an average of 4.96 answers from a pool of 137 workers

for 1500 movie reviews. The remaining 3506 reviews were used for testing. Letting the (continuous) output of the bottleneck layer be denoted μ, we considered 3 variants of the proposed crowd layer with different annotator-speciﬁc functions fr: a per-annotator scale parameter sr, such that fr(μ) = srμ (referred to as S ); a per-annotator bias parameter br, such that fr(μ) = μ + br ( B ); and a version with both: fr(μ) = srμ + br ( S+B ). The base neural network architecture used for this problem consists of a 3x3 convolutional layer with 128 features and 5x5 max pooling, a 5x5 convolutional layer with 128 features and 5x5 max pooling, and a FC layer with 32 hidden units. All layers, except for the output one, use Re LU activations.

The proposed DL-CL is compared with: a neural network trained on the mean answer of the annotators (DL-MEAN) and the approach from (Rodrigues et al. 2017) based on supervised LDA. In order to make the baselines even more competitive, we further propose a new variant of the EM algorithm described earlier that follows the same approach as the extension proposed in (Raykar et al. 2010) for regression problems. This approach assumes the following model for the annotators answers given the ground truth: p(yr n|zn) = N(yr n|zn, 1/λr). Although the formulation in (Raykar et al. 2010) relies on the probabilistic interpretation of the linear regression model to develop an EM algorithm for learning, we can nevertheless adapt the resultant EM algorithm by replacing the linear regression model with a deep neural network. The ﬁnal iterative procedure then alternates between computing the adjusted ground truth (E-step) and re-estimating the neural network and the annotators parameters (M-step). Finally, although Guan et al. (2017) do not discuss extensions to regression, we also developed variants of DL-DN and DL-WDN for continuous output variables. For the DL-WDN approach, we considered different weighting functions for combining the answers of the multiple annotators, namely: a single weight per annotator, a single bias, or both. We experimented with the different alternatives and found that using a per-annotator bias for combining the answers of the multiple annotators gives the best results.

Table 2 shows the obtained results for 30 runs of the dif-

Table 3: Results for Co NLL-2003 NER (MTurk) dataset.

Method Precision Recall F1

CRF-MA (Rodrigues, Pereira, and Ribeiro 2013) 0.494 0.856 0.626 DL-MV 0.664 ( 0.017) 0.464 ( 0.021) 0.546 ( 0.014) DL-EM 0.679 ( 0.012) 0.499 ( 0.010) 0.575 ( 0.008) DL-DN (Guan et al. 2017) 0.723 ( 0.009) 0.459 ( 0.014) 0.562 ( 0.012) DL-WDN (Guan et al. 2017) 0.611 ( 0.063) 0.480 ( 0.058) 0.534 ( 0.042)

DL-CL (VW) 0.709 ( 0.013) 0.472 ( 0.020) 0.566 ( 0.016) DL-CL (VW+B) 0.603 ( 0.013) 0.609 ( 0.012) 0.606 ( 0.007) DL-CL (MW) 0.660 ( 0.018) 0.593 ( 0.013) 0.624 ( 0.010)

DL-TRUE 0.711 ( 0.013) 0.740 ( 0.009) 0.725 ( 0.008)

2 1 0 1 2 True bias

Estimated bias parameter

corrcoef=0.873

Figure 4: Relationship between the learned br parameters and the true biases of the annotators.

ferent approaches, where we verify that the proposed crowd layer, particularly the B variant, signiﬁcantly outperforms all the other methods. In order to better understand what the crowd layer in the B variant is doing, we plotted learned br values in comparison with the true biases of the annotators, computed as the average difference between their answers and the ground truth. Figure 4 shows this comparison, in which we can verify that the learned values of br are highly correlated with the true biases of the annotators, thus showing that crowd layer is able to account for annotator bias when learning from the noisy labels of multiple annotators.

Named entity recognition Lastly, we evaluated the proposed crowd layer on a named entity recognition (NER) task. For this purpose, we used the AMT dataset introduced in (Rodrigues, Pereira, and Ribeiro 2013) which is based on the 2003 Co NLL shared NER task (Sang and Meulder 2003), where the goal is to identify the named entities in the sentence and classify them as persons, locations, organizations or miscellaneous. The dataset consists of 5985 labeled sentences using a pool of 47 workers. The remaining 3250 sentences of the original dataset were used for testing. The neural network architecture used for this problem consists of a layer of 300-dimensional word embeddings initialized with the pre-trained weights of Glo Ve (Pennington, Socher, and Manning 2014), followed by a 5x5 convolutional layer with 512 features, whose output is then fed to a GRU cell with a 50-dimensional hidden state. The individual hidden states of the GRU are then

passed to a FC layer with a softmax activation. The crowd layer uses the same annotator function mappings fr used for image classiﬁcation. The proposed crowd layer is compared with same baselines considered for the classiﬁcation problems. As previously explained, the EM approach is hard to generalize to sequence labelling problems due to marginalization over the latent ground truth sequences in Eq. (1). In order to make this marginalization tractable, we assume a fully factorized distribution of the posterior approximation q(zn), such that q(zn) = T t=1 q(znt), where T denotes the length of the sequence.3 Although the focus of this paper is on deep learning approaches, for the sake of completeness, we also compare with the results of the multi-annotator approach from (Rodrigues, Pereira, and Ribeiro 2013) based on conditional random ﬁelds (CRF-MA). Table 3 shows the obtained average results, which clearly demonstrate that the proposed approach signiﬁcantly outperforms all the other methods, and provides similar results to those of CRF-MA, while reducing the training time by at least one order of magnitude when compared to the latter (minutes instead of several hours on a Core i7 with 32GB of RAM and a NVIDIA GTX 1070).

This paper proposed the crowd layer - a novel neural network layer that enables us to train deep neural networks end-to-end, directly from the labels of multiple annotators and crowds, using backpropagation. Despite its simplicity, the crowd layer is able to capture the reliabilities and biases of the different annotators and adjust the error gradients that are backpropagated during training accordingly. As our empirical evaluation shows, the proposed approach outperforms other approaches that rely on the aggregation of the annotators answers prior to training, as well as other methods from the state-of-the-art which often rely on more complex, harder to setup and more computationally demanding EM-based approaches. Furthermore, unlike the latter, the crowd layer is trivial to generalize beyond classiﬁcation problems, which we empirically demonstrate using real

3Please note that, while this makes EM tractable, the computational complexity of the E-step is now increased to O(NTR).

data from Amazon Mechanical Turk for text regression and named entity recognition tasks.

Acknowledgments The research leading to these results has received funding from the People Programme (Marie Curie Actions) of the European Union s Seventh Framework Programme (FP7/2007-2013) under REA grant agreement no. 609405 (COFUNDPostdoc DTU), and from the European Union?s Horizon 2020 research and innovation programme under the Marie Sklodowska-Curie Individual Fellowship H2020-MSCA-IF-2016, ID number 745673.

Albarqouni, S.; Baur, C.; Achilles, F.; Belagiannis, V.; Demirci, S.; and Navab, N. 2016. Aggnet: deep learning from crowds for mitosis detection in breast cancer histology images. IEEE transactions on medical imaging 35(5):1313 1321. Buhrmester, M.; Kwang, T.; and Gosling, S. D. 2011. Amazon s mechanical turk a new source of inexpensive, yet highquality, data? Perspectives on psychological science 6(1):3 5. Chollet, F. 2015. Keras. https://github.com/fchollet/keras. Dawid, A. P., and Skene, A. M. 1979. Maximum likelihood estimation of observer error-rates using the EM algorithm. J. of the Royal Statistical Society. Series C 28(1):20 28. Donmez, P., and Carbonell, J. 2008. Proactive learning: cost-sensitive active learning with multiple imperfect oracles. In Proc. of the 17th ACM Conf. on Information and Knowledge Management, 619 628. Greenspan, H.; van Ginneken, B.; and Summers, R. M. 2016. Guest editorial deep learning in medical imaging: Overview and future promise of an exciting new technique. IEEE Transactions on Medical Imaging 35(5):1153 1159. Guan, M. Y.; Gulshan, V.; Dai, A. M.; and Hinton, G. E. 2017. Who said what: Modeling individual labelers improves classiﬁcation. ar Xiv preprint ar Xiv:1703.08774. Ioffe, S., and Szegedy, C. 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift. ar Xiv preprint ar Xiv:1502.03167. Ipeirotis, P. G.; Provost, F.; and Wang, J. 2010. Quality management on amazon mechanical turk. In Proceedings of the ACM SIGKDD workshop on human computation, 64 67. ACM. Kaggle. 2013. Dogs vs. cats competition. https://www. kaggle.com/c/dogs-vs-cats. Kingma, D., and Ba, J. 2014. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980. Le Cun, Y.; Bengio, Y.; and Hinton, G. 2015. Deep learning. Nature 521(7553):436 444. Parent, G., and Eskenazi, M. 2011. Speaking to the crowd: Looking at past achievements in using crowdsourcing for speech and predicting future challenges. In INTERSPEECH, 3037 3040. Citeseer.

Pennington, J.; Socher, R.; and Manning, C. D. 2014. Glove: Global vectors for word representation. In EMNLP, volume 14, 1532 1543. Raykar, V.; Yu, S.; Zhao, L.; Valadez, G.; Florin, C.; Bogoni, L.; and Moy, L. 2010. Learning from Crowds. J. Mach. Learn. Res 1297 1322. Rodrigues, F.; Lourenc o, M.; Ribeiro, B.; and Pereira, F. 2017. Learning supervised topic models for classiﬁcation and regression from crowds. IEEE Transactions on Pattern Analysis and Machine Intelligence. Rodrigues, F.; Pereira, F.; and Ribeiro, B. 2013. Sequence labeling with multiple annotators. Machine Learning 1 17. Rodrigues, F.; Pereira, F.; and Ribeiro, B. 2014. Gaussian process classiﬁcation and active learning with multiple annotators. In Proc. of the 31st Int. Conf. on Machine Learning, 433 441. Russell, B.; Torralba, A.; Murphy, K.; and Freeman, W. 2008. Labelme: a database and web-based tool for image annotation. Int. J. of Computer Vision 77(1-3):157 173. Sang, E., and Meulder, F. D. 2003. Introduction to the conll-2003 shared task: language-independent named entity recognition. In Proceedings of the 7th Conference on Natural Language Learning at HLT-NAACL, volume 4, 142 147. Schmidhuber, J. 2015. Deep learning in neural networks: An overview. Neural networks 61:85 117. Sheng, V.; Provost, F.; and Ipeirotis, P. 2008. Get another label? Improving data quality and data mining using multiple, noisy labelers. In Proc. of the Int. Conf. on Knowledge Discovery and Data Mining, 614 622. Simonyan, K., and Zisserman, A. 2014. Very deep convolutional networks for large-scale image recognition. ar Xiv preprint ar Xiv:1409.1556. Smyth, P.; Fayyad, U.; Burl, M.; Perona, P.; and Baldi, P. 1995. Inferring ground truth from subjective labelling of venus images. In Advances in Neural Information Processing Systems, 1085 1092. Snow, R.; O Connor, B.; Jurafsky, D.; and Ng, A. 2008. Cheap and fast - but is it good?: Evaluating non-expert annotations for natural language tasks. In Proc. of the Conf. on Empirical Methods in Natural Language Processing, 254 263. Su, H.; Deng, J.; and Fei-Fei, L. 2012. Crowdsourcing annotations for visual object detection. In Workshops at the Twenty-Sixth AAAI Conference on Artiﬁcial Intelligence, volume 1. Welinder, P.; Branson, S.; Perona, P.; and Belongie, S. 2010. The multidimensional wisdom of crowds. In Advances in neural information processing systems, 2424 2432. Whitehill, J.; Wu, T.-f.; Bergsma, J.; Movellan, J. R.; and Ruvolo, P. L. 2009. Whose vote should count more: Optimal integration of labels from labelers of unknown expertise. In Advances in neural information processing systems, 2035 2043. Yan, Y.; Rosales, R.; Fung, G.; Subramanian, R.; and Dy, J. 2014. Learning from multiple annotators with varying expertise. Mach. Learn. 95(3):291 327.