. In the rest of this work, a simple heuristic has been devised to perform this task. More advanced methods for this could be developed in future work.
Task splitting The page is split into different fragments, one for each task (see Section 3.3), if needed.
Shingle/tokenization The html of each fragment is tokenised with classical 3-gram techniques and then shingled (Damashek, 1995), where the words are either html tags or the raw text content. For simplicity, in the rest we chose to use the second approach.
Simhashing For each task (fragment) a simhash (Sadowski & Levin, 2007) is generated.
Using simhashes is the ideal solution for our problem because: (i) it allows a secure export of the task fingerprints, without the risk of leaking the task online (as explained in Section 3.5), (ii) it is fast and scalable, (iii) it allows the estimation of similarity (by simply comparing the Manhattan distance of the simhashes) even for near-miss cases, enabling the system to recognise gold questions that can differ by a small part, e. g. captchas, arithmetic questions, or programmatic gold (Oleson et al., 2011).
3.2 Server Workflow - Clustering
The server described in the previous section keeps a repository of triples (Job id; simhash; multiplicity), where multiplicity is the number of times a simhash appears in the collected data. The Manhattan distance matrix between the bit representation of the simhashes can be used to generate a clustering. Detecting similarity rather than exact matches is important because of the potential presence of noise in the task html collected from workers2 and because gold questions whose hashes differ for only few bits (like captchas and arithmetic questions) belong to the same cluster in our framework even if they are not exact matches. After this process, each cluster will represent a specific question, and have a multiplicity
2. i. e., differing fragments due to dynamic rendering with CDN/location dependent sources: in our preliminary analysis the same task served to two different workers was never an exact match.
Checco, Bates & Demartini
that is equal to the number of times that question has been posed to the participating workers. We will initially assume that all workers are colluding. In Section 5.3, we discuss the effect of the number of colluding workers on the attack performance. While the architecture is agnostic to the clustering method chosen, we believe the most appropriate one for this context is agglomerative clustering: it works well on non-euclidean distances like the one induced by simhashes, and needs only a distance threshold parameter. Moreover, such a methods can include connectivity constraints induced by the case of multiple tasks per page, as shown in Section 3.3.
3.2.1 Gold Questions Inference
If Assumptions 1 and 2 are satisfied, we expect that the multiplicities of clusters will have a bimodal distribution, where clusters corresponding to gold questions will have a higher mean multiplicity, as shown for a real case in Section 5, Figure 6. A Gaussian mixture model with two modes will be able to obtain a classification of the current state of the repository of simhashes, together with an estimate of the confidence of the classification accuracy. Such a model is appropriate because (i) the choice of using two components is obtained directly from Assumption 1, (ii) it has a minimal amount of hyperparameters (that guarantees a low transient phase), and (iii) it can capture the potentially unequal variability of the two classes. The plugin has two states:
Idle state: If the model goodness of fit is low, the plugin shows that there is not enough information: any question could be gold.
Active state: When the goodness of fit is high, the plugin signals which questions in the page are likely to be gold, together with a probability score for each of them.
When only few samples are provided, overfitting can be a problem: since we have only one dimension d (frequency) for each cluster of simhashes, and two modes p, we have six degrees of freedom to estimate (p d2/2 + 3d/2 + 1 ), and thus we have to keep the plugin idle for a number of samples of about five times that (Steyerberg, Harrell, & Frank, 2003), i. e. 30 samples. After that, the Bayesian Information Criterion (bic) of the model compared with the corresponding one-component model will be used to establish the state of the plugin (Fraley & Raftery, 1998). The performance of the Gaussian mixture model (and thus the plugin state) depends on the difference between the means of the two distributions (and in some cases the two means can be very close, as shown in Section 5, Figure 6). However, when the plugin is active it means that a two-components Gaussian mixture model explains the data better than a fit obtained by a unique population, and for each question the worker is able to visualise the posterior probability of it being a gold question. Even after the plugin is active, when a new gold question is shown to the workers for the first few times, the plugin will not signal it as gold, but the workers will still know that the plugin is indeed active and thus gold questions are present in the job: the best behaviour for a worker to maximise quality is to work as usual and use the plugin as confirmation of the presence of gold questions. If the job presents regular patterns, like one gold question per page, then the worker can decide to employ a more aggressive behaviour, by answering carefully only the questions that are signalled as gold.
An Attack Scheme on Gold Questions in Crowdsourcing
The plugin does not need to know the proportion of gold questions used, nor the proportion of colluding workers.
3.2.2 False Positives and Sensitivity
It is worth noticing that there is a risk that non-gold questions are simhash similar, with the consequence of having some of them ending up in the same cluster and thus causing some false positives. However, this event can be detected and corrected when a page contains multiple tasks, as shown in Section 3.3. Due to the nature of the application, high recall is more important than high precision in gold question detection: from a worker perspective, false positives will lead to additional work, but missing a gold question can potentially disrupt a worker quality score (e. g. approval rate) in the crowdsourcing platform. The server will return a probability of being a gold question for each simhash submitted by a worker. The user is then able, via the browser plugin, to select the desired confidence threshold for the tasks being signalled, thus setting their own precision/recall trade-off.
3.3 Multiple Tasks per Page
The operation of task splitting explained in Section 3.1 is not straightforward. It can be achieved in at least two ways:
1. Platform-based heuristic (e. g. Figure Eight uses a specific html class element to identify tasks in a page).
2. Heuristic based on document size: this affects the balance between precision and recall.
In our experiments, which make use of Figure Eight (formerly known as Crowdflower) datasets, we use the former approach. If this is not possible, a more conservative solution (with more fragments) can be used to maximise recall: for example, a very conservative solution could be to split the page at the
tag level (however, we note that the vast majority of crowdsourcing platforms either uses one task per page, or allows a trivial identification of the splits). If multiple tasks appear in the same page, the server will be able to use this information to perform a hash parameter estimation: the minimum distance between the simhashes belonging to the same page can be used to tune the clustering method and avoid that two different questions end on the same cluster. Moreover, if the clustering algorithm is able to use connectivity constraints, they can be enforced for all simhashes known to be in the same page. It is possible to obtain the same information, even when only one task per page is used, by moving the parameter estimation on the client side.
3.4 Peer-to-peer Implementation
The use of an external server to centrally collect data and perform the similarity computations can be avoided, if required. All operations of clustering and inference are lightweight and could potentially be run locally by the worker in the browser: what is needed is at
Checco, Bates & Demartini
least an append-only distributed peer-to-peer database system, e. g. Orbit DB3, to collectively store and retrieve the triples (Job id; simhash; multiplicity) and locally compute the probability of a task being a gold question. This approach would significantly increase the attack robustness, as each worker colluding would be a decentralised relay, and would make a countermeasure based on domain banning significantly harder, because no central server would be used.
3.5 Plausible Deniability and Data Security
Crowdworkers are sending only a simhash of the page with identification information removed on the client side. Moreover, they are not sending any information about the actual judgements being performed. This is an important aspect that allows to minimise the risk of worker identification through de-anonymisation (Aggarwal, 2005), and thus obtaining plausible deniability for the workers. Moreover, potentially sensitive data in the job cannot be reconstructed, even when the third party server is completely compromised, reducing the legal liabilities of the parties involved.
4. Attack Model - Performance, Cold Start
To understand whether our framework is applicable in a real crowdsourcing platform, we devise a model that allows us to compute the probability of recognising a gold question. Such a model assumes that the parameters of the system are known and that the report of a specific question will have exactly the same simhash from all workers (so, in this theoretical model, we do not consider the cases of noisy html or gold questions generated programmatically). In order to obtain a closed form solution of the average probability of recognition, we consider a gold question recognised by the system only when it has been already reported a number of times larger than the (known) multiplicity of the non-gold questions. Clearly, this under-estimates the probability of recognition because as we will show in the next section, a statistical analysis between the multiplicities can be enough to recognise a gold question. Moreover, such a simple model does not allow the estimation of the false positive rate (as will be studied in Section 5 over real data) but it still gives us a quick and clean way to explore the effect of the different parameters on the gold recognition probability. Regarding the parameters of the system, we consider a realistic scenario: a job of 2000 tasks with an additional 5 % (100 tasks) of gold questions. We consider the default automatic behaviour of Figure Eight: 10 gold questions are used at the beginning to train and test the ability of the worker (i. e. a quiz page). After that, pages of 10 tasks are shown to the worker, of which 9 are requested tasks and one is a gold question. To be considered trusted, workers are required, by default, to judge a minimum of four gold questions and to reach an accuracy threshold of 70 %. A similar setting can be implemented on Amazon Mechanical Turk by creating qualification tests and by manually distributing gold questions in subsequent tasks. Moreover, we consider the default Figure Eight aggregation setting: each requested (non-gold) task will be shown to 3 distinct workers. A gold question will not be shown
3. https://github.com/orbitdb/orbit-db.
An Attack Scheme on Gold Questions in Crowdsourcing
0 10 20 30 40 50 60 70 Number of workers
Probability of recognition
Gold ratio: 5%
Pages per worker
Figure 3: Probability of recognising a gold question varying the number of workers that have already used the system and parametrised by the number of 10-task pages.
twice to the same worker: thus, for our setting, a maximum of 9 pages can be shown to each worker, with a total (including the initial quiz page) of 19 unique gold questions per worker. Clearly, different workers may be required to evaluate the same gold questions. In Figure 3, we estimate the probability that a gold question displayed to a worker is correctly detected by our system, after a certain number of workers had already used the system for a specific job. This number depends on how many tasks each worker will complete on average. Even in the most conservative case (each worker completing only 10 tasks) and assuming a uniform distribution of gold questions, we can observe that after 50 workers entered the job, the probability of correctly detecting a gold question is above 99.9 %. In Figure 4, we can see a drastic fall in the probability of recognition when the number of gold question increases, especially when a small number of more prolific workers are contributing: if the average worker is completing 30 questions and more than 16 % of gold questions are available, the probability of recognition is below 20 % even when 25 % of the total work has been already completed. It is worth noticing that in this simplified model we assumed that all workers judged the same number of pages. In reality, the typical crowd engagement has a power-law distribution: we refer to Section 5 for a more realistic analysis. These results are promising but are based on the simplifying assumptions of the model. For this reason, in the next section we present the result of implementing and testing our system over a real-world crowdsourcing case study.
5. Experimental Analysis of Plugin Effectiveness
In this section, we evaluate the effectiveness of the plugin on real data. Since the effectiveness of the plugin depends both on the power of the inference technique and on the way the workers interact with the signalling system of the plugin (bias, trust, etc.), we focus first on the former, keeping the behaviour of the workers completely controlled (by means of
Checco, Bates & Demartini
0.02 0.04 0.06 0.08 0.10 0.12 0.14 0.16 Gold Ratio
Probability of recognition
25% of total work completed
Pages per worker
Figure 4: Probability of recognising a gold question after 25 % of the total work, varying the gold questions ratio and parameterised by the number of pages that each worker evaluates.
simulation). Conversely, we refer to Section 6 for an analysis on the workers interaction with the plugin.
5.1 Plugin Implementation
To perform the rest of the experiment, we did not implement the entire architecture described above. Instead, we simulated a server fast enough to be able to run a clustering every time a new report is provided. On the plugin side, we used a simple heuristic for multi-page splitting, and did not use peer-to-peer functionalities nor (apart from Section 6) reporting/tuning of confidence values. The core functionalities of the plugin to replicate the following experiments are available at https://github.com/Alessandro Checco/ all-that-glitters-is-gold.
5.2 Experimental Setting
To evaluate the effectiveness of the proposed attack scheme, we simulate the attack over two real crowdsourcing experiments. We use the csta datasets and task logs described in (Benoit, Conway, Lauderdale, Laver, & Mikhaylov, 2016)4, consisting of crowdsourced annotations of political data. An example of the task design is shown in Figure 5. We will start with the first dataset, consisting of 29,594 judgements from 336 workers and containing the platform logs for the submitted judgements, including timestamps of each question and whether a gold question has been missed. In the first csta dataset, out of 2700 unique questions, 12.4 % of them are gold questions. We selected this dataset because of the unusually abundant number of gold questions, and because each non-gold question had been answered by 10 workers with an average of 8.7
4. We used the jobs in the repository with id f269506 and f354285, available from https://github.com/ kbenoit/CSTA-APSR.
An Attack Scheme on Gold Questions in Crowdsourcing
Figure 5: Design of the csta task. Workers have to select the most appropriate policy area related to a specific sentence presented to them in its context and the political scale (left/right) where it positions itself.
pages per worker, making this example a particularly difficult case for this kind of attack, as predicted by our model and shown in Figure 4. Such an abundance of judgements and gold questions also allows us to simply sub-sample these two parameters keeping the rest of the log unchanged so that we can simulate what would have happened if fewer gold questions or judgements per question were to be used. Moreover, as shown in Figure 5, the gold questions used in this experiment are indistinguishable from the non-gold questions in terms of design and structure: being able to distinguish them is thus only possible through the statistical analysis of their multiplicity. At the end of this section we will show the result for the second dataset, consisting of 76,183 judgements and with a less challenging, more realistic distribution of gold questions. In Figure 6, the distribution (in logarithmic scale) of the multiplicity for gold and non-gold questions is shown: we can see that there is a clear indication of a bimodal distribution for the count of gold and non-gold questions, even if the two distributions overlap, making this dataset a good candidate to understand the limits of our framework. However, the main unknown of this approach is the behaviour of the system in the transient phase of the batch, that is, when many gold questions still have a multiplicity similar to that of non-gold questions (because of the high statistical variability when only a few workers have started the job and shared their tasks). This transient phase is when false positives are more likely to happen. Ideally, the plugin should
Checco, Bates & Demartini
5 10 15 20 25 30 Number of times the unit has been shown
Gold question Non-gold question
Figure 6: Distribution (log scale) of the multiplicity for gold and non-gold questions in the csta task when using 12.4 % of gold questions. Assumptions 1 and 2 are satisfied.
be in the idle state during this transient phase. Regarding the clustering phase to identify simhashes belonging to the same question, this job did not present significant difficulties, because all simhashes belonging to the same questions had a Manhattan distance of less than 2 bits, even though they were reported by different workers with potentially different dom. In order to avoid disrupting a real crowdsourcing job, we decided to run the plugin on the reconstructed html obtained from the logs, together with the crowdworker original behaviour. The original designer of this job opted for having an initial quiz of 10 questions, 8 of which were gold, and after that to present pages of 10 questions, one of which was a gold question. To study how the different parameters affect the proposed system, we keep the behaviour and time evolution of the worker fixed, but we vary the number of gold questions available via sub-sampling (i. e. allowing us to move from 0 % to 12.4 % of gold questions in the job), assuming that the worker s ability to answer a gold question is only dependent on their internal state, and not on which gold question they are being shown5. We are able to compare the original behaviour of the workers with the behaviour they would have by using the proposed collaborative gold signalling technique. To simplify the analysis, we assume the following behaviour for the workers when the plugin is in active state:
Time spent: The worker will spend time in answering only the questions that are signalled as potentially gold, spending an amount of time per page equal to g T, where g is the number of signalled gold questions in that page, and T is the average time they spent per unit on that page from the log.
5. The error made using this assumption should be mitigated by the fact that all gold questions are sampled uniformly at random.
An Attack Scheme on Gold Questions in Crowdsourcing
10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Total work progress
Average accuracy
Gold ratio: 4.4%, 10 judgements per non-gold
simhash original
10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Total work progress
Average accuracy
Gold ratio: 4.4%, 4 judgements per non-gold
simhash original
10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Total work progress
Average accuracy
Gold ratio: 12.44%, 10 judgements per non-gold
simhash original
10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Total work progress
Average accuracy
Gold ratio: 12.44%, 4 judgements per non-gold
simhash original
Figure 7: Average worker accuracy for the original logs and for the proposed method. On the left, each non-gold question has 10 judgements, on the right 4. On the top row we used a number of gold questions equal to 4.4 %, on the bottom row 12.4 %.
Performance: The worker will answer randomly to any gold question that the plugin failed to signal (false negative). On the other hand, the workers will answer any correctly signalled question normally (i. e. in the same way they did in the logs, still potentially answering them wrong even when correctly signalled). This is a worst-case analysis: in practice we can expect workers to answer more carefully to signalled questions.
Confidence: The worker will consider as gold all questions with signalled probability of being gold of at least 50 %.
This simulation setup guarantees that the individual accuracy of each worker is preserved: indeed whenever the plugin is idle, we will use the original performance reconstructed from the logs. We can notice that the likelihood threshold to receive the signals could be modified to reduce the false negative rate, at the expense of more false positives: thus, there is a trade-offbetween the time saved by the worker and the performance loss. We did not perform such optimisation and leave this aspect for a future study. However, it is worth noticing that the worker is able to set up their level of risk locally in the plugin, because the Gaussian mixture model is able to provide a probability of classification of each question provided, that can be then filtered on the client side.
Checco, Bates & Demartini
5.3 Experimental Results
We show the results of the experimental analysis by measuring the worker accuracy and the time spent per page for the proposed method, and we compare such measures with the values from the original platform logs. We modified the logs via sub-sampling, varying the number of gold questions available and the number of judgements performed for each non-gold question. For the rest of the section, we choose a realistic value of 4.4 % of gold questions and compare it with the higher value of 12.4 %, more challenging to achieve in practice due to the cost of generating gold questions; in like vein, we will show the case of 4 and 10 judgements collected for each non-gold question respectively. It is important to note that even the cases with low gold ratio and number of judgements may be higher than the usual parameters used in typical crowdsourcing experiments.
10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Total work progress
Average time per page [s]
Gold ratio: 4.4%, 10 judgements per non-gold
simhash original
10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Total work progress
Average time per page [s]
Gold ratio: 4.4%, 4 judgements per non-gold
simhash original
10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Total work progress
Average time per page [s]
Gold ratio: 12.44%, 10 judgements per non-gold
simhash original
10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Total work progress
Average time per page [s]
Gold ratio: 12.44%, 4 judgements per non-gold
simhash original
Figure 8: Average time spent per page for original and proposed method on first dataset. On the left each non-gold question has 10 judgements, on the right 4. On the top row we used a number of gold questions equal to 4.4 %, on the bottom row 12.4 %.
In Figure 7, the average worker accuracy is shown. When the gold ratio is high, more time is required for the inferential system to gain precision. However, it is interesting to note that in all cases the accuracy of the workers has stayed above the threshold of 70 %, that was the value under which a worker would have been rejected.
Regarding the number of judgements per non-gold question, we do not observe a notable trend on the accuracy of the workers. On the other hand, we can see in Figure 8 that the time a worker will save by using the proposed system is considerably higher when fewer judgements per non-gold question are used, especially in the transient phase: this is because
An Attack Scheme on Gold Questions in Crowdsourcing
the number of false positives is higher when the two distributions of multiplicities (of gold and non-gold questions as depicted in Figure 6) have a closer average value. More importantly, we can see that if either the gold ratio or the number of judgements per page are low, then the inferential system will allow the workers to complete the tasks in one fifth of the original time (after the transient phase): this means that on average, the worker will only need to answer to 2 questions per page, ignoring 8 questions per page, without a significant loss in accuracy. This result shows that the proposed system can completely disrupt the gold set paradigm for quality control.
10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Total work progress
Average accuracy
Gold ratio: 4.29%, 5 judgements per non-gold
simhash original
10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Total work progress
Average time per page [s]
Gold ratio: 4.29%, 5 judgements per non-gold
simhash original
Figure 9: Average worker accuracy (a) and average time spent per page (b) for the original logs and for the proposed method on the second dataset.
Finally, we repeated the whole experiment on the biggest dataset from the csta repository, with id f354285. The dataset consists of 76,183 judgements from 230 workers. Of the 13,371 unique questions, 4.3 % of them are gold questions, and each question has been judged five times. We did not perform sub-sampling. Even in this case, the accuracy of the workers never dropped below the threshold of 70 %. After the transient phase, we observed a reduction in accuracy of 4 % (as shown in Figure 9a) and a time saved of about 50 % (as shown in Figure 9b): on average the workers had to answer only half of the questions in the page to maintain an accuracy that allowed them to complete the entire job.
5.3.1 Number of Colluding Workers
In all experiments in this section, we considered the scenario where all workers are colluding. While an extensive study of the relaxation of this assumption is a major undertaking and left for future work, we can discuss its implications in the simple case in which all workers have equal retention and all enter the job at random times. If not all workers are colluding, some gold questions will not be reported. However, by Assumption 2, this is equivalent to having those gold questions still not reported because of entering early in a batch. The result is that M colluding workers (over a total of N workers) can at best expect to reach, towards the end of the batch, an accuracy equivalent to the one achievable when all workers are colluding and only M
N of the batch has been completed. For example, if the number of colluding workers is only 10 % of the whole pool of workers, from Figure 7 it is possible to
Checco, Bates & Demartini
see that the average accuracy achievable would be only the one of the first data point on the x-axis. Similarly, from Figure 8 it is clear that in this case the time saved by the colluding workers would be negligible. In other words, the number of colluding workers affects in a linear way the speed and extent of the transient phase of the attack. However, the presence of non-colluding workers is not affecting the inference mechanism in any other way.
6. Experimental Analysis of Workers Behaviour
The analysis in Section 5 was based on some strong assumptions about the behaviour of the workers: a simulated behaviour where worker acted to minimise the time spent on the platform by using the plugin. However, the following research questions remain:
1. How would the workers use the plugin? Would they use it to save time (by ignoring non-signalled questions), to improve quality (by spending more time on the signalled questions), or a combination of both?
2. What is the effect of the plugin s detection inaccuracy on the worker behaviour? Would a loss in trust change the way workers use it?
3. Would workers actually use the plugin?
In this section, we focus on answering the first two questions, by conducting an experiment in which workers interact freely with the plugin while the detection effectiveness of the plugin is artificially controlled. We leave the third question for future work.
6.1 Experiment Design
Workers were required to perform a classification task, where a commercial product (i. e. shoe or garment) and its corresponding customer review was shown. The crowdworker had to decide, for each question, whether the review contained a reference to a size issue, a fit issue, or none of the above. Before the task started, instructions were provided, with a rather convoluted definition of the classification classes and an explanation that a plugin able to signal gold question would have been used in the task. To minimise the effect of the bias caused by the fact that the workers were aware that this was an experiment (the plugin after all was provided by the designer of the task), we used the crowdsourcing platform s native quality control system to allow the worker to verify that indeed the usual gold set quality control system was in place. Each page was composed by five questions, one of which is a gold question. When a page is submitted, if a gold question is missed the worker will receive a feedback, and the current accuracy level of the worker will be updated. We selected this classification task because it has been annotated in its entirety by domain experts and, from our own experience, it is a rather difficult task. From a pilot run of this experiment we measured the workers accuracy for each question. We then selected as gold questions a relatively difficult set: in the pilot run, the gold questions had an average accuracy of 80.4 % and a median accuracy of 89 %, while the non-gold question had an average accuracy of 89.7 % and a median of 100 %. The rationale for this choice is that when using the plugin, the only way for the workers to verify the effectiveness of the plugin is to miss a question, because the quality control system of the
An Attack Scheme on Gold Questions in Crowdsourcing
platform will in that case notify the loss in accuracy and show the correct answer for that question, revealing that the plugin indeed signalled the correct question from the page, as shown in Figure 12. As an example, if the gold questions were extremely easy, the workers would never miss a gold question, and would never be able to decide whether to trust the plugin. This setup is only necessary in this experiment because the participating workers had never used the plugin before and our goal is to assess their behaviour after trust (or lack thereof) is established. The plugin signalling was a simple coloured box (red for high confidence, orange for low confidence), that alerted the worker on the possibility that that question might be a gold question, as shown in Figure 11. We performed three experiments, each with 100 questions and 5 judgements per question, for a total of 1500 tasks involving 244 workers from 46 countries:
Inactive (control): the original task without using the plugin at all. The instruction did not contain information on the plugin either.
Perfect Signalling: all gold questions were signalled with high confidence.
Imperfect Signalling: one third of the gold questions were signalled with high confidence (with a red signal and showing a confidence of 99 %), one third with low confidence (with an orange signal and showing a confidence of 10 %), and one third were missed altogether, with uniform at random selection among these cases6.
Since the goal of this experiment was to understand the workers behaviour while using the plugin, the following measures were taken to select an appropriate population and to prevent memory biases: At the beginning of the task, a quiz of 5 gold questions was shown, to filter out underperforming workers and bots: workers with less than 40 % accuracy were not allowed to continue. Workers were not allowed to participate in more than one experiment, and additional controls were put in place to prevent workers from restarting a task after gaining knowledge on those questions. Figure 10 shows the instructions provided at the beginning of the task. Each worker was exposed to the same set of questions and gold questions (selected via uniform sampling), although not all of the workers answered all of the questions. We measured the time spent on each question and the workers accuracy.
6.2 Worker Behaviour with Perfect Signalling
Here we compare the control experiment (i. e. no gold question signalled to workers) with the case in which every gold question has been signalled by the plugin. From Figure 13 we can see that after a phase in which the workers are evaluating the plugin (first 3 pages, which corresponded with testing the plugin 3 times), then the trend of spending more time on the signalled question becomes apparent, compared to the control group.
6. Since the variability of responses in this case was significantly higher, we performed an additional run of this experiment to increase the number of data points.
Checco, Bates & Demartini
Figure 10: Instructions for the perfect and imperfect signalling experiment.
6.2.1 Intra-page interaction
As shown in Figure 14, there is a significant increase in the time spent on gold questions (t-test, p < 0.05), while there is no significant decrease in the worker accuracy for both gold questions and non-gold questions. In fact, perhaps surprisingly, the accuracy on non-gold questions slightly increased, although the difference was not statistically significant.
While a performance improvement on gold question was expected, the performance of workers on non-gold questions is of particular importance for the requester, and thus requires a more careful analysis. As said before, the behaviour of the first 3 pages could be considered as a learning phase for the worker. Removing that part of the dataset, the performance improvement from the control group on the non-gold questions was even more dramatic (as shown in Figure 15): from 63.64 % to 92.31 %, and the improvement was statistically significant (one-tailed t-test, p < 0.05). For this reason, a longitudinal study on the same workers is recommended for future work.
6.2.2 Overall Time and Accuracy
A one-way ANOVA on the per-page accuracy (for all pages), with the three plugins as factors, showed that there is a statistically significant difference (p < 0.05) on the accuracy: as shown in Figure 16a, the average accuracy using the plugin with perfect signalling increased the accuracy from 79.4 % (control) to 87.3 %, and the median page accuracy from 80 % to
An Attack Scheme on Gold Questions in Crowdsourcing
Figure 11: Example of one of the five questions shown on a page: the plugin signalled this question with low confidence.
100 %. On the other hand, the time spent per page showed no statistically significant difference (Figure 16b) between the two experiments.
6.3 Worker Behaviour with Imperfect Signalling
We now focus on understanding the effect of the plugin effectiveness on worker trust. In this experimental setup, one third of the gold questions were signalled with high confidence (with a red signal and showing a confidence of 99 %), one third with low confidence (with an orange signal and showing a confidence of 10 %, as shown in Figure 11), and one third were missed altogether, with uniform at random selection among these cases.
6.3.1 Intra-page Interaction
In Figure 17, while we observe a reduction in time spent on non-gold questions in pages where gold questions were signalled, we can see that the effect is rather reduced compared to the perfect signalling case: there is no statistically significant difference between the times spent on the two different kinds of signalled questions.
6.3.2 Overall Time and Accuracy
Similarly, the overall per-page accuracy and time spent are statistically indistinguishable from the inactive (control) experiment, as shown in Figure 16.
Checco, Bates & Demartini
Figure 12: Example of the feedback received after failing a gold question in a page of five questions: the worker is now aware of which question was the gold one. The workers can keep track of their per-page accuracy on the top banner.
6.4 Discussion
In the case of perfect signalling, workers using the plugin spent more time on the signalled questions and less time on the non-signalled ones, while they spent the same time overall in the task. As would be expected, the accuracy on gold questions increased. Perhaps surprisingly, the accuracy on the non-gold (and non signalled) questions also increased (especially after the first pages when workers were still testing the plugin effectiveness). Workers increased their attention on signalled question, while retaining a high (if not often better) ability to work on the rest of the task, perhaps because of the decreased stress/cognitive load caused by the otherwise hidden quality control mechanism. Regarding the imperfect signalling experiments, we can observe that the results are statistically indistinguishable from the control group for both time spent and accuracy: the lack of trust in the plugin made the workers behave in the same way as the control group. Our approach is also subject to the issue in which the confidence of the gold detection classifier is inaccurate. In such a case, the classifier could incorrectly miss some gold questions and label them as being non-gold with high confidence (this problem is often called
An Attack Scheme on Gold Questions in Crowdsourcing
1 2 3 4 5 6 7 Page
Gold Question
(a) Inactive
1 2 3 4 5 6 7 Page
Gold Question
(b) Perfect Signalling
Figure 13: Time spent by workers on each question grouped by page on inactive (a), and perfect signalling (b) experiments. More time is spent on signalled gold questions after the first 3 pages.
False True Gold Question
Accuracy: 65.85% Accuracy: 81.90%
(a) Inactive
False True Gold Question
Accuracy: 89.02% Accuracy: 86.85%
(b) Perfect Signalling
Figure 14: Accuracy and time spent by workers on gold questions for inactive (a), and perfect signalling (b) experiments. More time is spent on signalled gold questions but accuracy does not decrease for the other questions.
unknown unknowns). In such cases, workers would be likely to miss the unidentified gold questions and potentially damage their platform reputation and get their job rejected. Such issues usually happen when some classes are under-represented in the training data. Thus, in our setting, this could happen when some gold questions are very different from others, and appear very rarely. In summary, the more data is available to the system through the plugin, the less the issue of unknown unknowns should arise and, in the cases when it arises, it would only affect a small minority of gold questions and workers.
7. Countermeasures
In this section, we describe possible approaches for a requester to mitigate the effects of the proposed attack scheme.
Checco, Bates & Demartini
False True Gold Question
Accuracy: 63.64% Accuracy: 76.14%
(a) Inactive
False True Gold Question
Accuracy: 92.31% Accuracy: 85.58%
(b) Perfect Signalling
Figure 15: Accuracy and time spent by workers on gold questions for inactive (a), and perfect signalling (b) experiments, after the first 3 pages. More time is spent on signalled gold questions and accuracy significantly increases.
perfect imperfect inactive Plugin
perfect imperfect inactive Plugin
Time per page [s]
Figure 16: Accuracy per page (a), and time per page (b) violin plot distribution for the three plugins. The white dot represent the median.
7.1 Gold Set Size
A potential countermeasure that would make the attack more difficult is to increase the gold set size. This will however significantly increase the overall data collection cost. In (Clough, Sanderson, Tang, Gollins, & Warner, 2013), the cost of generating gold questions for the relevance assessment problem was estimated to be more than 4 times the minimum wage (for a senior civil servant). This means that in our example, assuming to pay the crowdworkers minimum wage, in order to move from a gold set size of 4.4 % to 12.4 % (as shown in Figure 8), an additional 54 % of the crowdsourcing cost already undertaken would be required.
An Attack Scheme on Gold Questions in Crowdsourcing
no low high Signal level
Figure 17: Time spent by workers on each question with varying detection confidence levels for imperfect plugin experiment.
7.2 Number of Judgements
An alternative solution is to increase the number of judgements required per non-gold question. However, from our experiments in Section 4 it seems that the effectiveness of this solution is rather limited. Moreover, there is additional crowdsourcing cost that needs to be taken in account for such an approach.
7.3 Worker Retention
As shown in Section 4, having crowdworkers with high retention will significantly reduce the strength of this attack, because of the reduced initial assessment requirement, and because of the fact that each worker will only see different gold questions. In other words, after a fixed number of tasks are completed, the probability of having gold questions with high multiplicity in the inferential system is low if those tasks have been completed by a few prolific workers. In that case, the total number of gold questions shown will be lower (because of fewer initial assessments) and the probability of having repeated gold questions overall will be lower than the corresponding scenario where many workers completed the same number of tasks. This solution is interesting because increasing retention (e. g. through better task design or reward schemes) can also improve the quality of the work thanks to learning effects on long-standing workers (Difallah, Catasta, Demartini, & Cudr e-Mauroux, 2014).
7.4 Non-uniform Selection from the Gold Set
Another countermeasure can be to not satisfy Assumption 2: instead of sampling uniformly at random from the pool of questions that have not yet been shown to the worker, there could be a better approach that takes into account the overall sequence of gold questions shown
Checco, Bates & Demartini
and the possible existence of this attack scheme, potentially mitigating its effectiveness, especially if worker retention is not uniformly distributed. We tested this hypothesis on the first dataset of Section 5, repeating the simulation after relaxing Assumption 2 in the following way: we serve, at each step, the least seen question from the whole pool of gold questions, while avoiding to show the same gold question twice to the same worker7. The results are not encouraging for this countermeasure: the difference in accuracy between this technique and the original technique is less than 2.5 %, while the differences in time spent are of the order of seconds: in both cases the differences are not statistically significant. We believe the reason this countermeasure is not effective is that for relatively small gold set sizes, a uniform serving is statistically indistinguishable from a lexicographical serving. Another approach that should be considered is a countermeasure that exploits potential vulnerabilities in the Gaussian Mixture Model inference method, for example by serving some gold questions a high number of times to make the two modes inference ineffective. We leave for future work an extended analysis of this and other countermeasures.
7.5 Programmatic Gold Questions
Using always different, programmatically generated gold questions, as in (Oleson et al., 2011), that also have sufficiently distant simhashes would be an ideal solution. This could be achieved also by modifying carefully the way the questions are rendered. However, this approach would require a careful design phase again increasing the initial setup cost.
7.6 Inter-worker Agreement
Solutions like the one proposed in (Shah, Balakrishnan, & Wainwright, 2016) would be able to detect the difference in distribution for gold and non-gold questions, potentially identifying workers that answer randomly only on non-gold questions. Attackers could prevent this by colluding on a common rule (e. g. first option) rather than answering uniformly at random, but the study of this different kind of collusion attack is left for future work.
7.7 Time Analysis
Requesters could detect workers that are too fast in answering. This countermeasure could even be refined by analysing the distribution of times for gold and non-gold questions, attempting to identify workers that spend significantly less time on non-gold questions. However, this can increase the number of gold preys: workers that are performing as expected, but who end up failing the quality control because they are faster than the average or that are more careful when they recognise a gold question (Gadiraju et al., 2015). Moreover, when performing the attack, workers could just spend the same time on both types of question by working on multiple tasks in parallel.
7. Assumption 2 required instead to serve the gold questions sampling from the uniform distribution from the pool of gold question not yet seen by the worker.
An Attack Scheme on Gold Questions in Crowdsourcing
7.8 Page Obfuscation
A powerful countermeasure would be to obfuscate completely the html source, by always serving a seemingly identical source: all images served should appear having same path and all pure html text should be substituted by images. However, this would require a major architectural restructuring on the platform side, or a costly effort on the requester side.
8. Societal and Economic Implications
The use of monitoring technologies to control the speed and quality of workers output is not a new phenomenon. However, within the precarious, unstable, temporary working conditions of emergent labour markets typified by crowdwork, such techniques are being used as a means to exert control over and place accountability and responsibility upon individual workers struggling to earn an income in increasingly competitive labour markets (Moore, 2017, p. 14). In a data-rich era, a multitude of possibilities open up for gathering and analysing information about the amount and quality of work completed by employees and contractors. As Moore observes: data is treated as a neutral arbiter and judge, and is being prioritised over qualitative judgements (2017, p. 3). In such conditions, workers are becoming accustomed to constant observation and measurement enabled by the use of a variety of monitoring technologies. Stories abound in the media and academic press about the introduction of new monitoring technologies into workplaces, i. e. (Saner, 2018; Yeginsu, 2018; Moore, 2017). As a mechanism to control for quality of paid work, the gold set quality assurance paradigm can thus be defined as a type of computer-enabled monitoring of workers. It can be argued that such monitoring techniques function as systems of control within the capital-labour relation, and that the gold set quality assurance paradigm promotes similar effects to the classical panopticon effect in the workplace (Vorvoreanu & Botan, 2000; Botan, 1996; D Urso, 2006; Stahl, 2008). That is, workers understanding that they might be being observed at any single moment means they ought to feel compelled to self-govern their behaviour at all times in line with the employer s wishes (Foucault, 1991); the ideal workers come to internalise the imperative to perform . . . becoming observing, entrepreneurial subjects . . . whilst remaining objectified working bodies (Moore, 2017, p. 15). Here, we can turn to the work of philosopher Gilles Deleuze (1992) to theorise the nature of this relation of control. Deleuze explored how we ought to re-imagine the nature of social control in societies whose organisation was becoming more open and dynamic than the institutional settings (e. g. factory, school, prison) that were the focus of Foucault s (1991) work on the panopticon. For Deleuze, workers within capitalism had always been subordinated by machines, however he observed that the shift to the control society was marked, in part, by the introduction of computers as the machines of control (Deleuze, 1992, p. 6). As Haggerty and Ericson (2000) observe, computer-enabled monitoring works to abstract people from their lived realities. Monitoring data are separated from the people that are observed, and become a series of data flows that are re-assembled in different settings to create data doubles . These data doubles are then used to inform decisions without the need for any meaningful human engagement. Within the control society, the data double comes to mediate relations between human actors in our case the relation between requester and worker.
Checco, Bates & Demartini
Monitoring technologies allow for the management of labour at a distance. They seem the obvious choice for quality control within the context of crowdwork given the highly distributed, anonymous, undifferentiated and indistinguishable nature of the workforce. In such conditions, the construction of a trust relationship between crowdworker and requester would be very arduous, and monitoring presents itself as a far more efficient solution. In fact, the current crowdsourcing platform architectures, by design, do not allow for different quality assurance mechanisms. Yet, in the gold-set quality assurance paradigm, each response to a gold question takes on exaggerated significance in the worker s data double . Each time a worker responds to a gold question there is the potential for significant economic consequences for the worker in terms of payments and the possibility of future work. For a crowdworker, ensuring a flawless data double becomes a matter of survival within the highly competitive and individualised crowdwork labour market. As Holland, Cooper, and Hecker (2015), Snyder (2010), and Knox (2010) have demonstrated, electronic surveillance of this kind is correlated with a reduction of both trust in management and the perceived quality of the workplace relationship. It can also negatively impact work effort, attitudes, and communication in the workplace. In effect, such forms of monitoring tend to alienate workers, rather than establishing meaningful increases in worker quality and satisfaction. Workers subject to such labour conditions will respond in various ways. While some will passively accept the conditions of their labour, others will seek out ways of empowering themselves in relation to the monitoring system. The architecture of crowdsourcing systems clearly has an important impact on both labour quality and worker satisfaction. While the inherently dynamic nature of crowdwork platforms makes quality control mechanisms potentially prone to abuse against workers, at the same time it exposes many novel techniques for worker self-organisation and efforts to constitute a more equal power balance between workers and requesters. Reconceptualising the idea of the exploit from hacker culture, Galloway and Thacker (2007) argue that in the networked age those that aim to resist these systems of control must turn their attention to the vulnerabilities embedded within networked infrastructures and leverage these exploits for the purpose of bringing about positive social change. The framework we presented in this paper is a clear example of this kind of exploit, functioning as a form of sousveillance (Mann, Nolan, & Wellman, 2002) or, watching from below by turning the gaze back on the requesters who design the gold questions. The experimental findings indicate that not only is such an attack easy to implement and employ, but also that such an attack would be difficult to counter (as shown in Section 5), and moreover that when workers employ such technologies the overall quality and efficiency of their work increases (Section 6). While the reasons for this observed improvement in worker quality are uncertain and the results need further corroboration, the findings suggest that indicators of enhanced trust and sense of worker empowerment in the worker-requester relation, as well as less worry about gold-question monitoring, may prove more effective at enhancing quality than one-way monitoring based systems. Should the system described in this paper take hold in the crowdworker community, it would negatively impact the effectiveness of the gold question paradigm for quality assurance, forcing a shift towards different quality assurance approaches. The system has the potential to enable a less passive and quiescent labour force (Marx, 2003; Kulynych, 1997;
An Attack Scheme on Gold Questions in Crowdsourcing
Salehi et al., 2015), potentially ameliorating some of the digital power imbalance (Cushing, 2013; Sandford, 2006) between workers and requesters. On the other hand, it also has the potential to weaken the competitive advantage of crowdworkers who have invested time and energy in enhancing their reputations within the current frameworks for quality control. The implications for both workers and requesters therefore remain somewhat uncertain. Reducing rejection risk and building trust is identified as a top priority to improve outcomes for all parties in online labour markets (Mc Innis, Cosley, Nam, & Leshed, 2016). With our efforts, we encourage the crowdsourcing research community to question the efficacy of technologically enabled monitoring systems as the sole means of quality control. Instead, we push towards more socially-orientated frameworks for enhancing labour quality and satisfaction, such as those seen in the work of Turk Opticon (Irani & Silberman, 2013) and Mc Innis et al. (2016) on collective dispute resolution mechanisms, and initiatives that aim for more transparency and data portability across platforms (Sarasua & Thimm, 2014).
9. Conclusions
In this paper, we showed that the popular gold question method for quality assurance in paid crowdsourcing is prone to an attack carried out by a group of colluding crowdworkers that is easy to implement and deploy. We described an inferential system based on a browser plugin and a server, that can exploit the inherent limited size of the gold set to detect which parts of a crowdsourcing job are more likely to be gold questions. We have also showed how the described attack is robust to various forms of randomisation and programmatic generation of gold questions8. Integration with existing plugins like Turk Opticon (Irani & Silberman, 2013) is left for future work where we envision implementing a traffic light alert system for signalling potential gold questions to workers similar to the way Turk Opticon signals requester reputation levels. In our experimental evaluation we have observed the effect of the use of the proposed attack scheme on workers behaviours in terms of time spent and effectiveness in answering questions. From real-world crowdsourcing experiments, we saw that when using the proposed method, workers are required to answer only half (or even a fifth, in some conditions) of the questions presented to them, still maintaining an accuracy level high enough to avoid being excluded from the job. We also observed that workers are indeed spending more time on signalled gold questions but they are not neglecting others. The most positive observation we made is that in the presence of gold question signalling, the overall quality of work increases, possibly because of the reduced cognitive load caused by being invisibly monitored. We also observed that workers behaviour is highly sensitive to the plugin effectiveness: this result could also be caused by the fact that the workers used this plugin for the first time during the experiment, so trust establishment played a major role here. This effect should be studied more in depth in future work. Regarding potential countermeasures, we observed that increasing the gold set size or the number of judgements per question might be useful but infeasible in terms of cost. The countermeasure of serving gold questions avoiding repetitions among the whole pool also proved ineffective. On the
8. The core functionalities of the plugin are available at https://github.com/Alessandro Checco/ all-that-glitters-is-gold.
Checco, Bates & Demartini
other hand, we noticed that increasing worker retention through, for example, better task design might be a win-win solution that could also make crowdworkers more satisfied and perform better. Other countermeasures that could be explored are the use of programmatic gold questions creation and alternative ways of serving gold questions to interfere with the inference mechanism. Finally, we discussed the economic and sociological implications of these kind of attacks where we pointed out the positive repercussions on the future of crowdwork of the creation of stronger and long-term worker-requester relationships where bilateral trust can be established. Regarding future research directions, other than exploring the proposed countermeasures in detail, it would be interesting to refine the attack scheme by using locally optimised likelihood thresholds to balance the time saved by the workers and their loss in accuracy. Moreover, it is necessary to study the robustness of the method with respect to the number of workers colluding and coordinated attack times (Lasecki, Teevan, & Kamar, 2014; Difallah et al., 2012), and the worker behaviour with respect to the local tuning of the plugin and the consequent risk level. More advanced collusion attack schemes may include the sharing of the correct answer for gold questions, that would increase even more the gain in time spent and reduce the risk of being detected.
Acknowledgments
This project is supported by the European Union s Horizon 2020 research and innovation programme under grant agreement No. 732328 and by the Australian Research Council Discovery Project DP190102141.
Aggarwal, C. C. (2005). On k-anonymity and the curse of dimensionality. In Proceedings of the 31st International Conference on Very Large Data Bases, pp. 901 909. VLDB Endowment.
Benoit, K., Conway, D., Lauderdale, B. E., Laver, M., & Mikhaylov, S. (2016). Crowdsourced text analysis: Reproducible and agile production of political data. American Political Science Review, 110(2), 278 295.
Bentivogli, L., Federico, M., Moretti, G., & Paul, M. (2011). Getting expert quality from the crowd for machine translation evaluation. Proceedings of the MT Summmit, 13, 521 528.
Botan, C. (1996). Communication work and electronic surveillance: A model for predicting panoptic effects. Communications Monographs, 63(4), 293 313.
Buchholz, S., & Latorre, J. (2011). Crowdsourcing preference tests, and how to detect cheating. In Twelfth Annual Conference of the International Speech Communication Association.
Checco, A., Bates, J., & Demartini, G. (2018). All that glitters is gold an attack scheme on gold questions in crowdsourcing. In Sixth AAAI Conference on Human Computation and Crowdsourcing.
An Attack Scheme on Gold Questions in Crowdsourcing
Clough, P., Sanderson, M., Tang, J., Gollins, T., & Warner, A. (2013). Examining the limits of crowdsourcing for relevance assessment. IEEE Internet Computing, 17(4), 32 38.
Cushing, E. (2013). Amazon mechanical turk: The digital sweatshop. Utne Reader.
Damashek, M. (1995). Gauging similarity with n-grams: Language-independent categorization of text. Science, 267(5199), 843.
Daniel, F., Kucherbaev, P., Cappiello, C., Benatallah, B., & Allahbakhsh, M. (2018). Quality control in crowdsourcing: A survey of quality attributes, assessment techniques, and assurance actions. ACM Computing Surveys, 51(1), 7:1 7:40.
Deleuze, G. (1992). Postscript on the societies of control. October, 59, 3 7.
Difallah, D. E., Catasta, M., Demartini, G., & Cudr e-Mauroux, P. (2014). Scaling-up the crowd: Micro-task pricing schemes for worker retention and latency improvement. In Second AAAI Conference on Human Computation and Crowdsourcing.
Difallah, D. E., Demartini, G., & Cudr e-Mauroux, P. (2012). Mechanical cheat: Spamming schemes and adversarial techniques on crowdsourcing platforms.. In Crowd Search, pp. 26 30. Citeseer.
Dow, S., Kulkarni, A., Bunge, B., Nguyen, T., Klemmer, S., & Hartmann, B. (2011). Shepherding the crowd: managing and providing feedback to crowd workers. In CHI 11 Extended Abstracts on Human Factors in Computing Systems, pp. 1669 1674. ACM.
D Urso, S. C. (2006). Who s watching us at work? toward a structural perceptual model of electronic monitoring and surveillance in organizations. Communication Theory, 16(3), 281 303.
El Maarry, K., & Balke, W.-T. (2018). Quest for the gold par: Minimizing the number of gold questions to distinguish between the good and the bad. In Proceedings of the 10th ACM Conference on Web Science, Web Sci 18, pp. 185 194, New York, NY, USA. ACM.
Ettlinger, N. (2016). The governance of crowdsourcing: Rationalities of the new exploitation. Environment and Planning A: Economy and Space, 48(11), 2162 2180.
Foucault, M. (1991). Discipline and punish: The birth of the prison. Penguin.
Fraley, C., & Raftery, A. E. (1998). How many clusters? which clustering method? answers via model-based cluster analysis. The Computer Journal, 41(8), 578 588.
Gadiraju, U., Kawase, R., Dietze, S., & Demartini, G. (2015). Understanding malicious behavior in crowdsourcing platforms: The case of online surveys. In Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems, CHI 15, pp. 1631 1640, New York, NY, USA. ACM.
Galloway, A. R., & Thacker, E. (2007). The exploit: A theory of networks, Vol. 21. U of Minnesota Press.
Graham, M., Hjorth, I., & Lehdonvirta, V. (2017). Digital labour and development: impacts of global digital labour platforms and the gig economy on worker livelihoods. Transfer: European Review of Labour and Research, 23(2), 135 162.
Checco, Bates & Demartini
Gray, M. L., & Suri, S. (2019). Ghost Work: How to Stop Silicon Valley from Building a New Global Underclass. Eamon Dolan Books.
Haggerty, K. D., & Ericson, R. V. (2000). The surveillant assemblage. The British Journal of Sociology, 51(4), 605 622.
Holland, P. J., Cooper, B., & Hecker, R. (2015). Electronic monitoring and surveillance in the workplace: The effects on trust in management, and the moderating role of occupational type. Personnel Review, 44(1), 161 175.
Huang, S.-W., & Fu, W.-T. (2013). Enhancing reliability using peer consistency evaluation in human computation. In Proceedings of the 2013 conference on Computer supported cooperative work, pp. 639 648. ACM.
Ipeirotis, P. G., Provost, F., & Wang, J. (2010). Quality management on amazon mechanical turk. In Proceedings of the ACM SIGKDD Workshop on Human Computation, pp. 64 67. ACM.
Irani, L. C., & Silberman, M. (2013). Turkopticon: Interrupting worker invisibility in amazon mechanical turk. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pp. 611 620. ACM.
Kaplan, T., Saito, S., Hara, K., & Bigham, J. P. (2018). Striving to earn more: a survey of work strategies and tool use among crowd workers. In Sixth AAAI Conference on Human Computation and Crowdsourcing.
Kittur, A., Chi, E. H., & Suh, B. (2008). Crowdsourcing user studies with mechanical turk. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pp. 453 456. ACM.
Knox, D. (2010). A good horse runs at the shadow of the whip: Surveillance and organizational trust in online learning environments. The Canadian Journal of Media Studies, 7, 07 01.
Kulynych, J. J. (1997). Performing politics: Foucault, habermas, and postmodern participation. Polity, 30(2), 315 346.
Lasecki, W. S., Teevan, J., & Kamar, E. (2014). Information extraction and manipulation threats in crowd-powered systems. In Proceedings of the 17th ACM Conference on Computer Supported Cooperative Work & Social Computing, pp. 248 256. ACM.
Le, J., Edmonds, A., Hester, V., & Biewald, L. (2010). Ensuring quality in crowdsourced search relevance evaluation: The effects of training question distribution. In SIGIR 2010 Workshop on Crowdsourcing for Search Evaluation, pp. 21 26.
Lehdonvirta, V. (2018). Flexibility in the gig economy: managing time on three online piecework platforms. New Technology, Work and Employment, 33(1), 13 29.
Mann, S., Nolan, J., & Wellman, B. (2002). Sousveillance: Inventing and using wearable computing devices for data collection in surveillance environments.. Surveillance & Society, 1(3), 331 355.
Martin, D., Hanrahan, B. V., O Neill, J., & Gupta, N. (2014). Being a turker. In Proceedings of the 17th ACM Conference on Computer Supported Cooperative Work & Social Computing, pp. 224 235. ACM.
An Attack Scheme on Gold Questions in Crowdsourcing
Martin, D., O Neill, J., Gupta, N., & Hanrahan, B. V. (2016). Turking in a global labour market. Computer Supported Cooperative Work (CSCW), 25(1), 39 77.
Marx, G. T. (2003). A tack in the shoe: Neutralizing and resisting the new surveillance. Journal of Social Issues, 59(2), 369 390.
Mc Innis, B., Cosley, D., Nam, C., & Leshed, G. (2016). Taking a hit: Designing around rejection, mistrust, risk, and workers experiences in amazon mechanical turk. In Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems, pp. 2271 2282. ACM.
Moore, P. V. (2017). The quantified self in precarity: Work, technology and what counts. Routledge.
Oleson, D., Sorokin, A., Laughlin, G. P., Hester, V., Le, J., & Biewald, L. (2011). Programmatic gold: Targeted and scalable quality assurance in crowdsourcing. Human Computation, 11(11).
Qarout, R. K., Checco, A., & Demartini, G. (2016). The effect of class imbalance and order on crowdsourced relevance judgments. ar Xiv preprint ar Xiv:1609.02171.
Sadowski, C., & Levin, G. (2007). Simhash: Hash-based similarity detection. Technical report, Google.
Salehi, N., Irani, L. C., Bernstein, M. S., Alkhatib, A., Ogbe, E., Milland, K., et al. (2015). We are dynamo: Overcoming stalling and friction in collective action for crowd workers. In Proceedings of the 33rd annual ACM Conference on Human Factors in Computing Systems, pp. 1621 1630. ACM.
Sandford, R. (2006). Digital post colonialism. Flux.
Saner, E. (2018). Employers are monitoring computers, toilet breaks even emotions. Is your boss watching you?. https://www.theguardian.com/world/2018/may/14/ is-your-boss-secretly-or-not-so-secretly-watching-you. [Online; accessed 2019, Jan 27].
Sarasua, C., & Thimm, M. (2014). Crowd work cv: Recognition for micro work. In International Conference on Social Informatics, pp. 429 437. Springer.
Shah, N. B., Balakrishnan, S., & Wainwright, M. J. (2016). A permutation-based model for crowd labeling: Optimal estimation and robustness. ar Xiv preprint ar Xiv:1606.09632.
Snow, R., O Connor, B., Jurafsky, D., & Ng, A. Y. (2008). Cheap and fast but is it good?: evaluating non-expert annotations for natural language tasks. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 254 263. Association for Computational Linguistics.
Snyder, J. L. (2010). E-mail privacy in the workplace: A boundary regulation perspective. The Journal of Business Communication (1973), 47(3), 266 294.
Stahl, B. C. (2008). Forensic computing in the workplace: hegemony, ideology, and the perfect panopticon?. Journal of Workplace Rights, 13(2), 167 183.
Steyerberg, E., Harrell, F., & Frank, E. (2003). Statistical models for prognostication. Symptom Research: Methods and Opportunities. Bethesda, MD: National Institutes of Health.
Checco, Bates & Demartini
Venanzi, M., Guiver, J., Kazai, G., Kohli, P., & Shokouhi, M. (2014). Community-based bayesian aggregation models for crowdsourcing. In Proceedings of the 23rd International Conference on World Wide Web, WWW 14, pp. 155 164. ACM.
Vorvoreanu, M., & Botan, C. H. (2000). Examining electronic surveillance in the workplace: A review of theoretical perspectives and research findings. In the Conference of the International Communication Association.
Yeginsu, C. (2018). If Workers Slack Off, the Wristband Will Know. (And Amazon Has a Patent for It.). https://www.nytimes.com/2018/02/01/technology/ amazon-wristband-tracking-privacy.html. [Online; accessed 2019, Jan 27].
Yin, M., Gray, M. L., Suri, S., & Vaughan, J. W. (2016). The communication network within the crowd. In Proceedings of the 25th International Conference on World Wide Web, pp. 1293 1303. International World Wide Web Conferences Steering Committee.